Setting Up A Subject List

Overview

For C-PAC to run an analysis, it must have at least one pipeline configuration file and one subject list. The creation of a pipeline configuration file is treated in the next section. This section will deal with how to create a C-PAC subject list, which specifies a set of subjects, where to find each subject’s associated image files, and (optionally) scan parameter information for use during Slice Timing Correction. There are two ways to set up such a list:

  • Using a text editor (useful for servers where using the C-PAC GUI is not possible or impractical)
  • Using the subject list builder interface in the C-PAC GUI

The first method requires you to create a data configuration file, which encodes settings that you would normally enter through the subject list builder GUI (such as a template file path used to grab T1 anatomical data; see Defining Anatomical and Functional File Path Templates below). Both methods will result in the creation of a list that can be re-used for future runs with different pipelines.

Using a Text Editor

Data configuration files are stored as YAML files, which are text files meant to encode data in a human-readable markup (see here for more details). Each of the parameters used by C-PAC to assemble your subject list can be specified as key-value pairs, so a data configuration YAML might have multiple lines of the form:

key : value

An example of such a YAML file can be found here. You can create your own configuration file using your favorite text editor and the table of keys and their potential values in the table below.

Key Description Potential Values
dataFormat Whether or not the dataset is organized according to BIDS or a custom format. A string- can be either “BIDS” or “Custom”.
bidsBaseDir The base directory for the BIDS dataset (if the data is organized according to BIDS). A path.
anatomicalTemplate A file path template for anatomical scans (see below for how to define templates). A path template.
functionalTemplate A file path template for functional scans (see below for how to define templates). A path template.
subjectList An optional list of subjects to include (if you wish to include only a subset of subjects whose scans match the templates). A value of ‘None’ means all subjects will be run. This list can also be read from a text file. A list of strings with subject IDs (e.g., [‘sub-1’,’sub-2’]), a path to a text file with subject IDs separated by new lines, or ‘None’.
exclusionSubjectList An optional list of subjects to exclude (if you wish to exclude a subset of subjects whose scans match the templates). A value of ‘None’ means all subjects will be run. This list can also be read from a text file. A list of strings with subject IDs (e.g., [‘sub-1’,’sub-2’]), a path to a text file with subject IDs separated by new lines, or ‘None’.
siteList An optional list of sites to include (if you wish to include only a subset of sites whose scans match the templates). A value of ‘None’ means all sites will be run. This list can also be read from a text file. A list of strings with site IDs (e.g., [‘Lexington’,’Albany’]), a path to a text file with site IDs separated by new lines, or ‘None’.
scanParametersCSV Path to a CSV specifying the slice time acquisition parameters for scans. If set to ‘None’, the NifTI headers will be used. Instructions for creating such a file can be found here. A path or ‘None’.
outputSubjectListLocation The directory where the subject list will be generated. A path.
subjectListName A name for the subject list. A list containing a string (e.g.,[‘ABIDE’]).
awsCredentialsFile A credential file for downloading data stored using Amazon’s Simple Storage Service. A path.

Generating Subject Lists with the Data Configuration File

When you are done setting up the data configuration file, navigate in a terminal to the directory where you would like to store your subject list. If you are not running slice timing correction, or all subjects within a site have the same acquisition order, type the following command:

cpac_setup.py /path/to/data_config

If you are running slice timing correction and subjects within a site have differing acquisition orders open iPython by typing ipython and run the following command from the iPython prompt:

import CPAC
CPAC.utils.extract_data_multiscan_params.run('/path/to/data_config.yml')

Either of these methods will produce the subject list in your current directory, as well as two other files needed to run group-level analyses:

  • CPAC_subject_list_<name>.yml - The subject list used when running preprocessing and individual-level analyses. Contains subject IDs and paths to their associated data files. Subject lists are also stored as YAMLs.
  • template_phenotypic.csv - A template phenotypic file for use with the group-level analysis builder.
  • subject_list_group_analysis.txt - The subject list used when running group-level analyses.

Note: An alternate version of the phenotype template and the group analysis subject list will be created that are pre-formatted for use in repeated measures or within-subject group-level analysis.

For INDI releases, we have available preconfigured data configuration files, to generate subject lists. You will need to make sure that the file templates have been modified to match the specific location on your system for these files.

Using the GUI

First, open the C-PAC GUI by entering the command cpac_gui in a terminal window. You will be presented with the main C-PAC window.

_images/main_gui.png

Now, to generate a subject list file, click on the New button next to the Subject Lists box. This will bring up the following dialog window, which will allow you to define the parameters of the data configuration YAML:

_images/subject_list_gui.png
  1. Data format - [BIDS, Custom]: Select if data is organized using the BIDS standard or a custom format (see Defining Anatomical and Functional File Path Templates below for more information on forming templates for custom data organization).
  2. BIDS Base Directory - [path]: The base directory of BIDS-organized data if you are using BIDS.
  3. Anatomical File Path Template - [text]: A file path template for anatomical scans (see below for how to define templates). Used when a custom data organization scheme is enabled.
  4. Functional File Path Template - [text]: A file path template for functional scans (see below for how to define templates). Used when a custom data organization scheme is enabled.
  5. Subjects to Include (Optional) - [text/path]: An optional comma-separated list of subjects to include (if you wish to include only a subset of subjects whose scans match the templates). A value of ‘None’ means all subjects will be run. This list can also be read from a text file whose path is given in this field.
  6. Subjects to Exclude (Optional) - [text/path]: An optional comma-separated list of subjects to exclude (if you wish to exclude only a subset of subjects whose scans match the templates). A value of ‘None’ means all subjects will be run. This list can also be read from a text file whose path is given in this field.
  7. Sites to Include (Optional) - [text/path]: An optional comma-separated list of sites to include (if you wish to include only a subset of sites whose scans match the templates). A value of ‘None’ means all sites will be run. This list can also be read from a text file whose path is given in this field.
  8. Scan Parameters File (Optional) - [path]: Path to a CSV specifying the slice time acquisition parameters for scans. If set to ‘None’, these parameters will either be defined by the NifTI headers or by an explicit slice order specified in the pipeline configuration builder. Instructions for creating such a file can be found here. Note that a scan parameters file will also override the settings used in the slice timing screen of the pipeline configuration builder.
  9. AWS Credentials File (Optional)- [path]: Required if downloading data from a non-public S3 bucket on Amazon Web Services instead of using local files.
  10. Output Directory - [path]: The directory where the subject list will be generated.
  11. Subject List Name - [text]: A name for the subject list.
  12. Multiscan Data [checkbox]: Check this box only if the scans have different slice timing information.

Defining Anatomical and Functional File Path Templates

C-PAC has been designed to process large, complex data sets, and supports processing of multiple scans per participant, multiple scan sessions, and multiple data acquisition sites with differing scan acquisition parameters. Because the file directory structures resulting from such data sets can be complex, it is necessary to define the location of the image files to be processed for each participant. If you are using the BIDS directory structure, the location of image files is defined via the BIDS specification, and C-PAC is able to automatically find all of the files needed to run the pipeline. Otherwise, you will need to manually define path templates using the conventions described below.

File path templates are defined by adding placeholder variables nested within curly brackets to the file path of each type of image file to be processed (anatomical and functional). The possible placeholders are site, participant and session. session is optional, while the other two placeholders are not. To illustrate, if the full paths to the image files for a hypothetical participant_1 were:

/home/data/site_1/participant_1/anat/mprage.nii.gz
/home/data/site_1/participant_1/func/rest.nii.gz

Then the anatomical and functional file path templates would would be:

Anatomical Template:  /home/data/{site}/{participant}/anat/mprage.nii.gz
Functional Template:  /home/data/{site}/{participant}/func/rest.nii.gz

It should be noted that C-PAC currently requires participant directories to be within a site level directory. If your data set does not contain scans from multiple acquisition sites, we recommend creating a dummy site_1 directory and placing participant files inside this directory. This peculiarity will be fixed in future versions of C-PAC.

In cases where file paths differ in more than just the site, session and subject directories, asterisks can be used. This is useful for data sets containing multiple sessions in a single scan date or other subject-specific information in file or folder names. You may use as many asterisks as necessary to to define your file path templates. For example, if the paths to functional image files for a subject were:

/home/data/site_1/subject_1/scan_1/session_1/rest_1.nii.gz
/home/data/site_1/subject_1/scan_1/session_2/rest_2.nii.gz
/home/data/site_1/subject_1/scan_2/session_1/rest_1.nii.gz
/home/data/site_1/subject_1/scan_2/session_2/rest_2.nii.gz

The file path template would be:

/home/data/{site}/{participant}/scan_*/{session}/rest_*.nii.gz

Example File Path Templates

Here are the file path templates used for the 1000 Functional Connectomes data release, as well as an illustration of the directory structure used for the release:

Anatomical Template:  /path/to/data/{site}/{participant}/anat/mprage_anonymized.nii.gz
Functional Template:  /path/to/data/{site}/{participant}/func/rest.nii.gz
_images/fcon_structure.png

Another example is the file structure used by the ABIDE and ADHD-200 releases:

Anatomical Template:  /path/to/data/{site}/{participant}/{session}/anat_*/mprage.nii.gz
Functional Template:  /path/to/data/{site}/{participant}/{session}/rest_*/rest.nii.gz
_images/abide_adhd_structure.png

A final example is the file structure used by the Enhanced Nathan Kline Institute-Rockland Sample:

Anatomical Template:  /path/to/data/{site}/{participant}/anat/mprage.nii.gz
Functional Template:  /path/to/data/{site}/{participant}/{session}/RfMRI_*/rest.nii.gz
_images/nki-rs_template.png

Users experiencing difficulties defining file path templates may want to re-organize their data to match one of the examples above. If you manually define a file path template and encounter an error when attempting to generate participant lists, please contact us and we will be happy to help.

Subject List YAML Fields

The subject list builder GUI or the command line utility will produce a YAML file containing all of the subjects and various properties associated with that subject, such as its ID, session number, the location of its resting-state/functional and anatomical scans. Before each subject definition there is a single line with a dash, which indicates that start of the property defintions. Subject properties are indented under this dash. To illustrate, see the sample subject definition below:

-
    subject_id: 'subj_1'
    unique_id: 'session_1'
    anat: '/data/subj_1/session_1/anat_1/mprage.nii.gz'
    rest:
      rest_1_rest: '/data/subj_1/session_1/rest_1/rest.nii.gz'
      rest_2_rest: '/data/subj_1/session_1/rest_2/rest.nii.gz'
    scan_parameters:
        tr: '2.5'
        acquisition: 'seq+z'
        reference: '24'
        first_tr: ''
        last_tr: ''

Note that more than one resting state scan is defined under the rest key (i.e., multiple sessions), and that individual scan parameters can be defined to override the settings used in the C-PAC pipeline configuration GUI. Be careful not to name the multiple sessions under the rest key using dashes or periods, as this input will not be parsed by C-PAC and may cause errors.