Setting Up A Data Configuration File (Participant/Subject List)

Overview

C-PAC’s Configuration YAML Files

C-PAC requires at least one pipeline configuration file and one data configuration file (also known as the participant list/subject list) in order to run an analysis. These configuration files are in the YAML file format, which matches contents in a key: value relationship much like a dictionary. This section will focus on the data configuration file setup.

The Data Configuration (Participant List)

The data configuration file is essentially a list of file paths to anatomical and functional MRI scans keyed by their unique IDs, and listed with any additional information as necessary. This file can be generated both via the GUI and the terminal. Both ways are explained below.

Creating the Data Configuration File via the GUI

The C-PAC data configuration builder window accepts information about where to find your data, and allows you to customize what gets included in the final list. Once you’re done, it generates a data settings YAML file, which saves your preferences, so that you can easily edit and re-generate your data configuration YAML file later on.

_images/subject_list_gui.png
  1. Data format - [BIDS, Custom]: Whether or not the data is organized in accordance with the BIDS specification. More details below.
  2. BIDS Base Directory - [path]: The base directory of the BIDS-organized data, if you are using BIDS.
  3. Anatomical File Path Template - [text]: If the data is NOT in BIDS format, you can provide a file path template describing the anatomical scans here. More details below.
  4. Functional File Path Template - [text]: If the data is NOT in BIDS format, you can provide a file path template describing the functional scans here. More details below.
  5. Save Config Files Here - [path]: The directory where you want the data configuration builder to save both the data settings file (these configured options) and the data configuration file (the list of input data to be provided to CPAC).
  6. Participant List Name - [text]: The name/label for your data configuration and data settings files.
  7. (Optional) AWS Credentials File - [path]: Required if downloading data from a non-public S3 bucket on Amazon Web Services (AWS). This usually takes the form of a CSV file.
  8. (Optional) Scan Parameters File - [path]: Path to a CSV file specifying the slice time acquisition parameters for scans. If set to ‘None’, these parameters will either be defined by the NifTI headers or by an explicit slice order specified in the pipeline configuration builder. Instructions for creating this CSV file can be found here. Note: If your data is in BIDS format, the data configuration builder will read the scan parameters described in the data’s affiliated JSON file(s), if they exist, and a scan parameters CSV file is not required.
  9. (Optional) Field Map Phase File Path Template - [text]: If you are running field map-based distortion correction, AND your data is not in BIDS format, provide the file path template to your phase files here. If your data is in BIDS format, the data configuration builder will find these files automatically.
  10. (Optional) Field Map Magnitude File Path Template - [text]: If you are running field map-based distortion correction, AND your data is not in BIDS format, provide the file path template to your magnitude difference files here. If your data is in BIDS format, the data configuration builder will find these files automatically.
  11. (Optional) Include: Subjects - [text/path]: List the participant IDs to include, to have only those participants included in the list. Either enter it here (ex. “1001, 1002, 1007, ..”), or enter the file path of a text file containing each participant ID on its own line.
  12. (Optional) Exclude: Subjects - [text/path]: The same as above, except to exclude the participants you list here. Useful for when you only need a few dropped from the list of many.
  13. (Optional) Include: Sites - [text/path]: Which sites to include - can be a list or a text file, as described above.
  14. (Optional) Exclude: Sites - [text/path]: Which sites to exclude - can be a list or a text file, as described above.
  15. (Optional) Include: Sessions - [text/path]: Which sessions to include - can be a list or a text file, as described above.
  16. (Optional) Exclude: Sessions - [text/path]: Which sessions to exclude - can be a list or a text file, as described above.
  17. (Optional) Include: Series - [text/path]: Which series to include - can be a list or a text file, as described above.
  18. (Optional) Exclude: Series - [text/path]: Which series to exclude - can be a list or a text file, as described above.

Continue below for some example use cases.

Creating the Data Configuration File from terminal

You can configure the settings explained above in the data settings YAML file, then use the cpac_data_config_setup.py script to generate your data configuration file.

If you don’t already have a data settings YAML file, either get the template from our GitHub repo, or generate one by running:

cpac_data_config_setup.py --generate_template

Once your data settings file is ready, you can generate your data configuration file by running:

cpac_data_config_setup.py --data_settings_file /path/to/data_settings.yml

Continue below for some example use cases.

My Data is in BIDS Format

A full description of the BIDS data organization specification can be found at bids.neuroimaging.io.

This is the simplest option. As the data is in BIDS format, the C-PAC data configuration builder will know where to find all of the input files, the scan parameters (if available), site information, and field map files (if applicable). The inclusion and exclusion options for the different data levels (participant, site, etc.) work as usual.

Using the GUI

Select “BIDS” as your Data Format, and specify where to save the configuration files and the participant list name. Then, provide the BIDS Base Directory, which is the top-most directory level within which your BIDS-organized data set is stored. Click Generate Data Config in the bottom-right corner to generate the data configuration, and to also save this setup into a data settings file. If you only want to save the settings to generate the data configuration for later, click Save Preset.

Using the cpac_data_config_setup.py script

In the data settings file, populate these fields:

dataFormat:                  ['BIDS']
bidsBaseDir:                 /path/to/BIDS/directory
outputSubjectListLocation:   /save/configs/here
subjectListName:             data_config_name

You can also fill in the AWS credentials file field, and the inclusion and exclusion fields, as needed.

Once your data settings file is ready, generate your data configuration file by running:

cpac_data_config_setup.py --data_settings_file /path/to/data_settings.yml

My Data is stored in a custom layout

The C-PAC Data Configuration builder can handle a wide range of different directory organization layouts, but can only do it seamlessly for you if all of your data is organized in that same layout. If you have input files arranged in different ways, simply generate two different data configuration files, and then manually add one to the end of the other, in a text editor.

Using the GUI

Set your Data Format to “Custom”, and leave the BIDS Base Directory blank.

In the Anatomical File Path Template field, enter the full file path to any of your anatomical/structural input files (including the file extension .nii/.nii.gz), and then replace the appropriate directory levels with these tags:

{participant}
{site}           (if applicable)
{session}        (if applicable)

Note: C-PAC currently does not support multiple anatomical series/scans/runs at this time, but will do so in the following release!

Your template paths should look something like this, for the corresponding directory layouts:

Actual file:     /home/data/site-01/sub1003/session-A1/anat/mprage.nii.gz
Template path:   /home/data/{site}/{participant}/{session}/anat/mprage.nii.gz

Actual file:     /home/data/site-03/sub-1005_session-B1/anat/anat.nii
Template path:   /home/data/{site}/{participant}_{session}/anat/anat.nii

For the Functional File Path Template field, repeat the process with your functional input files. In addition, different levels of series/scans can be denoted with the following tag:

{series}

If you have field map files, and wish to perform field map-based distortion correction, the same process can be repeated for your field map phase and magnitude files, via the Field Map Phase File Path Template and Field Map Magnitude File Path Template fields in the GUI.

For the file path templates, only the {participant} tags are required. Defaults will be assigned to the other levels if they do not exist.

Then, fill in the fields for where you want the data configuration builder to save your files, and the name/label to use to name the files.

Once your settings are complete, nt list name. click Generate Data Config in the bottom-right corner to generate the data configuration, and to also save this setup into a data settings file. If you only want to save the settings to generate the data configuration for later, click Save Preset.

Using the cpac_data_config_setup.py script

Following the instructions for formatting your path templates given above, populate these fields in your data settings file:

dataFormat:                  ['Custom']
anatomicalTemplate:          /path/to/{site}/{participant}/{series}/anat/mprage.nii.gz
functionalTemplate:          /path/to/{site}/{participant}/{series}/func/{series}/bold.nii.gz
outputSubjectListLocation:   /save/configs/here
subjectListName:             data_config_name

You can also fill in the AWS credentials file field, and the inclusion and exclusion fields, as needed.

Once your data settings file is ready, generate your data configuration file by running:

cpac_data_config_setup.py --data_settings_file /path/to/data_settings.yml

Example File Path Templates

Here are the file path templates used for the 1000 Functional Connectomes data release, as well as an illustration of the directory structure used for the release:

Anatomical Template:  /path/to/data/{site}/{participant}/anat/mprage_anonymized.nii.gz
Functional Template:  /path/to/data/{site}/{participant}/func/rest.nii.gz
_images/fcon_structure.png

Another example is the file structure used by the ABIDE and ADHD-200 releases:

Anatomical Template:  /path/to/data/{site}/{participant}/{session}/anat_*/mprage.nii.gz
Functional Template:  /path/to/data/{site}/{participant}/{session}/rest_*/rest.nii.gz
_images/abide_adhd_structure.png

A final example is the file structure used by the Enhanced Nathan Kline Institute-Rockland Sample:

Anatomical Template:  /path/to/data/{site}/{participant}/anat/mprage.nii.gz
Functional Template:  /path/to/data/{site}/{participant}/{session}/RfMRI_*/rest.nii.gz
_images/nki-rs_template.png

Users experiencing difficulties defining file path templates may want to re-organize their data to match one of the examples above. If you manually define a file path template and encounter an error when attempting to generate participant lists, please contact us and we will be happy to help.

Data Configuration File YAML Fields

The data configuration builder GUI or the cpac_data_config_setup.py command line utility will produce a YAML file containing all of the participants and various properties associated with that participant, such as its ID, session number, the location of its resting-state/functional and anatomical scans. Before each participant definition there is a single line with a dash, which indicates that start of the property definitions. Participant properties are indented under this dash. To illustrate, see the sample participant definition below:

-
    subject_id: 'subj_1'
    unique_id: 'session_1'
    anat: '/data/subj_1/session_1/anat_1/mprage.nii.gz'
    func:
      rest_1: '/data/subj_1/session_1/rest_1/rest.nii.gz'
      rest_2: '/data/subj_1/session_1/rest_2/rest.nii.gz'
    scan_parameters:
        tr: '2.5'
        acquisition: 'seq+z'
        reference: '24'
        first_tr: ''
        last_tr: ''

Note that more than one functional scan is defined under the func key (i.e., multiple series), and that individual scan parameters can be defined to override the settings used in the C-PAC pipeline configuration GUI.