patkit

Guidelines for Data Directory Structure

Please note that not all of the Python classes supporting the structure describe below, will necessarily be part of the 1.0 release of PATKIT.

It is possible to work with many different kinds of data directory structures when using PATKIT. The description below matches fairly closely the internal representation of data in PATKIT and how PATKIT will save data and metadata when saving and/or exporting results.

Following this kind of structure when saving the data in the first place, therefore makes data management easier by keeping the original data and the files generated by PATKIT in the same place when they correspond to each other.

Recorded vs Patkit/saved data

Recorded data is not (currently) produced by Patkit. Instead it originates from other programs and sources. Patkit treats it as immutable and will not write it over or edit it in any way. If at some time in the future there will be options for editing e.g. sound and saving it, Patkit will do its best to never write over the original file. Nevertheless, you should keep golden, backed up copies of the data somewhere that Patkit can not reach like on external drives and in the cloud.

In contrast, Patkit data or saved data is written by Patkit and may be overwritten by it. To keep this distinction clear Patkit data is not saved in the same part of the directory tree as the recorded data. This also makes it easier to have analysis Scenarios where the Scenarios refer to a shared copy of the recorded data, and have their own patkit data files which will be Scenario specific.

Directories

The directory tree organisation follows the hierarchy of the Database classes. Depending on if a Trial level is needed – if there is more than one datasource – there are two directory structures used by PATKIT. Here is the one with only one datasource (which may have produced more than one kind of data):

└── dataset
    └── participant
        └── session
            └── [file types]

With two or more datasources (systems with their own internal synchronisation) it might be tempting to use structure like the one below:

└── dataset
    └── datasource
        └── participant
            └── session
                └── [file types]

However, since we want to cross synchronise the datasources, it is a better idea to do this to keep files that correspond to each other close together in the directory tree:

└── dataset
    └── participant
        └── session
            └── datasource
                └── [file types]

Specifically the extra level for datasource after session will help keep shared file types (most systems will have one or more wav files in the saved data) from clashing. Below is a more detailed example in which the different file types have been sorted to subdirectories.

└── dataset
    ├── participant 1
    │   ├── session 1
    │   │   ├── AAA
    │   │   │   ├── wav # including TextGrids
    │   │   │   ├── ultrasound # including .param files  
    │   │   │   ├── prompts # .txt files
    │   │   │   └── video # .avi files
    │   │   └── EVA
    │   │       ├── wav # including TextGrids, if needed 
    │   │       │       # given they are already in AAA
    │   │       └── oral_airflow # .oaf files]  
    │   └── session 2
    │       └── [etc]
    └── participant 2
        └── [etc]

Some sources like RASL will produce this sort of directory structure by default, others like AAA by default put all saved files in the same directory leaving out the final level in this example.

Separate directories for file types?

Another optional decision is whether individual data types are stored in subdirectories of a session. This one really depends on how difficult it will be to browse through a directories content if all files for a session are in it.

If files are divided into subdirectories by type, then it is still a good idea to keep .wav files and TextGrid files in the same directory. This makes it easier when working with them in Praat.

Things Not to Do

There is a temptation of putting individual trials in their own subdirectories. Grouping files by trial rather than type mostly makes it more difficult to find a file when looking for one manually.

This site is open source. Improve this page.