After files got successfully aligned, one would possibly want to export the aligned utterances as machine learning training samples.
This is where the export tool bin/export.sh comes in.
The exporter takes either a single audio file (--audio <AUDIO>)
plus a corresponding .aligned file (--aligned <ALIGNED>) or a series
of such pairs from a .catalog file (--catalog <CATALOG>) as input.
All of the following computations will be done on the joined list of all aligned utterances of all input pairs.
Option --ignore-missing will not fail on missing file references in the catalog
and instead just ignore the affected catalog entry.
The parameter --filter <EXPR> allows to specify a Python expression that has access
to all data fields of an aligned utterance (as can be seen in .aligned file entries).
This expression is now applied to each aligned utterance and in case it returns True,
the utterance will get excluded from all the following steps.
This is useful for excluding utterances that would not work as input for the planned
training or other kind of application.
As with filtering, the parameter --criteria <EXPR> allows for specifying a Python
expression that has access to all data fields of an aligned utterance.
The expression is applied to each aligned utterance and its numerical return
value is assigned to each utterance as quality.
This step is to (optionally) exclude utterances that would otherwise bias the data (risk of overfitting).
For each --debias <META DATA TYPE> parameter the following procedure is applied:
- Take the meta data type (e.g. "name") and read its instances (e.g. "Alice" or "Bob") from each utternace and group all utterances accordingly (e.g. a group with 2 utterances of "Alice" and a group with 15 utterances of "Bob"...)
- Compute the standard deviation (
sigma) of the instance-counts of the groups - For each group: If the instance-count exceeds
sigmatimes--debias-sigma-factor <FACTOR>:- Drop the number of exceeding utterances in order of their
quality(lowest first)
- Drop the number of exceeding utterances in order of their
Training sets are often partitioned into several quality levels.
For each --partition <QUALITY:PARTITION> parameter (ordered descending by QUALITY):
If the utterance's quality value is greater or equal QUALITY, assign it to PARTITION.
Remaining utterances are assigned to partition other.
Training sets (actually their partitions) are typically split into sets train, dev
and test (explanation).
This can get automated through parameter --split which will let the exporter split each
partition (or the entire set) accordingly.
Parameter --split-field allows for specifying a meta data type that should be considered
atomic (e.g. "speaker" would result in all utterances of a speaker
instance - like "Alice" - to end up in one sub-set only). This atomic behavior will also hold
true across partitions.
Option --split-drop-multiple allows for dropping all samples with multiple --split-field assignments - e.g. a
sample with more than one "speaker".
In contrast option --split-drop-unknown allows for dropping all samples with no --split-field assignment.
With option --assign-{train|dev|test} <VALUES> one can pre-assign values (of the comma-separated list)
to the specified set.
Option --split-seed <SEED> sets an integer random seed for the split operation.
For each partition/sub-set combination the following is done:
-
Construction of a
name(e.g.good-devwill represent the validation set of partitiongood). -
All samples are lazy-loaded and potentially re-sampled to match parameters:
--channels <N>: Number of audio channels - 1 for mono (default), 2 for stereo--rate <RATE>: Sample rate - default: 16000--width <WIDTH>: Sample width in bytes - default: 2 (16 bit)
--workers <WORKERS>can be used to specify how many parallel processes should be used for loading and re-sampling.--tmp-dir <DIR>overrides system default temporary directory that is used for converting samples.--skip-damagedallows for just skipping export of samples that cannot be loaded. -
If option
--target-dir <DIR>is provided, all output will be written to the provided target directory. This can be done in two different ways:- With the additional option
--sdbeach set will be written to a so called Sample-DB that can be used by DeepSpeech. It will be written as<name>.sdbinto the target directory. SDB export can be controlled with the following additional options:--sdb-bucket-size <SIZE>: SDB bucket size (using units like "1GB") for external sorting of the samples--sdb-workers <WORKERS>: Number of parallel workers for preparing and compressing SDB entries--sdb-buffered-samples <SAMPLES>: Number of samples per bucket buffer during last phase of external sorting--sdb-audio-type <TYPE>: Internal audio type for storing SDB samples -wavoropus(default)
- Without option
--sdball samples are written as WAV-files into sub-directory<name>of the target directory and a list of samples to a<name>.csvfile next to it with columnswav_filename,wav_filesize,transcript.
If not omitted through option
--no-meta, a CSV file called<name>.metais written to the target directory. For each written sample it provides the following columns:sample,split_entity,catalog_index,source_audio_file,aligned_file,alignment_index.Throughout this process option
--forceallows to overwrite any existing files. - With the additional option
-
If instead option
--target-tar <TAR-FILE>is provided, the same file structure as with--target-dir <DIR>is directly written to the specified tar-file. This output variant does not support writing SDBs.
Option --plan <PLAN> can be used to cache all computational steps before actual output writing.
Will be loaded if existing or generated otherwise.
This allows for writing several output formats using the same sample set distribution and without having to load
alignment files and re-calculate quality metrics, de-biasing, partitioning or splitting.
Using --dry-run one can avoid any writing and get a preview on set-splits and so forth
(--dry-run-fast won't even load any sample).