JSONL format
The JSONL format is the default in this repository and if you are training on your own data it is recommended to manipulate it into this format. Note that the data preparation steps are slightly different given the model you have decided to train so please refer to the model configuration page first.
The JSONL format stores a Lhotse
CutSet, each cut
must have a recording and a single supervision.
Prepare LibriSpeech in JSONL format
This page takes LibriSpeech as it is distributed from the https://www.openslr.org website and prepares it into a JSONL manifest format.
Quick Start
To run the data preparation steps for LibriSpeech and the base model run the following from the training/ directory:
# Download data to /datasets/LibriSpeech: requires 120GB of disk
./scripts/prepare_librispeech.sh
To run preprocessing for the testing or large configurations, instead run:
SPM_SIZE=1023 CONFIG_NAME=testing-1023sp ./scripts/prepare_librispeech.sh
SPM_SIZE=17407 CONFIG_NAME=large-17407sp ./scripts/prepare_librispeech.sh
If ~/datasets on the host is mounted to /datasets, the downloaded data will be accessible outside the container at ~/datasets/LibriSpeech.
Further detail: prepare_librispeech.sh
The script will:
- Download data
- Create
JSONmanifests for each subset of LibriSpeech - Convert the manifests into end-pointed manifests
- Convert the
JSONmanifests into ‘JSONL’ manifests - Create a sentencepiece tokenizer from the train-960h subset
- Record log-mel stats for the train-960h subset
- Populate the missing fields of a YAML configuration template
- Generate an n-gram language model with KenLM from the train-960h subset
1. Data download
Having run the script, the following folders should exist inside the container:
/datasets/LibriSpeechtrain-clean-100/train-clean-360/train-other-500/dev-clean/dev-other/test-clean/test-other/
2. JSON manifests
/datasets/LibriSpeech/librispeech-train-clean-100-flac.jsonlibrispeech-train-clean-360-flac.jsonlibrispeech-train-other-500-flac.jsonlibrispeech-train-clean-100-flac.eos.jsonlibrispeech-train-clean-360-flac.eos.jsonlibrispeech-train-other-500-flac.eos.jsonlibrispeech-dev-clean-flac.jsonlibrispeech-dev-other-flac.jsonlibrispeech-test-clean-flac.jsonlibrispeech-test-other-flac.json
3. Add EOS
/datasets/LibriSpeech/librispeech-train-clean-100-flac.eos.jsonlibrispeech-train-clean-360-flac.eos.jsonlibrispeech-train-other-500-flac.eos.json
4. JSONL manifests
/datasets/LibriSpeech/librispeech-train-clean-100-flac.cuts.jsonl.gzlibrispeech-train-clean-360-flac.cuts.jsonl.gzlibrispeech-train-other-500-flac.cuts.jsonl.gzlibrispeech-train-clean-100-flac.eos.jsonl.gzlibrispeech-train-clean-360-flac.eos.jsonl.gzlibrispeech-train-other-500-flac.eos.jsonl.gzlibrispeech-dev-clean-flac.cuts.jsonl.gzlibrispeech-dev-other-flac.cuts.jsonl.gzlibrispeech-test-clean-flac.cuts.jsonl.gzlibrispeech-test-other-flac.cuts.jsonl.gz
5. Sentencepiece tokenizer
/datasets/sentencepieces/librispeech8703.modellibrispeech8703.vocab
6. Log-mel stats
/datasets/stats/STATS_SUBDIR:melmeans.ptmeln.ptmelvars.pt
The STATS_SUBDIR will differ depending on the model since these stats are affected by the feature extraction window size. They are:
testing:/datasets/stats/librispeech-winsz0.02- {
base,large}:/datasets/stats/librispeech-winsz0.025
7. _run.yaml config
In the configs/ directory. Depending on the model you are training you will have one of:
testing:configs/testing-1023sp_run.yamlbase:configs/base-8703sp_run.yamllarge:configs/large-17407sp_run.yaml
_run indicates that this is a complete config, not just a template.
8. N-gram language model
/datasets/ngrams/librispeech8703/transcripts.txtngram.arpangram.binary
Selecting transcripts
Each supervision must include a “text” field, which contains the transcript of the corresponding audio file.
If multiple transcripts exist for the same audio file, the model can be trained using a specific transcript by specifying it in the use_transcripts field of the YAML configuration:
use_transcripts:
to_use: ["<transcript_key_1>", "<transcript_key_2>", ...]
on_missing: "raise_error" # or "use_default"
These extra transcripts are stored as custom data inside each supervision.
The first transcript key that is found in the JSONL manifest will be used.
If no key is found in the JSONL manifest, the behaviour will depend on the value of the on_missing field:
- “raise_error”: an error will be raised.
- “use_default”: the default transcript will be used.
Standardizing transcripts
Training on multiple datasets can negatively impact performance, as identical words may be transcribed differently across datasets - for example, “colour” (British English) vs “color” (American English). This inconsistency splits the model’s probability distribution across multiple spellings, reducing prediction confidence.
To address this, we standardize the training transcripts using Whisper EnglishTextNormalizer. This process follows the same approach as WER standardization (see WER calculation docs for more details).
There are two methods to do this which are described below.
Recommended approach: Pre-standardizing manifests
The recommended approach is to standardize your manifests before training using the standardize_manifest.py script:
python caiman_asr_train/data/text/standardize_manifest.py \
--manifests /path/to/shar/manifest \
--output_dirs /path/to/save/output \
--preserve-case \
--preserve-punctuation
This creates standardized versions of each transcript field as transcript-standardized. You can also choose to standardize
transcripts other than the default text using --transcript_fields:
python caiman_asr_train/data/text/standardize_manifest.py \
--manifests /path/to/shar/manifest \
--transcript-fields field1-name field2-name
The standardized versions of these will be stored as field1-name-standardized, field2-name-standardized. If any of the provided
fields are not found, they will be ignored.
You can then use these standardized fields during training by configuring your model’s YAML:
use_transcripts:
to_use: ["transcript-standardized"]
on_missing: "raise_error" # or "use_default"
This approach offers several advantages:
- Faster training startup (standardization is done only once)
- Ability to standardize multiple transcript fields in the same manifest
Multiple manifests and fields can be standardized at once:
python caiman_asr_train/data/text/standardize_manifest.py \
--manifests /path/to/shar/manifest1 /path/to/shar/manifest2 \
--output_dirs /path/to/output/for/manifest1 /path/to/output/for/manifest2
This will standardize the text field in each of the jsonl in manifest1 and manifest2 with a transcript-standardized custom field
All arguments:
--manifests [str] ...: Absolute path to dataset manifest files [required].--output_dirs [str] ...: Absolute path to save standardized manifest files [required].--preserve-case: Flag to preserve the original casing of the text [optional].--preserve-punctuation: Flag to preserve the original punctuation of the text [optional].--expand-contractions: Flag to enable expansion of contractions (e.g. “don’t” -> “do not”)[optional].--model-config [str]: Path to model config to extract user symbols to preserve e.g.<EOS>--num-workers [int]: Number of worker processes to use [optional][default: CPU count].--overwrite: Flag to enable overwrite of existing output files [optional].
If using offline standardization of manifests, please ensure to disable on-the-fly standardization by setting
standardize_text to false in the model config (.yaml) file.
By default, standardization expands all contractions(e.g “don’t” -> “do not”). As a result, if you train a model on standardized transcripts, it will always predict the expanded form e.g. it will output “do not” whether the speaker says “don’t” or “do not”.
If this behaviour is not desired,
you can disable contraction expansion during train-time standardization
by passing the --no-expand-contractions flag to the standardize_manifest.py script.
This will however worsen WERs by ~1-2% relative.
Alternative: On-the-fly standardization
Alternatively, the YAML config includes the entry standardize_text: true|false,
enabling on-the-fly standardization of training transcripts during training.
When using the SHAR data format on-the-fly standardization should be fully overlapped with GPU compute, resulting in minimal training speed impact.
Preserving casing and punctuation
By default, standardization lowercases and strips punctuation,
which is the desired behaviour for WER standardization.
However, this is undesirable when training models that predict
punctuation or casing (i.e. when normalize_transcripts is not set to lowercase).
Hence, both approaches support these standardization parameters:
preserve_case: true|false- preserve the original casing of the text during standardization.preserve_punctuation: true|false- retain the original punctuation in the text during standardization.
If normalize_transcripts is set to lowercase, these arguments have no effect,
as the text is already being lowercased before standardization.
Whether you choose pre-standardization or on-the-fly standardization, the process generally improves WER by ensuring consistent text formatting across datasets.
Default settings:
normalize_transcripts: lowercase
standardize_text: true
preserve_case: false
preserve_punctuation: false
Next steps
Having run the data preparation steps, go to the training docs to start training.