JSON
format
The JSON
format is the default in this repository and if you are training on your own data it is recommended to manipulate it into this format. Note that the data preparation steps are slightly different given the model you have decided to train so please refer to the model configuration page first.
Page contents
Prepare LibriSpeech in JSON
format
This page takes LibriSpeech as it is distributed from the https://www.openslr.org website and prepares it into a JSON manifest format.
Quick Start
To run the data preparation steps for LibriSpeech and the base
model run the following from the training/
directory:
# Download data to /datasets/LibriSpeech: requires 120GB of disk
./scripts/prepare_librispeech.sh
To run preprocessing for the testing
or large
configurations, instead run:
SPM_SIZE=1023 CONFIG_NAME=testing-1023sp ./scripts/prepare_librispeech.sh
SPM_SIZE=17407 CONFIG_NAME=large-17407sp ./scripts/prepare_librispeech.sh
If ~/datasets
on the host is mounted to /datasets
, the downloaded data will be accessible outside the container at ~/datasets/LibriSpeech
.
Further detail: prepare_librispeech.sh
The script will:
- Download data
- Create
JSON
manifests for each subset of LibriSpeech - Convert the manifests into end-pointed manifests
- Create a sentencepiece tokenizer from the train-960h subset
- Record log-mel stats for the train-960h subset
- Populate the missing fields of a YAML configuration template
- Generate an n-gram language model with KenLM from the train-960h subset
1. Data download
Having run the script, the following folders should exist inside the container:
/datasets/LibriSpeech
train-clean-100/
train-clean-360/
train-other-500/
dev-clean/
dev-other/
test-clean/
test-other/
2. JSON manifests
/datasets/LibriSpeech/
librispeech-train-clean-100-flac.json
librispeech-train-clean-360-flac.json
librispeech-train-other-500-flac.json
librispeech-train-clean-100-flac.eos.json
librispeech-train-clean-360-flac.eos.json
librispeech-train-other-500-flac.eos.json
librispeech-dev-clean-flac.json
librispeech-dev-other-flac.json
librispeech-test-clean-flac.json
librispeech-test-other-flac.json
3. Sentencepiece tokenizer
/datasets/sentencepieces/
librispeech8703.model
librispeech8703.vocab
4. Log-mel stats
/datasets/stats/STATS_SUBDIR
:melmeans.pt
meln.pt
melvars.pt
The STATS_SUBDIR
will differ depending on the model since these stats are affected by the feature extraction window size. They are:
testing
:/datasets/stats/librispeech-winsz0.02
- {
base
,large
}:/datasets/stats/librispeech-winsz0.025
5. _run.yaml
config
In the configs/
directory. Depending on the model you are training you will have one of:
testing
:configs/testing-1023sp_run.yaml
base
:configs/base-8703sp_run.yaml
large
:configs/large-17407sp_run.yaml
_run
indicates that this is a complete config, not just a template.
6. N-gram language model
/datasets/ngrams/librispeech8703/
transcripts.txt
ngram.arpa
ngram.binary
To train an n-gram on a different dataset, see n-gram docs.
Prepare Other Datasets
Convert your dataset to the JSON
format
Options:
- Adapt the code in
caiman_asr_train/data/make_datasets/librispeech.py
. - If your dataset is in Hugging Face format, you can use the script described here
Generate artifacts needed for training
Suppose you have preprocessed CommonVoice, organized like this:
CommonVoice17.0
|-- common_voice_17.0_dev
|-- common_voice_17.0_dev.json
|-- common_voice_17.0_test
|-- common_voice_17.0_test.json
|-- common_voice_17.0_train
|-- common_voice_17.0_train.json
To generate the training artifacts, run the following:
DATASET_NAME_LOWER_CASE=commonvoice
MAX_DURATION_SECS=20.0
SPM_SIZE=8703
CONFIG_NAME=base-8703sp
DATA_DIR=/datasets/CommonVoice17.0
NGRAM_ORDER=4
TRAIN_MANIFESTS=/datasets/CommonVoice17.0/common_voice_17.0_train.json
./scripts/make_json_artifacts.sh $DATASET_NAME_LOWER_CASE $MAX_DURATION_SECS \
$SPM_SIZE $CONFIG_NAME $DATA_DIR $NGRAM_ORDER $TRAIN_MANIFESTS
where:
DATASET_NAME_LOWER_CASE
will determine the name of the generatedSENTENCEPIECE
andSTATS_SUBDIR
MAX_DURATION_SECS
is number of seconds above which audio clips will be discarded during trainingSPM_SIZE
is the size of the sentencepiece model—in this case, the base modelCONFIG_NAME
is the name of the template configuration file to readDATA_DIR
is the path to your datasetNGRAM_ORDER
is the order of the n-gram language model that can be used during beam searchTRAIN_MANIFESTS
can be a space-separated list
It is advised that you use all of your training data transcripts to build the sentencepiece tokenizer but it is ok to use a subset of the data to calculate the mel stats via the --n_utterances_only
flag to caiman_asr_train/data/generate_mel_stats.py
.
Before running make_json_artifacts.sh on your custom dataset, you may want to create an EOS version as explained here
Next steps
Having run the data preparation steps, go to the training docs to start training.