Data preparation

Having chosen which model configuration to train, you will need to complete the following preprocessing steps:

Prepare your data in one of the supported training formats: JSON or WebDataset.
Create a sentencepiece model from your training data.
Record your training data log-mel stats for input feature normalization.
Populate a YAML configuration file with the missing fields.
Generate an n-gram language model from your training data.

Text normalization

Note

The examples assume a character set of size 28: a space, an apostrophe and 26 lower case letters. Transcripts will be normalized on the fly during training, as set in the YAML config templates, normalize_transcripts: lowercase. See Changing the character set for how to configure the character set and normalization. During validation, the predictions and reference transcripts will be standardized.

Text standardization

Note

Training on multiple datasets can negatively affect WER because the same word is transcribed with different conventions across the datasets. It is possible to make the training transcripts consistent by setting standardize_text: true in the YAML config (this is the default). This will apply the same standardization rules as used in validation as described in the WER Standardization section of the WER calculation docs

but in this case, to the training transcripts.

CAIMAN-ASR

Data preparation

Text normalization

Text standardization

See also