Data preparation

Having chosen which model configuration to train, you will need to complete the following preprocessing steps:

  1. Prepare your data in one of the supported training formats: JSON or WebDataset.
  2. Create a sentencepiece model from your training data.
  3. Record your training data log-mel stats for input feature normalization.
  4. Populate a YAML configuration file with the missing fields.
  5. Generate an n-gram language model from your training data.

Text normalization

Note

The examples assume a character set of size 28: a space, an apostrophe and 26 lower case letters. Transcripts will be normalized on the fly during training, as set in the YAML config templates, normalize_transcripts: lowercase. See Changing the character set for how to configure the character set and normalization. During validation, the predictions and reference transcripts will be standardized.

Text standardization

Note

Training on multiple datasets can negatively affect training, only because the same word is transcribed differently across the datasets. The YAML config contains the entry standardize_text: true|false that allows to standardize the training transcripts on the fly. This happens only once, during data preparation phase, to avoid unnecessary slowdown between training steps. In principle, it is the same as WER Standardization. See WER calculation docs for more info on text standardization procedure.

See also