Data preparation

Having chosen which model configuration to train, you will need to complete the following preprocessing steps:

  1. Prepare your data in one of the supported training formats: JSONL or SHAR.
  2. Create a sentencepiece model from your training data.
  3. Record your training data log-mel stats for input feature normalization.
  4. Populate a YAML configuration file with the missing fields.
  5. Generate an n-gram language model from your training data.
  6. Optionally standardize transcripts

Text normalization

The examples assume a character set of size 57:

  • 26 lowercase letters,
  • 26 uppercase letters,
  • space,
  • full stop (period),
  • comma,
  • apostrophe,
  • question mark.

Transcripts will be normalized on the fly during training, as set in the YAML config templates, normalize_transcripts: digit_to_word. See Changing the character set for how to configure the character set and normalization.

During validation, the predictions and reference transcripts will be standardized.

See also