Data preparation

Having chosen which model configuration to train, you will need to complete the following preprocessing steps:

Prepare your data in one of the supported training formats: JSON or WebDataset.
Create a sentencepiece model from your training data.
Record your training data log-mel stats for input feature normalization.
Populate a YAML configuration file with the missing fields.
Generate an n-gram language model from your training data.
Optionally standardize transcripts

Text normalization

Note

The examples assume a character set of size 28: a space, an apostrophe and 26 lower case letters. Transcripts will be normalized on the fly during training, as set in the YAML config templates, normalize_transcripts: lowercase. See Changing the character set for how to configure the character set and normalization. During validation, the predictions and reference transcripts will be standardized.

CAIMAN-ASR

Data preparation

Text normalization

See also