Data preparation
Having chosen which model configuration to train, you will need to complete the following preprocessing steps:
- Prepare your data in one of the supported training formats:
JSON
orWebDataset
. - Create a sentencepiece model from your training data.
- Record your training data log-mel stats for input feature normalization.
- Populate a YAML configuration file with the missing fields.
- Generate an n-gram language model from your training data.
Text normalization
The examples assume a character set of size 28: a space, an apostrophe and 26 lower case letters.
Transcripts will be normalized on the fly during training,
as set in the YAML config templates, normalize_transcripts: lowercase
.
See Changing the character set
for how to configure the character set and normalization.
During validation, the predictions and reference transcripts
will be standardized.
Text standardization
Training on multiple datasets can negatively affect training, only because the same word
is transcribed differently across the datasets. The YAML config contains the entry
standardize_text: true|false
that allows to standardize the training transcripts
on the fly. This happens only once, during data preparation phase, to avoid unnecessary
slowdown between training steps. In principle, it is the same as WER Standardization. See
WER calculation docs for more info on text standardization
procedure.