Data preparation
Having chosen which model configuration to train, you will need to complete the following preprocessing steps:
- Prepare your data in one of the supported training formats:
JSON
orWebDataset
. - Create a sentencepiece model from your training data.
- Record your training data log-mel stats for input feature normalization.
- Populate a YAML configuration file with the missing fields.
- Generate an n-gram language model from your training data.
Text normalization
The examples assume a character set of size 28: a space, an apostrophe and 26 lower case letters.
Transcripts will be normalized on the fly during training,
as set in the YAML config templates, normalize_transcripts: lowercase
.
See Changing the character set
for how to configure the character set and normalization.
During validation, the predictions and reference transcripts
will be standardized.
Text standardization
Training on multiple datasets can negatively affect WER because the same word
is transcribed with different conventions across the datasets. It is possible to
make the training transcripts consistent
by setting standardize_text: true
in the YAML config (this is the default).
This will apply the same standardization rules as used in validation
as described in the WER Standardization section of the WER calculation docs
- but in this case, to the training transcripts.