Training times

Training times for v1.12.0 on an 8 x A100 (80GB) system are as follows:

ModelTrain-time (days)Throughput (utt/s)Throughput (s/s)No. of updatesgrad_accumulation_batchesbatch_split_factor
base0.9140023,200100k18
large1.870011,700100k116

Training times for v1.12.0 on a 2 x RTX4090 (24GB) system are as follows:

ModelTrain-time (days)Throughput (utt/s)Throughput (s/s)No. of updatesgrad_accumulation_batchesbatch_split_factor
base8.4*1502,500100k88
large28*45750100k168

Training

where:

  • Throughput (s/s) is the number of seconds of audio trained on per second (higher is better).
  • Throughput (utt/s) is the number of samples/utterances seen per second during training (higher is better). NOTE: This metric is deprecated and will be removed in a future update, it is provided here for comparison.
  • No. of updates is the number of optimiser steps at --global_batch_size=1024 that are required to train the models on the 13k hrs training dataset. You may need fewer steps when training with less data
  • grad_accumulation_batches is the number of gradient accumulation steps performed on each GPU before taking an optimizer step
  • batch_split_factor is the number of sub-batches that the PER_GPU_BATCH_SIZE is split into before these sub-batches are passed through the joint network and loss.
  • Times appended with a ‘*’ are estimates from throughput scaling and extrapolation.

For more details on these hyper-parameters, including how to set them, please refer to the batch size arguments documentation. For some information about tuning DALI parameters see the heterogeneous CPU page.