Training times
Training times for v1.12.0 on an 8 x A100 (80GB)
system are as follows:
Model | Train-time (days) | Throughput (utt/s) | Throughput (s/s) | No. of updates | grad_accumulation_batches | batch_split_factor |
---|---|---|---|---|---|---|
base | 0.9 | 1400 | 23,200 | 100k | 1 | 8 |
large | 1.8 | 700 | 11,700 | 100k | 1 | 16 |
Training times for v1.12.0 on a 2 x RTX4090 (24GB)
system are as follows:
Model | Train-time (days) | Throughput (utt/s) | Throughput (s/s) | No. of updates | grad_accumulation_batches | batch_split_factor |
---|---|---|---|---|---|---|
base | 8.4* | 150 | 2,500 | 100k | 8 | 8 |
large | 28* | 45 | 750 | 100k | 16 | 8 |
Training
where:
- Throughput (s/s) is the number of seconds of audio trained on per second (higher is better).
- Throughput (utt/s) is the number of samples/utterances seen per second during training (higher is better). NOTE: This metric is deprecated and will be removed in a future update, it is provided here for comparison.
- No. of updates is the number of optimiser steps at
--global_batch_size=1024
that are required to train the models on the 13k hrs training dataset. You may need fewer steps when training with less data grad_accumulation_batches
is the number of gradient accumulation steps performed on each GPU before taking an optimizer stepbatch_split_factor
is the number of sub-batches that thePER_GPU_BATCH_SIZE
is split into before these sub-batches are passed through the joint network and loss.- Times appended with a ‘*’ are estimates from throughput scaling and extrapolation.
For more details on these hyper-parameters, including how to set them, please refer to the batch size arguments documentation. For some information about tuning DALI parameters see the heterogeneous CPU page.