Batch size hyperparameters
If you are training on an 8 x A100 (80GB) machine, the recommended batch size
hyper-parameters are given here. Otherwise, this page
gives guidance on how to select them.
Batch size in CAIMAN-ASR can be fixed or dynamic, depending on the --sampling_mode:
- “fixed”, fixed number of utterances per batch
- “duration”, dynamic batches with approx equal total duration, AKA ‘duration batches’
- “1D-bucketing”, dynamic duration batches from pre-defined duration buckets
- “1D-bucketing-dyn”, dynamic duration batches from dynamically created duration buckets
- “2D-bucketing”, dynamic batches with approx equal VRAM usage from pre-defined duration/token buckets
- “2D-bucketing-fixed”, fixed batches from pre-defined duration/token buckets
- “2D-bucketing-singlet”, fixed size batches from a single duration/token bucket
Every n batches one optimizer step is taken, where n is given by the
--grad_accumulation_batches argument. Each GPU does the same number of
batches (but the number of elements per batch may vary between each rank) and
the gradients are reduced across all GPUs before the optimizer step. Hence, the
effective global batch size is:
global_batch_size = avg_per_gpu_batch_size * num_gpus * grad_accumulation_batches
This is the batch size seen by the model before taking an optimizer step.
RNN-T models require large global_batch_sizes in order to reach good WERs,
but the larger the value, the longer training takes. The recommended value is
>1024 utterances or >5 hours.
The highest training throughput is achieved by using the highest
PER_GPU_BATCH_SIZE (and lowest grad_accumulation_batches) possible without
incurring an out-of-memory error (OOM) error.
batch_split_factor
The joint network output is a 4-dimensional tensor that requires a large amount
of GPU VRAM. For the models in this repo, the maximum
PER_GPU_JOINT_BATCH_SIZE is much lower than the maximum PER_GPU_BATCH_SIZE
that can be run through the encoder and prediction networks without incurring
an OOM. When PER_GPU_JOINT_BATCH_SIZE=PER_GPU_BATCH_SIZE, the GPU will be
underutilised during the encoder and prediction forward and backward passes
which is important because these networks constitute the majority of the
training-time compute.
The batch_split_factor arg makes it possible to increase the
PER_GPU_BATCH_SIZE whilst keeping the PER_GPU_JOINT_BATCH_SIZE constant
where:
PER_GPU_JOINT_BATCH_SIZE = PER_GPU_BATCH_SIZE / batch_split_factor
Starting from the default --batch_split_factor=1 it is usually possible to
achieve higher throughputs by increasing the PER_GPU_BATCH_SIZE and
increasing batch_split_factor, you will then be able to decrease the number
of grad accumulation batches to keep your effective global batch size constant.
Changing batch_split_factor should not impact the WER.
Summary
In your training command it is recommended to:
- Set
--grad_accumulation_batches=1,--batch_split_factor=2. - Tune your sampler arguments (duration, bucketing, fixed batch size, etc) to get the maximum GPU utilisation and throughput.
- Try increasing
--batch_split_factorand repeating step 2, if throughput increases repeat, otherwise reset and continue. - Calculate your average batch duration.
- Set your
--grad_accumulation_batchessuch that the total effective batch size is at your desired duration.
In order to test these, it is recommended to use your full training dataset as the utterance length distribution is important.