Batch size hyperparameters

If you are training on an 8 x A100 (80GB) machine, the recommended batch size hyper-parameters are given here. Otherwise, this page gives guidance on how to select them.

Batch size in CAIMAN-ASR can be fixed or dynamic, depending on the --sampling_mode:

  • “fixed”, fixed number of utterances per batch
  • “duration”, dynamic batches with approx equal total duration, AKA ‘duration batches’
  • “1D-bucketing”, dynamic duration batches from pre-defined duration buckets
  • “1D-bucketing-dyn”, dynamic duration batches from dynamically created duration buckets
  • “2D-bucketing”, dynamic batches with approx equal VRAM usage from pre-defined duration/token buckets
  • “2D-bucketing-fixed”, fixed batches from pre-defined duration/token buckets
  • “2D-bucketing-singlet”, fixed size batches from a single duration/token bucket

Every n batches one optimizer step is taken, where n is given by the --grad_accumulation_batches argument. Each GPU does the same number of batches (but the number of elements per batch may vary between each rank) and the gradients are reduced across all GPUs before the optimizer step. Hence, the effective global batch size is:

global_batch_size = avg_per_gpu_batch_size * num_gpus * grad_accumulation_batches

This is the batch size seen by the model before taking an optimizer step.

RNN-T models require large global_batch_sizes in order to reach good WERs, but the larger the value, the longer training takes. The recommended value is >1024 utterances or >5 hours.

The highest training throughput is achieved by using the highest PER_GPU_BATCH_SIZE (and lowest grad_accumulation_batches) possible without incurring an out-of-memory error (OOM) error.

batch_split_factor

The joint network output is a 4-dimensional tensor that requires a large amount of GPU VRAM. For the models in this repo, the maximum PER_GPU_JOINT_BATCH_SIZE is much lower than the maximum PER_GPU_BATCH_SIZE that can be run through the encoder and prediction networks without incurring an OOM. When PER_GPU_JOINT_BATCH_SIZE=PER_GPU_BATCH_SIZE, the GPU will be underutilised during the encoder and prediction forward and backward passes which is important because these networks constitute the majority of the training-time compute.

The batch_split_factor arg makes it possible to increase the PER_GPU_BATCH_SIZE whilst keeping the PER_GPU_JOINT_BATCH_SIZE constant where:

PER_GPU_JOINT_BATCH_SIZE = PER_GPU_BATCH_SIZE / batch_split_factor

Starting from the default --batch_split_factor=1 it is usually possible to achieve higher throughputs by increasing the PER_GPU_BATCH_SIZE and increasing batch_split_factor, you will then be able to decrease the number of grad accumulation batches to keep your effective global batch size constant.

Changing batch_split_factor should not impact the WER.

Summary

In your training command it is recommended to:

  1. Set --grad_accumulation_batches=1, --batch_split_factor=2.
  2. Tune your sampler arguments (duration, bucketing, fixed batch size, etc) to get the maximum GPU utilisation and throughput.
  3. Try increasing --batch_split_factor and repeating step 2, if throughput increases repeat, otherwise reset and continue.
  4. Calculate your average batch duration.
  5. Set your --grad_accumulation_batches such that the total effective batch size is at your desired duration.

In order to test these, it is recommended to use your full training dataset as the utterance length distribution is important.