SHAR format
The SHAR format is a sharded tarred archive format developed as part of Lhotse. CAIMAN-ASR’s training is optimized to work with SHAR archives, which provide efficient storage and retrieval of large datasets. SHAR allows efficient randomization of data order as well as de-duplication across data workers. To convert your data into the SHAR format use:
python ./training/caiman_asr_train/lhotse/scripts/convert.py cuts2shar \
/path/to/data.cuts.jsonl.gz \
/path/to/out/dir/ \
--compression flac
These SHAR datasets can then be used in your train_dataset.yaml as follows:
format: shar
canary:
alpha: 0.5
datasets:
- /path/to/out/dir/:
- /some/other/shar/dir/:
See the convert script’s --help for further options