Supported Dataset Formats
CAIMAN-ASR supports reading data from four formats:
| Format | Modes | Description | Docs |
|---|---|---|---|
JSON | training + validation | All audio as wav or flac files in a single directory hierarchy with transcripts in json file(s) referencing these audio files. | [link] |
Webdataset | training + validation | Audio <key>.{flac,wav} files stored with associated <key>.txt transcripts in tar file shards. Format described here | [link] |
Directories | validation | Audio (wav or flac) files and the respective text transcripts are in two separate directories. | [link] |
Hugging Face | training (using provided conversion script) + validation | Hugging Face Hub datasets | [link] |
To train on your own proprietary dataset you will need to arrange for it to be in the WebDataset or JSON format.
A worked example of how to do this for the JSON format is provided in json_format.md.
The script hugging_face_to_json.py converts a Hugging Face dataset to the JSON format; see here for more details.