Benchmarking CAIMAN_ASR on custom datasets
Data format in JSON
CAIMAN_ASR evaluation runs by default on LibriSpeech dev-clean dataset.
If the user wants to run an evaluation on a custom dataset, they need to generate a JSON manifest with the transcripts and the paths to the audio files in the following format:
[
{
"transcript": "BLA BLA BLA ...",
"files": [
{
"channels": 1,
"sample_rate": 16000.0,
"bitdepth": 16,
"bitrate": 155000.0,
"duration": 11.21,
"num_samples": 179360,
"encoding": "WAV",
"silent": false,
"fname": "test-clean/5683/32879/5683-32879-0004.wav"
}
],
"original_duration": 11.21,
"original_num_samples": 179360
},
...
]
Please refer to the documentation here, specifically the section Convert your dataset to the JSON format for more information.
CTM file
In order to evaluate user-perceived latency, CAIMAN_ASR requires a CTM file, which contains the ground truth of when the speaker finished words. This can be generated according to the instructions here.
See the instructions regarding launching the docker container here,
and run the above command to generate the CTM file with the model config argument as: --model_config configs/testing-1023sp_run.yaml
.
Notes on the custom dataset format
- The audio files should be in
WAV
format - The audio files, the JSON manifest and the CTM file should be copied under
$HOME/.cache/myrtle/benchmark/\<custom_dataset_dir>/
. - The
JSON
manifest should be named<custom_dataset_name>-wav.json
. - The
CTM
file should be named<custom_dataset_name>.wav.ctm
. - Please make sure that the audio file paths inside the JSON manifest and the CTM file are relative to the directory where the JSON manifest and the CTM file are stored.
Running the evaluation on custom data
Run the evaluation script according to the instructions in CAIMAN-ASR benchmark with the additional flags:
--data_dir <custom_dataset_dir> --dset <custom_dataset_name>