Word timestamps
In the following discussion timestamps refers only to the timestamps of the finals.
The model emits timestamps for each token, these can be grouped into word timestamps which define the start/end time of each word.
Average accumulated shift
The accuracy of the word timestamps is quantified with the average accumulated shift (AAS) metric (see https://arxiv.org/pdf/2301.12343). This is defined as:
$$ \text{AAS} = \frac{1}{N} \sum_{i=1}^N \frac{ \left| t_i^{\text{predStart}} - t_i^{\text{refStart}} \right| + \left| t_i^{\text{predEnd}} - t_i^{\text{refEnd}} \right| } {2} $$
This can be understood as the average absolute difference (in time) between the true start/end time and the model’s predicted start/end time for each word.
Latency
The model’s token level timestamps are known to lag behind real-time (see Emission Latency). To correct for this when estimating word timestamps a latency offset should be subtracted from the token timestamp at the beginning and end of each word. This is supported via the following flags:
--latency_head_offset <value in seconds>
--latency_tail_offset <value in seconds>
In general these require a model/domain/decoder specific calibration.
Measuring AAS
If the --calculate_emission_latency
flag is passed to the
Validation script then several AAS related metrics are
measured these include:
"optimal_head_offset"
:
$$ \text{median} \left\lbrace t_i^{\text{predStart}} - t_i^{\text{refStart}} \mid i \in 1\ldots N \right\rbrace $$
"optimal_tail_offset"
:
$$ \text{median} \left\lbrace t_i^{\text{predEnd}} - t_i^{\text{refEnd}} \mid i \in 1\ldots N \right\rbrace $$
"raw_AAS"
: The AAS calculated without any latency correction"fixed_AAS"
: The AAS calculated with the head/tail offset supplied via the CLI flags"corrected_AAS"
: The AAS calculated using the computed optimal head/tail offset