PER Calculation

PER Formula

The Punctuation Error Rate (PER) is calculated with the same formula as the WER. It is the count of additions, deletions, and substitutions of punctuation marks divided by the number of punctuation marks in the reference text.

Currently, the default punctuation marks used are: , . ?. They can be amended by parsing a different list with the --punctuation_marks argument.

The punctuation normalisation rules are the following:

    `;` -> `,` semi-colon to comma
    `:` -> `.` colon to period
    `!` -> `.` exclamation mark to period
    `-` -> ` ` hyphen to space

This means that both hypotheses and references are normalized before the calculation. It can be disabled by parsing the --punctuation_normalization=False argument.

For example:

Hypothesis: Hi  dear. Nice to see you, how are you
Reference:  Hi, dear! Nice to see you. How are you?

The punctuation error rate is calculated as: $$ PER = \frac{S + D + I}{N} \times 100 = \frac{1 + 2 + 0}{4} \times 100=75\% $$

where S is the number of substitutions, D is the number of deletions, I is the number of insertions, and N=4 is the number of punctuation marks in the reference text.

Note that in the above example the substitution of . for ! is not counted due to the punctuation normalization.