Performance of ISV-based ASV system on data with no attacks
Published: 2 years, 8 months ago
Voice activity detection is based on the modulation of the energy around 4Hz, the features include 19 MFCCs and energy, with their first and second derivatives. 256 Gaussian components were used in modeling.
The goal of a ASV system is to correctly verify the claimed identity of the user. During training, the system builds for each registered user the speech model, and when evaluated on the development set (for this set, the identity of each audio sample is known), the resulted scores are split into two sets (genuine data of correct identity and zero-impostors, the users with the wrong identity) in such a way that False Acceptance Rate (FAR) and False Reject Rate (FRR) are equal. This equal rate is usually called Equal Error Rate (err in the table below). The median value of the split scores is the EER threshold (threshold in the table), since this is the specific value of the system that leads to EER.
Applying the EER threshold obtained from development set to the scores of the test set leads to another pair of FAR (far_test in the table) and FRR (frr_test in the table) values, which are the measures of the system's performance in uncontrolled evaluation settings. In a perfectly consistent ASV system, FAR and FRR values on the test set would be the same as FAR and FRR values obtained for Dev set. Hence, to summarize the performance of the system in one value, a Half Total Error Rate (hter in the table) is computed as the mean of FAR and FRR. The HTER is then used as an overall measure of the ASV system performance.