Advancing No-Reference Speech Intelligibility Prediction to the Two-Talker Condition
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Speech Intelligibility (SI) is a crucial metric for assessing speech perception and comprehension, typically defined as the proportion of words in a sentence that can be correctly recognized by the listener. Speech Intelligibility Prediction Algorithms (SIPAs) aim to replace costly and time-consuming listening tests by accurately estimating SI under various noise conditions. However, when the reference signal is unavailable, SI prediction (SIP) becomes significantly more challenging, particularly in severely noisy environments. This study addresses two key challenges in No-Reference (NR)-SIP: (i) improving prediction performance under conditions of extreme data scarcity and (ii) developing a robust NR-SIPA capable of operating effectively in environments contaminated by Competing Speaker (CS) noise.
To mitigate the issue of limited training data, we introduce a set of NR-SIPAs leveraging a pre-trained Speech Foundation Model (SFM), specifically wav2vec 2.0. We adapt wav2vec 2.0 for automatic speech recognition under additive noise conditions with a parameter-efficient methodology, low-rank adaptation. We demonstrate NR-SIPAs designed with this approach using a moderate amount of training data. The best designs perform on par or even better than a state-of-the-art RB-SIPA across a variety of datasets comprising different degradation types.
To tackle the challenge posed by CS noise, which is a prevalent yet insufficiently studied source of interference in real-world NR-SIP, we introduce a novel Target-Speaker (TS)-SIP framework. Within another SFM, WavLM, we incorporate a speaker-aware adaptation module that utilizes speaker embeddings to extract TS representations from the speech mixture. These representations are then integrated with speaker-independent features through a gating mechanism, which effectively enhances the model's generalization capability in noisy environments.
Crucially, to resolve the label ambiguity problem in target-speaker modeling, we adopt an energy-based assumption whereby the embedding corresponding to the weaker speaker extracted from the speech mixture using proposed speaker-separation (SS)-ECAPA-TDNN is designated as the target. Experimental results demonstrate that our model is the only NR approach that can effectively mitigate the impact of CS noise. Moreover, it demonstrates competitive performance across datasets with diverse degradation types.

