Advancing No-Reference Speech Intelligibility Prediction to the Two-Talker Condition

Loading...
Thumbnail Image

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Speech Intelligibility (SI) is a crucial metric for assessing speech perception and comprehension, typically defined as the proportion of words in a sentence that can be correctly recognized by the listener. Speech Intelligibility Prediction Algorithms (SIPAs) aim to replace costly and time-consuming listening tests by accurately estimating SI under various noise conditions. However, when the reference signal is unavailable, SI prediction (SIP) becomes significantly more challenging, particularly in severely noisy environments. This study addresses two key challenges in No-Reference (NR)-SIP: (i) improving prediction performance under conditions of extreme data scarcity and (ii) developing a robust NR-SIPA capable of operating effectively in environments contaminated by Competing Speaker (CS) noise.

To mitigate the issue of limited training data, we introduce a set of NR-SIPAs leveraging a pre-trained Speech Foundation Model (SFM), specifically wav2vec 2.0. We adapt wav2vec 2.0 for automatic speech recognition under additive noise conditions with a parameter-efficient methodology, low-rank adaptation. We demonstrate NR-SIPAs designed with this approach using a moderate amount of training data. The best designs perform on par or even better than a state-of-the-art RB-SIPA across a variety of datasets comprising different degradation types.

To tackle the challenge posed by CS noise, which is a prevalent yet insufficiently studied source of interference in real-world NR-SIP, we introduce a novel Target-Speaker (TS)-SIP framework. Within another SFM, WavLM, we incorporate a speaker-aware adaptation module that utilizes speaker embeddings to extract TS representations from the speech mixture. These representations are then integrated with speaker-independent features through a gating mechanism, which effectively enhances the model's generalization capability in noisy environments.

Crucially, to resolve the label ambiguity problem in target-speaker modeling, we adopt an energy-based assumption whereby the embedding corresponding to the weaker speaker extracted from the speech mixture using proposed speaker-separation (SS)-ECAPA-TDNN is designated as the target. Experimental results demonstrate that our model is the only NR approach that can effectively mitigate the impact of CS noise. Moreover, it demonstrates competitive performance across datasets with diverse degradation types.

Description

Keywords

Speech Intelligibility Prediction, Competing Speakers, Speech Foundation Models, Deep Learning, Speech-on-Speech Masking

Citation

Endorsement

Review

Supplemented By

Referenced By

Creative Commons license

Except where otherwised noted, this item's license is described as Attribution-NonCommercial-NoDerivatives 4.0 International