Machine-Learning Algorithm for Improved Speech Intelligibility in Noise
A monaural machine-learning algorithm for classifying time-frequency units in an unknown signal, which results in marked speech-intelligibly improvements in noisy signals.
Wireless carriers receive daily complaints about poor speech recognition in background noise during calls and are constantly looking for methods to improve especially in light of recent forays into VOIP. The ability to discriminate between speech and noise in an audio signal then is an extremely important aspect of improving customer satisfaction and improved product use, but current attempts to do so have resulted in only mild success.
Some state-of-the-art methods employ microphone arrays in an effort to boost the desired speech from one direction while attenuating the noise originating from others. These arrays have produced some modest improvements in speech-in-noise intelligibility, but their underlying calculations rely on assumptions that are often not met in natural environments. More promising methods combine deep learning with time-frequency masking (T-F masking) to improve speech intelligibility with just a single microphone. These techniques select and enhance the segments of the signal containing the most speech, while attenuating the rest. The existing T-F masking techniques, including Ideal Binary Masking (IBM) and Ideal Ratio Masking (IRM), can substantially improve intelligibly, but each has drawbacks that may hinder their implementation.
Researchers at The Ohio State University, led by Dr. Eric Healy, have developed a novel form of T-F masking that combines the computational simplicity of the Ideal Binary Mask (IBM) with the superior sound quality of the Ideal Ratio Mask (IRM). The result is intelligibility results equal to or superior to the IRM at computational loads only marginally larger than the IBM.
The IBM is a binary system that assigns a value of 0 or 1 to each time-frequency unit based on its signal-to-noise ratio (SNR). Units with a poor SNR are assigned a 0 and attenuated, resulting in an output signal containing only T-F units dominated by speech. In the IRM, units are again attenuated, but they can be assigned any value between 0 and 1 resulting in a smoother output. The Ideal Quantized Mask (IQM), developed by OSU’s research team, combines the advantages of both methods. Instead of the IBM’s two attenuation levels, IQM classifies each T-F unit into any number of discrete categories. While this means the IQM could theoretically have an infinite number of categories, with just eight attenuation levels the IQM achieves IRM-level intelligibility that is far higher than that of the IBM, but without the need to engage in IRM’s regression calculations.
The algorithm is trained using techniques of deep learning to analyze and classify T-F units. Once trained, the algorithm estimates the IQM for the speech-plus-noise mixture. Importantly, the algorithm can be trained using any input signal, meaning it can be used to identify any desired marker in a noisy signal, making the IQM valuable for a wide range of applications, including voice communication, speech recognition, and noise cancellation.