Section II

Factors That Affect Intelligibility in Sound Systems


The goal of a speech reinforcement system is to deliver the speaking voice to listeners with sufficient clarity to be understood. Given the complexity of the speech signal, the task of providing high-quality speech reinforcement in real-world, less-than-ideal conditions is doubly complicated.

Here is a diagram of a simplified speech reinforcement system showing the main factors that affect intelligibility. As the diagram indicates, a number of acoustic, electromechanical and electronic factors need to be considered if intelligibility is to be maintained. In order to deal with all of these factors effectively, one must understand how each affects the speech signal.

Masking

The most common obstacle that speech system designers face is the intrusion of unwanted sounds that inevitably interfere with the speech signal. The effect is called “masking,” — a general term that covers a very wide variety of situations.

Masking noise can come from acoustical sources such as ventilation equipment, traffic, crowds and commonly, reverberation and echoes. It can also arise electronically from thermal noise, tape hiss or distortion products. If the sound system has unusually large peaks in its frequency response, the speech signal can even end up masking itself.

One relationship between the strength of the speech signal and the masking sound is called the signal-to-noise ratio expressed in decibels. Ideally, the S/N ratio is greater than 0dB, indicating that the speech is louder than the noise. Just how much louder the speech needs to be in order to be understood varies with, among other things, the type and spectral content of the masking noise.

The most uniformly effective mask is broadband noise. Here is a chart showing word articulation versus S/N when the masking source is noise spanning 20 Hz to 4 kHz. Notice that the signal must be 12 dB louder than the broadband noise to achieve 80% word recognition.

Although, narrow-band noise is less effective at masking speech than broadband noise, the degree of masking varies with frequency. Here is a chart showing word articulation versus S/N for two noise bands — 135 to 400 Hz (the fundamental frequency range of speech) and 1800 to 2500 Hz (the strongest consonant frequency range).

High-frequency noise masks only the consonants, and its effectiveness as a mask decreases as the noise gets louder. But low-frequency noise is a much more effective mask when the noise is louder than the speech signal, and at high sound pressure levels it masks both vowels and consonants. This is why the proximity effect of cardioid microphones can be so harmful to speech intelligibility: it causes the speech signal to mask itself. While cardioids are very useful for minimizing noise pickup at the source, they should always be used with a steep (12 dB/octave or greater) high-pass tuned to about 100 Hz (or higher, if the speaker’s voice range allows) so that proximity effect problems are minimized.

A human voice delivering a competing message, sometimes called a “distractor,” is also very good at masking speech — particularly at or below 0 dB S/N. In addition, the masking effect increases with the number of distractor voices. Here is a diagram comparing masking for one, two and three voices. Notice that, below 0 dB S/N, three voices become just as effective a source of masking as broadband noise. Above 0 dB S/N, however, intelligibility improves rapidly as the S/N increases. This illustrates the importance of having sufficient power in paging system to overcome crowd noise.

The direction from which a masking sound arrives, relative to the direction of the speech signal, can affect the degree of masking. If the noise comes from the same place, the masking is greatest; it decreases as the distance between the noise and the speech increases because this makes it easier for the brain to discriminate between them. The masking effect is lowest when the presentation is through headphones, with the speech in one ear and the mask in the other. (Unfortunately, we can’t take advantage of that feature in sound reinforcement).

From this discussion, we can see why reverberation is so destructive of intelligibility, especially beyond critical distance. Being itself caused by the speech, reverb mimics the speech spectrum, but generally with greater low-frequency energy. Sufficiently long reverb and echoes — such as are encountered in cathedrals and large sports arenas — can actually function like multiple distractor voices. And by its nature, reverberant energy arrives from all angles, so it’s hard to separate from the speech using directional clues.

Frequency Response

One of the most obvious aspects of sound system performance that affect intelligibility is frequency response. Severely band-limited systems deliver speech poorly. For instance, telephones are generally limited to a 2 kHz bandwidth, and this makes it hard to distinguish between “f” and “s” or “d” and “t” sounds.

High-quality speech systems need to cover the frequency range of about 80 Hz (for especially deep male voices) to about 10 kHz (for best reproduction of consonants, which are crucial to intelligibility). Response below 80 Hz must be eliminated to the extent possible: not only do these frequencies fall below the range of the speech signal, but also they will cause particularly destructive masking at high sound levels.

It’s important, also, for the system response to be reasonably flat throughout its range. The gradual high-frequency rolloff that many reinforcement professionals favor for music applications will tend to de-emphasize consonants, which are already as much as 27 dB less loud than vowels. Likewise, prominent peaks or dips in the response can cause either self-masking or loss of consonant articulation.

Finally, the coverage of the system must be consistent throughout the intended listener area, with minimal response cancellations or off-axis dropoff in the critical high frequencies. This requirement very often dictates either a distributed loudspeaker system or carefully aimed and delayed fill speakers. Using high-Q loudspeakers will help to elevate the S/N ratio between the speech and the reverberation levels.

Distortion

Early studies of intelligibility in communication systems suggest that clipping the peaks of the speech signal, and then amplifying it to restore its peak-to-peak amplitude, improves intelligibility. The trick works in very noisy situations because clipping generates partials that are harmonically related to the fundamental — and thus less likely to mask the speech — and because it both accentuates consonants and increases the sound power of the signal. As such, it has been helpful for band-limited communication systems that are used in very noisy environments, such as the deck of an aircraft carrier.

The fact is, however, that clipping the signal to improve intelligibility works only in cases where the signal-to-noise ratio is very poor. Here is a chart showing word articulation versus S/N for an infinitely clipped and an unclipped speech signal. Notice that the intelligibility score for the clipped signal levels out to around 50% at 0 dB S/N; above about +3 dB S/N, the unclipped signal scores better.

In real-life speech reinforcement systems, clipping should be avoided. Obviously, it will sound objectionable through a high-quality sound system. It also will increase the masking from any noise that is picked up by the microphone, since that noise will be clipped along with the speech.

Another type of distortion that is very destructive to intelligibility is intermodulation distortion. While it is easily controlled in the electronics of a sound system, significant IM can be generated when some types of loudspeakers (particularly two-way coaxials) are driven at high levels. IM produces sum and difference products that are not harmonically related to the fundamental frequency. As such, they have a much greater masking effect than the harmonic products of clipping.

Time Response

Perhaps because it remains poorly understood and its effects are more subtle, phase response in communication systems has received scant attention. In fact, most published research about “phase” and intelligibility actually deals with the effects of relative polarity. It’s been shown, for instance, that when speech is presented with noise over headphones, intelligibility increases by about 25% if the speech signal in one ear is inverted relative to the other ear. But this result has no application in sound reinforcement, other than for in-ear stage monitors.

We Invite Your Feedback On These Papers
And we hope to be able to create a forum for discussion through that feedback.

Next Section

 


Footer


homepage homepage products sound lab news company support sales/rentals contact request information contact copyright trademarks