| The goal of a speech reinforcement system is to deliver the speaking
voice to listeners with sufficient clarity to be understood. Given
the complexity of the speech signal, the task of providing high-quality
speech reinforcement in real-world, less-than-ideal conditions is
doubly complicated.
Here is a diagram of
a simplified speech reinforcement system showing the main factors
that affect intelligibility. As the diagram indicates, a number
of acoustic, electromechanical and electronic factors need to
be considered if intelligibility is to be maintained. In order
to deal with all of these factors effectively, one must understand
how each affects the speech signal.
Masking The most common obstacle that speech system designers face
is the intrusion of unwanted sounds that inevitably interfere
with the speech signal. The effect is called masking, a
general term that covers a very wide variety of situations.
Masking noise can come from acoustical sources such as ventilation
equipment, traffic, crowds and commonly, reverberation and echoes.
It can also arise electronically from thermal noise, tape hiss
or distortion products. If the sound system has unusually large
peaks in its frequency response, the speech signal can even end
up masking itself.
One relationship between the strength of the speech signal
and the masking sound is called the signal-to-noise
ratio expressed in decibels. Ideally, the S/N ratio is greater
than 0dB, indicating that the speech is louder than the noise.
Just how much louder the speech needs to be in order to be understood
varies with, among other things, the type and spectral content
of the masking noise.
The most uniformly effective mask is broadband noise. Here is
a chart showing word articulation versus
S/N when the masking source is noise spanning 20 Hz to 4 kHz.
Notice that the signal must be 12 dB louder than the broadband
noise to achieve 80% word recognition.
Although, narrow-band noise is less effective at masking speech
than broadband noise, the degree of masking varies with frequency. Here is
a chart showing word articulation versus S/N for two noise bands 135
to 400 Hz (the fundamental frequency range of speech) and 1800
to 2500 Hz (the strongest consonant frequency range).
High-frequency noise masks only the consonants, and its effectiveness
as a mask decreases as the noise gets louder. But low-frequency
noise is a much more effective mask when the noise is louder
than the speech signal, and at high sound pressure levels it
masks both vowels and consonants. This is why the proximity effect
of cardioid microphones can be so harmful to speech intelligibility:
it causes the speech signal to mask itself. While cardioids are
very useful for minimizing noise pickup at the source, they should
always be used with a steep (12 dB/octave or greater) high-pass
tuned to about 100 Hz (or higher, if the speakers voice
range allows) so that proximity effect problems are minimized.
A human voice delivering a competing message, sometimes called
a distractor, is also very good at masking speech particularly
at or below 0 dB S/N. In addition, the masking effect increases
with the number of distractor voices. Here is
a diagram comparing masking for one, two and three voices. Notice
that, below 0 dB S/N, three voices become just as effective a
source of masking as broadband noise. Above 0 dB S/N, however,
intelligibility improves rapidly as the S/N increases. This illustrates
the importance of having sufficient power in paging system to
overcome crowd noise.
The direction from which a masking sound arrives, relative
to the direction of the speech signal, can affect the degree
of masking. If the noise comes from the same place, the masking
is greatest; it decreases as the distance between the noise and
the speech increases because this makes it easier for the brain
to discriminate between them. The masking effect is lowest when
the presentation is through headphones, with the speech in one
ear and the mask in the other. (Unfortunately, we cant
take advantage of that feature in sound reinforcement).
From this discussion, we can see why reverberation is
so destructive of intelligibility, especially beyond critical
distance. Being itself caused by the speech, reverb mimics
the speech spectrum, but generally with greater low-frequency
energy. Sufficiently long reverb and echoes such as are
encountered in cathedrals and large sports arenas can
actually function like multiple distractor voices. And by its
nature, reverberant energy arrives from all angles, so its
hard to separate from the speech using directional clues.
Frequency Response One of the most obvious aspects of sound system performance
that affect intelligibility is frequency response. Severely band-limited
systems deliver speech poorly. For instance, telephones are generally
limited to a 2 kHz bandwidth, and this makes it hard to distinguish
between f and s or d and t sounds.
High-quality speech systems need to cover the frequency range
of about 80 Hz (for especially deep male voices) to about 10
kHz (for best reproduction of consonants, which are crucial to
intelligibility). Response below 80 Hz must be eliminated to
the extent possible: not only do these frequencies fall below
the range of the speech signal, but also they will cause particularly
destructive masking at high sound levels.
Its important, also, for the system response to be reasonably
flat throughout its range. The gradual high-frequency rolloff
that many reinforcement professionals favor for music applications
will tend to de-emphasize consonants, which are already as much
as 27 dB less loud than vowels. Likewise, prominent peaks or
dips in the response can cause either self-masking or loss of
consonant articulation.
Finally, the coverage of the system must be consistent throughout
the intended listener area, with minimal response cancellations
or off-axis dropoff in the critical high frequencies. This requirement
very often dictates either a distributed loudspeaker system or
carefully aimed and delayed fill speakers. Using high-Q loudspeakers
will help to elevate the S/N ratio between the speech and the
reverberation levels.
Distortion Early studies of intelligibility in communication systems suggest
that clipping the peaks of the speech signal, and then amplifying
it to restore its peak-to-peak amplitude, improves intelligibility.
The trick works in very noisy situations because clipping generates
partials that are harmonically related to the fundamental and
thus less likely to mask the speech and because it both
accentuates consonants and increases the sound power of the signal.
As such, it has been helpful for band-limited communication systems
that are used in very noisy environments, such as the deck of
an aircraft carrier.
The fact is, however, that clipping the signal to improve intelligibility
works only in cases where the signal-to-noise ratio is very poor. Here is
a chart showing word articulation versus S/N for an infinitely
clipped and an unclipped speech signal. Notice that the intelligibility
score for the clipped signal levels out to around 50% at 0 dB
S/N; above about +3 dB S/N, the unclipped signal scores better.
In real-life speech reinforcement systems, clipping should
be avoided. Obviously, it will sound objectionable through a
high-quality sound system. It also will increase the masking
from any noise that is picked up by the microphone, since that
noise will be clipped along with the speech.
Another type of distortion that is very destructive to intelligibility
is intermodulation distortion. While it is easily controlled
in the electronics of a sound system, significant IM can be generated
when some types of loudspeakers (particularly two-way coaxials)
are driven at high levels. IM produces sum and difference products
that are not harmonically related to the fundamental frequency.
As such, they have a much greater masking effect than the harmonic
products of clipping.
Time Response Perhaps because it remains poorly understood and its effects
are more subtle, phase response in communication systems has
received scant attention. In fact, most published research about phase and
intelligibility actually deals with the effects of relative polarity.
Its been shown, for instance, that when speech is presented
with noise over headphones, intelligibility increases by about
25% if the speech signal in one ear is inverted relative to the
other ear. But this result has no application in sound reinforcement,
other than for in-ear stage monitors.
We Invite Your Feedback On
These Papers
And we hope to be able to create a forum for discussion through that
feedback.
Next Section
|