Speech Synthesis and Processing

Speech Synthesis and Processing

Vocoder Techniques

A channel vocoder is a device for compressing, or encoding, the data needed to represent a speech waveform, while still retaining the intelligibilty of the original waveform. The first channel vocoder was developed by Homer Dudley in 1936. It passed the speech signal through a bank of band-pass filters. These filters each covered a portion of the audio spectrum. The energy of each filter's output was then measured, or sampled, at regular time intervals and stored. This collection of filter energy samples then comprised the "coding" of the speech signal. This code could then be transmitted over a communication channel of lower bandwidth than would be neccessary for the raw speech signal. At the receiving end, the speech signal is reconstructed from this code by using the time sequence of filter energy samples to modulate the amplitude of a pulse signal being fed into a bank of filters similar to the ones used for the encoding. The result, while clearly not the original speech signal, is nonetheless intelligible, and one can, in most cases, understand what is being said.

The idea for the channel vocoder technique arises from the manner in which speech is generated in the human vocal tract. Simply put, the vocal chords produce a periodic, pulse-like, stream of air, which is then acoustically filtered by the elements of the vocal tract: the esophagus, tongue, lips, teeth, and the oral and nasal cavities. As one speaks different sounds, the shape and elastic properties of these elements are being constantly changed in response to neural signals arising from speech centres in the brain. This causes a time-varying filtering, or spectral variation, of the excitation arising from the vocal chords.

The channel vocoder, then, first analyzes the speech signal to estimate this time-varying spectral variation. To do this it uses the filters in the filter banks to determine how a particular frequency component of the speech signal is changing with time. On the output end, this analysis of the spectral variations are used to synthesize the speech signal by using another filter bank to apply these same spectral variations to an artificial periodic pulse like signal. The output filter bank acts as an artificial vocal track and the pulse signal acts as a set of artificial vocal chords.

Some sounds produced during speech do not arise from the vocal chords, but are produced by turbulent air flow near constrictions in the vocal tract such as may occur between the tongue and the teeth. For example, such sounds as "SSS", "K", "SSHH", "P", and so forth arise in this manner. These sounds would be poorly reconstructed using a pulse excitation source, and so most channel vocoders also have a noise signal that can be used as an excitation source as well. A "Voiced/Unvoiced" detector circuit is used to detect whether the speech signal is arising from vocal chord excitation (Voiced speech) or is arising from noise excitation (Unvoiced speech), and the appropriate excitation source is then selected at the output end.

Channel vocoders were originally developed for signal coding purposes, with an eye (ear?) towards reducing the amount of data that would be needed to be transmitted over communication channels. In fact, speech coding system development continues to this day to be a vigorous area of research and development. These systems have far outstripped the basic channel vocoder idea in complexity, coding efficiency, and intelligibilty, however. So why do we still care about channel vocoders? The reason is that channel vocoders (and the functionally equivalent, but computationally quite different, phase vocoder) have found application to music production. In the 1960's Siemens in Germany produced a vocoder which was used in some recordings. The BBC Radiophonic Workshop in England likewise pioneered the use of vocoders in recording and in radio and television. The vocoders used in these early musical efforts were very large and unsuited to general use. In the mid-70's a breakthrough of sorts came about when a number of companies, notably EMS (Electronic Music Studios) in England, produced relatively small and easy to use vocoders designed for use in musical applications. After that, the vocoder sound became a staple of the music and entertainment industry. Many extremely popular records (Kraftwerk!), TV shows (the Cylons of Battlestar-Galactica), and movies (Darth Vader in Star Wars) are identfied with vocoders. Although the introduction of these relatively small vocoding systems made it possible for the wide application of vocoders to music and film, they were still quite expensive for your average musician in the street. The EMS vocoder cost upwards of 6500 UK pounds! There are now, however, a number of inexpensive hardware vocoders available, as well as a few software emulators.

The Nord Modular is also capable of implementing very nice vocoders. The following patch, developed by Kees van de Maarel, shows how to make an 8-band channel vocoder from the Nord Modular static filter modules. The quality of this vocoder implementation is nowhere near that of systems like the EMS vocoder, due to the relatively small number of bands and the slow cutoff rate (12dB) of the filters used. But the sound is still quite nice, and the patch illustrates the idea. It has a number of features not found on many cheaper vocoders - sampling and holding of the spectral features, and control of the speed of the envelope detectors. By slowing down the envelope detectors one can obtain a "slewing" effect where phonemes appear to get blurred together. This can produce a smoother sound.

Vocoder made from bandpass filter modules. Includes features such as spectrum sample/hold and freeze (v.d.Maarel).

Of course, the Nord Modular also provides a self-contained vocoder module, so one does not have to go through all the trouble of constructing a vocoder patch from scratch, as in the patch shown earlier. The Vocoder module has a limited number of features, however, so there may still be some need to construct a custom vocoder patch. Some features can be added to the Vocoder module, however. For example, we show in the patch below how to add a Voiced/Unvoiced detection feature. In this patch the input signal is passed through two filters - a lowpass and a highpass, to estimate the frequency content of the input signal. If the lowpass filter output has an amplitude greater than the highpass filter output, then the input signal is mostly likely a pitched signal, such as a vowel sound. If, on the other hand, the highpass filter output has a higher amplitude then the input signal is most likely an unvoiced sound, such as a consonant. To do the decision as to the type of input signal we pass the filter outputs through envelope detectors (to estimate the amplitude of the filter outputs) and then compare the envelope detector output values. We use the output of the comparator module to switch, with a cross-fade module, between two signal sources - a mixed set of three square waves and a noise waveform.

Vocoder patch made from the Nord Modular vocoder module. Includes a voiced/unvoiced decision process to distinguish between vowels and plosives (Clark).

Note the use of a compressor in the above patch. This is a common technique used with vocoders, as it prevents the sound from becoming too "choppy" due to the relatively low amplitude of vocal sounds at the beginning and end of spoken words.

Patch that emulates female voices, with vibrato (Lindell).

Vocal Filter Techniques

One of the most loved of all of the Nord Modular's modules is the Vocal Filter. Its intended purpose is to synthesize vowel sounds, to aid in making the Nord Modular `talk'. Of course, it can be used for many other purposes as well, some of which we will describe in this chapter. Technical details of the Vocal Filter can be found in the Module Technology chapter.

Patch demonstrating the usefulness of the vocal filter in making voice-like sounds. This classic patch sounds likes the Ewoks from the Star Wars movie! (Crosley)

Singing patch (Vetter)

Speech Synthesis

One of the most popular uses of the vocal filter is to make the Nord Modular 'talk'. Speech for the most part consists of sequences of phonemes. Phonemes are the sonic building blocks of speech and are short segments of sound. There are two main forms of phonemes - vowel sounds and plosives. The vowel sounds are pitched and have a definite spectral structure. The vocal filter module is designed to produce the spectrum of vowel sounds. The plosives are untuned for the most part, and have a wide range of inharmonic frequencies. They are best synthesized with short bursts of filtered noise. The following patches give examples of implementing various utterances with the Nord Modular.

Talking patch which says 'Baby I Love You' (Lindell)

Another talking patch. This one says 'Clavia'. It demonstrates how to synthesise the rather difficult letter 'L' (Maarel)

Another talking patch. This one says 'West'. It demonstrates how to synthesise the rather difficult letter 'W' (Maarel).

Pitch Tracking

Musicians often want to control a synthesizer with another instrument - one that they have more skill in playing. For example, guitarists might want to control a synthesizer through playing their guitar, or a vocalist through means of their voice. The most important pieces of information that a performer would like to convey to the synthesizer are the pitch and amplitude of the sounds that are to be generated. The amplitude of the sound is usually measured through the use of an Envelope Generator. This is a circuit that estimates the RMS (root-mean-square) level of a signal over some time interval. The Nord Modular provides such a module. There is no module provided, however, for measuring the pitch of an input signal. Part of the reason for this is probably that it is very difficult to determine the pitch of a signal in an accurate and noise free manner. But there are some simple teachniques that can be applied with some success. We will look at a few of these techniques here.

The first approach we will look at involves synchonization of an oscillator to the input signal. This uses oscillators whose waveform cycle can be reset by an external signal. Review the section on oscillator sync for more details. The following patch demonstrates this approach. In the patch an oscillator is synced to the zero crossings of an input signal. If the input signal is a relatively pure tone, the rate of its zero crossings will be more or less the same as its fundamental frequency. Hence the synced oscillator will track the pitch of the input.

Another pitch follower patch. This uses an oscillator synced to the zero crossings of the input signal (Beogh)

Rather than directly synchronizing an oscillator to the zero crossings of an input signal, we can instead measure the zero crossing rate of the input signal and use this rate measure to adjust the pitch of a slave oscillator. The following patch demonstrates this approach. A lowpass filter removes most of the energy of the input signal above 262 Hz. This is done to emphasize the fundamental frequency component, so that the zero crossings occur at the same rate as the fundamental. For most male speakers (and many female), the pitch of the voiced speech signals is less than 250 Hz. At each zero crossing a pulse of fixed width (1.3msec) is generated. These pulses are then smoothed out, or averaged, with a smoothing filter. This produces a control value which is proportional to the zero crossing rate. The control value is then used to vary the pitch of an oscillator, through the grey MST oscillator inputs (which provide linear frequency control).

One of the most common uses for a pitch tracker is to tune the carrier signal for a vocoder. In this way the vocoder can follow the pitch of the incoming signal. This avoids a 'robotic' sound, and is especially useful when you want to shift the apparent pitch of a sound (e.g. to make a male speaker sound female). Of course, sometimes you WANT the robotic sound, so in that case you should leave out the pitch tracking! The following patch illustrates how to use a pitch tracker in conjunction with a vocoder.

A pitch follower feeding into a vocoder (v.d.Maarel)

The final pitch tracking patch that we will look at uses a phase-locked-loop to track the pitch of the input. The phase-locked-loop is a circuit that is used extensively in communication systems, and in high-speed computer chips. Phase-locked-loops work by measuring the difference in phase between two signals - an input signal, and the output of an oscillator. The phase difference signal is smoothed with a lowpass filter and then used to change the frequency of the oscillator. If the phase difference is negative (phase of the oscillator lower than that of the input) the oscillator frequency is raised slightly, increasing the phase of the oscillator. This will tend to make the phase difference less negative. Similarly if the phase difference is positive (phase of the oscillator waveform is higher than that of the input signal) then the frequency of the oscillator is lowered slightly. When the phase difference is exactly zero, the oscillator frequency is held constant. In the following patch the phase comparator is implemented with a multiplier. The input signal is heavily amplified and clipped with an overdrive module to make it binary-valued (like a square wave). This binarised input is multiplied by the output of the tracking oscillator and the product is smoothed by the loop filter. In this design the phase-locked-loop tries to make the phase difference between the binarised input and the oscillator waveform 90 degrees. When the two signals are 90 degrees out of phase their product waveform (the output of the multiplier) will be half the time positive and half the time negative. Therefore the output of the loop filter, which averages out the multiplier output, will be zero (save for some ripple). If the input frequency rises, the phase difference will rise, and the output of the multiplier will have a duty cycle of more than 1/2, and the loop filter output will therefore also rise, causing the tracking oscillator frequency to rise. Similarly, when the input frequency drops, the phase difference will drop slightly causing the loop filter output to drop, resulting in a decrease in the frequency of the tracking oscillator.

A phase-locked-loop pitch follower (Clark)

Pitch tracking can be used to control more than just the pitch of an oscillator. It can also be used to control filter cutoff, for example. Or even something non-frequency related, like the slope of an envelope generator attack (using a modulated enveloped generator module). The following two patches demonstrate such uses of a pitch track signal. The first uses the pitch track signal to control the cutoff frequency of a filter, the second to control the center frequency of a phaser module.

Using a pitch track signal to control filter cutoff frequency (Clark)

Using a pitch track signal to control the center frequency of a phaser (Clark)