Speech or Voice Recognition: History
What is Speech or Voice Recognition?
Speech or voice recognition is a process allowing the elements of speech to be recognized and analyzed so that the message of that speech can be transposed into a meaningful form. More sophisticated applications of speech recognition also include the ability to respond to spoken words. The term “voice recognition” sometimes refers to speech recognition, systems that have to be trained to understand an individual human voice. The term is also used in a broader way to refer to technology that can more generally recognize speech without specifically being trained for a single speaker. This is important when voice recognition is used in a call center where many people are using the system. Speech recognition also includes technology such as voice calling or voice commands on a phone. Like many complex concepts, it is not always easy to point to one inventor. Depending on who referenced its history, speech recognition has different starting points and evolutions.
Who Invented Speech Recognition?
It is nearly universally agreed, though, that the technology began with Alexander Graham Bell’s inventions in the 1870s. By 1881, Bell and his cousin Chichester Bell and Charles Tainter invented a recording device based on their discovery of how to convert air pressure waves or sound waves into electrical impulses. They used a rotating cylinder with a wax coating. A stylus cut up and down grooves and this responded to sound pressure. This invention led to the formation of the Volta Graphophone Co. in 1888. This company later became the Columbia Graphophone Co., and it acquired the patent for a “Dictaphone” in 1907. Thomas Edison invented a similar device and his “Ediphone” competed directly with the Bell product. These machines recorded dictation of letters for a secretary to type. This was a step in the direction of a machine that could automatically transcribe the sound of a human voice. This was an important first step in unraveling the scientific mystery of speech.
Work by Fletcher and others at Bell Laboratories in the 1920s established the relationship between the distribution of the power of a speech sound across a frequency and its sound characteristics, as a person perceives them. Homer Dudley, also at Bell Laboratories, built on this idea in the 1930s to develop a speech synthesizer called the VODER or Voice Operating Demonstrator. Demonstrated at the New York World’s Fair in 1939, the VODER was a vital milestone in the development of voice recognition.
In the 1950s, Bell Laboratories, under the direction of Davis, Biddulph, and Balashek, developed a speech recognition device that recognized numbers. This system for isolated number recognition for a single speaker used the formant frequencies, a resonance of the human vocal tract, measured during vowel regions of each digit.
In the 1970s, this technology was developed further in the Hidden Markov Modeling (HMM) approach to speech recognition invented by Lenny Baum of Princeton University, along with members of the ARPA Speech Understanding Research project. This study added a significant piece to the speech recognition field by recognizing that the objective of electronic speech is the understanding of speech rather than just the recognition of words. Most of the speech recognition companies, including IBM, Philips, AT&T and Dragon Systems, eventually adopted this mathematical pattern-matching strategy.
DARPA was, like ARPA, a network that predated the Internet, and also worked on advancing speech recognition in its establishment of the Speech Understanding Research (SUR) program. The goal of this program was to discover a computer system that could understand speech. Lawrence Roberts, who initiated the program, developed the largest speech recognition project ever founded. SUR project groups were established at a number of locations, including MIT, CMU and SRI.
At the same time, two other initiatives were going on. IBM and AT&T Bell Laboratories were each taking their research in voice recognition in different directions. IBM’s Fred Jelinek concentrated his team’s efforts on creating a voice activated typewriter (VAT) that could translate a spoken sentence into typing on paper. This system was called Tangora, and the typewriter had to be trained for an individual person’s voice. Each speaker had to individually train the typewriter to recognize his or her voice. This kind of speech recognition was called transcription.
The early systems used the theory of acoustic-phonetics. This theory describes basic sounds of the language or phonetic elements and attempts to explain how they are formed acoustically in speech. Early speech recognition systems, such as one from RCA Laboratories, could recognize 10 syllables spoken by a single person. A number of laboratories in Japan built hardware that could recognize speech, including one system, developed in the sixties, that recognized vowels. At this time, these systems could recognize small vocabularies of isolated words (10-100 words). AT&T Bell Laboratories concentrated on research to provide automated telecommunication services, such as voice dialing of phones, to the public. Their challenge was to make these systems useful without the need for training each machine with an individual’s voice.
In the 1970s, the systems recognized larger vocabularies of about 100-1000 words. In 1978, Texas Instruments used a speech chip in a popular toy called the Speak and Spell. This was a big step in the development of sound. Shortly after this, Covox Co. introduced digital sound to the Commodore 64 and Atari 400/800 and by the mid-eighties to the IBM PC.
Throughout the following decades, various technologies emerged that moved speech recognition along. In the 1990s the Hidden Markov Model Tool Kit (HTK) from Cambridge University became the most widely used software for automatic speech recognition research. While speech recognition technology is currently widely used in all sectors, there is still vast research and development to be done in this field. The ultimate goal of many researchers is to develop a machine that performs and responds like a human being. For now, the technology is used in providing telephone support, writing medical reports, answering financial questions and in a variety of tasks requiring a human to machine interface, such as providing updated travel information, stock quotes and weather reports. The advances from the initial systems have been phenomenal; researchers say that most speech recognition software can perform with 90 to 95 percent accuracy.