In Star Trek IV, The Voyage Home, Scotty, back on present-day earth, walks up to a Mac and starts talking. But the computer, to his frustration, doesn’t respond. Someone points out the keyboard and Scotty looks at it curiously.
“A keyboard – how quaint,” he says, in much the same tone we would use when confronted by an old Smith Corona.
Some believe one day soon we will all look at keyboards the same way Scotty did. Others, however, are sceptical. Most now agree that voice recognition technology is pretty good when it comes to speaker-dependent applications – those in which the same user employs the same device – but speaker-independent applications still have a long way to go.
It all depends
One problem is that voice recognition technology requires a lot of storage and computing power. This means that while those working on personal computers have the capacity to create large voice profiles, those using cell phones or personal digital assistants (PDAs) can only use a limited vocabulary. But paradoxically it is in mobile arena that people most want voice recognition.
“We’re moving towards a world where we want access to information any where, any time, any place – and any kind of information. You’re seeing non-desktop devices starting to proliferate. We will soon have more handheld devices than PCs, and there, voice is a more natural interface,” said IBM Corp.’s director of consumer voice systems Krishna Nathan in West Palm Beach, Fla.
And because customers will find using their voice easier than their phone’s number pad, they will be willing to put up with some inaccuracy.
“The performance of speaker-independent speech recognition will be less than human for a long time,” said Alexander Linden, a senior analyst at the Gartner Group in Frankfurt.
But he doesn’t think that this will stop people from using the technology.
“It’s a question of cost-benefits. And we think the benefits outweigh the costs,” he said.
Those using PDAs or calling into call centres may be willing to put up with some inaccuracy for the sake of convenience, for example. Also, in these environments in which the topic is narrow and requires a limited vocabulary, accuracy improves and companies can provide content over the phone. That means people will be able to buy airplane tickets over the phone using speech recognition, but they will not be able to randomly surf the Web with a handheld device.
But some wonder if the latter functionality is something we’ll ever really need. Graeme Hirst, a computer science professor at the University of Toronto doubts the need for domain-free speaker-independent technology. “In most of life we simply don’t want things that are completely topic free,” Hirst said.
When topics are domain specific, the computer can distinguish between words like rain and reign, which sound the same but have different meanings, said Fred Popowich, an associate professor of computer science at Simon Fraser University.
“The way you recognize speech really has a lot more to do with what you’re saying – I mean the context of what we’re saying – as opposed to the actual words,” Nathan explained.
When people speak to a voice recognition system, their words are stored in a digital format as a series of ones and zeros, which correspond to the waveform. Because one letter can have many different pronunciations, scientists code for sounds rather than letters, Nathan said. And because we don’t live in a heterogeneous society, speech recognition companies use thousands of voices to train their databases.
“When you collect more and more data, you fill the entire spectrum of possibilities,” Nathan said. But before speech recognition technology can be perfected, computers need to gain natural language understanding. This means that if there is any ambiguity on the acoustics side, the computer can still figure out what was said. Currently, computer scientists use trigrams to help the computer make educated guesses. The computer looks at the probability of three words occurring together. If it knows the word “the” isn’t likely to occur twice in a row, this will help it decide whether the speaker said “the” or “a.”
testing, testing, 1,2,3…
While those using handheld devices may be willing to put up with some inaccuracy for ease of use, at Saint Mary’s University in Halifax the need for accuracy is critical. The university is conducting a speech-to-text pilot study.
As professors give lectures, their words are transcribed in real-time using IBM’s ViaVoice and displayed on a screen for the students to see.
During the pre-testing phase, the university ran into a lot of equipment problems. For instance, the equipment was optimized for a certain decibel range, and when faculty members wanted to emphasize a point, they would raise their voices. If someone exceeded the decibel range the computer would shut down and have to be restarted. This was difficult for factory members and unsettling for the class.
Inaccuracy was another problem. If the computer gets an important word wrong, it can change the meaning of what is said. This means that faculty members have to go back and edit their lectures before they can hand them out to students. The lecture can be made available on a disk, in hard copy, in an audio version or in Braille.
The study is designed primarily to increase access for students with disabilities. But Dr. David Leitch, the director of the Atlantic Centre of Research, Access and Support for Persons with Disabilities at Saint Mary’s, has found that most students are benefiting from the program, even though the application has not yet been perfected.
Students with disabilities no longer had to rely on sometimes-unreliable note takers. Others also benefited because they now had more complete notes and another means of taking in information – visually.
Did you say Jane Smith?
While companies can also use Saint Mary’s methods to transcribe presentations, for now they’re more focused on creating automated speech-enabled telephone directories and call centres.
At Arial Systems Corp., when customers call in they’re almost always talk to an actual person, thanks to the company’s automated telephone directory system.
This system integrates voice technology from Lernout & Hauspie (L&H) with Arial’s total access network. Employees wear ID badges with transmitters in an office equipped with receivers so that the telephone directory system can always find employees. When callers phone in, an obviously computer-generated voice asks them who they are calling for and then uses accuracy parameters to judge where to route the call. If the computer is 95 per cent sure, it puts the call through, otherwise it asks users for confirmation.
Although some employees are concerned about the Big Brother issue, the more pressing concern for most is the inability to hide behind voice mail, said Mike Wagener, vice-president of product development in Vernon Hills, Ill.
“The system is really geared to people who want to be hyper responsive to their customers,” Wagener said.
If someone is not available, the system checks a back-up tree to figure out how to route the call. The text-to-speech engine explains what’s happening to the caller.
Montreal-based Breton Banville also has a speech recognition-enabled telephone directory. The company has 160 employees and gets about 100 calls an hour. At one point they needed two receptionists to handle the load. Both spent about 70 per cent of their time just answering the phone. Thanks to Locus Dialogue’s multilingual speech recognition system, Breton Banville now only has one receptionist who spends only 30 per cent of her time on the phone.
“Within three years, the system will have paid for itself. Probably within the first year,” said Michael Tomlinson, the MIS manager.
But unlike Arial, Breton Banville decided not to use text-to-speech technology to respond to callers but, instead, has someone come in and record responses.
John Dalton, an associate analyst of site design and development at Forrester Research Inc. in Cambridge, Mass., understands why this choice was made.
“There are two halves to the recognition game. The industry was so keen over the past 15 years to get speech recognition down that the text-to-speech part of the puzzle has been neglected. Text-to-speech is awful. It sounds like a robot,” Dalton said.
As with speech recognition, text-to-speech works best when the domain is specific. This helps the computer figure out how to pronounce words like a “record,” so a word is pronounced differently depending on whether it’s a noun or a verb. Although computers have become pretty good at pronouncing individual words, they still have difficulty with the melody of a sentence.
“We always assign a default melody to a sentence based on the content of that sentence, but not necessarily based on the meaning of that sentence,” said Tom Morse, the senior director of engineering at L&H in Burlington, Mass.
One of the most important applications for speech recognition technology, according to Linden, will probably be in call centres. Businesses will be able to cut down on the number of calls they have to handle in person and significantly reduce costs. More importantly, companies will be able to use products like Newton, Mass.-based Dragon System Inc.’s Audiomining to record calls and index them for keywords so that they can data mine the voice data, Linden said.
But the keyboard lives on
Scotty may have looked at the keyboard in surprise, but whenever characters on newer generations of Star Trek use a handheld device, they always type into it. The keyboard may not become extinct, but it will morph into new shapes and forms.
Even if domain-free speaker-independent speech recognition is achieved one day, IBM’s Nathan believes the keyboard will still stick around. People will need to protect their privacy, he said.
“The big challenge is going to be understanding the limitations of speech and understanding that it doesn’t replace everything. Keyboards will be around for a long time. I will not be sitting in my airplane seat talking to my computer, I don’t care how good the recognition is, I don’t care how good the usability is,” he said.
“I think people often lose sight of that and think we’re going to go to a completely keyboardless world and everything is going to be speech-enabled. There is a reason for the keyboard to exist. It may change dramatically, but because of privacy issues, because of noise issues, because of recognition issues, you probably don’t want air traffic control running on speech recognition.”