integrated computational environment
currently in use:
`command line interface', such as
`sh' or
`bash'.
virtual Lisp machine, Emacs
graphic user interface,
such as Gnome, KDE, or Enlightenment
auditory interface, Emacspeak,
which is truly different from the previous three types of
interface even though it is built on Emacs.
The simplest way to use voice recognition is to enable a computer to respond to commands. This way, the computer program need only recognize a limited number of words, rather than the thousands that appear in regular speech.
Many new telephones already use voice command for calling, especially mobile or `cell' telephones that try to be, as best they can, `hands off' devices.
Telephones do not implement many different voice commands. (I do not count recognition of different people to call as `different commands' although it is important to recognize two dozen or so different names.)
There are two reasons that a telephone needs only a few commands. Firstly, a telephone is not a general purpose machine, but a specific task machine. Even when the notion of `calling' is extended to text messages which, tellingly, are not called `electronic mail' or `email' messages and to video communications, the telephone is perceived as a device for communications, not for anything else.
Although a modern telephone contains a full fledged computer with a fair amount of memory, its visual interface and keyboard are limited. A telephone can, of course, provide a full auditory interface, but few think of that. The direction is more towards larger video screens, and video communications, rather than more voice synthesis.
A counter argument is that such an interface will be advertised even if only a few customers want an auditory interface that can read their their email while driving a car. The reason is simple: the capability costs almost nothing to implement (its main cost is marketing) and it can generate revenue. The big question is whether a telephone company's marketing department thinks that perceived simplicity will sell better than perceived complexity; or whether the feature will be hidden in an `advanced' portion of the interface, so those users who seek simplicity can ignore it.
Secondly, a telephone cannot provide good feedback for more than a few commands. This, I think, is the argument against combining command recognition with a existing auditory interface, at least at the moment.
Command recognition software makes mistakes. A telephone requires that only a few spoken commands be recognized, such as `call'. The command `call' is followed by the different names of those who might be called.
On the other hand, even the simplest editor or typesetting program requires two dozen or more commands. (I just came up with a minimum list of 38 commands; in practice, I use many more.) Writing is difficult. Nine of those 38 commands are directly related to correcting mistakes.
I know that many people habitually write using fewer commands. Indeed, when I started, I used fewer. But the additional commands meant I could work more efficiently, more productively, and more easily. I did not notice learning them. If given the opportunity, and a need to write frequently, people will learn the two, three, or four dozen commands that help them most.
Think of yourself as driving to work. You can listen to your email with your telephone. That is easy to imagine. But suppose you want to respond to messages? How do you do it? You cannot without a keyboard, even with a `voice command recognition' system.
However, you can use a `voice command recognition' system with a wearable computer that includes a `chorded' keyboard. Many people already carry a wearable with such a keyboard.
Suppose you are walking to work, or taking a train. Then you can respond to messages. But can you comfortably use `voice command' only with auditory feedback? My sense is that it is harder. When you delete a previous word, you must either remember the words that go before it, or the computer must repeat them. It is, I think, easier simply to see them.
On the other hand, people who use existing auditory interfaces all the time tell me that they learn both to remember what they are writing and to listen to words spoken more quickly than humanly possible: they learn to listen at five or six hundred words per minute. So perhaps I am overly worried. Perhaps people will learn to use text to voice synthesis along with voice recognition that offers several dozen commands.
Command recognition is the simpler part of voice recognition. Existing voice recognition systems can `take dictation' rather well. I have seen very impressive demonstrations. But the amount of learning required for `dictating' is higher than many people are willing to tolerate.
The question is how soon we will see programs that comfortably and accurately recognize continuous speech, and translate it into text. The problem is accuracy. The program must correctly recognize more than 995 out of each 1000 words, even spoken by someone who has a strange accent or a cold. Otherwise, the speaker will spend too much time making corrections.
Personally, I expect to see good speech recognition programs soon. Indeed, they may already exist and I do not know of them. But then, my first experience with voice recognition took place a generation ago and I have been expecting good voice recognition ever since. As far as I know, the best current source for speech recognition programs and research is at CMU.
Let's presume that good voice recognition programs become available. Then what?
Is the voice recognition limited to one language or not? If the latter, as I expect, translation will be simplified. Tourists will carry translation computers.
There are two problems with this kind of automated translation: one is the voice recognition and the other is the translation. As a practical matter, it makes sense for the program to translate both from the first language to the second, from, for example, English to Chinese and then also to translate the translated statement back from Chinese to English, and speak it.
I do not know what specifically might happen in an English to Chinese translation, but I can imagine a problematical sequence from a first language to a second and then back. For example, suppose the first speaker is a tourist. He might ask a native to `Please tell me the way to the city museum. If he then hears it converted to `Please tell me the way to city hall', he learns enough to change his choice of words. This method still does not ensure that the native hears what the tourist intended, but it helps.
The talk can be written, too: a person wearing a computer can carry a flat display that shows both his words in the first language and the translated words in the second language. That way, neither person has to depend on memory to tell them what is said since sighted people can glance back at a previous sentence.
Accurate, comfortable, continuous voice recognition will, it goes without saying, quicken office work.
A side effect will be that more secrets will be spoken out loud. Companies will not like this. Highjackers want to learn which shipments are worth their stealing and which are not. Corporate thieves want to learn about takeover bids ahead of time. Both will learn to bounce laser beams off windows to pick up speech, as spies do now to discover foreign government secrets.
It will be straightforward to adapt the current graphic or virtual lisp machine interface to one that uses voice recognition both for commands and continuous speech. People want this. The only hindrance is the error rate for continuous voice recognition.
But with good continuous voice recognition, people will not have to use keyboards. They will not have to learn a second kind of keyboard, a chorded keyboard, to wear a computer. This means that office workers and most others need not be restricted to a display and keyboard on desk or table. An office worker will be able to do his work while walking or (more dangerously to the rest of us) while driving.
Hence, I expect wearable computers to become more commonplace. Since wearables will be able to act as telephones, I expect them to replace telephones. Moreover, because of what I perceive to be difficulties with a pure auditory interface, I expect that wearable displays will become available and inexpensive.
These wearable displays will provide a high resolution screen that takes up no more space than the lens of a pair of eye glasses. I doubt the display will be `fixed' on the device that the person wears. Instead I imagine that it will appear to be `fixed' on the external environment on the view behind it, like current CRTs and flat screens.
I have heard this feature called `tagging'. Clearly, the contents of the display will need to move gently when the wearer turns around. Otherwise, the user will not be able to see them. At the same time, a small motion should simply move the view to another part of the image, to the equivalent of the eight virtual `desktops' or `workspaces' that I am using now and that are commonplace on contemporary graphic user interfaces. It goes without saying that the human interface for this will be hard to design.
As yet, ordinary people do not use general purpose robots. They do not exist. By robot, I mean a device that can sense and react to a more complex environment than a thermostat, which detects and reacts only to temperature. Robots need not be mobile, but I suspect that household robots will be.
Contemporary computers are general purpose devices. However, they are mostly seen and used as information devices. They are useful for writing and calculating, for electronic mail, and for browsing the Web. Contemporary personal computers primarily detect and respond to keyboard input, Internet connections, and the like. The computers that run automobile engines respond to other inputs, but people mostly ignore that kind of use.
Computers that do one thing, such as vacuum the floor, are special purpose robots. These already exist, but are not yet in widespread use among regular consumers, and not very good at what they do. Vacuuming computers detect and respond to walls and to objects on the floor. They should be able to choose what to do: to vacuum a dust ball; to pick up, but not suck up a lost ring. And to stay away from the dog.
Special purpose robots are easier to design and build than general purpose robots. This is a important because special purpose robots are difficult to design and expensive to build. Currently, special purpose robots are fairly common in manufacturing. They are becoming common for tasks such as drug delivery within hospitals. I am told a robot exists that can make sushi in a restaurant. But hardly any special purpose robots exist in homes. I do not know of any that work to everyone's satisfaction.
But when they become available, general purpose home robots will be popular. Rather than acquire a special purpose robot that vacuums the floor, and another to make the bed, people will choose a general purpose device.
It will be hard to design such a general purpose device. In a home, a general purpose robot will do several complex tasks: not only will it be able to vacuum the floor, but it must be able to recognize what not to vacuum; not only should it be able to crack eggs for breakfast, but it must know how to recognize and remove the tiny bits of egg shell that sometimes fall onto the eggs.
You could operate a general purpose robot with typed commands. But my hunch is that such a robot will need voice recognition. People will not want to type commands. They will want to tell their `artificial stupid' what to do.