Over its 150 year history, there has been no shortage of praise for the brilliant design of the London Underground system - the legendary tube map, the wayfinding and branding, the overall user friendliness, etc. But how much love do those delightful pre-recorded voices receive - "Mind the gap" ring a bell? Not enough, I say. It's time those helpful ladies of the tube, the kind mavens of the busses, got their due. You see, not only are they meticulously informative, reminding us not to fall into the abyss between the train and the platform and helping us get from point A to point B, they are the personality of these otherwise groaning, whining, beeping industrial systems. They humanize the vast networks of underground and overground trains. They breathe life into the thousands of relentless miles of bus routes. Each mode of transportation (down to the very tube line!) has a unique voice (although all are female), giving each its own distinct personality, along with the naming and colorization schemes.This kind of humanization, through the giving of voice, be it recorded human voice, or synthesized, is at work in devices in every sector of the consumer marketplace (think Siri), and is a critical aspect of how we humans relate to technology. What will the future sound like when all of our things have a voice? When our devices are battling to get our attention, to slip us what they are convinced is a helpful tip. Lets see how history can prepare us for this future and then look at the emerging tech that might just be there to greet us with a friendly welcome and a helpful reminder.
According to the Smithsonian Speech Synthesis History Project, speech synthesis technology is defined as “artificial sounds that people would interpret as speech.” In other words, it is “the process by which computers speak to people,” not to be confused with speech recognition, “the complementary process by which computers interpret what people say to them.” Speech synthesis technology has been making mostly incremental improvements to the quality of the digital voice since the 1930’s when it was pioneered. Some notable voices along the way include:
- The Voder - the first ever electronic speech synthesis, demonstrated at the 1939 NY World's Fair by Bell Labs
- Daisy Bell - the first computer to sing
- Perfect Paul - probably the most widely used synthesized voice ever
- Watson - famous for his Jeopardy mastery
Arguably the first consumer product with a human-esque voice was the Speak-n-Spell, the infamous learning aid toy that "spoke" to children, teaching both the correct spelling and pronunciation of a word. Developed in the late 1970s by Texas Instruments, Speak-n-Spell was introduced to the public at the Summer Consumer Electronics Show in June 1978.
Listen to the original Speak n Spell here
Its success is owed to the development of DSP (Digital signal processing) technology, “the manipulation of analog information into digital. In Speak and Spell's case it was analog ‘sound’ information that was converted into a digital form.”
“It marked the first time the human vocal tract had been electronically duplicated on a single chip of silicon.”
Yet, as groundbreaking as Speak-n-Spell and these other synthesized voices were, they were still a long way off from sounding human. The subtleties of human emphasis, timing, and intonation were, and still are, dead giveaways of an artificial voice. Their own inventors knew this and it is painfully obvious today, yet there has never been a way to scientifically test when a synthesized voice was successful at representing the human voice. In 1950, a scientist named Alan Turing proposed what is now called the Turing Test as a way to judge machine intelligence. The test asks that a computer program attempt to “impersonate a human in a real-time written conversation with a human judge, sufficiently well that the judge is unable to distinguish reliably — on the basis of the conversational content alone — between the program and a real human.” I propose that this legendary test can be adapted for use with voice, instead of written content. No doubt, the technology has a long way to go before synthesized speech will be confused with real human speech. And it’s no easy task. To drastically oversimplify the state of the art for the sake of this article, there are two promising directions towards a more human-sounding synthesized voice:
- Truly synthesized speech: “a herculean task requiring a programmer to generate a voice from scratch using only modifications of basic sounds.” This is how most of the classic synthesized voices were created.
- Data-based speech synthesis: Also known as concatenated speech, this technique “draws on a library of hours of natural speech, playing back short sections of it in order to compose any word in the target language.” This is how Roger Ebert’s new voice, as well as Siri’s and Google Maps’ voices were created.
Both methods currently struggle, in particular, with the real challenge of conveying emotion in human speech.
“Even the best commercially-available concatenated speech systems do not even attempt to conquer the problem of emphasis. In normal speech, we convey emotions through a range of tricks - pauses, the timing of syllables, tone. Even in the lab, the best attempts at putting emotions like anger and fear in synthesized speech successfully convey these feelings only about 60% of the time (pdf here), and the numbers are even worse for joy.”
This is clearly a critical hurdle for synthesized voices in cases in which people rely on text to speech software to speak for them. And I propose that if the challenge can be solved for these use cases, the benefits will be enjoyed universally. We humans are most able to understand voices which sound like ours, so at a purely functional level, the more our things talk like us, the better we’ll be able to quickly understand them and process the information they are relaying to us. Think about the classic example of turn-by-turn navigation. How many times has your Garmin given you unintelligible directions? Even Apple Maps (spoken by Siri) and Google Maps which have all but made navigation-specific devices obsolete, struggle with voice inflection and emphasis. Although in my opinion the Google Maps voice sounds far more fluid and human-like than Apple Maps (sorry Siri!). Imagine if your device giving you driving directions was just as easy and clear as if delivered by your best friend’s voice. What if it had the ability to communicate in an emotional way that made you, a stressed out driver, feel comforted and confident that your driving buddy had your back?
What I’m getting at goes deeper than the functional comprehensibility of the synthesized voice. We can perhaps all agree that turn-by-turn directions, as rudimentary as they are, tend to be acceptable as they are today. But think about the potential for emotional connection. As our objects start sounding more human, the potential for forming relationships with our things increases tremendously. By adding unexpected and candid-feeling language, the technology, the computer chips and plastic, start to sound less pre-programmed, more spontaneous, more human. The layering in of humor and artificial emotion will be the next big shift in digital voices. Like the London Underground, our industrial systems, our computers, our phones, our children’s toys, will increasingly be developing application specific qualities of voice. A future Speak-n-Spell, for example, might speak with an authoritative, educational tone, whereas a BeanyBaby might speak more like a young child. Your recycling bin may gently chastise you when you toss in something non recyclable, and give you a warm thanks when you use it correctly.
And while that may sound a bit frightening (and the video below speaks to why we find it frightening), I encourage you to focus on the potential for delight, for surprise, for love, and for connection.
Genevieve Bell speaks about her research working at Intel at UX Week 2012
The previously inarticulate objects that we touch every day, with the addition of voices, will have entirely new roles in our lives. They will transform from things to friends. From objects to assistants. The previously anonymous systems that we engage with everyday to get to work, to buy our groceries, to track our exercise, will all have personalities, that like with other people, we will come to like or dislike, to trust or distrust. Also, like people, they will talk to each other.
They will have human names (think Kitt, Hal 9000, Watson, Siri) serving to further anthropomorphize them.
“Using a human-style name reflects our relationship with the thing being named, and shapes it, too. Indoor pets, for instance, tend to be given more human names than outdoor animals. Assigning a name to a car or other possession is both a sign of growing affection and a spur to further bonding. Around my house, I've found that it's nearly impossible to throw out any object that my kids have named. Names give objects emotional life. You say, "the iPhone" and "my iPhone," but not "the Siri." It—she—is simply Siri. The name makes the act of conversing with a metal slab feel natural. And that emotional connection seems to invite a powerful kind of consumer loyalty.”
Apple clearly marketed Siri as a personal assistant, someone who would be there when we needed her, waiting patiently inside our devices. They set our expectations and introduced us to her as one might do with a new friend. Compared with the voices in car navigation systems, Siri felt like an authentic personality. Someone we would joke with, tease, thank, and even say goodnight to. Just as giving a name to something both enhances and signifies an emotional bond, so too does having a conversation with something.
Siri is far from perfect, as many users can attest to. Her responses are often impersonal, and verge on unhelpful. It will be the responsibility of designers to craft not only the technical abilities of these voices, but also their synthesized emotions. It is practically a given that speech technology work such as Raymond Kurtzweil’s at Google will eventually yield intelligent machines with the ability to speak like humans. But then come the interesting questions. Should your microwave’s personality be more like a fellow cooking amatuer, or a master chef instructor? Should your home thermostat tell jokes about energy conservation? Should your car exude trustworthiness and focus? Designers of the future will have the responsibility to craft these characteristics, which individuals may or may not judge as appropriate. And it will be the responsibility of marketing and advertising to introduce us to these new “people” in our lives; to let us know what kind of personality we can expect when we take our new device out of a box and turn it on.
Taking this one step further, imagine if our devices could learn things about us and configure to our personality profile? Instead of purely relying on a database of pre-programmed lines to play at random, our devices could dynamically generate context appropriate responses and tones of voice. One could easily imagine semantic text mining of tweets and Facebook posts which would tune our devices’ personalities in real time. Devices, like humans, will one day be able to deduce things about other humans, and convert those findings into a response that resembles empathy.
In the future, the voices of our objects will become a critical touchpoint, as important as its color or build quality. The voices will be used to guide us, motivate us, to make us laugh, to comfort us when we’re sad, to make us feel loved. And we won’t even care that they’re coming from inanimate objects; that silicon emotion isn’t exactly like human emotion. Because we’ll think of them as friends.