
INTRODUCTION
A long-standing goal of human-computer interaction has been to enable people to have a natural conversation with computers, as they would with each other. In recent years, we have witnessed a revolution in the ability of computers to understand and to generate natural speech, especially with the application of deep neural networks. Still, even with today’s state of the art systems, it is often frustrating having to talk to stilted computerized voices that don't understand natural language. In particular, automated phone systems are still struggling to recognize simple words and commands. How many times have you broken down and started screaming “OPERATOR” to an automated phone system?
According to Scott Huffman, the VP of Engineering for Google Assistant, people are 200 times more likely to chain together multiple commands or questions that are contextually related to each other with Google Assistant compared to Google search because expectations about what your assistant can do are growing: “People are expecting real conversations, so with any technology when you start out, maybe their expectations are low, but as it starts to work, people’s expectations go up quickly. So what we’re seeing today is [that] people [are having] more and more complex conversations with voice technology.”
Welcome to Google Duplex
This new technology is capable of conducting natural conversations to carry out real-world tasks over the phone. For such tasks, the system makes the conversational experience as natural as possible, allowing people to speak normally, like they would to another person, without having to adapt to a machine.
According to Google, “Duplex uses a combination of a concatenative text to speech (TTS) engine and a synthesis TTS engine (using Tacotron and WaveNet) to control intonation depending on the circumstance.”

During training, the input sequences are real waveforms recorded from human speakers. After training, Google can sample the network to generate synthetic utterances. At each step during sampling, a value is drawn from the probability distribution computed by the network. This value is then fed back into the input and a new prediction for the next step is made. Building up samples one step at a time like this is computationally expensive, but Google has found it essential for generating complex, realistic-sounding audio.
The system sounds more natural thanks to the incorporation of speech disfluencies (e.g. “hmm” and “uh”). These are added when combining widely differing sounds in the concatenative TTS or adding synthetic pauses, which allow the system to signal in a natural way that it is still processing (this is what people do when they are gathering their thoughts.) In user studies, Google found that conversations using these speech disfluencies sound more familiar and natural. It says “mm-hmm” as if it's nodding in agreement. It elongates certain words as though it's buying time to think of an answer, even though its responses are instantaneously programmed by algorithms.
Duplex is fine giving out information, but it’s designed to only to give out information the bot is authorized to share. In a demo, Duplex would clearly, slowly spell out the demo caller’s phone number or name when asked. It even had good phone etiquette, saying things like, “The name is Ron, that’s R, O, N.” At one point, the caller’s email was asked for, and Duplex responded with “I'm afraid I don’t have permission to share my client’s email.”
For those of you worried about receiving phone calls from an AI without knowing it, you can rest easy, Google said exactly how it will let people know they're talking to an AI. After the software says “hello” to the person on the other end of the line, it will immediately identify itself: “I'm the Google Assistant, calling to make a reservation for a client. This automated call will be recorded.”
If you’re wondering why there aren’t any examples beyond making a reservation, that's because that's all Duplex can do right now. This is really the key to the whole system. Google did not build a general-purpose speech AI; it built something that is focused solely on making a reservation and nothing else. Duplex can't even do reservations at any business—it only supports making reservations at restaurants and hair salons or checking holiday hours.
Scott Huffman described Duplex like this, "One thing that makes it work is, in fact, that it is trained on these very narrow tasks. On one hand, a lot can happen in a restaurant reservation conversation, but on the other hand not that many things. So once you've done some calls, you get the heart of it, which is they're going to ask you about the time, the number of people, and all that. Then you can build out from there. With not very much data, you end up having the heart of it, and that's what allows us to build these kinds of systems."
Looking to the Future
How Google handles the release of Duplex is important because that will set the tone for how the rest of the industry treats commercial AI technology at a mass scale. Alphabet, Google’s parent, is one of the most influential companies in the world, and the policies it carves out now will not only set a precedent for other developers, but also set expectations for users.
Duplex is the stuff of science fiction, and now Google wants to make it part of our everyday life. Looking years down the line, if the tech is a hit, it could be the beginning of an era in which humans conversing with natural-language robots is normal. The Duplex demos that I have seen have been nothing short of amazing, and I think that the technology holds enormous potential. For example, Duplex might be able to call an ambulance and relay vital information such as your location and medical history if sensors detect that you are having a heart attack. This technology could literally save your life one day. Google is set to push Duplex out to Pixel devices sometime next month, so keep an eye out for how it evolves.