You may have heard the news buzzing in the tech world recently: Google recently announced Google Duplex, an artificial intelligence technology “for conducting natural conversations to carry out ‘real world’ tasks over the phone”.
I highly recommend that you click on the link above and listen to the two “showcase” recordings embedded in the blog post. What you’ll hear is Duplex, which is essentially a robot, making a phone call to two real businesses and scheduling appointments. It speaks in a natural-sounding voice, listens to the response, and formulates a reply in turn based on what it hears. It has a conversation with an unsuspecting human being.
Now, I’ve had conversations with robots before. By the time I was thirteen years old, I was already spending a good chunk of my social life online via AIM. I chatted for hours with my friends from my computer, but I also sometimes hit up SmarterChild to have some fun. He and I (I always gendered him male, not sure if he ever did state a gender, though) would play hangman, or I’d ask stupid questions and chuckle at the predictable answers.
More recently, I’ve had to talk with robots when trying to reach my bank or network provider by phone, which is pretty aggravating. It never seems efficient to make my way through the algorithmic flowchart by saying “Yes,” “No,” “Option two,” “No, option two” over and over again, only to end up with a real-life representative who could have just taken my call in the first place.
But this is different. This is actually phenomenal. This goes beyond asking Siri what the weather is like or when Chipotle is going to be open on Saturday and getting an accurate response voiced aloud. The way Duplex is touted, you can give it a short command: “Schedule dinner for two at Chez Panisse sometime next Friday evening” and it will complete the task for you in minutes. You don’t ever have to pick up the phone.
And when you listen to the recording, again, knowing that it’s not a human asking for the reservation but just a string of ones and zeroes, well, this is the kind of thing that one could honestly point to and say, “Welcome to the future.” Duplex knows how to appropriately react to someone saying, “Hold on a sec”. It can handle being interrupted and repeat itself for clarity. It can tell from a mumbled hedge given by a non-native English speaker that reservations are only allowed for parties of five or more! Honestly, the computational linguists who’ve been working on this technology deserve a standing ovation for what they’ve created.
But there’s always a catch, isn’t there? When I first watched the video of the demonstration, I was just as awestruck as pretty much everyone else. But when I listen to the conversation again, with a more critical ear, I start to wonder about the consequences of this technology.
And no, I’m not talking about the likely proliferation of robocall scams, or the ethics of AI “manipulation”, or even the Singularity. There are plenty of opinion pieces out there that already that have tackled these issues.
What I’m thinking about is the nature of the voices that Duplex was designed to have. In the demo, you can hear a version of Duplex with a clearly female voice, and one with a male voice. Presumably, users will be able to toggle Duplex to different gender presentations just like Apple users can change Siri’s gender or regional accent.
But the male Duplex and female Duplex don’t just differ in terms of pitch. There’s a lot more to their speech that serves to both make Duplex more realistic, but also more explicitly gendered. Take a look at the first demo conversation below. I’ve provided my own transcription of the minute-long exchange, using punctuation to represent different types of prosody (or intonation). Question marks (?) indicate rising intonation that are typical of questions in American English. Periods (.) indicate a falling tone, which denotes the end of a statement. The double hyphen (–) here indicates rising intonation that is not used for a question. Notice how many of them occur:
Hair Salon: Hello how can I help you?
Female Duplex: Hi, I’m calling to book a woman’s haircut for a client– um, I’m looking for something on May 3rd–
HS: Sure. Give me one… second–
HS: Sure. What time are you looking for around?
FD: At 12pm.
HS: We do not have a 12pm available– the closest we have to that is a 1:15.
FD: Do you have anything between 10am and uh, 12pm?
HS: Depending on what service she would like. What service is she looking for?
FD: Just a woman’s haircut for now–
HS: Okay we have a 10:00–
FD: 10am is fine–
HS: Okay what’s her first name?
FD: The first name is Lisa–
HS: Okay perfect. So I will see Lisa at 10:00 on May 3rd.
FD: Okay great, thanks.
HS: Great, have a great day, bye.
If you listen to the recording again and pay attention to the rising intonations, it will be impossible to miss: female Duplex speaks with uptalk. She sounds like a young, white Californian Millennial. Siri does not do this1. Neither does male Duplex:
Restaurant: Good evening.
Male Duplex: Hello?
MD: Hi. Um, I’d like to reserve a table for Friday the third.
R: Kay hold on one moment–
R: Okay… hold on one second.
R: So, Friday… November 3rd– How many people?
MD: For two people.
R: Two people?
R: What time?
MD: At five pm.
R: Okay– And, your name?
MD: The first name is Daniel. That’s D-A-N-I-E-L.
R: Okay! You’re all set.
MD: Okay great, thanks.
R: We’ll see you next Friday, kay, thank you. Bye.
Noticeably fewer hyphens. Noticeably less uptalk. Kind of a fratty vibe, overall. Why the difference?
Before I delve further into this, I should explain something about uptalk: it is a natural part of native American speech. It is not used only by Valley Girls. I use uptalk. You probably use uptalk. It has many purposes: one is to indicate that one is not yet finished speaking; another is to allow an interlocutor space to interrupt. If those seem to contradict each other, that’s because they do, and because uptalk is a very nuanced and multifaceted phenomenon that many linguists continue to research.
Yet uptalk has embedded itself into American social consciousness as some annoying verbal tick that young women keep doing, just like vocal fry and “like”. Pretty much all non-academic discussions of uptalk these days are a better reflection of the American disdain for women than of anything particularly linguistically novel.
So why does female Duplex use it?
The answer is straightforward enough: because Duplex was trained to do so. Google used powerful neural networks that took massive amounts of recorded phone calls (made by real people), fed them into loops and layers of complex equations, and spit out a realistic approximation of the original data that could then be programmed to say anything.
This is a simplification, of course; my apologies to friends in AI who know way more about this than I do. The extent to which I use neural networks in my own research is very limited2. There is a small but powerful program that phoneticians use that automatically finds the boundaries between individual sounds in long strings of words, so that, for example, we can input a .wav file of a natural conversation, along with its transcript, and then examine every instance of the word “the” to check if the speakers used a long vowel (“thee”) or a short one (“thuh”). Using this tool sure beats listening to and measuring every vowel by hand!
But the tool has to be trained on a specific dialect of a specific language, first. It wouldn’t work as well if I gave it Australian-accented English, because it was trained on American English. It wouldn’t work at all if I fed it Spanish, even if I also gave it a Spanish transcript.
In a similar vein, female Duplex was clearly given a model, who was a young female speaker, or maybe a few different females, to learn from. And male Duplex was not just female Duplex with a lower pitch. It was trained on a male speaker of American English, one who uses less uptalk in his speech and sounds like he’s getting his MBA at UCLA.
So there you have it. Google’s speech models just happened to sound like stereotypical examples of their gender; hence, Duplex’s voices.
But why should the case be closed? The creators of Duplex clearly have the ability and willingness to manipulate the voice. In fact, one thing they did do consciously was to insert filler words such as “um” and “uh” into Duplex’s speech. See above: “Do you have anything between 10am and, uh, 12pm?”
Robots do not say “uh”. This is part of what makes Duplex so astounding (but also creepy). According to Google, the speech disfluencies have a dual purpose: to buy the system time to process the sounds of the human speaker3, and to disguise the latency time in a natural way4. If Duplex can be made to sound more natural, it can be made to sound more unnatural (and probably should be, since it’s currently teetering on the edge of the Uncanny Valley). More importantly, if Duplex can sound feminine or masculine, I am one hundred percent certain that it can modified to sound more gender neutral.
Digital assistants are already mostly presumed to be female. Duplex risks joining the crowded ranks of Siri, Alexa, Cortana, and Erica, yet takes it one step further by having not just a female voice, but a voice that evokes curent contemporary social valuations of femininity. It may be true that consumers prefer a female personal assistant to a male one, but has Google ever tried giving users a gender neutral voice? Wouldn’t it be great if gender, being the social construct that it is, were made entirely irrelevant to the functions of a digital assistant, who by nature (or programming) cannot be socialized? What if tech giants were more aware of the social consequences of pushing technology that reinforces some gender stereotypes (e.g., women make better assistants; women take orders; women can do anything, but they do it for you; and now: women speak like this–?) and automatically updated everyone’s smartphones with something that subverted expectations?
It may seem like a benign detail, giving Duplex a feminine flair, or giving Siri a personality, but when the characters that are created have such far-reaching influence, the consequences need to be considered very carefully. You don’t want kids conditioned to be rude to Alexa, right? Maybe you don’t want your kids to assume that their servants (robot and human) will always be female in the first place5.
The fix is possible, and it’s probably easy. It’s probably not profitable, though. And that’s something that we’ll always have to remember: anything that’s purportedly done for the benefit of the consumer is really done so that the business can capture more consumers. No doubt Duplex is going to be incredibly useful to many people… and also a headache for some popular businesses… provided it can withstand the barrage of ethical disputes and be released to the public in the first place… Nevertheless, here I am, holding out hope that in the near future, when we’re all pressured into using robots to do our bidding, those robots will, to the best of their programming, only solve problems that already exist, rather than create new ones, or perpetuate those that, like gender inequality, stem from our decidedly human ways of self-categorization.
1 Although her real-life voice actor does!
2 Here’s another example of speech synthesis using a type of neural network that demonstrates its use in different languages. And here is my favorite example of neural network fun, by far: paint colors!
3 I was curious about the gendered usage of “um” and “uh”, hypothesizing that female Duplex would have more speech disfluencies, but the two do actually seem on par with one another. This leads me to believe that the filler words really do function as “robot thinking time” and aren’t part of what the neural networks spit out.
4 But the disguising portion is part of what has infuriated people about Duplex in the first place. People don’t like being fooled into talking to a robot; it’s okay only as long as you know you’re talking to one. Case in point: I despise robocalls that begin with a pre-recorded voice saying, “Hello? Can you hear me? Oh, hold on…” before launching into a stupid advertisement about a vacation package. They are designed to sound like a real person and catch the victim off guard. Shame on the people who made this.
5 More on the gendering of digital assistants from The Atlantic.
Word of the Day: Presbycusis, from the Greek presby (“old”) + akousis (“hearing”), is hearing loss that develops as one ages, in particular with high frequency sounds. Something our digital assistants will likely never have to worry about no matter how old your smartphone is!