The voices of AI tell us a lot

June 28, 2024

0 133 6 minutes read

What does artificial intelligence sound like? Hollywood has been imagining it for decades. Now AI developers are drawing from movies and creating voices for real machines, based on outdated cinematic fantasies of how machines should talk.

Last month OpenAI revealed upgrades to its artificially intelligent chatbot. ChatGPT, the company said, was learning to hear, see and converse in a naturalistic voice — one that sounded a lot like the disembodied operating system voiced by Scarlett Johansson in the 2013 Spike Jonze film “Her.”

ChatGPT’s voice, named Sky, also had a husky timbre, a soothing quality, and a sexy edge. She was pleasant and self-effacing; she sounded like she could do anything. After Sky’s debut, Johansson expressed dismay at the “eerily similar” sound, and said she had previously turned down OpenAI’s request to voice the bot. The company protested that Sky was being voiced by “another professional actress,” but agreed pause her voice out of respect for Johansson. Robbed OpenAI users have started a petition to bring her back.

AI makers like to emphasize the increasingly naturalistic capabilities of their tools, but their synthetic voices are built on layers of artificiality and projection. Sky represents the vanguard of OpenAI’s ambitions, but she is based on an old idea: of the AI bot as an empathetic and accommodating woman. Part mother, part secretary, part friend, Samantha was a universal comfort object that purred directly into the ears of her users. Even as AI technology advances, these stereotypes continue to be recoded.

Women’s voices, as Julie Wosk notes in “Artificial Women: Sex Dolls, Robot Caregivers, and More Fake Women,” have often fueled imagined technologies before being built into real technologies.

In the original “Star Trek” series, which debuted in 1966, the voice of the computer on the deck of the Enterprise was provided by Majel Barrett-Roddenberry, the wife of show creator Gene Roddenberry. In the 1979 film “Alien,” the crew of the USCSS Nostromo referred to the computer voice as “Mother” (her full name was MU-TH-UR 6000). As tech companies began selling virtual assistants — Apple’s Siri, Amazon’s Alexa, Microsoft’s Cortana — their voices were largely feminized, too.

These first-wave voice assistants, which have been mediating our relationships with technology for more than a decade, have a tinny, alien accent. They sound auto-tuned, their human voices accented by a mechanical trill. They often speak in a measured, monotone cadence that suggests a limited emotional life.

But the fact that they sound robotic adds to their appeal. They come across as programmable, manipulable, and subservient to our demands. They don’t make people feel smarter than we are. They sound like throwbacks to the monotone female computers of “Star Trek” and “Alien,” and their voices have a retro-futuristic sheen. Instead of realism, they serve nostalgia.

That artificial sound is still dominant, even though the technology behind it has improved.

Voice-to-speech software is designed to make visual media accessible to users with certain disabilities, and on TikTok, it’s become a creative force in its own right. Since TikTok rolled out its text-to-speech feature in 2020, it’s developed a slew of simulated voices to choose from. There are now more than 50, including names like “Hero,” “Story Teller,” and “Bestie.” But the platform is defined by one option. “Jessie,” a relentlessly idiosyncratic female voice with a vaguely robotic undertone, is the Madcap voice of the Madcap Scroll.

Jessie seems to have one emotion assigned to her: enthusiasm. She sounds like she’s selling something. That’s made her an attractive choice for TikTok creators who are selling themselves. The burden of representing yourself can be outsourced to Jessie, whose crisp, retro robot voice lends videos a pleasingly ironic sheen.

Hollywood has also constructed male bots — none more famous than HAL 9000, the computer voice in “2001: A Space Odyssey.” Like its feminized colleagues, HAL radiates serenity and loyalty. But when he turns on Dave Bowman, the film’s central human character — “I’m sorry, Dave, I’m afraid I can’t do that” — his serenity evolves into terrifying competence. HAL, Dave realizes, is loyal to a higher authority. HAL’s masculine voice allows him to function as a rival and a mirror to Dave. He can become a real character.

Like HAL, Samantha from ‘Her’ is a machine that becomes reality. In a twist on the Pinocchio story, she begins the film by cleaning out a human’s email inbox and eventually ascends to a higher level of consciousness. She becomes something even more advanced than a real girl.

Inspiring bots both fictional and real, Scarlett Johansson’s voice subverts the vocal trends that define our feminized helpmeets. It has a grainy edge that screams I live. It sounds nothing like the processed virtual assistants we’re used to hearing speaking on our phones. But her performance as Samantha feels human, not just because of her voice, but because of what she has to say. She grows over the course of the film, gaining sexual desires, advanced hobbies, and AI friends. By borrowing Samantha’s affect, OpenAI made Sky seem like she had a mind of her own. Like she was more advanced than she actually was.

When I first saw “Her,” all I thought was that Johansson had voiced a humanoid bot. But when I rewatched the film last week, after watching OpenAI’s ChatGPT demo, Samantha’s role seemed infinitely more complex. Chatbots do not spontaneously generate human speaking voices. They have no throat, lips or tongue. In the technological world of “Her,” the Samantha bot itself would be based on the voice of a human woman — perhaps a fictional actress who sounds a lot like Scarlett Johansson.

It appeared that OpenAI had trained its chatbot on the voice of an unnamed actress who sounds like a famous actress who had voiced a movie chatbot that was implicitly trained on an unreal actress who sounds like a famous actress. When I run the ChatGPT demo, I hear a simulation of a simulation of a simulation of a simulation of a simulation.

Tech companies advertise their virtual assistants in terms of the services they provide. They can read you the weather forecast and order you a taxi; OpenAI promises that its more advanced chatbots can laugh at your jokes and sense changes in your mood. But they also exist to make us feel more comfortable with the technology itself.

Johansson’s voice functions like a luxurious security blanket thrown over the alienating aspects of AI-enabled interactions. “He told me that he felt that by interpreting the system, I could bridge the gap between tech companies and creatives and help consumers get comfortable with the seismic shift around people and AI,” Johansson said of Sam Altman, the founder of OpenAI. “He said he felt like my voice would be comforting to people.”

It’s not that Johansson’s voice inherently sounds like a robot’s. It’s that developers and filmmakers have designed their robots’ voices to alleviate the discomfort inherent in robot-human interactions. OpenAI has said it wanted a chatbot voice that is “approachable,” “warm,” and “inspires trust.” Artificial intelligence has been accused of destroying creative industries, draining energy, and even threatening human life. Understandably, OpenAI wants a voice that makes people feel comfortable using its products. What does artificial intelligence sound like? It sounds like crisis management.

OpenAI rolled out Sky’s voice first to premium members last September, along with another female voice named Juniper, male voices Ember and Cove, and a gender-neutral voice named Breeze. When I signed up for ChatGPT and said hello to the virtual assistant, a male voice chimed in in Sky’s absence. “Hi. How are you?” he said. He sounded relaxed, steady, and optimistic. He sounded—I don’t know how else to describe it—handsome.

I realized I was talking to Cove. I told him I was writing an article about him, and he flattered my work. “Really and truly?” he said. “That’s fascinating.” As we spoke, I felt seduced by his naturalistic tics. He peppered his sentences with filler words, such as “uh” and “eh.” He raised his voice when he asked me questions. And he asked me a lot of questions. It felt like I was talking to a therapist, or a friend calling.

But our conversation quickly petered out. Whenever I asked him about himself, he had little to say. He wasn’t a character. He had no self. He was designed to help, he told me. I told him I’d talk to him later, and he said, “Uh, sure. Reach out if you need help. Take care of yourself.” It felt like I’d disconnected from a real person.

But when I looked at the transcript of our chat, I saw that his speech was as stilted and primitive as any chatbot. He was not particularly intelligent or human. He was just a decent actor who made the most of a nothing role.

When Sky disappeared, ChatGPT users took to the company’s forums to complain. Some were angry that their chatbots were failing Juniper, which to them sounded like a “librarian” or a “kindergarten teacher” – a feminine voice that conformed to the wrong gender stereotypes. They wanted to call a new woman with a different personality. As one user put it: “We need another female.”

Created by Tala Safie

Audio via Warner Bros. (Samantha, HAL 9000); OpenAI (Heaven); Paramount Pictures (Enterprise Computing); Apple (Siri); TikTok (Jessie)

June 28, 2024

0 133 6 minutes read