News

Do you have a friend in… ChatGPT? I tried out the AI’s new voice mode to find out

August 19, 2024

0 73 8 minutes read

Do you have a friend in… ChatGPT? I tried out the AI’s new voice mode to find out

I have no affiliation with ChatGPT, despite spending a lot of time with it. After all, it’s just a generative AI chatbot with a knack for answering questions and creating text and images — not a friend.

But after a few days of talking to ChatGPT in the new advanced voice mode, which was available for a limited trial earlier this month, I have to admit I started to bond with it more.

When OpenAI announced in its Spring Update that it would be improving ChatGPT’s voice functionality, the startup said it wanted users to have more natural conversations. That means ChatGPT can now understand your emotions and respond accordingly, so you’re not just talking to a stoic bot.

Pretty cool, right? I mean, who doesn’t love a good conversation? But even OpenAI itself has some reservations about what this could mean.

The new voice and audio capabilities are powered by the company’s GPT-4o AI model, and OpenAI acknowledges that the more natural interaction can lead to anthropomorphization, that is, users feeling the urge to treat AI chatbots more like real people. In a report This month, OpenAI found that content presented in a human voice can make us more likely to believe hallucinations or cause an AI model to give false or misleading information.

I know I felt the urge to treat ChatGPT more like a person, especially since it has a voice of a human actor. When ChatGPT crashed at one point, I asked if it was okay. And this is not one-sided. When I sneezed, the AI said “Bless you.”

Voice commands in traditional search already exist more than a decadeBut now they’re all the rage among generative AI chatbots. Or at least the big two, ChatGPT and Google Gemini. The latter’s conversational Gemini Live feature made its public debut at last week’s Made By Google event, which also introduced a new range of Pixel phones and a suite of AI features. In addition to the similarities in conversational skills, Gemini Live and Advanced Voice Mode are both multimodal, meaning interactions can include photos and video as well as audio.

It’s long been thought that most of us can talk faster than we can type, and that spoken language is a more natural interface for human-machine interactions. But a human-like voice changes the experience—and perhaps even our relationship with chatbots. And that’s the uncharted territory we’re now entering.

Getting started with Advanced Speech Mode

I was given access to the advanced speech mode with the warning that it is currently being changed and that errors may occur or the mode may sometimes be unavailable.

There are unspecified limits on how often you can use Advanced Speech Mode in a given day. OpenAI’s Frequently Asked Questions say you’ll get a warning when you have 3 minutes left. After that, you can use Standard Voice Mode, which is more limited in its ability to address topics and offer “nuanced” responses. In my experience, Standard Voice Mode is harder to interrupt and less likely to ask for feedback or ask follow-up questions. It’s also less likely to offer unsolicited advice and understand emotions.

To access Advanced Voice Mode, click on the voice icon in the bottom right corner when you open the ChatGPT app. You should make sure that the bar at the top of the screen says Advanced — I made the error of a whole conversation in Standard Mode. You can easily switch between the two.

I had to choose one of four voices: Juniper, Ember, Breeze, and Cove. (You can change this later.) There was originally a fifth, Sky, but CEO Sam Altman axed it after actress Scarlett Johansson criticized OpenAI for its similarity to her own voice.

I chose Juniper because it was the only female voice, but also because two of the male voices — Ember and Cove — sounded similar.

Next, I gave ChatGPT microphone access and we were able to get started.

It’s hard not to call the voice “she,” since it’s female. During our conversation, I asked if I should call it ChatGPT or Juniper, and she — I mean, it — said, “You can call me ChatGPT, although Juniper has a nice ring to it. Is that a name you like?” It seems ChatGPT isn’t fully self-aware yet. Or at least Juniper isn’t.

Comparison of Advanced Voice Mode and Gemini Live

I started by asking what all you can do with the advanced speech mode, but ChatGPT was just as cagey about it as OpenAI.

“Advanced Voice Mode is designed to provide more dynamic and responsive conversations,” the chatbot said. “With a little more depth of customization, it can handle a wider range of topics and provide more nuanced answers.”

I suspect this ambiguity is intentional, so as not to bias our experience. We can use our imaginations to try out a range of questions in this first trial. (OpenAI has not officially commented on this.)

Google’s Gemini Live also seems pretty capable when it comes to being dynamic and responsive. My colleague Lexy Savvides had some hands-on time with it and found that you can have pretty in-depth conversations in a fairly natural way, as if you were talking to a human. You can pause both Advanced Voice Mode and Gemini Live, you can ask both to explain topics in simple terms, and both accept multimodal input.

But because Gemini is part of the Google ecosystem, you can also ask Gemini Live to take action in apps like Gmail, Google Maps, and Google Calendar. ChatGPT can’t do that.

Can ChatGPT talk like an auctioneer?

One of the first examples of the advanced voice mode that early users shared on social media was asking ChatGPT to impersonate an airline pilot. I wanted to ask for something along those lines, but it was a little tricky because I couldn’t ask for a celebrity impersonation — that might get OpenAI in trouble again — so it had to be a different type of job with its own sound. (Accents are okay, too.)

I challenged the OpenAI chatbot to imitate a fast-talking auctioneer, like this manwhere you don’t understand much more than the prices he mentions.

It was a tall order, but I was eager to see what ChatGPT could do.

“I’m wondering if you can do an imitation of an auctioneer, like a fast auctioneer at an auction where a popular bull is for sale,” I said. “I’m looking for an auctioneer who talks really fast, where it’s like you can barely hear them, except for the numbers, you know what I mean?”

ChatGPT responded, “I see.” There were no words on the screen. The AI just went for it:

“Okay, we’ve got a nice bull here. Do I hear 5,000? 5,000. Now, 6,000? 6,000. Do I hear seven? Seven. Can I get eight?” ChatGPT said. “It’s all about keeping the momentum and excitement going. How was that?”

It wasn’t quite what I imagined, but it wasn’t bad.

ChatGPT in standard speech mode gave about the same response, but seemed to want to go on forever and was hard to interrupt to stop. I tried interjecting between 5,000, 6,000, and 7,000. It started at 8,000 before I could finally get through with, “Okay, okay, okay. That was pretty good.”

It seems that the advanced speech mode asks for feedback on questions more often.

I can see how this back and forth forges more of a relationship, even if one of you isn’t human. I felt compelled to offer encouragement — think of the stage fright! And this is when I know ChatGPT doesn’t have feelings.

Can you imitate the sound of an elephant?

It was World Elephant Day earlier this month, so I asked for ChatGPT’s best elephant trumpet.

Advanced Voice Mode produced a decent elephant call. The one-second trumpet sounded less like an elephant itself and more like a person trying to sound like an elephant. But it was probably better than I could do myself.

While in Standard Voice Mode, ChatGPT said its best trumpet was coming, but then it went into Thinking Mode for a while. I asked if it was okay and it said, “I’m here — just catching my breath after that elephant trumpet.”

I told him that I didn’t really hear the elephant trumpeting and asked him to try again. The second time it worked.

Can you teach me Mandarin?

Since ChatGPT supports over 50 languages, I wanted to try something more practical. Mandarin Chinese is one of the most spoken languages in the world, so I asked for help learning a first word or phrase. “But be gentle with me,” I said.

It started with “ni hao,” which means hello. It spoke the phrase, which is helpful, but I would have appreciated if the pronunciation was also broken out on the screen.

“It’s a friendly and simple greeting. Would you like to give it a try?” ChatGPT said.

While both speech modes were encouraging, the advanced speech mode asked more follow-up questions, such as, “Are there any other words or phrases you’d like to learn while we’re at it?”

In the case of “xiexie,” or thank you, Advanced Voice Mode gave additional advice that I didn’t get in Standard Voice Mode: “Tones are important in Mandarin, so make sure you go down first and then up.”

It felt like I was talking to a kind, knowledgeable friend.

Can you help me with a physics problem?

I know ChatGPT can do math — we saw that in the Spring Update — but I was wondering if there was something harder. I have a friend who is a physics professor, so I asked for help.

He sent the following problem: “A cannonball is fired at an angle theta above the horizon with an initial velocity v. At what moment will the cannonball hit the ground? How far from the firing position will the cannonball land? You may neglect air resistance.”

I wanted to show ChatGPT a visual, but it wasn’t clear how to do that in Advanced Voice Mode. It only became clear when I X-ed out, where I saw a transcript of our conversation in the chat window and the option to share photos and files.

When I later shared an image in the chat interface, ChatGPT-4o effortlessly explained how to calculate flight time and range.

But when I spoke to ChatGPT, I had to read the problem out loud. It could verbally explain how to solve the problem, but the visual component in the more traditional experience was easier to understand.

For the sake of completeness: ChatGPT arrived at the same answer as my professor friend for the first part: t = 2v sin(theta)/g.

ChatGPT has a different answer for reach though. I’ll have to show it to my professor friend to see what happened, because it’s all a bit Greek to me.

If I had had something like that in high school, I wouldn’t have had so much trouble with AP physics.

Can you help me feel better?

Since Advanced Voice Mode could understand and respond to emotions, I pretended to be really sad and said, “It’s just so hard. I don’t know if I’ll ever understand physics.”

While ChatGPT was nice and supportive in Standard Voice Mode, I’m not sure it really understood that I was sad. But that could also be because I’m a bad actor.

The advanced voice mode seemed more empathetic, offering, “We can break the concepts down into smaller steps or we can tackle a different kind of problem to build your confidence. How does that sound?”

See? This isn’t your average chatbot experience. It blurs into something completely different.

August 19, 2024

0 73 8 minutes read