It’s not an echo – ChatGPT can suddenly imitate your voice when you speak to it

August 12, 2024

0 91 2 minutes read

It’s not an echo – ChatGPT can suddenly imitate your voice when you speak to it

ChatGPT can sometimes seem to think like you, but wait until it suddenly sounds like you too. That’s a possibility brought to light by ChatGPT’s new Advanced Voice Mode, specifically the more advanced GPT-4o model. OpenAI last week released the system map explaining what GPT-4o can and cannot do, including the highly unlikely but nevertheless real possibility of Advanced Voice Mode, where users’ voices are mimicked without their consent.

The advanced voice mode allows users to have spoken conversations with the AI chatbot. The idea is to make interactions more natural and accessible. The AI has a few preset voices that users can choose from. However, the system board reports that this feature has exhibited unexpected behavior under certain conditions. During testing, a noisy input caused the AI to mimic the user’s voice.

The GPT-4o model produces voices using a system prompt, a hidden set of instructions that guides the model’s behavior during interactions. In the case of speech synthesis, this prompt relies on a permissioned voice sample. But while the system prompt guides the AI’s behavior, it’s not foolproof. The model’s ability to synthesize voices from short audio clips means that, under certain circumstances, it can generate other voices, including your own. You can hear what happened in the clip below, when the AI jumps in with “No!” and suddenly sounds like the first speaker.

Voice clone of your own voice

“Voice generation can also occur in non-hostile situations, such as our use of that capability to generate voices for ChatGPT’s advanced speech mode. During testing, we also observed rare cases where the model unintentionally generated an output that emulated the user’s voice,” OpenAI explained in the system map. “While unintended speech generation is still a weakness of the model, we use the secondary classifiers to ensure that the call is terminated if this happens, minimizing the risk of unintended speech generation.”

As OpenAI noted, it has since implemented safeguards to prevent such occurrences. That involves using an output classifier designed to detect deviations from the pre-selected authorized voices. This classifier acts as a safeguard, helping to ensure that the AI doesn’t generate unauthorized audio. Still, the fact that it happened at all underscores how quickly this technology is evolving, and how all safeguards must evolve to keep up with what the AI can do. The model’s outburst, suddenly shouting “No!” in a voice similar to the tester’s, underscores the potential for AI to inadvertently blur the lines between machine and human interactions.