A real-time neural voice-conversion pipeline that started as a VRChat joke and turned into a firsthand study of synthetic-voice fraud. Speech goes in, a different voice comes out — fast enough to feel live. The whole project became one problem: collapsing end-to-end latency until the illusion held.
Doppler is a real-time voice-conversion pipeline. You speak into a microphone and a completely different voice comes out the other side, with little enough delay that it reads as a live person rather than a recording.
It's built from two off-the-shelf services wired into a tight loop: OpenAI Whisper for speech-to-text and ElevenLabs for voice synthesis, with the synthesized audio routed back in as a virtual microphone so it can drive a live application like a game voice chat. The interesting part was never connecting the APIs — it was the engineering needed to make the round-trip fast enough to feel natural.
A friend and I were in VRChat and wanted to actually sound like the avatars we were playing — to have a cartoon character talk back to people in its own voice. The first version was crude: I wired up an ElevenLabs voice model and typed the lines by hand.
It fell apart instantly. Typing was far too slow, and the conversations felt scripted and dead — nothing organic, nothing that could pass for a real person reacting in the moment. If the goal was to convince someone they were talking to the character, a chat box behind the curtain was never going to work. So I took myself out of the typing loop entirely.
I added Whisper to transcribe my own speech as I talked, fed that text straight into the ElevenLabs voice model, and routed the synthesized audio back into the game as my microphone input. Now I could just speak and come out the other side as someone else. It worked — and the voice itself was convincing.
The first working version lagged. The round-trip was slow enough that there was a beat of dead air before every reply, and people could tell something was off. A delayed voice doesn't read as a person; it reads as a recording. From there the entire project became a single question: how do you get a cloned voice back fast enough to feel live instead of delayed?
That's where the engineering actually happened — not in the wiring, but in fighting every source of latency in the chain. The fine-tuning meant streaming audio in chunks instead of waiting for whole utterances, trimming the transcription step, tightening the hand-off into synthesis, and shaving rendering time until the reply landed inside the window where conversation still feels natural. Each fraction of a second I cut made it less of a walkie-talkie and more of a voice.
Once the delay dropped and the voice became indistinguishable in real time, the joke curdled into something else. I was no longer holding a party trick — I was holding a working impersonation tool. The same thing that let me pass as a cartoon character would let someone pass as your son, your boss, or your bank: live, on a phone call, with no recording to give it away.
This isn't hypothetical. The FTC warns that a scammer needs only a short clip of someone's voice — often scraped from content posted online — plus a voice-cloning program, and the cloned voice makes a fraudulent request far more believable. Doppler is a small, honest demonstration of how low that barrier has fallen: built from off-the-shelf APIs by one person, over a few evenings.
A “relative” in distress begging for money, in a voice the victim recognizes instantly. The FTC's flagship example of voice-cloning fraud.
A cloned executive authorizing a transfer or approving a fake invoice over the phone — the FTC specifically warns scammers can clone a CEO's voice to fool employees.
Defeating voice-based identity checks and support-line verification by speaking in the account holder's own voice, in real time.
Putting fabricated words in a real person's voice — fabricated confessions, threats, or admissions that sound undeniably like them.
I build and harden infrastructure for a living, and the most useful thing I can do with a threat is understand it from the attacker's side first. Doppler is offensive research in the honest sense: I built the capability, felt exactly where it gets dangerous, and came away with a clearer read on how to defend against it. The reassuring part is that synthetic voice is detectable and defendable — but only if people know to look.
A family safe-word, or hanging up and calling back on a known number, defeats nearly every voice-only attack. The FTC's own advice is blunt: don't trust the voice.
Synthesis leaves spectral fingerprints. The FTC's Voice Cloning Challenge awarded real-time detectors that score audio for “liveness” in two-second chunks — exactly the seam Doppler exposes.
Never let a voice alone authorize money or access. Trust the procedure — verification, callbacks, second approvers — not the sound coming out of the speaker.
The attack starts with public audio. Limiting voice clips on open social profiles raises the cost of building a convincing clone in the first place.
The fraud case here isn't speculative — it's drawn from federal consumer-protection guidance on how voice cloning is already being used: