Part 1: The Real Challenge of Voice Translation Wasn't Accuracy
Have you ever tried having a conversation with someone from another country using a voice translation app?
Microsoft Translator, Google Translate, Apple’s AirPods live translation — real-time voice translation from major tech companies is getting more accurate every year. Language support is expanding, and the technology is quite mature.
But when you actually use it, you notice something strange.
The translation is accurate, but the conversation doesn’t flow.
The Problem Was “Silence”
When you use a translation app in a business conversation, this is what happens:
The other person speaks in English → Wait for translation (3–5 seconds of silence) → Read the translation → You speak → The other person waits again
During this “waiting for translation” silence, the other person starts wondering “Can they hear me?” while you can’t respond until the translation appears. No matter how much translation quality improves, this problem of broken conversation rhythm cannot be structurally solved.
All existing voice translation systems use a “sequential translation” architecture. They wait for the speaker to finish, confirm the text, then translate. It’s accurate, but it always creates a “waiting time.”
Why Don’t Simultaneous Interpreters “Wait”?
If you observe professional simultaneous interpreters at work, you notice something interesting.
Interpreters start translating before the sentence is complete.
When they hear “We should probably reschedule the meeting to…”, they’re already conveying “They’re talking about rescheduling the meeting.” Before the specific detail “Tuesday afternoon” comes, they communicate the intent first.
For the listener, the most important thing is understanding “what the topic is” in the first few seconds. The details can be filled in afterwards.
Could we recreate this “communicate intent first” approach in software? That’s what led me to develop the prototype I’m introducing in this series.
The Concept of Intent-First Translation
Intent-First Translation processes information in a different order than traditional translation.
Traditional Translation (Sequential):
Start speaking → Finish speaking → Confirm text → Translate → Display
(Nothing reaches the listener during this time)
Intent-First Translation:
Start speaking → After 0.5s: "Schedule adjustment proposal" displayed
→ After 0.8s: "Wants to move the meeting to Tuesday afternoon" displayed
→ Finish speaking → Updated to confirmed translation
About 500 milliseconds after the speaker starts talking, the intent — “This person is talking about schedule adjustment” — appears on screen. The full translation follows a few hundred milliseconds later.
The listener can understand “what the speaker is talking about” while they’re still speaking. This is an experience that sequential translation architectures cannot provide.
Built as a Personal Prototype
I built this system as a personal prototype using publicly available APIs and open-source technology.
- Speech Recognition: Deepgram’s streaming API (gets real-time text while speaking)
- Intent Estimation & Translation: Gemini Flash, Groq / Llama 4 Maverick and other LLMs (structured output via streaming)
- Real-Time Communication: WebSocket (instant delivery from backend to frontend)
- Frontend: React + TypeScript
- Backend: Python / FastAPI
Measured Performance
Here’s the actual measured performance:
| Metric | Intent-First Translation | Traditional Sequential Translation |
|---|---|---|
| Intent Display | ~500ms | (No such feature) |
| Translation Display Start | ~500–800ms | ~2–5 seconds |
| Conversation Tempo | Nearly real-time | 3–5 second interruption each time |
The experience of having the intent communicated within about 500 milliseconds is something you can’t get with traditional translation apps.
See It in Action
Here’s a demo of Intent-First Translation in use. Watch how the intent and translation appear while the speaker is still talking:
From “Translating Accurately” to “Conversing Naturally”
This prototype isn’t trying to compete with existing services on translation accuracy. It’s a validation of a different approach: delivering translation while maintaining conversation tempo.
It’s still a prototype under development, but I feel that this simple shift in thinking — “show the intent first” — has the potential to significantly change the voice translation experience.
In the next post, I’ll cover the technical architecture that makes this 500-millisecond response possible.
Part 2: JSON Field Order Made Translation Display 2x Faster Part 3: Can Real-Time Translation Run on a Home GPU?