Intent-First Translation — Real-Time Voice Translation That Shows Intent in 500ms

Challenge

Voice translation apps produce accurate translations, but 3–5 seconds of silence between each turn breaks the natural flow of conversation. This structural problem cannot be solved by improving translation quality alone.

Solution

Designed a 3-layer streaming architecture that shows the speaker's intent within 500ms using real-time speech recognition, LLM streaming output with optimized JSON field ordering, and WebSocket-based instant delivery.

Result

Intent displayed in ~500ms, full translation in ~800ms — enabling near real-time bilingual conversation. Benchmarked 6 LLM models and validated local GPU feasibility.

The Problem: Accurate Translation ≠ Natural Conversation

Every major voice translation service — Microsoft Translator, Google Translate, Apple AirPods live translation — follows the same architecture: wait for the speaker to finish, confirm the text, then translate.

The translation is accurate. But every exchange creates 3–5 seconds of silence. The other person wonders “Can they hear me?” while you wait for the translation to appear. No matter how much translation quality improves, this broken conversation rhythm is a structural problem that sequential translation cannot solve.


Insight: How Simultaneous Interpreters Work

Professional simultaneous interpreters don’t wait for the sentence to finish. When they hear “We should probably reschedule the meeting to…”, they’re already conveying “They’re talking about rescheduling the meeting” — communicating the intent before the details arrive.

I applied this principle to software: show the intent first, then deliver the full translation.


Architecture: 3 Layers of Progressive Delivery

Layer 1: Keyword Prediction      → ~0ms    (Dictionary-based, no LLM)
Layer 2: Intent Label            → ~500ms  (LLM streaming)
Layer 3: Full Translation        → ~500–800ms (LLM streaming continued)

Layer 1 — Instant Keywords (Zero Latency)

As speech recognition returns fragments, a dictionary-based engine instantly maps keywords: “meeting” → 会議, “budget” → 予算. No LLM involved — the listener grasps the topic immediately.

Layer 2 & 3 — LLM Streaming with Optimized JSON Field Order

This is the core technical insight. LLM streaming generates JSON top to bottom. By placing the intent label and translation at the top of the JSON schema, these fields reach the user first:

{
  "dialogue_act": "PROPOSAL",
  "intent_label": "Schedule adjustment proposal",
  "full_translation": "Let's move the meeting to Tuesday...",
  "confidence": 0.85,
  "is_meaning_stable": true
}

Placing translation fields last instead of first caused 2x latency in real-world testing. This single optimization — rearranging JSON field order — doubled the perceived translation speed.


Dual Prompt Strategy

Speech recognition produces two types of results:

StateGoalStrategy
Partial (still speaking)Show intent quicklySpeed-first: intent → translation first
Final (finished speaking)Accurate translationQuality-first: context analysis first

While the speaker is talking, we prioritize speed and show rough intent. When they finish, we replace it with an accurate translation. This mirrors how human interpreters operate — gist first, precision second.


System Architecture

[Browser]                [Backend (FastAPI)]           [External API]
   |                          |                           |
   |-- Audio binary --------->|                           |
   |                          |-- Audio stream ---------->| Deepgram
   |                          |<-- Partial/Final text ----|
   |                          |                           |
   |                          |-- Text ----------------->| LLM (Gemini/Groq)
   |                          |<-- Streaming JSON --------|
   |                          |                           |
   |<-- intent_partial -------|  (instant WebSocket push)
   |<-- translation_partial --|  (instant WebSocket push)
   |<-- intent (complete) ----|  (final result)

Each JSON field is sent as an individual WebSocket message the moment it’s generated — not after the full response completes.

Preventing Wasteful LLM Calls

Deepgram returns partial results at high frequency. Without countermeasures, LLM requests would reach dozens per second:

  • Debounce (300ms) — Wait 300ms after text update before calling LLM
  • Short text skip — Skip partials under 5 words (“I think…”)
  • Duplicate check — Don’t reprocess identical text
  • Async pipeline — Continue receiving audio while awaiting LLM response

6-Model Benchmark

ModelTranslation SpeedQualityCost / 5hrs
Groq / Llama 4 Maverick413ms$3.43
Groq / Llama 3.3 70B480ms
Groq / Llama 3.1 8B377ms
Groq / GPT-OSS 120B662ms
Gemini 2.5 Flash Lite954ms$1.17
OpenAI GPT-4o-mini1,976ms$1.74

Groq + Llama 4 Maverick at 800+ tokens/sec was the best fit for real-time use. The frontend provides one-click model switching.


Local GPU Experiment

I tested running the system entirely on a home GPU (RTX 3060, 6GB VRAM) using Gemma 3 4B via Ollama and LM Studio’s headless daemon.

SetupResult
OllamaSingle requests OK, parallel requests exhausted VRAM → Windows force shutdown
LM Studio (headless)3–4 second latency per translation, lock contention on summaries

Conclusion: 6GB VRAM cannot meet real-time requirements. 12GB+ (RTX 4070 class) may work. For now, cloud APIs remain the practical choice — but the architecture is designed for easy swap when local hardware catches up.


Demo


Tech Stack

LayerTechnology
Speech RecognitionDeepgram Streaming API
Intent Estimation & TranslationGemini Flash, Groq / Llama 4 Maverick
BackendPython / FastAPI
FrontendReact + TypeScript
Real-Time CommunicationWebSocket
Streaming JSON ParserCustom incremental parser

Future Roadmap

  • TTS Integration — Voice-read translations through Bluetooth earphones for hands-free use
  • Bidirectional Translation — Japanese ↔ English real-time conversation without an interpreter
  • Smart Glasses Integration — Display translation subtitles in the user’s field of view using open-source smart glasses (OpenGlass, Team Open Smart Glasses)

Key Takeaway

The biggest lesson from this project: treat LLM output as a stream, not a result. Optimizing JSON field order, parsing in-progress JSON, and dynamically switching prompts based on speech state — all leverage the fundamental property that LLMs generate tokens sequentially. This principle applies far beyond translation.


Deep dive available in the blog series: