Intent-First Translation — Real-Time Voice Translation That Shows Intent in 500ms

The Problem: Accurate Translation ≠ Natural Conversation

Every major voice translation service — Microsoft Translator, Google Translate, Apple AirPods live translation — follows the same architecture: wait for the speaker to finish, confirm the text, then translate.

The translation is accurate. But every exchange creates 3–5 seconds of silence. The other person wonders “Can they hear me?” while you wait for the translation to appear. No matter how much translation quality improves, this broken conversation rhythm is a structural problem that sequential translation cannot solve.

Insight: How Simultaneous Interpreters Work

Professional simultaneous interpreters don’t wait for the sentence to finish. When they hear “We should probably reschedule the meeting to…”, they’re already conveying “They’re talking about rescheduling the meeting” — communicating the intent before the details arrive.

I applied this principle to software: show the intent first, then deliver the full translation.

Architecture: 3 Layers of Progressive Delivery

Layer 1: Keyword Prediction      → ~0ms    (Dictionary-based, no LLM)
Layer 2: Intent Label            → ~500ms  (LLM streaming)
Layer 3: Full Translation        → ~500–800ms (LLM streaming continued)

Layer 1 — Instant Keywords (Zero Latency)

As speech recognition returns fragments, a dictionary-based engine instantly maps keywords: “meeting” → 会議, “budget” → 予算. No LLM involved — the listener grasps the topic immediately.

Layer 2 & 3 — LLM Streaming with Optimized JSON Field Order

This is the core technical insight. LLM streaming generates JSON top to bottom. By placing the intent label and translation at the top of the JSON schema, these fields reach the user first:

{
  "dialogue_act": "PROPOSAL",
  "intent_label": "Schedule adjustment proposal",
  "full_translation": "Let's move the meeting to Tuesday...",
  "confidence": 0.85,
  "is_meaning_stable": true
}

Placing translation fields last instead of first caused 2x latency in real-world testing. This single optimization — rearranging JSON field order — doubled the perceived translation speed.

Dual Prompt Strategy

Speech recognition produces two types of results:

State	Goal	Strategy
Partial (still speaking)	Show intent quickly	Speed-first: intent → translation first
Final (finished speaking)	Accurate translation	Quality-first: context analysis first

While the speaker is talking, we prioritize speed and show rough intent. When they finish, we replace it with an accurate translation. This mirrors how human interpreters operate — gist first, precision second.

System Architecture

[Browser]                [Backend (FastAPI)]           [External API]
   |                          |                           |
   |-- Audio binary --------->|                           |
   |                          |-- Audio stream ---------->| Deepgram
   |                          |<-- Partial/Final text ----|
   |                          |                           |
   |                          |-- Text ----------------->| LLM (Gemini/Groq)
   |                          |<-- Streaming JSON --------|
   |                          |                           |
   |<-- intent_partial -------|  (instant WebSocket push)
   |<-- translation_partial --|  (instant WebSocket push)
   |<-- intent (complete) ----|  (final result)

Each JSON field is sent as an individual WebSocket message the moment it’s generated — not after the full response completes.

Preventing Wasteful LLM Calls

Deepgram returns partial results at high frequency. Without countermeasures, LLM requests would reach dozens per second:

Debounce (300ms) — Wait 300ms after text update before calling LLM
Short text skip — Skip partials under 5 words (“I think…”)
Duplicate check — Don’t reprocess identical text
Async pipeline — Continue receiving audio while awaiting LLM response

6-Model Benchmark

Model	Translation Speed	Quality	Cost / 5hrs
Groq / Llama 4 Maverick	413ms	◎	$3.43
Groq / Llama 3.3 70B	480ms	◎	—
Groq / Llama 3.1 8B	377ms	○	—
Groq / GPT-OSS 120B	662ms	◎	—
Gemini 2.5 Flash Lite	954ms	◎	$1.17
OpenAI GPT-4o-mini	1,976ms	◎	$1.74

Groq + Llama 4 Maverick at 800+ tokens/sec was the best fit for real-time use. The frontend provides one-click model switching.

Local GPU Experiment

I tested running the system entirely on a home GPU (RTX 3060, 6GB VRAM) using Gemma 3 4B via Ollama and LM Studio’s headless daemon.

Setup	Result
Ollama	Single requests OK, parallel requests exhausted VRAM → Windows force shutdown
LM Studio (headless)	3–4 second latency per translation, lock contention on summaries

Conclusion: 6GB VRAM cannot meet real-time requirements. 12GB+ (RTX 4070 class) may work. For now, cloud APIs remain the practical choice — but the architecture is designed for easy swap when local hardware catches up.

Demo

Tech Stack

Layer	Technology
Speech Recognition	Deepgram Streaming API
Intent Estimation & Translation	Gemini Flash, Groq / Llama 4 Maverick
Backend	Python / FastAPI
Frontend	React + TypeScript
Real-Time Communication	WebSocket
Streaming JSON Parser	Custom incremental parser

Future Roadmap

TTS Integration — Voice-read translations through Bluetooth earphones for hands-free use
Bidirectional Translation — Japanese ↔ English real-time conversation without an interpreter
Smart Glasses Integration — Display translation subtitles in the user’s field of view using open-source smart glasses (OpenGlass, Team Open Smart Glasses)

Key Takeaway

The biggest lesson from this project: treat LLM output as a stream, not a result. Optimizing JSON field order, parsing in-progress JSON, and dynamically switching prompts based on speech state — all leverage the fundamental property that LLMs generate tokens sequentially. This principle applies far beyond translation.

Deep dive available in the blog series: