Translation Earbuds Prototype — Real-Time Translation Audio via Public APIs
Challenge
Dedicated translation earbuds (AirPods Pro, Timekettle M3) cost ¥17,000–40,000 and require specific ecosystems. Can a regular smartphone and Bluetooth earbuds deliver a 'hear the translation' experience using only public APIs?
Solution
Added a TTS layer (Web Speech API) to the Intent-First Translation pipeline, solved mobile autoplay restrictions with audio lock unlock, and implemented platform-adaptive speech rate control.
Result
Achieved ~3 second end-to-end latency (speech → translated audio), comparable to professional interpreters. Identified browser I/O device limitations and validated a clear path to native app architecture.
![]()
Background
This project is an extension of Intent-First Translation, which displays the speaker’s intent within 500ms during real-time voice translation. Here, we added an audio output layer — translating English speech into Japanese and playing it through Bluetooth earbuds.
The goal: can you replicate the core experience of AirPods Pro live translation using a regular smartphone, Bluetooth earbuds, and public APIs?
Other person speaks English
→ Phone captures audio (Deepgram)
→ LLM translates in real-time (~2 seconds)
→ Japanese TTS plays through Bluetooth earbuds
What Worked
- Self-spoken English → Japanese translation in earbuds ✅
- Translation text to audio delay: within ~1 second
- End-to-end latency: ~3 seconds (comparable to professional simultaneous interpreters at 2–3 seconds)
What Didn’t Work
- Capturing another person’s voice through earbud mic ❌ — Bluetooth earbud mics are designed for the wearer’s voice; noise cancellation actively cuts ambient sound
- Separating input/output devices in browser ❌ — iOS WebKit doesn’t support explicit audio input device selection
Platform-Specific TTS Challenges
Web Speech API is a “browser standard,” but the actual engine differs by OS. Three critical problems were discovered and solved:
1. Speech Rate Inconsistency
The same rate=3.0 was “just right” on Windows Chrome but incomprehensibly fast on iPhone. All iOS browsers use Apple’s WebKit engine with Apple’s TTS engine underneath.
const isMobile = /iPhone|iPad|Android/i.test(navigator.userAgent);
const ttsRate = isMobile ? 1.3 : 3.0;
2. Silent TTS on Mobile
Mobile browsers block speechSynthesis.speak() unless called from a direct user tap. WebSocket callbacks don’t qualify as “user actions.”
// Unlock audio with silent utterance on first tap
const unlock = new SpeechSynthesisUtterance('');
unlock.volume = 0;
window.speechSynthesis.speak(unlock);
3. Missing TTS Triggers
In-progress translations could appear on screen but get overwritten before the confirmed result fired TTS. Fixed by triggering TTS on any translation text, with duplicate detection.
Browser vs Native App
| Feature | Browser | Native App |
|---|---|---|
| TTS Audio Output | ✅ (with limitations) | ✅ (unrestricted) |
| Input/Output Device Separation | ❌ | ✅ |
| Background Operation | ❌ | ✅ |
A native iOS app using AVAudioSession can control input and output devices independently — capturing with the built-in mic while playing through Bluetooth earbuds. The backend (FastAPI + Deepgram + LLM) is fully reusable.
Measured Latency
| Metric | Measurement |
|---|---|
| End of speech → Translation text | Average 2,115ms |
| Translation text → TTS complete (short) | Within ~1 second |
| End of speech → Translation audio heard | ~3 seconds |
Real-Device Testing
![]()
Tech Stack
| Layer | Technology |
|---|---|
| Speech Recognition | Deepgram Streaming API |
| Translation | GPT-4 / Gemini Flash / Groq (LLM) |
| Text-to-Speech | Web Speech API (SpeechSynthesis) |
| Real-Time Communication | WebSocket |
| Frontend | React + TypeScript |
| Backend | Python / FastAPI |
Key Takeaway
Dedicated products like AirPods Pro solve audio input with specialized hardware — beamforming, multi-microphone arrays. Software alone can’t cross that wall. But for the architecture of “capture with built-in mic, deliver translation through earbuds”, public APIs and open technology are more than sufficient. The lessons learned here — platform TTS behavior, mobile browser audio constraints, Bluetooth routing limits — provide the foundation for a native app redesign.
Foundation Project
This project builds on Intent-First Translation — the 500ms intent-display real-time translation system. The speech recognition and LLM translation pipeline is shared.
Deep dive in the blog series: