Translation Earbuds Prototype — Real-Time Translation Audio via Public APIs

Translation earbuds prototype — real-device testing

Background

This project is an extension of Intent-First Translation, which displays the speaker’s intent within 500ms during real-time voice translation. Here, we added an audio output layer — translating English speech into Japanese and playing it through Bluetooth earbuds.

The goal: can you replicate the core experience of AirPods Pro live translation using a regular smartphone, Bluetooth earbuds, and public APIs?

Other person speaks English
  → Phone captures audio (Deepgram)
  → LLM translates in real-time (~2 seconds)
  → Japanese TTS plays through Bluetooth earbuds

What Worked

Self-spoken English → Japanese translation in earbuds ✅
Translation text to audio delay: within ~1 second
End-to-end latency: ~3 seconds (comparable to professional simultaneous interpreters at 2–3 seconds)

What Didn’t Work

Capturing another person’s voice through earbud mic ❌ — Bluetooth earbud mics are designed for the wearer’s voice; noise cancellation actively cuts ambient sound
Separating input/output devices in browser ❌ — iOS WebKit doesn’t support explicit audio input device selection

Platform-Specific TTS Challenges

Web Speech API is a “browser standard,” but the actual engine differs by OS. Three critical problems were discovered and solved:

1. Speech Rate Inconsistency

The same rate=3.0 was “just right” on Windows Chrome but incomprehensibly fast on iPhone. All iOS browsers use Apple’s WebKit engine with Apple’s TTS engine underneath.

const isMobile = /iPhone|iPad|Android/i.test(navigator.userAgent);
const ttsRate = isMobile ? 1.3 : 3.0;

2. Silent TTS on Mobile

Mobile browsers block speechSynthesis.speak() unless called from a direct user tap. WebSocket callbacks don’t qualify as “user actions.”

// Unlock audio with silent utterance on first tap
const unlock = new SpeechSynthesisUtterance('');
unlock.volume = 0;
window.speechSynthesis.speak(unlock);

3. Missing TTS Triggers

In-progress translations could appear on screen but get overwritten before the confirmed result fired TTS. Fixed by triggering TTS on any translation text, with duplicate detection.

Browser vs Native App

Feature	Browser	Native App
TTS Audio Output	✅ (with limitations)	✅ (unrestricted)
Input/Output Device Separation	❌	✅
Background Operation	❌	✅

A native iOS app using AVAudioSession can control input and output devices independently — capturing with the built-in mic while playing through Bluetooth earbuds. The backend (FastAPI + Deepgram + LLM) is fully reusable.

Measured Latency

Metric	Measurement
End of speech → Translation text	Average 2,115ms
Translation text → TTS complete (short)	Within ~1 second
End of speech → Translation audio heard	~3 seconds

Real-Device Testing

Bluetooth earbuds testing with smartphone

Tech Stack

Layer	Technology
Speech Recognition	Deepgram Streaming API
Translation	GPT-4 / Gemini Flash / Groq (LLM)
Text-to-Speech	Web Speech API (SpeechSynthesis)
Real-Time Communication	WebSocket
Frontend	React + TypeScript
Backend	Python / FastAPI

Key Takeaway

Dedicated products like AirPods Pro solve audio input with specialized hardware — beamforming, multi-microphone arrays. Software alone can’t cross that wall. But for the architecture of “capture with built-in mic, deliver translation through earbuds”, public APIs and open technology are more than sufficient. The lessons learned here — platform TTS behavior, mobile browser audio constraints, Bluetooth routing limits — provide the foundation for a native app redesign.

Foundation Project

This project builds on Intent-First Translation — the 500ms intent-display real-time translation system. The speech recognition and LLM translation pipeline is shared.

Deep dive in the blog series: