Part 3: Can Real-Time Translation Run on a Home GPU?

In Part 1, I introduced the concept of Intent-First Translation. In Part 2, I covered the technical design. In this final part, I’ll share my challenge of running the system on a local GPU without cloud APIs, and the future vision for this project.

Why I Tried Local LLM

Cloud APIs are fast and convenient, but they cost money for continuous use.

Provider	Cost per 5 hours
Groq / Llama 4 Maverick	~~$3.43 (~~¥515)
OpenAI GPT-4o-mini	~~$1.74 (~~¥261)
Gemini 2.5 Flash Lite	~~$1.17 (~~¥175)

As a daily-use tool, this cost is not negligible. If “real-time translation running entirely on a home GPU” could work, the running cost would be zero.

So I tested Google’s open-source LLM “Gemma 3 4B” on my RTX 3060 (6GB VRAM) to see if it could handle real-time translation.

Testing with Ollama

I started with Ollama.

For single requests, it worked fine and the translation quality was practical. However, when sending parallel requests as required for real-time translation, GPU memory was exhausted and Windows force-shutdown occurred.

The 6GB VRAM constraint was more severe than expected.

Testing with LM Studio 0.4.0 (Headless Mode)

Next, I tried the headless daemon llmster added in LM Studio 0.4.0.

lms daemon up
lms server start --port 1234 --bind 0.0.0.0 --cors

It runs as a server without GUI and provides an OpenAI-compatible API, so backend code changes were minimal. I also implemented a lock mechanism for parallel request control, but the results were:

Item	Result
Translation	△ Works but 3–4 second latency
Summary Generation	× Lock contention prevented execution
Real-Time Performance	× Sub-500ms response impossible
Cost	◎ Completely free

Translation itself works, but taking 3–4 seconds per request means it can’t keep up with speaking speed.

In conclusion, a GPU with 6GB VRAM cannot meet the requirements for real-time translation. With 12GB+ VRAM (RTX 4070 class or above), it might be possible, but for now cloud APIs like Groq and Gemini are the practical choice.

I Also Investigated Mobile Local LLM

I explored whether we could run the LLM directly on a smartphone’s SoC.

I checked mobile LLM runtimes like MLC Chat, Google AI Edge Gallery, and SmolChat. Even running a 1B model on current smartphones (Snapdragon 8 Gen 2 class SoC) only achieves 12–16 tokens/sec, making sub-500ms real-time translation unrealistic.

However, mobile SoC compute performance is improving rapidly, and I believe it could reach practical levels within a few years.

Future Development Plans

Having confirmed the limitations of local LLM, I’m rethinking the development direction. Here are three themes I want to work on.

1. Hands-Free Translation with TTS (Text-to-Speech)

The current system displays translations on screen, but adding voice reading of translations would eliminate the need to look at the screen.

Technically, this uses the browser’s standard Web Speech API (SpeechSynthesis). When the confirmed translation is generated, Japanese speech is synthesized and delivered through Bluetooth earphones.

Other person speaks in English
  → Speech recognition (Deepgram)
  → Intent estimation + translation (LLM, ~500ms)
  → Japanese speech synthesis & playback
  → Translation heard through Bluetooth earphones

This would work as a translation device using just a smartphone and Bluetooth earphones.

2. Bidirectional Translation

Currently it’s one-directional (English → Japanese), but supporting the reverse direction would enable bidirectional real-time conversation.

You speak in Japanese → Translated to English → Reaches the other person
Other person speaks in English → Translated to Japanese → Reaches you

The goal is enabling natural-tempo conversation between speakers of different languages without a human interpreter.

3. Open-Source Smart Glasses Integration

As the most ambitious theme, I envision displaying translation subtitles on smart glasses.

Several open-source smart glasses projects have emerged recently:

OpenGlass: Converts regular glasses into AI smart glasses with about $20 (~¥3,000) in parts. Uses XIAO ESP32S3 Sense with camera, microphone, and Bluetooth.
Mentra: Smart glasses with camera, speaker, and microphone, providing an open-source SDK and app store for third-party development.
Team Open Smart Glasses: Fully open-source smart glasses with display, microphone, and wireless connectivity. Live translation app is officially supported.

Combining these devices with Intent-First Translation, it’s technically possible to have subtitles like “Schedule adjustment proposal” appear in your field of view 0.5 seconds after the other person starts speaking, followed by the full translation.

All the necessary components — speech recognition API, fast LLM API, WebSocket, Web Speech API, Bluetooth, open-source smart glasses hardware — already exist. The remaining step is to integrate them with the Intent-First design philosophy.

Looking Back at the 3-Part Series

Post	Theme	Key Point
Part 1	Problem & Inspiration	The real challenge of voice translation is “silence,” not accuracy. Show intent first to solve it
Part 2	Technical Design	JSON field order optimization, 3-layer architecture, 6-model benchmarks
Part 3	Challenge & Vision	Confirmed local LLM limitations. Planning TTS, bidirectional translation, smart glasses integration

Intent-First Translation is still at the prototype stage. Many challenges remain. But “accurate translation” and “natural conversation flow” are separate problems, and I feel confident about this direction as an approach to the latter.

With various AI APIs and open-source hardware available today, even individual developers can experiment with new approaches. I hope this project provides some reference for those interested in the same field.

Part 1: The Real Challenge of Voice Translation Wasn’t Accuracy Part 2: JSON Field Order Made Translation Display 2x Faster Part 4: Can You Hear Translations Through Bluetooth Earbuds?