Most translation tools work on text: paste in one language, copy out another. That's fine for written content. It's bad for the moment you actually need translation — mid-thought, writing in a second language, your brain running faster than your fingers can keep up in it.
Voice translation skips the middle step. You speak in language A, the text appears at your cursor in language B. No copy-paste round-trip, no separate tab. The moment you actually need translation — mid-sentence, mid-thought — it's already done.
This post is about how that pipeline works on Mac, what's realistic on accuracy, and where it pays off.
How voice translation works on Mac
The pipeline has two stages:
Stage 1 — Speech recognition. Your spoken audio gets transcribed into text in the source language. OpenAI Whisper handles 99+ languages out of the box and runs entirely on Apple Silicon.
Stage 2 — Translation. The transcribed text gets translated into the target language. Two sub-options here:
- Whisper's built-in translate mode (audio in any language → English text). Free, fast, but only goes to English.
- A separate translation pass via a small local LLM (Gemma, Qwen, LLaMA). Bidirectional between any pair of languages.
Most Mac apps that do "voice translation" use the second approach because it works for any direction, not just to-English. You get full bidirectional translation, all running locally on your Mac.
What "bidirectional translation" actually means
If you only need to translate Spanish-to-English (audio coming in, text in English going out), Whisper's translate mode alone is enough. It's a single model, fast, accurate for major languages.
If you need any pair — French to Japanese, German to Korean, Spanish to French — you need a translation pass after transcription. A small local LLM can handle this for any of the 99+ languages Whisper recognizes.
Use cases for each direction:
- Any language → English: You read non-English meeting audio (a partner team in Berlin, a client in São Paulo) and want to dictate notes in English. Whisper translate mode is enough.
- English → any language: You're an English speaker writing to a non-English audience. Dictate in English, get translated text. Common for international sales, support tickets, partner communications.
- Non-English → non-English: Multilingual users writing across language pairs. Less common but real — Spanish-Mexican writing French emails, Japanese-speaker writing Korean Slack messages, etc.
Accuracy expectations
The hardest thing to be honest about with translation is that "accurate" means different things for different tasks.
For casual messaging, summaries, and emails: local voice translation is genuinely usable. The output is close enough to native that a human reader understands without effort and rarely notices errors.
For published content, legal documents, or anything where exact phrasing matters: it's a draft, not a final. You need a native speaker to review.
By language pair:
- English ↔ Spanish, French, German, Italian, Portuguese: Excellent. Whisper + a modern small LLM gets you ~95%+ usable output.
- English ↔ Japanese, Korean, Chinese: Good for prose. Idioms and culturally-loaded phrasing need review.
- English ↔ Arabic, Hindi, Turkish, Russian, Polish: Solid for most content. Specialized vocabulary (legal, medical) more error-prone.
- Less common languages: Variable. Whisper Large-v3 is best for transcription. Translation quality depends on the LLM's training coverage.
These numbers are rough — actual accuracy depends on the model size, audio quality, and how technical your content is. Whisper Large-v3 + a 4B parameter LLM is the practical sweet spot on a 16GB Mac. Whisper Small + the same LLM is faster but loses 1–2 points on accuracy.
Cloud vs local for translation
The cloud services (Google Translate, DeepL, OpenAI's translation, Apple's translation in macOS) all do voice translation well. The trade-offs:
Cloud wins on:
- Best accuracy on every language pair, including obscure ones
- Real-time translation in conversation mode (Google Translate's two-way feature)
- No model download
Local wins on:
- Privacy. Audio doesn't leave your Mac.
- No subscription. Cloud translation services are usually free up to limits, then paid.
- No network dependency. Works on planes, in conference Wi-Fi, in secure facilities.
- No quota or rate limit.
- One workflow that works in any app instead of a translate app or browser tab.
For Mac users specifically, the gap between local and cloud translation quality has narrowed significantly in the last two years. Local Whisper + a 4B local LLM produces output close enough to DeepL that most users can't reliably distinguish them on common language pairs. The honest gap is more like 5% on specialized content than the 30% it used to be.
Apps that do live voice translation on Mac
Vext ($49 once) — set a target language in settings, dictate in any language, get translated text at your cursor. The translation runs through a local LLM after Whisper transcription. With Enhance enabled, cleanup and translation happen in a single pass — you speak messy French, clean English appears.
Apple Translate (built-in) — voice translation between major language pairs, free, on-device. Works in the Translate app but doesn't paste-at-cursor into other apps. For app-to-app translation you have to copy-paste.
MacWhisper — supports Whisper's translate mode (any language → English). Doesn't do bidirectional or non-English-target translation in a single pass. Good for file-based transcription with translation.
Cloud subscriptions — Wispr Flow, Otter, etc. all have translation features. Subscription-based, cloud-processed.
DeepL desktop — best-in-class text translation. Has voice input on some platforms but the macOS experience leans toward typed input + voice as supplemental. Free tier limited, Pro is $9/month.
Setting it up in Vext
Specific setup for voice translation in Vext:
- Install:
brew install muvon/tap/vext - Open Settings > Languages
- Set Source language to "Auto" (Whisper detects) or pin to a specific language for better accuracy
- Set Target language to whatever you want the output to be
- Enable Enhance — this lets cleanup + translation happen in one LLM pass
- Optional: download a larger Whisper model (Large-v3) for the highest accuracy on non-English source audio
Then: click into any text field, hold the hotkey, speak in source language, release. Translated text appears at the cursor.
Recommendation for users who switch language pairs often: don't try to detect source language automatically for every dictation — pin it to whatever you're using right now and change it manually when you switch. Auto-detection is usually right but occasionally guesses wrong on the first few words and the whole dictation gets transcribed in the wrong language. The 2 seconds it takes to flip the source language in settings saves the friction.
Workflows where this changes things
Support tickets in non-native English. Customer support agents whose native language isn't English often write slower and edit more in English. Speaking in their native language and getting English text removes the writing tax.
Cross-team communication. A Mexican engineering team writing to a Korean product team. Each side writes in their native language; the other side reads in theirs. Translation happens locally on each end.
Sales calls with non-native clients. Take notes during the call in your native language. Export them in the client's language for follow-up.
Language practice. Speak in your learning language, see what came out, compare to what you meant. Voice translation as a writing aid for language learners — more aggressive than typing because you get to hear yourself.
Travel. Working remotely from a country where you don't speak the language. Dictate notes in your native language; get them in the local language when you need to communicate. Or vice versa.
What it doesn't replace
Voice translation in a dictation app is not the same as:
Real-time conversation interpretation. If you're trying to have a live conversation with someone who speaks a different language, you want Google Translate's conversation mode or a phone with that built in. A dictation app is for solo work, not interpretation.
Document translation. For translating an existing document, DeepL or Google Translate's text/file mode is more efficient. Voice doesn't help if you already have the source text.
Subtitling. For video subtitles in another language, you want a dedicated workflow with Whisper translate mode + a captioning tool. Possible with Vext via file export to SRT, but not the primary use case.
A note on accuracy and trust
If you're using translated dictation for anything that has consequences — a customer email that needs to read professionally, a contract addendum, a public post — read it before sending. Local voice translation is good enough that you can trust it for first drafts; it's not good enough that you should trust it without review.
The pattern that works:
- Dictate in your native language
- Read the translated output
- Edit anything that sounds off
- Send
That edit step is rare for casual content (Slack, internal email) and important for external-facing or precise content. The translation gets you 95% of the way; you're the 5%.
For Mac users who work multilingually, the unlock isn't that the technology is perfect now. It's that it's good enough that you stop opening a translate tab.