Offline Voice to Text on Mac — How Local Speech Recognition Works

There's a quiet shift happening in voice transcription. Five years ago, anything good ran in the cloud. Apple Silicon changed the math — the M1 onward shipped with neural engines fast enough to run real speech recognition models on-device, and the gap between cloud and local has narrowed to almost nothing for most workloads.

This guide covers what offline voice to text on Mac actually means, how the underlying tech works, and which tools are worth using.

Why "offline" became viable

Speech recognition used to be a cloud problem because the models were too big to run on consumer hardware in real time. Whisper-Large is around 1.5 GB. Running it at conversational speed needs serious compute.

What changed:

Apple Silicon's neural engine can do roughly 11–15 TOPS on M1 base, scaling to 38 TOPS on M3 Pro and beyond. That's enough headroom for Whisper-Medium or Parakeet to run faster than real time.
Smaller models got better. Parakeet (NVIDIA's RNN-T model) hits competitive accuracy at a fraction of Whisper's size and runs at around 150x real-time on M-series chips.
CoreML and Metal got mature enough that Whisper.cpp and similar implementations actually use the hardware properly instead of pinning the CPU.

The result: you can now do dictation, transcribe a meeting, or process an hour-long file locally on a MacBook Air without the fans even spinning up.

What "offline" actually buys you

Privacy is the obvious one — your audio doesn't go anywhere. But there are practical benefits that matter daily:

Latency is gone. Cloud transcription has a network round-trip. Even on a fast connection, that's 50–200ms of overhead per request. Local inference returns results as fast as the model can produce them, which on Apple Silicon is usually under 200ms total for a short utterance.

Works offline. Flights, trains, hotel Wi-Fi, secure facilities, conference Wi-Fi that throttles everything. None of this matters if the model is on your machine.

No subscription. Cloud services charge by the minute or by the month. Local apps are usually one-time purchases or free.

No vendor lock-in. Your transcripts live in your filesystem. If the company that made the app shuts down, your data is fine.

Predictable. Cloud services change pricing, deprecate APIs, and rate-limit. Local tools just keep working.

How on-device speech recognition works on Mac

Two model families dominate on Apple Silicon:

OpenAI Whisper

Whisper is an encoder-decoder transformer trained on 680,000 hours of multilingual speech. It's open-weight, comes in multiple sizes (Tiny, Base, Small, Medium, Large), and handles 99+ languages.

Sizes and rough trade-offs on M-series Macs:

Model	Size	RAM	Speed (M2)	WER (English)
Tiny	75 MB	~400 MB	~30x realtime	~9%
Base	142 MB	~500 MB	~20x realtime	~7%
Small	466 MB	~1 GB	~10x realtime	~5.5%
Medium	1.5 GB	~2.5 GB	~5x realtime	~4.8%
Large-v3	3 GB	~5 GB	~2x realtime	~4.2%

Larger models are more accurate but use more RAM and run slower. For most dictation, Small or Medium is the sweet spot. For meetings or files where you want best accuracy, Large-v3.

NVIDIA Parakeet

Parakeet is an RNN-T model (recurrent neural network transducer). It's faster than Whisper at similar accuracy, English-only by default, and runs at around 150x real-time on M2.

Parakeet is the better default for English dictation because the latency advantage is huge — you barely notice the model running. The downside is single-language support. If you need multilingual transcription, Whisper is the choice.

Most modern Mac apps let you pick which engine to use per task.

What runs locally beyond transcription

Speech recognition is only half the picture. The full pipeline for dictation usually looks like:

Audio capture — microphone input or system audio.
Speech recognition — Whisper or Parakeet produces raw text.
Post-processing — punctuation, capitalization, filler word removal.
Optional: LLM cleanup — a local language model rewrites the text to read like polished writing.
Optional: Translation — output in a different language than the input.

Steps 4 and 5 use small local LLMs (Gemma 3 4B, Qwen 3 4B, LLaMA 3.2 3B) running through llama.cpp or MLX. These are around 2–4 GB each and run at conversational speed on M-series chips. The output reads like edited writing rather than a raw transcript.

Meeting transcription adds two more components:

Speaker diarization — figuring out who said what. Done with neural embeddings of voice characteristics, all local.
Summarization — feeding the transcript to a local LLM with a "summarize this meeting" prompt to extract action items and key decisions.

None of this needs the cloud anymore.

Tools that do this well

Free or low-cost:

Apple Dictation — built into macOS, on-device for the on-device variant. Limited to short dictation.
MacWhisper — free for file transcription, €64 Pro for live dictation.
VoiceInk — open-source, $25–49 once.
FluidVoice — free, open-source, supports Parakeet.

Paid with broader scope:

Vext — $49 once, dictation plus meetings plus translation, all local.
Superwhisper — $249 lifetime, dictation-focused with custom modes.
Voibe — $198 lifetime, privacy-focused dictation.

The split between these is mostly about feature scope. The local-vs-cloud trade-off is settled — local is genuinely competitive on accuracy and faster on latency. Everything below the top tier of cloud services (Otter Premium, Rev) is matched or beaten by what runs on your laptop.

When cloud still wins

To be honest about it: cloud services still have advantages in specific cases.

Team collaboration. Otter, Fireflies, Granola — these have shared transcript libraries, comments, real-time co-watching. If your workflow involves multiple people working on the same transcripts, cloud is built for that.

Industry-specific accuracy. Medical, legal, and technical domains have specialized cloud models trained on industry vocabulary that local Whisper or Parakeet won't match without fine-tuning.

Cross-platform. If you switch between Mac, Windows, and iPhone constantly, a cloud service syncs across all of them.

For solo work on a Mac, none of these usually matter. For team work in regulated industries, they might.

Setting up local voice to text

Three steps:

Pick an app. For most people, the right answer is one of MacWhisper (free trial), Vext (free trial), or Superwhisper (free trial). Try one, see if it fits.
Download the model. First run downloads 600 MB to 3 GB depending on which model you pick. After that, it just works.
Set a hotkey. Most apps default to a fn or right-shift trigger. Pick something you can hit without thinking.

That's the entire setup. No accounts, no API keys, no usage tiers.

The practical upshot

Offline voice to text on Mac stopped being a compromise in 2023 and crossed into "actually better than cloud" for most use cases by late 2024. The latency is lower, the privacy is real, and the price is one-time instead of monthly.

If you've been using cloud dictation out of habit, it's worth trying a local alternative. The gap you might remember from a few years ago isn't there anymore.