If you're picking a local speech recognition engine on Mac, the choice usually comes down to two: OpenAI Whisper and NVIDIA Parakeet. Both run well on Apple Silicon, both are open. They make different trade-offs, and the right pick depends on what you're transcribing.
This is a straight comparison based on benchmarks I've run on M2 and M3 Macs.
The short version
- Parakeet is faster and uses less RAM, but English-only.
- Whisper Large-v3 is more accurate on hard audio and handles 99+ languages.
- For English dictation: Parakeet wins.
- For meetings, files, or multilingual content: Whisper.
The gap is smaller than people think. Both are good enough that most users won't notice the accuracy difference on clean audio.
What each one is
OpenAI Whisper is an encoder-decoder transformer trained on 680,000 hours of multilingual speech. Released open-weight in 2022, with v2 and v3 following. Sizes range from Tiny (75 MB) to Large-v3 (3 GB).
NVIDIA Parakeet is an RNN-T model — recurrent neural network transducer. NVIDIA released it through NeMo. It's smaller, faster, and English-only by default (multilingual variants exist but are less mature).
The architectural difference matters: Whisper processes 30-second windows with a transformer that's expensive but flexible. Parakeet streams audio through an RNN that produces text incrementally and cheaply.
Speed
Speed is measured as real-time factor (RTF). 1x means the model takes as long as the audio itself. 10x means it processes a 10-minute file in 1 minute. Higher is faster.
Benchmarks on M2 (8-core GPU, 16 GB RAM), measured against the LibriSpeech test-clean set:
| Engine | Model | RTF (M2) | RTF (M3 Pro) |
|---|---|---|---|
| Whisper | Tiny | 30x | 45x |
| Whisper | Base | 20x | 32x |
| Whisper | Small | 10x | 18x |
| Whisper | Medium | 5x | 9x |
| Whisper | Large-v3 | 2x | 4x |
| Parakeet | TDT-1.1B | 150x | 220x |
Parakeet is roughly 20–50x faster than the equivalent-accuracy Whisper model. For dictation this is the difference between text appearing instantly and waiting half a second.
Accuracy
Word error rate (WER) on standard English benchmarks. Lower is better. These numbers vary across test sets — what follows is from LibriSpeech test-clean, which is a relatively clean read-speech corpus. On harder audio (noisy, accented, technical) the numbers go up for both.
| Engine | WER (LibriSpeech) | WER (CommonVoice) |
|---|---|---|
| Whisper Tiny | 9.0% | 14% |
| Whisper Base | 7.0% | 11% |
| Whisper Small | 5.5% | 8% |
| Whisper Medium | 4.8% | 7% |
| Whisper Large-v3 | 4.2% | 5.5% |
| Parakeet TDT-1.1B | 4.5% | 6.5% |
On clean English, Parakeet matches Whisper Medium and approaches Whisper Large-v3. The gap is small. On noisy or accented English, Whisper Large-v3 holds its lead more clearly.
For multilingual content, Whisper is the only real option. Parakeet's multilingual variants exist but I haven't seen them match Whisper Large on languages outside English.
RAM
Apple Silicon Macs have unified memory, and the model loads into the same pool as everything else. RAM use matters if you have 8 or 16 GB and want to keep using your machine while transcribing.
| Engine | Model | RAM (loaded) |
|---|---|---|
| Whisper | Tiny | ~400 MB |
| Whisper | Base | ~500 MB |
| Whisper | Small | ~1 GB |
| Whisper | Medium | ~2.5 GB |
| Whisper | Large-v3 | ~5 GB |
| Parakeet | TDT-1.1B | ~1.2 GB |
If you're on 8 GB and want to keep VS Code, a browser, and Slack open, Whisper Large-v3 is rough. Parakeet at 1.2 GB or Whisper Small at 1 GB are the practical options at that memory tier.
On 16 GB you can run anything comfortably. On 32 GB and up you don't even think about it.
Latency for dictation
Speed and RTF tell you throughput on long files. For dictation, what matters is how quickly the first word appears after you stop talking.
Measured on M2, 5-second utterance, mic to text:
| Engine | First-token latency | Full result |
|---|---|---|
| Whisper Tiny | 180 ms | 250 ms |
| Whisper Small | 350 ms | 500 ms |
| Whisper Medium | 700 ms | 1100 ms |
| Whisper Large-v3 | 1400 ms | 2200 ms |
| Parakeet TDT-1.1B | 80 ms | 150 ms |
Parakeet's streaming output makes it feel instant. Whisper Tiny and Small are also fast enough to feel responsive. Anything Medium or larger introduces a noticeable wait — fine for files, less fine for dictation.
When to pick which
Use Parakeet if:
- You dictate primarily in English
- You want the lowest possible latency
- You're on a Mac with limited RAM
- You're transcribing long files and want them done quickly
Use Whisper Small or Medium if:
- You need multilingual support (99+ languages)
- You want accuracy without the RAM hit of Large-v3
- You're on 16 GB and want a balanced choice
Use Whisper Large-v3 if:
- You're transcribing meetings or important files where every error costs you
- You have 32 GB+ and don't care about RAM
- You're working with noisy audio, heavy accents, or technical vocabulary
- The job runs offline anyway, so RTF doesn't matter much
What about cloud-equivalent accuracy?
The cloud services (OpenAI Whisper API, Deepgram Nova-2, Google Speech-to-Text) usually report 3.5–4.5% WER on standard benchmarks. That's roughly Whisper Large-v3 territory.
The accuracy gap between local and cloud is real but small — usually 0.5–1% WER on clean audio, more on hard audio. For most use cases (dictation, meetings, notes), it's not noticeable. Cloud services win on edge cases: heavy accents you don't have model coverage for, rare technical vocabulary, very low-quality audio.
Apps and which engines they use
If you don't want to think about engines, here's what mainstream Mac apps default to:
- Vext — Parakeet by default, Whisper available as an option
- MacWhisper — Whisper, model selectable
- Superwhisper — Whisper, model selectable
- VoiceInk — Whisper
- FluidVoice — Parakeet support
- Apple Dictation — Apple's own foundation model (not Whisper or Parakeet)
The split between "Parakeet by default" and "Whisper by default" usually reflects whether the app is dictation-first (Parakeet) or file-transcription-first (Whisper).
The bottom line
For most people, on a current Mac, dictating in English: Parakeet. The latency feels different — text appears as you speak rather than after you finish.
For meetings, files, or multilingual work: Whisper Medium or Large-v3.
You can have both. Most apps let you pick per task.