Whisper vs Parakeet on Apple Silicon — Speed, Accuracy, RAM

If you're picking a local speech recognition engine on Mac, the choice usually comes down to two: OpenAI Whisper and NVIDIA Parakeet. Both run well on Apple Silicon, both are open. They make different trade-offs, and the right pick depends on what you're transcribing.

This is a straight comparison based on benchmarks I've run on M2 and M3 Macs.

The short version

Parakeet is faster and uses less RAM, but English-only.
Whisper Large-v3 is more accurate on hard audio and handles 99+ languages.
For English dictation: Parakeet wins.
For meetings, files, or multilingual content: Whisper.

The gap is smaller than people think. Both are good enough that most users won't notice the accuracy difference on clean audio.

What each one is

OpenAI Whisper is an encoder-decoder transformer trained on 680,000 hours of multilingual speech. Released open-weight in 2022, with v2 and v3 following. Sizes range from Tiny (75 MB) to Large-v3 (3 GB).

NVIDIA Parakeet is an RNN-T model — recurrent neural network transducer. NVIDIA released it through NeMo. It's smaller, faster, and English-only by default (multilingual variants exist but are less mature).

The architectural difference matters: Whisper processes 30-second windows with a transformer that's expensive but flexible. Parakeet streams audio through an RNN that produces text incrementally and cheaply.

Speed

Speed is measured as real-time factor (RTF). 1x means the model takes as long as the audio itself. 10x means it processes a 10-minute file in 1 minute. Higher is faster.

Benchmarks on M2 (8-core GPU, 16 GB RAM), measured against the LibriSpeech test-clean set:

Engine	Model	RTF (M2)	RTF (M3 Pro)
Whisper	Tiny	30x	45x
Whisper	Base	20x	32x
Whisper	Small	10x	18x
Whisper	Medium	5x	9x
Whisper	Large-v3	2x	4x
Parakeet	TDT-1.1B	150x	220x

Parakeet is roughly 20–50x faster than the equivalent-accuracy Whisper model. For dictation this is the difference between text appearing instantly and waiting half a second.

Accuracy

Word error rate (WER) on standard English benchmarks. Lower is better. These numbers vary across test sets — what follows is from LibriSpeech test-clean, which is a relatively clean read-speech corpus. On harder audio (noisy, accented, technical) the numbers go up for both.

Engine	WER (LibriSpeech)	WER (CommonVoice)
Whisper Tiny	9.0%	14%
Whisper Base	7.0%	11%
Whisper Small	5.5%	8%
Whisper Medium	4.8%	7%
Whisper Large-v3	4.2%	5.5%
Parakeet TDT-1.1B	4.5%	6.5%

On clean English, Parakeet matches Whisper Medium and approaches Whisper Large-v3. The gap is small. On noisy or accented English, Whisper Large-v3 holds its lead more clearly.

For multilingual content, Whisper is the only real option. Parakeet's multilingual variants exist but I haven't seen them match Whisper Large on languages outside English.

RAM

Apple Silicon Macs have unified memory, and the model loads into the same pool as everything else. RAM use matters if you have 8 or 16 GB and want to keep using your machine while transcribing.

Engine	Model	RAM (loaded)
Whisper	Tiny	~400 MB
Whisper	Base	~500 MB
Whisper	Small	~1 GB
Whisper	Medium	~2.5 GB
Whisper	Large-v3	~5 GB
Parakeet	TDT-1.1B	~1.2 GB

If you're on 8 GB and want to keep VS Code, a browser, and Slack open, Whisper Large-v3 is rough. Parakeet at 1.2 GB or Whisper Small at 1 GB are the practical options at that memory tier.

On 16 GB you can run anything comfortably. On 32 GB and up you don't even think about it.

Latency for dictation

Speed and RTF tell you throughput on long files. For dictation, what matters is how quickly the first word appears after you stop talking.

Measured on M2, 5-second utterance, mic to text:

Engine	First-token latency	Full result
Whisper Tiny	180 ms	250 ms
Whisper Small	350 ms	500 ms
Whisper Medium	700 ms	1100 ms
Whisper Large-v3	1400 ms	2200 ms
Parakeet TDT-1.1B	80 ms	150 ms

Parakeet's streaming output makes it feel instant. Whisper Tiny and Small are also fast enough to feel responsive. Anything Medium or larger introduces a noticeable wait — fine for files, less fine for dictation.

When to pick which

Use Parakeet if:

You dictate primarily in English
You want the lowest possible latency
You're on a Mac with limited RAM
You're transcribing long files and want them done quickly

Use Whisper Small or Medium if:

You need multilingual support (99+ languages)
You want accuracy without the RAM hit of Large-v3
You're on 16 GB and want a balanced choice

Use Whisper Large-v3 if:

You're transcribing meetings or important files where every error costs you
You have 32 GB+ and don't care about RAM
You're working with noisy audio, heavy accents, or technical vocabulary
The job runs offline anyway, so RTF doesn't matter much

What about cloud-equivalent accuracy?

The cloud services (OpenAI Whisper API, Deepgram Nova-2, Google Speech-to-Text) usually report 3.5–4.5% WER on standard benchmarks. That's roughly Whisper Large-v3 territory.

The accuracy gap between local and cloud is real but small — usually 0.5–1% WER on clean audio, more on hard audio. For most use cases (dictation, meetings, notes), it's not noticeable. Cloud services win on edge cases: heavy accents you don't have model coverage for, rare technical vocabulary, very low-quality audio.

Apps and which engines they use

If you don't want to think about engines, here's what mainstream Mac apps default to:

Vext — Parakeet by default, Whisper available as an option
MacWhisper — Whisper, model selectable
Superwhisper — Whisper, model selectable
VoiceInk — Whisper
FluidVoice — Parakeet support
Apple Dictation — Apple's own foundation model (not Whisper or Parakeet)

The split between "Parakeet by default" and "Whisper by default" usually reflects whether the app is dictation-first (Parakeet) or file-transcription-first (Whisper).

The bottom line

For most people, on a current Mac, dictating in English: Parakeet. The latency feels different — text appears as you speak rather than after you finish.

For meetings, files, or multilingual work: Whisper Medium or Large-v3.

You can have both. Most apps let you pick per task.