Hold a hotkey. Talk. Text appears at your cursor. That's Vext — a voice-to-text app that runs entirely on your Mac. No cloud. No account. No subscription.
This guide covers everything: installation, hotkey configuration, the three modes (dictation, meetings, notes), Enhance, live translation, and every other feature.
Installation
Install via Homebrew:
brew install muvon/tap/vext
Or grab it directly from getvext.app. No account required — install and start using it immediately.
Requirements: macOS 14 Sonoma or later, Apple Silicon (M1–M4).
Your first dictation
- Launch Vext from Applications
- Hold your hotkey
- Speak
- Release — text appears at your cursor
Three steps. No login. The text goes wherever your cursor was when you started talking.
Three modes
Vext has three modes for different workflows.
Dictation
The core experience. Hold a hotkey, speak, release — text appears at your cursor. Works in any text field, any app: browsers, editors, terminals, chat, email, notes.
Dictation is the fastest way to get words into a computer. You speak at 130–150 words per minute. You type at 40–60. For a 100-word message, dictation takes about 40 seconds. Typing takes almost two minutes.
Meetings
Record meetings with speaker identification. Vext captures your microphone and system audio simultaneously, so it works with Zoom, Google Meet, FaceTime, and any other video call.
When the meeting ends, you get:
- A full transcript with speaker labels and timestamps
- An AI-generated summary with key points and action items
- Any screenshots you captured during the call
Notes
Quick voice memos captured with a single key press. Speak your thought, and Vext transcribes it, runs it through Enhance, and stores it locally.
Notes go through the same processing pipeline as dictation — cleanup, translation, the whole chain. The difference is that notes are saved in Vext instead of being pasted at your cursor.
Use notes for capturing ideas mid-task without switching apps, recording quick reminders, or saving context you'll need later.
Hands-free dictation
Standard dictation requires holding a key. Hands-free mode changes this — press once to start, press again to stop. No holding required.
This is useful for longer passages, when your hands are occupied, or when you're walking around talking through an idea. The key acts as a toggle instead of a push-to-talk button.
Enhance
Enhance is AI-powered post-processing that runs on your transcription before it reaches your clipboard. It cleans up filler words, fixes sentence structure, and smooths out the rough edges of spoken language — without changing what you said.
Before Enhance:
"So basically what I was thinking is that uh we should probably um move the API endpoint to like a separate service because it's getting kind of slow"
After Enhance:
"We should move the API endpoint to a separate service because it's getting slow."
The meaning is preserved. The tone is preserved. Enhance just removes the noise.
The raw transcription is always saved alongside the enhanced version. You never lose the original.
Live translation
Set a target language in Vext and speak in any language. The text that appears at your cursor is already translated.
When Enhance is also enabled, cleanup and translation happen in a single pass. You speak messy French, clean English appears at your cursor.
Vext supports translation between any pair of the 99+ languages that Whisper models understand.
Screenshot capture
During a meeting recording, you can capture any area of your screen. Drag to select a region, and the screenshot is automatically attached to your transcript.
This is useful for grabbing slides during a presentation, capturing a code snippet someone is showing, or saving a design that was discussed. Multiple captures per recording session, all saved alongside the transcript.
Audio ducking
When you start recording, Vext automatically fades your system audio so your voice comes through clearly. Release the hotkey and the volume fades back in.
This prevents your computer audio from interfering with the transcription — whether you're listening to music, watching a video, or on a call.
YOLO Mode
Turn on YOLO Mode and Vext automatically presses Return after pasting your transcription. Speak, release, and your prompt is already submitted.
This is designed for AI coding tools like Claude Code, ChatGPT, and Cursor. Instead of dictating a prompt, reviewing it, editing it, and pressing Enter — you just talk and it goes. LLMs handle imperfect language better than most people expect.
Transcription engines
Vext ships with multiple speech-to-text engines:
| Engine | Type | Speed |
|---|---|---|
| Parakeet | Local | 150x realtime |
| Apple Dictation | Local | 25x realtime |
| OpenAI-compatible | API | Varies |
Parakeet is the default. It runs entirely on your Apple Silicon GPU and transcribes at 150x realtime — a 60-second recording is processed in under half a second.
AI processing engines
Enhance, translation, and summarization are powered by local LLMs:
| Model | Type | Size |
|---|---|---|
| Gemma 3 4B | Local (default) | 2.8 GB |
| Qwen 3 4B | Local | 3.2 GB |
| LLaMA 3.2 3B | Local | 2.4 GB |
| Gemma 3 1B | Local | 0.8 GB |
| Phi-3.5 Mini | Local | 2.8 GB |
| OpenAI-compatible | API | — |
All local models run on your Mac's GPU. No internet connection required.
Privacy
Your voice never leaves your Mac. There's no cloud processing, no account, no telemetry, no analytics. Audio is processed on-device and never stored after transcription.
If you use an API-based engine (OpenAI-compatible), your audio is sent to that provider — but this is opt-in and disabled by default.
Pricing
Vext includes a free trial: 100 dictations, 50 notes, and 10 meeting recordings. No credit card, no account.
When you're ready, unlock unlimited use for $49 — a one-time payment from within the app. Free updates included within your version. Major new versions are available at 50% off for existing owners.
Getting started
- Install via
brew install muvon/tap/vextor download from getvext.app - Launch the app and hold your hotkey
- Start talking
The shift from typing to voice feels awkward for about 30 minutes. After that, typing starts to feel like the slow way.