The Complete Guide to Vext: Voice to Text for Mac

Hold a hotkey. Talk. Text appears at your cursor. That's Vext — a voice-to-text app that runs entirely on your Mac. No cloud. No account. No subscription.

This guide covers everything: installation, hotkey configuration, the three modes (dictation, meetings, notes), Enhance, live translation, and every other feature.

Installation

Install via Homebrew:

brew install muvon/tap/vext

Or grab it directly from getvext.app. No account required — install and start using it immediately.

Requirements: macOS 14 Sonoma or later, Apple Silicon (M1–M4).

Your first dictation

Launch Vext from Applications
Hold your hotkey
Speak
Release — text appears at your cursor

Three steps. No login. The text goes wherever your cursor was when you started talking.

Three modes

Vext has three modes for different workflows.

Dictation

The core experience. Hold a hotkey, speak, release — text appears at your cursor. Works in any text field, any app: browsers, editors, terminals, chat, email, notes.

Dictation is the fastest way to get words into a computer. You speak at 130–150 words per minute. You type at 40–60. For a 100-word message, dictation takes about 40 seconds. Typing takes almost two minutes.

Meetings

Record meetings with speaker identification. Vext captures your microphone and system audio simultaneously, so it works with Zoom, Google Meet, FaceTime, and any other video call.

When the meeting ends, you get:

A full transcript with speaker labels and timestamps
An AI-generated summary with key points and action items
Any screenshots you captured during the call

Notes

Quick voice memos captured with a single key press. Speak your thought, and Vext transcribes it, runs it through Enhance, and stores it locally.

Notes go through the same processing pipeline as dictation — cleanup, translation, the whole chain. The difference is that notes are saved in Vext instead of being pasted at your cursor.

Use notes for capturing ideas mid-task without switching apps, recording quick reminders, or saving context you'll need later.

Hands-free dictation

Standard dictation requires holding a key. Hands-free mode changes this — press once to start, press again to stop. No holding required.

This is useful for longer passages, when your hands are occupied, or when you're walking around talking through an idea. The key acts as a toggle instead of a push-to-talk button.

Enhance

Enhance is AI-powered post-processing that runs on your transcription before it reaches your clipboard. It cleans up filler words, fixes sentence structure, and smooths out the rough edges of spoken language — without changing what you said.

Before Enhance:

"So basically what I was thinking is that uh we should probably um move the API endpoint to like a separate service because it's getting kind of slow"

After Enhance:

"We should move the API endpoint to a separate service because it's getting slow."

The meaning is preserved. The tone is preserved. Enhance just removes the noise.

The raw transcription is always saved alongside the enhanced version. You never lose the original.

Live translation

Set a target language in Vext and speak in any language. The text that appears at your cursor is already translated.

When Enhance is also enabled, cleanup and translation happen in a single pass. You speak messy French, clean English appears at your cursor.

Vext supports translation between any pair of the 99+ languages that Whisper models understand.

Screenshot capture

During a meeting recording, you can capture any area of your screen. Drag to select a region, and the screenshot is automatically attached to your transcript.

This is useful for grabbing slides during a presentation, capturing a code snippet someone is showing, or saving a design that was discussed. Multiple captures per recording session, all saved alongside the transcript.

Audio ducking

When you start recording, Vext automatically fades your system audio so your voice comes through clearly. Release the hotkey and the volume fades back in.

This prevents your computer audio from interfering with the transcription — whether you're listening to music, watching a video, or on a call.

YOLO Mode

Turn on YOLO Mode and Vext automatically presses Return after pasting your transcription. Speak, release, and your prompt is already submitted.

This is designed for AI coding tools like Claude Code, ChatGPT, and Cursor. Instead of dictating a prompt, reviewing it, editing it, and pressing Enter — you just talk and it goes. LLMs handle imperfect language better than most people expect.

Transcription engines

Vext ships with multiple speech-to-text engines:

Engine	Type	Speed
Parakeet	Local	150x realtime
Apple Dictation	Local	25x realtime
OpenAI-compatible	API	Varies

Parakeet is the default. It runs entirely on your Apple Silicon GPU and transcribes at 150x realtime — a 60-second recording is processed in under half a second.

AI processing engines

Enhance, translation, and summarization are powered by local LLMs:

Model	Type	Size
Gemma 3 4B	Local (default)	2.8 GB
Qwen 3 4B	Local	3.2 GB
LLaMA 3.2 3B	Local	2.4 GB
Gemma 3 1B	Local	0.8 GB
Phi-3.5 Mini	Local	2.8 GB
OpenAI-compatible	API	—

All local models run on your Mac's GPU. No internet connection required.

Privacy

Your voice never leaves your Mac. There's no cloud processing, no account, no telemetry, no analytics. Audio is processed on-device and never stored after transcription.

If you use an API-based engine (OpenAI-compatible), your audio is sent to that provider — but this is opt-in and disabled by default.

Pricing

Vext includes a free trial: 100 dictations, 50 notes, and 10 meeting recordings. No credit card, no account.

When you're ready, unlock unlimited use for $49 — a one-time payment from within the app. Free updates included within your version. Major new versions are available at 50% off for existing owners.

Getting started

Install via brew install muvon/tap/vext or download from getvext.app
Launch the app and hold your hotkey
Start talking

The shift from typing to voice feels awkward for about 30 minutes. After that, typing starts to feel like the slow way.