Voice to Text for Zoom and Google Meet on Mac — Transcribing Calls Without a Bot

Meeting bots are everywhere now — Otter, Fireflies, Granola, Read, dozens more. They join the call as a participant, record everything, and ship you a transcript. They also show up as "Otter Bot is recording", which is awkward, sometimes against company policy, and increasingly something attendees actively push back on.

The alternative is transcribing the call from your end — your Mac records the audio it's already playing and the audio from your mic, transcribes it locally, and produces the transcript without any guest in the meeting. This guide is about how to do that for Zoom and Google Meet specifically on Mac.

Why people are moving away from bots

Three reasons come up repeatedly:

Awkwardness. A bot in a sales call, a job interview, or a sensitive internal conversation creates a different vibe than a human transcription tool. Some clients hard-no it. Some companies prohibit it via policy.

Privacy and data residency. Bots route audio through third-party servers. If the call involves customer data, internal strategy, IP discussions, or anything regulated, your legal team probably has opinions. Local transcription means audio never leaves the Mac that's already in the call.

Reliability. Bots get kicked out by some meeting hosts. They fail to join when meeting auth is tightened. They sometimes drop mid-call. A local recorder doesn't have these failure modes — if you can hear the audio, the recorder can capture it.

The downside of going botless: you lose the centralized features bots tend to ship with (shared libraries, team-wide search, automatic CRM sync). For solo work and small teams, this rarely matters. For larger orgs with established Otter/Fireflies workflows, the trade-off is real.

How "transcribe without a bot" actually works on Mac

Three audio sources you might want to capture:

Your microphone — your own voice
System audio — everything coming out of your speakers, including the other call participants
Both simultaneously — what you actually want for meeting transcription

Capturing your microphone alone is easy. Capturing system audio is the hard part because macOS deliberately doesn't expose system audio to apps for privacy reasons.

The standard way around this is a virtual audio device (Loopback, BlackHole, Aggregate Devices). The dictation/transcription app uses the virtual device as its input, and you route system audio into the virtual device. This works but is fiddly.

Some Mac dictation apps handle this automatically — they bundle the system audio capture and present it as a single "record this meeting" button. That's the experience most people actually want.

Zoom-specific notes

Zoom has its own built-in recording that produces a video file and a transcript (in Zoom Cloud Recording). This works fine and is free for paid Zoom plans. The catch:

The transcript is generated server-side after the call — not real-time, not local
Only available to the host or assigned recorder
Transcript quality is okay, not great
Storage is on Zoom's cloud unless you pay extra for local recording

If you're the host on a paid plan and don't mind the transcript living on Zoom's servers, this is the lowest-friction option. If any of those constraints bite, you need something else.

Google Meet-specific notes

Google Meet has built-in transcription (paid Workspace plans only) and produces a Google Doc with the transcript after the call. Same trade-offs as Zoom — server-side, post-call, lives in Google's cloud, only the host can usually enable it.

If you're not on a paid Workspace plan, you don't have native transcription in Meet at all. You're either using a bot or capturing from your end.

Local Mac options for both Zoom and Meet

Apps that capture mic + system audio on Mac and produce a transcript:

Vext — $49 once. Meeting mode captures both audio streams simultaneously, transcribes with Whisper, adds speaker labels via local diarization, and generates an AI summary at the end. Works with Zoom, Meet, FaceTime, Teams — anything that produces audio. Audio stays on your Mac. The summary and transcript are stored in the app.

MacWhisper — Pro version (€64) records and transcribes. Less integrated than Vext for meetings (no built-in speaker labels in some configurations), but solid for file-based transcription if you record with another tool.

Audio Hijack + a transcription pass — Audio Hijack ($64) records system audio cleanly. Pipe the resulting file into MacWhisper, OpenAI's Whisper, or any other transcription tool. More setup, more flexibility.

Granola — different model. Records from your Mac, but sends audio to its cloud for processing. Polished UX, fast summaries, but not local. Worth mentioning because people ask about it; it's not in the "no-cloud" bucket if that's the requirement.

Apple's built-in Voice Memos — records mic only. Won't get the other participants. Useful for recording your half of the conversation if that's what you want.

The split is between "fully local" (Vext, MacWhisper, Audio Hijack workflow) and "polished cloud" (Granola, Otter, Fireflies). Both have valid use cases.

Setting up Vext for Zoom or Meet

The flow we built it for:

Install Vext: brew install muvon/tap/vext
Open Vext, switch to Meeting mode in the menu bar
Start your Zoom or Meet call as usual
In Vext, click Start Recording — it captures your mic + system audio
Talk through the meeting
Stop the recording when the call ends
Vext transcribes locally (Whisper), produces speaker labels, and generates a summary

No bot joins the call. No participant other than you sees anything. The transcript and summary are stored in Vext on your Mac.

A few practical notes:

The first time you record, macOS will prompt for permission to capture system audio. Grant it. (This uses macOS's audio capture API, not a virtual audio device — no Loopback or BlackHole required.)
Speaker labels work best when participants take clear turns. Overlapping speech is hard for diarization; you'll get the words but the labels may get fuzzy.
The summary uses a local LLM (Gemma 3 4B by default). Quality is decent for typical meetings — action items, key decisions, topic outline. Not as polished as GPT-4 doing the same job, but private and free of API costs.
Screenshots during the meeting: you can drag-select any screen region while recording, and the screenshot gets attached to the transcript at the right timestamp. Useful for slides, code shown on a colleague's screen, design reviews.

What you give up going botless

To be honest about it:

Shared transcripts. Otter and Fireflies make sharing a transcript with the team trivial. With a local tool, you export to TXT/Markdown and paste it into Slack or upload to your shared drive. Friction is small but real.

Automatic CRM sync. Fireflies and Granola write transcript summaries straight into Salesforce, HubSpot, etc. Local tools don't have these integrations. You can build them with Zapier and the export files, but it's a project.

Team search. Otter's team plan has a searchable shared library. Local tools store transcripts on your Mac — not in a team-wide index.

Real-time captions for accessibility. Bots produce live captions during the call. Local tools transcribe after. If a participant needs live captions for accessibility, use Zoom or Meet's built-in live captions, or pair with a separate captioning tool.

For solo workflows, none of these usually matter. For team workflows, weigh them.

What you get

Privacy. Real, end-to-end. Audio doesn't leave your Mac.

No subscription. $49 once vs. $20+/month for the bot services.

Reliability. No bot to get kicked out, no API rate limits, no service outages affecting your transcripts.

Trust signal. Some clients and partners actively prefer that no bot was in the call. Particularly true in legal, healthcare, finance, and competitive negotiations.

Cleaner files. No "Otter Bot has joined the meeting" timestamps. Just the conversation.

A decision tree

Paid Zoom/Workspace, host of most calls, fine with server-side processing: Use built-in transcription. Save the money.
Lots of calls, team-wide sharing matters, fine with cloud: Otter, Fireflies, Granola — pick one.
Calls involve sensitive content, prefer no bot, want simple setup: Vext or MacWhisper Pro.
Power user, want maximum flexibility: Audio Hijack + Whisper.
You only need your half of the call: Apple Voice Memos, free.

What this looks like in practice

A typical week for someone who switched from a bot to local meeting transcription:

6–10 calls a week, mix of internal + external
Vext records each one; transcripts auto-generated
Skim the summary, copy action items into whatever task tracker
Search a specific transcript later for "what did we decide about pricing"
Total time spent post-call: 2 minutes per meeting

The bot version of this week was: invite the bot, hope it joined, get an email with the transcript, click through to Otter, copy action items. Roughly the same total time. The differences are who saw the bot in the call, where the audio went, and whether the team's data residency policy was happy.

For most solo and small-team use, the local option is now strictly better. For larger orgs the math gets more complicated, and either choice is defensible.