Back in 2016 and 2017 I recorded a podcast called Syscast : interviews with people I admired in the Linux, open source and infrastructure world. Matt Holt about Caddy , Daniel Stenberg about curl , Seth Vargo about Vault , and a handful more. Ten episodes, roughly ten hours of audio, and then life got in the way and I put it on pause.
The one thing those episodes never had was transcripts. I always wanted them. Audio is nice, but you can’t search it, you can’t skim it, and Google can’t read it. The problem was that in 2016, transcribing ten hours of two-person interviews yourself just wasn’t realistic. Decent speech-to-text was a paid cloud service, and telling two speakers apart was basically a research project.
It’s 2026 now, so I did it in an evening, on my own laptop, with open-source models and no API bill. Here’s how.
The stack#
Two open-source pieces do the work:
- WhisperX
wraps OpenAI’s Whisper
large-v3model for the actual speech-to-text, with word-level timestamps. - pyannote.audio handles the speaker diarization.
“Diarization” was a new word to me when I started this. It’s the step that splits a recording up by speaker: this stretch is one voice, this stretch is another, without knowing who either of them is yet. Whisper writes down what gets said; pyannote works out who said it. Put the two together and a two-person interview reads as a real back-and-forth instead of one long undivided block.
Both run locally. The audio never leaves the machine and there’s no per-minute cost. The only thing you need from the outside world is a free Hugging Face account.
Gated models on Hugging Face#
The diarization models are gated on Hugging Face. No idea why. Are they dangerous and you need to sign a waver? Who knows. 🤷♂️
All I did was create a (free) account, a read token and clicked “agree” on the model pages before I could download them. I hadn’t first, and WhisperX greeted me with this:
Could not download 'pyannote/speaker-diarization-3.1' pipeline.
It might be because the pipeline is private or gated...
That reads like a network or auth bug, but it just means the licence wasn’t accepted yet. If you try this, accept the conditions on pyannote/speaker-diarization-3.1 and pyannote/segmentation-3.0 first, drop your token in ~/.hf_token, and it works.
The pipeline#
Setup is a virtualenv and one install (I used uv ):
uv venv .venv-whisper
uv pip install --python .venv-whisper whisperx
The core is about a dozen lines of WhisperX. Load the model on CPU with int8 quantization (Apple Silicon has no usable CUDA path for this stack), transcribe, align for accurate word-level timestamps, then diarize and assign each word to a speaker:
import whisperx
audio = whisperx.load_audio(mp3_path)
# 1. transcribe with Whisper large-v3
model = whisperx.load_model("large-v3", "cpu", compute_type="int8", language="en")
result = model.transcribe(audio, batch_size=1, language="en")
# 2. align, for accurate word-level timestamps
align_model, meta = whisperx.load_align_model(language_code="en", device="cpu")
result = whisperx.align(result["segments"], align_model, meta, audio, "cpu")
# 3. diarize (who spoke when), then tag each word with a speaker
diarize = whisperx.DiarizationPipeline(use_auth_token=hf_token, device="cpu")
result = whisperx.assign_word_speakers(diarize(audio), result)
That gives me a list of segments, each with a speaker, start, end and text. Batching all ten episodes is just a loop over the mp3s, logging the wall-clock time per file (that’s where the benchmark below comes from):
for episode in static/podcast/episodes/*.mp3; do
start=$(date +%s)
.venv-whisper/bin/python scripts/transcribe-syscast.py "$episode"
printf '%s\t%ss\n' "$(basename "$episode")" "$(( $(date +%s) - start ))"
done
Raw WhisperX output is choppy: lots of short segments, SPEAKER_00/SPEAKER_01 labels (just whoever talked first and second), no paragraphs:
[0:00] SPEAKER_00: Welcome to a new episode of Syscast. My name is Mattias Geniar and today I'm joined by Seth Vargo from HashiCorp.
[0:14] SPEAKER_01: Hey Mattias, I'm good. Doing well over here in Pittsburgh.
A small cleanup step merges consecutive segments from the same speaker into one turn, then splits long turns into paragraphs every few sentences:
turns = []
for seg in segments:
if turns and turns[-1]["spk"] == seg["speaker"]:
turns[-1]["text"] += " " + seg["text"].strip() # same speaker, keep merging
else:
turns.append({"spk": seg["speaker"], "start": int(seg["start"]), "text": seg["text"].strip()})
The last touch is mapping SPEAKER_00 to “Mattias” and SPEAKER_01 to the guest (whose name is right there in the episode title), and fixing the obvious mis-hearings. Whisper was very confident my name is “Matthias Genjar”. 😁
The result#
That lands on each episode page as a readable, speaker-labelled conversation with clickable timestamps to jump straight into the audio. For example, Seth Vargo on Vault or Jan-Piet Mens on Linux vs BSD .
How long it actually took#
Feasible doesn’t mean fast, though. large-v3 plus diarization on a CPU (Apple Silicon, no usable GPU path for this stack) is slow. Per episode, on my own machine:
| Episode | Audio | Transcribe time |
|---|---|---|
| Matt Holt, Caddy | 57m | 1h57m |
| Nils De Moor, Docker | 68m | 2h00m |
| Daniel Stenberg, curl | 62m | 1h48m |
| James Cammarata, Ansible | 53m | 1h35m |
| Scott Arciszewski, security | 65m | 1h56m |
| Config Management Camp recap | 19m | 33m |
| Jan Somers, CPU wars | 65m | 1h53m |
| Jan-Piet Mens, Linux vs BSD | 74m | 2h07m |
| Total | 13h49m |
Call it roughly twice real-time, and about 14 hours of compute for the whole catalogue. I ran it overnight and the laptop got warm. If I wanted it done in minutes I’d have used a hosted API for a few dollars, but the point here was the opposite: can I do this myself, locally, for free? Yes.
Worth it?#
For ten old episodes that weren’t getting much traffic, the value isn’t the compute time. A decade of conversations with people like Matt and Daniel is now text: searchable, skimmable and indexable. The back catalogue gets a second life, and it cost me nothing but a warm laptop and a night of electricity.
The thing that sticks with me is the timeline. The exact task that was out of reach for one person in 2016 now runs, end to end, on the laptop in front of me. That keeps happening, and it’s worth noticing.