Automatically Generating Podcast Transcripts

I’m finding it valuable to create annotations on resources to index into my personal knowledge management system.
(The
Obsidian journaling
post from late last year goes into some depth about my process.)
I use the
Hypothesis
service to do this—Hypothesis annotations are imported into Markdown files for Obsidian using the custom script and method I describe in that blog post.
This works well for web pages and PDF files…Hypothesis can attach annotations to those resource types.
Videos are relatively straight forward, too, using Dan Whaley’s
DocDrop
service; it reads the closed captioning and puts that on an HTML page that enables Hypothesis to do its work.
What I’m missing, though, are annotations on podcast episodes.
Podcast creators that take the time to make transcripts available are somewhat unusual.
Podcasts from NPR and NPR member stations are pretty good about this, but everyone else is slacking off.
My task management system has about a dozen podcast episodes where I’d like to annotate transcripts (and one podcast that seemingly
stopped
making transcripts just before the episode I wanted to annotate!).
So I wrote a little script that creates a good-enough transcript HTML page.
You can see a
sample of what this looks like
(from the
Search and Ye Might Find
episode of 99% Invisible).
Note!
Of course,
99% Invisible
has now gone back and added transcripts to all of their episodes, including
the one used in this example
. Thanks? … No really, thank you 99PI!
AWS Transcribe
to the rescue
Amazon Web Services has a
Transcribe
service that takes audio, runs it through its machine learning algorithms, and outputs a
WebVTT
file.
Podcasts are typically well-produced audio, so AWS Transcribe has a clean audio track to work with.
In my testing, AWS Transcribe does well with most sentences; it misses unusual proper names and its sentence detection mechanism is good-but-not-great.
It is certainly good enough to get the main ideas across to provide an anchor for annotations.
A WebVTT file (of a podcast advertisement) looks like this:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
WEBVTT

1
00:00:00.190 –> 00:00:04.120
my quest to buy a more eco friendly deodorant quickly started to

2
00:00:04.120 –> 00:00:08.960
stink because sustainability and effectiveness don’t always go hand in hand.

3
00:00:09.010 –> 00:00:11.600
But then I discovered finch Finch is a

4
00:00:11.600 –> 00:00:14.830
free chrome extension that scores everyday products on
After a
WEBVTT
marker, there are groups of caption statements separated by newlines.
Each statement is numbered, followed by a time interval, followed by the caption itself.
(WebVTT can be much more complicated than this…to include CSS-like text styling and other features; read the specs if you want more detail.)
What the script does
The code for this is up
on GitHub
now.
The links to the code below point to the version of software at the time this blog po…

Descubre más desde Hoy En Perspectiva

Suscríbete y recibe las últimas entradas en tu correo electrónico.

Ultima Hora

Automatically Generating Podcast Transcripts