This March several podcast publishers are participating in “Try pod,” a campaign to encourage people to give podcasts a try. Unlike most people using the #trypod hashtag, I’m not going to use the event as an excuse to hawk my own podcast (I don’t have one.) Instead, I am publishing a large dataset of podcasts and episodes on data.world in the hopes that it will inspire some cool analytics or tools.

In this post, I’ll demonstrate some of the ETL logic used to create the dataset and run some queries on the dataset using Google BigQuery.

ETL

The first step in creating this dataset was to create a large list of podcast feeds to scrape. Since the de facto podcast provider, iTunes, doesn’t expose any public API I turned to remnants of the early days of podcasts, feed directories.

The build_lists directory contains scripts that scrape some of the largest podcast directories I found that were still operating, including All Podcasts, Podcastpedia, and Godcast1000. Fearing those directories are out-of-date, I also wrote a script to scrape the top podcasts in each category from iTunes.

Shows extract

Once the list of feeds was compiled, I wanted to extract and transform show objects. The transform script iterates over each tag element in the feed and creates a show object with it.

Since the list of feeds I extracted basically spanned the entire history of podcasting, there were a lot of feeds that didn’t follow today’s best practices or even use valid XML. Trying to accomodate those broken feeds led to a pretty messy transform step with lots of cases like the following:

# handle different date formats
elif tag == 'lastbuilddate':
	try:
		if re.search('[+-][0-9]+$', child.text.strip()):
			dt = datetime.datetime.strptime(child.text.strip()[0:-5].strip(), '%a, %d %b %Y %H:%M:%S')
		else:
			dt = datetime.datetime.strptime(child.text.strip().strip(), '%a, %d %b %Y %H:%M:%S')
	except (ValueError,AttributeError):
		dt = datetime.datetime.now()
				
obj['last_build_date'] = dt.strftime('%Y-%m-%d %H:%M:%S')

Episodes extract

Using the unique table of transformed shows, I then extracted episodes from the feeds. As with the shows transform step, the logic is full of try/catches looking for missing or malformed elements.

# some feeds use seconds, some use [HH:]MM:SS
elif tag == 'duration': 
	if child.text and ':' in child.text:
		lengths = child.text.split(':')[::-1]
		duration = 0

		for i in range(0, len(lengths)):
			try:
				duration += max((i*60), 1) * int(float(lengths[i]))
			except (ValueError, TypeError):
				pass
	else:
		try:
			duration = int(child.text)
		except (ValueError, TypeError):
			duration = None

	obj['length'] = duration

Analysis

To run analytics on the dataset, I loaded the CSV files into Google BigQuery. One of the advantages of using BigQuery is that it makes aggregations on a single column, even over millions of rows, absurdly fast. Running some of the same aggregations I ran below took 10 times as long on my local MySQL database.

Note: the episodes_flat table referenced below is a materialized view joining shows and episodes. Joined logical views in BQ tend to be more expensive to query than the cost of just unloading the results into a physical table since they do full-table scans to perform the join.

Completeness

The first question I want to answer with this dataset is how much of the podcast universe did I manage to scrape and ingest? According to a 2015 Myndset article, there are somewhere between 100,000 and 200,000 podcasts. My extract found and ingested 32,832 shows, so I found 15-30% of that total number.

As a check against that number, I wanted to see how many of the shows I subscribe to are included in the dataset. I used Exofudge’s Pocketcasts API lib to get a list of the 94 shows I subscribe to on Pocketcasts and compared it to the shows dataset:

SELECT in_dataset, COUNT(*)
FROM (
  SELECT 
    subs.title, case when shows.id is null then false else true end as in_dataset
  FROM podcasts.pocketcasts_subs subs
    LEFT JOIN podcasts.shows shows ON subs.title = shows.title
  GROUP BY 1,2
)
GROUP BY 1;
in_dataset count
true62
false32

So the dataset contains 66% of the shows I subscribe to. That number is likely skewed higher by the fact that I listen to mainly top-100 shows on iTunes, which one extract was specifically intended to cover. In any case, I’m pleased with the coverage I achieved.

Basic aggregations

How many episodes are marked explicit?

SELECT 
  episodes_explicit as explicit,
  COUNT(*) as episode_count
FROM podcasts.episodes_flat
GROUP BY 1 ORDER BY 2;
explicit count
true127,784
false1,135,940

What’s the average time between episodes?

SELECT 
  avg((episodes_pub_date - prev_pub_date)/(24000000)) as diff_in_hours,
  avg((episodes_pub_date - prev_pub_date)/(24000000 * 60 * 60)) as diff_in_days
FROM (
  SELECT 
    episodes_id,
    show_id,
    episodes_pub_date,
    LAG(episodes_pub_date, 1) OVER (PARTITION BY show_id ORDER BY episodes_pub_date) as prev_pub_date
  FROM podcasts.episodes_flat
);
diff_in_hours diff_in_days
38,647.7410.73

On average, how many episodes are in a feed?

SELECT AVG(episode_count) as avg, MIN(episode_count) as min, MAX(episode_count) as max
FROM (
  SELECT show_id, COUNT(*) as episode_count
  FROM podcasts.episodes_flat
  GROUP BY 1
);
avg min max
82.954181436261 1 24,350

The feed with 24,350 episodes is the TSN.ca podcast, which has published an average of 13 episodes a day since 2011. They keep their entire podcast history through 2011 in their feed, which is a whopping 31 MB.

It’s not a data post without some charts

What are the most popular audio encodings?

SELECT episodes_audio_mime_type, COUNT(*)
FROM podcasts.episodes_flat
GROUP BY 1 HAVING COUNT(*) > 1
ORDER BY 2 DESC;

Episode category over time

SELECT DATE(episodes_pub_date), categories.category, count(*) 
FROM (
  SELECT category, COUNT(*) 
  FROM podcasts.shows 
  WHERE category is not null
  GROUP BY 1
  ORDER BY 2 DESC
  LIMIT 15) categories
INNER JOIN podcasts.episodes_flat ep ON categories.category=ep.show_category
WHERE episodes_pub_date > timestamp('2006-11-27') and episodes_pub_date < timestamp('2016-11-27')
GROUP BY 1, 2
ORDER BY 1 DESC, 2;

Further work

There is a lot of unstructured text data in this dataset, using a tool like Elastic Search to mine insights from the episode descriptions could yield some interesting information. This would also be a great dataset to join with popularity data, but as far as I know there is no good source for this available.

This dataset could also provide a good starting point for a show recommendation engine based on NLP processing of descriptions and tag analysis.

Making something cool with this data? Let me know!