This March several podcast publishers are participating in “Try pod,” a campaign to encourage people to give podcasts a try. Unlike most people using the #trypod hashtag, I’m not going to use the event as an excuse to hawk my own podcast (I don’t have one.) Instead, I am publishing a large dataset of podcasts and episodes on data.world in the hopes that it will inspire some cool analytics or tools.
In this post, I’ll demonstrate some of the ETL logic used to create the dataset and run some queries on the dataset using Google BigQuery.
The first step in creating this dataset was to create a large list of podcast feeds to scrape. Since the de facto podcast provider, iTunes, doesn’t expose any public API I turned to remnants of the early days of podcasts, feed directories.
The build_lists directory contains scripts that scrape some of the largest podcast directories I found that were still operating, including All Podcasts, Podcastpedia, and Godcast1000. Fearing those directories are out-of-date, I also wrote a script to scrape the top podcasts in each category from iTunes.
Once the list of feeds was compiled, I wanted to extract and transform show objects. The transform script iterates over each tag element in the feed and creates a show object with it.
Since the list of feeds I extracted basically spanned the entire history of podcasting, there were a lot of feeds that didn’t follow today’s best practices or even use valid XML. Trying to accomodate those broken feeds led to a pretty messy transform step with lots of cases like the following:
# handle different date formats elif tag == 'lastbuilddate': try: if re.search('[+-][0-9]+$', child.text.strip()): dt = datetime.datetime.strptime(child.text.strip()[0:-5].strip(), '%a, %d %b %Y %H:%M:%S') else: dt = datetime.datetime.strptime(child.text.strip().strip(), '%a, %d %b %Y %H:%M:%S') except (ValueError,AttributeError): dt = datetime.datetime.now() obj['last_build_date'] = dt.strftime('%Y-%m-%d %H:%M:%S')
Using the unique table of transformed shows, I then extracted episodes from the feeds. As with the shows transform step, the logic is full of try/catches looking for missing or malformed elements.
# some feeds use seconds, some use [HH:]MM:SS elif tag == 'duration': if child.text and ':' in child.text: lengths = child.text.split(':')[::-1] duration = 0 for i in range(0, len(lengths)): try: duration += max((i*60), 1) * int(float(lengths[i])) except (ValueError, TypeError): pass else: try: duration = int(child.text) except (ValueError, TypeError): duration = None obj['length'] = duration
To run analytics on the dataset, I loaded the CSV files into Google BigQuery. One of the advantages of using BigQuery is that it makes aggregations on a single column, even over millions of rows, absurdly fast. Running some of the same aggregations I ran below took 10 times as long on my local MySQL database.
episodes_flat table referenced below is a materialized view joining shows
and episodes. Joined logical views in BQ tend to be more expensive to query than the cost
of just unloading the results into a physical table since they do full-table scans to
perform the join.
The first question I want to answer with this dataset is how much of the podcast universe did I manage to scrape and ingest? According to a 2015 Myndset article, there are somewhere between 100,000 and 200,000 podcasts. My extract found and ingested 32,832 shows, so I found 15-30% of that total number.
As a check against that number, I wanted to see how many of the shows I subscribe to are included in the dataset. I used Exofudge’s Pocketcasts API lib to get a list of the 94 shows I subscribe to on Pocketcasts and compared it to the shows dataset:
SELECT in_dataset, COUNT(*) FROM ( SELECT subs.title, case when shows.id is null then false else true end as in_dataset FROM podcasts.pocketcasts_subs subs LEFT JOIN podcasts.shows shows ON subs.title = shows.title GROUP BY 1,2 ) GROUP BY 1;
So the dataset contains 66% of the shows I subscribe to. That number is likely skewed higher by the fact that I listen to mainly top-100 shows on iTunes, which one extract was specifically intended to cover. In any case, I’m pleased with the coverage I achieved.
How many episodes are marked explicit?
SELECT episodes_explicit as explicit, COUNT(*) as episode_count FROM podcasts.episodes_flat GROUP BY 1 ORDER BY 2;
What’s the average time between episodes?
SELECT avg((episodes_pub_date - prev_pub_date)/(24000000)) as diff_in_hours, avg((episodes_pub_date - prev_pub_date)/(24000000 * 60 * 60)) as diff_in_days FROM ( SELECT episodes_id, show_id, episodes_pub_date, LAG(episodes_pub_date, 1) OVER (PARTITION BY show_id ORDER BY episodes_pub_date) as prev_pub_date FROM podcasts.episodes_flat );
On average, how many episodes are in a feed?
SELECT AVG(episode_count) as avg, MIN(episode_count) as min, MAX(episode_count) as max FROM ( SELECT show_id, COUNT(*) as episode_count FROM podcasts.episodes_flat GROUP BY 1 );
The feed with 24,350 episodes is the TSN.ca podcast, which has published an average of 13 episodes a day since 2011. They keep their entire podcast history through 2011 in their feed, which is a whopping 31 MB.
It’s not a data post without some charts
What are the most popular audio encodings?
SELECT episodes_audio_mime_type, COUNT(*) FROM podcasts.episodes_flat GROUP BY 1 HAVING COUNT(*) > 1 ORDER BY 2 DESC;
Episode category over time
SELECT DATE(episodes_pub_date), categories.category, count(*) FROM ( SELECT category, COUNT(*) FROM podcasts.shows WHERE category is not null GROUP BY 1 ORDER BY 2 DESC LIMIT 15) categories INNER JOIN podcasts.episodes_flat ep ON categories.category=ep.show_category WHERE episodes_pub_date > timestamp('2006-11-27') and episodes_pub_date < timestamp('2016-11-27') GROUP BY 1, 2 ORDER BY 1 DESC, 2;
There is a lot of unstructured text data in this dataset, using a tool like Elastic Search to mine insights from the episode descriptions could yield some interesting information. This would also be a great dataset to join with popularity data, but as far as I know there is no good source for this available.
This dataset could also provide a good starting point for a show recommendation engine based on NLP processing of descriptions and tag analysis.
Making something cool with this data? Let me know!