Skip to main content

RSS meta-data discovery and Podcast exploration

· 5 min read
DALL·E prompt: Many podcast microphones lined up in a long row, in a white room, octane 3d, blender, colorful lighting
Jake Mulford
Customizable code to ingest and aggregate podcast title data from RSS feeds into tables

RSS feeds bring together a large amount of frequently updating data from multiple sources. People often use them to scan headlines and article snippets to figure out what's actually worth their time. Podcasts typically publish their metadata, like episode titles, in RSS streams. With over 850,000 podcasts active in 2021, this provides us with a massive source of real-time data - the kind that Deephaven excels at handling.

Clearly sifting through that metadata manually would be a Sisyphean chore. In this blog post, we demonstrate how to build a system that aggregates podcast show titles into a single source. Placing the data into Deephaven tables makes working with information on this scale manageable. Last month, we gave you a DIY program to ingest Reddit posts and perform simple sentiment analysis. Similarly, you can modify this program in various ways, such as finding episodes that feature your favorite athlete or influencer. In fact, you could hook up any RSS feed with data that interests you.

Pulling data from RSS at scale

It turns out podcasts often use RSS feeds to publish information. The Podcast Index claims to have records of over 4.3 million podcasts, and nearly all of these have an associated RSS feed. This has the potential to be an incredibly fruitful resource. Let's pull in some of this data and see what's out there.

We'll use our code from analyzing Reddit RSS feeds as a starting point. Given the magnitude of RSS feeds, we need to accommodate our program to avoid performance problems. Our first step is to scale out effectively.

Threading and Deephaven

At the end of the day, RSS feeds are just URLs that follow a standard format. This means that RSS readers actually perform HTTP requests on the backend. Anyone who's worked with HTTP requests knows how slow they can be, and how much performance improvements can come from threading them. This can be applied to our RSS reader to improve performance!

Deephaven's tables work very well in multi-threaded environments. Specifically for our situation, we want a table that can be written to from multiple threads. One way to accomplish this is to create a table using Deephaven's DynamicTableWriter for each thread, and use the merge method in the main thread to combine the tables together. The resulting table will continue to update as the tables in the threads update. This query shows a simple example:

from deephaven import DynamicTableWriter
import deephaven.dtypes as dht
from deephaven.TableTools import merge

import threading
import time

NUMBER_OF_TABLES = 3

def write_to_table(writer):
writer.logRow("A", 1)
time.sleep(3)
writer.logRow("B", 2)
time.sleep(3)
writer.logRow("C", 3)

column_names = [
"Letter",
"Number"
]

column_types = [
dht.string,
dht.int32
]

tables = []
threads = []
for i in range(NUMBER_OF_TABLES):
writer = DynamicTableWriter(column_names, column_types)
thread = threading.Thread(target=write_to_table, args=[writer])
thread.start()

tables.append(writer.getTable())
threads.append(thread)

result = merge(tables)

#Create global variables to display in Deephaven UI
for i in range(len(tables)):
globals()[f"table_{i}"] = tables[i]

#Wait until the threads stop executing
while True:
thread_is_alive = False
for thread in threads:
if thread.is_alive():
thread_is_alive = True

if thread_is_alive:
time.sleep(1)
else:
break

We can apply this to our RSS reader to pull podcast information. Given a list of RSS feeds where each feed points to a podcast of our choice, we can read from them in a threaded environment and write their data to Deephaven.

For this example, we read from four arbitrary podcast RSS feeds using two threads and write the title of each podcast episode to our table:

import os
os.system("pip install feedparser")
from deephaven import DynamicTableWriter
import deephaven.dtypes as dht
from deephaven.TableTools import merge
import threading
import time
import feedparser

NUMBER_OF_RSS_TABLES = 2

def read_rss_feeds(feed_urls, table_writer):
for url in feed_urls:
feed = feedparser.parse(url)
for entry in feed.entries:
title = entry["title"]
table_writer.logRow([title])

rss_feed_urls = [
[
"http://feeds.soundcloud.com/users/soundcloud:users:151205561/sounds.rss",
"https://nocturniarecords.podomatic.com/rss2.xml",
],
[
"http://feeds.soundcloud.com/users/soundcloud:users:142613909/sounds.rss",
"http://feeds.soundcloud.com/users/soundcloud:users:155565658/sounds.rss",
]
]

column_names = ["EpisodeTitle"]
column_types = [dht.string]

rss_tables = []
rss_threads = []
for i in range(NUMBER_OF_RSS_TABLES):
writer = DynamicTableWriter(column_names, column_types)
thread = threading.Thread(target=read_rss_feeds, args=[rss_feed_urls[i], writer])
thread.start()

rss_tables.append(writer.getTable())
rss_threads.append(thread)

rss_feeds = merge(rss_tables)

#Create global variables to display in Deephaven UI
for i in range(len(rss_tables)):
globals()[f"rss_table_{i}"] = rss_tables[i]

#Wait until the threads stop executing
while True:
thread_is_alive = False
for thread in rss_threads:
if thread.is_alive():
thread_is_alive = True

if thread_is_alive:
time.sleep(1)
else:
break

Now we have a single table containing information from our various podcasts. This example only pulls the episode title, but there are many other attributes from the RSS feed that can be used as well.

Continually updating data

As we said, threads are a great way to improve performance in certain applications where processes can run in parallel and asynchronously. In these cases, Deephaven is a powerful tool.

The Deephaven Podcast Aggregation sample app shows an extreme example of using Deephaven in a threaded environment. Not only does this application scale out to millions of podcast RSS feeds, but it also contains pulling logic to continually read from these RSS feeds, allowing updates to the RSS feeds to come in real-time. This project pulls all of the meta-data from each podcast it reads from. You can use this data to figure out information like the most recently published podcasts, what podcast episodes contain certain keywords, and what podcasts produce the most number of episodes in a given period of time. Let us know what you come up with on Slack or in our Github Discussions.