Reddit RSS feeds and Python AI are a powerful combination. With them, you can track the sentiment trends of the topics you care about.
Deephaven bridges the gap. You can explore not only the history of subreddits, but track the streaming conversation in real time. In this article, we provide a program that ingests RSS feeds into Deephaven tables and performs simple sentiment analysis with easy to customize Python queries. We intend the code to be DIY - since RSS feeds are standardized, these methods can be applied to any RSS feed, such as those that source Wikipedia, Hackernews, CNN, and podcasts.
Read on for a look at WallStreetBets and what people are saying about meme stocks.
Pull data from RSS
How can we pull data from an RSS feed into Deephaven? The Python feedparser package is widely used and well-supported. It provides a method feedparser.parse()
that can pull data from an RSS feed. We can leverage this to pull data into Deephaven. In this example, we will be pulling information from the WallStreetBets subreddit.
import os
os.system("pip install feedparser")
import feedparser
feed = feedparser.parse("https://www.reddit.com/r/wallstreetbets/new/.rss")
This gives us a snapshot of the current RSS file. To view all of the entries of the RSS feed, we can look at the feed.entries
variable.
for entry in feed.entries:
print(entry)
- Log
Create a real-time table
Now we can see the parsed data from the RSS feed. To store this data into Deephaven, we can use the DynamicTableWriter to create a table and write to it.
We can choose any attributes from the RSS feed to write. For this example, we will only write the title of each entry.
from deephaven import DynamicTableWriter
import deephaven.dtypes as dht
column_names = ["Title"]
column_types = [dht.string]
table_writer = DynamicTableWriter(column_names, column_types)
for entry in feed.entries:
title = entry["title"]
table_writer.logRow([title])
rss_table_titles = table_writer.getTable()
Analyze the data
With our Deephaven table containing the titles from the RSS feed, we can work with the data to glean meaningful insight. Sentiment analysis is one of the easiest ways to analyze text data.
Performing sentiment analysis on Reddit feeds allows you to analyze what certain groups of people are discussing. For example, you could analyze the world news subreddits to see how people feel about current global events. You can get more specific, too. The subreddits on video games can be analyzed to view current sentiment on the gaming industry. This can be especially helpful if there are live events, such as E3, and you want to grab real-time information about what's happening.
The Python NLTK package has several methods for performing sentiment analysis. For this example, we will be using the built-in SentimentIntensityAnalyzer
class. This class has a polarity_scores
method that returns a sentiment score for the given string. We will use a Deephaven update query to apply this method to our table.
Since the polarity_scores
method returns a Python dictionary, we will wrap it with a custom function that returns a list. This will make it easier to use these values in Deephaven queries. We also need to cast the method call to the org.jpy.PyListWrapper
type. This will make Deephaven recognize the returned list.
os.system("pip install nltk")
import nltk
nltk.download('vader_lexicon')
from nltk.sentiment import SentimentIntensityAnalyzer
def classifier_method(classifier):
def method(strn):
sentiment = classifier.polarity_scores(strn)
return [sentiment["pos"], sentiment["neu"], sentiment["neg"], sentiment["compound"]]
return method
custom_polarity_scores = classifier_method(SentimentIntensityAnalyzer())
rss_table_analyzed = rss_table_titles.update(formulas = ["Sentiment = (org.jpy.PyListWrapper)custom_polarity_scores(Title)",
"Positive = (double)Sentiment[0]",
"Neutral = (double)Sentiment[1]",
"Negative = (double)Sentiment[2]",
"Compound = (double)Sentiment[3]"
])
Now we have a column containing our sentiment analysis. The next step is to generate statistics. The following query computes the average, median, and standard deviation of each value, and the percent of posts that had a larger positive score than negative score.
from deephaven import agg as agg
agg_list = [
agg.avg(cols = ["Avg_Positive = Positive", "Avg_Negative = Negative", "Avg_Neutral = Neutral", "Avg_Compound = Compound"]),
agg.AggMed("Med_Positive = Positive", "Med_Negative = Negative", "Med_Neutral = Neutral", "Med_Compound = Compound"),
agg.AggStd("Std_Positive = Positive", "Std_Negative = Negative", "Std_Neutral = Neutral", "Std_Compound = Compound"),
]
rss_table_analyzed_statistics = rss_table_analyzed.aggBy(agg_list)
rss_table_analyzed_positive_percent = rss_table_analyzed.update(formulas = ["PositiveCount = Positive > Negative ? 1 : 0"])\
.agg_by([agg.sum_(cols = ["PositiveCount"])], by = [agg.AggCount("TotalCount"])]))\
.update(formulas = ["PositivePercent = PositiveCount / TotalCount"])
Tailor to your own interests
This example just scratches the surface of what you can do with the wealth of information available in RSS feeds. The Deephaven examples repo contains a more complex example. We've provided this code and the extended example as a starting point for working with Deephaven and the RSS feeds that interest you. Reach out on Gitter or in our Github Discussions with any questions or feedback. We'd love to hear from you.