Skip to main content

Fetch Slack message for data analysis, and save to Parquet files

· 4 min read
DALL·E prompt: Comic chat bubbles floating above a humanoid robot looking up at them, purple and yellow lighting
Jake Mulford

Since its founding in 2009, Slack has become a powerhouse for communication within professional environments. Chances are you work at a company that uses Slack for messages among co-workers. Companies also frequently use Slack's freemium offering to create public workspaces to communicate with customers. This has been a huge success because it provides a known medium of communication, and companies don't need to allocate a budget for it.

However, the freemium product is not without its limitations. It only stores a maximum of 10,000 messages. Once the limit is reached, old messages start getting deleted. This becomes a problem if important messages get lost over time.

Using the Slack API, I was able to archive messages before the limit was reached, and preserve them for future analysis. Read on to learn how to set up this solution for yourself.

Using the Slack API

Fortunately for us, Slack provides an API that developers can use. The conversation history method can be used to pull messages from a workspace. Combining this with the conversations list method provides an easy way to preserve messages for a workspace. Simply grab all the channels, and then grab all the messages for each channel.

After installing the Slack Python SDK and setting up an application, you can run the following Python code to grab your messages.

from slack_sdk import WebClient

import os
import time

SLACK_API_TOKEN = os.environ.get("SLACK_API_TOKEN")

slack_client = WebClient(token=SLACK_API_TOKEN)

def get_public_channels():
cursor = None
channels = []
while True:
response = slack_client.conversations_list(cursor=cursor)

for channel in response["channels"]:
channels.append(channel["id"])

cursor = response["response_metadata"]["next_cursor"]
if len(cursor) == 0:
break
else:
print("Pagination found, getting next entries")
time.sleep(3)

return channels

def get_channel_messages(slack_channels):
messages = []
for slack_channel in slack_channels:
cursor = None
while True:
channel_history = slack_client.conversations_history(channel=slack_channel, cursor=cursor)

for message in channel_history["messages"]:
if (message["type"] == "message"):
messages.append((slack_channel, message["text"]))

if bool(channel_history["has_more"]):
cursor = channel_history["response_metadata"]["next_cursor"]
else:
cursor = None

if cursor is None:
break
else:
print("Pagination found, getting next entries")
time.sleep(1.2)

return messages

slack_channels = get_public_channels()
messages = get_channel_messages(slack_channels)

print(messages)

This looks good...but there's a small issue. If you have any threads in your channels (which is extremely common), you may notice that those threads are missing. Thankfully the conversation replies method lets you pull messages from threads. Let's redefine get_channel_messages and update it to pull messages from these threads.

def get_thread_messages(slack_channel, ts):
messages = []
cursor = None

while True:
thread_replies = slack_client.conversations_replies(channel=slack_channel, ts=ts, cursor=cursor)

for message in thread_replies["messages"]:
if (message["type"] == "message"):
messages.append(message["text"])

if bool(thread_replies["has_more"]):
cursor = thread_replies["response_metadata"]["next_cursor"]
else:
cursor = None

if cursor is None:
break
else:
print("Pagination found, getting next entries")
time.sleep(1.2)
return messages

def get_channel_messages(slack_channels):
messages = []
for slack_channel in slack_channels:
cursor = None
while True:
channel_history = slack_client.conversations_history(channel=slack_channel, cursor=cursor)

for message in channel_history["messages"]:
if (message["type"] == "message"):
if ("thread_ts" in message):
for text in get_thread_messages(slack_channel, message["ts"]):
messages.append((slack_channel, text))
else:
messages.append((slack_channel, message["text"]))

if bool(channel_history["has_more"]):
cursor = channel_history["response_metadata"]["next_cursor"]
else:
cursor = None

if cursor is None:
break
else:
print("Pagination found, getting next entries")
time.sleep(1.2)

return messages

Using the messages within Deephaven

Now that we know how to pull our message information from Slack, let's put this data into Deephaven. Using the DynamicTableWriter, we can easily write our Slack data to Deephaven tables.

from deephaven import DynamicTableWriter
import deephaven.dtypes as dht

column_definitions = {
"Channel": dht.string,
"Message": dht.string
}

table_writer = DynamicTableWriter(column_definitions)

for (slack_channel, message) in messages:
table_writer.write_row(slack_channel, message)

table = table_writer.table

We now can use all of Deephaven's table operations and tools on our Slack messages! If we want to write our data to disk, we can use the Parquet file writer method to write our table.

from deephaven.parquet import write

write(table, "/data/slack_messages.parquet")

Make it your own

As more and more data is generated by various data sources, it's important to know how to retrieve and store this data for future needs. This blog post shows just one of many examples of how you can work with data using Deephaven. The code in this project comes from Deephaven's social data collector, so feel free to check out that project and use it for your own needs. Tell us what other data sources you're working with by reaching out on - where else? - Slack.