The best thing about a CSV file is that it is a simple text format; the worst thing about a CSV file is that it is a simple text format. Thus describes the love-hate relationship I often find myself in while working with external datasets. "To CSV, or not to CSV", that is the question.
If you haven't seen it yet, r/place had a recent revival - a social experiment where millions of users cooperated, or competed, to carve out pixels on a shared artistic canvas. There are a lot of interesting stories here - but the one I'm focusing on is the public dataset they have released 2022_place_canvas_history.csv.gzip. It contains the timestamps, user ids, colors, and locations of every single pixel placement. Downloader beware - the compressed CSV file is 12GB, and the uncompressed CSV file is 22GB.
While this CSV is a good jumping off point for reading the data once, it is less than ideal to work with for repeated analyses in its raw form. Enter Parquet: a binary columnar format with types and built-in support for compression.
Translating the data from CSV to Parquet is relatively easy within a Deephaven session:
from deephaven import csv, parquet
def csv_to_parquet(csv_path, parquet_path):
csv_table = csv.read(csv_path)
parquet.write(csv_table, parquet_path)
csv_to_parquet(
'2022_place_canvas_history.csv',
'2022_place_canvas_history.parquet')
But we can do even better by cleaning up some of the data first:
- Using numeric ids instead of large strings for
user_id
- Translating the hex
pixel_color
to an intrgb
column - Breaking out
coordinate
tox1
,y1
,x2
, andy2
columns 1 - Sorting the data based on
timestamp
from deephaven import csv, parquet
# 1. Read the place canvas history CSV
place_raw = csv.read('2022_place_canvas_history.csv')
# 2. Cleanup the data
# a. use an increasing numeric id for user_id instead of a large string
user_ids = place_raw.select_distinct("user_id").update_view("K=k")
# b. translate the hex pixel_color to an int rgb column
# c. break out coordinate to x1, y1, x2, and y2 columns
# d. sort by timestamp (raw data is not sorted)
place = place_raw\
.exact_join(user_ids, "user_id", "K")\
.update_view([
"rgb=Integer.parseInt(pixel_color.substring(1), 16)",
"coordinates=coordinate.split(`,`)",
"x1=Short.parseShort(coordinates[0])",
"y1=Short.parseShort(coordinates[1])",
"x2=coordinates.length == 2 ? NULL_SHORT : Short.parseShort(coordinates[2])",
"y2=coordinates.length == 2 ? NULL_SHORT : Short.parseShort(coordinates[3])"])\
.view(["timestamp", "user_id=(int)K", "rgb", "x1", "y1", "x2", "y2"])\
.sort("timestamp")
# 3. Write to a Parquet file with ZSTD compression and share with others!
parquet.write(place, '2022_place_deephaven.parquet', compression_codec_name='ZSTD')
The full code can be found at place_csv_to_parquet.py.
The result is a 1.5GB Parquet file ready for further data analysis: 2022_place_deephaven.parquet. Feel free to download this dataset as an alternative to the CSV. If you are in a Deephaven session, you can read the Parquet file with:
from deephaven import parquet
place = parquet.read("2022_place_deephaven.parquet")
We've also put this data on Kaggle. Let us know if there's anything you'd like to see!
- The
x2
andy2
are only present on r/place moderation edits where a rectangle was placed over questionable content.↩