Skip to main content

Batch process CSVs 60x faster than pandas with Parquet

· 4 min read
DALL·E prompt: A synthwave panda losing a race on a racetrack made of spreadsheet data
Amanda Martin

I love the versatility of pandas as much as anyone. When it's slow, however, pandas frustrates me as much as anyone.

Below I work through a use case to speed things up. A lot.

Recently Conor O'Sullivan wrote a great article on Batch Processing 22GB of Transaction Data with Pandas, which presented handling large datasets with limited computational resources. His dataset was a single CSV file of 22GB. Using pandas, O'Sullivan's aggregations took about 50 minutes each. On my computer, they took 90 minutes. Ouch.

Using the same (constrained) resources, I was able to process O'Sullivan's example in 50 seconds using Parquet files and Deephaven -- absolutely torching the performance of pandas + CSV. Nice!

For context, here are links to O'Sullivan's full Kaggle dataset, his GitHub example notebook, and all of my code at the Deephaven examples GitHub repository.

CSV vs Parquet

The first issue with this data set is loading it to work with Python. For CSV files, Python loads the entire CSV data set into memory. With limited resources, this is not possible and causes the kernel to die. This is one of the general problems with CSV files.

I always think it's important to use the right tool for the job. In this case, Parquet is a much better choice.

To convert any large CSV file to Parquet format, we step through the CSV file and save each increment as a Parquet file. Since Parquet files can be read in via a whole directory, there is no need to combine these files later.

The code to write CSV to Parquet is below and will work on any CSV by changing the file name. You can choose either the Deephaven reader/writer or the Pandas reader/writer. Note timing elements are added to each step:

import time

# Deephaven imports
from deephaven import read_csv
from deephaven import parquet

# Panda imports
#import pandas as pd

names = ['CUST_ID', 'START_DATE', 'END_DATE', 'TRANS_ID', 'DATE', 'YEAR', 'MONTH', 'DAY', 'EXP_TYPE', 'AMOUNT']

steps = 5000000
count = 0
while True:
start = time.time()

# transform CSV to Parquet in Deephaven
table = read_csv(file, skip_rows=steps*count, num_rows=steps, allow_missing_columns=True, ignore_excess_columns = True)
parquet.write(table, f"/data/transaction_parquet/{count}.parquet")

# transform CSV to Parquet in Pandas
# df = pd.read_csv(file, skiprows=steps*count, nrows=steps, names=names)
# df.to_parquet(f"/data/transaction_parquet/{count}.parquet")

end = time.time()
print("read "+str(table.size)+ " in "+ str(end - start) + " seconds." + " iteration number ", count)

count+=1

#Exit loop
if table.size!=steps:
break

To make it easy for you, I've made O'Sullivan's data available as a Parquet file at my Kaggle. Feel free to just access it directly.

kaggle datasets download -d amandamartin62/simulated-transactions-parquet-format

Then accessing the Parquet file in your Deephaven Python session is easy:

from deephaven import parquet
table = parquet.read("/data/transaction.parquet")

Time trials

As proof that Parquet is the right tool for this job, I conducted some timing trials. Since we can never load the full CSV into memory, performing aggregates on that data means re-pulling each step of data from the CSV and storing the aggregate only in memory. This is a good workaround but inefficient.

In O'Sullivan's article, he shows how to use batch processing to create three different aggregations of the transaction data:

  1. Total number of transactions
  2. Total yearly expenditure
  3. Average monthly entertainment expenditure in 2020

I replicate the same aggregations in Deephaven. Here are the actual times I got on my normal laptop using the old CSV files and performing aggregations versus the new method via Parquet.

TaskCSV [seconds]Parquet [seconds]
Read data53281
Total yearly expenditure298256
Average monthly entertainment expenditure3782153

To see how much time Parquet can save you, the full instructions and code are in our example repo.

There are a lot of options with datasets this large. Time should never be a limiting factor in the data science we can do.

Let us know how your query performs on Slack.

Further reading