I love the versatility of pandas as much as anyone. When it's slow, however, pandas frustrates me as much as anyone.
Below I work through a use case to speed things up. A lot.
Recently Conor O'Sullivan wrote a great article on Batch Processing 22GB of Transaction Data with Pandas, which presented handling large datasets with limited computational resources. His dataset was a single CSV file of 22GB. Using pandas, O'Sullivan's aggregations took about 50 minutes each. On my computer, they took 90 minutes. Ouch.
Using the same (constrained) resources, I was able to process O'Sullivan's example in 50 seconds using Parquet files and Deephaven -- absolutely torching the performance of pandas + CSV. Nice!
CSV vs Parquet
The first issue with this data set is loading it to work with Python. For CSV files, Python loads the entire CSV data set into memory. With limited resources, this is not possible and causes the kernel to die. This is one of the general problems with CSV files.
I always think it's important to use the right tool for the job. In this case, Parquet is a much better choice.
To convert any large CSV file to Parquet format, we step through the CSV file and save each increment as a Parquet file. Since Parquet files can be read in via a whole directory, there is no need to combine these files later.
The code to write CSV to Parquet is below and will work on any CSV by changing the file name. You can choose either the Deephaven reader/writer or the Pandas reader/writer. Note timing elements are added to each step:
# Deephaven imports
from deephaven import read_csv
from deephaven import parquet
# Panda imports
#import pandas as pd
names = ['CUST_ID', 'START_DATE', 'END_DATE', 'TRANS_ID', 'DATE', 'YEAR', 'MONTH', 'DAY', 'EXP_TYPE', 'AMOUNT']
steps = 5000000
count = 0
start = time.time()
# transform CSV to Parquet in Deephaven
table = read_csv(file, skip_rows=steps*count, num_rows=steps, allow_missing_columns=True, ignore_excess_columns = True)
# transform CSV to Parquet in Pandas
# df = pd.read_csv(file, skiprows=steps*count, nrows=steps, names=names)
end = time.time()
print("read "+str(table.size)+ " in "+ str(end - start) + " seconds." + " iteration number ", count)
To make it easy for you, I've made O'Sullivan's data available as a Parquet file at my Kaggle. Feel free to just access it directly.
kaggle datasets download -d amandamartin62/simulated-transactions-parquet-format
Then accessing the Parquet file in your Deephaven Python session is easy:
from deephaven import parquet
table = parquet.read("/data/transaction.parquet")
As proof that Parquet is the right tool for this job, I conducted some timing trials. Since we can never load the full CSV into memory, performing aggregates on that data means re-pulling each step of data from the CSV and storing the aggregate only in memory. This is a good workaround but inefficient.
|Task||CSV [seconds]||Parquet [seconds]|
|Total yearly expenditure||2982||56|
|Average monthly entertainment expenditure||3782||153|
To see how much time Parquet can save you, the full instructions and code are in our example repo.
There are a lot of options with datasets this large. Time should never be a limiting factor in the data science we can do.
Let us know how your query performs on Slack.