Skip to main content

Kafka + Parquet: Maximize speed, minimize storage

· 5 min read
DALL·E prompt: Server room with a parquet tile floor, with tiles made of wood, dramatic lighting, cgsociety 4k
Amanda Martin
How I use Kafka and Parquet for the perfect stream and storage solution

Kafka is an ideal choice for streaming data. I have seen streams of a couple hundred thousand messages a second in real-world use cases. However, with this type of data, storage space can quickly become an issue. Do you want to store the same amount of data while using less than 2% of your current resources? In fact, would you like to store 50 times more data on the same resources? There is no magic here - we just use the tools already at hand in a smart manner. Pairing Parquet files with Deephaven allows your streaming, real-time data to be combined with stored data without the need for vast resources.

In this blog, I'll show you how easy it is to save your Kafka stream for future use without a high storage overhead.

In How to implement streaming analytics with Redpanda & Deephaven, we create multiple topics in Kafka for financial data. Often with a Kafka stream, the real-time nature of the data is important; sometimes we also want to have a historical record of that data. For this example, I added a volume to my docker-compose.yml file for the Kafka redpanda service:

      volumes:
- ./redpanda:/var/lib/redpanda/data

This causes the streams to be copied to the local disk. When the containers are done running, that data is still available locally. I wanted to create an hourly aggregate of the data over the course of the last 24 hours. Data was collected for 24 hours on each topic, with each stream having just over one million Kafka offsets. I decided to explore how Kafka saved that data. It was shocking to see the size of these data files.

By the numbers

Overall, the streams contained about 5 GBs of information!

Disk usageKafka source
749M./Candle/0_14
413M./Order/0_4
625M./Quote/0_8
591M./Series/0_16
7833M./Summary/0_6
888M./TimeAndSale/0_12
591M./Series/0_16
7833M./Trade/0_2
888M./Underlying/0_18
Total: 4.9G

My laptop powers my work. It's not a super machine, but a nice tool that gets the job done. I knew that if I kept doing this workflow, I would not be able to do the data analysis I wanted. There had to be a better solution!

This is where Parquet saves my data day. I opened up the Kafka streams from above and simply saved those streams as a ZSTD compressed Parquet file. Now I'm happy with the sizes:

Disk usageParquet file
18Mcandle_feb15_ZSTD.parquet
12Morder_feb15_ZSTD.parquet
12Mquotes_feb15_ZSTD.parquet
7.3Mseries_feb15_ZSTD.parquet
5.7Msummary_feb15_ZSTD.parquet
10MtimeAndSale_feb15_ZSTD.parquet
17Mtrades_feb15_ZSTD.parquet
6.6Munderlying_feb15_ZSTD.parquet
Total: 95M

What was nearly 5 GBs is now about 100 MBs.

To see how the financial data compresses, here is the total disk utilization for each type of file with all of the topics combined:

Disk usageFormat
4.9G./kafka
210M./LZ4
205M./LZO
120M./GZIP
228M./Snappy
95M./ZSTD

How does it do this? See below for the various compression options and simple Deephaven script.

Parquet compression options

Parquet is designed for large-scale data with several types of data compression formats supported. Depending on your data format, you might want a different compression.

  • LZ4: Compression codec loosely based on the LZ4 compression algorithm, but with an additional undocumented framing scheme. The framing is part of the original Hadoop compression library and was historically copied first in parquet-mr, then emulated with mixed results by parquet-cpp.
  • LZO: Compression codec based on or interoperable with the LZO compression library.
  • GZIP: Compression codec based on the GZIP format (not the closely-related "zlib" or "deflate" formats) defined by RFC 1952.
  • Snappy: Default compression for parquet files.
  • ZSTD: Compression codec with the highest compression ratio based on the Zstandard format defined by RFC 8478.

Deephaven's write_table method

To take advantage of this smaller storage, do not write the streaming data to a Kafka directory. Instead, write it to a Parquet file before closing the container:

from deephaven2.parquet import write_table

write_table(table, "/data/FILE.parquet", compression_codec_name = "ZSTD")

Sometimes I worry about my files being too big. After all, some data streams are not just a million offsets a day, but a million every few minutes. Luckily, in Deephaven, when we load a Parquet file into a table we do not load the whole file into RAM. This means that files much larger than the available RAM can be loaded as tables. So far, I've loaded a 30G file without problem.

If you want to be able to take advantage of the speed of Kafka and the efficiency of Parquet, Deephaven makes it easy. Reach out on Slack with your ideas!

Further reading