Zstd vs. zlib: which should you use for better market data compression?

November 07, 2023

Title picture for Zstd vs. zlib: market data compression

Most financial vendors and open-source projects use zlib/gzip for compressing market data. One "free" way to store more market data is to use Zstandard (zstd) for more efficient compression, something we've done with much success at Databento. We use it for most of Databento's historical API surface, and it's one reason we can deliver vast amounts of order book (MBO) data over the internet quickly. In this article, we'll cover some differences between zstd and zlib and the benefits of using zstd for market data storage.

Zstandard and zlib are lossless data compression algorithms that users can compare based on their compression ratio, compression speed, and decompression speed. Although zlib is widely used and has been around longer than zstd, its performance tends to fall short compared to more modern compression algorithms. zstd is a newer algorithm that offers additional features than zlib and promises better metrics in all of the above areas.

Source: https://facebook.github.io/zstd/

It's mainstream. zstd is the default compression algorithm for Arch Linux package compression, Fedora's file system, RocksDB, etc., and there's support for pandas/parquet and Wireshark.
It's on the Pareto-frontier. At the same compression ratio, we've found it decompresses nearly 4x as fast as zlib on a single core.
Its speed makes it excellent for inline compression/decompression. Most well-designed platforms employ inline decompression to speed up reads and reduce storage requirements for market data in most places aside from the latency-sensitive parts. We use zstd inline when processing pcaps and transcoding data to other formats like CSV and JSON.
It supports streaming compression. While we don't recommend it for real-time, latency-sensitive use cases, it's great if you want to transmit large amounts of data in near-real time when bandwidth is limited—a commonly recurring pattern in electronic trading environments—e.g., sending intraday captures over long-haul, sending application logs to a central location, sending data to your post-trade and analytics tools.
It has support for skippable frames. Skippable frames make it easy to insert user-defined data and embed metadata within a compressed stream. This is great for financial use cases because it lets you easily embed things about the data without compromising the compressed data itself; you could use it for data quality indicators, data versioning, symbology mappings, dates, and other valuable ways to interpret the data.
It has dictionary compression support. We haven't found it helpful for large quantities of market data, but you'll probably find this useful if you're working with individual messages or building a web application.
lz4 and zstd are complementary. If you're looking to sacrifice compression ratio for faster compression, you'll probably still have use for lz4. For example, we still use lz4 for inline compression on file systems. However, storage and bandwidth costs are usually the dominant factor, and we've found that most of our use cases are supplanted by zstd.

We offer free customization options for your data, including choosing between zstd compression or no compression. We recommend using zstd for those wanting faster transfer speeds and smaller files, which we use for our DBN encoding and is available for CSV and JSON encodings.

Zstd vs. zlib: which should you use for better market data compression?

Related posts

Computing the option Greeks using Pathway and Databento

Build a fast, real-time stock screener in Python

Download and merge PCAPs with Databento for market replay