Large file and Parquet support for Python client library

January 24, 2024
Title picture for Large file and Parquet support for Python client library

Databento's latest enhancement to our Python client library introduces a Parquet transcoder and chunked iterator.

Columnar file formats like Apache Parquet have several advantages over row-based formats (e.g., CSV) in handling large datasets. The format offers fast read/write speeds and smaller compressed file sizes. It's suitable for storing large volumes of historical tick data, such as those required for backtesting.

As a popular open source format, Parquet is compatible with tools and platforms used for the quantitative analysis of market data. This includes integration with Python-based data science libraries like pandas, polars, and pyarrow.

Our proprietary encoding, DBN, is excellent for users in the Python data science ecosystem as both a real-time message encoding and historical data storage format. Now, Python users can also transcode DBN data directly to Parquet.

Transcoding to Parquet supports automatic symbol mapping as well as timestamp and price formatting. Here's an example using Python:

import databento as db

client = db.Historical()
data = client.timeseries.get_range(
    dataset="OPRA.PILLAR",
    symbols="SPX.OPT",
    stype_in="parent",
    schema="trades",
    start="2023-12-15"
)

data.to_parquet(
    "spx-opt-20231215.trades.parquet",
    pretty_ts = True,
    price_type = "fixed",
    map_symbols = True,
)

This also provides the benefit of writing large datasets in DBN straight to a Parquet file, even on machines with limited memory resources. Our library method handles chunked writing under the hood, eliminating the need for users to manage large pandas DataFrame objects in memory or buffer the data themselves.

Our Python client library now supports iterating and reading large DBN blobs incrementally in chunks, drastically reducing the memory requirements for large files. Alongside the latest Parquet support, CSV and JSON transcoding from DBN also benefit from this improvement. The count parameter, available on DBNStore.to_df and DBNStore.to_ndarray allows the DBN file to be read in chunks the length of count.

import databento as db

client = db.Historical()
data = client.timeseries.get_range(
    dataset="OPRA.PILLAR",
    symbols="SPX.OPT",
    stype_in="parent",
    schema="trades",
    start="2023-12-15"
)

for df in data.to_df(count=100):
  print(df)  # contains at most 100 rows

Parquet support is currently limited to our Python client library and is based on pyarrow. When Parquet support is added to the official DBN library, we'll remove this pyarrow dependency. We then plan to extend Parquet support to C++, Rust, and our batch download service in future releases.