Writing a stock screener in Python with Databento

August 27, 2024

This tutorial shows you how to implement a stock screener in Python using Databento’s market data API.

More generally, you can modify this example to implement any kind of screener or ranker.

We’ll demonstrate how powerful the Databento API is, showing examples that ranks across 600,000+ symbols and computes multiple metrics like average spread size and depth, in the matter of seconds—all on a laptop with a home WiFi connection.

In the context of algorithmic trading:

A screener is any application that scans the entire market—every symbol on one venue or more—to calculate a certain metric or to find symbols that meet a condition.
A ranker is any application that ranks symbols based on a certain metric.

Screeners and rankers tend to work in tandem. The two terms may also be used interchangeably.

These are usually implemented as an ETL pipeline, a web UI or dashboard, or a combination of them.

The key feature of Databento that allows you to easily implement any kind of screener or ranker is the symbols='ALL_SYMBOLS' parameter. This fetches or subscribes to every single symbol on a given venue or dataset.

This type of firehose request is usually an expensive operation, especially on a feed that has options contracts like OPRA (1.3M+ symbols) and CME (600k+ symbols). But Databento’s API backend has some peculiarities that make it possible:

A custom zero-copy binary format.
Inline zstd compression.
DPDK-accelerated routing and firewalls.
Lots of network bandwidth and route diversity through private peering.
A Rust-based backend optimized for large payloads and long-lived connections.

In our first example, we’d like to find symbols whose prices changed the most during the pre-market hours between 04:00 AM and 09:30 AM Eastern.

import databento as db
import pandas as pd

# First, create a historical client
client = db.Historical(key="YOUR_API_KEY")

# Next, request ohlcv-1m bars for the time range we're interested in
data = client.timeseries.get_range(
    dataset="XNAS.ITCH",
    schema="ohlcv-1m",
    symbols="ALL_SYMBOLS",
    start=pd.Timestamp("2023-06-06T04:00", tz="US/Eastern"),
    end=pd.Timestamp("2023-06-06T09:30", tz="US/Eastern"),
)

# We need to obtain the symbology mappings for 'ALL_SYMBOLS'
symbology = data.request_symbology(client)

# This allows us to populate the 'symbol' column when we convert to a DataFrame
data.insert_symbology_json(symbology)

# Convert to dataframe with timestamps in the US/Eastern timezone
df = data.to_df(tz="US/Eastern")

# Combine the first open, last close and volume sum into a DataFrame
df_moves = df.groupby("symbol").agg({"open": "first", "close": "last", "volume": "sum"})

# Calculate the change from start to end of pre-market
df_moves["change"] = (df_moves["close"] - df_moves["open"]) / df_moves["open"]
df_moves.sort_values(by="change", ascending=False, inplace=True)

# Display the top 10 largest movers
print(df_moves.head(10))

           open    close   volume    change
symbol
CRGOW    0.1564   0.3200     1977  1.046036
ZVZZT   13.0200  25.0000      369  0.920123
MLECW    0.0974   0.1798     3394  0.845996
HOTH     2.0300   3.5400  1600550  0.743842
BOAC+    0.0458   0.0782      700  0.707424
RCRT     0.1918   0.2953   575273  0.539625
ATMCR    0.1489   0.2237      400  0.502351
SBFMW    0.5938   0.8915      305  0.501347
RCRTW    0.0229   0.0302      300  0.318777
TCBPW    0.0436   0.0573      300  0.314220

On this particular date, 2,473 symbols traded during the pre-market session and this example screens through all of 2,473 symbols and runs in 2.723s of clock time on my M2 MacBook Air over a home WiFi connection—showing the incredible speed of Databento’s API.

Note that Databento’s API appends the raw (string) symbol when you request specific symbols, but it doesn’t do this automatically when you request ALL_SYMBOLS, so you’ll only have numeric instrument IDs. This is a performance optimization and a form of lazy evaluation, since mapping the symbols is expensive especially when it’s the entire universe on the venue, it defers this decision to you until you really need it.

These two lines manually perform this mapping:

# We need to obtain the symbology mappings for 'ALL_SYMBOLS'
symbology = data.request_symbology(client)

# This allows us to populate the 'symbol' column when we convert to a DataFrame
data.insert_symbology_json(symbology)

Many systematic trading strategies can trade any generic instrument, so long as it has some minimum threshold of liquidity. So a typical precursor to developing a systematic trading strategy is to be able to scan all symbols on a market and then rank them by some liquidity properties like volume, average spread size, and average depth at BBO (touch).

This is particularly useful on a venue like CME Globex, where there are over 600,000 listed instruments at any given time and it’s impossible to select tradable instruments by hand.

There are various ways to characterize the most “liquid” symbols on a market, one way is to look at their volumes through daily OHLCV aggregates:

def rank_by_volume(top=500):
    """
    Returns instrument IDs of instruments that traded most, in descending rank
    """
    data = client.timeseries.get_range(
        dataset='GLBX.MDP3',
        symbols='ALL_SYMBOLS',
        schema='ohlcv-1d',
        start='2023-08-15',
        end='2023-08-16'
    )
    df = data.to_df()
    return df.sort_values(by='volume', ascending=False).instrument_id.tolist()[:top]

Aside from volume and open interest, spread size, depth, and sweep-to-fill liquidity may also be used as quick proxies for liquidity, since they’re closely related to the execution slippage on an instrument.

Databento makes it easy to study this, providing data of every top of book (MBP-1) update and the top of book data in trade space (TBBO):

               median_spread  median_touch
raw_symbol
WY3Q3 C1110                1       70460.2
SR3:AB 01Y U4              1          27.5
SR3M6                      1         704.5
ZQK4                       1          80.5
SR3M7                      1         502.5
...                      ...           ...
HOU3                      10           1.0
HOX3                      11           1.0
RBZ3                      12           1.0
HOZ3                      14           1.2
MGCV3                     20           3.5

See the full code on GitHub here. Results in CSV can also be found here.

The result shows some symbols you’d expect to be very liquid, like the 3-month SOFR futures outrights, but also symbols that you’ll unlikely know without an automated screener like this, such as the 10-Year Treasury Note weekly options contract WY3Q3 C1110 and combination instruments like SR3:AB 01Y U4.

This example takes 4.932s of clock time on the same laptop and home WiFi connection.

It’s better to rank certain instruments by open interest. You can get this from the market statistics schema of Databento.

Since these examples are all based in Python it’s easy to extend them with any ETL or workflow orchestration tools with first-class Python support:

And since Databento’s API is incredibly fast, you can think less about handling failure modes, database ingestion, flat file processing, job queues, and distributed processing. You can run examples like this in the ETL worker or agent without needing to farm it out to a server cluster!

It’s also easy to build a display web app over it, using Python-based frameworks such as:

Databento is a simpler, faster way to get market data. We’re an official distributor of real-time and historical data for over 40 venues, and provide APIs and other solutions for accessing market data.

It only takes a few minutes to sign up and run these examples.

Interested in more articles on practical use cases like this? Check out Databento’s blog.

Writing a stock screener in Python with Databento

Related posts

Computing the option Greeks using Pathway and Databento

Build a fast, real-time stock screener in Python

Download and merge PCAPs with Databento for market replay