Normalization
Normalization refers to the process of exporting data in their various source formats to a single, unified format. Such a unified format is often called a normalization format (or normalized schema). The primary reason for normalizing market data is to abstract away differences between source formats, making the data easier to work with.
The normalization process is one of the most likely places where inaccuracies or data errors are introduced. This article describes these issues, the trade-offs for addressing them, and the reasons behind the design of Databento's normalization schema.
Examples of normalized data
- For example, Nasdaq's proprietary TotalView data feed has a protocol with its own message format and provides market-by-order data, while IEX's proprietary TOPS data feed has a completely different protocol with another message format and provides top-of-book data. These are examples of raw data formats.
- When you consume market data from a data redistributor's feed, the redistributor will have its own protocol and message format, distinct from the venues'. This is an example of normalized data.
- The most sophisticated trading firms will generally collect data directly from their sources and normalize them to a proprietary format.
Common issues found in normalized market data
There are many ways in which normalization can introduce data errors, lossiness or performance issues.
Issue | Definition | Examples |
---|---|---|
Incompatible schema | The source schema and normalized schema are mismatched. | A direct market feed with an order-based schema is normalized to a vendor's schema that only provides aggregated market depth. |
Truncated timestamps | A direct market feed which originally includes nanosecond-resolution timestamps is normalized to a schema that truncates the timestamps to a lower resolution. | Some vendors have a legacy data schema designed for older FIX dialects, forcing them to truncate nanosecond-resolution timestamps found in modern markets to millisecond resolution. |
Discarded timestamps | A direct market feed which originally includes more than one timestamp field is normalized to a schema that discards that timestamp. This introduces imprecision when the normalized data is used for strategy backtesting. | A proprietary exchange feed may include both match (Tag 60) and sending (Tag 52) timestamps but a vendor's schema may preserve only one of the two. |
Discarded or remapped sequence numbers | The normalized schema either discards the original message sequence numbers or remaps them to a vendor's own message sequence numbers. | This creates problems if you need to resolve post-trade issues with the market or your broker, as it makes it harder to identify the exact event. |
Loss of price precision | The normalized schema represents prices in a type that loses precision. | Many vendors use floating point representation for prices, losing precision past 6 decimal places. This can create issues for trading Japanese yen spot rates, fixed income instruments and cryptocurrencies. |
Loss of null semantics | The normalized schema represents null values in a way that changes the meaning. | Some data feeds will represent null prices with zeros or a negative value like -1. This can introduce errors downstream if the price is interpreted to be non-null. This is also problem if your application needs to handle both asset classes that can have negative prices (such as futures and spreads) and asset classes which cannot. |
Loss of packet structure | The normalized schema does not preserve packet-level structure. | Many markets publish multiple events within a packet. Without packet-level structure, it may create the appearance of artificial trading opportunities between any two events within a packet. |
Lossy or irreversible symbology mappings | The normalized schema adopts a proprietary symbology that is different from the original source's symbology. Sometimes, such proprietary symbology cannot be mapped back to the original. | Some vendors adopt a symbology system that only includes lead months of futures contracts, causing the far month contracts to be discarded. |
Lossy abstraction | The normalized schema does not adequately standardize information across multiple datasets, resulting in the end user needing to understand the specifications of the various source schemas anyway in order to determine the lost information. | This often happens when normalizing less commonly used features such as matching engine statuses or instrument definitions. This puts significant burden on the user to study the specifications of various data feeds to understand the lost information. |
Statelessness | The normalized schema provides incremental changes but does not provide snapshots or replay of order book state. | This presents an issue when using the normalized data in real-time, as the user loses information in the event of a disconnection or late join. |
Coalescing | The normalized schema aggregates the information at a lower granularity. | A vendor may coalesce a feed of tick data with second bar aggregates or subsample a source feed. |
Conflation | A normalized feed batches multiple updates into one at some lower frequency, to alleviate bandwidth limitations. Often present along with coalescing. | This is a common practice for retail brokerages, whose data feeds are designed more for display use and consumption over sparse WAN links. |
Dropped packets | A normalized data feed deliberately discards data when the network or system is unable to keep up. | This is often present when the source feed or upstream parts of the vendor's infrastructure uses UDP for transmission. |
Buffering | The data server sends stale data either because there is insufficient network bandwidth or the client is reading too slowly. The client misinterprets the stale data, either obscuring this effect or injecting incorrect timestamps. | This often manifests when the data feed uses TCP for transmission - which is a common practice when disseminating data over WAN links. |
Ex post cleaning | A data source is cleaned, during the normalization process, using future information. This enhances the historical data with artificial information that may not have been actionable in real-time. | The data may be reordered; trades that were canceled after the end of market session may be removed, or prices may be adjusted with information from a future rollover or dividend event. |
Schema bloat | A normalized schema represents some data fields with types that take up unnecessary space or make the data more difficult to compress, which increases storage costs and reduces application performance. | Common cases of this include representing timestamps as ISO 8601 strings or prices as strings, especially on vendor feeds that use JSON encoding. |
Our normalization schema is designed to mitigate most of these issues.
Why use normalized data?
Though it may seem counterintuitive, some degree of lossiness introduced during normalization can be preferable.
A normalized schema that has too many data fields, as a result of trying to preserve information from too many different source schemas, is hard to use.
Here are some ways in which lossiness can be useful:
- Discarding unnecessary data fields can reduce storage and bandwidth requirements, and improve application performance. For example, many strategies execute at time scales where extra timestamps are unnecessary.
- Most status or reference data events are irrelevant for any given business use case. For example, many users only trade during the regular market session and their applications do not need to be aware of special matching conditions that are more typically found outside of regular hours or during pre-market.
- Floating point prices can be easier to use and the modeling error introduced by them could be negligible compared to other, more likely sources of error for the given use case.
- Order book snapshots can be unnecessary on liquid products whose orders turnover very quickly, as pre-existing orders in the snapshot will eventually be filled or canceled - a process which is commonly referred to as natural refresh.
It should be noted that normalized data is not necessarily going to be smaller or have a simpler specification than the source data. If your use case only requires a single dataset, there may be complexity using normalized data whose schema was designed to accommodate differences between multiple datasets.
We normalize to our proprietary Databento Binary Encoding DBN.