Normalized vs. raw data: the pros and cons
Normalization involves converting data from various source schemas into a single, unified set of schemas, referred to as a normalization format or normalized schema. The goal of this process is to eliminate differences between source schemas.
Normalization ensures that data is presented in a standardized manner, allowing users to avoid errors and the need to spend significant amounts of time reconciling and aligning data from multiple sources. This helps streamline the data integration process and makes it easier for users to extract insights.
During the normalization process, some data fields may be discarded, combined, or transformed to create a more uniform representation across different data sources. This can lead to some degree of lossiness—the loss of information or precision—because some data fields may be simplified or removed. Though it seems counterintuitive, having a normalized schema with too many fields can be difficult to use. There are several ways in which lossiness can be useful:
- Discarding unnecessary data fields can reduce storage and bandwidth requirements, as well as improve application performance. For example, many strategies execute at time scales where extra timestamps are unnecessary.
- Most status or reference data events are irrelevant for any given business use case. For example, many users only trade during the regular market session and their applications do not need to be aware of special matching conditions that are more typically found outside of regular hours or during pre-market.
- Order book snapshots can be unnecessary on liquid products whose orders turnover very quickly, as pre-existing orders in the snapshot will eventually be filled or canceled—a process which is commonly referred to as natural refresh.
It should be noted that normalized data may not necessarily be smaller or have a simpler specification than the source data. If your use case only requires a single dataset, using normalized data with a schema designed to accommodate differences between multiple datasets may introduce complexity. However, normalized data will often be smaller or have a smaller specification. We normalize to our proprietary Databento Binary Encoding DBN .
Normalization is essential for making data easier to work with. Alongside market data vendors, the most sophisticated trading firms will generally collect data directly from their sources and normalize them to a proprietary format. Our approach balances the degree of lossiness introduced during normalization against the requirements of common use cases.
By providing a normalized schema that is compatible with a wide range of venues and asset classes, our users can focus on analysis rather than dealing with data inconsistencies, formatting issues, or pouring time into standardizing data themselves. The schema is consistent across both historical and live data services, enabling our users to utilize the same code for backtesting and production trading with identical APIs and normalized formats.
We're also mindful not to over-normalize our data and recognize that different use cases may require different levels of precision. For example, when dealing with US equities, negative prices may not be a consideration. However, when dealing with other asset classes, such as futures contracts or foreign currencies, negative prices may come into play. Our prices are expressed as signed integers in fixed decimal format, whereby every 1 unit corresponds to 10-9 to account for these possibilities. Many other vendors use floating-point representation for prices or lose precision past 6 decimal places.
In our experience, the normalization process is one of the most common places data errors are introduced, and the above is just one example of the issues our normalization schema is designed to mitigate.
If you're experiencing a few (or more) of these issues regularly, it might be time to consider switching market data providers:
The data server sends stale data either because there is insufficient network bandwidth or the client is reading too slowly. The client misinterprets the stale data, either obscuring this effect or injecting incorrect timestamps. This often manifests when the data feed uses TCP for transmission - which is a common practice when disseminating data over WAN links.
The normalized schema aggregates the information at a lower granularity.A vendor may coalesce a feed of tick data with second bar aggregates or subsample a source feed. A vendor may coalesce a feed of tick data with second bar aggregates or subsample a source feed.
To alleviate bandwidth limitations, a normalized feed batches multiple updates into one at some lower frequency. It's often present along with coalescing. This is a common practice for retail brokerages, whose data feeds are designed more for display use and consumption over sparse WAN links.
The normalized schema either discards the original message sequence numbers or remaps them to a vendor's own message sequence numbers.
A direct market feed which originally includes more than one timestamp field is normalized to a schema that discards that timestamp. This introduces imprecision when the normalized data is used for strategy backtesting. A proprietary exchange feed may include both match (Tag 60) and sending (Tag 52) timestamps but a vendor's schema may preserve only one of the two.
A normalized data feed deliberately discards data when the network or system is unable to keep up. A normalized data feed deliberately discards data when the network or system is unable to keep up.
A data source is cleaned, during the normalization process, using future information. This enhances the historical data with artificial information that may not have been actionable in real-time. The data may be reordered; trades that were canceled after the end of market session may be removed, or prices may be adjusted with information from a future rollover or dividend event.
The normalized schema represents null values in a way that changes the meaning. Some data feeds will represent null prices with zeros or a negative value, like -1. This can introduce errors downstream if the price is interpreted to be non-null. This is also problem if your application needs to handle both asset classes that can have negative prices (such as futures and spreads) and asset classes which cannot.
The source schema and normalized schema are mismatched. A direct market feed with an order-based schema is normalized to a vendor's schema that only provides aggregated market depth.
The normalized schema does not preserve packet-level structure. Many markets publish multiple events within a packet. Without packet-level structure, it may create the appearance of artificial trading opportunities between any two events within a packet.
The normalized schema represents prices in a type that loses precision. Many vendors use floating point representation for prices, losing precision past 6 decimal places. This can create issues for trading Japanese yen spot rates, fixed income instruments and cryptocurrencies.
The normalized schema does not adequately standardize information across multiple datasets, resulting in the end user needing to understand the specifications of the various source schemas anyway in order to determine the lost information. This often happens when normalizing less commonly used features, such as matching engine statuses or instrument definitions. This puts significant burden on the user to study the specifications of various data feeds to understand the lost information.
The normalized schema adopts a proprietary symbology that is different from the original source's symbology. Sometimes, such proprietary symbology cannot be mapped back to the original. Some vendors adopt a symbology system that only includes lead months of futures contracts, causing the far month contracts to be discarded.
Some vendors adopt a symbology system that only includes lead months of futures contracts, causing the far month contracts to be discarded. Common cases of this include representing timestamps as ISO 8601 strings or prices as strings, especially on vendor feeds that use JSON encoding.
The normalized schema provides incremental changes but does not provide snapshots or replay of order book state. This presents an issue when using the normalized data in real-time, as the user loses information in the event of a disconnection or late join.
A direct market feed which originally includes nanosecond resolution timestamps is normalized to a schema that truncates the timestamps to a lower resolution. Some vendors have a legacy data schema designed for older FIX dialects, forcing them to truncate nanosecond resolution timestamps found in modern markets to millisecond resolution.