Normalization
Quick definition
Normalization is the process of converting financial data in various source formats from different trading venues, exchanges, or data publishers to a single, standardized format.
This is most often applied in the context of market data, where it refers to the process of taking raw market data in different native formats and converting it to normalized market data. However, it can also apply to other forms of financial data like instrument definitions and reference data.
What is Normalization?
The primary advantage of normalizing market data is that it improves usability by providing a consistent format. This greatly simplifies the downstream code and business logic, which then only has to work with a single format or set of formats and doesn't need to have exchange-specific or venue-specific logic.
Normalization will usually involve some degree of lossiness, but this trade-off can be beneficial. By simplifying the schema and removing unnecessary messages or fields, normalization reduces storage and bandwidth requirements while improving application performance.
For instance:
- Administrative messages and heartbeats, which may not be useful for generating trading signals, may be omitted to reduce storage size and bandwidth.
- Additional timestamps or timestamp precision may be irrelevant to a lower-frequency trading strategy and may also be removed to reduce storage size and bandwidth.
- Floating-point prices lose precision over exact fixed-decimal integer prices, but may only introduce negligible modeling errors while being easier to use with numerical libraries and mathematical routines.
- In highly liquid markets, order book snapshots may be unnecessary for recovery, as rapid turnover ensures that pre-existing orders are quickly filled or canceled, causing the order book state to become consistent eventually.
- Complex derivatives may have reference data or instrument definitions with a wider set of fields and properties. Often, these properties include classification codes which could be useful for administrative and instrument search, but could be irrelevant to a trading strategy.
A majority of trading participants only have access to normalized market data and do not have access to actual raw market data. Some vendors may advertised their normalized data as "unfiltered", "raw", "lossless", or "L3", but this is usually a misleading form of marketing since the vendor is not passing on a direct feed, a raw multicast feed, or historical packet captures—very few vendors actually distribute raw market data.
A downside of normalized market data, however, is that it is more prone to errors in the normalization process. Moreover, the normalization usually takes place within a data vendor's internal system, which is opaque to its end users. This makes it nearly impossible to verify lossiness in the normalization process without painstaking effort of triangulating the data against other vendors' normalized market data or comparing the data to raw market data like packet captures.
Raw market data, in the original source format or native wire protocol, leaves almost no room for errors. Errors in raw market data are usually limited to gaps, packet loss, and bit rot, which are easily identifiable from sequence numbers or checksums.
Another downside of normalized market data is that it may be too lossy and lose valuable information that more sophisticated participants, with access to raw market data, may have access to.
Most sophisticated trading participants will adopt a combination of raw and normalized market data depending on the use case, to strike an optimal balance between manageability and granularity. Some data vendors like Databento provide both raw and normalized market data, through its packet capture and API services.
- Unit standardization: Different markets may report prices in different units (cents, dollars, index points) or currencies. Normalized market data usually adopts a more uniform standard on units.
- Field mapping: Field names may be slightly different from one trading venue to another. For example, terms like "size", "depth", and "volume" often refer to the same
- Field deduplication: Some trading venues or data publishers have redundant fields in instrument definitions or incremental event messages for backward compatibility. Normalized data usually strips out redundant fields like this.
- Symbol standardization: Financial data usually comes with ticker symbols or instrument identifiers that differs across exchanges or data publishers. Normalization usually involves mapping these identifiers to a consistent symbology, such as ISIN, FIGI, RICs, CMS, CUSIP, etc.
- Timezone standardization: Exchanges often publish their raw market data in local timezone, and normalized market data usually adopts a single timezone like UTC or the local timezone of the trading firm that uses the data.
Despite similar names, normalization in the context of finance is unrelated to database or schema normalization and shouldn't be confused for one another. Database normalization refers to the process of structuring the schemas and tables of a relational database such that it reduces duplication of data.