Author: Richie R. Ma
The Python package cmemdp is inspired by the author's R package cme.mdp published on April 2025. This package covers almost all features in that package and I also include other important functions into this package. The goal of this package is to make market data cleaning more easily and more user-friendly. Users can learn how modern financial markets work from the microstructure perspective.
I substantially revised the parser function to make it more efficient for batch job environment. The previous one is not for data-intensive useage. The function automatically saves the parsed data into the trageted directory. The default setting is to save data every 5,000 packets which can be reset.
CME may not disseminate the first snapshot of limit order book from the MBP template and this can be finished through MBO template. Thus, I create another function sunday_recover to reconstruct the first snapshot based on MBO messages.
Financial markets have become more transparent, and exchanges can provide high-frequency data for traders to better monitor markets, which creates more demand about the high-frequency data usage both in the academia and industry, either for real-time or historical. Most exchanges do not disseminate tabulated complete market data to non-member market participants, and almost all market data are specially coded to enhance the communication efficiency, such as various binary protocols (Simple Binary Enconding in the CME). Thus, financial economists need to know how to clean these non-tabular data at first, which is a substantially time-consuming task and might not be very user-friendly. This project will closely focus on how to parse and clean the market data of Chicago Mercantile Exchange (CME) under the FIX and binary (new feature!!) protocols and provide a faster limit order book reconstruction without explicit for loop statements for either outright, implied, or consolidated books.
So far, there have been Market by Price (MBP) data which aggregates all individual order information (e.g., size) at every price level, and Market by Order (MBO) data that can show all individual order details (e.g., order priority) at each price level. Both data formats are included in the raw PCAP data. The MBO data also provide more information about trade summaries than the MBP, so that traders are able to know which limit orders are matched in each trade and their corresponding matching quantities. The detailed trade summaries also assign the trade direction more precisely than the MBP and no quote merge is required for almost all trades. In general, CME will disseminate the MBP incremental updates followed by the order-level details (e.g., submission, cancellation) that describes the reason for MBP updates. This package considers the above characters and can process both the MBP and MBO data including quote messages and trade summaries.
Users are strongly encouraged to download the package can be through the Github.
pip install git+https://github.com/richie-ma/cmemdp.gitThis package almost mirrors the basic features that are published in the cme.mdp package in R. Users can refer to the R package documents.
CME Packet capture data is the raw dataset that captures all public data messages. CME stipulates a complete message template schema that uses different numbers to represent all messages. The raw messages are stored in the binary format which is not human readable and byte-wise. The message structure in the real PCAP data, e.g., data including technical header, packet header, payload, etc., is as follows:
# Sequence | Time | Message Size | Block Length | Template ID | Schemal ID | Version | FIX header | FIX Message Body |
# (Packet Header) | (2 bytes) |........(Simplie Binary Header, 8 bytes)..........|.......(FIX message)...........|
# (12 bytes) |..........................(message header)........................|
# |.................................... MDP messages.................................................|The PCAP data from the CME Datamine is not the real PCAP data while it contains the main parts of the real pcap data. The message structure of the CME Datamine PCAP data is as follows:
#Channel | Length|Sequence | Time | Message Size | Block Length | Template ID | Schemal ID | Version | FIX header | FIX Message Body |
# 2 bytes 2 bytes| (Packet Header)| (2 bytes) |........(Simplie Binary Header, 8 bytes).......... |.......(FIX message)...........|
# | (12 bytes) |..........................(message header)........................|
# |.................................... MDP messages.................................................|This package can tickle the two scenarios. For the real PCAP data, one should use cme_parser_pcap function, which parses the raw data based on the standard PCAP format. For PCAP data from the CME Datamine, one should use cme_parser_datamine. Both functions are expected to give the same outputs.
Sample PCAP data can be obtained from the CME Datamine Sales Team. An example from April 20, 2025. One can know the number of messages for each message. This function will return a dictionary of pandas DataFrames eventually, which is expected to be convenient for users to manipulate them easily. Users can also specify how many packets are need to be processed, while the default is reading all packets within the data. To acquire the message sequences and sending timestamps, one can set cme_header = True, Currently, this function supports saving files in pickle format. The message template script can trace back to 2017 when the PCAP data was first publicized by the CME.
from cmemdp.cme_parser import cme_parser_datamine
example = cme_parser_datamine(
path="R:/_RawData/PCAP/20250420-PCAP_318_0_0_0_e",
max_read_packets=None, cme_header=True,
save_file_path="R:/_RawData/PCAP/", disable_progress_bar=False, chunk_size=5000)Users need to deal with the timestamp conversions and are encouraged to use timestamp_conversion function. Users still need to deal with the display format of the price in the MDEntryPX column. Basically, the numbers shown in the MDEntryPX column needs to times
CME Market by Price messages incrementally show the depths in the limit order book thus this package reconstructs the limit order book based on MBP. This function can also supports the consolidated limit order book where both outright quotes that are generated by traders and implied quotes generated by the CME implied functionality. Now this function is under the FIX_input module, but can also be used for the output from the binary data cleaning as long as the columns are set as what this package shows in the FIX (future version might bridge the two).
I acknowledge the financial support from the Bielfeldt Office for Futures and Options Research at the University of Illinois at Urbana-Champaign. I also acknowledge prior practice from former OFOR members, including but not limited to Anabelle Couleau and Siyu Bian. Some codes are heavily inspired by their work. The OFOR has signed non-disclosure agreement with the CME and only sample data are used here for illustration purposes.
Please post an issue on the GitHub repository.