Apache Spark and Spark MLLib for building price movement prediction model from order log data.
The code for this application app can be found on Github
This post is based on Modeling high-frequency limit order book dynamics with support vector machines paper. Roughly speaking I’m implementing ideas introduced in this paper in scala with Spark and Spark MLLib. Authors are using sampling, I’m going to use full order log from NYSE (sample data is available from NYSE FTP), just because I can easily do it with Spark. Instead of using SVM, I’m going to use Decision Tree algorithm for classification, because in Spark MLLib it supports multiclass classification out of the box.
If you want to get deep understanding of the problem and proposed solution, you need to read the paper. I’m going to give high level overview of the problem in less academic language, in one or two paragraphs.
Predictive modelling is the process by which a model is created or chosen to try to best predict the probability of an outcome.
Authors are proposing framework for extracting feature vectors from from raw order log data, that can be used as input to machine learning classification method (SVM or Decision Tree for example) to predict price movement (Up, Down, Stationary). Given a set of training data with assigned labels (price movement) classification algorithm builds a model that assigns new examples into one of pre-defined categories.
In the table, each row of the message book represents a trading event that could be either a order submission, order cancellation, or order execution. The arrival time measured from midnight, is in seconds and nanoseconds; price is in US dollars, and the Volume is in number of shares. Ask - I’m selling and asking for this price, Bid - I want to buy for this price.
From this log it’s very easy to reconstruct state of order book after each entry. You can read more about order book and limit order book in Investopedia, I’m not going to dive into details. Concepts are super easy for understanding.
An electronic list of buy and sell orders for a specific security or financial instrument, organized by price level.
Feature Extraction and Training Data Preparation
After order books are reconstructed from order log, we can derive attributes, that will form feature vectors used as input to
Attributes are divided into three categories: basic, time-insensitive, and time-sensitive, all of which can be directly computed from the data.
Attributes in the basic set are the prices and volumes at both ask and bid sides up to n = 10 different levels (that is, price levels in the order book at a given moment),
which can be directly fetched from the order book. Attributes in the time-insensitive set are easily computed from the basic set at a single point in time.
Of this, bid-ask spread and mid-price, price ranges, as well as average price and volume at different price levels are calculated in feature sets
v5 is designed to track the accumulated differences of price and volume between ask and bid sides. By further taking the recent history of current data into consideration,
we devise the features in the time-sensitive set. More about calculating other attributes can be found in original paper.
Labeling Training Data
To prepare training data for machine learning it’s also required to label each point with price movement observed over some time horizon (1 second fo example). It’s straightforward task that only requires two order books: current order book and order book after some period of time.
I’m going to use
MeanPriceMove label that can be:
Order Log Data
TAQ (Trades and Quotes) historical data products provide a varying range of market depth on a T+1 basis for covered markets. TAQ data products are used to develop and backtest trading strategies, analyze market trends as seen in a real-time ticker plant environment, and research markets for regulatory or audit activity.
Prepare Training Data
OrderBook is two sorted maps, where key is price and value is volume.
Cell from Framian library to represent extracted feature values. It can be
As defined in original paper we have three feature sets, first two calculated from
OrderBook, last one requires
OrdersTrail which effectively is
window computation over raw order log.
and it’s how features calculation looks like
Label Training Data
To extract labeled data from orders I’m using
and it can be constructed nicely with builder
extractor will prepare labeled points using
MeanPriceMovementLabel with 3 features: ask price, bid price and mean price
Run Classification Model
In “real” application I’m using 36 features from all 3 feature sets. I run my tests with sample data from NYSE ftp,
EQY_US_NYSE_BOOK_20130403 for model training and
EQY_US_NYSE_BOOK_20130404 for model validation.
Output of running Decision Tree classification for single symbol
As you can see this pretty simple model was able to successfully classify ~70% of the data.
Remark: Despite the fact, that this model shows very good success rate, it doesn’t mean that it can be successfully used to build profitable automated trading strategy. First of all I don’t check if it’s 95% success predicting stationary and 95% error rate predicting any price movement with average 70% success rate. It doesn’t measure “strength” of price movement, it has to be sufficient to cover transaction costs. And many other details that matters for building real trading system.
For sure it’s huge room for improvement and result validation. Unfortunately it’s hard do get enough data, 2 trading days is to small data set to draw conclusions and start building system to earn all the money in the world. However I think it’ a good starting point.
I was able to relatively easy reproduce fairly complicated research project at much lager scale than in original paper.
Latest Big Data technologies allows to build models using all available data, and stop doing samplings. Using all of the data helps to build best possible models and capture all details from full data set.
The code for this application app can be found on Github