I wrote earlier about our approach for machine learning with Spark at Collective, it was focused on transforming raw data into features that can be used for training a model. At this post I want describe how to assemble multiple building blocks into production application, that efficiently uses Spark cluster and can train/validate hundreds of models.
Training single model is relatively easy, and it’s well covered in Spark documentation and multiple other blog posts. Training hundreds of models can become really tricky from engineering point of view. Spark has lot’s of configuration parameters that can affect cluster performance and stability, and you can use some clever tricks to get higher cluster utilization.
Scale Of The Problem
At Collective we are using Spark and machine learning for online advertising optimization, trying to decide which ads are relevant to which people, at which time and at which web site.
Log data used for training models is huge, billions of rows. Number of users that we target is hundreds of millions. We have hundreds of clients with tens of different campaigns, with different optimization targets and restrictions.
These factors gives an idea of the scale of the problem.
Production Machine Learning Pipeline
Typical machine learning pipeline for one specific ad campaign looks like this:
- Prepare response dataset: based on impression/activity logs define which users should be in positive set (did some actions that we are trying to optimize for, example could be signing up for test drive)
- Prepare train dataset: extract predictors data from logs
- Feauturize: given predictors data extract feature vectors
- Train model: feed features and response data into Spark ML
- Prepare test dataset: data that is going to be used for model performance evaluation
- Evaluate model: for binary classification it can be computing ROC and AUC
Preparing datasets is usually contains of multiple join and filter conditions. Featurization and training built on top of Spark ML Pipeline API, and compose multiple transformers and estimators together. It’s covered in one of previous posts.
Given different nature of each step they have different limiting factors. Preparing dataset is expensive operation and heavily uses shuffle. Model training time is usually dominated by network latency introduced by iterative function optimization algorithm.
Serial execution of these steps can’t efficiently utilize all executor cores. Running multiple models in parallel in our case doesn’t really help as well, multiple models reach modeling stage at almost the same time.
I’ll just put first paragraph from amazing Introduction to scalaz-stream here:
Every application ever written can be viewed as some sort of transformation on data. Data can come from different sources, such as a network or a file or user input or the Large Hadron Collider. It can come from many sources all at once to be merged and aggregated in interesting ways, and it can be produced into many different output sinks, such as a network or files or graphical user interfaces. You might produce your output all at once, as a big data dump at the end of the world (right before your program shuts down), or you might produce it more incrementally. Every application fits into this model.
We model machine learning pipeline as a
scalaz.stream.Process - multistep transformation on data, and use
scalaz-stream combinators to run it
with controlled concurrency and resource safety.
Simple domain model for campaign optimization can be defined like this:
Input would be definition of campaign that needs to be optimized, and output would be model that was trained and evaluated or error if something went wrong.
Each of modeling steps described earlier can be encoded as
Given previously defined modeling steps, optimization pipeline can be defined as
concurrently method, which runs each step with controlled concurrency in separate threads. Steps that are doing heavy shuffles
are running not more than 2 in parallel, in contrast to model training that is relatively lightweight operation and can run with much higher
concurrency. This helper method is described in earlier post.
Push vs Pull Based Streams
scalaz-steam uses pull based model, it means that not first step (prepare response) is pushing data down the transformation chain when it’s ready, but
the bottom step (evaluate model) asks the previous step for new data when it’s done.
This allows to keep Spark cluster always busy, for example when relatively slow running modeling step is done, it
featurize for new data, and it’s already there, which means that modeling can start immediately.
Spark Cluster Tuning
For better cluster utilization I suggest to use
FAIR scheduler mode, that can be turned on with
--conf spark.scheduler.mode=FAIR flag.
Another big problem for us was tuning garbage collection. I’ve spent a lot of time trying to tune
G1 collector, but
ParNewGC showed the best results
in our case. It doesn’t guarantee that it’s also the best choice for your particular case.
scalaz-stream is a great abstraction and as it’s described in scalaz-stream Introduction every application can, and I believe should be modeled this way.
This approach is embraced not only in scala community, but also in clojure, take a look for example at Rich Hickey presentation about clojure core.async channels, and how your application can be modeled with queues.