Eugene Zhulenev

Engineering Machine Learning and Audience Modeling at Collective

Large Scale Deep Learning with TensorFlow on EC2 Spot Instances

In this post I’m demonstrating how to combine together TensorFlow, Docker, EC2 Container Service and EC2 Spot Instances to solve massive cluster computing problems the most cost-effective way.

Source code is on on Github:

Neural Networks and Deep Learning in particular gained a lot of attention over the last year, and it’s only the beginning. Google released to open source their numerical computing framework TensorFlow, which can be used for training and running deep neural networks for wide variety of machine learning problems, especially image recognition.

TensorFlow was originally developed by researchers and engineers working on the Google Brain Team within Google’s Machine Intelligence research organization for the purposes of conducting machine learning and deep neural networks research, but the system is general enough to be applicable in a wide variety of other domains as well.

Although TensorFlow version used at Google supports distributed training, open sourced version can run only on one node. However some of machine learning problems are still embarrassingly parallel, and can be easily parallelized regardless of single-node nature of the core library itself.

  1. Hyperparameter optimization or model selection is the problem of choosing a set of hyperparameters for a learning algorithm, usually with the goal of optimizing a measure of the algorithm’s performance on an independent data set. Naturally parallelized by training models for each set ot parameters in parallel and choosing the best model (parameters) later.
  2. Inference (applying trained model to new data) can be parallelized by splitting input dataset into smaller batches and running trained model on each of them in parallel
Read on →

Spark in Production: Lessons From Running Large Scale Machine Learning

I wrote earlier about our approach for machine learning with Spark at Collective, it was focused on transforming raw data into features that can be used for training a model. At this post I want describe how to assemble multiple building blocks into production application, that efficiently uses Spark cluster and can train/validate hundreds of models.

Training single model is relatively easy, and it’s well covered in Spark documentation and multiple other blog posts. Training hundreds of models can become really tricky from engineering point of view. Spark has lot’s of configuration parameters that can affect cluster performance and stability, and you can use some clever tricks to get higher cluster utilization.

Read on →

Optimizing Spark Machine Learning for Small Data

Update 2015-10-08: Optimization “hack” described in this post still works, however we don’t use it in production anymore. With careful parallelism config, overhead introduced by distributed models is negligible.

You’ve all probably already know how awesome is Spark for doing Machine Learning on Big Data. However I’m pretty sure no one told you how bad (slow) it can be on Small Data.

As I mentioned in my previous post, we extensively use Spark for doing machine learning and audience modeling. It turned out that in some cases, for example when we are starting optimization for new client/campaign we simply don’t have enough positive examples to construct big enough dataset, so that using Spark would make sense.

Read on →

Audience modeling with Spark ML Pipelines

At Collective we are heavily relying on machine learning and predictive modeling to run digital advertising business. All decisions about what ad to show at this particular time to this particular user are made by machine learning models (some of them are real time, and some of them are offline).

We have a lot of projects that uses machine learning, common name for all of them can be Audience Modeling, as they all are trying to predict audience conversion (CTR, Viewability Rate, etc…) based on browsing history, behavioral segments and other type of predictors.

For most of new development we use Spark and Spark MLLib. It is a awesome project, however we found that some nice tools/libraries that are widely used for example in R are missing in Spark. In order to add missing features that we would really like to have in Spark, we created Spark Ext - Spark Extensions Library.

Spark Ext on Github:

I’m going to show simple example of combining Spark Ext with Spark ML pipelines for predicting user conversions based geo and browsing history data.

Spark ML pipeline example: SparkMlExtExample.scala

Read on →

Interactive Audience Analytics with Spark and HyperLogLog

At Collective we are working not only on cool things like Machine Learning and Predictive Modeling, but also on reporting that can be tedious and boring. However at our scale even simple reporting application can become challenging engineering problem. This post is based on talk that I gave at NY-Scala Meetup. Slides are available here.

Example application is available on github:

Read on →

Feature Engineering at Scale with Spark

Check Model Matrix Website and Github project.

At Collective we are in programmatic advertisement business, it means that all our advertisement decisions (what ad to show, to whom and at what time) are driven by models. We do a lot of machine learning, build thousands predictive models and use them to make millions decision per second.

How do we get the most out of our data for predictive modeling?

Success of all Machine Learning algorithms depends on data that you put into it, the better the features you choose, the better the results you will achieve.

Feature Engineering is the process of using domain knowledge of the data to create features that make machine learning algorithms work better.

In Ad-Tech it’s finite pieces of information about users that we can put into our models, and it’s almost the same across all companies in industry, we don’t have access to any anonymous data like real name and age, interests on Facebook etc. It really matter how creative you are to get maximum from the data you have, and how fast you can iterate and test new idea.

In 2014 Collective data science team published Machine Learning at Scale paper that describes our approach and trade-offs for audience optimization. In 2015 we solve the same problems, but using new technologies (Spark and Spark MLLib) at even bigger scale. I want to show the tool that I built specifically to handle feature engineering/selection problem, and which is open sources now.

Model Matrix

Read on →

Building Twitter Live Stream Analytics With Spark and Cassandra

This is repost of my article from Pellucid Tech Blog


At Pellucid Analytics we we are building a platform that automates and simplifies the creation of data-driven chartbooks, so that it takes minutes instead of hours to get from raw data to powerful visualizations and compelling stories.

One of industries we are focusing on is Investment Banking. We are helping IB advisory professionals build pitch-books, and provide them with analytical and quantitative support to sell their ideas. Comparable Companies Analysis is central to this business.

Comparable company analysis starts with establishing a peer group consisting of similar companies of similar size in the same industry and region.

The problem we are faced with is finding a scalable solution to establish a peer group for any chosen company.

Read on →

Stock price prediction with Big Data and Machine Learning

Apache Spark and Spark MLLib for building price movement prediction model from order log data.

The code for this application app can be found on Github


This post is based on Modeling high-frequency limit order book dynamics with support vector machines paper. Roughly speaking I’m implementing ideas introduced in this paper in scala with Spark and Spark MLLib. Authors are using sampling, I’m going to use full order log from NYSE (sample data is available from NYSE FTP), just because I can easily do it with Spark. Instead of using SVM, I’m going to use Decision Tree algorithm for classification, because in Spark MLLib it supports multiclass classification out of the box.

If you want to get deep understanding of the problem and proposed solution, you need to read the paper. I’m going to give high level overview of the problem in less academic language, in one or two paragraphs.

Predictive modelling is the process by which a model is created or chosen to try to best predict the probability of an outcome.

Read on →

Running Spark tests in standalone cluster


Unit testing Spark Applications with standalone Apache Spark Cluster.

Update 2015-10-08: Albeit approach described in this post works and totally valid, now I would suggest to take a look on packaging all tests in a fat jar together with scalatest (or any other test library of your choice) and using spark-submit command to run it

The code for this application app can be found on Github

Running Spark Applications

To be able to run Spark jobs, Spark cluster needs to have all classes used by your application in it’s classpath. You can put manually all jar files required by your application to Spark nodes, but it’s not cool. Another solution is to manually set jar files that required to distribute to worker nodes when you create SparkConf. One way to do it, is to package your application as a “fat-jar”, so you need to distribute only single jar. Industry standard for packaging Spark application is sbt-assembly plugin, and it’s used by Spark itself.

Unit Testing Spark Applications

If you need to test your Spark application, easiest way is to create local Spark Context for each test, or maybe shared between all tests. When Spark is running in local mode, it’s running in the same JVM as your tests with same jar files in classpath.

If your tests requires data that doesn’t fit into single node, for example in integration or acceptance tests, obvious solution is to run them in standalone Spark cluster with sufficient number of nodes. At this time everything becomes more difficult. Now you need to package you application with tests in single jar file, and submit it to Spark cluster with each test.

Read on →

Akka Cluster for Value at Risk calculation (Part 2/2)


The code & sample app can be found on Github

Risk Management in finance is one of the most common case studies for Grid Computing, and Value-at-Risk is most widely used risk measure. In this article I’m going to show how to scale-out Value-at-Risk calculation to multiple nodes with latest Akka middleware. In Part 1 I’m describing the problem and single-node solution, and in Part 2 I’m scaling it to multiple nodes.

[Part 2/2] Scale-out VaR calculation to multiple nodes

Go to Part 1 where Value at Risk calculator defined.

Read on →