Eugene Zhulenev

Building Large Scale Systems at Google

Type-Level Instant Insanity in Scala

| Comments

This post is a Scala version of Haskell Type-Level Instant Insanity by Conrad Parker

This post shows an implementation of Instant Insanity puzzle game at compile time, using powerful Scala type system. This post is based on amazing article by Conrad Parker in the Monad Reader Issue 8. Original article is around 20 pages long, this post is much more concise version of it. Original article is very well written and easy to understand, this post should help with jumping from Scala to Haskell code for people who are not familiar with Haskell language.

Textbook Implementation

Instant Insanity puzzle formulated as:

It consists of four cubes, with faces coloured blue, green, red or white. The problem is to arrange the cubes in a vertical pile such that each visible column of faces contains four distinct colours.

“Classic” solution in scala can be found here, this solution stacks the cubes one at a time, trying each possible orientation of each cube.

I’m going to show how to translate this solution into Scala Type System.

Large Scale Deep Learning With TensorFlow on EC2 Spot Instances

| Comments

In this post I’m demonstrating how to combine together TensorFlow, Docker, EC2 Container Service and EC2 Spot Instances to solve massive cluster computing problems the most cost-effective way.

Source code is on on Github:

Neural Networks and Deep Learning in particular gained a lot of attention over the last year, and it’s only the beginning. Google released to open source their numerical computing framework TensorFlow, which can be used for training and running deep neural networks for wide variety of machine learning problems, especially image recognition.

TensorFlow was originally developed by researchers and engineers working on the Google Brain Team within Google’s Machine Intelligence research organization for the purposes of conducting machine learning and deep neural networks research, but the system is general enough to be applicable in a wide variety of other domains as well.

Although TensorFlow version used at Google supports distributed training, open sourced version can run only on one node. However some of machine learning problems are still embarrassingly parallel, and can be easily parallelized regardless of single-node nature of the core library itself.

  1. Hyperparameter optimization or model selection is the problem of choosing a set of hyperparameters for a learning algorithm, usually with the goal of optimizing a measure of the algorithm’s performance on an independent data set. Naturally parallelized by training models for each set ot parameters in parallel and choosing the best model (parameters) later.
  2. Inference (applying trained model to new data) can be parallelized by splitting input dataset into smaller batches and running trained model on each of them in parallel

Spark in Production: Lessons From Running Large Scale Machine Learning

| Comments

I wrote earlier about our approach for machine learning with Spark at Collective, it was focused on transforming raw data into features that can be used for training a model. At this post I want describe how to assemble multiple building blocks into production application, that efficiently uses Spark cluster and can train/validate hundreds of models.

Training single model is relatively easy, and it’s well covered in Spark documentation and multiple other blog posts. Training hundreds of models can become really tricky from engineering point of view. Spark has lot’s of configuration parameters that can affect cluster performance and stability, and you can use some clever tricks to get higher cluster utilization.

Optimizing Spark Machine Learning for Small Data

| Comments

Update 2015-10-08: Optimization “hack” described in this post still works, however we don’t use it in production anymore. With careful parallelism config, overhead introduced by distributed models is negligible.

You’ve all probably already know how awesome is Spark for doing Machine Learning on Big Data. However I’m pretty sure no one told you how bad (slow) it can be on Small Data.

As I mentioned in my previous post, we extensively use Spark for doing machine learning and audience modeling. It turned out that in some cases, for example when we are starting optimization for new client/campaign we simply don’t have enough positive examples to construct big enough dataset, so that using Spark would make sense.

Audience Modeling With Spark ML Pipelines

| Comments

At Collective we are heavily relying on machine learning and predictive modeling to run digital advertising business. All decisions about what ad to show at this particular time to this particular user are made by machine learning models (some of them are real time, and some of them are offline).

We have a lot of projects that uses machine learning, common name for all of them can be Audience Modeling, as they all are trying to predict audience conversion (CTR, Viewability Rate, etc…) based on browsing history, behavioral segments and other type of predictors.

For most of new development we use Spark and Spark MLLib. It is a awesome project, however we found that some nice tools/libraries that are widely used for example in R are missing in Spark. In order to add missing features that we would really like to have in Spark, we created Spark Ext - Spark Extensions Library.

Spark Ext on Github:

I’m going to show simple example of combining Spark Ext with Spark ML pipelines for predicting user conversions based geo and browsing history data.

Spark ML pipeline example: SparkMlExtExample.scala

Interactive Audience Analytics With Spark and HyperLogLog

| Comments

At Collective we are working not only on cool things like Machine Learning and Predictive Modeling, but also on reporting that can be tedious and boring. However at our scale even simple reporting application can become challenging engineering problem. This post is based on talk that I gave at NY-Scala Meetup. Slides are available here.

Example application is available on github:

Feature Engineering at Scale With Spark

| Comments

Check Model Matrix Website and Github project.

At Collective we are in programmatic advertisement business, it means that all our advertisement decisions (what ad to show, to whom and at what time) are driven by models. We do a lot of machine learning, build thousands predictive models and use them to make millions decision per second.

How do we get the most out of our data for predictive modeling?

Success of all Machine Learning algorithms depends on data that you put into it, the better the features you choose, the better the results you will achieve.

Feature Engineering is the process of using domain knowledge of the data to create features that make machine learning algorithms work better.

In Ad-Tech it’s finite pieces of information about users that we can put into our models, and it’s almost the same across all companies in industry, we don’t have access to any anonymous data like real name and age, interests on Facebook etc. It really matter how creative you are to get maximum from the data you have, and how fast you can iterate and test new idea.

In 2014 Collective data science team published Machine Learning at Scale paper that describes our approach and trade-offs for audience optimization. In 2015 we solve the same problems, but using new technologies (Spark and Spark MLLib) at even bigger scale. I want to show the tool that I built specifically to handle feature engineering/selection problem, and which is open sources now.

Model Matrix

Building Twitter Live Stream Analytics With Spark and Cassandra

| Comments

This is repost of my article from Pellucid Tech Blog


At Pellucid Analytics we we are building a platform that automates and simplifies the creation of data-driven chartbooks, so that it takes minutes instead of hours to get from raw data to powerful visualizations and compelling stories.

One of industries we are focusing on is Investment Banking. We are helping IB advisory professionals build pitch-books, and provide them with analytical and quantitative support to sell their ideas. Comparable Companies Analysis is central to this business.

Comparable company analysis starts with establishing a peer group consisting of similar companies of similar size in the same industry and region.

The problem we are faced with is finding a scalable solution to establish a peer group for any chosen company.

Stock Price Prediction With Big Data and Machine Learning

| Comments

Apache Spark and Spark MLLib for building price movement prediction model from order log data.

The code for this application app can be found on Github


This post is based on Modeling high-frequency limit order book dynamics with support vector machines paper. Roughly speaking I’m implementing ideas introduced in this paper in scala with Spark and Spark MLLib. Authors are using sampling, I’m going to use full order log from NYSE (sample data is available from NYSE FTP), just because I can easily do it with Spark. Instead of using SVM, I’m going to use Decision Tree algorithm for classification, because in Spark MLLib it supports multiclass classification out of the box.

If you want to get deep understanding of the problem and proposed solution, you need to read the paper. I’m going to give high level overview of the problem in less academic language, in one or two paragraphs.

Predictive modelling is the process by which a model is created or chosen to try to best predict the probability of an outcome.

Running Spark Tests in Standalone Cluster

| Comments

Unit testing Spark Applications with standalone Apache Spark Cluster.

Update 2015-10-08: Albeit approach described in this post works and totally valid, now I would suggest to take a look on packaging all tests in a fat jar together with scalatest (or any other test library of your choice) and using spark-submit command to run it

The code for this application app can be found on Github

Running Spark Applications

To be able to run Spark jobs, Spark cluster needs to have all classes used by your application in it’s classpath. You can put manually all jar files required by your application to Spark nodes, but it’s not cool. Another solution is to manually set jar files that required to distribute to worker nodes when you create SparkConf. One way to do it, is to package your application as a “fat-jar”, so you need to distribute only single jar. Industry standard for packaging Spark application is sbt-assembly plugin, and it’s used by Spark itself.

Unit Testing Spark Applications

If you need to test your Spark application, easiest way is to create local Spark Context for each test, or maybe shared between all tests. When Spark is running in local mode, it’s running in the same JVM as your tests with same jar files in classpath.

If your tests requires data that doesn’t fit into single node, for example in integration or acceptance tests, obvious solution is to run them in standalone Spark cluster with sufficient number of nodes. At this time everything becomes more difficult. Now you need to package you application with tests in single jar file, and submit it to Spark cluster with each test.