Eugene Zhulenev

Working on a Tensorflow at Google Brain

Type-Level Instant Insanity in Scala

This post is a Scala version of Haskell Type-Level Instant Insanity by Conrad Parker

This post shows an implementation of Instant Insanity puzzle game at compile time, using powerful Scala type system. This post is based on amazing article by Conrad Parker in the Monad Reader Issue 8. Original article is around 20 pages long, this post is much more concise version of it. Original article is very well written and easy to understand, this post should help with jumping from Scala to Haskell code for people who are not familiar with Haskell language.

Textbook Implementation

Instant Insanity puzzle formulated as:

It consists of four cubes, with faces coloured blue, green, red or white. The problem is to arrange the cubes in a vertical pile such that each visible column of faces contains four distinct colours.

“Classic” solution in scala can be found here, this solution stacks the cubes one at a time, trying each possible orientation of each cube.

I’m going to show how to translate this solution into Scala Type System.

Read on →

Optimizing Spark Machine Learning for Small Data

Update 2015-10-08: Optimization “hack” described in this post still works, however we don’t use it in production anymore. With careful parallelism config, overhead introduced by distributed models is negligible.

You’ve all probably already know how awesome is Spark for doing Machine Learning on Big Data. However I’m pretty sure no one told you how bad (slow) it can be on Small Data.

As I mentioned in my previous post, we extensively use Spark for doing machine learning and audience modeling. It turned out that in some cases, for example when we are starting optimization for new client/campaign we simply don’t have enough positive examples to construct big enough dataset, so that using Spark would make sense.

Read on →

Audience modeling with Spark ML Pipelines

At Collective we are heavily relying on machine learning and predictive modeling to run digital advertising business. All decisions about what ad to show at this particular time to this particular user are made by machine learning models (some of them are real time, and some of them are offline).

We have a lot of projects that uses machine learning, common name for all of them can be Audience Modeling, as they all are trying to predict audience conversion (CTR, Viewability Rate, etc…) based on browsing history, behavioral segments and other type of predictors.

For most of new development we use Spark and Spark MLLib. It is a awesome project, however we found that some nice tools/libraries that are widely used for example in R are missing in Spark. In order to add missing features that we would really like to have in Spark, we created Spark Ext - Spark Extensions Library.

Spark Ext on Github: https://github.com/collectivemedia/spark-ext

I’m going to show simple example of combining Spark Ext with Spark ML pipelines for predicting user conversions based geo and browsing history data.

Spark ML pipeline example: SparkMlExtExample.scala

Read on →

Interactive Audience Analytics with Spark and HyperLogLog

At Collective we are working not only on cool things like Machine Learning and Predictive Modeling, but also on reporting that can be tedious and boring. However at our scale even simple reporting application can become challenging engineering problem. This post is based on talk that I gave at NY-Scala Meetup. Slides are available here.

Example application is available on github: https://github.com/collectivemedia/spark-hyperloglog

Read on →

Feature Engineering at Scale with Spark

Check Model Matrix Website and Github project.

At Collective we are in programmatic advertisement business, it means that all our advertisement decisions (what ad to show, to whom and at what time) are driven by models. We do a lot of machine learning, build thousands predictive models and use them to make millions decision per second.

How do we get the most out of our data for predictive modeling?

Success of all Machine Learning algorithms depends on data that you put into it, the better the features you choose, the better the results you will achieve.

Feature Engineering is the process of using domain knowledge of the data to create features that make machine learning algorithms work better.

In Ad-Tech it’s finite pieces of information about users that we can put into our models, and it’s almost the same across all companies in industry, we don’t have access to any anonymous data like real name and age, interests on Facebook etc. It really matter how creative you are to get maximum from the data you have, and how fast you can iterate and test new idea.

In 2014 Collective data science team published Machine Learning at Scale paper that describes our approach and trade-offs for audience optimization. In 2015 we solve the same problems, but using new technologies (Spark and Spark MLLib) at even bigger scale. I want to show the tool that I built specifically to handle feature engineering/selection problem, and which is open sources now.

Model Matrix

Read on →

Building Twitter Live Stream Analytics With Spark and Cassandra

This is repost of my article from Pellucid Tech Blog

Background

At Pellucid Analytics we we are building a platform that automates and simplifies the creation of data-driven chartbooks, so that it takes minutes instead of hours to get from raw data to powerful visualizations and compelling stories.

One of industries we are focusing on is Investment Banking. We are helping IB advisory professionals build pitch-books, and provide them with analytical and quantitative support to sell their ideas. Comparable Companies Analysis is central to this business.

Comparable company analysis starts with establishing a peer group consisting of similar companies of similar size in the same industry and region.

The problem we are faced with is finding a scalable solution to establish a peer group for any chosen company.

Read on →

Stock price prediction with Big Data and Machine Learning

Apache Spark and Spark MLLib for building price movement prediction model from order log data.

The code for this application app can be found on Github

Synopsis

This post is based on Modeling high-frequency limit order book dynamics with support vector machines paper. Roughly speaking I’m implementing ideas introduced in this paper in scala with Spark and Spark MLLib. Authors are using sampling, I’m going to use full order log from NYSE (sample data is available from NYSE FTP), just because I can easily do it with Spark. Instead of using SVM, I’m going to use Decision Tree algorithm for classification, because in Spark MLLib it supports multiclass classification out of the box.

If you want to get deep understanding of the problem and proposed solution, you need to read the paper. I’m going to give high level overview of the problem in less academic language, in one or two paragraphs.

Predictive modelling is the process by which a model is created or chosen to try to best predict the probability of an outcome.

Read on →