Update 2015-10-08: Optimization “hack” described in this post still works, however we don’t use it in production anymore. With careful parallelism config, overhead introduced by distributed models is negligible.

You’ve all probably already know how awesome is Spark for doing Machine Learning on Big Data. However I’m pretty sure no one told you how bad (slow) it can be on Small Data.

As I mentioned in my previous post, we extensively use Spark for doing machine learning and audience modeling. It turned out that in some cases, for example when we are starting optimization for new client/campaign we simply don’t have enough positive examples to construct big enough dataset, so that using Spark would make sense.

### Spark ML from 10000 feet

Essentially every machine learning algorithm is a function minimization, where function value depends on some calculation using data in `RDD`

. For example logistic regression can calculate function value 1000 times before it will converge and find optimal parameters. It means that it will compute some `RDD`

1000 times. In case of `LogisticRegression`

it’s doing `RDD.treeAggregate`

which is supper efficient, but still it’s distributed computation.

Now imagine that all the data you have is 50000 rows, and you have for example 1000 partitions. It means that each partition has only 50 rows. And each `RDD.treeAggregate`

on every iteration serializing closures, sending them to partitions and collecting result back. It’s **HUGE OVERHEAD** and huge load on a driver.

### Throw Away Spark and use Python/R?

It’s definitely an option, but we don’t want to build multiple systems for data of different size. Spark ML pipelines are awesome abstraction, and we want to use it for all machine learning jobs. Also we want to use the same algorithm, so results would be consistent if dataset size just crossed the boundary between small and big data.

### Run LogisticRegression in ‘Local Mode’

What if Spark could run the same machine learning algorithm, but instead of using `RDD`

for storing input data, it would use `Arrays`

? It solves all the problems, you get consistent model, computed 10-20x faster because it doesn’t need distributed computations.

That’s exactly approach I used in Spark Ext, it’s called LocalLogisticRegression. It’s mostly copy-pasta from Spark `LogisticRegression`

, but when input data frame has only single partition, it’s running function optimization on one of the executors using `mapPartition`

function, essentially using Spark as distributed executor service.

This approach is much better than collecting data to driver, because you are not limited by driver computational resources.

When `DataFrame`

has more than 1 partition it just falls back to default distributed logistic regression.

Code for new `train`

method looks like this:

If input dataset size is less than 100000 rows, it will be placed inside single partition, and regression model will be trained in local mode.

## Results

With a little ingenuity (and copy paste) Spark became perfect tool for machine learning both on Small and Big Data. Most awesome thing is that this new `LocalLogisticRegression`

can be used as drop in replacement in Spark ML pipelines, producing exactly the same `LogisticRegressionModel`

at the end.

It might be interesting idea to use this approach in Spark itself, because in this case it would be possible to do it without doing so many code duplication. I’d love to see if anyone else had the same problem, and how solved it.

More cool Spark things in Github.