In this post I’m demonstrating how to combine together TensorFlow, Docker, EC2 Container Service and EC2 Spot Instances to solve massive cluster computing problems the most cost-effective way.
Source code is on on Github: https://github.com/ezhulenev/distributo
Neural Networks and Deep Learning in particular gained a lot of attention over the last year, and it’s only the beginning. Google released to open source their numerical computing framework TensorFlow, which can be used for training and running deep neural networks for wide variety of machine learning problems, especially image recognition.
TensorFlow was originally developed by researchers and engineers working on the Google Brain Team within Google’s Machine Intelligence research organization for the purposes of conducting machine learning and deep neural networks research, but the system is general enough to be applicable in a wide variety of other domains as well.
Although TensorFlow version used at Google supports distributed training, open sourced version can run only on one node. However some of machine learning problems are still embarrassingly parallel, and can be easily parallelized regardless of single-node nature of the core library itself.
- Hyperparameter optimization or model selection is the problem of choosing a set of hyperparameters for a learning algorithm, usually with the goal of optimizing a measure of the algorithm’s performance on an independent data set. Naturally parallelized by training models for each set ot parameters in parallel and choosing the best model (parameters) later.
- Inference (applying trained model to new data) can be parallelized by splitting input dataset into smaller batches and running trained model on each of them in parallel