Unit testing Spark Applications with standalone Apache Spark Cluster.
Update 2015-10-08: Albeit approach described in this post works and totally valid, now I would suggest to take a look on packaging all tests in a fat jar together with scalatest (or any other test library of your choice) and using spark-submit command to run it
The code for this application app can be found on Github
Running Spark Applications
To be able to run Spark jobs, Spark cluster needs to have all classes used by your application in it’s classpath. You can put manually all jar files required by your application to Spark nodes, but it’s not cool. Another solution is to manually set jar files that required to distribute to worker nodes when you create SparkConf. One way to do it, is to package your application as a “fat-jar”, so you need to distribute only single jar. Industry standard for packaging Spark application is sbt-assembly plugin, and it’s used by Spark itself.
Unit Testing Spark Applications
If you need to test your Spark application, easiest way is to create local Spark Context for each test, or maybe shared between all tests. When Spark is running in local mode, it’s running in the same JVM as your tests with same jar files in classpath.
If your tests requires data that doesn’t fit into single node, for example in integration or acceptance tests, obvious solution is to run them in standalone Spark cluster with sufficient number of nodes. At this time everything becomes more difficult. Now you need to package you application with tests in single jar file, and submit it to Spark cluster with each test.
Example Application
To show how to run and test Spark applications I prepared very simple application. It uses Scala OpenBook library to parse NYSE OpenBook messages (orders log from New York Stock Exchange), distribute them to cluster as RDD, and count Buy and Sell orders by ticker. Only purpose of this application is to have dependency on a library that for sure is not available on Spark nodes.
Assembly Main Application
Add sbt-assembly plugin in project/plugin.sbt
Add assembly settings to build.sbt
Inside your application you need to create SparkConf and add current jar to it.
After that you can use assembly command, and run assembled application in your Spark Cluster
Assembly Tests
First step to run tests in standalone Spark Cluster is to package all main and test classes into single jar, that will be transfered to each worker node before running tests. It’s very similar to assemblying main app.
I wrote simple sbt plugin that has test-assembly task. First this task assemblies jar file with test classes and all dependencies, then set it’s location to environment variable, and then starts tests.
All Apache Spark tests should inherit ConfiguredSparkFlatSpec with configured Spark Context. If assembled tests jar file is available, it’s distributed to Spark worker nodes. If not, only local mode is supported.
Running Tests
By default spark.master property is set to local[2]. So you can run tests in local mode. If you want run tests in standalone Apache Spark, you need to override spark.master with your master node.
If you’ll try to run test command with standalone cluster it will fail with ClassNotFoundException
However test-assembled will be successfull
The code for this application app can be found on Github