Beyond Parallelize and Collect by Holden Karau

Beyond Parallelize & Collect

(Effective testing of Spark Programs)

Now mostly

“works”*

*See developer for details. Does not imply warranty. :p

Who am I?● My name is Holden Karau● Prefered pronouns are she/her● I’m a Software Engineer● currently IBM and previously Alpine, Databricks, Google, Foursquare &

Amazon● co-author of Learning Spark & Fast Data processing with Spark● @holdenkarau● Slide share http://www.slideshare.net/hkarau ● Linkedin https://www.linkedin.com/in/holdenkarau ● Spark Videos http://bit.ly/holdenSparkVideos

https://twitter.com/holdenkarau


http://www.slideshare.net/hkarau

https://www.linkedin.com/in/holdenkarau

http://bit.ly/holdenSparkVideos

What is going to be covered:● What I think I might know about you● A bit about why you should test your programs● Using parallelize & collect for unit testing (quick skim)● Comparing datasets too large to fit in memory● Considerations for Streaming & SQL (DataFrames & Datasets)● Cute & scary pictures

○ I promise at least one panda and one cat

● “Future Work”○ Integration testing lives here for now (sorry)○ Some of this future work might even get done!

Who I think you wonderful humans are?● Nice* people● Like silly pictures● Familiar with Apache Spark

○ If not, buy one of my books or watch Paco’s awesome video

● Familiar with one of Scala, Java, or Python○ If you know R well I’d love to chat though

● Want to make better software○ (or models, or w/e)

So why should you test?● Makes you a better person● Save $s

○ May help you avoid losing your employer all of their money■ Or “users” if we were in the bay

○ AWS is expensive

● Waiting for our jobs to fail is a pretty long dev cycle● This is really just to guilt trip you & give you flashbacks to your QA internships

So why should you test - continued

Results from: Testing with Spark survey http://bit.ly/holdenTestingSpark

http://bit.ly/holdenTestingSpark

So why should you test - continued

Results from: Testing with Spark survey http://bit.ly/holdenTestingSpark


Why don’t we test?● It’s hard

○ Faking data, setting up integration tests, urgh w/e

● Our tests can get too slow● It takes a lot of time

○ and people always want everything done yesterday○ or I just want to go home see my partner○ etc.

Cat photo from http://galato901.deviantart.com/art/Cat-on-Work-Break-173043455

An artisanal Spark unit test @transient private var _sc: SparkContext = _ override def beforeAll() { _sc = new SparkContext("local[4]") super.beforeAll() } override def afterAll() { if (sc != null) sc.stop() System.clearProperty("spark.driver.port") // rebind issue _sc = null super.afterAll() }

Photo by morinesque

And on to the actual test... test("really simple transformation") { val input = List("hi", "hi holden", "bye") val expected = List(List("hi"), List("hi", "holden"), List("bye")) assert(tokenize(sc.parallelize(input)).collect().toList === expected) } def tokenize(f: RDD[String]) = { f.map(_.split(" ").toList) }

Photo by morinesque

Wait, where were the batteries?

Photo by Jim Bauer

Let’s get batteries!● Spark unit testing

○ spark-testing-base - https://github.com/holdenk/spark-testing-base ○ sscheck - https://github.com/juanrh/sscheck

● Integration testing○ spark-integration-tests (Spark internals) - https://github.com/databricks/spark-integration-tests

● Performance○ spark-perf (also for Spark internals) - https://github.com/databricks/spark-perf

● Spark job validation○ spark-validator - https://github.com/holdenk/spark-validator

Photo by Mike Mozart

https://github.com/holdenk/spark-testing-base

https://github.com/juanrh/sscheck

https://github.com/databricks/spark-integration-tests

https://github.com/databricks/spark-perf

https://github.com/holdenk/spark-validator

A simple unit test re-visited (Scala)class SampleRDDTest extends FunSuite with SharedSparkContext { test("really simple transformation") { val input = List("hi", "hi holden", "bye") val expected = List(List("hi"), List("hi", "holden"), List("bye")) assert(SampleRDD.tokenize(sc.parallelize(input)).collect().toList === expected) }}

Ok but what about problems @ scale● Maybe our program works fine on our local sized input● If we are using Spark our actual workload is probably huge● How do we test workloads too large for a single machine?

○ we can’t just use parallelize and collect

Qfamily

Distributed “set” operations to the rescue*● Pretty close - already built into Spark● Doesn’t do so well with floating points :(

○ damn floating points keep showing up everywhere :p

● Doesn’t really handle duplicates very well ○ {“coffee”, “coffee”, “panda”} != {“panda”, “coffee”} but with set operations...

Matti Mattila

Or use RDDComparisions: def compareWithOrderSamePartitioner[T: ClassTag](expected: RDD

[T], result: RDD[T]): Option[(T, T)] = {

expected.zip(result).filter{case (x, y) => x != y}.take(1).

headOption

}

Matti Mattila

Or use RDDComparisions:def compare[T: ClassTag](expected: RDD[T], result: RDD[T]): Option

[(T, Int, Int)] = {

val expectedKeyed = expected.map(x => (x, 1)).reduceByKey(_ +

_)

val resultKeyed = result.map(x => (x, 1)).reduceByKey(_ + _)

expectedKeyed.cogroup(resultKeyed).filter{case (_, (i1, i2))

=>

i1.isEmpty || i2.isEmpty || i1.head != i2.head}.take(1).

headOption.

map{case (v, (i1, i2)) => (v, i1.headOption.getOrElse(0),

i2.headOption.getOrElse(0))}

}

Matti Mattila

But where do we get the data for those tests?● If you have production data you can sample you are lucky!

○ If possible you can try and save in the same format

● If our data is a bunch of Vectors or Doubles Spark’s got tools :)● Coming up with good test data can take a long time

Lori Rielly

QuickCheck / ScalaCheck● QuickCheck generates tests data under a set of constraints● Scala version is ScalaCheck - supported by the two unit testing libraries for

Spark● sscheck

○ Awesome people*, supports generating DStreams too!

● spark-testing-base○ Also Awesome people*, generates more pathological (e.g. empty partitions etc.) RDDs

*I assume

PROtara hunt

With spark-testing-basetest("map should not change number of elements") { forAll(RDDGenerator.genRDD[String](sc)){ rdd => rdd.map(_.length).count() == rdd.count() }}

Testing streaming….

Photo by Steve Jurvetson

// Setup our Stream:

class TestInputStream[T: ClassTag](@transient var sc:

SparkContext,

ssc_ : StreamingContext, input: Seq[Seq[T]], numPartitions: Int)

extends FriendlyInputDStream[T](ssc_) {

def start() {}

def stop() {}

def compute(validTime: Time): Option[RDD[T]] = {

logInfo("Computing RDD for time " + validTime)

val index = ((validTime - ourZeroTime) / slideDuration - 1).

toInt

val selectedInput = if (index < input.size) input(index) else

Seq[T]()

// lets us test cases where RDDs are not created

if (selectedInput == null) {

return None

}

val rdd = sc.makeRDD(selectedInput, numPartitions)

logInfo("Created RDD " + rdd.id + " with " + selectedInput)

Some(rdd)

}

}

Artisanal Stream Testing Codetrait StreamingSuiteBase extends FunSuite with BeforeAndAfterAll with Logging with SharedSparkContext {

// Name of the framework for Spark context def framework: String = this.getClass.getSimpleName

// Master for Spark context def master: String = "local[4]"

// Batch duration def batchDuration: Duration = Seconds(1)

// Directory where the checkpoint data will be saved lazy val checkpointDir = { val dir = Utils.createTempDir() logDebug(s"checkpointDir: $dir") dir.toString }

// Default after function for any streaming test suite. Override this // if you want to add your stuff to "after" (i.e., don't call after { } ) override def afterAll() { System.clearProperty("spark.streaming.clock") super.afterAll() }

Photo by Steve Jurvetson

and continued….

/** * Create an input stream for the provided input sequence. This is done using * TestInputStream as queueStream's are not checkpointable. */ def createTestInputStream[T: ClassTag](sc: SparkContext, ssc_ : TestStreamingContext, input: Seq[Seq[T]]): TestInputStream[T] = { new TestInputStream(sc, ssc_, input, numInputPartitions) }

// Default before function for any streaming test suite. Override this // if you want to add your stuff to "before" (i.e., don't call before { } ) override def beforeAll() { if (useManualClock) { logInfo("Using manual clock") conf.set("spark.streaming.clock", "org.apache.spark.streaming.util.TestManualClock") // We can specify our own clock } else { logInfo("Using real clock") conf.set("spark.streaming.clock", "org.apache.spark.streaming.util.SystemClock") } super.beforeAll() }

/** * Run a block of code with the given StreamingContext and automatically * stop the context when the block completes or when an exception is thrown. */ def withOutputAndStreamingContext[R](outputStreamSSC: (TestOutputStream[R], TestStreamingContext)) (block: (TestOutputStream[R], TestStreamingContext) => Unit): Unit = { val outputStream = outputStreamSSC._1 val ssc = outputStreamSSC._2 try { block(outputStream, ssc) } finally { try { ssc.stop(stopSparkContext = false) } catch { case e: Exception => logError("Error stopping StreamingContext", e) } } }

}

and now for the clock/* * Allows us access to a manual clock. Note that the manual clock changed between 1.1.1 and 1.3 */class TestManualClock(var time: Long) extends Clock { def this() = this(0L)

def getTime(): Long = getTimeMillis() // Compat def currentTime(): Long = getTimeMillis() // Compat def getTimeMillis(): Long = synchronized { time }

def setTime(timeToSet: Long): Unit = synchronized { time = timeToSet notifyAll() }

def advance(timeToAdd: Long): Unit = synchronized { time += timeToAdd notifyAll() }

def addToTime(timeToAdd: Long): Unit = advance(timeToAdd) // Compat

/** * @param targetTime block until the clock time is set or advanced to at least this time * @return current time reported by the clock when waiting finishes */ def waitTillTime(targetTime: Long): Long = synchronized { while (time < targetTime) { wait(100) } getTimeMillis() }

}

Testing streaming the happy panda way● Creating test data is hard

○ ssc.queueStream works - unless you need checkpoints (1.4.1+)

● Collecting the data locally is hard○ foreachRDD & a var

● figuring out when your test is “done”

Let’s abstract all that away into testOperation

We can hide all of that:test("really simple transformation") { val input = List(List("hi"), List("hi holden"), List("bye")) val expected = List(List("hi"), List("hi", "holden"), List("bye")) testOperation[String, String](input, tokenize _, expected, useSet = true)}

Photo by An eye for my mind

What about DataFrames?● We can do the same as we did for RDD’s (.rdd)● Inside of Spark validation looks like:

def checkAnswer(df: DataFrame, expectedAnswer: Seq[Row])

● Sadly it’s not in a published package & local only● instead we expose:

def equalDataFrames(expected: DataFrame, result: DataFrame) {def approxEqualDataFrames(e: DataFrame, r: DataFrame, tol: Double) {

…. and Datasets● We can do the same as we did for RDD’s (.rdd)● Inside of Spark validation looks like:

def checkAnswer(df: Dataset[T], expectedAnswer: T*)

● Sadly it’s not in a published package & local only● instead we expose:

def equalDatasets(expected: Dataset[U], result: Dataset[V]) {def approxEqualDatasets(e: Dataset[U], r: Dataset[V], tol: Double) {

This is what it looks like: test("dataframe should be equal to its self") { val sqlCtx = sqlContext import sqlCtx.implicits._// Yah I know this is ugly val input = sc.parallelize(inputList).toDF equalDataFrames(input, input) }

*This may or may not be easier.

Which has “built-in” large support :)

Photo by allison

Let’s talk about local mode● It’s way better than you would expect*● It does its best to try and catch serialization errors● It’s still not the same as running on a “real” cluster● Especially since if we were just local mode, parallelize and collect might be

fine

Photo by: Bev Sykes

Options beyond local mode:● Just point at your existing cluster (set master)● Start one with your shell scripts & change the master

○ Really easy way to plug into existing integration testing

● spark-docker - hack in our own tests● YarnMiniCluster

○ https://github.

com/apache/spark/blob/master/yarn/src/test/scala/org/apache/spark/deploy/yarn/BaseYarnClusterSuite.scala

○ In Spark Testing Base extend SharedMiniCluster■ Not recommended until after SPARK-10812 (e.g. 1.5.2+ or 1.6+)

Photo by Richard Masoner



https://github.com/apache/spark/blob/master/yarn/src/test/scala/org/apache/spark/deploy/yarn/BaseYarnClusterSuite.scala




https://issues.apache.org/jira/browse/SPARK-10812

Validation● Validation can be really useful for catching errors before deploying a model

○ Our tests can’t catch everything

● For now checking file sizes & execution time seem like the most common best practice (from survey)

● Accumulators have some challenges (see SPARK-12469 for progress) but are an interesting option

● spark-validator is still in early stages and not ready for production use but interesting proof of concept

Photo by:Paul Schadler

https://github.com/apache/spark/pull/10841


Related talks & blog posts● Testing Spark Best Practices (Spark Summit 2014)● Every Day I’m Shuffling (Strata 2015) & slides● Spark and Spark Streaming Unit Testing● Making Spark Unit Testing With Spark Testing Base

https://spark-summit.org/2014/wp-content/uploads/2014/06/Testing-Spark-Best-Practices-Anupama-Shetty-Neil-Marshall.pdf

https://spark-summit.org/2014/wp-content/uploads/2014/06/Testing-Spark-Best-Practices-Anupama-Shetty-Neil-Marshall.pdf

https://www.youtube.com/watch?v=Wg2boMqLjCg

http://www.slideshare.net/databricks/strata-sj-everyday-im-shuffling-tips-for-writing-better-spark-programs

https://www.youtube.com/watch?v=Wg2boMqLjCg

http://mkuthan.github.io/blog/2015/03/01/spark-unit-testing/

http://mkuthan.github.io/blog/2015/03/01/spark-unit-testing/

http://blog.cloudera.com/blog/2015/09/making-apache-spark-testing-easy-with-spark-testing-base/

http://blog.cloudera.com/blog/2015/09/making-apache-spark-testing-easy-with-spark-testing-base/

Learning Spark

Fast Data Processing with Spark(Out of Date)

Fast Data Processing with Spark (2nd edition)

Advanced Analytics with Spark

http://bit.ly/learning-spark-presentation


http://bit.ly/fast-data-processing-presentation


http://bit.ly/fast-data-processing-with-spark-2nd-edition


http://bit.ly/advanced-analytics-spark

Learning Spark

Fast Data Processing with Spark(Out of Date)

Fast Data Processing with Spark (2nd edition)

Advanced Analytics with Spark

Coming soon: Spark in Action

Coming soon:High Performance Spark







http://bit.ly/advanced-analytics-spark

http://www.manning.com/bonaci/

http://www.manning.com/bonaci/

http://www.highperformancespark.com

And the next book…..

Still being written - signup to be notified when it is available:● http://www.highperformancespark.com ● https://twitter.com/highperfspark



https://twitter.com/highperfspark

https://twitter.com/highperfspark

Related packages

● spark-testing-base: https://github.com/holdenk/spark-testing-base ● sscheck: https://github.com/juanrh/sscheck ● spark-validator: https://github.com/holdenk/spark-validator *ALPHA*

● spark-perf - https://github.com/databricks/spark-perf

● spark-integration-tests - https://github.com/databricks/spark-integration-tests

● scalacheck - https://www.scalacheck.org/

https://github.com/holdenk/spark-testing-base

https://github.com/juanrh/sscheck


https://github.com/databricks/spark-perf


https://www.scalacheck.org/

And including spark-testing-base:sbt:

"com.holdenkarau" %% "spark-testing-base" % "1.5.2_0.3.1"

maven:

<dependency>

<groupId>com.holdenkarau</groupId>

<artifactId>spark-testing-base</artifactId>

<version>${spark.version}_0.3.1</version>

<scope>test</scope>

</dependency>

“Future Work”● Better ScalaCheck integration (ala sscheck)● Testing details in my next Spark book● Whatever* you all want

○ Testing with Spark survey: http://bit.ly/holdenTestingSpark

Semi-likely:

● integration testing (for now see @cfriegly’s Spark + Docker setup):○ https://github.com/fluxcapacitor/pipeline

Pretty unlikely:

● Integrating into Apache Spark ( SPARK-12433 )*That I feel like doing, or you feel like making a pull request for.

Photo by bullet101


https://github.com/fluxcapacitor/pipeline

https://github.com/fluxcapacitor/pipeline

https://issues.apache.org/jira/browse/SPARK-12433

Cat wave photo by Quinn Dombrowski

k thnx bye!

If you want to fill out survey: http://bit.ly/holdenTestingSpark

Will use update results in Strata Presentation & tweet eventually at @holdenkarau





Data & Analytics

Beyond Parallelize and Collect by Holden Karau