42
Monitoring Spark Applications Tzach Zohar @ Kenshoo, March/2016

Monitoring Spark Applications

Embed Size (px)

Citation preview

Page 1: Monitoring Spark Applications

Monitoring Spark ApplicationsTzach Zohar @ Kenshoo, March/2016

Page 2: Monitoring Spark Applications

Who am ISystem Architect @ Kenshoo

Java backend for 10 years

Working with Scala + Spark for 2 years

https://www.linkedin.com/in/tzachzohar

Page 3: Monitoring Spark Applications

Who’s Kenshoo10-year Tel-Aviv based startup

Industry Leader in Digital Marketing

500+ employees

Heavy data shop

http://kenshoo.com/

Page 4: Monitoring Spark Applications

And who’re you?

Page 5: Monitoring Spark Applications

AgendaWhy Monitor

Spark UI

Spark REST API

Spark Metric Sinks

Applicative Metrics

Page 6: Monitoring Spark Applications

The Importance of being Earnest

Page 7: Monitoring Spark Applications

Why MonitorFailures

Performance

Know your data

Correctness of output

Page 8: Monitoring Spark Applications

Monitoring Distributed SystemsNo single log file

No single User Interface

Often - no single framework (e.g. Spark + YARN + HDFS…)

Page 9: Monitoring Spark Applications

Spark UI

Page 10: Monitoring Spark Applications

Spark UISee http://spark.apache.org/docs/latest/monitoring.html#web-interfaces

The first go-to tool for understanding what’s what

Created per SparkContext

Page 11: Monitoring Spark Applications

Spark UIJobs -> Stages -> Tasks

Page 12: Monitoring Spark Applications

Spark UIJobs -> Stages -> Tasks

Page 13: Monitoring Spark Applications

Spark UI Use the “DAG Visualization” in Job Details to:

Understand flow

Detect caching opportunities

Page 14: Monitoring Spark Applications

Spark UIJobs -> Stages -> Tasks

Detect unbalanced stages

Detect GC issues

Page 15: Monitoring Spark Applications

Spark UIJobs -> Stages -> Tasks -> “Event Timeline”

Detect stragglers

Detect repartitioning opportunities

Page 16: Monitoring Spark Applications

Spark UI Disadvantages“Ad-Hoc”, no history*

Human readable, but not machine readable

Data points, not data trends

Page 17: Monitoring Spark Applications

Spark UI Disadvantages

UI can quickly become hard to use…

Page 18: Monitoring Spark Applications

Spark REST API

Page 19: Monitoring Spark Applications

Spark’s REST APISee http://spark.apache.org/docs/latest/monitoring.html#rest-api

Programmatic access to UI’s data (jobs, stages, tasks, executors, storage…)

Useful for aggregations over similar jobs

Page 20: Monitoring Spark Applications

Spark’s REST APIExample: calculate total shuffle statistics: object SparkAppStats { case class SparkStage(name: String, shuffleWriteBytes: Long, memoryBytesSpilled: Long, diskBytesSpilled: Long) implicit val formats = DefaultFormats val url = "http://<host>:4040/api/v1/applications/<app-name>/stages"

def main (args: Array[String]) { val json = fromURL(url).mkString val stages: List[SparkStage] = parse(json).extract[List[SparkStage]] println("stages count: " + stages.size) println("shuffleWriteBytes: " + stages.map(_.shuffleWriteBytes).sum) println("memoryBytesSpilled: " + stages.map(_.memoryBytesSpilled).sum) println("diskBytesSpilled: " + stages.map(_.diskBytesSpilled).sum) }}

Page 21: Monitoring Spark Applications

Example: calculate total shuffle statistics:

Example output:

stages count: 1435

shuffleWriteBytes: 8488622429

memoryBytesSpilled: 120107947855

diskBytesSpilled: 1505616236

Spark’s REST API

Page 22: Monitoring Spark Applications

Spark’s REST APIExample: calculate total time per job name: val url = "http://<host>:4040/api/v1/applications/<app-name>/jobs"

case class SparkJob(jobId: Int, name: String, submissionTime: Date, completionTime: Option[Date], stageIds: List[Int]) { def getDurationMillis: Option[Long] = completionTime.map(_.getTime - submissionTime.getTime) } def main (args: Array[String]) { val json = fromURL(url).mkString parse(json) .extract[List[SparkJob]] .filter(j => j.getDurationMillis.isDefined) // only completed jobs .groupBy(_.name) .mapValues(list => (list.map(_.getDurationMillis.get).sum, list.size)) .foreach { case (name, (time, count)) => println(s"TIME: $time\tAVG: ${time / count}\tNAME: $name") } }

Page 23: Monitoring Spark Applications

Spark’s REST APIExample: calculate total time per job name:

Example output:

TIME: 182570 AVG: 16597 NAME: count at

MyAggregationService.scala:132

TIME: 230973 AVG: 1297 NAME: parquet at MyRepository.scala:99

TIME: 120393 AVG: 2188 NAME: collect at MyCollector.scala:30

TIME: 5645 AVG: 627 NAME: collect at MyCollector.scala:103

Page 24: Monitoring Spark Applications

But that’s still ad-hoc, right?

Page 25: Monitoring Spark Applications

Spark Metric Sinks

Page 26: Monitoring Spark Applications

Metrics: easy Java API for creating and updating metrics stored in memory, e.g.:

MetricsSee http://spark.apache.org/docs/latest/monitoring.html#metrics

Spark uses the popular dropwizard.metrics library (renamed from codahale.metrics and yammer.metrics)

// Gauge for executor thread pool's actively executing task countsmetricRegistry.register(name("threadpool", "activeTasks"), new Gauge[Int] { override def getValue: Int = threadPool.getActiveCount()})

Page 27: Monitoring Spark Applications

MetricsWhat is metered? Couldn’t find any detailed documentation of this

This trick flushes most of them out: search sources for “metricRegistry.register”

Page 28: Monitoring Spark Applications

Where do these metrics go?

Page 29: Monitoring Spark Applications

Spark Metric SinksA “Sink” is an interface for viewing these metrics, at given intervals or ad-hoc

Available sinks: Console, CSV, SLF4J, Servlet, JMX, Graphite, Ganglia*

we use the Graphite Sink to send all metrics to Graphite

$SPARK_HOME/metrics.properties:*.sink.graphite.class=org.apache.spark.metrics.sink.GraphiteSink*.sink.graphite.host=<your graphite hostname>*.sink.graphite.port=2003*.sink.graphite.period=30*.sink.graphite.unit=seconds*.sink.graphite.prefix=<token>.<app-name>.<host-name>

Page 30: Monitoring Spark Applications

.. and it’s in Graphite ( + Grafana)

Page 31: Monitoring Spark Applications

Graphite SinkVery useful for trend analysis

WARNING: Not suitable for short-running applications (will pollute graphite with new metrics for each application)

Requires some Graphite tricks to get clear readings (wildcards, sums, derivatives, etc.)

Page 32: Monitoring Spark Applications

Applicative Metrics

Page 33: Monitoring Spark Applications

The Missing PieceSpark meters its internals pretty thoroughly, but what about your internals?

Applicative metrics are a great tool for knowing your data and verifying output correctness

We use Dropwizard Metrics + Graphite for this too (everywhere)

Page 34: Monitoring Spark Applications

Counting RDD Elementsrdd.count() might be costly (another action)

Spark Accumulators are a good alternative

Trick: send accumulator results to Graphite, using “Counter-backed Accumulators”/** * * Call returned callback after acting on returned RDD to get counter updated */ def countSilently[V: ClassTag](rdd: RDD[V], metricName: String, clazz: Class[_]): (RDD[V], Unit => Unit) = { val counter: Counter = Metrics.newCounter(new MetricName(clazz, metricName)) val accumulator: Accumulator[Long] = rdd.sparkContext.accumulator(0, metricName) val countedRdd = rdd.map(v => { accumulator += 1; v }) val callback: Unit => Unit = u => counter.inc(accumulator.value) (countedRdd, callback) }

Page 35: Monitoring Spark Applications

Counting RDD Elements

Page 36: Monitoring Spark Applications

We Measure...Input records

Output records

Parsing failures

Average job time

Data “freshness” histogram

Much much more...

Page 37: Monitoring Spark Applications

WARNING: it’s addictive...

Page 38: Monitoring Spark Applications
Page 39: Monitoring Spark Applications
Page 40: Monitoring Spark Applications

ConclusionsSpark provides a wide variety of monitoring options

Each one should be used when appropriate - neither one is sufficient on its own

Metrics + Graphite + Grafana can give you visibility to any numeric timeseries

Page 41: Monitoring Spark Applications

Questions?

Page 42: Monitoring Spark Applications

Thank you