Upload
raymond-tay
View
966
Download
1
Embed Size (px)
Citation preview
S PA R K - N E W K I D O N T H E B L O C K
A B O U T M E …
• I designed Bamboo (HP’s Big Data Analytics Platform)
• I write software (mostly with Scala but leaning towards Haskell recently …)
• I like translating seq to parallel algorithms mostly using CUDA / OpenCL; embedded assembly is an EVIL thing.
• I wrote 2 books
• OpenCL Parallel Programming Development Cookbook
• Developing an Akka Edge
W H AT ’ S C O V E R E D T O D AY ?
• What’s Apache Spark
• What’s a RDD ? How can i understand it ?
• What’s Spark SQL
• What’s Spark Streaming
• References
W H AT ’ S A PA C H E S PA R K
• As a beginner’s guide, you can refer to Tsai Li Ming’s talk.
• API model abstracts
• how to extract data from 3rd party s/w (via JDBC, Cassandra, HBase)
• how to extract-compute data (via GraphX, MLLib, SparkSQL)
• how to store data (data connectors to “local”, “hdfs”, “s3”
R E S I L I E N T D I S T R I B U T E D D ATA S E T S
• Apache Spark works on data broken into chunks
• These chunks are called RDDs
• RDDs are chained into a lineage graph => a graph that identifies relationships.
• RDDs can be queried, grouped, transformed in a coarse grained manner to a fine grained manner.
• A RDD has a lifecycle:
• reification
• lazy-compute/lazy re-compute
• destruction
• RDD’s lifecycle is managed by the system unless …
• A program commands the RDD to persist() or unpersist() which affects the lazy computation.
R E S I L I E N T D I S T R I B U T E D D ATA S E T S
“ A G G R E G AT E ” I N S PA R K
> val data = sc.parallelize( (1 to 4) toList,2) > data.aggregate(0) > .. (math.max(_, _), > .. ( _ + _ )) > ….. > result = 6
def aggregate(zerovalue: U) (fbinary: (U, T) => U, fagg: (U, U) => U): U
H O W “ A G G R E G AT E ” W O R K S I N S PA R K
e1
RDD
fagg
fbinary
e2 e3 e4
zerovalue
res1
fbinary
res2
fagg final result
caveat: partition-sensitive algorithm should work correctly regardless of partitions
“ C O G R O U P ” I N S PA R K
> val x = sc.parallelize(List(1, 2, 1, 3), 1) > val y = x.map((_, "y")) > val z = x.map((_, "z")) > y.cogroup(z).collect res72: Array[(Int, (Iterable[String], Iterable[String]))] = Array((1,(Array(y, y),Array(z, z))), (3,(Array(y),Array(z))), (2,(Array(y),Array(z))))
def cogroup[W1, W2, W3] (other1: RDD[(K, W1)], other2: RDD[(K, W2)], other3: RDD[(K, W3)], numPartitions: Int): RDD[(K, (Iterable[V], Iterable[W1], Iterable[W2], Iterable[W3]))]
H O W “ C O G R O U P ” W O R K S I N S PA R K
RDDx(k1,va) (k2,vb) (k1,vc) (k3,vd) (k1,ve)
(k1,vf) (k2,vg) (k1,vh) RDDy
RDDx.cogroup(RDDy) =?
H O W “ C O G R O U P ” W O R K S I N S PA R K
ArraycombinedArray[(k1,[va,vc,ve,vf,vh]),
(k2,[vb,vg]),
(k3,[vd])]
RDDx.cogroup(RDDy) = *see below*
“ C O G R O U P ” I N S PA R K
• CoGroup works in both RDD and Spark Streams
• the ability to combine multiple RDDs allows higher abstractions to be constructed
• A Stream in Spark is just a list of (Time,RDD[U])
W H AT ’ S S PA R K S Q L• Spark SQL is new, largely replaced Shark
• Large scale queries (inline queries) to be embedded into a Spark program
• Spark SQL supports Apache Hive, JSON, Parquet, RDD.
• Spark SQL’s optimizer is clever!
• Supports UDFs from Hive or Write your own !
S PA R K S Q L
J S O N
S PA R K S Q L
PA R Q U E TH I V E
data sources
R D D
S PA R K S Q L ( A N E X A M P L E )
// import spark sql import org.apache.spark.sql.hive.HiveContext // create a spark sql hivecontext val sc = new SparkContext(…) val hiveCtx = new HiveContext(sc)
S PA R K S Q L ( A N E X A M P L E )
// import spark sql import org.apache.spark.sql.hive.HiveContext // create a spark sql hivecontext val sc = new SparkContext(…) val hiveCtx = new HiveContext(sc)
val input = hiveCtx.jsonFile(inputFile) input.registerTempTable(“tweets”)
S PA R K S Q L ( A N E X A M P L E )
// import spark sql import org.apache.spark.sql.hive.HiveContext // create a spark sql hivecontext val sc = new SparkContext(…) val hiveCtx = new HiveContext(sc)
val input = hiveCtx.jsonFile(inputFile) input.registerTempTable(“tweets”)
val topTweets = hiveCtx.sql(“SELECT text, retweetCount FROM tweets ORDER BY retweetCount LIMIT 10”)
S PA R K S Q L ( A N E X A M P L E )// import spark sql import org.apache.spark.sql.hive.HiveContext // create a spark sql hivecontext val sc = new SparkContext(…) val hiveCtx = new HiveContext(sc)
val input = hiveCtx.jsonFile(inputFile) input.registerTempTable(“tweets”)
val topTweets = hiveCtx.sql(“SELECT text, retweetCount FROM tweets ORDER BY retweetCount LIMIT 10”)
val topTweetContent = topTweets.map(row ⇒ row.getString(0))
W H AT ’ S S PA R K S T R E A M I N G
• Core component is a DStream
• DStream is an abstract RDD whose basic components is a (key,value) pairs where key = Time, value = RDD.
• Forward and backward queries are supported
• Fault-Tolerance by check-pointing RDDs.
• What you can do with RDDs, you can do with DStreams.
S PA R K S T R E A M I N G ( Q U I C K E X A M P L E )
import org.apache.spark.streaming.StreamingContext import org.apache.spark.streaming.StreamingContext._ import org.apache.spark.streaming.dstream.DStream import org.apache.spark.streaming.Duration
// Create a StreamingContext with a 1-second batch size from a SparkConf val ssc = new StreamingContext(conf, Seconds(1))
S PA R K S T R E A M I N G ( Q U I C K E X A M P L E )import org.apache.spark.streaming.StreamingContext import org.apache.spark.streaming.StreamingContext._ import org.apache.spark.streaming.dstream.DStream import org.apache.spark.streaming.Duration
// Create a StreamingContext with a 1-second batch size from a SparkConf val ssc = new StreamingContext(conf, Seconds(1))
// Create a DStream using data received after connecting to // port 7777 on the local machine val lines = ssc.socketTextStream("localhost", 7777)
// Filter our DStream for lines with "error"val errorLines = lines.filter(_.contains("error"))
// Print out the lines with errorserrorLines.print()
S PA R K S T R E A M I N G ( Q U I C K E X A M P L E )import org.apache.spark.streaming.StreamingContext import org.apache.spark.streaming.StreamingContext._ import org.apache.spark.streaming.dstream.DStream import org.apache.spark.streaming.Duration
// Create a StreamingContext with a 1-second batch size from a SparkConf val ssc = new StreamingContext(conf, Seconds(1))
// Create a DStream using data received after connecting to // port 7777 on the local machine val lines = ssc.socketTextStream("localhost", 7777)
// Filter our DStream for lines with "error"val errorLines = lines.filter(_.contains("error"))
// Print out the lines with errorserrorLines.print()
// Start our streaming context and wait for it to "finish" ssc.start()
// Wait for the job to finish ssc.awaitTermination()
A D S T R E A M L O O K S L I K E …
t1 to t2 t2 to t3 t3 to t4
timestart
DStream
A D S T R E A M C A N H AV E T R A N S F O R M AT I O N S O N T H E M !
t1 to t2
timestart
DStream(s)
t1 to t2
data-1
data-2
f
transformation on the fly!
S PA R K S T R E A M T R A N S F O R M AT I O N
t1 to t2t2 to t3
timestart
DStream(s)
t1 to t2t2 to t3
data-1
data-2
f fdata output in
batches
S PA R K S T R E A M T R A N S F O R M AT I O N
t3 to t4
timestart
DStream(s)
t3 to t4
data-1
data-2
f
t1 to t2t2 to t3
t1 to t2t2 to t3
f fff
S TAT E F U L S PA R K S T R E A M T R A N S F O R M AT I O N
t3 to t4
timestart
DStream(s)
t3 to t4
data-1
data-2
f
t1 to t2t2 to t3
t1 to t2t2 to t3
f fff
H O W D O E S S PA R K S T R E A M I N G H A N D L E FA U LT S ?
• As before, check-point is the key to fault-tolerance (especially in stateful-dstream transformations)
• Programs can recover from check-points => no need to restart all over again.
• You can use “monit” to restart Spark jobs or pass the Spark flag “- - supervise” to the job config a.k.a driver fault tolerance
• All incoming data to workers replicated
• In-house RDDs follow the lineage graph to recover
• The above is known as worker fault tolerance.
• Receivers fault tolerance is largely dependent on whether data sources can re-send lost data
• Streams guarantee exactly-once semantics; caveat: multiple writes can occur to the HDFS (app specific logic needs to handle)
H O W D O E S S PA R K S T R E A M I N G H A N D L E FA U LT S ?
R E F E R E N C E S
• Books:
• “Learning Spark: Lightning Fast Big Data ANlaytics”
• “Advanced Analytics with Spark: Patterns for Learning from Data At Scale”
• “Fast Data Processing with Spark”
• “Machine Learning with Spark”
• Berkeley Data Bootcamp
• Introduction to Big Data with Apache Spark
• Kien Dang’s introduction to Spark and R using Naive Bayes (click here)
• Spark Streaming with Scala and Akka (click here)
T H E E N D
Q U E S T I O N S ?
T W I T T E R : @ R AY M O N D TAY B L G I T H U B : @ R AY G I T