THE TALE OF THE!GLORIOUS LAMBDAS!
& THE WERE-CLUSTERZ
Mateusz [email protected]!Michał [email protected]
+
More than the weather forecast.
MUCH MORE…
WE SPY ON SCIENTISTS
RAW DATA
COMMON MAP OF ACADEMIA
HADOOPHow to read millions of papers?
IN ONE PICTUREMap Reduce
WORD COUNT IS THE NEW HELLO WORLD
WORD COUNT IN VANILLA MAP-REDUCE
package org.myorg; import java.io.IOException; import java.util.*; import org.apache.hadoop.fs.Path; import org.apache.hadoop.conf.*; import org.apache.hadoop.io.*; import org.apache.hadoop.mapreduce.*; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; public class WordCount { public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); } } }
public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); } } public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = new Job(conf, "wordcount"); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(Map.class); job.setReducerClass(Reduce.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true); } }
HOW SHOULDA WORD COUNT
LOOK LIKE?
val lines = List("ala ma kota", "kot ma ale")!!val words = lines.flatMap(_.split(" "))!val groups = words.groupBy(identity)"val counts = groups.map(x => (x._1, x._2.length))!!counts.foreach(println)
SCOOBI, SCALDINGMap–Reduce the right way — with lambdas.
WORD COUNT IN PURE SCALA
val lines = List("ala ma kota", "kot ma ale")!!val words = lines.flatMap(_.split(" "))!val groups = words.groupBy(identity)"val counts = groups.map(x => (x._1, x._2.size))"!counts.foreach(println)
WORD COUNT IN SCOOBI
val lines = fromTextFile("hdfs://in/...")!!val words = lines.mapFlatten(_.split(" "))!val groups = words.groupBy(identity)"val counts = groups.map(x => (x._1, x._2.size))"!counts!" .toTextFile("hdfs://out/...", overwrite=true)!" .persist()
BEHIND THE SCENES
val lines = ! fromTextFile("hdfs://in/...")!!val words = ! lines.mapFlatten(_.split(" "))!val groups = ! words.groupBy(identity)"val counts = ! groups.map(x => (x._1, x._2.length))!!counts! .toTextFile("hdfs://out/...",! overwrite=true)! .persist()
flatMap
groupBy
map
map
reduce
map
reduce
map
reduce
SCOOBI SNACKS– Joins, group-by, etc. baked in!
– Static type checking with custom data types and IO!
– One lang to rule them all (and it’s THE lang)!
– Easy local testing!
– REPL!
WHICH ONE IS THE FRAMEWORK?
Scoobi ScaldingPure Scala Cascading wrapperDeveloped by NICTA Developed by TwitterStrongly typed API Field-based and strongly typed API
Has cooler logo
THE NEW BIG DATA ZOOMost slides are by Matei Zaharia from the Spark team
SPARK IDEA
MAPREDUCE PROBLEMS…
iter. 1 iter. 2 . . .
Input
HDFS read
HDFS write
HDFS read
HDFS write
Input
query 1
query 2
query 3
result 1
result 2
result 3
. . .
HDFS read
iter. 1 iter. 2 . . .
Input
… SOLVED WITH SPARK
Distributedmemory
Input
query 1
query 2
query 3
. . .
one-time processing
HDFS
RESILIENT DISTRIBUTED DATASETS (RDDS)
Restricted form of distributed shared memory» Partitioned data»Higher–level operations (map, filter, join, …)»No side–effects
Efficient fault recovery using lineage»List of operations»Recompute lost partitions on failure»No cost if nothing fails
API
Scala, Python, Java
+ REPL
map"reduce
filter"groupBy
join"…
SPARK EXAMPLES
EXAMPLE: LOG MININGLoad error messages from a log into memory, then interactively search for various patterns
Worker
Worker
Worker
Master
EXAMPLE: LOG MININGLoad error messages from a log into memory, then interactively search for various patterns
lines = spark.textFile(“hdfs://...”)errors = lines.filter(_.startsWith(“ERROR”))messages = errors.map(_.split(‘\t’)(2))cachedMsgs = messages.cache()
Worker
Worker
Worker
Master
EXAMPLE: LOG MININGLoad error messages from a log into memory, then interactively search for various patterns
lines = spark.textFile(“hdfs://...”)errors = lines.filter(_.startsWith(“ERROR”))messages = errors.map(_.split(‘\t’)(2))cachedMsgs = messages.cache()
Worker
Worker
Worker
Master
cachedMsgs.filter(_.contains(“foo”)).count
EXAMPLE: LOG MININGLoad error messages from a log into memory, then interactively search for various patterns
lines = spark.textFile(“hdfs://...”)errors = lines.filter(_.startsWith(“ERROR”))messages = errors.map(_.split(‘\t’)(2))cachedMsgs = messages.cache()
Block 1
Block 2
Block 3
Worker
Worker
Worker
Master
cachedMsgs.filter(_.contains(“foo”)).count
EXAMPLE: LOG MININGLoad error messages from a log into memory, then interactively search for various patterns
lines = spark.textFile(“hdfs://...”)errors = lines.filter(_.startsWith(“ERROR”))messages = errors.map(_.split(‘\t’)(2))cachedMsgs = messages.cache()
Block 1
Block 2
Block 3
Worker
Worker
Worker
Master
cachedMsgs.filter(_.contains(“foo”)).count
tasks
EXAMPLE: LOG MININGLoad error messages from a log into memory, then interactively search for various patterns
lines = spark.textFile(“hdfs://...”)errors = lines.filter(_.startsWith(“ERROR”))messages = errors.map(_.split(‘\t’)(2))cachedMsgs = messages.cache()
Block 1
Block 2
Block 3
Worker
Worker
Worker
Master
cachedMsgs.filter(_.contains(“foo”)).count
tasks
Cache 1
Cache 2
Cache 3
EXAMPLE: LOG MININGLoad error messages from a log into memory, then interactively search for various patterns
lines = spark.textFile(“hdfs://...”)errors = lines.filter(_.startsWith(“ERROR”))messages = errors.map(_.split(‘\t’)(2))cachedMsgs = messages.cache()
Block 1
Block 2
Block 3
Worker
Worker
Worker
Master
cachedMsgs.filter(_.contains(“foo”)).count
tasks
results
Cache 1
Cache 2
Cache 3
1TB data in 5-7 sec (vs 170 sec for on-disk data)
EXAMPLE: LOG MININGLoad error messages from a log into memory, then interactively search for various patterns
lines = spark.textFile(“hdfs://...”)errors = lines.filter(_.startsWith(“ERROR”))messages = errors.map(_.split(‘\t’)(2))cachedMsgs = messages.cache()
Block 1
Block 2
Block 3
Worker
Worker
Worker
Master
cachedMsgs.filter(_.contains(“foo”)).countcachedMsgs.filter(_.contains(“bar”)).count
tasks
results
Cache 1
Cache 2
Cache 3
PAGERANK PERFORMANCET
ime
per
itera
tion
(s)
0
45
90
135
180
23,01
170,75 Hadoop Spark
SPARK LIBRARIES
SPARK’S ZOO
Spark
Spark Streaming
(real-time)
GraphX(graph)
…
Shark(SQL)
MLlib(machine learning)
BlinkDB
ALL IN ONE
val points = sc.runSql[Double, Double]( “select latitude, longitude from historic_tweets”)
val model = KMeans.train(points, 10)
sc.twitterStream(...) .map(t => (model.closestCenter(t.location), 1)) .reduceByWindow(“5s”, _ + _)
SPARK CONCLUSION
• In memory processing
• Libraries
• Increasingly popularSpark
Spark Streaming!
GraphX
…Shark MLlib
BlinkDB
USEFUL LINKS
• spark.apache.org!
• spark-summit.org !videos & online hands–on tutorials
Like Spark but less popular and less mature
CONCLUSION
• We are in the 80’s of RDBMS
• Scala goes well with big data
THANK YOU!!Q&A
Iter
atio
n tim
e (s
)
0
62,5
125
187,5
250
Number of machines
25 50 100
3615
6280
116
76
111
184
HadoopHadoopBinMemSpark
Logistic Regression
Iter
atio
n tim
e (s
)
0
75
150
225
300
Number of machines
25 50 100
33
61
143
87
121
197
106
157
274Hadoop HadoopBinMemSpark
K-Means
SCALABILITY
Iter
atio
n tim
e (s
)
0
25
50
75
100
Percent of working set in memory
0 0.25 0.5 0.75 1
11,5
29,740,7
58,168,8
INSUFFICIENT RAM
PERFORMANCE
Resp
onse
Tim
e (s)
0
11,25
22,5
33,75
45
HiveImpala (disk)Impala (mem)Shark (disk)Shark (mem)
SQLR
esp
onse
Tim
e (m
in)
0
7,5
15
22,5
30
HadoopGiraphGraphX
Graph
Thro
ughp
ut
(MB
/s/n
od
e)
0
9
18
26
35
StormSpark
Streaming