Scala and big data in ICM. Scoobie, Scalding, Spark, Stratosphere. Scalar 2014

THE TALE OF THE!GLORIOUS LAMBDAS!

& THE WERE-CLUSTERZ

Mateusz [email protected]!Michał [email protected]

+

mailto:[email protected]

mailto:[email protected]

More than the weather forecast.

MUCH MORE…

WE SPY ON SCIENTISTS

RAW DATA

COMMON MAP OF ACADEMIA

HADOOPHow to read millions of papers?

IN ONE PICTUREMap Reduce

WORD COUNT IS THE NEW HELLO WORLD

WORD COUNT IN VANILLA MAP-REDUCE

package org.myorg; import java.io.IOException; import java.util.*; import org.apache.hadoop.fs.Path; import org.apache.hadoop.conf.*; import org.apache.hadoop.io.*; import org.apache.hadoop.mapreduce.*; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; public class WordCount { public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); } } }

public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); } } public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = new Job(conf, "wordcount"); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(Map.class); job.setReducerClass(Reduce.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true); } }

HOW SHOULDA WORD COUNT

LOOK LIKE?

val lines = List("ala ma kota", "kot ma ale")!!val words = lines.flatMap(_.split(" "))!val groups = words.groupBy(identity)"val counts = groups.map(x => (x._1, x._2.length))!!counts.foreach(println)

SCOOBI, SCALDINGMap–Reduce the right way — with lambdas.

WORD COUNT IN PURE SCALA

val lines = List("ala ma kota", "kot ma ale")!!val words = lines.flatMap(_.split(" "))!val groups = words.groupBy(identity)"val counts = groups.map(x => (x._1, x._2.size))"!counts.foreach(println)

WORD COUNT IN SCOOBI

val lines = fromTextFile("hdfs://in/...")!!val words = lines.mapFlatten(_.split(" "))!val groups = words.groupBy(identity)"val counts = groups.map(x => (x._1, x._2.size))"!counts!" .toTextFile("hdfs://out/...", overwrite=true)!" .persist()

BEHIND THE SCENES

val lines = ! fromTextFile("hdfs://in/...")!!val words = ! lines.mapFlatten(_.split(" "))!val groups = ! words.groupBy(identity)"val counts = ! groups.map(x => (x._1, x._2.length))!!counts! .toTextFile("hdfs://out/...",! overwrite=true)! .persist()

flatMap

groupBy

map

map

reduce

map

reduce

map

reduce

SCOOBI SNACKS– Joins, group-by, etc. baked in!

– Static type checking with custom data types and IO!

– One lang to rule them all (and it’s THE lang)!

– Easy local testing!

– REPL!

WHICH ONE IS THE FRAMEWORK?

Scoobi ScaldingPure Scala Cascading wrapperDeveloped by NICTA Developed by TwitterStrongly typed API Field-based and strongly typed API

Has cooler logo

THE NEW BIG DATA ZOOMost slides are by Matei Zaharia from the Spark team

SPARK IDEA

MAPREDUCE PROBLEMS…

iter. 1 iter. 2 . . .

Input

HDFS read

HDFS write

HDFS read

HDFS write

Input

query 1

query 2

query 3

result 1

result 2

result 3

. . .

HDFS read

iter. 1 iter. 2 . . .

Input

… SOLVED WITH SPARK

Distributedmemory

Input

query 1

query 2

query 3

. . .

one-time processing

HDFS

RESILIENT DISTRIBUTED DATASETS (RDDS)

Restricted form of distributed shared memory» Partitioned data»Higher–level operations (map, filter, join, …)»No side–effects

Efficient fault recovery using lineage»List of operations»Recompute lost partitions on failure»No cost if nothing fails

API

Scala, Python, Java

+ REPL

map"reduce

filter"groupBy

join"…

SPARK EXAMPLES

EXAMPLE: LOG MININGLoad error messages from a log into memory, then interactively search for various patterns

Worker

Worker

Worker

Master


lines = spark.textFile(“hdfs://...”)errors = lines.filter(_.startsWith(“ERROR”))messages = errors.map(_.split(‘\t’)(2))cachedMsgs = messages.cache()

Worker

Worker

Worker

Master



Worker

Worker

Worker

Master

cachedMsgs.filter(_.contains(“foo”)).count



Block 1

Block 2

Block 3

Worker

Worker

Worker

Master




Block 1

Block 2

Block 3

Worker

Worker

Worker

Master


tasks



Block 1

Block 2

Block 3

Worker

Worker

Worker

Master


tasks

Cache 1

Cache 2

Cache 3



Block 1

Block 2

Block 3

Worker

Worker

Worker

Master


tasks

results

Cache 1

Cache 2

Cache 3

1TB data in 5-7 sec (vs 170 sec for on-disk data)



Block 1

Block 2

Block 3

Worker

Worker

Worker

Master

cachedMsgs.filter(_.contains(“foo”)).countcachedMsgs.filter(_.contains(“bar”)).count

tasks

results

Cache 1

Cache 2

Cache 3

PAGERANK PERFORMANCET

ime

per

itera

tion

(s)

0

45

90

135

180

23,01

170,75 Hadoop Spark

SPARK LIBRARIES

SPARK’S ZOO

Spark

Spark Streaming

(real-time)

GraphX(graph)

…

Shark(SQL)

MLlib(machine learning)

BlinkDB

ALL IN ONE

val points = sc.runSql[Double, Double]( “select latitude, longitude from historic_tweets”)

val model = KMeans.train(points, 10)

sc.twitterStream(...) .map(t => (model.closestCenter(t.location), 1)) .reduceByWindow(“5s”, _ + _)

SPARK CONCLUSION

• In memory processing

• Libraries

• Increasingly popularSpark

Spark Streaming!

GraphX

…Shark MLlib

BlinkDB

USEFUL LINKS

• spark.apache.org!

• spark-summit.org !videos & online hands–on tutorials

Like Spark but less popular and less mature

CONCLUSION

• We are in the 80’s of RDBMS

• Scala goes well with big data

THANK YOU!!Q&A

Iter

atio

n tim

e (s

)

0

62,5

125

187,5

250

Number of machines

25 50 100

3615

6280

116

76

111

184

HadoopHadoopBinMemSpark

Logistic Regression

Iter

atio

n tim

e (s

)

0

75

150

225

300

Number of machines

25 50 100

33

61

143

87

121

197

106

157

274Hadoop HadoopBinMemSpark

K-Means

SCALABILITY

Iter

atio

n tim

e (s

)

0

25

50

75

100

Percent of working set in memory

0 0.25 0.5 0.75 1

11,5

29,740,7

58,168,8

INSUFFICIENT RAM

PERFORMANCE

Resp

onse

Tim

e (s)

0

11,25

22,5

33,75

45

HiveImpala (disk)Impala (mem)Shark (disk)Shark (mem)

SQLR

esp

onse

Tim

e (m

in)

0

7,5

15

22,5

30

HadoopGiraphGraphX

Graph

Thro

ughp

ut

(MB

/s/n

od

e)

0

9

18

26

35

StormSpark

Streaming

Data & Analytics

Scala and big data in ICM. Scoobie, Scalding, Spark, Stratosphere. Scalar 2014