Click here to load reader

Scala and big data in ICM. Scoobie, Scalding, Spark, Stratosphere. Scalar 2014

  • View
    1.025

  • Download
    5

Embed Size (px)

DESCRIPTION

The Tale of the Glorious Lambdas & the Were-Clusterz

Text of Scala and big data in ICM. Scoobie, Scalding, Spark, Stratosphere. Scalar 2014

  • THE TALE OFTHE! GLORIOUS LAMBDAS! &THE WERE-CLUSTERZ Mateusz Fedoryszak! [email protected]! Micha Oniszczuk! [email protected] +
  • More than the weather forecast.
  • MUCH MORE
  • WE SPY ON SCIENTISTS
  • RAW DATA
  • COMMON MAP OF ACADEMIA
  • HADOOP How to read millions of papers?
  • IN ONE PICTURE Map Reduce
  • WORD COUNT ISTHE NEW HELLO WORLD
  • WORD COUNT INVANILLA MAP-REDUCE package org.myorg; import java.io.IOException; import java.util.*; import org.apache.hadoop.fs.Path; import org.apache.hadoop.conf.*; import org.apache.hadoop.io.*; import org.apache.hadoop.mapreduce.*; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; public class WordCount { public static class Map extends Mapper { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); } } } public static class Reduce extends Reducer { public void reduce(Text key, Iterable values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); } } public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = new Job(conf, "wordcount"); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(Map.class); job.setReducerClass(Reduce.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true); } }
  • HOW SHOULD A WORD COUNT LOOK LIKE? val lines = List("ala ma kota", "kot ma ale")! ! val words = lines.flatMap(_.split(" "))! val groups = words.groupBy(identity)" val counts = groups.map(x => (x._1, x._2.length))! ! counts.foreach(println)
  • SCOOBI, SCALDING MapReduce the right way with lambdas.
  • WORD COUNT IN PURE SCALA val lines = List("ala ma kota", "kot ma ale")! ! val words = lines.flatMap(_.split(" "))! val groups = words.groupBy(identity)" val counts = groups.map(x => (x._1, x._2.size))" ! counts.foreach(println)
  • WORD COUNT IN SCOOBI val lines = fromTextFile("hdfs://in/...")! ! val words = lines.mapFlatten(_.split(" "))! val groups = words.groupBy(identity)" val counts = groups.map(x => (x._1, x._2.size))" ! counts! " .toTextFile("hdfs://out/...", overwrite=true)! " .persist()
  • BEHINDTHE SCENES val lines = ! fromTextFile("hdfs://in/...")! ! val words = ! lines.mapFlatten(_.split(" "))! val groups = ! words.groupBy(identity)" val counts = ! groups.map(x => (x._1, x._2.length))! ! counts! .toTextFile("hdfs://out/...",! overwrite=true)! .persist() flatMap groupBy map map reduce map reduce map reduce
  • SCOOBI SNACKS Joins, group-by, etc. baked in! Static type checking with custom data types and IO! One lang to rule them all (and its THE lang)! Easy local testing! REPL!
  • WHICH ONE IS THE FRAMEWORK? Scoobi Scalding Pure Scala Cascading wrapper Developed by NICTA Developed byTwitter Strongly typed API Field-based and strongly typed API Has cooler logo
  • THE NEW BIG DATA ZOO Most slides are by Matei Zaharia from the Spark team
  • SPARK IDEA
  • MAPREDUCE PROBLEMS iter. 1 iter. 2 . . . Input HDFS read HDFS write HDFS read HDFS write Input query 1 query 2 query 3 result 1 result 2 result 3 . . . HDFS read
  • iter. 1 iter. 2 . . . Input SOLVED WITH SPARK Distributed memory Input query 1 query 2 query 3 . . . one-time processing
  • HDFS RESILIENT DISTRIBUTED DATASETS (RDDS) Restricted form of distributed shared memory Partitioned data Higherlevel operations (map, filter, join, ) No sideeffects Efficient fault recovery using lineage List of operations Recompute lost partitions on failure No cost if nothing fails
  • API Scala, Python, Java + REPL map" reduce filter" groupBy join"
  • SPARK EXAMPLES
  • EXAMPLE: LOG MINING Load error messages from a log into memory, then interactively search for various patterns Worker Worker Worker Master
  • EXAMPLE: LOG MINING Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(hdfs://...) errors = lines.filter(_.startsWith(ERROR)) messages = errors.map(_.split(t)(2)) cachedMsgs = messages.cache() Worker Worker Worker Master
  • EXAMPLE: LOG MINING Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(hdfs://...) errors = lines.filter(_.startsWith(ERROR)) messages = errors.map(_.split(t)(2)) cachedMsgs = messages.cache() Worker Worker Worker Master cachedMsgs.filter(_.contains(foo)).count
  • EXAMPLE: LOG MINING Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(hdfs://...) errors = lines.filter(_.startsWith(ERROR)) messages = errors.map(_.split(t)(2)) cachedMsgs = messages.cache() Block 1 Block 2 Block 3 Worker Worker Worker Master cachedMsgs.filter(_.contains(foo)).count
  • EXAMPLE: LOG MINING Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(hdfs://...) errors = lines.filter(_.startsWith(ERROR)) messages = errors.map(_.split(t)(2)) cachedMsgs = messages.cache() Block 1 Block 2 Block 3 Worker Worker Worker Master cachedMsgs.filter(_.contains(foo)).count tasks
  • EXAMPLE: LOG MINING Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(hdfs://...) errors = lines.filter(_.startsWith(ERROR)) messages = errors.map(_.split(t)(2)) cachedMsgs = messages.cache() Block 1 Block 2 Block 3 Worker Worker Worker Master cachedMsgs.filter(_.contains(foo)).count tasks Cache 1 Cache 2 Cache 3
  • EXAMPLE: LOG MINING Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(hdfs://...) errors = lines.filter(_.startsWith(ERROR)) messages = errors.map(_.split(t)(2)) cachedMsgs = messages.cache() Block 1 Block 2 Block 3 Worker Worker Worker Master cachedMsgs.filter(_.contains(foo)).count tasks results Cache 1 Cache 2 Cache 3
  • 1TB data in 5-7 sec (vs 170 sec for on-disk data) EXAMPLE: LOG MINING Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(hdfs://...) errors = lines.filter(_.startsWith(ERROR)) messages = errors.map(_.split(t)(2)) cachedMsgs = messages.cache() Block 1 Block 2 Block 3 Worker Worker Worker Master cachedMsgs.filter(_.contains(foo)).count cachedMsgs.filter(_.contains(bar)).count tasks results Cache 1 Cache 2 Cache 3
  • PAGERANK PERFORMANCETimeperiteration(s) 0 45 90 135 180 23,01 170,75 Hadoop Spark
  • SPARK LIBRARIES
  • SPARKS ZOO Spark Spark Streaming (real-time) GraphX (graph) Shark (SQL) MLlib (machine learning) BlinkDB
  • ALL IN ONE val points = sc.runSql[Double, Double]( select latitude, longitude from historic_tweets) val model = KMeans.train(points, 10) sc.twitterStream(...) .map(t => (model.closestCenter(t.location), 1)) .reduceByWindow(5s, _ + _)
  • SPARK CONCLUSION In memory processing Libraries Increasingly popular Spark Spark Streaming! GraphX Shark MLlib BlinkDB
  • USEFUL LINKS spark.apache.org! spark-summit.org ! videos & online handson tutorials
  • Like Spark but less popular and less mature
  • CONCLUSION We are in the 80s of RDBMS Scala goes well with big data
  • THANKYOU!! Q&A
  • Iterationtime(s) 0 62,5 125 187,5 250 Number of machines 25 50 100 3615 62 80 116 76 111 184 Hadoop HadoopBinMem Spark Logistic Regression Iterationtime(s) 0 75 150 225 300 Number of machines 25 50 100 33 61 143 87 121 197 106 157 274 Hadoop HadoopBinMem Spark K-Means SCALABILITY
  • Iterationtime(s) 0 25 50 75 100 Percent of working set in memory 0 0.25 0.5 0.75 1 11,5 29,7 40,7 58,1 68,8 INSUFFICIENT RAM
  • PERFORMANCE ResponseTime(s) 0 11,25 22,5 33,75 45 Hive Impala (disk) Impala (mem) Shark (disk) Shark (mem) SQL ResponseTime(min) 0 7,5 15 22,5 30 Hadoop Giraph GraphX Graph Throughput(MB/s/node) 0 9 18 26 35 Storm Spark Streaming