186
Scalding the not-so-basics Konrad 'ktoso' Malawski Scala Days 2014 @ Berlin

Scalding - the not-so-basics @ ScalaDays 2014

Embed Size (px)

DESCRIPTION

Some more in depth tips about writing and optimising Scalding Map Reduce Jobs

Citation preview

Page 1: Scalding - the not-so-basics @ ScalaDays 2014

Scaldingthe not-so-basics

Konrad 'ktoso' Malawski Scala Days 2014 @ Berlin

Page 2: Scalding - the not-so-basics @ ScalaDays 2014

Konrad `@ktosopl` Malawski

typesafe.com geecon.org

Java.pl / KrakowScala.pl sckrk.com / meetup.com/Paper-Cup @ London

GDGKrakow.pl meetup.com/Lambda-Lounge-Krakow

hAkker @

Page 3: Scalding - the not-so-basics @ ScalaDays 2014

http://hadoop.apache.org/

http://research.google.com/archive/mapreduce.html

How old is this guy?

Page 4: Scalding - the not-so-basics @ ScalaDays 2014

http://hadoop.apache.org/

http://research.google.com/archive/mapreduce.html

Google MapReduce, paper: 2004 Hadoop (Yahoo impl): 2005

Page 5: Scalding - the not-so-basics @ ScalaDays 2014

the Big Landscape

Page 6: Scalding - the not-so-basics @ ScalaDays 2014

Hadoop

Page 7: Scalding - the not-so-basics @ ScalaDays 2014

https://github.com/twitter/scalding

Scalding is “on top of” Hadoop

Page 8: Scalding - the not-so-basics @ ScalaDays 2014

https://github.com/twitter/scalding

Scalding is “on top of” Cascading, which is “on top of” Hadoop

http://www.cascading.org/

Page 9: Scalding - the not-so-basics @ ScalaDays 2014

https://github.com/twitter/scalding

Summingbird is “op top of” Scalding, which is “on top of” Cascading, which is “on top of” Hadoop

http://www.cascading.org/https://github.com/twitter/summingbird

Page 10: Scalding - the not-so-basics @ ScalaDays 2014

https://github.com/twitter/scalding

Summingbird is “op top of” Scalding or Storm, which is “on top of” Cascading, which is “on top of” Hadoop

http://www.cascading.org/https://github.com/twitter/summingbirdhttp://storm.incubator.apache.org/

Page 11: Scalding - the not-so-basics @ ScalaDays 2014

https://github.com/twitter/scalding

Summingbird is “op top of” Scalding or Storm, which is “on top of” Cascading, which is “on top of” Hadoop;

Spark is a bit “separate” currently.

http://www.cascading.org/https://github.com/twitter/summingbird

http://storm.incubator.apache.org/

http://spark.apache.org/

Page 12: Scalding - the not-so-basics @ ScalaDays 2014

https://github.com/twitter/scalding

Summingbird is “op top of” Scalding or Storm, which is “on top of” Cascading, which is “on top of” Hadoop;

Spark is a bit “separate” currently.

http://www.cascading.org/https://github.com/twitter/summingbird

http://storm.incubator.apache.org/

http://spark.apache.org/

HDFS yes,

MapReduce no

Page 13: Scalding - the not-so-basics @ ScalaDays 2014

https://github.com/twitter/scalding

Summingbird is “op top of” Scalding or Storm, which is “on top of” Cascading, which is “on top of” Hadoop;

Spark is a bit “separate” currently.

http://www.cascading.org/https://github.com/twitter/summingbird

http://storm.incubator.apache.org/

http://spark.apache.org/

Page 14: Scalding - the not-so-basics @ ScalaDays 2014

https://github.com/twitter/scalding

Summingbird is “op top of” Scalding or Storm, which is “on top of” Cascading, which is “on top of” Hadoop;

Spark is a bit “separate” currently.

http://www.cascading.org/https://github.com/twitter/summingbird

http://storm.incubator.apache.org/

http://spark.apache.org/

HDFS yes,

MapReduce no

Page 15: Scalding - the not-so-basics @ ScalaDays 2014

https://github.com/twitter/scalding

Summingbird is “op top of” Scalding or Storm, which is “on top of” Cascading, which is “on top of” Hadoop;

Spark is a bit “separate” currently.

http://www.cascading.org/https://github.com/twitter/summingbird

http://storm.incubator.apache.org/

http://spark.apache.org/

HDFS yes,

MapReduce no

Possibly soon?!

Page 16: Scalding - the not-so-basics @ ScalaDays 2014

https://github.com/twitter/scalding

Summingbird is “op top of” Scalding or Storm, which is “on top of” Cascading, which is “on top of” Hadoop;

Spark has nothing to do with all this.

http://www.cascading.org/https://github.com/twitter/summingbird

http://storm.incubator.apache.org/

http://spark.apache.org/

-streams

Page 17: Scalding - the not-so-basics @ ScalaDays 2014

https://github.com/twitter/scalding

Summingbird is “op top of” Scalding or Storm, which is “on top of” Cascading, which is “on top of” Hadoop

http://www.cascading.org/https://github.com/twitter/summingbird

http://storm.incubator.apache.org/

http://spark.apache.org/

this talk

Page 18: Scalding - the not-so-basics @ ScalaDays 2014

Why?

Page 19: Scalding - the not-so-basics @ ScalaDays 2014

Stuff > Memory

Scala collections... fun but, memory bound!

val text = "so many words... waaah! ..."!!! text! .split(" ")! .map(a => (a, 1))! .groupBy(_._1)! .map(a => (a._1, a._2.map(_._2).sum))!

Page 20: Scalding - the not-so-basics @ ScalaDays 2014

Stuff > Memory

Scala collections... fun but, memory bound!

val text = "so many words... waaah! ..."!!! text! .split(" ")! .map(a => (a, 1))! .groupBy(_._1)! .map(a => (a._1, a._2.map(_._2).sum))!

in Memory

Page 21: Scalding - the not-so-basics @ ScalaDays 2014

Stuff > Memory

Scala collections... fun but, memory bound!

val text = "so many words... waaah! ..."!!! text! .split(" ")! .map(a => (a, 1))! .groupBy(_._1)! .map(a => (a._1, a._2.map(_._2).sum))!

in Memory

in Memory

Page 22: Scalding - the not-so-basics @ ScalaDays 2014

Stuff > Memory

Scala collections... fun but, memory bound!

val text = "so many words... waaah! ..."!!! text! .split(" ")! .map(a => (a, 1))! .groupBy(_._1)! .map(a => (a._1, a._2.map(_._2).sum))!

in Memory

in Memory

in Memory

Page 23: Scalding - the not-so-basics @ ScalaDays 2014

Stuff > Memory

Scala collections... fun but, memory bound!

val text = "so many words... waaah! ..."!!! text! .split(" ")! .map(a => (a, 1))! .groupBy(_._1)! .map(a => (a._1, a._2.map(_._2).sum))!

in Memory

in Memory

in Memory

in Memory

Page 24: Scalding - the not-so-basics @ ScalaDays 2014

Stuff > Memory

Scala collections... fun but, memory bound!

val text = "so many words... waaah! ..."!!! text! .split(" ")! .map(a => (a, 1))! .groupBy(_._1)! .map(a => (a._1, a._2.map(_._2).sum))!

in Memory

in Memory

in Memory

in Memory

in Memory

Page 25: Scalding - the not-so-basics @ ScalaDays 2014

package org.myorg;!!import org.apache.hadoop.fs.Path;!import org.apache.hadoop.io.IntWritable;!import org.apache.hadoop.io.LongWritable;!import org.apache.hadoop.io.Text;!import org.apache.hadoop.mapred.*;!!import java.io.IOException;!import java.util.Iterator;!import java.util.StringTokenizer;!!public class WordCount {!! public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {! private final static IntWritable one = new IntWritable(1);! private Text word = new Text();!! public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {! String line = value.toString();! StringTokenizer tokenizer = new StringTokenizer(line);! while (tokenizer.hasMoreTokens()) {! word.set(tokenizer.nextToken());! output.collect(word, one);! }!

Why Scalding?Word Count in Hadoop MR

Page 26: Scalding - the not-so-basics @ ScalaDays 2014

private final static IntWritable one = new IntWritable(1);! private Text word = new Text();!! public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {! String line = value.toString();! StringTokenizer tokenizer = new StringTokenizer(line);! while (tokenizer.hasMoreTokens()) {! word.set(tokenizer.nextToken());! output.collect(word, one);! }! }! }!! public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {! public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {! int sum = 0;! while (values.hasNext()) {! sum += values.next().get();! }! output.collect(key, new IntWritable(sum));! }! }!! public static void main(String[] args) throws Exception {! JobConf conf = new JobConf(WordCount.class);! conf.setJobName("wordcount");!! conf.setOutputKeyClass(Text.class);! conf.setOutputValueClass(IntWritable.class);!! conf.setMapperClass(Map.class);! conf.setCombinerClass(Reduce.class);! conf.setReducerClass(Reduce.class);!! conf.setInputFormat(TextInputFormat.class);! conf.setOutputFormat(TextOutputFormat.class);!! FileInputFormat.setInputPaths(conf, new Path(args[0]));! FileOutputFormat.setOutputPath(conf, new Path(args[1]));!! JobClient.runJob(conf);! }!}!

Why Scalding?Word Count in Hadoop MR

Page 27: Scalding - the not-so-basics @ ScalaDays 2014
Page 28: Scalding - the not-so-basics @ ScalaDays 2014
Page 29: Scalding - the not-so-basics @ ScalaDays 2014
Page 30: Scalding - the not-so-basics @ ScalaDays 2014
Page 31: Scalding - the not-so-basics @ ScalaDays 2014
Page 32: Scalding - the not-so-basics @ ScalaDays 2014
Page 33: Scalding - the not-so-basics @ ScalaDays 2014
Page 34: Scalding - the not-so-basics @ ScalaDays 2014
Page 35: Scalding - the not-so-basics @ ScalaDays 2014

“Field API”

Page 36: Scalding - the not-so-basics @ ScalaDays 2014

map

Page 37: Scalding - the not-so-basics @ ScalaDays 2014

val data = 1 :: 2 :: 3 :: Nil

val doubled = data map { _ * 2 }

mapScala:

Page 38: Scalding - the not-so-basics @ ScalaDays 2014

val data = 1 :: 2 :: 3 :: Nil

val doubled = data map { _ * 2 }

mapScala:

Page 39: Scalding - the not-so-basics @ ScalaDays 2014

val data = 1 :: 2 :: 3 :: Nil

val doubled = data map { _ * 2 }

map

IterableSource(data) .map('number -> 'doubled) { n: Int => n * 2 }

Scala:

Page 40: Scalding - the not-so-basics @ ScalaDays 2014

val data = 1 :: 2 :: 3 :: Nil

val doubled = data map { _ * 2 }

map

IterableSource(data) .map('number -> 'doubled) { n: Int => n * 2 }

Scala:

available in Pipe

Page 41: Scalding - the not-so-basics @ ScalaDays 2014

val data = 1 :: 2 :: 3 :: Nil

val doubled = data map { _ * 2 }

map

IterableSource(data) .map('number -> 'doubled) { n: Int => n * 2 }

Scala:

available in Pipestays in Pipe

Page 42: Scalding - the not-so-basics @ ScalaDays 2014

val data = 1 :: 2 :: 3 :: Nil!!val doubled = data map { _ * 2 }!! // Int => Int

map

IterableSource(data)! .map('number -> 'doubled) { n: Int => n * 2 }!!! // Int => Int

Scala:

must choose type!

Page 43: Scalding - the not-so-basics @ ScalaDays 2014

mapTo

Page 44: Scalding - the not-so-basics @ ScalaDays 2014

var data = 1 :: 2 :: 3 :: Nil

val doubled = data map { _ * 2 }data = null

mapToScala:

Page 45: Scalding - the not-so-basics @ ScalaDays 2014

var data = 1 :: 2 :: 3 :: Nil

val doubled = data map { _ * 2 }data = null

mapToScala:

“release reference”

Page 46: Scalding - the not-so-basics @ ScalaDays 2014

var data = 1 :: 2 :: 3 :: Nil

val doubled = data map { _ * 2 }data = null

mapToScala:

“release reference”

Page 47: Scalding - the not-so-basics @ ScalaDays 2014

var data = 1 :: 2 :: 3 :: Nil

val doubled = data map { _ * 2 }data = null

mapTo

IterableSource(data) .mapTo('doubled) { n: Int => n * 2 }

Scala:

“release reference”

Page 48: Scalding - the not-so-basics @ ScalaDays 2014

var data = 1 :: 2 :: 3 :: Nil

val doubled = data map { _ * 2 }data = null

mapTo

IterableSource(data) .mapTo('doubled) { n: Int => n * 2 }

Scala:

doubled stays in Pipe

“release reference”

Page 49: Scalding - the not-so-basics @ ScalaDays 2014

var data = 1 :: 2 :: 3 :: Nil

val doubled = data map { _ * 2 }data = null

mapTo

IterableSource(data) .mapTo('doubled) { n: Int => n * 2 }

Scala:

doubled stays in Pipenumber is removed

“release reference”

Page 50: Scalding - the not-so-basics @ ScalaDays 2014

flatMap

Page 51: Scalding - the not-so-basics @ ScalaDays 2014

val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String]

val numbers = data flatMap { line => // String line.split(",") // Array[String]} map { _.toInt } // List[Int]

flatMapScala:

Page 52: Scalding - the not-so-basics @ ScalaDays 2014

val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String]

val numbers = data flatMap { line => // String line.split(",") // Array[String]} map { _.toInt } // List[Int]

flatMapScala:

Page 53: Scalding - the not-so-basics @ ScalaDays 2014

val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String]

val numbers = data flatMap { line => // String line.split(",") // Array[String]} map { _.toInt } // List[Int]

flatMap

TextLine(data) // like List[String] .flatMap('line -> 'word) { _.split(",") } // like List[String] .map('word -> 'number) { _.toInt } // like List[Int]

Scala:

Page 54: Scalding - the not-so-basics @ ScalaDays 2014

flatMap

Page 55: Scalding - the not-so-basics @ ScalaDays 2014

val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String]

val numbers = data flatMap { line => // String line.split(",").map(_.toInt) // Array[Int]}

flatMapScala:

Page 56: Scalding - the not-so-basics @ ScalaDays 2014

val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String]

val numbers = data flatMap { line => // String line.split(",").map(_.toInt) // Array[Int]}

flatMap

TextLine(data) // like List[String] .flatMap('line -> 'word) { _.split(",").map(_.toInt) } // like List[Int]

Scala:

Page 57: Scalding - the not-so-basics @ ScalaDays 2014

groupBy

Page 58: Scalding - the not-so-basics @ ScalaDays 2014

val data = 1 :: 2 :: 30 :: 42 :: Nil // List[Int]

val groups = data groupBy { _ < 10 }

groups // Map[Boolean, Int]

groupByScala:

Page 59: Scalding - the not-so-basics @ ScalaDays 2014

val data = 1 :: 2 :: 30 :: 42 :: Nil // List[Int]

val groups = data groupBy { _ < 10 }

groups // Map[Boolean, Int]

groupByScala:

Page 60: Scalding - the not-so-basics @ ScalaDays 2014

val data = 1 :: 2 :: 30 :: 42 :: Nil // List[Int]

val groups = data groupBy { _ < 10 }

groups // Map[Boolean, Int]

groupBy

IterableSource(List(1, 2, 30, 42), 'num) .map('num -> 'lessThanTen) { i: Int => i < 10 } .groupBy('lessThanTen) { _.size }

Scala:

Page 61: Scalding - the not-so-basics @ ScalaDays 2014

val data = 1 :: 2 :: 30 :: 42 :: Nil // List[Int]

val groups = data groupBy { _ < 10 }

groups // Map[Boolean, Int]

groupBy

IterableSource(List(1, 2, 30, 42), 'num) .map('num -> 'lessThanTen) { i: Int => i < 10 } .groupBy('lessThanTen) { _.size }

Scala:

groups all with == value

Page 62: Scalding - the not-so-basics @ ScalaDays 2014

val data = 1 :: 2 :: 30 :: 42 :: Nil // List[Int]

val groups = data groupBy { _ < 10 }

groups // Map[Boolean, Int]

groupBy

IterableSource(List(1, 2, 30, 42), 'num) .map('num -> 'lessThanTen) { i: Int => i < 10 } .groupBy('lessThanTen) { _.size }

Scala:

groups all with == value 'lessThanTenCounts

Page 63: Scalding - the not-so-basics @ ScalaDays 2014

groupBy

Page 64: Scalding - the not-so-basics @ ScalaDays 2014

groupBy

IterableSource(List(1, 2, 30, 42), 'num)

Page 65: Scalding - the not-so-basics @ ScalaDays 2014

groupBy

IterableSource(List(1, 2, 30, 42), 'num) .map('num -> 'lessThanTen) { i: Int => i < 10 }

Page 66: Scalding - the not-so-basics @ ScalaDays 2014

groupBy

IterableSource(List(1, 2, 30, 42), 'num) .map('num -> 'lessThanTen) { i: Int => i < 10 } .groupBy('lessThanTen) { _.sum('total) }

Page 67: Scalding - the not-so-basics @ ScalaDays 2014

groupBy

IterableSource(List(1, 2, 30, 42), 'num) .map('num -> 'lessThanTen) { i: Int => i < 10 } .groupBy('lessThanTen) { _.sum('total) }

'total = [3, 74]

Page 68: Scalding - the not-so-basics @ ScalaDays 2014

import org.apache.hadoop.util.ToolRunner!import com.twitter.scalding!!object ScaldingJobRunner extends App {!! ToolRunner.run(new Configuration, new scalding.Tool, args)!!}

Main Class - "Runner"

Page 69: Scalding - the not-so-basics @ ScalaDays 2014

import org.apache.hadoop.util.ToolRunner!import com.twitter.scalding!!object ScaldingJobRunner extends App {!! ToolRunner.run(new Configuration, new scalding.Tool, args)!!}

Main Class - "Runner"

from App

Page 70: Scalding - the not-so-basics @ ScalaDays 2014

class WordCountJob(args: Args) extends Job(args) {!!!!!!!!!!!}

Word Count in Scalding

Page 71: Scalding - the not-so-basics @ ScalaDays 2014

class WordCountJob(args: Args) extends Job(args) {!! val inputFile = args("input")! val outputFile = args("output")!!!!!!!!}

Word Count in Scalding

Page 72: Scalding - the not-so-basics @ ScalaDays 2014

class WordCountJob(args: Args) extends Job(args) {!! val inputFile = args("input")! val outputFile = args("output")!! TextLine(inputFile)!!!!!!}

Word Count in Scalding

Page 73: Scalding - the not-so-basics @ ScalaDays 2014

class WordCountJob(args: Args) extends Job(args) {!! val inputFile = args("input")! val outputFile = args("output")!! TextLine(inputFile)! .flatMap('line -> 'word) { line: String => tokenize(line) }!!!! def tokenize(text: String): Array[String] = implemented!}

Word Count in Scalding

Page 74: Scalding - the not-so-basics @ ScalaDays 2014

class WordCountJob(args: Args) extends Job(args) {!! val inputFile = args("input")! val outputFile = args("output")!! TextLine(inputFile)! .flatMap('line -> 'word) { line: String => tokenize(line) }! .groupBy('word) { group => group.size('count) }!!! def tokenize(text: String): Array[String] = implemented!}

Word Count in Scalding

Page 75: Scalding - the not-so-basics @ ScalaDays 2014

class WordCountJob(args: Args) extends Job(args) {!! val inputFile = args("input")! val outputFile = args("output")!! TextLine(inputFile)! .flatMap('line -> 'word) { line: String => tokenize(line) }! .groupBy('word) { group => group.size }!!! def tokenize(text: String): Array[String] = implemented!}

Word Count in Scalding

Page 76: Scalding - the not-so-basics @ ScalaDays 2014

class WordCountJob(args: Args) extends Job(args) {!! val inputFile = args("input")! val outputFile = args("output")!! TextLine(inputFile)! .flatMap('line -> 'word) { line: String => tokenize(line) }! .groupBy('word) { _.size }!!! def tokenize(text: String): Array[String] = implemented!}

Word Count in Scalding

Page 77: Scalding - the not-so-basics @ ScalaDays 2014

class WordCountJob(args: Args) extends Job(args) {!! val inputFile = args("input")! val outputFile = args("output")!! TextLine(inputFile)! .flatMap('line -> 'word) { line: String => tokenize(line) }! .groupBy('word) { _.size }! .write(Tsv(outputFile))!! def tokenize(text: String): Array[String] = implemented!}

Word Count in Scalding

Page 78: Scalding - the not-so-basics @ ScalaDays 2014

class WordCountJob(args: Args) extends Job(args) {!! val inputFile = args("input")! val outputFile = args("output")!! TextLine(inputFile)! .flatMap('line -> 'word) { line: String => tokenize(line) }! .groupBy('word) { _.size }! .write(Tsv(outputFile))!! def tokenize(text: String): Array[String] = implemented!}

Word Count in Scalding

4{

Page 79: Scalding - the not-so-basics @ ScalaDays 2014

1 day in the life of a guy implementing Scalding jobs

Page 80: Scalding - the not-so-basics @ ScalaDays 2014

“How much are my shops selling?”

Tsv(input, ('shopId, 'itemId, 'itemName, 'quantity))! .groupBy('shopId) {! _.sum[Long]('quantity -> ‘totalSoldItems)! }! .write(Tsv(output))

Page 81: Scalding - the not-so-basics @ ScalaDays 2014

“How much are my shops selling?”

Tsv(input, ('shopId, 'itemId, 'itemName, 'quantity))! .groupBy('shopId) {! _.sum[Long]('quantity -> ‘totalSoldItems)! }! .write(Tsv(output))

1!107!2!144!3!16!… …

Page 82: Scalding - the not-so-basics @ ScalaDays 2014

“How much are my shops selling?”

Tsv(input, ('shopId, 'itemId, 'itemName, 'quantity))! .groupBy('shopId) {! _.sum[Long]('quantity -> ‘totalSoldItems)! }! .write(Tsv(output, writeHeader = true))

Page 83: Scalding - the not-so-basics @ ScalaDays 2014

“How much are my shops selling?”

Tsv(input, ('shopId, 'itemId, 'itemName, 'quantity))! .groupBy('shopId) {! _.sum[Long]('quantity -> ‘totalSoldItems)! }! .write(Tsv(output, writeHeader = true))

shopId! totalSoldItems! 1!! ! ! 107! 2!! ! ! 144! 3!! ! ! 16! …!! ! ! …

Page 84: Scalding - the not-so-basics @ ScalaDays 2014

“Which are the top selling shops?”

Tsv(input, ('shopId, 'itemId, 'itemName, 'quantity))! .groupBy('shopId) {! _.sum[Long]('quantity -> ‘totalSoldItems)! }! .groupAll { _.sortBy('totalSoldItems).reverse }! .write(Tsv(output, writeHeader = true))

Page 85: Scalding - the not-so-basics @ ScalaDays 2014

“Which are the top selling shops?”

Tsv(input, ('shopId, 'itemId, 'itemName, 'quantity))! .groupBy('shopId) {! _.sum[Long]('quantity -> ‘totalSoldItems)! }! .groupAll { _.sortBy('totalSoldItems).reverse }! .write(Tsv(output, writeHeader = true))

shopId! totalSoldItems! 2!! ! ! 144 ! 1!! ! ! 107! 3!! ! ! 16! …!! ! ! …

Page 86: Scalding - the not-so-basics @ ScalaDays 2014

“What’s the top 3 shops?”

Tsv(input, ('shopId, 'itemId, 'itemName, 'quantity))! .groupBy('shopId) {! _.sum[Long]('quantity -> ‘totalSoldItems)! }! .groupAll { _.sortBy(‘totalSoldItems).reverse.take(3) }! .write(Tsv(output, writeHeader = true))

Page 87: Scalding - the not-so-basics @ ScalaDays 2014

“What’s the top 3 shops?”

Tsv(input, ('shopId, 'itemId, 'itemName, 'quantity))! .groupBy('shopId) {! _.sum[Long]('quantity -> ‘totalSoldItems)! }! .groupAll { _.sortBy(‘totalSoldItems).reverse.take(3) }! .write(Tsv(output, writeHeader = true))

shopId! totalSoldItems! 2!! ! ! 144 ! 1!! ! ! 107! 3!! ! ! 16

Page 88: Scalding - the not-so-basics @ ScalaDays 2014

“What’s the top 3 shops?”

Tsv(input, ('shopId, 'itemId, 'itemName, 'quantity))! .groupBy('shopId) {! _.sum[Long]('quantity -> ‘totalSoldItems)! }! .groupAll { _.sortBy(‘totalSoldItems).reverse.take(3) }! .write(Tsv(output, writeHeader = true))

shopId! totalSoldItems! 2!! ! ! 144 ! 1!! ! ! 107! 3!! ! ! 16

SLOW! Instead do sortWithTake!SLOW! Instead do sortWithTake!

Page 89: Scalding - the not-so-basics @ ScalaDays 2014

“What’s the top 3 shops?”Tsv(input, ('shopId, 'itemId, 'itemName, 'quantity))! .groupBy('shopId) {! _.sum[Long]('quantity -> ‘totalSoldItems)! }! .groupAll { ! _.sortedReverseTake[Long]('totalSold -> 'x, 3) ! }! .write(Tsv(output, writeHeader = true))

Page 90: Scalding - the not-so-basics @ ScalaDays 2014

“What’s the top 3 shops?”Tsv(input, ('shopId, 'itemId, 'itemName, 'quantity))! .groupBy('shopId) {! _.sum[Long]('quantity -> ‘totalSoldItems)! }! .groupAll { ! _.sortedReverseTake[Long]('totalSold -> 'x, 3) ! }! .write(Tsv(output, writeHeader = true))

x!List((5,146), (2,142), (3,32))!

Page 91: Scalding - the not-so-basics @ ScalaDays 2014

“What’s the top 3 shops?”Tsv(input, ('shopId, 'itemId, 'itemName, 'quantity))! .groupBy('shopId) {! _.sum[Long]('quantity -> ‘totalSoldItems)! }! .groupAll { ! _.sortedReverseTake[Long]('totalSold -> 'x, 3) ! }! .write(Tsv(output, writeHeader = true))

x!List((5,146), (2,142), (3,32))!

WAT!?

Page 92: Scalding - the not-so-basics @ ScalaDays 2014

“What’s the top 3 shops?”Tsv(input, ('shopId, 'itemId, 'itemName, 'quantity))! .groupBy('shopId) {! _.sum[Long]('quantity -> ‘totalSoldItems)! }! .groupAll { ! _.sortedReverseTake[Long]('totalSold -> 'x, 3) ! }! .write(Tsv(output, writeHeader = true))

x!List((5,146), (2,142), (3,32))!

WAT!?

Emits scala.collection.List[_]

Page 93: Scalding - the not-so-basics @ ScalaDays 2014

“What’s the top 3 shops?”Tsv(input, ('shopId, 'itemId, 'itemName, 'quantity))! .groupBy('shopId) {! _.sum[Long]('quantity -> ‘totalSoldItems)! }! .groupAll { ! _.sortWithTake(('shopId, 'totalSold) -> 'x, 3) { ! (l: (Long, Long), r: (Long, Long)) => ! l._2 < l._2 ! }! }! .flatMapTo('x -> ('shopId, 'totalSold)) { ! x: List[(Long, Long)] => x! }! .write(Tsv(output, writeHeader = true))

Page 94: Scalding - the not-so-basics @ ScalaDays 2014

“What’s the top 3 shops?”Tsv(input, ('shopId, 'itemId, 'itemName, 'quantity))! .groupBy('shopId) {! _.sum[Long]('quantity -> ‘totalSoldItems)! }! .groupAll { ! _.sortWithTake(('shopId, 'totalSold) -> 'x, 3) { ! (l: (Long, Long), r: (Long, Long)) => ! l._2 < l._2 ! }! }! .flatMapTo('x -> ('shopId, 'totalSold)) { ! x: List[(Long, Long)] => x! }! .write(Tsv(output, writeHeader = true))

Provide Ordering explicitly because implicit Ordering is not enough for Tuple2 here

Page 95: Scalding - the not-so-basics @ ScalaDays 2014

Tsv(input, ('shopId, 'itemId, 'itemName, 'quantity))! .groupBy('shopId) {! _.sum[Long]('quantity -> ‘totalSoldItems)! }! .groupAll { ! _.sortWithTake(('shopId, 'totalSold) -> 'x, 3) { ! (l: (Long, Long), r: (Long, Long)) => ! l._2 < l._2 ! }! }! .flatMapTo('x -> ('shopId, 'totalSold)) { ! x: List[(Long, Long)] => x! }! .write(Tsv(output, writeHeader = true))

“What’s the top 3 shops?”

Page 96: Scalding - the not-so-basics @ ScalaDays 2014

Tsv(input, ('shopId, 'itemId, 'itemName, 'quantity))! .groupBy('shopId) {! _.sum[Long]('quantity -> ‘totalSoldItems)! }! .groupAll { ! _.sortWithTake(('shopId, 'totalSold) -> 'x, 3) { ! (l: (Long, Long), r: (Long, Long)) => ! l._2 < l._2 ! }! }! .flatMapTo('x -> ('shopId, 'totalSold)) { ! x: List[(Long, Long)] => x! }! .write(Tsv(output, writeHeader = true))

“What’s the top 3 shops?”

shopId! totalSoldItems! 2!! ! ! 144 ! 1!! ! ! 107! 3!! ! ! 16

Page 97: Scalding - the not-so-basics @ ScalaDays 2014

Tsv(input, ('shopId, 'itemId, 'itemName, 'quantity))! .groupBy('shopId) {! _.sum[Long]('quantity -> ‘totalSoldItems)! }! .groupAll { ! _.sortWithTake(('shopId, 'totalSold) -> 'x, 3) { ! (l: (Long, Long), r: (Long, Long)) => ! l._2 < l._2 ! }! }! .flatMapTo('x -> ('shopId, 'totalSold)) { ! x: List[(Long, Long)] => x! }! .write(Tsv(output, writeHeader = true))

“What’s the top 3 shops?”

Page 98: Scalding - the not-so-basics @ ScalaDays 2014

Tsv(input, ('shopId, 'itemId, 'itemName, 'quantity))! .groupBy('shopId) {! _.sum[Long]('quantity -> ‘totalSoldItems)! }! .groupAll { ! _.sortWithTake(('shopId, 'totalSold) -> 'x, 3) { ! (l: (Long, Long), r: (Long, Long)) => ! l._2 < l._2 ! }! }! .flatMapTo('x -> ('shopId, 'totalSold)) { ! x: List[(Long, Long)] => x! }! .write(Tsv(output, writeHeader = true))

“What’s the top 3 shops?”

MUCH faster Job =

Happier me.

Page 99: Scalding - the not-so-basics @ ScalaDays 2014

Reduce, these Monoids

Page 100: Scalding - the not-so-basics @ ScalaDays 2014

Reduce, these Monoids

Page 101: Scalding - the not-so-basics @ ScalaDays 2014

trait Monoid[T] {! def zero: T! def +(a: T, b: T): T!}

Reduce, these Monoids

Page 102: Scalding - the not-so-basics @ ScalaDays 2014

Reduce, these Monoids

trait Monoid[T] {! def zero: T! def +(a: T, b: T): T!}

interface:

Page 103: Scalding - the not-so-basics @ ScalaDays 2014

Reduce, these Monoids

+ 3 laws:

trait Monoid[T] {! def zero: T! def +(a: T, b: T): T!}

interface:

Page 104: Scalding - the not-so-basics @ ScalaDays 2014

Reduce, these Monoids

+ 3 laws:

Closure:

trait Monoid[T] {! def zero: T! def +(a: T, b: T): T!}

interface:

Page 105: Scalding - the not-so-basics @ ScalaDays 2014

Reduce, these Monoids

+ 3 laws:(T, T) => TClosure:

trait Monoid[T] {! def zero: T! def +(a: T, b: T): T!}

∀a,b∈T:a·b∈T

interface:

Page 106: Scalding - the not-so-basics @ ScalaDays 2014

Reduce, these Monoids

+ 3 laws:(T, T) => TClosure:

Associativity:

trait Monoid[T] {! def zero: T! def +(a: T, b: T): T!}

∀a,b∈T:a·b∈T

interface:

Page 107: Scalding - the not-so-basics @ ScalaDays 2014

Reduce, these Monoids

+ 3 laws:(T, T) => TClosure:

Associativity:

trait Monoid[T] {! def zero: T! def +(a: T, b: T): T!}

∀a,b∈T:a·b∈T

∀a,b,c∈T:(a·b)·c=a·(b·c)(a + b) + c! ==!a + (b + c)

interface:

Page 108: Scalding - the not-so-basics @ ScalaDays 2014

Reduce, these Monoids

+ 3 laws:(T, T) => TClosure:

Associativity:

Identity element:

trait Monoid[T] {! def zero: T! def +(a: T, b: T): T!}

∀a,b∈T:a·b∈T

∀a,b,c∈T:(a·b)·c=a·(b·c)(a + b) + c! ==!a + (b + c)

interface:

Page 109: Scalding - the not-so-basics @ ScalaDays 2014

Reduce, these Monoids

+ 3 laws:(T, T) => TClosure:

Associativity:

Identity element:

trait Monoid[T] {! def zero: T! def +(a: T, b: T): T!}

∀a,b∈T:a·b∈T

∀a,b,c∈T:(a·b)·c=a·(b·c)(a + b) + c! ==!a + (b + c)

interface:

∃z∈T:∀a∈T:z·a=a·z=a z + a == a + z == a

Page 110: Scalding - the not-so-basics @ ScalaDays 2014

Reduce, these Monoids

object IntSum extends Monoid[Int] {! def zero = 0! def +(a: Int, b: Int) = a + b!}

Summing:

Page 111: Scalding - the not-so-basics @ ScalaDays 2014

Monoid ops can start “Map-side”

bear, 2

car, 3

deer, 2

Monoid ops can already start being computed map-side!

Monoid ops can already start being computed map-side!

river, 2

Page 112: Scalding - the not-so-basics @ ScalaDays 2014

Monoid ops can start “Map-side”

average() sum()

sortWithTake() histogram()

Examples:

bear, 2

car, 3

deer, 2

river, 2

Page 113: Scalding - the not-so-basics @ ScalaDays 2014

Obligatory: “Go check out Algebird, NOW!” slide

https://github.com/twitter/algebird

ALGE-birds

Page 114: Scalding - the not-so-basics @ ScalaDays 2014

BloomFilterMonoid

https://github.com/twitter/algebird/wiki/Algebird-Examples-with-REPL

val NUM_HASHES = 6! val WIDTH = 32! val SEED = 1! val bfMonoid = new BloomFilterMonoid(NUM_HASHES, WIDTH, SEED)! ! val bf1 = bfMonoid.create("1", "2", "3", "4", "100")! val bf2 = bfMonoid.create("12", "45")! val bf = bf1 ++ bf2!// bf: com.twitter.algebird.BF =! ! val approxBool = bf.contains("1")!// approxBool: com.twitter.algebird.ApproximateBoolean = ApproximateBoolean(true,0.9290349745708529)!! val res = approxBool.isTrue! // res: Boolean = true

Page 115: Scalding - the not-so-basics @ ScalaDays 2014

BloomFilterMonoid

https://github.com/twitter/algebird/wiki/Algebird-Examples-with-REPL

val NUM_HASHES = 6! val WIDTH = 32! val SEED = 1! val bfMonoid = new BloomFilterMonoid(NUM_HASHES, WIDTH, SEED)! ! val bf1 = bfMonoid.create("1", "2", "3", "4", "100")! val bf2 = bfMonoid.create("12", "45")! val bf = bf1 ++ bf2!// bf: com.twitter.algebird.BF =! ! val approxBool = bf.contains("1")!// approxBool: com.twitter.algebird.ApproximateBoolean = ApproximateBoolean(true,0.9290349745708529)!! val res = approxBool.isTrue! // res: Boolean = true

Page 116: Scalding - the not-so-basics @ ScalaDays 2014

BloomFilterMonoid

https://github.com/twitter/algebird/wiki/Algebird-Examples-with-REPL

val NUM_HASHES = 6! val WIDTH = 32! val SEED = 1! val bfMonoid = new BloomFilterMonoid(NUM_HASHES, WIDTH, SEED)! ! val bf1 = bfMonoid.create("1", "2", "3", "4", "100")! val bf2 = bfMonoid.create("12", "45")! val bf = bf1 ++ bf2!// bf: com.twitter.algebird.BF =! ! val approxBool = bf.contains("1")!// approxBool: com.twitter.algebird.ApproximateBoolean = ApproximateBoolean(true,0.9290349745708529)!! val res = approxBool.isTrue! // res: Boolean = true

Page 117: Scalding - the not-so-basics @ ScalaDays 2014

BloomFilterMonoid

Csv(input, separator, ('shopId, 'itemId, 'itemName, 'quantity))! .groupBy('shopId) {! _.foldLeft('itemName -> 'itemBloom)(bfMonoid.zero) { ! (bf: BF, itemId: String) => bf + itemId ! }! }! .map(‘itemBloom -> 'hasSoldBeer) { b: BF => b.contains(“beer").isTrue }! .map('itemBloom -> 'hasSoldWurst) { b: BF => b.contains("wurst").isTrue }! .discard('itemBloom)! .write(Tsv(output, writeHeader = true))

Page 118: Scalding - the not-so-basics @ ScalaDays 2014

BloomFilterMonoid

Csv(input, separator, ('shopId, 'itemId, 'itemName, 'quantity))! .groupBy('shopId) {! _.foldLeft('itemName -> 'itemBloom)(bfMonoid.zero) { ! (bf: BF, itemId: String) => bf + itemId ! }! }! .map(‘itemBloom -> 'hasSoldBeer) { b: BF => b.contains(“beer").isTrue }! .map('itemBloom -> 'hasSoldWurst) { b: BF => b.contains("wurst").isTrue }! .discard('itemBloom)! .write(Tsv(output, writeHeader = true))

shopId! hasSoldBeer!hasSoldWurst! 1!! ! ! false!! ! ! true! 2!! ! ! false!! ! ! true! 3!! ! ! false!! ! ! true! 4!! ! ! true! ! ! ! false! 5!! ! ! true! ! ! ! false!

Page 119: Scalding - the not-so-basics @ ScalaDays 2014

BloomFilterMonoid

Csv(input, separator, ('shopId, 'itemId, 'itemName, 'quantity))! .groupBy('shopId) {! _.foldLeft('itemName -> 'itemBloom)(bfMonoid.zero) { ! (bf: BF, itemId: String) => bf + itemId ! }! }! .map(‘itemBloom -> 'hasSoldBeer) { b: BF => b.contains(“beer").isTrue }! .map('itemBloom -> 'hasSoldWurst) { b: BF => b.contains("wurst").isTrue }! .discard('itemBloom)! .write(Tsv(output, writeHeader = true))

shopId! hasSoldBeer!hasSoldWurst! 1!! ! ! false!! ! ! true! 2!! ! ! false!! ! ! true! 3!! ! ! false!! ! ! true! 4!! ! ! true! ! ! ! false! 5!! ! ! true! ! ! ! false!

Why not Set[String]? It would OutOfMemory.

Page 120: Scalding - the not-so-basics @ ScalaDays 2014

BloomFilterMonoid

Csv(input, separator, ('shopId, 'itemId, 'itemName, 'quantity))! .groupBy('shopId) {! _.foldLeft('itemName -> 'itemBloom)(bfMonoid.zero) { ! (bf: BF, itemId: String) => bf + itemId ! }! }! .map(‘itemBloom -> 'hasSoldBeer) { b: BF => b.contains(“beer").isTrue }! .map('itemBloom -> 'hasSoldWurst) { b: BF => b.contains("wurst").isTrue }! .discard('itemBloom)! .write(Tsv(output, writeHeader = true))

shopId! hasSoldBeer!hasSoldWurst! 1!! ! ! false!! ! ! true! 2!! ! ! false!! ! ! true! 3!! ! ! false!! ! ! true! 4!! ! ! true! ! ! ! false! 5!! ! ! true! ! ! ! false!

ApproximateBoolean(true,0.9999580954658956)

Why not Set[String]? It would OutOfMemory.

Page 121: Scalding - the not-so-basics @ ScalaDays 2014

Joins

Page 122: Scalding - the not-so-basics @ ScalaDays 2014

Joins

that.joinWithLarger('id1 -> 'id2, other)!that.joinWithSmaller('id1 -> 'id2, other)! !!that.joinWithTiny('id1 -> 'id2, other)

Page 123: Scalding - the not-so-basics @ ScalaDays 2014

Joins

that.joinWithLarger('id1 -> 'id2, other)!that.joinWithSmaller('id1 -> 'id2, other)! !!that.joinWithTiny('id1 -> 'id2, other)

joinWithTiny is appropriate when you know that # of rows in bigger pipe > mappers * # rows in smaller pipe, where

mappers is the number of mappers in the job.

Page 124: Scalding - the not-so-basics @ ScalaDays 2014

Joins

that.joinWithLarger('id1 -> 'id2, other)!that.joinWithSmaller('id1 -> 'id2, other)! !!that.joinWithTiny('id1 -> 'id2, other)

joinWithTiny is appropriate when you know that # of rows in bigger pipe > mappers * # rows in smaller pipe, where

mappers is the number of mappers in the job.

The “usual”

Page 125: Scalding - the not-so-basics @ ScalaDays 2014

Joinsval people = IterableSource(!(1, “hans”) ::!(2, “bob”) ::!(3, “hermut”) ::!(4, “heinz”) ::!(5, “klemens”) :: … :: Nil,!('id, 'name))

val cars = IterableSource(!(99, 1, “bmw") :: !(123, 2, "mercedes”) ::!(240, 11, “other”) :: Nil,!('carId, 'ownerId, 'carName))!

Page 126: Scalding - the not-so-basics @ ScalaDays 2014

Joins

import com.twitter.scalding.FunctionImplicits._!!people.joinWithLarger('id -> 'ownerId, cars)! .map(('name, 'carName) -> 'sentence) { ! (name: String, car: String) =>! s"Hello $name, your $car is really nice"! }! .project('sentence)! .write(output)

val people = IterableSource(!(1, “hans”) ::!(2, “bob”) ::!(3, “hermut”) ::!(4, “heinz”) ::!(5, “klemens”) :: … :: Nil,!('id, 'name))

val cars = IterableSource(!(99, 1, “bmw") :: !(123, 2, "mercedes”) ::!(240, 11, “other”) :: Nil,!('carId, 'ownerId, 'carName))!

Page 127: Scalding - the not-so-basics @ ScalaDays 2014

Joins

import com.twitter.scalding.FunctionImplicits._!!people.joinWithLarger('id -> 'ownerId, cars)! .map(('name, 'carName) -> 'sentence) { ! (name: String, car: String) =>! s"Hello $name, your $car is really nice"! }! .project('sentence)! .write(output)

Hello hans, your bmw is really nice!Hello bob, your bob's car is really nice!

val people = IterableSource(!(1, “hans”) ::!(2, “bob”) ::!(3, “hermut”) ::!(4, “heinz”) ::!(5, “klemens”) :: … :: Nil,!('id, 'name))

val cars = IterableSource(!(99, 1, “bmw") :: !(123, 2, "mercedes”) ::!(240, 11, “other”) :: Nil,!('carId, 'ownerId, 'carName))!

Page 128: Scalding - the not-so-basics @ ScalaDays 2014

“map-side” join

that.joinWithTiny('id1 -> 'id2, tinyPipe)

Choose this when: !

or: when the Left side is 3 orders of magnitude larger.

Left > max(mappers,reducers) * Right!

Page 129: Scalding - the not-so-basics @ ScalaDays 2014

Skew Joinsval sampleRate = 0.001!val reducers = 10!val replicationFactor = 1!val replicator = SkewReplicationA(replicationFactor)! !!val genders: RichPipe = …!val followers: RichPipe = …!!followers! .skewJoinWithSmaller('y1 -> 'y2, in1, sampleRate, reducers, replicator)! .project('x1, 'y1, 's1, 'x2, 'y2, 's2)! .write(Tsv("output"))

Page 130: Scalding - the not-so-basics @ ScalaDays 2014

Skew Joinsval sampleRate = 0.001!val reducers = 10!val replicationFactor = 1!val replicator = SkewReplicationA(replicationFactor)! !!val genders: RichPipe = …!val followers: RichPipe = …!!followers! .skewJoinWithSmaller('y1 -> 'y2, in1, sampleRate, reducers, replicator)! .project('x1, 'y1, 's1, 'x2, 'y2, 's2)! .write(Tsv("output"))

1. Sample from the left and right pipes with some small probability, in order to determine approximately how often each join key appears in each pipe.

Page 131: Scalding - the not-so-basics @ ScalaDays 2014

Skew Joinsval sampleRate = 0.001!val reducers = 10!val replicationFactor = 1!val replicator = SkewReplicationA(replicationFactor)! !!val genders: RichPipe = …!val followers: RichPipe = …!!followers! .skewJoinWithSmaller('y1 -> 'y2, in1, sampleRate, reducers, replicator)! .project('x1, 'y1, 's1, 'x2, 'y2, 's2)! .write(Tsv("output"))

1. Sample from the left and right pipes with some small probability, in order to determine approximately how often each join key appears in each pipe.

2. Use these estimated counts to replicate the join keys, according to the given replication strategy.

Page 132: Scalding - the not-so-basics @ ScalaDays 2014

Skew Joinsval sampleRate = 0.001!val reducers = 10!val replicationFactor = 1!val replicator = SkewReplicationA(replicationFactor)! !!val genders: RichPipe = …!val followers: RichPipe = …!!followers! .skewJoinWithSmaller('y1 -> 'y2, in1, sampleRate, reducers, replicator)! .project('x1, 'y1, 's1, 'x2, 'y2, 's2)! .write(Tsv("output"))

1. Sample from the left and right pipes with some small probability, in order to determine approximately how often each join key appears in each pipe.

2. Use these estimated counts to replicate the join keys, according to the given replication strategy.

3. Join the replicated pipes together.

Page 133: Scalding - the not-so-basics @ ScalaDays 2014

Where did my type-safety go?!

Page 134: Scalding - the not-so-basics @ ScalaDays 2014

Where did my type-safety go?!Tsv(in, ('userId1, 'userId2, 'rel))! .filter('userId1) { uid1: Long => uid1 == 1337 }! .write(Tsv(out))!

Page 135: Scalding - the not-so-basics @ ScalaDays 2014

Where did my type-safety go?!Tsv(in, ('userId1, 'userId2, 'rel))! .filter('userId1) { uid1: Long => uid1 == 1337 }! .write(Tsv(out))!

Caused by: cascading.flow.FlowException: local step failed at cascading.flow.planner.FlowStepJob.blockOnJob(FlowStepJob.java:219) at cascading.flow.planner.FlowStepJob.start(FlowStepJob.java:149) at cascading.flow.planner.FlowStepJob.call(FlowStepJob.java:124) at cascading.flow.planner.FlowStepJob.call(FlowStepJob.java:43) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:744) Caused by: cascading.pipe.OperatorException: [com.twitter.scalding.C...][com.twitter.scalding.RichPipe.filter(RichPipe.scala:325)] operator Each failed executing operation at cascading.flow.stream.FilterEachStage.receive(FilterEachStage.java:81) at cascading.flow.stream.FilterEachStage.receive(FilterEachStage.java:34) at cascading.flow.stream.SourceStage.map(SourceStage.java:102) at cascading.flow.stream.SourceStage.call(SourceStage.java:53) at cascading.flow.stream.SourceStage.call(SourceStage.java:38) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:744)

Caused by: java.lang.NumberFormatException: For input string: "bob" at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) at java.lang.Long.parseLong(Long.java:589) at java.lang.Long.parseLong(Long.java:631) at cascading.tuple.coerce.LongCoerce.coerce(LongCoerce.java:50) at cascading.tuple.coerce.LongCoerce.coerce(LongCoerce.java:29)

Page 136: Scalding - the not-so-basics @ ScalaDays 2014

Where did my type-safety go?!Tsv(in, ('userId1, 'userId2, 'rel))! .filter('userId1) { uid1: Long => uid1 == 1337 }! .write(Tsv(out))!

Caused by: cascading.flow.FlowException: local step failed at cascading.flow.planner.FlowStepJob.blockOnJob(FlowStepJob.java:219) at cascading.flow.planner.FlowStepJob.start(FlowStepJob.java:149) at cascading.flow.planner.FlowStepJob.call(FlowStepJob.java:124) at cascading.flow.planner.FlowStepJob.call(FlowStepJob.java:43) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:744) Caused by: cascading.pipe.OperatorException: [com.twitter.scalding.C...][com.twitter.scalding.RichPipe.filter(RichPipe.scala:325)] operator Each failed executing operation at cascading.flow.stream.FilterEachStage.receive(FilterEachStage.java:81) at cascading.flow.stream.FilterEachStage.receive(FilterEachStage.java:34) at cascading.flow.stream.SourceStage.map(SourceStage.java:102) at cascading.flow.stream.SourceStage.call(SourceStage.java:53) at cascading.flow.stream.SourceStage.call(SourceStage.java:38) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:744)

Caused by: java.lang.NumberFormatException: For input string: "bob" at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) at java.lang.Long.parseLong(Long.java:589) at java.lang.Long.parseLong(Long.java:631) at cascading.tuple.coerce.LongCoerce.coerce(LongCoerce.java:50) at cascading.tuple.coerce.LongCoerce.coerce(LongCoerce.java:29)

“oh, right… We changed that file to be user names, not ids…”

Page 137: Scalding - the not-so-basics @ ScalaDays 2014

Trap it!Tsv(in, ('userId1, 'userId2, 'rel))! .addTrap(Tsv(“errors")) // add a trap! .filter('userId1) { uid1: Long => uid1 == 1337 }! .write(Tsv(out))

Page 138: Scalding - the not-so-basics @ ScalaDays 2014

Trap it!Tsv(in, ('userId1, 'userId2, 'rel))! .addTrap(Tsv(“errors")) // add a trap! .filter('userId1) { uid1: Long => uid1 == 1337 }! .write(Tsv(out))

solves “dirty data”, no help for maintenance

Page 139: Scalding - the not-so-basics @ ScalaDays 2014

Typed API

Page 140: Scalding - the not-so-basics @ ScalaDays 2014

TypedAPI’sTsv(in, ('userId1, 'userId2, 'rel))! .filter('userId1) { rel: Long => rel == 1337 }! .write(Tsv(out))!

Page 141: Scalding - the not-so-basics @ ScalaDays 2014

TypedAPI’sTsv(in, ('userId1, 'userId2, 'rel))! .filter('userId1) { rel: Long => rel == 1337 }! .write(Tsv(out))!

import TDsl._!! TypedCsv[(String, String, Int)](in, ('user1, 'user2, 'rel))! .filter { _._1 === "bob" }! .write(TypedTsv(out))!

Page 142: Scalding - the not-so-basics @ ScalaDays 2014

TypedAPI’sTsv(in, ('userId1, 'userId2, 'rel))! .filter('userId1) { rel: Long => rel == 1337 }! .write(Tsv(out))!

import TDsl._!! TypedCsv[(String, String, Int)](in, ('user1, 'user2, 'rel))! .filter { _._1 === "bob" }! .write(TypedTsv(out))!

Must give Type to each Field

Page 143: Scalding - the not-so-basics @ ScalaDays 2014

TypedAPI’sTsv(in, ('userId1, 'userId2, 'rel))! .filter('userId1) { rel: Long => rel == 1337 }! .write(Tsv(out))!

TypedCsv[(String, String)](in, ('user1, 'user2, 'rel))! .filter { _._1 === "bob" }! .write(TypedTsv(out))!

import TDsl._!! TypedCsv[(String, String, Int)](in, ('user1, 'user2, 'rel))! .filter { _._1 == "bob" }! .write(TypedTsv(out))!

Page 144: Scalding - the not-so-basics @ ScalaDays 2014

TypedAPI’sTsv(in, ('userId1, 'userId2, 'rel))! .filter('userId1) { rel: Long => rel == 1337 }! .write(Tsv(out))!

TypedCsv[(String, String)](in, ('user1, 'user2, 'rel))! .filter { _._1 === "bob" }! .write(TypedTsv(out))!

import TDsl._!! TypedCsv[(String, String, Int)](in, ('user1, 'user2, 'rel))! .filter { _._1 == "bob" }! .write(TypedTsv(out))!Tuple arity: 2 Tuple arity: 3

Page 145: Scalding - the not-so-basics @ ScalaDays 2014

TypedAPI’sTsv(in, ('userId1, 'userId2, 'rel))! .filter('userId1) { rel: Long => rel == 1337 }! .write(Tsv(out))!

Caused by: java.lang.IllegalArgumentException: num of types must equal number of fields: [{3}:'user1', 'user2', 'rel'], found: 2 at cascading.scheme.util.DelimitedParser.reset(DelimitedParser.java:176)

TypedCsv[(String, String)](in, ('user1, 'user2, 'rel))! .filter { _._1 === "bob" }! .write(TypedTsv(out))!

import TDsl._!! TypedCsv[(String, String, Int)](in, ('user1, 'user2, 'rel))! .filter { _._1 == "bob" }! .write(TypedTsv(out))!Tuple arity: 2 Tuple arity: 3

Page 146: Scalding - the not-so-basics @ ScalaDays 2014

TypedAPI’sTsv(in, ('userId1, 'userId2, 'rel))! .filter('userId1) { rel: Long => rel == 1337 }! .write(Tsv(out))!

Caused by: java.lang.IllegalArgumentException: num of types must equal number of fields: [{3}:'user1', 'user2', 'rel'], found: 2 at cascading.scheme.util.DelimitedParser.reset(DelimitedParser.java:176)

TypedCsv[(String, String)](in, ('user1, 'user2, 'rel))! .filter { _._1 === "bob" }! .write(TypedTsv(out))!

import TDsl._!! TypedCsv[(String, String, Int)](in, ('user1, 'user2, 'rel))! .filter { _._1 == "bob" }! .write(TypedTsv(out))!Tuple arity: 2 Tuple arity: 3

“planing-time” exception

Page 147: Scalding - the not-so-basics @ ScalaDays 2014

TypedAPI’sTsv(in, ('userId1, 'userId2, 'rel))! .filter('userId1) { rel: Long => rel == 1337 }! .write(Tsv(out))!

// … with Relationships {! import TDsl._!! userRelationships(date)! .filter { _._ == "bob" }! .write(TypedTsv(out))!!}

Page 148: Scalding - the not-so-basics @ ScalaDays 2014

TypedAPI’sTsv(in, ('userId1, 'userId2, 'rel))! .filter('userId1) { rel: Long => rel == 1337 }! .write(Tsv(out))!

// … with Relationships {! import TDsl._!! userRelationships(date)! .filter { _._ == "bob" }! .write(TypedTsv(out))!!}

Easier to reuse schemas now

Page 149: Scalding - the not-so-basics @ ScalaDays 2014

TypedAPI’sTsv(in, ('userId1, 'userId2, 'rel))! .filter('userId1) { rel: Long => rel == 1337 }! .write(Tsv(out))!

// … with Relationships {! import TDsl._!! userRelationships(date)! .filter { _._ == "bob" }! .write(TypedTsv(out))!!}

Easier to reuse schemas now

Not coupled by Field names, but still too magic for reuse… “_1”?

Page 150: Scalding - the not-so-basics @ ScalaDays 2014

TypedAPI’sTsv(in, ('userId1, 'userId2, 'rel))! .filter('userId1) { rel: Long => rel == 1337 }! .write(Tsv(out))!

// … with Relationships {! import TDsl._!! userRelationships(date) ! .filter { p: Person => p.name == ”bob" }! .write(TypedTsv(out))!!}

Page 151: Scalding - the not-so-basics @ ScalaDays 2014

TypedAPI’sTsv(in, ('userId1, 'userId2, 'rel))! .filter('userId1) { rel: Long => rel == 1337 }! .write(Tsv(out))!

// … with Relationships {! import TDsl._!! userRelationships(date) ! .filter { p: Person => p.name == ”bob" }! .write(TypedTsv(out))!!}

TypedPipe[Person]

Page 152: Scalding - the not-so-basics @ ScalaDays 2014

Typed Joins

case class UserName(id: Long, handle: String)!case class UserFavs(byUser: Long, favs: List[Long])!case class UserTweets(byUser: Long, tweets: List[Long])! !def users: TypedSource[UserName]!def favs: TypedSource[UserFavs]!def tweets: TypedSource[UserTweets]! !def output: TypedSink[(UserName, UserFavs, UserTweets)]! !users.groupBy(_.id)! .join(favs.groupBy(_.byUser))! .join(tweets.groupBy(_.byUser))! .map { case (uid, ((user, favs), tweets)) =>! (user, favs, tweets)! } ! .write(output)!

Page 153: Scalding - the not-so-basics @ ScalaDays 2014

Typed Joins

case class UserName(id: Long, handle: String)!case class UserFavs(byUser: Long, favs: List[Long])!case class UserTweets(byUser: Long, tweets: List[Long])! !def users: TypedSource[UserName]!def favs: TypedSource[UserFavs]!def tweets: TypedSource[UserTweets]! !def output: TypedSink[(UserName, UserFavs, UserTweets)]! !users.groupBy(_.id)! .join(favs.groupBy(_.byUser))! .join(tweets.groupBy(_.byUser))! .map { case (uid, ((user, favs), tweets)) =>! (user, favs, tweets)! } ! .write(output)!

Page 154: Scalding - the not-so-basics @ ScalaDays 2014

Typed Joins

case class UserName(id: Long, handle: String)!case class UserFavs(byUser: Long, favs: List[Long])!case class UserTweets(byUser: Long, tweets: List[Long])! !def users: TypedSource[UserName]!def favs: TypedSource[UserFavs]!def tweets: TypedSource[UserTweets]! !def output: TypedSink[(UserName, UserFavs, UserTweets)]! !users.groupBy(_.id)! .join(favs.groupBy(_.byUser))! .join(tweets.groupBy(_.byUser))! .map { case (uid, ((user, favs), tweets)) =>! (user, favs, tweets)! } ! .write(output)!

3-way-merge in 1 MR step

Page 155: Scalding - the not-so-basics @ ScalaDays 2014

> run pl.project13.oculus.job.WordCountJob ! —local —tool.graph --input in --output out!!writing DOT: ! pl.project13.oculus.job.WordCountJob0.dot!!writing Steps DOT: ! pl.project13.oculus.job.WordCountJob0_steps.dot

Do the DOT

Page 156: Scalding - the not-so-basics @ ScalaDays 2014

Do the DOT !!!!

pl.project13.oculus.job.WordCountJob0.dot!!!!!!!!!!!!! !

pl.project13.oculus.job.WordCountJob0_steps.dot

Page 157: Scalding - the not-so-basics @ ScalaDays 2014

!!!!

> dot -Tpng pl.project13.oculus.job.WordCountJob0.dot!!!!!!!!!!!!! !

Do the DOT

Page 158: Scalding - the not-so-basics @ ScalaDays 2014

!!!!

> dot -Tpng pl.project13.oculus.job.WordCountJob0.dot!!!!!!!!!!!!! !

Do the DOT

M A P

Page 159: Scalding - the not-so-basics @ ScalaDays 2014

!!!!

> dot -Tpng pl.project13.oculus.job.WordCountJob0.dot!!!!!!!!!!!!! !

Do the DOT

M A P

R E D

Page 160: Scalding - the not-so-basics @ ScalaDays 2014

Do the DOT

Page 161: Scalding - the not-so-basics @ ScalaDays 2014

<3 Testing

Page 162: Scalding - the not-so-basics @ ScalaDays 2014

class WordCountJobTest extends FlatSpec ! with ShouldMatchers with TupleConversions {!! "WordCountJob" should "count words" in {! JobTest(new WordCountJob(_))! .arg("input", "inFile")! .arg("output", "outFile")! .source(TextLine("inFile"), List("0" -> "kapi kapi pi pu po"))! .sink[(String, Int)](Tsv("outFile")) { out =>! out.toList should contain ("kapi" -> 3)! }! .run! .finish! }!!}!

<3 Testing

Page 163: Scalding - the not-so-basics @ ScalaDays 2014

class WordCountJobTest extends FlatSpec ! with ShouldMatchers with TupleConversions {!! "WordCountJob" should "count words" in {! JobTest(new WordCountJob(_))! .arg("input", "inFile")! .arg("output", "outFile")! .source(TextLine("inFile"), List("0" -> "kapi kapi pi pu po"))! .sink[(String, Int)](Tsv("outFile")) { out =>! out.toList should contain ("kapi" -> 3)! }! .run! .finish! }!!}!

<3 Testing

Page 164: Scalding - the not-so-basics @ ScalaDays 2014

class WordCountJobTest extends FlatSpec ! with ShouldMatchers with TupleConversions {!! "WordCountJob" should "count words" in {! JobTest(new WordCountJob(_))! .arg("input", "inFile")! .arg("output", "outFile")! .source(TextLine("inFile"), List("0" -> "kapi kapi pi pu po"))! .sink[(String, Int)](Tsv("outFile")) { out =>! out.toList should contain ("kapi" -> 3)! }! .run! .finish! }!!}!

<3 Testing

Page 165: Scalding - the not-so-basics @ ScalaDays 2014

class WordCountJobTest extends FlatSpec ! with ShouldMatchers with TupleConversions {!! "WordCountJob" should "count words" in {! JobTest(new WordCountJob(_))! .arg("input", "inFile")! .arg("output", "outFile")! .source(TextLine("inFile"), List("0" -> "kapi kapi pi pu po"))! .sink[(String, Int)](Tsv("outFile")) { out =>! out.toList should contain ("kapi" -> 3)! }! .run! .finish! }!!}!

<3 Testing

Page 166: Scalding - the not-so-basics @ ScalaDays 2014

class WordCountJobTest extends FlatSpec ! with ShouldMatchers with TupleConversions {!! "WordCountJob" should "count words" in {! JobTest(new WordCountJob(_))! .arg("input", "inFile")! .arg("output", "outFile")! .source(TextLine("inFile"), List("0" -> "kapi kapi pi pu po"))! .sink[(String, Int)](Tsv("outFile")) { out =>! out.toList should contain ("kapi" -> 3)! }! .run! .finish! }!!}!

<3 Testing

Page 167: Scalding - the not-so-basics @ ScalaDays 2014

class WordCountJobTest extends FlatSpec ! with ShouldMatchers with TupleConversions {!! "WordCountJob" should "count words" in {! JobTest(new WordCountJob(_))! .arg("input", "inFile")! .arg("output", "outFile")! .source(TextLine("inFile"), List("0" -> "kapi kapi pi pu po"))! .sink[(String, Int)](Tsv("outFile")) { out =>! out.toList should contain ("kapi" -> 3)! }! .run! .finish! }!!}!

<3 Testing

Page 168: Scalding - the not-so-basics @ ScalaDays 2014

class WordCountJobTest extends FlatSpec ! with ShouldMatchers with TupleConversions {!! "WordCountJob" should "count words" in {! JobTest(new WordCountJob(_))! .arg("input", "inFile")! .arg("output", "outFile")! .source(TextLine("inFile"), List("0" -> "kapi kapi pi pu po"))! .sink[(String, Int)](Tsv("outFile")) { out =>! out.toList should contain ("kapi" -> 3)! }! .run! .finish! }!!}!

<3 Testing

Page 169: Scalding - the not-so-basics @ ScalaDays 2014

class WordCountJobTest extends FlatSpec ! with ShouldMatchers with TupleConversions {!! "WordCountJob" should "count words" in {! JobTest(new WordCountJob(_))! .arg("input", "inFile")! .arg("output", "outFile")! .source(TextLine("inFile"), List("0" -> "kapi kapi pi pu po"))! .sink[(String, Int)](Tsv("outFile")) { out =>! out.toList should contain ("kapi" -> 3)! }! .runHadoop! .finish! }!!}!

<3 Testing

Page 170: Scalding - the not-so-basics @ ScalaDays 2014

class WordCountJobTest extends FlatSpec ! with ShouldMatchers with TupleConversions {!! "WordCountJob" should "count words" in {! JobTest(new WordCountJob(_))! .arg("input", "inFile")! .arg("output", "outFile")! .source(TextLine("inFile"), List("0" -> "kapi kapi pi pu po"))! .sink[(String, Int)](Tsv("outFile")) { out =>! out.toList should contain ("kapi" -> 3)! }! .runHadoop! .finish! }!!}!

<3 Testing

run || runHadoop

Page 171: Scalding - the not-so-basics @ ScalaDays 2014
Page 172: Scalding - the not-so-basics @ ScalaDays 2014

“Parallelize all the batches!”

Page 173: Scalding - the not-so-basics @ ScalaDays 2014

“Parallelize all the batches!”Feels much like Scala collections

Page 174: Scalding - the not-so-basics @ ScalaDays 2014

“Parallelize all the batches!”Feels much like Scala collectionsLocal Mode thanks to Cascading

Page 175: Scalding - the not-so-basics @ ScalaDays 2014

“Parallelize all the batches!”Feels much like Scala collectionsLocal Mode thanks to Cascading

Easy to add custom Taps

Page 176: Scalding - the not-so-basics @ ScalaDays 2014

“Parallelize all the batches!”Feels much like Scala collectionsLocal Mode thanks to Cascading

Easy to add custom TapsType Safe, when you want to

Page 177: Scalding - the not-so-basics @ ScalaDays 2014

“Parallelize all the batches!”Feels much like Scala collectionsLocal Mode thanks to Cascading

Easy to add custom TapsType Safe, when you want to

Pure Scala

Page 178: Scalding - the not-so-basics @ ScalaDays 2014

“Parallelize all the batches!”Feels much like Scala collectionsLocal Mode thanks to Cascading

Easy to add custom TapsType Safe, when you want to

Pure ScalaTesting friendly

Page 179: Scalding - the not-so-basics @ ScalaDays 2014

“Parallelize all the batches!”Feels much like Scala collectionsLocal Mode thanks to Cascading

Easy to add custom TapsType Safe, when you want to

Pure ScalaTesting friendly

Page 180: Scalding - the not-so-basics @ ScalaDays 2014

“Parallelize all the batches!”Feels much like Scala collectionsLocal Mode thanks to Cascading

Easy to add custom TapsType Safe, when you want to

Pure ScalaTesting friendly

Matrix API

Page 181: Scalding - the not-so-basics @ ScalaDays 2014

“Parallelize all the batches!”Feels much like Scala collectionsLocal Mode thanks to Cascading

Easy to add custom TapsType Safe, when you want to

Pure ScalaTesting friendly

Matrix APIEfficient columnar storage (Parquet)

Page 182: Scalding - the not-so-basics @ ScalaDays 2014

Scalding Re-Cap

!!! !! TextLine(inputFile)! .flatMap('line -> 'word) { line: String => tokenize(line) }! .groupBy('word) { _.size }! .write(Tsv(outputFile))!!!

Page 183: Scalding - the not-so-basics @ ScalaDays 2014

Scalding Re-Cap

!!! !! TextLine(inputFile)! .flatMap('line -> 'word) { line: String => tokenize(line) }! .groupBy('word) { _.size }! .write(Tsv(outputFile))!!!

4{

Page 184: Scalding - the not-so-basics @ ScalaDays 2014

!!! !! $ activator new activator-scalding!!

Try it!

http://typesafe.com/activator/template/activator-scalding

Template by Dean Wampler

Page 185: Scalding - the not-so-basics @ ScalaDays 2014

Loads Of Links

1. http://parleys.com/play/51c2e0f3e4b0ed877035684f/chapter0/about 2. https://github.com/twitter/scalding/blob/develop/scalding-core/src/main/scala/com/twitter/scalding/ReduceOperations.scala 3. http://www.slideshare.net/johnynek/scalding?qid=6db1da40-121b-4547-8aa6-4fb051343d91&v=qf1&b=&from_search=4 4. http://www.slideshare.net/Hadoop_Summit/severs-june26-255pmroom210av2?

qid=6db1da40-121b-4547-8aa6-4fb051343d91&v=qf1&b=&from_search=3 5. http://www.slideshare.net/LivePersonDev/scalding-reaching-efficient-mapreduce?

qid=6db1da40-121b-4547-8aa6-4fb051343d91&v=qf1&b=&from_search=2 6. http://www.michael-noll.com/blog/2013/12/02/twitter-algebird-monoid-monad-for-large-scala-data-analytics/ 7. http://blog.liveramp.com/2013/04/03/bloomjoin-bloomfilter-cogroup/ 8. https://engineering.twitter.com/university/videos/why-scalding-is-important-for-data-science 9. https://github.com/parquet/parquet-format 10. http://www.slideshare.net/ktoso/scalding-hadoop-word-count-in-less-than-60-lines-of-code 11. https://github.com/scalaz/scalaz 12. http://blog.echen.me/2012/02/09/movie-recommendations-and-more-via-mapreduce-and-scalding/

Page 186: Scalding - the not-so-basics @ ScalaDays 2014

!

Danke! Dzięki! Thanks! Gracias!

ありがとう!

ktoso @ typesafe.com t: ktosopl / g: ktoso blog: project13.pl