67
MapReduce with Scalding Antonios Chalkiopoulos 24 th Big Data London Meetup Scalding.io

MapReduce with Scalding @ 24th Hadoop London Meetup

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 1: MapReduce with Scalding @ 24th Hadoop London Meetup

MapReduce with ScaldingAntonios Chalkiopoulos24th Big Data London Meetup

Scalding.io

Page 2: MapReduce with Scalding @ 24th Hadoop London Meetup

$ whoami

Scalding.io

http://scalding.io

http://github.com/scalding-io

@chalkiopoulos

Page 3: MapReduce with Scalding @ 24th Hadoop London Meetup

My recent achievement..

Scalding.io

Page 4: MapReduce with Scalding @ 24th Hadoop London Meetup
Page 5: MapReduce with Scalding @ 24th Hadoop London Meetup

What are we gonna talk about..?

Scalding.io

Page 6: MapReduce with Scalding @ 24th Hadoop London Meetup

Scalding.io

Page 7: MapReduce with Scalding @ 24th Hadoop London Meetup

A Scala API on top of Cascading

Scalding.io

Page 8: MapReduce with Scalding @ 24th Hadoop London Meetup

But what is ?

Scalding.io

Page 9: MapReduce with Scalding @ 24th Hadoop London Meetup

A few years ago I started on a fresh Big Data team…

Scalding.io

Story!!

Page 10: MapReduce with Scalding @ 24th Hadoop London Meetup

How do we efficiently develop MapReduce jobs for our new hadoop cluster ?

Scalding.io

??

Page 11: MapReduce with Scalding @ 24th Hadoop London Meetup

MapReduce Techs

Scalding.io

Hadoop

ab

stra

ctio

n

Page 12: MapReduce with Scalding @ 24th Hadoop London Meetup

MapReduce Techs

Scalding.io

Java MapReduce

Hadoop

ab

stra

ctio

n

Page 13: MapReduce with Scalding @ 24th Hadoop London Meetup

ws

Java MapReduce Word count example

Page 14: MapReduce with Scalding @ 24th Hadoop London Meetup

MapReduce Techs

Scalding.io

Java MapReduce

Pig

Hadoop

ab

stra

ctio

n

Page 15: MapReduce with Scalding @ 24th Hadoop London Meetup

MapReduce Techs

Scalding.io

Java MapReduce

Pig Hive

Hadoop

ab

stra

ctio

n

Page 16: MapReduce with Scalding @ 24th Hadoop London Meetup

MapReduce Techs

Scalding.io

Java MapReduce

Pig Hive

Hadoop

Others

ab

stra

ctio

n

Page 17: MapReduce with Scalding @ 24th Hadoop London Meetup

MapReduce Techs

Scalding.io

Java MapReduce

Pig Hive

Hadoop

Cascading Others

ab

stra

ctio

n

Page 18: MapReduce with Scalding @ 24th Hadoop London Meetup

The promise of Cascading

Scalding.io

Page 19: MapReduce with Scalding @ 24th Hadoop London Meetup

[1] A simple, high level java API for MapReduce easy to understand and work with.

Scalding.io

Page 20: MapReduce with Scalding @ 24th Hadoop London Meetup

[2] Extensions to

MANY platforms

Scalding.io

Page 21: MapReduce with Scalding @ 24th Hadoop London Meetup

Scalding.io

Cascading

NoSQL Databases

SQL Databases

Hadoop Filesystem

Local Filesystem

In memory systems

Search Platforms

MongoDB Cassandra HBASE Accumulo …

ElasticSearch Solr …

Redis Memcached

Page 22: MapReduce with Scalding @ 24th Hadoop London Meetup

How it works?

Scalding.io

Page 23: MapReduce with Scalding @ 24th Hadoop London Meetup

A pipeline architecture

Scalding.io

Page 24: MapReduce with Scalding @ 24th Hadoop London Meetup

Scalding.io

data

data

data

Tuple1Tuple2

where tuples flow through pipes

Source tap

data

data

data

Sin

k tap

Page 25: MapReduce with Scalding @ 24th Hadoop London Meetup

Scalding.io

Log files

Customer Data

Log & Customer

Page 26: MapReduce with Scalding @ 24th Hadoop London Meetup

Scalding.io

FinalResults

Log files

Log files

Customer Data

Results

Results

Page 27: MapReduce with Scalding @ 24th Hadoop London Meetup

Cascading Example

Scalding.io

Page 28: MapReduce with Scalding @ 24th Hadoop London Meetup

Word count in Cascading

1. public class WordCount { 

2. public static void main(String[] args) {3. Properties properties = new Properties();4. FlowConnector.setApplicationJarClass (properties, WordCount.class);5. Scheme sourceScheme = new TextLine (new Fields(“line”));6. Scheme sinkScheme = new TextLine (new Fields(“word”,”count”)); 7. Tap source = new Hfs( sourceScheme, args[0]);8. Tap sink = new Hfs( sinkScheme, args[1], SinkMode.REPLACE );9. Pipe assembly = new Pipe(“ Word Count “);10.     String regex = “(?>!\\pL)(?=\\pL)[^ ]*(?<=\\pL)(?!\\pL)”;11.     Function function = new RegexGenerator( new Fields(“word”), regex);12.     assembly = new Each( assembly, new Fields(“line”), function );13.     assembly = new GroupBy( assembly, new Fields(“word”) );14.   Aggregator count = new Count(new Fields(“count”) );15.   assembly = new Every( assembly, count );16.     FlowConnector flowConnector = new FlowConnector( properties );17.   Flow flow = flowConnector.connect(“word-count”, source, sink,

assembly);18.   flow.complete();19. }20. }Scalding.io

70% less boilerplate code

But still some infrastructure code

Page 29: MapReduce with Scalding @ 24th Hadoop London Meetup

Scalding.io

Page 30: MapReduce with Scalding @ 24th Hadoop London Meetup

Scalding.io

No boilerplate code at all

Functional

Robust & Scalable

Run on JVM

Page 31: MapReduce with Scalding @ 24th Hadoop London Meetup

Here it comes

Scalding.io

Java MapReduce

Pig Hive

Hadoop

Cascading Others

ab

stra

ctio

n

Scalding

Page 32: MapReduce with Scalding @ 24th Hadoop London Meetup

The power of Scala on top of Cascading

Scalding.io

Page 33: MapReduce with Scalding @ 24th Hadoop London Meetup

Scala fits naturally with data

Scalding.io

Page 34: MapReduce with Scalding @ 24th Hadoop London Meetup

Word count in Scalding

Scalding.io

1. import com.twitter.scalding._

2. class WordCountJob(args : Args) extends Job(args) {

3. TextLine("input.txt”).read4. .flatMap('line -> 'word) { line : String => line.split("\\s+") }5. .groupBy('word) { _.size }6. .write( Tsv(”results.tsv”) )

7. }

Page 35: MapReduce with Scalding @ 24th Hadoop London Meetup

Word count in Scalding

Scalding.io

1. import com.twitter.scalding._

2. class WordCountJob(args : Args) extends Job(args) {

3. TextLine("input.txt”).read4. .flatMap('line -> 'word) { line : String => line.split("\\s+") }5. .groupBy('word) { _.size }6. .write( Tsv(”results.tsv”) )

7. }

Map phase

Page 36: MapReduce with Scalding @ 24th Hadoop London Meetup

Word count in Scalding

Scalding.io

1. import com.twitter.scalding._

2. class WordCountJob(args : Args) extends Job(args) {

3. TextLine("input.txt”).read4. .flatMap('line -> 'word) { line : String => line.split("\\s+") }5. .groupBy('word) { _.size }6. .write( Tsv(”results.tsv”) )

7. } Reducephase

Page 37: MapReduce with Scalding @ 24th Hadoop London Meetup

Word count in Scalding

Scalding.io

1. import com.twitter.scalding._

2. class WordCountJob(args : Args) extends Job(args) {

3. TextLine("input.txt”).read4. .flatMap('line -> 'word) { line : String => line.split("\\s+") }5. .groupBy('word) { _.size }6. .write( Tsv(”results.tsv”) )

7. } 4 lines of code!

4

Code that developers enjoy writing

Page 38: MapReduce with Scalding @ 24th Hadoop London Meetup

Who is using it?

Scalding.io

Many many others…

Page 39: MapReduce with Scalding @ 24th Hadoop London Meetup

Scalding…

…open sourced by twitter at 2011…has more than 100 open source contributors…exposes the right abstractions…maximizes expressiveness…promotes extensibility…adds new capabilities to Cascading

Scalding.io

Page 40: MapReduce with Scalding @ 24th Hadoop London Meetup

Core Concepts

Scalding.io

Page 41: MapReduce with Scalding @ 24th Hadoop London Meetup

Sources & Sinks

1. Tsv("data.tsv", ('productID,'price,'quantity))2. .read3. .write(UnpackedAvroSource("data.avro”))

Scalding.io

TsvCsvOsvAvroParquet…

Page 42: MapReduce with Scalding @ 24th Hadoop London Meetup

Map Operations

Scalding.io

1. pipe1.filter ('age) { age:Int => age > 18 }2. pipe1.map ('price -> ’withVAT) { price:Double => price * 1.2 }3. pipe1.project('name, 'surname)

15 map operations translated into map phases

Page 43: MapReduce with Scalding @ 24th Hadoop London Meetup

Join operations

1. pipe1.joinWithSmaller('productId -> 'productId, pipe2)2. pipe1.joinWithLarger ('productId -> 'productId, pipe2)3. pipe1.joinWithTiny ('productId -> 'productId, pipe2)

Scalding.io

Optimize by hinting the relative sizes

Supports Left, Right, Inner, Outer Joins

1. pipe12. .joinWithSmaller('productId -> 'productId, pipe2, 3. joiner=new LeftJoin)

Page 44: MapReduce with Scalding @ 24th Hadoop London Meetup

Group operations

1. val pipe = Tsv(“input”, ('shopId, 'itemId, 'quantity))2. .groupBy('shopId) {3. _.sum[Long]('quantity-> 'totalSoldItems)4. }5. .write(Tsv(“results.tsv”))

Scalding.io

Group by particular fields

.groupBy

.groupAll Group all data

Page 45: MapReduce with Scalding @ 24th Hadoop London Meetup

Pipe operations

1. val p = (pipe1 ++ pipe2) // Concatenate 2 pipes2. .debug // Print sample data to screen3. .addTrap(Tsv(“bogus_lines”) // dirty data are recorded

Scalding.io

Simple pipe operations

Page 46: MapReduce with Scalding @ 24th Hadoop London Meetup

Connect with external systems

Scalding.io

Page 47: MapReduce with Scalding @ 24th Hadoop London Meetup

Scalding + Hive1. class HiveExample (args: Args) extends Job(args) {

2. val USER_SCHEMA = List('userId, 'username, 'photo)

3. HiveSource("myHiveTable", SinkMode.KEEP)4. .withHCatScheme(osvInputScheme(fields = USER_SCHEMA))5. .write(Tsv("outputFromHive"))6. }

Scalding.io

Page 48: MapReduce with Scalding @ 24th Hadoop London Meetup

Scalding + Hive1. class HiveExample (args: Args) extends Job(args) {

2. val USER_SCHEMA = List('userId, 'username, 'photo)

3. HiveSource("myHiveTable", SinkMode.KEEP)4. .withHCatScheme(osvInputScheme(fields = USER_SCHEMA))5. .write(Tsv("outputFromHive"))6. }

Scalding.io

Define the schema

Page 49: MapReduce with Scalding @ 24th Hadoop London Meetup

Scalding + Hive1. class HiveExample (args: Args) extends Job(args) {

2. val USER_SCHEMA = List('userId, 'username, 'photo)

3. HiveSource("myHiveTable", SinkMode.KEEP)4. .withHCatScheme(osvInputScheme(fields = USER_SCHEMA))5. .write(Tsv("outputFromHive"))6. }

Scalding.io

Query HcatalogRead directly from

HDFS

Page 50: MapReduce with Scalding @ 24th Hadoop London Meetup

Scalding + ElasticSearch1. val schema = List('number, 'product, 'description)

2. val readES = ElasticSearchTap("localhost", 9200,"index firstType","", schema).read.write(Tsv("data/es-out.tsv"))

3. val writeES = Tsv("data.tsv”).read.write(ElasticSearchTap ("localhost”, 9200,"index/secondType","", schema))

Scalding.io

Page 51: MapReduce with Scalding @ 24th Hadoop London Meetup

Scalding + ElasticSearch1. val schema = List('number, 'product, 'description)

2. val readES = ElasticSearchTap("localhost", 9200,"index firstType","", schema).read.write(Tsv("data/es-out.tsv"))

3. val writeES = Tsv("data.tsv”).read.write(ElasticSearchTap ("localhost”, 9200,"index/secondType","", schema))

Scalding.io

Read from ElasticSearch in

one line!

Page 52: MapReduce with Scalding @ 24th Hadoop London Meetup

Scalding + ElasticSearch1. val schema = List('number, 'product, 'description)

2. val readES = ElasticSearchTap("localhost", 9200,"index firstType","", schema).read.write(Tsv("data/es-out.tsv"))

3. val writeES = Tsv("data.tsv”).read.write(ElasticSearchTap ("localhost”, 9200,"index/secondType","", schema))

Scalding.io

Also index new data in ES

Page 53: MapReduce with Scalding @ 24th Hadoop London Meetup

Design patterns

Scalding.io

Page 54: MapReduce with Scalding @ 24th Hadoop London Meetup

Dependency InjectionLate boundExternal Operations

Page 55: MapReduce with Scalding @ 24th Hadoop London Meetup

How about defining external operations?

Scalding.io

1. val pipe1 = Tsv(“omniture.tsv”,OMNITURE_SCHEMA)2. .read3. .ETLOmnitureData4. .calculateOmnitureUserStats5. .joinWithCustomerDB('userId->'userId, customerPipe)6. .write(Tsv(“omniture-results.tsv”))

Custom operations: Re-usable modular code Single responsibility TestabilityFull-code

http://bit.ly/1pNSUKf

Page 56: MapReduce with Scalding @ 24th Hadoop London Meetup

Scalding Testing

Scalding.io

Page 57: MapReduce with Scalding @ 24th Hadoop London Meetup

Testing challenges in the context of MR

Scalding.io

Acceptance Tests

Unit – Component Tests

System Tests

Integration Tests

Scalding enables

testing in every layer

&

TDD

Page 58: MapReduce with Scalding @ 24th Hadoop London Meetup

example

Scalding.io

1. class TsvWordCountJobTest extends FlatSpec2. with ShouldMatchers with TuppleConversions {

3. “WordCountJob” should “count words” in { 4. JobTest(new WordCountJob(_))5. .args(“input”,”inFile”)6. .args(“output”,”outFile”)7. .source(TextLine(“inFile”), List(“0”) -> “cool Scala cool”))8. .sink[(String,Int)](Tsv(“outFile”)) { out =>9. out.toList should contain (“cool” -> 2)10. }11. .run12. .finish13. }14. }

Replaces taps with in-memory

collections and asserts the expected

output

Page 59: MapReduce with Scalding @ 24th Hadoop London Meetup

Monitoring

Scalding.io

Page 60: MapReduce with Scalding @ 24th Hadoop London Meetup

“Driven takes Cascading application development to the next level with management and monitoring capabilities for your apps”

Scalding.io

http://driven.cascading.io

Page 61: MapReduce with Scalding @ 24th Hadoop London Meetup

Scalding.io

Collects telemetry data and expose through a Web UI

Page 62: MapReduce with Scalding @ 24th Hadoop London Meetup

Advanced Concepts

Scalding.io

Page 63: MapReduce with Scalding @ 24th Hadoop London Meetup

Scalding adds Typed API Matrix API

Graphs Machine Learning Algorithm

Scalding.io

Page 64: MapReduce with Scalding @ 24th Hadoop London Meetup

What the future like?

Scalding.io

Page 65: MapReduce with Scalding @ 24th Hadoop London Meetup

So far…

Scalding.io

ab

stra

ctio

n

Page 66: MapReduce with Scalding @ 24th Hadoop London Meetup

Real TimeBatch Hybrid

Scalding.io

ab

stra

ctio

n

Summingbird

A unified API for everything

StormTEZ Spark

Enables the Lambda architecture

Page 67: MapReduce with Scalding @ 24th Hadoop London Meetup

Scalding.io

Questions?