71
MapReduce 101 by Chaordic Systems

MapReduce: teoria e prática

Embed Size (px)

DESCRIPTION

Hoje em dia é fácil juntar quantidades absurdamente grandes de dados. Mas, uma vez de posse deles, como fazer para extrair informações dessas montanhas amorfas de dados? Nesse minicurso vamos apresentar o modelo de programação MapReduce: entender como ele funciona, para que serve e como construir aplicações usando-o. Vamos ver também como usar o Elastic MapReduce, o serviço da Amazon que cria clusters MapReduce sob-demanda, para que você não se preocupe em administrar e conseguir acesso a um cluster de máquinas, mas em como fazer seu código digerir de forma distribuída os dados que você possui. Veremos exemplos práticos em ação e codificaremos juntos alguns desafios.

Citation preview

Page 1: MapReduce: teoria e prática

MapReduce 101

by Chaordic Systems

Page 2: MapReduce: teoria e prática

Brought to you by...

Page 3: MapReduce: teoria e prática

Big Data, what's the big deal?

Why is this talk relevant to you?

● we have too much datato process in a single computer

● we make too few informed decisionbased on the data we have

● we have too little {time|CPU|memory}to analyze all this data

● 'cuz not everything needs to be on-lineIt's 2013 but doing batch processing is still OK

Page 4: MapReduce: teoria e prática

Map-what?

And why MapReduce and not, say MPI?

● Simple computation modelMapReduce exposes a simple (and limited) computational model.It can be a restraining at times but it is a trade off.

● Fault-tolerance, parallelization and distribution among machines for freeThe framework deals with this for you so you don't have to

● Because it is the bread-and-butter of Big Data processingIt is available in all major cloud computing platforms, and it is against what other Big Data systems compare themselves against.

Page 5: MapReduce: teoria e prática

Outline

● Fast recap on python and whatnot

● Introduction to MapReduce

● Counting Words

● MrJob and EMR

● Real-life examples

Page 6: MapReduce: teoria e prática

Fast recap

Page 7: MapReduce: teoria e prática

Let's assume you know what the following is:

● JSON

● Python's yield keyword

● Generators in Python

● Amazon S3

● Amazon EC2

If you don't, raise your hand now. REALLY

Fast recap

Page 8: MapReduce: teoria e prática

RecapJSON

JSON (JavaScript Object Notation) is a lightweight data-interchange format.It's like if XML and JavaScript slept together and gave birth a bastard but good-looking child.

{"timestamp": "2011-08-15 22:17:31.334057",

"track_id": "TRACCJA128F149A144",

"tags": [["Bossa Nova", "100"],

["jazz", "20"],

["acoustic", "20"],

["romantic", "20"],],

"title": "Segredo",

"artist": "Jo\u00e3o Gilberto"}

Page 9: MapReduce: teoria e prática

RecapPython generators

From Python's wiki:“Generators functions allow you to declare a function that behaves like an iterator, i.e. it can be used in a for loop.”

The difference is: a generator can be iterated (or read) only once as you don't store things in memory but create them on the fly [2].

You can create generators using the yield keyword.

Page 10: MapReduce: teoria e prática

RecapPython yield keyword

It's just like a return, but turns your function into a generator.Your function will suspend its execution after yielding a value and resume its execution for after the request for the next item in the generator (next loop).

def count_from_1(): i = 1

while True: yield i i += 1

for j in count_from_1(): print j

Page 11: MapReduce: teoria e prática

RecapAmazon S3

From Wikipedia:“Amazon S3 (Simple Storage Service) is an online storage web service offered by Amazon Web Services.”

Its like a distributed filesystem that is easy to use from other Amazon services, specially from Amazon Elastic MapReduce.

Page 12: MapReduce: teoria e prática

RecapEC2 - Elastic Cloud Computing

From Wikipedia:“EC2 allows users to rent virtual computers on which to run their own computer applications”

So you can rent clusters on demand, no need to maintain, keep fixing and up-to-date your ever breaking cluster of computers. Less headache, moar action.

Instances can be purchased on demand for fixed prices or you can bid on those.

Page 13: MapReduce: teoria e prática

MapReduce:a quick introduction

Page 14: MapReduce: teoria e prática

MapReduce

MapReduce builds on the observation that many tasks have the same structure: computation is applied over a large number of records to generate partial results, which are then aggregated in some fashion.

Page 15: MapReduce: teoria e prática

MapReduce

MapReduce builds on the observation that many tasks have the same structure: computation is applied over a large number of records to generate partial results, which are then aggregated in some fashion.

Map

Page 16: MapReduce: teoria e prática

MapReduce

MapReduce builds on the observation that many tasks have the same structure: computation is applied over a large number of records to generate partial results, which are then aggregated in some fashion.

Map Reduce

Page 17: MapReduce: teoria e prática

Typical (big data) problem

● Iterate over a large number of records

● Extract something of interest from each

● Shuffle and sort intermediate results

● Aggregate intermediate results

● Generate final output

Map

Reduce

Page 18: MapReduce: teoria e prática

Phases of a MapReduction

MapReduce have the following steps:

map(key, value) -> [(key1, value1), (key1, value2)]

combine

sort + shuffle

reduce(key1, [value1, value2]) -> [(keyX, valueY)]

May happen in parallel, in multiple machines!

Page 19: MapReduce: teoria e prática
Page 20: MapReduce: teoria e prática

Notice:

Reduce phase only starts after all mappers have completed.Yes, there is a synchronization barrier right there.

There is no global knowledgeNeither mappers nor reducers know what other mappers (or reducers) are processing

Page 21: MapReduce: teoria e prática
Page 22: MapReduce: teoria e prática

Counting Words

Counting the number of occurrences of a word in a document collection is quite a big deal.

Let's try with a small example:

"Me gusta correr, me gustas tu.Me gusta la lluvia, me gustas tu."

Page 23: MapReduce: teoria e prática

Counting Words

"Me gusta correr, me gustas tu.Me gusta la lluvia, me gustas tu."

me 4

gusta 2

correr 1

gustas 2

tu 2

la 1

lluvia 1

Page 24: MapReduce: teoria e prática

Counting word - in Python

doc = open('input')

count = {}

for line in doc: words = line.split()

for w in words: count[w] = count.get(w, 0) + 1

Easy, right? Yeah... too easy. Let's split what we do for each line and aggregate, shall we?

Page 25: MapReduce: teoria e prática

Counting word - in MapReduce

def map_get_words(self, key, line):

for word in line.split():

yield word, 1

def reduce_sum_words(self, word, occurrences):

yield word, sum(occurrences)

Page 26: MapReduce: teoria e prática

def map_get_words(self, key, line):

for word in line.split():

yield word, 1

What is Map's output?

key=1line="me gusta correr me gustas tu"

('me', 1)

('gusta', 1)

('correr', 1)

('me', 1)

('gustas', 1)

('tu', 1)

key=2line="me gusta la lluvia me gustas tu"

('me', 1),

('gusta', 1)

('la', 1)

('lluvia', 1)

('me', 1)

('gustas', 1)

('tu', 1)

Page 27: MapReduce: teoria e prática

What about shuffle?

Page 28: MapReduce: teoria e prática
Page 29: MapReduce: teoria e prática

What about shuffle?

Think of it as a distributed group by operation.

In the local map instance/node:

● it sorts map output values,● groups them by their key,● send this group of key and associated values to the

reduce node responsible for this key.

In the reduce instance/node:

● the framework joins all values associated with this key in a single list - for you, for free.

Page 30: MapReduce: teoria e prática

What's Shuffle output? orWhat's Reducer input?

Notice:

This table represents a global view.

"In real life", each reducer instance only knows about its own key and values.

Key (input) Values

correr [1]

gusta [1, 1]

gustas [1, 1]

la [1]

lluvia [1]

me [1, 1, 1, 1]

tu [1, 1]

Page 31: MapReduce: teoria e prática

def reduce_sum_words(self, word, occurrences):

yield word, sum(occurrences)

What's Reducer output?

word occurrences output

correr [1] (correr, 1)

gusta [1, 1] (gusta, 2)

gustas [1, 1] (gustas, 2)

la [1] (la, 1)

lluvia [1] (lluvia, 1)

me [1, 1, 1, 1] (me, 4)

tu [1, 1] (tu, 2)

Page 32: MapReduce: teoria e prática
Page 33: MapReduce: teoria e prática

MapReduce (main) Implementations

Google MapReduce● C++● Proprietary

Apache Hadoop● Java

○ interfaces for anything that runs in the JVM○ Hadoop streamming for a pipe-like programming

language agnostic interface● Open source

Nobody really cares about the others (for now... ;)

Page 34: MapReduce: teoria e prática

Amazon Elastic MapReduce (EMR)

Amazon Elastic MapReduce

● Uses Hadoop with extra sauces

● creates a hadoop cluster on demand

● It's magical -- except when it fails

● Can be a sort of unpredictable sometimes○ Installing python modules can fail for no clear reason

Page 35: MapReduce: teoria e prática

MrJob

It's a python interface for hadoop streaming jobs with a really easy to use interface

● Can run jobs locally or in EMR.● Takes care of uploading your python code to

EMR.● Deals better if everything is in a single

python module.● Easy interface to chain sequences of M/R

steps.● Some basic tools to aid debugging.

Page 36: MapReduce: teoria e prática

Counting wordsFull MrJob Examplefrom mrjob.job import MRJob

class MRWordCounter(MRJob):

def get_words(self, key, line):

for word in line.split():

yield word, 1

def sum_words(self, word, occurrences):

yield word, sum(occurrences)

def steps(self):

return [self.mr(self.get_words, self.sum_words),]

if __name__ == '__main__':

MRWordCounter.run()

Page 37: MapReduce: teoria e prática

MrJobLauching a job

Running it locallypython countwords.py --conf-path=mrjob.conf input.txt

Running it in EMRDo not forget to set AWS_ env. vars!

python countwords.py \ --conf-path=mrjob.conf \ -r emr \ 's3://ufcgplayground/data/words/*' \ --no-output \ --output-dir=s3://ufcgplayground/tmp/bla/

Page 38: MapReduce: teoria e prática

Install MrJob using pip or easy_installDo not, I repeat DO NOT install the version in Ubuntu/Debian.

sudo pip install mrjob

Setup your environment with AWS credentialsexport AWS_ACCESS_KEY_ID=...

export AWS_SECRET_ACCESS_KEY=...

Setup your environment to look for MrJob settings:

export MRJOB_CONF=<path to mrjob.conf>

MrJobInstalling and Environment setup

Page 39: MapReduce: teoria e prática

Use our sample MrJob app as your templategit clone https://github.com/chaordic/mr101ufcg.git

Modify the sample mrjob.conf so that your jobs are labeled to your teamIt's the Right Thing © to do.

s3_logs_uri: s3://ufcgplayground/yournamehere/log/

s3_scratch_uri: s3://ufcgplayground/yournamehere/tmp/

Profit!

MrJobInstalling and Environment setup

Page 40: MapReduce: teoria e prática

Real

Page 41: MapReduce: teoria e prática

Target Categories

Objective: Find the most commonly viewed categories per user

Input:● views and orders

Patterns used:● simple aggregation

Page 42: MapReduce: teoria e prática

zezin, fulano, [telefone, celulares, vivo]zezin, fulano, [telefone, celulares, vivo]zezin, fulano, [eletro, caos, furadeira]lojaX, fulano, [livros, arte, anime]lojaX, fulano, [livros, arte, anime]lojaX, fulano, [livros, arte, anime]

Map input

Page 43: MapReduce: teoria e prática

zezin, fulano, [telefone, celulares, vivo]zezin, fulano, [telefone, celulares, vivo]zezin, fulano, [eletro, caos, furadeira]lojaX, fulano, [livros, arte, anime]lojaX, fulano, [livros, arte, anime]lojaX, fulano, [livros, arte, anime]

Map input

Key

Page 44: MapReduce: teoria e prática

zezin, fulano, [telefone, celulares, vivo]zezin, fulano, [telefone, celulares, vivo]zezin, fulano, [eletro, caos, furadeira]lojaX, fulano, [livros, arte, anime]lojaX, fulano, [livros, arte, anime]lojaX, fulano, [livros, arte, anime]

Map input

Key

Reduce Input

(zezin, fulano)[telefone, celulares, vivo][telefone, celulares, vivo][eletro, caos, furadeira]

(lojaX, fulano)[livros, arte, anime][livros, arte, anime][livros, arte, anime]

Sort + Shuffle

Page 45: MapReduce: teoria e prática

Reduce Input

(zezin, fulano)[telefone, celulares, vivo][telefone, celulares, vivo][eletro, caos, furadeira]

(lojaX, fulano)[livros, arte, anime][livros, arte, anime][livros, arte, anime]

Page 46: MapReduce: teoria e prática

Reduce Output

(zezin, fulano) ([telefone, celulares, vivo], 2)([eletro, caos, furadeira], 1)

(lojaX, fulano) ([livros, arte, anime], 3)

Reduce Input

(zezin, fulano)[telefone, celulares, vivo][telefone, celulares, vivo][eletro, caos, furadeira]

(lojaX, fulano)[livros, arte, anime][livros, arte, anime][livros, arte, anime]

Page 47: MapReduce: teoria e prática

Filter Expensive Categories

Objective: List all categories where a user purchased something expensive.

Input:● Orders (for price and user information)● Products (for category information)

Patterns used:● merge using reducer

Page 48: MapReduce: teoria e prática

lojaX livro fulano R$ 20

lojaX iphone deltrano R$ 1800

lojaX livro [livros, arte, anime]

lojaX iphone [telefone, celulares, vivo]

Pro

duct

sB

uyO

rder

s

Map

Inpu

t

We have to merge those tables above!

Page 49: MapReduce: teoria e prática

lojaX livro fulano R$ 20

lojaX iphone deltrano R$ 1800

lojaX livro [livros, arte, anime]

lojaX iphone [telefone, celulares, vivo]

Pro

duct

sB

uyO

rder

s

Map

Inpu

t

commonKey

Page 50: MapReduce: teoria e prática

lojaX livro fulano R$ 20 (nada, é barato)

lojaX iphone deltrano R$ 1800 {”usuario” : “deltrano”}

lojaX livro [livros, arte, anime] {“cat”: [livros...]}

lojaX iphone [telefone, celulares, vivo] {“cat”: [telefone...]}

Pro

duct

sB

uyO

rder

s

Map

Inpu

t

Key Value

Map Output

Page 51: MapReduce: teoria e prática

lojaX livro fulano R$ 20 (nada, é barato)

lojaX iphone deltrano R$ 1800 {”usuario” : “deltrano”}

lojaX livro [livros, arte, anime] {“cat”: [livros...]}

lojaX iphone [telefone, celulares, vivo] {“cat”: [telefone...]}

Pro

duct

sB

uyO

rder

s

Map

Inpu

t

Key Value

Map Output

(lojaX, livro) {“cat”: [livros, arte, anime]}

(lojaX, iphone) {”usuario” : “deltrano”}

{“cat”: [telefone, celulares, vivo]}

Red

uce

Inpu

t

Page 52: MapReduce: teoria e prática

(lojaX, livro) {“cat”: [livros, arte, anime]}

(lojaX, iphone) {”usuario” : “deltrano”}

{“cat”: [telefone, celulares, vivo]}

Key Values

Red

uce

Inpu

t

Page 53: MapReduce: teoria e prática

(lojaX, livro) {“cat”: [livros, arte, anime]}

(lojaX, iphone) {”usuario” : “deltrano”}

{“cat”: [telefone, celulares, vivo]}

Key Values

Red

uce

Inpu

t

Those are the parts we care about!

Page 54: MapReduce: teoria e prática

(lojaX, livro) {“cat”: [livros, arte, anime]}

(lojaX, iphone) {”usuario” : “deltrano”}

{“cat”: [telefone, celulares, vivo]}

Key Values

(lojaX, deltrano) [telefone, celulares, vivo]

Red

uce

Out

put

Red

uce

Inpu

t

Page 55: MapReduce: teoria e prática

Real

Datasets

Page 56: MapReduce: teoria e prática

Real datasets, real problems

In the following hour we will write code to analyse some real datasets:● Twitter Dataset (from an article published in WWW'10)● LastFM Dataset, from The Million Song Datset

Supporting code ● available at GitHub, under https://github.

com/chaordic/mr101ufcg● comes with sample data under data for

local runs.

Page 57: MapReduce: teoria e prática

Twitter Followers Dataset

A somewhat big dataset● 41.7 million profiles● 1.47 billion social relations (who follows who)● 25 Gb of uncompressed data

Available at s3://mr101ufcg/data/twitter/ ...● splitted/*.gz

full dataset splitted in small compressed files

● numeric2screen.txtnumerid id to original screen name mapping

● followed_by.txtoriginal 25Gb dataset as a single file

Page 58: MapReduce: teoria e prática

Twitter Followers Dataset

Each line in followed_by.txt has the following format:

user_id \t follower_id

For instance:12 \t 38

12 \t 41

13 \t 47

13 \t 52

13 \t 53

14 \t 56

Page 59: MapReduce: teoria e prática

Million Song Dataset project'sLast.fm Dataset

A not-so-big dataset● 943,347 tracks● 1.2G of compressed data

Yeah, it is not all that big...

Available at s3://mr101ufcg/data/lastfm/ ...● metadata/*.gz

Track metadata information, in JSONProtocol format.

● similars/*.gzTrack similarity information, in JSONProtocol format.

Page 60: MapReduce: teoria e prática

Million Song Dataset project'sLast.fm Dataset

JSONProcotol encodes key-pair information in a single line using json-encoded values separated by a tab character ( \t ).

<JSON encoded data> \t <JSON encoded data>

Exemple line:

"TRACHOZ12903CCA8B3" \t {"timestamp": "2011-09-07 22:12:47.150438", "track_id": "TRACHOZ12903CCA8B3", "tags": [], "title": "Close Up", "artist": "Charles Williams"}

Page 61: MapReduce: teoria e prática

Questions?

Page 62: MapReduce: teoria e prática

Stuff I didn't talk about but are sorta cool

Persistent jobs

Serialization (protocols in MrJob parlance)

Amazon EMR Console

Hadoop dashboard (and port 9100)

Page 63: MapReduce: teoria e prática

Combiners

Are just like reducers but take place just after a Map and just before data is sent to the network during shuffle.

Combiners must...● be associative {a.(b.c) == (a.b).c}● commutative (a.b == b.a)● have the same input and output types as yours Map

output type.

Caveats:● Combiners can be executed zero, one or many times,

so don't make your MR depend on them

Page 66: MapReduce: teoria e prática

Life beyond MapReduce

What reading about other frameworks for distributed processing with BigData?● Spark● Storm● GraphLab

And don't get me started on NoSQL...

Page 67: MapReduce: teoria e prática

Many thanks to...

for supporting this course.You know there will be some live, intense, groovy Elastic MapReduce action right after this presentation, right?

Page 68: MapReduce: teoria e prática

Questions?

Feel free to contact me at [email protected]

Or follows us @chaordic

Page 69: MapReduce: teoria e prática
Page 70: MapReduce: teoria e prática

So, lets write some code?

Twitter Dataset● Count how many followers each user has● Discover the user with more followers● What if I want the top-N most followed?

LastFM● Merge similarity and metadata for tracks● What is the most "plain" song?● What is the plainest rock song according only to rock

songs?

Page 71: MapReduce: teoria e prática

Extra slides