Workflows 3: Graphs, PageRank, Loops, Sparkwcohen/10-605/workflows-3.pdf · Workflows 3: Graphs, PageRank, Loops, Spark Always start off with a joke, to lighten the mood 1. The course

Workflows 3:Graphs, PageRank, Loops, Spark

Always start off with a joke, to

lighten the mood

1

The course so far• Costofoperations• Streaminglearningalgorithms

– Parallelstreamingwithmap-reduce– Buildingcomplexparallelalgorithmswithdataflowlanguages(Pig,GuineaPig,….)

• Mostlyfortext• Anothercommonlargedatastructure:graphs

• Examples:web,Facebook,Twitter,citationdata,Freebase,Wikipedia,biologicalnetworks,…

– PageRankindataflowlanguages– Iterationindataflow– Spark

• Catchup:similarityjoins2

PageRank

3

Google’s PageRank

web site xxx

web site yyyy

web site a b c d e f g

web

site

pdq pdq ..

web site yyyy


web site xxx

Inlinks are “good” (recommendations)

Inlinks from a “good” site are better than inlinks from a “bad” site

but inlinks from sites with many outlinks are not a “good”...

“Good “ and “bad” are relative.

web site xxx

4

Google’s PageRank

web site xxx

web site yyyy


web

site

pdq pdq ..

web site yyyy


web site xxx Imagine a “pagehopper”

that always either

• follows a random link, or

• jumps to random page

5

Google’s PageRank(Brin & Page, http://www-db.stanford.edu/~backrub/google.html)

web site xxx

web site yyyy


web

site

pdq pdq ..

web site yyyy


web site xxx Imagine a “pagehopper”

that always either

• follows a random link, or

• jumps to random page

PageRank ranks pages by the amount of time the pagehopper spends on a page:

• or, if there were many pagehoppers, PageRank is the expected “crowd size”

6

PageRank in Memory• Letu=(1/N,…,1/N)

– dimension=#nodesN• LetA=adjacencymatrix:[aij=1ó i linkstoj]• LetW=[wij =aij/outdegree(i)]

–wij isprobabilityofjumpfromi toj• Letv0 =(1,1,….,1)

– oranythingelseyouwant• Repeatuntilconverged:

– Letvt+1=cu +(1-c) vtW• cisprobabilityofjumping“anywhererandomly”

7

Streaming PageRank

8

Streaming PageRank• AssumewecanstorevbutnotW inmemory• Repeatuntilconverged:

– Letvt+1=cu +(1-c) vtW

• Wis asparserowmatrix: eachlineis– i ji,1,…,ji,d [theneighborsofi]

• Storev’andv inmemory:v’ startsoutascu• Foreachline“i ji,1,…,ji,d “

– Foreachjinji,1,…,ji,d• v’[j]+=(1-c)v[i]/d

Everything needed for update is right there in row….

9

each iteration: from v,compute v’, then let

v=v’

Streaming PageRank• AssumewecanstoreonerowofWinmemoryatatime,butnotv– i.e.,linksonawebpagefitinmemory,notpageRank valuesforthewholeweb.

• Repeatuntilconverged:– Letvt+1=cu +(1-c) vtW

• StoreWasarowmatrix: eachlineis– i ji,1,…,ji,d [theneighborsofi]

• v’ startsoutascu• Foreachline“i ji,1,…,ji,d “


10

We need to convert these counter updates to messages, like we did for

naïve Bayes

Like in naïve Bayes: a document fits, but the

model doesn’t

Streaming PageRank• AssumewecanstoreonerowofWinmemoryatatime,butnotv– i.e.,linksonawebpagefitinmemory,notpageRank forthewholeweb.





11

Streaming: if we know c, v[i] and have the linked-to j’s then• we can compute d• we can produce messages

saying how much to increment each v[j]

Then we sort the messages by v[j] and add up all the increments• and somehow also add in

c/N

Streaming PageRank• AssumewecanstoreonerowofWinmemoryatatime,butnotv– i.e.,linksonawebpagefitinmemory,notpageRank forthewholeweb.





12

Streaming: if we know c, v[i] and have the linked-to j’s then• we can compute d• we can produce messages

saying how much to increment each v[j]

Then we sort the messages by v[j] and add up all the increments• and somehow also add in

c/N

So we need to stream thru some structure that has v[i] plus all the outlinks from i

This is a copy of the graph with current pageRankestimates (v) for each node attached to the node.

PageRank in Dataflow Languages

13

Iteration in Dataflow

• PIGandGuineaPigarepure dataflowlanguages–noconditionals–noiteration

• Toloopyouneedtoembedadataflowprogramintoa‘real’program

14

15

params: d, docs_in, docs_out

16

Recap: a Gpig program….

There’s no “program” object, and no obvious way to write a main that will call a program

Calling GuineaPig programmaticallySomewhere in the execution of the plan we will call this script with special arguments that tell it to do a substep of the plan.

So we need a Python main that will do that right

But I also want to write an actual main program so….

Creatiing a Planner object I can call (not a subclass) – remember to call setup()

17

Calling GuineaPig programmatically

Convert from initial format and

assign initial pagerank vector v

create the planner

Move initial pageranks +

graph to tmp file

Compute updated

pageranks

Move new pageranks to

tmp file for next iteration of loop 18

lots and lots of i/o happening here…

rowininitGraph:(i,[j1,….,jd])

19

note: there is lots of i/o happening here…20

21lots of i/o happening here…

Julien Le Dem -Yahoo

An example from Ron Bekkermanof an iterative dataflow computation

22

Example: k-means clustering

• AnEM-likealgorithm:• Initializek clustercentroids• E-step:associateeachdatainstancewiththeclosestcentroid– Findexpectedvaluesofclusterassignmentsgiventhedataandcentroids

• M-step:recalculatecentroids asanaverageoftheassociateddatainstances– Findnewcentroids thatmaximizethatexpectation

23

k-means Clustering

centroids

24

Parallelizing k-means

25


26


27

k-means on MapReduce

• Mappers readdataportionsandcentroids• Mappers assigndatainstancestoclusters• Mappers computenewlocalcentroids andlocalclustersizes

• Reducersaggregatelocalcentroids(weightedbylocalclustersizes)intonewglobalcentroids

• Reducerswritethenewcentroids

Panda et al, Chapter 2

28

k-means in Apache Pig: input data

• Assumeweneedtoclusterdocuments–Storedina3-columntableD:

• Initialcentroids arek randomlychosendocs–StoredintableC inthesameformatasabove

Document Word Count

doc1 Carnegie 2

doc1 Mellon 2

29

D_C =JOIN C BY w,D BY w;PROD =FOREACH D_C GENERATE d,c,id *ic AS idic ;

PRODg =GROUP PROD BY (d,c);DOT_PROD =FOREACH PRODg GENERATE d,c,SUM(idic)AS dXc;

SQR =FOREACH C GENERATE c,ic *ic AS ic2;SQRg =GROUP SQR BY c;LEN_C=FOREACH SQRgGENERATE c,SQRT(SUM(ic2))AS lenc;

DOT_LEN =JOIN LEN_CBY c,DOT_PROD BY c;SIM =FOREACH DOT_LEN GENERATE d,c,dXc /lenc;

SIMg =GROUP SIM BY d;CLUSTERS=FOREACH SIMg GENERATE TOP(1,2,SIM);

k-means in Apache Pig: E-step

( )åå

Î

Î×

=

cwwc

dwwc

wd

cdi

iic

2maxarg

30







( )åå

Î

Î×

=

cwwc

dwwc

wd

cdi

iic

2maxarg

31







( )åå

Î

Î×

=

cwwc

dwwc

wd

cdi

iic

2maxarg

32







( )åå

Î

Î×

=

cwwc

dwwc

wd

cdi

iic

2maxarg

33







( )åå

Î

Î×

=

cwwc

dwwc

wd

cdi

iic

2maxarg

34







35

# document d, cluster c similarities

# cluster assignments: dàclosest c

k-means in Apache Pig: M-step

D_C_W= JOIN CLUSTERSBY d, DBY d;

D_C_Wg = GROUP D_C_WBY (c, w);SUMS= FOREACH D_C_Wg GENERATE c, w, SUM(id)AS sum;

D_C_Wgg =GROUP D_C_WBY c;SIZES=FOREACH D_C_WggGENERATE c,COUNT(D_C_W) AS size;

SUMS_SIZES=JOIN SIZESBY c, SUMSBY c;C=FOREACH SUMS_SIZESGENERATE c,w, sum/sizeAS ic ;

Finally - embed in Java (or Python or ….) to do the looping

36

# add up documents in each cluster

# figure out cluster size

# figure out weighted average of docs in each cluster

Data is read, and model is written, with every iteration

• Mappers readdataportionsandcentroids• Mappers assigndatainstancestoclusters• Mappers computenewlocalcentroids andlocalclustersizes

• Reducersaggregatelocalcentroids(weightedbylocalclustersizes)intonewglobalcentroids

• Reducerswritethenewcentroids

Panda et al, Chapter 2

37

Spark: Another Dataflow Language

38

Spark• Hadoop:Toomuchtyping

– programsarenotconcise• Hadoop:Toolowlevel

– missingabstractions– hardtospecifyaworkflow

• Pig,GuineaPig,… addresstheseproblems:alsoSpark• Notwellsuitedtoiterativeoperations

– E.g.,PageRank,E/M,k-meansclustering,…– Sparklowerscostofrepeated reads

39

Set of concise dataflow operations (“transformation”)

Dataflow operations are embedded in an API together with “actions”

Spark: Sharded files are replaced by “RDDs” – resilient distributed datasets. RDDs can be cached in cluster memory and recreated to recover from error

Spark examples

40

spark is a spark context object

Spark examples

41

errors is a transformation, and

thus a data strucurethat explains HOW to

do somethingcount() is an action: it will actually execute the plan for errorsand return a value.

errors.filter() is a transformation

collect() is an action

everything is sharded, like inHadoop and GuineaPig

Spark examples

42

# modify errors to be stored in cluster memory

subsequent actions will be much faster

everything is sharded … and the shards are stored in memory of worker machines not local disk (if possible)

You can also persist() an RDD on disk, whichis like marking it as opts(stored=True) in GuineaPig. Spark’s not smart about persisting data.

Spark examples

43

# modify errors to be stored in cluster memory

everything is sharded … and the shards are stored in memory of worker machines not local disk (if possible)

You can also persist() an RDD on disk, whichis like marking it as opts(stored=True) in GuineaPig. Spark’s not smart about persisting data.

persist-on-disk works because the RDD is read-only

(immutable)

Spark examples: wordcount

44

the actiontransformation on (key,value) pairs , which are special

Spark examples: batch logistic regression

45

reduce is an action –it produces a numpy

vector

p.x and w are vectors, from the numpy package

p.x and w are vectors, from the numpy package.

Python overloads operations like * and +

for vectors.


Important note: numpy vectors/matrices are not just “syntactic sugar”. • They are much more compact than something like a list of python

floats.• numpy operations like dot, *, + are calls to optimized C code• a little python logic around a lot of numpy calls is pretty efficient

46


47

w is defined outsidethe lambda function,

but used inside itSo: python builds a closure – code

including the current value of w – and Spark ships it off to each worker. So

w is copied, and must be read-only.


48

dataset of points is cached in cluster

memory to reduce i/o

Spark logistic regression example

49

Spark

50

Documents

Workflows 3: Graphs, PageRank, Loops, Sparkwcohen/10-605/workflows-3.pdf · Workflows 3: Graphs, PageRank, Loops, Spark Always start off with a joke, to lighten the mood 1. The course