50
Workflows 3: Graphs, PageRank, Loops, Spark Always start off with a joke, to lighten the mood 1

Workflows 3: Graphs, PageRank, Loops, Sparkwcohen/10-605/workflows-3.pdf · Workflows 3: Graphs, PageRank, Loops, Spark Always start off with a joke, to lighten the mood 1. The course

  • Upload
    others

  • View
    28

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Workflows 3: Graphs, PageRank, Loops, Sparkwcohen/10-605/workflows-3.pdf · Workflows 3: Graphs, PageRank, Loops, Spark Always start off with a joke, to lighten the mood 1. The course

Workflows 3:Graphs, PageRank, Loops, Spark

Always start off with a joke, to

lighten the mood

1

Page 2: Workflows 3: Graphs, PageRank, Loops, Sparkwcohen/10-605/workflows-3.pdf · Workflows 3: Graphs, PageRank, Loops, Spark Always start off with a joke, to lighten the mood 1. The course

The course so far• Costofoperations• Streaminglearningalgorithms

– Parallelstreamingwithmap-reduce– Buildingcomplexparallelalgorithmswithdataflowlanguages(Pig,GuineaPig,….)

• Mostlyfortext• Anothercommonlargedatastructure:graphs

• Examples:web,Facebook,Twitter,citationdata,Freebase,Wikipedia,biologicalnetworks,…

– PageRankindataflowlanguages– Iterationindataflow– Spark

• Catchup:similarityjoins2

Page 3: Workflows 3: Graphs, PageRank, Loops, Sparkwcohen/10-605/workflows-3.pdf · Workflows 3: Graphs, PageRank, Loops, Spark Always start off with a joke, to lighten the mood 1. The course

PageRank

3

Page 4: Workflows 3: Graphs, PageRank, Loops, Sparkwcohen/10-605/workflows-3.pdf · Workflows 3: Graphs, PageRank, Loops, Spark Always start off with a joke, to lighten the mood 1. The course

Google’s PageRank

web site xxx

web site yyyy

web site a b c d e f g

web

site

pdq pdq ..

web site yyyy

web site a b c d e f g

web site xxx

Inlinks are “good” (recommendations)

Inlinks from a “good” site are better than inlinks from a “bad” site

but inlinks from sites with many outlinks are not a “good”...

“Good “ and “bad” are relative.

web site xxx

4

Page 5: Workflows 3: Graphs, PageRank, Loops, Sparkwcohen/10-605/workflows-3.pdf · Workflows 3: Graphs, PageRank, Loops, Spark Always start off with a joke, to lighten the mood 1. The course

Google’s PageRank

web site xxx

web site yyyy

web site a b c d e f g

web

site

pdq pdq ..

web site yyyy

web site a b c d e f g

web site xxx Imagine a “pagehopper”

that always either

• follows a random link, or

• jumps to random page

5

Page 6: Workflows 3: Graphs, PageRank, Loops, Sparkwcohen/10-605/workflows-3.pdf · Workflows 3: Graphs, PageRank, Loops, Spark Always start off with a joke, to lighten the mood 1. The course

Google’s PageRank(Brin & Page, http://www-db.stanford.edu/~backrub/google.html)

web site xxx

web site yyyy

web site a b c d e f g

web

site

pdq pdq ..

web site yyyy

web site a b c d e f g

web site xxx Imagine a “pagehopper”

that always either

• follows a random link, or

• jumps to random page

PageRank ranks pages by the amount of time the pagehopper spends on a page:

• or, if there were many pagehoppers, PageRank is the expected “crowd size”

6

Page 7: Workflows 3: Graphs, PageRank, Loops, Sparkwcohen/10-605/workflows-3.pdf · Workflows 3: Graphs, PageRank, Loops, Spark Always start off with a joke, to lighten the mood 1. The course

PageRank in Memory• Letu=(1/N,…,1/N)

– dimension=#nodesN• LetA=adjacencymatrix:[aij=1ó i linkstoj]• LetW=[wij =aij/outdegree(i)]

–wij isprobabilityofjumpfromi toj• Letv0 =(1,1,….,1)

– oranythingelseyouwant• Repeatuntilconverged:

– Letvt+1=cu +(1-c) vtW• cisprobabilityofjumping“anywhererandomly”

7

Page 8: Workflows 3: Graphs, PageRank, Loops, Sparkwcohen/10-605/workflows-3.pdf · Workflows 3: Graphs, PageRank, Loops, Spark Always start off with a joke, to lighten the mood 1. The course

Streaming PageRank

8

Page 9: Workflows 3: Graphs, PageRank, Loops, Sparkwcohen/10-605/workflows-3.pdf · Workflows 3: Graphs, PageRank, Loops, Spark Always start off with a joke, to lighten the mood 1. The course

Streaming PageRank• AssumewecanstorevbutnotW inmemory• Repeatuntilconverged:

– Letvt+1=cu +(1-c) vtW

• Wis asparserowmatrix: eachlineis– i ji,1,…,ji,d [theneighborsofi]

• Storev’andv inmemory:v’ startsoutascu• Foreachline“i ji,1,…,ji,d “

– Foreachjinji,1,…,ji,d• v’[j]+=(1-c)v[i]/d

Everything needed for update is right there in row….

9

each iteration: from v,compute v’, then let

v=v’

Page 10: Workflows 3: Graphs, PageRank, Loops, Sparkwcohen/10-605/workflows-3.pdf · Workflows 3: Graphs, PageRank, Loops, Spark Always start off with a joke, to lighten the mood 1. The course

Streaming PageRank• AssumewecanstoreonerowofWinmemoryatatime,butnotv– i.e.,linksonawebpagefitinmemory,notpageRank valuesforthewholeweb.

• Repeatuntilconverged:– Letvt+1=cu +(1-c) vtW

• StoreWasarowmatrix: eachlineis– i ji,1,…,ji,d [theneighborsofi]

• v’ startsoutascu• Foreachline“i ji,1,…,ji,d “

– Foreachjinji,1,…,ji,d• v’[j]+=(1-c)v[i]/d

10

We need to convert these counter updates to messages, like we did for

naïve Bayes

Like in naïve Bayes: a document fits, but the

model doesn’t

Page 11: Workflows 3: Graphs, PageRank, Loops, Sparkwcohen/10-605/workflows-3.pdf · Workflows 3: Graphs, PageRank, Loops, Spark Always start off with a joke, to lighten the mood 1. The course

Streaming PageRank• AssumewecanstoreonerowofWinmemoryatatime,butnotv– i.e.,linksonawebpagefitinmemory,notpageRank forthewholeweb.

• Repeatuntilconverged:– Letvt+1=cu +(1-c) vtW

• StoreWasarowmatrix: eachlineis– i ji,1,…,ji,d [theneighborsofi]

• v’ startsoutascu• Foreachline“i ji,1,…,ji,d “

– Foreachjinji,1,…,ji,d• v’[j]+=(1-c)v[i]/d

11

Streaming: if we know c, v[i] and have the linked-to j’s then• we can compute d• we can produce messages

saying how much to increment each v[j]

Then we sort the messages by v[j] and add up all the increments• and somehow also add in

c/N

Page 12: Workflows 3: Graphs, PageRank, Loops, Sparkwcohen/10-605/workflows-3.pdf · Workflows 3: Graphs, PageRank, Loops, Spark Always start off with a joke, to lighten the mood 1. The course

Streaming PageRank• AssumewecanstoreonerowofWinmemoryatatime,butnotv– i.e.,linksonawebpagefitinmemory,notpageRank forthewholeweb.

• Repeatuntilconverged:– Letvt+1=cu +(1-c) vtW

• StoreWasarowmatrix: eachlineis– i ji,1,…,ji,d [theneighborsofi]

• v’ startsoutascu• Foreachline“i ji,1,…,ji,d “

– Foreachjinji,1,…,ji,d• v’[j]+=(1-c)v[i]/d

12

Streaming: if we know c, v[i] and have the linked-to j’s then• we can compute d• we can produce messages

saying how much to increment each v[j]

Then we sort the messages by v[j] and add up all the increments• and somehow also add in

c/N

So we need to stream thru some structure that has v[i] plus all the outlinks from i

This is a copy of the graph with current pageRankestimates (v) for each node attached to the node.

Page 13: Workflows 3: Graphs, PageRank, Loops, Sparkwcohen/10-605/workflows-3.pdf · Workflows 3: Graphs, PageRank, Loops, Spark Always start off with a joke, to lighten the mood 1. The course

PageRank in Dataflow Languages

13

Page 14: Workflows 3: Graphs, PageRank, Loops, Sparkwcohen/10-605/workflows-3.pdf · Workflows 3: Graphs, PageRank, Loops, Spark Always start off with a joke, to lighten the mood 1. The course

Iteration in Dataflow

• PIGandGuineaPigarepure dataflowlanguages–noconditionals–noiteration

• Toloopyouneedtoembedadataflowprogramintoa‘real’program

14

Page 15: Workflows 3: Graphs, PageRank, Loops, Sparkwcohen/10-605/workflows-3.pdf · Workflows 3: Graphs, PageRank, Loops, Spark Always start off with a joke, to lighten the mood 1. The course

15

params: d, docs_in, docs_out

Page 16: Workflows 3: Graphs, PageRank, Loops, Sparkwcohen/10-605/workflows-3.pdf · Workflows 3: Graphs, PageRank, Loops, Spark Always start off with a joke, to lighten the mood 1. The course

16

Recap: a Gpig program….

There’s no “program” object, and no obvious way to write a main that will call a program

Page 17: Workflows 3: Graphs, PageRank, Loops, Sparkwcohen/10-605/workflows-3.pdf · Workflows 3: Graphs, PageRank, Loops, Spark Always start off with a joke, to lighten the mood 1. The course

Calling GuineaPig programmaticallySomewhere in the execution of the plan we will call this script with special arguments that tell it to do a substep of the plan.

So we need a Python main that will do that right

But I also want to write an actual main program so….

Creatiing a Planner object I can call (not a subclass) – remember to call setup()

17

Page 18: Workflows 3: Graphs, PageRank, Loops, Sparkwcohen/10-605/workflows-3.pdf · Workflows 3: Graphs, PageRank, Loops, Spark Always start off with a joke, to lighten the mood 1. The course

Calling GuineaPig programmatically

Convert from initial format and

assign initial pagerank vector v

create the planner

Move initial pageranks +

graph to tmp file

Compute updated

pageranks

Move new pageranks to

tmp file for next iteration of loop 18

Page 19: Workflows 3: Graphs, PageRank, Loops, Sparkwcohen/10-605/workflows-3.pdf · Workflows 3: Graphs, PageRank, Loops, Spark Always start off with a joke, to lighten the mood 1. The course

lots and lots of i/o happening here…

rowininitGraph:(i,[j1,….,jd])

19

Page 20: Workflows 3: Graphs, PageRank, Loops, Sparkwcohen/10-605/workflows-3.pdf · Workflows 3: Graphs, PageRank, Loops, Spark Always start off with a joke, to lighten the mood 1. The course

note: there is lots of i/o happening here…20

Page 21: Workflows 3: Graphs, PageRank, Loops, Sparkwcohen/10-605/workflows-3.pdf · Workflows 3: Graphs, PageRank, Loops, Spark Always start off with a joke, to lighten the mood 1. The course

21lots of i/o happening here…

Julien Le Dem -Yahoo

Page 22: Workflows 3: Graphs, PageRank, Loops, Sparkwcohen/10-605/workflows-3.pdf · Workflows 3: Graphs, PageRank, Loops, Spark Always start off with a joke, to lighten the mood 1. The course

An example from Ron Bekkermanof an iterative dataflow computation

22

Page 23: Workflows 3: Graphs, PageRank, Loops, Sparkwcohen/10-605/workflows-3.pdf · Workflows 3: Graphs, PageRank, Loops, Spark Always start off with a joke, to lighten the mood 1. The course

Example: k-means clustering

• AnEM-likealgorithm:• Initializek clustercentroids• E-step:associateeachdatainstancewiththeclosestcentroid– Findexpectedvaluesofclusterassignmentsgiventhedataandcentroids

• M-step:recalculatecentroids asanaverageoftheassociateddatainstances– Findnewcentroids thatmaximizethatexpectation

23

Page 24: Workflows 3: Graphs, PageRank, Loops, Sparkwcohen/10-605/workflows-3.pdf · Workflows 3: Graphs, PageRank, Loops, Spark Always start off with a joke, to lighten the mood 1. The course

k-means Clustering

centroids

24

Page 25: Workflows 3: Graphs, PageRank, Loops, Sparkwcohen/10-605/workflows-3.pdf · Workflows 3: Graphs, PageRank, Loops, Spark Always start off with a joke, to lighten the mood 1. The course

Parallelizing k-means

25

Page 26: Workflows 3: Graphs, PageRank, Loops, Sparkwcohen/10-605/workflows-3.pdf · Workflows 3: Graphs, PageRank, Loops, Spark Always start off with a joke, to lighten the mood 1. The course

Parallelizing k-means

26

Page 27: Workflows 3: Graphs, PageRank, Loops, Sparkwcohen/10-605/workflows-3.pdf · Workflows 3: Graphs, PageRank, Loops, Spark Always start off with a joke, to lighten the mood 1. The course

Parallelizing k-means

27

Page 28: Workflows 3: Graphs, PageRank, Loops, Sparkwcohen/10-605/workflows-3.pdf · Workflows 3: Graphs, PageRank, Loops, Spark Always start off with a joke, to lighten the mood 1. The course

k-means on MapReduce

• Mappers readdataportionsandcentroids• Mappers assigndatainstancestoclusters• Mappers computenewlocalcentroids andlocalclustersizes

• Reducersaggregatelocalcentroids(weightedbylocalclustersizes)intonewglobalcentroids

• Reducerswritethenewcentroids

Panda et al, Chapter 2

28

Page 29: Workflows 3: Graphs, PageRank, Loops, Sparkwcohen/10-605/workflows-3.pdf · Workflows 3: Graphs, PageRank, Loops, Spark Always start off with a joke, to lighten the mood 1. The course

k-means in Apache Pig: input data

• Assumeweneedtoclusterdocuments–Storedina3-columntableD:

• Initialcentroids arek randomlychosendocs–StoredintableC inthesameformatasabove

Document Word Count

doc1 Carnegie 2

doc1 Mellon 2

29

Page 30: Workflows 3: Graphs, PageRank, Loops, Sparkwcohen/10-605/workflows-3.pdf · Workflows 3: Graphs, PageRank, Loops, Spark Always start off with a joke, to lighten the mood 1. The course

D_C =JOIN C BY w,D BY w;PROD =FOREACH D_C GENERATE d,c,id *ic AS idic ;

PRODg =GROUP PROD BY (d,c);DOT_PROD =FOREACH PRODg GENERATE d,c,SUM(idic)AS dXc;

SQR =FOREACH C GENERATE c,ic *ic AS ic2;SQRg =GROUP SQR BY c;LEN_C=FOREACH SQRgGENERATE c,SQRT(SUM(ic2))AS lenc;

DOT_LEN =JOIN LEN_CBY c,DOT_PROD BY c;SIM =FOREACH DOT_LEN GENERATE d,c,dXc /lenc;

SIMg =GROUP SIM BY d;CLUSTERS=FOREACH SIMg GENERATE TOP(1,2,SIM);

k-means in Apache Pig: E-step

( )åå

Î

Î×

=

cwwc

dwwc

wd

cdi

iic

2maxarg

30

Page 31: Workflows 3: Graphs, PageRank, Loops, Sparkwcohen/10-605/workflows-3.pdf · Workflows 3: Graphs, PageRank, Loops, Spark Always start off with a joke, to lighten the mood 1. The course

D_C =JOIN C BY w,D BY w;PROD =FOREACH D_C GENERATE d,c,id *ic AS idic ;

PRODg =GROUP PROD BY (d,c);DOT_PROD =FOREACH PRODg GENERATE d,c,SUM(idic)AS dXc;

SQR =FOREACH C GENERATE c,ic *ic AS ic2;SQRg =GROUP SQR BY c;LEN_C=FOREACH SQRgGENERATE c,SQRT(SUM(ic2))AS lenc;

DOT_LEN =JOIN LEN_CBY c,DOT_PROD BY c;SIM =FOREACH DOT_LEN GENERATE d,c,dXc /lenc;

SIMg =GROUP SIM BY d;CLUSTERS=FOREACH SIMg GENERATE TOP(1,2,SIM);

k-means in Apache Pig: E-step

( )åå

Î

Î×

=

cwwc

dwwc

wd

cdi

iic

2maxarg

31

Page 32: Workflows 3: Graphs, PageRank, Loops, Sparkwcohen/10-605/workflows-3.pdf · Workflows 3: Graphs, PageRank, Loops, Spark Always start off with a joke, to lighten the mood 1. The course

D_C =JOIN C BY w,D BY w;PROD =FOREACH D_C GENERATE d,c,id *ic AS idic ;

PRODg =GROUP PROD BY (d,c);DOT_PROD =FOREACH PRODg GENERATE d,c,SUM(idic)AS dXc;

SQR =FOREACH C GENERATE c,ic *ic AS ic2;SQRg =GROUP SQR BY c;LEN_C=FOREACH SQRgGENERATE c,SQRT(SUM(ic2))AS lenc;

DOT_LEN =JOIN LEN_CBY c,DOT_PROD BY c;SIM =FOREACH DOT_LEN GENERATE d,c,dXc /lenc;

SIMg =GROUP SIM BY d;CLUSTERS=FOREACH SIMg GENERATE TOP(1,2,SIM);

k-means in Apache Pig: E-step

( )åå

Î

Î×

=

cwwc

dwwc

wd

cdi

iic

2maxarg

32

Page 33: Workflows 3: Graphs, PageRank, Loops, Sparkwcohen/10-605/workflows-3.pdf · Workflows 3: Graphs, PageRank, Loops, Spark Always start off with a joke, to lighten the mood 1. The course

D_C =JOIN C BY w,D BY w;PROD =FOREACH D_C GENERATE d,c,id *ic AS idic ;

PRODg =GROUP PROD BY (d,c);DOT_PROD =FOREACH PRODg GENERATE d,c,SUM(idic)AS dXc;

SQR =FOREACH C GENERATE c,ic *ic AS ic2;SQRg =GROUP SQR BY c;LEN_C=FOREACH SQRgGENERATE c,SQRT(SUM(ic2))AS lenc;

DOT_LEN =JOIN LEN_CBY c,DOT_PROD BY c;SIM =FOREACH DOT_LEN GENERATE d,c,dXc /lenc;

SIMg =GROUP SIM BY d;CLUSTERS=FOREACH SIMg GENERATE TOP(1,2,SIM);

k-means in Apache Pig: E-step

( )åå

Î

Î×

=

cwwc

dwwc

wd

cdi

iic

2maxarg

33

Page 34: Workflows 3: Graphs, PageRank, Loops, Sparkwcohen/10-605/workflows-3.pdf · Workflows 3: Graphs, PageRank, Loops, Spark Always start off with a joke, to lighten the mood 1. The course

D_C =JOIN C BY w,D BY w;PROD =FOREACH D_C GENERATE d,c,id *ic AS idic ;

PRODg =GROUP PROD BY (d,c);DOT_PROD =FOREACH PRODg GENERATE d,c,SUM(idic)AS dXc;

SQR =FOREACH C GENERATE c,ic *ic AS ic2;SQRg =GROUP SQR BY c;LEN_C=FOREACH SQRgGENERATE c,SQRT(SUM(ic2))AS lenc;

DOT_LEN =JOIN LEN_CBY c,DOT_PROD BY c;SIM =FOREACH DOT_LEN GENERATE d,c,dXc /lenc;

SIMg =GROUP SIM BY d;CLUSTERS=FOREACH SIMg GENERATE TOP(1,2,SIM);

k-means in Apache Pig: E-step

( )åå

Î

Î×

=

cwwc

dwwc

wd

cdi

iic

2maxarg

34

Page 35: Workflows 3: Graphs, PageRank, Loops, Sparkwcohen/10-605/workflows-3.pdf · Workflows 3: Graphs, PageRank, Loops, Spark Always start off with a joke, to lighten the mood 1. The course

k-means in Apache Pig: E-step

D_C =JOIN C BY w,D BY w;PROD =FOREACH D_C GENERATE d,c,id *ic AS idic ;

PRODg =GROUP PROD BY (d,c);DOT_PROD =FOREACH PRODg GENERATE d,c,SUM(idic)AS dXc;

SQR =FOREACH C GENERATE c,ic *ic AS ic2;SQRg =GROUP SQR BY c;LEN_C=FOREACH SQRgGENERATE c,SQRT(SUM(ic2))AS lenc;

DOT_LEN =JOIN LEN_CBY c,DOT_PROD BY c;SIM =FOREACH DOT_LEN GENERATE d,c,dXc /lenc;

SIMg =GROUP SIM BY d;CLUSTERS=FOREACH SIMg GENERATE TOP(1,2,SIM);

35

# document d, cluster c similarities

# cluster assignments: dàclosest c

Page 36: Workflows 3: Graphs, PageRank, Loops, Sparkwcohen/10-605/workflows-3.pdf · Workflows 3: Graphs, PageRank, Loops, Spark Always start off with a joke, to lighten the mood 1. The course

k-means in Apache Pig: M-step

D_C_W= JOIN CLUSTERSBY d, DBY d;

D_C_Wg = GROUP D_C_WBY (c, w);SUMS= FOREACH D_C_Wg GENERATE c, w, SUM(id)AS sum;

D_C_Wgg =GROUP D_C_WBY c;SIZES=FOREACH D_C_WggGENERATE c,COUNT(D_C_W) AS size;

SUMS_SIZES=JOIN SIZESBY c, SUMSBY c;C=FOREACH SUMS_SIZESGENERATE c,w, sum/sizeAS ic ;

Finally - embed in Java (or Python or ….) to do the looping

36

# add up documents in each cluster

# figure out cluster size

# figure out weighted average of docs in each cluster

Page 37: Workflows 3: Graphs, PageRank, Loops, Sparkwcohen/10-605/workflows-3.pdf · Workflows 3: Graphs, PageRank, Loops, Spark Always start off with a joke, to lighten the mood 1. The course

Data is read, and model is written, with every iteration

• Mappers readdataportionsandcentroids• Mappers assigndatainstancestoclusters• Mappers computenewlocalcentroids andlocalclustersizes

• Reducersaggregatelocalcentroids(weightedbylocalclustersizes)intonewglobalcentroids

• Reducerswritethenewcentroids

Panda et al, Chapter 2

37

Page 38: Workflows 3: Graphs, PageRank, Loops, Sparkwcohen/10-605/workflows-3.pdf · Workflows 3: Graphs, PageRank, Loops, Spark Always start off with a joke, to lighten the mood 1. The course

Spark: Another Dataflow Language

38

Page 39: Workflows 3: Graphs, PageRank, Loops, Sparkwcohen/10-605/workflows-3.pdf · Workflows 3: Graphs, PageRank, Loops, Spark Always start off with a joke, to lighten the mood 1. The course

Spark• Hadoop:Toomuchtyping

– programsarenotconcise• Hadoop:Toolowlevel

– missingabstractions– hardtospecifyaworkflow

• Pig,GuineaPig,… addresstheseproblems:alsoSpark• Notwellsuitedtoiterativeoperations

– E.g.,PageRank,E/M,k-meansclustering,…– Sparklowerscostofrepeated reads

39

Set of concise dataflow operations (“transformation”)

Dataflow operations are embedded in an API together with “actions”

Spark: Sharded files are replaced by “RDDs” – resilient distributed datasets. RDDs can be cached in cluster memory and recreated to recover from error

Page 40: Workflows 3: Graphs, PageRank, Loops, Sparkwcohen/10-605/workflows-3.pdf · Workflows 3: Graphs, PageRank, Loops, Spark Always start off with a joke, to lighten the mood 1. The course

Spark examples

40

spark is a spark context object

Page 41: Workflows 3: Graphs, PageRank, Loops, Sparkwcohen/10-605/workflows-3.pdf · Workflows 3: Graphs, PageRank, Loops, Spark Always start off with a joke, to lighten the mood 1. The course

Spark examples

41

errors is a transformation, and

thus a data strucurethat explains HOW to

do somethingcount() is an action: it will actually execute the plan for errorsand return a value.

errors.filter() is a transformation

collect() is an action

everything is sharded, like inHadoop and GuineaPig

Page 42: Workflows 3: Graphs, PageRank, Loops, Sparkwcohen/10-605/workflows-3.pdf · Workflows 3: Graphs, PageRank, Loops, Spark Always start off with a joke, to lighten the mood 1. The course

Spark examples

42

# modify errors to be stored in cluster memory

subsequent actions will be much faster

everything is sharded … and the shards are stored in memory of worker machines not local disk (if possible)

You can also persist() an RDD on disk, whichis like marking it as opts(stored=True) in GuineaPig. Spark’s not smart about persisting data.

Page 43: Workflows 3: Graphs, PageRank, Loops, Sparkwcohen/10-605/workflows-3.pdf · Workflows 3: Graphs, PageRank, Loops, Spark Always start off with a joke, to lighten the mood 1. The course

Spark examples

43

# modify errors to be stored in cluster memory

everything is sharded … and the shards are stored in memory of worker machines not local disk (if possible)

You can also persist() an RDD on disk, whichis like marking it as opts(stored=True) in GuineaPig. Spark’s not smart about persisting data.

persist-on-disk works because the RDD is read-only

(immutable)

Page 44: Workflows 3: Graphs, PageRank, Loops, Sparkwcohen/10-605/workflows-3.pdf · Workflows 3: Graphs, PageRank, Loops, Spark Always start off with a joke, to lighten the mood 1. The course

Spark examples: wordcount

44

the actiontransformation on (key,value) pairs , which are special

Page 45: Workflows 3: Graphs, PageRank, Loops, Sparkwcohen/10-605/workflows-3.pdf · Workflows 3: Graphs, PageRank, Loops, Spark Always start off with a joke, to lighten the mood 1. The course

Spark examples: batch logistic regression

45

reduce is an action –it produces a numpy

vector

p.x and w are vectors, from the numpy package

p.x and w are vectors, from the numpy package.

Python overloads operations like * and +

for vectors.

Page 46: Workflows 3: Graphs, PageRank, Loops, Sparkwcohen/10-605/workflows-3.pdf · Workflows 3: Graphs, PageRank, Loops, Spark Always start off with a joke, to lighten the mood 1. The course

Spark examples: batch logistic regression

Important note: numpy vectors/matrices are not just “syntactic sugar”. • They are much more compact than something like a list of python

floats.• numpy operations like dot, *, + are calls to optimized C code• a little python logic around a lot of numpy calls is pretty efficient

46

Page 47: Workflows 3: Graphs, PageRank, Loops, Sparkwcohen/10-605/workflows-3.pdf · Workflows 3: Graphs, PageRank, Loops, Spark Always start off with a joke, to lighten the mood 1. The course

Spark examples: batch logistic regression

47

w is defined outsidethe lambda function,

but used inside itSo: python builds a closure – code

including the current value of w – and Spark ships it off to each worker. So

w is copied, and must be read-only.

Page 48: Workflows 3: Graphs, PageRank, Loops, Sparkwcohen/10-605/workflows-3.pdf · Workflows 3: Graphs, PageRank, Loops, Spark Always start off with a joke, to lighten the mood 1. The course

Spark examples: batch logistic regression

48

dataset of points is cached in cluster

memory to reduce i/o

Page 49: Workflows 3: Graphs, PageRank, Loops, Sparkwcohen/10-605/workflows-3.pdf · Workflows 3: Graphs, PageRank, Loops, Spark Always start off with a joke, to lighten the mood 1. The course

Spark logistic regression example

49

Page 50: Workflows 3: Graphs, PageRank, Loops, Sparkwcohen/10-605/workflows-3.pdf · Workflows 3: Graphs, PageRank, Loops, Spark Always start off with a joke, to lighten the mood 1. The course

Spark

50