Upload
others
View
28
Download
0
Embed Size (px)
Citation preview
Workflows 3:Graphs, PageRank, Loops, Spark
Always start off with a joke, to
lighten the mood
1
The course so far• Costofoperations• Streaminglearningalgorithms
– Parallelstreamingwithmap-reduce– Buildingcomplexparallelalgorithmswithdataflowlanguages(Pig,GuineaPig,….)
• Mostlyfortext• Anothercommonlargedatastructure:graphs
• Examples:web,Facebook,Twitter,citationdata,Freebase,Wikipedia,biologicalnetworks,…
– PageRankindataflowlanguages– Iterationindataflow– Spark
• Catchup:similarityjoins2
PageRank
3
Google’s PageRank
web site xxx
web site yyyy
web site a b c d e f g
web
site
pdq pdq ..
web site yyyy
web site a b c d e f g
web site xxx
Inlinks are “good” (recommendations)
Inlinks from a “good” site are better than inlinks from a “bad” site
but inlinks from sites with many outlinks are not a “good”...
“Good “ and “bad” are relative.
web site xxx
4
Google’s PageRank
web site xxx
web site yyyy
web site a b c d e f g
web
site
pdq pdq ..
web site yyyy
web site a b c d e f g
web site xxx Imagine a “pagehopper”
that always either
• follows a random link, or
• jumps to random page
5
Google’s PageRank(Brin & Page, http://www-db.stanford.edu/~backrub/google.html)
web site xxx
web site yyyy
web site a b c d e f g
web
site
pdq pdq ..
web site yyyy
web site a b c d e f g
web site xxx Imagine a “pagehopper”
that always either
• follows a random link, or
• jumps to random page
PageRank ranks pages by the amount of time the pagehopper spends on a page:
• or, if there were many pagehoppers, PageRank is the expected “crowd size”
6
PageRank in Memory• Letu=(1/N,…,1/N)
– dimension=#nodesN• LetA=adjacencymatrix:[aij=1ó i linkstoj]• LetW=[wij =aij/outdegree(i)]
–wij isprobabilityofjumpfromi toj• Letv0 =(1,1,….,1)
– oranythingelseyouwant• Repeatuntilconverged:
– Letvt+1=cu +(1-c) vtW• cisprobabilityofjumping“anywhererandomly”
7
Streaming PageRank
8
Streaming PageRank• AssumewecanstorevbutnotW inmemory• Repeatuntilconverged:
– Letvt+1=cu +(1-c) vtW
• Wis asparserowmatrix: eachlineis– i ji,1,…,ji,d [theneighborsofi]
• Storev’andv inmemory:v’ startsoutascu• Foreachline“i ji,1,…,ji,d “
– Foreachjinji,1,…,ji,d• v’[j]+=(1-c)v[i]/d
Everything needed for update is right there in row….
9
each iteration: from v,compute v’, then let
v=v’
Streaming PageRank• AssumewecanstoreonerowofWinmemoryatatime,butnotv– i.e.,linksonawebpagefitinmemory,notpageRank valuesforthewholeweb.
• Repeatuntilconverged:– Letvt+1=cu +(1-c) vtW
• StoreWasarowmatrix: eachlineis– i ji,1,…,ji,d [theneighborsofi]
• v’ startsoutascu• Foreachline“i ji,1,…,ji,d “
– Foreachjinji,1,…,ji,d• v’[j]+=(1-c)v[i]/d
10
We need to convert these counter updates to messages, like we did for
naïve Bayes
Like in naïve Bayes: a document fits, but the
model doesn’t
Streaming PageRank• AssumewecanstoreonerowofWinmemoryatatime,butnotv– i.e.,linksonawebpagefitinmemory,notpageRank forthewholeweb.
• Repeatuntilconverged:– Letvt+1=cu +(1-c) vtW
• StoreWasarowmatrix: eachlineis– i ji,1,…,ji,d [theneighborsofi]
• v’ startsoutascu• Foreachline“i ji,1,…,ji,d “
– Foreachjinji,1,…,ji,d• v’[j]+=(1-c)v[i]/d
11
Streaming: if we know c, v[i] and have the linked-to j’s then• we can compute d• we can produce messages
saying how much to increment each v[j]
Then we sort the messages by v[j] and add up all the increments• and somehow also add in
c/N
Streaming PageRank• AssumewecanstoreonerowofWinmemoryatatime,butnotv– i.e.,linksonawebpagefitinmemory,notpageRank forthewholeweb.
• Repeatuntilconverged:– Letvt+1=cu +(1-c) vtW
• StoreWasarowmatrix: eachlineis– i ji,1,…,ji,d [theneighborsofi]
• v’ startsoutascu• Foreachline“i ji,1,…,ji,d “
– Foreachjinji,1,…,ji,d• v’[j]+=(1-c)v[i]/d
12
Streaming: if we know c, v[i] and have the linked-to j’s then• we can compute d• we can produce messages
saying how much to increment each v[j]
Then we sort the messages by v[j] and add up all the increments• and somehow also add in
c/N
So we need to stream thru some structure that has v[i] plus all the outlinks from i
This is a copy of the graph with current pageRankestimates (v) for each node attached to the node.
PageRank in Dataflow Languages
13
Iteration in Dataflow
• PIGandGuineaPigarepure dataflowlanguages–noconditionals–noiteration
• Toloopyouneedtoembedadataflowprogramintoa‘real’program
14
15
params: d, docs_in, docs_out
16
Recap: a Gpig program….
There’s no “program” object, and no obvious way to write a main that will call a program
Calling GuineaPig programmaticallySomewhere in the execution of the plan we will call this script with special arguments that tell it to do a substep of the plan.
So we need a Python main that will do that right
But I also want to write an actual main program so….
Creatiing a Planner object I can call (not a subclass) – remember to call setup()
17
Calling GuineaPig programmatically
Convert from initial format and
assign initial pagerank vector v
create the planner
Move initial pageranks +
graph to tmp file
Compute updated
pageranks
Move new pageranks to
tmp file for next iteration of loop 18
lots and lots of i/o happening here…
rowininitGraph:(i,[j1,….,jd])
19
note: there is lots of i/o happening here…20
21lots of i/o happening here…
Julien Le Dem -Yahoo
An example from Ron Bekkermanof an iterative dataflow computation
22
Example: k-means clustering
• AnEM-likealgorithm:• Initializek clustercentroids• E-step:associateeachdatainstancewiththeclosestcentroid– Findexpectedvaluesofclusterassignmentsgiventhedataandcentroids
• M-step:recalculatecentroids asanaverageoftheassociateddatainstances– Findnewcentroids thatmaximizethatexpectation
23
k-means Clustering
centroids
24
Parallelizing k-means
25
Parallelizing k-means
26
Parallelizing k-means
27
k-means on MapReduce
• Mappers readdataportionsandcentroids• Mappers assigndatainstancestoclusters• Mappers computenewlocalcentroids andlocalclustersizes
• Reducersaggregatelocalcentroids(weightedbylocalclustersizes)intonewglobalcentroids
• Reducerswritethenewcentroids
Panda et al, Chapter 2
28
k-means in Apache Pig: input data
• Assumeweneedtoclusterdocuments–Storedina3-columntableD:
• Initialcentroids arek randomlychosendocs–StoredintableC inthesameformatasabove
Document Word Count
doc1 Carnegie 2
doc1 Mellon 2
29
D_C =JOIN C BY w,D BY w;PROD =FOREACH D_C GENERATE d,c,id *ic AS idic ;
PRODg =GROUP PROD BY (d,c);DOT_PROD =FOREACH PRODg GENERATE d,c,SUM(idic)AS dXc;
SQR =FOREACH C GENERATE c,ic *ic AS ic2;SQRg =GROUP SQR BY c;LEN_C=FOREACH SQRgGENERATE c,SQRT(SUM(ic2))AS lenc;
DOT_LEN =JOIN LEN_CBY c,DOT_PROD BY c;SIM =FOREACH DOT_LEN GENERATE d,c,dXc /lenc;
SIMg =GROUP SIM BY d;CLUSTERS=FOREACH SIMg GENERATE TOP(1,2,SIM);
k-means in Apache Pig: E-step
( )åå
Î
Î×
=
cwwc
dwwc
wd
cdi
iic
2maxarg
30
D_C =JOIN C BY w,D BY w;PROD =FOREACH D_C GENERATE d,c,id *ic AS idic ;
PRODg =GROUP PROD BY (d,c);DOT_PROD =FOREACH PRODg GENERATE d,c,SUM(idic)AS dXc;
SQR =FOREACH C GENERATE c,ic *ic AS ic2;SQRg =GROUP SQR BY c;LEN_C=FOREACH SQRgGENERATE c,SQRT(SUM(ic2))AS lenc;
DOT_LEN =JOIN LEN_CBY c,DOT_PROD BY c;SIM =FOREACH DOT_LEN GENERATE d,c,dXc /lenc;
SIMg =GROUP SIM BY d;CLUSTERS=FOREACH SIMg GENERATE TOP(1,2,SIM);
k-means in Apache Pig: E-step
( )åå
Î
Î×
=
cwwc
dwwc
wd
cdi
iic
2maxarg
31
D_C =JOIN C BY w,D BY w;PROD =FOREACH D_C GENERATE d,c,id *ic AS idic ;
PRODg =GROUP PROD BY (d,c);DOT_PROD =FOREACH PRODg GENERATE d,c,SUM(idic)AS dXc;
SQR =FOREACH C GENERATE c,ic *ic AS ic2;SQRg =GROUP SQR BY c;LEN_C=FOREACH SQRgGENERATE c,SQRT(SUM(ic2))AS lenc;
DOT_LEN =JOIN LEN_CBY c,DOT_PROD BY c;SIM =FOREACH DOT_LEN GENERATE d,c,dXc /lenc;
SIMg =GROUP SIM BY d;CLUSTERS=FOREACH SIMg GENERATE TOP(1,2,SIM);
k-means in Apache Pig: E-step
( )åå
Î
Î×
=
cwwc
dwwc
wd
cdi
iic
2maxarg
32
D_C =JOIN C BY w,D BY w;PROD =FOREACH D_C GENERATE d,c,id *ic AS idic ;
PRODg =GROUP PROD BY (d,c);DOT_PROD =FOREACH PRODg GENERATE d,c,SUM(idic)AS dXc;
SQR =FOREACH C GENERATE c,ic *ic AS ic2;SQRg =GROUP SQR BY c;LEN_C=FOREACH SQRgGENERATE c,SQRT(SUM(ic2))AS lenc;
DOT_LEN =JOIN LEN_CBY c,DOT_PROD BY c;SIM =FOREACH DOT_LEN GENERATE d,c,dXc /lenc;
SIMg =GROUP SIM BY d;CLUSTERS=FOREACH SIMg GENERATE TOP(1,2,SIM);
k-means in Apache Pig: E-step
( )åå
Î
Î×
=
cwwc
dwwc
wd
cdi
iic
2maxarg
33
D_C =JOIN C BY w,D BY w;PROD =FOREACH D_C GENERATE d,c,id *ic AS idic ;
PRODg =GROUP PROD BY (d,c);DOT_PROD =FOREACH PRODg GENERATE d,c,SUM(idic)AS dXc;
SQR =FOREACH C GENERATE c,ic *ic AS ic2;SQRg =GROUP SQR BY c;LEN_C=FOREACH SQRgGENERATE c,SQRT(SUM(ic2))AS lenc;
DOT_LEN =JOIN LEN_CBY c,DOT_PROD BY c;SIM =FOREACH DOT_LEN GENERATE d,c,dXc /lenc;
SIMg =GROUP SIM BY d;CLUSTERS=FOREACH SIMg GENERATE TOP(1,2,SIM);
k-means in Apache Pig: E-step
( )åå
Î
Î×
=
cwwc
dwwc
wd
cdi
iic
2maxarg
34
k-means in Apache Pig: E-step
D_C =JOIN C BY w,D BY w;PROD =FOREACH D_C GENERATE d,c,id *ic AS idic ;
PRODg =GROUP PROD BY (d,c);DOT_PROD =FOREACH PRODg GENERATE d,c,SUM(idic)AS dXc;
SQR =FOREACH C GENERATE c,ic *ic AS ic2;SQRg =GROUP SQR BY c;LEN_C=FOREACH SQRgGENERATE c,SQRT(SUM(ic2))AS lenc;
DOT_LEN =JOIN LEN_CBY c,DOT_PROD BY c;SIM =FOREACH DOT_LEN GENERATE d,c,dXc /lenc;
SIMg =GROUP SIM BY d;CLUSTERS=FOREACH SIMg GENERATE TOP(1,2,SIM);
35
# document d, cluster c similarities
# cluster assignments: dàclosest c
k-means in Apache Pig: M-step
D_C_W= JOIN CLUSTERSBY d, DBY d;
D_C_Wg = GROUP D_C_WBY (c, w);SUMS= FOREACH D_C_Wg GENERATE c, w, SUM(id)AS sum;
D_C_Wgg =GROUP D_C_WBY c;SIZES=FOREACH D_C_WggGENERATE c,COUNT(D_C_W) AS size;
SUMS_SIZES=JOIN SIZESBY c, SUMSBY c;C=FOREACH SUMS_SIZESGENERATE c,w, sum/sizeAS ic ;
Finally - embed in Java (or Python or ….) to do the looping
36
# add up documents in each cluster
# figure out cluster size
# figure out weighted average of docs in each cluster
Data is read, and model is written, with every iteration
• Mappers readdataportionsandcentroids• Mappers assigndatainstancestoclusters• Mappers computenewlocalcentroids andlocalclustersizes
• Reducersaggregatelocalcentroids(weightedbylocalclustersizes)intonewglobalcentroids
• Reducerswritethenewcentroids
Panda et al, Chapter 2
37
Spark: Another Dataflow Language
38
Spark• Hadoop:Toomuchtyping
– programsarenotconcise• Hadoop:Toolowlevel
– missingabstractions– hardtospecifyaworkflow
• Pig,GuineaPig,… addresstheseproblems:alsoSpark• Notwellsuitedtoiterativeoperations
– E.g.,PageRank,E/M,k-meansclustering,…– Sparklowerscostofrepeated reads
39
Set of concise dataflow operations (“transformation”)
Dataflow operations are embedded in an API together with “actions”
Spark: Sharded files are replaced by “RDDs” – resilient distributed datasets. RDDs can be cached in cluster memory and recreated to recover from error
Spark examples
40
spark is a spark context object
Spark examples
41
errors is a transformation, and
thus a data strucurethat explains HOW to
do somethingcount() is an action: it will actually execute the plan for errorsand return a value.
errors.filter() is a transformation
collect() is an action
everything is sharded, like inHadoop and GuineaPig
Spark examples
42
# modify errors to be stored in cluster memory
subsequent actions will be much faster
everything is sharded … and the shards are stored in memory of worker machines not local disk (if possible)
You can also persist() an RDD on disk, whichis like marking it as opts(stored=True) in GuineaPig. Spark’s not smart about persisting data.
Spark examples
43
# modify errors to be stored in cluster memory
everything is sharded … and the shards are stored in memory of worker machines not local disk (if possible)
You can also persist() an RDD on disk, whichis like marking it as opts(stored=True) in GuineaPig. Spark’s not smart about persisting data.
persist-on-disk works because the RDD is read-only
(immutable)
Spark examples: wordcount
44
the actiontransformation on (key,value) pairs , which are special
Spark examples: batch logistic regression
45
reduce is an action –it produces a numpy
vector
p.x and w are vectors, from the numpy package
p.x and w are vectors, from the numpy package.
Python overloads operations like * and +
for vectors.
Spark examples: batch logistic regression
Important note: numpy vectors/matrices are not just “syntactic sugar”. • They are much more compact than something like a list of python
floats.• numpy operations like dot, *, + are calls to optimized C code• a little python logic around a lot of numpy calls is pretty efficient
46
Spark examples: batch logistic regression
47
w is defined outsidethe lambda function,
but used inside itSo: python builds a closure – code
including the current value of w – and Spark ships it off to each worker. So
w is copied, and must be read-only.
Spark examples: batch logistic regression
48
dataset of points is cached in cluster
memory to reduce i/o
Spark logistic regression example
49
Spark
50