Upload
dobao
View
220
Download
1
Embed Size (px)
Citation preview
CS 6604: Data Mining Large Networks and Time-‐series
B. Aditya Prakash Lecture #6: Hadoop and Graph
Analysis
What to do if data is really large?
§ Peta-‐bytes (exabytes, ze?abytes ….. )
§ Google processed 24 PB of data per day (2009)
§ FB adds 0.5 PB per day
Prakash 2013 CS 6604: DM Large Networks & Time-‐Series 2
Prakash 2013 CS 6604: DM Large Networks & Time-‐Series 3
BIG data
Single vs Cluster
§ 4TB HDDs are coming out § Cluster? – How many machines? – Handle machine and drive failure – Need redundancy, backup..
Prakash 2013 CS 6604: DM Large Networks & Time-‐Series 4
How to analyze such large datasets?
First thing, how to store them?
Single machine? 4TB drive is out
Cluster of machines?
• How many machines?• Need to worry about
machine and drive failure. Really?
• Need data backup, redundancy, recovery, etc.
5
3% of 100,000 hard drives fail within first 3 months
Failure Trends in a Large Disk Drive Populationhttp://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/archive/disk_failures.pdf
3% of 100K HDDs fail in <= 3 months
h?p://sta[c.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/archive/disk_failures.pdf
Hadoop
§ Open source so^ware – Reliable, scalable, distributed compu[ng
§ Can handle thousands of machines § Wri?en in JAVA § A simple programming model § HDFS (Hadoop Distributed File System) – Fault tolerant (can recover from failures)
Prakash 2013 CS 6604: DM Large Networks & Time-‐Series 5
Open-source software for reliable, scalable, distributed computing
Written in Java
Scale to thousands of machines
• Linear scalability: if you have 2 machines, your job runs twice as fast
Uses simple programming model (MapReduce)
Fault tolerant (HDFS)
• Can recover from machine/disk failure (no need to restart computation)
7http://hadoop.apache.org
Why Hadoop?
§ Many research groups/projects use it § Fortune 500 companies use it § Low cost to set-‐up and pick-‐up
§ Its FREE!!
Prakash 2013 CS 6604: DM Large Networks & Time-‐Series 6
Map-‐Reduce [Dean and Ghemawat 2004]
§ Abstrac[on for simple compu[ng – Hides details of paralleliza[on, fault-‐tolerance, data-‐balancing
Prakash 2013 CS 6604: DM Large Networks & Time-‐Series 7
Programming Model
§ Input: key/value pairs § Output: key/value pairs
§ User has to specify TWO func[ons: map() and reduce() – Map(): takes an input pair and produces k intermediate key/value pairs
– Reduce(): given an intermediate pair and a set of values for the key, output another key/value pair
Prakash 2013 CS 6604: DM Large Networks & Time-‐Series 8
Master-‐Slave Architecture: Phases
1. Map phase Divide data and computa[on into smaller pieces; each machine ‘mapper’ works on one piece in parallel.
2. Shuffle phase Master sorts and moves results to reducers
3. Reduce phase Machines (‘reducers’) combine results in parallel.
Prakash 2013 CS 6604: DM Large Networks & Time-‐Series 9
Example: Histogram of Fruit names
U Kang (2013) 8CS760
Example� Example: histogram of fruit names
Map 0 Map 1 Map 2
Reduce 0 Reduce 1
Shuffle
(apple, 1)(apple, 1) (strawberry,1)
(apple, 2) (orange, 1)(strawberry, 1)
(orange, 1)
HDFS
HDFS
map( fruit ) {output(fruit, 1);
}
reduce( fruit, v[1..n] ) {for(i=1; i <=n; i++)sum = sum + v[i];
output(fruit, sum);}
Prakash 2013 CS 6604: DM Large Networks & Time-‐Series 10 Source: U Kang, 2013
Map-‐Reduce (MR) as SQL
§ select count(*) from FRUITS group by fruit-‐name
Prakash 2013 CS 6604: DM Large Networks & Time-‐Series 11
Mapper
Reducer
More examples
§ Count of URL access frequency – Map(): output <URL, 1> – Reduce: output <URL, total_count>
§ Reverse Web-‐link graph – Map(): output <target, source> for each target link in a source web-‐page
– Reduce(): output <target, list[source]> § TF-‐IDF of a document corpus – Map(): ? – Reduce(): ?
Prakash 2013 CS 6604: DM Large Networks & Time-‐Series 12
Architecture
U Kang (2013) 15CS760
Execution Overview
Prakash 2013 CS 6604: DM Large Networks & Time-‐Series 13
Master
§ For each map and reduce task, stores – The state (idle, in-‐progress, completed) – The iden[ty of the worker machine (for non-‐idle tasks)
Prakash 2013 CS 6604: DM Large Networks & Time-‐Series 14
Fault Tolerance
§ Master pings each worker periodically § If response [me-‐out, worker marked as failed
§ MR task on a failed worker is eligible for rescheduling
Prakash 2013 CS 6604: DM Large Networks & Time-‐Series 15
Locality
§ GFS (Google File System) – 3-‐way replica[on of data
Prakash 2013 CS 6604: DM Large Networks & Time-‐Series 16
MASTER
Loca[on informa[on of the input files
Schedule a map task on a machine
with replica
Backup Tasks (Specula]ve Execu]on)
§ The problem of ‘straggler’ § Solu[on – When a MR opera[on is close to comple[on, Master schedules backup execu[ons of the remaining in-‐progress tasks
– The task is marked as completed whenever either the primary or backup execu[on completes
§ 44% faster
Prakash 2013 CS 6604: DM Large Networks & Time-‐Series 17
Refinements
§ Par[[oning Func[on § Ordering Guarantee § Combiner func[on § Status informa[on § Counter
Prakash 2013 CS 6604: DM Large Networks & Time-‐Series 18
Par]]oning Func]on
§ Decides which output of mappers are assigned to which reducer
§ Default: – Hashing: hash (key) mod R
• R = # of reducers § Other op[on – Range Par[[on
• When? – Iden[ty Par[[on
• When?
Prakash 2013 CS 6604: DM Large Networks & Time-‐Series 19
Ordering Guarantee
§ Within a given par[[on, the intermediate key/value pairs are processed in increasing key order
Prakash 2013 CS 6604: DM Large Networks & Time-‐Series 20
Combiners (~ Reducers)
§ When to use? – Repe[[on in intermediate keys produced by map – User-‐specified Reduce() is
• Commuta[ve (e.g. x * y = y * x) • Associa[ve (e.g. ( x * y) * z = x * (y * z)
– E.g. color count
§ When not to use?
Prakash 2013 CS 6604: DM Large Networks & Time-‐Series 21
Status informa]on
§ Master has an internal HTTP server – Exports status web-‐pages
Prakash 2013 CS 6604: DM Large Networks & Time-‐Series 22
Counter
§ Global variable that only increases
§ Useful for – E.g. making sure that the # of output pairs = # input pairs
Prakash 2013 CS 6604: DM Large Networks & Time-‐Series 23
Performance-‐-‐-‐Grep
U Kang (2013) 27CS760
Performance
� Grep
Prakash 2013 CS 6604: DM Large Networks & Time-‐Series 24
Performance-‐-‐-‐Sort
U Kang (2013) 28CS760
Performance
� Sort
Prakash 2013 CS 6604: DM Large Networks & Time-‐Series 25
MR @ Google
U Kang (2013) 29CS760
MapReduce in Google
Prakash 2013 CS 6604: DM Large Networks & Time-‐Series 26
MR @ Google
U Kang (2013) 30CS760
MapReduce in Google
Prakash 2013 CS 6604: DM Large Networks & Time-‐Series 27
GRAPH ANALYSIS ON HADOOP
Prakash 2013 CS 6604: DM Large Networks & Time-‐Series 28
Based on slides from Prof. U. Kang
Problem Defini]on
§ Given: a BIG graph G(V, E) § Compute: – Connected Components – Diameter – PageRank – ….
§ Solu]on: Use Hadoop
Prakash 2013 CS 6604: DM Large Networks & Time-‐Series 29
GIM-‐V: One unified framework [Kang+, 2008]
U Kang (2013) 7CS760
Example: GIM-V At Work
� Connected Components
Size
Count300-size cmptX 500.Why?
1100-size cmptX 65.Why?
Prakash 2013 CS 6604: DM Large Networks & Time-‐Series 30
YahooWeb: |V| = 1.4B |E| = 6.6B
Main Idea
§ GIM-‐V – Generalized Itera[ve Matrix-‐Vector Mul[plica[on – Extension of plain matrix-‐vector mul[plica[on – Includes as special cases
• Connected Components • PageRank • RWR • Diameters • ….
Prakash 2013 CS 6604: DM Large Networks & Time-‐Series 31
Main Idea: Intui]on
U Kang (2013) 11CS760
Main Idea: Intuition
� Plain M-V multiplication
1
1
0.1• Weighted Combination
of Colors• ~ Message Passing
1 1 0.1
1
1
0.1X
¦
4
144 '
iii vmv
M v
=
'v
Prakash 2013 CS 6604: DM Large Networks & Time-‐Series 32
§ Three implicit opera[ons here:
Prakash 2013 CS 6604: DM Large Networks & Time-‐Series 33
U Kang (2013) 12CS760
Main Idea
� Plain M-V multiplication
Three Implicit Operations here:combine2combineAll
assign
multiply and ijm jvsum n multiplication resultsupdate 'iv
'vvM u
1 1 0.1
1
1
0.1X
M v
=
'v
¦
4
1'
iijij vmv
1
1
0.1
Message sendingMessage combination
U Kang (2013) 12CS760
Main Idea
� Plain M-V multiplication
Three Implicit Operations here:combine2combineAll
assign
multiply and ijm jvsum n multiplication resultsupdate 'iv
'vvM u
1 1 0.1
1
1
0.1X
M v
=
'v
¦
4
1'
iijij vmv
1
1
0.1
Message sendingMessage combination
Main Idea
§ GIM-‐V – Matrix represents edge(Src, dest) – Vector represents ‘node values/labels’ – Customizing the three opera[ons
Prakash 2013 CS 6604: DM Large Networks & Time-‐Series 34
U Kang (2013) 13CS760
Main Idea
� GIM-V� Matrix represents edge(src, dst)� Vector represents some value of nodes� Customizing the three operations leads to many algorithms
Assign
Sum
Multiply
assign
combineAll
combine2
Con. Cmpt. DiameterRWRPageRankStandard MVoperations
U Kang (2013) 13CS760
Main Idea
� GIM-V� Matrix represents edge(src, dst)� Vector represents some value of nodes� Customizing the three operations leads to many algorithms
Assign
Sum
Multiply
assign
combineAll
combine2
Con. Cmpt. DiameterRWRPageRankStandard MVoperations
U Kang (2013) 13CS760
Main Idea
� GIM-V� Matrix represents edge(src, dst)� Vector represents some value of nodes� Customizing the three operations leads to many algorithms
Assign
Sum
Multiply
assign
combineAll
combine2
Con. Cmpt. DiameterRWRPageRankStandard MVoperations
GIM-‐V
U Kang (2013) 14CS760
Main Idea
� GIM-V for Connected Components
Assign
Sum
Multiply
assign
combineAll
combine2
Con. Cmpt. DiameterRWRPageRankStandard MVoperations
MIN
MIN
Bool. X
Prakash 2013 CS 6604: DM Large Networks & Time-‐Series 35
GIM-‐V for CC
§ How many connected components? – == which node belongs to which component?
Prakash 2013 CS 6604: DM Large Networks & Time-‐Series 36
U Kang (2013) 15CS760
4
38
Main Idea
� GIM-V for Connected Components� How many connected components?� Which node belong to which component?
1
2
5
6
7
G1 G5 G7
1 12 13 14 15 56 57 78 7
componentid
nodeid
Input Graph Output
U Kang (2013) 15CS760
4
38
Main Idea
� GIM-V for Connected Components� How many connected components?� Which node belong to which component?
1
2
5
6
7
G1 G5 G7
1 12 13 14 15 56 57 78 7
componentid
nodeid
Input Graph Output
GIM-‐V for CC
U Kang (2013) 16CS760
Main Idea
� GIM-V for Connected Components
12 1345678
1 2 3 4 5 6 7 8
1
11
11
11
11
12345678
11115577
finalvector
initvector
4
38
1
2
5
6
7
G1 G5 G7
?
Prakash 2013 CS 6604: DM Large Networks & Time-‐Series 37
GIM-‐V for CC
Prakash 2013 CS 6604: DM Large Networks & Time-‐Series 38
åa
Faloutsos and Kang (CMU) 22 SIGMOD’12
GIM-V for Connected Components
Main Idea
min(1, min(2) )
min(2, min(1,3) )
min(3, min(2,4) )
min(4, min(3) )
min(5, min(6) )
min(6, min(5) )
min(7, min(8) )
min(8, min(7) )
1 2 1 3 4 5 6 7 8
1 2 3 4 5 6 7 8
1
1 1
1 1
1 1
1 1
1 1 2 3 5 5 7 7
1 2 3 4 5 6 7 8
4
3 8
1
2
5 6
7
4
3 8
1
2
5 6
7
1 1 1 2 5 5 7 7
1 1 1 1 5 5 7 7
4
3 8
1
2
5 6
7
4
3 8
1
2
5 6
7
“Sending Invitations” “Accept the Smallest”
Faloutsos and Kang (CMU) 22 SIGMOD’12
GIM-V for Connected Components
Main Idea
min(1, min(2) )
min(2, min(1,3) )
min(3, min(2,4) )
min(4, min(3) )
min(5, min(6) )
min(6, min(5) )
min(7, min(8) )
min(8, min(7) )
1 2 1 3 4 5 6 7 8
1 2 3 4 5 6 7 8
1
1 1
1 1
1 1
1 1
1 1 2 3 5 5 7 7
1 2 3 4 5 6 7 8
4
3 8
1
2
5 6
7
4
3 8
1
2
5 6
7
1 1 1 2 5 5 7 7
1 1 1 1 5 5 7 7
4
3 8
1
2
5 6
7
4
3 8
1
2
5 6
7
“Sending Invitations” “Accept the Smallest”
Faloutsos and Kang (CMU) 22 SIGMOD’12
GIM-V for Connected Components
Main Idea
min(1, min(2) )
min(2, min(1,3) )
min(3, min(2,4) )
min(4, min(3) )
min(5, min(6) )
min(6, min(5) )
min(7, min(8) )
min(8, min(7) )
1 2 1 3 4 5 6 7 8
1 2 3 4 5 6 7 8
1
1 1
1 1
1 1
1 1
1 1 2 3 5 5 7 7
1 2 3 4 5 6 7 8
4
3 8
1
2
5 6
7
4
3 8
1
2
5 6
7
1 1 1 2 5 5 7 7
1 1 1 1 5 5 7 7
4
3 8
1
2
5 6
7
4
3 8
1
2
5 6
7
“Sending Invitations” “Accept the Smallest”
GIM-‐V for CC
Prakash 2013 CS 6604: DM Large Networks & Time-‐Series 39
åa
Faloutsos and Kang (CMU) 22 SIGMOD’12
GIM-V for Connected Components
Main Idea
min(1, min(2) )
min(2, min(1,3) )
min(3, min(2,4) )
min(4, min(3) )
min(5, min(6) )
min(6, min(5) )
min(7, min(8) )
min(8, min(7) )
1 2 1 3 4 5 6 7 8
1 2 3 4 5 6 7 8
1
1 1
1 1
1 1
1 1
1 1 2 3 5 5 7 7
1 2 3 4 5 6 7 8
4
3 8
1
2
5 6
7
4
3 8
1
2
5 6
7
1 1 1 2 5 5 7 7
1 1 1 1 5 5 7 7
4
3 8
1
2
5 6
7
4
3 8
1
2
5 6
7
“Sending Invitations” “Accept the Smallest”
Faloutsos and Kang (CMU) 22 SIGMOD’12
GIM-V for Connected Components
Main Idea
min(1, min(2) )
min(2, min(1,3) )
min(3, min(2,4) )
min(4, min(3) )
min(5, min(6) )
min(6, min(5) )
min(7, min(8) )
min(8, min(7) )
1 2 1 3 4 5 6 7 8
1 2 3 4 5 6 7 8
1
1 1
1 1
1 1
1 1
1 1 2 3 5 5 7 7
1 2 3 4 5 6 7 8
4
3 8
1
2
5 6
7
4
3 8
1
2
5 6
7
1 1 1 2 5 5 7 7
1 1 1 1 5 5 7 7
4
3 8
1
2
5 6
7
4
3 8
1
2
5 6
7
“Sending Invitations” “Accept the Smallest”
Faloutsos and Kang (CMU) 22 SIGMOD’12
GIM-V for Connected Components
Main Idea
min(1, min(2) )
min(2, min(1,3) )
min(3, min(2,4) )
min(4, min(3) )
min(5, min(6) )
min(6, min(5) )
min(7, min(8) )
min(8, min(7) )
1 2 1 3 4 5 6 7 8
1 2 3 4 5 6 7 8
1
1 1
1 1
1 1
1 1
1 1 2 3 5 5 7 7
1 2 3 4 5 6 7 8
4
3 8
1
2
5 6
7
4
3 8
1
2
5 6
7
1 1 1 2 5 5 7 7
1 1 1 1 5 5 7 7
4
3 8
1
2
5 6
7
4
3 8
1
2
5 6
7
“Sending Invitations” “Accept the Smallest”
1 GIM-‐V with MIN = find min. node ids within 1 hop
GIM-‐V for CC
Prakash 2013 CS 6604: DM Large Networks & Time-‐Series 40
åa
Faloutsos and Kang (CMU) 22 SIGMOD’12
GIM-V for Connected Components
Main Idea
min(1, min(2) )
min(2, min(1,3) )
min(3, min(2,4) )
min(4, min(3) )
min(5, min(6) )
min(6, min(5) )
min(7, min(8) )
min(8, min(7) )
1 2 1 3 4 5 6 7 8
1 2 3 4 5 6 7 8
1
1 1
1 1
1 1
1 1
1 1 2 3 5 5 7 7
1 2 3 4 5 6 7 8
4
3 8
1
2
5 6
7
4
3 8
1
2
5 6
7
1 1 1 2 5 5 7 7
1 1 1 1 5 5 7 7
4
3 8
1
2
5 6
7
4
3 8
1
2
5 6
7
“Sending Invitations” “Accept the Smallest”
Faloutsos and Kang (CMU) 22 SIGMOD’12
GIM-V for Connected Components
Main Idea
min(1, min(2) )
min(2, min(1,3) )
min(3, min(2,4) )
min(4, min(3) )
min(5, min(6) )
min(6, min(5) )
min(7, min(8) )
min(8, min(7) )
1 2 1 3 4 5 6 7 8
1 2 3 4 5 6 7 8
1
1 1
1 1
1 1
1 1
1 1 2 3 5 5 7 7
1 2 3 4 5 6 7 8
4
3 8
1
2
5 6
7
4
3 8
1
2
5 6
7
1 1 1 2 5 5 7 7
1 1 1 1 5 5 7 7
4
3 8
1
2
5 6
7
4
3 8
1
2
5 6
7
“Sending Invitations” “Accept the Smallest”
Faloutsos and Kang (CMU) 22 SIGMOD’12
GIM-V for Connected Components
Main Idea
min(1, min(2) )
min(2, min(1,3) )
min(3, min(2,4) )
min(4, min(3) )
min(5, min(6) )
min(6, min(5) )
min(7, min(8) )
min(8, min(7) )
1 2 1 3 4 5 6 7 8
1 2 3 4 5 6 7 8
1
1 1
1 1
1 1
1 1
1 1 2 3 5 5 7 7
1 2 3 4 5 6 7 8
4
3 8
1
2
5 6
7
4
3 8
1
2
5 6
7
1 1 1 2 5 5 7 7
1 1 1 1 5 5 7 7
4
3 8
1
2
5 6
7
4
3 8
1
2
5 6
7
“Sending Invitations” “Accept the Smallest”
k GIM-‐V with MIN = find min. node ids within k hops
GIM-‐V for CC
Prakash 2013 CS 6604: DM Large Networks & Time-‐Series 41
åa
Faloutsos and Kang (CMU) 22 SIGMOD’12
GIM-V for Connected Components
Main Idea
min(1, min(2) )
min(2, min(1,3) )
min(3, min(2,4) )
min(4, min(3) )
min(5, min(6) )
min(6, min(5) )
min(7, min(8) )
min(8, min(7) )
1 2 1 3 4 5 6 7 8
1 2 3 4 5 6 7 8
1
1 1
1 1
1 1
1 1
1 1 2 3 5 5 7 7
1 2 3 4 5 6 7 8
4
3 8
1
2
5 6
7
4
3 8
1
2
5 6
7
1 1 1 2 5 5 7 7
1 1 1 1 5 5 7 7
4
3 8
1
2
5 6
7
4
3 8
1
2
5 6
7
“Sending Invitations” “Accept the Smallest”
Faloutsos and Kang (CMU) 22 SIGMOD’12
GIM-V for Connected Components
Main Idea
min(1, min(2) )
min(2, min(1,3) )
min(3, min(2,4) )
min(4, min(3) )
min(5, min(6) )
min(6, min(5) )
min(7, min(8) )
min(8, min(7) )
1 2 1 3 4 5 6 7 8
1 2 3 4 5 6 7 8
1
1 1
1 1
1 1
1 1
1 1 2 3 5 5 7 7
1 2 3 4 5 6 7 8
4
3 8
1
2
5 6
7
4
3 8
1
2
5 6
7
1 1 1 2 5 5 7 7
1 1 1 1 5 5 7 7
4
3 8
1
2
5 6
7
4
3 8
1
2
5 6
7
“Sending Invitations” “Accept the Smallest”
Faloutsos and Kang (CMU) 22 SIGMOD’12
GIM-V for Connected Components
Main Idea
min(1, min(2) )
min(2, min(1,3) )
min(3, min(2,4) )
min(4, min(3) )
min(5, min(6) )
min(6, min(5) )
min(7, min(8) )
min(8, min(7) )
1 2 1 3 4 5 6 7 8
1 2 3 4 5 6 7 8
1
1 1
1 1
1 1
1 1
1 1 2 3 5 5 7 7
1 2 3 4 5 6 7 8
4
3 8
1
2
5 6
7
4
3 8
1
2
5 6
7
1 1 1 2 5 5 7 7
1 1 1 1 5 5 7 7
4
3 8
1
2
5 6
7
4
3 8
1
2
5 6
7
“Sending Invitations” “Accept the Smallest”
Max. itera]ons = diameter
GIM-‐V
Faloutsos and Kang (CMU) 30 SIGMOD’12
Main Idea
GIM-V
Assign
Sum
Multiply
assign
combineAll
combine2
Con. Cmpt. Diameter RWR PageRank Standard MV Operations
MIN
MIN
Multiply
Assign
Sum with rj prob.
Multiply with c
Assign
Sum with restart prob
Multiply with c
BIT-OR()
BIT-OR()
Multiply bit-vector
(approx.)
Prakash 2013 CS 6604: DM Large Networks & Time-‐Series 42
HDFS restric]ons
§ [R1] HDFS is loca[on transperant – Users don’t know which machine has which file
§ [R2] A line is never split – A large file is split into pieces of a size (e.g. 256MB)
– Users don’t know the point of split
Prakash 2013 CS 6604: DM Large Networks & Time-‐Series 43
Fast GIM-‐V
§ Given R1 and R2, how to design faster algs. for GIM-‐V?
§ 1. Block Mul[plica[on § 2. Clustering § 3. Compression
Prakash 2013 CS 6604: DM Large Networks & Time-‐Series 44
Block Mul]plica]on
Faloutsos and Kang (CMU) 34 SIGMOD’12
I1) Block-Method
Main Idea
1 2 3 4 5 6 7 8
1. Group elements together into 1 line 2. Storage for an element: 2log n bits -> 2log b bits 3. Adjust the MapReduce code(block multiplication)
1
1 1
1
1
1 1
1 1 1
1
1 1
1
1
1 1
1
1 1
1 2 3 4
+
1 2 3 4
+ 5 6 7 8
5 6 7 8
log b bits log n bits
b: block width n: # of nodes
5 6 7 8
5 6 7 8
5 6 7 8
Prakash 2013 CS 6604: DM Large Networks & Time-‐Series 45
Clustering and Compress
Prakash 2013 CS 6604: DM Large Networks & Time-‐Series 46
Faloutsos and Kang (CMU) 37 SIGMOD’12
37
Main Idea
I2) Clustering
1 1
1 1
1 1
1 1
1 1
1
1 1
1
1
1 1
1 1 1
Preprocess
A: preprocessing for clustering (only green blocks are stored in HDFS)
Faloutsos and Kang (CMU) 38 SIGMOD’12
Main Idea
I3) Compression
1 1
1 1
1 1
1 1
1 1
Compress
A: compress clustered blocks
1 1
1 1
1 1
1 1
1 1
ZIP
ZIP
Performance
U Kang (2013) 41CS760
Performance
Block Encoding? Compression? Clustering?
RAW No No No
NNB Yes No No
NCB Yes Yes No
CCB Yes Yes Yes
43x smaller
9.2x faster
Prakash 2013 CS 6604: DM Large Networks & Time-‐Series 47
Many extensions and op]miza]ons
§ Local queries § Diagonal block itera[ons § …. § …
Prakash 2013 CS 6604: DM Large Networks & Time-‐Series 48
Other systems
§ GraphLab (CMU/UW) § Pregel (Google) § Shark/Spark (Berkeley) § …. § …. § HOT area
Prakash 2013 CS 6604: DM Large Networks & Time-‐Series 49