49
CS 6604: Data Mining Large Networks and Timeseries B. Aditya Prakash Lecture #6: Hadoop and Graph Analysis

CS#6604:#Data#Mining#Large# …people.cs.vt.edu/~badityap/classes/cs6604-Fall13/lectures/lecture...GIMV#for#CC# Prakash’2013’ CS’6604:’DMLarge’Networks’&’Time9Series’

  • Upload
    dobao

  • View
    220

  • Download
    1

Embed Size (px)

Citation preview

Page 1: CS#6604:#Data#Mining#Large# …people.cs.vt.edu/~badityap/classes/cs6604-Fall13/lectures/lecture...GIMV#for#CC# Prakash’2013’ CS’6604:’DMLarge’Networks’&’Time9Series’

CS  6604:  Data  Mining  Large  Networks  and  Time-­‐series  

B.  Aditya  Prakash  Lecture  #6:  Hadoop  and  Graph  

Analysis    

Page 2: CS#6604:#Data#Mining#Large# …people.cs.vt.edu/~badityap/classes/cs6604-Fall13/lectures/lecture...GIMV#for#CC# Prakash’2013’ CS’6604:’DMLarge’Networks’&’Time9Series’

What  to  do  if  data  is  really  large?  

§  Peta-­‐bytes  (exabytes,  ze?abytes  …..  )  

§  Google  processed  24  PB  of  data  per  day  (2009)  

§  FB  adds  0.5  PB  per  day  

Prakash  2013   CS  6604:  DM  Large  Networks  &  Time-­‐Series   2  

Page 3: CS#6604:#Data#Mining#Large# …people.cs.vt.edu/~badityap/classes/cs6604-Fall13/lectures/lecture...GIMV#for#CC# Prakash’2013’ CS’6604:’DMLarge’Networks’&’Time9Series’

Prakash  2013   CS  6604:  DM  Large  Networks  &  Time-­‐Series   3  

BIG  data  

Page 4: CS#6604:#Data#Mining#Large# …people.cs.vt.edu/~badityap/classes/cs6604-Fall13/lectures/lecture...GIMV#for#CC# Prakash’2013’ CS’6604:’DMLarge’Networks’&’Time9Series’

Single  vs  Cluster  

§  4TB  HDDs  are  coming  out  §  Cluster?  – How  many  machines?    – Handle  machine  and  drive  failure  – Need  redundancy,  backup..  

Prakash  2013   CS  6604:  DM  Large  Networks  &  Time-­‐Series   4  

How to analyze such large datasets?

First thing, how to store them?

Single machine? 4TB drive is out

Cluster of machines?

• How many machines?• Need to worry about

machine and drive failure. Really?

• Need data backup, redundancy, recovery, etc.

5

3% of 100,000 hard drives fail within first 3 months

Failure Trends in a Large Disk Drive Populationhttp://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/archive/disk_failures.pdf

3%  of  100K  HDDs  fail  in  <=  3  months  

h?p://sta[c.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/archive/disk_failures.pdf  

Page 5: CS#6604:#Data#Mining#Large# …people.cs.vt.edu/~badityap/classes/cs6604-Fall13/lectures/lecture...GIMV#for#CC# Prakash’2013’ CS’6604:’DMLarge’Networks’&’Time9Series’

Hadoop  

§  Open  source  so^ware    – Reliable,  scalable,  distributed  compu[ng  

§  Can  handle  thousands  of  machines  § Wri?en  in  JAVA  §  A  simple  programming  model    §  HDFS  (Hadoop  Distributed  File  System)  – Fault  tolerant  (can  recover  from  failures)  

Prakash  2013   CS  6604:  DM  Large  Networks  &  Time-­‐Series   5  

Open-source software for reliable, scalable, distributed computing

Written in Java

Scale to thousands of machines

• Linear scalability: if you have 2 machines, your job runs twice as fast

Uses simple programming model (MapReduce)

Fault tolerant (HDFS)

• Can recover from machine/disk failure (no need to restart computation)

7http://hadoop.apache.org

Page 6: CS#6604:#Data#Mining#Large# …people.cs.vt.edu/~badityap/classes/cs6604-Fall13/lectures/lecture...GIMV#for#CC# Prakash’2013’ CS’6604:’DMLarge’Networks’&’Time9Series’

Why  Hadoop?  

§ Many  research  groups/projects  use  it  §  Fortune  500  companies  use  it  §  Low  cost  to  set-­‐up  and  pick-­‐up  

§  Its  FREE!!  

Prakash  2013   CS  6604:  DM  Large  Networks  &  Time-­‐Series   6  

Page 7: CS#6604:#Data#Mining#Large# …people.cs.vt.edu/~badityap/classes/cs6604-Fall13/lectures/lecture...GIMV#for#CC# Prakash’2013’ CS’6604:’DMLarge’Networks’&’Time9Series’

Map-­‐Reduce  [Dean  and  Ghemawat  2004]  

§  Abstrac[on  for  simple  compu[ng  – Hides  details  of  paralleliza[on,  fault-­‐tolerance,  data-­‐balancing  

Prakash  2013   CS  6604:  DM  Large  Networks  &  Time-­‐Series   7  

Page 8: CS#6604:#Data#Mining#Large# …people.cs.vt.edu/~badityap/classes/cs6604-Fall13/lectures/lecture...GIMV#for#CC# Prakash’2013’ CS’6604:’DMLarge’Networks’&’Time9Series’

Programming  Model  

§  Input:  key/value  pairs  §  Output:  key/value  pairs  

§  User  has  to  specify  TWO  func[ons:  map()  and  reduce()  – Map():  takes  an  input  pair  and  produces  k  intermediate  key/value  pairs  

– Reduce():  given  an  intermediate  pair  and  a  set  of  values  for  the  key,  output  another  key/value  pair  

Prakash  2013   CS  6604:  DM  Large  Networks  &  Time-­‐Series   8  

Page 9: CS#6604:#Data#Mining#Large# …people.cs.vt.edu/~badityap/classes/cs6604-Fall13/lectures/lecture...GIMV#for#CC# Prakash’2013’ CS’6604:’DMLarge’Networks’&’Time9Series’

Master-­‐Slave  Architecture:  Phases  

1.   Map  phase  Divide  data  and  computa[on  into  smaller  pieces;  each  machine  ‘mapper’  works  on  one  piece  in  parallel.  

2.   Shuffle  phase  Master  sorts  and  moves  results  to  reducers  

3.   Reduce  phase  Machines  (‘reducers’)  combine  results  in  parallel.  

Prakash  2013   CS  6604:  DM  Large  Networks  &  Time-­‐Series   9  

Page 10: CS#6604:#Data#Mining#Large# …people.cs.vt.edu/~badityap/classes/cs6604-Fall13/lectures/lecture...GIMV#for#CC# Prakash’2013’ CS’6604:’DMLarge’Networks’&’Time9Series’

Example:  Histogram  of  Fruit  names  

U Kang (2013) 8CS760

Example� Example: histogram of fruit names

Map 0 Map 1 Map 2

Reduce 0 Reduce 1

Shuffle

(apple, 1)(apple, 1) (strawberry,1)

(apple, 2) (orange, 1)(strawberry, 1)

(orange, 1)

HDFS

HDFS

map( fruit ) {output(fruit, 1);

}

reduce( fruit, v[1..n] ) {for(i=1; i <=n; i++)sum = sum + v[i];

output(fruit, sum);}

Prakash  2013   CS  6604:  DM  Large  Networks  &  Time-­‐Series   10  Source:  U  Kang,  2013  

Page 11: CS#6604:#Data#Mining#Large# …people.cs.vt.edu/~badityap/classes/cs6604-Fall13/lectures/lecture...GIMV#for#CC# Prakash’2013’ CS’6604:’DMLarge’Networks’&’Time9Series’

Map-­‐Reduce  (MR)  as  SQL  

§  select      count(*)          from  FRUITS          group  by  fruit-­‐name        

Prakash  2013   CS  6604:  DM  Large  Networks  &  Time-­‐Series   11  

Mapper

Reducer

Page 12: CS#6604:#Data#Mining#Large# …people.cs.vt.edu/~badityap/classes/cs6604-Fall13/lectures/lecture...GIMV#for#CC# Prakash’2013’ CS’6604:’DMLarge’Networks’&’Time9Series’

More  examples  

§  Count  of  URL  access  frequency  – Map():  output  <URL,  1>  –  Reduce:  output  <URL,  total_count>    

§  Reverse  Web-­‐link  graph  – Map():  output  <target,  source>  for  each  target  link  in  a  source  web-­‐page  

–  Reduce():  output  <target,  list[source]>  §  TF-­‐IDF  of  a  document  corpus  – Map():  ?  –  Reduce():  ?  

Prakash  2013   CS  6604:  DM  Large  Networks  &  Time-­‐Series   12  

Page 13: CS#6604:#Data#Mining#Large# …people.cs.vt.edu/~badityap/classes/cs6604-Fall13/lectures/lecture...GIMV#for#CC# Prakash’2013’ CS’6604:’DMLarge’Networks’&’Time9Series’

Architecture  

U Kang (2013) 15CS760

Execution Overview

Prakash  2013   CS  6604:  DM  Large  Networks  &  Time-­‐Series   13  

Page 14: CS#6604:#Data#Mining#Large# …people.cs.vt.edu/~badityap/classes/cs6604-Fall13/lectures/lecture...GIMV#for#CC# Prakash’2013’ CS’6604:’DMLarge’Networks’&’Time9Series’

Master  

§  For  each  map  and  reduce  task,  stores  – The  state  (idle,  in-­‐progress,  completed)  – The  iden[ty  of  the  worker  machine  (for  non-­‐idle  tasks)  

Prakash  2013   CS  6604:  DM  Large  Networks  &  Time-­‐Series   14  

Page 15: CS#6604:#Data#Mining#Large# …people.cs.vt.edu/~badityap/classes/cs6604-Fall13/lectures/lecture...GIMV#for#CC# Prakash’2013’ CS’6604:’DMLarge’Networks’&’Time9Series’

Fault  Tolerance  

§ Master  pings  each  worker  periodically  §  If  response  [me-­‐out,  worker  marked  as  failed  

§ MR  task  on  a  failed  worker  is  eligible  for  rescheduling  

Prakash  2013   CS  6604:  DM  Large  Networks  &  Time-­‐Series   15  

Page 16: CS#6604:#Data#Mining#Large# …people.cs.vt.edu/~badityap/classes/cs6604-Fall13/lectures/lecture...GIMV#for#CC# Prakash’2013’ CS’6604:’DMLarge’Networks’&’Time9Series’

Locality  

§  GFS  (Google  File  System)  – 3-­‐way  replica[on  of  data  

Prakash  2013   CS  6604:  DM  Large  Networks  &  Time-­‐Series   16  

MASTER  

Loca[on  informa[on  of  the  input  files  

Schedule  a  map  task  on  a  machine  

with  replica  

Page 17: CS#6604:#Data#Mining#Large# …people.cs.vt.edu/~badityap/classes/cs6604-Fall13/lectures/lecture...GIMV#for#CC# Prakash’2013’ CS’6604:’DMLarge’Networks’&’Time9Series’

Backup  Tasks  (Specula]ve  Execu]on)  

§  The  problem  of  ‘straggler’  §  Solu[on  – When  a  MR  opera[on  is  close  to  comple[on,  Master  schedules  backup  execu[ons  of  the  remaining  in-­‐progress  tasks  

– The  task  is  marked  as  completed  whenever  either  the  primary  or  backup  execu[on  completes  

§  44%  faster  

Prakash  2013   CS  6604:  DM  Large  Networks  &  Time-­‐Series   17  

Page 18: CS#6604:#Data#Mining#Large# …people.cs.vt.edu/~badityap/classes/cs6604-Fall13/lectures/lecture...GIMV#for#CC# Prakash’2013’ CS’6604:’DMLarge’Networks’&’Time9Series’

Refinements  

§  Par[[oning  Func[on  §  Ordering  Guarantee  §  Combiner  func[on  §  Status  informa[on  §  Counter  

Prakash  2013   CS  6604:  DM  Large  Networks  &  Time-­‐Series   18  

Page 19: CS#6604:#Data#Mining#Large# …people.cs.vt.edu/~badityap/classes/cs6604-Fall13/lectures/lecture...GIMV#for#CC# Prakash’2013’ CS’6604:’DMLarge’Networks’&’Time9Series’

Par]]oning  Func]on  

§  Decides  which  output  of  mappers  are  assigned  to  which  reducer  

§  Default:  – Hashing:      hash  (key)  mod  R  

•  R  =  #  of  reducers  §  Other  op[on  –  Range  Par[[on  

•  When?  –  Iden[ty  Par[[on  

•  When?  

Prakash  2013   CS  6604:  DM  Large  Networks  &  Time-­‐Series   19  

Page 20: CS#6604:#Data#Mining#Large# …people.cs.vt.edu/~badityap/classes/cs6604-Fall13/lectures/lecture...GIMV#for#CC# Prakash’2013’ CS’6604:’DMLarge’Networks’&’Time9Series’

Ordering  Guarantee  

§ Within  a  given  par[[on,  the  intermediate  key/value  pairs  are  processed  in  increasing  key  order    

Prakash  2013   CS  6604:  DM  Large  Networks  &  Time-­‐Series   20  

Page 21: CS#6604:#Data#Mining#Large# …people.cs.vt.edu/~badityap/classes/cs6604-Fall13/lectures/lecture...GIMV#for#CC# Prakash’2013’ CS’6604:’DMLarge’Networks’&’Time9Series’

Combiners  (~  Reducers)  

§ When  to  use?  – Repe[[on  in  intermediate  keys  produced  by  map  – User-­‐specified  Reduce()  is    

•  Commuta[ve  (e.g.  x  *  y  =  y  *  x)  •  Associa[ve  (e.g.  (  x  *  y)  *  z  =  x  *  (y  *  z)  

– E.g.  color  count  

§ When  not  to  use?    

Prakash  2013   CS  6604:  DM  Large  Networks  &  Time-­‐Series   21  

Page 22: CS#6604:#Data#Mining#Large# …people.cs.vt.edu/~badityap/classes/cs6604-Fall13/lectures/lecture...GIMV#for#CC# Prakash’2013’ CS’6604:’DMLarge’Networks’&’Time9Series’

Status  informa]on  

§ Master  has  an  internal  HTTP  server    – Exports  status  web-­‐pages  

Prakash  2013   CS  6604:  DM  Large  Networks  &  Time-­‐Series   22  

Page 23: CS#6604:#Data#Mining#Large# …people.cs.vt.edu/~badityap/classes/cs6604-Fall13/lectures/lecture...GIMV#for#CC# Prakash’2013’ CS’6604:’DMLarge’Networks’&’Time9Series’

Counter  

§  Global  variable  that  only  increases  

§  Useful  for  – E.g.  making  sure  that  the  #  of  output  pairs  =  #  input  pairs  

Prakash  2013   CS  6604:  DM  Large  Networks  &  Time-­‐Series   23  

Page 24: CS#6604:#Data#Mining#Large# …people.cs.vt.edu/~badityap/classes/cs6604-Fall13/lectures/lecture...GIMV#for#CC# Prakash’2013’ CS’6604:’DMLarge’Networks’&’Time9Series’

Performance-­‐-­‐-­‐Grep  

U Kang (2013) 27CS760

Performance

� Grep

Prakash  2013   CS  6604:  DM  Large  Networks  &  Time-­‐Series   24  

Page 25: CS#6604:#Data#Mining#Large# …people.cs.vt.edu/~badityap/classes/cs6604-Fall13/lectures/lecture...GIMV#for#CC# Prakash’2013’ CS’6604:’DMLarge’Networks’&’Time9Series’

Performance-­‐-­‐-­‐Sort  

U Kang (2013) 28CS760

Performance

� Sort

Prakash  2013   CS  6604:  DM  Large  Networks  &  Time-­‐Series   25  

Page 26: CS#6604:#Data#Mining#Large# …people.cs.vt.edu/~badityap/classes/cs6604-Fall13/lectures/lecture...GIMV#for#CC# Prakash’2013’ CS’6604:’DMLarge’Networks’&’Time9Series’

MR  @  Google  

U Kang (2013) 29CS760

MapReduce in Google

Prakash  2013   CS  6604:  DM  Large  Networks  &  Time-­‐Series   26  

Page 27: CS#6604:#Data#Mining#Large# …people.cs.vt.edu/~badityap/classes/cs6604-Fall13/lectures/lecture...GIMV#for#CC# Prakash’2013’ CS’6604:’DMLarge’Networks’&’Time9Series’

MR  @  Google  

U Kang (2013) 30CS760

MapReduce in Google

Prakash  2013   CS  6604:  DM  Large  Networks  &  Time-­‐Series   27  

Page 28: CS#6604:#Data#Mining#Large# …people.cs.vt.edu/~badityap/classes/cs6604-Fall13/lectures/lecture...GIMV#for#CC# Prakash’2013’ CS’6604:’DMLarge’Networks’&’Time9Series’

GRAPH  ANALYSIS  ON  HADOOP  

Prakash  2013   CS  6604:  DM  Large  Networks  &  Time-­‐Series   28  

Based  on  slides  from  Prof.  U.  Kang  

Page 29: CS#6604:#Data#Mining#Large# …people.cs.vt.edu/~badityap/classes/cs6604-Fall13/lectures/lecture...GIMV#for#CC# Prakash’2013’ CS’6604:’DMLarge’Networks’&’Time9Series’

Problem  Defini]on  

§  Given:  a  BIG  graph  G(V,  E)  §  Compute:    – Connected  Components  – Diameter  – PageRank  – ….  

§  Solu]on:  Use  Hadoop  

Prakash  2013   CS  6604:  DM  Large  Networks  &  Time-­‐Series   29  

Page 30: CS#6604:#Data#Mining#Large# …people.cs.vt.edu/~badityap/classes/cs6604-Fall13/lectures/lecture...GIMV#for#CC# Prakash’2013’ CS’6604:’DMLarge’Networks’&’Time9Series’

GIM-­‐V:  One  unified  framework  [Kang+,  2008]    

U Kang (2013) 7CS760

Example: GIM-V At Work

� Connected Components

Size

Count300-size cmptX 500.Why?

1100-size cmptX 65.Why?

Prakash  2013   CS  6604:  DM  Large  Networks  &  Time-­‐Series   30  

YahooWeb:    |V|  =  1.4B  |E|  =  6.6B  

Page 31: CS#6604:#Data#Mining#Large# …people.cs.vt.edu/~badityap/classes/cs6604-Fall13/lectures/lecture...GIMV#for#CC# Prakash’2013’ CS’6604:’DMLarge’Networks’&’Time9Series’

Main  Idea  

§  GIM-­‐V  – Generalized  Itera[ve  Matrix-­‐Vector  Mul[plica[on  – Extension  of  plain  matrix-­‐vector  mul[plica[on  –  Includes  as  special  cases  

•  Connected  Components  •  PageRank  •  RWR  •  Diameters  •  ….  

Prakash  2013   CS  6604:  DM  Large  Networks  &  Time-­‐Series   31  

Page 32: CS#6604:#Data#Mining#Large# …people.cs.vt.edu/~badityap/classes/cs6604-Fall13/lectures/lecture...GIMV#for#CC# Prakash’2013’ CS’6604:’DMLarge’Networks’&’Time9Series’

Main  Idea:  Intui]on  

U Kang (2013) 11CS760

Main Idea: Intuition

� Plain M-V multiplication

1

1

0.1• Weighted Combination

of Colors• ~ Message Passing

1 1 0.1

1

1

0.1X

¦

4

144 '

iii vmv

M v

=

'v

Prakash  2013   CS  6604:  DM  Large  Networks  &  Time-­‐Series   32  

Page 33: CS#6604:#Data#Mining#Large# …people.cs.vt.edu/~badityap/classes/cs6604-Fall13/lectures/lecture...GIMV#for#CC# Prakash’2013’ CS’6604:’DMLarge’Networks’&’Time9Series’

 

§  Three  implicit  opera[ons  here:    

Prakash  2013   CS  6604:  DM  Large  Networks  &  Time-­‐Series   33  

U Kang (2013) 12CS760

Main Idea

� Plain M-V multiplication

Three Implicit Operations here:combine2combineAll

assign

multiply and ijm jvsum n multiplication resultsupdate 'iv

'vvM u

1 1 0.1

1

1

0.1X

M v

=

'v

¦

4

1'

iijij vmv

1

1

0.1

Message sendingMessage combination

U Kang (2013) 12CS760

Main Idea

� Plain M-V multiplication

Three Implicit Operations here:combine2combineAll

assign

multiply and ijm jvsum n multiplication resultsupdate 'iv

'vvM u

1 1 0.1

1

1

0.1X

M v

=

'v

¦

4

1'

iijij vmv

1

1

0.1

Message sendingMessage combination

Page 34: CS#6604:#Data#Mining#Large# …people.cs.vt.edu/~badityap/classes/cs6604-Fall13/lectures/lecture...GIMV#for#CC# Prakash’2013’ CS’6604:’DMLarge’Networks’&’Time9Series’

Main  Idea  

§  GIM-­‐V  – Matrix  represents  edge(Src,  dest)  – Vector  represents  ‘node  values/labels’  – Customizing  the  three  opera[ons  

Prakash  2013   CS  6604:  DM  Large  Networks  &  Time-­‐Series   34  

U Kang (2013) 13CS760

Main Idea

� GIM-V� Matrix represents edge(src, dst)� Vector represents some value of nodes� Customizing the three operations leads to many algorithms

Assign

Sum

Multiply

assign

combineAll

combine2

Con. Cmpt. DiameterRWRPageRankStandard MVoperations

U Kang (2013) 13CS760

Main Idea

� GIM-V� Matrix represents edge(src, dst)� Vector represents some value of nodes� Customizing the three operations leads to many algorithms

Assign

Sum

Multiply

assign

combineAll

combine2

Con. Cmpt. DiameterRWRPageRankStandard MVoperations

U Kang (2013) 13CS760

Main Idea

� GIM-V� Matrix represents edge(src, dst)� Vector represents some value of nodes� Customizing the three operations leads to many algorithms

Assign

Sum

Multiply

assign

combineAll

combine2

Con. Cmpt. DiameterRWRPageRankStandard MVoperations

Page 35: CS#6604:#Data#Mining#Large# …people.cs.vt.edu/~badityap/classes/cs6604-Fall13/lectures/lecture...GIMV#for#CC# Prakash’2013’ CS’6604:’DMLarge’Networks’&’Time9Series’

GIM-­‐V  

U Kang (2013) 14CS760

Main Idea

� GIM-V for Connected Components

Assign

Sum

Multiply

assign

combineAll

combine2

Con. Cmpt. DiameterRWRPageRankStandard MVoperations

MIN

MIN

Bool. X

Prakash  2013   CS  6604:  DM  Large  Networks  &  Time-­‐Series   35  

Page 36: CS#6604:#Data#Mining#Large# …people.cs.vt.edu/~badityap/classes/cs6604-Fall13/lectures/lecture...GIMV#for#CC# Prakash’2013’ CS’6604:’DMLarge’Networks’&’Time9Series’

GIM-­‐V  for  CC  

§  How  many  connected  components?  – ==  which  node  belongs  to  which  component?  

Prakash  2013   CS  6604:  DM  Large  Networks  &  Time-­‐Series   36  

U Kang (2013) 15CS760

4

38

Main Idea

� GIM-V for Connected Components� How many connected components?� Which node belong to which component?

1

2

5

6

7

G1 G5 G7

1 12 13 14 15 56 57 78 7

componentid

nodeid

Input Graph Output

U Kang (2013) 15CS760

4

38

Main Idea

� GIM-V for Connected Components� How many connected components?� Which node belong to which component?

1

2

5

6

7

G1 G5 G7

1 12 13 14 15 56 57 78 7

componentid

nodeid

Input Graph Output

Page 37: CS#6604:#Data#Mining#Large# …people.cs.vt.edu/~badityap/classes/cs6604-Fall13/lectures/lecture...GIMV#for#CC# Prakash’2013’ CS’6604:’DMLarge’Networks’&’Time9Series’

GIM-­‐V  for  CC  

U Kang (2013) 16CS760

Main Idea

� GIM-V for Connected Components

12 1345678

1 2 3 4 5 6 7 8

1

11

11

11

11

12345678

11115577

finalvector

initvector

4

38

1

2

5

6

7

G1 G5 G7

?

Prakash  2013   CS  6604:  DM  Large  Networks  &  Time-­‐Series   37  

Page 38: CS#6604:#Data#Mining#Large# …people.cs.vt.edu/~badityap/classes/cs6604-Fall13/lectures/lecture...GIMV#for#CC# Prakash’2013’ CS’6604:’DMLarge’Networks’&’Time9Series’

GIM-­‐V  for  CC  

Prakash  2013   CS  6604:  DM  Large  Networks  &  Time-­‐Series   38  

åa  

Faloutsos and Kang (CMU) 22 SIGMOD’12

  GIM-V for Connected Components

Main Idea

min(1, min(2) )

min(2, min(1,3) )

min(3, min(2,4) )

min(4, min(3) )

min(5, min(6) )

min(6, min(5) )

min(7, min(8) )

min(8, min(7) )

1 2 1 3 4 5 6 7 8

1 2 3 4 5 6 7 8

1

1 1

1 1

1 1

1 1

1 1 2 3 5 5 7 7

1 2 3 4 5 6 7 8

4

3 8

1

2

5 6

7

4

3 8

1

2

5 6

7

1 1 1 2 5 5 7 7

1 1 1 1 5 5 7 7

4

3 8

1

2

5 6

7

4

3 8

1

2

5 6

7

“Sending Invitations” “Accept the Smallest”

Faloutsos and Kang (CMU) 22 SIGMOD’12

  GIM-V for Connected Components

Main Idea

min(1, min(2) )

min(2, min(1,3) )

min(3, min(2,4) )

min(4, min(3) )

min(5, min(6) )

min(6, min(5) )

min(7, min(8) )

min(8, min(7) )

1 2 1 3 4 5 6 7 8

1 2 3 4 5 6 7 8

1

1 1

1 1

1 1

1 1

1 1 2 3 5 5 7 7

1 2 3 4 5 6 7 8

4

3 8

1

2

5 6

7

4

3 8

1

2

5 6

7

1 1 1 2 5 5 7 7

1 1 1 1 5 5 7 7

4

3 8

1

2

5 6

7

4

3 8

1

2

5 6

7

“Sending Invitations” “Accept the Smallest”

Faloutsos and Kang (CMU) 22 SIGMOD’12

  GIM-V for Connected Components

Main Idea

min(1, min(2) )

min(2, min(1,3) )

min(3, min(2,4) )

min(4, min(3) )

min(5, min(6) )

min(6, min(5) )

min(7, min(8) )

min(8, min(7) )

1 2 1 3 4 5 6 7 8

1 2 3 4 5 6 7 8

1

1 1

1 1

1 1

1 1

1 1 2 3 5 5 7 7

1 2 3 4 5 6 7 8

4

3 8

1

2

5 6

7

4

3 8

1

2

5 6

7

1 1 1 2 5 5 7 7

1 1 1 1 5 5 7 7

4

3 8

1

2

5 6

7

4

3 8

1

2

5 6

7

“Sending Invitations” “Accept the Smallest”

Page 39: CS#6604:#Data#Mining#Large# …people.cs.vt.edu/~badityap/classes/cs6604-Fall13/lectures/lecture...GIMV#for#CC# Prakash’2013’ CS’6604:’DMLarge’Networks’&’Time9Series’

GIM-­‐V  for  CC  

Prakash  2013   CS  6604:  DM  Large  Networks  &  Time-­‐Series   39  

åa  

Faloutsos and Kang (CMU) 22 SIGMOD’12

  GIM-V for Connected Components

Main Idea

min(1, min(2) )

min(2, min(1,3) )

min(3, min(2,4) )

min(4, min(3) )

min(5, min(6) )

min(6, min(5) )

min(7, min(8) )

min(8, min(7) )

1 2 1 3 4 5 6 7 8

1 2 3 4 5 6 7 8

1

1 1

1 1

1 1

1 1

1 1 2 3 5 5 7 7

1 2 3 4 5 6 7 8

4

3 8

1

2

5 6

7

4

3 8

1

2

5 6

7

1 1 1 2 5 5 7 7

1 1 1 1 5 5 7 7

4

3 8

1

2

5 6

7

4

3 8

1

2

5 6

7

“Sending Invitations” “Accept the Smallest”

Faloutsos and Kang (CMU) 22 SIGMOD’12

  GIM-V for Connected Components

Main Idea

min(1, min(2) )

min(2, min(1,3) )

min(3, min(2,4) )

min(4, min(3) )

min(5, min(6) )

min(6, min(5) )

min(7, min(8) )

min(8, min(7) )

1 2 1 3 4 5 6 7 8

1 2 3 4 5 6 7 8

1

1 1

1 1

1 1

1 1

1 1 2 3 5 5 7 7

1 2 3 4 5 6 7 8

4

3 8

1

2

5 6

7

4

3 8

1

2

5 6

7

1 1 1 2 5 5 7 7

1 1 1 1 5 5 7 7

4

3 8

1

2

5 6

7

4

3 8

1

2

5 6

7

“Sending Invitations” “Accept the Smallest”

Faloutsos and Kang (CMU) 22 SIGMOD’12

  GIM-V for Connected Components

Main Idea

min(1, min(2) )

min(2, min(1,3) )

min(3, min(2,4) )

min(4, min(3) )

min(5, min(6) )

min(6, min(5) )

min(7, min(8) )

min(8, min(7) )

1 2 1 3 4 5 6 7 8

1 2 3 4 5 6 7 8

1

1 1

1 1

1 1

1 1

1 1 2 3 5 5 7 7

1 2 3 4 5 6 7 8

4

3 8

1

2

5 6

7

4

3 8

1

2

5 6

7

1 1 1 2 5 5 7 7

1 1 1 1 5 5 7 7

4

3 8

1

2

5 6

7

4

3 8

1

2

5 6

7

“Sending Invitations” “Accept the Smallest”

1  GIM-­‐V  with  MIN  =  find  min.  node  ids  within  1  hop  

Page 40: CS#6604:#Data#Mining#Large# …people.cs.vt.edu/~badityap/classes/cs6604-Fall13/lectures/lecture...GIMV#for#CC# Prakash’2013’ CS’6604:’DMLarge’Networks’&’Time9Series’

GIM-­‐V  for  CC  

Prakash  2013   CS  6604:  DM  Large  Networks  &  Time-­‐Series   40  

åa  

Faloutsos and Kang (CMU) 22 SIGMOD’12

  GIM-V for Connected Components

Main Idea

min(1, min(2) )

min(2, min(1,3) )

min(3, min(2,4) )

min(4, min(3) )

min(5, min(6) )

min(6, min(5) )

min(7, min(8) )

min(8, min(7) )

1 2 1 3 4 5 6 7 8

1 2 3 4 5 6 7 8

1

1 1

1 1

1 1

1 1

1 1 2 3 5 5 7 7

1 2 3 4 5 6 7 8

4

3 8

1

2

5 6

7

4

3 8

1

2

5 6

7

1 1 1 2 5 5 7 7

1 1 1 1 5 5 7 7

4

3 8

1

2

5 6

7

4

3 8

1

2

5 6

7

“Sending Invitations” “Accept the Smallest”

Faloutsos and Kang (CMU) 22 SIGMOD’12

  GIM-V for Connected Components

Main Idea

min(1, min(2) )

min(2, min(1,3) )

min(3, min(2,4) )

min(4, min(3) )

min(5, min(6) )

min(6, min(5) )

min(7, min(8) )

min(8, min(7) )

1 2 1 3 4 5 6 7 8

1 2 3 4 5 6 7 8

1

1 1

1 1

1 1

1 1

1 1 2 3 5 5 7 7

1 2 3 4 5 6 7 8

4

3 8

1

2

5 6

7

4

3 8

1

2

5 6

7

1 1 1 2 5 5 7 7

1 1 1 1 5 5 7 7

4

3 8

1

2

5 6

7

4

3 8

1

2

5 6

7

“Sending Invitations” “Accept the Smallest”

Faloutsos and Kang (CMU) 22 SIGMOD’12

  GIM-V for Connected Components

Main Idea

min(1, min(2) )

min(2, min(1,3) )

min(3, min(2,4) )

min(4, min(3) )

min(5, min(6) )

min(6, min(5) )

min(7, min(8) )

min(8, min(7) )

1 2 1 3 4 5 6 7 8

1 2 3 4 5 6 7 8

1

1 1

1 1

1 1

1 1

1 1 2 3 5 5 7 7

1 2 3 4 5 6 7 8

4

3 8

1

2

5 6

7

4

3 8

1

2

5 6

7

1 1 1 2 5 5 7 7

1 1 1 1 5 5 7 7

4

3 8

1

2

5 6

7

4

3 8

1

2

5 6

7

“Sending Invitations” “Accept the Smallest”

k  GIM-­‐V  with  MIN  =  find  min.  node  ids  within  k  hops  

Page 41: CS#6604:#Data#Mining#Large# …people.cs.vt.edu/~badityap/classes/cs6604-Fall13/lectures/lecture...GIMV#for#CC# Prakash’2013’ CS’6604:’DMLarge’Networks’&’Time9Series’

GIM-­‐V  for  CC  

Prakash  2013   CS  6604:  DM  Large  Networks  &  Time-­‐Series   41  

åa  

Faloutsos and Kang (CMU) 22 SIGMOD’12

  GIM-V for Connected Components

Main Idea

min(1, min(2) )

min(2, min(1,3) )

min(3, min(2,4) )

min(4, min(3) )

min(5, min(6) )

min(6, min(5) )

min(7, min(8) )

min(8, min(7) )

1 2 1 3 4 5 6 7 8

1 2 3 4 5 6 7 8

1

1 1

1 1

1 1

1 1

1 1 2 3 5 5 7 7

1 2 3 4 5 6 7 8

4

3 8

1

2

5 6

7

4

3 8

1

2

5 6

7

1 1 1 2 5 5 7 7

1 1 1 1 5 5 7 7

4

3 8

1

2

5 6

7

4

3 8

1

2

5 6

7

“Sending Invitations” “Accept the Smallest”

Faloutsos and Kang (CMU) 22 SIGMOD’12

  GIM-V for Connected Components

Main Idea

min(1, min(2) )

min(2, min(1,3) )

min(3, min(2,4) )

min(4, min(3) )

min(5, min(6) )

min(6, min(5) )

min(7, min(8) )

min(8, min(7) )

1 2 1 3 4 5 6 7 8

1 2 3 4 5 6 7 8

1

1 1

1 1

1 1

1 1

1 1 2 3 5 5 7 7

1 2 3 4 5 6 7 8

4

3 8

1

2

5 6

7

4

3 8

1

2

5 6

7

1 1 1 2 5 5 7 7

1 1 1 1 5 5 7 7

4

3 8

1

2

5 6

7

4

3 8

1

2

5 6

7

“Sending Invitations” “Accept the Smallest”

Faloutsos and Kang (CMU) 22 SIGMOD’12

  GIM-V for Connected Components

Main Idea

min(1, min(2) )

min(2, min(1,3) )

min(3, min(2,4) )

min(4, min(3) )

min(5, min(6) )

min(6, min(5) )

min(7, min(8) )

min(8, min(7) )

1 2 1 3 4 5 6 7 8

1 2 3 4 5 6 7 8

1

1 1

1 1

1 1

1 1

1 1 2 3 5 5 7 7

1 2 3 4 5 6 7 8

4

3 8

1

2

5 6

7

4

3 8

1

2

5 6

7

1 1 1 2 5 5 7 7

1 1 1 1 5 5 7 7

4

3 8

1

2

5 6

7

4

3 8

1

2

5 6

7

“Sending Invitations” “Accept the Smallest”

Max.  itera]ons  =  diameter  

Page 42: CS#6604:#Data#Mining#Large# …people.cs.vt.edu/~badityap/classes/cs6604-Fall13/lectures/lecture...GIMV#for#CC# Prakash’2013’ CS’6604:’DMLarge’Networks’&’Time9Series’

GIM-­‐V  

Faloutsos and Kang (CMU) 30 SIGMOD’12

Main Idea

  GIM-V

Assign

Sum

Multiply

assign

combineAll

combine2

Con. Cmpt. Diameter RWR PageRank Standard MV Operations

MIN

MIN

Multiply

Assign

Sum with rj prob.

Multiply with c

Assign

Sum with restart prob

Multiply with c

BIT-OR()

BIT-OR()

Multiply bit-vector

(approx.)

Prakash  2013   CS  6604:  DM  Large  Networks  &  Time-­‐Series   42  

Page 43: CS#6604:#Data#Mining#Large# …people.cs.vt.edu/~badityap/classes/cs6604-Fall13/lectures/lecture...GIMV#for#CC# Prakash’2013’ CS’6604:’DMLarge’Networks’&’Time9Series’

HDFS  restric]ons  

§  [R1]  HDFS  is  loca[on  transperant  – Users  don’t  know  which  machine  has  which  file  

§  [R2]  A  line  is  never  split  – A  large  file  is  split  into  pieces  of  a  size  (e.g.  256MB)  

– Users  don’t  know  the  point  of  split  

Prakash  2013   CS  6604:  DM  Large  Networks  &  Time-­‐Series   43  

Page 44: CS#6604:#Data#Mining#Large# …people.cs.vt.edu/~badityap/classes/cs6604-Fall13/lectures/lecture...GIMV#for#CC# Prakash’2013’ CS’6604:’DMLarge’Networks’&’Time9Series’

Fast  GIM-­‐V  

§  Given  R1  and  R2,  how  to  design  faster  algs.  for  GIM-­‐V?  

§  1.  Block  Mul[plica[on  §  2.  Clustering  §  3.  Compression  

Prakash  2013   CS  6604:  DM  Large  Networks  &  Time-­‐Series   44  

Page 45: CS#6604:#Data#Mining#Large# …people.cs.vt.edu/~badityap/classes/cs6604-Fall13/lectures/lecture...GIMV#for#CC# Prakash’2013’ CS’6604:’DMLarge’Networks’&’Time9Series’

Block  Mul]plica]on  

Faloutsos and Kang (CMU) 34 SIGMOD’12

  I1) Block-Method

Main Idea

1 2 3 4 5 6 7 8

1. Group elements together into 1 line 2. Storage for an element: 2log n bits -> 2log b bits 3. Adjust the MapReduce code(block multiplication)

1

1 1

1

1

1 1

1 1 1

1

1 1

1

1

1 1

1

1 1

1 2 3 4

+

1 2 3 4

+ 5 6 7 8

5 6 7 8

log b bits log n bits

b: block width n: # of nodes

5 6 7 8

5 6 7 8

5 6 7 8

Prakash  2013   CS  6604:  DM  Large  Networks  &  Time-­‐Series   45  

Page 46: CS#6604:#Data#Mining#Large# …people.cs.vt.edu/~badityap/classes/cs6604-Fall13/lectures/lecture...GIMV#for#CC# Prakash’2013’ CS’6604:’DMLarge’Networks’&’Time9Series’

Clustering  and  Compress  

Prakash  2013   CS  6604:  DM  Large  Networks  &  Time-­‐Series   46  

Faloutsos and Kang (CMU) 37 SIGMOD’12

37

Main Idea

  I2) Clustering

1 1

1 1

1 1

1 1

1 1

1

1 1

1

1

1 1

1 1 1

Preprocess

A: preprocessing for clustering (only green blocks are stored in HDFS)

Faloutsos and Kang (CMU) 38 SIGMOD’12

Main Idea

  I3) Compression

1 1

1 1

1 1

1 1

1 1

Compress

A: compress clustered blocks

1 1

1 1

1 1

1 1

1 1

ZIP

ZIP

Page 47: CS#6604:#Data#Mining#Large# …people.cs.vt.edu/~badityap/classes/cs6604-Fall13/lectures/lecture...GIMV#for#CC# Prakash’2013’ CS’6604:’DMLarge’Networks’&’Time9Series’

Performance  

U Kang (2013) 41CS760

Performance

Block Encoding? Compression? Clustering?

RAW No No No

NNB Yes No No

NCB Yes Yes No

CCB Yes Yes Yes

43x smaller

9.2x faster

Prakash  2013   CS  6604:  DM  Large  Networks  &  Time-­‐Series   47  

Page 48: CS#6604:#Data#Mining#Large# …people.cs.vt.edu/~badityap/classes/cs6604-Fall13/lectures/lecture...GIMV#for#CC# Prakash’2013’ CS’6604:’DMLarge’Networks’&’Time9Series’

Many  extensions  and  op]miza]ons  

§  Local  queries  §  Diagonal  block  itera[ons  §  ….  §  …  

Prakash  2013   CS  6604:  DM  Large  Networks  &  Time-­‐Series   48  

Page 49: CS#6604:#Data#Mining#Large# …people.cs.vt.edu/~badityap/classes/cs6604-Fall13/lectures/lecture...GIMV#for#CC# Prakash’2013’ CS’6604:’DMLarge’Networks’&’Time9Series’

Other  systems  

§  GraphLab  (CMU/UW)  §  Pregel  (Google)  §  Shark/Spark  (Berkeley)  §  ….  §  ….  §  HOT  area  

Prakash  2013   CS  6604:  DM  Large  Networks  &  Time-­‐Series   49