46
MICHAEL J BOMMARITO II DANIEL MARTIN KATZ Advanced Network Analysis Methods: Community Detec:on

Advanced Methods in Network Science: Community Detection Algorithms

Embed Size (px)

Citation preview

Page 1: Advanced Methods in Network Science: Community Detection Algorithms

     

MICHAEL  J  BOMMARITO  II                                                                DANIEL  MARTIN  KATZ    

Advanced  Network  Analysis  Methods:  Community  Detec:on  

   

Page 2: Advanced Methods in Network Science: Community Detection Algorithms

Defini:on  –  Simple  Version  

�  Broadly:  “a  group  of  nodes  that  are  rela&vely  densely  connected  to  each  other  but  sparsely  connected  to  other  dense  groups  in  the  network”  ¡  Porter,  Onnela,  Mucha.    Communi&es  in  Networks.  No:ces  to  the  AMS,  2009.  

�  Examples:  ¡  Cliques  in  a  high  school  social  network  ¡  Vo:ng  coali:ons  in  Congress  ¡  Consumer  types  in  a  network  of  co-­‐purchases  

Michael  J.  Bommarito  II,  Daniel  Mar:n  Katz  

Page 3: Advanced Methods in Network Science: Community Detection Algorithms

Example  –  Social  Networks  

Imagine  this  Graph  ….  

Michael  J.  Bommarito  II,  Daniel  Mar:n  Katz  

Page 4: Advanced Methods in Network Science: Community Detection Algorithms

Example  –  Social  Networks  

What   factors   might   affect   the   formaJon   of  friendships  in  a  high  school  social  network?    Ideas:    Age,    Gender,  Class,  Race,  Interests  

 How   might   we   assign   communiJes   to   this  network?    

           

VerJces:  People  Edges:  Friendship  

Michael  J.  Bommarito  II,  Daniel  Mar:n  Katz  

Page 5: Advanced Methods in Network Science: Community Detection Algorithms

Example  –  Social  Networks  

What   factors   might   affect   the   formaJon   of  friendships  in  a  high  school  social  network?    Ideas:    Age,    Gender,  Class,  Race,  Interests  

 How   might   we   assign   communiJes   to   this  network?    

           

Girls  

Boys  

VerJces:  People  Edges:  Friendship  

Michael  J.  Bommarito  II,  Daniel  Mar:n  Katz  

Page 6: Advanced Methods in Network Science: Community Detection Algorithms

Example  –  Vo:ng  Coali:ons  

Michael  J.  Bommarito  II,  Daniel  Mar:n  Katz  

VerJces:  People  Edges:  Co-­‐voted                at  least  once  

Now  let’s  look  at  the  same  network  as  if  it  represented  co-­‐voJng  in  the  Senate.    Ideas:  Issue  posi:on,  geography,  ethnicity,  gender    How  might  we  assign  communiJes  to  this  network?            

Page 7: Advanced Methods in Network Science: Community Detection Algorithms

Example  –  Vo:ng  Coali:ons  

Republicans  

Democrats  

Independents  

Michael  J.  Bommarito  II,  Daniel  Mar:n  Katz  

VerJces:  People  Edges:  Co-­‐voted                at  least  once  

Now  let’s  look  at  the  same  network  as  if  it  represented  co-­‐voJng  in  the  Senate.    Ideas:  Issue  posi:on,  geography,  ethnicity,  gender    How  might  we  assign  communiJes  to  this  network?            

Page 8: Advanced Methods in Network Science: Community Detection Algorithms

Context!  

Note  that  we  have  assigned  community  membership  differently        despite  observing  the  same  graph!    Community  detecJon  is  not  a  concept  that  can  be  divorced  from  context.      

Michael  J.  Bommarito  II,  Daniel  Mar:n  Katz  

Page 9: Advanced Methods in Network Science: Community Detection Algorithms

Directedness  

Undirected   Directed  

Michael  J.  Bommarito  II,  Daniel  Mar:n  Katz  

Page 10: Advanced Methods in Network Science: Community Detection Algorithms

Directedness  

Many  methods  do  not  incorporate  direcJon!      Many  methods  that  do  incorporate  direcJon  do  not  allow  for  bidirected  edges.      Different  soVware  packages  may  implement  the  same  “method”  with  or  without  support  for  directed  edges.  

Michael  J.  Bommarito  II,  Daniel  Mar:n  Katz  

Page 11: Advanced Methods in Network Science: Community Detection Algorithms

Weights  

Unweighted   Weighted  

•   Binary  rela:onships  •   Data  limita:ons  

•   Rela:onship  strength  •   Frequency  of  rela:onship  •   Flow  

Michael  J.  Bommarito  II,  Daniel  Mar:n  Katz  

Page 12: Advanced Methods in Network Science: Community Detection Algorithms

Weights  

Unweighted   Weighted  

•   Binary  rela:onships  •   Data  limita:ons  

•   Rela:onship  strength  •   Frequency  of  rela:onship  •   Flow  

Note  edge  thickness.  

Michael  J.  Bommarito  II,  Daniel  Mar:n  Katz  

Page 13: Advanced Methods in Network Science: Community Detection Algorithms

Weights  

Many  methods  do  not  incorporate  edge  weights!    Methods  that  do  incorporate  edge  weights  may  differ  in  acceptable  values!  •   Integers  or  real  weights  •   Strictly  posi:ve  weights    Different  soVware  packages  may  implement  the  same  “method”  with  or  without  support  for  weighted  edges.  

Michael  J.  Bommarito  II,  Daniel  Mar:n  Katz  

Page 14: Advanced Methods in Network Science: Community Detection Algorithms

Resolu:on  

Resolu:on  is  a  concept  inherited  from  op:cs.    According  to  Wiki,      Op,cal  resolu,on  describes  the  ability  of  an  imaging  system        to  resolve  detail  in  the  object  that  is  being  imaged.      

High  resoluJon)   Low  resoluJon  

•   Can  make  out  many  details!  (15.1MP)  •   But…  

•   Details  may  be  noise  •   Some:mes  they  don’t  ma]er!    

•   Can’t  read  a  word!  •   But…  

•   Can  focus  on  broad  regions  •   Noise  is  out  of  focus  

Michael  J.  Bommarito  II,  Daniel  Mar:n  Katz  

Page 15: Advanced Methods in Network Science: Community Detection Algorithms

Resolu:on  

High  resoluJon  (microscopic)   Low  resoluJon  (macroscopic)  

Same  graphs!  

Michael  J.  Bommarito  II,  Daniel  Mar:n  Katz  

Page 16: Advanced Methods in Network Science: Community Detection Algorithms

Resolu:on  

Different  hypotheses  or  quesJons  correspond  to  different      resoluJons.    Different  methods  are  more  or  less  effecJve  at  detecJng        community  structure  at  different  resoluJons.    Modularity-­‐based  methods  cannot  detect  structure  below      a  known  resoluJon  limit.  

Michael  J.  Bommarito  II,  Daniel  Mar:n  Katz  

Page 17: Advanced Methods in Network Science: Community Detection Algorithms

Overlapping  Communi:es  

Palla,  Derenyi,  Farkas  ,Vicsek.  Uncovering  the  overlapping  community  structure  of  complex  networks  in  nature  and  society  

Nature    435,  2005.  

Michael  J.  Bommarito  II,  Daniel  Mar:n  Katz  

Page 18: Advanced Methods in Network Science: Community Detection Algorithms

Computa:onal  Complexity  Refresher  

ComputaJonal  complexity  is  a  serious  issue!  

       

Data   is   becoming   more   abundant   and   more  detailed.    Many   quan:ta:ve   research   projects   hinge   on  the  feasibility  of  calcula:ons.    Understanding   computa:onal   complexity   can  allow  you  to  communicate  with  department  IT  personnel  or  computer  scien:sts  to  solve  your  problem.    Make   sure   your   project   is   feasible   before  commi[ng  the  Jme!      

Michael  J.  Bommarito  II,  Daniel  Mar:n  Katz  

Page 19: Advanced Methods in Network Science: Community Detection Algorithms

Computa:onal  Complexity  Refresher  

Computa:onal  complexity  in  the  context  of  modern  compu:ng  is        primarily  focused  on  two  resources:    1.    Time:  How  long  does  it  take  to  perform  a  sequence  of  opera:ons?  

•  CPU/GPU  •  Exact  vs.  approximate  solu:ons    

2.    Storage:  How  much  space  does  it  take  to  store  our  problem?  •  Memory  and  “persistent”  storage  (to  a  lesser  degree)  •  Data  representa:ons  

We  tend  to  communicate  :me  and  storage  complexity  through  “Big-­‐O  nota:on.”  

Michael  J.  Bommarito  II,  Daniel  Mar:n  Katz  

Page 20: Advanced Methods in Network Science: Community Detection Algorithms

Computa:onal  Complexity  Refresher  

In  computa:onal  complexity,  “Big-­‐O  nota:on”  conveys  informa:on        about  how  :me  and  storage  costs  scale  with  inputs.    •   O(1):  constant  -­‐  independent  of  input  •   O(n):  scales  linearly  with  the  size  of  input  •   O(n^2):  scales  quadra:cally  with  the  size  of  input  •   O(n^3):  scales  cubically  with  the  size  of  input  

These  terms  ofen  occur  with  log  n  terms      and  are  then  given  the  prefix  “quasi-­‐.”  

For  graph  algorithms,  the  input  n  is  typically    • |V|,  the  number  of  ver:ces  • |E|,  the  number  of  edges  

       

Michael  J.  Bommarito  II,  Daniel  Mar:n  Katz  

Page 21: Advanced Methods in Network Science: Community Detection Algorithms

Taxonomy  of  Methods  

This  taxonomy  of  methods  follows  the  history  of  their  development.    • Divisive  Methods  

•  Edge-­‐betweenness  (2002)    

• Modularity  Methods  •  Fast-­‐greedy  (2004)  •  Leading  Eigenvector  (2006)  

• Dynamic  Methods  •  Clique  percola:on  (2005)  •  Walktrap  (2005)  

Michael  J.  Bommarito  II,  Daniel  Mar:n  Katz  

Page 22: Advanced Methods in Network Science: Community Detection Algorithms

Edge  Betweenness  

PublicaJon(s):    Girvan,  Newman.    Community  structure  in  social  and  biological  networks.    PNAS,  2002.    Basic  Idea:    Divide  the  network  into  subsequently  smaller  pieces  by  finding  edges  that  “bridge”  communi:es.    Constraints:      •   Can  be  adapted  to  directed  networks  (igraph).  •   Can  be  adapted  to  weights  (no  public  sofware).    Time  Complexity:  O(|V|^3)  in  general,  O(|V|^2  log  |V|)  for  special  cases  

Michael  J.  Bommarito  II,  Daniel  Mar:n  Katz  

Page 23: Advanced Methods in Network Science: Community Detection Algorithms

Edge  Betweenness  

From  the  paper:  

Michael  J.  Bommarito  II,  Daniel  Mar:n  Katz  

Page 24: Advanced Methods in Network Science: Community Detection Algorithms

Quick  Aside  –  Zach’s  Karate  Club  

Zachary's  Karate  Club:  Social  network  of  friendships  between  34  members  of  a  karate  club  at  a  US  university  in  the  1970s  

Event:  During  the  observa:on  period,  the  club  broke  into  2  smaller  clubs.    This  split  occurred  along  a  pre-­‐exis:ng  social  division  between  the  two  “communi:es”  in  the  network.  

 Drawn  from  the  Paper:  Zachary.  An  informa&on  flow  model  for  conflict  and  fission  in  

small  groups.  Journal  of  Anthropological  Research  33,  1977.  

Download  the  Data:  h]p://www-­‐personal.umich.edu/~mejn/netdata/  

     

 

   

Michael  J.  Bommarito  II,  Daniel  Mar:n  Katz  

Page 25: Advanced Methods in Network Science: Community Detection Algorithms

Edge  Betweenness  

Only  misclassifica:on  

Michael  J.  Bommarito  II,  Daniel  Mar:n  Katz  

Page 26: Advanced Methods in Network Science: Community Detection Algorithms

Edge  Betweenness  

Betweenness  tends  to  get  the  big  picture  right.        However,  resolu:on  can  be  a  problem!        Do  not  draw  conclusions  about  small  communi:es  from  this  algorithm  alone.  

Michael  J.  Bommarito  II,  Daniel  Mar:n  Katz  

Page 27: Advanced Methods in Network Science: Community Detection Algorithms

Modularity  

 •   e  is  the  number  of  edges  in  module  i    •   d  is  total  degree  of  ver:ces  in  module  i    •   m  is  the  total  number  of  edges  in  network    Q  is  difference  between  observed  connecJvity  within  modules  and  EV  for  the  configuraJon  model  (degree-­‐distribuJon  fixed)    

Michael  J.  Bommarito  II,  Daniel  Mar:n  Katz  

Page 28: Advanced Methods in Network Science: Community Detection Algorithms

Modularity  

Remember  our  previous  discussion  on  computa:onal  complexity?    

Modularity  maximiza:on  is  an  NP-­‐hard  problem.    

This  means  that  there  is  no  polynomial  representa:on  of  :me  complexity!    

All  methods  therefore  try  to  solve  for  approximate  solu&ons.      

Michael  J.  Bommarito  II,  Daniel  Mar:n  Katz  

Page 29: Advanced Methods in Network Science: Community Detection Algorithms

Modularity  

Michael  J.  Bommarito  II,  Daniel  Mar:n  Katz  

Benjamin  H.  Good,  Yves-­‐Alexandre  de  Montjoye  &  Aaron  Clauset,    The  Performance  of    Modularity  Maximiza:on  in  Prac:cal  Contexts,  Phys.  Rev.  E  81,  046106  (2010)  

 

Page 30: Advanced Methods in Network Science: Community Detection Algorithms

Fast  Greedy  

PublicaJon(s):    •   Newman.    Fast  algorithm  for  detec&ng  community  structure  in  networks.  Phys.  Rev.  E,  2004.  •   Clauset,  Newman,  Moore.    Finding  community  structure  in  very  large  networks.  Phys.  Rev.    E,  2004.  •   Wakita,  Tsurumi.  Finding  Community  Structure  in  Mega-­‐scale  Social  Networks.  2007.      Basic  Idea:        Try  to  randomly  assemble  a  larger  and  larger  communi:es  from  the  ground  up.    Start  by  placing  each  vertex  in  its  own  community  and  then  combine  communi:es  that  produce  the  best  modularity  at  that  step.    Constraints:  •   Can  be  adapted  to  directed  edges  (no  public).  •   Can  be  adapted  to  weights  (igraph).    Time  Complexity:  O(|E||V|  log  |V|)  worst  case  

Michael  J.  Bommarito  II,  Daniel  Mar:n  Katz  

Page 31: Advanced Methods in Network Science: Community Detection Algorithms

Fast  Greedy  

Fast-­‐Greedy  also  tends  to  aggressively  create  larger  communi:es  to  the  detriment  of  smaller  communi:es.  

Why  is  this  node  red  instead  of  blue?  

Michael  J.  Bommarito  II,  Daniel  Mar:n  Katz  

Page 32: Advanced Methods in Network Science: Community Detection Algorithms

Leading  Eigenvector  

PublicaJon(s):    •   Newman.  Finding  community  structure  in  networks  using  the  eigenvectors  of  matrices.  Phys.  Rev.  E,  2006.  •   Leicht,  Newman.  Community  structure  in  directed  networks.  Phys.  Rev.  Le].,  2008.    Basic  Idea:  Use  the  sign  on  the  components  of  the  leading  eigenvector  of  the  Laplacian  to  sequen:ally  divide  the  network.    Constraints:  •   Can  be  adapted  to  directed  edges  (no  public).  •   Can  be  adapted  to  weights  (igraph).    Time  Complexity:  O(|V|^2)  

Michael  J.  Bommarito  II,  Daniel  Mar:n  Katz  

Page 33: Advanced Methods in Network Science: Community Detection Algorithms

Leading  Eigenvector  

Note   that   eigenvector’s   results  seem   to   split   the   difference  between   edge   betweenness   and  fast-­‐greedy  in  this  case.  

Why  are  these  nodes  not  a  part  of  the  larger  modules?  

Michael  J.  Bommarito  II,  Daniel  Mar:n  Katz  

Page 34: Advanced Methods in Network Science: Community Detection Algorithms

Walktrap  

PublicaJon(s):  Pons,  Latapy.  Compu&ng  communi&es  in  large  networks  using  random  walks.  JGAA,  2006.    Basic  Idea:    Simulate  many  short  random  walks  on  the  network  and  compute  pairwise  similarity  measures  based  on  these  walks.    Use  these  similarity  values  to  aggregate  ver:ces  into  communi:es.    Constraints:  •   Can  be  adapted  to  directed  edges  (igraph).  •   Can  be  adapted  to  weights  (igraph).  •   Can  alter  resolu:on  by  walk  length  (igraph).    Time  Complexity:  depends  on  walk  length,  O(|V|^2  log  |V|)  typically  

Michael  J.  Bommarito  II,  Daniel  Mar:n  Katz  

Page 35: Advanced Methods in Network Science: Community Detection Algorithms

Walktrap  

Michael  J.  Bommarito  II,  Daniel  Mar:n  Katz  

Page 36: Advanced Methods in Network Science: Community Detection Algorithms

Walktrap  

Walktrap  assigns  ver:ces  to  different  communi:es  than  previous  algorithms.    Note  that  the  simulated  walk  length  can  be  changed  to  alter  resolu:on.    Furthermore,  simulaJon  is  stochasJc  and  thus  results  may  change  even  aVer  fixing  the  walk  length  and  input  graph!      

Michael  J.  Bommarito  II,  Daniel  Mar:n  Katz  

Page 37: Advanced Methods in Network Science: Community Detection Algorithms

Method  Comparison  

Edge-­‐Betweenness   Fast-­‐Greedy  

Leading  Eigenvector  Walktrap  

Michael  J.  Bommarito  II,  Daniel  Mar:n  Katz  

Page 38: Advanced Methods in Network Science: Community Detection Algorithms

Recommended  Sofware  -­‐  igraph  

•   Core  Library:  C  •   Interfaces:  Python,  R,  Ruby    •   Features:  Graph  opera:ons  &  algorithms,  random  graph  genera:on,  graph  sta:s:cs,  community  detec:on,  visualiza:on  layout,  ploqng  •   URL:  h]p://igraph.sourceforge.net/  •   Documenta:on:  h]p://igraph.sourceforge.net/documenta:on.html  

 

Michael  J.  Bommarito  II,  Daniel  Mar:n  Katz  

Page 39: Advanced Methods in Network Science: Community Detection Algorithms

Example  Python  Source  Code  

Michael  J.  Bommarito  II,  Daniel  Mar:n  Katz  

Page 40: Advanced Methods in Network Science: Community Detection Algorithms

Fron:ers  of  Community  Detec:on:  Temporal  Network  Dynamics  

Michael  J.  Bommarito  II,  Daniel  Mar:n  Katz  

Gergely Palla, Albert-Laszlo Barabasi & Tamas Vicsek, Quantifying Social Group Evolution, Nature 446:7136, 664-667 (2007)

Page 41: Advanced Methods in Network Science: Community Detection Algorithms

 Fron:ers  of  Community  Detec:on:  

Community  Structure  Over  Scales,  Time  Period,  etc.    

Michael  J.  Bommarito  II,  Daniel  Mar:n  Katz  

Science 14 May 2010, Vol. 328. no. 5980, pp. 876 - 878

Page 42: Advanced Methods in Network Science: Community Detection Algorithms

Community  Detec:on  Review  Ar:cles  

Some  Useful  Review  ArJcles:       Mason A. Porter, Jukka-Pekka Onnela and Peter J. Mucha. 2009. “Communities in Networks.” Notices of the American Mathematical Society 56: 1082-1166.    Santo Forunato. 2010. “Community detection in graphs.” Physics Reports. 486: 75-174.  

Michael  J.  Bommarito  II,  Daniel  Mar:n  Katz  

Page 43: Advanced Methods in Network Science: Community Detection Algorithms

A  Transi:on  to  Our  Sink  Method  Paper      

�  Provide  a  very  brief  introduc:on  to  the            Exponen:al  Random  Graph  Models  (p*)        

Michael  J.  Bommarito  II,  Daniel  Mar:n  Katz  

�  Now  we  are  going  to  transi:on  to  a  specific  project  -­‐-­‐-­‐        where  we  apply  some  of  the  ideas  contained  herein      

 

Page 44: Advanced Methods in Network Science: Community Detection Algorithms

Our  Sink  Paper  –Physica  A      

Michael  J.  Bommarito  II,  Daniel  Mar:n  Katz  

Page 45: Advanced Methods in Network Science: Community Detection Algorithms

Dynamic  Acyclic  Digraphs  

Michael  J.  Bommarito  II,  Daniel  Mar:n  Katz  

�  We  are  interested  in  conduc:ng  community  detec:on  in  the  special  case  of  dynamic  acyclic  digraphs  …      

�  Before  we  transi:on  to  the  full  presenta:on  –  some  background    

 �  Dynamic  =  Changing  both  Locally  and  Globally    �  Digraph  =  Directed  Graph  �  Acyclic  =  No  cycles  because  current  documents  generally  cannot  cite  documents  in  the  future    

 

Page 46: Advanced Methods in Network Science: Community Detection Algorithms

Dynamic  Acyclic  Digraphs  

Michael  J.  Bommarito  II,  Daniel  Mar:n  Katz  

Case  to  Case  Judicial  Cita:on  Networks  are  Dynamic  Acyclic  Digraphs    

So  are  Academic  Cita:on  Networks,  Patents,  etc.