63
Big Graph Data Science Lise Getoor University of California, Santa Cruz SF MLConf November 14, 2014

Lise Getoor, Professor, Computer Science, UC Santa Cruz at MLconf SF

Embed Size (px)

DESCRIPTION

Abstract: One of the challenges in big data analytics lies in being able to reason collectively about extremely large, heterogeneous, incomplete, noisy interlinked data. We need data science techniques which an represent and reason effectively with this form of rich and multi-relational graph data. In this presentation, I will describe some common collective inference patterns needed for graph data including: collective classification (predicting missing labels for nodes in a network), link prediction (predicting potential edges), and entity resolution (determining when two nodes refer to the same underlying entity). I will describe three key capabilities required: relational feature construction, collective inference, and scaling. Finally, I briefly describe some of the cutting edge analytic tools being developed within the machine learning, AI, and database communities to address these challenges.

Citation preview

Page 1: Lise Getoor, Professor, Computer Science, UC Santa Cruz at MLconf SF

Big Graph Data Science

Lise Getoor University of California, Santa Cruz

SF MLConf November 14, 2014

Page 2: Lise Getoor, Professor, Computer Science, UC Santa Cruz at MLconf SF

Data is not flat BIG

©2004-2013 lonnitaylor

Page 3: Lise Getoor, Professor, Computer Science, UC Santa Cruz at MLconf SF

Data is multi-modal, multi-relational, spatio-temporal, multi-media

Page 4: Lise Getoor, Professor, Computer Science, UC Santa Cruz at MLconf SF
Page 5: Lise Getoor, Professor, Computer Science, UC Santa Cruz at MLconf SF

NEED: Data Science for Graphs

New V: Vinculate

Page 6: Lise Getoor, Professor, Computer Science, UC Santa Cruz at MLconf SF

Pa#erns   Tools  Key  Ideas  

Page 7: Lise Getoor, Professor, Computer Science, UC Santa Cruz at MLconf SF

Pa#erns   Tools  Key  Ideas  

Page 8: Lise Getoor, Professor, Computer Science, UC Santa Cruz at MLconf SF

☐ Collec2ve  Classifica2on  ☐ Link  Predic2on  ☐ En2ty  Resolu2on  

Page 9: Lise Getoor, Professor, Computer Science, UC Santa Cruz at MLconf SF

Collec&ve  Classifica&on:    inferring  the  labels  of  nodes  in  a  graph  

Page 10: Lise Getoor, Professor, Computer Science, UC Santa Cruz at MLconf SF

Collective Classification

spouse

spouse

colleague

colleague

spouse friend

friend

friend

friend

or ? Question:

Page 11: Lise Getoor, Professor, Computer Science, UC Santa Cruz at MLconf SF

Collective Classification

spouse

spouse

colleague

colleague

spouse friend

friend

friend

friend

Page 12: Lise Getoor, Professor, Computer Science, UC Santa Cruz at MLconf SF

Collective Classification

spouse

spouse

colleague

colleague

spouse friend

friend

friend

friend

?

? ?

Page 13: Lise Getoor, Professor, Computer Science, UC Santa Cruz at MLconf SF

☐ Collec2ve  Classifica2on  ☐ Link  Predic2on  ☐ En2ty  Resolu2on  

✔  

Page 14: Lise Getoor, Professor, Computer Science, UC Santa Cruz at MLconf SF

Link  Predic&on:    inferring  the  existence  of  edges  in  a  graph  

Page 15: Lise Getoor, Professor, Computer Science, UC Santa Cruz at MLconf SF

Link Prediction

¢  Entities l  People, Emails

¢  Observed relationships l  communications, co-location

¢  Predict relationships l  Supervisor, subordinate,

colleague

#

Page 16: Lise Getoor, Professor, Computer Science, UC Santa Cruz at MLconf SF

☐ Collec2ve  Classifica2on  ☐ Link  Predic2on  ☐ En2ty  Resolu2on  

✔  

✔  

Page 17: Lise Getoor, Professor, Computer Science, UC Santa Cruz at MLconf SF

En&ty  Resolu&on:    determining  which  nodes  refer  to  same  underlying  en2ty  

Page 18: Lise Getoor, Professor, Computer Science, UC Santa Cruz at MLconf SF

Ironically, Entity Resolution has many duplicate names

Doubles

Duplicate detection Record linkage

Deduplication

Object identification Object consolidation

Coreference resolution

Entity clustering

Reference reconciliation

Reference matching Householding

Household matching

Fuzzy match

Approximate match

Merge/purge

Hardening soft databases

Identity uncertainty

Page 19: Lise Getoor, Professor, Computer Science, UC Santa Cruz at MLconf SF

before after

Entity Resolution & Network Analysis

Tutorial on Entity Resolution in Big Data w/ Ashwin Machanavajjhala @ KDD13 (slides available)

Page 20: Lise Getoor, Professor, Computer Science, UC Santa Cruz at MLconf SF

☐ Collec2ve  Classifica2on  ☐ Link  Predic2on  ☐ En2ty  Resolu2on  

✔  

✔  

✔  

Page 21: Lise Getoor, Professor, Computer Science, UC Santa Cruz at MLconf SF

Graph  Iden2fica2on  

•  Goal:  – Given  an  input  graph  infer  an  output  graph  

•  Three  major  components:  – En&ty  Resolu&on  (ER):  Infer  the  set  of  nodes  – Link  Predic&on  (LP):  Infer  the  set  of  edges  – Collec&ve  Classifica&on  (CC):  Infer  the  node  labels  

•  Challenge:    The  components  are  intra  and  inter-­‐dependent  

Namata,  Kok,  Getoor,  KDD  2011  

Page 22: Lise Getoor, Professor, Computer Science, UC Santa Cruz at MLconf SF

Pa#erns   Tools  Key  Ideas  

Page 23: Lise Getoor, Professor, Computer Science, UC Santa Cruz at MLconf SF

☐    Feature  Construc2on  ☐ Collec2ve  Reasoning  ☐ Blocking  

Page 24: Lise Getoor, Professor, Computer Science, UC Santa Cruz at MLconf SF

Key Idea #1: Relational Feature Construction

Page 25: Lise Getoor, Professor, Computer Science, UC Santa Cruz at MLconf SF

Key Idea: Feature Construction ¢  Feature informativeness is key to the success of a

relational classifier

¢  Relational feature construction l  Node-specific measures

•  Aggregates: summarize attributes of relational neighbors •  Structural properties: capture characteristics of the relational

structure

l  Node-pair measures: summarize properties of (potential) edges

•  Attribute-based measures •  Edge-based measures •  Neighborhood similarity measures

Page 26: Lise Getoor, Professor, Computer Science, UC Santa Cruz at MLconf SF

Relational Classifiers ¢  Pros:

l  Efficient: Can handle large amounts of data •  Features can often be pre-computed ahead of time

l  Flexible: Can take advantage of well-understood classification/regression algorithms

l  One of the most commonly-used ways of incorporating relational information

¢  Cons: l  Features are based on observed values/evidence, cannot be

based on attributes or relations that are being predicted l  Makes incorrect independence assumptions l  Hard to impose global constraints on joint assignments

•  For example, when inferring a hierarchy of individuals, we may want to enforce constraint that it is a tree

Page 27: Lise Getoor, Professor, Computer Science, UC Santa Cruz at MLconf SF

☐    Feature  Construc2on  ☐ Collec2ve  Reasoning  ☐ Blocking  

✔  

Page 28: Lise Getoor, Professor, Computer Science, UC Santa Cruz at MLconf SF

Collective Classification

?

Page 29: Lise Getoor, Professor, Computer Science, UC Santa Cruz at MLconf SF

Collective Classification

? $ $

Tweet Status update

Donates(A, ) => Votes(A, ): 5.0

Mentions(A, “Affordable Health”) => Votes(A, ): 0.3

Page 30: Lise Getoor, Professor, Computer Science, UC Santa Cruz at MLconf SF

Collective Classification

spouse

spouse

colleague

colleague

spouse friend

friend

friend

friend

Page 31: Lise Getoor, Professor, Computer Science, UC Santa Cruz at MLconf SF

Collective Classification

vote(A,P) ∧ spouse(B,A) à vote(B,P) : 0.8

vote(A,P) ∧ friend(B,A) à vote(B,P) : 0.3

spouse

spouse

colleague

colleague

spouse friend

friend

friend

friend

Page 32: Lise Getoor, Professor, Computer Science, UC Santa Cruz at MLconf SF

Collective Classification

spouse

spouse

colleague

colleague

spouse friend

friend

friend

friend

Page 33: Lise Getoor, Professor, Computer Science, UC Santa Cruz at MLconf SF

Link Prediction § People, emails, words,

communication, relations § Use model to express

dependencies -  “If email content suggests type X, it

is of type X” -  “If A sends deadline emails to B,

then A is the supervisor of B” -  “If A is the supervisor of B, and A is

the supervisor of C, then B and C are colleagues”

#

Page 34: Lise Getoor, Professor, Computer Science, UC Santa Cruz at MLconf SF

Link Prediction § People, emails, words,

communication, relations § Use model to express

dependencies -  “If email content suggests type X, it

is of type X” -  “If A sends deadline emails to B,

then A is the supervisor of B” -  “If A is the supervisor of B, and A is

the supervisor of C, then B and C are colleagues”

#

complete by

due

HasWord(E, “due”) => Type(E, deadline) : 0.6

Page 35: Lise Getoor, Professor, Computer Science, UC Santa Cruz at MLconf SF

Link Prediction § People, emails, words,

communication, relations § Use model to express

dependencies -  “If email content suggests type X, it

is of type X” -  “If A sends deadline emails to B,

then A is the supervisor of B” -  “If A is the supervisor of B, and A is

the supervisor of C, then B and C are colleagues”

#

Sends(A,B,E) ^ Type(E,deadline) => Supervisor(A,B) : 0.8

Page 36: Lise Getoor, Professor, Computer Science, UC Santa Cruz at MLconf SF

Link Prediction § People, emails, words,

communication, relations § Use model to express

dependencies -  “If email content suggests type X, it

is of type X” -  “If A sends deadline emails to B,

then A is the supervisor of B” -  “If A is the supervisor of B, and A is

the supervisor of C, then B and C are colleagues”

#

Supervisor(A,B) ^ Supervisor(A,C) => Colleague(B,C) : ∞

Page 37: Lise Getoor, Professor, Computer Science, UC Santa Cruz at MLconf SF

Entity Resolution § Entities

-  People References

§ Attributes -  Name

§ Relationships -  Friendship

§ Goal: Identify references that denote the same person

A B

John Smith J. Smith

name name

C

E

D F G

H

friend friend

=

=

Page 38: Lise Getoor, Professor, Computer Science, UC Santa Cruz at MLconf SF

Entity Resolution § References, names,

friendships § Use model to express

dependencies -  ‘’If two people have similar names,

they are probably the same’’ -  ‘’If two people have similar friends,

they are probably the same’’ -  ‘’If A=B and B=C, then A and C must

also denote the same person’’

A B

John Smith J. Smith

name name

C

E

D F G

H

friend friend

=

=

Page 39: Lise Getoor, Professor, Computer Science, UC Santa Cruz at MLconf SF

Entity Resolution § References, names,

friendships § Use model to express

dependencies -  ‘’If two people have similar names,

they are probably the same’’ -  ‘’If two people have similar friends,

they are probably the same’’ -  ‘’If A=B and B=C, then A and C must

also denote the same person’’

A B

John Smith J. Smith

name name

C

E

D F G

H

friend friend

=

=

A.name ≈{str_sim} B.name => A≈B : 0.8

Page 40: Lise Getoor, Professor, Computer Science, UC Santa Cruz at MLconf SF

Entity Resolution § References, names,

friendships § Use model to express

dependencies -  ‘’If two people have similar names,

they are probably the same’’ -  ‘’If two people have similar friends,

they are probably the same’’ -  ‘’If A=B and B=C, then A and C must

also denote the same person’’

A B

John Smith J. Smith

name name

C

E

D F G

H

friend friend

=

=

{A.friends} ≈{} {B.friends} => A≈B : 0.6

Page 41: Lise Getoor, Professor, Computer Science, UC Santa Cruz at MLconf SF

Entity Resolution § References, names,

friendships § Use model to express

dependencies -  ‘’If two people have similar names,

they are probably the same’’ -  ‘’If two people have similar friends,

they are probably the same’’ -  ‘’If A=B and B=C, then A and C must

also denote the same person’’

A B

John Smith J. Smith

name name

C

E

D F G

H

friend friend

=

=

A≈B ^ B≈C => A≈C : ∞

Page 42: Lise Getoor, Professor, Computer Science, UC Santa Cruz at MLconf SF

Challenges ¢  Collective Classification: labeling nodes in graph

l  irregular structure, not a chain, not a grid l  Challenge: One large partially labeled graph

¢  Link prediction: predicting edges in graph l  Dependencies among edges l  Don’t want to reason about all possible edges l  Challenge: scaling & extremely skewed probabilities

¢  Entity resolution: determine nodes that refer to same entities in a graph l  Dependencies between clusters l  Challenge: enforcing constraints, e.g. transitive closure

Page 43: Lise Getoor, Professor, Computer Science, UC Santa Cruz at MLconf SF

☐    Feature  Construc2on  ☐ Collec2ve  Reasoning  ☐ Blocking  

✔  

✔  

Page 44: Lise Getoor, Professor, Computer Science, UC Santa Cruz at MLconf SF

Blocking: Motivation ¢  Naïve pairwise: |N|2 pairwise comparisons

l  1000 business listings each from 1,000 different cities across the world

l  1 trillion comparisons l  11.6 days (if each comparison is 1 μs)

¢  Mentions from different cities are unlikely to be matches l  Blocking Criterion: City l  1 billion comparisons l  16 minutes (if each comparison is 1 µs)

Page 45: Lise Getoor, Professor, Computer Science, UC Santa Cruz at MLconf SF

Blocking: Motivation ¢  Mentions from different cities are unlikely to be

matches l  May miss potential matches

47

Page 46: Lise Getoor, Professor, Computer Science, UC Santa Cruz at MLconf SF

Blocking: Motivation

Matching  Pairs  of  Nodes    

Set  of  all  Pairs  of  Nodes    

Pairs  of  Nodes  sa&sfying    

Blocking  criterion  

Page 47: Lise Getoor, Professor, Computer Science, UC Santa Cruz at MLconf SF

Blocking Algorithms 1 ¢  Hash based blocking

l  Each block Ci is associated with a hash key hi. l  Mention x is hashed to Ci if hash(x) = hi. l  Within a block, all pairs are compared. l  Each hash function results in disjoint blocks.

¢  What hash function? l  Deterministic function of attribute values l  Boolean Functions over attribute values

[Bilenko et al ICDM’06, Michelson et al AAAI’06, Das Sarma et al CIKM ‘12]

l  minHash (min-wise independent permutations) [Broder et al STOC’98]; locality sensitive hashing

Page 48: Lise Getoor, Professor, Computer Science, UC Santa Cruz at MLconf SF

Blocking Algorithms 2 ¢  Pairwise Similarity/Neighborhood based blocking

l  Nearby nodes according to a similarity metric are clustered together

l  Results in non-disjoint canopies.

¢  Techniques l  Sorted Neighborhood Approach [Hernandez et al

SIGMOD’95] l  Canopy Clustering [McCallum et al KDD’00]

Page 49: Lise Getoor, Professor, Computer Science, UC Santa Cruz at MLconf SF

☐    Feature  Construc2on  ☐ Collec2ve  Reasoning  ☐ Blocking  

✔  

✔  

✔  

Page 50: Lise Getoor, Professor, Computer Science, UC Santa Cruz at MLconf SF

Pa#erns   Tools  Key  Ideas  

Page 51: Lise Getoor, Professor, Computer Science, UC Santa Cruz at MLconf SF

http://psl.umiacs.umd.edu

Matthias Broecheler Lily Mihalkova Stephen Bach Stanley Kok Alex Memory

Bert Huang

Angelika Kimmig

Arti Ramesh Jay Pujara Shobeir Fakhraei Hui Miao Ben London

Page 52: Lise Getoor, Professor, Computer Science, UC Santa Cruz at MLconf SF

http://psl.umiacs.umd.edu

Probabilistic Soft Logic (PSL) Declarative language based on logics to express

collective probabilistic inference problems -  Predicate = relationship or property -  Atom = (continuous) random variable -  Rule = capture dependency or constraint -  Set = define aggregates

PSL Program = Rules + Input DB

Page 53: Lise Getoor, Professor, Computer Science, UC Santa Cruz at MLconf SF

Collective Classification

vote(A,P) ∧ spouse(B,A) à vote(B,P) : 0.8

vote(A,P) ∧ friend(B,A) à vote(B,P) : 0.3

spouse

spouse

colleague

colleague

spouse friend

friend

friend

friend

Page 54: Lise Getoor, Professor, Computer Science, UC Santa Cruz at MLconf SF

PSL Foundations § PSL makes large-scale reasoning scalable by

mapping logical rules to convex functions

§ Three principles justify this mapping: -  LP programs for MAX SAT with approximation

guarantees [Goemans and Williamson, ’94] -  Pseudomarginal LP relaxations of Boolean Markov

random fields [Wainwright, et al., ’02] -  Łukasiewicz logic, a logic for reasoning about

continuous values [Klir and Yuan, ‘95]

Page 55: Lise Getoor, Professor, Computer Science, UC Santa Cruz at MLconf SF

Hinge-loss Markov Random Fields

P (Y |X) =

1

Zexp

2

4�mX

j=1

wj max{`j(Y,X), 0}pj

3

5

§ Continuous variables in [0,1] § Potentials are hinge-loss functions § Subject to arbitrary linear constraints § Log-concave!

Page 56: Lise Getoor, Professor, Computer Science, UC Santa Cruz at MLconf SF

PSL in a Slide §  MAP Inference in PSL translates into convex optimization

problem -> inference is really fast! §  Inference further enhanced with state-of-the-art

optimization and distributed processing paradigms such as ADMM & GraphLab -> inference even faster!

§  Outperforms discrete MRFs in terms of speed, and (very) often accuracy

§  PSL is flexible: Applied to image segmentation, activity recognition, stance-detection, sentiment analysis, document classification, drug target prediction, latent social groups and trust, engagement modeling, ontology alignment, and looking for more!

Page 57: Lise Getoor, Professor, Computer Science, UC Santa Cruz at MLconf SF

psl.umiacs.umd.edu

Page 58: Lise Getoor, Professor, Computer Science, UC Santa Cruz at MLconf SF

Discussion

Page 59: Lise Getoor, Professor, Computer Science, UC Santa Cruz at MLconf SF

Pa#erns   Tools  Key  Ideas  

NEED: Data Science for Graphs

Page 60: Lise Getoor, Professor, Computer Science, UC Santa Cruz at MLconf SF

Closing  Comments  •  Need  new  sta2s2cal  and  ML  theory  for  very  large  graphs  

–  Be#er  understand  bias,  iden2fiability,  and  mechanism  design  in  graph  domains  

•  Important  topics  not  touched  on  –  Genera2ve  mechanisms,  causal  reasoning,  dynamic  modeling  –  Privacy,  Users,  &  Visualiza2on  

•  Many  exci2ng  opportuni2es  to  develop  new  theory  and  algorithms  and  apply  them  to  compelling  business,  intelligence,  social,  and  societal  problems!  

Page 61: Lise Getoor, Professor, Computer Science, UC Santa Cruz at MLconf SF

http://psl.umiacs.umd.edu

psl.umiacs.umd.edu

Thank You!

Contact information: [email protected]

Page 62: Lise Getoor, Professor, Computer Science, UC Santa Cruz at MLconf SF
Page 63: Lise Getoor, Professor, Computer Science, UC Santa Cruz at MLconf SF