Lise Getoor, Professor, Computer Science, UC Santa Cruz at MLconf SF

Big Graph Data Science

Lise Getoor University of California, Santa Cruz

SF MLConf November 14, 2014

Data is not flat BIG

Data is multi-modal, multi-relational, spatio-temporal, multi-media

NEED: Data Science for Graphs

New V: Vinculate

Pa#erns Tools Key Ideas

☐ Collec2ve Classifica2on ☐ Link Predic2on ☐ En2ty Resolu2on

Collec&ve Classifica&on: inferring the labels of nodes in a graph

Collective Classification

spouse

colleague

spouse friend

friend

or ? Question:

spouse

colleague

spouse friend

friend

spouse

colleague

spouse friend

friend

Link Predic&on: inferring the existence of edges in a graph

Link Prediction

¢  Entities l  People, Emails

¢  Observed relationships l  communications, co-location

¢  Predict relationships l  Supervisor, subordinate,

colleague

En&ty Resolu&on: determining which nodes refer to same underlying en2ty

Ironically, Entity Resolution has many duplicate names

Doubles

Duplicate detection Record linkage

Deduplication

Object identification Object consolidation

Coreference resolution

Entity clustering

Reference reconciliation

Reference matching Householding

Household matching

Fuzzy match

Approximate match

Merge/purge

Hardening soft databases

Identity uncertainty

before after

Entity Resolution & Network Analysis

Tutorial on Entity Resolution in Big Data w/ Ashwin Machanavajjhala @ KDD13 (slides available)

Graph Iden2fica2on

•  Goal: – Given an input graph infer an output graph

•  Three major components: – En&ty Resolu&on (ER): Infer the set of nodes – Link Predic&on (LP): Infer the set of edges – Collec&ve Classifica&on (CC): Infer the node labels

•  Challenge: The components are intra and inter-‐dependent

Namata, Kok, Getoor, KDD 2011

☐ Feature Construc2on ☐ Collec2ve Reasoning ☐ Blocking

Key Idea #1: Relational Feature Construction

Key Idea: Feature Construction ¢  Feature informativeness is key to the success of a

relational classifier

¢  Relational feature construction l  Node-specific measures

•  Aggregates: summarize attributes of relational neighbors •  Structural properties: capture characteristics of the relational

structure

l  Node-pair measures: summarize properties of (potential) edges

•  Attribute-based measures •  Edge-based measures •  Neighborhood similarity measures

Relational Classifiers ¢  Pros:

l  Efficient: Can handle large amounts of data •  Features can often be pre-computed ahead of time

l  Flexible: Can take advantage of well-understood classification/regression algorithms

l  One of the most commonly-used ways of incorporating relational information

¢  Cons: l  Features are based on observed values/evidence, cannot be

based on attributes or relations that are being predicted l  Makes incorrect independence assumptions l  Hard to impose global constraints on joint assignments

•  For example, when inferring a hierarchy of individuals, we may want to enforce constraint that it is a tree

Tweet Status update

Donates(A, ) => Votes(A, ): 5.0

Mentions(A, “Affordable Health”) => Votes(A, ): 0.3

spouse

colleague

spouse friend

friend

vote(A,P) ∧ spouse(B,A) à vote(B,P) : 0.8

vote(A,P) ∧ friend(B,A) à vote(B,P) : 0.3

spouse

colleague

spouse friend

friend

spouse

colleague

spouse friend

friend

Link Prediction § People, emails, words,

communication, relations § Use model to express

dependencies -  “If email content suggests type X, it

is of type X” -  “If A sends deadline emails to B,

then A is the supervisor of B” -  “If A is the supervisor of B, and A is

the supervisor of C, then B and C are colleagues”

complete by

HasWord(E, “due”) => Type(E, deadline) : 0.6

Sends(A,B,E) ^ Type(E,deadline) => Supervisor(A,B) : 0.8

Supervisor(A,B) ^ Supervisor(A,C) => Colleague(B,C) : ∞

Entity Resolution § Entities

-  People References

§ Attributes -  Name

§ Relationships -  Friendship

§ Goal: Identify references that denote the same person

John Smith J. Smith

name name

friend friend

Entity Resolution § References, names,

friendships § Use model to express

dependencies -  ‘’If two people have similar names,

they are probably the same’’ -  ‘’If two people have similar friends,

they are probably the same’’ -  ‘’If A=B and B=C, then A and C must

also denote the same person’’

John Smith J. Smith

name name

friend friend

John Smith J. Smith

name name

friend friend

A.name ≈{str_sim} B.name => A≈B : 0.8

John Smith J. Smith

name name

friend friend

{A.friends} ≈{} {B.friends} => A≈B : 0.6

John Smith J. Smith

name name

friend friend

A≈B ^ B≈C => A≈C : ∞

Challenges ¢  Collective Classification: labeling nodes in graph

l  irregular structure, not a chain, not a grid l  Challenge: One large partially labeled graph

¢  Link prediction: predicting edges in graph l  Dependencies among edges l  Don’t want to reason about all possible edges l  Challenge: scaling & extremely skewed probabilities

¢  Entity resolution: determine nodes that refer to same entities in a graph l  Dependencies between clusters l  Challenge: enforcing constraints, e.g. transitive closure

Blocking: Motivation ¢  Naïve pairwise: |N|2 pairwise comparisons

l  1000 business listings each from 1,000 different cities across the world

l  1 trillion comparisons l  11.6 days (if each comparison is 1 μs)

¢  Mentions from different cities are unlikely to be matches l  Blocking Criterion: City l  1 billion comparisons l  16 minutes (if each comparison is 1 µs)

Blocking: Motivation ¢  Mentions from different cities are unlikely to be

matches l  May miss potential matches

Blocking: Motivation

Matching Pairs of Nodes

Set of all Pairs of Nodes

Pairs of Nodes sa&sfying

Blocking criterion

Blocking Algorithms 1 ¢  Hash based blocking

l  Each block Ci is associated with a hash key hi. l  Mention x is hashed to Ci if hash(x) = hi. l  Within a block, all pairs are compared. l  Each hash function results in disjoint blocks.

¢  What hash function? l  Deterministic function of attribute values l  Boolean Functions over attribute values

[Bilenko et al ICDM’06, Michelson et al AAAI’06, Das Sarma et al CIKM ‘12]

l  minHash (min-wise independent permutations) [Broder et al STOC’98]; locality sensitive hashing

Blocking Algorithms 2 ¢  Pairwise Similarity/Neighborhood based blocking

l  Nearby nodes according to a similarity metric are clustered together

l  Results in non-disjoint canopies.

¢  Techniques l  Sorted Neighborhood Approach [Hernandez et al

SIGMOD’95] l  Canopy Clustering [McCallum et al KDD’00]

http://psl.umiacs.umd.edu

Matthias Broecheler Lily Mihalkova Stephen Bach Stanley Kok Alex Memory

Bert Huang

Angelika Kimmig

Arti Ramesh Jay Pujara Shobeir Fakhraei Hui Miao Ben London

Probabilistic Soft Logic (PSL) Declarative language based on logics to express

collective probabilistic inference problems -  Predicate = relationship or property -  Atom = (continuous) random variable -  Rule = capture dependency or constraint -  Set = define aggregates

PSL Program = Rules + Input DB

vote(A,P) ∧ spouse(B,A) à vote(B,P) : 0.8

vote(A,P) ∧ friend(B,A) à vote(B,P) : 0.3

spouse

colleague

spouse friend

friend

PSL Foundations § PSL makes large-scale reasoning scalable by

mapping logical rules to convex functions

§ Three principles justify this mapping: -  LP programs for MAX SAT with approximation

guarantees [Goemans and Williamson, ’94] -  Pseudomarginal LP relaxations of Boolean Markov

random fields [Wainwright, et al., ’02] -  Łukasiewicz logic, a logic for reasoning about

continuous values [Klir and Yuan, ‘95]

Hinge-loss Markov Random Fields

P (Y |X) =

4�mX

wj max{`j(Y,X), 0}pj

§ Continuous variables in [0,1] § Potentials are hinge-loss functions § Subject to arbitrary linear constraints § Log-concave!

PSL in a Slide §  MAP Inference in PSL translates into convex optimization

problem -> inference is really fast! §  Inference further enhanced with state-of-the-art

optimization and distributed processing paradigms such as ADMM & GraphLab -> inference even faster!

§  Outperforms discrete MRFs in terms of speed, and (very) often accuracy

§  PSL is flexible: Applied to image segmentation, activity recognition, stance-detection, sentiment analysis, document classification, drug target prediction, latent social groups and trust, engagement modeling, ontology alignment, and looking for more!

psl.umiacs.umd.edu

Discussion

NEED: Data Science for Graphs

Closing Comments •  Need new sta2s2cal and ML theory for very large graphs

–  Be#er understand bias, iden2fiability, and mechanism design in graph domains

•  Important topics not touched on –  Genera2ve mechanisms, causal reasoning, dynamic modeling –  Privacy, Users, & Visualiza2on

•  Many exci2ng opportuni2es to develop new theory and algorithms and apply them to compelling business, intelligence, social, and societal problems!

psl.umiacs.umd.edu

Thank You!

Contact information: getoor@soe.ucsc.edu

Lise Getoor, Professor, Computer Science, UC Santa Cruz at MLconf SF

Technology

Sriraam Natarajan Introduction to Probabilistic Logical Models Slides based on tutorials by Kristian Kersting, James Cussens, Lise Getoor & Pedro Domingos

Query-time Entity ResolutionQuery-time Entity Resolution Indrajit Bhattacharya indrajbh@in.ibm.com IBM India Research Laboratory Vasant Kunj, New Delhi 110 070, India Lise Getoor getoor@cs.umd.edu

Graphical Models: An Introduction Lise Getoor Computer Science Dept University of Maryland getoor

Weakly Supervised Models of Aspect-Sentiment for Online Course Discussion Forums ARTI RAMESH SHACHI H. KUMAR JAMES FOULDS LISE GETOOR

1 Planning 2 Some material adapted from slides by Tim Finin,Jean-Claude Latombe, Lise Getoor, and Marie desJardins

Probabilistic Models of Object-Relational Domains Daphne Koller Stanford University Joint work with: Lise Getoor Ming-Fai Wong Eran Segal Avi Pfeffer Pieter

Decision Making Under Uncertainty CMSC 471 – Spring 2041 Class #25– Tuesday, April 29 R&N, 17.1-17.3 material from Lise Getoor, Jean-Claude Latombe, and

Statistical Relational Learning: A Quick Intro Lise Getoor University of Maryland, College Park

Curriculum Vitae Lise C. Getoor 1 Personal Informationgetoor/cv.pdfCurriculum Vitae Lise C. Getoor Department of Computer Science University of Maryland Institute for Advanced Computer

1 Instance-Based & Bayesian Learning Chapter 20.1-20.4 Some material adapted from lecture notes by Lise Getoor and Ron Parr

Daozheng Chen 1, Mustafa Bilgic 2, Lise Getoor 1, David Jacobs 1, Lilyana Mihalkova 1, Tom Yeh 1 1 Department of Computer Science, University of Maryland,

Decision Making Under Uncertainty CMSC 671 – Fall 2010 R&N, Chapters 16.1-16.3, 16.5-16.6, 17.1-17.3 material from Lise Getoor, Jean-Claude Latombe, and

G ROUP PROXIMITY MEASURE FOR RECOMMENDING GROUPS IN ONLINE SOCIAL NETWORKS Barna Saha and Lise Getoor University of Maryland SNA-KDD Workshop ‘08 Presented

Entity Resolution: Tutorialusers.umiacs.umd.edu/~getoor/Tutorials/ER_VLDB2012.pdfEntity Resolution: Tutorial Lise GtGetoor AshwinMachanavajjhala ... Relationship between M true and

1 CMSC 671 Fall 2010 Class #12/13 – Wednesday, October 13/ Monday, October 18 Some material adapted from slides by Jean-Claude Latombe / Lise Getoor

Curriculum Vitae Lise C. Getoor 1 Personal Informationgetoor/cv.pdf · Curriculum Vitae Lise C. Getoor Department of Computer Science ... Josef Stefan Institute, Ljubljana, Slovenia

Lise Getoor Curriculum Vitae

Link Mining Lise Getoor Department of Computer Science University of Maryland, College Park

Causal Relational Learning - homes.cs.washington.edusuciu/babak-sigmod-2020-pdfa.pdfCausalRelationalLearning Babak Salimi1, Harsh Parikh2, Moe Kayali1, Lise Getoor 3, Sudeepa Roy2,

Collective Classiﬁcation in Network Dataeliassi.org/papers/ai-mag-tr08.pdfCollective Classiﬁcation in Network Data Prithviraj Sen, Galileo Namata, Mustafa Bilgic, Lise Getoor University