74
Tools and Algorithms for Querying and Mining Large Graphs Hanghang Tong Machine Learning Department Carnegie Mellon University [email protected] http://www.cs.cmu.edu/~htong 1

Tools and Algorithms for Querying and Mining Large Graphs

  • Upload
    ismet

  • View
    49

  • Download
    0

Embed Size (px)

DESCRIPTION

Tools and Algorithms for Querying and Mining Large Graphs. Hanghang Tong Machine Learning Department Carnegie Mellon University [email protected] http://www.cs.cmu.edu/~htong. Thesis Committee. Christos Faloutsos William Cohen Jeff Schneider Philip S. Yu. Graphs are everywhere!. - PowerPoint PPT Presentation

Citation preview

Page 1: Tools and Algorithms for  Querying and Mining Large Graphs

Tools and Algorithms for Querying and Mining Large Graphs

Hanghang TongMachine Learning Department

Carnegie Mellon [email protected]

http://www.cs.cmu.edu/~htong

1

Page 2: Tools and Algorithms for  Querying and Mining Large Graphs

Thesis Committee

• Christos Faloutsos• William Cohen• Jeff Schneider• Philip S. Yu

2

Page 3: Tools and Algorithms for  Querying and Mining Large Graphs

Graphs are everywhere!

3

Page 4: Tools and Algorithms for  Querying and Mining Large Graphs

Motivating Questions: (high level)• Given a large graph, we want to

4

R. Agrawal Jiawei Han

V. Vapnik M. Jordan

H.V. Jagadish

Laks V.S. Lakshmanan

Heikki Mannila

Christos Faloutsos

Padhraic Smyth

Corinna Cortes

15 1013

1 1

6

1 1

4 Daryl Pregibon

10

2

11

3

16

CePS on DBLP [Tong+ KDD 06] T3 on CIKM [Tong+ CIKM 08]

+Task A: Querying +Task B: Mining

Will return to this later…

Page 5: Tools and Algorithms for  Querying and Mining Large Graphs

Motivating Questions (in details)

• Querying [Goal: query complex relationship]– Q.1. Find complex user-specific patterns;– Q.2. Link Prediction & Proximity Tracking;– Q.3. Answer all the above questions quickly.

• Mining [Goal: find interesting patterns]– M.1. Spot Anomalies; – M.2. Mine time & space;– M.3. Detect communities.

5

Page 6: Tools and Algorithms for  Querying and Mining Large Graphs

Thesis Overview

6

Q1

Q3

Q2Q2

Q3

M1

M2

M3M3

M1

M2

Page 7: Tools and Algorithms for  Querying and Mining Large Graphs

Thesis Overview

7

CePS, G-Ray, ProSIN (KDD06, KDD07 a, ICDM08)Q1

FastProx (ICDM06, KAIS07, KDD07 b, ICDM08)Q3

pTrack/cTrack (SDM08, SAM08)Q2

DAP(KDD07 b)

Q2

FastProx(SDM08, SAM08)Q3P3

Colibri-D(KDD08 b)

M1

T3/MT3 (CIKM08)

M2

P1M3P1M3

Colibri-S(KDD08 b)M1 P3

P3

Completed Proposed

Questions That We Ask

P2M2 P3

Page 8: Tools and Algorithms for  Querying and Mining Large Graphs

Tasks Impact, ApplicationsQ1 Identify master-mind criminal; money launder ring;

interactive search & summarization

Q2 Predict who-calls-whom; Trend analysis on graph level

Q3 Scale all the above app.s to large, disk resident, graphs

M1 Efficient anomaly detection in an intuitive, dynamic way

M2 Mine time/space in complex settings

M3 Detect community w/ optional constraints

Thesis Overview: Impact

Qu

erying

Min

ing

8

Footnote: Our work for Q1 has been transferred into IBM product (Cyano)

Page 9: Tools and Algorithms for  Querying and Mining Large Graphs

Roadmap• Introduction• Completed Work

–Querying–Mining

• Proposed Work

9

• Preliminary

• Q1

• Q2

• Q3

Page 10: Tools and Algorithms for  Querying and Mining Large Graphs

Preliminary: Proximity Measurement

10

A BH1 1

D1 1

E

F

G1 11

I J1

1 1

a.k.a Relevance, Closeness, ‘Similarity’…

Page 11: Tools and Algorithms for  Querying and Mining Large Graphs

Thesis Overview

11

CePS, G-Ray, ProSIN (KDD06, KDD07 a, ICDM08)Q1

FastProx (ICDM06, KAIS07, KDD07 b, ICDM08)Q3

pTrack/cTrack (SDM08, SAM08)Q2

DAP(KDD07 b)Q2

FastProx(SDM08, SAM08)Q3P3

Colibri-D(KDD08 b)

M1

T3/MT3 (CIKM08)

M2

P1M3P1M3

Colibri-S(KDD08 b)M1 P3

P3

Completed Proposed

Questions That We Ask

P2M2 P3

Page 12: Tools and Algorithms for  Querying and Mining Large Graphs

Competed work on Q1

• Goal: Find complex user-specific patterns, – Q1.1. Center-Piece Subgraph Discovery,

– e.g., master-mind criminal given some suspects X, Y and Z?

– Q1.2. Best Effort Pattern Match, – e.g., Money-laundry ring

– Q1.3 Interactive querying (e.g. Negation)– e.g., find most similar conferences wrt KDD, but not like

ICML?

12

Page 13: Tools and Algorithms for  Querying and Mining Large Graphs

Q1.1 Center-Piece Subgraph Discovery [Tong+ KDD 06]

A C

B

A C

B

Original GraphCePS

Q: How to find hub for the black nodes?

CePS Node

Input Output

Red: Max (Prox(A, Red) x Prox(B, Red) x Prox(C, Red))

Page 14: Tools and Algorithms for  Querying and Mining Large Graphs

CePS: Example (AND Query)

R. Agrawal Jiawei Han

V. Vapnik M. Jordan

H.V. Jagadish

Laks V.S. Lakshmanan

Heikki Mannila

Christos Faloutsos

Padhraic Smyth

Corinna Cortes

15 1013

1 1

6

1 1

4 Daryl Pregibon

10

2

11

3

16

14

DBLP co-authorship network: - 400,000 authors, 2,000,000 edges

Page 15: Tools and Algorithms for  Querying and Mining Large Graphs

K_SoftAND: Relaxation of AND

Asking AND query? No Answer!

Disconnected Communities

Noise

15

Page 16: Tools and Algorithms for  Querying and Mining Large Graphs

R. Agrawal Jiawei Han

V. Vapnik M. Jordan

H.V. Jagadish

Laks V.S. Lakshmanan

Umeshwar Dayal

Bernhard Scholkopf

Peter L. Bartlett

Alex J. Smola

1510

13

3 3

5 2 2

327

4

CePS: 2 SoftAND

Stat.

DB

16

Page 17: Tools and Algorithms for  Querying and Mining Large Graphs

Output

Data Graph

Query Graph

Matching SubgraphAccountant

CEO

Manager

SEC

Q: How to find matching subgraph?

Q1.2. Best-Effort Pattern Match [Tong+ KDD 2007 b]

Input

Interception

Page 18: Tools and Algorithms for  Querying and Mining Large Graphs

G-Ray: How to?

matching node

matching node

matching node

matching node

Goodness = Prox (12, 4) x Prox (4, 12) x Prox (7, 4) x Prox (4, 7) x Prox (11, 7) x Prox (7, 11) x Prox (12, 11) x Prox (11, 12)

details

Observation: , etc. 18

Page 19: Tools and Algorithms for  Querying and Mining Large Graphs

Effectiveness: star-query

Query Result

Databases

Bio-medicalIntelligent Agent

19

Page 20: Tools and Algorithms for  Querying and Mining Large Graphs

Effectiveness: line-query

Query

Result

Databases Learning Bio-medicalTheory

20

Page 21: Tools and Algorithms for  Querying and Mining Large Graphs

Q1.3: Interactive Querying

21

User Feedback

User Feedback

User Feedback

User Feedback

Page 22: Tools and Algorithms for  Querying and Mining Large Graphs

Initial Results No to `ICML’ Yes to `SIGIR’

'ICDM' 'ICML' 'SDM' 'VLDB' 'ICDE'

'SIGMOD' 'NIPS''PKDD''IJCAI'

'PAKDD'

'ICDM' 'SDM''PKDD''ICDE''VLDB'

'SIGMOD''PAKDD''CIKM''SIGIR'

'WWW'

'SIGIR''TREC''CIKM''ECIR''CLEF''ICDM''JCDL''VLDB''ACL''ICDE'

two main sub-communities in KDD: DBs (green) vs. Stat (Red)

Negative feedback on ICML will exclude other stats confs (NIPS, IJCAI)

Positive feedback on SIGIR will bring more IR (brown) conferences.

what are most related conferences wrt KDD?(DBLP author-conference bipartite graph) 22

Q1.3 ProSIN for Interactive Querying

[Tong+ ICDM 08]

Page 23: Tools and Algorithms for  Querying and Mining Large Graphs

Q1.3 ProSIN for Interactive Querying

[Tong+ ICDM 08]Initial Results No to `ICML’ Yes to `SIGIR’

'ICDM' 'ICML' 'SDM' 'VLDB' 'ICDE'

'SIGMOD' 'NIPS''PKDD''IJCAI'

'PAKDD'

'ICDM' 'SDM''PKDD''ICDE''VLDB'

'SIGMOD''PAKDD''CIKM''SIGIR'

'WWW'

'SIGIR''TREC''CIKM''ECIR''CLEF''ICDM''JCDL''VLDB''ACL''ICDE'

two main sub-communities in KDD: DBs (green) vs. Stat (Red)

Negative feedback on ICML will exclude other stats confs (NIPS, IJCAI)

Positive feedback on SIGIR will bring more IR (brown) conferences.

what are most related conferences wrt KDD?(DBLP author-conference bipartite graph) 23

Page 24: Tools and Algorithms for  Querying and Mining Large Graphs

Initial Results No to `ICML’ Yes to `SIGIR’

'ICDM' 'ICML' 'SDM' 'VLDB' 'ICDE'

'SIGMOD' 'NIPS''PKDD''IJCAI'

'PAKDD'

'ICDM' 'SDM''PKDD''ICDE''VLDB'

'SIGMOD''PAKDD''CIKM''SIGIR'

'WWW'

'SIGIR''TREC''CIKM''ECIR''CLEF''ICDM''JCDL''VLDB''ACL''ICDE'

two main sub-communities in KDD: DBs (green) vs. Stat (Red)

Negative feedback on ICML will exclude other stats confs (NIPS, IJCAI)

Positive feedback on SIGIR will bring more IR (brown) conferences.

what are most related conferences wrt KDD?(DBLP author-conference bipartite graph) 24

Q1.3 ProSIN for Interactive Querying

[Tong+ ICDM 08]

Page 25: Tools and Algorithms for  Querying and Mining Large Graphs

Thesis Overview

25

CePS, G-Ray, ProSIN (KDD06, KDD07 a, ICDM08)Q1

FastProx (ICDM06, KAIS07, KDD07 b, ICDM08)Q3

pTrack/cTrack (SDM08, SAM08)Q2

DAP(KDD07 b)Q2

FastProx(SDM08, SAM08)Q3P3

Colibri-D(KDD08 b)

M1

T3/MT3 (CIKM08)

M2

P1M3P1M3

Colibri-S(KDD08 b)M1 P3

P3

Completed Proposed

Questions That We Ask

P2M2 P3

Page 26: Tools and Algorithms for  Querying and Mining Large Graphs

Q2.1 Link Prediction: direction [Tong+ KDD 07 a]

• Q: Given the existence of the link,

what is the direction of the link?

• A: (DAP) Compare Prox(ij) and Prox(ji)>70%

Prox (ij) - Prox (ji)

density

i

j

i

i

i

26

?

Web Link - 4, 000 nodes - 10, 000 edges

Page 27: Tools and Algorithms for  Querying and Mining Large Graphs

Q2.2 pTrack/cTrack: Challenge[Tong+ SDM 08]

• Observations (CePS, GRay, ProSIN…)– All for static graphs– Proximity: main tool

• Graphs are evolving over time!– New nodes/edges show up; – Existing nodes/edges die out; – Edge weights change…

Q: How to make everything incremental? A: Track Proximity! 27

Page 28: Tools and Algorithms for  Querying and Mining Large Graphs

pTrack/cTrack: Trend analysis on graph level

M. Jordan

G.HintonC. Koch

T. Sejnowski

Year

Rank of Influence

28

Page 29: Tools and Algorithms for  Querying and Mining Large Graphs

pTrack: Problem Definitions

• [Given] – (1) a large, skewed time-evolving bipartite graphs, – (2) the query nodes of interest

• [Track] – (1) top-k most related nodes for each query node

at each time step t; – (2) the proximity score (or rank of proximity)

between any two query nodes at each time step t

29

Page 30: Tools and Algorithms for  Querying and Mining Large Graphs

pTrack: Philip S. Yu’s Top-5 conferences up to each year

ICDE

ICDCS

SIGMETRICS

PDIS

VLDB

CIKM

ICDCS

ICDE

SIGMETRICS

ICMCS

KDD

SIGMOD

ICDM

CIKM

ICDCS

ICDM

KDD

ICDE

SDM

VLDB

1992 1997 2002 2007

DatabasesPerformanceDistributed Sys.

DatabasesData Mining

DBLP: (Au. x Conf.) - 400k aus, - 3.5k confs - 20 yrs

30

Page 31: Tools and Algorithms for  Querying and Mining Large Graphs

KDD’s Rank wrt. VLDB over yearsProx. Rank

Year

Data Mining and Databases are getting closer & closer

31

(Closer)

Page 32: Tools and Algorithms for  Querying and Mining Large Graphs

cTrack:10 most influential authors in NIPS community up to each year

Author-paper bipartite graph from NIPS 1987-1999. 1740 papers, 2037 authors, spreading over 13 years

T. Sejnowski

M. Jordan

32

Page 33: Tools and Algorithms for  Querying and Mining Large Graphs

Thesis Overview

33

CePS, G-Ray, ProSIN (KDD06, KDD07 a, ICDM08)Q1

FastProx (ICDM06, KAIS07, KDD07 b, ICDM08)Q3

pTrack/cTrack (SDM08, SAM08)Q2

DAP(KDD07 b)Q2

FastProx(SDM08, SAM08)Q3P3

Colibri-D(KDD08 b)

M1

T3/MT3 (CIKM08)

M2

P1M3P1M3

Colibri-S(KDD08 b)M1 P3

P3

Completed Proposed

Questions That We Ask

P2M2 P3

Page 34: Tools and Algorithms for  Querying and Mining Large Graphs

Proximity is the main tool• Q.1: CePS, G-Ray, ProSIN• Q.2: DAP, pTrack/cTrack

34

Q: What is a `good’ Score?

A BH1 1

D1 1

E

F

G1 11

I J1

1 1

a.k.a Relevance, Closeness, ‘Similarity’…

Page 35: Tools and Algorithms for  Querying and Mining Large Graphs

Random walk with restart [Pan+ KDD 2004]

Node 4

Node 1Node 2Node 3Node 4Node 5Node 6Node 7Node 8Node 9Node 10Node 11Node 12

0.130.100.130.220.130.050.050.080.040.030.040.02

1

4

3

2

56

7

910

811

120.13

0.10

0.13

0.13

0.05

0.05

0.08

0.04

0.02

0.04

0.03

Ranking vector More red, more relevant

Nearby nodes, higher scores

4r

Page 36: Tools and Algorithms for  Querying and Mining Large Graphs

2c 3cQ c ...W 2W 3W

Why RWR is a good score?

all paths from i to j with length 1

all paths from i to j with length 2

all paths from i to j with length 3

W : adjacency matrix. c: damping factor

1( )Q I cW ,( , ) i jQ i j r

i

j

RWR summarizes all the weighted paths from i to j

Page 37: Tools and Algorithms for  Querying and Mining Large Graphs

Computing RWR• OntheFly

– No Pre-Computation; – Light Storage Cost (W)– Slow On-Line Response: O(mE)

• Pre-Compute– Fast On-Line Response – Prohibitive Pre-Compute Cost: O(n3)– Prohibitive Storage Cost: O(n2)

37

~

1( )Q I cW

[ 1] [ ] (1 )i i ir t cWr t c e

Page 38: Tools and Algorithms for  Querying and Mining Large Graphs

Q: How to Balance?

On-line Off-line

38

Goal: Efficiently Get (elements) of 1( )Q I cW

Page 39: Tools and Algorithms for  Querying and Mining Large Graphs

B_Lin: Basic Idea[Tong+ ICDM 2006]

1

43

2

5 6

7

9 10

811

120.130.10

0.13

0.13

0.05

0.05

0.08

0.04

0.02

0.04

0.03

1

4

3

2

56

7

910

811

12

Find Community

Fix the remaining

Combine1

43

2

5 6

7

9 10

811

12

1

43

2

5 6

7

9 10

811

12

56

7

910

811

12

1

43

2

5 6

7

9 10

811

12

1

4

3

2

5 6

7

910

811

12

1

4

3

2

39

Page 40: Tools and Algorithms for  Querying and Mining Large Graphs

+~~

B_Lin: details

Cross community

details

40

+=

Page 41: Tools and Algorithms for  Querying and Mining Large Graphs

B_Lin: details

W~I – c ~~ I – c – cUSVW1~

-1 -1

Easy to be inverted LRA difference

Sherman–Morrison Lemma!

details

41If Then

Page 42: Tools and Algorithms for  Querying and Mining Large Graphs

B_Lin: summary

• Pre-Compute Stage• Q: • A: A few small, instead of ONE BIG, matrices inversions

• On-Line Stage• Q: Efficiently recover one column of Q• A: A few, instead of MANY, matrix-vector multiplications

Efficiently compute and store Q

42

Page 43: Tools and Algorithms for  Querying and Mining Large Graphs

Query Time vs. Pre-Compute Time

Log Query Time

Log Pre-compute Time

•Quality: 90%+ •On-line:

•Up to 150x speedup•Pre-computation:

•Two orders saving

43

Our Results

Page 44: Tools and Algorithms for  Querying and Mining Large Graphs

More on Scalability Issues for Querying(the spectrum of ``FastProx’’)

• B_Lin: one large linear system – [Tong+ ICDM06, KAIS08]

• BB_Lin: the intrinsic complexity is small – [Tong+ KAIS08]

• FastUpdate: time-evolving linear system – [Tong+ SDM08, SAM08]

• FastAllDAP: multiple linear systems – [Tong+ KDD07 a]

• Fast-ProSIN: dealing w/ on-line feedback– [Tong+ ICDM 2008]

44

Page 45: Tools and Algorithms for  Querying and Mining Large Graphs

Roadmap• Introduction• Completed Work

–Querying–Mining

• Proposed Work

45

• M1: Spotting Anomalies

• M2: Mining Time

Page 46: Tools and Algorithms for  Querying and Mining Large Graphs

Thesis Overview

46

CePS, G-Ray, ProSIN (KDD06, KDD07 a, ICDM08)Q1

FastProx (ICDM06, KAIS07, KDD07 b, ICDM08)Q3

pTrack/cTrack (SDM08, SAM08)Q2

DAP(KDD07 b)Q2

FastProx(SDM08, SAM08)Q3P3

Colibri-D(KDD08 b)

M1

T3/MT3 (CIKM08)

M2

P1M3P1M3

Colibri-S(KDD08 b)M1 P3

P3

Completed Proposed

Questions That We Ask

P2M2 P3

Page 47: Tools and Algorithms for  Querying and Mining Large Graphs

Motivation [Tong+ KDD 08 b]

• Q: How to find patterns?– e.g., communities, anomalies, etc.

• A: Low-Rank Approximation (LRA) for Adjacency Matrix of the Graph.

A L

M RX X

~~47

Page 48: Tools and Algorithms for  Querying and Mining Large Graphs

LRA for Graph Mining: Example

John

KDD

Tom

Bob

Carl

Van

RoyRECOMB

ISMB

ICDM

Author Conf.

L M R

~~X X

Adj. matrix: A

Au. clusters

Conf. Cluster

Interaction

Recon. error is high ‘Carl’ is abnormal

48

Page 49: Tools and Algorithms for  Querying and Mining Large Graphs

Challenges: How to get (L, M, R)?

• Efficiently • both time and space

• Intuitively• easy for interpretation

• Dynamically • track patterns over time

49

None of Existing Methods Fully Meets Our Wish List!

Page 50: Tools and Algorithms for  Querying and Mining Large Graphs

Why Not SVD and CUR/CX?

• SVD: Optimal in L2 and LF

– Efficiency• Time:• Space: (L, R) are dense

– Interpretation• Linear Combination of

many columns

– Dynamic: Not Easy

50

2 2(min( , ))O n m nm

• CUR: Example-based– Efficiency

• Better than SVD• Redundancy in L

– Interpretation• Actual Columns from A

xxxx

– Dynamic: Not Easy

Page 51: Tools and Algorithms for  Querying and Mining Large Graphs

Solutions: Colibri [Tong+ KDD 08 b]

• Colibri-S: for static graph– Basic idea: remove linear redundancy– Same accuracy as CUR/CX– Significant savings in both time & space

• Up to 53x speed-up

• Colibri-D: for dynamic graph– Basic idea: leverage smoothness between time – Same accuracy as CUR/CMD

• Up to 112x speed-up

51

details

Page 52: Tools and Algorithms for  Querying and Mining Large Graphs

A Pictorial Comparison (for static graphs)

52

1st singular vector

2nd singular vector

SVD CUR

CMD Colibri-S

details

Page 53: Tools and Algorithms for  Querying and Mining Large Graphs

Comparison SVD, CUR vs. Colibri

s

Wish List SVD [Golub+ 1989]

CUR/CX[Drineas+ 2005]

Colibri[Tong+ 2008]

Efficiency

Interpretation

Dynamics53

details

Page 54: Tools and Algorithms for  Querying and Mining Large Graphs

Performance of Colibri-S

Time Space

Ours

CUR CUR

CMD

OursCMD

• Accuracy• Same 91%+

• Time• 12x of CMD• 28x of CUR

• Space• ~1/3 of CMD• ~10% of CUR

54Data set: Network traffic

- 21,837 sources/destinations, 158,805 edges

Page 55: Tools and Algorithms for  Querying and Mining Large Graphs

Performance of Colibri-D

Time

# of changed cols

CMD

Colibri-S

Colibri-D achieves up to 112x speedups

Colibri-D

55

Network traffic

- 21,837 nodes

- 1,220 hours

- 22,800 edge/hr

Page 56: Tools and Algorithms for  Querying and Mining Large Graphs

Thesis Overview

56

CePS, G-Ray, ProSIN (KDD06, KDD07 a, ICDM08)Q1

FastProx (ICDM06, KAIS07, KDD07 b, ICDM08)Q3

pTrack/cTrack (SDM08, SAM08)Q2

DAP(KDD07 b)Q2

FastProx(SDM08, SAM08)Q3P3

Colibri-D(KDD08 b)

M1

T3/MT3 (CIKM08)

M2

P1M3P1M3

Colibri-S(KDD08 b)M1 P3

P3

Completed Proposed

Questions That We Ask

P2M2 P3

Page 57: Tools and Algorithms for  Querying and Mining Large Graphs

M2: How to mine time in some complex context?

[Tong+ CIKM 08]

57

Page 58: Tools and Algorithms for  Querying and Mining Large Graphs

A Motivating Example: InputsTime Event(e.g., Session) EntityOct. 26 Link Analysis Tom, Bob

Clustering Bob, AlanOct. 27 Classification Bob, Alan

Anomaly Detection Alan, BeckOct. 28 Party Beck, DanOct. 29 Web Search Dan, Jack

Advertising Jack, PeterOct. 30 Enterprise Search Jack, PeterOct. 31 Q & A Peter, Smith

58

Page 59: Tools and Algorithms for  Querying and Mining Large Graphs

Time Cluster, rep. entities: b7,b6, b8A Motivating Example: Outputs

JackOct. 29

Oct. 30Oct. 30

Oct. 28

Oct. 26

Oct. 27

Time Cluster Rep. Entities:

``Jack’’, ``Peter’’, ``Smith’’

Abnormal Time Rep. Entities:

``Beck’’ , ``Dan’’

Time Cluster Rep. Entities:

``Tom’’, ``Bob’’, ``Alan’’

Page 60: Tools and Algorithms for  Querying and Mining Large Graphs

Problem Definitions (How to mine time in such complex context)

• Given data sets collected at different time stamps;

• We want to find +1: Time Clusters+2: Abnormal Time stamps+3: Interpretations+4: Right time granularity

60

T3

MT3

Our Solutions

Page 61: Tools and Algorithms for  Querying and Mining Large Graphs

Data Sets• CIKM: from CIKM proceedings

• Time: Publication year (1993-2007, 15)• Event: Paper-published (952)• Entities: Author (1895) & Session (279)• Attribute: Keyword (158)

• DeviceScan: from MIT Reality Mining• Time: the day scanning happened (1/1/2004-

5/5/2005, 294)• Event: blue tooth device scanning person (114, 046)• Entities: Device (103) & Person (97)• Attribute: NA

61

Page 62: Tools and Algorithms for  Querying and Mining Large Graphs

T3 on `CIKM’ Data Set Rep. Authors Rep. Keywords

James. P. CallanW. Bruce Croft

James AllanPhilip S. Yu

George KarypisCharles Clarke

WebCluster

ClassificationXML

LanguageStream

Rep. Authors Rep. KeywordsElke Rundensteiner

Daniel MirankerAndreas Henrich

Il-Yeol SongScott B Huffman

Robert J. Hall

KnowledgeSystem

UnstructuredRule

Object-orientedDeductive 62

Page 63: Tools and Algorithms for  Querying and Mining Large Graphs

MT3 on `DeviceScan’ Data Set

Aggregate by Month

Apr. 2004 is anomaly

Aggregate by Day

Work day

Semester Break & Holiday

63

Page 64: Tools and Algorithms for  Querying and Mining Large Graphs

Roadmap• Introduction• Completed Work

–Querying–Mining

• Proposed Work–P1: Community detection–P2: Mining Space–P3: Diffusion Wavelets

64

Page 65: Tools and Algorithms for  Querying and Mining Large Graphs

Thesis Overview

65

CePS, G-Ray, ProSIN (KDD06, KDD07 a, ICDM08)Q1

FastProx (ICDM06, KAIS07, KDD07 b, ICDM08)Q3

pTrack/cTrack (SDM08, SAM08)Q2

DAP(KDD07 b)Q2

FastProx(SDM08, SAM08)Q3P3

Colibri-D(KDD08 b)

M1

T3/MT3 (CIKM08)

M2

P1M3P1M3

Colibri-S(KDD08 b)M1 P3

P3

Completed Proposed

Questions That We Ask

P2M2 P3

Page 66: Tools and Algorithms for  Querying and Mining Large Graphs

Detecting Communities

• Observations: two seemingly opposite efforts in community detection– E1: parameter-free (no user intervention)– E2: cluster w/ constraints (listen to users)

• Challenge: How to fill the gap?• Idea: MDL-based method, encoding the

constraints in descriptions.

66

P1

Page 67: Tools and Algorithms for  Querying and Mining Large Graphs

Mining Space

67

P2

Page 68: Tools and Algorithms for  Querying and Mining Large Graphs

Diffusion Wavelets

68

P3

Page 69: Tools and Algorithms for  Querying and Mining Large Graphs

Time Line• Dec. ‘08: Thesis Proposal• Jan. – Feb., ‘09:

– Research on Community Detection

• Mar. – Apr. ‘09: – Research on Mining Space

• May – Jul. ‘09: – Research on Diffusion Wavelets

• Aug. ‘09: Thesis Write-up• Sep. ‘09: Defense

69

P3

P1

P2

Page 70: Tools and Algorithms for  Querying and Mining Large Graphs

Selected References• H. Tong & C. Faloutsos. (2006) Center-piece subgraphs: problem definition and fast

solutions. In KDD, 404-413, 2006.• H. Tong, C. Faloutsos, & J.Y. Pan. (2006) Fast Random Walk with Restart and Its

Applications. In ICDM, 613-622, 2006. (b.p. award)• H. Tong, Y. Koren, & C. Faloutsos. (2007) Fast direction-aware proximity for graph

mining. In KDD, 747-756, 2007.• H. Tong, B. Gallagher, C. Faloutsos, & T. Eliassi-Rad. (2007) Fast best-effort pattern

matching in large attributed graphs. In KDD, 737-746, 2007.• H. Tong, S. Papadimitriou, P.S. Yu & C. Faloutsos. (2008) Proximity Tracking on Time-

Evolving Bipartite Graphs. in SDM 2008. (b.p. award)• H. Tong, S. Papadimitriou, J. Sun, P.S. Yu & C. Faloutsos. (2008) Fast Mining of Static

and Dynamic Graphs. KDD 2008• H. Tong, Y. Sakurai, T. Eliassi-Rad, and C. Faloutsos. Fast Mining of Complex Time-

Stamped Events CIKM 08• H. Tong, H. Qu, and H. Jamjoom. Measuring Proximity on Graphs with Side Information.

ICDM 2008

70

Page 71: Tools and Algorithms for  Querying and Mining Large Graphs

My other work during Ph.D study• GhostEdge (w/ Brian, Christos and Tina, in KDD 08)

– Classification in Sparsely Labeled Network• GMine (w/ Junio, Agma, Christos and Jure, in VLDB 06)

– Interactive Graph Visualization and Mining• Graphite (w/ Polo, Christos, Jason, Brian and Tina, in ICDM 08)

– Visual Query System for Attributed Graphs • TANGENT (w/ Kensuke and Christos)

– ``surprise-me’’ recommendation • PaCK (w/ Jingrui, Spiros, Tina, Jaime and Christos)

– Community detection for heterogonous graphs

71

Page 72: Tools and Algorithms for  Querying and Mining Large Graphs

Acknowledgements

• Christos Faloutsos, Jia-Yu Pan, Yehuda Koren, Spiros Papadimitriou, Philip S. Yu, Jimeng Sun, Huiming Qu, Hani Jamjoom, Tina Eliassi-Rad, Brian Gallagher, Yasushi Sakurai,

• Kensuke Oonuma, Duen Horng (Polo) Chau, Jason I. Hong, Jingrui He, Jaime Carbonell, José Fernando Rodrigues Jr., Jure Leskovec Agma J. M. Traina,

• Charalampos (Babis) Tsourakakis, Meng Su72

(the old way)

Page 73: Tools and Algorithms for  Querying and Mining Large Graphs

CePSProSINGray

DAP

pTrackcTrack

BLin

BBLin

FastUpdateFastDAP

Fast-ProSIN

Colibri

P1

P3

GhostEdge

Graphite

Pack

TANGENT

GMine

T3/MT3P2

MiningQ1

Q2

Q3

M2M3

M1

A Graph Miner’s Way: My Collaboration Graph (During Ph.D Study)

Legends:Green: QueryingBlue: MiningPurple: Others : Completed : Proposed

Page 74: Tools and Algorithms for  Querying and Mining Large Graphs

Q & A

Thank you!

74