49
Efficient Identification of Overlapping Communities Jeffrey Baumes Mark Goldberg Malik Magdon-Ismail Rensselaer Polytechnic Institute, Troy, NY

Efficient Identification of Overlapping Communities Jeffrey Baumes Mark Goldberg Malik Magdon-Ismail Rensselaer Polytechnic Institute, Troy, NY

Embed Size (px)

Citation preview

Page 1: Efficient Identification of Overlapping Communities Jeffrey Baumes Mark Goldberg Malik Magdon-Ismail Rensselaer Polytechnic Institute, Troy, NY

Efficient Identification of Overlapping Communities

Jeffrey BaumesMark Goldberg

Malik Magdon-Ismail

Rensselaer Polytechnic Institute, Troy, NY

Page 2: Efficient Identification of Overlapping Communities Jeffrey Baumes Mark Goldberg Malik Magdon-Ismail Rensselaer Polytechnic Institute, Troy, NY

Outline

• Communities as clusters • What is a cluster? • Cluster seed procedure (LA) • Cluster refinement procedure (IS2) • Experimental results • Conclusions and future work

Page 3: Efficient Identification of Overlapping Communities Jeffrey Baumes Mark Goldberg Malik Magdon-Ismail Rensselaer Polytechnic Institute, Troy, NY

Communities as clusters

• Malicious groups use large communication networks for planning and coordination

• Their goal: remain undetected• Our goal: sift through

communications for suspicious patterns, using structure only, not content

Page 4: Efficient Identification of Overlapping Communities Jeffrey Baumes Mark Goldberg Malik Magdon-Ismail Rensselaer Polytechnic Institute, Troy, NY

Communities as clusters

• Detecting all social groups (malicious or not) will aide in searching for “hidden” groups

• Social groups tend to communicate densely

• Approach: Find social groups by finding clusters in the graph of the communication network

actor Aactor B

A communicates with Blikely a social group

likely not a social group

Add external edges

Page 5: Efficient Identification of Overlapping Communities Jeffrey Baumes Mark Goldberg Malik Magdon-Ismail Rensselaer Polytechnic Institute, Troy, NY

What is a cluster?

• Many partitioning algorithms exist• Social groups often overlap• Instead define clusters as locally

optimal with respect to density

partitioning overlapping clustering

Page 6: Efficient Identification of Overlapping Communities Jeffrey Baumes Mark Goldberg Malik Magdon-Ismail Rensselaer Polytechnic Institute, Troy, NY

Two-stage process

seed procedure

refinement procedure

communication network

seed clusters

final clusters

Page 7: Efficient Identification of Overlapping Communities Jeffrey Baumes Mark Goldberg Malik Magdon-Ismail Rensselaer Polytechnic Institute, Troy, NY

Original procedures

Rank Removal(RaRe)

Iterative Scan(IS)

communication network

seed clusters

final clusters

Jeffrey Baumes, Mark Goldberg, Mukkai Krishnamoorthy, Malik Magdon-Ismail,

Nathan Preston. "Finding Communities by Clustering a Graph into

Overlapping Subgraphs", International Conference on Applied Computing (IADIS

2005), Feb 22-25, Algarve, Portugal.

Page 8: Efficient Identification of Overlapping Communities Jeffrey Baumes Mark Goldberg Malik Magdon-Ismail Rensselaer Polytechnic Institute, Troy, NY

Proposed new procedures

Link Aggregate(LA)

Iterative Scan 2(IS2)

communication network

seed clusters

final clusters

Page 9: Efficient Identification of Overlapping Communities Jeffrey Baumes Mark Goldberg Malik Magdon-Ismail Rensselaer Polytechnic Institute, Troy, NY

Link Aggregate (LA)

• Order the nodes (two routines are used)

• Pass through the nodes– For each node, add it to the clusters it

improves, or start a new cluster

Page 10: Efficient Identification of Overlapping Communities Jeffrey Baumes Mark Goldberg Malik Magdon-Ismail Rensselaer Polytechnic Institute, Troy, NY

LA procedure

Page 11: Efficient Identification of Overlapping Communities Jeffrey Baumes Mark Goldberg Malik Magdon-Ismail Rensselaer Polytechnic Institute, Troy, NY

LA procedure

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16 17

18

1920

21

22

2324

25

26

27

28

29

30

31

32

33

34

35

Page 12: Efficient Identification of Overlapping Communities Jeffrey Baumes Mark Goldberg Malik Magdon-Ismail Rensselaer Polytechnic Institute, Troy, NY

LA procedure

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16 17

18

1920

21

22

2324

25

26

27

28

29

30

31

32

33

34

35

Page 13: Efficient Identification of Overlapping Communities Jeffrey Baumes Mark Goldberg Malik Magdon-Ismail Rensselaer Polytechnic Institute, Troy, NY

LA procedure

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16 17

18

1920

21

22

2324

25

26

27

28

29

30

31

32

33

34

35

Page 14: Efficient Identification of Overlapping Communities Jeffrey Baumes Mark Goldberg Malik Magdon-Ismail Rensselaer Polytechnic Institute, Troy, NY

LA procedure

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16 17

18

1920

21

22

2324

25

26

27

28

29

30

31

32

33

34

35

Page 15: Efficient Identification of Overlapping Communities Jeffrey Baumes Mark Goldberg Malik Magdon-Ismail Rensselaer Polytechnic Institute, Troy, NY

LA procedure

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16 17

18

1920

21

22

2324

25

26

27

28

29

30

31

32

33

34

35

Page 16: Efficient Identification of Overlapping Communities Jeffrey Baumes Mark Goldberg Malik Magdon-Ismail Rensselaer Polytechnic Institute, Troy, NY

LA procedure

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16 17

18

1920

21

22

2324

25

26

27

28

29

30

31

32

33

34

35

Page 17: Efficient Identification of Overlapping Communities Jeffrey Baumes Mark Goldberg Malik Magdon-Ismail Rensselaer Polytechnic Institute, Troy, NY

Iterative Scan (IS)

• Old refinement procedure– Traverses entire node list, adding /

removing nodes which increase the density

– Repeats the process until no improvements are possible

• May be inefficient in sparse networks\

• Guaranteed to be locally optimal

Page 18: Efficient Identification of Overlapping Communities Jeffrey Baumes Mark Goldberg Malik Magdon-Ismail Rensselaer Polytechnic Institute, Troy, NY

Iterative Scan 2 (IS2)

• New refinement procedure– Traverses neighborhood of cluster

only, adding / removing nodes which increase the density

– Repeats the process until no improvements are possible

• More efficient in sparse networks in spite of overhead, less efficient in dense networks

Page 19: Efficient Identification of Overlapping Communities Jeffrey Baumes Mark Goldberg Malik Magdon-Ismail Rensselaer Polytechnic Institute, Troy, NY

IS2 procedure

Page 20: Efficient Identification of Overlapping Communities Jeffrey Baumes Mark Goldberg Malik Magdon-Ismail Rensselaer Polytechnic Institute, Troy, NY

IS2 procedure

Page 21: Efficient Identification of Overlapping Communities Jeffrey Baumes Mark Goldberg Malik Magdon-Ismail Rensselaer Polytechnic Institute, Troy, NY

IS2 procedure

Page 22: Efficient Identification of Overlapping Communities Jeffrey Baumes Mark Goldberg Malik Magdon-Ismail Rensselaer Polytechnic Institute, Troy, NY

IS2 procedure

Page 23: Efficient Identification of Overlapping Communities Jeffrey Baumes Mark Goldberg Malik Magdon-Ismail Rensselaer Polytechnic Institute, Troy, NY

IS2 procedure

Page 24: Efficient Identification of Overlapping Communities Jeffrey Baumes Mark Goldberg Malik Magdon-Ismail Rensselaer Polytechnic Institute, Troy, NY

Experimental results

• Compare run time of new vs. old• Compare cluster quality of new vs.

old• Compare on different network types

– Random– Preferential attachment– Real-world

• Compare possible actor orderings for LA

Page 25: Efficient Identification of Overlapping Communities Jeffrey Baumes Mark Goldberg Malik Magdon-Ismail Rensselaer Polytechnic Institute, Troy, NY

RaRe vs. LA run time

New RaRe

LA

Original RaReNew RaRe

LA

Page 26: Efficient Identification of Overlapping Communities Jeffrey Baumes Mark Goldberg Malik Magdon-Ismail Rensselaer Polytechnic Institute, Troy, NY

IS vs. IS2 run time

Define IS* = IS for dense graphs, IS2 for sparse graphs

Page 27: Efficient Identification of Overlapping Communities Jeffrey Baumes Mark Goldberg Malik Magdon-Ismail Rensselaer Polytechnic Institute, Troy, NY

Old vs. new quality

New RaRe → IS

LA → IS2

New RaRe → IS

LA → IS2

Page 28: Efficient Identification of Overlapping Communities Jeffrey Baumes Mark Goldberg Malik Magdon-Ismail Rensselaer Polytechnic Institute, Troy, NY

Preferential attachment

New RaRe → IS

LA → IS2

New RaRe → IS

LA → IS2

Page 29: Efficient Identification of Overlapping Communities Jeffrey Baumes Mark Goldberg Malik Magdon-Ismail Rensselaer Polytechnic Institute, Troy, NY

Real-World Networks

Ratio = new/old = (LA→IS*)/(RaRe→IS)

Quality Ratio

0

0.5

1

1.5

2

2.5

E-mail Web Newsgroup Fortune 500

Run-time Ratio

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

E-mail Web Newsgroup Fortune 500

IS2 IS IS2 IS2IS* =

Page 30: Efficient Identification of Overlapping Communities Jeffrey Baumes Mark Goldberg Malik Magdon-Ismail Rensselaer Polytechnic Institute, Troy, NY

LA ordering

Page 31: Efficient Identification of Overlapping Communities Jeffrey Baumes Mark Goldberg Malik Magdon-Ismail Rensselaer Polytechnic Institute, Troy, NY

Conclusions and future work

• Overlapping clustering may be used to discover social groups in communication networks

• The new algorithm is more efficient in many cases, while keeping the same or better quality

• A unified algorithm should choose strategies and parameters based on network properties

Page 32: Efficient Identification of Overlapping Communities Jeffrey Baumes Mark Goldberg Malik Magdon-Ismail Rensselaer Polytechnic Institute, Troy, NY

Questions

Page 33: Efficient Identification of Overlapping Communities Jeffrey Baumes Mark Goldberg Malik Magdon-Ismail Rensselaer Polytechnic Institute, Troy, NY

Rank Removal

• Existing seed procedure– Removes highly connected nodes until network is

broken into small clusters– Adds removed nodes back into clusters it is well-

connected to

• Two main inefficiencies– Computed Page Rank at each iteration– Computed connected components at each iteration

• Page Rank could be computed once, but reprocessing connected components is crucial

Page 34: Efficient Identification of Overlapping Communities Jeffrey Baumes Mark Goldberg Malik Magdon-Ismail Rensselaer Polytechnic Institute, Troy, NY

LA procedure detail

Page 35: Efficient Identification of Overlapping Communities Jeffrey Baumes Mark Goldberg Malik Magdon-Ismail Rensselaer Polytechnic Institute, Troy, NY

IS2 procedure detail

Page 36: Efficient Identification of Overlapping Communities Jeffrey Baumes Mark Goldberg Malik Magdon-Ismail Rensselaer Polytechnic Institute, Troy, NY

RaRe vs. LA

Page 37: Efficient Identification of Overlapping Communities Jeffrey Baumes Mark Goldberg Malik Magdon-Ismail Rensselaer Polytechnic Institute, Troy, NY

RaRe vs. LA

Page 38: Efficient Identification of Overlapping Communities Jeffrey Baumes Mark Goldberg Malik Magdon-Ismail Rensselaer Polytechnic Institute, Troy, NY

RaRe vs. LA

Page 39: Efficient Identification of Overlapping Communities Jeffrey Baumes Mark Goldberg Malik Magdon-Ismail Rensselaer Polytechnic Institute, Troy, NY

IS vs. IS2

Page 40: Efficient Identification of Overlapping Communities Jeffrey Baumes Mark Goldberg Malik Magdon-Ismail Rensselaer Polytechnic Institute, Troy, NY

IS vs. IS2

Page 41: Efficient Identification of Overlapping Communities Jeffrey Baumes Mark Goldberg Malik Magdon-Ismail Rensselaer Polytechnic Institute, Troy, NY

IS vs. IS2

Page 42: Efficient Identification of Overlapping Communities Jeffrey Baumes Mark Goldberg Malik Magdon-Ismail Rensselaer Polytechnic Institute, Troy, NY

Run time RaRe vs. LA

Page 43: Efficient Identification of Overlapping Communities Jeffrey Baumes Mark Goldberg Malik Magdon-Ismail Rensselaer Polytechnic Institute, Troy, NY

Run time IS vs. IS2

Page 44: Efficient Identification of Overlapping Communities Jeffrey Baumes Mark Goldberg Malik Magdon-Ismail Rensselaer Polytechnic Institute, Troy, NY

Cluster quality

Page 45: Efficient Identification of Overlapping Communities Jeffrey Baumes Mark Goldberg Malik Magdon-Ismail Rensselaer Polytechnic Institute, Troy, NY

Cluster quality

Page 46: Efficient Identification of Overlapping Communities Jeffrey Baumes Mark Goldberg Malik Magdon-Ismail Rensselaer Polytechnic Institute, Troy, NY

Preferential attachment run time

Page 47: Efficient Identification of Overlapping Communities Jeffrey Baumes Mark Goldberg Malik Magdon-Ismail Rensselaer Polytechnic Institute, Troy, NY

Preferential attachment quality

Page 48: Efficient Identification of Overlapping Communities Jeffrey Baumes Mark Goldberg Malik Magdon-Ismail Rensselaer Polytechnic Institute, Troy, NY

LA ordering run time

Page 49: Efficient Identification of Overlapping Communities Jeffrey Baumes Mark Goldberg Malik Magdon-Ismail Rensselaer Polytechnic Institute, Troy, NY

LA ordering quality