Emergent Behavior Detection in Massive Graphs · 40 Vert.20 Vertices 42 Edges30 Edges Seven Bridges of Konigsberg 1735 4 Vertices 7 Edges Hamilton’s Zachary’s Karate Club 1977

Emergent Behavior Detection in

Massive Graphs

Nadya T. Bliss and Benjamin A. Miller

MIT Lincoln Laboratory

15 February 2012

Minisymposium on Parallel Analysis of Massive Social Networks

SIAM Conference on Parallel Processing for Scientific Computing

This work is supported by the Intelligence Advanced Research Projects Activity (IARPA) via Air Force Contract FA8721-05-C-

0002. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any

copyright annotation thereon.

Disclaimer: The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily

representing the official policies or endorsements, either expressed or implied, of IARPA or the U.S. Government.

VLG - 2

BAM 02/15/12

Large Scale Data & Information Processing

High Level Composable API:

D4M

Novel Analytics:

Graph Analysis

Network Discovery

WEAK SIGNATURES,

NOISY DATA,

DYNAMICS

High Performance Computing:

LLGrid

Interactive

super-

computing

Distributed Database:

Accumulo

Distributed

database/

distributed file

system

A

C

E

B

Array

algebra

VLG - 3

BAM 02/15/12

Graphs and Networks From Data to Relationships

Graphs are a natural way to represent relationships between entities

Graph analytics enable detection of unobserved coordination

between discrete elements or events in massive, noisy datasets

ISR Cyber Social

• Graphs represent entities

and relationships detected

through multi-INT sources

• CONOPS: Identify

anomalous patterns of

activities

• Graphs represent

communication patterns of

computers on a network

• CONOPS: Detect cyber

attacks or malicious

software

• Graphs represent

relationships between

individuals or documents

• CONOPS: Identify hidden

networks

VLG - 4

BAM 02/15/12

Trends in Graph Data Sizes

AIDS Sex

Partners

1985

40 Vert.

42 Edges

Seven Bridges

of Konigsberg

1735

4 Vertices

7 Edges

Hamilton’s

Game

1856

20 Vertices

30 Edges

Zachary’s

Karate Club

1977

34 Vertices

78 Edges Les Miserables

Characters

1993

77 Vertices

254 Edges

Internet

Topology

1999

88,107 Vertices

99,664 Edges

Western U.S.

Power Grid

1998

4,941 Vertices

6,594 Edges

Enron E-mail

Corpus

2005

93,526 Vert.

344,264 Edg.

1800 1700 2010 1990 1980 1970 2000

10 10,000 60 50 40 30 1,000 0 70 80 100,000 20

Classical Era Modern Era

2010

42M V

>100M E

Citation Data

VLG - 5

BAM 02/15/12

Massive Graph Analysis

Efforts at MIT Lincoln Laboratory

Dynamic

Techniques

Development of Novel

Graph Analytics Architecture for Big Data

A

C

E

B

D

D4M High Level

Composable API

AccumuloDistributed

Database

LLGrid

High Performance

Computing

AA Ε

-Graph Model

Residual Analysis

Community Development

Lincoln Laboratory is involved in several aspects of graph research;

this presentation focuses on analytics

VLG - 6

BAM 02/15/12

Detecting Emergent Activity

recruitment planning execution (Example threat network created under the Counter Terror Social Network Analysis and Intent Recognition effort at Lincoln Laboratory)

Evolving

planning of

violent activity

Emerging

coordinated activity

(e.g., botnets)

Emerging

technologies in

scientific literature

Detection of emergent behavior is applicable to a variety of domains

VLG - 7

BAM 02/15/12

• Introduction

• Detection Framework

• Example Dataset: Web of Science

• Ongoing Work: New Algorithms and Models

• Summary

Outline

VLG - 8

BAM 02/15/12

Graph Based Residuals Analysis

• Least-squares residuals from

a best-fit line

• Analysis of variance (ANOVA)

describes fit

• “Explained” vs “unexplained”

variance signal/noise

discrimination

Linear Regression

• “Residuals” from a best-fit

graph model

• Analysis of variance from

expected topology

• Unexplained variance in

graph residuals subgraph

detection

Graph “Regression”

VLG - 9

BAM 02/15/12

Subgraph Detection Algorithm Overview

RESIDUALS

CONSTRUCTION

EIGEN

DECOMPOSITION

COMPONENT

SELECTION DETECTION IDENTIFICATION

Input:

• A, adjacency matrix

representation of G

• No cue

Output:

• vs, set of vertices identified as

belonging to subgraph Gf

Processing chain* for subgraph detection analogous to a traditional signal processing chain

*N. Bliss, B. A. Miller and P. J. Wolfe, “Spectral Methods for Subgraph Detection,” SIAM Annual Meeting, 2010.

VLG - 10

BAM 02/15/12

Residuals Construction

*M. E. J. Newman. “Finding community structure in networks using the eigenvectors of matrices,” Phys. Rev. E, 74:036104, 2006.

M

KKAB

T

2

EXAMPLE:

GRAPH G

1 2

3

4 7

6

5

20

1 -

NUMBER

OF EDGES

ADJACENCY

MATRIX

1

2

3

4

5

6

7

1 2 3 4 5 6 7

*

DEGREE VECTOR

2

3

3

3

3

2

4

2 3 3 3 3 2 4

•The modularity matrix* is an example of a residuals matrix • Commonly used to evaluate quality of division of a graph into communities

• Observed minus “expected” edges, yielding residuals

RESIDUALS

CONSTRUCTION

EIGEN

DECOMPOSITION

COMPONENT


VLG - 11

BAM 02/15/12

Eigen Decomposition

B=UDUT

Projection onto principal components of the modularity matrix yields good

separation between background and a foreground with large residuals

Eigenvalues, sorted

by magnitude

N

N

1

2

1

...

Corresponding

eigenvectors

u1 |u2 | ... |uN-1 |uN[ ]

Principal

components • Each point represents a vertex in G

• Vertices in G: 1024

• Vertices in Gf: 12

• Uncued background/foreground separation

COLORING BASED

ON KNOWN TRUTH

RESIDUALS

CONSTRUCTION

EIGEN

DECOMPOSITION

COMPONENT


VLG - 12

BAM 02/15/12

Toy Detection Example

Background (normal behavior), Gb only

Background and foreground (includes

anomalous behavior), Gb + Gf

H0

H1

H0 and H1 distributions are well separated

TEST STATISTIC:

SYMMETRY OF THE

PROJECTION ONTO

SELECTED

COMPONENTS H0 H1

Powerlaw Background, 12-Vertex Dense Subgraph

Test Statistic

B.A. Miller, N.T. Bliss and P.J. Wolfe, “Toward signal processing theory for graphs and non-Euclidean data,” in Proc. ICASSP, 2010.

RESIDUALS

CONSTRUCTION

EIGEN

DECOMPOSITION

COMPONENT


VLG - 13

BAM 02/15/12

Dynamic Graph Processing Chain

Output:

• vs, set of vertices identified as

belonging to subgraph Gf

Processing chain for dynamic subgraph detection incorporates temporal integration

TEMPORAL

INTEGRATION

RESIDUALS

CONSTRUCTION

EIGEN

DECOMPOSITION

COMPONENT


Input:

• A(t), time-varying

adjacency matrix

representation of G

• No cue

time

identified vertices

VLG - 14

BAM 02/15/12

Temporal Integration

)(nB )1( nB )1( nB

Temporal integration achieved by creating a linear

combination of individual time-step graphs

)0(h )1(h )1( h+ + +

1

0

)()()(~

k

knBkhnB : Weighted sum of residuals

TEMPORAL

INTEGRATION

RESIDUALS

CONSTRUCTION

EIGEN

DECOMPOSITION

COMPONENT


Modularity

(residuals)

matrix at time

n

Filter

coefficients

(scalars)

VLG - 15

BAM 02/15/12

• Introduction




• Summary

Outline

VLG - 16

BAM 02/15/12

Thomson Reuters “Web of Science” Database

• Commercially available research database of papers in the sciences, social sciences, arts, and humanities

– More than 42 million records from 1900 to present

– Articles from over 12,000 journals and 148,000 conference proceedings

• Records typically include

– Author(s), title, publication date, type

– Document IDs for works cited

– May also include a number of other fields, e.g. subject area, institution,

keywords, abstract, as provided by the publication

VLG - 17

BAM 02/15/12

Examples of Web of Science (WOS) Graphs

CITATION GRAPH

nodes are papers

(directed) edges are citations

PAPER #1

PAPER #2

PAPER #3

PAPER #4

Paper #2 cites

paper #4

CO-AUTHOR GRAPH

nodes are authors

(undirected) edges indicate

co-authorship

J. DOE

M. SMITH

A. PERSON

J. BLOGGS

A. PERSON and J. DOE have

co-authored a paper together

N-GRAM GRAPH

nodes are title subsequences

(weighted) edges count

co-occurrence

METHOD

FOR

A

FAST

FOURIER

THEORY

STATISTICS

OF

“STATISTICS OF” and “METHOD FOR”

co-occur in 5,334 document titles

5,334 10,471

998 83

14

677

VLG - 18

BAM 02/15/12

Eigenspace Analysis of Citation Graph

• 4,668,824 documents (documents in the database and those cited)

• Modularity matrices integrated over a 5-year window with a linear ramp filter

• Exceptionally large eigenvalues coincide with documents with thousands of citations (typically review articles)

• 549,726 unique authors

• Modularity matrices integrated over a 5-year window with a linear ramp filter

• Exceptionally large eigenvalues coincide with papers with many authors (sudden, large cliques)

Eigenvalues significantly larger than the general trend correspond

to “clutter”-type behavior B. A. Miller et al., “A scalable signal processing architecture for massive graph analysis,” IEEE ICASSP, 2012. To appear.

VLG - 19

BAM 02/15/12

Emerging Clusters: Citation Graph

• Consider vertices aligned with principal eigenvectors after

manually pruning large citation lists • i.e., ignore eigenvectors highly concentrated on a single vertex

• Two subsets of vertices with significant internal connectivity • Most documents are in biochemistry and microbiology

• Largely focus on metabolic properties of acids and proteins

• Temporal integration increases the strength of these subsets in the

residuals space

1950 1951 1952 1953 1954

Emerging interconnectedness in a major research area

emphasized by linear ramp filter

VLG - 20

BAM 02/15/12

Emerging Clusters: Coauthor Graph

• Consider vertices aligned with principal eigenvectors after

manually pruning large author lists • Ignore eigenvectors aligned with sudden cliques

• Again, two tightly-connected subgraphs emerge • One arises from pathology case records of Mass. General Hospital,

published in New England J. Med.

• Similar periodicals in the (newly-founded) American J. Med comprise

the other cluster

• In both cases, teams of medical researchers form gradually over time

• Ramp filter emphasizes the densifying behavior

1946 1947 1948 1949 1950

Emerging collaboration network stands out after temporal integration

VLG - 21

BAM 02/15/12

• Introduction




• Summary

Outline

VLG - 22

BAM 02/15/12

Preferential Attachment with Memory

Before 1945

Since 1945

• Preferential attachment is a

popular model for graph evolution • New nodes connect to existing

ones with probability proportional

to degree

• Does not account for recency

• New model: predict new

attachment rates by applying a

finite impulse response filter to

the sequence of recent

connection counts

• Captures more variability in the

WOS citation network than other

models in the literature

Least squares fit of coefficients:

more correlation with recent

citations than older ones

New perspective on preferential attachment; integrating this

technique into residuals analysis

VLG - 23

BAM 02/15/12

Filtering Away Periodic Behavior

)()(~

)(~

)(min

1

ktGhtG

tGtG

k

k

F

• In some applications,

connections from autonomous

nodes exhibit periodic behavior • Observed in web traffic data

• Model the expected graph as a

linear combination of previous

graphs to account for this effect

• Recent results indicate that this

technique helps detect non-

periodic coordinated activity

Moving average filter emphasizes aperiodic behavior and improves

anomaly detection

VLG - 24

BAM 02/15/12

COEFFICIENT VECTOR COVARIATE VECTOR

Metadata-Based Modeling

• Topology-based modeling has

significant practical limitations • Connection likelihood depends

on more than degree

• Additional vertex or edge data can

improve the expected value model • Used, e.g., in link prediction

• A recent statistical framework*

models the probability of

connection as a function of vertex

and edge parameters:

Subject-to-Subject Citation Counts

Additional graph metadata improves connection likelihood

estimation, enabling more detailed residuals analysis

More citations within a subject

than between subjects

*P. O. Perry and P. J. Wolfe, “Point process modeling for directed interaction networks,”

2010, http://arxiv.org/abs/1011.1703.

T

ijji xvvP )(logit

VLG - 25

BAM 02/15/12

Summary

• Applying a dynamic subgraph detection framework to large

graph data to pull emergent activity out of the background

• In the Web of Science document database, emergent

clusters are found underneath some clutter activity in

citation and coauthorship graphs using a simple expected

value term • Integrating residuals over time emphasizes dynamic behavior

(e.g., densification), enabling the detection of gradually-

growing subnetworks

• New models for graph behavior incorporate dynamics of the

topology and vertex and edge metadata into the graph’s

expected value

• Ongoing work includes experimentation with the entire WOS

corpus and integration of the new models into the analysis

of residuals

VLG - 26

BAM 02/15/12

Acknowledgements

• Nick Arcolano

• Michelle Beard

• Jeremy Kepner

• Matt Schmidt

• Bill Campbell

• Cliff Weinstein

• Patrick Wolfe

Documents

Emergent Behavior Detection in Massive Graphs · 40 Vert.20 Vertices 42 Edges30 Edges Seven Bridges of Konigsberg 1735 4 Vertices 7 Edges Hamilton’s Zachary’s Karate Club 1977