Upload
others
View
6
Download
0
Embed Size (px)
Citation preview
Emergent Behavior Detection in
Massive Graphs
Nadya T. Bliss and Benjamin A. Miller
MIT Lincoln Laboratory
15 February 2012
Minisymposium on Parallel Analysis of Massive Social Networks
SIAM Conference on Parallel Processing for Scientific Computing
This work is supported by the Intelligence Advanced Research Projects Activity (IARPA) via Air Force Contract FA8721-05-C-
0002. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any
copyright annotation thereon.
Disclaimer: The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily
representing the official policies or endorsements, either expressed or implied, of IARPA or the U.S. Government.
VLG - 2
BAM 02/15/12
Large Scale Data & Information Processing
High Level Composable API:
D4M
Novel Analytics:
Graph Analysis
Network Discovery
WEAK SIGNATURES,
NOISY DATA,
DYNAMICS
High Performance Computing:
LLGrid
Interactive
super-
computing
Distributed Database:
Accumulo
Distributed
database/
distributed file
system
A
C
E
B
Array
algebra
VLG - 3
BAM 02/15/12
Graphs and Networks From Data to Relationships
Graphs are a natural way to represent relationships between entities
Graph analytics enable detection of unobserved coordination
between discrete elements or events in massive, noisy datasets
ISR Cyber Social
• Graphs represent entities
and relationships detected
through multi-INT sources
• CONOPS: Identify
anomalous patterns of
activities
• Graphs represent
communication patterns of
computers on a network
• CONOPS: Detect cyber
attacks or malicious
software
• Graphs represent
relationships between
individuals or documents
• CONOPS: Identify hidden
networks
VLG - 4
BAM 02/15/12
Trends in Graph Data Sizes
AIDS Sex
Partners
1985
40 Vert.
42 Edges
Seven Bridges
of Konigsberg
1735
4 Vertices
7 Edges
Hamilton’s
Game
1856
20 Vertices
30 Edges
Zachary’s
Karate Club
1977
34 Vertices
78 Edges Les Miserables
Characters
1993
77 Vertices
254 Edges
Internet
Topology
1999
88,107 Vertices
99,664 Edges
Western U.S.
Power Grid
1998
4,941 Vertices
6,594 Edges
Enron E-mail
Corpus
2005
93,526 Vert.
344,264 Edg.
1800 1700 2010 1990 1980 1970 2000
10 10,000 60 50 40 30 1,000 0 70 80 100,000 20
Classical Era Modern Era
2010
42M V
>100M E
Citation Data
VLG - 5
BAM 02/15/12
Massive Graph Analysis
Efforts at MIT Lincoln Laboratory
Dynamic
Techniques
Development of Novel
Graph Analytics Architecture for Big Data
A
C
E
B
D
D4M High Level
Composable API
AccumuloDistributed
Database
LLGrid
High Performance
Computing
AA Ε
-Graph Model
Residual Analysis
Community Development
Lincoln Laboratory is involved in several aspects of graph research;
this presentation focuses on analytics
VLG - 6
BAM 02/15/12
Detecting Emergent Activity
recruitment planning execution (Example threat network created under the Counter Terror Social Network Analysis and Intent Recognition effort at Lincoln Laboratory)
Evolving
planning of
violent activity
Emerging
coordinated activity
(e.g., botnets)
Emerging
technologies in
scientific literature
Detection of emergent behavior is applicable to a variety of domains
VLG - 7
BAM 02/15/12
• Introduction
• Detection Framework
• Example Dataset: Web of Science
• Ongoing Work: New Algorithms and Models
• Summary
Outline
VLG - 8
BAM 02/15/12
Graph Based Residuals Analysis
• Least-squares residuals from
a best-fit line
• Analysis of variance (ANOVA)
describes fit
• “Explained” vs “unexplained”
variance signal/noise
discrimination
Linear Regression
• “Residuals” from a best-fit
graph model
• Analysis of variance from
expected topology
• Unexplained variance in
graph residuals subgraph
detection
Graph “Regression”
VLG - 9
BAM 02/15/12
Subgraph Detection Algorithm Overview
RESIDUALS
CONSTRUCTION
EIGEN
DECOMPOSITION
COMPONENT
SELECTION DETECTION IDENTIFICATION
Input:
• A, adjacency matrix
representation of G
• No cue
Output:
• vs, set of vertices identified as
belonging to subgraph Gf
Processing chain* for subgraph detection analogous to a traditional signal processing chain
*N. Bliss, B. A. Miller and P. J. Wolfe, “Spectral Methods for Subgraph Detection,” SIAM Annual Meeting, 2010.
VLG - 10
BAM 02/15/12
Residuals Construction
*M. E. J. Newman. “Finding community structure in networks using the eigenvectors of matrices,” Phys. Rev. E, 74:036104, 2006.
M
KKAB
T
2
EXAMPLE:
GRAPH G
1 2
3
4 7
6
5
20
1 -
NUMBER
OF EDGES
ADJACENCY
MATRIX
1
2
3
4
5
6
7
1 2 3 4 5 6 7
*
DEGREE VECTOR
2
3
3
3
3
2
4
2 3 3 3 3 2 4
•The modularity matrix* is an example of a residuals matrix • Commonly used to evaluate quality of division of a graph into communities
• Observed minus “expected” edges, yielding residuals
RESIDUALS
CONSTRUCTION
EIGEN
DECOMPOSITION
COMPONENT
SELECTION DETECTION IDENTIFICATION
VLG - 11
BAM 02/15/12
Eigen Decomposition
B=UDUT
Projection onto principal components of the modularity matrix yields good
separation between background and a foreground with large residuals
Eigenvalues, sorted
by magnitude
N
N
1
2
1
...
Corresponding
eigenvectors
u1 |u2 | ... |uN-1 |uN[ ]
Principal
components • Each point represents a vertex in G
• Vertices in G: 1024
• Vertices in Gf: 12
• Uncued background/foreground separation
COLORING BASED
ON KNOWN TRUTH
RESIDUALS
CONSTRUCTION
EIGEN
DECOMPOSITION
COMPONENT
SELECTION DETECTION IDENTIFICATION
VLG - 12
BAM 02/15/12
Toy Detection Example
Background (normal behavior), Gb only
Background and foreground (includes
anomalous behavior), Gb + Gf
H0
H1
H0 and H1 distributions are well separated
TEST STATISTIC:
SYMMETRY OF THE
PROJECTION ONTO
SELECTED
COMPONENTS H0 H1
Powerlaw Background, 12-Vertex Dense Subgraph
Test Statistic
B.A. Miller, N.T. Bliss and P.J. Wolfe, “Toward signal processing theory for graphs and non-Euclidean data,” in Proc. ICASSP, 2010.
RESIDUALS
CONSTRUCTION
EIGEN
DECOMPOSITION
COMPONENT
SELECTION DETECTION IDENTIFICATION
VLG - 13
BAM 02/15/12
Dynamic Graph Processing Chain
Output:
• vs, set of vertices identified as
belonging to subgraph Gf
Processing chain for dynamic subgraph detection incorporates temporal integration
TEMPORAL
INTEGRATION
RESIDUALS
CONSTRUCTION
EIGEN
DECOMPOSITION
COMPONENT
SELECTION DETECTION IDENTIFICATION
Input:
• A(t), time-varying
adjacency matrix
representation of G
• No cue
time
identified vertices
VLG - 14
BAM 02/15/12
Temporal Integration
)(nB )1( nB )1( nB
Temporal integration achieved by creating a linear
combination of individual time-step graphs
)0(h )1(h )1( h+ + +
1
0
)()()(~
k
knBkhnB : Weighted sum of residuals
TEMPORAL
INTEGRATION
RESIDUALS
CONSTRUCTION
EIGEN
DECOMPOSITION
COMPONENT
SELECTION DETECTION IDENTIFICATION
Modularity
(residuals)
matrix at time
n
Filter
coefficients
(scalars)
VLG - 15
BAM 02/15/12
• Introduction
• Detection Framework
• Example Dataset: Web of Science
• Ongoing Work: New Algorithms and Models
• Summary
Outline
VLG - 16
BAM 02/15/12
Thomson Reuters “Web of Science” Database
• Commercially available research database of papers in the sciences, social sciences, arts, and humanities
– More than 42 million records from 1900 to present
– Articles from over 12,000 journals and 148,000 conference proceedings
• Records typically include
– Author(s), title, publication date, type
– Document IDs for works cited
– May also include a number of other fields, e.g. subject area, institution,
keywords, abstract, as provided by the publication
VLG - 17
BAM 02/15/12
Examples of Web of Science (WOS) Graphs
CITATION GRAPH
nodes are papers
(directed) edges are citations
PAPER #1
PAPER #2
PAPER #3
PAPER #4
Paper #2 cites
paper #4
CO-AUTHOR GRAPH
nodes are authors
(undirected) edges indicate
co-authorship
J. DOE
M. SMITH
A. PERSON
J. BLOGGS
A. PERSON and J. DOE have
co-authored a paper together
N-GRAM GRAPH
nodes are title subsequences
(weighted) edges count
co-occurrence
METHOD
FOR
A
FAST
FOURIER
THEORY
STATISTICS
OF
“STATISTICS OF” and “METHOD FOR”
co-occur in 5,334 document titles
5,334 10,471
998 83
14
677
VLG - 18
BAM 02/15/12
Eigenspace Analysis of Citation Graph
• 4,668,824 documents (documents in the database and those cited)
• Modularity matrices integrated over a 5-year window with a linear ramp filter
• Exceptionally large eigenvalues coincide with documents with thousands of citations (typically review articles)
• 549,726 unique authors
• Modularity matrices integrated over a 5-year window with a linear ramp filter
• Exceptionally large eigenvalues coincide with papers with many authors (sudden, large cliques)
Eigenvalues significantly larger than the general trend correspond
to “clutter”-type behavior B. A. Miller et al., “A scalable signal processing architecture for massive graph analysis,” IEEE ICASSP, 2012. To appear.
VLG - 19
BAM 02/15/12
Emerging Clusters: Citation Graph
• Consider vertices aligned with principal eigenvectors after
manually pruning large citation lists • i.e., ignore eigenvectors highly concentrated on a single vertex
• Two subsets of vertices with significant internal connectivity • Most documents are in biochemistry and microbiology
• Largely focus on metabolic properties of acids and proteins
• Temporal integration increases the strength of these subsets in the
residuals space
1950 1951 1952 1953 1954
Emerging interconnectedness in a major research area
emphasized by linear ramp filter
VLG - 20
BAM 02/15/12
Emerging Clusters: Coauthor Graph
• Consider vertices aligned with principal eigenvectors after
manually pruning large author lists • Ignore eigenvectors aligned with sudden cliques
• Again, two tightly-connected subgraphs emerge • One arises from pathology case records of Mass. General Hospital,
published in New England J. Med.
• Similar periodicals in the (newly-founded) American J. Med comprise
the other cluster
• In both cases, teams of medical researchers form gradually over time
• Ramp filter emphasizes the densifying behavior
1946 1947 1948 1949 1950
Emerging collaboration network stands out after temporal integration
VLG - 21
BAM 02/15/12
• Introduction
• Detection Framework
• Example Dataset: Web of Science
• Ongoing Work: New Algorithms and Models
• Summary
Outline
VLG - 22
BAM 02/15/12
Preferential Attachment with Memory
Before 1945
Since 1945
• Preferential attachment is a
popular model for graph evolution • New nodes connect to existing
ones with probability proportional
to degree
• Does not account for recency
• New model: predict new
attachment rates by applying a
finite impulse response filter to
the sequence of recent
connection counts
• Captures more variability in the
WOS citation network than other
models in the literature
Least squares fit of coefficients:
more correlation with recent
citations than older ones
New perspective on preferential attachment; integrating this
technique into residuals analysis
VLG - 23
BAM 02/15/12
Filtering Away Periodic Behavior
)()(~
)(~
)(min
1
ktGhtG
tGtG
k
k
F
• In some applications,
connections from autonomous
nodes exhibit periodic behavior • Observed in web traffic data
• Model the expected graph as a
linear combination of previous
graphs to account for this effect
• Recent results indicate that this
technique helps detect non-
periodic coordinated activity
Moving average filter emphasizes aperiodic behavior and improves
anomaly detection
VLG - 24
BAM 02/15/12
COEFFICIENT VECTOR COVARIATE VECTOR
Metadata-Based Modeling
• Topology-based modeling has
significant practical limitations • Connection likelihood depends
on more than degree
• Additional vertex or edge data can
improve the expected value model • Used, e.g., in link prediction
• A recent statistical framework*
models the probability of
connection as a function of vertex
and edge parameters:
Subject-to-Subject Citation Counts
Additional graph metadata improves connection likelihood
estimation, enabling more detailed residuals analysis
More citations within a subject
than between subjects
*P. O. Perry and P. J. Wolfe, “Point process modeling for directed interaction networks,”
2010, http://arxiv.org/abs/1011.1703.
T
ijji xvvP )(logit
VLG - 25
BAM 02/15/12
Summary
• Applying a dynamic subgraph detection framework to large
graph data to pull emergent activity out of the background
• In the Web of Science document database, emergent
clusters are found underneath some clutter activity in
citation and coauthorship graphs using a simple expected
value term • Integrating residuals over time emphasizes dynamic behavior
(e.g., densification), enabling the detection of gradually-
growing subnetworks
• New models for graph behavior incorporate dynamics of the
topology and vertex and edge metadata into the graph’s
expected value
• Ongoing work includes experimentation with the entire WOS
corpus and integration of the new models into the analysis
of residuals
VLG - 26
BAM 02/15/12
Acknowledgements
• Nick Arcolano
• Michelle Beard
• Jeremy Kepner
• Matt Schmidt
• Bill Campbell
• Cliff Weinstein
• Patrick Wolfe