Upload
jason-riedy
View
307
Download
1
Embed Size (px)
Citation preview
scalable and efficient algorithms foranalysis of massive, streaming graphs
E. Jason Riedy and David A. BaderMS76 Scalable Network Analysis: Tools, Algorithms, ApplicationsSIAM PP, 15 April 2016
HPC Lab, School of Computational Science and EngineeringGeorgia Institute of Technology
motivation and applications
(insert prefix here)-scale data analysis
Cyber-security Identify anomalies, malicious actors
Health care Finding outbreaks, population epidemiology
Social networks Advertising, searching, grouping
Intelligence Decisions at scale, regulating algorithms
Systems biology Understanding interactions, drug design
Power grid Disruptions, conservation
Simulation Discrete events, cracking meshes
• Graphs are a motif / theme in data analysis.• Changing and dynamic graphs are important! 3
outline
1. Motivation and background2. Incremental PageRank3. Seed set expansion4. Community maintenance5. STINGER: Framework for streaming graph analysis
4
why graphs?
Another tool, like dense and sparse linear algebra.
• Combine things with pairwiserelationships
• Smaller, more generic than raw data.• Taught (roughly) to all CS students...• Semantic attributions can captureessential relationships.
• Traversals can be faster than filteringDB joins.
• Provide clear phrasing for queriesabout relationships.
5
potential applications
• Social Networks• Identify communities, influences, bridges, trends,anomalies (trends before they happen)...
• Potential to help social sciences, city planning, andothers with large-scale data.
• Cybersecurity• Determine if new connections can access a device orrepresent new threat in < 5ms...
• Is the transfer by a virus / persistent threat?• Bioinformatics, health
• Construct gene sequences, analyze proteininteractions, map brain interactions
• Credit fraud forensics⇒ detection⇒ monitoring• Integrate all the customer’s data, identify in real-time
6
streaming graph data
Networks data rates:
• Gigabit ethernet: 81k – 1.5M packets per second• Over 130 000 flows per second on 10 GigE (< 7.7 µs)
Person-level data rates:
• 500M posts per day on Twitter (6k / sec)1• 3M posts per minute on Facebook (50k / sec)2
We need to analyze only changes and not entire graph.
Throughput & latency trade off and expose differentlevels of concurrency.
1www.internetlivestats.com/twitter-statistics/2www.jeffbullas.com/2015/04/17/21-awesome-facebook-facts-and-statistics-you-need-to-check-out/
7
streaming graph analysis
Terminology:
• Streaming changes into a massive, evolving graph• Not CS streaming algorithm (tiny memory)• Need to handle deletions as well as insertions
Previous throughput results (not comprehensive review):
Data ingest >2M up/sec [Ediger, McColl, Poovey, Campbell, & B2014]
Clustering coefficients >100K up/sec [R, Meyerhenke, Bader,Ediger, & Mattson 2012]
Connected comp. >1M up/sec [McColl, Green, & B 2013]Community clustering >100K up/sec∗ [R & B 2013]
8
incremental pagerank
pagerank
Everyone’s “favorite” metric: PageRank.
• Stationary distribution of the random surfer model.• Eigenvalue problem can be re-phrased as a linearsystem (
I− αATD−1) x = kv,with
α teleportation constant, much < 1A adjacency matrixD diagonal matrix of out degrees, withx/0 = x (self-loop)
v personalization vector, here 1/|V|k irrelevant scaling constant
• Amenable to analysis, etc. 10
incremental pagerank
• Streaming data setting, update PageRank withouttouching the entire graph.
• Existing methods maintain databases of walks, etc.• Let A∆ = A+∆A, D∆ = D+∆D for the new graph,want to solve for x+∆x.
• Simple algebra:(I− αAT∆D−1
∆
)∆x = α
(A∆D−1
∆ − AD−1) x,and the right-hand side is sparse.
• Re-arrange for Jacobi,
∆x(k+1) = αAT∆D−1∆ ∆x(k) + α
(A∆D−1
∆ − AD−1) x,iterate, ...
11
incremental pagerank: accumulating error
• And fail. The updated solution wanders away fromthe true solution. Top rankings stay the same...
12
incremental pagerank: think instead
• The old solution x is an ok, not exact, solution to theoriginal problem, now a nearby problem.
• How close? Residual:
r′ = kv− x+ αA∆D−1∆ x
= r+ α(A∆D−1
∆ − AD−1) x.• Solve (I− αA∆D−1
∆ )∆x = r′.• Cheat by not refining all of r′, only region growingaround the changes:
(I− αA∆D−1∆ )∆x = r′|∆
• (Also cheat by updating r rather than recomputing atthe changes.)
13
incremental pagerank: works
Riedy, GABB at IPDPS 2016, to appear.14
incremental pagerank: worst latency
●
●●
●
●
●
●
●
●
● ●
●
●
●
●
0.001
0.010
0.001
0.010
0.01
0.10
1.00
1
100
1
10
100
power
PG
Pgiantcom
pocaidaR
outerLevelbelgium
.osmcoP
apersCiteseer
10 100 1000Batch size
Upd
ate
time
(s)
Algorithm ● dpr dprheld pr_restart
Riedy, GABB at IPDPS 2016, to appear.15
seed set expansion
graphs: big, nasty hairballs
Yifan Hu’s (AT&T) visualization of the in-2004 data sethttp://www2.research.att.com/~yifanhu/gallery.html
17
but no shortage of structure...
Protein interactions, Giot et al., “A ProteinInteraction Map of Drosophila melanogaster”,
Science 302, 1722-1736, 2003. Jason’s network via LinkedIn Labs
• Locally, there are clusters or communities.• There are methods for global community detection.• Also need local communities around seeds forqueries and targetted analysis.
18
seed set expansion
• Seed set expansion finds the “best” subgraph orcommunities for a set of vertices of interest• Many quality criteria: Modularity, conductance, etc.
• Can be applied to cryptocurrency to identify andtrack groups of interacting entities
• Dynamic algorithm updates communities faster thanrecomputation, allowing us to keep up with new dataproduced
19
static seed set expansion
Greedy expansion starting from S = { seed vertices }:
1. Check the fitness of every vertex v neighboring S.• fitness = f(S ∪ {v})− f(S)
2. It any fitness is positive, include most fit v in S.• Currently sequential, could include all sufficientlygood neighbors.
3. Record at which step v is included.• This list is the base for updates.
Now the dynamic version by example...
20
dynamic seed set example
In preparation with Anita Zakrzewska and Eisha Nathan
21
dynamic seed set example
In preparation with Anita Zakrzewska and Eisha Nathan
21
dynamic seed set example
In preparation with Anita Zakrzewska and Eisha Nathan
21
dynamic seed set quality
Graphs from the Koblenz Network Collection.
22
dynamic seed set speed-up
Graphs from the Koblenz Network Collection.
23
global community updates
community detection
• Partition a graph’svertices into disjointcommunities.
• A community locallyoptimizes somemetric, NP-hard.
• Trying to capture thatvertices are moresimilar within onecommunity thanbetweencommunities. Jason’s network via LinkedIn Labs
25
what about streaming?
• Simple approach based on agglomeration:1. Extract all vertices touched by an update.2. Re-start agglomeration.
• “Works” and is fast (MTAAP 2013), but nevermentioned quality.• Extracted vertices form bridges and do not re-merge.
• Some methods based on label propagation, notmetric-driven.
• Backtracking (e.g. Görke, et al., JEA 2013) preservesquality at cost of change size.• Ongoing: Can we limit backtracking?
26
community quality: achievable
Data from Stanford SNAP archive: Facebook.Stream generation: Reversing the graph.
In preparation with Pushkar Godbolé 27
community quality: change size
Data from Stanford SNAP archive: Facebook.Stream generation: Reversing the graph.
In preparation with Pushkar Godbolé 28
closing
future directions
• Of course, continuing to develop streaming /dynamic / incremental algorithms.• For massive graphs, computing small changes isalways a win.
• Improving approximations or replacing expensivemetrics like betweenness centrality would be great.
• Including more external and semantic data.• If vertices are documents or data records, manymore measures of similarity.
• Only now being exploited in concert with static graphalgorithms.
30
hpc lab people
Faculty:• David A. Bader• Oded Green
Data:• Pushkar Godbolé• Anita Zakrzewska• Eisha Nathan
STINGER:
• Robert McColl,• James Fairbanks,• Adam McLaughlin,• Daniel Henderson,• David Ediger (nowGTRI),
• Jason Poovey (GTRI),• Karl Jiang, and• feedback from users inindustry, government,academia
Support: DoD, DoE, NSF, Intel, IBM, Oracle 31
stinger: where do you get it?
Home: www.cc.gatech.edu/stinger/Code: git.cc.gatech.edu/git/u/eriedy3/stinger.git/
Gateway to
• code,
• development,
• documentation,
• presentations...
Remember: Academic code, but maturingwith contributions.Users / contributors / questioners:Georgia Tech, PNNL, CMU, Berkeley, Intel,Cray, NVIDIA, IBM, Federal Government,Ionic Security, Citi, ...
32