19
STINGER Dynamic Graph Analysis

Introduction to STINGER

Embed Size (px)

DESCRIPTION

An introduction to the STINGER dynamic graph structure and analysis package. Shows the motivation for STINGER, what has been done with it, and how you can use it. More at http://cc.gatech.edu/stinger

Citation preview

Page 1: Introduction to STINGER

STINGERDynamic Graph Analysis

Page 2: Introduction to STINGER

Contributors• David Bader• David Ediger• Rob McColl• Jason Riedy• Kamesh Madduri• Jason Poovey

Page 3: Introduction to STINGER

Outline• Motivation

• Dynamic Graph Basics

• What is STINGER?

• What can STINGER do?

• Why STINGER?

Page 4: Introduction to STINGER

Big Data problems need Graph Analysis

• Finding outbreaks, population epidemiologyHealth Care

• Advertising, searching, grouping, influenceSocial Networks

• Decisions at scale, regulating algorithmsIntelligence

• Understanding interactions, drug designSystems Biology

• Disruptions, conversionPower Grid

• Discrete events, cracking meshesSimulation

Page 5: Introduction to STINGER

Graphs are pervasive• Graphs: things and relationships

• Different kinds of things, different kinds of relationships, but graphs provide a framework for analyzing the relationships.

• New challenges for analysis: data sizes, heterogeneity, uncertainty, data quality.

AstrophysicsProblem: Outlier detectionChallenges: Massive data sets, temporal variationGraph Problems: matching, clustering

BioinformaticsProblem: Identifying target proteinsChallenges: Data heterogeneity, qualityGraph Problems: Centrality, clustering

Social InformaticsProblem: Emergent behavior, information spreadChallenges: New analysis, data uncertainty, scaleGraph Problems: clustering, flows, shortest paths

Page 6: Introduction to STINGER

Data rates and volumes are immense• Facebook:

• ~1 billion users• average 130 friends• 30 billion pieces of content shared / month

• Twitter: • 500 million active users• 340 million tweets / day

• Internet – 100s of exabytes / year• 300 million new websites per year• 48 hours of video to You Tube per minute• 30,000 YouTube videos played per second

Page 7: Introduction to STINGER

Our focus is streaming graphs• As relationships change

• Edges (relationships) are inserted, updated, and removed• New vertices (things) join and leave the network

• What are the effects?• On information flow• On community structure• On the integrity of data and structure

• Which actors and relationships are…• The key players and influencers in the change?• The anomalies and threats?

x yz

Page 8: Introduction to STINGER

What is STINGER?Spatio-Temporal Interaction Networks and Graphs Extensible RepresentationD. A. Bader, J. Berry, A. Amos-Binks, D. Chavarr´ıa-Miranda, C. Hastings, K. Madduri, S. C. Poulos

• A scalable, high performance in-memory dynamic graph data structure• Stores semantic and temporal information.• Designed to be flexible and extendable.• Be useful for the entire “large graph” community.• Permit good performance: No single structure is optimal for all.• Assume globally addressable memory access.• Support multiple, parallel readers and a single parallel writer.

• A software suite for dynamic graph analysis• Targets large shared-memory x86 and the Cray XMT• Written in C with OpenMP and XMT pragma support for parallelism

Page 9: Introduction to STINGER

As a data structure• Fast insertions, deletions, and updates:

A data structure that grows and changes at the speed of the data.

• Edge and vertex types and weights:Represent complex relationships and multiple simultaneous networks.

• Filtering traversal mechanisms:Traverse serially or in parallel on specific edge types, time ranges, vertex sets, etc.

• Experimental workflow server:Multiple data streams and analytics with one persistent data structure.

• Experimental Java and Python bindings:Use efficiency-oriented languages without sacrificing performance-oriented results.

Page 10: Introduction to STINGER

As an analysis package• Streaming edge insertions and deletions:

Performs new edge insertions, updates, and deletions in batches or individually.

• Streaming clustering coefficients: Tracks the local and global clustering coefficients of a graph under both edge insertions and deletions.

• Streaming connected components: Accurately tracks the connected components of a graph with insertions and deletions.

• Streaming community detection: Track and update the community structures within the graph as they change.

• Parallel agglomerative clustering: Find clusters that are optimized for a user-defined edge scoring function.

• Streaming Betweenness Centrality: Find the key points within information flows and structural vulnerabilities.

• K-core Extraction: Extract additional communities and filter noisy high-degree vertices.

• Classic breadth-first search: Performs a parallel breadth-first search of the graph starting at a given source vertex to find shortest paths.

Page 11: Introduction to STINGER

How is the graph stored?

Page 12: Introduction to STINGER

What can STINGER represent?• Nearly any set of

relationships• Healthcare• Social Networks• Intelligence• Systems biology• Power grid• Travel networks

• Example: Twitter• Users, hashtags, tweets as vertex types• Authorship, retweet, mentions, follows / followed by edge types

• Example: Work Environment• Users, PCs, printers, emails, URLs, files, etc. as vertex types• Email alias, from, to, access, logon/off, print, IM, etc. as edge types

Page 13: Introduction to STINGER

What can STINGER do?• Optimized to update at rates of over 3 million edges per second on

graphs of one billion edges• D. Ediger, R. McColl, J. Riedy, and D.A. Bader, "STINGER: High Performance Data Structure for Streaming

Graphs,'' The IEEE High Performance Extreme Computing Conference (HPEC), Waltham, MA, September 20-22, 2012. Best Paper Award.

RMAT – Recursive MATrix graph generator. RMAT(N) indicates 2^N vertices.

Page 14: Introduction to STINGER

What can STINGER do?• Maintaining connected components in a graph of half a billion edges

• Up to 1.26 million updates per sec.• 137x faster than recomputing.

• Scalable parallel streaming community detection • Built on parallel insert / delete mechanisms.

• Streaming approximate betweenness• Used to analyze influencers on Twitter during Hurricane Sandy over time.

Page 15: Introduction to STINGER

What does STINGER not do?• Does not provide all ACID properties

• Why: Not intended to be the backing data store.• Why: Allows for greater ingest and processing speeds.• Alternative: Back STINGER ingest with an ACID DB• Alternative: STINGER does provide consistency, partial isolation

• No text base query language – for now• Why: Currently, no language is general enough to describe most or all queries• Alternative: Filtering traversal APIs, unlimited query flexibility through code• Alternative: Productivity language bindings (Python, Java)

• No distributed / Hadoop-like cluster support• Why: Good fit for ingest, but poor for streaming analysis, random access is too slow• Alternative: Larger shared memory systems such as the Cray XMT and SGI UV systems• Alternative: Processing billion-edge graphs in shared memory on affordable Intel servers• Alternative: Extract key portions of the graph from a larger data store and perform fast in-

memory processing in STINGER

Page 16: Introduction to STINGER

What sizes, performance can it handle?

V E Config Size (GB) Connected Components (s)

Updates per Sec.

1M 8M 22-14 1.184 0.316 2.7M

2M 16M 22-14 2.384 0.75 2.3M

4M 33M 22-14 4.768 2 2.3M

8M 67M 24-14 9.536 5.36 0.85M

4M 67M 24-14 7.984 3 1.38M

4M 134M 24-14 14.336 5.7 0.8M

Desktop (Intel Core i7-2600 16GB DDR3)V E Config Size (GB) Connected

Components (s)Updates per Sec.

16M 512M 25-14 60GB 13.7 696K

16M 256M 25-14 24.6GB 9.82 2.1M

Server 4x Opteron 6282 256GB DDR3

V E Config Size (GB) Connected Components (s)

Updates per Sec.

67M 512M 28-32 86GB 13.8 3.3M

268M 4.3B 28-32 312GB 52.3 2.34M

Cray XMT2 – 64 Processors 2TB DDR2

• The only limitation on size is system memory• Billions of vertices and edges are possible

• V vertices and E edges in each graph• E counts are undirected• STINGER stores both directions

• Config is STINGER-specific parameters

Page 17: Introduction to STINGER

Why not existing technologies?• Traditional SQL databases

• Not structured to do any meaningful graph queries with any level of efficiency or timeliness

• Graph databases - mostly on-disk• Distributed disk can keep up with storing / indexing, but is simply too slow at

random graph access to process on as the graph updates

• Hadoop and HDFS-based projects• Not really the right programming model for many structural queries over the

entire graph, random access performance is poor

• Smaller graph libraries, processing tools• Can't scale, can't process dynamic graphs, frequently leads to impossible

visualization attempts

Page 18: Introduction to STINGER

Who is GTRI?• Georgia Tech Research Institute

• Largest research entity at Georgia Institute of Technology• One of the world's premier university-based applied R&D

organizations for 75 years• Non-profit with over 1,600 employees and 21 locations world-wide• Over $240 million per year of government and industry contracts

• Innovative Computing Divisionof the Cyber Technology and Information Security Lab• Dedicated to the application of practical HPC expertise and

cutting‑edge fundamental research to solve real-world problems• Experts in high-performance computing, algorithms, and big data

Page 19: Introduction to STINGER

How can I start using STINGER?• Information, code, help

• http://cc.gatech.edu/stinger• [email protected]

• Together, GTRI and Georgia Tech can offer• Consulting

Understand how your organization can benefit from graph analytics.

• TrainingLearn how to use graph analysis and apply STINGER to your data.

• ImplementationCustomize and extend STINGER to suit your needs using our experts.

• Research ExpertiseConnect with researchers on the cutting edge of big data to develop novel solutions to your open problems.