Upload
others
View
9
Download
0
Embed Size (px)
Citation preview
Distributed Graph Analytics with Gradoop
9th LDBC TUC Meeting
9-10 February 2017
Walldorf, Germany
Martin Junghanns University of Leipzig – Database Research Group
Distributed Graph Analytics with Gradoop – 9th LDBC TUC Meeting – Martin Junghanns 2
Gradoop Extended Property Graphs Summary Evaluation
2
Gradoop
„An open-source graph dataflow system for declarative analytics of heterogeneous graph data.“
Distributed Graph Analytics with Gradoop – 9th LDBC TUC Meeting – Martin Junghanns 3
Gradoop Extended Property Graphs Summary Evaluation
3
Gradoop
Distributed Graph Storage (Apache HDFS)
Apache Flink Operator Implementation
Distributed Operator Execution (Apache Flink)
Extended Property Graph Model (EPGM)
Graph Dataflow Operators
I/O
Distributed Graph Analytics with Gradoop – 9th LDBC TUC Meeting – Martin Junghanns 4
Gradoop Extended Property Graphs Summary Evaluation
4
Extended Property Graphs
• Vertices and directed Edges
• Logical Graphs
• Identifiers
• Type Labels
• Properties
1 2
3
4
5 1 2 3
4
5
2
1
Extended Property Graph Model (EPGM)
Person name : Eve
Person name : Alice yob : 1984
Band name : Metallica founded : 1981
Person name : Bob
Band name : AC/DC founded : 1973
likes since : 2014
likes since : 2013
likes since : 2015
knows
likes
|Community|interest:Heavy Metal
|Community|interest:Hard Rock
Distributed Graph Analytics with Gradoop – 9th LDBC TUC Meeting – Martin Junghanns 5
Gradoop Extended Property Graphs Summary Evaluation
5
Extended Property Graphs
EPGM Operators/Transformations
Unary Binary
Gra
ph
Co
llect
ion
Lo
gica
l Gra
ph
Equality
Union
Intersection
Difference
Limit
Selection
Pattern Matching
Distinct
Apply
Reduce
Call
Aggregation
Pattern Matching
Transformation
Grouping
Call
Subgraph
Equality
Combination
Overlap
Exclusion
Fusion
Algorithms
Flink Gelly Library
BTG Extraction
Frequent Subgraph Mining
Adaptive Partitioning
Distributed Graph Analytics with Gradoop – 9th LDBC TUC Meeting – Martin Junghanns 6
Gradoop Extended Property Graphs Summary Evaluation
6
Extended Property Graphs
1 3
4
5 2
3
1 2
1 3
4
5 2
1 2 4
5
Combination
Overlap
Exclusion
3 Basic Binary Operators
LogicalGraph graph3 = graph1.combine(graph2); LogicalGraph graph4 = graph1.overlap(graph2); LogicalGraph graph5 = graph1.exclude(graph2);
Distributed Graph Analytics with Gradoop – 9th LDBC TUC Meeting – Martin Junghanns 7
Gradoop Extended Property Graphs Summary Evaluation
7
Extended Property Graphs
3
1 3
4
5 2 3
4
1 2 UDF
LogicalGraph graph4 = graph3.subgraph( (vertex => vertex.getLabel().equals(‘Green’)), (edge => edge.getLabel().equals(‘orange’)));
Subgraph
Distributed Graph Analytics with Gradoop – 9th LDBC TUC Meeting – Martin Junghanns 8
Gradoop Extended Property Graphs Summary Evaluation
8
Extended Property Graphs
1 3
4
5 2
3
1 3
4
5 2
3 | vertexCount: 5
UDF
Aggregation
graph3 = graph3.aggregate(new VertexCount());
Distributed Graph Analytics with Gradoop – 9th LDBC TUC Meeting – Martin Junghanns 9
Gradoop Extended Property Graphs Summary Evaluation
9
Extended Property Graphs
Keys
3
1 3
4
5 2
+Aggregate
3
a:23 a:84
a:42
a:12
1 3
4
5 2
a:13
a:21
4
count:2 count:3
max(a):42
max(a):84
max(a):13 max(a):21
6 7
4
6 7
Grouping
LogicalGraph grouped = graph3.groupBy() .useVertexLabel() .useEdgeLabel() .addVertexAggregate(new CountAggregator()) .addEdgeAggregate(new MaxAggregator(‘a’));
Distributed Graph Analytics with Gradoop – 9th LDBC TUC Meeting – Martin Junghanns 10
Gradoop Extended Property Graphs Summary Evaluation
10
Extended Property Graphs
3
1 3
4
5 2 Pattern
4 5
1 3
4
2
Graph Collection
GraphCollection collection = graph3.match(‘(:Green)-[:orange]->(:Orange)’);
Pattern Matching (Single Graph Input)
Distributed Graph Analytics with Gradoop – 9th LDBC TUC Meeting – Martin Junghanns 11
Gradoop Extended Property Graphs Summary Evaluation
11
Extended Property Graphs
Pattern
Graph Collection
Pattern Matching (Graph-Transaction Setting)
GraphCollection filtered = coll.match(‘(v0:Green)-[:blue]->(:Orange),(v0)-[:blue]->(:Orange)’);
1 |
2 |
0 2
3
4 1
5 7 8 6
Graph Collection
1 | contained = true
2 | contained = false
0 2
3
4 1
5 7 8 6
Distributed Graph Analytics with Gradoop – 9th LDBC TUC Meeting – Martin Junghanns 12
Gradoop Extended Property Graphs Summary Evaluation
12
Extended Property Graphs
Algorithm
1
0 2
3
4 1
5 7 8 6
2
3
0 2
3
4 1
5 7 8 6
Call (e.g. Clustering)
GraphCollection clustering = graph.callForCollection(new ClusteringAlgorithm())
Distributed Graph Analytics with Gradoop – 9th LDBC TUC Meeting – Martin Junghanns 13
Gradoop Extended Property Graphs Summary Evaluation
13
Evaluation
1. Extract subgraph containing only Persons and knows relations
2. Transform Persons to necessary information
3. Find communities using Label Propagation
4. Aggregate vertex count for each community
5. Select communities with more than 50K users
6. Combine large communities to a single graph
7. Group graph by Persons location and gender
8. Aggregate vertex and edge count of grouped graph
http://ldbcouncil.org/
Overall performance [3]
Distributed Graph Analytics with Gradoop – 9th LDBC TUC Meeting – Martin Junghanns 14
Gradoop Extended Property Graphs Summary Evaluation
14
Evaluation
1. Extract subgraph containing only Persons and knows relations
2. Transform Persons to necessary information
3. Find communities using Label Propagation
4. Aggregate vertex count for each community
5. Select communities with more than 50K users
6. Combine large communities to a single graph
7. Group graph by Persons location and gender
8. Aggregate vertex and edge count of grouped graph
https://git.io/vD2Ii
Overall performance [3]
Distributed Graph Analytics with Gradoop – 9th LDBC TUC Meeting – Martin Junghanns 15
Gradoop Extended Property Graphs Summary Evaluation
15
Evaluation
Dataset # Vertices # Edges
Graphalytics.1 61,613 2,026,082
Graphalytics.10 260,613 16,600,778
Graphalytics.100 1,695,613 147,437,275
Graphalytics.1000 12,775,613 1,363,747,260
Graphalytics.10000 90,025,613 10,872,109,028
• 16x Intel(R) Xeon(R) 2.50GHz 6 (12) • 16x 48 GB RAM • 1 Gigabit Ethernet • Hadoop 2.6.0 • Flink 1.0-SNAPSHOT
• slots (per worker) 12 • jobmanager.heap.mb 2048 • taskmanager.heap.mb 40960
Overall performance [3]
Distributed Graph Analytics with Gradoop – 9th LDBC TUC Meeting – Martin Junghanns 16
Gradoop Extended Property Graphs Summary Evaluation
16
Evaluation
Dataset # Vertices # Edges
Graphalytics.1 61,613 2,026,082
Graphalytics.10 260,613 16,600,778
Graphalytics.100 1,695,613 147,437,275
Graphalytics.1000 12,775,613 1,363,747,260
Graphalytics.10000 90,025,613 10,872,109,028
• 16x Intel(R) Xeon(R) 2.50GHz 6 (12) • 16x 48 GB RAM • 1 Gigabit Ethernet • Hadoop 2.6.0 • Flink 1.0-SNAPSHOT
• slots (per worker) 12 • jobmanager.heap.mb 2048 • taskmanager.heap.mb 40960
0
200
400
600
800
1000
1200
1 2 4 8 16
Ru
nti
me
[s]
Number of workers
Graphalytics.100
1
2
4
8
16
1 2 4 8 16
Spe
ed
up
Number of workers
Graphalytics.100 Linear
Overall performance [3]
Distributed Graph Analytics with Gradoop – 9th LDBC TUC Meeting – Martin Junghanns 17
Gradoop Extended Property Graphs Summary Evaluation
17
Evaluation
1
10
100
1000
10000
Ru
nti
me
[s]
Dataset # Vertices # Edges
Graphalytics.1 61,613 2,026,082
Graphalytics.10 260,613 16,600,778
Graphalytics.100 1,695,613 147,437,275
Graphalytics.1000 12,775,613 1,363,747,260
Graphalytics.10000 90,025,613 10,872,109,028
• 16x Intel(R) Xeon(R) 2.50GHz 6 (12) • 16x 48 GB RAM • 1 Gigabit Ethernet • Hadoop 2.6.0 • Flink 1.0-SNAPSHOT
• slots (per worker) 12 • jobmanager.heap.mb 2048 • taskmanager.heap.mb 40960
Overall performance [3]
Distributed Graph Analytics with Gradoop – 9th LDBC TUC Meeting – Martin Junghanns 18
Gradoop Extended Property Graphs Summary Evaluation
18
Evaluation
Grouping [1]
• 16x Intel(R) Xeon(R) 2.50GHz 6 (12) • 16x 48 GB RAM • 1 Gigabit Ethernet • Hadoop 2.6.0 • Flink 1.0.3
• slots (per worker) 12 • jobmanager.heap.mb 2048 • taskmanager.heap.mb 40960
Dataset # Vertices # Edges
Graphalytics.1 61,613 2,026,082
Graphalytics.10 260,613 16,600,778
Graphalytics.100 1,695,613 147,437,275
Graphalytics.1000 12,775,613 1,363,747,260
Pokec 1,632,803 30,622,564
Distributed Graph Analytics with Gradoop – 9th LDBC TUC Meeting – Martin Junghanns 19
Gradoop Extended Property Graphs Summary Evaluation
19
Summary
• Extended Property Graph Model • Schema flexible: Type Labels and Properties • Logical Graphs / Graphs Collection
• Graph and Collection Operators • Combination to analytical workflows
• Implemented on Apache Flink • Built-in scalability • Combine with other libraries
Distributed Graph Analytics with Gradoop – 9th LDBC TUC Meeting – Martin Junghanns 20
Gradoop Extended Property Graphs Summary Evaluation Summary
www.gradoop.com
[1] Junghanns, M.; Petermann, A.; K.; Rahm, E., „Distributed Grouping of Property Graphs with Gradoop“, Proc. BTW Conf. , 2017.
[2] Petermann, A.; Junghanns, M.; Kemper, S.; Gomez, K.; Teichmann, N.; Rahm, E., „Graph Mining for Complex Data Analytics “,
Proc. ICDM Conf. (Demo), 2016.
[3] Junghanns, M.; Petermann, A.; Teichmann, N.; Gomez, K.; Rahm, E., „Analyzing Extended Property Graphs with Apache Flink“,
Int. Workshop on Network Data Analytics (NDA), SIGMOD, 2016.
[4] Petermann, A.; Junghanns, M., „Scalable Business Intelligence with Graph Collections“,
it – Special Issue on Big Data Analytics, 2016.
[5] Petermann, A.; Junghanns, M.; Müller, M.; Rahm, E., „Graph-based Data Integration and Business Intelligence with BIIIG“,
Proc. VLDB Conf. (Demo), 2014.