47
Manuel Then, Moritz Kaufmann, Alfons Kemper, Thomas Neumann Technical University of Munich Chair of Database Systems Evaluation of Parallel Graph Loading Techniques

Evaluation of Parallel Graph Loading Techniques · Problem: The optimal way of loading the graph depends on various factors: • Format of the graph data • Source of the data •

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Evaluation of Parallel Graph Loading Techniques · Problem: The optimal way of loading the graph depends on various factors: • Format of the graph data • Source of the data •

Manuel Then, Moritz Kaufmann, Alfons Kemper, Thomas Neumann

Technical University of Munich

Chair of Database Systems

Evaluation of Parallel Graph Loading Techniques

Page 2: Evaluation of Parallel Graph Loading Techniques · Problem: The optimal way of loading the graph depends on various factors: • Format of the graph data • Source of the data •

3Manuel Then (TUM) | Evaluation of Parallel Graph Loading Techniques

Page 3: Evaluation of Parallel Graph Loading Techniques · Problem: The optimal way of loading the graph depends on various factors: • Format of the graph data • Source of the data •

4Manuel Then (TUM) | Evaluation of Parallel Graph Loading Techniques

Page 4: Evaluation of Parallel Graph Loading Techniques · Problem: The optimal way of loading the graph depends on various factors: • Format of the graph data • Source of the data •

Goal: Efficiently load a given graph dataset for explorative analytics

5Manuel Then (TUM) | Evaluation of Parallel Graph Loading Techniques

General Graph Loading Pipeline

Read

• Parse edges and create relabeling

• Write edges to worker-local buffer

Sync

• Find unique vertices

• Count neighbors

Write

• Create final graph data structure

• Apply final relabeling

Analytics• The actual analytics work

Page 5: Evaluation of Parallel Graph Loading Techniques · Problem: The optimal way of loading the graph depends on various factors: • Format of the graph data • Source of the data •

Problem: The optimal way of loading the graph depends on various factors:

• Format of the graph data

• Source of the data

• Properties of the input data

• Target graph data structure

• Execution machine

Graph loading pipeline must be adapted to the scenario at hand

6Manuel Then (TUM) | Evaluation of Parallel Graph Loading Techniques

Scenario-specific Graph Loading

Page 6: Evaluation of Parallel Graph Loading Techniques · Problem: The optimal way of loading the graph depends on various factors: • Format of the graph data • Source of the data •

Goal: Efficiently load a given graph dataset for explorative analytics

7Manuel Then (TUM) | Evaluation of Parallel Graph Loading Techniques

General Graph Loading Pipeline

Read

• Parse edges and create relabeling

• Write edges to worker-local buffer

Sync

• Find unique vertices

• Count neighbors

Write

• Create final graph data structure

• Apply final relabeling

Analytics• The actual analytics work

Page 7: Evaluation of Parallel Graph Loading Techniques · Problem: The optimal way of loading the graph depends on various factors: • Format of the graph data • Source of the data •

Goal: Efficiently load a given graph dataset for explorative analytics

8Manuel Then (TUM) | Evaluation of Parallel Graph Loading Techniques

General Graph Loading Pipeline

Read

• Parse edges and create relabeling

• Write edges to worker-local buffer

Sync

• Find unique vertices

• Count neighbors

Write

• Create final graph data structure

• Apply final relabeling

Analytics• The actual analytics work

Identifier data

type? binary,

decimal, string?

Page 8: Evaluation of Parallel Graph Loading Techniques · Problem: The optimal way of loading the graph depends on various factors: • Format of the graph data • Source of the data •

Goal: Efficiently load a given graph dataset for explorative analytics

9Manuel Then (TUM) | Evaluation of Parallel Graph Loading Techniques

General Graph Loading Pipeline

Read

• Parse edges and create relabeling

• Write edges to worker-local buffer

Sync

• Find unique vertices

• Count neighbors

Write

• Create final graph data structure

• Apply final relabeling

Analytics• The actual analytics work

Identifier data

type? binary,

decimal, string?

Can input data

be read multiple

times?

Page 9: Evaluation of Parallel Graph Loading Techniques · Problem: The optimal way of loading the graph depends on various factors: • Format of the graph data • Source of the data •

Goal: Efficiently load a given graph dataset for explorative analytics

10Manuel Then (TUM) | Evaluation of Parallel Graph Loading Techniques

General Graph Loading Pipeline

Read

• Parse edges and create relabeling

• Write edges to worker-local buffer

Sync

• Find unique vertices

• Count neighbors

Write

• Create final graph data structure

• Apply final relabeling

Analytics• The actual analytics work

Identifier data

type? binary,

decimal, string?

Random

access

possible?

Can input data

be read multiple

times?

Page 10: Evaluation of Parallel Graph Loading Techniques · Problem: The optimal way of loading the graph depends on various factors: • Format of the graph data • Source of the data •

Goal: Efficiently load a given graph dataset for explorative analytics

11Manuel Then (TUM) | Evaluation of Parallel Graph Loading Techniques

General Graph Loading Pipeline

Read

• Parse edges and create relabeling

• Write edges to worker-local buffer

Sync

• Find unique vertices

• Count neighbors

Write

• Create final graph data structure

• Apply final relabeling

Analytics• The actual analytics work

Identifier data

type? binary,

decimal, string?

Random

access

possible?

Can input data

be read multiple

times?

Explicit vertex

list available?

Page 11: Evaluation of Parallel Graph Loading Techniques · Problem: The optimal way of loading the graph depends on various factors: • Format of the graph data • Source of the data •

Goal: Efficiently load a given graph dataset for explorative analytics

12Manuel Then (TUM) | Evaluation of Parallel Graph Loading Techniques

General Graph Loading Pipeline

Read

• Parse edges and create relabeling

• Write edges to worker-local buffer

Sync

• Find unique vertices

• Count neighbors

Write

• Create final graph data structure

• Apply final relabeling

Analytics• The actual analytics work

Identifier data

type? binary,

decimal, string?

Random

access

possible?

Can input data

be read multiple

times?

Explicit vertex

list available?

Page 12: Evaluation of Parallel Graph Loading Techniques · Problem: The optimal way of loading the graph depends on various factors: • Format of the graph data • Source of the data •

Goal: Efficiently load a given graph dataset for explorative analytics

13Manuel Then (TUM) | Evaluation of Parallel Graph Loading Techniques

General Graph Loading Pipeline

Read

• Parse edges and create relabeling

• Write edges to worker-local buffer

Sync

• Find unique vertices

• Count neighbors

Write

• Create final graph data structure

• Apply final relabeling

Analytics• The actual analytics work

Identifier data

type? binary,

decimal, string?

Random

access

possible?

Can input data

be read multiple

times?

Explicit vertex

list available?

Which data

structure to

generate?

Page 13: Evaluation of Parallel Graph Loading Techniques · Problem: The optimal way of loading the graph depends on various factors: • Format of the graph data • Source of the data •

Goal: Efficiently load a given graph dataset for explorative analytics

14Manuel Then (TUM) | Evaluation of Parallel Graph Loading Techniques

General Graph Loading Pipeline

Read

• Parse edges and create relabeling

• Write edges to worker-local buffer

Sync

• Find unique vertices

• Count neighbors

Write

• Create final graph data structure

• Apply final relabeling

Analytics• The actual analytics work

Identifier data

type? binary,

decimal, string?

Random

access

possible?

Can input data

be read multiple

times?

Explicit vertex

list available?

Which data

structure to

generate?

Page 14: Evaluation of Parallel Graph Loading Techniques · Problem: The optimal way of loading the graph depends on various factors: • Format of the graph data • Source of the data •

Goal: Efficiently load a given graph dataset for explorative analytics

15Manuel Then (TUM) | Evaluation of Parallel Graph Loading Techniques

General Graph Loading Pipeline

Read

• Parse edges and create relabeling

• Write edges to worker-local buffer

Sync

• Find unique vertices

• Count neighbors

Write

• Create final graph data structure

• Apply final relabeling

Analytics• The actual analytics work

Page 15: Evaluation of Parallel Graph Loading Techniques · Problem: The optimal way of loading the graph depends on various factors: • Format of the graph data • Source of the data •

Binary reader

• No parsing necessary => directly copy vertex identifiers

• Every edge same size => work splitting trivial

16Manuel Then (TUM) | Evaluation of Parallel Graph Loading Techniques

Parsers

Page 16: Evaluation of Parallel Graph Loading Techniques · Problem: The optimal way of loading the graph depends on various factors: • Format of the graph data • Source of the data •

Binary reader

• No parsing necessary => directly copy vertex identifiers

• Every edge same size => work splitting trivial

Library-provided decimal parsing

• Readily-available for many languages

• We evaluated C++’s stream operator and strtol

• Varying edge length => work splitting more complex

17Manuel Then (TUM) | Evaluation of Parallel Graph Loading Techniques

Parsers

Page 17: Evaluation of Parallel Graph Loading Techniques · Problem: The optimal way of loading the graph depends on various factors: • Format of the graph data • Source of the data •

Binary reader

• No parsing necessary => directly copy vertex identifiers

• Every edge same size => work splitting trivial

Library-provided decimal parsing

• Readily-available for many languages

• We evaluated C++’s stream operator and strtol

• Varying edge length => work splitting more complex

18Manuel Then (TUM) | Evaluation of Parallel Graph Loading Techniques

Parsers

2x 20x 200x

Page 18: Evaluation of Parallel Graph Loading Techniques · Problem: The optimal way of loading the graph depends on various factors: • Format of the graph data • Source of the data •

Binary reader

• No parsing necessary => directly copy vertex identifiers

• Every edge same size => work splitting trivial

Library-provided decimal parsing

• Readily-available for many languages

• We evaluated C++’s stream operator and strtol

• Varying edge length => work splitting more complex

Iterative decimal parsing

• Multiply by ten and add character’s respective digit

19Manuel Then (TUM) | Evaluation of Parallel Graph Loading Techniques

Parsers

2x 20x 200x

Page 19: Evaluation of Parallel Graph Loading Techniques · Problem: The optimal way of loading the graph depends on various factors: • Format of the graph data • Source of the data •

Binary reader

• No parsing necessary => directly copy vertex identifiers

• Every edge same size => work splitting trivial

Library-provided decimal parsing

• Readily-available for many languages

• We evaluated C++’s stream operator and strtol

• Varying edge length => work splitting more complex

Iterative decimal parsing

• Multiply by ten and add character’s respective digit

20Manuel Then (TUM) | Evaluation of Parallel Graph Loading Techniques

Parsers

2x 20x 200x

Page 20: Evaluation of Parallel Graph Loading Techniques · Problem: The optimal way of loading the graph depends on various factors: • Format of the graph data • Source of the data •

Binary reader

• No parsing necessary => directly copy vertex identifiers

• Every edge same size => work splitting trivial

Library-provided decimal parsing

• Readily-available for many languages

• We evaluated C++’s stream operator and strtol

• Varying edge length => work splitting more complex

Iterative decimal parsing

• Multiply by ten and add character’s respective digit

21Manuel Then (TUM) | Evaluation of Parallel Graph Loading Techniques

Parsers

2x 20x 200x

Page 21: Evaluation of Parallel Graph Loading Techniques · Problem: The optimal way of loading the graph depends on various factors: • Format of the graph data • Source of the data •

Binary reader

• No parsing necessary => directly copy vertex identifiers

• Every edge same size => work splitting trivial

Library-provided decimal parsing

• Readily-available for many languages

• We evaluated C++’s stream operator and strtol

• Varying edge length => work splitting more complex

Iterative decimal parsing

• Multiply by ten and add character’s respective digit

Vectorized decimal parsing

• Leverage wide vector units for identifier parsing

22Manuel Then (TUM) | Evaluation of Parallel Graph Loading Techniques

Parsers

2x 20x 200x

Page 22: Evaluation of Parallel Graph Loading Techniques · Problem: The optimal way of loading the graph depends on various factors: • Format of the graph data • Source of the data •

Binary reader

• No parsing necessary => directly copy vertex identifiers

• Every edge same size => work splitting trivial

Library-provided decimal parsing

• Readily-available for many languages

• We evaluated C++’s stream operator and strtol

• Varying edge length => work splitting more complex

Iterative decimal parsing

• Multiply by ten and add character’s respective digit

Vectorized decimal parsing

• Leverage wide vector units for identifier parsing

23Manuel Then (TUM) | Evaluation of Parallel Graph Loading Techniques

2x 20x 200x

Parsers

T. Muhlbauer, W. Rodiger, R. Seilbeck, A. Reiser, A. Kemper, and T. Neumann

Instant loading for main memory databases.

Proceedings of the VLDB Endowment, 2013.

Page 23: Evaluation of Parallel Graph Loading Techniques · Problem: The optimal way of loading the graph depends on various factors: • Format of the graph data • Source of the data •

Binary reader

• No parsing necessary => directly copy vertex identifiers

• Every edge same size => work splitting trivial

Library-provided decimal parsing

• Readily-available for many languages

• We evaluated C++’s stream operator and strtol

• Varying edge length => work splitting more complex

Iterative decimal parsing

• Multiply by ten and add character’s respective digit

Vectorized decimal parsing

• Leverage wide vector units for identifier parsing

24Manuel Then (TUM) | Evaluation of Parallel Graph Loading Techniques

Parsers

2x 20x 200x

Page 24: Evaluation of Parallel Graph Loading Techniques · Problem: The optimal way of loading the graph depends on various factors: • Format of the graph data • Source of the data •

Binary reader

• No parsing necessary => directly copy vertex identifiers

• Every edge same size => work splitting trivial

Library-provided decimal parsing

• Readily-available for many languages

• We evaluated C++’s stream operator and strtol

• Varying edge length => work splitting more complex

Iterative decimal parsing

• Multiply by ten and add character’s respective digit

Vectorized decimal parsing

• Leverage wide vector units for identifier parsing

25Manuel Then (TUM) | Evaluation of Parallel Graph Loading Techniques

Parsers

2x 20x 200x

Page 25: Evaluation of Parallel Graph Loading Techniques · Problem: The optimal way of loading the graph depends on various factors: • Format of the graph data • Source of the data •

Binary reader

• No parsing necessary => directly copy vertex identifiers

• Every edge same size => work splitting trivial

Library-provided decimal parsing

• Readily-available for many languages

• We evaluated C++’s stream operator and strtol

• Varying edge length => work splitting more complex

Iterative decimal parsing

• Multiply by ten and add character’s respective digit

Vectorized decimal parsing

• Leverage wide vector units for identifier parsing

Parser code generation

26Manuel Then (TUM) | Evaluation of Parallel Graph Loading Techniques

Parsers

2x 20x 200x

Page 26: Evaluation of Parallel Graph Loading Techniques · Problem: The optimal way of loading the graph depends on various factors: • Format of the graph data • Source of the data •

Goal: Efficiently load a given graph dataset for explorative analytics

27Manuel Then (TUM) | Evaluation of Parallel Graph Loading Techniques

General Graph Loading Pipeline

Read

• Parse edges and create relabeling

• Write edges to worker-local buffer

Sync

• Find unique vertices

• Count neighbors

Write

• Create final graph data structure

• Apply final relabeling

Analytics• The actual analytics work

Page 27: Evaluation of Parallel Graph Loading Techniques · Problem: The optimal way of loading the graph depends on various factors: • Format of the graph data • Source of the data •

Closely related areas

28Manuel Then (TUM) | Evaluation of Parallel Graph Loading Techniques

Data Structures and Identifier Relabeling

Page 28: Evaluation of Parallel Graph Loading Techniques · Problem: The optimal way of loading the graph depends on various factors: • Format of the graph data • Source of the data •

Closely related areas

Map of Neighbor Lists => No relabeling (Identity)

• Directly use dataset identifiers

• Runtime overhead for neighbor and property accesses

• Simple and efficient to load

29Manuel Then (TUM) | Evaluation of Parallel Graph Loading Techniques

Data Structures and Identifier Relabeling

1

1 2

0 2

Page 29: Evaluation of Parallel Graph Loading Techniques · Problem: The optimal way of loading the graph depends on various factors: • Format of the graph data • Source of the data •

Closely related areas

Map of Neighbor Lists => No relabeling (Identity)

• Directly use dataset identifiers

• Runtime overhead for neighbor and property accesses

• Simple and efficient to load

30Manuel Then (TUM) | Evaluation of Parallel Graph Loading Techniques

Data Structures and Identifier Relabeling

1

1 2

0 2

Hash-based

access

Page 30: Evaluation of Parallel Graph Loading Techniques · Problem: The optimal way of loading the graph depends on various factors: • Format of the graph data • Source of the data •

Closely related areas

Map of Neighbor Lists => No relabeling (Identity)

• Directly use dataset identifiers

• Runtime overhead for neighbor and property accesses

• Simple and efficient to load

Compressed Sparse Row (CSR) => Dense relabeling

• Dense identifiers [0, |V|-1]

• Packed, sequential memory layout

• Allows offset-based data structure access

• e.g. for neighbor lists, or properties

• Overhead during loading

31Manuel Then (TUM) | Evaluation of Parallel Graph Loading Techniques

Data Structures and Identifier Relabeling

1

1 2

0 2

1 1 2 0 2

Hash-based

access

Page 31: Evaluation of Parallel Graph Loading Techniques · Problem: The optimal way of loading the graph depends on various factors: • Format of the graph data • Source of the data •

Closely related areas

No relabeling (Identity) => Map of Neighbor Lists

• Directly use dataset identifiers

• Runtime overhead for neighbor and property accesses

• Simple and efficient to load

Dense relabeling => Compressed Sparse Row (CSR)

• Dense identifiers [0, |V|-1]

• Packed, sequential memory layout

• Allows offset-based data structure access

• e.g. for neighbor lists, or properties

• Overhead during loading

32Manuel Then (TUM) | Evaluation of Parallel Graph Loading Techniques

Data Structures and Identifier Relabeling

1

1 2

0 2

1 1 2 0 2

Hash-based

access

Offset-based

access

Page 32: Evaluation of Parallel Graph Loading Techniques · Problem: The optimal way of loading the graph depends on various factors: • Format of the graph data • Source of the data •

Mapping

• Assign dense identifiers while reading the input data

• Global: All workers use a shared map

• Local: Each worker creates a local relabeling

33Manuel Then (TUM) | Evaluation of Parallel Graph Loading Techniques

Relabeling Strategies

Page 33: Evaluation of Parallel Graph Loading Techniques · Problem: The optimal way of loading the graph depends on various factors: • Format of the graph data • Source of the data •

Mapping

• Assign dense identifiers while reading the input data

• Global: All workers use a shared map

• Local: Each worker creates a local relabeling

Collection

• Gather unique identifiers while reading the input

• Assign dense identifiers at the end

• Global: Shared identifier set for all workers

• Local: Use a local set per worker

34Manuel Then (TUM) | Evaluation of Parallel Graph Loading Techniques

Relabeling Strategies

∪ ∪ ∪

Page 34: Evaluation of Parallel Graph Loading Techniques · Problem: The optimal way of loading the graph depends on various factors: • Format of the graph data • Source of the data •

Mapping

• Assign dense identifiers while reading the input data

• Global: All workers use a shared map

• Local: Each worker creates a local relabeling

Collection

• Gather unique identifiers while reading the input

• Assign dense identifiers at the end

• Global: Shared identifier set for all workers

• Local: Use a local set per worker

Relabeling is finalized/applied when the graph data structure is written

35Manuel Then (TUM) | Evaluation of Parallel Graph Loading Techniques

Relabeling Strategies

∪ ∪ ∪

Page 35: Evaluation of Parallel Graph Loading Techniques · Problem: The optimal way of loading the graph depends on various factors: • Format of the graph data • Source of the data •

Graph loading times for various relabeling strategies

No further dataset properties leveraged

36Manuel Then (TUM) | Evaluation of Parallel Graph Loading Techniques

Relabeling Strategies - Measurements

Page 36: Evaluation of Parallel Graph Loading Techniques · Problem: The optimal way of loading the graph depends on various factors: • Format of the graph data • Source of the data •

Graph loading times for various relabeling strategies

No further dataset properties leveraged

37Manuel Then (TUM) | Evaluation of Parallel Graph Loading Techniques

Relabeling Strategies - Measurements

Page 37: Evaluation of Parallel Graph Loading Techniques · Problem: The optimal way of loading the graph depends on various factors: • Format of the graph data • Source of the data •

Goal: Efficiently load a given graph dataset for explorative analytics

38Manuel Then (TUM) | Evaluation of Parallel Graph Loading Techniques

General Graph Loading Pipeline

Read

• Parse edges and create relabeling

• Write edges to worker-local buffer

Sync

• Find unique vertices

• Count neighbors

Write

• Create final graph data structure

• Apply final relabeling

Analytics• The actual analytics work

Page 38: Evaluation of Parallel Graph Loading Techniques · Problem: The optimal way of loading the graph depends on various factors: • Format of the graph data • Source of the data •

Explicit vertex lists

• All unique vertices in the dataset are known beforehand

• No need to find and count vertices => improves loading efficiency

39Manuel Then (TUM) | Evaluation of Parallel Graph Loading Techniques

Leveraging Dataset Properties

Page 39: Evaluation of Parallel Graph Loading Techniques · Problem: The optimal way of loading the graph depends on various factors: • Format of the graph data • Source of the data •

Explicit vertex lists

• All unique vertices in the dataset are known beforehand

• No need to find and count vertices => improves loading efficiency

Partitioned edge list

• Edge list partitioned by source vertex

• Each source vertex has a responsible worker thread

• determined by the input data chunk

• Significantly reduces worker communication overhead

40Manuel Then (TUM) | Evaluation of Parallel Graph Loading Techniques

Leveraging Dataset Properties

Page 40: Evaluation of Parallel Graph Loading Techniques · Problem: The optimal way of loading the graph depends on various factors: • Format of the graph data • Source of the data •

Explicit vertex lists

• All unique vertices in the dataset are known beforehand

• No need to find and count vertices => improves loading efficiency

Partitioned edge list

• Edge list partitioned by source vertex

• Each source vertex has a responsible worker thread

• determined by the input data chunk

• Significantly reduces worker communication overhead

41Manuel Then (TUM) | Evaluation of Parallel Graph Loading Techniques

Leveraging Dataset Properties

Partitioned

1 2

1 3

1 4

2 1

2 4

3 1

3 2

4 3

Unpartitioned

4 3

1 3

3 1

1 4

2 1

1 2

3 2

2 4

Page 41: Evaluation of Parallel Graph Loading Techniques · Problem: The optimal way of loading the graph depends on various factors: • Format of the graph data • Source of the data •

42Manuel Then (TUM) | Evaluation of Parallel Graph Loading Techniques

Leveraging Dataset Properties - Measurements

Graphs

• LDBC-1000, |V| = 3.6M, |E| = 447M

• Twitter , |V| = 41.6M, |E| = 1.5B

Page 42: Evaluation of Parallel Graph Loading Techniques · Problem: The optimal way of loading the graph depends on various factors: • Format of the graph data • Source of the data •

43Manuel Then (TUM) | Evaluation of Parallel Graph Loading Techniques

Leveraging Dataset Properties - Measurements

Graphs

• LDBC-1000, |V| = 3.6M, |E| = 447M

• Twitter , |V| = 41.6M, |E| = 1.5B

Page 43: Evaluation of Parallel Graph Loading Techniques · Problem: The optimal way of loading the graph depends on various factors: • Format of the graph data • Source of the data •

44Manuel Then (TUM) | Evaluation of Parallel Graph Loading Techniques

Leveraging Dataset Properties - Measurements

Graphs

• LDBC-1000, |V| = 3.6M, |E| = 447M

• Twitter , |V| = 41.6M, |E| = 1.5B

Page 44: Evaluation of Parallel Graph Loading Techniques · Problem: The optimal way of loading the graph depends on various factors: • Format of the graph data • Source of the data •

45Manuel Then (TUM) | Evaluation of Parallel Graph Loading Techniques

Leveraging Dataset Properties - Measurements

Graphs

• LDBC-1000, |V| = 3.6M, |E| = 447M

• Twitter , |V| = 41.6M, |E| = 1.5B

Page 45: Evaluation of Parallel Graph Loading Techniques · Problem: The optimal way of loading the graph depends on various factors: • Format of the graph data • Source of the data •

46Manuel Then (TUM) | Evaluation of Parallel Graph Loading Techniques

Comparison with Existing Systems

Twitter LDBC

Oracle PGX 2153s 632s

GraphBIG out of memory 1682s

Ours non-partitioned 88s 24s

Ours partitioned 34s 7s

Graphs

• LDBC-1000, |V| = 3.6M, |E| = 447M

• Twitter , |V| = 41.6M, |E| = 1.5B

Machine:

• 2x Intel Xeon E5-2660 v2 2 × 20 @ 2.2GHz)

• 256GB, Ubuntu 15.10, kernel 4.2.0

Page 46: Evaluation of Parallel Graph Loading Techniques · Problem: The optimal way of loading the graph depends on various factors: • Format of the graph data • Source of the data •

CSR (relabeled)

Load + Run = Total

Neighbors Map (identity)

Load + Run = Total

PageRank 37s 33s 70s---- 25s 194s 219s----

Triangle Counting 37s 49s 86s---- 25s 66s 92s----

47Manuel Then (TUM) | Evaluation of Parallel Graph Loading Techniques

Influence on Analytics

Graphs

• Twitter , |V| = 41.6M, |E| = 1.5B

Machine:

• 2x Intel Xeon E5-2660 v2 2 × 20 @ 2.2GHz)

• 256GB, Ubuntu 15.10, kernel 4.2.0

Page 47: Evaluation of Parallel Graph Loading Techniques · Problem: The optimal way of loading the graph depends on various factors: • Format of the graph data • Source of the data •

Optimal loading pipeline for a graph dataset is highly dependent on the

• Data format

• Source of the data

• Properties of the dataset

• Algorithm-dependent graph data structure

• Target machine

Custom iterative identifier parsing always beneficial

Concurrent identifier relabeling mostly beneficial

• More challenging than identity mapping, but usually worth it

Leveraging properties of the dataset can lead to enormous speedups

48Manuel Then (TUM) | Evaluation of Parallel Graph Loading Techniques

Summary