Upload
others
View
10
Download
0
Embed Size (px)
Citation preview
Large-Scale Graph Processing on Emerging Storage Devices
Nima Elyasi1, Changho Choi2, Anand Sivasubramaniam1
1Pennsylvania State University
2Samsung Semiconductor Inc.
Graph Processing is Commonplace
2
Search Engines Social MediaRecommendations
and AdsMap and
Navigation
Large-Scale Graph Processing Challenges
Huge Datasets Irregular Accesses
5
High cost of DRAM
$$$$
DRAM
Large-Scale Graph Processing Challenges
Huge Datasets Irregular Accesses
External Graph Processing is Desirable
5
High cost of DRAM
$$$$
NVMe SSD
$
DRAM
Large-Scale Graph Processing Challenges
Huge Datasets Irregular Accesses
External Graph Processing is Desirable
5
High cost of DRAM
$$$$
NVMe SSD
$
DRAM
Large-Scale Graph Processing Challenges
Huge Datasets Irregular Accesses
External Graph Processing is Desirable
5
High cost of DRAM
$$$$
NVMe SSDFine-Grained and Random
Accesses
$
DRAM
Fine-Grained Access in External Graph Processing
5
SSD Page Size and Vertex Accesses Don’t Match!
SSD Page 0 SSD Page 1
SSD Page
Several KiloBytes(4KB ~ 16KB) Several Bytes,
e.g., 4Bytes
Vertex Value
Irregular Accesses
Fine-Grained Access in External Graph Processing
5
SSD Page Size and Vertex Accesses Don’t Match!
SSD Page 0 SSD Page 1
SSD Page
Several KiloBytes(4KB ~ 16KB)
Vertex updates are detrimental to:
Performance Device Endurance
Several Bytes, e.g., 4Bytes
Vertex Value
Irregular Accesses
Providing Perfect Sequentiality as a Remedy
9
• If vertex data could be stored on DRAM• Fine-grained accesses was less of an issue
GraFBoost, ISCA’18Instead, prior external graph processing framework maintains vertex data on SSD
Providing Perfect Sequentiality as a Remedy
10
• If vertex data could be stored on DRAM• Fine-grained accesses was less of an issue
GraFBoost, ISCA’18Instead, prior external graph processing framework maintains vertex data on SSD
Achieves perfect sequentiality by coalescing fine-grained accesses
Programming Model
5
Vertex-centric Programming Model
- Iterative programming model
- Each vertex runs a user-defined program
- Sending updates to neighbors along outgoing edges
A
Prior External Graph Processing -- GraFBoost
12
Vertex Data Index File
Edge File
Sang-Woo Jun, et al. Grafboost: Using accelerated flash storage for external graph analytics, ISCA’18.
Prior External Graph Processing -- GraFBoost
13
V0 → Vx
V0 → Vy
V0 → Vz
Vertex Data Index File
Edge File
Sang-Woo Jun, et al. Grafboost: Using accelerated flash storage for external graph analytics, ISCA’18.
Prior External Graph Processing -- GraFBoost
14
Keys: {Vx , Vy , Vz}, Value: {V0 value}
<Vx,V0 value>, <Vy,V0 value>, <Vz,V0 value>
V0 → Vx
V0 → Vy
V0 → Vz
Vertex Data Index File
Edge File
Sang-Woo Jun, et al. Grafboost: Using accelerated flash storage for external graph analytics, ISCA’18.
Prior External Graph Processing -- GraFBoost
15
Keys: {Vx , Vy , Vz}, Value: {V0 value}
<Vx,V0 value>, <Vy,V0 value>, <Vz,V0 value>
V0 → Vx
V0 → Vy
V0 → VzGraFBoost sorts key-value pairs in memory, logs them in SSD, merges them, and updates vertex list in SSD
Vertex Data Index File
Edge File
Sang-Woo Jun, et al. Grafboost: Using accelerated flash storage for external graph analytics, ISCA’18.
Computation Overhead of Sort!
16
• Up to 60% sort overhead (web graph)
• Higher sort overhead for PageRank- Processes all vertices in each iteration and generates more updates
Current External Graph Processing:
Read from SSD Sort in Memory Write to SSD
Linear Time O(|E|) |E|*log(|E|) Linear Time O(|E|)
Scalability Issue
17
Current External Graph Processing:
Read from SSD Sort in Memory Write to SSD
Linear Time O(|E|) |E|*log(|E|) Linear Time O(|E|)
Assuming DRAM “k” times faster than SSD (e.g., k=30):
Scalability Issue
18
When k < log(|E|) → Sorting can become bottleneck
Current External Graph Processing:
Read from SSD Sort in Memory Write to SSD
Linear Time O(|E|) |E|*log(|E|) Linear Time O(|E|)
Assuming DRAM “k” times faster than SSD (e.g., k=30):
Scalability Issue
19
When k < log(|E|) → Sorting can become bottleneck
Current External Graph Processing:
Read from SSD Sort in Memory Write to SSD
Linear Time O(|E|) |E|*log(|E|) Linear Time O(|E|)
Assuming DRAM “k” times faster than SSD (e.g., k=30):
Scalability Issue
20
When k < log(|E|) → Sorting can become bottleneck
Instead, we propose a vertex partitioning to eliminate the sorting
Extensive Prior Efforts on Partitioning Graph Data:
- Not well suited for fully external graph processing
Partitioning Graph Data
21
Require all vertices be present in main memory
Do not decouple vertices and edges
FlashGraph, FAST’15 GraphChi, OSDI’12, Mosaic, EuroSys’17
PowerGraph, OSDI’12GridGraph, USENIX ATC’15
GraphP, HPCA’18
Need each partition be completely present in
cache or memory
Dramatically increasing number of partitions and incurring high cross-
partition communication
Reorganizing graph data so that vertices associated with each partition can fit in main memory
Instead, We Propose a Partitioning for Vertex Data
22
Source Vertices
Destination Vertices
Reorganizing graph data so that vertices associated with each partition can fit in main memory
Instead, We Propose a Partitioning for Vertex Data
23
Sou
rce V
erte
x D
ata
Vertex ID &
ValueIndex
Vertex A Offset A
Vertex B Offset B
Vertex C Offset C
Partition 0
So
rted B
ased
on
Vertex
ID
Edge Data
Vertex A
Out-edge
Vertex A
Out-edge
Vertex B
Out-edge
Vertex C
Out-edge
Source Vertices
Destination Vertices
In each iteration:
Execution Flow
24
SSD
Vertex Data
Destination
Vertex for a
partition
Memory
In each iteration:
Execution Flow
25
Sou
rce V
erte
x D
ata
Vertex ID &
ValueIndex
Vertex A Offset A
Vertex B Offset B
Vertex C Offset C
Partition 0
SSD
Vertex Data
Destination
Vertex for a
partition
A Chunk
of Source
Vertex
(32MB)
Update Destination
Vertices
Reading
Neighboring
Information
Memory Memory
Stre
am
ing F
rom
SS
D
In each iteration:
Execution Flow
26
Sou
rce V
erte
x D
ata
Vertex ID &
ValueIndex
Vertex A Offset A
Vertex B Offset B
Vertex C Offset C
Partition 0
SSD
Vertex Data
Destination
Vertex for a
partition
A Chunk
of Source
Vertex
(32MB)
Write all
updated
vertex data
on SSD
Update Destination
Vertices
Reading
Neighboring
Information
Generate Mirror Updates
for other partitions
Meta-
data for
current
partition
Memory Memory Memory
Stre
am
ing F
rom
SS
D
In each iteration:
Execution Flow
27
Sou
rce V
erte
x D
ata
Vertex ID &
ValueIndex
Vertex A Offset A
Vertex B Offset B
Vertex C Offset C
Partition 0
SSD
Vertex Data
Destination
Vertex for a
partition
A Chunk
of Source
Vertex
(32MB)
Write all
updated
vertex data
on SSD
Update Destination
Vertices
Reading
Neighboring
Information
Generate Mirror Updates
for other partitions
Meta-
data for
current
partition
Memory Memory Memory
Stre
am
ing F
rom
SS
D
How to Update Vertex List in Main Memory?
Multiple threads are updating elements of the same vertex list- High synchronization cost
Updating Vertices in Memory
28
Vertex List
Multiple threads are updating elements of the same vertex list- High synchronization cost
Updating Vertices in Memory
29
Vertex List
Buffer,
e.g., 1MB
Required Meta-Data for Mirror Updates
Updating Vertex Mirrors on Different Partitions
30O(|V|) running time for updating mirrors
Vertex Value
Partition i
Mirrors for
Partition 0
Source Vertex Table
Vertex 0
Part ID(s)
Vertex 1
Part ID(s)
Vertex 2
Part ID(s)
Partition 0
Start Index
End Index
Partition i
Start Index
End Index
For each
partition
Metadata
For each
VertexStart Index
End Index
Experimental Setup
• Processor: Intel Xeon -- 48 Cores
• Memory: DRAM – 256 GB
• SSD: Two Samsung NVMe SSDs - 3.2 TB capacity in total, and 6.4 GB/s Sequential Read Speed
• Graph Algorithms: - PageRank and Breadth-First-Search (BFS)
• Input Graphs:- Web, Twitter, Synthetic (Kron)
31
Performance Evaluation
32
• More than 2X Improvement Compared to GrafSoft
• Providing Higher Benefits for larger graphs (Web, Kron32)
• Incurring around 10% space overhead for partitioning
Execution Time Breakdown
33
• Mirror updates account for 8-12% of execution time
• I/O does not remain the main contributor to the total execution time
PageRank
Concluding Remarks
• Large-scale graph processing suffers from random updates to vertices
• State-of-the-art provides perfect sequentiality by sorting all updates
- High computation overhead
• A partitioning for vertex data is proposed to eliminate the need for perfect sequentiality
• In Future: Addressing timely evolving graphs
• Thanks to GraFboost authors (Sang-Woo Jun) !34
Thanks!
35