36
1 Harvesting the Opportunity of GPU-based Acceleration Matei Ripeanu Networked Systems Laboratory (NetSysLab) University of British Columbia Joint work Abdullah Gharaibeh, Samer Al- Kiswany

1 Harvesting the Opportunity of GPU- based Acceleration Matei Ripeanu Networked Systems Laboratory (NetSysLab) University of British Columbia Joint work

  • View
    216

  • Download
    1

Embed Size (px)

Citation preview

Page 1: 1 Harvesting the Opportunity of GPU- based Acceleration Matei Ripeanu Networked Systems Laboratory (NetSysLab) University of British Columbia Joint work

1

Harvesting the Opportunity of GPU-based Acceleration

Matei RipeanuNetworked Systems Laboratory (NetSysLab)

University of British Columbia

Joint work Abdullah Gharaibeh, Samer Al-Kiswany

Page 2: 1 Harvesting the Opportunity of GPU- based Acceleration Matei Ripeanu Networked Systems Laboratory (NetSysLab) University of British Columbia Joint work

2

A golf course …

… a (nudist) beach

(… and 199 days of rain each year)

Networked Systems Laboratory (NetSysLab)University of British Columbia

Page 3: 1 Harvesting the Opportunity of GPU- based Acceleration Matei Ripeanu Networked Systems Laboratory (NetSysLab) University of British Columbia Joint work

3

Hybrid architectures in Top 500 [Nov’10]

Page 4: 1 Harvesting the Opportunity of GPU- based Acceleration Matei Ripeanu Networked Systems Laboratory (NetSysLab) University of British Columbia Joint work

4

• Hybrid architectures– High compute power / memory bandwidth– Energy efficient

[operated today at low efficiency]

• Agenda for this talk– GPU Architecture Intuition

• What generates the above characteristics?

– Progress on efficiently harnessing hybrid

(GPU-based) architectures

Page 5: 1 Harvesting the Opportunity of GPU- based Acceleration Matei Ripeanu Networked Systems Laboratory (NetSysLab) University of British Columbia Joint work

5Acknowledgement: Slide borrowed from presentation by Kayvon Fatahalian

Page 6: 1 Harvesting the Opportunity of GPU- based Acceleration Matei Ripeanu Networked Systems Laboratory (NetSysLab) University of British Columbia Joint work

6Acknowledgement: Slide borrowed from presentation by Kayvon Fatahalian

Page 7: 1 Harvesting the Opportunity of GPU- based Acceleration Matei Ripeanu Networked Systems Laboratory (NetSysLab) University of British Columbia Joint work

7Acknowledgement: Slide borrowed from presentation by Kayvon Fatahalian

Page 8: 1 Harvesting the Opportunity of GPU- based Acceleration Matei Ripeanu Networked Systems Laboratory (NetSysLab) University of British Columbia Joint work

8Acknowledgement: Slide borrowed from presentation by Kayvon Fatahalian

Page 9: 1 Harvesting the Opportunity of GPU- based Acceleration Matei Ripeanu Networked Systems Laboratory (NetSysLab) University of British Columbia Joint work

9Acknowledgement: Slide borrowed from presentation by Kayvon Fatahalian

Page 10: 1 Harvesting the Opportunity of GPU- based Acceleration Matei Ripeanu Networked Systems Laboratory (NetSysLab) University of British Columbia Joint work

10Acknowledgement: Slide borrowed from presentation by Kayvon Fatahalian

Page 11: 1 Harvesting the Opportunity of GPU- based Acceleration Matei Ripeanu Networked Systems Laboratory (NetSysLab) University of British Columbia Joint work

11Acknowledgement: Slide borrowed from presentation by Kayvon Fatahalian

Page 12: 1 Harvesting the Opportunity of GPU- based Acceleration Matei Ripeanu Networked Systems Laboratory (NetSysLab) University of British Columbia Joint work

12

Feed the cores with data

Idea #3

The processing elements are data hungry!

Wide, high throughput memory bus

Page 13: 1 Harvesting the Opportunity of GPU- based Acceleration Matei Ripeanu Networked Systems Laboratory (NetSysLab) University of British Columbia Joint work

13

10,000x parallelism!

Idea #4

Hide memory access latency

Hardware supported multithreading

Page 14: 1 Harvesting the Opportunity of GPU- based Acceleration Matei Ripeanu Networked Systems Laboratory (NetSysLab) University of British Columbia Joint work

14

The Resulting GPU Architecture

Multiprocessor 2

Multiprocessor NGPU

Core MInstruction

Unit

Shared Memory

Registers

Multiprocessor 1

Core 1

Registers

Core 2

Registers

Global Memory

Texture Memory

Constant Memory

nVidia Tesla 2050

448 cores

Four ‘memories’•Shared fast – 4 cycles small – 48KB•Global slow – 400-600cycles large – up to 3GB

high throughput – 150GB/s•Texture – read only•Constant – read only

Hybrid• PCI 16x -- 4GBps

HostMemory

HostMachine

Page 15: 1 Harvesting the Opportunity of GPU- based Acceleration Matei Ripeanu Networked Systems Laboratory (NetSysLab) University of British Columbia Joint work

15

GPUs offer different characteristics

High peak compute power

High host-device communication overhead

Complex to program

High peak memory bandwidth

Limited memory space

Page 16: 1 Harvesting the Opportunity of GPU- based Acceleration Matei Ripeanu Networked Systems Laboratory (NetSysLab) University of British Columbia Joint work

16

Projects at NetSysLab@UBChttp://netsyslab.ece.ubc.ca

Porting applications to efficiently exploit GPU characteristics• Size Matters: Space/Time Tradeoffs to Improve GPGPU Applications Performance, A.Gharaibeh, M.

Ripeanu, SC’10• Accelerating Sequence Alignment on Hybrid Architectures, A. Gharaibeh, M. Ripeanu, Scientific

Computing Magazine, January/February 2011.

Middleware runtime support to simplify application development

• CrystalGPU: Transparent and Efficient Utilization of GPU Power, A. Gharaibeh, S. Al-Kiswany, M. Ripeanu, TR

GPU-optimized building blocks: Data structures and libraries• GPU Support for Batch Oriented Workloads, L. Costa, S. Al-Kiswany, M. Ripeanu, IPCCC’09• Size Matters: Space/Time Tradeoffs to Improve GPGPU Applications Performance, A.Gharaibeh, M.

Ripeanu, SC’10• A GPU Accelerated Storage System, A. Gharaibeh, S. Al-Kiswany, M. Ripeanu, HPDC’10• On GPU's Viability as a Middleware Accelerator, S. Al-Kiswany, A. Gharaibeh, E. Santos-Neto, M.

Ripeanu, JoCC‘08

Page 17: 1 Harvesting the Opportunity of GPU- based Acceleration Matei Ripeanu Networked Systems Laboratory (NetSysLab) University of British Columbia Joint work

17

Motivating Question: How should we design applications to efficiently exploit GPU characteristics?

Context: A bioinformatics problem: Sequence Alignment

A string matching problem Data intensive (102 GB)

Size Matters: Space/Time Tradeoffs to Improve GPGPU Applications Performance, A.Gharaibeh, M. Ripeanu, SC’10

Page 18: 1 Harvesting the Opportunity of GPU- based Acceleration Matei Ripeanu Networked Systems Laboratory (NetSysLab) University of British Columbia Joint work

18

Past work: sequence alignment on GPUsMUMmerGPU [Schatz 07, Trapnell 09]:

A GPU port of the sequence alignment tool MUMmer [Kurtz 04] ~4x (end-to-end) compared to CPU version

Hypothesis:

mismatch between the core data structure (suffix tree) and GPU characteristics

> 50% overhead

(%)

Page 19: 1 Harvesting the Opportunity of GPU- based Acceleration Matei Ripeanu Networked Systems Laboratory (NetSysLab) University of British Columbia Joint work

19

Use a space efficient data structure (though, from higher computational complexity class): suffix array

4x speedup compared to suffix tree-based on GPU

Idea: trade-off time for space

Consequences: Opportunity to exploit

multi-GPU systems as I/O is less of a bottleneck

Focus is shifted towards optimizing the compute stage

Significant overhead reduction

Page 20: 1 Harvesting the Opportunity of GPU- based Acceleration Matei Ripeanu Networked Systems Laboratory (NetSysLab) University of British Columbia Joint work

20

Outline for the rest of this talk

Sequence alignment: background and offloading to GPU

Space/Time trade-off analysis

Evaluation

Page 21: 1 Harvesting the Opportunity of GPU- based Acceleration Matei Ripeanu Networked Systems Laboratory (NetSysLab) University of British Columbia Joint work

21

CCAT GGCT... .....CGCCCTA GCAATTT.... ...GCGG ...TAGGC TGCGC... ...CGGCA... ...GGCG ...GGCTA ATGCG… .…TCGG... TTTGCGG…. ...TAGG ...ATAT… .…CCTA... CAATT….

..CCATAGGCTATATGCGCCCTATCGGCAATTTGCGGCG..

Background: Sequence Alignment Problem

Problem: Find where each query most likely originated from

Queries 108 queries101 to 102 symbols length per query

Reference106 to 1011 symbols length (up to ~400GB)

Queries

Reference

Page 22: 1 Harvesting the Opportunity of GPU- based Acceleration Matei Ripeanu Networked Systems Laboratory (NetSysLab) University of British Columbia Joint work

22

GPU Offloading: Opportunity and Challenges

Sequence alignment

Easy to partition Memory intensive

GPU

Massively parallel High memory bandwidth

Op

po

rtu

nit

y

Data Intensive Large output size

Limited memory space No direct access to

other I/O devices (e.g., disk)C

hal

len

ges

Page 23: 1 Harvesting the Opportunity of GPU- based Acceleration Matei Ripeanu Networked Systems Laboratory (NetSysLab) University of British Columbia Joint work

23

GPU Offloading: addressing the challenges

subrefs = DivideRef(ref) subqrysets = DivideQrys(qrys)foreach subqryset in subqrysets { results = NULL CopyToGPU(subqryset) foreach subref in subrefs { CopyToGPU(subref) MatchKernel(subqryset,

subref) CopyFromGPU(results) } Decompress(results)}

• Data intensive problem and limited memory space

→divide and compute in rounds

→search-optimized data-structures

• Large output size→compressed output

representation (decompress on the CPU)

High-level algorithm (executed on the host)

Page 24: 1 Harvesting the Opportunity of GPU- based Acceleration Matei Ripeanu Networked Systems Laboratory (NetSysLab) University of British Columbia Joint work

24

Space/Time Trade-off AnalysisSpace/Time Trade-off Analysis

Page 25: 1 Harvesting the Opportunity of GPU- based Acceleration Matei Ripeanu Networked Systems Laboratory (NetSysLab) University of British Columbia Joint work

25

The core data structure

massive number of queries and long reference =>

pre-process reference to an index

$

CAA TACACA$

0

5

CA$

2 4

CA$ $

3 1

$ CA$

Past work: build a suffix tree (MUMmerGPU [Schatz 07, 09])

Search: O(qry_len) per query Space: O(ref_len)

but the constant is high ~20 x ref_len Post-processing:

DFS traversal for each query O(4qry_len - min_match_len)

Page 26: 1 Harvesting the Opportunity of GPU- based Acceleration Matei Ripeanu Networked Systems Laboratory (NetSysLab) University of British Columbia Joint work

26

The core data structure

massive number of queries and long reference => pre-process reference to an index

Past work: build a suffix tree (MUMmerGPU [Schatz 07])

Search: O(qry_len) per query

Space: O(ref_len), but the constant is high: ~20xref_len

Post-processing: O(4qry_len - min_match_len), DFS traversal per query

subrefs = DivideRef(ref) subqrysets = DivideQrys(qrys)foreach subqryset in subqrysets { results = NULL CopyToGPU(subqryset) foreach subref in subrefs { CopyToGPU(subref)

MatchKernel(subqryset, subref) CopyFromGPU(results) } Decompress(results)}

Expensive

Expensive

Efficient

Page 27: 1 Harvesting the Opportunity of GPU- based Acceleration Matei Ripeanu Networked Systems Laboratory (NetSysLab) University of British Columbia Joint work

27

A better matching data structure?

$

CAA TACACA$

0

5

CA$

2 4

CA$ $

3 1

$ CA$

Suffix Tree

0 A$

1 ACA$

2 ACACA$

3 CA$

4 CACA$

5 TACACA$

Suffix Array

Space O(ref_len), 20 x ref_len O(ref_len), 4 x ref_len

Search O(qry_len) O(qry_len x log ref_len)

Post-

processO(4qry_len - min_match_len) O(qry_len – min_match_len)

Impact 1: Reduced communication

Less data to transfer

Com

pute

Page 28: 1 Harvesting the Opportunity of GPU- based Acceleration Matei Ripeanu Networked Systems Laboratory (NetSysLab) University of British Columbia Joint work

28

A better matching data structure

$

CAA TACACA$

0

5

CA$

2 4

CA$ $

3 1

$ CA$

Suffix Tree

0 A$

1 ACA$

2 ACACA$

3 CA$

4 CACA$

5 TACACA$

Suffix Array

Space O(ref_len), 20 x ref_len O(ref_len), 4 x ref_len

Search O(qry_len) O(qry_len x log ref_len)

Post-

processO(4qry_len - min_match_len) O(qry_len – min_match_len)

Impact 2: Better data locality is achieved at the cost of additional per-thread processing time

Space for longer sub-references => fewer processing rounds

Com

pute

Page 29: 1 Harvesting the Opportunity of GPU- based Acceleration Matei Ripeanu Networked Systems Laboratory (NetSysLab) University of British Columbia Joint work

29

A better matching data structure

$

CAA TACACA$

0

5

CA$

2 4

CA$ $

3 1

$ CA$

Suffix Tree

0 A$

1 ACA$

2 ACACA$

3 CA$

4 CACA$

5 TACACA$

Suffix Array

Space O(ref_len), 20 x ref_len O(ref_len), 4 x ref_len

Search O(qry_len) O(qry_len x log ref_len)

Post-

processO(4qry_len - min_match_len) O(qry_len – min_match_len)

Impact 3: Lower post-processing overhead

Com

pute

Page 30: 1 Harvesting the Opportunity of GPU- based Acceleration Matei Ripeanu Networked Systems Laboratory (NetSysLab) University of British Columbia Joint work

30

Evaluation

Page 31: 1 Harvesting the Opportunity of GPU- based Acceleration Matei Ripeanu Networked Systems Laboratory (NetSysLab) University of British Columbia Joint work

31

Evaluation setup

Workload / Species Reference

sequence length# of

queriesAverage read

length

HS1 - Human (chromosome 2) ~238M ~78M ~200

HS2 - Human (chromosome 3) ~100M ~2M ~700

MONO - L. monocytogenes ~3M ~6M ~120

SUIS - S. suis ~2M ~26M ~36

Testbed Low-end Geforce 9800 GX2 GPU (512MB) High-end Tesla C1060 (4GB)

Base line: suffix tree on GPU (MUMmerGPU [Schatz 07, 09])

Success metrics Performance Energy consumption

Workloads (NCBI Trace Archive, http://www.ncbi.nlm.nih.gov/Traces)

Page 32: 1 Harvesting the Opportunity of GPU- based Acceleration Matei Ripeanu Networked Systems Laboratory (NetSysLab) University of British Columbia Joint work

32

Speedup: array-based over tree-based

Page 33: 1 Harvesting the Opportunity of GPU- based Acceleration Matei Ripeanu Networked Systems Laboratory (NetSysLab) University of British Columbia Joint work

33

Dissecting the overheads

Significant reduction in data transfers and post-processing

Workload: HS1, ~78M queries, ~238M ref. length on GeForce

Page 34: 1 Harvesting the Opportunity of GPU- based Acceleration Matei Ripeanu Networked Systems Laboratory (NetSysLab) University of British Columbia Joint work

34

Comparing with CPU performance [baseline single core performance]

[Suffix tree] [Suffix tree] [Suffix array]

Page 35: 1 Harvesting the Opportunity of GPU- based Acceleration Matei Ripeanu Networked Systems Laboratory (NetSysLab) University of British Columbia Joint work

35

Summary GPUs have drastically different performance

characteristics

Reconsidering the choice of the data structure used is necessary when porting applications to the GPU

A good matching data structure ensures: Low communication overhead Data locality: might be achieved at the cost of

additional per thread processing time Low post-processing overhead

Page 36: 1 Harvesting the Opportunity of GPU- based Acceleration Matei Ripeanu Networked Systems Laboratory (NetSysLab) University of British Columbia Joint work

36

Code, benchmarks and papers Code, benchmarks and papers available at:available at: netsyslab.ece.ubc.ca netsyslab.ece.ubc.ca