1 Harvesting the Opportunity of GPU- based Acceleration Matei Ripeanu Networked Systems Laboratory (NetSysLab) University of British Columbia Joint work

1

Harvesting the Opportunity of GPU-based Acceleration

Matei RipeanuNetworked Systems Laboratory (NetSysLab)

University of British Columbia

Joint work Abdullah Gharaibeh, Samer Al-Kiswany

2

A golf course …

… a (nudist) beach

(… and 199 days of rain each year)

Networked Systems Laboratory (NetSysLab)University of British Columbia

3

Hybrid architectures in Top 500 [Nov’10]

4

• Hybrid architectures– High compute power / memory bandwidth– Energy efficient

[operated today at low efficiency]

• Agenda for this talk– GPU Architecture Intuition

• What generates the above characteristics?

– Progress on efficiently harnessing hybrid

(GPU-based) architectures

5Acknowledgement: Slide borrowed from presentation by Kayvon Fatahalian







12

Feed the cores with data

Idea #3

The processing elements are data hungry!

Wide, high throughput memory bus

13

10,000x parallelism!

Idea #4

Hide memory access latency

Hardware supported multithreading

14

The Resulting GPU Architecture

Multiprocessor 2

Multiprocessor NGPU

Core MInstruction

Unit

Shared Memory

Registers

Multiprocessor 1

Core 1

Registers

Core 2

Registers

Global Memory

Texture Memory

Constant Memory

nVidia Tesla 2050

448 cores

Four ‘memories’•Shared fast – 4 cycles small – 48KB•Global slow – 400-600cycles large – up to 3GB

high throughput – 150GB/s•Texture – read only•Constant – read only

Hybrid• PCI 16x -- 4GBps

HostMemory

HostMachine

15

GPUs offer different characteristics

High peak compute power

High host-device communication overhead

Complex to program

High peak memory bandwidth

Limited memory space

16

Projects at NetSysLab@UBChttp://netsyslab.ece.ubc.ca

Porting applications to efficiently exploit GPU characteristics• Size Matters: Space/Time Tradeoffs to Improve GPGPU Applications Performance, A.Gharaibeh, M.

Ripeanu, SC’10• Accelerating Sequence Alignment on Hybrid Architectures, A. Gharaibeh, M. Ripeanu, Scientific

Computing Magazine, January/February 2011.

Middleware runtime support to simplify application development

• CrystalGPU: Transparent and Efficient Utilization of GPU Power, A. Gharaibeh, S. Al-Kiswany, M. Ripeanu, TR

GPU-optimized building blocks: Data structures and libraries• GPU Support for Batch Oriented Workloads, L. Costa, S. Al-Kiswany, M. Ripeanu, IPCCC’09• Size Matters: Space/Time Tradeoffs to Improve GPGPU Applications Performance, A.Gharaibeh, M.

Ripeanu, SC’10• A GPU Accelerated Storage System, A. Gharaibeh, S. Al-Kiswany, M. Ripeanu, HPDC’10• On GPU's Viability as a Middleware Accelerator, S. Al-Kiswany, A. Gharaibeh, E. Santos-Neto, M.

Ripeanu, JoCC‘08

17

Motivating Question: How should we design applications to efficiently exploit GPU characteristics?

Context: A bioinformatics problem: Sequence Alignment

A string matching problem Data intensive (102 GB)

Size Matters: Space/Time Tradeoffs to Improve GPGPU Applications Performance, A.Gharaibeh, M. Ripeanu, SC’10

18

Past work: sequence alignment on GPUsMUMmerGPU [Schatz 07, Trapnell 09]:

A GPU port of the sequence alignment tool MUMmer [Kurtz 04] ~4x (end-to-end) compared to CPU version

Hypothesis:

mismatch between the core data structure (suffix tree) and GPU characteristics

> 50% overhead

(%)

19

Use a space efficient data structure (though, from higher computational complexity class): suffix array

4x speedup compared to suffix tree-based on GPU

Idea: trade-off time for space

Consequences: Opportunity to exploit

multi-GPU systems as I/O is less of a bottleneck

Focus is shifted towards optimizing the compute stage

Significant overhead reduction

20

Outline for the rest of this talk

Sequence alignment: background and offloading to GPU

Space/Time trade-off analysis

Evaluation

21

CCAT GGCT... .....CGCCCTA GCAATTT.... ...GCGG ...TAGGC TGCGC... ...CGGCA... ...GGCG ...GGCTA ATGCG… .…TCGG... TTTGCGG…. ...TAGG ...ATAT… .…CCTA... CAATT….

..CCATAGGCTATATGCGCCCTATCGGCAATTTGCGGCG..

Background: Sequence Alignment Problem

Problem: Find where each query most likely originated from

Queries 108 queries101 to 102 symbols length per query

Reference106 to 1011 symbols length (up to ~400GB)

Queries

Reference

22

GPU Offloading: Opportunity and Challenges

Sequence alignment

Easy to partition Memory intensive

GPU

Massively parallel High memory bandwidth

Op

po

rtu

nit

y

Data Intensive Large output size

Limited memory space No direct access to

other I/O devices (e.g., disk)C

hal

len

ges

23

GPU Offloading: addressing the challenges

subrefs = DivideRef(ref) subqrysets = DivideQrys(qrys)foreach subqryset in subqrysets { results = NULL CopyToGPU(subqryset) foreach subref in subrefs { CopyToGPU(subref) MatchKernel(subqryset,

subref) CopyFromGPU(results) } Decompress(results)}

• Data intensive problem and limited memory space

→divide and compute in rounds

→search-optimized data-structures

• Large output size→compressed output

representation (decompress on the CPU)

High-level algorithm (executed on the host)

24

Space/Time Trade-off AnalysisSpace/Time Trade-off Analysis

25

The core data structure

massive number of queries and long reference =>

pre-process reference to an index

$

CAA TACACA$

0

5

CA$

2 4

CA$ $

3 1

$ CA$

Past work: build a suffix tree (MUMmerGPU [Schatz 07, 09])

Search: O(qry_len) per query Space: O(ref_len)

but the constant is high ~20 x ref_len Post-processing:

DFS traversal for each query O(4qry_len - min_match_len)

26

The core data structure

massive number of queries and long reference => pre-process reference to an index

Past work: build a suffix tree (MUMmerGPU [Schatz 07])

Search: O(qry_len) per query

Space: O(ref_len), but the constant is high: ~20xref_len

Post-processing: O(4qry_len - min_match_len), DFS traversal per query

subrefs = DivideRef(ref) subqrysets = DivideQrys(qrys)foreach subqryset in subqrysets { results = NULL CopyToGPU(subqryset) foreach subref in subrefs { CopyToGPU(subref)

MatchKernel(subqryset, subref) CopyFromGPU(results) } Decompress(results)}

Expensive

Expensive

Efficient

27

A better matching data structure?

$

CAA TACACA$

0

5

CA$

2 4

CA$ $

3 1

$ CA$

Suffix Tree

0 A$

1 ACA$

2 ACACA$

3 CA$

4 CACA$

5 TACACA$

Suffix Array

Space O(ref_len), 20 x ref_len O(ref_len), 4 x ref_len

Search O(qry_len) O(qry_len x log ref_len)

Post-

processO(4qry_len - min_match_len) O(qry_len – min_match_len)

Impact 1: Reduced communication

Less data to transfer

Com

pute

28

A better matching data structure

$

CAA TACACA$

0

5

CA$

2 4

CA$ $

3 1

$ CA$

Suffix Tree

0 A$

1 ACA$

2 ACACA$

3 CA$

4 CACA$

5 TACACA$

Suffix Array



Post-


Impact 2: Better data locality is achieved at the cost of additional per-thread processing time

Space for longer sub-references => fewer processing rounds

Com

pute

29

A better matching data structure

$

CAA TACACA$

0

5

CA$

2 4

CA$ $

3 1

$ CA$

Suffix Tree

0 A$

1 ACA$

2 ACACA$

3 CA$

4 CACA$

5 TACACA$

Suffix Array



Post-


Impact 3: Lower post-processing overhead

Com

pute

30

Evaluation

31

Evaluation setup

Workload / Species Reference

sequence length# of

queriesAverage read

length

HS1 - Human (chromosome 2) ~238M ~78M ~200

HS2 - Human (chromosome 3) ~100M ~2M ~700

MONO - L. monocytogenes ~3M ~6M ~120

SUIS - S. suis ~2M ~26M ~36

Testbed Low-end Geforce 9800 GX2 GPU (512MB) High-end Tesla C1060 (4GB)

Base line: suffix tree on GPU (MUMmerGPU [Schatz 07, 09])

Success metrics Performance Energy consumption

Workloads (NCBI Trace Archive, http://www.ncbi.nlm.nih.gov/Traces)

http://www.ncbi.nlm.nih.gov/Traces



32

Speedup: array-based over tree-based

33

Dissecting the overheads

Significant reduction in data transfers and post-processing

Workload: HS1, ~78M queries, ~238M ref. length on GeForce

34

Comparing with CPU performance [baseline single core performance]

[Suffix tree] [Suffix tree] [Suffix array]

35

Summary GPUs have drastically different performance

characteristics

Reconsidering the choice of the data structure used is necessary when porting applications to the GPU

A good matching data structure ensures: Low communication overhead Data locality: might be achieved at the cost of

additional per thread processing time Low post-processing overhead

36

Code, benchmarks and papers Code, benchmarks and papers available at:available at: netsyslab.ece.ubc.ca netsyslab.ece.ubc.ca

Documents

1 Harvesting the Opportunity of GPU- based Acceleration Matei Ripeanu Networked Systems Laboratory (NetSysLab) University of British Columbia Joint work