Nvidia in bioinformatics

GPU ACCELERATION OF BIOINFORMATICS PIPELINES Jonathan Cohen and Mark Berger, NVIDIA

Agenda

GPU Programming in 10 slides – Cohen (10 minutes)

GPUs for Bioinformatics – Cohen (10 minutes)

Experiences porting SeqAn to CUDA – Siragusa (15 minutes)

Resources – Berger (5 minutes)

Discussion, Q&A – All (20 minutes)

GPU Programming in Ten Slides

CUDA – Programming for Throughput

CPU threads:

Large amount of memory per thread

Full-featured instruction set

1-16 execute simultaneous

CUDA threads:

Lightweight footprint

Full-featured instruction set

10,000 execute simultaneously

CPU Host Executes functions

GPU Device Executes kernels

Run few threads,

each one very fast

Run many threads,

each one slow,

=> total throughput high

CUDA Kernels: Parallel Threads

A kernel is an array of threads,

executed in parallel

All threads execute the same

code

Each thread has an ID

Select input/output data

Control decisions

float x =

input[threadID];

float y = func(x);

output[threadID] = y;

CUDA Kernels: Subdivide into Blocks


Threads are grouped into blocks



Blocks are grouped into a grid




A kernel is executed as a grid of blocks of threads




A kernel is executed as a grid of blocks of threads

GPU

Accelerated Computing Multi-core plus Many-cores

CPU Optimized for Serial Tasks

GPU Accelerator Optimized for Many

Parallel Tasks

3-10X+ Comp Thruput 7X Memory Bandwidth

5x Energy Efficiency

How GPU Acceleration Works

Application Code

+

GPU CPU 5% of Code

Compute-Intensive Functions

Rest of Sequential CPU Code

Hello World in CUDA

__global__

void parallel_hello_world()

{

printf(“Hello, world. This is thread %d, block %d!\n”,

threadIdx.x, blockIdx.x);

}

int main()

{

parallel_hello_world<<<128,128>>>();

return 0;

}

> nvcc –o hello_world –arch=sm_30 main.cu

> ./hello_world

Hello, world. This is thread 0, block 0!

Hello, world. This is thread 1, block 0!

...

GPUs for Bioinformatics

Life Technologies

Ion Proton

3 GPUs per Device

S3229 - GPU Accelerated Signal Processing in Ion Proton

Whole Genome Sequencer

Mohit Gupta ( Life Technologies )

Jakob Siegel ( Life Technologies )

https://registration.gputechconf.com/form/session-listing





BGI & NVIDIA

Joint Innovation Lab

SOAP3 Aligner

S3257 - Tackling Big Data in Genomics with GPU

BingQiang Wang (Beijing Genomics Institute)







CUDASW++

From Bertil Schmidt’s group: http://cudasw.sourceforge.net/homepage.htm

Y. Liu, A. Wirawan, B. Schmidt: "CUDASW++ 3.0: accelerating Smith-Waterman protein database search

by coupling CPU and GPU SIMD instructions". BMC Bioinformatics, 2013, 14:117.

Performance comparisons on

the Swiss-Prot database.

“On GTX680 (GTX690),

CUDASW++ 3.0 yields an

average performance of 109.4

(169.7) GCUPS, with a

maximum of 119.0 (185.6)

GCUPS.”

http://cudasw.sourceforge.net/homepage.htm



http://www.biomedcentral.com/1471-2105/14/117




NVIDIA GPU Life Science Focus

Molecular Dynamics: All codes are available

AMBER, CHARMM, DESMOND, DL_POLY,

GROMACS, LAMMPS, NAMD

Great multi-GPU performance

GPU codes: Abalone, ACEMD, HOOMD-Blue

Focus: scaling to large numbers of GPUs

Quantum Chemistry: key codes ported or optimizing

Active GPU acceleration projects:

VASP, NWChem, Gaussian, GAMESS, ABINIT,

Quantum Espresso, BigDFT, CP2K, GPAW, etc.

GPU code: TeraChem

Analytical and Medical Imaging Instruments

NVBIO

A GPU based C++ framework for

High Throughput Sequence Analysis

Short & Long Read Alignment

Variant Calling

Compression

…

Overall Design:

flexibility & customizability – a templated library

parallelism at every level

optimize throughput, server-like design

optimize the whole pipeline, not just a single component

(e.g. including data transfers, SAM, BAM, CRAM I/O, …)

A modular library

FM-index

Suffix Trie

Radix Tree

Sorted Dictionary

Edit Distance

Smith-Waterman

Needleman-Wunsch

Gotoh

Banded/Full DP

DP Alignment Tries

Exact Search

Backtracking

Text Search

FASTQ

FASTA

Sequence I/O

SAM

BAM

CRAM

Alignment I/O

HTML report

generators

Support Tools

GPU

CPU

O(1k-10k) threads

O(10-100) threads

nvBowtie2 - Real Datasets

speedup 4.3x

alignment rate +0.5%

disagreement 0.002%

Ion Proton 100M x 175bp (8-350) end-to-end

-

speedup 2.4x

alignment rate =

disagreement 0.006%

Illumina Genome Analyzer II 10M x 100bp x 2 end-to-end

ERR161544

speedup 7.6x

alignment rate -0.6%

disagreement 0.03%

Ion Proton 100M x 175bp (8-350) local

-

speedup 2.6x

alignment rate =

disagreement 0.022%

Illumina Genome Analyzer II 10M x 100bp x 2 local

ERR161544

TT32

NVBIO: efficient sequences analysis on GPUs

Jacopo Pantaleoni Tuesday 2:10 pm, Hall 9

GPU Technology Conference


Tag: “Bioinformatics and Genomics”

http://www.gputechconf.com/page/home.html

Google: “GPU Technology Conference”







Resources

3 Ways to Accelerate Applications

Applications

Libraries

“Drop-in”

Acceleration

Programming

Languages

Maximum

Flexibility

OpenACC

Directives

Easily Accelerate

Applications

GPU Accelerated Libraries “Drop-in” Acceleration for your Applications

Linear Algebra FFT, BLAS,

SPARSE, Matrix

Numerical & Math RAND, Statistics

Data Struct. & AI Sort, Scan, Zero Sum

Visual Processing Image & Video

NVIDIA

cuFFT,

cuBLAS,

cuSPARSE

NVIDIA

Math Lib NVIDIA cuRAND

NVIDIA

NPP

NVIDIA

Video

Encode

GPU AI –

Board

Games

GPU AI –

Path Finding

OpenACC: Open, Simple, Portable

• Open Standard

• Easy, Compiler-Driven Approach

• Portable on GPUs and Xeon Phi

main() {

…

<serial code>

…

#pragma acc kernels

{

<compute intensive code>

}

…

}

Compiler Hint

CAM-SE Climate 6x Faster on GPU 2x Faster on CPU only

Top Kernel: 50% of Runtime

Available from:

GPU Programming Languages

OpenACC, CUDA Fortran Fortran

OpenACC, CUDA C C

Thrust, CUDA C++ C++

PyCUDA, Anaconda Accelerate Python

GPU.NET C#

R, MATLAB, Mathematica, LabVIEW Numerical analytics

Reaching New Developers - CUDA Python Python Productivity + GPU Performance

Easy to Learn

Powerful Libraries

Popular in New Developers

HPC & Data Analytics

Data from CodeEval.com, based on 100k+ code samples

Easiest Way to Learn CUDA

50K Registered

127 Countries

$$

Learn from the Best

Anywhere, Any Time

It’s Free!

Engage with an Active Community

Feedback/Discussion