Upload
shanker-trivedi
View
1.914
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Presentation by Jonathan Cohen & Mark Berger at Bioinformatics conference July 2013. It covers - GPU Programming in 10 slides - GPUs in Bioinformatics - Porting SeqAn to CUDA - Resources for developers and bioinformatics professionals
Citation preview
GPU ACCELERATION OF BIOINFORMATICS PIPELINES Jonathan Cohen and Mark Berger, NVIDIA
Agenda
GPU Programming in 10 slides – Cohen (10 minutes)
GPUs for Bioinformatics – Cohen (10 minutes)
Experiences porting SeqAn to CUDA – Siragusa (15 minutes)
Resources – Berger (5 minutes)
Discussion, Q&A – All (20 minutes)
GPU Programming in Ten Slides
CUDA – Programming for Throughput
CPU threads:
Large amount of memory per thread
Full-featured instruction set
1-16 execute simultaneous
CUDA threads:
Lightweight footprint
Full-featured instruction set
10,000 execute simultaneously
CPU Host Executes functions
GPU Device Executes kernels
Run few threads,
each one very fast
Run many threads,
each one slow,
=> total throughput high
CUDA Kernels: Parallel Threads
A kernel is an array of threads,
executed in parallel
All threads execute the same
code
Each thread has an ID
Select input/output data
Control decisions
float x =
input[threadID];
float y = func(x);
output[threadID] = y;
CUDA Kernels: Subdivide into Blocks
CUDA Kernels: Subdivide into Blocks
Threads are grouped into blocks
CUDA Kernels: Subdivide into Blocks
Threads are grouped into blocks
Blocks are grouped into a grid
CUDA Kernels: Subdivide into Blocks
Threads are grouped into blocks
Blocks are grouped into a grid
A kernel is executed as a grid of blocks of threads
CUDA Kernels: Subdivide into Blocks
Threads are grouped into blocks
Blocks are grouped into a grid
A kernel is executed as a grid of blocks of threads
GPU
Accelerated Computing Multi-core plus Many-cores
CPU Optimized for Serial Tasks
GPU Accelerator Optimized for Many
Parallel Tasks
3-10X+ Comp Thruput 7X Memory Bandwidth
5x Energy Efficiency
How GPU Acceleration Works
Application Code
+
GPU CPU 5% of Code
Compute-Intensive Functions
Rest of Sequential CPU Code
Hello World in CUDA
__global__
void parallel_hello_world()
{
printf(“Hello, world. This is thread %d, block %d!\n”,
threadIdx.x, blockIdx.x);
}
int main()
{
parallel_hello_world<<<128,128>>>();
return 0;
}
> nvcc –o hello_world –arch=sm_30 main.cu
> ./hello_world
Hello, world. This is thread 0, block 0!
Hello, world. This is thread 1, block 0!
...
GPUs for Bioinformatics
Life Technologies
Ion Proton
3 GPUs per Device
S3229 - GPU Accelerated Signal Processing in Ion Proton
Whole Genome Sequencer
Mohit Gupta ( Life Technologies )
Jakob Siegel ( Life Technologies )
https://registration.gputechconf.com/form/session-listing
BGI & NVIDIA
Joint Innovation Lab
SOAP3 Aligner
S3257 - Tackling Big Data in Genomics with GPU
BingQiang Wang (Beijing Genomics Institute)
https://registration.gputechconf.com/form/session-listing
CUDASW++
From Bertil Schmidt’s group: http://cudasw.sourceforge.net/homepage.htm
Y. Liu, A. Wirawan, B. Schmidt: "CUDASW++ 3.0: accelerating Smith-Waterman protein database search
by coupling CPU and GPU SIMD instructions". BMC Bioinformatics, 2013, 14:117.
Performance comparisons on
the Swiss-Prot database.
“On GTX680 (GTX690),
CUDASW++ 3.0 yields an
average performance of 109.4
(169.7) GCUPS, with a
maximum of 119.0 (185.6)
GCUPS.”
NVIDIA GPU Life Science Focus
Molecular Dynamics: All codes are available
AMBER, CHARMM, DESMOND, DL_POLY,
GROMACS, LAMMPS, NAMD
Great multi-GPU performance
GPU codes: Abalone, ACEMD, HOOMD-Blue
Focus: scaling to large numbers of GPUs
Quantum Chemistry: key codes ported or optimizing
Active GPU acceleration projects:
VASP, NWChem, Gaussian, GAMESS, ABINIT,
Quantum Espresso, BigDFT, CP2K, GPAW, etc.
GPU code: TeraChem
Analytical and Medical Imaging Instruments
NVBIO
A GPU based C++ framework for
High Throughput Sequence Analysis
Short & Long Read Alignment
Variant Calling
Compression
…
Overall Design:
flexibility & customizability – a templated library
parallelism at every level
optimize throughput, server-like design
optimize the whole pipeline, not just a single component
(e.g. including data transfers, SAM, BAM, CRAM I/O, …)
A modular library
FM-index
Suffix Trie
Radix Tree
Sorted Dictionary
Edit Distance
Smith-Waterman
Needleman-Wunsch
Gotoh
Banded/Full DP
DP Alignment Tries
Exact Search
Backtracking
Text Search
FASTQ
FASTA
Sequence I/O
SAM
BAM
CRAM
Alignment I/O
HTML report
generators
Support Tools
GPU
CPU
O(1k-10k) threads
O(10-100) threads
nvBowtie2 - Real Datasets
speedup 4.3x
alignment rate +0.5%
disagreement 0.002%
Ion Proton 100M x 175bp (8-350) end-to-end
-
speedup 2.4x
alignment rate =
disagreement 0.006%
Illumina Genome Analyzer II 10M x 100bp x 2 end-to-end
ERR161544
speedup 7.6x
alignment rate -0.6%
disagreement 0.03%
Ion Proton 100M x 175bp (8-350) local
-
speedup 2.6x
alignment rate =
disagreement 0.022%
Illumina Genome Analyzer II 10M x 100bp x 2 local
ERR161544
TT32
NVBIO: efficient sequences analysis on GPUs
Jacopo Pantaleoni Tuesday 2:10 pm, Hall 9
GPU Technology Conference
https://registration.gputechconf.com/form/session-listing
Tag: “Bioinformatics and Genomics”
http://www.gputechconf.com/page/home.html
Google: “GPU Technology Conference”
Resources
3 Ways to Accelerate Applications
Applications
Libraries
“Drop-in”
Acceleration
Programming
Languages
Maximum
Flexibility
OpenACC
Directives
Easily Accelerate
Applications
GPU Accelerated Libraries “Drop-in” Acceleration for your Applications
Linear Algebra FFT, BLAS,
SPARSE, Matrix
Numerical & Math RAND, Statistics
Data Struct. & AI Sort, Scan, Zero Sum
Visual Processing Image & Video
NVIDIA
cuFFT,
cuBLAS,
cuSPARSE
NVIDIA
Math Lib NVIDIA cuRAND
NVIDIA
NPP
NVIDIA
Video
Encode
GPU AI –
Board
Games
GPU AI –
Path Finding
OpenACC: Open, Simple, Portable
• Open Standard
• Easy, Compiler-Driven Approach
• Portable on GPUs and Xeon Phi
main() {
…
<serial code>
…
#pragma acc kernels
{
<compute intensive code>
}
…
}
Compiler Hint
CAM-SE Climate 6x Faster on GPU 2x Faster on CPU only
Top Kernel: 50% of Runtime
Available from:
GPU Programming Languages
OpenACC, CUDA Fortran Fortran
OpenACC, CUDA C C
Thrust, CUDA C++ C++
PyCUDA, Anaconda Accelerate Python
GPU.NET C#
R, MATLAB, Mathematica, LabVIEW Numerical analytics
Reaching New Developers - CUDA Python Python Productivity + GPU Performance
Easy to Learn
Powerful Libraries
Popular in New Developers
HPC & Data Analytics
Data from CodeEval.com, based on 100k+ code samples
Easiest Way to Learn CUDA
50K Registered
127 Countries
$$
Learn from the Best
Anywhere, Any Time
It’s Free!
Engage with an Active Community
Feedback/Discussion