© NVIDIA Corporation 2011 The ‘Super’ Computing Company From Super Phones to Super Computers...

The ‘Super’ Computing Company

From Super Phones to Super Computers

CUDA 4.0

CUDA Toolkit 4.0 Release CandidateAvailable to Registered Developers on March 4th

Press Embargo : February 28th – 6am PST (San Francisco)

Rapid Application PortingUnified Virtual Addressing

Faster Multi-GPU ProgrammingGPUDirect 2.0

CUDA 4.0Application Porting Made Simpler

Easier Parallel Programming in C++ Thrust

CUDA 4.0 for Broader Developer Adoption

CUDA 1.0 2007

Researchers

and Early

Adopters

CUDA 2.0 2008

Scientists and

Applications

CUDA 3.0 2009

Application

Innovation

Leaders

CUDA 4.0 2011

Broader

Developer

Adoption

NVIDIA GPUDirect™:Towards Eliminating the CPU Bottleneck

• Direct access to GPU memory for 3rd party devices

• Eliminates unnecessary sys mem copies & CPU overhead

• Supported by Mellanox and Qlogic

• Up to 30% improvement in communication performance

Version 1.0

for applications that communicate over a network

• Peer-to-Peer memory access, transfers & synchronization

• MPI implementations natively support GPU data transfers

• Less code, higher programmer productivity

Details @ http://www.nvidia.com/object/software-for-tesla-products.html

Version 2.0

for applications that communicate within a node

Before GPUDirect v2.0

Required Copy into Main Memory

GPU1Memory

GPU2Memory

Chipset

SystemMemory

GPUDirect v2.0: Peer-to-Peer Communication

Direct Transfers b/w GPUs

GPU1Memory

GPU2Memory

Chipset

SystemMemory

Unified Virtual Addressing Easier to Program with Single Address Space

No UVA: Multiple Memory Spaces

UVA : Single Address Space

System

Memory

CPU GPU0

GPU0Memor

GPU1Memor

System

Memory

CPU GPU0

GPU0Memor

GPU1Memor

PCI-e PCI-e

0x0000

0xFFFF

0x0000

0xFFFF

0x0000

0xFFFF

0x0000

0xFFFF

C++ Templatized Algorithms & Data Structures (Thrust)

Powerful open source C++ parallel algorithms & data structures

Similar to C++ Standard Template Library (STL)

Automatically chooses the fastest code path at compile time

Divides work between GPUs and multi-core CPUs

Parallel sorting @ 5x to 100x faster than STL and TBB

Data Structures

• thrust::device_vector

• thrust::host_vector• thrust::device_ptr• Etc.

Algorithms

• thrust::sort• thrust::reduce• thrust::exclusive_scan

• Etc.

Parallel Programming Sweet Spot

CUDA 4.0: Highlights

• Share GPUs across multiple threads

• Single thread access to all GPUs

• No-copy pinning of system memory

• New CUDA C/C++ features

• Thrust templated primitives library

• NPP image/video processing library

• Layered Textures

Easier ParallelApplication Porting

• Auto Performance Analysis

• C++ Debugging

• GPU Binary Disassembler

• cuda-gdb for MacOS

New & Improved Developer Tools

• Unified Virtual Addressing

• NVIDIA GPUDirect™ v2.0

• Peer-to-Peer Access

• Peer-to-Peer Transfers

• GPU-accelerated MPI

Faster Multi-GPU Programming

GPU Technology Conference 2011Oct. 11-14 | San Jose, CA

3rd annual GPU Technology Conference

New for 2011:

Co-located with Los Alamos HPC Symposium

300+ Research Scientists from National Labs

2010 highlights

• 280 hours of sessions

• 100+ Research posters

• 42 countries representedwww.gputechconf.com

BACKGROUND SLIDESCUDA 4.0

NVIDIA CUDA Summary

New in

CUDA 4.0

Libraries

Thrust C++ LibraryTemplated Performance Primitives

NVIDIA Library Support

Complete math.hComplete BLAS Library (1, 2

and 3)

Sparse Matrix Math LibraryRNG LibraryFFT Library (1D, 2D and 3D)Image Processing Library

Video Processing Library (NPP)

3rd Party Math Libraries• CULA Tools• MAGMA• IMSL• VSIPL

Parallel Nsight Pro

NVIDIA Tools SupportParallel Nsight 1.0 IDEcuda-gdb Debugger with

multi-GPU

CUDA/OpenCL Visual Profiler

CUDA Memory CheckerCUDA C SDKCUDA Disassembler

CUDA Partner Tools

Allinea DDT RogueWave /Totalview Vampir Tau CAPS HMPP

Platform

GPUDirect 2.0Fast Path to Data

Hardware SupportECC MemoryDouble PrecisionNative 64-bit ArchitectureConcurrent Kernel ExecutionDual Copy Engines Multi-GPU support 6GB per GPU supported

Operating System Support

MS Windows 32/64Linux 32/64 supportMac OSX support

Cluster ManagementGPUDirect Tesla Compute Cluster (TCC)Graphics Interoperability

Programming Model

Unified Virtual Addressing

C++ new/delete

C++ Virtual Functions

C support• NVIDIA C Compiler• CUDA C Parallel Extensions• Function Pointers • Recursion• Atomics• malloc/free

C++ support• Classes/Objects• Class Inheritance• Polymorphism• Operator Overloading • Class Templates• Function Templates• Virtual Base Classes • Namespaces

Fortran, OpenCL

cuda-gdb Now Available for MacOS

Details @ http://developer.nvidia.com/object/cuda-gdb.html

Automated Performance Analysis in Visual Profiler

Summary analysis & hints

Session

Device

Context

Kernel

New UI for kernel analysis

Identify limiting factor

Analyze instruction throughput

Analyze memory throughput

Analyze kernel occupancy

NVIDIA Parallel Nsight™

Professional features now available

free of charge!

Key FeaturesProfessional Profiler Standard

Microsoft Visual Studio 2010 support

Single System Debugging

Tesla Compute Cluster

CUDA Toolkit 3.2

CUDA 3rd Party Ecosystem

Parallel Debuggers

Visual Studio IDE with

Parallel Nsight Pro

Allinea DDT Debugger

TotalView Debugger

Performance Tools

ParaTools VampirTrace

TauCUDA Performance Tools

HPC Toolkit

Compute Platform Providers

Cloud Compute

Amazon EC2

Peer 1

OEM’s

Cluster Tools

Cluster Management

Platform LSF Cluster Manager

Platform Symphony

Bright Cluster manager

Job Scheduling Altair PBS

Cluster Resources TORQUE

MPI Libraries

OpenMPI

Qlogic OFED

Compilers

PGI CUDA Fortran

PGI Accelerators

PGI CUDA x86

CAPS HMPP

TidePowerd GPU.net

pyCUDA

NVIDIA CUDA Developer Resources

ENGINES &LIBRARIES

Math LibrariesCUFFT, CUBLAS, CUSPARSE, CURAND

3rd Party LibrariesCULA LAPACK, VSIPL,

NPP Image LibrariesPerformance primitives for imaging

App Acceleration EnginesRay Tracing: Optix, iRay

Video Libraries

NVCUVID / NVCUVENC

DEVELOPMENTTOOLS

CUDA ToolkitComplete GPU computing development kit

cuda-gdbGPU hardware debugging

Visual ProfilerGPU hardware profiler for CUDA C and OpenCL

Parallel NsightIntegrated development environment for Visual Studio

SDKs AND CODE SAMPLES

GPU Computing SDK CUDA C/C++, DirectCompute,OpenCL code samples and documentation

Books CUDA by Example, GPU Gems

Optimization GuidesBest Practices for GPU computing and graphics development

http://developer.nvidia.com

Proven Research Vision

John Hopkins University

Nanyan University

Technical University-Czech

SINTEF

HP Labs

Barcelona SuperComputer Center

Clemson University

Fraunhofer SCAI

Karlsruhe Institute Of Technology

World Class Research Leadership and Teaching

University of Cambridge

Harvard University

University of Utah

University of Tennessee

University of Maryland

University of Illinois at Urbana-Champaign

Tsinghua University

Tokyo Institute of Technology

Chinese Academy of Sciences

National Taiwan University

Georgia Institute of Technology

http://research.nvidia.com

GPGPU Education350+ Universities

Academic Partnerships / Fellowships

GPU Computing Research & Education

Mass. Gen. Hospital/NE Univ

North Carolina State University

Swinburne University of Tech.

Techische Univ. Munich

University of New Mexico

University Of Warsaw-ICM

VSB-Tech

University of Ostrava

And more coming shortly.

CUDA Applications Momentum Increasing

Today’s CUDA CAE Solutions

Structural Mechanics

Electromagnetics

ANSYS Mechanical

Abaqus/Standard

(beta)AcuSolveMoldflowCulises (OpenFOAM)Particleworks

NexximEMProCST MSXFdtdSEMCAD X

Fluid Dynamics

© NVIDIA Corporation 2011 The ‘Super’ Computing Company From Super Phones to Super Computers...

Documents

Nvidia cuda programming_guide_0.8.2

NVIDIA CUDA Compute Unified Device Architecturedeveloper.download.nvidia.com/compute/cuda/2_0/... · Version 2.0 June 2008 NVIDIA CUDA Compute Unified Device Architecture Reference

NVIDIA CUDA Programming Guide - IMFUFA (dirac)dirac.ruc.dk/manuals/cuda-3.0/NVIDIA_CUDA_ProgrammingGuide_3… · NVIDIA CUDA™ Programming Guide . ... 3.1.5 C/C++ Compatibility

CUDA Libraries and CUDA Fortran - Nvidia · CUDA Libraries and CUDA Fortran Massimiliano Fatica NVIDIA Corporation. NVIDIA CUDA Libraries CUDA Toolkit includes several libraries:

NVIDIA CUDA Compute Unified Device Architecturedeveloper.download.nvidia.com/compute/cuda/1.0/NVIDIA_CUDA... · NVIDIA CUDA Compute Unified Device Architecture ... 4.1 An Extension

NVIDIA CUDA Toolkit v6developer.download.nvidia.com/.../docs/CUDA_Toolkit...NVIDIA CUDA Toolkit v6.0 RN-06722-001 _v6.0 | iii ERRATA CUDA Tools ‣ NVIDIA Nsight Eclipse Edition in

Tech Talk NVIDIA CUDA

CUDA Libraries and Tools - Nvidia › content › GTC › documents › SC09_CUDA...CUDA Libraries & Tools NVIDIA GPU with the CUDA Parallel Computing Architecture CUDA C OpenCL Direct

NVIDIA CUDA 编程指南 · - 2 - gpu .....1 nvidia cuda

NVIDIA CUDA Video Encoder - dirac.ruc.dkdirac.ruc.dk › manuals › cuda-5.5 › CUDA_Video_Encoder.pdf · NVIDIA CUDA VIDEO ENCODER TB-06717-001_v5.5 | July 2013 Specification

NVIDIA CUDA Getting Started Guide for Linux‣ CUDA Driver

NVIDIA CUDA Libraries - Peoplepeople.sc.fsu.edu/~gerlebacher/gpus/nvidia_cuda_libraries_gtc2010... · NVIDIA CUDA Libraries —CUFFT —CUBLAS ... Directly approach our CUDA Library

NVIDIA CUDA Best Practices Guidedeveloper.download.nvidia.com/.../NVIDIA_CUDA_C_BestPracticesG… · CUDA Best Practices Guide Version 3.1 Version 3.1 5/19/2010 NVIDIA CUDA™ NVIDIA

NVIDIA CUDA Programming Guide

NVIDIA CUDA Toolkit v5.5

NVIDIA CUDA ProgrammingGuide

NVIDIA CUDA Video Decoderaccelerated encoder or the NVIDIA CUDA encoder. The NVIDIA CUDA Samples application (windows only) implements the following playback pipeline: 1. Parse the

CUDA ON WINDOWS - Nvidia

CUDA Technical Training - NVIDIA

Nvidia cuda programming_guide_1.0