14

Nvidia Kepler GPU Microarchitecture By: Andrew Kuebler & Joshua Lake

Nvidia Kepler GPU Microarchitecturemeseec.ce.rit.edu/551-projects/spring2014/4-2.pdfNvidia GPU Cache Hierarchy L1/L2 Unified Cache 64 KB Configurable Shared Memory and L1 Cache 48

Download PDF Report

Upload
others
View
33
Download
0

Embed Size (px)

Citation preview

Page 1: Nvidia Kepler GPU Microarchitecturemeseec.ce.rit.edu/551-projects/spring2014/4-2.pdfNvidia GPU Cache Hierarchy L1/L2 Unified Cache 64 KB Configurable Shared Memory and L1 Cache 48

Nvidia Kepler GPU Microarchitecture

By: Andrew Kuebler & Joshua Lake

Page 2: Nvidia Kepler GPU Microarchitecturemeseec.ce.rit.edu/551-projects/spring2014/4-2.pdfNvidia GPU Cache Hierarchy L1/L2 Unified Cache 64 KB Configurable Shared Memory and L1 Cache 48

Overview● Kepler Architecture● Streaming Multiprocessor (SMX)● Improvements over Fermi Architecture● Hyper-Q Technology● Dynamic Parallelism ● Nvidia GRID● Kepler Applications

Page 3: Nvidia Kepler GPU Microarchitecturemeseec.ce.rit.edu/551-projects/spring2014/4-2.pdfNvidia GPU Cache Hierarchy L1/L2 Unified Cache 64 KB Configurable Shared Memory and L1 Cache 48

Nvidia Kepler Microarchitecture● 192 single-precision

CUDA cores● 28 nm manufacturing

process● 15 SMX processors● Six 64-bit memory

controllers

Page 4: Nvidia Kepler GPU Microarchitecturemeseec.ce.rit.edu/551-projects/spring2014/4-2.pdfNvidia GPU Cache Hierarchy L1/L2 Unified Cache 64 KB Configurable Shared Memory and L1 Cache 48

Kepler SMX Processor● 192 Single-precision● 64 Double-precision● 32 Load/Store● 32 Special Function● 16 Texture● 65536 32-bit Registers● 4 Warps

Page 5: Nvidia Kepler GPU Microarchitecturemeseec.ce.rit.edu/551-projects/spring2014/4-2.pdfNvidia GPU Cache Hierarchy L1/L2 Unified Cache 64 KB Configurable Shared Memory and L1 Cache 48

Nvidia GPU Cache Hierarchy ● L1/L2 Unified Cache● 64 KB Configurable

Shared Memory and L1 Cache

● 48 KB Read-Only Data Cache

● 1536 KB L2 Cache● Content protected using

SECDED EEC

Page 6: Nvidia Kepler GPU Microarchitecturemeseec.ce.rit.edu/551-projects/spring2014/4-2.pdfNvidia GPU Cache Hierarchy L1/L2 Unified Cache 64 KB Configurable Shared Memory and L1 Cache 48

Kepler Quad-Warp Scheduler● 1 Warp = 32 parallel

threads● Each SMX can handle 4

warps and 8 instruction dispatch units

● Each warp allows two independent instructions per cycle

Page 7: Nvidia Kepler GPU Microarchitecturemeseec.ce.rit.edu/551-projects/spring2014/4-2.pdfNvidia GPU Cache Hierarchy L1/L2 Unified Cache 64 KB Configurable Shared Memory and L1 Cache 48

Kepler vs Fermi (Last Gen)● Most cores operate on GPU core

clock instead of shader clocko 3x increase in performance

per watt at cost of larger area● 5 additional atomic operations

o Increases performance of parallel programming

● Replaced texture binding table with bindless texture stateso Allows for unlimited

simultaneous texture access● Shuffle instruction added to ISA

o Shares data between threads in the same warp

Page 8: Nvidia Kepler GPU Microarchitecturemeseec.ce.rit.edu/551-projects/spring2014/4-2.pdfNvidia GPU Cache Hierarchy L1/L2 Unified Cache 64 KB Configurable Shared Memory and L1 Cache 48

Hyper-Q Technology● 32 simultaneous

hardware-managed connections/queues

● Vastly increases GPU utilization by keeping many more cores busy

● Removes false dependencies that caused bottlenecks

Page 9: Nvidia Kepler GPU Microarchitecturemeseec.ce.rit.edu/551-projects/spring2014/4-2.pdfNvidia GPU Cache Hierarchy L1/L2 Unified Cache 64 KB Configurable Shared Memory and L1 Cache 48

Dynamic Parallelism● Removes dependence

on CPU for launching new kernels

● Allows GPU to generate work for itself

● Parallel/recursive/nested loop heavy code can run solely on GPU

● Frees up CPU to work on other tasks

Page 10: Nvidia Kepler GPU Microarchitecturemeseec.ce.rit.edu/551-projects/spring2014/4-2.pdfNvidia GPU Cache Hierarchy L1/L2 Unified Cache 64 KB Configurable Shared Memory and L1 Cache 48

Kepler and Cloud Computing: GRID● First true GPU

Virtualization solution● Comes in GPU or

standalone modules● Contains up to 4 Kepler

based cards per unit● Boasts extreme energy

efficiency

Page 11: Nvidia Kepler GPU Microarchitecturemeseec.ce.rit.edu/551-projects/spring2014/4-2.pdfNvidia GPU Cache Hierarchy L1/L2 Unified Cache 64 KB Configurable Shared Memory and L1 Cache 48

Page 12: Nvidia Kepler GPU Microarchitecturemeseec.ce.rit.edu/551-projects/spring2014/4-2.pdfNvidia GPU Cache Hierarchy L1/L2 Unified Cache 64 KB Configurable Shared Memory and L1 Cache 48

Kepler Applications: Jetson TK1Tegra K1 SOC

• NVIDIA Kepler GPU with 192 CUDA cores

• NVIDIA 4-Plus-1™ quad-core ARM® Cortex-A15 CPU- See

more at: http://www.nvidia.com/object/jetson-tk1-

embedded-dev-kit.html#sthash.TpimqqUN.dpuf2 GB

memory

16 GB eMMC

Gigabit Ethernet

USB 3.0

SD/MMC

miniPCIe

HDMI 1.4

SATA

Line out/Mic in

RS232 serial port

Expansion ports for additional display, GPIOs, and high-

bandwidth camera interface

Page 13: Nvidia Kepler GPU Microarchitecturemeseec.ce.rit.edu/551-projects/spring2014/4-2.pdfNvidia GPU Cache Hierarchy L1/L2 Unified Cache 64 KB Configurable Shared Memory and L1 Cache 48

Jetson TK1 (Cont)

Page 14: Nvidia Kepler GPU Microarchitecturemeseec.ce.rit.edu/551-projects/spring2014/4-2.pdfNvidia GPU Cache Hierarchy L1/L2 Unified Cache 64 KB Configurable Shared Memory and L1 Cache 48

Questions?

G2 Ruby 3rdRail CodeGear - Embarcadero Website€¦ · .NET Delphi C++Builder ... Core 1 Core 2 L1 Instruction Cache L1 Data Cache L1 Instruction Cache L1 Data Cache L2 for Core 1

G2 Ruby 3rdRail CodeGear - Embarcadero Website€¦ · .NET Delphi C++Builder ... Core 1 Core 2 L1 Instruction Cache L1 Data Cache L1 Instruction Cache L1 Data Cache L2 for Core 1

Documents

ASLR on the Line - NetCologne€¦ · L1 code / L1 data L2... L3 (Last Level Cache), shared between cores... memory access data virtual address physical address MMU TLB cache PT walk

ASLR on the Line - NetCologne€¦ · L1 code / L1 data L2... L3 (Last Level Cache), shared between cores... memory access data virtual address physical address MMU TLB cache PT walk

Documents

L1: PAST PROJECTS - Carnegie Mellon Universityece545/F12/slides/L01_PastProjects.pdf · L1: PAST PROJECTS. 18-545: FALL 2011 ... OpenGL GPU GPU on board ... Dig through project reports

L1: PAST PROJECTS - Carnegie Mellon Universityece545/F12/slides/L01_PastProjects.pdf · L1: PAST PROJECTS. 18-545: FALL 2011 ... OpenGL GPU GPU on board ... Dig through project reports

Documents

L1 Data Cache Decomposition for Energy Efficiency

L1 Data Cache Decomposition for Energy Efficiency

Documents

Selective GPU Caches to Eliminate CPU–GPU HW Cache Coherence

Selective GPU Caches to Eliminate CPU–GPU HW Cache Coherence

Documents

AN Using the i.MXRT L1 Cache - NXP Semiconductors · PDF fileUsing the i.MXRT L1 Cache, Application Note, Rev. 1, 12/2017 ... fetched into a linefill buffer by the AXI bus before being

AN Using the i.MXRT L1 Cache - NXP Semiconductors · PDF fileUsing the i.MXRT L1 Cache, Application Note, Rev. 1, 12/2017 ... fetched into a linefill buffer by the AXI bus before being

Documents

Computação Paralela em GPU Usando CUDAbosco.sobral/ensino/ine5645/Computacao... · 32K Registradores/SM 64K Shared L1 768K Cache L2. C U D A Computer Unified Device Architecture

Computação Paralela em GPU Usando CUDAbosco.sobral/ensino/ine5645/Computacao... · 32K Registradores/SM 64K Shared L1 768K Cache L2. C U D A Computer Unified Device Architecture

Documents

Analytical Modeling of Partially Shared Caches in ......L1 I cache L1 D cacheL1 I cache L1 I cache L1 D cacheL1 I cache core1 core2 core3 app0 app1 app2 app3 3/20 Memory app0 app3

Analytical Modeling of Partially Shared Caches in ......L1 I cache L1 D cacheL1 I cache L1 I cache L1 D cacheL1 I cache core1 core2 core3 app0 app1 app2 app3 3/20 Memory app0 app3

Documents

GPU-based MRC Methods for Overlapping eBeam Shots › docs › bacus-2013-gpu-kato.pdf · GFLOP comparison（CPU vs GPU） Architecture of CPU and GPU CPU has a single cache memory

GPU-based MRC Methods for Overlapping eBeam Shots › docs › bacus-2013-gpu-kato.pdf · GFLOP comparison（CPU vs GPU） Architecture of CPU and GPU CPU has a single cache memory

Documents

– 1 – P6 (PentiumPro,II,III,Celeron) memory system bus interface unit DRAM Memory bus instruction fetch unit L1 i-cache L2 cache cache bus L1 d-cache inst

– 1 – P6 (PentiumPro,II,III,Celeron) memory system bus interface unit DRAM Memory bus instruction fetch unit L1 i-cache L2 cache cache bus L1 d-cache inst

Documents

Adaptive Cache Management for Energy-efﬁcient …impact.crhc.illinois.edu/shared/Papers/chen-micro47.pdfAdaptive Cache Management for Energy-efﬁcient GPU Computing Xuhao Chenzy,

Adaptive Cache Management for Energy-efﬁcient …impact.crhc.illinois.edu/shared/Papers/chen-micro47.pdfAdaptive Cache Management for Energy-efﬁcient GPU Computing Xuhao Chenzy,

Documents

What is the Difference Between l1 and l2 Cache Memory?

What is the Difference Between l1 and l2 Cache Memory?

Documents

RTX ON THE NVIDIA TURING GPU - Hot Chips2 Turing SM 14 TFLOPS + 14 TIPS Concurrent FP & INT Enhanced L1 cache Uniform datapath & RF INTRODUCING TURING Greatest Leap Since 2006 CUDA

RTX ON THE NVIDIA TURING GPU - Hot Chips2 Turing SM 14 TFLOPS + 14 TIPS Concurrent FP & INT Enhanced L1 cache Uniform datapath & RF INTRODUCING TURING Greatest Leap Since 2006 CUDA

Documents

Caches - Stony Brook UniversityL1 I-Cache L1 D-Cache L2 Cache I-TLB D-TLB Main Memory (DRAM) L3 Cache (LLC) Spring 2015 :: CSE 502 –Computer Architecture Processor Memory Organization

Caches - Stony Brook UniversityL1 I-Cache L1 D-Cache L2 Cache I-TLB D-TLB Main Memory (DRAM) L3 Cache (LLC) Spring 2015 :: CSE 502 –Computer Architecture Processor Memory Organization

Documents

University of Washington Today Move on to … 1. University of Washington Hmm… Lets look at this again Disk Main Memory L2 unified cache L1 I-cache L1 D-cache

University of Washington Today Move on to … 1. University of Washington Hmm… Lets look at this again Disk Main Memory L2 unified cache L1 I-cache L1 D-cache

Documents

Radiance Cache Splatting: A GPU-Friendly Global ... · Radiance Cache Splatting: A GPU-Friendly Global Illumination Algorithm Pascal Gautron pgautron@irisa.fr IRISA ∗- UCF † Jaroslav

Radiance Cache Splatting: A GPU-Friendly Global ... · Radiance Cache Splatting: A GPU-Friendly Global Illumination Algorithm Pascal Gautron [email protected] IRISA ∗- UCF † Jaroslav

Documents

Sttřřeedošškkoollsskkáá atteecchhnniikka 22001100 · 2010. 6. 2. · Velikost paměti cache je 384 kB, L1 cache je 128 kB, L2 cache 256 kB. Napětí procesoru je 1,65 V. Počítá

Sttřřeedošškkoollsskkáá atteecchhnniikka 22001100 · 2010. 6. 2. · Velikost paměti cache je 384 kB, L1 cache je 128 kB, L2 cache 256 kB. Napětí procesoru je 1,65 V. Počítá

Documents

SUPERCOMPUTER 'FUGAKU' DEVELOPMENT · L1 inst. cache Branch Predictor Decode & Issue RSE0 RSA RSE1 RSBR PGPR EXA EXB EAGA EXC EAGB EXD PFPR Fetch Port Store Port L1 data cache HBM2

SUPERCOMPUTER 'FUGAKU' DEVELOPMENT · L1 inst. cache Branch Predictor Decode & Issue RSE0 RSA RSE1 RSBR PGPR EXA EXB EAGA EXC EAGB EXD PFPR Fetch Port Store Port L1 data cache HBM2

Documents

A Detailed GPU Cache Model Based on Reuse Distance Theory · HPCA-20 | GPU Cache Model | Gert-Jan van den Braak | February 2014 5 But what can it be used for? Examples of using the

A Detailed GPU Cache Model Based on Reuse Distance Theory · HPCA-20 | GPU Cache Model | Gert-Jan van den Braak | February 2014 5 But what can it be used for? Examples of using the

Documents

Cache&Memories:&& WhyProgrammersNeedtoKnow!& · 2011. 4. 4. · 2 Duke University 3 Compsci 104 IntelCore&i7Cache&Hierarchy Regs L1 d-cache L1 i-cache L2 unified cache Core 0 Regs

Cache&Memories:&& WhyProgrammersNeedtoKnow!& · 2011. 4. 4. · 2 Duke University 3 Compsci 104 IntelCore&i7Cache&Hierarchy Regs L1 d-cache L1 i-cache L2 unified cache Core 0 Regs

Documents

Tag-Split Cache for Efﬁcient GPGPU Cache Utilizationzz124/ics16.pdf · 2016-10-13 · GPUs, e.g., hardware-managed L1 D-caches and software-managed shared memory (software cache)

Tag-Split Cache for Efﬁcient GPGPU Cache Utilizationzz124/ics16.pdf · 2016-10-13 · GPUs, e.g., hardware-managed L1 D-caches and software-managed shared memory (software cache)

Documents

Memory Hierarchy. Hierarchy List Registers L1 Cache L2 Cache Main memory Disk cache Disk Optical Tape

Memory Hierarchy. Hierarchy List Registers L1 Cache L2 Cache Main memory Disk cache Disk Optical Tape

Documents

Das digitale Flipchart für effizientere Zusammenarbeit€¦ · On-Chip-Cache-Speicher L1-Instruction-Cache: 48 KB / L1-Daten-Cache: 32 KB / L2-Cache: 2 MB Taktgeschwindigkeit 1,7

Das digitale Flipchart für effizientere Zusammenarbeit€¦ · On-Chip-Cache-Speicher L1-Instruction-Cache: 48 KB / L1-Daten-Cache: 32 KB / L2-Cache: 2 MB Taktgeschwindigkeit 1,7

Documents

Power Optimization in L1 Cache of Embedded Processors ... · Power Optimization in L1 Cache of Embedded Processors Using CBF Based TOB Architecture B.Gomathi Assistant Professor Department

Power Optimization in L1 Cache of Embedded Processors ... · Power Optimization in L1 Cache of Embedded Processors Using CBF Based TOB Architecture B.Gomathi Assistant Professor Department

Documents

An Adaptive Load Balancer For Graph Analytical ... (SMs). Each SM is associated with resources such as streaming processors (SPs), registers, scratchpad memory, and L1 cache. The GPU

An Adaptive Load Balancer For Graph Analytical ... (SMs). Each SM is associated with resources such as streaming processors (SPs), registers, scratchpad memory, and L1 cache. The GPU

Documents

Analysis and Improvement of Cache performance for ...elec525/projects/mmcache_report.pdf · we only use L-1 data cache miss rate to represent cache performance. Figure 1. L1 D-cache

Analysis and Improvement of Cache performance for ...elec525/projects/mmcache_report.pdf · we only use L-1 data cache miss rate to represent cache performance. Figure 1. L1 D-cache

Documents

Lecture 6: GPU Programming · Data cache (A big one) 13 Credit: Kayvon Fatahalian (Stanford) GPU Architecture (recap)Programming GPUs. Slimming down ... GPU Architecture (recap)Programming

Lecture 6: GPU Programming · Data cache (A big one) 13 Credit: Kayvon Fatahalian (Stanford) GPU Architecture (recap)Programming GPUs. Slimming down ... GPU Architecture (recap)Programming

Documents

A Coherent Hybrid SRAM and STT-RAM L1 Cache Architecture for Shared Memory Multicoreswongwf/papers/ASPDAC14-ppt.pdf · 2014. 6. 2. · STT-RAM L1 Cache Architecture for Shared Memory

A Coherent Hybrid SRAM and STT-RAM L1 Cache Architecture for Shared Memory Multicoreswongwf/papers/ASPDAC14-ppt.pdf · 2014. 6. 2. · STT-RAM L1 Cache Architecture for Shared Memory

Documents

OpenPit - Princeton University · with private L1 and L1.5 caches and a distributed, shared L2 cache. Each tile in OpenPiton contains an instance of the L1 cache, L1.5 cache, and

OpenPit - Princeton University · with private L1 and L1.5 caches and a distributed, shared L2 cache. Each tile in OpenPiton contains an instance of the L1 cache, L1.5 cache, and

Documents

MIC & GPU Architecture - LRZ27/04/15 Momme Allalen: MIC&GPU@LRZ , April 27-29, 2015 5 Control ALU Cache - Simple flow control and limited cache. - Devotes more transistors for computing

MIC & GPU Architecture - LRZ27/04/15 Momme Allalen: MIC&GPU@LRZ , April 27-29, 2015 5 Control ALU Cache - Simple flow control and limited cache. - Devotes more transistors for computing

Documents