LAB5GPGPU-sim Tutorial
Content
• Introduction
• Configurations of GPGPU-sim
• Experiment & Result
• References
2017/12/15 2
Introduction
What is GPU?
• Graphics Processing Unit
• Optimized for Highly Parallel Workloads
• Highly Programmable
• Commodity Hardware
2017/12/15 4
Architecture of V100
Why GPU?
2017/12/15 5
Architecture of GPU
2017/12/15 6
Software
• CUDA and OpenCL
• Extensions of C to support coprocessor model
• GPGPU-Sim support both
2017/12/15 7
What is GPGPU-sim?
• Microarchitecture performance model of contemporary GPUs• Functional model
• Timing model
• Power model: GPUWattch
• Runs unmodified CUDA/OpenCL
• BSD License
2017/12/15 8
Modules Overview
2017/12/15 9
CUDA/OpenCL
API Library
Interface
PTX Emulator
(CUDA-Sim)
GPGPU-Sim
Entrypoint
Timing
Model
Abstract
HW Model
Power
Model:
GPUWattch
The functional simulator
that executes PTX kernels
Top Level Organization
2017/12/15 10
Microarchitecture of GPU
2017/12/15 11
See details in the micro-2013 slides
GPGPU-sim: Function model(1/2)
• Functional model for PTX/SASS
2017/12/15 12
• Scalar PTX ISA(Instruction level)
• Scalar control flow (if-branch, for-loops)
• Register allocation not done in PTX
• Intermediate representation in CUDA tool chain
PTX(Parallel Thread eXecution)
• Better correlation with HW GPU
• NVIDIA’s cuobjdump
SASS (Native ISA for Nvidia GPUs)
GPGPU-sim: Function model(2/2)
2017/12/15 13
GPGPU-sim: Timing Model
2017/12/15 14
GPGPU-Sim simulates the timing model of a GPU running each launched CUDA kernel
• Reports # cycles spent running the kernels
• Exclude any time spent on data transfer on PCIe bus
• CPU may run concurrently with asynchronous
kernel launches.
GPGPU-Sim w/ SASS is ~0.98 correlated to the real HW
GPGPU-sim: Power Model
GPUWattch
• Estimate power consumed by the GPU according to the timing behavior
• Validated with power measurements from a real GTX 480
2017/12/15 15
Debugging and Visualization
GPGPU-Sim provides tools to debug and visualize simulated GPU behavior
• GDB macros• Cycle-level debugging
• AerialVision• High-level performance dynamics
2017/12/15 16
Configurations of GPGPU-sim
GPGPU-Sim ConfigurationsChange configuration by modifying ‘GTX480_run_dir/gpgpusim.conf’
2017/12/15 18
1. Simulation Run
2. Statistics Collection
3. High-Level Architecture
4. Additional Architecture
5. Scheduler
6. Shader Core Pipeline
7. Memory Sub-System
8. Operand Collector
9. DRAM/Memory Controller
10. Interconnection
11. PTX
12. Power information
Scheduler
Modify property of scheuler
• Number of warp scheuler in a core
• Issue number of warp scheuler
Examples • gpgpu_num_sched_per_core
• gpgpu_max_insn_issue_per_warp
2017/12/15 19
Shader Core Pipeline
Modify property of shader core
• Pipeline
• Register number
• Councurrent thread array
• Branch Divergence
Examples • gpgpu_shader_core_pipeline <# thread/shader core>:<warp size>:<pipeline SIMD width>
• gpgpu_shader_registers <# registers/shader core, default=8192>
• gpgpu_shader_cta <# CTA/shader core, default=8>
2017/12/15 20
Memory Sub-System Configuration
2017/12/15 21
Set up size and operation of serveral kinds of memory and cache
• Memory: share memory
• Cache: Texture, constant, instruction, data cache
Examples:• gpgpu_perfect_mem <0=off (default), 1=on>
• gpgpu_tex_cache:l1 <nsets>:<bsize>:<assoc>:<rep>:<wr>:<alloc>,<mshr>:<N>:<merge>,<mq>
• gpgpu_const_cache:l1 <nsets>:<bsize>:<assoc>:<rep>:<wr>:<alloc>,<mshr>:<N>:<merge>,<mq>
• gpgpu_cache:il1 <nsets>:<bsize>:<assoc>:<rep>:<wr>:<alloc>,<mshr>:<N>:<merge>,<mq>
• gpgpu_cache:dl2 <nsets>:<bsize>:<assoc>:<rep>:<wr>:<alloc>,<mshr>:<N>:<merge>,<mq>
• gpgpu_shmem_size <shared memory size, default=16kB>
• gpgpu_shmem_warp_parts
• gpgpu_flush_cache <0=off (default), 1=on>
Power information
2017/12/15 22
Simulate power model of GPGPU-Sim
Examples• power_simulation_enabled 1 # Enable power model
• gpuwattch_xml_file gpuwattch_gtx480.xml # choose the configuration file
• power_trace_enabled 1 # Enable output: detailed average power traces
• steady_power_levels_enabled 1 # Enable output: steady state average power levels and corresponding performance counters
Experiment & Result
Experiment Environment
• VirtualBox(Recommended)• Install Oracle VM VirtualBox
• Go to (http://www.gpgpu-sim.org/)
• Download fully setup virtual machine
• Double click this setup file
• Github• Go to (GPGPU-sim’s github)
• Follow the manual
2017/12/15 24
Installation - AerialVision
Step1. Install AerialVision dependencies$ sudo apt-get install python-pmw python-ply python-numpy libpng12-dev python-matplotlib
Step2. Run bin/aerialvision.py in GPGPU-Sim distribution$ python aerialvision.py
2017/12/15 25
HintVM Password: gpgpu-sim
Benchmarks
2017/12/15 26
63 cuda executable benchmarks
Run a simple program
CUDA program • Benchmark : vectorAdd.cu$ ./run_gpgpu-sim.sh ~/cuda/sdk/4.2/C/bin/linux/release/vectorAdd
2017/12/15 27
Host code
Device code
Simulation Result - Overall
2017/12/15 28
• simulation cycle of the GPU• simulation ins. of the GPU• IPC of the GPU• …
Overall report
Simulation result - Cache
2017/12/15 29
Behavior of every cache in the GPU• access times• misses times• pending hits times• reservation fails times
Cache report
Simulation result - Interconnect
2017/12/15 30
Statistics of the interconnect• Packet latency• Network latency…
Interconnect report
Simulation result - Power information
2017/12/15 31
Total power List of average power
Source code – You can modify it!
2017/12/15 32
Simulation result - Visualization
2017/12/15 33
Source code viewTime Lapse view
References
References
• GPGPU-sim Official Website
• GPGPU-sim Manual
• GPUWattch Manual
• Micro-45, Tutorial, GPGPU-Sim 3.x: A Performance Simulator for Manycore Accelerator Research
• McPAT
• NVIDIA OpenCL
• CUDA Toolkit
• Chinese Note
2017/12/15 35