Overview of Ocelot: architecture

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

OVERVIEW OF OCELOT: ARCHITECTURE

OverviewGPU Ocelot overview

Building, configuring, and executing Ocelot programs

Ocelot Device Interface and CUDA Runtime API

Ocelot PTX Internal Representation

PTX Pass Manager

Ocelot: Multiplatform Dynamic Compilation

Just-in-time code generation and

optimization for data intensive

applications

esd.lbl.gov

R. Domingo & D. Kaeli (NEU)

Data Parallel IR

Language Front-

• Environment for i) compiler research, ii) architecture research, and iii) productivity tools

NVIDIA’s Compute Unified Device Architecture (CUDA)

Integrate the concept of a compute kernel called from standard languages

Multithreaded host programsThe compute kernel specifies data parallel computation as thousands of threads

An accelerator model of computing Explicit functions for off-loading computation to GPUs Data movement explicitly managed by the programmer

http://developer.nvidia.com/cuda-education-training

Host GPU

For access to CUDA tutorials

NVIDIA’s Compute Unified Device Architecture (CUDA)

Structure of a Compute Kernel

Arrays of (data parallel) thread blocks called cooperative thread arrays (CTAs)

Barrier synchronizationMapped to single instruction stream multiple data stream (SIMD) processor

Parallel Thread Execution (PTX) instruction set architecture

NVIDIA Fermi GF 100• 4 Global Processing Clusters

(GPCs) containing 4 SMs each

• Each SM has 32 ALUs, 4 SFUs, and 16 LS units

• Each ALU has access to 1024 32bit registers (total of 128kB per SM)

• Each SM has its own Shared Memory/L1 cache (64kB total)

• Unified L2 cache (768kB)• Six 64bit Memory Controllers

(total 384bit wide)

ALU Streaming multiprocessor (SM)

Ocelot Structure1 PTX Kernel

1G. Diamos, A. Kerr, S. Yalamanchili, and N. Clark, “Ocelot: A Dynamic Optimizing Compiler for Bulk Synchronous Applications in Heterogeneous Systems,” PACT, September 2010. .

CUDA Application

Ocelot is built with nvcc and the LLVM backend Structured around a PTX IR LLVM IR Translator

Compile stock CUDA applications without modification

CUDA to PTX

PTX modules stored as string literals in fat binary We ignore accompanying binary image (GPU native

binary)9

OverviewGPU Ocelot overview

PTX Pass Manager

Dependencies Software

C++ Compiler (GCC 4.5.x) Lex Lexer Generator (Flex 2.5.35) YACC Parser Generator (Bison 2.4.1) Scons (Python 2.7) LLVM (3.1)

Libraries boost_system (1.46) boost_filesystem (1.46) boost_serialization (1.46) GLEW (optional for GL interop) (1.5) GL (for NVIDIA GPU Devices)

Library headers Boost (1.46)

http://code.google.com/p/gpuocelot/wiki/Installation

Ocelot Source Code

• Freely available via Google Code project site (New BSD License)

• ocelot/• analysis/ -- analysis passes• api/ -- Ocelot-specific API extensions• cuda/ -- implements CUDA runtime• executive/ -- Device interface and backend implementations• ir/ -- internal representations (PTX, LLVM, AMD IL)• parser/ -- parser (to PTX)• tools/ -- standalone applications using Ocelot• trace/ -- trace generation and analysis tools• translator/ -- translators from PTX to LLVM and AMD IL• transforms/ -- program transformations

http://code.google.com/p/gpuocelot/

svn checkout http://gpuocelot.googlecode.com/svn/trunk/ gpuocelot-read-only

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 13

Building GPU Ocelot Obtain source code

svn checkout http://gpuocelot.googlecode.com/svn/trunk/ gpuocelot-read-only

Compile with Scons sudo ./build.py –install

Build and execute unit tests sudo ./build.py –test=full

Output appears in .release_build libocelot.so OcelotConfig Tests

Installation directory: /usr/local/include/ocelot /usr/local/lib

http://code.google.com/p/gpuocelot/wiki/Installation

Configuring Ocelot configure.ocelot

Controls Ocelot’s initial state Located in application’s startup directory trace specifies which trace generators are initially

attached executive controls device properties

trace: memoryChecker – ensures raceDetector - enforces synchronized access

to .shared debugger - interactive debugger

executive: devices:

List of Ocelot backend devices that are enabled nvidia - NVIDIA GPU backend emulated – Ocelot PTX emulator (trace generators) llvm – efficient execution of PTX on multicore CPU amd – translation to AMD IL for PTX on AMD RADEON

trace: { memoryChecker: { enabled: true, checkInitialization: false }, raceDetector: { enabled: false, ignoreIrrelevantWrites: true }, debugger: { enabled: false, kernelFilter:

"_Z13scalarProdGPUPfS_S_ii", alwaysAttach: true }, }, executive: { devices: [ "emulated" ], }}

Building and Executing CUDA Programsnvcc -c example.cu -arch sm_23

g++ -o example example.o `OcelotConfig -l` `OcelotConfig -l` expands to ‘-locelot’

libocelot.so replaces libcudart.so

Overview

GPU Ocelot overview

PTX Pass Manager

CUDA Runtime API

Ocelot implements CUDA Runtime API Transparent hooks into existing CUDA

applications override methods of

cuda::CudaDeviceInterface Maps CUDA RT onto Ocelot device interface

abstraction cuda::CudaRuntime

Extended through custom Ocelot API e.g. ocelot::registerPTXModule( );

Ocelot CUDA Runtime Overview18

Kernels execute anywhere Key to portability!

A reimplementation of the CUDA Runtime API

Compatible with existing applications

Link against libocelot.so instead of libcudart

R. Domingo & D. Kaeli (NEU)

Ocelot CUDA Runtime

Clean device abstraction

All back-ends implement same interface

Ocelot API Extensions Add/remove trace

generators Compile/launch kernels

directly in PTX Device memory sharing

among host threads Device switching

Ocelot Source Code: CUDA Runtime API• ocelot/

• analysis/ -- analysis passes• api/ -- Ocelot-specific API extensions

• cuda/ -- implements CUDA runtime

• interface/CudaRuntimeInterface.h• interface/CudaRuntime.h• interface/CudaRuntimeContext.h• interface/FatBinaryContext.h• interface/CudaDriverFrontend.h

• executive/ -- Device interface and backend implementations• ir/ -- internal representations (PTX, LLVM, AMD IL)• parser/ -- parser (to PTX)• tools/ -- standalone applications using Ocelot• trace/ -- trace generation and analysis tools• translator/ -- translators from PTX to LLVM and AMD IL• transforms/ -- program transformations

Ocelot CUDA Runtime API Implementation Implement interface defined by cuda::CudaRuntimeInterface

ocelot/cuda/interface/CudaRuntime.h ocelot/cuda/implementation/CudaRuntime.cpp class cuda::CudaRuntime

cuda::CudaRuntime members Host thread contexts Ocelot devices Registered modules, textures, kernels Fat binaries Global mutex

CUDA Runtime API functions eg. cudaMemcpy, cudaLaunch, __cudaRegisterModule(),

Additional functions eg. _lock(), _unlock(), _registerModule()

Ocelot Source Code: Device Interface• ocelot/

• executive/ -- Device interface and backend implementations

• interface/Device.h• interface/EmulatorDevice.h• interface/NVIDIAGPUDevice.h• interface/MulticoreCPUDevice.h• interface/ATIGPUDevice.h

Ocelot Device Interface class executive::Device Succinct interface for device objects

Module registration Memory management Kernel configuration and launching Global variable and texture management OpenGL interoperability Streams and Events Trace generators

Minimal set of APIs for device-oriented programming model 57 functions (versus CUDA Runtime’s 120+)

Capture device state: Memory allocations, global variables, textures, graphics interoperability

Facilitate creation of backend execution targets Implement Device interface

Enable multiple API front ends Implement front ends targeting Device interface

Overview

GPU Ocelot overview

PTX Pass Manager

Ocelot PTX Intermediate Representation (IR)

Backend compiler framework for PTX Full-featured PTX IR

Class hierarchy for PTX instructions/directives PTX control flow graph Static single-assignment form Dataflow/dominance analysis Enables PTX optimization

IR to IR translation From PTX to other IRs LLVM (x86/PowerPC/ARM) CAL (AMD GPUs)

PTX Kernel

Ocelot Source Code: Intermediate Representation• ocelot/

• ir/ -- internal representations (PTX, LLVM, AMD IL)

• interface/Module.h• interface/PTXInstruction.h• interface/PTXOperand.h• interface/PTXKernel.h• interface/ControlFlowGraph.h• interface/ILInstruction.h• interface/LLVMInstruction.h

• parser/ -- parser (to PTX)

• interface/PTXParser.h

Ocelot PTX Internal Representation C++ classes representing PTX module

ir::PTXModule ir::PTXKernel ir::PTXInstruction ir::PTXOperand ir::GlobalVariable ir::LocalVariable ir::Parameter

Ocelot PTX Parser target, Emitter source ir::PTXInstruction::valid( )

Translator source PTX to LLVM PTX to AMD IL

Suitable for analysis and transformation Executable representation

PTX Emulator

Ocelot PTX IR: Kernels.global .f32 globalVariable;

.entry sequence (.param .u64 __cudaparm_sequence_A,.param .s32 __cudaparm_sequence_N){.reg .u32 %r<11>;.reg .u64 %rd<6>;.local u32 %rp0;

. . . . . .

$LDWbegin_sequence: ld.param.s32 %r6, [__cudaparm_sequence_N]; setp.le.s32 %p1, %r6, %r5; @%p1 bra $Lt_0_1026; . . . . . .$Lt_0_1026:

exit;$LDWend_sequence:

} // sequence

ir::Module

ir::Kernel

ir::BasicBlock

ir::Local

ir::Parameter

ir::Global

add.s32 %r7, %r5, 1;

ld .param .u64 %rd1, [__cudaparm_sequence_A];

cvt.s64.s32 %rd2, %r5;

mul.wide.s32 %rd3, %r5, 4;

add.u64 %rd4, %rd1, %rd3;

st .global .s32 [ %rd4 + 0 ], %r7;

@%p1 bra $Lt_0_6146;

ir::BasicBlockir::PTXInstruction

opcode addressSpace dataType d a

addressMode: address

addressMode: register

addressMode: immediate

addressMode: indirect

ir::PTXOperand

addressMode: label

Guard predicate

Ocelot PTX IR: Instructions

Control and Data-Flow Graphs

• Data structure for representing kernels• Basic blocks

• fall-through and branch edges• instruction vector• label

• Traversals:• pre-order, topological, post-order• iterator visits blocks

• Data-flow graph overlays CFG• definition-use chains explicit• to and from SSA form

• CFG Transformations:• split blocks, edges

• DFG Transformations:• insert and remove values• iterate over def-use

Example: Control-Flow Graphs// example: splits basic blocks containing barriers//for (ir::ControlFlowGraph::iterator bb_it = kernel->cfg()->begin(); bb_it != kernel->cfg()->end(); ++bb_it) { // iterate over basic blocks

unsigned int n = 0; ir::BasicBlock::InstructionList::iterator inst_it;

for (inst_it = (bb_it)->instructions.begin(); inst_it != (bb_it)->instructions.end(); ++inst_it, n++) { // iterate over instructions in *bb_it

const ir::PTXInstruction *inst = static_cast< const ir::PTXInstruction *>(*inst_it);

if (inst->opcode == ir::PTXInstruction::Bar) { if (n + 1 < (unsigned int)(bb_it)->instructions.size()) {

std::string label = (bb_it)->label + "_bar";

kernel->cfg()->split_block(bb_it, n+1, ir::BasicBlock::Edge::FallThrough, label); // split block containing bar.sync // so that it’s always the last } // instruction in a block break; } } // end for (inst_it)

} // end for (bb_it)

Example: Spilling Live Values// ocelot/analysis/implementation/RemoveBarrierPass.cpp

void RemoveBarrierPass::_addSpillCode( DataflowGraph::iterator block, const DataflowGraph::Block::RegisterSet& alive ){

unsigned int bytes = 0;

ir::PTXInstruction move ( ir::PTXInstruction::Mov );

move.type = ir::PTXOperand::u64;

move.a.identifier = "__ocelot_remove_barrier_pass_stack";

move.a.addressMode = ir::PTXOperand::Address;

move.a.type = ir::PTXOperand::u64;

move.d.reg = _kernel->dfg()->newRegister();

move.d.addressMode = ir::PTXOperand::Register;

move.d.type = ir::PTXOperand::u64;

_kernel->dfg()->insert( block, move, block->instructions().size() - 1 );

for( DataflowGraph::Block::RegisterSet::const_iterator reg = alive.begin(); reg != alive.end(); ++reg ) {

ir::PTXInstruction save( ir::PTXInstruction::St );

save.type = reg->type;

save.addressSpace = ir::PTXInstruction::Local;

save.d.addressMode = ir::PTXOperand::Indirect;

save.d.reg = move.d.reg;

save.d.type = ir::PTXOperand::u64;

save.d.offset = bytes;

bytes += ir::PTXOperand::bytes( save.type );

save.a.addressMode = ir::PTXOperand::Register;

save.a.type = reg->type;

save.a.reg = reg->id;

_kernel->dfg()->insert( block, save, block->instructions().size() - 1 );

_spillBytes = std::max( bytes, _spillBytes );

Example: Spilling Live Values

IR for AMD and LLVM

LLVM IR• Implements all of the LLVM instruction set• Decouples translator with LLVM project• Easier to construct than LLVM’s actual IR

AMD IL• Supports translation from PTX to AMD

interface

Emitters construct parseable string representations of modules

AMD Backend: R. Domingo & D. Kaeli (NEU)

Overview

GPU Ocelot overview

PTX Pass Manager

PTX PassManager Orchestrates analysis and transformation passes

Derived from LLVM model Analysis Passes generate meta-data Meta-data consumed by transformations Transformation Passes modify the IR

Using the Pass Manager Passes added to a manager

Schedules execution Manages analysis meta-data

Ensures meta-data available Up to date; not redundantly computed

Analysis Passes Analysis runs over the PTX IR

Generates meta-data Modifies PTX IR Possibly updates or invalidates existing meta-data

Examples Data-flow graph Dominator and Post-dominator trees Thread frontiers

Analysis Passes – Supported Analaysis Structures

Control Flow Graph ir/interface/ControlFlowGraph.h

Data Flow Graph analysis/interface/DataflowGraph.h

Dominator and Post-Dominator Trees analysis/interface/DominatorTree.h analysis/interface/PostDominatorTree.h

Superblock Analysis analysis/interface/SuperblockAnalysis.h

Divergence Graph analysis/interface/DivergenceGraph.h

Thread Frontiers analysis/interface/ThreadFrontiers.h

Transformation Passes Modify the PTX IR

Consume meta-data Examples:

Dead-code elimination transforms/interface/DeadCodeEliminationPass.h

Control-flow structuring transforms/interface/StructuralTransform.h

Sync elimination transforms/interface/SyncElimination.h

Dynamic instrumentation

Example: Dead Code Elimination Transformation Pass

Dead Code Elimination Approach

Run once on each kernel Consume data-flow analysis meta-data Delete instructions producing values with no users Implementation

transforms/interface/DeadCodeEliminationPass.h transforms/implementation/DeadCodeEliminationPass.cpp

Dead Code Elimination (1 of 5) Setup pass dependencies

DeadCodeEliminationPass::DeadCodeEliminationPass(): KernelPass(Analysis::DataflowGraphAnalysis | Analysis::StaticSingleAssignment, "DeadCodeEliminationPass"){

Dead Code Elimination (2 of 5) Run pass

Analysis* dfgAnalysis = getAnalysis(Analysis::DataflowGraphAnalysis);assert(dfgAnalysis != 0);

// cast upanalysis::DataflowGraph& dfg = *static_cast<analysis::DataflowGraph*>(dfgAnalysis);assert(dfg.ssa());

void DeadCodeEliminationPass::runOnKernel(ir::IRKernel& k){

Get analysis metadata

Dead Code Elimination (3 of 5) Loop until change

BlockSet blocks;for (iterator block = dfg.begin(); block != dfg.end(); ++block){ report(" Queueing up BB_" << block->id()); blocks.insert(block);}

while(!blocks.empty()){ iterator block = *blocks.begin(); blocks.erase(blocks.begin()); eliminateDeadInstructions(dfg, blocks, block);}

Dead Code Elimination (4 of 5) Remove unused live-out valuesAliveKillList aliveOutKillList;for (RegisterSet::iterator aliveOut = block->aliveOut().begin(); aliveOut != block->aliveOut().end(); ++aliveOut){ if (canRemoveAliveOut(dfg, block, *aliveOut)) { report(" removed " << aliveOut->id); aliveOutKillList.push_back(aliveOut); }}for (AliveKillList::iterator killed = aliveOutKillList.begin(); killed != aliveOutKillList.end(); ++killed){ block->aliveOut().erase(*killed);}

Dead Code Elimination (5 of 5) Check if an instruction can be removedif (ptx.hasSideEffects()) return false;

for (RegisterPointerVector::iterator reg = instruction->d.begin(); reg != instruction->d.end(); ++reg) {

// the reg is alive outside the blockif (block->aliveOut().count(*reg) != 0) return false;InstructionVector::iterator next = instruction;for (++next; next != block->instructions().end(); ++next) {

for (RegisterPointerVector::iterator source = next->s.begin();source != next->s.end(); ++source) {// found a user in the blockif (*source->pointer == *reg->pointer) return false;}

Dead Code Elimination Repeat for

phi instructions Other instructions alive-in values

Ensures meta-data is valid

Running Passes on PTX Static optimizer

PTXOptimizer Runs passes on PTX assembly files ocelot/tools/PTXOptimizer.cpp

JIT optimization Runs passes before kernels are launched ocelot/api/implementation/OcelotRuntime.cpp

Questions GPU Ocelot

Google Code site: http://code.google.com/p/gpuocelot

Research Project site: http://gpuocelot.gatech.edu

Mailing list: gpuocelot@googlegroups.com

Contributors Gregory Diamos, Rodrigo Dominguez, Naila Farooqui, Andrew Kerr, Ashwin

Lele, Si Li, Tri Pho, Jin Wang, Haicheng Wu, Sudhakar Yalamanchili

Sponsors AMD, IBM, Intel, LogicBlox, NSF, NVIDIA

Overview of Ocelot: architecture

Documents

Cloud4all Architecture Overview

MICROSERVICES ARCHITECTURE OVERVIEW - Tech …techtalks.bg/.../03/Dzmitry_microservices-architecture-overview_v2.pdf · MICROSERVICES ARCHITECTURE OVERVIEW DZMITRY SKAREDAU, ... •

UChoices Overview Dennis Koutsoures. Architecture Overview

ATMOS ARCHITECTURE OVERVIEW

Research with ocelot

Species: Ocelot

PINATA white paper Metabase Q Ocelot

Computer architecture overview

Architecture & Overview

Ocelot and the SST- MacSim Simulator

Hadoop Overview & Architecture

Architecture Description Languages: An Overview Architecture Description Languages: An Overview Architecture Definition ADLs Architecture vs. Design ADLs

OVERVIEW OF OCELOT: ARCHITECTURE - GT comparch

The Ocelot: An Endangered Species

Business Architecture: Overview

2009 Architecture Plan Overview 2009 Architecture Plan Overview

Ocelot Final Poster

Overview of Ocelot: architecture

Ocelot: PTX Emulator

Enterprise Architecture Overview