ENVISION. ACCELERATE. ARRIVE. ClearSpeed Technical …jbaker/ResearchGroup/slides...• 47% logic, 53% memory – About 50% of the logic is FPU – Hence around one quarter of the

Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com1

ENVISION. ACCELERATE. ARRIVE.

Overview

ClearSpeed Technical Training

December 2007


Presenters

Ronald LanghiTechnical Marketing Manager

[email protected]

Brian SumnerSenior Engineer

[email protected]

mailto:[email protected]

mailto:[email protected]


ClearSpeed Technology: Company Background

• Founded in 2001

– Focused on alleviating the power, heat, and density challenges of HPC systems

– 103 patents granted and pending (as of September 2007)

– Offices in San Jose, California and Bristol, UK


Agenda

AcceleratorsClearSpeed and HPCHardware overviewInstalling hardware and softwareThinking about performanceSoftware Development KitApplication examplesHelp and support



What is an accelerator?


What is an accelerator?

• A device to improve performance– Relieve main CPU of workload– Or to augment CPU’s capability

• An accelerator card can increase performance– On specific tasks– Without aggravating facility limits on clusters (power,

size, cooling)


FPGAs•Good for integer, bit-level ops•Programming looks like circuit design•Low power per chip, but

20x more power than custom VLSI

•Not for 64-bit FLOPS

Cell and GPUs•Good for video gaming tasks•32-bit FLOPS, not IEEE•Unconventional programming model•Small local memory•High power consumption (> 200 W)

ClearSpeed•Good for HPC applications•IEEE 64-bit and 32-bit FLOPS•Custom VLSI, true coprocessor•At least 1 GB local memory•Very low power consumption (25 W)•Familiar programming model

All accelerators are good…

for their intended purpose


The case for accelerators

• Accelerators designed for HPC applications can improve performance as well as performance per (watt, cabinet, dollar)

• Accelerators enable:– Larger problems for given compute time, or

– Higher accuracy for given compute time, or

– Same problem in shorter time

• Host to card latency and bandwidth are not major barriers to successful use of properly-

designed accelerators.



What can be accelerated?


Good application targets for acceleration

• Application needs to be both computationally intensive and contain a high degree of data parallelism.

• Computationally intensive:– Software depends on executing large numbers of arithmetic calculations– Usually 64-bit FLoating point Operations per Second (FLOPS)– Should also have a high ratio of FLOPS to data movement (bandwidth)– Computationally intensive applications may run for many hours or more

even on large clusters.• Data parallelism:

– Software performs the same sequence of operations again and again but on a different item of data each time

• Example computationally intensive, data parallel problems include:

– Large matrix arithmetic (linear algebra)– Molecular simulations– Monte Carlo options pricing in financial applications– And many, many more…


Example data parallel problems that can be accelerated

Structural Analysis

ElectromagneticModeling

Radar Cross-Section

Ab initio Computational Chemistry

Global Illumination Graphics


HPC Requirements

• Accelerator boards increase compute performance on highly specific tasks, without aggravating facility limits on clusters (power, size)

• Need to consider– Type of application– Software– Data type and precision– Compatibility with host (logical and physical)– Memory size (local to accelerator)– Latency and bandwidth to host


An HPC-specific accelerator

• CSX600 coprocessor for math acceleration– Assists serial CPU running compute-intensive math libraries– Available on add-in boards, e.g. PCI-X, PCIe– Potentially integrated on the motherboard– Can also be used for embedded applications

• Significantly accelerates certain libraries and applications– Target libraries: Level 3 BLAS, LAPACK, ACML, Intel® MKL– Mathematical modeling tools: Mathematica®, MATLAB®, etc.– In-house code: Using the SDK to port compute-intensive kernels

• ClearSpeed Advance™

board– Dual CSX600 coprocessors– Sustains 67 GFLOPS for 64-bit matrix multiply (DGEMM) calls– PCI-X, PCI Express x8– Low power; typically 25-35 Watts


Plug-and-play Acceleration

• ClearSpeed host-side library CSXL– Provides some of the most commonly used and

important Level 3 BLAS and LAPACK functions– Exploits standard shared/dynamic library mechanisms to

intercept calls to L3 BLAS and LAPACK– Executes calls heterogeneously across both the multi-

core host and the ClearSpeed accelerators simultaneously for maximum performance

– Compatible with ACML from AMD and MKL from Intel

• User & application do not need to be aware of ClearSpeed– Except that the application suddenly runs faster


Programming considerations

• Is my main data type integer or floating-point?• Is the data parallel in nature?• What precision do I need?• How much data needs to be local to the accelerated task?• Does existing accelerator software meet my needs, or do I have

to write my own?• If I have to write my own code will the existing tools meet my

needs—for example: compiler, debugger, and simulator?



Hardware Overview


CSX600: A chip designed for HPC

• Array of 96 Processor Elements; 64-bit and 32-bit floating point

• Single-Instruction, Multiple-Data (SIMD)

• 210 MHz --

key to low power• 47% logic, 53% memory

– About 50% of the logic is FPU– Hence around one quarter of the

chip is floating point hardware• Embedded SRAM• Interface to DDR2 DRAM• Inter-processor I/O ports• ~ 1 TB/sec internal bandwidth• 128 million transistors• Approximately 10 Watts

ClearSpeed CSX600


CSX 600

• Multi-Threaded Array Processing– Programmed in familiar languages– Hardware multi-threading– Asynchronous, overlapped I/O– Run-time extensible instruction set

• Array of 96 Processor Elements (PEs)– Each has multiple execution units– Including double precision floating

point and integer units

CSX600 processor core

Programmable I/O to DRAM

PE 0

Peripheral Network

PE 1

PE 95…

Data Cache

Mono Controller Instruc-

tion Cache

Control and

Debug

System Network

Poly Controller

System Network


CSX600 processing element (PE)

32/64-bitIEEE 754

PE

n

Programmed I/O

Register File128 Bytes

PE SRAM6 KBytes

FP M

ul

FP A

dd

Div

, Sqr

t

MA

C

ALU

PIO Collection & Distribution

64 64 64

32

6464 PE

n+1

PE

n–1

128

32

}• Multiple execution units

• 4-stage floating point adder• 4-stage floating point multiplier• Divide/square root unit• Fixed-point MAC 16x16 → 32+64• Integer ALU with shifter• Load/store

• 5-port register file (3 reads, 2 writes)• Closely coupled 6 KB SRAM for data• High bandwidth per PE DMA (PIO)• Per PE address generators (serves as

hardware gather-scatter)• Fast inter-PE communication path


PE 95

Bank 1CSX DRAM: 0.5 GBytes

Advance accelerator memory hierarchy

HostDRAM: 1-32 GBytes

typical

Bank 0CSX DRAM: 0.5 GBytes

PE 0Poly memory: 6 KBytes

Per PERegister memory: 128 Bytes

Per PEArithmetic: 0.42 GFLOPS

Tier 3

Tier 2

Tier 1

Tier 016 16 16

1616

32

Swazzle322 GB/s

Aggregate:

~0.03 GB/s per PE

~1GB/s

1.0 GBytes

5.4 GB/s

192 PEs * 6 KB = 1.1 MB

161 GB/s

192 PEs * 128 Byte = 24 KB

725 GB/s

Total: 80 GFLOPS, 1.1 TB/s…but only 25 Watts


203 mm length, full-height

Advance X620 (PCI-X)

Advance e620 PCIe (x8)

Both boards can sustain over 66 GFLOPS on 64-bit HPC kernels

Acceleration by plug-in card

• Dual ClearSpeed CSX600 coprocessors• R∞

> 66 GFLOPS for 64-bit matrix multiply (DGEMM) calls– Hardware also supports 32-bit floating

point and integer calculations• 133 MHz PCI-X two-thirds length (8″) form

factor• PCIe x8 half-length form factor• 1 GB of memory on the board• Drivers today for Linux (Red Hat and SLES)

and Windows (XP, Server 2003)• Low power: 25 watts typical• Multiple boards can be used together for

greater performance

Half length, full-height


Host to board DMA performance

• The board includes a host DMA controller which can act as a bus master.

• All DMA transfers are at least 8-byte aligned.

• The host DMA engine will attempt to use the full bandwidth of the bus.

• Note: measured bandwidth is highly system-dependent– Variations of up to 50% have been observed

– Depends on system chipset, operating system, bus contention…

Type of PCI-X slot Peak bandwidth Expected DMA speed

PCI Express x8 2,000 MB/s Up to 1,300 MB/s

PCI-X 133 MHz 1,066 MB/s Up to 750 MB/s



Installing Hardware and Software


Supported compilers• For Linux:

gcc, icc, fort, pgf• For Windows XP, 2003:

Visual C++ 2005

Configuration support

Supported host BLAS libraries• AMD ACML• Intel MKL• Goto• ATLAS

For the latest support information go to http://support.clearspeed.com

Operating System IA32 (x86)

AMD64/EM64T (x86-64)

SuSE Linux Enterprise Server 9Red Hat Enterprise Linux 4Windows XP SP2Windows Server 2003 preview

• Advance supports

the

following

host operating systems:

http://support.clearspeed.com/


Base software

• All ClearSpeed software on Linux is installed using the rpm command.

• The software consists of three parts:– Runtime and driver software– Diagnostics– ClearSpeed standard libraries, CSXL & CSFFT

• You can download the latest versions from the ClearSpeed support website:

• https://support.clearspeed.com/downloads

https://support.clearspeed.com/downloads


Installing

base software on Linux

1.

Log in to the Linux machine as root and change to the directory containing the drivers package.

2.

Install the runtime software, using the command:

rpm –i csx600_m512_le-runtime-<version>.<arch>.rpm

3.

Install the Kernel module -

for Linux 2.6 simply install the open source CSX driver using:

/opt/clearspeed/csx600_m512_le/drivers/csx/install-csx

4.

Install the board diagnostics: rpm –i csx600_m512_le-board_diagnostics- <version>.<arch>.rpm

5.

Install the CSXL library package: rpm –i csx600_m512_le-csxl_<version>.<arch>.rpm

Note: For Windows a Jungo driver will need to be installed and configured – see installation manual for more details.


Confirming successful installation

ClearSpeed distributes diagnostic tests to check that the board and drivers are successfully installed:

– Some tests take several minutes to complete.– Each test will write Pass or Fail to standard output.– A log file

test.log will be written in the current directory.

1. Open a shell window and go to an appropriate directory:cd /tmp

2. Set up ClearSpeed environment variables, by typing: source /opt/clearspeed/csx600_m512_le/bin/bashrc

3. Run the diagnostic program, by typing the command: /opt/clearspeed/csx600_m512_le/bin/run_tests.pl


csreset

– The csreset command reinitializes an Advance board and its processors.

– It must be run after start-up or reboot of the system or simulator.

– It is also a good idea to run csreset at the start of a batch job that calls the Advance board.

– The csreset command can take argument flags to provide a finer level of control. These include:

-A Specifies that all boards should be reset.-v Verbose output. This shows the details about each

board.-h Help. This shows the full list of options.


If you have problems with software installation

– Make sure you are logged in as super-user.• As root for Linux.• As administrator for Windows.

– If the configure or make install steps fail, check that you have the appropriate header files.

• Check the preconfigured header files and, if necessary, obtain the appropriate configured header file.

– If the system cannot access the board but the driver is installed, make sure the board is seated well.

• Try removing the board and reinstalling.



Targeting ClearSpeed Advance: Exploiting Data Parallelism


Alternative approaches

Three main approaches to acceleration:

1.

Use an application which is already ported2.

Plug and play

3.

Custom port using the SDK


Using an application which is already ported

• Acceleration: simply insert ClearSpeed• Latest list of ported applications:

– http://www.clearspeed.com/products/applicationsupport/

• Includes:– Amber– Mathematica– MATLAB– Star-P

http://www.clearspeed.com/products/applicationsupport/


Plug and play libraries: CSXL

• Underlying shared libraries are augmentedwith ClearSpeed CSXL accelerated functions

• Includes key functions from:– LAPACK– Level 3 BLAS

• As an example, BLAS is used by:– AMD ACML– Intel MKL– Full list on:

http://www.clearspeed.com/products/compatibility/

• Application is transparently accelerated– No modifications to application

http://www.clearspeed.com/products/compatibility/


Acceleration using CSXL and standard libraries

Host Library LAPACK

BLAS etc.

CSXL Library LAPACK

BLAS etc.

Application

Automatically select optimum path

CSXL Intercept Layer


Considerations for custom port of application

• Is the task large enough to consider acceleration?– Takes time to ship data to the accelerator

• Accelerator can work in parallel with host– Overlap computation

• Performance considerations– Look for areas of data parallelism– Overlap compute with data I/O– Make full use of ClearSpeed I/O paths

• Analysis starts with model based on memory tiers and can be verified using performance profiling tools


Is this trip necessary? Considering I/O

• Time to move N data to/from another node or an accelerator is ~latency+N/B seconds.

• Because local memory bandwidth is usually higher than B, acceleration might be lost in the communication time.

• Estimate the break-even point for the task (note: offloading is different from accelerating, where host continues working).

Node

Node memory

Accelerator

Accelerator memory

Bandwidth = B

accelerator

node

break- even

speed

time(larger problem size)


Memory bandwidth dictates performance

• Applications that can stage into local RAM can go 10x faster than current high-end Intel and AMD hosts– Applications residing in Accelerator DRAM do not make use of massive

memory bandwidth• GPUs face very similar issue

Multicore x86

Node memory

Accelerator

Accelerator DRAM

PCI-X or PCIe1 to 2 GB/s

AcceleratorLocal RAM

17 GB/s 5.4 GB/s

192 GB/s


Latency and bandwidth: Simple offload model

• Accelerator must be quite fast for this approach to have benefit

• This “mental picture”

may stem from early days of Intel 80x87, Motorola 6888x math coprocessors

latencyHost latency

Accelerator

band

-widt

h

band-

widthHost


Latency and bandwidth: Acceleration model

• Host continues working– Accelerator needs only be fast enough to make up for time lost to

bandwidth + latency• Easiest use model

– Host and accelerator share the same task, like DGEMM• More flexible

– Host, accelerator each specialize what they do

latencyHost latencyHost

Accelerator

band

-widt

h

band-

widthHost


Accelerator need not wait for all data before starting

• Host can work while data is moved– PCI transfers might burden a single x86 core by up to 40%– Other cores on host continue productive work at full speed

• Accelerator can work while data is moved– Can be slower than the host, and still add performance!

• In practice, latency is microseconds; accelerator task takes seconds– Latency gaps above would be microscopic if drawn to scale

Accelerator

latencyHost latency

band

-widt

h

band-

width

Accelerator

Host Host


Performance considerations

• Look for data parallelism– Fine-grained – vector operations– Medium-grained – unrolled independent loops – Coarse-grained – multiple simultaneous data

channels/sets

• Performance analysis for accelerator cards– Like analysis for message-passing parallelism but with

more levels of memory and communication

• Application porting success depends heavily on attention to memory bandwidths – (Surprisingly) not so much on the bandwidth between host

and accelerator card


PCI Bus

• ClearSpeed boards utilize either PCI-X or PCIe busses– PCI-X 133 MHz: 1 GB/s Peak– PCIe x8: 1.6 GB/s Peak

• Available memory on board– 1 GB of 200 MHz DDR2 SDRAM shared by 2 CSX600 processors

• Must consider both the transfer rate AND the available memory– If application requires more memory, then more communication to the

board is necessary

• Infinitely fast board– Time = Bus Speed * Total data size transferred


PCI Bus

• Driver performance is very machine-specific and depends on transfer size, direction, etc.– Transfer Size vs. Transfer Rate

– See Runtime User’s Guide for current driver performance


On-board Memory

• 2 level memory hierarchy– 1 GB “mono” shared memory– 6 kB “poly” memory per processing element (PE)

• 6 kB/PE * 96 PE = 576 kB per CSX600

• Peak bandwidth between levels– 2.7 GB/s x 2 chips = 5.4 GB/s

• Must consider both the transfer rate AND the available memory– If application requires more memory, then more communication to the

board is necessary• Infinitely fast PEs

– Time = Bus Speed * Total data size transferred

• Secondary considerations– Burst size: 64 Bytes/PE (i.e., 8 doubles)– Transfers can be smaller, but at reduced efficiency


SIMD Computing

• What is SIMD?– Single Instruction, Multiple Data

• Each PE sees the same instruction stream• Each PE issues “load”, “multiply”, etc., simultaneously• But acts on different data per PE

– PARALLEL COMPUTATION

• ClearSpeed SIMD is enhanced by:– Local memory for each PE

• data management is easier within “poly” memory• does not require adjacent access for all 96 elements involved in the

computation from shared memory pool– PEs can be enabled/disabled

• not required to use all PEs always• useful for handling “boundaries”


SIMD Array

• 96 PEs per CSX600– 210 MHz– double precision multiply-accumulate per cycle– 4 cycle pipeline depth for multiply and accumulation

• For top performance, use operations on 4 element vectors on each PE

• Nearest neighbor communication– “swazzle” path topology is a line or ring– Bandwidth: 8 Bytes per cycle between register files

• 8*96*210 = 161 GB/s• Useful for fine grained communication


Good Example Kernels

• Dense Linear Algebra– Matrix-Matrix products (DGEMM)

• Low memory bandwidth required = high data re-use• Inner kernel: Matrix-multivector product

– 96x96 matrix, x4 vectors» 96x96 matrix due to 96 PEs» 4 vectors due to multiply/accumulate pipeline depth

• Monte Carlo (computational finance)– “Embarrassingly parallel” task distribution– Very little data requirement

• Molecular Dynamics (Amber, BUDE)– Large numbers of identical tasks can be found– Requires small working data sets


Possible Kernel

• Partial Differential Equations– Some are memory bandwidth limited, so not a good candidate

for ClearSpeed acceleration• small stencil implies little computation per grid point• wide, sparse stencil implies large active data set

• But, some PDE simulations are good candidates– require a small grid, so can run entirely in PE memory

(computational finance)– have large, dense stencils

• large amounts of computation per grid point• sufficiently small active data set

– implicit time stepping• large systems of equations solved via direct methods• direct solvers utilize dense linear algebra kernels

– (i.e., DGEMM)


Keys to Success

• Parallelism is essential

• Proper management of the “poly”

memory is also critical– Application must accept memory bandwidth limits

• PCIe or PCI-X• On-board memory hierarchy

– SDK enables asynchronous data transfers• permits efficient “double buffering” to manage data streams,

accommodating the size limit– Application must employ a small working data set

• less than 576 kB, distributed across 96 PEs• also aware of 1 GB shared memory limit

• While developing ClearSpeed applications, use the ClearSpeed Visual Profiler to discover what is actually happening on the board!


Remember the host processor

• Today’s multi-core hosts are very useful for managing “other tasks”

that are not accelerated by ClearSpeed

• Many applications can overlap these tasks with ClearSpeed accelerated tasks

• Profile the host portion of your application as well using any of a variety of tools

– Use ClearSpeed Visual Profile for CSAPI utilization


General optimization techniques

• Latency hiding– Overlap compute with I/O

• Data reuse– On-chip swazzle path

• Maximize PE usage– Ensure all PEs are processing, not idle


Overlap data with compute

• Double-buffer• Many levels of data I/O –

compute parallelism

– PE load/store overlaps PE compute– PE to board memory can also overlap– Board memory to host memory can also overlap

• Hence, if task is compute bound:– Data takes “no time” to transfer

• If task is I/O bound:– Compute takes “no time” to calculate


Data reuse

• Swazzle path– Left or right 64 bit transfer (8 bytes)– 8 bytes per cycle, so ~161GB/s per CSX processor– Can be complete loop or linear chain

• Parallel with other data I/O– Register-register move– On-off chip in parallel

• Doesn’t impinge on DRAM access– PE local memory – register in parallel

• Doesn’t impinge on local memory access


Maximize PE usage

• Aim for 100% efficiency• PEs use predicated execution

– PEs are “disabled” rather than code skipped– Minimize effects – extract common code from conditionals

• Mono processor can branch– Skip blocks of code


PE

n+1

PE

n-1

Detail of I/O widths for performance analysis

Each accelerator board has:– 161 GB/s bandwidth PE register

to PE memory• 4 bytes per cycle

– 322 GB/s swazzle path bandwidth

• 8 bytes per cycle– 968 GB/s bandwidth PE register

to PE ALU• 24 bytes per cycle

– 5.4 GB/s DRAM bandwidth• 32 bytes per cycle

(Aggregate bandwidth for two CSX600 chips.)

PE

n

Programmed I/O

Register File128 Bytes

PE SRAM6 KBytes

FP M

ul

FP A

dd

Div

, Sqr

t

MA

C

ALU

PIO Collection & Distribution

64 64 64

32

6464

128

32

161GB/s

322GB/s 322GB/s

968GB/s

CSX DRAM1 GByte

5.4GB/s



Software Development Kit


ClearSpeed SDK overview

• Cn

compiler – C with extension for SIMD control

• Assembler• Linker• Simulator• Debugger• Graphical profiler• Libraries• Documentation• Available for Windows XP / 2003 and Linux

(Red Hat Enterprise Linux 4 and SLES 9)


Agenda

1.

Introduction to Cn

2.

Cn

Libraries3.

Debugging Cn

4.

CSAPI: Host / Board Communication



Introduction to Cn


Software Development

The CSX architecture is simpler to program:

• Single program for serial and parallel operations• Architecture and compiler co-designed• Instruction and data caches• Simple, regular 32-bit instruction set• Large, flexible register file• Fast thread context switching• Built-in debug support• Same development process as traditional architectures:

compile, assemble, link• Cn

is a simple parallel extension of C


Cn

— C with vector extensions for CSX

• New Keywords– mono and poly storage

qualifiers• mono is a serial (single)

variable• poly is a parallel (vector)

variable

• Mono variables in 1 GB DRAM

• Poly variables in 6 KB SRAM of each PE

DRAM 1 GB


Cn

differences from C

• New data type multiplicity modifiers:– mono: denotes serial variable

• resident in “mono” memory• mono is the default multiplicity

– poly: denotes parallel/vector variable• resident in “poly” memory local to each PE

– applies to pointers, doubly so:• mono int * poly foo;

– foo is a pointer in poly memory to an int in mono memory• poly int * mono bar;

– bar is a pointer in mono memory to an int in poly memory• int * poly *mono good_grief;

– as you would expect….

• Pointer sizes:– mono int *

• 4 bytes (32-bit addressable space, 512 MB)– poly int *

• 2 bytes (16-bit addressable space, 6 kB)


Cn

differences from C

• Execution context:– Alters branch/jump behavior– In mono context, jumps occur as in traditional architecture– In poly context, PEs are enabled/disabled

• if (penum>32) {…} else {…}– disables false PEs on true branch, then re-enables the false

PEs and disables the other PEs for the false branch– both branches executed

• break, continue– select PEs get disabled until the end of scope on all PEs

• return– select PEs get disabled until all PEs return, or end of scope


Porting C to Cn

(Example 1)

poly int i, j; i = get_penum(); // i=0 on PE0, i=1 on PE1 etc. j = 2*i; // j=0 on PE0, j=2 on PE2 etc.

int i, j;

for( i=0; i<96; i++ ) { j = 2*i;

}

Similar Cn

code

C code


Porting C to Cn

(Example 2)

poly int me, i; mono int npes; me = get_penum(); // me=0 on PE0, me=1 on PE1 etc. npes = get_num_pes(); // npes = 96

// i=0,96,192, …; 1,97,193, … etc. for( i=me; i<N; i+=npes ) { …

}

int i;

for( i=0; i<N; i++ ) { …

}

Similar Cn

code

C code


Simple Cn

examplevoid foo

(double *A, double *B, int n) { // Assume n is divisible by 24*96.poly double mat[4]={1.,2.,3.,4.};poly double a[24];poly double b[4]={0.,0.,0.,0.}; int i;

while (n) { memcpym2p (a, A+24*get_penum(), 24*sizeof(double));A+=24*96;for (i=0; i<24; i++) {

b[0] += a[i]*mat[0] + a[i+1]*mat[1];b[1] += a[i+1]*mat[0] + a[i]*mat[1];b[2] += a[i]*mat[2] -

a[i+1]*mat[3];b[3] += a[i+1]*mat[2] -

a[i]*mat[3];}n -= 24*96;

}

memcpyp2m (B+4*get_penum(), b, 4*sizeof(double));return;

}



Cn Libraries


Runtime libraries

• Cn

supports standard C runtime, including:– malloc– printf– sqrt– memcpy

Cn extensions include:– sqrtp– memcpym2p / memcpyp2m– get_penum– swazzle– any / all


Asynchronous I/O

• For most efficient use of limited PE memory, overlap data transfers between mono memory and poly:

– async_memcpym2p/p2m– sem_sig / sem_wait

For greatest efficiency, async_memcpy routines bypass the data cache, so coherency must be maintained:• dcache_flush / dcache_flush_address


Asynchronous I/O examplevoid foo(double

*A, double *B,int

n) { // Assume n is divisible by 24*96poly unsigned short penum=get_penum();poly double mat[4]={1.,2.,3.,4.};poly double a_front[12], a_back[12];poly double b[4]={0.,0.,0.,0.}; int i;

async_memcpym2p(19,a_front,A+12*penum,12*sizeof(double));A+=12*96;n-=24*96;while (n) {

async_memcpym2p(17,a_back,A+12*penum,12*sizeof(double));A+=12*96;sem_wait(19);for (i=0;i<12;i++) {

b[0] += a_front[i]*mat[0] + a_front[i+1]*mat[1];b[1] += a_front[i+1]*mat[0] + a_front[i]*mat[1];b[2] += a_front[i]*mat[2] -

a_front[i+1]*mat[3];b[3] += a_front[i+1]*mat[2] -

a_front[i]*mat[3];}n-=12*96;async_memcpym2p(19,a_front,A+12*penum,12*sizeof(double));A+=12*96;sem_wait(17);for (i=0;i<12;i++) { …

// compute on a_back, then finish outside while loop



Cn Pointers


• Using mono and poly with pointersmono int * mono mPmi mono pointer to mono intpoly int * mono mPpi mono pointer to poly intmono int * poly pPmi poly pointer to mono intpoly int * poly pPpi poly pointer to poly int

• Most commonly used is mono pointer to polypoly <type> * mono <variable_name>

Cn

— mono and poly pointers


• mono

pointer to mono

intmono int * mono mPmi

int

int *

Mono memory

Cn



• mono

pointer to poly

intpoly int * mono pPmi

Note: Points to same location in each PE

int *

Mono memory

intPoly memory

intPoly memory

intPoly memory

Cn



• poly pointer to poly intpoly int * poly pPpi

Note: Pointer stored in same location in each PE

int *

Poly memory

int

int *

Poly memory

Int

int *

Poly memory

int

Cn



• poly pointer to mono intmono int * poly pPmi

Note: Pointer stored in same location in each PE

int

Mono memory

int *Poly memory

int *Poly memory

int *Poly memory

int

int

Cn




Conditional Expressions


Conditional Expressions: mono-if

• Conditions based on mono expressions– Expression has same value on all PEs– Code block selected according to expression and branch

instruction executedmono int i, j;

i = j = 1; if( i == j ) { // this block executed on all PEs

} else { // this block branched over on all PEs

}


Conditional Expressions: poly-if

• Conditions based on poly expressions– Expression may have different values on different PEs– But SIMD model implies all PEs execute same instruction

simultaneously– All branches executed on all PEs, with PE enabled if

conditional expression is true (like predicated instructions)

poly int i; i = get_penum();

if( i < 48 ) { // PEs 0, 1, 2, … execute instructions // PEs 48, 49, … instructions issued but ignored

} else { // PEs 0, 1, 2, … instructions issued but ignored // PEs 48, 49, … execute instructions

}


Conditional Expressions: poly-while

• While loops based on poly expressions– Loop continues execution until condition is false on all

PEs– PEs will be disabled one by one until while condition is

false on all PEs– count keeps track of total number of iterations (96 in this

case)

mono int count = 0; poly int me; me = get_penum();

while( me > 0 ) { --me; ++count;

}


Other variations between C and Cn

• Labeled

break and continue statements• No switch statement using poly variables (use

multiple if statements)• No goto statement in poly context



Moving Data


Data flow

• Board and host communicate via Linux kernel module or Windows driver

• Create a handle and establish the connection


Data flow

• Register intent of using the first processor on the card• Load the code onto the enabled processor


Data flow

• Transfer data from host to board• Semaphores synchronize transfers between host and

board


Data flow

• Run the code on the enabled processor• Host can continue with other work


Data flow

• Send results back to host• Halt board program and clean up


Implicit broadcast from mono and poly

• Implicit broadcast from mono to poly by assignment

• Assigning poly to mono is not permitted

mono int m = 7; poly int p;

p = m; // Implicit broadcast to all PEs

mono int m; poly int p = get_penum();

m = p; // NO! m receives different value from each PE


Explicit data movement –

mono to poly

memcpym2p();

async_memcpym2p()• Memory copy of n bytes from mono to poly

– Source is a poly pointer to mono memory, which can have a different value for each PE

– Destination is a mono pointer to poly memory, that is destination address is the same for all PEs

Source data in mono memory

PE0 PE1 PE2

Same destination on each PE

PE95



poly to mono

memcpyp2m();

async_memcpyp2m()• Memory copy of n bytes from poly to mono

– Source is a mono pointer to poly memory; therefore source address is the same for every PE

– Destination is a poly pointer to mono memory, which can have a different value for each PE

Destination data in mono memory

PE0 PE1 PE2

Same source address on each PE

PE95



asynchronous

async_memcpym2p(); async_memcpyp2m()• Asynchronous memory copy of n bytes

from mono to poly or from poly to mono– Computation continues during data copy– Mono memory data cache NOT flushed– Restrictions on alignment of data– Use semaphores to wait for completion of copy– Much higher bandwidth than synchronous versions

dcache_flush(); async_memcpym2p( semaphore, … ); // computation continues sem_wait( semaphore ); // use data that has been transferred from mono memory



swazzle

• Register-to-register transfer between neighboring

PE’s

PE n

Register file

Memory

ALU

Enable stack

Status flags

To:

PE n-1

To:

PE n+1


Swazzle operations

• Assembly language versions operate directly on register file

• Cn versions operate on data and include implicit data movement from memory to registers

• Variants– swazzle_up( poly int src ); // copy to higher numbered PE– swazzle_down( poly int src ); // copy to lower numbered

PE– swazzle_up_generic( poly void *dst, poly void *src,

unsigned int size );– swazzle_down_generic( … );– Similar swazzles operating on other data types– Functions to set data copied into ends of swazzle chain


Data movement bandwidths per CSX600

• Mono memory to poly memory —

2.7 GB/s aggregate over 96 PEs

• Poly memory to registers —

840 MB/s per PE, 81 GB/s aggregate

• Swazzle path bandwidth —

1680 MB/s per PE, 161 GB/s aggregate

• Total bandwidth for Advance board (2 CSX600 processors) ~0.5 TB/s


DMA performance

• Advance board has a host DMA controller which can act as a PCI bus master

• All DMA transfers are at least 8-byte aligned• Host DMA engine will attempt to use the entire bus

bandwidth

ClearSpeed Advance DMA Performance

0

200

400

600

800

1000

1200

2.0 3.0 3.9 4.9 5.9 6.8 7.8

Transfer size (MB)

MB

/ s

e620_Read_avge620_Write_avgX620_Read_avgX620_Write_avg



CSAPI Host -

Board communication


Host-Board interaction basics

• The basic model for interaction between the host and the card is very simple:

• The ClearSpeed board can signal and wait for semaphores; it cannot initiate data transactions with the host.

• The host pushes data to and pulls data from the board.

• The host can also signal and receive semaphores.


Connecting to the board

• A host application needs to perform the following sequence to launch a process on the board:– Create a CSAPI handle

• CSAPI_new– Establish a connection with the board

• CSAPI_connect– Register the host application with the driver

• CSAPI_register_application– Load the CSX application on the desired chip

• CSAPI_load– Run the CSX application on the desired chip

• CSAPI_run


Interacting with the board

– Get board memory address of a known symbol• CSAPI_get_symbol_value

– This must be done after the application is loaded, if the dynamic load capability is to be used.

– Write/Read data to a retrieved memory address• CSAPI_write_mono_memory• CSAPI_read_mono_memory

– Asynchronous variants of these routines also exist– A process does not need to be running for these operations to succeed,

but the process needs to be loaded.– These should not be performed DURING process termination.

– Managing semaphores• CSAPI_allocate_shared_semaphore

– Declares a semaphore for use on both host and card• CSAPI_semaphore_wait• CSAPI_semaphore_signal


Cleaning up

– Process termination• CSAPI_wait_on_terminate• CSAPI_get_return_value

– Clean-up• CSAPI_delete

– See CSX600 Runtime Software User Guide for more details, including:

• managing multiple processes on the board/chip at once• managing board control registers• board reset• managing multi-threaded CSX applications• board memory allocation• managing multiple boards/chips• error handling



Debugging Cn


csgdb

– csgdb is a port of the open source gdb debugger– full symbolic debugging of mono/poly variables– full gdb breakpoint support– step through Cn or assembly– views mono and poly registers– views PE enabled state– also accessible via DDD

• DDD allows graphical data visualization


Debug control

– To enable debugging:• export CS_CSAPI_DEBUGGER=1

– initializes the debug interface within the host application• export CS_CSAPI_DEBUGGER_ATTACH=1

– host application will then write a port number to stdout and wait for <Return/Enter> to be pressed so that csgdb can be manually attached to the connected board process

– Launch the host application• This can be done with or without a debugger.

– Launch csgdb in a new shell*• csgdb <csx_file_name> <port_number>

– No need to “connect” as the host application did this already• set desired breakpoints• run

– Note that the host is currently blocked waiting for <Return/Enter>, so card process may also be blocked waiting for the host.

– Press return in the host shell for the host and card applications to proceed.


Real time plot of contents of PE

memory

Cn source-level break point, watch points,

single step, etc.

Disassembly, break point, watch points,

single step, etc.

Register contents

On-chip poly array contents displayed

csgdb Debugger (Shown with ddd

Front-end)


csgdb Command-line example

% cscn foo.cn –g –o foo.csx% csgdb ./foo.csx

• (gdb) connect• 0x80000000 in __FRAME_BEGIN_MONO__ ()• (gdb) break 109• Breakpoint 1 at 0x800154c0: file foo.cn, line 109.• (gdb) run• Starting program: /home/kris/my_app/foo.csx• Breakpoint 1, main () at foo.cn:109• (gdb) next• 110 y = MINY + (get_penum() * STEPY);• (gdb) print y• $1 = {-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1}



ClearSpeed Visual Profiler Explaining Performance


ClearSpeed Visual Profiler (csvprof)

– Host tracing• Trace CSAPI function• User can infer overlapping host/board utilization• Locate hot-spots

– Board tracing• Trace board side functions without instrumentation• Locate hot-spots

– Board hardware utilization• Display activity of csx functional units including:

• Cycle accurate• View corresponding source

– Unified GUI

– ld/st– Pi/o– SIMD microcode

– Instruction cache– Data cache– Thread


Advance Accelerator Board

CSX 600

Pipeline

CSX 600

Pipeline

HostCPU(s)Host

CPU(s)HostCPU(s)

Detailed profiling is essential for accelerator tuning

Advance Accelerator Board

HostCPU(s)

CSX600

Pipeline

HOST/BOARD INTERACTIONInfer cause and effect.Measure transfer bandwidth.Check overlap of host and board compute.

ACCELERATOR PIPEView instruction issue.Visualize overlap of executing instructions. Get cycle-accurate timing.Remove instruction-level performance bottlenecks.

CSX600 SYSTEMTrace at system level.Inspect overlap of compute and I/O.View cache utilization. Graph performance.

CSX600

Pipeline

HOST CODE PROFILINGVisually inspect multiple host threads.Time specific code sections.Check overlap of host threads.


csvprof: Host tracing

• Dynamic loading of CSAPI Trace implementation

• Triggered with an environment variable:– export CS_CSAPI_TRACE=1

» Recall similar enabling of debug support:» export CS_CSAPI_DEBUGGER=1

• Specify tracing format:– export CS_CSAPI_TRACE_CSVPROF=1– currently this is the only implementation, but in the future…

• Specify output file for trace:– export CS_CSAPI_TRACE_CSVPROF_FILE=mytrace.cst– default filename = csvprof_data.cst

• Output file written during CSAPI_delete


csvprof: Host-Board interaction


csvprof: Host code profile –

Linpack

benchmark


csvprof: CSX600 system profile


csvprof: Accelerator pipeline profile


csvprof: Instruction pipeline stalls


csvprof: Advance board tracing

– Enabled using the debugger, csgdb• Can use interactively or through gdb script

– Can select events to profile, or all events

– Requires buffer allocation on the card• Today, this is done statically• One could use CSAPI to allocate buffer, but developer must get

location and size of the buffer to user to be entered for csgdb• Easy if running only on one chip, place buffer in the other chip’s

memory

– Explicit dump to generate trace file• Can control the type of data to be dumped


csvprof: Sample gdb script

• % cat ./csgdb_trace.gdb•

connect

•

load ./foo.csx

•

cstrace

buffer 0x60000000 0x1000000

•

cstrace

event all on

•

tbreak

test_me

•

continue

•

cstrace

enable

•

continue

•

cstrace

dump foo.cst

•

cstrace

dump branch dgemm_test4_branch.cst

•

quit

• % csgdb –command=./csgdb_trace.gdb



Tuning Tips


Pipelined arithmetic

• Four-stage floating-point pipeline• Use vector types, vector intrinsic functions, and

vector math library for high efficiency

__DVECTOR a, b, c; poly double x[N];

a = *((__DVECTOR *)x[0]); b = *((__DVECTOR *)x[4]); c = cs_sqrt( __cs_vadd( a, b ) );


Poly conditionals

• When possible, remove common sub- expressions from poly if-blocks to reduce

amount of replicated work.• Maybe need to compute and throw away results

if it leads to fewer poly conditional blocks.• A poly if-block uses predicated instructions,

not a branch, so it is cheap if not many additional instructions are executed.


Poly loop counters

• Loops with poly counters are more expensive than those with mono counters

• Use mono loop counters if possible


Arrays

• Pointer incrementing is more efficient than using array index notation

• Poly addresses require 16 bits• Use short for poly pointer increments

– This avoids conversion of int to short


Data transfer

• Synchronous functions are completely general– flush the data cache each transfer

• memcpyp2m()• memcpym2p()

• Asynchronous functions maximize performance– do not flush cache– have data size and alignment restrictions– require use of wait semaphore

• async_memcpyp2m(); sem_wait()• async_memcpym2p(); sem_wait()

• Large data blocks are more efficient than small blocks– Host to board– Board to host– Mono to poly– Poly to mono



Application Examples


Math function speed comparison

Typical speedup of ~8X over the fastest x86 processors, because math functions stay in local memory on the card

64-bit Function Operations per Second (Billions)

0.0

0.5

1.0

1.5

2.0

2.5

Sqrt InvSqrt Exp Ln Cos Sin SinCos Inv Function name

2.6 GHz dual-core Opteron3 GHz dual-core WoodcrestClearSpeed Advance card


Nucleic Acid Builder (NAB)

• Newton-Raphson refinement now possible; large DGEMM calls from computed second derivatives will be in AMBER 10

• 2.5x speedup obtained for this operation in three hours of programmer effort

• Enables accurate computation of entropy and Gibbs Free Energy for first time

• AMBER itself has cases that ClearSpeed accelerates by 3.2x to 9x, with 5x to 17x possible once we exploit symmetry of atom-

atom interactions


AMBER molecular modeling

with ClearSpeed

• AMBER model Host Advance X620 Speedup• Gen Born 1 83.5 min 24.6 min 3.4×• Gen Born 2 84.6 min 23.5 min 3.6×• Gen Born 6 37.9 min 4.0 min 9.4×

AMBER Generalized Born Models 1, 2, and 6

83.5 84.6

37.9

24.6 23.5

4.00.0

10.0

20.0

30.0

40.0

50.0

60.0

70.0

80.0

90.0

100.0

Generalized Born 1 Generalized Born 2 Generalized Born 3

Run

Tim

e, in

Min

utes

Host Advance X620


Monte Carlo methods exploit high local bandwidth

• Monte Carlo methods are ideal for ClearSpeed acceleration:– High regularity and locality of the algorithm– Very high compute to I/O ratio– Very good scalability to high degrees of parallelism– Needs 64-bit

• Excellent results for parallelization– Achieving 10x performance per Advance card vs. highly

optimized code on the fastest x86 CPUs available today– Maintains high precision required by the computations

• True 64-bit IEEE 754 floating point throughout– 25 W per card typical when card is computing

• ClearSpeed has a Monte Carlo example code, available in source form for evaluation


Monte Carlo applications scale very well

• No acceleration:

200M samples, 79 seconds• 1 Advance board:

200M samples, 3.6 seconds• 5 Advance boards:

200M samples, 0.7 secondsEuropean Option Pricing Model

0123456789

10

0 1 2 3 4 5

Number of ClearSpeed Advance Boards

Spee

dup


Why do Monte Carlo applications need 64-bit?

• Accuracy increases as the square root of the number of trials, so five-decimal accuracy takes 10 billion trials.

• But, when you sum many similar values, you start to lose all the significant digits.

• 64-bit summation needed to get a single-precision result!

Single precision:1.0000x108 + 1 = 1.0000x108

Double precision:1.0000x108 + 1

= 1.00000001x108



Help and Support


Installed documentation

• docs directory– CSXL user guide– runtime user guide– csvprof Visual Profiler overview and examples– SDK

• getting started• gdb manual• instruction set manual• Cn

library manual• reference manual

– release notes

• examples directory


ClearSpeed online

• General information, news, etc.– Company website www.clearspeed.com

• Report a problem, find answers, etc.– Support website support.clearspeed.com

• Support website has:– Documentation, user guides, reference manuals– Solutions knowledge base– Software downloads– Log a case

http://www.clearspeed.com/

http://www.clearspeed.com/


Join the ClearSpeed Developer Program!

• Designed to support the leading-edge community of developers using accelerators

• Membership is free and has the following benefits: – Access to the ClearSpeed Developer website– ClearSpeed Developer Community on-line forum– Invitation to participate in ClearSpeed Developer & User

Community meetings and events– Repository to share and access demonstrations and sample

codes within the ClearSpeed Developer Community – Technical updates, tips and tricks from the gurus at ClearSpeed

and the Developer Community– And more, including opportunities to preview new software

releases and developer discount programs.• Leverage the expertise of developers worldwide.• Ask a question, or share your knowledge.• Register now at developer.clearspeed.com !

http://developer.clearspeed.com/


Documents

ENVISION. ACCELERATE. ARRIVE. ClearSpeed Technical …jbaker/ResearchGroup/slides...• 47% logic, 53% memory – About 50% of the logic is FPU – Hence around one quarter of the