Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com1
ENVISION. ACCELERATE. ARRIVE.
Overview
ClearSpeed Technical Training
December 2007
Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com2
Presenters
Ronald LanghiTechnical Marketing Manager
Brian SumnerSenior Engineer
Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com3
ClearSpeed Technology: Company Background
• Founded in 2001
– Focused on alleviating the power, heat, and density challenges of HPC systems
– 103 patents granted and pending (as of September 2007)
– Offices in San Jose, California and Bristol, UK
Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com4
Agenda
AcceleratorsClearSpeed and HPCHardware overviewInstalling hardware and softwareThinking about performanceSoftware Development KitApplication examplesHelp and support
Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com5
ENVISION. ACCELERATE. ARRIVE.
What is an accelerator?
Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com6
What is an accelerator?
• A device to improve performance– Relieve main CPU of workload– Or to augment CPU’s capability
• An accelerator card can increase performance– On specific tasks– Without aggravating facility limits on clusters (power,
size, cooling)
Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com7
FPGAs•Good for integer, bit-level ops•Programming looks like circuit design•Low power per chip, but
20x more power than custom VLSI
•Not for 64-bit FLOPS
Cell and GPUs•Good for video gaming tasks•32-bit FLOPS, not IEEE•Unconventional programming model•Small local memory•High power consumption (> 200 W)
ClearSpeed•Good for HPC applications•IEEE 64-bit and 32-bit FLOPS•Custom VLSI, true coprocessor•At least 1 GB local memory•Very low power consumption (25 W)•Familiar programming model
All accelerators are good…
for their intended purpose
Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com8
The case for accelerators
• Accelerators designed for HPC applications can improve performance as well as performance per (watt, cabinet, dollar)
• Accelerators enable:– Larger problems for given compute time, or
– Higher accuracy for given compute time, or
– Same problem in shorter time
• Host to card latency and bandwidth are not major barriers to successful use of properly-
designed accelerators.
Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com9
ENVISION. ACCELERATE. ARRIVE.
What can be accelerated?
Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com10
Good application targets for acceleration
• Application needs to be both computationally intensive and contain a high degree of data parallelism.
• Computationally intensive:– Software depends on executing large numbers of arithmetic calculations– Usually 64-bit FLoating point Operations per Second (FLOPS)– Should also have a high ratio of FLOPS to data movement (bandwidth)– Computationally intensive applications may run for many hours or more
even on large clusters.• Data parallelism:
– Software performs the same sequence of operations again and again but on a different item of data each time
• Example computationally intensive, data parallel problems include:
– Large matrix arithmetic (linear algebra)– Molecular simulations– Monte Carlo options pricing in financial applications– And many, many more…
Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com11
Example data parallel problems that can be accelerated
Structural Analysis
ElectromagneticModeling
Radar Cross-Section
Ab initio Computational Chemistry
Global Illumination Graphics
Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com12
HPC Requirements
• Accelerator boards increase compute performance on highly specific tasks, without aggravating facility limits on clusters (power, size)
• Need to consider– Type of application– Software– Data type and precision– Compatibility with host (logical and physical)– Memory size (local to accelerator)– Latency and bandwidth to host
Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com13
An HPC-specific accelerator
• CSX600 coprocessor for math acceleration– Assists serial CPU running compute-intensive math libraries– Available on add-in boards, e.g. PCI-X, PCIe– Potentially integrated on the motherboard– Can also be used for embedded applications
• Significantly accelerates certain libraries and applications– Target libraries: Level 3 BLAS, LAPACK, ACML, Intel® MKL– Mathematical modeling tools: Mathematica®, MATLAB®, etc.– In-house code: Using the SDK to port compute-intensive kernels
• ClearSpeed Advance™
board– Dual CSX600 coprocessors– Sustains 67 GFLOPS for 64-bit matrix multiply (DGEMM) calls– PCI-X, PCI Express x8– Low power; typically 25-35 Watts
Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com14
Plug-and-play Acceleration
• ClearSpeed host-side library CSXL– Provides some of the most commonly used and
important Level 3 BLAS and LAPACK functions– Exploits standard shared/dynamic library mechanisms to
intercept calls to L3 BLAS and LAPACK– Executes calls heterogeneously across both the multi-
core host and the ClearSpeed accelerators simultaneously for maximum performance
– Compatible with ACML from AMD and MKL from Intel
• User & application do not need to be aware of ClearSpeed– Except that the application suddenly runs faster
Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com15
Programming considerations
• Is my main data type integer or floating-point?• Is the data parallel in nature?• What precision do I need?• How much data needs to be local to the accelerated task?• Does existing accelerator software meet my needs, or do I have
to write my own?• If I have to write my own code will the existing tools meet my
needs—for example: compiler, debugger, and simulator?
Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com16
ENVISION. ACCELERATE. ARRIVE.
Hardware Overview
Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com17
CSX600: A chip designed for HPC
• Array of 96 Processor Elements; 64-bit and 32-bit floating point
• Single-Instruction, Multiple-Data (SIMD)
• 210 MHz --
key to low power• 47% logic, 53% memory
– About 50% of the logic is FPU– Hence around one quarter of the
chip is floating point hardware• Embedded SRAM• Interface to DDR2 DRAM• Inter-processor I/O ports• ~ 1 TB/sec internal bandwidth• 128 million transistors• Approximately 10 Watts
ClearSpeed CSX600
Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com18
CSX 600
• Multi-Threaded Array Processing– Programmed in familiar languages– Hardware multi-threading– Asynchronous, overlapped I/O– Run-time extensible instruction set
• Array of 96 Processor Elements (PEs)– Each has multiple execution units– Including double precision floating
point and integer units
CSX600 processor core
Programmable I/O to DRAM
PE 0
Peripheral Network
PE 1
PE 95…
Data Cache
Mono Controller Instruc-
tion Cache
Control and
Debug
System Network
Poly Controller
System Network
Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com19
CSX600 processing element (PE)
32/64-bitIEEE 754
PE
n
Programmed I/O
Register File128 Bytes
PE SRAM6 KBytes
FP M
ul
FP A
dd
Div
, Sqr
t
MA
C
ALU
PIO Collection & Distribution
64 64 64
32
6464 PE
n+1
PE
n–1
128
32
}• Multiple execution units
• 4-stage floating point adder• 4-stage floating point multiplier• Divide/square root unit• Fixed-point MAC 16x16 → 32+64• Integer ALU with shifter• Load/store
• 5-port register file (3 reads, 2 writes)• Closely coupled 6 KB SRAM for data• High bandwidth per PE DMA (PIO)• Per PE address generators (serves as
hardware gather-scatter)• Fast inter-PE communication path
Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com20
PE 95
Bank 1CSX DRAM: 0.5 GBytes
Advance accelerator memory hierarchy
HostDRAM: 1-32 GBytes
typical
Bank 0CSX DRAM: 0.5 GBytes
PE 0Poly memory: 6 KBytes
Per PERegister memory: 128 Bytes
Per PEArithmetic: 0.42 GFLOPS
Tier 3
Tier 2
Tier 1
Tier 016 16 16
1616
32
Swazzle322 GB/s
Aggregate:
~0.03 GB/s per PE
~1GB/s
1.0 GBytes
5.4 GB/s
192 PEs * 6 KB = 1.1 MB
161 GB/s
192 PEs * 128 Byte = 24 KB
725 GB/s
Total: 80 GFLOPS, 1.1 TB/s…but only 25 Watts
Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com21
203 mm length, full-height
Advance X620 (PCI-X)
Advance e620 PCIe (x8)
Both boards can sustain over 66 GFLOPS on 64-bit HPC kernels
Acceleration by plug-in card
• Dual ClearSpeed CSX600 coprocessors• R∞
> 66 GFLOPS for 64-bit matrix multiply (DGEMM) calls– Hardware also supports 32-bit floating
point and integer calculations• 133 MHz PCI-X two-thirds length (8″) form
factor• PCIe x8 half-length form factor• 1 GB of memory on the board• Drivers today for Linux (Red Hat and SLES)
and Windows (XP, Server 2003)• Low power: 25 watts typical• Multiple boards can be used together for
greater performance
Half length, full-height
Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com22
Host to board DMA performance
• The board includes a host DMA controller which can act as a bus master.
• All DMA transfers are at least 8-byte aligned.
• The host DMA engine will attempt to use the full bandwidth of the bus.
• Note: measured bandwidth is highly system-dependent– Variations of up to 50% have been observed
– Depends on system chipset, operating system, bus contention…
Type of PCI-X slot Peak bandwidth Expected DMA speed
PCI Express x8 2,000 MB/s Up to 1,300 MB/s
PCI-X 133 MHz 1,066 MB/s Up to 750 MB/s
Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com23
ENVISION. ACCELERATE. ARRIVE.
Installing Hardware and Software
Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com24
Supported compilers• For Linux:
gcc, icc, fort, pgf• For Windows XP, 2003:
Visual C++ 2005
Configuration support
Supported host BLAS libraries• AMD ACML• Intel MKL• Goto• ATLAS
For the latest support information go to http://support.clearspeed.com
Operating System IA32 (x86)
AMD64/EM64T (x86-64)
SuSE Linux Enterprise Server 9Red Hat Enterprise Linux 4Windows XP SP2Windows Server 2003 preview
• Advance supports
the
following
host operating systems:
Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com25
Base software
• All ClearSpeed software on Linux is installed using the rpm command.
• The software consists of three parts:– Runtime and driver software– Diagnostics– ClearSpeed standard libraries, CSXL & CSFFT
• You can download the latest versions from the ClearSpeed support website:
• https://support.clearspeed.com/downloads
Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com26
Installing
base software on Linux
1.
Log in to the Linux machine as root and change to the directory containing the drivers package.
2.
Install the runtime software, using the command:
rpm –i csx600_m512_le-runtime-<version>.<arch>.rpm
3.
Install the Kernel module -
for Linux 2.6 simply install the open source CSX driver using:
/opt/clearspeed/csx600_m512_le/drivers/csx/install-csx
4.
Install the board diagnostics: rpm –i csx600_m512_le-board_diagnostics- <version>.<arch>.rpm
5.
Install the CSXL library package: rpm –i csx600_m512_le-csxl_<version>.<arch>.rpm
Note: For Windows a Jungo driver will need to be installed and configured – see installation manual for more details.
Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com27
Confirming successful installation
ClearSpeed distributes diagnostic tests to check that the board and drivers are successfully installed:
– Some tests take several minutes to complete.– Each test will write Pass or Fail to standard output.– A log file
test.log will be written in the current directory.
1. Open a shell window and go to an appropriate directory:cd /tmp
2. Set up ClearSpeed environment variables, by typing: source /opt/clearspeed/csx600_m512_le/bin/bashrc
3. Run the diagnostic program, by typing the command: /opt/clearspeed/csx600_m512_le/bin/run_tests.pl
Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com28
csreset
– The csreset command reinitializes an Advance board and its processors.
– It must be run after start-up or reboot of the system or simulator.
– It is also a good idea to run csreset at the start of a batch job that calls the Advance board.
– The csreset command can take argument flags to provide a finer level of control. These include:
-A Specifies that all boards should be reset.-v Verbose output. This shows the details about each
board.-h Help. This shows the full list of options.
Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com29
If you have problems with software installation
– Make sure you are logged in as super-user.• As root for Linux.• As administrator for Windows.
– If the configure or make install steps fail, check that you have the appropriate header files.
• Check the preconfigured header files and, if necessary, obtain the appropriate configured header file.
– If the system cannot access the board but the driver is installed, make sure the board is seated well.
• Try removing the board and reinstalling.
Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com30
ENVISION. ACCELERATE. ARRIVE.
Targeting ClearSpeed Advance: Exploiting Data Parallelism
Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com31
Alternative approaches
Three main approaches to acceleration:
1.
Use an application which is already ported2.
Plug and play
3.
Custom port using the SDK
Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com32
Using an application which is already ported
• Acceleration: simply insert ClearSpeed• Latest list of ported applications:
– http://www.clearspeed.com/products/applicationsupport/
• Includes:– Amber– Mathematica– MATLAB– Star-P
Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com33
Plug and play libraries: CSXL
• Underlying shared libraries are augmentedwith ClearSpeed CSXL accelerated functions
• Includes key functions from:– LAPACK– Level 3 BLAS
• As an example, BLAS is used by:– AMD ACML– Intel MKL– Full list on:
http://www.clearspeed.com/products/compatibility/
• Application is transparently accelerated– No modifications to application
Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com34
Acceleration using CSXL and standard libraries
Host Library LAPACK
BLAS etc.
CSXL Library LAPACK
BLAS etc.
Application
Automatically select optimum path
CSXL Intercept Layer
Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com35
Considerations for custom port of application
• Is the task large enough to consider acceleration?– Takes time to ship data to the accelerator
• Accelerator can work in parallel with host– Overlap computation
• Performance considerations– Look for areas of data parallelism– Overlap compute with data I/O– Make full use of ClearSpeed I/O paths
• Analysis starts with model based on memory tiers and can be verified using performance profiling tools
Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com36
Is this trip necessary? Considering I/O
• Time to move N data to/from another node or an accelerator is ~latency+N/B seconds.
• Because local memory bandwidth is usually higher than B, acceleration might be lost in the communication time.
• Estimate the break-even point for the task (note: offloading is different from accelerating, where host continues working).
Node
Node memory
Accelerator
Accelerator memory
Bandwidth = B
accelerator
node
break- even
speed
time(larger problem size)
Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com37
Memory bandwidth dictates performance
• Applications that can stage into local RAM can go 10x faster than current high-end Intel and AMD hosts– Applications residing in Accelerator DRAM do not make use of massive
memory bandwidth• GPUs face very similar issue
Multicore x86
Node memory
Accelerator
Accelerator DRAM
PCI-X or PCIe1 to 2 GB/s
AcceleratorLocal RAM
17 GB/s 5.4 GB/s
192 GB/s
Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com38
Latency and bandwidth: Simple offload model
• Accelerator must be quite fast for this approach to have benefit
• This “mental picture”
may stem from early days of Intel 80x87, Motorola 6888x math coprocessors
latencyHost latency
Accelerator
band
-widt
h
band-
widthHost
Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com39
Latency and bandwidth: Acceleration model
• Host continues working– Accelerator needs only be fast enough to make up for time lost to
bandwidth + latency• Easiest use model
– Host and accelerator share the same task, like DGEMM• More flexible
– Host, accelerator each specialize what they do
latencyHost latencyHost
Accelerator
band
-widt
h
band-
widthHost
Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com40
Accelerator need not wait for all data before starting
• Host can work while data is moved– PCI transfers might burden a single x86 core by up to 40%– Other cores on host continue productive work at full speed
• Accelerator can work while data is moved– Can be slower than the host, and still add performance!
• In practice, latency is microseconds; accelerator task takes seconds– Latency gaps above would be microscopic if drawn to scale
Accelerator
latencyHost latency
band
-widt
h
band-
width
Accelerator
Host Host
Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com41
Performance considerations
• Look for data parallelism– Fine-grained – vector operations– Medium-grained – unrolled independent loops – Coarse-grained – multiple simultaneous data
channels/sets
• Performance analysis for accelerator cards– Like analysis for message-passing parallelism but with
more levels of memory and communication
• Application porting success depends heavily on attention to memory bandwidths – (Surprisingly) not so much on the bandwidth between host
and accelerator card
Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com42
PCI Bus
• ClearSpeed boards utilize either PCI-X or PCIe busses– PCI-X 133 MHz: 1 GB/s Peak– PCIe x8: 1.6 GB/s Peak
• Available memory on board– 1 GB of 200 MHz DDR2 SDRAM shared by 2 CSX600 processors
• Must consider both the transfer rate AND the available memory– If application requires more memory, then more communication to the
board is necessary
• Infinitely fast board– Time = Bus Speed * Total data size transferred
Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com43
PCI Bus
• Driver performance is very machine-specific and depends on transfer size, direction, etc.– Transfer Size vs. Transfer Rate
– See Runtime User’s Guide for current driver performance
Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com44
On-board Memory
• 2 level memory hierarchy– 1 GB “mono” shared memory– 6 kB “poly” memory per processing element (PE)
• 6 kB/PE * 96 PE = 576 kB per CSX600
• Peak bandwidth between levels– 2.7 GB/s x 2 chips = 5.4 GB/s
• Must consider both the transfer rate AND the available memory– If application requires more memory, then more communication to the
board is necessary• Infinitely fast PEs
– Time = Bus Speed * Total data size transferred
• Secondary considerations– Burst size: 64 Bytes/PE (i.e., 8 doubles)– Transfers can be smaller, but at reduced efficiency
Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com45
SIMD Computing
• What is SIMD?– Single Instruction, Multiple Data
• Each PE sees the same instruction stream• Each PE issues “load”, “multiply”, etc., simultaneously• But acts on different data per PE
– PARALLEL COMPUTATION
• ClearSpeed SIMD is enhanced by:– Local memory for each PE
• data management is easier within “poly” memory• does not require adjacent access for all 96 elements involved in the
computation from shared memory pool– PEs can be enabled/disabled
• not required to use all PEs always• useful for handling “boundaries”
Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com46
SIMD Array
• 96 PEs per CSX600– 210 MHz– double precision multiply-accumulate per cycle– 4 cycle pipeline depth for multiply and accumulation
• For top performance, use operations on 4 element vectors on each PE
• Nearest neighbor communication– “swazzle” path topology is a line or ring– Bandwidth: 8 Bytes per cycle between register files
• 8*96*210 = 161 GB/s• Useful for fine grained communication
Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com47
Good Example Kernels
• Dense Linear Algebra– Matrix-Matrix products (DGEMM)
• Low memory bandwidth required = high data re-use• Inner kernel: Matrix-multivector product
– 96x96 matrix, x4 vectors» 96x96 matrix due to 96 PEs» 4 vectors due to multiply/accumulate pipeline depth
• Monte Carlo (computational finance)– “Embarrassingly parallel” task distribution– Very little data requirement
• Molecular Dynamics (Amber, BUDE)– Large numbers of identical tasks can be found– Requires small working data sets
Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com48
Possible Kernel
• Partial Differential Equations– Some are memory bandwidth limited, so not a good candidate
for ClearSpeed acceleration• small stencil implies little computation per grid point• wide, sparse stencil implies large active data set
• But, some PDE simulations are good candidates– require a small grid, so can run entirely in PE memory
(computational finance)– have large, dense stencils
• large amounts of computation per grid point• sufficiently small active data set
– implicit time stepping• large systems of equations solved via direct methods• direct solvers utilize dense linear algebra kernels
– (i.e., DGEMM)
Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com49
Keys to Success
• Parallelism is essential
• Proper management of the “poly”
memory is also critical– Application must accept memory bandwidth limits
• PCIe or PCI-X• On-board memory hierarchy
– SDK enables asynchronous data transfers• permits efficient “double buffering” to manage data streams,
accommodating the size limit– Application must employ a small working data set
• less than 576 kB, distributed across 96 PEs• also aware of 1 GB shared memory limit
• While developing ClearSpeed applications, use the ClearSpeed Visual Profiler to discover what is actually happening on the board!
Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com50
Remember the host processor
• Today’s multi-core hosts are very useful for managing “other tasks”
that are not accelerated by ClearSpeed
• Many applications can overlap these tasks with ClearSpeed accelerated tasks
• Profile the host portion of your application as well using any of a variety of tools
– Use ClearSpeed Visual Profile for CSAPI utilization
Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com51
General optimization techniques
• Latency hiding– Overlap compute with I/O
• Data reuse– On-chip swazzle path
• Maximize PE usage– Ensure all PEs are processing, not idle
Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com52
Overlap data with compute
• Double-buffer• Many levels of data I/O –
compute parallelism
– PE load/store overlaps PE compute– PE to board memory can also overlap– Board memory to host memory can also overlap
• Hence, if task is compute bound:– Data takes “no time” to transfer
• If task is I/O bound:– Compute takes “no time” to calculate
Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com53
Data reuse
• Swazzle path– Left or right 64 bit transfer (8 bytes)– 8 bytes per cycle, so ~161GB/s per CSX processor– Can be complete loop or linear chain
• Parallel with other data I/O– Register-register move– On-off chip in parallel
• Doesn’t impinge on DRAM access– PE local memory – register in parallel
• Doesn’t impinge on local memory access
Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com54
Maximize PE usage
• Aim for 100% efficiency• PEs use predicated execution
– PEs are “disabled” rather than code skipped– Minimize effects – extract common code from conditionals
• Mono processor can branch– Skip blocks of code
Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com55
PE
n+1
PE
n-1
Detail of I/O widths for performance analysis
Each accelerator board has:– 161 GB/s bandwidth PE register
to PE memory• 4 bytes per cycle
– 322 GB/s swazzle path bandwidth
• 8 bytes per cycle– 968 GB/s bandwidth PE register
to PE ALU• 24 bytes per cycle
– 5.4 GB/s DRAM bandwidth• 32 bytes per cycle
(Aggregate bandwidth for two CSX600 chips.)
PE
n
Programmed I/O
Register File128 Bytes
PE SRAM6 KBytes
FP M
ul
FP A
dd
Div
, Sqr
t
MA
C
ALU
PIO Collection & Distribution
64 64 64
32
6464
128
32
161GB/s
322GB/s 322GB/s
968GB/s
CSX DRAM1 GByte
5.4GB/s
Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com56
ENVISION. ACCELERATE. ARRIVE.
Software Development Kit
Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com57
ClearSpeed SDK overview
• Cn
compiler – C with extension for SIMD control
• Assembler• Linker• Simulator• Debugger• Graphical profiler• Libraries• Documentation• Available for Windows XP / 2003 and Linux
(Red Hat Enterprise Linux 4 and SLES 9)
Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com58
Agenda
1.
Introduction to Cn
2.
Cn
Libraries3.
Debugging Cn
4.
CSAPI: Host / Board Communication
Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com59
ENVISION. ACCELERATE. ARRIVE.
Introduction to Cn
Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com60
Software Development
The CSX architecture is simpler to program:
• Single program for serial and parallel operations• Architecture and compiler co-designed• Instruction and data caches• Simple, regular 32-bit instruction set• Large, flexible register file• Fast thread context switching• Built-in debug support• Same development process as traditional architectures:
compile, assemble, link• Cn
is a simple parallel extension of C
Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com61
Cn
— C with vector extensions for CSX
• New Keywords– mono and poly storage
qualifiers• mono is a serial (single)
variable• poly is a parallel (vector)
variable
• Mono variables in 1 GB DRAM
• Poly variables in 6 KB SRAM of each PE
DRAM 1 GB
Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com62
Cn
differences from C
• New data type multiplicity modifiers:– mono: denotes serial variable
• resident in “mono” memory• mono is the default multiplicity
– poly: denotes parallel/vector variable• resident in “poly” memory local to each PE
– applies to pointers, doubly so:• mono int * poly foo;
– foo is a pointer in poly memory to an int in mono memory• poly int * mono bar;
– bar is a pointer in mono memory to an int in poly memory• int * poly *mono good_grief;
– as you would expect….
• Pointer sizes:– mono int *
• 4 bytes (32-bit addressable space, 512 MB)– poly int *
• 2 bytes (16-bit addressable space, 6 kB)
Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com63
Cn
differences from C
• Execution context:– Alters branch/jump behavior– In mono context, jumps occur as in traditional architecture– In poly context, PEs are enabled/disabled
• if (penum>32) {…} else {…}– disables false PEs on true branch, then re-enables the false
PEs and disables the other PEs for the false branch– both branches executed
• break, continue– select PEs get disabled until the end of scope on all PEs
• return– select PEs get disabled until all PEs return, or end of scope
Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com64
Porting C to Cn
(Example 1)
poly int i, j; i = get_penum(); // i=0 on PE0, i=1 on PE1 etc. j = 2*i; // j=0 on PE0, j=2 on PE2 etc.
int i, j;
for( i=0; i<96; i++ ) { j = 2*i;
}
Similar Cn
code
C code
Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com65
Porting C to Cn
(Example 2)
poly int me, i; mono int npes; me = get_penum(); // me=0 on PE0, me=1 on PE1 etc. npes = get_num_pes(); // npes = 96
// i=0,96,192, …; 1,97,193, … etc. for( i=me; i<N; i+=npes ) { …
}
int i;
for( i=0; i<N; i++ ) { …
}
Similar Cn
code
C code
Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com66
Simple Cn
examplevoid foo
(double *A, double *B, int n) { // Assume n is divisible by 24*96.poly double mat[4]={1.,2.,3.,4.};poly double a[24];poly double b[4]={0.,0.,0.,0.}; int i;
while (n) { memcpym2p (a, A+24*get_penum(), 24*sizeof(double));A+=24*96;for (i=0; i<24; i++) {
b[0] += a[i]*mat[0] + a[i+1]*mat[1];b[1] += a[i+1]*mat[0] + a[i]*mat[1];b[2] += a[i]*mat[2] -
a[i+1]*mat[3];b[3] += a[i+1]*mat[2] -
a[i]*mat[3];}n -= 24*96;
}
memcpyp2m (B+4*get_penum(), b, 4*sizeof(double));return;
}
Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com67
ENVISION. ACCELERATE. ARRIVE.
Cn Libraries
Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com68
Runtime libraries
• Cn
supports standard C runtime, including:– malloc– printf– sqrt– memcpy
Cn extensions include:– sqrtp– memcpym2p / memcpyp2m– get_penum– swazzle– any / all
Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com69
Asynchronous I/O
• For most efficient use of limited PE memory, overlap data transfers between mono memory and poly:
– async_memcpym2p/p2m– sem_sig / sem_wait
For greatest efficiency, async_memcpy routines bypass the data cache, so coherency must be maintained:• dcache_flush / dcache_flush_address
Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com70
Asynchronous I/O examplevoid foo(double
*A, double *B,int
n) { // Assume n is divisible by 24*96poly unsigned short penum=get_penum();poly double mat[4]={1.,2.,3.,4.};poly double a_front[12], a_back[12];poly double b[4]={0.,0.,0.,0.}; int i;
async_memcpym2p(19,a_front,A+12*penum,12*sizeof(double));A+=12*96;n-=24*96;while (n) {
async_memcpym2p(17,a_back,A+12*penum,12*sizeof(double));A+=12*96;sem_wait(19);for (i=0;i<12;i++) {
b[0] += a_front[i]*mat[0] + a_front[i+1]*mat[1];b[1] += a_front[i+1]*mat[0] + a_front[i]*mat[1];b[2] += a_front[i]*mat[2] -
a_front[i+1]*mat[3];b[3] += a_front[i+1]*mat[2] -
a_front[i]*mat[3];}n-=12*96;async_memcpym2p(19,a_front,A+12*penum,12*sizeof(double));A+=12*96;sem_wait(17);for (i=0;i<12;i++) { …
// compute on a_back, then finish outside while loop
Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com71
ENVISION. ACCELERATE. ARRIVE.
Cn Pointers
Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com72
• Using mono and poly with pointersmono int * mono mPmi mono pointer to mono intpoly int * mono mPpi mono pointer to poly intmono int * poly pPmi poly pointer to mono intpoly int * poly pPpi poly pointer to poly int
• Most commonly used is mono pointer to polypoly <type> * mono <variable_name>
Cn
— mono and poly pointers
Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com73
• mono
pointer to mono
intmono int * mono mPmi
int
int *
Mono memory
Cn
— mono and poly pointers
Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com74
• mono
pointer to poly
intpoly int * mono pPmi
Note: Points to same location in each PE
int *
Mono memory
intPoly memory
intPoly memory
intPoly memory
Cn
— mono and poly pointers
Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com75
• poly pointer to poly intpoly int * poly pPpi
Note: Pointer stored in same location in each PE
int *
Poly memory
int
int *
Poly memory
Int
int *
Poly memory
int
Cn
— mono and poly pointers
Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com76
• poly pointer to mono intmono int * poly pPmi
Note: Pointer stored in same location in each PE
int
Mono memory
int *Poly memory
int *Poly memory
int *Poly memory
int
int
Cn
— mono and poly pointers
Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com77
ENVISION. ACCELERATE. ARRIVE.
Conditional Expressions
Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com78
Conditional Expressions: mono-if
• Conditions based on mono expressions– Expression has same value on all PEs– Code block selected according to expression and branch
instruction executedmono int i, j;
i = j = 1; if( i == j ) { // this block executed on all PEs
} else { // this block branched over on all PEs
}
Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com79
Conditional Expressions: poly-if
• Conditions based on poly expressions– Expression may have different values on different PEs– But SIMD model implies all PEs execute same instruction
simultaneously– All branches executed on all PEs, with PE enabled if
conditional expression is true (like predicated instructions)
poly int i; i = get_penum();
if( i < 48 ) { // PEs 0, 1, 2, … execute instructions // PEs 48, 49, … instructions issued but ignored
} else { // PEs 0, 1, 2, … instructions issued but ignored // PEs 48, 49, … execute instructions
}
Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com80
Conditional Expressions: poly-while
• While loops based on poly expressions– Loop continues execution until condition is false on all
PEs– PEs will be disabled one by one until while condition is
false on all PEs– count keeps track of total number of iterations (96 in this
case)
mono int count = 0; poly int me; me = get_penum();
while( me > 0 ) { --me; ++count;
}
Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com81
Other variations between C and Cn
• Labeled
break and continue statements• No switch statement using poly variables (use
multiple if statements)• No goto statement in poly context
Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com82
ENVISION. ACCELERATE. ARRIVE.
Moving Data
Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com83
Data flow
• Board and host communicate via Linux kernel module or Windows driver
• Create a handle and establish the connection
Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com84
Data flow
• Register intent of using the first processor on the card• Load the code onto the enabled processor
Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com85
Data flow
• Transfer data from host to board• Semaphores synchronize transfers between host and
board
Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com86
Data flow
• Run the code on the enabled processor• Host can continue with other work
Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com87
Data flow
• Send results back to host• Halt board program and clean up
Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com88
Implicit broadcast from mono and poly
• Implicit broadcast from mono to poly by assignment
• Assigning poly to mono is not permitted
mono int m = 7; poly int p;
p = m; // Implicit broadcast to all PEs
mono int m; poly int p = get_penum();
m = p; // NO! m receives different value from each PE
Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com89
Explicit data movement –
mono to poly
memcpym2p();
async_memcpym2p()• Memory copy of n bytes from mono to poly
– Source is a poly pointer to mono memory, which can have a different value for each PE
– Destination is a mono pointer to poly memory, that is destination address is the same for all PEs
Source data in mono memory
PE0 PE1 PE2
Same destination on each PE
PE95
Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com90
Explicit data movement –
poly to mono
memcpyp2m();
async_memcpyp2m()• Memory copy of n bytes from poly to mono
– Source is a mono pointer to poly memory; therefore source address is the same for every PE
– Destination is a poly pointer to mono memory, which can have a different value for each PE
Destination data in mono memory
PE0 PE1 PE2
Same source address on each PE
PE95
Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com91
Explicit data movement –
asynchronous
async_memcpym2p(); async_memcpyp2m()• Asynchronous memory copy of n bytes
from mono to poly or from poly to mono– Computation continues during data copy– Mono memory data cache NOT flushed– Restrictions on alignment of data– Use semaphores to wait for completion of copy– Much higher bandwidth than synchronous versions
dcache_flush(); async_memcpym2p( semaphore, … ); // computation continues sem_wait( semaphore ); // use data that has been transferred from mono memory
Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com92
Explicit data movement –
swazzle
• Register-to-register transfer between neighboring
PE’s
PE n
Register file
Memory
ALU
Enable stack
Status flags
To:
PE n-1
To:
PE n+1
Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com93
Swazzle operations
• Assembly language versions operate directly on register file
• Cn versions operate on data and include implicit data movement from memory to registers
• Variants– swazzle_up( poly int src ); // copy to higher numbered PE– swazzle_down( poly int src ); // copy to lower numbered
PE– swazzle_up_generic( poly void *dst, poly void *src,
unsigned int size );– swazzle_down_generic( … );– Similar swazzles operating on other data types– Functions to set data copied into ends of swazzle chain
Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com94
Data movement bandwidths per CSX600
• Mono memory to poly memory —
2.7 GB/s aggregate over 96 PEs
• Poly memory to registers —
840 MB/s per PE, 81 GB/s aggregate
• Swazzle path bandwidth —
1680 MB/s per PE, 161 GB/s aggregate
• Total bandwidth for Advance board (2 CSX600 processors) ~0.5 TB/s
Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com95
DMA performance
• Advance board has a host DMA controller which can act as a PCI bus master
• All DMA transfers are at least 8-byte aligned• Host DMA engine will attempt to use the entire bus
bandwidth
ClearSpeed Advance DMA Performance
0
200
400
600
800
1000
1200
2.0 3.0 3.9 4.9 5.9 6.8 7.8
Transfer size (MB)
MB
/ s
e620_Read_avge620_Write_avgX620_Read_avgX620_Write_avg
Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com96
ENVISION. ACCELERATE. ARRIVE.
CSAPI Host -
Board communication
Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com97
Host-Board interaction basics
• The basic model for interaction between the host and the card is very simple:
• The ClearSpeed board can signal and wait for semaphores; it cannot initiate data transactions with the host.
• The host pushes data to and pulls data from the board.
• The host can also signal and receive semaphores.
Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com98
Connecting to the board
• A host application needs to perform the following sequence to launch a process on the board:– Create a CSAPI handle
• CSAPI_new– Establish a connection with the board
• CSAPI_connect– Register the host application with the driver
• CSAPI_register_application– Load the CSX application on the desired chip
• CSAPI_load– Run the CSX application on the desired chip
• CSAPI_run
Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com99
Interacting with the board
– Get board memory address of a known symbol• CSAPI_get_symbol_value
– This must be done after the application is loaded, if the dynamic load capability is to be used.
– Write/Read data to a retrieved memory address• CSAPI_write_mono_memory• CSAPI_read_mono_memory
– Asynchronous variants of these routines also exist– A process does not need to be running for these operations to succeed,
but the process needs to be loaded.– These should not be performed DURING process termination.
– Managing semaphores• CSAPI_allocate_shared_semaphore
– Declares a semaphore for use on both host and card• CSAPI_semaphore_wait• CSAPI_semaphore_signal
Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com100
Cleaning up
– Process termination• CSAPI_wait_on_terminate• CSAPI_get_return_value
– Clean-up• CSAPI_delete
– See CSX600 Runtime Software User Guide for more details, including:
• managing multiple processes on the board/chip at once• managing board control registers• board reset• managing multi-threaded CSX applications• board memory allocation• managing multiple boards/chips• error handling
Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com101
ENVISION. ACCELERATE. ARRIVE.
Debugging Cn
Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com102
csgdb
– csgdb is a port of the open source gdb debugger– full symbolic debugging of mono/poly variables– full gdb breakpoint support– step through Cn or assembly– views mono and poly registers– views PE enabled state– also accessible via DDD
• DDD allows graphical data visualization
Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com103
Debug control
– To enable debugging:• export CS_CSAPI_DEBUGGER=1
– initializes the debug interface within the host application• export CS_CSAPI_DEBUGGER_ATTACH=1
– host application will then write a port number to stdout and wait for <Return/Enter> to be pressed so that csgdb can be manually attached to the connected board process
– Launch the host application• This can be done with or without a debugger.
– Launch csgdb in a new shell*• csgdb <csx_file_name> <port_number>
– No need to “connect” as the host application did this already• set desired breakpoints• run
– Note that the host is currently blocked waiting for <Return/Enter>, so card process may also be blocked waiting for the host.
– Press return in the host shell for the host and card applications to proceed.
Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com104
Real time plot of contents of PE
memory
Cn source-level break point, watch points,
single step, etc.
Disassembly, break point, watch points,
single step, etc.
Register contents
On-chip poly array contents displayed
csgdb Debugger (Shown with ddd
Front-end)
Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com105
csgdb Command-line example
% cscn foo.cn –g –o foo.csx% csgdb ./foo.csx
• (gdb) connect• 0x80000000 in __FRAME_BEGIN_MONO__ ()• (gdb) break 109• Breakpoint 1 at 0x800154c0: file foo.cn, line 109.• (gdb) run• Starting program: /home/kris/my_app/foo.csx• Breakpoint 1, main () at foo.cn:109• (gdb) next• 110 y = MINY + (get_penum() * STEPY);• (gdb) print y• $1 = {-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1}
Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com106
ENVISION. ACCELERATE. ARRIVE.
ClearSpeed Visual Profiler Explaining Performance
Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com107
ClearSpeed Visual Profiler (csvprof)
– Host tracing• Trace CSAPI function• User can infer overlapping host/board utilization• Locate hot-spots
– Board tracing• Trace board side functions without instrumentation• Locate hot-spots
– Board hardware utilization• Display activity of csx functional units including:
• Cycle accurate• View corresponding source
– Unified GUI
– ld/st– Pi/o– SIMD microcode
– Instruction cache– Data cache– Thread
Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com108
Advance Accelerator Board
CSX 600
Pipeline
CSX 600
Pipeline
HostCPU(s)Host
CPU(s)HostCPU(s)
Detailed profiling is essential for accelerator tuning
Advance Accelerator Board
HostCPU(s)
CSX600
Pipeline
HOST/BOARD INTERACTIONInfer cause and effect.Measure transfer bandwidth.Check overlap of host and board compute.
ACCELERATOR PIPEView instruction issue.Visualize overlap of executing instructions. Get cycle-accurate timing.Remove instruction-level performance bottlenecks.
CSX600 SYSTEMTrace at system level.Inspect overlap of compute and I/O.View cache utilization. Graph performance.
CSX600
Pipeline
HOST CODE PROFILINGVisually inspect multiple host threads.Time specific code sections.Check overlap of host threads.
Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com109
csvprof: Host tracing
• Dynamic loading of CSAPI Trace implementation
• Triggered with an environment variable:– export CS_CSAPI_TRACE=1
» Recall similar enabling of debug support:» export CS_CSAPI_DEBUGGER=1
• Specify tracing format:– export CS_CSAPI_TRACE_CSVPROF=1– currently this is the only implementation, but in the future…
• Specify output file for trace:– export CS_CSAPI_TRACE_CSVPROF_FILE=mytrace.cst– default filename = csvprof_data.cst
• Output file written during CSAPI_delete
Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com110
csvprof: Host-Board interaction
Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com111
csvprof: Host code profile –
Linpack
benchmark
Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com112
csvprof: CSX600 system profile
Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com113
csvprof: Accelerator pipeline profile
Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com114
csvprof: Instruction pipeline stalls
Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com115
csvprof: Advance board tracing
– Enabled using the debugger, csgdb• Can use interactively or through gdb script
– Can select events to profile, or all events
– Requires buffer allocation on the card• Today, this is done statically• One could use CSAPI to allocate buffer, but developer must get
location and size of the buffer to user to be entered for csgdb• Easy if running only on one chip, place buffer in the other chip’s
memory
– Explicit dump to generate trace file• Can control the type of data to be dumped
Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com116
csvprof: Sample gdb script
• % cat ./csgdb_trace.gdb•
connect
•
load ./foo.csx
•
cstrace
buffer 0x60000000 0x1000000
•
cstrace
event all on
•
tbreak
test_me
•
continue
•
cstrace
enable
•
continue
•
cstrace
dump foo.cst
•
cstrace
dump branch dgemm_test4_branch.cst
•
quit
• % csgdb –command=./csgdb_trace.gdb
Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com117
ENVISION. ACCELERATE. ARRIVE.
Tuning Tips
Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com118
Pipelined arithmetic
• Four-stage floating-point pipeline• Use vector types, vector intrinsic functions, and
vector math library for high efficiency
__DVECTOR a, b, c; poly double x[N];
a = *((__DVECTOR *)x[0]); b = *((__DVECTOR *)x[4]); c = cs_sqrt( __cs_vadd( a, b ) );
Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com119
Poly conditionals
• When possible, remove common sub- expressions from poly if-blocks to reduce
amount of replicated work.• Maybe need to compute and throw away results
if it leads to fewer poly conditional blocks.• A poly if-block uses predicated instructions,
not a branch, so it is cheap if not many additional instructions are executed.
Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com120
Poly loop counters
• Loops with poly counters are more expensive than those with mono counters
• Use mono loop counters if possible
Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com121
Arrays
• Pointer incrementing is more efficient than using array index notation
• Poly addresses require 16 bits• Use short for poly pointer increments
– This avoids conversion of int to short
Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com122
Data transfer
• Synchronous functions are completely general– flush the data cache each transfer
• memcpyp2m()• memcpym2p()
• Asynchronous functions maximize performance– do not flush cache– have data size and alignment restrictions– require use of wait semaphore
• async_memcpyp2m(); sem_wait()• async_memcpym2p(); sem_wait()
• Large data blocks are more efficient than small blocks– Host to board– Board to host– Mono to poly– Poly to mono
Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com123
ENVISION. ACCELERATE. ARRIVE.
Application Examples
Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com124
Math function speed comparison
Typical speedup of ~8X over the fastest x86 processors, because math functions stay in local memory on the card
64-bit Function Operations per Second (Billions)
0.0
0.5
1.0
1.5
2.0
2.5
Sqrt InvSqrt Exp Ln Cos Sin SinCos Inv Function name
2.6 GHz dual-core Opteron3 GHz dual-core WoodcrestClearSpeed Advance card
Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com125
Nucleic Acid Builder (NAB)
• Newton-Raphson refinement now possible; large DGEMM calls from computed second derivatives will be in AMBER 10
• 2.5x speedup obtained for this operation in three hours of programmer effort
• Enables accurate computation of entropy and Gibbs Free Energy for first time
• AMBER itself has cases that ClearSpeed accelerates by 3.2x to 9x, with 5x to 17x possible once we exploit symmetry of atom-
atom interactions
Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com126
AMBER molecular modeling
with ClearSpeed
• AMBER model Host Advance X620 Speedup• Gen Born 1 83.5 min 24.6 min 3.4ו Gen Born 2 84.6 min 23.5 min 3.6ו Gen Born 6 37.9 min 4.0 min 9.4×
AMBER Generalized Born Models 1, 2, and 6
83.5 84.6
37.9
24.6 23.5
4.00.0
10.0
20.0
30.0
40.0
50.0
60.0
70.0
80.0
90.0
100.0
Generalized Born 1 Generalized Born 2 Generalized Born 3
Run
Tim
e, in
Min
utes
Host Advance X620
Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com127
Monte Carlo methods exploit high local bandwidth
• Monte Carlo methods are ideal for ClearSpeed acceleration:– High regularity and locality of the algorithm– Very high compute to I/O ratio– Very good scalability to high degrees of parallelism– Needs 64-bit
• Excellent results for parallelization– Achieving 10x performance per Advance card vs. highly
optimized code on the fastest x86 CPUs available today– Maintains high precision required by the computations
• True 64-bit IEEE 754 floating point throughout– 25 W per card typical when card is computing
• ClearSpeed has a Monte Carlo example code, available in source form for evaluation
Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com128
Monte Carlo applications scale very well
• No acceleration:
200M samples, 79 seconds• 1 Advance board:
200M samples, 3.6 seconds• 5 Advance boards:
200M samples, 0.7 secondsEuropean Option Pricing Model
0123456789
10
0 1 2 3 4 5
Number of ClearSpeed Advance Boards
Spee
dup
Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com129
Why do Monte Carlo applications need 64-bit?
• Accuracy increases as the square root of the number of trials, so five-decimal accuracy takes 10 billion trials.
• But, when you sum many similar values, you start to lose all the significant digits.
• 64-bit summation needed to get a single-precision result!
Single precision:1.0000x108 + 1 = 1.0000x108
Double precision:1.0000x108 + 1
= 1.00000001x108
Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com130
ENVISION. ACCELERATE. ARRIVE.
Help and Support
Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com131
Installed documentation
• docs directory– CSXL user guide– runtime user guide– csvprof Visual Profiler overview and examples– SDK
• getting started• gdb manual• instruction set manual• Cn
library manual• reference manual
– release notes
• examples directory
Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com132
ClearSpeed online
• General information, news, etc.– Company website www.clearspeed.com
• Report a problem, find answers, etc.– Support website support.clearspeed.com
• Support website has:– Documentation, user guides, reference manuals– Solutions knowledge base– Software downloads– Log a case
Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com133
Join the ClearSpeed Developer Program!
• Designed to support the leading-edge community of developers using accelerators
• Membership is free and has the following benefits: – Access to the ClearSpeed Developer website– ClearSpeed Developer Community on-line forum– Invitation to participate in ClearSpeed Developer & User
Community meetings and events– Repository to share and access demonstrations and sample
codes within the ClearSpeed Developer Community – Technical updates, tips and tricks from the gurus at ClearSpeed
and the Developer Community– And more, including opportunities to preview new software
releases and developer discount programs.• Leverage the expertise of developers worldwide.• Ask a question, or share your knowledge.• Register now at developer.clearspeed.com !
Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com134