Upload
trandien
View
246
Download
2
Embed Size (px)
Citation preview
End User Update: High-Performance Reconfigurable Computing
End User Update: High-Performance Reconfigurable Computing
Tarek El-Ghazawi
Director, GW Institute for Massively Parallel Applications and Computing Technologies(IMPACT)
Co-Director, NSF Center for High-Performance Reconfigurable Computing (CHREC)
The George Washington Universityhpcl.gwu.edu
Tarek El-Ghazawi
Director, GW Institute for Massively Parallel Applications and Computing Technologies(IMPACT)
Co-Director, NSF Center for High-Performance Reconfigurable Computing (CHREC)
The George Washington Universityhpcl.gwu.edu
2Tarek El-Ghazawi, GWU HPC User Forum, Roanoke April 21, 2008
Paul Muzio’s Outline!
PerformanceWhat hardware accelerators are you using/evaluating?Describe the applications that you are porting to accelerators?What kinds of speed-ups are you seeing (provide the basis for
the comparison)?How does it compare to scaling out (i.e., just using more X86
processors)?What are the bottlenecks to further performance improvements?
EconomicsDescribe the programming effort required to make use of the
accelerator.AmortizationCompare accelerator cost to scaling out costEase of use issues
FuturesWhat is the future direction of hardware based accelerators?Software futures?What are your thoughts on what the vendors need to do to
ensure wider acceptance of accelerators?
3Tarek El-Ghazawi, GWU HPC User Forum, Roanoke April 21, 2008
Why Accelerators: A Historical Perspective
Vector Machines
MPPs with Multicores and Heterogeneous Accelerators
MassivelyParallel
Processors
1993-HPCC
2006-End of Moore’s Law in Clocking!
Hopes are in Architecture!
Performance
Time
4Tarek El-Ghazawi, GWU HPC User Forum, Roanoke April 21, 2008
Which Accelerators?
We considered HPRCs more than anything elseTo be addressed today
We are increasingly using GPUs
Some Cell
5Tarek El-Ghazawi, GWU HPC User Forum, Roanoke April 21, 2008
High-Performance Reconfigurable Computing (HPRC)
IEEE Computer, March 2007
High-Performance Reconfigurable Computers are parallel computing systems that contain multiple microprocessors and multiple FPGAs. In current settings, the design uses FPGAs as coprocessors that are deployed to execute the small portion of the application that takes most of the time—under the 10-90 rule, the 10 percent of code that takes 90 percent of the execution time.
6Tarek El-Ghazawi, GWU HPC User Forum, Roanoke April 21, 2008
Evaluated FPGA-Accelerated Systems
SRC- 6
SRC- 6E
XD1
HC-36
Altix-350
Altix-4700
An Architectural Classification for Hardware Accelerated
High-Performance Computers
An Architectural Classification for Hardware Accelerated
High-Performance Computers
El-Ghazawi et. al. The Performance Potential of HPRCs. IEEE Computer, February 2008
8Tarek El-Ghazawi, GWU HPC User Forum, Roanoke April 21, 2008
Uniform Nodes Non-Uniform System (UNNS)
μP Node
…μP 1 μP N
RP Node
…RP 1 RP N
RP Node
…RP 1 RP N
μP Node
…μP 1 μP N
IN and/or GSM
HPRC Examples: SRC 6/7, SGI Altix/RC100 Systems
9Tarek El-Ghazawi, GWU HPC User Forum, Roanoke April 21, 2008
Non-Uniform NodesUniform System (NNUS)
IN and/or GSM
HPRC Example: Cray XD1, Cray XT5h
μP RPμP RP
Applications and Performance
Applications and Performance
Cryptography, Remote Sensing and BioinformaticsCryptography, Remote Sensing and Bioinformatics
11Tarek El-Ghazawi, GWU HPC User Forum, Roanoke April 21, 2008
Multi-Spectral Imagery 10’s of bands (MODIS ≡ 36 bands, SeaWiFS ≡ 8 bands, IKONOS ≡ 5 bands)
Hyperspectral Imagery100’s-1000’s of bands (AVIRIS ≡ 224 bands, AIRS ≡ 2378 bands)Challenges (Curse of
Dimensionality)Solution
Dimension ReductionMultispectral / Hyperspectral Imagery Comparison
Hyperspectral Dimension Reduction
12Tarek El-Ghazawi, GWU HPC User Forum, Roanoke April 21, 2008
Hyperspectral Dimension Reduction(Techniques)
Principal Component Analysis (PCA):Most Common Method
Dimension ReductionComplex and Global
computations: difficult for parallel processing and hardware implementations
Wavelet-Based Dimension Reduction*:Simple and Local OperationsHigh-Performance
Implementation
Multi-Resolution Wavelet Decomposition of Each Pixel 1-D Spectral Signature (Preservation of Spectral Locality)
* S. Kaewpijit, J. Le Moigne, T. El-Ghazawi, “Automatic Reduction of Hyperspectral Imagery Using Wavelet Spectral Analysis”, IEEE Transactions on Geoscience and Remote Sensing, Vol. 41, No. 4, April, 2003, pp. 863-871.
13Tarek El-Ghazawi, GWU HPC User Forum, Roanoke April 21, 2008
Wavelet-Based Dimension Reduction(Execution Profiles on SRC)
Total Execution Time = 20.21 sec (Pentium4, 1.8GHz)
Total Execution Time = 1.67 sec (SRC-6E, P3)
Speedup = 12.08 x (without-streaming)
Speedup = 13.21 x (with-streaming)
Total Execution Time = 0.84 sec (SRC-6)Speedup = 24.06 x (without-streaming)
Speedup = 32.04 x (with-streaming)
14Tarek El-Ghazawi, GWU HPC User Forum, Roanoke April 21, 2008
Cloud Detection
Software/Reference Mask
Band 2 (Green Band) Band 3 (Red Band) Band 4 (Near-IR Band) Band 5 (Mid-IR Band)
Band 6 (Thermal IR Band) Hardware Floating-Point Mask(Approximate Normalization)
Hardware Fixed-Point Mask(Approximate Normalization)
15Tarek El-Ghazawi, GWU HPC User Forum, Roanoke April 21, 2008
GC
TATTGG- 0
GATACTTT-
Protein and DNA Matching-The Scoring Matrix
-4-4-3-12581114GC
TATTGG-
-5-4-2147101316
-2-4-3-20369121-1-3-3-114710420-2-302587531-1-2036
1086420-214131197531-121614121086420GATACTTT-
0_1,_,1
,1,1max,
penaltygapjiFpenaltygapjiFyxsjiF
jiFji
16Tarek El-Ghazawi, GWU HPC User Forum, Roanoke April 21, 2008
28x86x2x961IDEA Breaker
253x779x22x8723Smith-Waterman (DNA Sequencing)
1116x3439x96x38514DES Breaker
SAVINGS
17x
Cost Savings
198x610x6838RC5(32/12/16) Breaker
Size ReductionPower SavingsSpeedupApplication
Savings of HPRC (Based on one Altix 4700 10U rack)
Assumptions100% cluster efficiencyCost Factor P : RP 1 : 400Power Factor P : RP 1 : 11.2
1 10U Rack: 1230 W µP board (with two µPs): 220 W
Size Factor P : RP 1 : 34.5 Cluster of 100 µPs = four 19-inch racks
» footprint = 6 square feet Reconfigurable computer (10U)
» footprint = 2.07 square feet
17Tarek El-Ghazawi, GWU HPC User Forum, Roanoke April 21, 2008
24x116x23x2321RC5(32/8/8) Breaker
25x120x24x2402IDEA Breaker
1x6x1x125HyperspectralDimension Reduction
29x140x28x2794Smith-Waterman (DNA Sequencing)
127x608x122x12162DES Breaker
SAVINGS
1x
Cost Savings
1x5x110Cloud Detection
Size ReductionPower SavingsSpeedupApplication
Savings of HPRC (Based on one Cray-XD1 chassis)
Assumptions 100% cluster efficiency Cost Factor P : RP 1 : 100 Power Factor P : RP 1 : 20
Reconfigurable processor (based on one XD1 Chassis): 2200 W µP board (with two µPs): 220 W
Size Factor P : RP 1 : 95.8 Cluster of 100 µPs = four 19-inch racks
» footprint = 6 square feet Reconfigurable computer (one XD1 Chassis)
» footprint = 5.75 square feet
18Tarek El-Ghazawi, GWU HPC User Forum, Roanoke April 21, 2008
34x313x6x1140RC5(32/12/16) Breaker
0.96x9x0.16x32HyperspectralDimension Reduction
34x313x6x1138Smith-Waterman (DNA Sequencing)
203x1856x34x6757DES Breaker
19x176x3x641IDEA Breaker
SAVINGS
0.14x
Cost Savings
0.84x8x28Cloud Detection
Size ReductionPower SavingsSpeedupApplication
Savings of HPRC (Based on SRC-6)
Assumptions 100% cluster efficiency Cost Factor P : RP 1 : 200 Power Factor P : RP 1 : 3.64
Reconfigurable processor (based on SRC-6): 200 W µP board (with two µPs): 220 W
Size Factor P : RP 1 : 33.3 Cluster of 100 µPs = four 19-inch racks
» footprint = 6 square feet Reconfigurable computer (SRC MAPstationTM)
» footprint = 1 square feet
20Tarek El-Ghazawi, GWU HPC User Forum, Roanoke April 21, 200820
Historical PerspectiveHistorical Perspective
Users
Tools Evolution
Circuit
Designers
Schematics
RTL
Glue LogicGlue LogicGlue LogicGlue Logic
Logic FabricLogic Fabric(180 nm)(180 nm)
DSP & N
etworki
ng
Designers
IP Core Generators
HDLs
Custom Custom Comp.Comp.
Custom Custom Comp.Comp.
DSP Slice & Dual Port DSP Slice & Dual Port Block RAM Block RAM (130 nm)(130 nm)
Embedded
Software
Engineers
Embedded
System
Des
igners HW/SW Codesign
Embedded & DSP IDEs
HLLs
PSoCPSoCPSoCPSoC
Embedded Processors Embedded Processors & Transceivers & Transceivers (90 nm)(90 nm)
InIn--Socket AcceleratorsSocket Accelerators(65 nm)(65 nm)
HPRCHPRCHPRCHPRC
Platform Specifications & Parallel SW Languages
Improved-HLLs
RC-Aware
Domain
Scientis
ts
Domain Scie
ntists
New Methodologies, Programming
Models and IDEs
Technology
Applications
22Tarek El-Ghazawi, GWU HPC User Forum, Roanoke April 21, 20082222
Productivity Analysis of Existing ToolsProductivity Analysis of Existing Tools
Tools considered Impulse-C Handel-C Carte-C Mitrion-C SysGen RC-Toolbox HDLs
Utility Frequency Area
Cost Acquisition time
Learning time Development
timeResults excerpted from GWU papers in SPL’07 and FPT’07 conferences.
23Tarek El-Ghazawi, GWU HPC User Forum, Roanoke April 21, 2008
Future Hardware Development
More use of socket-based integration
Better integration with memory hierarchy
Better accelerators, in the FPGA sideMore computationally oriented/Floating-point
cores?Coarser grain FPGAs?
On-chip FPGAs and accelerators?
24Tarek El-Ghazawi, GWU HPC User Forum, Roanoke April 21, 2008
Parallelism Concepts: From Systems to Commodity Chips
Early 1970sFirst Vector andSIMD Systems•CDC STAR-100•TI ASC•ILLIAC IV 1985
FPGAXilinx
1996-1998SIMD AltivecBy Apple, IBM, and Motorola
2001Vector Processor/SIMD CELL BE
1998HPRCSRC
Hybrid-Reconfigurable/ Chip? Accelerators as cores?GPGPUs
NVIDIA and AMD
1971-78First MIMD System•CMU C.mmp (16 PDP11s)
Mulicore CPUsIBM Power 4
Time
Coming soon?
25Tarek El-Ghazawi, GWU HPC User Forum, Roanoke April 21, 2008
Future Software
More user/application-centric programming
Unified parallel programming interface?
More efficient compiling
Tools for accelerator-GPP application co-design
Virtualization for ease-of-use and portability
HELP MAY BE COMING?
26Tarek El-Ghazawi, GWU HPC User Forum, Roanoke April 21, 2008
DARPA Studies
DARPA is looking at bridging productivity gap for FPGAs
NSF CHREC Schools (UF, GWU) and (BYU, VT) conducted a DARPA study
DARPA has at least one more ongoing study
Are we going to see any BAAs?
28Tarek El-Ghazawi, GWU HPC User Forum, Roanoke April 21, 2008
Conclusions Lots of common issues among accelerators For the applications that they can do well, they do really well! FPGAs were not built originally for computingLimited applicationsLess than user friendly interfacesVery long time for compiling
Programming languages expose a restrictive view of the system and are often hardware oriented, Need a single system wide language paradigm
A major bottleneck is data transfer rates between the microprocessor and the FPGA
More work is needed on how to manage heterogeneity Virtualization for portability and ease of useAdvanced programming models based on parallel computingNew tools for performance tuning and debugging in
heterogeneous environmentsBetter integration into memory hierarchy
The above requires fundamental work that will be unlikely supported by vendors alone, it needs a for example DARPA Driven Industry/University effort (like HPCS)