Upload
dangtruc
View
231
Download
2
Embed Size (px)
Citation preview
Optimal Configuration ofCombined GPP/DSP/FPGA Systems for
Minimal SWAP
Presented byJohn K. Antonio
University of Oklahoma
Second Annual ReviewSeptember 23, 1999
• Program Overview and Introduction (Quad Chart)
• Program Management Status
• Highlights from Year 1
• Highlights from Year 2
• Work to be Completed
Outline
Configuring Combined GPP/DSP/FPGA Systems for Minimal SWAPApplications
• SAR• STAP
Requirements• Throughput• SWAP
•Combined Technology•Minimal SWAP Configuration•Mixed-Mode Operation•Demonstration
University of Oklahoma: John K. Antonio and Sudarshan K. Dhall
New Ideas• Systematic determination of minimal SWAP
configuration based on proven mathematical programming techniques
• Optimal configuration based on automatic“tuning” of system design parameters- number and types of cards used- data mapping and communication schemes- place and route schemes
• Novel computing techniques based oncharacteristics of GPP/DSP/FPGA system
Jun 97Start
Jun 98 Jun 99 Dec 00End
ScheduleDevelop optimalconfigurationtechniques
Construction and integration of GPP/DSP/FPGA system
Implement and test optimal configurations onGPP/DSP/FPGA system
Develop practicaldesign methodsbased on SAR andSTAP applications
Demonstrate advantagesof combiningtechnologies
Impact• Embedded Systems requirements for the
21st Century can be satisfied with thecombined use of GPP, DSP, and FPGA technologies
• Demonstrate use of FPGA boards as co-processors for embedded multiprocessorGPP and DSP systems
• Demonstrate systematic approaches tooptimally configure GPP/DSP/FPGA syst. forminimal SWAP for embedded applications
Jun 00
• Program Overview and Introduction (Quad Chart)
• Program Management Status
• Highlights from Year 1
• Highlights from Year 2
• Work to be Completed
Outline
Personnel(Program Management Status)
• John K. Antonio, Principal Investigator
• Ph.D., Texas A&M University
• Professor/Director of CS, University of Oklahoma
• Over 70 publications in HPC and related areas
• PI or co-PI of 17 contracts/grants
totaling over $2.1M
Personnel(Program Management Status)
• Sudarshan K. Dhall, Co-Principal Investigator
• Ph.D., University of Illinois
• Professor of CS, University of Oklahoma
• Over 80 publications, 2 books, 3rd underway
• PI or co-PI of grants and contracting totalingabout $1M
Personnel(Program Management Status)
• Jack West, Research Scholar
Optimal Mapping, Scheduling, and Configuration Techniques for STAP; Network Simulator; STAP Implementation
• Jeff Muehring, Research Scholar
Optimal GPP/DSP/FPGA Configuration Techniques for SAR; SAR Implementation Intern at IBM/Houston, 8/99 to 1/00
Research Scholar at OU, 1/00 to 7/00
Personnel(Program Management Status)
• Hongping Li, Research Assistant, Ph.D. Student
Calibration of Power Prediction Simulator, System Interfacing, SAR Implementation
• Sirirut Vanichayobon, Research Assistant, Ph.D.Student
FPGA-Based Linear Equation Solver for STAP, System Interfacing, STAP Implementation
• Seok-Hyun Ko, Research Assistant, M.S. Student
Power Simulator Enhancements
• Tim Osmulski, Research Assistant, M.S. student
Power Prediction Simulator for FPGAs
Graduated May 1998
• Nikhil Gupta, Research Assistant, M.S. student
Algorithms for STAP Weight Calculation Mapping Inner Product Computations onto FPGAs
Graduated August 1998
Personnel(Program Management Status)
Personnel(Program Management Status)
• Brian Veale, Research Assistant, M.S. student
Space and Power Study for High-Performance Integer and Floating Point ReconfigurableArchitectures
Graduated August 1999
Contacts, Partners, Vendors, and Other Communications
(Program Management Status)
José Muñoz, DARPA Ralph Kohler, Rome Lab
MIT Lincoln LabDavid MartinezJim Ward
MITRERichard Games
Northrop GrummanMarc Campbell
Synplicity, Inc. Madelyn Miller
XilinxJason Feinsmith
Annapolis Micro SystemsJenny DonaldsonBill HulbertPaul Kowalewski
ISIMilissa BenincasaDavid Coker
Mercury ComputerThomas EinsteinEd HolstienCraig LundDave Toms
Mercury20 Slot Hybrid Chassis with SPARC 5VSolaris 2.5 with C CompilerMC/OS, Cross Assembler, ToolkitMPI-Pro for MC/OS9U VME RACE Board1 SHARC Daughtercard (2CNs, 8MB/CN, 3 SHARCs/CN) = 6 SHARCS3 SHARC Daughtercards (2CNs, 16MB/CN, 3 SHARCs/CN) = 18 SHARCS4 PowerPC Daughtercard (2CNs, 16MB/CN, 1 PPC/CN) = 8 PPCsRIN-T Input CardROUT-T Output Card
Annapolis Micro Systems4 PCI WILDONE Cards (Xilinx 4028/4036)4 PCI WILDFORCE Array Card (5 Xilinx 4085s)Interfacing Cables
Other VendorsModelSim Simulation Software (Model Technology, Inc.)Synplify Synthesis Software (Synplicity, Inc.)Xilinx Foundation Software (Xilinx, Inc.)
Equipment Status(Program Management Status)
June 1997 Dec. 1998 June 2000 Dec. 2000Sept. 1999Mar. 1998
Design STAPIterative Weight Solver for FPGA
Inter-GPP/DSP Comm.Simulator for STAP
Optimal GPP/DSPConfig. for SAR
GPP/DSP/FPGA Platform Construction and Independent Testing of GPP/DSP and FPGA Subsystems
Implement STAP Iterative Weight Solver on FPGA
Optimal GPP/DSPConfig. for STAP
Implement SAR Linear Filteringon FPGA
Optimal GPP/DSP/FPGAConfig. for SAR/STAP
GPP/DSP and FPGA Subsystem Design, Integration and Testing
Optimal GPP/DSP/FPGA Config. for SAR
Demonstrate Combined SAR/STAP onGPP/DSP/FPGA Platform
Implement SAR on GPP/DSP
Design SAR Linear Filteringfor FPGA
Implement STAP on GPP/DSP
Implement SAR onGPP/DSP/FPGA Platform
Optimal GPP/DSP/FPGA Config. for STAP
Implement STAP onGPP/DSP/FPGA Platform
Develop FPGA Power Consumption Simulator
KeyGPP/DSP Sub-System
Research/DesignImplement/Test
FPGA Sub-SystemResearch/DesignImplement/Test
GPP/DSP/FPGA SystemResearch/DesignImplement/Test
Test FPGA Power Consumption Simulator
Schedule of Milestones(Program Management Status)
CurrentBudget
Balance on8/1/99
ProjectedExpenses8/99-7/00
ProjectedExpenses8/00-12/00
Personnel 246,223 108,635 154,024 52,123
Fringes 72,117 36,051 27,712 9,340
Consulting 40,000 37,000 0 0
Expenses 9,781 6,261 10,000 5,069
Travel 17,545 4,889 12,000 7,372
Equipment 217,670 42,652 42,652 0
Indirect Cost 181,262 90,632 87,317 31,674
Total 784,598 326,120 333,705 105,578
Budget Summary(Program Management Status)
• Program Overview and Introduction (Quad Chart)
• Program Management Status
• Highlights from Year 1
• Highlights from Year 2
• Work to be Completed
Outline
Highlights from Year 1
• Optimal Configuration of Compute Nodes for SAR Processing
• Network Simulator
• FPGA Inner-Product Co-Processor Designs for STAP Weight Solvers
• FPGA Power Prediction Simulator
Optimal Configuration of Compute Nodes for SAR Processing
(Highlights from Year 1)
• Motivation and SAR Basics
• Parallelization of SAR Processing
• The Optimal Configuration Problem• Formulation• Numerical Results
• Conclusions
Nominal UAV Payload
“Predator”
Targets
Azim
uth
Velo
city
Range
Footprint
Footprint of Aerial Side-Looking SAR
Offset Overlapping Beams
vReal Azimuth Resolution
Rs
Synthetic Beams
Azim
uth
vR
Rs
CompressedResolution
Optimal Configuration of Compute Nodes for SAR Processing
(Highlights from Year 1)
• Motivation and SAR Basics
• Parallelization of SAR Processing
• The Optimal Configuration Problem• Formulation• Numerical Results
• Conclusions
Parallelization of SAR Processing
DistributedCorner-Turn
1
Ran
ge S
ampl
es
Pulse No.
Range Samples
Puls
e N
o.
Range Processing(shown across 3 range processors)
Azimuth Processing(shown across 4 azimuth processors)
1
1
1
K r
Sa
Sa
K r
where Sa is the azimuth section length and Kr is the range reference kernel size
Reference:T. Einstein, “Realtime Synthetic Aperture Radar Processing on the RACE Multicomputer,” App. Note 203.0, Mercury Computing Sys, 1996.
Sectioned Convolution
Kernel
Discard
OverlapSection
FFT size
Large Overlap/Section ratio ⇒ Small azimuth memory, large number azimuth processorsSmall Overlap/Section ratio ⇒ Large azimuth memory, small number azimuth processors
Reference:T. Einstein, “Realtime Synthetic Aperture Radar Processing on the RACE Multicomputer,” App. Note 203.0, Mercury Computing Sys, 1996.
System Parameters
• radar-dependent: R (range), Rs (range swath), and λ (wavelength)
• application-dependent: δ (desired resolution) and v (platform velocity)
• processor-dependent: αr and αa (non-fast-convolution range and azimuth loading) and γ (fast convolution throughput)
• software-dependent: Sa (azimuth convolution section length), Fa (azimuth FFT length), and Fr(range FFT length)
Derivations for Memory and Processor Requirements
Pv F R F F
PvR
F FS
MR v F R F F
MR R S
rr r s r r
a
s aa a
a
rs r r s r r
as a
=+ +
=+
+
=+ +
=+
( lg )
( lg )
( lg )
( )
6 10
6 10
16 6 10
2
2
2
3
2
3
δ α γ δγδ
αγ
δ
δ α γ δγδ
λ δδ
Optimal Configuration of Compute Nodes for SAR Processing
(Highlights from Year 1)
• Motivation and SAR Basics
• Parallelization of SAR Processing
• The Optimal Configuration Problem• Formulation• Numerical Results
• Conclusions
• Objective: Determine configurations for the CNs, number of CNs of each configuration, and section size, to satisfy processor and memory requirements and minimize power consumption
• Notation and Definitions:• CN Configuration: Specifies the daughtercard type
and number of range and azimuth CEs (per configured CN)
• X, Y: The two possible CN configurations• XT, YT: Daughtercard type for each CN configuration
Optimal Configuration Formulation
• Notation and Definitions:• Xr, Yr: Number of range processors per CN
(for each configuration)• Xa, Ya: Number of azimuth processors per CN
(for each configuration)• NX, NY: Number of CNs of configurations X and Y• ΠCN(•): Power per CN as a function of
daughtercard type• MCN(•): Memory per CN as a function of
daughtercard type• PCN(•): Processors per CN as a function of
daughtercard type
Optimal Configuration Formulation
1,0,,,,,
,....2,1,2
)()(
)()()(
)()()(
)(
)()(
≥≥
=+≥=
≤+≤+
+≥
+≥
+≤+≤
+=
aararYX
aak
a
TCNar
TCNar
aa
aaa
r
rrTCN
aa
aaa
r
rrTCN
aYaXaa
rYrXr
TCNYTCNX
SYYXXNN
kKSF
YPYYXPXX
SPSMY
PMYYM
SPSMX
PMXXM
YNXNSPYNXNP
YΠNXΠNZMinimize:
Subject to:
Optimal Configuration Formulation
Optimal Configuration of Compute Nodes for SAR Processing
(Highlights from Year 1)
• Motivation and SAR Basics
• Parallelization of SAR Processing
• The Optimal Configuration Problem• Formulation• Numerical Results
• Conclusions
Minimum Power
Azimuth FFT Size
Optimal Azimuth Section Size
Optimal Ratio of Kernel Size to Section Size
Percentage of Power Usage by Card Type 1
Optimal CN Configurations
0.5 1 1.5 250
100
150
200
250
300
350
400
Resolution
Vel
ocity
1 1 22 1 11 1 2 1 2 1
XT Xr Xa YTYrYa
1 1 2 2 0 1
1 2 1 2 0 21 3 0 2 0 21 3 0 2 1 12 0 2 2 1 1
1 1 2 2 1 1
2 1 1 2 2 0
1 1 2 2 0 2
Optimal Configuration of Compute Nodes for SAR Processing
(Highlights from Year 1)
• Motivation and SAR Basics
• Parallelization of SAR Processing
• The Optimal Configuration Problem• Formulation• Numerical Results
• Conclusions
Conclusions
• A method for optimally configuring CN-based parallel systems for SAR processing was introduced.
• The method provides detailed HW and SW design and implementation information about how to best utilizesystem resources for given values of application parameters.
• The numerical studies show that the optimal ratio of daughtercard types can be relatively constant over regions of the application parameter space.
• For a fixed hardware configuration, the CNs can be re-configured (via software re-configuration) to achieve optimal power consumption over specified regions.
Highlights from Year 1
• Optimal Configuration of Compute Nodes for SAR Processing
• Network Simulator
• FPGA Inner-Product Co-Processor Designs for STAP Weight Solvers
• FPGA Power Prediction Simulator
Network Simulator(Highlights from Year 1)
• Parallel STAP: The Motivation behind the Network Simulator
• Overview of the Network Simulator
• Numerical Studies
• Conclusions
Pulses Pulses
Data Cube
Data Cube
Doppler Filter
Channels
Ran
ge
Ran
ge
Channels
Beamform
Beam Outputs
Ran
ge
Pulses
QR Decomposition
Rotate
Channels
Ran
ge
Pulses
Data Cube
Steering Vectors
Weights
Input Data
RotatePulse
Compress
Data CubeC
hann
els
Pulses
Range
STAPSTAP PPROCESSING ROCESSING FFLOWLOW
1. Partition STAP data cube over a 2-D process set.
2. Process the contiguous dimension.
3. Re-partition the data cube before processing the next dimension.
4. Rotate the newly distributed data to make the next dimension sequential in memory.
5. Repeat steps 1 through 4 before each processing phase.
SSUBUB--CUBE CUBE BBAR AR PPARTITIONING ARTITIONING MMETHODOLOGYETHODOLOGY
Pulse Compression Partitioningwith range dimension whole.Pulse Compression Partitioningwith range dimension whole.
Pulses Range
Cha
nnel
s
Cha
nnel
s
1 32 4
5 76 8
9 1110 12
Pulses
+
3 x 4 Process Set
Pulses
5
1
9
Range
Cha
nnel
s
Doppler Filtering Partitioningwith pulses dimension whole.Doppler Filtering Partitioningwith pulses dimension whole.
Pulses Range
Cha
nnel
s
9 10 11 12
5 6 7 8
1 2 3 4
Pulses Range
Cha
nnel
s
+
Cha
nnel
s
1 32 4
5 76 8
9 1110 12
Range
3 x 4 Process Set
STAPSTAP DDATA ATA CCUBE UBE PPARTITIONING ARTITIONING EEXAMPLESXAMPLES
Pulses
5
1
9
Range
Cha
nnel
s• Re-Partitioning involves exchanging data with the next whole dimension.
Cha
nnel
s
1 32 4
5 76 8
9 1110 12
Pulses
3 x 4 Process Set
Range Dimension is Contiguous
Cha
nnel
s
1 32 4
5 76 8
9 1110 12
Range
3 x 4 Process Set
Pulse Dimension is Contiguous
• Interprocessor Communication is required between processors in the same row.
Pulses
Range
Cha
nnel
s
9 10 11 12
5 6 7 8
1 1 1 2 1 3 1 4
STAPSTAP DDATA ATA CCUBE UBE RREPARTITIONINGEPARTITIONING
Required Data TransfersRequired Data Transfers
Network Interconnection ConfigurationNetwork Interconnection Configuration
6-PortCrossbar
CN CN CN CN
12
3
45
6 78
9
1011
12
IPC
56
78
910
1112
Cha
nnel
12
34Pulses Range
Pulse Compression
1
4CN
7
10
CN
CN
CN
CN
CN
3
4
3
3
4
3
Doppler Filtering
Pulses
Cha
nnel
Range
9 10 11 12
5 6 7 8
1 2 3 4
STAPSTAP DDATA ATA CCUBE UBE RREPARTITIONINGEPARTITIONING
Data ReData Re--distribution Mappingdistribution Mapping
Network Simulator(Highlights from Year 1)
• Parallel STAP: The Motivation behind the Network Simulator
• Overview of the Network Simulator
• Numerical Studies
• Conclusions
1. 40Mhz clock, 32 bit data paths, 2048 byte circuit-switched packets.
2. Contention resolved using priorities.a. User-programmable message priority
b. Hardware priority assigned at each crossbar along a path (based on complex connection rules)
3. A packet with higher priority preempts (suspends) a lower priority packet (active or inactive) to gain control of a crossbar port.
SSOMEOME RACERACENNETWORK ETWORK FFEATURESEATURES
6-PortCrossbar
6-PortCrossbar
6-PortCrossbar
6-PortCrossbar
CNCN CNCNCNCN CNCN CNCN CNCNCNCN CNCN CNCN CNCNCNCN CNCNCNCN CNCN CNCN CNCN
6-PortCrossbar
6-PortCrossbar
6-PortCrossbar
6-PortCrossbar
6-PortCrossbar
6-PortCrossbar
6-PortCrossbar
6-PortCrossbar
CN
6-PortCrossbar
6-PortCrossbar
Message DestinationMessage DestinationMessage SourceMessage Source
MessagePath
MessagePath
6-PortCrossbar
6-PortCrossbar
6-PortCrossbar
6-PortCrossbar
6-PortCrossbar
CN
RACERACE NNETWORK ETWORK IINTERCONNECTNTERCONNECTFFATAT--TTREE REE TTOPOLOGYOPOLOGY
6-PortCrossbar
6-PortCrossbar
CNCN
6-PortCrossbar
SSTANDARD TANDARD CCROSSBAR ROSSBAR PPRIORITY RIORITY AARBITRATION RBITRATION AALGORITHM LGORITHM TTABLEABLE
7 F A,B,C,D,E F A,B,C,D,E F A,B,C,D6 E F E F A,B,C,D* A,B,C,D*5 A,B,C,D F A,B,C,D F A,B,C,D F4 E A,B,C,D E A,B,C,D - -3 *A,B,C,D *A,B,C,D,E A,B,C,D* A,B,C,D* - -2 - - A,B,C,D E - -1 - - - - - -
HardwarePriority Entry Port Exit Port Entry Port Exit Port Entry Port Exit Port
Active Port E InvolvedNot Yet Active
Port E Not Involved
Transaction Status
* - Peer Kill Rules Apply
CrossbarCrossbar CrossbarCrossbar
CrossbarCrossbar
Compute NodeProcessor InformationOutgoing and Received Message QueuesOutgoing and Received Packet Stack
Compute NodeProcessor InformationOutgoing and Received Message QueuesOutgoing and Received Packet Stack
LinkLink
Random ScanGenerates Pseudo-Random CN Scan Ordering
Random ScanGenerates Pseudo-Random CN Scan Ordering
ClockBased on Network Clock Frequency (factor of 5)Data Transfer Rate Equates to Effective Network Bandwidth
ClockBased on Network Clock Frequency (factor of 5)Data Transfer Rate Equates to Effective Network Bandwidth
Dynamic Network ConstructionDynamic Routing Table CreationDynamic CN and CE Message Traffic GenerationSimulates Packet Traffic
Dynamic Network ConstructionDynamic Routing Table CreationDynamic CN and CE Message Traffic GenerationSimulates Packet Traffic
Network Methods
NNETWORK ETWORK CCLASS LASS DDETAILSETAILS
Implements Hardware Priority Arbitration • TOP-LEVEL ALGORITHM• STANDARD ALGORITHM
Query Port StatusRoutes Packets to Next LocationAllocates and Frees Internal Port Connections and Connected Link ObjectsTransmits Packet Data
Implements Hardware Priority Arbitration • TOP-LEVEL ALGORITHM• STANDARD ALGORITHM
Query Port StatusRoutes Packets to Next LocationAllocates and Frees Internal Port Connections and Connected Link ObjectsTransmits Packet Data
Crossbar Methods
LinkConnects Crossbar Objects Link Status: Occupied or Free
LinkConnects Crossbar Objects Link Status: Occupied or Free
CrossbarTwo Parent Port ConnectionsFour Child Port ConnectionsInternal Switch ConnectionsFour CN Connections for TerminalCrossbars.
CrossbarTwo Parent Port ConnectionsFour Child Port ConnectionsInternal Switch ConnectionsFour CN Connections for TerminalCrossbars.
CCROSSBAR ROSSBAR CCLASS LASS DDETAILSETAILS
Compute Node Methods:Manages Outgoing and Received MessageQueuesManages Outgoing and Received Packet StackExplodes the Top Outgoing Message into Packets of Size 2048 or LessHandles DMA Chaining of PacketsEstablishes Path Through Network and Transmits Packet Data
Compute Node Methods:Manages Outgoing and Received MessageQueuesManages Outgoing and Received Packet StackExplodes the Top Outgoing Message into Packets of Size 2048 or LessHandles DMA Chaining of PacketsEstablishes Path Through Network and Transmits Packet Data
Outgoing Message QueueOutgoing Message Queue
Message 1
Message 2
Message 3
::
Packet StackPacket StackEXPLODE
Compute NodeProcessor InformationOutgoing and Received Message QueuesOutgoing and Received Packet Stack
• PACKETS ARE SELF-ROUTING
Compute NodeProcessor InformationOutgoing and Received Message QueuesOutgoing and Received Packet Stack
• PACKETS ARE SELF-ROUTING
::
Packet 2Packet 3Packet 4
Packet 1
CCOMPUTE OMPUTE NNODE ODE CCLASS LASS DDETAILSETAILS
SSIMULATOR IMULATOR UMLUMLSSEQUENCE EQUENCE DDIAGRAMIAGRAM
NetworkNetwork CrossbarCrossbarData CubeData Cube Process SetProcess Set CNCN<<actor>>
User<<actor>>
User ClockClock
Pass 1
Pass 2
Increment Simulation
Clock
Build Messages
R:200,P:22,C:16
CEs:48
X:6, Y:8
Routing:FCN Traffic,
Phase 1 DMA:Y
Connection/Data
Transfer
Clean Up
Message Matrices
X, Y,MappingMatrices
SimulationTime = 2 msSimulation
Time = 2 ms
Messages Time* iterative process
PPACKETACKET UML SUML STATECHARTTATECHARTSimulation Simulation Pass 1Pass 1 and and Pass 2Pass 2
Simulation Pass Subsystem
Start UpStart Up
Waitingfor Kill
Waitingfor Kill
CompletedCompletedSuspendedSuspended
BlockedBlocked ActiveActive
ReadyReady
Pass 1
Pass 2
Network Simulator(Highlights from Year 1)
• Parallel STAP: The Motivation behind the Network Simulator
• Overview of the Network Simulator
• Numerical Studies
• Conclusions
Process Set - Phase 1 (CN:12, R:200, P:22, C:16, Routing:F)
05
101520253035404550
0.5 1 1.5 2
Time (ms)
Coun
t
CN 12 (12x3)CN 12 (9x4)CN 12 (6x6)CN 12 (4x9)
PPROCESSROCESS SSETETPPERFORMANCE ERFORMANCE MMETRICETRIC
Communication Phase 1Communication Phase 1
PPROCESSROCESS SSETETPPERFORMANCE ERFORMANCE MMETRICETRIC
Communication Phase 2Communication Phase 2
Process Set - Phase 2 (CN:12, R:200, P:22, C:16, Routing:F)
0123456789
10
3 3.5 4 4.5 5 5.5 6
Time (ms)
Coun
t
CN 12 (12x3)CN 12 (9x4)CN 12 (6x6)CN 12 (4x9)
Process Set - Phase 2 (CN:12, R:200, P:22, C:16, Routing:F)
0123456789
10
3 3.5 4 4.5 5 5.5 6
Time (ms)
Coun
t
CN 12 (12x3)CN 12 (9x4)CN 12 (6x6)CN 12 (4x9)
MMESSAGEESSAGE TTRAFFICRAFFICPPERFORMANCE ERFORMANCE MMETRICETRIC
Communication Phase 1Communication Phase 1
Message Traffic - Phase 1 (CN:16, X:12, Y:4, R:400, P:22, C:16, Routing:EF)
0123456789
2 2.1 2.2 2.3 2.4 2.5
Time (ms)
Coun
t CN TrafficCE Traffic
Message Traffic - Phase 1 (CN:16, X:12, Y:4, R:400, P:22, C:16, Routing:EF)
0123456789
2 2.1 2.2 2.3 2.4 2.5
Time (ms)
Coun
t CN TrafficCE Traffic
Message Traffic - Phase 2 (CN:16, X:12, Y:4, R:400, P:22, C:16, Routing:EF)
012345678
10 15 20 25
Time (ms)
Coun
t CN TrafficCE Traffic
Message Traffic - Phase 2 (CN:16, X:12, Y:4, R:400, P:22, C:16, Routing:EF)
012345678
10 15 20 25
Time (ms)
Coun
t CN TrafficCE Traffic
MMESSAGEESSAGE TTRAFFICRAFFICPPERFORMANCE ERFORMANCE MMETRICETRIC
Communication Phase 2Communication Phase 2
DMADMA CCHAININGHAININGPPERFORMANCE ERFORMANCE MMETRICETRIC
Communication Phase 1Communication Phase 1
DMA Chaining - Phase 1 (CE:24, X:8, Y:3, R:800, P:32, C:22, Routing:F)
0123456789
14 16 18 20 22
Time (ms)
Coun
t ChainingNo Chaining
DMA Chaining - Phase 1 (CE:24, X:8, Y:3, R:800, P:32, C:22, Routing:F)
0123456789
14 16 18 20 22
Time (ms)
Coun
t ChainingNo Chaining
DMADMA CCHAININGHAININGPPERFORMANCE ERFORMANCE MMETRICETRIC
Communication Phase 2Communication Phase 2
DMA Chaining - Phase 2 (CE:24, X:8, Y:3, R:800, P:32, C:22, Routing:F)
012345678
21 22 23 24 25 26 27
Time (ms)
Coun
t ChainingNo Chaining
DMA Chaining - Phase 2 (CE:24, X:8, Y:3, R:800, P:32, C:22, Routing:F)
012345678
21 22 23 24 25 26 27
Time (ms)
Coun
t ChainingNo Chaining
Network Simulator(Highlights from Year 1)
• Parallel STAP: The Motivation behind the Network Simulator
• Overview of the Network Simulator
• Numerical Studies
• Conclusions
1. Designed and implemented a platform independent simulator.
4. Communication pattern implemented for STAP but may be used for other applications with phased communication pattern.
2. Simulator demonstrates that the Process Set, the CN or CE Message Traffic, the DMA chaining, the adaptive routing, and the scheduling of the messages affects performance.
3. Allows users to experiment with possible current and future configurations.
CCONCLUSIONSONCLUSIONS
Highlights from Year 1
• Optimal Configuration of Compute Nodes for SAR Processing
• Network Simulator
• FPGA Inner-Product Co-Processor Designs for STAP Weight Solvers
• FPGA Power Prediction Simulator
FPGA Inner-Product Co-Processor Designs for STAP Weight Solvers
(Highlights from Year 1)
• Overview of STAP Weight Calculation
• Two FPGA Inner-Product Circuit Designs
• Numerical Accuracy Studies
• Conclusions
Doppler Filter
Weight Computation
Steering Vector
Input Data
Pulse Compress Data Cube Data Cube
Weight Application
ThresholdDetection
Target Decision
Typical STAP Processing Flow
pulses
range
Doppler
range8%
91.5%
0.5%
CovarianceMatrix
Space-Time Adaptive Processing
• Effective partially adaptive STAP technique
• The architecture consists of
• Doppler processing across all pulse repetition intervals
• Adaptive filtering across• all channels and• K adjacent Doppler bins
Kth- Order Doppler Factored STAP
1 31 ˆ:),(
×=× NN
rkx
r
∑+−=
=bL
rkxrkx
bkR
rLbr
H
rL 1)1(
),(),(1
),(ψ
Kth-Order Doppler Factored STAP
bth Ran
ge
Segm
ent
(with
L rce
lls)N
Cha
nnel
s
Doppler
k (k - 1)(k + 1)
Data matrix needed for calculating covariance matrix for kth Doppler Bin
and bth Range Segment using Kth-OrderDoppler Factored STAP with K = 3
Matrix-Based Derivation of
rr LNLN
bk
3 ˆ:),(
×=×
X
),(),(1
),(),(1),(1)1(
bkbk
bLrkxrkxbk
H
r
Lbr
H
r
L
LR
r
XX
ψ
=
= ∑+−=
sbkwbk =),(),(ψ
The Weight Equation:
),( bkψ
STAP Weight Calculation
sLbkwRR
RR
sbkwRRL
bkwRQQRL
QRbk
sbkwbkbkL
sbkwbk
rT
TT
T
r
TT
r
T
H
r
=
=
==
=
=
=
),(
]0[ that Note
),(1),(1
),( :onDecomposti QR Take
),(),(),(1
),(),(
*11
1
***
X
XX
ψ
onsubstituti backward using ),(for Solve
),(
neliminatio forward using for Solve
),(Let
*1
1
*1
bkw
pbkwR
p
sLpR
pbkwR
rT
=
=
=
sw =ψ :Equation Weight thesolve toMethodion decomposit-QR Using
Iteration
STAP Weight Calculation
Initialization
ikTi
iTi
ii
ii
ii
Ti
iTi
ii
ddd
dggd
swg
ddd
dgww
+−=
−=
−=
+++
++
+
)(1
11
11
1
ψψ
ψ
ψ
sw =ψ :Equation Weight thesolve toMethodGradient Conjugate Using
00000 ,set , Choose dgwsdw −=−= ψ
Numerical Studies
Lr = 125
Flop
Cou
nt
108
109
1010
10-8 10-7 10-6 10-5 10-4 10-3 10-2 10-1
Tolerance
CGQR
10-8 10-7 10-6 10-5 10-4 10-3 10-2 10-1
Lr = 250
Tolerance
1010
109
108
Flop
Cou
nt
Tolerance
CGQR
FPGA Inner-Product Co-Processor Designs for STAP Weight Solvers
(Highlights from Year 1)
• Overview of STAP Weight Calculation
• Two FPGA Inner-Product Circuit Designs
• Numerical Accuracy Studies
• Conclusions
+
Output Register
a b
Sign+16 bitmantissa
Normalizing unit
1’s comp/register
a bsign of a
a b
b
BUFFER
X
BUFFER
FPGA
BOARD
INTERCONNECTION
BUS
HOSTPROCESSOR
• Multiply-Accumulate Pipe• Reads two block floating
point operands per cycle • Performs two operations
per cycle• Performs exponent
normalization prior to accumulation
• 2 N-vectors reduced to a constant number of partial sums
FPGA Inner Product Co-Processor:Design 1
• Multiply-Add Reduction Pipe• Reads four operands
per cycle • Performs three operations
per cycle• No normalization required• 2 N-vectors reduced to N/2 partial sums
• Basic Tradeoff: First design has lower throughput, but can perform more work
X X
1’s comp/register
Sign bSign a
+
Sign+16 bit mantissa
INTERCONNECTION
BUS
HOSTPROCESSOR
BUFFER
BUFFER
FPGA
BOARD
2 ff
Data forFirst
Multiplier
Data forSecond
Multiplier
Unitclocked
here
FPGA Inner Product Co-Processor:Design 2
FPGA Inner-Product Co-Processor Designs for STAP Weight Solvers
(Highlights from Year 1)
• Overview of STAP Weight Calculation
• Two FPGA Inner-Product Circuit Designs
• Numerical Accuracy Studies
• Conclusions
Two Orders of Magnitude Experiment
Accuracy HistogramDesign 1
0
1
2
3
4
5
6
7
0.999893 0.9999015 0.99991 0.9999185 0.999927
Freq
uenc
y
Data Histogram
05
101520253035404550
0 7 14 21 27 34 41 48 55 62 69 76 82 89 96 103
110
Freq
uenc
y
Exponent Histogram
050
100150200250300350400450500
119
121
123
125
127
129
131
133
135
137
139
141
143
145
Freq
uenc
y
Accuracy HistogramDesign 2
0
50
100
150
200
250
0.99
399
0.99
436
0.99
474
0.99
511
0.99
549
0.99
586
0.99
624
0.99
661
0.99
699
0.99
736
0.99
774
0.99
811
0.99
849
0.99
886
0.99
924
0.99
961
0.99
999
Freq
uenc
y
Five Orders of Magnitude Experiment
Accuracy HistogramDesign 1
0
1
2
3
4
5
6
7
8
0.999912 0.99991875 0.9999255 0.99993225 0.999998
Freq
uenc
y
Data Value Histogram
05
101520253035404550
0
6867
1373
4
2060
2
2746
9
3433
6
4120
3
4807
0
5493
7
6180
5
6867
2
7553
9
8240
6
8927
3
9614
1
1030
08
Freq
uenc
y
Exponent Histogram
0
100
200
300
400
500
600
700
800
119 121 123 125 127 129 131 133 135 137 139 141 143
Freq
uenc
y
Accuracy HistogramDesign 2
0
50
100
150
200
250
300
0.00
000
0.06
250
0.12
500
0.18
750
0.25
000
0.31
249
0.37
499
0.43
749
0.49
999
0.56
249
0.62
499
0.68
749
0.74
999
0.81
249
0.87
499
0.93
748
0.99
998
Freq
uenc
y
“Outlyer” Experiment
Accuracy HistogramDesign 2
0
5
10
15
20
25
30
35
40
45
50
0.00
0.06
0.12
0.17
0.23
0.29
0.35
0.40
0.46
0.52
0.58
0.64
0.69
0.75
0.81
0.87
0.92
Freq
uenc
y
Exponent Histogram
0
100
200
300
400
500
600
114
116
118
120
122
124
126
128
130
132
134
136
138
Freq
uenc
y
Data Value Histogram
0
200
400
600
800
1000
1200
0.00
09
62.5
008
125.
0007
187.
5007
250.
0006
312.
5006
375.
0005
437.
5005
500.
0004
562.
5004
625.
0003
687.
5003
750.
0002
812.
5002
875.
0001
937.
5001
1000
.000
0
Freq
uenc
y
Accuracy HistogramDesign 1
0
2
4
6
8
10
12
0.593067 0.6398925 0.686718 0.7335435 0.78369
Freq
uenc
y
outlyeroutlyer
FPGA Inner-Product Co-Processor Designs for STAP Weight Solvers
(Highlights from Year 1)
• Overview of STAP Weight Calculation
• Two FPGA Inner-Product Circuit Designs
• Numerical Accuracy Studies
• Conclusions
Conclusions
• CG weight solver provides tradeoff between accuracy and required FLOPs(compared to QR weight solver)
• Tradeoff between two FPGA designs: Design 1 (Mult & Accum) has lower peak throughput, but can perform more total work than Design 2
• Block floating point provides acceptable accuracy for uniformly distributed data over reasonable dynamic ranges
• Block floating point accuracy breaks down when there are a few large outlyers in the data set
Highlights from Year 1
• Optimal Configuration of Compute Nodes for SAR Processing
• Network Simulator
• FPGA Inner-Product Co-Processor Designs for STAP Weight Solvers
• FPGA Power Prediction Simulator
FPGA Power Prediction Simulator
(Highlights from Year 1)
• CMOS Power Consumption and Past Research
• Design and Implementation of the Power Prediction Simulator
• Conclusions and Demo
Leakage CurrentDynamic Capacitance Charging Current
Most important for CMOSDependant on clock frequency
Power Dissipation in CMOS
Transient Current
Dependant on signal activityDependant on signal activity
Power Equations
Equivalent model of a transistor’s gate...
( )
−=
−RC
teVtvc 1
( ) RCt
VetvR
−=
( )ReVtp
RCt
R
22
−
=
∫∫−
−
−−
==ττ
ττ0
22
0
22 2
21 dte
RCCVdt
ReVp RC
tRCt
avg
222
21
2CVeCVp
o
RCt
avg ττ
τ
≈−
=−
( ) 50.0=clockp
( ) 88.01 =xp
( ) 29.02 =xp
( ) 69.03 =xp ( ) 27.03 =xA
( ) 0.1=clockA
( ) 10.01 =xA
( ) 17.02 =xA
p(s): the probability that signal sattains a logical value of true at any given clock cycle.
A(s): the probability that signal stransitions at any given clock cycle.
Probabilistic Modeling
Probabilistic Modeling
x3
x2
x1
y
y
x3
x2
x1
:)(1 tx:)(2 tx:)(3 tx
:)(21 txx:)(321 txxx
p=0.88, A=0.10
p=0.29, A=0.17
p=0.69, A=0.27
p=0.83, A=0.17
p=0.10, A=0.13
Calculation of average power:
∑∈
=gates all
2
21
ggavg ACVP
Probabilistic Equations
( )
( )1 where,)(1
1
===
=
∏∑
∑ ∏
=
=
ii
k
ii
k
ii
Pyp
f
ππ
( ) ( )
( ) ( ){ }
( ) ( ){ }
∑∑ ∏
∑ ∏
∑ ∏
+
−⊕+
−⊕+
−⊕
⋅=
===≠≠ ∉
==≠ ∉
= ≠
X n
kjikji kjil
llkkjjiikji
n
jiji jik
kkjjiiji
n
i ijjjiii
xzPxzPxzPxzPzzzXfXf
xzPxzPxzPzzXfXf
xzPxzPzXfXf
XPyA
K
1,1,1,,
1,1,
1
)(1)()()(),,;()(31
)(1)()(),;()(21
)(1)();()(
)()(
*
* Probabilistic Treatment of General Combinatorial Networks† Estimation of Circuit Activity Considering Signal Correlations and Simultaneous Switching
Signal probability transformations...
Signal activity transformations...†
FPGA Power Prediction Simulator
(Highlights from Year 1)
• CMOS Power Consumption and Past Research
• Design and Implementation of the Power Prediction Simulator
• Conclusions and Demo
FPGA Design
FPGA internal structure design...
CLB
IOB BUF
Routing Fabric Design
Example routings...
Xilinx 4000 series routing fabric is very intricate.
Xilinx synthesis tools use shortest path routing where possible.
The distance the signal travels is the metric considered in this model.
Signal Design
Symbolic Probability
Numeric Probability
Numeric Activity
Signal Reference
Manhattan Distance
CLBCLB
R
L
Local Signal Remote Signal
Routing Example
4
4 InterconnectionLUT
LUT
LUT
LUT
LUT
LUT
Routing Signal Connections
R
R
R
R
R
R
R
R
L
L
L
RRRR
RRRR
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
L
L
L
RRRR
RRRR
R
R
R
R
R
R
R
R
LUT
LUT
LUT
LUT
LUT
LUT
L
L
L
L
FPGA Power Prediction Simulator
(Highlights from Year 1)
• CMOS Power Consumption and Past Research
• Design and Implementation of the Power Prediction Simulator
• Conclusions
Conclusions
• Designed and Implemented power prediction simulator for Xilinx 4000 series FPGAs.
• Inputs to simulator:• Place & Route bit stream (from Xilinx Tool)• Activity and Probability factors for pin signals
• Simulator calculates probabilities and activities for all internal signals
• Tool outputs power consumption of FPGA chip
• Currently calibrating/tuning simulator using both heat and DC current measurement cross-calibration methods
• Program Overview and Introduction (Quad Chart)
• Program Management Status
• Highlights from Year 1
• Highlights from Year 2
• Work to be Completed
Outline
Highlights from Year 2
• Efforts to Calibrate the FPGA Power Prediction Simulator
• Comparison of Integer and Floating Point Computations on FPGAs
• Architecture of Prototype System for SAR and STAP Processing
• Integration of Reconfigurable Computing into SAR
• Configuration Technique for STAP
Basic Approach to Calibration
• N x N array of CLBs (configurable logic blocks)
• Programmable interconnect• Let S denote the set of all internal
signals for a configuration and Si denote all signals of length i
• Let Ai denote the sum of activities for all signals of length i
• 2N + 1 distinct capacitances (C) dependent on signal length
sSs
sdavg ACfVP ∑∈
= )(2
21
+⋅⋅⋅++
= ∑∑∑
∈∈∈ NSssN
Sss
Sssavg ACACACfVP
210
2102
21
Basic Approach to Calibration
=
NNNNNN
N
N
P
PP
C
CC
AAA
AAAAAA
fV
2
1
0
2
1
0
2,21,20,2
2,11,10,1
2,01,00,0
2
21
MMMOM
L
L
• For the j-th design/data set combination:let Pj denote the measured power let Aj,k denote the aggregate activity of all signalsof length k
• For each design/data set combination, the simulator provides the values for one row of the above matrix
• Given 2N + 1 measured values for Pj, the unknown capacitance values are then determined. This is how the simulator is calibrated.
Efforts to Calibrate the Simulator
• For the Xilinx 4036 family of parts, N = 36
• Generated a total of 73 (= 2N + 1) design/data set combinations
• Created a utility for generating data sets with specified statistics
• Created a utility for computing statistics associated with a given data set
• Attempts at Measuring Consumed Power• Heat• Current
Heat Measurement Approach
Heat Measurement Approach(continued)
Current Measurement Approach
Highlights from Year 2
• Efforts to Calibrate the FPGA Power Prediction Simulator
• Comparison of Integer and Floating Point Computations on FPGAs
• Architecture of Prototype System for SAR and STAP Processing
• Integration of Reconfigurable Computing into SAR
• Configuration Technique for STAP
Comparison of Integer and Floating Point Computations on FPGAs
(Highlights from Year 2)
• Integer Pipelined Multiplier
• Floating Point Pipelined Multiplier
• Floating Point Pipelined Adder
• Comparison of Two Inner-Product Designs
• Conclusions
Array-Based Integer Multiplier
CSA 9
CSA 8
CSA 7
CSA 6
CSA 5
CSA 4
CSA 3
CSA 2
CSA 1
CSA 0
Propagate Adder
b0Ab1Ab2Ab3Ab4Ab5Ab6Ab7Ab8Ab9Ab10Ab11A
sumcarry
Carry-Save Addersin a 5-bit Multiplier
Half AdderFull AdderFull AdderFull AdderHalf Adder
Half AdderFull AdderFull AdderFull AdderFull Adder
Half AdderFull AdderFull AdderFull AdderFull Adder
Half AdderFull Adder
Full AdderFull Adder
b3a0b3a1b3a2b3a3b3a4
b4a0b4a1b4a2b4a3b4a4
b2a0b2a1b2a2b2a3b2a4
b1a0b1a1b1a2b1a3b1a4
b0a0b0a1b0a2b0a3b0a4
CSA 0
CSA 1
CSA 2
Propagate Adder
Full Adder
Half Adder
Full Adder
Full Adder
Full Adder
Full Adder
Full Adder
Full Adder
Full Adder
Full Adder
Full Adder
Full Adder
Full Adder
sumcarry
sumcarry
upper 13 bits of product
CSA 9
Propagate Adder
• The Wild-One system runs at a maximum speed of 50MHz
• The 4036xla has more routing resources than the 4028ex
• Table shows maximum achieved clock rate as a function of the number of pipelined stages employed
# of stages Speed(Mhz)4028ex 4036xla
1 14 282 19 253 21 N/A4 22 275 29 286 39 287 22 298 33 50
Pipelining Results forArray-Based Integer Multiplier
Comparison of Integer and Floating Point Computations on FPGAs
(Highlights from Year 2)
• Integer Pipelined Multiplier
• Floating Point Pipelined Multiplier
• Floating Point Pipelined Adder
• Comparison of Two Inner-Product Designs
• Conclusions
16-bit Floating-Point Format
• The floating point format chosen is a 16-bit format supported by the ADSP-2106x family of SHARC DSP processors
• The exponent is represented in excess-7 notation
• Range : ±1.5625×10-2 to ±2.559375×102
101.f0e3e 0fs • • • • • •
Short Word Floating-Point Format15 14 11 10 0
Floating Point Multiplier
0
12 bit Array-Based Multiplier
1.m1 1.m2
1 0
1
1
0
1
excess-7 adder
exponentadjustselect
e1(2)
e2(3)
e2(2)e1(3)
e1(1)e2(1)
e1(0)e2(0)
unf ovf
If the msb = 1 take thebits msb-1…msb-11
If the msb = 0 take thebits msb-2…msb-11
exponent
11
upper 13 bits of product
e2e1
mantissa
If underflow = 1, set exponent = 0If overflow = 1, set exponent = 15
(representing infinity)
If e1 or e2 = 0, set exponent = 0If e1 or e2 = 15, set exponent = 15
s2s1
mantissaexponentsign1 bit 4 bits 11 bits
Comparison of Integer and Floating Point Computations on FPGAs
(Highlights from Year 2)
• Integer Pipelined Multiplier
• Floating Point Pipelined Multiplier
• Floating Point Pipelined Adder
• Comparison of Two Inner-Product Designs
• Conclusions
difference
pos./neg.
Choose Exponent
Normalize Mantissa and Adjust Exponent
Align Mantissas
Add/Subtract Mantissas
1.m1 1.m2e1 e2 s1 s2
Registers
exponent mantissa sign
Check for Absolute Zero and Infinity and Add Phantom Bit
Registers
Registers
Compare Exponents by Subtraction
Registers
Floating Point Adder
Comparison of Integer and Floating Point Computations on FPGAs
(Highlights from Year 2)
• Integer Pipelined Multiplier
• Floating Point Pipelined Multiplier
• Floating Point Pipelined Adder
• Comparison of Two Inner-Product Designs
• Conclusions
Inner Product Co-processor Designs
Input Buffer
Pipeline Multiplier
Pipeline Multiplier
Pipeline Adder
Output Buffer
Input Buffer
Pipeline Multiplier
Pipeline Adder
Output Buffer
Multiply-Accumulate SchemeMultiply-Add Scheme
PerformanceSpeed # of # of # of # of Equivalent Estimated Power
Co-Processor Type (MHz) CLBs Flip-Flops 3-Input LUTs 4-Input LUTs Gate Count ConsumptionInteger Multiply-Accumalate 50 622 720 180 794 10076 N/AInteger Multiply-Add 43 1013 1148 423 1421 16809 415F.P. Multiply-Accumalate 38 437 414 154 742 8072 454F.P. Multiply-Add 34 716 654 254 1082 11766 390
( )
+++⋅⋅⋅++
= ∑∑∑∑
∈∈∈∈ − NN Sss
Sss
Sss
Sss ANANAA
21210
12221 Power Estimated
Notes:1. Integer co-processors implemented with 16-bit integer
multipliers and 32-bit integer adders2. The estimated power consumption calculated from
power simulator based on simplified (non-calibrated)constants:
F.P. Multiply-Add vs F.P. Multiply-Accumulate Non-Weighted Activity Values
0
0.5
1
1.5
2
2.5
3
3.5
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46
Interconnection Length
Activ
ity V
alue Multiply-Add
Multiply-Accumulate
0
10
20
30
40
50
60
70
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46
Interconnection Length
Wei
ghte
d Ac
tivity
Multiply-AddMultiply-Accumulate
F.P. Multiply-Add vs F.P. Multiply-Accumulate Linearly-Weighted Activity Values
Comparison of Integer and Floating Point Computations on FPGAs
(Highlights from Year 2)
• Integer Pipelined Multiplier
• Floating Point Pipelined Multiplier
• Floating Point Pipelined Adder
• Comparison of Two Inner-Product Designs
• Conclusions
Conclusions
• Developed libraries of efficient integer and floating point pipelined multipliers and adders
• Discovered that increasing the degree of pipelining increases required hardware
• Discovered that increasing the degree of pipelining generally increases maximum clock rate
• 16-bit F.P inner-product designs require less hardware than integer inner-product designs, which employ 16-bit multiplier(s) and 32-bit adder
• Multiply-accumulate designs consume more power (estimated) than multiply-add designs due to the requirement for long feedback paths
• Developed 50 page User’s Manual for Annapolis System
Highlights from Year 2
• Efforts to Calibrate the FPGA Power Prediction Simulator
• Comparison of Integer and Floating Point Computations on FPGAs
• Architecture of Prototype System for SAR and STAP Processing
• Integration of Reconfigurable Computing into SAR
• Configuration Technique for STAP
Data Source
VME
MercurySystem
CNCNPEPE... ...
SPARC
ReconfigurableSubsystem
DSP/GPPSubsystem
Data Sink
AnnapolisSystem 120 MB/sec
PC
120 MB/sec120 MB/sec
PC
PCI Custom Custom
PEPE...
ReconfigurableSubsystem
AnnapolisSystem
PCI
120 MB/sec
Architecture of Prototype System
SAR Processing Flow
RangeCompression
AzimuthProcessing
DataTransfer
Azimuth
Range
STAP Processing Flow
RangeCompression
DopplerFiltering
WeightComputation
DataTransfer
Doppler
Cha
nnel
Range
DataTransfer
Refer to Poster for Physical Viewof Architecture
Highlights from Year 2
• Efforts to Calibrate the FPGA Power Prediction Simulator
• Comparison of Integer and Floating Point Computations on FPGAs
• Architecture of Prototype System for SAR and STAP Processing
• Integration of Reconfigurable Computing into SAR
• Configuration Technique for STAP
Integration of ReconfigurableComputing into SAR
(Highlights from Year 2)
• The SAR Benchmark
• Comparison of Two FIR Filter Designs
• Including FPGAs in the SAR Optimization Formulation
The SAR Benchmark
• Retrieved Benchmark from
http://www.rl.af.mil/programs/hpcbench/
• Developed under the ARPT/Tri-Services Rapid Prototyping of Application Specific Signal Processors (RASSP) program
• Two main programs
• Synthetic SAR data generator (400 lines of code)
• Serial SAR processor (1600 lines of code)
• The SAR algorithm is stripmap mode - currently processes 4 frames of hh polarization data
• The SAR Benchmark
• Comparison of Two FIR Filter Designs
• Including FPGAs in the SAR Optimization Formulation
Integration of ReconfigurableComputing into SAR
(Highlights from Year 2)
Comparison of TwoFIR Filter Designs
D Q D Q
D Q
D QD Q
D Q
D Q
xk0 xk3xk2xk1
n
n++
+
Serial-Multiply/Parallel Add
• Ease of routing• Poor modularity
xk3 xk2 xk0xk1
+ +++
D Q
D QD Q D QD Q
D QD QD Q
n
n
Parallel-Multiply/Serial Add
• Poor routing• Good modularity
Comparison of TwoFIR Filter Designs
• Both designs implemented using fixed-point complex data (16-bit fixed-point real and imaginary components)
• Both designs make use of constant coefficient multipliers (from core generator)
• Four tap serial-multiply/parallel-add filter fit onto one 4036xla part
• Three tap parallel-multiply/serial-add filter fit onto one 4036xla part (insufficient routing resources for four taps)
• Four tap parallel-multiply/serial-add filter implemented across two parts on one board (one 4036 and one 4013)
• The SAR Benchmark
• Comparison of Two FIR Filter Designs
• Including FPGAs in the SAR Optimization Formulation
Integration of ReconfigurableComputing into SAR
(Highlights from Year 2)
Including FPGAs in the SAR Optimization Formulation
• Power estimates must be determined for a range of kernel sizes for both filter designs
• Hybrid designs may exist for multi-chip implementations that yield desired features of both modularity and routability
• Binary optimization variable defines whether entry-FPGA or DSP/GPP subsystems perform range compression
• Real optimization variable defines fraction of azimuth processing divided among GPP/DSP and exit-FPGA subsystems
Highlights from Year 2
• Efforts to Calibrate the FPGA Power Prediction Simulator
• Comparison of Integer and Floating Point Computations on FPGAs
• Architecture of Prototype System for SAR and STAP Processing
• Integration of Reconfigurable Computing into SAR
• Configuration Technique for STAP
Configuration Technique for STAP
• Incorporate New Features into the Network Simulator
• Testing and Calibration of the Network Simulator
• Build and Execute RT_STAP Benchmark on Mercury RACE® Computer
• Optimization Problem• Computational Investigation
NEW FEATURES FOR THE NETWORK SIMULATOR
• Incorporate Software Overhead Times in the Simulation Model– Currently, the simulator performs hardware switch-level modeling (i.e.,
packet level simulation at the crossbar level).– Modify the Network Simulator to include software overhead times for two
communication protocols.– Empirical analysis will be utilized to capture software overhead times for
the communication protocols.• Provide Additional Timing Information from Simulation Runs
– Currently, the simulator outputs completion times after each corner turn of the STAP data cube.
– Modify the Network Simulator to output message queue completion times for each Compute Node (CN) sending messages.
– Message queue completions times will become vital input into theoptimization algorithm.
• Add PowerPC Compute Node Configuration to the Simulator
INCORPORATE SOFTWAREOVERHEAD TIMES
• Communication Time for a Message:
BM
TTT HardwareOSoftwareOC ++= )()(
CT
)(SoftwareOT
)(HardwareOTM
= Completion Time
= Software Overhead Time
= Hardware Overhead Time
= Message Size
= Network BandwidthB
where:
Modeled by SimulatorModeled by SimulatorInclude SoftwareInclude Software
Overhead Time in theOverhead Time in theSimulation ModelSimulation Model
SOFTWARE PROTOCOLS
• Two Communication Protocol Times will be added to the SimulationModel
– DMA MC/OS Communication Times (DMA Transfers between CNs)– MPI (Message Passing Interface) Software Layer Communication Times
• Incorporating Software Overhead Times into the Simulation Model will be accomplished through Empirical Analysis.
– For each of the two software protocols, zero length messages will be sent through the network. Their resulting communication times will be measured.
– After analysis of multiple runs, the simulator will be calibrated to include both DMA transfer overhead and MPI software overhead.
SOFTWARE COMPONENTS
MC/OS Runtime EnvironmentMC/OS Runtime EnvironmentMC/OS Runtime Environment
Interprocessor Communication System(ICS)
Interprocessor Interprocessor Communication Communication SystemSystem(ICS)(ICS)
POSIXAPI
POSIXPOSIXAPIAPI
MCexecMCexecMCexec
LoadableDevice Drivers
LoadableLoadableDevice Device DriversDrivers
DMAControllerDMADMAControllerController
CN ASIC Registers,InterruptsTimers,etc.
CN ASIC CN ASIC Registers,Registers,InterruptsInterruptsTimers,etc.Timers,etc.
MPI
Soft
war
e La
yer
MPI
Soft
war
e La
yer
MPI
Soft
war
e La
yer
‘DX’ Data Transfer‘DX’ Data TransferFacilityFacility
CPURegistersCPUCPURegistersRegisters
HARDWARE ABSTRACTION LAYER
Use
r Applic
atio
nU
ser
Applic
atio
nU
ser
Applic
atio
n
PROPOSED WORK
• Incorporate New Features into the Network Simulator
• Testing and Calibration of the Network Simulator
• Build and Execute RT_STAP Benchmark on Mercury RACE® Computer
• Optimization Problem• Computational Investigation
TESTING AND CALIBRATION OF THE NETWORK SIMULATOR
• Test Specific Communication Patterns to Verify Accuracy of the Network Simulator– Implement a Communication Task on the Mercury RACE®
Computer– Replicate the Communication Task on the Network Simulator– Compare the Resultant Completion Times– If Necessary, Fine-Tune the Network Simulator
• Two Types of Communication Patterns will be used to Test and Calibrate the Network Simulator– Simple Test Patterns (Hand-Calculated Verification) – Complex Test Patterns (Empirical Verification)
TESTING AND CALIBRATION WITH SPECIFIC TEST PATTERNS
• Simple Test Patterns (Hand-Calculated Verification)– Implement simple test patterns between CNs to verify the accuracy and assist in
fine-tuning of the Network Simulator. The test pattern communication time can be hand-calculated for comparison to the simulated result.
• Single Source Message Tests• Two Source Message Tests (Non-Contending Paths)• Two Source Message Tests (Contending Paths)• N Source Message Tests (Non-Contending Paths)• N Source Message Tests (Contending Paths)
• Complex Test Patterns (Empirical Verification)– Implement more complex basic communication patterns to test the validity of the
simulator. Empirical data from the Mercury Computer implementing the same test pattern will be used to calibrate the Network Simulator.
• All-to-All Personalized Communication Test• Randomized Message Queue Communication Test
SIMPLE TEST PATTERNSSingle Source Message Tests
• Test Plan Development Diagram
SingleMessageSingle
Message
TwoMessages
TwoMessages
3..N Messages
3..N Messages
SinglePacket /Message
SinglePacket /Message
TwoPackets /Message
TwoPackets /Message
3..PPackets /Message
3..PPackets /Message
SingleCrossbarSingle
Crossbar
3..CCrossbars
3..CCrossbarsSTARTSTART
RUN
TEST
RUN
TEST
SIMPLE TEST PATTERNSTwo Source Message Tests
(*Non-Contending Paths)
• Test Plan Development Diagram (For Each Source)
SingleMessage /
CN
SingleMessage /
CN
TwoMessages /
CN
TwoMessages /
CN
3..N Messages /
CN
3..N Messages /
CN
SinglePacket /Message
SinglePacket /Message
TwoPackets /Message
TwoPackets /Message
3..PPackets /Message
3..PPackets /Message
SingleCrossbar
(Non-Contending)
SingleCrossbar
(Non-Contending)
3..CCrossbars
(Non-Contending)
3..CCrossbars
(Non-Contending)
STARTSTART
RUN
TEST
RUN
TEST
SIMPLE TEST PATTERNSTwo Source Message Tests
(*Contending Paths)
• Test Plan Development Diagram (For Each Source)
SingleMessage /
CN
SingleMessage /
CN
TwoMessages /
CN
TwoMessages /
CN
3..N Messages /
CN
3..N Messages /
CN
SinglePacket /Message
SinglePacket /Message
TwoPackets /Message
TwoPackets /Message
3..PPackets /Message
3..PPackets /Message
SingleCrossbar(Contending)
SingleCrossbar(Contending)
3..CCrossbars(Contending)
3..CCrossbars(Contending)
STARTSTART
RUN
TEST
RUN
TEST
Configuration Technique for STAP
• Incorporate New Features into the Network Simulator
• Testing and Calibration of the Network Simulator
• Build and Execute RT_STAP Benchmark on Mercury RACE® Computer
• Optimization Problem• Computational Investigation
MERCURY RACE®COMPUTER CONFIGURATION
CrossbarCrossbarCrossbar
CrossbarCrossbarCrossbarCrossbarCrossbarCrossbar
CrossbarCrossbarCrossbarCrossbarCrossbarCrossbarCrossbarCrossbarCrossbarCrossbarCrossbarCrossbar
CNCNCN CNCNCN CNCNCN CNCNCN CNCNCN CNCNCN CNCNCN CNCNCN CNCNCN CNCNCN CNCN CNCN CNCN CNCN CNCN CNCN
VME PortVME Port
I/OI/O
CNCNCN
CNCNCN
CNCNCNPPC 603e, 16Mb, 100MhzPPC 603e, 16Mb, 100Mhz 3 SHARC 3 SHARC DSPsDSPs, 8Mb, 40Mhz, 8Mb, 40Mhz
3 SHARC 3 SHARC DSPsDSPs, 16Mb, 40Mhz, 16Mb, 40Mhz
STAP IMPLEMENTATION ON MERCURY RACE® COMPUTER
• Implementation of STAP on the Mercury RACE® Computer involves the following tasks:
– Build the RT_STAP1 benchmark designed and developed by MITRE (requires MPI software).
– Successfully install and build MPI Software Technology, Inc.’s message passing interface software (MPI/PRO™) for the Mercury Computer (used by RT_STAP Benchmark).
– Build both the sequential host and parallel Mercury Computer version of the benchmark.• After successfully building and executing the RT_STAP benchmark on the 8 node
PowerPC Mercury RACE® computer, perform the following tasks:– Analysis of the RT-STAP benchmark source code to determined the partitioning of the
data (i.e., the mapping) and the scheduling of the messages. Replicate the data partitioning and message scheduling on the Network Simulator.
– Verify the reported communication times from the RT_STAP benchmark with the Network Simulator.
– Modify the RT-STAP source code to allow for specification of mapping and ordering of the data distribution. Verify results with the Network Simulator.
1 Cain, K.C., Torres, J.A., and Williams, R.T. MITRE Technical Report, MTR 96B0000021 RT_STAP: Real-Time Space-Time Adaptive Processing Benchmark. February 1997.
MPI/PRO™ BUILD FORMERCURY RACE® COMPUTER
• MPI/PRO™ for RACE® is a Commercial Off-the-Shelf Standards-Based Message-Passing Middleware.
• Provides robust messaging and implements the Message Passing Interface (MPI) defined by the Message-Passing Forum.
• MPI/PRO™ supports MPI 1.2 extensions.
• Currently supports RACE® PowerPC and i860 CNs.
• MPI/PRO™ is layered on Mercury’s MC/OS development and runtime environment.
RT_STAP BENCHMARK ON MERCURY RACE® COMPUTER
• The RT_STAP benchmark, developed by MITRE, was designed to evaluate the application of scalable, high performance computers to the real time implementation of STAP techniques.
• The benchmark has the capability to vary the sophistication and computational complexity of the adaptive algorithms employed.
• The goal is to build and execute the MITRE RT_STAP benchmarksoftware on an 8 node PPC 603e Mercury Computer (MCOS 4.4.2) using MPI Software Technology, Inc. MPI/PRO.
• The RT_STAP benchmark software employs a QR-decomposition algorithm component in the space-time adaptive processing. A QRD benchmark is also provided to characterize a single processors performance of QR-decompositions.
Configuration Technique for STAP
• Incorporate New Features into the Network Simulator
• Testing and Calibration of the Network Simulator
• Build and Execute RT_STAP Benchmark on Mercury RACE® Computer
• Optimization Problem• Computational Investigation
OPTIMIZATION PROBLEM
• Overview of the Approach
• Definition of a Class of Mappings for Data Partitioning
• Development of an Objective Function to Evaluate Defined Classes of Mappings
• Implementation of a Genetic Algorithm to Produce Schedules for the Top Mapping Candidates generated by the Mapping Objective Function. – Use the Simulator to Evaluate the Communication Performance.
OVERVIEW OF THE APPROACH
STAP Data CubeSTAP Data Cube
Select # CNs (P)(P=Allocated Compute
Nodes)
Select # Select # CNs CNs (P)(P)(P=Allocated Compute (P=Allocated Compute
Nodes)Nodes)
Minimize Mapping(Use Objective Function)Minimize MappingMinimize Mapping(Use Objective Function)(Use Objective Function)
GeneticAlgorithm
(Determine Optimal Schedule)
GeneticGeneticAlgorithmAlgorithm
(Determine Optimal (Determine Optimal Schedule)Schedule)
Network Simulator(Estimate Overall
Communication Time)
Network SimulatorNetwork Simulator(Estimate Overall (Estimate Overall
Communication Time)Communication Time)
Select Fixed or Random MappingSelect Fixed or Select Fixed or
Random MappingRandom Mapping
OPTIMIZEOPTIMIZEOPTIMIZE
Mercury RACE®(Configured with 1..P CNs)
Mercury RACE®(Configured with 1..P CNs)
Adjust Allocated P
Adjust Adjust Allocated PAllocated P
The mapping matrices could be defined by any one of the following:
• Possible values for M and N :
DEFINITION OF A CLASS OF MAPPINGS
FOR DATA PARTITIONING
111 : NMT ×
( ) { }PjijiNM =⋅∈ |),(,
222 : NMT ×333 : NMT ×
{ }3|),( Pjiji =⋅
• Let the matrix represent the mapping for the kth processing phase:
kT2-d Process Set
MM
NN
kT
kk NMP ⋅=• Equation for the number of CNs:
For Ex. Assume: 12=P
321 ,, TTT
{ })112(),26(),34(),43(),62(),121( ××××××
Assuming the CN assignments with a mapping matrix is raster ordered left to right, the total number of combinations is: 2166366 3 =⋅=
• Total number of combinations :
OBJECTIVE FUNCTION DEVELOPMENTQuality of Mapping
• An objective function can be developed based on the definition of a class of mappings for data partitioning.
= { | CN i communicates with CN j }
1T
2T
CornerCorner--Turn Produces Messages Turn Produces Messages
∑∈
⋅1),(
minεji
ijij dmObjective:
ijmijm
ijd
= message from CN i to CN j
= message size of ijm
Using the following definitions:
= minimum number of required crossbar connections for message ijm
1T = such that each represents the CN where the data vector is distributed.
[ ]crT ,111 NM ×
2T = such that each represents the CN where the data vector is distributed.
[ ]crT ,222 NM ×
ε ),( ji
3T = such that each represents the CN where the data vector is distributed.
[ ]crT ,333 NM ×
2T
3T
CornerCorner--Turn Produces Messages Turn Produces Messages
∑∈
⋅2),(
minεji
ijij dmObjective:
OBJECTIVE FUNCTION DEVELOPMENTQuality of Mapping
• An objective function for the communication time:
• An objective function for STAP processing:
⋅+
⋅ ∑∑
∈∈ 21 ),(2
),(1 minmin
εε jiijij
jiijij dmkdmk
⋅+
⋅ ∑∑
∈∈ 2),(2
),(1 minmin
1 εε jiijij
jiijij dmkdmk
4k+ 5k+
3k+ (Range Computation Time)
(Doppler Computation Time) (Weight Computation Time)
First Corner Turn Second Corner Turn
GENETIC ALGORITHMS
• A genetic algorithm (GA) is a population-based model that uses selection and recombination operators to generate new sample points in a search space.
• A GA encodes a potential solution to a specific problem on a chromosome-like data structure and applies recombination operators to these structures so as to preserve critical information.
• Often, GAs are viewed as function optimizers. As a result, researchers are typically interested in GAs as optimization tools.
• Implementation of a GA begins with a population of chromosomes. Once each chromosome is evaluated, reproduction opportunities are applied in such a way that those chromosomes which represent a better solution to the target problem are given more chances to reproduce than chromosomes with poorer solutions.
• Currently, GAs are a promising heuristic approach to locating near-optimal solutions in large search spaces.
GENETIC ALGORITHMS
• A genetic algorithm is typically composed of two main components that are problem dependent:
– The problem encoding• The first component involves generating an encoding scheme to represent possible
solutions to the optimization problem. Candidate solutions are usually represented as strings of fixed length, like chromosomes, usually coded with a binary character set.
– The evaluation function• An evaluation function measures the quality of a particular solution. In this
research, the evaluation of a particular candidate will be accomplished by the Network Simulator. The fitness of the candidate from the population space will be measured based on its simulated performance.
• The objective of a GA search is to locate the chromosome that has the optimal fitness value. For this research, if the chromosome represented the scheduling of messages and the fitness value the completion time of the schedule, the objective of the GA would be to find the smallest value (i.e., shortest completion time).
IMPLEMENTATION OF A GENETIC ALGORITHM HEURISTIC
• Implementation of a GA involves the following steps:1
– Generate an initial populationThis initial population is the first generation where evolution starts. A random set of chromosomes is often used as the initial population
– An evaluation using the evaluation or fitness functionEvaluate the quality of each chromosome in the initial population.
– A selection mechanismIn this step, chromosomes are duplicated or eliminated based on its relative quality or fitness. The population size is kept constant.
– A crossover mechanismSome pairs of the chromosomes are selected from the current population, and some of their corresponding components are exchanged to form two valid chromosomes. The new chromosomes may or may not be in the current population.
1 Wang, L., Siegel, H.J., Roychowdhury, V.P., and Maciejewski, A.A. Task Matching and Scheduling in Heterogeneous Computing Environments using a Genetic Algorithm-Based Approach, Journal of Parallel and Distributed Computing Special Issue on Parallel Evolutionary Computing.
IMPLEMENTATION OF A GENETIC ALGORITHM HEURISTIC
• Implementation of a GA involves the following steps:1
– A mutation mechanismAfter a crossover operation, each string in the population may be mutated with some probability. The mutation process transforms a chromosome into another valid one that may or may not be in the population. The motivation for using mutation is to prevent the algorithm from getting stuck in a local minimum.
– Reevaluation of the populationThe new population after selection, crossover, and mutation is reevaluated. The fitness value for each new chromosome is computed.
– A set of stopping criteriaThe stopping criteria specifies the criteria upon which the algorithm terminates. If the stopping criteria have not been met, the new population goes through another cycle of selection, crossover, mutation, and evaluation. This cycle repeats until one of the stopping criteria is met.
1 Wang, L., Siegel, H.J., Roychowdhury, V.P., and Maciejewski, A.A. Task Matching and Scheduling in Heterogeneous Computing Environments using a Genetic Algorithm-Based Approach, Journal of Parallel and Distributed Computing Special Issue on Parallel Evolutionary Computing.
Configuration Technique for STAP
• Incorporate New Features into the Network Simulator
• Testing and Calibration of the Network Simulator
• Build and Execute RT_STAP Benchmark on Mercury RACE® Computer
• Optimization Problem• Computational Investigation
COMPUTATIONAL INVESTIGATION
• A QR-D computation is deterministic (i.e, its complexity can be calculated).
• A Conjugate Gradient (CG) computation is notDeterministic. Its complexity depends on the initial condition and desired tolerance.– This work proposes the investigation of the impact of
“intelligent” initial condition values to a CG algorithm.
CONJUGATE GRADIANT APPROACHInvestigation of Initial Condition Values
A B C D
swCBA
=11 ),,(ψ sw
DCB=22 ),,(
ψ
HxxCBA 111 ),,(
⋅=ψ
=
CBA
x 1
=
DCB
x 2Hxx
DCB 222 ),,(⋅=ψ
Solve the following equations:Solve the following equations:
Where:Where:
,,
,,
=s
=1w weight vectorweight vector
steering vectorsteering vector
CONJUGATE GRADIANT APPROACHInvestigation of Initial Condition Values
[ ]
=
=⋅=
HHH
HHH
HHH
HHHH
CCCBCABCBBBAACABAA
CBACBA
xxCBA 111 ),,(
ψ
[ ]
=
=⋅=
HHH
HHH
HHH
HHHH
DDDCDBCDCCCBBDBCBB
DCBDCB
xxDCB 222 ),,(
ψ
• Expanding and yields the following:),,(1 CBA
ψ),,(2 DCB
ψ
• Attempting to solve the following equation for :
• Attempting to solve the following equation for :
CONJUGATE GRADIANT APPROACHInvestigation of Initial Condition Values
swCBA
=11 ),,(ψ1w
=
3
2
1
3,1
2,1
1,1
1 ),,(
sss
www
CBAψ
=
3
2
1
3,2
2,2
1,2
2 ),,(
sss
www
DCBψ
13,12,11,1 swACwABwAA HHH =++
23,12,11,1 swBCwBBwBA HHH =++
33,12,11,1 swCCwCBwCA HHH =++
13,22,21,2 swBDwBCwBB HHH =++
23,22,21,2 swCDwCCwCB HHH =++
33,22,21,2 swDDwDCwDB HHH =++
2w swDCB
=22 ),,(ψ
Set of Linear EquationsSet of Linear Equations
Set of Linear EquationsSet of Linear Equations
• Investigation of the two sets of linear equations reveals similarities among the sets of equations:
• The similarities between the equations may provide insight into the selection of the initial condition values. Assuming the steering vector remains the same for each set of linear equations, the initial values could be assigned as follows:
– If range bin D is similar to range bin C, then
– If range bin D is similar to range bin A, then
CONJUGATE GRADIANT APPROACHInvestigation of Initial Condition Values
13,12,11,1 swACwABwAA HHH =++
23,12,11,1 swBCwBBwBA HHH =++
33,12,11,1 swCCwCBwCA HHH =++
13,22,21,2 swBDwBCwBB HHH =++
23,22,21,2 swCDwCCwCB HHH =++
33,22,21,2 swDDwDCwDB HHH =++
2,11,2 ww ← 3,12,2 ww ← 3,13,2 ww ←
2,11,2 ww ← 3,12,2 ww ← 1,13,2 ww ←
• Program Overview and Introduction (Quad Chart)
• Program Management Status
• Highlights from Year 1
• Highlights from Year 2
• Work to be Completed
Outline
Work to be Completed
• Interfacing of FPGA and GPP/DSP Subsystems
• Implement Parallel SAR Algorithm on GPP/DSP System
• Integrate FPGA FIR Filters for Range and Azimuth Processing for SAR
• Implement Parallel STAP Algorithm for GPP/DSP System
• Integrate FPGA FIR Filters for Range Processing for STAP
• Implement FPGA-based Linear Equation Solver
• Integrate FPGA-based Linear Equation Solver with STAP