Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse

Optimal Configuration ofCombined GPP/DSP/FPGA Systems for

Minimal SWAP

Presented byJohn K. Antonio

University of Oklahoma

Second Annual ReviewSeptember 23, 1999

• Program Overview and Introduction (Quad Chart)

• Program Management Status

• Highlights from Year 1


• Work to be Completed

Outline

Configuring Combined GPP/DSP/FPGA Systems for Minimal SWAPApplications

• SAR• STAP

Requirements• Throughput• SWAP

•Combined Technology•Minimal SWAP Configuration•Mixed-Mode Operation•Demonstration

University of Oklahoma: John K. Antonio and Sudarshan K. Dhall

New Ideas• Systematic determination of minimal SWAP

configuration based on proven mathematical programming techniques

• Optimal configuration based on automatic“tuning” of system design parameters- number and types of cards used- data mapping and communication schemes- place and route schemes

• Novel computing techniques based oncharacteristics of GPP/DSP/FPGA system

Jun 97Start

Jun 98 Jun 99 Dec 00End

ScheduleDevelop optimalconfigurationtechniques

Construction and integration of GPP/DSP/FPGA system

Implement and test optimal configurations onGPP/DSP/FPGA system

Develop practicaldesign methodsbased on SAR andSTAP applications

Demonstrate advantagesof combiningtechnologies

Impact• Embedded Systems requirements for the

21st Century can be satisfied with thecombined use of GPP, DSP, and FPGA technologies

• Demonstrate use of FPGA boards as co-processors for embedded multiprocessorGPP and DSP systems

• Demonstrate systematic approaches tooptimally configure GPP/DSP/FPGA syst. forminimal SWAP for embedded applications

Jun 00






Outline

Personnel(Program Management Status)

• John K. Antonio, Principal Investigator

• Ph.D., Texas A&M University

• Professor/Director of CS, University of Oklahoma

• Over 70 publications in HPC and related areas

• PI or co-PI of 17 contracts/grants

totaling over $2.1M


• Sudarshan K. Dhall, Co-Principal Investigator

• Ph.D., University of Illinois

• Professor of CS, University of Oklahoma

• Over 80 publications, 2 books, 3rd underway

• PI or co-PI of grants and contracting totalingabout $1M


• Jack West, Research Scholar

Optimal Mapping, Scheduling, and Configuration Techniques for STAP; Network Simulator; STAP Implementation

• Jeff Muehring, Research Scholar

Optimal GPP/DSP/FPGA Configuration Techniques for SAR; SAR Implementation Intern at IBM/Houston, 8/99 to 1/00

Research Scholar at OU, 1/00 to 7/00


• Hongping Li, Research Assistant, Ph.D. Student

Calibration of Power Prediction Simulator, System Interfacing, SAR Implementation

• Sirirut Vanichayobon, Research Assistant, Ph.D.Student

FPGA-Based Linear Equation Solver for STAP, System Interfacing, STAP Implementation

• Seok-Hyun Ko, Research Assistant, M.S. Student

Power Simulator Enhancements

• Tim Osmulski, Research Assistant, M.S. student

Power Prediction Simulator for FPGAs

Graduated May 1998

• Nikhil Gupta, Research Assistant, M.S. student

Algorithms for STAP Weight Calculation Mapping Inner Product Computations onto FPGAs

Graduated August 1998



• Brian Veale, Research Assistant, M.S. student

Space and Power Study for High-Performance Integer and Floating Point ReconfigurableArchitectures

Graduated August 1999

Contacts, Partners, Vendors, and Other Communications

(Program Management Status)

José Muñoz, DARPA Ralph Kohler, Rome Lab

MIT Lincoln LabDavid MartinezJim Ward

MITRERichard Games

Northrop GrummanMarc Campbell

Synplicity, Inc. Madelyn Miller

XilinxJason Feinsmith

Annapolis Micro SystemsJenny DonaldsonBill HulbertPaul Kowalewski

ISIMilissa BenincasaDavid Coker

Mercury ComputerThomas EinsteinEd HolstienCraig LundDave Toms

Mercury20 Slot Hybrid Chassis with SPARC 5VSolaris 2.5 with C CompilerMC/OS, Cross Assembler, ToolkitMPI-Pro for MC/OS9U VME RACE Board1 SHARC Daughtercard (2CNs, 8MB/CN, 3 SHARCs/CN) = 6 SHARCS3 SHARC Daughtercards (2CNs, 16MB/CN, 3 SHARCs/CN) = 18 SHARCS4 PowerPC Daughtercard (2CNs, 16MB/CN, 1 PPC/CN) = 8 PPCsRIN-T Input CardROUT-T Output Card

Annapolis Micro Systems4 PCI WILDONE Cards (Xilinx 4028/4036)4 PCI WILDFORCE Array Card (5 Xilinx 4085s)Interfacing Cables

Other VendorsModelSim Simulation Software (Model Technology, Inc.)Synplify Synthesis Software (Synplicity, Inc.)Xilinx Foundation Software (Xilinx, Inc.)

Equipment Status(Program Management Status)

June 1997 Dec. 1998 June 2000 Dec. 2000Sept. 1999Mar. 1998

Design STAPIterative Weight Solver for FPGA

Inter-GPP/DSP Comm.Simulator for STAP

Optimal GPP/DSPConfig. for SAR

GPP/DSP/FPGA Platform Construction and Independent Testing of GPP/DSP and FPGA Subsystems

Implement STAP Iterative Weight Solver on FPGA

Optimal GPP/DSPConfig. for STAP

Implement SAR Linear Filteringon FPGA

Optimal GPP/DSP/FPGAConfig. for SAR/STAP

GPP/DSP and FPGA Subsystem Design, Integration and Testing

Optimal GPP/DSP/FPGA Config. for SAR

Demonstrate Combined SAR/STAP onGPP/DSP/FPGA Platform

Implement SAR on GPP/DSP

Design SAR Linear Filteringfor FPGA

Implement STAP on GPP/DSP

Implement SAR onGPP/DSP/FPGA Platform

Optimal GPP/DSP/FPGA Config. for STAP

Implement STAP onGPP/DSP/FPGA Platform

Develop FPGA Power Consumption Simulator

KeyGPP/DSP Sub-System

Research/DesignImplement/Test

FPGA Sub-SystemResearch/DesignImplement/Test

GPP/DSP/FPGA SystemResearch/DesignImplement/Test

Test FPGA Power Consumption Simulator

Schedule of Milestones(Program Management Status)

CurrentBudget

Balance on8/1/99

ProjectedExpenses8/99-7/00

ProjectedExpenses8/00-12/00

Personnel 246,223 108,635 154,024 52,123

Fringes 72,117 36,051 27,712 9,340

Consulting 40,000 37,000 0 0

Expenses 9,781 6,261 10,000 5,069

Travel 17,545 4,889 12,000 7,372

Equipment 217,670 42,652 42,652 0

Indirect Cost 181,262 90,632 87,317 31,674

Total 784,598 326,120 333,705 105,578

Budget Summary(Program Management Status)






Outline

Highlights from Year 1

• Optimal Configuration of Compute Nodes for SAR Processing

• Network Simulator

• FPGA Inner-Product Co-Processor Designs for STAP Weight Solvers

• FPGA Power Prediction Simulator

Optimal Configuration of Compute Nodes for SAR Processing

(Highlights from Year 1)

• Motivation and SAR Basics

• Parallelization of SAR Processing

• The Optimal Configuration Problem• Formulation• Numerical Results

• Conclusions

Nominal UAV Payload

“Predator”

Targets

Azim

uth

Velo

city

Range

Footprint

Footprint of Aerial Side-Looking SAR

Offset Overlapping Beams

vReal Azimuth Resolution

Rs

Synthetic Beams

Azim

uth

vR

Rs

CompressedResolution






• Conclusions

Parallelization of SAR Processing

DistributedCorner-Turn

1

Ran

ge S

ampl

es

Pulse No.

Range Samples

Puls

e N

o.

Range Processing(shown across 3 range processors)

Azimuth Processing(shown across 4 azimuth processors)

1

1

1

K r

Sa

Sa

K r

where Sa is the azimuth section length and Kr is the range reference kernel size

Reference:T. Einstein, “Realtime Synthetic Aperture Radar Processing on the RACE Multicomputer,” App. Note 203.0, Mercury Computing Sys, 1996.

Sectioned Convolution

Kernel

Discard

OverlapSection

FFT size

Large Overlap/Section ratio ⇒ Small azimuth memory, large number azimuth processorsSmall Overlap/Section ratio ⇒ Large azimuth memory, small number azimuth processors

Reference:T. Einstein, “Realtime Synthetic Aperture Radar Processing on the RACE Multicomputer,” App. Note 203.0, Mercury Computing Sys, 1996.

System Parameters

• radar-dependent: R (range), Rs (range swath), and λ (wavelength)

• application-dependent: δ (desired resolution) and v (platform velocity)

• processor-dependent: αr and αa (non-fast-convolution range and azimuth loading) and γ (fast convolution throughput)

• software-dependent: Sa (azimuth convolution section length), Fa (azimuth FFT length), and Fr(range FFT length)

Derivations for Memory and Processor Requirements

Pv F R F F

PvR

F FS

MR v F R F F

MR R S

rr r s r r

a

s aa a

a

rs r r s r r

as a

=+ +

=+

+

=+ +

=+

( lg )

( lg )

( lg )

( )

6 10

6 10

16 6 10

2

2

2

3

2

3

δ α γ δγδ

αγ

δ

δ α γ δγδ

λ δδ






• Conclusions

• Objective: Determine configurations for the CNs, number of CNs of each configuration, and section size, to satisfy processor and memory requirements and minimize power consumption

• Notation and Definitions:• CN Configuration: Specifies the daughtercard type

and number of range and azimuth CEs (per configured CN)

• X, Y: The two possible CN configurations• XT, YT: Daughtercard type for each CN configuration

Optimal Configuration Formulation

• Notation and Definitions:• Xr, Yr: Number of range processors per CN

(for each configuration)• Xa, Ya: Number of azimuth processors per CN

(for each configuration)• NX, NY: Number of CNs of configurations X and Y• ΠCN(•): Power per CN as a function of

daughtercard type• MCN(•): Memory per CN as a function of

daughtercard type• PCN(•): Processors per CN as a function of

daughtercard type


1,0,,,,,

,....2,1,2

)()(

)()()(

)()()(

)(

)()(

≥≥

=+≥=

≤+≤+

+≥

+≥

+≤+≤

+=

aararYX

aak

a

TCNar

TCNar

aa

aaa

r

rrTCN

aa

aaa

r

rrTCN

aYaXaa

rYrXr

TCNYTCNX

SYYXXNN

kKSF

YPYYXPXX

SPSMY

PMYYM

SPSMX

PMXXM

YNXNSPYNXNP

YΠNXΠNZMinimize:

Subject to:







• Conclusions

Minimum Power

Azimuth FFT Size

Optimal Azimuth Section Size

Optimal Ratio of Kernel Size to Section Size

Percentage of Power Usage by Card Type 1

Optimal CN Configurations

0.5 1 1.5 250

100

150

200

250

300

350

400

Resolution

Vel

ocity

1 1 22 1 11 1 2 1 2 1

XT Xr Xa YTYrYa

1 1 2 2 0 1

1 2 1 2 0 21 3 0 2 0 21 3 0 2 1 12 0 2 2 1 1

1 1 2 2 1 1

2 1 1 2 2 0

1 1 2 2 0 2






• Conclusions

Conclusions

• A method for optimally configuring CN-based parallel systems for SAR processing was introduced.

• The method provides detailed HW and SW design and implementation information about how to best utilizesystem resources for given values of application parameters.

• The numerical studies show that the optimal ratio of daughtercard types can be relatively constant over regions of the application parameter space.

• For a fixed hardware configuration, the CNs can be re-configured (via software re-configuration) to achieve optimal power consumption over specified regions.






Network Simulator(Highlights from Year 1)

• Parallel STAP: The Motivation behind the Network Simulator

• Overview of the Network Simulator

• Numerical Studies

• Conclusions

Pulses Pulses

Data Cube

Data Cube

Doppler Filter

Channels

Ran

ge

Ran

ge

Channels

Beamform

Beam Outputs

Ran

ge

Pulses

QR Decomposition

Rotate

Channels

Ran

ge

Pulses

Data Cube

Steering Vectors

Weights

Input Data

RotatePulse

Compress

Data CubeC

hann

els

Pulses

Range

STAPSTAP PPROCESSING ROCESSING FFLOWLOW

1. Partition STAP data cube over a 2-D process set.

2. Process the contiguous dimension.

3. Re-partition the data cube before processing the next dimension.

4. Rotate the newly distributed data to make the next dimension sequential in memory.

5. Repeat steps 1 through 4 before each processing phase.

SSUBUB--CUBE CUBE BBAR AR PPARTITIONING ARTITIONING MMETHODOLOGYETHODOLOGY

Pulse Compression Partitioningwith range dimension whole.Pulse Compression Partitioningwith range dimension whole.

Pulses Range

Cha

nnel

s

Cha

nnel

s

1 32 4

5 76 8

9 1110 12

Pulses

+

3 x 4 Process Set

Pulses

5

1

9

Range

Cha

nnel

s

Doppler Filtering Partitioningwith pulses dimension whole.Doppler Filtering Partitioningwith pulses dimension whole.

Pulses Range

Cha

nnel

s

9 10 11 12

5 6 7 8

1 2 3 4

Pulses Range

Cha

nnel

s

+

Cha

nnel

s

1 32 4

5 76 8

9 1110 12

Range

3 x 4 Process Set

STAPSTAP DDATA ATA CCUBE UBE PPARTITIONING ARTITIONING EEXAMPLESXAMPLES

Pulses

5

1

9

Range

Cha

nnel

s• Re-Partitioning involves exchanging data with the next whole dimension.

Cha

nnel

s

1 32 4

5 76 8

9 1110 12

Pulses

3 x 4 Process Set

Range Dimension is Contiguous

Cha

nnel

s

1 32 4

5 76 8

9 1110 12

Range

3 x 4 Process Set

Pulse Dimension is Contiguous

• Interprocessor Communication is required between processors in the same row.

Pulses

Range

Cha

nnel

s

9 10 11 12

5 6 7 8

1 1 1 2 1 3 1 4

STAPSTAP DDATA ATA CCUBE UBE RREPARTITIONINGEPARTITIONING

Required Data TransfersRequired Data Transfers

Network Interconnection ConfigurationNetwork Interconnection Configuration

6-PortCrossbar

CN CN CN CN

12

3

45

6 78

9

1011

12

IPC

56

78

910

1112

Cha

nnel

12

34Pulses Range

Pulse Compression

1

4CN

7

10

CN

CN

CN

CN

CN

3

4

3

3

4

3

Doppler Filtering

Pulses

Cha

nnel

Range

9 10 11 12

5 6 7 8

1 2 3 4

STAPSTAP DDATA ATA CCUBE UBE RREPARTITIONINGEPARTITIONING

Data ReData Re--distribution Mappingdistribution Mapping





• Conclusions

1. 40Mhz clock, 32 bit data paths, 2048 byte circuit-switched packets.

2. Contention resolved using priorities.a. User-programmable message priority

b. Hardware priority assigned at each crossbar along a path (based on complex connection rules)

3. A packet with higher priority preempts (suspends) a lower priority packet (active or inactive) to gain control of a crossbar port.

SSOMEOME RACERACENNETWORK ETWORK FFEATURESEATURES

6-PortCrossbar

6-PortCrossbar

6-PortCrossbar

6-PortCrossbar

CNCN CNCNCNCN CNCN CNCN CNCNCNCN CNCN CNCN CNCNCNCN CNCNCNCN CNCN CNCN CNCN

6-PortCrossbar

6-PortCrossbar

6-PortCrossbar

6-PortCrossbar

6-PortCrossbar

6-PortCrossbar

6-PortCrossbar

6-PortCrossbar

CN

6-PortCrossbar

6-PortCrossbar

Message DestinationMessage DestinationMessage SourceMessage Source

MessagePath

MessagePath

6-PortCrossbar

6-PortCrossbar

6-PortCrossbar

6-PortCrossbar

6-PortCrossbar

CN

RACERACE NNETWORK ETWORK IINTERCONNECTNTERCONNECTFFATAT--TTREE REE TTOPOLOGYOPOLOGY

6-PortCrossbar

6-PortCrossbar

CNCN

6-PortCrossbar

SSTANDARD TANDARD CCROSSBAR ROSSBAR PPRIORITY RIORITY AARBITRATION RBITRATION AALGORITHM LGORITHM TTABLEABLE

7 F A,B,C,D,E F A,B,C,D,E F A,B,C,D6 E F E F A,B,C,D* A,B,C,D*5 A,B,C,D F A,B,C,D F A,B,C,D F4 E A,B,C,D E A,B,C,D - -3 *A,B,C,D *A,B,C,D,E A,B,C,D* A,B,C,D* - -2 - - A,B,C,D E - -1 - - - - - -

HardwarePriority Entry Port Exit Port Entry Port Exit Port Entry Port Exit Port

Active Port E InvolvedNot Yet Active

Port E Not Involved

Transaction Status

* - Peer Kill Rules Apply

CrossbarCrossbar CrossbarCrossbar

CrossbarCrossbar

Compute NodeProcessor InformationOutgoing and Received Message QueuesOutgoing and Received Packet Stack


LinkLink

Random ScanGenerates Pseudo-Random CN Scan Ordering

Random ScanGenerates Pseudo-Random CN Scan Ordering

ClockBased on Network Clock Frequency (factor of 5)Data Transfer Rate Equates to Effective Network Bandwidth

ClockBased on Network Clock Frequency (factor of 5)Data Transfer Rate Equates to Effective Network Bandwidth

Dynamic Network ConstructionDynamic Routing Table CreationDynamic CN and CE Message Traffic GenerationSimulates Packet Traffic

Dynamic Network ConstructionDynamic Routing Table CreationDynamic CN and CE Message Traffic GenerationSimulates Packet Traffic

Network Methods

NNETWORK ETWORK CCLASS LASS DDETAILSETAILS

Implements Hardware Priority Arbitration • TOP-LEVEL ALGORITHM• STANDARD ALGORITHM

Query Port StatusRoutes Packets to Next LocationAllocates and Frees Internal Port Connections and Connected Link ObjectsTransmits Packet Data

Implements Hardware Priority Arbitration • TOP-LEVEL ALGORITHM• STANDARD ALGORITHM

Query Port StatusRoutes Packets to Next LocationAllocates and Frees Internal Port Connections and Connected Link ObjectsTransmits Packet Data

Crossbar Methods

LinkConnects Crossbar Objects Link Status: Occupied or Free

LinkConnects Crossbar Objects Link Status: Occupied or Free

CrossbarTwo Parent Port ConnectionsFour Child Port ConnectionsInternal Switch ConnectionsFour CN Connections for TerminalCrossbars.

CrossbarTwo Parent Port ConnectionsFour Child Port ConnectionsInternal Switch ConnectionsFour CN Connections for TerminalCrossbars.

CCROSSBAR ROSSBAR CCLASS LASS DDETAILSETAILS

Compute Node Methods:Manages Outgoing and Received MessageQueuesManages Outgoing and Received Packet StackExplodes the Top Outgoing Message into Packets of Size 2048 or LessHandles DMA Chaining of PacketsEstablishes Path Through Network and Transmits Packet Data

Compute Node Methods:Manages Outgoing and Received MessageQueuesManages Outgoing and Received Packet StackExplodes the Top Outgoing Message into Packets of Size 2048 or LessHandles DMA Chaining of PacketsEstablishes Path Through Network and Transmits Packet Data

Outgoing Message QueueOutgoing Message Queue

Message 1

Message 2

Message 3

::

Packet StackPacket StackEXPLODE


• PACKETS ARE SELF-ROUTING


• PACKETS ARE SELF-ROUTING

::

Packet 2Packet 3Packet 4

Packet 1

CCOMPUTE OMPUTE NNODE ODE CCLASS LASS DDETAILSETAILS

SSIMULATOR IMULATOR UMLUMLSSEQUENCE EQUENCE DDIAGRAMIAGRAM

NetworkNetwork CrossbarCrossbarData CubeData Cube Process SetProcess Set CNCN<<actor>>

User<<actor>>

User ClockClock

Pass 1

Pass 2

Increment Simulation

Clock

Build Messages

R:200,P:22,C:16

CEs:48

X:6, Y:8

Routing:FCN Traffic,

Phase 1 DMA:Y

Connection/Data

Transfer

Clean Up

Message Matrices

X, Y,MappingMatrices

SimulationTime = 2 msSimulation

Time = 2 ms

Messages Time* iterative process

PPACKETACKET UML SUML STATECHARTTATECHARTSimulation Simulation Pass 1Pass 1 and and Pass 2Pass 2

Simulation Pass Subsystem

Start UpStart Up

Waitingfor Kill

Waitingfor Kill

CompletedCompletedSuspendedSuspended

BlockedBlocked ActiveActive

ReadyReady

Pass 1

Pass 2





• Conclusions

Process Set - Phase 1 (CN:12, R:200, P:22, C:16, Routing:F)

05

101520253035404550

0.5 1 1.5 2

Time (ms)

Coun

t

CN 12 (12x3)CN 12 (9x4)CN 12 (6x6)CN 12 (4x9)

PPROCESSROCESS SSETETPPERFORMANCE ERFORMANCE MMETRICETRIC

Communication Phase 1Communication Phase 1

PPROCESSROCESS SSETETPPERFORMANCE ERFORMANCE MMETRICETRIC



0123456789

10

3 3.5 4 4.5 5 5.5 6

Time (ms)

Coun

t

CN 12 (12x3)CN 12 (9x4)CN 12 (6x6)CN 12 (4x9)


0123456789

10

3 3.5 4 4.5 5 5.5 6

Time (ms)

Coun

t

CN 12 (12x3)CN 12 (9x4)CN 12 (6x6)CN 12 (4x9)

MMESSAGEESSAGE TTRAFFICRAFFICPPERFORMANCE ERFORMANCE MMETRICETRIC


Message Traffic - Phase 1 (CN:16, X:12, Y:4, R:400, P:22, C:16, Routing:EF)

0123456789

2 2.1 2.2 2.3 2.4 2.5

Time (ms)

Coun

t CN TrafficCE Traffic


0123456789

2 2.1 2.2 2.3 2.4 2.5

Time (ms)

Coun



012345678

10 15 20 25

Time (ms)

Coun



012345678

10 15 20 25

Time (ms)

Coun


MMESSAGEESSAGE TTRAFFICRAFFICPPERFORMANCE ERFORMANCE MMETRICETRIC


DMADMA CCHAININGHAININGPPERFORMANCE ERFORMANCE MMETRICETRIC


DMA Chaining - Phase 1 (CE:24, X:8, Y:3, R:800, P:32, C:22, Routing:F)

0123456789

14 16 18 20 22

Time (ms)

Coun

t ChainingNo Chaining


0123456789

14 16 18 20 22

Time (ms)

Coun


DMADMA CCHAININGHAININGPPERFORMANCE ERFORMANCE MMETRICETRIC



012345678

21 22 23 24 25 26 27

Time (ms)

Coun



012345678

21 22 23 24 25 26 27

Time (ms)

Coun






• Conclusions

1. Designed and implemented a platform independent simulator.

4. Communication pattern implemented for STAP but may be used for other applications with phased communication pattern.

2. Simulator demonstrates that the Process Set, the CN or CE Message Traffic, the DMA chaining, the adaptive routing, and the scheduling of the messages affects performance.

3. Allows users to experiment with possible current and future configurations.

CCONCLUSIONSONCLUSIONS






FPGA Inner-Product Co-Processor Designs for STAP Weight Solvers


• Overview of STAP Weight Calculation

• Two FPGA Inner-Product Circuit Designs

• Numerical Accuracy Studies

• Conclusions

Doppler Filter

Weight Computation

Steering Vector

Input Data

Pulse Compress Data Cube Data Cube

Weight Application

ThresholdDetection

Target Decision

Typical STAP Processing Flow

pulses

range

Doppler

range8%

91.5%

0.5%

CovarianceMatrix

Space-Time Adaptive Processing

• Effective partially adaptive STAP technique

• The architecture consists of

• Doppler processing across all pulse repetition intervals

• Adaptive filtering across• all channels and• K adjacent Doppler bins

Kth- Order Doppler Factored STAP

1 31 ˆ:),(

×=× NN

rkx

r

∑+−=

=bL

rkxrkx

bkR

rLbr

H

rL 1)1(

),(),(1

),(ψ

Kth-Order Doppler Factored STAP

bth Ran

ge

Segm

ent

(with

L rce

lls)N

Cha

nnel

s

Doppler

k (k - 1)(k + 1)

Data matrix needed for calculating covariance matrix for kth Doppler Bin

and bth Range Segment using Kth-OrderDoppler Factored STAP with K = 3

Matrix-Based Derivation of

rr LNLN

bk

3 ˆ:),(

×=×

X

),(),(1

),(),(1),(1)1(

bkbk

bLrkxrkxbk

H

r

Lbr

H

r

L

LR

r

XX

ψ

=

= ∑+−=

sbkwbk =),(),(ψ

The Weight Equation:

),( bkψ

STAP Weight Calculation

sLbkwRR

RR

sbkwRRL

bkwRQQRL

QRbk

sbkwbkbkL

sbkwbk

rT

TT

T

r

TT

r

T

H

r

=

=

==

=

=

=

),(

]0[ that Note

),(1),(1

),( :onDecomposti QR Take

),(),(),(1

),(),(

*11

1

***

X

XX

ψ

onsubstituti backward using ),(for Solve

),(

neliminatio forward using for Solve

),(Let

*1

1

*1

bkw

pbkwR

p

sLpR

pbkwR

rT

=

=

=

sw =ψ :Equation Weight thesolve toMethodion decomposit-QR Using

Iteration

STAP Weight Calculation

Initialization

ikTi

iTi

ii

ii

ii

Ti

iTi

ii

ddd

dggd

swg

ddd

dgww

+−=

−=

−=

+++

++

+

)(1

11

11

1

ψψ

ψ

ψ

sw =ψ :Equation Weight thesolve toMethodGradient Conjugate Using

00000 ,set , Choose dgwsdw −=−= ψ

Numerical Studies

Lr = 125

Flop

Cou

nt

108

109

1010

10-8 10-7 10-6 10-5 10-4 10-3 10-2 10-1

Tolerance

CGQR

10-8 10-7 10-6 10-5 10-4 10-3 10-2 10-1

Lr = 250

Tolerance

1010

109

108

Flop

Cou

nt

Tolerance

CGQR






• Conclusions

+

Output Register

a b

Sign+16 bitmantissa

Normalizing unit

1’s comp/register

a bsign of a

a b

b

BUFFER

X

BUFFER

FPGA

BOARD

INTERCONNECTION

BUS

HOSTPROCESSOR

• Multiply-Accumulate Pipe• Reads two block floating

point operands per cycle • Performs two operations

per cycle• Performs exponent

normalization prior to accumulation

• 2 N-vectors reduced to a constant number of partial sums

FPGA Inner Product Co-Processor:Design 1

• Multiply-Add Reduction Pipe• Reads four operands

per cycle • Performs three operations

per cycle• No normalization required• 2 N-vectors reduced to N/2 partial sums

• Basic Tradeoff: First design has lower throughput, but can perform more work

X X

1’s comp/register

Sign bSign a

+

Sign+16 bit mantissa

INTERCONNECTION

BUS

HOSTPROCESSOR

BUFFER

BUFFER

FPGA

BOARD

2 ff

Data forFirst

Multiplier

Data forSecond

Multiplier

Unitclocked

here

FPGA Inner Product Co-Processor:Design 2






• Conclusions

Two Orders of Magnitude Experiment

Accuracy HistogramDesign 1

0

1

2

3

4

5

6

7

0.999893 0.9999015 0.99991 0.9999185 0.999927

Freq

uenc

y

Data Histogram

05

101520253035404550

0 7 14 21 27 34 41 48 55 62 69 76 82 89 96 103

110

Freq

uenc

y

Exponent Histogram

050

100150200250300350400450500

119

121

123

125

127

129

131

133

135

137

139

141

143

145

Freq

uenc

y


0

50

100

150

200

250

0.99

399

0.99

436

0.99

474

0.99

511

0.99

549

0.99

586

0.99

624

0.99

661

0.99

699

0.99

736

0.99

774

0.99

811

0.99

849

0.99

886

0.99

924

0.99

961

0.99

999

Freq

uenc

y

Five Orders of Magnitude Experiment


0

1

2

3

4

5

6

7

8

0.999912 0.99991875 0.9999255 0.99993225 0.999998

Freq

uenc

y

Data Value Histogram

05

101520253035404550

0

6867

1373

4

2060

2

2746

9

3433

6

4120

3

4807

0

5493

7

6180

5

6867

2

7553

9

8240

6

8927

3

9614

1

1030

08

Freq

uenc

y

Exponent Histogram

0

100

200

300

400

500

600

700

800

119 121 123 125 127 129 131 133 135 137 139 141 143

Freq

uenc

y


0

50

100

150

200

250

300

0.00

000

0.06

250

0.12

500

0.18

750

0.25

000

0.31

249

0.37

499

0.43

749

0.49

999

0.56

249

0.62

499

0.68

749

0.74

999

0.81

249

0.87

499

0.93

748

0.99

998

Freq

uenc

y

“Outlyer” Experiment


0

5

10

15

20

25

30

35

40

45

50

0.00

0.06

0.12

0.17

0.23

0.29

0.35

0.40

0.46

0.52

0.58

0.64

0.69

0.75

0.81

0.87

0.92

Freq

uenc

y

Exponent Histogram

0

100

200

300

400

500

600

114

116

118

120

122

124

126

128

130

132

134

136

138

Freq

uenc

y

Data Value Histogram

0

200

400

600

800

1000

1200

0.00

09

62.5

008

125.

0007

187.

5007

250.

0006

312.

5006

375.

0005

437.

5005

500.

0004

562.

5004

625.

0003

687.

5003

750.

0002

812.

5002

875.

0001

937.

5001

1000

.000

0

Freq

uenc

y


0

2

4

6

8

10

12

0.593067 0.6398925 0.686718 0.7335435 0.78369

Freq

uenc

y

outlyeroutlyer






• Conclusions

Conclusions

• CG weight solver provides tradeoff between accuracy and required FLOPs(compared to QR weight solver)

• Tradeoff between two FPGA designs: Design 1 (Mult & Accum) has lower peak throughput, but can perform more total work than Design 2

• Block floating point provides acceptable accuracy for uniformly distributed data over reasonable dynamic ranges

• Block floating point accuracy breaks down when there are a few large outlyers in the data set






FPGA Power Prediction Simulator


• CMOS Power Consumption and Past Research

• Design and Implementation of the Power Prediction Simulator

• Conclusions and Demo

Leakage CurrentDynamic Capacitance Charging Current

Most important for CMOSDependant on clock frequency

Power Dissipation in CMOS

Transient Current

Dependant on signal activityDependant on signal activity

Power Equations

Equivalent model of a transistor’s gate...

( )

−=

−RC

teVtvc 1

( ) RCt

VetvR

−=

( )ReVtp

RCt

R

22

−

=

∫∫−

−

−−

==ττ

ττ0

22

0

22 2

21 dte

RCCVdt

ReVp RC

tRCt

avg

222

21

2CVeCVp

o

RCt

avg ττ

τ

≈−

=−

( ) 50.0=clockp

( ) 88.01 =xp

( ) 29.02 =xp

( ) 69.03 =xp ( ) 27.03 =xA

( ) 0.1=clockA

( ) 10.01 =xA

( ) 17.02 =xA

p(s): the probability that signal sattains a logical value of true at any given clock cycle.

A(s): the probability that signal stransitions at any given clock cycle.

Probabilistic Modeling

Probabilistic Modeling

x3

x2

x1

y

y

x3

x2

x1

:)(1 tx:)(2 tx:)(3 tx

:)(21 txx:)(321 txxx

p=0.88, A=0.10

p=0.29, A=0.17

p=0.69, A=0.27

p=0.83, A=0.17

p=0.10, A=0.13

Calculation of average power:

∑∈

=gates all

2

21

ggavg ACVP

Probabilistic Equations

( )

( )1 where,)(1

1

===

=

∏∑

∑ ∏

=

=

ii

k

ii

k

ii

Pyp

f

ππ

( ) ( )

( ) ( ){ }

( ) ( ){ }

∑∑ ∏

∑ ∏

∑ ∏

+

−⊕+

−⊕+

−⊕

⋅=

===≠≠ ∉

==≠ ∉

= ≠

X n

kjikji kjil

llkkjjiikji

n

jiji jik

kkjjiiji

n

i ijjjiii

xzPxzPxzPxzPzzzXfXf

xzPxzPxzPzzXfXf

xzPxzPzXfXf

XPyA

K

1,1,1,,

1,1,

1

)(1)()()(),,;()(31

)(1)()(),;()(21

)(1)();()(

)()(

*

* Probabilistic Treatment of General Combinatorial Networks† Estimation of Circuit Activity Considering Signal Correlations and Simultaneous Switching

Signal probability transformations...

Signal activity transformations...†





• Conclusions and Demo

FPGA Design

FPGA internal structure design...

CLB

IOB BUF

Routing Fabric Design

Example routings...

Xilinx 4000 series routing fabric is very intricate.

Xilinx synthesis tools use shortest path routing where possible.

The distance the signal travels is the metric considered in this model.

Signal Design

Symbolic Probability

Numeric Probability

Numeric Activity

Signal Reference

Manhattan Distance

CLBCLB

R

L

Local Signal Remote Signal

Routing Example

4

4 InterconnectionLUT

LUT

LUT

LUT

LUT

LUT

Routing Signal Connections

R

R

R

R

R

R

R

R

L

L

L

RRRR

RRRR

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

L

L

L

RRRR

RRRR

R

R

R

R

R

R

R

R

LUT

LUT

LUT

LUT

LUT

LUT

L

L

L

L





• Conclusions

Conclusions

• Designed and Implemented power prediction simulator for Xilinx 4000 series FPGAs.

• Inputs to simulator:• Place & Route bit stream (from Xilinx Tool)• Activity and Probability factors for pin signals

• Simulator calculates probabilities and activities for all internal signals

• Tool outputs power consumption of FPGA chip

• Currently calibrating/tuning simulator using both heat and DC current measurement cross-calibration methods






Outline


• Efforts to Calibrate the FPGA Power Prediction Simulator

• Comparison of Integer and Floating Point Computations on FPGAs

• Architecture of Prototype System for SAR and STAP Processing

• Integration of Reconfigurable Computing into SAR

• Configuration Technique for STAP

Basic Approach to Calibration

• N x N array of CLBs (configurable logic blocks)

• Programmable interconnect• Let S denote the set of all internal

signals for a configuration and Si denote all signals of length i

• Let Ai denote the sum of activities for all signals of length i

• 2N + 1 distinct capacitances (C) dependent on signal length

sSs

sdavg ACfVP ∑∈

= )(2

21

+⋅⋅⋅++

= ∑∑∑

∈∈∈ NSssN

Sss

Sssavg ACACACfVP

210

2102

21

Basic Approach to Calibration

=

NNNNNN

N

N

P

PP

C

CC

AAA

AAAAAA

fV

2

1

0

2

1

0

2,21,20,2

2,11,10,1

2,01,00,0

2

21

MMMOM

L

L

• For the j-th design/data set combination:let Pj denote the measured power let Aj,k denote the aggregate activity of all signalsof length k

• For each design/data set combination, the simulator provides the values for one row of the above matrix

• Given 2N + 1 measured values for Pj, the unknown capacitance values are then determined. This is how the simulator is calibrated.

Efforts to Calibrate the Simulator

• For the Xilinx 4036 family of parts, N = 36

• Generated a total of 73 (= 2N + 1) design/data set combinations

• Created a utility for generating data sets with specified statistics

• Created a utility for computing statistics associated with a given data set

• Attempts at Measuring Consumed Power• Heat• Current

Heat Measurement Approach

Heat Measurement Approach(continued)

Current Measurement Approach







Comparison of Integer and Floating Point Computations on FPGAs


• Integer Pipelined Multiplier

• Floating Point Pipelined Multiplier

• Floating Point Pipelined Adder

• Comparison of Two Inner-Product Designs

• Conclusions

Array-Based Integer Multiplier

CSA 9

CSA 8

CSA 7

CSA 6

CSA 5

CSA 4

CSA 3

CSA 2

CSA 1

CSA 0

Propagate Adder

b0Ab1Ab2Ab3Ab4Ab5Ab6Ab7Ab8Ab9Ab10Ab11A

sumcarry

Carry-Save Addersin a 5-bit Multiplier

Half AdderFull AdderFull AdderFull AdderHalf Adder

Half AdderFull AdderFull AdderFull AdderFull Adder

Half AdderFull AdderFull AdderFull AdderFull Adder

Half AdderFull Adder

Full AdderFull Adder

b3a0b3a1b3a2b3a3b3a4





CSA 0

CSA 1

CSA 2

Propagate Adder

Full Adder

Half Adder

Full Adder

Full Adder

Full Adder

Full Adder

Full Adder

Full Adder

Full Adder

Full Adder

Full Adder

Full Adder

Full Adder

sumcarry

sumcarry

upper 13 bits of product

CSA 9

Propagate Adder

• The Wild-One system runs at a maximum speed of 50MHz

• The 4036xla has more routing resources than the 4028ex

• Table shows maximum achieved clock rate as a function of the number of pipelined stages employed

# of stages Speed(Mhz)4028ex 4036xla

1 14 282 19 253 21 N/A4 22 275 29 286 39 287 22 298 33 50

Pipelining Results forArray-Based Integer Multiplier







• Conclusions

16-bit Floating-Point Format

• The floating point format chosen is a 16-bit format supported by the ADSP-2106x family of SHARC DSP processors

• The exponent is represented in excess-7 notation

• Range : ±1.5625×10-2 to ±2.559375×102

101.f0e3e 0fs • • • • • •

Short Word Floating-Point Format15 14 11 10 0

Floating Point Multiplier

0

12 bit Array-Based Multiplier

1.m1 1.m2

1 0

1

1

0

1

excess-7 adder

exponentadjustselect

e1(2)

e2(3)

e2(2)e1(3)

e1(1)e2(1)

e1(0)e2(0)

unf ovf

If the msb = 1 take thebits msb-1…msb-11

If the msb = 0 take thebits msb-2…msb-11

exponent

11

upper 13 bits of product

e2e1

mantissa

If underflow = 1, set exponent = 0If overflow = 1, set exponent = 15

(representing infinity)

If e1 or e2 = 0, set exponent = 0If e1 or e2 = 15, set exponent = 15

s2s1

mantissaexponentsign1 bit 4 bits 11 bits







• Conclusions

difference

pos./neg.

Choose Exponent

Normalize Mantissa and Adjust Exponent

Align Mantissas

Add/Subtract Mantissas

1.m1 1.m2e1 e2 s1 s2

Registers

exponent mantissa sign

Check for Absolute Zero and Infinity and Add Phantom Bit

Registers

Registers

Compare Exponents by Subtraction

Registers

Floating Point Adder







• Conclusions

Inner Product Co-processor Designs

Input Buffer

Pipeline Multiplier

Pipeline Multiplier

Pipeline Adder

Output Buffer

Input Buffer

Pipeline Multiplier

Pipeline Adder

Output Buffer

Multiply-Accumulate SchemeMultiply-Add Scheme

PerformanceSpeed # of # of # of # of Equivalent Estimated Power

Co-Processor Type (MHz) CLBs Flip-Flops 3-Input LUTs 4-Input LUTs Gate Count ConsumptionInteger Multiply-Accumalate 50 622 720 180 794 10076 N/AInteger Multiply-Add 43 1013 1148 423 1421 16809 415F.P. Multiply-Accumalate 38 437 414 154 742 8072 454F.P. Multiply-Add 34 716 654 254 1082 11766 390

( )

+++⋅⋅⋅++

= ∑∑∑∑

∈∈∈∈ − NN Sss

Sss

Sss

Sss ANANAA

21210

12221 Power Estimated

Notes:1. Integer co-processors implemented with 16-bit integer

multipliers and 32-bit integer adders2. The estimated power consumption calculated from

power simulator based on simplified (non-calibrated)constants:

F.P. Multiply-Add vs F.P. Multiply-Accumulate Non-Weighted Activity Values

0

0.5

1

1.5

2

2.5

3

3.5

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46

Interconnection Length

Activ

ity V

alue Multiply-Add

Multiply-Accumulate

0

10

20

30

40

50

60

70

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46

Interconnection Length

Wei

ghte

d Ac

tivity

Multiply-AddMultiply-Accumulate

F.P. Multiply-Add vs F.P. Multiply-Accumulate Linearly-Weighted Activity Values







• Conclusions

Conclusions

• Developed libraries of efficient integer and floating point pipelined multipliers and adders

• Discovered that increasing the degree of pipelining increases required hardware

• Discovered that increasing the degree of pipelining generally increases maximum clock rate

• 16-bit F.P inner-product designs require less hardware than integer inner-product designs, which employ 16-bit multiplier(s) and 32-bit adder

• Multiply-accumulate designs consume more power (estimated) than multiply-add designs due to the requirement for long feedback paths

• Developed 50 page User’s Manual for Annapolis System







Data Source

VME

MercurySystem

CNCNPEPE... ...

SPARC

ReconfigurableSubsystem

DSP/GPPSubsystem

Data Sink

AnnapolisSystem 120 MB/sec

PC

120 MB/sec120 MB/sec

PC

PCI Custom Custom

PEPE...

ReconfigurableSubsystem

AnnapolisSystem

PCI

120 MB/sec

Architecture of Prototype System

SAR Processing Flow

RangeCompression

AzimuthProcessing

DataTransfer

Azimuth

Range

STAP Processing Flow

RangeCompression

DopplerFiltering

WeightComputation

DataTransfer

Doppler

Cha

nnel

Range

DataTransfer

Refer to Poster for Physical Viewof Architecture







Integration of ReconfigurableComputing into SAR


• The SAR Benchmark

• Comparison of Two FIR Filter Designs

• Including FPGAs in the SAR Optimization Formulation

The SAR Benchmark

• Retrieved Benchmark from

http://www.rl.af.mil/programs/hpcbench/

• Developed under the ARPT/Tri-Services Rapid Prototyping of Application Specific Signal Processors (RASSP) program

• Two main programs

• Synthetic SAR data generator (400 lines of code)

• Serial SAR processor (1600 lines of code)

• The SAR algorithm is stripmap mode - currently processes 4 frames of hh polarization data






Comparison of TwoFIR Filter Designs

D Q D Q

D Q

D QD Q

D Q

D Q

xk0 xk3xk2xk1

n

n++

+

Serial-Multiply/Parallel Add

• Ease of routing• Poor modularity

xk3 xk2 xk0xk1

+ +++

D Q

D QD Q D QD Q

D QD QD Q

n

n

Parallel-Multiply/Serial Add

• Poor routing• Good modularity

Comparison of TwoFIR Filter Designs

• Both designs implemented using fixed-point complex data (16-bit fixed-point real and imaginary components)

• Both designs make use of constant coefficient multipliers (from core generator)

• Four tap serial-multiply/parallel-add filter fit onto one 4036xla part

• Three tap parallel-multiply/serial-add filter fit onto one 4036xla part (insufficient routing resources for four taps)

• Four tap parallel-multiply/serial-add filter implemented across two parts on one board (one 4036 and one 4013)






Including FPGAs in the SAR Optimization Formulation

• Power estimates must be determined for a range of kernel sizes for both filter designs

• Hybrid designs may exist for multi-chip implementations that yield desired features of both modularity and routability

• Binary optimization variable defines whether entry-FPGA or DSP/GPP subsystems perform range compression

• Real optimization variable defines fraction of azimuth processing divided among GPP/DSP and exit-FPGA subsystems







Configuration Technique for STAP

• Incorporate New Features into the Network Simulator

• Testing and Calibration of the Network Simulator

• Build and Execute RT_STAP Benchmark on Mercury RACE® Computer

• Optimization Problem• Computational Investigation

NEW FEATURES FOR THE NETWORK SIMULATOR

• Incorporate Software Overhead Times in the Simulation Model– Currently, the simulator performs hardware switch-level modeling (i.e.,

packet level simulation at the crossbar level).– Modify the Network Simulator to include software overhead times for two

communication protocols.– Empirical analysis will be utilized to capture software overhead times for

the communication protocols.• Provide Additional Timing Information from Simulation Runs

– Currently, the simulator outputs completion times after each corner turn of the STAP data cube.

– Modify the Network Simulator to output message queue completion times for each Compute Node (CN) sending messages.

– Message queue completions times will become vital input into theoptimization algorithm.

• Add PowerPC Compute Node Configuration to the Simulator

INCORPORATE SOFTWAREOVERHEAD TIMES

• Communication Time for a Message:

BM

TTT HardwareOSoftwareOC ++= )()(

CT

)(SoftwareOT

)(HardwareOTM

= Completion Time

= Software Overhead Time

= Hardware Overhead Time

= Message Size

= Network BandwidthB

where:

Modeled by SimulatorModeled by SimulatorInclude SoftwareInclude Software

Overhead Time in theOverhead Time in theSimulation ModelSimulation Model

SOFTWARE PROTOCOLS

• Two Communication Protocol Times will be added to the SimulationModel

– DMA MC/OS Communication Times (DMA Transfers between CNs)– MPI (Message Passing Interface) Software Layer Communication Times

• Incorporating Software Overhead Times into the Simulation Model will be accomplished through Empirical Analysis.

– For each of the two software protocols, zero length messages will be sent through the network. Their resulting communication times will be measured.

– After analysis of multiple runs, the simulator will be calibrated to include both DMA transfer overhead and MPI software overhead.

SOFTWARE COMPONENTS

MC/OS Runtime EnvironmentMC/OS Runtime EnvironmentMC/OS Runtime Environment

Interprocessor Communication System(ICS)

Interprocessor Interprocessor Communication Communication SystemSystem(ICS)(ICS)

POSIXAPI

POSIXPOSIXAPIAPI

MCexecMCexecMCexec

LoadableDevice Drivers

LoadableLoadableDevice Device DriversDrivers

DMAControllerDMADMAControllerController

CN ASIC Registers,InterruptsTimers,etc.

CN ASIC CN ASIC Registers,Registers,InterruptsInterruptsTimers,etc.Timers,etc.

MPI

Soft

war

e La

yer

MPI

Soft

war

e La

yer

MPI

Soft

war

e La

yer

‘DX’ Data Transfer‘DX’ Data TransferFacilityFacility

CPURegistersCPUCPURegistersRegisters

HARDWARE ABSTRACTION LAYER

Use

r Applic

atio

nU

ser

Applic

atio

nU

ser

Applic

atio

n

PROPOSED WORK





TESTING AND CALIBRATION OF THE NETWORK SIMULATOR

• Test Specific Communication Patterns to Verify Accuracy of the Network Simulator– Implement a Communication Task on the Mercury RACE®

Computer– Replicate the Communication Task on the Network Simulator– Compare the Resultant Completion Times– If Necessary, Fine-Tune the Network Simulator

• Two Types of Communication Patterns will be used to Test and Calibrate the Network Simulator– Simple Test Patterns (Hand-Calculated Verification) – Complex Test Patterns (Empirical Verification)

TESTING AND CALIBRATION WITH SPECIFIC TEST PATTERNS

• Simple Test Patterns (Hand-Calculated Verification)– Implement simple test patterns between CNs to verify the accuracy and assist in

fine-tuning of the Network Simulator. The test pattern communication time can be hand-calculated for comparison to the simulated result.

• Single Source Message Tests• Two Source Message Tests (Non-Contending Paths)• Two Source Message Tests (Contending Paths)• N Source Message Tests (Non-Contending Paths)• N Source Message Tests (Contending Paths)

• Complex Test Patterns (Empirical Verification)– Implement more complex basic communication patterns to test the validity of the

simulator. Empirical data from the Mercury Computer implementing the same test pattern will be used to calibrate the Network Simulator.

• All-to-All Personalized Communication Test• Randomized Message Queue Communication Test

SIMPLE TEST PATTERNSSingle Source Message Tests

• Test Plan Development Diagram

SingleMessageSingle

Message

TwoMessages

TwoMessages

3..N Messages

3..N Messages

SinglePacket /Message


TwoPackets /Message

TwoPackets /Message

3..PPackets /Message


SingleCrossbarSingle

Crossbar

3..CCrossbars

3..CCrossbarsSTARTSTART

RUN

TEST

RUN

TEST

SIMPLE TEST PATTERNSTwo Source Message Tests

(*Non-Contending Paths)

• Test Plan Development Diagram (For Each Source)

SingleMessage /

CN

SingleMessage /

CN

TwoMessages /

CN

TwoMessages /

CN

3..N Messages /

CN

3..N Messages /

CN



TwoPackets /Message

TwoPackets /Message



SingleCrossbar

(Non-Contending)

SingleCrossbar

(Non-Contending)

3..CCrossbars

(Non-Contending)

3..CCrossbars

(Non-Contending)

STARTSTART

RUN

TEST

RUN

TEST

SIMPLE TEST PATTERNSTwo Source Message Tests

(*Contending Paths)

• Test Plan Development Diagram (For Each Source)

SingleMessage /

CN

SingleMessage /

CN

TwoMessages /

CN

TwoMessages /

CN

3..N Messages /

CN

3..N Messages /

CN



TwoPackets /Message

TwoPackets /Message



SingleCrossbar(Contending)

SingleCrossbar(Contending)

3..CCrossbars(Contending)

3..CCrossbars(Contending)

STARTSTART

RUN

TEST

RUN

TEST






MERCURY RACE®COMPUTER CONFIGURATION

CrossbarCrossbarCrossbar

CrossbarCrossbarCrossbarCrossbarCrossbarCrossbar

CrossbarCrossbarCrossbarCrossbarCrossbarCrossbarCrossbarCrossbarCrossbarCrossbarCrossbarCrossbar

CNCNCN CNCNCN CNCNCN CNCNCN CNCNCN CNCNCN CNCNCN CNCNCN CNCNCN CNCNCN CNCN CNCN CNCN CNCN CNCN CNCN

VME PortVME Port

I/OI/O

CNCNCN

CNCNCN

CNCNCNPPC 603e, 16Mb, 100MhzPPC 603e, 16Mb, 100Mhz 3 SHARC 3 SHARC DSPsDSPs, 8Mb, 40Mhz, 8Mb, 40Mhz

3 SHARC 3 SHARC DSPsDSPs, 16Mb, 40Mhz, 16Mb, 40Mhz

STAP IMPLEMENTATION ON MERCURY RACE® COMPUTER

• Implementation of STAP on the Mercury RACE® Computer involves the following tasks:

– Build the RT_STAP1 benchmark designed and developed by MITRE (requires MPI software).

– Successfully install and build MPI Software Technology, Inc.’s message passing interface software (MPI/PRO™) for the Mercury Computer (used by RT_STAP Benchmark).

– Build both the sequential host and parallel Mercury Computer version of the benchmark.• After successfully building and executing the RT_STAP benchmark on the 8 node

PowerPC Mercury RACE® computer, perform the following tasks:– Analysis of the RT-STAP benchmark source code to determined the partitioning of the

data (i.e., the mapping) and the scheduling of the messages. Replicate the data partitioning and message scheduling on the Network Simulator.

– Verify the reported communication times from the RT_STAP benchmark with the Network Simulator.

– Modify the RT-STAP source code to allow for specification of mapping and ordering of the data distribution. Verify results with the Network Simulator.

1 Cain, K.C., Torres, J.A., and Williams, R.T. MITRE Technical Report, MTR 96B0000021 RT_STAP: Real-Time Space-Time Adaptive Processing Benchmark. February 1997.

MPI/PRO™ BUILD FORMERCURY RACE® COMPUTER

• MPI/PRO™ for RACE® is a Commercial Off-the-Shelf Standards-Based Message-Passing Middleware.

• Provides robust messaging and implements the Message Passing Interface (MPI) defined by the Message-Passing Forum.

• MPI/PRO™ supports MPI 1.2 extensions.

• Currently supports RACE® PowerPC and i860 CNs.

• MPI/PRO™ is layered on Mercury’s MC/OS development and runtime environment.

RT_STAP BENCHMARK ON MERCURY RACE® COMPUTER

• The RT_STAP benchmark, developed by MITRE, was designed to evaluate the application of scalable, high performance computers to the real time implementation of STAP techniques.

• The benchmark has the capability to vary the sophistication and computational complexity of the adaptive algorithms employed.

• The goal is to build and execute the MITRE RT_STAP benchmarksoftware on an 8 node PPC 603e Mercury Computer (MCOS 4.4.2) using MPI Software Technology, Inc. MPI/PRO.

• The RT_STAP benchmark software employs a QR-decomposition algorithm component in the space-time adaptive processing. A QRD benchmark is also provided to characterize a single processors performance of QR-decompositions.






OPTIMIZATION PROBLEM

• Overview of the Approach

• Definition of a Class of Mappings for Data Partitioning

• Development of an Objective Function to Evaluate Defined Classes of Mappings

• Implementation of a Genetic Algorithm to Produce Schedules for the Top Mapping Candidates generated by the Mapping Objective Function. – Use the Simulator to Evaluate the Communication Performance.

OVERVIEW OF THE APPROACH

STAP Data CubeSTAP Data Cube

Select # CNs (P)(P=Allocated Compute

Nodes)

Select # Select # CNs CNs (P)(P)(P=Allocated Compute (P=Allocated Compute

Nodes)Nodes)

Minimize Mapping(Use Objective Function)Minimize MappingMinimize Mapping(Use Objective Function)(Use Objective Function)

GeneticAlgorithm

(Determine Optimal Schedule)

GeneticGeneticAlgorithmAlgorithm

(Determine Optimal (Determine Optimal Schedule)Schedule)

Network Simulator(Estimate Overall

Communication Time)

Network SimulatorNetwork Simulator(Estimate Overall (Estimate Overall

Communication Time)Communication Time)

Select Fixed or Random MappingSelect Fixed or Select Fixed or

Random MappingRandom Mapping

OPTIMIZEOPTIMIZEOPTIMIZE

Mercury RACE®(Configured with 1..P CNs)

Mercury RACE®(Configured with 1..P CNs)

Adjust Allocated P

Adjust Adjust Allocated PAllocated P

The mapping matrices could be defined by any one of the following:

• Possible values for M and N :

DEFINITION OF A CLASS OF MAPPINGS

FOR DATA PARTITIONING

111 : NMT ×

( ) { }PjijiNM =⋅∈ |),(,

222 : NMT ×333 : NMT ×

{ }3|),( Pjiji =⋅

• Let the matrix represent the mapping for the kth processing phase:

kT2-d Process Set

MM

NN

kT

kk NMP ⋅=• Equation for the number of CNs:

For Ex. Assume: 12=P

321 ,, TTT

{ })112(),26(),34(),43(),62(),121( ××××××

Assuming the CN assignments with a mapping matrix is raster ordered left to right, the total number of combinations is: 2166366 3 =⋅=

• Total number of combinations :

OBJECTIVE FUNCTION DEVELOPMENTQuality of Mapping

• An objective function can be developed based on the definition of a class of mappings for data partitioning.

= { | CN i communicates with CN j }

1T

2T

CornerCorner--Turn Produces Messages Turn Produces Messages

∑∈

⋅1),(

minεji

ijij dmObjective:

ijmijm

ijd

= message from CN i to CN j

= message size of ijm

Using the following definitions:

= minimum number of required crossbar connections for message ijm

1T = such that each represents the CN where the data vector is distributed.

[ ]crT ,111 NM ×


[ ]crT ,222 NM ×

ε ),( ji


[ ]crT ,333 NM ×

2T

3T

CornerCorner--Turn Produces Messages Turn Produces Messages

∑∈

⋅2),(

minεji

ijij dmObjective:

OBJECTIVE FUNCTION DEVELOPMENTQuality of Mapping

• An objective function for the communication time:

• An objective function for STAP processing:

⋅+

⋅ ∑∑

∈∈ 21 ),(2

),(1 minmin

εε jiijij

jiijij dmkdmk

⋅+

⋅ ∑∑

∈∈ 2),(2

),(1 minmin

1 εε jiijij

jiijij dmkdmk

4k+ 5k+

3k+ (Range Computation Time)

(Doppler Computation Time) (Weight Computation Time)

First Corner Turn Second Corner Turn

GENETIC ALGORITHMS

• A genetic algorithm (GA) is a population-based model that uses selection and recombination operators to generate new sample points in a search space.

• A GA encodes a potential solution to a specific problem on a chromosome-like data structure and applies recombination operators to these structures so as to preserve critical information.

• Often, GAs are viewed as function optimizers. As a result, researchers are typically interested in GAs as optimization tools.

• Implementation of a GA begins with a population of chromosomes. Once each chromosome is evaluated, reproduction opportunities are applied in such a way that those chromosomes which represent a better solution to the target problem are given more chances to reproduce than chromosomes with poorer solutions.

• Currently, GAs are a promising heuristic approach to locating near-optimal solutions in large search spaces.

GENETIC ALGORITHMS

• A genetic algorithm is typically composed of two main components that are problem dependent:

– The problem encoding• The first component involves generating an encoding scheme to represent possible

solutions to the optimization problem. Candidate solutions are usually represented as strings of fixed length, like chromosomes, usually coded with a binary character set.

– The evaluation function• An evaluation function measures the quality of a particular solution. In this

research, the evaluation of a particular candidate will be accomplished by the Network Simulator. The fitness of the candidate from the population space will be measured based on its simulated performance.

• The objective of a GA search is to locate the chromosome that has the optimal fitness value. For this research, if the chromosome represented the scheduling of messages and the fitness value the completion time of the schedule, the objective of the GA would be to find the smallest value (i.e., shortest completion time).

IMPLEMENTATION OF A GENETIC ALGORITHM HEURISTIC

• Implementation of a GA involves the following steps:1

– Generate an initial populationThis initial population is the first generation where evolution starts. A random set of chromosomes is often used as the initial population

– An evaluation using the evaluation or fitness functionEvaluate the quality of each chromosome in the initial population.

– A selection mechanismIn this step, chromosomes are duplicated or eliminated based on its relative quality or fitness. The population size is kept constant.

– A crossover mechanismSome pairs of the chromosomes are selected from the current population, and some of their corresponding components are exchanged to form two valid chromosomes. The new chromosomes may or may not be in the current population.

1 Wang, L., Siegel, H.J., Roychowdhury, V.P., and Maciejewski, A.A. Task Matching and Scheduling in Heterogeneous Computing Environments using a Genetic Algorithm-Based Approach, Journal of Parallel and Distributed Computing Special Issue on Parallel Evolutionary Computing.

IMPLEMENTATION OF A GENETIC ALGORITHM HEURISTIC

• Implementation of a GA involves the following steps:1

– A mutation mechanismAfter a crossover operation, each string in the population may be mutated with some probability. The mutation process transforms a chromosome into another valid one that may or may not be in the population. The motivation for using mutation is to prevent the algorithm from getting stuck in a local minimum.

– Reevaluation of the populationThe new population after selection, crossover, and mutation is reevaluated. The fitness value for each new chromosome is computed.

– A set of stopping criteriaThe stopping criteria specifies the criteria upon which the algorithm terminates. If the stopping criteria have not been met, the new population goes through another cycle of selection, crossover, mutation, and evaluation. This cycle repeats until one of the stopping criteria is met.

1 Wang, L., Siegel, H.J., Roychowdhury, V.P., and Maciejewski, A.A. Task Matching and Scheduling in Heterogeneous Computing Environments using a Genetic Algorithm-Based Approach, Journal of Parallel and Distributed Computing Special Issue on Parallel Evolutionary Computing.






COMPUTATIONAL INVESTIGATION

• A QR-D computation is deterministic (i.e, its complexity can be calculated).

• A Conjugate Gradient (CG) computation is notDeterministic. Its complexity depends on the initial condition and desired tolerance.– This work proposes the investigation of the impact of

“intelligent” initial condition values to a CG algorithm.

CONJUGATE GRADIANT APPROACHInvestigation of Initial Condition Values

A B C D

swCBA

=11 ),,(ψ sw

DCB=22 ),,(

ψ

HxxCBA 111 ),,(

⋅=ψ

=

CBA

x 1

=

DCB

x 2Hxx

DCB 222 ),,(⋅=ψ

Solve the following equations:Solve the following equations:

Where:Where:

,,

,,

=s

=1w weight vectorweight vector

steering vectorsteering vector


[ ]

=

=⋅=

HHH

HHH

HHH

HHHH

CCCBCABCBBBAACABAA

CBACBA

xxCBA 111 ),,(

ψ

[ ]

=

=⋅=

HHH

HHH

HHH

HHHH

DDDCDBCDCCCBBDBCBB

DCBDCB

xxDCB 222 ),,(

ψ

• Expanding and yields the following:),,(1 CBA

ψ),,(2 DCB

ψ

• Attempting to solve the following equation for :

• Attempting to solve the following equation for :


swCBA

=11 ),,(ψ1w

=

3

2

1

3,1

2,1

1,1

1 ),,(

sss

www

CBAψ

=

3

2

1

3,2

2,2

1,2

2 ),,(

sss

www

DCBψ

13,12,11,1 swACwABwAA HHH =++

23,12,11,1 swBCwBBwBA HHH =++

33,12,11,1 swCCwCBwCA HHH =++

13,22,21,2 swBDwBCwBB HHH =++

23,22,21,2 swCDwCCwCB HHH =++

33,22,21,2 swDDwDCwDB HHH =++

2w swDCB

=22 ),,(ψ

Set of Linear EquationsSet of Linear Equations

Set of Linear EquationsSet of Linear Equations

• Investigation of the two sets of linear equations reveals similarities among the sets of equations:

• The similarities between the equations may provide insight into the selection of the initial condition values. Assuming the steering vector remains the same for each set of linear equations, the initial values could be assigned as follows:

– If range bin D is similar to range bin C, then

– If range bin D is similar to range bin A, then


13,12,11,1 swACwABwAA HHH =++

23,12,11,1 swBCwBBwBA HHH =++

33,12,11,1 swCCwCBwCA HHH =++

13,22,21,2 swBDwBCwBB HHH =++

23,22,21,2 swCDwCCwCB HHH =++

33,22,21,2 swDDwDCwDB HHH =++

2,11,2 ww ← 3,12,2 ww ← 3,13,2 ww ←

2,11,2 ww ← 3,12,2 ww ← 1,13,2 ww ←






Outline

Work to be Completed

• Interfacing of FPGA and GPP/DSP Subsystems

• Implement Parallel SAR Algorithm on GPP/DSP System

• Integrate FPGA FIR Filters for Range and Azimuth Processing for SAR

• Implement Parallel STAP Algorithm for GPP/DSP System

• Integrate FPGA FIR Filters for Range Processing for STAP

• Implement FPGA-based Linear Equation Solver

• Integrate FPGA-based Linear Equation Solver with STAP

Documents

Optimal Configuration of Combined GPP/DSP/FPGA …antonio/pubs/p-ann_rev99acs.pdf · Optimal Configuration of Combined GPP/DSP/FPGA Systems for Minimal SWAP Presented by ... Pulse