47
Introduction to Custom Computing Introduction to Custom Computing Oskar Mencer Oskar Mencer [email protected]

Introduction to Custom Computingoskar/CCintro1h.pdf · word4 ^= t2; t2 ^= word2; word2 = word3 ^ t1; word3 = t2; LOOP_END(); 1 10 100 1000 Processors XCV300 XCV600 XCV2000 Mbits/s

  • Upload
    others

  • View
    23

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Introduction to Custom Computingoskar/CCintro1h.pdf · word4 ^= t2; t2 ^= word2; word2 = word3 ^ t1; word3 = t2; LOOP_END(); 1 10 100 1000 Processors XCV300 XCV600 XCV2000 Mbits/s

Introduction to Custom ComputingIntroduction to Custom Computing

Oskar MencerOskar Mencer [email protected]

Page 2: Introduction to Custom Computingoskar/CCintro1h.pdf · word4 ^= t2; t2 ^= word2; word2 = word3 ^ t1; word3 = t2; LOOP_END(); 1 10 100 1000 Processors XCV300 XCV600 XCV2000 Mbits/s

ObjectivesObjectives

• What is Custom Computing, and Why

• Compilation issues

• Run-time issues

• Systems: putting it all together

Page 3: Introduction to Custom Computingoskar/CCintro1h.pdf · word4 ^= t2; t2 ^= word2; word2 = word3 ^ t1; word3 = t2; LOOP_END(); 1 10 100 1000 Processors XCV300 XCV600 XCV2000 Mbits/s

Overview of Custom ComputingOverview of Custom Computing

1. FPGAs: what and why1. FPGAs: what and why FPGA, the big picture FPGA devices

4. Putting it altogether4. Putting it altogether system view PAM: a PC-size system SONIC: a PCI card

2. Compiling to FPGAs2. Compiling to FPGAs architecture generation compiler passes

3. Run-time reconfiguration3. Run-time reconfiguration generating configurations - at compile time, at run time scheduling reconfigurations - control driven, data driven

Page 4: Introduction to Custom Computingoskar/CCintro1h.pdf · word4 ^= t2; t2 ^= word2; word2 = word3 ^ t1; word3 = t2; LOOP_END(); 1 10 100 1000 Processors XCV300 XCV600 XCV2000 Mbits/s

some Historysome History

1st “FPGA” Patents

1984 2002

XilinxXilinxmajor companies

AlteraAltera

SRAM-basedlookup tables

(LUTs)

large SRAM blocks FPGA+Multipliers

199719911st FPL Conference

Reconfigurable -Custom

Computing

DEC PRLPAM Project

SAT Solverson FPGAs

Datastructureson FPGAs

Signal Processingon FPGAs

1989TriscendTriscend

FPGA+Processor

Systolic ArraysNeural Networks

Flexible+Variable Structs.DataflowVLIW,…

RSA EncryptionSpeed Record

National SemiconductorNational Semiconductor

Page 5: Introduction to Custom Computingoskar/CCintro1h.pdf · word4 ^= t2; t2 ^= word2; word2 = word3 ^ t1; word3 = t2; LOOP_END(); 1 10 100 1000 Processors XCV300 XCV600 XCV2000 Mbits/s

FPGA versus MicroprocessorFPGA versus Microprocessor

partly from P. Alfke, Xilinx.

Page 6: Introduction to Custom Computingoskar/CCintro1h.pdf · word4 ^= t2; t2 ^= word2; word2 = word3 ^ t1; word3 = t2; LOOP_END(); 1 10 100 1000 Processors XCV300 XCV600 XCV2000 Mbits/s

… … and the price per gate goes down …and the price per gate goes down …

data taken partly from Xilinx from P.Alfke, Xilinx.

Page 7: Introduction to Custom Computingoskar/CCintro1h.pdf · word4 ^= t2; t2 ^= word2; word2 = word3 ^ t1; word3 = t2; LOOP_END(); 1 10 100 1000 Processors XCV300 XCV600 XCV2000 Mbits/s

Time/months

Update 1

Update 2

1 10 20

30 Source: Algotronix Consulting / Xilinx

FPGA versus ASIC Accelerator Card FPGA versus ASIC Accelerator Card

Number of New Users

FPGA accelerator

ASIC accelerator

Page 8: Introduction to Custom Computingoskar/CCintro1h.pdf · word4 ^= t2; t2 ^= word2; word2 = word3 ^ t1; word3 = t2; LOOP_END(); 1 10 100 1000 Processors XCV300 XCV600 XCV2000 Mbits/s

HW execution

HW compilation

The Speedup ArgumentThe Speedup Argument

>100 speedup

Execution Time

DIMACS Benchmarks, Xilinx FPGAs vs. GRASP Software

Boolean Satisfiability: FPGA vs Pentium

Page 9: Introduction to Custom Computingoskar/CCintro1h.pdf · word4 ^= t2; t2 ^= word2; word2 = word3 ^ t1; word3 = t2; LOOP_END(); 1 10 100 1000 Processors XCV300 XCV600 XCV2000 Mbits/s

Source: J. Babb PhD thesis, MIT, 2001

Speedup: FPGA vs MIPS ProcessorSpeedup: FPGA vs MIPS Processor

Page 10: Introduction to Custom Computingoskar/CCintro1h.pdf · word4 ^= t2; t2 ^= word2; word2 = word3 ^ t1; word3 = t2; LOOP_END(); 1 10 100 1000 Processors XCV300 XCV600 XCV2000 Mbits/s

Source: J. Babb PhD thesis, MIT, 2001

Energy Reduction: FPGA vs MIPS CoreEnergy Reduction: FPGA vs MIPS Core

Page 11: Introduction to Custom Computingoskar/CCintro1h.pdf · word4 ^= t2; t2 ^= word2; word2 = word3 ^ t1; word3 = t2; LOOP_END(); 1 10 100 1000 Processors XCV300 XCV600 XCV2000 Mbits/s

Overview of Custom ComputingOverview of Custom Computing

1. FPGAs: what and why1. FPGAs: what and why FPGA, the big picture FPGA devices

4. Putting it altogether4. Putting it altogether system view PAM: a PC-size system SONIC: a PCI card

2. Compiling to FPGAs2. Compiling to FPGAs architecture generation compiler passes

3. Run-time reconfiguration3. Run-time reconfiguration generating configurations - at compile time, at run time scheduling reconfigurations - control driven, data driven

Page 12: Introduction to Custom Computingoskar/CCintro1h.pdf · word4 ^= t2; t2 ^= word2; word2 = word3 ^ t1; word3 = t2; LOOP_END(); 1 10 100 1000 Processors XCV300 XCV600 XCV2000 Mbits/s

A B C D Z

0 0 0 0 00 0 0 1 00 0 1 0 00 0 1 1 10 1 0 0 10 1 0 1 1 . . . 1 1 0 0 01 1 0 1 01 1 1 0 01 1 1 1 1

Look-Up Table (LUT): A Universal GateLook-Up Table (LUT): A Universal Gate• combinatorial logic: stored in Look-Up Tables• capacity limited by number of inputs not complexity• LUT delay: independent of logic!

16-bit SRAM

Combinatorial Logic

AB

CD

Z

Page 13: Introduction to Custom Computingoskar/CCintro1h.pdf · word4 ^= t2; t2 ^= word2; word2 = word3 ^ t1; word3 = t2; LOOP_END(); 1 10 100 1000 Processors XCV300 XCV600 XCV2000 Mbits/s

CLB

CLB

CLB

CLB

S witchMa trix

ProgrammableInterconnect

Xilinx FPGAs: XC4000, Virtex, Virtex IIXilinx FPGAs: XC4000, Virtex, Virtex II

Configurable Logic Block (CLB)

FPGA Chip

LUT

4-LUTs flip-flops

tri-state buffers

carry chain

Page 14: Introduction to Custom Computingoskar/CCintro1h.pdf · word4 ^= t2; t2 ^= word2; word2 = word3 ^ t1; word3 = t2; LOOP_END(); 1 10 100 1000 Processors XCV300 XCV600 XCV2000 Mbits/s

CLB

Slice 0Slice 0

LUT Carry

LUT Carry D QCE

PRE

CLR

DQ

CE

PRE

CLR

Slice 1Slice 1

LUT Carry

LUT Carry D QCE

PRE

CLR

D QCE

PRE

CLR

Simplified CLB Structure of Xilinx FPGAsSimplified CLB Structure of Xilinx FPGAs

Page 15: Introduction to Custom Computingoskar/CCintro1h.pdf · word4 ^= t2; t2 ^= word2; word2 = word3 ^ t1; word3 = t2; LOOP_END(); 1 10 100 1000 Processors XCV300 XCV600 XCV2000 Mbits/s

The VLSI CAD Productivity GapThe VLSI CAD Productivity Gap

1

Log i

c Tr

a nsi

sto r

s p e

r Ch i

p(K

)

P

rodu

cti v

i t yTr

ans .

/Sta

f f - M

onth

10

100

1,000

10,000

100,000

1,000,000

10,000,000

10

100

1,000

10,000

100,000

1,000,000

10,000,000

100,000,000 Logic Transistors/ChipTransistor/Staff Month

58%/Yr. compoundComplexity growth rate

21%/Yr. compoundProductivity growth rate

Source: SEMATECH

1981

1983

1985

1 987

1989

1991

1993

1995

1 997

1999

2003

2001

200 5

2 007

200 9

xx xx x

x

x

Page 16: Introduction to Custom Computingoskar/CCintro1h.pdf · word4 ^= t2; t2 ^= word2; word2 = word3 ^ t1; word3 = t2; LOOP_END(); 1 10 100 1000 Processors XCV300 XCV600 XCV2000 Mbits/s

Overview of Custom ComputingOverview of Custom Computing

1. FPGAs: what and why1. FPGAs: what and why FPGA, the big picture FPGA devices

4. Putting it altogether4. Putting it altogether system view PAM: a PC-size system SONIC: a PCI card

2. Compiling to FPGAs2. Compiling to FPGAs architecture generation compiler passes

3. Run-time reconfiguration3. Run-time reconfiguration generating configurations - at compile time, at run time scheduling reconfigurations - control driven, data driven

Page 17: Introduction to Custom Computingoskar/CCintro1h.pdf · word4 ^= t2; t2 ^= word2; word2 = word3 ^ t1; word3 = t2; LOOP_END(); 1 10 100 1000 Processors XCV300 XCV600 XCV2000 Mbits/s

Programming FPGAs ToolflowProgramming FPGAs Toolflow

C/Matlab/Premiere

Compiler

Machinecode

Run-timeinterface

Configurationinformation

Fixed processor Custom processorCustom computing system

Page 18: Introduction to Custom Computingoskar/CCintro1h.pdf · word4 ^= t2; t2 ^= word2; word2 = word3 ^ t1; word3 = t2; LOOP_END(); 1 10 100 1000 Processors XCV300 XCV600 XCV2000 Mbits/s

The “Compiler” for FPGAsThe “Compiler” for FPGAs

Conventional Compiler Passes

Architecture Generation

Module Generation

Gate Level CADFPGA Netlist

loop transformationsbitwidth analysispointer analysismemory managementdatastructure transformarchitecture selection

Module instantiationModule connectionControl generation

“instruction set” generationoptimal technology mappingcarry chain generation

FPGA Vendor Tools

FPGA Configuration

Page 19: Introduction to Custom Computingoskar/CCintro1h.pdf · word4 ^= t2; t2 ^= word2; word2 = word3 ^ t1; word3 = t2; LOOP_END(); 1 10 100 1000 Processors XCV300 XCV600 XCV2000 Mbits/s

Languages for Architecture GenerationLanguages for Architecture Generation

• generate particular instances of a generic architecture, such as a signal processor, a stream architecture, a neural network, a cellular automaton, etc…

• HW Description Languages: VHDL, Verilog

• SW Languages used for Custom Computing• C/C++, Java, Ruby, etc.

• C++ example: ASC – A Stream Compiler

Page 20: Introduction to Custom Computingoskar/CCintro1h.pdf · word4 ^= t2; t2 ^= word2; word2 = word3 ^ t1; word3 = t2; LOOP_END(); 1 10 100 1000 Processors XCV300 XCV600 XCV2000 Mbits/s

ASC - A Stream Compiler for FPGAs ASC - A Stream Compiler for FPGAs Bell Labs

ASC generates Stream Architectures based on C++ input.

MemoryMemoryMemoryMemory

featuresdistributed registersdistributed delay FIFOslocal memory blocksexternal MemoryPAM-Blox II modules

(pre-pipelined)

Page 21: Introduction to Custom Computingoskar/CCintro1h.pdf · word4 ^= t2; t2 ^= word2; word2 = word3 ^ t1; word3 = t2; LOOP_END(); 1 10 100 1000 Processors XCV300 XCV600 XCV2000 Mbits/s

ASC Example: IDEA EncryptionASC Example: IDEA EncryptionLOOP(8); LoopIndex(i);

word1 = HWmul(word1, key[i*6+0]); word2 = word2 + key[i*6+1]; word3 = word3 + key[i*6+2]; word4 = HWmul(word4, key[i*6+3]);

t2 = word1 ^ word3; t2 = HWmul(t2, key[i*6+4]);

t1 = (t2 + (word2 ^ word4)); t1 = HWmul(t1, key[i*6+5]); t2 = (t1 + t2);

word1 ^= t1; word4 ^= t2;

t2 ^= word2; word2 = word3 ^ t1; word3 = t2;LOOP_END();

1

10

100

1000

Processors XCV300 XCV600 XCV2000

Mbits/s

Performance

ratio ioncommunicatncomputatio high

*“unroll once” “unroll 8x”

OPTIMIZE=REDUNDANT;

Page 22: Introduction to Custom Computingoskar/CCintro1h.pdf · word4 ^= t2; t2 ^= word2; word2 = word3 ^ t1; word3 = t2; LOOP_END(); 1 10 100 1000 Processors XCV300 XCV600 XCV2000 Mbits/s

Overview of Custom ComputingOverview of Custom Computing

1. FPGAs: what and why1. FPGAs: what and why FPGA, the big picture FPGA devices programming FPGAs - module generation

4. Putting it altogether4. Putting it altogether system view PAM: a PC-size system SONIC: a PCI card

2. Compiling to FPGAs2. Compiling to FPGAs architecture generation compiler passes

3. Run-time reconfiguration3. Run-time reconfiguration generating configurations - at compile time, at run time scheduling reconfigurations - control driven, data driven

Page 23: Introduction to Custom Computingoskar/CCintro1h.pdf · word4 ^= t2; t2 ^= word2; word2 = word3 ^ t1; word3 = t2; LOOP_END(); 1 10 100 1000 Processors XCV300 XCV600 XCV2000 Mbits/s

The “Compiler” for FPGAsThe “Compiler” for FPGAs

Conventional Compiler Passes

Architecture Generation

Module Generation

Gate Level CADFPGA Netlist

loop transformationsbitwidth analysispointer analysismemory managementdatastructure transformarchitecture selection

Module instantiationModule connectionControl generation

“instruction set” generationoptimal technology mappingcarry chain generation

FPGA Vendor Tools

FPGA Configuration

Page 24: Introduction to Custom Computingoskar/CCintro1h.pdf · word4 ^= t2; t2 ^= word2; word2 = word3 ^ t1; word3 = t2; LOOP_END(); 1 10 100 1000 Processors XCV300 XCV600 XCV2000 Mbits/s

Bitwidth Analysis (Type Size Inference)Bitwidth Analysis (Type Size Inference)

• for programs written in a high-level language

• minimum bitwidth required for • each variable at every

static location of the program • each static operation

of the program• reduces memory bandwidth

dependency, MBD (slide 15, above)

int a;int b;char c;

a = c;

a = a / 2;

b = a >> 4;

b = a + b;

8 bits

7 bits

3 bits

8 bits

7 bits

8 bits 7 bits

Page 25: Introduction to Custom Computingoskar/CCintro1h.pdf · word4 ^= t2; t2 ^= word2; word2 = word3 ^ t1; word3 = t2; LOOP_END(); 1 10 100 1000 Processors XCV300 XCV600 XCV2000 Mbits/s

HAGAR: Hardware Graph AcceleratorsHAGAR: Hardware Graph Accelerators

0010100001000110

3

2

1

0

3210

vvvv

vvvv

operations: insert, delete, reachability, …reachability speedup over PC: 10 to 1000 for 1K nodes

Row 3

Row 2

Row 1

Row 0

Col. 0 Col. 1 Col. 2 Col. 3

Flip-flop

.. .

. ..

..

..

.....

....

.. . ..

.

..

.. ...

11

11

1111

11

TristateBuffer

Memory Latency Dependent, e.g. data structures, pointers implement data-structure + algorithm in hardware

Page 26: Introduction to Custom Computingoskar/CCintro1h.pdf · word4 ^= t2; t2 ^= word2; word2 = word3 ^ t1; word3 = t2; LOOP_END(); 1 10 100 1000 Processors XCV300 XCV600 XCV2000 Mbits/s

from Lorenz Huelsbergen

1000 Vertices

100 Vertices

high inter-chip delay

low inter-chip delay

Speedup of ReachabilitySpeedup of ReachabilitySpeedup over Software(simulation results)

Page 27: Introduction to Custom Computingoskar/CCintro1h.pdf · word4 ^= t2; t2 ^= word2; word2 = word3 ^ t1; word3 = t2; LOOP_END(); 1 10 100 1000 Processors XCV300 XCV600 XCV2000 Mbits/s

Overview of Custom ComputingOverview of Custom Computing

1. FPGAs: what and why1. FPGAs: what and why FPGA, the big picture FPGA devices programming FPGAs - module generation

4. Putting it altogether4. Putting it altogether system view PAM: a PC-size system SONIC: a PCI card

2. Compiling to FPGAs2. Compiling to FPGAs architecture generation compiler passes

3. Run-time reconfiguration3. Run-time reconfiguration generating configurations - at compile time, at run time scheduling reconfigurations - control driven, data driven

Page 28: Introduction to Custom Computingoskar/CCintro1h.pdf · word4 ^= t2; t2 ^= word2; word2 = word3 ^ t1; word3 = t2; LOOP_END(); 1 10 100 1000 Processors XCV300 XCV600 XCV2000 Mbits/s

Run-time Reconfiguration Run-time Reconfiguration

FPGAs are reconfigurable within milliseconds!

1. generating configurations• at compile time• at run time

2. scheduling reconfigurations• at compile time - control driven• at run time – data driven

Page 29: Introduction to Custom Computingoskar/CCintro1h.pdf · word4 ^= t2; t2 ^= word2; word2 = word3 ^ t1; word3 = t2; LOOP_END(); 1 10 100 1000 Processors XCV300 XCV600 XCV2000 Mbits/s

Generating & Scheduling ReconfigurationsGenerating & Scheduling ReconfigurationsApplication

source or executable

Phase 1

Phase 2

FPGA Configurations

C1 C2

Control driven reconfiguration from phase 1 to phase 2.Data driven reconfiguration from C1 to C2=> insert reconfiguration strategy (algorithm) into application

Page 30: Introduction to Custom Computingoskar/CCintro1h.pdf · word4 ^= t2; t2 ^= word2; word2 = word3 ^ t1; word3 = t2; LOOP_END(); 1 10 100 1000 Processors XCV300 XCV600 XCV2000 Mbits/s

Generating & Scheduling Reconfigurations Generating & Scheduling Reconfigurations

@compile time

Viterbi decoder:adaptive error

correction

@run time

boolean satisfiability

media processing

sche

dulin

g re

conf

igur

atio

ns control driven

generating configurations

data driven

network intrusion detection

image processing

Rijndael AES encryption

Page 31: Introduction to Custom Computingoskar/CCintro1h.pdf · word4 ^= t2; t2 ^= word2; word2 = word3 ^ t1; word3 = t2; LOOP_END(); 1 10 100 1000 Processors XCV300 XCV600 XCV2000 Mbits/s

Media Processing ExampleMedia Processing Example

• shape-adaptive template matching (SATM)• MPEG4, MPEG7: find arbitrarily shaped object in video

• template contains object of interest• Sum of Absolute Distances of template and video

• adapt design to template and search frame

Page 32: Introduction to Custom Computingoskar/CCintro1h.pdf · word4 ^= t2; t2 ^= word2; word2 = word3 ^ t1; word3 = t2; LOOP_END(); 1 10 100 1000 Processors XCV300 XCV600 XCV2000 Mbits/s

SATM: FPGA versus PCSATM: FPGA versus PC • HDTV frames: 1920 by 1080 pixels• for 300 HDTV frames, use PC time as reference TPC : execution time of PC with 1.4GHz Pentium 4 SD : speedup of dynamic design SPD : speedup of partially dynamic design SS : speedup of static design

Page 33: Introduction to Custom Computingoskar/CCintro1h.pdf · word4 ^= t2; t2 ^= word2; word2 = word3 ^ t1; word3 = t2; LOOP_END(); 1 10 100 1000 Processors XCV300 XCV600 XCV2000 Mbits/s

• dynamic design superior when:• a large number of consecutive frames• templates do not change often • small templates

T_D : time for dynamic design T_S : time for static design T_PD : time for partially dynamic design

SATM: Performance ResultsSATM: Performance Results

dynamicpartially-dynamic

static

Page 34: Introduction to Custom Computingoskar/CCintro1h.pdf · word4 ^= t2; t2 ^= word2; word2 = word3 ^ t1; word3 = t2; LOOP_END(); 1 10 100 1000 Processors XCV300 XCV600 XCV2000 Mbits/s

Overview of Custom ComputingOverview of Custom Computing

1. FPGAs: what and why1. FPGAs: what and why FPGA, the big picture FPGA devices programming FPGAs - module generation

4. Putting it altogether4. Putting it altogether system view PAM: a PC-size system SONIC: a PCI card

3. Run-time reconfiguration3. Run-time reconfiguration generating configurations - at compile time, at run time scheduling reconfigurations - control driven, data driven

2. Compiling to FPGAs2. Compiling to FPGAs architecture generation compiler passes - loop transformations - bitwidth analysis

Page 35: Introduction to Custom Computingoskar/CCintro1h.pdf · word4 ^= t2; t2 ^= word2; word2 = word3 ^ t1; word3 = t2; LOOP_END(); 1 10 100 1000 Processors XCV300 XCV600 XCV2000 Mbits/s

System View of Custom AcceleratorsSystem View of Custom Accelerators

Microprocessor

FPGA

Memory

Interconnect

from the IBM websiteIntel Pentium

“Custom Accelerator”Local Area NetworkPCI BusPC Card/CardbusMemory Bus / DIMMCustom InterconnectOn-chip Interconnect

Page 36: Introduction to Custom Computingoskar/CCintro1h.pdf · word4 ^= t2; t2 ^= word2; word2 = word3 ^ t1; word3 = t2; LOOP_END(); 1 10 100 1000 Processors XCV300 XCV600 XCV2000 Mbits/s

Xilinx Virtex-II ProXilinx Virtex-II Pro

Source: Xilinx

Page 37: Introduction to Custom Computingoskar/CCintro1h.pdf · word4 ^= t2; t2 ^= word2; word2 = word3 ^ t1; word3 = t2; LOOP_END(); 1 10 100 1000 Processors XCV300 XCV600 XCV2000 Mbits/s

Case Studies Case Studies

1 PAM project at DIGITAL PRL (Compaq)• various applications on multi-FPGA platform

2 SONIC project at Imperial College and Sony• broadcast-quality video processing PCI card

Page 38: Introduction to Custom Computingoskar/CCintro1h.pdf · word4 ^= t2; t2 ^= word2; word2 = word3 ^ t1; word3 = t2; LOOP_END(); 1 10 100 1000 Processors XCV300 XCV600 XCV2000 Mbits/s

Case Study 1: The PAM Project Case Study 1: The PAM Project

PProgrammable rogrammable AActive ctive MMemories emories (source: Mark Shand)(source: Mark Shand)

4 Platforms: - Perle-0 (1989)

50k gate reconfigurable VME board - DECPeRLe-1 (1992)

200k gates, 20ms turnaround - TURBOchannel Pamette (1994) - PCI Pamette (1996) PCI card with 40K - 112K gates

People: Jean Vuillemin Patrice Bertin Didier Roncin Herve Touati

Philippe BoucardHarry PrintzMark Shand

Applications: Long Int. Multiplication RSA Cryptography Dynamic Programming Laplace Heat Equation Viterbi Decoder Sound Synthesis Neural Networks

Goals: Maximum performance Rapid turnaround Exploring how to build and program PAM Exploring the application space (Non-goals: high-level synthesis, platform independence, improving FPGAs)

Stereo VisionHough TransformHigh Energy PhysicsImage AquisitionWireless LAN testbed

All applications are implemented by hand at the gate level using PamDC.

Page 39: Introduction to Custom Computingoskar/CCintro1h.pdf · word4 ^= t2; t2 ^= word2; word2 = word3 ^ t1; word3 = t2; LOOP_END(); 1 10 100 1000 Processors XCV300 XCV600 XCV2000 Mbits/s

DECPeRLe-1DECPeRLe-1

Page 40: Introduction to Custom Computingoskar/CCintro1h.pdf · word4 ^= t2; t2 ^= word2; word2 = word3 ^ t1; word3 = t2; LOOP_END(); 1 10 100 1000 Processors XCV300 XCV600 XCV2000 Mbits/s

Case Study 2: SONICCase Study 2: SONIC

professional video processing professional video processing at SONY and Imperial Collegeat SONY and Imperial College

Page 41: Introduction to Custom Computingoskar/CCintro1h.pdf · word4 ^= t2; t2 ^= word2; word2 = word3 ^ t1; word3 = t2; LOOP_END(); 1 10 100 1000 Processors XCV300 XCV600 XCV2000 Mbits/s

PIPE 1 PIPE 2PIPEN-2

PIPEN-1 PIPE N

Host Bus

PIPEFlow BPIPEFlow A

ExternalVideoBuses

LBC(Local BusController)

SONIC architectureSONIC architecture

PIPE = Plug In Processing Element

Page 42: Introduction to Custom Computingoskar/CCintro1h.pdf · word4 ^= t2; t2 ^= word2; word2 = word3 ^ t1; word3 = t2; LOOP_END(); 1 10 100 1000 Processors XCV300 XCV600 XCV2000 Mbits/s

PIPE 1 PIPE 2PIPEN-2

PIPEN-1 PIPE N

Host Bus

PIPEFlow BPIPEFlow A

ExternalVideoBuses

LBC(Local BusController)PIPE Router

(PR)

PIPE Engine(PE)

PIPEMemory

(PM)

PIPEFlowLeft

PIPEFlowRight

PIPEFlowA

PIPEFlowB

PIPE Bus

PIPEFlowIn

PIPEFlowOut

local busglobal (flow) bus

PIPE ArchitecturePIPE Architecture

PR provides abstractionfor plug-in builder

Page 43: Introduction to Custom Computingoskar/CCintro1h.pdf · word4 ^= t2; t2 ^= word2; word2 = word3 ^ t1; word3 = t2; LOOP_END(); 1 10 100 1000 Processors XCV300 XCV600 XCV2000 Mbits/s

LBC

PE

PR

PM

PE

PRPM

PE

PR

PM

PE

PR

PM

Host BUS

SONIC Processing an ImageSONIC Processing an Image

Image Transfer In

Processing

Image Transfer Out

Configuration of Plug-In

Image Transfer In

Processing

Image Transfer Out

PE

PR PRPM PM

PE

Configuration of Plug-In

Plug-In

PE PEPMPR PR

PM

Page 44: Introduction to Custom Computingoskar/CCintro1h.pdf · word4 ^= t2; t2 ^= word2; word2 = word3 ^ t1; word3 = t2; LOOP_END(); 1 10 100 1000 Processors XCV300 XCV600 XCV2000 Mbits/s

SONIC with Multiple Plug-insSONIC with Multiple Plug-ins

PE

PRPM

LBC

PE

PR

PM

PE

PR

PM

PE

PR

PM

PE

PR

PM

PE

PR

PM

Host BUS

Plug-In

PE

PR

PM

Plug-In2

Plug-In3

Plug-In1 1

Page 45: Introduction to Custom Computingoskar/CCintro1h.pdf · word4 ^= t2; t2 ^= word2; word2 = word3 ^ t1; word3 = t2; LOOP_END(); 1 10 100 1000 Processors XCV300 XCV600 XCV2000 Mbits/s

SONIC versus PentiumSONIC versus Pentium

0

510

15

20

2530

35

40

2-D FIR 3x3 FIR 2-DTransform

ColourCorrector

3x3 FIR +Colour

Corrector

Task

Spe

ed-U

p 1 PIPE2 PIPEs4 PIPEs6 PIPEs

Speedup Compared with an MMX Pentium II 300Mhz

Page 46: Introduction to Custom Computingoskar/CCintro1h.pdf · word4 ^= t2; t2 ^= word2; word2 = word3 ^ t1; word3 = t2; LOOP_END(); 1 10 100 1000 Processors XCV300 XCV600 XCV2000 Mbits/s

Current Generation: UltraSONICCurrent Generation: UltraSONIC

• PE, PR on one FPGA• 8MB SRAM on PIPE• up to 16 PIPEs• 64-bit, 66MHz PCI• real-time image registration

Page 47: Introduction to Custom Computingoskar/CCintro1h.pdf · word4 ^= t2; t2 ^= word2; word2 = word3 ^ t1; word3 = t2; LOOP_END(); 1 10 100 1000 Processors XCV300 XCV600 XCV2000 Mbits/s

Pointers to Conferences and JournalsPointers to Conferences and Journals

FPGA ConferencesFPGA Conferences FPGA Conference FCCM FPL FPT ERSAHardware Design ICCAD DACComputer Architecture ISCA, ISC, HPCA, ASPLOS, MICRO

JournalsJournals IEEE Transactions on Computers

IEEE Transactions on VLSI

IEEE Transactions on CAD

Kluwer Journal on VLSI and Signal Processing

ACM Transactions on Architecture and Code Optimization