144
Low Power Multimedia Reconfigurable Platforms Jun-Dong Cho SungKyunKwan Univ. Dept. of ECE, Vada Lab. http://vada.skku.ac.kr

Low Power Multimedia Reconfigurable Platforms

  • Upload
    abedi

  • View
    47

  • Download
    0

Embed Size (px)

DESCRIPTION

Low Power Multimedia Reconfigurable Platforms. Jun-Dong Cho SungKyunKwan Univ. Dept. of ECE, Vada Lab. http://vada.skku.ac.kr. Communication bandwidth [Hansen ’ s law]. µ processor integration density (1.2/year ). - PowerPoint PPT Presentation

Citation preview

Page 1: Low Power Multimedia Reconfigurable Platforms

Low Power Multimedia Reconfigurable Platforms

Jun-Dong ChoSungKyunKwan Univ.

Dept. of ECE, Vada Lab. http://vada.skku.ac.kr

Page 2: Low Power Multimedia Reconfigurable Platforms

VLSI Algorithmic Design Automation Lab. at SKKU2

What are the Challenges ? [ST microelectronics, MorphICs, Dataquest, eASIC]

1

2

0 10 12 18

months

factor

Com

mun

icat

ion

band

wid

th [H

anse

n’s

law]

Integratio

n density (1

.4/year) [M

oore’s

law]

µprocessor integration density

(1.2/year)

4y

Page 3: Low Power Multimedia Reconfigurable Platforms

VLSI Algorithmic Design Automation Lab. at SKKU3

Reconfigurable System

Reconfigurable systems are suitable for the dynamic application and communication environment of wireless multimedia devices

such as SDR.

A hierarchical system model is used in which Quality of Service and energy consumption play a crucial role.

Dynamically partition tasks of an application.

Page 4: Low Power Multimedia Reconfigurable Platforms

VLSI Algorithmic Design Automation Lab. at SKKU4

Reconfigurable SOC

As technology (supply voltage) scales down, logic (transistor) is virtually free while the interconnect becomes the bottleneck and power consuming.

Parallel execution of nested Do loop algorithms by an array of localized processing elements at moderate clock frequency is a viable solution.

It can compromise the three orthogonal issues: design time, power consumption, and performance.

Page 5: Low Power Multimedia Reconfigurable Platforms

VLSI Algorithmic Design Automation Lab. at SKKU5

Context SoC and Customizable Platform Based-Design

Specifications

Processing power

Area

Power consumption

etc.

ReconfigurableHardware

(Coarse Grain)ASIC 1

DSP

Reconfigurable

Hardware (Fine Grain)

We need metrics to compare !

ASIC 2

ControllerCPU

RAMROM

Flash

?

ControllerCPU

RAMROM

Flash

?

Page 6: Low Power Multimedia Reconfigurable Platforms

VLSI Algorithmic Design Automation Lab. at SKKU6

First choose the right architecture …

MAC

Unit

Addr

Gen

P

Prog Mem

Embedded Processor

(lpArm)

Direct MappedHardware

EmbeddedFPGA

DSP(e.g. TI 320CXX )

Fle

xib

ility

Area or Power

Reconfigurable Processors (Maia)

Factor of 100-1000

100-1000 MOPS/mW

10-100MOPS/mW

.5-5MIPS/mW

Jan Rabaey

Page 7: Low Power Multimedia Reconfigurable Platforms

VLSI Algorithmic Design Automation Lab. at SKKU7

Design Space of Reconfigurable Architecture

RECONFIGURABLE ARCHITECTURES(R-SOC)

FINE GRAIN(FPGA)

MULTI GRANULARITY(Heterogeneous)

COARSE GRAIN(Systolic)

Processor +Coprocessor

Tile-BasedArchitecture

Coarse Grain Coprocessor

Fine GrainCoprocessor

IslandTopology

Hierarchical Topology

LinearTopology

HierarchicalTopology

MeshTopology

• Chameleon• REMARC• Morphosys

• Pleiades• Garp• FIPSOC• Triscend E5• Triscend A7• Xilinx Virtex-II Pro• Altera Excalibur• Atmel FPSIC

• Xilinx Virtex• Xilinx Spartran• Atmel AT40K• Lattice ispXPGA

• Altera Stratix• Altera Apex• Altera Cyclone

• Systolic Ring• RaPiD• PipeRench

• DART• FPFA

• RAW• CHESS• MATRIX• KressArray• Systolix Pulsedsp

• aSoC• E-FPFA

Page 8: Low Power Multimedia Reconfigurable Platforms

VLSI Algorithmic Design Automation Lab. at SKKU8

Semiconductor Revolutions“Mainstream Silicon Application

is switching every 10 Years”

Makimoto’s Wave

TTL

custom

standard

1957

1967

1977LSI,MSI

µproc.,memory

1987

1997ASICs,accel’s

1st

desi

gn

cri

sis

2n

d d

esi

gn

cri

sis

hardware people new breed (M&C)

software people new breed needed

2007

reconfigurable

Communication gap:

Terminology clean-up

instruction

streamsdata

streams

structured

VLSI design

Page 9: Low Power Multimedia Reconfigurable Platforms

VLSI Algorithmic Design Automation Lab. at SKKU9

3 different mind sets

TTL µproc.,memory

1957

1967

1977

1987

1997

2007

ASICs,accel’s

LSI,MSI

FPGAs

coarsegrain

soft CPU

s

hardware people CSpeople new breed needed

Common terminology needed

Page 10: Low Power Multimedia Reconfigurable Platforms

VLSI Algorithmic Design Automation Lab. at SKKU10

Machine paradigms

von Neumann

data-stream machine

instruction stream machine

M

I/O

instructionsequencer

CPU

instructionstream

I/OMM MM M

(r)DPU

DPU

Software

I/OMM MM M

(r)DPA

memoryembedded memory architecture*

M

DPU or rDPU

data addressgenerator

(data sequencer)

memory

data streamI/O

asM*

Configware

Flowware

Page 11: Low Power Multimedia Reconfigurable Platforms

VLSI Algorithmic Design Automation Lab. at SKKU11

FPGA Chip DSP Chip

Programming Language

VHDL, Verilog C, Assembly Language

Ease of software

programming

Fairly easy, however, a programmer needs to understand the hardware architecture before programming

Easy

Performance Can be very fast if an appropriate architecture is designed

Speed is limited by the clock speed of a DSP chip

Reconfigurability

SRAM-type FPGAs can be reconfigurable infinite times

Can be configurable by changing program memory content

Page 12: Low Power Multimedia Reconfigurable Platforms

VLSI Algorithmic Design Automation Lab. at SKKU12

FPGA Chip DSP Chip

Reconfiguration method

downloading configuration data to a chip electronically

reading a program at a memory address

AreaFIR filter, IIR filter, conrrelator, convolver, FFT

A signal processing program

Power consumption

Can be minimized if the circuit is designed to save power

Power consumption does not change

Speed of MAC Can be fast if a parallel algorithm is used.

Limited by the speed of a DSP chip

Parallelism Can be parallelized to archieve high performance

DSP chip programming is usually sequential

Page 13: Low Power Multimedia Reconfigurable Platforms

VLSI Algorithmic Design Automation Lab. at SKKU13

Architecture Choices forReal-time Embedded System

Greg Delagi, TI

Page 14: Low Power Multimedia Reconfigurable Platforms

VLSI Algorithmic Design Automation Lab. at SKKU14

Fine-Grained RSOCs Xilinx Virtex II-Pro

Xilinx, Inc., San Jose, CA Up to 4 PowerPC 405 Processor

Cores Up to 160k Reconfigurable Logic

Cells (4-i/p 1-o/p Lookup Table) Up to 216 18-bit x 18-bit

Dedicated Multipliers Up to 216 18-kbit On-Chip

Distributed Memory Blocks Up to 852 I/O Pins www.xilinx.com

Page 15: Low Power Multimedia Reconfigurable Platforms

VLSI Algorithmic Design Automation Lab. at SKKU15

Xilinx 의 Xtreme

Page 16: Low Power Multimedia Reconfigurable Platforms

VLSI Algorithmic Design Automation Lab. at SKKU16

Fine-Grained RSOCs Altera Excalibur

Altera, San Jose, CA

32-bit ARM9 Based Microprocessor @200 MHz

Up to 256kbytes SRAM

Up to 1M programmable logic gates

200 MHz Bus

Built-in SDRAM Controller

Page 17: Low Power Multimedia Reconfigurable Platforms

VLSI Algorithmic Design Automation Lab. at SKKU17

Fine-Grained RSOCs: Triscend A7 CSOC

A7 Family, Triscend, 32-bit ARM 7 with 8kB

Cache3200 logic cells max. (40K

gates)Up to 3800 flip-flopsUp to 300 Prog. I/O pinswww.triscend.com

Page 18: Low Power Multimedia Reconfigurable Platforms

VLSI Algorithmic Design Automation Lab. at SKKU18

Chameleon Structure Coarse-Grained RSOCs

Chameleon Systems Inc.

Paul J.M. Havinga, Lodewijk T.smit, Gerard J.M. Smit, Martinus Bos, Paul M. Heysters32-bit ARC control processor

Up to 84 32-bit Datapath Units (DPU)DPU=a 32-bit ALU+a 32-bit barrel

shifter Up to 24 of 16x24-bit multipliersUp to 48 of 128x32-bit local memory

modulesUp to 160 Prog. I/O pinsTargeted at 3rd gen. wireless

basestation, wireless local loop, SW radio, etc.

www.chameleonsystems.com

Page 19: Low Power Multimedia Reconfigurable Platforms

VLSI Algorithmic Design Automation Lab. at SKKU19

Architectural Rationale and Motivation

Configurable processors have shown orders of magnitude performance improvements

Tensilica has shown ~2x to ~50x performance improvements Specialized functional units Memory configurations

Tensilica matches the architecture with software development tools

FU

RegFile

Memory

ICache

FUFU

RegFile

Memory

ICache

HUFDCT FUConfiguration

Set memory parametersAdd DCT and Huffmanblocks for a JPEG app

Scott WeberScott WeberUniversity of California University of California at Berkeleyat Berkeley

Page 20: Low Power Multimedia Reconfigurable Platforms

VLSI Algorithmic Design Automation Lab. at SKKU20

Architectural Rationale and Motivation

In order to continue this performance improvement trend Architectural features which exploit more concurrency are required Heterogeneous configurations need to be made possible Software development tools support new configuration options

FUFU

RegFile

Memory

ICache

HUFDCT FU

PE PE

PE PE

PE

PE

PE PE PE

FUFU FU

RegFile

Memory

ICache

DCT HUF

...begins tolook like aVLIW...

PE PE

PE PE

PE

PE

PE PE PE

...concurrent processesare required in orderto continue performanceimprovement trend...

...generic meshmay not suit theapplication’stopology...

PE PE

PE PE

PE

PE PE PE

...configurable VLIWPEs and network topology...

Page 21: Low Power Multimedia Reconfigurable Platforms

VLSI Algorithmic Design Automation Lab. at SKKU21

IXP1200 Network Processors

Six micro-engines Support 24

contexts Hash instructions StrongArm core Bus and memory

controllers Example of an

architecture we want to be able to configure to

SDRAMCtrl

MicroEngPCI

Interface

SRAMCtrl

SACore

MicroEng

MicroEng

MicroEng

MicroEng

MicroEng

MiniDCache

DCache

ICache

ScratchPad

SRAM

IX BusInterface

HashEngine

IXP1200 Network Processor (Intel)

Page 22: Low Power Multimedia Reconfigurable Platforms

VLSI Algorithmic Design Automation Lab. at SKKU22

Architecture Goals Provide template for the exploration of a range of architectures Retarget compiler and simulator to the architecture Enable compiler to exploit the architecture Concurrency

Multiple instructions per processing element Multiple threads per and across processing elements Multiple processes per and across processing elements

Support for efficient computation Special-purpose functional units, intelligent memory, processing

elements

Support for efficient communication Configurable network topology Combined shared memory and message passing

Page 23: Low Power Multimedia Reconfigurable Platforms

VLSI Algorithmic Design Automation Lab. at SKKU23

Architecture Template Prototyping template for array of processing elements

Configure processing element for efficient computation Configure memory elements for efficient retiming Configure the network topology for efficient communication

FUFU FU

RegFile

Memory

ICache

DCT HUFFUFU FU

RegFile

Memory

ICache

FU FU FUFU FU

RegFile

Memory

ICache

DCT HUF

Memory

RegFile

...configurePE...

...configurememoryelements...

...configure PEsand network tomatch the application...

Page 24: Low Power Multimedia Reconfigurable Platforms

VLSI Algorithmic Design Automation Lab. at SKKU24

Architecture Template

Templates provide prototyping platform for constrained refinement

Estimators feedback system performance and guide configuration System designer refines configuration or the process is automated Refined elements have a compatible interface in the system

.o Simulator

gen uArch Designer

gen

Compiler

Estimation

Programmer’sModel

Page 25: Low Power Multimedia Reconfigurable Platforms

VLSI Algorithmic Design Automation Lab. at SKKU25

Synthesis of Architectures Not inventing new architectures We are providing a tool for the prototyping and synthesis

of a family of architectures Gives a micro-architecture, ISA, compiler, and simulator Refine within an instance to improve characteristics of

the design Most existing architectures are a point in the architecture

spectrum We want to allow a wide range of architectures to be

realized Each coupled with supporting software development tools

Page 26: Low Power Multimedia Reconfigurable Platforms

VLSI Algorithmic Design Automation Lab. at SKKU26

Initial Processing Element

VLIW class architecture HPL-PD architecture Exploit ILP

Malleable elements Memory size Cache size Register file size Number of functional units Specialized functional units

FUFU FUFU SFU

Register File

Memory System

Instruction Cache

Page 27: Low Power Multimedia Reconfigurable Platforms

VLSI Algorithmic Design Automation Lab. at SKKU27

Future Processing Element Specialized memory systems for efficient memory utility

Multi-ported, banked, levels, and intelligent memory

Split register file allows greater register bandwidth to FUs Groups of functional units have dedicated register files

Sticky state for specialized FUs saves register file reads and writes

Multiple contexts for a processing element provide latency tolerance

Hardware for efficient context switching to fill empty instruction slots

Specialized functional units and processing elements SIMD instructions Re-configurable fabrics for bit-level operations Re-use IP blocks for more efficient computation Custom hardware for the highest performance

Page 28: Low Power Multimedia Reconfigurable Platforms

VLSI Algorithmic Design Automation Lab. at SKKU28

Initial Distributed Architecture

Array of concurrent PEs and supporting network

Malleable network topology Topology matches

application Efficient communication

PE PE

PE PE

PE

PE

PE PE PE

PE PE

PE PE

PE

PE

PE PE PE

Page 29: Low Power Multimedia Reconfigurable Platforms

VLSI Algorithmic Design Automation Lab. at SKKU29

Initial Distributed Architecture Array of concurrent PEs and

supporting network Malleable network and PEs

Topology matches application Refine to meet system

constraints Memory organized around a

PE Each PE has physical memory Message passing between

PEs

PE PE

PE PE

PE

PE

PE PE PE

PE PE

PE PE

PE

PE PE PE

Page 30: Low Power Multimedia Reconfigurable Platforms

VLSI Algorithmic Design Automation Lab. at SKKU30

Future Distributed Architecture

Multiple processing elements share a memory space

Shared memory communication Snooping cache coherency protocol Directory based protocol required if PEs in a shared memory

space is large

Introspective processing elements Use processing elements to analyze the computation or

communication Identify dynamic bottlenecks and remove them on the fly Reschedule and bind tasks as the introspective elements

report

Page 31: Low Power Multimedia Reconfigurable Platforms

VLSI Algorithmic Design Automation Lab. at SKKU31

Communication Models

Shared memory Hardware handles loads and stores from PEs to a common

memory Synchronization is separate from communication Interacting threads on a single or group of processing

elements Message passing

Hardware to send and receive messages and invoke a handler

Synchronization and communication are together Interacting processes between single or group of

processing elements

Page 32: Low Power Multimedia Reconfigurable Platforms

VLSI Algorithmic Design Automation Lab. at SKKU32

Memory Model

Relax the consistency model Hardware implements lock and unlock mutex instructions Synchronization instructions inserted in program Loads and stores before a lock must complete before

loads and stores after the lock are started Relaxes the ordering of reads and writes in order to increase

memory utility Compiler is constrained on reordering around

synchronization barriers

Page 33: Low Power Multimedia Reconfigurable Platforms

VLSI Algorithmic Design Automation Lab. at SKKU33

Range of Architectures

Scalar Configuration EPIC Configuration EPIC with special FUs Mesh of HPL-PD PEs Customized PEs, network Supports a family of architectures

Plan to extend the family with the micro-architectural features presented

FU

Register File

Memory System

Instruction Cache

Page 34: Low Power Multimedia Reconfigurable Platforms

VLSI Algorithmic Design Automation Lab. at SKKU34

PE PE

PE PE

PE

PE

PE PE PE

FUFU FUFU FU

Register File

Memory System

Instruction Cache

Range of Architectures

Scalar Configuration EPIC Configuration EPIC with special FUs Mesh of HPL-PD PEs Customized PEs, network Supports a family of architectures

Plan to extend the family with the micro-architectural features presented

Page 35: Low Power Multimedia Reconfigurable Platforms

VLSI Algorithmic Design Automation Lab. at SKKU35

FUFU FFT

Register File

Memory System

Instruction Cache

DCTDES

Range of Architectures

Scalar Configuration EPIC Configuration EPIC with special FUs Mesh of HPL-PD PEs Customized PEs, network Supports a family of architectures

Plan to extend the family with the micro-architectural features presented

Page 36: Low Power Multimedia Reconfigurable Platforms

VLSI Algorithmic Design Automation Lab. at SKKU36

FUFU FFT

Register File

Memory System

Instruction Cache

DCTDES

Range of Architectures

Scalar Configuration EPIC Configuration EPIC with special FUs Mesh of HPL-PD PEs Customized PEs, network Supports a family of architectures

Plan to extend the family with the micro-architectural features presented

PE

PE PE

PE

PE

PE

PE PE PE

Page 37: Low Power Multimedia Reconfigurable Platforms

VLSI Algorithmic Design Automation Lab. at SKKU37

Range of Architectures

Scalar Configuration EPIC Configuration EPIC with special FUs Mesh of HPL-PD PEs Customized PEs, network Supports a family of architectures

Plan to extend the family with the micro-architectural features presented

PE PE

PE PE

PE

PE PE PE

Page 38: Low Power Multimedia Reconfigurable Platforms

VLSI Algorithmic Design Automation Lab. at SKKU38

Range of Architectures (Future)

Template support for such an architecture

Prototype architecture

Software development tools generated

Generate compiler

Generate simulator

SDRAMCtrl

MicroEngPCI

Interface

SRAMCtrl

SACore

MicroEng

MicroEng

MicroEng

MicroEng

MicroEng

MiniDCache

DCache

ICache

ScratchPad

SRAM

IX BusInterface

HashEngine

IXP1200 Network Processor (Intel)

Page 39: Low Power Multimedia Reconfigurable Platforms

VLSI Algorithmic Design Automation Lab. at SKKU39

The Research Playground

Component AssemblyComponent Assemblyand Synthesisand Synthesis

MicroarchitectureMicroarchitecture

ArchitectureArchitecture

Verification and Verification and Manufacture TestManufacture Test

What is theWhat is theProgrammer’sProgrammer’s

Model?Model?

AlgorithmAlgorithm

SoftwareSoftwareImplementationImplementation

CompilationCompilationand SW and SW

EnvironmentEnvironment

ApplicationApplication

Page 40: Low Power Multimedia Reconfigurable Platforms

Mescal CompilerManish VachharajaniPrinceton University

Page 41: Low Power Multimedia Reconfigurable Platforms

VLSI Algorithmic Design Automation Lab. at SKKU41

Outline

Compiler goals Compiler research issues Compiler infrastructure requirements

Trimaran 2.0 compiler infrastructure Ongoing work Summary

Page 42: Low Power Multimedia Reconfigurable Platforms

VLSI Algorithmic Design Automation Lab. at SKKU42

So What’s Different?

General purpose compiler hand tuned to: SPEC benchmarks A particular general purpose machine

Need compiler tuned to: Specific application A particular application specific machine

And… Meet code density, real-time, and power constraints Do this automatically for a range of

applications/architectures

Page 43: Low Power Multimedia Reconfigurable Platforms

VLSI Algorithmic Design Automation Lab. at SKKU43

So What’s Different? Traditional application hw/sw design requires

Hand selection of traditional general purpose OS components Hand written customization of

device drivers memory management…

Instead… Application specific synthesis of traditional OS components

scheduling synchronization…

Automatic synthesis of hardware specific code from specifications

device drivers memory management…

Page 44: Low Power Multimedia Reconfigurable Platforms

VLSI Algorithmic Design Automation Lab. at SKKU44

Compiler Goals

Develop a retargetable compiler infrastructure that enables a set of interesting applications to be efficiently mapped onto a family of fully programmable architectures and microarchitectures.

10 Year Vision: Will have fully automatically-retargetable compilation, OS

synthesis, and simulation for a class of architectures consisting of multiple heterogeneous processing elements with specialized functional units / memories

Compiled code size and performance will be within 10% of hand-coding

Page 45: Low Power Multimedia Reconfigurable Platforms

VLSI Algorithmic Design Automation Lab. at SKKU45

Compiler Research Issues

Synthesis of RTOS elements in the compiler On the application side: Generation of an efficient application-

specific static/run-time scheduler and synchronization On the hardware side: Generation of device drivers, memory

management primitives, etc. using hardware specifications Automatic retargetability for family of target architectures

while preserving aggressive optimization Automatic application partitioning

Mapping of process/task-level concurrency onto multiple PEs using programmer guidance in programmer’s model

Effective visualization for family of target architectures

Page 46: Low Power Multimedia Reconfigurable Platforms

VLSI Algorithmic Design Automation Lab. at SKKU46

Compiler Infrastructure Requirements

High level of usability good documentation, well coded

Large suite of machine-independent code optimizations Significant level of retargetability Strong support for instruction-level parallelism Support for memory as a first-class citizen Simulation tools Preferably

visualization tools a good support team

Page 47: Low Power Multimedia Reconfigurable Platforms

VLSI Algorithmic Design Automation Lab. at SKKU47

Trimaran 2.0 Compiler Overview

IMPACT/ELCOR features strong VLIW data structure and algorithm support

Data structures basic, hyper, super blocks loop analysis procedure analysis miscellaneous, e.g. lists, sets

Algorithms if-conversion software pipelining scheduling/register allocation

C

IMPACTFront-End

MDES

Simulator & Visualization

ELCORBack-End

U. of Illinois IMPACT Group

HP Labs CAR Group

NYU ReaCT-ILP Group

www.trimaran.org

Page 48: Low Power Multimedia Reconfigurable Platforms

VLSI Algorithmic Design Automation Lab. at SKKU48

Trimaran 2.0 Overview:Simulator and Visualization Tools

Cycle-level simulator easily extensible to support new specialized operations

Simply augment table specifying operation semantics

Visualization tools visualize assortment of useful static / dynamic information

Instruction schedule Data-dependency graphs Total cycles per function / region Percentage of total function operations that are branches,

loads, stores, integer ALU, floating-point ALU, etc.

Page 49: Low Power Multimedia Reconfigurable Platforms

VLSI Algorithmic Design Automation Lab. at SKKU49

Trimaran 2.0 Overview:Machine Description (MDES)

Target specified in high-level machine-description language

Translated into low-level language

ELCOR supports Playdoh Parameterized non-clustered

VLIW architecture Support for

speculative/predicated execution, software pipelining

User may modify following playdoh parameters:

number of registers number of integer, floating-

point, memory, branch FUs operation latencies

C

IMPACTFront-End

TRIMARAN

High-levelPlayDoh

MDES

Simulator & Visualization

ELCORBack-End

Low-levelPlayDoh

MDES

Page 50: Low Power Multimedia Reconfigurable Platforms

VLSI Algorithmic Design Automation Lab. at SKKU50

Extensions to Trimaran 2.0:Support for Multiple PEs

ELCOR does not provide MDES and data structure support for multiple Playdoh PEs

New MDES format has been devised to support multiple PEs with varying connectivity

Array of MDES data structures maintained, one per PE

Each code region must be associated with an MDES PE prior to code generation

Communication channels between PEs currently not modeled

PE1: machine description

PE2: machine description

PEm: machine description

.

.

.

MESCAL Machine Description

Channel1: from PE1 to PE2

.

.

.

Channel2: from PE1 to PE3

Channeln: from PEi to PEj

Page 51: Low Power Multimedia Reconfigurable Platforms

VLSI Algorithmic Design Automation Lab. at SKKU51

Extensions to Trimaran 2.0:Support for Specialized FUs and Operations

ELCOR lacks support for specialized FUs and operations

MESCAL supports specialized FUs and operations via function intrinsics which get translated into special operations.

Special operations only require map from intrinsic for implementation.

Normalization Hardware

AssemblyNORM B

x = NORM(y)Intrinsic

Page 52: Low Power Multimedia Reconfigurable Platforms

VLSI Algorithmic Design Automation Lab. at SKKU52

Mescal Compiler Framework MESCAL source code layer

exists on top of ELCOR All Trimaran source code

needing modification is copied over to the MESCAL layer

MESCAL source code is compatible with future Trimaran releases

C

IMPACTFront-End

TRIMARAN

Simulator & Visualization

ELCORBack-End

MESCAL

ELCORMDES

MESCALMDES

Page 53: Low Power Multimedia Reconfigurable Platforms

VLSI Algorithmic Design Automation Lab. at SKKU53

Mes

cal C

ompi

ler

What Do You Get? Mescal Compiler will feature:

Automatic retargetability for architectures consisting of multiple heterogeneous PEs and a configurable communication topology

Mapping of coarse-grain parallelism onto multiple PEs via guidance from programmer’s model

Programmer’s model will allow code-generation with size and performance comparable to hand-coding

Synthesis of RTOS elements and synchronization that are tuned to the application

Application Code in

Programmer’s Model

Hardwaredescription

RTOS synthesis

Compiler front end

Compiler back end

System Code

Page 54: Low Power Multimedia Reconfigurable Platforms

VLSI Algorithmic Design Automation Lab. at SKKU54

Ongoing Work

Automatic device driver synthesis from a system specification

Xiaoling Xu, Minxi Gao: UCB Support for additional classes of processors (e.g. DSPs) within Mescal framework

Subbu Rajagopalan: Princeton

involves adding support for memory as a first-class citizen

Tuning of front/back end code optimizations, based on application and micro-architectural characteristics

Manish Vachharajani: Princeton

Automatic synthesis of RTOS elements in compiler

Shaojie Wang: Princeton Dynamically-reconfigurable computing for systems-on-a-chip

Zhining Huang: Princeton MESCAL compiler overview.

Niraj Shah, Michael Shilman: UCB

Page 55: Low Power Multimedia Reconfigurable Platforms

VLSI Algorithmic Design Automation Lab. at SKKU55

The Research Playground

Component AssemblyComponent Assemblyand Synthesisand Synthesis

MicroarchitectureMicroarchitecture

ArchitectureArchitecture

Verification and Verification and Manufacture TestManufacture Test

What is theWhat is theProgrammer’sProgrammer’s

Model?Model?

AlgorithmAlgorithm

SoftwareSoftwareImplementationImplementation

CompilationCompilation

ApplicationApplication

Page 56: Low Power Multimedia Reconfigurable Platforms

MESCAL Programmer’s Model

Niraj ShahUniversity of California at Berkeley

Page 57: Low Power Multimedia Reconfigurable Platforms

VLSI Algorithmic Design Automation Lab. at SKKU57

Outline

Motivation Goals Our Approach Initial Model Ongoing Research

Page 58: Low Power Multimedia Reconfigurable Platforms

VLSI Algorithmic Design Automation Lab. at SKKU58

Motivation

Silicon integration is allowing for high micro-architectural complexity on a die (e.g. Intel IXP1200)

multiple processors specialized execution units hardware context swap

SDRAM Controller

enginePCI

Interface

SRAMController

StrongArmCore

I-Cache

engine

engine

engine

engine

engine

MiniD-Cache

D-Cache

IX BusInterface

HashEngine

ScratchPad

SRAM

Circuit architects are designing more complex devices

How do we program these architectures?

Page 59: Low Power Multimedia Reconfigurable Platforms

VLSI Algorithmic Design Automation Lab. at SKKU59

Example: C Language

C compilers of the early 70’s were not good, but C became the standard for writing efficient code.

C provided an abstraction (programmer’s model) of standard processors that allowed programmers to write efficient code

They found the 20% of the assembler capability to capture 80% of program efficiency:

register keyword pointer arithmetic bit-level operations

Page 60: Low Power Multimedia Reconfigurable Platforms

VLSI Algorithmic Design Automation Lab. at SKKU60

Goals Capture the 20% of architectural features of new

architectural platforms to get 80% of the performance

Concurrency processor level functional unit level bit level

Memory useful characteristics of specialized memories address generation units

Present the programmer with an abstraction of the architecture while giving them the power to write efficient code

Page 61: Low Power Multimedia Reconfigurable Platforms

VLSI Algorithmic Design Automation Lab. at SKKU61

Our Approach

Combine bottom-up and top-down views Bottom-up: create an abstraction of the

architecture visibility - sufficient detail of the architecture to allow

the program to improve the efficiency of the program opacity - hide micro-architectural details from

programmer Top-down: expressive enough for the

programmer to relay all the information he/she knows about the program to the compiler

Page 62: Low Power Multimedia Reconfigurable Platforms

VLSI Algorithmic Design Automation Lab. at SKKU62

Bottom-Up View

Visible Specialized hardware

FU’s PE’s

Communication explicit message passing shared address space

Opaque Micro-architectural features

pipelines cache details

FUFU FUFU SFU

PE PE

PE PE

PE

PE PE PE

Page 63: Low Power Multimedia Reconfigurable Platforms

VLSI Algorithmic Design Automation Lab. at SKKU63

Top-Down View

Parallelism at different levels Process level - communicate via message passing Task/thread level - communicate via shared memory

OS capabilities Scheduling Binding Synchronization

Page 64: Low Power Multimedia Reconfigurable Platforms

VLSI Algorithmic Design Automation Lab. at SKKU64

Initial Programmer’s Model

Start with C View specialized FU’s through intrinsics (e.g.

normalization)

Model process level concurrency through a hybrid communication model

Processes - subset of Message Passing Interface (MPI) Threads - shared memory

AssemblyNORM B x = NORM(y)

Intrinsic

Page 65: Low Power Multimedia Reconfigurable Platforms

VLSI Algorithmic Design Automation Lab. at SKKU65

Message Passing Interface (MPI)

A standard interface for communication on multiprocessor systems

Messages are passed between processes, which the user must specify

“Push” style communication – sender specifies data rate

Types of Communication Blocking: stall until send/receive buffer can be used Non-blocking: allows overlap of computation and

communication

Simulator included

Page 66: Low Power Multimedia Reconfigurable Platforms

VLSI Algorithmic Design Automation Lab. at SKKU66

Ongoing Research

The programmer’s model is the Holy Grail of the MESCAL project

Right abstraction for memory Incorporate bit level concurrency Compiler for Intel IXP1200 - test initial

programmer’s model

Page 67: Low Power Multimedia Reconfigurable Platforms

VLSI Algorithmic Design Automation Lab. at SKKU67

The Research Playground

Component AssemblyComponent Assemblyand Synthesisand Synthesis

MicroarchitectureMicroarchitecture

ArchitectureArchitecture

Verification and Verification and

Manufacture TestManufacture Test

What is theWhat is theProgrammer’sProgrammer’s

Model?Model?

AlgorithmAlgorithm

SoftwareSoftwareImplementationImplementation

CompilationCompilation

ApplicationApplication

Page 68: Low Power Multimedia Reconfigurable Platforms

Scalable Self-Test for Designs with Embedded Programmable Components

Tim ChengUniversity of California, Santa Barbara

Page 69: Low Power Multimedia Reconfigurable Platforms

VLSI Algorithmic Design Automation Lab. at SKKU69

Test and diagnosis are applications of a highly programmable system!!Test and diagnosis are applications of a highly programmable system!!

Goals Reuse of on-chip programmable components for

test Processor/DSP/FPGA cores for on-chip test

generation, measurement, response analysis and even diagnosis

Self-test a processor/DSP using its instruction set for high structural fault coverage

Use the tested processor/DSP to test buses, interfaces and other components, including analog and mixed-signal components

Extend for self-diagnosis

Page 70: Low Power Multimedia Reconfigurable Platforms

VLSI Algorithmic Design Automation Lab. at SKKU70

Functional Self-Test for Structural Faults -Motivation

At-speed testing of GHz IC’s increasingly difficult with external testers

Growing gap between IC and tester performance Growing cost of high performance testers Increasing yield loss caused by inherent tester inaccuracy

Self-testing using instructions enables natural application of at-speed test of GHz processors and SoC’s

Potential advantages over structural BIST (such as scan-based BIST) include: area, performance, design time, power consumption during test

Page 71: Low Power Multimedia Reconfigurable Platforms

VLSI Algorithmic Design Automation Lab. at SKKU71

Functional Self-Test vs. Structural BIST Good understanding of the capability and

limitations of functional self-test could support further new development of hybrid solutions combining strengths of functional and structural self-test

Lesson from memory self-test: from functional, to structural, now back to functional self-test

Logic self-test?

Page 72: Low Power Multimedia Reconfigurable Platforms

VLSI Algorithmic Design Automation Lab. at SKKU72

Initial Projects on Processor Functional Self-Test

Self-Testing of Embedded Processor Cores and SoC (UCSD) Delivering deterministic tests using instruction set Automatic synthesis of programs for:

on-chip test generation (constraint-aware software LFSR) test pattern delivery test response analysis

Self-Testing of Processor Cores for Delay Faults (UCSB) Automatic synthesis of test programs for path delay faults Applying deterministic delay tests by execution of test

program Tests generated by integrated process combining structural

ATPG and instruction-level ATPG

Page 73: Low Power Multimedia Reconfigurable Platforms

VLSI Algorithmic Design Automation Lab. at SKKU73

Self-Test for Embedded Processor Components

ExternalTester

Instr. memory Data memory

Processor bus

CPU

Processor bus

On-chip testgenerationprogram

Test patterndeliveryprogram

Test responseanalysisprogram

Self-testsignature

Processor busProcessor bus

Self-testsignature

Processor bus

Test patterns

Processor busProcessor busProcessor bus

Test response

Processor busProcessor busProcessor bus

Responsesignature

Processor bus

Page 74: Low Power Multimedia Reconfigurable Platforms

VLSI Algorithmic Design Automation Lab. at SKKU74

Functional Self-Testing of Processor Cores for Path Delay Faults

Spatial and temporal constraints between registers and control signalsInstr. Set Architecture,Instr. Set Architecture,

-architecture-architecture && NetlistNetlist

Test Program SynthesisTest Program SynthesisTest Program SynthesisTest Program Synthesis

Automatic Constraint ExtractionAutomatic Constraint ExtractionAutomatic Constraint ExtractionAutomatic Constraint Extraction

Constrained Structural ATPGConstrained Structural ATPGConstrained Structural ATPGConstrained Structural ATPG

Path ClassificationPath ClassificationPath ClassificationPath Classification

Test ProgramTest Program

Some structural testable paths not functionally testable by instructions

Identifying functionally testable paths

Vector generation for functionally testable paths

Mapping test vectors to instruction sequences

Page 75: Low Power Multimedia Reconfigurable Platforms

VLSI Algorithmic Design Automation Lab. at SKKU75

No. of paths: No. of paths: ~430K paths~430K paths

datapathdatapathNo. of paths: No. of paths: ~18K paths~18K paths

controllercontroller

Path Classification: DLX - A 32-bit RISC Processor

Structurally testableStructurally testable~97%~97%

Structurally testable Structurally testable ~51%~51%

Functionally testableFunctionally testable~40%~40%

Functionally testableFunctionally testable~46%~46%

Automatic identification of paths testable by instructions Structurally testable but functionally untestable paths need

not be tested.

Page 76: Low Power Multimedia Reconfigurable Platforms

VLSI Algorithmic Design Automation Lab. at SKKU76

Self-Test for Analog and Mixed-Signal Components in Highly Programmable Systems

Reuse on-chip digital programmable components and A/D and/or D/A converters for test signal generation, on-chip measurement and response analysis for analog/mixed signal components

To relieve the need for expensive mixed-signal testers To avoid noisy external measurement To provide maximum flexibility for customized/optimized

self-test solutions for different types of analog components

Page 77: Low Power Multimedia Reconfigurable Platforms

VLSI Algorithmic Design Automation Lab. at SKKU77

Analog/Mixed-Signal Self-Test Approaches DSP-based analog self-test

Targeting systems with both DAC and ADC

Pulse-Density-Modulation-based analog self-test Targeting systems without an ADC and/or an DAC

Page 78: Low Power Multimedia Reconfigurable Platforms

VLSI Algorithmic Design Automation Lab. at SKKU78

D/A A/D

AnalogAnalogComponentComponent

UnderUnderTestTest

DSP/Programmable Components

Synchronization

• • more efficientmore efficient• • single setup for single setup for multiple types of testsmultiple types of tests

ProsPros

ConsCons • • limited measurement limited measurement resolution (improving)resolution (improving)

DSP-Based Self-Testing

Test signal:Test signal: • • digitized sinusoiddigitized sinusoid • • digitized multi-tonedigitized multi-tone • • pseudo randompseudo random

Response analysis:Response analysis: • • FFTFFT • • IEEE 1057 sinewave fittingIEEE 1057 sinewave fitting • • cross-correlationcross-correlation • • auto-correlationauto-correlation

Page 79: Low Power Multimedia Reconfigurable Platforms

VLSI Algorithmic Design Automation Lab. at SKKU79

Pulse-Density-Modulation-Based Self-Test Targeting designs without a DAC and/or an ADC

Use simple yet high-tolerant DA & AD conversion techniques Use DSP techniques for test synthesis and response analysis Excellent flexibility

AnalogCUT

AnalogCUT

Test Synthesis

Software1-bit

modulator

1-bit DAC& low-pass

filter

1-bit DAC& low-pass

filter

..0101...

memoryTest

stimulus

Spec.

pass/fail

Response Analysis

1-bit modulator

1-bit modulator

DSP..0101...

SOCATE SOCATE

Page 80: Low Power Multimedia Reconfigurable Platforms

VLSI Algorithmic Design Automation Lab. at SKKU80

PDM-Based Analog Self-Test: Current Status

A general self-test architecture for mixed-signal systems

Use modulation principle for stimulus generation and signal acquisition

Characterization and calibration of 1-bit first-order modulator for on-chip signal analysis

For compensating the error caused by the imperfections associated with the modulator

A self-test scheme for testing on-chip ADC and DAC

Page 81: Low Power Multimedia Reconfigurable Platforms

VLSI Algorithmic Design Automation Lab. at SKKU81

Directions for the Next Three Years

Processor self-test and self-diagnosis Adding new “test instructions” to aid self-test and self-

diagnosis Test program synthesis for response analysis and self-

diagnosis Analog/mixed signal self-test

Hardware validation of proposed PDM-based schemes High-frequency applications Defect-oriented test synthesis and response analysis

Full-chip self-test using self-tested processors Testing buses, interfaces and other digital components Reconfiguration of bus arbiters and communication protocols

for test delivery

Page 82: Low Power Multimedia Reconfigurable Platforms

VLSI Algorithmic Design Automation Lab. at SKKU82

The Research Playground

Component AssemblyComponent Assemblyand Synthesisand Synthesis

MicroarchitectureMicroarchitecture

ArchitectureArchitecture

VerificationVerification and and Manufacture TestManufacture Test

What is theWhat is theProgrammer’sProgrammer’s

Model?Model?

AlgorithmAlgorithm

SoftwareSoftwareImplementationImplementation

CompilationCompilation

ApplicationApplication

Page 83: Low Power Multimedia Reconfigurable Platforms

Functional Verification for a Family of Microarchitectures

Serdar TasiranUniversity of California at Berkeley

Page 84: Low Power Multimedia Reconfigurable Platforms

VLSI Algorithmic Design Automation Lab. at SKKU84

Outline Verification goal State-of-the-art in processor verification Our strategy

Rationale Implementation

Current projects Future extensions

Three year goals

Page 85: Low Power Multimedia Reconfigurable Platforms

VLSI Algorithmic Design Automation Lab. at SKKU85

Verification GoalDevelop comprehensive, focused functional verification

support for identified microarchitectural familyIdeally, the verification approach… …must be adaptable: must not require new theory and tools for

different configurations different environments for design different verification requirements

(cache coherence, consistency with programmer’s model, …) …must lend itself to incremental changes in design …must degrade gracefully

Page 86: Low Power Multimedia Reconfigurable Platforms

VLSI Algorithmic Design Automation Lab. at SKKU86

Processor Verification: State-of-the-artHeated research activity on verification of pipelines, superscalar

processors, out-of-order and speculative execution. Datapath abstraction

Reduce width Symbolic representations (e.g. multiway decision graphs)

Symbolic simulation Theorem proving

Verifying functional units (ALUs, FPUs, etc.) Compositional (assume-guarantee) reasoning

Divide verification problem into pieces Can use a variety of methods for each piece

Reduce problem to equivalence checking of formulas Propositional logic with uninterpreted functions and predicates

Page 87: Low Power Multimedia Reconfigurable Platforms

VLSI Algorithmic Design Automation Lab. at SKKU87

Processor Verification: State-of-the-art Formal verification valuable when applicable, but

each technique addresses only one aspect of the problem verification expertise required from designer methods not incremental or adaptable difficult to use in large design groups capacity much short of current processor complexity

Validation relies heavily on simulation Even more likely to be the case for complex, highly

programmable systems

Page 88: Low Power Multimedia Reconfigurable Platforms

VLSI Algorithmic Design Automation Lab. at SKKU88

Our Verification Strategy Validation of complex, highly programmable systems will

require semi-formal methods The natural way to verify these systems is to simulate and

debug Practical goal: Make “optimal” use of simulation resources

IDEAL: Comprehensive validation with minimal redundant effort

OUR APPROACH: Use coverage analysis to guide verification Identify good verification coverage metrics Develop corresponding vector generation methodology

Page 89: Low Power Multimedia Reconfigurable Platforms

VLSI Algorithmic Design Automation Lab. at SKKU89

Validation using Simulation: Current Picture

Simulationdriver

Simulationengine

Monitors

SHORTCOMINGS: Vector generation

Manual: A lot of user effort, ad hoc Random: Little control over what

gets exercised Quantifying comprehensiveness

Low bug detection rate is the main criterion Likely interpretation: Not generating quality vectors any more.

Functional

testing

Weeks

Bugs

per

week

TapeoutPurgatory

Courtesy Prof. Dill

Page 90: Low Power Multimedia Reconfigurable Platforms

VLSI Algorithmic Design Automation Lab. at SKKU90

Verification Using Intelligent Simulation

Simulationdriver

Simulationengine

Monitors

Symbolicsimulation

Coverageanalysis

Diagnosis ofnon-verified

portions

Vectorgeneration

Conventional

Novel

Page 91: Low Power Multimedia Reconfigurable Platforms

VLSI Algorithmic Design Automation Lab. at SKKU91

Verification Using Intelligent Simulation – Rationale

Simulationdriver

Simulationengine

Monitors

Symbolicsimulation

Coverageanalysis

Diagnosis ofnon-verified

portions

Vectorgeneration

Conventional

Novel

Need formal means to: Gauge status and progress of verification Automate generation of quality vectors

Page 92: Low Power Multimedia Reconfigurable Platforms

VLSI Algorithmic Design Automation Lab. at SKKU92

Coverage Analysis – Why? What aspects of design

haven’t been exercised? Guides vector generation

How comprehensive is the verification so far?

A heuristic stopping criterion Coordinate and compare

Separate sets of simulation runs Model checking, symbolic simulation, … Helps allocate verification resources

Simulationdriver

Simulationengine

Monitors

Symbolicsimulation

Coverageanalysis

Diagnosis ofunverifiedportions

Vectorgeneration

Conventional

Novel

Page 93: Low Power Multimedia Reconfigurable Platforms

VLSI Algorithmic Design Automation Lab. at SKKU93

Observability and Coverage Analysis

Portion of design covered only when it is exercised (controllability) a discrepancy originating there causes

discrepancy in a monitored variable (observability)

We initially focus on tag coverage [Devadas, Keutzer, Ghosh ’96]

Code coverage metrics + observability requirement. All other verification metrics overlook observability

Tag coverage: Bugs modeled as errors in assignments. A buggy assignment may be stimulated, but still missed

Wrong value generated speculatively, but never used.

Page 94: Low Power Multimedia Reconfigurable Platforms

VLSI Algorithmic Design Automation Lab. at SKKU94

Biased-Random Vector Generation - Rationale Vector generation methods

trade-off between Time to find “good” vectors Time to simulate vectors

Typically > 50% of time spent on biased random simulation. Improved random vectors Improved overall validation quality Less intelligence for selecting next step but many more vectors

Can explore deeper into state space Deterministic methods bad at “deep errors”

Example: 8-bit counter must expire for bug to be exercised

Find Simulate

0% 100%Portion of Computation Time

Page 95: Low Power Multimedia Reconfigurable Platforms

VLSI Algorithmic Design Automation Lab. at SKKU95

Contrast with Alternatives Elaborate vector generation methods

justified if they yield better verification quality for

given computation time, or if they exercise difficult corner cases BUT: Hard to judge “quality” of test vectors a-priori.

Heavyweight methods have limited application Can’t handle large sequential depth. Too costly to use all the time

We spend most effort on initial determination of weights Can run many simulation/emulation cycles fast

Our target: Get all but the most difficult bugs out.

Find Simulate

0% 100%

Page 96: Low Power Multimedia Reconfigurable Platforms

VLSI Algorithmic Design Automation Lab. at SKKU96

Our Approach to Biased Random Vector Generation

Primary inputs at each clock cycle selected according to a probability distribution

Distributions are functions of circuit state Distributions ( “weights” ) determined prior to simulation

Faster simulation Algorithm determines weights chosen based on

Set of tags targeted A structural netlist describing the circuit

Goal of weight determination algorithm: Maximize expected number of tags covered in a given #

of simulation cycles

Page 97: Low Power Multimedia Reconfigurable Platforms

VLSI Algorithmic Design Automation Lab. at SKKU97

Current ProjectsBiased-Random Vector Generation for Tag Coverage

(Chinnery, Jin, Keutzer, Tasiran, Weber, UCB)Select primary input distributions based on

Œ Circuit structure Current state Ž Tags to be covered

Heuristic based on circuit structure and set of tags IDEA: At each gate, bias inputs towards pins with more tags

in their transitive fan-in. Estimate and optimize detectability of tags

Propagate input probability distributions across circuit Estimate steady-state distributions of latches Estimate detectability of tags along “most likely” paths Modify input weights to maximize expected number of detected

tags

Page 98: Low Power Multimedia Reconfigurable Platforms

VLSI Algorithmic Design Automation Lab. at SKKU98

Current ProjectsVector Generation for Tag Coverage

of Processor Datapaths(Keutzer, Meyerowitz, Tasiran, UCB)

Identify commonly encountered structures in processor datapaths

Determine input distributions that increase tag coverage of these structures

(In collaboration with configurableprocessor IP provider)

Initial approach: Model control by hand-written

abstract machine

sinit

s3

s4

s2

s5

s6

Control

Datapath

Page 99: Low Power Multimedia Reconfigurable Platforms

VLSI Algorithmic Design Automation Lab. at SKKU99

Now: Biased-random vector generation Initial focus: Configurable processor

control and datapaths Topology-based heuristics with

tag coverage goal Later:

More sophisticated methods for bias selection Methods that address control and datapath together

Overall: “Closed feedback loop” that integrates a variety of Coverage metrics, analysis and feedback methods Coverage guided, automatic vector generation methods

Simulationdriver

Simulationengine Monitors

Symbolicsimulation

Coverageanalysis

Diagnosis ofunverifiedportions

Vectorgeneration

Directions for the Next Three Years

Page 100: Low Power Multimedia Reconfigurable Platforms

VLSI Algorithmic Design Automation Lab. at SKKU100

The Research Playground

Component AssemblyComponent Assemblyand Synthesisand Synthesis

MicroarchitectureMicroarchitecture

ArchitectureArchitecture

Verification and Verification and Manufacture TestManufacture Test

What is theWhat is theProgrammer’sProgrammer’s

Model?Model?

AlgorithmAlgorithm

SoftwareSoftwareImplementationImplementation

Compilation andCompilation andSoftware EnvironmentSoftware Environment

ApplicationApplication

Page 101: Low Power Multimedia Reconfigurable Platforms

VLSI Algorithmic Design Automation Lab. at SKKU101

Evaluation Strategy

Quantify quality of results of final implementation according to:

Speed Power Area/cost Design time Design cost

Compare to: Other purely programmable solutions

FPGA, microprocessor, specialized processor ASIC solutions

Page 102: Low Power Multimedia Reconfigurable Platforms

VLSI Algorithmic Design Automation Lab. at SKKU102

Ten Year Vision Elaborated

Significant percentage of embedded system applications fielded using only fully programmable components.

Supporting efficient but fully programmable solutions in areas of emerging standards.

Design-time brought within acceptable limits to achieve time-to-market goals.

Enabling new applications: Supporting greater complexity. Reducing overall design cost.

Page 103: Low Power Multimedia Reconfigurable Platforms

VLSI Algorithmic Design Automation Lab. at SKKU103

What Will Get Us There?…

Flexible architectural templates covering a large design space.

Multiple levels of support for concurrency Automated software development environment.

Retargetable compilers/assemblers/debuggers Architectural simulators Run-time environments – schedulers/synchronizers Analysis tools – design visualization, performance

monitoring, power analysis…

Page 104: Low Power Multimedia Reconfigurable Platforms

VLSI Algorithmic Design Automation Lab. at SKKU104

First Year Progress Against Strategies

Identified and assembled key application – VPN router. Identified and assembled compiler infrastructure:

Trimaran 2.0. Initiated multiple compiler/run-time environment

projects. (Mostly) identified initial architectural family.

Simulator for one processing element of the architectural family assembled.

Test strategy for one processing element determined.

Page 105: Low Power Multimedia Reconfigurable Platforms

VLSI Algorithmic Design Automation Lab. at SKKU105

Further Progress Against StrategiesIn two years: Automatic retargeting onto a family of

architectures and microarchitectures from a hardware-description language.

Automatically generated performance estimator, simulator.

Automatic generation of assembler, compiler, run-time system.

Automatically generated hardware for special purpose units?

In five years: Much like the above, but across a much broader range of

architectures/microarchitectures.

Real breakthrough will be in the development of a natural programmer’s model

Page 106: Low Power Multimedia Reconfigurable Platforms

VLSI Algorithmic Design Automation Lab. at SKKU106

Reconfigurable Reconfigurable Computing(FPFComputing(FPFA)A)

Energy-efficient Energy-efficient wireless wireless communicationcommunication

System System architecture for architecture for mobile mobile multimedia multimedia computerscomputers

SecuritySecurity8

Field Programmable Function Array: Chameleon

Page 107: Low Power Multimedia Reconfigurable Platforms

VLSI Algorithmic Design Automation Lab. at SKKU107

Montium Processing Tile

Page 108: Low Power Multimedia Reconfigurable Platforms

VLSI Algorithmic Design Automation Lab. at SKKU108

Montium Tile Processor

Page 109: Low Power Multimedia Reconfigurable Platforms

VLSI Algorithmic Design Automation Lab. at SKKU109

U-P vs XPP

Page 110: Low Power Multimedia Reconfigurable Platforms

VLSI Algorithmic Design Automation Lab. at SKKU110

A SDR/Multimedia Solution

Page 111: Low Power Multimedia Reconfigurable Platforms

VLSI Algorithmic Design Automation Lab. at SKKU111

PACT’s SDR XPP

Page 112: Low Power Multimedia Reconfigurable Platforms

VLSI Algorithmic Design Automation Lab. at SKKU112

PACT’s SDR XPP

Page 113: Low Power Multimedia Reconfigurable Platforms

VLSI Algorithmic Design Automation Lab. at SKKU113

Current Multimedia Processors

Digital Signal Processor => Multimedia Processor RISC instruction set and pipelining to gain higher clock

frequency Instruction level parallelism (ILP) Concern more and more on data movement and I/O

interface Pay more attention on low power design

Page 114: Low Power Multimedia Reconfigurable Platforms

VLSI Algorithmic Design Automation Lab. at SKKU114

Current Multimedia Processors

Name TMS320C82 Mpact 2 Trimedia TM1 MSP

Architecture Multiproc. VLIW VLIW Multiproc.

CMOS Technology 0.5 0.35 0.35 0.35

Vcc (Volts) 3.3 3.3 3.3 3.3

Power (Watts) 3 (@50 MHz) 4.45 4 4

Clock frequency (MHz) 50,60 125 100 100

Performance(BOPS 8-bit integer)

1.5 6 4 6.4

Manufacturer TIToshiba &Chromatic Res.

Philips Sumsung

Page 115: Low Power Multimedia Reconfigurable Platforms

VLSI Algorithmic Design Automation Lab. at SKKU115

TMS320C6x VelociTI

Highest Performance (1 GFLOPS) Floating point DSP 6-ns Instruction Cycle Time 167-MHz Clock Rate Eight 32-Bit Instructions/Cycle Instruction packing Complex programming model Poor energy and memory efficiency 600Mhz, $110 Good tools and third party support

Page 116: Low Power Multimedia Reconfigurable Platforms

VLSI Algorithmic Design Automation Lab. at SKKU116

StarCore SC140, Infineon

6-issue 16-bit fixed-point architecture Up to four 16-bit MACs per cycle5-stage pipeline with single-cycle latencyStrong Performance on most metricsMulti-vendor Architecture :Motorola, Agere and now

Infineon Limited Product Offerings: poor cost-efficiency, 300Mhz, $132

Page 117: Low Power Multimedia Reconfigurable Platforms

VLSI Algorithmic Design Automation Lab. at SKKU117

Analog Devices TigerSHARC

4-issue fixed- and floating-point hierarchical SIMD atrchitecture

Upto 8 16-bit fixed point MACs per cycle Special CDMA-oriented instructions High memory bandwidth (8Gb/s) 250Mhz, $175 2-level SIMD complicates programming Good tools

Page 118: Low Power Multimedia Reconfigurable Platforms

VLSI Algorithmic Design Automation Lab. at SKKU118

LSI Logic ZSP400 A 4-Way Superscalar DSP Core Up to 2 16-bit MACs per cycle Five-stage pipeline with single-cycle latencies Available as core, ASIC library component ,ASSP 200 Mhz, $36 Cost, energy and memory efficient Superscalar architecture simplifies, complicates

programming Unproven tools and third party support

Page 119: Low Power Multimedia Reconfigurable Platforms

VLSI Algorithmic Design Automation Lab. at SKKU119

Target Applications

Video - DVD, MPEG 1 & 2 decoding Audio - Dolby AC-3, 3D Audio, MPEG Decode,

Wavetable Synthesis Graphics - 2D & 3D acceleration Communication

Vocoder ADSL, Fax/MODEM : V.34, 56k Echo chancellor Desktop Videoconferencing

Page 120: Low Power Multimedia Reconfigurable Platforms

VLSI Algorithmic Design Automation Lab. at SKKU120

Advanced DSP130 nm Copper Technology

Greg Delagi, TI

Greg Delagi, TI

Page 121: Low Power Multimedia Reconfigurable Platforms

VLSI Algorithmic Design Automation Lab. at SKKU121

Reconfigurable Computing Research Group

DARPA’s Adaptive Computing Systems Project Virginia Tech University of California at Berkeley Brigham Young University Chameleon Systems Inc. Morphic Inc. Quicksilver Technology Inc. Sirius Inc.

Page 122: Low Power Multimedia Reconfigurable Platforms

VLSI Algorithmic Design Automation Lab. at SKKU122

Quicksilver 의 ACM

Page 123: Low Power Multimedia Reconfigurable Platforms

VLSI Algorithmic Design Automation Lab. at SKKU123

SDR-processing requirements for Mobile Communications (GSM)

Modem w/ basic equalizer2 MFLOPS for CDMA sector2.5 MFLOPS for a wideband CDMA4 MFLOPS for a G4

Requires high performance devices s.tPowerPC G4PowerPC with Altivec CPUs

TMS320-C6x SHARC/Tiger-SHARC DSPs

Page 124: Low Power Multimedia Reconfigurable Platforms

VLSI Algorithmic Design Automation Lab. at SKKU124

The need for a software configurable platform

That is capable to handle standards like AM, FM, GSM, UMTS, digital broadcasting standards(DAB, Sirius, XM-Sat Radio), analog and digital television and other data links.

A fully software reconfigurable multi-channel broadband sampling receiver for standards in the 100 MHz band

Page 125: Low Power Multimedia Reconfigurable Platforms

VLSI Algorithmic Design Automation Lab. at SKKU125

IN

F1

VRB1

VRB6VRB4 VRB5Master

MPU

OUT

VRB2 VRB3

High Speed

Low Speed

F2 F3

F4 F5 F6

전력관리

Versatile Reconfigurable Block Array

장점 대기 지연 시간이 없다 , 적은 silicon area 를 요구한다 . 간단한 wrapper 를 통해서 IP 과 호환성 있는 데이터 전송

단점 대용량 시스템에서 timing 정확성이 감소 복잡한 시스템의 경우 Test 가 어려움 Master 의 증가에 따라 arbiter 지연이 증가한다

Page 126: Low Power Multimedia Reconfigurable Platforms

VLSI Algorithmic Design Automation Lab. at SKKU126

Comparisons

Only 1 cycle to (re)configure the DSP Few cycles to (re)configure coarse grain RA (8) Many cycles to (re)configure fine grain RA

NPE Nc RName Type F (MHz)

2304 0.14 16457

24 4 6

24 4 6

128 16 8

ARDOISE

Systolic Ring

DART

MorphoSys

TMS320C62

Fine Grain RA

Coarse Grain RA

Coarse Grain RA

Coarse Grain RA

DSP VLIW 8 8

33

200

130

100

300 1

FcNc

FeNR PE

.

.

Page 127: Low Power Multimedia Reconfigurable Platforms

VLSI Algorithmic Design Automation Lab. at SKKU127

Multi-DSP Tree Structure

A. K. Salkintzis, N. Hong and P. T. Mathiopoulos

Page 128: Low Power Multimedia Reconfigurable Platforms

VLSI Algorithmic Design Automation Lab. at SKKU128

Multi-DSP Network Structure

Multiplexing &Burst Construction Encription

ChannelCoding

Interleaving

DataProcessing

CRCinsertionModulation

Sequencer

Spreading

Equalization

Rate matching Channelization

Segmentation

RadioResource

Data traffic is reduced with each connection

Page 129: Low Power Multimedia Reconfigurable Platforms

VLSI Algorithmic Design Automation Lab. at SKKU129

Platform 분류

Application Platform: 멀티미디어 platform: Nexperia, TI 의 OMAP 3G 무선 platform: Infineon 의 M-gold Bluetooth platform: Parthus 무선 platform: ARM 의 PrimeXsys

Process-centric platform Improv System, ARC, Tensilica, Triscend

Communication-centric platform: Sonics, Palmchip

Page 130: Low Power Multimedia Reconfigurable Platforms

VLSI Algorithmic Design Automation Lab. at SKKU130

Recent Computing Machines

ACM (Adaptive Computing Machine) – Quicksilver: www.qstech.com (image appl.)

RCF (Reconfigurable Compute Fabric) – Motorola (SDR base-station), array of DSP cores connected through high-bandwidth interconnect and high-speed local memory, controlled by a RISC.

Page 131: Low Power Multimedia Reconfigurable Platforms

VLSI Algorithmic Design Automation Lab. at SKKU131

What is Software Radio

A transceiver in which all aspects of its operation are determined using versatile general purpose hardware whose configuration is under software control

Flexible all-purpose radios that can implement new and different standards or protocols through reprogramming.

Same hardware for all air interfaces and modulation schemes

Page 132: Low Power Multimedia Reconfigurable Platforms

VLSI Algorithmic Design Automation Lab. at SKKU132

Key Technological Constraints

High speed wide band ADCs. High speed DSPs. Real Time Operating Systems (isochronous

software) Power Consumption

Page 133: Low Power Multimedia Reconfigurable Platforms

VLSI Algorithmic Design Automation Lab. at SKKU133

Applications

User Applications and Base Station Applications Evolve as a universal terminal Spectrum management: Reconfigurability is a big

advantage Application updates, service enhancements and

personalization

Page 134: Low Power Multimedia Reconfigurable Platforms

VLSI Algorithmic Design Automation Lab. at SKKU134

Research and Commercialization

DARPA’s Adaptive computing system project Virginia Tech – algorithms and architecture ; multi user

receiver based on reconfigurable computing ; generic soft radio architecture for reconfigurable hardware

UC Berkeley – Pleiades, ultra low power, high performance multimedia computing ; high power efficiency by providing programmability

Sirius Inc – Software Reconfigurable Code Division Multiple Access (CDMAx)

Page 135: Low Power Multimedia Reconfigurable Platforms

VLSI Algorithmic Design Automation Lab. at SKKU135

Research and Commercialization

Brigham Young University – Development of JHDL to facilitate hardware synthesis in reconfigurable processors

Chameleon Systems- Reconfigurable Platform Architecture for wireless base station

MorphIC Inc -Programmable hardware reconfigurable code using DRL

Quicksilver Tech. Inc – Universal Wireless `Ngine (WunChip) baseband algorithms

Page 136: Low Power Multimedia Reconfigurable Platforms

VLSI Algorithmic Design Automation Lab. at SKKU136

Programmable OFDM-CDMA Tranceiver.

CDMA suffers from Multiple access interference and ISI.

OFDM reduces interference and helps better spectrum utilization and attainment of satisfactory BER.

It is proposed that this might be implemented by using SDR.

Page 137: Low Power Multimedia Reconfigurable Platforms

VLSI Algorithmic Design Automation Lab. at SKKU137

Programmable OFDM-CDMA Tranceiver.

CDMA suffers from Multiple access interference and ISI.

OFDM reduces interference and helps better spectrum utilization and attainment of satisfactory BER.

It is proposed that this might be implemented by using SDR.

Page 138: Low Power Multimedia Reconfigurable Platforms

VLSI Algorithmic Design Automation Lab. at SKKU138

Programmable OFDM-CDMA Tranceiver.

CDMA suffers from Multiple access interference and ISI.

OFDM reduces interference and helps better spectrum utilization and attainment of satisfactory BER.

It is proposed that this might be implemented by using SDR.

Page 139: Low Power Multimedia Reconfigurable Platforms

VLSI Algorithmic Design Automation Lab. at SKKU139

SDR Architecture

Signal processing/control unitRF unit

Rx SYN

Tx SYN

Rx SYN

Tx SYN

RX

TX

RX

TX

EX.

EX.

PA

PA

LNA

LNAData converterQuadrature MODEMBaseband MODEMInterface Control

C- PCI bus

HMITerminal

Input/Output

Receive/Transmit

Receive/Transmit

Hitachi Kokusai Electric Inc., [email protected]

Page 140: Low Power Multimedia Reconfigurable Platforms

VLSI Algorithmic Design Automation Lab. at SKKU140

Signal processing/control unit

The signal processing/control unit consists of the following module Data converter Quadrature Modem Baseband Modem Interface/Control

Every module is connected to each other by PCI bus, and provides a CPU in addition to the FPGA and DSP devices.

Page 141: Low Power Multimedia Reconfigurable Platforms

VLSI Algorithmic Design Automation Lab. at SKKU141

Quadrature modem module

The Quadrature modem uses FPGAs to process

to generate baseband sampling rate

Quadrature modulation Quadrature detection Sampling rate conversion Filtering

Signal processing/control unitRF unit

Rx SYN

Tx SYN

Rx SYN

Tx SYN

RX

TX

RX

TX

EX.

EX.

PA

PA

LNA

LNAData converterQuadrature MODEMBaseband MODEMInterface Control

C- PCI bus

HMITerminal

Input/Output

Receive/Transmit

Receive/Transmit

Page 142: Low Power Multimedia Reconfigurable Platforms

VLSI Algorithmic Design Automation Lab. at SKKU142

Baseband modem module The Baseband modem processes

Multi-channel modulation Multi-channel demodulation

Using four floating points DSP devices

individual DSP is assigned for each channel. Therefore, even if processing of either channel is under execution, a program can be downloaded to another channel.

Signal processing/control unitRF unit

Rx SYN

Tx SYN

Rx SYN

Tx SYN

RX

TX

RX

TX

EX.

EX.

PA

PA

LNA

LNAData converterQuadrature MODEMBaseband MODEMInterface Control

C- PCI bus

HMITerminal

Input/Output

Receive/Transmit

Receive/Transmit

Page 143: Low Power Multimedia Reconfigurable Platforms

VLSI Algorithmic Design Automation Lab. at SKKU143

Specification of Prototype

RF range 2~500MHz

Waveform SSB, AM, FM, BPSK, QPSK, 8PSK, 16QAM

Number of channel Four full-duplex

Radio relay Repeat/Bridge

Frequency accuracy <0.1ppm

Rx IF frequency 70MHz

Tx IF frequency 25MHz

Dynamic range 14bits

Rx IF sampling frequency 40MHz

Tx IF sampling frequency 100MHz

Page 144: Low Power Multimedia Reconfigurable Platforms

VLSI Algorithmic Design Automation Lab. at SKKU144

Specification of Prototype

Signal processingFPGA : Quadrature MODEM

DSP : Baseband MODEM

FPGA XCV2000E x 3

DSP TMS320C6701 x 4

CPU Control module : Celeron Peripheral module

System bus cPCI

Operating system Linux

HMI Operates from web browser

InterfaceAudio I/OSerial I/O

Ethernet(100BASE-TX)