21
1 Codesign Extended Applications Brian Grattan, Greg Stitt, Frank Vahid* Dept of Computer Science & Engineering University of California, Riverside *Also with the Center for Embedded Computer Systems at UC Irvine This work was supported in part by the National Science Foundation and by NEC C&C Research Labs

Codesign Extended Applications

  • Upload
    nitara

  • View
    28

  • Download
    0

Embed Size (px)

DESCRIPTION

Codesign Extended Applications. Brian Grattan, Greg Stitt, Frank Vahid* Dept of Computer Science & Engineering University of California, Riverside *Also with the Center for Embedded Computer Systems at UC Irvine - PowerPoint PPT Presentation

Citation preview

Page 1: Codesign Extended Applications

1

Codesign Extended Applications

Brian Grattan, Greg Stitt, Frank Vahid*Dept of Computer Science & Engineering

University of California, Riverside*Also with the Center for Embedded Computer Systems at UC

Irvine

This work was supported in part by the National Science Foundation and by NEC C&C Research Labs

Page 2: Codesign Extended Applications

CODES’02 – Codesign Extended ApplicationsBrian Grattan, Greg Stitt, Frank Vahid, Univ. of California,

Riverside 1-2

Outline

Introduction: Hardware/Software Partitioning And the common assumption of a single

specification Different Algorithms in Hardware/Software Codesign Extended Applications Experiments Future Work and Conclusions

Page 3: Codesign Extended Applications

CODES’02 – Codesign Extended ApplicationsBrian Grattan, Greg Stitt, Frank Vahid, Univ. of California,

Riverside 1-3

Introduction – Hw/Sw Partitioning

Hw/sw partitioning can speedup software Shown by numerous researchers

E.g., Balboni, Fornaciari, Sciuto CODES’96; Eles, Peng, Kuchchinski, Doboli DAES’97; Gajski, Vahid, Narayan, Gong Prentice-Hall 1997; Grode, Knudsen, Madsen DATE’98; many others

1.5 to 10x common Some examples like image processing get 100-800x speedup

E.g., Cameron project, FCCM’02

Can reduce energy too E.g.

Henkel, Li CODES’98 Wan, Ichikawa, Lidsky, Rabaey CICC’98 Stitt, Grattan, Villarreal, Vahid FCCM’02

60-80% energy savings measured on real single-chip uP/FPGA devices

Page 4: Codesign Extended Applications

CODES’02 – Codesign Extended ApplicationsBrian Grattan, Greg Stitt, Frank Vahid, Univ. of California,

Riverside 1-4

Hw/Sw Partitioning on Single-Chip Platforms

Numerous single-chip commercial devices with uP and FPGA

Triscend E5 (shown) Triscend A7 Atmel FPSLIC Xilinx Virtex II Pro Altera Excalibur More sure to come…

Make hw/sw partitioning even more attractive

uP and peripherals

Cache/memory

Configurable logic

Page 5: Codesign Extended Applications

CODES’02 – Codesign Extended ApplicationsBrian Grattan, Greg Stitt, Frank Vahid, Univ. of California,

Riverside 1-5

Hw/Sw Partitioning – Commercial Tools Evolving

Commercial products evolving Synopsys’ Nimble

compiler (2000) attempt Proceler

Microprocessor Report’s 2001 Technology of the Year Award

Others coming…

Page 6: Codesign Extended Applications

CODES’02 – Codesign Extended ApplicationsBrian Grattan, Greg Stitt, Frank Vahid, Univ. of California,

Riverside 1-6

Hw/Sw Partitioning – Single-Spec Assumption

Assumption – Start from a single specification Typically sw source

Partitioning Find critical sw kernels,

map some to hw This assumption is

made in most research efforts as well as commercial tools

Hw/sw partitioner

Sw Hw

Specification

Compilation Synthesis

Binaries Netlists

Page 7: Codesign Extended Applications

CODES’02 – Codesign Extended ApplicationsBrian Grattan, Greg Stitt, Frank Vahid, Univ. of California,

Riverside 1-7

Digital Camera Example

Developed with intent of exploring hw/sw tradeoffs Captures images,

compresses, uploads to PC Soon found that a single

specification wasn’t reasonable Two key functions had

different hw/sw algorithms CRC DCT

Controller

Communications

DCT

CCD

Pre-Process

Huffman Encoder

CRCcalculation

Controller

DCT

CCD Pre-Processor

Huffmanencoder

CRC

Page 8: Codesign Extended Applications

CODES’02 – Codesign Extended ApplicationsBrian Grattan, Greg Stitt, Frank Vahid, Univ. of California,

Riverside 1-8

Digital Camera Example

Results in weak hw design We would have

written CRC and DCT differently had we known they’d be mapped to hw

Yet, we’d keep the original algorithms if they ended up in software

Hw/sw partitioner

Sw: Huff., CCD, Ctrl Hw: CRC, DCT

Spec: DCT, Huffman, CRC, CCD, Ctrl

Compilation Synthesis

Binaries Netlists

Weak

Page 9: Codesign Extended Applications

CODES’02 – Codesign Extended ApplicationsBrian Grattan, Greg Stitt, Frank Vahid, Univ. of California,

Riverside 1-9

Different Algorithms in Hw vs. Sw

The single-specification assumption doesn’t always hold

Key observation Designers often use very different algorithms if a

behavior is mapped to hardware versus if that behavior is mapped to software

Widely known by designers In textbooks Also known in parallel processing – sequential

and parallel algorithms

Page 10: Codesign Extended Applications

CODES’02 – Codesign Extended ApplicationsBrian Grattan, Greg Stitt, Frank Vahid, Univ. of California,

Riverside 1-10

Different Algorithms – Sorting Example

Suppose desired behavior fills a buffer, sorts the buffer, and transmits the sorted list

Fill()Sort()Transmit()

Sort() in software –QuickSort Simple and fast in sw Poor in hw, can’t be parallelized well

Sort() in hardware – Parallel Mergesort

Very fast in hardware Slow in sw (if sequential) due to

overhead Derive one from the other?

Quicksort

MS

MS

MS MS MS

MS

Page 11: Codesign Extended Applications

CODES’02 – Codesign Extended ApplicationsBrian Grattan, Greg Stitt, Frank Vahid, Univ. of California,

Riverside 1-11

Different Algorithms – CRC Example

CRC – Cyclic Redundancy Check Used for error

checking during communication, stronger than parity

Mathematically, divides a constant into the data and saves the remainder

 

Main Function

…calls crc() with parameters:init_crc-initial value

*data-pointer to data

len-length of data

jinit-initializing options  

crc()

returns:value of CRC for given data

crc/data/data/data

Page 12: Codesign Extended Applications

CODES’02 – Codesign Extended ApplicationsBrian Grattan, Greg Stitt, Frank Vahid, Univ. of California,

Riverside 1-12

Different Algorithms – CRC in Hardware

char crc_hw(…){ unsigned short j , crc_value = init_crc; unsigned short new_crc_value; if (jinit >= 0) crc_value=((uchar) jinit) | (((uchar) jinit) << 8); for (j=1;j<=len;j++) { new_crc_value = bit(4,data[j]) ^ bit(0,data[j]) ^ bit(8,crc_value) ^ bit(12,crc_value); // bit 0 new_crc_value = new_crc_value | (bit(5,data[j])^bit(1,data[j])^bit(9,crc_value)^bit(13,crc_value))<<1; new_crc_value = new_crc_value | (bit(6,data[j])^bit(2,data[j])^bit(10,crc_value)^bit(14,crc_value))<< 2;. … continue for bits 3 through 7 …. } return (new_crc_value);} Hardware Version

Knowing the generator polynomial, one can calculate the XOR’s for each individual bit

Each CRC value is the result of bit-wise XOR’s with the data and the previous CRC value

Synthesizes to hw very nicely; but getting bits and shifting are inefficient in sw

Page 13: Codesign Extended Applications

CODES’02 – Codesign Extended ApplicationsBrian Grattan, Greg Stitt, Frank Vahid, Univ. of California,

Riverside 1-13

Different Algorithms – CRC in Software

Software Version Before doing any

calculations, create an initialization table that calculates the CRC for each individual character

Use data as index into initialization table and execute two XOR’s

Requires lookups, but faster for a sequential calculation

char crc_sw(…) // Source: Numerical Recipes in C{ unsigned short initialize_table(unsigned short crc, unsigned

char one_char); static unsigned short icrctb[256]; unsigned short tmp1, j , crc_value = init_crc; if (!init) { init=1; for (j=0;j<=255;j++) { icrctb[j]=initialize_table(j << 8,(uchar)0); } } if (jinit >= 0) crc_value=((uchar) jinit) | (((uchar) jinit) << 8); for (j=1;j<=len;j++) { tmp1 = data[j] ^ HIBYTE(crc_value); crc_value = icrctb[tmp1] ^ LOBYTE(crc_value) << 8; } }return (crc_value);}

Page 14: Codesign Extended Applications

CODES’02 – Codesign Extended ApplicationsBrian Grattan, Greg Stitt, Frank Vahid, Univ. of California,

Riverside 1-14

Different Algorithms -- DCT

DCT – Discrete Cosine Transform Computationally intensive, numerous matrix

multiplies Accounts for perhaps 70% of JPEG encoding time Dozens of possible algorithms

Best algorithm depends largely on computational resources

Certainly different for sw and hw Doing multiplications in floating-point vs. fixed-

point Multiplication by a constant can be efficiently mapped to

hardware, but accuracy will be lost by not using floating-point

Page 15: Codesign Extended Applications

CODES’02 – Codesign Extended ApplicationsBrian Grattan, Greg Stitt, Frank Vahid, Univ. of California,

Riverside 1-15

Codesign Extended Applications (CEAs)

Basic idea: Write two versions of certain

functions Only the critical functions, and Only those with different sw and

hw algorithms Typically only a handful of these

Most time is spent in just a few critical functions

Include both function versions in the specification

But use compiler flags to include either sw or hw version

main(){ … crc(); …}

char crc(…){#ifdef cea_crc_hw crc_hw(…);#else crc_sw(…);#endif}

% gcc –Dcea_crc_hw main.c

Page 16: Codesign Extended Applications

CODES’02 – Codesign Extended ApplicationsBrian Grattan, Greg Stitt, Frank Vahid, Univ. of California,

Riverside 1-16

CEAs when using C/C++ and VHDL

C code crc_hw(…inputs…)

/* Hardware crc... */

for (j=1;j<=len;j++) {

TSHORT(to_hw)= data[j]);

TBYTE(enable) = 1;

TBYTE(enable) = 0;

}

crc_value=TSHORT(result);

return (crc_value)

VHDL code if (rst = '1') then crc <= "0000000000000000"; done <= '0'; elsif (clk'event and clk = '1') then if (enable = '1') then if done = '0' then crc <= nextCRC16_D8(input,crc); done <= '1'; end if; else done <= '0'; output <= crc; end if; end if;

Page 17: Codesign Extended Applications

CODES’02 – Codesign Extended ApplicationsBrian Grattan, Greg Stitt, Frank Vahid, Univ. of California,

Riverside 1-17

CEAs Enable Hw/Sw Partitioning Tool Traditional hw/sw partitioner

Compiler, estimators, search heuristics, technology files, etc.

Drawback: heavy impact on tool flow

CEAs plus platforms result in simple partitioner

Script uses existing compiler, synthesis, and evaluation (simulation or physical measurement)

Drawbacks: must write two versions of critical functions, script may use simpler search function

Different partitioners for different domains

Hw/sw partitioner

Sw Hw

Specification

Compilation Synthesis

Binaries Netlists

Essentially a compiler, search heuristic, and estimator. Heavy-duty tool.

Script

Sw Hw

CEA

Compilation Synthesis

Binaries Netlists

Evaluator

Search heuristic and tool control. Lightweight tool.

Page 18: Codesign Extended Applications

CODES’02 – Codesign Extended ApplicationsBrian Grattan, Greg Stitt, Frank Vahid, Univ. of California,

Riverside 1-18

Experiments

Compared hw and sw CRC algorithms Synthesized to FPGA Compiled to MIPS uP

Demonstrates need for different algorithms

Sw and hw CRC algorithms in FPGA.

Size (Blocks)

Delay (clock cycles/character)

Hardware CRC algorithm

19 1

Software CRC algorithm

44 3

Sw and hw CRC algorithms on a microprocessor.

Size (Assembly

Lines)

Clock Cycles

Software CRC Algorithm

1061 180,000

Hardware CRC Algorithm

1298 814,000

Page 19: Codesign Extended Applications

CODES’02 – Codesign Extended ApplicationsBrian Grattan, Greg Stitt, Frank Vahid, Univ. of California,

Riverside 1-19

Experiments Wrote small signal processing example as CEA

Wrote sw and hw versions of core functions In this case, algorithms were similar

Setup power measurement for two real platforms XS40 (board with microcontroller chip and Xilinx FPGA chip) E5 (single chip with microcontroller and FPGA)

Partitioning script automatically partitioned and measured power and cycles (overnight – due to place & route time)

Demonstrates how CEAs enable simple yet practical hw/sw partitioning Easily migrates to different platforms, different chips

Partitioning Energy (Joules) on E5 deviceMultiply Sum Bit-Share

SW SW SW 12.4SW SW HW 8.6SW HW SW 8.8HW SW SW 8.0SW HW HW 4.8HW SW HW Does not RouteHW HW SW Does not RouteHW HW HW Does not Route

Page 20: Codesign Extended Applications

CODES’02 – Codesign Extended ApplicationsBrian Grattan, Greg Stitt, Frank Vahid, Univ. of California,

Riverside 1-20

Issues and Future Work Issues

What if hw versions not used after partitioning? Wasted effort? Verification of all possible combinations? Must use wisely or problem grows unwieldy

Future work More examples, more platforms Several versions of the same function

One hardware area-conscious One hardware speed-conscious One software code-size-conscious One software speed-conscious …more…

Experimenting with communication between hardware and software

DMA transfer, wide-access memories, …

Page 21: Codesign Extended Applications

CODES’02 – Codesign Extended ApplicationsBrian Grattan, Greg Stitt, Frank Vahid, Univ. of California,

Riverside 1-21

Conclusions

Basic hw/sw partitioning assumption of a single specification doesn’t always hold

Codesign Extended Applications help support different algorithms

CEAs enable hw/sw partitioning in existing tool flows Utilizes existing compilation, synthesis, mapping,

evaluation tools, and platforms Simple yet effective approach to hw/sw

partitioning