TKTTKT--2431 SoC 2431 SoC Design - TUT · TKTTKT--2431 SoC 2431 SoC Design ... Part of the slides ... bottlenecks,,g g p g e.g. in logic’s critical path or area usage

TKTTKT--2431 SoC 2431 SoC DesignDesign

Lec 8 Lec 8 –– Optimization, ASIPOptimization, ASIP

Erno SalminenErno Salminen

Department of Computer SystemsDepartment of Computer SystemsTampere University of TechnologyTampere University of TechnologyTampere University of TechnologyTampere University of Technology

Fall 2012Fall 2012

Erno Salminen - Oct. 2012

Copyright noticeCopyright notice

Part of the slidesadapted from slide set by Albertoadapted from slide set by Alberto

Sangiovanni-Vincentelli course EE249 at University of California,

Berkeley http://www-

cad.eecs.berkeley.edu/~polis/class/lectures.shtml

Part of figures from: J. Heikkinen, J. Sertamo, T. Rautiainen and J. Takala,

"Design of Transport Triggered Architecture Processor forDesign of Transport Triggered Architecture Processor for Discrete Cosine Transform", in Proc. 15th Ann. IEEE Int. ASIC/SOC Conf., Rochester, NY, U.S.A., Sept. 25-28 2002, pp. 87-91

#2/44 Department of Computer SystemsErno Salminen - Oct. 2012

OutlineOutlineDetermine bottlenecks - Amdahl’s lawMethodsMethods Architectural choices Algorithm modifications, assembly codingg , y g Custom processors, e.g. ASIP HW accelerators (Parallel processing on next lecture)


At firstAt first

Make sure that simple things worksimple things work before even tryingbefore even trying more complex onesmore complex ones


ForewordForeword ”Premature optimization is the root of all evil” Donald Knuth [quoting Hoare]

Sutter, Alexandrescu 1st rule: Don’t optimize

(f ) 2nd rule (for experts only): Don’t do it yet. Measure twice, optimize once.

Focus on making code as clear and readableFocus on making code as clear and readable as possibleOptimizations make design and code more p g

complex Optimize only when performance bottle-neck


has been proven

System bottlenecks (1)System bottlenecks (1) Determine what’s taking time

Or area, power, memory

Bottleneck halts other parts of the system

[H Meyr Application Specific Instruction-Set


[H. Meyr, Application Specific Instruction Set

Processors for Wireless Communications, Tampere

SoC, Nov. 2004]

[Berkeley Design Technology Inc., Alternatives to DPSs: What and Why?, Tampere SoC, Nov. 2003]

System System bottlenecks (2)bottlenecks (2)bottlenecks (2)bottlenecks (2)

Concentrate optimization on bottlenecks Don’t optimize everything, e.g. function taking 3% of runtime

System may be refined into smaller blocks to define the bottlenecks, e.g. in logic’s critical path or area usage, g g p g Otherwise, it is difficult to determine the relation between HDL

source line and schematic

Removing a single bottleneck might have minor effect if Removing a single bottleneck might have minor effect, if the second worst is almost as bad Consider e.g. critical paths in logic

Embarrasingly trivial Matlab example Removed one unnecessary #include from m files: 12x speedup Locating bottleneck took few hours, fixing took 1 minute


Amdahl’s LawAmdahl’s Law

tnew = told * (1 - fractionenhanced) + fractionenhanced

speedupenhanced

speedupoverall =told

=1

(1 f ti ) + f tip poverall tnew

(1 - fractionenhanced) + fractionenhanced

speedupenhanced

c enh

HUOM! OBS!

M importante!

frac

#8/44 Department of Computer SystemsErno Salminen - Oct. 2012 [H. Corporaal, course material Adv. Computer architectures, Univ. Delft, 2001]

Muy importante!

old new

Amdahl’s Law ExampleAmdahl’s Law ExampleFloating point instructions improved to run 2X;

but only 10% of actual instructions are FPy

tnew = told * (1.0-0.1 + 0.1/2) = 0.95 * told

speedupoverall = 10 95

= 1.053

new old ( ) old

M d 1 / (1 f ti )

0.95

Max. speedupoverall = 1 / (1- fractionenhanced )Be careful: ”new is 5% smaller than old”

means that”old is 5 3% larger than new”


means that old is 5.3% larger than new”

Architectural choicesArchitectural choices


Architectural choices: qualitativeArchitectural choices: qualitativeex

ibili

ty

Data+instr

memData+instr

Dream solution

log

Fle

micro-

processor micro-

processor

mem

i

Data+instr

mem

(exists only in marketing material...)

FPGA

General purpose

microprocessor MAC

processorAddr

gen

SW blco-

micro-

processorco-

FPGASW programmable

DSPproc proc

Hardware

reconfigurable

std. cell

ASIC full custom

l Effi i

processor

Direct mapped HW

No free lunch this time

custom ASIC


log Efficiency (increasing speed, decreasing power and area)

No free lunch this time either

Architectural choices: quantitative dataArchitectural choices: quantitative data

General-purpose CPU

General-purpose CPU

DSP

FPGA, ASIP

purpose CPU

DSP

std-cell ASICFPGA, ASIP

f ll t ASIC

ASIC

Heinrich Meyr, Future Wireless Communication Systems…, VTC, 2005.

full custom ASIC


(Figure data by T.Noll T.Noll, RWTH Aachen)

http://www.ieeevtc.org/vtc2005spring/presentations/2020_presentations/HMeyr.pdfNote: flexibility and price are not included

Architectural choices: quatitative (2)Architectural choices: quatitative (2) Area and energy efficiencies of comparable MPEG-4

encoder implementations (bigger the better)[O Sil d K J kkä Ob ti P Effi i T d i M bil C i ti D i

,[Mpixels/s/mm2]

[O. Silven and K. Jyrkkä, Observations on Power-Efficiency Trends in Mobile Communication Devices, EURASIP Journal on Embedded Systems, Vol 2007, Article ID 56976, 10 pages, 2007.]

dream solution

,[Mpixels/s/W]

#13/44 Department of Computer SystemsErno Salminen - Oct. 2012 Values include RAM.

Algorithmic Algorithmic modifications assembly modifications assembly modifications, assembly modifications, assembly languagelanguage


Example: SortingExample: Sorting900

Simplest algorithms have O(n2) execution time

M l O( l )

bubble

selection

900

More complex O(n log n) Require recursion,

advanced data structures, and multiple arrays

insertion

shelland multiple arrays Recursion may lead to

stack overflow

shell

heap0.7

Multiple arrays require big memory

Fig:

heapmerge

i kg

http://linux.wku.edu/~lamonml/algor/sort/sort.html

P.S. Avoid light-colored lines (e.g. yellow)

quick


yellow).

Algorithm manipulationAlgorithm manipulation Do not perform over-accurate calculation

Single/double prec. floating-point vs. fixed point Fixed point is less accurate but may be enough Fixed point is less accurate but may be enough SW emulation of floating point operations is s-l-o-w, tens to

hundreds of cycles per operation (+, *, /…) HW FPUs are big: HW FPUs are big:

Nios II/f + periph ~2 kALUT, FPU incl. DIV 4.2 kALUT ~5.7 mm2 @0.35 um [Brunelli, TreSoc04], ~120 kgates

(compare to RISC core ~50 kgates) Word width optimization

Useufl especially on HW However, smallest is not necessarily fastest on SW

Using type char may require additional shift/AND/ORinstructions


Example2: Sacrificing qualityExample2: Sacrificing qualityD d id h f HW Decrease data width of HW


[Ramchan Woo, Tampere Soc, Nov. 2004]

Assembly coding (1)Assembly coding (1)Try assembly only if everything else fails Keep also the high-level language (HLL) version p g g g ( )

to allow portability and reuseSometimes required with special instructions Such as interrupt handling, MMX, processor

mode (user/supervisor)Speedup with RISC procecssors not that

greatU ll l ti it Usually only one execution unit

(Few) instructions, simple addressingDecent compilers available


Decent compilers available

Assembly coding (2)Assembly coding (2) DSPs most likely benefit from assembly

Tight loops Complex micro-architecture is difficult for compiler

“Latest Compilers fall short of hand-optimized performance substantially even for DSP Kernels”performance substantially even for DSP Kernels


[Naji S. Ghazal et al., Retargetable Estimation for DSP, Architecture Selection, Tampere Soc, Nov. 1999]

Optimization impactOptimization impact RISC = estimated number of required basic ”RISC” operations fm = fitting coefficient = measured_cycles / estim_RISC_ops N.O = no optimization H.O. = hand optimized It was no use tohand-optimize the codes O. Lehtoranta, PhD Thesis, TUT 2006 for single-issue RISC (=ARM )

#20/44 Department of Computer SystemsErno Salminen - Oct. 2012 [O. Lehtoranta, PhD Thesis, TUT 2006]

Assembly example: vector copy, B[] = A[]Assembly example: vector copy, B[] = A[] First version

start_copy:ld r1, [r2] // r2 is src addr, A[i]st [r3], r1 // r3 is dst addr, B[i]inc r2

Load causes pipeline inc r2

inc r3dec r4 // r4 is data amount, one data copiedcmp r4, 0 // is enough copied?bneq start_copy // loop back if needed

Second

pipeline stall if next instruction depends on loaded

ld r1, [r2]inc r2st [r3], r1and so on ...

Incrementing r2 does not depend on r1 and stall is id d

on loaded value, like here

g pavoided

Load could be performed just before branch Load delay happens during pipeline stall Some ISAs support auto increment in load and Some ISAs support auto-increment in load and

store Poor compiler might even load the table addresses

again on every iteration


Assembly example: delayed branchAssembly example: delayed branchAddr Instruction

Fig 2 ’Normal’ branch

Fig 3. Delayed branch

Two instr. (i3 +i4) following the branch are also e ec ted The m st

Addr Instruction

a1 i1: MR=MR+MX0*MY0 (SS);

a2 i2: IF COND JUMP aa1;

a3 i3 Fig 2. ’Normal’ branch branch are also executed. They must be nop if others are not found

a3 i3

a4 i4

a5 i5

a6 i6a6 i6

a7 i7

... ...

aa1 ii1aa1 ii1

Four-cycle stall before ii1 is executed

Only two-cycle stall

#22/44 Department of Computer SystemsErno Salminen - Oct. 2012 [http://www.analog.com/UploadedFiles/Application_Notes/587795865ee_123.pdf]

Custom processors Custom processors Custom processors Custom processors (ASIPs)(ASIPs)


Custom processorsCustom processors ASIP = Application Specific Instruction-set Processor Extend CPU with application (domain) specific instructions

MAC, sum with clipping, DCT etc.g Extension tightly coupled with CPU pipeline Optimize internal communication within CPU

Remove unnecessary instructionsOth i fi CPU ( f i t d t idth ) Otherwise configure CPU (num of registers, data width...)

Allow using C/C++ compilation


Custom processor performance (1)Custom processor performance (1) Tensilica Xtensa Kernel speed-up 6x – 100x

Depends heavily on applicationy Base CPU ~20 000 gates

HW overhead 20% - 150%

Sidenote: (most likely) the Sidenote: (most likely) the largest multiprocessor

chip in the world contains 192 Xtensa processors

Fig: [Monica Lam, Compiler Technology for Configurable Processors Tampere SoC Nov

(Cisco’s CSR-1 router chip)

Configurable Processors, Tampere SoC, Nov. 2001.]


Custom processor performance (3)Custom processor performance (3) Beneficial also for energy

Note: E= P * t

(6.1x speedup)

(8.0x speedup)


[H. Meyr, Application Specific Instruction-Set

Processors for Wireless Communications, TreSoC 2004

Transport Triggered Architecture (TTA)Transport Triggered Architecture (TTA) Application-specific processor

Wide range in performance vs. cost Can reach almost the same cyclecount y

as ASIC Still allows programmability, more

flexible than HW Easily configurable Easily configurable

Number and type of execution units Connections between units Number of cores (multi threading) Number of cores (multi-threading)Many trade-offs between area and

performance Easy way tio desing an accelerator Easy way tio desing an accelerator

Designer gives C and HW description Tools generate synthesizable VHDL Automated exploration is under

construction


constructionScreen caps: tce.cs.tut.fi

TTA (2)TTA (2) Harvard architecture

Separate instruction and data memories

Supports multiple data memories

C compiler and simulator automatically configured to newautomatically configured to new micro-architecture

Only one instruction: move e.g. ”Add r2, r3, r3:g , ,move RF[2] -> ALU.op1move RF[3] -> ALU.trigmove ALU.result -> RF[3]

Instruction word has as many Instruction word has as many fields as there are internal buses Resembles VLIW, Everything

scheduled at compile-timeL d i th RISC


Larger code size than RISC

Move instruction is handyMove instruction is handy Instructions control the internal buses, and

operations happen as a side-effect Resource sharing for buses Resource sharing for buses

Move result from FU’s output to next one’s input, instead of going through register file -> less registers and ports to register fileregisters and ports to register file

More freedom in code scheduling than traditional CPUs. Move can happen later (or earlier) if the

lt (i t) i t d d f th d tresult reg (input) reg is not needed for other data -> less buses needed, supports different pipeline depths in FUs

Number of units and buses easily configurableNumber of inputs and outputs in an FU is easily

configurable (not just 2 inputs and 1 outputs)

#29/44 Department of Computer Systems

configurable (not just 2 inputs and 1 outputs)


TTA performanceTTA performanceBetter area and performance than general

purpose RISCSpecial function unit (SFU)Special function unit (SFU)

Designed and added manually Arbitrary latency and num of operands (thanks to

transport triggered scheme)transport-triggered scheme) Decreases ex.time but increases area

For certain algorithms, same cycle counts as S CASIC may achieved ASIC has larger operating frequency, though

Currently, TTA+tools developed at TUTCurrently, TTA tools developed at TUT Download: http://tce.cs.tut.fi/ Used in course TKT-3526 Processor design Interested students may do project work on TTA


Interested students may do project work on TTA

Area vs. runtime tradeArea vs. runtime trade--offoff TTA’s cycle count

smaller than RISC, close to ASIC

TTA’s area between ASIC and RISC ASIC has highest frequency


(memory excluded) (memory excluded)

[Hämäläinen, Euromicro DSD, 2005]RC4 exploration

HW acceleratorsHW accelerators


Recap: HW acceleratorsRecap: HW accelerators Favor: highest performance, smallest area and power Against: longest design time, narrow application domain Do not require code memory like progammable processors (CPU Do not require code memory like progammable processors (CPU,

ASIP, DSP) Accelerated function should give identical results with original

Additi l i f ti d d Additional conversion functions reduce speedup E.g. converting 16b results to 32b integers with SW or

transposing the resuly matrix on SW take time Optimally, the next function can accept slightly different input

# Type um Cycles Area Speedup (in cycle count) Freq [MHz] Max perf

[blocks/s]

Perf/area [blocks/s /

gates]

Example: 8x8 DCT

y ) [ ] gates]A RISC (ARM9) 0.18 2660 190 kilogates + mem 1.0 160 60 M 0.32

B ASIP (TTA+SFU) 0.13 538 56 kilogates + 34 kilogates mem

4.9 250 464 M 5.16

C HW (by student) 0.18 250 44 kilogates 10.6 182 728 M 16.5539 kilogates + control


D: [J. Nikara, Application-Specific Parallel Structures for Discrete Cosine Transforn and Variable Length Decoding, PhD thesis, TUT, June 2004]

D HW (by PhD) 0.11 9439 kilogates + control

logic 29.3 253 2691 M 69.01

HW HW accelerators: private vs. sharedaccelerators: private vs. shared Originally, accelerator were always attached to CPU

memory bus Smaller SoCs, just 1 CPU, poor portabilitySmaller SoCs, just 1 CPU, poor portability

Today, both private and shared aceclerators are used Each shared resource needs some arbitration mecahnism

which decides who can use itwhich decides who can use it Leads to contention and (usually) unpredictable delay

Data ”flowing through” the accelerator (e.g. cpu1→ acc →cpu2) is better than ”roundtrip” (cpu1 →acc → cpu1)→cpu2) is better than roundtrip (cpu1 →acc → cpu1)

CPU 1 I+D mem

on-chip

CPU 2 I+D mem

Local, private acc.

=> Low time overhead.

Large area if all CPUS h th i

accel 1

pnetwork

network IF

network IF

accel 2


have their ownaccel

3

Remote, shared acc.

=> Larger and more unpredictable time overhead, but also a smaller area

HW accelerators: HW accelerators: Uasge overheadUasge overhead Regular, data-flow type functions most suitable for

HW Communication between CPU and HW critical Communication between CPU and HW critical

Delay for feeding the input and getting results Ensuring mutual exclusion so that no other CPU uses the

same HW Pipelining reduces the idle period in CPU

CPUCPU only CPU CPUCPUCPU only CPU CPU

CPUCPU CPUCommunication overhead reduces the overall speedup. Moreover, CPU is idle

4x speedup

+ HW v.1 HW when HW is active

CPU CPU CPU Pipeline uses CPU and HW concurrently


+ HW v.2once the first results from HW are ready. Requires a bit more bookkeeping in SW.

HW accelerators: Pipelined usageHW accelerators: Pipelined usage Orig SW:

for i=0:N loopload r1, [r2]add sub mul cmp beq other processing

Measured SW ex.time includes loading input values

add, sub, mul, cmp, beq, other processingst r1, [r3]

end loop

SW + HW, straightforward polling

and storing the results

E if HW dstart hw()while (hw_ready==0) {}for i=0:N loop

load r1, [r2]

Even if HW does processing much faster, data transfers from CPU to HW must be taken into

polling =busy wait

end loop

SW + HW, pipelinedstart_hw()other function x();

account

Function X executed in parallel with HW. _ _ ;

while (hw_ready==0) {}for i=0:N loop

load r1, [r2]end loop

pLess time wasted in polling. Most efficient when HW and Function_X take nearly the same time


p nearly the same time

HW accelerator: Overhead (2)HW accelerator: Overhead (2) Sometimes, overhead is even larger than actual computation In the example below, Both Motion Estimation (ME) and DCT-

Quant-Iquant-IDCT took about 25 kcycles on NiosQuant-Iquant-IDCT took about 25 kcycles on Nios Accelerators in ideal conditions (in TB) took 1/70x and 1/14x of

SW time Espcecially the ME requires large input data (>2 1 kcycles) and Espcecially the ME requires large input data (>2.1 kcycles) and

large transfer contends for memory access with other parts of SoC (>4.3 kcycles)

Despite overheads accelerators offered about 3 5x and 6 5x Despite overheads, accelerators offered about 3.5x and 6.5x speedups


A. Rasmus et al.. "IP Integration Overhead Analysis…", DDECS, 2007]

HW accelerators: Getting the resultsHW accelerators: Getting the results Interrupts allow more efficient parallel

execution than polling Or CPU can enter a sleep state to save energy Or, CPU can enter a sleep state to save energy

Most SoCs include DMA units that can efficiently transfer data between resourcesCPU controlled transfers vs. DMA

a) CPU transfers all the data, time O(n), e.g. 7 cycles/wordmemcpy (&B[0], &A[0], 64*sizeof(int));py , , ;

b) CPU just inits DMA controller, cpu_time O(1), dma_time O(n) but only ~1 cycle/word

start_dma:st #DMA SRC ADDR r1st #DMA SRC ADDR, r1st #DMA_DST_ADDR, r2st #DMA_AMOUNT, r4

do_other_stuff()...


CPU is free once DMA is started

HW HW blockblock--level optimization level optimization (1)(1) Reuse benefits from configurability and many parameters

Run-time configurability is often costly Good for simulation-based testingGood for simulation based testing

Convert input signals into generics for synthesis Turn unwanted features off to save area and power

Perhaps increases the max freq alsop q if enable_g = ’1’ then <code>;

1200 0

1400.0

gate

s] Example: config memory inside bus

600.0

800.0

1000.0

1200.0

mem

ory

area

[g

No slots 1 slot 2 slotsmemory inside bus wrapper 2 generics

0.0

200.0

400.0

we=0, re=0 we=0, re=1 we=1, re=0 we=1, re=1Con

figur

atio

n 1. we= write enable2. re = read enable optimize according


, , , ,

rom ram

Memory type

to application

HW HW blockblock--level optimization level optimization (2)(2)Try to design HW so that propagation delay is

not (linearly) dependent on data width Scalable solution Bad example: if data < 55 then data<= data+1; Better: if data /= 55 then data<= data+1;

Turn on boundary optimizationy p Logic in different entities optimized together

Hand-coding might be required with more complicated boundary-optimizationp y p

block BE.g. inverters can be removed

R t i t d l t i t t

block A

4b (This can be optimized for smaller Note: combinatorial outputs not recommended

Restricted value set in output


(If output uses < 16 of all possible values)

range)

HW HW blockblock--level optimization level optimization (3)(3)Minimize the data width of signals Remove unnecessary flip-flops (á 4-6 eq.gates)

i.e. those with constant output DC: set compile_seqmap_propagate_constants true

Optimizes also the logic after the flip-flop

→always 1 By default, synthesis does

NOT remove any registers

propagated constant

NOT remove any registers All signals that are assigned

in sequential process (clk, rst n) produce a flip-flop

→ always 0rst n) produce a flip-flop


Flip-flop with constant output

HW blockHW block--level optimization level optimization (4)(4) Do not ’reset’ registers when value is not needed

e.g. if valid_in = ’0’ then data_r <= (others =>’0’);

Unncecessary input MUX Unncecessary input MUX Good for visualization in

simulation though if dbg enable g = ’1’ then if dbg enable g = 1 then

reg <= dbg_value;

Easy to see when these are valid

Validity determined according to signal empty

real real

”debug value”

unnecessary mux


logiclogic

HW optim: Aim at ”fast enough”HW optim: Aim at ”fast enough” Do not overoptimize HW, if performance limit is

known 100 frames/sec encoder is not better than 25 fps enc, if100 frames/sec encoder is not better than 25 fps enc, if

camera restricts the frame rate anywayMinimizing critical path, causes large area

Requires larger drive strength for gates q g g g They also have higher leakage currents

Minimizing cycle countMinimizing cycle count needs many parallel sub-blocks (e.g. ALUs)

Consider the integration overheads also

Fig: [J. Wei, C. Rowen, “Implementing low-power


Fig: [J. Wei, C. Rowen, Implementing low power configurable processors…”, DAC 2005]

ConclusionConclusionRemember Amdahl’s law – concentrate on

appropriate parts of the systempp p p yASIPs provide great improvements (like

ASIC) but allow programmability (like CPU)) p g y ( )Communication between components has

great impact on performanceg Use interrupts and DMA controllers Pipeline SW and HW

Don’t overdo things, aim for fast enough and then minimize area and power


Sidenote: ASIC vs. FPGA Design StartsSidenote: ASIC vs. FPGA Design Starts

5000

6000

ASIC Design Starts

500000

600000

PLD/FPGA Design Starts

3000

4000

300000

400000

1000

2000

Source: Gartner Group

100000

200000

Source: Gartner Group0

2001 2002 2003 20040

2001 2002 2003 2004

“ASIC design starts will decline 12.3 percent PLD/FPGAs are becoming more and more g pto 4,345 this year following the precipitous 36 percent drop in design starts in 2001”

(B. Lewis, Gartner Dataquest, 10/28/02)

gthe driving force in microelectronics technology, CAD tools and System-on-Chip design.


Note! New ASICs are much larger (much more logic, much more personnel involved) than previously.

Custom processor performance (2)Custom processor performance (2) SC140 = original Star Core DSPGFISA = special instructions for Galois field

ti dd doperations added HW overhead ~10%

Special ISA does not help every algorithm!Reed-Solomon decoding cycle count

p p y g

runt

ime

[Yasmin Oz et al.,Galois Field Instruction Set A l t i th St C SC140 DSP


Accelerator in the StarCore SC140 DSP, Tampere SoC, Nov. 2001.]

Speedup 22.1 14.5 6.3 1.0=t(sc140)

t(gfisa)

Last warning: Scheduling anomalyLast warning: Scheduling anomalya0

a1

deadline for a2

b1

PE2

PE1

Improving one part of the system may

a2b0PE0

1y y

result in worst performance

timedeadline met Faster NoC, faster PE

Thi i d ttask a0

a1

deadline for a2

b1

PE2

PE

This is due to changes in the schedule i e order or a1

a2b0

b1

PE0’

PE1schedule, i.e. order or execution


timedeadline violated

Documents

TKTTKT--2431 SoC 2431 SoC Design - TUT · TKTTKT--2431 SoC 2431 SoC Design ... Part of the slides ... bottlenecks,,g g p g e.g. in logic’s critical path or area usage