45
technische universität dortmund fakultät für informatik informatik 12 Embedded System Hardware - Processing - Marwedel, 2003 Peter Marwedel Informatik 12 TU Dortmund Germany 201105 14 Graphics: © Alexandra Nolte, Gesine These slides use Microsoft clip arts. Microsoft copyright restrictions apply.

Embedded System Hardware - TU Dortmund...Embedded system hardware is frequently used in a loop ( “hardware in a loop“ ): technische universität - 2 - dortmund fakultät für informatik

  • Upload
    others

  • View
    10

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Embedded System Hardware - TU Dortmund...Embedded system hardware is frequently used in a loop ( “hardware in a loop“ ): technische universität - 2 - dortmund fakultät für informatik

technische universität dortmund

fakultät für informatikinformatik 12

Embedded System Hardware

- Processing -

Gra

phic

s: ©

Ale

xand

ra N

olte

, Ges

ine

Mar

wed

el,

2003

Peter MarwedelInformatik 12TU Dortmund

Germany

2011年年年年 05 月月月月 14 日日日日

Gra

phic

s: ©

Ale

xand

ra N

olte

, Ges

ine

Mar

wed

el,

2003

These slides use Microsoft clip arts. Microsoft copyright restrictions apply.

Page 2: Embedded System Hardware - TU Dortmund...Embedded system hardware is frequently used in a loop ( “hardware in a loop“ ): technische universität - 2 - dortmund fakultät für informatik

TU Dortmund

Embedded System Hardware

Embedded system hardware is frequently usedin a loop (“hardware in a loop“ ):

- 2 -technische universitätdortmund

fakultät für informatik

P.Marwedel, Informatik 12, 2011

� cyber-physical systems

Page 3: Embedded System Hardware - TU Dortmund...Embedded system hardware is frequently used in a loop ( “hardware in a loop“ ): technische universität - 2 - dortmund fakultät für informatik

TU Dortmund

Processing units

Need for efficiency (power + energy):

“Power is considered as the most important constraint in embedded systems“[in: L. Eggermont (ed): Embedded Systems Roadmap 2002, STW]

- 3 -technische universitätdortmund

fakultät für informatik

P.Marwedel, Informatik 12, 2011

Energy consumption by IT is the key concern ofgreen computing initiatives (embedded computingleading the way)

Why worry about energy and power?

http://www.esa.int/images/earth,4.jpg

Page 4: Embedded System Hardware - TU Dortmund...Embedded system hardware is frequently used in a loop ( “hardware in a loop“ ): technische universität - 2 - dortmund fakultät für informatik

TU Dortmund

Importance of Energy Efficiency

- 4 -technische universitätdortmund

fakultät für informatik

P.Marwedel, Informatik 12, 2011

© Hugo De Man, IMEC, Philips, 2007

Page 5: Embedded System Hardware - TU Dortmund...Embedded system hardware is frequently used in a loop ( “hardware in a loop“ ): technische universität - 2 - dortmund fakultät für informatik

TU Dortmund

Power and energy are relatedto each other

∫= dtPEP

- 5 -technische universitätdortmund

fakultät für informatik

P.Marwedel, Informatik 12, 2011

t

E

In many cases, faster execution also means less energy, but the opposite may be true if power has to be increased to allow faster execution.

E'

Page 6: Embedded System Hardware - TU Dortmund...Embedded system hardware is frequently used in a loop ( “hardware in a loop“ ): technische universität - 2 - dortmund fakultät für informatik

TU Dortmund

Low Power vs. Low Energy Consumption

� Minimizing power consumption important for• the design of the power supply• the design of voltage regulators• the dimensioning of interconnect• short term cooling

� Minimizing energy consumption important due to

- 6 -technische universitätdortmund

fakultät für informatik

P.Marwedel, Informatik 12, 2011

� Minimizing energy consumption important due to• restricted availability of energy (mobile systems)

• limited battery capacities (only slowly improving)• very high costs of energy (solar panels, in space)

• cooling• high costs• limited space

• dependability • long lifetimes, low temperatures

Page 7: Embedded System Hardware - TU Dortmund...Embedded system hardware is frequently used in a loop ( “hardware in a loop“ ): technische universität - 2 - dortmund fakultät für informatik

TU Dortmund

Prescott: 90 W/cm², 90 nm [c‘t 4/2004]

Nuclear reactor

Power density continues to get worse

- 7 -technische universitätdortmund

fakultät für informatik

P.Marwedel, Informatik 12, 2011

© IntelM. Pollack, Micro-32

Page 8: Embedded System Hardware - TU Dortmund...Embedded system hardware is frequently used in a loop ( “hardware in a loop“ ): technische universität - 2 - dortmund fakultät für informatik

TU Dortmund

Surpassed hot (kitchen) plate …?Why not use it?

- 8 -technische universitätdortmund

fakultät für informatik

P.Marwedel, Informatik 12, 2011

http://www.phys.ncku.edu.tw/~htsu/humor/fry_egg.html

Page 9: Embedded System Hardware - TU Dortmund...Embedded system hardware is frequently used in a loop ( “hardware in a loop“ ): technische universität - 2 - dortmund fakultät für informatik

TU Dortmund

Energy consumption in mobile devices

- 9 -technische universitätdortmund

fakultät für informatik

P.Marwedel, Informatik 12, 2011

[O. Vargas (Infineon Technologies): Minimum power consumption in mobile-phone memory subsystems; Pennwell Portable Design - September 2005;] Thanks to Thorsten Koch (Nokia/ TU Dortmund) for providing this source.

Page 10: Embedded System Hardware - TU Dortmund...Embedded system hardware is frequently used in a loop ( “hardware in a loop“ ): technische universität - 2 - dortmund fakultät für informatik

TU Dortmund

Need to consider CPU & System Power

600/500 MHz uP13%

Power Supply10%

Other13%

Mobile PC (notebook)Thermal Design (TDP) System Power

600/500 MHz uP37%

Power Supply10%

Other13%

Mobile PC (notebook)Average System Power

- 10 -technische universitätdortmund

fakultät für informatik

P.Marwedel, Informatik 12, 2011

Note: Based on Actual Measurements

LCD 10"30%

HDD19%

Memory+Graphics15%

Multiple Platform Components Comprise Average Power

LCD 10"19%

HDD9%

Memory+Graphics12%

CPU Dominates Thermal Design Power

[Courtesy: N. Dutt; Source: V. Tiwari]

Page 11: Embedded System Hardware - TU Dortmund...Embedded system hardware is frequently used in a loop ( “hardware in a loop“ ): technische universität - 2 - dortmund fakultät für informatik

TU Dortmund

Application Specific Circuits (ASICS)or Full Custom Circuits

Approach suffers from

� long design times,

� lack of flexibility(changing standards) and

� high costs

- 11 -technische universitätdortmund

fakultät für informatik

P.Marwedel, Informatik 12, 2011

� high costs(e.g. Mill. $ mask costs).

Custom-designed circuits necessary

� if ultimate speed or

� energy efficiency is the goal and

� large numbers can be sold.

Page 12: Embedded System Hardware - TU Dortmund...Embedded system hardware is frequently used in a loop ( “hardware in a loop“ ): technische universität - 2 - dortmund fakultät für informatik

TU Dortmund

Mask cost for specialized HWbecomes very expensive

�Trend towards implementation in Software

- 12 -technische universitätdortmund

fakultät für informatik

P.Marwedel, Informatik 12, 2011

[http://www.molecularimprints.com/Technology/tech_articles/MII_COO_NIST_2001.PDF9]

in Software

HW synthesis not covered in this course.

Page 13: Embedded System Hardware - TU Dortmund...Embedded system hardware is frequently used in a loop ( “hardware in a loop“ ): technische universität - 2 - dortmund fakultät für informatik

TU Dortmund

Key requirements for processors

1. Energy/ power-efficiency

- 13 -technische universitätdortmund

fakultät für informatik

P.Marwedel, Informatik 12, 2011

Page 14: Embedded System Hardware - TU Dortmund...Embedded system hardware is frequently used in a loop ( “hardware in a loop“ ): technische universität - 2 - dortmund fakultät für informatik

TU Dortmund

Dynamic power management (DPM)

RUN: operationalIDLE: a SW routine may stop the CPU when not RUN

400mW

Example: STRONGARM SA1100

- 14 -technische universitätdortmund

fakultät für informatik

P.Marwedel, Informatik 12, 2011

stop the CPU when not in use, while monitoring interruptsSLEEP: Shutdown of on-chip activity

SLEEPIDLE

160µW50mW

90µs

10µs

10µs160ms

Power fault signal

Page 15: Embedded System Hardware - TU Dortmund...Embedded system hardware is frequently used in a loop ( “hardware in a loop“ ): technische universität - 2 - dortmund fakultät für informatik

TU Dortmund

Fundamentals of dynamic voltagescaling (DVS)

Power consumption of CMOScircuits (ignoring leakage):

activity switching

with

:

2 fVCP ddL

αα=

( ) withddL

VCk

−= 2τ

Delay for CMOS circuits:

- 15 -technische universitätdortmund

fakultät für informatik

P.Marwedel, Informatik 12, 2011

frequency clock

voltagesupply

ecapacitanc load

:

:

:

f

V

C

dd

L( )

) than

voltage threshhold

with

ddt

t

tdd

L

VV

V

VVCk

<

−=

(

:

�Decreasing Vdd reduces P quadratically,while the run-time of algorithms is only linearly increased

Page 16: Embedded System Hardware - TU Dortmund...Embedded system hardware is frequently used in a loop ( “hardware in a loop“ ): technische universität - 2 - dortmund fakultät für informatik

TU Dortmund

Voltage scaling: Example

- 16 -technische universitätdortmund

fakultät für informatik

P.Marwedel, Informatik 12, 2011

Vdd[Courtesy, Yasuura, 2000]

Page 17: Embedded System Hardware - TU Dortmund...Embedded system hardware is frequently used in a loop ( “hardware in a loop“ ): technische universität - 2 - dortmund fakultät für informatik

TU Dortmund

Variable-voltage/frequency example:INTEL Xscale

OS should schedule distribution of the energy budget.

- 17 -technische universitätdortmund

fakultät für informatik

P.Marwedel, Informatik 12, 2011

Fro

m In

tel’s

Web

Site

budget.

Page 18: Embedded System Hardware - TU Dortmund...Embedded system hardware is frequently used in a loop ( “hardware in a loop“ ): technische universität - 2 - dortmund fakultät für informatik

TU Dortmund

Low voltage, parallel operation more efficient than high voltage, sequential operation

Basic equationsPower: P ~ VDD² ,Maximum clock frequency: f ~ VDD ,Energy to run a program: E = P × t, with: t = runtime (fixed)Time to run a program: t ~ 1/f

Changes due to parallel processing, with αααα operations per clock:

- 18 -technische universitätdortmund

fakultät für informatik

P.Marwedel, Informatik 12, 2011

Clock frequency reduced to: f ’ = f / α,Voltage can be reduced to: VDD’ =V DD / α,Power for parallel processing: P° = P / αααα² per operation,Power for α operations per clock: P’ = αααα × P° = P / αααα, Time to run a program is still: t’ = t ,Energy required to run program: E’ = P’ × t = E / αααα

�Argument in favour of voltage scaling,VLIW processors, and multi-cores

Rough approxi-mations!

Page 19: Embedded System Hardware - TU Dortmund...Embedded system hardware is frequently used in a loop ( “hardware in a loop“ ): technische universität - 2 - dortmund fakultät für informatik

TU Dortmund

Application: VLIW processing and vol-tage scaling in the Crusoe processor

� VDD: 32 levels (1.1V - 1.6V)

� Clock: 200MHz - 700MHz in increments of 33MHz

Scaling is triggered when CPU load change is detected by software (~1/2 ms).

- 19 -technische universitätdortmund

fakultät für informatik

P.Marwedel, Informatik 12, 2011

� More load: Increase of supply voltage (~20 ms/step),followed by scaling clock frequency

� Less load: reduction of clock frequency, followed byreduction of supply voltage

Worst case (1.1V to 1.6V VDD, 200MHz to 700MHz) takes 280 ms

Page 20: Embedded System Hardware - TU Dortmund...Embedded system hardware is frequently used in a loop ( “hardware in a loop“ ): technische universität - 2 - dortmund fakultät für informatik

TU Dortmund

Result (as published by transmeta)

Pentium Crusoe

- 20 -technische universitätdortmund

fakultät für informatik

P.Marwedel, Informatik 12, 2011

[www.transmeta.com]Running the same multimedia application.

Page 21: Embedded System Hardware - TU Dortmund...Embedded system hardware is frequently used in a loop ( “hardware in a loop“ ): technische universität - 2 - dortmund fakultät für informatik

TU Dortmund

Key requirement #2:Code-size efficiency

� CISC machines : RISC machines designed for run-time-, not for code-size-efficiency

� Compression techniques: key idea

- 21 -technische universitätdortmund

fakultät für informatik

P.Marwedel, Informatik 12, 2011

Page 22: Embedded System Hardware - TU Dortmund...Embedded system hardware is frequently used in a loop ( “hardware in a loop“ ): technische universität - 2 - dortmund fakultät für informatik

TU Dortmund

Code-size efficiency

� Compression techniques (continued):• 2nd instruction set, e.g. ARM Thumb instruction set:

16-bit Thumb instr.ADD Rd #constant001 10 Rd Constant

zero extended

majoropcode minor source=

Dynamically decoded at run-time

- 22 -technische universitätdortmund

fakultät für informatik

P.Marwedel, Informatik 12, 2011

• Reduction to 65-70 % of original code size• 130% of ARM performance with 8/16 bit memory• 85% of ARM performance with 32-bit memory

1110 001 01001 0 Rd 0 Rd 0000 Constant

zero extendedopcode minoropcode

source=destination

[ARM, R. Gupta]

Same approach for LSI TinyRisc, …Requires support by compiler, assembler etc.

Page 23: Embedded System Hardware - TU Dortmund...Embedded system hardware is frequently used in a loop ( “hardware in a loop“ ): technische universität - 2 - dortmund fakultät für informatik

TU Dortmund

Dictionary approach, two level control store (indirect addressing of instructions)

“Dictionary-based coding schemes cover a wide range of various coders and compressors.Their common feature is that the methods use some kind of a dictionary that contains parts of the input sequence which frequently appear.

- 23 -technische universitätdortmund

fakultät für informatik

P.Marwedel, Informatik 12, 2011

frequently appear.The encoded sequence in turn contains references to the dictionary elements rather than containing these over and over.”

[Á. Beszédes et al.: Survey of Code size Reduction Methods, Survey of Code-Size Reduction Methods, ACM Computing Surveys, Vol. 35, Sept. 2003, pp 223-267]

Page 24: Embedded System Hardware - TU Dortmund...Embedded system hardware is frequently used in a loop ( “hardware in a loop“ ): technische universität - 2 - dortmund fakultät für informatik

TU Dortmund

Key idea (for d bit instructions)

Uncompressed storage of a d-bit-wide instructions requires a x d bits.

In compressed code, each instruction pattern is

instructionaddress

b « d bit

For each instruction address, S contains table address of instruction.

S a

b

- 24 -technische universitätdortmund

fakultät für informatik

P.Marwedel, Informatik 12, 2011

instruction pattern is stored only once.

Hopefully, axb+cxd < axd.

Called nanoprogramming in the Motorola 68000.CPU

d bit

b « d bit

table of used instructions (“dictionary”)c ≦ 2b

small

Page 25: Embedded System Hardware - TU Dortmund...Embedded system hardware is frequently used in a loop ( “hardware in a loop“ ): technische universität - 2 - dortmund fakultät für informatik

TU Dortmund

Cache-based decompression

� Main idea: decompression whenever cache-lines are fetched from memory.

� Cache lines ↔ variable-sized blocks in memory

- 25 -technische universitätdortmund

fakultät für informatik

P.Marwedel, Informatik 12, 2011

� Cache lines ↔ variable-sized blocks in memory� line address tables (LATs) for translation of instruction addresses into memory addresses.

� Tables may become large and have to be bypassed by a line address translation buffer.

[A. Wolfe, A. Chanin, MICRO-92]

Page 26: Embedded System Hardware - TU Dortmund...Embedded system hardware is frequently used in a loop ( “hardware in a loop“ ): technische universität - 2 - dortmund fakultät für informatik

TU Dortmund

More information on code compaction

� Popular code compaction library by Rik van de Wiel[http://www.extra.research.philips.com/ccb] has been moved to

- 26 -technische universitätdortmund

fakultät für informatik

P.Marwedel, Informatik 12, 2011

http://www-perso.iro.umontreal.ca/~latendre/ codeCompression/codeCompression/node1.html

http://www.iro.umontreal.ca/~latendre/compactBib/

(153 entries as per 11/2004)

Page 27: Embedded System Hardware - TU Dortmund...Embedded system hardware is frequently used in a loop ( “hardware in a loop“ ): technische universität - 2 - dortmund fakultät für informatik

TU Dortmund

Key requirement #3: Run -time efficiency -- Domain-oriented architectures -

Example: Filtering in Digital signal processing (DSP)

- 27 -technische universitätdortmund

fakultät für informatik

P.Marwedel, Informatik 12, 2011

Signal at t=ts (sampling points)

Page 28: Embedded System Hardware - TU Dortmund...Embedded system hardware is frequently used in a loop ( “hardware in a loop“ ): technische universität - 2 - dortmund fakultät für informatik

TU Dortmund

Filtering in digital signal processing

-- outer loop over-- sampling times ts{ MR:=0; A1:=1; A2:=s-1;

ADSP 2100

- 28 -technische universitätdortmund

fakultät für informatik

P.Marwedel, Informatik 12, 2011

{ MR:=0; A1:=1; A2:=s-1;MX:=w[s]; MY:=a[0];for (k=0; k <= (n−1); k++){ MR:=MR + MX * MY;MX:=w[A2]; MY:=a[A1];A1++; A2--;}

x[s]:=MR;}

Maps nicely

Page 29: Embedded System Hardware - TU Dortmund...Embedded system hardware is frequently used in a loop ( “hardware in a loop“ ): technische universität - 2 - dortmund fakultät für informatik

TU Dortmund

DSP-Processors: multiply/accumulate(MAC) and zero-overhead loop (ZOL) instr.

MR:=0; A1:=1; A2:=s-1; MX:=w[s]; MY:=a[0];for ( k:=0 <= n-1){MR:=MR+MX*MY; MY:=a[A1]; MX:=w[A2]; A1++; A2--}

- 29 -technische universitätdortmund

fakultät für informatik

P.Marwedel, Informatik 12, 2011

Multiply/accumulate (MAC) instruction Zero-overhead loop (ZOL) instruction preceding MAC instruction.Loop testing done in parallel to MAC operations.

Page 30: Embedded System Hardware - TU Dortmund...Embedded system hardware is frequently used in a loop ( “hardware in a loop“ ): technische universität - 2 - dortmund fakultät für informatik

TU Dortmund

Heterogeneous registers

MX MYAX AY

DP

Address-

Example (ADSP 210x):

- 30 -technische universitätdortmund

fakultät für informatik

P.Marwedel, Informatik 12, 2011

MR

MF

*+,-

AR

AF

+,-,..

Address generation unit (AGU)

Address-registersA0, A1, A2 ..

Different functionality of registers An, AX, AY, AF,MX, MY, MF, MR

Page 31: Embedded System Hardware - TU Dortmund...Embedded system hardware is frequently used in a loop ( “hardware in a loop“ ): technische universität - 2 - dortmund fakultät für informatik

TU Dortmund

Separate address generation units (AGUs)

� Data memory can only be fetched with address contained in A,

� but this can be done in parallel

Example (ADSP 210x):

- 31 -technische universitätdortmund

fakultät für informatik

P.Marwedel, Informatik 12, 2011

� but this can be done in parallel with operation in main data path (takes effectively 0 time).

� A := A ± 1 also takes 0 time,� same for A := A ± M;� A := <immediate in instruction>

requires extra instruction� Minimize load immediates� Optimization in optimization

chapter

Page 32: Embedded System Hardware - TU Dortmund...Embedded system hardware is frequently used in a loop ( “hardware in a loop“ ): technische universität - 2 - dortmund fakultät für informatik

TU Dortmund

Modulo addressing

Modulo addressing:Am++ ≡ Am:=(Am+1) mod n(implements ring or circular buffer in memory)

sliding windoww

t1t

- 32 -technische universitätdortmund

fakultät für informatik

P.Marwedel, Informatik 12, 2011

..w[t1-1]

w[t1]

w[t1-n+1]

w[t1-n+2]..

Memory, t=t1 Memory, t2= t1+1

n most recent values

..w[t1-1]

w[t1]

w[t1+1]

w[t1-n+2]..

Page 33: Embedded System Hardware - TU Dortmund...Embedded system hardware is frequently used in a loop ( “hardware in a loop“ ): technische universität - 2 - dortmund fakultät für informatik

TU Dortmund

� Returns largest/smallest number in case of over/underflows

� Example:a 0111b + 1001standard wrap around arithmetic (1)0000

Saturating arithmetic

- 33 -technische universitätdortmund

fakultät für informatik

P.Marwedel, Informatik 12, 2011

standard wrap around arithmetic (1)0000saturating arithmetic 1111(a+b)/2: correct 1000

wrap around arithmetic 0000saturating arithmetic + shifted 0111

� Appropriate for DSP/multimedia applications:• No timeliness of results if interrupts are generated for overflows• Precise values less important• Wrap around arithmetic would be worse.

“almost correct“

Page 34: Embedded System Hardware - TU Dortmund...Embedded system hardware is frequently used in a loop ( “hardware in a loop“ ): technische universität - 2 - dortmund fakultät für informatik

TU Dortmund

Example

- 34 -technische universitätdortmund

fakultät für informatik

P.Marwedel, Informatik 12, 2011 MATLAB Demo

Page 35: Embedded System Hardware - TU Dortmund...Embedded system hardware is frequently used in a loop ( “hardware in a loop“ ): technische universität - 2 - dortmund fakultät für informatik

TU Dortmund

Fixed -point arithmetic

- 35 -technische universitätdortmund

fakultät für informatik

P.Marwedel, Informatik 12, 2011

Shifting required after multiplications and divisions in order to maintain binary point.

Page 36: Embedded System Hardware - TU Dortmund...Embedded system hardware is frequently used in a loop ( “hardware in a loop“ ): technische universität - 2 - dortmund fakultät für informatik

TU Dortmund

Properties of fixed -point arithmetic

� Automatic scaling a key advantage for multiplications.

� Example:x= 0.5 x 0.125 + 0.25 x 0.125 = 0.0625 + 0.03125 = 0.09375For iwl=1 and fwl=3 decimal digits, the less significant digits are

- 36 -technische universitätdortmund

fakultät für informatik

P.Marwedel, Informatik 12, 2011

For iwl=1 and fwl=3 decimal digits, the less significant digits are automatically chopped off: x = 0.093Like a floating point system with numbers ∈ (-1..1),with no stored exponent (bits used to increase precision).

� Appropriate for DSP/multimedia applications(well-known value ranges).

Page 37: Embedded System Hardware - TU Dortmund...Embedded system hardware is frequently used in a loop ( “hardware in a loop“ ): technische universität - 2 - dortmund fakultät für informatik

TU Dortmund

Real-time capability

� Timing behavior has to be predictableFeatures that cause problems:

• Unpredictable access to shared resources• Caches with difficult to predict replacement strategies

• Unified caches (conflicts between instructions and data)

• Pipelines with difficult to predict stall cycles ("bubbles")

[Dag

stuh

l wor

ksho

p on

pre

dict

abili

ty, N

ov. 1

7-19

, 200

3]

- 37 -technische universitätdortmund

fakultät für informatik

P.Marwedel, Informatik 12, 2011

• Pipelines with difficult to predict stall cycles ("bubbles")

• Unpredictable communication times for multiprocessors

• Branch prediction, speculative execution

• Interrupts that are possible any time

• Memory refreshes that are possible any time

• Instructions that have data-dependent execution times

� Trying to avoid as many of these as possible.

[Dag

stuh

l wor

ksho

p on

pre

dict

abili

ty, N

ov. 1

7

Page 38: Embedded System Hardware - TU Dortmund...Embedded system hardware is frequently used in a loop ( “hardware in a loop“ ): technische universität - 2 - dortmund fakultät für informatik

TU Dortmund

Multiple memory banks or memories

MFMX MY

AFAX AY

DP

Address-registersA0, A1, A2

- 38 -technische universitätdortmund

fakultät für informatik

P.Marwedel, Informatik 12, 2011

MR

*+,-

AR

+,-,..

Address generation unit (AGU)

A0, A1, A2 ..

Simplifies parallel fetches

Page 39: Embedded System Hardware - TU Dortmund...Embedded system hardware is frequently used in a loop ( “hardware in a loop“ ): technische universität - 2 - dortmund fakultät für informatik

TU Dortmund

Multimedia-Instructions/Processors

� Multimedia instructions exploit that many registers, adders etc are quite wide (32/64 bit),

� whereas most multimedia data types are narrow(e.g. 8 bit per color, 16 bit per audio sample per channel)

� 2-8 values can be stored per register and added. E.g.:

32 bits 32 bits

- 39 -technische universitätdortmund

fakultät für informatik

P.Marwedel, Informatik 12, 2011

2 additions per instruction; carry disabled at half-word boundaries.

a1 a2

32 bits

b1 b2

32 bits

c1 c2

32 bits

+

Page 40: Embedded System Hardware - TU Dortmund...Embedded system hardware is frequently used in a loop ( “hardware in a loop“ ): technische universität - 2 - dortmund fakultät für informatik

TU Dortmund

Early example: HP precision architecture (hp PA)

Half word add instruction HADD:

- 40 -technische universitätdortmund

fakultät für informatik

P.Marwedel, Informatik 12, 2011

Optional saturating arithmetic.Up to 10 instructions can be replaced by HADD.

Half word add?

Page 41: Embedded System Hardware - TU Dortmund...Embedded system hardware is frequently used in a loop ( “hardware in a loop“ ): technische universität - 2 - dortmund fakultät für informatik

TU Dortmund

Pentium MMX -architecture (1)

Multimedia registers mm0 - mm7,consistent with floating-point registers (OS unchanged). 64-bit vectors representing 8 byte encoded,4 word encoded or 2 double word encoded numbers.wrap around/saturating options.

- 41 -technische universitätdortmund

fakultät für informatik

P.Marwedel, Informatik 12, 2011

Instruction Options Comments

Padd[b/w/d]PSub[b/w/d]

wrap around,saturating

addition/subtraction of bytes, words, double words

Pcmpeq[b/w/d]Pcmpgt[b/w/d]

Result= "11..11" if true, "00..00" otherwiseResult= "11..11" if true, "00..00" otherwise

PmullwPmulhw

multiplication, 4*16 bits, least significant wordmultiplication, 4*16 bits, most significant word

Page 42: Embedded System Hardware - TU Dortmund...Embedded system hardware is frequently used in a loop ( “hardware in a loop“ ): technische universität - 2 - dortmund fakultät für informatik

TU Dortmund

Pentium MMX -architecture (2)

Psra[w/d]Psll[w/d/q]Psrl[w/d/q]

No. of positions in register or instruction

Parallel shift of words, double words or 64 bit quad words

Punpckl[bw/wd/dq] Parallel unpack

- 42 -technische universitätdortmund

fakultät für informatik

P.Marwedel, Informatik 12, 2011

Punpckl[bw/wd/dq]Punpckh[bw/wd/dq]

Parallel unpackParallel unpack

Packss[wb/dw] saturating Parallel pack

Pand, PandnPor, Pxor

Logical operations on 64 bit words

Mov[d/q] Move instruction

Page 43: Embedded System Hardware - TU Dortmund...Embedded system hardware is frequently used in a loop ( “hardware in a loop“ ): technische universität - 2 - dortmund fakultät für informatik

TU Dortmund

Appli-cation

Scaled interpolation between two images

Next word =

- 43 -technische universitätdortmund

fakultät für informatik

P.Marwedel, Informatik 12, 2011

Next word = next pixel, same color.

4 pixels processed at a time.

pxor mm7,mm7 ;clear register mm7movq mm3,fade_val;load scaling valuemovd mm0,imageA ;load 4 red pixels for Amovd mm1,imageB ;load 4 red pixels for Bunpcklbw mm1,mm7 ;unpack,bytes to wordsunpcklbw mm0,mm7 ;upper bytes from mm7psubw mm0,mm1 ;subtract pixel valuespmulhw mm0,mm3 ;scalepaddw mm0,mm1 ;add to image Bpackuswb mm0,mm7 ;pack, words to bytes

Page 44: Embedded System Hardware - TU Dortmund...Embedded system hardware is frequently used in a loop ( “hardware in a loop“ ): technische universität - 2 - dortmund fakultät für informatik

TU Dortmund

Short vector instruction set extensionsfor Intel ® Pentium ®/AMD® processors

� 3DNow! (AMD, 1989)� Streaming SIMD Extensions SSE (Intel, 1999)

• 16 new registers, floating point SIMD� SSE2 (Intel, 2001; AMD, 2003)

• MMX instructions available for new SSE registers� SSE3 (Intel, 2004; AMD)

• vector reduction, floating point conversion independent

- 44 -technische universitätdortmund

fakultät für informatik

P.Marwedel, Informatik 12, 2011

• vector reduction, floating point conversion independent of global rounding mode, relaxed alignment restrictions

� SSE4 (Intel, 2006; AMD: 4 instructions implemented)• String comparison, counting 1‘s, CRC, …

� SSE5 (AMD, 2007)• 3-address instructions, …

� Advanced vector extensions AVX (Intel, 2008)• Registers 256, … bit wide

Page 45: Embedded System Hardware - TU Dortmund...Embedded system hardware is frequently used in a loop ( “hardware in a loop“ ): technische universität - 2 - dortmund fakultät für informatik

TU Dortmund

Summary

Hardware in a loop� Sensors� Discretization� Information processing

• Importance of energy efficiency

- 45 -technische universitätdortmund

fakultät für informatik

P.Marwedel, Informatik 12, 2011

• Special purpose HW very expensive• Energy efficiency of processors• Code size efficiency• Run-time efficiency• MPSoCs

� D/A converters� Actuators