ELEC692 VLSI Signal Processing Architecture Lecture 1 Introduction to DSP Systems

ELEC692 VLSI Signal Processing Architecture

Lecture 1Introduction to DSP Systems

Issues of VLSI Signal Processing Architecture

• Performance• Area/Cost• Speed of execution, throughput and clock rate• Power dissipation or amount of energy required

to perform a given task• Fixed-point DSP systems- finite wordlength

performance– Quantization and roundoff noise

• Special features of DSP systems– Real-time throughput requirements– Data-driven property

Typical DSP algorithm and applications (I)

• Speech coding and decoding, Speech encryption and decryption– Cell phones, cordless phone,multimedia computer, secure

communications

• Speech recognition– Advanced user interface, phones, consumer products,

machine/human interface

• Speech synthesis– Advanced user interface, consumer products, machine/human

interface

• Modem algorithms– Phones, wireless communications, data/fax modems, secure

communications

Typical DSP algorithm and applications (II)

• Noise cancellation– Audio applications, wireless communications

• Audio Equalization– Audio applications

• Image compression and decompression– Digital camera, video, multimedia applications

• Beamforming– Navigation, radar/sonar, wireless communications

• Echo cancellation– Speakerphones, modems, telephone switches

Issues in wireless system design

• Ubiquitous services put wireless system spectrum at a premium

• Current spectral efficiency far below theoretical limits

• Emerging solutions– Adoption of better spectrum utilization techniques

• E.g. interference cancellation, multiple antenna, MIMO system

• Multi-functional, adaptive systems

• Even higher bit-rate wireless applications– IEEE 802.11a, wireless IEEE 1394

Improving Spectral Density and higher bit rate comes at a performance and power cost

• Digital baseband processing requirements

Wide-band CDMA FDMA with multiple antenna

Match Filter

Blind MMSE

Exact Decorrelator

SVD

Performance Bits/sec/Hz

1 2 2 6

Multiplications

124 496 230,000 736

Memory 248 1240 640,000 2120

ALU 124 502 240,000 800

Word-length 8-bit 12-bit 16-bit 16-bit

From Jan Rabaey of UC Berkeley

Shannon beats Moore’s Law

Energy plays a critical role

Battery capacity

Programmable processor vs. ASIC

• DSP Selection guide for mobile multimedia

DSP computation - Convolution

k

knhkxnhnxny )()()()()(

•Describe and analyze linear time-invariant (LTI) systems, which are completely characterized by their unit-sample( or impluse) response h(n)•Finite impulse response (FIR) – systems containing a finite number of nonzero samples, i.e. h(n) is of finite duration•infinite impulse response (IIR) –h(n) is of infinite duration•A system is causal of y(n0) depends only on the past input samples x(k) , k<= n0.

DSP computation - Correlation

• Widely used in digital communication• Correlation of 2 sequences a(n) and x(n):

• It can be described as a convolution as follows:

• If a(n) and x(n) have finite length N, these are nonzero for n=0,1,…,N-1, the digital correlation operations is given as:

k

knxkany )()()(

)(*)()()()( nxnaknxkanyk

1

0

)()()(N

k

knxkany

DSP computation – Digital Filters

• Properties of a causal digital filter is characterized by its unit-sample response h(n) or its frequency response H(ejw) or by difference equations.

• A linear, time-invariant, and causal filter is given by

• If ak=0 for 1<= k <= N, we have

• This is a non-recursive M-tap finite impulse response (FIR) Filter, where h(k) = bk.

• If one of the is ak>0, then this is a recursive filter and its corresponding unit-sample response has infinite duration. This is referred as IIR filter

1

01

)()()(M

kk

N

kk knxbknyany

1

0

)()(M

kk knxbny

DSP computation – Digital Filters

• Linear-phase FIR filter– Unit-sample responses are

symmetric and require only half the number of multiplications

– For a M-tap linear phase FIR filter: h(n)=h(M-n).

– E.g. 7-tap linear phase FIR filter with impulse response h(0)=h(6)=b0 h(1)=h(5)=b1, h(2)=h(4)= b2, h(3)= b3,

– Y(n)= b0x(n)+ b1x(n-1)+ b2x(n-2)+ b3x(n-3)+ b2x(n-4)+ b1x(n-5)+ b0x(n-6)

DSP computation – Adaptive Filter

• The filter coefficient is changing and updated at each iteration.

• Used for applications such as echo cancellation, channel equalization, voiceband modem and many others.

• It predict one random process y(n) from observations of another random process x(n) using linear models such as digital filters.

• Coefficients are updated in order to minimize the difference between the filter output and the desried signal. Updating process continues until the coefficient converges.

• Consists of two blocks: a general filter block and a coefficient updating block.

DSP computation – LMS Adaptive Filter

• Notations:– WT(n) = [w1(n), w2(n),..,wN(n)]=weighted vector

– UT(n) = [u(n),u(n-1),…,u(n-N+1)]= vector of current and past input samples

– is the estimated signal and e(n) is the estimation error.

– We have

)(ˆ nd

)()1()()(ˆ)()(

)()1()(ˆ

nUnWndndndne

nUnWndT

T


• In the n-th iteration, the LMS algorithm selects WT(n) which minimizes the square error e(n)2

• LMS adaptive filters consists of an FIR filter block with coefficient vector WT(n) and input sequence u(n) and a weight update block.


• Weight update algorithm

eUUUWd

UUWdUW

ee

T

TTW T

2)(2

22)(2

2

)()()1()(

))((2

1)1()( 2

nUnenWnW

nenWnW TW

Other common DSP computations• Motion estimation

– Used in interframe predictive coding• Discrete Cosine Transform

– Frequency transform used in image processing• Fast Fourier Transform

– Frequency transform used in communication and audio/voice processing

• Vector Quantization– Used for data compression in speech, image and video coding

• Viterbi algorithm– Error control coding, used for communication and other data

correction applications.• Decimator and Expanding

– Multirate systems for image compression, digital audio and adaptive signal processing

Implementation of DSP algorithms

• A lot of applications can be implemented in programmable DSP processor or media-microprocessor

• For some applications, due to complexity and power issue, special VLSI architecture or ASICs are still required

• E.g. – MPEG2 encoder – Block Matching for ME for HDTV frame needs ~370 GOPs/sec

• - 2D-DCT for HDTV = 3.84 GOPs/sec

DSP representation• Non-terminating programs and iteration based

)2()1()()( 210 nxhnxhnxhny

• Iteration period – time required to execute one iteration• Sampling rate (throughput) – number of samples processed per second• Latency – difference between the time an output is generated and the time at which its

corresponding input was received• Critical path delay• Clock period (clock rate is not equal to sampling rate)

DSPInput x(n) Output y(n)

For n=1 to n=

DSP representation• Mathematical formulation• Behavioral descriptive Language

– Applicative language• Set of equations

– Prescriptive languages• Specify order of assignment statement

– E.g. Pascal, C, SystemC

– Descriptive Languages• Represent structure of the DSP system• E.g. VHDL, Verilog

• Graphical Representation– For investigating and analyzing data flow properties– Exhibit parallelism and data-driven (dependency) properties, provide

insight for space-time tradeoff.– Mapping DSP algorithms to hardware implementation

• Block diagram, Signal-Flow Graph (SFG), Data-Flow Graph (DFG), and dependence graph (DG).

Block Diagram

• Consists of functional blocks connected with directed edges, which represents the data flow from its input block to output block.

• Edges may or may not contain delay elements

Signal Flow Graph (SFG)

• SFG is a graph whose nodes represent computations/tasks and directed edge e(j,k) denotes a branch from node j and terminating at node k.

• With input signal at node j and output signal at node k, e(j,k) denotes a linear transformation from the signal at node j to the signal at node k.

• In digital network, the edges are usually restricted to constant gain multipliers, or delay elements

• Adders and multipliers are described by a node with multiple incoming edges and one outgoing edge.

• 2 special nodes – sink and source

Example SFG of a direct-form 3-tap FIR filter

Transposition of SFG

• Linear SFGs can be transformed into different forms– Flow graph reversal or transposition for

Single-input-single-output (SISO) systems– Transform operations

• Reversing the direction of all edges• Exchanging the input and output nodes while

keeping the edge gain or edge delay unchanged• Resulting SFG maintains the same functionality

Data Flow Graph (DFG)• Graph G = (N,E) where nodes represent computations

(or functions or subtasks) and directed edges represent data paths (communications between nodes). Each edge has a non-negative number of delays associated.

Data Flow Graph (DFG)

• DFG captures the data-driven property• Node can execute only when all the input data are

available.• Concurrency execution• A node with multiple input edges can only execute when

all its precedent nodes have executed, thus, describing the precedence constraints– If edge has zero delay – intra-iteration precedence– If edge has non-zero delay – inter-iteration precedence

• DFG are generally used for high-level synthesis, map concurrent implementation of DSP applications onto parallel hardware– Task scheduling and resource allocation

Example of DFG

Synchronous Data Flow graph (SDFG)

• Special case of DFG– Number of data samples produced or consumed by each node

in each execution is specified a priori– Both for single-rate and multi-rate systems– Unrolling (unfolding) multirate systems to single-rate.

Dependence Graph

• A directed graph that shows the dependence of the computation

• Nodes represent computations and edges represent precedence constraints

• Similar to DFG except nodes in DFG only cover the computations in one iteration, where as DG contains computations for all iterations. DFG contains delay elements that store and pass data between iterations while DG does not contain delay elelments

Example of a DG

Critical Path of a DFG• Critical path – path with the longest computation time among all

paths that contain zero delay (i.e. without delay element)• The minimum clock period of the DSP system depends on the

critical path delay• In DSP systems, e.g. filter element, the critical path depends on the

delay of the following:– Input to the delay element– Input to the output– Delay element to the output– Delay element to delay element E.g.

D D D D

X X X

++X

In

Out

2 2 2

111

Critical path comparison

D D

X

+ D+ +

X X X

X(n)

y(n)

D D

X

+

D

+ +

X X X

X(n)

y(n)

Direct Form 4-tap FIR

Transposed Form 4-tap FIR

Critical Path = Delay(mult)+(N-1) delay(add)Delay element: shorter bitwidth

Critical Path = Delay(mult+ delay(add)Delay element: longer bitwidth- Fanout of the input is larger

Iteration Period• Iteration: execution of all computations of an

algorithm once• Iteration period: the time required for execution

of an iteration• E.g. y(n) = ay(n-1) + x(n)

D

X(n) y(n-1)

a

(2)

(4)

...221100 BABABA

y(n)

X(n)

D

(2)(4)

aAB

Loop Bound

• Loop: a directed path that begins and ends at the same nodes.

• Loop Bound of the loop– Lower bound on the loop computation time

– Defined as tl/wl, where tl is the loop computation time and wl is the number of delays in the loop

• E.g.y(n)

X(n)

D

(2)(4)

aAB

A,B, A is a loop andTl = 2+ 4, Wl = 1And hence loop bound =6

Loop Bound• Another example

y(n)

X(n)

2D

(2)(4)

aAB

A,B, A is a loop andTl = 2+ 4, Wl = 2 (since 2D)And hence loop bound =3

It means one iteration of loop can be executed in 3 time unit. This can be done in two independent set of precedence constraints

oddBABABA

evenBABABA

...

...

553311

442200

• Another example

A B C

2D

(2) (4)(5)

Two loopsA->B->A: T = 6, W = 2, bound = 3

A->B->C->A, T = 11, W = 1, bound = 11

Hence the loop bound of this isMax{3,11} = 11

D

Iteration Bound

• Critical Loop- the loop with maximum loop bound

• Iteration bound (Tit)- the loop bound of the critical loop,

• Not possible to achieve iteration period lower than iteration bound even with infinite processing power

• E.g.

ii

ii

i

i

loopalliit loopindelayofW

loopoftimentcomputatioT

W

TT

#

_max_

A B C D

D

D

D

2D

(4) (3) (2) (4)

Loop(A->B->A) (T/W=7/1=7Loop(A-B->C->A) T/W = 9/2=4.5Loop(B->C->D->B) T/W = 9/3=3Iteration Bound= max(7,4.5,3)=7

Algorithms for computing iteration bound

• Longest Path Matrix Algorithm

• Minimum Cycle Mean Algorithm

Longest Path Matrix Algorithm (LPM)

• Construct a series of matrix, iteration bound is found by examining the diagonal elements of the matrices

• Let d be the number of delay element in the DFG, and di be the ith delay element.

• Construct matrix L(m), where m =1,2,…,d such that the value of is the longest computation time of all paths from delay element di to dj that pass through exactly m-1 delays. =-1 if no such path.

• L(m+1) can be obtained form L(1) and L(m) recursively by, if there is k such that ,

otherwise =-1

)(,mjil

)(,mjil

mjkki

mji lll ,

1,

)1(,

)1(,

mjil

LPM algorithm• The diagonal element represents the longest

computation time of all loops with m delays contains di. Then the iteration bound is equal to

dmiform

lT

mii

it ,1}max{)(

,

LPM algorithm (example)1

2

1115

0115

1014

1101

)1(L

3

4

5

6

D

D

D

D

(1)

(1)

(1)

(2)

(2)

(2)

d1

d2

d3

d4

)1(1,3le.g. All paths form d3 to d1 that pass

Through exactly zero delay:Path: d3->5->3->2->1->d1,

)1(1,3l =2+1+1+1=5e.g.

5)50,1max(

),1(max )1(1,

)1(,2

}3{

)2(1,2

kkk

lll

1151

1155

0144

1014

)2(L

1519

1559

1458

0145

)3(L

51910

55910

4589

1458

)4(L

2}4

5,4

5,4

8,4

8,3

5,3

5,3

5,2

4,2

4max{

max ,

},...,2,1{,

m

lT

mii

dmiit

LPM algorithm (another example)

1616

1212

88

44

)2(

)1(

L

L

1 2 3 4 5 6

7

DD

(1) (2) (1) (1) (2) (1)

(1)d2 d1

8}2

16,

2

12,1

8,1

4max{ itT

Documents

ELEC692 VLSI Signal Processing Architecture Lecture 1 Introduction to DSP Systems