MEMORY-EFFICIENT CONCURRENT VLSI ... ... MEMORY-EFFICIENT CONCURRENT VLSI ARCHITECTURES FOR TWO-DIMENSIONAL

Embed Size (px)

Text of MEMORY-EFFICIENT CONCURRENT VLSI ... ... MEMORY-EFFICIENT CONCURRENT VLSI ARCHITECTURES FOR...

  • MEMORY-EFFICIENT CONCURRENT VLSI ARCHITECTURES FOR TWO-DIMENSIONAL DISCRETE

    WAVELET TRANSFORM

    Synopsis

    of

    Ph.D. thesis

    By

    Anurag Mahajan (Enrollment Number: 07P01005G)

    Under the Guidance of

    Prof. B.K.Mohanty

    Department of Electronics and Communication Engineering JAYPEE UNIVERSITY OF ENGINEERING AND TECHNOLOGY,

    GUNA (M.P.) - INDIA August - 2013

  • Synopsis - 1

    Preface

    Discrete wavelet transform (DWT) is a mathematical technique that provides a new method

    for signal processing. It decomposes a signal in the time domain by using dilated / contracted

    and translated versions of a single basis function, named as prototype wavelet Mallat (1989);

    Daubachies (1992); Meyer (1993); Vetterli and Kovacevic (1995). DWT offers wide variety

    of useful features over other unitary transforms like discrete Fourier transforms (DFT),

    discrete cosine transform (DCT) and discrete sine transform (DST). Two-dimensional (2-D)

    DWT has been applied in image compression, image analysis and image watermarking etc.

    Lewis and Knowles (1992). Currently, 2-D DWT is used in JPEG 2000 image compression standard Skodars et al. (2001). The 2-D DWT is highly computation intensive and many of its application need real-time processing to deliver better performance. The 2-D DWT is

    currently implemented in very large scale integration (VLSI) system to meet the space-time

    requirement of various real-time applications. Several design schemes have been suggested

    for efficient implementation of 2-D DWT in a VLSI system.

    The hardware complexity of multilevel 2-D DWT structure is broadly divided into

    two parts (i) arithmetic and (ii) memory. The arithmetic component is comprised of

    multipliers and adders, and its complexity depends on wavelet filter size (k). The memory

    component is comprised of line buffer and frame buffer. The memory complexity depends on

    image size (MN), where M and N represent the height and width of input image. Small size

    filters (k < 10) are used in DWT where the standard image size is (512 × 512). Therefore, the

    complexity of multilevel 2-D DWT structure is dominated by the complexity of memory

    component. Most of the existing design strategies are focused on arithmetic complexity, cycle

    period and throughput rate. There is no specific memory-centric design method is proposed

    for multilevel 2-D DWT. The objective of the proposed thesis work is to explore memory-

    centric design approaches and proposes area-delay-power efficient hardware designs for

    implementation of multilevel 2-D DWT.

    Objective

    The thesis entitled “Memory-Efficient Concurrent VLSI Architectures for Two-Dimensional

    Discrete Wavelet Transform” has the following aims and objectives:

     To improve memory utilization efficiency of 2-D DWT structure.

     To reduce transposition memory size.

  • Synopsis - 2

     To eliminate frame buffer.

     To reduce arithmetic complexity using low complexity design scheme.

    The summary of the thesis is given below:

    Chapter 1: Introduction

    In this Chapter, computation scheme of one dimensional (1-D) and 2-D DWT are

    discussed. 1-D DWT can be performed using convolution scheme or lifting scheme proposed

    by Sweldens (1996). Convolution scheme involves more arithmetic resources and memory

    space than the lifting scheme. However, the lifting scheme is suitable for bi-orthogonal

    wavelet filters. The 2-D DWT computation is performed by two approaches: (i) separable and

    (ii) non-separable. In non-separable approach, row and column transforms of 2-D DWT are

    performed simultaneously using 2-D wavelet filters. In separable approach, row and column

    transforms of 2-D DWT are performed separately using 1-D DWT. Separable approach is

    more popular than non-separable approach as it demands less computation than non-

    separable approach. However, separable approach requires transposition memory between

    row and column transform. Multilevel 2-D DWT computation can be performed using

    pyramid algorithm (PA), recursive pyramid algorithm (RPA) of Vishwanath (1994) and

    folded scheme of Wu and Chen (2001). Due to design simplicity, 100% hardware utilization

    efficiency (HUE) and lower arithmetic resource requirement, folded scheme is more popular

    than the PA and RPA for hardware realization. Keeping this in view, several architectures

    based on folded scheme have been proposed for efficient implementation of 2-D DWT.

    Chapter 2: Hardware Complexity Analysis

    Folded 2-D DWT computation is performed level by level using one separable 2-

    DWT refer to as processing unit (PU) and one frame buffer. The low-low subband of the

    current DWT level is stored in the frame buffer to compute higher DWT levels. The PU

    comprised of one row-processor (to perform 1-D DWT computation row-wise), one column-

    processor (to perform 1-D DWT computation column-wise), one transposition memory and

    one temporal memory. Transposition memory stores the intermediate matrices low-pass (Ul)

    and high-pass (Uh) while temporal memory is used by the column-processor to store the

    partial results of column DWT. Frame memory may either be on-chip or off-chip, while the

    other two are usually on-chip memories.

  • Synopsis - 3

    The arithmetic complexity of folded structure depends on the DWT computation

    scheme and filter length. The size of frame buffer size is MN/4, words which is independent

    of data access scheme, type of DWT computation scheme (convolution or lifting) and length

    of wavelet filter. Temporal memory size depends on DWT computation scheme and wavelet

    filter length. For convolution-based 2-D DWT temporal memory size is zero when a direct-

    form FIR structure is used for computation of 1-D DWT. In case of lifting-based 2-D DWT,

    the size of temporal memory depends on number of lifting steps of bi-orthogonal wavelet

    filter. Transposition memory size mainly depends on the data access scheme adopted to feed

    2-D input samples and DWT computation scheme (convolution or lifting). In general, the

    sizes of the transposition memory and temporal memory are some multiple of image width,

    while the size of frame memory is some multiple of image size. On other hand, the

    complexity of arithmetic component depends on the size of the wavelet filter. The standard

    image size is (512 × 512) where the size of most commonly used wavelet filter is less than

    10. The hardware complexity of folded 2-D DWT structure is dominated by complexity of

    memory component.

    Several VLSI architectures have been suggested for the folded 2-D DWT in last

    decade to meet space and time requirement of real-time application. All these designs differ

    by arithmetic complexity, cycle period, throughput rate and they use almost same amount of

    on-chip memory words and equal amount of frame buffer words. The Arithmetic complexity

    (in terms of multiplier and adder) and memory complexity (in terms of memory words) of

    best available designs are estimated for 9/7 wavelet filter and image size (512 × 512) Wu et

    al. (2005); Xiong et al. (2006); Xiong et al. (2007); Cheng et al. (2007). It is found that

    memory complexity is almost 103 times higher than arithmetic complexity. Consequently, the

    memory words per output (MPO) of the existing designs are significantly higher than the

    arithmetic complexity per output. Since, the logic complexity of arithmetic components and

    memory components are widely different. Transistor count is considered to estimate

    arithmetic and memory complexity of the existing structures. We find that, the transistor

    count of memory component is almost 97% on average of total transistor count of the folded

    designs. Therefore, memory component of folded design consumes most of the chip area and

    power. However, the existing design approaches are focused on optimizing the arithmetic

    complexity and cycle period. There is no specific design is suggested to address memory

    complexity which is a major component of folded 2-D DWT structure.

  • Synopsis - 4

    Chapter 3: Block-Based Architecture for Folded 2-D DWT Using Line Scanning

    Folded 2-D DWT structure is memory intensive. Few two-input and two-output and

    four-input and four-output designs have been suggested in Xiong et al. (2006), Xiong et al.

    (2007), Li et al. (2009) and Lai et al. (2009) for high throughput implementation of folded 2-

    D DWT. The arithmetic complexity of these structures is varying proportionality with

    throughput rate, but the memory complexity is almost independent of throughput rate. For

    example the structure of Xiong et al. (2007) processes four samples per cycle and involves

    on-chip memory of size nearly 5.5N words, where the structure of Xiong et al. (2006)

    processes two samples per cycle and involves on-chip memory of size 5.5N words, but both

    the designs involve frame buffer of size MN/4 words. In general, on-chip and off-chip

    memory of folded design is almost independent of input block size. Therefore, block

Recommended

View more >