VLSI titles

VLSI IEEE-2012 Titles

Image processing

1. Implementation of image reconstruction algorithm using

compressive sensing in FPGA

Abstract

Compressive Sensing (CS) is a technique that suggests the possibility of

reconstruction of a signal vector using much smaller linear measurements than

its dimension. Sparse signals are acquired in vectors using sensing matrices. If

the signals are sparse enough the original signal can be reconstructed

successfully. In CS applications while the signal can be acquired using basic

methods, in reconstructing the signal using incomplete data sets high processing

power and complex statistical computations are required. In this research OMP

(Orthogonal Matching Pursuit) which is a faster and more hardware-

implementable reconstruction algorithm among other methods is used. OMP

algorithm is implemented on a Virtex-6 type FPGA (Field Programmable Gate

Array). With various optimizations the designed system yielded at least

thousand times faster results than CPU (Central Processing Unit) and GPU

(Graphics Processing Unit) applications.

2. Implementation of algorithm for detection and correction of

defective pixels in FPGA


Abstract

Defect pixels are a common occurrence in digital camera sensors, either

resulting from the manufacturing process or developing over time. Though low

in quantity, they are very noticeable and can destroy the perceived quality of the

images. This paper presents a method for detection and correction of defect

pixels in images generated by a Bayer mosaic image sensor. We propose an

online and adaptive algorithm, which analyzes the images retrieved from a

Bayer array sensor on a pixel by pixel basis. We consider the values of adjacent

pixels to determine if the current pixel is possibly defective, which is either

confirmed or refuted by repeating the analysis in subsequent frames. For the

confirmed defective pixels, interpolation is performed to restore the image

quality. The algorithm is implemented on a FPGA logic device, suitable for

very high frequency operation required to correct defect pixels in images

produced by high definition (HD) cameras.

3. Real time hardware co-simulation of Edge Detection for video

processing system

Abstract

A methodology for implementing real-time DSP applications on a field

programmable gate arrays (FPGA) using Xilinx System Generator (XSG) for

Matlab is presented in this paper. It presents architecture for Edge Detection

using Sobel Filter for image processing using Xilinx System Generator. The

design was implemented targeting a Spartan3A DSP 3400 device

(XC3SD3400A-4FGG676C) then a Virtex 5 (xc5vlx50-1ff676). The Edge


Detection method has been verified successfully with no visually perceptual

errors in the resulted images.

4. FPGA implementation of graph cut based image thresholding

Abstract

Thresholding is an important process in many image processing applications.

Recently, a bi-level image thresholding method based on graph cut was

proposed. The method provided thresholding results which were superior to

those obtained with previous techniques. Moreover, the technique was

computationally less complex compared to other graph cut-based image

thresholding approaches. However, the execution time requirements may still

be significant, especially if it is of interest to perform real-time thresholding of

a large number of images, such as in the case of high-resolution video

sequences. In this paper, we propose a method based on the previously

proposed graph cut thresholding method, which is nevertheless appropriate for

hardware (FPGA) real-time implementations. A subset of the proposed

modifications are also appropriate for a general software implementation.

Considering only this subset, the C implementation of the modified method is

approximately 2.2 times faster than the original method, as it was presented in

the original graph cut-based thresholding paper. Furthermore, the FPGA-based

implementation is designed to be 70-100 times faster than the software

implementation, depending on the image used.


5. Background subtraction algorithm for moving object detection in

FPGA

Abstract

Currently, both the market and the academic communities have required

applications based on image and video processing with several real-time

constraints. On the other hand, detection of moving objects is a very important

task in mobile robotics and surveillance applications. In order to achieve an

alternative design that allows for rapid development of real time motion

detection systems, this paper proposes a hardware architecture for motion

detection based on the background subtraction algorithm, which is implemented

on FPGAs (Field Programmable Gate Arrays). For achieving this, the following

steps are executed: (a) a background image (in gray-level format) is stored in an

external SRAM memory, (b) a low-pass filter is applied to both the stored and

current images, (c) a subtraction operation between both images is obtained,

and (d) a morphological filter is applied over the resulting image. Afterward,

the gravity center of the object is calculated and sent to a PC (via RS-232

interface). Both the practical results of the motion detection system and

synthesis results have demonstrated the feasibility of FPGAs for implementing

the proposed algorithms on an FPGA based hardware platform. The

implemented system provides one processed pixel per FPGA's clock cycle

(after the latency time) and speed-ups the software implementation (using the

real-time xPC Target OS from MathWorks) by a factor of 32.


6. Efficient FPGA implementation of steerable Gaussian smoothers

Abstract

Smoothing filters have been extensively used in image and video analysis. In

particular, directional smoothers have been employed in motion analysis, edge

detection, line parameter estimation, and texture analysis. Such applications

often necessitate the use of several directional filters oriented at different

angles. However, applying a large number of filters commonly requires a

significant amount of computing resources. In such cases, real-time

performance may be possibly achieved through utilization of hardware devices

having parallel processing capabilities. Additionally, techniques can take

advantage of the inherent properties of certain smoothing filters. Such a

property is steerability, which implies that the outputs of several filtering

operations can be linearly combined in order to produce the output of a

directional filter at an arbitrary orientation. Although several efficient FPGA

implementations of the convolution operation have been presented in the

literature for non-separable and separable, research on steerable filter

implementations on FPGA is limited. In this paper, steerable Gaussian

smoothers are implemented on an FPGA platform. The technique is compared

with a software-based implementation. Performance comparisons indicate that

the FPGA technique provides significant speed-up factor of at least ~6, utilizing

only a small percentage of the FPGA resources.


7. An FPGA-Based Hardware Implementation of Configurable Pixel-

Level Color Image Fusion

Abstract

Image fusion has attracted a lot of interest in recent years. As a result, different

fusion methods have been proposed mainly in the fields of remote sensing and

computer (e.g., night) vision, while hardware implementations have been also

presented to tackle real-time processing in different application domains. In this

paper, a linear pixel-level fusion method is employed and implemented on a

field-programmable-gate-array-based hardware system that is suitable for

remotely sensed data. Our work incorporates a fusion technique (called VTVA)

that is a linear transformation based on the Cholesky decomposition of the

covariance matrix of the source data. The circuit is composed of different

modules, including covariance estimation, Cholesky decomposition, and

transformation ones. The resulted compact hardware design can be

characterized as a linear configurable implementation since the color properties

of the final fused color can be selected by the user in a way of controlling the

resulting correlation between color components.

8. Implementation of image reconstruction algorithm using

compressive sensing in FPGA

Abstract

Compressive Sensing (CS) is a technique that suggests the possibility of

reconstruction of a signal vector using much smaller linear measurements than

its dimension. Sparse signals are acquired in vectors using sensing matrices. If

the signals are sparse enough the original signal can be reconstructed


successfully. In CS applications while the signal can be acquired using basic

methods, in reconstructing the signal using incomplete data sets high processing

power and complex statistical computations are required. In this research OMP

(Orthogonal Matching Pursuit) which is a faster and more hardware-

implementable reconstruction algorithm among other methods is used. OMP

algorithm is implemented on a Virtex-6 type FPGA (Field Programmable Gate

Array). With various optimizations the designed system yielded at least

thousand times faster results than CPU (Central Processing Unit) and GPU

(Graphics Processing Unit) applications.

9. A hardware acceleration of a real time video processing

Abstract

This paper presents a method based on Edge histogram descriptor to accelerate

shot cut detector algorithm for real-time applications. In fact, before any content-

based manipulations, the hierarchical structure of video must be determined and

software pure solution is not suitable for this application due of constraints

imposed by this algorithm. In this context we have used a Field Programmable

Gate Array (FPGA) integrated architecture to accelerate this treatment.

10.A non linear equation based cryptosystem for image encryption and

decryption

Abstract

In this paper a new approach for image encryption and decryption using chaotic

map and a non linear equation known as BB equation is described. Chaotic maps

have been widely used in data encryption. Various chaos map based encryption


and decryption algorithms are used but are found to be insecure. Hence a new

method is implemented based on BB (Brahmagupta-Bhaskara) equation which is

combined with chaos to give a non linear dependency and thus improved security.

VLSI architecture for the proposed algorithm is designed and realized using Xilinx

ISE VLSI software for hardware implementation.


DSP

1. Efficient VLSI implementation of soft-input soft-output fixed-

complexity sphere decoder

Abstract

Fixed-complexity sphere decoder (FSD) is one of the most promising techniques

for the implementation of multiple-input multiple-output (MIMO) detection, with

relevant advantages in terms of constant throughput and high flexibility of parallel

architecture. The reported works on FSD are mainly based on software level

simulations and a few details have been provided on hardware implementation.

The authors present the study based on a four-nodes-per-cycle parallel FSD

architecture with several examples of VLSI implementation in 4×4 systems with

both 16-quadrature amplitude modulation (QAM) and 64-QAM modulation and

both real and complex signal models. The implementation aspects and details of

the architecture are analysed in order to provide a variety of performance-

complexity trade-offs. The authors also provide a parallel implementation of log-

likelihood-ratio (LLR) generator with optimised algorithm to enhance the proposed

FSD architecture to be a soft-input soft-output (SISO) MIMO detector. To the

authors best knowledge, this is the first complete VLSI implementation of an FSD

based SISO MIMO detector. The implementation results show that the proposed

SISO FSD architecture is highly efficient and flexible, making it very suitable for

real applications.


2. Lossy Compression of Discrete Sources via the Viterbi Algorithm

Abstract

We present a new lossy compressor for finite-alphabet sources. For coding a

sequence xn, the encoder starts by assigning a certain cost to each possible

reconstruction sequence. It then finds the one that minimizes this cost and

describes it losslessly to the decoder via a universal lossless compressor. The cost

of each sequence is a linear combination of its distance from the sequence xn and a

linear function of its kth order empirical distribution. The structure of the cost

function allows the encoder to employ the Viterbi algorithm to find the sequence

with minimum cost. We identify a choice of the coefficients used in the cost

function which ensures that the algorithm universally achieves the optimum rate-

distortion performance for any stationary ergodic source, in the limit of large ,

provided that increases as o(log n). Iterative techniques for approximating the

coefficients, which alleviate the computational burden of finding the optimal

coefficients, are proposed and studied.

3. FPGA implementation of IEEE 802.15.3c receiver

Abstract

This paper presents the implementation of the OFDM demodulator and the Viterbi

decoder, proposed as part of a wireless High Definition video receiver to be

integrated in an FPGA. These blocks were implemented in a Xilinx Virtex-6

FPGA. The complete system was previously modeled and simulated using


MATLAB/Simulink to extract important hardware characteristics for the FPGA

implementation.

4. A Network-on-Chip-based turbo/LDPC decoder architecture

Abstract

The current convergence process in wireless technologies demands for strong

efforts in the conceiving of highly flexible and interoperable equipments. This

contribution focuses on one of the most important baseband processing units in

wireless receivers, the forward error correction unit, and proposes a Network-on-

Chip (NoC) based approach to the design of multi-standard decoders. High level

modeling is exploited to drive the NoC optimization for a given set of both turbo

and Low-Density-Parity-Check (LDPC) codes to be supported. Moreover,

synthesis results prove that the proposed approach can offer a fully compliant

WiMAX decoder, supporting the whole set of turbo and LDPC codes with higher

throughput and an occupied area comparable or lower than previously reported

flexible implementations. In particular, the mentioned design case achieves a

worst-case throughput higher than 70 Mb/s at the area cost of 3.17 mm2 on a 90 nm

CMOS technology.


5. Design and implementation of an optical OFDM baseband receiver in

FPGA

Abstract

In this paper, a baseband receiver design and its FPGA implementation for an

OOFDM system aimed at the NG-PON (passive optical network) applications are

presented. A low cost IMDD (intensity modulation, direct detection) architecture is

adopted and baseband DSP measures are employed to compensate various optical

impairments. Targeting a 4GSps throughput rate, an 8-way parallel architecture is

developed to perform the synchronization, FFT and equalization each with massive

parallelism. A real valued FFT module taking advantage of the Hermitian spectrum

is also developed to reduce the circuit complexity significantly. The simulation

results show the proposed baseband receiver is capable of achieving an 8Gbps

(effective) transmission bandwidth for 64-QAM coded OFDM symbols over a

25km long single mode fiber network. The uncoded BER reaches 10-3 when the

received optical power is -16dBm. Due to the speed and resource limitation, the

FPGA implementation obtains a fully functional but speed degraded system. The

maximum working frequency is 250 MHz, which is one half of the 500MHz

required for real time processing. The design occupies 21,423 logic slices and 56

embedded multiplier modules.

6. VLSI Architecture for a Reconfigurable Spectrally Efficient FDM

Baseband Transmitter

Abstract

Spectrally efficient FDM (SEFDM) systems employ non-orthogonal overlapped

carriers to improve spectral efficiency for future communication systems. One of


the key research challenges for SEFDM systems is to demonstrate efficient

hardware implementations for transmitters and receivers. Focusing on transmitters,

this paper explains the SEFDM concept and examines the complexity of published

modulation algorithms, with particular consideration to implementation issues. We

then present two new variants of a digital baseband transmitter architecture for

SEFDM, based on a modulation algorithm which employs the discrete Fourier

transform (DFT) implemented efficiently using the fast Fourier transform (FFT).

The algorithm requires multiple FFTs, which can be configured either as parallel

transforms, which is optimal for throughput or using a multi-stream FFT

architecture, for reduced circuit area. We propose a simplified approach to IFFT

pruning for pipeline architectures, based on a token-flow control style, specifically

optimized for the SEFDM application. Reconfigurable implementations for

different bandwidth compression ratios, including conventional OFDM, are easily

derived from the proposed implementations. The SEFDM transmitters have been

synthesized, placed and routed in a commercial 32 nm CMOS process technology

and also verified in FPGA. We report circuit area and simulated power dissipation

figures, which confirm the feasibility of SEFDM transmitters.

7. A Nonbinary LDPC Decoder Architecture With Adaptive Message

Control

Abstract

A new decoder architecture for nonbinary low-density paritycheck (LDPC) codes

is presented in this paper to reduce the hardware operational complexity in VLSI

implementations. The low decoding complexity is achieved by employing adaptive

message control (AMC) that dynamically trims the message length of belief

information to reduce the amount of memory accesses and arithmetic operations.


To implement the proposed AMC, we develop the architecture of a horizontal

sequential nonbinary LDPC decoder. Key components in the architecture have

been designed with the consideration of variable message lengths to leverage the

benefit of the proposed AMC. Simulation results demonstrate that the proposed

nonbinary LDPC decoder architecture can significantly reduce hardware

operations and power consumption as compared with existing work with negligible

performance degradation.

8. Design and implementation of low power FFT/IFFT processor for

wireless communication

Abstract

Fast Fourier transform (FFT) processing is one of the key procedure in popular

orthogonal frequency division multiplexing (OFDM) communication systems.

Structured pipeline architectures, low power consumption, high speed and reduced

chip area are the main concerns in this VLSI implementation. In this paper, the

efficient implementation of FFT/IFFT processor for OFDM applications is

presented. The processor can be used in various OFDM-based communication

systems, such as Worldwide Interoperability for Microwave access (Wi-Max),

digital audio broadcasting (DAB), digital video broadcasting-terrestrial (DVB-T).

We adopt single-path delay feedback architecture. To eliminate the read only

memories (ROM's) used to store the twiddle factors, this proposed architecture

applies a reconfigurable complex multiplier to achieve a ROM-less FFT/IFFT

processor and to reduce the truncation error we adopt the fixed width modified

booth multiplier. The three processing elements (PE's), delay-line (DL) buffers are

used for computing IFFT. Thus we consume the low power, lower hardware cost,

high efficiency and reduced chip size.


9. High-Speed Low-Power Viterbi Decoder Design for TCM Decoders

Abstract

High-speed, low-power design of Viterbi decoders for trellis coded modulation

(TCM) systems is presented in this paper. It is well known that the Viterbi

decoder (VD) is the dominant module determining the overall power

consumption of TCM decoders. We propose a pre-computation architecture

incorporated with T-algorithm for VD, which can effectively reduce the power

consumption without degrading the decoding speed much. A general solution to

derive the optimal pre-computation steps is also given in the paper.

Implementation result of a VD for a rate-3/4 convolutional code used in a TCM

system shows that compared with the full trellis VD, the precomputation

architecture reduces the power consumption by as much as 70% without

performance loss, while the degradation in clock speed is negligible.


Network On Chip1. Design methodology for on-chip bus architectures using system-on-chip

network protocol

Abstract

As the number of IP cores that can be integrated into a single chip has increased

significantly in recent years, various types of multi-layered bus architectures are

now being used. However, a reckless use of bus layers may lead to an excessive

number of wires and low-resource utilization. To reduce such waste, researches

have studied automated on-chip bus design methods for optimal architecture

synthesis. This study expands the existing studies in two aspects. First, it considers

all possible topologies and redefines the existing exploration problem, whereas the

existing studies assume only a few types of topologies. Second, the study includes

an exploration process based on a new on-chip bus protocol, system-on-chip

network protocol (SNP), as well as processes based on existing protocols to solve

the redefined problem. After the time complexity is investigated, it is found that

the problem is NP-hard. Accordingly, this study proposes fast search algorithms

that can be applied to each of the exploration steps. The proposed algorithms are

implemented as a software program of exploration. The overall reduction ratio of

the time complexity reaches about three millionths, with a maximal 16% increase


in communication time (CT). Considering todays design life cycle, this seems to be

a good trade-off.

2. Configuring algorithm for reconfigurable Network-on-Chip

architecture

Abstract

With the challenge that a larger number of cores will be integrated on one single

chip, Network-on-Chip (NoC) has been the popular solution gradually. And

recently, researchers have focused on improving the performance of NoC to

achieve well-performed chips. In this paper, we will propose a configuring

algorithm based on one reconfigurable NoC architecture to design application-

specific NoC. The reconfigurable NoC architecture decreases the design

complexity and makes NoC design more flexible comparing to the topology-

generation-floorplanning scheme and mapping scheme respectively. Besides, our

configuring algorithm aims at optimized networks with better performance. For

one specific application, we can choose the reconfigurable NoC architecture with

suitable size and configure it according to the communication relationship to make

sure that the final network is optimized. A cycle-accurate simulator is used to carry

out simulations for three networks designed by our scheme and two other methods

for the same application under the same environment. The results turn out that our

network performs better.


3. A Novel Encoding Scheme for Low Power in Network on Chip Links

Abstract

Dynamic power dissipation in interconnects is a major contributor to power

consumption in Network on Chips (NoCs). This is mainly due to two factors, self

switching activity of the particular link and coupling switching activity among

adjacent links. Two novel techniques are proposed to reduce power consumption

due to switching transition and cross talk. First technique reorders the data in such

a way that switching transition is brought down. In the second technique, it is

ensured that power consumption due to cross coupling activity is reduced. An end

to end encoding scheme facilitating two stage coding to reduce power consumption

in wormhole routed network on chip is designed using the proposed power

reduction techniques. Encoder and Decoder exhibiting the proposed scheme have

been described in RTL level in Verilog HDL, synthesized and mapped into

UMC180 nm technology library. It has been observed that the proposed technique

(TSC) offers an average reduction in dynamic power consumption of 17.34%.

Proposed scheme was compared with existing techniques and observations

concluded that there was not much degradation in area, speed and static power

dissipation. Power reduction when subjected to different kinds of data streams was

analyzed and results indicate that proposed scheme offers uniform power reduction

irrespective of the nature of data stream unlike the existing techniques

4. Dynamic buffer management to improve the performance of fault

tolerance adaptive Network-On-Chip applications


Abstract

Networks-On-Chip are developed with a trade-off between latency and power

dissipation defined at design time. But, if the communication pattern is changed,

decisions taken at design time (say buffer size) may result in large area and power

consumption or higher latency. Using large buffers to guarantee performance leads

to excessive power dissipation. Small buffers reduce power consumption but result

in increase in latency. The purpose of the proposed work is to design a

heterogeneous router where the buffer slots are dynamically assigned to improve

the performance, under different communication needs in fault-tolerant adaptive

NoC applications. In the proposed router, buffer slots can dynamically be re-

allocated for various applications to improve performance. Reallocation is based

on the number of hotspots using EBLA (Extended Buffer Loan Algorithm). By

introducing oversized IPs (OIPs), regular mesh-based NoC architecture may be

destroyed. Resulting mesh-based NoC becomes irregular and needs new routing

algorithms to solve routing problems in case of faulty links. A NoC with irregular

2D mesh topology is considered and an fault tolerant adaptive routing algorithm is

used.

5. Congestion mitigation using flexible router architecture for

Network-on-Chip

Abstract

An important topic in Network-on-Chip (NoC) design is the tradeoff between area

and performance. Some techniques tend to increase the number of buffers to

improve performance. However this method increases the chip area and so does the

power consumption. In this paper we introduce a new flexible router architecture


that can improve the performance of the overall network using the same amount of

buffering available but in an efficient way. Therefore there is no need to increase

the size of buffers or to use extra virtual channels (VCs) which have high power

and area overheads or complex logic. If there is a request to a busy buffer the

router will store the incoming packet in any other suitable free buffer in the router.

The Flexible router shows an increase in performance in terms of increasing the

saturation rate for Hotspot, Uniform, and Nearest-Neighbor traffics, especially

Hotspot with 11.4% increase. Discussion about area overhead over a standard Base

router and the analysis of arriving unordered packets (side-effect) are also

presented.

6. Performance evaluation of a flow control algorithm for Network-

on-Chip

Abstract

Network-on-chip (NoC) has been proposed for SoC (System-on-Chip) as an

alternative to on-chip bus-based interconnects to achieve better performance and

lower energy consumption. Several approaches have been proposed to deal with

NoCs design and can be classified into two main categories, design-time

approaches and run-time approaches. Design-time approaches are generally

tailored for an application domain or a specific application by providing a

customized NoC. All parameters, such as routing and switching schemes, are

defined at design time. Run-time approaches, however, provide techniques that

allow a NoC to continuously adapt its structure and its behavior (i.e., at runtime).

In this paper, performance evaluation of a flow control algorithm for congestion

avoidance in NoCs is presented. This algorithm allows NoC elements to


dynamically adjust their inflow by using a feedback control-based mechanism.

Analytical and simulation results are reported to show the viability of this

mechanism for congestion avoidance in NoCs.

7. Low-area boundary BIST architecture for mesh-like network-on-

chip

Abstract

Current paper proposes a Built-In Self-Test (BIST) architecture for targeting the

routing infrastructure of mesh-like NoCs from their boundaries. The architecture

contains a counter and a Finite State Machine (FSM) implementing the test

configurations. Test data is generated and test responses compacted by a dedicated

hardware structure requiring very little silicon area. The advantages of this new

boundary BIST concept with respect to existing methods is that costly data

wrappers in the NoC network are unnecessary, and thus, area and performance

penalties are avoided. We have also improved previously developed test

configurations. Experiments show that up to two orders of magnitude gains in the

speed of testing are achieved using the new method for large NoCs.

8. Effect of Application Mapping on Network-on-Chip Performance

Abstract

Network-on-Chip (NoC) is a developing and promising on-chip communication

paradigm that improves scalability and performance of System-on-Chips. NoC

design flow contains many problems from different areas, for example networking,

embedded design and computer architecture. Application mapping is one of these


problems, which is well studied in literature but generally considered as a

communication energy minimization problem. The present study discusses the

effect of application mapping on network parameters such as average queuing

delay or packet loss rates of routers. On the other hand, self similarity is a

phenomenon that is used to characterize Ethernet and/or wide area network traffic,

as well as most of on-chip network traffic. The main concern of this study is to

analyze the effect of application mapping on network related parameters by using

an on-chip traffic characterization that contains self similarity. The results of our

computational study show that mapping of cores may have a significant

degenerative effect on network performance, and so adding network related terms

to application mapping problem may improve the overall on-chip network

performance considerably

9. AdNoC: Runtime Adaptive Network-on-Chip Architecture

Abstract

Networsk-on-chip (NoCs) have emerged as a promising on-chip interconnect for

future multi/many-core architectures as NoCs are able to scale communication

links with the growing number of cores. State-of-the-art NoC designs rely mainly

on a static network configuration using fixed routing algorithms and buffer

placements. These approaches are not effective in dealing with hard-to-predict

system behavior, for instance due to user behavior or varying workloads, since in

order for static NoCs to cover these scenarios, they would have to be designed for

worst case scenarios. In this paper, we address these problems with a runtime

adaptive network-on-chip (AdNoC). Focusing on the architecture-level adaptation,

we present an adaptive route allocation algorithm which provides a required level


of QoS (guaranteed bandwidth) coupled with an adaptive buffer assignment

scheme which reassigns buffer blocks on-demand. Furthermore, the adaptivity

requires a comprehensive, hardly intrusive, runtime observability infrastructure,

i.e., using monitoring components, in order to gather data on the system state. The

area overhead introduced by the adaptive scheme can be traded off against the

flexibility gained. Moreover, the area overhead is also reduced by resource

multiplexing due to the on-demand buffer assignment at each output port (we

achieved on an average 42% buffer saving in our experiments). We demonstrate

the advantage by using various digital media applications and compare our

approach to the state-of-the-art static NoC architectures e.g., Xpipe, QNoC, and

Æthereal.

10. Fine-Grained Bandwidth Adaptivity in Networks-on-Chip

Using Bidirectional Channels

Abstract

Networks-on-Chip (NoC) serve as efficient and scalable communication substrates

for many-core architectures. Currently, the bandwidth provided in NoCs is over

provisioned for their typical usage case. In real-world multi-core applications, less

than 5% of channels are utilized on average. Large bandwidth resources serve to

keep network latency low during periods of peak communication demands.

Increasing the average channel utilization through narrower channels could

improve the efficiency of NoCs in terms of area and power, however, in current

NoC architectures this degrades overall system performance. Based on thorough

analysis of the dynamic behaviour of real workloads, we design a novel NoC

architecture that adapts to changing application demands. Our architecture uses


fine-grained bandwidth-adaptive bidirectional channels to improve channel

utilization without negatively affecting network latency. Running PARSEC

benchmarks on a cycle-accurate full-system simulator, we show that fine-grained

bandwidth adaptivity can save up to 75% of channel resources while achieving

92% of overall system performance compared to the baseline network, no

performance is sacrificed in our network design configured with 50% of the

channel resources used in the baseline.

11. Active Memory Processor for Network-on-Chip-Based

Architecture

Abstract

Memory-intensive operations and their memory access latency are often the

performance bottleneck in parallel applications. In this paper, we investigate the

concept of active memory operation which is an active data processing operation

performed on the memory side. Utilizing the active memory operation, we can

replace multiple transactions of memory accesses over the on-chip network and

related computations on the processor side with a smaller number of high-level

transactions and computations on the memory side. To realize the concept, we

have designed a special-purpose processor called active memory processor which

is tightly coupled with the memory and executes the active memory operations. In

our case studies, we have applied the concept to five real-world applications

(parallelized JPEG, FFT, text indexing for data mining, histogram, and eikonal

equation solver) running on a 36--tile architecture with 64 cores and four memory

tiles and found that the proposed approach can improve performance by 20.5~

259.3 percent.


12. Implementation of CDMA technique for Network-on-Chip

Abstract

A Code-Division Multiple Access (CDMA) based on-chip communication network

is proposed in this paper. The proposed design features a novel encoding and

decoding scheme for CDMA transmission which improves area, latency and power

dissipation of the network on Chip (NoC). The orthogonal and balance properties

of Walsh codes are used for the routing of data between the resources on the

network. The proposed CDMA encoding and decoding schemes are compared with

the conventional schemes. The overall area required to implement the proposed

CDMA NoC design is reduced by 54%. The design decreases the latency of the

network by 48.2%. The total power consumption required to achieve the proposed

design is decreased by 54.8%.

Cryptography

1. A novel architecture for VLSI implementation of RSA cryptosystem

Abstract

The RSA system is widely employed in networking applications and achieves good

performance and high security. In this paper, we use Verilog to implement a 16-bit

RSA block cipher system. The whole implementation includes three parts: key

generation, encryption and decryption process. The key generation stage aims to

generate a pair of public key and private key, and then the private key will be

distributed to receiver according to certain key distribution schemes. The memory


usage and overhead associated with the key generation is eliminated by the

proposed system model. The cipher text can be decrypted at receiver side by RSA

secret key. These are simulated in Xilinx and hardware is synthesized using RTL

Compiler. The existing and proposed models are then analyzed for performance

measures using Synopsis-Design Vision. Net list generated from RTL Compiler

will be used to generate IC layout.

2. VLSI Implementation of Advanced Encryption Standard

Abstract

Information Security is always the primary concern for a user. Information

Security is required to save data There are number of approaches as well number

of available software's to achieve the information security. The proposed work is

representing one of such cryptographic technique called AES. The proposed work

is the implementation of AES for a hardware using the VHDL. For the simulation

and implementation we are using Active HDL software. The system will accept the

input and form the encoding and later the decoding process is defined. The results

are presented in the waveforms.

3. Highly secured high throughput VLSI architecture for AES algorithm

Abstract

This paper provides an efficient VLSI architecture to increase the throughput and

security of the Advanced Encryption Standard (AES) Algorithm. The existing

architecture provide the Look up Table technique for the Subbytes and inverse

Subbytes transformation used in AES algorithm, our proposed technique uses

combinational circuit and pipelining technique which increase the throughput and


reduce the delay. This design proposes a new technique for implementing the S-

box, which decides the speed and power of AES architecture and the basic

components of this architecture is made completely fault detectable by using

pseudo-nMOS technology and thereby increases the security of this system. This

AES design was modeled using Verilog HDL and synthesized using TSMC's 90

nm standard cell library with RTL Compiler, and physical design implementation

was done using SOC Encounter and thereby achieved a through put of 58.18 Gbps

after detailed routing. The basic security of the system is validated by using

Cadence Virtuoso in the transistor level design.

4. A non linear equation based cryptosystem for image encryption and

decryption

Abstract

In this paper a new approach for image encryption and decryption using chaotic

map and a non linear equation known as BB equation is described. Chaotic maps

have been widely used in data encryption. Various chaos map based encryption

and decryption algorithms are used but are found to be insecure. Hence a new

method is implemented based on BB (Brahmagupta-Bhaskara) equation which is

combined with chaos to give a non linear dependency and thus improved security.

VLSI architecture for the proposed algorithm is designed and realized using Xilinx

ISE VLSI software for hardware implementation.


Low power VLSI

1. Routing-efficient implementation of an internal-response-based BIST

architecture

Abstract

Recently internal-response-based BIST techniques are proposed. By using internal

circuit responses to directly generate test patterns, these techniques can

significantly reduce or even eliminate storage requirement for test data. For these

techniques, appropriate routing of the circuit internal nets to the BIST circuitry is

crucial for minimizing the required area overhead and the induced performance

impact. In this paper, an efficient net sharing algorithm together with special

response decompressor hardware is proposed to minimize the total number of

required internal nets for an internal-response-based BIST scheme. Experimental

results show that on average 3.24% of nets and 2.83% area overhead of the


response decompressors are sufficient to achieve complete fault coverage for

ISCAS'85 circuits.

2. A high throughput sort free VLSI architecture for wireless applications

Abstract

For high data rate Multiple Input Multiple Output technology is used in wireless

communications. The use of multiple antennas at both transmitter and receiver

(MIMO) significantly increases the capacity and spectral efficiency of wireless

systems. This paper presents a Field Programmable Gate Array (FPGA)

implementation for a 4 × 4 breadth first K-best MIMO decoder using a 64

Quadrature Amplitude Modulation (QAM) scheme. A novel sort free approach to

path extension, as well as, quantized metrics result in a high throughput, low power

and area. Finally, VLSI architectural tradeoffs are explored for a synthesized using

synopsys the power analysis, throughput analysis in 120 nm technology. The

power needed is 20.0025 μW.

3. Design and implementation of high-performance high-valency ling

adders

Abstract

Parallel prefix adders are used for efficient VLSI implementation of binary number

additions. Ling architecture offers a faster carry computation stage compared to the

conventional parallel prefix adders. Recently, Jackson and Talwar proposed a new

method to factorize Ling adders, which helps to reduce the complexity as well as

the delay of the adder further. This paper discusses the design and implementation

details for such lower complexity, fast parallel prefix adders based on Ling theory


of factorization. In particular, valency or radix, the number of inputs to a single

node, is explored as a design parameter. Several low and high valency adders are

implemented in 65 nm CMOS technology. Experimental results show that the

high-valency Ling adders have superior area×delay characteristics over previously

reported Ling-based or non-Ling adders for the same input size. Moreover, our 20-

bit high valency adder has a better area×delay measurement than the previously-

published 16-bit adders.

4. Design of 64-bit low power parallel prefix VLSI adder for high speed

arithmetic circuits

Abstract

The addition of two binary numbers is the basic and most often used arithmetic

operation on microprocessors, digital signal processors and data processing

application specific integrated circuits. Parallel prefix adder is a general technique

for speeding up binary addition. This method implements logic functions which

determine whether groups of bits will generate or propagate a carry. The proposed

64-bit adder is designed using four different types prefix cell operators, even-dot

cells, odd-dot cells, even-semi-dot cells and odd-semi-dot cells; it offers robust

adder solutions typically used for low power and high-performance design

application needs. The comparison can be made with various input ranges of

Parallel Prefix adders in terms power, number of transistor, number of nodes.

Tanner EDA tool was used for simulating the parallel prefix adder designs in the

250nm technologies.


5. Low-power dissipation using FPGA architecture

Abstract

Power optimization is the process of generating the best design in digital VLSI

circuits without violating design specifications. In this paper, the existing FPGA

routing switch is compared with the proposed low-power FPGA routing circuitry.

The experimental results show that the power dissipation in the proposed technique

is less than the existing FPGA design.

4G Techonlogy

1. Design and implementation of an optical OFDM baseband receiver

in FPGA

Abstract

In this paper, a baseband receiver design and its FPGA implementation for an

OOFDM system aimed at the NG-PON (passive optical network) applications are

presented. A low cost IMDD (intensity modulation, direct detection) architecture is

adopted and baseband DSP measures are employed to compensate various optical

impairments. Targeting a 4GSps throughput rate, an 8-way parallel architecture is

developed to perform the synchronization, FFT and equalization each with massive

parallelism. A real valued FFT module taking advantage of the Hermitian spectrum

is also developed to reduce the circuit complexity significantly. The simulation


results show the proposed baseband receiver is capable of achieving an 8Gbps

(effective) transmission bandwidth for 64-QAM coded OFDM symbols over a

25km long single mode fiber network. The uncoded BER reaches 10-3 when the

received optical power is -16dBm. Due to the speed and resource limitation, the

FPGA implementation obtains a fully functional but speed degraded system. The

maximum working frequency is 250 MHz, which is one half of the 500MHz

required for real time processing. The design occupies 21,423 logic slices and 56

embedded multiplier modules.

2. Performance analysis of Multiple carrier code division multiple

access system

Abstract

To achieve high data rate Multi-Carrier Code Division Multiple Access (MC-

CDMA) is one suitable choice for next generation wireless communication system.

MC-CDMA is the combination of CDMA and OFDM schemes, resulting into

getting the advantages of both the schemes. Capacity planning is one of the major

issues in designing of wireless communication system. In wireless communication

system capacity planning greatly depends on bit error rate (BER). This study

investigates the BER performance of MC-CDMA system over Rayleigh fading

channel for different length of spreading code. Walsh-Hadamard (W-H) code has

been chosen for spreading, which reduces the multiple access interference (MAI)

in downlink due to its orthogonal property. Simulation results show the


improvement in BER performance with the increasing length of the spreading

code. Also the comparative study of BER performance over different modulation

techniques show that minimum BER obtained with the BPSK modulation

technique.

Documents

VLSI titles