master of technology

CSLA Implementation Technique to Minimize the Area, Power and Delay

A Project Report

submitted in partial fulfillment of the requirements for the award of the degree of

MASTER OF TECHNOLOGY

in VLSI & EMBEDDED SYSTEMS

by

G.BHAGYA SRI (13MK1D6805)

under the esteemed guidance of

Prof. P.BALA MURALI KRISHNA

DEPARTMENT OF ELECTRONICS & COMMUNICATION ENGINEERING

SRI MITTAPALLI INSTITUTE OF TECHNOLOGY FOR WOMEN (Approved by AICTE, New Delhi & Affiliated to JNTU, Kakinada)

NH-5, TUMMALAPALEM, GUNTUR-522233, A.P.

2013-2015

SRI MITTAPALLI INSTITUTE OF TECHNOLOGY FOR WOMEN

(Approved by AICTE, New Delhi & Affiliated to JNTU, Kakinada)

NH5, TUMMALAPALEM, GUNTUR-522233, A.P.

Department of Electronics and Communication Engineering

CERTIFICATE This is to certify that a project report entitled CSLA IMPLEMENTATION TECHNIQUE TO MINIMIZE THE AREA, POWER AND DELAY being submitted by GUTTIKONDA BHAGYA SRI (13MK1D6805) in partial fulfillment of the requirements for the award of the degree of Master of Technology in VLSI & EMBEDDED SYSTEMS to Jawaharlal Nehru Technological University, Kakinada, during the year 2013-2015 of SRI MITTAPALLI INSTITUTE OF TECHNOLOGY FOR WOMEN, GUNTUR.

PROJECT GUIDE HEAD OF THE DEPARTMENT Prof. P.Bala Murali Krishna G. Suseelamma

Professor Associate Professor Department of ECE Department of ECE SMITW SMITW

EXTERNAL EXAMINER

ACKNOWLEDGEMENT

I express my sincere thanks to Sri M.V.Koteswara Rao, Chairman, and Sri M.B.V.Satyanarayana, Secretary and Correspondent of Sri Mittapalli Institute of Technology for Women, Guntur for providing dexterities to carry out this project.

It gives us an honor to express my deep sense of gratitude and to our principal and project guide Prof P.Bala Murali Krishna, Department of ECE, Sri Mittapalli Institute of Technology for Women, Guntur for his valuable guidance, constant encouragement, and for every scientific and personal concern throughout the course of investigation and successful completion of this work.

I wish to extend my sincere thanks to G.Suseelamma, Head of the Department of ECE, Sri Mittapalli Institute of Technology for Women, Guntur for her constant support, encouragement

and enabling us to do a work of this magnitude.

Our sincere thanks to teaching and non-teaching staff members of ECE Department of Sri Mittapalli Institute of Technology for Women, Guntur.

Lastly, I bow to my affectionate Parents for their love and blessings, which has sustained me a lot in completing this project work successfully.

BY

G. BHAGYA SRI (13MK1D6805)

CONTENTS

TITLE Page No

ABSTRACT I

LIST OF FIGURES II & III

LIST OF TABLES IV

CHAPTER 1: INTRODUCTION 1

1.1 Introduction to VLSI 1

1.2 Objective 6

1.3 Existing system 6

1.3.1 Existing System Disadvantages 6

1.4 Proposed system 7

1.5 Project Outline 7

CHAPTER 2: LITERATURE REVIEW 8

2.1 CMOS Technology 12

2.1.1 CMOS Transmission Gate 15

2.1.2 Fabrication Technology 15

2.2 FPGA Design Flow 16

2.3 FPGA Performance 17

2.4 Basic FPGA Architecture 19

2.5 FPGA Design and Programming 20

2.6 VHDL & Verilog 21

CHAPTER 3: DESIGN APPROACH 23

3.1 Overview of Carry Select Adder 24

3.2 Operation 27

3.3 Why we replaced Regular CSLA with Modified CSLA? 29

3.4 Logic Formulation 31

3.4.1 Logic Expressions of the SCG Unit of the Conventional CSLA 32

3.4.2 Logic Expression of the SCG Unit of the BEC-Based CSLA 33

3.5 Proposed Adder Design 34

3.5.1 Ripple Carry Adder (RCA) 36

3.5.2 Carry Select Adders (CSLA) 37

3.5.3 Carry Look Ahead Adders (CLA) 37

3.5.4 Binary to Excess-1 Converter 38

3.5.5 Multiplexer 39

3.6 Analysis of Adders 44

3.7 Square Root CSLA (SQRT-CSLA) 46

CHAPTER 4: RESULTS ANALYSIS 48

4 .1 Performance Evaluation 48

4.1.1 Ripple Carry Adder (8-bit) 48

4.1.2 CSA (8-bit) 49

4.1.3 Proposed CSA (8-bit) 50


4.2.2 CSA (16-bit) 53



4.3.2 CSA (32-bit) 57


4.2 Performance Comparison 61

4.3 Synthesis Report 61

4.4 Applications 63

4.5 Advantages 64

CHAPTER 5: CONCLUSION AND FUTURE SCOPE 65

5.1 Conclusion 65

5.2 Future Scope 65

REFERENCES 66 & 67

ABSTRACT

With the advancements in semiconductor technology, there has been an increased emphasis in low-power design techniques over the last few decades. Reversible computing has been proposed by several researchers as a possible alternative to address the energy dissipation problem. This project describes the design of Mach Zehnder Interferometer and reviews its applications in emerging optical communication networks. Mach Zehnder Interferometer is used to measure relative phase shift between two collimated beams from a coherent light source. Using the basic principle, a number of devices was designed, few of these such as optical sensors, all-optical switches, optical add-drop multiplexer and implementation of sum function is discussed in this project.

I

LIST OF FIGURES

NAME OF THE FIGURE Page No

Fig. 2.1 MOS TRANSISTOR 14

Fig. 3.1 Block diagram of regular CSLA 29

Fig. 3.2 Block diagram of modified CSLA 30

Fig. 3.3 The 5-bit Binary to Execss-1 Code Converter:

(a) BEC (without carry) (b) BECWC (with carry) 31

31

Fig. 3.4 (a) Conventional CSLA; n is the input operand bit-width. (b) The logic operations of the RCA is shown in split form

32

Fig. 3.5 Structure of the BEC-based CSLA; n is the input operand bit-width 34

Fig. 3.6 (a) Proposed CS adder design, where n is the input operand bit-width (b) Gate-level design of the HSG (c) Gate-level optimized design of (CG0) for input-carry = 0. (d) Gate-level optimized design of (CG1) for input-carry = 1 (e) Gate-level design of the CS unit (f) Gate-level design of the final sum generation (FSG)

34

34

34

34

34

34

Fig. 3.7 A 4-bit Ripple Carry Adder 36

Fig. 3.8 A Carry Select Adder with 1 level using n/2- bit RCA 37

Fig. 3.9 34-BIT CLA Logic equations 37

Fig. 3.10 Proposed SQRT-CSLA for n = 16. All intermediate and output signals are labeled with delay

46

Fig. 4.1 (a) Simulation Waveform Result of 8-bit Ripple Carry Adder 48

Fig. 4.1 (b) RTL Diagram of 8 bit Ripple Carry Adder 49

Fig. 4.2 (a) Simulation Waveform Result of 8-bit CSA 49

Fig. 4.2 (b) RTL diagram of 8 bit CSA 50

II

Fig. 4.3 (a) Simulation Waveform Result of 8-bit Proposed CSA 50

Fig. 4.3 (b) Design Summary of 8-bit Proposed CSA 51

Fig. 4.3 (c) RTL diagram of 8 bit proposed CSA 52


Fig. 4.4 (b) RTL diagram of 16 bit Ripple Carry Adder 53


Fig. 4.5 (b) RTL diagram of 16 bit CSA 54



Fig. 4.6 (c) RTL diagram of 16 bit Proposed CSA 56


Fig. 4.7 (b) RTL diagram of 32-bit Ripple Carry Adder 57


Fig. 4.8 (b) RTL diagram of 32-bit CSA 58



Fig. 4.9 (c) RTL diagram of 32-bit Proposed CSA 60

III

LIST OF TABLES NAME OF THE TABLE Page No

Table 3.1 Truth table 31

Table 3.2 Functional table of the 4-bit BEC 39

Table 3.3 Categorization of adders w.r.t delay time and capacity 41 Table 3.4 Theoretical Comparison of Area Occupied 45

Table 3.5 Theoretical Comparison of Time Required 45

Table 3.6 Theoretical Area Delay Product (AxT) 45

Table 3.7 Comparison of Time Required (Simulated value) 46

Table 4.1 Device Utilization summary of 8-bit Ripple Carry Adder 48

Table 4.2 Device Utilization summary of 8-bit CSA 49

Table 4.3 Synthesis Report of 8-bit Proposed CSA 51

Table 4.4 Device Utilization summary of 8-bit Proposed CSA 51









Table 4.13 Theoretical Estimation 62

Table 4.14 comparison of post layout- synthesis result 62

Table 4.15 Design Summary 62

Table 4.16 Comparison of the Regular and Modified SQRT CSLA 63

IV

1

CHAPTER 1

INTRODUCTION

This chapter introduces the concepts such as introduction of VLSI, objective, existing system proposed systemand the project outline.

1.1 Introduction to VLSI VLSI Design presents state-of-the-art papers in VLSI design, computer aided design, design analysis, design implementation, simulation and testing. Its scope also includes papers that address technical trends, pressing issues, and educational aspects in VLSI Design. The Journal provides a dynamic high quality international forum for original papers and tutorials by academic, industrial, and other scholarly contributors in VLSI Design.

The development of microelectronics spans a time which is even lesser than the average life expectancy of a human, and yet it has seen as many as four generations. Early 60s saw the low density fabrication processes classified under Small Scale Integration (SSI) in which transistor count was limited to about 10. This rapidly gave way to Medium Scale Integration in the late 60s when around 100 transistors could be placed on a single chip. It was the time when the cost of research began to decline and private firms started entering the competition in contrast to the earlier years where the main burden was borne by the military. Transistor-Transistor logic (TTL) offering higher integration densities outlasted other IC families like ECL and became the basis of the first integrated circuit revolution. It was the production of this family that gave impetus to semiconductor giants like Texas Instruments, Fairchild and National Semiconductors.

Early seventies marked the growth of transistor count to about 1000 per chip called the Large Scale Integration. By mid-eighties, the transistor count on a single chip had already exceeded 1000 and hence came the age of Very Large Scale Integration or VLSI. Though many improvements have been made and the transistor count is still rising, further names of generations like ULSI are generally avoided. It was during this time when TTL lost the battle to MOS family owing to the same problems that had pushed vacuum tubes into negligence, power dissipation and the limit it imposed on the number of gates that could be placed on a single die. The second age of Integrated Circuits

2

revolution started with the introduction of the first microprocessor, the 4004 by Intel in 1972 and the 8080 in 1974. Today many companies like Texas Instruments, Infineon, Alliance Semiconductors, Cadence, Synopsys,Celox Networks, Cisco, Micron Tech, National Semiconductors, ST Microelectronics, Qualcomm, Lucent, Mentor Graphics, Analog Devices, Intel, Philips, Motorola and many other firms have been established and are dedicated to the various fields in "VLSI" like Programmable Logic Devices, Hardware Descriptive Languages, Design tools, Embedded Systems etc.In 1980s hold over from outdated taxonomy for integration levels. Obviouslyinfluenced from frequency bands i.e. HF, VHF and UHF. Sources disagree on what is measured (gates or transistors)

SSI Small-Scale Integration (0-102) MSI Medium-Scale Integration (102 -103)

LSI Large-Scale Integration (103 -105) VLSI Very Large-Scale Integration (105 - 107)

ULSI Ultra Large-Scale Integration (>= 107)

VLSI Technology Inc. was a company which designed and manufactured custom and semi-custom ICs. The company was based in Silicon Valley, with headquarters at 1109 McKay Drive in San Jose, California. Along with LSI Logic, VLSI Technology defined the leading edge of the application-specific integrated circuit (ASIC) business, which accelerated the push of powerful embedded systems into affordable products. The company was founded in 1979 by a trio from Fairchild Semiconductor by way of Synertek - Jack Balletto, Dan Floyd, Gunnar Wetlesen - and by Doug Fairbairn of Xerox PARC and Lambda (later VLSI Design) magazine. Alfred J. Stein became the CEO of the company in 1982. Subsequently VLSI built its first fab in San Jose; eventually a second fab was built in San Antonio, Texas.

VLSI had its initial public offering in 1983, and was listed on the stock market as (NASDAQ: VLSI). The company was later acquired by Philips and survives to this day as part of NXP Semiconductors.

The first semiconductor chips held two transistors each. Subsequent advances added more and more transistors, and, as a consequence, more individual functions or systems were integrated over time. The first integrated circuits held only a few devices, perhaps as many as ten diodes, transistors, resistors and capacitors, making

3

it possible to fabricate one or more logic gates on a single device. Now known retrospectively as small-scale integration (SSI), improvements in technique led to devices with hundreds of logic gates, known as medium-scale integration (MSI). Further improvements led to large scale integration (LSI), i.e. systems with at least a thousand logic gates. Current technology has moved far past this mark and today's microprocessors have many millions of gates and billions of individual transistors. At one time, there was an effort to name and calibrate various levels of large-scale integration above VLSI. Terms like ultra-large-scale integration (ULSI) were used. But the huge number of gates and transistors available on common devices has rendered such fine distinctions moot. Terms suggesting greater than VLSI levels of integration are no longer in widespread use.

As of early 2008, billion transistor processors are commercially available. This is expected to become more commonplace as semiconductor fabrication moves from the current generation of 65nm processes to the next 45nm generations (while experiencing new challenges such as increased variation across process corners). A notable example is Nvidia's 280 series GPU. This GPU is unique in the fact that almost all of its 1.4 billion transistors are used for logic, in contrast to the Itanium, whose large transistor count is largely due to its 24MB L3 cache. Current designs, as opposed to the earliest devices, use extensive design automation and automated logic synthesis to lay out the transistors, enabling higher levels of complexity in the resulting logic functionality. Certain high performance logic blocks like the SRAM (Static Random Access Memory) cell, however, are still designed by hand to ensure the highest efficiency (sometimes by bending or breaking established design rules to obtain the last bit of performance by trading stability) [citation needed].

What is VLSI? VLSI stands for "Very Large Scale Integration". This is the field which involves packing more and more logic devices into smaller and smaller areas.

Simply we say Integrated circuit is many transistors on one chip. Design/manufacturing of extremely small, complex circuitry using

modifiedsemi-conductor material. Integrated circuit (IC) may contain millions of transistors, each a few mm in

size.

4

Why VLSI? Integration improves the design Lower parasitic means higher speed and lower power consumption and physically smaller. The Integration reduces manufacturing cost (almost) no manual assembly. The course will cover basic theory and techniques of digital VLSI design in CMOS technology. Topics include: CMOS devices and circuits, fabrication processes, static and dynamic logic structures, chip layout, simulation and testing, low power techniques, design tools and methodologies, VLSI architecture.

We use full custom techniques to design basic cells and regular structures such as data path and memory. There is an emphasis on modern design issues in interconnect and clocking. We will also use several case studies to explore recent real world VLSI designs (e.g. Pentium, Alpha, PowerPC Strong ARM, etc.) and papers from the recent research literature. On-campus students will design small test circuits using various CAD tools. Circuits will be verified and analyzed for performance with various simulators. Some final project designs will be fabricated and returned to students the following semester for testing.

Very-large-scale-integration (VLSI) is the process of creating integrated circuits by combining thousands of transistor based circuits into a single chip. VLSI began in the 1970s when complex semiconductor and communication technologies were being developed. The microprocessor is a VLSI device. The term is no longer as common as it once was, as chips have increased in complexity into the hundreds of millions of transistors.

The first semiconductor chips held one transistor each. Subsequent advances added more and more transistors, and, as a consequence, more individual functions or systems were integrated over time. The first integrated circuits held only a few devices, perhaps as many as ten diodes, transistors, resistors and capacitors, making it possible to fabricate one or more logic gates on a single device. Now known retrospectively as "small-scale integration" (SSI), improvements in technique led to devices with hundreds of logic gates, known as large-scale integration (LSI), i.e. systems with at least a thousand logic gates. Current technology has moved far past

5

this mark and today's microprocessors have many millions of gates and hundreds of millions of individual transistors.

Applications of VLSI I. Electronic system in cars.

II. Digital electronics control VCRs.

III. Transaction processing system, ATM.

IV. Personal computers and Workstations. V. Medical electronic systems.

I. Electronic systems now perform a wide variety of tasks in daily life. Electronic systems in some cases have replaced mechanisms that operated mechanically, hydraulically, or by other means; electronics are usually smaller, more flexible, and easier to service. In other cases, electronic systems have created totally new applications. Electronic systems perform a variety of tasks, some of them visible, some more hidden: Personal entertainment systems such as portable MP3 players and DVD players perform sophisticated algorithms with remarkably little energy. Electronic systems in cars operate stereo systems and displays; they also control fuel injection systems, adjust suspensions to varying terrain, and perform the control functions required for anti-lock braking systems. II. Digital electronics compress and decompress video, even at high definition data rates, on the fly in consumer electronics. Low cost terminals for Web browsing still require sophisticated electronics, despite their dedicated function. III. Personal computers and workstations provide word-processing, financial analysis, and games. Computers include both central processing units and special-purpose hardware for disk access, faster screen display, etc. IV.Medical electronic systems measure bodily functions and perform complex processing algorithms to warn about unusual conditions. The availability of these complex systems, far from overwhelming consumers, only creates demand for even more complex systems.

The growing sophistication of applications continually pushes the design and manufacturing of integrated circuits and electronic systems to new levels of complexity. And perhaps the most amazing characteristic of this collection of

6

systems is its variety as systems become more complex, we build not a few general purpose computers but an ever wider range of special purpose systems. Our ability to do so is a testament to our growing mastery of both integrated circuit manufacturing and design, but the increasing demands of customers continue to test the limits of design and manufacturing.

1.2 Objective The main objective of this study is to identify redundant logic operations and data-dependency so as to provide parallel path for carry propagation which helps to reduce the overall adder delay. The CLSA has two units: 1) the sum and carry generator unit (SCG) and 2) the sum and carry selection unit. Accordingly, we remove all redundant logic operations and sequence logic operations based on their data-dependency.

1.3 Existing System In digital adders, the speed of addition is limited by the time required to propagate a carry through the adder. The sum for each bit position in an elementary adder is generated sequentially only after the previous bit position has been summed and a carry propagated into the next position. The early years carry look ahead adder used to overcome the delay it will produce all produce all the carries at time but it requires more circuitry, next those are replaced by carry select adders using dual RCAs.

1.3.1 Existing System Disadvantages The Ripple Carry Adder (RCA) provides the most compact design but takes longer computing time. If there is N-bit RCA, the delay is linearly proportional to N.

Thus for large values of N the RCA gives highest delay of all adders. The Carry Look Ahead Adder (CLA) gives fast results but consumes large area. So for higher number of bits, CLA gives higher delay than other adders due to presence of large number of fan-in and a large number of logic gates. The Carry Select Adder (CSA) provides a compromise between small area but longer delay RCA and a large area with shorter delay CLA. In rapidly growing mobile industry, faster units are not the

7

only concern but also smaller area and less power become major concerns for design of digital circuits.

1.4 Proposed System In this technique one carry ripple adder is used instead of using dual carry ripple adder to enhance the area, power and delay. A carry-select adder can be implemented by using single ripple carry adder and an add-one circuit instead of using dual ripple-carry adders. This paper proposes a new add-one circuit using the first zero finding circuit and multiplexers to reduce the area and power with no speed penalty. For bit length n=64, thisnew carry-select adder requires 38percent fewer transistors than the dual ripple-carry select adder and 29percent fewer transistors than Changs carry-select adder using single ripple carry adder.

1.5 Project Outline The project is organized into 6 chapters, namely Introduction, Literature Review, Proposed Concept, Simulation and Synthesis Result Analysis and Conclusion and Future Scope.Chapter2 contains the complete details about the Introduction of VLSI and literature review. Chapter 3 describes about the Logic Formulation, Proposed Adder Design and Analysis of adders. Chapter 4 explains the Simulation Result and Synthesis Result. Chapter 5 includes conclusion of proposed works to enhance the project in the future.

8

CHAPTER 2 LITERATURE REVIEW

As we know adders are of fundamental importance in a wide variety of digital systems, several types of fast adders exist but adding fast using low area and power is still challenging. In digital adders, the speed of addition is limited by the time required to propagate a carry through adder. So the CSLA is used in many computational systems to alleviate the problem of carry propagation delay. So many papers were published on this with several examples of such adders and many efficient implementations were also done.

A number of modifications are suggested by researchers to improve the performance of carry select adder. Reference [1] proposes a logic formulation for CSLA by removing all the redundant logic operation from the conventional CSLA design. In this design carry select (CS) operation is scheduled before the calculation of the final SUM. Reference [3] presents various architectures of CSLA and also presents analysis of the presented architectures for their speed and area. A power-area efficient gate level modified design is implemented in [15, 4, 8] by minimizing the logic operationin comparison with the conventional CSLA design. Analysis of 16-bit conventional CSLA and Binary to Excess-1Converter (BEC) CSLA is presented in [7] and a D-latch based CSLA architecture is proposed in this project. An area delay optimized architecture of 16-bit, 32-bit and 64-bit CSLA adder is proposed and analyzed in [5, 6]. Reference [16] presents simulation and performance evaluation of a 16-bit modified architecture of Square-Root CSLA (SQRTCSLA). Area-Delay-Power based simulation of redundant logic optimized modified design of CSLA with respect to the conventional CSLA design is shown in [9, 10, 11, 12]. A modified design for 16-bit, 32-bit and 64-bit CSLA is proposed in [19] that does not usemultiplexerarchitecture. This paper also shows a comparative analysis of theproposed architecture with the conventional architecture. A logic converter unit (LCU) based modified architecture ofadder is proposed in [20] for optimized area-delay-power parameter. The modified architectures find applications inhigh performance VLSI system architectures in the development of modern electronic devices and gadgets. An efficientarchitecture of Adder essentially improves the overall performance of complex systems. The different sections of

9

theproposed work are arranged as: Section II presents the architecture of 64-bit CSLA and the design of its building blockusing gate level logic. Section III presents the simulation and synthesis results. This section also shows the comparativeanalysis of the design for dynamic power consumption on different FPGAs. Section IV presents the conclusion basedon the present design simulation analysis. In the last, this paper is concluded with the acknowledgement and thereferences.

In 1962, O.J. Bedrij [1] described the extremely fast digital adder with sum selection and multiple-radix carry. He compared the amount of hardware and the logical delay for a 100-bit ripple-carry adder and a carry-select adder. The problem of carry-propagation delay was overcome by independently generating multiple-radix carries and using these carries to select between simultaneously generated sums. In this adder system, the addend and augend were divided into subaddend and subaugend sections that were added twice to produce two sub sums. One addition was done with a carry digit forced into each section, and the other addition combined the operands without the forced carry digit. The selection of the correct sub sum from each of the adder sections depended upon whether or not there actually was a carry into that adder section.

Bedriji 1962 proposes [1] that the problem of carry propagation delay is overcome by independently generating multiple radix carries and using these carries to select between simultaneously generated sums. Ramkumar et al 2010 proposed a BEC method to reduce the maximum delay of carry propagation in final stage of carry save adder [2]. Ramkumar and Harish 2011 [7] propose BEC technique which is a simple and efficient gate level modification to significantly reduce the area and power of square root CSLA.

There are many carry select adder approaches available but most of them use ripple carry adder. T.Y. Chang and M.J. Hsiao [3], suggested that instead of using dual ripple carry adders, a carry select adder scheme using an add one circuit to replace one ripple carry adder requires 29.2% fewer transistors with a speed penalty of 5.9% for bit length n=64. If speed was important for this 64-bit adder, then two of carry-select adder blocks could be substituted by the proposed scheme with a 6.3% area saving and the same speed.

10

The Youngjoon kim and Lee-Sup Kim [4] suggested that a carry-select adder could be implemented by using single ripple carry adder and an add-one circuit instead of using dual ripple-carry adders. They proposed a new add-one circuit using the first zero finding circuit and multiplexers to reduce the area and power with no speed penalty. For n=64 bit, this new carry-select adder requires 38% fewer transistors than the dual ripple-carry carry select adder and 29 percent fewer transistors than Chang's carry-select adder using single ripple carry adder. This new 64b adder using a 0.25 um CMOS technology had 3.45 ns delay time at 2.5 V power supply. Behnam Amelifard et.al [6], suggested a new adder called carry select adder with sharing (CSAS) which was area efficient but the delay was more. M.Alioto et.al [5], suggested using variable size block sizing depending on the multiplexers delay.

The B. Ram kumar, H.M. Kittur, and P.M. Kannan [7] suggested a very simple approach to improve the speed of addition. Based on this approach a 16, 32 and 64-bit adder architecture was developed and compared with conventional fast adder architectures. In many parallel multipliers to speed up the final addition, CLA was arranged in the form of Carry Select adder (CSLA) & was used. But due to the structure of the CSLA it occupied more chip area, because it uses multiple pairs of RCAs to generate the partial sum and carry by considering Cin=0 and Cin=1.Thus the complexity of the final adder structure was high. So they replaced the RCA (CLA) with Cin=1 with BEC logic, which reduced the maximum area and delay in the final adder structure.

In the year of 2013 Mugilvannan.L, Ramasamy.S, [5] Investigated on Carry Select Adder (CSLA) is one of the fastest adders used in many data-processing processors to perform fast arithmetic functions. From the structure of the CSLA, it is clear that there is scope for reducing the area and power consumption in the CSLA. This work uses a simple and efficient transistor level modification in BEC-1 converter to significantly reduce the area and power of the CSLA. Based on this modification 16-b square root CSLA (SQRT CSLA) architecture have been developed and compared with the SQRT CSLA architecture using ordinary BEC-1 converter. The proposed design has reduced area and power as compared with the

11

SQRT CSLA using ordinary BEC-1 converter with only a slight increase in the delay. This work evaluates the performance of the proposed designs in terms of delay, area, and power by hand with logical effort and through Cadence Virtuoso. The results analysis shows that the proposed CSLA structure is better than the SQRT CSLA with ordinary BEC-1 converter.

In the year of 2013 Vijayalakshmi.V, Seshadri.R, Ramakrishnan.S, [6] worked on the study of the VLSI design of the carry look-ahead adder (CLAA) based 32-bit unsigned integer multiplier and the VLSI design of the carry select adder (CSLA) based 32-bit unsigned integer multiplier. Both the VLSI design of multiplier multiplies two 32-bit unsigned integer values and gives a product term of 64-bit values. The CLAA based multiplier uses the delay time of 99ns for performing multiplication operation where as in CSLA based multiplier also uses nearly the same delay time for multiplication operation. But the area needed for CLAA multiplier is reduced to 31% by the CSLA based multiplier to complete the multiplication operation. These multipliers are implemented using Altera Quartus II and timing diagrams are viewed through avan waves.

In year 2014 Pandey.S, Khan.A. A, Sarma.R, [1] Investigated on comparison between the design of the 8T adder based Carry Select Adder (CSA) and 10T adder based CSA. Using both the designs of adders 4-bit CSA architecture has been developed and compared with the 28T adder and 4-bit CSA. The 10T CSA design has reduced delay, power and area as compared with the 28T CSA with a slight tradeoff for area as compared to 8T CSA. The analysis shows that the 10T CSA is better than both 8T adder based CSA and 28T CSA. This work evaluates the performance of the 10T CSA design in terms of power, delay and area using 180nm CMOS process technology Cadence Virtuoso tool and Spectre simulator.

In the year of 2014 Paradhasaradhi.D, Prashanthi.M, Vivek.N [2] described that the Carry Select Adder (CSLA) provides a good compromise between cost and performance in carry propagation adder design. A Square Root Carry Select Adder using RCA is introduced but it offers some speed penalty. However, conventional CSLA is still area-consuming due to the dual ripple carry adder structure. In the

12

proposed work, generally in Wallace multiplier the partial products are reduced as soon as possible and the final carry propagation path carry select adder is used. In this project, modification is done at gate level to reduce area and power consumption. The Modified Square Root Carry Select-Adder (MCSLA) is designed using Common Boolean Logic and then compared with regular CSLA respective architectures, and this MCSLA is implemented in Wallace Tree Multiplier. This work gives the reduced area compared to normal Wallace tree multiplier. Finally, an area efficient Wallace tree multiplier is designed using common Boolean logic based square root carry select adder.

In the year of 2014 Naaz.S.A.A, Pradeep.M.N.N, Bhairannawar.S, Halvi.S [4] presented the study of field of communication and signal processing applications. Every application demands for a higher throughput arithmetic operation. One of the key arithmetic operations is multiplication which takes maximum execution time.

The development of efficient multiplier is a subject of interest over decades. So there is a need for an efficient multiplier which obtains higher performance for real time signal processing application. The modular design of Vedic multiplier using carry select adder. The delay of proposed multiplier is reduced due to high speed carry select adder. The proposed multiplier is applied to parallel FIR filter. It can be observed that the combinational delay reduced for the proposed multiplier compared to existing architecture.

Ramkumar and Harish 2011 [4] propose BEC technique which is a simple and efficient gate level modification to significantly reduce the area and power of square root CSLA. Veena nair in 2013 suggested a new approach in with D-latch is used with enabled signal instead of BEC [6]. Based on this approach a 16, 32 and 64-bit adder architecture was developed and compared with conventional fast adder architectures. The new structure as a result reduces the delay of the structure.

2.1 CMOS Technology In the present decade the chips being designed are made from CMOS technology. CMOS is Complementary Metal Oxide Semiconductor. It consists of both NMOS and PMOS transistors. To understand CMOS better, we first need to

13

know about the MOS transistor. MOS Transistor MOS stands for Metal Oxide Semiconductor field effect transistor. MOS is the basic element in the design of a large scale integrated circuit is the transistor. It is a voltage controlled device. These transistors are formed as a "sandwich'' consisting of a semiconductor layer, usually a slice, or wafer, from a single crystal of silicon; a layer of silicon dioxide (the oxide) and a layer of metal. These layers are patterned in a manner which permits transistors to be formed in the semiconductor material (the "substrate''); The MOS transistor consists of three regions, Source, Drain and Gate.

The source and drain regions are quite similar, and are labeled depending on to what they are connected. The source is the terminal, or node, which acts as the source of charge carriers; charge carriers leave the source and travel to the drain. In the case of an N channel MOSFET (NMOS), the source is the more negative of the terminals; in the case of a P channel device (PMOS), it is the more positive of the terminals. The area under the gate oxide is called the "channel". Below is figure of a MOS Transistor.

The transistor normally needs some kind of voltage initially for the channel to form. When there is no channel formed, the transistor is said to be in the cut off region. The voltage at which the transistor starts conducting (a channel begins to form between the source and the drain) is called threshold Voltage. The transistor at this point is said to be in the linear region. The transistor is said to go into the saturation region when there are no more charge carriers that go from the source to the drain. CMOS technology is made up of both NMOS and CMOS transistors. Complementary Metal Oxide Semiconductors (CMOS) logic devices are the most common devices used today in the high density, large number transistor count circuits found in everything from complex microprocessor integrated circuits to signal processing and communication circuits.

The CMOS structure is popular because of its inherent lower power requirements, high operating clock speed, and ease of implementation at the transistor level. The complementary p-channel and n-channel transistor networks are used to connect the output of the logic device to the either the VDD or VSS power supply rails for a given input logic state. The MOSFET transistors canbe treated as

14

simple switches. The switch must be on (conducting) to allow current to flow between the source and drain terminals. In CMOS, there is only one driver, but the gate can drive as many gates as possible. In CMOS technology, the output always drives another CMOS gate input. The charge carriers for PMOS transistors is holes and charge carriers for NMOS are electrons. The mobility of electrons is two times more than that of holes. Due to this the output rise and fall time is different. To make it same, the W/L ratio of the PMOS transistor is made about twice that of the NMOS transistor. This way, the PMOS andNMOS transistors will have the same drive strength. In a standard cell library, the length L of a transistor is always constant. The width W values are changed to have to different drive strengths for each gate. The resistance is proportional to (L/W). Therefore, if the increasing the width, decreases the resistance.

Fig. 2.1 MOS TRANSISTOR

Power Dissipation in CMOS ICs the big percentage of power dissipation in CMOS ICs is due to the charging and discharging of capacitors. Majority of the low power CMOS IC designs issue is to reduce power dissipation. The main sources of power dissipation are:

a. Dynamic Switching Power Due to charging and discharging of circuit capacitances, a low to high output transition draws energy from the power supply. A high to low transition dissipatesenergy stored in CMOS transistor.

b. Short Circuit Current It occurs when the rise/fall time at the input of the gate is larger than theoutput

rise/fall time.

15

c. Leakage Current Power It is caused by two reasons a. Reverse Bias Diode Leakage on TransistorDrains: This happens in CMOS design, when one transistor is off, and the active transistor charges up/down the drain using the bulk potential of the other transistor.

2.1.1 CMOS Transmission Gate A PMOS transistor is connected in parallel to a NMOS transistor to form a Transmission gate. The transmission gate just transmits the value at the input to the output. It consists of both NMOS and PMOS because, PMOS transistor transmits a strong 1 and NMOS transistor transmits a strong 0. The advantages of using a Transmission Gate are:

It shows better characteristics than a switch. The resistance of the circuit is reduced, since the transistors are connected in parallel.

2.1.2 Fabrication Technology It is Silicon of extremely high purity and chemically purified then grown into large crystals. Wafers is type of crystals are sliced into wafers, and wafer diameter is currently 150mm, 200mm, 300mm and wafer thickness

16

are then packaged and Chips are binned according to their performance.

2.2 FPGA Design Flow The designer facing a design problem must go through a series of steps between initial ideas and final hardware. This series of steps is commonly referred to as the design flow. First, after all the requirements have been spelled out, a proper digital design phase must be carried out. It should be stressed that the tools supplied by the different FPGA vendors to target their chips do not help the designer in this phase. They only enter the scene once the designer is ready to translate a given design into working hardware. The most common flow nowadays used in the design of FPGAs involves the following subsequent phases: Design entry: This step consists in transforming the design ideas into some form of computerized representation. This is most commonly accomplished using Hardware Description Languages (HDLs). The two most popular HDLs are Verilog and the Very High Speed Integrated Circuit HDL (VHDL) [2]. It should be noted that an HDL, as its name implies, is only a tool to describe a design that pre-existed in the mind, notes, and sketches of a designer. It is not a tool to design electronic circuits. Another point to note is that HDLs differ from conventional software programming languages in the sense that they dont support the concept of sequential execution of statements in the code. This is easy to understand if one considers the alternative schematic representation of an HDL file: what one sees in the upper part of the

schematic cannot be said to happen before or after what one sees in the lower part. Synthesis: The synthesis tool receives HDL and a choice of FPGA vendor and model. From these two pieces of information, it generates a net list which uses the primitives proposed by the vendor in order to satisfy the logic behavior specified in the HDLfiles. Most synthesis tools go through additional steps such as logic optimization, register load balancing, and other techniques to enhance timing performance, so the resulting net list can be regarded as a very efficient implementation of the HDLdesign.

Place and route: The placer takes the synthesized net list and chooses a place for each of the primitives inside the chip. The routers task is then to interconnect all these primitives together satisfying the timing constraints. The most obvious constraint for a design is the frequency of the system clock, but there are more

17

involved constraints one can impose on a design using the software packages supported by the vendors. Bit stream generation: FPGAs are typically configured at power up time from some sort of external permanent storage device, typically a flash memory. Once the place and route process is finished, the resulting choices for the configuration of each programmable element in the FPGA chip, be it logic or interconnect, must be stored in a file to program the flash. Of these four phases, only the first one is human labor intensive. Somebody has to type in the HDL code, which can be tedious and error prone for complicated designs involving, for example, lots of digital signal processing. This is the reason for the appearance, in recent years, of alternative flows which include a preliminary phase in which the user can draw blocks at a higher level ofabstraction and rely on the software tool for the generation of the HDL. Some of these tools also include the capability of simulating blocks which will become HDLs with other blocks which provide stimuli and processing to make the simulation output easier to interpret. The concept of hardware co-simulation is also becoming widely used. In co-simulation, stimuli are sent to a running FPGA hosting the design to be tested and the outputs of the design are sent back to a computer for display (typically through a Joint Test Action Group (JTAG), or Ethernet connection). The advantage of co-simulation is that one is testing the real system, therefore suppressing all possible misinterpretations present in a pure simulator. In other cases, co-simulation may be the only way to simulate a complex design in a reasonable amount of time. The standard FPGA design flow starts with design entry using schematics or a hardware description language (HDL), such as Verilog HDL or VHDL. In this step, you create the digital circuit that is implemented inside the FPGA. The flow then proceeds through compilation, simulation, programming, and verification in the FPGA hardware we first define the relevant terminology in the field and then describe the recent evolution of FPDs. The three main categories of FPDs are delineated: Simple PLDs (SPLDs), Complex PLDs (CPLDs) and Field-Programmable Gate Arrays (FPGAs).

2.3 FPGA Performance While the headline performance increase offered by FPGAs is often very large (>100 times for some algorithms) it is important to consider a number of factors

18

when assessing their usefulness for accelerating a particular application. Firstly, is it

practical to implement the whole application on an FPGA? The answer to this is

likely to be no, particularly for floating-point intensive applications which tend to swallow up a large amount of logic. If it is either impractical or impossible to implement the whole application on an FPGA, the next best option is to implement those kernels within the application that are responsible for the majority of the run time, which may be determined by profiling. Next, the real speedup of the whole application must be estimated once the kernel has been implemented in a FPGA. Even if that kernel was originally responsible for 90% of the runtime the total speed-up that you can achieve for your application cannot exceed 10 times (even if you achieve a 1000 times speed up for the kernel), an example of Amdahls law, that long time irritant of the HPC software engineer. Once such an estimate has been made, one must decide if the potential gain is worthwhile given the complexity of instantiating the algorithm on anFPGA.

In general terms FPGAs are best at tasks that use short word length integer or fixed point data, and exhibit a high degree of parallelism, but they are not so good at high precision floating-point arithmetic (although they can still outperform conventional processors in many cases). The implications of shipping data to the FPGA from the CPU and vice versa must also come under consideration, for if that outweighs any improvement in the kernel then implementing the algorithm in an

FPGA may be an exercise in futility. FPGAs are best suited to integer arithmetic. Unfortunately, the vast majority of scientific codes rely heavily on 64 bit IEEE floating point arithmetic (often referred to as double precision floating point arithmetic). It is not unreasonable to suggest that in order to get the most out of FPGAs computational scientists must perform a thorough numerical analysis of their

code, and ideally reemployment it using fixed point arithmetic or lower precision floating-point arithmetic. Scientists who have been used to continual performance increases provided by each new generation of processor are not easily convinced that the large amount of effort required for such an exercise will be sufficiently rewarded. That said the recent development of efficient floating point cores has gone some way towards encouraging scientists to use FPGAs.

If the performance of such cores can be demonstrated by accelerating a number of

19

real world applications, then the wider acceptance of FPGAs will move a step closer. At present there is very little performance data available for 64-bit floating-point intensive algorithms on FPGAs. To give an indication of expected performance we have therefore used data taken from the Xilinx floating point cores (v3) datasheet. To measure the area, performance and power consumption gap between field programmable gate arrays (FPGAs) and standard cell application-specific integrated circuits (ASICs) for the following reasons: I. In the early stages of system design, when system architects choose their implementation medium, they often choose between FPGAs and ASICs. Such decisions are based on the differences in cost (which is related to area); performance and power consumption between these implementation media but to date there have been few attempts to quantify these differences. A system architect can use these measurements to assess whether implementation in an FPGA is feasible. II. These measurements can also be useful for those building ASICs that contain programmable logic, by quantifying the impact of leaving part of a design to be implemented in the programmable fabric. III. FPGA makers seeking to improve FPGAs can gain insight by quantitative measurements of these metrics, particularly when it comes to understanding the benefit of less programmable (but more efficient) hard heterogeneous blocks such as block memory multipliers/accumulators and multiplexers that modern FPGAs often employ.

2.4 Basic FPGA Architecture The most common FPGA architecture consists of an array of configurable logic blocks (CLBs), I/O pads, and routing channels. Generally, all the routing channels have the same width (number of wires). Multiple I/O pads may fit into the height of one row or the width of one column in the array. An application circuit must be mapped into an FPGA with adequate resources. While the number of CLBs and I/Os required is easily determined from the design, the number of routing tracks needed may vary considerably even among designs with the same amount of logic. (For example, a crossbar switch requires much more routing than a systolic array with the same gate count.) Since unused routing tracks increase the cost (and decrease the performance) of the part without providing any benefit, FPGA manufacturers try to provide just enough tracks so that most designs that will fit in

20

terms of LUTs and IOs can be routed. This is determined by estimates such as those derived from Rent's rule or by experiments with existing designs.

2.5 FPGA Design and Programming To define the behavior of the FPGA, the user provides a hardware description language (HDL) or a schematic design. The HDL form might be easier to work with when handling large structures because it's possible to just specify them numerically rather than having to draw every piece by hand. On the other hand, schematic entry can allow for easier visualization of a design. Then, using an electronic design automation tool, a technology-mapped netlist is generated. The netlist can then be fitted to the actual FPGA architecture using a process called place-and-route, usually performed by the FPGA Companys proprietary place-and-route software. The user will validate the map, place and route results via timing analysis, simulation, and other verification methodologies. Once the design and validation process is complete, the binary file generated (also using the FPGA company's proprietary software) is used to (re)configure the FPGA. The source files are fed to a software suite from the FPGA/CPLD vendor that through different steps will produce a file. This file is then transferred to the FPGA/CPLD via a serial interface or to an external memory device like an EEPROM. The most common HDLs are VHDL and Verilog, although in an attempt to reduce the complexity of designing in HDLs, which have been compared to the equivalent of assembly languages, there are moves to raise the abstraction level through the introduction of alternative languages.

Advantages of Using Hardware Description Languages (HDLs) to Design FPGA Devices Using Hardware Description Languages (HDLs) to design high-density FPGA devices have the following advantages:

I. Top-Down Approach for Large Projects Designers use HDLs to create complex designs. The top-down approach to system design works well for large HDL projects that require many designers working together. After the design team determines the overall design plan, individual designers can work independently on separate code sections.

II. Functional Simulation Early in the Design Flow You can verify design

21

functionality early in the design flow by simulating the HDL description. Testing your design decisions before the design is implemented at the Register Transfer Level (RTL) or gate level allows you to make any necessary changes early on.

III. Synthesis of HDL Code to Gates Synthesizing your hardware description to target the FPGA implementation:

Decreases design time by allowing a higher-level design specification, rather than specifying the design from the FPGA base elements.

Reduces the errors that can occur during a manual translation of a hardware description to a schematic design.

Allows you to apply the automation techniques used by the synthesis tool (such as machine encoding styles and automatic I/O insertion) during optimization to the original HDL code. This results in greater optimization and efficiency.

Early Testing of Various Design Implementations HDLs allows you to test

different design implementations early in the design flow. Use the synthesis tool to perform the logic synthesis and optimization into gates. Additionally, Xilinx FPGA devices allow you to implement your design at your computer. Since the synthesis time is short, you have more time to explore different architectural possibilities at the Register Transfer Level (RTL). You can reprogram Xilinx FPGA devices to test several design implementations. You can retarget RTL code to new FPGA devices with minimum recoding.

2.6 VHDL & Verilog Both VHDL and Verilog are well established hardware description languages. They have the advantage that the user can define high-level algorithms and low-level optimizations (gate-level and switch-level) in the same language. A basic example of VHDL code, the evaluation of the Fibonacci series, is shown below, and it is a good example of the points made above. The code itself is reasonably straightforward for a software programmer to understand, provided that he/she understands that this is a truly parallel language and all lines are executing at once. It is also straightforward to simulate a simple design of this nature. However, it is surprisingly difficult to

22

implement it in hardware and this difficulty is a direct result of I/O issues. As noted above for a design to work in hardware access is required to resources that are external to the FPGA, such as memory, and an FPGA is, by its very nature, unaware of the components to which it is connected. If you want to retrieve a value from main memory and use it on the FPGA then you need to instantiate a memory controller. While systems such as the Cray XD1 provide cores for communicating with memory, such cores are still complex and unfamiliar to software programmers. Our early experiences with VHDL have indicated that it should only be used for FPGA development if you are in a position to work closely with experienced hardware designers throughout the development process.

23

CHAPTER-3

DESIGN APPROACH

Low-Power, area-efficient, and high-performance VLSI systems are increasingly used in portable and mobile devices, multi standard wireless receivers, and biomedical instrumentation [1], [2]. An adder is the main component of an arithmetic unit. A complex digital signal processing (DSP) system involves several adders. An efficient adder design essentially improves the performance of a complex DSP system. A ripple carry adder (RCA) uses a simple design, but carry propagation delay (CPD) is the main concern in this adder. Carry look-ahead and carry select (CS) methods have been suggested to reduce the CPD of adders. A conventional carry select adder (CSLA) is an RCARCA configuration that generates a pair of sum words and output carry bits corresponding the anticipated input-carry (cin = 0 and 1) and selects one out of each pair for final-sum and final-output-carry [3]. A conventional CSLA has less CPD than an RCA, but the design is not attractive since it uses a dual RCA. Few attempts have been made to avoid dual use of RCA in CSLA design. Kim and Kim [4] used one RCA and one add-one circuit instead of two RCAs, where the add-one circuit is implemented using a multiplexer (MUX). He et al. [5] proposed a square-root (SQRT)-CSLA to implement large bit-width adders with less delay. In a SQRT CSLA, CSLAs with increasing size are connected in a cascading structure. The main objective of SQRT-CSLA design is to provide a parallel path for carry propagation that helps to reduce the overall adder delay. We suggested a binary to BEC-based CSLA. The BEC-based CSLA involves less logic resources than the conventional CSLA, but it has marginally higher delay. A CSLA based on common Boolean logic (CBL) is also proposed in [7] and [8]. The CBL-based CSLA of [7] involves significantly less logic resource than the conventional CSLA but it has longer CPD, which is almost equal to that of the RCA. To overcome this problem, a SQRT-CSLA based on CBL was proposed in [8]. However, the CBL-based SQRTCSLA design of [8] requires more logic resource and delay than the BEC-based SQRT-CSLA of [6]. We observe that logic optimization largely depends on availability of redundant operations in the formulation, whereas adder delay mainly depends on data dependence. In the existing designs, logic is optimized without giving any consideration to the data

24

dependence. In this brief, we made an analysis on logic operations involved in conventional and BEC-based CSLAs to study the data dependence and to identify redundant logic operations. Based on this analysis, we have proposed a logic formulation for the CSLA.

The main contribution in this brief is logic formulation based on data dependence and optimized carry generator (CG) and CS design. Based on the proposed logic formulation, we have derived an efficient logic design for CSLA. Due to optimized logic units, the proposed CSLA involves significantly less ADP than the existing CSLAs. We have shown that the SQRT-CSLA using the proposed CSLA design involves nearly 32% less ADP and consumes 33% less energy than that of the corresponding SQRT-CSLA.

3.1 Overview of Carry Select Adder

The carry-select adder generally consists of two ripple carry adder and a multiplexer. Adding two n-bit numbers with a carry select adder is done with two adders (therefore two ripple carry adders) in order to perform the calculation twice, one time with the assumption of the carry being zero and the other assuming one. After the two results are calculated, the correct sum, as well as the correct carry, is then selected with the multiplexer once the correct carry is known. The number of

bits in each carry select block can be uniform, or variable. In the uniform case, the

optimal delay occurs for a block size of . When variable, the block size should have a delay, from addition inputs A and B to the carry out, equal to that of the

multiplexer chain leading in to it, so that the carry out is calculated just in time. The delay is derived from uniform sizing, wherethe ideal number of full-adder

elements per block is equal to the square root of the number of bits being added, since that will yield an equal number of MUX delays.

However, the carry select adder is not area efficient because it uses multiple pairs of Ripple Carry Adders to generate partial sum and carry by considering carry input and then the final sum and carry are selected by the multiplexers (mux). To overcome the above problem, the above CSLA is modified by using n-bit Binary to

25

Excess-1 code converters (BEC) to improve the speed of addition. The logic can be implemented with any type of adder to further improve the speed. We use the Binary toExcess-1 Converter (BEC) instead of ripple carry adder in the regular CSLA to achieve lower area and power consumption. The main advantage of this BEC logic comes from the lesser number of logic gates than the Full Adder (FA) structure. The modified design has reduced area and power as compared with the regular SQRTCSLA with an increase in the delay. Therefore, an improved CSLA was designed with a D-Latch replacing the BEC in the modified CSLA. This design has efficiently reduced the delay there by increasing the speed making it a high speed Carry Select Adder.The factors which are desirable in adders are as follows:

High speed, Low power consumption Area efficient Robustness and noise stability Insensitivity to process variables Less internal activity when activity is low

According to the requirement of the adder the designer has to consider all these parameter While choosing a structure for adders what makes this decision even harder is that usually most of these parameter are not independent from each other tradeoff between desired parameter make this decision a multi-dimensional optimization problem for high performance system a multi-dimensional optimization problem for a non-linear system that usually has hundreds of variables, is unfortunately impossible to solve within the limited design time. The idea for this thesis is to explore the area, power consumption and time delay for different structure of adders this will give us a good understanding of different structure and makes the decision easier for the designers.

The Ripple Carry Adder (RCA) provides the most compact design but takes longer computing time. If there is N-bit RCA, the delay is linearly proportional to N. Thus for large values of N the RCA gives highest delay of all adders. The Carry Look Ahead Adder (CLA) gives fast results but consumes large area. If there is N-bit adder, CLA is fast for N4, but for large values of N its delay increases more than other adders. So for higher number of bits, CLA gives higher delay than other adders due to presence of large number of fan-in and a large number of logic gates.

26

The Carry Select Adder (CSA) provides a compromise between small area but longer delay RCA and a large area with shorter delay CLA. In rapidly growing mobile industry, faster units are not the only concern but also smaller area and less power become major concerns for design of digital circuits. In mobile electronics, reducing area and power consumption are key factors in increasing portability and battery life. Even in servers and desktop computers power dissipation is an important design constraint. Design of area and power efficient high-speed data path logic systems are one of the most substantial areas of research in VLSI system design. In the present work, the design of an 8-bit adder topology like ripple carry adder, carry look ahead adder, carry skip adder, carry select adder, carry increment adder, carry save adder and carry bypass adder are presented. It tightly integrates mixed-signal implementation with digital implementation, circuit simulation, transistor-level extraction and verification. Performance issues like area, power dissipation and propagation delay for all the adders are analyzed at 0.12m 6metal layer CMOS technology using micro windtool. Design of area and power-efficient high speed data path logic systems are one of the most substantial areas of research in VLSI system design. In digital adders, the speed of addition is limited by the time required to propagate a carry through the adder. The sum for each bit position in an elementary adder is generated sequentially only after the previous bit position has been summed and a carry propagated into the next position. The CSLA is used in many computational systems to alleviate the problem of carry propagation delay by independently generating multiple carries and then select a carry to generate the sum [1]. However, the CSLA is not area efficient because it uses multiple pairs of Ripple Carry Adders (RCA) to generate partial sum and carry by considering carry input Cin = 0 and Cin = 1, then the final sum and carry are selected by the multiplexers (mux). The basic idea of this work is to use simple combinational circuit instead of RCA with cin = 1 and multiplexer in the regular CSLA to achieve lower area and power. The main advantage of this Project is logic comes from low power than the n-bit Full Adder (FA) structure. The SQRT CSLA has been developed by using simple combinational circuit and compared with regular SQRT CSLA.A regular CSLA uses two copies of the carry evaluation blocks, one with block carry input is zero and other one with block carry input is one. Regular CSLA suffers from the

27

disadvantage of occupying more chip area. The modified CSLA reduces the area and power when compared to regular CSLA with increase in delay by the use of Binary to Excess-1 converter. This Project proposes a scheme which reduces the delay, area and power than regular and modified CSLA by the use of D-latches.

3.2 Operation Carry Select Adders (CSA) is one of the fastest adders used in many data-processing processors to perform fast arithmetic functions. The carry-select adder partitions the adder into several groups, each of which performs two additions in parallel. Therefore, two copies of ripple-carry adder act as carry evaluation block per select stage. One copy evaluates the carry chain assuming the block carry-in is zero, while the other assumes it to be one. Once the carry signals are finally computed, the correct sum and carry-out signals will be simply selected by a set of multiplexers. The 4-bit adder block is RCA.Systems are one of the most substantial areas of research in VLSI system design. In digital adders, the speed of addition is limited by the time required to propagate a carry through the adder. The sum for each bit position in an elementary adder is generated sequentially only after the previous bit position has been summed and a carry propagated into the next position. The CSLA is used in many computational systems to alleviate the problem of carry propagation delay by independently generating multiple carries and then select a carry to generate the sum. However, the CSLA is not area efficient because it uses multiple pairs of Ripple Carry Adders (RCA) to generate partial sum and carry by considering carry input and, then the final sum and carry are selected by the multiplexers (MUX). The carry-select adder generally consists of two ripple carry adders and a multiplexer. Adding two n-bit numbers with a carry-select adder is done with two adders (therefore two ripple carry adders) in order to perform the calculation twice, one time with the assumption of the carry being zero and the other assuming one. After the two results are calculated, the correct sum, as well as the correct carry, is then selected with the multiplexer once the correct carry is known. The number of bits in each carry select block can be uniform, or variable. In the uniform case, the optimal delay occurs for a block size of n variable, the block size should have a delay, from additional inputs A and B to the carry out, equal to that of the

28

multiplexer chain leading into it, so that the carry out is calculated just in time. The delay is derived from uniform sizing, where the ideal number of full-adder elements per block is equal to the square root of the number of bits being added, since that will yield an equal number of MUX delays. Two 4-bit ripple carry adders are multiplexed together, where the resulting carry and sum bits are selected by the carry-in. Since one ripple carry adder assumes a carry-in of 0, and the other assumes a carry-in of 1, selecting which adder had the correct assumption via the actual carry-in yields the desired result. A 16-bit carry-select adder with a uniform block size of 4 can be created with three of these blocks and a 4-bit ripple carry adder. Since carry-in is known at the beginning of computation, a carry select block is not needed for the first four bits. The delay of this adder will be four full adder delays, plus three MUX delaysA 16-bit carry-select adder with variable size can be similarly created. Here we show an adder with block sizes. This break-up is ideal when the full-adder delay is equal to the MUX delay, which is unlikely. The total delay is two full adder delays, and four MUX delays. Addition is the heart of computer arithmetic, and the arithmetic unit is often the work horse of a computational circuit. They are the necessary component of a data path, e.g. in microprocessors or a signal processor. There are many ways to design an added. The Ripple Carry Adder (RCA) provides the most compact design but takes longer computing time. If there is N-bit RCA, the delay is linearly proportional to N. Thus for large values of N the RCA gives highest delay of all adders. The Carry Look Ahead Adder (CLA) gives fast results but consumes large area. If there is N-bit adder, CLA is fast for N4, but for large values of N its delay increases more than other adders. So for higher number of bits, CLA gives higher delay than other adders due to presence of large number of fan-in and a large number of logic gates. The Carry Select Adder (CSA) provides a compromise between small area but longer delay RCA and a large area with shorter delay CLA.In rapidly growing mobile industry, faster units are not the only concern but also smaller area and less power become major concerns for design of digital circuits. In mobile electronics, reducing area and power consumption are key factors in increasing portability and battery life. Even in servers and desktop computers power dissipation is an important design constraint. Design of area- and power-efficient high-speed data path logic systems are one of the most substantial areas of research in VLSI system design. In digital adders, the speed of addition is limited by the time required to

29

propagate a carry through the adder. The sum for each bit position in an elementary adder is generated sequentially only after the previous bit position has been summed and a carry propagated into the next position. Among various adders, the CSA is intermediate regarding speed and area.

3.3 Why we replaced Regular CSLA with Modified CSLA? Regular CSLA has 2 ripple carry adders (rca) in each module for performing

addition depending on carry.

Using 2 RCAs in each module increases the number of transistors. Increase in number of transistors leads to increase in area and power

consumption.

2nd RCA in each module can be replaced by binary to excess one converter which performs the same operation with less number of transistors which leads to modified CSLA which is area efficient and low power consumption.

Fig. 3.1 Block diagram of regular CSLA

30

Fig. 3.2 Block diagram of modified CSLA

Code converters are very essential in digital systems. Here we are going to give the truth table for binary to excess-1 converter. The Excess-1 converter is obtained by adding one to the binary value. The detailed structures of the 5-bit BEC without carry (BEC) and with carry (BECWC) are shown in Fig.3.3. The BEC gets n inputs and generates n output; the BECWC gets n input and generates n+1 output to give the carry output as the selection input of the next stage mux used in the final adder design. The function table of BEC and BECWC are shown in Table 3.1.

31

Table 3.1 Truth table

Large bit sized multipliers require multiple BEC and each of them requires the selection input from the carry output of the preceding BEC.

Fig. 3.3 The 5-bit Binary to Execss-1 Code Converter: (a) BEC (without carry), (b) BECWC (with carry).

3.4 Logic Formulation The CSLA has two units: 1) the sum and carry generator unit (SCG) and 2) the sum and carry selection unit [9]. The SCG unit consumes most of the logic resources of CSLA and significantly contributes to the critical path. Different logic designs

32

have been suggested for efficient implementation of the SCG unit. We made a study of the logic designs suggested for the SCG unit of conventional and BEC-based CSLAs of [6] by suitable logic expressions. The main objective of this study is to identify redundant logic operations and data dependence. Accordingly, we remove all redundant logic operations and sequence logic operations based on their data dependence.

Fig. 3.4 (a) Conventional CSLA; n is the input operand bit-width. (b) The logic operations of the RCA are shown in split form, where HSG, HCG, FSG, and

FCG represent half-sum generation, half-carry generation, full-sum generation, and full-carry generation, respectively.

3.4.1 Logic Expressions of the SCG Unit of the Conventional CSLA The SCG unit of the conventional CSLA as shown in Fig. 3.4 (a), [3] is composed of two n-bit RCAs, where n is the adder bit-width. The logic operation of the n-bit RCA shown in fig. 3.4 (b) is performed in four stages:

Half-sum generation (HSG); Half-carry generation (HCG); Full-sum generation (FSG); and Full carry generation (FCG).

Suppose two n-bit operands are added in the conventional CSLA, then RCA-1 and RCA-2 generate n-bit sum (s0 and s1) and output-carry (c0 out and c1 out) corresponding to input-carry (cin = 0 and cin = 1), respectively. Logic expressions of RCA-1 and RCA-2 of the SCG unit of the n-bit CSLA are given as

soo (i) = A(i) XOR B(i), coo(i) = A(i) and B(i)

33

s1o (i) = soo (i) XOR c1o (i1)

c1o (i) = coo (i) + soo (i) and c1o (i1), couto = c1o (n1)

so1 (i) = A(i) XOR B(i) co1 (i) = A(i) and B(i)

s11 (i) = so1 (i) XOR c11 (i1)

c11 (i) = co1 (i) + so1 (i) and c11 (i1), cout1 = c11 (n1) ..................1

As stated above the main idea of this work is to use BEC instead of the RCA with Cin=1in order to reduce the area and power consumption of the regular CSLA. To replace the n bit RCA, an n+1 bit BEC is required.

3.4.2 Logic Expression of the SCG Unit of the BEC-Based CSLA The RCA as shown in Fig. 3.2, calculates n-bit sum and corresponding to

cin = 0. The BEC unit receives and from the RCA and generates (n + 1) bit excess-1 code. The most significant bit (MSB) of BEC represents c1 out, in which n least significant bits (LSBs) represent . The logic expressions

s11(i)= soo (0) c11 (0) = s1o (0)

s11(i)= s1o(i) + c11(i-1)

c11(i)= s1o(i). c11(i-1)

cout1

= c1o(n-1) + c11(n-1) ........ 2

The selected carry word is added with the half-sum (s0) to generate the final-sum (s). Using this method, one can have three design advantages:

1. Calculation of is avoided in the SCG unit;

2. The n-bit select unit is required instead of the (n+1) bit; and 3. Small output-carry delay.

All these features result in an areadelay and energy-efficient design for the CSLA. We have removed all the redundant logic operations of 2 and rearranged logic expressions of 2 based on their dependence. The proposed logic formulation for the CSLA is given as

so(i) = A(i) XOR B(i), coo(i) = A(i) and B(i) c1

o(i) = c1o(i-1) and soo(i) + co(i) for c1o(0) = 0 c1

1(i) = c01(i-1) and soo(i) + co(i) for c1o(0) = 1 c(i)= c1o(i) if(cin=0)

c(i)= c11(i) if(cin=1) ..........3

34

Fig. 3.5 Structure of the BEC-based CSLA; n is the input operand bit-width.

3.5 Proposed Adder Design The proposed CSLA is based on the logic formulation given in 3.6 (a), and its structure is shown in Fig. 3.5. It consists of one HSG unit, one FSG unit, one CG unit, and one CS unit. The CG unit is composed of two CGs (CG0 and CG1) corresponding to input-carry 0 and 1. The HSG receives two n-bit operands (A and B) and generate half-sum word s0 and half-carry word c0 of width n bits each. Both CG0 and CG1 receive s0 and c0 from the HSG unit and generate two n-bit full-carry words c0 1 and c11 corresponding to input-carry 0 and 1, respectively.

Fig. 3.6 (a) Proposed CS adder design, where n is the input operand bit-width, and [] represents delay (in the unit of inverter delay), n = max (t, 3.5n + 2.7). (b) Gate-level design of the HSG. (c) Gate-level optimized design of (CG0) for input-carry = 0. (d) Gate-level optimized design of (CG1) for input-carry = 1.

(e) Gate-level design of the CS unit. (f) Gate-level design of the final sum generation (FSG) unit.

35

The logic diagram of the HSG unit is shown in Fig. 3.6 (b). The logic circuits of CG0 and CG1 are optimized to take advantage of the fixed input-carry bits. The optimized designs of CG0 and CG1 are shown in Fig. 3.6 (c) and (d), respectively. The CS unit selects one final carry word from the two carry words available at its input line using the control signal cin. It selects when cin = 0; otherwise, it selects. The CS unit can be implemented using an n-bit 2-to-l MUX. However, we find from the truth table of the CS unit that carry words c0 1 and c11 follow a specific bit pattern. If (i) = 1, then (i) = 1, irrespective of s0(i) and c0(i), for 0 i n 1. This feature is used for logic optimization of the CS unit. The optimized design of the CS unit is shown in Fig. 3.6 (e), which is composed of n ANDOR gates. The final carry word c is obtained from the CS unit. The MSB of c is sent to output as cout, and (n 1) LSBs are XORed with (n 1) MSBs of half-sum (s0) in the FSG [shown in Fig. 3.6 (f)] to obtain (n 1) MSBs of final-sum (s). The LSB of s0 is XORed with cin to obtain the LSB of s.

We have considered all the gates to be made of 2-input AND, 2-input OR, and inverter (AOI). A 2-input XOR is composed of 2 AND, 1 OR, and 2 NOT gates. The area and delay of the 2-input AND, 2-input OR, and NOT gates are taken from the Synopsys Armenia Educational Department (SAED) 90-nm standard cell library datasheet for theoretical estimation. The area and delay of a design are calculated using the following relations:

A = a . Na + r . No + i - Ni T = na . Ta + no . To + nj . Ti.......... 4

Where (Na, No, Ni) and (na, no, ni), respectively, represent the (AND, OR, NOT) gate counts of the total design and its critical path. (a, r, i) and (Ta, To, Ti), respectively, represent the area and delay of one (AND, OR, NOT) gate. We have calculated the (AOI) gate counts of each design for area and delay estimation the area and delay of each design are calculated from the AOI gate counts (Na, No, Ni), (na, no, ni), and the cell details. The path of the proposed CSLA, the delay of each intermediate and output signals of the proposed n-bit CSLA design of Fig. 3.6 is shown in the square bracket against each signal. We can observe that the proposed n-bit single-stage CSLA adder involves 6n less number of AOI gates than the CSLA

36

of [6] and takes 2.7 and 6.6 units less delay to calculate final-sum and output-carry. Compared with the CBL-based CSLA of [7], the proposed CSLA design involves n more AOI gates, and it takes (n 4.7) unit less delay to calculate the output-carry. In this work the following adder structures are used:

Ripple Carry Adder Carry Save Adder

Carry Look-Ahead Adder Carry Increment adder

Carry Skip Adder Carry Bypass Adder

Carry Select Adder

3.5.1 Ripple Carry Adder (RCA) The ripple carry adder is constructed by cascading full adders (FA) blocks in series. One full adder is responsible for the addition of two binary digits at any stage of the ripple carry. The carryout of one stage is fed directly to the carry-in of the next stage. Even though this is a simple adder and can be used to add unrestricted bit length numbers, it is however not very efficient when large bit numbers are used. One of the most serious drawbacks of this adder is that the delay increases linearly with the bit length. The worst-case delay of the RCA is when a carry signal transition ripples through all stages of adder chain from the least significant bit to the most significant bit, which is approximated by:

t = (n-1) tc + ts The well-known adder architecture, ripple carry adder is composed of cascaded full adders for n-bit adder, as shown in figure 3.7. It is constructed by cascading full adder blocks in series. The carry out of one stage is fed directly to the carry-in of the next stage. For an n-bit parallel adder it requires n full adders.

Fig. 3.7 A 4-bit Ripple Carry Adder

37

Not very efficient when large number bit numbers are used. Delay increases linearly with bit length.

3.5.2 Carry Select Adders (CSLA) In Carry select adder scheme, blocks of bits are added in two ways: one assuming a carry-in of 0 and the other with a carry-in of 1. This results in two pre computed sum and carry-out signal pairs (s0i-1: k, c0i; s1i-1: k, c1i), later as the blocks true carry-in (ck) becomes known, the correct signal pairs are selected. Generally, multiplexers are used to propagate carries.

Fig. 3.8 A Carry Select Adder with 1 level using n/2- bit RCA Because of multiplexers larger area is required. Have a lesser delay than Ripple Carry Adders (half delay of RCA). Hence we always go for Carry Select Adder while working with smaller no of

bits.

3.5.3 Carry Look Ahead Adders (CLA) Carry Look Ahead Adder can produce carries faster due to carry bits generated in parallel by an additional circuitry whenever inputs change. This technique uses carry bypass logic to speed up the carry propagation.

Fig. 3.9 34-BIT CLA Logic equations

38

Let ai and bi be the augends and addend inputs, ci the carry input, si and ci+1, the sum and carry-out to the ith bit position. If the auxiliary functions, pi and gi are called the propagate and generate signals, the sum output respectively are defined as follows.

As we increase the no of bits in the Carry Look Ahead adders, the complexity increases because the no. of gates in the expression Ci+1 increases. So practically its not desirable to use the traditional CLA shown above because it increases the space required and the power too.

Instead we will use here Carry Look Ahead adder (less bits) in levels to create a larger CLA. Commonly smaller CLA may be taken as a 4-bit CLA. So we can define carry look ahead over a group of 4 bits. Hence now we redefine terms carry generate as [Group Generated Carry] g [i, i+3] and carry propagate as [Group Propagated Carry] p [i, i+3] which are defined below.

3.5.4 Binary to Excess-1 Converter The main idea of this work is to use BEC instead of the RCA with Cin = 1 in order to reduce the area and power consumption of the regular CSLA. To replace the n-bit RCA, an n+1-bit BEC is required. A structure and the function table of a 4-b BEC. Illustrates how the basic function of the CSLA is obtained by using the4-bit BEC together with the mux. One input of the 2:1 mux gets as it input (B3, B2, B1, and B0) and another input of the mux is the BEC output. This produces the two possible partial results in parallel and the mux is used to select either the BEC output or the direct inputs according to the control signal Cin. The importance of the BEC logic stems from the large silicon area reduction when the CSLA with large number of bits are designed. The Boolean expressions of the 4-bit BEC is listed as (note the functional symbols ~ NOT, & AND, ^ XOR)

X0 = ~B0 X2 = B2 ^ (B0& B1) X1 = B0 ^ B1 X3 = B3 ^ (B0 & B1& B2).

39

The 4-bit BEC with 2:1 multiplexer, the inputs for the 2:1MUX are one is the output of the 4-bit BEC and another input is output of 4- bit full adder with input carry equal to zero. The selection line is carry of previous stage which select one of

the input as output, if Cin=1 output is 4-bit BEC output.

Table 3.2 Functional table of the 4-bit BEC B3 B2 B1 B0 X3 X2 X1 X0

0 0 0 0 0 0 0 1

0 0 0 1 0 0 1 0

0 0 1 0 0 0 1 1

0 0 1 1 0 1 0 0

0 1 0 0 0 1 0 1

0 1 0 1 0 1 1 0

0 1 1 0 0 1 1 1

0 1 1 1 1 0 0 0

1 0 0 0 1 0 0 1

1 0 0 1 1 0 1 0

1 0 1 0 1 0 1 1

1 0 1 1 1 1 0 0

1 1 0 0 1 1 0 1

1 1 0 1 1 1 1 0

1 1 1 0 1 1 1 1

1 1 1 1 0 0 0 0

3.5.5 Multiplexer In electronics, a multiplexer (or MUX) is a device that selects one of several analog or digital input signals and forwards the selected input into a single line. Multiplexer of 2n inputs has n select lines, which are used to select which input line to send to the output. Multiplexers are mainly used to increase the amount of data that can be sent over the network within a certain amount of time and bandwidth. A multiplexer is also called a data selector. An electronic multiplexer makes it possible for several signals to share one device or resource, for example one A/D converter or one communication line, instead of having one device per input signal.

40

In digital circuit design, the selector wires are of digital value. In the case of a 2-to-1 multiplexer, a logic value of 0 would connect to the output while a logic value of 1 would connect to the output. In larger multiplexers, the number of selector pins is equal to where is the number of inputs. A 2-to-1 multiplexer has a Boolean equation where and are the two inputs, is the selector input, and is the output.

Addition is the most common and often used arithmetic operation on microprocessor, digital signal processor, especially digital computers. Also, it serves as a building block for synthesis all other arithmetic operations. Therefore, regarding the efficient implementation of an arithmetic unit, the binary adder structures become a very critical hardware unit. In any book on computer arithmetic, someone looks that there exists a large number of different circuit architectures with different performance characteristics and widely used in the practice. Although many researches dealing with the binary adder structures have been done, the studies based on their comparative performance analysis are only a few.

In this project, qualitative evaluations of the classified binary adder architectures are given. Among the huge member of the adders we wrote VHDL (Hardware Description Language) code for Ripple-carry, Carry-select and Carry-look ahead to emphasize the common performance properties belong to their classes. In the following section, we give a brief description of the studied adder architectures. With respect to asymptotic delay time and area complexity, the binary adder architectures can be categorized into four primary classes as given in Table 3.3. The given re

Documents

master of technology