9
80 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 40, NO. 1, JANUARY 2005 8-Gb/s Source-Synchronous I/O Link With Adaptive Receiver Equalization, Offset Cancellation, and Clock De-Skew James E. Jaussi, Member, IEEE, Ganesh Balamurugan, Student Member, IEEE, David R. Johnson, Member, IEEE, Bryan Casper, Member, IEEE, Aaron Martin, Member, IEEE, Joseph Kennedy, Member, IEEE, Naresh Shanbhag, Senior Member, IEEE, and Randy Mooney, Member, IEEE Abstract—A source-synchronous I/O link with adaptive re- ceiver-side equalization has been implemented in 0.13- m bulk CMOS technology. The transceiver is optimized for small area (360 m 360 m) and low power (280 mW). The analog equalizer is implemented as an 8-way interleaved, 4-tap dis- crete-time linear filter. The equalization improved the data rate of a 102 cm backplane interconnect by 110%. On-die adaptive logic determines optimal receiver settings through comparator offset cancellation, data alignment of the transmitter and receiver, clock de-skew and setting filter coefficients for equalization. The noise-margin degradation due to statistical variation in converged coefficient values was less than 3%. Index Terms—Adaptive equalizers, analog equalization, high-speed I/O, offset cancellation, transceivers, waveform capture. I. INTRODUCTION A S DATA RATES increase, the variation in channel responses becomes more pronounced and adaptive solu- tions are desirable to maximize link performance. PC desktop configurations have channel lengths that range from 5 to 17 cm with one socket, while server configurations range from 25 to 100 cm of channel length with sockets and connectors. Channel loss is a function of frequency, interconnect length and discontinuities which results in intersymbol interference (ISI). To accommodate many different interconnects, PC desktop and backplane topologies and configurations, we implemented an adaptive equalizer to remove the ISI and extend the maximum I/O data rate. In general, the equalizer can be implemented at the transmitter or receiver. Adaptive receiver equalization has advantages over adaptive transmit equalization. First, transmit equalization constrains the magnitude sum of the equalizer taps which reduces the cursor amplitude. Second, adaptive transmit equalization requires the receiver information be conveyed back to the transmitter [1]. The linear equalizer is implemented as an analog 4-tap discrete-time finite impulse response (DT-FIR) filter [2]–[4]. The filter coefficients are determined by the on-die adaptive Manuscript received April 15, 2004; revised July 30, 2004. J. E. Jaussi, G. Balamurugan, D. R. Johnson, B. Casper, A. Martin, J. Kennedy, and R. Mooney are with Circuits Research, Intel Labs, Hillsboro, OR 97124 USA (e-mail: [email protected]). N. Shanbhag is with the University of Illinois at Urbana-Champaign, Urbana, IL 61801 USA. Digital Object Identifier 10.1109/JSSC.2004.838009 control unit (ACU) that updates the coefficients based on a modified partial zero-forcing (PZF) algorithm or a modified sign-sign least mean squares (SS-LMS) algorithm [2]. The ACU also controls transmitter and receiver data alignment, per bit clock de-skew and comparator offset cancellation. At lower data rates and higher noise-margins, the bit error ratios (BER) become very low and are therefore an ineffective metric for comparing link margin and performance. More efficient relative voltage margin comparisons between multiple links are made possible by a figure of merit called normalized estimated noise-margin (NENM). With equalization enabled, the data rate improved by a factor of 1.3 to 2.1 depending on the channel characteristics. The cost of the equalization is relatively low in terms of I/O power (280 mW at 8 Gb/s with 1.7-V supply) and area (360 m 360 m) in a 0.13- m bulk CMOS technology. II. LINK ARCHITECTURE A diagram of the link architecture is shown in Fig. 1. Uni- directional unencoded non-return-to-zero (NRZ) data is trans- mitted in parallel with a unidirectional clock. The forwarded clock is used to eliminate the need for data coding and its as- sociated circuit complexity, port latency and bandwidth over- head [5]. The clock is transmitted with a differential cascode current-mode driver with terminating resistances to ground [6]. The clock and data transmitter are identical circuits. The clock receiver is a delay-locked loop (DLL) used to lock the 0 clock phase to the 180 phase and produce four global phases and their complements separated by 45 [6]. These global phases are buffered and driven to each I/O cell. Local to the I/O cell, an interpolator receives the differential low swing clock phases as shown in Fig. 2. The clock phases pass through the digital coarse select that is implemented as analog pass gates. The two selected adjacent phases are applied to the differential stages with common linearized load elements [7]. The digital fine se- lect adjusts the transconductance of each differential input stage through digitally controlled current sources to achieve a 6.4 resolution. The output of the interpolator feeds the clock to the phase generator. A phase generator produces eight phases that clock the sam- plers at the front-end of the filter and the current latches (I-latch). Fig. 3 shows state elements that accept low swing differential clocks, where the inputs and outputs are complementary signals. During the reset state, a 1 0 0 0 state is set for both the upper and 0018-9200/$20.00 © 2005 IEEE

80 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 40, …shanbhag.ece.illinois.edu/publications/james-jssc2005.pdf · 2014-12-03 · The linear equalizer is implemented as an analog 4-tap

  • Upload
    donhu

  • View
    215

  • Download
    0

Embed Size (px)

Citation preview

80 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 40, NO. 1, JANUARY 2005

8-Gb/s Source-Synchronous I/O Link With AdaptiveReceiver Equalization, Offset Cancellation,

and Clock De-SkewJames E. Jaussi, Member, IEEE, Ganesh Balamurugan, Student Member, IEEE, David R. Johnson, Member, IEEE,

Bryan Casper, Member, IEEE, Aaron Martin, Member, IEEE, Joseph Kennedy, Member, IEEE,Naresh Shanbhag, Senior Member, IEEE, and Randy Mooney, Member, IEEE

Abstract—A source-synchronous I/O link with adaptive re-ceiver-side equalization has been implemented in 0.13- m bulkCMOS technology. The transceiver is optimized for small area(360 m 360 m) and low power (280 mW). The analogequalizer is implemented as an 8-way interleaved, 4-tap dis-crete-time linear filter. The equalization improved the data rateof a 102 cm backplane interconnect by 110%. On-die adaptivelogic determines optimal receiver settings through comparatoroffset cancellation, data alignment of the transmitter and receiver,clock de-skew and setting filter coefficients for equalization. Thenoise-margin degradation due to statistical variation in convergedcoefficient values was less than 3%.

Index Terms—Adaptive equalizers, analog equalization,high-speed I/O, offset cancellation, transceivers, waveformcapture.

I. INTRODUCTION

AS DATA RATES increase, the variation in channelresponses becomes more pronounced and adaptive solu-

tions are desirable to maximize link performance. PC desktopconfigurations have channel lengths that range from 5 to 17 cmwith one socket, while server configurations range from 25to 100 cm of channel length with sockets and connectors.Channel loss is a function of frequency, interconnect length anddiscontinuities which results in intersymbol interference (ISI).To accommodate many different interconnects, PC desktop andbackplane topologies and configurations, we implemented anadaptive equalizer to remove the ISI and extend the maximumI/O data rate. In general, the equalizer can be implemented atthe transmitter or receiver. Adaptive receiver equalization hasadvantages over adaptive transmit equalization. First, transmitequalization constrains the magnitude sum of the equalizer tapswhich reduces the cursor amplitude. Second, adaptive transmitequalization requires the receiver information be conveyedback to the transmitter [1].

The linear equalizer is implemented as an analog 4-tapdiscrete-time finite impulse response (DT-FIR) filter [2]–[4].The filter coefficients are determined by the on-die adaptive

Manuscript received April 15, 2004; revised July 30, 2004.J. E. Jaussi, G. Balamurugan, D. R. Johnson, B. Casper, A. Martin, J.

Kennedy, and R. Mooney are with Circuits Research, Intel Labs, Hillsboro, OR97124 USA (e-mail: [email protected]).

N. Shanbhag is with the University of Illinois at Urbana-Champaign, Urbana,IL 61801 USA.

Digital Object Identifier 10.1109/JSSC.2004.838009

control unit (ACU) that updates the coefficients based on amodified partial zero-forcing (PZF) algorithm or a modifiedsign-sign least mean squares (SS-LMS) algorithm [2]. TheACU also controls transmitter and receiver data alignment,per bit clock de-skew and comparator offset cancellation. Atlower data rates and higher noise-margins, the bit error ratios(BER) become very low and are therefore an ineffective metricfor comparing link margin and performance. More efficientrelative voltage margin comparisons between multiple links aremade possible by a figure of merit called normalized estimatednoise-margin (NENM).

With equalization enabled, the data rate improved by afactor of 1.3 to 2.1 depending on the channel characteristics.The cost of the equalization is relatively low in terms ofI/O power (280 mW at 8 Gb/s with 1.7-V supply) and area(360 m 360 m) in a 0.13- m bulk CMOS technology.

II. LINK ARCHITECTURE

A diagram of the link architecture is shown in Fig. 1. Uni-directional unencoded non-return-to-zero (NRZ) data is trans-mitted in parallel with a unidirectional clock. The forwardedclock is used to eliminate the need for data coding and its as-sociated circuit complexity, port latency and bandwidth over-head [5]. The clock is transmitted with a differential cascodecurrent-mode driver with terminating resistances to ground [6].The clock and data transmitter are identical circuits. The clockreceiver is a delay-locked loop (DLL) used to lock the 0 clockphase to the 180 phase and produce four global phases andtheir complements separated by 45 [6]. These global phasesare buffered and driven to each I/O cell. Local to the I/O cell,an interpolator receives the differential low swing clock phasesas shown in Fig. 2. The clock phases pass through the digitalcoarse select that is implemented as analog pass gates. The twoselected adjacent phases are applied to the differential stageswith common linearized load elements [7]. The digital fine se-lect adjusts the transconductance of each differential input stagethrough digitally controlled current sources to achieve a 6.4resolution. The output of the interpolator feeds the clock to thephase generator.

A phase generator produces eight phases that clock the sam-plers at the front-end of the filter and the current latches (I-latch).Fig. 3 shows state elements that accept low swing differentialclocks, where the inputs and outputs are complementary signals.During the reset state, a 1 0 0 0 state is set for both the upper and

0018-9200/$20.00 © 2005 IEEE

JAUSSI et al.: 8-Gb/s SOURCE-SYNCHRONOUS I/O LINK 81

Fig. 1. Link architecture.

Fig. 2. Interpolator design.

lower state machines. During port operation, this state cyclescontinuously through the flip-flops. The timing diagram shownin Fig. 4 contains the clock and phase relationship information.The even and odd phases are generated with respect to the risingand falling clock edge, respectively. These clock phases are thenapplied to the FIR filter.

The authors of [3] proposed an FIR filter architecture thatholds the samples stationary while rotating the filter coef-ficients. By holding the analog samples stationary, a highsignal-to-noise ratio (SNR) can be maintained [3]. This workproposes maintaining both the samples and filter coefficientsstationary, thereby drastically reducing the dynamic powerrequirements. In order to keep both terms stationary, the filtermust be interleaved by one more than the number of filter taps.The 4-tap FIR filter is interleaved by eight to allow sufficienttime for samples to be acquired and transient currents to settlebefore the evaluation of the I-latch. Based on simulations, theproposed 8-way interleaved FIR filter with stationary samplesand filter coefficients consumes less power compared witha two way-interleaved FIR filter with rotating weights.

The total simulated power for the transceiver is 280 mW witha power supply of 1.7 V. The power breakdown is 46% for thetransmitter and 54% for the receiver. The analog filter and clockphase generator constitutes 66% of the receiver power.

Each filter tap consists of a voltage-to-current converter(VIC) and a current steering DAC (I-DAC) shown in Fig. 5.The VIC is a differential input stage where the resistanceshown as R1 and R2 can be bypassed by closing switches S1and S2. Therefore, the source degeneration can be enabled or

Fig. 3. Clock phase generator.

Fig. 4. Sampling phase and latch evaluation phase timing diagram.

disabled through digital control. The I-DAC is implementedwith binary-weighted parallel NMOS devices. The sources ofthe I-DAC devices are connected to the drains of the currentmirror. The drains of the I-DAC devices are connected to eitherthe input of the I-latch or the VCC power supply. During portoperation, the VIC’s output current is entirely passed to theI-latch input or the VIC’s output current is divided by the filtercoefficient value. The I-DAC outputs from the four taps are thensummed together at the input of the I-latch. Another componentthat sums into the I-latch is the offset current DAC (C-DAC).The C-DAC is sized as a cascode current source with thelower device designed as a binary weighted, selectable currentsource. A fixed upper cascode device minimizes the capacitiveload at the I-latch input. The purpose of the C-DAC is tocancel unavoidable intrinsic offsets due to device mismatches

82 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 40, NO. 1, JANUARY 2005

Fig. 5. FIR filter architecture.

in the I-latch, current mirror and VIC. It is also required toprovide a reference offset during the adaptation process.

The I-latch requires a low input impedance to maintain asmall voltage swing and keep the input parasitic pole as high aspossible, even though many devices from the I-DAC and C-DACare connected to its input. The I-latch architecture is based onthe current summing circuit proposed in [8], where the upper de-vices, M6 and M7, are forced into the linear region by M4 andM5, respectively. The linear devices provide a low impedancenode for current summation.

Each tap, by way of the I-DAC, multiplies the differentialand common-mode current by the weight factor. Depending onthe polarity of the weight coefficient, the differential summedcurrent can decrease in magnitude. The common-mode current,however, will continue to increase per tap. The device sizes ofthe I-latch were sized to accommodate large variations in thecommon-mode current due to multiple taps with a wide rangeof possible tap weights.

The required transversal behavior for the FIR filter operationis achieved by making adjacent clock samples with adjacentclock phases, separated by one unit interval (UI). These samplesare applied to the VIC and multiplied by the I-DAC. In orderto reduce the required circuitry and power requirements, theoutputs of each of the VIC’s are current mirrored to four differentI-DACs that are associated with four independent filters. Forexample, when an I/O sample is acquired, it becomes thefourth, third, second and first sample for four adjacent FIRfilters. Thus, the transversal behavior is implemented withoutrequiring additional devices that load the I/O pad.

Fig. 6. Adaptive receiver diagram.

III. ADAPTIVE ALGORITHM

The on-die ACU cancels receiver offsets, sets the trans-mitter and receiver bit alignment, determines the optimal clockde-skew and optimizes the channel filter coefficients to cancelISI [9]. Fig. 6 shows the block diagram of the ACU and its in-terface with the analog receiver. The adaptation of the receiveris based on a training sequence, which is summarized in thesimplified flowchart in Fig. 7. There are five required statesto adapt the receiver, number one through five. The followingparagraphs will describe the purpose of each state.

State 1: During the initial offset trim phase, dominant offsetsat the eight I-latch inputs are cancelled. The transmitter sends

JAUSSI et al.: 8-Gb/s SOURCE-SYNCHRONOUS I/O LINK 83

Fig. 7. Adaptive training sequence block diagram.

“0” and “1” dc patterns. For each transmitted dc pattern, theC-DAC offset code is incremented until the I-latch outputs aretripped to the opposite state. The offset code results from each dcvalue are averaged, leaving the intrinsic offset value. By initiallyremoving intrinsic offsets, poor sensitivities do not dominate thefilter response and the risk of a suboptimum filter response issubstantially reduced.

State 2: The alignment phase consists of optimally aligningthe received bit-stream with the expected training pattern toenable decision-directed adaptation. This state has a two stepprocess involving coarse alignment and fine alignment. Initially0’s are transmitted followed by 32 1’s. In the presence ofchannel loss, the 0 to 1 transition will be detected, albeit, late.Even though it is detected late, the location of the transitionnarrows the alignment window. Then, a 128 bit pseudo-randombit sequence (PRBS) is transmitted repetitively. Even in thepresence of ISI induced errors and with no equalization, thetraining patterns can be aligned by correlating the expected pat-tern with the received pattern. The adaptor adjusts the variabledelay multiplexer (shown in Fig. 6) until the number of detectederrors reaches a minimum, indicating the patterns are aligned.

State 3: The adaptation phase optimally determines the filtercoefficients and offsets. Both filter and offset coefficients needto be adapted simultaneously, as input offsets are scaled by thefilter coefficients before they appear at the latch input. While the

Fig. 8. Modified adaptor implementation.

eight interleaved equalizers share the same filter tap weights,the offset C-DACs are independently controlled to accountfor within-die mismatch. Two adaptive update equations wereimplemented using sign-sign least mean squares (SS-LMS)and partial zero-forcing (PZF) algorithms. To implement theSS-LMS algorithm, one of the eight filters has independentlycontrolled coefficients. The adjacent filters are enabled toextract the sign of the received samples. Following adaptation,the filter coefficients are shared with the remaining sevenfilters. This modification avoids the requirement of additionalsamplers and comparators. All eight filters can be adaptedsimultaneously using the PZF algorithm implementation. Onlythe PZF algorithm is described in this paper. The PZF coeffi-cient update equations are variations of those in [10] and aregiven by the equations shown at the bottom of the page, where

is the vector of filter (offset reference) coefficients,is the desired data vector, is the current adaptation error,

is the filter (offset) tap index, isthe update index, is the actual transmit binarysymbol and represents the adaptation polarity. and

are the update step sizes for coefficients and offsets, respec-tively. The updates are performed in a block-based fashion byaveraging over 32 bits. Use of selective updates through theindicator function , simplifies the adaptor implementationas shown in Fig. 8 by eliminating the need for a high-speedanalog-digital interface. The selective update followed byaveraging the results for both polarities, allows both the offsetcancellation current and adaptation reference current to berealized using a single C-DAC.

=

84 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 40, NO. 1, JANUARY 2005

Fig. 9. Transceiver cell plot.

State 4: Following adaptation and while the 128 bit PRBStraining pattern is still being transmitted, the noise margin forthe current sampling phase is estimated by setting the C-DACoffsets to their maximum offset value and then decreasing theoffset until no errors are detected. This offset code is recordedalong with the sampling phase code. States 1 through 4 are thenrepeated for the desired number of sampling phases, controlledthrough the interpolator.

State 5: The optimal sampling phase is selected based on thesampling phase code with the largest offset value and lowestnumber of detected errors. This concludes the port trainingprocess and the adaptor is disabled. The filter and offset settingsare held unchanged for the duration of the port operation.

The adaptation process requires approximately 20 ms to com-plete. While testing the chip, supply voltage and temperaturewere well controlled so that the equalization settlings and clockde-skew settings did not require updates. For implementationswhere significant environment variations exist, a subset of theadaptive algorithm can be run. For example, States 3 and 4would scan through offset from the original optimumde-skew phase. State 5 would also run selecting the new op-timum sampling phase. Depending on the gradient of environ-ment change, the frequency of additional adaptation runs can bedetermined. The abbreviated adaptive sequence would requireonly hundreds of microseconds.

IV. EXPERIMENTAL RESULTS

A. Circuit Characterization

A transceiver I/O cell plot is shown in Fig. 9. The chip wasfabricated in 0.13- m bulk CMOS technology and was flip-chipbonded into a ball-grid array package. The test board consistsof two chips separated by 5 cm of FR4 and 17 cm of FR4. Anadditional chip is connected through a backplane with 55 cmand 102 cm FR4 links.

Through measurements, we determined that when source de-generation in the VIC was enabled, the filter sensitivity was verypoor since the degeneration severely limited the maximum dif-ferential current. To obtain a reasonable sensitivity, we disabledthe source degeneration and the differential current increasedthrough the differential input pair by a factor of four. With the

Fig. 10. An FIR tap receiver sensitivity at 6 GS/s.

Fig. 11. Simulated FIR tap input referred device noise breakdown.

source degeneration disabled and the weight value of the I-DACset to its maximum value, the sensitivity of one filter tap (whichincludes a VIC, IDAC and I-latch shown in Fig. 5) is shown inFig. 10 as a function of input differential voltage. The measuredbehavior follows a Gaussian distribution with the sigma value of6.3 mV at 6 GS/s. The Gaussian distribution suggests that thesensitivity is dominated by device noise. Taking a closer lookat the device noise effects by using a time domain noise simu-lator, it is clear that the latch contributes a significant portion ofthe input referred noise as shown in Fig. 11. The I-latch devicesmake up 73% of the input referred noise where the input devicesand current mirror yield a rather small 6% of the total input re-ferred noise. Simulations confirm that redesigning the latch witha pre-amplification stage and optimizing the equalizing switchcan dramatically improve the sensitivity. Fig. 12 shows the mea-sured peak-to-peak input referred noise as a function of C-DACoffset code. This measurement shows the effective sensitivitydue to various offsets from the C-DAC. With the peak-to-peaksensitivity curve relatively flat as a function of the C-DAC offsetcode, the C-DAC contributes negligible amounts of device noisefor a reasonable offset range.

Nonidealities that limit the accuracy of linear equalizers alsohinder their effectiveness. For this architecture, the VIC is themost critical element since it sets the resolution ceiling for the

JAUSSI et al.: 8-Gb/s SOURCE-SYNCHRONOUS I/O LINK 85

Fig. 12. Peak-to-peak FIR tap sensitivity as a function of C-DAC offset.

Fig. 13. FIR tap offset histogram, normalized to maximum transmit swing of� 900 mV (72 samples).

filter design. The VIC was designed to have five bits of linearitywhen the source degeneration is enabled. The measured bitintegral nonlinearity (INL) resolution with and without sourcedegeneration is 5.6 bits and 3.0 bits, respectively. Because of theaforementioned I-latch sensitivity issue, source degenerationwas not used. The second important linearity component is theI-DAC. The I-DAC is designed to have 6.0 bits of resolutionwhile the measured INL was 3.3 bits. Finally, the C-DACis designed with a 10-bit resolution and the measured INLwas 6.7 bits. For both the I-DAC and C-DAC, the accuracycould be enhanced by increasing the design area to implementbetter layout techniques. However, due to optimizing the smallfootprint to be compatible for parallel port implementations,we found some accuracy was sacrificed.

Fig. 13 shows the statistical offset distributions for 72 filtertaps and the relative contribution of I-latch offsets, currentmirror offsets and the transmitter and VIC offsets. The offsetcode is normalized based on a maximum transmitter differ-ential swing of mV. The I-latch contributes the largestamount of offset. This offset is directly canceled by the C-DAC.Current mirror and VIC offsets, however, are multiplied by theweight coefficient and will change as the ACU updates the filterweights. Hence, the filter tap weights and offset coefficientsneed to be updated simultaneously as described in Section III.

B. Adaptive Algorithm Performance and Characterization

Additional states were introduced in the synthesized ACU toenable observation of the adaptation dynamics. A typical ex-ample of measured coefficient and offset evolution curves isshown in Fig. 14. Fractional step sizes can be approximated by

using excess precision in the registers storing the C-DAC off-sets and FIR coefficients. It can be seen that while the FIR tapweights converge to the same value regardless of the adaptationpolarity, the offset C-DAC codes converge to different valuessince they include the polarity-dependent adaptation reference.The starting value for the C-DAC codes is the initial calculatedoffsets as determined in State 1 of the ACU (Section III). Asshown, the starting values are significantly different for eachC-DAC, i.e., the offset of each latch can be significantly dif-ferent. When the average of the two adaptive polarities is calcu-lated, the resulting offset takes into account the offset from thefour VICs, four I-DACs, and one I-Latch. A typical convergencetime of State 3 is less than 25 s.

The statistical nature of the adaptation process implies a cer-tain amount of random variation in the converged coefficientvalues. This variation degrades noise-margins above and beyondthose due to channel and circuit nonidealities. Fig. 15 showsthis variation for the eight I-latch offset values over 500 itera-tions of the adaptor. The variation is grouped into five bins. Frac-tional step sizes of bit of an offset code and bit of anI-DAC code were used for the offsets and filter coefficients, re-spectively. The filter coefficients have a tighter distribution sincemore error bits are available for their adaptation. As all the in-terleaved FIR filters share the same filter coefficients, 64 bits(half of the 128 error bits) are available for adaptation for eachpolarity as described in Section III. However, the offsets are in-dependently controlled and each offset code is updated basedon just eight bits of error information ( th of 64 bits). Hence,greater variation is observed in the offset codes compared to thefilter tap weights. The additional degradation in noise-marginsdue to these statistical fluctuations is found to be less than 3%of the transmit swing.

C. Link and Circuit Performance

Using BERs to evaluate the operating margins with andwithout the equalizer enabled for a single data rate can becomeimpractical if the BER is very low. For this reason, we havechosen to use the measured normalized estimated noise-margin(NENM). We normalize this figure of merit by the transmitswing. Using the estimated noise-margin calculated by theadaptor, the relative increase in the link margin can be de-termined. Fig. 16 shows the link margins over 102 cm ofFR4 at 4 Gb/s with and without equalization as a functionof interpolator sampling phase, demonstrating the dramaticimprovement in negative link margin to positive link marginafforded by equalization. Additionally, it shows that there are anumber of sampling phases that yield the peak voltage margins.During the adaptation phase, the ACU records this estimatednoise-margin plot and picks the optimum sampling phasebased on the maximum margins. Another method availablefor port characterization and to show the improvement fromequalization is the pulse response capture capability. Usingwaveform capture methods [6], the discrete pulse responseswith and without equalization are captured at 3.2 Gb/s over the102 cm channel as shown in Fig. 17. Again, the y-axis is nor-malized by the dc transmitter swing. The pulse response beforeequalization shows pre and post-cursor ISI. After equalization,the first pre-cursor and the first and second post-cursor ISI

86 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 40, NO. 1, JANUARY 2005

Fig. 14. Evolution of four coefficients and eight offsets for 4 Gb/s transmission over 102 cm FR4.

Fig. 15. Four histograms of calculated offsets during adaptation (500iterations).

Fig. 16. Estimated noise margin with and without equalization over 102 cmFR4 at 4 Gb/s.

terms are significantly reduced. The cursor value is essentiallyunchanged as compared to the reduced cursor value in transmit

Fig. 17. Comparison of a sampled pulse response with and withoutequalization over 102 cm FR4 at 3.2 Gb/s with one pre-cursor and twopost-cursor taps.

equalization [1]. The residual ISI is a result of the finite resolu-tion and span of the FIR filter.

Summarized in Fig. 18 is the measured NENM given as afunction of data rate for four different FR4 channels. These mea-surements were made with the same link operating conditionsfor all data rates. The margins with and without equalizationare compared. Assuming the minimum required noise-marginis 10% of the transmit signal swing, the 4-tap adaptive equal-izer increases the achievable data rates by 33% for 5 cm and17 cm FR4, 60% for 55 cm FR4 and 110% for 102 cm FR4. A10% NENM measured at the receiver approximately equatesto a BER less than . For maximum data rates, the supplywas elevated from 1.5 to 1.7 V and the circuit was optimizedto reach a data rate of 8 Gb/s over 5 cm FR4, 8 Gb/s over17 cm FR4 and 5.6 Gb/s over 50 cm FR4, all with a BERof .

JAUSSI et al.: 8-Gb/s SOURCE-SYNCHRONOUS I/O LINK 87

Fig. 18. Measured comparison of NENM versus data rate for 5–102 cm FR4 at 1.5-V VCC

To quantify the maximum amount of loss that the FIR filtercan equalize given the previously measured nonlinearities andsensitivities, the link is operated over a lossy interconnect.Based on the characterization of this lossy interconnect andthe measured pad capacitance, the approximate channel loss is

16 dB at 3.75 GHz. At this frequency, the FIR filter equalizedthe channel to zero NENM at 7.5 Gb/s.

V. CONCLUSION

An adaptive, analog 4-tap FIR filter is described. The analogcircuitry requires relatively low power and small silicon area.Significant data rate improvements, up to 110%, are shownwhen the equalizer is enabled. The receiver equalizationdemonstrates several interpolator sampling phases that yieldthe maximum received voltage margin. The PZF and SS-LMSalgorithms are modified to accommodate operating the digitaladaptive control circuitry at a lower frequency than the I/O datarate to conserve power. Using an adaptive training sequence,the comparators are offset trimmed, the transmitter and receiverare logically aligned, sampling clocks are de-skewed and equal-ization coefficients are optimized. Multiple adaptive iterationsyielded minor statistical variations in the filter coefficients andoffset values.

ACKNOWLEDGMENT

The authors would like to thank G. Dermer and C. Roberts fordesign and characterization of the test boards, and H. Wilson,J. Howard, G. Ruhl, D. Klowden, K. Truong, C. Parsons andK. Ikeda for their help in building the test chip.

REFERENCES

[1] J. T. Stonick et al., “An adaptive PAM-4 5 Gb/s backplane transceiverin 0.25-�m CMOS,” IEEE J. Solid-State Circuits, vol. 38, no. 3, pp.436–443, Mar. 2003.

[2] J. E. Jaussi et al., “8 Gb/s source-synchronous I/O link with adaptivereceiver equalization, offset cancellation and clock de-skew,” in IEEEISSCC Dig. Tech. Papers, Feb. 2004, pp. 244–246.

[3] T. Lee et al., “A 125-MHz CMOS mixed-signal equalizer for gigabitethernet on copper wire,” Proc. IEEE CICC, pp. 131–134, 2001.

[4] R. Farjad-Rad et al., “A 0.3-�m CMOS 8 Gb/s 4-PAM serial link trans-ceiver,” IEEE J. Solid-State Circuits, vol. 35, no. 5, pp. 757–764, May2000.

[5] R. Mooney et al., “A 900 Mb/s bidirectional signaling scheme,” IEEE J.Solid-State Circuits, vol. 30, no. 12, pp. 1538–1543, Dec. 1995.

[6] B. Casper et al., “8 Gb/s SBD link with on-die waveform capture,” IEEEJ. Solid-State Circuits, vol. 38, no. 12, pp. 2111–2120, Dec. 2003.

[7] S. Sidiropoulos et al., “A semi-digital DLL with unlimited phase shiftcapability and 0.08–400 MHz operation range,” in IEEE ISSCC Dig.Tech. Papers, Feb. 1997, pp. 332–333.

[8] D. Comer et al., “A high-frequency CMOS current summing circuit,”Analog Integrated Circuits and Signal Processing, pp. 215–220, 2003.

[9] G. Balamurugan et al., “Receiver adaptation and system characterizationof an 8 Gbps source-synchronous I/O Link using on-die circuits in 0.13�m CMOS,” in VLSI Symp. Tech. Dig., Jun. 2004, pp. 356–359.

[10] J. G. Proakis, Digital Communications, 4th ed. New York: McGraw-Hill, 2000.

88 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 40, NO. 1, JANUARY 2005

James E. Jaussi (M’01) received the B.S. and M.S.degrees in electrical engineering from BrighamYoung University, Provo, UT, in 2000. He is cur-rently working toward the Ph.D. degree in electricalengineering at Oregon State University, Corvallis,OR.

For the past four years, he has worked for IntelLaboratories, Hillsboro, OR. His main focus isresearch, design, and characterization of high-speedCMOS transceivers and mixed signal circuits, withan emphasis in receiver equalization and clock

extraction architectures.

Ganesh Balamurugan (S’99) received the Ph.D. de-gree from the University of Illinois at Urbana-Cham-paign in 2004.

His research interests include adaptive equaliza-tion and noise cancellation in high-speed I/O, andnoise-tolerant digital system design.

David R. Johnson (M’97) received the B.S degreein electrical engineering from the South DakotaSchool of Mines and Technology (SDSMT), RapidCity, SD, in 1996 and the M.S. degree in electricalengineering from the Oregon Graduate Institute(OGI), Beaverton, OR, in 1997.

Since 1997, he has been a Design Engineer at IntelCorporation, where his research focus has been to de-velop integrated CMOS circuit designs for applica-tions in high speed copper link interfaces. He is cur-rently a strategic design lead in the Enterprise Prod-

ucts Group (EPG).

Bryan Casper (S’97–M’98) received the B.S. andM.S. degrees in electrical engineering from BrighamYoung University, Provo, UT.

He is a Circuit Researcher with Intel Labs, Hills-boro, OR. He joined Intel in 1998. His current re-sponsibilities include research, design, validation andcharacterization of high-speed mixed-signal circuitsand I/O systems.

Aaron Martin (M’99) received the Bachelor’s andMaster’s degrees in electrical engineering fromBrigham Young University, Provo, UT.

He is a Signaling Circuits Researcher at IntelLabs, Hillsboro, OR. For the past four years, his re-sponsibilities have included researching, developingand testing I/O circuits for PC and server platforminterconnects.

Joseph Kennedy (S’88–M’91) received the B.S.degree in electrical and computer engineering fromOregon State University, Corvallis, OR, in 1991.

He is a Senior Circuits Researcher with Intel’sCircuits Research Labs, Hillsboro, OR. Over thepast nine years at Intel, his responsibilities haveincluded all aspects of research and design ofhigh-speed mixed-signal circuits and I/O systems.Prior to joining Intel, Joe spent four years withLattice Semiconductor where he worked as a leadcircuit designer developing data-path circuits and

I/O interfaces for electrically programmable logic components.

Naresh R. Shanbhag received the B.Tech. degreefrom the Indian Institute of Technology, New Delhi,India, in 1988, the M.S. degree from the Wright StateUniversity in 1990, and the Ph.D. degree from theUniversity of Minnesota, Minneapolis, in 1993, all inelectrical engineering.

From 1993 to 1995, he worked at AT&T Bell Lab-oratories, Murray Hill, NJ, where he was the lead chiparchitect for AT&T’s 51.84 Mb/s transceiver chipsover twisted-pair wiring for Asynchronous TransferMode (ATM)-LAN and very high-speed digital sub-

scriber line (VDSL) chip-sets. Since August 1995, he is with the Departmentof Electrical and Computer Engineering, and the Coordinated Science Labora-tory where he is presently a Professor. His research interests are in the design ofintegrated circuits and systems for broadband communications including dig-ital signal processing and error-control coding algorithms and VLSI architec-tures, digital and analog integrated circuit design. He has published more than90 journal articles/book chapters/conference publications in this area and holdsthree U.S. patents. He is also a co-author of the research monograph PipelinedAdaptive Digital Filters (Kluwer, 1994).

Dr. Shanbhag received the 2001 IEEE TRANSACTIONS ON VLSI Best PaperAward, the 1999 IEEE Leon K. Kirchmayer Best Paper Award, the 1999 XeroxFaculty Award, the National Science Foundation CAREER Award in 1996, andthe 1994 Darlington Best Paper Award from the IEEE Circuits and SystemsSociety. Since July 1997, he has been a Distinguished Lecturer for the IEEECircuits and Systems Society. From 1997–1999 and from 1999–2002, he servedas an Associate Editor for the IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS:PART II and the IEEE TRANSACTIONS ON VLSI, respectively. He has servedon the technical program committees of major conferences such as the IEEEConference on Acoustics, Speech and Signal Processing, the IEEE InternationalSymposium on Low-Power Electronic Design, the IEEE Workshop on SignalProcessing Systems, and the IEEE International Symposium on Circuits andSystems.

Randy Mooney (M’88) received the M.S. degree inelectrical engineering from Brigham Young Univer-sity, Provo, UT.

He is currently is an Intel Fellow and directorof I/O research in Intel’s Corporate TechnologyGroup. He was with the Standard Products Divisionof Signetics Corporation in Orem, UT, from 1980to 1992, where he developed products in Bipolar,CMOS, and BiCMOS technologies, concentratingon components for use in bus driving applications. In1992, he joined the Supercomputer Systems Division

of Intel Corporation, and worked on development of interconnect componentsfor parallel processor communications. He was responsible for the developmentof signaling technology for these components, and developed a method ofsimultaneous bidirectional signaling that was used for the Intel TeraflopsSupercomputer. His current work is focused on multi-gigabit, differential,serial stream based interfaces and pt-to-pt memory interfaces.