Embedded Debug Architecture for Bypassing

Embed Size (px)

Citation preview

  • 8/9/2019 Embedded Debug Architecture for Bypassing

    1/12

    This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

    IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS 1

    Embedded Debug Architecture for BypassingBlocking Bugs During Post-Silicon Validation

    Ehab Anis Daoud , Member, IEEE , and Nicola Nicolici , Member, IEEE

    Abstract Once a bug is found during post-silicon validation,before committing to a silicon respin of the design it is expectedthat any other bugs, which have escaped pre-silicon verication,to be also identied. This will minimize the number of respins,which in turn will reduce the implementation costs. However, thisis hindered by the presence of blocking bugs in one erroneousmodule that inhibit the search for bugs in other parts of the chipthat process data received from this erroneous module. To addressthis problem, in this paper we propose a novel embedded debugarchitecture for bypassing the blocking bugs when dealing withdeterministic debug experiments.

    Index Terms Blocking bugs, post-silicon validation.

    I. INTRODUCTION

    M ODERN system-on-a-chip (SoC) designs contain tensor even hundreds of logic blocks. Many of these logicblocks are embedded cores, which are reused from previous de-signs in order to shorten the implementation cycle. Althoughthese embedded cores are veried as independent entities, thecomplex interactions between them can be exercised only whenthe SoC is integrated. However, because the SoC gate count iscommonly in excess of one million gates, pre-silicon verica-

    tion [2] is not scalable to formally prove that the circuit is de-sign error-free. Therefore, once sufcient condence has beenreached during pre-silicon verication, the circuit is sent forfabrication. Fabrication defects can affect individual devices, inwhich case known manufacturing test techniques can be usedto screen the defective chips [3]. Nonetheless, if manufacturingtest is successful and the circuit works on the tester howeverit fails in-eld when exercised by an user application, the rootcause can be either undetected fabrication defects or design er-rors (or bugs) that have escaped to silicon. Understanding whatcauses in-eld failures is referred to as silicon debug and diag-nosis [4]. If the problem is caused by a design error and it affects

    every single fabricated device, then searching for bugs in siliconis also referred to as post-silicon validation and debug, or justpost-silicon validation.

    Embedded cores in an SoC interact between each other andpass data between producer and consumer cores. It is often thecase that a design bug in a producer core to cause the results of

    Manuscript received May 28, 2009; revised September 29, 2009. A prelim-inary version of this paper appeared in the Proceedings of the IEEE EuropeanTest Symposium (ETS), pp. 6974, 2008.

    The authors are with the Department of Electrical andComputerEngineering,McMaster University, Hamilton, ON L8S 4K1, Canada (e-mail: [email protected]; [email protected]).

    Digital Object Identier 10.1109/TVLSI.2009.2038390

    a consumer core to fail, regardless whether the consumer core iscorrectly implemented or not. In this case, if there are any bugsin the consumer core, then its excitation and manifestation canbe masked out by the erroneous values at its inputs. If no mech-anism is provided to aid post-silicon validation to adequatelyvalidate the consumer core, then, if this consumer core is indeederroneous, its bugs will be found out only after another siliconrespin. This problem will obviously increase the implementa-tion cycle, add to the mask costs and affect the time-to-market.Therefore, it provides the key motivation for the work presentedin this paper. To better understand the context in which our so-lution is relevant, we rst outline the main challenges and ap-proaches used in post-silicon validation.

    A. Related Work

    There are two main types of bugs that escape to silicon. Elec-trical bugs are caused by the imperfect device models and nar-rowing down the root cause will eventually require probing thesilicon [5][7] . Due to the complexity of this task, a pruningstep that narrows down the suspect nets to be probed is neces-sary. This involves analyzing the data captured on-chip againstthe reference values from simulation [8]. The same task, which

    we regard as logic probing, is required when localizing func-tional bugs, which are the bugs that will cause even the reg-ister-transfer level (RTL) source code to fail. Design-for-debug(DFD) techniques are necessary to aid logic probing, due to thelimited observability of the internal nets in the circuit-under-debug (CUD).

    Scan chains, already present in the circuit for simplifyingmanufacturing test, are one of the most commonly used DFDtechniques in practice [9][13] . When a particular event of in-terest occurs on-chip, the CUD state is captured in the scanchains and through a scan dump the state is ofoaded to debugsoftware through a low bandwidth interface, such as BoundaryScan [14]. The main advantage of using scan chains is the fullvisibility of the state elements.However, because the state needsto be dumped after acquisition, real-time collection of data inconsecutive clock cycles is not possible. Moreover, in manycases the states that help most with debugging are the ones thathappen before the capture events, which cannot be collectedthrough scan. These problems can be addressed by complemen-tary DFD techniques that collect a limited number of signalsin real-time. They are commonly referred as trace-based tech-niques and are discussed next.

    Real-time trace collection can be achieved off-chip [15] oron-chip [16]. For a large acquisition bandwidth, the off-chip ap-proaches require more pins than the on-chip approaches, which

    use embedded memories as trace buffers. Trace buffer-based1063-8210/$26.00 2010 IEEE

    Authorized licensed use limited to: Hindusthan College of Engineering and Tech. Downloaded on June 15,2010 at 05:43:55 UTC from IEEE Xplore. Restrictions apply.

    http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-
  • 8/9/2019 Embedded Debug Architecture for Bypassing

    2/12

    This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

    2 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

    Fig. 1. Debug scenarios with and without bypassing blocking bugs feature. (a)Debug scenario without bypassing blocking bugs feature; (b) debug scenariowith bypassing blocking bugs feature.

    techniques (referred commonly as embedded logic analysis )have been used in post-silicon validation for debugging micro-processors [17], [18], SoC designs [19][21] , and eld-pro-grammable gate-array designs [22][24] . The usefulness of embedded logic analysis is limited by the capacity of theon-chip buffers, which determine how many signals can betraced and for how many clock cycles. Recent research has

    investigated how to improve the real-time observability inthe presence of on-chip buffers that are limited in capacity[25][27] . Nonetheless, to the best of authors knowledge, noneof the known techniques have dealt explicitly with blockingbugs, which are explained in the next subsection.

    B. Motivation and Summary of Contributions

    To avoid unnecessary respins, it is essential to identify thedesign bugs that have escaped to silicon as soon as the rst pro-totype is available. Because locating a design bug at a certainpoint makes it an obstacle for debugging the remaining parts of the design that are connected to this point, it is important to by-pass its erroneous behavior. This type of bug is called a blockingbug . In the presence of blocking bugs, the erroneous sampleshave to be replaced in real-time with the correct stimuli.

    To illustrate the problem solved in this paper, we show twodifferent debug scenarios in Fig. 1 . In Fig. 1(a) , the on-chip tracebuffer is used just for capturing the debug data based on thedebug conguration that species the trigger condition at whichthe acquisition process starts. After the trace buffer is lled,the captured data is ofoaded to the debug software, where thedebug information is compared against the behavioral model. Inorder to replace the erroneous behavior caused by the blockingbugs, another level of triggering is needed to enable the tracebuffer to provide the correct stimuli only at the specic times de-

    termined by the occurrence of the blocking bugs . This motivates

    our research to develop the stimuli selection module, which isshaded in Fig. 1(b) .

    As detailed later in this paper, in this scenario the debug con-guration includes the following:

    initial trigger event for providing the stimuli data; the stimuli control information to be uploaded into stimuli

    selection circuitry; this information species the times atwhich the stimuli data will be provided and the duration of the stimuli intervals as well;

    the stimuli data to be uploaded into on-chip trace buffer.It is important to note that in order to achieve the above, the

    following two assumptions need to be satised. First, the debugdata has to be deterministically computed and reproduced usinga reference behavioral model of the CUD. Second, the target ap-plication board (on which the CUD is located) has a determin-istic execution behavior where reapplying the same input datawill always produce the same outputdata. This deterministic be-havior is common for application boards where stimuli are ap-plied synchronously, such as audio/video applications. As dis-cussed in the next section, in the presence of nondeterminism,recent techniques have been proposed to enable deterministicdebugging [28].

    Our objective in this paper is to develop a debug method-ology, and an associated logic circuitry to be integrated in theembedded logic analyzers, to enable the real-time replacementof the erroneous behavior (caused by blocking functional bugs)with the correct stimuli. Our contributions are motivated by theobservation that design errors which escape to silicon will man-ifest in a burst-mode and only several times over a large execu-tion time (bugsof this type are indeedthemost hard-to-detect,asdiscussed in [29]). This key observation enables also to stream

    the stimuli data and control in real-time through a low-band-width interface connected to the debug software. The main con-tributions of this paper are summarized as follows:

    we propose a novel architecture that enables hierarchicalevent detection mechanism to provide correct stimuli froman embedded trace buffer, in order to replace the erroneoussamples caused by the blocking bugs;

    wepresent an architectural feature in thedebug module thatenables sharing of the stimuli data, stimuli control andcap-ture data in a segmented trace buffer; in different debug ex-periments, the trace buffer can be congured with differentsegment sizes for its three segments;

    we show that by leveraging the streaming feature of thelow-bandwidth interface to the debug software, we can fur-ther extend the duration of debug experiments by feedingthe stimuli data in real-time;

    we propose an on-chip decompressor to extract the stimuliin real-time; this architecture is particularly useful whenmore erroneous samples (than provided by the capacity of the trace buffer) need to be bypassed.

    The rest of this paper is organized as follows. Section IIdescribes the proposed debug methodology for bypassingblocking bugs. Section III presents the proposed debug archi-tecture. Experimental results from Section IV show how theproposed debug architecture is used to aid in debugging an

    MP3 audio decoder. Finally, Section V concludes this paper.

    Authorized licensed use limited to: Hindusthan College of Engineering and Tech. Downloaded on June 15,2010 at 05:43:55 UTC from IEEE Xplore. Restrictions apply.

    http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-
  • 8/9/2019 Embedded Debug Architecture for Bypassing

    3/12

    This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

    DAOUD AND NICOLICI: EMBEDDED DEBUG ARCHITECTURE FOR BYPASSING BLOCKING BUGS 3

    Fig. 2. Bypassing blocking bugs framework.

    II. METHODOLOGY FOR BYPASSING BLOCKING BUGS

    As outlined in the previous section, our debug methodologyfor dealing with blocking bugs relies on: 1) validation datacan be deterministically computed from a reference behavioralmodel of the design, which avoids time-consuming circuitsimulation; 2) during post-silicon validation the circuit exhibits

    deterministic behavior, i.e., the nondeterminism caused byasynchronous events is masked out. To eliminate the nonde-terminism caused by I/O devices, a buffering module can beemployed to record the input data and its time stamps as de-scribed in [28]. In this technique, when replaying the execution,the I/O devices can be temporarily suspended and the inputbuffer will reproduce the input signals that have been recorded.

    To illustrate the proposed debug methodology, which relieson two phases, consider the two embedded cores shown inFig. 2. In the rst phase, the lossy compression techniquepresented in [25] is used to identify the hard-to-nd functionalbugs in Core 1. In this technique, lossy compression is achieved

    by employing a multiple input signature register (MISR) placedin front of the trace buffer to map a sequence of samples intoone signature. The proposed debug architecture in [25] en-ables an iterative debug ow for zooming only in the intervalscontaining erroneous samples that occur intermittently in longobservation windows. The observation window is dened asthe length of the debug experiment during which the circuitis monitored without interrupting its real-time execution. Thisiterative debug ow is started by capturing the enlarged obser-vation window using a sequence of signatures into the tracebuffer at the initial debug experiment. The debug engineer theniteratively zooms only into the intervals that contain erroneoussamples in the succeeding debug experiments. After identifyingthe exact times when the bugs from Core 1 are activated (andknowing from the behavioral model the correct values that need

    to appear on the cores output), the erroneous behavior shouldbe bypassed , in order to validate in silicon the other parts of thedesign that are connected to the erroneous core (i.e., Core 2 inFig. 2).

    In phase 2, discussed in this paper, we replace the effect of the blocking bugs from Core 1 with the stimuli required for thevalidation of Core 2. We rely on the fact that a blocking buggenerates erroneous patterns in bursts that are not consecutive

    and the duration of each burst is different from one another asshown in Fig. 2 (the shaded segments at the output of Core 1represent the erroneous patterns). Note, in addition to uploadingthe replacements for the erroneous samples, a stimuli pointer(Stimuli pointer ) that represents the beginning of the erroneoussamples and an associated stimuli number ( Stimuli No ) are alsouploaded in the embedded debug module. The debug steps forvalidating Core 2 are explained next.

    The debug module is rst uploaded with the debug congu-ration [in Fig. 2, step (1) ]. The debug conguration includes thetrigger event that represents the beginning of the data block thathas erroneous samples, as well as the stimuli data and stimuli

    control (i.e., stimuli pointer and stimuli number) for the rstfew erroneous samples to be replaced (the layout and the up-date of this information in the embedded trace buffer are de-tailed in the following section). The stimuli pointer representsthe time at which the stimuli will be provided at the probe point.The stimuli number species the number of the stimuli that areneeded to replace the erroneous samples upon the occurrence of the stimuli pointer.

    After the CUD starts execution [in Fig. 2, step (2) ] and uponthe occurrence of the trigger event, an internal counter from thedebug module starts its operation and it increments whenevera specic stimuli selection condition occurs. This stimuli se-lection condition represents the event at which the stimuli datashould be made available to Core 2. Once this internal counterreaches the rst stimuli pointer, the stimuli are read from the

    Authorized licensed use limited to: Hindusthan College of Engineering and Tech. Downloaded on June 15,2010 at 05:43:55 UTC from IEEE Xplore. Restrictions apply.

    http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-
  • 8/9/2019 Embedded Debug Architecture for Bypassing

    4/12

    This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

    4 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

    Fig. 3. Proposed embedded debug module.

    trace buffer until the number of erroneous samples to be re-placed equals the stimuli number [in Fig. 2, step (3) ]. By ex-ploiting the fact that erroneous values caused by Core 1 aresparse, the stimuli data control can be streamed in real-timethrough a low-bandwidth interface connected to the debug soft-ware, as detailed in the following section.

    In order to capture data from Core 2 which is connected toCore 1 that has blocking bugs, trigger events for Core 2 needto be uploaded into the embedded debug module. These triggerevents determine the times at which the acquisition processesstart, while the associated samples numbers specify the requiredamount of sampling data as shown in Fig. 2. Thus, the trace

    buffer is employed as a segmented buffer to allow providingstimuli at the output of Core 1 and capturing data at the outputof Core 2. The acquisition process can be congured to cap-ture the debug data for a specied number of clock cycles. Thisprocess is stopped once the required amount of the samplingdata is captured or the capture data segment is lled. Thereafter,the captured data is ofoaded to debug software. To increasethe effectiveness of the embedded trace buffer during the debugprocess, we develop an architectural feature that enables sharingthe stimuli data, stimuli control, and capture data in a segmentedtrace buffer that canbe conguredwith different segments sizes.Providing programmable features to the user to decide for eachexperiment how much data is allocated in each segment is es-sential because the amount of erroneous data introduced by theblocking bugs cannot be known a priori . Likewise, the amountof data that needs to becaptured cannot beknown at design time.

    III. PROPOSED EMBEDDED DEBUG MODULE

    This section introduces the proposed embedded debugmodule, which enables bypassing the blocking bugs, based onthe methodology described in Section II. Our contributions inthis embedded debug module are the stimuli selection module,the embedded trace buffer control, and the decoder architecturewhich are shaded in Fig. 3. First, we describe the main featuresin the embedded debug module. Then we introduce the stimuliselection module architecture and the associated selection

    mechanism that facilitate bypassing the blocking bugs. This isfollowed by the architectural features in the embedded tracebuffer control. Finally, we show how the limited capacity of the trace buffer to provide the stimuli is addressed by usingdictionary-based compression techniques combined with a newon-chip decoder architecture.

    A. Overview of the Embedded Debug Module

    A standard embedded debug module can capture a set of internal samples after the occurrence of a trigger event thatmatches a user-programmable triggering condition. This isachieved by a detection mechanism using the event detector as

    shown in Fig. 3. In our implementation, the triggering conditionis based on bitwise, comparison or logical operations betweenany selected trigger signal and a specied constant value. Thepurpose of using two event detector circuits is to concurrentlymonitor trigger signals from two different cores. To furtherenhance the detection ability, each event detector is followedby an event sequencer to monitor a user-specied sequence of events. The conguration of the trigger signal selection, thetrigger events, and the choice of signals that need to be probedare uploaded to the embedded debug module control throughthe low-bandwidth interface (e.g., JTAG [14]). It should benoted that the trigger signals, stimuli selection signals (used bythe stimuli detection module described in Section III-B) and thedata that is traced from the embedded cores are not mutuallyexclusive signals.

    B. Stimuli Selection Module

    The event detection capability of the embedded debugmodule can be extended to enable a mechanism for bypassingblocking bugs. This mechanism provides a second level of triggering at which the erroneous behavior of blocking bugsis replaced with the correct stimuli from the trace buffer. Therst level of triggering represents the beginning of the rstset of data that has erroneous samples and the second levelindicates when the erroneous samples start to occur after therst level trigger condition is satised. The second trigger levelis specied by the stimuli pointer.

    Authorized licensed use limited to: Hindusthan College of Engineering and Tech. Downloaded on June 15,2010 at 05:43:55 UTC from IEEE Xplore. Restrictions apply.

    http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-
  • 8/9/2019 Embedded Debug Architecture for Bypassing

    5/12

    This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

    DAOUD AND NICOLICI: EMBEDDED DEBUG ARCHITECTURE FOR BYPASSING BLOCKING BUGS 5

    Fig. 4. Example of on-chip stimuli pointers.

    Fig. 5. Stimuli selection module with S stimuli control registers.

    The process of updating stimuli control registers is con-strained by the time needed to upload new values in these

    registers through the low-bandwidth interface (also referredto as the serial interface). This time depends on the serialinterface frequency and the width of the stimuli control (i.e.,the width of the stimuli pointer register combined with thestimuli number register). Therefore, the number of on-chipclock cycles required for updating the stimuli control equals

    , where and is theon-chip sampling frequency; is the frequency of theserial interface; is the width of the stimuli control. If theinterval between two erroneous samples is less than ,then even the correct samples that occur between these twoerroneous samples must also be loaded in the on-chip stimuli

    memory. Hence one stimuli pointer is used to indicate the be-ginning of this interval whose length is identied by the stimulinumber. If the time between two stimuli pointers is sufcientto update the on-chip stimuli control registers during the debugexperiment, the amount of stimuli will be reduced. Therefore,we illustrate the importance of having multiple on-chip stimuli pointers with the following example.

    Fig. 4 shows part of the observation window that has mul-tiple erroneous bursts over two interval groups. We assume thaton-chip stimuli registers are uploaded with the information of stimuli control 1, 2, 3, and 4 in group A. In Fig. 4, the timebetween the two groups is sufcient to upload these registerswith new values (i.e., ). As a result, for thecase of having four on-chip stimuli pointers, their values will beupdated and hence any correct samples between any two con-

    secutive erroneous samples within each group do not need tobe stored on-chip as stimuli. On the other hand, for the case of

    having just one on-chip stimuli control register (i.e., one registerfor the stimuli pointer and one register for the stimuli number),the information of the stimuli control 1 from Fig. 4 will indi-cate all the erroneous bursts in group A. Since the time betweenany two bursts within group A is less than the time needed toupload one stimuli control information (i.e., ),the correct samples which occur between any two bursts withinthis group need to be stored in the on-chip trace buffer. Becausesimilar patterns occur in group B (i.e., ), theinformation of stimuli pointer number 5 will be uploaded to theon-chip stimuli pointer register to indicate the erroneous burstsin group B. Given the importance of having on-chip informa-

    tion for stimuli control, we discuss different approaches to im-plement them.In Fig. 5, the individual registers that store the stimuli pointers

    and the associated stimuli numbers are accessed by the indexcounter. When the stimuli pointer counter reaches a specicstimuli pointer, which is determined by the index counter, thestimuli are enabled from the trace buffer for a certain durationspecied by the stimuli number. The stimuli ag is enabled afterthe occurrence of the stimuli pointer. When the stimuli numbercounter reaches the value of the associated stimuli number reg-ister, this ag is disabled. The more stimuli pointers are storedon-chip, the less stimuli data needs to be stored. However, thedrawback of this architecture is the area overhead caused by thephysical registers used for storing the on-chip stimuli pointers.

    Authorized licensed use limited to: Hindusthan College of Engineering and Tech. Downloaded on June 15,2010 at 05:43:55 UTC from IEEE Xplore. Restrictions apply.

    http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-
  • 8/9/2019 Embedded Debug Architecture for Bypassing

    6/12

    This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

    6 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

    Fig. 6. Stimuli selection module with stimuli control stored in the trace buffer.

    An alternative is to store all of the stimuli control informationfor the entire observation window into the trace buffer (alongwith the stimuli) and allocate one stimuli control register instimuli selection module. These stimuli control values are ac-cessed from the trace buffer by the stimuli selection modulecontrol. This obviously has a smaller area overhead than the ap-proach described in previous paragraph. However, it will con-sume as many locations of the trace buffer as the amount of total stimuli control information. This solution is constrainedby the limited capacity of the trace buffer and hence it impactsthe length of the observation window.

    The solution adopted in our work chooses only a user-pro-grammable number of the stimuli pointers (and the associatedstimuli numbers) to be stored in a segment of the trace buffer.Subsequently, one stimuli pointer register and one stimulinumber register can be allocated in the stimuli selection module

    as shown in Fig. 6. The values for stimuli pointers and stimulinumbers can be updated from the off-chip software via theserial interface by exploiting the slack between consecutivebursts of erroneous samples that need to be substituted on-chipdue to the blocking bugs. This solution combines the benetsof the approaches described in the previous two paragraphs:it has an area overhead smaller than using dedicated physicalregisters for storing control information and it uses only afew locations in the trace buffer. Furthermore, by employingthe low-bandwidth interface to the debug software, one cantimeshare this physical link to update also the stimuli data inthe embedded trace buffer.

    C. Embedded Trace Buffer Control

    As shown in Fig. 7, the dual-port embedded trace buffer hasthree segments: the rst segment stores the stimuli control, thesecond one is used to store the stimuli data and the third one isused to capture the data responses from the core that is currentlyvalidated and it is connected to the core that has the blockingbugs. The segments for stimuli data and stimuli control work as circular buffers (i.e., if the reading address of any segmentreaches its depth, it starts again from the beginning address forthis segment). This is necessary in order to stream data in thetrace buffer while running the CUD for long observation win-dows. This feature is essential for the stimuli control segmentbecause having a few on-chip stimuli control values would helpreduce the amount of the stored stimuli data (as explained in

    Section III-B and substantiated later in the experimental results).Because the hard-to-detect bugs occur intermittently over longobservation windows, we have observed that there will be suf-cient time to stream in new stimuli data during a long error-freeinterval that occurs between two erroneous intervals. It is alsoimportant to note that the embedded debug module CTRL is ca-pable to timeshare the low-bandwidth interface between stimulicontrol and data. It distinguishes between the stimuli data andcontrol based on an one-bit tag information embedded in thestreamthat is supplied from theoff-chipdebug software. In sum-mary, the embedded trace buffer CTRL controls the followingprocesses:

    writing the captured data into the trace buffer and readingit to be streamed out while running the debug experiment;

    reading the stimuli control and stimuli data from the tracebuffer and writing new stimuli control and new stimuli

    data, which are streamed in through the low-bandwidthinterface. The accessed stimuli control and stimuli dataare overwritten by the streamed new stimuli control andstimuli data, respectively.

    Because the trace buffer is used for stimuli and data capture,it is essential to congure the size of stimuli segment and thesize of capture segment such that as much length of the obser-vation window as possible can be debugged. These sizes can bechanged from debug experiment to another in order to targeta length of observation window longer than the previous de-bugged one. For example, in Fig. 2, when running the debug

    experiment that targets capturing the debug data from Core 2upon trigger event 2, the stimuli segment size can be selectedto be smaller than the capture segment size. This is because theamount of stimuli, that is needed to bypass the blocking bugsat Core 1, does not require a large segment size to reach thistrigger event. On the other hand, when running the debug ex-periment that target capturing the debug data from Core 2 upontrigger event 3, the stimuli segment has to accommodate all thestimuli required to bypassthe blocking bugs until theoccurrenceof this triggereventandhence largerstimuli segment size will beneeded. It is important to note that the proposed methodologycan be used effectively whenever the on-chip trace buffer ac-commodates the stimuli required for bypassing the entire erro-neous behavior caused by the blocking bugs.

    Authorized licensed use limited to: Hindusthan College of Engineering and Tech. Downloaded on June 15,2010 at 05:43:55 UTC from IEEE Xplore. Restrictions apply.

    http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-
  • 8/9/2019 Embedded Debug Architecture for Bypassing

    7/12

    This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

    DAOUD AND NICOLICI: EMBEDDED DEBUG ARCHITECTURE FOR BYPASSING BLOCKING BUGS 7

    Fig. 7. Embedded trace buffer CTRL.

    Fig. 8. Proposed decoder architecture.

    In order to deal with multiple blocking bugs, distributed em-bedded logic analyzers can be employed. The proposed debugarchitecture canbe integrated with therecentwork that proposes

    distributed embedded logic analyzers for post-silicon validation[21]. The number of blocking bugs that can be bypassed de-pends on the number of embedded logic analyzers that are avail-able and how close the occurrences of the blocking bugs are.For example, if the time difference between the occurrence of successive blocking bugs is sufcient to update the embeddeddebug architecture with the debug conguration required to by-pass the second blocking bug, there will be no need to employanother embedded debug architecture. Otherwise, another em-bedded debug architecture would be needed to simultaneouslybypass both of the blocking bugs. Future work will address theproblem of bypassing multiple blocking bugs from differentcores using distributed embedded logic analyzers.

    D. Lossless Decoder Architecture

    Given the necessity of providing stimuli to bypass theblocking bugs that cause more erroneous samples than thecapacity of the trace buffer, compression techniques can beemployed to store the stimuli in a compressed form in the tracebuffer. It should be noted that the requirements for data decom-pression using an embedded debug module are fundamentallydifferent to the ones when scan is used for manufacturing test.This is due to the fact that the stimuli are fully specied fordebug, unlike scan patterns which have only a few care bitsspecied. It is also important to note that dictionary-basedalgorithms have been explored before by the authors for loss-less compression of debug data that is captured on-chip [27].

    TABLE IAREA OF THE PROPOSED DEBUG MODULE IN NAND 2 EQUIVALENTS EXCLUDING

    TRACE BUFFER AND DECODER AREA

    TABLE IIAREA OF THE PROPOSED DECODER ARCHITECTURE IN NAND 2 EQUIVALENTS

    TABLE III

    EFFECT OF THE ON-CHIP STIMULI POINTERS ON THE TOTAL NUMBER OFSTIMULI AND ON THE TOTAL NUMBER OF STIMULI POINTERS THAT AREUSED DURING DEBUGGING THE ENTIRE OBSERVATION WINDOW (SONG =

    2661.75 k S AMPLES , SAMPLE = 2-byte W ORD)

    The novel implementation presented in this section however isconcerned with on-chip decompression .

    Because compression algorithms may vary in terms of theirresource requirements for real-time compression, a tradeoff be-

    tween the amount of additional area overhead and the compres-sion ratio needs to be taken into account. As a consequence,we implemented two adaptive decompression dictionary-basedtechniques [30], [31] in order to achieve a good compressionratio and at the same time to attain a small impact on the sil-icon area.The adaptivedictionary-based algorithms can achievehigh throughput and competitive compression ratios comparedto the adaptive statistical algorithms [30]. The difference be-tween the implemented algorithms lies in the structure of thedictionary whether it has a xed width, as in the locally adaptivedata compression algorithm (BSTW) [30]; or it has a hierarchyvariable word dictionary width, as in the word-based dynamicLempel-Ziv (WDLZW) data compression algorithm [31]. In thedecoding process, thedictionary is represented by a lookup table(LUT). Next, we explain how the decoding process works by

    Authorized licensed use limited to: Hindusthan College of Engineering and Tech. Downloaded on June 15,2010 at 05:43:55 UTC from IEEE Xplore. Restrictions apply.

    http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-
  • 8/9/2019 Embedded Debug Architecture for Bypassing

    8/12

    This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

    8 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

    TABLE IVLENGTH OF THE OBSERVATION WINDOW (k SAMPLES ) VERSUS THE ON-CHIP STIMULI POINTERS FOR DIFFERENT SIZES OF TRACE BUFFERS (K BYTES ); HALF

    OF EACH TRACE BUFFER SIZE IS USED FOR CAPTURING DEBUG DATA (SONG = 2661.75 k S AMPLES )

    introducing the hardware implementation of the single-symbolwidth dictionary BSTW decoding algorithm, shown in Fig. 8.

    At the beginning of the debug experiment, the trace buffer isloadedwith thecompressed stimuli. Since the stimuli are knowna priori , the LUT is uploaded with the symbols. when the de-coder and the stimuli signals are active, the code and data ex-tractor module reads the trace buffer data and based on the valueof the codeword the symbol is retrieved from (or added to) theLUT. If the codeword equals 0 (this codeword indicates that thesymbol that follows it does not exist in the LUT), the extractorwrites the symbol which is followed after this codeword to thestimuli output; this symbol is written in the LUT at the address

    pointed by a control counter in the Decoder control . This con-trol LUT address counter is initialized to 1 at the beginning of the debug experiment and it increments each time a codeword0 is read and returns to 1 whenever it reaches the LUT depth(i.e., rst-in rst-out replacement policy). As shown in Fig. 8,the LUT is a dual port RAM where one port is used for writingthe symbol when a 0 codeword is detected and the other portis used for reading the symbol that is addressed by the code-word. To enhance the compression ratio for correlated consec-utive stimuli, the dictionary should support multiple symbolsencoding approach and hence a multiple symbol width LUT isneeded in the decoder architecture.

    IV. EXPERIMENTAL RESULTS

    This section discusses the area and the advantages of the pro-posed embedded debug module for bypassing blocking bugs.The area results have been estimated using a 180-nm standardcell library. The debug data has been collected from a eld-pro-grammable gate-array (FPGA) prototype of an MP3 audio de-coder [32] and the stimuli are used from a reference behavioralmodel of the MP3 decoder design.

    A. Area of the Proposed Debug Module

    Table I shows the cell area of the proposed debug module (ex-cluding the trace buffer and the MP3 decoder area) in terms of

    the equivalent area of two input NAND (NAND 2) gates. This areais estimated using a 180-nm application-specic integrated cir-cuit (ASIC) standard cell library. The area results are given forboth the casewhenno stimuli selection module isusedand whenthe stimuli selection module is employed. For the latter case, thefollowing number of stimuli control registers were considered:1, 2, 4, 8, and 16. The stimuli selection module with one stimulicontrol register is shown in Fig. 6. The results for 2, 4, 8, and16 stimuli control registers are for the stimuli selection moduleillustrated in Fig. 5.

    It is essential to note that as the number of on-chip stimulicontrol registers increases, the area overhead of the debug

    module can be signicantly impacted when compared to thedebug module that has only one stimuli control register. There-fore, it is desirable to store the values of the stimuli pointersand stimuli numbers into a few locations in the trace buffer andaccess these values through the proposed low cost architectureshown in Fig. 6. As it can be noted, there is an approximately30% impact on the silicon area when using the debug modulewith the architecture shown in Fig. 6 when compared to theone that has no stimuli selection module. Note, however, whenthe area of the trace buffer is accounted for, this overhead issubstantially diminished.

    Table II shows the area overhead of the decoder architec-

    ture including the area of the LUT in terms of two inputNAND

    (NAND2) gates.Theresults shown in Table II are reported for dif-ferent LUT depth and for both BSTW and WDLZW decodingapproaches. For WDLZW dictionary-based method, the depthof the single-symbol LUT is half of the shown LUT depth inthe cases of two-symbol (2-Sym) and three-symbol (3-Sym) en-coding approaches. The depth of the two-symbol width LUTis half of the remaining depth for the three-symbol approach(whereas the depth of the three-symbol width LUT is the otherhalf). As noted from Table II , the WDLZW decoding architec-tures for multiple symbols have smaller area overhead than theBSTW ones. In addition, the more symbols are accounted forduring compression, the larger the area overheadof the decoder,which likely affects the compression benets when using spe-cic dictionary structure as explained in [27].

    Authorized licensed use limited to: Hindusthan College of Engineering and Tech. Downloaded on June 15,2010 at 05:43:55 UTC from IEEE Xplore. Restrictions apply.

    http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-
  • 8/9/2019 Embedded Debug Architecture for Bypassing

    9/12

    This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

    DAOUD AND NICOLICI: EMBEDDED DEBUG ARCHITECTURE FOR BYPASSING BLOCKING BUGS 9

    Fig. 9. Length of the observation window (k samples) versus stimuli segment size ratio without using stimuli data streaming and without using compression, forthe number of on-chip stimuli pointers equal to 2, 4, 8, and 16, respectively (Song = 2661.75 k samples). (a) Number of on-chip stimuli pointers equals to 2; (b)number of on-chip stimuli pointers equals to 4; (c) number of on-chip stimuli pointers equals to 8; (d) number of on-chip stimuli pointers equals to 16.

    B. MP3 Decoder Experiments

    The debug experiments have been performed on an FPGAprototype of an MP3 audio decoder and the stimuli are usedfrom a reference behavioral model of the MP3 decoder design.

    It should be noted that for the experiments we have performed,the output of the stereo decoder module at one channel has er-roneous sample patterns similar to the ones illustrated in Fig. 2.These erroneous samples are due to a functional bug (in the RTLcode) that was identied in the stereo decoder module. After an-alyzing the erroneous behavior of this blocking functional bug,we have noted that the errors occur only within a few musicframes (each frame has two granules and each granule has 576samples for each of the two channels) throughout the songs thathave specic stereo decoding conguration; thus justifying thecondition that the blocking bugs that are very difcult-to-ndmanifest themselves only intermittently over long observationwindows (as noted also in [29]).

    Table III shows the effect of the on-chip stimuli pointers onthe total number of stimuli and on the total number of stimuli

    pointers that are used during debugging the entire observationwindow for an MP3 song. As discussed in Section III-B , thenumber of on-chip stimuli pointers inuences total number of stimuli pointers that need to be uploaded through the low-band-width interface. As the number of on-chip stimuli pointers in-

    creases, the total amount of stimuli decreases. This is due to thefact that the erroneous behavior of the detected blocking bugsoccur over intermittent intervals throughout the entire song. Thetime between two intervals is used to upload the on-chip stimulipointers with new values and hence enables a large number of trigger pointers to be used. In Table III , we have considered

    (stimuli pointer combined with stimuli-number widthis 32 bits) and (note, the MP3 decoder has been im-plemented for energy-efciency at low frequencies, which arecomparable with the speed of low-bandwidth interface). It isimportant to note that the handshaking between the host andthe CUD does not incur any additional latency because the pro-cessing of stimuli data/control is done at the same time while thedebug experiment is running on-chip and the stimuli data/con-trol is transferred from thedebug software, i.e., theon-chip trace

    Authorized licensed use limited to: Hindusthan College of Engineering and Tech. Downloaded on June 15,2010 at 05:43:55 UTC from IEEE Xplore. Restrictions apply.

    http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-
  • 8/9/2019 Embedded Debug Architecture for Bypassing

    10/12

    This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

    10 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

    Fig. 10. Lengthof theobservation window (k samples) versusstimulisegmentsize ratio without using stimuli data streamingand with usingtwo-symbolWDLZWdecompression algorithm with LUT depth equals 256, for the number of on-chip stimuli pointers equal to 2, 4, 8, and 16, respectively (Song = 2661.75 k samples).(a) Number of on-chip stimuli pointers equals to 2; (b) number of on-chip stimuli pointers equals to 4; (c) number of on-chip stimuli pointers equals to 8; (d)number of on-chip stimuli pointers equals to 16.

    buffer is used to mask the latency. Forexample,when the stimulidata/control is read, theembedded debug modulecommunicateswith the debug software to transfer new stimuli data/controlwhile the debug experiment in running.

    Table IV shows the length of the observation windows (i.e.,the number of samples that can be observed from the entire songwhen using the stored stimuli in the trace buffer) for differenton-chip stimuli pointers and different sizes of the trace buffers.There are possible causes depending whether the streaming andcompression features are employed. In the case of using com-pression, we have used a two-symbol WDLZW decompressionarchitecture whose LUT depth equals 256, as an instance of thedecoder architectures from Table II . It is important to note thatthe achievedcompression depends on both the correlation of thecompressed data and the structure of dictionary as described in[27]. In these debug experiments, the total number of samplesfor the MP3 data are 2661.75 k samples (sample width 2-byteword, number of frames ). Based on the results fromTables III and IV, we emphasize the following points.

    The stored stimuli represent intermittent intervals throughthe entire observation window. Thus, the larger the tracebuffer we use, the larger the amount of stimuli that can bestored and hence the longer the observation window that

    can be achieved. There is no signicant difference between the length inobservation windows when using the architecture fromFig. 5 and the architecture from Fig. 6. As pointed outin Section III-B , the architecture from Fig. 5 has morestimuli stored on-chip than the architecture from Fig. 6(this is due to the space allocated for the stimuli controlsegment in Fig. 6). Note, these few extra stimuli in Fig. 5are used after the end of the observation window for Fig. 6.Hence, the observation window length will be affectedonly if these extra stimuli are applied at an interval longafter the end of the observation window achieved by thearchitecture from Fig. 6.

    As observed from the compression results in Table IV , theproposed lossless decoder architecture can be used to pro-

    Authorized licensed use limited to: Hindusthan College of Engineering and Tech. Downloaded on June 15,2010 at 05:43:55 UTC from IEEE Xplore. Restrictions apply.

    http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-
  • 8/9/2019 Embedded Debug Architecture for Bypassing

    11/12

    This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

    DAOUD AND NICOLICI: EMBEDDED DEBUG ARCHITECTURE FOR BYPASSING BLOCKING BUGS 11

    vide the required stimuli to bypass the blocking bugs thatcause more erroneous samples than the ones that can be ac-commodated in the stimuli segment and hence extend theobservation window.

    As shown in Table IV , the stimuli data streaming featureprovides an increase in the length of observation window.

    This improvement is primarily due to exploiting the timebetween two erroneous intervals to stream in new stimulidata. Note, in the case of using the decoder architecture,the stimuli data is streamed in a compressed form.

    The used debug data stream in these debug experiments,which contains 2366 frames (i.e., 2661.75 K samples), isve times larger than the other three data streams pre-viously used in [1]. In this paper, we have considered alongerdata streamto emphasize thebenet of theproposedmethod when debugging longer observation windows. Inaddition, the obtained results show how when combiningthe streaming and compression features, the observabilityis improved.

    Fig. 9 shows the length of the observation windows (i.e., thenumber of samples that can be debugged from the entire songwhen using the stored stimuli in the trace buffer) versus thestimuli segment size ratio for different on-chip stimuli pointersand different sizes of the trace buffers (TBs). Note, when thestimuli segment size equals half of the size of the trace buffer(i.e., 50% in Figs. 9 and 10), the achieved results are consis-tent with the results in Table IV . These results show the im-portance of the developed architectural feature in the embeddedtrace buffer controller to enable sharing the stimuli data, stimulicontrol and capture data in a segmented trace buffer that can becongured with different segment sizes.

    As shown from the results in Fig. 10, when the compression isemployed for the stimuli data, the targeted observation windowcanbe extended to coverthe entireobservation window of debugdata (i.e., 2661.75 k Samples in these debug experiments), asobserved when the TBs sizes equal 64 KB and 128 KB wherethe stimuli segment size is more than 75% of the TB size. Thisis achieved at an additional cost of the on-chip decoder area,as shown in Table II . When compared to increasing the size of the trace buffer, the proposed decoder architecture has a smallerimpact on silicon area than the case of not using compressionand extend the trace buffer to target the same length of an ob-servation window. In addition, the results from Figs. 9 and 10show also how the stimuli segment size constrains the lengthof the observation window that can be debugged. As discussedin Section III-C , the stimuli segment size can be congured totarget a specic length of the observation window, dependingon how much space needs to be employed by the capture datasegment.

    V. CONCLUSION

    In this paper, we have presented a novel embedded debug ar-chitecture to deal with blocking bugs during post-silicon vali-dation and debugging. These blocking bugs, unless bypassed,will inhibit the detection of all the bugs in silicon and hencewill show up during the following respin; thus increasing theimplementation cycle. Our approach facilitates the validation

    of the other parts of the chip that process data received fromthe erroneous module. It relies on a hierarchical event detec-tion mechanism to provide correct stimuli from an embeddedtrace buffer, in order to replace the erroneous samples causedby the blocking bugs. We developed an architectural feature inthe embedded trace buffer controller in order to enable sharing

    the stimuli data, stimuli control and capture data in a segmentedtrace buffer that can be congured with different segment sizes.Moreover, we have shown that by leveraging the streaming fea-ture of the low-bandwidth interface to the debug software, wecan further improve the observability by streaming the stimulidata. Also, we have presented an on-chip lossless decoder to ex-tract the stimuli data in real-time in order to bypass the blockingbugs that cause more erroneous samples than the ones that canbe accommodated in the stimuli segment.

    REFERENCES[1] E. Anis and N. Nicolici, On bypassing blocking bugs during post-

    silicon validation, in Proc. IEEE Eur. Test Symp. (ETS) , May 2008,pp. 6974.

    [2] W. K. Lam , Hardware Design Verication: Simulation and Formal Method-Based Approaches . Englewood Cliffs, NJ: Prentice-Hall,2005.

    [3] M. Bushnell and V. Agrawal , Essentials of Electronic Testing . Nor-well, MA: Kluwer, 2000.

    [4] M. Abramovici, E. J. Marinissen, M. Ricchetti, and B. West, Sug-gested terminology standard for silicon debug and diagnosis, in Proc. IEEE Int. Silicon Debug Diagnosis Workshop (SDD) , Nov. 2005, pp.16.

    [5] R. Livengood and D. Medeiros, Design for (physical) debug for sil-icon microsurgery and probing of ip-chip packaged integrated cir-cuits, in Proc. IEEE Int. Test Conf. (ITC) , Sep. 1999, pp. 877882.

    [6] R. Desplats, F. Beaudoin, P. Perdu, N. Natara, T. Lundquist, and K.Shah, Fault localization using time resolved photon emission andSTIL waveforms, in Proc. IEEE Int. Test Conf. (ITC) , Oct. 2003, pp.254263.

    [7] M. Paniccia, T. Eiles, V. R. M. Rao, and W. M. Yee, Novel opticalprobing technique for ip chip packaged microprocessors, in Proc. IEEE Int. Test Conf. (ITC) , Oct. 1998, pp. 740747.

    [8] D. P. Vallett, IC failure analysis: The importance of test and diagnos-tics, IEEE Des. Test Comput. , vol. 14, no. 3, pp. 7682, Jul. 1997.

    [9] B. Vermeulen, T. Waayers, and S. K. Goel, Core-based scan architec-ture for silicon debug, in Proc. IEEE Int. Test Conf. (ITC) , Oct. 2002,pp. 638647.

    [10] D. Josephson, S. Poehhnan, and V. Govan, Debug methodology forthe McKinley processor, in Proc. IEEE Int. Test Conf. (ITC) , Oct.2001, pp. 451460.

    [11] H. Balachandran, K. Butler, and N. Simpson, Facilitating rapid rstsilicon debug, in Proc. IEEE Int. Test Conf. (ITC) , Oct. 2002, pp.628637.

    [12] D. Josephson and B. Gottlieb, The crazy mixed up world of silicondebug, in Proc. IEEE Custom Integr. Circuits Conf. (CICC) , Oct.2004, pp. 665670.

    [13] R. Datta, A. Sebastine, and J. A. Abraham, Delay fault testing andsilicon debug using scan chains, in Proc. IEEE Eur. Test Symp. (ETS) ,May 2004, pp. 4651.

    [14] IEEE Standard Test Access Port and Boundary-Scan Architecture ,IEEE JTAG 1149.1-2001 Std., IEEE Computer Society, 2001.

    [15] B. Vermeulen and S. K. Goel, Design for debug: Catching design er-rors in digital chips, IEEE Des. Test Comput. , vol. 19, no. 3,pp. 3543,May 2002.

    [16] R. Leatherman and N. Stollon, An embedded debugging architecturefor SoCs, IEEE Potentials , vol. 24, no. 1, pp. 1216, Feb. 2005.

    [17] A. Hopkins and K. McDonald-Maier, Debug support strategy for sys-tems-on-chips with multiple processor cores, IEEE Trans. Comput. ,vol. 55, no. 2, pp. 174184, Feb. 2006.

    [18] T. J. Foster, D. L. Lastor, and P. Singh, First silicon functional valida-tion and debug of multicore microprocessors, IEEE Trans. Very LargeScale Integr. (VLSI) Syst. , vol. 15, no. 5, pp. 495504, May 2007.

    [19] M. Abramovici, P. Bradley, K. Dwarakanath,P. Levin,G. Memmi, andD. Miller, A recongurabledesign-for-debug infrastructure for SoCs,in Proc. ACM/IEEE Des. Autom. Conf. (DAC) , Jul. 2006, pp. 712.

    Authorized licensed use limited to: Hindusthan College of Engineering and Tech. Downloaded on June 15,2010 at 05:43:55 UTC from IEEE Xplore. Restrictions apply.

    http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-
  • 8/9/2019 Embedded Debug Architecture for Bypassing

    12/12

    This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

    12 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

    [20] Y.-C. Hsu, F. Tsai, W. Jong, andY.-T.Chang, Visibilityenhancementfor silicon debug, in Proc. ACM/IEEE Des. Autom. Conf. (DAC) , Jul.2006, pp. 1318.

    [21] H. F. Ko, A. B. Kinsman, andN. Nicolici, Distributed embedded logicanalysis for post-silicon validation of SoCs, presented at the Proc.IEEE Int. Test Conf. (ITC), Oct. 2008, paper 16.3, pp. 110.

    [22] Altera Corporation, , Altera verication tool, SignalTap II EmbeddedLogic Analyzer, 2009. [Online]. Available: http://www.altera.com/lit-

    erature/hb/qts/qts_qii53009.pdf [23] Xilinx Corporation, Xilinx verication tool, ChipScope Pro, 2008.[Online]. Available: http://www.xilinx.com/ise/optional_prod/cspro.html

    [24] Synplicity, Synplicity verication tool, Identify, 2008. [Online].Available: http://www.synplicity.com/products/identify/index.html

    [25] E. Anis andN. Nicolici,Lowcost debug architectureusing lossy com-pression for silicon debug, in Proc. Des., Autom., Test Eur. (DATE) ,Apr. 2007, pp. 225230.

    [26] H. F. Ko and N. Nicolici, Automated trace signals identication andstate restoration for improving observabilityin post-silicon validation,in Proc. Des., Autom., Test Eur. (DATE) , Mar. 2008, pp. 12981303.

    [27] E. Anis and N. Nicolici, On using lossless compression of debug datain embedded logic analysis, presentedat theProc.IEEE Int. TestConf.(ITC), Oct. 2007, paper 18.3, pp. 110.

    [28] S. Sarangi, B. Greskamp, and J. Torrellas, CADRE: Cycle-AccurateDeterministic Replay for hardware debugging, in Proc. IEEE Int.Conf. Dependable Syst. Netw. (IDSN) , Jun. 2006, pp. 301312.

    [29] D. Josephson, The manic depression of microprocessor debug, inProc. IEEE Int. Test Conf. (ITC) , Oct. 2002, pp. 657663.

    [30] J. L. Bentley, D. D. Sleator, R. E. Tarjan, and V. K. Wei, A locallyadaptive data compression scheme, Commun. ACM , vol. 29, no. 4, pp.320330, Apr. 1986.

    [31] J. Jiang and S. Jones, Word-based dynamic algorithms for data com-pression, IEE Proc., Commun., Speech Vision , vol. 139, no. 6, pp.582586, Dec. 1992.

    [32] S. Hacker, MP3: The denitive guide, OReilly & Associates, Inc.,May 2000.

    Ehab Anis Daoud (S07M09) received the B.Sc.(with honors) and M.Sc. degrees in electronics andcommunications engineering from Cairo University,Cairo, Egypt, in 1998 and 2003, respectively, and the

    Ph.D. degree in electrical and computer engineeringfrom McMaster University, Hamilton, ON, Canada,in 2008.

    His research interests include post-silicon valida-tion and debug, lossless data compression, and VLSIsystems design.

    Nicola Nicolici (S99M00) received the Dipl.Ing.degree in computer engineering from the Universityof Timisoara, Timisoara, Romania, in 1997 and thePh.D. degree in electronics and computer sciencefrom the University of Southampton, Southampton,U.K., in 2000.

    He is currently an Associate Professor with theDepartment of Electrical and Computer Engineering,McMaster University, Hamilton, ON, Canada. Hisresearch interests include the area of computer-aideddesign and test. He has authored a number of papers

    in this area.Dr.Nicoliciwas therecipient of theIEEE TTTC BeausangAwardfor theBest

    Student Paper at the International Test Conference in 2000 and the Best PaperAward at the IEEE/ACM Design Automation and Test in Europe Conference in2004.