by Zissis Poulos - University of Toronto T-Space · PDF fileAbstract High Level Debugging Techniques for Modern Veriﬁcation Flows Zissis Poulos Master of Applied Science Graduate

HIGH LEVEL DEBUGGING TECHNIQUES FOR MODERN VERIFICATION FLOWS

by

Zissis Poulos

A thesis submitted in conformity with the requirementsfor the degree of Master of Applied Science

Graduate Department of Electrical and Computer EngineeringUniversity of Toronto

Copyright c© 2014 by Zissis Poulos

Abstract

High Level Debugging Techniques for Modern Verification Flows

Zissis Poulos

Master of Applied Science

Graduate Department of Electrical and Computer Engineering

University of Toronto

2014

Early closure to functional correctness of the final chip has become a crucial success factor in the

semiconductor industry. In this context, the tedious task of functional debugging poses a significant

bottleneck in modern electronic design processes, where new problems related to debugging are con-

stantly introduced and predominantly performed manually. This dissertation proposes methodologies

that address two emerging debugging problems in modern design flows.

First, it proposes a novel and automated triage framework for Register-Transfer-Level (RTL) debug-

ging. The proposed framework employs clustering techniques to automate the grouping of a plethora of

failures that occur during regression verification. Experiments demonstrate accuracy improvements of

up to 40% compared to existing triage methodologies.

Next, it introduces new techniques for Field Programmable Gate Array (FPGA) debugging that

leverage reconfigurability to allow debugging to operate without iterative executions of computationally-

intensive design re-synthesis tools. Experiments demonstrate productivity improvements of up to 30×

vs. conventional approaches.

ii

Acknowledgements

First and foremost I would like to sincerely thank my supervisor Professor Andreas Veneris for being

an excellent mentor, guide and teacher throughout my journey into research. You are a constant source

of motivation and passion that drives me to achieve my full potential.

I would like to acknowledge my MASc. committee members, Professors Jason Anderson and

Vaughn Betz for their thorough reviews of my dissertation and their insightful feedback.

I would like to thank my parents who have been a never-ending source of love and support. You

have instilled in me the values and lessons that have guided me through my life. Mother, your spirit

keeps guiding me.

I am indebted to several colleagues at the University of Toronto for all their shared wisdom and

constructive feedback. Special thanks to Hratch Mangassarian, Terry Yang, Bao Le, Mohammad Fawaz,

and Kalin Ovtcharov.

Finally, acknowledgements are due to the University of Toronto for their financial support.

iii

for my Mother

Contents

List of Tables vii

List of Figures viii

1 Introduction 1

1.1 Background and Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1.1 Failure Triage in Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.1.2 Functional Debug and FPGAs . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.2 Purpose and Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.3 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2 Background 8

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.2 Design Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.2.1 Verification Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.2.2 Notation and Iterative Logic Arrays . . . . . . . . . . . . . . . . . . . . . . . 10

2.2.3 Multiple Counter-examples in Verification . . . . . . . . . . . . . . . . . . . . 11

2.3 SAT-based Design Debugging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.3.1 Debugging Multiple Counter-examples . . . . . . . . . . . . . . . . . . . . . 18

2.4 Simulation Metrics in Verification and Debugging . . . . . . . . . . . . . . . . . . . . 19

2.5 Field Programmable Gate Arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

v

3 A Triage Engine for RTL Design Debug 25

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.2 Triage in Debugging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.2.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.3 Counter-example Triage Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.3.1 Error Behavior Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.3.2 Suspect Ranking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.3.3 Counter-example Proximity . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3.3.4 Error Count Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

3.3.5 Counter-example Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

3.3.6 Overall Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

3.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

4 Re-configurability in FPGA Functional Debug 58

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

4.2 FPGA Functional Debug . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

4.3 A Reconfigurability-Driven Approach to FPGA Functional Debug . . . . . . . . . . . 61

4.3.1 An Area-Optimized Multiplexer Implementation . . . . . . . . . . . . . . . . 62

4.3.2 Debugging with Limited Output Pins . . . . . . . . . . . . . . . . . . . . . . 63

4.4 Experimental Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

4.4.1 Area Usage and Timing Analysis . . . . . . . . . . . . . . . . . . . . . . . . 65

4.4.2 Productivity and Stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

5 Conclusions and Future Work 71

5.1 Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

vi

List of Tables

3.1 Proposed Triage Engine Performance (R=4) . . . . . . . . . . . . . . . . . . . . . . . 55

4.1 Benchmarks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

4.2 Effects of area-optimized multiplexer. . . . . . . . . . . . . . . . . . . . . . . . . . . 67

4.3 Effects of area-optimized multiplexers with shift registers. . . . . . . . . . . . . . . . 68

4.4 Compilation time of SignalTap. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

vii

List of Figures

1.1 Simplified VLSI design flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2.1 The Iterative Logic Array method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.2 Two distinct counter-examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.3 Typical automated debugging flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.4 Debugging result for counter-examples C1 and C2 . . . . . . . . . . . . . . . . . . . . 17

2.5 Prefix and suffix windows of suspects 〈s11 ,5〉 and 〈s22 ,6〉 . . . . . . . . . . . . . . . . 22

2.6 FPGA architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.7 FPGA hardware structures. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.1 Incorrect grouping by conventional techniques . . . . . . . . . . . . . . . . . . . . . . 28

3.2 Probabilistic behavior of errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.3 Example of suspect ranking for a counter-example . . . . . . . . . . . . . . . . . . . . 37

3.4 Counter-example proximity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

3.5 Counter-example hierarchical clustering . . . . . . . . . . . . . . . . . . . . . . . . . 48

3.6 Proposed triage framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

3.7 Examples of injected RTL errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

3.8 Features of real errors and suspects . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

3.9 Effect of selecting the R highest in rank suspects . . . . . . . . . . . . . . . . . . . . . 54

3.10 Effect of modification on Ward’s Method . . . . . . . . . . . . . . . . . . . . . . . . 56

4.1 Area overhead of SignalTap. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

4.2 Multiplexer for signal selection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

viii

4.3 16-to-1 MUX implementation in 6-input LUTs. . . . . . . . . . . . . . . . . . . . . . 63

4.4 Multiplexer with a 4-bit Shift Register. . . . . . . . . . . . . . . . . . . . . . . . . . . 64

4.5 Area and Fmax of multiplexers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

4.6 Stability of SignalTap. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

ix

Chapter 1

Introduction

1.1 Background and Motivation

Today, the size and complexity of modern Very Large Scale Integration (VLSI) computer chips grow at

a galloping pace. This exponential trend constitutes the driving force that retains growth in the semi-

conductor industry. The domain is characterized by escalating competitiveness, increasing customer

demands, strict time-to-market deadlines and tight budgetary constraints. Consequently, the semicon-

ductor industry is at a constant lookout to automate most of the steps involved in typical VLSI design

flows. Computer Aided Design (CAD) tools are crucial components in this effort, as they automate

many steps that would otherwise be intractable manually.

The realization of a computer chip comprises of a multitude of steps, where each step refers to a

different layer of design abstraction. Fig. 1.1 illustrates a simplified view of the VLSI design flow.

The flow begins with a high level behavioral specification composed into a natural language document

or written with the use of refined behavioral models in languages such as C or Matlab. Behavioral

synthesis is the step that follows immediately, and converts behavioral specification into a Register

Transfer Level (RTL) description such as Verilog or VHDL. The next step in the design flow is logic

synthesis, where the generated RTL description is translated into a gate-level netlist. The gate-level

description is subsequently synthesized to a transistor-level netlist, which in turn is placed and routed

on the chip fabric. Finally, the resulting physical layout is passed for fabrication to a chip manufacturing

facility.

1

CHAPTER 1. INTRODUCTION 2

Chapter 1. Introduction 2

DesignDebugging

Behavioral

Behavioral

RTL

Specifications

Chip

Description

Netlist

Layout

Synthesis

SynthesisLayout

Fabrication

SynthesisLogic

VerificationFunctional

CheckingEquivalence

TestingFault DiagnosisSilicon Debug /

Verification, Testing and DebugDesign

Gate−level

Figure 1.1: Simplified IC Design flowFigure 1.1: Simplified VLSI design flow


Two integral processes of any VLSI design flow are those of verification and test, which are in-

terposed between implementation steps to guarantee that each abstraction layer matches the next one.

More precisely, functional verification is a process that matches the behavioral specification to the RTL

description and the RTL description to the gate-level netlist, ensuring that no functional discrepancies

exist. When functional verification fails, design debugging commences. The purpose of debugging is to

locate the root-cause that led to a functional mismatch exposed by verification. Finally, test is the pro-

cess which ensures that the manufactured chip matches its gate-level netlist. If a mismatch is identified

at that step, then either silicon debug or fault diagnosis commence in order to locate functional errors or

fabric defects respectively.

In modern design flows, most of the verification and debugging steps in Fig. 1.1 are automated.

Equivalence checking [14] [5], property checking [9, 28], automatic test-bench generation, assertion-

based verification [13], functional coverage and the employment of powerful formal engines, such as

Binary Decision Diagrams (BDDs), Boolean Satisfiability (SAT) solvers and Quantified Boolean For-

mulae (QBF) [12, 27, 34, 35, 37, 38], constitute an evolution that re-defined the verification approach.

These algorithms and methods brought automation to functional debugging and achieved a significant

decrease in the time and cost related.

However, despite the above breakthroughs, the ever-growing complexity and size of electronic de-

signs has intensified the cost of debugging-related issues that were previously easier to manage. For

example, more and more engineers with very specialized responsibilities are involved in design cy-

cles. As a result, the task of determining the appropriate engineer(s) to analyze a verification failure

has become far from trivial. Another negative impact of this massive growth appears in the field of

FPGA functional debug, where, today, the internal logic becomes harder to observe and debug, and thus

requires the use of external equipment and time-consuming iterative approaches.

In the following two Sections, we discuss the current status and issues regarding the above growing

problems in verification.


1.1.1 Failure Triage in Verification

Typical modern designs incorporate a broad collection of interconnected functional blocks that interact

to define overall design operation. Evidently, functional verification of such complex Systems-on-Chip

(SoCs) becomes much more demanding in time, cost and resources. It is commonplace for verification

and design engineering teams to repetitively exercise the system’s behavior in an effort to expose as

many failures as possible; a process known as regression verification.

Despite recent efforts in industry and academia that bring automation to ease functional debugging,

another emerging problem that relates to regression verification has been left unexplored. Specifically,

automation in debugging mainly focuses on single failures, where each failure is treated separately and

in isolation, even if verification exposes dozens of mismatches. This narrow approach becomes an

impediment in regression verification scenarios and complicates the debugging task, since regression

can potentially generate a plethora of failures to be fixed. Neglecting potential relations between these

failures may lead to multiple engineers investing effort to resolve the same design error; a waste of

precious resources. Moreover, since the design’s erroneous behavior can be familiar to some of the

engineers but totally opaque to others, it is hard to identify the most suitable engineer(s) to debug these

failures.

In this context, triage is the high-level debugging task following regression verification that has a

twofold purpose. First, it tries to group together those failures that are closely related with respect to

their root-cause. Second, it aims to identify the most suitable engineer(s) to perform detailed debugging

for each one of the formed groups. The benefit of triage lies into the fact that only those engineers

familiar with the erroneous behavior will pursue detailed debugging for a given group. Moreover, any

fix for a failure can potentially eliminate most failures belonging to the same group, since they are

probably caused by the same design error.

Current studies indicate that triage can potentially occupy up to 30% of the debugging effort [18].

Despite these projections, triage in modern flows is predominantly performed in an ad hoc, manual and

time-consuming manner [36].


1.1.2 Functional Debug and FPGAs

As the cost of state-of-the-art ASIC design continues to escalate, field-programmable gate arrays (FP-

GAs) have become widely used platforms for digital circuit implementation. FPGAs carry several

advantages over ASICs, including reconfigurability and lower NRE costs for mid-to-high volume ap-

plications. While there remains a gap between FPGAs and ASICs in terms of circuit speed, power and

logic density [24], innovations in FPGA architecture, circuits and CAD tools have produced steady im-

provements on all of these fronts [7, 8, 39]. Today, FPGAs are a viable target technology for all but the

highest volume or low-power applications.

The reconfigurability property of FPGAs reduces the cost associated with fixing the various func-

tional errors that can occur during the design cycle. Whether used as hardware emulation platforms or

as actual target devices, FPGAs offer a different debugging approach by allowing design iterations to

include actual silicon execution. Designers verify their design in hardware using the same (or a similar)

FPGA they intend to deploy in the field. When design errors are discovered, the design’s RTL is altered,

re-synthesized and executed in hardware. Although widely adopted, the nature of such verification and

debugging approaches is clearly iterative, where each iteration includes the re-execution of time and

resource-intensive tool flow steps.

The underlying factor that strongly affects the number of debug iterations is the observability of the

design’s internal logic. If the engineer is able to observe a relatively large amount of internal signals

during silicon execution, then less debugging iterations are required. However, in silicon execution the

amount of observable signals is limited by the number of user-dedicated pins that, in most cases, is

prohibitively small to accommodate the needs of functional debugging. As a result, typical techniques

that utilize commercial internal or external logic analyzers have their efficiency highly degraded because

of the above limitation [4, 42]. Engineers are required to observe a predefined set of internal signals,

analyze the respective waveforms, and then re-synthesize the design along with the interconnections to

the logic analyzer, before they observe a new set of signals.

As such, the domain of FPGA functional debug faces an emerging need for robust debugging tech-

niques that leverage reconfigurability to increase observability and eventually shrink down debugging

iterations.


1.2 Purpose and Scope

This thesis aims to address the aforementioned emerging issues in functional debugging. The first

contribution is the introduction of a novel and automated triage framework for RTL design debugging.

The second contribution suggests new hardware structures and software techniques that leverage re-

configurability to reduce FPGA functional debugging time.

In summary, the contributions of this thesis are as follows:

• A novel automated triage methodology is presented, which groups together related failures that

are generated by regression verification flows. The framework is based on newly introduced met-

rics that define relations between failures and make predictions about the number of co-existing

RTL errors in the failing design. Part of the framework is a novel ranking scheme that aids the

engineers to identify those potential error locations that should be prioritized for detailed debug-

ging. The proposed framework formulates triage as a pattern recognition problem, and generates

solutions by employing clustering techniques. The developed framework is tested on four differ-

ent industrial designs with multiple injected errors demonstrating significant gains over existing

triage methodologies. Specifically, among all regression verification scenarios, we observed an

overall accuracy of 94%, which constitutes a 40% improvement over conventional approaches.

• New hardware and software techniques are introduced that speed-up FPGA functional debugging

by allowing the tracing of internal design signals during silicon execution without the need for

time-intensive re-synthesis iterations. The proposed method requires a sole execution of the syn-

thesis flow to trace a large number of signals for an arbitrary number of cycles using a limited

number of output pins. Experimental results demonstrate that our approach improves run-time

by up to ×30 vs. a conventional approach for FPGA functional debug. Our approach also offers

stability in the timing characteristics of the circuit being debugged.

1.3 Thesis Outline

This thesis is organized as follows. Chapter 2 presents background information on design verification,

design debugging and FPGAs. Chapter 3 describes a novel automated triage framework for design de-


bugging. Chapter 4 proposes new techniques for FPGA functional debug that exploit the re-configurable

nature of these devices. Finally, Chapter 5 concludes this thesis.

Chapter 2

Background

2.1 Introduction

This chapter presents background material that is relevant to the contributions of this thesis. It is or-

ganized as follows. Section 2.2 provides an overview of functional verification and design debugging.

Next, Sections 2.3 and 2.4 discusses the basic concepts of SAT-based functional debugging and sim-

ulation metrics in debugging and verification; topics that are related to the first contribution of this

dissertation. Finally, Section 2.5 summarizes the basic architectural features of modern FPGAs that

form the platform on which the second thesis contribution is based.

2.2 Design Verification

The goal of functional verification is to determine whether the implementation of a design conforms to

its specification. Although many different types of errors exist, this thesis addresses functional errors in

the RTL. Errors relating to power consumption and/or timing violations are not considered. For clarity,

the term error refers to any incorrect RTL element(s) that cause an observable discrepancy between

the design’s implementation and its specification under the same input stimuli. Broadly speaking, there

are two main types of errors that are introduced either by bugs in CAD tools or by the human factor.

Those related to incorrect design code (wrong signal(s), operation(s), disconnected port(s), etc) and

those related to erroneous verification code (wrong assertion(s), incorrect stimuli etc). The first are

referred to as design errors, whereas the latter are known as verification environment errors [22]. In the

8

CHAPTER 2. BACKGROUND 9

context of this dissertation, the general term error refers to both of the above error categories.

The observable effect of an error is called a verification failure. A discrepancy related to a verifi-

cation failure takes the form of conflicting signal values (0, 1 or X for unknown) at observation points,

such as primary output pins, observed internal circuit signals or failing assertions.

2.2.1 Verification Tools

Functional verification can be coarsely categorized into simulation-based, emulation-based and formal

verification. Broadly speaking, simulation-based verification [1, 6, 32] explores the design space by

providing stimulus to the design through a testbench. Mainstream simulation-based verification strate-

gies utilize a logic simulator to determine the existence of failures by repeatedly applying input patterns

to exercise corner-case functionality. The major drawback of this approach is its non-exhaustive nature.

As such, it is unable to guarantee functional correctness of the design. Along these lines, several cov-

erage metrics exist that indicate whether simulation can achieve a high degree of confidence in design

correctness. Despite its shortcomings, simulation-based verification remains the predominant verifica-

tion methodology in the industry today as it is used in more than 90% of verification cases [18]. The

success of simulation-based verification is primarily due to its simplicity and its relative scalability.

In contrast to simulation, formal verification explores the design space exhaustively with the use

of formal engines and mathematical models, such as Binary Decision Diagrams (BDDs), Satisfiability

(SAT) and Quantified Boolean Formula (QBF) solvers [12, 27, 34, 35, 37, 38]. Consequently, for-

mal techniques can prove or disprove the correctness of the design, but they may suffer from limited

practicality due to their limited scalability, as they use mathematical models to model the design.

Finally, emulation-based verification is predominantly based on FPGA prototyping [16, 25]; the

design is first implemented on a FPGA and it is exercised for the purpose of verification. Once the

design passes verification, it is then implemented on a similar platform that is intended to deploy in the

field. The major advantage of emulation/FPGA prototyping is speed. The design can often run orders

of magnitude faster than simulation and the verification process does not slow down as new functional

blocks are integrated in the design. On the other hand, its major drawback is observability. Accessing

the values of internal signals and state elements requires the use of logic analyzers or scan chains that

can only monitor a limited number of signals [4, 20].


In the simplest of cases, verification tools are run on-the-fly by the engineer to identify a single fail-

ure and then debugging commences so that the failure is eventually rectified. However, this dissertation

addresses a more complex scenario; that of regression verification. In modern verification environments,

regression verification refers to the term where large sets of tests are run overnight on computer clusters

to intensively exercise the design’s behavior and potentially expose a large amount of failures. For a

regression run, it is only when all observed failures are collected that the debugging step begins. As

such, the discussion that follows assumes the existence of multiple failures.

2.2.2 Notation and Iterative Logic Arrays

Before moving into details regarding verification and debugging, it is crucial to first introduce appro-

priate notation for sequential circuits and describe the basic circuit representation used in verification

flows.

For a given sequential circuit with n primary inputs, m observation points and l state elements,

let x1,x2, . . . ,xn, y1,y2, . . . ,ym and e1,e2, . . . ,el denote the primary inputs, observation points and state

elements, respectively. Let also vectors xi = {xi1,x

i2, . . . ,x

in}, yi = {yi

1,yi2, . . . ,y

im} and ei = {ei

1,ei2, . . . ,e

il}

correspond to the values of primary inputs, observation points and state elements at cycle i, respectively.

The vector yi is generally referred to as the observed response for cycle i and corresponds to the values

provided by the implementation of the design. On the other hand, for the specification of the design, the

vector yi = {yi1, y

i2, . . . , y

im} represents the values that define the correct design behavior, and is referred

to as the expected response for cycle i. If verification finds at least one cycle i for which yi 6= yi, then the

design is said to fail verification and that a failure is observed at cycle i. Finally, let Xji = {xi,xi+1, . . . ,xj}

and Yji = {yi,yi+1, . . . ,yj} denote the set of vectors of primary input values and the set of vectors of

observed responses from cycle i to cycle j, respectively.

In verification, it is necessary to model the behavior of a sequential circuit. This can be achieved by

using a method called the Iterative Logic Array (ILA), also known as time-frame expansion. Assuming

a single clock domain, time-frame expansion for k cycles models a sequential circuit by extracting its

combinational part and replicating it for k times, such that the next-state of each clock, or time-frame, is

connected to the current-state of the next time-frame. Using the notation presented above, the following

example demonstrates the time-frame expansion process.


QD

x2y1

x1e1

(a) Original circuit

Q D

y1

e1

x2

x1

(b) Extracted circuit

y11 e2

1x2

1

y21

x22x1

2

x11

e11

(c) Two time-frame expanded circuit

Figure 2.1: The Iterative Logic Array method

Example 1 Figure 2.1 shows how to generate the ILA of a circuit for two cycles. Figure 2.1(a) shows

the original sequential circuit, with primary inputs x1 and x2, a single primary output y1, and a single

state element e1. First, the combinational component is separated from the sequential components by

extracting the flip-flops, as shown in Figure 2.1(b). The pseudo-inputs and pseudo-outputs are implicitly

shown to be wires coming out of the dotted box. Next, the combinational component is replicated into a

two time-frame expanded circuit shown in Figure 2.1(c).

2.2.3 Multiple Counter-examples in Verification

In the vast majority of verification cases, for each failure exposed during regression, both simulation-

based and formal verification tools generate a counter-example, that is, a sequence of primary inputs,

starting from a given initial state that leads to a discrepancy between the actual and expected responses

of a design’s implementation and specification, respectively. As such, each counter-example contains


precisely the information needed to excite and observe the erroneous behavior related to each failure.

It should be clarified that there are cases where the design fails verification because it cannot reach

specific states, and as a result no counter-example is generated [31]. For example, consider the case

where a design is exercised to reach a particular state, but the sate is actually unreachable for the input

stimuli and initial states that are applied. Despite the fact that cases like this are not rare, this thesis

does not deal with debugging instances where no counter-example exists. In what follows, we formally

define the concept of a counter-example within a regression verification context.

Suppose that a regression verification process applies a single and long test sequence on the design

under verification and generates a set of distinct counter-examples. Let this set be C = {C1,C2, . . . ,C|C |}.

Since there is a single counter-example Ci ∈ C for each observed failure, the set of counter-examples

exposes exactly |C | verification failures. Each counter-example Ci has an associated length denoted

as Li, which refers to the number of cycles the design is simulated until the corresponding failure is

observed. Assuming that all counter-examples start from cycle 1, each counter-example Ci can be

formally defined as follows.

Definition 1 Given an erroneous design, a counter-example Ci that exposes a failure at cycle Li, is a

tuple 〈I,XLi1 ,YLi

1 , yLi〉, where I is a set of initial states, XLi1 = {x1,x2, . . . ,xLi} is a set of vectors of

primary input values from cycle 1 to cycle Li, YLi1 = {y1,y2, . . . ,yLi} is a set of vectors of observed

responses from cycle 1 to cycle Li, and yLi is a vector of expected responses for cycle Li, with yLi 6= yLi .

Note that, yLi differentiates from yLi (yLi 6= yLi) in that the first represents the expected response

at cycle Li, whereas the latter corresponds to the actually observed response at the same cycle. In this

context, a counter-example comes into conflict with the simulation of the buggy design only at these

responses related to the corresponding verification failure. Also note that a counter-example typically

ends at the cycle where the mismatch related to the failure is observed. In other words, the discrepancy

associated with counter-example Ci of length Li is observed exactly at cycle Li.

According to the above definition, it should be clarified that each counter-example includes only the

expected response that exposes a single failure at a specific simulation cycle. For example, a counter-

example Ci includes only the expected response at cycle Li where the corresponding failure is observed,

even if more failures are observed in previous cycles. Those failures are captured by different counter-


C1

I

x1 xk xm−2 xm−1 xm

y11 y2

1 y31 y1

k y2k y3

k ym−21 ym−2

2 ym−23 ym−1

1 ym−12 ym−1

3 ym1 ym

2 ym3

. . . . . .

(a) A Counter-example of length m

C2

I

x1 xk xm−2 xm−1

y11 y1

2 y13 yk

1 yk2 yk

3 ym−21 ym−2

2 ym−23 ym−1

1 ym−12 ym−1

3

. . . . . .

(b) A Counter-example of length m-1

Figure 2.2: Two distinct counter-examples

examples and are not associated with counter-example Ci.

Finally, regression verification generates counter-examples in increasing order of finish time. Since

all counter-examples start at cycle 1, their finish time is equal to their length. As such, all counter-

examples are thereby considered to appear in sorted order in counter-example set C , such that Li ≤ L j,

if and only if, i < j.

Example 2 Based on the above definitions, Figure 2.2 illustrates a time-frame expanded erroneous

design with two counter-examples C1 and C2 of respective lengths L1 = m and L2 = m− 1. Counter-

examples C1 and C2 expose two distinct failures at cycles m and m− 1, respectively. According to

the notation presented above, C1 = 〈I,{x1,x2, . . . ,xm},{y1,y2, . . . ,ym}, ym〉. Similarly, we have that

C2 = 〈I,{x1,x2, . . . ,xm−1},{y1,y2, . . . ,ym−1}, ym−1〉. In Figure 2.4, an error (illustrated by a circle)

is excited at cycle m− 2 and eventually propagates to cause a mismatch at primary output y3 during

cycle m. This erroneous behavior is captured by counter-example C1. Counter-example C2, on the

other hand, exposes the erroneous behavior related to the second error, which is excited in cycle k and


causes a failure in cycle m−1 at output y2, as illustrated by Figure 2.2(b). It is important to note that

each counter-example contains expected responses only for a single failure that has not already been

exposed by another counter-example. Thus, C1 includes the expected response ym for cycle m but does

not include the expected response ym−1 for cycle m−1, because the failure at cycle m−1 has already

been exposed by C2.

As we will see in the next Section, a counter-example is the main object of analysis during functional

debugging.

2.3 SAT-based Design Debugging

Functional debugging is the process in which a failure is analyzed to identify its error source(s). When

the error is in the design or the verification code, the term commonly used in literature is design debug-

ging [1, 35]. With respect to design debugging, the term suspect is used to specify a design component

that can potentially be the error source of a failure. A suspect component can be a line of RTL code, a

Verilog always statement, an if or case statement, a module instantiation etc. More precisely a suspect

is defined as follows.

Definition 2 Let Ci = 〈I,XLi1 ,YLi

1 , yLi〉 represent a counter-example exposing a failure at cycle Li. Then

a design component is called a suspect for Ci, if and only if the component can be functionally modified

such that Ci no longer exposes the observed failure at cycle Li.

SAT-based design debugging is a formal method that encodes the design debugging problem into

a SAT instance for a given counter-example [12, 34, 35, 38]. All satisfying assignments for this SAT

instance correspond to suspects which can be functionally altered and rectify the erroneous behavior

exposed by the counter-example.

As an additional benefit, SAT-based mechanics allow debuggers to return error propagation paths in

the circuit that show how a value from a suspect location propagates through consecutive cycles to reach

the output where a mismatch is observed [22]. Modern SAT-based debugging formulations [22], known

as time-based design debugging, further allow the debugging tool to pinpoint the exact cycle where

a possible error can be excited at a suspect location to cause the observed failure. This information


augments the knowledge of the engineer(s) during the effort of identifying the actual cause of a failure

among all returned suspect locations.

Definition 3 The cycle at which an erroneous value appears at the output(s) of a suspect component

and eventually propagates to cause a failure is referred to as the excitation cycle for that suspect.

Definition 4 An error propagation path is a path in the ILA representation of a sequential circuit that

starts from a suspect location at the corresponding excitation cycle and shows how the erroneous value

propagates to the failing output.

It becomes apparent that a SAT-based debugger does not return a suspect solely as a design location,

but it also provides the excitation cycle for each suspect component. For the remainder of this thesis,

we refer to possibly erroneous design locations as suspect locations, irrespective of the excitation cycle,

and we use the more general term suspect to refer to the suspect location along with its excitation cycle.

Using the above definitions we can define the input and output of a SAT-based debugger as follows:

Definition 5 A SAT-based debugger takes as input an erroneous design and a counter-example Ci

demonstrating a failure. The output of the debugger is a set of suspects Si = {〈si1 , ti1〉,〈si2 , ti2〉, . . . ,〈si|Si |, ti|Si |〉}

containing all possible suspect locations si j and the respective excitation cycles ti j . Set Si is referred to

as the suspect set (or solution set) of counter-example Ci.

Here, we should mention that SAT-based debugging traditionally requires a parameter called error

cardinality, denoted as N. The error cardinality is equal to the number of suspects that can simulta-

neously rectify the counter-example. Specifically, if the debugger is run with error cardinality N = 1,

then each suspect in the solution set corresponds to a single design component. If, on the other hand,

the error cardinality is set to a higher value, then each suspect is a tuple of N design components that

must be simultaneously modified in order to rectify the failure. For example, if N = 2 then the re-

turned solution set contains all possible pairs of design components that, if both modified, can rectify

the counter-example. In any case, the excitation cycles of each design component are also provided in

the corresponding solution set.

Traditionally, error cardinality N is set to 1 and is iteratively incremented until all possible combi-

nations of suspects are discovered. Nevertheless, in practice it has been shown that in the vast majority


Specifications

Verification Tool

pass?

Counter-example

Automated Debugger

Suspects

done

No

Yes

Design Under Test

Figure 2.3: Typical automated debugging flow

of debugging scenarios it suffices to set N to the value of 1 or 2, that is, to search for single suspects or

pairs of suspects [35]. For simplicity, the methods described in this thesis assume that N = 1. This is not

a limitation, and as we will discuss in Chapter 3, enforcing higher error cardinality can be effectively

managed by the proposed methodologies.

The above definitions imply that a suspect is not necessarily the actual error that was introduced

into the design. It can be shown that there can be multiple equivalent suspects that explain a single

failure [41]. However, the exhaustive nature of SAT-based debuggers guarantees that the actual error

will be included as a suspect in the returned suspect set [35]. Evidently, the returned suspect set provides

vital hints to the engineer(s) as to where the actual error location should be searched for. The engineer(s)

use their intuition and inherent understanding of the design’s behavior in order to identify these suspect

locations that could be actual errors or related to ones. The process of rigorously examining the suspect

set is called detailed debugging [36].


C1

I

x1 xk xm−2 xm−1 xm

y11 y1

2 y13 yk

1 yk2 yk

3 ym−21 ym−2

2 ym−23 ym−1

1 ym−12 ym−1

3 ym1 ym

2 ym3

. . . . . .s11

s12

s13

(a) Counter-example C1 and debugging suspects

C2

I

x1 xk xm−2 xm−1

y11 y1

2 y13 yk

1 yk2 yk

3 ym−21 ym−2

2 ym−23 ym−1

1 ym−12 ym−1

3

. . . . . .s21

s22

(b) Counter-example C2 and debugging suspects

Figure 2.4: Debugging result for counter-examples C1 and C2

In summary of the above, a typical verification flow including the task of automated debugging is

shown in Figure 2.3. The verification tool takes as input the design under test (DUT) as well as the high

level specification. If verification fails, a counter-example is generated. The debugging process takes

the counter-example, as well as the DUT and returns a set of suspects that can rectify the erroneous

behavior.

The following example illustrates the aforementioned concepts.

Example 3 An example of debugging two counter-examples is depicted in Figure 2.4 by using the It-

erative Logic Array representation of the sequential design. In this example, two errors cause two

distinct failures. One is excited in cycle m− 2 and propagates to cause a failure at the output in cycle

m, as shown in Figure 2.4(a) and the other is excited in cycle k and propagates to an output in cycle

m− 1, as illustrated by Figure 2.4(b). The generated counter-examples C1 and C2 of length L1 = m

and L2 = m− 1 respectively, are then passed to an automated debugger. For counter-example C1, the

result is a solution set S1 = {〈s11 ,k〉,〈s12 ,m−2〉,〈s13 ,m−1〉} of circuit components that can explain the


wrong output at cycle m. On the other hand, for counter-example C2, the debugger returns a solution

set S2 = {〈s21 ,k〉,〈s22 ,m− 1〉} of circuit components that can be responsible for the failure at cycle

m−1. All suspects s11 , s12 , s13 , s21 , and s22 excited at cycles k, m−2, m−1, k and m−1 respectively,

along with their propagation paths are illustrated in Figure 2.4. Note that the above excitation cycles

are also included in the returned suspect sets. In Figure 2.4 each suspect location is denoted by a dotted

circle in the cycle at which it is excited. These suspects that correspond to the actual error locations are

illustrated by a solid circle. Also notice that the actual error locations are returned as suspects in the

respective solution sets. The error responsible for counter-example C1 is returned as suspect s12 , while

the error responsible for counter-example C2 is returned as suspect s21 .

2.3.1 Debugging Multiple Counter-examples

One of the major limitations in traditional SAT-based debugging flows is that each counter-example is

treated in isolation, without considering potential relations to other counter-examples. When the number

of counter-examples is relatively small, this narrow approach rarely affects the debugging cycle. On the

other hand, when regression generates dozens or even hundreds of counter-examples, then ignoring

such correlations may heavily jeopardize and/or prolong the verification cycle. For example, imagine a

scenario where many counter-examples expose failures that are related to the same design error. In this

case, debugging and individually analyzing each counter-example introduces significant redundancy in

the verification cycle. If the engineer(s) had some knowledge available regarding this correlation, then

the combined information from these counter-examples would significantly aid the discovery of the

actual design error.

In this context, a straightforward correlation between two counter-examples can be easily defined

based on the suspect locations they share in common. We refer to these suspects as the mutual suspect

set and we define it as follows.

Definition 6 For two distinct counter-examples Ci,C j ∈ C , the set containing all common suspects and

their respective excitation cycles is referred to as the mutual suspect set of Ci and C j, denoted as Mi j

and formally defined as:

Mi j ={{〈siu , tiu〉,〈s jv , t jv〉} : siu = s jv

}(2.1)


Obviously, in the above definition, Mi j = M ji. Remark that in our notation, the equality sign “=”

between two suspect locations corresponds to a matching between their unique names in the netlist

representation. This implies the exclusion of the excitation time for the purposes of this comparison;

two suspect locations are considered identical when they correspond to the same circuit component even

if their excitation cycles differ. Also, for uniformity, Mii is also defined under Eq.2.1, where all pairs

contain each suspect in Si twice. The example that follows demonstrates the concept of mutual suspect

sets.

Example 4 For the example in Fig. 2.4, presented before, the mutual suspect set of counter-examples

C1 and C2 is M12 ≡M21 = {{〈s11 ,k〉,〈s21 ,k〉}}. Note that, trivially, we have M11 = {{〈s11 ,k〉,〈s11 ,k〉}},

,{〈s12 ,m−2〉,〈s12 ,m−2〉}},{〈s13 ,m−1〉,〈s13 ,m−1〉}}} and similarly M22 = {{〈s21 ,k〉,〈s21 ,k〉}},

,{〈s22 ,m−1〉,〈s22 ,m−1〉}}}.

The above definition forms the basis which this dissertation builds upon in order to extract useful

information from the massive data available after regression verification.

2.4 Simulation Metrics in Verification and Debugging

As previously discussed, simulation-based verification is widely adopted in the semiconductor industry.

Apart from the benefits of speed and scalability, simulation tools also offer a wide range of metrics

that can be exploited to extract useful information for the debugging process. Simulation coverage

is one such metric. Generally, the term simulation coverage refers to various coverage metrics, such

as functional coverage, branch coverage or code coverage. For the purposes of this thesis the term

simulation coverage (or simply coverage) refers to the number of times each code line, block, expression

or branch in the RTL description of a sequential circuit is exercised.

The basic goal of simulation coverage is to provide a degree of confidence that a design is indeed

correct whenever it passes verification. For example, if an RTL block is never covered by simulation,

then this RTL block is not guaranteed to be bug-free even if the design passes verification. On the other

hand, coverage can also provide useful information when verification fails, such as the number of times

a component is simulated before the failure is observed.


Broadly speaking, knowing whether a design component is rigorously exercised or not, provides

a measure of certainty when one attempts to estimate how reliably that component can be considered

bug-free. One question that arises is whether coverage can still provide useful information given that a

design component is potentially erroneous, that is, given that the component is a suspect location for a

particular counter-example.

Intuitively, if a suspect component is rigorously exercised for many cycles before an error at that

location is excited and propagates to an observation point, then one can speculate that this component

may actually correspond to an error that is hard to excite and/or has a small effect to the rest of the

circuit; because, otherwise, it would only take a small number of cycles to excite an error at that suspect

location. Conversely, a suspect location that needs to be exercised only for a few cycles until an erro-

neous value appears and causes a failure, can be considered as more “severe”, that is, easier to excite

and easier to propagate to an observation point. Simulation coverage can provide useful information,

which, combined with SAT-based debugging results, may enrich our knowledge around the nature of

each suspect component.

To this end, we need to extract coverage information with respect to each suspect that appears in the

solution set of a counter-example. However, each suspect has its own excitation cycle within a counter-

example, and, as a result, different parts of the counter-example need to be examined for each suspect

component. In this context, for each suspect there are two parts of the counter-example that should be

examined separately, namely the prefix window and the suffix window of the suspect. In what follows,

we define these two concepts.

If a suspect location is a potential fix for more than one counter-example, then it must appear in

more than one suspect set, having possibly different excitation cycles. In that case, the prefix window of

a suspect location with respect to a given counter-example is defined as the part of the counter-example

that lies between the following cycles:

• the excitation cycle of the suspect location that is related to the given counter-example

• the excitation cycle of the same suspect location that chronologically precedes the excitation that

relates to the given counter-example.

Note that the chronologically preceding excitation does not relate to the given counter-example,


but it relates to a different one for which the suspect location is also a possible fix. More precisely,

consider that the uth suspect for counter-example Ci, denoted as 〈siu , tiu〉, includes the same suspect

location as the vth suspect for counter-example C j, denoted as 〈s jv , t jv〉. That is, siu = s jv , but tiu and

t jv are potentially different. Also assume that the excitation cycle t jv precedes the excitation cycle

tiu in time and that between cycles tiu and t jv there is no other excitation cycle for the same suspect

location. Then the prefix window of suspect 〈siu , tiu〉 with respect to counter-example Ci is defined as

pre(〈siu , tiu〉) = 〈Xtiutjv,Ytiu

tjv〉, which effectively defines that part of counter-example Ci that starts at cycle

t jv and ends at cycle tiu . If Ci is the only counter-example for which suspect location siu is returned in the

solution set, then pre(〈siu , tiu〉) = 〈Xtiu1 ,Ytiu

1 〉, that is the prefix window starts from cycle 1, since there

is no other excitation for the same suspect location in any cycle before cycle tiu . The prefix window of

a suspect is examined separately, because it contains coverage information for a given suspect between

two consecutive excitations. As previously discussed, this information may prove to be significant

when one attempts to analyze how easy or difficult it is to excite an erroneous value at a specific suspect

location.

On the other hand, the suffix window of a suspect location is the part of the corresponding counter-

example that begins one cycle after the suspect excitation cycle and ends at the cycle where the failure

is observed. Evidently, the suffix window of a suspect is the one that contains its error propagation path

and demonstrates how easily an erroneous value at a suspect location can propagate to the observation

point. We denote the suffix window of suspect 〈siu , tiu〉 as su f (〈siu , tiu〉) = 〈XLi(tiu+1),Y

Li(tiu+1)〉.

For a prefix and suffix window of a suspect component we also refer to its length as the number of

cycles included in the window. With respect to the above definitions, we denote the length of the prefix

and suffix window of suspect 〈siu , tiu〉 as ||pre(〈siu , tiu〉)|| and ||su f (〈siu , tiu〉)||, respectively.

Example 5 Consider the same counter-examples C1 and C2 that were previously shown in Figure 2.4,

along with their suspect components and the respective excitation cycles. Now, Figure 2.5 illustrates

the prefix and suffix window of suspects 〈s11 ,5〉 and 〈s22 ,6〉. The prefix window of suspect 〈s11 ,5〉 is

given as pre(〈s11 ,5〉) = 〈Xt11t21,Y

t11t21〉 = 〈X5

3,Y53〉, since the chronologically preceding excitation for the

same suspect location happens at cycle 3 and is captured by counter-example C2. The suffix window of

suspect 〈s11 ,5〉 is given as su f (〈s11 ,5〉) = 〈XL1(t11+1),Y

L1(t11+1)〉= 〈X10

6 ,Y106 〉. Similarly, the prefix window


C1

I

x1 x5 x8 x9 x10

y11 y1

2 y13

y51 y5

2 y53 y8

1 y82 y8

3 y91 y9

2 y93 y10

1 y102 y10

3

. . . . . .

s12

s11s13

C2

I

x1 x3 x6 x7

y11 y1

2 y13 y3

1 y32 y3

3 y61 y6

2 y63 y7

1 y72 y7

3

. . . . . .

s22

s21s23

||pre(〈s11 ,5〉)||

||pre(〈s22 ,6〉)||

||su f (〈s11 ,5〉)||

||su f (〈s22 ,6〉)||

Figure 2.5: Prefix and suffix windows of suspects 〈s11 ,5〉 and 〈s22 ,6〉

of suspect 〈s22 ,6〉 is given as pre(〈s22 ,6〉) = 〈Xt221 ,Y

t221 〉 = 〈X6

1,Y61〉, since there is no excitation of the

same suspect location happening before the excitation at cycle 6. The suffix window of suspect 〈s22 ,6〉

is given as su f (〈s22 ,6〉) = 〈XL2(t22+1),Y

L2(t22+1)〉= 〈X7

7,Y77〉.

Now that the prefix window of a suspect is defined, we can also define the number of times a suspect

location is exercised for the duration of its associated prefix window, as follows.

Definition 7 The number of cycles for which the input(s) of a suspect component siu toggle(s) during its

prefix window is referred to as the frequency of the suspect, and is denoted as fiu .

Since all suspects siu in a counter-example suspect set Si correspond to circuit components, a map-

ping between siu and its corresponding frequency fiu is always feasible.


Chapter 2. Background 6

I/O I/O I/O I/O

I/O I/O I/O I/O

I/OI/O

I/OI/O

I/OI/O

I/OI/O

LB LB LB LB

LB LB LB LB

LB LB LB LB

LB LB LB LB

Figure 2.1: FPGA architecture

LUT. Figure 2.2b illustrates the internals of a BLE. The logic block inputs and outputs

are connected to the programmable routing of the FPGA to provide connections between

I/Os and other logic blocks.

BLE 1 BLE 2 BLE N ... N BLEs

N Outputs

I Inputs

...

...

K-inputLUT FF

clk

Clock

(a) Cluster-based logic block (b) Basic Logic Element

Figure 2.2: Cluster and BLE architecture

The FPGA routing architecture [26] consists of channels of wires that run between

logic blocks, spanning the vertical and horizontal distance of the chip, as shown in Fig-

ure 2.3. The number of wires (or tracks) within each channel is denoted by W . Signals

Figure 2.6: FPGA architecture

2.5 Field Programmable Gate Arrays

In this Section we offer a brief description of the basic concepts behind a typical Field Programmable

Gate Array architecture. It is exactly these specific architectural features that our presented FPGA

debugging approach exploits in order to speed-up productivity in FPGA functional debug.

An FPGA is a two-dimensional array of programmable logic blocks (LBs) and a configurable rout-

ing network, as show in Figure 2.6. Combinational logic functions within logic blocks in FPGAs are

implemented using K-input look-up-tables (LUTs), which are small memories capable of implementing

any logic function of up to K variables. As shown in Figure 2.7(a), each LUT in an FPGA logic block

is normally coupled with a flip-flop, which can optionally be bypassed. SRAM configuration cells are

programmed to specify the truth table of the logic function implemented by the LUT, as well as control

the flip-flop bypass MUX.

Fig. 2.7(b) shows a simplified view of a programmable routing structure. The inputs to the MUX

attach to logic block output pins or routing conductors in the FPGA device (metal wire segments). The

output of the buffer can drive a routing conductor or a logic block input. Again, SRAM configuration


Leveraging Reconfigurability to Raise Productivityin FPGA Functional Debug

Zissis Poulos1, Yu-Shen Yang2, Jason Anderson1, Andreas Veneris1

1Dept. of ECE, University of Toronto, Toronto, Canada.{zpoulos, janders, veneris}@eecg.utoronto.ca2Vennsa Inc., Toronto, Canada, [email protected]

Abstract—We propose new hardware and software techniques for FPGAfunctional debug that leverage the inherent reconfigurability of the FPGAfabric to reduce functional debugging time. Traditionally, the functionalityof an FPGA circuit is represented by a programming bitstream that specifiesthe configuration of the FPGA’s internal logic and routing. The proposedmethodology allows different sets of design internal signals to be tracedsolelyby changes to the programming bitstream followed by device reconfigurationand hardware execution. Evidently, the advantage of this new methodologyvs. existing debug techniques is that it operates without the need of iterativeexecutions of the computationally-intensive design re-synthesis, placementand routing tools. In essence, with a single execution of the synthesis flow,the new approach permits a large number of the design internal signals tobe traced for an arbitrary number of clock cycles using a limited numberof external pins. Experimental results using commercial FPGA vendor toolsdemonstrate productivity (i.e., run-time time) improvements of up to 30×vs. a conventional approach to FPGA functional debugging. These resultsdemonstrate the practicality and effectiveness of the proposed approach.

I. I NTRODUCTION

As the cost of state-of-the-art ASIC design continues to escalate, field-programmable gate arrays (FPGAs) have become widely used platformsfor digital circuit implementation. FPGAs carry several advantages overASICs, including reconfigurability and lower NRE costs for mid-to-highvolume applications. While there remains a gap between FPGAs andASICs in terms of circuit speed, power and logic density [1], innovationsin FPGA architecture, circuits and CAD tools have produced steadyimprovements on all of these fronts. Today, FPGAs are a viable targettechnology for all but the highest volume or low-power applications.

The reconfigurability property of FPGAs reduces the cost associatedwith fixing the various functional errors that can occur during thedesign cycle. With ASICs, designers spend considerable time in sim-ulation/verification before type-out, including, for example, simulationwith post-layout extracted capacitances and cross-talk noise analysis.Conversely with FPGAs, designers rarely do post-routing full delaysimulations. Instead, reconfigurability allows design iterations to includeactual silicon execution. Designers verify their design in hardware usingthe same (or a similar) FPGA they intend to deploy in the field. Whendesign errors are discovered, the design’s RTL is altered, re-synthesizedand executed in hardware.

The time needed for design cycles in FPGAs is dominated by re-synthesis (logic synthesis, technology mapping, placement and routing)tool run-times. FPGA placement and routing can take hours or daysfor the largest designs [2], and such run-times are an impediment todesigner productivity. With this observation in mind, in this paper, wepresent new techniques for FPGA functional debug that exploit thereconfigurability concept to raise productivity by reducing the numberof compute-intensive design re-synthesis runs that are needed.

At a high-level, our approaches work as follows: Say, for example, anengineer wishes to trace a large number,N, of a design’s internal signalsduring functional debug, using a small number of available external pins,m (N>>m). We augment the design with additional circuitry that allowthe N signals to be traced with⌈N/m⌉ FPGA device re-configurationsand hardware executions. The key value of our approach is that thedesign is only synthesized, placed and routedonce, rather than⌈N/m⌉times. This is achieved by selecting the different sets ofm trace signalsthrough modifications to the FPGA’s configuration bitstream (i.e. thepost-routed design).

While all of the proposed approaches leverage reconfigurability toreduce loops through the design process, we present a number of designvariants that are desirable in different scenarios, e.g. with differentnumbers of external pins being available for debugging, and withdifferent availabilities of internal FPGA resources, such as block RAMs.A further contribution of this work is a new multiplexer (MUX) designscheme for FPGAs that uses significantly less area than a traditionalMUX design. The new MUX is suitable for use in cases wherein theMUX select inputs are changed using the FPGA bitstream, instead ofusing normally routed logic signals.

...

f4f3f1f2

s

s

f1f2f3f4

sss

4-LUTDFF

MU

X

SRAM cell

Logic blockclk

MU

X

SRAM config cells

...in

i1i2i3

MU

X BUF

ss s . . .

(b) Routing structures

Fig. 1. FPGA hardware structures.

As compared with design re-synthesis for each group ofm signals,experimental results demonstrate that our approaches improves run-timeby up to 30×. They also offer stability in the timing characteristics ofthe circuit being debugged.

The remainder of this paper is organized as follows. Section II reviewsbackground on FPGA architecture and related work on FPGA functionaldebug. The proposed approach to debugging is described in Section III,as well as various architectures to meet different resource constraints.Section IV provides experimental results. Conclusions and suggestionsfor future work are offered in Section V.

II. BACKGROUND

A. FPGA Architecture

An FPGA is a two-dimensional array of programmable logic blocksand a configurable routing network. Combinational logic functions inFPGAs are implemented usingK-input look-up-tables (LUTs), whichare small memories capable of implementingany logic function of up toK variables. As shown in Fig. 1(a), each LUT in an FPGA logic blockis normally coupled with a flip-flop, which can optionally be bypassed.SRAM configuration cells are programmed to specify the truth tableof the logic function implemented by the LUT, as well as control theflip-flop bypass MUX.

Fig. 1(b) shows a simplified view of a programmable routing struc-ture. The inputs to the MUX attach to logic block output pins or routingconductors in the FPGA device (metal wire segments). The output ofthe buffer can drive a routing conductor or a logic block input. Again,SRAM configuration cells drive the select inputs on the MUX, and theSRAM values specify a particular input whose signal is driven throughthe buffer.

Fig. 1 is intended to illustrate that the logic functionality and routingconnectivity of an FPGA depends entirely on values in the programmingbitstream that is shifted into the FPGA’s SRAM configuration cells(which are connected in a scan chain). The programming bitstreamalso specifies the initial value (logic-0 or logic-1) for each flip-flop inthe device. Our approaches to FPGA functional debug rely on makingchanges to the programming bitstream, without having to re-run time-consuming FPGA synthesis, place and route tools.

B. FPGA Functional DebugThere are two major approaches to perform functional debug with

an FPGA. The first approach is to implement the complete design inan FPGA device. This is suitable for small designs that do not needto be executed in high frequency. Because of the reconfigurability,debugging modules can be easily added or modified with no cost. Aset of circuit modification that enhance debug capabilities is presentedin [3]. It provides software-like debug features, such as watchpoints.However, any modification requires recompilation of designs – a run-time intensive task. In a somewhat similar manner to what is proposedin this work, Graham et al. improve debugging productivity by instru-menting FPGA bitstreams [4]. An embedded logic analyzer is insertedinto the design without connecting to any signals. After place-and-route, the signals targeted for tracing are routed to the logic analyzer

(a) FPGA logic structures

Leveraging Reconfigurability to Raise Productivityin FPGA Functional Debug

Zissis Poulos1, Yu-Shen Yang2, Jason Anderson1, Andreas Veneris1

1Dept. of ECE, University of Toronto, Toronto, Canada.{zpoulos, janders, veneris}@eecg.utoronto.ca2Vennsa Inc., Toronto, Canada, [email protected]

Abstract—We propose new hardware and software techniques for FPGAfunctional debug that leverage the inherent reconfigurability of the FPGAfabric to reduce functional debugging time. Traditionally, the functionalityof an FPGA circuit is represented by a programming bitstream that specifiesthe configuration of the FPGA’s internal logic and routing. The proposedmethodology allows different sets of design internal signals to be tracedsolelyby changes to the programming bitstream followed by device reconfigurationand hardware execution. Evidently, the advantage of this new methodologyvs. existing debug techniques is that it operates without the need of iterativeexecutions of the computationally-intensive design re-synthesis, placementand routing tools. In essence, with a single execution of the synthesis flow,the new approach permits a large number of the design internal signals tobe traced for an arbitrary number of clock cycles using a limited numberof external pins. Experimental results using commercial FPGA vendor toolsdemonstrate productivity (i.e., run-time time) improvements of up to 30×vs. a conventional approach to FPGA functional debugging. These resultsdemonstrate the practicality and effectiveness of the proposed approach.

I. I NTRODUCTION

As the cost of state-of-the-art ASIC design continues to escalate, field-programmable gate arrays (FPGAs) have become widely used platformsfor digital circuit implementation. FPGAs carry several advantages overASICs, including reconfigurability and lower NRE costs for mid-to-highvolume applications. While there remains a gap between FPGAs andASICs in terms of circuit speed, power and logic density [1], innovationsin FPGA architecture, circuits and CAD tools have produced steadyimprovements on all of these fronts. Today, FPGAs are a viable targettechnology for all but the highest volume or low-power applications.

The reconfigurability property of FPGAs reduces the cost associatedwith fixing the various functional errors that can occur during thedesign cycle. With ASICs, designers spend considerable time in sim-ulation/verification before type-out, including, for example, simulationwith post-layout extracted capacitances and cross-talk noise analysis.Conversely with FPGAs, designers rarely do post-routing full delaysimulations. Instead, reconfigurability allows design iterations to includeactual silicon execution. Designers verify their design in hardware usingthe same (or a similar) FPGA they intend to deploy in the field. Whendesign errors are discovered, the design’s RTL is altered, re-synthesizedand executed in hardware.

The time needed for design cycles in FPGAs is dominated by re-synthesis (logic synthesis, technology mapping, placement and routing)tool run-times. FPGA placement and routing can take hours or daysfor the largest designs [2], and such run-times are an impediment todesigner productivity. With this observation in mind, in this paper, wepresent new techniques for FPGA functional debug that exploit thereconfigurability concept to raise productivity by reducing the numberof compute-intensive design re-synthesis runs that are needed.

At a high-level, our approaches work as follows: Say, for example, anengineer wishes to trace a large number,N, of a design’s internal signalsduring functional debug, using a small number of available external pins,m (N>>m). We augment the design with additional circuitry that allowthe N signals to be traced with⌈N/m⌉ FPGA device re-configurationsand hardware executions. The key value of our approach is that thedesign is only synthesized, placed and routedonce, rather than⌈N/m⌉times. This is achieved by selecting the different sets ofm trace signalsthrough modifications to the FPGA’s configuration bitstream (i.e. thepost-routed design).

While all of the proposed approaches leverage reconfigurability toreduce loops through the design process, we present a number of designvariants that are desirable in different scenarios, e.g. with differentnumbers of external pins being available for debugging, and withdifferent availabilities of internal FPGA resources, such as block RAMs.A further contribution of this work is a new multiplexer (MUX) designscheme for FPGAs that uses significantly less area than a traditionalMUX design. The new MUX is suitable for use in cases wherein theMUX select inputs are changed using the FPGA bitstream, instead ofusing normally routed logic signals.

...

f4f3f1f2

s

s

f1f2f3f4

sss

4-LUTDFF

MU

X

SRAM cell

Logic blockclk

MU

X

(a) FPGA logic structures

SRAM config cells

...in

i1i2i3

MU

X BUF

ss s . . .

(b) Routing structures

Figure 2.7: FPGA hardware structures.

cells drive the select inputs on the MUX, and the SRAM values specify a particular input whose signal

is driven through the buffer.

Fig. 2.7 is intended to illustrate that the logic functionality and routing connectivity of an FPGA

depends entirely on values in the programming bitstream that is shifted into the FPGA’s SRAM config-

uration cells (which are connected in a scan chain). The programming bitstream also specifies the initial

value (logic-0 or logic-1) for each flip-flop in the device.

2.6 Summary

This Section presented background material necessary for understanding the contributions of this thesis.

First, definitions and concepts about verification and debugging were presented. Next, a brief discussion

on SAT-based debugging and simulation metrics for verification was made. Following that, a brief

presentation of regression verification was given. Finally, a description of the basic architecture and

concepts in Field Programmable Gate Arrays was provided.

Chapter 3

A Triage Engine for RTL Design Debug

3.1 Introduction

Debugging commences once a discrepancy between the specification and the implementation of a design

is discovered. As discussed in the previous Chapter, cutting-edge automated debuggers instrument

formal methodologies to ameliorate the debugging process [19, 35]. This is accomplished by utilizing

counter-examples to generate a set of suspect locations in the design that can explain the erroneous

behavior. These locations provide vital suggestions to the engineer as to where the actual error lies in

the design.

However, regression verification flows complicate and prolong the debugging task, since they can

potentially generate hundreds of counter-examples to be fixed. At the end of regression tests, knowledge

of the relation between counter-examples and their culprit is limited. Normally, this causes confusion in

the design and/or verification engineering team, as each failure is constantly assigned and re-assigned

to various engineers until the most suitable one is found to fix the responsible design error.

Recall that triage is the high-level debugging task following regression verification that aims to

group together counter-examples with respect to their root cause errors and provide information that

helps determine the most suitable engineer to perform detailed debugging for each one of the formed

groups. Usually, this information comes into the form of a list of suspicious design components across

counter-examples that belong to a specific group.

Despite its growing complexity, triage in modern flows is predominantly performed in an ad-hoc

25

CHAPTER 3. A TRIAGE ENGINE FOR RTL DESIGN DEBUG 26

manual and time-consuming manner. In the majority of cases, triage is based on scripts that parse

verification error messages to group the observed failures and determine the responsible engineers. In

less common cases, a single engineer is assigned to monitor and analyze verification error logs on a

daily basis to determine the best suited engineer for detailed debugging. The scripting approach suffers

from frequent inaccuracy in counter-example classification, whereas the manual nature of binding an

engineer to the triage task incurs significant cost in terms of time and relies on the engineer’s intuition

and inherent understanding of the design’s behavior. In this work we present a novel automated counter-

example triage framework. More precisely, our contributions are as follows.

1. First, we devise a ranking system for possibly erroneous design locations (suspects) to quantify

their likelihood of being an actual error. This is achieved by performing a probabilistic analysis to

show that errors i) usually have a suffix window of small length, and ii) usually have a relatively

small frequency during their prefix window. The ranking scheme forms the core of the proposed

triage metrics that correlate counter-examples.

2. Second, we introduce the concept of counter-example proximity, a novel speculative metric that

expresses similarity or dissimilarity between counter-examples based on the likelihood of orig-

inating from the same error source or from distinct ones. The suggested metric is constructed

by exploiting simulation coverage, satisifiability, and the proposed ranking system to determine

counter-example correlation.

3. Triage is then formulated as a pattern recognition problem and solved via hierarchical cluster-

ing methodologies. Our approach allows us to employ machine learning algorithms to build an

automated debugging triage framework.

The proposed triage engine can be seamlessly integrated in a regression verification flow and can be

viewed as a vital preprocessing step before detailed debugging commences. The framework is tested on

four different designs with multiple injected errors and achieves significant gains in accuracy, of up to

40%, compared to existing triage methodologies.

This Chapter is organized as follows. Section 3.2 defines the problem of triage in design debug-

ging. Section 3.3 introduces the proposed failure triage framework along with suggested metrics and

heuristics. Finally, Section 3.4 provides experimental results and Section 3.5 summarizes the chapter.


3.2 Triage in Debugging

As presented in Chapter 2, debugging a single counterexample is a straightforward procedure; the auto-

mated debugger will return a solution set that can justify the erroneous behavior, and from that set, all

suspects will be examined by the engineer to track down the actual error. Moreover, a quick overview

of the suspect locations is usually adequate to identify the rightful owner that should proceed with fix-

ing the counter-example. However, the existence of multiple counter-examples at the end of regression

verification necessitates a pre processing step where one needs identify the relation between counter ex-

amples, perform some coarse-grain analysis and group them before they are delivered to the appropriate

design/verification engineer. In this context, counter-example triage is defined as follows.

Definition 1: Given an erroneous design and a set of counter-examples C = {C1,C2, . . .C|C |},

counter-example triage is a complete disjoint partition of C into N clusters/groups g1,g2, . . . ,gN , with

gi ⊆ C (1≤ i≤ N), such that the following properties hold:

• jointly exhaustive property: There is no counter-example that does not belong to some cluster,

that is,i=N⋃i=1

gi = C .

• mutually exclusive property: Each counter-example belongs exactly to one cluster, that is, gi∩

g j = /0 if i 6= j.

• relation property: Each cluster contains related counter-examples that have a high probability

of originating from the same design error.

From the above definitions, two central points arise and have to be carefully addressed:

1. First, the relation between counter-examples belonging to the same group has to be clearly de-

fined. Ideally, counter-examples that belong to the same group are all caused by the same design

error. However, since state-of-the art debugging tools only approximate the actual error loca-

tions, it is practically impossible to develop a method that guarantees the above. Instead, each

counter-example is assigned to a specific group only if there is a high probability that it is caused

by the same design error as the rest of the counter-examples in that group. Conversely, counter-

examples belonging to different groups should have a low probability of originating from the

same error source.


C1

I

x1 x5 x8 x9 x10

y11 y1

2 y13 y5

1 y52 y5

3 y81 y8

2 y83 y9

1 y92 y9

3 y101 y10

2 y103

. . . . . .

C2

gw

I

x1 x3 x6 x7

y11 y1

2 y13 y3

1 y32 y3

3 y61 y6

2 y63 y7

1 y72 y7

3

. . . . . .

gk

(a) Different outputs failing because of same error

C1

I

x1 x5 x8 x9 x10

y11 y1

2 y13 y5

1 y52 y5

3 y81 y8

2 y83 y9

1 y92 y9

3 y101 y10

2 y103

. . . . . .

C3

I

x1 x4 x8

y11 y1

2 y13 y4

1 y42 y4

3 y81 y8

2 y83

. . . . . .

gk

(b) Same output failing because of different errors

Figure 3.1: Incorrect grouping by conventional techniques


2. Second, a decision has to be made on the number of groups to be eventually formed by the

engine. Ideally, the number of formed groups should be equal to the number of co-existing errors

responsible for the whole set of generated counter-examples. However, in a real verification

environment there is no prior knowledge on what this number is. As such, triage needs to make a

guess on the number of co-existing errors. The quality of this process depends on how close this

guess is to the actual number of errors in the design.

It becomes clear that triage is characterized by uncertainty; a feature that makes the overall accuracy

of the engine sensitive to the way counter-example similarity is defined. Being able to identify accu-

rate correlations and making a good guess on the number of actual errors drastically increases triage

accuracy. In practice though, producing such accurate estimations is all but a trivial task.

3.2.1 Motivation

Conventional approaches, such as script-based grouping of error logs or manual analysis frequently

fall short when it comes to identifying counter-example relationships and are usually devoid of any

estimation on the actual number of design errors [36]. Fig. 3.1 illustrates two common cases were

traditional methods tend to fail. In Fig. 3.1(a) an error propagates due to different stimulus through

different circuit elements and eventually causes two failures at distinct observation points, y2 and y3,

and at different cycles, 7 and 10 respectively. The counter-examples exposing those two failures will be

wrongly grouped into two separate groups gk and gw, biased by the fact that the observation points -and

hence the error messages- are different. The opposite scenario can also happen. Fig. 3.1(b) illustrates

two distinct errors causing a discrepancy at the same observation point, y3 (although in different cycles 8

and 10) by following different propagation paths. Traditionally, the counter-examples will be placed into

the same group, which is not the desired result. Finally, there is no clear way for the existing automated

methods to determine the rightful owner for each formed group of counter-examples. Frequently, a

verification engineer will be assigned with this task, which involves a lot of manual effort and essentially

defeats the purpose of automation.

Here, it should be noted that for traditional script-based methods, it is hard to base the error message

comparison on the timing information. Particularly, if the same output(s) fail, the error messages will

be considered identical in most cases even if the error messages are triggered in different cycles. Still,


there are cases where scripting techniques can identify patterns in the cycles where error messages

are triggered and hence are able to distinguish between those that are likely to correspond to different

design errors. For example, if two error messages refer to the same observation point but one is triggered

every 500ns and the other one every 2000ns, then these messages will be considered different and the

corresponding counter-examples will be grouped separately. The scripts used for comparison in this

dissertation abide to these empirical rules. Nevertheless, it becomes clear that triage strategies that rely

solely on error messages or timing information will often suffer from poor accuracy.

Where the above strategies fail, the proposed automated triage engine aims to extract deeper infor-

mation from each counter-example, then automatically group those that are similar and finally pass them

to the suitable engineer(s) for further detailed debugging. In order for the grouping to be acceptably ac-

curate, we propose the following:

1. First, a well-defined similarity metric is devised that quantifies the relationship between two given

counter-examples based on their likelihood of sharing the same error source. The proposed simi-

larity metric, called counter-example proximity, is based on a probabilistic error behavior analysis.

This analysis generates a ranking scheme for possible error locations (suspects). The way ranked

suspects are distributed among counter-examples defines how proximity is constructed.

2. Second, a “guess” is made on the number of errors causing the generated set of counter-examples,

because this determines the number of groups to be formed. The above guess is referred to as

the error count estimation and is also based on the suspect ranking scheme presented in this

dissertation.

3. Furthermore, an efficient technique is applied to form closely related groups by using the similar-

ity metric between all possible pairs of counter-examples. This is achieved by formulating triage

as a clustering machine learning problem.

4. Finally, the triage engine provides suggestions that help assign each group to the most appropriate

engineer(s). The suspect ranking scheme presented in the next Sections essentially shows which

suspects should be prioritized (i.e., examined first). The engineers responsible for high priority

suspects are the ones to whom the counter-examples are assigned.


The following Section describes detailed work on the aforementioned issues.

3.3 Counter-example Triage Framework

3.3.1 Error Behavior Analysis

In this dissertation we make the assumption that the majority of design errors in typical regression veri-

fication flows are introduced by the human factor. In our discussion, we often refer to human introduced

errors as actual errors. The exhaustive nature of SAT-based debugging guarantees that the location of

an actual error will be returned as a single suspect in the solution set of the resulting counter-example.

Some of the remaining suspects will often be related to the error as well, such as locations in the fan-out

of the error, as shown in [26]. However, a significant portion of suspects are not related to the actual

design bug, even if they fix the same failure. As such, before constructing any triage metrics, it is crucial

to identify those suspects that present similar characteristics to actual design errors.

We address the above by speculating on the way an actual error is excited and eventually propagates

to the failing responses. Suspects that follow our assumptions on how an error behaves are promoted

against suspects that violate our expectations. Such a filtering approach is expected to lead towards

more accurate triage metrics by identifying important suspects and separating them from suspects that

are likely to bear noisy information.

Generally, we expect human introduced errors to be excited in temporal proximity to the failing

observation points, an intuitive argument that is central behind Bounded Model Debugging [33]. More-

over, we expect that for a human introduced error to be excited it would take a relatively small number

of times for it to be exercised in simulation. Recall that this number is referred to as the frequency of the

respective design location. In the remainder of this Section, we present a thorough probabilistic analysis

that supports the observations above.

Assuming that a single error exists in the design and that simulation starts at cycle 1, let exi be the

probability that the error is excited at cycle i. Also, let pri be the probability of the error propagating

from cycle i to cycle i+ 1, and obi be the probability of observing a failure at an observation point at

cycle i given that the error has propagated to that cycle. Also assume that the input vector sequences are

temporally independent and stationary random sequences.


Proposition 1: The probability pm of observing the first failure at cycle m given the probability that

the error is excited for the first time at cycle n, where m > n, is:

pm =n−1

∏i=1

(1− exi)× exn×m−1

∏i=n

pri×m−1

∏i=n

(1−obi)×obm (3.1)

Proof: Let events:

Ei = {an error is excited at cycle i},

Xi = {an error propagates from cycle i to cycle i+1 given that it propagated to cycle i},

Oi = {a failure is observed in cycle i given that an error has propagated to cycle i}.

The probability pm can be expressed in terms of the events Ei, Xi, and Oi as follows:

pm = P( n−1⋂

i=1

Ei∩En∩(m−1⋂

i=n

Xi∩m−1⋂i=n

Oi∩Om

∣∣∣ En

)).

But eventsn−1⋂i=1

Ei are conditionally independent tom−1⋂i=n

Xi,m−1⋂i=n

Oi∩Om. Thus:

pm = P( n−1⋂

i=1

Ei∩En)×P

(m−1⋂i=n

Xi∩m−1⋂i=n

Oi∩Om

∣∣∣ En

).

By Bayes’ law and the chain rule we have P (A∩B|C) = P (A|C)×P (B|A∩C).

Hence:

pm =P( n−1⋂

i=1

Ei∩En)×P

(m−1⋂i=n

Xi

∣∣∣ En

)×

×P(m−1⋂

i=n

Oi

∣∣∣ m−1⋂i=n

Xi∩En

)×P

(Om

∣∣∣ m−1⋂i=n

Oi∩m−1⋂i=n

Xi∩En

).

But events Om andm−1⋂i=n

Oi are conditionally independent given En, thus pm can be re-written as:

pm =P( n−1⋂

i=1

Ei∩En)×P

(m−1⋂i=n

Xi

∣∣∣ En

)×

×P(m−1⋂

i=n

Oi

∣∣∣ m−1⋂i=n

Xi∩En

)×P

(Om

∣∣∣ m−1⋂i=n

Xi∩En

)


By assumption, inputs of consecutive cycles are temporally independent. As a result, Xi is indepen-

dent of X j, and Ei is independent of E j for all i 6= j, meaning that:

P(Xi∩X j|En

)= P

(Xi|En

)×P

(X j|En

), and P

(Ei∩E j

)= P

(Ei)×P

(E j).

Consequently:

P(

m−1⋂i=n

Xi

∣∣∣ En

)=

m−1∏i=n

P(Xi|En

), and P

(n−1⋂i=1

Ei∩En)=

n−1∏i=1

P(Ei)×P

(En).

Similarly, conditional independence between Oi and O j yields:

P(

m−1⋂i=n

Oi

∣∣∣ m−1⋂i=n

Xi∩En

)=

m−1∏i=n

P(

Oi

∣∣∣ m−1⋂i=n

Xi∩En

).

Hence, pm can be simplified to:

pm =n−1

∏i=1

P(Ei)×P

(En)×

m−1

∏i=n

P(Xi|En

)×

m−1

∏i=n

P(

Oi

∣∣∣ m−1⋂i=n

Xi∩En

)×P

(Om

∣∣∣ m−1⋂i=n

Xi∩En

)Based on the assumptions:

1− exi = P(Ei), exn = P

(En), pri = P

(Xi|En

), obi = P

(Oi

∣∣∣ m−1⋂i=n

Xi ∩En

), therefore pm can be

defined as:

pm =n−1

∏i=1

(1− exi)× exn×m−1

∏i=n

pri×m−1

∏i=n

(1−obi)×obm

�.

Since we simply aim to construct a generalized view of error bahavior, we can assume that the

probabilities pri = pr, obi = ob and exi = ex remain fixed for all cycles i. Then Proposition 1 implies:

pm = (1− ex)n−1× ex× prm−n× (1−ob)m−n×ob (3.2)

Fig. 3.2 illustrates the results of plotting Eq. 3.2. To show our findings, pm is plotted under two

different settings. In the first, depicted in Fig. 3.2(a), the cycle where the error is first excited is kept con-

stant such that n = 1, whereas cycle m where the failure is observed is selected from the set [2,3,4,5,6].

Recall, that for a single error and a single failure the prefix window of the error corresponds to the se-

quence of cycles from the initial states to the cycle where the error is excited, and that the suffix window

refers to the sequence of cycles immediately after the excitation cycle until the cycle where the failure is

observed. In the above setting, the prefix window is set to a constant length of 1 and the suffix window


1 2 3 40

0.05

0.1

0.15

0.2

m−n: suffix window length

Pm

pr=ob=0.8pr=ob=0.5pr=ob=0.2

(a) Effect of suffix window length

0 0.5 10

0.5

1x 10−3

ex: excitation probability

Pm

n=2n=3n=4n=5n=6

(b) Relation between excitation and failure observation

probability

Figure 3.2: Probabilistic behavior of errors

length varies. Additionally, the propagation, observation and excitation probabilities are set constant,

such that pr = ob = [0.2,0.5,0.8] and ex = 0.5. Probability pm is plotted as a function of the failure

observation cycle m.

In the second setting, depicted in Fig. 3.2(b), the number of cycles between the first excitation cycle

and the cycle where the error is observed is kept constant so that m− n = 2, while n (the cycle of

first excitation) takes values from the set [2,3,4,5,6]. Essentially, the prefix window length now varies

whereas the suffix window length is fixed to 2 cycles. Probabilities pr and ob are set constant to 0.8. In

the second setting, pm is plotted as a function of ex.

In Fig. 3.2(a), the negative exponential nature of the probability curves confirms the expectations

that an error usually causes a failure only a small number of cycles after it has been excited. Hence,

the error’s suffix is expected to be relatively short. Fig. 3.2(b) leads us to an additional observation.

We observe that as the prefix length increases, the highest value for pm is achieved when the excitation

probability becomes smaller. The above behavior confirms our intuition that even when an error is

first excited close to the failure point, the longer the prefix is, the harder to excite the error it should

be (the excitation probability that maximizes pm drops). The excitation probability is proportional to

the likelihood of the error location being covered by simulation; an error that is covered by simulation

a small number of times has a small likelihood of being excited (harder to excite), and vice versa.

Therefore, any conclusions related to error excitation can be applied to error coverage (frequency) as

well.

It has to be noted that the description above serves not as a theoretical proof of the behavior of


errors but only as an experimental intuition of the most typical cases. This is because it is based on

certain assumptions. Specifically, we assume conditional independence between error excitation and

error propagation events for cycles that precede the error excitation cycle, which is definitely not a true

independence. However, this probabilistic analysis offers a generalized model of error behavior and

leads to the following general observations. An error is expected:

1. to have a relatively short suffix window

2. to exhibit low coverage (low frequency) during its prefix window

The above observations form the basis of the suspect ranking scheme described in the next Section.

The proposed ranking guides triage metrics to more accurate outcomes, as will be demonstrated by

experimental results.

3.3.2 Suspect Ranking

For a counter-example Ci ∈ C , the returned solution set Si contains all possible suspects for the observed

mismatch. However, there are many cases where the solution set is large. To add more pain, some of

the returned suspects are incidental and not typical of common error locations, such as reset signals,

input suspects, or stuck-at-faults at the gate-level. Along these lines, the goal of suspect ranking is to

generate a ranked version of the solution that serves two purposes. First, it segregates suspects that

are not typical of human introduced errors from suspects that are or are closely related to ones. As

such, counter-example relation is defined based on suspects that are closely related to actual errors and

not by treating all suspects evenly. Second, it aids engineers to prioritize detailed debugging by first

examining those suspects high in rank. Moreover, these high-ranked suspects can offer better guidance

when deciding the most suitable engineer for detailed debugging. For example, if one or more high-

ranked suspects are located in the same design module then the engineer(s) responsible for this particular

module should be the ones to investigate the counter-examples in detail. Conventional debugging does

not offer any such guidance, since all suspects are generated without any sense of priority.

In order to generate an appropriate ranking, we need quantify the observations of Section 3.3.1.

Recall that 〈siu , tiu〉 ∈ Si refers to the uth suspect location and its excitation cycle in the suspect set of

counter-example Ci. Also, as previously defined, the suffix window of suspect 〈siu , tiu〉 is denoted as


su f (〈siu , tiu〉). Finally, fiu denotes the frequency of suspect location siu with respect to counter-example

Ci. Recall that the frequency equals to the number of cycles for which the input(s) of suspect component

siu toggle(s) during its prefix window, denoted as pre(〈siu , tiu〉). Let score(〈siu , tiu〉) be a scoring function

quantifying the likelihood of 〈siu , tiu〉 being an actual error, defined as follows:

score(〈siu , tiu〉) =Li−||su f (〈siu , tiu〉)||

Li×(

1− fiu− γ

max{ fiv : 〈siv , tiv〉 ∈ Si}

)(3.3)

Based on the probabilistic analysis in the previous Section, the higher score(〈siu , tiu〉) is, the more

typical of an actual error suspect siu is considered. The first factor Li−||su f (〈siu ,tiu 〉)||Li

in Eq. 3.3 is in the

range of [0 . . .1] and quantifies the expectation that a real error is excited in temporal proximity to the

observed mismatch. The shorter the suffix window su f (〈siu , tiu〉) is, the higher score(〈siu , tiu〉) becomes,

as desired. The second factor increases as the frequency fiu of suspect 〈siu , tiu〉 decreases, respectively

resulting in an increase to the score function, again as desired. Similar to the first factor, the second also

falls within the range of [0 . . .1] for homogeneity. The denominator is set to the maximum frequency

observed for all suspects in the corresponding solution, as a measure of comparison. A relatively small

offset γ is subtracted from the numerator to avoid zeroing out the contribution of suspects that have

maximum frequency for the counter-example.

Based on the scoring function above we can construct a ranking for all suspects. Let rank be

a relation, such that for two distinct suspects 〈siu , tiu〉 and 〈siv , tiv〉, if rank(〈siu , tiu〉) < rank(〈siv , tiv〉),

then 〈siu , tiu〉 is more likely to be the actual error compared to 〈siv , tiv〉, respectively score(〈siu , tiu〉) >

score(〈siv , tiv〉). Given that score(〈siu , tiu〉) has been computed for all suspects 〈siu , tiu〉 ∈ Si, then the rank

of suspect 〈siu , tiu〉 is formally defined as:

rank(〈siu , tiu〉) = {r : |{〈siv , tiv〉 ∈ Si : score(〈siv , tiv〉)≥ score(〈siu , tiu〉)}|= r−1} (3.4)

It can be easily confirmed that a suspect with the highest score for a given counter-example will be

assigned a rank of 1, and a suspect with the lowest score will be assigned a rank of |Si|, which is the

lowest possible for that particular suspect set. In our implementation, ties between suspect scores are

broken randomly. Based on the above equations, real errors and suspects related to them are more likely

to be placed high in rank, exactly as desired.


C1

I

x1 x5 x8 x9 x10

y11 y1

2 y13 y5

1 y52 y5

3 y81 y8

2 y83 y9

1 y92 y9

3 y101 y10

2 y103

. . . . . .

s12

s11s13

||su f (〈s11 ,5〉)||= 5||su f (〈s12 ,8〉)||= 2||su f (〈s13 ,9〉)||= 1f11 = 2f12 = 2f13 = 5

score(〈s11 ,5〉) = 0.31score(〈s12 ,8〉) = 0.49score(〈s13 ,9〉) = 0.018rank(〈s11 ,5〉) = 2rank(〈s12 ,8〉) = 1rank(〈s13 ,9〉) = 3

Figure 3.3: Example of suspect ranking for a counter-example

Example 6 Consider a counter-example C1 of length L1 = 10 (from cycle 1 to cycle 10) that exposes

a conflicting value at output y3 in cycle 10. The output of a SAT-based debugger is a solution set

S1 = {〈s11 ,5〉,〈s12 ,8〉,〈s13 ,9〉}, as shown in Figure 3.3, where the suspect location s12 corresponds to

the actual error location. Suppose that the frequencies of the suspects are f11 = 2, f12 = 2 and f13 = 5.

The suffix length of each suspect is computed directly from the counter-example. In this example, we

have ||su f (〈s11 ,5〉)||= 5, ||su f (〈s12 ,8〉)||= 2 and ||su f (〈s13 ,9〉)||= 1. Based on Eq. 3.3, and by setting

γ = 0.1, the resulting suspect scores are:

score(〈s11 ,5〉) =5

10×(

1− 1.95

)= 0.31

score(〈s12 ,8〉) =8

10×(

1− 1.95

)= 0.49

score(〈s13 ,9〉) =910×(

1− 4.95

)= 0.018

Consequently, the ranking scheme yields rank(〈s11 ,5〉) = 2, rank(〈s12 ,8〉) = 1, rank(〈s13 ,9〉) = 3.

In the above example we observe that the actual error location eventually becomes the top-ranked

suspect. Of course, this is not always the case, but, in general, the proposed scoring will push actual

errors higher in the rank, as shown by experimental results.


3.3.3 Counter-example Proximity

The cornerstone of triage is a well-defined metric to express a relation between any two given counter-

examples. In order to develop such a metric we exploit information from the suspect ranking scheme

along with the number of suspects that two counter-examples share in common.

As defined in Chapter 2, the set of mutual suspects between two counter-examples Ci and C j is

denoted as Mi j. Intuitively, when Mi j is large relative to the number of total suspects in both solutions,

then Ci and C j are considered to be strongly related, thus possibly originating from the same error

source. However, if mutual suspects are low in ranking in both or at least one of the solutions, then

this correlation becomes weaker. For example, if a mutual suspect is high-ranked in Si but low-ranked

in S j, then it is more likely to be a real error for counter-example Ci and not for C j, even if it can fix

both; counter-example C j is expected to be caused by a suspect higher in rank. We combine the above

expectations into a speculative metric called counter-example proximity between any pair Ci and C j,

which is denoted as prox(Ci,C j) and defined as:

prox(Ci,C j) = 1− |Mi j||Si|+ |S j|− |Mi j|

× ∏{〈siu ,tiu 〉,〈s jv ,t jv 〉}∈Mi j

(1− |rank(〈siu , tiu〉)− rank(〈s jv , t jv〉)|

max{|Si|, |S j|}

)(3.5)

In literature, proximity can take various forms and express similarity or dissimilarity between the

objects that are analyzed [10]. In the context of this dissertation the proximity metric between counter-

examples expresses dissimilarity and, thus, a small proximity indicates a strong relation (small dissimi-

larity).

According to Definition 3.5, when Ci and C j are strongly related then prox(Ci,C j) tends to 0,

whereas a weak correlation sets prox(Ci,C j) closer to 1. Apparently, prox(Ci,C j) = 0 implies that

the counter-examples are guaranteed to be caused by the same error, but obviously this should only

happen when the counter-examples expose identical behavior. On the other extreme, prox(Ci,C j) = 1

means that the counter-examples are definitely caused by different errors, and this should only happen

when they do not share any suspects. Observe that the number of mutual suspects over total suspects

is encoded in the factor |Mi j||Si|+|S j|−|Mi j| . As desired, a large mutual suspect set Mi j will force prox(Ci,C j)

closer to 0.


In the case where all suspects in solutions Si and S j are mutual then |Mi j| = |Si| = |S j|, thus

|Mi j||Si|+|S j|−|Mi j| = 1, and the first factor maximizes its contribution. In this context, the second factor

quantifies the contribution of mutual suspects based on their ranking. Ideally, counter-examples caused

by the same error will exhibit similar behavior. Therefore, their mutual suspects are expected to have

similar ranks in their respective solution sets. Based on Eq. 3.5, proximity decreases as the difference

|rank(〈siu , tiu〉)− rank(〈siv , tiv〉)| in the ranking of mutual suspects increases, which models the above

expectation. Remark that, prox(Ci,Ci) = 0 as desired, since all suspects are mutual ( Mi j|Si|+|S j|−|Mi j| = 1)

and have the same rank (|rank(〈siu , tiu〉)− rank(〈siv , tiv〉)| = 0 always). On the other hand, if Ci and

C j share no mutual suspects then they are definitely unrelated and caused by different errors, which is

successfully captured by Eq. 3.5, since in that case |Mi j|= 0 and thus prox(Ci,C j) = 1.

Example 7 Consider three counter-examples C1, C2 and C3 that expose three distinct failures at out-

puts y3 at cycle 10, y2 at cycle 7 and y3 at cycle 8, respectively. The corresponding solution sets are

S1 = {〈s11 ,5〉,〈s12 ,8〉,〈s13 ,9〉}, S2 = {〈s21 ,3〉,〈s22 ,6〉,〈s23 ,6〉} and S3 = {〈s31 ,4〉,〈s32 ,8〉}, as shown in

Figure 3.4, where s12 , s22 , and s31 correspond to the actual error locations. Note that counter-examples

C1 and C2 expose failures that originate from the same error location. Suppose that the proposed sus-

pect ranking scheme produces the ranking that is shown in Figure 3.4. In this example, the mutual

suspect sets between each pair of counter-examples are M12 = {(〈s11 ,5〉,〈s21 ,3〉),(〈s12 ,8〉,〈s22 ,6〉)},

M13 = {(〈s13 ,9〉,〈s32 ,8〉)} and M23 = /0, since C2 and C3 do not share any suspects in common. Based

on Eq. 3.5, the resulting proximity for each pair of counter-examples is:

prox(C1,C2) =1− 23+3−2

×(

1− |2−1|3

)×(

1− |1−2|3

)= 1−0.22 = 0.78

prox(C1,C3) =1− 13+2−1

×(

1− |3−2|3

)= 1−0.17 = 0.83

prox(C2,C3) =1

We observe that the relation between C1 and C2 is estimated to be stronger compared to the relation

between C1 and C3, since the proximity for the first pair is smaller, even though different outputs fail for

the first pair and the same output fails for the second one. Thus, leveraging information from the suspect

sets and the ranking scheme allows us to define similarity more accurately compared to traditional


C1

I

x1 x5 x8 x9 x10

y11 y1

2 y13 y5

1 y52 y5

3 y81 y8

2 y83 y9

1 y92 y9

3 y101 y10

2 y103

. . . . . .

s12

s11s13

C2

I

x1 x3 x6 x7

y11 y1

2 y13 y3

1 y32 y3

3 y61 y6

2 y63 y7

1 y72 y7

3

. . . . . .

C3

I

x1 x4 x8

y11 y1

2 y13 y4

1 y42 y4

3 y81 y8

2 y83

. . . . . .

s22

s21s23

s31

s32

rank(〈s12 ,8〉) = 1rank(〈s11 ,5〉) = 2rank(〈s13 ,9〉) = 3

rank(〈s21 ,3〉) = 1rank(〈s22 ,6〉) = 2rank(〈s23 ,6〉) = 3

rank(〈s31 ,4〉) = 1rank(〈s32 ,8〉) = 2

M12 = {(〈s11 ,5〉,〈s21 ,3〉),(〈s12 ,8〉,〈s22 ,6〉)}M13 = {(〈s13 ,9〉,〈s32 ,8〉)}M23 = /0

|M12|= 2|M13|= 1|M23|= 0

prox(C1,C2) = 0.78prox(C1,C2) = 0.83prox(C1,C2) = 1

Figure 3.4: Counter-example proximity

methodologies. Recall that existing automated triage techniques would erroneously decide that C1 and

C3 are related and, on the contrary, that C1 and C2 are not. This is because scripts base the decision

solely on the failing outputs. For the same example, we observe that the computed ranks of the mutual

suspects between C1 and C2 are different. This incurs some uncertainty that does not allow us to be

“too confident” that C1 and C2 are caused by the same error. If, on the other hand, the ranks would be

identical, then the corresponding proximity would be lower (specifically prox(C1,C2) = 0.5) and our

confidence would effectively be stronger.

3.3.4 Error Count Estimation

For a grouping of the generated counter-examples to be meaningful, it is necessary to define the num-

ber of groups expected to be formed. Ideally, this number should equal the number of design errors

responsible for the set of counter-examples C . However, in the vast majority of regression scenarios the

number of co-existing errors is not known a priori. Therefore an initial guess on the number of groups

has to be made that will reflect an acceptable grouping scheme.


For that purpose, we construct a heuristic called error count estimation that leverages information

from the suspect ranking scheme. For each mutual suspect set Mi j a subset MRi j ⊆Mi j is created, which

is referred to as the reduced set MRi j. This set contains only those suspects that have at most a rank of

R≤ min{|Si|, |S j|} in suspect sets Si or S j. Formally:

MRi j = Mi j−

{{〈siu , tiu〉,〈siv , tiv〉} ∈Mi j : (rank(〈siu , tiu〉)> R)∨ (rank(〈siv , tiv〉)> R)

}(3.6)

As described in Section 3.3.2, high-ranked suspects, that is, suspects with small R, generally have a

stronger relation to actual errors. Intuitively, a large number of high-ranked mutual suspects indicates

that counter-examples are caused by a small number of errors, and vice versa. Along these lines, after

computing all possible sets Mi j and the reduced sets MRi j, for each suspect location siu we count all its

appearances in the reduced mutual sets. To do so, for each suspect location siu we construct the set

CNTsiu, which contains all suspects found in reduced mutual sets that correspond to the same suspect

location as siu , including siu itself:

CNTsiu={

s jv : {〈siu , tiu〉,〈s jv , t jv〉} ∈MRi j}

(3.7)

Effectively, |CNTsiu| provides the number of times that suspect location siu appears in the reduced

mutual sets. To extrapolate over all suspects, let CNT be a set that contains all computed CNTsiusets,

without including duplicate sets:

CNT =

i=|C |⋃i=1

u=|Si|⋃u=1

CNTsiu(3.8)

Note that, |CNT | corresponds to the total number of distinct high-ranked suspect locations that are

returned among all debugging sessions for the set of generated counter-examples. Now, the average

number of times such high-ranked suspects participate in a mutual set, and hence in a solution set,

estimates how many counter-examples we expect those suspects to be responsible for. This number is

denoted as CNTavg, and given as:

CNTavg =

∑CNTsiu

∈ CNT|CNTsiu

|

|CNT | (3.9)


Then, our error count estimation, denoted as e, is given by:

e =⌈ |C |

CNTavg

⌉(3.10)

Eq. 3.10 essentially says that the expected number of co-existing errors responsible for all counter-

examples is calculated by dividing the number of counter-examples |C | by the average number of

counter-examples we expect each high-ranked suspect to be responsible for. Observe that, if no mutual

suspects of high rank (R) exist, then |CNTsiu|= |{siu}|= 1 for all high-ranked suspects, and CNTavg = 1

according to Eq. 3.9. Then the error count estimation is e = |C |, acceptably predicting that each counter-

example is most likely caused by a unique error, and thus the number of errors equals the number of

counter-examples. On the other hand, the existence of high-ranked suspects among various solutions

incurs a decrease in e, since CNTavg increases. Eq. 3.10 offers a loose approximation on the number of

co-existing errors, but in practice, offers a good estimation on the number of groups to be formed, as

demonstrated by experimental results.

Example 8 Consider the previously presented counter-examples C1, C2, and C3, as shown in Figure 3.4.

Suppose that we decide to select only those mutual suspects that have rank 1 or 2 for each counter-

example. Then the reduced mutual suspect sets are:

M211 = {{〈s11 ,5〉,〈s11 ,5〉},{〈s12 ,8〉,〈s12 ,8〉},{〈s13 ,9〉,〈s13 ,9〉}},

M212 = {{〈s11 ,5〉,〈s21 ,3〉},{〈s12 ,8〉,〈s22 ,6〉}},

M222 = {{〈s21 ,3〉,〈s21 ,3〉},{〈s22 ,6〉,〈s22 ,6〉},{〈s23 ,6〉,〈s23 ,6〉}},

M233 = {{〈s31 ,4〉,〈s31 ,4〉},{〈s32 ,8〉,〈s32 ,8〉}},

M213 = {{〈s13 ,9〉,〈s32 ,8〉}} and

M223 = /0.

Notice that for the error count estimation we also need the set MRii for each counter-example Ci.

Next, we compute the times each suspect location appears in each of these sets, which gives CNTs12=

CNTs22= {s12 ,s22}, CNTs11

=CNTs21= {s11 ,s21}, CNTs31

= {s31}, and CNTs32= {s32}. Then we com-

pute the set that contains all the above counting sets, CNT ={{s12 ,s22},{s11 ,s21},{s31},{s32}

}. Fi-

nally, the average number of times these suspects appear in the reduced mutual suspect sets is computed

as follows:


CNTavg =|{s12 ,s22}|+ |{s11 ,s21}|+ |{s31}|+ |{s31}|

|CNT | =2+2+1+1

4=

32

Finally the error count estimation e is calculated by Eq. 3.10:

e =⌈ |C |

CNTavg

⌉=⌈ 3

32

⌉= 2

For the previous example, the error count estimation successfully predicts that there are two actual

errors responsible for the three generated counter-examples. As expected, the existence of high-ranked

mutual suspects supports a prediction that indicates a relatively small number of actual errors. On the

other hand, the existence of high-ranked suspects that are not shared among other counter-examples

prohibits the above estimation from being too low on the number of predicted errors. Finally, observe

that if only the top ranked suspects (rank 1) would be selected for the estimate, then useful information

would be discarded, resulting into a “bad guess” (e = 3), which wrongly predicts that every counter-

example is caused by a distinct error.

On the contrary, if all suspects were selected for the computations in the above equations, then

noisy suspects that do not relate to actual errors would have an unpredictable contribution towards the

method’s estimate. If most of them appear as common suspects then the estimation would be even lower,

whereas if most of them solely appear in their respective counter-example then the estimate would be

higher. The above scenario cannot be demonstrated in the previous example due to its simplicity, since

there are only but a few noisy suspects; the error count estimation would still be e = 2. Obviously,

selecting extreme values for the suspect rank would either discard useful knowledge or include noisy

information. As such, it is generally more reasonable to select ranks that fall between the extreme

values.

The error count estimation is mainly based on how various suspects are shared among counter-

examples. As a result, the lower bound on the method’s estimate is implicit and might be affected by

noise as described previously. Nonetheless, the triage engine should ideally enforce a strict lower bound.

This is indeed possible since there are counter-examples for which we know that they are definitely

unrelated; these that share no common suspects. The number of counter-examples that are guaranteed


to be unrelated essentially equals the minimum number of distinct errors that should be considered by

the triage engine. For example, if three counter-examples are definitely unrelated, then we know that

there exist at least three distinct errors responsible for all the observed failures. Their number might be

higher, and the proposed method will attempt to approximate this number, but it is definitely no less,

and the method should never predict a smaller number. It is possible to explicitly enforce a minimum

error count estimation, so that, in the worst case, the proposed estimation does not fall below this lower

bound. The lower bound on error count estimation, denoted as emin, is given by:

e≥ emin =∣∣∣{Ci :

j=|C |∑j=1|Mi j|= 0}

∣∣∣ (3.11)

In essence, Eq. 3.11 says that emin is equal to the number of counter-examples that share no mutual

suspects with any other counter-example.

3.3.5 Counter-example Clustering

The information embedded in the metrics described above is applied for the last step of the proposed

triage framework, which is the formation of groups of similar counter-examples. For that purpose, we

formulate triage as a clustering problem and employ a hierarchical clustering algorithm [10, 40] to solve

it.

Hierarchical clustering aims to group together elements based on their relationship, which is quan-

tified by a metric called distance [10, 40]. A distance metric takes positive real values, and is assigned

per pair of elements. A small distance between a pair of elements indicates a strong relationship, and

vice versa. In our framework, the elements to be grouped are essentially counter-examples. Since

counter-example proximity expresses relationship in the same manner as the required distance metric, it

is natural to generate the clustering distance metric from the computed counter-example proximity.

In the context of our work, hierarchical clustering takes as input the set of all counter-examples C ,

and the proximity between all pairs of counter-examples in the form of a |C |× |C | matrix, referred to as

the proximity matrix P|C |,|C |:


P|C |,|C | =

prox(C1,C1) prox(C1,C2) · · · prox(C1,C|C |)

prox(C2,C1) prox(C2,C2) · · · prox(C2,C|C |)...

.... . .

...

prox(C|C |,C1) prox(C|C |,C2) · · · prox(C|C |,C|C |)

(3.12)

Each row i in the proximity matrix P|C |,|C |, entails exactly the information that correlates counter-

example Ci with the rest of the generated counter-examples. As such, we can define the proximity vector

of counter-example Ci as ~Ci =[prox(Ci,C1), prox(Ci,C2), . . . , prox(Ci,C|C |)

]∈ R|C |. If we consider ~Ci

as a coefficient vector, then we can map counter-example Ci to a |C |-dimensional Euclidean space.

Then the desired clustering distance metric can be defined as the Euclidean distance between each pair

of counter-examples in the |C |-dimensional Euclidean space. We denoted this distance as di j. Formally:

di j = ||~Ci− ~C j||=(

k=|C |∑k=1|prox(Ci,Ck)− prox(C j,Ck)|2

)1/2

(3.13)

The corresponding distance matrix, denoted as D|C |,|C | is given below:

D|C |,|C | =

d11 d12 · · · d1|C |

d21 d22 · · · d1|C |...

.... . .

...

d|C |1 d|C |2 · · · d|C ||C |

(3.14)

The reason for using the distance matrix D|C |,|C | to express counter-example relation, rather than

using the proximity matrix P|C |,|C |, is the fact that counter-example proximity does not necessarily re-

spect the Euclidean property. As we will see in the next Section, mapping counter-examples as data

points in a high dimensional Euclidean space allows us to use more flexible metrics for cluster merging,

something that is not possible when the proximity matrix is used in the form presented here. During the

development of the proposed triage engine, we observed that for data sets that are well separable the

effect of transitioning from the proximity matrix to the distance matrix is negligible.


The output of hierarchical clustering is not a single partition of the counter-example set C . In the

general case, the algorithm produces a nested sequence of partitions, with a single, all-inclusive cluster

at the top and singleton clusters of individual data points at the bottom. Each intermediate iteration

can be viewed as combining (splitting) two clusters from the next lower (next higher) iteration. In the

proposed implementation, clusters of counter-examples are formed in a bottom-up fashion (agglomera-

tive) by merging clusters that are likely to contain similar counter-examples (that have small Euclidean

distance). The process stops when emin clusters are formed. The reason is that any partition with less

than emin clusters implies the existence of less errors than the ones guaranteed to exist, and thus should

be discarded. Of course, in the case where emin = 1 then an all-inclusive cluster that contains all counter-

examples is produced. At each iteration of the algorithm we merge only two clusters, and therefore the

number of total formed clusters decreases always by one.

The algorithm generates all possible partitions of the counter-example set C that involve emin clusters

or more. However, the error count estimation, e, presented in this Section, suggests the most reasonable

partition based on our pre-processing and assumptions made. This is the reason why the algorithm

does not stop when the number of currently formed clusters is equal to the error count estimation, but

it proceeds until the number of clusters is emin. We wish to have all partitions available at the end, for

examination, in case the suggested partition does not satisfy the verification engineer(s).

So far it has become obvious that the proximity metric is what defines the distance between counter-

examples. However, all clustering algorithms -including hierarchical clustering- require to further define

what the distance (relation) between two clusters is. The decision to merge two clusters is determined

by a linkage criterion. There are various linkage criteria that confer a different behavior to the algorithm

and usually produce different partitions of the data set. The most popular of them are the Single-Linkage,

Complete-Linkage, Group Average criteria and Ward’s Method [10]. The first three criteria do not re-

quire that the clustered objects reside in an Euclidean space and can be applied on any proximity matrix

as long as it expresses some similarity/dissimilarity between objects. Ward’s Method, on the other hand,

requires that the objects are represented by a feature vector [40], which is the case in the proposed formu-

lation. As such, the proposed distance matrix and the vector-based representation of counter-examples

allows the application of a wider range of linkage criteria. Nevertheless, all the aforementioned criteria

are based on greedy choices when clusters are merged, which is the reason why hierarchical clustering


is a greedy algorithm and is not based on any optimization of a given cost function.

In this work, we use Ward’s Method [40], where at each step we merge the pair of clusters that leads

to minimum increase in total within-cluster variance after merging. More precisely, Ward’s Method says

that the distance between two clusters A and B, denoted as ∆AB, is the amount of increase in variance of

the data points that belong to these clusters, if we decide to merge them into a larger one. Formally:

∆AB = ∑Ci∈A∪B

||~Ci−~mA∪B||2− ∑

Ci∈A||~Ci−~mA||

2− ∑Ci∈B||~Ci−~mB||

2

=nA×nB

nA +nB×||~mA−~mB||2 (3.15)

where ~mk is the center of cluster k, and nk is the number of counter-examples in it.

With hierarchical clustering, the sum of squares starts out at zero, because every point is in its own

singleton cluster, and then grows as we merge clusters. Ward’s method keeps this growth as small as

possible. This property tends to create compact clusters, which proved to perform well, as shown by

experiments in the next Section.

Even if the linkage criterion presented here will try to merge clusters of related counter-examples,

the counter-example proximity is derived in such a way that allows us to know with full certainty that

some counter-examples should never belong to the same cluster. Ward’s Method, and in reality any other

linkage criterion, does not have such information available. The distance/proximity between definitely

unrelated counter-examples is the maximum that can be generated, but cannot guarantee that two clusters

between these counter-examples will never be formed. One possible solution is to explicitly force the

distance between such counter-examples to a relatively high number, so that it will cause a massive

increase in variance when the respective clusters attempt to merge. However, there is not straightforward

way of performing such a modification without violating the relative distance between these counter-

examples and the ones that are indeed related to them. Instead, it is much simpler to modify the linkage

criterion such that it prohibits the merging of clusters that contain definitely unrelated counter-examples,

and thus performing no changes to the distances themselves. More specifically, we modify Ward’s

Method such that the distance between clusters that contain definitely unrelated counter-examples is set

to infinity. We denote the modified cluster distance, as ∆′AB:


. . . . . . .

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

. . . . . . .

1st Iteration

3rd Iteration 2nd Iteration

C1

C2

C3

Figure 3.5: Counter-example hierarchical clustering

∆′AB =

+∞ ∃Ci ∈ A,C j ∈ B : |Mi j|= 0

∆AB otherwise(3.16)

This way, the merging of two clusters that contain counter-examples guaranteed to be caused by

different errors is always blocked, since the distance between these clusters will never be minimum.

Also note that the above issue cannot be merely resolved by forcing a lower bound of emin clusters,

because the lower bound emin constrains the minimum number of clusters to be formed, and not the way

clusters are merged.

To summarize, the hierarchical clustering algorithm is performed in three major steps:

1. Place each counter-example into its own singleton cluster

2. Iteratively merge the two closest clusters

3. Stop when all counter-examples are merged into emin clusters

Example 9 Consider again the same three counter-examples for which the proximity and error count


estimation were previously computed in examples 6 and 7 respectively. The proximity matrix for counter-

examples C1, C2 and C3 is:

P3,3 =

0 0.78 0.83

0.78 0 1

0.83 1 0

(3.17)

The corresponding Euclidean distance matrix is computed as:

D3,3 =

0 1.116 1.194

1.116 0 1.415

1.194 1.415 0

(3.18)

Note that the relation between the counter-examples is preserved after this transformation. For

illustration only purposes assume that we map the data points corresponding to the three counter-

examples on a 2-D Euclidean plane as shown in Figure 3.5. The hierarchical clustering algorithm will

initially consider each counter-example as a singleton cluster and in two iterations will produce the final

all-inclusive cluster. At the second iteration counter-examples C1 and C2 are merged into a single cluster

and C3 remains a singleton. Recall that the error count estimation for this example was computed as

e = 2, and thus the above partition is the suggested one. Notice that these counter-examples are now

correctly grouped as opposed to the unfortunate outcome of conventional triage techniques for the same

counter-examples, shown at the beginning of this Chapter.

3.3.6 Overall Flow

The overall flow that contains the steps described in this Chapter is illustrated by Figure. 3.6. The input

to the flow is a set of counter-examples generated by regression verification. The debugger is evoked

and provides a solution set for each counter-example. Based on Eq. 3.3, a ranked version of the suspects

is constructed. The ranking scheme is subsequently utilized for the computation of counter-example

proximity and the error count estimation based on Eq. 3.5 and Eq. 3.10 respectively. Those metrics are


then passed to the clustering algorithm that forms all possible clusters of related counter-examples. The

output of the triage engine is the unique partitioning that comprises of e related clusters, suggested by

the error count estimation. The grouping is then examined by engineers, along with the suspect ranking

scheme which is already computed. Note that the triage process is initially executed with the error count

estimation, but it depends on the engineer to accept the formation of e groups or examine an alternative

number of groups already generated by the hierarchical clustering algorithm.

COUNTER-

EXAMPLE

PROXIMITY

ERROR COUNT

ESTIMATION (e)

AUTOMATED

DEBUGGER

SOLUTION SETS

{S1,S2,…,SN}

COUNTER-

EXAMPLE

CLUSTERING

GROUP g1 GROUP g2 GROUP ge

. . .

SATISFYING

CLUSTERING ?

DETAILED ROOT

CAUSE ANALYSIS

YES

{C1,C2,…,CN}

SUSPECT

RANKING

ACCEPT

ALTERNATIVE

CLUSTERINGNO

Figure 3.6: Proposed triage framework

It should be clarified that the debugging step preceding the clustering process requires manual effort


from the engineer(s). However, this does not add any overhead to the whole debugging flow. Rather, this

manual task is simply moved earlier in the flow in order to generate the necessary metrics for clustering.

As such, it will not have to be repeated once the groups of counter-examples are passed to the rightful

engineer(s).

The complexity of the overall flow is dominated by the complexity of hierarchical clustering and is

upper-bounded by O(|C |3), where |C | is the number of counter-examples to be grouped. Since the size

of the counter-example set is not expected to be in the thousands in the majority of regression cases, the

triage engine is expected to scale well within typical regression verification flows.

3.4 Experimental Results

This Section presents preliminary experimental results for the proposed triage framework. All experi-

ments are conducted on a single core of an Intel Core i5 3.1 GHz workstation with 8GB of RAM. Four

OpenCores [29] designs are used for the evaluation (vga, fpu, spi and mem ctrl). The underlying

automated debugging tool used for extracting the suspect locations is implemented based on [35]. A

platform coded in Python is developed to parse the returned results of the debugger, calculate the rele-

vant metrics and perform hierarchical clustering on the resulting counter-examples. For each design, a

set of different “typical” errors is injected each time by modifying the RTL description. In total, sixteen

regression simulations are run, generating a different number of counter-examples each time, caused by

a different set of errors.

For each design, a pre-generated set of test sequences is used that is stored in vector files. Each

regression run involves hundreds to thousands of input vectors. For the purpose of capturing failures

caused by the injected design errors we use end-to-end checkers that compare the expected value for

various operations, exception checkers and various assertions throughout the designs. It should be

noted that the injected RTL errors are not generated randomly, as we observed that the majority of

randomly introduced bugs are either not captured by the pre-defined test suites or create trivial cases for

counter-example clustering (i.e. counter-examples that are “shifted“ in time versions of other counter-

examples). Instead, we meticulously inject errors that resemble typical human-introduced ones and that


"14540: ERROR: output mismatch. Expected f292e945, Got

f309efe9 (3ff759808cd7826af292e945) in vector: 4"

"27540: ERROR: output mismatch. Expected f00007b2, Got

efcda8a0 (cd7fa2441cff92e8f00007b2) in vector: 17"

"33540: ERROR: output mismatch. Expected 795a1f75, Got

79804398 (7b9e426741b9bdbf795a1f75) in vector: 23"

"34540: ERROR: output mismatch. Expected 35804398, Got

35dae339 (b3e7a98fbde72f7a35804398) in vector: 24"

"** Error: Assertion error.

Time: 1150 ns Started: 950 ns Scope:

test.dut.chk_fpu.a_div File: ../sva/fpu.sv Line: 233"

"ERROR: Underflow Exception Expected: 0, Got 1

45540: ERROR: output mismatch. Expected 00000000, Got

00000000 (8a314ad1997a7e9b00000000) in vector: 35"

"24540: ERROR: output mismatch. Expected ceac709c, Got

cf2c709c (4ef3129a4f4fc19bceac709c) in vector: 14"

"49540: ERROR: output mismatch. Expected 45aad895, Got

462ad895 (c17e453045ab57b845aad895) in vector: 39"



test.dut.u1.chk_pre_norm.a_check_pos_sign File:

../sva/pre_norm.sv Line: 70"



test.dut.u1.chk_pre_norm.a_check_neg_sign File:

../sva/pre_norm.sv Line: 75"

"43540: ERROR: output mismatch. Expected 6fcfb17a, Got

efcfb179 (6fcfb17a1bb73bd36fcfb17a) in vector: 33"

"48540: ERROR: output mismatch. Expected aebaa9dd, Got

2ebaa9dd (aebaa9de996ed347aebaa9dd) in vector: 38"

"ERROR: DIV_BY_ZERO Exception: Expected: 1, Got 0

28540: ERROR: output mismatch. Expected 00000000, Got 00000000 (92bf785f9b6e56a400000000) in vector: 18"

another. If the similarity score is high then two failures should be

grouped together; if they are low, then they should be separated.

Other information can be used as additional metrics to either

bias the weights when comparing separate error paths, or simply

used as tie-breakers when two failures are borderline similar. For

example, recently changed code could act as a simple filter to

disregard, or change the weight of the different components in the

path. Another example is using different operation modes as a tie

breaking score for borderline similar failures. These metrics are

much more dependent on the environment and can be tuned for

optimal use.

Once a similarity score is generated from two signatures,

there are many clustering algorithms that can be applied to group

them. These algorithms typically involve a threshold parameter

that will decide how easily two similar failures can be grouped

together. This threshold value need not be static either as it can

change based on the environment information as well. In fact, it

may be best to experiment with many such variables until settling

on an appropriate set of thresholds and metrics.

Finally, when the bins are created, the best suited engineer

to fix the problem must be identified. The source control database

can tag engineers based on the owner of the most common

modules/files or the author of the last change committed for each

bin.

4. CASE STUDY

The failure triage infrastructure described in this paper has

been developed and is available for commercial use. Its industrial

use has been applied to commercial designs in communication

applications. Due to the confidential nature of the commercial

designs, we cannot disclose the level of data required for this

paper. However, to provide a detailed level of information and

illustrate the effectiveness of the triage infrastructure, we have

collaborated with graduate students from the University of

Toronto where the triage tool was applied on two sample designs.

To create a realistic verification environment, dozens of “typical”

bugs were created in the RTL and testbench components of the

designs. A sample of the bugs are shown in the remaining of the

paper.

In this section we describe in detail the two case studies. For

each we present an overview of the design, provide a sample of

the failures found during simulation and show the results of the

triage engine.

4.1 Design 1: FPU module

The FPU design used in the case study is a single precision

IEEE 754 compliant Float Point Unit (FPU) from Opencores [5]

with some modifications. The design is written in Verilog and is

composed of eight modules totalling 1415 lines of code. It can

perform six operations and supports four rounding modes. The

architecture consists of a floating point exception number units,

floating point pre-normalization unit that adjust the numbers to

equal exponents, primitive operation modules, and floating point

post-normalization that denormalizes and rounds the result.

The test suite contains a test set for each FPU operation with

different round modes. The test sequences are pre-generated and

stored in vector files. Depending on the operation and round

mode, the corresponding test sequence is loaded into the memory,

as well as the expected values. There is an end-to-end checker that

compares the expected value for each operation, exception

checkers and some assertions throughout the design.

When simulating the design with the different test

sequences we get many failures that occur. Due to space

constraints we only show some of the firings as follows.

Notice that some errors are due to assertion failures and

others are due to golden value mismatches and exception

catching. We run the OnPoint root cause analysis engine and use

the result to generate the signatures described. The clustering

algorithm groups all the failures into five bins. After performing

further manual debugging on each bin, the root cause of each error

is identified. In these experiments, the grouping is performed

correctly as the same bugs are grouped together and distinct bugs

are grouped separately. Note that if binning was done purely

based on the failure messages there would have been four bins

where at least two bug sources would have gone unidentified, and

another bin would have presented a duplicate error.

Next, each of the bins containing the root causes are briefly

described. The first bin groups four checker failures and one

assertion failure together, which is typically hard to do manually.

Bin 1: 4 checkers, 1 assertion

Bug Location: primitives.v : 90 & 98

-> // Bug: missing one clock delay

-> always @(posedge clk)

-> quo <= #1 opa / opb;

-> always @(posedge clk)

-> rem <= #1 opa % opb;

-> // Fix:

-> always @(posedge clk) begin

-> quo1 <= #1 opa / opb;

-> quo <= #1 quo1;

-> end

-> always @(posedge clk) begin

-> rem1 <= #1 opa % opb;

-> rem <= #1 rem1;

-> end

(a) Missing pipeline stage

"# ** Error: Assertion error.

# Time: 603 ns Started: 603 ns Scope:

test.dut.chk_top_i1.assertion_a_fifo_rreq File:

../sva/vga_top.sv Line: 92"



test.dut.wbm.clut_sw_fifo.chk_fifo_i1._a_read_pointer File:

../sva/vga_fifo.sv Line: 32"



test.dut.wbm.data_fifo.chk_fifo_i1._a_word_down_counter File:




test.dut.sigMap_i1.assertion_wbs_dat_o File:

../sva/VennsaChecker.sv Line: 45

# 897.0 ns: expected aaaaaaaa, got ffffff9f"



test.dut.wbm.data_fifo.chk_fifo_i1._a_read_pointer File:




test.dut.pixel_generator.rgb_fifo.chk_fifo_i1._a_read_pointer

File: ../sva/vga_fifo.sv Line: 32"





# 831.0 ns: expected ffffff9f, got ffffffXf"

"# At time 273.0 ns: ERROR in wishbone:

golden=0000000000000000, actual=0000000100000000"





# 273.0 ns: expected 00000000, got 00000001"

"# At time 111.0 ns: ERROR in sync: golden=0,

actual=1"

The correct and buggy RTL is shown above. In this case the bug

is that there is a missing pipeline stage.

The second bin contains a single exception error. The bug in

this case resides in the Verilog testbench where bad stimulus is

generated. Interestingly this failure is distinguished from the

others and is binned on its own.

The third bin groups two checker failures. In this case a

basic grouping algorithm would have resulted in a similar result.

This bug is in the RTL and is due to setting the top-most bit to

zero instead of one.

Bin four groups two assertion failures and two checker

failures, which is typically hard to identify manually. The bug is

in the RTL and is a result of incorrectly decoded signals inside a

case statement.

Bin five captures a single exception on its own. In this case,

after root cause analysis, it finds that the bug is in the expected

model used for the exception handling. This strengths the notion

that the triage approach is also valid for bugs outside of the DUT.

4.2 Design 2: VGA controller

The VGA controller is from Opencores [5] with some

modification, it is written in Verilog, composed of 17 modules

totalling 4,076 lines of code and approximately 90,000

synthesized gates. The controller provides VGA capabilities for

embedded systems. The architecture consists of a Color

Processing module and a Color Lookup Table (CLUT), a Cursor

Processing module, a Line FIFO that controls the data stream to

the display, a Video Timing Generator, and Wishbone master and

slave interfaces to communicate with all external memory and the

host, respectively.

The operation of the core is as follows. Image data is

fetched automatically via the Wishbone Master interface from the

video memory located outside the primary core. The Color

Processor then decodes the image data and passes it to the Line

FIFO to transmit to the display. The Cursor Processor controls the

location and image of the cursor processor on the display. The

Video Timing Generator module generates synchronization pulses

and interrupt signals for the host.

The test suite for the VGA core is constructed using UVM.

Four main tests are used for verifying this design. These include

register, timing, pixel data, and FIFO tests. The transaction has

randomly generated control-data pairing packets under certain

constraints. These transactions are expected to cover all the VGA

operation modes in the tests (and they may be reused to test other

video cores such as DVI, etc). The sequencer exercises different

combinations of these transactions through a given testing scheme

so that most corner cases and/or mode switching are covered. The

monitors are connected to the DUT and the reference model

Bin 5: 1 exception

Bug Location: test_top.v : 302 - 306

-> // Bug: incorrect reference model

-> if(div_by_zero != exc4[2])

-> begin

-> exc_err=1;

-> $display("\nERROR: DIV_BY_ZERO Exception:

Expected: %h, Got %h\n",exc4[2],div_by_zero);

-> end

Bin 4: 2 assertions, 1 checker, 1 exception

Bug Loaction: pre_norm.v : 213 - 216

-> always @(signa or signb or add ...

-> ...

-> // Bug: switched assignments

-> 3'b0_0_0: sign_d = 1;

-> 3'b0_1_0: sign_d = !fractb_lt_fracta;

-> 3'b1_0_0: sign_d = fractb_lt_fracta;

-> 3'b1_1_0: sign_d = 0;

-> // Fix:

-> 3'b0_0_0: sign_d = fractb_lt_fracta;

-> 3'b0_1_0: sign_d = 0;

-> 3'b1_0_0: sign_d = 1;

-> 3'b1_1_0: sign_d = !fractb_lt_fracta;

Bin 3: 2 checkers

Bug Location: post_norm.v : 354

-> // Bug: Incorrect padding bit

-> assign {exp_rnd_adj0, fract_out_rnd0} = round ?

fract_out_pl1 : {1'b1, fract_out};

-> // Fix:

-> assign {exp_rnd_adj0, fract_out_rnd0} = round ?

fract_out_pl1 : {1'b0, fract_out};

Bi

Bug Location: test_top.v : 322 - 326

-> // Bug: incorrect stimulus

-> ...

-> @(posedge clk);

-> #1;

-> ...

-> oper = tmp[103:96];

-> ...

-> case(oper)

-> 8'b00000001: fpu_op=3'b000; // Add

-> ...

-> 8'b01000000: fpu_op=3'b110; // rem

-> default: fpu_op=3'bx;

-> endcase

-> ...

(b) Bad stimulus generated

respectively. They check the protocols of the responses, and make

sure that the data being sent to scoreboard has correct timing. The

scoreboard and checkers contain all the field checkers which

compare the data from the DUT, and reports the mismatches.

The golden reference model is implemented using C++. It

receives the same set of stimulus from the driver (uvm_driver

class) and produces the expected value of the outputs. Along with

the reference model, 50 SystemVerilog Assertions (SVA) are used

to do some instant checks. While running simulation, SVA can

catch unexpected behaviours of the design and prevent corrupted

data going through the flow.

A sample of the failures that occurred during a suite of

simulation tests is shown. Notice that there are a set of assertions

and correctness checkers that fire. As in the FPU case, OnPoint is

run on the test to generate the suspects which are used as

signatures during the triage process.

If triage were performed based purely on the error message

the result would be six bins. In contrast there are only four errors

in this case, thus time would have been wasted analyzing

redundant failures. Furthermore, two of the errors could also have

been missed if only one failure is analyzed within each bin. In

contrast, the triage infrastructure proposed correctly generates

four bins, one for each error. The resulting triage bins and the root

cause of the failures are shown below.

Bin one groups four assertions failures based on three

different assertions thus eliminating wasted time by analyzing

each one separately. The single bug source is due to an incorrect

assignment based on the state of the vga color processor.

Bin two catches three distinct assertion failures once again.

In this case, the RTL bug is due to picking the wrong bit of a read

pointer inside a fifo.

Bin three contains both a checker and an assertion failure.

Interestingly, this bug resides in the testbench where some

stimulus signals are instantiated using the wrong models. As a

result both the checker and an assertion fail. Debug such cases

typically would involved multiple designers and verification

engineers.

Bin four contains a single checker failure. This failure is

caused by a bug in the testbench where the reference model

contains the bug.

In all these cases, we have confirmed that the bins generated

by the proposed triage approach correctly bin the failure based on

the same root cause. We confirmed the finding by verifying that

fixing the bugs remove all the failures for a given bin. It should be

noted that the proposed triage approach may not always be

correct, if distinct bugs are close in proximity they may end up in

the same bin.

5. Conclusion

In this work we presented a novel failure triage approach

that is both automated and generates better results than previous

script-based and manual techniques. The triage engine relies on

information from root cause analysis tools that provide visibility

into the propagation paths of the bug. These paths along with their

activation times provide unique insight that is used to group

similar failure together. To illustrate the effectiveness of the

approach we provide two small case studies where distinct bugs

are correctly binned separately. Further research in this area will

focus on improving the resolution and quality of the binning

algorithms and generating custom heuristics for testbench and

environment originating bugs.

6. REFERENCES

[1] H. Foster, “Assertion-based verification: Industry myths to

realities (invited tutorial),” in Computer Aided Verification,

2008, pp. 5–10.

[2] S. Huang and K. Cheng, Formal Equivalence Checking and

Design Debugging. Kluwer Academic Publisher, 1998.

[3] A. Smith, A. Veneris, M. F. Ali, and A. Viglas, “Fault

Diagnosis and Logic Debugging Using Boolean

Satisfiability,” IEEE Trans. on CAD,vol. 24, no. 10, pp.

1606–1621, 2005.

[4] Vennsa Technologies Inc.,

http://www.vennsa.com/product_simulation.html

[5] OpenCores, http://www.opencores.org

Bin 4: 1 checker

Bug Location : self_checking.v : 110 - 121

-> // Bug: incorrect "blanc_golden" generated from

erroreous reference model

-> if(^{hsync_golden, vsync_golden, csync_golden,

blanc_golden} !==1'bx)

-> if({hsync_golden, vsync_golden, csync_golden,

blanc_golden} != {hsync, vsync, csync, blanc})

-> begin

-> $display("At time %t: ERROR in sync:

golden=%h, actual=%h", $time,

-> {hsync_golden, vsync_golden,

csync_golden, blanc_golden},

-> {hsync, vsync, csync, blanc}

-> );

-> ->ERROR;

->

-> end

Bin 3: 1 checker, 1 assertion

Bug Location: test_bench_top.v : 684

-> // Bug: incorrect stimulus "wb_err_i" generated from

wb_slv model

-> wb_slv #(24) s0(.clk( clk ),

-> .rst( rst ),

-> .adr( {1'b0, wb_addr_o[30:0]} ),

-> ...

-> .err( wb_err_i ),

-> .rty( )

-> );

Bin 2: 3 assertions (3 different)

Bug Location : vga_fifo.v : 191

-> always @(posedge clk or negedge aclr)

-> ...

-> // Bug: missing use of function

-> else if (frreq) rp <= #1 {rp[aw-1:1], rp};

-> // Fix:

-> else if (frreq) rp <= #1 {rp[aw-1:1], lsb(rp)};


Bug Location: vga_colproc.v : 263

-> always @(c_state or vdat_buffer_empty or colcnt or

DataBuffer or rgb_fifo_full or clut_ack or clut_q or Ba or Ga

or Ra)

-> begin : output_decoder

->

-> // initial values

-> // Bug incorrect initial value

-> ivdat_buf_rreq = 1'b1;

->

-> // Fix:

-> ivdat_bug_rreq = 1'b0;

(c) Incorrect assignment based on state of vga color processor

respectively. They check the protocols of the responses, and make

sure that the data being sent to scoreboard has correct timing. The

scoreboard and checkers contain all the field checkers which

compare the data from the DUT, and reports the mismatches.

The golden reference model is implemented using C++. It

receives the same set of stimulus from the driver (uvm_driver

class) and produces the expected value of the outputs. Along with

the reference model, 50 SystemVerilog Assertions (SVA) are used

to do some instant checks. While running simulation, SVA can

catch unexpected behaviours of the design and prevent corrupted

data going through the flow.

A sample of the failures that occurred during a suite of

simulation tests is shown. Notice that there are a set of assertions

and correctness checkers that fire. As in the FPU case, OnPoint is

run on the test to generate the suspects which are used as

signatures during the triage process.

If triage were performed based purely on the error message

the result would be six bins. In contrast there are only four errors

in this case, thus time would have been wasted analyzing

redundant failures. Furthermore, two of the errors could also have

been missed if only one failure is analyzed within each bin. In

contrast, the triage infrastructure proposed correctly generates

four bins, one for each error. The resulting triage bins and the root

cause of the failures are shown below.

Bin one groups four assertions failures based on three

different assertions thus eliminating wasted time by analyzing

each one separately. The single bug source is due to an incorrect

assignment based on the state of the vga color processor.

Bin two catches three distinct assertion failures once again.

In this case, the RTL bug is due to picking the wrong bit of a read

pointer inside a fifo.

Bin three contains both a checker and an assertion failure.

Interestingly, this bug resides in the testbench where some

stimulus signals are instantiated using the wrong models. As a

result both the checker and an assertion fail. Debug such cases

typically would involved multiple designers and verification

engineers.

Bin four contains a single checker failure. This failure is

caused by a bug in the testbench where the reference model

contains the bug.

In all these cases, we have confirmed that the bins generated

by the proposed triage approach correctly bin the failure based on

the same root cause. We confirmed the finding by verifying that

fixing the bugs remove all the failures for a given bin. It should be

noted that the proposed triage approach may not always be

correct, if distinct bugs are close in proximity they may end up in

the same bin.

5. Conclusion

In this work we presented a novel failure triage approach

that is both automated and generates better results than previous

script-based and manual techniques. The triage engine relies on

information from root cause analysis tools that provide visibility

into the propagation paths of the bug. These paths along with their

activation times provide unique insight that is used to group

similar failure together. To illustrate the effectiveness of the

approach we provide two small case studies where distinct bugs

are correctly binned separately. Further research in this area will

focus on improving the resolution and quality of the binning

algorithms and generating custom heuristics for testbench and

environment originating bugs.

6. REFERENCES

[1] H. Foster, “Assertion-based verification: Industry myths to

realities (invited tutorial),” in Computer Aided Verification,

2008, pp. 5–10.

[2] S. Huang and K. Cheng, Formal Equivalence Checking and

Design Debugging. Kluwer Academic Publisher, 1998.

[3] A. Smith, A. Veneris, M. F. Ali, and A. Viglas, “Fault

Diagnosis and Logic Debugging Using Boolean

Satisfiability,” IEEE Trans. on CAD,vol. 24, no. 10, pp.

1606–1621, 2005.

[4] Vennsa Technologies Inc.,

http://www.vennsa.com/product_simulation.html

[5] OpenCores, http://www.opencores.org

Bin 4: 1 checker

Bug Location : self_checking.v : 110 - 121

-> // Bug: incorrect "blanc_golden" generated from

erroreous reference model

-> if(^{hsync_golden, vsync_golden, csync_golden,

blanc_golden} !==1'bx)

-> if({hsync_golden, vsync_golden, csync_golden,

blanc_golden} != {hsync, vsync, csync, blanc})

-> begin

-> $display("At time %t: ERROR in sync:

golden=%h, actual=%h", $time,

-> {hsync_golden, vsync_golden,

csync_golden, blanc_golden},

-> {hsync, vsync, csync, blanc}

-> );

-> ->ERROR;

->

-> end

Bin 3: 1 checker, 1 assertion

Bug Location: test_bench_top.v : 684

-> // Bug: incorrect stimulus "wb_err_i" generated from

wb_slv model

-> wb_slv #(24) s0(.clk( clk ),

-> .rst( rst ),

-> .adr( {1'b0, wb_addr_o[30:0]} ),

-> ...

-> .err( wb_err_i ),

-> .rty( )

-> );

Bin

Bug Location : vga_fifo.v : 191

-> always @(posedge clk or negedge aclr)

-> ...

-> // Bug: missing use of function

-> else if (frreq) rp <= #1 {rp[aw-1:1], rp};

-> // Fix:

-> else if (frreq) rp <= #1 {rp[aw-1:1], lsb(rp)};


Bug Location: vga_colproc.v : 263

-> always @(c_state or vdat_buffer_empty or colcnt or

DataBuffer or rgb_fifo_full or clut_ack or clut_q or Ba or Ga

or Ra)

-> begin : output_decoder

->

-> // initial values

-> // Bug incorrect initial value

-> ivdat_buf_rreq = 1'b1;

->

-> // Fix:

-> ivdat_bug_rreq = 1'b0;

(d) Picking the wrong bit of read pointer inside fifo

Figure 3.7: Examples of injected RTL errors

also generate interesting non-trivial scenarios for triage. Figure 3.7 shows some examples of injected

RTL errors for the vga and fpu designs. Note that the error in Figure 3.7(b) resides in the testbench; we

also introduce and group errors in the verification environment and not only in the design that undergoes

debugging.

A first set of experiments is conducted to confirm the claims made based on our probabilistic analysis

in Section 3.3.1.


1 2 3 >=40

20

40

60

suffix window length (cycles)

% o

ver

tota

l num

ber

actual errorssuspects

(a) Suspect allocation based on suffix across all testcases

1 2 3 4 >=50

20

40

60

80

times covered by simulation

% o

ver

tota

l num

ber

actual errorssuspects

(b) Suspect allocation based on frequency across all test-

cases

Figure 3.8: Features of real errors and suspects

After regression simulation, 285 counter-examples are collected for all designs, and since the actual

error is known between the returned suspects, we observe their first excitation cycle along with the

frequency of the corresponding RTL component. Results are illustrated in Fig. 3.8, where we see that

both real errors and suspects generally follow our expectations that suffix window length is generally

short and that these locations are exercised only a small number of times during their prefix window.

However, actual human introduced errors tend to follow the above pattern more accurately compared to

the rest of the suspects; a feature that enables these errors to be generally high in the suspect ranking

scheme and comes in compliance with our expectations.

A second set of experimental results is depicted in Figure 3.9, where we explore the effect of the

ranking scheme on the framework’s average accuracy across all sixteen testcases. For the experiments

presented in this Section the γ offset (Eq. 3.3) is set to 0.1. Furthermore, for this set of experiments

we run the triage engine for each regression testcase once with the standard and once with the modified

Ward’s Method criterion (WM and WM-Mod, respectively). The ratio(1− misclassified Ci’s

|C |)

determines

accuracy. A counter-example belonging to the wrong group is considered misclassified. In Figure 3.9,

R = 0 denotes the absence of ranking scheme (MRi j = Mi j always). More precisely, Figure 3.9(a) demon-

strates how bigger the method’s estimate e is, when compared to the actual number of errors. Apparently,

computations devoid of any suspect ranking lead to the inclusion of 2.8 extra clusters on average. This

reflects to a 77% average accuracy for the triage engine shown in Figure 3.9(b). Remark, that selecting

only the top ranked suspect (R = 1) discards useful knowledge by excluding the rest of the suspects and

results into 3.1 extra clusters on average, decreasing overall accuracy to 74%. Also observe that, when


0 1 2 3 4 5 6 7 8 9 100

1

2

3

4

R

avg.

err

or o

n e

(a) Average error on e across all testcases

0 1 2 3 4 5 6 7 8 9 10

40

60

80

100

R

avg.

acc

urac

y (%

)

proposed triage WMproposed triage WM−Mod

script

(b) Effect of R on triage accuracy across all testcases

Figure 3.9: Effect of selecting the R highest in rank suspects

R is set too high, low rank suspects are included in the computations and introduce noise to the error

count estimation. This also incurs a decrease in overall accuracy shown in Figure 3.9(b). However, any

reasonable selection for R, between 2−10, results in better accuracy overall, with the best outcome of

94% achieved when R = 4. Note that in the latter case, e is off only by 0.7 on average. Generally, even

for extreme values of R (1 or 10) the triage engine always outperforms conventional scripting-based

triage, which achieves a 67% average accuracy shown by the red line in Fig. 3.9(b). Finally, we observe

that by modifying Ward’s Method we achieve higher accuracy overall. With the standard criterion em-

ployed the triage engine exhibits an average accuracy of 83% across all selections for the parameter R.

On the other hand, with the modified criterion the triage engine achieves an average accuracy of 86%

across all values for R.

Table 3.1 demonstrates detailed results for all sixteen testcases and four designs with R = 4 so that

the top 4 in rank suspects are selected for the error count estimation. Our selection is indicative as

we choose this value of R because it reflects a good behavior for the algorithm. Again, though, any

reasonable R ∈ [2 . . .10] generates similar results as shown in Fig. 3.9. The first, second and third

columns refer to the design name, the testcase number and the size in gates of the design, respectively.

Columns 4 and 5 contain the actual number of errors that are injected into the design and the number of

test vectors that are used for each of the sixteen regression runs. Column 6 indicates the total number of

counter-examples that are generated by each regression run. Column 7 shows the error count estimation

(e) that the proposed method generates for each testcase. Columns 8 to 10 include a comparison in

accuracy between the proposed triage flow and a typical binning strategy based on a script that exploits


Table 3.1: Proposed Triage Engine Performance (R=4)

circuit testcase # gates # errors # test |C | e

accuracy

# suspects error rank time (sec)triage triage script

No. vectors (e) (# errors) (avg) (high - low)

1 4 7550 15 4 100% 100% 80% 12.5 1-6 12.8

fpu 2 83303 5 7550 20 6 83% 100% 65% 14.1 2-5 13.8

3 6 9802 24 6 100% 100% 68% 12.7 1-4 16.7

4 7 9802 31 7 95% 95% 65% 14.3 1-7 19.8

5 3 21203 14 4 88% 100% 71% 14.0 2-6 14.3

vga 6 72292 4 21203 15 6 78% 89% 67% 13.9 2-9 15.5

7 5 21203 22 5 100% 100% 55% 15.1 1-4 17.9

8 6 21203 29 7 89% 95% 69% 12.9 1-5 19.3

9 2 3370 8 2 100% 100% 75% 10.8 1-3 12.4

spi 10 1724 3 3370 13 3 100% 100% 62% 11.4 1-4 12.7

11 4 5019 16 5 90% 100% 63% 12.2 1-5 13.4

12 5 5019 16 6 94% 100% 69% 10.3 1-5 15.7

13 4 10834 9 4 100% 100% 56% 15.8 1-6 12.2

mem ctrl 14 46767 5 10834 15 5 100% 100% 67% 14.6 1-4 12.7

15 6 19507 18 8 89% 94% 72% 15.0 2-4 13.8

16 7 17006 20 9 90% 95% 70% 15.2 1-5 15.7

AVG: 94% 98% 67% 14.9

error message information. Specifically, columns 8 and 9 refer to the accuracy of the triage flow when

performed with our error count estimation (e) or with the actual number of errors (# errors) respectively,

assuming prior knowledge of that number. The eleventh column presents the average number of suspects

across all counter-examples in each regression session. Column 12 presents the lowest and highest rank

assigned to the actual error in the ranking list. Finally, the last column indicates the total time consumed

by the calculation of the two metrics and the clustering process.

The engine’s average accuracy reaches 94% when the algorithm is executed with our initial guess

(column 6) and reaches 98% for those groupings where the number of clusters equals the number of

design errors. Generally, a perfect initial guess that reflects to the actual number of errors is observed

in seven out of sixteen testcases, achieving a 99% accuracy on average. On the other hand, in cases

where the error count estimation is off by one or two clusters, accuracy drops to 88%. A conventional

approach similar to the one in Section 3.2 consisting of scripts to perform the grouping achieves an

overall accuracy of 67%. As such, the proposed method improves accuracy up to 40% when the initial


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 1670

80

90

100

testcase No.Cla

ssifi

catio

n ac

cura

cy (

%)

R = 4

WM−Mod

WM

Figure 3.10: Effect of modification on Ward’s Method

guess is utilized; a solid improvement that indicates the potential of the proposed framework. Moreover,

the actual error is assigned a high rank in the suspect ranking list, as shown in column 10. Finally,

computation of the two metrics and clustering consume an average of 14.9 seconds in total, which is

acceptable for the purposes of triage.

Detailed results demonstrating the benefit of constraining Ward’s Method to prohibit the merging of

clusters that contain definitely unrelated counter-examples are shown in Fig. 3.10. Results are generated

per testcase, using the same configuration (R = 4, γ = 0.1) as in Table 3.1. Fig. 3.10 illustrates accuracy

results for the triage engine when standard Ward’s Method (WM) is used as a linkage criterion and when

the modified version (WM-Mod) is applied. We observe that the proposed modified criterion achieves

higher accuracy in 9 out of 16 testcases overall improving the average accuracy by 4.5%. Recall that the

detailed results in Table 3.1 are generated when the modified version of Ward’s Method is applied and

thus agree with the results presented in Fig. 3.10.

3.5 Summary

In this Chapter, a novel automated debugging triage framework is proposed. The algorithm extracts

information from simulation and debugging results to define relationship between various counter-

examples. Strongly related counter-examples are then grouped together to guide detailed debugging.


In order to quantify counter-example relation we introduce the concept of counter-example proximity

and propose a suspect ranking scheme for its computation. Furthermore, we devise a speculative met-

ric to estimate the number of co-existing errors. The applicability and efficacy of the triage engine is

demonstrated by experimental results within typical regression verification flows, indicating a signifi-

cant increase in grouping accuracy compared to traditional triage techniques.

Chapter 4

Leveraging Re-configurability To Raise

Productivity In FPGA Functional Debug

4.1 Introduction

Compared to high cost, state-of-the-art ASIC design, field-programmable gate arrays (FPGAs) offer a

wide gamut of benefits when employed as platforms for digital circuit implementation. What mainly

distinguishes FPGAs is that they carry the advantage of re-configurability and relatively low NRE costs

for mid-to-high volume applications.

The reconfigurability property is one of the outstanding assets of FPGAs when it comes to functional

verification. With ASICs, designers spend considerable time in simulation/verification before tape out,

including, for example, simulation with post-layout extracted capacitances and cross-talk noise analysis.

Conversely with FPGAs, designers rarely do post-routing full delay simulations. Instead, reconfigurabil-

ity allows design iterations to include actual silicon execution. Designers verify their design in hardware

using the same (or a similar) FPGA they intend to deploy in the field. When design errors are discovered,

the design’s RTL is altered, re-synthesized and executed in hardware.

The time needed for design cycles in FPGAs is dominated by re-synthesis (logic synthesis, technol-

ogy mapping, placement and routing) tool run-times. FPGA placement and routing can take hours or

days for the largest designs [15], and such run-times are an impediment to designer productivity. With

this observation in mind, in this dissertation, we present new techniques for FPGA functional debug that

58

CHAPTER 4. RE-CONFIGURABILITY IN FPGA FUNCTIONAL DEBUG 59

exploit the reconfigurability concept to raise productivity by reducing the number of compute-intensive

design re-synthesis runs that are needed.

At a high-level, our approaches work as follows: Say, for example, an engineer wishes to trace

a large number, N, of a design’s internal signals during functional debug, using a small number of

available external pins, m (N >> m). We augment the design with additional circuitry that allow the

N signals to be traced with dN/me FPGA device re-configurations and hardware executions. The key

value of our approach is that the design is only synthesized, placed and routed once, rather than dN/me

times. This is achieved by selecting the different sets of m trace signals through modifications to the

FPGA’s configuration bitstream (i.e. the post-routed design).

While the proposed approach leverages reconfigurability to reduce loops through the design process,

a further contribution of this work is a new multiplexer (MUX) design scheme for FPGAs that uses

significantly less area than a traditional MUX design. The new MUX is suitable for use in cases wherein

the MUX select inputs are changed using the FPGA bitstream, instead of using normally routed logic

signals. We also present a design variant to handle the scenario where limited external pins are available

for debugging.

As compared with design re-synthesis for each group of m signals, experimental results demonstrate

that our approach improves run-time by up to 30×. Our approach also offers stability in the timing

characteristics of the circuit being debugged.

This Chapter is organized as follows. Section 4.2 discusses the role of FPGAs in functional de-

bug and outlines previous work on the field. Section 4.3 introduces the proposed approach to FPGA

functional debug. Finally, Section 4.4 provides experimental results and Section 4.5 summarizes the

Chapter.

4.2 FPGA Functional Debug

There are two major approaches to perform functional debug with an FPGA. The first approach is to

implement the complete design in an FPGA device. This is suitable for small designs that do not need

to be executed at a high frequency. Because of the reconfigurability, debugging modules can be easily

added or modified with no cost. A set of circuit modifications that enhance debug capability is presented


in [25]. It provides software-like debug features, such as watchpoints and breakpoints. However, any

modification to watchpoints or breakpoints requires recompilation of designs – a run-time intensive

task. In a somewhat similar manner to what is proposed in this work, Graham et al. improve debugging

productivity by instrumenting FPGA bitstreams [16]. An embedded logic analyzer is inserted into

the design without connecting to any signals. After place-and-route, the signals targeted for tracing

are routed to the logic analyzer by modifying bitstreams using vendor tools. Although the approach

provides flexibility in choosing the desired internal signals for tracing, it remains a very complicated

procedure. Furthermore, when different sets of signals are selected for tracing, re-routing needs to be

performed, which can significantly affect the timing closure of the design.

Xilinx’s ChipScope tool provides features to trace different signals without re-executing place and

route [42]. Special logic analyzer hardware is inserted into a design to trace internal signals during

design execution. The captured signals can then be displayed using Xilinx’s ChipScore Pro Analyzer

Tool, running on a connected host computer. While the approach bears similarity to our own, the

techniques and tools associated with ChipScope are proprietary and not disclosed publicly.

The second approach to using FPGAs for functional debug is that of embedding reconfigurable logic

into SoCs to enhance debug capability [2, 30]. The programmability of reconfigurable logic can be

applied to implement various debug paradigms, such as assertions, signal capture and what-if analysis.

Those paradigms help engineers to understand the internal behavior of the chip and provide at-speed

in-system debug. Engineers can instrument the reconfigurable logic on-the-fly, as needed. However,

each change to the debug circuitry incurs significant cost and overhead.

A recent work presented in [21] proposes a methodology for post-mapping incremental trace-

insertion that only utilizes unoccupied logic in order to preserve the original FPGA mapping. The

method allows for small incremental re-compilations for signal tracing, while also having a negligible

effect to critical path delay. The approach presented in our work can potentially be integrated in such

robust methodologies to futher reduce the time between debugging sessions and preserve timing closure

in a sign-off circuit.

Finally, several works on selecting the signals that one may wish to trace for debugging have been

proposed [20, 23, 43]. While most works target ASIC designs, the work in [20] is designed specifically

for FPGAs. It predicts which signals may be useful for debugging and automatically instruments the


16 32 64 128 2560

1000

2000

3000

# Traced Nodes

# A

LM

s +

# R

egis

ters

Figure 4.1: Area overhead of SignalTap.

www

i0i1i2

outw...

wim

s0s1 sn. . .

Figure 4.2: Multiplexer for signal selection.

design. Any prior work on signal selection could be used in conjunction with our approach.

4.3 A Reconfigurability-Driven Approach to FPGA Functional Debug

This Section presents a new approach to enhance the observability of FPGA designs for functional

debug. To debug functional errors in an FPGA design, the design is first synthesized, placed and routed

on the target FPGA device. The programming bitstream is generated, programmed into the FPGA, and

execution commences. If unexpected behavior is observed, a set of internal signals is selected to be

traced by a logic analyzer to provide more information. In the conventional debug process, the design

needs to be recompiled and the FPGA needs to be reprogrammed. Fig. 4.1 shows the area overhead of

Altera’s SignalTap II [4] logic analyzer vs. the number of signals being tapped. One can see that the

overhead grows significantly as the number of monitored signals increases. Due to the area overhead of

the logic analyzer, usually only a small set of signals are traced at any one time. The process is repeated

until the values for all signals of interest are acquired. The main issue with this process is that it can

take hours to compile large designs [3]. As such, repeated compilation can introduce significant time

overhead and prolong the overall debug process.

To alleviate the issue, a new design process that avoids recompilation is presented in this work. The

idea is to modify the bitstream directly when different signals need to be traced. This is achieved by

inserting a multiplexer into the design implemented on the FPGA, with the MUX inputs being all sig-

nals that one potentially wants to trace. Fig. 4.2 depicts a multiplexer that can select one of m groups of

w signals. The select signals of the multiplexer are preset to logic-0 or to logic-1. Then, one can trace


different signals by manipulating the bitstream to set the select signals to different constants. Since there

is no re-routing required, the bitstream modifications can be done easily. As a result, the time overhead

of this process is reduced to a bitstream modification followed by a bitstream downloading. Bitstream

downloading normally requires only seconds – significantly less overhead than the re-compilation ap-

proach.

Another advantage of the proposed process is its negligible effect on the stability of the design. In

the conventional debug process, the design is re-routed each time when different signals are selected,

As a result, designers often need to readjust the design to meet the various timing constraints. Even

though recent FPGA tools provide incremental compilation to preserve the engineering efforts from

previous place/route steps, experiments show that the speed performance of designs after incremental

compilation can vary. In the proposed process, because all signals one potentially wants to trace are

connected to the selection module at the beginning, only the original one compilation is necessary. As

a result, selecting different signals through bitstream modifications minimizes the overall impact on the

performance of the design. Note that although our study targets Altera FPGAs, the proposed debugging

flow is not limited to Altera, and applies equally to FPGAs from other vendors.

4.3.1 An Area-Optimized Multiplexer Implementation

It is well-known that FPGAs are inefficient at implementing multiplexers. Therefore, in this Section,

a novel multiplexer implementation, optimized in the number of LUTs, is presented. The proposed

construction also takes advantage of the bitstream changes (described above).

Fig. 4.3(a) shows a traditional 16-to-1 MUX implementation in a Stratix III FPGA (the image is a

screen capture from Altera’s technology map viewer tool). Observe that five 6-input LUTs are required.

In a traditional MUX, the values of signals on the MUX select inputs can change at any time while

the circuit operates. However, in the proposed design process, the selected trace signals do not need

to change as the circuit operates. Rather, the set of selected signals is determined by the FPGA bit-

stream, and as such, may only change between device configurations. This makes an alternative MUX

implementation possible – one that consumes only three 6-LUTs in the 16-to-1 case.

The new MUX design is based on recognizing that a LUT’s internal hardware contains a MUX,

coupled with SRAM configuration cells. In our design, the LUT’s internal MUX forms a portion of


LUTs

Data in

puts

Select in

puts

(a) Traditional

i0 i1

i2

i3

i4

i5

i6 i7

i8

i9

i10

i11

f1 f2

i12

i13

i14

i15

f1 = i5

f2 = X

f3 = f1 OUT

(b) Proposed

Figure 4.3: 16-to-1 MUX implementation in 6-input LUTs.

the MUX we wish to implement (made possible owing to the MUX select lines being held constant

during device operation). Fig. 4.3(b) shows the proposed 16-to-1 MUX, where the 16 inputs are labeled

(i0-i15). In this case, the LUT configuration SRAM cells (i.e., the truth table) determine which MUX

input signal is passed to the output. For the purposes of illustration, in Fig. 4.3(b), each LUT is labeled

with the logic function needed to select the 6th MUX input (i5) to the output. Only three LUTs are

required: The LUT labeled f 1 passes input i5 to its output. LUT f 2 can implement any logic function

since its output is not observable (however, to save power f 2 should be programmed to constant logic-0

or logic-1). LUT f 3 is programmed to pass f 1 to its output. The proposed design offers significant area

savings relative to the traditional design, and allows signal selection via bitstream changes.

4.3.2 Debugging with Limited Output Pins

The debugging architecture described above requires multiple output pins if a group of signals is traced

in one silicon execution. This approach may not be feasible in cases where the output pins are limited.

Therefore, an alternative architecture that utilizes a parallel-in serial-out shift register is presented in

Fig. 4.4.

In Fig. 4.4, only one output pin is used. Values of the target group are loaded into the shift register in

parallel in each clock cycle. Then, the system clock is stopped, and a second debugging clock, is used to


4444

4

clkdebug

ABCD

s1 s2

Figure 4.4: Multiplexer with a 4-bit Shift Register.

shift out the stored value. There is a trade-off between the number of output pins and the test execution

time. If more output pins are available, the data can be distributed into multiple shift registers which

feed different output pins. This results in fewer clock cycles for retrieving data from the shift registers.

This architecture can be improved to obtain all values stored in the shift registers within one system

clock cycle (without stopping the system clock). Instead of shifting the data with a debug clock supplied

from off-chip, one can use the on-FPGA PLL to synthesize the debug clock from the system clock, with

the debug clock being n times faster than the system clock, where n is the width of the shift registers.

The advantage of this implementation is that the design does not need to be halted after each cycle in

order to empty the shift registers. However, this approach is only feasible if the system can be operated

at a low frequency.

4.4 Experimental Study

This Section presents the area overhead and timing impact of the proposed structures. The structures

are integrated into benchmarks selected from the OpenCores and CHStone benchmark suites [17]. The

CHStone benchmarks are synthesized from the C language to Verilog RTL using a high-level synthesis

tool [11]. All RTL benchmarks are then compiled using Altera’s Quartus II 11.0, targeting the 65 nm

Stratix III FPGA (EP3SL70), with a maximum frequency constraint of 1 GHz. Table 4.1 summarizes

the ALM and register utilization of each original benchmark (i.e., without any debugging structures

integrated). The table also shows the post-routing maximum frequency (Fmax) of the benchmarks.


Table 4.1: Benchmarks.

Ckt.# # Fmax

Ckt.# # Fmax

ALM Reg (MHz) ALM Reg (MHz)

ethernet 1323 1256 321.85 main 24483 20046 37.47

mem ctrl 1024 1051 266.95 dfsin 13946 16367 118.29

tmu 2336 3425 168.63 aes 8224 9090 129.10

rsdecoder 658 539 730.46 adpcm 11330 9852 101.58

In our experiment setup, registers in each module of each benchmark are randomly selected as

tracing candidates. Benchmarks are modified such that traced signals are wired to the top-level of the

benchmark and connected to the proposed structures. Altera’s synthesis attributes, keep and noprune,

are used to ensure that all signals exist after optimization. In the following discussion, the notation,

m-w, represents the tracing setting where m signals are candidates for tracing and w signals are traced

concurrently in one silicon execution.

Experimental results of the structures described in Section 4.3 are presented in the next subsection,

followed by an analysis of the productivity and the stability of the proposed design process.

4.4.1 Area Usage and Timing Analysis

The area overhead and Fmax of the proposed architectures with various sizes are depicted in Figure 4.5.

Four implementations are investigated: a traditional MUX implementation, a 6-LUT-based MUX im-

plementation (as proposed in Section 4.3.1), a 4-LUT-based MUX implementation (same as proposed

in Section 4.3.1) except using 4-LUTs instead of 6-LUTs), and a shift-register-based implementation

(as proposed in Section 4.3.2). As shown in Fig. 4.5(a), the 6-input LUT implementation uses, on av-

erage, 35% fewer ALMs than the traditional MUX implementation. The 4-input LUT implementation

can further reduce the usage of ALMs. This is because each ALM in a Stratix III device can contain

two 4-input LUTs, and Quartus II may merge two 4-input LUTs into one ALM. However, there is no

user control to force such an optimization to happen, and therefore, in the remaining experiments, all

multiplexers in the proposed structures are implemented with the 6-input LUT approach. In this exper-

iment, the shift-register implementation only uses one output pin and is driven with an external debug

clock. Due to the shift register, the area cost is slightly greater than the cost with the full multiplexer

implementation.


128−2 128−4 128−8 256−2 256−4 256−80

25

50

75

100

Configurations

# A

LM

s

Traditional Mux

6LUT Mux

4LUT Mux

Shift Register

(a) Area

128−2 128−4 128−8 256−2 256−4 256−80

200

400

600

800

Configurations

Fm

ax (

MH

z)

Traditional Mux

6LUT Mux

4LUT Mux

Shift Register

(b) Fmax

Figure 4.5: Area and Fmax of multiplexers.

Fig. 4.5(b) shows the Fmax of each MUX implemented in isolation. Since the area-optimized im-

plementation requires fewer ALMs to construct a multiplexer, less parasitic capacitance is introduced

on the critical path. Consequently, multiplexers with the 4-input LUT implementation have the highest

frequency in most cases.

Table 4.2(a) reports the percentage increase in ALMs and registers of benchmarks when the area-

optimized multiplexer is integrated. Two groups of tracing settings are considered. The worst-case in

each group is shown in bold. Results show that in most cases the area overhead is less than 10%. The

area overhead is contributed not only by the additional structure, but also because we wire signals from

sub-modules up to the top-level module. The maximum frequency Fmax of the benchmarks with the


Table 4.2: Effects of area-optimized multiplexer.(a) Area Increase Percentage (ALMs + registers) (%)

Ckt. 128-2 128-4 128-8 256-2 256-4 256-8

ethernet 6.91 7.10 7.34 10.87 11.26 11.53

mem ctrl 6.12 6.48 6.90 10.57 10.66 11.69

tmu 5.95 6.02 6.11 10.95 10.99 11.12

rsdecoder 11.36 10.52 9.86 13.03 19.05 17.79

main 0.27 0.29 0.65 0.77 0.75 1.15

dfsin 0.46 0.52 0.39 1.08 1.06 1.05

aes 1.14 1.31 1.67 2.54 2.48 2.94

adpcm 1.61 1.52 1.66 1.76 1.59 1.64

(b) Fmax Change Percentage (%)

Ckt. 128-2 128-4 128-8 256-2 256-4 256-8

ethernet -0.28 -0.02 -0.07 -0.31 -0.06 -0.11

mem ctrl -3.2 -10.1 -5.2 -8.19 -12.15 -7.23

tmu 1.99 2.12 2.2 1.06 0.98 0.92

rsdecoder -35.06 -32 -17.51 -33.99 -29.87 -28.51

main -1.81 -1.36 -4.06 -0.43 2.86 2.16

dfsin 3.53 -1.5 3.29 1.57 -0.06 -3.51

aes -0.74 -0.33 0.77 0.17 -0.6 -1.14

adpcm 3.62 2.07 -0.27 1.79 -0.5 -0.36

same tracing settings is reported in Table 4.2(b). Overall, Fmax is not affected greatly – changes are

mainly due to algorithmic noise. The only exception is with rsdecoder, with reason being that the

critical path for this benchmark is altered to pass through the multiplexer.

Similar to Table 4.2(a) and Table 4.2(b), the effect of the the shift register-based structure on the area

and Fmax of benchmarks is summarized in Table 4.3(a) and Table 4.3(b), respectively. Here, instead

of using an external debug clock, a faster debug clock is generated from the system clock using the

Stratix III PLL. The faster clock allows us to shift out the content of the shift-register within one system

clock cycle. As expected, because of the additional shift registers, the overall area overhead can be a

bit higher than the area overhead of the full multiplexer discussed previously. Furthermore, Fmax drops

significantly in all cases – the system clock speed is limited by the debug clock speed. For three of the

eight benchmarks, Fmax drops more than 50%.


Table 4.3: Effects of area-optimized multiplexers with shift registers.(a) Area Increase Percentage (ALMs + registers) (%)

Ckt. 128-2 128-4 128-8 256-2 256-4 256-8

ethernet 6.81 7.43 7.16 9.71 10.53 11.42

mem ctrl 6.72 6.57 6.92 10.56 11.11 11.93

tmu 6.56 6.20 6.70 9.81 10.29 10.76

rsdecoder 12.36 13.37 12.53 19.21 20.21 19.80

main 0.21 0.25 0.25 0.60 0.64 0.63

dfsin 0.29 0.27 0.39 0.73 0.76 0.85

aes 1.09 1.02 1.13 2.14 2.24 2.17

adpcm 1.22 1.20 1.49 1.70 1.77 1.88

(b) Fmax Change Percentage (%)

Ckt. 128-2 128-4 128-8 256-2 256-4 256-8

ethernet -42.76 -44.2 -44.89 -51.1 -54.26 -53.32

mem ctrl -29.25 -28.05 -28.67 -39.87 -35.37 -31.33

tmu -5.24 -6.08 -6.03 -0.43 -9.68 -7.14

rsdecoder -75.55 -71.96 -69.02 -75 -76.25 -76.41

main 0.77 0.61 -0.72 -2.86 -1.49 0.43

dfsin -10.74 -9.38 -8.57 -10.24 -7.75 -6.02

aes -4.93 -2.98 -1.9 -11.11 -10.22 -11.96

adpcm -3.63 1.1 1.55 4.1 3.36 2.63

4.4.2 Productivity and Stability

In the last set of the experiments, we evaluate the productivity and stability of the conventional design

process. Altera’s SignalTap II is used as the embedded logic analyzer. As mentioned in Section 4.3, due

Table 4.4: Compilation time of SignalTap.

Ckt.

128-8 256-8

Prop. SignalTap (sec) Prop. SignalTap (sec)

(sec) First Incr. Total (sec) First Incr. Total

ethernet 139 134 117 2006 141 134 119 3946

mem ctrl 150 143 124 2129 156 143 123 4073

tmu 169 161 137 2354 179 161 140 4639

rsdecoder 106 103 99 1685 109 103 98 3233

main 1449 1448 293 6141 1453 1448 290 10737

dfsin 706 696 216 4150 711 696 217 7648

aes 465 453 186 3428 466 453 184 5901

adpcm 634 615 226 4234 639 615 225 7815


5 10 15 20 25 30

0.9

0.92

0.94

0.96

0.98

1

# Critical Path Nodes

No

rmaliz

ed

Fm

ax

ethernetmem ctrltmumaindfsinaesadpcm

(a) Tracing nodes on the critical path

5 10 15 20 25 300.2

0.4

0.6

0.8

1

Debug Session

No

rma

lize

d F

max

rsdecoder

(b) Tracing random nodes

Figure 4.6: Stability of SignalTap.

to the size of SignalTap, acquiring trace data for a large number of signals is often achieved by succes-

sively tracing multiple smaller groups. Recompilation is required when a different group of signals is

selected.

The experiment is carried out as follows. Two tracing settings are studied: 128-8 and 256-8. In

order to use the incremental compilation feature in Quartus II, only post-fitting signals are considered.

First, the design is compiled without the SignalTap module. 128(256) post-fitting nodes are randomly

selected after the first compilation. Next, eight signals from the set are monitored. The procedure is

repeated until all 128(256) signals are traced.

The compilation time results are summarized in Table 4.4. The first column lists the benchmarks.

The next four columns report the results for the first tracing setting: the compilation time of the proposed

process, the first compilation of the SignalTap process, the average compilation time of each data ac-


quisition session and the total cumulative compilation time of the SignalTap-based debugging process.

The result of the second tracing setting is reported in the final four columns. As shown in the table,

since the proposed bitstream-modifications-only process only requires one compilation, the compilation

time roughly equals to the first compilation of the SignalTap process. Although incremental compilation

reduces the compilation time by 4%-80%, each additional compilation adds time overhead. Overall, the

proposed process can save up to 93% (i.e., 139/2006 for ethernet) in the case of the 128-8 scenario,

and 97% (i.e., 103/3233 for rsdecoder) in the case of 256-8.

Incremental compilation tries to preserve the engineering effort from a previous compilation to

minimize the impact to design performance. While it does well in many cases, experiments show that

Fmax can still vary when the monitored signals are on the critical path. The result is plotted in Fig. 4.6(a).

In each case, a total of 32 signals are traced. The x-axis of the plot is the number of traced signals that

are on the critical path. The y-axis is the normalized Fmax, where the base is the Fmax of the original

benchmark. One can see that Fmax drops in various degrees, as much as 10%. It all depends on what

signals are monitored. However, the reader should note that cases where the majority of the traced

signals reside on the critical path correspond to worst case scenarios. Still, even for those cases where

a small number of traced signals are on the critical path one can observe an unpredictable bahavior in

maximum frequency.

For designs that can be operated at a very high frequency, the SignalTap module can in fact be

where the critical path resides. In this case, monitoring any set of signals can change Fmax, as shown

in Fig. 4.6(b). The x-axis of the plot is the data acquisition session, where 8 signals are traced in each

session with 32 sessions in total. The plot shows that Fmax is unstable from one session to another.

4.5 Summary

Functional debugging using FPGA devices provides several advantages over the traditional software

simulation approach. This chapter presents a set of hardware structures to take the advantage of the

FPGA reconfigurability feature to enhance the observability for debugging. Furthermore, experimental

results demonstrate that the new techniques can improve the productivity of the debugging process up

to 30×.

Chapter 5

Conclusions and Future Work

5.1 Summary of Contributions

Debugging today remains a complicated and resource-intensive process, with its multiple aspects, re-

quirements and limitations spanning various hardware design practices and design flows. New prob-

lems related to debugging are constantly introduced in modern design flows, requiring the development

of novel debugging methodologies to keep up with this growing complexity and heterogeneity of the

debugging task. Two of those challenging debugging problems that constantly grow in importance and

difficulty are those of triage in RTL design debug and FPGA functional debug.

The purpose of this thesis is to present practical techniques to address these problems. The first

contribution is a novel automated triage framework for RTL design debug, which is developed in an

effort to offer a viable alternative to traditional script-based or manual triage. The second contribution

introduces new hardware and software techniques that leverage the re-configurability property of FPGAs

in order to increase productivity for FPGA functional debug.

• In Chapter 3, a novel automated triage framework is presented, which groups together related

counter-examples that are generated by regression verification flows. The framework is based on

newly introduced metrics that define relations between counter-example and make predictions on

the number of co-existing RTL errors in the failing design. The proposed framework formulates

triage as a clustering problem and generates partitions of the counter-example set by employing

hierarchical clustering techniques.

71

CHAPTER 5. CONCLUSIONS AND FUTURE WORK 72

• In Chapter 4, novel hardware and software techniques are introduced that accelerate FPGA func-

tional debug by allowing the tracing of internal design signals during silicon execution without

the need for time-intensive re-synthesis iterations. The proposed method requires a sole execution

of the synthesis flow to trace a large number of signals for an arbitrary number of cycles using a

limited number of output pins.

5.2 Future Work

The following provides a summary of extensions and future directions relating to the contributions of

Chapters 3 and 4.

• There are several areas for extensions and future directions with respect to the contributions of

Chapter 3. To the best of our knowledge, the proposed framework is the first published work

on automated counter-example triage for RTL design debug. As a result, there is a lot of fertile

ground for extensions and improvements for this approach to eventually reach maturity. One of

the limitations of the proposed work in Chapter 3 is that it only performs well when the human

introduced errors responsible for the counter-example set are generally profane and affect the

circuit more than gate-level errors or stuck-at faults. This implies that in the case where the actual

error is, for example, a wrong bit inversion or a wrong gate then it will most likely not be identified

as an important suspect. In the same context, errors introduced by CAD tools are not modeled

and their typical behavior is not investigated. Apart from the above, the way counter-example

proximity is defined only allows hierarchical clustering to be applied. One idea is to formulate

triage using more flexible models that will allow the application of various clustering methods,

such as K-means, K-medoids, and Gaussian Mixture Models [10].

• One of the extensions to the work presented in Chapter 4 can be the integration of debug features,

such as trigger events, to the proposed structures to enhance the debugging ability. Another

interesting extension is developing a debugging algorithm that utilizes the proposed structures

and provides an efficient and effective FPGA debugging environment.

Bibliography

[1] M. Abramovici, M. Breuer, and A. Friedman, Digital Systems Testing and Testable Design. Com-

puter Science Press, 1990.

[2] M. Abramovici, P. Bradley, K. Dwarakanath, P. Levin, G. Memmi, and D. Miller, “A reconfig-

urable design-for-debug infrastructure for SoCs,” in Design Automation Conference, 2006, pp.

7–12.

[3] Increasing Productivity With Quartus II Incremental Compilation, Altera Corp., San Jose, CA,

2008.

[4] Design Debugging Using the SignapTap II Logic Analyzer, Altera, Corp., San Jose, CA, 2011.

[5] J. Baumgartner, H. Mony, V. Paruthi, R. Kanzelman, and G. Janssen, “Scalable sequential equiva-

lence checking across arbitrary design transformations,” in International Conference on Computer

Design, 2006.

[6] J. Bergeron, Writing Testbenches: Functional Verification of HDL Models, Second Edition.

Kluwer Academic Publishers, 2003.

[7] V. Betz and J. Rose, “Cluster-based logic blocks for FPGAs: Area-efficiency vs. input sharing and

size,” in IEEE Custom Integrated Circuits Conf., Santa Clara, CA, 1997, pp. 551–554.

[8] ——, “FPGA routing architecture: Segmentation and buffering to optimize speed and density,” in

ACM/SIGDA Int’l Symposium on FPGAs, Monterey, CA, 1999, pp. 140–149.

[9] A. Biere, A. Cimatti, E. M. Clarke, O. Strichman, and Y. Zhu, “Bounded model checking,” Ad-

vances in Computers, vol. 58, pp. 118–149, 2003.

73

BIBLIOGRAPHY 74

[10] C. M. Bishop, Pattern Recognition and Machine Learning (Information Science and Statistics).

Springer, 2007.

[11] A. Canis, J. Choi, M. Aldham, V. Zhang, A. Kammoona, J. Anderson, S. Brown, and T. Cza-

jkowski, “LegUp: high-level synthesis for FPGA-based processor/accelerator systems,” in Inter-

national Symposium on Field-Programmable Gate Arrays, 2011, pp. 33–36.

[12] F. M. De Paula, M. Gort, A. J. Hu, S. Wilton, and J. Yang, “Backspace: Formal analysis for

post-silicon debug,” in International Conference on Formal Methods in CAD, 2008, pp. 1–10.

[13] H. Foster, A. Krolnik, and D. Lacey, Assertion-Based Design. Kluwer Academic Publishers,

2003.

[14] E. Goldberg, M. Prasad, and R. Brayton, “Using SAT for combinational equivalence checking,” in

Design, Automation and Test in Europe, 2001, pp. 114–121.

[15] M. Gort and J. Anderson, “Deterministic multi-core parallel routing for FPGAs,” in International

Conference on Field Programmable Logic and Applications, 2010, pp. 78 –86.

[16] P. Graham, B. Nelson, and B. Hutchings, “Instrumenting bitstreams for debugging FPGA circuits,”

in International Symposium on Field-Programmable Custom Computing Machines, 2001, pp. 41–

50.

[17] Y. Hara, H. Tomiyama, S. Honda, and H. Takada, “Proposal and quantitative analysis of the CH-

Stone benchmark program suite for practical C-based high-level synthesis,” Journal of Information

Processing, vol. 17, pp. 242–254, 2009.

[18] H.Foster, “From volume to velocity: The transforming landscape in function verification.” in De-

sign Verification Conference, 2011.

[19] S. Huang and K. Cheng, Formal Equivalence Checking and Design Debugging. Kluwer Aca-

demic Publisher, 1998.

[20] E. Hung and S. Wilton, “Speculative debug insertion for FPGAs,” in International Conference on

Field Programmable Logic and Applications, 2011, pp. 524–531.

BIBLIOGRAPHY 75

[21] E. Hung and S. J. Wilton, “Incremental trace-buffer insertion for fpga debug,” in IEEE Transac-

tions on Very Large Scale Integration (VLSI) Systems, vol. PP, no. 99, pp. 1–1, 2013.

[22] B. Keng and A. Veneris, “Path directed abstraction and refinement in sat-based design debugging,”

in Design Automation Conference, 2012.

[23] H. F. Ko and N. Nicolici, “Algorithms for state restoration and trace-signal selection for data

acquisition in silicon debug,” IEEE Transactions on CAD, vol. 28, no. 2, pp. 285 – 297, 2009.

[24] I. Kuon and J. Rose, “Measuring the gap between FPGAs and ASICs,” IEEE Transactions on

CAD, vol. 26, no. 2, pp. 203–215, 2007.

[25] L. Lagadec and D. Picard, “Software-like debugging methodology for reconfigurable platforms,”

in IEEE International Symposium on Parallel and Distributed Processing, 2009, pp. 1–4.

[26] H. Mangassarian, L. Bao, A. Goultiaeva, A. Veneris, and F. Bacchus, “Leveraging dominators

for preprocessing qbf,” in Design, Automation Test in Europe Conference Exhibition, 2010, pp.

1695–1700.

[27] H. Mangassarian, A.Veneris, S.Safarpour, M.Benedetti, and D.Smith, “A performance-driven

QBF-based on iterative logic array representation with applications to verification, debug and test,”

in International Conference on Computer Aided Design, 2007.

[28] K. McMillan, “Interpolation and SAT-based model checking,” in Computer Aided Verification,

2003.

[29] OpenCores.org, “http://www.opencores.org,” 2007.

[30] B. Quinton and S. Wilton, “Programmable logic core based post-silicon debug for SoCs,” in IEEE

International Silicon Debug and Diagnosis Workshop, 2007.

[31] Ranjan, R.K., C. C., and S. S., “Beyond verification: Leveraging formal for debugging,” in Design

Automation Conference, 2009, pp. 648–651.

[32] P. Rashinkar, P. Paterson, and L. Singh, System-on-a-chip Verification: Methodology and Tech-

niques. Kluwer Academic Publisher, 2000.

BIBLIOGRAPHY 76

[33] S. Safarpour, A. Veneris, and F. Najm, “Managing verification error traces with bounded model

debugging,” in ASP Design Automation Conference, 2010.

[34] O. Sarbishei, M. Tabandeh, B. Alizadeh, and M. Fujita, “A formal approach for debugging arith-

metic circuits,” in IEEE Transactions on CAD, vol. 28, no. 5, May 2009, pp. 742–754.

[35] A. Smith, A. Veneris, M. F. Ali, and A. Viglas, “Fault diagnosis and logic debugging using Boolean

satisfiability,” IEEE Transactions on CAD, vol. 24, no. 10, pp. 1606–1621, 2005.

[36] S.Safarpour, B.Keng, Y.S.Yang, and E.Qin, “Failure triage: The neglected debugging problem,” in

Design and Verification Conference, 2012.

[37] S.Safarpour, M.Liffton, H.Mangassarian, A.Veneris, and K.A.Sakallah, “Improved design debug-

ging using maximum satisfiability,” in International Conference on Formal Methods in CAD, 2007.

[38] A. Suelflow, G. Fey, R. Bloem, and R. Drechsler, “Using unsatisfiable cores to debug multiple

design errors,” in Great Lakes Symposium on VLSI, 2008.

[39] J. Swartz, V. Betz, and J. Rose, “A fast routability-driven router for FPGAs,” in ACM/SIGDA Int’l

Symposium on FPGAs, Monterey, CA, 1998, pp. 140–149.

[40] G. J. Szekely and M. L. Rizzo, “Hierarchical clustering via joint between-within distances: Ex-

tending ward’s minimum variance method,” Journal of Classification, vol. 22, no. 2, pp. 151–183,

2005.

[41] A. Veneris and I. N. Hajj, “Design error diagnosis and correction via test vector simulation,” IEEE

Transactions on CAD, vol. 18, no. 12, pp. 1803–1816, 1999.

[42] ChipScope ILA Tools Tutorial, Xilinx Inc., San Jose, CA, 2003.

[43] Y.-S. Yang, N. Nicolici, and A. Veneris, “Automating data analysis and acquisition setup in a

silicon debug environment,” IEEE Transactions on VLSI, 2011.

Documents

by Zissis Poulos - University of Toronto T-Space · PDF fileAbstract High Level Debugging Techniques for Modern Veriﬁcation Flows Zissis Poulos Master of Applied Science Graduate