34
Automatic Detection and Repair of Errors in Data Structures Brian Demsky Martin Rinard Laboratory for Computer Science Massachusetts Institute of Technology

Automatic Detection and Repair of Errors in Data Structures Brian Demsky Martin Rinard Laboratory for Computer Science Massachusetts Institute of Technology

  • View
    221

  • Download
    1

Embed Size (px)

Citation preview

Automatic Detection and Repair of Errors in Data Structures

Brian DemskyMartin Rinard

Laboratory for Computer ScienceMassachusetts Institute of Technology

Motivation

F = 20G = 5

F = 20G = 10

I = 5

J = 2

Broken Data Structure

Errors• Missing elements• Inappropriate

sharing• Dangling

references• Out of bounds

array indices• Inconsistent values

Goal

F = 10G = 5

F = 20G = 10

I = 3

J = 2

F = 2G = 1

F = 20G = 5

F = 20G = 10

I = 5

J = 2

Broken Data Structure Consistent Data Structure

RepairAlgorithm

Goal

F = 10G = 5

F = 20G = 10

I = 3

J = 2

F = 2G = 1

F = 20G = 5

F = 20G = 10

I = 5

J = 2

Broken Data Structure Consistent Data Structure

RepairAlgorithm

ConsistencyProperties

FromDeveloper

What Does Repair Algorithm Produce?

• Data structure that • Satisfies consistency properties, and• Heuristically close to broken data

structure• Not necessarily the same data structure

as (hypothetical) correct program would produce

• But enough to keep program going

Precursors

• Data structure repair has historically appeared in systems with extreme reliability goals• 5ESS switch – hand coded audit

routines• IBM MVS operating system – hand

coded failure recovery routines• Key component of these systems

Where Is This Likely To Be Useful?

• Not for transient errors in systems with slack – you can just reboot• Must be willing to lose volatile state• Must be willing to wait for system to

come back up• Permanent data structures

• File systems• Application files (Word, PowerPoint, …)

• Autonomous systems• Critical systems

Architecture

101110011000111101110101010111100111011010111000111101110

Broken Bits

BrokenAbstract Model

RepairedAbstract Model

101001111000111101110101101011100110101010111011001100010

Repaired Bits

Model Definition &Translation

Internal ConsistencyProperties

External ConsistencyProperties

Architecture RationaleWhy go through abstract model?

• Simple, uniform structure • Sets of objects• Relations between objects

• Simplifies both• Expression of consistency properties• Repair algorithm

• Enables system to support full range of efficient, heavily encoded data structures

File System Example

abst intro 0 2 1

Directory Entries Disk Blocks

struct Entry {byte name[Length];int firstBlock;

}struct Block {

int nextBlock;data byte[BlockSize];

}

struct Disk {Entry dir[NumEntries];Block block[NumBlocks];

}

Disk D;

-5 1 -1

Model Definition

• Sets of objectsset blocks of integer : partition used |

free;• Relations between objects – values of

object fields, referencing relationships between objectsrelation next : used, used;blocks

used freenext

Model TranslationBits translated to sets and relations in abstract

model using statements of the form:

Quantifiers, Condition Inclusion Constraint

for i in 0..NumEntries, 0 D.dir[i].firstBlock and D.dir[i].firstBlock < NumBlocks D.dir[i].firstBlock in used

for b in used, 0 D.block[b].nextBlock and D.block[b].nextBlock < NumBlocks b,D.block[b].nextBlock in next

for b,n in next, true n in usedfor b in 0..NumBlocks, not (b in used) b in free

Model in Example

1

0

2

next

next

used

free

3

blocks

abst intro 0 2 1

Directory Entries Disk Blocks

-5 1 -1

Internal Consistency PropertiesQuantifiers, Body

• Body is first-order property of basic propositions• Inequality constraints on values of numeric

fields • V.R = E, V.R < E, V.R E, V.R E, V.R > E

• Presence of required number of objects• size(S) = C, size(S) C, size(S) C

• Topology of region surrounding each object• size(V.R) = C, size(V.R) C, size(V.R) C • size(R.V) = C, size(R.V) C, size(R.V) C

• Inclusion constraints: V in S, V1 in V2.R, V1,V2 in R• Example: for b in used, size(next.b) 1

Internal Consistency ViolationsEvaluate consistency properties, find

violationsfor b in used, size(next.b) 1 is false for b

= 1

1

0

2

next

next

used

free

3

blocks

Repairing Violations of Internal Consistency Properties

• Violation provides binding for quantified variables

• Convert Body to disjunctive normal form(p1 … pn ) … (q1 … qm )

p1 … pn , q1 … qm are basic propositions

• Choose a conjunction to satisfy• Repair violated basic propositions in

conjunction

Repairing Violations of Basic Propositions

• Inequality constraints on values of numeric fields • V.R = E, V.R < E, V.R E, V.R E, V.R > E• Compute value of expression, assign field

• Presence of required number of objects• size(S) = C, size(S) C, size(S) C• Remove or insert objects from/to set

• Topology of region surrounding each object• size(V.R) = C, size(V.R) C, size(V.R) C • size(R.V) = C, size(R.V) C, size(R.V) C• Remove or insert pairs from/to relation

• Inclusion constraints: V in S, V1 in V2.R, V1,V2 in R• Add object or pair to set or relation

Repair in Examplefor b in used, size(next.b) 1 is false for b

= 1Must repair size(next.1) 1

Can remove either 0,1 or 2,1 from next

1

0

2

next

next

used

free

3

blocks

Repair in Examplefor b in used, size(next.b) 1 is false for b

= 1Must repair size(next.1) 1

Can remove either 0,1 or 2,1 from next

1

0

2

next

used

free

3

blocks

Acyclic Repair Dependences

• Questions• Isn’t it possible for the repair of one

constraint to invalidate another constraint?

• What about infinite repair loops?• What about unsatisfiable specifications?

• Answer• We require specifications to have no

cyclic repair dependences between constraints

• So all repair sequences terminate• Repair can fail only because of resource

limitations

External Consistency Constraints

Quantifiers, Condition Body• Body of form V = E, V.F = E, V.F[I] = E• Example

for b in free, true D.block[b].nextBlock = -2

for i,j in next, true D.block[i].nextBlock = j

for b in used, size(b.next) = 0 D.block[b].nextBlock = -1

• Repair simply performs assignments• Translates model repairs to bit repairs

abst intro 0 2 1

Directory Entries Disk Blocks

-5 1 -1

abst intro 0 2 1

Directory Entries Disk Blocks

-1 -1 -2

Repaired File System

Repair in Example

Inconsistent File System

What About Corrupted Pointers?• Sets may contain pointers to structs• System only allows valid structs in

model• struct must be completely in valid

memory• one struct may be nested inside

another struct (but must agree on memory format)

Valid Memory

Invalid Memory

Valid StructValid Structs

Invalid Struct

When to Test for Consistency and Repair

• Persistent data structures• Repair can be independent activity, or• Repair when data written out or read in

• Volatile data structures in running program• Under programmer control• Transaction-based approach

• Identify transaction start and end• Repair at start, end, or both

• Failure-based approach• Wait until program fails• Repair and restart from latest safe point

Experience• We acquired three benchmarks

• Simplified Linux file system• Freeciv interactive game• Microsoft Word files

• We developed specifications for all three • Less than a week of development time• Most of time spent figuring out Freeciv

• Each benchmark has• Workload• Fault insertion methodology

• Ran benchmarks with and without repair

intro 110 0 1011

directoryblock

inodebitmapblock

blockbitmapblock

inode inode…

inode block

disk blocks

Simplified Linux File System

Some Consistency Properties• inode bitmap consistent with inode

usage• block bitmap consistent with block

usage• directory entries refer to valid inodes • files contain valid blocks only• files do not share blocks

superblock

groupblock

Results

• Workload – write and verify several files • Fault insertion – crash file system

• Inode and block bitmap errors• Partially initialized directory and inode

entries• Without repair

• Incorrect file contents because of inode and disk block sharing

• With repair• Bitmaps repaired preventing illegal

sharing, correct file contents

PO MM

OO MP

PO MM

PP MP

loc: 3,0

loc: 2,3

Terrain Grid

City Structures

Freeciv

Consistency Properties• Tiles have valid terrain

values• Cities are not in the ocean• Each city has exactly one

reference from city location grid

• City locations are consistent in• City structures and• tile grid

O = OceanP = PlainM = Mountain

Results

• Workload – Freeciv software plays against itself

• Fault insertion – randomly corrupt terrain values

• Without repair – program fails (seg fault)• With repair

• Game runs just fine• But game plays out differently because

of the different terrain values

Microsoft Word Files• Files consist of a sequence of streams• Streams stored using FAT-based data

structure

-1 -1 -21

HeaderFAT

blockDisk Blocks

Consistency Properties

• The FAT blocks exist• FAT contains valid values only

• -1 – terminates FAT streams• -2 – indicates free blocks• Valid disk block index – next block in

stream• FAT streams properly terminated• Free blocks properly marked• Streams contain valid blocks only • Streams do not share blocks

Results

• Workload – several Microsoft Word files• Fault insertion – scramble FAT• Without repair

• If blocks containing the FAT were incorrectly marked as free, Word successfully loads file

• Otherwise, “The document name or path is not

valid”

• With repair• Word loads all files

Related Work

• Hand-coded repair• Lucent 5ESS switch• IBM MVS operating system

• Transactions• Identify actions that leave system

consistent• If action fails, roll back to consistent state

• Checkpoint and recovery• Reboot system from scratch• Logging for roll-forward

• Self-stabilizing algorithms

Conclusion

• Data structure repair interesting way to (potentially) improve reliability

• Specification-based approach promises to make technique more widely applicable

• Moving towards more robust, probabilistic, continuous concept of system behavior