Upload
colin-simmons
View
229
Download
4
Tags:
Embed Size (px)
Citation preview
CHESS : Systematic Testing of Concurrent Programs
Madan MusuvathiShaz Qadeer
Microsoft Research
Testing multithreaded programs is HARD
Specific thread interleavings expose subtle errorsTesting often misses these errors
Even when found, errors are hard to debugNo repeatable traceSource of the bug is far away from where it manifests
Concurrency is a real problemWindows 2000 hot fixes
Concurrency errors most common defects among “detectable errors”
Incorrect synchronization and protocol errors most common defects among all coding errors
Windows Server 2003 late cycle defectsSynchronization errors second in the list, next to buffer
overruns
Race conditions can result in security exploits
Current practiceConcurrency testing == Stress testing
Example: testing a concurrent queueCreate 100 threads performing queue operationsRun for days/weeksPepper the code with sleep ( random() )
Stress increases the likelihood of rare interleavingsMakes any error found hard to debug
CHESS: Unit testing for concurrencyExample: testing a concurrent queue
Create 1 reader thread and 1 writer threadExhaustively try all thread interleavings
Run the test repeatedly on a specialized scheduler
Explore a different thread interleaving each timeUse model checking techniques to avoid redundancy
Check for assertions and deadlocks in every runThe error-trace is repeatable
Systematic Stress Testing Using CHESS
Kernel: Threads, Scheduler, Synchronization Objects
While(not done) { TestScenario()}
While(not done) { TestScenario()}
TestScenario() { …}
ProgramTester Provides a Test Scenario CHESS
CHESS runs the scenario in a loop • Every run takes a different interleaving• Every run is repeatable
Win32 API
Conditions on Test ScenarioTest scenario should terminate in all interleavings
Test scenario should be idempotentFree all resources (handles, memory, …)Clear the hardware state
Key observation:Existing stress tests already have these propertiesBecause they repeatedly run for ever
Perturb the System as Little as Possible
Kernel: Threads, Scheduler, Synchronization Objects
While(not done){ TestScenario()}
While(not done){ TestScenario()}
TestScenario(){ …}
Program
CHESS
Win32 API
Detour Win32 API calls• To control and introduce nondeterminism
Run the system as is• On the actual OS, hardware• Using system threads, synchronization
Advantages• Avoid reporting false errors• Easy to add to existing test frameworks• Use existing debuggers
Implementation detailsHandle all the Win32 synchronization mechanisms
Critical sections, locks, semaphores, events,…ThreadpoolsAsynchronous procedure callsTimersIO Completions
No modification to the kernel scheduler / Win32 library
CHESS drives the system along a desired by interleaving by ‘hijacking’ the scheduler
Controlling the Scheduling NondeterminismNondeterministic choices for the scheduler
Determine when to context switchOn context switch, pick the next runnable thread to runOn resource release, wake up one of the waiting threads
Hijack these choices from the schedulerEnsure at most one thread is runnableNo thread is waiting on a resourceAt chosen schedule points, block the current thread while
waking the next threadEmulate program execution on a uniprocessor with
context switches only at synchronization points
Partial-order reductionMany thread interleavings are equivalent
Accesses to separate memory locations by different threads can be reordered
Avoid exploring equivalent thread interleavings
Partial-order reduction in CHESSAlgorithm:
Assume the program is data-race freeContext switch only at synchronization pointsCheck for data-races in each execution
Theorem:If the algorithm terminates without reporting races,
then the program has no assertion failures
Executions on Multi-coresCHESS checks for data-racesIf a Test Scenario manifests a bug on a multi-core
machine, then CHESS willEither report a data-raceOr the bug
CHESS systematically enumerates all sequentially consistent executionsAny data-race free multi-core execution is equivalent to
a sequentially consistent execution
State space explosion
x = 1;y = 1;x = 1;y = 1;
x = 2;y = 2;x = 2;y = 2;
2,12,1
1,01,0
0,00,0
1,11,1
2,22,2
2,22,22,12,1
2,02,0
2,12,12,22,2
1,21,2
2,02,0
2,22,2
1,11,1
1,11,1 1,21,2
1,01,0
1,21,2 1,11,1
y = 1;y = 1;
x = 1;x = 1;
y = 2;y = 2;
x = 2;x = 2;
x = 2; … … … … … y = 2;
x = 2; … … … … … y = 2;
State space explosion
x = 1; … … … … …y = 1;
x = 1; … … … … …y = 1;
…
n threads
k steps each
Number of executions = O( nnk )
Exponential in both n and kTypically: n < 10 k > 100
Limits scalability to large programs (large k)
Bounding execution depthWorks very well for message-passing programs
Limit the number of message exchanges
Message processing code executed atomicallyCan go ‘deep’ in the state space
Does not work for multithreaded programsEven toy programs can have large number of steps
(shared-variable accesses)
x = 1;if (p != 0) { x = p->f;}
x = 1;if (p != 0) { x = p->f;}
Iterative context bounding
x = p->f;} x = p->f;}
x = 1;if (p != 0) {x = 1;if (p != 0) {
p = 0;p = 0;
preemption
non-preemption
Iterative context-bounding algorithmThe scheduler has a budget of c preemptions
Nondeterministically choose the preemption pointsResort to non-preemptive scheduling after c
preemptionsOnce all executions explored with c preemptions
Try with c+1 preemptions
Iterative context-bounding has desirable propertiesProperty 0: Easy to implement
Property 1: Polynomial state spaceTerminating program with fixed inputs and deterministic threads
n threads, k steps each, c preemptionsNumber of executions <= nkCc . (n+c)! = O( (n2k)c. n! )
Exponential in n and c, but not in k
x = 1; … … … … …y = 1;
x = 1; … … … … …y = 1;
x = 2; … … … … … y = 2;
x = 2; … … … … … y = 2;
x = 1; … … … …
x = 1; … … … …
x = 2; … … …
x = 2; … … …
…y = 1; …y = 1;
… … … …
y = 2;y = 2;
• Choose c preemption points
• Permute n+c atomic blocks
Property 2: Deep exploration possible with small boundsA context-bounded execution has unbounded depth
a thread may execute unbounded number of steps within each context
Event a context-bound of zero yields complete terminating executions
Property 3: Finds the ‘simplest’ error traceFinds smallest number of preemptions to the
error
Number of preemptions better metric of error complexity than execution length
Property 4: Coverage metricIf search terminates with context-bound of c, then any
remaining error must require at least c+1 preemptions
Intuitive estimate forThe complexity of the bugs remaining in the programThe chance of their occurrence in practice
Property 5: Lots of bugs with small number of preemptionsA non-blocking implementation of the work-
stealing queue algorithmbounded circular buffer accessed concurrently by
readers and stealersDeveloper provided
test harnessthree buggy variations of the program
Each bug found with at most 2 preemptionsexecutions with 35 preemptions are possible!
Context-bounding + Partial-order reductionAlgorithm:
Assume the program is data-race freeContext switch only at synchronization pointsExplore executions with c preemptionsCheck for data-races in each execution
Theorem:If the algorithm terminates without reporting races,
Then the program has no assertion failures reachable with c preemptions
Requires that a thread can block only at synchronization pointsProof (Musuvathi-Q, PLDI 2007)
Bugs found
Program KLOC Max Num Threads
Bugs Reachable with Preemption Count
0 1 2 3 Total
Bluetooth 0.4 3 0 1 0 0 1
Work-Stealing Queue
1.3 3 0 1 2 0 3
Transaction Manager
7.0 2 0 0 2 1 3
APE 18.9 4 2 1 1 - 4
Dryad Channels 16.0 5 1 5 1 - 7
// Function called by a worker thread // of RChannelReaderImplvoid RChannelReaderImpl::AlertApplication(RChannelItem* item){ // Notify Application
// XXX: Preempt here for the bug EnterCriticalSection(&m_baseCS); // process before exit LeaveCriticalSection(&m_baseCS);}
// Function called by the main threadvoid TestChannel(WorkQueue* workQueue, ...){ // Creating a channel // allocates worker threads RChannelReader* channel = new RChannelReaderImpl(..., workQueue);
// ... do work here
channel->Close(); // wrong assumption that channel->Close() // waits for worker threads to be finished
delete channel; // BUG: deleting the channel when // worker threads still have a valid // reference to the channel}
// Function called by a worker thread // of RChannelReaderImplvoid RChannelReaderImpl::AlertApplication(RChannelItem* item){ // Notify Application
// XXX: Preempt here for the bug EnterCriticalSection(&m_baseCS); // process before exit LeaveCriticalSection(&m_baseCS);}
// Function called by the main threadvoid TestChannel(WorkQueue* workQueue, ...){ // Creating a channel // allocates worker threads RChannelReader* channel = new RChannelReaderImpl(..., workQueue);
// ... do work here
channel->Close(); // wrong assumption that channel->Close() // waits for worker threads to be finished
delete channel; // BUG: deleting the channel when // worker threads still have a valid // reference to the channel}
// Function called by a worker thread // of RChannelReaderImplvoid RChannelReaderImpl::AlertApplication(RChannelItem* item){ // Notify Application
// XXX: Preempt here for the bug EnterCriticalSection(&m_baseCS); // process before exit LeaveCriticalSection(&m_baseCS);}
// Function called by the main threadvoid TestChannel(WorkQueue* workQueue, ...){ // Creating a channel // allocates worker threads RChannelReader* channel = new RChannelReaderImpl(..., workQueue);
// ... do work here
channel->Close(); // wrong assumption that channel->Close() // waits for worker threads to be finished
delete channel; // BUG: deleting the channel when // worker threads still have a valid // reference to the channel}
// Function called by a worker thread // of RChannelReaderImplvoid RChannelReaderImpl::AlertApplication(RChannelItem* item){ // Notify Application
// XXX: Preempt here for the bug EnterCriticalSection(&m_baseCS); // process before exit LeaveCriticalSection(&m_baseCS);}
// Function called by the main threadvoid TestChannel(WorkQueue* workQueue, ...){ // Creating a channel // allocates worker threads RChannelReader* channel = new RChannelReaderImpl(..., workQueue);
// ... do work here
channel->Close(); // wrong assumption that channel->Close() // waits for worker threads to be finished
delete channel; // BUG: deleting the channel when // worker threads still have a valid // reference to the channel}
// Function called by a worker thread // of RChannelReaderImplvoid RChannelReaderImpl::AlertApplication(RChannelItem* item){ // Notify Application
// XXX: Preempt here for the bug EnterCriticalSection(&m_baseCS); // process before exit LeaveCriticalSection(&m_baseCS);}
// Function called by the main threadvoid TestChannel(WorkQueue* workQueue, ...){ // Creating a channel // allocates worker threads RChannelReader* channel = new RChannelReaderImpl(..., workQueue);
// ... do work here
channel->Close(); // wrong assumption that channel->Close() // waits for worker threads to be finished
delete channel; // BUG: deleting the channel when // worker threads still have a valid // reference to the channel}
Facts about Dryad error trace
Long error trace but requires only one preemptionDepth-bounding cannot find it without a lot of luck
The error trace has 6 non-preempting context switchesIt is important to leave unbounded the number of non-
preempting context switches This (and the other 6 errors) in Dryad remained in
spite of careful regression testing and months of production use
Bugs found
Program KLOC Max Num Threads
Bugs Reachable with Preemption Count
0 1 2 3 Total
Bluetooth 0.4 3 0 1 0 0 1
Work-Stealing Queue
1.3 3 0 1 2 0 3
Transaction Manager
7.0 2 0 0 2 1 3
APE 18.9 4 2 1 1 - 4
Dryad Channels 16.0 5 1 5 1 - 7
Coverage vs. Context-bound
Dryad (coverage vs. time)
Current CHESS applications (work in progress)Dryad (library for distributed dataflow programming)Singularity/Midori (OS in managed code)User-mode drivers
Cosmos (distributed file system)SQL database
ConclusionConcurrency is important
Building robust concurrent software is still a challengeLack of debugging and testing toolsCHESS: Concurrency unit-testing
Exhaustively try all interleavingsAttempt to seamlessly integrate with existing test
frameworksProvide replay capability
Iterative context-bounding algorithm key to the design