Recent Advances in Software Engineering in Microsoft Research Judith Bishop Microsoft Research [email protected] University of Nanjing, 28 May 2015

Recent Advances in Software

Engineering in Microsoft Research

Judith BishopMicrosoft [email protected]

University of Nanjing, 28 May 2015

• Statistics• Trends

• WER, CRANE• Testing

• IntelliTest• Code Hunt

• Z3• And Friends

Prevention Education

HardwareMaintenance

Software runs on hardware – lots of it

Worldwide PC units for personal devices increased by 5% year over year in 1Q14 with sales of basic and utility tablets in emerging markets, plus smartphones driving total device market growth during the quarter. Gartner June 2014

Connected Devices and The Cloud

Most recent technology shift

Desktop operating system market share

Source: www.netmarketshare.com

Mobile/tablet market share

Source: www.netmarketshare.com

Market share of operating systems in the United States from January 2012 to September 2014

Not Windows






HardwareMaintenance

Maintenance

11

The Challenge for MicrosoftMicrosoft ships software to 1 billion users around the world

We want tofix bugs regardless of source

application or OSsoftware, hardware, or malware

prioritize bugs that affect the most users

generalize the solution to be used by any programmer

get the solutions out to users most efficiently

try to prevent bugs in the first place

12

Debugging in the Large with WER…

!analyze

5 17 23,450,649

Minidump

The huge data based can be mined to prioritize workFix bugs from most (not loudest) users

Correlate failures to co-located componentsShow when a collection of unrelated crashes all contain the same culprit (e.g. a device driver)

Proven itself “in the wild”Found and fixed 5000 bugs in beta releases of Windows after programmers had found 100 000 with static analysis and model checking tools.

WER’s properties

Kirk Glerum, Kinshuman Kinshumann, Steve Greenberg, Gabriel Aul, Vince Orgovan, Greg Nichols, David Grant, Gretchen Loihle, and Galen Hunt, Debugging in the (Very) Large: Ten Years of Implementation and Experience, in SOSP '09, Big Sky, MT, October 2009

http://research.microsoft.com/apps/pubs/default.aspx?id=81176

14

Bucketing Mostly WorksOne bug can hit multiple buckets

up to 40% of error reportsduplicate buckets must be hand triaged

Multiple bugs can hit one bucketup to 4% of error reportsharder to isolate each bug

But if bucketing is wrong 44% of the time?Solution: scale is our friend

With billions of error reports, we can throw away a few million

15

Top 20 Buckets for MS Word 2010

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 200%

25%

50%

75%

100%

Rela

tive

hit

cou

nt 3-week internal deployment

to 9,000 users.

Just 20 buckets account for 50% of all errorsFixing a small # of bugs will help many users

Bucket #:

CDF

16

Hardware: Processor Bug

-9 -7 -5 -3 -1 1 3 5 7 9 11 13 15 17 190%

20%

40%

60%

80%

100%

Repo

rts

as

% o

f Pea

k

Day #:

WER helped fix hardware error Manufacturer could have caught this earlier w/ WER

17

WER works because …… bucketing mostly worksWindows Error Reporting (WER) is

the first post-mortem reporting system with automatic diagnosisthe largest client-server system in the world (by installs)helped 700 companies fix 1000s of bugs and billions of errorsfundamentally changed software development at Microsoft

http://winqual.microsoft.com

http://www.microsoft.com/whdc/devtools

http://www.microsoft.com/whdc/devtools

CRANE: Risk Prediction and Change Risk AnalysisGoal: to improve hotfix quality and response time

CRANE adoption in Windows

Retrospective evaluation of CRANE on WindowsCategorization of fixes that failed in the field

Recommendation: Make metrics simple, empirical and insightful, project and context specific, non-redundant and actionable.

Jacek Czerwonka, Rajiv Das, Nachiappan Nagappan, Alex Tarvo, Alex Teterev: CRANE: Failure Prediction, Change Analysis and Test Prioritization in Practice - Experiences from Windows. ICST 2011: 357-366

http://google.com/search?q=CRANE:+Failure+Prediction,+Change+Analysis+and+Test+Prioritization+in+Practice+-+Experiences+from+Windows.

http://www.informatik.uni-trier.de/~ley/pers/hd/d/Das:Rajiv.html

http://www.informatik.uni-trier.de/~ley/pers/hd/n/Nagappan:Nachiappan.html

http://www.informatik.uni-trier.de/~ley/pers/hd/t/Tarvo:Alexander.html

http://www.informatik.uni-trier.de/~ley/pers/hd/t/Teterev:Alex.html

http://www.informatik.uni-trier.de/~ley/db/conf/icst/icst2011.html#CzerwonkaDNTT11



IMPROVING TESTING PROCESSES

Release cycles impact verification process• Testing becomes bottleneck for development.

• How much testing is enough?

• How reliable and effective are tests?

• When should we run a test?

Kim Herzig£$, Michaela Greiler$, Jacek Czerwonka$, Brendan Murphy£

The Art of Testing Less without Sacrificing Code Quality, ICSE 2015.

Engineering Process

Engineers desktop Integration process

System and Integration Testing

Quality gates• Developers have to pass quality gates (no control over test selection)• Checking system constraints: e.g. compatibility or performance• Failures not isolated

involve human inspections causes development freeze for corresponding branch

System and Integration Testing

Software testing is expensive• 10k+ gates executed, 1M+ test cases• Different branches, architectures, languages, …

• Aims to find code issues as early as possible• Slows down product development

Research Objective

Only run effective and reliable tests• Not every tests performs equally well, depends on code base • Reduce execution frequency of tests that cause false test alarms

(failures due to test and infrastructure issues)

Do not sacrifice code quality• Run every test at least once on every code change• Eventually find all code defects, taking risk of finding defects later ok.

Running less tests increases code velocity• We cannot run all tests on all code changes anymore. • Identify tests that are more likely to find defects (not coverage).

Effectiveness

Rel

iabi

lity

High cost, unknown value

$$$$

High cost, low value

$$$$

Low cost, good value

$$

Low cost, low value

$

high

low

high

low

HISTORIC TEST FAILURE PROBABILITIES

Analyzing past test runs: failure probabilities• How often did the test fail and detected a code defect? ()

• How often did the test report a false test alarm? ()

timeQuality

Gate

Build Build

?Build

Execution history

These probabilities depend on the execution context!

Does it Pay Off?

Less test executionsreduce cost

Taking riskincreases cost

~11 month period> 30 million test execsmultiple branches

~3 month period> 1.2 million test execssingle branch

~12 month period> 6.5 million test execsmultiple branches

Across All Products

TABLE I. SIMULATION RESULTS FOR MICROSOFT WINDOWS, OFFICE, AND DYNAMICS.

Windows Office Dynamics Measurement Rel. improvement Cost

improvement Rel. improvement Cost

improvement Rel. improvement Cost

improvement Test executions 40.58% -- 34.9% -- 50.36% -- Test time 40.31% $1,567,607.76 40.1% $76,509.24 47.45% $19,979.03 Test result inspection 33.04% $61,532.80 21.1% $104,880.00 32.53% $2,337,926.40 Escaped defects 0.20% $11,970.56 8.7% $75,326.40 13.40% $310,159.42 Total cost balance $1,617,170.00 $106,063.24 $2,047,746.01

Results vary • Branching structure• Runtime of tests• We save cost on all products

Fine-tuning possible, better results but not general

DYNAMIC & SELF-ADAPTIVE

Probabilities are dynamic (change over time)• Skipping tests influences risk factors (of higher level branches)

• Tests re-enabled when code quality drops

• Feedback-loop between decision points

0%

10%

20%

30%

40%

50%

60%

70%

Time (Windows 8.1)

rela

tive

test

red

uctio

n ra

te

Training period

automatically enable tests again

Impact on Development Process

Secondary Improvements• Machine Setup

We may lower the number of machines allocated to testing process• Developer satisfaction

Removing false test failures increases confidence in testing process

Development speed• Impact on development speed hard to estimate through simulation• Product teams invest as they believe that removing tests:

Increases code velocity (at least lower bound) Avoids additional changes due to merge conflicts Reduces the number of required integration branches as their main purpose is to test product

“We used the data your team has provided to cut a bunch of bad content and are running a much leaner BVT system […] we’re panning out to scale about 4x and run in well under 2 hours” (Jason Means, Windows BVT PM)






HardwareMaintenance

Prevention

Continual abstraction

33

Automated Theorem ProverWon 19/21 divisions in SMT 2011 Competition

The most influential tool paper in the first 20 years of TACAS (2014)

Z3 reasons over a combination of logical

theories

BooleanAlgebra

Bit Vectors LinearArithmetic Floating

Point

First-orderAxioms

Non-linear, Reals

Algebraic Data TypesSets/Maps/…

33Leonardo de Moura and Nikolaj Bjørner. Satisfiability modulo theories: introduction and applications. Commun. ACM, 54(9):69-77, 2011.

SAGE: Binary File FuzzingSymbolic execution of x86 traces to generate new input filesZ3 theories: bit vectors and arrays

Finds assertion violations using stratified inlining of procedures and calls to Z3Z3 theories: arrays, linear arithmetic, bit vectors, un-interpreted functions

Automated Test Generation and Safety/Termination Checking

Random +RegressionAll Others SAGE

Fuzzing bugs found in Win7 (over 100s of file parsers):

34

Corral: Whole Program analysis

As of Windows Threshold, Corral is the program analysis engine for SDV (Static Driver Verifier)

Problem:1000s of devicesLow level access control lists for different policies Updates to Edge ACL can break policiesComplexity is “inhumane”

Validating Network ACLs in the Datacenter

35

Education






HardwareMaintenance

Available in Visual Studio since 2010(as Pex and Smart Unit Tests)

IntelliTest in Visual Studio 2015

Nikolai Tillmann, Jonathan de Halleux, Tao Xie:Transferring an automated test generation tool to practice: from pex to fakes and code digger. ASE 2014: 385-396

http://dblp.uni-trier.de/pers/hd/h/Halleux:Jonathan_de

http://dblp.uni-trier.de/pers/hd/x/Xie:Tao

http://dblp.uni-trier.de/db/conf/kbse/ase2014.html#TillmannHX14

Working and learning for fun

Enjoyment adds to long term retention on a taskDiscovery is a powerful driver, contrasting with direct instructionsGaming joins these two, and is hugely popularCan we add these elements to coding?

Code Hunt can!

www.codehunt.com

Code Hunt

• Is a serious programming game• Works in C# and Java (Python coming)• Appeals to coders wishing to hone their programming skills• And also to students learning to code• Code Hunt has had over 300,000 users since launching in March 2014

with around 1,000 users a day• Stickiness (loyalty) is very high

Gameplay

1. User writes code in browser2. Cloud analyzes code – test cases show differences

As long as there are differences: User must adapt code, repeatWhen they are no more differences: User wins level!

secret

code

test cases

void CoverMe(int[] a){ if (a == null) return; if (a.Length > 0) if (a[0] == 1234567890) throw new Exception("bug");}

a.Length>0

a[0]==123…

TF

T

F

Fa==null

T

Constraints to solve

a!=null a!=null &&a.Length>0

a!=null &&a.Length>0 &&a[0]==123456890

Input

null{}

{0}

{123…}

Execute&MonitorSolve

Choose next path

Observed constraints

a==nulla!=null &&!(a.Length>0)a==null &&a.Length>0 &&a[0]!=1234567890a==null &&a.Length>0 &&a[0]==1234567890

Done: There is no path left.

Dynamic Symbolic Execution

Code Hunt - the APCS (default) Zone

• Opened in March 2014• 129 problems covering the Advanced Placement Computer Science course• By August 2014, over 45,000 users started.

1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 1.10 1.11 1.12 1.13 1.14 1.15 2.1 2.2 2.3 2.4 2.5 2.6 2.7 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.80

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

APCS Zone, First three sectors, 45K to 1K

Sector and Level

Play

ers

Effect of difficulty on drop off in sectors 1-3

Yellow – DivisionBlue – OperatorsGreen - Sectors

Aug 2014 and Feb 2015

1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 1.10 1.11 1.12 1.13 1.14 1.15 2.1 2.2 2.3 2.4 2.5 2.6 2.7 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8

-10

0

10

20

30

40

50

60

Effect of Puzzle Difficulty on Drop off

Aug Feb-A

Perc

enta

ge D

rop

Off

Puzzle Level Aug Feb-ACompute -X 1.1 17 22Compute 4 / X 1.6 18 21Compute X-Y 1.7 18 22Compute X/Y 1.11 32 38Compute X%3+1 1.13 15 18Compute 10%X 1.14 12 16Construct a list of numbers 0..N-1 2.1 37 48Construct a list of multiples of N 2.2 19 23Compute x^y 3.1 11 18Compute X! the factorial of X 3.2 16 19Compute sum of i*(i+1)/2 3.5 17 22

Towards a Course Experience

Total Try Count

Average Try Count

Max Try Count

Total Solved Users

13374 363 1306 1581

Public Data release in open source

For ImCupSept257 users x 24 puzzles x approx. 10 tries = about 13,000 programs

For experimentation on how people program and reach solutions

Github.com/microsoft/code-hunt

Upcoming events

PLOOC 2015 at PLDI 2015, June 14 2015, Portland, OR, USA

CHESE 2015 at ISSTA 2015, July 14, 2015, Baltimore, MD, USA

Worldwide intern and summer school contests

Public Code Hunt Contests are over for the summer

Special ICSE attendees Contest. Register at

aka.ms/ICSE2015

Code Hunt Workshop February 2015

http://conf.researchr.org/track/pldi2015/PLOOC-2015-papers


http://conf.researchr.org/home/pldi2015

http://research.microsoft.com/chese2015

http://issta2015.cs.uoregon.edu/

Summary:Code Hunt: A Game for Coding

1. Powerful and versatile platform for coding as a game2. Unique in working from unit tests not specifications3. Contest experience fun and robust4. Large contest numbers with public data sets from cloud data

• Enables testing of hypotheses and making conclusions about how players are mastering coding, and what holds them up

5. Has potential to be a teaching platform• collaborators needed

Total Try Count

Average Try Count

Max Try Count

Total Solved Users

13374 363 1306 1581

Public Data release in open source

For ImCupSept257 users x 24 puzzles x approx. 10 tries = about 13,000 programs

For experimentation on how people program and reach solutions

Github.com/microsoft/code-hunt

Upcoming events

PLOOC 2015 at PLDI 2015, June 14 2015, Portland, OR, USA

CHESE 2015 at ISSTA 2015, July 14, 2015, Baltimore, MD, USA

Worldwide intern and summer school contests

Public Code Hunt Contests are over for the summer

Special ICSE attendees Contest. Register at

aka.ms/ICSE2015

Code Hunt Workshop February 2015



http://conf.researchr.org/home/pldi2015

http://research.microsoft.com/chese2015

http://issta2015.cs.uoregon.edu/

Summary:Code Hunt: A Game for Coding

1. Powerful and versatile platform for coding as a game2. Unique in working from unit tests not specifications3. Contest experience fun and robust4. Large contest numbers with public data sets from cloud data

• Enables testing of hypotheses and making conclusions about how players are mastering coding, and what holds them up

5. Has potential to be a teaching platform• collaborators needed

Websites

GameProjectCommunityData ReleaseBlogsOffice Mix

www.codehunt.comresearch.microsoft.com/codehuntresearch.microsoft.com/codehuntcommunitygithub.com/microsoft/code-huntLinked on the Project pagemix.office.com

http://www.codehunt.com/

Conclusions

1. Software runs on hardware and hardware is increasingly varied2. The hardware sector that is growing (mobile) is the most tricky3. Maintenance increases in complexity with the number of

deployments4. Addressing human factors in large maintenance teams pays off 5. Prevention is a hugely valuable aid to maintenance6. Gaming is a way for practicing software engineering skills

Thank you! Questions?

Documents

Recent Advances in Software Engineering in Microsoft Research Judith Bishop Microsoft Research [email protected] University of Nanjing, 28 May 2015