Upload
marsha-butler
View
215
Download
1
Embed Size (px)
Citation preview
Recent Advances in Software
Engineering in Microsoft Research
Judith BishopMicrosoft [email protected]
University of Nanjing, 28 May 2015
• Statistics• Trends
• WER, CRANE• Testing
• IntelliTest• Code Hunt
• Z3• And Friends
Prevention Education
HardwareMaintenance
Software runs on hardware – lots of it
Worldwide PC units for personal devices increased by 5% year over year in 1Q14 with sales of basic and utility tablets in emerging markets, plus smartphones driving total device market growth during the quarter. Gartner June 2014
Connected Devices and The Cloud
Most recent technology shift
Desktop operating system market share
Source: www.netmarketshare.com
Mobile/tablet market share
Source: www.netmarketshare.com
Market share of operating systems in the United States from January 2012 to September 2014
Not Windows
• Statistics• Trends
• WER, CRANE• Testing
• IntelliTest• Code Hunt
• Z3• And Friends
Prevention Education
HardwareMaintenance
Maintenance
11
The Challenge for MicrosoftMicrosoft ships software to 1 billion users around the world
We want tofix bugs regardless of source
application or OSsoftware, hardware, or malware
prioritize bugs that affect the most users
generalize the solution to be used by any programmer
get the solutions out to users most efficiently
try to prevent bugs in the first place
12
Debugging in the Large with WER…
!analyze
5 17 23,450,649
Minidump
The huge data based can be mined to prioritize workFix bugs from most (not loudest) users
Correlate failures to co-located componentsShow when a collection of unrelated crashes all contain the same culprit (e.g. a device driver)
Proven itself “in the wild”Found and fixed 5000 bugs in beta releases of Windows after programmers had found 100 000 with static analysis and model checking tools.
WER’s properties
Kirk Glerum, Kinshuman Kinshumann, Steve Greenberg, Gabriel Aul, Vince Orgovan, Greg Nichols, David Grant, Gretchen Loihle, and Galen Hunt, Debugging in the (Very) Large: Ten Years of Implementation and Experience, in SOSP '09, Big Sky, MT, October 2009
14
Bucketing Mostly WorksOne bug can hit multiple buckets
up to 40% of error reportsduplicate buckets must be hand triaged
Multiple bugs can hit one bucketup to 4% of error reportsharder to isolate each bug
But if bucketing is wrong 44% of the time?Solution: scale is our friend
With billions of error reports, we can throw away a few million
15
Top 20 Buckets for MS Word 2010
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 200%
25%
50%
75%
100%
Rela
tive
hit
cou
nt 3-week internal deployment
to 9,000 users.
Just 20 buckets account for 50% of all errorsFixing a small # of bugs will help many users
Bucket #:
CDF
16
Hardware: Processor Bug
-9 -7 -5 -3 -1 1 3 5 7 9 11 13 15 17 190%
20%
40%
60%
80%
100%
Repo
rts
as
% o
f Pea
k
Day #:
WER helped fix hardware error Manufacturer could have caught this earlier w/ WER
17
WER works because …… bucketing mostly worksWindows Error Reporting (WER) is
the first post-mortem reporting system with automatic diagnosisthe largest client-server system in the world (by installs)helped 700 companies fix 1000s of bugs and billions of errorsfundamentally changed software development at Microsoft
http://winqual.microsoft.com
CRANE: Risk Prediction and Change Risk AnalysisGoal: to improve hotfix quality and response time
CRANE adoption in Windows
Retrospective evaluation of CRANE on WindowsCategorization of fixes that failed in the field
Recommendation: Make metrics simple, empirical and insightful, project and context specific, non-redundant and actionable.
Jacek Czerwonka, Rajiv Das, Nachiappan Nagappan, Alex Tarvo, Alex Teterev: CRANE: Failure Prediction, Change Analysis and Test Prioritization in Practice - Experiences from Windows. ICST 2011: 357-366
IMPROVING TESTING PROCESSES
Release cycles impact verification process• Testing becomes bottleneck for development.
• How much testing is enough?
• How reliable and effective are tests?
• When should we run a test?
Kim Herzig£$, Michaela Greiler$, Jacek Czerwonka$, Brendan Murphy£
The Art of Testing Less without Sacrificing Code Quality, ICSE 2015.
Engineering Process
Engineers desktop Integration process
System and Integration Testing
Quality gates• Developers have to pass quality gates (no control over test selection)• Checking system constraints: e.g. compatibility or performance• Failures not isolated
involve human inspections causes development freeze for corresponding branch
System and Integration Testing
Software testing is expensive• 10k+ gates executed, 1M+ test cases• Different branches, architectures, languages, …
• Aims to find code issues as early as possible• Slows down product development
Research Objective
Only run effective and reliable tests• Not every tests performs equally well, depends on code base • Reduce execution frequency of tests that cause false test alarms
(failures due to test and infrastructure issues)
Do not sacrifice code quality• Run every test at least once on every code change• Eventually find all code defects, taking risk of finding defects later ok.
Running less tests increases code velocity• We cannot run all tests on all code changes anymore. • Identify tests that are more likely to find defects (not coverage).
Effectiveness
Rel
iabi
lity
High cost, unknown value
$$$$
High cost, low value
$$$$
Low cost, good value
$$
Low cost, low value
$
high
low
high
low
HISTORIC TEST FAILURE PROBABILITIES
Analyzing past test runs: failure probabilities• How often did the test fail and detected a code defect? ()
• How often did the test report a false test alarm? ()
timeQuality
Gate
Build Build
?Build
Execution history
These probabilities depend on the execution context!
Does it Pay Off?
Less test executionsreduce cost
Taking riskincreases cost
~11 month period> 30 million test execsmultiple branches
~3 month period> 1.2 million test execssingle branch
~12 month period> 6.5 million test execsmultiple branches
Across All Products
TABLE I. SIMULATION RESULTS FOR MICROSOFT WINDOWS, OFFICE, AND DYNAMICS.
Windows Office Dynamics Measurement Rel. improvement Cost
improvement Rel. improvement Cost
improvement Rel. improvement Cost
improvement Test executions 40.58% -- 34.9% -- 50.36% -- Test time 40.31% $1,567,607.76 40.1% $76,509.24 47.45% $19,979.03 Test result inspection 33.04% $61,532.80 21.1% $104,880.00 32.53% $2,337,926.40 Escaped defects 0.20% $11,970.56 8.7% $75,326.40 13.40% $310,159.42 Total cost balance $1,617,170.00 $106,063.24 $2,047,746.01
Results vary • Branching structure• Runtime of tests• We save cost on all products
Fine-tuning possible, better results but not general
DYNAMIC & SELF-ADAPTIVE
Probabilities are dynamic (change over time)• Skipping tests influences risk factors (of higher level branches)
• Tests re-enabled when code quality drops
• Feedback-loop between decision points
0%
10%
20%
30%
40%
50%
60%
70%
Time (Windows 8.1)
rela
tive
test
red
uctio
n ra
te
Training period
automatically enable tests again
Impact on Development Process
Secondary Improvements• Machine Setup
We may lower the number of machines allocated to testing process• Developer satisfaction
Removing false test failures increases confidence in testing process
Development speed• Impact on development speed hard to estimate through simulation• Product teams invest as they believe that removing tests:
Increases code velocity (at least lower bound) Avoids additional changes due to merge conflicts Reduces the number of required integration branches as their main purpose is to test product
“We used the data your team has provided to cut a bunch of bad content and are running a much leaner BVT system […] we’re panning out to scale about 4x and run in well under 2 hours” (Jason Means, Windows BVT PM)
• Statistics• Trends
• WER, CRANE• Testing
• IntelliTest• Code Hunt
• Z3• And Friends
Prevention Education
HardwareMaintenance
Prevention
Continual abstraction
33
Automated Theorem ProverWon 19/21 divisions in SMT 2011 Competition
The most influential tool paper in the first 20 years of TACAS (2014)
Z3 reasons over a combination of logical
theories
BooleanAlgebra
Bit Vectors LinearArithmetic Floating
Point
First-orderAxioms
Non-linear, Reals
Algebraic Data TypesSets/Maps/…
33Leonardo de Moura and Nikolaj Bjørner. Satisfiability modulo theories: introduction and applications. Commun. ACM, 54(9):69-77, 2011.
SAGE: Binary File FuzzingSymbolic execution of x86 traces to generate new input filesZ3 theories: bit vectors and arrays
Finds assertion violations using stratified inlining of procedures and calls to Z3Z3 theories: arrays, linear arithmetic, bit vectors, un-interpreted functions
Automated Test Generation and Safety/Termination Checking
Random +RegressionAll Others SAGE
Fuzzing bugs found in Win7 (over 100s of file parsers):
34
Corral: Whole Program analysis
As of Windows Threshold, Corral is the program analysis engine for SDV (Static Driver Verifier)
Problem:1000s of devicesLow level access control lists for different policies Updates to Edge ACL can break policiesComplexity is “inhumane”
Validating Network ACLs in the Datacenter
35
Education
• Statistics• Trends
• WER, CRANE• Testing
• IntelliTest• Code Hunt
• Z3• And Friends
Prevention Education
HardwareMaintenance
Available in Visual Studio since 2010(as Pex and Smart Unit Tests)
IntelliTest in Visual Studio 2015
Nikolai Tillmann, Jonathan de Halleux, Tao Xie:Transferring an automated test generation tool to practice: from pex to fakes and code digger. ASE 2014: 385-396
Working and learning for fun
Enjoyment adds to long term retention on a taskDiscovery is a powerful driver, contrasting with direct instructionsGaming joins these two, and is hugely popularCan we add these elements to coding?
Code Hunt can!
www.codehunt.com
Code Hunt
• Is a serious programming game• Works in C# and Java (Python coming)• Appeals to coders wishing to hone their programming skills• And also to students learning to code• Code Hunt has had over 300,000 users since launching in March 2014
with around 1,000 users a day• Stickiness (loyalty) is very high
Gameplay
1. User writes code in browser2. Cloud analyzes code – test cases show differences
As long as there are differences: User must adapt code, repeatWhen they are no more differences: User wins level!
secret
code
test cases
void CoverMe(int[] a){ if (a == null) return; if (a.Length > 0) if (a[0] == 1234567890) throw new Exception("bug");}
a.Length>0
a[0]==123…
TF
T
F
Fa==null
T
Constraints to solve
a!=null a!=null &&a.Length>0
a!=null &&a.Length>0 &&a[0]==123456890
Input
null{}
{0}
{123…}
Execute&MonitorSolve
Choose next path
Observed constraints
a==nulla!=null &&!(a.Length>0)a==null &&a.Length>0 &&a[0]!=1234567890a==null &&a.Length>0 &&a[0]==1234567890
Done: There is no path left.
Dynamic Symbolic Execution
Code Hunt - the APCS (default) Zone
• Opened in March 2014• 129 problems covering the Advanced Placement Computer Science course• By August 2014, over 45,000 users started.
1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 1.10 1.11 1.12 1.13 1.14 1.15 2.1 2.2 2.3 2.4 2.5 2.6 2.7 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.80
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
APCS Zone, First three sectors, 45K to 1K
Sector and Level
Play
ers
Effect of difficulty on drop off in sectors 1-3
Yellow – DivisionBlue – OperatorsGreen - Sectors
Aug 2014 and Feb 2015
1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 1.10 1.11 1.12 1.13 1.14 1.15 2.1 2.2 2.3 2.4 2.5 2.6 2.7 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8
-10
0
10
20
30
40
50
60
Effect of Puzzle Difficulty on Drop off
Aug Feb-A
Perc
enta
ge D
rop
Off
Puzzle Level Aug Feb-ACompute -X 1.1 17 22Compute 4 / X 1.6 18 21Compute X-Y 1.7 18 22Compute X/Y 1.11 32 38Compute X%3+1 1.13 15 18Compute 10%X 1.14 12 16Construct a list of numbers 0..N-1 2.1 37 48Construct a list of multiples of N 2.2 19 23Compute x^y 3.1 11 18Compute X! the factorial of X 3.2 16 19Compute sum of i*(i+1)/2 3.5 17 22
Towards a Course Experience
Total Try Count
Average Try Count
Max Try Count
Total Solved Users
13374 363 1306 1581
Public Data release in open source
For ImCupSept257 users x 24 puzzles x approx. 10 tries = about 13,000 programs
For experimentation on how people program and reach solutions
Github.com/microsoft/code-hunt
Upcoming events
PLOOC 2015 at PLDI 2015, June 14 2015, Portland, OR, USA
CHESE 2015 at ISSTA 2015, July 14, 2015, Baltimore, MD, USA
Worldwide intern and summer school contests
Public Code Hunt Contests are over for the summer
Special ICSE attendees Contest. Register at
aka.ms/ICSE2015
Code Hunt Workshop February 2015
Summary:Code Hunt: A Game for Coding
1. Powerful and versatile platform for coding as a game2. Unique in working from unit tests not specifications3. Contest experience fun and robust4. Large contest numbers with public data sets from cloud data
• Enables testing of hypotheses and making conclusions about how players are mastering coding, and what holds them up
5. Has potential to be a teaching platform• collaborators needed
Total Try Count
Average Try Count
Max Try Count
Total Solved Users
13374 363 1306 1581
Public Data release in open source
For ImCupSept257 users x 24 puzzles x approx. 10 tries = about 13,000 programs
For experimentation on how people program and reach solutions
Github.com/microsoft/code-hunt
Upcoming events
PLOOC 2015 at PLDI 2015, June 14 2015, Portland, OR, USA
CHESE 2015 at ISSTA 2015, July 14, 2015, Baltimore, MD, USA
Worldwide intern and summer school contests
Public Code Hunt Contests are over for the summer
Special ICSE attendees Contest. Register at
aka.ms/ICSE2015
Code Hunt Workshop February 2015
Summary:Code Hunt: A Game for Coding
1. Powerful and versatile platform for coding as a game2. Unique in working from unit tests not specifications3. Contest experience fun and robust4. Large contest numbers with public data sets from cloud data
• Enables testing of hypotheses and making conclusions about how players are mastering coding, and what holds them up
5. Has potential to be a teaching platform• collaborators needed
Websites
GameProjectCommunityData ReleaseBlogsOffice Mix
www.codehunt.comresearch.microsoft.com/codehuntresearch.microsoft.com/codehuntcommunitygithub.com/microsoft/code-huntLinked on the Project pagemix.office.com
Conclusions
1. Software runs on hardware and hardware is increasingly varied2. The hardware sector that is growing (mobile) is the most tricky3. Maintenance increases in complexity with the number of
deployments4. Addressing human factors in large maintenance teams pays off 5. Prevention is a hugely valuable aid to maintenance6. Gaming is a way for practicing software engineering skills
Thank you! Questions?