Enhancing Developer Productivity with Code Forensics

BW10 Session 6/5/2013 3:45 PM

"Enhancing Developer Productivity with Code Forensics"

Presented by:

Anthony Voellm Google, Inc.

Brought to you by:

340 Corporate Way, Suite 300, Orange Park, FL 32073 888‐268‐8770 ∙ 904‐278‐0524 ∙ [email protected] ∙ www.sqe.com

Anthony Voellm Google, Inc.

At Google Anthony Voellm is focused on delivering performance, reliability, and security to the Google Compute Engine, Google App Engine, Google Cloud SQL, and Google Cloud BigQuery while also innovating new offerings. His experience ranges from kernel and database engines to image processing and graphics. Anthony is an avid inventor who holds seven technology patents. Prior to joining Google in 2011, Anthony held multiple roles at Microsoft leading the Windows reliability, security, and privacy test teams. Anthony has taught performance testing to more than 2,000 people worldwide and given dozens of informative talks on software fundamentals and the cloud. He writes a technology blog on software fundamentals.

Enhancing Developer Productivity with Code Forensics:Applications of behavioral analysis and developer assessment to improve productivity

Presented at BSE West - Vegas - June 5th 2013

Anthony F. VoellmGoogle Cloud Security, Performance and Test Manager

[email protected] / G+ / @p3rfguy

The hypothesis:

Today Tomorrow

One size fits all testing• Do everything• Do nothing• Best guess• Static code analysis

The right amount of tests• Skills / Knowledge• Experience• State of mind• Behavior

Overview• Part 1 - The backdrop

• Part 2 - The big question

• Part 3 - Measure, Measure, Measure

• Part 4 - Where is the science?

• Part 5 - The path forward

The backdrop

All developers are are the same - right?

Chevy Nova for a car is a great name in english. "No va" however in spanish means "no go."

Internationalization

Reliability We live in a world of

bugs

Internationalization

Security bugsPerformance bugs

Logical Bugs

Accessibility bugs

...int x[10];

x[10] = -1;...

Where do bugs come from?

Humans... The Google Brain - NYTimes article

... in the future machines.

How do bugs happen?

Fridays… :)

How do bugs happen?

T t Bl " If ’ ti duTest Blog - "... If you’re tired, angry or frustrated for instance (like Patriots fans this morning) then you’re almost guaranteed to make some careless mistakes "make some careless mistakes. ...

How do bugs happen?

College mentee of mine -"Cut and paste is wicked."

How do bugs happen?

"How do fixes become bugs" Paperin 2011 - "...the bug-fixing process can also introduce errors... Developers and

reviewers for incorrect fixes usually do not have enoughusually do not have enough

knowledge...”

How do bugs happen?

NIST Study - "Software is error-ridden in part because

i l itof its growing complexity. The size of software products is no longer measured in thousands of lines of code, but in millions.

Software developers already spend approximately 80 percent of development costs on identifying and correcting defects, and yet few products of any type

other than software are shipped with such highother than software are shipped with such high levels of errors."

Speaking of complexity...

From - "How do fixes become bugs" paper

The cost of bugs• Time

o 25%+ of developers time is fixing bugso 25%+ of developers time is fixing bugso A 1 line fix takes 1+ hours of testing is common

• Moneyo ~$60 Billion (9 zero's) to the US economy each year

in 2002!

• Reputationo 10% Error rate on critical security fixes.

Reputation … this might be the most important.

Years to develop and it only takes minutes to destroy.

The big question

One size fits allWith all the evidence that humans are the root cause of bugs *and* we all have different levelscause of bugs and we all have different levels of skill.... Why do we all test the same?

From - "How do fixes become bugs" paper

Types of testing• All unit tests (15 minutes or less should be the target)

o The most basic test with all layers stripped away.

• All Integration / System Tests (1 h l )

TimeSink

• All Integration / System Tests (1 hour or less)o Uses multiple features together.

• All Performance Tests (8 hours or less)o Micro-benchmarks (fio, iperf, ...)o Industry benchmarks (SpecCPU, TPC-C, Hibernate, ...)

• All Reliability tests (days)o Longhaulo Longhaulo Leak detection tools

• All Security tests (weeks)o Smart Fuzzerso Static code analyzers like Coverity

HURRAY!

“…time pressure prevents testers from conducting

thorough regression tests before releasing the fix.”*

How do we choose what to run?

• Today:o Run everythingo Run nothingo Selectively run based on complex change

to test associationso Best guess based on developer caution

How do we choose what to run?

• Tomorrow:o Run based on developer skillo Run based familiarity with code baseo Run based on the complexity of the codeo Run based on the type of bug being fixedo Run based on behavioral analysis of the

code

Measure, Measure, Measure

What should we measure?

Artifacts

• What is the frequency of check ins by the developer?

• How often has this developer checked in a [severe] bug?[ ] g

• Do bugs trail checkings?

Behavior and skillCode .....Code .....Code .....Code .....C d

Freshman

Code .....Code .....Code .....Code .....Code .....Code .....Code .....Code .....Code .....Code .....Code .....

Code .....Code .....Code .....Code .....Code .....Code .....Code .....Code .....Code .....Code .....

Code .....Code .....Code .....Code .....Code .....Code

Code .....Code

Sophomore

Junior

Senior

VS

Code .....Code .....

Code .....Code .....Code .....Code .....

Code .....Code .....Code .....Code .....Code .....

Code ..... Guru

Measures

• Is the checkin in code the developer is "familiar" with?with?

• Knowledge of the code reviewer.

• Peer ranking on how people feel about your level g p p yof expertise.

• What is the size of the check in?

Emotions

http://people.brandeis.edu/~sekuler/eegERP.html

Measures

• Is the the day before a weekend?

• Is the time of day of the check in unusual for the developer?

• Bug DebtBug Debt

Cyclomatic complexity

http://en.wikipedia.org/wiki/Cyclomatic_complexity

Measures

• Does the developer write complex code?

• How layered is the code?

• Does the developer write units tests?

• What percentage of the check in is covered by tests?

Where is the science?

Studies...

Empirical investigation of software p gproduct line quality

Researcher: Katerina Goseva-PopstajanovaLane Department of Computer Science and Electrical Engineering

West Virginia Universityest g a U e s ty

***Special thanks to Katerina Goseva-Popstajanova who presented at GTAC2013 and graciously allowed me to use these slides. You can see her full talk here - http://www.youtube.com/watch?v=fiG-SdNcjTE

West Virginia University

The following slides are based on the paper

Open source product line: basics

“A longitudinal study of post-release faults in an evolving, open-source software product line”

by T. Devine, K. Goseva-Popstajanova, S. K i h d R R L tKrishnan, and R. R. Lutz

Submitted to a journal, currently under review



Eclipse can be treated as a SPL • Currently consists of fourteen different members that

share main components and are set apart by variable components

• Considered four products: Classic, C/C++, Java and JavaEE

• Large size: these four products consist of over 125,000 files and 20 million LoC

• Evolving product line: considered seven releases• Goals: assessment and prediction of post-release faults



Release Year ClassicPkgs

C/C++Pkgs

JavaPkgs

JavaEEPkgs

Total

KLoC Pkgs FaultyPkgsPkgs

2.0 2002 34 773 34 26

2.1 2003 41 1054 41 37

3.0 2004 76 1756 76 70

3.3.Europa 2007 85 62 103 185 3988 185 148

3 43.4 Ganymede 2008 89 62 105 200 4291 200 152

3.5 Galileo 2009 77 61 104 188 3913 188 120

3.6 Helios 2010 77 61 105 206 4262 206 103


Different degrees of reuseEuropa Ganymede

Galileo Helios


MetricsCode metrics Change metrics

LoC Revisions Average Changeset

Statements R f t i AStatements Refactorings Age

Percent Branch Statement Bugfixes Weighted Age

Method Call Statement Authors

Percent Lines with Comments LoC Added

Classes and Interfaces Max LoC Added

Methods per class Average LoC Added

Average Statements per Method L C D l t dAverage Statements per Method LoC Deleted

Max Complexity Max LoC Deleted

Average Complexity Codechurn

Max Block Depth Max Codechurn

Average Block Depth Average Codechurn

Statements at Block Level n (0, 1, … 9) Max Changeset


Assessment: Evolution through releases

1. Does quality, measured by the number of post-release faults for the packages in each release, consistently improve as the SPL matures?improve as the SPL matures?

Post-release fault density decreases as the product line matures through releases


Assessment: Post-release fault distribution

2. Do the majority of faults reside in a small subset of packages?

For each release, from 66% to 93% of post-release faults were located in 20% of packages, with average around 81%


Prediction: What features are good predictors?

7. Are some features better indicators of the number of post-release faults in a package than others?

Feature selection via stepwise regression selected from 1 –16 features out of 112 features



Of the fifteen features appearing in more than a quarter of models, only four are static code metrics



Change metrics (correlation 0.726 – 0.768)• total and maximum number of bugfixes • total authors • total code churn • total revisions

S ( 0 610 0 683)Static code metrics (correlation 0.610 - 0.683)• maximum statements at block level one• maximum and total statements at block level four• maximum method call statements

The path forward

Use the human factor to ship faster...• Create [AI] models that account for...

o Experienceo Knowledge of the code baseo Knowledge of the code baseo Code complexityo Measures of behavioro ... and much more

• Use the models as part of check in and code health...o Developers with less risky profiles run less testso Developers with higher risk profile run more tests

• Let automated systems running in parallel be the safety net.

Human factors success - Blint! By Erick Fejta

Let productivity soar!

End - Questions?

Name: Anthony F. Voellm (aka Tony)Contact: [email protected]: http://perfguy.blogspot.comG+: http://goo gl/mPXcXG+: http://goo.gl/mPXcXTwitter: @p3rfguy

Appendix

Abstract:This talk will present data and findings on how behavioral analysis and

developer assessment can be applied to improving productivity. Just image an engineering system that could recognize rushed check-ins, "grade" developer knowledge, and use that data to speed up development -"Congrats Jane you know this code well ... no check-in test gate for you." The approach has been motivated by looking at today's test systems, tools, and processes and recognizing thes are designed around the premise that all developers are created equal. Studies have shown developer error rates can vary widely and have a number of root causes - mind set of the developer at the time the code was written, experience level, amount of code in a check in complexity of the code and much more This talk willcode in a check in, complexity of the code, and much more. This talk will introduce a number of metrics and concepts such as Cyclomatic complexity and Digital Code Forensics, and demonstrate how even modest application of the approach can speed up development. This is the bleeding edge of engineering productivity.

The message:The message:

Not all developers have the same experience or skill level, and we can use this to improve the speed of development. Speed up the better developers, and slow down the less precise. We dont need a one

size fits all policy, however we do need to base the decisions on data.

References• http://en.wikipedia.org/wiki/Software_bug#cite_note-1• http://www.cs.unm.edu/~forrest/classes/readings/HowDoFixesBecomeBugs.pdf• http://blog.utest.com/the-software-testing-mindset/2012/02/• http://www.cse.buffalo.edu/~mikeb/Billions.pdf• http://software-testing-zone.blogspot.com/2008/12/why-are-bugsdefects-in-software.html• http://www.itbusinessedge.com/cm/community/features/guestopinions/blog/battling-software-

defects-one-developer-at-a-time/?cs=39611• http://istqbexamcertification.com/what-is-the-psychology-of-testing/• http://ubuntuforums.org/archive/index.php/t-1582847.html• http://sqa.stackexchange.com/questions/545/how-does-a-testers-perspective-toward-software-

differ-from-a-developersdiffer from a developers• http://software-testing-zone.blogspot.com/2009/04/software-testing-diplomacy-deal.html

Technology

Enhancing Developer Productivity with Code Forensics