Transferring Software Testing Tools to Practice (AST 2017 Keynote)

Transferring Software Testing Tools to Practice

Tao XieUniversity of Illinois at Urbana-Champaign

[email protected] http://taoxie.cs.illinois.edu/

In collaboration with collaborators from Microsoft Research, Tencent, and Salesforce

(Automated) Test Generation

• Human• Expensive, incomplete, …

• Brute Force• Pairwise, predefined data, etc…

• Tool Automation!!

…

Getting Real to Produce Practice Impact

3

Making real impact

Building real technologies

Solving real problems

Software testing tools are naturally tied with software development practice

Dynamic Symbolic Execution

Code to generate inputs for:

Constraints to solve

a!=null

a!=null &&

a.Length>0

a!=null &&

a.Length>0 &&

a[0]==1234567890

void CoverMe(int[] a){

if (a == null) return;if (a.Length > 0)

if (a[0] == 1234567890)throw new Exception("bug");

}

Observed constraints

a==null

a!=null &&

!(a.Length>0)

a!=null &&

a.Length>0 &&

a[0]!=1234567890

a!=null &&

a.Length>0 &&

a[0]==1234567890

Data

null

{}

{0}

{123…}a==null

a.Length>0

a[0]==123…T

TF

T

F

F

Execute&MonitorSolve

Choose next path

Done: There is no path left.

Negated condition

[DART: Godefroid et al. PLDI’05]

Microsoft Research Automated Test Generation Tool: Pex & Relatives

• Pex (released in May 2008)

http://research.microsoft.com/pex/



• 30K downloads after 20 months

• Active user community: 1.4K forum posts during ~3 years

• Shipped with Visual Studio 2015 as IntelliTest

• Moles (released in Sept 2009)• Shipped with Visual Studio 2012 as Fakes

• “Provide Microsoft Fakes w/ all Visual Studio editions” got 1.5K community votes


There are decision procedures for individual path conditions, but…

• Number of potential paths grows exponentially with number of branches

• Reachable code not known initially

• Without guidance, same loop might be unfolded forever

Fitnex search strategy

[Xie et al. DSN 09]

Explosion of Search Space

http://taoxie.cs.illinois.edu/publications/dsn09-fitnex.pdf


DSE Example

public bool TestLoop(int x, int[] y) {

if (x == 90) {

for (int i = 0; i < y.Length; i++)

if (y[i] == 15)

x++;

if (x == 110)

return true;

}

return false;

}

TestLoop(0, {0})

Path condition:!(x == 90)

↓New path condition:(x == 90)

↓New test input:TestLoop(90, {0})

DSE Example


if (x == 90) {


if (y[i] == 15)

x++;

if (x == 110)

return true;

}

return false;

}

TestLoop(90, {0})

Path condition:(x == 90) && !(y[0] ==15) && !(x == 110)

↓New path condition:(x == 90) && (y[0] ==15)

↓New test input:TestLoop(90, {15})

Challenge in DSE


if (x == 90) {


if (y[i] == 15)

x++;

if (x == 110)

return true;

}

return false;

}

TestLoop(90, {15})

Path condition:(x == 90) && (y[0] ==15) && !(x+1 == 110)

↓New path condition:(x == 90) && (y[0] ==15) && (x+1 == 110)

↓New test input:No solution!?

A Closer Look


if (x == 90) {


if (y[i] == 15)

x++;

if (x == 110)

return true;

}

return false;

}

TestLoop(90, {15})

Path condition:(x == 90) && (y[0] ==15) && (0 < y.Length) && !(1 < y.Length) && !(x+1 == 110)

↓New path condition:(x == 90) && (y[0] ==15) && (0 < y.Length) && (1 < y.Length) Expand array size

A Closer Look


if (x == 90) {


if (y[i] == 15)

x++;

if (x == 110)

return true;

}

return false;

}

TestLoop(90, {15})

We can have infinite paths!

Manual analysis need at least 20 loop iterations to cover the target branch

Exploring all paths up to 20loop iterations is infeasible:

220 paths

Fitnex: Fitness-Guided Exploration


if (x == 90) {


if (y[i] == 15)

x++;

if (x == 110)

return true;

}

return false;

}

Key observations: with respect to the coverage target

• not all paths are equally promising for branch-node flipping

• not all branch nodes are equally promising to flip

• Our solution:

– Prefer to flip branch nodes on the most promising paths

– Prefer to flip the most promising branch nodes on paths

– Fitness function to measure “promising” extents

TestLoop(90, {15, 0})TestLoop(90, {15, 15})

[Xie et al. DSN 2009]



Fitness Function

• FF computes fitness value (distance between the current state and the goal state)

• Search tries to minimize fitness value

[Tracey et al. 98, Liu at al. 05, …]

Fitness Function for (x == 110)


if (x == 90) {


if (y[i] == 15)

x++;

if (x == 110)

return true;

}

return false;

}

Fitness function: |110 – x |

Compute Fitness Values for Paths


if (x == 90) {


if (y[i] == 15)

x++;

if (x == 110)

return true;

}

return false;

}

(90, {0}) 20(90, {15}) 19(90, {15, 0}) 19(90, {15, 15}) 18(90, {15, 15, 0}) 18(90, {15, 15, 15}) 17(90, {15, 15, 15, 0}) 17(90, {15, 15, 15, 15}) 16(90, {15, 15, 15, 15, 0}) 16(90, {15, 15, 15, 15, 15}) 15…

Fitness Value(x, y)


Give preference to flip paths with better fitness valuesWe still need to address which branch node to flip on paths …

Compute Fitness Gains for Branches


if (x == 90) {


if (y[i] == 15)

x++;

if (x == 110)

return true;

}

return false;

}

(90, {0}) 20(90, {15}) flip b4 19(90, {15, 0}) flip b2 19(90, {15, 15}) flip b4 18(90, {15, 15, 0}) flip b2 18(90, {15, 15, 15}) flip b4 17(90, {15, 15, 15, 0}) flip b2 17(90, {15, 15, 15, 15}) flip b4 16(90, {15, 15, 15, 15, 0}) flip b2 16(90, {15, 15, 15, 15, 15}) flip b4 15…

Fitness Value(x, y)


Branch b1: i < y.LengthBranch b2: i >= y.LengthBranch b3: y[i] == 15Branch b4: y[i] != 15 •Flipping Branch b4 (b3) gives us average 1 (-1) fitness gain (loss)

•Flipping branch b2 (b1) gives us average 0 fitness gain (loss)

Compute Fitness Gain for Branches cont.

• For a flipped node leading to Fnew, find out the old fitness value Fold before flipping

• Assign Fitness Gain (Fold – Fnew) for the branch of the flipped node

• Assign Fitness Gain (Fnew – Fold ) for the other branch of the branch of the flipped node

• Compute the average fitness gain for each branch over time

Search Frontier

• Each branch node candidate for being flipped is prioritized based on its composite fitness value:

• (Fitness value of node – Fitness gain of its branch)

• Select first the one with the best composite fitness value

To avoid local optimal or biases, the fitness-guided strategy is integratedwith Pex’s fairness search strategies



• 30K downloads after 20 months

• Active user community: 1.4K forum posts during ~3 years

• Shipped with Visual Studio 2015 as IntelliTest

• Moles (released in Sept 2009)• Shipped with Visual Studio 2012 as Fakes

• “Provide Microsoft Fakes w/ all Visual Studio editions” got 1.5K community votes


What went on behind the scenes to build a user base?See more @ASE 2014 Experience Report: http://taoxie.cs.illinois.edu/publications/ase14-pexexperiences.pdf

http://taoxie.cs.illinois.edu/publications/ase14-pexexperiences.pdf

Lesson 1. Evolving Vision

void TestAdd(ArrayList a, object o) {Assume.IsTrue(a!=null);int i = a.Count;a.Add(o);Assert.IsTrue(a[i] == o);

}

Parameterized Unit Tests Supported by Pex

Moles/Fakes

IntelliTest

Pex4Fun/Code Hunt

• Surrounding (Moles/Fakes)

• Retargeting (Pex4Fun/Code Hunt)

• Simplifying (IntelliTest)

Lesson 2. Landing the First Customer

• Developer/manager: “Who took a dependency on your tool?”

• Pex team: “Do you want to be the first?”

• Developer/manager: “I love your tool but no.”

Tool Adoption by (Mass) Target Users

Tool Shipping with Visual Studio

Macro Perspective

Micro Perspective

Lesson 2. Landing the First Customer

• Tackle real-world challenges• Demo Pex on real-world cases (e.g., ResourceReader) beyond textbook examples

• Demo Moles to well address important scenarios (e.g., unit testing SharePoint code)

• Address technical/non-technical barriers for tech adoption in industry• Offer tool license not prohibiting commercial use

• Incremental shipping• Ship experimental reduced versions and gather feedback

• Find early adopters

• Provide quantitative info (reflecting tool’s importance or benefit extent)• Not all downloads are equal! (e.g., those from Fortune 500)

cont.

Lesson 3. Human Factors –Generated Data Consumed by Human

• Developer: “Code digger generates a lot of “\0” strings as input. I can’t find a way to create such a string via my own C# code. Could any one show me a C# snippet? I meant zero terminated string.”

• Pex team: “In C#, a \0 in a string does not mean zero-termination. It’s just yet another character in the string (a very simple character where all bits are zero), and you can create as Pex shows the value: “\0”.”

• Developer: “Your tool generated “\0””

• Pex team: “What did you expect?”

• Developer: “Marc.”

…

Lesson 3. Human Factors –Generated Name Consumed by Human

• Developer: “Your tool generated a test called Capitalize01. I don’t like it.”

• Pex team: “What did you expect?”

• Developer:“Capitalize_Should_Fail_When_Value_Is_Null.”

Lesson 3. Human Factors –Generated Results Consumed by Human

Object Creation messages suppressed (e.g., Covana by Xiao et al. [ICSE’11])

Exception Tree View

Exploration Tree View

Exploration Results View

Lesson 4. Misconceptions

• Someone advertises: “Simply one mouse click and then everything would work just perfectly”• Often need environment isolation w/ Moles/Fakes

• “One mouse click, a test generation tool would detect all or most kinds of faults in the code under test”• Developer: “Your tool only finds null references.”

• Pex team: “Did you write any assertions?”

• Developer: “Assertion???”

• “I do not need test generation; I already practice unit testing (and/or TDD). Test generation does not fit into the TDD process”

Lesson 5. Embracing Feedback

Gathered feedback from target tool users:

• Directly, e.g., via • MSDN Pex forum, tech support, outreach to MS engineers and .NET user groups,

outreach to external early adopters

• Indirectly, e.g., via • interactions with the Visual Studio team (a tool vendor to its huge user base)

• Lack of basic test isolation in practice => Moles• Our suggestion of refactoring code for testability faced strong resistance in practice

• Observation at Agile 2008 conference• Large focus on mock objects and tool support for mock objects

Feedback

Early drops on VS Code Gallery of the Pex Extension, and the Code Digger extensions for Visual Studio 2013

Visual Studio MVP Community

Internal dogfooding by teams within Microsoft

Uservoice feedback (> 20 ideas)

StackoverflowActive forum with questions tagged with "Pex" or "IntelliTest“

Facebookhttps://www.facebook.com/PexMoles/

Twitterhttps://twitter.com/pexandmoles

Item Votes

Add support for NUnit and xUnit.net 304

Add Support for VB.NET 336

Make it available with Visual Studio Professional 319

Enable IntelliTest for 64 bit projects 72

Gathering Feedback

https://visualstudio.uservoice.com/forums/121579-visual-studio-2015?query=Intellitest

https://www.facebook.com/PexMoles/

https://twitter.com/pexandmoles

https://visualstudio.uservoice.com/forums/121579-visual-studio-2015/suggestions/6792167-enable-intellitest-to-generate-test-code-in-xunit

https://visualstudio.uservoice.com/forums/121579-visual-studio-2015/suggestions/6825493-enable-intellitest-to-generate-tests-for-vb-net-pr

https://visualstudio.uservoice.com/forums/121579-visual-studio-2015/suggestions/6773265-make-intellitest-available-to-visual-studio-profes

https://visualstudio.uservoice.com/forums/121579-visual-studio-2015/suggestions/10139661-enable-intellitest-for-64-bit-projects

Pex to IntelliTest - Adds and Cuts

• Additions• Externalizing Test Framework support

• Cuts• Visual Basic Support

• CommandLine Support

• FixIt

• Code Contract integration

• Pex Explorer

• Reporting

From http://bbcode.codeplex.com/

NUnit Extension – 37,378 installs

xUnit.net Extension – 17,438 installs

Pex to IntelliTest - Shipped!

(as of May 19, 20 17)

http://bbcode.codeplex.com/

https://visualstudiogallery.msdn.microsoft.com/bd30bf3f-4183-4b00-a245-1875316b8cd3

https://visualstudiogallery.msdn.microsoft.com/bf74d890-a81e-4e49-beb7-1ad3a4e012af

Collaboration with Academia• Win-win collaboration model

• Win (Industry Lab): longer-term research innovation, man power, research impacts, …

• Win (University): powerful infrastructure, relevant/important problems in practice, both research and industry impacts, …

• Hosting academic visitors• Faculty visits

e.g., Fitnex [Xie et al. DSN’09], Pex4Fun [Tillmann et a. ICSE’13 SEE]

• Student internshipse.g., FloPSy [Lakhotia et al. ICTSS’10], DyGen [Thummalapenta et al. TAP’10]

http://research.microsoft.com/pex/community.aspx#publications

Engaging Broader Academic Communities

• Academic research inspiring internal technology development• Reggae [Li et al. ASE’09] Rex [Veanes et al. ICST’10]

• MSeqGen [Thummalapenta et al. FSE’09] DyGen [Thummalapenta et al. TAP’10]

• …

• Academic research exploring longer-term research frontiers• DySy [Csallner et al. ICSE’08]

• Seeker [Thummalapenta et al., OOPSLA’11]

• Covana [Xiao et al. ICSE’11]

• SEViz [Honfi et al. ICST’15]

• Pex + Code Contracts [Christakis et al. ICSE’16]

• …

http://research.microsoft.com/pex/community.aspx#publications

Going from Pex to Coding Duels

Secret Implementation

class Secret {public static int Puzzle(int x) {

if (x <= 0) return 1;return x * Puzzle(x-1);

}}

Player Implementation

class Player {public static int Puzzle(int x) {

return x;}

}

class Test {public static void Driver(int x) {

if (Secret.Puzzle(x) != Player.Puzzle(x))throw new Exception(“Mismatch”);

}}

behaviorSecret Impl == Player Impl

34

About Code Hunt

Blogs and SitesApril 29, 2015May 15, 2014 www.codehunt.comresearch.microsoft.com/codehuntresearch.microsoft.com/codehuntcommunityData site on Github

Powerful and versatile platform for coding as a game

Built on the symbolic execution of Pex

Addressing different audiences – students, developers, researchers

Data available in the cloud

Unique in working from unit tests not specifications

Open sourced data available for analysis

Over 350,000 players as of August 2016 (since mid 2014)

www.codehunt.com

http://blogs.msdn.com/b/msr_er/archive/2015/04/29/code-hunt-creating-a-community-with-a-game.aspx

http://blogs.msdn.com/b/msr_er/archive/2014/05/15/what-if-coding-were-a-game.aspx

http://www.codehunt.com/

http://research.microsoft.com/codehunt

http://research.microsoft.com/codehuntcommunity

http://www.github.com/microsoft/code-hunt


It’s a game!

1. iterative gameplay2. adaptive3. personalized4. no cheating5. clear winning criterionScore is based on

• how many puzzles solved,

• how well solved, and

• when solved

code

test cases

Lesson 6: Following the Data

• Java is provided by a source-to-source translator• We watched which features players used and what errors they made to

concentrate translation efforts for maximum effect

• The bank of over 400 puzzles records difficulty levels • These are updated by crowdsourcing users attempts

• The vast number of attempts at solving puzzles gives reliable data as to where programmers have difficulty – see open sourced data

For ImCupSept257 users x 24 puzzles x approx. 10 tries = about 13,000 programs`

http://taoxie.cs.illinois.edu/publications/icse15jseet-codehunt.pdfICSE 2015 JSEET:

http://taoxie.cs.illinois.edu/publications/icse13see-pex4fun.pdfICSE 2013 SEE:

http://taoxie.cs.illinois.edu/publications/icse15jseet-codehunt.pdf

http://taoxie.cs.illinois.edu/publications/icse13see-pex4fun.pdf

Test Generation for Mobile Apps

When Monkey and WeChat Meet …

41

http://taoxie.cs.illinois.edu/publications/fse16industry-wechat.pdf

FSE 2016 Industry Track:

ICSE 2017 SEIP Track:

http://taoxie.cs.illinois.edu/publications/icse17seip-wechat.pdf

http://taoxie.cs.illinois.edu/publications/fse16industry-wechat.pdf

http://taoxie.cs.illinois.edu/publications/icse17seip-wechat.pdf

Motivation

Choudhary et al. [ASE’15]: Do we have good-enough tools to test Android apps?• Evaluated six research tools and Monkey on 68 open-source apps• Monkey tool outperformed all six research tools

Their study can be further extended• No industrial-strength Android app was studied• No demonstration on whether and how techniques can further

improve Monkey under industrial settings

42

Challenges for code coverage measurement: Requiring app’s source code Industrial-strength Android app can cause 64K

reference limit exception during instrumentation

Challenges for applicability: Scalability on testing apps with large codebases OS compatibility of testing tool

Challenges on Testing Industrial Mobile Apps

43

WeChat Overview

WeChat = WhatsApp+Facebook+Instagram+PayPal+Uber… 846 million monthly active users Daily number: dozens of billion messages sent, hundreds of

million photos uploaded, hundreds of million payment transactions executed

WeChat backend: 2K+ microservices running on 40k+ servers 10M queries per second during Chinese New Year Eve

Large codebase on WeChat Android

44

Monkey: Experimental setup

Experiment Setup• Set Monkey to fire random events every 500

milliseconds• Run Monkey on WeChat 5 times independently• Run Monkey for 18 hours each time (2 hours

without login)

Evaluation Metric• Line coverage• Activity coverage

45

Monkey: Coverage Result Findings

Finding 1: Monkey has low

line coverage (19.5%) and low

activity coverage (10.3%). s

Finding 2: Monkey allocates a lopsided distribution of exploration time on each activity.

46

0

5

10

15

20

25

30

35

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

Coverage

Percentage

ExplorationTime(hours)

LineCoverage- Monkey

ActivityCoverage- Monkey

Manually login

Monkey: Exploration time challenges

Widget obliviousness: It is difficult to generate events at the small-sized GUI element

State obliviousness: Monkey explores the same two activities repeatedly without contributing to new code coverage

To another activity

Back to other activities.

SelectContactUI ContactLabelUI

47

New Approach

Design goals• Have direct access to UI elements on activity under test• Allocate more exploration time towards new GUI states

Techniques• Use UIAutomator framework to gain UI layout tree • Abstract GUI states to guide firing of state-changing

events

48

New Approach: Coverage result

New approach covers an additional 11.1% p.p. more lines and 18.4% p.p. more activities than Monkey does! 49

Categorization of Not-covered Activities

50

Dead activity examples: unreleased features,activities for older devices

Insufficient account state examples: requiring financial information, account history, enabled features

Example Not-covered Activities

51Default disabled features of WeChat

Activity for searching saved favorite history Activity for showing details of searching result

Substring Hole Analysis

•Substring hole: name set of not-covered activities

•Identify “wallet”-related activities to be not-covered•Manual testing of wallet-related tests, reducing substring hole of “wallet” to be about 22.5% (16 / 71) from 85.9% (61 / 71)

52

Learning for Test PrioritizationAn Industrial Case Study

Benjamin Busjaeger

Salesforce

Tao Xie

University of Illinois,

Urbana-Champaign

FSE 2016 Industry Track

http://taoxie.cs.illinois.edu/publications/fse16industry-learning.pdf

http://taoxie.cs.illinois.edu/publications/fse16industry-learning.pdf

Large Scale Continuous Integration

2K+Modules

600K+Tests

2K+Engineers

500+Changes/Day

Main repository ...

350K+Source files

Motivation

Before Commit After Commit

Objective Reject faulty changes Detect test failures as quickly as possible

Current Run hand-picked tests Parallelize (8000 concurrent VMs) &batch (1-1000 changes per run)

Problem Too complex for human Feedback time constrained by capacity & batching complicates fault assignment

Desired Select top-k tests likely to fail Prioritize all tests by likelihood of failure

Insight: Need Multiple Existing Techniques

● Heterogeneous languages: Java, PL/SQL, JavaScript, Apex, etc.

●Non-code artifacts: metadata and configuration●New/recently-failed tests more likely to fail

Test code coverage of change Textual similarity between test and change

Test age and recent failures

Insight: Need Scalable Techniques●Complex integration tests: concurrent execution●Large data volume: 500+ changes, 50M+ test runs, 20TB+ results per

day

→ Our approach: Periodically collect coarse-grained measurements

Test code coverage of change Textual similarity between test and change Test age and recent failures

model for learning from past test results

New Approach: Learning for Test Prioritization

Test code coverage of change Textual similarity between test and change Test age and recent failures

Change

Test Ranking

→ Implementation currently in pilot use at Salesforce

Empirical Study: Setup

●Test results of 45K tests for ~3 month period●In this period, 711 changes with ≥1 failure

• 440 for training• 271 for testing

●Collected once for each test:• Test code coverage• Test text content

●Collected continuously:• Test age and recent failures

● New approach achieves highest average recall at all cutoff points• 50% failures detected with top 0.2%• 75% failures detected with top 3%

Results: Test selection (before commit)New Approach

● New approach achieves highest APFD with least variance• Median: 85%• Average: 99%

Results: Test prioritization (after commit)New Approach

Summary: Learning for Test Prioritization

● Main insights gained in conducting test prioritization in industry

● Novel learning-based approach to test prioritization

● Implementation currently in pilot use at Salesforce

● Empirical evaluation using a large Salesforce dataset

Summary• Pex practice impact by surrounding, retargeting, simplifying

• Lessons in transferring tools to practice1. Evolving vision

2. Landing your first customer

3. Human factors

4. Misconceptions

5. Embracing feedback

• Collaboration/engagement with academia

• Educational impact and lesson learned

6. Following the data

• Important to research on testing industrial apps (e.g., WeChat)

• Beyond open-source ones




64

Making real impact



Software testing/analysis tools are naturally tied with software development practice

Thank you! Questions?

65

This material is based upon work supported by the Maryland Procurement Office under Contract No. H98230-14-C-0141. This work is also supported in part by National Science Foundation under grants no. CCF-1409423, CNS-1434582, CNS-1513939, CNS-1564274.


66

Making real impact



Software testing/analysis tools are naturally tied with software development practice