85
Software Analytics: Data Analytics for Software Engineering Achievements and Opportunities Tao Xie Department of Computer Science University of Illinois at Urbana-Champaign, USA [email protected] In Collaboration with Microsoft Research; Most Slides Adapted from Slides Co-Presented with Dongmei Zhang (Microsoft Research, http://research.microsoft.com/en-us/groups/sa/)

Software Analytics: Data Analytics for Software Engineering

  • Upload
    tao-xie

  • View
    304

  • Download
    5

Embed Size (px)

Citation preview

Page 1: Software Analytics: Data Analytics for Software Engineering

Software Analytics:

Data Analytics for Software EngineeringAchievements and Opportunities

Tao XieDepartment of Computer Science

University of Illinois at Urbana-Champaign, USA

[email protected]

In Collaboration with Microsoft Research; Most Slides Adapted from Slides Co-Presented with

Dongmei Zhang (Microsoft Research, http://research.microsoft.com/en-us/groups/sa/)

Page 2: Software Analytics: Data Analytics for Software Engineering

New Era…Software itself is changing...

Software Services

Page 3: Software Analytics: Data Analytics for Software Engineering

How people use software is changing…

Page 4: Software Analytics: Data Analytics for Software Engineering

Individual Isolated

Not much data/content

generation

How people use software is changing…

Page 5: Software Analytics: Data Analytics for Software Engineering

How people use software is changing…

Individual Isolated

Not much data/content

generation

Page 6: Software Analytics: Data Analytics for Software Engineering

How people use software is changing…

Individual

Social

Isolated

Not much data/content

generation

Collaborative

Huge amount of data/artifacts

generated anywhere anytime

Page 7: Software Analytics: Data Analytics for Software Engineering

How software is built & operated is changing…

Page 8: Software Analytics: Data Analytics for Software Engineering

How software is built & operated is changing…

Data pervasive

Long product cycle

Experience & gut-feeling

In-lab testing

Informed decision making

Centralized development

Code centric

Debugging in the large

Distributed development

Continuous release

… …

Page 9: Software Analytics: Data Analytics for Software Engineering

How software is built & operated is changing…

Data pervasive

Long product cycle

Experience & gut-feeling

In-lab testing

Informed decision making

Centralized development

Code centric

Debugging in the large

Distributed development

Continuous release

… …

Page 10: Software Analytics: Data Analytics for Software Engineering

Software Analytics

Software analytics is to enable software

practitioners to perform data exploration and

analysis in order to obtain insightful and

actionable information for data-driven tasks

around software and services.

Dongmei Zhang, Yingnong Dang, Jian-Guang Lou, Shi Han, Haidong Zhang, and Tao Xie. Software

Analytics as a Learning Case in Practice: Approaches and Experiences. In MALETS 2011

http://research.microsoft.com/en-us/groups/sa/malets11-analytics.pdf

Page 11: Software Analytics: Data Analytics for Software Engineering

Software Analytics

Software analytics is to enable software

practitioners to perform data exploration and

analysis in order to obtain insightful and

actionable information for data-driven tasks

around software and services.

http://research.microsoft.com/en-us/groups/sa/

http://research.microsoft.com/en-us/news/features/softwareanalytics-052013.aspx

Page 12: Software Analytics: Data Analytics for Software Engineering

Research topics – the trinity view

Software

Users

Software

Development

Process

Software

System

• Covering different areas of

software domain

• Throughout entire development

cycle

• Enabling practitioners to obtain

insights

Page 13: Software Analytics: Data Analytics for Software Engineering

Data sources

Runtime traces

Program logs

System events

Perf counters

Usage log

User surveys

Online forum posts

Blog & Twitter

Source code

Bug history

Check-in history

Test cases

Eye tracking

MRI/EMG

Page 14: Software Analytics: Data Analytics for Software Engineering

Target audience – software practitioners

Page 15: Software Analytics: Data Analytics for Software Engineering

Target audience – software practitioners

Developer

Tester

Page 16: Software Analytics: Data Analytics for Software Engineering

Target audience – software practitioners

Developer

Tester

Program Manager

Usability engineer

Designer

Support engineer

Management personnel

Operation engineer

Page 17: Software Analytics: Data Analytics for Software Engineering

Output – insightful information

• Conveys meaningful and useful understanding or

knowledge towards completing the target task

• Not easily attainable via directly investigating raw data

without aid of analytics technologies

• Examples

– It is easy to count the number of re-opened bugs, but how to

find out the primary reasons for these re-opened bugs?

– When the availability of an online service drops below a

threshold, how to localize the problem?

Page 18: Software Analytics: Data Analytics for Software Engineering

Output – actionable information

• Enables software practitioners to come up with concrete

solutions towards completing the target task

• Examples

– Why bugs were re-opened?

• A list of bug groups each with the same reason of re-

opening

– Why availability of online services dropped?

• A list of problematic areas with associated confidence values

Page 19: Software Analytics: Data Analytics for Software Engineering

Research topics & technology pillars

Software

Users

Software

Development

Process

Software

System Vertical

Horizontal

Information Visualization

Data Analysis Algorithms

Large-scale Computing

Page 20: Software Analytics: Data Analytics for Software Engineering

Connection to practice

• Software Analytics is naturally tied with software

development practice

• Getting real

Page 21: Software Analytics: Data Analytics for Software Engineering

Connection to practice

• Software Analytics is naturally tied with software

development practice

• Getting real

Real

Data

Real

Problems

Real

Users

Real

Tools

Page 22: Software Analytics: Data Analytics for Software Engineering

Outline

• Software Engineering for Big Data

• Overview of Software Analytics

• Selected Projects

– StackMine: Performance debugging in the large

– XIAO: Scalable code clone analysis

– SAS: Incident management of online services

– Code Hunt/Pex4Fun: Teaching & learning via interactive gaming

Page 23: Software Analytics: Data Analytics for Software Engineering

StackMinePerformance debugging in the large

via mining millions of stack traces

http://research.microsoft.com/en-us/groups/sa/stackmine_icse2012.pdf

Page 24: Software Analytics: Data Analytics for Software Engineering

Performance issues in the real world• One of top user complaints

• Impacting large number of users every day

• High impact on usability and productivity

High Disk I/O High CPU consumption

Page 25: Software Analytics: Data Analytics for Software Engineering

Performance issues in the real world• One of top user complaints

• Impacting large number of users every day

• High impact on usability and productivity

High Disk I/O High CPU consumption

As modern software systems tend to get more and more complex, given limited time and resource before software release, development-site testing and debugging become more and more insufficient to ensure satisfactory software performance.

Page 26: Software Analytics: Data Analytics for Software Engineering

Performance debugging in the large

Page 27: Software Analytics: Data Analytics for Software Engineering

Performance debugging in the large

Trace StorageTrace collection

Network

Page 28: Software Analytics: Data Analytics for Software Engineering

Performance debugging in the large

Trace StorageTrace collection

Network

Trace analysis

Page 29: Software Analytics: Data Analytics for Software Engineering

Performance debugging in the large

Trace StorageTrace collection

Bug DatabaseNetwork

Trace analysis

Bug filing

Page 30: Software Analytics: Data Analytics for Software Engineering

Performance debugging in the large

Trace StorageTrace collection

Problematic Pattern

Repository Bug DatabaseNetwork

Trace analysis

Bug filing

Page 31: Software Analytics: Data Analytics for Software Engineering

Performance debugging in the large

Pattern Matching

Trace StorageTrace collection

Bug update

Problematic Pattern

Repository Bug DatabaseNetwork

Trace analysis

Bug filing

Page 32: Software Analytics: Data Analytics for Software Engineering

Performance debugging in the large

Pattern Matching

Trace StorageTrace collection

Bug update

Problematic Pattern

Repository Bug DatabaseNetwork

Trace analysis

Bug filing

Key to issue

discovery

Page 33: Software Analytics: Data Analytics for Software Engineering

Performance debugging in the large

Pattern Matching

Trace StorageTrace collection

Bug update

Problematic Pattern

Repository Bug DatabaseNetwork

Trace analysis

Bug filing

Key to issue

discovery

Bottleneck of

scalability

Page 34: Software Analytics: Data Analytics for Software Engineering

Performance debugging in the large

Pattern Matching

Trace StorageTrace collection

Bug update

Problematic Pattern

Repository Bug DatabaseNetwork

Trace analysis

How many issues are

still unknown?

Bug filing

Key to issue

discovery

Bottleneck of

scalability

Page 35: Software Analytics: Data Analytics for Software Engineering

Performance debugging in the large

Pattern Matching

Trace StorageTrace collection

Bug update

Problematic Pattern

Repository Bug DatabaseNetwork

Trace analysis

How many issues are

still unknown?

Which trace file should I

investigate first?

Bug filing

Key to issue

discovery

Bottleneck of

scalability

Page 36: Software Analytics: Data Analytics for Software Engineering

Problem definition

Given OS traces collected from tens of thousands

(potentially millions) of users,

help domain experts identify impactful program execution

patterns

(that cause the most impactful underlying performance problems)

with limited time and resource.

Page 37: Software Analytics: Data Analytics for Software Engineering

Challenges

Combination of expertise• Generic machine learning tools without domain

knowledge guidance do not work well

Highly complex analysis• Numerous program runtime combinations triggering

performance problems

• Multi-layer runtime components from application to

kernel being intertwined

Large-scale trace data• TBs of trace files and increasing

• Millions of events in single trace streamInternet

Page 38: Software Analytics: Data Analytics for Software Engineering

Technical highlights

• Machine learning for system domain

– Formulate the discovery of problematic execution patterns as

callstack mining & clustering

– Systematic mechanism to incorporate domain knowledge

• Interactive performance analysis system

– Parallel mining infrastructure based on HPC + MPI

– Visualization aided interactive exploration

Page 39: Software Analytics: Data Analytics for Software Engineering

Impact“We believe that the MSRA tool is highly valuable and much more

efficient for mass trace (100+ traces) analysis. For 1000 traces, we

believe the tool saves us 4-6 weeks of time to create new signatures,

which is quite a significant productivity boost.”

Highly effective new issue discovery on Windows

mini-hang

Continuous impact on future Windows

versions

Page 40: Software Analytics: Data Analytics for Software Engineering

XIAOScalable code clone analysis

2012

http://research.microsoft.com/jump/175199

Page 41: Software Analytics: Data Analytics for Software Engineering

XIAO: Code Clone Analysis

• Motivation

– Copy-and-paste is a common developer behavior

– A real tool widely adopted internally and externally

• XIAO enables code clone analysis in the following way

– High tunability

– High scalability

– High compatibility

– High explorability

Page 42: Software Analytics: Data Analytics for Software Engineering

High tunability – what you tune is what you get

• Intuitive similarity metric: effective control of the

degree of syntactical differences between two code

snippets

for (i = 0; i < n; i ++) {

a ++;

b ++;

c = foo(a, b);

d = bar(a, b, c);

e = a + c; }

for (i = 0; i < n; i ++) {

c = foo(a, b);

a ++;

b ++;

d = bar(a, b, c);

e = a + d;

e ++; }

Page 43: Software Analytics: Data Analytics for Software Engineering

High scalability• Four-step analysis process

• Easily parallelizable based on source code partition

Pre-processing

Pruning Fine Matching

Coarse Matching

Page 44: Software Analytics: Data Analytics for Software Engineering

High compatibility

• Compiler independent

• Light-weight built-in parsers for C/C++ and C#

• Open architecture for plug-in parsers to support

different languages

• Easy adoption by product teams

– Different build environment

– Almost zero cost for trial

Page 45: Software Analytics: Data Analytics for Software Engineering

High explorability

1. Clone navigation based on source tree hierarchy

2. Pivoting of folder level statistics

3. Folder level statistics

4. Clone function list in selected folder

5. Clone function filters

6. Sorting by bug or refactoring potential

7. Tagging

1 2 3 4 5 6

7

1. Block correspondence

2. Block types

3. Block navigation

4. Copying

5. Bug filing

6. Tagging

1

2

3

4

1

6

5

Page 46: Software Analytics: Data Analytics for Software Engineering

Scenarios & SolutionsQuality gates at milestones• Architecture refactoring

• Code clone clean up

• Bug fixing

Post-release maintenance• Security bug investigation

• Bug investigation for sustained engineering

Development and testing• Checking for similar issues before check-in

• Reference info for code review

• Supporting tool for bug triage

Online code clone search

Offline code clone analysis

Page 47: Software Analytics: Data Analytics for Software Engineering

Benefiting developer community

Available in Visual Studio 2012 RC

Searching similar snippets

for fixing bug onceFinding refactoring

opportunity

Page 48: Software Analytics: Data Analytics for Software Engineering

More secure Microsoft products

Code Clone Search service integrated into

workflow of Microsoft Security Response Center

Over 590 million lines of code indexed across

multiple products

Real security issues proactively identified and addressed

Page 49: Software Analytics: Data Analytics for Software Engineering

Example – MS Security Bulletin MS12-034Combined Security Update for Microsoft Office, Windows, .NET Framework, and Silverlight, published: Tuesday, May 08, 2012

3 publicly disclosed vulnerabilities and seven privately reported involved. Specifically, one is exploited by the Duqu malware to execute arbitrary code when a user opened a malicious Office document

Insufficient bounds check within the font parsing subsystem of win32k.sys

Cloned copy in gdiplus.dll, ogl.dll (office), Silver Light, Windows Journal viewer

Microsoft Technet Blog about this bulletin

However, we wanted to be sure to address the vulnerable code wherever it appeared across the Microsoft code base. To that end, we have been working with Microsoft Research to develop a “Cloned Code Detection” system that we can run for every MSRC case to find any instance of the vulnerable code in any shipping product. This system is the one that found several of the copies of CVE-2011-3402 that we are now addressing with MS12-034.

Page 50: Software Analytics: Data Analytics for Software Engineering

SASIncident management of online services

http://research.microsoft.com/apps/pubs/?id=202451

Page 51: Software Analytics: Data Analytics for Software Engineering

Motivation

Incident Management (IcM) is a critical task to

assure service quality

• Online services are increasingly popular & important

• High service quality is the key

Page 52: Software Analytics: Data Analytics for Software Engineering

Incident Management: Workflow

Detect a

service

issue

Alert On-

Call

Engineers

(OCEs)

Investigate

the problem

Restore

the

service

Fix root cause

via

postmortem

analysis

Page 53: Software Analytics: Data Analytics for Software Engineering

Incident Management: Characteristics

Shrink-Wrapped

Software Debugging

Root Cause and Fix

Debugger

Controlled

Environment

Online Service

Incident

Management

Workaround

No Debugger

Live Data

Page 54: Software Analytics: Data Analytics for Software Engineering

Incident Management: Challenges

Large volume and noisy data

Highly complex problem space

No knowledge of entire system

Knowledge not well organized

Page 55: Software Analytics: Data Analytics for Software Engineering

SAS: Incident management of online services

SAS, developed and deployed to effectively reduce MTTR

(Mean Time To Restore) via automatically analyzing

monitoring data

5

5

Design Principle of SAS

Automating Analysis

Handling Heterogeneity

Accumulating Knowledge

Supporting human-in-the-loop (HITL)

Page 56: Software Analytics: Data Analytics for Software Engineering

Techniques Overview

• System metrics

– Identifying Incident Beacons

• Transaction logs

– Mining Suspicious Execution Patterns

• Historical incidents

– Mining Historical Workaround Solutions

Page 57: Software Analytics: Data Analytics for Software Engineering

Industry Impact of SAS

Deployment

• SAS deployed to

worldwide datacenters for

Service X (serving

hundreds of millions of

users) since June 2011

• OCEs now heavily depend

on SAS

Usage

• SAS helped successfully

diagnose ~76% of the

service incidents assisted

with SAS

Page 58: Software Analytics: Data Analytics for Software Engineering

Code Hunt/Pex4FunTeaching and learning via interactive gaming

http://research.microsoft.com/pubs/189240/icse13see-pex4fun.pdf

http://pex4fun.com/https://www.codehunt.com/

Page 59: Software Analytics: Data Analytics for Software Engineering

Coding Duels

1,841,880 clicked 'Ask Pex!'

http://pex4fun.com/

Page 60: Software Analytics: Data Analytics for Software Engineering

Code Hunt: Resigned As Gamehttps://www.codehunt.com/

Page 61: Software Analytics: Data Analytics for Software Engineering

Behind the Scene

Secret Implementation

class Secret {

public static int Puzzle(int

x) {

if (x <= 0) return 1;

return x * Puzzle(x-1);

}

}

Player Implementation

class Player {

public static int Puzzle(int x) {

return x;

}

}

class Test {

public static void Driver(int x) {

if (Secret.Puzzle(x) != Player.Puzzle(x))

throw new Exception(“Mismatch”);

}

}

behaviorSecret Impl == Player Impl

62

Page 62: Software Analytics: Data Analytics for Software Engineering

Coding Duels: Fun and Engaging

Iterative gameplay

Adaptive

Personalized

No cheating

Clear winning criterion

Page 63: Software Analytics: Data Analytics for Software Engineering

Teaching and Learning

Page 64: Software Analytics: Data Analytics for Software Engineering

Coding Duels for Course Assignments@Grad Software Engineering Course

http://pexforfun.com/gradsofteng

Observed Benefits

• Automatic Grading

• Real-time Feedback (for Both Students and Teachers)

• Fun Learning Experiences

Page 65: Software Analytics: Data Analytics for Software Engineering

Coding Duel Competition

@ICSE 2011

http://pexforfun.com/icse2011

Page 66: Software Analytics: Data Analytics for Software Engineering

Example User Feedback

“It really got me *excited*. The part that got me most

is about spreading interest in teaching CS: I do think

that it’s REALLY great for teaching | learning!”

“I used to love the first person shooters and

the satisfaction of blowing away a whole team

of Noobies playing Rainbow Six, but this is far

more fun.”

“I’m afraid I’ll have to constrain myself to spend just an

hour or so a day on this really exciting stuff, as I’m

really stuffed with work.”

Released since 2010

X

Page 67: Software Analytics: Data Analytics for Software Engineering

Usage ScenariosStudents: proceed through a sequence on puzzles to learn and practice.

Educators: exercise different parts of a curriculum, and track students’ progress

Recruiters: use contests to inspire communities and results to guide hiring

Researchers: mine extensive data in Azure to evaluate how people code and learn

http://taoxie.cs.illinois.edu/publications/icse15jseet-codehunt.pdfICSE 2015 JSEET:

http://taoxie.cs.illinois.edu/publications/icse13see-pex4fun.pdfICSE 2013 SEE:

Page 68: Software Analytics: Data Analytics for Software Engineering

Conclusion

Software

Users

Software

Development

Process

Software

System Vertical

Horizontal

Information Visualization

Data Analysis Algorithms

Large-scale Computing

Page 69: Software Analytics: Data Analytics for Software Engineering

Q & A

http://taoxie.cs.illinois.edu/

Contact: [email protected]

Page 70: Software Analytics: Data Analytics for Software Engineering

Conclusion

Software

Users

Software

Development

Process

Software

System Vertical

Horizontal

Information Visualization

Data Analysis Algorithms

Large-scale Computing

Page 71: Software Analytics: Data Analytics for Software Engineering

Code Hunt Programming Game

Page 72: Software Analytics: Data Analytics for Software Engineering

Code Hunt Programming Game

Page 73: Software Analytics: Data Analytics for Software Engineering

Code Hunt Programming Game

Page 74: Software Analytics: Data Analytics for Software Engineering

Code Hunt Programming Game

Page 75: Software Analytics: Data Analytics for Software Engineering

Code Hunt Programming Game

Page 76: Software Analytics: Data Analytics for Software Engineering

Code Hunt Programming Game

Page 77: Software Analytics: Data Analytics for Software Engineering

Code Hunt Programming Game

Page 78: Software Analytics: Data Analytics for Software Engineering

Code Hunt Programming Game

Page 79: Software Analytics: Data Analytics for Software Engineering

Code Hunt Programming Game

Page 80: Software Analytics: Data Analytics for Software Engineering

Code Hunt Programming Game

Page 81: Software Analytics: Data Analytics for Software Engineering

Code Hunt Programming Game

Page 82: Software Analytics: Data Analytics for Software Engineering

Code Hunt Programming Game

Page 83: Software Analytics: Data Analytics for Software Engineering

Code Hunt Programming Game

Page 84: Software Analytics: Data Analytics for Software Engineering

Leaderboard and Dashboard

Publically visible, updated during the contest

Visible only to the organizer

Page 85: Software Analytics: Data Analytics for Software Engineering

It’s a game!

iterative gameplay

adaptive

personalized

no cheating

clear winning criterion

code

test cases