Upload
lytruc
View
224
Download
6
Embed Size (px)
Citation preview
Experimentation in Software Engineering: Theory and Practice
Part I – Planning and Designing your Experiment
Massimiliano Di PentaUniversity of Sannio, Italy
Outline
Empirical studies in software engineering Different kinds of study Experiment definition and planning Experiment design Analysis of threats to validity Experiment operation On Sunday:
Analysis of results
Social aspects in software development Why we bother with experiments with human
subjects?
In other disciplines the human factor does not play an important role Physics, traditional engineering branches
What about software engineering? Human component essential part of the development task The usefulness of a method/tool depends on who is going to
use it We have many commonalities with social sciences Experimentation becomes more complex
Kinds of empirical studies Survey: Retrospective (post mortem), e.g. about a
technology/tool being adopted for a period of time Case study: monitoring an ongoing (real) project Experiment: performed in a laboratory setting, with a
high degree of control Objective: manipulate some variables (e.g. method A vs.
method B) and control others (e.g. ability, experience, experimental objects)
Quasi-experiments: you could not really control all variables
Survey Collecting opinions, market analysis
Example: whether a development process is becoming popularin industry
Use of questionnaires to collect data
Characteristics: Intended to understand the entire population, not just the
sample
Often we can really observe a limited number of variables
Of course we can collect data from many variables and observe some of them only
Various kinds of surveys Descriptive: analyze the distribution of some attributes
on a population Distribution of Java knowledge among software developers
Explanatory: try to explain some phenomenon Why developers prefer a technique or the other
Exploratory: preliminary to further studies Understand the developers’ characteristics before an
experiment
Case study Investigate a phenomenon in a specific time frame
E.g. evaluate the use of a technique on a real project
Study the application of SE techniques in industry settings
Differences from experiments
Experiments sample on manipulated variables
Case studies look at a real situation
Pros:
Easy to design
Often more realistic setting than experiments
Cons:
Results not generalizable
(Controlled) Experiments Need to apply more treatments to evaluate results
Compare two different testing techniques
Use of UML models vs. stereotyped models
For each variable involved, I need more measures The same technique should be applied by more subjects
Could be performed on line High level of control
Limited time, thus need for easy tasks
…or off-line Lower level of control
Could involve more complex tasks
Experiments are useful for: Confirm known theories Confirm (or sometimes contradict) a common
wisdom Explore relations existing among variables Evaluate the performances of a method Many times you believe something is true
…but then you might discover many nice surprises
Kinds of empirical studies Quantitative: to get numerical relations among
variables Are programmers more productive with Java than with C#? Are defects correlated with Chidamber-Kemerer metrics?
Qualitative: to interpret a phenomenon just observing it in its context E.g. by using explanations obtained by interviewing
developers I interview developers to know why a given method
improves their productivity Live interview Survey questionnaires
Often quantitative studies should be combined with qualitative ones to better interpret them
In vitro or in vivo
In vitro: Performed in laboratory Controlled conditions Reasonable costs, low risks
Experiments carried out with students to evaluate the effectiveness of a testing technique
Reality could be different
I can use the experiment to prepare furtherstudies in a more realistic context
In vitro vs in vivo studies
In vivo: Real projects Cannot control the experimental conditions More realistic settings and subjects Results may be different Higher costs Possibly unacceptable risks
Can do it when we are sure that the study ismature In vitro experiments provide encouraging results
Running example
Use of UML Stereotypes in comprehension and maintenance tasks Filippo Ricca, Massimiliano Di Penta, Marco Torchiano, Paolo Tonella,
Mariano Ceccato: How Developers' Experience and Ability Influence Web Application Comprehension Tasks Supported by UML Stereotypes: A Series of Four Experiments. IEEE Trans. Software Eng. 36(1): 96-118 (2010)
Filippo Ricca, Massimiliano Di Penta, Marco Torchiano, Paolo Tonella, Mariano Ceccato: The Role of Experience and Ability in Comprehension Tasks Supported by UML Stereotypes. ICSE 2007: 375-384
In the following briefly referred as “Conallen”
Motivations
General-purpose notations not adequate Solution: domain-specific languages
Example: Web applications Several notation have been proposed WebML, WSDM, OOHDM, or… WAE (Conallen’s UML stereotypes)
Extends basic UML with stereotypes that model Web application pages and their relationships
Basic UML vs. Stereotypes
Basic UML Conallen’s Stereotypes
Does it enhance comprehension
during maintenance?
Who would benefit more of such an
enhanced notation?
Where do we start?
We have an idea/conjecture about a cause effect relation
We have a theory Thus we can formulate a hypothesis And to test it.. we run an experiment!
Experiment principles
CauseConstruct
EffectConstruct
Treatment Outcome
Experiment objective
cause-effect construct
treatment-outcome construct
Experiment operationIndependent
variableDependent
variable
Theory
Observation
Terminology - I
Dependent (or response) variables: variableswe are interested to study
Independent variables: variables we control Example: evaluate productivity (dependentvariable) based on development method, skills, tool (independent variables)
Process…Indep.
variables
Dependentvariables
Terminology – II
The experiment studies how changesoccurring on independent variables (factors) influence a dependent variable
A treatment is a particular value for a factor e.g. I want to study how effective is a new
development method (main) factor: development method Treatments (2): the old method and the new one
Terminology – III
A treatment is applied on a combination of subjects and objects
An experiment is a set of tests (o trials) defined as combinations of treatments, subjects and objects Joe (subject) uses the new development method
(treatment) to develop the program A (object) The number of tests influences the ability of
making statistically significant conclusions
Controlling the variables
Process…Independent
Variables
DependentvariableExp.
design…
Treatment
Fixed independent variables
Steps of an experimentalprocess
Definition
Planning
Operation
Analysis &interpretation
Presentation &package
Idea
Conclusions
Experimentation
Definition Phase
Based on Goal-Question Metrics[Basili, 93]
Poses the basis for the experimentation Wrong definition useless results
Idea Define Experimentdefinition
Goal Definition Template
Analyze <Object(s) of study>for the purpose of <Purpose>with respect to their <Quality focus>from the point of view of the <Perspective>in the context of <Context>
Goal Definition Template - II Object of study: Entity to study
Products, processes, theories, tools Purpose: intent of the experiment
Compare two techniques, characterize a learning process
Quality focus: Effect to study Effectiveness, cost, efficiency, precision…
Perspective: from what point of view should I interpret the results? Researcher, project manager, developer,…
Goal Definition Template - III Context: environment where the study
is carried out Subjects: experience, specific skills, etc. Objects: complexity, application domain,
etc.
Definition Framework
Object of study Purpose Quality focus Perspective Context
ProductProcessModelMetricTheory
CharacterizeMonitorEvaluatePredictControlChange
EffectivenessCostReliabilityMaintainabilityPortability
DeveloperMaintainerProject managerCorporatemanagerCustomerUserResearcher
SubjectsObjects
Example - I
Goal: Analyze the use of stereotyped UML diagrams with the purpose of evaluating their usefulness in Web application comprehension for different categories of users
Quality focus: high comprehensibility and maintainability Perspective: researchers, project managers Context:
Two Web apps: WfMS and Claros Undergrad and Graduate students from Trento and Unisannio,
researchers from both Trento and Unisannio
Planning
The Definition describes why we run anexperiment
The planning determines how the experiment will be executed
Indispensible for any engineering task
Planning: steps
Contextselection
Definition
Hypothesisformulation
Variableselection
Selection ofsubjects
Experimentdesign
InstrumentationValidity
evaluationExperiment
design
Experiment planning
Context selection
Set of objects and subjects involved in the experiment
4 dimensions Off-line vs. on-line Students vs. professionals Toy problems vs. real problems Specific vs. general
Selection of objects In many experimental designs you might need more
than one object Good to vary among domains But their complexity should not be too different The objects should be simple enough to allow
performing the task in a limited time frame But if possible avoid toy examples (sub) systems of small-medium OSS
Sometimes you have to prepare them E.g. inject faults etc. Be careful to avoid biasing the experiment Check against mistakes in the objects, a major cause of
experiment failures!
Objects: Conallen study
Two Web-based systems developed in Java WfMS (Workflow management system) Claros (Web mail system)
Selection of Subjects Influences the possibility of generalizing our
results Need to sample the population Probabilistic sampling
Simple random sampling, systematic sampling, stratified random sampling
Convenience sampling Just select the available subjects ..or those more appropriate
Often convenience sampling is the only way toproceed
Experiments with students Very often the easiest way to run your
experiments is to do it within a course Need to go through an Ethical committee If done well, it could be a nice exercise and
students will appreciate! Hints:
Don't make it compulsory Don't evaluate students on their
performance in the experiment Provide them a little reward for their
participation
Experiments with professionals More realistic … but more difficult to achieve Suppose you want to do an experiment with
30 subjects, lasting 6 hours in total How much does it cost?
Hint First, do experiments with students Then you could do small replications with
professionals … or case studies
Experiments with students: the good and the bad
Many students could be more expert and better trained on some techniques you want to experiment than professionals
They have the same experience of junior developers
The setting is different, students don't have the pressure industrial developers have when completing a project and meeting deadlines
The experience is different from senior developers able to face off tough problems
Assessing subjects Important to:
Influence the sampling (whenever possible) Assign subjects with different experience and
ability uniformly across experimental groups i.e. avoid that one treatment is performed mainly by high
ability or low ability subjects
Ability could be assesses a-priori Bachelor/laurea degree, previous exams grades
Same for experience Better to discretize
Divide ability and experience in macro-categories High, Low
Pre-test
Better way to assess ability Option 1: ask subjects to self-assess
themselves Easy, but could be subjective
Option 2: ask subjects to perform a task related to what they will do in the experiment E.g. understanding source code Expensive to evaluate
Example of pre-testPlease rate your knowledge of the following subjects:(Answers: 1. Very poor; 2. Poor; 3. Satisfactory; 4. Good; 5. Very good.)
1 English 1 2 3 4 5
2 Java programming 1 2 3 4 5
3 Eclipse IDE for Java 1 2 3 4 5
4 Understanding/evolving existing systems 1 2 3 4 5
Please indicate the number of years you have been practicing the following activities:
5 Programming (any programming language): ______ years
6 Java programming: ______ years
7 Performing maintenance on an existing code base: ______ years
Conallen study: subjects
74 among students and researchers Exp I - Trento (13 Master students) Exp II - Trento (28 Bachelor students) Exp III - Benevento (15 Master students) Exp IV – Trento/Benevento (8 Researchers)
Hypothesis formulation The experiment aims at rejecting a null hypothesis We can reject the null hypothesis we can draw conclusions
2 hypotheses: Null hypothesis H0: there do not exist trend/patterns in the
experimental setting: the occurred differences are due tochances Example: there is no difference in code comprehension with the
new technique and the old one H0 µNold= µNnew
Alternative hypothesis Ha: in favor of which the null hypothesisis rejected Example: the new technique allows a better level of code
comprehension than the old one H0 µNold< µNnew
One-tailed vs. two-tailedOne tailed: We are interested to see
whether one mean was higher than the other Not interested in whether
the first mean was lower than the other
One side of the probability distribution
Two-tailed:to see if two means are
different from each otherWe don’t know a priori the
direction of the differenceBoth sides of the probability
distribution
ExamplesOne-tailed: We would like to see if
additional documentation improves the software comprehension level We don’t care to test if this
decreases the comprehension level
We would like to see if complementing testing technique A with technique B would increase the number of faults discovered We are testing the
significance of the increment
Two-tailed:We would compare the
effort/time needed to perform a task with two techniqueWe don’t know which one
requires more timeNumber of faults discovered
with two different testing techniquesWe don’t know which one is
better
Example: Conallen study
Null Hypotheses: H0: use of stereotypes does not influence comprehension
One tailed H0e: subjects’ ability does not interact with the main factor H0a: subjects’ experience does not interact with main factor H0ea: no interaction ability, experience, and main factor
IMPORTANT! An experiment does not prove any theory, it
can only fail to reject an hypothesis The logic of scientific discovery
[Popper, 1959] Any statement made in a scientific field is true
until anybody can contradict it Thus…
Our experiments can only say something if they reject null hypotheses
If we don’t reject our H0 we cannot really say thatwe reject Ha
Well.. In practice we could do it after severalreplications…
Dependent variables… Used to measure the effect of treatments Derived from hypotheses Sometimes not directly measured need for indirect
measures Validation needed, possible threats
Need for specifying the measure scale Nominal, ordinal, interval, ratio, absolute
Need for specifying the range If for different systems the variable assumes too different
levels, need to normalize
M norm= M− minmax− min
Nominal Scale
Labeling/Classification Any numerical representation of classes
is acceptable No order relation Symbols are not associated to particular
values
Nominal Scale: Example
Localize where a fault is located requirement, design, code
M(x) =
1 if x is a specification fault2 if x is a design fault3 if x is a code fault
{
Ordinal Scale
Order relation among categories Classes ordered wrt. an attribute
Any mapping preserving ordering is acceptable E.g. numbers where higher numbers correspond
to higher classes
Numbers just represent rankings Additions, subtractions and other operations are
not applicable
Ordinal Scale: Example Capture the subjective complexity of a method using
terms “trivial”, “simple”, “moderate”, “complex” e “incomprehensible”
Implicit “less than” relation “trivial” less complex than “simple” etc.
M(x)=
1 if x is trivial2 if x is simple3 if x is moderate
4 if x is complex5 if x is incomprehensible
Interval Scale Captures information about the size of intervals separating
classes Preserves ordering Preserves the difference operation but does not allow
comparisons I can compute the difference between two classes but not the
ratio Addition and subtraction allowed, multiplication and division not
possible
Examples: calendar, temperature scales Given two mappings M e M’, it is always possible to find 2
numbers a>0 e b such that:M’=aM+b
Interval Scale: Example I
Temperature can be represented using Celsius or Fahrenheit scales Same interval The temperature in Rome increases from 20°C to
21°C The temperature in Washington increases from
30°F to 31°F Washington is not 50% warmer than Rome! Transformation from C to F:
F=9/5 C + 32
Interval Scale: Example II
I can transform M1 into M3 using the formula M3=2M1+1.1
M1(x)=
1 if x is trivial
2 if x is simple
3 if x is moderate
4 if x is complex
5 if x is incomprehensible
M2(x)=
0 if x is trivial
2 if x is simple
4 if x is moderate
6 if x is complex
8 if x is incomprehensible
M3(x)=
3.1 if x is trivial
5.1 if x is simple
7.1 if x is moderate
9.1 if x is complex
11.1 if x is incomprehensible
Ratio Scale Preserves ordering, size of intervals, and ratio
between entities There is a null element (zero attribute)
indicating the absence of a value The mapping starts from the zero value and
increases with equal intervals (units) Any arithmetic operation makes sense Transformations are in the form
M=aM’where a is a positive scalar
Ratio Scale: Examples
Length of an object in cm An object is twice another
Length of a program in LOC A program is twice longer than another
Absolute Scale
Given two measures, M e M’ only the identity transformation is possible Measures performed just counting
elements Any arithmetic operation possible
Absolute Scale: Examples
Failures detected during integration testing Developers working on a project What about LOC?
If LOC measure the size of a program, they are in ratio scale I could measure size differently (statements, kbytes…)
If they are just lines of code, then the scale isabsolute
Conallen study
Dependent Variable: comprehension level Assessed through a questionnaire 12 questions per task Covering both system specific and generic
changes Subjects had to answer by listing items
Measured by means of Precision, Recall and F-Measure
Standard information retrieval metrics Comprehension level is the mean across questions
QuestionsSample question:
Q2: Suppose that you have to substitute, in the entire application, the form-based communication mechanism between pages with another mechanism (i.e. Applet, ActiveX, ...).
Which classes/pages does this change impact?
Q2: Suppose that you have to substitute, in the entire application, the form-based communication mechanism between pages with another mechanism (i.e. Applet, ActiveX, ...).
Which classes/pages does this change impact?
CORRECT ANSWER: main.jsp, login.jsp, start.jsp
F-Measure s,i=2 precisions,i recall s,i
precisions,i +recall s,i= 2 0 . 5 1
1+ 0 .5= 0 .67
precisions,i=C i∩ As,i
As,i= 3
6= 0 .5
recall s,i=C i∩As,i
C i= 3
3= 1
Sample answer:
Independent variables Variables we can control and modify
Of course a lot depends on the experimental design!
The choice depends on the domain knowledge As usual we need to specify scale and range One independent variable is the main factor of our
experiment Often one level for the control group
E.g. use of old/traditional technique/tool
One or more levels for experimental groups E.g. use of new technique(s) tool(s)
Other independent variables are the co-factors
Co-factors Our main (experimented) factor is of course
not the only variable influencing the dependent variable(s)
There are other factors Co-factors or sometimes confounding factors
In a good experiment limit their effect through a good experimental
design able to separate their effect from main factors analyze the interaction with main factors
Of course we would never account for all possible co-factors
Conallen Study: Co-factors
Main factor treatments: Pure UML vs stereotyped (Conallen)
Co-Factors: Lab {Lab1, Lab2}
Ability {High, Low}
Experience {Grad, Undergrad}
System {Claros, WfMS}
Experiment Design
Is the set of treatment tests Combinations of treatments, subjects and objects
Defines how tests are organized and executed
Influences the statistical analyses we can do Based on the formulated hypotheses Influences the ability of performing
replications And combining results
Basic Principles - I Experimental design is based on three
principles1. Randomization2. Blocking3. Balancing
Randomization: observation must be madeon random variables Influences the allocation of objects, subjects and
the ordering in which tests are performed Useful to mitigate confounding effects
E.g. influence of objects, learning effect
Basic Principles - II Blocking: sometimes some factors influence
our results but we want to mitigate theireffects I can split my population in blocks with same (or
similar) level of this factor e.g. subjects’ experience
Balancing: I should try to have the same (or similar) number of subjects for eachtreatment Simplifies the statistical analysis Not strictly needed and sometimes we cannot
achieve a perfect balancing
Different kinds of design
One factor and two treatments one factor and >2 treatments Two factors and two treatments >2 factors, each one with two
treatments
One factor and two treatments
Notation µi: dependent variable mean for treatment i yij: j-th measure of the dependent variable for
treatment i Example:
I’d like to experiment whether a new design produces less fault-prone code than the old design Factor: design method Treatments:
1. New method2. Old method
Dependent variable: number of faults detected
Completely randomized design
Examples of hypotheses: H0: µ1= µ2
Ha: µ1≠ µ2, µ1< µ2 o µ1> µ2
Analyses t-test (unpaired) Mann-Whitney test
Subjects Treatment 1 Treatment 2
1 X
2 X
3 X
4 X
5 X
6 X
Paired comparison design
Each subject applies differenttreatments Need to have different objects
Need to minimize the ordering effect
Examples of hypotheses: Given dj=y1j-y2j
Given µd the mean of differences H0: µd= 0 Ha: µd≠ 0, µd<0 o µd>0
Analyses: Paired t-test Sign test Wilcoxon
Subjects Treatment 1
Treatment 2
1 2 1
2 1 2
3 2 1
4 2 1
5 1 2
6 1 2
How to instantiate it
Subjects should work on different systems/objects on different labs To avoid learning effects
Different possible orderings of main factor treatments between labs
Different possible orderings of systems/objects
Example: Conallen
ClarosClarosWfMSWfMS
Lab 2
WfMSWfMSClarosClaros
Lab 1
Group 4Group 3Group 2Group 1
ConallenUML
UML UML
C o n a l le n
Conallen Conallen
UML
Subjects received: Short description of the application Diagrams Source code
One factor and >2 treatments
Example: Fault proneness wrt. Programming
language adopted C, C++, Java
Completely randomized design
Example of hypotheses: H0: µ1= µ2= µ3= µa
Ha: µi≠ µj for at least one pair (i, j)
Analyses: ANOVA
(ANalysis Of VAriance) Kruskal-Wallis
Subjects Treat-ment 1
Treat-ment 2
Treat-ment 3
1 X
2 X
3 X
4 X
5 X
6 X
Randomized complete block design
Example of hypotheses:
H0: µ1= µ2= µ3= µa Ha: µi≠ µj for at least
one pair (i, j) Analyses:
ANOVA (ANalysis Of VAriance)
Kruskal-Wallis Repeated Measures
ANOVA
Subjects Treat-ment 1
Treat-ment 2
Treat-ment 3
1 1 3 2
2 3 1 2
3 2 3 1
4 2 1 3
5 3 2 1
6 1 2 3
Two Factors The experiment becomes more complex The hypothesis need to be split into three hypotheses
Effect of the first factor Effect of the second factor Effect of the interaction between the two factors
Notation: τi: effect of treatment i on factor A βj: effect of treatment j on factor B (τβ)ij: effect of interaction between τi and βj
Example: Investigate the comprehensibility of design documents
Structured vs. OO design (factor A) Well-structured vs. poorly structured documents (factor B)
2*2 factorial design
Examples of hypotheses: H0: τ1= τ2=0 Ha: at least one τi ≠ 0
H0: β1= β2=0 Ha: at least one βj ≠ 0
H0: (τβ)ij=0 for each i,j Ha: for each (τβ)ij ≠ 0
Analysis: ANOVA
(ANalysis Of VAriance)
Factor A
Treatment A1
Treatment A2
FactorB
Treatment B1
Subjects4, 6
Subjects1, 7
Treatment B2
Subjects2, 3
Subjects5, 8
Two-stage nested design - I
Hierarchical design Useful when a factor is similar, but not
identical for different treatments of the otherfactor
Example: Evaluate the effectiveness of a unit testing
strategy OO Programs and procedural programs (factor A) Presence of defects (factor B) Factor B is slightly different for OO and procedural code
Two-stage nested design - II
Factor ATreatment A1 Treatment A2
Factor B Factor BTreatment
B1'Treatment
B2'Treatment
B1''Treatment
B2''
Subjects: 1,3
Subjects: 6,2
Subjects:7,8
Subjects: 5,4
More than two factors
Need to evaluate the impact of the dependent variable of differentinteracting co-factors
Factorial design In the following we will consider
examples limited to two treatments only
2k factorial design
Generalizes the 2*2 (# of factors k=2)
2k treatment combinations
Factor A Factor B Factor C Subjects
A1 B1 C1 2, 3
A2 B1 C1 1, 13
A1 B2 C1 5, 6
A2 B2 C1 10, 16
A1 B1 C2 7, 15
A2 B2 C2 8, 11
A1 B1 C2 4, 9
A2 B2 12, 14C2
2k fractional factorial design
Disadvantage Number of combinations increasing with
the # of factors Therefore:
Some interactions could be useless to be analyzed
We could analyze only some combinations
One-half fractional factorialdesign
Considers half of the 2k
design combinations Selection performed
such that if a factor is removed, the remainingdesing is a 2k-1 factorialdesign
Two alternative fractions Performed in sequence
(replications) you willobtain a 2k factorialdesign
FactorA
FactorB
FactorC
Subjects
A1 B1 C2 2, 3
A2 B1 C1 1, 8
A1 B2 C1 5, 6
A2 B2 C2 4, 7
FactorA
FactorB
FactorC
Subjects
A1 B1 C1 2, 3
A2 B1 C2 1, 8
A1 B2 C2 5, 6
A2 B2 C1 4, 7
One-quarter fractional factorialdesign One quarter of the 2k combinations If you remove 2 factors the remaining design
is a 2k-2 factorial design Dependences between factors Four alternatives
In sequence (replications) allow to obtain a 2k
factorial design
One-quarter fractional factorialdesign: Example
FactorA
FactorB
FactorC
FactorD
FactorE
Subj.
A1 B1 C1 D2 E2 3, 16
A2 B1 C1 D1 E1 7, 9
A1 B2 C1 D1 E2 1, 4
A2 B2 C1 D2 E1 8, 10
A1 B1 C2 D2 E1 5, 12
A2 B1 C2 D1 E2 2, 6
A1 B2 C2 D1 E1 11, 15
A2 B2 D2 E2 13, 14
D depends on a combinationof A and B We have D2
for each combination of A1 B1 or A2 B2
Similarly,E depends on a combinationof A and C
C2
One-quarter fractional factorialdesign: Example
If I removefactors C and E (or B and D), it becomesa doublereplication of a 23-1 factorialdesign
1
2
FactorA
FactorB
FactorC
FactorD
FactorE
Subj.
A1 B1 C1 D2 E2 3, 16
A2 B1 C1 D1 E1 7, 9
A1 B2 C1 D1 E2 1, 4
A2 B2 C1 D2 E1 8, 10
A1 B1 C2 D2 E1 5, 12
A2 B1 C2 D1 E2 2, 6
A1 B2 C2 D1 E1 11, 15
A2 B2 D2 E2 13, 14C2
One-quarter fractional factorialdesign: Example
FactorA
FactorB
FactorC
FactorD
FactorE
Subj.
A1 B1 C1 D2 E2 3, 16
A2 B1 C1 D1 E1 7, 9
A1 C1 D1 E2 1, 4
A2 B2 C1 E1 8, 10
A1 B1 C2 D2 E1 5, 12
A2 B1 C2 D1 E2 2, 6
A1 B2 C2 D1 E1 11, 15
A2 B2 C2 D2 E2 13, 14
If I remove D and E, itbecomes a 23
full factorialdesign withfactorsA, B, C
B2
Experimental design: conclusions
Essential choice when doing anexperiment
Conclusions we may make depend on the kind of design we choose
Constraints on statistical methods If possible, use a simple design Maximize the usage of the available
subjects Often not many subjects available
Planning
Contextselection
Definition
Hypothesisformulation
Variableselection
Selection ofsubjects
Experimentdesign
InstrumentationValidity
evaluationExperiment
design
Experiment planning
Validity evaluation Crucial questions in the analysis of experiment results are
To what extent are our results valid? They should be at least valid for the population of interest Then, if we could generalize…
Be careful: having limited threats to validity does not meanability to generalize your results
Threats to validity [Campbell and Stanley, 63]1. Conclusion validity (C)2. Internal validity (I)3. Construct validity (S)4. External validity (E)
Threats to validity
Conclusion validity (C): concerns the relation between treatment and outcome There must be a statistically significant relation
Internal validity (I): concerns factors that can affect our results We don’t control nor measure it
Construct validity (S): relation between theory and observation The treatment should reflect the construct of the cause The outcome reflects the effect construct
External validity (E): concerns the generalization of results If there is a causal relation between construct and effect,
could this relation be generalized?
Mapping Experiment Principles
CauseConstruct
EffectConstruct
Treatment Outcome
Experiment objective
cause-effect construct
treatment-outcome construct
Experiment operationIndependent
variableDependent
variable
Theory
Observation
E
S S
C I
Conclusion validity - I Low statistical power:
Results not statistically significant There is a significant difference but the statistical test does not
reveal it due to the low number of data points
Violated assuptions of statistical tests I use a test when I could not Erroneous conclusions Many tests (e.g. t-test) assume normally distributed and linearly
independent samples
Fishing and the error rate I look at a particular result and use it to make conclusions Error rate: I do 3 tests on the same data sets with significance
level 0.05 the actual significance level would be (1-0.05)3=0.14 When doing multiple mean comparisons with a two-means or two
median test, I need to correct the p-value!
Conclusion validity - II Reliability of measures
I repeat a measure under the same conditions and should obtainthe same results
Could be due to wrong instrumentation Prefer objective measures over subjective ones
Reliability of treatment implementation The treatment implementation could vary when applied to different
subjects or in different context e.g. highly experienced developers would prefer command line
tools over graphical ones Some tools can create issues on slow computers I would limit the use of complex tools in experiments if possible!!!
Random irrelevancies in experimental setting External elements disturbing the experiment Even noise outside your room
Conclusion validity - III Random heterogeneity of subjects
The effect of such an heterogeneity could hide the effectof the main factor treatment
Students are often more homogeneous than professionals But I could have external validity problems
Internal validity –Single group threats - I No control group
Same group of subjects work with the old method and the new
Not sure if the effect was caused by the treatment or by a confounding factor
Single group threats: History: experiment performed in particular time frames
(after holidays, before exams…)
Maturation: as time passes subjects react differently Tiresome effect, boredome effect, learning effect
Testing: if I perform a test twice with the same subjects, the second time they already know something about the task Even if the task is slightly different
Internal validity –Single group threats - II
Instrumentation: Unclear forms, inprecision in measurement instruments
Statistical regression: suppose I do subjects’blocking based on a previous experiment I assume a subject would not perform well based on a
previous experiment, but it may not be the case for the new experiment
Selection: due to the performance variability of subjects Experiment with volutneers better motivated persons
than average, thus may not be representative
Internal validity –Single group threats - III
Mortality: some subjects may abandon the experiment Would this affect the representativeness of our sample?
E.g. imagine I loose all high ability subjects If a subject does not show up in multiple labs this limits
the data points for paired analysis
Ambiguity about direction of casual influence: A causes B, B causes A, or X causes A and B? e.g. correlation between complexity and fault-proneness
Complexity causes fault-proneness… (A) Could it be that fault-prone code (B) tend to be on average
more complex (A)? Or else problem-specific factors (X) make code more
complex (A) and fault-prone (B)
Internal validity –Multiple group threats
Arise when studying different groups A control group (on which I apply the old
method) and an experimental group (on which I apply the new method) are influencedby single group threats
Threats: Interactions with selection: different group react in
different way The maturation effect influences more a group than the
other e.g. a group learns faster than the other
The history effect influences more a group than the other
e.g. I’ve experimented with two groups in different dates
Internal validity –Social threats - I Applicable to single and multiple group
experiments Threats:
Diffusion or imitation of treatments: the control group imitates the experimental group e.g. the control group uses (even unconsciously) the new
method assuming this would help to perform better
Compensatory equalization of treatments: The control group should not be rewarded for using the
old method Nor we should let them use a third, alternative method
Internal validity –Social threats - II
Compensatory rivalry: subjects applying the lessdesirable treatment could be motivated to performbetter Thus you might see better results for the old method
than for the new one
Resentful demoralization: opposite of the previousproblem The most boring treatment could decrease performances
or even cause abandonment Sometimes subjects are better motivated with the new
stuffs…
Construct validity –Design threats - I Related to the experimental design and its possible
influence on the study construct Threats:
Inadequate preoperational explication of constructs:construct not well defined before being translated into measures Theory unclear Comparing two methods, but not clear what does mean that a
method is better than another
Mono-operation bias: I have one independent variable only, one single object or treatment the experiment could not represent the theory E.g. inspection conducted on a single document not
representative of the set of documents on which the techniqueis often applied
Construct validity –Design threats - II
Mono-method bias: I use a single type of measure If the measure is biased, it influences the results e.g. I should use different clone detector, different measures
for complexity…
Confounding constructs and levels of constructs: e.g. the (lack of) knowledge may or may not be a meaningful
variable for the experiment …while the years of experience could be a meaningful variable!
Interaction of different treatments: if I apply differenttreatments A and B to the same subject, the outcome couldbe due to treatment A, to treatment B, or to their interaction
Construct validity –Design threats - III
Interaction of testing and treatment: If subjects are aware that I’m measuring their mistakes,
they do their very best to avoid making any mistake
Restricted generalizability across constructs: a treatment could act positively on a construct and negatively on others difficult to generalize ourresults The new method improves the productivity However it reduces the maintainability, that I’m not
measuring What can I conclude? Does it make sense to propose the
new method?
Construct validity –Social threats - I
Hypothesis guessing: in this case subjects could act Positively, influencing the results towards the expected one Negatively, contradicting the expected results Note: in experiments we should not really expect any
result!!!
Evaluation apprehension: If I evaluate subjects based on how they perform in the experiment, this couldbias the results I’m experimenting a testing strategy, subjects might try to
identify more faults without following the strategy
Construct validity –Social threats - II
Experimenter expectancies: the scientist couldinfluence the experiment based on herexpectancies Questionnaire that “guides” the subjects towards
answers aimed at validating her own theory Someone else (not aware about the theory) should
prepare the questionnaire
External validity - I Threats:
Interaction of selection and treatment: the population of subjects is not representative of the one for which I wouldlike to generalize my results Performing experiments with students to use results in industry I do a code inspection experiment with programmers, while
testers would behave differently than programmers
Interaction of setting and treatment: the experimentalsetting or the material are not representative E.g. I let the subjects using tools that they don’t use in the
reality E.g. Web development using textual editors I use toy objects
External validity - II
Interaction of history and treatment: I perform the experiment in particular days and this couldinfluence the results Questionnaire about the system reliability compiled the
day after a major crash
In conclusion: we should make the experimental environment as much realisticas possible
Prioritize threats to validity Many threats are conflictual
When I reduce one threat, the other could increase Quite an optimization problem
Experiments with students Larger samples, higher homogeneity good conclusion validity
Low external validity
Use of different measures to make sure thattreatment and outcome well represent the construct Good construct validity Conclusion validity problems errors due to multiple measures
Conallen study: Threats to validity
Conclusion validity Statistical tests properly used F-Measure as aggregate measure precision and recall also
checked separately Construct validity
Questionnaires as a measure of comprehension Ability measure
Internal validity Abandonment Paired tests on few subjects (enough),
unpaired on all subjects Learning effect balanced by the design
External validity Different categories of students (and Universities) but… Results need to be extended to professionals
Operation
After having designed an experiment we needto execute it
We are in touch with subjects for the first time Besides the pre-experiment briefing and training
Although the design and plan are perfect, everything depends on the operation If something goes wrong in a couple of hours we
could waste months of work…
Experiment operation: steps
Preparation
Experimentdesign
Execution
Datavalidation
Experimentdata
Experiment operation
Preparation Obtain consent:
Participants agree with the research objectives But should not be aware of the hypotheses!
Explain how we will use the experiment results Participation should not be compulsory
Confidentiality: do not disseminate sensitive data E.g. participants’ productivity
Reward in some way participation
Preparation - Briefing Before the experiment, it is advisable to show a
short presentation for: Explaining the experiment and its objectives
Be careful to introduce any bias Be careful to hypothesis guessing
Introduce the objects (briefly describe them, show class diagrams, etc.)
Introduce the instrumentation (form, tool, etc.) Indicate where subjects can get forms/tools Where they can get documentation… Explain how to use tools if needed
Describe in detail the steps of the experiment Be careful with tiny details: how to name files to be sent back,
etc.
Instrumentation
Strongly influences the experiment outcome Tipi di instrumentation:
Objects: specification, code, documentation,… Guidelines:
Description of the experimental process Checklists Guidelines need to be complemented with a proper training
Measurement instruments: Paper-based forms, interviews, web-based tools Prepare questionnaires suitable for subjects’ skills Avoid intrusive questionnaires! Results should not depend on the measurement instrument
Preparation: instrumentation
Before the experiment the instrumentationmust be ready Objects, guidelines, tools, measurement
instruments Put data in a homogeneous format easier to analyze
Anonymous form if you don’t need to identify the participant
But then you might not contact her/him anymore
If interviews are planned, prepare the questions before the experiment
Conallen: Material
Description of the application URL of the “running” Web application Source code of the application (URL) Paper Diagrams (Conallen/Pure-UML) Visual UML Diagrams (Conallen/Pure-UML) Questionnaire
Conallen:Experimental Procedure
1. Read the description of the application2. Read a question of the questionnaire3. Understand it4. Infer the answer using: - Diagrams
- Code(If you want you can execute the application)
5. Compile the questionnaire
Execution
Different ways to execute an experiment Online
You can actually monitor the experiment
Offline Distribute the task via email and wait for results
Data collection
Manual: analyze forms, artifacts produced by the subjects Form vs. interviews
Forms do not require you to actively take part to the experiment execution
However interviews may reveal things that form could notreveal
Automatic: web based forms, automated analysis e.g. test case execution to analyze the correctness of
the produced code As in the Fit experiment
Post-Experiment Questionnaire
Used to understand: Whether anything went wrong with clarity of objectives
material, time available, tasks How much time (approx) subjects spent on particular
artifacts Whether they “felt” a particular method easier/better
than another
Qualitative information Used to explain quantitative results not to replace them!
Building your questionnaire
Responses using a Likert scale1.Strongly disagree2.Weakly disagree3.Uncertain4.Weakly Agree5.Strongly AgreeNA Not applicable
May or may not use itavoid if you want the subjectto lean towards a positive or negative answer
May or may not use itavoid if you want the subjectto lean towards a positive or negative answer
Example: Conallen
Asked the subjects to assess: Clarity of
task objectives and individual questions
Difficulty in reading diagrams code
Time spent on diagrams code
For task with Conallen's notation understandability of stereotypes usefulness of stereotypes
Pros and cons of survey questionnaires
Help to better understand the subjects behavior during an experiment
Can be used (of course) if you have developers available
Controlled experiments, in vivo case studies Not possible for MSR studies Risk of bias very high Sometimes subjects tend to be overly positive or
negative It remains a purely qualitative feedback Don’t try to make strong conclusions based only on
that
Interviewing
Alternative to survey questionnaire
Respondent better think about the answer
… but they could feel under pressure
The questions can be adapted case by case
…but the risk is to have a too unstructured set of
answers
Difficult to make comparisons
Contacting team members
Questionnaires/surveys cannot be applied to
MSR studies
However it is worth trying to contact project
contributors / core project members
Instead of just “guessing”
They may or may not respond, but it costs
nothing…
Using Eye Tracking tools
Yusuf, S., Kagdi, H., and Maletic, J. I. 2007. Assessing the Comprehension of UML Class Diagrams via Eye Tracking. In Proceedings of the 15th IEEE international Conference on Program Comprehension(June 26 - 29, 2007)
Eye Tracking: Pros and Cons
You can really record what subjects looked at
during the study
Might be somewhat expensive
Very likely, you need to perform the study in sequence, with a few subjects only
Complex tasks difficult to be tracked
Other monitoring techniques
Taping the session Recording (thinking aloud) Intercepting events on the machine
Diana Coman, Alberto Sillitti, Giancarlo Succi: A case-study on using an Automated In-process Software Engineering Measurement and Analysis system in an industrial environment. ICSE 2009: 89-99
Problems: Too invasive Data analysis might require a huge
amount of work
Data validation Once the experiment has been completed, we
need to do a consistency check on the collected data Were treatments correctly applied? Did subjects understand the provided forms? Did subjects correctly fill the forms? Remove subjects that
Did not participate to the experiment Exhibited a weird behavior (e.g. did not pay attention to
the task)
Try to have a quick look at data as soon aspossible
At least using descriptive statistics
Replication
“We do not take even our own observationquite seriosly, or accept them as scientificobservations, until we have repeated and tested them” (Popper, 1960)
Replication is essential in experimentation You should document your experiments so that
others can Repeat the experiment with different subjects Replicate your statistical analysis
Look at the non-replicable case of the cold fusion!
Advantages of replications
Fix problems occurred in the first experiments Training not adequate Tasks too complex
Consider a wide variety of subjects Different subjects in different replications Sometimes different objects also
Increase the statistical power Possibility of analyzing results as a whole or do
meta-analysis
Consolidating ideas
Idea
Exp. I Exp. n…
Laboratorysetting
Research Lab
ExperimentalProject I
ExperimentalProject II…
Production
Project I Project II…
Many replications every time
Conclusions
Software engineering/development is a human-intensive activity
Surely you can assess your new approach using various measures but…..
…. to really show the usefulness/efficiency/* of a technique, you should see how developers perform with that technique
Conclusions Experiment definition and planning is crucial
Wrong definition you’re studying something irrelevant
Design influences the kind of analyses you can do on your data
Experiment operation needs to be carefully performed You can loose months of work in one shot
Carefully analyze the threats to validity Don’t be afraid of doing that…there’s no perfect
study
Suggested Readings - I Experimentation in
Software Engineering: An IntroductionClaes Wohlin, Per Runeson, Martin Höst , Springer, 1999
Basics of Software Engineering ExperimentationNatalia Juristo, Ana M. Moreno, Springer, 2010