Empirical Methods in Computer Science © 2006-now Gal Kaminka 2
Vague idea
“ groping around” experiences
Model/Theory
Hypothesis
Initialobservations
Data, analysis, interpretation
Results & finalPresentation
Experimental Lifecycle
Experiment
Empirical Methods in Computer Science © 2006-now Gal Kaminka 3
A Slightly Revised View...
Model/Theory
Hypothesis
Experiment
Analysis
Empirical Methods in Computer Science © 2006-now Gal Kaminka 4
Proving a Theory?
We've discussed 4 methods of proving a propositionEveryone knows itSomeone specific says itAn experiment supports itWe can mathematically prove it
Some propositions cannot be verified empirically:“ This mega-compiler has linear run-time”Infinite possible inputs --> cannot prove empirically
But they may still be disproved:e.g., code that causes the compiler to run non-linearly
Empirical Methods in Computer Science © 2006-now Gal Kaminka 5
Karl Popper's Philosophy of Science
Popper advanced a particular philosophy of science:Falsifiability
For a theory to be considered scientific, it must be falsifiableThere must be some way to refute it, in principleNot falsifiable <==> Not scientific
Examples:“ All crows are black” falsifiable by finding a white crow“ Compile in linear time” falsifiable by non-linear performance
Theory tested on its predictions
Empirical Methods in Computer Science © 2006-now Gal Kaminka 6
Proving by disproving...
Platt (“ Strong Inference” , 1964) offers a specific method:1) Devise alternative hypotheses for observations2) Devise experiment(s) allowing elimination of hypotheses3) Carry out experiments to obtain a clean result4) Go to 1.
The idea is to eliminate (falsify) hypotheses
Empirical Methods in Computer Science © 2006-now Gal Kaminka 7
Forming Hypotheses
So, to support theory X, we:1) Construct falsifiability hypotheses X
1,.... X
n, ....
2) Systematically experiment to disprove X, but proving Xi
3) If all falsification hypotheses eliminated, then this lends support
Note that future falsification hypotheses may be formedTheory must continue to hold against “ attacks”Popper: Scientific evolution, “ survival of the fittest theory”
How does this view hold in computer science?
Empirical Methods in Computer Science © 2006-now Gal Kaminka 8
Forming Hypotheses in CS
(1) Carefully identify the theoretical object we are studying:e.g., “ the relation between input-size and run-time is linear”e.g., “ the algorithm causes robots to collect pucks better”e.g., “ the display improves user performance”
(2) Identify falsification hypothesis (null hypothesis) H0
e.g., “ there is an input-size for which run-time is non-linear”e.g., “ the algorithm will cause robots to collect less pucks”e.g., “ the display will have no effect on user performance”
(3) Now, experiment to eliminate H0
Empirical Methods in Computer Science © 2006-now Gal Kaminka 9
The Basics of Experiment Design
Experiments identify a relation between variables X, Y, ... Simple experiments: Provide indication of relation
Better/worse, linear or non-linear, ....
Advanced experiments: help identify causes, interactionsLinear in input size but constant factor depends on type of data
Empirical Methods in Computer Science © 2006-now Gal Kaminka 10
Types of Experiments and Variables
Manipulation experimentsManipulate (= set value of) independent variablesObserve (measure value of) dependent variables
Observation experimentsObserve predictor variablesObserve response variables
Other variables:Endogenous: On causal path between independent and dependent Extraneous: Other variables influencing dependent variables
Empirical Methods in Computer Science © 2006-now Gal Kaminka 11
An example observation experiment
Theory: Gender affects score performanceFalsifying hypothesis: Gender does not affect performanceCannot use manipulation experiments:
Cannot control gender
Must use observation experiments
Empirical Methods in Computer Science © 2006-now Gal Kaminka 12
An example observation experiment(ala “ Empirical methods in AI” , Cohen 1995)
# Siblings: 2
Mother: artist
Gender: Male
Height: 145cm
Teacher's attitude
Child confidence
Test score: 650
# Siblings: 3
Mother: Doctor
Gender: Female
Height: 135cm
Teacher's attitude
Child confidence
Test score: 720
Independent (Predictor)Variables
Empirical Methods in Computer Science © 2006-now Gal Kaminka 13
An example observation experiment(ala “ Empirical methods in AI” , Cohen 1995)
# Siblings: 2
Mother: artist
Gender: Male
Height: 145cm
Teacher's attitude
Child confidence
Test score: 650
# Siblings: 3
Mother: Doctor
Gender: Female
Height: 135cm
Teacher's attitude
Child confidence
Test score: 720
Dependent (Response)Variables
Empirical Methods in Computer Science © 2006-now Gal Kaminka 14
An example observation experiment(ala “ Empirical methods in AI” , Cohen 1995)
# Siblings: 2
Mother: artist
Gender: Male
Height: 145cm
Teacher's attitude
Child confidence
Test score: 650
# Siblings: 3
Mother: Doctor
Gender: Female
Height: 135cm
Teacher's attitude
Child confidence
Test score: 720
EndogenousVariables
Empirical Methods in Computer Science © 2006-now Gal Kaminka 15
An example observation experiment(ala “ Empirical methods in AI” , Cohen 1995)
# Siblings: 2
Mother: artist
Gender: Male
Height: 145cm
Teacher's attitude
Child confidence
Test score: 650
# Siblings: 3
Mother: Doctor
Gender: Female
Height: 135cm
Teacher's attitude
Child confidence
Test score: 720
ExogenousVariables
Empirical Methods in Computer Science © 2006-now Gal Kaminka 16
Experiment Design: Introduction
Different experiment types explore different hypothesesFor instance, a very simple design: treatment experiment
Sometimes known as a lesion study
treatment Ind1 & Ex
1 & Ex
2 & .... & Ex
n ==> Dep
1
control Ex1 & Ex
2 & .... & Ex
n ==> Dep
2
Treatment condition: With independent variableControl condition: with no independent variable
Empirical Methods in Computer Science © 2006-now Gal Kaminka 17
Comparison Experiments
An improvement over treatment experimentsAllow comparison of different conditions
treatment1 Ind
1 & Ex
1 & Ex
2 & .... & Ex
n ==> Dep
1
treatment2
Ind2 & Ex
1 & Ex
2 & .... & Ex
n ==> Dep
2
control Ex1 & Ex
2 & .... & Ex
n ==> Dep
3
Compare performance of algorithm A to B to C ....Control condition: Optional (e.g., to establish baseline)
Empirical Methods in Computer Science © 2006-now Gal Kaminka 18
Example of Comparison Experiments
Compare performance of user interface A to B to C ....(Kaminka and Elmaliach 2006)
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
5.5
6
Split&Tool
Only Tool
split
dev
iati
on
[deg
ree]
Empirical Methods in Computer Science © 2006-now Gal Kaminka 19
Careful !
An effect on the dependent variable may not be as expectedExample: An experiment
Hypothesis: fly's ear is on its wingsFly with two wings. Make loud noise. Observe flight.Fly with one wing. Make loud noise. No flight.Conclusion: Fly with only one wing cannot hear!
What's going on here?First, interpretation by the experimenterBut also, lack of sufficient falsifiability:
There are other possible explanations for why fly wouldn't fly.
Empirical Methods in Computer Science © 2006-now Gal Kaminka 20
Controlling for other factors
Often, we cannot manipulate all extraneous variablesThen, we need to make sure they are sampled randomly
Randomization averages out their affect
This can be difficulte.g.,, suppose we are trying to relate gender and mathWe control for effect of # of siblings by random samplingBut # of siblings may be related to age:
Parents continue to have children hoping for a boy (Beal 1994)Thus # of siblings tied with gender
Must separate results based on # of siblings
Empirical Methods in Computer Science © 2006-now Gal Kaminka 21
Factorial Experiment DesignsEvery combination of factor values is sampled
Hope is to exclude or reveal interactions
This creates a combinatorial number of experimentsN factors, k values each = kN combinations
Strategies for eliminating values:Merge values, categories. Skip values.Focus on extremes, to get a general trend.
Head turn velocity
Perf
orm
ance
Head turn velocityPe
rfor
man
ce
Empirical Methods in Computer Science © 2006-now Gal Kaminka 22
Tips for Factorial Experiments
For “ numerical” variables, 2 value ranges are not enoughDon't give a good sense of the function relating variables.
Measure, measure, measure.Piggybacking measurement: cheaper than re-running experiments
Simplify comparisons:Use same number of data points (trials) for all configurations
Empirical Methods in Computer Science © 2006-now Gal Kaminka 23
Experiment Validity
Type of validity: Internal and External validityInternal validity:
Experiment shows relationship (independent causes dependent)
External validity:Degree to which results generalize to other conditions
Threats: uncontrolled conditions threatening validity
Empirical Methods in Computer Science © 2006-now Gal Kaminka 24
Internal validity threats: Examples
Order effectsPractice effects in human or animal test subjectsBug in testing system leaves system “ unclean” for next trial
Demand effectsExperimenter influences subject
e.g., answering questions of subjects
Confounding effectsSee “ fly with no wings cannot hear”
Empirical Methods in Computer Science © 2006-now Gal Kaminka 25
Order Effects
Order effects can confound resultsIf treatment/control given two different orders
e.g., good for treatment, bad for control (or vice versa)
Solution:Counter-balancing (all possible orders to all groups)
If treatment/control given exact same orderPractice effects in humans and animals
Solution:Randomize order of presentation to subjects
Empirical Methods in Computer Science © 2006-now Gal Kaminka 26
External threats to validity
Sampling bias: Non-representative samplese.g., non-representative external factors
Floor and ceiling effectsProblems tested too hard, too easy
Regression effectsResults have no way to go but up or down
Solution approach: Run pilot experiments
Empirical Methods in Computer Science © 2006-now Gal Kaminka 27
Sampling Bias
Prefer setting/measuring specific values over othersFor instance:
Including results that were found by some deadline
Solution: Detect, and removee.g., by visualization, looking for non-normal distributionse.g., surprising distribution of dependent data, for different values of indepdentn variable.
Empirical Methods in Computer Science © 2006-now Gal Kaminka 28
Baselines: Floor and Ceiling Effects
How do we know A is good? Bad?Maybe the problems are too simple? Too hard?
For exampleNew machine learning algorithm has 95% accuracyIs this good?
Controlling for Floor/CeilingEstablish baselinesFind range of inputsShow that a “ silly” approach achieves close result
Empirical Methods in Computer Science © 2006-now Gal Kaminka 29
Regression Effects
General phenomenon: “ Regression towards the mean”Repeated measurement converges towards mean values
Example threat: Run a program on 100 different inputsProblems 6, 14, 15 get a very low scoreWe now fix problem, and want to re-testIf chance has anything to do with scoring, then must re-run allWhy?
Scores on 6, 14, 15 has no where to go but up.So re-running these problems will show improvement by chance
Solution:Re-run complete tests, or sample conditions uniformly
Empirical Methods in Computer Science © 2006-now Gal Kaminka 30
Summary
Defensive thinkingIf I were trying to disprove the claim, what would I doThen think ways to counter any possible attack on claim
Strong Inference, Popper's falsification ideasScience moves by disproving theories (empirically)
Experiment design: Carefully think through threatsIdeal independent variables: easy to manipulateIdeal dependent variables: measurable, sensitive, and meaningful
Next week: Hypothesis testing (?)
Empirical Methods in Computer Science © 2006-now Gal Kaminka 31
Sampling Bias