View
118
Download
3
Category
Tags:
Preview:
DESCRIPTION
Presented at JAX London 2013 So you’re a big data and distributed systems “expert”, you’ve collected 500 billion data points, thrown it into sci-lib-of-the-week, you’re using Hadoop, backing onto those cool AWS GPU instances, let it grind away for days and it's spit out the answer to life the universe and everything. But is it really better than a coin toss? How do you validate whether your data analysis algorithm works? Are you learning a solution to your problems or just the data you already have? What problems can you encounter when analysing your data?
Citation preview
ARE YOU BETTERTHAN A COIN TOSS?
BY JOHN OLIVER AND RICHARD WARBURTON
WHO ARE WE?
Why you should care
The Fundamentals
Practical Problems
Applying the Theory
'EXPERTS" AREN'T VERY GOOD
BIG DATA SOLVESALL KNOWNPROBLEMS
BIG DATA SOLVESALL KNOWNPROBLEMS
... HELPS
VALIDATION =TESTS FOR DATA
FUNDAMENTALS
NULL HYPOTHESISUntil proven otherwise there is no relationship
between phenomena
WHEN YOU HEAR "WOLF!" THERE IS A WOLFNEARBY
Cry "Wolf!" Stay QuietWolf Nearby Ok False
NegativeIts really achicken!
FalsePositive
Ok
WHY IS THIS IMPORTANT?
It is better that ten guilty personsescape than that one innocent suffer
- William Blackstone
STATIC ANALYSIS
COST BENEFIT ANALYSISCosts a lot to jail an innocent manCosts very little to show someone aninappropriate houseCredibility, Liberty, Morality are also costs
CHOOSE THE RIGHT MEASUREMENTThere's more than one concept of accuracy
RECALL
Recall =number of true positives
number of actually true values
Recall =tp
+tp fn
Also called True Positive Rate or Sensitivity
PRECISION
Precision =number of true positives
predicted true value
Precision =tp
+tp fp
Also called Positive Predicted Value
=Fβ
(1 + ) ⋅β2 tp
(1 + ) ⋅ + ⋅ +β2 tp β2 fn fp
F MEASURE
Don't worry about the formula!
CASE STUDY: MEMORY LEAKSAbout ~10% of our dataset had memory leaks
Predict "never leaks memory" ~= 0.9 accuracy,but F1 = 0
Our algorithm ~= 0.9 accuracy and F1 ~= 0.9
PROBLEM: RELIABILITY OF MEASUREMENT
RULE OF THUMBIf the graph looks like random noise, it probably
is random noise.
SOLUTION: CHECK YOUR DATA
Low Standard Deviation
σ = ( −1N
∑i=1
N
xi x̄)2
− −−−−−−−−−−−−⎷
Coefficient of V ariation =σ
Mean
CAVEAT: NON-NORMAL DISTRIBUTONS
GO MAD (MEDIAN ABSOLUTE DEVIATION)MAD = media (| − media ( )|)ni Xi nj Xj
PROBLEM: EXPERIMENTAL FLUKES
IS YOUR A/B TEST A HEISEN TEST?
SOLUTION: P-VALUE
Many tests: eg Chi-Squared or Student's T
How many times do you need to roll heads beforeyou know your coin isn't biased?
SCIENCE WORKS - B****ES!
PRACTICALPROBLEMS
PROBLEM: FALSE PROPHETS
I'M AN EXPERT, LISTEN TO ME!
SOLUTION: ESTABLISH GOALS AND HYPOTHESISTHEN TEST SOLUTIONS
PROBLEM: CODE QUALITYThe math works :-) the code does not
:-(@headinthebox
GROWTH IN A TIME OF DEBT
SOLUTION: SOFTWARE ENGINEERING PRACTICES
Everyone Lies- House
SOLUTION: UNDERSTAND BIASESAND DESIGN AROUND THEM
Gay couples should have an equalright to get married, not just to have
civil partnershipsPopulus: 65% vs 27%
Marriage should continue to bedefined as a life-long exclusive
commitment between a man and awoman
Comres: 22% vs 70%
ACQUIESCENCE BIASAnswer yes
REMOVAL OF PARTICULAR ADVERTISING AND SPONSORSHIP BANS
FOR: 1045 AGAINST: 731 ABSTAIN: 121 Motion Carried
MAINTAINING AN ETHICAL UNION BY REAFFIRMING ADVERTISING ANDSPONSORSHIP BANS
FOR: 858AGAINST: 755ABSTAIN: 166Motion Carried
SOLUTION: PHRASE QUESTIONS NEUTRALLYAnd only have one question
SOCIAL DESIRABILITYPoor people overestimate their income, rich
people under estimate it.
SOLUTIONSAnonymisationConfidentialityRandomized ResponseBogus Pipeline
BIAS TOWARDS THE FIRSTANSWER OF A QUESTION
Make sure to randomise the order of answers
PROBLEM: CORRELATION DOESN’T IMPLYCAUSALITY
DATABASE AND NETWORKACTIVITY CORRELATING
Performance Diagnosis: was actually a GarbageCollection Problem.
SOLUTION: DOMAIN KNOWLEDGE
SOLUTIONSUse domain knowledge - ask PilotsStratified sample setsMeasure outcomes - are planes survivingmore?
BE RIGOROUS
APPLYING THETHEORY
CORRELATIONA MEASURE OF THE STRENGTH OF DEPENDENCE BETWEEN TWO VARIABLES
PEARSON CORRELATION= =ρX,Y
cov(X, Y )σXσY
E[(X − )(Y − )]μX μY
σXσY
Err...Just look it up
(Assumes linear relationship)
Range Strength<0.4 Weak/No Correlation<0.7 Some Correlation>0.7 Strong Correlation
CASE STUDY: PERFORMANCE PROBLEM WITH HIGHSYSTEM TIME
Hypothesis: caused by Disk I/O
Correlation Strength: 0.78453
MACHINE LEARNINGApplication of statistics to learn a relationship
HOW MANY CLUSTERS?
HOW MANY CLUSTERS?
HOW MANY CLUSTERS?
SOLUTION: ELBOW ESTIMATORS
FITTING
FITTING
SOLUTION:CROSS VALIDATION
CHOOSE CROSS VALIDATION DATA WISELY
SELF VALIDATINGEnsemble methods - Train lots of weak classifiers
and merge
RANDOM FOREST AND BAGGINGDivide the data into bootstrap sets
Use the rest for calculating error
LEARNING CURVES
UNDER-FITTING (BIAS)
OVER-FITTING (VARIANCE)
HOW MUCH IS TOO MUCH?
ACCURACY FOR DIFFERENT TREE SIZES
F1 FOR DIFFERENT TREE SIZES
MONITOR PRODUCTION DATA...IT CHANGESDoes it look like the same data that you learnt
with?
A/B TEST NEW SYSTEMSSatisfaction/Profit/Traffic...
COMMON THREADSTraining set errors are misleadingCross Validation, Production MonitoredValues are the ones that really matterVisualise and compare these errors
CONCLUSIONAnalytics are increasingly importantWide variety of statistical and practical tips toget them rightHave fun and Best of luck!
Recommended