Upload
antony-ward
View
217
Download
0
Tags:
Embed Size (px)
Citation preview
Biostat 200 Lecture 2
1
Today
• Discussion of data cleaning• Probability
2
Data cleaning
• Data cleaning is always necessary with a new data set
• Assume your data set has errors and your job is to find them
• The first step is to use tables and summary statistics and graphs to identify outliers and anomalies
• Outliers are defined as extreme values• We do NOT automatically remove outliers !!!
3
Outliers – what do we do?• First consider if the value is physically possible• Example: Our original data set had a person who
was 3’4” tall . Yes, that is physically possible but fairly unusual.
• Look at the other variables for clues. We found (last year) age=3.
• For this one, we remove the entire observation from the analysis data set because of ineligibility
• We document this, and retain a copy of the original data set
4
Outliers – what do we do?• If age had been =20, we might have asked the
interviewer about this value.• Another example – there were a few other
strange heights: 5’12”, 5’20”, 5’41” ... • Probably typos? Check original source document.• You can prevent some of this by programming
your data entry programs not to accept out of range values.
5
Outliers – what do we do?• We also had 2 observations with weight=25,
30 pounds...• If we can’t explain but we are pretty sure that
these values are not reasonable, we might exclude these values (but not the whole observation unless we suspect poor data throughout!)
6
Outliers – what do we do?• What about these high values?
0.0
5.1
.15
.2F
ract
ion
0 50 100 150 200Number of minutes typically exercise
7
Outliers – what do we do?• What about outliers that seem reasonable?• May have large influence on some analyses• Be aware of them, do not exclude them.• Think about more robust analyses. E.g. which
measures of central tendency might you use?
8
Data management strategies• Keep a .do file for all your recodes
• At the beginning of the .do file read in the original raw data• At the end of the file save the data to another filename
• Use comments, set off by ***s, to remind yourself why you are making these recodes
• Make .do files for your analyses• I often keep these separate from my recodes files
• Make a generic .do file to create value labels that you might use across data sets– label define 0 “Male” 1 “Female”– label define 0 “Negative” 1 “Positive” 2 “Indeterminate”
• Use the command include *.do to include the value label .do file in your recode .do file
9
• use "H:\Biostat200\colddata_2011.dta", clear• summ age, detail
** children were not eligible for the study ** • drop if age<18
• include "H:\Biostat200\label defines.do"
• label values educ educl• label values sex sexl
• save "H:\Biostat200\colddata_2011_v2.dta"
Example .do file for recoding and labeling variable levels
10
Basic probability
11
Basic probability• Probability is the foundation of statistical inference
– Statistical inference is what is needed to make statements about the characteristics of the population from which a sample was drawn
– p-values and confidence intervals tell us how our sample might relate to the population
• Many of the entities we use daily are probabilities – e.g. the probability of breast cancer given they are BRCA1/2 positive
Population
Sample
12
Basic probability
• Event – Result of an experiment or observation– Occurs or does not occur– Denoted by uppercase letters e.g. A,B, X– We will apply probability to events – i.e. we will
want to know the probability that an event occurs
– E.g. a disease occurrence, an extreme laboratory value
13
Basic probability• Frequentist definition of probability If an
experiment is repeated n times under essentially identical conditions, and if the event A occurs m times, then as n grows large, the ratio m/n approaches a fixed limit that is the probability of A
14
Basic probability• Probability of an event – relative frequency
of its occurrence in a large number of trials repeated under the same conditions– E.g. Probability of picking a red ball out of a bag
of red and black balls– Always lies between 0 and 1 (inclusive)– Denoted P(A) or P(X)
15
Basic probability• Complement of an event, Ā or AC (read Not A or A
complement)– E.g. the event that the person does not have malaria– P(A)= 1-P(Ā)
• In epidemiology, we often write E for exposed and Ē for not exposed
• Ω is the universe, all the possible outcomes of an event• P(Ω) = P(A) + P(Ā) = 1
A
A
Ā
Ω
16
Complement example
• Probability that someone has extremely drug resistant (XDR TB) versus they do not
• P(XDR TB+) + P(XDR TB-) = 1
17
Basic probability• The intersection of 2 events is written A ∩ B• The intersection is when both A and B occur
– E.g. The event that a person has both malaria and pulmonary tuberculosis
– The probability that both occur is written P(A ∩ B)
18
Basic probability• The union of 2 events is written A U B• The union is if either A or B or both occur
– E.g. The event that a person has either malaria or tuberculosis or both
– P(A U B) = P(A) + P(B) – P(A ∩ B)– The probability of A or B is the sum of their individual
probabilities minus the probability of their intersection
19
Basic probability• Two events are mutually exclusive if they cannot
occur together– In English: for mutually exclusive events, the
probability of A or B occurring is the sum of their individual probabilities; both cannot occur together so P(A ∩ B) = 0
– In probability lexicon: P(A U B) = P(A) + P(B) - P(A ∩ B) = P(A) + P(B)
20
Basic probability• Two events are mutually exclusive if they
cannot occur together– This is true for complements– E.g.
• Being pregnant and not pregnant • You cannot be both
21
Basic probability• If A and B are mutually exclusive,
P(A U B) = P(A) + P(B)• This is the additive rule of probability• E.g.
P(HCV genotype 1) in the US = .7P(HCV genotype 2) in the US = .15
P(HCV genotype 3,4,6) = .15 P(HCV genotype 1 or 2) = .85
22
Basic probability• The additive rule of probability can be applied
to three or more mutually exclusive events• If none of the events can occur together, thenP(A1 U A2 U … U An ) = P(A1) + P(A2) + … P(An)
23
Probability summary• Complement: P(A)= 1-P(Ā)• Union: Prob A or B or both = P(A U B)
P(A U B) =P(A) + P(B) – P(A ∩ B)
• Intersection: Prob A and B = P(A ∩ B)
• For mutually exclusive events: P(A ∩ B)=0P(A U B) = P(A) + P(B) additive rule
• So A and Ā are mutually exclusive
24
Basic probability example• A = the event that an individual is exposed to
high levels of carbon monoxide• B = the event that an individual is exposed to
high levels of nitrogen dioxide– What is the event A ∩ B called? What is that in
this example?– What is the event A U B called? What is it in this
example?– What is the complement of A?– Are A and B mutually exclusive?
25
Basic probability example– A ∩ B is the intersection of A and B. It is the
event that the person is exposed to both gases.– A U B is the union of A and B. It is the event that
the person is exposed to one or the other or both.– Ac is the event that the person is not exposed to
carbon monoxide.– Are A and B mutually exclusive? Can they both
occur? Yes. So NOT mutually exclusive.
26
Conditional probability• The probability that an event B will occur given
that event A has occurred– Notation: P(B|A)– Read: the probability of B given A
• Example: Probability of a person becoming infected with malaria given that he/she uses a bed net at night
• Event A is using a bed net• Event B is becoming infected with malaria
27
Conditional probability• Multiplicative rule of probability
P(A ∩ B) = P(A) P(B|A)So P(B|A) = P(A ∩ B) / P(A)
• Example: P(becoming infected with malaria | use a bed net)Answer: P( Becoming infected and using a bed net ) /
P(using a bed net)= number of people who become infected with
malaria who use a bed net / number of people who use a bed net
28
Probability example1992 U.S. birth statistics• Probability that mother’s age was ≤24 = 0.003 + 0.124 + 0.263 = 0.390 (What probability rule?)
• Given that a mother is under age 30, what is the probability that she is under age 20?P( Mother’s age<20 | Mother’s age<30 ) = P ( Mother’s age<20 and <30 ) / P(Mother’s age <30) = ( 0.003 + 0.124 ) / ( 0.003 + 0.124 + 0.263 + 0.290 ) = 0.127 / 0.68 = 0.187
Age of mother Probability
<15 0.003
15-19 0.124
20-24 0.263
25-29 0.290
30-34 0.220
35-39 0.085
40-44 0.014
45-49 0.001
Total 1.000
29
Examples of conditional probabilities
• Relative risk is the ratio of 2 conditional probabilities
P(disease | exposed) / P(disease | not exposed)
• Odds also include conditional probabilities P(disease | exposed) / (1- P(disease | exposed))
P(disease | not exposed) / (1- P(disease | not exposed))
30
Independence
• If the occurrence of B does not depend on A, – then P(B|A) = P(B)– Example: Probability of becoming infected with
malaria given that you wear a blue shirt = probability of becoming infected with malaria
– Then the multiplicative rule is P(A ∩ B) = P(A) P(B)– Example: coin tosses – the probability of a heads on
the 2nd throw is independent of the outcome on the first throw
31
Independence
Note that independence ≠ mutual exclusivity!– Mutual exclusivity
• 2 events cannot both occur• P(A ∩ B) =0
– Independence • 2 events do not depend on each other• P(B|A)=P(B)• P(A ∩ B) = P(A) P(B)
32
Law of Total Probability• The law of total probability:
P(B) = P(B ∩ A) + P(B ∩ Ā) P(B) = P(B|A)P(A) + P(B|Ā)P(Ā)
More generally P(B) = P(B ∩ A1) + P(B ∩ A2) + … + P(B ∩ An)
if P(A1 U A2 U … U An ) = 1
P(B) = P(B|A1)P(A1) + P(B|A2)P(A2) + … + P(B|An)P(An)
33
Law of Total Probability• Helpful when you cannot directly calculate
a probability• Example:
– Suppose you know the TB prevalence in different areas and the population size in those areas, and you want to know the worldwide TB prevalence
– P(TB+) = P(TB+| live in lower income country)*P(live in lower income country) + P(TB+| live in upper income country)*P(live in upper income country)
– Weighted average of the 2 TB rates
34
Diagnostic tests
• Diagnostic tests of disease are rarely perfect– True positives – the test is positive given the person has the
disease • The probability of this is P(T+|D+) = Sensitivity
– False positives – the test is positive although the person does not have the disease
– True negatives – the test is negative given the person does not have the disease
• The probability of this is P(T-|D-) = Specificity
– False negatives – the test is negative even though the person has the disease
35
Diagnostic tests
• Sensitivity = P(T+|D+) = P(T+∩D+)/P(D+) = TP/(TP+FN)
• Specificity = P(T-|D-) = P(T-∩D-)/P(D-) = TN/(FP+TN)
TRUTH
D+ D-
Test T+ TP FP
T- FN TN
36
Diagnostic tests
• Diagnostic test characteristics (sensitivity and specificity) are based on experiments in which the test is compared to a “gold standard”
37
Diagnostic test validation example
• New biological markers of alcohol consumption are being developed. Phosphatidylethanol (PEth) is a metabolite of alcohol that is formed only in the presence of alcohol.
• We examined 77 HIV positives in Mbarara, Uganda. We followed them for 21 days and did daily breathalyzers and drinking surveys. If the breathalyzer result was ever >0 and/or the participant reported drinking, we considered this any alcohol consumption.
• We drew blood at the end of the 21-days to test for PEth.
38
Diagnostic test example
• Number of positive PEth tests among those with any alcohol consumption in the prior 21 days >=10 ng/ml Sensitivity = 45/51 = 88.2%
• Number of negative PEth tests among the abstainers = Specificity
= 23/26 = 88.5%
“TRUTH”
Alc+ Alc-
Peth
Test
+ 45 3
- 6 23
39
Diagnostic tests• The level of the cutoff for a diagnostic test can be set
to– Maximize sensitivity -- this will decrease specificity!
• This might be ideal if a follow up confirmatory test is easy and you want to be sure not to miss any positives
– Maximize specificity -- this will decrease sensitivity!• This might be necessary if there are grave ramifications of a false
positive test
• Receiver-operator curves illustrate this tension– The ROC curve plots the sensitivity versus the 1-specificity
for a test at every possible test cutoff
40
ROC of PEth to detect alcohol consumption in persons with HIV in Mbarara, Uganda
41
Application of laws of probability to diagnostic tests
• Suppose you have a panel of diagnostic tests and each give false positive results 2% of the time (98% specificity)
• If you test your patient with one of the tests and they do not have the disease, there is a 2% chance you’ll get a false positive result
• There is a 98% chance you will get the correct negative result.
42
Application of laws of probability to diagnostic tests
• If you give the patient 2 tests, what is the chance of at least 1 false positive?
• Possible results are:• You could get Neg Neg.
P(Neg test 1 ∩ Neg test 2) = 0.98*0.98=.9604• You could get Neg Pos
P (Neg test 1 ∩ Pos test 2) = 0.98*0.02=.0196• You could get Pos Neg
P (Pos test 1 ∩ Neg test 2) = 0.02*0.98=.0196• You could get Pos Pos
P (Pos test 1 ∩ Pos test 2) = 0.02*0.02=.000443
Application of laws of probability to diagnostic tests
• All 4 of these possibilities add to 1.9604 + .0196 + .0196 + .0004 = 1
• P(1 or more test is pos) = (Neg test 1 ∩ Pos test 2) + (Pos test 1 ∩ Neg test 2) + P(Pos test 1 ∩ Pos test 2) = .0196 + .0196
+ .0004 =.0396
An easier way:P(1 or more test is pos) = 1-P(both tests are neg)
44
Application of laws of probability to diagnostic tests
• P(both tests are neg) = (Neg test 1 ∩ Neg test 2) =.98*.98
• So P(1 or more test is neg) = 1-.98*.98 = 0.0396• In general, P(At least one false positive)
= 1-P(no false positives occur over all tests)
= 1-P(test specificity)# of tests
Here = 1- 0.982
45
Application of laws of probability to diagnostic tests
• What is the probability of at least one false positive if 5 tests were run?
1-0.985 = 0.096• What if the false positive proportion was .05?
1-0.955 = 0.226• What is the probability of at least one false
positive if 10 tests were run (where P(FP=0.02))? 1-0.9810 = 0.183
• What if the false positive proportion was .05?1-0.9510 = 0.401
46
Bayes’ theorem for diagnostic tests• Suppose you know from diagnostic testing that
– The sensitivity of a new rapid HIV antibody test (P(T+|HIV+)) is 0.96
– The specificity P(T-|HIV-)) of the test is 0.99
• You want to know the probability that someone with a positive test using this test is truly infected with HIV – What is P(HIV+|T+) ?
• This is called the Positive Predictive Value (PPV) of the test
48
Bayes’ theorem
• P(A|B)=P(B|A)P(A) / P(B)
• Proof:– By definition of conditional probability– P(A|B)=P(A∩B)/P(B)
• P(A∩B) = P(A|B)*P(B)– P(B|A)=P(A∩B)/P(A)
• P(A∩B) = P(B|A)P(A) so P(A|B)*P(B) = P(B|A)P(A)rearrange to get P(A|B)=P(B|A)*P(A) / P(B)
49
By Bayes’ theorem:P(HIV+|T+) = P(T+|HIV+)*P(HIV+) / P(T+) using
P(A|B)=P(B|A)P(A) / P(B)
Probability of being truly infected with HIV (HIV+) if you have a positive test result
Bayes’ theorem for diagnostic tests
50
Want to know P(HIV+|T+)Instead we know:
Sensitivity P(T+|HIV+) and Specificity P(T-|HIV-) and P(T-|HIV+) = 1-sensitivity (false negatives) and P(T+|HIV-) = 1-specificity (false positives)
Bayes’ theorem for diagnostic tests
51
P(HIV+|T+) = P(T+|HIV+)*P(HIV+) / P(T+)
P(T+|HIV+) = 0.96 (sensitivity) P(HIV+) in sub-Saharan Africa is = 0.02P(T+) = the overall chances of having a positive test
P(T+) = P(T+|HIV+) P(HIV+) + P(T+|HIV-) P(HIV-) by the law of total probability
= 0.96*0.02 + 0.01*0.98 P(HIV+|T+) = 0.96*0.02/(0.0192+0.0098) = 0.662
Bayes’ theorem for diagnostic tests
52
The prevalence of HIV was assumed to be 2%So before testing, the probability that a randomly
selected person is infected with HIV is .02This is the prior probability.
The probability that someone who tests positive has HIV is .662
This is the posterior probabilityIt incorporates the information gained by doing the test
In reality, HIV tests have much higher sensitivity than 96% … So the PPV is higher
Prior and posterior probability
53
What is P(HIV+|T+) in a population in which the HIV prevalence is 0.004?
P(HIV+|T+) = P(T+|HIV+)*P(HIV+) / P(T+) P(T+|HIV+)=0.96 P(HIV+) is =0.004
P(T+) = P(T+|HIV+) P(HIV+) + P(T+|HIV-) P(HIV-) = 0.96*0.004 + 0.01*0.996
P(HIV+|T+) = 0.96*0.004/(0.00384+0.0096) = 0.278
Bayes’ theorem for diagnostic tests
54
Bayes’ theoremBayes’ theorem allows you to use what you
know about the conditional probability of one event on another to help you understand the inverse
P(A1| B) = P(A1 ∩ B) / P(B)
= P( B | A1 ) P(A1) / P(B)
= P( B|A1 ) P(A1) / (P(B|A1)P(A1) + P(B|A2)P(A2) )
Remember P(B) = P(B|A1)P(A1) + P(B|A2)P(A2) by the law of total probability
55
For next time
• Read Pagano and Gauvreau– Chapter 6 (Review of today’s material)– Chapter 7
• Bring your textbook to lecture next Tuesday
56