63
DATASET INTRODUCTION 1

DATASET INTRODUCTION 1. Dataset: Urine 2 From Cleveland Clinic 1981-1984

Embed Size (px)

Citation preview

Page 1: DATASET INTRODUCTION 1. Dataset: Urine 2 From Cleveland Clinic 1981-1984

1

DATASET INTRODUCTION

Page 2: DATASET INTRODUCTION 1. Dataset: Urine 2 From Cleveland Clinic 1981-1984

2

Dataset: Urine

From Cleveland Clinic 1981-1984

Page 3: DATASET INTRODUCTION 1. Dataset: Urine 2 From Cleveland Clinic 1981-1984

3

Outcome Variable:

Categorical VariableCalcium Oxalate Crystal Presence

• In this analysis, this variable will be our

• Outcome variable

• Response Variable

• Dependent Variable• Note: The dataset is coded directly as Yes/No (not 0/1 coding)

Page 4: DATASET INTRODUCTION 1. Dataset: Urine 2 From Cleveland Clinic 1981-1984

4

Other Variables (Covariates)

QuantitativeVariables

Specific Gravity

pH

Osmolarity

Conductivity

Urea Concentration (millimoles/liter)

Calcium Concentration (millimoles/liter)

Cholesterol: serum cholesterol levels

Page 5: DATASET INTRODUCTION 1. Dataset: Urine 2 From Cleveland Clinic 1981-1984

5

Discussion/Review Purpose of dataset: Determine which of the covariates

are related to the outcome. Covariates can also be called

• Independent Variables

• Predictors

• Explanatory Variables

Outcomes/Covariates can be categorical or quantitative

Can be more than one outcome and many covariates in a given study with any mixture of variable types

Page 6: DATASET INTRODUCTION 1. Dataset: Urine 2 From Cleveland Clinic 1981-1984

6

Calcium Oxalate Crystal

PresenceN Mean

Std Dev

Min Q1 Med Q3 Max

No 42 2.69 1.90 0.17 1.22 2.16 3.93 8.48

Yes 31 5.92 3.59 0.27 3.10 6.19 7.82 14.34

Page 7: DATASET INTRODUCTION 1. Dataset: Urine 2 From Cleveland Clinic 1981-1984

7

Discussion Clearly, those with calcium oxalate crystals present tend

to have higher calcium concentrations

Later we will learn to conduct hypothesis tests in such situations

Now we use this data to illustrate concepts of probability

Page 8: DATASET INTRODUCTION 1. Dataset: Urine 2 From Cleveland Clinic 1981-1984

8

Comments To facilitate our discussion of probability and classification

tests

We will categorize the quantitative variable Calcium Concentration into four groups

1 = 0-1.992 = 2-4.993 = 5-7.994 = 8 or More

Page 9: DATASET INTRODUCTION 1. Dataset: Urine 2 From Cleveland Clinic 1981-1984

9

BASIC PROBABILITYPart 1 (Unconditional Probability using Logic)

Page 10: DATASET INTRODUCTION 1. Dataset: Urine 2 From Cleveland Clinic 1981-1984

10

Back to the Urine Dataset Suppose one individual is selected from our sample and

consider the following questions

• What is the probability that the individual has calcium oxalate crystals present?

• What is the probability that the individual has a calcium concentration of 5 or more?

• What is the probability the individual has calcium oxalate crystals present AND has a calcium concentration of 5 or more?

• What is the probability the individual has calcium oxalate crystals present OR has a calcium concentration of 5 or more?

Page 11: DATASET INTRODUCTION 1. Dataset: Urine 2 From Cleveland Clinic 1981-1984

11

Comments All of these four probability questions relate to the

ENTIRE SAMPLE

We begin by answering the questions logically from the table we created using software

Page 12: DATASET INTRODUCTION 1. Dataset: Urine 2 From Cleveland Clinic 1981-1984

12

Let’s Practice!

Basic Probability of an Event

• What is the probability that the individual has calcium oxalate crystals present? We will denote this event by A.

• = PREVALENCE of calcium oxalate crystals in our sampleTable of group by r

group (Calcium Concentration

Group)r (Calcium Oxalate Crystal Presence)

FrequencyNo Yes Total

0-1.99 19 4 23

2-4.99 17 9 26

5-7.99 5 11 16

8 or More 1 7 8

Total 42 31 73

Page 13: DATASET INTRODUCTION 1. Dataset: Urine 2 From Cleveland Clinic 1981-1984

13

Let’s Practice!

Basic Probability of an Event

• What is the probability that the individual has a calcium concentration of 5 or more? We will denote this event by B.

Table of group by rgroup (Calcium Concentration

Group)r (Calcium Oxalate Crystal Presence)

FrequencyNo Yes Total

0-1.99 19 4 23

2-4.99 17 9 26

5-7.99 5 11 16

8 or More 1 7 8

Total 42 31 73

Page 14: DATASET INTRODUCTION 1. Dataset: Urine 2 From Cleveland Clinic 1981-1984

14

Let’s Practice!

Basic Probability of an Event: Intersections

• What is the probability the individual has calcium oxalate crystals present AND has a calcium concentration of 5 or more?

Table of group by rgroup (Calcium Concentration

Group)r (Calcium Oxalate Crystal Presence)

FrequencyNo Yes Total

0-1.99 19 4 23

2-4.99 17 9 26

5-7.99 5 11 16

8 or More 1 7 8

Total 42 31 73

Page 15: DATASET INTRODUCTION 1. Dataset: Urine 2 From Cleveland Clinic 1981-1984

15

Let’s Practice!

Basic Probability of an Event: Unions

• What is the probability the individual has calcium oxalate crystals present OR has a calcium concentration of 5 or more?

Table of group by rgroup (Calcium Concentration

Group)r (Calcium Oxalate Crystal Presence)

FrequencyNo Yes Total

0-1.99 19 4 23

2-4.99 17 9 26

5-7.99 5 11 16

8 or More 1 7 8

Total 42 31 73

Table of group by rgroup (Calcium Concentration

Group)r (Calcium Oxalate Crystal Presence)

FrequencyNo Yes Total

0-1.99 19 4 23

2-4.99 17 9 26

5-7.99 5 11 16

8 or More 1 7 8

Total 42 31 73

Page 16: DATASET INTRODUCTION 1. Dataset: Urine 2 From Cleveland Clinic 1981-1984

16

USING PROBABILITY RULESPart 1

Page 17: DATASET INTRODUCTION 1. Dataset: Urine 2 From Cleveland Clinic 1981-1984

17

Probability Rules Rules are created and used for many reasons

The rules and properties stated previously are important and useful in probability and sometimes in statistics

Not always needed

• If you can determine the answer through logic alone you may not need a rule!

• If you are provided only pieces of the puzzle, sometimes a rule is faster than logic!

Page 18: DATASET INTRODUCTION 1. Dataset: Urine 2 From Cleveland Clinic 1981-1984

18

Continuing We now illustrate a few formulas using the questions we

have already answered using logic

Page 19: DATASET INTRODUCTION 1. Dataset: Urine 2 From Cleveland Clinic 1981-1984

19

Let’s Practice Again!

Complement Rule

• What is the probability that the individual DOES NOT have calcium oxalate crystals present?

• We could use logic and count the No’s instead of the Yes’s however knowing P(Yes)=P(A):

Table of group by rgroup (Calcium Concentration

Group)r (Calcium Oxalate Crystal Presence)

FrequencyNo Yes Total

0-1.99 19 4 23

2-4.99 17 9 26

5-7.99 5 11 16

8 or More 1 7 8

Total 42 31 73

Page 20: DATASET INTRODUCTION 1. Dataset: Urine 2 From Cleveland Clinic 1981-1984

20

Let’s Practice Again!

Addition Rule (Unions)

• What is the probability the individual has calcium oxalate crystals present OR has a calcium concentration of 5 or more?

Table of group by rgroup (Calcium Concentration

Group)r (Calcium Oxalate Crystal Presence)

FrequencyNo Yes Total

0-1.99 19 4 23

2-4.99 17 9 26

5-7.99 5 11 16

8 or More 1 7 8

Total 42 31 73

Table of group by rgroup (Calcium Concentration

Group)r (Calcium Oxalate Crystal Presence)

FrequencyNo Yes Total

0-1.99 19 4 23

2-4.99 17 9 26

5-7.99 5 11 16

8 or More 1 7 8

Total 42 31 73

Page 21: DATASET INTRODUCTION 1. Dataset: Urine 2 From Cleveland Clinic 1981-1984

21

Let’s Practice Again!

Addition Rule (Unions)

• What is the probability the individual has calcium oxalate crystals present OR has a calcium concentration of 5 or more?

Page 22: DATASET INTRODUCTION 1. Dataset: Urine 2 From Cleveland Clinic 1981-1984

22

INDEPENDENCEPart 1

Page 23: DATASET INTRODUCTION 1. Dataset: Urine 2 From Cleveland Clinic 1981-1984

23

Independent Events Two events are independent if knowing one event occurs

does not change the probability of the other

This is not the same as “disjoint” events which are separate in that they cannot occur together

These are two different concepts entirely

Independence is a statement about the equality of the probability of one event whether or not the other event occurs (or is occurring, or has occurred)

Page 24: DATASET INTRODUCTION 1. Dataset: Urine 2 From Cleveland Clinic 1981-1984

24

Let’s Practice!

Investigating Independence Part 1

We know the following from our sample

?

Page 25: DATASET INTRODUCTION 1. Dataset: Urine 2 From Cleveland Clinic 1981-1984

25

Let’s Practice!

Investigating Independence Part 1

From our sample we have:

This is clearly not equal to 0.247!!

In our sample the events are dependent (we can test this hypothesis about the population later)

Page 26: DATASET INTRODUCTION 1. Dataset: Urine 2 From Cleveland Clinic 1981-1984

26

BASIC PROBABILITYPart 2: Conditional Probability (Logic & Formula)

Page 27: DATASET INTRODUCTION 1. Dataset: Urine 2 From Cleveland Clinic 1981-1984

27

Conditional Probability So far, we have divided by the TOTAL

Sometimes, however, we have additional CONDITIONS that cause us to alter the denominator (bottom) of our probability calculation

Suppose, when choosing one person from the Urine data, we ask

• Given the individual has Calcium Oxalate Crystals present, what is the probability the individual’s calcium concentration is 5 or above?

“Conditional” refers to the fact that we have these additional conditions, restrictions, or other information

Page 28: DATASET INTRODUCTION 1. Dataset: Urine 2 From Cleveland Clinic 1981-1984

28

Let’s Practice!

CONDITIONAL Probability of an Event

• Given the individual has Calcium Oxalate Crystals present, what is the probability the individual’s calcium concentration is 5 or above?

Table of group by rgroup (Calcium Concentration

Group)r (Calcium Oxalate Crystal Presence)

FrequencyNo Yes Total

0-1.99 19 4 23

2-4.99 17 9 26

5-7.99 5 11 16

8 or More 1 7 8

Total 42 31 73

Page 29: DATASET INTRODUCTION 1. Dataset: Urine 2 From Cleveland Clinic 1981-1984

29

Let’s Practice!

CONDITIONAL Probability FORMULA

• Given the individual has Calcium Oxalate Crystals present, what is the probability the individual’s calcium concentration is 5 or above?

Page 30: DATASET INTRODUCTION 1. Dataset: Urine 2 From Cleveland Clinic 1981-1984

30

Let’s Practice!

CONDITIONAL Probability of an Event

• Given the individual DOES NOT HAVE Calcium Oxalate Crystals present, what is the probability the individual’s calcium concentration is 5 or above?

Table of group by rgroup (Calcium Concentration

Group)r (Calcium Oxalate Crystal Presence)

FrequencyNo Yes Total

0-1.99 19 4 23

2-4.99 17 9 26

5-7.99 5 11 16

8 or More 1 7 8

Total 42 31 73

Page 31: DATASET INTRODUCTION 1. Dataset: Urine 2 From Cleveland Clinic 1981-1984

31

MORE PRACTICEConditional Probability

Page 32: DATASET INTRODUCTION 1. Dataset: Urine 2 From Cleveland Clinic 1981-1984

32

Let’s Verify!

CONDITIONAL Probability of an Event

• Given the individual has a calcium concentration of 5 or above, what is the probability the individual has calcium oxalate crystals?

• We have a small amount of rounding error this timeTable of group by r

group (Calcium Concentration

Group)r (Calcium Oxalate Crystal Presence)

FrequencyNo Yes Total

0-1.99 19 4 23

2-4.99 17 9 26

5-7.99 5 11 16

8 or More 1 7 8

Total 42 31 73

Page 33: DATASET INTRODUCTION 1. Dataset: Urine 2 From Cleveland Clinic 1981-1984

33

INDEPENDENCEPart 2

Page 34: DATASET INTRODUCTION 1. Dataset: Urine 2 From Cleveland Clinic 1981-1984

34

Let’s Practice!

Investigating Independence Part 2

We know the following from our sample

? ?

Page 35: DATASET INTRODUCTION 1. Dataset: Urine 2 From Cleveland Clinic 1981-1984

35

Comments

Investigating Independence Part 2

These probabilities are clearly unequal in our sample, our eventual question might be if this is also true for our population

In this sample, these events are dependent

From our analysis so far, it seems likely they may be dependent in our population (we can test later)

Knowing whether or not the person has calcium oxalate crystals present CHANGES the probability of having a calcium concentration of 5 or above!!

Page 36: DATASET INTRODUCTION 1. Dataset: Urine 2 From Cleveland Clinic 1981-1984

36

GENERAL MULTIPLICATION RULE

Page 37: DATASET INTRODUCTION 1. Dataset: Urine 2 From Cleveland Clinic 1981-1984

37

General Multiplication Rule

This formula comes from rearranging the definition of conditional probability

To achieve the second formulation on the right consider the formula below for P(A|B) instead and note that the numerator is unchanged

Page 38: DATASET INTRODUCTION 1. Dataset: Urine 2 From Cleveland Clinic 1981-1984

38

General Multiplication Rule

Page 39: DATASET INTRODUCTION 1. Dataset: Urine 2 From Cleveland Clinic 1981-1984

39

REPEATED SAMPLING

Page 40: DATASET INTRODUCTION 1. Dataset: Urine 2 From Cleveland Clinic 1981-1984

40

Repeated Sampling Often we consider problems in which we draw multiple

individuals from a set of individuals

• Drawing parts from a box where some are defective

• Choosing multiple people from a certain population

The formulas we have investigated can be used to calculate probabilities in these situations

Page 41: DATASET INTRODUCTION 1. Dataset: Urine 2 From Cleveland Clinic 1981-1984

41

Let’s Practice! If we select two subjects at random from our sample, what

is the probability that both have a calcium concentration of 8 or more?

Table of group by rgroup (Calcium Concentration

Group)r (Calcium Oxalate Crystal Presence)

FrequencyNo Yes Total

0-1.99 19 4 23

2-4.99 17 9 26

5-7.99 5 11 16

8 or More 1 7 8

Total 42 31 73

Page 42: DATASET INTRODUCTION 1. Dataset: Urine 2 From Cleveland Clinic 1981-1984

42

WANT TO LEARN MORE?

READ THE FOLLOWING OPTIONAL MATERIALThe remaining slides are optional. They illustrate some more difficult probability rules along with additional examples of probability related to the health sciences

Page 43: DATASET INTRODUCTION 1. Dataset: Urine 2 From Cleveland Clinic 1981-1984

43

Optional Content: Read About Relative Risk

Total Probability Rule

Bayes Rule

Screening Tests

• Sensitivity/Specificity

• PV+/PV-

• False Positive and False Negative Rates

ROC Curves

Page 44: DATASET INTRODUCTION 1. Dataset: Urine 2 From Cleveland Clinic 1981-1984

44

Relative Risk Relative risk is

• the risk of an “event” relative to an “exposure”

• the ratio of the probability of the event occurring among “exposed” versus “non-exposed”

• If A and B are independent, the relative risk is 1

In our rule B is the EVENT and A is the EXPOSURE

Page 45: DATASET INTRODUCTION 1. Dataset: Urine 2 From Cleveland Clinic 1981-1984

45

Let’s Practice! Find the Relative Risk of High Calcium Concentration

Given Calcium Oxalate Crystal Presence

• Note: this is the reverse of what we probably want in this case, consider that for more practice!

• INTERPRET RR: Having a calcium concentration of 5 or more is around 4 times more likely among those with calcium oxalate crystals than among those without.

Page 46: DATASET INTRODUCTION 1. Dataset: Urine 2 From Cleveland Clinic 1981-1984

46

Total Probability Rule

Page 47: DATASET INTRODUCTION 1. Dataset: Urine 2 From Cleveland Clinic 1981-1984

47

Bayes’ Rule

We want to find P(A|B) so that we will need to “rearrange” the formula swapping A’s and B’s

Page 48: DATASET INTRODUCTION 1. Dataset: Urine 2 From Cleveland Clinic 1981-1984

48

Bayes’ Rule

Page 49: DATASET INTRODUCTION 1. Dataset: Urine 2 From Cleveland Clinic 1981-1984

49

Let’s Verify!

CONDITIONAL Probability of an Event

• Given the individual has a calcium concentration of 5 or above, what is the probability the individual has calcium oxalate crystals?

• We have a small amount of rounding error this timeTable of group by r

group (Calcium Concentration

Group)r (Calcium Oxalate Crystal Presence)

FrequencyNo Yes Total

0-1.99 19 4 23

2-4.99 17 9 26

5-7.99 5 11 16

8 or More 1 7 8

Total 42 31 73

Page 50: DATASET INTRODUCTION 1. Dataset: Urine 2 From Cleveland Clinic 1981-1984

50

SCREENING TESTSand ROC Curves

Page 51: DATASET INTRODUCTION 1. Dataset: Urine 2 From Cleveland Clinic 1981-1984

51

Screening Tests

Page 52: DATASET INTRODUCTION 1. Dataset: Urine 2 From Cleveland Clinic 1981-1984

52

Sensitivity & Specificity

“Epi” StyleHas

Condition

Does not have

Condition

Test Positive

ATP

BFP

Total Positive

Test (A+B)

TestNegative

CFN

DTN

Total Negative

Test (C+D)

Number with

Condition(A+C)

Number without

Condition(B+D)

Page 53: DATASET INTRODUCTION 1. Dataset: Urine 2 From Cleveland Clinic 1981-1984

53

Sensitivity & Specificity

Has Condition

Does not have

Condition

0-1.99NEGATIVE 4 19

2 or morePOSITIVE 27 23

31 42

group (Calcium Concentration

Group)r (Calcium Oxalate Crystal Presence)

FrequencyYes No Total

0-1.99 4 19 23

2-4.99 9 17 26

5-7.99 11 5 16

8 or More 7 1 8

Total 31 42 73

Page 54: DATASET INTRODUCTION 1. Dataset: Urine 2 From Cleveland Clinic 1981-1984

54

Sensitivity & Specificity

Has Condition

Does not have

Condition

0-4.99NEGATIVE 13 36

5 or morePOSITIVE 18 6

31 42

group (Calcium Concentration

Group)r (Calcium Oxalate Crystal Presence)

FrequencyYes No Total

0-1.99 4 19 23

2-4.99 9 17 26

5-7.99 11 5 16

8 or More 7 1 8

Total 31 42 73

Page 55: DATASET INTRODUCTION 1. Dataset: Urine 2 From Cleveland Clinic 1981-1984

55

Sensitivity & Specificity

Has Condition

Does not have

Condition

0-7.99NEGATIVE 24 41

8 or morePOSITIVE 7 1

31 42

group (Calcium Concentration

Group)r (Calcium Oxalate Crystal Presence)

FrequencyYes No Total

0-1.99 4 19 23

2-4.99 9 17 26

5-7.99 11 5 16

8 or More 7 1 8

Total 31 42 73

Page 56: DATASET INTRODUCTION 1. Dataset: Urine 2 From Cleveland Clinic 1981-1984

56

Bayes’ Rule

Has Condition

Does not have

Condition

Negative0- 4.99 24 41

Positive ≥ 8 7 1

31 42

Here we Define: A = Disease B = Test Positive

Page 57: DATASET INTRODUCTION 1. Dataset: Urine 2 From Cleveland Clinic 1981-1984

57

Choosing Different Cut-Off

2 or more

Cut-point Sensitivity Specificity

2 or more 0.87 0.45

5 or more 0.58 0.86

8 or more 0.23 0.98

High Sensitivity but Low Specificity

Page 58: DATASET INTRODUCTION 1. Dataset: Urine 2 From Cleveland Clinic 1981-1984

58

Choosing Different Cut-Off

5 or more

Cut-point Sensitivity Specificity

2 or more 0.87 0.45

5 or more 0.58 0.86

8 or more 0.23 0.98

Specificity IncreasedBut you reduce sensitivity

(orange arrow)

Page 59: DATASET INTRODUCTION 1. Dataset: Urine 2 From Cleveland Clinic 1981-1984

59

Choosing Different Cut-Off

8 or more

Cut-point Sensitivity Specificity

2 or more 0.87 0.45

5 or more 0.58 0.86

8 or more 0.23 0.98

Very High SpecificityVery Low Sensitivity (High

False Negative Rate)

Page 60: DATASET INTRODUCTION 1. Dataset: Urine 2 From Cleveland Clinic 1981-1984

60

What happens when We assign all individuals a positive test result?

• Sensitivity = P(Test+|Disease) = 1

• Specificity = P(Test-|No Disease) = 0

• 1 – Specificity = 1

We assign all individuals a negative test result?

• Sensitivity = P(Test+|Disease) = 0

• Specificity = P(Test-|No Disease) =1

• 1 – Specificity = 0

Page 61: DATASET INTRODUCTION 1. Dataset: Urine 2 From Cleveland Clinic 1981-1984

61

Receiver Operating Characteristic curve (ROC curve)

Cut-point Sensitivity Specificity

2 or more 0.87 0.45

5 or more 0.58 0.86

8 or more 0.23 0.980.000.100.200.300.400.500.600.700.800.901.00

0.00 0.20 0.40 0.60 0.80 1.00

True

Pos

itive

Rat

e (S

ensi

tivity

)

False Positive Rate (1-Specificity)

ROC Curve for Calcium Oxalate

2

5

8

Page 62: DATASET INTRODUCTION 1. Dataset: Urine 2 From Cleveland Clinic 1981-1984

62

ROC Curves

Area under the curve = probability that for a randomly selected pair of normal and abnormal subjects, the test will correctly identify the normal subject given the “measurement”

Area = 0.89 for the example on the left

Page 63: DATASET INTRODUCTION 1. Dataset: Urine 2 From Cleveland Clinic 1981-1984

63

Trapezoidal Rule (FYI)