Upload
alvin-white
View
215
Download
0
Embed Size (px)
Citation preview
Unit 5a: Survival Analysis: Questions about Whether and When
© Andrew Ho, Harvard Graduate School of Education Unit 5a– Slide 1
http://xkcd.com/931/
• Research questions addressed by survival analysis: Whether+When• Contrasting 2 Data Formats: Person vs. Person-Period• Life Table Analysis: Hazard Probability vs. Survival Probability
© Andrew Ho, Harvard Graduate School of Education Unit 5a– Slide 2
Multiple RegressionAnalysis (MRA)
Multiple RegressionAnalysis (MRA) iiii XXY 22110
Do your residuals meet the required assumptions?
Test for residual
normality
Use influence statistics to
detect atypical datapoints
If your residuals are not independent,
replace OLS by GLS regression analysis
Use Individual
growth modeling
Specify a Multi-level
Model
If time is a predictor, you need discrete-
time survival analysis…
If your outcome is categorical, you need to
use…
Binomial logistic
regression analysis
(dichotomous outcome)
Multinomial logistic
regression analysis
(polytomous outcome)
If you have more predictors than you
can deal with,
Create taxonomies of fitted models and compare
them.
Form composites of the indicators of any common
construct.
Conduct a Principal Components Analysis
Use Cluster Analysis
Use non-linear regression analysis.
Transform the outcome or predictor
If your outcome vs. predictor relationship
is non-linear,
Use Factor Analysis:EFA or CFA?
Course Roadmap: Unit 5a
Today’s Topic Area
© Andrew Ho, Harvard Graduate School of Education Unit 5a– Slide 3
The “Whether and When” TestYou need survival analysis if your research questions ask “Whether” and “When” a critical event occurs.
The “Whether and When” TestYou need survival analysis if your research questions ask “Whether” and “When” a critical event occurs.
Time-to-Relapse Among Treated AlcoholicsCooney, et al. (1991).Research Questions:
Whether, and if so when, rehabilitated alcoholics relapse to drinking?
Which treatment regimens are more effective in preventing relapse?
89 post-treatment alcoholics, randomized to either a “coping skills” or an “interaction skills” follow-up treatment.
Prospective data collection for 2 years.During follow-up 57 patients relapsed to
alcoholism, 28 remained abstinent, 4 disappeared after remaining abstinent for a short time.
Time-to-Relapse depended on: Type of follow-up program. Psychopathology of the patient.
Age at 1st Suicide Ideation For Adolescents Bolger, et al. (1989). Research Questions:
Whether, and if so when, an adolescent 1st considers suicide?
Does occurrence of suicide ideation differ by gender and developmental phase?
391 undergraduates, aged 16 through 22. Retrospective data collection, through current
age. At interview, 275 respondents had considered
suicide, 116 had not. Time-to-First-Suicide-Ideation.
Greatest risk in middle adolescence. Higher among females. Higher in adolescents w/ absent parents. Race by Age interaction.
Research questions addressed by Survival Analysis
© Andrew Ho, Harvard Graduate School of Education Unit 5a– Slide 4
Classical Methods of Survival Analysis
Simple data-analytic approaches for summarizing survival data appropriately:• Estimation of the sample
hazard function.• Estimation of the sample
survivor function.• Estimation of the median
lifetime. Simple tests of differences in
survivor functions by “group”:• Survival analytic equivalent
of the t-test.
Today
Discrete-TimeSurvival Analysis
Replicates classical methods of survival analysis, using logistic regression analysis.
Extends classical survival analytic methods by making a regression format available:• Can include multiple
predictors, including interactions.
• Provides single parameter and GLH testing, using the –2LL statistic.
• Fitted hazard functions, survivor functions & median lifetimes, can be recovered from the fitted logistic regression model.
Next 2-3 class meetings
Continuous-TimeSurvival Analysis
Replaces discrete-time survival analysis when time has been measured continuously.
Imposes additional assumptions on the data.
Extends classical survival analytic methods by making a regression format available:• Can include predictors,
including interactions.
• Has its own testing procedures, based on standard practices.
• Fitted hazard functions, survivor functions & median lifetimes, are easily recovered from the fitted models.
Time Permitting
Analytic Approaches to Survival Analysis
© Andrew Ho, Harvard Graduate School of Education Unit 5a– Slide 5
Dataset SPEC_ED.txt
Overview Discrete-time person-level dataset on the career duration of special education teachers who began their teaching careers in the Michigan public schools between 1972 and 1978, and who were followed uninterruptedly until 1985.
Source State Department of Education, Michigan.
Sample size 3941 teachers.
More Info Singer & Willett, 2003
Note on labeling of discrete-time “bins.” We regarded a teacher’s physical first year as their zeroth year, a year in which they must have taught in order to be a part of the study. If they quit sometime during the following year, they were classified as having taught for one year and having quit in “bin one.”
Note on labeling of discrete-time “bins.” We regarded a teacher’s physical first year as their zeroth year, a year in which they must have taught in order to be a part of the study. If they quit sometime during the following year, they were classified as having taught for one year and having quit in “bin one.”
Important Distinction You Must Keep In Mind
The two “modern” approaches to survival analysis are distinct in the way that duration must be measured:• In Discrete-time Survival
Analysis, time is measured in discrete units, such as semesters, years, etc.
• In Continuous-time Survival Analysis, time can be measured to any level of precision.
Research Question
Whether, and if so when, do special
education teachers in Michigan leave the teaching profession for the first time?
“Multiple Cohort” Sample DesignMultiple annual cohorts of special education teachers are pooled together in the sample:• Cohorts entered the sample sequentially
between 1972/3 and 1978/9 school years.*
• All cohorts were followed until the end of the 1984/5 school year (i.e., in June 1985).
72 |--|--|--|--|--|--|--|--|--|--|--|--|--|8573 |--|--|--|--|--|--|--|--|--|--|--|--|85
74 |--|--|--|--|--|--|--|--|--|--|--|8575 |--|--|--|--|--|--|--|--|--|--|85
76 |--|--|--|--|--|--|--|--|--|8577 |--|--|--|--|--|--|--|--|85
78 |--|--|--|--|--|--|--|85
The SPEC_ED Dataset
© Andrew Ho, Harvard Graduate School of Education Unit 5a– Slide 6
The dataset is straightforward, containing Teacher IDs and length of service, with one small hitch …The dataset is straightforward, containing Teacher IDs and length of service, with one small hitch …
Structure of Dataset
Col# Var Name Variable Description Variable Metric/Labels
1 ID Teacher identification code. Integer
2 YRSTCH
# of years the teacher remained in teaching, to first quit, or until the teacher was censored in 1985 by the end of the study.
Integer
3 CENSOR
Dummy variable to indicate whether a teacher’s career was censored by the end of data collection in 1985.
Dichotomous variable: 0 = not censored,1 = censored.
There is a problem intrinsic to survival data, and is illustrated here: The event of interest is “quitting teaching for the first time.” But, not every teacher experiences this event while being
observed by researchers. We say that these teachers are “censored” by the end of the
data-collection. We call this “right censoring” because the YRSTCH range is
cut off on the right (positive) side. The actual observation (if we had waited) would be higher.
Key Idea: The presence of the censored cases is telling you
something about the probability that the time-to-event is longer than the
period of observation.
If you want an unbiased estimate of time-to-event, you cannot ignore
the censored cases, but must find a way to include them in the analysis
so that they can contribute whatever information they contain.
Why Is Censoring A Problem For Data Analysis?
… because if censoring occurs we don’t know the time-to-event for the people in the sample who may have
the longest times-to-event.
Dataset variables and the issue of “Censoring”
© Andrew Ho, Harvard Graduate School of Education Unit 5a– Slide 7
*---------------------------------------------------------------------------* Input the raw dataset, name and label the variables and selected values.*---------------------------------------------------------------------------* Input the target dataset: infile ID YRSTCH CENSOR /// using "C:\My Documents\My Course Stuff\S052\Data\Datasets\SPEC_ED.txt" * Label the variables: label variable ID "Teacher Identification Code" label variable YRSTCH "Number of Years in Teaching" label variable CENSOR "Was Teaching Career Censored?" * Label the values of important categorical variables: * Dichotomous censoring variable CENSOR: label define censorlbl 0 "Not Censored" 1 "Censored" label values CENSOR censorlbl *----------------------------------------------------------------------------* Examining the data, for the first 40 cases.*---------------------------------------------------------------------------- list ID YRSTCH CENSOR in 1/40, clean
*---------------------------------------------------------------------------* Input the raw dataset, name and label the variables and selected values.*---------------------------------------------------------------------------* Input the target dataset: infile ID YRSTCH CENSOR /// using "C:\My Documents\My Course Stuff\S052\Data\Datasets\SPEC_ED.txt" * Label the variables: label variable ID "Teacher Identification Code" label variable YRSTCH "Number of Years in Teaching" label variable CENSOR "Was Teaching Career Censored?" * Label the values of important categorical variables: * Dichotomous censoring variable CENSOR: label define censorlbl 0 "Not Censored" 1 "Censored" label values CENSOR censorlbl *----------------------------------------------------------------------------* Examining the data, for the first 40 cases.*---------------------------------------------------------------------------- list ID YRSTCH CENSOR in 1/40, clean
Bearing this in mind, let’s explore the special educator data in Stata Do File, Unit5a.do …Bearing this in mind, let’s explore the special educator data in Stata Do File, Unit5a.do …
Standard data-input and labeling statements
Standard data-input and labeling statements
Print out the data on the first 40 teachers in the dataset for inspection …
Print out the data on the first 40 teachers in the dataset for inspection …
+----------------------------+ | ID YRSTCH CENSOR | |----------------------------| 1. | 1 1 Not Censored | 2. | 2 2 Not Censored | 3. | 3 1 Not Censored | 4. | 4 1 Not Censored | 5. | 5 12 Censored | |----------------------------| 6. | 6 1 Not Censored | 7. | 7 12 Censored | 8. | 8 1 Not Censored | 9. | 9 2 Not Censored | 10. | 10 2 Not Censored | |----------------------------| 11. | 12 7 Not Censored | 12. | 13 12 Censored | 13. | 14 1 Not Censored | 14. | 15 12 Censored | 15. | 16 12 Censored | |----------------------------| 16. | 17 2 Not Censored | 17. | 18 12 Censored | 18. | 19 1 Not Censored | 19. | 20 3 Not Censored |… 37. | 38 1 Not Censored | 38. | 39 3 Not Censored | 39. | 40 12 Censored | 40. | 41 6 Not Censored | +----------------------------+
+----------------------------+ | ID YRSTCH CENSOR | |----------------------------| 1. | 1 1 Not Censored | 2. | 2 2 Not Censored | 3. | 3 1 Not Censored | 4. | 4 1 Not Censored | 5. | 5 12 Censored | |----------------------------| 6. | 6 1 Not Censored | 7. | 7 12 Censored | 8. | 8 1 Not Censored | 9. | 9 2 Not Censored | 10. | 10 2 Not Censored | |----------------------------| 11. | 12 7 Not Censored | 12. | 13 12 Censored | 13. | 14 1 Not Censored | 14. | 15 12 Censored | 15. | 16 12 Censored | |----------------------------| 16. | 17 2 Not Censored | 17. | 18 12 Censored | 18. | 19 1 Not Censored | 19. | 20 3 Not Censored |… 37. | 38 1 Not Censored | 38. | 39 3 Not Censored | 39. | 40 12 Censored | 40. | 41 6 Not Censored | +----------------------------+
The “Person-Level” Dataset
+----------------------------+ | ID YRSTCH CENSOR | |----------------------------| 1. | 1 1 Not Censored | 2. | 2 2 Not Censored | 3. | 3 1 Not Censored | 4. | 4 1 Not Censored | 5. | 5 12 Censored | |----------------------------| 6. | 6 1 Not Censored | 7. | 7 12 Censored | 8. | 8 1 Not Censored | 9. | 9 2 Not Censored | 10. | 10 2 Not Censored | |----------------------------| 11. | 12 7 Not Censored | 12. | 13 12 Censored | 13. | 14 1 Not Censored | 14. | 15 12 Censored | 15. | 16 12 Censored | |----------------------------| 16. | 17 2 Not Censored | 17. | 18 12 Censored | 18. | 19 1 Not Censored | 19. | 20 3 Not Censored |… 37. | 38 1 Not Censored | 38. | 39 3 Not Censored | 39. | 40 12 Censored | 40. | 41 6 Not Censored | +----------------------------+
+----------------------------+ | ID YRSTCH CENSOR | |----------------------------| 1. | 1 1 Not Censored | 2. | 2 2 Not Censored | 3. | 3 1 Not Censored | 4. | 4 1 Not Censored | 5. | 5 12 Censored | |----------------------------| 6. | 6 1 Not Censored | 7. | 7 12 Censored | 8. | 8 1 Not Censored | 9. | 9 2 Not Censored | 10. | 10 2 Not Censored | |----------------------------| 11. | 12 7 Not Censored | 12. | 13 12 Censored | 13. | 14 1 Not Censored | 14. | 15 12 Censored | 15. | 16 12 Censored | |----------------------------| 16. | 17 2 Not Censored | 17. | 18 12 Censored | 18. | 19 1 Not Censored | 19. | 20 3 Not Censored |… 37. | 38 1 Not Censored | 38. | 39 3 Not Censored | 39. | 40 12 Censored | 40. | 41 6 Not Censored | +----------------------------+
© Andrew Ho, Harvard Graduate School of Education Unit 5a– Slide 8
Here’s the data listing (with cases omitted to save space) …Here’s the data listing (with cases omitted to save space) … Dataset formatted in this way is
known as a person-level dataset: Because it contains one row of
event history data per teacher.
Teacher #2 was in the dataset for 2 years and was not censored.• S/he experienced the event of interest in the second year,• That is, s/he quit teaching for the first time sometime during
the second year.
Teacher #5 was in the dataset for 12 years and was censored.• S/he outlasted the data collection.• S/he taught for at least 12 years, and
possibly more.
We tend to be drawn to dangerous analyses with this dataset structure!!!
The “Person-Level” dataset encourages dangerous analyses…
© Andrew Ho, Harvard Graduate School of Education Unit 5a– Slide 9
CareerLength
UncensoredCases
CensoredCases
1 456 02 384 03 359 04 295 05 218 06 184 07 123 2808 79 3079 53 255
10 35 26511 16 24112 5 386
0
100
200
300
400
500
1 3 5 7 911
456
384359
295
218
184
123
7953
3516
5
0 0 0 0 0 0
280307
255265241
386
# of
Tea
cher
s
Frequency of Teachers with Careers of Different Lengths
One sensible thing you can do in such datasets is display the frequency with which each career length occurs, in a vertical histogram that includes all the teachers in the sample, both censored and un-censored.One sensible thing you can do in such datasets is display the frequency with which each career length occurs, in a vertical histogram that includes all the teachers in the sample, both censored and un-censored.
Note the impact of the multi-cohort research
design -- any teacher who began teaching after 1978 and taught longer than six years is a censored case.
Comparing Uncensored and Censored Cases0
100
200
300
400
500
Fre
que
ncy
0 2 4 6 8 10 12Number of Years in Teaching
Uncensored Censored
© Andrew Ho, Harvard Graduate School of Education Unit 5a– Slide 10
CareerLength
UncensoredCases
CensoredCases
1 456 02 384 03 359 04 295 05 218 06 184 07 123 2808 79 3079 53 255
10 35 26511 16 24112 5 386
13
57
911
0 0 0 0 0 0
280307
255265241
386
456
384359
295
218
184
123
7953
3516
5
0
100
200
300
400
500
# of
Tea
cher
s
Frequency of Teachers with Careers of Different Lengths
Here, are two misleading – but common -- strategies for trying to summarize teachers’ career length, while trying to deal with censoring …Here, are two misleading – but common -- strategies for trying to summarize teachers’ career length, while trying to deal with censoring …
Second Misleading Approach
If you set the career lengths of the censored
teachers to their longest observed
career length, then the sample mean teaching career length is 6.31 years. This too is a negatively biased
estimate of population career length even if only one teacher has lasted longer than the censored duration.
First Misleading ApproachIf you take the average of the career
lengths of only the uncensored teachers, their sample mean teaching career is 3.73 years, a negatively biased estimate of the average population teaching career length.
010
020
030
040
050
0F
req
uenc
y
0 2 4 6 8 10 12Number of Years in Teaching
Uncensored Censored
Bias imparted when ignoring censoring
© Andrew Ho, Harvard Graduate School of Education Unit 5a– Slide 11
Dataset SPEC_ED_PP.txt
Overview Person-period dataset containing the same information as the SPEC_ED.txt person dataset, on the career duration of special education teachers who began their teaching careers in the Michigan public schools between 1972 and 1978, and who were followed uninterruptedly until 1985.
Source State Department of Education, Michigan.
Sample size 24875 annual person-period records.
More Info Singer & Willett, 2003
It is easier to appreciate these data when they are reformated into a person-period format. In a person-period dataset, you can gain a better understanding of a class of summary statistics that address
the “whether” and “when” questions. Hazard probability – Probability of failure at time conditional upon survival to that time point. Survival probability – Probability of surviving beyond time Median lifetime – Lifetime above which half of the persons survive.
It is easier to appreciate these data when they are reformated into a person-period format. In a person-period dataset, you can gain a better understanding of a class of summary statistics that address
the “whether” and “when” questions. Hazard probability – Probability of failure at time conditional upon survival to that time point. Survival probability – Probability of surviving beyond time Median lifetime – Lifetime above which half of the persons survive.
Notice that the name of the dataset is different
Here’s a clue to the difference between the person-level and the person-period dataset… There is
a row for every person-period combination in the data.
The Person-Period Dataset
To convert from one to the other, use the dthaz library. Type “net install dthaz.pkg” or type “findit prsnperd” The library was created by a former Ph.D. student at our
School of Public Health, Alexis Dinno (now an Assistant Professor at Portland State).
© Andrew Ho, Harvard Graduate School of Education Unit 5a– Slide 12
Col Var Variable Description Labels
1 ID Teacher identification code. Integer
2 PERIOD Records the discrete time period to which each record refers. Integer
3 EVENT Dummy variable indicating whether the teacher experienced the event of interest in this period. 0 = no; 1 = yes
4 P1
5 P2
6 P3
Etc.
In a person-period dataset, each person has one row of data for each discrete time-period, each containing …In a person-period dataset, each person has one row of data for each discrete time-period, each containing …
The earlier YRSTCH variable,
which recorded the duration of the teaching career in the person-level dataset, has been
replaced by variable PERIOD, which labels the time-period to
which each row of the person-period
dataset refers.
Person-period dataset contains other variables too,
that are labeled and explained in these rows of the codebook. We ignore them here, but will return to them later during the presentation on discrete-time
survival analysis.
We’ve also acquired a new variable called EVENT, which records whether a teacher experienced the event of interest
(“Quit Teaching For The 1st Time”) in the particular discrete time-period in question.
The Person-Period Data Structure
© Andrew Ho, Harvard Graduate School of Education Unit 5a– Slide 13
*-----------------------------------------------------------------------------•Input the person-period dataset•*-----------------------------------------------------------------------------* Input the dataset: infile ID PERIOD EVENT P1-P12 /// using "C:\My Documents\My Course Stuff\S052\Data\Datasets\SPEC_ED_PP.txt" * Label the variables: label variable ID "Teacher Identification Code" label variable PERIOD "Current Time Period" label variable EVENT "Did Teacher Quit in this Time Period“ * Label the values of important categorical variables: * Dichotomous event occurrence variable EVENT: label define eventlbl 0 "No Quit" 1 "Quit" label values EVENT eventlbl *------------------------------------------------------------------------------* Inspect the structure of the new person-period dataset.* Notice that there is one row per discrete time-period for each person.*----------------------------------------------------------------------------- list ID PERIOD EVENT in 1/40
*------------------------------------------------------------------------------* Carry out the life-table analysis, by classical contingency table analysis.*------------------------------------------------------------------------------ tabulate EVENT PERIOD, column
*-----------------------------------------------------------------------------•Input the person-period dataset•*-----------------------------------------------------------------------------* Input the dataset: infile ID PERIOD EVENT P1-P12 /// using "C:\My Documents\My Course Stuff\S052\Data\Datasets\SPEC_ED_PP.txt" * Label the variables: label variable ID "Teacher Identification Code" label variable PERIOD "Current Time Period" label variable EVENT "Did Teacher Quit in this Time Period“ * Label the values of important categorical variables: * Dichotomous event occurrence variable EVENT: label define eventlbl 0 "No Quit" 1 "Quit" label values EVENT eventlbl *------------------------------------------------------------------------------* Inspect the structure of the new person-period dataset.* Notice that there is one row per discrete time-period for each person.*----------------------------------------------------------------------------- list ID PERIOD EVENT in 1/40
*------------------------------------------------------------------------------* Carry out the life-table analysis, by classical contingency table analysis.*------------------------------------------------------------------------------ tabulate EVENT PERIOD, column
In Unit5a.do, I input the special educator person-period dataset and list the data, including estimation of a life table …In Unit5a.do, I input the special educator person-period dataset and list the data, including estimation of a life table …
Standard data input statements, reading in the ID,
PERIOD and EVENT variables and the mystery variables, P1 through P12, that we will return to later during our discrete-time
survival-analysis presentation
Print out the first 40 cases for inspection.
Carry out a Life Table Analysis: Tabulate the frequencies of EVENT by PERIOD. Kill the row & total percentage computation, but retain the
estimation of percentages in the columns defined by PERIOD.
Carry out a Life Table Analysis: Tabulate the frequencies of EVENT by PERIOD. Kill the row & total percentage computation, but retain the
estimation of percentages in the columns defined by PERIOD.
Reading in the Person-Period Dataset
Person-Level Dataset +----------------------------+ | ID YRSTCH CENSOR | |----------------------------| 1. | 1 1 Not Censored | 2. | 2 2 Not Censored | 3. | 3 1 Not Censored | 4. | 4 1 Not Censored | 5. | 5 12 Censored | |----------------------------| 6. | 6 1 Not Censored | 7. | 7 12 Censored | 8. | 8 1 Not Censored | 9. | 9 2 Not Censored | 10. | 10 2 Not Censored | |----------------------------| 11. | 12 7 Not Censored | 12. | 13 12 Censored | 13. | 14 1 Not Censored | 14. | 15 12 Censored |… 37. | 38 1 Not Censored | 38. | 39 3 Not Censored | 39. | 40 12 Censored | 40. | 41 6 Not Censored | +----------------------------+
Person-Level Dataset +----------------------------+ | ID YRSTCH CENSOR | |----------------------------| 1. | 1 1 Not Censored | 2. | 2 2 Not Censored | 3. | 3 1 Not Censored | 4. | 4 1 Not Censored | 5. | 5 12 Censored | |----------------------------| 6. | 6 1 Not Censored | 7. | 7 12 Censored | 8. | 8 1 Not Censored | 9. | 9 2 Not Censored | 10. | 10 2 Not Censored | |----------------------------| 11. | 12 7 Not Censored | 12. | 13 12 Censored | 13. | 14 1 Not Censored | 14. | 15 12 Censored |… 37. | 38 1 Not Censored | 38. | 39 3 Not Censored | 39. | 40 12 Censored | 40. | 41 6 Not Censored | +----------------------------+
© Andrew Ho, Harvard Graduate School of Education Unit 5a– Slide 14
Person-Period Dataset +-----------------------+ | ID PERIOD EVENT | |-----------------------| 1. | 1 1 Quit | 2. | 2 1 No Quit | 3. | 2 2 Quit | 4. | 3 1 Quit | 5. | 4 1 Quit | |-----------------------| 6. | 5 1 No Quit | 7. | 5 2 No Quit | 8. | 5 3 No Quit | 9. | 5 4 No Quit | 10. | 5 5 No Quit | |-----------------------| 11. | 5 6 No Quit | 12. | 5 7 No Quit | 13. | 5 8 No Quit | 14. | 5 9 No Quit | 15. | 5 10 No Quit | |-----------------------| 16. | 5 11 No Quit | 17. | 5 12 No Quit | 18. | 6 1 Quit | 19. | 7 1 No Quit | 20. | 7 2 No Quit | |-----------------------| 21. | 7 3 No Quit | 22. | 7 4 No Quit | 23. | 7 5 No Quit | 24. | 7 6 No Quit | 25. | 7 7 No Quit | |-----------------------| 26. | 7 8 No Quit | 27. | 7 9 No Quit |…
Person-Period Dataset +-----------------------+ | ID PERIOD EVENT | |-----------------------| 1. | 1 1 Quit | 2. | 2 1 No Quit | 3. | 2 2 Quit | 4. | 3 1 Quit | 5. | 4 1 Quit | |-----------------------| 6. | 5 1 No Quit | 7. | 5 2 No Quit | 8. | 5 3 No Quit | 9. | 5 4 No Quit | 10. | 5 5 No Quit | |-----------------------| 11. | 5 6 No Quit | 12. | 5 7 No Quit | 13. | 5 8 No Quit | 14. | 5 9 No Quit | 15. | 5 10 No Quit | |-----------------------| 16. | 5 11 No Quit | 17. | 5 12 No Quit | 18. | 6 1 Quit | 19. | 7 1 No Quit | 20. | 7 2 No Quit | |-----------------------| 21. | 7 3 No Quit | 22. | 7 4 No Quit | 23. | 7 5 No Quit | 24. | 7 6 No Quit | 25. | 7 7 No Quit | |-----------------------| 26. | 7 8 No Quit | 27. | 7 9 No Quit |…
In a person-period dataset:• Each person contributes one row of data for
each time-period,• Data record continues until the time-period in
which they either experience the event of interest, or they are censored.
Teacher #2 is not censored and so s/he
experiences the event of interest (i.e. quits teaching for the first time) in the 2nd
year.
Teacher #5 is censored – s/he never experiences the event of interest (i.e. doesn’t quit
teaching for the first time) in all the 12 years during which
teachers are observed.
Person-Level vs. Person-Period Datasets
100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 Total 3,941 3,485 3,101 2,742 2,447 2,229 2,045 1,642 1,256 948 648 391 24,875 11.57 11.02 11.58 10.76 8.91 8.25 6.01 4.81 4.22 3.69 2.47 1.28 8.87 Quit 456 384 359 295 218 184 123 79 53 35 16 5 2,207 88.43 88.98 88.42 89.24 91.09 91.75 93.99 95.19 95.78 96.31 97.53 98.72 91.13 No Quit 3,485 3,101 2,742 2,447 2,229 2,045 1,922 1,563 1,203 913 632 386 22,668 happened 1 2 3 4 5 6 7 8 9 10 11 12 Total Event PERIODhappen; 1: did not 0: Event
column percentage frequency Key
. tabulate EVENT PERIOD, column
© Andrew Ho, Harvard Graduate School of Education Unit 5a– Slide 15
Here’s the Life Table – it’s a Two-Way Contingency Table Analysis of EVENT by PERIOD …Here’s the Life Table – it’s a Two-Way Contingency Table Analysis of EVENT by PERIOD …
Use frequencies to estimate a hazard probability describing “risk of quitting teaching for the 1st time” in each time-period, given that the teacher survived earlier periods.
Hazard probability is the (conditional) probability that a teacher will experience the event of interest (i.e., quit teaching for the first time) in a particular time-period, given that s/he has “survived” up until this period.
In Discrete Time Period #1, for instance: There are 3941 teachers “at risk of quitting for the first time.” Of this “risk set,” 456 were observed to quit for the first time. Hence, the probability that a teacher will quit for the first time in this period
(given that she entered it), is (456/3941), or 0.1157. So, the sample hazard probability in Discrete Time-Period #1 is 1157.0ˆ
1 th
100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 Total 3,941 3,485 3,101 2,742 2,447 2,229 2,045 1,642 1,256 948 648 391 24,875 11.57 11.02 11.58 10.76 8.91 8.25 6.01 4.81 4.22 3.69 2.47 1.28 8.87 Quit 456 384 359 295 218 184 123 79 53 35 16 5 2,207 88.43 88.98 88.42 89.24 91.09 91.75 93.99 95.19 95.78 96.31 97.53 98.72 91.13 No Quit 3,485 3,101 2,742 2,447 2,229 2,045 1,922 1,563 1,203 913 632 386 22,668 happened 1 2 3 4 5 6 7 8 9 10 11 12 Total Event PERIODhappen; 1: did not 0: Event
column percentage frequency Key
. tabulate EVENT PERIOD, column
Life Tables: At Each Time Point, for People Who Survived
100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 Total 3,941 3,485 3,101 2,742 2,447 2,229 2,045 1,642 1,256 948 648 391 24,875 11.57 11.02 11.58 10.76 8.91 8.25 6.01 4.81 4.22 3.69 2.47 1.28 8.87 Quit 456 384 359 295 218 184 123 79 53 35 16 5 2,207 88.43 88.98 88.42 89.24 91.09 91.75 93.99 95.19 95.78 96.31 97.53 98.72 91.13 No Quit 3,485 3,101 2,742 2,447 2,229 2,045 1,922 1,563 1,203 913 632 386 22,668 happened 1 2 3 4 5 6 7 8 9 10 11 12 Total Event PERIODhappen; 1: did not 0: Event
column percentage frequency Key
. tabulate EVENT PERIOD, column
100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 Total 3,941 3,485 3,101 2,742 2,447 2,229 2,045 1,642 1,256 948 648 391 24,875 11.57 11.02 11.58 10.76 8.91 8.25 6.01 4.81 4.22 3.69 2.47 1.28 8.87 Quit 456 384 359 295 218 184 123 79 53 35 16 5 2,207 88.43 88.98 88.42 89.24 91.09 91.75 93.99 95.19 95.78 96.31 97.53 98.72 91.13 No Quit 3,485 3,101 2,742 2,447 2,229 2,045 1,922 1,563 1,203 913 632 386 22,668 happened 1 2 3 4 5 6 7 8 9 10 11 12 Total Event PERIODhappen; 1: did not 0: Event
column percentage frequency Key
. tabulate EVENT PERIOD, column
© Andrew Ho, Harvard Graduate School of Education Unit 5a– Slide 16
Here’s the sample hazard probability for discrete time-period #2 …Here’s the sample hazard probability for discrete time-period #2 …
Sample hazard probability (or “risk”) in discrete time-period #2 is: 3485 teachers survive from time-period #1 and enter the risk set for time-period #2. Of these, 384 quit for the first time. Hence, the risk that a teacher will quit for the first time in time-period #2, given that she survived to that
point, is (384/3485), or 0.1102. So, the sample hazard probability in discrete time-period #2 is 11.02%. How did we get that number? Note that the survivors at the target time point are the survivors from the previous time point minus the
“quitters.” For now…
Hazard Probability: For each Time Point, the Probability of “Fail”
100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 Total 3,941 3,485 3,101 2,742 2,447 2,229 2,045 1,642 1,256 948 648 391 24,875 11.57 11.02 11.58 10.76 8.91 8.25 6.01 4.81 4.22 3.69 2.47 1.28 8.87 Quit 456 384 359 295 218 184 123 79 53 35 16 5 2,207 88.43 88.98 88.42 89.24 91.09 91.75 93.99 95.19 95.78 96.31 97.53 98.72 91.13 No Quit 3,485 3,101 2,742 2,447 2,229 2,045 1,922 1,563 1,203 913 632 386 22,668 happened 1 2 3 4 5 6 7 8 9 10 11 12 Total Event PERIODhappen; 1: did not 0: Event
column percentage frequency Key
. tabulate EVENT PERIOD, column
100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 Total 3,941 3,485 3,101 2,742 2,447 2,229 2,045 1,642 1,256 948 648 391 24,875 11.57 11.02 11.58 10.76 8.91 8.25 6.01 4.81 4.22 3.69 2.47 1.28 8.87 Quit 456 384 359 295 218 184 123 79 53 35 16 5 2,207 88.43 88.98 88.42 89.24 91.09 91.75 93.99 95.19 95.78 96.31 97.53 98.72 91.13 No Quit 3,485 3,101 2,742 2,447 2,229 2,045 1,922 1,563 1,203 913 632 386 22,668 happened 1 2 3 4 5 6 7 8 9 10 11 12 Total Event PERIODhappen; 1: did not 0: Event
column percentage frequency Key
. tabulate EVENT PERIOD, column
© Andrew Ho, Harvard Graduate School of Education Unit 5a– Slide 17
Here’s the sample hazard probability for discrete time-period #3 …Here’s the sample hazard probability for discrete time-period #3 …
Sample hazard probability (or “risk”) in discrete time-period #3 is: 3101teachers survive from time-period #2 and enter the risk set for time-period #3. Of these, 359 quit for the first time. Hence, the risk that a teacher will quit for the first time in time-period #2, given that she survived to that
point, is (359/3101), or 0.1158. So, the sample hazard probability in discrete time-period #3 is 11.58%. How did we get that number? The survivors at the target time point are still the survivors from the previous time point minus the
“quitters.” For now…
Hazard Probability: For each Time Point, the Probability of “Fail”
100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 Total 3,941 3,485 3,101 2,742 2,447 2,229 2,045 1,642 1,256 948 648 391 24,875 11.57 11.02 11.58 10.76 8.91 8.25 6.01 4.81 4.22 3.69 2.47 1.28 8.87 Quit 456 384 359 295 218 184 123 79 53 35 16 5 2,207 88.43 88.98 88.42 89.24 91.09 91.75 93.99 95.19 95.78 96.31 97.53 98.72 91.13 No Quit 3,485 3,101 2,742 2,447 2,229 2,045 1,922 1,563 1,203 913 632 386 22,668 happened 1 2 3 4 5 6 7 8 9 10 11 12 Total Event PERIODhappen; 1: did not 0: Event
column percentage frequency Key
. tabulate EVENT PERIOD, column
100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 Total 3,941 3,485 3,101 2,742 2,447 2,229 2,045 1,642 1,256 948 648 391 24,875 11.57 11.02 11.58 10.76 8.91 8.25 6.01 4.81 4.22 3.69 2.47 1.28 8.87 Quit 456 384 359 295 218 184 123 79 53 35 16 5 2,207 88.43 88.98 88.42 89.24 91.09 91.75 93.99 95.19 95.78 96.31 97.53 98.72 91.13 No Quit 3,485 3,101 2,742 2,447 2,229 2,045 1,922 1,563 1,203 913 632 386 22,668 happened 1 2 3 4 5 6 7 8 9 10 11 12 Total Event PERIODhappen; 1: did not 0: Event
column percentage frequency Key
. tabulate EVENT PERIOD, column
© Andrew Ho, Harvard Graduate School of Education Unit 5a– Slide 18
Here’s the sample hazard probability for discrete time-period #11 …Here’s the sample hazard probability for discrete time-period #11 …
Sample hazard probability (or “risk”) in discrete time-period #11 is: 913 survive from time-period #10… BUT ONLY 648 enter the risk set for
time-period #11. Of these, 16 quit for the first time. Hence, the risk that a teacher will quit for the first time in time-period #11,
given that she survived to that point, is (16/648), or 0.0245. So, the sample hazard probability in discrete time-period #11 is 2.45%. Where did the teachers go? At time 10, they were censored. They did not quit, but, although they did
survive until time-period #11, we do not know whether they quit at that time. So, we don’t let this bias our interpretation of the hazard probability at time-
period #11. However they can contribute to hazard estimates at times<11!
Hazard Probability: For each Time Point, the Probability of “Fail”
© Andrew Ho, Harvard Graduate School of Education Unit 5a– Slide 19
Timeperiod
# Teachersin the risk set
in this time period
# Teachers who quit in this time period
Samplehazard
probability
1 3941 456 0.11572 3485 384 0.11023 3101 359 0.11584 2742 295 0.10765 2447 218 0.08916 2229 184 0.08257 2045 123 0.06018 1642 79 0.04819 1256 53 0.042210 948 35 0.036911 648 16 0.024712 391 5 0.0128
0.0000
0.0200
0.0400
0.0600
0.0800
0.1000
0.1200
0.1400
1 2 3 4 5 6 7 8 9 10 11 12
Year in Teaching Career
Ha
zard
Pro
ba
bil
ity
Collect the sample hazard probabilities together and plot them as a sample hazard function …Collect the sample hazard probabilities together and plot them as a sample hazard function …
The Hazard Function
© Andrew Ho, Harvard Graduate School of Education Unit 5a– Slide 20
TimePeriod
SampleHazard
Probabilityh(t)
Sample Survival
ProbabilityS(t)
0 1.00001 0.1157 0.88432 0.1102 0.78693 0.1158 0.69584 0.1076 0.62095 0.0891 0.56566 0.0825 0.51897 0.0601 0.48778 0.0481 0.46429 0.0422 0.444610 0.0369 0.428211 0.0247 0.417712 0.0128 0.4123
Once you have the sample hazard probabilities, you can cumulate them to get sample survival probabilities …Once you have the sample hazard probabilities, you can cumulate them to get sample survival probabilities …
Sample Survival Probability
Survival probability in any time period is the probability of “surviving” beyond that period (ie, the probability of not experiencing the event of
interest until after the period).
Here, all teachers survived the 0th time period, so the estimated sample survival probability in the 0th period is 1.000.
The estimated hazard probability suggests that a proportion of 0.1157 of teachers in the 1st period risk set will “die” in the 1st period (i.e., quit teaching).
Because a proportion of 0.1157 of the risk set will “die” in the 1st period, we know that (1 - 0.1157) or 0.8843 of the 1st period risk set will survive.
In other words, 0.8843 of the entering “1.0000” will remain “alive” beyond the 1st time-period (and will therefore be potentially available to quit teaching for the first time at some later period).
The sample survival probability in the 1st time period is therefore 0.8843 1.000, or:
8843.0)(ˆ1 tS
The Survival Probability
© Andrew Ho, Harvard Graduate School of Education Unit 5a– Slide 21
TimePeriod
SampleHazard
Probabilityh(t)
Sample Survival
ProbabilityS(t)
0 1.00001 0.1157 0.88432 0.1102 0.78693 0.1158 0.69584 0.1076 0.62095 0.0891 0.56566 0.0825 0.51897 0.0601 0.48778 0.0481 0.46429 0.0422 0.444610 0.0369 0.428211 0.0247 0.417712 0.0128 0.4123
And, the estimated survival probability in discrete time period #2…And, the estimated survival probability in discrete time period #2…
Here, according to the estimated sample survival probability, a proportion of 0.8843 of the teachers survived the 1th time period.
Estimated hazard probability suggests that a proportion of 0.1102 of teachers in the 2nd period risk set will “die” in the 2nd period (i.e., quit teaching for the first time).
Because a proportion of 0.1102 of the risk set will “die” in the 2nd period, we know that (1 - 0.1102) -- or 0.8898 -- of the 2nd period risk set will survive.
In other words, a proportion of 0.8898 of the entering “0.8843” will remain “alive” beyond the 2nd time period (and be potentially available to quit teaching for the first time, later).
Sample survival probability in the 2nd time period is therefore 0.8898 0.8843, or:
7869.0)(ˆ2 tS
The Survival Probability
© Andrew Ho, Harvard Graduate School of Education Unit 5a– Slide 22
TimePeriod
SampleHazard
Probabilityh(t)
Sample Survival
ProbabilityS(t)
0 1.00001 0.1157 0.88432 0.1102 0.78693 0.1158 0.69584 0.1076 0.62095 0.0891 0.56566 0.0825 0.51897 0.0601 0.48778 0.0481 0.46429 0.0422 0.444610 0.0369 0.428211 0.0247 0.417712 0.0128 0.4123
And, the estimated survival probability in discrete time period #3 … etcAnd, the estimated survival probability in discrete time period #3 … etc
Here, according to the estimated sample survival probability, a proportion of 0.7869 of the teachers survived the 2nd time period.
The estimated hazard probability suggests that a proportion of 0.1158 of teachers in the 3rd period risk set will “die” in the 3rd period (i.e., quit teaching for the first time).
Because a proportion of 0.1158 of the risk set will “die” in the 3rd period, we know that (1 - 0.1158), or 0.8842, of the 3rd period risk set will survive.
In other words, a proportion of 0.8842 of the entering “0.7869” will remain “alive” beyond the 3rd time period (and be potentially available to quit teaching for the first time, later).
The sample survival probability in the 2nd time period is therefore 0.8842 0.7869, or:
6958.0)(ˆ3 tS
The Survival Probability
© Andrew Ho, Harvard Graduate School of Education Unit 5a– Slide 23
TimePeriod
SampleHazard
Probabilityh(t)
Sample Survival
ProbabilityS(t)
jt )(ˆ jth )(ˆjtS
1jt )(ˆ1jtS
Thus, as a general principle, the estimated survivor probability in any time period j can be found by substituting into a simple little rule …
Thus, as a general principle, the estimated survivor probability in any time period j can be found by substituting into a simple little rule …
So, in general, in any time period j ..
)(ˆ)](ˆ1[)(ˆ1 jjj tSthtS
The Survival Probability – General Equation
© Andrew Ho, Harvard Graduate School of Education Unit 5a– Slide 24
Timeperiod
Samplehazard
probability
Sample Survival
Probability
0 1.00001 0.1157 0.88432 0.1102 0.78693 0.1158 0.69584 0.1076 0.62095 0.0891 0.56566 0.0825 0.51897 0.0601 0.48778 0.0481 0.46429 0.0422 0.444610 0.0369 0.428211 0.0247 0.417712 0.0128 0.4123
Sample Survivor Function
0.0000
0.1000
0.2000
0.3000
0.4000
0.5000
0.6000
0.7000
0.8000
0.9000
1.0000
0 2 4 6 8 10 12
Year in Teaching
Sam
ple
Su
rviv
or P
rob
abili
ty
Plotting the sample survival probabilities against time period provides the sample survivor function.Plotting the sample survival probabilities against time period provides the sample survivor function.
Typical monotonically decreasing survivor function …
We can also use this to estimate the median time of survival, by projecting over from 0.5 and down to the Time axis.
The Survival Function