STATASURVIVALANALYSISAND ...Title intro — Introduction to survival analysis manual Description This entry describes this manual and what has changed since Stata 10. See the next

STATA SURVIVAL ANALYSIS ANDEPIDEMIOLOGICAL TABLES

REFERENCE MANUALRELEASE 11

A Stata Press PublicationStataCorp LPCollege Station, Texas

Copyright c© 1985–2009 by StataCorp LPAll rights reservedVersion 11

Published by Stata Press, 4905 Lakeway Drive, College Station, Texas 77845Typeset in TEXPrinted in the United States of America

10 9 8 7 6 5 4 3 2 1

ISBN-10: 1-59718-061-0ISBN-13: 978-1-59718-061-0

This manual is protected by copyright. All rights are reserved. No part of this manual may be reproduced, storedin a retrieval system, or transcribed, in any form or by any means—electronic, mechanical, photocopy, recording, orotherwise—without the prior written permission of StataCorp LP unless permitted by the license granted to you byStataCorp LP to use the software and documentation. No license, express or implied, by estoppel or otherwise, to anyintellectual property rights is granted by this document.

StataCorp provides this manual “as is” without warranty of any kind, either expressed or implied, including, butnot limited to, the implied warranties of merchantability and fitness for a particular purpose. StataCorp may makeimprovements and/or changes in the product(s) and the program(s) described in this manual at any time and withoutnotice.

The software described in this manual is furnished under a license agreement or nondisclosure agreement. The softwaremay be copied only in accordance with the terms of the agreement. It is against the law to copy the software ontoDVD, CD, disk, diskette, tape, or any other medium for any purpose other than backup or archival purposes.

The automobile dataset appearing on the accompanying media is Copyright c© 1979 by Consumers Union of U.S.,Inc., Yonkers, NY 10703-1057 and is reproduced by permission from CONSUMER REPORTS, April 1979.

Stata and Mata are registered trademarks and NetCourse is a trademark of StataCorp LP.

Other brand and product names are registered trademarks or trademarks of their respective companies.

For copyright information about the software, type help copyright within Stata.

The suggested citation for this software is

StataCorp. 2009. Stata: Release 11 . Statistical Software. College Station, TX: StataCorp LP.

i

Table of contents

intro . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Introduction to survival analysis manual 1

survival analysis . . . . Introduction to survival analysis & epidemiological tables commands 4

ct . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Count-time data 11

ctset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Declare data to be count-time data 12

cttost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Convert count-time data to survival-time data 18

discrete . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Discrete-time survival analysis 21

epitab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tables for epidemiologists 23

ltable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Life tables for survival data 78

snapspan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Convert snapshot data to time-span data 91

st . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Survival-time data 94

st is . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Survival analysis subroutines for programmers 96

stbase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Form baseline dataset 101

stci . . . . . . . . . . . . . . . . . . . . Confidence intervals for means and percentiles of survival time 111

stcox . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Cox proportional hazards model 121

stcox PH-assumption tests . . . . . . . . . . . . . . . . . . . Tests of proportional-hazards assumption 149

stcox postestimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Postestimation tools for stcox 164

stcrreg . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Competing-risks regression 195

stcrreg postestimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Postestimation tools for stcrreg 221

stcurve . . . . . . . . Plot survivor, hazard, cumulative hazard, or cumulative incidence function 231

stdescribe . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Describe survival-time data 241

stfill . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fill in by carrying forward values of covariates 245

stgen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Generate variables reflecting entire histories 249

stir . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Report incidence-rate comparison 255

stpower . . . . . . . . . . . Sample-size, power, and effect-size determination for survival analysis 258

stpower cox . . Sample size, power, and effect size for the Cox proportional hazards model 270

stpower exponential . . . . . . . . . . . . . . . . . . . . Sample size and power for the exponential test 286

stpower logrank . . . . . . . . . . . . . . . . Sample size, power, and effect size for the log-rank test 315

stptime . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Calculate person-time, incidence rates, and SMR 335

strate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tabulate failure rates and rate ratios 342

streg . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Parametric survival models 353

ii

streg postestimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Postestimation tools for streg 386

sts . . . . . . . . . . . Generate, graph, list, and test the survivor and cumulative hazard functions 396

sts generate . . . . . . . . . . . . . . . . . . Create variables containing survivor and related functions 412

sts graph . . . . . . . . . . . . . . . . . . . . . . . . Graph the survivor and cumulative hazard functions 415

sts list . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . List the survivor or cumulative hazard function 434

sts test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Test equality of survivor functions 439

stset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Declare data to be survival-time data 454

stsplit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Split and join time-span records 496

stsum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Summarize survival-time data 514

sttocc . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Convert survival-time data to case–control data 520

sttoct . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Convert survival-time data to count-time data 525

stvary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Report whether variables vary over time 527

Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 530

Subject and author index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 541

iii

Cross-referencing the documentation

When reading this manual, you will find references to other Stata manuals. For example,

[U] 26 Overview of Stata estimation commands[R] regress[D] reshape

The first example is a reference to chapter 26, Overview of Stata estimation commands, in the User’sGuide; the second is a reference to the regress entry in the Base Reference Manual; and the thirdis a reference to the reshape entry in the Data-Management Reference Manual.

All the manuals in the Stata Documentation have a shorthand notation:

[GSM] Getting Started with Stata for Mac[GSU] Getting Started with Stata for Unix[GSW] Getting Started with Stata for Windows[U] Stata User’s Guide[R] Stata Base Reference Manual[D] Stata Data-Management Reference Manual[G] Stata Graphics Reference Manual[XT] Stata Longitudinal-Data/Panel-Data Reference Manual[MI] Stata Multiple-Imputation Reference Manual[MV] Stata Multivariate Statistics Reference Manual[P] Stata Programming Reference Manual[SVY] Stata Survey Data Reference Manual[ST] Stata Survival Analysis and Epidemiological Tables Reference Manual[TS] Stata Time-Series Reference Manual[ I ] Stata Quick Reference and Index

[M] Mata Reference Manual

Detailed information about each of these manuals may be found online at

http://www.stata-press.com/manuals/

http://www.stata-press.com/manuals/

Title

intro — Introduction to survival analysis manual

DescriptionThis entry describes this manual and what has changed since Stata 10. See the next entry,

[ST] survival analysis, for an introduction to Stata’s survival analysis capabilities.

RemarksThis manual documents commands for survival analysis and epidemiological tables and is referred

to as [ST] in cross-references. Following this entry, [ST] survival analysis provides an overview ofthe commands.

This manual is arranged alphabetically. If you are new to Stata’s survival analysis and epidemio-logical tables commands, we recommend that you read the following sections first:

[ST] survival analysis Introduction to survival analysis & epidemiological tables commands[ST] st Survival-time data[ST] stset Set variables for survival data

Stata is continually being updated, and Stata users are always writing new commands. To find outabout the latest survival analysis features, type search survival after installing the latest officialupdates; see [R] update. To find out about the latest epidemiological features, type search epi.

What’s newThis section is intended for previous Stata users. If you are new to Stata, you may as well skip it.

1. New command stcrreg fits competing-risks regression models for survival data, according tothe method of Fine and Gray (1999). In a competing risks model, subjects are at risk of failurebecause of two or more separate and possibly correlated causes. See [ST] stcrreg. Existing commandstcurve will now graph cumulative incidence functions after stcrreg; see [ST] stcurve.

2. Stata’s new multiple-imputation commands for dealing with missing values may be used withstcox, streg, and stcrreg; see [MI] intro. Either stset your data before using mi set, or usemi’s mi stset to stset your data afterward.

3. Factor variables may now be used with stcox, streg, and stcrreg. See [U] 11.4.3 Factorvariables.

4. New reporting options baselevels and allbaselevels control how base levels of factor variablesare displayed in output tables. New reporting option noemptycells controls whether missing cellsin interactions are displayed.

These new options are supported by estimation commands stcox, streg, and stcrreg, and byexisting postestimation commands estat summarize and estat vce. See [R] estimation options.

5. New reporting option noomitted controls whether covariates that are dropped because of collinearityare reported in output tables. By default, Stata now includes a line in estimation and related outputtables for collinear covariates and marks those covariates as “(omitted)”. noomitted suppressesthose lines.

1

2 intro — Introduction to survival analysis manual

noomitted is supported by estimation commands stcox, streg, and stcrreg, and by existingpostestimation commands estat summarize and estat vce. See [R] estimation options.

6. New option vsquish eliminates blank lines in estimation and related tables. Many output tablesnow set off factor variables and time-series–operated variables with a blank line. vsquish removesthese lines.

vsquish is supported by estimation commands stcox, streg, and stcrreg, and by existingpostestimation command estat summarize. See [R] estimation options.

7. Estimation commands stcox, streg, and stcrreg support new option coeflegend to displaythe coefficients’ legend rather than the coefficient table. The legend shows how you would type acoefficient in an expression, in a test command, or in a constraint definition. See [R] estimationoptions.

8. Estimation commands streg and stcrreg support new option nocnsreport to suppress reportingconstraints; see [R] estimation options.

9. Concerning predict:

a. predict after stcox offers three new diagnostic measures of influence: DFBETAs, likelihooddisplacement values, and LMAX statistics. See [ST] stcox postestimation.

b. predict after stcox can now calculate diagnostic statistics basesurv(), basechazard(),basehc(), mgale(), effects(), esr(), schoenfeld(), and scaledsch(). Previously,you had to request these statistics when you fit the model by specifying the option withthe stcox command. Now you obtain them by using predict after estimation. The op-tions continue to work with stcox directly but are no longer documented. See [ST] stcoxpostestimation.

c. predict after stcox and streg now produces subject-level residuals by default. Previously,record-level or partial results were produced, although there was an inconsistency. This affectsmultiple-record data only because there is no difference between subject-level and partialresiduals in single-record data. This change affects predict’s options mgale, csnell,deviance, and scores after stcox (and new options ldisplace, lmax, and dfbeta, ofcourse); and it affects mgale and deviance after streg. predict, deviance was theinconsistency; it always produced subject-level results.

For instance, in previous Stata versions you typed. predict cs, csnell

to obtain partial Cox–Snell residuals. One statistic per record was produced. To obtainsubject-level residuals, for which there is one per subject and which predict stored oneach subject’s last record, you typed

. predict ccs, ccsnell

In Stata 11, when you type. predict cs, csnell

you obtain the subject-level residual. To obtain the partial, you use the new partial option:. predict cs, csnell partial

The same applies to all the other residuals. Concerning the inconsistency, partial deviancesare now available.

Not affected is predict, scores after streg. Log-likelihood scores in parametric modelsare mathematically defined at the record level and are meaningful only if evaluated at thatlevel.

intro — Introduction to survival analysis manual 3

Prior behavior is restored under version control. See [ST] stcox postestimation, [ST] stregpostestimation, and [ST] stcrreg postestimation.

10. stcox now allows up to 100 time-varying covariates as specified in option tvc(). The previouslimit was 10. See [ST] stcox.

11. Existing commands stcurve and estat phtest no longer require that you specify the appropriateoptions to stcox before using them. The commands automatically generate the statistics theyrequire. See [ST] stcurve and [ST] stcox PH-assumption tests.

12. Existing epitab commands ir, cs, cc, and mhodds now treat missing categories of variables inby() consistently. By default, missing categories are now excluded from the computation. Thismay be overridden by specifying by()’s new option missing. See [ST] epitab.

13. Existing command sts list has new option saving() that creates a dataset containing the results.See [ST] sts list.

For a complete list of all the new features in Stata 11, see [U] 1.3 What’s new.

ReferenceFine, J. P., and R. J. Gray. 1999. A proportional hazards model for the subdistribution of a competing risk. Journal

of the American Statistical Association 94: 496–509.

Also see[U] 1.3 What’s new[R] intro — Introduction to base reference manual

Title

survival analysis — Introduction to survival analysis & epidemiological tables commands

DescriptionStata’s survival analysis routines are used to compute sample size, power, and effect size and to

declare, convert, manipulate, summarize, and analyze survival data. Survival data is time-to-eventdata, and survival analysis is full of jargon: truncation, censoring, hazard rates, etc. See the glossaryin this manual. For a good Stata-specific introduction to survival analysis, see Cleves et al. (2008).

Stata also has several commands for analyzing contingency tables resulting from various forms ofobservational studies, such as cohort or matched case–control studies.

This manual documents the following commands, which are described in detail in their respectivemanual entries.

Declaring and converting count datactset [ST] ctset Declare data to be count-time datacttost [ST] cttost Convert count-time data to survival-time data

Converting snapshot datasnapspan [ST] snapspan Convert snapshot data to time-span data

Declaring and summarizing survival-time datastset [ST] stset Declare data to be survival-time datastdescribe [ST] stdescribe Describe survival-time datastsum [ST] stsum Summarize survival-time data

Manipulating survival-time datastvary [ST] stvary Report whether variables vary over timestfill [ST] stfill Fill in by carrying forward values of covariatesstgen [ST] stgen Generate variables reflecting entire historiesstsplit [ST] stsplit Split time-span recordsstjoin [ST] stsplit Join time-span recordsstbase [ST] stbase Form baseline dataset

Obtaining summary statistics, confidence intervals, tables, etc.sts [ST] sts Generate, graph, list, and test the survivor and cumulative

hazard functionsstir [ST] stir Report incidence-rate comparisonstci [ST] stci Confidence intervals for means and percentiles of survival timestrate [ST] strate Tabulate failure ratestptime [ST] stptime Calculate person-time, incidence rates, and SMRstmh [ST] strate Calculate rate ratios with the Mantel–Haenszel methodstmc [ST] strate Calculate rate ratios with the Mantel–Cox methodltable [ST] ltable Display and graph life tables

4

survival analysis — Introduction to survival analysis & epidemiological tables commands 5

Fitting regression modelsstcox [ST] stcox Cox proportional hazards modelestat concordance [ST] stcox postestimation Calculate Harrell’s Cestat phtest [ST] stcox PH-assumption tests Test Cox proportional-hazards

assumptionstphplot [ST] stcox PH-assumption tests Graphically assess the Cox

proportional-hazards assumptionstcoxkm [ST] stcox PH-assumption tests Graphically assess the Cox

proportional-hazards assumptionstreg [ST] streg Parametric survival modelsstcurve [ST] stcurve Plot survivor, hazard, cumulative

hazard, or cumulative incidencefunction

stcrreg [ST] stcrreg Competing-risks regression

Sample-size and power determination for survival analysisstpower cox [ST] stpower cox Sample size, power, and effect size for

the Cox proportional hazards modelstpower exponential [ST] stpower exponential Sample size and power for the

exponential teststpower logrank [ST] stpower logrank Sample size, power, and effect size for

the log-rank test

Converting survival-time datasttocc [ST] sttocc Convert survival-time data to

case–control datasttoct [ST] sttoct Convert survival-time data to

count-time data

Programmer’s utilitiesst * [ST] st is Survival analysis subroutines for

programmers

Epidemiological tablesir [ST] epitab Incidence rates for cohort studiesiri [ST] epitab Immediate form of ircs [ST] epitab Risk differences, risk ratios, and odds

ratios for cohort studiescsi [ST] epitab Immediate form of cscc [ST] epitab Odds ratios for case–control datacci [ST] epitab Immediate form of cctabodds [ST] epitab Tests of log odds for case–control datamhodds [ST] epitab Odds ratios controlled for confoundingmcc [ST] epitab Analysis of matched case–control datamcci [ST] epitab Immediate form of mcc

6 survival analysis — Introduction to survival analysis & epidemiological tables commands

RemarksRemarks are presented under the following headings:

IntroductionDeclaring and converting count dataConverting snapshot dataDeclaring and summarizing survival-time dataManipulating survival-time dataObtaining summary statistics, confidence intervals, tables, etc.Fitting regression modelsSample size and power determination for survival analysisConverting survival-time dataProgrammer’s utilitiesEpidemiological tables

Introduction

All but one entry in this manual deals with the analysis of survival data, which is used to measurethe time to an event of interest such as death or failure. Survival data can be organized in twoways. The first way is as count data, which refers to observations on populations, whether people orgenerators, with observations recording the number of units at a given time that failed or were lostbecause of censoring. The second way is as survival-time, or time-span, data. In survival-time data,the observations represent periods and contain three variables that record the start time of the period,the end time, and an indicator of whether failure or right-censoring occurred at the end of the period.The representation of the response of these three variables makes survival data unique in terms ofimplementing the statistical methods in the software.

Survival data may also be organized as snapshot data (a small variation of the survival-time format),in which observations depict an instance in time rather than an interval. When you have snapshotdata, you simply use the snapspan command to convert it to survival-time data before proceeding.

Stata commands that begin with ct are used to convert count data to survival-time data. Survival-time data are analyzed using Stata commands that begin with st, known in our terminology as stcommands. You can express all the information contained in count data in an equivalent survival-timedataset, but the converse is not true. Thus Stata commands are made to work with survival-time databecause it is the more general representation.

The one remaining entry is [ST] epitab, which describes epidemiological tables. [ST] epitab coversmany commands dealing with analyzing contingency tables arising from various observational studies,such as case–control or cohort studies. [ST] epitab is included in this manual because the conceptspresented there are related to concepts of survival analysis, and both topics use the same terminologyand are of equal interest to many researchers.

Declaring and converting count data

Count data must first be converted to survival-time data before Stata’s st commands can be used.Count data can be thought of as aggregated survival-time data. Rather than having observations thatare specific to a subject and a period, you have data that, at each recorded time, record the numberlost because of failure and, optionally, the number lost because of right-censoring.

ctset is used to tell Stata the names of the variables in your count data that record the time, thenumber failed, and the number censored. You ctset your data before typing cttost to convert itto survival-time data. Because you ctset your data, you can type cttost without any arguments toperform the conversion. Stata remembers how the data are ctset.


Converting snapshot data

Snapshot data are data in which each observation records the status of a given subject at a certainpoint in time. Usually you have multiple observations on each subject that chart the subject’s progressthrough the study.

Before using Stata’s survival analysis commands with snapshot data, you must first convert the datato survival-time data; that is, the observations in the data should represent intervals. When you convertsnapshot data, the existing time variable in your data is used to record the end of a time span, and anew variable is created to record the beginning. Time spans are created using the recorded snapshottimes as breakpoints at which new intervals are to be created. Before converting snapshot data totime-span data, you must understand the distinction between enduring variables and instantaneousvariables. Enduring variables record characteristics of the subject that endure throughout the timespan, such as sex or smoking status. Instantaneous variables describe events that occur at the end of atime span, such as failure or censoring. When you convert snapshots to intervals, enduring variablesobtain their values from the previous recorded snapshot or are set to missing for the first interval.Instantaneous variables obtain their values from the current recorded snapshot because the existingtime variable now records the end of the span.

Stata’s snapspan makes this whole process easy. You specify an ID variable identifying yoursubjects, the snapshot time variable, the name of the new variable to hold the beginning times of thespans, and any variables that you want to treat as instantaneous variables. Stata does the rest for you.

Declaring and summarizing survival-time data

Stata does not automatically recognize survival-time data, so you must declare your survival-timedata to Stata by using stset. Every st command relies on the information that is provided whenyou stset your data. Survival-time data come in different forms. For example, your time variablesmay be dates, time measured from a fixed date, or time measured from some other point unique toeach subject, such as enrollment in the study. You can also consider the following questions. What isthe onset of risk for the subjects in your data? Is it time zero? Is it enrollment in the study or someother event, such as a heart transplant? Do you have censoring, and if so, which variable recordsit? What values does this variable record for censoring/failure? Do you have delayed entry? That is,were some subjects at risk of failure before you actually observed them? Do you have simple dataand wish to treat everyone as entering and at risk at time zero?

Whatever the form of your data, you must first stset it before analyzing it, and so if you arenew to Stata’s st commands, we highly recommend that you take the time to learn about stset.It is really easy once you get the hang of it, and [ST] stset has many examples to help. For morediscussion of stset, see chapter 6 of Cleves et al. (2008).

Once you stset the data, you can use stdescribe to describe the aspects of your survival data.For example, you will see the number of subjects you were successful in declaring, the total numberof records associated with these subjects, the total time at risk for these subjects, time gaps for anyof these subjects, any delayed entry, etc. You can use stsum to summarize your survival data, forexample, to obtain the total time at risk and the quartiles of time-to-failure in analysis-time units.

Manipulating survival-time data

Once your data have been stset, you may want to clean them up a bit before beginning youranalysis. Suppose that you had an enduring variable and snapspan recorded it as missing for theinterval leading up to the first recorded snapshot time. You can use stfill to fill in missing valuesof covariates, either by carrying forward the values from previous periods or by making the covariate


equal to its earliest recorded (nonmissing) value for all time spans. You can use stvary to checkfor time-varying covariates or to confirm that certain variables, such as sex, are not time varying.You can use stgen to generate new covariates based on functions of the time spans for each givensubject. For example, you can create a new variable called eversmoked that equals one for all asubject’s observations, if the variable smoke in your data is equal to one for any of the subject’s timespans. Think of stgen as just a convenient way to do things that could be done using by subject id:with survival-time data.

stsplit is useful for creating data that have multiple records per subject from data that haveone record per subject. Suppose that you have already stset your data and wish to introduce atime-varying covariate. You would first need to stsplit your data so that separate time spans couldbe created for each subject, allowing the new covariate to assume different values over time within asubject. stjoin is the opposite of stsplit. Suppose that you have data with multiple records persubject but then realize that the data could be collapsed into single-subject records with no loss ofinformation. Using stjoin would speed up any subsequent analysis using the st commands withoutchanging the results.

stbase can be used to set every variable in your multiple-record st data to the value at baseline,defined as the earliest time at which each subject was observed. It can also be used to convert st datato cross-sectional data.

Obtaining summary statistics, confidence intervals, tables, etc.

Stata provides several commands for nonparametric analysis of survival data that can produce awide array of summary statistics, inference, tables, and graphs. sts is a truly powerful command,used to obtain nonparametric estimates, inference, tests, and graphs of the survivor function, thecumulative hazard function, and the hazard function. You can compare estimates across groups, suchas smoking versus nonsmoking, and you can adjust these estimates for the effects of other covariatesin your data. sts can present these estimates as tables and graphs. sts can also be used to test theequality of survivor functions across groups.

stir is used to estimate incidence rates and to compare incidence rates across groups. stciis the survival-time data analog of ci and is used to obtain confidence intervals for means andpercentiles of time to failure. strate is used to tabulate failure rates. stptime is used to calculateperson-time and standardized mortality/morbidity ratios (SMRs). stmh calculates rate ratios by usingthe Mantel–Haenszel method, and stmc calculates rate ratios by using the Mantel–Cox method.

ltable displays and graphs life tables for individual-level or aggregate data.

Fitting regression models

Stata has commands for fitting both semiparametric and parametric regression models to survivaldata. stcox fits the Cox proportional hazards model and predict after stcox can be used to retrieveestimates of the baseline survivor function, the baseline cumulative hazard function, and the baselinehazard contributions. predict after stcox can also calculate a myriad of Cox regression diagnosticquantities, such as martingale residuals, efficient score residuals, and Schoenfeld residuals. stcoxhas four options for handling tied failures. stcox can be used to fit stratified Cox models, wherethe baseline hazard is allowed to differ over the strata, and it can be used to model multivariatesurvival data by using a shared-frailty model, which can be thought of as a Cox model with randomeffects. After stcox, you can use estat phtest to test the proportional-hazards assumption orestat concordance to calculate Harrell’s C. With stphplot and stcoxkm, you can graphicallyassess the proportional-hazards assumption.


Stata offers six parametric regression models for survival data: exponential, Weibull, lognormal,loglogistic, Gompertz, and gamma. All six models are fit using streg, and you can specify themodel you want with the distribution() option. All these models, except for the exponential, haveancillary parameters that are estimated (along with the linear predictor) from the data. By default, theseancillary parameters are treated as constant, but you may optionally model the ancillary parametersas functions of a linear predictor. Stratified models may also be fit using streg. You can also fitfrailty models with streg and specify whether you want the frailties to be treated as spell-specificor shared across groups of observations.

stcrreg fits a semiparametric regression model for survival data in the presence of competingrisks. Competing risks impede the failure event under study from occurring. An analysis of suchcompeting-risks data focuses on the cumulative incidence function, the probability of failure in thepresence of competing events that prevent that failure. stcrreg provides an analogue to stcox forsuch data. The baseline subhazard function—that which generates failures under competing risks—isleft unspecified, and covariates act multiplicatively on the baseline subhazard.

stcurve is for use after stcox, streg, and stcrreg and will plot the estimated survivor, hazard,cumulative hazard, and cumulative incidence function for the fitted model. Covariates, by default, areheld fixed at their mean values, but you can specify other values if you wish. stcurve is useful forcomparing these functions across different levels of covariates.

Sample size and power determination for survival analysis

Stata has commands for computing sample size, power, and effect size for survival analysis usingthe log-rank test, the Cox proportional hazards model, and the exponential test comparing exponentialhazard rates.

stpower logrank estimates required sample size, power, and effect size for survival analysiscomparing survivor functions in two groups using the log-rank test. It provides options to accountfor unequal allocation of subjects between the two groups, possible withdrawal of subjects from thestudy (loss to follow-up), and uniform accrual of subjects into the study.

stpower cox estimates required sample size, power, and effect size for survival analysis using Coxproportional hazards (PH) models with possibly multiple covariates. It provides options to accountfor possible correlation between the covariate of interest and other predictors and for withdrawal ofsubjects from the study.

stpower exponential estimates required sample size and power for survival analysis comparingtwo exponential survivor functions using the exponential test (in particular, the Wald test of thedifference between hazards or, optionally, of the difference between log hazards). It accommodatesunequal allocation between the two groups, flexible accrual of subjects into the study (uniform andtruncated exponential), and group-specific losses to follow-up.

The stpower commands allow automated production of customizable tables and have options toassist with the creation of graphs of power curves and more.

Converting survival-time data

Stata has commands for converting survival-time data to case–control and count data. Thesecommands are rarely used, because most of the analyses are performed using data in the survival-timeformat. sttocc is useful for converting survival data to case–control data suitable for estimation withclogit. sttoct is the opposite of cttost and will convert survival-time data to count data.


Programmer’s utilities

Stata also provides routines for programmers interested in writing their own st commands. These arebasically utilities for setting, accessing, and verifying the information saved by stset. For example,st is verifies that the data have in fact been stset and gives the appropriate error if not. st showis used to preface the output of a program with key information on the st variables used in theanalysis. Programmers interested in writing st code should see [ST] st is.

Epidemiological tables

See the Description section of [ST] epitab for an overview of Stata’s commands for calculatingstatistics and performing lists that are useful for epidemiologists.

ReferenceCleves, M. A., W. W. Gould, R. G. Gutierrez, and Y. Marchenko. 2008. An Introduction to Survival Analysis Using

Stata. 2nd ed. College Station, TX: Stata Press.

Also see[ST] stset — Declare data to be survival-time data[ST] intro — Introduction to survival analysis manual

http://www.stata-press.com/books/saus.htmlhttp://www.stata-press.com/books/saus.html

Title

ct — Count-time data

DescriptionThe term ct refers to count-time data and the commands—all of which begin with the letters

“ct”—for analyzing them. If you have data on populations, whether people or generators, withobservations recording the number of units under test at time t (subjects alive) and the number ofsubjects that failed or were lost because of censoring, you have what we call count-time data.

If, on the other hand, you have data on individual subjects with observations recording that thissubject came under observation at time t0 and that later, at t1, a failure or censoring was observed,you have what we call survival-time data. If you have survival-time data, see [ST] st.

Do not confuse count-time data with counting-process data, which can be analyzed using the stcommands; see [ST] st.

There are two ct commands:

ctset [ST] ctset Declare data to be count-time datacttost [ST] cttost Convert count-time data to survival-time data

The key is the cttost command. Once you have converted your count-time data to survival-timedata, you can use the st commands to analyze the data. The entire process is as follows:

1. ctset your data so that Stata knows that they are count-time data; see [ST] ctset.2. Type cttost to convert your data to survival-time data; see [ST] cttost.3. Use the st commands; see [ST] st.

Also see[ST] ctset — Declare data to be count-time data[ST] cttost — Convert count-time data to survival-time data[ST] st — Survival-time data[ST] survival analysis — Introduction to survival analysis & epidemiological tables commands

11

Title

ctset — Declare data to be count-time data

Syntax

Declare data in memory to be count-time data and run checks on data

ctset timevar nfailvar[

ncensvar[

nentvar] ] [

, by(varlist) noshow]

Specify whether to display identities of key ct variables

ctset,{show | noshow

}Clear ct setting

ctset, clear

Display identity of key ct variables and rerun checks on data{ctset | ct

}where timevar refers to the time of failure, censoring, or entry. It should contain times ≥0.nfailvar records the number failing at time timevar.

ncensvar records the number censored at time timevar.

nentvar records the number entering at time timevar.

Stata sequences events at the same time as

at timevar nfailvar failures occurred,then at timevar + 0 ncensvar censorings occurred,

finally at timevar + 0 + 0 nentvar subjects entered the data.

MenuStatistics > Survival analysis > Setup and utilities > Declare data to be count-time data

Descriptionct refers to count-time data and is described here and in [ST] ct. Do not confuse count-time data

with counting-process data, which can be analyzed using the st commands; see [ST] st.In the first syntax, ctset declares the data in memory to be ct data, informing Stata of the key

variables. When you ctset your data, ctset also checks that what you have declared makes sense.

In the second syntax, ctset changes the value of show/noshow. In show mode—the default—theother ct commands display the identities of the key ct variables before their normal output. If youtype ctset, noshow, they will not do this. If you type ctset, noshow and then wish to restore thedefault behavior, type ctset, show.

12

ctset — Declare data to be count-time data 13

In the third syntax, ctset, clear causes Stata to no longer consider the data to be ct data. Thedataset itself remains unchanged. It is not necessary to type ctset, clear before doing anotherctset. ctset, clear is used mostly by programmers.

In the fourth syntax, ctset—which can be abbreviated ct here—displays the identities of thekey ct variables and reruns the checks on your data. Thus ct can remind you of what you have ctset(especially if you have ctset, noshow) and reverify your data if you make changes to the data.

Optionsby(varlist) indicates that counts are provided by group. For instance, consider data containing records

such as

t fail cens sex agecat5 10 2 0 15 6 1 1 15 12 0 0 2

These data indicate that, in the category sex = 0 and agecat = 1, 10 failed and 2 were censoredat time 5; for sex = 1, 1 was censored and 6 failed; and so on.

The above data would be declared

. ctset t fail cens, by(sex agecat)

The order of the records is not important, nor is it important that there be a record at every timefor every group or that there be only one record for a time and group. However, the data mustcontain the full table of events.

noshow and show specify whether the identities of the key ct variables are to be displayed at thestart of every ct command. Some users find the report reassuring; others find it repetitive. In anycase, you can set and unset show, and you can always type ct to see the summary.

clear makes Stata no longer consider the data to be ct data.


ExamplesData errors flagged by ctset

Examples

About all you can do with ct data in Stata is convert it to survival-time (st) data so that you canuse the survival analysis commands. To analyze count-time data with Stata,

. ctset . . .

. cttost

. ( now use any of the st commands )

(Continued on next page)

14 ctset — Declare data to be count-time data

Example 1: Simple ct data

We have data on generators that are run until they fail:

. use http://www.stata-press.com/data/r11/ctset1

. list, sep(0)

failtime fail

1. 22 12. 30 13. 40 24. 52 15. 54 46. 55 27. 85 78. 97 19. 100 3

10. 122 211. 140 1

For instance, at time 54, four generators failed. To ctset these data, we could type

. ctset failtime fail

dataset name: http://www.stata-press.com/data/r11/ctset1.dtatime: failtime

no. fail: failno. lost: -- (meaning 0 lost)

no. enter: -- (meaning all enter at time 0)

It is not important that there be only 1 observation per failure time. For instance, according to ourdata, at time 85 there were seven failures. We could remove that observation and substitute two inits place—one stating that at time 85 there were five failures and another that at time 85 there weretwo more failures. ctset would interpret that data just as it did the previous data.

In more realistic examples, the generators might differ from one another. For instance, the followingdata show the number failing with old-style (bearings = 0) and new-style (bearings = 1) bearings:


. list, sepby(bearings)

bearings failtime fail

1. 0 22 12. 0 40 23. 0 54 14. 0 84 25. 0 97 26. 0 100 1

7. 1 30 18. 1 52 19. 1 55 1

10. 1 100 311. 1 122 212. 1 140 1

That the data are sorted on bearings is not important. The ctset command for these data is


. ctset failtime fail, by(bearings)


no. fail: failno. lost: -- (meaning 0 lost)

no. enter: -- (meaning all enter at time 0)by: bearings

Example 2: ct data with censoring

In real data, not all units fail in the time allotted. Say that the generator experiment was stoppedafter 150 days. The data might be


. list

bearings failtime fail censored

1. 0 22 1 02. 0 40 2 03. 0 54 1 04. 0 84 2 05. 1 97 2 0

6. 0 100 1 07. 0 150 0 28. 1 30 1 09. 1 52 1 0

10. 1 55 1 0

11. 1 122 2 012. 1 140 1 013. 1 150 0 3

The ctset command for these data is

. ctset failtime fail censored, by(bearings)


no. fail: failno. lost: censored

no. enter: -- (meaning all enter at time 0)by: bearings

In some other data, observations might also be censored along the way; that is, the value ofcensored would not be 0 before time 150. For instance, a record might read

bearings failtime fail censored0 84 2 1

This would mean that at time 84, two failed and one was lost because of censoring. The failure andcensoring occurred at the same time, and when we analyze these data, Stata will assume that thecensored observation could have failed, that is, that the censoring occurred after the two failures.

16 ctset — Declare data to be count-time data

Example 3: ct data with delayed entry

Data on survival time of patients with a particular kind of cancer are collected. Time is measuredas time since diagnosis. After data collection started, the sample was enriched with some patientsfrom hospital records who had been previously diagnosed. Some of the data are

time die cens ent other variables0 0 0 501 0 0 5 . . ....

30 0 0 3 . . .31 0 1 2 . . .32 1 0 1 . . ....

100 1 1 0 . . ....

Fifty patients entered at time 0 (time of diagnosis); five patients entered 1 day after diagnosis; andthree, two, and one patients entered 30, 31, and 32 days after diagnosis, respectively. On the 32ndday, one of the previously entered patients died.

If the other variables are named sex and agecat, the ctset command for these data is

. ctset time die cens ent, by(sex agecat)

time: timeno. fail: dieno. lost: cens

no. enter: entby: sex agecat

The count-time format is an inferior way to record data like these—data in which every subjectdoes not enter at time 0—because some information is already lost. When did the patient who diedon the 32nd day enter? There is no way of telling.

For traditional survival analysis calculations, it does not matter. More modern methods of estimatingstandard errors, however, seek to identify each patient, and these data do not support using suchmethods.

This issue concerns the robust estimates of variance and the vce(robust) options on some of thest analysis commands. After converting the data, you must not use the vce(robust) option, evenif an st command allows it, because the identities of the subjects—tying together when a subjectstarts and ceases to be at risk—are assigned randomly by cttost when you convert your ct to stdata. When did the patient who died on the 32nd day enter? For conventional calculations, it doesnot matter, and cttost chooses a time randomly from the available entry times.

Data errors flagged by ctset

ctset requires only two things of your data: that the counts all be positive or zero and, if youspecify an entry variable, that the entering and exiting subjects (failure + censored) balance.

If all subjects enter at time 0, we recommend that you do not specify a number-that-enter variable.ctset can determine for itself the number who enter at time 0 by summing the failures and censorings.


Methods and formulasctset is implemented as an ado-file.

Also see[ST] ct — Count-time data[ST] cttost — Convert count-time data to survival-time data

Title

cttost — Convert count-time data to survival-time data

Syntaxcttost

[, options

]options description

t0(t0var) name of entry-time variablewvar(wvar) name of frequency-weighted variableclear overwrite current data in memory

†nopreserve do not save the original data; programmer’s command

†nopreserve is not shown in the dialog box.You must ctset your data before using cttost; see [ST] ctset.

MenuStatistics > Survival analysis > Setup and utilities > Convert count-time data to survival-time data

Descriptioncttost converts count-time data to their survival-time format so that they can be analyzed with

Stata. Do not confuse count-time data with counting-process data, which can also be analyzed withthe st commands; see [ST] ctset for a definition and examples of count data.

Optionst0(t0var) specifies the name of the new variable to create that records entry time. (For most ct data,

no entry-time variable is necessary because everyone enters at time 0.)

Even if an entry-time variable is necessary, you need not specify this option. cttost will, bydefault, choose t0, time0, or etime according to which name does not already exist in the data.

wvar(wvar) specifies the name of the new variable to be created that records the frequency weightsfor the new pseudo-observations. Count-time data are actually converted to frequency-weighted stdata, and a variable is needed to record the weights. This sounds more complicated than it is.Understand that cttost needs a new variable name, which will become a permanent part of thest data.

If you do not specify wvar(), cttost will, by default, choose w, pop, weight, or wgt accordingto which name does not already exist in the data.

clear specifies that it is okay to proceed with the conversion, even though the current dataset hasnot been saved on disk.

18

cttost — Convert count-time data to survival-time data 19

The following option is available with cttost but is not shown in the dialog box:

nopreserve speeds the conversion by not saving the original data that can be restored should thingsgo wrong or should you press Break. nopreserve is intended for use by programmers who usecttost as a subroutine. Programmers can specify this option if they have already preserved theoriginal data. nopreserve does not affect the conversion.

RemarksConverting ct to st data is easy. We have some count-time data,

. use http://www.stata-press.com/data/r11/cttost

. ctdataset name: http://www.stata-press.com/data/r11/cttost.dta

time: timeno. fail: ndeadno. lost: ncens

no. enter: -- (meaning all enter at time 0)by: agecat treat

. list in 1/5

agecat treat time ndead ncens

1. 2 1 464 4 02. 3 0 268 3 13. 2 0 638 2 04. 1 0 803 1 45. 1 0 431 2 0

and to convert it, we type cttost:

. cttost

failure event: ndead != 0 & ndead < .obs. time interval: (0, time]exit on or before: failure

weight: [fweight=w]

33 total obs.0 exclusions

33 physical obs. remaining, equal to82 weighted obs., representing39 failures in single record/single failure data

48726 total analysis time at risk, at risk from t = 0earliest observed entry t = 0

last observed exit t = 1227

(Continued on next page)

20 cttost — Convert count-time data to survival-time data

Now that it is converted, we can use any of the st commands:

. sts test treat, logrank

failure _d: ndeadanalysis time _t: time

weight: [fweight=w]

Log-rank test for equality of survivor functions

Events Eventstreat observed expected

0 22 17.051 17 21.95

Total 39 39.00

chi2(1) = 2.73Pr>chi2 = 0.0986

Methods and formulascttost is implemented as an ado-file.

Also see[ST] ct — Count-time data[ST] ctset — Declare data to be count-time data

Title

discrete — Discrete-time survival analysis

As of the date that this manual was printed, Stata does not have a suite of built-in commandsfor discrete-time survival models matching the st suite for continuous-time models, but a good casecould be made that it should. Instead, these models can be fit easily using other existing estimationcommands and data manipulation tools.

Discrete-time survival analysis concerns analysis of time-to-event data whenever survival times areeither a) intrinsically discrete (e.g., numbers of machine cycles) or b) grouped into discrete intervalsof time (“interval censoring”). If intervals are of equal length, the same methods can be applied toboth a) and b); survival times will be positive integers.

You can fit discrete-time survival models with the maximum likelihood method. Data may containcompleted or right-censored spells, and late entry (left truncation) can also be handled, as wellas unobserved heterogeneity (also termed “frailty”). Estimation makes use of the property that thesample likelihood can be rewritten in a form identical to the likelihood for a binary dependent variablemultiple regression model and applied to a specially organized dataset (Allison 1984, Jenkins 1995).For models without frailty, you can use, for example, logistic (or logit) to fit the discrete-timelogistic hazard model or cloglog to fit the discrete-time proportional hazards model (Prentice andGloeckler 1978). Models incorporating normal frailty may be fit using xtlogit and xtcloglog. Amodel with gamma frailty (Meyer 1990) may be fit using pgmhaz (Jenkins 1997).

Estimation consists of three steps:

1. Data organization: The dataset must be organized so that there is 1 observation for each periodwhen a subject is at risk of experiencing the transition event. For example, if the original datasetcontains one row for each subject, i, with information about their spell length, Ti, the new datasetrequires Ti rows for each subject, one row for each period at risk. This may be accomplishedusing expand or stsplit. (This step is episode splitting at each and every interval.) The resultis data of the same form as a discrete panel (xt) dataset with repeated observations on each panel(subject).

2. Variable creation: You must create at least three types of variables. First, you will need an intervalidentification variable, which is a sequence of positive integers t = 1, . . . , Ti. For example,

. sort subject_id

. by subject_id: generate t = _n

Second, you need a period-specific censoring indicator, di. If di = 1 if subject i’s spell is completeand di = 0 if the spell is right-censored, the new indicator d∗it = 1 if di = 1 and t = Ti, andd∗it = 0 otherwise.

Third, you must define variables (as functions of t) to summarize the pattern of duration dependence.These variables are entered as covariates in the regression. For example, for a duration dependencepattern analogous to that in the continuous-time Weibull model, you could define a new variablex1 = logt. For a quadratic specification, you define variables x1 = t and x2 = t2. We can achievea piecewise constant specification by defining a set of dummy variables, with each group of periodssharing the same hazard rate, or a semiparametric model (analogous to the Cox regression modelfor continuous survival-time data) using separate dummy variables for each and every durationinterval. No duration variable need be defined if you want to fit a model with a constant hazardrate.

In addition to these three essentials, you may define other time-varying covariates.

21

22 discrete — Discrete-time survival analysis

3. Estimation: You fit a binary dependent variable multiple regression model, with d∗it as the dependentvariable and covariates, including the duration variables and any other covariates.

For estimation using spell data with late entry, the stages are the same as those outlined above,with one modification and one warning. To fit models without frailty, you must drop all intervalsprior to each subject’s entry to the study. For example, if entry is in period ei, you drop it if t < ei.If you want to fit frailty models on the basis of discrete-time data with late entry, then be aware thatthe estimation procedure outlined does not lead to correct estimates. (The sample likelihood in thereorganized data does not account for conditioning for late entry here. You will need to write yourown likelihood function by using ml; see [R] maximize.)

To derive predicted hazard rates, use the predict command. For example, after logistic orcloglog, use predict, pr. After xtlogit or xtcloglog, use predict, pu0 (which predicts thehazard assuming the individual effect is equal to the mean value). Estimates of the survivor function, Sit,can then be derived from the predicted hazard rates, pit, because Sit = (1−pi1)(1−pi2)(· · ·)(1−pit).

AcknowledgmentWe thank Stephen Jenkins, University of Essex, UK, for drafting this initial entry.

ReferencesAllison, P. D. 1984. Event History Analysis: Regression for Longitudinal Event Data. Newbury Park, CA: Sage.

Jenkins, S. P. 1995. Easy estimation methods for discrete-time duration models. Oxford Bulletin of Economics andStatistics 57: 129–138.

. 1997. sbe17: Discrete time proportional hazards regression. Stata Technical Bulletin 39: 22–32. Reprinted inStata Technical Bulletin Reprints, vol. 7, pp. 109–121. College Station, TX: Stata Press.

Meyer, B. D. 1990. Unemployment insurance and unemployment spells. Econometrica 58: 757–782.

Prentice, R. L., and L. Gloeckler. 1978. Regression analysis of grouped survival data with application to breast cancerdata. Biometrics 34: 57–67.

Also see[D] expand — Duplicate observations[ST] stcox — Cox proportional hazards model[ST] stcrreg — Competing-risks regression[ST] streg — Parametric survival models[R] cloglog — Complementary log-log regression[R] logistic — Logistic regression, reporting odds ratios[XT] xtcloglog — Random-effects and population-averaged cloglog models[XT] xtlogit — Fixed-effects, random-effects, and population-averaged logit models

http://www.stata.com/products/stb/journals/stb39.pdf

Title

epitab — Tables for epidemiologists

SyntaxCohort studies

ir varcase varexposed vartime[

if] [

in] [

weight] [

, ir options]

iri #a #b #N1 #N2[, tb level(#)

]cs varcase varexposed

[if] [

in] [

weight] [

, cs options]

csi #a #b #c #d[, csi options

]Case–control studies

cc varcase varexposed[

if] [

in] [

weight] [

, cc options]

cci #a #b #c #d[, cci options

]tabodds varcase

[expvar

] [if] [

in] [

weight] [

, tabodds options]

mhodds varcase expvar[

varsadjust] [

if] [

in] [

weight] [

, mhodds options]

Matched case–control studies

mcc varexposed case varexposed control[

if] [

in] [

weight] [

, tb level(#)]

mcci #a #b #c #d[, tb level(#)

]ir options description

Options

by(varname[, missing

]) stratify on varname

estandard combine external weights with within-stratum statisticsistandard combine internal weights with within-stratum statisticsstandard(varname) combine user-specified weights with within-stratum statisticspool display pooled estimatenocrude do not display crude estimatenohom do not display homogeneity testird calculate standard incidence-rate differencetb calculate test-based confidence intervalslevel(#) set confidence level; default is level(95)

23

24 epitab — Tables for epidemiologists

cs options description

Options

by(varlist[, missing

]) stratify on varlist

estandard combine external weights with within-stratum statisticsistandard combine internal weights with within-stratum statisticsstandard(varname) combine user-specified weights with within-stratum statisticspool display pooled estimatenocrude do not display crude estimatenohom do not display homogeneity testrd calculate standardized risk differencebinomial(varname) number of subjects variableor report odds ratiowoolf use Woolf approximation to calculate SE and CI of the odds ratiotb calculate test-based confidence intervalsexact calculate Fisher’s exact plevel(#) set confidence level; default is level(95)

csi options description

or report odds ratiowoolf use Woolf approximation to calculate SE and CI of the odds ratiotb calculate test-based confidence intervalsexact calculate Fisher’s exact plevel(#) set confidence level; default is level(95)

cc options description

Options

by(varname[, missing

]) stratify on varname

estandard combine external weights with within-stratum statisticsistandard combine internal weights with within-stratum statisticsstandard(varname) combine user-specified weights with within-stratum statisticspool display pooled estimatenocrude do not display crude estimatenohom do not display homogeneity testbd perform Breslow–Day homogeneity testtarone perform Tarone’s homogeneity testbinomial(varname) number of subjects variablecornfield use Cornfield approximation to calculate CI of the odds ratiowoolf use Woolf approximation to calculate SE and CI of the odds ratiotb calculate test-based confidence intervalsexact calculate Fisher’s exact plevel(#) set confidence level; default is level(95)

epitab — Tables for epidemiologists 25

cci options description

cornfield use Cornfield approximation to calculate CI of the odds ratiowoolf use Woolf approximation to calculate SE and CI of the odds ratiotb calculate test-based confidence intervalsexact calculate Fisher’s exact plevel(#) set confidence level; default is level(95)

tabodds options description

Main

binomial(varname) number of subjects variablelevel(#) set confidence level; default is level(95)or report odds ratioadjust(varlist) report odds ratios adjusted for the variables in varlistbase(#) reference group of control variable for odds ratiocornfield use Cornfield approximation to calculate CI of the odds ratiowoolf use Woolf approximation to calculate SE and CI of the odds ratiotb calculate test-based confidence intervalsgraph graph odds against categoriesciplot same as graph option, except include confidence intervals

CI plot

ciopts(rcap options) affect rendition of the confidence bands

Plot

marker options change look of markers (color, size, etc.)marker label options add marker labels; change look or positioncline options affect rendition of the plotted points

Add plots

addplot(plot) add other plots to the generated graph

Y axis, X axis, Titles, Legend, Overall

twoway options any options other than by() documented in [G] twoway options

mhodds options description

Options

by(varlist[, missing

]) stratify on varlist

binomial(varname) number of subjects variablecompare(v1,v2) override categories of the control variablelevel(#) set confidence level; default is level(95)

fweights are allowed; see [U] 11.1.6 weight.


Menuir

Statistics > Epidemiology and related > Tables for epidemiologists > Incidence-rate ratio

iri

Statistics > Epidemiology and related > Tables for epidemiologists > Incidence-rate ratio calculator

cs

Statistics > Epidemiology and related > Tables for epidemiologists > Cohort study risk-ratio etc.

csi

Statistics > Epidemiology and related > Tables for epidemiologists > Cohort study risk-ratio etc. calculator

cc

Statistics > Epidemiology and related > Tables for epidemiologists > Case-control odds ratio

cci

Statistics > Epidemiology and related > Tables for epidemiologists > Case-control odds-ratio calculator

tabodds

Statistics > Epidemiology and related > Tables for epidemiologists > Tabulate odds of failure by category

mhodds

Statistics > Epidemiology and related > Tables for epidemiologists > Ratio of odds of failure for two categories

mcc

Statistics > Epidemiology and related > Tables for epidemiologists > Matched case-control studies

mcci

Statistics > Epidemiology and related > Tables for epidemiologists > Matched case-control calculator

Descriptionir is used with incidence-rate (incidence-density or person-time) data. It calculates point estimates

and confidence intervals for the incidence-rate ratio and difference, along with attributable or preventedfractions for the exposed and total population. iri is the immediate form of ir; see [U] 19 Immediatecommands. Also see [R] poisson and [ST] stcox for related commands.

cs is used with cohort study data with equal follow-up time per subject and sometimes with cross-sectional data. Risk is then the proportion of subjects who become cases. It calculates point estimatesand confidence intervals for the risk difference, risk ratio, and (optionally) the odds ratio, along withattributable or prevented fractions for the exposed and total population. csi is the immediate formof cs; see [U] 19 Immediate commands. Also see [R] logistic and [R] glogit for related commands.

cc is used with case–control and cross-sectional data. It calculates point estimates and confidenceintervals for the odds ratio, along with attributable or prevented fractions for the exposed and totalpopulation. cci is the immediate form of cc; see [U] 19 Immediate commands. Also see [R] logisticand [R] glogit for related commands.


tabodds is used with case–control and cross-sectional data. It tabulates the odds of failure againsta categorical explanatory variable expvar. If expvar is specified, tabodds performs an approximateχ2 test of homogeneity of odds and a test for linear trend of the log odds against the numerical codeused for the categories of expvar. Both tests are based on the score statistic and its variance; seeMethods and formulas. When expvar is absent, the overall odds are reported. The variable varcase iscoded 0/1 for individual and simple frequency records and equals the number of cases for binomialfrequency records.

Optionally, tabodds tabulates adjusted or unadjusted odds ratios, using either the lowest levelsof expvar or a user-defined level as the reference group. If adjust(varlist) is specified, it producesodds ratios adjusted for the variables in varlist along with a (score) test for trend.

mhodds is used with case–control and cross-sectional data. It estimates the ratio of the odds offailure for two categories of expvar, controlled for specified confounding variables, varsadjust, andtests whether this odds ratio is equal to one. When expvar has more than two categories but noneare specified with the compare() option, mhodds assumes that expvar is a quantitative variable andcalculates a 1-degree-of-freedom test for trend. It also calculates an approximate estimate of the logodds-ratio for a one-unit increase in expvar. This is a one-step Newton–Raphson approximation tothe maximum likelihood estimate calculated as the ratio of the score statistic, U , to its variance, V(Clayton and Hills 1993, 103).

mcc is used with matched case–control data. It calculates McNemar’s chi-squared; point estimatesand confidence intervals for the difference, ratio, and relative difference of the proportion with thefactor; and the odds ratio and its confidence interval. mcci is the immediate form of mcc; see[U] 19 Immediate commands. Also see [R] clogit and [R] symmetry for related commands.

Options

Options are listed in the order that they appear in the syntax tables above. The commands forwhich the option is valid are indicated in parentheses immediately after the option name.

� � �Options (ir, cs, cc, and mhodds) / Main (tabodds) �by(varname

[, missing

]) (ir, cs, cc, and mhodds) specifies that the tables be stratified on

varname. Missing categories in varname are omitted from the stratified analysis, unless optionmissing is specified within by(). Within-stratum statistics are shown and then combined withMantel–Haenszel weights. If estandard, istandard, or standard() is also specified (seebelow), the weights specified are used in place of Mantel–Haenszel weights.

estandard, istandard, and standard(varname) (ir, cs, and cc) request that within-stratumstatistics be combined with external, internal, or user-specified weights to produce a standardizedestimate. These options are mutually exclusive and can be used only when by() is also specified.(When by() is specified without one of these options, Mantel–Haenszel weights are used.)

estandard external weights are the person-time for the unexposed (ir), the total number ofunexposed (cs), or the number of unexposed controls (cc).

istandard internal weights are the person-time for the exposed (ir), the total number of exposed(cs), or the number of exposed controls (cc). istandard can be used to produce, among otherthings, standardized mortality ratios (SMRs).

standard(varname) allows user-specified weights. varname must contain a constant within stratumand be nonnegative. The scale of varname is irrelevant.


pool (ir, cs, and cc) specifies that, in a stratified analysis, the directly pooled estimate alsobe displayed. The pooled estimate is a weighted average of the stratum-specific estimates usinginverse-variance weights, which are the inverse of the variance of the stratum-specific estimate.pool is relevant only if by() is also specified.

nocrude (ir, cs, and cc) specifies that in a stratified analysis the crude estimate—an estimateobtained without regard to strata—not be displayed. nocrude is relevant only if by() is alsospecified.

nohom (ir, cs, and cc) specifies that a χ2 test of homogeneity not be included in the output ofa stratified analysis. This tests whether the exposure effect is the same across strata and can beperformed for any pooled estimate—directly pooled or Mantel–Haenszel. nohom is relevant onlyif by() is also specified.

ird (ir) may be used only with estandard, istandard, or standard(). It requests that ircalculate the standardized incidence-rate difference rather than the default incidence-rate ratio.

rd (cs) may be used only with estandard, istandard, or standard(). It requests that cs calculatethe standardized risk difference rather than the default risk ratio.

bd (cc) specifies that Breslow and Day’s χ2 test of homogeneity be included in the output of astratified analysis. This tests whether the exposure effect is the same across strata. bd is relevantonly if by() is also specified.

tarone (cc) specifies that Tarone’s χ2 test of homogeneity, which is a correction to the Breslow–Daytest, be included in the output of a stratified analysis. This tests whether the exposure effect is thesame across strata. tarone is relevant only if by() is also specified.

binomial(varname) (cs, cc, tabodds, and mhodds) supplies the number of subjects (cases pluscontrols) for binomial frequency records. For individual and simple frequency records, this optionis not used.

or (cs, csi, and tabodds), for cs and csi, reports the calculation of the odds ratio in addition tothe risk ratio if by() is not specified. With by(), or specifies that a Mantel–Haenszel estimateof the combined odds ratio be made rather than the Mantel–Haenszel estimate of the risk ratio.In either case, this is the same calculation that would be made by cc and cci. Typically, cc, cci,or tabodds is preferred for calculating odds ratios. For tabodds, or specifies that odds ratios beproduced; see base() for details about selecting a reference category. By default, tabodds willcalculate odds.

adjust(varlist) (tabodds) specifies that odds ratios adjusted for the variables in varlist be calculated.

base(#) (tabodds) specifies that the #th category of expvar be used as the reference group forcalculating odds ratios. If base() is not specified, the first category, corresponding to the minimumvalue of expvar, is used as the reference group.

cornfield (cc, cci, and tabodds) requests that the Cornfield (1956) approximation be used tocalculate the confidence interval of the odds ratio. By default, cc and cci report an exact intervaland tabodds reports a standard-error–based interval, with the standard error coming from thesquare root of the variance of the score statistic.

woolf (cs, csi, cc, cci, and tabodds) requests that the Woolf (1955) approximation, also knownas the Taylor expansion, be used for calculating the standard error and confidence interval for theodds ratio. By default, cs and csi with the or option report the Cornfield (1956) interval; ccand cci report an exact interval; and tabodds reports a standard-error–based interval, with thestandard error coming from the square root of the variance of the score statistic.


tb (ir, iri, cs, csi, cc, cci, tabodds, mcc, and mcci) requests that test-based confidence intervals(Miettinen 1976) be calculated wherever appropriate in place of confidence intervals based on otherapproximations or exact confidence intervals. We recommend that test-based confidence intervalsbe used only for pedagogical purposes and never for research work.

exact (cs, csi, cc, and cci) requests that Fisher’s exact p be calculated rather than the χ2 andits significance level. We recommend specifying exact whenever samples are small. When theleast-frequent cell contains 1,000 cases or more, there will be no appreciable difference betweenthe exact significance level and the significance level based on the χ2, but the exact significancelevel will take considerably longer to calculate. exact does not affect whether exact confidenceintervals are calculated. Commands always calculate exact confidence intervals where they can,unless cornfield, woolf, or tb is specified.

compare(v1,v2) (mhodds) indicates the categories of expvar to be compared; v1 defines the numeratorand v2, the denominator. When compare() is not specified and there are only two categories, thesecond is compared to the first; when there are more than two categories, an approximate estimateof the odds ratio for a unit increase in expvar, controlled for specified confounding variables, isgiven.

level(#) (ir, iri, cs, csi, cc, cci, tabodds, mhodds, mcc, and mcci) specifies the confidencelevel, as a percentage, for confidence intervals. The default is level(95) or as set by set level;see [R] level.

The following options are for use only with tabodds.

� � �Main �graph (tabodds) produces a graph of the odds against the numerical code used for the categories

of expvar. All graph options except connect() are allowed. This option is not allowed with theor option or the adjust() option.

ciplot (tabodds) produces the same plot as the graph option, except that it also includes theconfidence intervals. This option may not be used with either the or option or the adjust()option.

� � �CI plot �ciopts(rcap options) (tabodds) is allowed only with the ciplot option. It affects the rendition

of the confidence bands; see [G] rcap options.

� � �Plot �marker options (tabodds) affect the rendition of markers drawn at the plotted points, including their

shape, size, color, and outline; see [G] marker options.

marker label options (tabodds) specify if and how the markers are to be labeled; see[G] marker label options.

cline options (tabodds) affect whether lines connect the plotted points and the rendition of thoselines; see [G] cline options.


� � �Add plots �addplot(plot) (tabodds) provides a way to add other plots to the generated graph; see

[G] addplot option.

� � �Y axis, X axis, Titles, Legend, Overall �twoway options (tabodds) are any of the options documented in [G] twoway options, excluding

by(). These include options for titling the graph (see [G] title options) and options for saving thegraph to disk (see [G] saving option).


Incidence-rate dataStratified incidence-rate dataStandardized estimates with stratified incidence-rate dataCumulative incidence dataStratified cumulative incidence dataStandardized estimates with stratified cumulative incidence dataCase–control dataStratified case–control dataCase–control data with multiple levels of exposureCase–control data with confounders and possibly multiple levels of exposureStandardized estimates with stratified case–control dataMatched case–control data

To calculate appropriate statistics and suppress inappropriate statistics, the ir, cs, cc, tabodds,mhodds, and mcc commands, along with their immediate counterparts, are organized in the wayepidemiologists conceptualize data. ir processes incidence-rate data from prospective studies; cs,cohort study data with equal follow-up time (cumulative incidence); cc, tabodds, and mhodds,case–control or cross-sectional (prevalence) data; and mcc, matched case–control data. With theexception of mcc, these commands work with both simple and stratified tables.

Epidemiological data are often summarized in a contingency table from which various statistics arecalculated. The rows of the table reflect cases and noncases or cases and person-time, and the columnsreflect exposure to a risk factor. To an epidemiologist, cases and noncases refer to the outcomes ofthe process being studied. For instance, a case might be a person with cancer and a noncase mightbe a person without cancer.

A factor is something that might affect the chances of being ultimately designated a case or anoncase. Thus a case might be a cancer patient and the factor, smoking behavior. A person is said tobe exposed or unexposed to the factor. Exposure can be classified as a dichotomy, smokes or doesnot smoke, or as multiple levels, such as number of cigarettes smoked per week.

For an introduction to epidemiological methods, see Walker (1991). For an intermediate treatment,see Clayton and Hills (1993) and Lilienfeld and Stolley (1994). For other advanced discussions, seevan Belle et al. (2004); Kleinbaum, Kupper, and Morgenstern (1982); and Rothman, Greenland, andLash (2008). For an anthology of writings on epidemiology since World War II, see Greenland (1987).See Jewell (2004) for a text aimed at graduate students in the medical professions that uses Statafor much of the analysis. See Dohoo, Martin, and Stryhn (2003) for a graduate-level text on theprinciples and methods of veterinary epidemiologic research; Stata datasets and do-files are available.


Incidence-rate dataIn incidence-rate data from a prospective study, you observe the transformation of noncases into

cases. Starting with a group of noncase subjects, you monitor them to determine whether they becomecases (e.g., stricken with cancer). You monitor two populations: those exposed and those unexposedto the factor (e.g., multiple X-rays). A summary of the data is

Exposed Unexposed Total

Cases a b a+ bPerson-time N1 N0 N1 +N0

Example 1: iri

It will be easiest to understand these commands if we start with the immediate forms. Remember,in the immediate form, we specify the data on the command line rather than specifying namesof variables containing the data; see [U] 19 Immediate commands. We have data (Boice Jr. andMonson [1977]; reported in Rothman, Greenland, and Lash [2008, 244]) on breast cancer casesand person-years of observation for women with tuberculosis repeatedly exposed to multiple X-rayfluoroscopies, and those not so exposed:

X-ray fluoroscopyExposed Unexposed

Breast cancer cases 41 15Person-years 28,010 19,017

Using iri, the immediate form of ir, we specify the values in the table following the command:. iri 41 15 28010 19017


Cases 41 15 56Person-time 28010 19017 47027

Incidence rate .0014638 .0007888 .0011908

Point estimate [95% Conf. Interval]

Inc. rate diff. .000675 .0000749 .0012751Inc. rate ratio 1.855759 1.005684 3.6093 (exact)Attr. frac. ex. .4611368 .0056519 .722938 (exact)Attr. frac. pop .337618

(midp) Pr(k>=41) = 0.0177 (exact)(midp) 2*Pr(k>=41) = 0.0355 (exact)

iri shows the table, reports the incidence rates for the exposed and unexposed populations, andthen shows the point estimates of the difference and ratio of the two incidence rates along with theirconfidence intervals. The incidence rate is simply the frequency with which noncases are transformedinto cases.

Next iri reports the attributable fraction among the exposed population, an estimate of theproportion of exposed cases attributable to exposure. We estimate that 46.1% of the 41 breast cancercases among the exposed were due to exposure. (Had the incidence-rate ratio been less than 1, iriwould have reported the prevented fraction in the exposed population, an estimate of the net proportionof all potential cases in the exposed population that was prevented by exposure; see the followingtechnical note.)


After that, the table shows the attributable fraction in the total population, which is the netproportion of all cases attributable to exposure. This number, of course, depends on the proportionof cases that are exposed in the base population, which here is taken to be 41/56 and may not berelevant in all situations. We estimate that 33.8% of the 56 cases were due to exposure. We estimatethat 18.9 cases were caused by exposure; that is, 0.338× 56 = 0.461× 41 = 18.9.

At the bottom of the table, iri reports both one- and two-sided exact significance tests. For theone-sided test, the probability that the number of exposed cases is 41 or greater is 0.0177. This is a“midp” calculation; see Methods and formulas below. The two-sided test is 2× 0.0177 = 0.0354.

Technical noteWhen the incidence-rate ratio is less than 1, iri (and ir, cs, csi, cc, and cci) substitutes the

prevented fraction for the attributable fraction. Let’s reverse the roles of exposure in the above data,treating as exposed a person who did not receive the X-ray fluoroscopy. You can think of this as anew treatment for preventing breast cancer—the suggested treatment being not to use fluoroscopy.

. iri 15 41 19017 28010


Cases 15 41 56Person-time 19017 28010 47027

Incidence rate .0007888 .0014638 .0011908

Point estimate [95% Conf. Interval]

Inc. rate diff. -.000675 -.0012751 -.0000749Inc. rate ratio .5388632 .277062 .9943481 (exact)Prev. frac. ex. .4611368 .0056519 .722938 (exact)Prev. frac. pop .1864767

(midp) Pr(k


Example 2: irir works like iri, except that it obtains the entries in the tables by summing data. You specify

three variables: the first represents the number of cases represented by this observation, the secondindicates whether the observation is for subjects exposed to the factor, and the third records the totaltime the subjects in this observation were observed. An observation may reflect one subject or agroup of subjects.

For instance, here is a 2-observation dataset for the table in the previous example:. use http://www.stata-press.com/data/r11/irxmpl

. list

cases exposed time

1. 41 0 280102. 15 1 19017

If we typed ir cases exposed time, we would obtain the same output that we obtained above.Another way the data might be recorded is

. use http://www.stata-press.com/data/r11/irxmpl2

. list

cases exposed time

1. 20 0 140002. 21 0 140103. 15 1 19017

Here the first 2 observations will be automatically summed by ir because both are exposed. Finally,the data might be individual-level data:

. use http://www.stata-press.com/data/r11/irxmpl3

. list in 1/5

cases exposed time

1. 1 1 102. 0 1 83. 0 0 94. 1 0 25. 0 1 1

The first observation represents a woman who got cancer, was exposed, and was observed for 10years. The second is a woman who did not get cancer, was exposed, and was observed for 8 years,and so on.

Technical noteir (and all the other commands) assumes that a subject was exposed if the exposed variable is

nonzero and not missing, assumes the subject was not exposed if the variable is zero, and ignores theobservation if the variable is missing. For ir, the case variable and the time variable are restricted tononnegative integers and are summed within the exposed and unexposed groups to obtain the entriesin the table.


Stratified incidence-rate data

Example 3: ir with stratified data

ir can work with stratified tables, as well as with single tables. For instance, Rothman (1986,185) discusses data from Rothman and Monson (1973) on mortality by sex and age for patients withtrigeminal neuralgia:

Age through 64 Age 65+Males Females Males Females

Deaths 14 10 76 121Person-years 1516 1701 949 2245

Entering the data into Stata, we have the following dataset:

. use http://www.stata-press.com/data/r11/rm(Rothman and Monson 1973 data)

. list

age male deaths pyears

1.


Technical noteStratification is one way to deal with confounding; that is, perhaps sex affects the incidence of

trigeminal neuralgia and so does age, so the table was stratified by age in an attempt to uncoverthe sex effect. (We are concerned that age may confound the true association between sex and theincidence of trigeminal neuralgia because the age distributions are so different for males and females.If age affects incidence, the difference in the age distributions would induce different incidences formales and females and thus confound the true effect of sex.)

We do not, however, have to use tables to uncover effects; the estimation alternative when wehave aggregate data is Poisson regression, and we can use the same data on which we ran ir withpoisson. Poisson regression also works with individual-level data.

(Although age in the previous example appears to be a string, it is actually a numeric variabletaking on values 0 and 1. We attached a value label to produce the labels chi2 = 0.0000

Log likelihood = -10.733944 Pseudo R2 = 0.8843

deaths IRR Std. Err. z P>|z| [95% Conf. Interval]

male 1.495096 .2060997 2.92 0.004 1.141118 1.95888age 8.888775 1.934943 10.04 0.000 5.801616 13.61867

pyears (exposure)

Compare these results with the Mantel–Haenszel estimates produced by ir:

Source IR Ratio 95% Conf. Int.Mantel–Haenszel (ir) 1.50 1.14 1.96poisson 1.50 1.14 1.96

The results from poisson agree with the Mantel–Haenszel estimates to two decimal places. Butpoisson also estimates an incidence-rate ratio for age. Here the estimate is not of much interest,because the outcome variable is total mortality and we already knew that older people have a highermortality rate. In other contexts, however, the estimate might be of greater interest.

See [R] poisson for an explanation of the poisson command.

Technical noteBoth the model fit above and the preceding table asserted that exposure effects are the same

across age categories and, if they are not, then both of the previous results are equally inappropriate.The table presented a test of homogeneity, reassuring us that the exposure effects do indeed appearto be constant. The Poisson-regression alternative can be used to reproduce that test by includinginteractions between the age groups and exposure:


. poisson deaths male age male#c.age, exposure(pyears) irr

Iteration 0: log likelihood = -10.898799Iterati

Documents

STATASURVIVALANALYSISAND ...Title intro — Introduction to survival analysis manual Description This entry describes this manual and what has changed since Stata 10. See the next