of 35/35
71 IX. ADVANCED DATA MANAGEMENT TOPICS In this section further detail is provided on data management commands listed below and on issues related to missing data: Write (Export) Delete File/Table Delete Records / Undelete Records Merge Relate Write (Export) data Exporting to other file types The Write (Export) command allows users to save the data into a different Epi Info .MDB data file or into another file format available in this command. With the Write command you can also specify which variables to write to the file and their order in the new file. As an example, Read the viewEvansCounty file in Sample.mdb (see the previous Read section) into an Excel file. Click on Write (Export) in the Analysis Commands dialog box, and the Write dialog box is presented as follows: Figure 83. Dialog box for Write (Export) command, Epi Info. As seen in the Write dialog box, the A ll (*) symbol is initially selected by default. This option writes all variables from the current data set into a new data set. If you want to exclude some variables in the new Data table, you can use All (*) Ex cept option A ll (*) symbol must first be unchecked to permit the selection of All (*) Ex cept. You can also highlight and select desired variables from the variable box by right-clicking over individual variables, after unchecking A ll (*) and All (*) Ex cept symbols. Here for the sake of simplicity, we will stick to use all variables in the new data set with A ll (*) symbol checked. Then, decide how data should be written by using Output M ode which determines whether the data being written will Append to or Replace the existing data set. For this example, use Replace. With the Replace option checked, the new data will replace the current data set, whereas the data will be simply added to the file if the Append option is checked. See the Output Fo rmats compartment and select Excel 4.0 by clicking on down-arrow button. Using down- arrow button allows the selection of a data file format available in Epi Info: Epi2000 Access 97, 2000 dBase III, IV, 5.0 Paradox 3.x, 4.x, 5.x Excel 3.0, 4.0 Epi Info 6 Text (Delimited).

IX. ADVANCED DATA MANAGEMENT TOPICS - … · IX. ADVANCED DATA MANAGEMENT TOPICS ... Delete File/Table Delete Records / Undelete Records Merge Relate ... Delete File/Table in is used

  • View
    224

  • Download
    2

Embed Size (px)

Text of IX. ADVANCED DATA MANAGEMENT TOPICS - … · IX. ADVANCED DATA MANAGEMENT TOPICS ... Delete...

  • 71

    IX. ADVANCED DATA MANAGEMENT TOPICS In this section further detail is provided on data management commands listed below and on issues related to missing data:

    Write (Export) Delete File/Table Delete Records / Undelete Records Merge Relate

    Write (Export) data Exporting to other file types The Write (Export) command allows users to save the data into a different Epi Info .MDB data file or into another file format available in this command. With the Write command you can also specify which variables to write to the file and their order in the new file. As an example, Read the viewEvansCounty file in Sample.mdb (see the previous Read section) into an Excel file. Click on Write (Export) in the Analysis Commands dialog box, and the Write dialog box is presented as follows: Figure 83. Dialog box for Write (Export) command, Epi Info.

    As seen in the Write dialog box, the All (*) symbol is initially selected by default. This option writes all variables from the current data set into a new data set. If you want to exclude some variables in the new Data table, you can use All (*) Except option All (*) symbol must first be unchecked to permit the selection of All (*) Except. You can also highlight and select desired variables from the variable box by right-clicking over individual variables, after unchecking All (*) and All (*) Except symbols. Here for the sake of simplicity, we will stick to use all variables in the new data set with All (*) symbol checked. Then, decide how data should be written by using Output Mode which determines whether the data being written will Append to or Replace the existing data set. For this example, use Replace. With the Replace option checked, the new data will replace the current data set, whereas the data will be simply added to the file if the Append option is checked. See the Output Formats compartment and select Excel 4.0 by clicking on down-arrow button. Using down-arrow button allows the selection of a data file format available in Epi Info: Epi2000 Access 97, 2000 dBase III, IV, 5.0 Paradox 3.x, 4.x, 5.x Excel 3.0, 4.0 Epi Info 6 Text (Delimited).

  • 72

    Pressing the button with . . .to the right of the File Name displays a dialog box where you can select a folder to save the new file. Here, lets go to the C:\Epi Info folder and type a file name EvansCounty. You will see .xls in Save as type section of the dialog box. Click Save and the new Excel file will be ready to be created. Click OK, and EvansCounty.xls is now written (exported) to the folder C:\Epi Info. To check for accuracy of EvansCounty.xls, use Read/Import command or use Excel to open the file. Related to Data table option, Output Formats must be Epi2000 or Access. Only then, you can type in a desired table name in Data table box. Using down-arrow button, Data table box also allows for the selection of a Data table to receive output data set. This condition applies when you want to replace or append a current Data table. However, neither Epi-Info view files nor Data tables of views will appear in the list of Data table box, because the Write command cannot be used to add data to a view file. In that case, use Merge command. Thats the reason you dont see the view file viewEvansCounty or related Data table of view file EvansCounty in the Data table box. Similarly, you can create a new data set with other file formats (dBase, text, etc), different variables, and different output modes by following the aforementioned guideline.

    Delete File/Table Delete File/Table in is used when you want to delete a file, a table from within an Epi2000/Access file, or a view from within an Epi2000/Access file (see Figure 84 for an example). Figure 84. Dialog box for Delete File/Table, Epi Info

    As an example, Read the viewEvansCounty file, then use the Write (Export) command to save the file as Delete_Me in the Sample.MDB file. Next, use Delete File/Table, in the dialog box click on Table, for the Database select Sample.MDB, for the Table Name select Delete_Me.

    Delete Records / Undelete Records Using Delete Records you can either mark records for deletion or permanently remove records from the file (Figure 85). Records that are marked for deletion remain in the data file but are usually ignored during analyses. (Note: using the Set command the usual setting for Process Records is Normal, i.e., perform analyses only on undeleted records; two other options are to analyze both records marked for deletion [Both] or only records marked for deletion [Deleted].) The other option is to permanently remove records from the file. As shown in Figure 85, you can choose criteria for determining which records to delete, such as * to delete all records or any other criteria, such as Age>50 or Sex=M, similar to the types of functions and mathematical comparisons described for Select (see Appendix 2). The Run Silent option, when not checked, makes a sound and pops up a small dialog box; when checked, neither the sound nor pop-up window will occur.

  • 73

    Records marked for deletion can be undeleted using the Undelete Record command (Figure 86). Specific criteria can be given as to which records to undelete. Figure 85. Dialog box for Delete Records command, Epi Info.

    Figure 86. Dialog box for Undelete Records command, Epi Info.

    (Note inconsistency between command Undelete Records and dialog box name UNDELETE)

    Relate files In some situations you may want to Relate two files. Two common examples where relating files is used includes with health clinic data where one file may contain information on an individual, such as name, age, sex, address, and another contains information on clinic visits. The other example would be with survey data where one file contains information at the household level and another has information on the individual. The investigator may want to Relate these two files and perform an analysis of the combined data table. A visual example is shown in Figure 87. To Relate two files, you must have a variable common to both data tables on which to link, such as a clinic ID number or a household number. Figure 87. Relating two data tables. + As an example, lets relate the data table viewFamily to another data table viewPatient which can be found in Refugee.MDB, an example file included with Epi Info. (The details of these files can be found in the Appendix 1). A partial listing of the viewFamily table, the viewPatient table, and the related file are shown in Figure 88.

    Data table A (Main table)

    Data table B (The other table that is to be related to the main table)

    Data table C (A combination of A and B)

  • 74

    Figure 88. The viewFamily table, the viewPatient table, and the related file, viewFamily table Line Family Id Number household Date of Arrival: Port of Entry: Country of Origin: Language spoken1 1 1 12-22-1998 NEW YORK BOSNIA 4 2 2 2 01-06-1999 NEW YORK BOSNIA 4 3 3 3 01-20-1999 NEW YORK BOSNIA 4 4 4 4 01-12-1999 CALIFORNIA VIETNAM 3 5 5 5 01-20-1999 NEW YORK BOSNIA viewPatient table Line Today date Family ID Number BOH ID NUMBER: BOH Re-entry

    16229 04-07-1999 1 688174 688174 16230 01-11-1999 1 9569112 9569112 16231 03-18-1999 1 8251382 8251382 16232 03-19-1999 2 8188724 8188724 16233 08-16-1999 2 7335445 7335445 Related viewFamily and viewPatient tables Line Family Id Number household Date of Arrival: Port of Entry: Country of Origin: Language spoken1 1 1 12-22-1998 NEW YORK BOSNIA 4 2 1 1 12-22-1998 NEW YORK BOSNIA 4 3 1 1 12-22-1998 NEW YORK BOSNIA 4 4 2 2 01-06-1999 NEW YORK BOSNIA 4 5 2 2 01-06-1999 NEW YORK BOSNIA 4 6 4 4 01-12-1999 CALIFORNIA VIETNAM 3 Read the data table viewFamily (you will need to change the Data Source to C:\Epi_Info\Reguee.MDB). Then click the Relate command from Analysis Commands on the left, and the Relate dialog box will appear as follows (Figure 89). Again, you will need to change the Data Source to C:\Epi_Info\Refugee.MDB. In the Views portion of dialog box, click on viewPatient, the table you want to relate. You must supple a Key variable which exists in both tables which will allow records to be related, by clicking on Build Key button. In doing so, another dialog box Relate - Build Key dialog box appears (Figure 90). With the main Current Table(s) (viewFamily) selected, click the down arrow next to the Available Variables blank box and select the key variable FAMIDNUM. Then, click OK. Select the Related Table (viewPatient) and once again click the down arrow next to the Available Variables to choose select FAMIDNUM. Click OK again to close Relate - Build Key dialog box and to return to the Relate dialog box. In this Relate dialog box, the Key at the bottom of the dialog box will say FAMIDNUM :: FAMIDNUM. Click the OK button and the relationship between files will be created with the following message presented in the Analysis Output window as shown in Figure 91.

  • 75

    Figure 89. Dialog box for Relate command, Epi Info.

    Figure 90. Dialog box for Relate - Build Key, Epi Info

    Figure 91. Example Output from Relate command Current View: C:\Epi_Info\Refugee.MDB:viewFamily

    Relate: LNK_2 Record Count: 1772 (Deleted records excluded) Date: 6/29/2005 10:53:25 AM One option when relating files in Figure 89 is Use Unmatched (All). If this option is selected by clicking on the box, the related file will contain all records from both files whether or not they can be related to one another; when this box is not checked, only records that can be related to one another will be in the related file.] Note that more than two tables can be related and that common identifier may span several fields.

  • 76

    Merge files Here we describe two ways to Merge files in Epi Info: Append and Update. The first approach is to Read a file and Append (or concatenate) records from another file to the master file (Figure 92). An example of this approach is when you have two people entering data from a study on separate computers and you would like to combine the two files into one file. Figure 92. Conceptual approach to use of Merge using Append option.

    Read Master Table Merged Table

    ID Ltr ID Ltr 1 A 1 A 2 B 2 B 3 C 3 C 4 D 4 D 5 E Append 5 E

    + 6 F Merge Second Table 7 G

    ID Ltr 8 H 6 F 9 I 7 G 10 J 8 H 9 I

    10 J The second approach is to Update a file where a file is Read and then information updated in the Merge table when the key matches. Only fields found in both datasets with a non-empty value in the Merge table will be replaced. A conceptual example of this is presented in Figure 93 and an example would be in a state health department reportable disease system where a master file is kept at the state and a local health department may send a table that had updated information. Figure 93. Conceptual approach to use of Merge using Update option.

    Read Master Table Merged Table

    ID Ltr ID Ltr 1 A 1 A 2 B 2 B 3 C 3 F 4 D 4 D 5 E Update 5 G

    + Merge Second Table

    ID Ltr 3 F 5 G

    In general, the steps are: Read a master file Use Merge (see Figure 94 for the dialog box)

    o Select a table or file o Choose either Update or Append or both o Provide one or more Key variables by pressing the Build Key button and completing the

    Relate Build Key dialog box (see Figure 90) o Click the OK button on both dialog boxes

  • 77

    Figure 94. Dialog box for Merge command, Epi Info.

  • 78

    Acknowledgments We would like to thank Andrew Dean, MD, MPH, for his comments and suggestions on this document. Should you have any suggestions to improve this document, please feel free to contact Kevin Sullivan at [email protected] This document was made possible, in part, by a grant from the Bill and Melinda Gates Foundation. References Kleinbaum DG. Survival Analysis: A Self-Learning Text. Springer Verlag Publishers, 1996. Kleinbaum DG, Klein M. Logistic Regression: A Self-Learning Text, 2nd Ed. Springer Verlag Publishers,

    2002. Kleinbaum DG, Kupper LL, Morgenstern H. Epidemiologic Research: Principles and Quantitative Methods.

    John Wiley and Sons Publishers, New York, 1982. Kleinbaum DG, Kupper LL, Muller KE, Nizam A. Applied Regression Analysis and Multivariable Methods, 3rd

    Ed. Duxbury Press, 1998. Kleinbaum DG, Sullivan KM, Barker N. ActivEpi Companion Textbook. Springer Verlag Publishers, 2003. Rosner B. Fundamentals of Biostatistics, 5th Ed. Duxbury, Pacific Grove, 2000.

  • 79

    APPENDICES

    Appendix 1. Data Dictionaries This appendix contains the data dictionaries for the examples in this document in alphabetical order. For the files in the Sample.mdb, the files are:

    Addicts Anderson BFmeasles Chemo Myeloma Stanford Vets viewAddfull viewAgeWithCount viewBabyBloodPressure

    viewEpi1 viewEpi10 viewEstriolandBirthweight viewEvansCounty viewhmohiv viewLasum viewLEUKEM2 viewOswego viewRely viewSmoke

    The files in the Refugee.mdb for merging or relating datasets are:

    viewFamily viewPatient

    Addicts Survival Analysis These data are based on a cohort study among 238 heroin addict patients, comparing treatment effectiveness of one clinic to the other. The number of days from entry to a clinic until departure was the outcome variable. This is an example file in the text by Kleinbaum called Addicts. Please note that these data are originally provided by John Caplehorn (The University of Sydney, Department of Public Health). Reference: Kleinbaum DG. Survival Analysis: A Self-Learning Text. Springer-Verlag, New York, 1996. File Name: Addicts Project: Sample.mdb Number of records: 238 Variable Label Values/Description Freq Main predictor of interest. This is the exposure variable which assigns the study subjects into clinic 1 and clinic 2.

    Clinic 1= clinic 1 2= clinic 2

    161 77

    Censored variable. This is the variable which denotes whether the patient has developed an event (exit from clinic) or not.

    status 0= censored 1= uncensored (exist

    from clinic)

    150 88

    Survival time in days from entry to a clinic until departure. This is the outcome variable time to an event

    Survival_Time_Days Range: 2-1076 days Mean: 404.6555 Median: 367.5

    Past history of imprisonment Prison_Record 0= No 1= Yes

    126 112

    Daily dose of Methadone substitute (mg/day)

    Methadone_dose__mg_day_ Range: 20-110 Mean: 60.542 Median: 60

  • 80

    Anderson Survival Analysis This is a clinical trial studying survival times in weeks (remission) of 42 leukemia patients to compare the effect of a steroid (6-mercaptopurine) with placebo. The duration of relapsed-free period after treatment or placebo was the outcome variable. This is an example file in Survival Analysis Self-Learning Text by Kleinbaum called Anderson. Please note that these data are originally from Freireich, et al. Data source: Freireich et al. The effect of 6-mercaptopurine on the duration of steroid-induced remissions in acute leukemia. Blood 21: 699-716, 1963. File Name: Anderson Project: Sample.mdb Number of records: 42 Variable Label Values/Description Freq Survival time in weeks until relapse. This is the outcome variable time to an event

    Stime Range: 1-35 weeks Mean: 12.881 Median: 10.5

    Censored variable. This is the variable which denotes whether the patient has developed an event (exit from clinic) or not.

    status 0= censored 1= relapsed

    12 30

    Gender sex 0= female 1= male

    22 20

    Log value of white blood cells Log_wbc Range: 1.4-5 Mean: 2.9302 Median: 2.8

    Main predictor of interest. This is the exposure variable (treatment or placebo) randomly assigned to the leukemia patients.

    Rx 1= placebo 0= treatment

    21 21

    BFMeasles - Measles Outbreak Investigation These data are test data provided with compliment by Epi Info working group, Epidemiology program office, CDC. Thanks to Roger Friedman for sharing the data information for this document. File Name: BFMeasles Project: Sample.mdb Number of records: 262 Variable Label Values/Description Freq location code expressed as text in the fields Province, District, Town, Village/Neighborhood).

    EPID From BFA-TEN-OUA-01-0005 to BFA-BOB-DAN-02-1297

    262

    Name of Province of patient PROVINCE 11provinces ranging from BANFORA to TENKODOGO (alphabetically) 262

    Name of District of patient DISTRICT 40 districts ranging from BANFORA to ZORGHO (alphabetically)

    262

    Name of Town TOWN Name of Village/Neighborhood VILLNEIG 160 village/neighborhoods ranging from

    ABSINDO to ZOUMAMISSIRI (alphabetically) (.) missing

    227 35

    A location code which matches that on the map file used to display the data.

    AMAPCODE 40 codes ranging from BFA BAN BAN to BFA TEN ZAB

    Name of nearest Hospital Facility responsible for the patient.

    NEARHF 136 facilities (.) missing

    256 6

    Unknown code UR 1 2

    36 226

    Date of birth DOB 02/27/1999 (.) missing Note: Date format is month, day and 4 digit year

    1 261

  • 81

    Age of patient (years) AGEYR every value is 3

    Age of patient (months) AGEMO Range: 1-4 Mean: 2 Median: 1.5 (.) missing

    6 256

    Gender SEX F female M male

    Date of notification DNOT Range: 01/18/2001-07/17/2002 Note: Date format is month, day and 4 digit year (.) missing

    255 7

    Date of investigation DOI Range: 12/19/2001-07/17/2002 Same format as above. (.) missing

    68 194

    Date of onset of illness DONSET Range: 01/17/2001 07/10/2002 Same format as above.

    262

    Status of patient: died or alive DIED 1 yes 2 no 9 unknown (.) missing

    10 198 53 1

    Number of doses of vaccine DOSES 0 not vaccinated 1 vaccinated 1 time 9 unknown

    42 24 196

    Date of last vaccination DVAC Range: 05/07/1998 03/11/2002 (.) missing

    20 242

    Date of sample collection DCOLL Range: 12/19/2001 07/17/2002 (.) missing

    56 206

    Date the sample was sent to lab DSENT1 Range: 01/06/2002 04/02/2002 (.) missing

    6 256

    Date the sample was received at the lab DREC1 Range: 01/07/2002 04/15/2002 (.) missing

    7 255

    Date of result received from lab DRESULT1 Range: 01/22/2002 07/24/2002 (.) missing

    Result of measles immunoassay test INDIR 1 positive 2 negative 3 indeterminate (.) missing

    38 10 2 212

    Result of rubella test RUBTEST 1 positive 2 negative 3 indeterminate (.) missing

    1 11 1 249

    Name of investigator INVESTIGAT (.) missing 262 Result of investigation (in French)

    INVRESULT Positive result value in French (.) missing

    249 13

    case categories 1-5: (meanings are unknown)

    CLASS2 1 3 4 5

    38 208 10 6

    case categories 1-5: (meanings are unknown)

    CLASS 1 2 3 4 5 (.) missing

    38 7 56 135 19 7

  • 82

    Chemo Survival Analysis These data are from a clinical trial on gastric carcinoma by Stablein et al, involving 95 patients randomized to either chemotherapy alone or to a combination of chemotherapy and radiation, in order to assess treatment outcome. The number of days from a treatment until death was the outcome variable. This is an example file in the self-learning text by Kleinbaum, called Chemo.dat. Data source: Stablein DM. Carter WH Jr. Novak JW. Analysis of survival data with nonproportional hazard functions. Controlled Clinical Trials. 2(2): 149-59, 1981 Jun.. File Name: Chemo Project: Sample.mdb Number of records: 95 Variable Label Values/Description Freq Main predictor of interest. This is the exposure variable to patients which denotes either chemotherapy alone or combination of chemotherapy and radiation.

    Rx 1= chemotherapy alone 2= chemotherapy and

    radiation

    47 48

    Censored variable. This is the variable which denotes whether the patient has developed an event (death) or not.

    status 0= censored 1= died

    17 78

    Survival time in days from entry to a clinic until departure. This is the outcome variable time to an event

    STime Range: 1-1519 days Mean: 529.1368 Median: 401

    Myeloma Survival Analysis These data are based on a study at the Medical Centre of the University of West Virginia, USA, where the association between some probable explanatory variables and the survival time of patients was examined. The response variable was the time (in months) from diagnosis until death from multiple myeloma. The data in the table were reported in Krall et al., and were related to 48 patients, aged ranging from 50 to 80 years. Reference: Krall, J. M., Uthoff, V. A. and Harley, J. B. (1975). A step-up procedure for selecting variables associated with survival. Biometrics, 31, 49 57. File Name: Myeloma Project: Sample.mdb Number of records: 48 Variable Label Values/Description Freq Identification number PATIENT Range: 1-48 Survival time in months from entry to the study until death. This is the outcome variable time to an event

    STIME Range: 1-91 Mean: 23.375 Median: 14.5

    Censored variable. This is the variable which denotes whether the patient has developed an event (died) or not.

    STATUS 0 censored 1 died

    12 36

    Age of patients (years) AGE Range: 50-77 Mean: 62.8958 Median: 62.5

    gender SEX 1= male 2= female

    29 19

    Blood urea nitrogen (mg%) BUN Range: 6-172 Mean: 33.9167 Median: 21

  • 83

    serum Calcium (mg%) CA Range: 8-15 Mean: 9.9375 Median: 10

    Hemoglobin (mg%) HB Range: 4.9-14.6 Mean: 10.2521 Median: 10.2

    Percentage of plasma cells in the bone marrow (%)

    PC Range: 3-100 Mean: 42.9375 Median: 33

    Presence of Bence-Jones protein in the urine

    BJ Yes No

    15 33

    Stanford Survival Analysis These data are based on a Stanford heart transplant study by Kalbfleisch et al, involving 249 patients who were either treated with transplant or not, with varying period of waiting time before the transplant. The study was conducted to assess the effect on survival time between different attributes among patients who received transplants, as well as, to determine the survival time between patients with heart transplants and those without transplants. The survival time, a combination of pre-transplant survival time and post-transplant survival time (if any) was the outcome variable. This is an ideal example to use extended Cox model in order to take into account the different pre-transplant survival time (waiting time) because patients change treatment status during the course of the study. The data file can be found in Survival analysis self-learning text by Kleinbaum, called Stanf.dat. Data source: Kalbfleisch, J and Prentice, R. The statistical analysis of failure time data. John Wiley and Sons, New York, 1980. File Name: Stanford Project: Sample.mdb Number of records: 249 Variable Label Values/Description Freq Survival time from entry to the study until death before the transplant (or) until the transplant.

    PRE_TRANSPLANT_SURVIVAL_TIME Range: 0-340 days Mean: 40.7068 Median: 26

    Censored variable 1. This is the variable which denotes whether the patient has died or not at first end-point (the time of Transplant).

    STATUS 0= censored 1= died (.)= missing

    193 55 1

    Survival period from the time of transplant until death (or) the patient is censored.

    POSTTRANSPLANT_SURVIVAL_TIME Range: 0-3694 days Mean: 696.9348 Median: 351 (.)= missing

    184 65

    Censored variable 2. This is the variable which denotes whether the patient has died or not at the time of second end-point (Feb 1980).

    STATUS_AT_SECOND_ENDPOINT 0= censored 1= died (.)= missing

    65 119 65

    Age of patient at the time of transplant

    AGE Range: 12 64 years Mean: 41.0924 Median: 44 (.)= missing

    184 65

    Tissue mismatch score TISSUE_MISMATCH_SCORE Range: 0-3.05 Mean: 1.1166 Median: 1.04 (.)= missing

    157 92

  • 84

    Vets Survival Analysis These data are from Veterans administration lung cancer trial among 137 patients with pulmonary carcinoma, comparing effectiveness of test treatment with standard treatment. The survival time in days until death was the outcome variable. These data are originally provided by Kalbfleisch, et al., and used as an example data file in Survival analysis self-learning text by Kleinbaum called Anderson.dat. Data source: Kalbfleisch, J and Prentice, R. The statistical analysis of failure time data. John Wiley and Sons, New York, 1980. File Name: Vets Project: Sample.mdb Number of records: 99 Variable Label Values/Description Freq Main predictor of interest. This is the exposure variable which assigns the study subjects into test and standard.

    treatment 1= standard 2= test

    69 30

    cancer cell type- large cell cell_type_1 0= other 1= large cell

    84 15

    cancer cell type- Adeno cell cell_type_2 0= other 1= Adeno cell (.)= missing

    89 9 1

    cancer cell type- small cell cell_type_3 1= Small cell 0= other

    59 40

    cancer cell type- squamous cell cell_type_4 1= Squamous cell 0= other

    64 35

    Survival time in days until death. This is the outcome variable time to an event

    STime Range: 1-999 days Mean: 136.8889 Median: 95

    Performance status (0=worst,..,100=best)

    performance_status Range: 20-90 Mean: 9.0202 Median: 6

    Disease duration (months from diagnosis)

    disease_duration Range: 1-58 months Mean: 404.6555 Median: 367.5

    Age of patients (years) age Range: 34-81 Mean: 58.4343 Median: 60

    History of prior therapy prior_therapy 0= none 10= some

    68 31

    Censored variable. This is the variable which denotes whether the patient has died or not.

    status 0= censored 1= death

    8 91

  • 85

    ViewADDFULL - Attention deficit disorder Note: we were not able to find more details on this datafile. File Name: ViewADDFULL Project: Sample.mdb Number of records: 359 Variable- Label Values/Description Freq Gender of patient GENDER 1 female??

    2 male?? 198 161

    ? REPEAT 0 no history of repetition 1 history of repetition (.) missing

    324 34 1

    ? ENGL 1 2 3 (.) missing

    40 254 46 19

    ? ENGG 0 1 2 3 4 (.) missing

    11 37 122 135 41 13

    ? OLMAT Range: 55-137 Mean: 102.7333 Median: 103 (.) missing

    210 149

    ? KF Range: 75-129 Mean: 104.8444 Median: 105 (.) missing

    90 269

    ? GPA Range: 0-4 Mean: 2.3797 Median: 2.5 (.) missing

    347 12

    ? SOCPROB 0 1 (.) missing

    304 44 11

    ? SCORE2 Range: 25-90 Mean: 53.3287 Median: 52

    ? SCORE4 Range: 22-90 Mean: 52.8936 Median: 53 (.) missing

    357 2

    ? SCORE5 Range: 22-87 Mean: 53.2696 Median: 52 (.) missing

    319 40

    ? DROPOUT 0 no history of dropout 1 history of dropout (.) missing

    297 46 16

    ? ADDSC Range: 24.6667-80 Mean: 53.1068 Median: 53

  • 86

    ? IQ Range: 55-137 Mean: 102.3712 Median: 103 (.) missing

    233 126

    viewAgeWithCount File name: viewAgeWithCount Project: Sample.mdb Number of records: 16 Number of observations: 85 Variable Label Values/Description Freq RecordNumber Rage: 1-10 Age Range: 1-10 Count Range: 1-20 viewBabyBloodPressure - Hypertension in Infants In these data, birth weight and systolic blood pressure were measured in 16 infants. Systolic blood pressure is the dependent variable, and birth weight and age of the infant are independent variables. Reference: Rosner B. Fundamentals of Biostatistics, 5th Ed. Duxbury, 2000. File name: viewBabyBloodPressure Project: Sample.mdb Number of records: 16 Variable Label Values/Description Freq Birth weight of infant (in ounces); an independent variable

    Birthweight Range: 90-160 Mean: 120.31 SD: 18.75

    Age in days; an independent variable AgeInDays Range: 2-5 Mean: 3.31 SD: 0.95

    Systolic blood pressure (mm Hg); the dependent variable

    SystolicBlood Range: 77-98 Mean: 88.06 SD: 6.69

    viewEpi1 - Complex Survey Data based on the Expanded Program for Immunization (EPI) method These data are based on a 30-cluster survey using the Expanded Program on Immunization (EPI) methodology. Using this methodology, 30 communities (i.e., clusters) are selected from a listing of all communities in a geographic area using the proportional to population size (PPS) sampling technique. The PPS methodology is self-weighted, i.e., statistical weights are not necessary when analyzing the data. Survey teams visit each cluster and, using one of several sampling techniques, visit households to identify seven children in the appropriate age range and assess their immunization status. The EPI survey is frequently referred to as a 30x7 cluster design, i.e., 30 clusters, each with 7 children. File name: viewEpi1 Project: Sample.mdb Number of records: 210 Variable Label Values/Description Freq A variable to specify in which cluster the individual lived.

    CLUSTER Range: 1-30

    A question concerning whether or not the mother had received prenatal care for the child being assessed.

    PRENATAL 1 = received prenatal care 2 = no prenatal care

    87 123

    Whether the child was vaccinated. VAC 1 = vaccinated 2 = not vaccinated

    155 55

  • 87

    viewEpi10 - Complex Survey Data based on the Expanded Program for Immunization (EPI) method with 10 strata The viewEpi10 file is an example of a country performing an EPI survey in each of its 10 provinces, i.e., there were 10 separate EPI surveys carried out, one in each province. This is considered a stratified cluster survey. The viewEpi10 data has the same variables as viewEpi1 plus two additional variables: a variable for a numeric value to identify which province the child lived (LOCATION) and a variable that takes into account the differences in population sizes of the different provinces (POPW). To calculate national estimates, it would be important to take into account the population size of each province. The weighting scheme is presented in Table A1 and is calculated as the population size of the population divided by the number in the sample. In Location 1, each child sampled represents 43.87 children; in cluster 8, each child sampled represents 853.02 children. Please note that there are other methods for weighting data than the one presented here. Table A1. Population weights for children in each location

    Location Population Sample POPW 1 9,870 225 43.87 2 33,600 219 153.42 3 14,130 212 66.65 4 27,900 219 127.40 5 12,750 212 60.14 6 15,810 214 73.88 7 16,050 210 76.43 8 180,840 212 853.02 9 9,030 217 41.61

    10 25,650 212 120.99 Total 345,630 2,152

    POPW = Population/Sample File name: viewEpi10 Project: Sample.mdb Number of records: 2152 Variable Label Values/Description Freq Variable with codes for the 10 strata LOCATION Range: 1-10 Statistical weight to estimate unbiased national estimates taking into account strata population sizes.

    POPW Range: 41.61-853.02

    Variable specifying cluster number. CLUSTER Range: 1-30 A question concerning whether or not the mother had received prenatal care for the child being assessed.

    PRENATAL 1 = received prenatal care 2 = no prenatal care

    1088 1064

    Whether or not the child was vaccinated.

    VAC 1 = vaccinated 2 = not vaccinated

    1242 910

    viewEstriolandBirthweight - Estriol and Birth Weight Data These data are by Greene and Touchstone and used as an example in the text by Rosner to study the relationship of the estriol level in pregnant women with birth weight. Reference: Rosner B. Fundamentals of Biostatistics, 5th Ed. Duxbury, 2000. File name: viewEstriolandBirthweight Project: Sample.mdb Number of records: 31 Variable Label Values/Description Freq Estriol level of pregnant woman (mg/24 hr)

    ESTRIOL Range: 7-27 Mean: 17.23 SD: 4.75

    Birth weight of infant (g/100) BIRTHWEIGHT Range: 24-43 Mean: 32.0 SD: 4.74

  • 88

    viewEvansCounty - Evans County Heart Disease Study Data The data are based on the Evans County heart disease cohort study on the seven-year incidence of coronary heart disease in 609 white males. The variable CAT (endogenous catecholamine level) was fabricated for illustrative purposes and dichotomized into categories "high" (top quintile of cohort values) and "low." There are no missing values in this dataset. Thanks to Dr. David Kleinbaum for making the data available. Reference: Kleinbaum DG, Kupper LL, Morgenstern H. Epidemiologic Research: Principles and quantitative methods. Lifetime Learning Publications, Belmont, California, 1982. File name: viewEvansCounty Project: Sample.mdb Number of records: 609 Variable Label Values/Description Freq Identification Number ID Range: 21-19161 Coronary Heart Disease CHD No = not a case

    Yes = case 538 71

    Age (years) AGE Range: 40-76 Mean: 53.71 SD: 9.26

    Catecholamine Level CAT No = low Yes = high

    487 122

    Serum Cholesterol (mg/100 mL) CHL Range: 94-357 Mean: 211.74 SD: 39.83

    Diastolic Blood Pressure (mmHg) DBP Range: 60-170 Mean: 91.18 SD: 14.50

    Electrocardiogram ECG No = normal ECG Yes = abnormal ECG

    443 166

    Hematocrit (percent) HEM Range: 29-58 Mean: 46.26 SD: 3.47

    Marital Status MAR No = not married Yes = married

    64 545

    Occupation OCC 1 = ? 2 = ?

    365 244

    Pulse (beats/min) PLS Range: 45-120 Mean: 74.59 SD: 12.67

    Quetelet Index* QTI Range: 2.121-6.041 Mean: 3.62 SD: 0.59

    Systolic Blood Pressure (mmHg) SBP Range: 92-300 Mean: 145.48 SD: 27.50

    Socioeconomic Status (McGuire- White index)

    SES Range: 20-84 Mean: 57.86 SD: 13.62

    Cigarette Smoking SMK No = never smoked Yes = smoker

    222 387

    Age Group 1 (Years) AGEG1 No = LT 55 Yes = GE 55

    358 251

    Age Group 2 (Years) AGEG2 1 = 40-44 2 = 45-49 3 = 50-54 4 = 55-59 5 = 60-64 6 = 65-69 7 = 70+

    109 138 111 92 63 52 44

    Cholesterol Group CHLG No = LT 250 504

  • 89

    Yes = GE 250 105 QTI Group QTIG No = LT 3.57

    Yes = GE 3.57 306 303

    SES Group SESG No = GE 57 Yes = LT 57

    330 279

    Hypertension HPT No = SBP94

    354 255

    GE=greater than or equal to; LT=less than *100[(weight in pounds)/(height in inches)] viewhmohiv - survival analysis These data are provided with compliment by Epi Info development team, Epidemiology program office, CDC. File Name: viewhmohiv Project: Sample.mdb Number of records: 100 Variable Label Values/Description Freq Identification of patient ID Range 1-100 Survival time TIME1 Range 1-60

    Mean 11.36 Median 5

    age AGE Range 20-54 Mean 36.07 Median 35

    exposure DRUG 0 placebo 1 treatment

    51 49

    CENSOR 0 censored 1 event

    20 80

    The date that the patient first entered the study

    ENTDATE Range= 1-12-1989 to 12-27-1991 Format: mm-dd-yyyy

    The date that the patient was last observed

    ENDDATE Range= 2-15-1989 to 11-13-1995 Format: mm-dd-yyyy

    ViewLasum - Estrogen and Endometrial Cancer Matched Case-Control Study (weighted analysis) These data come from a Los Angeles study to determine whether the effect of exogenous estrogen relates to endometrial cancer among 315 participants. The study design is a matched case-control study where each of the 63 cases with endometrial cancer, is matched to four control women who were born within one year of the case, had the same marital status, and lived in the same retirement community for the same length of time. Please note that the data set is of summary file format where individual records with similar characteristics were summarized into 25 groups. This study can be used as an example for conditional logistic regression analysis, taking into account the count (frequency) variable. Reference: Breslow and Day. Statistical methods in cancer research: Volume 1 The analysis of case-control studies. Lyon : International Agency for Research on Cancer, 1980.

  • 90

    File Name: viewLasum.dat Project: Sample.mdb Number of summary records: 25 Number of observation: 315 Variable Label Values/Description Freq Obesity

    OBS 0 not obese 1 obese (.) missing

    97 167 51

    Estrogen conjugated dose (mg/day): An exposure variable

    DOS 0= none 1= 0.1-0.299 2= 0.3-0.625 3= 0.626+ (.)= unknown

    8 155 61 56 35

    Disease outcome: A dependent variable.

    OUTCOME 0 no 1 yes

    252 63

    A weight variable: Summary number of records

    COUNT Range: 1-61

    viewLeukem2 Survival Analysis This is a clinical trial studying survival times in weeks (remission) of 42 leukemia patients to compare the effect of a steroid (6-mercaptopurine) with placebo. The duration of relapsed-free period after treatment or placebo was the outcome variable. Please note that these data are the same as Anderson (mentioned earlier), but covariates sex and logwbc have been omitted. File Name: viewLeukem2 Project: Sample.mdb Number of records: 42 Variable Label Values/Description Freq Identification of patient ID Range: 1-42 Main predictor of interest - the exposure variable (6 mercaptopurine vs placebo) randomly assigned to the pts.

    Rx placebo 6-MP

    21 21

    Censored variable - the variable which denotes whether the patient developed an event (exit from clinic).

    status 0= censored 1= relapsed

    12 30

    Survival time in weeks until relapse. This is the outcome variable time to an event

    Stime Range: 1-35 weeks Mean: 12.8810 Median: 10.5

    viewOswego - Oswego Classical Study of Disease Outbreak Investigation. These data are based on a classical study of an outbreak of acute gastrointestinal illness in the village of Lycoming, Oswego County, New York, reported to the District Health Officer in Syracuse on April 19, 1940. It was learned that all persons known to be ill had attended a church supper the previous evening, April 18. Accordingly, the goal for the study was to find which food or foods caused the outbreak. The outcome variable is disease(yes/no). Possible risk factors (predictor variables) are foods and drinks consumed. Interviews regarding the presence of symptoms, including the day and hour of onset, and the food consumed at the church supper, were completed on 75 of the 80 persons known to have been present. A total of 46 persons who had experienced gastrointestinal illness were identified. Reference: The data and information for this outbreak is derived from an educational program developed by the CDC in Atlanta, and provided by Dr A.M.Rubin, then Epidemiologist-in-training who actually conducted the investigation.

  • 91

    File Name: viewOswego Project: Sample.mdb Number of records: 75 Variable Label Values/Description Freq Age of patient (years) AGE Range: 3-77

    Mean: 36.8133 Median: 36

    Gender SEX Female male

    44 31

    Outcome variable: diarrheal illness

    ILL Yes No

    46 29

    BAKEDHAM Yes No

    46 29

    SPINACH Yes No

    43 32

    MASHEDPOTA Yes No (.)

    37 37 1

    CABBAGESAL Yes No

    28 47

    JELLO Yes No

    23 52

    ROLLS Yes No

    37 38

    BROWNBREAD Yes No

    27 48

    Food items

    FRUITSALAD Yes No

    6 69

    MILK Yes No

    4 71

    COFFEE Yes No

    31 44

    Beverages

    WATER Yes No

    24 51

    CAKE Yes No

    40 35

    VANILLA Yes No

    54 21

    Desserts

    CHOCOLATE Yes No (.)

    47 27 1

    Date of onset of illness (mm-dd-yyyy, time)

    DATEONSET 04-18-1940; 3pm - 04-19-1940; 10:30am

    Date of supper (mm-dd-yyyy, time)

    TIMESUPPER 04-18-1940; 12am - 04-18-1940; 10pm

    Name code of patient NAME Range: patient1-patient75 Identification number CODE_RW Range: P1- P75 (.) = missing value viewRely - Rely Tampons and Toxic Shock Syndrome Matched Case-Control Data This is an example of a matched case-control data set where cases (women who were diagnosed with toxic shock syndrome) were each matched to four controls. The specifics of the matching is not provided, but probably based on age and geographic location. As mentioned in the Match command section, the ID is repeated five times: once for the case and then for each of the four matched controls.

  • 92

    File name: viewRely Project: Sample.mdb Number of records: 56 Variable Label Values/Description Freq Identification Number; an ID number that links each case with their individually matched controls

    ID Range: 1-14

    Case of toxic shock syndrome? Outcome variable which divides the study group into cases and controls

    CASE No = control Yes = case

    42 14

    Use of Rely tampons? Exposure variable which separates the group into exposed and not exposed

    RELY No = did not use Yes = did use

    32 24

    viewSmoke - A Telephone Survey With Multistage Stratified Cluster Design These data are based on a random digit telephone survey of adults (18 years of age and older) using a stratified three-stage design in a state. Clusters are defined as telephone numbers consisting of numbers with the same first eight digits of a 10-digit telephone number. Separately for each county, a with-replacement sample of clusters is randomly chosen with probabilities proportional to size (PPS) of the number of residential telephone numbers. Nest, a random sample of three participating households is selected in each cluster. Finally, an interview is completed with one adult who is chosen at random within each participating household. This would be considered a stratified three-stage sample, with clusters of telephone numbers as primary sampling units (PSUs), primary stratification by county, residential phone numbers as the second stage, and the random selection of one adult in the household as the third stage (see Table A2.) Table A2. Stages used in telephone survey Stage List Used Sampling Method One 8-digit telephone number clusters

    by county Random PPS within 8-digit clusters (stratified by county)

    Two Clusters from Stage One Three random households per clusters Three Households from Stage Two One adult selected at random from participating households File name: viewSmoke Project: Sample.mdb Number of records: 337 Variable Label Values/Description* Freq Primary Sampling Unit (PSU) ID number PSUID Range: 15-1310 Date of interview DATE Range: 010190-032490

    Note: a character field; dates are month, day, and 2-digit year

    Interviewers initials INTERID Do you smoke now? SMOKE 1= Yes

    2= No 83 254

    Number of cigarettes smoked per day NUMCIGAR Range: 2-40 n: 82 Mean: 17.256 SE: 0.972 Note: question asked of cigarette smokers only

    Age of participant in years AGE Range: 9-96 Mean: 43.818 SE: 1.053 Note: value of 9 appears to be an error since survey was to be limited to adults only

    Race of participant RACE 1= White 2= Black 5= Other

    289 47 1

    Marital status MARITAL 1= Married 2= Divorced 3= Widowed 4= Separated 5= Never married

    184 45 48 6 52

  • 93

    9= Refused 2 Weight (without shoes) in pounds WEIGHT Range: 88-285

    Also 777 - dont know 999 - refused

    Height (without shoes) in feet and inches HEIGHT Range: 410-607 Also 777 - dont know 999 - refused Note: 3-digit numeric field; 1st digit=height in feet; next 2 digits=height in inches .

    Sex of participant SEX 1= Male 2= Female

    122 215

    Sample weight SAMPW Range: 47100152.009- 47113103.03

    Stratum STRATA 1= County A 2= County B 3= County C

    113 112 112

    *Note that mean and standard error (SE) estimates take into account the complex survey design and statistical weighting viewFamily - Merging/Relating files This Data table is provided along with Epi Info software under the dataset named Refugee.MDB. It contains information concerning refugee families that have arrived to the United States (e.g., the language they speak or their country of origin). Filename: viewFamily Project:Refugee.MDB Number of records: 539 Variable Label Values/Description Freq Apartment APARTMENT City: CITY Contact Information Contact Information Country of Origin: COUNTRY County: COUNTY Date of Arrival: DTOFARR Port of Entry in USA: ENTRY AL 2 CA 1 CALIFORNIA 67 CHICAGO 53 FL 1 IL 9 LA 1 LOS ANGELES 6 MIAMI 1 NEW YORK 248 NY 123 Family Home Phone: FAMHMPH Family Id Number FAMIDNUM 0-539 household HOUSEHOLD Interpreter code INTERPRETE Language spoken LANG Sponsor: SPONSOR State: STATE Street: STREET Zip Code: ZIPCODE

    NB: Description of individual variable was not available. viewPatient - Merging/Relating files

  • 94

    This Data table is provided along with Epi Info software under the dataset named Refugee.MDB Filename: viewPatient Project:Refugee.MDB Number of records: 18000 Variable Label Values/Description Freq Date of record entry TODAYDATE Family ID Number FAMIDNUM No: 1 to 546 BOH ID NUMBER: BOHID BOH Re-entry BOH Alien Number2: ALIENNUM2 Alien Number: ALIENNUMBE Last Name LASTNAME First Name: FIRSTNAME

    Head of Household: HEAD Yes No Missing

    434 1325 16241

    Relationship with the household head RELATION

    Missing 0 1 2 3 4 5 6 7 8 9 10 11 13

    16342 435 8 14 209 25 457 338 2 4 11 30 1 124

    Date of Birth: DOB

    Age in years: AGE

    Range: 0-80 yrs Mean: 24.1 Median: 21 (n=1751)

    Sex: SEX Missing F M

    16232 832 936

    Race: RACE

    Missing A B H O White

    16232 195 727 3 3 840

    I-94 Status: I94STATUS

    Missing 1 2 3

    16229 1767 3 1

    Previous Resettlement: RESETTL No Missing 1772 16228

    From: FROM missing 18000

    Health classification CLASS

    Missing B B1 B2 O

    16285 429 21 50 1215

    NB: actual data were available only in 546 families (based on FAMIDNUM), and the remaining records have missing values in all variables except last and first name of a refugee.

  • 95

    Appendix 2. Operators/Functions - for use in arithmetic and logical expressions Below is a partial listing of operators and functions Arithmetic + addition - subtraction

    * multiplication / division ^ exponentiation (use ^0.5 for square root)

    Comparison > greater than < less than >= greater than or equal to

  • 96

  • 97

    Appendix 3. Answers to Exercises

    Answers Exercise 1

    1. Mean of HEM using Means command:

    Obs Total Mean Variance Std Dev 609 28173.0000 46.2611 12.0584 3.4725

    Minimum 25% Median 75% Maximum Mode 29.0000 44.0000 46.0000 48.0000 58.0000 46.0000

    2. Appear to be normally distributed? Use the Graph module and make either a histogram, bar, or line chart with

    HEM as the X-axis. The data appears to be somewhat normally distributed. While there are statistical tests to see whether or not a variable is normally distributed, Epi Info does not perform this test.

    3. Descriptive Statistics for Each Value of Crosstab Variable

    Obs Total Mean Variance Std Dev Yes 251 11459.0000 45.6534 13.1954 3.6325 No 358 16714.0000 46.6872 10.8542 3.2946

    ANOVA, a Parametric Test for Inequality of Population Means (For normally distributed data only)

    Variation SS df MS F statistic Between 157.6822 1 157.6822 13.3420 Within 7173.8055 607 11.8185 Total 7331.4877 608

    T Statistic =3.652 P-value =0.0003

    Bartlett's Test for Inequality of Population Variances Bartlett's chi square= 2.8276 df=1 P value=0.0927

    A small p-value (e.g., less than 0.05) suggests that the variances are not homogeneous and that the ANOVA may not be appropriate.

    Mann-Whitney/Wilcoxon Two-Sample Test (Kruskal-Wallis test for two groups)

    Kruskal-Wallis H (equivalent to Chi square) = 14.7051 Degrees of freedom = 1

    P value = 0.0001

  • 98

    Are the variances approximately equal? Yes, Bartletts test has p-value of .09, so we can assume approximately equal variances. Therefore, can use the t-test p-value of .0003 and state that there are statistically significant different mean hematocrits between younger adults vs. older adults, with older adults having a slightly higher mean hematocrit.

    4. Mean is 57.855. Obs Total Mean Variance Std Dev 609 35234.0000 57.8555 185.5712 13.6225

    Minimum 25% Median 75% Maximum Mode 20.0000 49.0000 57.0000 71.0000 84.0000 71.0000

    5. Using graph module, make a bar, histogram, or bar chart. Does not seem to be normally distributed.

    6. Descriptive Statistics for Each Value of Crosstab Variable

    Obs Total Mean Variance Std Dev 1 109 6186.0000 56.7523 206.8733 14.3831 2 138 7879.0000 57.0942 193.9254 13.9257 3 111 6479.0000 58.3694 165.2896 12.8565 4 92 5530.0000 60.1087 162.4935 12.7473 5 63 3661.0000 58.1111 180.1326 13.4213 6 52 2822.0000 54.2692 198.8673 14.1020 7 44 2677.0000 60.8409 182.8811 13.5234

    Minimum 25% Median 75% Maximum Mode 1 20.0000 48.0000 57.0000 68.0000 84.0000 71.0000 2 20.0000 47.0000 57.0000 71.0000 81.0000 71.0000 3 26.0000 51.0000 57.0000 71.0000 84.0000 57.0000 4 32.0000 51.0000 59.0000 71.0000 84.0000 57.0000 5 34.0000 49.0000 55.0000 72.0000 84.0000 54.0000 6 20.0000 44.5000 54.0000 62.5000 84.0000 54.0000 7 38.0000 51.0000 57.0000 71.5000 84.0000 54.0000

    ANOVA, a Parametric Test for Inequality of Population Means

    (For normally distributed data only)

  • 99

    Variation SS df MS F statistic Between 1774.0885 6 295.6814 1.6028 Within 111053.1955 602 184.4737 Total 112827.2841 608

    P-value =0.1438

    Bartlett's Test for Inequality of Population Variances Bartlett's chi square= 2.4070 df=6 P value=0.8787

    A small p-value (e.g., less than 0.05) suggests that the variances are not homogeneous and that the ANOVA may not be appropriate.

    Mann-Whitney/Wilcoxon Two-Sample Test (Kruskal-Wallis test for two groups) Kruskal-Wallis H (equivalent to Chi square) = 8.3535

    Degrees of freedom = 6 P value = 0.2133

    Data do not seem to be normally distributed, so might be better to use Kruskal-Wallis test. Conclusion there is no significant difference in SES score by age groups.

    7. OR=1.21, RR=1.18; no statistically significant association.

    Single Table Analysis Point 95% Confidence Interval Estimate Lower Upper PARAMETERS: Odds-based Odds Ratio (cross product) 1.2065 0.6448 2.2576 (T) Odds Ratio (MLE) 1.2061 0.6252 2.2224 (M) 0.5945 2.3087 (F) PARAMETERS: Risk-based Risk Ratio (RR) 1.1789 0.6833 2.0343 (T) Risk Difference (RD%) 2.0238 -5.0418 9.0895 (T) (T=Taylor series; C=Cornfield; M=Mid-P; F=Fisher Exact) STATISTICAL TESTS Chi-square 1-tailed p 2-tailed p Chi square - uncorrected 0.3456 0.5566322582 Chi square - Mantel-Haenszel 0.3450 0.5569564191 Chi square - corrected (Yates) 0.1770 0.6739617631 Mid-p exact 0.2751234372 Fisher exact 0.3287296811

    CHD CHLG Yes No TOTAL

    Yes Row % Col %

    14 13.3 19.7

    91 86.7 16.9

    105 100.0 17.2

    No Row % Col %

    57 11.3 80.3

    447 88.7 83.1

    504 100.0 82.8

    TOTAL Row % Col %

    71 11.7

    100.0

    538 88.3

    100.0

    609 100.0 100.0

  • 100

    8.

    Third Variable Interaction p-value

    Crude OR1 Adjusted OR2

    Conclusion?3

    ECG 0.42 2.9 2.4 Confounding MAR 0.46 2.9 2.8 Neither SMK 0.46 2.9 2.9 Neither AGEG1 0.81 2.9 2.2 Confounding QTIG 0.07 2.9 2.9 Neither HPT

  • 101

    The mean CHD_index is 6.5224.

    Obs Total Mean Variance Std Dev 609 3972.1560 6.5224 6.5602 2.5613

    4. Do those who developed CHD have a significantly higher or lower mean CHD_index compared to those who did

    not develop CHD? Assuming a normal distribution, we would conclude that there is no statistically significant difference in mean CHD_index between those with or without CHD.

    Descriptive Statistics for Each Value of Crosstab Variable Obs Total Mean Variance Std Dev Yes 71 453.0072 6.3804 4.7808 2.1865 No 538 3519.1488 6.5412 6.8013 2.6079

    Minimum 25% Median 75% Maximum Mode Yes 2.8617 4.6229 6.3317 7.4880 14.2804 2.8617 No 2.5707 4.8616 6.0801 7.6534 28.9549 5.0062

    ANOVA, a Parametric Test for Inequality of Population Means

    (For normally distributed data only) Variation SS df MS F statistic Between 1.6215 1 1.6215 0.2469 Within 3986.9572 607 6.5683 Total 3988.5787 608

    T Statistic =0.4969, P-value =0.6195

    Bartlett's Test for Inequality of Population Variances

    Bartlett's chi square= 3.4989 df=1 P value=0.0614

    5. First, Define the variable agegroup; next, use the Recode command as follows: on the first Recode dialog box, click on Fill Ranges to get to the screen below; provide the Start, End, and By values:

  • 102

    Click OK to see the Recode dialog box with the ranges completed:

    To determine the number in each group, use the Frequencies command:

    agegroup Frequency Percent Cum Percent >39 - 59 450 73.9% 73.9% >59 - 79 159 26.1% 100.0% Total 609 100.0% 100.0%

    6. First Define the variable Anemic. There are different programming approaches to doing this. One way is as follows:

    IF HEM < 39 and SMK = (-) THEN Anemic = 1 END

  • 103

    IF HEM >= 39 and SMK = (-) THEN Anemic = 2 END IF HEM < 40 and SMK = (+) THEN Anemic = 1 END IF HEM >= 40 and SMK = (+) THEN Anemic = 2 END Another approach that would work just as well is: ASSIGN Anemic = 1 IF HEM >= 39 and SMK = (-) THEN Anemic = 2 END IF HEM >= 40 and SMK = (+) THEN Anemic = 2 END IF HEM= (.) AND SMK= (.) THEN Anemic = (.) END The prevalence of anemia is 1.1%.

    Anemic Frequency Percent Cum Percent 1 7 1.1% 1.1% 2 602 98.9% 100.0% Total 609 100.0% 100.0%

    7. In the Program Editor, click on the Save button; a Save Program dialog box will appear save the program name as Anemic and then click on the OK button. Next, click on the Open button in the Program Editor, click on the down arrow at the right of Program and select the Anemic program and edit it to remove commands not needed, then Save the edited program. Now, reRead viewEvansCounty, Open the Anemic program, and then click the Run button. Double check to see if the program worked correctly by doing a frequency of anemia.

    Answers Exercise 3

    Third Variable Interaction p-value

    Crude OR Adjusted OR

    Conclusion?1

    ECG 0.42 2.9 2.4 Confounding MAR 0.46 2.9 2.8 Neither SMK 0.46 2.9 2.9 Neither AGEG1 0.81 2.9 2.2 Confounding QTIG 0.07 2.9 2.9 Neither HPT 0.003 2.9 2.0 Interaction

    1 Interaction, confounding, or neither

  • 104

    Appendix 4. Analysis commands by number and types of variables

    The tables below provide information on appropriate use of the analytic commands which depend upon the number of variables under consideration (one or more variables), the types of variables (categorical vs. continuous), and whether the data are to be analyzed assume simple random sampling or complex sampling designs. Table A.4.1. Epi Info commands for the analysis of one variable of interest, assuming simple random sampling

    A variable of interest Analysis command Categorical variable

    e.g., Illness=Yes or No, sex Frequencies

    Means Continuous variable

    e.g., age, blood pressure, cholesterol level Means

    Time to event *

    e.g., survival time until an event occurs Kaplan-Meier Survival

    *Requires two variables, a time variable and a variable as to whether or not an event occurred. Table A.4.2. Epi-Info commands for the analysis of a predictor variable vs. an outcome variable, assuming simple random sampling

    Predictor variables Outcome

    Paired

    observa-tions1

    Categorical variable ( 2 categories)

    Continuous variable Both categorical and continuous variables

    No Tables Logistic Regression (unconditional)

    Logistic Regression (unconditional) Means2

    Logistic Regression (unconditional)

    Categorical variable

    e.g., illness= Yes or No Yes Match

    Logistic Regression (conditional)

    Logistic Regression (conditional)

    Logistic Regression (conditional)

    Continuous variable

    e.g., age, blood pressure

    No Means2 Linear Regression

    Linear Regression

    Linear Regression

    Time to event e.g., survival time until an

    event occurs/is censored

    No Kaplan-Meier Survival Cox Proportional Hazards Extended Cox model3

    Cox Proportional Hazards Extended Cox model3

    Cox Proportional Hazards Extended Cox model3

    1 e.g., matched case-control study 2 Student t-test and ANOVA for parametric tests, and Kruskal-Wallis test for non-parametric tests. 3 used when predictor variable/s are time-dependent or Cox PH assumptions are violated.

  • 105

    TableA.4.3. Epi Info commands for the analysis of one variable of interest in a survey using a complex sample design

    One variable of interest

    Analysis command

    Categorical variable e.g., illness=Yes or No

    Complex Sample Frequencies

    Continuous variable e.g., age, blood pressure

    Complex Sample Means

    TableA.4.4. Epi-Info commands for the analysis of a predictor variable vs. an outcome variable in a survey using a complex sample design

    Outcome

    Predictor variable (Categorical variable)

    Categorical variable e.g., illness=Yes or No

    Complex Sample Tables

    Continuous variable e.g., age, blood pressure

    Complex Sample Means

    Intro to Epi Info 3.3.2 Analysis.doc January 4 2007