Introduction to Stata using the UK Labour Force Survey

G10 The series of ESDS Guides are available online at www.esds.ac.uk

Introduction to Stata using the UK Labour Force Survey

ESDS Government Author:

Anthony Rafferty

Version: 7.2 Date: September 2008


1

Contents

1.0 Introduction 2

2.0 Getting Started: The Basic features of Stata 5

3.0 Exploring your data 14

4.0 Generating variables and changing their values 28

5.0 Graphics in Stata 34

6.0 Statistical modelling using Stata: A brief introduction 37

7.0 Do-files: Using and saving commands 56

Appendix A Resources for Learners 59

Appendix B Entering and transferring data into Stata 61

Appendix C Reserved names and Stata operators 64

Stata ® and the Stata logo ® are registered trademarks of StataCorp LP. This guide has not been sponsored or approved by Stata. The author is solely responsible for any mistakes.


2

1.0 Introduction This short guide provides an introduction to STATA 9 using the ESDS Labour Force Survey (LFS) Teaching Dataset (2002). Its central aim is to provide a learning resource for those who have little or no experience of using Stata, through the use of practical examples which you can try. The dataset accompanying this guide was produced by ESDS Government, and can be downloaded together with supporting documentation from the following website:

http://www.esds.ac.uk/government/lfs/resources/#teaching. The LFS teaching dataset (2002) gives a subset of data drawn from the UK Labour Force Survey, containing data from all four quarters of the 2002/3 LFS, for respondents aged 16-65 and resident in the UK (n=63,559). For ease of use within a teaching context, the dataset is restricted to a subset of 58 key (mainly individual level) variables1. In order to access this data you will need to register. For UK academic users, or previous users of the UK Data Archive, this can be done by registering as an ESDS and UK Data Archive user with your ATHENS username and password. Simply follow the instructions which appear when you attempt to download the data. Using this Guide The guide is separated into seven sections. In section one, the visual operating environment of Stata is explored. Basic commands for opening, examining, and saving datasets are demonstrated. Section two looks at ways of producing frequency tables, cross-tabulations and summary statistics. Some specific functions for the coding of missing and inapplicable values are also considered. In section four, ways to create, manipulate, and recode variables are outlined, whereas section five goes on to explore the graphical capabilities of more recent versions of Stata beyond version 8. Section six gives a basic introduction to estimation and post-estimation commands used for statistical modelling, illustrating examples of multiple linear and logistic regression. Finally, section seven considers how to record and edit syntax commands using do-files. Throughout the guide, illustrations are given for Stata SE version 9. This version of Stata has slightly greater capabilities than some of its predecessors. Those with older versions of Stata will find most of the illustrations (bar section 5 on using the menu system to produce graphics) functional. Information about the different versions of Stata can be found on page 1 of the Getting Started with Stata for Windows Guide (Release 9) (StataCorp, 2005).

1 There are two household variables that take the same value for all members of the household: ‘ten96’ and ‘house’.


3

Why use the Labour Force Survey? As well as giving a basic introduction to Stata, a further intention of this guide is to promote the usage of the full LFS dataset for secondary analysis. A wider exploration of the documentation supporting the LFS available from the ESDS Government website is consequently encouraged to supplement the present text (see http://www.esds.ac.uk/government/lfs/). The LFS is carried out by the Office for National Statistics (ONS). Other than the Population Census, it represents the only comprehensive source of information about all aspects of the labour market. The survey has a high research potential for secondary analysis due to its large sample size and detailed questions. Since 1992 (following major methodological changes and the introduction of the quarterly LFS), a simple, stratified random, unclustered sample design has been used to select a sample of addresses. Each quarterly LFS sample of 57,000 responding UK households is made up of five waves, each of approximately 11,000 private households. Each wave is interviewed in five successive quarters, so that in any one quarter, one wave will be receiving their first interview, one wave their second, and so forth with one wave receiving their fifth and final interview. Thus there is an 80 per cent overlap in the samples for each successive quarter. All adults within responding households are interviewed face to face at their first inclusion in the survey and by telephone (if possible) at quarterly intervals thereafter. Each household has their fifth and last quarterly interview on the anniversary of the first. Unlike most other large-scale government surveys, the LFS includes people living in NHS accommodation. Information is also available for young people aged between 16 to 24 years, as the LFS sample includes people living away from their parental home in a student hall of residence or similar institution during term time. Due to the large sample size and stratified unclustered random sample, the LFS has small sampling errors for main population sub-groups. The sample design also allows representative results to be published for any thirteen-week period. In terms of its limitations, the LFS has a high proportion of proxy interviews (c. 30%) in comparison to other surveys such as the General Household Survey (c. 5%). Also, and as with most UK government surveys, response rates have dropped in recent years. In 1999/2000, the response rate for the LFS was 63%. Despite this, the LFS remains a primary resource for those wishing to undertake secondary analysis on the UK labour market and employment related issues. Some other notable features of the LFS data are that:

• Longitudinal datasets are available which link the quarters e.g. June 2001 to August 2002.

• Separate datasets exist for the analyses of households. These are available for every quarter and may be used for household level analyses, or for individual analyses which draw on household and family characteristics.

• Special license data is available which contains additional detail and geography.

• Aggregated Local Authority level datasets are available on the standard ‘end user’ license.


4

• The survey can be used for the analyses of ethnic minorities and other small samples. In order to obtain adequate sample sizes, it may be necessary to combine a number of years of data together.

• A related dataset, the Annual Population Survey (APS), combines results from five different sources: The LFS (waves 1 and 5); the English Local Labour Force Survey (LLFS); the Welsh Labour Force Survey (WLFS); The Scottish Labour Force Survey (SLFS); and the Annual Population Boost Sample (APS (B)). The APS (B) ceased to exist at the end of December 2005, therefore APS data from January 2006 onwards will contain all of the above data apart from APS (B) data.

The Publication database www.esds.ac.uk/government/citations on the ESDS Government website provides a useful way to search for publications resulting from secondary analyses of the LFS, or other ESDS Government supported datasets. In addition to ESDS Government resources, information about the LFS can be obtained by searching the UK government Office for National Statistics website www.statistics.gov.uk. Any further queries can be directed towards the ESDS Government Helpdesk: E-mail : [email protected] Tel: +44 (0)161 275 1980 Fax: 0161 275 4722

Postal address:

ESDS Government CCSR School of Social Sciences University of Manchester Crawford House Manchester M13 9PL


5

2.0 Getting Started: The Basic Features of Stata This section provides a brief introduction to the visual operating environment of Stata. Some of the basic commands for opening and exploring the contents of datasets are considered. This part of the guide will be mainly relevant to those who have no experience of using the software.

2.1 The Stata Environment Opening Stata

You can open Stata in the same way as you would most other software packages by clicking on its icon or menu item as shown above. When you open Stata, you should see the following screen (although the layout of the windows might vary somewhat and some windows may be minimised or shaped differently):-

The large black window is the results window. The results window will contain your output. This includes:

The commands that you run The results you obtain Error messages Active links to Stata web pages, the help system and further output


6

The Review window is designed to contain a list of past commands. The Variables window contains the list of variables in your data file. The Stata Command window is where you can type commands. At the start of a new Stata session, all three windows are empty. The review and variable windows will also be minimised. To view each window click on the tab. The pushpin icon allows you to toggle between different preset sizes. Try opening the variables window, and resizing it using the pushpin. The relative sizes of the windows can be altered by clicking on the meeting point of the windows in the same way that you can change the size of cells in a table. At the top of the screen is an icon bar menu. Some of the menu items will be familiar whereas others are more specific to Stata. You can see a description of what each item does by running your mouse pointer over it.

In Stata version 10, the icons have been modified and will appear as:

2.2 Setting memory size Before loading your dataset, you will probably need to set the amount of memory allocated to Stata by your computer. Stata achieves a higher processing speed by holding data within memory whilst performing calculations (as opposed to accessing it from hard disk). This means that the size of the dataset you can load into Stata is limited by the amount of memory allocated. An error message will appear if you attempt to load datasets larger than your allocated memory. The default memory allocated is roughly one megabyte. For most datasets, it will probably be necessary to increase this allocation. A memory size of 16 Mb or slightly higher will be enough for most purposes. For example, the dataset we will be using is around 2.5MB. We will set the memory to 20 Mb by typing a simple instruction into the command window at the bottom of the Screen: Type set mem 20m in the Command window and hit the return key.

• Notice that your command has appeared in the results window


7

• If you click the review window tab to make this visible (if it is not already) you will see that the command has also appeared there.

• Stata’s response to your command is also given in the results window (this will give some information about your settings and will include a message saying that you have set memory to 20m).

If you want to keep the memory set to 20 megabytes permanently (until you instruct Stata otherwise) then type: set memory 20m, permanently

This means that each time you open Stata, a memory allocation of 20Mb will already be assigned. Notice that a comma separates the option from the command. This is a general syntax feature for specifying options in Stata commands. 2.3 Opening a file As with many of Stata’s functions, when opening a data file, you can either use the menu system or enter instructions through the command box. In section seven, we will also consider how commands can be entered from text files known as do-files. For users of SPSS, these files are Stata’s equivalent of syntax files. Using the open (use menu button): To open a file using the menu system:

• Click on the Open (Use) button

You should obtain a dialogue box:

• You will find yourself in the default directory: Browse to the folder which you extracted the LFS teaching dataset to. The graphic shows an example in which the data has been saved as ‘lfs2002.dta’ within a folder called stata8.


8

• Click on the filename and then click the ‘Open’ button to load the file. You will find that:

• The command has been echoed in the review window as before • There should be no error command in the results window • The variables window now contains a list of variables.

Click on the variables tab to see the variables window

Click the pushpin to view the labels for the variables as well as the names. Click the pushpin a second time to hide the window. Opening files using written commands: The command for opening a data file takes the following form:- Use <filename>, clear Throughout this guide, words in italics in parentheses such as <filename> indicate where a specific name of a file or variable needs to be added. Underlined portions of command words indicate abbreviations, which can be used instead of typing out full commands. The clear command at the end of the instruction means that any data that you currently have stored in memory will be erased. Stata saves and opens files from a default directory on your hard disk (c:\data\). You can however organise your data as you wish, determining the directory that you use for loading and saving data. Suppose the file we wish to open is LFS2002.dta and is located in the directory C:\Data_Stata\Course4_Stata_for_LFS\. We could use the following command to open this file: use “C:\Data_Stata\Course4_Stata_for_LFS\LFS2002.dta”, clear


9

Changing the default directory You can also alter the directory from which Stata loads and saves data by using the cd ‘change directory’ command. For example, if you were to type in the following: cd C:\Data_Stata\Course4_Stata_for_LFS All load and save options would now operate to and from this specified location. This means you can simply enter the use <filename> command without specifying the file directory each time. 2.4 Opening the Data Browser Unlike some other statistical packages, the data in Stata is not immediately visible upon opening a data file. To view the raw data, it is necessary to open the data browser window. This window allows you to view the data, but not to change its values (See Appendix B for information on entering data into Stata).

Click on the button to open the data browser. Note that:

• The data for each individual is on a separate row. We call the individual the case because we will be analysing the data at the individual level (although it is possible to structure data differently)

• Each row is numbered. • Each column relates to a particular variable. • Each column is headed with the name of the variable. • By double clicking the name of the variable at the top of the column, you

can read a variable label, which will help you to understand what the variable is.

• Each cell contains text describing the value for a particular variable and individual case. The text associated with each value was defined by the data creator, using a value label.

• By clicking on a particular cell the value for that individual for that particular variable is given at the top of the screen.

• If we click on the first cell (the cell for case 1 for ten96), the value will appear in the space above the data. Ten96[1]=2 means, the value of ten96 for case 1 is 2. Each value of a categorical variable is associated with a specific label indicating what this value represents. The cell contains the text ‘being bou.’ This is truncated. The text ‘being bought with a mortgage or loan’ is the full value label associated with the value 2 for the ten96 variable. This label is therefore associated with all cases that have value 2 for ten96.


10

• By double clicking on the word ten96 at the top of the ten96 column we

obtain the following window:

This gives:

• The variable name • The variable label, in this case ‘accommodation details’, which tells us

more information about what the variable is • Some information about the format in which the data is displayed (in

this case, %9.0g means that the data is a general numeric variable up to 9 digits long with no decimal places)

• The name of the value label.

All of this information is greyed out to prevent you from accidentally changing any details. N.B. You will need to close the browser window and any associated dialogue boxes in order to run further commands.


11

2.5 The describe command Another way to look at what variables are in a dataset is to use the describe command.

• Close the data browser window. In the command window, type:

Describe Typing describe alone without any variable names will give you information on all of the variables contained in the dataset: Contains data from C:\Data_Stata\Course4_Stata_for_LFS\LFS2002.dta obs: 63,559 vars: 58 size: 5,021,161 (75.8% of memory free) ------------------------------------------------------------------------------- storage display value variable name type format label variable label ------------------------------------------------------------------------------- ten96 byte %8.0g ten96 accommodation details house byte %8.0g house accommodation details (grouped) sex byte %8.0g sex sex age byte %8.0g age age last birthday ages byte %8.0g ages age groups in 5 yearly intervals nation byte %8.0g nation nationality cry01 byte %8.0g cry01 country of birth region byte %8.0g region region of usual residence numchild byte %8.0g number of children in the household aged 0-4 numchil1 byte %8.0g number of children in the household aged 5-16 ayfl19 byte %8.0g ayfl19 age of youngest dependent child in family aged <19 ethnic byte %8.0g ethnic ethnicity

…….and so forth for each variable. Otherwise, you can type variable names after the describe command to obtain details for a subset or single variable: Describe numchil1 Using the ‘help’ and ‘findit’ functions: You can find out about the format and function of a command by using the help command. For example, you could use this to find information out about the describe command: help describe You also can search for a command if you do not know its full name using the findit instruction: findit describe To supplement this guide, you may find it useful to use these commands to find out further information about each command.


12

2.6 The Codebook command The codebook command is used to obtain more detailed information about the values of specific variables. This can be obtained using the command window by typing codebook <variable name>, and hitting return. For example:-

• Type: codebook ten96 Once again, you can obtain the same information using the menu system:

• Click on- Data Describe Data Describe Data Contents (Codebook)

• In the dialogue box type ten96 into the variable field, or click the drop down variable list and select ten96 from this.

• Click OK.

0

How many cases have the value 6?

What is the name of the value label attached to the variable?

To find out what value descriptions are defined by the ten96 label: • Click Data from the Menu • Select Labels, label values • Click on List value labels • Click on ten96 in the list of value labels (you may need to scroll down to

do this)


13

What value description is associated with the value –8?

What value description is associated with the value 3?

What command was generated to run this procedure? (Look either in the results window or the review window)

• Use the codebook command to find out the name of the label for the

ilodefa variable (hint: you can click on the review window to obtain the correct command structure but replace the name of the variable, using the backspace button and typing in ‘ilodefa’)

What is the variable label for ilodefa?

What value label is attached to the ilodefa variable? • Use the label list command to obtain the value descriptions for values of

the label attached to the ilodefa.

What are the value descriptions associated with each value of ilodefa?

2.7 Saving data files

In the subsequent sections, we shall go on to recode some of the variables in the dataset. It may therefore be useful to save a copy of your dataset under a different filename and from now on use this file for the following examples. This will mean that you will always have a back up copy of the dataset to hand which is in its original form.

To save using the command box:

• Type: save <filename> As with the use command, if you do not specify the directory prior to the filename (and have not changed directory using the cd command) this will save into the default directory (c:\data\).


14

3.0 Exploring your data We will now consider ways to explore the dataset in more detail using frequency tables and cross-tabulations. Options for estimating measures of central tendency (mean, mode, and median) and measures of dispersion (standard deviation, variance, and deciles) are also demonstrated. Additionally, an overview of Stata options for the coding of missing and inapplicable values is given. Before You Start… Create a permanent log of your output: Before undertaking any analysis or data manipulation, it is recommended that you create a log file to record your output. The output in Stata by default is temporary. This means that the results history accessible through the output window is confined to a particularly (although adjustable) size. Creating a log file allows you to create a permanent record for future reference, which can also be printed out. You can start logging your results by clicking on the log button and complete the

resulting dialogue box. This button can also be used to close or to suspend (pause recording on) a log file.

Alternatively use the command window to start a log file:- • Open a log file by typing the command :-

• log using <filename> • In the command window – you will need to substitute the <filename> for

the name you wish to give to your log file. • Note that the results window indicates that the log file has a .smcl

extension to its name. This is how we know files are log files. When using log, the options replace <filename> or append <filename> are available. These give Stata the instruction to either replace an existing log of the same file name with a new one, or to start appending a log from the bottom of an already existing file as denoted by <filename>. Logs can also be suspended, restarted, or closed by typing:- log off log on log close N.B. Log files can also be saved as text files, which may make them easier to access and read.


15

3.1 Obtaining frequency counts Once you have opened your log-file, you are ready to begin exploring the data. As well as the codebook command, tabulate can be used to create frequency tables. It can also be used to cross-tabulate two or more variables. In the following example, we will first consider how many women and men there are in the dataset.

• To obtain the frequency of men and women in the sample enter the following:

tabulate sex

This should give the following output:- sex | Freq. Percent Cum. ------------+----------------------------------- male | 30396 47.82 47.82 female | 33163 52.18 100.00 ------------+----------------------------------- Total | 63559 100.00

• The columns are fairly self-explanatory;

o The first column contains counts of the number of cases observed with that value

o The second contains the percentage of cases with a non-missing value that have that value

o The final column contains the cumulative percentage (those which have an identical or lower value).

We can next use the codebook command to examine what numerical values are connected to the labels ‘male’ and ‘female.’ It will be important to understand the values underpinning our variable labels when we come to the issue of recoding shortly.

codebook sex

3.2 Handling inapplicable and missing values Missing data occurs where respondents fail to answer certain questions. It can also exist where respondents are not interviewed at a specific wave of a longitudinal survey. Another form of survey non-response, coded as ‘inapplicable,’ occurs where a specific question does not apply to a given respondent, such as a question about number of hours of paid-work per week for an unemployed respondent. Within a survey interview, sets of instructions attached to questions (or in the computer programme if it is a computer assisted interview) route the survey so that only questions relevant to specific respondents are asked. In the above example, a question on employment status would route respondents past questions on current employment if they indicated they were unemployed, as answering such questions would not make sense.


16

From the codebook for the variable sex, we can see that all of the values are positive, and there are no cases coded by Stata as missing. However, if we look at another variable, jobtype, which considers whether employed respondents have permanent contracts or not, we can see that this is not always the case. codebook jobtype -------------------------------------------------------------------------------------- jobtyp permanent or temporary job ---------------------------------------------------------------------------- type: numeric (byte) label: jobtyp range: [-9,2] units: 1 unique values: 4 missing .: 0/63559 tabulation: Freq. Numeric Label 23388 -9 does not apply 15 -8 no answer 37544 1 permanent 2612 2 not permanent in some way

The codebook command gives a sample of values for a variable indicative of its value range. We can examine the full range of values for a variable by using the tabulate command:- ta jobtype permanent or temporary | job | Freq. Percent Cum. --------------------------+----------------------------------- does not apply | 23388 36.80 36.80 no answer | 15 0.02 36.82 permanent | 37544 59.07 95.89 not permanent in some way | 2612 4.11 100.00 --------------------------+----------------------------------- Total | 63559 100.00

Another way to look at what values underpin variable labels is to use the ‘no label’ nol option for tabulate. This gives the numerical values for variables as opposed to their value labels. As usual for Stata commands, the option(s) for the tabulate command are specified following a comma: ta jobtype, nol


17

permanent | or | temporary | job | Freq. Percent Cum. ------------+----------------------------------- -9 | 23388 36.80 36.80 -8 | 15 0.02 36.82 1 | 37544 59.07 95.89 2 | 2612 4.11 100.00 ------------+----------------------------------- Total | 63559 100.00

It can be seen that the jobtype variable has two negative values ( -8 =”not answered” and -9= “does not apply”).The first denotes a number of groups to whom this question does not apply. These include people who are not in paid-employment and the self-employed. It is good practice to check in more detail who the inapplicable category refers to by looking at the codebook for survey. These will typically tell you the question routing information. The “-8 not answered” category represents missing data. Note that on the top right hand corner of the codebook output, Stata suggested that there were no missing values. This is because, although missing values are indicated within the dataset, they have not yet been assigned a value that Stata recognises as missing. In Stata, missing data is denoted by a full stop/dot “.”, or by a dot followed by a letter suffix such as “.a” ranging from a to z. This option for multiple codings for different types of missing values is available from version 8 of the software onwards, prior to which, there was one code for missing data- “.”. By recoding variables as missing in this way, Stata will omit them from your analysis. Handling Missing Values: When you receive survey datasets from the UK Data Archive (and from many other sources) the values for missing variables may not already be coded into Stata recognised missing values. Instead, they may take on a negative number, or some other format. This means that it is up to you to recode missing values. It also remains your decision how missing data will be handled within your analysis. For present purposes, missing data will be ignored (although in practice this may be unadvisable). More detailed information about handling missing data can be found at: http://www.lshtm.ac.uk/msu/missingdata/index.html. To prevent inapplicable cases from appearing in your output you must either exclude cases with invalid answers or set their values to missing. We will now set the missing cases to which the variable jobtyp did not apply.

Using the menus: To set jobtyp value –9 to missing using the menu system

• From the menu click on Data • Select Create or Change variables • Click on Other variable transformation commands • Click on Change numeric values to missing


18

• A dialogue box will appear • In the Variable field type jobtyp (or select it from the drop down menu as before) • In the Conversion Rules field type –9 = .a • Click OK • This command will convert all cases with the current value –9, to .a for jobtyp

What command has appeared in the review window? • Type codebook jobtyp in the command window and hit return to obtain

the following information about the variable.

How many values have been set to missing according to the unique mv codes field?

How many cases have been set to a specific missing value? (see the missing .* field)

What 4 values can cases take for this variable now?

re-run the tab command for jobtyp • Compare this tabular output with that produced before –9 was set to

missing

ta jobtyp permanent or temporary | job | Freq. Percent Cum. --------------------------+----------------------------------- no answer | 15 0.04 0.04 permanent | 37,544 93.46 93.50 not permanent in some way | 2,612 6.50 100.00 --------------------------+----------------------------------- Total | 40,171 100.00

Has the percentage of those who are not permanent in some way changed as compared with your previous table?


19

Which of these percentages do you think would be most appropriate if you were writing a report?

Can you think of an occasion when you might wish to use the other figures from the earlier table?

How would you set -8 to be .b? Using mvdecode command:- Stata has a number of specific functions for the manipulation of missing values. A particularly useful command is mvdecode. We can turn the –8 which denotes where a question was applicable but a respondent ‘did not answer’ (or any other such value) into a Stata recognised missing value by using mvdecode. mvdecode hourpay, mv(-8) Multiple transformation rules may be specified if they are separated by a backward slash. If all the variables in the data set had the –8 and we wished to set them to a Stata recognised missing value (“.”), we could do this by: mvdecode _all, mv(-8) where _all is a Stata internal variable command representing all the variables in the data set. It can be used in a variety of contexts to denote when you wish to perform a command on every variable. To turn these missing values back into numerically coded values, we can use mvencode: mvencode hourpay, mv(-8) mvencode _all, mv(-8) Now all of the missing values we previously coded as “.” are once again coded as -8. N.B. You may have to update Stata in order to use mvdecode or mvencode. Stata can be updated online if your computer is connected to the internet.

• You can check if your version of Stata is up to date by typing: Update query

3.3 Cross-tabulating variables Cross-tabulating variables provides another useful means for describing and exploring data. This is undertaken using the tabulate command. When creating cross-tabs, we can specify a number of statistics as options. These include row percentages (row), column percentages (col), and whether missing value frequencies are contained in the table (m): tab ethnic sex, r m


20

| sex ethnicity | male female | Total -----------+----------------------+---------- white | 28391 30880 | 59271 | 47.90 52.10 | 100.00 -----------+----------------------+---------- caribbean | 263 321 | 584 | 45.03 54.97 | 100.00 -----------+----------------------+---------- african | 189 238 | 427 | 44.26 55.74 | 100.00 -----------+----------------------+---------- black oth | 87 131 | 218 | 39.91 60.09 | 100.00 -----------+----------------------+---------- indian | 486 546 | 1032 | 47.09 52.91 | 100.00 -----------+----------------------+---------- pstani | 269 312 | 581 | 46.30 53.70 | 100.00 -----------+----------------------+---------- bdeshi | 116 128 | 244 | 47.54 52.46 | 100.00 -----------+----------------------+---------- chinese | 92 89 | 181 | 50.83 49.17 | 100.00 -----------+----------------------+---------- other | 491 509 | 1000 | 49.10 50.90 | 100.00 -----------+----------------------+---------- . | 12 9 | 21 | 57.14 42.86 | 100.00 -----------+----------------------+---------- Total | 30396 33163 | 63559 | 47.82 52.18 | 100.00

The tab2 command alternatively creates all possible 2 way tables for combinations of variables in a variable list, specified in your command. In the following example, we create all possible cross-tabulations for the variables sex, married (marital status), and fb (foreign born i.e. whether or not a respondent was born in the UK or not): tab2 sex married fb, r m -> tabulation of sex by married | whether | married/cohabiting sex | no yes | Total -----------+----------------------+---------- male | 13052 17344 | 30396 | 42.94 57.06 | 100.00 -----------+----------------------+---------- female | 14052 19111 | 33163 | 42.37 57.63 | 100.00 -----------+----------------------+---------- Total | 27104 36455 | 63559 | 42.64 57.36 | 100.00


21

-> tabulation of sex by fb | whether born outside | uk sex | no yes | Total -----------+----------------------+---------- male | 27816 2580 | 30396 | 91.51 8.49 | 100.00 -----------+----------------------+---------- female | 30138 3025 | 33163 | 90.88 9.12 | 100.00 -----------+----------------------+---------- Total | 57954 5605 | 63559 | 91.18 8.82 | 100.00 -> tabulation of married by fb whether | whether born outside married/co | uk habiting | no yes | Total -----------+----------------------+---------- no | 25126 1978 | 27104 | 92.70 7.30 | 100.00 -----------+----------------------+---------- yes | 32828 3627 | 36455 | 90.05 9.95 | 100.00 -----------+----------------------+---------- Total | 57954 5605 | 63559 | 91.18 8.82 | 100.00

3.4 Creating Summary Statistics The summarize command can be used to create summary statistics for continuous variables. In the following examples, we will consider the variable for gross hourly pay (hourpay). The variable hourpay will again have many inapplicable cases. If we run our summary statistics before assigning Stata recognised missing values, we will include “-8” and “-9” values in our analysis, distorting the results. If we look at the codebook for hourpay using the if command to see the values for those who are not in paid employment, we can see that this question only applies to those in paid employment (where stat==1). codebook hourpay if stat ~=1 hourpay gross hourly pay (£) ------------------------------------------------------------------------------------------------------------------------------------------- type: numeric (double) label: hourpay range: [-9,-9] units: 1 unique values: 1 missing .: 0/22942 tabulation: Freq. Numeric Label 22942 -9 does not apply


22

Like most Stata commands, summarize can be used with if to select a subset of the data. Note that the symbols “~=” means “not equal to”. Considering that the hourpay variable is routed by stat, another way to handle inapplicable cases is to select those who are in paid employment when creating our summary statistics i.e. where stat==1. However, if we check the codebook for hourpay for respondents where stat==1, we can see that even amongst those in paid employment, there are still some missing values (-9). codebook hourpay if stat==1 --------------------------------------------------------------- hourpay gross hourly pay (£) ---------------------------------------------------------------------------- type: numeric (double) label: hourpay, but 4333 nonmissing values are not labeled range: [-9,204.8] units: .01 unique values: 4334 missing .: 0/40617 examples: -9 does not apply 5.25 7.6399999 11.63 Summarize hourpay if stat==1 Variable | Obs Mean Std. Dev. Min Max -------------+----------------------------------------------------- hourpay | 31410 9.700548 6.99723 .05 204.8

To omit not answered and inapplicable cases, we could alternatively select cases with a value greater than zero: su hourpay if stat==1 & hourpay >0 & hourpay ~=. Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------- hourpay | 31410 9.700548 6.99723 .05 204.8

The “~=” operator is one of many which can be used in Stata to make conditional statements in commands. The table below gives some other commonly used operators. Arithmetic Logical Relational (numeric and string) + addition ~ not > greater than - subtraction ! not < less than * multiplication | or >= > or equal / division & and == equal ^ power ~= not equal != not equal + string concatenation Note that a double equal sign (==) is used for equality testing.


23

Note that in the above example using the summarize command, the hourpay ~= . instruction is superfluous. This is because: a) we know from the codebook that there are no Stata coded missing values for hourpay and, b) the summarize command ignores Stata missing values. However, it is good practice to account for missing value in commands that use the greater than (>), or greater than or equal to (>=) operators. This is because Stata stores missing values as a value greater than all other values of a variable. This means that the greater than, or greater or equal to operators will include missing values if we do not tell Stata not to include such values by using either the commands < . or ~=. instructions. Further details on this issue can be found in section 4.2. Examining Continuous Variables: Note that minimum indicates that the smallest value is 0.05 (5 pence per hour!) This is probably due to coding error in the dataset. In a fuller analysis, you may have to make decisions on how to ‘clean’ errors, outliers, and improbable values on continuous variables such as income. The detail option for summarize gives a wider range of summary statistics.

• Type the following: su hourpay if stat==1 & hourpay >0 & hourpay ~=., detail gross hourly pay (£) ------------------------------------------------------------- Percentiles Smallest 1% 2.06 .05 5% 3.66 .09 10% 4.2 .11 Obs 31410 25% 5.47 .11 Sum of Wgt. 31410 50% 7.81 Mean 9.700548 Largest Std. Dev. 6.99723 75% 11.9 138.46 90% 17.18 150 Variance 48.96123 95% 21.35 163.5 Skewness 4.826005 99% 34.6 204.8 Kurtosis 65.15382

We can also combine the functions for cross-tabulation and summarize to create tables of summary statistics for sub-groups as defined through cross-tabulation. In the following example, we compare the mean hourly pay for men and women, dependent upon whether they are living as a couple (i.e. married or in a cohabiting union), or whether they are not living as a couple:-

• Type: ta married sex if (stat==1& hourpay>0), su(hourpay) means


24

Means of gross hourly pay (£) whether | married/co | sex habiting | male female | Total -----------+----------------------+---------- no | 9.0985452 7.9928881 | 8.5115412 yes | 12.454774 8.619823 | 10.514367 -----------+----------------------+---------- Total | 11.13273 8.3577719 | 9.7005479

The option means specifies that we only require the mean values for each category. If we had not specified this option, we would also obtain standard deviations and cell frequencies in the output table. This table suggests that for both men and women, married people get higher pay than non-married people (although we could also check whether these differences are statistically significant). However, this pattern may be confounded by age. Non –married people may be younger, have less work experience and seniority, and so thus be paid less. We need to account for age differences in order to discount this explanation. We can approach this by considering whether the relationship between marital status and income differs between different age groups. We can use the variable ages, derived from age for this. This gives the ages of respondents in five year categories. The bysort: command allows us to perform operations by levels of a specified variable: bysort ages: tabulate married sex if (status==1&hourpay>0), summ(hourpay) means nofreq ------------------------------------------------------------------------------------------------------------------------------------------- -> ages = 16-19 Means of gross hourly pay (£) whether | married/co | sex habiting | male female | Total -----------+----------------------+---------- no | 4.3975325 4.4363439 | 4.4177293 yes | 4.23 4.652 | 4.5816667 -----------+----------------------+---------- Total | 4.3973349 4.4375108 | 4.4182844 ------------------------------------------------------------------------------------------------------------------------------------------- -> ages = 20-24 Means of gross hourly pay (£) whether | married/co | sex habiting | male female | Total -----------+----------------------+---------- no | 6.7361486 6.3238977 | 6.5155047 yes | 7.1792592 6.5097887 | 6.6942347 -----------+----------------------+---------- Total | 6.7581009 6.3436704 | 6.5299505


25

------------------------------------------------------------------------------------------------------------------------------------------- -> ages = 25-29 Means of gross hourly pay (£) whether | married/co | sex habiting | male female | Total -----------+----------------------+---------- no | 9.3453418 8.5594706 | 8.9495492 yes | 9.7817412 8.5467598 | 9.0923597 -----------+----------------------+---------- Total | 9.4733402 8.5551396 | 8.9949653 ------------------------------------------------------------------------------------------------------------------------------------------- -> ages = 30-34 Means of gross hourly pay (£) whether | married/co | sex habiting | male female | Total -----------+----------------------+---------- no | 10.548311 9.437 | 10.012208 yes | 11.821378 9.2466534 | 10.445421 -----------+----------------------+---------- Total | 11.243172 9.3234061 | 10.259968 ------------------------------------------------------------------------------------------------------------------------------------------- -> ages = 35-39 Means of gross hourly pay (£) whether | married/co | sex habiting | male female | Total -----------+----------------------+---------- no | 11.777334 9.5189622 | 10.585505 yes | 12.909386 8.9489756 | 10.9662 -----------+----------------------+---------- Total | 12.586221 9.1294836 | 10.851559 ------------------------------------------------------------------------------------------------------------------------------------------- -> ages = 40-44 Means of gross hourly pay (£) whether | married/co | sex habiting | male female | Total -----------+----------------------+---------- no | 11.414991 9.5329118 | 10.426241 yes | 13.720475 8.7016084 | 11.164072 -----------+----------------------+---------- Total | 13.136032 8.9225785 | 10.972367 ------------------------------------------------------------------------------------------------------------------------------------------- -> ages = 45-49 Means of gross hourly pay (£) whether | married/co | sex


26

habiting | male female | Total -----------+----------------------+---------- no | 11.790461 9.7145892 | 10.597074 yes | 13.240801 8.6652131 | 10.86329 -----------+----------------------+---------- Total | 12.924504 8.9365285 | 10.799492 ------------------------------------------------------------------------------------------------------------------------------------------- -> ages = 50-54 Means of gross hourly pay (£) whether | married/co | sex habiting | male female | Total -----------+----------------------+---------- no | 11.206868 9.1092617 | 9.9189148 yes | 12.658364 8.7526648 | 10.612248 -----------+----------------------+---------- Total | 12.399727 8.8377683 | 10.465945 ------------------------------------------------------------------------------------------------------------------------------------------- -> ages = 55-59 Means of gross hourly pay (£) whether | married/co | sex habiting | male female | Total -----------+----------------------+---------- no | 10.193532 8.1655882 | 8.9578674 yes | 11.846913 7.6297244 | 9.8324965 -----------+----------------------+---------- Total | 11.575704 7.7640855 | 9.65073 ------------------------------------------------------------------------------------------------------------------------------------------- -> ages = 60-64 Means of gross hourly pay (£) whether | married/co | sex habiting | male female | Total -----------+----------------------+---------- no | 9.069663 7.8004444 | 8.3047322 yes | 10.54429 7.7155882 | 9.5276321 -----------+----------------------+---------- Total | 10.355453 7.7397053 | 9.2935043 ------------------------------------------------------------------------------------------------------------------------------------------- -> ages = 65-69 Means of gross hourly pay (£) whether | married/co | sex habiting | male female | Total -----------+----------------------+---------- no | 8.0625001 7.5027273 | 7.652 yes | 9.2046154 6.8515789 | 8.4337931 -----------+----------------------+---------- Total | 9.0983721 7.0903333 | 8.2731507


27

For men (with the exception of those below 20 years of age), those who are married tend to earn more than their non-married counterparts. However, the difference between the earnings of married and non-married women is less marked. One problem is that the above table gives a large number of cross-tabulations, marking it more difficult to interpret the results. In the following section, we shall go on to consider how we can recode variables into formats more specifically tailored to our questions and analyses. The ‘table’ command: N.B. If you wish to present your results with less decimal places it is necessary to use the table command and format () option instead of tabulate: table married sex if (status==1&hourpay>0), c(mean hourpay) format(%9.2f)

See help table for further details.

Exercises and suggested answers 2.1 Do male and female respondents in the dataset have a similar age profile? [sex ages] 2.2 Do levels of male and female hourly and weekly pay differ by a) ethnicity b) educational attainment? [sex

ethnic hourpay grsswk hiqual] 2.3 Which ethnic group is most likely and which is least likely to be not born in the UK? [ethnic fb] 2.4 For men aged 30-65, which ethnic groups get the highest pay? [age sex ethnic grsswk] 2.5 Which ethnic group are least likely to be owner-occupiers? [ethnic house]

Suggested answers: *Exercise 2.1 sort sex tab sex, summ(age) bysort sex: summ age by sex, sort: su age table sex, content(mean age) *Exercise 2.2 tab ethnic sex if status==1&hourpay>0, summ(hourpay) nost nofr tab ethnic sex if status==1&grsswk>0, summ(grsswk) nost nofr table ethnic sex if status==1&hourpay>0, c(mean hourpay) f(%5.2f) table ethnic sex if status==1&grsswk>0, c(mean grsswk) f(%5.2f) table ethnic sex if status==1&grsswk>0&hiqual>=1&hiqual<=5, c(mean grsswk) f(%5.2f) *Exercise 2.3 tab ethnic fb, row *Exercise 2.4 tab ethn, summ(hourpay) ,if (age>=35&age<=65&sex==0&status==1&hourpay>0) tab ethn, summ(grsswk) ,if (age>=35&age<=65&sex==0&status==1&grsswk>0) *Exercise 2.5 tab ethni house if house >0, row


28

4.0 Generating variables and changing their values From the preceding examples, it can be seen that the format in which secondary datasets are often supplied will mean that some variable recoding will be required prior to analysis. In this section, we focus upon three of the most important commands in Stata for this task: generate, recode, and replace. 4.1 Creating variables using the generate command The generate command allows you to create new variables. This command can be used in conjunction with label, which is used to assign a label to a variable, create a set of value definitions for a given set of numerical values of variables, and to assign these value labels to a given variable.

• Generate the following variable: Generate sex1= sex.

• The label variable [“label name”] command gives a name to your variable: Label variable sex1 “sex”

• Label define [“label name”] establishes a set of labels for a set of values: Label define sexlabel 0 male 1 female

• label values <variable name> [“label name”] attaches labels as created through the label define option to the values of a specified variable:

Label values sex1 sexlabel Note that labels created through the label define option are stored independently to variables. This means that one label can be assigned to a number of different variables:

• For example: ge sex3 = sex label define sex3 sexlabel This command would thus use the same label (sexlabel) again. Alternatively, we could have used the label that already existed for sex to label our new variables, given that their values are identical.


29

Creating ‘replica’ variables: Instead of recoding original variables, it is good practice to create ‘replica’ variables, identical in their values to original variables, and perform any recoding on these variables. The advantage of this is that you will keep the integrity of the original values of the variables in your dataset. The original variables can then provide a useful check to your recoding. They can also be indispensable to correcting otherwise potentially irreversible changes that result from your data manipulation. Generate also allows you to create a variable with values based upon the mathematical transformation of another variable. If for instance, we find a curvilinear relationship between age and income, we might fit a quadratic model. In such cases, we may need a variable for age squared. This can be created easily using the generate command:- generate agesquared=age^2 The superscript is indicated by a “^” symbol. Another use for the generate command can be found in the handling of variables which relate to a date, year, or some other form of calendar information. The LFS dataset contains the variable conmpy indicating the year in which a respondent first joined their present employing organisation. From this we can use generate to create a ‘tenure,’ or ‘length of service’ variable, by calculating how many years prior to the interview year (2003) respondents were present at their current organisation:

• Create a replica variable identical to conmpy named lengthcom • Convert the -8 and -9 values of your new variable to two different

Stata missing values. • Next, create a variable indicating length of tenure, the value of

which equals 2003 minus lengthcom (2003-lengthcom) as the dataset is for 2002. This will mean that the value one will indicate those with one or less years of experience.

• Check your values using the tabulate command If you did the above correctly, you should have done something like the following. First, create a replica variable so as not to alter the values of the original one: gen lengthcom=conmpy Second, recode missing values (n.b. A backwards slash can separate multiple codings): mvdecode lengthcom, mv(-8=.\ -9=.a) Next, a variable is created to indicate tenure (approximated to no. years prior to interview year+1, to avoid minus values): gen tenure=2003 - lengthcom We can check our new variable by using tabulate, once again specifying that missing values are include in the table:-


30

tab tenure, m tenure | Freq. Percent Cum. ------------+----------------------------------- 0 | 132 0.21 0.21 1 | 5,435 8.55 8.76 2 | 6,348 9.99 18.75 3 | 3,926 6.18 24.92 4 | 3,081 4.85 29.77 5 | 2,478 3.90 33.67 6 | 1,957 3.08 36.75 7 | 1,705 2.68 39.43 8 | 1,379 2.17 41.60 9 | 1,104 1.74 43.34 10 | 831 1.31 44.65 11 | 912 1.43 46.08 12 | 943 1.48 47.56 ......values ommitted between here 40 | 28 0.04 63.28 41 | 30 0.05 63.33 42 | 25 0.04 63.37 43 | 18 0.03 63.40 44 | 8 0.01 63.41 45 | 7 0.01 63.42 46 | 5 0.01 63.43 47 | 6 0.01 63.44 48 | 4 0.01 63.45 49 | 7 0.01 63.46 50 | 6 0.01 63.47 . | 23,220 36.53 100.00 ------------+----------------------------------- Total | 63,559 100.00 Another common use for the generate command is to create indicator or ‘dummy variables’. These are typically binary variables (holding valid values of 1 or 0) which can be used to enter categorical variables into statistical models. For example, we might wish to make a binary indicator variable from tenure, which assigns a value of one to those with ten or less years service, and zero to those with greater service. This is to identify respondents with less than ten years tenure within their current organisation. To do this, the generate command can be used alone or in combination with its accompanying conditional or if options. Here are these two alternative ways of doing this:- gen tenure_a= tenure <=10 if tenure ~=. (23220 missing values generated) or alternatively: gen tenure_b= cond(tenure <=10, 1,0) if tenure ~=. (23220 missing values generated)


31

• Create two indicator variables using the above procedures • Cross-tabulate their values using the tabulate command

In the above two commands, we are asking Stata to generate two new variables tenure_a and tenure_b (similar to the if command in SPSS). If the condition is true (namely, tenure <=10), a code of 1 is assigned, otherwise a code of 0 is given, i.e. 0 is assigned for all observations except those meeting the if expression. Note that in the second command, the condition is explicitly applied (cond) whilst in the first it is implied. Stata knows the conditional nature of the first command and assigns a value of 1 for those with values less than or equal to 10 on tenure, and 0 for those with other valid values in the variable. If you cross-tabulate these two variables, you will see they are identical:- tab tenure_a tenure_b

tenure_ | tenure_b a | 0 1 | Total -----------+----------------------+---------- 0 | 11963 0 | 11963 1 | 0 28376 | 28376 -----------+----------------------+---------- Total | 11963 28376 | 40339 Renaming Variables: If you wished to change the name of a variable, you can use the rename <oldname> <newname> option: E.g. rename tenure_a tenure_z Using rename does not alter the variable and value labels associated with a variable. 4.2 Using the replace command to change the values of existing variables The command replace in combination with if changes variable values based upon one or more combinations of numerical, arithmetical, logical, or relational conditions. In such cases, replace assigns a given value to a variable if specified conditions are met. The replace command can consequently also be used as a way of creating dummy variables:- gen tenure_c =1 if tenure <=10 replace tenure_c =0 if tenure_c ~=1 & tenure ~=. The commands label define and label values can also be used in conjunction with replace to assign labels to your new values. Label define tenure_c 1 “<=10 yrs” 0 “>= 10yrs”


32

Label values tenure_c tenure_c Avoiding accidentally changing missing values: In section 3.4, it was noted that although omitted from your analysis, system missing values in Stata are coded as a number higher than the numerical value of all other numerical values. Care must consequently be taken when using the >= “greater or equal to” or > “greater than” operators. For example, if missing data was present in your analysis and you typed the following: Replace varx =1 if varz >4000 Stata would assess missing values to be greater than 4000 and code them as 1. To avoid this, you would type: Replace varx =1 if varz > 4000 & varz ~= . N.B. This applies to all uses of greater than or greater or equal to operators, and not just when using the replace command. 4.3 Recoding variables Recode provides a slightly different function to replace. Typically, recode is used to collapse the number of categories in a variable. In the following example, we will create a recoded version of the continuous variable, tenure. Our new variable, tenure4 will have four categories. As always when transforming or creating new variables, it is first important to look at the codebook and survey documentation. This is to ensure that we understand what each value means or what range of values exist (i.e. what are the max and min). This also allows us to consider whether there are any missing or inapplicable cases, which need to be accounted for. From the above tabulations, we know that the longest tenure is 50 years whereas the shortest is zero years. The latter value indicates those who joined their current organisation in 2003. In order to keep the integrity of the values of the original variable, we will first create a replica variable which we will then recode into four categories:- gen tenure4= tenure recode tenure4 min/5=1 6/10=2 11/20=3 21/50=4 *=. label variable tenure4 "num. years in present company" label define tenure4 1 "0-5yrs" 2 "6-10yrs" 3 "11-20yrs" /* */ 4 "21- 50" label values tenure4 tenure4 Note that if we had not recoded the -8 and -9 values to “.”, the min option denoting the minimum value would include these in the recode. For long commands, we can


33

use the /* at the end of one line, and */ at the beginning of the next to tell Stata that the text on different lines is part of the same command. Some notes can be made on the symbols usable within recode statements. Stata understands ‘/’ to mean 'through' (6/10 in the present context thus means 6 through 10). The symbol ‘*’ in the context of recoding instructions means 'remaining' or 'all others'. To those familiar with SPSS, this is similar to the 'else' option. We can also use ‘min’ to denote the minimum value. Just as when using the greater than (>) or greater or equal to (>=) operators, care must be taken when using the max option so as not to accidentally recode missing values as valid cases. Instead of using generate and recode as separate commands, you can also use generate as an option of recode. The name of the new variable is defined in the brackets that follow the generate option:- recode length4 min/5=1 6/10=2 11/20=3 21/50=4 *=., generate(length4_b)

Exercises and suggested answers 4.1 Return to the example in 3.4, which looks at differences in earnings by marital status and age. Create a recoded variable called ‘age3’ which recodes the variable age into the following categories: 1=16-35yrs, 2 =36-50yrs, 3 =51-65yrs.” 4.2 Using the bysort command again, reconsider how the relationship between marital status and income differs by age and gender using your new variable (age3, married sex).

Suggested answers: *Exercise 2.1 tab age,m gen age3=age recode age3 min/35=1 36/50=2 51/max=3 label var age3 "Age groups" label def age3 1 "16-35" 2 "36-50" 3 "51-65" label val age3 age3 tab age age3, m *Exercise 2.2 bys age3: tabulate married sex if (status==1&hourpay>0), summ(hourpay) means nofreq


34

5.0 Graphics in Stata The graphical capabilities of recent versions of Stata (beyond Stata 8) have been improved considerably compared to earlier versions of the software. This means you can now produce publication quality graphics with relative ease, and without having to transfer data or output to different software. Below, we will consider some techniques for producing simple histograms and two-way scatter plots. These can provide a useful accompaniment to the summary statistics created in Section 3.5. 5.1 Producing a histogram The menu system is a good way to produce graphics because the graphic command structure is different to the usual structure of Stata commands. Because the full options available in the graph menus can be excessive we’ll stick to the Easy graphs option.

• Open the histogram dialogue by choosing Graphs, Easy graphs, Histogram

• Select hourpay using the drop down menu and click OK

0.0

5.1

Den

sity

0 50 100 150 200gross hourly pay (£)

Where did the graph appear?


35

Does the graph include cases where the values are not valid? What is the y axis marked as?

• Use the if/in to only include those cases where hourpay>=0 o Click on the Options tab, to change 3 options:

o Set the Scheme to s1 monochrome – this will change the colour

scheme to a black and white one which will be better suited to printing

o Set the bin width to 10, to make the width of the bars £10 wide o Change the y axis to percent o Click OK to run the graph

What happened to your original graph when your new graph appeared? How do you think you might add a title and notes?

• Try playing around with the various options in the histogram dialogue • When you’re happy click OK to run the graph • Save the graph by right clicking on it. Save the graph as a windows

metafile format (.wmf) to import to a word document. N.B. Stata 9 can now also save to .tiff, which may be preferable when producing results for publication.


36

5.2 Producing a two-way scatter plot To produce a 2 way scatter plot of men’s hourly pay by age:

• Select, graphics, easy graphs, scatter plot • Select age as your x variable and hourpay as your y variable in the main

tab • Limit the procedure to those cases where hourpay >=1 and sex==1 in the

if tab • Can you locate the appropriate options to replicate the graph below? • Do this and save the file in a format appropriate to import into a word

document.

050

100

150

200

Hou

rly p

ay in

pou

nds

20 30 40 50 60 70Age (in years)

Source: Labour Force Survey 2002 Teaching Dataset

Men's Hourly Pay by Age

Type help graph for further options


37

6.0 Statistical modelling using Stata: A brief introduction One of the most powerful aspects of Stata is its range of capabilities for statistical modelling. Most estimation commands in Stata follow a similar syntactical structure. This means once you have learnt the basic estimation procedure for one command, with a little knowledge, you will be able to produce results using a wide range of procedures. In this section, worked examples of two common forms of statistical models are given, these being multiple linear and logistic regression.

6.1 Example 1: Multiple Linear Regression Multiple linear regression can be used to consider the extent to which a set of explanatory (independent) covariates predict the values of a continuous outcome (dependent) variable. Introductory texts for these techniques are suggested in Appendix A. In Stata, commands that produce statistical models are referred to as estimation commands. Such commands are demonstrated by the basic structure of the linear regression command: regress depvar [varlist] [weight] [if exp] [in range]

[, level(#) beta robust noconstant noheader]

The dependent variable is indicated by the first variable after the regress command. Subsequent listed variables define independent/ explanatory variables. Sampling weights2 can also be defined in the statement, as can conditional statements [if exp] or a range of cases [in range]. These latter two options allow for the selection of a subset of data. Following the comma in estimation commands, a number of other options can be specified. A full list of options for the regress command can be viewed by searching for ‘regress’ using the help option. For now, we will consider some of the more important options. Level (#) allows the specification of the significance level for your results. Beta provides standardised coefficients in your output. Noconstant suppresses the constant term (intercept) in the model output. The option robust specifies a robust estimation of the variance covariance matrix. Noheader suppresses the display of the ANOVA table and summary statistics at the top of the output, so that only the coefficient table is displayed.

• Type help regress to find out about further options available for the regress command.

2 See http://www.esds.ac.uk/government/resources/analysis/ for a guide to sample weighting.


38

In the following example, we will ask a number of questions relating to ethnicity, gender, and income: 1: Which ethnic groups are more likely to have higher incomes? 2: Are such levels of income related to gender and marital status? 3: Do people who are not born in the UK suffer a ‘nativity penalty’ in terms of lower pay? 4. Do ethnic differences in wages persist after controlling for educational attainment and job experience? To attempt to answer these questions, we will estimate four models: Model 1:- income = β0 + β1ethnic + ε Model 2:- income = β0 + β1ethnic + β2sex + β3married + ε Model 3:- income = β0 + β1ethnic + β2sex + β3married + β4foreignborn + ε Model 4:- income = β0 + β1ethnic + β2sex + β3married + β4foreignborn +β5education+β6

jobtenure+ ε Before proceeding with our models, it is first necessary to prepare the data. In Section 4.1, we considered how indicator variables can be created. These variables allow categorical variables to be entered into models as sets of binary variables, taking on the values 1 and 0.The variables sex, married and fb (‘foreign born’) are already coded as indicator (dummy) variables in the dataset: sex (female = 1, male = 0); married (married = 1, non-married = 0) fb (foreign born = 1 and native born = 0). In addition to these, we need to create indictor variables for ethnicity. In Section 4, we learnt a few commands we can use to do this, including generate, replace, and recode. In this section, we shall introduce another way, which is particularly useful in modelling contexts: using the char <variablename> [omit] # command in conjunction with the xi: “interaction expansion” command. When entering indicator variables for a categorical variable into a model, it is always important to omit one category of the variable. This category will act as the reference group to which all other category values of a variable will be compared. The char <variablename> [omit] # function specifies a categorical variable <variablename> as an indicator variable to Stata by denoting the category value of a variable to be omitted (#).


39

Suppose that we wish to use the white category as the reference group for our ethnicity variable.

• Enter the following: char ethnic [omit]1 The # in the command syntax is replaced by a number 1, denoting the category which we wish to use as a reference group (in this case 1=‘white’). Once specified, Stata remembers the category we have selected and omits it from every model we run (unless we specify another category). We can decide to omit whichever group we prefer. Although this will change the absolute value of the coefficients, it will not change the relative magnitude of difference between the coefficient values for each category of a variable. Selecting Comparison Groups: When selecting your omitted category, choose a theoretically and empirically meaningful category. For instance, in relation to our ethnicity variable, to use the ‘Other’ ethnic category may not be meaningful if we do not have a clear idea about who the ‘Other’ group’ are. The reference group should also be of a fairly large size, otherwise this could affect the stability of the models. Stata will give a message encouraging the use of another reference category if it considers the one you have chosen is too small. Identifying indicator variables using the xi: command The regression command, and all other estimation commands, do not automatically recognise categorical independent variables listed in our instructions. Consequently, it is necessary to tell Stata that a model contains such variables, by prefixing our regression statement with the xi: “interaction expansion” term. When using xi: we also need to prefix each of our indicator variables in the model with i. to tell Stata that these are categorical variables which are to be split into indicator variables. Otherwise, Stata will treat such categorical variables as continuous, in many cases rendering meaningless results. When Stata sees a variable preceded by i. it will treat it as a categorical variable, omitting the category we have declared as the base group. If we haven’t specified a baseline category, Stata by default will treat the category assigned the lowest numerical value as the comparison group. In each of the following models, we will tell Stata that the models contain categorical variables. We will also filter the sample, selecting for those who are in paid employment (status==1) and who do not have -8 missing or inapplicable values for the weekly gross pay variable (grsswrk >= 0). Always know your data before conducting any analysis. Prior to analysis, we might also want to check the distribution of our income variable, check and clean our variables, perform descriptive analyses, handle missing values, and consider potential transformations for our variables etc. For present illustrative purposes, we shall omit these stages.


40

• In the first model estimated, we will consider the univariate relationship between ethnicity and gross pay:

xi: regress grsswk i.ethnic if status==1 & grsswk>=0

This should produce the following results in your output screen: i.ethnic _Iethnic_1-9 (naturally coded; _Iethnic_1 omitted) Source | SS df MS Number of obs = 31568 -------------+------------------------------ F( 8, 31559) = 4.94 Model | 3068172.75 8 383521.594 Prob > F = 0.0000 Residual | 2.4508e+09 31559 77656.2294 R-squared = 0.0013 -------------+------------------------------ Adj R-squared = 0.0010 Total | 2.4538e+09 31567 77733.7446 Root MSE = 278.67 ------------------------------------------------------------------------------ grsswk | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- _Iethnic_2 | -23.72905 17.73292 -1.34 0.181 -58.48626 11.02816 _Iethnic_3 | -5.981017 21.49624 -0.28 0.781 -48.11448 36.15245 _Iethnic_4 | -118.1102 31.39386 -3.76 0.000 -179.6434 -56.57704 _Iethnic_5 | 25.67318 14.46086 1.78 0.076 -2.670672 54.01702 _Iethnic_6 | -56.42616 22.96276 -2.46 0.014 -101.4341 -11.41826 _Iethnic_7 | -115.4301 40.25447 -2.87 0.004 -194.3304 -36.52978 _Iethnic_8 | 15.2324 36.01186 0.42 0.672 -55.35227 85.81706 _Iethnic_9 | 34.13716 14.29295 2.39 0.017 6.122419 62.1519 _cons | 349.0343 1.607448 217.14 0.000 345.8836 352.1849 ------------------------------------------------------------------------------

We find that, when only ethnicity is included in the model, ‘Black Other’, ‘Pakistani’ and ‘Bangladeshi’ respondents get significantly less income than ‘White’ respondents. The ‘Other’ ethnic group gets significantly higher incomes. The R-squared tells us how well our regression model fits the observed data. For example, a value of 0.23 would tell us that the model accounts for 23 per cent of the variance in the data. The R-square for our first model is 0.001, which is very low. If we enter more variables into the model which predict our dependent variable, the fit of our model to the data should improve, and this value should rise. Storing estimates If you have opened up a log file, the output from your results will be stored permanently. However, within its active memory, Stata automatically holds results for the last model ran. However, once we run a subsequent model, the last model will be lost. This means we can no longer perform any further estimations or statistical tests on the prior model. We can prevent this happening by using the estimate store <model name> command to assign names to models which we can use to recall results and perform further estimation procedures on at a later time: est store model1 Below, we will consider how the est command can be used to simultaneously handle a number of different estimated models. In section 6.2, we will also go on to see how est options can be used to perform post-estimation procedures following the initial estimation and storing of our models.


41

• In our second model, we will include gender and marital status: xi: regress grsswk i.ethni i.sex i.married if status==1&grsswk>=0 est store model2 i.ethnic _Iethnic_1-9 (naturally coded; _Iethnic_1 omitted) i.sex _Isex_0-1 (naturally coded; _Isex_0 omitted) i.married _Imarried_0-1 (naturally coded; _Imarried_0 omitted) Source | SS df MS Number of obs = 31568 -------------+------------------------------ F( 10, 31557) = 473.51 Model | 320155308 10 32015530.8 Prob > F = 0.0000 Residual | 2.1337e+09 31557 67613.0751 R-squared = 0.1305 -------------+------------------------------ Adj R-squared = 0.1302 Total | 2.4538e+09 31567 77733.7446 Root MSE = 260.03 ------------------------------------------------------------------------------ grsswk | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- _Iethnic_2 | -1.831729 16.55295 -0.11 0.912 -34.27617 30.61271 _Iethnic_3 | -5.290509 20.05968 -0.26 0.792 -44.60826 34.02725 _Iethnic_4 | -79.97105 29.30674 -2.73 0.006 -137.4134 -22.5287 _Iethnic_5 | 17.44311 13.49736 1.29 0.196 -9.012241 43.89846 _Iethnic_6 | -71.83893 21.42786 -3.35 0.001 -113.8384 -29.8395 _Iethnic_7 | -151.5194 37.56624 -4.03 0.000 -225.1507 -77.8881 _Iethnic_8 | 12.12303 33.60301 0.36 0.718 -53.74018 77.98624 _Iethnic_9 | 33.56038 13.3376 2.52 0.012 7.418158 59.7026 _Isex_1 | -188.2034 2.930134 -64.23 0.000 -193.9466 -182.4602 _Imarried_1 | 66.28188 2.984302 22.21 0.000 60.43253 72.13123 _cons | 406.6536 2.795909 145.45 0.000 401.1735 412.1337

The results indicate the effect of each variable after controlling for the effects of other variables in the model. This model suggests that women receive significantly lower incomes than men, and that married people get higher incomes than non-married people. Through the inclusion of gender and marital status, we can see that the R-square value has increased to 0.13 or 13 per cent explained variance. Since sex and married are coded as binary dummies already, prefixing i. or not prefixing i. before the two variable names in the command would produce exactly the same results. You could consequently just write the following: xi: regress grsswk i.ethni sex married if status==1& grsswk>=0 We can assess whether the inclusion of additional variables significantly improves overall model fit by using the testparm command to produce Wald tests. This is one of many post–estimation commands you can implement after fitting your model. testparm _Is* _Ima* ( 1) _Isex_1 = 0.0 ( 2) _Imarried_1 = 0.0 F( 2, 31557) = 2344.87 Prob > F = 0.0000


42

Note that by inspecting the output, we know _Isex_1 stands for sex and _Imarried_1 for married and that there are no other terms beginning with s or m so that we can simply use _Is* and _Im* to stand for the two terms respectively.

• In the third model, we will include the variable fb to control for income differences between people who were born in the United Kingdom or who were born in another country:

xi: regress grsswk i.ethni sex married fb if status==1&grsswk>=0 est store model3 Source | SS df MS Number of obs = 31568 -------------+------------------------------ F( 11, 31556) = 441.62 Model | 327352435 11 29759312.3 Prob > F = 0.0000 Residual | 2.1265e+09 31556 67387.1429 R-squared = 0.1334 -------------+------------------------------ Adj R-squared = 0.1331 Total | 2.4538e+09 31567 77733.7446 Root MSE = 259.59 ------------------------------------------------------------------------------ grsswk | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- _Iethnic_2 | -26.30005 16.69402 -1.58 0.115 -59.02099 6.420884 _Iethnic_3 | -56.19266 20.62295 -2.72 0.006 -96.61445 -15.77087 _Iethnic_4 | -87.46918 29.26673 -2.99 0.003 -144.8331 -30.10525 _Iethnic_5 | -22.66187 14.02247 -1.62 0.106 -50.14646 4.822708 _Iethnic_6 | -102.5066 21.59687 -4.75 0.000 -144.8373 -60.17588 _Iethnic_7 | -187.8878 37.66817 -4.99 0.000 -261.7189 -114.0567 _Iethnic_8 | -33.2174 33.83249 -0.98 0.326 -99.5304 33.0956 _Iethnic_9 | -11.09539 13.99887 -0.79 0.428 -38.53372 16.34295 sex | -188.3054 2.925251 -64.37 0.000 -194.039 -182.5718 married | 65.41331 2.980497 21.95 0.000 59.57142 71.2552 fb | 67.33948 6.515964 10.33 0.000 54.56794 80.11103 _cons | 404.2611 2.800818 144.34 0.000 398.7714 409.7508

Surprisingly, people who were not born in the UK (fb==1) have significantly higher incomes than native born respondents (after controlling for ethnicity). The sex and marital status parameters are similar to those in Model 2. However, once all sex, marital and nativity factors are controlled for, several of the non-white ethnic groups still have have significantly lower incomes than the white category.

• Does the inclusion of fb make a statistically significant contribution to the terms already included in Model 2?

testparm fb ( 1) fb = 0.0 F( 1, 31556) = 106.80 Prob > F = 0.0000

The answer is yes, it does. Instead of just comparing all other ethnic groups to the White category, we can also consider whether there are statistically significant differences between any particular two ethnic groups. This is achieved using the test command to produce Wald tests.


43

The following example tests whether there are significant differences between the Black Caribbean and Black Other groups, between Indian and Pakistani groups, and between Pakistanis and Bangladeshi ethnic groups: test _Iethnic_2=_Iethnic_4 ( 1) _Iethnic_2 - _Iethnic_4 = 0.0 F( 1, 31556) = 3.32 Prob > F = 0.0684

test _Iethnic_5=_Iethnic_6 ( 1) _Iethnic_5 - _Iethnic_6 = 0.0 F( 1, 31556) = 10.03 Prob > F = 0.0015

test _Iethnic_6=_Iethnic_7 ( 1) _Iethnic_6 - _Iethnic_7 = 0.0 F( 1, 31556) = 3.92 Prob > F = 0.0477

After controlling for gender, marital status and country of birth, the results indicate that there are no significant income differences between the Black Caribbean and Black Other groups. Those within the Indian ethnic category however received significantly higher incomes than the Pakistani group, whereas Pakistanis received significantly more income than Bangladeshis. Note that when we tell Stata to compare _Iethnic_2=_Iethnic_4, it automatically re-orders the equation so that it becomes _Iethnic_2 - _Iethnic_4 = 0.0.

• Type ‘help test’. What is the difference between the ‘test’ and ‘testparm’ commands?

• What other variables that predict income do you think need to be added to the model?

Although the above models give some initial indications regarding ethnic differences in income, they are incomplete in that there are many other important explanatory variables for income (such as educational attainment, and job tenure) which are missing from the model, and as a result could mean our current estimates are biased. In the next model, variables to denote level of education attainment and tenure are included. For simplicity, we will restrict this to a binary variable indicating whether a respondent has a higher level qualification or not (although you can try some more detailed categorisations if you wish3). We can obtain a definition of a higher qualification from the National Qualifications Framework as used for the 2001 Census.

3 The variable hiquald is included in the teaching dataset which gives more detailed categorisation. You may wish to try using this variable as an alternative.


44

• Create a higher education variable:

gen hieduc=hiqual replace hieduc=. if hiqual<1 tab hieduc,m

The National Qualifications Framework Level of qualification

General qualification

Vocationally-related qualification

Occupational qualification

Age (typically)

5 Level 5 NVQ 4

Higher education (e.g. BA, BSc, MA, PhD) Level 4 NVQ

19+

3 advanced level

A/AS level Vocational A level (Advanced GNVQ)

Level 3 NVQ 17-18

Post-com

pulsory education

2 intermediate level

O level, GCSE grade A-C

Intermediate GNVQ

Level 2 NVQ 15-16

1 foundation level

GCSE grade D-G Foundations GNVQ

Level 1 NVQ 15-16

Entry level Certificate of (educational) achievement 14 or less

Compulsory

education

• We will define codes 1 to 14 as indicating higher qualifications: replace hieduc=1 if hiqual>=1& hiqual<=14 replace hieduc=0 if hiqual>=15 & hiqual<=41 lab var hieduc "Higher educational qualifications" lab def hieduc 1 "Higher" 0 "No higher qual" lab val hieduc hieduc

• Check whether the variable is coded correctly:

tab hiqual hieduc, m | Higher educational | qualifications highest qualification | No Higher Higher . | Total ----------------------+---------------------------------+---------- no answer | 0 0 8 | 8 higher degree | 0 2,591 0 | 2,591 nvq level 5 | 0 61 0 | 61 first degree | 0 5,801 0 | 5,801 other degree | 0 693 0 | 693 nvq level 4 | 0 264 0 | 264 diploma in higher edu | 0 599 0 | 599 hnc,hnd,btec etc high | 0 2,076 0 | 2,076 teaching, further edu | 0 160 0 | 160 teaching, secondary e | 0 156 0 | 156 teaching, primary edu | 0 243 0 | 243 teaching, level not s | 0 7 0 | 7


45

nursing etc | 0 1,047 0 | 1,047 rsa higher diploma | 0 40 0 | 40 other higher educatio | 0 326 0 | 326 nvq level 3 | 1,421 0 0 | 1,421 gnvq advanced | 402 0 0 | 402 a level or equivalent | 3,958 0 0 | 3,958 rsa advanced diploma | 84 0 0 | 84 ond,onc,btec etc, nat | 1,156 0 0 | 1,156 city & guilds advance | 1,839 0 0 | 1,839 scottish csys | 62 0 0 | 62 sce higher or equival | 548 0 0 | 548 a,s level or equivale | 61 0 0 | 61 trade apprenticeship | 3,932 0 0 | 3,932 nvq level 2 or equiva | 1,597 0 0 | 1,597 gnvq intermediate | 217 0 0 | 217 rsa diploma | 91 0 0 | 91 city & guilds craft | 475 0 0 | 475 btec,scotvec first/ge | 109 0 0 | 109 o level, gcse grade a | 9,162 0 0 | 9,162 nvq level 1 or equiva | 281 0 0 | 281 gnvq,gsvq foundation | 30 0 0 | 30 cse below grade1,gcse | 1,951 0 0 | 1,951 btec,scotvec first/ge | 22 0 0 | 22 scotvec modules | 71 0 0 | 71 rsa other | 548 0 0 | 548 city & guilds other | 184 0 0 | 184 yt,ytp certificate | 74 0 0 | 74 other qualification | 4,308 0 0 | 4,308 no qualifications | 6,986 0 0 | 6,986 don't know | 147 0 0 | 147 ----------------------+---------------------------------+---------- Total | 39,716 14,064 8 | 53,788

We can also use our job tenure variable (created in section 4.1), and create a quadratic job tenure variable (tenure squared), as although wages may increase with job tenure, this could be at a diminishing rate: Ge tensq = tenure^2

So the model with all of the variables: xi: regress grsswk i.ethni sex married fb hieduc tenure tensq if status==1&grsswk>=0 Source SS df MS Number of obs = 31567 F( 14, 31552) = 871.22 Model 684113077 14 48865219.8 Prob > F = 0.0000 Residual 1.7697e+09 31552 56088.3668 R-squared = 0.2788 Adj R-squared = 0.2785 Total 2.4538e+09 31566 77735.9572 Root MSE = 236.83 grsswk Coef. Std. Err. t P>t [95% Conf. Interval] _Iethnic_2 -14.46197 15.23146 -0.95 0.342 -44.31623 15.3923 _Iethnic_3 -61.14157 18.82518 -3.25 0.001 -98.03965 -24.24348 _Iethnic_4 -47.27793 26.70726 -1.77 0.077 -99.62521 5.06935 _Iethnic_5 -35.92875 12.79591 -2.81 0.005 -61.00924 -10.84826 _Iethnic_6 -87.28629 19.71208 -4.43 0.000 -125.9227 -48.64984 _Iethnic_7 -137.6793 34.37475 -4.01 0.000 -205.0552 -70.30348 _Iethnic_8 -59.08771 30.86834 -1.91 0.056 -119.5909 1.415435 _Iethnic_9 -6.121715 12.77978 -0.48 0.632 -31.17059 18.92716 sex -179.9912 2.681839 -67.11 0.000 -185.2478 -174.7347 married 40.47744 2.801253 14.45 0.000 34.98687 45.968 fb 62.48901 5.949279 10.50 0.000 50.82819 74.14983 hieduc 211.7115 2.93175 72.21 0.000 205.9652 217.4579 tenure 5.394739 .1689575 31.93 0.000 5.063576 5.725903


46

tensq -.0026798 .0000878 -30.53 0.000 -.0028519 -.0025078 _cons 307.4406 2.923535 105.16 0.000 301.7104 313.1709

Although having a degree level qualification and greater job tenure increases wages, the negative wage penalty of many of the minority ethnic groups persists after controlling for such differences. The quadratic term for job tenure indicates that wages increase with tenure, albeit a diminishing rate. The estimates command offers a number functions for handling multiple models. If we forget which of our stored models we are currently using, we can use estimates query:

• Type est query: (the active estimation result is model3) We can also look at the directory of stored estimates using the est dir command: est dir ------------------------------------------------------- model | command depvar npar title -------------+----------------------------------------- model1 | regress grsswk 9 model2 | regress grsswk 11 model3 | regress grsswk 12 model4 | regress grsswk 15 -------------------------------------------------------

If we wish to see the results for model 1 again, we could use the estimate replay <modelname> command. Alternatively, to change which model is currently active (i.e. which we are performing post-estimation procedures on) we can use est restore: est restore model1 (results model1 are active now) If we want to erase specific models from memory, we can use the estimates drop <model name> command to make room for another model. Alternatively, estimates clear drops all stored models from memory.

• Type help estimate for further details of commands. It is now necessary to check whether the assumptions of OLS regression are not violated, consider the influence of outliers, consider the model fit etc (see the texts in the resources for learners section).

• What variables would you add into this model? • Would it be more sensible to model male and female income in separate

models?


47

6.2 Example 2: Multiple Logistic Regression The table below indicates some of the different modelling strategies and associated commands for varying types of dependent variables. From this, it can be seen that in many cases in social research the dependent variable of interest will not be continuous but categorical. In this section, we provide an example of modelling a binary categorical dependent variable using multiple logistic regression.

Types of dependent variables

Continuous Binary Multinomial Ordinal

Examples Income Long-term illness

Employment status

Levels of schooling

Modelling techniques involved

Linear regression

Logistic regression

Multinomial logit regression

Ordinal logit regression

Stata techniques regress logistic/logit mlogit ologit Type of independent variables:

Any

Any

Any

Any

Logistic models provide a common solution for modelling binary outcomes. A number of reasons exist for using a logistic as opposed to a linear model when analysing a dichotomous dependent variable. Firstly, the predicted probability of an event occurring for a binary responses lies between 1 (occurrence) and 0 (non-occurrence). A problem with using a linear predictor is that this may give any value between minus infinity and plus infinity. The observed values for a binary outcome furthermore do not follow a normal distribution, violating assumptions of the linear regression model. Logistic regression models provide a strategy that overcomes these problems. They further provide easily interpretable results through their construction around odds ratios. In the logistic model, the log odds of an event (as measured by the dependent variable) occurring is modelled: Log (p/1-p) = β0 + β1X1+ β2X2+…. βkXk+ε Where p is the probability that a dependent variable Y=1, and X1, X2… Xk are independent variables, and β1 β2 …βk are the regression coefficients estimated from the data. Depending on the research topic, the dependent variable could indicate anything that can be considered in terms of binary states (where either y=1, or y=0). In the following example, we will use logistic regression to predict economic activity, where y=1 if a person is economically active, and y=0 if they are economically inactive. Before we do this, some further terms require explaining. ‘Odds’ indicate the probability (p) of an event occurring divided by the probability of it not occurring (1-


48

p), so are expressed mathematically as p/(1-p). Odds are more commonly recognised for their use in gambling where a bookmaker may give odds against a horse winning. In such contexts, they are more commonly expressed as fractions rather than decimals. Thus, when the odds against a horse winning are 4/1, in decimal this is expressed as an odds of 4.0. If the odds against a horse winning were 1/4, in decimal format this would be 0.25. The odds ratio, in contrast, gives the ratio of two sets of odds (i.e. p/(1-p)/ p/(1-p)). For a continuous independent variable, this indicates how a unit increase in the independent variable changes the odds of the dependent variable. For example, how one extra year of age increases or decreases the odds of an event occurring. For a categorical variable, the odds ratio indicates the odds of an event occurring for one level of a variable divided by the odds of an event occurring for the selected base comparison group. For example, if a dependent variable equals 1 if a person is employed, and 0 otherwise, and an independent variable sex indicates 1 for women and 0 for men (men are the base category), the odds ratio for the variable sex indicates the odds of women being employed (p/(1-p) for women) divided by the odds of men being employed (p/(1-p) for men). Thus if the odds of women being employed are higher than the odds for men, the odds ratio (women’s odds/men’s odds) will be above 1. If the odds for women being employed is lower than for men, the odds ratio will be below zero. The odds ratio, thus gives information on the relative size of the odds of the dependent variable equalling 1 for the different groups. In Stata, the logit and logistic commands can both be used to perform logistic regression. These are identical, bar the manner in which the results are displayed. Whereas logit produces coefficients (log odds), logistic gives the odds ratios. For interpreting coefficients in logit models, odds ratios can be calculated by taking the exponential of the regression coefficients to unlog the odds. However, using the logistic command in Stata will give you odds ratios anyway.

• Compare the command structure for logistic regression to that for linear regression in the previous section.

logistic depvar [varlist] [weight] [if exp] [in range], [options] In the following worked example, we will continue to explore the themes of gender, ethnicity, and the labour market. Instead of looking at wages, this time we will use a discrete outcome variable indicating whether a person is economically active or not. We will ask the following questions: 1. Are some ethnic groups more likely to be economically active than others? 2. How do levels of economic activity differ by gender? 3. Does the probability of being economically active differ by level of educational attainment?


49

We will define economic activity to denote whether a person is in paid or self-employment, or is an unpaid family worker, but omit those are economically active but unemployed (unlike this definition, the unemployed are included in the International Labour Office definition of economic activity). In passing, other forms of non-waged worked such as childrearing or (non-family) voluntary work make an economic contribution that could be considered as economic activity. From the example models in the preceding section, we know that a number of variables may be relevant to explaining levels of economic activity. These include sex, marital status, and whether a respondent was born within the UK or not. We will also restrict our analysis to working age people, selecting a sample of 19-60 year olds for both men and women. We can do this by using the if conditional statement in our estimation command. Alternatively, we can use the keep command to select our sample prior to estimation, and permanently remove those aged 60 and above from our dataset.

• Type the following: keep if age >= 19 & age < 60 If you wish to save this dataset, give it a different name to distinguish it from the original one. To attempt to answer the above questions, three models will be used: Model 1:- active = β0 + β1ethnic + ε Model 2:- active = β0 + β1ethnic + β2sex + β3married +β4foreignborn + ε Model 3:- active = β0 + β1ethnic + β2sex + β3married + β4foreignborn +β5hieduc+ ε

A binary dependent variable is required to indicate whether a person is economically active or not. We will derive this variable from nstat, which gives the economic status for respondents in their main job. We will select the employed (nstat=1), self-employed (nstat=2), and unpaid family workers (nstat=4) for our indicator of economic activity.

• Enter the following: ge active = 0 label var active "economically active" replace active = 1 if nstat == 1 | nstat == 2| nstat ==4 codebook active


50

Before conducting any analysis using logistic regression, it is important to tabulate your dependent variable against each of your categorical explanatory variables. As well as to conduct an initial univariate exploration, you also need to do this to check that a given categorical explanatory variable does not perfectly predict your dependent variable. By perfect prediction, we mean where a binary indicator variable is only represented at one level of your dependent variable (i.e. has all ‘zero’ or all ‘one’ values for your dependent variable). In such cases, it will be impossible to derive an odds ratio for the indicator variable as its value will either be zero or one respectively. We therefore need to check that each level of a categorical variable is represented in both states of the dependent variable prior to using the xi: command. If this is not the case, we can consider recoding or collapsing the number of categories. A chi-square test (chi2) can be used to consider the level of association between your dependent and categorical variables.

• Using the variable ‘hieduc,’ derived in the preceding section, type:

ta active hieduc, col chi2 | Higher educational economical | qualifications ly active | A level o Higher | Total -----------+----------------------+---------- 0 | 10,385 1,706 | 12,091 | 26.15 12.13 | 22.48 -----------+----------------------+---------- 1 | 29,331 12,358 | 41,689 | 73.85 87.87 | 77.52 -----------+----------------------+---------- Total | 39,716 14,064 | 53,780 | 100.00 100.00 | 100.00 Pearson chi2(1) = 1.2e+03 Pr = 0.000 From the above table, we can see that each level of our highest academic qualification variable is well distributed between the two states of our outcome variable, active. Around 88 per cent of those with a higher level qualification are economically active, compared to around only 12.1 per cent of those who do not have a higher level qualification. We might therefore expect highest academic qualification to be a strong predictor of economic activity in our final model.

• We can also tabulate our higher education variable with ethnicity:

ta ethnic hieduc, row chi2 (row percentages) | economically active ethnicity | 0 1 | Total -----------+----------------------+---------- white | 10,713 39,354 | 50,067 | 21.40 78.60 | 100.00 -----------+----------------------+---------- caribbean | 147 359 | 506 | 29.05 70.95 | 100.00


51

-----------+----------------------+---------- african | 133 244 | 377 | 35.28 64.72 | 100.00 -----------+----------------------+---------- black oth | 65 108 | 173 | 37.57 62.43 | 100.00 -----------+----------------------+---------- indian | 256 636 | 892 | 28.70 71.30 | 100.00 -----------+----------------------+---------- pstani | 255 244 | 499 | 51.10 48.90 | 100.00 -----------+----------------------+---------- bdeshi | 125 82 | 207 | 60.39 39.61 | 100.00 -----------+----------------------+---------- chinese | 62 95 | 157 | 39.49 60.51 | 100.00 -----------+----------------------+---------- other | 331 561 | 892 | 37.11 62.89 | 100.00 -----------+----------------------+---------- Total | 12,087 41,683 | 53,770 | 22.48 77.52 | 100.00 Pearson chi2(8) = 664.9402 Pr = 0.000 From this, we can see that levels of educational attainment appear lowest amongst those within the Bangladeshi ethnic group, followed by the Pakistani ethnic group. Regarding deciding which variables to include in your model, Hosmer and Lemeshow (2000) suggest that variables which exhibit significance at the p < 0.25 level at the univariate stage should be taken forward into the multivariate analysis. If the p=0.05 level is used at the univariate stage to omit variables, we may risk throwing away important variables that may exhibit significance in the multivariate analysis. You might also consider including variables that are of theoretical or clinical importance in your model even if they do not reach such significance. For the above table, the association between educational attainment and economic activity is highly significant (p<0.01). Estimating univariate logistic regression models provides another way to assess the significance of variables and is useful for continuous variables where the use of cross-tabulations may be less appropriate.

• For our first model:

xi:logistic active i.ethni if hieduc~=. Logistic regression Number of obs = 53765 LR chi2(8) = 566.34 Prob > chi2 = 0.0000 Log likelihood = -28366.142 Pseudo R2 = 0.0099


52

------------------------------------------------------------------------------ active | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- _Iethnic_2 | .6647224 .0654912 -4.15 0.000 .5479948 .8063141 _Iethnic_3 | .4993458 .0540954 -6.41 0.000 .4038207 .6174676 _Iethnic_4 | .4522448 .0711659 -5.04 0.000 .3322211 .6156303 _Iethnic_5 | .6762081 .0505908 -5.23 0.000 .5839791 .7830029 _Iethnic_6 | .2604431 .0234958 -14.91 0.000 .2182337 .3108163 _Iethnic_7 | .178553 .0254486 -12.09 0.000 .1350355 .2360946 _Iethnic_8 | .4170567 .0682422 -5.34 0.000 .3026319 .5747454 _Iethnic_9 | .4627135 .0324942 -10.97 0.000 .4032145 .5309922 ------------------------------------------------------------------------------

• Store the first model to memory calling it “model1”:

est store model1

• The second model includes variables for sex, marital status, and whether a person was born within the United Kingdom:

xi:logistic active i.ethni i.sex i.married i.fb if hieduc~=. Logistic regression Number of obs = 53765 LR chi2(11) = 2132.59 Prob > chi2 = 0.0000 Log likelihood = -27583.019 Pseudo R2 = 0.0372 ------------------------------------------------------------------------------ active | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- _Iethnic_2 | .8012041 .0814019 -2.18 0.029 .6565405 .9777431 _Iethnic_3 | .6282631 .0725774 -4.02 0.000 .5009682 .7879033 _Iethnic_4 | .5426481 .08717 -3.81 0.000 .3960799 .7434533 _Iethnic_5 | .7479508 .0600289 -3.62 0.000 .6390833 .8753639 _Iethnic_6 | .2752299 .0263188 -13.49 0.000 .2281915 .3319645 _Iethnic_7 | .1888436 .0280893 -11.21 0.000 .1410884 .2527628 _Iethnic_8 | .4812966 .0815334 -4.32 0.000 .3453155 .6708254 _Iethnic_9 | .540557 .0418911 -7.94 0.000 .4643834 .6292254 _Isex_1 | .4652456 .01016 -35.04 0.000 .4457525 .485591 _Imarried_1 | 1.428184 .0304908 16.69 0.000 1.369656 1.489213 _Ifb_1 | .7735376 .0319667 -6.21 0.000 .7133542 .8387985 ------------------------------------------------------------------------------

est store model2 From earlier cross-tabulations, we saw that whether a person had a higher level qualification was strongly associated with being economically active. In our third model, we will include the higher education derived variable (hieduc) to test whether differences in educational attainment account for some of the observed differences in economic activity by ethnic group. If so, we might expect the coefficients for the ethnicity variable to change as the effects of educational attainment are controlled for:

• ..so for the third model: xi:logistic active i.ethni i.sex i.married i.fb hieduc if hieduc~=. Logistic regression Number of obs = 53765 LR chi2(12) = 3423.42 Prob > chi2 = 0.0000 Log likelihood = -26937.604 Pseudo R2 = 0.0597


53

------------------------------------------------------------------------------ active | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- _Iethnic_2 | .8353799 .0859731 -1.75 0.081 .6827828 1.022081 _Iethnic_3 | .5896085 .0699287 -4.45 0.000 .4673148 .7439057 _Iethnic_4 | .5613082 .091286 -3.55 0.000 .4081029 .7720279 _Iethnic_5 | .7197111 .0590272 -4.01 0.000 .6128396 .8452197 _Iethnic_6 | .2937087 .0285837 -12.59 0.000 .2427046 .3554314 _Iethnic_7 | .2151108 .0324225 -10.19 0.000 .1600903 .2890409 _Iethnic_8 | .4547321 .0787488 -4.55 0.000 .3238529 .6385039 _Iethnic_9 | .5227579 .0412819 -8.21 0.000 .4477975 .6102665 _Isex_1 | .4599479 .0101594 -35.16 0.000 .4404607 .4802972 _Imarried_1 | 1.440996 .0311539 16.90 0.000 1.381211 1.503369 _Ifb_1 | .7512998 .0317853 -6.76 0.000 .6915148 .8162536 hieduc | 2.616867 .0752294 33.46 0.000 2.473497 2.768547 ------------------------------------------------------------------------------

est store model3 The hieduc variable indicates that the odds of being economically active for those with a higher level educational qualification are 2.61 times higher than for those without such a qualification, and that this finding is highly significant, statistically (p<0.01). Yet despite controlling for educational attainment, differences in economic activity by ethnicity persist. For example, the odds of Pakistani respondents being economically active were 0.29 lower than for white respondents (p <0.01). To illustrate the interpretation of odds ratios that are below zero, we can invert the odds ratio, swapping the numerator and denominator, by dividing 1 by the odds ratios: Hypothetical example: odds ratio 1/5= 0.2 inverted: 5/1= 5, or 1/0.2=5 Doing this for the above example tells us that the odds of white respondents being employed were 3.44 times higher than for Pakistani respondents: 1/Odds ratio=1/ 0.29=3.44 Since we have stored our models using the est store command, we can now perform some post-estimation est procedures:

• The est table command can be used to visually compare output from our different models. This allows us to compare the values of coefficients following the addition of further variables. This may be of interest for considering the relative influence of a given covariate before and after statistically controlling for the effects of another added variable.

• The star and eform options tell Stata to show stars next to parameter estimates to indicate level of statistical significance, and to display the results in ‘exponentiated form’ respectively (in this case as odds ratios).


54

est table model1 model2 model3, star eform -------------------------------------------------------------- Variable | model1 model2 model3 -------------+------------------------------------------------ _Iethnic_2 | .66472241*** .80120407* .83537987 _Iethnic_3 | .49934579*** .62826308*** .58960846*** _Iethnic_4 | .45224483*** .54264806*** .56130815*** _Iethnic_5 | .67620809*** .74795082*** .71971115*** _Iethnic_6 | .2604431*** .27522987*** .29370875*** _Iethnic_7 | .17855296*** .18884356*** .21511075*** _Iethnic_8 | .41705672*** .48129661*** .45473214*** _Iethnic_9 | .46271346*** .54055699*** .52275793*** _Isex_1 | .46524555*** .45994785*** _Imarried_1 | 1.4281837*** 1.440996*** _Ifb_1 | .7735376*** .7512998*** hieduc | 2.6168669*** _cons | 3.67398*** 4.7307938*** 3.8771301*** -------------------------------------------------------------- legend: * p<0.05; ** p<0.01; *** p<0.001 We can also test whether the inclusion of the hieduc variable significantly improves overall model fit. The logistic model uses maximum likelihood estimation. This means that in addition to Wald tests, the Likelihood Ratio test can be used to test whether the inclusion of a variable increases the fit of a model. This test assesses whether an added variable significantly increases the likelihood:

• Type: lrtest model2 model3 Likelihood-ratio test LR chi2(1) = 1290.83 (Assumption: model2 nested in model3) Prob > chi2 = 0.0000 Stata treats the smaller model (without the extra included variable(s)) as nested within the larger one, telling us whether the greater number of variables significantly improves the likelihood, and so model fit. In this case it does. N.B. The order in which you list the models you wish to compare is irrelevant as they are ordered in the analysis by size. When using post-estimation commands such as the lrtest, it is important that the sample sizes used for both models are the same, otherwise Stata will not calculate the statistic. If the hieduc education variable had missing values (which were not accounted for in previous models), then model three would be have fewer observations than model two.

• Try another recoding of the education variable using a greater number of categories. Does this give a different pattern of results?

• What other explanatory variables would you include in this model?


55

• Are there other factors that might be more difficult to measure in survey data that could affect differences in pay and economic activity between ethnic groups?

Stata has a wealth of other post-estimation commands. Type ‘help logistic’ or ‘help logistic postestimation’ for further details.


56

7.0 Do-files: Using and saving commands Up until now, we have focused upon two main ways of instructing Stata. These were entering commands through the input window and using the menu system. When conducting more complicated coding or analysis it is preferable to write, structure, and organise larger chunks of commands before running them in Stata. Do-files provide a valuable function for this task, allowing you to re-run previous commands or analyses, and specify further options without re-typing everything into the command window again.

7.1 Saving your commands in a .do file A do-file is a set of commands that can be saved, edited and used again in the future. Whilst conducting your analysis, it is helpful to produce a single do- file (or collection of ordered files) that take you from your original data, through your data manipulation to your complete analysis. These files are saved with a .do extension at the end. One way you can save the commands you have run is by saving the contents of the review window.

• Click on the review tab to make the review window visible (if it is not already)

• Right click in the window and select on save review contents in the pop up • You will then be prompted for a name and location to save the file to

However, it is more advisable to establish a do-file prior to running any commands, running commands directly from this file.

• To write your own do-file now, go to the menu at the top of the screen


57

• Click on the above icon This opens the do- file editor where you can write your file. If you prefer, you can use Microsoft Word or some other word processors instead to create your text, provided that you save your document as a text file but with the .do extension at the end. The editor should have this menu bar at the top of an otherwise blank screen.

The do-file editor has some familiar commands to those in word processors such as save, cut & paste, print etc. The find and replace commands may be particularly useful for larger variable recoding tasks. Now you can type in a few commands.

• Type: use <filename>, clear set more off sum hourpay if hourpay>0 tab sex if hourpay>0, sum(hourpay) Remember to substitute the name and location of your teaching dataset file for “<filename>” This set of commands does the following:

• Clears any data out of memory and opens the file • Sets an option so that you are not prompted to press more at the end of each

screen of output • Produces the summary descriptive statistics for the hourpay variable for cases

where hourpay is greater than zero • Produces a table containing the summary statistics for each sex for cases

where hourpay is greater than zero. You have used `set more off’ to stop Stata from pausing for a prompt. This happens by default when your output fills your results window. To run these commands, click on


58

Then minimise the do file editor by clicking on

In the top right hand corner. You should now be back to your original Stata screen. You’ll see in the main results window that all the commands in your do file have been executed. The end of the do file is denoted by

Its good practice to add notes to your commands, particularly to explain what it is that you have done and why. Comments are preceded by two forward slashes (//) or by an asterix ‘*’

Do files in Stata Version 10 In Stata Version 10, the do file editor has been updated with some added functionality. The main change is that users can now open up multiple do files in one do file editor. These are represented as different tabs along the top of the editor:

• Clicking on untitled1.do or untitled2.do in the above example will allow you to move between the different do files.

• When you open existing or new do files, these will appear as extra tabs along the tab bar.


59

Appendix A. Resources for Learners Other useful ESDS Resources The ESDS resources pages contain links to:

• A Guide to Weighting the Social Surveys • A range of other useful resources and links to relevant external sites including

the Practical Exemplars on Analysing Surveys site which contains guidance on analysing complex surveys using Stata see http://www.napier.ac.uk/depts/fhls/peas/

Labour Force Survey

• Information about the LFS is also available on the ESDS site at http://www.esds.ac.uk/government/lfs/ including the questionnaires.

The Stata website The website can be found at http://www.stata.com Resources include:

• Guides to getting started with Stata • Information on future Stata-based courses • Information on how to join the Stata email list • Help • A bookstore

You will also find links to web based resources (see: http://www.stata.com/links/resources1.html ) for example the clear and helpful An Introduction to Stata 8 by Svend Juul.

Stata Manuals The content of Stata manuals is very similar to the help menus although these books will provide a good printed resource for regular users.

Books: Statistics with Stata There are two main general introductions to using Stata. Both of these are newly released to reflect the changes to the software.

• Statistics with Stata (Updated for Version 9) Lawrence C. Hamilton (2006) Duxbury

• A Handbook of Statistical Analyses using Stata, 3rd ed Sophia Rabe-Hesketh, Brian Everitt (2004) Chapman and Hall

Both of these books include reasonably advanced topics. Rabe-Hesketh and Everitt have a particular emphasis on longitudinal data and epidemiology. Hamilton contains sections on regression diagnostics and data reduction.


60

There are also a range of more advanced, specialised Stata books including those on topics such as categorical regression and survival modelling. Information about these is available on the Stata website (see above). Textbooks on linear regression: Draper, N. R. & Smith, H. (1999) Applied Regression Analysis , New York: Wiley. Kohler, U. & Kreuter, F. (2005) Data Analysis Using Stata, Texas: Stata Press. Textbooks on logistic regression: Hosmer, D.W. & Lemeshow, S. (1999) Applied Logistic Regression (2nd ed.), New York: Wiley. Scott Long, J. & Freese, J (2006) Regression Models for Categorical Dependent Variables Using Stata (2nd Ed.), Texas: Stata Press.


61

Appendix B. Entering and transferring data into Stata Sometimes it may be necessary to enter data from other sources into Stata. Most commonly this will be required when we have new raw data from our own research or data within other formats. The following three examples are considered: (1) How to input the data directly in Stata; (2) How to get the data of other formats into Stata format and use them; (3) How to use StatTransfer.

Entering data directly in Stata Suppose we have raw data from a survey sample which we wish to enter directly into Stata.

• Clear any current data from Stata’s memory using the clear command • Go to Data Editor (not browse editor).

In the following example, a hypothetical dataset composed of four variables and eight records is considered.

• Put the cursor in row 1 and column 1 and begin to enter the below data.

By default, the variables are called var1 var2 var3 var4. We can rename and label each variable independently.


62

• E.g:

Rename var1 = id Label var id “identification no.” Alternatively, we can use the renvar command to rename all the variables using one instruction:

• Type the following:

renvars v*\id sex age income

After inputting, we can save the data in Stata format. It is always good practice to compress the data within Stata before saving to reduce the amount of memory required to hold the dataset: compress save "C:\Data\Input_data_example", replace (note: file C:\Data\Input_data_example.dta not found) file C:\Data\Input_data_example.dta saved

Importing files of other formats into Stata. In the following example, a hypothetical comma separated values (csv) file called CSV_Example located in the directory “C:\Data\” is imported into Stata: clear set mem 30m set more off


63

*to have double rather than the float default to save memory insheet using "C:\DATA\CSV_Example.csv", double Stata can also import .dat or raw datafiles. In the next example, suppose that we only wish to obtain four variables from a file called slim.dat, and that the first is a string which we will allow to have two characters: infile str20 name cond status resp using slim.dat

Using StatTransfer StatTransfer allows you to convert around 26 different formats of files into Stata, including those from more commonly used statistical packages such as SPSS and SAS. It can also be used to transfer Stata data files into other formats. In the following example, StatTransfer (vs. 6) is used to transfer an .xls file into Stata format.

The most frequent use of StatTransfer may be to import SPSS portable files into Stata. We can do this using StatTransfer just as we do other kinds of files. StatTransfer will give the same directory and file name but will change the xls or por file to dta file format.


64

Appendix C. Reserved names and Stata operators The following are the reserved names that should not be used as variable names: _all double long _rc _b float _n _se byte if _N _skip _coef in _pi using _cons int _pred with

There are some system names in Stata to which reserved for specific operations:

_n running number of the current observation _N total number of observations _all all variables _b vector of regression coefficients _se vector of standard errors of regression coefficients _merge variable created after merging files which tells us the source of

resulting observation

Further information can be found in the Stata User guide (U).


65

ESDS Government Economic and Social Data Service Cathie Marsh Centre for Census and Survey Research University of Manchester Manchester M13 9PL Email: [email protected] Tel: +44 (0)161 275 1980 Fax: 0161 275 4722 www.esds.ac.uk/government

Documents

Introduction to Stata using the UK Labour Force Survey