All You Need to Know About STATA

Takashi Yamano

Fall Semester 2005

Lecture Notes on Advanced Econometrics

Lecture A2: All you need to know about STATA

In this lecture note, I explain STATA commands that you typically need to do

homework in my class. What I can do, however, is just to introduce some STATA

commands to you. To master STATA, you need to consult with the STATA manuals

and practice with actual data.

For now, I assume that you would be typing each command in the STATA Command

Window. But I urge you to learn how to use STATA-do-files. In STATA-do files,

you can keep your commands in a file and can execute all the commands in one file at

once. In this way, you can keep what you have done in a file as long as you have the

file. You will find do-files very useful. Instructions about do-files are presented

later. But, for now, let’s start with some of very important commands.

First, I open a STATA file and close it.

. clear

. use C:/Docs/FASID/Classes/Econometrics/wooldridge_data/WAGE1.DTA

. clear

STATA can hold only one data file in its memory. So before you open a STATA data

file, you need to “clear” the STATA memory. You can open a data file by typing

“use” followed by a file name with its directory. Alternatively, you can open a file by

pulling down File menu and choosing Open. After opening a file, you can simply

discard the data file from the STATA memory by typing “clear” again. Note that the

original file is still in the same folder. So you can open the same file again if you

like.

Next, I open the same STATA file and save it into a different holder.

. clear

. use C:/Docs/FASID/Classes/Econometrics/wooldridge_data/WAGE1.DTA

. save C:/Docs/tmp/WAGE1.DTA

Note that I am saving this file in a different folder so that I do not replace the original

data file. My advice to you is “Do not replace original files!” If you create new

variables (such as a squared variable) and want to save it, save it in a different folder or

use a different file name. There is always a danger of replacing an original file with a

new file which has fewer observations or variables.

Descriptive Statistics

Next, there are some commands to obtain descriptive information: describe and

summarize. “describe” provides you types and definitions of variables. This is

especially helpful when you use the data file for the first time. “summarize” provides

descriptive statistics of variables: mean, standard deviations, minimums, and

maximums. If you type “summarize x, d(etail),” you can get detailed information

about a variable. Here is how they work:

. describe

Contains data from C:¥Docs¥FASID¥Classes¥Econometrics¥wooldridge_data¥WAGE1.DTA

obs: 526

vars: 24 16 Sep 1996 15:52

size: 18,936 (97.8% of memory free)

---------------------------------------------------------------

storage display value

variable name type format label variable label

---------------------------------------------------------------

wage float %8.2g average hourly earnings

educ byte %8.0g years of education

exper byte %8.0g years potential experience

lwage float %9.0g log(wage)

expersq int %9.0g exper^2

tenursq int %9.0g tenure^2

--------------------------------------------------------------------------

. summarize

Variable | Obs Mean Std. Dev. Min Max

-------------+-----------------------------------------------------

wage | 526 5.896103 3.693086 .53 24.98

educ | 526 12.56274 2.769022 0 18

exper | 526 17.01711 13.57216 1 51

lwage | 526 1.623268 .5315382 -.6348783 3.218076

expersq | 526 473.4354 616.0448 1 2601

tenursq | 526 78.15019 199.4347 0 1936

. summarize wage, d

average hourly earnings

-------------------------------------------------------------

Percentiles Smallest

1% 1.67 .53

5% 2.75 1.43

10% 2.92 1.5 Obs 526

25% 3.33 1.5 Sum of Wgt. 526

50% 4.65 Mean 5.896103

Largest Std. Dev. 3.693086

75% 6.88 21.86

90% 10 22.2 Variance 13.63888

95% 13 22.86 Skewness 2.007325

99% 20 24.98 Kurtosis 7.970083

To obtain frequency of a categorical variable, you can use “table.” “table” can also

provide you descriptive statistics of other variables for each value of the categorical

variables.

. table educ

----------------------

years of |

education | Freq.

----------+-----------

0 | 2

2 | 1

12 | 198

13 | 39

18 | 19

----------------------

. table educ, c(mean wage sd wage min wage max wage n wage)

----------------------------------------------------------------------

years of |

education | mean(wage) sd(wage) min(wage) max(wage) N(wage)

----------+-----------------------------------------------------------

0 | 3.53 .9050967 2.89 4.17 2

2 | 3.75 3.75 3.75 1

12 | 5.37136 3.092932 .53 22.20 198

13 | 5.59897 3.026567 2.00 15.38 39

18 | 10.6789 5.913146 3.50 24.98 19

------------------------------------------------------------------

Creating Variables

You can create variables by using “generate” or “gen” for short:

. gen educsq=educ*educ

or

. gen educsq=educ^2

If you want to drop (or delete) a variable, then we use “drop.”

. drop educsq

Suppose that you want to modify a variable, you need to use “replace.”

. replace female=2 if female==0

Here I have replaced zeros in female by 2. So now, female has one for female

workers and two for male workers, instead of zero for male workers. In STATA, you

need to type “=” twice to indicate the value of a variable is equal to something. Other

cases are: “>”, “>=”, “<=”, and “<”. These are respectively “larger than,” “equal to

or larger than,” “equal to or smaller than,” and “smaller than.”

Now, because female is not a dummy variable, I create a new dummy variable by

typing:

. gen women=0

. replace women=1 if female == 1

Or STATA can create a dummy variable automatically by typing:

. gen women=(female==1)

Neat, isn’t it?

OLS estimations

It is very easy to estimate OLS models in STATA. You just need to type:

. regress y x1 x2 x3 x4 x5

You can obtain a predicted variable by typing:

. predict y

Then, a predicted variable called y is created. If you want a residual variable, then we

need to type:

. predict e, residual

Here is an example:

. regress lwage educ exper expersq female married northcen south west

Source | SS df MS Number of obs = 526

-------------+------------------------------ F( 8, 517) = 45.95

Model | 61.6387993 8 7.70484991 Prob > F = 0.0000

Residual | 86.6909521 517 .167680758 R-squared = 0.4156

-------------+------------------------------ Adj R-squared = 0.4065

Total | 148.329751 525 .28253286 Root MSE = .40949

-----------------------------------------------------------------------------

lwage | Coef. Std. Err. t P>|t| [95% Conf. Interval]

-------------+--------------------------------------------------------------

educ | .0808207 .0070101 11.53 0.000 .067049 .0945925

exper | .0363615 .0052269 6.96 0.000 .0260929 .0466301

expersq | -.000645 .0001128 -5.72 0.000 -.0008665 -.0004235

female | -.3345661 .0364315 -9.18 0.000 -.406138 -.2629941

married | .0711934 .0417725 1.70 0.089 -.0108712 .153258

northcen | -.070182 .0519674 -1.35 0.177 -.1722752 .0319111

south | -.1162238 .0486825 -2.39 0.017 -.2118637 -.0205839

west | .04643 .0576308 0.81 0.421 -.0667894 .1596494

_cons | .4625799 .1074952 4.30 0.000 .2513988 .673761

---------------------------------------------------------------------------

. predict y

(option xb assumed; fitted values)

Graphs (this section is for version 8)

It is sometimes a good idea to examine the data visually. Here, I just explain two

types of graphs: histograms and two-way graphs. “Histogram” is useful to see

frequency and “two-way” is useful to examine a relationship between two variables.

. twoway histogram wage

0.1

.2.3

Density

0 5 10 15 20 25average hourly earnings

When you have a discrete variable, by specifying it you can have a column for each

value of a discrete variable:

. twoway histogram educ, discrete 0

.1.2

.3.4

Density

0 5 10 15 20years of education

When you want to examine a relationship between two variables, you can create a

two-way graph by typing:

. graph twoway scatter wage educ

05

10

15

20

25

average hourly earnings


Or you can omit “graph” and type

. twoway scatter wage educ

to get the same graph.

You can also include a fitted line by typing “lfit wage educ.” But because there are

two types of ploy-types, you need to specify that way:

. twoway (scatter wage educ) (lfit wage educ)

05

10

15

20

25

average hourly earnings/Fitted values


average hourly earnings Fitted values

You can learn more about graphs in a STATA manual called “Graphics.”

All you need to know about using Do-files in STATA

There are three types of files in STATA: data-file (.dta), log-file (.log), and do-file (.do).

(There is one more type called ado-file, but I ignore this type of files in this note.)

Data-files contain data. Any kinds of data files can be converted into STATA data

files by using Stat-Transfer (from STATA Corporation).

Log-files record commands and results displayed on the STATA Results window. I

will discuss about log-files later.

Do-files execute commands recorded in them. By recording all of your commands in

a do-file, you can keep a history of your work. This way, you can execute the exact

same commands days or years later. You do not need to remember what you have

done. Just you need to remember the files names. (Actually this is not easy either.

Occasionally, I spend many hours looking for old do-files. I recommend descriptive

file names.)

Why do you need to use do-files?

Even though the advantages of using do-files become clear as you get used to using

them, you may think do-files are cumbersome at the beginning because you have to

type every single command in do-files. There are three major reasons for using

do-files:

(i) it is easy to use do-files,

(ii) you will be able to reproduce your results (even after many years),

(iii) you can communicate with your colleagues by exchanging do-files.

(i) You may not like typing all of your commands in do-files, instead of

drag-and-click on STATA platform. However, once you remember some of important

commands, you can do most of your work. When necessary, you can look up the

manuals or use the help command in STATA to learn about commands.

(ii) You will need to reproduce your results even after many months. For instance,

your adviser may want you to modify your models. With do-files you can just make

small changes and produce results according to your adviser’s comments; you do not

need start from the scratch every time you change specifications.

(iii) When you work with your colleagues, it is useful to share the same data sets

among your colleagues and exchange do-files. As long as data sets are the same, the

same do-files will produce the same results. This way, your colleagues can check

your work and make adjustments.

So let’s start using do-files!

How to open a do-file

Just click File-Do. You can open existing do-files. Or click an icon with a note and

pencil, a new do-file will show up.

How to execute do-files

After typing commands in a do-file, you can just click an icon with a lined-note. For

instance, type the following commands in a do-file:

clear

use c:¥docs¥fasid¥econometrics¥homework¥wage1.dta

sum wage

sum wage, d

table female

table female, c(mean wage)

Then click an icon with a lined-note.

You will probably see an error message

file c:¥docs¥fasid¥econometrics¥homework¥wage1.dta not found

This is because you don’t have the “wage1.dta” data-file in the specified directory.

But at least you know that the do-file has tried to execute your commands. Now,

correct the directory and execute the do-file again.

If you did not face any problems, you should find:

. sum wage

Variable | Obs Mean Std. Dev. Min Max

---------+-----------------------------------------------------

wage | 526 5.896103 3.693086 .53 24.98

. table female

----------+-----------

female | Freq.

----------+-----------

0 | 274

1 | 252

----------+-----------

You have run a do-file. We will learn these two commands (sum and table) later.

But for now, you should save the do-file by clicking File-Save As.

Commands You Need to Know

There is a note made by Wooldridge called “Rudiments of STATA.” This note

explains most of important commands, so I do not repeat. Instead, I will show you an

example of a do-file:

Example 6-1

*This is a do-file, called how_to_STATA, for Lecture 6

clear

use c:¥docs¥fasid¥econometrics¥homework¥wage1.dta

*log close

log using c:¥docs¥fasid¥econometrics¥homework¥wage1.log, replace

*Describe the data

des

sum wage

sum wage, d

table female

table female, c(mean wage min wage max wage)

*Generate a wage variable in log

gen logwage=ln(wage)

*Generate a squared variable of experience

gen expersq=exper*exper

*Run OLS, predict logwage, and do F-test

reg logwage female educ exper expersq

predict yhat

test exper expersq

End of Example 6-1

One very useful command is this: *. This is called a star. This is not exactly a

command because a star (*) does not execute any work. Instead a star (*) prevents a

command from executing. For instance, in the above do-file, the second star (*) is

preventing a command log close from executing.

I have left a star in front of log close because I do not want to execute this command

yet.

At this point there is no log file open. If I try to close a log-file (by saying log close),

STATA will give me an error message and does not execute other commands. Thus I

leave the second star. After running this do-file once, a log-file will be open and keep

recording all the results on STATA-Results window. Thus from the second time, I

will delete the second star in front of log close. As you can see, the star (*) is very

useful to prevent some commands from executing temporary.

Another way of using a star (*) is to put notes in do-files. Sometimes, you want to

leave some notes in do-files to remind yourself or explain your colleagues.

Remember you may need to open your do-files after many months or years. You may

not remember all the details about your do-files at that time. From my experiences, it

is a good idea to leave some notes in your do-files, as I have done in this do-file.

Using log-files

As I mentioned above, a log-file records all the results displayed on STATA screen.

You can open a log-file in a word processor, such as Word. A font called Courier

works the best with STATA outputs.

When you need to replace an old log-file under the same name, you need to add

replace after a comma:

log using c:¥docs¥fasid¥econometrics¥homework¥wage1.log, replace

If you want to add new results at the end of an old log-file, you need to add append

after a comma

log using c:¥docs¥fasid¥econometrics¥homework¥wage1.log, append

As I mention before, you can close a log-file by using

log close

All you need to know about managing data in STATA

Sorting the data

sort arranges the observations into ascending order of the values of the variable. For

instance, assume that income contains income, then

sort income

arranges the observations from the lowest income observation to the highest. You can

sort observations according to more than one variable. For instance, if you type

sort female_head income

STATA sort observations first by female_head then sort the observations according to

income, separately for male and female headed households.

To see the sorted data, you can look into the data window, or you can use list. list

shows identified variables on the screen. For instance,

list income

shows income values from the lowest.

list income in 1/20

shows income values from the lowest to the 20th observations.

Although, sort is a useful command, it can only sort the observations ascending order.

Sometimes, you may want to sort observations descending order, from the largest to

the smallest. For this purpose, you can use gsort:

gsort - income

This will sort observations from the largest to the smallest. You can also use more

than one variables.

gsort female_head - income

This will sort observations from the largest to the smallest for male and female headed

households separately.

Aggregating the data

In surveys and data, information is collected at different units. For instance, a typical

household survey not only collects information at the household level (e.g., How much

does this household use?) but also at the individual level (e.g., How old is this

person?).

To combine information collected at different units, we need to either aggregate data

up to a higher unit or merge data from a higher unit to data at a lower unit. For

instance, we need to create an aggregated data from the individual level up to the

household level.

In STATA, we can use collapse to create an aggregated data. For instance, assume

that we have demographic information at the individual level:

HHID PersonID Age Gender

1 1 42 Male

1 2 37 Female

1 3 10 Female

2 1 28 Male

2 2 24 Female

HHID indicates in household ID numbers in which each individual belongs; PersonID

indicates ID numbers for each individual; and Age and Gender indicate personal

information.

Suppose that we want to create a variable called HHsize that indicates the household

size. To create HHsize, I would create HHsize which is one for all individuals:

gen HHsize = 1

HHID PersonID Age Gender HHsize

1 1 42 Male 1

1 2 37 Female 1

1 3 10 Female 1

2 1 28 Male 1

2 2 24 Female 1

Then, I would aggregate up the data to the household level.

collapse (sum) HHsize, by(HHID)

collapse aggregates up the data to the level identified by the identifying variable

specified in by( ). In this example, I am aggregating the data up to HHID level.

In the example, we will get an aggregated data looks like:

HHID HHsize

1 3

2 2

Notice that all the other variables are eliminated. In addition to summing up, you can

also calculate means, standard deviations, maximums, minimums, median, etc. For

instance, you can calculate average ages and find the maximum age within the

household by typing:

collapse (sum) HHsize (mean) Age (max) Agemax = Age, by(HHID)

HHID HHsize Age Agemax

1 3 29.7 42

2 2 26 28

After creating an aggregated data, you can combine this to another data using an

identifying variable. In the example, the identifying variable is HHID. Before

merging this file with other data files at the household level, you need to sort the data

according to the identifying variable. Thus,

sort HHID

save c:/data/tmp/hhsize, replace

Merging data files

To combine data from different files, we need to merge files. Files must be sorted by

the same identifying variable in the same order before merging. For instance,

suppose that we have a data set of household income at the household level and bring

in HHsize from a different file to crease a per capita income variable, called PCincome.

First, we need to open a base file. I this example, this is a file with household

income:

HHID income

1 302

2 189

Then, we merge this file with a file that contains HHsize:

sort HHID

merge HHID using c:/data/tmp/hhsize

HHID income HHsize Age Agemax merge

1 302 3 29.7 42 3

2 189 2 26 28 3

Thus, we have merged two data files at the household level (HHID).

Documents

All You Need to Know About STATA