Upload
teo-yp
View
69
Download
0
Embed Size (px)
Citation preview
Takashi Yamano
Fall Semester 2005
Lecture Notes on Advanced Econometrics
Lecture A2: All you need to know about STATA
In this lecture note, I explain STATA commands that you typically need to do
homework in my class. What I can do, however, is just to introduce some STATA
commands to you. To master STATA, you need to consult with the STATA manuals
and practice with actual data.
For now, I assume that you would be typing each command in the STATA Command
Window. But I urge you to learn how to use STATA-do-files. In STATA-do files,
you can keep your commands in a file and can execute all the commands in one file at
once. In this way, you can keep what you have done in a file as long as you have the
file. You will find do-files very useful. Instructions about do-files are presented
later. But, for now, let’s start with some of very important commands.
First, I open a STATA file and close it.
. clear
. use C:/Docs/FASID/Classes/Econometrics/wooldridge_data/WAGE1.DTA
. clear
STATA can hold only one data file in its memory. So before you open a STATA data
file, you need to “clear” the STATA memory. You can open a data file by typing
“use” followed by a file name with its directory. Alternatively, you can open a file by
pulling down File menu and choosing Open. After opening a file, you can simply
discard the data file from the STATA memory by typing “clear” again. Note that the
original file is still in the same folder. So you can open the same file again if you
like.
Next, I open the same STATA file and save it into a different holder.
. clear
. use C:/Docs/FASID/Classes/Econometrics/wooldridge_data/WAGE1.DTA
. save C:/Docs/tmp/WAGE1.DTA
Note that I am saving this file in a different folder so that I do not replace the original
data file. My advice to you is “Do not replace original files!” If you create new
variables (such as a squared variable) and want to save it, save it in a different folder or
use a different file name. There is always a danger of replacing an original file with a
new file which has fewer observations or variables.
Descriptive Statistics
Next, there are some commands to obtain descriptive information: describe and
summarize. “describe” provides you types and definitions of variables. This is
especially helpful when you use the data file for the first time. “summarize” provides
descriptive statistics of variables: mean, standard deviations, minimums, and
maximums. If you type “summarize x, d(etail),” you can get detailed information
about a variable. Here is how they work:
. describe
Contains data from C:¥Docs¥FASID¥Classes¥Econometrics¥wooldridge_data¥WAGE1.DTA
obs: 526
vars: 24 16 Sep 1996 15:52
size: 18,936 (97.8% of memory free)
---------------------------------------------------------------
storage display value
variable name type format label variable label
---------------------------------------------------------------
wage float %8.2g average hourly earnings
educ byte %8.0g years of education
exper byte %8.0g years potential experience
lwage float %9.0g log(wage)
expersq int %9.0g exper^2
tenursq int %9.0g tenure^2
--------------------------------------------------------------------------
. summarize
Variable | Obs Mean Std. Dev. Min Max
-------------+-----------------------------------------------------
wage | 526 5.896103 3.693086 .53 24.98
educ | 526 12.56274 2.769022 0 18
exper | 526 17.01711 13.57216 1 51
lwage | 526 1.623268 .5315382 -.6348783 3.218076
expersq | 526 473.4354 616.0448 1 2601
tenursq | 526 78.15019 199.4347 0 1936
. summarize wage, d
average hourly earnings
-------------------------------------------------------------
Percentiles Smallest
1% 1.67 .53
5% 2.75 1.43
10% 2.92 1.5 Obs 526
25% 3.33 1.5 Sum of Wgt. 526
50% 4.65 Mean 5.896103
Largest Std. Dev. 3.693086
75% 6.88 21.86
90% 10 22.2 Variance 13.63888
95% 13 22.86 Skewness 2.007325
99% 20 24.98 Kurtosis 7.970083
To obtain frequency of a categorical variable, you can use “table.” “table” can also
provide you descriptive statistics of other variables for each value of the categorical
variables.
. table educ
----------------------
years of |
education | Freq.
----------+-----------
0 | 2
2 | 1
12 | 198
13 | 39
18 | 19
----------------------
. table educ, c(mean wage sd wage min wage max wage n wage)
----------------------------------------------------------------------
years of |
education | mean(wage) sd(wage) min(wage) max(wage) N(wage)
----------+-----------------------------------------------------------
0 | 3.53 .9050967 2.89 4.17 2
2 | 3.75 3.75 3.75 1
12 | 5.37136 3.092932 .53 22.20 198
13 | 5.59897 3.026567 2.00 15.38 39
18 | 10.6789 5.913146 3.50 24.98 19
------------------------------------------------------------------
Creating Variables
You can create variables by using “generate” or “gen” for short:
. gen educsq=educ*educ
or
. gen educsq=educ^2
If you want to drop (or delete) a variable, then we use “drop.”
. drop educsq
Suppose that you want to modify a variable, you need to use “replace.”
. replace female=2 if female==0
Here I have replaced zeros in female by 2. So now, female has one for female
workers and two for male workers, instead of zero for male workers. In STATA, you
need to type “=” twice to indicate the value of a variable is equal to something. Other
cases are: “>”, “>=”, “<=”, and “<”. These are respectively “larger than,” “equal to
or larger than,” “equal to or smaller than,” and “smaller than.”
Now, because female is not a dummy variable, I create a new dummy variable by
typing:
. gen women=0
. replace women=1 if female == 1
Or STATA can create a dummy variable automatically by typing:
. gen women=(female==1)
Neat, isn’t it?
OLS estimations
It is very easy to estimate OLS models in STATA. You just need to type:
. regress y x1 x2 x3 x4 x5
You can obtain a predicted variable by typing:
. predict y
Then, a predicted variable called y is created. If you want a residual variable, then we
need to type:
. predict e, residual
Here is an example:
. regress lwage educ exper expersq female married northcen south west
Source | SS df MS Number of obs = 526
-------------+------------------------------ F( 8, 517) = 45.95
Model | 61.6387993 8 7.70484991 Prob > F = 0.0000
Residual | 86.6909521 517 .167680758 R-squared = 0.4156
-------------+------------------------------ Adj R-squared = 0.4065
Total | 148.329751 525 .28253286 Root MSE = .40949
-----------------------------------------------------------------------------
lwage | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+--------------------------------------------------------------
educ | .0808207 .0070101 11.53 0.000 .067049 .0945925
exper | .0363615 .0052269 6.96 0.000 .0260929 .0466301
expersq | -.000645 .0001128 -5.72 0.000 -.0008665 -.0004235
female | -.3345661 .0364315 -9.18 0.000 -.406138 -.2629941
married | .0711934 .0417725 1.70 0.089 -.0108712 .153258
northcen | -.070182 .0519674 -1.35 0.177 -.1722752 .0319111
south | -.1162238 .0486825 -2.39 0.017 -.2118637 -.0205839
west | .04643 .0576308 0.81 0.421 -.0667894 .1596494
_cons | .4625799 .1074952 4.30 0.000 .2513988 .673761
---------------------------------------------------------------------------
. predict y
(option xb assumed; fitted values)
Graphs (this section is for version 8)
It is sometimes a good idea to examine the data visually. Here, I just explain two
types of graphs: histograms and two-way graphs. “Histogram” is useful to see
frequency and “two-way” is useful to examine a relationship between two variables.
. twoway histogram wage
0.1
.2.3
Density
0 5 10 15 20 25average hourly earnings
When you have a discrete variable, by specifying it you can have a column for each
value of a discrete variable:
. twoway histogram educ, discrete 0
.1.2
.3.4
Density
0 5 10 15 20years of education
When you want to examine a relationship between two variables, you can create a
two-way graph by typing:
. graph twoway scatter wage educ
05
10
15
20
25
average hourly earnings
0 5 10 15 20years of education
Or you can omit “graph” and type
. twoway scatter wage educ
to get the same graph.
You can also include a fitted line by typing “lfit wage educ.” But because there are
two types of ploy-types, you need to specify that way:
. twoway (scatter wage educ) (lfit wage educ)
05
10
15
20
25
average hourly earnings/Fitted values
0 5 10 15 20years of education
average hourly earnings Fitted values
You can learn more about graphs in a STATA manual called “Graphics.”
All you need to know about using Do-files in STATA
There are three types of files in STATA: data-file (.dta), log-file (.log), and do-file (.do).
(There is one more type called ado-file, but I ignore this type of files in this note.)
Data-files contain data. Any kinds of data files can be converted into STATA data
files by using Stat-Transfer (from STATA Corporation).
Log-files record commands and results displayed on the STATA Results window. I
will discuss about log-files later.
Do-files execute commands recorded in them. By recording all of your commands in
a do-file, you can keep a history of your work. This way, you can execute the exact
same commands days or years later. You do not need to remember what you have
done. Just you need to remember the files names. (Actually this is not easy either.
Occasionally, I spend many hours looking for old do-files. I recommend descriptive
file names.)
Why do you need to use do-files?
Even though the advantages of using do-files become clear as you get used to using
them, you may think do-files are cumbersome at the beginning because you have to
type every single command in do-files. There are three major reasons for using
do-files:
(i) it is easy to use do-files,
(ii) you will be able to reproduce your results (even after many years),
(iii) you can communicate with your colleagues by exchanging do-files.
(i) You may not like typing all of your commands in do-files, instead of
drag-and-click on STATA platform. However, once you remember some of important
commands, you can do most of your work. When necessary, you can look up the
manuals or use the help command in STATA to learn about commands.
(ii) You will need to reproduce your results even after many months. For instance,
your adviser may want you to modify your models. With do-files you can just make
small changes and produce results according to your adviser’s comments; you do not
need start from the scratch every time you change specifications.
(iii) When you work with your colleagues, it is useful to share the same data sets
among your colleagues and exchange do-files. As long as data sets are the same, the
same do-files will produce the same results. This way, your colleagues can check
your work and make adjustments.
So let’s start using do-files!
How to open a do-file
Just click File-Do. You can open existing do-files. Or click an icon with a note and
pencil, a new do-file will show up.
How to execute do-files
After typing commands in a do-file, you can just click an icon with a lined-note. For
instance, type the following commands in a do-file:
clear
use c:¥docs¥fasid¥econometrics¥homework¥wage1.dta
sum wage
sum wage, d
table female
table female, c(mean wage)
Then click an icon with a lined-note.
You will probably see an error message
file c:¥docs¥fasid¥econometrics¥homework¥wage1.dta not found
This is because you don’t have the “wage1.dta” data-file in the specified directory.
But at least you know that the do-file has tried to execute your commands. Now,
correct the directory and execute the do-file again.
If you did not face any problems, you should find:
. sum wage
Variable | Obs Mean Std. Dev. Min Max
---------+-----------------------------------------------------
wage | 526 5.896103 3.693086 .53 24.98
. table female
----------+-----------
female | Freq.
----------+-----------
0 | 274
1 | 252
----------+-----------
You have run a do-file. We will learn these two commands (sum and table) later.
But for now, you should save the do-file by clicking File-Save As.
Commands You Need to Know
There is a note made by Wooldridge called “Rudiments of STATA.” This note
explains most of important commands, so I do not repeat. Instead, I will show you an
example of a do-file:
Example 6-1
*This is a do-file, called how_to_STATA, for Lecture 6
clear
use c:¥docs¥fasid¥econometrics¥homework¥wage1.dta
*log close
log using c:¥docs¥fasid¥econometrics¥homework¥wage1.log, replace
*Describe the data
des
sum wage
sum wage, d
table female
table female, c(mean wage min wage max wage)
*Generate a wage variable in log
gen logwage=ln(wage)
*Generate a squared variable of experience
gen expersq=exper*exper
*Run OLS, predict logwage, and do F-test
reg logwage female educ exper expersq
predict yhat
test exper expersq
End of Example 6-1
One very useful command is this: *. This is called a star. This is not exactly a
command because a star (*) does not execute any work. Instead a star (*) prevents a
command from executing. For instance, in the above do-file, the second star (*) is
preventing a command log close from executing.
I have left a star in front of log close because I do not want to execute this command
yet.
At this point there is no log file open. If I try to close a log-file (by saying log close),
STATA will give me an error message and does not execute other commands. Thus I
leave the second star. After running this do-file once, a log-file will be open and keep
recording all the results on STATA-Results window. Thus from the second time, I
will delete the second star in front of log close. As you can see, the star (*) is very
useful to prevent some commands from executing temporary.
Another way of using a star (*) is to put notes in do-files. Sometimes, you want to
leave some notes in do-files to remind yourself or explain your colleagues.
Remember you may need to open your do-files after many months or years. You may
not remember all the details about your do-files at that time. From my experiences, it
is a good idea to leave some notes in your do-files, as I have done in this do-file.
Using log-files
As I mentioned above, a log-file records all the results displayed on STATA screen.
You can open a log-file in a word processor, such as Word. A font called Courier
works the best with STATA outputs.
When you need to replace an old log-file under the same name, you need to add
replace after a comma:
log using c:¥docs¥fasid¥econometrics¥homework¥wage1.log, replace
If you want to add new results at the end of an old log-file, you need to add append
after a comma
log using c:¥docs¥fasid¥econometrics¥homework¥wage1.log, append
As I mention before, you can close a log-file by using
log close
All you need to know about managing data in STATA
Sorting the data
sort arranges the observations into ascending order of the values of the variable. For
instance, assume that income contains income, then
sort income
arranges the observations from the lowest income observation to the highest. You can
sort observations according to more than one variable. For instance, if you type
sort female_head income
STATA sort observations first by female_head then sort the observations according to
income, separately for male and female headed households.
To see the sorted data, you can look into the data window, or you can use list. list
shows identified variables on the screen. For instance,
list income
shows income values from the lowest.
list income in 1/20
shows income values from the lowest to the 20th observations.
Although, sort is a useful command, it can only sort the observations ascending order.
Sometimes, you may want to sort observations descending order, from the largest to
the smallest. For this purpose, you can use gsort:
gsort - income
This will sort observations from the largest to the smallest. You can also use more
than one variables.
gsort female_head - income
This will sort observations from the largest to the smallest for male and female headed
households separately.
Aggregating the data
In surveys and data, information is collected at different units. For instance, a typical
household survey not only collects information at the household level (e.g., How much
does this household use?) but also at the individual level (e.g., How old is this
person?).
To combine information collected at different units, we need to either aggregate data
up to a higher unit or merge data from a higher unit to data at a lower unit. For
instance, we need to create an aggregated data from the individual level up to the
household level.
In STATA, we can use collapse to create an aggregated data. For instance, assume
that we have demographic information at the individual level:
HHID PersonID Age Gender
1 1 42 Male
1 2 37 Female
1 3 10 Female
2 1 28 Male
2 2 24 Female
HHID indicates in household ID numbers in which each individual belongs; PersonID
indicates ID numbers for each individual; and Age and Gender indicate personal
information.
Suppose that we want to create a variable called HHsize that indicates the household
size. To create HHsize, I would create HHsize which is one for all individuals:
gen HHsize = 1
HHID PersonID Age Gender HHsize
1 1 42 Male 1
1 2 37 Female 1
1 3 10 Female 1
2 1 28 Male 1
2 2 24 Female 1
Then, I would aggregate up the data to the household level.
collapse (sum) HHsize, by(HHID)
collapse aggregates up the data to the level identified by the identifying variable
specified in by( ). In this example, I am aggregating the data up to HHID level.
In the example, we will get an aggregated data looks like:
HHID HHsize
1 3
2 2
Notice that all the other variables are eliminated. In addition to summing up, you can
also calculate means, standard deviations, maximums, minimums, median, etc. For
instance, you can calculate average ages and find the maximum age within the
household by typing:
collapse (sum) HHsize (mean) Age (max) Agemax = Age, by(HHID)
HHID HHsize Age Agemax
1 3 29.7 42
2 2 26 28
After creating an aggregated data, you can combine this to another data using an
identifying variable. In the example, the identifying variable is HHID. Before
merging this file with other data files at the household level, you need to sort the data
according to the identifying variable. Thus,
sort HHID
save c:/data/tmp/hhsize, replace
Merging data files
To combine data from different files, we need to merge files. Files must be sorted by
the same identifying variable in the same order before merging. For instance,
suppose that we have a data set of household income at the household level and bring
in HHsize from a different file to crease a per capita income variable, called PCincome.
First, we need to open a base file. I this example, this is a file with household
income:
HHID income
1 302
2 189
Then, we merge this file with a file that contains HHsize:
sort HHID
merge HHID using c:/data/tmp/hhsize
HHID income HHsize Age Agemax merge
1 302 3 29.7 42 3
2 189 2 26 28 3
Thus, we have merged two data files at the household level (HHID).