25
INTRODUCTION TO STATA Hui-shien Tsao, Manager of User Support Center for Social and Demographic Analysis October 9, 2015

INTRODUCTION TO TATA - University at Albany, SUNYmumford.albany.edu/downloads/Stata2015/Introduction to... · 2015-10-09 · Clicking the left mouse button on a command at the “Review”

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: INTRODUCTION TO TATA - University at Albany, SUNYmumford.albany.edu/downloads/Stata2015/Introduction to... · 2015-10-09 · Clicking the left mouse button on a command at the “Review”

INTRODUCTION TO STATA

Hui-shien Tsao, Manager of User SupportCenter for Social and Demographic AnalysisOctober 9, 2015

Page 2: INTRODUCTION TO TATA - University at Albany, SUNYmumford.albany.edu/downloads/Stata2015/Introduction to... · 2015-10-09 · Clicking the left mouse button on a command at the “Review”

TOPICS

Stata Windows Interface Basics of Stata Data Input and Saving Variables and Data Properties Data Management

General Commands Creating and Modifying Variables Labeling Variables Combining Datasets

Data Analysis

Page 3: INTRODUCTION TO TATA - University at Albany, SUNYmumford.albany.edu/downloads/Stata2015/Introduction to... · 2015-10-09 · Clicking the left mouse button on a command at the “Review”

WHICH STATA IS RIGHT FOR YOU?

Package Max. no. of variables

Max. no. of right-hand variables

Max. no. of observations

64-bit versionavailable?

Fastest: designed for

parallel processing?

Platforms

Stata/MP 32,767 10,998 20 billion* Yes Yes Windows, Mac, or Unix

Stata/SE 32,767 10,998 2.14 billion* Yes No Windows, Mac, or Unix

Stata/IC 2,047 798 2.14 billion* Yes No Windows, Mac, or Unix

Small Stata 99 99 1,200 No No Windows, Mac, or Unix

Page 4: INTRODUCTION TO TATA - University at Albany, SUNYmumford.albany.edu/downloads/Stata2015/Introduction to... · 2015-10-09 · Clicking the left mouse button on a command at the “Review”

STATA WINDOWS INTERFACE (STATA 13)InteractiveMenus

Do‐file Editor

Data Editor (Ctrl+8)

Command Review

Command Window

Results Window

Variables in memory

Variables and Datasets Properties

Page 5: INTRODUCTION TO TATA - University at Albany, SUNYmumford.albany.edu/downloads/Stata2015/Introduction to... · 2015-10-09 · Clicking the left mouse button on a command at the “Review”

STATA WINDOWS INTERFACE (STATA 13)

Page 6: INTRODUCTION TO TATA - University at Albany, SUNYmumford.albany.edu/downloads/Stata2015/Introduction to... · 2015-10-09 · Clicking the left mouse button on a command at the “Review”

BASICS OF STATA: DO FILE AND LOG FILE

A do-file is simply a list of Stata commands. It is saved as a script in a simple text file with file extension .do. Having a do file allows the analyses and results to be reproduced. Stata data file has a file extension .dta.

Using log: Stata does not log your results by default. The Stata Results window only contains limited output. Stata can keep logs in two formats: SMCL or text. SMCL is a text markup and control language that is only understood by Stata. The default format is SMCL.→ syntax: log using filename [, append replace text]→ syntax: log close→ example: log using output, text

Page 7: INTRODUCTION TO TATA - University at Albany, SUNYmumford.albany.edu/downloads/Stata2015/Introduction to... · 2015-10-09 · Clicking the left mouse button on a command at the “Review”

BASICS OF STATA: COMMAND SYNTAX

command [varlist ] [,options] the square brackets [ ] surround optional parts the language is case-sensitive many commands and options have abbreviations

For example, summarize (*** w/o options, it will summarize ALL

variables in the dataset) summarize agep summarize agep, detail sum agep, detail

repeating previous commands by pressing the key.

Page 8: INTRODUCTION TO TATA - University at Albany, SUNYmumford.albany.edu/downloads/Stata2015/Introduction to... · 2015-10-09 · Clicking the left mouse button on a command at the “Review”

BASICS OF STATA: ABOUT VARIABLES, REVIEW, AND RESULTS WINDOWS

Clicking on a variable name at the “Variables” window copies that name to the command window.

Clicking the left mouse button on a command at the “Review” window copies it to the command window, where it can be edited and re-run.

Double-clicking on a command at the “Review” window runs the command directly.

The “Results” window only retains a limited amount of your analysis, so you need to start a log file to ensure that you do not lose any output.

Page 9: INTRODUCTION TO TATA - University at Albany, SUNYmumford.albany.edu/downloads/Stata2015/Introduction to... · 2015-10-09 · Clicking the left mouse button on a command at the “Review”

INTERACTION WITH THE OPERATINGSYSTEM AND OTHER BASIC COMMANDS

→ syntax pwd – display the name of the working directory

→ syntax: cd – change directory→ example: cd C:\temp

→ syntax dir – directory (list all of the files in the current working directory)

→ syntax mkdir – create a new directory → example: mkdir C:\temp\workshop

Note that you will need quotation marks if the directory or folder name has spaces in it. → example: mkdir “C:\temp\2015 workshop”

Page 10: INTRODUCTION TO TATA - University at Albany, SUNYmumford.albany.edu/downloads/Stata2015/Introduction to... · 2015-10-09 · Clicking the left mouse button on a command at the “Review”

INTERACTION WITH THE OPERATINGSYSTEM AND OTHER BASIC COMMANDS Tell Stata to pause or not pause for --more-- messages→ syntax: set more {on|off} [, permanently]→ example: set more off

Perform update ado, update executable, and update utilities→ syntax: update all [, from(location)] → example: update all

Display command the details of a command→ syntax: help→ example: help summarize

Search for the name of the command you don’t know→ syntax: search→ example: search regression

Page 11: INTRODUCTION TO TATA - University at Albany, SUNYmumford.albany.edu/downloads/Stata2015/Introduction to... · 2015-10-09 · Clicking the left mouse button on a command at the “Review”

DATA INPUT: READING DATA INTO STATA

If your data is in Stata format, then simply read in as:→ syntax: use filename [, options]→ example: use “C:\temp\acs2013.dta”, clear→ example: use agep sex using “C:\temp\acs2013.dta”, clear

(clear option will clear the dataset currently in memory before opening the other one)

Once you are done modifying the dataset, you want to save changes, use save command.

→ syntax: save [filename] [, save_options]→ syntax: saveold [filename] [, saveold_options]→ example: save “C:\temp\acs2013.dta”, replace

(replace option overwrites the existing acs2013.dta file”)

Page 12: INTRODUCTION TO TATA - University at Albany, SUNYmumford.albany.edu/downloads/Stata2015/Introduction to... · 2015-10-09 · Clicking the left mouse button on a command at the “Review”

DATA INPUT: READING DATA INTO STATA

If the original data is text prepared by spreadsheet either in csv (comma separated value ASCII text) or txt (tab-delimited ASCII text) format, then use the insheet command.

→ syntax: insheet [varlist] using filename [, options]→ example: insheet using “C:\temp\cnty13.csv”, clear→ example: insheet using “C:\temp\cnty13.txt”, clear

Page 13: INTRODUCTION TO TATA - University at Albany, SUNYmumford.albany.edu/downloads/Stata2015/Introduction to... · 2015-10-09 · Clicking the left mouse button on a command at the “Review”

DATA INPUT: READING DATA INTO STATA

Some data are saved in a fixed ASCII format. You need to use the data codebook (data dictionary) to identify the length and specific location of a variable. In the case, you use infix command to enter the data.

For example, we want to get the state population estimates from 1981 to 1989. The data and codebookare at the Census Bureau website.

→ syntax: infix specifications using filename [if] [in] [, clear]→ example: infix stfips 1-2 year 3 race 4 sex 5 age0_4 6-12 age5_9 13-19 using “C:\temp\st_int_asrh.txt”, clear

Page 14: INTRODUCTION TO TATA - University at Albany, SUNYmumford.albany.edu/downloads/Stata2015/Introduction to... · 2015-10-09 · Clicking the left mouse button on a command at the “Review”

VARIABLES AND DATA PROPERTIES

Stata stores data in either of two ways – numeric or string. Numeric will store numbers while string will store text. Stata stores numbers in different formats: byte, int, long, float (default), and double. A string variable can contain up to 244 characters. To preserve space, only store a variable with the minimum string necessary. The command “compress” will go through each observation and decides the least space-consuming format without sacrificing the current level of accuracy of the data.

missing values: missing numeric observations are denoted by a single dot (.), missing string variables are denoted by blank double quotes (“”).

Numeric variables can be read as string variables. Use “tostring” to convert a numeric to string; use “destring” to convert a string to numeric

Page 15: INTRODUCTION TO TATA - University at Albany, SUNYmumford.albany.edu/downloads/Stata2015/Introduction to... · 2015-10-09 · Clicking the left mouse button on a command at the “Review”

VARIABLES AND DATA PROPERTIES list will list the values of variables, which allow you to examine all

variables and values.→ syntax: list [varlist] [if] [in] [, options]→ example: list name cnty_fips totpop in 1/10 (list the first 10 observations of selected variables)→ example: list name cnty_fips totpop if totpop >=50000 (list the observations the total population is over 50,000)

browse use data editor to browse

order will reorder variables in dataset

describe will report some basic information about the dataset and its variables (size, # of observations, storage type, etc.)

codebook provides extra information on the variables, such as summary statistics.

summarize provides summary statistics, such as means, standard deviations, and so on.

Page 16: INTRODUCTION TO TATA - University at Albany, SUNYmumford.albany.edu/downloads/Stata2015/Introduction to... · 2015-10-09 · Clicking the left mouse button on a command at the “Review”

GENERAL COMMANDS sort arranges the observations of the current data into ascending

order based on the values of the variables. To sort in descending order, you need to use the gsort command.

gsort can sort either in ascending or descending order.→ syntax: gsort [+|-] varname→ example: gsort –state (sort the variable state in descending order)

by command repeats Stata command on subsets of the data; the sort command is often used in accompanied with by. Before Statarepeats the command, it first verifies whether the dataset is sorted.

→ syntax: by varlist: stata_cmd→ example: by state, sort: summarize age

bysort are really the same command; bysort is just by with the sort option

→ example: bysort state: summarize agep

Page 17: INTRODUCTION TO TATA - University at Albany, SUNYmumford.albany.edu/downloads/Stata2015/Introduction to... · 2015-10-09 · Clicking the left mouse button on a command at the “Review”

GENERAL COMMANDS

if is a “qualifier” and it will evaluate each observation as the command passes over the data. if is allowed with most Stata commands.

→ example: drop if totpop<=1000

When you are creating expressions (e.g., using if in a statement) or generating new variables, it’s important to know the logical and mathematical functions.

~ not | or & and == equals + plus - minus * multiplied by / divided by ^ raised to > greater than>= greater than or equal to

<= less than or equal to ~= not equal to abs() absolute value exp() exponentiation ln() natural log sqrt() square root

Page 18: INTRODUCTION TO TATA - University at Albany, SUNYmumford.albany.edu/downloads/Stata2015/Introduction to... · 2015-10-09 · Clicking the left mouse button on a command at the “Review”

GENERAL COMMANDS

It’s a good practice to keep notes in your do-file so that you look back over it you know what you were trying to achieve with each command or sets of commands. There are several comment indicators in Stata:

1. * At the beginning of a line2. /* */ Everything in between is ignored3. // Can be used at the beginning or end of

a line4. /// Used to break long lines of codes

Page 19: INTRODUCTION TO TATA - University at Albany, SUNYmumford.albany.edu/downloads/Stata2015/Introduction to... · 2015-10-09 · Clicking the left mouse button on a command at the “Review”

DATA MANAGEMENT:CREATING AND MODIFYING VARIABLES generate: to create new variables→ syntax: generate newvar = exp [if]→ example: gen age2=age^2→ example: gen hispan=1 if hisp~=1

replace: to change contents of existing variables→ syntax: replace oldvar = exp [if] [in]→ example: replace hispan=0 if hisp==1

egen: extensions to generate→ syntax: egen [type] newvar = fcn(arguments) [if] [in] [, options]→ example: egen avg=mean(income)→ example: egen std_score=std(score)→ example: egen tot_inc=rowtotal(inc1 inc2 inc3)→ example: egen avg_inc=rowmean(inc1 inc2 inc3)

Page 20: INTRODUCTION TO TATA - University at Albany, SUNYmumford.albany.edu/downloads/Stata2015/Introduction to... · 2015-10-09 · Clicking the left mouse button on a command at the “Review”

DATA MANAGEMENT:CREATING AND MODIFYING VARIABLES drop/keep: eliminate/keep observations or variables→ syntax: drop varlist (drop variables)→ syntax: drop if exp (drop observations)→ syntax: drop in range [if exp]→ example: drop if age < 18→ example: drop pwgtp* (drop all variables with prefix pwgtp)

rename: rename variables→ syntax: rename old_varname new_varname→ example: rename income inc

recode: recode categorical variables→ syntax: recode varlist (rule) [(rule) ...] [, generate(newvar)]→ example: recode sex (1=0) (2=1), gen(female)→ example: recode schl(1=1) (2/15=2)(16/17=3)(18/20=4)(21/24=5), gen(educ)

Page 21: INTRODUCTION TO TATA - University at Albany, SUNYmumford.albany.edu/downloads/Stata2015/Introduction to... · 2015-10-09 · Clicking the left mouse button on a command at the “Review”

DATA MANAGEMENT:LABELING VARIABLES

label variable→ syntax: label variable varname ["label"]→ example: label variable educ “Education”

define value label→ syntax: label define lblname # "label" [# "label" ...] [, add modify replace nofix]

→ example: label define edu_label 1 "No School" 2 "LT High School" 3 "HS / GED“→ example: label define edu_label 4 "Some College" 5 "College or More", add

assign value label tolab variable(s)→ syntax: label values varlist [lblname|.]→ example: label values educ edu_label

Page 22: INTRODUCTION TO TATA - University at Albany, SUNYmumford.albany.edu/downloads/Stata2015/Introduction to... · 2015-10-09 · Clicking the left mouse button on a command at the “Review”

DATA MANAGEMENT:COMBINING DATASETS

Use ‘append’ to add a second dataset to the end of the dataset currently used. Usually it is used to add more observations to an existing dataset.

→ syntax: append using filename [, options]→ example: append using psam_p34,dta

When datasets shared the same observations, but have different variables, use the ‘merge’ command to join them. For example, you use merge to combine student data and school data. Students in the same school share the same school characteristics. Merge is for adding variables from a second dataset to existing observations.

Page 23: INTRODUCTION TO TATA - University at Albany, SUNYmumford.albany.edu/downloads/Stata2015/Introduction to... · 2015-10-09 · Clicking the left mouse button on a command at the “Review”

DATA MANAGEMENT:COMBINING DATASETS

Syntax:One-to-one merge: merge 1:1 varlist using filenameMany-to-one merge: merge m:1 varlist using filenameOne-to-many merge: merge 1:m varlist using filenameMany-to-many merge: merge m:m varlist using filename

merge can perform both one-to-one and match merges. In either case, the variable _merge (or the variable specified in _merge() if provided) is added to the data containing

_merge==1 obs. from master data _merge==2 obs. from only one using dataset _merge==3 obs. from at least two datasets,

master or using

Page 24: INTRODUCTION TO TATA - University at Albany, SUNYmumford.albany.edu/downloads/Stata2015/Introduction to... · 2015-10-09 · Clicking the left mouse button on a command at the “Review”

DATA ANALYSIS

summarize provides descriptive statistics→ syntax: summarize [varlist] [if] [in] [weight] [, options]→ example: summarize agep if sex==“1”, detail

tabulate creates one- and two-way frequency tables→ syntax: tabulate varname [if] [in] [weight] [, tabulate1_options]→ example: tab1 mar cow if agep>=25→ syntax: tabulate varname1 varname2 [if] [in] [weight] [, options]→ example: tab2 mar cow if agep>=25

tabstat displays table of summary statistics→ syntax: tabstat varlist [if] [in] [weight] [, options]→ example: tabstat agep wagp pincp, by(sex) stat(mean sdmin max)

Page 25: INTRODUCTION TO TATA - University at Albany, SUNYmumford.albany.edu/downloads/Stata2015/Introduction to... · 2015-10-09 · Clicking the left mouse button on a command at the “Review”

DATA ANALYSIS Mean-comparison tests: ttest→ syntax: ttest varname [if] [in] , by(groupvar) [options1]→ example: ttest pincp, by(sex) Correlation matrix: corr or pwcorr (pairwise)→ syntax: correlate [varlist] [if] [in] [weight] → syntax: pwcorr [varlist] [if] [in] [weight], [, pwcorr_options]→ example: correlate pincp schl→ example: pwcorr pincp schl, star(.05) Linear regression: regress→ syntax: regress depvar [indepvars] [if] [in] [weight] [, options] Logistic regression: logit→ syntax: logit depvar [indepvars] [if] [in] [weight] [, options] Include dummy variables into the analyses→ syntax: xi [, prefix(string) noomit] : any_stata_command varlist_with_terms→ example: xi: regress mpg.i.make Arrange regression, summary, and tabulation results into an illustrative

table: outreg2