Descriptive Statistics and Visualizing Data in STATAcourses.washington.edu/b517/Misc/Disc2.pdf ·...

Preview:

Citation preview

Descriptive Statistics and Visualizing Data

in STATA

BIOS 514/517

R. Y. Coley

Week of October 7, 2013

Log Files, Getting Data in STATA

Log files save your commandscd /home/students/rycoley/bios514-517

• To change directory

log using stata-section-oct7, replace text

• To name log file (change stata-section-oct7)

• capture log close to close log file

insheet using

http://courses.washington.edu/b517/Datasets/FEVdata.csv

• To get FEV data in

Defining, Labeling Variables

table smoke

• Currently coded as 1 and 2

• No missing data (would be coded as 9)

label define smokelabel 1 "smoker" 2 "non-smoker"

label values smoke smokelabel

label define sexlabel 1 "male" 2 "female"

label values sex sexlabel

Labeling Variables

label variable age "Age (years)"

label variable fev "FEV (L/s)"

label variable height "Height (in)"

Descriptive Statistics

Basic commands detailed in this week’s lecture notes:

• summarize

• means

• centile

• tabstat

• tabulate

Descriptive Stats by Group

bysort sex: tabstat fev, stat(n mean sd min p25

med p75 max) col(stat) format

bysort sex: tabulate smoke

Defining New Variables

A few ways:

• gen age9over = age>=9

• gen age9over = 0

replace age9over=1 if age>=9

• gen age9over = age==9 | age==10 | age==11...

|age==19

Measures of Spread

• Range: tabstat fev, stat(min max range)

• Variance: tabstat fev, stat(var)

• Standard Deviation: tabstat fev, stat(sd)

• Interquartile Range: tabstat fev, stat(p25, p75,iqr)

• IQR is the distance between the 25th and 75thpercentiles of the data

Visualizing Data- Histograms

histogram fev

to save: graph export hist-fev.png, replace

Height of each bar proportional to proportion of observationsin that bin’s range

Visualizing Data- Histogramshistogram fev, kdensity by (sex)

kdensity adds smooth line estimating density

Visualizing Data- Dotplotsdotplot fev

Each dot represents an observations

Visualizing Data- Box Plots

• a.k.a. “Box and whiskers” plots

• Box extends from lower quartile (25th percentile of data) toupper quartile (75th percentile) with a line at the median(50th percentile).

• Whiskers extend from lower quartile to “lower adjacent value”and from upper quartile to “upper adjacent value”

LAV = lower quartile− 3

2IQR UAV = upper quartile+

3

2IQR (1)

• Observations outside the UAV and LAV plotted as points

• (Some box plots have whiskers extend to minimum andmaximum observations.)

Visualizing Data- Box Plots

graph box fev

Visualizing Data- Box Plots

graph box fev, over(sex)

Visualizing Data- Scatterplots

scatter fev height

Visualizing Data- Bar Charts

gen one=1

graph bar (count) one, over(smoke) ytitle("frequency")

Another Example

log using cause-of-death, text replace

set obs 10

input float deaths str30 cause

700142 "Heart Disease"

553768 "Cancer"

163538 "Cerebrovascular Disease"

123013 "Chronic respiratory disease"

101537 "Accidental Death"

71372 "Diabetes"

62034 "Flu and pneumonia"

53852 "Alzheimer’s disease"

39480 "Kidney disorder"

32238 "Septicemia"

Visualizing Data- Bar Chartgen dthou=deaths/1000

graph hbar dthou, over(cause) ytitle("Annual

deaths (thousands)")

Visualizing Data- Bar Chartsgen dthou=deaths/1000

graph hbar dthou, over(cause, sort(1) descending)

ytitle("Annual deaths (thousands)")

Visualizing Data- Pie Charts

graph pie deaths, over(cause) sort descending

Visualizing Data- Pie Charts

Visualizing Data- Pie Charts

Visualizing Data- Pie Charts

Doing it all over again in R!

Look at the code I have posted on the discussionboard. It is extensively commented (##)!Comments omitted here.

data<-read.csv("FEVdata.csv",header=TRUE)

names(data)

dim(data)

n<-dim(data)[1]

(Re-)defining variables

Variables don’t have labels like in Stata. But, we can improveupon the current coding of ”smoke” and ”sex”.

data$SMOKE[data$SMOKE==2]<-0 \\

data$FEMALE<-data$SEX==2

Creating a new variable:

data$age9over<-data$AGE>=9

Descriptive Statistics

summary(data$FEV) #min, 1Q, Med, Mean, 3Q, Max

mean(data$FEV)

quantile(data$FEV, p=c(0.25, 0.5, 0.75))

table(data$SMOKE)

xtabs(~data$SMOKE+data$FEMALE) #to get cross tabulation

Measures of Spread

range(data$FEV) #gives min and max

var(data$FEV) #variance

sd(data$FEV) #standard deviation

Histograms

hist(data$FEV, xlab="FEV (L/s)", main="Histogram of FEV")

To save the graph:

pdf(file="fev-hist-R.pdf")

hist(data$FEV, xlab="FEV (L/s)", main="Histogram of FEV")

graphics.off()Histogram of FEV

FEV (L/s)

Fre

quen

cy

1 2 3 4 5 6

050

100

150

Histogramshist(data$FEV, xlab="FEV (L/s)", main="Histogram of FEV",

prob=TRUE)

lines(density(data$FEV))

Histogram of FEV

FEV (L/s)

Den

sity

1 2 3 4 5 6

0.0

0.1

0.2

0.3

0.4

Histogram

plot(hist(data$FEV[data$FEMALE==0], xlab="FEV (L/s)",

main="Males", ylim=c(0,80)),

hist(data$FEV[data$FEMALE==1], xlab="FEV (L/s)",

main="Females", xlim=c(0,6)))

Males

FEV (L/s)

Fre

quen

cy

1 2 3 4 5 6

020

4060

80

Females

FEV (L/s)

Fre

quen

cy

0 1 2 3 4 5 6

020

4060

80

Boxplotboxplot(data$FEV, ylab="FEV (L/s)")

●●

●●

12

34

5

FE

V (

L/s)

Boxplotboxplot(data$FEV~data$FEMALE, ylab="FEV (L/s)",

xaxt="n")

axis(1, at=c(1,2), labels=c("Male", "Female"))1

23

45

FE

V (

L/s)

Male Female

Scatter Plot

plot(data$FEV~data$HEIGHT, ylab="FEV (L/s)",

xlab="Height (in)")

● ●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

45 50 55 60 65 70 75

12

34

5

Height (in)

FE

V (

L/s)

Bar Plotcounts<-table(data$SMOKE)

barplot(counts, xlab="Smoker", xaxt="n")

axis(1, at=c(1,2), labels=c("No","Yes"))

Smoker

010

020

030

040

050

0

No Yes

Cause of Death Example in R

n.deaths<-c(700142, 553768, 163538, 123013,

101537, 71372, 62034, 53852, 39480, 32238)

cause<-c("Heart Disease", "Cancer", "Cerebrovascular

Disease", "Chronic Respiratory Diesease","Accidental

death", "Diabetes", "Flu and Pneumonia", "Alzheimer’s

Disease", "Kidney Disorder","Septicemia")

n.deaths<-n.deaths/1000

Cause of Death Example

par(mar=c(4,6.5,1,1))

barplot(n.deaths, horiz=T, yaxt="n", xlab="Number of Deaths

(Thousands)", main="Cause of Death")

text(y=seq(1,11.35, 1.15), par("usr")[1], labels=cause,

srt=45, pos=2, xpd=T, cex=0.75)

Cause of Death

Number of Deaths (Thousands)0 100 200 300 400 500 600 700Hea

rt Dise

ase

Cance

r

Cereb

rova

scula

r Dise

ase

Chron

ic Res

pirat

ory D

iesea

se

Accide

ntal

deat

hDiabet

es

Flu an

d Pne

umon

ia

Alzheim

er's

Diseas

e

Kidney

Diso

rderSep

ticem

ia

Cause of Death Examplepie(n.deaths, cause, main="Cause of Death" )

Heart Disease

Cancer

Cerebrovascular Disease Chronic Respiratory Diesease

Accidental death

Diabetes

Flu and Pneumonia

Alzheimer's Disease

Kidney Disorder

Septicemia

Cause of Death