37
Descriptive Statistics and Visualizing Data in STATA BIOS 514/517 R. Y. Coley Week of October 7, 2013

Descriptive Statistics and Visualizing Data in STATAcourses.washington.edu/b517/Misc/Disc2.pdf · Standard Deviation: tabstat fev, stat(sd) Interquartile Range: tabstat fev, stat(p25,

Embed Size (px)

Citation preview

Page 1: Descriptive Statistics and Visualizing Data in STATAcourses.washington.edu/b517/Misc/Disc2.pdf · Standard Deviation: tabstat fev, stat(sd) Interquartile Range: tabstat fev, stat(p25,

Descriptive Statistics and Visualizing Data

in STATA

BIOS 514/517

R. Y. Coley

Week of October 7, 2013

Page 2: Descriptive Statistics and Visualizing Data in STATAcourses.washington.edu/b517/Misc/Disc2.pdf · Standard Deviation: tabstat fev, stat(sd) Interquartile Range: tabstat fev, stat(p25,

Log Files, Getting Data in STATA

Log files save your commandscd /home/students/rycoley/bios514-517

• To change directory

log using stata-section-oct7, replace text

• To name log file (change stata-section-oct7)

• capture log close to close log file

insheet using

http://courses.washington.edu/b517/Datasets/FEVdata.csv

• To get FEV data in

Page 3: Descriptive Statistics and Visualizing Data in STATAcourses.washington.edu/b517/Misc/Disc2.pdf · Standard Deviation: tabstat fev, stat(sd) Interquartile Range: tabstat fev, stat(p25,

Defining, Labeling Variables

table smoke

• Currently coded as 1 and 2

• No missing data (would be coded as 9)

label define smokelabel 1 "smoker" 2 "non-smoker"

label values smoke smokelabel

label define sexlabel 1 "male" 2 "female"

label values sex sexlabel

Page 4: Descriptive Statistics and Visualizing Data in STATAcourses.washington.edu/b517/Misc/Disc2.pdf · Standard Deviation: tabstat fev, stat(sd) Interquartile Range: tabstat fev, stat(p25,

Labeling Variables

label variable age "Age (years)"

label variable fev "FEV (L/s)"

label variable height "Height (in)"

Page 5: Descriptive Statistics and Visualizing Data in STATAcourses.washington.edu/b517/Misc/Disc2.pdf · Standard Deviation: tabstat fev, stat(sd) Interquartile Range: tabstat fev, stat(p25,

Descriptive Statistics

Basic commands detailed in this week’s lecture notes:

• summarize

• means

• centile

• tabstat

• tabulate

Page 6: Descriptive Statistics and Visualizing Data in STATAcourses.washington.edu/b517/Misc/Disc2.pdf · Standard Deviation: tabstat fev, stat(sd) Interquartile Range: tabstat fev, stat(p25,

Descriptive Stats by Group

bysort sex: tabstat fev, stat(n mean sd min p25

med p75 max) col(stat) format

bysort sex: tabulate smoke

Page 7: Descriptive Statistics and Visualizing Data in STATAcourses.washington.edu/b517/Misc/Disc2.pdf · Standard Deviation: tabstat fev, stat(sd) Interquartile Range: tabstat fev, stat(p25,

Defining New Variables

A few ways:

• gen age9over = age>=9

• gen age9over = 0

replace age9over=1 if age>=9

• gen age9over = age==9 | age==10 | age==11...

|age==19

Page 8: Descriptive Statistics and Visualizing Data in STATAcourses.washington.edu/b517/Misc/Disc2.pdf · Standard Deviation: tabstat fev, stat(sd) Interquartile Range: tabstat fev, stat(p25,

Measures of Spread

• Range: tabstat fev, stat(min max range)

• Variance: tabstat fev, stat(var)

• Standard Deviation: tabstat fev, stat(sd)

• Interquartile Range: tabstat fev, stat(p25, p75,iqr)

• IQR is the distance between the 25th and 75thpercentiles of the data

Page 9: Descriptive Statistics and Visualizing Data in STATAcourses.washington.edu/b517/Misc/Disc2.pdf · Standard Deviation: tabstat fev, stat(sd) Interquartile Range: tabstat fev, stat(p25,

Visualizing Data- Histograms

histogram fev

to save: graph export hist-fev.png, replace

Height of each bar proportional to proportion of observationsin that bin’s range

Page 10: Descriptive Statistics and Visualizing Data in STATAcourses.washington.edu/b517/Misc/Disc2.pdf · Standard Deviation: tabstat fev, stat(sd) Interquartile Range: tabstat fev, stat(p25,

Visualizing Data- Histogramshistogram fev, kdensity by (sex)

kdensity adds smooth line estimating density

Page 11: Descriptive Statistics and Visualizing Data in STATAcourses.washington.edu/b517/Misc/Disc2.pdf · Standard Deviation: tabstat fev, stat(sd) Interquartile Range: tabstat fev, stat(p25,

Visualizing Data- Dotplotsdotplot fev

Each dot represents an observations

Page 12: Descriptive Statistics and Visualizing Data in STATAcourses.washington.edu/b517/Misc/Disc2.pdf · Standard Deviation: tabstat fev, stat(sd) Interquartile Range: tabstat fev, stat(p25,

Visualizing Data- Box Plots

• a.k.a. “Box and whiskers” plots

• Box extends from lower quartile (25th percentile of data) toupper quartile (75th percentile) with a line at the median(50th percentile).

• Whiskers extend from lower quartile to “lower adjacent value”and from upper quartile to “upper adjacent value”

LAV = lower quartile− 3

2IQR UAV = upper quartile+

3

2IQR (1)

• Observations outside the UAV and LAV plotted as points

• (Some box plots have whiskers extend to minimum andmaximum observations.)

Page 13: Descriptive Statistics and Visualizing Data in STATAcourses.washington.edu/b517/Misc/Disc2.pdf · Standard Deviation: tabstat fev, stat(sd) Interquartile Range: tabstat fev, stat(p25,

Visualizing Data- Box Plots

graph box fev

Page 14: Descriptive Statistics and Visualizing Data in STATAcourses.washington.edu/b517/Misc/Disc2.pdf · Standard Deviation: tabstat fev, stat(sd) Interquartile Range: tabstat fev, stat(p25,

Visualizing Data- Box Plots

graph box fev, over(sex)

Page 15: Descriptive Statistics and Visualizing Data in STATAcourses.washington.edu/b517/Misc/Disc2.pdf · Standard Deviation: tabstat fev, stat(sd) Interquartile Range: tabstat fev, stat(p25,

Visualizing Data- Scatterplots

scatter fev height

Page 16: Descriptive Statistics and Visualizing Data in STATAcourses.washington.edu/b517/Misc/Disc2.pdf · Standard Deviation: tabstat fev, stat(sd) Interquartile Range: tabstat fev, stat(p25,

Visualizing Data- Bar Charts

gen one=1

graph bar (count) one, over(smoke) ytitle("frequency")

Page 17: Descriptive Statistics and Visualizing Data in STATAcourses.washington.edu/b517/Misc/Disc2.pdf · Standard Deviation: tabstat fev, stat(sd) Interquartile Range: tabstat fev, stat(p25,

Another Example

log using cause-of-death, text replace

set obs 10

input float deaths str30 cause

700142 "Heart Disease"

553768 "Cancer"

163538 "Cerebrovascular Disease"

123013 "Chronic respiratory disease"

101537 "Accidental Death"

71372 "Diabetes"

62034 "Flu and pneumonia"

53852 "Alzheimer’s disease"

39480 "Kidney disorder"

32238 "Septicemia"

Page 18: Descriptive Statistics and Visualizing Data in STATAcourses.washington.edu/b517/Misc/Disc2.pdf · Standard Deviation: tabstat fev, stat(sd) Interquartile Range: tabstat fev, stat(p25,

Visualizing Data- Bar Chartgen dthou=deaths/1000

graph hbar dthou, over(cause) ytitle("Annual

deaths (thousands)")

Page 19: Descriptive Statistics and Visualizing Data in STATAcourses.washington.edu/b517/Misc/Disc2.pdf · Standard Deviation: tabstat fev, stat(sd) Interquartile Range: tabstat fev, stat(p25,

Visualizing Data- Bar Chartsgen dthou=deaths/1000

graph hbar dthou, over(cause, sort(1) descending)

ytitle("Annual deaths (thousands)")

Page 20: Descriptive Statistics and Visualizing Data in STATAcourses.washington.edu/b517/Misc/Disc2.pdf · Standard Deviation: tabstat fev, stat(sd) Interquartile Range: tabstat fev, stat(p25,

Visualizing Data- Pie Charts

graph pie deaths, over(cause) sort descending

Page 21: Descriptive Statistics and Visualizing Data in STATAcourses.washington.edu/b517/Misc/Disc2.pdf · Standard Deviation: tabstat fev, stat(sd) Interquartile Range: tabstat fev, stat(p25,

Visualizing Data- Pie Charts

Page 22: Descriptive Statistics and Visualizing Data in STATAcourses.washington.edu/b517/Misc/Disc2.pdf · Standard Deviation: tabstat fev, stat(sd) Interquartile Range: tabstat fev, stat(p25,

Visualizing Data- Pie Charts

Page 23: Descriptive Statistics and Visualizing Data in STATAcourses.washington.edu/b517/Misc/Disc2.pdf · Standard Deviation: tabstat fev, stat(sd) Interquartile Range: tabstat fev, stat(p25,

Visualizing Data- Pie Charts

Page 24: Descriptive Statistics and Visualizing Data in STATAcourses.washington.edu/b517/Misc/Disc2.pdf · Standard Deviation: tabstat fev, stat(sd) Interquartile Range: tabstat fev, stat(p25,

Doing it all over again in R!

Look at the code I have posted on the discussionboard. It is extensively commented (##)!Comments omitted here.

data<-read.csv("FEVdata.csv",header=TRUE)

names(data)

dim(data)

n<-dim(data)[1]

Page 25: Descriptive Statistics and Visualizing Data in STATAcourses.washington.edu/b517/Misc/Disc2.pdf · Standard Deviation: tabstat fev, stat(sd) Interquartile Range: tabstat fev, stat(p25,

(Re-)defining variables

Variables don’t have labels like in Stata. But, we can improveupon the current coding of ”smoke” and ”sex”.

data$SMOKE[data$SMOKE==2]<-0 \\

data$FEMALE<-data$SEX==2

Creating a new variable:

data$age9over<-data$AGE>=9

Page 26: Descriptive Statistics and Visualizing Data in STATAcourses.washington.edu/b517/Misc/Disc2.pdf · Standard Deviation: tabstat fev, stat(sd) Interquartile Range: tabstat fev, stat(p25,

Descriptive Statistics

summary(data$FEV) #min, 1Q, Med, Mean, 3Q, Max

mean(data$FEV)

quantile(data$FEV, p=c(0.25, 0.5, 0.75))

table(data$SMOKE)

xtabs(~data$SMOKE+data$FEMALE) #to get cross tabulation

Page 27: Descriptive Statistics and Visualizing Data in STATAcourses.washington.edu/b517/Misc/Disc2.pdf · Standard Deviation: tabstat fev, stat(sd) Interquartile Range: tabstat fev, stat(p25,

Measures of Spread

range(data$FEV) #gives min and max

var(data$FEV) #variance

sd(data$FEV) #standard deviation

Page 28: Descriptive Statistics and Visualizing Data in STATAcourses.washington.edu/b517/Misc/Disc2.pdf · Standard Deviation: tabstat fev, stat(sd) Interquartile Range: tabstat fev, stat(p25,

Histograms

hist(data$FEV, xlab="FEV (L/s)", main="Histogram of FEV")

To save the graph:

pdf(file="fev-hist-R.pdf")

hist(data$FEV, xlab="FEV (L/s)", main="Histogram of FEV")

graphics.off()Histogram of FEV

FEV (L/s)

Fre

quen

cy

1 2 3 4 5 6

050

100

150

Page 29: Descriptive Statistics and Visualizing Data in STATAcourses.washington.edu/b517/Misc/Disc2.pdf · Standard Deviation: tabstat fev, stat(sd) Interquartile Range: tabstat fev, stat(p25,

Histogramshist(data$FEV, xlab="FEV (L/s)", main="Histogram of FEV",

prob=TRUE)

lines(density(data$FEV))

Histogram of FEV

FEV (L/s)

Den

sity

1 2 3 4 5 6

0.0

0.1

0.2

0.3

0.4

Page 30: Descriptive Statistics and Visualizing Data in STATAcourses.washington.edu/b517/Misc/Disc2.pdf · Standard Deviation: tabstat fev, stat(sd) Interquartile Range: tabstat fev, stat(p25,

Histogram

plot(hist(data$FEV[data$FEMALE==0], xlab="FEV (L/s)",

main="Males", ylim=c(0,80)),

hist(data$FEV[data$FEMALE==1], xlab="FEV (L/s)",

main="Females", xlim=c(0,6)))

Males

FEV (L/s)

Fre

quen

cy

1 2 3 4 5 6

020

4060

80

Females

FEV (L/s)

Fre

quen

cy

0 1 2 3 4 5 6

020

4060

80

Page 31: Descriptive Statistics and Visualizing Data in STATAcourses.washington.edu/b517/Misc/Disc2.pdf · Standard Deviation: tabstat fev, stat(sd) Interquartile Range: tabstat fev, stat(p25,

Boxplotboxplot(data$FEV, ylab="FEV (L/s)")

●●

●●

12

34

5

FE

V (

L/s)

Page 32: Descriptive Statistics and Visualizing Data in STATAcourses.washington.edu/b517/Misc/Disc2.pdf · Standard Deviation: tabstat fev, stat(sd) Interquartile Range: tabstat fev, stat(p25,

Boxplotboxplot(data$FEV~data$FEMALE, ylab="FEV (L/s)",

xaxt="n")

axis(1, at=c(1,2), labels=c("Male", "Female"))1

23

45

FE

V (

L/s)

Male Female

Page 33: Descriptive Statistics and Visualizing Data in STATAcourses.washington.edu/b517/Misc/Disc2.pdf · Standard Deviation: tabstat fev, stat(sd) Interquartile Range: tabstat fev, stat(p25,

Scatter Plot

plot(data$FEV~data$HEIGHT, ylab="FEV (L/s)",

xlab="Height (in)")

● ●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

45 50 55 60 65 70 75

12

34

5

Height (in)

FE

V (

L/s)

Page 34: Descriptive Statistics and Visualizing Data in STATAcourses.washington.edu/b517/Misc/Disc2.pdf · Standard Deviation: tabstat fev, stat(sd) Interquartile Range: tabstat fev, stat(p25,

Bar Plotcounts<-table(data$SMOKE)

barplot(counts, xlab="Smoker", xaxt="n")

axis(1, at=c(1,2), labels=c("No","Yes"))

Smoker

010

020

030

040

050

0

No Yes

Page 35: Descriptive Statistics and Visualizing Data in STATAcourses.washington.edu/b517/Misc/Disc2.pdf · Standard Deviation: tabstat fev, stat(sd) Interquartile Range: tabstat fev, stat(p25,

Cause of Death Example in R

n.deaths<-c(700142, 553768, 163538, 123013,

101537, 71372, 62034, 53852, 39480, 32238)

cause<-c("Heart Disease", "Cancer", "Cerebrovascular

Disease", "Chronic Respiratory Diesease","Accidental

death", "Diabetes", "Flu and Pneumonia", "Alzheimer’s

Disease", "Kidney Disorder","Septicemia")

n.deaths<-n.deaths/1000

Page 36: Descriptive Statistics and Visualizing Data in STATAcourses.washington.edu/b517/Misc/Disc2.pdf · Standard Deviation: tabstat fev, stat(sd) Interquartile Range: tabstat fev, stat(p25,

Cause of Death Example

par(mar=c(4,6.5,1,1))

barplot(n.deaths, horiz=T, yaxt="n", xlab="Number of Deaths

(Thousands)", main="Cause of Death")

text(y=seq(1,11.35, 1.15), par("usr")[1], labels=cause,

srt=45, pos=2, xpd=T, cex=0.75)

Cause of Death

Number of Deaths (Thousands)0 100 200 300 400 500 600 700Hea

rt Dise

ase

Cance

r

Cereb

rova

scula

r Dise

ase

Chron

ic Res

pirat

ory D

iesea

se

Accide

ntal

deat

hDiabet

es

Flu an

d Pne

umon

ia

Alzheim

er's

Diseas

e

Kidney

Diso

rderSep

ticem

ia

Page 37: Descriptive Statistics and Visualizing Data in STATAcourses.washington.edu/b517/Misc/Disc2.pdf · Standard Deviation: tabstat fev, stat(sd) Interquartile Range: tabstat fev, stat(p25,

Cause of Death Examplepie(n.deaths, cause, main="Cause of Death" )

Heart Disease

Cancer

Cerebrovascular Disease Chronic Respiratory Diesease

Accidental death

Diabetes

Flu and Pneumonia

Alzheimer's Disease

Kidney Disorder

Septicemia

Cause of Death