© Andrew Ho, Harvard Graduate School of Education
This isS-030: Applied Regression and Data Analysis
For those of you looking for S-052 with John Willett, we’ve switched rooms.
Go upstairs to Larsen 106.
Today’s class starts at 10:15. All subsequent classes will start promptly at 10:10.
Unit 1 / Page 1
Room Switch!
Welcome to S-030!
Unit 1 / Page 2© Andrew Ho, Harvard Graduate School of Educationhttp://xkcd.com/715/
Building Perception:1. A well designed figure should be the goal of every analysis.2. Know your axes.3. Know your axis ranges.4. Anthropomorphize yourself as a point on a figure and ask, who am I, and what story do I tell?5. Outliers are observations to be explained not blemishes to be erased or fixed.
© Andrew Ho, Harvard Graduate School of Education
How you’ll spend your time in S-030, Part I: What we’ll do in class
1) Research Questions and Data Sets
Lectures with your questions:Active participation is encouraged, time permitting
2) Statistical Conceptsand Content
3) Interpretation and Presentation of Results
Each unit addresses the following:
Note-taking:On laptops (in laptop zones at the edges of
the lecture hall) or printouts of handouts
Please be courteous:No cellphones, email, websurfing, IM, texting or other electronic distractions during class.Attend class. On time.
Unit 1 / Page 3
© Andrew Ho, Harvard Graduate School of Education
How you’ll spend time in S-030 Part II: What you’ll do outside of class
Assignments• 6 Homework Assignments (~⅔ grade), pairs
mandatory except for the 3rd assignment (done individually).
• 1 final project (~⅓ grade), completed individually.• Due e-submitted on iSites by 1PM on the date noted
on the syllabus. • No late assignments accepted.
Individual and group work• Work in study groups as you’d like, but write and
submit HWs in pairs (or individually for the 3rd assignment and final).
• Find partners and study groups on your own, in sections, by emailing us, or by using this google doc (linked on the iSite under the Assignments tab): https://docs.google.com/spreadsheet/ccc?key=0AuXHUPzC9fGPdGdSc0VacnFzUnI5YmswbkoxNGE3YWc
• Reference any collaborations beyond yourself and your partner with a footnote on the first page of your completed assignment.
Where to get help: - Lectures and the Course Website - Partners and Study Groups - Your “Homeroom” TF and weekly section - Any other TF and other sections. - Professor office hours: http://andrew-ho-office-hours.wikispaces.com/
Course website: http://my.gse.harvard.edu/course/gse-s030/2012/spring
No required reading, but optionally:
…and review course slides.
Unit 1 / Page 4
© Andrew Ho, Harvard Graduate School of Education
The First Three Weeks
Unit 1 / Page 5
January 2012Sunday Monday Tuesday Wednesday Thursday Friday Saturday
22: Week 1No Sections
23 24 – Class 1 25 – “Doodle” poll for sections sent to you.
26 – Class 2 27 – 1PM Doodle poll for sections due.
28
29: Week 2Sections Begin
30–Asgn 1 Out 31 – Class 3SECTIONS
February 2012Sunday Monday Tuesday Wednesday Thursday Friday Saturday
Week 2 (cont.) 1SECTIONS
2 – Class 4SECTIONS
3
4
5: Week 3 6
7 – Class 5 SECTIONS
8 SECTIONS
9 – Class 6 SECTIONS
10–Drop/ Grade Change DeadlineAssn 1 Due, 1PM
11
© Andrew Ho, Harvard Graduate School of Education
Unit 1: Introduction to Simple Linear Regression Class 2 Class 3 Class 4
Unit 1 / Page 6http://xkcd.com/1007/
Assignment #1 has been posted. Assignment Data Template Grading guidelines Partnering spreadsheetSections begin this weekPriya’s section meets today at 2:40 in Gutman 440 (this week only)All other sections in Gutman 302.
© Andrew Ho, Harvard Graduate School of Education
Where is Unit 1 in our 11-Unit Sequence?
Unit 6:The basics of
multiple regression
Unit 7:Statistical control in depth:Correlation and collinearity
Unit 10:Interaction and quadratic effects
Unit 8:Categorical predictors I:
Dichotomies
Unit 9:Categorical predictors II:
Polychotomies
Unit 11:Regression in practice. Common Extensions.
Unit 1:Introduction to
simple linear regression
Unit 2:Correlation
and causality
Unit 3:Inference for the regression model
Building a solid
foundation
Unit 4:Regression assumptions:Evaluating their tenability
Unit 5:Transformations
to achieve linearity
Mastering the
subtleties
Adding additional predictors
Generalizing to other types of predictors and
effects
Pulling it all
together
Unit 1 / Page 7
How Can Regression Help Us? Prediction.
• Prediction – Given lifestyle and background, I can predict your lifespan.– Given polling results to date, I can predict who will win the Florida primary.– Given previous award wins and nominations, I can predict who will win an
Oscar.
Unit 1 / Page 8© Andrew Ho, Harvard Graduate School of Educationhttp://www.livingto100.com/
http://elections.nytimes.com/2012/fivethirtyeight/primaries/floridahttp://carpetbagger.blogs.nytimes.com/2011/02/24/4-rules-to-win-your-oscar-pool/
How Can Regression Help Us? Association and Theory Building.
• Establishing association and relative strengths of associations between variables.
• Building theory: Which constructs are similar and which dissimilar?
Unit 1 / Page 9http://www.education.com/reference/article/triarchic-theory-of-intelligence/http://www.socialresearchmethods.net/kb/mtmmmat.phphttp://psycnet.apa.org/index.cfm?fa=buy.optionToBuy&id=2011-19822-001
© Andrew Ho, Harvard Graduate School of Education
© Andrew Ho, Harvard Graduate School of Education
How Can Regression Help Us? Understanding Cause and Effect.
• Understanding Cause and Effect (Be careful, here.)– If we link teacher pay to student performance, do students learn more?– If we increase our movie’s advertising budget, how much will our box office
revenues grow?– If my parents had gone to college, what might my GPA have been?
• The causal inference must be supported by a *design*, not merely a model (more on this in Unit 2).
• Consider causal inference a second layer of inference, above and beyond prediction and association, that requires a higher standard of evidence and argument.
Unit 1 / Page 10
© Andrew Ho, Harvard Graduate School of Education
Three Types of Variables
Question Predictor(s)
A variable or variables whose effects or predictive potential you wish to examine.Graphically: On the horizontal axis.Also: Independent Variable(s)
Outcome Variable
A variable that you wish to predict, have an effect upon, or use to measure the effects of the predictors.Graphically: On the vertical axis.Also: Response Variable Dependent Variable
Covariate(s)/Control(s)
Additional predictors whose effects you would like to adjust for, account for, or “statistically control for.”Graphically: On the legend.Also: Control Predictors Extraneous Variables
Unit 1 / Page 11
The distinction between Predictors () and Covariates () is substantive, not statistical. Sometimes, no distinction is made between them.
The distinction between and is substantive AND statistical.
© Andrew Ho, Harvard Graduate School of Education
• How well does the overall critical acclaim of a movie predict whether it will win an Oscar?– Question Predictor: Critical Acclaim (Rotten Tomatoes Score, what
percent of critics like/recommend the movie?)– Outcome Variable: Oscar Win– Control Predictor: Golden Globe (Alternative Award) Win– Boxes and Arrows (on the right)
– In an informal prediction equation ( on the left)Oscar_Win = Critical_Acclaim
+ Golden_Globe– In a graph ( on the vertical axis, question – predictor usually on the horizontal axis.)
Example 1: Critical Acclaim and Oscar Odds
cclaimCritical_A Oscar_Win
Unit 1 / Page 12
beGolden_Glo
0% 20% 40% 60% 80% 100%0%
10%20%30%40%50%60%70%80%90%
100%
Critical_Acclaim (Tomatometer)O
scar
Win
Pro
babi
lity
Golden Globe Win
Golden Globe Loss
© Andrew Ho, Harvard Graduate School of Education
Example 2: The SATs and Freshman GPA
• In 2001, an influential study on University of California students prompted the redesign of the SATs.
• The variables: Freshman GPA (FGPA), High School GPA (HSGPA), SAT I, (Achieve-ment Tests) SAT II (Subject Tests).
HSGPAFGPA
SATI
SATII
SATIISATIHSGPAFGPA
Unit 1 / Page 13
Model Language: How well does the SAT I predict freshman GPA when accounting for (after controlling for) high school GPA and SAT II scores?
http://www.ucop.edu/sas/research/researchandplanning/pdf/sat_study.pdf
Statistical models will almost never fit the data perfectly. Interpreting their results requires considering:
• Other structural (systematic) components (not included in the model or not measured)
• Sampling Variation – Differences between the sample data that you have and the population to which you hope to generalize.
• Neglecting these is to neglect systematic vs. random error respectively.
Models: Simplified representations of relationships among variables
Mathematical/Physical Models Statistical models
Geometry: Area of a Square = (length)^2Or Kinetic Energy: KE = (1/2)mass*velocity^2
Mathematical models and most Physical models are deterministic—
• Some are linear; some nonlinear, but…• All squares behave this way—once we know
the “rule,” we can use it to fit the model to data perfectly.
• We might be able to say: The data fit the model vs. the model fit the data. The former is not a widely accepted expression in statistics.
Modeling people, organizations, or any type of social or cultural unit.
Unit 1 / Page 14© Andrew Ho, Harvard Graduate School of Education
Outcome = Structural Component (Predictors) + Residual (Error)Step 1: Identify the structural (systematic) components and note their predictive utility for the
outcome.Step 2: Assess how well we did by examining the residual (error) terms.Step 3: Evaluate whether an alternative structural model may better explain the outcome.
© Andrew Ho, Harvard Graduate School of Education
The Art of Statistical Modeling
• Statistical mantra (attributed to George Box): All models are wrong; some are useful.
• Our goal is less a single statistical model but a clearly written account of the journey we take to get there.
• We take the perspective that all statistical models are vast oversimplifications that may nonetheless lead us to make accurate predictions, increase our understanding of relationships and mechanisms, and build useful theory.
• Our most common and dangerous errors (and the ones we grade most harshly) typically involve overstating our results, sliding into deterministic, unwarranted causal language, and drifting lazily towards a simple single answer. Unit 1 / Page 15
Outcome = Structural Component (Predictors) + Residual (Error)
http://xkcd.com/793/
The sordid history of Statistics: Galton and Burt
Sir Francis Galton (1822-1911)
• A strong relationship in which nature dominates: “families of reputation were much more likely than ordinary families to produce offspring of ability”
• Recommended “judicious marriages during several generations” to “produce a highly gifted race of men”
• His “genetic utopia”: “Bright, healthy individuals were treated and paid well, and encouraged to have plenty of children. Social undesirables were treated with reasonable kindness so long as they worked hard and stayed celibate.”
No data for “intelligence” so instead studied HEIGHT
Outsourced modeling to JD Dickson, a Cambridge mathemetician, who formalized linear regression.
Unit 1 / Page 16© Andrew Ho, Harvard Graduate School of Education
• Sir Cyril Burt (1883-1971)• Burt’s father was Galton’s physician• Over a 30 year period, he & two RAs
—Miss Howard and Miss Conway—accrued data on 53 pairs of separated twins
• 15 pairs in 1943• Up to 21 pairs in 1955• Up to 53 pairs in 1966
• “‘Intelligence’, when adequately assessed, is largely dependent on genetic constitution” (Burt, 1966)
Studied heredity by fitting statistical models predicting IQs of identical twins raised in “foster” (adoptive) homes from IQs of siblings raised in biological parents’ homes
© Andrew Ho, Harvard Graduate School of Education 17
IQ scores for Cyril Burt's identical twins reared apartResults of the list command in Stata
Predictor (X):owniq
Outcome (Y):fostiq
RQ: What’s the relationship between the IQ of the child raised in an adoptive home and his/her identical twin raised in the birth home?n = 53
Instinct 1: Get your eyes on (and fingers in) the data.[Also, try the Data Editor (browse) button: ]
© Andrew Ho, Harvard Graduate School of EducationTry also, summarize, detail . Or histogram owniq, frequency kdensity . Or dotplot owniq fostiq .
Univariate summaries of the predictor and the outcome.Results of the summarize, stem, and graph box commands.
Unit 1 / Page 18
Instinct 2: Visualize the data. Univariate summaries, stat.
• What is the unit of analysis? What are we looking at? Persons? Schools? Countries?
• What is the scale? What does the distance between the minimum and the maximum mean?
• What is the central tendency (mean/median)? Is that a reasonable value on this scale?
• What is the standard deviation or interquartile range? Is that a reasonable value on this scale?
• What is the skewness/symmetry? Is the mean much larger than the median (suggests positive skew)?
• Are there many outlying observations (high kurtosis/heavy tails) or surprisingly few (whiskers much shorter than 1.5IQR).
• Remember that there are no hard and fast definitions for positive skew, heavy tails, or outliers. The relevance of these observations must be mediated by your substantive knowledge of the domain and the final inferences you wish to draw from your analysis.
• The smaller the sample, the more anomalies we expect. Don’t obsess over slight skewness and kurtosis when
• Looking at these distributions, they are similar and symmetrical with no obvious outlying univariate observations.
Questions to ask about univariate data graphs and summaries.
Unit 1 / Page 19© Andrew Ho, Harvard Graduate School of Education
© Andrew Ho, Harvard Graduate School of Education
We fit models to data; we don’t fit data to models.
Unit 1 / Page 20
Growing accusations• In 1973, Arthur Jensen, a supporter of Burt, noted “misprints and inconsistencies in
some of the data”• In 1974, Leon Kamin noted how odd it was that Burt’s correlation coefficients remained
virtually unchanged as the sample size increased (r=.770, r=.771, and r=.771)• In 1976, a London Sunday Times reporter tried to find the RAs and concluded that they
did not exist• In 1979, The British Journal of Psychology added the following notice to Burt’s 1966
paper: “The attention of readers of the Journal is drawn to the fact that it has now been established that this paper contains spurious data”
• In 1995, an edited volume with 5 essays, Cyril Burt: Fraud or Framed (Oxford), found evidence of sloppy writing, cutting and pasting of text, but perhaps not fraudulent data
• Debate continues to this day—and with Burt long dead, the conclusion may be that we’ll never know
• See also Stephen Jay Gould’s classic treatise against the folly of these early hereditarians: The Mismeasure of Man (1981).
© Andrew Ho, Harvard Graduate School of Education
Less “Ideal” Data: The 2011 Box Office
Unit 1 / Page 21
• Instinct 1: Get your eyes on the data• movie11.dta – All 146 movies that achieved wide release in 2011.
grossmil – 2011 Box Office gross by yesterday (in millions of dollars)theaters – Number of theater locations upon wide release (>600)openday – Day of the year in 2011 (1 is January 1, 365 is December 31)openmil – Opening weekend box office gross (in millions of dollars)
Source: http://boxofficemojo.com/
010
020
030
040
0U
S b
ox o
ffic
e g
ross
(in
mill
ions
) by
1/2
5/12
1,0
002,
000
3,0
004,
000
5,0
00N
um
ber
of th
eate
r lo
catio
ns
upo
n op
enin
g
Univariate Summaries and Model Language
© Andrew Ho, Harvard Graduate School of Education Unit 1 / Page 22
010
2030
4050
Fre
que
ncy
0 100 200 300 400US box office gross (in millions) by 1/25/12
010
2030
40F
req
uenc
y
1000 2000 3000 4000Number of theater locations upon opening
Instinct 2: Visualize. Univariate SummariesFigure 1 shows the distribution of US box office gross revenues for the 146 movies that achieved a wide release in 2011. The distribution exhibits a strong positive skew, with a mean of $65.8 million while half of releases earned less than the median of $40.6 million. The two top grossing movies, Harry Potter and Transformers, were well over four standard deviations above the mean.
Figure 2 shows… The distribution has a noticeable negative skew, with a mean of 2800 below the median of 2954. There is the suggestion of bimodality with a second local maximum below 1000. This is consistent with a second category of “limited release” movies restricted from the analysis by variable definition.
*As an aside, don’t sweat kurtosis.
The bivariate relationship between the predictor and the outcomeResults of the graph twoway scatter or, simply, scatter command
Learn the Standard TerminologyA scatterplot of Y vs. X or Y on X.Here, we plot fostiq on owniq.
Unit 1 / Page 23© Andrew Ho, Harvard Graduate School of Education
Instinct 3: Visualize. Bivariate Summaries
Figure 3 represents the bivariate relationship between twin IQs as a scatterplot. The association is strong, positive, and linear, with no noticeable outlying observations. Hm…
010
020
030
040
0U
S b
ox o
ffic
e g
ross
(in
mill
ions
) by
1/2
5/12
1000 2000 3000 4000 5000Number of theater locations upon opening
Less “Ideal” Bivariate Summaries
Unit 1 / Page 24© Andrew Ho, Harvard Graduate School of Education
Instinct 3: Visualize. Bivariate Summaries
Figure 4 shows… There appears to be a strong, nonlinear, positive association between the two variables.
A small number of observations appear to have gross revenues larger than what might be predicted by the number of theater locations, although this conclusion requires further analysis.
Foreshadowing: A log transformation of the grossmil variable is the clear next step.
. scatter grossmil theaters
Questions to ask when examining scatterplots
Direction of relationship?
First, all the same questions we asked for univariate graphs... for each variable. Then…
Unit 1 / Page 25© Andrew Ho, Harvard Graduate School of Education
Linearity of relationship?
Any unusual observations?
Strength of Relationship?
How do we statistically model the bivariate relationship between and ?Use theory and bivariate graphics to decide on the model’s structure: it’s functional form
Simple linear regression: Why are straight lines so popular?
Transformations to achieve linearity In Units 5&10, we’ll learn how to tweak our variables so that we can use straight line machinery to fit curves to data.
A limited range of X may yield linearity Range restrictions are common in social research and can allow for the linear model to fit the data well within an observed range. Appreciating this, we must be cautious of extending predictions beyond the range of observed data.
Actual linearity Many relationships – particularly those in the physical world – are in fact linear:
Mathematical and conceptual simplicityA straight line is among the simplest of mathematical relationships between variables—it makes our work very tractable. Even when data arise from more complex nonlinear processes, the linear model can get us surprisingly accurate predictions.
Unit 1 / Page 26© Andrew Ho, Harvard Graduate School of Education
6080
100
120
140
IQ o
f tw
in r
aise
d in
'fo
ste
r ho
me'
by
adop
tive
pare
nts
.
60 80 100 120 140IQ of twin raised in 'own home' by birth parents.
6080
100
120
140
IQ o
f tw
in r
aise
d in
'fo
ste
r ho
me'
by
adop
tive
pare
nts
.
60 80 100 120 140IQ of twin raised in 'own home' by birth parents.
6080
100
120
140
IQ o
f tw
in r
aise
d in
'fo
ste
r ho
me'
by
adop
tive
pare
nts
.
60 80 100 120 140IQ of twin raised in 'own home' by birth parents.
How do we statistically model the bivariate relationship between and ?Mathematically represent the model’s structure.
Unit 1 / Page 27© Andrew Ho, Harvard Graduate School of Education
Any two points define a line.We use slope-intercept form:
and are our outcome and predictor, respectively, on the vertical and horizontal axes, respectively. is the Greek “b” – beta.
http://www.livingwaterbiblegames.com/greek-alphabet-handwriting.html
020
4060
8010
012
014
0IQ
of t
win
rai
sed
in 'f
ost
er
hom
e' b
y ad
optiv
e pa
ren
ts.
0 20 40 60 80 100 120 140IQ of twin raised in 'own home' by birth parents.
020
4060
8010
012
014
0IQ
of t
win
rai
sed
in 'f
ost
er
hom
e' b
y ad
optiv
e pa
ren
ts.
0 20 40 60 80 100 120 140IQ of twin raised in 'own home' by birth parents.
- The intercept (also, beta-zero, beta-not, the constant term)The predicted/estimated value of when is 0, even if a 0 value for is unobserved, unrealistic, or meaningless.
- The regression coefficient (also, beta-one, the slope)The increment in the predicted/estimated value of for any unit change in .
There is a big oversimplification here (parameters vs. statistics) that we will revisit in a moment!
10ish?
11ish?
rise over run just less than 1ish?
6080
100
120
140
IQ o
f tw
in r
aise
d in
'fo
ste
r ho
me'
by
adop
tive
pare
nts
.
60 80 100 120 140IQ of twin raised in 'own home' by birth parents.
6080
100
120
140
IQ o
f tw
in r
aise
d in
'fo
ste
r ho
me'
by
adop
tive
pare
nts
.
60 80 100 120 140IQ of twin raised in 'own home' by birth parents.
How do we statistically model the bivariate relationship between and ?Formalize the statistical model
Unit 1 / Page 28© Andrew Ho, Harvard Graduate School of Education
As we said earlier, a statistical model has two components, a systematic/structural component, and a residual/error component. We formalize that here:
is the Greek “e” – epsilon.
http://www.livingwaterbiblegames.com/greek-alphabet-handwriting.html
– The residual (also, the error term, “e,”) The difference between the observed (actual) value and the expected
(predicted) value from the systematic/structural component. The “miss.” Reflects the reality that our model is not deterministic.We generally want these values to be as small as possible. They represent a challenge to our structural model: what is left to be
explained.
6080
100
120
140
IQ o
f tw
in r
aise
d in
'fo
ste
r ho
me'
by
adop
tive
pare
nts
.
60 80 100 120 140IQ of twin raised in 'own home' by birth parents.
6080
100
120
140
IQ o
f tw
in r
aise
d in
'fo
ste
r ho
me'
by
adop
tive
pare
nts
.
60 80 100 120 140IQ of twin raised in 'own home' by birth parents.
The Population Regression Equation and Residuals
Unit 1 / Page 29© Andrew Ho, Harvard Graduate School of Education
, a shorthand, for datum
, let’s say…
The “miss,” visualizable.
𝑌 𝑖=10+.9 𝑋 𝑖“Y-hat,” our predicted/estimated value…The Prediction Equation.
Assumes known parameters.
Population Parameters vs. Sample Statistics
Unit 1 / Page 30© Andrew Ho, Harvard Graduate School of Education
The Population. An abstraction to which we generalize.All possible twins, separated at birth, that we could theoretically sample.Generally thought to be infinite.
6080
100
120
140
IQ o
f tw
in r
aise
d in
'fo
ste
r ho
me'
by
adop
tive
pare
nts
60 80 100 120 140IQ of twin raised in 'own home' by birth parents.
The Sample. The data we have sampled, ideally in an unbiased, representative fashion, from the population. Of finite size, in this case,
A parameter is a fact about a population.
Parameters and . Written in Greek. Rarely if never known in practice.
A statistic is a fact about a sample.
Statistics and . Written in Roman or in Greek with hats designating parameter estimates. What we use for inference about the population.
© Andrew Ho, Harvard Graduate School of Education
6080
100
120
140
IQ o
f tw
in r
aise
d in
'fo
ste
r ho
me'
by
adop
tive
pare
nts
.
60 80 100 120 140IQ of twin raised in 'own home' by birth parents.
Inference about the population from sample data:How do we fit the hypothesized model to observed data?
Unit 1 / Page 31
How do we get and ?
© Andrew Ho, Harvard Graduate School of Education
Understanding the (ordinary) least squares (OLS) criterion
Observed values ()are the sample Foster IQ points
Residuals () are the vertical distances between the observed and predicted values for each datum. Every data point has a residual.
So a “good” line would go through the “center” of the data and have small residuals ();
…perhaps as small as possible???
Predicted values are estimated using the prediction equation,
Ordinary Least Squares (OLS) criterion: Minimize the sum of the squared vertical residuals
The least-squares criterion selects the statistics that make the sum of squared residuals as small as possible (for this particular sample). Obtaining the least-squares regression line for
the sample data is a purely geometrical problem requiring no assumptions.
How do we find the “best-fit” line that has the smallest residuals possible?
http://illuminations.nctm.org/activitydetail.aspx?id=82 Unit 1 / Page 32
∑ (𝑒𝑖 )2=∑ (𝑌 𝑖−𝑌 𝑖)
2=∑ (𝑌 𝑖− ( �̂�0+ �̂� 𝑋 𝑖) )
2
© Andrew Ho, Harvard Graduate School of Education
Three “best-fit” regression lines.The line you want, and why.
The OLS criterion minimizes the sum of vertical squared residuals.
Other definitions of “best fit” are possible:
Vertical Squared Residuals (OLS) Horizontal Squared Residuals (X on Y) Orthogonal Residuals
Why does it make substantive sense to minimize vertical residuals?
Because we are interested in trying to predict Y given X.
Unit 1 / Page 33
Midpoint of the vertical slice.
© Andrew Ho, Harvard Graduate School of Education
Regression in the Population: The Conditional Distribution
Unit 1 / Page 34
0.005.01.015.02.025Density
50
10
015
0
IQ
o
f tw
in
ra
ised in
'foste
r h
om
e' by ado
ptive p
arents
0.02.04.06Density
60
70
80
90
10
01
10
IQ
o
f tw
in
ra
ised
in
'fo
ste
r h
om
e' b
y ad
op
tive
p
are
nts
0.01.02.03.04Density
90
10
01
10
12
01
30
14
0
IQ
o
f tw
in
ra
ised
in
'foste
r h
om
e' b
y ad
optive p
are
nts
The
unco
nditi
onal
dis
trib
ution
of
The
cond
ition
al d
istr
ibuti
on o
f gi
ven
The
cond
ition
al d
istr
ibuti
on o
f gi
ven
Regression in the sample: Imagining conditional distributions in the populationThe lfit command and layering plots using ()
eXY 10ˆˆ
XY 10ˆˆˆ
From this perspective, we can see that the predicted value, , represented by the regression line, is a conditional mean: an average of a conditional distribution at a particular vertical slice through the data.
We should start to perceive these population conditional distributions, their means, and their variances.
93X 106X
94ˆ Y
106ˆ Y
Unit 1 / Page 35© Andrew Ho, Harvard Graduate School of Education
. . . . .. … .. … … …… …. ….. .. . . . .. . . . . . . . .. … .. … … …… …. ….. .. . . . .. . . .
The mean and standard deviation of conditional distributions
© Andrew Ho, Harvard Graduate School of Education
. . . . .. … .. … … …… …. ….. .. . . . .. . . . . . . . .. … .. … … …… …. ….. .. . . . .. . . .
10 20 30 40 50 60 70 80 90You should be able to eyeball the mean and standard deviation of any approximately normal distribution. The mean is the balance point, the central tendency. This is straightforward. The standard deviation is the distance from the mean to the inflection point of the curve. Approximately 2/3 of observations are within ±1 standard deviation of the mean.
For prediction, do we want conditional distributions to have large or small standard deviations?
Population mean: mu, the Greek “m” pronounced “moo”Unconditional population mean: Unconditional sample mean, Conditional population mean: Conditional estimated mean:
Population standard deviation (SD): sigma, the Greek “s” (vs. Sigma: )Unconditional population SD: Unconditional sample SD, Conditional population SD: Conditional estimated SD: RMSE
These four assumptions are often summarized in a shorthand: . This reads, the population residuals, , are independent and identically normally distributed with mean 0 (values centered on the prediction) and a common variance (and thus a common SD).
The linear regression model and its four assumptions
© Andrew Ho, Harvard Graduate School of Education Unit 1 / Page 37
Assumption 1: At each value of there is a conditional distribution of that is normal with mean and SD .
Assumption 2: The straight line model is correct: The fall on a line.
Assumption 3: Homoscedasticity. The conditional standard deviations, , are equal across all .
Assumption 4: Conditional independence. For any value of , the s are independent. They share no hidden common association. Cannot be visualized from the plot.
Intercept parameter estimate:
Results of fitting a least squares regression line to Cyril Burt’s dataThe regress command in Stata.
Interpretation:
Intercept – For a pair of twins where owniq = 0, the predicted value of fostiq is 9.7195.
Slope – A unit increment in owniq is associated with a 0.9079-unit increment in the predicted value
of fostiq.
Mistake: A unit increment in fostiq is associated with…
Double check the number of observations.
Predictor Variable
Slope parameter estimate:
Unit 1 / Page 38© Andrew Ho, Harvard Graduate School of Education
Always takes the form regress Y X
The Estimated Prediction Equation:
Prediction Postestimation
(80, 82.35)
(120, 118.67)
(97.36, 98.11)
Unit 1 / Page 39© Andrew Ho, Harvard Graduate School of Education
The Estimated Prediction Equation:
. predict yhat, xb
After the regress command, you can save the predicted values, , for each :
When
When
When
The regression line will always go through the centroid:
Disciplined perception: four values, three distances.
Unit 1 / Page 40© Andrew Ho, Harvard Graduate School of Education
The story of Mary Kate and Ashley.Four Values
, IQ of Mary Kate in own home., IQ of Ashley in foster home. , predicted IQ of Ashley in foster home., the unconditional mean of foster twins.
𝑋
𝑌
𝑌
𝑌Three Distances
, the Total Deviation, what to “explain”: Ashley above average., the Error/Residual Deviation, what is “unexplained”: Ashley above conditional expectation., the Regression Deviation, what is “explained”: Ashley accounted for.
Total
Error/Residual
Regression
The Analysis of Variance (ANOVA) Decomposition
Unit 1 / Page 41© Andrew Ho, Harvard Graduate School of Education
𝑌
𝑌
𝑌
Total
Error/Residual
Regression
The ANOVA Decomposition of Deviation:Total Deviation = Regression Deviation + Error Deviation
Point to the Mean = Line to the Mean + Point to the LineThis is a tautology, like saying,
It’s more interesting to ask, instead of decomposing the deviations for a single observation, how can we summarize the balance of regression vs. error deviations over all observations?
The ANOVA Decomposition of Variance:Total Variation= Regression Variation + Error Variation
Step 1: Let’s make sure we understand how to compute residuals:Vertical distances between observed values () and fitted values ()
Conclusion: The IQ of the first foster twin is 8.5 points lower than the linear regression model predicts given his/her non-adoptive twin’s IQ.
46.71)68(9079.07195.9ˆ
63 68, :1n Observatio
1
11
Y
YX
Over-predicted
Under-predicted
Positive residuals:
Negative residuals
Sometimes we under-predict, sometimes we over-predict, but across the full sample, the residuals will always sum to 0.
This is pretty nifty, but it also indicates why the sum of residuals is not a good summary of error.
46.846.71631̂1 YY
Second: Calculate the residual using
First: For a particular , compute by substituting into the regression equation
Unit 1 / Page 42© Andrew Ho, Harvard Graduate School of Education
11.98y
22.113ˆ iy
125iy
89.26)( yyi
Deviation Total
Step 2: To what might we reference the size of the residual?
11.98Y
22.113ˆ iY
125iY
11.15)ˆ( YYi
89.26)( YYi
78.11)ˆ( ii YY
(ID 46: owniq=114, fostiq=125)
Two ways the ANOVA decomposition helps us evaluate the quality of the fit
1.Total deviations provide a reference point for evaluating the magnitude of the residuals
2.Regression deviations quantify the accomplishment of the regression model.
Now:
1. Generalize these ideas across cases.
2. Arrive at a meaningful metric.
The mean would be our “best guess” for all values of Y if we had no information from our regression model. It is our reference point.
ANOVA decomposition of deviations
)( YYi )ˆ( ii YY )ˆ( YYi = +
TotalDev
RegrDev
ErrorDev= +
Total Deviation
Error Deviation
Regression Deviation
(point to mean) = (line to mean) + (point to line) Unit 1 / Page 43© Andrew Ho, Harvard Graduate School of Education
Step 3: Summarize across observations to a single descriptive statistic
)(: YYDevTotal i
Point to mean =
)ˆ(: ii YYDevError )ˆ(: YYDevRegr i
Line to mean + Point to line
∑ (�̂� 𝑖−𝑌 )2
∑ (𝑌 𝑖−𝑌 )2 =𝑆𝑆𝑀𝑆𝑆𝑇
=𝑅2
∑ (𝑌 𝑖−𝑌 )2=∑ (�̂� 𝑖−𝑌 )2+∑ (𝑌 𝑖−𝑌 𝑖)2
Sum of Squares Total () = Sum of Squares Model () + Sum of Squares Error ()
The proportion of total variation that is accounted for by the model.
What is …
© Andrew Ho, Harvard Graduate School of Education
Analysis of Variance regression decomposition in Stata
)ˆ()ˆ()( iiii YYYYYY Analysis of Variance regression decomposition
2 2 2
12035 2785 ≈ 9251 +
= 76.86%
Interpreting R2
76.9 percent of the variation in the foster
twins’ IQ scores is “attributable to” or
“accounted for by” or “explained by” or
“associated with” or “predicted by” the IQ
of the twin raised in the parental home.
What about the remaining 23.1%?Environment, SES of household, nutrition,measurement error,random error, individual variation, alien abductions…Error is what we haven’t modeled yet.
The Ubiquitous R2
The variance of Y that is accounted for by X…
The single most widespread and easily interpretable summary statistic derivable from a single regression analysis. Essential
to describing the overall predictive function of the model.
Unit 1 / Page 45
SS Total = SS Model + SS Error
One last parameter to estimate: The residual variance, 2Y|X
X
Y
𝑿𝟏 𝑿𝟐… 𝑿𝟑
𝝁𝒀∨𝑿𝟑
𝝁𝒀∨𝑿𝟏
𝝁𝒀∨𝑿𝟐
1. At each value of X, there is a distribution of Y. These distributions have a mean and a variance of
2. Homoscedasticity. The variances of each of these distributions, the are identical
Of what importance is ? (the residual variance of at each value of )
1
)(ˆ
22
n
YYiY
Why do we divide by ?Because we estimated 1 parameter (the mean) to estimate this other parameter (the variance). If we didn’t do this, the sample variance would be a biased estimate of the population variance.
So tells us about the variability of the residuals—the unexplained variability in that’s “left over”
From Variation to VarianceLet’s start by reviewing the sample variance of Y
52
035,12
Now we move from the variance of Y to the conditional variance of Y.
= 231.45
Does this numerator look familiar?This is the Total Deviation, or SS Total
Unit 1 / Page 46© Andrew Ho, Harvard Graduate School of Education
From estimating to estimating
2
)ˆ(ˆ
22|
n
YY iiXY
Take away 2 because we estimated both 0 and 1 to estimate .
60.5451
66.2784
Root Mean Square Error (RMSE)(the estimated standard deviation of the residuals)
Mean Square Error (MSE)
The Underappreciated RMSE
The Estimated Standard Deviation of conditional on ,
Interpretable on the scale of Y, so requires knowledge of the scale to be interpretable. Not distorted by restriction of range. Tends to remind you of just how much variance there is left to explain.
Unit 1 / Page 47© Andrew Ho, Harvard Graduate School of Education
X
Y
𝑿𝟏 𝑿𝟐 … 𝑿𝟑
𝝁𝒀∨𝑿𝟑
𝝁𝒀∨𝑿𝟏
𝝁𝒀∨𝑿𝟐
�̂�𝑎𝑟𝑖𝑎𝑛𝑐𝑒=𝑆𝑢𝑚𝑜𝑓 𝑆𝑞𝑢𝑎𝑟𝑒𝑠
𝑛−¿𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠 𝑒𝑠𝑡𝑖𝑚𝑎𝑡𝑒𝑑¿
A loose understanding of degrees of freedom (df)
The simplest answer: That’s what we have to do to get unbiased estimates of the population parameters.
Why do we divide SS Total (Point to Mean) by n-1 to get our estimated variance, but we divide SS Residual (Point to Line) by n-2 to get our estimated conditional variance?
A somewhat less simple answer: We divide by the number of observations (n) minus the number of parameters that we have to estimate in order to calculate our target.
In the case of estimating variance, we need to estimate the mean first (Point to Mean), in order to estimate variance, so we divide by n-1.
In the case of estimating conditional variance, we need to estimate the regression line first (Point to Line), and that has two parameters. So we divide by n-2.
For the real answer, take a matrix algebra course!
In General:If you have a regression equation with k predictors, you will need to estimate k+1
parameters.For example, in simple linear regression, we have one predictor (e.g., owniq),
and we have two parameters we need to estimate (slope and intercept)Thus, for OLS linear regression in general, our degrees of freedom for error will be n-(k+1) =
n-k-1.
Unit 1 Recap: Let’s step back a bit and see what we’ve learned.
• Burt wanted to understand how well the IQ of a separated twin in his or her own home predicted the IQ of a matched twin raised in a foster home.
• We obtained a prediction equation and two statistics that describe how well the prediction equation works: and the MSE or RMSE.
• I would describe this as a strong predictive relationship given the context. A substantial proportion of the variance of foster twins’ IQs can be accounted for by their respective twin raised in his or her own home.
A 1 IQ-point increment in owniq is associated with a 0.91 IQ-point increment in predicted fostiq.
An of .7686: 76.86% of the variance in fostiq is accounted for by the predictor variable, owniq.
An RMSE of 7.39 : conditional on a particular owniq value, actual fostiq values tend to vary around the prediction with a standard deviation of 7.39. The unconditional sd was 15.21.
Unit 1 / Page 49© Andrew Ho, Harvard Graduate School of Education
�̂�𝑜𝑠𝑡𝑖𝑞=9.719+.908𝑜𝑤𝑛𝑖𝑞
Don’t forget your research question… and other research questions of interest
• Apart from prediction for and association between owniq and fostiq, is there another research question that might arise from these data?
• It is worth remembering that regression works perfectly well when and are on different scales. We can use weight to predict height, socioeconomic status to predict test scores, and whether our class meets to predict snowfall.
• However, when X and Y are on the same scale, different research questions can arise…
Regression helps us understand whether there is an association between and , whether we can predict from , and whether can account for ’s variance.
Mean comparisons address whether the means of and are equal or different.
Different research questions; different statistical approaches
Caution: Don’t abuse or overinterpret kdensities! Reserve for large sample sizes or nice distributions.
Unit 1 / Page 50© Andrew Ho, Harvard Graduate School of Education
What are the take-home messages from this unit?
• The regression model represents your hypothesis about the population
– When you fit a regression model to data, you are estimating sample values of population parameters that you’ll never actual obtain directly
– Don’t confuse sample estimates with population parameters—estimates are just estimates, even if they have sound statistical properties
– The regression model is a model for the average of at each given value of : the conditional mean of given . Individual variation figures prominently into the model through the error term (and the residuals it represents)
• Be sure to fully understand the meaning of the regression coefficients
– These are the building blocks for all further data analysis; take the time to make sure you have a complete and intuitive understanding of what they tell us
– Distinguish clearly between the magnitude (the value of in context) and strength of an predictive relationship (high , low RMSE)—don’t confuse these separate concepts.
– The regression approach assumes linearity. We’ll learn in Units 4 and 5 how to evaluate these assumptions and what to do if they don’t hold
• is a nifty summary of how much the regression model helps us
– Be careful about causal language—the phrases “accounted for” or “explained by” do not imply determinism or causation.
– The ANOVA decomposition of variance, which leads to and our estimate of the RMSE (the residual standard deviation) will appear in subsequent calculations; be sure you understand them.
Unit 1 / Page 51© Andrew Ho, Harvard Graduate School of Education