Upload
dallon
View
28
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Model. . . . Estimations. with calculation of risk. Sample. . Population. Relative frequency. Probability. Mean. Expectation. Overview. Contents of the lecture. A Descriptive statistics basic terms, parameters for univariate and multivariate samples - PowerPoint PPT Presentation
Citation preview
WS 2006/07 Prof. Dr. J. Schütze, FB GW 1
Fachhochschule JenaUniversity of Applied Sciences Jena
Overview
Model
Estimations
with calculation of risk
Sample Population
Relative frequency Probability
Mean ... Expectation ...
Descriptive statistics Inferential statistics
Probability theory
WS 2006/07 Prof. Dr. J. Schütze, FB GW 2
Fachhochschule JenaUniversity of Applied Sciences Jena
Contents of the lecture
A Descriptive statistics
basic terms, parameters for univariate and multivariate samples
B Probability theory
calculating with probabilities, random variables, distributions, limit theorems
C Inferential statistics
estimation of parameters, confidence intervals, hypothesis testing, non parametric tests
LiteratureJ.D. Jobson, Applied Multivariate Data Analysis, Springer Texts in StatisticsHinkelmann, Kempthorne, Design and Analysis of Experiments, Wiley Series
WS 2006/07 Prof. Dr. J. Schütze, FB GW 3
Fachhochschule JenaUniversity of Applied Sciences Jena
Examples
Technics
When measuring a certain quantity, various uncontrollable parameters can affect the result. Repeated measurement may lead to different results. We consider the data to be outcomes of a random variable. From the different results, we estimate the unknown quantity. How exact or reliable is this estimation?
Biology/Medicine
A newly developed drug to reduce cholesterol has to be compared to a standard drug concerning its effectivity.For reasons of time and costs, it is not possible to test the drug in the whole population of people suffering from elevated cholesterol level.How certain are the conclusions drawn from the measurements only in a the sample of the whole population?Which differences of effectivity are random, when do we see a significant difference in the effectivity?
WS 2006/07 Prof. Dr. J. Schütze, FB GW 4
Fachhochschule JenaUniversity of Applied Sciences Jena
Statistical methods
Statistical methods
• samples also allow reliable conclusions
• however, only with certain confidence
• sample survey has to be random
Advantages Disadvantages
Complete population
exact information high effort, costs
data protection
Sample survey low effort result depends on how good
the sample represents the population
WS 2006/07 Prof. Dr. J. Schütze, FB GW 5
Fachhochschule JenaUniversity of Applied Sciences Jena
Basic terms
Sample spaceall in principle possible observation unit (intended population)
Sample from the sample space randomly chosen subset of observation units
Survey unit/statistical unit (proband) each observation unit contained in the sample
WS 2006/07 Prof. Dr. J. Schütze, FB GW 6
Fachhochschule JenaUniversity of Applied Sciences Jena
Basic terms
Characteristic/statistical variableintended aim (observed values) of survey
Values of characteristicmeasured or observed values of a statistical unit/proband
Possible valuesrange of the statistical variable
The characteristics are different concerning the content of information.We distinguish several different levels of scale.
WS 2006/07 Prof. Dr. J. Schütze, FB GW 7
Fachhochschule JenaUniversity of Applied Sciences Jena
Question:
Which chocolate do you buy?
Ritter Sport
Milka
Sarotti
else
Level of scale of a variable
The possible answers are categories, which can only have the relations "equal" or "different".For data input, they are usually coded with numbers, nevertheless, there is no way for reasonable calculating.No averages, no spread, only reports on frequency!!!!Such variables have categorical/nominal level.
WS 2006/07 Prof. Dr. J. Schütze, FB GW 8
Fachhochschule JenaUniversity of Applied Sciences Jena
Question:
How much do you like chocolate?
1. not at all
2. less
3. neutral
4. a little
5. very much
The possible answers 1,...,5 are an ordinal scale.They have the relations ‚>‘ or ‚<‘.The distances between the values can be regarded as different, dependent on the observer.Such variables have an ordinal level.
Level of scale of a variable
WS 2006/07 Prof. Dr. J. Schütze, FB GW 9
Fachhochschule JenaUniversity of Applied Sciences Jena
Question:
How much money do you spend on chocolate per week? _ _,_ _ €
The answer marks a number on the natural scale.We can distinguish the relations ‚>‘ or ‚<‘. Additionally, we can compute reasonable differences, sums and averages.
Such variables have a cardinal or metric scale.
For the metric scale, we distinguish between discrete and continuous scales.A discrete scale is realized if the answer originates from counting.Continuous variables can assume each value in an interval.
Level of scale of a variable
WS 2006/07 Prof. Dr. J. Schütze, FB GW 10
Fachhochschule JenaUniversity of Applied Sciences Jena
Nominal scaledata expresses qualitative attributes (categories)order on a scale is impossible (only equal or different)differences in different values cannot be measured
Ordinal scaledata can be sorted, but differences in values are not quantifiable
Cardinal scale (metric)data is measured on a discrete or continuous scaledifference between values characterizes a correspondingly high difference in characteristics
The information content raises with this sequence.
Level of scale of data
WS 2006/07 Prof. Dr. J. Schütze, FB GW 11
Fachhochschule JenaUniversity of Applied Sciences Jena
Discrete characteristic X is measured n times, sampleif there are only k< n possible different values , you can count how often this values occur.
Univariate discrete characteristics
Absolute frequency number of occurrences of among the values of the sample
Relative frequency
1,... nx x
1 ,... kx x
ix
( ), 1ih x i k
( ), 1if x i k
( )( ) , 1ii
h xf x i k
n
1
1
0 ( ) , ( )
0 ( ) 1, ( ) 1
k
i ii
k
i ii
h x n h x n
f x f x
Properties
WS 2006/07 Prof. Dr. J. Schütze, FB GW 12
Fachhochschule JenaUniversity of Applied Sciences Jena
Univariate discrete characteristics
Example 1grades of 20 students:
2, 1, 2, 3, 3, 3, 1, 4, 2, 4, 3, 3, 2, 3, 5, 4, 5, 4, 3, 2sample size is n = 20
different possible values: 1, 2, 3, 4, 5that means k = 5
1.00 (100%)0.1025
0.90 (90%)0.2044
0.70 (70%)0.3573
0.35 (35%)0.2552
0.10 (10%)0.1021
cumulative frequencyvalue ( )ih x ( )if x
determination of frequency
WS 2006/07 Prof. Dr. J. Schütze, FB GW 13
Fachhochschule JenaUniversity of Applied Sciences Jena
Graphical display e.g. with bar charts
absolute/relative frequency or absolute/relative cumulative frequencies
2
5
7
4
2
0
1
2
3
4
5
6
7
8
1 2 3 4 5 grades
totalfrequency
1035
7090 100
0
20
40
60
80
100
120
1 2 3 4 5 grades
percen-tage
Univariate discrete characteristics
WS 2006/07 Prof. Dr. J. Schütze, FB GW 14
Fachhochschule JenaUniversity of Applied Sciences Jena
Continuous characteristic X is measured n times, samplethereby, all occurring values are different in general and a simply bar chart givesno information about the distribution of X
Absolute / relative cumulative frequency H(x) / F(x) is computed assum of frequencies upon all classes left of x, including x
1,... nx x
We divide the interval between smallest and largest observation inequally wide, mutually exclusive classes , number of classes
Absolute frequency of class number of in
Relative frequency ( ), 1if K i k ( )
( ) , 1ii
h Kf K i k
n
k n, 1iK i k
1,... nx x iK( ), 1ih K i k
( )ih K
all cl.up to all cl. up to
( ) ( ), ( ) ( )i ix x
H x h K F x f K
Univariate continuous characteristics
WS 2006/07 Prof. Dr. J. Schütze, FB GW 15
Fachhochschule JenaUniversity of Applied Sciences Jena
Example 2 cholesterol values of 1067 probands (raw data are classified)
Histogram Function of cumulative frequency
cholesterol value (class midpoints)
380 340 300 260 220 180 140 100
relative frequency
,5
,4
,3
,2
,1
0,0
cholesterol (class midpoints)
380 340 300 260 220 180 140 100
relative sum frequency
1,2
1,0
,8
,6
,4
,2
0,0
Univariate continuous characteristics
WS 2006/07 Prof. Dr. J. Schütze, FB GW 16
Fachhochschule JenaUniversity of Applied Sciences Jena
Empirical quantiles
ProblemBelow which boundary do we find half of (a tenth of,...) the sample?
Example 3height of 10 newborn babies: 51, 50, 51, 49, 49, 51, 50, 57, 48, 52in ascending order: 48, 49, 49, 50, 50, 51, 51, 51, 52, 57
E.g. the 10%-quantile is a border which parts the sample thus there is exactly one value (10%) below and nine values (90%) above the border.Therefore each value between 48 and 49 is possible, the border is not unique.To avoid this problem, we take the value in the middle, 48.5.
Concerning the 15%-quantile, the demand for the percentage can only be fulfilled approximately, because a 5%-grid is too narrow for only ten values.
We sort the sample in ascending order.By size n, one value matches to a fraction 1/n, k values to a fraction k/n.
From k/n = α we find that the part for fraction α is reached for the first time when the value with index α·n (if it is an integer) is reached.
WS 2006/07 Prof. Dr. J. Schütze, FB GW 17
Fachhochschule JenaUniversity of Applied Sciences Jena
Empirical α –Quantil for is the number
Calculation of empirical quantiles
sample sorted in ascending order
min (1) (2) ( ) max... nx x x x x
The α –quantile divides the range of the sample into two parts thus below the quantilethere are α·100%, and above there are (1-α)·100% of the sample values.
( )
( ) ( 1)
, 1
1( ),
2
if
if whole-numbered
k
k k
x k n kx
x x k n
0 1
Empirical quantiles
WS 2006/07 Prof. Dr. J. Schütze, FB GW 18
Fachhochschule JenaUniversity of Applied Sciences Jena
0.10x
484949
5050
515151 52 57
0.10 0.10 (1) (2): 10 0.10 1 ( ) / 2 48.5x k n x x x 10%-quantile:
15%-quartile: 0.15 0.15 (2): 10 0.15 1.5 49x k n x x
Empirical quantiles
Example 3 continuedheight of 10 newborn babies: 51, 50, 51, 49, 49, 51, 50, 57, 48, 52in ascending order: 48, 49, 49, 50, 50, 51, 51, 51, 52, 57
WS 2006/07 Prof. Dr. J. Schütze, FB GW 19
Fachhochschule JenaUniversity of Applied Sciences Jena
0.25 0.25 (3): 10 0.25 2.5 49x k n x x lower quartile:
median:
upper quartile:
0.5 0.5 (5) (6): 10 0.5 5 ( ) / 2 50.5x k n x x x
0.75 0.75 (8): 10 0.75 7.5 51x k n x x
interquartile range 0.75 0.25 51 49 2d x x
0.25x 0.5x 0.75x
484949
5050
515151 52 57
Empirical quantiles
Example 3 continuedheight of 10 newborn babies: 51, 50, 51, 49, 49, 51, 50, 57, 48, 52in ascending order: 48, 49, 49, 50, 50, 51, 51, 51, 52, 57
WS 2006/07 Prof. Dr. J. Schütze, FB GW 20
Fachhochschule JenaUniversity of Applied Sciences Jena
Boxplots
lower quartile
0.25x 0.5x 0.75x
484949
5050
515151 52 57
median upper quartile
width of box = interquartile range = 2
WS 2006/07 Prof. Dr. J. Schütze, FB GW 21
Fachhochschule JenaUniversity of Applied Sciences Jena
Detection of outliers
lower quartile median upper quartile
width of box = 2, therefore we find: normal range (49 – 1.5*2, 51 + 1.5*2) = (46, 54)values outside of the normal range are outliers, like the value 57the fences denote the area which is covered by the "normal values" of the sample
The normal range is from the lower quartile – 1.5*box width to theupper quartile + 1.5*box width.
WS 2006/07 Prof. Dr. J. Schütze, FB GW 22
Fachhochschule JenaUniversity of Applied Sciences Jena
altered data setheight of 10 newborn babies: 51, 50, 51, 49, 49, 51, 50, 53, 48, 52
lower quartile 0.5 50.5x 0.75 51x 0.25 49x median upper quartile
box width = 2, normal range (49 – 3, 51 + 3) = (46, 54), there is no outlier
0.25x 0.5x 0.75x
484949
5050
515151 52 53
Detection of outliers
WS 2006/07 Prof. Dr. J. Schütze, FB GW 23
Fachhochschule JenaUniversity of Applied Sciences Jena
newly altered data setheight of 10 newborn babies: 51, 50, 51, 49, 49, 51, 50, 55, 48, 52
lower quartile 0.5 50.5x 0.75 51x 0.25 49x median upper quartilebox width = 2, normal range (49 – 3, 51 + 3) = (46, 54), the value 55 is an outlier
0.25x 0.5x 0.75x
484949
5050
515151 52 55
Detection of outliers
WS 2006/07 Prof. Dr. J. Schütze, FB GW 24
Fachhochschule JenaUniversity of Applied Sciences Jena
Statistical measures
with absol. frequencies)(
1 *
1
*ii
k
ii xhx
nx
)()(1
1 *
1
2*2ii
k
ii xhxx
ns
with relat. frequencies )( *
1
*ii
k
ii xfxx
)()(
1*
1
2*2ii
k
ii xfxx
n
ns
standard deviation 2ss coeff. of variation
x
sv
standard error n
ssx
median 5,0~~ xx medium absolute
deviation d
nx xi
i
n
1
1
~
inter quartile range 25.075.05.0~~~xxd
measures of location measures of deviation arithmeticmean
empirical variance
n
ii
n
ii
xnxn
xxn
s
1
22
1
22
(1
1
)(1
1
1
1 n
ii
x xn
WS 2006/07 Prof. Dr. J. Schütze, FB GW 25
Fachhochschule JenaUniversity of Applied Sciences Jena
Multivariate characteristics
When measuring more than one characteristic at the same object, we often want to know if those characteristics are dependent.
Here we have two cardinal characteristics for 12 drivers.For analyzing the dependence, the points can be shown graphically. Suchplots are called scatter plots.
Example 4During a traffic check, all those driver who had to pay a speeding finewere asked their age.
Age 20 23 24 59 55 26 32 29 43 38 31 36 Speeding 22 22 40 23 34 22 22 21 28 27 25 29
WS 2006/07 Prof. Dr. J. Schütze, FB GW 26
Fachhochschule JenaUniversity of Applied Sciences Jena
Example of a scatter plot
age
605040302010
spee
din
g
45
40
35
30
25
20
26,25y
34,67x
Measures of dependence for cardinal characteristics
Age 20 23 24 59 55 26 32 29 43 38 31 36 Speeding 22 22 40 23 34 22 22 21 28 27 25 29
WS 2006/07 Prof. Dr. J. Schütze, FB GW 27
Fachhochschule JenaUniversity of Applied Sciences Jena
We divide the area of the plot into 4 quadrants by the mean values of the single dimensions. An argument for a linear dependence is that all points are situated in the first and third respectively in the second and forth quadrant.
ascending tendency: decreasing tendency:
If the points are distributed over all quadrants, there is no linear tendency.
, , )( ) 0 or , therefore (i i i i i ix x y y x x y y x x y y , , )( ) 0 or , therefore (i i i i i ix x y y x x y y x x y y
Products are the main items for the measure of dependence( )( )i ix x y y
2 2
( )( )
( ) ( )i i
i i
X X Y Yr
X X Y Y
Correlation coefficient (Pearson)
Measures of dependence for cardinal characteristics
1
1Cov( , ) ( )( )
1
n
i ii
x y x x y yn
Covariance
WS 2006/07 Prof. Dr. J. Schütze, FB GW 28
Fachhochschule JenaUniversity of Applied Sciences Jena
Equivalent notation of the Pearson-correlation
2 2
2 2 2 2
2 2 2 2
( , )
Var Var( )( )
( ) ( )
( )( )
( ( ) )( ( ) )
i i
i i
i i
i i
i i i i
i i i i
Cov X Yr
X YX X Y Y
X X Y Y
X Y nXY
X nX Y nY
n X Y X Y
n X X n Y Y
Measures of dependence for cardinal characteristics
WS 2006/07 Prof. Dr. J. Schütze, FB GW 29
Fachhochschule JenaUniversity of Applied Sciences Jena
Interpretation of the Pearson-correlation
The correlation coefficient of Pearson measures, how close the lineardependence between X and Y is.
we always find:
For r = 1, all data points are situated on a ascending line.For r = -1, all data points are situated on a descending line.For r = 0, there is no linear tendency.
1 1r
Measures of dependence for cardinal characteristics
WS 2006/07 Prof. Dr. J. Schütze, FB GW 30
Fachhochschule JenaUniversity of Applied Sciences Jena
Example 4Calculation of the correlation coefficient of Pearson
X Y X2 Y2 XY
20 22 400 484 440 23 22 529 484 506 24 40 576 1600 960 59 23 3481 529 1357 55 34 3025 1156 1870 26 22 676 484 572 32 22 1024 484 704 29 21 841 441 609 43 28 1849 784 1204 38 27 1444 729 1026 31 25 961 625 775 36 29 1296 841 1044
416 315 16102 8641 11067
2 2 22 2 2
12 11067 416 3150,1858
(12 16102 416 )(12 8641 315 )( ( ) )( ( ) )
n XY X Yr
n X X n Y Y
Measures of dependence for cardinal characteristics
WS 2006/07 Prof. Dr. J. Schütze, FB GW 31
Fachhochschule JenaUniversity of Applied Sciences Jena
Measures of dependence for cardinal characteristics
rank number of in ascending order of values of X rank number of in ascending order of values of YMultiple appearing values receive the same middle rank.
( )iR x ix( )iR y iy
2 2 2
2 2 2 2 2 2
( ( ) ) ( ( ) ) ( ) ( )
( ) ) ( ( ) ) ( ( ) )( ( ) )i i i i
s
i i i i
R x R R y R R x R y nRr
R x R R y R R x nR R y nR
1
2
nR
The correlation coefficient of Spearman measures a monotonous dependence.For r = 1 all data points show a monotonous ascending tendency.For r = -1 all data points show a monotonous descending tendency.For r = 0 there is no monotonous tendency.
Correlation coefficient of Spearman
for ordinal characteristics or cardinal characteristics with outliers
WS 2006/07 Prof. Dr. J. Schütze, FB GW 32
Fachhochschule JenaUniversity of Applied Sciences Jena
Example 4
Correlation coefficient of Spearman
Measures of dependence for cardinal characteristics
xi 20 23 24 59 55 26 32 29 43 38 31 36
yi 22 22 40 24 34 22 22 21 28 27 25 29
R(xi) 1 2 3 12 11 4 7 5 10 9 6 8
WS 2006/07 Prof. Dr. J. Schütze, FB GW 33
Fachhochschule JenaUniversity of Applied Sciences Jena
Example 4
Correlation coefficient of Spearman
Then, value 22 appears four times, on the ranks 2, 3, 4 and 5.Instead of those rank numbers each receives the same average rank,
2 3 4 53.5
4
xi 20 23 24 59 55 26 32 29 43 38 31 36
yi 22 22 40 24 34 22 22 21 28 27 25 29
R(yi) 3.5 3.5 12 6 11 3.5 3.5 1 9 8 7 10
Measures of dependence for cardinal characteristics
The smallest value of y is 21, it receives rank 1.
afterwards the numeration is continued with 6.
WS 2006/07 Prof. Dr. J. Schütze, FB GW 34
Fachhochschule JenaUniversity of Applied Sciences Jena
Example 4
2 2
( 1) 12 13 1 1312, ( ) ( ) 78, 6.5,
2 2 2 2
( ) 650, ( ) 645, ( ) ( ) 567
i i
i i i i
n n nn R x R y R
R x R y R x R y
2 2
2 2 2 2 2 2
( ) ( ) 567 12 6.50.427
( ( ) )( ( ) ) (650 12 6.5 )(645 12 6.5 )i i
s
i i
R x R y nRr
R x nR R y nR
Correlation coefficient of Spearman
xi 20 23 24 59 55 26 32 29 43 38 31 36
yi 22 22 40 24 34 22 22 21 28 27 25 29
R(xi) 1 2 3 12 11 4 7 5 10 9 6 8
R(yi) 3.5 3.5 12 6 11 3.5 3.5 1 9 8 7 10
Measures of dependence for cardinal characteristics
WS 2006/07 Prof. Dr. J. Schütze, FB GW 35
Fachhochschule JenaUniversity of Applied Sciences Jena
residuals are the vertical differences between themeasured points and the lineThe sum of the squared residuals is mini-mized by choosing the correct parameters.
0 1i iy b b x
Linear regression
If the cardinal characteristics X, Y have a high correlation coefficient, they have a close linear dependence, which can be described by a linear equation.
Set up: 0 1y b b x
2
0 11
min.n
i ii
y b b x
The unknown coefficients are determined after the following criteria of optimality (method of least squares)
0 1,b b
WS 2006/07 Prof. Dr. J. Schütze, FB GW 36
Fachhochschule JenaUniversity of Applied Sciences Jena
Linear regression
0 1
20 1
i i
i i i i
y b n b x
x y b x b x
system of normal equations
Optimal estimation for parameters is received by calculating the minimal value of
2
0 1 0 11
( , ) min.n
i ii
f b b y b b x
Setting the partial derivatives of f by b0, b1 to zero leads to a system of equations
with the following solutions:
1 22
i i i i
i i
n x y x yb
n x x
Regression coefficient
Regression constant
2
0 22
i i i i i
i i
y x x y xb
n x x
WS 2006/07 Prof. Dr. J. Schütze, FB GW 37
Fachhochschule JenaUniversity of Applied Sciences Jena
Example 5
Linear regression
x 1 2 4 5
y 3 2 1 1
WS 2006/07 Prof. Dr. J. Schütze, FB GW 38
Fachhochschule JenaUniversity of Applied Sciences Jena
Linear regression
x y x² xy
1 3 1 3
2 2 4 4
4 1 16 4
5 1 25 5
sum 12 7 46 16
0 2
7 46 16 12ˆ 3.254 46 12
b
1 2
4
4 16 12 7ˆ 0.54 46 12
n
b
Regression function: y = 3.25 – 0.5 x
In the example:
1 22
i i i i
i i
n x y x yb
n x x
Regression coefficient
Regression constant
2
0 22
i i i i i
i i
y x x y xb
n x x
WS 2006/07 Prof. Dr. J. Schütze, FB GW 39
Fachhochschule JenaUniversity of Applied Sciences Jena
Linear regression
Function of regressiony = 3.25 – 0.5 x
Thus the calculated function of regression fits the points optimal according to the used criteria.
Because the criteria minimizes the sum of the squared deviations of the points and the line, it is called MLS-regression(Method of Least Squares).
WS 2006/07 Prof. Dr. J. Schütze, FB GW 40
Fachhochschule JenaUniversity of Applied Sciences Jena
Goodness of fit of the function of regression
Linear regression
Residuals: vertical difference between the points and the regression linefrom those, we derive the residual variance
square sum: 2
1
( )ˆn
i ii
SSE y y
For MSE, the division by n - 2 corresponds tothe number of degrees of freedom whichdiminishes the sample size n by two because of the two estimated values b0 and b1.
2 2
1
1( )ˆ ˆ
2
n
i ii
s MSE y yn
residual variance:
MSE: Mean Square Error
ˆ ˆi i ie y y residuals:
WS 2006/07 Prof. Dr. J. Schütze, FB GW 41
Fachhochschule JenaUniversity of Applied Sciences Jena
Evaluation of goodness of fit for y = 3.25 – 0.5 x
Linear regression
x y
1 3 2.75 0.25 0.0625
2 2 2.25 -0.25 0.0625
4 1 1.25 -0.25 0.0625
5 1 0.75 0.25 0.0625
Summe 12 7 0 SSE = 0.25
ˆi iy y0 1
ˆ ˆˆi iy b b x
The sum of the residuals is always zero, therefore we cannot use it for evaluating the goodness of the fit.We calculate the residual variance from the sum of the squares of the residuals.
220 1( ) 0.25ˆi i i iSSE y y y b b x
2( )ˆi iy y
Residual variance 21 1( )ˆ
2 2 i iMSE SSE y yn n
WS 2006/07 Prof. Dr. J. Schütze, FB GW 42
Fachhochschule JenaUniversity of Applied Sciences Jena
Linear regression
x y
1 3 2.75 0.25 0.0625 1
2 2 2.25 -0.25 0.0625 0.25
4 1 1.25 -0.25 0.0625 0.25
5 1 0.75 0.25 0.0625 1
sum 12 7 0 0.25 2.5
2ˆ 2.5ˆSSY y y
2ˆiy y0 1
ˆ ˆˆi iy b b x ˆi iy y 2( )ˆi iy y
The explained variance is the squared deviation of the points on the regression function to the mean line 1.75.y ˆiy
WS 2006/07 Prof. Dr. J. Schütze, FB GW 43
Fachhochschule JenaUniversity of Applied Sciences Jena
Linear regression
22.75iSSY y y
total variance
2
0 1ˆ 2.5iSSY y a a x
explained variance
1.75y
WS 2006/07 Prof. Dr. J. Schütze, FB GW 44
Fachhochschule JenaUniversity of Applied Sciences Jena
Linear regression
2ˆ ˆ 2.5iSSY y y 22.75iSSY y y
We get always: (decomposition of variance)ˆSSY SSY SSE
2ˆ 0.25i iSSE y y
total variance explained variance residual variance
Attention: The decomposition of variance is only valid for the square sums - without factors.
WS 2006/07 Prof. Dr. J. Schütze, FB GW 45
Fachhochschule JenaUniversity of Applied Sciences Jena
Linear regression
ˆ1SSY SSE
SSY SSY
It is equal to the square of the correlation coefficient rxy2
2 xyxy xy
x y
sB r
s s
Coefficient of determination:
The coefficient of determination is the part of the explained variance in the total variance.
ˆ1xy
SSY SSEB
SSY SSY
For the example we find with SSE = 0.25, SSY = 2.75 the coefficient of determination0.25
1 1 0.9092.75xy
SSEB
SSY
From the decomposition of the variance SSY = SSŶ + SSE we find, dividing by SSY
WS 2006/07 Prof. Dr. J. Schütze, FB GW 46
Fachhochschule JenaUniversity of Applied Sciences Jena
Linear regression
Function of regressionY = 0.087x + 23.218
Coefficient of determination0.035
age
spee
ding
The linear relationship is notvery strong.
The goodness of fit of the linear regression could be affected by the existence of outliers in the dataset. Consider again example 4.
WS 2006/07 Prof. Dr. J. Schütze, FB GW 47
Fachhochschule JenaUniversity of Applied Sciences Jena
Linear regression
function of regressionY = 0.197x + 17.971
coefficient of determination0.365
Removing this possible outlier will improve the coefficient of determination:
spee
ding
age
WS 2006/07 Prof. Dr. J. Schütze, FB GW 48
Fachhochschule JenaUniversity of Applied Sciences Jena
Linear regression
function of regressionY = 0.375x + 12.717
coefficient of determination0.831
Repeating the procedure with another ‘outlier’, we get
spee
ding
age
WS 2006/07 Prof. Dr. J. Schütze, FB GW 49
Fachhochschule JenaUniversity of Applied Sciences Jena
Linear regression
The noncritical elimination of outliers can be misleading. It can simulate strong relationships,which are only wishes.
function of regressionY = 0.375x + 12.717
coefficient of determination0.831
function of regressionY = 0.197x + 17.971
coefficient of determination0.365
function of regressionY = 0.087x + 23.218
coefficient of determination0.035
WS 2006/07 Prof. Dr. J. Schütze, FB GW 50
Fachhochschule JenaUniversity of Applied Sciences Jena
Linear regression with SPSS
Data set reg.sav
Analyze / Regression / Linear
WS 2006/07 Prof. Dr. J. Schütze, FB GW 51
Fachhochschule JenaUniversity of Applied Sciences Jena
Linear regression with SPSS
data set reg.sav
Analyze / Regression / Linear
Output for coefficient
In column B we find the coefficients for the regression function,
y = 3.25 – 0.5 x.
Coefficientsa
3,250 ,379 8,572 ,013
-,500 ,112 -,953 -4,472 ,047
(Constant)
x
Model1
B Std. Error
UnstandardizedCoefficients
Beta
StandardizedCoefficients
t Sig.
Dependent Variable: ya.
WS 2006/07 Prof. Dr. J. Schütze, FB GW 52
Fachhochschule JenaUniversity of Applied Sciences Jena
Linear regression with SPSS
Output for coefficient of determination
In column R we find the correlation coefficient for the observed and estimated values of Y.
In column R-square we find the coefficient of determination of the regression.
R-square = 0.909 means that 90.9% of the deviation of the values of Y areexplained by the regression.
Standard error of the estimator is the root of MSE.
1 10.25 0.354ˆ
2 2s MSE SSE
n
Model Summary
,953a ,909 ,864 ,354Model1
R R SquareAdjustedR Square
Std. Error ofthe Estimate
Predictors: (Constant), xa.
WS 2006/07 Prof. Dr. J. Schütze, FB GW 53
Fachhochschule JenaUniversity of Applied Sciences Jena
Linear regression with SPSS
Output for decomposition of variance
The column square sum contains the parts of the decomposition of the variance.Total: SSY = 2.75 (total variance of Y)Residual: SSE = 0.25 (residual variance)Regression: SSŶ = 2.5 (explained variance)
ANOVAb
2,500 1 2,500 20,000 ,047a
,250 2 ,125
2,750 3
Regression
Residual
Total
Model1
Sum ofSquares df Mean Square F Sig.
Predictors: (Constant), xa.
Dependent Variable: yb.