Overview

WS 2006/07 Prof. Dr. J. Schütze, FB GW 1

Fachhochschule JenaUniversity of Applied Sciences Jena

Overview

Model

Estimations

with calculation of risk

Sample Population

Relative frequency Probability

Mean ... Expectation ...

Descriptive statistics Inferential statistics

Probability theory



Contents of the lecture

A Descriptive statistics

basic terms, parameters for univariate and multivariate samples

B Probability theory

calculating with probabilities, random variables, distributions, limit theorems

C Inferential statistics

estimation of parameters, confidence intervals, hypothesis testing, non parametric tests

LiteratureJ.D. Jobson, Applied Multivariate Data Analysis, Springer Texts in StatisticsHinkelmann, Kempthorne, Design and Analysis of Experiments, Wiley Series



Examples

Technics

When measuring a certain quantity, various uncontrollable parameters can affect the result. Repeated measurement may lead to different results. We consider the data to be outcomes of a random variable. From the different results, we estimate the unknown quantity. How exact or reliable is this estimation?

Biology/Medicine

A newly developed drug to reduce cholesterol has to be compared to a standard drug concerning its effectivity.For reasons of time and costs, it is not possible to test the drug in the whole population of people suffering from elevated cholesterol level.How certain are the conclusions drawn from the measurements only in a the sample of the whole population?Which differences of effectivity are random, when do we see a significant difference in the effectivity?



Statistical methods

Statistical methods

• samples also allow reliable conclusions

• however, only with certain confidence

• sample survey has to be random

Advantages Disadvantages

Complete population

exact information high effort, costs

data protection

Sample survey low effort result depends on how good

the sample represents the population



Basic terms

Sample spaceall in principle possible observation unit (intended population)

Sample from the sample space randomly chosen subset of observation units

Survey unit/statistical unit (proband) each observation unit contained in the sample



Basic terms

Characteristic/statistical variableintended aim (observed values) of survey

Values of characteristicmeasured or observed values of a statistical unit/proband

Possible valuesrange of the statistical variable

The characteristics are different concerning the content of information.We distinguish several different levels of scale.



Question:

Which chocolate do you buy?

Ritter Sport

Milka

Sarotti

else

Level of scale of a variable

The possible answers are categories, which can only have the relations "equal" or "different".For data input, they are usually coded with numbers, nevertheless, there is no way for reasonable calculating.No averages, no spread, only reports on frequency!!!!Such variables have categorical/nominal level.



Question:

How much do you like chocolate?

1. not at all

2. less

3. neutral

4. a little

5. very much

The possible answers 1,...,5 are an ordinal scale.They have the relations ‚>‘ or ‚<‘.The distances between the values can be regarded as different, dependent on the observer.Such variables have an ordinal level.




Question:

How much money do you spend on chocolate per week? _ _,_ _ €

The answer marks a number on the natural scale.We can distinguish the relations ‚>‘ or ‚<‘. Additionally, we can compute reasonable differences, sums and averages.

Such variables have a cardinal or metric scale.

For the metric scale, we distinguish between discrete and continuous scales.A discrete scale is realized if the answer originates from counting.Continuous variables can assume each value in an interval.




Nominal scaledata expresses qualitative attributes (categories)order on a scale is impossible (only equal or different)differences in different values cannot be measured

Ordinal scaledata can be sorted, but differences in values are not quantifiable

Cardinal scale (metric)data is measured on a discrete or continuous scaledifference between values characterizes a correspondingly high difference in characteristics

The information content raises with this sequence.

Level of scale of data



Discrete characteristic X is measured n times, sampleif there are only k< n possible different values , you can count how often this values occur.

Univariate discrete characteristics

Absolute frequency number of occurrences of among the values of the sample

Relative frequency

1,... nx x

1 ,... kx x

ix

( ), 1ih x i k

( ), 1if x i k

( )( ) , 1ii

h xf x i k

n

1

1

0 ( ) , ( )

0 ( ) 1, ( ) 1

k

i ii

k

i ii

h x n h x n

f x f x

Properties




Example 1grades of 20 students:

2, 1, 2, 3, 3, 3, 1, 4, 2, 4, 3, 3, 2, 3, 5, 4, 5, 4, 3, 2sample size is n = 20

different possible values: 1, 2, 3, 4, 5that means k = 5

1.00 (100%)0.1025

0.90 (90%)0.2044

0.70 (70%)0.3573

0.35 (35%)0.2552

0.10 (10%)0.1021

cumulative frequencyvalue ( )ih x ( )if x

determination of frequency



Graphical display e.g. with bar charts

absolute/relative frequency or absolute/relative cumulative frequencies

2

5

7

4

2

0

1

2

3

4

5

6

7

8

1 2 3 4 5 grades

totalfrequency

1035

7090 100

0

20

40

60

80

100

120

1 2 3 4 5 grades

percen-tage




Continuous characteristic X is measured n times, samplethereby, all occurring values are different in general and a simply bar chart givesno information about the distribution of X

Absolute / relative cumulative frequency H(x) / F(x) is computed assum of frequencies upon all classes left of x, including x

1,... nx x

We divide the interval between smallest and largest observation inequally wide, mutually exclusive classes , number of classes

Absolute frequency of class number of in

Relative frequency ( ), 1if K i k ( )

( ) , 1ii

h Kf K i k

n

k n, 1iK i k

1,... nx x iK( ), 1ih K i k

( )ih K

all cl.up to all cl. up to

( ) ( ), ( ) ( )i ix x

H x h K F x f K

Univariate continuous characteristics



Example 2 cholesterol values of 1067 probands (raw data are classified)

Histogram Function of cumulative frequency

cholesterol value (class midpoints)

380 340 300 260 220 180 140 100

relative frequency

,5

,4

,3

,2

,1

0,0

cholesterol (class midpoints)

380 340 300 260 220 180 140 100

relative sum frequency

1,2

1,0

,8

,6

,4

,2

0,0

Univariate continuous characteristics



Empirical quantiles

ProblemBelow which boundary do we find half of (a tenth of,...) the sample?

Example 3height of 10 newborn babies: 51, 50, 51, 49, 49, 51, 50, 57, 48, 52in ascending order: 48, 49, 49, 50, 50, 51, 51, 51, 52, 57

E.g. the 10%-quantile is a border which parts the sample thus there is exactly one value (10%) below and nine values (90%) above the border.Therefore each value between 48 and 49 is possible, the border is not unique.To avoid this problem, we take the value in the middle, 48.5.

Concerning the 15%-quantile, the demand for the percentage can only be fulfilled approximately, because a 5%-grid is too narrow for only ten values.

We sort the sample in ascending order.By size n, one value matches to a fraction 1/n, k values to a fraction k/n.

From k/n = α we find that the part for fraction α is reached for the first time when the value with index α·n (if it is an integer) is reached.



Empirical α –Quantil for is the number

Calculation of empirical quantiles

sample sorted in ascending order

min (1) (2) ( ) max... nx x x x x

The α –quantile divides the range of the sample into two parts thus below the quantilethere are α·100%, and above there are (1-α)·100% of the sample values.

( )

( ) ( 1)

, 1

1( ),

2

if

if whole-numbered

k

k k

x k n kx

x x k n

0 1

Empirical quantiles



0.10x

484949

5050

515151 52 57

0.10 0.10 (1) (2): 10 0.10 1 ( ) / 2 48.5x k n x x x 10%-quantile:

15%-quartile: 0.15 0.15 (2): 10 0.15 1.5 49x k n x x

Empirical quantiles

Example 3 continuedheight of 10 newborn babies: 51, 50, 51, 49, 49, 51, 50, 57, 48, 52in ascending order: 48, 49, 49, 50, 50, 51, 51, 51, 52, 57



0.25 0.25 (3): 10 0.25 2.5 49x k n x x lower quartile:

median:

upper quartile:

0.5 0.5 (5) (6): 10 0.5 5 ( ) / 2 50.5x k n x x x

0.75 0.75 (8): 10 0.75 7.5 51x k n x x

interquartile range 0.75 0.25 51 49 2d x x

0.25x 0.5x 0.75x

484949

5050

515151 52 57

Empirical quantiles

Example 3 continuedheight of 10 newborn babies: 51, 50, 51, 49, 49, 51, 50, 57, 48, 52in ascending order: 48, 49, 49, 50, 50, 51, 51, 51, 52, 57



Boxplots

lower quartile

0.25x 0.5x 0.75x

484949

5050

515151 52 57

median upper quartile

width of box = interquartile range = 2



Detection of outliers

lower quartile median upper quartile

width of box = 2, therefore we find: normal range (49 – 1.5*2, 51 + 1.5*2) = (46, 54)values outside of the normal range are outliers, like the value 57the fences denote the area which is covered by the "normal values" of the sample

The normal range is from the lower quartile – 1.5*box width to theupper quartile + 1.5*box width.



altered data setheight of 10 newborn babies: 51, 50, 51, 49, 49, 51, 50, 53, 48, 52

lower quartile 0.5 50.5x 0.75 51x 0.25 49x median upper quartile

box width = 2, normal range (49 – 3, 51 + 3) = (46, 54), there is no outlier

0.25x 0.5x 0.75x

484949

5050

515151 52 53




newly altered data setheight of 10 newborn babies: 51, 50, 51, 49, 49, 51, 50, 55, 48, 52

lower quartile 0.5 50.5x 0.75 51x 0.25 49x median upper quartilebox width = 2, normal range (49 – 3, 51 + 3) = (46, 54), the value 55 is an outlier

0.25x 0.5x 0.75x

484949

5050

515151 52 55




Statistical measures

with absol. frequencies)(

1 *

1

*ii

k

ii xhx

nx

)()(1

1 *

1

2*2ii

k

ii xhxx

ns

with relat. frequencies )( *

1

*ii

k

ii xfxx

)()(

1*

1

2*2ii

k

ii xfxx

n

ns

standard deviation 2ss coeff. of variation

x

sv

standard error n

ssx

median 5,0~~ xx medium absolute

deviation d

nx xi

i

n

1

1

~

inter quartile range 25.075.05.0~~~xxd

measures of location measures of deviation arithmeticmean

empirical variance

n

ii

n

ii

xnxn

xxn

s

1

22

1

22

(1

1

)(1

1

1

1 n

ii

x xn



Multivariate characteristics

When measuring more than one characteristic at the same object, we often want to know if those characteristics are dependent.

Here we have two cardinal characteristics for 12 drivers.For analyzing the dependence, the points can be shown graphically. Suchplots are called scatter plots.

Example 4During a traffic check, all those driver who had to pay a speeding finewere asked their age.

Age 20 23 24 59 55 26 32 29 43 38 31 36 Speeding 22 22 40 23 34 22 22 21 28 27 25 29



Example of a scatter plot

age

605040302010

spee

din

g

45

40

35

30

25

20

26,25y

34,67x

Measures of dependence for cardinal characteristics

Age 20 23 24 59 55 26 32 29 43 38 31 36 Speeding 22 22 40 23 34 22 22 21 28 27 25 29



We divide the area of the plot into 4 quadrants by the mean values of the single dimensions. An argument for a linear dependence is that all points are situated in the first and third respectively in the second and forth quadrant.

ascending tendency: decreasing tendency:

If the points are distributed over all quadrants, there is no linear tendency.

, , )( ) 0 or , therefore (i i i i i ix x y y x x y y x x y y , , )( ) 0 or , therefore (i i i i i ix x y y x x y y x x y y

Products are the main items for the measure of dependence( )( )i ix x y y

2 2

( )( )

( ) ( )i i

i i

X X Y Yr

X X Y Y

Correlation coefficient (Pearson)


1

1Cov( , ) ( )( )

1

n

i ii

x y x x y yn

Covariance



Equivalent notation of the Pearson-correlation

2 2

2 2 2 2

2 2 2 2

( , )

Var Var( )( )

( ) ( )

( )( )

( ( ) )( ( ) )

i i

i i

i i

i i

i i i i

i i i i

Cov X Yr

X YX X Y Y

X X Y Y

X Y nXY

X nX Y nY

n X Y X Y

n X X n Y Y




Interpretation of the Pearson-correlation

The correlation coefficient of Pearson measures, how close the lineardependence between X and Y is.

we always find:

For r = 1, all data points are situated on a ascending line.For r = -1, all data points are situated on a descending line.For r = 0, there is no linear tendency.

1 1r




Example 4Calculation of the correlation coefficient of Pearson

X Y X2 Y2 XY

20 22 400 484 440 23 22 529 484 506 24 40 576 1600 960 59 23 3481 529 1357 55 34 3025 1156 1870 26 22 676 484 572 32 22 1024 484 704 29 21 841 441 609 43 28 1849 784 1204 38 27 1444 729 1026 31 25 961 625 775 36 29 1296 841 1044

416 315 16102 8641 11067

2 2 22 2 2

12 11067 416 3150,1858

(12 16102 416 )(12 8641 315 )( ( ) )( ( ) )

n XY X Yr

n X X n Y Y





rank number of in ascending order of values of X rank number of in ascending order of values of YMultiple appearing values receive the same middle rank.

( )iR x ix( )iR y iy

2 2 2

2 2 2 2 2 2

( ( ) ) ( ( ) ) ( ) ( )

( ) ) ( ( ) ) ( ( ) )( ( ) )i i i i

s

i i i i

R x R R y R R x R y nRr

R x R R y R R x nR R y nR

1

2

nR

The correlation coefficient of Spearman measures a monotonous dependence.For r = 1 all data points show a monotonous ascending tendency.For r = -1 all data points show a monotonous descending tendency.For r = 0 there is no monotonous tendency.

Correlation coefficient of Spearman

for ordinal characteristics or cardinal characteristics with outliers



Example 4



xi 20 23 24 59 55 26 32 29 43 38 31 36

yi 22 22 40 24 34 22 22 21 28 27 25 29

R(xi) 1 2 3 12 11 4 7 5 10 9 6 8



Example 4


Then, value 22 appears four times, on the ranks 2, 3, 4 and 5.Instead of those rank numbers each receives the same average rank,

2 3 4 53.5

4

xi 20 23 24 59 55 26 32 29 43 38 31 36

yi 22 22 40 24 34 22 22 21 28 27 25 29

R(yi) 3.5 3.5 12 6 11 3.5 3.5 1 9 8 7 10


The smallest value of y is 21, it receives rank 1.

afterwards the numeration is continued with 6.



Example 4

2 2

( 1) 12 13 1 1312, ( ) ( ) 78, 6.5,

2 2 2 2

( ) 650, ( ) 645, ( ) ( ) 567

i i

i i i i

n n nn R x R y R

R x R y R x R y

2 2

2 2 2 2 2 2

( ) ( ) 567 12 6.50.427

( ( ) )( ( ) ) (650 12 6.5 )(645 12 6.5 )i i

s

i i

R x R y nRr

R x nR R y nR


xi 20 23 24 59 55 26 32 29 43 38 31 36

yi 22 22 40 24 34 22 22 21 28 27 25 29

R(xi) 1 2 3 12 11 4 7 5 10 9 6 8

R(yi) 3.5 3.5 12 6 11 3.5 3.5 1 9 8 7 10




residuals are the vertical differences between themeasured points and the lineThe sum of the squared residuals is mini-mized by choosing the correct parameters.

0 1i iy b b x

Linear regression

If the cardinal characteristics X, Y have a high correlation coefficient, they have a close linear dependence, which can be described by a linear equation.

Set up: 0 1y b b x

2

0 11

min.n

i ii

y b b x

The unknown coefficients are determined after the following criteria of optimality (method of least squares)

0 1,b b



Linear regression

0 1

20 1

i i

i i i i

y b n b x

x y b x b x

system of normal equations

Optimal estimation for parameters is received by calculating the minimal value of

2

0 1 0 11

( , ) min.n

i ii

f b b y b b x

Setting the partial derivatives of f by b0, b1 to zero leads to a system of equations

with the following solutions:

1 22

i i i i

i i

n x y x yb

n x x

Regression coefficient

Regression constant

2

0 22

i i i i i

i i

y x x y xb

n x x



Example 5

Linear regression

x 1 2 4 5

y 3 2 1 1



Linear regression

x y x² xy

1 3 1 3

2 2 4 4

4 1 16 4

5 1 25 5

sum 12 7 46 16

0 2

7 46 16 12ˆ 3.254 46 12

b

1 2

4

4 16 12 7ˆ 0.54 46 12

n

b

Regression function: y = 3.25 – 0.5 x

In the example:

1 22

i i i i

i i

n x y x yb

n x x

Regression coefficient

Regression constant

2

0 22

i i i i i

i i

y x x y xb

n x x



Linear regression

Function of regressiony = 3.25 – 0.5 x

Thus the calculated function of regression fits the points optimal according to the used criteria.

Because the criteria minimizes the sum of the squared deviations of the points and the line, it is called MLS-regression(Method of Least Squares).



Goodness of fit of the function of regression

Linear regression

Residuals: vertical difference between the points and the regression linefrom those, we derive the residual variance

square sum: 2

1

( )ˆn

i ii

SSE y y

For MSE, the division by n - 2 corresponds tothe number of degrees of freedom whichdiminishes the sample size n by two because of the two estimated values b0 and b1.

2 2

1

1( )ˆ ˆ

2

n

i ii

s MSE y yn

residual variance:

MSE: Mean Square Error

ˆ î i ie y y residuals:



Evaluation of goodness of fit for y = 3.25 – 0.5 x

Linear regression

x y

1 3 2.75 0.25 0.0625

2 2 2.25 -0.25 0.0625

4 1 1.25 -0.25 0.0625

5 1 0.75 0.25 0.0625

Summe 12 7 0 SSE = 0.25

î iy y0 1

ˆ ˆî iy b b x

The sum of the residuals is always zero, therefore we cannot use it for evaluating the goodness of the fit.We calculate the residual variance from the sum of the squares of the residuals.

220 1( ) 0.25î i i iSSE y y y b b x

2( )î iy y

Residual variance 21 1( )ˆ

2 2 i iMSE SSE y yn n



Linear regression

x y

1 3 2.75 0.25 0.0625 1

2 2 2.25 -0.25 0.0625 0.25

4 1 1.25 -0.25 0.0625 0.25

5 1 0.75 0.25 0.0625 1

sum 12 7 0 0.25 2.5

2ˆ 2.5ˆSSY y y

2îy y0 1

ˆ ˆî iy b b x î iy y 2( )î iy y

The explained variance is the squared deviation of the points on the regression function to the mean line 1.75.y îy



Linear regression

22.75iSSY y y

total variance

2

0 1ˆ 2.5iSSY y a a x

explained variance

1.75y



Linear regression

2ˆ ˆ 2.5iSSY y y 22.75iSSY y y

We get always: (decomposition of variance)ˆSSY SSY SSE

2ˆ 0.25i iSSE y y

total variance explained variance residual variance

Attention: The decomposition of variance is only valid for the square sums - without factors.



Linear regression

ˆ1SSY SSE

SSY SSY

It is equal to the square of the correlation coefficient rxy2

2 xyxy xy

x y

sB r

s s

Coefficient of determination:

The coefficient of determination is the part of the explained variance in the total variance.

ˆ1xy

SSY SSEB

SSY SSY

For the example we find with SSE = 0.25, SSY = 2.75 the coefficient of determination0.25

1 1 0.9092.75xy

SSEB

SSY

From the decomposition of the variance SSY = SSŶ + SSE we find, dividing by SSY



Linear regression

Function of regressionY = 0.087x + 23.218

Coefficient of determination0.035

age

spee

ding

The linear relationship is notvery strong.

The goodness of fit of the linear regression could be affected by the existence of outliers in the dataset. Consider again example 4.



Linear regression

function of regressionY = 0.197x + 17.971

coefficient of determination0.365

Removing this possible outlier will improve the coefficient of determination:

spee

ding

age



Linear regression



Repeating the procedure with another ‘outlier’, we get

spee

ding

age



Linear regression

The noncritical elimination of outliers can be misleading. It can simulate strong relationships,which are only wishes.









Linear regression with SPSS

Data set reg.sav

Analyze / Regression / Linear




data set reg.sav

Analyze / Regression / Linear

Output for coefficient

In column B we find the coefficients for the regression function,

y = 3.25 – 0.5 x.

Coefficientsa

3,250 ,379 8,572 ,013

-,500 ,112 -,953 -4,472 ,047

(Constant)

x

Model1

B Std. Error

UnstandardizedCoefficients

Beta

StandardizedCoefficients

t Sig.

Dependent Variable: ya.




Output for coefficient of determination

In column R we find the correlation coefficient for the observed and estimated values of Y.

In column R-square we find the coefficient of determination of the regression.

R-square = 0.909 means that 90.9% of the deviation of the values of Y areexplained by the regression.

Standard error of the estimator is the root of MSE.

1 10.25 0.354ˆ

2 2s MSE SSE

n

Model Summary

,953a ,909 ,864 ,354Model1

R R SquareAdjustedR Square

Std. Error ofthe Estimate

Predictors: (Constant), xa.




Output for decomposition of variance

The column square sum contains the parts of the decomposition of the variance.Total: SSY = 2.75 (total variance of Y)Residual: SSE = 0.25 (residual variance)Regression: SSŶ = 2.5 (explained variance)

ANOVAb

2,500 1 2,500 20,000 ,047a

,250 2 ,125

2,750 3

Regression

Residual

Total

Model1

Sum ofSquares df Mean Square F Sig.

Predictors: (Constant), xa.

Dependent Variable: yb.

Documents

Overview