jnkvv.orgjnkvv.org/PDF/09042020212640Practical_Manual__Ag_Statistics_ug … · Contents S. No. Chapter Name Page No. 1. Frequency Distribution 1 2. Graphical Representation of data

PRACTICAL MANUAL

AGRICULTURAL STATISTICS

STATISTICAL METHODS, SAMPLING TECHNIQUES AND EXPERIMENTAL DESIGNS

( UG/PG COURSES)

Compiled by

SURABHI JAIN

(Asst. Prof./Scientist)

&

H. L. Sharma

Professor and Head,

DEPARTMENT OF MATHEMATICS AND STATISTICS

Jawaharlal Nehru Krishi Vishwa Vidyalaya,

JABALPUR 482 004 ( M.P.)

Contents

S. No. Chapter Name Page No.

1. Frequency Distribution 1

2. Graphical Representation of data 2

3. Curve fitting 3

4. Measures of Central Tendency 4-8

5. Measures of Dispersion 9-11

6. Skewness and Kurtosis 12-14

7. Probability 15-17

8. Discrete and Continuous Distribution 18-22

9. Correlation and Regression 23-28

10. Multiple and Partial Correlation 29-32

11. Multiple Regression Equation and Analysis Technique 33-39

12. Simple and Stratified Random sampling 40-44

13. Ratio and Regression Estimator 45-47

14. Large sample test 48-49

15. Small sample test 50-53

16. Chi-Square test 54-56

17. Experimental Design 57-67

18. Factorial Design 68-71

19. Confounding 72-76

References 77

1

1. Frequency distribution

Frequency distribution: is used to condense the large amount of data and to provide fruitful information

of our interest.

To construct the frequency distribution first we will find the range= max value –min value.

The following formula can be used to determine an approximate number k of classes.

K= 1+3.322 log10N or log(no. of observations)/log(2), where N is the total frequency. Round up the

answer to the next integer.

After dividing the range by number of classes class interval is obtained.

Kinds of data: The list of IQ scores is: 118, 123, 124, 125, 127, 128, 129, 130, 130, 133, 136, 138, 141,

142, 149, 150, 154.

Solution: Here Range = 154-118= 36

So the number of classes k=1+3.322log10 (17) = 5 or log(17)/log(2) = 4.08 so the number of classes=5 can

be considered.

Class interval = 𝑟𝑎𝑛𝑔𝑒

𝑛𝑜.𝑜𝑓 𝑐𝑙𝑎𝑠𝑠𝑒𝑠 =

36

5 = 7.2≈ 8

First class= min value + class interval = 118+8=126

Since the data is discrete we can subtract 1 one from the upper limit of the class. The next class will start

from next integer.

So our frequency distribution table would be

I.Q. (class interval) Number or frequency

118-125 4

126-133 6

134-141 3

142-149 2

150-157 2

The classes in which both the upper and lower limits are included in same class are called inclusive

classes whereas the classes in which the upper limit of first class is same as the lower limit of the second

class eg. 10-15,15-20 etc. are known as exclusive classes.

Remark: To apply any statistical technique, first the inclusive classes should be converted to exclusive

classes. For this purpose we find the difference of 𝑙𝑜𝑤𝑒𝑟 𝑙𝑖𝑚𝑖𝑡 𝑜𝑓 𝑠𝑒𝑐𝑜𝑛𝑑 𝑐𝑙𝑎𝑠𝑠−𝑢𝑝𝑝𝑒𝑟 𝑙𝑖𝑚𝑖𝑡 𝑜𝑓 𝑓𝑖𝑟𝑠𝑡 𝑐𝑙𝑎𝑠𝑠

2 and add

this amount to upper limit of first class and subtract it from the lower limit of next higher class.

In the present example the conversion factor = 𝟏𝟐𝟔−𝟏𝟐𝟓

𝟐 = 0.5

So we add 0.5 to 125 and subtract 0.5 from 126 and finally get the exclusive classes.

I.Q. (class interval) Number or frequency

118-125.5 4

125.5-133.5 6

133.5-141.5 3

141.5-149.5 2

149.5-157 2

Generally the number of classes should lie between 5 to 15.

2

2. Graphical Representation of data

Graphical Representation of data: used to make data attractive, effective, ready for comparison and also

save the time and energy.

Type of

Graph

Characteristics Diagram

Simple

Bar

Diagram

Thick lines used to represent the

corresponding figure at equal distances

Histogram

In histogram the area of rectangle is

proportion to the frequency of the

corresponding class limits

Marks

maths 0-10 10-20 20-30

30-

40

40-

50

No. of

students 2 5 7 5 3

Pie

Diagram

In pie diagram the sector of the circle

represents the components of total Quantity

or it can also expressed in %. items food Cloth Edu. HRA Oth

er

Exp.

(in Rs.) 800 200 350 400 250

For % cal. Food = 800

2000∗ 100 = 40%

For Rad. Food = 800

2000∗ 360 = 144

Frequency

Polygon

It is obtained by joining the mid points of the

class interval on x-axis and their

corresponding frequency on y-axis.(same

data as histogram)

UP MP AP

Series1 78 80 83

74

76

78

80

82

84

PO

PU

LATI

ON

(IN

Mill

ion

s)

BAR DIAGRAM

0

2

4

6

8

0-10 10-20 20-30 30-40 40-50N

um

be

r o

f st

ud

en

ts

Marks in Maths

Histogram

food

40%

cloth

10%

education

17%

HRA20%

other

13%

Monthly expenditure

in %

food, 144

cloth, 36

education , 63

HRA, 72

other, 45

0

2

4

6

8

0-10 10-20 20-30 30-40 40-50

Nu

mb

er

of

stu

de

nts

Marks in maths

Monthly expenditure

in (radians)

3

3. Curve fitting Curve fitting: is used to find an analytic expression of the form y=f(x), the functional relationship suggested

by the given data.

Fitting of a straight line: Y=a + bX

In curve fitting by principal of least square we have to determine a and b so that

E=∑(𝑦𝑖 − 𝑎 − 𝑏𝑥𝑖)2 is minimum.

So two line of equations by differentiating w.r.t a and b are ∑𝑦𝑖 = na+b∑𝑥𝑖

∑𝑥𝑖𝑦𝑖 = a∑𝑥𝑖+b∑𝑥𝑖2 by solving these two equations value of a and b can be obtained.

Fitting of a second degree parabola: Y=a + bX +C𝑿𝟐

In curve fitting by principal of least square we have to determine a and b so that

E=∑(𝑦𝑖 − 𝑎 − 𝑏𝑥𝑖 − 𝑐𝑥𝑖2)2 is minimum.

So two line of equations by differentiating w.r.t a and b are

∑𝑦𝑖 = na+b∑𝑥𝑖+c∑𝑥𝑖2

∑𝑥𝑖𝑦𝑖 = a ∑ 𝑥𝑖+b∑𝑥𝑖2 + c∑𝑥𝑖

3

∑𝑥𝑖2𝑦𝑖 = a ∑ 𝑥𝑖

2+b∑𝑥𝑖3 + c∑𝑥𝑖

4

by solving these three equations value of a, b and c can be obtained.

Objective: Fitting of a straight line.

Kinds of data: Treating X as the independent variable.

X: 1 2 3 4 6 8

Y: 2.4 3.0 3.6 4 5 6

Solution: Let the line be Y= a+ b X

X Y X2 XY Using the two normal equations of the straight line, we have

24= 6a+ 24 b and 113.2=24a +130b

On solving the two equations, we get a=1.976 and b=0.506

Thus, we have Ŷ = 1.976 + 0.506X is the fitted straight line.

1

2

3

4

6

8

2.4

3.0

3.6

4.0

5.0

6.0

1

4

9

16

36

64

2.0

6.0

10.8

16.0

30.0

48.0

24 24 130 113.2

Objective: Fitting of a quadratic curve to the following data treating X as the independent variable.

Kinds of data:

X: 0 1 2 3 4

Y: 1 1.8 1.3 2.5 6.3

Solution: Let Y = a+bX +cX2 be the equation of the quadratic curve.

X Y X2 X3 X4 XY X2Y Using three normal equations for the quadratic curve,

we have 12.9 =5a +10b +30c

37.1 =10a+30b+100c

130.3=30a+100b+354c

Solving the above three equations, we have

a=1.42, b=-1.07 and c=0.55

Thus the quadratic curve is fitted as

Ŷ = 1.42 -1.07X + 0.55 X2

0

1

2

3

4

1

1.8

1.3

2.5

6.3

0

1

4

9

16

0

1

8

27

64

0

1

16

81

256

0

1.8

2.6

7.5

25.2

0

1.8

5.2

22.5

100.8

10 12.9 30 100 354 37.1 130.3

Remark: If the values of X and Y are so large, the computation of ∑X,∑X2,∑X3, ∑XY…, becomes difficult

and takes more time and energy. Therefore, the calculations may be reduced by using change of origin and

scale of data.

4

4. Measures of Central Tendency

Measures of Central Tendency: gives us an idea about the average/central value of the distribution.

Notations: A= any arbitrary value of variable x generally mid value, h=class interval, L=lower limit of class interval,

F=cumulative frequency of Median/Quartile/Deciles/Percentile class, f1=frequency of mode class, f0 and f2 are the

preceding and succeeding frequency of mode class

Important note:

• In measures of central Tendency the unit of measurement is same as whatever the unit of given

dataset.

• Q2=Median=D5=P50

Measures Definition Ungrouped data Grouped Data

Mean:

Average of the given data,

Most Stable measure of central

tendency

A.M. =𝑆𝑢𝑚 𝑜𝑓 𝑂𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛

𝑁𝑜. 𝑜𝑓 𝑂𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛

A.M. = (𝑥1+ 𝑥2+⋯𝑥𝑛)

𝑛

1.Ordinary Method ��=∑𝑓𝑖𝑥𝑖

∑𝑓𝑖,

2. Change of Origin Method

��=A + ∑𝑓𝑖𝑂𝑖

∑𝑓𝑖, oi = Xi – A

3. Change of Scale Method

��= ∑𝑓𝑖𝑆𝑖

∑𝑓𝑖∗ ℎ, si =

𝑥𝑖

ℎ

4. Change of Origin and Scale

Method

��=A + ∑𝑓𝑖𝑑𝑖

∑𝑓𝑖∗ ℎ, di =

𝑥𝑖−𝐴

ℎ

Median:

Divides the whole data set into

2 equal parts, also used for

Qualitative data

Arrange the observation in

ascending order, If the number

of observations is

(1) odd : ((𝑛+1)

2) 𝑡ℎ 𝑡𝑒𝑟𝑚

(2) Even:

𝑛

2+ (

𝑛

2 +1)

2th term

is median

Md = L+

𝑁

2− 𝐹

𝑓 * h

Mode:

Most frequently occurred

observation, ill-defined

Prepare the frequency table

and find the mode

Mo = L+ (𝑓1−𝑓0)

(2𝑓1− 𝑓0− 𝑓2)∗ h

Geometric

Mean:

Nth root of the product of

observation, Gives more

weightage to small items

GM=Antilog(1

𝑛 (∑ 𝑙𝑜𝑔𝑥𝑖)) GM=Antilog(

1

∑𝑓𝑖 (∑𝑓𝑖𝑙𝑜𝑔𝑥𝑖))

Harmonic

Mean:

Arithmetic mean of the

reciprocal of the given values,

Gives more weightage to small

items

HM= 𝑛

∑1

𝑥𝑖

HM= ∑𝑓𝑖

∑𝑓𝑖𝑥𝑖

Quartiles , Deciles and Percentiles

Quartiles 3 in number and divide the

series into 4 equal parts Qi=

𝑖(𝑛+1)

4 𝑡ℎ 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛

where i=1,2,3 Qi = L +

𝑖𝑁

4 − 𝐹

𝑓 * h

Deciles 9 in number and divide the

series into 10 equal parts Di=

𝑖(𝑛+1)


where i=1,2,3,..,9 Di = L +

𝑖𝑁

10 − 𝐹

𝑓 * h

Percentiles 99 in number and divide the

series into 100 equal parts Pi=

𝑖(𝑛+1)


where i=1,2,3,..,99 Pi = L +

𝑖𝑁

100 − 𝐹

𝑓 * h

5

Objective: Computation of Measures of Central Tendency by all methods for Ungrouped data.

Data: Suppose the data are 10, 7, 11, 9, 9, 10, 7, 9, 12

Solution:

(1) Arithmetic mean= 10+7+11+9+9+10+7+9+12

10 =

84

9 =9.33

(2) Median: Arrange the observation in ascending order

7, 7, 9, 9, 9, 10, 10, 11, 12 here the number of observations is odd then (9+1

2) 𝑡ℎ 𝑡𝑒𝑟𝑚 𝑖𝑠 𝑚𝑒𝑑𝑖𝑎𝑛

The 5th term is 9. So the median is 9.

(3) Mode: Prepare the frequency table

The maximum frequency is 3. So the

Mode value is 9.

(4) Geometric Mean=(10, 7, 11, 9, 9, 10,7,9,12)1/9

Log10GM= 1

9 (log1010+ log107+ log1011+ log109+ log109+ log1010+ log107+ log109+ log1012)

= 1

9 (1.00+0.85+1.04+0.95+0.95+1.00+0.85+0.95+1.08) =

8.67

9 = 0.963

GM=Antilog(0.963) = 9.19

(5) Harmonic Mean= 9

(1

10+

1

7+

1

11+

1

9+

1

9+

1

10+

1

7+

1

9+

1

12) =

9

0.99 = 9.06

(6) Calculation of 3rd Quartile, 5th Decile and 60th percentile (first arrange the data in ascending order)

(a) Quartile Qi=𝑖(𝑛+1)

4 𝑡ℎ 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛 means Q3=

3(9+1)

4 𝑡ℎ =

30

4𝑡ℎ 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛 =7.5th observation

So Q3 = 7th observation + 0.5 * (8th observation – 7th observation) = 10+0.5*(11-10)= 10.5

So 10.5 is the 3rd Quartile value.

(b) Decile Di=𝑖(𝑛+1)

10 𝑡ℎ 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛 means D5=

5(9+1)

10 𝑡ℎ = 5𝑡ℎ 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛 = 9

So 9 is the 5th Decile value.

(c) Percentile Pi=𝑖(𝑛+1)

100 𝑡ℎ 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛 means P60=

60(9+1)

100 𝑡ℎ = 6 𝑡ℎ 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛 =10

So 10 is the 60th Percentile.

observation frequency

7 2

9 3

10 2

11 1

12 1

6

Objective: Computation of Measures of Central Tendency by all methods for Grouped data.

Kinds of data: The following data relate to the percentage of marks obtained by 556 students in a certain

examination.

Class

intervals

(Marks)

1-10 11-20 21-30 31-40 41-50 51-60 61-70 71-80 81-90 91-100

No.of

Students 2 20 50 90 107 115 91 53 20 8

Solution:

Class

intervals

(%Marks)

No. of

Students

Frequency

(Fi)

Cumu-

lative

frequency

Ordinary

method

Change of origin

method (A=55)

Change of scale

method (h=10)

Change of origin

and scale method

𝑿𝒊 𝑭𝒊𝑿𝒊 Oi = Xi –

A

𝑭𝒊𝑶𝒊 Si = 𝒙𝒊

𝒉 𝑭𝒊𝑺𝒊 di =

𝒙𝒊−𝑨

𝒉

𝑭𝒊𝒅𝒊

0-10 2 2 5 10 -50 -100 0.5 1.0 -5 -10

10-20 20 22 15 300 -40 -800 1.5 30.0 -4 -80

20-30 50 72 25 1250 -30 -1500 2.5 125.0 -3 -150

30-40 90 162 35 3150 -20 -1800 3.5 315.0 -2 -180

40-50 107(f0) 269 45 4815 -10 -1070 4.5 481.5 -1 -107

50-60 115 (f1) 384 55 6325 0 0 5.5 632.5 0 0

60-70 91(f2) 475 65 5915 10 910 6.5 591.5 1 91

70-80 53 528 75 3975 20 1060 7.5 397.5 2 106

80-90 20 548 85 1700 30 600 8.5 170.0 3 60

90-100 8 556 95 760 40 320 9.5 76.0 4 32

Total 556 28200 -50 -2380 50 2820 -5 -238

Arithmetic Mean

(a) Ordinary method = 28200

556 = 50.72 %

(b) Change of origin method = 55+(−2380)

556 = 55-4.28= 50.72 %

(c) Change of Scale method =2820

556 *10 = 50.72 %

(d) Change of origin and scale method =55+(−238

556) * 10 = 55-4.28=50.72 %

Median

First we will find the median class= ∑𝒇𝒊

𝟐 =

𝟓𝟓𝟔

𝟐 = 278

278 comes under 384 in cumulative frequency class so 50-60 is median class.

So, Md = 50+ (

556

2−269

115) * 10 = 50+

90

115 = 50.78 %

Mode

Highest frequency is 115 so mode class is 50-60.

Mo = 50+ (115−107)

(2∗115−107−91)∗ 10 = 50+

80

32 = 50+2.5 =52.5 %

7

Geometric Mean: GM=Antilog(1

∑𝑓𝑖 (∑𝑓𝑖𝑙𝑜𝑔𝑥𝑖))

GM = Antilog(929.58

556) = Antilog(1.67) =46.98 %

Harmonic Mean: HM= ∑𝑓𝑖

∑𝑓𝑖𝑥𝑖

HM= 556

13.20 = 42.12 %

Quartiles, Deciles and Percentiles

Third quartile, sixth decile and 20th percentile.

First we will determine the Quartile class = 3∗556

4 = 417.

417 come in 60-70 cumulative frequency class. So the

Third Quartile

Q3 = 60 +(3∗556

4−384)

91 * 10 = 60+

330

91 = 60+3.63 =63.63 %

Sixth Decile: Here decile class = 6∗556

10 = 333.6 so the decile class is 50-60.

D6 = 50 +(6∗556

10−269)

115 * 10 = 50+

646

115 = 50+5.62 =55.62 %

20th percentile : Here Percentile class = 20∗556

100 = 111.2 so the percentile class is 30-40.

P20 = 30 +(20∗556

100−72)

90 * 10 = 30+

39.2

90∗ 10 = 30+4.35 =34.35 %

Objective: Computation of algebraic sum of the deviations of a set of values from their arithmetic mean

Kinds of data: The following data relate to the frequency distribution of the number of workers according

to their wages in a certain factory..

Wages(in

Rs) Below10 below20 below30 below40 Below50 below60 Below70 Below80

No.of

workers 15 35 60 84 96 127 198 250

Solution: First, we will compute the arithmetic mean by changing the data in the following form using the

rule of cumulative frequency of less than type. Then the data become:

Wages (in

Rs) 0-10 10-20 20-30 30-40 40-50 50-60 60-70 70-80

No.of

workers 15 20 25 24 82 31 71 52

Then the A.M.= ∑fixi/N=15750/320= Rs.50.40.

Then the algebraic sum of the deviations of a set of values from their arithmetic mean is equal to ∑fi(xi-��)

which is equal to approximately zero.

xi logxi fi filogxi fi/xi

5 0.70 2 1.40 0.40

15 1.18 20 23.52 1.33

25 1.40 50 69.90 2.00

35 1.54 90 138.97 2.57

45 1.65 107 176.89 2.38

55 1.74 115 200.14 2.09

65 1.81 91 164.98 1.40

75 1.88 53 99.38 0.71

85 1.93 20 38.59 0.24

95 1.98 8 15.82 0.08

Total 556 929.58 13.20

8

Objective: Computation of pooled mean of the given data.

Kinds of data: the average of 5 numbers (first series) is 40 and the average of another 4 numbers (second

series) is 50.

Solution: we know that pooled mean formula = 𝑛1𝑋1 +𝑛2𝑋2

𝑛1+𝑛2 =

5∗40+4∗50

5+4 =

400

9 = 44.44

Objective: Computation of average speed using harmonic mean.

Kinds of data: A train covered the first 5 kms of its journey at a speed of 30 km/h and next 15 km at a speed

of 45 km/h. the average speed of the train was 40 km/h

Solution: Here since different distance is covered by different speed, we calculate here weighted average

speed where in harmonic mean formula weights were used instead of frequency

X (km/h) 30 45

w 5 15

Now the formula becomes average speed= ∑𝑤𝑖

∑𝑤

𝑥𝑖

= 20

(5

30+

15

45) =

20

0.5 = 40 km/h

Kinds of data: A man goes from place A to place B at a speed of 10 km/h and comes back from B to A at a

speed of 15 km if the distance travelled is x.

Solution: We know that Total speed =𝑻𝒐𝒕𝒂𝒍 𝒅𝒊𝒔𝒕𝒂𝒏𝒄𝒆

𝑻𝒐𝒕𝒂𝒍 𝒕𝒊𝒎𝒆 𝒕𝒂𝒌𝒆𝒏 =

𝒙+𝒙𝒙

𝟏𝟎+

𝒙

𝟏𝟓

=𝟑𝟎𝟎𝒙

𝟐𝟓𝒙 = 12 km/h , which is the harmonic mean

of 10 and 15.

Objective: Relationship between AM, GM and HM.

Kinds of data: if for value of X, Arithmetic mean = 25, Harmonic mean=9 then geometric mean=

Solution:We know that A.M. X H.M.=(𝐺.𝑀. )2 then by putting values we get

25* 9 = (𝐺.𝑀. )2, hence G.M. =√225 = 15

Here also we can see that AM ≥ 𝐺𝑀 ≥ 𝐻𝑀

Effect of change of origin and scale on Arithmetic Mean:

Change of Origin: If the origin is shifted to another value A, then the new measurement will be X-A and the

new mean will be �� – A.

Change of Scale: if each observation X is divided by a value h then the new measurement will be 𝑋

ℎ and the

new mean will be ��

ℎ.

Change of Origin and scale both: if both are changed the new observation will be 𝑋−𝐴

ℎ and the new mean

will be ��−𝐴

ℎ .

Kinds of data: The mean of 100 observations is 50. What will be the new mean if

(i) 6 is added to each observation (ii) each observation is multiplied by 3.

(iii)If 5 is subtracted from each observation and then it is divided by 4.

Solution: (i) Here since 6 is added to each observation then the new mean will be = 50+6 = 56

(ii) if each observation is multiplied by 3 the new mean will be =3 *50 =150

(iii)the new variable will be U= 𝑋−5

4 , hence the mean �� =

��−5

4 =

50−5

4 = 11.25

9

5. Measures of Dispersion

Measures of Dispersion: gives us an idea about the Scatteredness of the data.

Measures Definition Ungrouped data Grouped Data

Range It is defined as the

difference between the

maximum and minimum

value of any dataset

Range=Maximum value –

Minimum value

Range=upper value of last

class interval – lowest value

of first class interval

Quartile

Deviation

It is the difference

between first and third

quartile divided by 2

QD= 𝑄3− 𝑄1

2 QD=

𝑄3− 𝑄1

2

Mean

Deviation

It is defined as the

average of the sum of

absolute deviation of all

the observation from

their mean

MD = ∑|𝑋𝑖−𝑋) |

𝑛

MD = ∑𝑓𝑖|𝑋𝑖−𝑋) |

∑𝑓𝑖

Standard

Deviation

(Best

Measure)

It is defined as the square

root of the average of the

sum of squares of

deviation of all the

observation from their

mean

SD = √∑(𝑋𝑖−𝑋) 2

𝑛 SD =√

∑𝑓𝑖(𝑋𝑖−𝑋) 2

∑𝑓𝑖

Measures for comparison of two series

Coefficient

of

Dispersion

(CD)

To compare the

variability of two series

based upon

(1) Range CD = Max −min

Max+min

(2) Quartile Deviation CD = 𝑄3− 𝑄1

𝑄3+ 𝑄1

(3) Mean Deviation CD= 𝑀𝐷

𝐴𝑣𝑒𝑟𝑎𝑔𝑒 𝑓𝑟𝑜𝑚 𝑤ℎ𝑖𝑐ℎ 𝑖𝑡 𝑖𝑠 𝑐𝑎𝑙𝑐𝑢𝑙𝑎𝑡𝑒𝑑

(4) Standard Deviation CD = 𝑆𝐷

𝑀𝑒𝑎𝑛

Coefficient

of

Variation

(CV)

100 times the coefficient

of dispersion based upon

standard deviation

(Unitless Measure)

CV = 𝑆𝐷

𝑀𝑒𝑎𝑛 *100

10

Objective: Computation of Measures of Dispersion by all methods for Ungrouped data.

Kinds of Data: Suppose the data are 10, 7, 5, 9, 9, 10, 7, 3, 12

Solution:

(1) Range= 12-3=9

(2) Quartile Deviation: Arrange the observation in ascending order

3, 5, 7, 7, 9, 9, 10, 10, 12,

Q1=1∗(9+1)

4 𝑡ℎ =

10


So Q1= 5+0.5*(7-5) =6

Similarly Q3 =3∗(9+1)

4 𝑡ℎ =

30


So Q3 = 10+ 0.5*(10-10)=10

Now QD = (10−6)

2 =2

(3) Mean Deviation:

Mean=(10+7+5+9+9+10+7+3+12)

9 =8

MD=1

9(|10 − 8| + |7 − 8| + |5 − 8| + |9 − 8| + |9 − 8| + |10 − 8| + |7 − 8| + |3 − 8| + |12 − 8|)

=1

9 (2+1+3+1+1+2+1+5+4) =

20

9 = 2.22

(4) Standard Deviation:

Mean=(10+7+5+9+9+10+7+3+12)

9 =8

SD=√(10−8)2+(7−8)2+(5−8)2+(9−8)2+(9−8)2+(10−8)2+(7−8)2+(3−8)2+(12−8)2

9

=√(4+1+9+1+1+4+1+25+16)

9 =√

62

9 = 2.62

Objective: Computation of Measures of Dispersion by all methods for Grouped data.

Kinds of data: The age distribution of 542 members are given below

Age(in years) 20-30 30-40 40-50 50-60 60-70 70-80 80-90 Total

No. of

members 3 61 132 153 140 51 2 542

Solution:

(1)Range = 90-20=70

(2) Quartile Deviation : first we will find the first and third quartile

Age(in

years)

No. of

members

Cumulative

Frequency Xi FiXi (Xi -��) Fi|𝐗𝐢 − ��)| (Xi -��)2 Fi(Xi -��)2

20-30 3 3 25 75 -29.7 89.2 883.3 2649.8

30-40 61 64 35 2135 -19.7 1202.9 388.9 23721.6

40-50 132 196 45 5940 -9.7 1283.0 94.5 12471.1

50-60 153 349 55 8415 0.3 42.8 0.1 12.0

60-70 140 489 65 9100 10.3 1439.2 105.7 14795.0

70-80 51 540 75 3825 20.3 1034.3 411.3 20975.2

80-90 2 542 85 170 30.3 60.6 916.9 1833.8

Total 542 2183 385 29660 1.96 5152 2800.54 76458.49

11

First we will determine the first Quartile class = 1∗542

4 = 135.5

135.5 come in 40-50 cumulative frequency class. So the first Quartile

Q1 = 40 +(1∗542

4 −64)

132 * 10 = 40+

715

132 = 40+5.42=45.42 years

Similarly for Q3 = 3∗542

4 = 406.5,

406.5 come in 60-70 cumulative frequency class. So the third Quartile is

Q3 = 60 +(3∗542

4 −349)

140 * 10 = 60+

575

140 = 60+4.11=64.11 years

So the quartile deviation is =(64.11−45.42)

2 =

18.69

2 =9.345

(3) Mean Deviation : first calculate the mean

Mean= 29660

542 = 54.72

From the above table Mean Deviation = 5152

542 =9.51

(4) Standard Deviation=√76548.9

542 =√141.07 = 11.88

Objective: Computation of variability of two series by coefficient of variation.

Kinds of data : Goals scored by two teams A and B in a football season were as follows

No. of goals scored in a

match 0 1 2 3 4

No. of

matches

A 27 9 8 5 4

B 17 9 6 5 3

Solution: Here we have to calculate the CV of both the team separately

No. of

goals

(𝒙𝒊)

A

(𝒇𝑨) 𝒇𝑨𝒙𝒊 (𝒙𝒊 − 𝒙

(𝒙𝒊 − 𝒙) 2 𝒇𝒊(𝒙𝒊 − 𝒙) 𝟐

B

(𝒇𝑩) 𝒇𝑩𝒚𝒊 (𝒚𝒊 − ��)

(𝒚𝒊 − 𝒚) 2 𝒇𝒊(𝒚𝒊 − 𝒚) 𝟐

0 27 0 -1.05 1.10 29.77 17 0 -1.2 1.44 24.48

1 9 9 -0.05 0.00 0.02 9 9 -0.2 0.04 0.36

2 8 16 0.95 0.90 7.22 6 12 0.8 0.64 3.84

3 5 15 1.95 3.80 19.01 5 15 1.8 3.24 16.2

4 4 16 2.95 8.70 34.81 3 12 2.8 7.84 23.52

Total 53 56 4.75 14.51 90.83 40 48 4 13.2 68.4

First we will calculate the mean and standard deviation of first (A) series

𝑋𝐴 =

56

53 = 1.05, 𝜎𝐴 = √

90.83

53 = √1.714 = 1.31 then CV=

𝜎𝐴

𝑋𝐴 *100 =

1.31

1.05 *100 = 124.76

Now we calculate the mean and standard deviation of Second (B) series

𝑋𝐵 =

48

40 = 1.2, 𝜎𝐵 = √

68.4

40 = √1.71 = 1.30 then CV=

𝜎𝐵

𝑋𝐵 *100 =

1.30

1.2 *100 = 108.33

After comparing the coefficient of variation of series A and B it was found that the series B because of lower

CV value is more consistent.

12

6. Skewness and Kurtosis

Skewness and Kurtosis: Skewness gives us an idea about the symmetry of the curve whereas the Kurtosis

gives us an idea about the shape of the curve.

Measures of Shape of

a curve Definition Types Coefficients

Skewness Means Lack of

Symmetry

(𝑀𝑒𝑎𝑛 ≠

𝑀𝑒𝑑𝑖𝑎𝑛 ≠ 𝑀𝑜𝑑𝑒

(i)Zero Skewed or

normal curve

(ii)Positively skewed

(Mean>Median or

Mean >Mode)

(iii)Negatively skewed

Mean<Median

or Mean <Mode

(1)Absolute Measures

𝑆𝑘 = Mean-Median

𝑆𝑘 = Mean-Mode

𝑆𝑘= 𝑄3 + 𝑄1 − 2𝑀𝑒𝑑𝑖𝑎𝑛

(2) Relative Measures

(i) Karl Pearson Coefficient of

Sk = 𝑀𝑒𝑎𝑛−𝑀𝑜𝑑𝑒

𝑆𝐷

= 3 (𝑀𝑒𝑎𝑛−𝑀𝑒𝑑𝑖𝑎𝑛)

𝑆𝐷

Range of coefficient is -3 to +3

(ii) Prof. Bowleys Coefficient

of Skewness

Sk =𝑄3+ 𝑄1−2 𝑚𝑒𝑑𝑖𝑎𝑛

𝑄3−𝑄1

Range of coefficient is -1 to +1

(3) Based on Moments

Sk =√𝜷𝟏(𝜷𝟐+𝟑)

𝟐(𝟓𝜷𝟐− 𝟔𝜷𝟏−𝟗) where

𝛽1 = 𝜇3

2

𝜇23 , 𝛽2 =

𝜇4

𝜇22

Kurtosis Flatness or

peakedness of the

curve

Leptokurtic

(highly peaked)

Mesokurtic

(normal curve)

Platykurtic

(flatter than normal)

𝛽2 = 𝜇4

𝜇22 or Ύ2 = 𝛽2 − 3

Leptokurtic If 𝛽2 >3 or Ύ2>0 ,

Mesokurtic If 𝛽2 =3 or Ύ2=0,

Platykurtic If 𝛽2 <3 or Ύ2<0

Note : Here 𝜇2, 𝜇3 and 𝜇4 are central moments or the moments about the mean and can be defined as 𝜇𝑟= ∑𝑓𝑖(𝑋𝑖−𝑋) 𝑟

∑𝑓𝑖

13

Moments: refers to the average of the deviations from mean or some other value raised to a certain power.

Moments about any arbitrary value A : 𝜇𝑟′= ∑𝑓𝑖(𝑋𝑖−𝐴)𝑟

∑𝑓𝑖 where A is any arbitrary value.

Moments about the origin: 𝑚𝑟= ∑𝑓𝑖(𝑋𝑖)

𝑟

∑𝑓𝑖

Moments about the arithmetic mean: 𝜇𝑟= ∑𝑓𝑖(𝑋𝑖−𝑋) 𝑟

∑𝑓𝑖 , where ��is the Arithmetic mean.

Relationship between 𝝁𝒓 and 𝝁𝒓′: 𝜇𝑟= 𝜇𝑟′ - 𝑟𝑐1

(𝜇1′)( 𝜇𝑟−1′)+ 𝑟𝑐2 (𝜇𝑟′)

2( 𝜇𝑟−2′)+……+(−1)𝑟 (𝜇1′)𝑟

In particular, 𝜇2= 𝜇2′ - (𝜇1′)2

𝜇3= 𝜇3′- 3 𝜇1′𝜇2′ + 2(𝜇1′)3

𝜇4= 𝜇4′ - 4 𝜇1′𝜇3′+6 (𝜇1′)2( 𝜇2

′ )- 3(𝜇1′)4

Important: (i) 𝜇0 = 𝜇0′=1 (ii) First central moment is always zero. (iii) 𝜇2 = 𝑆𝐷2 =variance

Objective: Computation of Mean and variance when moments about arbitrary value is given .

Kinds of data: The first three moments of a distribution about the value 2 of a variable are 1, 16 and -40.

Solution: Here arbitrary value A=2 and the moments are 𝜇1′ =1, 𝜇2′=16 and 𝜇3′= -40

We know that 𝜇1′= ∑𝑓𝑖(𝑋𝑖−2)1

∑𝑓𝑖 =1 , hence

∑𝒇𝒊𝒙𝒊

∑𝒇𝒊 - 2

∑𝒇𝒊

∑𝒇𝒊 =1 which gives �� =

∑𝒇𝒊𝒙𝒊

∑𝒇𝒊 = 1+2 = 3

Hence the mean is 3.

We know that 𝜇2= 𝜇2′ - (𝜇1′)2, by putting the values we get

𝜇2= 𝜇2′ - (𝜇1′)2 = 16- 1*1 = 15

Hence the variance is 15.

Objective: Computation of first four central moments, 𝛽1 and 𝛽2 .

Kinds of data: The distribution of data are given below

x 0 1 2 3 4 5 6 7 8

f 1 8 28 56 70 56 28 8 1

Solution :

𝒙𝒊 𝒇𝒊 𝒇𝒊𝒙𝒊 (𝒙𝒊- ��) 𝒇𝒊(𝒙𝒊- ��) 𝒇𝒊(𝒙𝒊 − ��)𝟐 𝒇𝒊(𝒙𝒊 − ��)𝟑 𝒇𝒊(𝒙𝒊 − ��)𝟒

0 1 0 -4 -4 16 -64 256

1 8 8 -3 -24 72 -216 648

2 28 56 -2 -56 112 -224 448

3 56 168 -1 -56 56 -56 56

4 70 280 0 0 0 0 0

5 56 280 1 56 56 56 56

6 28 168 2 56 112 224 448

7 8 56 3 24 72 216 648

8 1 8 4 4 16 64 256

Total 256 1024 0 0 512 0 2816

First we will calculate the mean of the series

�� = 1024

256 = 4

Then 𝜇1= ∑𝑓𝑖(𝑋𝑖−𝑋) 1

∑𝑓𝑖 = 0 , 𝜇2=


∑𝑓𝑖 =

512

256 = 2

𝜇3= ∑𝑓𝑖(𝑋𝑖−𝑋) 3

∑𝑓𝑖 =

0

256 =0 ‘ 𝜇4=


∑𝑓𝑖 =

2816

256 =11

Now we calculate 𝛽1 and 𝛽2

𝜷𝟏 = 𝜇3

2

𝜇23 =

02

23 = 0 , 𝜷𝟐 = 𝜇4

𝜇22 =

11

22 = 2.75, Hence curve is platykurtic.

14

Objective: Calculate the appropriate measure of skewness from the following cumulative frequency distribution.

Kinds of data: The distribution of data are given below

Age(under years) 20 30 40 50 60 70

Number of persons 12 29 48 75 94 106

Solution: Here, upper limit along with cumulative frequencies are given in the data.

Now we find the lower limit and frequency of the given dataset.

Age (years) Cummulative frequency Number of persons

(Frequency)

Below 20 12 12

20-30 29 = 29-12 = 17

30-40 48 = 48-29 = 19

40-50 75 =75-48 = 27

50-60 94 = 94-75 =19

60-70 106 =106-94=12

Total N=106

Here since the distribution is open ended mean cannot be calculated, so all the methods in which mean is

required cannot be used. So here bowleys method which is based on quartiles can be used.

First we will determine the first Quartile class = 1∗106

4 = 26.5

26.5 come in 20-30 cumulative frequency class. So the first Quartile

Q1 = 20 +(1∗106

4 −12)

17 * 10 = 20+

14.5

17 = 20+8.53=28.53 years

Similarly for Median = Q2 = 2∗106

4 = 53,

53 come in 40-50 cumulative frequency class. So the second Quartile is

Q2 = 40 +(2∗106

4 −48)

27 * 10 = 40+

5

27 = 40+1.85=41.85 years

Similarly for Q3 = 3∗106

4 = 79.5,

79.5 come in 50-60 cumulative frequency class. So the third Quartile is

Q3 = 50 +(3∗106

4 −75)

19 * 10 = 50+

4.5

19 = 50+2.37=52.37 years

Prof. Bowleys Coefficient of Skewness Sk =𝑄3+ 𝑄1−2 𝑚𝑒𝑑𝑖𝑎𝑛

𝑄3−𝑄1

By putting the values in the formula we get

Sk =52.37+ 28.53−2∗ 41.85

52.37−28.53 =

−2.8

23.84 = -0.117

Hence the coefficient of skewness is -0.117.

15

7. Probability

Probability : is defined as the ratio of favorable number of cases of any event to the total number of all

possible outcomes.

Important terms and Laws of Probability

Terms Definition Example

Trial Experiment is called as Trial Experiment :Tossing of a single

coin

Event: Head and Tail

Exhaustive events={H,T}

Mutually exclusive: in single

throw either H or T will come.

Equally likely: The chances of

occurrence of H or T are ½.

Independent event: in Tossing of

two coins the occurrence of H and

T in each one is independent.

Event Outcomes are known as event

Exhaustive Events The total number of all possible

outcomes of any experiment

Mutually Exclusive events If in two or more events only

one can happen

Equally likely If there is no reason to prefer

one in preference to the other

Independent event If happening of an event is not

affected by the happening of

the other

Dependent event If happening of an event is

affected by the happening of

the other

Laws of Probability

Law of Addition If E is an event which includes the happening of anyone of the n

mutually exclusive events E1, E2, …, En then P(E)= P(E1)+ P(E2)

+….., +P(En)

Law of Multiplication If E is an event which includes the happening of anyone of the n

independent events E1, E2, …, En then P(E)= P(E1) *P(E2)*…..,

*P(En)

Law of Total Probability For any two events A and B the probability of happening of either

A or B is given by

P(AUB)=P(A)+P(B)-P(A∩B)

Particular case

Probability of at least one

event happens

= 1- P(All the events fail to happen)

Note:

• The sum of total probability is always equal to 1, if P is the probability of success of any event then

q=1-p is the probability of failure of any event.

• Sometimes the probability is based on combination (selection). In combination the chances of

selecting r things out of n things is given by (𝑛𝑟) =

𝑛!

(𝑛−𝑟)!∗𝑟!, where n!=n(n-1)(n-2),…3.2.1.

eg. 5!= 5*(5-1)*(5-2)*(5-3)*(5-4)=5*4*3*2*1 =120

16

1. Find the chance of throwing atleast one ace in a single throw with two dice.

Solution: The probability of getting one ace in a single throw of dice p =𝑛𝑜.𝑜𝑓 𝑓𝑎𝑣𝑜𝑟𝑎𝑏𝑙𝑒 𝑐𝑎𝑠𝑒𝑠

𝑎𝑙𝑙 𝑝𝑜𝑠𝑠𝑖𝑏𝑙𝑒 𝑜𝑢𝑡𝑐𝑜𝑚𝑒𝑠 =

1

6

So the probability of failure of getting one ace in a single throw q = 1 −1

6 =

5

6

We know that the Probability of at least one event happens = 1- P(All the events fail to happen)

=1- 5

6∗

5

6 = 1 -

25

36 =

11

36

2. From a bag containing 4 white and 5 black balls 3are drawn at random. What are the odds against these being

all black.

Solution : The probability of selecting black ball =𝑛𝑜.𝑜𝑓 𝑓𝑎𝑣𝑜𝑟𝑎𝑏𝑙𝑒 𝑐𝑎𝑠𝑒𝑠 𝑓𝑜𝑟 𝑠𝑒𝑙𝑒𝑐𝑡𝑖𝑜𝑛 𝑜𝑓 3 𝑏𝑙𝑎𝑐𝑘 𝑏𝑎𝑙𝑙𝑠

𝑡𝑜𝑡𝑎𝑙 𝑛𝑜.𝑜𝑓 𝑎𝑙𝑙 𝑝𝑜𝑠𝑠𝑖𝑏𝑙𝑒 𝑐𝑎𝑠𝑒𝑠

=(53)

(93) =

5!

3!2!9!

3!6!

=5

42

And the probability of not selecting the black ball =1 - 5

42 =

37

42

So the odds against these being all black are =37

5

3. What is the chance of drawing a pie from a purse, one compartment of which contains 3 paises and 2 pies and

the other, 2 paise and 1 pie.

Solution: Here we know that the probability of selecting each compartment is 1

2

So the probability of drawing a pie from a purse = 1

2 *

2

5 +

1

2 *

1

3 =

1

5 +

1

6 =

11

30

4. A bag contains 4 red balls and 3 blue balls. Two drawings of 2 balls are made. Find the chance that the first

drawing gives 2 red balls and the second drawing, 2 blue balls.

If the balls are returned to the bag after the first draw.

If the balls are not returned.

Solution: (a) In the first case if the balls are returned to the bag after the first draw

the probability =𝑛𝑜.𝑜𝑓 𝑓𝑎𝑣𝑜𝑟𝑎𝑏𝑙𝑒 𝑐𝑎𝑠𝑒𝑠 𝑓𝑜𝑟 𝑠𝑒𝑙𝑒𝑐𝑡𝑖𝑜𝑛 𝑜𝑓 2 𝑟𝑒𝑑 𝑏𝑎𝑙𝑙𝑠

𝑡𝑜𝑡𝑎𝑙 𝑛𝑜.𝑜𝑓 𝑎𝑙𝑙 𝑝𝑜𝑠𝑠𝑖𝑏𝑙𝑒 𝑐𝑎𝑠𝑒𝑠 𝑜𝑓 𝑠𝑒𝑙𝑒𝑐𝑡𝑖𝑜𝑛 𝑜𝑓 2 𝑟𝑒𝑑 𝑏𝑎𝑙𝑙𝑠

*𝑛𝑜.𝑜𝑓 𝑓𝑎𝑣𝑜𝑟𝑎𝑏𝑙𝑒 𝑐𝑎𝑠𝑒𝑠 𝑓𝑜𝑟 𝑠𝑒𝑙𝑒𝑐𝑡𝑖𝑜𝑛 𝑜𝑓 2 𝑟𝑒𝑑 𝑏𝑎𝑙𝑙𝑠

𝑡𝑜𝑡𝑎𝑙 𝑛𝑜.𝑜𝑓 𝑎𝑙𝑙 𝑝𝑜𝑠𝑠𝑖𝑏𝑙𝑒 𝑐𝑎𝑠𝑒𝑠 𝑜𝑓 𝑠𝑒𝑙𝑒𝑐𝑡𝑖𝑜𝑛 𝑜𝑓 2 𝑏𝑙𝑢𝑒 𝑏𝑎𝑙𝑙𝑠

= (42)

(72) *

(32)

(72) =

4!

2!2!7!

5!2!

*

3!

2!1!7!

5!2!

=6

21 *

3

21 =

2

49

(b) In the second case if the balls are not returned to the bag after the first draw

the probability =𝑛𝑜.𝑜𝑓 𝑓𝑎𝑣𝑜𝑟𝑎𝑏𝑙𝑒 𝑐𝑎𝑠𝑒𝑠 𝑓𝑜𝑟 𝑠𝑒𝑙𝑒𝑐𝑡𝑖𝑜𝑛 𝑜𝑓 2 𝑟𝑒𝑑 𝑏𝑎𝑙𝑙𝑠

𝑡𝑜𝑡𝑎𝑙 𝑛𝑜.𝑜𝑓 𝑎𝑙𝑙 𝑝𝑜𝑠𝑠𝑖𝑏𝑙𝑒 𝑐𝑎𝑠𝑒𝑠 𝑜𝑓 𝑠𝑒𝑙𝑒𝑐𝑡𝑖𝑜𝑛 𝑜𝑓 2 𝑟𝑒𝑑 𝑏𝑎𝑙𝑙𝑠

*𝑛𝑜.𝑜𝑓 𝑓𝑎𝑣𝑜𝑟𝑎𝑏𝑙𝑒 𝑐𝑎𝑠𝑒𝑠 𝑓𝑜𝑟 𝑠𝑒𝑙𝑒𝑐𝑡𝑖𝑜𝑛 𝑜𝑓 2 𝑟𝑒𝑑 𝑏𝑎𝑙𝑙𝑠

𝑡𝑜𝑡𝑎𝑙 𝑛𝑜.𝑜𝑓 𝑎𝑙𝑙 𝑝𝑜𝑠𝑠𝑖𝑏𝑙𝑒 𝑐𝑎𝑠𝑒𝑠 𝑜𝑓 𝑠𝑒𝑙𝑒𝑐𝑡𝑖𝑜𝑛 𝑜𝑓 2 𝑏𝑙𝑢𝑒 𝑏𝑎𝑙𝑙𝑠𝑓𝑟𝑜𝑚 𝑡ℎ𝑒 𝑟𝑒𝑚𝑎𝑖𝑛𝑖𝑛𝑔 5 𝑏𝑎𝑙𝑙𝑠

= (42)

(72) *

(32)

(52) =

4!

2!2!7!

5!2!

*

3!

2!1!5!

3!2!

=6

21 *

3

10 =

3

35

17

5. If three coins are tested what is the chance of getting (a) two heads exactly (b) atleast two heads (c) atmost

two heads. 3/8, ½, 7/8

Solution :

(a) p= p{H,H,T} + p{H,T,H} + p{T,H,H}

So, P = 1

2∗

1

2∗

1

2 +

1

2∗

1

2∗

1

2+

1

2∗

1

2∗

1

2 =

1

8+

1

8+

1

8 =

3

8

(b) p= p{H,H,T} + p{H,T,H} + p{T,H,H} + pp{H,H,H}

So, P = 1

2∗

1

2∗

1

2 +

1

2∗

1

2∗

1

2+

1

2∗

1

2∗

1

2 +

1

2∗

1

2∗

1

2 =

1

8+

1

8+

1

8+

1

8 =

4

8 =

1

2

© p= p{H,H,T} + p{H,T,H} + p{T,H,H} + pp{T,T,H}+ p{T,H,T} + p{H,T,T} + p{T,T,T}

P= 1

8+

1

8+

1

8+

1

8+

1

8+

1

8+

1

8=

7

8

6. Four persons are chosen at random from a group consisting of 3 men, 2 women and 4 children. Show that

the chance that exactly 2 of them will be children is 10/21.

Solution: Here p= 𝑛𝑜.𝑜𝑓 𝑐𝑎𝑠𝑒𝑠 𝑓𝑎𝑣𝑜𝑟𝑎𝑏𝑙𝑒 𝑓𝑜𝑟 𝑠𝑒𝑙𝑒𝑐𝑡𝑖𝑛𝑔 2 𝑐ℎ𝑖𝑙𝑑𝑟𝑒𝑛 𝑜𝑢𝑡 𝑜𝑓 4 𝑎𝑛𝑑 𝑟𝑒𝑚𝑎𝑖𝑛𝑖𝑛𝑔 2 𝑓𝑟𝑜𝑚 5 𝑚𝑒𝑛 𝑎𝑛𝑑 𝑤𝑜𝑚𝑒𝑛

𝑡𝑜𝑡𝑎𝑙 𝑛𝑜.𝑜𝑓 𝑎𝑙𝑙 𝑝𝑜𝑠𝑠𝑖𝑏𝑙𝑒 𝑐𝑎𝑠𝑒𝑠 𝑓𝑜𝑟 𝑠𝑒𝑙𝑒𝑐𝑡𝑖𝑛𝑔 2 𝑐ℎ𝑖𝑙𝑑𝑟𝑒𝑛 𝑜𝑢𝑡 𝑜𝑓 9

So, P= (42)(5

2)

(94)

=

4!

2!∗2!∗

5!

2!∗3!9!

4!∗5!

= 30

63 =

10

21

7. A card is drawn from a well shuffled pack of playing cards. What is the probability that it is either a spade

or an ace.

Solution: Let A is the event of selecting a spade, B is the event of selecting an ace and A∩B is the event of

selecting an ace of spade then by law of total probability we can write

P(AUB)=P(A)+P(B)-P(A∩B)

= 13

52 +

4

52 -

1

52 =

16

52 =

4

13

18

8. Discrete and Continuous Distribution

Distribution: The distribution of a statistical data set (or a population) is a listing or function showing all

the possible values (or intervals) of the data and how often they occur.

Distribution

Discrete: for discontinuous variable

varaontinuoso

dis

Continuous: for continuous variable

Binomial Dist. Poisson Dist.

Prob. Mass function is

P(X = x) = (𝑛𝑥)pxqn-x, x=0,1,2..n

0, otherwise

Where 0≤ 𝑥 ≤ 𝑛, n and p are the

parameters of the dist.

Here

• X represent the different num

ber of successes of the event,

• The probability of x is given by

p(x)= (𝑛𝑥)pxqn-x

• The frequency of x in N sets each

of n trials=N.p(x)

• Mean = np

• Variance = npq

Conditions for binomial dist.

• Each trial results in two mutually

disjoint outcomes i.e.

success and failure

• The no. of trials n is finite

• The trials are independent of each

other

• The probability of success is

constant for each trial

• Example of binomial dist.

Tossing of a coin,

Throwing of a dice etc.

Fitting of Binomial Dist.

• Calculate the mean ��

Equate the �� = np so p=��

𝑛

• Expand the binomial N(q+p)n

=N[𝑞𝑛 + (𝑛1) 𝑞𝑛−1𝑝 +…..+𝑝𝑛)

• Or multiplying factor

=𝑛−𝑟+1

𝑟∗

𝑝

𝑞

• Apply chi square to test the

goodness of fit.

Prob. Mass function is

P(X = x) =e−λλ

x

x!, x=0,1,2….

0, otherwise

Where 0≤ 𝑥 ≤ ∞, λ≥ 0, and λ is

the parameter of the dist.

Here

• X represents the no. of

occurrences of the rare event

eg. 0,1,2

• The probability of x is given by

p(x)= e−λλ

x

x!,

• The frequency of x out of N

cases =N.p(x)

Mean=Variance=λ

Condition:

Poisson distr. occurs when there

are events which don’t occur as

outcomes of a definite number

of trials but occur at random

point of time and space and

where our preference is only for

the number of occurrences of the

event

eg. Number of deaths from a

disease,

number of faulty blades in a

packet of 100

Fitting of Poisson Dist.

• Calculate the mean ��

• Equate �� = 𝜆

• Find 𝑒−𝜆 𝑡ℎ𝑒𝑛 𝑐𝑎𝑙𝑐𝑢𝑙𝑎𝑡𝑒 e−λλ

x

x! Or recurrence formula

P(x+1) = 𝜆

𝑥+1 p(x) can also be

used

• Apply chi square to test the

goodness of fit.

Normal Dist.

Limiting form of binomial distribution when n is

large, n→ ∞, neither p nor q is very small

Probability density function is

F(x; µ, 𝜎)=1

𝜎√2𝜋 exp[-

1

2 {

𝑥−𝜇

𝜎}2],

-∞ < 𝑥 < ∞, -∞ < µ < ∞, σ > 0

µ and 𝜎2 are the parameters of the dist.

Mean = µ

Variance = 𝜎2 Property of Normal Distribution:

• Curve is bell shaped and symmetrical

• Mean=median=mode

• As x increases f(x) decreases rapidly

• 𝛽1 =0 and 𝛽2 =3

• 𝑓(𝑥)𝑐𝑎𝑛 𝑛𝑒𝑣𝑒r be negative

• X axis is an asymptote to the curve

• Mean deviation about mean =2

3 σ

• QD:MD:SD=10:12:15

• Area Property

P(µ- σ<X< µ+σ) = 0.6826

P(µ- 2σ<X< µ+2σ) = 0.9544

P(µ- 3σ<X< µ+3σ) = 0.9973 Fitting of Normal Dist.

• Calculate the mean µ and standard deviation σ

from the given data.

• Then we calculate the standard normal variate 𝑧𝑖= 𝑥𝑖− µ

𝜎 corresponding to the lower limit of each

class interval. Then the area under the normal

curve to the left of the ordinate at z=𝑧𝑖 say

∅(𝑧𝑖) are computed from the tables. • The areas for successive class intervals are obtained by

subtraction ∅(𝑧𝑖 + 1)-∅(𝑧𝑖), i=1,2,…

• By multiplying these areas by N we get the

expected normal frequencies.

• Apply chi square to test the goodness of fit.

19

Objective: Fitting of binomial distribution

Kinds of data: The following data relate to the frequency distribution of number of boys in the first seven

children in families of Swedish minister

No.of

boys/family 0 1 2 3 4 5 6 7 Total

No. of families 6 57 206 362 365 256 69 13 1334

Solution:

No.of

boys/family

xi

No. of

families fi

fi xi P(x) F(x)=N*P(x),

Expected

frequency

𝝒𝟐 = ∑(𝒐𝒊−𝑬𝒊)𝟐

𝑬𝒊

0 6 0 =(70) ∗ (0.51)0 ∗ (0.49)7−0 9.05 1.03

1 57 57 =(71) ∗ (0.51)1 ∗ (0.49)7−1 65.9 1.21

2 206 412 =(72) ∗ (0.51)2 ∗ (0.49)7−2 205.8 0.00

3 362 1086 =(73) ∗ (0.51)3 ∗ (0.49)7−3 357.0 0.07

4 365 1460 =(74) ∗ (0.51)4 ∗ (0.49)7−4 371.6 0.12

5 256 1280 =(75) ∗ (0.51)5 ∗ (0.49)7−5 232.1 2.47

6 69 414 =(76) ∗ (0.51)6 ∗ (0.49)7−6 80.5 1.65

7 13 91 =(77) ∗ (0.51)7 ∗ (0.49)7−7 12.0 0.09

Total 1334 4800 1334 6.62

Mean =4800

1334 =3.6 , now by comparing np=3.6 we get p=

3.6

7 = 0.51

Then q=1-0.51 = 0.49, and the frequencies are calculated in the table.

Then we apply the 2 test for goodness of fit. By comparing observed 2 values 6.62 is greater than the

tabulated 2 values at 5 degrees of freedom with ṕ=0.51. It seems that binomial distribution is fitting well to

the number of boys in the first seven children in families of Swedish minister .

Problem: Ten coins are thrown simultaneously. Find the probability of getting atleast 7 heads.

Solution: In tossing of a coin P(H)=P(T)=1

2

The probability of getting x heads in a random throw of 10 coins is

P(X=x) = (10𝑥

)1

2

𝑥 1

2

10−𝑥 = (

10𝑥

)1

2

10; x=0,1,2…10

Probability of getting atleast seven heads is given by

P(X≥ 7) = 𝑃(7) + 𝑃(8) + 𝑃(9) + 𝑃(10)

=(107

)1

2

10+(

108

)1

2

10+(

109

)1

2

10+ (

1010

)1

2

10

= 1

2

10 {(

107

) + (108

) + (109

) + (1010

)} =120+45+10+1

1024 =

176

1024

20

Objective: Fitting of Poisson distribution

Kinds of data: The following data relate to the number of α –particles emitted by a film of polonium in

2608 successive intervals of one-eighth of a minute.

No.of α-

particles 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Observed

frequency 57 203 383 525 532 408 273 139 45 27 10 4 0 1 1

Solution: Calculate the mean of observed data and equate it to the theoretical mean λ.

Mean=10097

2608 = 3.87

So ��= 3.87 ,

To fit the distribution we find the value of 𝑒−𝜆

By putting x=0 in P[X=x]= 𝑒−𝜆λx/x!

We get p[X=0]= 𝑒−3.87

Log10P(0) = -3.87 log10e =-3.87 * 0.4343 = -1.68

Then P(0) = Antilog(-1.68) = 0.0208

Then we find out the expected frequency from its probability mass function as shown in table.

No.of α-

particles

Observed

frequency 𝒇𝒊

∑𝒇𝒊 𝑿𝒊 Probability

P(X)

Expected

Frequency

𝜘2 = ∑(𝑜𝑖−𝐸𝑖)2

𝐸𝑖

0 57 0 P(0) =𝑒−3.87 =.0208 54.3 0.1331

1 203 203 P(1) =

𝑒−3.87∗ 3.871

1!=.0806

210.3 0.2514

2 383 766 P(2) =

𝑒−3.87∗ 3.872

2! =0.1560

407.0 1.4194

3 525 1575 P(3) = 0.2014 525.3 0.0002

4 532 2128 P(4) = 0.1949 508.4 1.0937

5 408 2040 P(5) = 0.1509 393.7 0.5214

6 273 1638 P(6) = 0.0974 254.0 1.4180

7 139 973 P(7) = 0.0539 140.5 0.0159

8 45 360 P(8) = 0.0261 68.0 7.7744

9 27 243 P(9) = 0.0112 29.2 0.1728

10 10 100 P(10) = 0.0043 11.3 0.1547

11 4 44 P(11) = 0.0015 4.0 0.0069

12 0 0 P(12) = 0.0005 1.3

13 1 13 P(13) = 0.0001 0.4

14 1 14 P(14) = 0.0000 0.1

Total 2608 10097 2608 12.9616

Then we apply 2 test for goodness of fit. The observed and expected frequencies are given together in the

following table.

Comparing observed 2 values as12.96 to the tabulated χ2 values at 10 degrees of freedom 18.307 with

�� =3.87150.511. It seems that Poisson distribution is fitting well to the data.

21

Problem : In a book of 520 pages, 390 typographical errors occur. Assuming poisson law for the number of

errors per page. Find the probability that a random sample of 5 pages will contain no error.

Solution: First we will find the average number of typographical error λ = 390

520 =0.75

By using poisson probability law P(X=x) = e−λλ

x

x! =

e−0.750.75x

x!

So the probability that a random sample of 5 pages will contain no error is

[𝑃(𝑋 = 0)]5 = (e−0.75)5 = e−3.75

Objective: Fitting of normal distribution

Kinds of data: The following data relate to the frequency distribution of 1000 population of their

intelligence score.

Class

intervals 60-65 65-70 70-75 75-80 80-85 85-90 90-95 95-100 Total

Observed

Freq. 3 21 150 335 326 135 26 4 1000

Solution: Set up null hypothesis H0: There is no significant difference between observed frequency and

expected frequency and H1: There is significant difference between these two.

Class

intervals Fi Xi Fi Xi Lower

class

Z=(X-μ)/σ φ(z)

∆φ(z)=

φ(z+1)-

φ(z)

Expected

Freq.

Rounded

Exp.freq.

Below 60 -∞ -∞ 0 0 0.12 0

60-65 3 62.5 187.5 60 -3.66 0 0.003 2.9 3

65-70 21 67.5 1417.5 65 -2.75 0.003 0.031 31 31

70-75 150 72.5 10875 70 -1.83 0.034 0.148 147.8 148

75-80 335 77.5 25962.5 75 -0.91 0.182 0.322 322.1 322

80-85 326 82.5 26895 80 0.01 0.504 0.319 319.3 319

85-90 135 87.5 11812.5 85 0.93 0.823 0.144 144.1 144

90-95 26 92.5 2405 90 1.49 0.967 0.03 29.8 30

95-100 4 47.5 190 95 2.68 0.997 0.003 2.7 3

100&Over ∞ ∞ 1

Total

Calculate the mean and standard deviation of observed data as 79.945 and 5.545 respectively. .

(i) Find out the value of Z=[X-μ]/σ.

(ii) Find out the area of normal curve as per the value of Z as φ(z)

(iii) Then, determine ∆φ(z) by taking successive differences.

(iv) In last, multiply by N to ∆φ(z) to know the expected frequency.

(i) Apply 2 test for goodness of fit.

(v) Interpret the result about fitting of normal distribution.

(vi) Comparing observed 2 values as 4.614 to the tabulated 2 value at 3 d.f..

It seems that normal distribution is fitting well to the data .

22

Problem : X is a normal variate with mean 30 and standard deviation 5. Find the probabilities that (i)

26 ≤ 𝑿 ≤ 𝟒𝟎 (ii) X≥ 𝟒𝟓 (iii) |𝑿 − 𝟑𝟎| > 5

Solution:

(i) Here it is given that µ = 30 and 𝜎 = 5

So first we calculate the standard normal variate z= 𝑋−µ

𝜎

For X=26, z= 𝑋−µ

𝜎 =

26−30

5= -0.8 and X=40, Z=

40−30

5= 2

Now P (-0.8 ≤ 𝑍 ≤ 2) = P (-0.8 ≤ 𝑍 ≤ 0) + P (0 ≤ 𝑍 ≤ 2)

= P (0 ≤ 𝑍 ≤ 0.8) + P (0 ≤ 𝑍 ≤ 2) =0.2881 + 0.4772 =0.7653

(ii) P(X≥ 45)

𝑍 =45 − 30

5= 3

P(Z≥ 3) = 0.5 − 𝑃(0 ≤ 𝑍 ≤ 3)

= 0.5 – 0.4986 = .0014

(iii) |𝑋 − 30| > 5 can also be written as

P(|𝑋 − 30| > 5) = 1 − P(|𝑋 − 30| ≤ 5)

=1- P (25 ≤ 𝑋 ≤ 35)

By applying standard normal variate rule we get

=1- P (−1 ≤ 𝑍 ≤ 1)=1-2*P(0≤ 𝑍 ≤ 1)

=1-2*0.3413=1-0.6826=0.3174

23

9. Correlation and Regression

Correlation Regression

Measure of linear relationship between two

variables

Measure of average relationship between two or

more variables

The correlation coefficient between X on Y and

Y on X is same and calculated by Karl Pearson

correlation formula

𝑟𝑥,𝑦 =𝑐𝑜𝑣(𝑥,𝑦)

𝜎𝑥𝜎𝑦 =

∑(𝑥𝑖−𝑥) (𝑦𝑖−��)

√∑(𝑥𝑖−𝑥) 2 ∑(𝑦𝑖−𝑦) 2

The regression line for Y on X and X and Y are

different and is given by

(y-��) =𝑏𝑦𝑥(x-��) for y on x

(x-��) =𝑏𝑥𝑦(y-��) for x on y

Where 𝑏𝑦𝑥 =𝑐𝑜𝑣(𝑥,𝑦)

𝜎𝑥2 =

∑(𝑥𝑖−𝑥) (𝑦𝑖−��)

∑(𝑥𝑖−𝑥) 2

and 𝑏𝑥𝑦 =𝑐𝑜𝑣(𝑥,𝑦)

𝜎𝑦2 =

∑(𝑥𝑖−𝑥) (𝑦𝑖−��)

∑(𝑦𝑖−𝑦) 2

here 𝑏𝑦𝑥 and 𝑏𝑥𝑦are the regression coefficient and

shows the change in dependent variable with a

unit change in independent variable

Correlation coefficient lies between -1 to +1 Regression coefficient lies between -∞ to +∞

Correlation coefficient is independent of change

of origin and scale.

Regression coefficient is independent of change of

origin and but not of scale.

Test of significance of correlation coefficient

(Null Hypo. r=0)

Test of significance of regression coefficient

(Null Hypo. 𝑏𝑦𝑥 =0, 𝑏𝑥𝑦 = 0)

𝑡𝑐𝑎𝑙=𝑟𝑐𝑎𝑙∗√𝑛−2

√1−𝑟𝑐𝑎𝑙2 at (n-2) d.f 𝑡𝑐𝑎𝑙 =

𝑏𝑦𝑥

𝑆.𝐸.𝑜𝑓 𝑏𝑦𝑥

=𝑏𝑦𝑥

√(∑(𝑦−𝑦) 2−(∑(𝑥−��)(𝑦−��))

2

∑(𝑥−𝑥) 2 )/(𝑛−2)∑(𝑥−𝑥) 2

based on

(n-2) d.f.(for y on x )

𝑡𝑐𝑎𝑙 = 𝑏𝑥𝑦

𝑆.𝐸.𝑜𝑓 𝑏𝑥𝑦 =

𝑏𝑥𝑦

√(∑(𝑥−𝑥) 2−(∑(𝑥−��)(𝑦−��))

2

∑(𝑦−𝑦) 2 )/(𝑛−2)∑(𝑦−𝑦) 2

based on

(n-2) d.f.(for x on y )

Relationship between correlation and regression coefficient

Correlation coefficient is the Geometric mean

between the regression coefficients.

r = ±√𝑏𝑦𝑥 ∗ 𝑏𝑥𝑦

If one of the regression coefficients is greater

than unity the other must be less than unity.

𝑟2 ≤ 1, 𝑏𝑦𝑥 ∗ 𝑏𝑥𝑦 ≤ 1,

Arithmetic mean of the regression coefficient is

greater than the correlation coefficient r if r>0.

1

2(𝑏𝑦𝑥 + 𝑏𝑥𝑦) ≥ 𝑟

If r=0, 𝜃 =𝜋

2 If the two variables are uncorrelated the lines of

regression become perpendicular to each other.

If r= ±1, 𝜃 = 0 𝑜𝑟 𝜋 The two lines of regression are coincide with each

other.

24

Spearman’s Rank correlation: is used to estimate the correlation between two characters on the basis of the

rank of the individuals.

The formula for rank correlation is 𝜌 = 1 − 6∗∑𝑑𝑖

2

𝑛∗(𝑛2−1),

Where d is the difference of the rank of individuals.

• If two individuals received the same rank then the arithmetic mean of their ranks is assigned to the

tied individuals and the next one individual will be given to actual rank. In this case a correction

factor is 1

12∑(𝑝3 − 𝑝) is added to ∑𝑑𝑖

2.

• Now 𝜌 = 1 − 6∗(∑𝑑𝑖

2+1

12∑(𝑝3−𝑝))

𝑛∗(𝑛2−1), where p is the number of items whose ranks are common

• Limits of rank correlation coefficient is -1≤ 𝜌 ≤ +1

Objective: Computation of correlation coefficient and the equations of the line of regression of Y on X and

X on Y and the estimation of the value of Y when the value of X is known and the value of X when the

value of Y is known.

Kinds of data: The following table relate to the data of stature (inches) of brother and sister from Pearson

and Lee’s sample of 1,401 families.

Family

number 1 2 3 4 5 6 7 8 9 10 11

Brother,X 71 68 66 67 70 71 70 73 72 65 66

Sister,Y 69 64 65 63 65 62 65 64 66 59 62

Solution: Calculation of correlation coefficient

Family

Number

Brother

X

Sister

Y (𝑿𝒊 − ��) (𝒀𝒊 − ��) (𝑿𝒊 − ��)𝟐 (𝒀𝒊 − ��)𝟐 (𝑿𝒊 − ��)(𝒀𝒊 − ��)

1 71 69 2 5 4 25 10

2 68 64 -1 0 1 0 0

3 66 65 -3 1 9 1 -3

4 67 63 -2 -1 4 1 2

5 70 65 1 1 1 1 1

6 71 62 2 -2 4 4 -4

7 70 65 1 1 1 1 1

8 73 64 4 0 16 0 0

9 72 66 3 2 9 4 6

10 65 59 -4 -5 16 25 20

11 66 62 -3 -2 9 4 6

Total 759 704 74 66 39

First we calculate the mean �� = 759

11 = 69 , �� =

7o4

11 = 64

Then by using the formula of correlation coefficient, we have

𝑟𝑥𝑦 = 39

√74∗66 = 0.558

Test of significance of correlation coefficient

t =0.558∗ √11−2

√1−0.5582 = 2.018

25

the table value of t at 9 df. At 5 % level of significance is 2.26.

Since t calculated is less than t tabulated the null hypothesis is accepted. The correlation coefficient is not

significant.

Calculation of Regression Coefficient

using the formula of regression coefficient of Y on X on Y, we have

𝑏𝑦𝑥 = 39

74 = 0.527, 𝑏𝑥𝑦 =

39

66 = 0.591

Hence, the equation of regression line of Y on X is

Y- 64 = 0.527 (X-69)

Hence, the equation of regression line of X on Y is

X- 69 = 0.591 (Y-64)

Estimation of Y when X is given :

If we want to calculate the value of Y for X=70 then by putting X=70 in the line of regression of Y on X we

get Y - 64 =0.527*(70 -69)

Hence Y= 64 + 0.527 * 1 =64.527

Estimation of X when Y is given :

If we want to calculate the value of X for Y=62 then by putting Y=62 in the line of regression of X on Y we

get X - 69 =0.591(62 -64)

Hence X= 69 + 0.591 * (-2) =67.82

Test of significance of regression coefficient of y on x

𝒕𝒚𝒙=𝟎.𝟓𝟐𝟕

√ 𝟔𝟔−(𝟑𝟗)𝟐

𝟕𝟒(𝟏𝟏−𝟐)∗𝟕𝟒

=𝟎.𝟓𝟐𝟕

𝟎.𝟐𝟔𝟏 = 2.017

Test of significance of regression coefficient of x on y

𝒕𝒙𝒚=𝟎.𝟓𝟗𝟏

√ 𝟕𝟒−(𝟑𝟗)𝟐

𝟔𝟔(𝟏𝟏−𝟐)∗𝟔𝟔

=𝟎.𝟓𝟗𝟏

𝟎.𝟐𝟗𝟐 = 1.799

Since the value of t calculated is less than t tabulated. Regression coefficients are not significant.

Objective: Computation of rank correlation coefficient.

Kinds of data: The following two series of data are given. By the method of rank differences (after ranking

them in proper order).

X 75 88 92 70 60 80 81 50

Y 120 124 150 115 110 140 142 100

26

Solution:

X Y Rank of x 𝑹𝒙 Rank of y 𝑹𝒚 𝒅𝒊 = 𝑹𝒙 - 𝑹𝒚 𝒅𝒊𝟐

75 120 5 5 0 0

88 124 4 4 -2 4

92 150 1 1 0 0

70 115 6 6 0 0

60 110 7 7 0 0

80 140 3 3 1 1

81 142 2 2 1 1

50 100 8 8 0 0

Total ∑𝒅𝒊

𝟐 = 𝟔

Coefficient of rank correlation = 1 –6∗6

8∗(82−1) = 1-

36

8∗63 =1- 0.0714 = 0.929

Thus , there is high positive correlation.

Objective: Computation of rank correlation coefficient when ranks are being repeated.

Kinds of data: The following two series of data are given on the variables X and Y .

X 12 15 18 20 16 15 18 22 15 21 18 15

Y 10 18 19 12 15 19 17 19 16 14 13 17

Solution:

X Y Rank of x 𝑹𝒙 Rank of y 𝑹𝒚 𝒅𝒊 = 𝑹𝒙 - 𝑹𝒚 𝒅𝒊𝟐

12 10 12 2 10 100.0

15 18 9.5 2 7.5 56.3

18 19 5 2 3 9.0

20 12 3 4 -1 1.0

16 15 7 5.5 1.5 2.3

15 19 9.5 5.5 4 16.0

18 17 5 7 -2 4.0

22 19 1 8 -7 49.0

15 16 9.5 9 0.5 0.3

21 14 2 10 -8 64.0

18 13 5 11 -6 36.0

15 17 9.5 12 -2.5 6.3

Total ∑𝒅𝒊𝟐 = 𝟑𝟒𝟒

r rank = 1- 6[∑𝑑2+

𝑝3−𝑝

12+

𝑝3−𝑝

12+⋯]

𝑛(𝑛2−1)

In the X series, 18 is repeated 3 times after third rank, thus the common rank assigned to each of these

values is the average of (4+5+6

3= 5). Next value 16 gets the next ranks as 7. Again the value 15 occurs four

times, common rank assigned to it is 9.5 which is the arithmetic mean of 8,9,10 and 11. The next number 12

gets the ranks as 12. Similarly for the Y - series, the value 19 occurs thrice and common rank assigned to

each is 2 i.e. arithmetic mean of 1, 2 and 3 and accordingly the ranks assigned to other. So here in X series

m= 3 for 18, m=4 for 15 and for Y series m= 3 for 19, m=2 for 17. 𝜌= 1—

6[344+43−4

12+

33−3

12+

23−2

12+

33−3

12]

12(122−1) =-0.2028

a poor rank correlation.

27

Objective: Testing the significance of an observed sample correlation coefficient and determination of 95%

and 99% confidence limits.

Kinds of data: In a random sample of 27 pairs of observations from a bivariate population the correlation

coefficient is obtained as 0.6.

Solution: (i) Set up the null and alternative hypothesis as

Ho: 𝜌=0

H1:𝜌 ≠0

(ii) Choose a suitable level of significance 𝛼=0.05 (say)

(iii) Compute ‘t’ statistic

t=𝑟√𝑛−2

√1−𝑟2 with (n-2) d.f.

t=0.6√27−2

√1−0.36 =

3

√0.64 = 3.75 with 25 degrees of freedom.

(iv) Tabulated value of t using two tailed test at 5% level of significance with 25 degrees of freedom is 2.06.

(v) Since calculated value of t(3.750 is greater than the tabulated value of t, H0 is rejected at 5% level of

significance. Hence it is concluded that the variables are correlated in the population.

95% confidence limits for 𝜌 (Population correlation coefficient)

r∓ 1.96 Standard error = r∓1.96∗(1−𝑟2)

√𝑛

= 0.6 ∓ 1.96(1−0.36)

√27

= 0.6∓0.2414

= 0.3586 to 0.8414

Likewise 99% confidence limits for 𝜌 are

r ∓2.58 Standard error = r∓2.58∗(1−𝑟2)

√𝑛

= 0.6 ∓ 2.58∗(1−0.36)

√27

=0.6∓0.3178

= 0.2822 to 0.9178

28

Objective: Computation of correlation coefficient for the bivariate frequency distribution.

Kinds of data: The following data provides according to age the frequency of marks obtained by

100 students in an intelligence test.

Age in years

Marks 18 19 20 21 Total

10-20 4 2 2 - 8

20-30 5 4 6 4 19

30-40 6 8 10 11 35

40-50 4 4 6 8 22

50-60 - 2 4 4 10

60-70 - 2 3 1 6

Total 19 22 31 28 100

Solution: let us assume variable age in years as U and Marks as V.

Let U=X-19. V=(Y-35)/10 and prepare the table as shown below.

V=𝒀−𝟑𝟓

𝟏𝟎 U=X-19 -1 0 1 2

y X/Y 18 19 20 21 f(v) vf(v) v2f(v) 𝛴vf(u,v)

-2 15 10-20 4(8) 2(0) 2(-4) 8 -16 32 4

-1 25 20-30 5(5) 4(0) 6(-6) 4(-8) 19 -19 19 -9

0 35 30-40 6(0) 8(0) 10(0) 11(0) 35 0 0 0

1 45 40-50 4(-4) 4(0) 6(6) 8(16) 22 22 22 18

2 55 50-60 2(0) 4(8) 4(16) 10 20 40 24

3 65 60-70 2(0) 3(9) 1(6) 6 18 54 15

Total f(u) 19 22 31 28 100 25 167 52

uf(u) -19 0 31 56 68

u2f(u) 19 0 31 112 162

𝛴vf(u,v) 9 0 13 30 52

Mean of u=∑uf(u)

∑ f(u) =

68

100 =0.68. Mean of v=

∑vf(v)

∑ f(v) =

25

100 =0.25

Cov(u,v)= 52

199-0.68x0.25=0.35

Variance of u= 162

100 - (0.68)2=1.1576

Variance of v=167

100 - (0.25)2=1.6075

r(u,v)= 0.35

√1.1576𝑥1.6075 = 0.25,

Since correlation coefficient is independent of change of origin and scale,

r(x,y)=r(u,v) =0.25

29

10. Multiple and Partial Correlation

Multiple Correlation Coefficient: provide the maximum degree of linear relationship between two or more

independent variables and a single dependent variable. It is a measure of how well a given variable can be

predicted using a linear function of a set of other variables. Multiple correlation coefficients can never be

negative and presented by 𝑅2, which represent the percentage variance explained by all the independent

variables in dependent variable.

Multiple correlation coefficient is represented by R1.23, where X1 is dependent variable and X2, X3 are

independent variables and the formula is given by

𝑅1.232 = 1 −

𝜔

𝜔11=

𝑟122+𝑟132−2𝑟12𝑟13𝑟23

1−𝑟232 , Where 0≤ R1.23 ≤ 1 and 𝜔=|

1 𝑟12 𝑟13

𝑟21 1 𝑟23

𝑟31 𝑟32 1| and 𝜔11 =|

1 𝑟23

𝑟32 1|

F-test for significance of multiple correlation coefficient:

The null and alternative hypothesis is Ho: R 1.23=0, H1:R 1.23 ≠0

The test statistic F=𝑅2

1−𝑅2x(𝑛−𝑘−1)

𝑘 follows F distribution with (k, n-k-1) degrees of freedom, where k is the

number of independent variables. If calculated value of F is less than the tabulated value, the null

hypothesis is accepted.

Partial Correlation Coefficient : measures the degree of association between two random variables X1 and

X2, with the effect of a set of controlling random variables say X3 is removed. For example, if we have

economic data on the consumption, income, and wealth of various individuals and we wish to see if there is a

relationship between consumption and income, failing to control for wealth when computing a correlation

coefficient between consumption and income would give a misleading result, since income might be

numerically related to wealth which in turn might be numerically related to consumption; a measured

correlation between consumption and income might actually be contaminated by these other correlations.

The use of a partial correlation avoids this problem. Like the correlation coefficient, the partial correlation

coefficient takes on a value in the range from –1 to 1.Partial correlation coefficient helps in deciding whether

to include or not an additional independent variable in regression analysis.

The correlation coefficient between X1 and X2 after the linear effect of X3 on each of them has been

eliminated is called the partial correlation coefficient and the formula is given by

𝑟12.3 =𝑟12− 𝑟13𝑟23

√(1−𝑟13)2∗(1−𝑟23)2 , -1≤ r12.3 ≤ +1

t-test for significance of partial correlation coefficient:

The null and alternative hypothesis is Ho: 𝜌 12.3=0, H1:𝜌 12.3≠0.

The test statistic t = 𝑟12.3√𝑛−𝑘−2

√1−𝑟12.32, follows t distribution with (n-k-2) degrees of freedom, where k is the

number of variables from which the effect of common variable is eliminated. If calculated value of t is less

than the tabulated value, the null hypothesis is accepted.

Relation between multiple, total and partial correlations: 1- 𝑅1.232 = (1-𝑟12

2)(1-𝑟13.22)

30

Objective: Computation of multiple correlation coefficients from the tri-variate population.

Kinds of data: Given r12=0.60,r13=0.70 and r23=0.65

Solution: We know that R 1.23 = √𝑟122+𝑟132−2𝑟12𝑟13𝑟23

1−𝑟232

= √0.62+0.72−2𝑥0.6𝑥0.7𝑥0.65

1−0.652

= √0.36+0.49−0.546

0.5775=√0.526 =0.725

Thus, R 1.23 = 0.725

Likewise, R 3.12 = √𝑟132+𝑟232−2𝑟12𝑟13𝑟23

1−𝑟122

= √0.72+0.652−2𝑥0.6𝑥0.7𝑥0.65

1−0.602

= √0.49+0.4225−0.546

0.64 = √0.573 =0.757

In the last, R 2.13 =√𝑟122+𝑟232−2𝑟12𝑟13𝑟23

1−𝑟132

= √0.62+0.652−2𝑥0.6𝑥0.7𝑥0.65

1−0.702

= √0.36+0.4225−0.546

0.51=√0.464 =0.681

In this way, we have all the values of multiple correlation coefficients.

Objective: Testing the significance of an observed multiple correlation coefficient.

Kinds of data: The value of R 1.23=0.725 from a tri-variate distribution for n=25.


Ho: R 1.23=0, H1:R 1.23 ≠0


(iii) Compute ‘F’ statistic

F=𝑅2

1−𝑅2x

(𝑛−𝑘−1)

𝑘 follows F distribution with (k,n-k-1) degrees of freedom, where k is the number of

independent variables

Thus, F=(0.7252)

1−0.7252x25−2−1

2

=0.5256

0.4744x

22

2 = 1.1079 x 11=12.187

The tabulated value of F at (2,22) d.f. at 5% level of significance is 3.44. Hence the calculated value of F

statistic is greater than the tabulated value. Thus ,we reject the null hypothesis and conclude that multiple

correlation R 1.23 is not zero i.e. observed multiple correlation coefficient is significant in the population.

31

Objective: Calculation of 𝑟23.1,b 12.3, b 13.2 and 𝜎 1.23

Kinds of data: In a tri-variate distribution 𝜎1=2, 𝜎2=𝜎3=3, 𝑟12=0.7, 𝑟23= 𝑟31=0.5

Solution: (i) we know that 𝑟23.1 = 𝑟23−𝑟21𝑟31

√(1−𝑟212)(1−𝑟31

2)

By putting the values we get 𝑟23.1 = 0.5−0.7∗0.5

√(1−0.72)(1−0.52) =0.2425

(i) b 12.3 = r12.3 * 𝜎1.3

𝜎2.3 and b 13.2 = r13.2 *

𝜎1.2

𝜎3.2

Now, r12.3 = 𝑟12−𝑟13𝑟23

√(1−𝑟132)(1−𝑟23

2)

= 0.7−0.7∗0.5

√(1−0.52)(1−0.52) = 0.6

Similarly r13.2 = 𝑟13−𝑟12𝑟32

√(1−𝑟122)(1−𝑟32

2)

= 0.5−0.7∗0.5

√(1−0.72)(1−0.52) = 0.2425

𝜎1.3 = 𝜎1√(1 − 𝑟132) = 2*√(1 − 0.52) = 1.7320

𝜎2.3 = 𝜎2√(1 − 𝑟232) = 3*√(1 − 0.52) = 2.5980

𝜎1.2 = 𝜎1√(1 − 𝑟122) = 2*√(1 − 0.72) = 1.4282

𝜎3.2 = 𝜎3√(1 − 𝑟322) = 3*√(1 − 0.52) = 2.5980

By putting these values we get

b12.3 =0.6 * 1.7320

2.5980 = 0.4 and b13.2 = 0.2425 *

1.4282

2.5980 =0.1333

(ii) we know that σ21.23…n = σ1

2 𝜔

𝜔11, then σ2

1.23 = σ12 𝜔

𝜔11 , where

𝜔=|

1 𝑟12 𝑟13

𝑟21 1 𝑟23

𝑟31 𝑟32 1| = |

1 0.7 0.50.7 1 0.50.5 0.5 1

|

and 𝜔11 =|1 𝑟23

𝑟32 1| = =|

1 0.50.5 1

|

by solving these by determinant expansion we get

𝜔 = 1 ∗ (1 ∗ 1 − 0.5 ∗ 0.5) − 0.7(0.7 ∗ 1 − 0.5 ∗ 0.5) + 0.5 ∗ (0.7 ∗ 0.5 − 1 ∗ 0.5) = 0.36

And 𝜔11 =(1*1-0.5*0.5) =0.75

Hence σ21.23 = σ1

2 𝜔

𝜔11 = 22 0.36

0.75 = 1.92,

Then 𝜎 1.23 = 1.385

32

Objective: Testing the significance of an observed partial correlation coefficient.

Kinds of data: The value of r 12.3 = -0.60 from a trivariate distribution for n=29.


Ho: 𝜌 12.3=0, H1:𝜌 12.3≠0


(iii) Compute ‘t’ statistic

t=𝑟12.3√𝑛−𝑘−2

√1−𝑟12.32 =

−0.60∗√29−2−2

√1−(0.6)2 =-3.75

The table value of t at 5% level of significance at 25 degrees of freedom is 2.06. As computed value of t is

greater than the table value of t, so we reject the null hypothesis. Thus the observed partial correlation

coefficient is significantly different in the population.

33

11. Multiple Regression Equation and Analysis Technique

Equation of plane of regression:

The equation of plane of regression of Xi, on remaining variable Xj (j≠ 𝑖 = 1,2, … , 𝑛) is given by

𝑋1

𝜎1 𝜔𝑖1+

𝑋2

𝜎2 𝜔𝑖2+….+

𝑋𝑖

𝜎𝑖 𝜔𝑖𝑖+…….+

𝑋𝑛

𝜎𝑛 𝜔𝑖𝑛 = 0 ; i= 1,2,…,n

Objective: Determination of regression equation when the values are given and also determine the value of

X3 when X1 =30 and X2 =45

Kinds of data: for a trivariate distribution

𝑋1 =40 𝑋2

=70 𝑋3 =90

𝜎1 = 3 𝜎2 = 6 𝜎3 = 7

𝑟12 = 0.4 𝑟23 = 0.5 𝑟13 = 0.6

Solution: the equation of plane of regression of X1 on X2 and X3 is given by

𝑋1

𝜎1 𝜔11+

𝑋2

𝜎2 𝜔12+

𝑋3

𝜎3 𝜔13 = 0

Since the line of regression passes through mean and can be written as

(𝑋1−𝑋1)

𝜎1 𝜔11+

(𝑋2−𝑋2)

𝜎2 𝜔12+

(𝑋3−𝑋3)

𝜎3 𝜔13 = 0

𝜔=|

1 𝑟12 𝑟13

𝑟21 1 𝑟23

𝑟31 𝑟32 1| =|

1 0.4 0.60.4 1 0.50.6 0.5 1

| and

𝜔11 =|1 𝑟23

𝑟32 1| = |

1 0.50.5 1

| = (1*1-0.5*0.5) = 0.75

𝜔12 =− |𝑟21 𝑟23

𝑟31 1 | =|0.4 0.50.6 1

| = - (0.4*1-0.5*0.6) = -0.10

𝜔13 =|𝑟21 1𝑟31 𝑟32

| =|0.4 10.6 0.5

| = (0.4*0.5- 1*0.6) = -0.4

By putting the values in regression line we get

(𝑋1−40)

3∗ 0.75+

(𝑋2−70)

6∗ (−0.10) +

(𝑋3−90)

7∗ (−0.4) = 0

This is the required line of regression.

By putting X1 =30 and X2 =45 in the equation we get

(30−40)

3∗ 0.75 +

(45−70)

6∗ (−0.10) +

(𝑋3−90)

7∗ (−0.4) = 0

= -2.50 + 0.417 - (𝑋3 − 90)*.057 = 0

Hence -2.50 + 0.417 = (𝑋3 − 90)*.057,

Then 𝑋3 = −2.083

.057 +90 =53.46

34

Objective: Find the multiple regression equation of X1 on X2 and X3 .

Kinds of data: The data relating to three variables are given below :

X1 4 6 7 9 13 15

X2 15 12 8 6 4 3

X3 30 24 20 14 10 4

Solution: The regression equation of X1 on X2 and X3 is given by

X1=a + b 12.3 X2 + b 13.2 X3 + e

The value of these constants a, b 12.3 and b 13.2 are obtained by solving the following three normal equations.

𝛴 X1= na + b 12.3 𝛴 X2 +b 13.2 𝛴X3

𝛴 X1X2= a𝛴X2 +b 12.3 𝛴 X22 + b 13.2 𝛴X2x3

𝛴 X1X3 = a 𝛴X3 +b 12.3 𝛴X2X3 + b 13.2 𝛴X32

The sum of squares and sum of products required in above equations are obtained as below:

X1 X2 X3 X1X2 X1X3 X2X3 X22 X3

2 X12

4

6

7

9

13

15

15

12

8

6

4

3

30

24

20

14

10

4

60

72

56

54

52

45

120

144

140

126

130

60

450

288

160

84

40

12

225

144

64

36

16

9

900

576

400

196

100

16

16

36

49

81

169

225

54 48 102 339 720 1034 494 2188 576

Substituting the values in the normal equations ,we get

6a+ 48 b12.3 +102 b13.2 =54 (i)

48 a +494 b12.3 + 1034 b13.2 =339 (ii)

102 a +1034 b12.3 +2188 b13.2 =720 (iii)

Multiplying equation (i) by 8, we get 48a =384 b12.3 +816 b13.2 = 432 (iv)

Subtracting equation (ii) from (iv), we get 110 b12.3 +218 b13.2 = - 93 (v)

Multiplying equation (i) by 17, we get 102 a +816 b12.3 +1734 b13.2 =918 (vi)

Subtracting equation (iii) from equation (vi), we get 218 b12.3 +454 b13.2 = -198 (vii)

Multiply equation (v) by 109 we get 11990 b12.3 +23762 b13.2 = -10137 (viii)

Multiplying equation (vii) by 55 we get 1990 b12.3 + 24970 b13.2 = -10890 (ix)

Subtracting equation(viii) from equation (ix),we get 1208 b13.2 = -753

Hence, b13.2 = -753/1208 = -0.623

Substituting the value of b13.2 in equation (v), we get 110 b12.3 = 218(-0.623) - 93

110 b12.3 = 135.814 – 93

Hence, b12.3= 42.814

110 = 0.389

35

Substituting the value of b12.3 and b13.2 in equation (i),we get

6a + 48(0.389) +102(-0.623) = 54

6a = 54 +63.546 –18.672= 98.874

Hence, a = 16.479

Thus the required regression equation is X1= 16.479 + 0.389X2 –0.623 X3

Multiple Regression Analysis: It is a technique used for predicting the unknown value of a variable from

the known value of two or more variables- also called the predictors. For example the yield of rice per acre

depends upon quality of seed, fertility of soil, fertilizer used, temperature, rainfall. If we want to study the

joint affect of all these variables on rice yield, we can use this technique. An additional advantage of this

technique is it also enables us to study the individual influence of these variables on yield.

In general, the multiple regression equation of Y on X1, X2, …, Xk is given by:

Y = b0 + b1 X1 + b2 X2 + …………………… + bk Xk

Here b0 is the intercept and b1, b2, b3, …, bk are analogous to the slope in linear regression equation and are

also called regression coefficients. They can be interpreted the same way as slope. Thus if bi = 2.5, it would

indicates that Y will increase by 2.5 units if Xi increased by 1 unit.

ANOVA for Multiple Regression: are similar to ANOVA for linear regression except that degrees of

freedom are adjusted to reflect the number of explanatory variables included in the model.

Analysis of variance table for simple regression analysis

Source of variation Degree of

freedom

Sum of Square Mean sum

of square

Fcal Ftab (5 %) at (source

and error ) d.f.

Model 1 ∑(𝑌�� − 𝑌) 2 SSM/df of

Model

=MSM/MSE

Error n-2 ∑(𝑌𝑖 − 𝑌��)2 SSE/df of

Error

Total n-1 ∑(𝑌𝑖 − 𝑌) 2

In simple regression we test the null hypothesis that 𝛽1 = 0 and test the statistic F= 𝑀𝑆𝑀

𝑀𝑆𝐸, has an F distribution

with d.f. (1, n-2).

In Multiple regression analysis for p explanatory variables the model degree of freedom are equal to p , the

error d.f. are equal to n-p-1 and the total d.f. are equal to n-1.

Analysis of variance table for Multiple regression analysis


freedom

Sum of Square Mean sum

of square


and error ) d.f.

Model p ∑(𝑌�� − 𝑌) 2 SSM/df of

Model

=MSM/MSE

Error n-p-1 ∑(𝑌𝑖 − 𝑌��)2

SSE/df of

Error

Total n-1 ∑(𝑌𝑖 − 𝑌) 2

36

In simple regression we test the null hypothesis that 𝛽1 = 0 and test the statistic F= 𝑀𝑆𝑀

𝑀𝑆𝐸, has an F distribution

with d.f. (1, n-2).

Test of significance of model: The appropriateness of the multiple regression model as a whole can be

tested by the F-test in the ANOVA table. A significant F indicates a linear relationship between Y and at

least one of the X's. The suitability of model for prediction is examined by the coefficient of determination

(R2). R2 always lies between 0 and 1.The closer R2 is to 1, the better is the model and its prediction.

t-test: To test whether the independent variables individually influence the dependent variable significantly

or not we test the null hypothesis that the relevant regression coefficient is zero. This can be done using t-

test. If the t-test of a regression coefficient is significant, it indicates that the variable is in question influences

Y significantly while controlling for other independent explanatory variables.

Objective: Fitting a straight line with two predictors by matrix approach and determination the value of R2.

Kinds of data: The twenty five observations of pounds of steam used per month in a plant along with

average atmospheric temperature in degrees Fahrenheit and number of operating days in the month.

Observation Number

Pounds of steam used per month

Y

Average atmospheric TemperatureX1

Number of operating days in

the monthX2 1 10.98 35.3 20

2 11.13 29.7 20

3 12.51 30.8 23 4 8.4 58.8 20

5 9.27 61.4 21

6 8.73 71.3 22 7 6.36 74.4 11

8 8.5 76.7 23

9 7.82 70.7 21 10 9.14 57.5 20

11 8.24 46.4 20

12 12.19 28.9 21 13 11.88 28.1 21

14 9.57 39.1 19

15 10.94 46.8 23 16 9.58 48.5 20

17 10.09 59.3 22

18 8.11 70 22 19 6.83 70 11

20 8.88 74.5 23

21 7.68 72.1 20 22 8.47 58.1 21

23 8.86 44.6 20

24 10.36 33.4 20 25 11.08 28.6 22

Solution: we know that the least squares estimates of 𝛽0, 𝛽1 and 𝛽2 are given by

37

b=(X′X) -1X′Y

where b is the vector of estimates of the elements of 𝛽, provided that X′X is non_singular matrix.

Here Y =

[ 10.9811.1312.518.4...

10.3611.08]

X=

[ 1 35.3 201 29.7 201 30.8 231 58.8 20

.

.

.1 33.4 201 28.6 22]

𝛽=[

𝛽0𝛽1𝛽2

] 𝜀=

[ 𝜀1𝜀2𝜀3𝜀4...

𝜀24𝜀25]

Where Y is a (25x1) vector, X is a (25x3) matrix,

𝛽 is a (3x1) vector ,and 𝜀 is a (25x1) vector.

Now, b=[𝑏0𝑏1𝑏2

] =

(

[1 1 1 . . . 1

35.3 29.7 30.8 28.6 20 20 23 22

]

[

1 35.3 20 1 29.7 20 1 30.8 23

.

. 1 28.6 22 ]

)

−1

x[1 1 1 . . . 1

35.3 29.7 30.8 28.6 20 20 23 22

]

[ 10.9811.1312.51

.

.11.08]

b=[𝑏0𝑏1𝑏2

] = [25 1315 506

1315 76323.42 26353.30506 26353.30 10450

] X

[

[1 1 1 . . . 1

35.3 29.7 30.8 28.6 20 20 23 22

]

[ 10.9811.1312.51

.

.11.08]

]

Thus, b=[𝑏0𝑏1𝑏2

] = [9.1266

−0.07240.2029

]

Thus, the fitted least squares equation is

Ŷ =9.1266-0.0724 X1 +0.2029 X 2

After the regression equation is estimated we find the estimated value of Y for each X1 and X2and then find

the total, regression and residual sum of squares.

38

Observation

Number

Pounds of

steam used

per month

Y

Estimated

value of ��

Total SS

=∑(𝒀𝒊 − 𝒀) 𝟐

Regression/Model SS=

∑(𝒀�� − 𝒀) 𝟐

Error SS=

∑(𝒀𝒊 − 𝒀��)𝟐

1 10.98 10.63 2.42 1.45 0.12 2 11.13 11.03 2.91 2.59 0.01 3 12.51 11.56 9.52 4.58 0.90 4 8.4 8.93 1.05 0.25 0.28 5 9.27 8.94 0.02 0.23 0.11 6 8.73 8.43 0.48 0.99 0.09 7 6.36 5.97 9.39 11.92 0.15 8 8.5 8.24 0.85 1.40 0.07 9 7.82 8.27 2.57 1.33 0.20

10 9.14 9.02 0.08 0.16 0.01 11 8.24 9.83 1.40 0.16 2.51 12 12.19 11.30 7.65 3.50 0.80 13 11.88 11.35 6.03 3.72 0.28 14 9.57 10.15 0.02 0.53 0.34 15 10.94 10.40 2.30 0.96 0.29 16 9.58 9.67 0.02 0.06 0.01 17 10.09 9.30 0.44 0.02 0.63 18 8.11 8.52 1.73 0.81 0.17 19 6.83 6.29 6.73 49.82 0.29 20 8.88 8.40 0.30 1.05 0.23 21 7.68 7.96 3.04 2.13 0.08 22 8.47 9.18 0.91 0.06 0.51 23 8.86 9.96 0.32 0.28 1.20 24 10.36 10.77 0.88 1.80 0.17 25 11.08 11.52 2.74 4.39 0.19

Total 63.84 54.21 9.63

The ANOVA table for this regression model is given below:

Source of

variation

d.f. SS MS Fcal Ftab(2,22)

Regression

Residual

2

22

54.21

9.63

27.105

0.4377

61.92 3.44

Total 24 63.84

We can also split the S.S. due to regression into S. S. due to X1 and X2

For this purpose we fit the simple line of regression of Y on X1 as Ŷ =13.62-0.08 X1

Now again we calculate the SS due to X1 =∑(𝑌�� − 𝑌) 2 = 45.79

Source of

variation

d.f. SS MS Fcal Ftab (1,22)

Regression

Due to X1

Due to X2

Residual

1

1

22

45.79

8.42

9.63

45.79

8.42

0.4377

104.6

19.23

4.30

Total 24 63.84

39

Here, since 104.1636 and 19.6361 exceeds Ftab(1,22,0.95)=4.30 , the predictor X1 and X2 are found to be

significant. It is to be noted that the sum of square due to X1 (45.5924) is obtained taking only one variable

X1 and the sum of square due to X2 is obtained (Total SS minus the SS due to X1).

To check the suitability of the model for prediction multiple correlation coefficient R2 is calculated.

R2=𝑠𝑢𝑚 𝑜𝑓 𝑠𝑞𝑢𝑎𝑟𝑒𝑠 𝑑𝑢𝑒 𝑡𝑜 𝑟𝑒𝑔𝑟𝑒𝑠𝑠𝑖𝑜𝑛

𝑇𝑜𝑡𝑎𝑙 (𝑐𝑜𝑟𝑟𝑒𝑐𝑡𝑒𝑑)𝑠𝑢𝑚 𝑜𝑓 𝑠𝑞𝑢𝑎𝑟𝑒𝑠 =

45.79+8.42

63.84=84.91%

Hence it is found that 84.91 % variability of the predicted model is explained by explanatory variables.

t-test for regression coefficient byx1 = -.0724

𝑡𝑐𝑎𝑙 = 𝑏𝑦𝑥

𝑆.𝐸.𝑜𝑓 𝑏𝑦𝑥 =

𝑏𝑦𝑥

√(∑(𝑦−𝑦) 2−(∑(𝑥−��)(𝑦−��))

2

∑(𝑥−𝑥) 2 )/(𝑛−2)∑(𝑥−𝑥) 2

based on (n-2) d.f.(for y on x )

Here null hypothesis byx1 =0

∑(𝑦 − 𝑦) 2 = 63.82 , ∑(𝑥 − 𝑥) 2 =7154.42 , (∑(𝑥 − ��)(𝑦 − ��))2 =326187.2

𝑡𝑐𝑎𝑙 = −.0724

√(63.82−

326187.27154.42

)

(25−2)∗7154.42

= -6.87

The table value of t at 23 d.f and 5 % level of significance is 2.068. Since calculated absolute value of t is

greater than tabulated null hypothesis is rejected.

t-test for regression coefficient byx2 =0.2029

𝑡𝑐𝑎𝑙 = 𝑏𝑦𝑥

𝑆.𝐸.𝑜𝑓 𝑏𝑦𝑥 =

𝑏𝑦𝑥

√(∑(𝑦−𝑦) 2−(∑(𝑥−��)(𝑦−��))

2

∑(𝑥−𝑥) 2 )/(𝑛−2)∑(𝑥−𝑥) 2

based on (n-2) d.f.(for y on x )

Here null hypothesis byx2 = 0

∑(𝑦 − 𝑦) 2 = 63.82 , ∑(𝑥 − 𝑥) 2 =218.56 , (∑(𝑥 − ��)(𝑦 − ��))2 =4008.916

𝑡𝑐𝑎𝑙 = .2029

√(63.82−

4008.916218.56

)

(25−2)∗218.56

= 2.13

The table value of t at 23 d.f and 5 % level of significance is 2.068. Since calculated absolute value of t is

greater than tabulated null hypothesis is rejected.

Hence both the regression coefficient are found to be significant.

40

12. Simple and Stratified Random Sampling

Simple Random Sampling (SRS): It is the process of selecting a sample from given population according

to some law of chance in which each unit of population has an equal and independent chance of being

included in the sample.

SRSWR(With Replacement): A selection process in which the unit selected at any draw is replaced to the

population before the next subsequent draw is known as Simple random sampling with replacement. In this

case the number of possible samples of size n selected from the population of size N is 𝑁𝑛. The samples

selected through this method are not distinct.

SRSWOR(Without Replacement): A selection process in which the unit selected at any draw is not

replaced to the population before the next subsequent draw and the next sample is selected from the

remaining population is known as Simple random sampling without replacement. In this case the number of

possible samples of size n selected from the population of size N is 𝑁𝑐𝑛 . The samples selected through this

method are distinct.

Note: Sample mean is an unbiased estimate of population mean in SRSWR and SRSWOR, whereas sample

variance is an unbiased estimate of population variance in case of SRSWOR only.

SRSWOR is more efficient than SRSWR because V(𝑦𝑛 ) 𝑆𝑅𝑆𝑊𝑂𝑅 < V(𝑦𝑛 ) 𝑆𝑅𝑆𝑊𝑅.

Stratified Random Sampling: When the population is heterogeneous and we wish that every section of

population is represented in the sample. We divide the whole population into different number of strata so

that the one stratum is much different from one another whereas the samples within each stratum are more

homogeneous. This technique of selecting a representative sample of whole population is known as stratified

random sampling.

In stratified random sampling allocation of sample size to different strata is based on the staratum sizes (Ni),

the variability within the stratum Si2 and the cost of surveying per sampling unit in the stratum.

Methods for allocation of sample size to different strata are

Equal Allocation : ni =𝑛

𝑘

Proportional Allocation: ni = 𝑛Ni

𝑁

Neyman Allocation: ni = 𝑛 ∗𝑁𝑖𝑆𝑖

∑𝑁𝑖𝑆𝑖

Optimum Allocation (based on cost) : ni = 𝑛 ∗𝑁𝑖𝑆𝑖√𝐶𝑖

∑𝑁𝑖𝑆𝑖√𝐶𝑖

41

Objective: In simple random sampling, show the sample mean and sample mean square is an unbiased

estimate of population mean and population mean square with the help of an hypothetical population in

SRSWOR and to determine its variances and S.E.

Kinds of data: The data relate to the hypothetical population whose units are 1, 2, 3, 4 and 5. Draw a

sample of size n=3 using SRSWOR.

Solution: Number of all possible samples of size n=3 under SRSWOR is given by 𝑁𝑐𝑛 = 5𝑐3

=10.

Compute the mean of each sample 𝑦𝑛 = ∑𝑦𝑖

𝑛 and sample mean square 𝑠2 =

1

𝑛−1∑(𝑦𝑖 − 𝑦𝑛 )2.

Similarly the mean of population 𝑦𝑁 = ∑𝑦𝑖

𝑁 =

15

5=3 and population mean square 𝑆2 =

1

𝑁−1∑(𝑦𝑖 − 𝑦𝑁 )2

S2 = 1

4[(1-3)2 + (2-3)2 +(3-3)2 + (4-3)2 + (5-3)2]=

10

4 =2.5

The 10 possible samples are given below in the table.

S.No. Possible

samples

Sample mean

𝒚𝒏

Sample mean

square (s2)

Sampling error

(𝒚𝒏 − 𝒚𝑵)

1. 1,2,3 2.0 1.0 -1.0

2. 2,3,4 3.0 1.0 0.0

3. 3,4,5 4.0 1.0 1.0

4. 4,5,1 3.33 4.33 0.33

5. 5,1,2 2.67 4.33 -0.33

6. 1,3,4 2.67 2.33 -0.33

7. 2,4,5 3.67 2.33 0.67

8. 3,5,1 3.0 4.00 0.0

9. 4,1,2 2.33 2.33 -0.67

10. 5,2,3 3.33 2.33 0.33

Total 30.0 24.98=25 0.00

Now we have to check whether E (𝑦𝑛 ) = 𝑦𝑁 and E (s2) = S2 ,

E (𝑦𝑛 )= ∑ 𝑦𝑛

𝑁𝑐𝑛

= 30

10 =3 =𝑦𝑁 and E (s2)=

∑ si2

𝑁𝑐𝑛

= 25

10 =2.5=S2,

then we can say that sample mean 𝑦𝑛 and sample variance s2 are an unbiased estimator of population

mean 𝑦𝑁 and population variance S2 respectively.

In order to find out the variance of sample mean in SRSWOR, we know that

V(𝑦𝑛 )SRSWOR= 𝑁−𝑛

𝑁𝑛 S2 =

5−3

5∗3 *2.5 = 0.33

We can verify that this variance is correct.

V(𝑦𝑛 )= ∑( 𝑦𝑛 −E(𝑦𝑛)) 2

𝑁𝑐𝑛

= 1

𝑁𝑐𝑛

[∑𝑦𝑛 2 – (∑ ��𝑛)2

𝑁𝑐𝑛

] = 1

10 [93.33-90]=0.33

This shows that V(𝑦𝑛 ) 𝑆𝑅𝑆𝑊𝑂𝑅 is correct.

Standard Error of (𝑦𝑛 ) = √V(𝑦𝑛 ) = √0.33 =0.57

We can also compare the two variances, one in SRSWOR and the other in SRSWR.

V(𝑦𝑛 )SRSWR= 𝑁−1

𝑁𝑛 S2 =

5−1

5∗3 *2.5 = 0.66

Since V(𝑦𝑛 ) 𝑆𝑅𝑆𝑊𝑂𝑅 < V(𝑦𝑛 ) 𝑆𝑅𝑆𝑊𝑅

Hence we can say that SRSWOR is more efficient than SRSWR.

42

Objective: Showing the unbiased estimator for population mean and biased estimator for population mean

square in simple random sampling with replacement (SRSWR) with the help of an hypothetical example and

determination of its variance and standard error (S.E.)

Kind of data: Consider a finite population of size N=5 including the values of sampling units as (1,2,3,4,5).

Enumerate all possible samples of size n=2 using SRSWR. find the estimate of V(𝑦𝑛 ) in 9th sample.

Solution: Number of all possible samples of size n=3 under SRSWOR is given by 𝑁𝑛 = 52=25.

Compute the mean of each sample 𝑦𝑛 = ∑𝑦𝑖

𝑛 and sample mean square𝑠2 =

1

𝑛−1∑(𝑦𝑖 − 𝑦𝑛 )2.

Similarly the mean of population 𝑦𝑁 = ∑𝑦𝑖

𝑁 =

15

5=3 and population mean square 𝑆2 =

1

𝑁−1∑(𝑦𝑖 − 𝑦𝑁 )2

S2 = 1

4[(1-3)2 + (2-3)2 +(3-3)2 + (4-3)2 + (5-3)2]=

10

4 =2.5

S.No. Possible

samples

Sample

mean

𝒚𝒏

Sample

mean

square

(s2)

Sampling

error (𝒚𝒏 − 𝒚𝑵)

S.No. Possible

samples

Sample

mean

𝒚𝒏

Sample

mean

square

(s2)

Sampling

error

(𝒚𝒏 − 𝒚𝑵)

1 1,2 1.5 0.50 -1.5 13 4,1 2.5 4.50 -0.5

2 1,3 2.0 2.00 -1.0 14 5,1 3.0 8.00 0.0

3 1,4 2.5 4.50 -0.5 15 3,2 2.5 0.50 -0.5

4 1,5 3.0 8.00 0.0 16 4,2 3.0 2.00 0.0

5 2,3 2.5 0.50 -0.5 17 5,2 3.5 4.50 0.5

6 2,4 3.0 2.00 0.0 18 4,3 3.5 0.50 0.5

7 2,5 3.5 4.50 0.5 19 5,3 4.0 2.00 1.0

8 3,4 3.5 0.50 0.5 20 5,4 4.5 0.50 1.5

9 3,5 4.0 2.00 1.0 21 1,1 1.0 0.00 -2.0

10 4,5 4.5 0.50 1.5 22 2,2 2,0 0.00 - 1.0

11 2,1 1.5 0.50 -1.5 23 3,3 3.0 0.00 0.0

12 3,1 2.0 2.00 -1.0 24 4,4 4.0 0.00 1.0

25 5,5 5.0 0.00 2.0

Total 75.0 50.00

Now we have to check whether E (𝑦𝑛 )= 𝑦𝑁 and E (s2) = S2 ,

E (𝑦𝑛 )= ∑ 𝑦𝑛

𝑁𝑛 = 75

25 =3 =𝑦𝑁 and E (s2)=

∑ si2

𝑁𝑛 = 50

25 =2 ≠S2,

then we can say that sample mean 𝑦𝑛 is an unbiased estimate of population mean whereas and sample

variance s2 is not an unbiased estimate of population variance S2 in case of SRSWR.

In order to find out the variance of sample mean in SRSWR, we known that

V(𝑦𝑛 )=𝜎2

𝑛 =

𝑁−1

𝑁𝑛 S2 =

5−1

5∗2 *2.5 = 1.0

Standard Error of (𝑦𝑛 ) = √V(𝑦𝑛 ) = √1 =1

In order to find the estimate of V(𝑦𝑛 ) based on 9th sample, we have

V(𝑦𝑛 )= 𝑁−1

𝑁𝑛 𝑆2 =

5−1

5∗2 *2.0 = 0.8

Standard Error of (𝑦𝑛 ) = √V(𝑦𝑛 ) = √0.80 =0.894

43

Objective : Drawing of samples in stratified random sampling under different allocation along with

determination of their variances and standard errors.

Kinds of data: A hypothetical population of N= 3000 is divided into four strata, their sizes of population and

standard deviations are given as follows :

Strata I II III IV

Size Ni 400 600 900 1100

SD Si 4 6 9 12

A stratified random sample of size 800 is to be selected from the population

Soultion : In case of

(i) Equal allocation the sizes of sample allocated to different strata will be the same. Hence the different

sample sizes will be ni =𝑛

𝑘 =

𝑡𝑜𝑡𝑎𝑙 𝑠𝑎𝑚𝑝𝑙𝑒 𝑠𝑖𝑧𝑒

𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑠𝑡𝑟𝑎𝑡𝑎=

800

4 =200 samples from each allocation.

(ii) In case of proportional allocation ni (i=1,2,3,4) is given by ni = npi where pi =Ni/N

ni = 𝑛Ni

𝑁

Hence n1 = 800∗400

3000 =106.67≈107 samples from stratum I

n2 = 800∗600

3000 =160 samples from stratum II

n3 = 800∗900

3000 =240 samples from stratum III

n4 = 800∗1100

3000 =293 samples from stratum IV

Thus, n1 + n2 + n3 + n4 = 800 constitute the samples required from all the strata.

(iii) The sample size in Neyman allocation is given by ni = 𝑛 ∗𝑃𝑖𝑆𝑖

∑𝑃𝑖𝑆𝑖 = 𝑛 ∗

𝑁𝑖𝑆𝑖

∑𝑁𝑖𝑆𝑖

Here ∑𝑁𝑖𝑆𝑖 =400*4+600*6+900*9+1100*12= 26500

Hence, n1 = 800 ∗400∗4

26500 =48, n2== 800 ∗

600∗6

26500 =109,

n3 = 800 ∗900∗9

26500 =245, n4== 800 ∗

1100∗12

26500 =398,

In Neyman allocation, the sample sizes from four strata are 48, 109, 245 and 398 which constitute the

required sample size.

Variance of 𝒚𝒔𝒕 in equal allocation V(𝒚𝒔𝒕 ) =k∑pi

2si2

𝑛−

∑𝑝𝑖𝑠𝑖2

𝑁,

from above data ∑piSi= 8.83, ∑pisi2= 86.43 and ∑pi

2 si

2 = 28.37

V(𝒚𝒔𝒕 ) =4∗86.43

800−

28.37

3000, =.141-.028= 0.1130

Standard Error of (𝑦𝑠𝑡 ) = √V(𝑦𝑠𝑡 ) = √0.1130 =0.336

Variance of 𝒚𝒔𝒕 in proportional allocation V(𝒚𝒔𝒕 ) =(1

𝑛−

1

𝑁)∑𝑝𝑖𝑠𝑖

2 =(1

800 -

1

3000 )*86.43 =0.0792

Standard Error of (𝑦𝑠𝑡 )prop = √V(𝑦𝑠𝑡 ) = √0.0792 =0.2815

Variance of 𝒚𝒔𝒕 in Neyman allocation V(𝒚𝒔𝒕 ) =(∑piSi)

2

𝑛−

∑𝑝𝑖𝑠𝑖2

𝑁=

8.832

800−

86.43

3000 =.068

Standard Error of (𝑦𝑠𝑡 )ney = √V(𝑦𝑠𝑡 ) = √. 068 = 0.262

44

Objective: Determination of the estimate of population mean and population total in stratified random

sampling and samples under different allocations

Kinds of data : A population of size N = 4000 has been divided into five strata with their sizes, S.D.’s and

sample means in stratified random sampling.

A stratified random sample of size 800 is to be drawn from the population.

Solution: (i) Equal allocation the sizes of sample allocated to different strata will be the same. Hence the

different sample sizes will be ni =𝑛

𝑘 =

𝑡𝑜𝑡𝑎𝑙 𝑠𝑎𝑚𝑝𝑙𝑒 𝑠𝑖𝑧𝑒

𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑠𝑡𝑟𝑎𝑡𝑎=

800

5 =160 samples from each allocation.

(ii) In case of proportional allocation ni (i=1,2,3,4,5) is given by ni = npi where pi =Ni/N

ni = 𝑛Ni

𝑁

Hence n1 = 800∗300

4000 =60 samples from stratum I

n2 = 800∗600

4000 =120 samples from stratum II

n3 = 800∗900

4000 =180 samples from stratum III

n4 = 800∗1200

4000=240 samples from stratum IV

n5 = 800∗1000

4000=200 samples from stratum V

Thus, n1 + n2 + n3 + n4+ n5= 800 constitute the samples required from all the strata.

(iii) The sample size in Neyman allocation is given by ni = 𝑛 ∗𝑃𝑖𝑆𝑖

∑𝑃𝑖𝑆𝑖 = 𝑛 ∗

𝑁𝑖𝑆𝑖

∑𝑁𝑖𝑆𝑖

Here ∑𝑁𝑖𝑆𝑖 =300*2+600*4+900*6+1200*8+1000*5= 23000

Hence, n1 = 800 ∗300∗2

23000 =20.87≈ 21, n2== 800 ∗

600∗4

23000 =83.48≈ 83 ,

n3 = 800 ∗900∗6

23000 =187.82 ≈ 188, n4== 800 ∗

1200∗8

23000 =333.91≈ 334,

n5== 800 ∗1000∗5

23000 =173.91≈ 174

In Neyman allocation, the sample sizes from five strata are 21, 83, 188, 334 and 174 which constitute

the required sample size.

We know that an unbiased estimate of population mean 𝑌𝑁 can be worked out as

𝑌𝑁 = 𝑌𝑠𝑡

= 1

𝑁 *∑𝑁𝑖𝑦𝑛𝑖

= 300∗8+600∗10+900∗15+1200∗18+1000∗13

4000=14.125

An appropriate estimator to estimate the population total is given by

�� = N*𝑌𝑠𝑡 = 4000*14.125 = 56500

Strata I II III IV V

Sizes Ni 300 600 900 1200 1000

Sample Means 𝒚𝒏𝒊 8 10 15 18 13

Standard

Deviation

2 4 6 8 5

45

13. Ratio and Regression Estimator

Ratio Estimator: Ratio method of estimation is based on the information available for auxiliary variable.

When the correlation coefficient between the study variable and the auxiliary variable is positive and high,

the ratio method of estimation can be used to study the population parameters of study variable Y.

The equation of ratio estimator is given by 𝑦𝑅 = 𝑦𝑛

𝑥𝑛 𝑋𝑁 , where 𝑦𝑛 and 𝑥𝑛 are sample means of y and x

respectively and 𝑋𝑁 is population mean.

In case of ratio estimator sample mean is not an unbiased estimate of population mean. The bias will be

zero only when there is a perfect positive correlation between y and x.

The bias of ratio estimator to the first order of approximation is given by

𝐵1(𝑌𝑅 ) =

(𝑁−𝑛)

𝑁𝑛 𝑌𝑁 (𝐶𝑥

2 − 𝜌𝐶𝑥𝐶𝑦) , where 𝐶𝑥= 𝑆𝑋

𝑋𝑁 and 𝐶𝑦=

𝑆𝑌

𝑌𝑁

The variance of ratio estimator is given by V (𝑌𝑅 ) =

(𝑁−𝑛)

𝑁𝑛 (𝑆𝑦

2 + 𝑅2𝑆𝑥2− 2𝑅𝜌𝑆𝑥𝑆𝑦) where R =

𝑌𝑁

𝑋𝑁

Regression Estimator: Ratio estimator is used if y and x are linearly related and the line of regression

between y and x are passes through origin. But when this is not the case and the variate y is approximately

a constant multiple of an auxiliary variate x, the regression estimator is used.

The regression estimator can be defined as 𝑌𝑖𝑟 = 𝑦𝑛 + 𝑏𝑦𝑥(𝑥𝑁 − 𝑥𝑛 )

Regression estimator is also a biased estimate of population mean.

The variance of regression estimator is given by V(𝑌𝑖𝑟 ) =

(𝑵−𝒏)

𝑵𝒏 𝑠𝑦

2(1-𝑟𝑥𝑦2), here rxy =

𝑆𝑥𝑦

𝑆𝑥𝑆𝑦

and 𝑏𝑦𝑥 = 𝑆𝑥𝑦

𝑆𝑥2 where 𝑠𝑥

2= 1

𝑛−1 [∑𝑥𝑖

2 - (∑𝑥𝑖)

2

𝑛] and 𝑆𝑥𝑦 =

1

𝑛−1 [∑𝑥𝑖𝑦𝑖 -

∑𝑥𝑖 ∑𝑦𝑖

𝑛]

• Regression estimator is more efficient than Ratio Estimator V(𝑌𝑖𝑟 ) < V(𝑌𝑅

)

• If correlation coefficient is equal to zero , we should not apply regression estimator.

Objective : Estimation of the average number of bullocks per acre using ratio estimator and show that it is

a biased estimator of population mean. Compute bias and variance along with its standard error.

Kinds of data : A bivariate population of size N=6 is given below :

No. of bullocks(Y) 3 4 8 9 6 9

Farm Size (acre)(X) 15 20 40 45 25 42

Enumerate all possible samples of size n=2 using SRSWOR.

Solution :Here it is given that N=6 and n=2.

The total number of possible samples of size n=2 is 𝑁𝑐𝑛 =6𝑐2

=15

46

S.No. Possible

Samples

(𝒚𝒊)

Possible

Samples (𝒙𝒊)

Sample

mean

𝒚𝒏

Sample

mean 𝒙𝒏

𝒚𝑹 = 𝒚𝒏

𝒙𝒏 𝑿𝑵

��=𝒚𝒏

𝒙𝒏

1. 3,4 15,20 3.5 17.5 6.233 0.20

2. 3,8 15,40 5.5 27.5 6.233 0.20

3. 3,9 15,45 6 30 6.233 0.20

4. 3,6 15,25 4.5 20 7.013 0.225

5. 3,9 15,42 6 28.5 6.561 0.211

6. 4,8 20,40 6 30 6.233 0.20

7. 4,9 20,45 6.5 32.5 6.233 0.20

8. 4,6 20,25 5 22.5 6.930 0.222

9. 4,9 20,42 6.5 31 6.535 0.210

10. 8,9 40,45 8.5 42.5 6.233 0.20

11. 8,6 40,25 7 32.5 6.713 0.215

12. 8,9 40,42 8.5 41 6.461 0.207

13. 9,6 45,25 7.5 35 6.679 0.214

14. 9,9 45,42 9 43.5 6.448 0.207

15. 6,9 25,42 7.5 33.5 6.978 0.224

Total 467.5 97.716 3.135

𝑋𝑁 =

∑𝑋𝑖

𝑁 =

187

6=31.17, 𝑌𝑁

= ∑𝑌𝑖

𝑁 =

39

6=6.50

E(𝑦𝑅 ) = ∑𝑌𝑅

𝑁𝑐𝑛

= 97.716

15 = 6.514,

Since E(𝑦𝑅 ) ≠ 𝑦𝑁 , 𝑡ℎ𝑒 ratio estimator is not an unbiased estimator of population mean 𝑌𝑁 .

The bias of ratio estimator to the first order of approximation is given by

𝐵1(𝑌𝑅 ) =

(𝑁−𝑛)

𝑁𝑛 𝑌𝑁 (𝐶𝑥

2 − 𝜌𝐶𝑥𝐶𝑦) , where 𝐶𝑥= 𝑆𝑋

𝑋𝑁 and 𝐶𝑦=

𝑆𝑌

𝑌𝑁

Now, SX = √1

5∗ (6639 −

1872

6) = 12.73 and SY =2.588,

Cx= 0.408, Cy =0.397

To find out the value of 𝜌 correlation coefficient between X and Y, we have to make the following

values :

∑y=39, ∑x=187, ∑xy=1378, ∑x2=6639, ∑y2=287

𝜌 =

1378

6−

187∗39

6∗6

√6639

6− (

187

6)2∗√

287

6− (

39

6)2

= 0.9859

Hence 𝐵1(𝑌𝑅 ) =

(6−2)

6∗2 * 6.50 * (0.4082 − 0.9859 ∗ 0.408 ∗ 0.397)=0.014

.

The variance of ratio estimator is given by

V (𝑌𝑅 ) =

(𝑁−𝑛)

𝑁𝑛 (𝑆𝑦


𝑌𝑁

𝑋𝑁 =0.208

= (6−2)

6∗2 (2.582+ 0.2082 ∗ 12.732 -2*0.208*0.9859*12.73*2.58) =0.065 =0.0625.

The above formula of variance in terms of coefficient of variation can be written as :

V (𝑌𝑅 ) =

(𝑁−𝑛)

𝑁𝑛 𝑦𝑁 2 (𝐶𝑦

2 + 𝐶𝑥2− 2𝜌𝐶𝑥𝐶𝑦)

= (6−2

6∗2) * 6.502* (0.3972 + 0.4082 − 2 ∗ 0.9859 ∗ 0.408 ∗ 0.397 = .0660

Both values of variances of ratio estimator are approximately equal.

Standard Error of Ratio Estimator (𝑦𝑅 ) = √V(𝑦𝑅 ) = √0.0660 =0.256

47

Objective: Determination of the regression estimator, comparison with the ratio estimator, and its sampling

variance and standard errors.

Kinds of data: A bi-variate population of size N=85 with population mean 𝑋𝑁 = 6.55 and 𝑌𝑁

= 8.55, a

random sample of size n=10 was drawn using SRSWOR scheme and was recorded as

Y 11 8 7 6 4 5 3 2 9 10

X 10 7 6 5 3 4 2 1 8 9

Solution: First we will calculate the 𝑦𝑛 , 𝑥𝑛 and 𝑋𝑁

𝑦𝑛 =65

10= 6.5, 𝑥𝑛 =

55

10= 5.5, and 𝑋𝑁

= 8.55

The equation of regression estimator 𝑌𝑖𝑟 = 𝑦𝑛 + 𝑏𝑦𝑥(𝑥𝑁 − 𝑥𝑛 ),

Where

𝑏𝑦𝑥 = 𝑆𝑥𝑦

𝑆𝑥2 where 𝑠𝑥

2= 1

𝑛−1 [∑𝑥𝑖

2 - (∑𝑥𝑖)

2

𝑛] and 𝑆𝑥𝑦 =

1

𝑛−1 [∑𝑥𝑖𝑦𝑖 -

∑𝑥𝑖 ∑𝑦𝑖

𝑛]

Total

Y 11 8 7 6 4 5 3 2 9 10 65

X 10 7 6 5 3 4 2 1 8 9 55

YX 110 56 42 30 12 20 6 2 72 90 440

Y2 121 64 49 36 16 25 9 4 81 100 505

X2 100 49 36 25 9 16 4 1 64 81 385

By putting the values we get 𝑠𝑥2=

1

9 (385-

552

10) =

82.5

9 = 9.16, 𝑠𝑦

2 = 9.16 and 𝑆𝑥𝑦 =9.16

So the value of 𝑏𝑦𝑥 = 𝑆𝑥𝑦

𝑆𝑥2 =

9.16

9.16 =1

Now the equation of regression estimator 𝑌𝑖𝑟 = 𝑦𝑛 + 𝑏𝑦𝑥(𝑥𝑁 − 𝑥𝑛 ) = 6.5 + 1*(6.55-5.5)r=6.55

Estimate of the sampling variance of the estimator 𝒀𝒊𝒓

V(𝑌𝑖𝑟 ) =

(𝑵−𝒏)

𝑵𝒏 𝑠𝑦

2(1-𝑟𝑥𝑦2), here rxy =

𝑆𝑥𝑦

𝑆𝑥𝑆𝑦 =

9.16

9.16 =1

By putting the values we get V(𝑌𝑖𝑟 ) =

(85−10)

85∗10 * 9.16*(1-12) =0

Hence we can say that in case of perfect positive correlation the variance of linear regression estimator 𝑌𝑖𝑟

for population mean 𝑌𝑁 will always be equal to zero.

The variance of ratio estimator is given by

V (𝑌𝑅 ) =

(𝑁−𝑛)

𝑁𝑛 (𝑆𝑦


𝑌𝑛

𝑋𝑛 =

6.5

5.5 =1.18

V(𝑌𝑅 ) =

(85−10)

85∗10 (9.16+1.182*9.16-2*1.18*1*9.16)=.0262

The estimated sampling variance of the sample mean is given by

V(𝑌𝑛)SRSWOR =

(𝑁−𝑛)

𝑁𝑛 =

(85−10)

85∗10 *9.16 = 0.808

Here V(𝑌𝑖𝑟 )< V(𝑌𝑅

) < V(𝑌𝑛)SRSWOR

This shows that sample mean 𝑌𝑛 is less efficient than the ratio and regression estimator.

48

14. Large Sample Test

For large value of sample size n (usually >30) almost all the distributions follows normal distribution. To

solve the problems of large sample size the normal variable X is transformed to a new variable Z=𝑋− 𝜇

𝜎 ,

which is known as a standard normal variate mean 0 and variance 1.

By the area property of normal distribution the standard normal variate should lie between -3 to +3.

Hence, if |𝑍| > 3, null hypothesis will always be rejected.

If |𝑍| ≤ 3, null hypothesis will be tested for possible rejection at certain level of significance.

For a two tailed test

if |𝑍| <1.96, H0 is accepted at 5 % level of significance and

if |𝑍| <2.58, H0 is accepted at 1 % level of significance

For a single tailed (Right or left) test

if |𝑍| <1.645, H0 is accepted at 5 % level of significance and

if |𝑍| <2.33, H0 is accepted at 1 % level of significance

Objective: Testing the significance of single mean based on large samples.

Kinds of data: A random samples of 900 items has a mean 3.4 cms and standard deviation 2.61 cm.

Given that the population mean and standard deviation are 3.25 cm and 2.62 cm respectively.

Solution: We set up the null hypothesis H0: µ = 3.25 and 𝜎 = 2.62, and

H1:µ ≠ 3.25 and 𝜎 ≠ 2.62,

Under H0 : Z=��− 𝜇

𝜎/√𝑛 that follows N(0,1)

Now Z =3.4−3.25

2.62

√900

= 1.73

Since 1.73<1.96. The null hypothesis is accepted at 5% level of significance.

Objective: Testing the significance of two means based on large samples.

Kinds of data: A random samples of 1000 and 2000 members have their means as 67.5 and 68.0 inches

respectively with population standard deviations for both samples are 2.5 inches.

Solution: We set up the null hypothesis H0: µ1=µ2, and H1:µ1≠ µ2.

Under H0 : Z=𝑥1 − 𝑥2

√𝜎2(1

𝑛1+

1

𝑛2) that follows N(0,1)

Now Z =67.5−68

√6.25(1

1000+

1

2000)

Since the value of σ2=(2.5)2 =6.25

Hence Z= -5.1, Absolute value of Z=5.1

The difference of these two sample means is highly significant since the calculated value of Z=5.1 is larger

than the tabulated value of Z( >3) at 1% level of significance.

49

Objective: Testing the significance of single proportions based on large samples.

Kinds of data: In a sample of 1000 people in Maharashtra, 540 were found rice eaters and rest was wheat

eaters. Test whether the rice and wheat eaters are equally popular in this state at 1 % level of significance.

Solution: We set up the null hypothesis H0: P=0.5, and H1:P≠0.5.

p=proportion of rice eaters in Maharashtra =540

1000 = 0.54, q=1-p=0.46

Under H0 : Z=𝑝−𝑃

√𝑝𝑞

𝑛

= 0.54−0.5

√0.54∗0.46

1000

= 2.532

Since Z=2.532 <2.58. the null hypothesis both rice and wheat eaters are equally popular in the state are accepted

at 1% level of significance.

Objective: Testing the significance of two proportions based on large samples.

Kinds of data: A random samples of 400 men and 600 women were interviewed whether they would like

to build a beautiful garden near their residence. Two hundred men and 325 women were interviewed in

favour of proposal.

Solution: We set up the null hypothesis H0: P1=P2, and H1:P1≠ P2.

P1=proportion of men in favour of proposal = 200/400=0.50

P2= proportion of women in favour of proposal=325/600=0.541

Under H0 : Z=𝑃1−𝑃2

√𝑃𝑄(1

𝑛1+

1

𝑛2)

that follows N(0,1)

Where P=𝑛1𝑃1+𝑛2𝑃2

𝑛1+𝑛2 =

200+325

400+600 =0.525 and Q=1-P=0.475

Hence Z=-1.27, Absolute value of Z=1.27

The opinion of men and women in favour of building up garden near their residence is not significant

Since the calculated value of Z=1.27 is less than the tabulated value of Z(1.96) at 5% level of significance.

Objective: Testing the significance of two standard deviations based on large samples.

Kinds of data: A random samples of 1000 and 1200 members have their standard deviations as 2.58 and

2.50 inches respectively.

Solution: We set up the null hypothesis H0: S1=S2, and H1:S1≠ S2.

Under H0 : Z=𝑆1−𝑆2

√𝑆12

2𝑛1+

𝑆22

2𝑛2

that follows N(0,1)

Now Z =2.58−2.50

√ 2.582

2𝑥1000+

2.502

2𝑥1200

Hence Z= 1.03

The difference of these two sample standard deviations do not differ significantly, since the calculated value

of Z(1.03) is less than tabulated Z (1.96) at 5% level of significance.

50

15. Small Sample Test

Small Sample Test : If the sample size n is small, the distribution of various statistics i.e. z=��−𝜇

𝑆

√𝑛

are far

from normality. In such cases small samples test developed by student (W. S. Gosset) were used.

Student t: let 𝑥𝑖 be a random sample of size n from a normal population with mean 𝜇 and variance 𝜎2. Then

student t is defined by the statistic t = ��−𝜇

𝑆

√𝑛

, where �� is the sample mean and 𝑆2 = 1

(𝑛−1)∑(𝑥𝑖 − 𝑥) 2 is an

unbiased estimate of population variance 𝜎2 and it follows student t distribution with (n-1) degree of

freedom.

Assumptions of t test:

• The parent population from which samples is drawn should be normal.

• The sample observations are independent or randomly selected.

• Population standard deviation 𝜎 is unknown.

T test for single mean: is used to test whether the sample has been drawn from the population with mean 𝜇

or there is no significant difference between the sample mean �� and the population mean 𝜇.

the test statistic |𝑡| = |��−𝜇0|

𝑆

√𝑛

where �� = ∑𝑋𝑖

𝑛, and 𝑆2 =

1

(𝑛−1)∑(𝑥𝑖 − 𝑥) 2, follows student t distribution with

(n-1) degree of freedom.

Two independent sample t test: is used to test whether the two samples differ from one another

significantly in their means or whether they may be belonging to the same population. The test statistics is t=

𝑋1 − 𝑋2

√𝑆2(1

𝑛1+

1

𝑛2) where 𝑆2 =

∑(𝑋𝑖−��)2 + ∑(𝑌𝑖−��)2

𝑛1+𝑛2−2 =

(𝑛1−1)𝑆12+ (𝑛2−1)𝑆2

2

𝑛1+𝑛2−2 , follows student t distribution with (n1+n2-

2) degree of freedom.

Paired t test: is used when the sample sizes are equal and the two samples are not independent but the

sample observations are paired together.

The test statistics is given by |𝑡|=|𝑑|

𝑆

√𝑛

where 𝑑𝑖 = 𝑋𝑖 − 𝑌𝑖, 𝑎𝑛𝑑 2 = 1

(𝑛−1) * ∑(𝑑𝑖 − 𝑑) 2

, follows student t distribution with (n-1) degree of

freedom.

Test of significance of null hypothesis: for test of significance of null hypothesis the calculated value of t is

compared with the table value of t at certain level of significance generally 5%. If calculated value of |𝑡| >

tabulated t, the null hypothesis is rejected or If calculated value of |𝑡| < tabulated t, the null hypothesis is

accepted.

Note: if you are unable to understand whether the samples are paired or independent then you can decide it

by the degree of freedom of tabulated value given in question. In case of independent sample for table value

d.f. is (n1 + n2 -2) whereas in paired sample d.f. is (n-1).

51

Objective: Test the significance of difference between sample mean and population mean.

Kinds of data: The data relate to the IQ’s of ten randomly selected boy are given below:

70,120,110,101,88,83, 95,98,107 and 100. Given population mean μ=100.

Solution: Here the null hypothesis is H0: The data are consistent with the assumption of a mean

IQ of 100 in the population. First we will find the sample mean

�� = (70+120+110+101+88+83+95+98+107+100)

10= 97.2

Now we calculate the sample variance 𝑆2 = 1

(𝑛−1)∑(𝑥𝑖 − 𝑥) 2

𝑋𝑖 70 120 110 101 88 83 95 98 107 100 Total

𝑋𝑖 - �� -27.2 22.8 12.8 3.8 -9.2 -14.2 -2.2 0.8 9.8 2.8

(𝑋𝑖 − ��)2 739.84 519.84 163.84 14.44 84.64 201.64 4.84 0.64 96.04 7.84 1833.60

By substituting the values in the formula we get 𝑆2 =1

(10−1) * 1833.60 =203.73 and S= 14.27

Now we apply the t statistics |𝑡| = |��−𝜇0|

𝑆

√𝑛

=|97.2−100|

14.27

√10

= 2.8

4.51 = 0.62

Here absolute value of t=0.62 and tabulated value of t at 9 d.f. at 0.05 level of significance=2.26.

Since the calculated value of t is less than the tabulated value, the null hypothesis is accepted and We

conclude that data are consistent with the assumption of mean IQ of 100 in the population.

Objective: Testing the significance of difference between two means in small samples

Kinds of data: The data relate to the two random samples drawn from two normal population with the

following results.

n1=6, 𝑋1 =25, S1

2=36,n2=8, 𝑋2 =20, and S2

2=25 provided population variances of two population are equal,

i.e. σ12=σ22

Solution: The null hypothesis H0: There is no significant difference between two means i.e.μ1=μ2.

H1: There is significant difference between two means i.e.μ1≠μ2.

Now we calculate the value of sample mean square by 𝑆2 = (𝑛1−1)𝑆1

2+ (𝑛2−1)𝑆22

𝑛1+𝑛2−2

𝑆2 = (6−1)∗36+ (8−1)∗25

6+8−2 =

355

12 = 29.58

Apply t-statistic t= 𝑋1 − 𝑋2

√𝑆2(1

𝑛1+

1

𝑛2) =

25−20

√29.58∗(1

6+

1

8)

= 5

2.93 =1.70

Here absolute value of t=1.70 and tabulated value of t at 9 d.f. at 0.05 level of significance=2.18.

Since the calculated value of t is less than the tabulated value, the null hypothesis is accepted and we

conclude that there is no significant difference the two means.

52

Objective: Testing the significance of difference of two means in small samples when the observations

of the two samples are paired together.

Kinds of data: The data relate to the increase in weights due to food A and food B in comparing the pigs are as follows:

Pig number 1 2 3 4 5 6 7 8

Food A 49 53 51 52 47 50 52 53

Food B 52 55 52 53 50 54 54 53

Solution: H0: There is no significant difference between two means of foods A and B. i.e.μ1=μ2.


Take the difference of foods A-B as di=Ai-Bi and calculate �� = ∑𝑑𝑖

𝑛 =

−16

8 = -2

Pig number 1 2 3 4 5 6 7 8

Food A 49 53 51 52 47 50 52 53

Food B 52 55 52 53 50 54 54 53

di=Ai-Bi -3 -2 -1 -1 -3 -4 -2 0

(𝑑𝑖 − ��)2 1 0 1 1 1 4 0 4

Now we calculate the value of 𝑆2 = 1

(𝑛−1) * ∑(𝑑𝑖 − 𝑑) 2

= 12

7 = 1.71 and S=1.31

Apply t-statistic |𝑡|=|𝑑|

𝑆

√𝑛

= 2

1.31

√8

=2

0.46 =4.32

Here the absolute value of t= 4.32 and tabulated value of t at 7 d.f. at 0.05 level of significance=2.37

Since the calculated value of t is greater than the tabulated value, the null hypothesis is rejected and

We conclude that the two foods taken into study for pigs differ significantly at 5% level of significance.

.

Objective: Testing the significance of difference of two means in small samples when the observations

of the two samples are independent.

Kinds of data: The data relate to the increase in weights due to food A and food B in comparing the pigs,

assuming that the two samples of pigs are independent were given as follows:

Pig number 1 2 3 4 5 6 7 8

Food A 49 53 51 52 47 50 52 53

Food B 52 55 52 53 50 54 54 53

Solution: H0: There is no significant difference between two means of foods A and B. i.e.μ1=μ2.


53

Here �� = 407

8 = 50.88 and �� =

423

8 = 52.88

Pig

number

Food A Food B

X 𝑋𝑖 − �� (𝑋𝑖 − ��)𝟐 Y 𝑌𝑖 − �� (𝑌𝑖 − ��)𝟐

1 49 -1.88 3.53 52 -0.88 0.77

2 53 2.12 4.49 55 2.12 4.49

3 51 0.12 0.01 52 -0.88 0.77

4 52 1.12 1.25 53 0.12 0.01

5 47 -3.88 15.05 50 -2.88 8.29

6 50 -0.88 0.77 54 1.12 1.25

7 52 1.12 1.25 54 1.12 1.25

8 53 2.12 4.49 53 0.12 0.01

Total 407 -0.04 30.88 423 -0.04 16.88

The test statistic used here as t= ��−��

√𝑆2(1

𝑛1+

1

𝑛2) where 𝑆2 =

∑(𝑋𝑖−��)2 + ∑(𝑌𝑖−��)2

𝑛1+𝑛2−2

Now 𝑆2 = ∑(𝑋𝑖−��)2 + ∑(𝑌𝑖−��)2

𝑛1+𝑛2−2 =

30.88+16.88

8+8−2 =

47.75

14 =3.41

Then t =|50.88−52.88|

√3.41∗(1

8+

1

8)

= 2

0.923 = 2.165

Here the absolute value of t= 2.165 and tabulated value of t at 14 d.f. at 0.05 level of significance=2.15

Since the calculated value of t is greater than the tabulated value, the null hypothesis is rejected and

We conclude that the two foods taken into study for pigs differ significantly at 5% level of significance.

54

16. Chi-Square Test

Chi-Square Test : is used to test the null hypothesis based on some general law of nature or any reasoning.

Types of problem dealt with chi-square test:

• Whether a particular distribution is in agreement with the normal distribution or

• Whether the two given distributions are in agreement with each other or

• Whether the two observed frequencies are in any particular given ratio or

• Whether the two sets of classification are independent of each other and so on.

Conditions for the applicability of chi- square test:

• The sample observations should be independent.

• Constraints on the cell frequency should be linear eg. ∑𝑂𝑖 = ∑𝐸𝑖.

• N the total frequency should be large (>50).

• No observed frequency should be less than 5.

• If anyone of cell frequency is less than 5 then it is pooled with proceeding or succeeding frequency so

that the pooled frequency is more than 5 and then we adjust the degree of freedom lost in pooling.

𝝌𝟐 test of goodness of fit: The null hypothesis is that there is no difference between the experimental result

and theory. If 𝑂𝑖 (i=1,2,…n) is set of observed frequencies and 𝐸𝑖 is the set of expected frequencies

the karl pearson 𝝌𝟐 is given by

𝝌𝟐 = ∑(𝑂𝑖−𝐸𝑖)

𝟐

𝐸𝑖 , for i=1,2,…n follows 𝜒2 distribution with (n-1) d.f.

𝝌𝟐 test for 2X2 contingency table: for the 2X2 contingency table

The 𝜒2 test is given by 𝜒2 = 𝑁(𝑎𝑑−𝑏𝑐)2

(𝑎+𝑐)(𝑏+𝑑)(𝑎+𝑏)(𝑐+𝑑) , where N=a+b+c+d

for 1 d.f.

Or alternatively we can calculate the expected frequency of each cell and then apply the chi-square test of

goodness of fit. eg. E(a) =(𝑎+𝑏)(𝑎+𝑐)

𝑎+𝑏+𝑐+𝑑, E(b) =

(𝑎+𝑏)(𝑏+𝑑)

𝑎+𝑏+𝑐+𝑑 or accordingly.

Yates’ correction for continuity: if anyone of the cell frequency is less than 5 in 2X2 contingency table

then by using pooling method the degree of freedom becomes 0. In this case we apply Yates correction for

continuity which consist of adding 0.5 to the cell frequency which is less than 5 and then adjusting for the

remaining cell frequencies accordingly so that the marginal totals are not disturbed at all. After corrections

we get

𝝌𝟐 = 𝑵[(𝒂∓

𝟏

𝟐)(𝒅∓

𝟏

𝟐)−(𝒃±

𝟏

𝟐)(𝒄±

𝟏

𝟐)]

𝟐

(𝑎+𝑐)(𝑏+𝑑)(𝑎+𝑏)(𝑐+𝑑) =

𝑵[|𝒂𝒅−𝒃𝒄|−𝑵

𝟐]𝟐

(𝑎+𝑐)(𝑏+𝑑)(𝑎+𝑏)(𝑐+𝑑)

Note: In 𝝌𝟐 test if the calculate value of 𝜒2 is greater than tabulated value the null hypothesis is rejected

means there is a significant difference between the experimental result and theory.

For mxn table the degree of freedom is (m-1)X(n-1).

a b (a+b)

c d (c+d)

(a+c) (b+d) N=a+b+c+d

55

Objective: Testing whether the frequencies are equally distributed in a given dataset.

Kinds of data: 200 digits were chosen at random from a set of tables. The frequencies of the digits were as

follows.

Digits 0 1 2 3 4 5 6 7 8 9

Frequency 22 21 16 20 23 15 18 21 19 25

Solution: We set up the null hypothesis H0: The digits were equally distributed in the given dataset.

Under the null hypothesis the expected frequencies of the digits would be = 𝑠𝑢𝑚 𝑜𝑓 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦

𝑛𝑜.𝑜𝑓 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛 =

200

10 =20

Then the value of 𝜘2= (22−20)2

20 +

(21−20)2

20+

(16−20)2

20+

(20−20)2

20 +

(23−20)2

20+

(15−20)2

20+

(18−20)2

20 +

(21−20)2

20+

(19−20)2

20 +

(25−20)2

20 =

1

20 (4+1+16+0+9+25+4+1+1+25) =

86

20 =4.3

The tabulated value of 𝜘2 at 9 d.f. and 5 % level of significance is 16.91. Since the calculated value of 𝜘2 is

less than the tabulated value, the null hypothesis is accepted. Hence we conclude that the digits are equally

distributed in a given dataset.

Objective: Testing the goodness of fit between experimental result and theory.

Kinds of data: The theory predicts the proportion of beans in the four groups A, B, C and D should be in the

ratio 9:3:3:1. In an experiment having 1600 beans, the observed numbers in the four groups were found to be

882,313,287 and 118.

Solution: We set up the null hypothesis H0: There is no significant difference between experimental result

and theory.

The expected frequencies can be calculated as follows:

Total number of beans =882+313+287+118.=1600. Given ratio is 9:3:3:1

E(882)= 9

16X1600 = 900 ; E(313)=

3

16X1600 = 300

E(287)= 3

16X1600 = 300 ; E(118)=

1

16X1600 = 100

Thus, 2 for testing the goodness of fit is

=24

1

( )

i

Oi Ei

Ei=

− =

2 2 2 2(882 900) (313 300) (287 300) (118 100)

900 300 300 100

− − − −+ + +

=0.36+0.56+0.56+3.24=4.72

Now, d.f.=4-1=3. The tabulated 2 at 5% level of significance at 3 d.f.=7.815

Since the calculated value of 2 is less than the tabulated value, it is not significant. Hence, the null

hypothesis is accepted and we conclude that there is good correspondence between theory and experimental

result.

Objective: Testing the significance of independence of two attributes using 2 test.

Kinds of data: The data relate to the sample of married women according to their level of education and

their marriage adjustment score

Marriage adjustment score

Level of education Very low Low High Very High Total

College 24 97 62 58 241

High School 22 28 30 41 121

Middle School 32 10 11 20 73

Total 78 135 103 139 435

56

Solution: We set up the null hypothesis H0: The two attributes level of education and marriage adjustment

scores both are independent.

The expected frequencies can be calculated as follows:

We sum the row and column totals. The three row totals are 241,121 and 73 respectively. The four column

totals are 78,135,103 and 139 respectively giving grand total as 435.

E(24)= 78 𝑥 241

435= 43.2 ; E(97)=

135 𝑥 241

435= 74.8; E(22)=

78 𝑥 121

435= 21.7

E(28)= 135 𝑥 121

435= 37.6; E(32)=

78 𝑥 73

435= 13.1 ; E(10)=

135 𝑥 73

435= 22.7

E(62)= 103 𝑥 241

435= 57.1; E(58)=

19 𝑥 241

435= 65.9 ; E(30)=

103 𝑥 121

435= 28.7

E(41)= 119 𝑥 121

435= 33.1 ; E(11)=

103𝑥 73

435= 17.3 ; E(20)=

119 𝑥 73

435= 20.1

Thus, 2 for testing the goodness of fit is

=(24−43.2)2

43.2+

(22−21.7)2

21.7+

(32−13.1)2

13.1+

(97−74.8)2

74.8+

(28−37.6)2

37.6+

(10−22.7)2

22.7+

(62−57.1)2

57.1+

(30−28.7)2

28.7+

(11−17.3)2

17.3+

(58−65.9)2

65.9+

(41−33.1)2

33.1+

(20−20.1)2

20.1 = 57.6

Now, d.f.=(4-1)x(3-1)=6. The tabulated value of 2 at 5% level of significance and 6 d.f.= 12.592

Since the calculated value of 2 is more than the tabulated value, it is significant. Hence, we reject the

null hypothesis and conclude that there is good correspondence or dependence between level of

education and marriage adjustment score.

Objective: Computation of 2 value in the case of contingency table to test the independence of attributes

where one of the cell frequencies is less than five.

Kinds of data: The data relate to the height of father and their youngest son at the age of 40 years.

Height of youngest

Sons

Height of fathers

Tall Short Total

Tall 8 2 10

Short 7 6 13

Total 15 8 23

Solution: Here the null hypothesis is H0: The height of youngest son is independent of the height of

Fathers and H1: They are dependent on each other.

Since one cell frequency is less than five then we apply Yates’s correction and correct the contingency

table as given below.

Height of youngest

Sons

Height of fathers

Tall Short Total

Tall 7.5 (a) 2.5 (b) 10

Short 7.5 (c) 5.5 (d) 13

Total 15 8 23

Now Compute value of 2( )

2( )( )( )( )

N ad bc

a b a c c d b d

−=

+ + + +

𝜘2 = 23(7.5 𝑋 5.5−2.5 𝑋 7.5 )2

10 𝑋 13 𝑋 15 𝑋 8 = 0.746

The table value of 2 at 5 % level of significance and 1 d.f. = 3.841

Since the calculated value of 2 is less than tabulated value of 2 . We do not reject null hypothesis and

Conclude that the height of youngest sons is independent of the height of their father.

57

17. Design of Experiment

(CRD, RBD, LSD, Split and Strip Plot Design)

It is the planning (sequence of steps taken well in time) of an experiment to obtain appropriate data with

respect to problem under investigation. Principles of experimental design: There are 3 basic principles of experimental design:

Replication: Repetition of treatment under investigation is known as replication.

Randomization: The process of assigning treatment to various experimental unit in purely chance manner is

known as randomization.

Local Control: The process of reducing the experimental error by dividing the relatively heterogeneous

experimental area into homogeneous blocks is known as local control.

Type of Design:

Completely Randomized Design: This design is used when the experimental material is homogeneous eg.

laboratory or pot experiment. The principle of local control is absent in CRD.

Randomized Block Design: This design is used when the fertility gradient of soil is only in one direction.

Then the whole field is divided into a number of equal blocks perpendicular to the direction of fertility

gradient and then each block is divided into number of plots equal to the number of treatments. In RBD we ty

to minimize the within block variation whereas the between block variation as large as possible.

Latin Square Design: This design is used when the fertility gradient of the soil in both the directions. Then

the field is divided into homogeneous blocks in two ways. The blocks in one direction are known as rows

whereas the blocks in other direction are known as columns. In LSD number of replications must be equal to

number of treatments. Number of row, column and treatment should be equal and randomization of treatment

is done in such a way that each treatment occurs once and only once in each row and column.

Split Plot Design: is used when there are two types of treatments and both are to be estimated with different

precision. The treatment which is to be estimated with greater precision is allotted as subplot treatment. In

split plot design the effect of the subplot treatments and the interactions with the main plot treatments can be

estimated more precisely.

Strip Plot Design: is used when there are two factors and both of them require large experimental unit.

Suppose four levels of spacing and three methods of ploughing. In strip plot design interaction effect is

estimated with greater precision. The experimental area is divided into three plots namely vertical strip,

horizontal strip and intersection plot.

All the analysis of experimental design is based on the analysis of variance table.

ANOVA: It is a technique by which the total variation in any experiment may be split into several

physically assignable components. In ANOVA we determine the source of variation and check whether this

source of variation is significant or not. To check the significance of source of variation F-test is used.

Format of ANOVA:

Analysis of variance table


freedom

Sum of

Square

Mean sum of

square


and error ) d.f.

58

Objective: Analyzing the data of completely randomized design with unequal number of replication per

treatment.

Kinds of data: The following data relate to a varietal trial on green gram ( in coded form) that was

conducted using CRD having five varieties V1 , V2 , V3, V4 and V5 with 3, 6, 5 and 4 replications

respectively. The results are given below (kg/plot):

Varieties Seed yield of greengram (kg/plot) Total Mean

V1 2 1 4 7 2.33

V2 3 4 2 1 5 6 17 2.83

V3 1 5 4 2 4 31 6.20

V4 4 6 3 5 18 4.50

Total 73

Solution: Here we test whether the varieties differ significantly or not.

Grand total = 73

Correction factor = 732

18= 296.05

Total sum of squares = (22 + 12 + ⋯+ 32 + 52) - 296.056 = 80.94

Variety sum of squares =( 7)2/3 +(17)2/6 +( 31)2/5 +(18)2/4 -CF

= 337.70 - 296.05 = 41.64

Error sum of squares = Total sum of squares–variety sum of squares

= 80.944 - 41.644 = 39.30


Source of

variation

Degree of

freedom

Sum of

Square

Mean sum of

square

Fcal Ftab (5 %) at

(3,14) d.f.

Between varieties 3 41.64 13.88 4.94 (S) 3.34

Within varieties

(Error)

14 39.30 2.80

Total 17 80.94

Since Fcal >Ftab, F test indicates that there are significant differences between the treatment means.

The individual varieties can be compared with the help of critical difference.

In case of unequal number of replication the standard error of difference between treatment means varies

from pair to pair .

Standard error of difference between V1 and V2 = √2.80 ∗ (1

3+

1

6) = 1.18

∴ Critical Difference = (S.E.)diff X t 0.05 at 14 d.f. = 1.18 x 2.14 = 2.52

Similarly, we can compute the value of C.D. for other treatments comparisons..

The following treatments comparisons can be made on the basis of C.D. values:

V3 V4 V2 V1

Conclusions : Variety V3 gives significantly higher yield as compared to other varieties, variety V3 and

variety V4 are at par but both of them differ significantly with variety V2 and variety V1. The variety

V2 and variety V1 are also at par and they are giving the lower yield of green gram.

59

Objective: Analyzing the data of completely randomized design with equal number of replications per

treatment.

Kinds of data : The data relate to the five varieties of sesame using CRD conducted in a greenhouse with

four pots per variety.

Varieties Seed yield of sesame

(gm/plot)

Total Mean

V1 25 22 22 18 87 21.75

V2 25 28 26 25 104 26

V3 24 24 18 21 87 21.75

V4 20 17 18 19 74 18.5

V5 14 15 15 11 55 13.75

Total 407


The Grand total = 407.

Correction factor = (407)2/20= 8282.45

Total sum of squares = (25 2 + 252 +….112) - CF

= 8685 - CF = 402.55

Variety sum of square = (872 +1042 +------552)/4 -CF

=8613.75 – CF = 331.30

Error sum of square = 402.55 - 331.30 = 71.25



freedom

Sum of

Square

Mean sum of

square

Fcal Ftab (5 %) at

(4,15) d.f.

Between varieties 4 331.30 82.83 17.44 (S) 3.06

Within varieties

(Error)

15 71.25 4.75

Total 19 402.55

Since Fcal >Ftab, F test indicates that there are significant differences between the treatment means.

Standard error of difference between two treatments = √2∗𝐸𝑀𝑆𝑆

𝑟 = √

2∗4.75

4 = 1.541


As per F test the varieties differ significantly, the individual varieties can be compared with the help of

critical difference.

The varieties can be compared by setting them in the descending order of their yields in the following

manner.

V2(26) V1(21.75) V3(21.75) V4(18.5) V5(13.75)

If the difference between two varieties means is greater than the critical difference the varieties differ

significantly. The varieties which do not differ significantly have been underlined by a bar.

Conclusions: The variety V2 gives significantly higher yield than all other varieties. The varieties V1, V3

and V4 are at par but differ significantly with variety V2 and V5 .The variety V5 gives lower yield of

sesame.

60

Objective : Analyzing the data of randomized complete block design and the computation of efficiency as

compared to completely randomized design.

Kinds of data : The data relate to the yields of 6 wheat varieties(in rounded figures) in an experiment with

4 randomized blocks.

Block yield of Wheat varieties Total

V1 V2 V3 V4 V5 V6

1 27 30 27 16 16 24 140

2 27 28 22 15 17 22 131

3 28 31 34 14 17 22 146

4 38 39 36 19 15 26 173

Total 120 128 119 64 65 94 590

Mean 30 32 29.75 16 16.25 23.5



24= 14504.17

Total sum of squares = (272 + 302 + ⋯ .+262) − 𝐶𝐹 = 15834 – CF = 1329.83

Bock sum of squares =1402+1312+1462+1732

6 - CF= 14667.67 – CF =163.50

Variety sum of squares =1202+1282+1192+642+652+942

4− 𝐶𝐹= 15525.50 - CF = 1021.33

Error sum of squares = 1329.83 – 163.50 – 1021.33 = 145.00



freedom

Sum of

Square

Mean sum of

square

Fcal Ftab (5 %) at

(4,19) d.f.

Block 3 163.50 54.50 5.64 3.29

Between varieties 5 1021.33 204.27 21.12 2.90

Within varieties

(Error)

15 145.00 9.67

Total 23 1374.93

Since Fcal >Ftab, F test indicates that there are significant differences between the Variety means.

The individual varieties can be compared with the help of critical difference.


𝑟 = √

2∗9.67

4 = 2.19

∴ Critical Difference = (S.E.)diff x t 0.05 at 15 d.f. = 2.19 x 2.13 = 4.66


manner.

V2(32) V1(30) V3(29.75) V6(23.5) V5(16.25) V4(16.0)


significantly. The varieties which do not differ significantly have been underlined by a bar.

Conclusions: The variety V4 gives the lowest yield and it does not differ significantly with V5 While V6

differs significantly rather than V4, V5, V1, V3 and V2. The varieties V2 gives the highest yield and it is at

par with V1 and V3 but differs significantly with V4, V5 and V6.

Relative Efficiency of RBD as compared to CRD:

E= 𝑟(𝑡−1)𝑆𝐸

2+ (𝑟−1)𝑆𝐵2

(𝑟𝑡−1)𝑆𝐸2 =

4∗(6−1)∗9.67+(4−1)∗54.50

(4∗6−1)∗9.67 = 1.60

Here ,the d.f. for error is less than 20, so we can use the precision factor as given below :

61

(𝑛2+1)(𝑛1+3)

(𝑛1+1)(𝑛2+3), 𝑛1 and 𝑛2 are degree of freedom for two experiments, which is an expression for relative

efficiency of the second experiment as compared to the first.

(15+1)(18+3)

(18+1)(15+3) =0.982

The corrected relative efficiency is, then, given by

1.60 x 0.982 = 1.57

Therefore, the gain in efficiency in RBD is 57 % as compared to completely randomized design.

Objective : Analyzing the data of Latin square design

Kinds of data : The yields of five varieties of wheat tried in a LSD along with the plan have been given

below in oz.

E (68) B(78) D(80) C (122) A(100)

D(72) E(73) A(70) B(58) C(129)

A(78) C(99) E(57) D(75) B(72)

C(113) A(69) B(60) E(73) D(64)

B(48) D (70) C(76) A(82) E (73)

Solution: for getting the row, column and treatment totals following table will be prepared.

Rows Columns Row

totals

Treatment

I II III IV V Totals Means

I E (68) B(78) D(80) C (122) A(100) 448 TA =399 79.8

II D(72) E(73) A(70) B(58) C(129) 402 TB =316 63.2

III A(78) C(99) E(57) D(75) B(72) 381 TC =539 107.8

IV C(113) A(69) B(60) E(73) D(64) 379 TD =361 72.2

V B(48) D (70) C(76) A(82) E (73) 349 TE =344 68.8

Column

Totals

379 389 343 410 438 GT=1959

Now correction factor =19592

25 = 153507.2

Total sum of squares=(682 + 782 + ⋯+ 732) − 𝐶𝐹= 162941 – 153507.2=9433.76

Sum of square (Rows) =(4482+4022+3812+3792+3492

5) – CF =154511-153507.2 = 1003.76

Sum of square (Columns) =(3792+3892+3432+4102+4382

5) – CF =154582.2 – 153507.2=1074.96

Sum of square (Treatments) =(3992+3162+5392+3612+3442

5) – CF = 159647 -153507.2=6139.76

Sum of square (Error) = 9433.76-1003.76-1074.96-6139.76= 1215.28



freedom

Sum of

Square

Mean sum of

square

Fcal Ftab (5 %) at

(4,12) d.f.

Row 4 1003.76 250.94 - -

Column 4 1074.96 268.74 - -

Treatment 4 6139.76 1534.94 15.15 3.26

Error 12 1215.28 101.27

Total 24 36571

Since Fcal> Ftab, there are significant differences among the treatment means.

62


𝑟 = √

2∗101.27

5 = 6.36



manner.

Tc(107.8) Ta(79.8) Td(72.2) Te(68.8) Tb(63.2)


significantly. The varieties which do not differ significantly can be underlined by a bar.

Conclusions: The variety Tc gives the highest yield and differ significantly from all the varieties. The

varieties A, D, E, and B are at par to each other.

Objective : Analyzing the data of Latin square design and the computation of efficiency as compared to

RCBD and CRD.

Kinds of data : The data relate to the Latin square design to test the efficiency of methods of spacing:

A,2’’;B,4’’;C,6’’;D,8’’;E,10’’. The yield in grams of plots of Millet arranged in LSD and layout of the

treatments are given below:

Rows Columns Row

totals

Treatment

I II III IV V Totals Means

I B(257) E(230) A(279) C(287) D(202) 1255 TA =1349 269.8

II D(245) A(283) E(245) B(280) C(260) 1313 TB =1314 262.8

III E(182) B(252) C(280) D(246) A(250) 1210 TC =1262 252.4

IV A(203) C(204) D(227) E(193) B(259) 1086 TD =1191 238.2

V C(231) D(271) E(266) A(334) E(338) 1440 TE =1188 237.6

Column

Totals 1118 1240 1297 1340 1309 GT=6304

Thus, In order to test the significant difference among the treatment means, we have to analyze the above

data.

The computation of the sums of squares is given below:

Now correction factor =63042

25 = 1589617

Total sum of squares=(2572 + 2302 + ⋯+ 3382) − 𝐶𝐹= 1626188 – 1589617=36571

Sum of square (Rows) =(12552+⋯+14402

5) – CF =1603218-1589617= 13601

Sum of square (Columns) =(11182+⋯+13092

5) – CF =1595763 – 1589617=6146

Sum of square (Treatments) =(13492+13142+12622+11912+11882

5) – CF = 1593773 - 1589617=4156

Sum of square (Error) = 36571-13601-6146-4156= 12668



freedom

Sum of

Square

Mean sum of

square

Fcal Ftab (5 %) at

(4,12) d.f.

Row 4 13601 3400 3.22 3.26

Column 4 6146 1536 1.45 3.26

Treatment 4 4156 1039 1.02 3.26

Error 12 12668 1056

Total 24 36571

63

On comparison of the calculated and tabulated values of F, we find that rows, columns and spacings

are not significant. This is probably due to the shape of the plots. They were long and narrow; hence the

columns are narrow strips running the length of the rectangular area. Under these conditions, the Latin

square may have little advantage on the average over a randomized block plan.

In order to compare means of pairs of treatments, we look up t 0.05 at 12 degrees of freedom =2.18.

The significant difference between two means must be equal to

2.18x√2∗1056

5 = 44.80

Efficiency of LSD as compared to RBD, when rows are taken as blocks

E= 𝑆𝑐

2+ (𝑚−1)𝑆𝐸2

𝑚.𝑆𝐸2 =

1536+(5−1)∗1056

5∗1056 = 1.1

Since the degrees of freedom for error is less than 20, hence the precision factor is to be computed

for corrected efficiency of LSD as compared to RCBD and CRD.

The precision factor is (𝑛2+1)(𝑛1+3)

(𝑛1+1)(𝑛2+3) =

(12+1)(16+3)

(16+1)(12+3) = 0.9686

The corrected efficiency =0.9686x1.11 = 1.07

The gain in efficiency is only 7%

Efficiency of LSD as compared to RBD, when columns are taken as blocks

E= 𝑆𝑅

2+ (𝑚−1)𝑆𝐸2

𝑚.𝑆𝐸2 =

3400+(5−1)∗1056

5∗1056 = 1.44

The precision factor is (𝑛2+1)(𝑛1+3)

(𝑛1+1)(𝑛2+3) =

(12+1)(16+3)

(16+1)(12+3) = 0.9686, same for row.

The corrected efficiency = 0.9686 x 1.44=1.40

The gain in efficiency is merely 40%

It reveals that the rows have not played any significant role to control the errors.

Efficiency of LSD as compared to CRD,

the formula for re;ative efficiency is given by

E= 𝑆𝑅

2+𝑆𝑐2+ (𝑚−1)𝑆𝐸

2

(𝑚+1).𝑆𝐸2 =

3400+1536+(5−1)∗1056

(5+1)∗1056 =1.45

The precision factor =(𝑛2+1)(𝑛1+3)

(𝑛1+1)(𝑛2+3)=

(12+1)(20+3)

(20+1)(12+3) =0.9492

The corrected efficiency is given by = 0.9492 x 1.45 = 1.37

The gain in efficiency is given by merely 37%.

64

Objective: Analysis of data in relation to split plot experiment.

Kinds of data : The data relate to the yields of 3 varieties of Alfalfa obtained in a split plot experiment

with 4 dates of final cutting. The yields are reported in tons per acre.

Yields of Alfalfa in a split plot experiment.

Variety Block

Date 1 2 3 4 5 6 Total

A 2.17 1.88 1.62 2.34 1.58 1.66 11.25

Ladak B 1.58 1.26 1.22 1.59 1.25 0.94 7.84

C 2.29 1.60 1.67 1.91 1.39 1.12 9.98

D 2.23 2.01 1.82 2.10 1.66 1.10 10.92

Total 8.27 6.75 6.33 7.94 5.88 4.82 39.99

A 2.33 2.01 1.70 1.78 1.42 1.35 10.59

Cossack B 1.38 1.30 1.85 1.09 1.13 1.06 7.81

C 1.86 1.70 1.81 1.54 1.67 0.88 9.46

D 2.27 1.81 2.01 1.40 1.31 1.06 9.86

Total 7.84 6.82 7.37 5.81 5.53 4.35 37.72

A 1.75 1.95 2.13 1.78 1.31 1.30 10.22

Ranger B 1.52 1.47 1.80 1.37 1.01 1.31 8.48

C 1.55 1.61 1.82 1.56 1.23 1.13 8.9

D 1.56 1.72 1.99 1.55 1.51 1.33 9.66

Total 6.38 6.75 7.74 6.26 5.06 5.07 37.26

G.Total 22.49 20.32 21.44 20.01 16.47 14.24 114.97

Solution: First we prepare two way table of main plot treatment and replication. Here main plot is variety (3)

whereas subplot is dates(4).

Replication Variety (RV)

Ladak Cossack Ranger Total

I 8.27 7.84 6.38 22.49

II 6.75 6.82 6.75 20.32

III 6.33 7.37 7.74 21.44

Iv 7.94 5.81 6.26 20.01

V 5.88 5.53 5.06 16.47

VI 4.82 4.35 5.07 14.24

Total 39.99 37.72 37.26 114.97

Correction factor =114.972

72 = 183.58

Total Sum of squares =(2.172+1.882+……..+1.512 + 1.332)-183.58 = 9.12

Block Sum of squares = (22.492+ 20.322+21.442+20.012+16.472+14.242)

4∗3 - 183.58 =4.15

Variety Sum of squares =(39.992+ 37.722+37.262)

6∗4 - 183.58 =183.76-183.58= 0.18

Total sum of square from RV table= ∑𝑅𝑉2

𝑏 -CF =

(8.272+7.842+⋯,+5.072)

4 - 183.58 = 189.27-183.58=5.69

Main plot Error S S or Error I =TSS of RV- BSS- VSS= 5.69 - 0.18 – 4.15 = 1.36

Next we prepare main plot x subplot table

65

Variety Date

A B C D Total

Ladak 11.25 7.84 9.98 10.92 39.99

Cossack 10.59 7.81 9.46 9.86 37.72

Ranger 10.22 8.48 8.9 9.66 37.26

Total 32.06 24.13 28.34 30.44 114.97

Dates Sum of squares =(32.062+24.132+28.342+30.442)

6∗3 - CF= 185.54 – 183.58 =1.96

Sum of square due to Interaction(VD) =∑𝑉𝐷2

𝑟 - CF- SSV-SSD

=11.252+7.842+⋯.+9.662

6 - 183.58-0.18-1.96 = 185.93 – 183.58-0.18-1.96= 0.21

Sub plot Error S S or Error II = Total Sum of squares – All other sum of squares

= 9.12 – 4.15 - 0.18 -1.36 -1.96 - 0.21=1.34

The complete analysis can now be set up : Analysis of variance table


freedom

Sum of

Square

Mean sum of

square

Fcal Ftab (5 %)

Block 5 4.15 0.83

Variety 2 0.18 0.09 <1 F.05 at (2,10) =4.10

Error I 10 1.36 0.14

Dates 3 1.96 0.65 23.21** F.05 at (3,45) =2.81

Interaction 6 0.21 0.40 <1 F.05 at (6,45) =2.31

Error II 45 1.34 .029

Total 95 9.20

The means for the dates of cutting are significantly different, but the other effects are found to be non-

significant.

Compute standard errors and to make specific comparisons among treatment means compute respective

critical differences only when F-test shows significant differences.

The standard errors of mean differences can be worked out according to the formulas given below: (i)

Standard Error of difference (Variety) =√2∗𝐸𝑎

𝑟∗𝑏 = 0.1086

Critical difference for two variety means= SEdiff x t5%(10 d.f) = 0.1086 *2.23 =0.242

(ii) Standard Error of difference (Date of cutting) =√2∗𝐸𝑏

𝑟∗𝑎 =0.0567

Critical difference for two Date of cutting means= SEdiff x t5%(45d.f) =0.0567 *2.02 =0.1145

(iii) S. E. of difference between two dates of cutting at the same level of variety =√2∗𝐸𝑏

𝑟= 0.098 Critical

difference for two Date of cutting means at the same level of variety= SEdiff x t5%(45d.f) = 0.098*2.02 = 0.1979

(iv) S. E. of difference between two variety means at the same or different level of date of cutting

=√2[(𝑏−1)∗𝐸𝑏+𝐸𝑎]

𝑟𝑏= 0.1375

For (iv) standard error of mean difference involves two error terms, so we use the following equation to

calculate weighted t value.

t=(𝑏−1)𝐸𝑏𝑡𝑏+𝐸𝑎𝑡𝑎

(𝑏−1)𝐸𝑏+𝐸𝑎 =

3∗.029∗2.02+0.14∗2.23

3∗0.029+0.14 =2.149 where ta and tb are t values at error d.f (Ea) and error d.f. (Eb)

respectively.

Critical difference for two variety means at the same or different level of date of cutting = SEdiff x t =

=0.1375*2.149 = 0.2956

Conclusions: There was no significant difference among variety means. Yield was significantly affected by

dates of final cutting. However the interaction between variety and final date of cutting was not significant

66

Objective : Analysis of data in relation to strip plot experiment.

Kinds of data : The data relate to the four dates of optimum schedule for five different varieties of wheat

with three replications.

The layout plan and the yields in Kg/plot are given below :

S1 S3 S2 S4

Replication I V2 5.60 2.30 6.70 4.93

V5 5.46 5.87 2.63 6.78

V3 2.24 5.67 3.48 6.58

V1 5.67 6.89 2.56 3.78

V4 2.60 5.65 3.26 2.57

S3 S1 S4 S2

Replication II V4 3.50 6.45 4.80 6.90

V1 6.50 4.69 1.59 4.96

V5 5.32 6.89 2.45 5.36

V2 4.25 3.45 5.69 4.62

V3 2.86 4.39 4.68 2.90

S2 S3 S1 S4

Replication III V3 6.89 4.36 4.26 2.89

V4 4.89 4.58 5.69 5.36

V5 6.89 3.25 2.56 4.60

V2 2.68 4.89 8.90 6.09

V1 2.68 1.89 3.89 2.80

Solution : Prepare two way tables of Replication x Variety, Replication x spacing and Variety x spacing .

(a) Replication x Variety (each figure is a sum of 4 plots)

Replicate Variety

V1 V2 V3 V4 V5 Total

I 18.90 19.53 17.97 14.08 20.74 91.22

II 17.74 18.01 14.83 21.65 20.02 92.25

III 11.26 22.56 18.39 20.52 17.30 90.03

Total 47.90 60.10 51.19 56.25 58.06 273.50

(b) Replication x Spacing (each figure is a sum of 5 plots)

Replicate Spacing

S1 S2 S3 S4 Total

I 21.57 26.38 18.63 24.64 91.22

II 25.87 24.74 22.43 19.21 92.25

III 25.30 24.03 18.96 21.74 90.03

Total 72.74 75.15 60.02 65.59 273.50

(c) Variety x Spacing (each figure is a sum of 3 plots)

Variety Spacing

S1 S2 S3 S4 Total

V1 14.25 14.53 10.95 8.15 47.90

V2 17.95 9.60 15.84 16.71 60.10

V3 10.89 15.46 10.69 14.15 51.19

V4 14.74 17.44 11.34 12.73 56.25

V5 14.91 18.12 11.20 13.83 58.06

Total 72.74 75.15 60.02 65.59 273.50

67

Grand total of the observations = 273.50

Correction factor = 273.502

60=1246.70

Total Sum of squares =(5.602 + 2.302 + ⋯+ 2.802) − 𝐶𝐹 = 158.92

Replicate Sum of square =91.222+92.252+90.032

4∗5 -CF= 0.128

Variety S.S. = 47.902+60.102+51.192+56.252+58.062

3∗4- CF= 8.45

Total Sum of square (1) = 18.902+19.532+⋯+17.302

4 -CF=1278.19-1246.70=31.48

Error I = TSS(1)- Replicate S.S. – Variety S.S.= 31.48-0.128-8.45 = 22.91

Spacing Sum of Square= 72.742+75.152+60.022+65.592

3∗5 -CF= 1256.20-1246.70=9.50


5 -CF=1263.69-1246.70=16.99

Error II = TSS(2) - Replicate SS - Spacing SS = 16.99 - 0.128 – 9.50 = 7.36


3 -CF=1298.84 – 1246.70=52.14

Interaction = TSS(3) - Variety S.S. –Spacing S.S. = 52.14 - 8.45 - 9.50 =34.19

Error III = Total sum of square – all sum of squares = 76.38


Source of

variation

Degree of

freedom

Sum of

Square

Mean sum of

square

Fcal Ftab (5 %)

Replicates 2 0.128 0.06

Variety 4 8.45 2.11 0.74 F.05 at (4,8) =3.34

Error I 8 22.91 2.86

Spacing 3 9.50 3.17 2.58 F.05 at (3,6) =4.76

Error II 6 7.36 1.23

var x spacing 12 34.19 2.85 0.90 F.05 at (12,24) =2.18

Error III 24 76.38 3.18

Total 59

None of the effects are found to be significant in the strip plot design. The standard errors of variety, spacing

and their interaction can be determined on the parallel line of split plot experiment as given below:

(i) S.E. of difference between two variety means=√2∗𝐸𝑎

𝑟∗𝑏=0.690

(ii) S.E. of difference between two spacing means=√2∗𝐸𝑏

𝑟∗𝑎=0.405

(iii) S.E. of difference between two variety means at the same level of spacing=√2[(𝑏−1)∗𝐸𝑐+𝐸𝑎]

𝑟𝑏= =1.43

(iv) S.E. of difference between two spacing means at the same level of variety =√2[(𝑎−1)∗𝐸𝑐+𝐸𝑏]

𝑟𝑎 = 1.36

Critical difference is obtained by multiplying the standard error by table value of t at respective degree of

freedom for (i) and (ii). For (iii) and (iv) the following equations were used to compute the weighted values

of t. t=(𝑏−1)𝐸𝑐𝑡𝑐+𝐸𝑎𝑡𝑎

(𝑏−1)𝐸𝑐+𝐸𝑎 and t=

(𝑎−1)𝐸𝑐𝑡𝑐+𝐸𝑏𝑡𝑏

(𝑎−1)𝐸𝑐+𝐸𝑏, where ta, tb and tc are table value of t at error degree of

freedom of Ea, Eb and Ec respectively.

68

18. Factorial Design

Factorial Design: In factorial experiment effect of several factors of variations are studied and investigated

simultaneously, the treatments being the combinations of different factors under study. In this experiment we

estimate the effect of each of the factor and also their interaction effect. In case of 2𝑛 experiment there are n

factors each at 2 levels.

Let us suppose 22 experiment. There are 2 factors each at 2 levels. The treatment combinations are 𝑎0𝑏0,

𝑎1𝑏0, 𝑎0𝑏1 and 𝑎1𝑏1.

In factorial experiment analysis can be done as usual manner in CRD, RBD but the treatment sum of square

is split into orthogonal components.

In factorial experiment factorial effect totals are given by the expression.

[A] = (a-1)(b+1) =[ab]-[b]+[a]-[1]

[B] = (a+1)(b-1) =[ab]+[b]-[a]-[1]

[AB] = (a-1)(b-1) = [ab]-[a]-[b]+[1]

Yates Method can also be used for computing factorial effect totals.

Treatment

Combination

Total yield (1) (2) Effect

Totals

SS

=𝑬𝒇𝒇𝒆𝒄𝒕 𝒕𝒐𝒕𝒂𝒍𝟐

𝟐𝟐∗𝒓

1 [1] [1]+[a] [1]+[a]+ [b]+[ab] GT 35046.28

a [a] [b]+[ab] [a]-[1]+ [ab]-[b] [A] 124.03

b [b] [a]-[1] [b]+[ab]- [1]-[a] [B] 30.03

ab [ab] [ab]-[b] [ab]-[b]- [a]+[1] [AB] 34.03

Analysis of variance table for 𝟐𝟐

Source of

variation

Degree of

freedom

Sum of

Square

Mean sum of

square

Fcal Ftab (5 %)

Blocks r-1 𝑆𝑅2 87.98 3.54* F.05 at (7,17)=2.61

Main Effect A 1 𝑆𝐴2 =

[𝐴]2

22 ∗r 124.03 5.00*

F.05 at (1,17) =4.45

Main Effect B 1 𝑆𝐵2 =

[𝐵]2

22 ∗r 30.03 1.21

Interaction AxB 1 𝑆𝐴𝐵2 =

[𝐴𝐵]2

22 ∗r 30.03 1.21

Error 3(r-1) By

subtraction 24.82

Total 22 *r - 1 ∑∑𝑦𝑖𝑗2

− 𝐶𝐹

Similarly in case of 32 experiment there are two factors each at three levels and the total combinations are 9.

69

Objective: Analysis of 23 factorial experiment.

Kinds of data : A 23 experiment in eight randomized blocks was conducted in order to obtain an idea of

the interaction :with three factors N,P, and K each at two levels. The design and yield per plot are given

below:

Replicate 1

Block 1 (1) 25 pk 24 nk 32 Np 30

Block2 n 30 k 32 npk 36 p 27

Replicate 2

Block 3 p 32 npk 42 n 46 k 39

Block4 nk 34 (1) 44 np 30 pk 36

Replicate 3

Block 5 npk30 k 32 n 28 p 26

Block6 (1) 24 pk 20 nk 28 np 36

Replicate 4

Block 7 np 32 (1)34 pk 39 nk 41

Block8 npk 45 n 41 p 29 k 35

Solution: Null hypothesis H0 = Blocks as well as treatments are homogeneous.

First we will calculate the Block and treatment totals.

The eight block totals are : 111, 125, 159, 144, 116,108,146 and 150.

The eight treatment totals are 127 = [1], 145=[n], 114 = [p],[np]=128,[k]=138,[nk]=135,[pk]=119 and

[npk]=153

Here the total number of observations are =32

Grand total = 1059


32 =35046.28

Total sum of square =(252 + 242 + ⋯+ 352) - CF = 36381 – 35046.28 =1334.72

Block sum of square = 2362+3032+2242+2962

8 – CF = 35662.13 – 35046.28=615.845

Treatment sum of square =1272+1452+⋯+1532

4 – CF = 35343.25 – 35046.28=296.97

Error sum of square = TSS-BSS-TRSS= 1334.72- 615.845- 296.97 = 421.90

Now we break up the treatments S. S. with 7 d.f. into 7 orthogonal components each with 1 d.f. for this we

use Yates method for computing factorial effect totals and their sum of squares.

Treatment

Combination

Total yield (1) (2) (3) Effect Totals SS

=𝑬𝒇𝒇𝒆𝒄𝒕 𝒕𝒐𝒕𝒂𝒍𝟐

𝟑𝟐

1 127 272 514 1059 GT 35046.28

n 145 242 545 63 [N] 124.03

p 114 273 32 -31 [P] 30.03

np 128 272 31 33 [NP] 34.03

k 138 18 -30 31 [K] 30.03

nk 135 14 -1 -1 [NK] 0.03

Pk 119 -3 -4 29 [PK] 26.28

Npk 153 34 37 41 [NPK] 52.53

70

Analysis of variance table for 𝟐𝟑 factorial experiment

Source of

variation

Degree of

freedom

Sum of

Square

Mean sum of

square

Fcal Ftab (5 %)

Blocks 7 615.84 87.98 3.54* F.05 at (7,17)=2.61

N 1 124.03 124.03 5.00* F.05 at (1,17) =4.45

P 1 30.03 30.03 1.21

NP 1 30.03 30.03 1.21

K 1 34.03 34.03 1.37

NK 1 0.03 0.03 0.00

PK 1 26.28 26.28 1.06

NPK 1 52.53 52.53 2.12

Error 17 421.90 24.82

Total 31 1334.72

Here blocks differ significantly. There are merely main effect of N is significant present in the above

experiment. Other treatments are not significant.

Standard error for any factorial effect total =√𝑟. 23. 𝑆𝐸2 = √4 ∗ 8 ∗ 24.82 = 28.18

Significant value for any factorial effect total = t5% for 17 d.f. * 28.18 = 2.109*28.18=59.45

Conclusion: Comparing this value with the factorial effect totals in Yates table we find that only main effect

N are significant and others are non-significant.

Objective : Analysis of data in relation to 32 factorial experiment.

Kinds of data : A hypothetical data on two factors A and B each consisting of three levels in four

randomized blocks is given below:

Blocks a2 a1 a0 Total

b2 b1 b0 b2 b1 b0 b2 b1 b0

I 26 33 28 14 18 21 28 10 11 189

II 30 28 26 24 31 18 20 16 14 207

III 20 24 17 24 23 13 8 11 13 153

IV 24 27 17 22 12 8 24 19 18 171

Total 100 112 88 84 84 60 80 56 56 720

Solution: There is no significant difference among three levels of the factors A and B.

First we construct the two way table of treatment totals over all the blocks.

A/B b2 b1 b0 Total

a2 100 112 88 300

a1 84 84 60 228

a0 80 56 56 192

Total 264 252 204 720

C.F.=7202

36=14400

Total Sum of Squares= 262+302+…+182 – 14400 = 16028 – 14400 = 1628

Block S.S.= 2 2 2 2189 207 153 171

. . 1809

C F+ + +

− =

71

Treatment S.S.= 2 2 2100 112 ...,56

. . 7684

C F+ +

− =

Error S.S.=T.S.S.-Block S.S.-Treatment S.S.= 680

Now, let us compute the value and sum of squares of the eight contrasts.

Contrasts Value of Z Divisor (Dxr) Sum of

Square=Z2/Dxr

A1 300-192=108 6x4 486

A2 492-456=36 18x4 18

B1 264-204=60 6x4 150

B2 468-504=-36 18x4 18

A1B1 156-168=-12 4x4 9

A1B2 300-360=-60 12x4 75

A2B1 300-312=-12 12x4 3

A2B2 660-624=36 36x4 9

Analysis of variance table for 𝟑𝟐 factorial experiment

Source of

variation

Degree of

freedom

Sum of

Square

Mean sum of

square

Fcal Ftab (5 %)

Blocks 3 180 60

Treatments 8 768 96 3.39* 2.36

A1 1 486 486 17.15* 4.26

A2 1 18 18 <1 4.26

B1 1 150 150 5.29* 4.26

B2 1 18 18 <1 4.26

A1B1 1 9 9 <1 4.26

A1B2 1 75 75 2.65 4.26

A2B1 1 3 3 <1 4.26

A2B2 1 9 9 <1 4.26

Error 24 680 28.33

Total 35 1628

The treatments are found to be significant especially the linear effect of the factor A and B.

Remark: This is a 23 factorial experiment in four replicates and each replicate has been divided into two

blocks of four plots each. However, this experiment can also be analyzed by the method of complete

confounding including the sum of squares due to factor NPK into errors since NPK has been completely

confounded which are analyzed in practical No.48.

72

19. Confounding

It is the technique by which the precision on the main effects and certain interactions generally they are of

lower order is increased by the sacrifice of precision on certain high order interactions.

Complete Confounding: when the same interaction is confounded in all the replicates, it is known as

complete or total confounding.

The analysis procedure is same as factorial experiment. Only one degree of freedom is lost from treatment

and increased in error because the same interaction is confounded in all the replicates.

Partial confounding: is used when we want to divide the replicate into homogeneous smaller blocks and

also don’t want to lose information on any of the interactions. In partial confounding different effect is

confounded in different replicate. So that the interaction effect can be estimated from the remaining of the

replicates in which that effect was not confounded.

In partial confounding also the analysis procedure is same as factorial experiment. Here the calculation is

only changed in calculation of sum of square of partially confounded effects.

Objective: Layout of completely confounded design

Kinds of data: 23 factorial experiment in which ABC are confounded in all the three replicates.

Solution: For ABC the contrast are

ABC= (a-1)(b-1)(c-1)=(abc+a+b+c) – (ac+bc+ac+(1))

The same entries of the blocks are repeated in other replications with a fresh randomization within blocks.

Thus, the layout of a 23 experiment in which ABC are confounded is given below:

Rep. I

Block 1 Block 2

a (1)

abc ab

b ac

c bc

Rep. II

Block 3 Block 4

c ab

b ac

abc bc

a (1)

Rep. III

Block 5 Block 6

b bc

c (1)

a ac

abc ab

Objective: Layout of partially confounded design

Kinds of data: 23 factorial experiment in which ABC, AC and BC are confounded in three replicates.

Solution: For ABC, AC and BC contrasts are

ABC= (a-1)(b-1)(c-1)=(abc+a+b+c) – (ac+bc+ac+(1))

AC= (a-1)(b+1)(c-1) = (abc+b+ac+(1))-(a+c+ab+bc)

BC= (a+1)(b-1)(c-1) = (abc+a+bc+(1))-(ab+ac+b+c),

Thus, the layout of a 23 experiment in which ABC, AC and BC are confounded is given below:

Rep. I

Block 1 Block 2

a ac

abc (1)

b ab

c bc

Rep. II

Block 3 Block 4

abc a

b ab

(1) bc

ac c

Rep. III

Block 5 Block 6

(1) b

a ab

bc c

abc ac

73

Objective: Analysis of data of complete confounding

Kinds of data : The following data relate to complete confounding of 23 experiment of the Factors A,B,C

and the experiment is conducted in 4 replications. In each replicate the interaction ABC is confounded..

Effect ABC

confounded

Replicate1 Replicate2 Replicate3 Replicate4

Blocks (i) (ii) (i) (ii) (i) (ii) (i) (ii)

(1) 19.1 a18.6 (1)20.7 a25.9 (1)23.4 a22.2 (1)19.1 a23.6

ab 19.2 b18.2 ab22.1 b23.0 ab20.4 b21.0 ab21.9 b23.7

ac18.8 c19.0 ac21.2 c24.9 ac23.2 c23.6 ac18.6 c21.0

bc19.4 abc20.4 bc20.1 abc23.4 bc20.3 abc21.6 bc21.5 abc22.8

Solution: H0 : The data is homogenous with respect to blocks and treatments.

Here since each replicate has been divided into 2 blocks. One effect has been confounded in each replicate.

We find that the effect ABC has been confounded in each replicate.

Here Grand total of the observations GT= 681.9

Correction Factor =681.92

32 =14530.863

Raw S.S. = (19.12 + 18.62 + ⋯+ 22.82)=14658.87

Total sum of squares (corrected) =14658.87 – CF = 128.007

The eight blocks totals are 76.5, 76.2, 94.1, 97.2, 87.3, 88.4,81.1, and 91.1

So, Block sum of squares = (76.52+76.22+⋯+91.12)

4 – CF = 92.03

The sum of squares due to 6 uncompounded factorial effects is obtained by Yates Technique.

Treatment

Combination

Total yield (1) (2) (3) Effect Totals SS =

𝑬𝒇𝒇𝒆𝒄𝒕 𝒕𝒐𝒕𝒂𝒍𝟐

𝟑𝟐

1 82.3 172.6 342.1 681.9 GT

A 90.3 169.5 339.8 5.9 [A] 1.09

B 85.9 170.3 5.7 -3.9 [B] 0.48

AB 83.6 169.5 0.2 3.3 [AB] 0.34

C 88.5 8 -3.1 -2.3 [C] 0.17

AC 81.8 -2.3 -0.8 -5.5 [AC] 0.95

BC 81.3 -6.7 -10.3 2.3 [BC] 0.17

ABC 88.2 6.9 13.6 23.9 Not estimate

Put all sum of squares in ANOVA table and test the main effects and interactions excluding

the factors that are completely confounded with blocks.

74


Source of

variation

Degree of

freedom

Sum of

Square

Mean sum of

square

Fcal Ftab (5 %)

Blocks 7 92.03 13.15 2.80 F.05 at (7,18)=2.58

A 1 1.08 1.08 <1 F.05 at (1,18) =4.41

B 1 0.47 0.47 <1

AB 1 0.34 0.34 <1

C 1 0.16 0.16 <1

AC 1 0.94 0.94 <1

BC 1 0.16 0.16 <1

Error 18 32.79 1.82

Total 31 128.00

From the above analysis of variance table, we find that none of the treatments effect is significant as would

be expected for data taken from a uniformity trial. Whereas block effect is significant, hence confounding is

found effective.

Objective : Analysis of data of partial confounding.

Kinds of data : The following data relate to the partially confounded in 23 experiment taken from an

uniformity trial.

Effects

Confounded

AB AC BC ABC

Blocks (i) (ii) (i) (ii) (iii) (ii) (i) (ii)

(1) 25.7 a 23.2 (1)27.6 a25.6 (1) 21.4 b18.8 (1)23.9 a25.4

ab 21.1 b21.0 ac26.7 c27.9 bc18.6 c16.0 ab21.4 b26.9

c 17.6 ac18.6 b26.2 ab28.5 a18.8 ab16.4 ac20.6 c25.2

abc17.5 bc18.3 abc22.0 bc27.2 abc18.2 ac16.6 bc22.4 abc30.1

Solution : H0 : The data is homogenous with respect to blocks and treatments. Since each replicate has been

divided into 2 blocks, one effect has been confounded in each replicate. Replicate 1 confounds AB, replicate

2 confounds AC, replicate 3 confounds BC and replicate 4 confounds ABC . Hence, this is an example of

partial confounding.

Here Grand total of the observations GT= 715.4

Correction Factor =715.42

32 =15993.66

Raw S.S. = (25.72 + 23.22 + ⋯+ 30.12)=16520.42

Total sum of squares (corrected) =16520.42 – CF = 526.76

The eight block totals are: 81.9, 81.1, 102.5, 109.2, 77.0, 67.8, 88.3 and 107.6.

So, Block sum of squares = (81.92+81.12+⋯+107.62)

4 – CF = 410.39

75

The sum of squares due to 6 uncompounded factorial effects is obtained by Yates Technique.

Treatment

Combination

Total yield (1) (2) (3) Effect Totals SS =

𝑬𝒇𝒇𝒆𝒄𝒕 𝒕𝒐𝒕𝒂𝒍𝟐

𝟑𝟐

1 98.6 191.6 371.9 715.4 GT

A 93 180.3 343.5 -14 [A] 6.13

B 92.9 169.2 -11.1 -6.2 [B] 1.20

AB 87.4 174.3 -2.9 5.6 [AB] not estimable

C 86.7 -5.6 -11.3 -28.4 [C] 25.21

AC 82.5 -5.5 5.1 8.2 [AC] not estimable

BC 86.5 -4.2 0.1 16.4 [BC] not estimable

ABC 87.8 1.3 5.5 5.4 not estimable

In order to estimate the sum of square of partially confounded effect adjustment factor is calculated for each

interaction.

AF= [Total of the block containing (1) of replicate in which the effect is confounded] – [Total of the block

not containing (1) of replicate in which the effect is confounded]

Interaction AB is estimated by =1

4 (a-1)(b-1)(c+1), here sign of 1 is positive. Hence the

AF for AB = [25.7+21.1+17.6+17.5]-[23.2+21+18.6+18.3] =0.8

Hence Adjusted effect total for AB becomes = 5.6 -0.8 =4.8

Sum of square of AB=[𝐴𝐵]2

24=

4.82

24 = 0.96

AF for AC = [27.6+26.7+26.2+22]-[25.6+27.9+28.5+27.2] =-6.7

Hence Adjusted effect total for AC becomes = 8.2 – (-6.7) =14.9

Sum of square of AC=[𝐴𝐶]2

24=

14.92

24 = 9.25

AF for BC = 77- 67.8=9.2

Hence Adjusted effect total for BC becomes = 16.4 – 9.2 =7.2

Sum of square of BC=[𝐵𝐶]2

24=

7.22

24 = 2.16

Interaction ABC is estimated by =1

4 (a-1)(b-1)(c-1), here sign of 1 is negative. Hence the

AF for ABC = [25.4+26.9+25.2+30.1] - [23.9+21.4+20.6+22.4] = 19.3

Hence Adjusted effect total for ABC becomes = 5.4 – 19.3 = -13.9

Sum of square of ABC=[𝐴𝐵𝐶]2

24=

−13.92

24 = 8.05

S.S. due to error = total S.S. - block S.S. - treatment = 63.42

Put all these sum of squares in a ANOVA table and test for main effects and their interaction.

76

Analysis of variance table for the partially confounded 23 - experiment

Source of

variation

Degree of

freedom

Sum of

Square

Mean sum of

square

Fcal Ftab (5 %)

Blocks 7 410.39 2.80 F.05 at (7,17)=2.61

Treatments 7 52.95

A 1 6.12 6.12 1.64 F.05 at (1,17) =4.45

B 1 1.20 1.20 <1 F.01 at (1,17) =8.40

AB 1 0.960 0.960 <1

C 1 25.205 25.205 6.76*

AC 1 9.25 9.25 2.48

BC 1 2.16 2.16 <1

ABC 1 8.05 8.05 2.16

Error 17 63.42 3.73

Total 31

From the above table, it is observed that only the main effect of the factor C is significant at the 5% level.

The other effects are found to be non-significant. Also the block effect are found significant.

In comparing means it is important to keep in mind that the interactions are determined on merely ¾ of the

replications. Thus, the standard error of a mean interaction response is = √2∗3.73

3∗22 =0.788

Similarly, the standard error of a main effect will be =√2∗3.73

4∗22 = 0.683

77

REFERENCES:

1. Practicals in Statistics , by Dr.H.L.Sharma

2. Statistical Methods, by G.W.Snedecor.

3. Experimental Designs and Survey Sampling: Methods and Applications, by H.L.Sharma

4. A handbook of Agricultural Statistics BY Dr. S. R. S Chandel

5. The Theory of Sample surveys and Statistical Decisions by K. S. Kushwaha and Rajesh Kumar

6. Fundamentals of Mathematical Statistics by S. C. Gupta and V. K. Kapoor

Documents

jnkvv.orgjnkvv.org/PDF/09042020212640Practical_Manual__Ag_Statistics_ug … · Contents S. No. Chapter Name Page No. 1. Frequency Distribution 1 2. Graphical Representation of data