62
Bio Statistics Introduction Any science needs precision for it’s development. For precision, facts, observations or measurements have to be expressed in figures. “It has been said when you can measure what you are speaking about and express it in numbers, you know something about it, but when you cannot express it in numbers your knowledge is of meagre and unsatisfactory kind.” - Lord Kelvin Similarly in medicine, be it diagnosis, treatment or research everything depends on measurement. E.g. you have to measure or count the number of missing teeth OR measure the vertical dimension [Type text]

Biostatistics (Dr Shilpi Gilra)

Embed Size (px)

Citation preview

Page 1: Biostatistics (Dr Shilpi Gilra)

Bio Statistics

Introduction

• Any science needs precision for it’s development.

• For precision, facts, observations or measurements have to be

expressed in figures.

• “It has been said when you can measure what you are speaking about

and express it in numbers, you know something about it, but when

you cannot express it in numbers your knowledge is of meagre and

unsatisfactory kind.” - Lord Kelvin

• Similarly in medicine, be it diagnosis, treatment or research

everything depends on measurement.

• E.g. you have to measure or count the number of missing teeth OR

measure the vertical dimension and express it in number so that it

makes sense.

• Statistic or datum means a measured or counted fact or piece of the

information stated as a figure such as height of one person, birth

weight of a baby etc.

• Statistics or data is plural of the same.

• Statistics is the science of figures.

• Bio statistics is the term used when tools of statistics are applied to

data that is derived from biological sciences such as medicine.

[Type text]

Page 2: Biostatistics (Dr Shilpi Gilra)

Applications and uses of bio statistics as a science

• In physiology and anatomy

– To define the limits of normality for variable such as height or

weight or Blood Pressure etc in a population.

– Variation more than natural limits may be pathological i.e

abnormal due to play of certain external factors.

– To find correlation between two variables like height and

weight.

• In pharmacology

– To find the action of drugs

– To compare the action of two drugs or two successive dosages

of same drug

– To find the relative potency of a new drug with respect to a

standard drug

• In medicine

– To compare the efficiency of a particular drug, operation or line

of treatment

– To find association between two attributes such as cancer and

smoking

– To identify signs and symptoms of disease

• In community medicine and public health

– To test usefulness of sera or vaccine in the field

[Type text]

Page 3: Biostatistics (Dr Shilpi Gilra)

– In epidemiologic studies the role of causative factors is

statistically tested

• In research

– It helps in compilation of data , drawing conclusions and

making recommendations.

• For students

– By learning the methods in biostatistics a student learns to

evaluate articles published in medical and dental journals or

papers read in medical and dental conferences.

– He also understands the basic methods of observation in his

clinical practice and research.

Common Statistical Terms

• Constant

– Quantities that do not vary e.g. in biostatistics, mean, standard

deviation are considered constant for a population

• Variable

– Characteristics which takes different values for different person,

place or thing such as height, weight, blood pressure

• Population

– Population includes all persons, events and objects under study.

it may be finite or infinite.

• Sample

[Type text]

Page 4: Biostatistics (Dr Shilpi Gilra)

– Defined as a part of a population generally selected so as to be

representative of the population whose variables are under

study

• Parameter

– It is a constant that describes a population e.g. in a college there

are 40% girls. This describes the population, hence it is a

parameter.

• Statistic

– Statistic is a constant that describes the sample e.g. out of 200

students of the same college 45% girls. This 45% will be

statistic as it describes the sample

• Attribute

– A characteristic based on which the population can be described

into categories or class e.g. gender, caste, religion.

Source of data

• The main sources for collection of data

– Experiments

– Surveys

– Records

• Experiments

– Experiments are performed to collect data for investigations

and research by one or more workers.

• Surveys

[Type text]

Page 5: Biostatistics (Dr Shilpi Gilra)

– Carried out for Epidemiological studies in the field by trained

teams to find incidence or prevalence of health or disease in a

community.

• Records

– Records are maintained as a routine in registers and books over

a long period of time

– provides readymade data.

Types of data

• Data is of two types

• Qualitative or discrete data

• In such data there is no notion of magnitude or size of an

attribute as the same cannot be measured.

• The number of person having the same attribute are variable

and are measured

• e.g. like out of 100 people 75 have class I occlusion, 15 have

class II occlusion and 10 have class III occlusion.

• Class I II III are attributes , which cannot be measured in

figures, only no of people having it can be determined

• Quantitative or continuous data

• In this the attribute has a magnitude. both the attribute and the

number of persons having the attribute vary

[Type text]

Page 6: Biostatistics (Dr Shilpi Gilra)

• E.g Freeway space. It varies for every patient. It is a quantity

with a different value for each individual and is measurable. It

is continuous as it can take any value between 2 and 4 like it

can be 2.10 or 2.55 or 3.07 etc.

Data presentation

• Statistical data once collected should be systematically arranged and

presented

– To arouse interest of readers

– For data reduction

– To bring out important points clearly and strikingly

– For easy grasp and meaningful conclusions

– To facilitate further analysis

– To facilitate communication

• Two main types of data presentation are

– Tabulation

– Graphic representation with charts and diagrams

Tabulation

• It is the most common method

• Data presentation is in the form of columns and rows

• It can be of the following types

– Simple tables

– Frequency distribution tables

[Type text]

Page 7: Biostatistics (Dr Shilpi Gilra)

Simple Table

  Number of patients at KIDS, Bgm

Jan 06 2,800

Feb 06 1,900

March 06 1,750

Frequency distribution table

• In a frequency distribution table, the data is first split into convenient

groups ( class interval ) and the number of items ( frequency ) which

occurs in each group is shown in adjacent column.

Number of Cavities Number of Patients

0 to 3 78

3 to 6 67

6 to 9 32

9 and above 16

[Type text]

Page 8: Biostatistics (Dr Shilpi Gilra)

Charts and diagrams

• Useful method of presenting statistical data

• Powerful impact on imagination of the people

They are

• Bar chart

• Histogram

• Frequency polygon

• Frequency curve

• Line diagram

• Cumulative frequency diagram or ogive

• Scatter diagram

• Pie chart

• Pictogram

• Spot map or map diagram

Bar chart

• Length of bars drawn vertical or horizontal is proportional to

frequency of variable.

• suitable scale is chosen

• bars usually equally spaced

• They are of three types

-simple bar chart

-multiple bar chart

• two or more variables are grouped together

-component bar chart

[Type text]

Page 9: Biostatistics (Dr Shilpi Gilra)

• bars are divided into two parts

• each part representing certain item and proportional to

magnitude of that item

Simple Bar Chart

Multiple Bar Chart

[Type text]

Page 10: Biostatistics (Dr Shilpi Gilra)

Component Bar Chart

Histogram

• pictorial presentation of frequency distribution

• consists of series of rectangles

• class interval given on vertical axis

• area of rectangle is proportional to the frequency

Frequency polygon

• obtained by joining midpoints of histogram blocks at the height of

frequency by straight lines usually forming a polygon

[Type text]

Page 11: Biostatistics (Dr Shilpi Gilra)

Frequency curve

• when number of observations is very large and class interval is

reduced the frequency polygon losses its angulations becoming a

smooth curve known as frequency curve

Line diagram

• line diagram are used to show the trends of events with the passage of

time

[Type text]

Page 12: Biostatistics (Dr Shilpi Gilra)

Cumulative Frequency Diagram

• graphical representation of cumulative frequency .

• it is obtained by adding the frequency of previous class

Scatter or Dot diagram

• shows relationship between two variables

• If the dots are clustered showing a straight line, it shows a relationship

of linear nature

[Type text]

Page 13: Biostatistics (Dr Shilpi Gilra)

Pie chart

• In this frequencies of the group are shown as segment of circle

• Degree of angle denotes the frequency

• Angle is calculated by

– class frequency X 360

total observations

Pictogram

• Popular method of presenting data to the common man

[Type text]

Page 14: Biostatistics (Dr Shilpi Gilra)

Spot map or map diagram

• These maps are prepared to show geographic distribution of

frequencies of characteristics

Measures of statistical averages or central tendency

• Average value in a distribution is the one central value around which

all the other observations are concentrated

• Average value helps

– to find most characteristic value of a set of measurements

– to find which group is better off by comparing the average of

one group with that of the other

• the most commonly used averages are

– mean

– median

– mode

Mean

• refers to arithmetic mean

• it is the summation of all the observations divided by the total number

of observations (n)

• denoted by X for sample and µ for population

• X = x1 + X2 + X3 …. Xn / n

• Advantages – it is easy to calculate

• Disadvantages – influenced by extreme values

[Type text]

Page 15: Biostatistics (Dr Shilpi Gilra)

Median

• When all the observation are arranged either in ascending order or

descending order, the middle observation is known as median

• In case of even number the average of the two middle values is taken

• Median is better indicator of central value as it is not affected by the

extreme values

Mode

• Most frequently occurring observation in a data is called mode

• Not often used in medical statistics.

Example

• Number of decayed teeth in 10 children

2,2,4,1,3,0,10,2,3,8

• Mean = 34 / 10 = 3.4

• Median = (0,1,2,2,2,3,3,4,8,10) = 2+3 /2

= 2.5

• Mode = 2 ( 3 Times).

Types of variability

• There are three types of variability

– Biological variability

– Real variability

– Experimental variability

• Experimental variability are of three subtypes

[Type text]

Page 16: Biostatistics (Dr Shilpi Gilra)

– Observer Error

– Instrumental Error

– Sampling Error

Biological variability

• It is the natural difference which occurs in individuals due to age,

gender and other attributes which are inherent

• This difference is small and occurs by chance and is within certain

accepted biological limits

• e.g. vertical dimension may vary from patient to patient

Real Variability

• such variability is more than the normal biological limits

• the cause of difference is not inherent or natural and is due to some

external factors

• e.g. difference in incidence of cancer among smokers and non

smokers may be due to excessive smoking and not due to chance only

Experimental Variability

• it occurs due to the experimental study

• they are of three types

– Observer error

• the investigator may alter some information or not record

the measurement correctly

– Instrumental error

• this is due to defects in the measuring instrument

• both the observer and the instrument error are called non

sampling error

– Sampling error or errors of bias

[Type text]

Page 17: Biostatistics (Dr Shilpi Gilra)

• this is the error which occurs when the samples are not

chosen at random from population.

• Thus the sample does not truly represent the population

Measures of variation or dispersion

• Biological data collected by measurement shows variation

• e.g. BP of an individual can show variation even if taken by

standardized method and measured by the same person.

• Thus one should know what is the normal variation and how to

measure it.

• The various measures of variation or dispersion are

• Range

• Mean or average deviation

• Standard deviation

• Co efficient of variation

Range

• It is the simplest

• Defined as the difference between the highest and the lowest figures

in a sample

• Defines the normal limits of a biological characteristic e.g. freeway

space ranges between 2-4 mm

• Not satisfactory as based on two extreme values only

[Type text]

Page 18: Biostatistics (Dr Shilpi Gilra)

Mean deviation

• It is the summation of difference or deviations from the mean in any

distribution ignoring the + or – sign

• Denoted by MD

MD = ∑ ( x – x )

n

X = observation

X = mean

n = no of observation

Standard deviation

• Also called root mean square deviation

• It is an Improvement over mean deviation used most commonly in

statistical analysis

• Denoted by SD or s for sample and σ for a population

• Denoted by the formula

SD = ∑ ( x – x )2

n or n-1

• Greater the standard deviation, greater will be the magnitude of

dispersion from mean

[Type text]

Page 19: Biostatistics (Dr Shilpi Gilra)

• Small standard deviation means a high degree of uniformity of the

observations

• Usually measurement beyond the range of ± 2 SD are considered rare

or unusual in any distribution

• Uses of Standard Deviation

• It summarizes the deviation of a large distribution from its

mean.

• It helps in finding the suitable size of sample e.g. greater

deviation indicates the need for larger sample to draw

meaningful conclusions

• It helps in calculation of standard error which helps us to

determine whether the difference between two samples is by

chance or real

Coefficient of variation

• It is used to compare attributes having two different units of

measurement e.g. height and weight

• Denoted by CV

CV = SD X 100

Mean

• and is expressed as percentage

[Type text]

Page 20: Biostatistics (Dr Shilpi Gilra)

Normal distribution or normal curve

• So much of physiologic variation occurs in any observation

• Necessary to

– Define normal limits

– Determine the chances of an observation being normal

– To determine the proportion of observation that lie within a

given range

• Normal distribution or normal curve used most commonly in statistics

helps us to find these

• Large number of observations with a narrow class interval gives a

frequency curve called the normal curve

It has the following characteristics

• Bell shaped

• Bilaterally symmetrical

• Frequency increases from one side reaches its highest and decreases

exactly the way it had increased

• The highest point denotes mean, median and mode which coincide

[Type text]

Page 21: Biostatistics (Dr Shilpi Gilra)

• Mean +_ 1 SD includes 68.27% of all observations . such

observations are fairly common

• Mean +- 2 SD includes 95.45% of all observations i.e. by convention

values beyond this range are uncommon or rare. There chances of u

u77being normal is 100 – 95.45 % i.e. only 4.55.%.

• Mean +- 3 SD includes 99.73%. such values are very rare. There

chance of being normal is 0.27% only

• These limits on either side of measurement are called confidence

limits

• the look of frequency distribution curve may vary depending on mean

and SD . thus it becomes necessary to standardize it.

• Eg- One study has SD as 3 and other has SD as 2,thus it becomes

difficult to compare them

• Thus normal curve is standardized by using the unit of standard

deviation to place any measurement with reference to mean.

• The curve that emerges through this procedure is called standard

normal curve

[Type text]

Page 22: Biostatistics (Dr Shilpi Gilra)

Properties of standard normal curve

• smooth bell shaped

• perfectly symmetrical

• based on infinite number of observations thus curve does not touch X

axis

• mean is zero

• SD is always 1

• total area under the curve is 1

• mean median mode coincide

[Type text]

Page 23: Biostatistics (Dr Shilpi Gilra)

• the unit of SD here is relative or standard normal deviate and is

denoted by Z

• Z = Observation – Mean

SD

• With the help of Z value we can find the area under the curve from a

table

• This area helps to give the P value

Sampling

• It is not possible to include each and every member of population as it

will be time consuming, costly , laborious .

• therefore sampling is done

• Sampling is a process by which some unit of a population or universe

are selected for the study and by subjecting it to statistical

computation, conclusions are drawn about the population from which

these units are drawn

• The sample will be a representative of entire population only

• It is sufficiently large

• It is unbiased

• Such sample will have its statistics almost equal to parameters of

entire population

Two main characteristics of a representative sample are

• Precision

• Unbiased character

Precision

• Precision depends on a sample size

[Type text]

Page 24: Biostatistics (Dr Shilpi Gilra)

• Ordinarily sample size should not be less than 30

• Precision = √n/s

• n = sample size , s = standard deviation

• Precision is directly proportional to square root of sample size, greater

the sample size greater the precision

• Also greater the SD, less will be the precision

• Thus in such cases to obtain precision, sample size needs to be

increased

Unbiased character

• The sample should be unbiased i.e. every individual should have an

equal chance to be selected in the sample.

• Thus a standard random sampling method should be used

• Non sampling errors can be taken care of by

– Using standardized instruments and criteria

– By single , double , triple blind trials

– Use of a control group

Determination of sample size

For Quantitative Data

• The investigator needs to decide how large an error due to sampling

defect is allowable i.e. allowable error L

• Either the investigator should start with assumed SD or do a pilot

study to estimate SD

sample size = 4 SD2 / L2

[Type text]

Page 25: Biostatistics (Dr Shilpi Gilra)

• Mean pulse rate of population is 70 beats per min with standard

deviation of 8 beats. What will be the sample size if allowable error is

± 1

n = 4 X 8 X 8 / 1 X 1 = 256

• If L is less n will be more i.e. larger the sample size lesser is the error.

For qualitative data

• In such data we deal with proportion

Sample size = n = 4 p q

L2

• p = proportion of positive character

• q = proportion of negative character

• q = 1-p or (100-p if expressed in percent)

• L = allowable error usually 10% of p

• e.g. incidence rate in last influenza was found to be 5% of the

population exposed

• what should be the size of the sample

• to find incidence rate in current epidemic if allowable error is 10%?

• p = 5% q = 95%

• l = 10 % of p = 0.5%

n = 4 X 5 X 95 / 0.5 X 0.5 = 7600

Probability or p value

[Type text]

Page 26: Biostatistics (Dr Shilpi Gilra)

• Concept of probability is very important in statistics

• Probability is the chance of occurrence of any event or permutation

combination.

• It is denoted by p for sample and P for population

• In various tests of significance we are often interested to know

whether the observed difference between 2 samples is by chance or

due to sampling variation.

• There probability or p value is used

• P ranges from 0 to 1

• 0 = there is no chance that the observed difference could not be due to

sampling variation

• 1 = it is absolutely certain that observed difference between 2 samples

is due to sampling variation

• However such extreme values are rare.

• P = 0.4 i.e. chances that the difference is due to sampling variation is

4 in 10

• Obviously the chances that it is not due to sampling variation will be 6

in 10

• The essence of any test of significance is to find out p value and draw

inference

• If p value is 0.05 or more

• it is customary to accept that difference is due to chance

(sampling variation) .

• The observed difference is said to be statistically not

significant.

• If p value is less than 0.05

[Type text]

Page 27: Biostatistics (Dr Shilpi Gilra)

• observed difference is not due chance but due to role of some

external factors.

• The observed difference here is said to be statistically

significant.

From shape of normal curve

• We know that 95% observation lie within mean ± 2SD . Thus

probability of value more or less than this range is 5%

From probability tables

• p value is also determined by probability tables in case of student t

test or chi square test

By area under normal curve

• Here z= standard normal deviate is calculated

• Corresponding to z values the area under the curve is determined (A)

• Probability is given by 2(0.5 - A)

Tests of significance

• Whatever be the sampling procedure or the care taken while selecting

sample, the sample statistics will differ from the population

parameters

• Also variations between 2 samples drawn from the same population

may also occur

[Type text]

Page 28: Biostatistics (Dr Shilpi Gilra)

• i.e. differences in the results between two research workers for the

same investigation may be observed

• Thus it becomes important to find out the significance of this

observed variation

• ie whether it is due to

• chance or biological variation (statistically not significant) OR

• due to influence of some external factors ( statistically

significant)

• To test whether the variation observed is of significance, the various

tests of significance are done. The test of significance can be broadly

classified as

1. Parameteric tests

2. Non parametric tests

Parameteric tests

• Parametric tests are those tests in which certain assumptions are made

about the population

– Population from which sample is drawn has normal distribution

– The variances of sample do not differ significantly

– The observations found are truly numerical thus arithmetic

procedure such as addition, division, and multiplication can be

used

• Since these test make assumptions about the population parameters

hence they are called parameteric tests .

• These are usually used to test the difference

• They are:

[Type text]

Page 29: Biostatistics (Dr Shilpi Gilra)

– Student t test( paired or unpaired)

– ANOVA

– Test of significance between two means

Non parametric tests

• In many biological investigations, the research worker may not know

the nature of distribution or other required values of the population.

• Also some biological measurements may not be true numerical values

hence arithmetic procedures are not possible in such cases.

• In such cases distribution free or non parametric tests are used in

which no assumption are made about the population parameters e.g.

• Mann Whitney test

• Chi square test

• Phi coefficient test

• Fischer’s Exact test

• Sign Test

• Freidmans Test

• Test of significance can also be divided into one tailed or 2 tailed test

Two tailed test

• This test determines if there is a difference between the two groups

without specifying whether difference is higher or lower

• It includes both ends or tails of the normal distribution

• Such test is called Two tailed test

[Type text]

Page 30: Biostatistics (Dr Shilpi Gilra)

• Eg when one wants to know if mean IQ in malnourished children is

different from well nourished children but does not specify if it is

more or less

One tailed test

• In the test of significance when one wants to specifically know if the

difference between the two groups is higher or lower

• ie the direction plus or minus side is specified.

• Then one end or tail of the distribution is excluded

• eg if one wants to know if mal nourished children have less mean IQ

than well nourished then higher side of the distribution will be

excluded

• Such test of significance is called one tailed test

Stages in performing test of significance

• State the null hypothesis

• State the alternative hypothesis

• Accept or reject the null hypothesis

• Finally determine the p value

State the null hypothesis

• Null hypothesis

• It is a hypothesis of no difference between statistics of a sample and

parameter of the population or between statistics of two samples

• It nullifies the claim that the experimental result is different from or

better than the one observed already

[Type text]

Page 31: Biostatistics (Dr Shilpi Gilra)

State the Alternative hypothesis

• It is hypothesis stating that the sample result is different ie larger or

smaller than the value of population or statistics of one sample is

different from the other

Accept or reject the null hypothesis

• Null Hypothesis is accepted or rejected depending on whether the

result falls in zone of acceptance or zone of rejection

• If the result of a sample falls in the area of mean ± 2SE the null

hypothesis is accepted.

• This area of normal curve is called zone of acceptance for null

hypothesis

• If the result of sample falls beyond the area of mean ± 2 SE

• null hypothesis of no difference is rejected and alternate hypothesis

accepted

• This area of normal curve is called zone of rejection for null

hypothesis

Finally determine the p value

• P value is determined using any of the previously mentioned methods

• If p> 0.05 the difference is due to chance and not statistically different

but if

• p < 0.05 the difference is due to some external factor and statistically

significant

[Type text]

Page 32: Biostatistics (Dr Shilpi Gilra)

Types of error

• While drawing conclusions in a study we are likely to commit two

types of error.

– Type I error

– Type II error

Type I error

• This type of error occurs

• When we conclude that the difference is significant when in fact there

is no real difference in the population ie, we reject the null hypothesis

when it is true

• Denoted by α

Type II error

• This type of error occurs

• When we say that the difference is not significant when in fact there is

a real difference between the populations i.e. the null hypothesis is not

rejected when it is actually false

• It is denoted by β

Tests of significance for large samples

• These tests are used for sample size greater than 30

• The test used is Z test

• Z is standard normal derivate and has been discussed under normal

distribution

Z = observation – mean / SD

[Type text]

Page 33: Biostatistics (Dr Shilpi Gilra)

• However in Z test standard deviation is replaced by standard error

In Z test, Z = observed difference / standard error

• We know that standard deviation measure the variation within a

sample

• Standard error is the measure of difference in values occuring

– between a sample and population

– between two samples of the same population

• Standard error used in Z test can be

– Standard error of mean

– Standard error of proportion

– Standard error of difference between 2 means

– Standard error of difference between 2 proportions

• If in the Z test the Z>2 i.e. if the observed difference between the 2

means or proportion is greater than 2 times the standard error of

difference

• p < 0.05 according to the given table

Z 1.6 2.0 2.3 2.6

P 0.1 0.05 0.02 0.01

Thus the difference is not due to chance and may be due to influence of

some external factor i.e. the difference is statistically significant

Standard error of mean

• Used for quantitative data

[Type text]

Page 34: Biostatistics (Dr Shilpi Gilra)

• Standard error of mean is the difference between sample mean and

population mean given by

SE x = SD of Sample / √n

• also population mean will be sample mean ± 2 standard error of mean

• This will enable us to know whether the sample mean is within the

limits of population mean

Here Z=sample mean – population mean / SE of mean

• In a random sample of 100 the mean blood sugar is 80 mg % with SD

6 mg% . Within what limits the population mean will be ? What can

be said about another sample whose mean is 82%

SE = 6 / √100 = 6 / 10 = 0.6

• Thus the population mean will be 80± 2 X 0.6 = 78.8 to 81.2

• A sample with 82% mean is not within limits of population mean thus

it does not seem to be drawn from the same population

Standard error of difference between 2 means

• Used for quantitative data

• It is the difference between means of two samples drawn from the

same population

• It helps to know what is the significance of difference obtained by 2

research workers for the same investigation

SE (X1 – X2) = √ SD12 / n1 + SD22 / n2

• Eg.Find the significance of difference in mean heights of 50 girls and

50 boys with following values

  Mean SD

[Type text]

Page 35: Biostatistics (Dr Shilpi Gilra)

Girls 147.4 6.6

Boys 151.6 6.3

SE = √ (6.6)2 /50 + (6.3)2 / 50

= 1.29

Z=observed difference / SE

Z = 151.6 – 147.4 / 1.29

= 3.26

• Since Z value is more than 2 ,p will be less than .05

• Thus difference is statistically significant and it can be concluded that

boys are taller than girls

Standard error of proportion

• In case of qualitative data where character remains same but its

frequency varies we express it in proportion instead of mean

• Proportion of individual having special character p

• q is number of individual not having the character

• P+q =1 or 100 if expressed in %age

• Standard error of proportion is the unit which measures variation in

proportion of a character from sample to population

SE of proportion = √ p X q / n

p=proportion of positive character

q=proportion of negative character

n=sample size

[Type text]

Page 36: Biostatistics (Dr Shilpi Gilra)

• Also proportion of population = proportion of sample ± 2 SEP

• Thus one can determine whether the proportion of sample is within

limits of population proportion

Proportion of blood group B among Indians is 30%. If in a sample of 100

individuals it is 25% what is your conclusion about the group

SEP = √ p X q / n = √ 25 X 75 / 100 = 4.33

Z = observed diff / SE = 30 - 25 / 4.33 = 1.15

• Since z is < 2 ,p will be more .05 thus the difference is not

significant.

Standard error of difference between 2 proportion

• Measures the difference in proportion of a character from sample to

sample

SE (p1-p2) = √ p1 q1 / n1 + p2 q2 / n2

• If typhoid mortality in a sample of 100 is 20 % and other sample of

100 is 30% then is this difference in mortality rate significant ?

• p1 = 20 : q1 = 80 : n1 = 100

• p2 = 30 : q2 = 70 : n2 = 100

• SE(p1-p2) = 6.08

• Z = 30 – 20 / 6.08 = 1.64

• Z< 2 , p<.05 thus difference observed is not significant

Test of significance for small samples

[Type text]

Page 37: Biostatistics (Dr Shilpi Gilra)

• In case of samples less than 30 the Z value will not follow the normal

distribution

• Hence Z test will not give the correct level of significance .

• In such cases students t test is used

• It was given by WS Gossett whose pen name was student

• There are two types of student t Test

1. Unpaired t test

2. Paired t test

Unpaired t test

• Applied to unpaired data of observation made on individuals of 2

separate groups to find the significance of difference between 2 means

• Sample size is less than 30

• e.g. difference in accuracy in an impression using two different

impression materials

Steps in unpaired t Test are

• Calculate the mean of two samples

• Calculate combined standard deviation

• Calculate the standard error of mean which is given by

SEM = SD √1/n1 + 1/n2

• Calculate observed difference between means X1 – X2

• Calculate t value = observed difference / Standard error of mean

• Determine the degree of freedom which is one less than no of

observation in a sample (n -1)

• Here combined degree of freedom will be = (n1 – 1) + (n2 – 1)

[Type text]

Page 38: Biostatistics (Dr Shilpi Gilra)

• Refer to table and find the probability of the t value corresponding to

degree of freedom

• P< 0.05 states difference is significant

• P> 0.05 states difference is not significant

• In a nutritional study 13 children in group A are given usual diet along

with vitamin A and vitamin D while 12 children in group B take the

usual diet.

• The gain in weight in pounds for both groups after 12 months is

shown in the table

• Is vitamin A and D responsible for gain in weight?

• Mean of group A = 4

• Mean of group B = 2.5

• Total SD = 1.37

• Total SE = 0.548

• t = Observed difference / SE

[Type text]

Group A Group B

5 1

3 3

4 2

3 4

2 2

6 1

3 3

2 4

3 3

6 2

7 2

5 3

3 -

Page 39: Biostatistics (Dr Shilpi Gilra)

• t = 4 – 2.5 / 0.548 = 2.74

• Combined degree of freedom = n1 + n2 – 2

• 12 +13 - 2

• p Value is checked corresponding to the t value at 23 d.f. from the t

table

• It is < 0.02

• Thus difference is statistically significant

• And accounted to role of vitamins A&D

Paired t test

• It is applied to paired data of observation from one sample only .

• Used in sample less than 30

• The individual gives a pair of observation i.e. observation before and

after taking a drug

• The steps involved are

• Calculate the difference in paired observation i.e. before and after =

x1 – x2 = y

• Calculate the mean of this difference = y

• Calculate SD

• Calculate SE = SD / √ n

• Determine t = y / SE

• Determine the degree of freedom

• Since there is one sample df = n-1

• Refer to table and find the probability of the t value corresponding to

degree of freedom

• P< 0.05 states difference is significant

• P> 0.05 states difference is not significant

[Type text]

Page 40: Biostatistics (Dr Shilpi Gilra)

Eg.Systolic BP of a normal individual before and after injection of hypotensive drug is

given in the table. Does the drug lower the BP?

BP before giving drug X1 BP after giving drug X2 Difference X1-X2 = y

122 120 2

121 118 3

120 115 5

115 110 5

126 122 4

130 130 0

120 116 4

125 124 1

128 125 3

• Mean of difference y = ∑ y / n = 27 / 9 = 3

• SD = √ ∑ ( y - y )2 /n-1 = 1.73

• SE = SD / n = 1.73 / 9 = 0.58

• t = y / SE = 3 / 0.58 = 5.17

• Degree of freedom to n – 1 = 9 – 1 = 8

• p value corresponding to t = 5.17 and d.f. 8 is < 0.001

• Thus highly significant

• Thus decrease in BP is due to the Drug

Chi square test

• Chi square test unlike z and t test is a non parametric test

• The test involves calculation of a quantity called chi square .

• Chi square is denoted by X2

[Type text]

Page 41: Biostatistics (Dr Shilpi Gilra)

• It was developed by Karl Pearson

• The most important application of chi square test in medical statistics

are

• Test of proportion

• Test of association

• Test of goodness of fit

• Test of proportion

• Used as an alternate test to find the significance of difference in

2 or more than 2 proportions

• Test of association

• To measure the probability of association between 2 discreet

attributes e.g smoking and cancer

• Test of goodness of fit

• Tests whether the observed values of a character differ from the

expected value by chance or due to play of some external factor

X2 = € ( O – E ) 2 / E

• X2 denotes Chi square

• O = Observed Value

• E = Expected Value

Steps in Chi Square Test

• State the null hypothesis

• Determine the Chi square value

• Find the degree of freedom

• Refer the Chi square table to find the probability value corresponding

to the degree of freedom

[Type text]

Page 42: Biostatistics (Dr Shilpi Gilra)

Let us consider the following example

• We are making a field trial of 2 vaccines

• The results of field trial are

Vaccine Attacked Not AttackedTotal Attack Rate

A 22 68 90 24.4%

B 14 72 86 16.2%

Total 36 140 176  

• Vaccine B seems to be superior to Vaccine A

• We perform Chi Square test to verify if the vaccine B is superior to

vaccine A or is it merely due to chance

State the null hypothesis

• It states that the vaccines have equal efficacy

Determining the Chi Square Value

• Find total attack and non attack rates

• Total Attack rate = 36 / 176 = 0.204

• Total Non Attack Rate = 140 / 176 = 0.795

Vaccine Attacked Not Attacked

A

(n=90)

O = 22

E = 0.204 X 90

=18.36

O - E = + 3.64

O = 68

E = 0.795 X 90

= 71.55

O - E = - 3.55

B

(n=86)

O = 14

E = 0.204 X 86

O = 72

E = 0.795 X 86

[Type text]

Page 43: Biostatistics (Dr Shilpi Gilra)

= 17.54

O - E = -3.54

= 68.37

O - E = + 3.63

X2 = ∑ ( O – E ) 2 / E

= (3.64)2 /18.36 + (3.55) 2 / 71.55 + (3.54) 2/ 17.54 + (3.63) 2 / 68.37

= 0.72 + 0.17 + 0.71 + 0.19

= 1.79

• Find the Degree of Freedom = (c-1) (r-1)

• c = number of Columns

• r = number of Rows

• d.f. = (2-1)(2-1) = 1

• Find the p value

• On referring to Chi square table with one degree of freedom the p

value was more than 0.05.

• Hence the difference is not statistically significant and the null

hypothesis of no difference between vaccines is accepted.

ANOVA

Analysis of variance

• Investigations may not always be confined to comparison of 2

samples only

• e.g. we might like to compare the difference in vertical dimension

obtained using 3 or more methods like phonetics, swallowing,

niswonger’s method

[Type text]

Page 44: Biostatistics (Dr Shilpi Gilra)

• In such cases where more than 2 samples are used ANOVA can be

used

• Also when measurements are influenced by several factors playing

there role e.g. factors affecting retention of a denture, ANOVA can be

used.

• ANOVA helps to decide which factors are more important

• Requirements

• Data for each group are assumed to be independent and

normally distributed

• Sampling should be at random

• One way ANOVA

• Where only one factor will effect the result between 2 groups

• Two way ANOVA

• Where we have 2 factors that affect the result or outcome

• Multi way ANOVA

• Three or more factors affect the result or outcomes between

groups

F test

F = Mean Square between Samples / Mean Square within Samples

• F = variance ratio

• The values of Mean square are seen from the analysis of variance

table if we have the values of sum of squares and degree of freedom

( which are calculated )

• Mean Square between Samples

[Type text]

Page 45: Biostatistics (Dr Shilpi Gilra)

– It denotes the difference between the sample mean of all groups

involved in the study (A, B, C etc) with the mean of the

population

• Mean Square within Samples

– it denotes the difference between the means in between

different samples

• The greater both these value more is the difference between the

samples

• The F value observed from the study is compared to the theoretical F

value obtained from the Tables at 1% and 5% confidence limits.

• The results are then interpreted.

• If the observed value is more than theoretical value at 1% , the

relation is highly significant.

• If the observed value is less than the theoretical value at 5% it is not

significant.

• If the observed value is between 1 and 5% of theoretical value it is

statistically significant.

Presented by Dr Shilpi Gilra

[Type text]