Lseidman@matcmadison.edu. Biotechnology Laboratory Technician Program Course: Basic Biotechnology...

Preview:

Citation preview

lseidman@matcmadison.edu

Biotechnology Laboratory Technician Program

Course: Basic Biotechnology Laboratory Skills for a Regulated Workplace

Lisa Seidman, Ph.D. Ph.D.

STATISTICS

A BRIEF INTRODUCTION

lseidman@matcmadison.edu

WHY LEARN ABOUT STATISTICS? Statistics provides tools that are used in

Quality control Research Measurements

Sports

lseidman@matcmadison.edu

IN THIS COURSE

We will use some of these tools Ideas Vocabulary A few calculations

lseidman@matcmadison.edu

VARIATION

There is variation in the natural world People vary Measurements vary Plants vary Weather varies

lseidman@matcmadison.edu

Variation among organisms is the basis of natural selection and evolution

lseidman@matcmadison.edu

EXAMPLE

100 people take a drug and 75 of them get better

100 people don’t take the drug but 68 get better without it

Did the drug help?

lseidman@matcmadison.edu

VARIABILITY IS A PROBLEM

There is variation in response to the illness There is variation in response to the drug So it’s difficult to figure out if the drug helped

lseidman@matcmadison.edu

STATISTICS

Provides mathematical tools to help arrive at meaningful conclusions in the presence of variability

lseidman@matcmadison.edu

Might help researchers decide if a drug is helpful or not

This is a more advanced application of statistics than we will get into

lseidman@matcmadison.edu

DESCRIPTIVE STATISTICS

Chapter 16 in your textbook Descriptive statistics is one area within

statistics

lseidman@matcmadison.edu

DESCRIPTIVE STATISTICS

Provides tools to DESCRIBE, organize and interpret variability in our observations of the natural world

lseidman@matcmadison.edu

DEFINITIONS

Population: Entire group of events, objects, results, or

individuals, all of whom share some unifying characteristic

lseidman@matcmadison.edu

POPULATIONS

Examples: All of a person’s red blood cells All the enzyme molecules in a test tube

All the college students in the U.S.

lseidman@matcmadison.edu

SAMPLE

Sample: Portion of the whole population that represents the whole population

lseidman@matcmadison.edu

Example: It is virtually impossible to measure the level of hemoglobin in every cell of a patient

Rather, take a sample of the patient’s blood and measure the hemoglobin level

lseidman@matcmadison.edu

MORE ABOUT SAMPLES

Representative sample: sample that truly represents the variability in the population -- good sample

lseidman@matcmadison.edu

TWO VOCABULARY WORDS

A sample is random if all members of the population have an equal chance of being drawn

A sample is independent if the choice of one member does not influence the choice of another

Samples need to be taken randomly and independently in order to be representative

lseidman@matcmadison.edu

SAMPLING

How we take a sample is critical and often complex

If sample is not taken correctly, it will not be representative

lseidman@matcmadison.edu

EXAMPLE

How would you sample a field of corn?

lseidman@matcmadison.edu

VARIABLES

Variables: Characteristics of a population (or a sample) that

can be observed or measured Called variables because they can vary among

individuals

lseidman@matcmadison.edu

VARIABLES

Examples: Blood hemoglobin levels Activity of enzymes Test scores of students

lseidman@matcmadison.edu

A population or sample can have many variables that can be studied

Example Same population of six year old children can be

studied for Height Shoe size Reading level Etc.

lseidman@matcmadison.edu

DATA

Data: Observations of a variable (singular is datum) May or may not be numerical

Examples: Heights of all the children in a sample (numerical) Lengths of insects (numerical) Pictures of mouse kidney cells (not numerical)

lseidman@matcmadison.edu

ALWAYS UNCERTAINTY

Even if you take a sample correctly, there is uncertainty when you use a sample to represent the whole population Various samples from the same population are unlikely to

be identical So, need to be careful about drawing conclusions

about a population, based on a sample – there is always some uncertainty

lseidman@matcmadison.edu

SAMPLE SIZE

If a sample is drawn correctly, then, the larger the sample, the more likely it is to accurately reflect the entire population

If it is not done correctly, then a bigger sample may not be any better

How does this apply to the corn field?

lseidman@matcmadison.edu

INFERENTIAL STATISTICS

Another branch of statistics Won’t talk about it much Deals with tools to handle the uncertainty of

using a sample to represent a population

lseidman@matcmadison.edu

EXAMPLE PROBLEM

In a quality control setting, 15 vials of product from a batch are tested. What is the sample? What is the population?

In an experiment, the effect of a carcinogenic compound was tested on 2000 lab rats. What is the sample? What is the population?

lseidman@matcmadison.edu

A clinical study of a new drug was tested on fifty patients. What is the sample? What is the population?

lseidman@matcmadison.edu

ANSWERS

15 vials, the sample, were tested for QC. The population is all the vials in the batch.

The sample is the rats that were tested. The population is probably all lab rats.

The sample is the 50 patients tested in the trial. The population is all patients with the same condition.

lseidman@matcmadison.edu

EXAMPLE PROBLEM

An advertisement says that 2 out of 3 doctors recommend Brand X. What is the sample? What is the population? Is the sample representative? Does this statement ensure that Brand X is better

than competitors?

lseidman@matcmadison.edu

ANSWER

Many abuses of statistics relate to poor sampling. The population of interest is all doctors. No way to know what the sample is. The sample could have included only relatives of employees at Brand X headquarters, or only doctors in a certain area. Therefore the statement does not ensure that the majority of doctors recommend Brand X. It certainly does not ensure that Brand X is best.

lseidman@matcmadison.edu

DESCRIBING DATA SETS

Draw a sample from a population Measure values for a particular variable Result is a data set

lseidman@matcmadison.edu

DATA SETS

Individuals vary, therefore the data set has variation

Data without organization is like letters that aren’t arranged into words

lseidman@matcmadison.edu

Numerical data can be arranged in ways that are meaningful – or that are confusing or deceptive

lseidman@matcmadison.edu

DESCRIPTIVE STATISTICS

Provides tools to organize, summarize, and describe data in meaningful ways

Example: Exam scores for a class is the data set What is the variable of interest? Can summarize with the class “average”, what

does this tell you?

lseidman@matcmadison.edu

A measure that describes a data set, such as the average, is sometimes called a “statistic”

Average gives information about the center of the data

lseidman@matcmadison.edu

MEDIAN AND MODE

Two other statistics that give information about the center of a set of data

Median is the middle value Mode is most frequent value

lseidman@matcmadison.edu

MEASURES OF CENTRAL TENDENCY Measures that describe the center of a data

set are called: Measures of Central Tendency Mean, median, and the mode

lseidman@matcmadison.edu

HYPOTHETICAL DATA SET

2 5 6 7 8 3 9 3 10 4 7 4 6 11 9

Simplest way to organize them is to put in order:

2 3 3 4 4 5 6 6 7 7 8 9 9 10 11

By inspection they center around 6 or 7

lseidman@matcmadison.edu

MEAN

Mean is basically the same as the average Add all the numbers together and divide by

number of values

2 3 3 4 4 5 6 6 7 7 8 9 9 10 11

What is the mean for this data set?

lseidman@matcmadison.edu

NOMENCLATURE

Mean = 6.3 = read “X bar” The observations are called X1, X2, etc. There are 15 observations in this example, so the

last one is X15

Mean = Xi

n

Where n = number of values

lseidman@matcmadison.edu

EXAMPLE

Data set 2 3 3 4 5 6 7 8 9

What is the mode?

What is the median?

lseidman@matcmadison.edu

MEAN OF A POPULATION VERSUS THE MEAN OF A SAMPLE Statisticians distinguish between the mean of

a sample and the mean of a population The sample mean is The population mean is μ It is rare to know the population mean, so the

sample mean is used to represent it

lseidman@matcmadison.edu

DISPERSION

Data sets A and B both have the same average:A 4 5 5 5 6 6 B 1 2 4 7 8 9

But are not the same: A is more clumped around the center of the

central value B is more dispersed, or spread out

lseidman@matcmadison.edu

MEASURES OF DISPERSION

Measures of central tendency do not describe how dispersed a data set is

Measures of dispersion do; they describe how much the values in a data set vary from one another

lseidman@matcmadison.edu

MEASURES OF DISPERSION

Common measures of dispersion are: Range Variance Standard deviation Coefficient of variation

lseidman@matcmadison.edu

CALCULATIONS OF DISPERSION Measures of dispersion, like measures of

central tendency, are calculated Range is the difference between the lowest

and highest values in a data set

lseidman@matcmadison.edu

Example:

2 3 3 4 4 5 6 6 7 7 8 9 9 10 11 Range: 11-2 = 9 or, 2 to 11 Range is not particularly informative because

it is based only on two values from the data set

lseidman@matcmadison.edu

CALCULATING VARIANCE AND STANDARD DEVIATION Variance and standard deviation measure of

the average amount by which each observation varies from the mean

Example:4cm 5cm 6cm 7cm 7cm 7cm 9cm 11cm

Data set, lengths of 8 insects

lseidman@matcmadison.edu

4cm 5cm 6cm 7cm 7cm 7cm 9cm 11cm The mean is 7 cm How much do they vary from one another? Intuitively might see how much each point

varies from the mean This is called the deviation

lseidman@matcmadison.edu

CALCULATION OF DEVIATIONS FROM MEAN

4cm 5cm 6cm 7cm 7cm 7cm 9cm 11cm

Value-Mean Deviationin cm(4-7) - 3(5-7) - 2(6-7) - 1(7-7) 0(7-7) 0(7-7) 0(9-7) +2(11-7) +4

lseidman@matcmadison.edu

Value-Mean Deviation(in cm)

(4-7) - 3(5-7) - 2(6-7) - 1(7-7) 0(7-7) 0(7-7) 0(9-7) +2(11-7) +4

Sum of deviations = 0

lseidman@matcmadison.edu

Sum of the deviations from the mean is always zero

Therefore, cannot use the average deviation Therefore, mathematicians decided to square

each deviation so they will get positive numbers

lseidman@matcmadison.edu

Value-Mean Deviation SquaredDeviation(in cm)

(4-7) - 3 9 cm2

(5-7) - 2 4 cm2

(6-7) - 1 1 cm2

(7-7) 0 0 (7-7) 0 0 (7-7) 0 0(9-7) +2 4 cm2

(11-7) +4 16 cm2

total squared deviation = sum of squares = 34 cm2

lseidman@matcmadison.edu

VARIANCE

Total squared deviation (sum of squares) divided by the number of measurements:

34 cm2 = 4.25 cm2

8

lseidman@matcmadison.edu

STANDARD DEVIATION

Square root of the variance:

4.25 cm2 = 2.06 cm

Note that the SD has the same units as the data Note also that the larger the variance and SD, the

more dispersed are the data

lseidman@matcmadison.edu

VARIANCE AND SD OF POPULATION VS SAMPLE Statisticians distinguish between the mean

and SD of a population and a sample The variance of a population is called sigma

squared, σ2

Variance of a sample is S2

lseidman@matcmadison.edu

The standard deviation of a population is called sigma, σ

Standard deviation of a sample is S or SD

lseidman@matcmadison.edu

STANDARD DEVIATION OF A SAMPLE

(Xi - )2

n -1

lseidman@matcmadison.edu

EXAMPLE PROBLEM

A biotechnology company sells cultures of E. coli. The bacteria are grown in batches that are freeze dried and packaged into vials. Each vial is expected to have 200 mg of bacteria. A QC technician tests a sample of vials from each batch and reports the mean weight and SD.

lseidman@matcmadison.edu

Batch Q-21 has a mean weight of 200 mg and a SD of 12 mg. Batch P-34 has a mean weight of 200 mg and as SD of 4 mg. Which lot appears to have been packaged in a more controlled fashion?

lseidman@matcmadison.edu

ANSWER

The SD can be interpreted as an indication of consistency. The SD of the weights of Batch P-34 is lower than of Batch Q-21. Therefore, the weights for vials for Batch P-34 are less dispersed than those for Batch Q-21 and Batch P-34 appears to have been better controlled.

lseidman@matcmadison.edu

FREQUENCY DISTRIBUTIONS

So far, talked about calculations to describe data sets

Now talk about graphical methods

lseidman@matcmadison.edu

TABLE 5THE WEIGHTS OF 175 FIELD MICE

(in grams)19 22 20 24 22 19 27 2021 22 20 22 24 24 21 2519 21 20 23 25 22 19 1720 20 21 25 21 22 27 2219 22 23 22 25 22 24 2320 21 22 23 21 24 19 2122 22 25 22 23 20 23 2222 26 21 24 23 21 25 2023 20 21 24 23 18 20 2321 22 22 25 21 23 22 2420 21 23 21 19 21 24 2022 23 20 22 19 22 24 2025 21 22 22 24 21 22 2325 21 19 19 21 23 22 2224 21 23 22 23 28 20 2326 21 22 24 20 21 23 2022 23 21 19 20 26 22 2021 22 23 24 20 21 23 2224 21 23 22 24 21 22 2420 22 21 23 26 21 22 2324 21 23 20 20 21 25 2220 22 21 21 23 22

lseidman@matcmadison.edu

FREQUENCY DISTRIBUTION TABLE OF THE WEIGHTS OF FIELD MICE

Weight Frequency (g) 17 1

18 119 1120 2521 3422 4023 2724 1925 1026 4

27 2

28 1

lseidman@matcmadison.edu

FREQUENCY TABLE

Tells us that most mice have weights in the middle of the range, a few are lighter or heavier

The word distribution refers to a pattern of variation for a given variable

lseidman@matcmadison.edu

It is important to be aware of patterns, or distributions, that emerge when data are organized by frequency

The frequency distribution can be illustrated as a frequency histogram

lseidman@matcmadison.edu

FREQUENCY HISTOGRAM

X axis is units of measurement, in this example, weight in grams

Y axis is the frequency of a particular value For example, 11 mice weighed 19 g The values for these 11 mice are illustrated

as a bar

lseidman@matcmadison.edu

Note that when the mouse data were collected, a mouse recorded as 19 grams actually weighed between 18.5 g and 19.4 g.

Therefore the bar spans an interval of 1 gram

lseidman@matcmadison.edu

FIRST FOUR BARS

WEIGHTS IN GRAMS17 18 19 20

FREQUENCY

lseidman@matcmadison.edu

CONSTRUCTING A FREQUENCY HISTOGRAM Divide the range of the data into intervals It is simplest to make each interval (class) the

same width No set rule as to how many intervals to have For example, length data might be 1-9 cm,

10-19 cm, 20-29 cm and so on

lseidman@matcmadison.edu

Count the number of observations that are in each interval

Make a frequency table with each interval and the frequency of values in that interval

Label the axes of a graph with the intervals on the X axis and the frequency on the Y axis

lseidman@matcmadison.edu

Draw in bars where the height of a bar corresponds to the frequency of the value

Center the bars above the midpoint of the class interval

For example, if the interval is 0-9 cm, then the bar should be centered at 4.5 cm

lseidman@matcmadison.edu

NORMAL FREQUENCY DISTRIBUTION If weights of very many lab mice were

measured, would likely have a frequency distribution that looks like a bell shape, also called the “normal distribution”

lseidman@matcmadison.edu

NORMAL DISTRIBUTION

WEIGHT

FREQUENCY

lseidman@matcmadison.edu

NORMAL DISTRIBTION

Very important Examples:

Heights of humans Measure same thing over and over,

measurements will have this distribution

lseidman@matcmadison.edu

CALCULATIONS AND GRAPHICAL METHODS Related The center of the peak of a normal curve is

the mean, the median and the mode Values are evenly spread out on either side

of that high point

lseidman@matcmadison.edu

The width of the normal curve is related to the SD

The more dispersed the data, the higher the SD and the wider the normal curve

Exact relationship is in text, not go into it this semester

lseidman@matcmadison.edu

EXAMPLE PROBLEM

A technician customarily performs a certain assay. The results of 8 typical assays are:

32.0 mg 28.9 mg 23.4 mg 30.7 mg 23.6 mg 21.5 mg 29.8 mg 27.4 mga. If the technician obtains a value of 18.1 mg,

should he be concerned? Base your answer on estimation.

b. Perform statistical calculations to see if the answer if out of the range of two SDs.

lseidman@matcmadison.edu

ANSWER

The average appears to be in the midtwenties and hovers around + 5. Therefore, 18.1 mg appears a bit low.

Mean = 27.16 mg, SD = 3.87 mg. The mean – 2SD is 19.4 mg, so 18.1 mg appears to be outside the range and should be investigated

lseidman@matcmadison.edu

Recommended