49
Statistics

Statistics. Introduction When studying a process, we are interested in understanding sources of variation in our inputs and outputs. Descriptive statistical

Embed Size (px)

Citation preview

Page 1: Statistics. Introduction When studying a process, we are interested in understanding sources of variation in our inputs and outputs. Descriptive statistical

Statistics

Page 2: Statistics. Introduction When studying a process, we are interested in understanding sources of variation in our inputs and outputs. Descriptive statistical

IntroductionWhen studying a process, we are interested in understanding

sources of variation in our inputs and outputs.

Descriptive statistical methods help:

· Understand patterns of variation in data, and

· Describe characteristics of a population.

Page 3: Statistics. Introduction When studying a process, we are interested in understanding sources of variation in our inputs and outputs. Descriptive statistical

ObjectivesAfter completing this section, participants should be able to:

· Define and indentify the three types of data: nominal, ordinal, and continuous;

· Construct and interpret histograms, · Calculate and interpret the sample mean, standard

deviation, variance, median, range;· Characterize distributions as: skewed, symmetric,

bimodal, multimodal, normal, and mound-shaped;

Page 4: Statistics. Introduction When studying a process, we are interested in understanding sources of variation in our inputs and outputs. Descriptive statistical

IntroductionWhen studying a process, we are interested in understanding

sources of variation in our inputs and outputs.

Descriptive statistical methods help:

· Understand patterns of variation in data, and

· Describe characteristics of a population.

In this section, we address the following for continuous data:

· Graphical displays of data, and

· Numerical measures of location and spread.

Page 5: Statistics. Introduction When studying a process, we are interested in understanding sources of variation in our inputs and outputs. Descriptive statistical

IntroductionThe methods used to display and analyze data depend on the

type of data of interest.

Measurements can be classified into three types:· Nominal – Measurements are unordered categories.· Ordinal – Measurements are ordered categories.

· Continuous – The measurement of interest is in units of measure that, at least conceptually, follow a continuous scale.

Page 6: Statistics. Introduction When studying a process, we are interested in understanding sources of variation in our inputs and outputs. Descriptive statistical

IntroductionAnother classification of data, used in distinguishing control

charts, has two main categories:

· Attribute data: Counts and proportions resulting from nominal data.

· Variables data: Again, conceptually continuous measurements with a unit of measure.

Page 7: Statistics. Introduction When studying a process, we are interested in understanding sources of variation in our inputs and outputs. Descriptive statistical

HistogramsA Six Sigma team is charged with

reducing delivery time to customers

Delivery time is defined as the number of days between shipment of an order by the company and receipt of the order at the customer’s site.

Delivery time is obtained for 100 randomly chosen orders.

Note that we treat this data as continuous.

Page 8: Statistics. Introduction When studying a process, we are interested in understanding sources of variation in our inputs and outputs. Descriptive statistical

Histograms

This Stem-and-Leaf plot groups the data while retaining the original values.

Note that 36 deliveries occurred within 4 days, 27 took 5 to 9 days, etc.

What can you conclude about delivery times?

The list of data values is not very informative. The 100 data values can be grouped in order to give more information on overall behavior.

Page 9: Statistics. Introduction When studying a process, we are interested in understanding sources of variation in our inputs and outputs. Descriptive statistical

Histograms

The height of each bar, given on the vertical axis, indicates the number of delivery times that fall within each interval, given on the horizontal axis.

The chart below, called a histogram, is a special kind of bar chart.

The histogram respects the order that is implicit in the continuous data, and gives a cleaner picture of the data than does the stem-and-leaf plot.

Page 10: Statistics. Introduction When studying a process, we are interested in understanding sources of variation in our inputs and outputs. Descriptive statistical

HistogramsWe say that the histogram gives a picture of the distribution of

delivery times.

The continuous curve superimposed on the histogram gives a picture of the shape of the distribution.

This distribution has a long right tail.

We say that it is skewed to the right.

Page 11: Statistics. Introduction When studying a process, we are interested in understanding sources of variation in our inputs and outputs. Descriptive statistical

Histograms

Where are the delivery times centered?

How much do they spread or vary?

What is the shape of the distribution?

We can use histograms to assess three characteristics of the distribution: Centering, spread, and shape.

Page 12: Statistics. Introduction When studying a process, we are interested in understanding sources of variation in our inputs and outputs. Descriptive statistical

HistogramsWe can generate a histogram and discuss this distribution in

terms of centering, spread and shape.

Does the process appear to be meeting the specifications limits (the blue vertical lines)?

Page 13: Statistics. Introduction When studying a process, we are interested in understanding sources of variation in our inputs and outputs. Descriptive statistical

HistogramsCommon Histogram Shapes

Left Skewed: Data trails off to the left.

Symmetric: Data has approximately the same distribution on either side of the center.

Right Skewed: Data trails off to the right.

Page 14: Statistics. Introduction When studying a process, we are interested in understanding sources of variation in our inputs and outputs. Descriptive statistical

HistogramsMore Histogram Shapes

Bi-modal or multi-modal: Data has more than one peak.

Uniform: Data is evenly distributed over its range.

Uni-modal: Data has one peak.

Page 15: Statistics. Introduction When studying a process, we are interested in understanding sources of variation in our inputs and outputs. Descriptive statistical

Histograms

Compare the centering and spread (variability) of these three distributions.

Page 16: Statistics. Introduction When studying a process, we are interested in understanding sources of variation in our inputs and outputs. Descriptive statistical

HistogramsHistograms provide many benefits. Histograms:

· Summarize the data.· Allow one to assess centering, spread, and shape.· Help to identify unusual patterns in data.

Histograms also have some limitations:· Conclusions about the shape of the underlying

distribution should not be drawn without a large enough data set (at least 75 randomly chosen data values - 100 data values are recommended).

· Individual data values are not shown.· Improper bin sizes, as we will see on the following slide,

can mask important data features.

Page 17: Statistics. Introduction When studying a process, we are interested in understanding sources of variation in our inputs and outputs. Descriptive statistical

Histograms

0.10

0.20

0.30

70 110 150 190 230 270

0.05

0.10

0.15

0.20

70 90 110 130 150 170 190 210 230 250

0.20

0.40

0.60

70 150 230 310

9 bins of width 20:

18 bins of width 10: Too many bins? Do we see too much noise?

5 bins of width 40: Too few? Do we lose too much information?

Page 18: Statistics. Introduction When studying a process, we are interested in understanding sources of variation in our inputs and outputs. Descriptive statistical

Measures of Location and SpreadGraphical displays are often supplemented with numerical

measures that summarize the information in the data.

· Measures of centering or location include:

- Mean- Median- Mode

· Measures of spread or variability include:

- Variance- Standard deviation- Range

Page 19: Statistics. Introduction When studying a process, we are interested in understanding sources of variation in our inputs and outputs. Descriptive statistical

Measures of Location and SpreadIn a study of pull-off force for bonded wires, given in foot-

pounds, what can we conclude about the distribution of values?

What is a typical pull-off force?

How do the measurements vary about the center?

Page 20: Statistics. Introduction When studying a process, we are interested in understanding sources of variation in our inputs and outputs. Descriptive statistical

Measures of Location and Spread

The mean is the center of gravity or balancing point of a data set.

The sample mean is an estimate of the population mean, which is the average of all observations from a population.

1 2 1...n

in ixx x x

Xn n

The sample mean or average is the most important measure of centering.

The sample mean, referred to as ‘X-bar’, is the average of all observations from a sample:

Page 21: Statistics. Introduction When studying a process, we are interested in understanding sources of variation in our inputs and outputs. Descriptive statistical

Measures of Location and SpreadThe sample mean for the pull-off force data.

Notice that the mean is denoted by a fulcrum to emphasize that it is the balancing point of the distribution of values.

13X

Page 22: Statistics. Introduction When studying a process, we are interested in understanding sources of variation in our inputs and outputs. Descriptive statistical

Measures of Location and SpreadThe sample median is the 50th percentile of the sample data.

Half of the data values lie below the median and half lie above the median.

The median is the middle value when the data are ordered.

The sample median is denoted X 0.50.

The sample median and sample mean are approximately the same if the distribution is symmetric.

Page 23: Statistics. Introduction When studying a process, we are interested in understanding sources of variation in our inputs and outputs. Descriptive statistical

Measures of Location and SpreadThe sample mode is the most frequently occurring value in a

dataset.

The mode is of little interest in itself.

Terms such as unimodal, bimodal and multimodal are of interest:

· A unimodal distribution has one peak.

· A bimodal distribution has two peaks.

· A multimodal distribution has two or more peaks.

Multimodal distributions are often indications that more than one underlying population or process is represented in the data.

Page 24: Statistics. Introduction When studying a process, we are interested in understanding sources of variation in our inputs and outputs. Descriptive statistical

Measures of Location and SpreadExample: Histogram of the Asphalt Content

0.05

0.10

0.15

0.20

0.25

3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 8.5 9.0

Possibly a mixture of data from four different batches of material?

Page 25: Statistics. Introduction When studying a process, we are interested in understanding sources of variation in our inputs and outputs. Descriptive statistical

Measures of Location and Spread

0.10

0.30

0.50

3 4 5 6 7 8 9

0.10

0.30

0.50

3 4 5 6 7 8 9

0.10

0.20

0.30

0.40

0.50

3 4 5 6 7 8 9

0.20

0.40

0.60

3 4 5 6 7 8 9

Batch 1 Batch 2

Batch 3 Batch 4

Page 26: Statistics. Introduction When studying a process, we are interested in understanding sources of variation in our inputs and outputs. Descriptive statistical

Measures of Location and SpreadThe two most important measures of the variability or spread

of sample data are the sample variance and sample standard deviation.

Sample Variance, denoted by S2:· S2 is an estimate of the population variance, 2.· S2 = “average” squared distance between the data points

and the sample mean.

Sample Standard Deviation, denoted by S:· S is an estimate of the population standard deviation, .· S = square root of “average” squared distance between

the data points and the sample mean.

Page 27: Statistics. Introduction When studying a process, we are interested in understanding sources of variation in our inputs and outputs. Descriptive statistical

Measures of Location and Spread

2

2 1

1

n

iix X

Sn

The sample variance is “average” squared distance between the data points and the sample mean:

The sample standard deviation is the square root of the sample variance:

2S S

Note that the sample standard deviation is a value whose units are the original measurement units.

Page 28: Statistics. Introduction When studying a process, we are interested in understanding sources of variation in our inputs and outputs. Descriptive statistical

Measures of Location and Spread

2 21.600.2286 ft-lbs , 0.2286 0.4781 ft-lbs

7S S

Calculation of variance and standard deviation for the pull-off force data.

i P u ll-o ff ( ix ) ix X2( )ix X

1 1 2 .6 -0 .4 0 .162 1 2 .9 -0 .1 0 .013 1 3 .4 0 .4 0 .164 1 2 .3 -0 .7 0 .495 1 3 .6 0 .6 0 .366 1 3 .5 0 .5 0 .257 1 2 .6 -0 .4 0 .168 1 3 .1 0 .1 0 .01

1 04 .0 0 .0 1 .60

Page 29: Statistics. Introduction When studying a process, we are interested in understanding sources of variation in our inputs and outputs. Descriptive statistical

Measures of Location and SpreadThe sample range (R) is the difference between the largest

observation and the smallest observation.

The sample range is the simplest measure of spread about the sample mean:

R = High value - Low value

This formula is often written as:

R = Xmax - Xmin

Example: For the pull-off force data,

R = Xmax- Xmin= 13.6 - 12.3 = 1.3 ft-lbs

Page 30: Statistics. Introduction When studying a process, we are interested in understanding sources of variation in our inputs and outputs. Descriptive statistical

Parameters and StatisticsWe will start by defining some basic statistical terms used

throughout this course.A population is a set of all possible observations or units of

interest.

· Some populations are finite (all parts in inventory today).· Others are conceptual (all parts that can be

produced by a machine at given settings).A sample is a set of observations drawn from a population. A random sample is a representative sample drawn from the

population. Such a sample must be selected in a random manner so that each member of the population has an equal probability of being selected.

Page 31: Statistics. Introduction When studying a process, we are interested in understanding sources of variation in our inputs and outputs. Descriptive statistical

Parameters and StatisticsThe population mean is the theoretical (unknown) average of

all population measurements. For continuous data, the population mean is denoted by the Greek letter (mu).

The population standard deviation is the theoretical (unknown) standard deviation of a population. For continuous data, it is denoted by the Greek letter (sigma).

The population variance is the theoretical variance of a population. For continuous data, it is denoted by 2 (sigma squared).

We virtually never know the true values of the population mean, standard deviation, or variance.

Page 32: Statistics. Introduction When studying a process, we are interested in understanding sources of variation in our inputs and outputs. Descriptive statistical

Parameters and StatisticsA parameter is a numerical value calculated from population data.

· The population mean, standard deviation, and variance are examples of parameters.

· Since we are virtually never able to compute parameters, they are theoretical quantities.

· Parameters are often represented by Greek letters: m, s, and s2 are examples of this convention.

A statistic is a numerical value calculated from sample data.

· Examples are the sample mean ( ), the sample standard deviation (S), and the sample variance (S2).

· These statistics are used to estimate the corresponding population parameters.

X

Page 33: Statistics. Introduction When studying a process, we are interested in understanding sources of variation in our inputs and outputs. Descriptive statistical

Parameters and StatisticsPopulation

Sample

m

, s s2

p

S, S2

X

p

Unknown! Can calculate!

Page 34: Statistics. Introduction When studying a process, we are interested in understanding sources of variation in our inputs and outputs. Descriptive statistical

Parameters and StatisticsOther examples of parameters are:

· the median, · the range, and · any percentile of a population.

These are estimated by taking a random sample from a population, and calculating its median, range, and percentiles.

Of critical importance in estimating population parameters is the ability to draw a sample (often, but not always, a random sample) from the population.

In order to do this, the population of interest must be well-defined.

Page 35: Statistics. Introduction When studying a process, we are interested in understanding sources of variation in our inputs and outputs. Descriptive statistical

DistributionsThe word distribution is used to describe the pattern formed by

measurements.For example, we discuss the distribution of cycle times or of

errors.In the case of a population of measurements, we talk about the

theoretical distribution.For example, cycle times from a sample might have the

distribution given by the histogram on the next slide.Keep in mind that the theoretical distribution is, in general unknown. It is something that we try to estimate.

Page 36: Statistics. Introduction When studying a process, we are interested in understanding sources of variation in our inputs and outputs. Descriptive statistical

DistributionsThe histogram might give cycle times for a sample of 100 orders.

The continuous curve that is overlayed on the histogram might represent the distribution of cycle times in the underlying population.

0.10

0.20

0.30

0.40

100 150 200 250

Page 37: Statistics. Introduction When studying a process, we are interested in understanding sources of variation in our inputs and outputs. Descriptive statistical

DistributionsTheoretical distributions are useful for two main reasons:

· Modeling data, and

· Providing a “yardstick” for sample statistics.

With this in mind, we will introduce several useful theoretical distributions:

· The normal (or Gaussian) distribution, which is based on continuous data,

· The binomial distribution, which applies to two-category nominal data, and

· The Poisson distribution, which applies to counts of occurrences.

Page 38: Statistics. Introduction When studying a process, we are interested in understanding sources of variation in our inputs and outputs. Descriptive statistical

The Normal DistributionThe normal distribution is the basis for many of the statistical

techniques that we cover throughout the course.

There are many other distributions that are used for modeling continuous data.

However, the normal distribution is useful in terms of sample statistics as well as for modeling data.

The normal distribution has the classic bell-shape.

There are infinitely many normal distributions, each defined by a value for the mean, m, and one for the variance, s2.

If a quantity, call it X, has a normal distribution with mean and variance 2, we denote this by writing X ~ N(, 2 ).

Page 39: Statistics. Introduction When studying a process, we are interested in understanding sources of variation in our inputs and outputs. Descriptive statistical

The Normal DistributionThere are infinitely many possible normal distributions defined

by values of the population parameters and 2.

Below, we see examples of normal curves with different means and variances.

N(25, 1)N(15, 9)

Page 40: Statistics. Introduction When studying a process, we are interested in understanding sources of variation in our inputs and outputs. Descriptive statistical

The Normal DistributionExample: The distributions of characteristics of manufactured

product are often normal.

Suppose that certain bonded wires have pull-off force measurements, X, that are normally distributed with mean 10 and variance 4. The distribution of X is denoted by X ~ N(10, 4). Note that s = 2.

The shaded area shows P(X>13), where X represents the pull-off force.

Page 41: Statistics. Introduction When studying a process, we are interested in understanding sources of variation in our inputs and outputs. Descriptive statistical

The Normal Distribution

2 3 2 3

99.73%

95.45%

68.27%

Page 42: Statistics. Introduction When studying a process, we are interested in understanding sources of variation in our inputs and outputs. Descriptive statistical

The Binomial DistributionAnother distribution that is extremely prevalent is called the

binomial distribution. Binomial data are data that result from a series of trials, where

each trial results in only one of two possible values, pass or fail, success or failure, yes or no, etc.

To have a binomial distribution, three conditions must be met:

· The number of trials, denoted by n, is fixed in advance;· The probability of obtaining a success (which is denoted by

“p”) must be constant from trial to trial;· The trials are independent (obtaining a success on one trial

must not affect the likelihood of obtaining an success on another trial).

Page 43: Statistics. Introduction When studying a process, we are interested in understanding sources of variation in our inputs and outputs. Descriptive statistical

A binomial variable is the total number of successes in n trials where the previous conditions are satisfied.

The binomial distribution provides a model for many industrial situations.

The following are quantities that might well have binomial distributions:

· The number of parts produced with a particular type of defect;

· The number orders for a given part that take in excess of 20 days to fill;

· The number of late deliveries of a certain type of shipment.

The Binomial Distribution

Page 44: Statistics. Introduction When studying a process, we are interested in understanding sources of variation in our inputs and outputs. Descriptive statistical

In the case of binomial data, we are usually interested in the proportion of successes in a series of trials (rather than the total number of successes).Equivalently, we are interested in the probability p of a success.For example, we may be interested in the proportion of parts with a defect, or the proportion of late deliveries.

The proportion of successes in a random sample drawn from a binomial distribution is denoted by .p

The Binomial Distribution

Page 45: Statistics. Introduction When studying a process, we are interested in understanding sources of variation in our inputs and outputs. Descriptive statistical

Example:Suppose that 100 records are randomly chosen from the data

warehouse, and that 12 of these have a particular type of error.

Then an estimate of p, the proportion of records in the entire data warehouse that have this type of error, is given by:

12 /100 0.12p

The Binomial Distribution

Page 46: Statistics. Introduction When studying a process, we are interested in understanding sources of variation in our inputs and outputs. Descriptive statistical

Another distribution of interest is called the Poisson distribution.The Poisson distribution is used to model the number of occurrences of

an event that is relatively rare in some unit of time or space. The following might be modeled by Poisson distributions:

· The number of stacking marks per month’s production of cups;· The number of customer returns of a given type of product,

reported weekly; · The number of OSHA recordable injuries per 100,000 man hours;· The number of defects in a large casting.

The Poisson Distribution

Page 47: Statistics. Introduction When studying a process, we are interested in understanding sources of variation in our inputs and outputs. Descriptive statistical

The parameter of interest for a Poisson distribution is the average number of occurrences in the unit of time or space.

So, for example, the population mean for the number of errors of a specific type entering the data warehouse daily, or the population mean of the number of defects in large castings.

This theoretical mean is denoted by “c”, for “count”. A sample can be used to estimate c. The estimate is simply the

average of the counts in the sample.For example, if the numbers of errors for five randomly chosen

days are 8, 5, 6, 4, and 7, then c is estimated by

30 / 5 6c

The Poisson Distribution

Page 48: Statistics. Introduction When studying a process, we are interested in understanding sources of variation in our inputs and outputs. Descriptive statistical

The Empirical RuleThe Empirical Rule provides an estimate of the proportion of

data values falling within a certain distance of the mean.

The Empirical Rule states that, if a frequency distribution is approximately symmetric and mounded in shape, then:

· Approximately 68% of all values will fall within one standard deviation of the mean.

· Approximately 95% will fall within two standard deviations of the mean.

· Nearly 100% will fall within 3 standard deviations of the mean.

The Empirical Rule is derived from the probabilities associated with a normal random variable.

Page 49: Statistics. Introduction When studying a process, we are interested in understanding sources of variation in our inputs and outputs. Descriptive statistical

The Empirical Rule

2 3 2 3

99.73%

95.45%

68.27%

We repeat a graph shown earlier.