47
Chapter 1 Introduction • Individual: objects described by a set of data (people, animals, or things) • Variable: Characteristic of an individual. It can take on different values for different individuals. Examples: age, height, gender, favorite class, speed, moisture, etc.

Chapter 1 Introduction Individual: objects described by a set of data (people, animals, or things) Variable: Characteristic of an individual. It can take

  • View
    218

  • Download
    4

Embed Size (px)

Citation preview

Chapter 1Introduction

• Individual: objects described by a set of data (people, animals, or things)

• Variable: Characteristic of an individual. It can take on different values for different individuals.

Examples: age, height, gender, favorite class, speed, moisture, etc.

Types of Variables• Quantitative: numerical values, can be added,

subtracted, averaged, etc.– ________: takes on values which are spaced. That is,

for two values of a discrete variable that are adjacent, there is no value that goes between them.

– ________: values are all numbers in a given interval. That is, for two values of a continuous variable that are adjacent, there is another value that can go between the two.

• Categorical: An individual is placed into one of several groups or categories. These groups or categories are not usually numerical.

Types of Variables

Examples:

Numeric

Variable Discrete Continuous Categorical

Length

Hours Enrolled

Major

Zip Code

Distribution of a Variable

• The distribution of a variable tells us the possible values for the variable and the probability that the variable takes these values.

• Two ways to describe a distribution– Numerically– Graphically

Categorical Variables

• Suppose we poll 46 people on an issue. How can we exhibit their response?

• Numerically:– Counts

– Proportions

– Percentages

• Graphically:– Frequency Tables

– Bar Charts

– Pie Charts

Categorical Variables

• Suppose we poll 46 people on an issue. How can we exhibit their response?– Frequency Tables:

• counts (14 agree)

• proportions (14/46 = .304 agree)

• percents (30.4% agree)

VOTE

14 30.4 30.4 30.4

23 50.0 50.0 80.4

9 19.6 19.6 100.0

46 100.0 100.0

agree

disagree

undecid.

Total

ValidFrequency Percent Valid Percent

CumulativePercent

Categorical Variables

• Suppose we poll 46 people on an issue. How can we exhibit their response?– Bar Chart:

can have counts,

percents or

proportions on

vertical axis

Categorical Variables

• Suppose we poll 46 people on an issue. How can we exhibit their response?– Pie Chart:

Examining a Distribution

• To describe a distribution we need 3 items:– Shape: modes, symmetric, skewed– Center: mean, median– Spread: range, standard deviation, IQR

• Look for the overall pattern and for striking deviations– Outlier-individual value that falls outside the

overall pattern

Numeric Variable Distributions

Shape:Modes: Major peaks in the distributionSymmetric: The values smaller and larger than the midpoint are mirror images of each otherSkewed to the right: Right tail is much longer than the left tailSkewed to the left: Left tail is much longer than the right tail

Center:Mean: The arithmetic average. Add up the numbers and divide by the

number of observations. Median: List the data from smallest to largest. If there are an odd

number of data values, the median is the middle one in the list. If there are an even number of data values, average the middle two in the list

Numeric Variable Distributions

Spread:

Range: The difference in the largest and smallest value. (Max – Min)

Standard Deviation: Measures spread by looking at how far observations are from their mean.

The computational formula for the standard deviation is

Interquartile Range (IQR): Distance between the first quartile (Q1) and the third quartile (Q3). IQR = Q3 – Q1

Q1 – 25% of the observations are less than Q1 and 75% are greater than Q1.

Q3 – 75% of the observations are less than Q3 and 25% are greater than Q3.

2)(1

1xx

ns i

Numeric Variable Distributions

• Example 1.5 on page 11 of the book shows how much 50 consecutive shoppers spent in a store. The data appear as follows:

$3.11 $18.30 $24.50 $36.30 $50.30

$8.88 $18.40 $25.10 $38.60 $52.70

$9.26 $19.20 $26.20 $39.10 $54.80

$10.80 $19.50 $26.20 $41.00 $59.00

$12.60 $19.50 $27.60 $42.90 $61.20

$13.70 $20.10 $28.00 $44.00 $70.30

$15.20 $20.50 $28.00 $44.60 $82.70

$15.60 $22.20 $28.30 $45.40 $85.70

$17.00 $23.00 $32.00 $46.60 $86.30

$17.30 $24.40 $34.90 $48.60 $93.30

Numerical Variables

• How can we describe the distribution of these 50 numbers?– Numerically

• Center: Mean or Median

• Spread: Quartiles, Range, IQR, or Standard deviation

– Graphically• Frequency Table

• Histogram

• Boxplot

• Stem and Leaf

• Normal Quantile Plot

Descriptive statistics The descriptives box from SPSS gives the mean,

median, variance, standard deviation, minimum, maximum, range, and IQR.

Descriptives

34.6550 3.0682

28.4891

40.8209

33.1929

27.8000

470.704

21.6957

3.11

93.3

90.2

26.7000

1.104 .337

.711 .662

Mean

Lower Bound

Upper Bound

95% ConfidenceInterval for Mean

5% Trimmed Mean

Median

Variance

Std. Deviation

Minimum

Maximum

Range

Interquartile Range

Skewness

Kurtosis

Statistic Std. Error

Percentiles• 50th percentile is also called the median – the

middle data value if ordered smallest to largest

• 25th and 75th percentiles are also called the quartiles: Q1 and Q3 respectively – the middle data value of each half

Percentiles

9.0890 12.7100 19.0000 27.8000 45.7000 69.3900 85.9700

19.2000 27.8000 45.4000

WeightedAverage(Definition 1)

Tukey's Hinges

5 10 25 50 75 90 95

Percentiles

Frequency Table

– Notice the amount

spent is broken into

categories or groups

– Recall, frequency

tables can be used for

categorical variables

as well

CategoryCount or

Frequency Percent

0 - 10 3 6.00%

10 - 20 12 24.00%

20 - 30 13 26.00%

30 - 40 5 10.00%

40 - 50 7 14.00%

50 - 60 4 8.00%

60 - 70 1 2.00%

70 - 80 1 2.00%

80 - 90 3 6.00%

90 - 100 1 2.00%

Histogram

– Breaks the range of values

of a variable into intervals

(midpoint is displayed here)

– Displays only the count

or percent of the observations

that fall into each interval0

2

4

6

8

10

12

14

5 15 25 35 45 55 65 75 85 95

Box Plot

Minimum, Q1, Median, Q3, and Maximum

These five numbers

are called the

____________________

What are these points?

50N =

100

80

60

40

20

0

-20

4849

50

Stem and Leaf Plot

• Works best for smaller data sets– Example 1.4 on pg 10

• Here are the numbers of homeruns that Babe Ruth hit in each of his 15 years with the New York Yankees from 1920-1934:

– 54, 59, 35, 41, 46, 25, 47, 60, 54, 46, 49, 46, 41, 34, 22

Normal Quantile Plot

– Normal Quantile Plot (This compares the distribution of the sample to the Normal Distribution):

the straight line

is normal,

compare dots

to the line

If dots fall close to the normal

line then the data comes

from a normal distribution.

Describing Numeric Variable Distributions

• Now, we examine the appearance of other data:– Modes are major peaks in the distributionThe histogram below The histogram below has one

has two modes-bimodal mode-unimodal

Describing Numeric Variable Distributions• Now, we examine the appearance of other data:

– This example is called right This is an example of a boxplot skewed since the distribution has that is skewed to the _______.

a long right tail.

4.00 8.00 12.00 16.00

data

0

4

8

12

Count

46N =

DATA

20

10

0

-10

31

35

39

40

Describing Numeric Variable Distributions

• ________: observations that are unusually far from the bulk of the data.

• What are some possible explanations for outliers?– The data point was recorded wrong.– The data point wasn’t actually a member of the population we

were trying to sample.– We just happened to get an extreme value in our sample.

• The 1.5 x IQR Criterion for Outliers: Designate an observation a suspected outlier if it falls more than 1.5 x IQR below the first quartile or above the third quartile.

1.5*IQR Criterion Example• Suppose you had the following data set:

-2, 15, 3, 7, 10, 21, 1, 5, 12, 8, 1, 35, 10

List data from smallest to largest:

Find Q1, Median, Q3, Min, and Max:

IQR = Q3 – Q1 = ______

1.5*IQR = _______

Q1 – 1.5*IQR = ________If less than this number, then outlier

Q3 + 1.5*IQR = ________If more than this number, then outlier

Are there any outliers in this data set?

Describing Numeric Variable Distributions

• Symmetry versus Skewness:

73N =

DATA

20

10

0

-1041N =

DATA

20

10

0

-10

1617

48N =

DATA

20

10

0

-10

__________ _________ ___________

4.00 8.00 12.00 16.00

data

0

2

4

6

Count

DATA

18.016.014.012.010.08.06.04.02.0

14

12

10

8

6

4

2

0

Std. Dev = 3.68

Mean = 8.4

N = 41.00

0.00 5.00 10.00 15.00

data

0

5

10

15

Count

Mean versus Median:• For a skewed distribution, the mean is farther out in the longer tail than is the median.

mean<median mean=median mean>medianTo describe distributions use: Median and IQR Mean and standard deviation Median and IQR

73N =

DATA

20

10

0

-1041N =

DATA

20

10

0

-10

1617

48N =

DATA

20

10

0

-10

Left Skewed Symmetric Right Skewed

4.00 8.00 12.00 16.00

data

0

2

4

6

0.00 5.00 10.00 15.00

data

0

5

10

15

Strategy for Exploring Data on a Single Quantitative Variable

1) Always plot your data: make a graph usually a stem and leaf or histogram

2) Look for overall pattern and for outliers

3) Calculate an appropriate numerical summary to briefly describe center and spread

4) Sometimes the overall pattern of a large number of observations is so regular that it can be described by a smooth curve

Introducing the Normal DistributionIt is customary to describe a normal distribution in the following way:

Properties of the Normal Distribution:

1) Symmetric, bell-shaped

2) Mean, μ and standard deviation, σ

3) Area under the curve is 1

2,N

The Normal DistributionNormal distributions can take on many different means and standard deviations. Only the general bell shape must remain the same.

Here are some examples of normal distributions:

= 0 = 3 = -2

1,0N 22,3N 25.0,2N0 3 -2

Distribution Properties

• Introducing: The Standard Normal Distribution

Properties:

1. _________________

2. _________________

3. _________________

Distribution Properties

• Empirical Rule (The 68-95-99.7 Rule): If the distribution is normal, then– Approximately 68% of the data falls within one standard

deviation of the mean

– Approximately 95% of the data falls within two standard deviations of the mean

– Approximately 99.7% of the data falls within three standard deviations of the mean

Distribution PropertiesEmpirical Rule

Percentiles of a Standard Normal Curve

Empirical Rule Example

• If the grades on an exam are normally distributed with a mean of 68 and a variance of 16, what grade do you have to make to be in the top 15% of the class?

Distribution Properties• Shift Changes: adding or subtracting a number

from the each of the values.

mean

mean + c

mean - c

Distribution Properties

• The mean, median, Q1, Q3, minimum, and maximum all shift when there is a shift change. The shift change, say c, is added or subtracted to each of the statistics accordingly.

• The measures of spread (standard deviation, variance, IQR, and range) do not change when there is a shift change.

Distribution Properties

• Scale Changes: multiplying or dividing each of the values by a number.

mean

Distribution Properties

• Scale Changes: multiplying or dividing each of the values by a number.

mean*c

Distribution Properties

• Scale Changes: multiplying or dividing each of the values by a number.

mean/c

Distribution Properties

• The mean, median, Q1, Q3, minimum, and maximum all change when there is a scale change unless they are zero. Each is multiplied or divided by the scale change c.

• The measures of spread (standard deviation, variance, IQR, and range) always change when there is a scale change. The standard deviation, IQR, and range are multiplied or divided by the scale change c. The variance is multiplied or divided by c2.

Shift Change Example

• Suppose we measure the weight of everyone on a football team and obtain the following statistics for a team report:– Mean: 230 lbs. Median: 240 lbs.– Std. Dev.: 50 lbs. Q1: 200 lbs., Q3: 280 lbs.– Variance: 2500 sq. lbs. IQR: 80 lbs– Min.: 170 lbs. Range: 180 lbs.– Max.: 350 lbs.

Shift Change Example

• Now suppose we found out the scale was 10 lbs. under so we need to add 10 lbs. to every weight. What would happen to each of the following statistics?

Mean: 230 lbs. Mean:________

Original After Shift Change

Median: 240 lbs. Median:_________s: 50 lbs. s:_______

Q1: 200 lbs. Q1:________Q3: 280 lbs. Q3:________

Shift Change Example

• Now suppose we found out the scale was 10 lbs. under so we need to add 10 lbs. to every weight. What would happen to each of the following statistics?

Variance: 2500 sq. lbs.

Original After Shift Change

Variance: ________IQR: 80 lbs. IQR: _________Min: 170 lbs. Min: _________Max: 350 lbs. Max: _________Range: 180 lbs. Range: _________

Shift and Scale Change Example

• Further, suppose we found out that we are supposed to report the weights and statistics in kilograms, not lbs (Remember, 1 lb = 0.6 kilograms). What would happen to each of the following statistics?

Mean: 240 lbs.

After Shift Change After Shift and Scale Change

Mean: ______________Median: 250 lbs. Median: ______________s: 50 lbs. s: _____________

Q1: 210 lbs. Q1: _____________Q3: 290 lbs. Q3: _____________

Shift and Scale Change Example

• Further, suppose we found out that we are supposed to report the weights and statistics in kilograms, not lbs (Remember, 1 lb = 0.6 kilograms). What would happen to each of the following statistics?

Variance: 2500 sq. lbs.

After Shift Change After Shift and Scale Change

Variance: _______________IQR: 80 lbs. IQR: _______________Min: 180 lbs. Min: _______________Max: 360 lbs. Max: ________________Range: 180 lbs. Range: _________________

Linear Transformations

• If you are given a mean, (or ), and a standard deviation, s (or ), and want to convert your data so you have a new mean, (or new), and new standard deviation, snew (or new), all you need is to remember what shift and scales changes affect.

• In our linear transformation formula: – a is the shift change

– b is the scale change

• Standard deviation are only affected by scale changes, but means are affected by both shift and scales changes.

bxaxnew

xxx

sscalesnew * xscaleshiftxnew *

x

newx

Linear Transformation Example

• For example: = 12 and s = 7 but we want = 25 and = 10.

snew = scale*s 10 = scale*7 scale = 10/7 scale = 1.43• substituting in: = shift + scale*

25 = shift + 1.43*12 shift = 25 1.43*12 shift = 7.84• So our linear transformation equation is: x new = 7.84 + 1.43*x

newsx newx

newx x