PART I - Exploring and Describing Data Agenda...PART I - Exploring and Describing Data Agenda Day Number Chapter Topics to discuss Homework Due Next Time 1 Class Intro The purpose

DePutron AP Statistics Unit I Packet: Exploring and Describing Data 1

PART I - Exploring and Describing Data Agenda

Day

Number Chapter Topics to discuss Homework Due Next Time

1 Class Intro The purpose of statistics, receive

packet & syllabus, Day One Survey Complete all tasks on TO DO list.

2 Prelim Intro to Stats experiment Finish questions from the textbook

3 Prelim Simulating an experiment Case closed, page 26

4 Prelim Identify population, variables,

individuals

Identify population, variables, and

individuals for given questions

Day One Survey: Please fill out the survey questions below. When you have finished, come visit me to record your email address in the school computer. You will then be sent this survey to input your answers tonight on SurveyMonkey.com.

Gender

Gra

duating

Cla

ss Y

ear

Left

or

Rig

ht-

Handed

Left

or

Rig

ht-

Eyed

(look b

elo

w!)

Shoe S

ize

(no h

alf-s

izes,

ple

ase)

# o

f Sib

lings

# o

f Aunts

&

Uncle

s

#of Cousin

s

Min

ute

s t

o g

et

to s

chool

measure

in

MIL

LIM

ETERS

the t

hic

kness o

f

textb

ook

Left-Eyed and Right-Eyed-ness When you pick up a pencil or pen and write, what hand do you typically use? People are also left-eye dominant or right-eye dominant. Which one are you? Here’s how to find out: Hold your hands in front of you like the picture. Find an object about 10-15 feet away. Make a small space to look through. Now, close your right eye, keeping your left eye open. Can you still see it? If so, then you are left-eye dominant. If you can’t see the object, open your right eye and close your left eye. Can you see it? If so, then you are right-eye dominant. Write down one question you have from doing this activity in the space below:


TO DO BEFORE THE 2ND CLASS: ____ 1. Get a 3 ring binder with 8 dividers ____ 2. Loose leaf AND graph paper ____ 3. Get a graphing calculator (TI-84 Recommended). Come see me, if necessary. ____ 4. Read the syllabus. ____ 5. Complete the Day One Survey on surveymonkey.com. ____ 6. Send me an email: [email protected], include:

1. First and last name 2. Questions from the syllabus 3. A statistical question that you are interested in answering this year (i.e. Does increased cell phone use lead to brain tumors?; Does the number of AP courses a

student takes increase his/her likelihood of going to college?) Statistical Question:

surveymonkey.com

mailto:[email protected]


Preliminary Chapter: Overview of the course Bookwork Define statistics. Explain the difference between a population and a sample. Pg. 11, P.2 (a) Pg. 11, P.4 (a) (b) (c) Pg. 11, P.5 (a) Define individuals. Define variable. Pg. 21, P.12 (a) (b) (c) (d) (e) (f)


Tap Water vs. Bottled Water Experiment Bottled water is becoming an increasingly popular alternative to ordinary tap water. But can people really tell the difference if they aren’t told which is which? Do you think you can tell the difference between bottled water and tap water? This activity will give you the chance to discover answers to both of these questions. How would you design a study to determine if someone could tell the difference? What are some things to keep in mind in conducting this study in order to validate the results? How will we account for these? How will we decide in which order to present the cups to the student? What would you expect? Class Results Record the number of correct and incorrect choices in the table below.

Correct Incorrect Proportion

Let’s assume that no one can really distinguish tap water from bottled water. So if a student guesses correctly, it is only due to luck. How many students would you expect to guess correctly? [no evidence]


Let’s assume that it is very easy to distinguish the bottled water from the tap water. How many students would you expect to guess correctly? [clear evidence]


How many correct answers would you need to see to be convinced that it is possible to tell the difference between tap water and bottled water? [possible evidence]


Did our class do better than blindly guessing? Was our proportion significantly higher than blindly guessing to say that some students can distinguish between the flavors of two popular sodas? Identify the individuals and variables in this experiment:


Simulating the Experiment Now let’s simulate this experiment to see if random guessing could have produced a result as high as ours. How could we do this using a die? Now record the number of correct guesses: _____ We will create a dot plot of these simulations using a Statistical Data Software called Fathom as a class. Copy our class’s dot plot below. Dot Plot

What proportion of these simulations did as well or better than our class did in choosing the bottled water?


Populations, Individuals, and Variables In any study, we are collecting data (information) about individuals, or cases. Sometimes these

individuals are also called experimental units, if we are conducting an experiment.

For each case, we record one or more variables: attributes, or characteristics about each

individual in a study. We typically have a research question about a population of interest.

Individuals or Cases: Who (or what) did we gather information from?

Variables: What values did we record for each case? Also,

Are the variables quantitative or categorical?

If quantitative, do we know the units of each variable?

How was each variable measured?

Population of interest in the study: What group does the researcher want to make conclusions

about?

Often, the group of individuals that were studied is only a subset of the population. We

call this group the sample.

In statistics, we often try to make predictions about a large population from a sample.

What is the research question?

Is there a suspicion / hypothesis being tested?

Are data being collected in order to learn more about a group?

Example from Yahoo Sports on the Los Angeles Lakers:

Each row contains data on a single

case (case #1, case #2, etc…)

The heading of each column is the

name of a variable (player, position,

bat, height, etc…)

Each variable has different possible

outcomes. (For weight: many

different outcomes. For position: PG,

SG, SF, etc.)

Can you determine who any of the players are based on these statistics??


Example 1:

Read the following article. Then, identify the individuals, the intended population, the variable, the research question, and any conclusions.

New York Times

AUGUST 22, 2011, 5:14 PM

Really? The Claim: Drinking Green Tea Can Help Lower Cholesterol

By ANAHAD O'CONNOR

THE FACTS

Green tea is thought to be an herbal panacea of sorts, believed by many to have a wide range of health

benefits. But whether it can actually produce measurable effects on cholesterol is a question that has

drawn much debate.

Advocates say green tea’s heart-healthy benefits are due in part to a large concentration of polyphenols,

which block the absorption of cholesterol in the gut. But skeptics argue that any beneficial effect would

be small, and the side effects from a few too many cups a day not worth it.

Numerous studies have delved into the matter, with mixed results. But this year a team of researchers

combined and analyzed data from more than a dozen previous trials to reach a more definitive answer.

The report, published in The American Journal of Clinical Nutrition, involved more than 1,100 people

and looked at studies in which the subjects were randomly assigned to drink either green tea or a placebo

daily for up to several months.

The researchers found that the subjects who received the green tea, on average, did see an effect on their

cholesterol, but it was minimal. Over all, their levels of LDL, or “bad” cholesterol, fell by 2.2

milligrams per deciliter, a change of roughly 2 percent. There was no effect on their levels of HDL, or

“good” cholesterol.

For some it may be worth a shot. But for others there could be side effects: A compound in green tea

called EGCG may interfere with medications like anticoagulants and the cancer drug bortezomib.

THE BOTTOM LINE

Studies have found that green tea may reduce levels of LDL cholesterol, but the effect appears minimal.

a) Who are the individuals, or cases observed in this study?

b) Can you determine the intended population?

c) What variable(s) were recorded on each case?

d) What is the research question of the study?

e) What conclusion was reached? Are there any caveats / concerns raised in the study?

http://well.blogs.nytimes.com/author/anahad-oconnor/

http://www.ncbi.nlm.nih.gov/pubmed/20837049




http://bloodjournal.hematologylibrary.org/content/113/23/5695.long


Example 2. The data below came from a study from the US Census Bureau. The Bureau collected

information from every householder in the United States. The goals of the Census are to collect

accurate information about the US population, in order to make decisions about congressional

districts, allocating funds, and studying changes in the population.

a) Who are the individuals, or cases, in this study?

b) Identify the intended population in this study. Do you think the sample of individuals is

representative of the intended population? Why (not)?

c) Which of the following are variables collected on each individual? Justify.

i. Highest level of education completed by householder.

ii. Number of households

iii. Percent distribution by income level

iv. Income level

v. Less than 9th grade, some HS, …, bachelor’s degree or higher

vi. Under $10,000

d) What percentage of householders with less than a 9th grade education earned more than

$75,000?

e) What proportion householders with less than a 9th grade education earned more than $75,000?

f) How many householders with less than a 9th grade education earned more than $75,000?

Homework Assignment: On pg. 30, read questions P.19-22. For each question, identify:

1. The individuals, or cases (The “who.” )

2. The variables recorded for each case (The “What.”) 3. The intended population of interest.

Bonus: Research your own article, print it out, and identify these same three important parts of the study.


Chapter 1: Displaying Distributions with Graphs Good graphs are extremely powerful tools for displaying large quantities of complex data; they help

turn the realms of information available today into knowledge. But, unfortunately, some graphs

deceive or mislead. This may happen because the designer chooses to give readers the impression

of better performance or results than is actually the situation. In other cases, the person who

prepares the graph may want to be accurate and honest, but may mislead the reader by a poor

choice of a graph form or poor graph construction.

The following things are important to consider when looking at a graph:

1. Title

2. Labels on both axes of a line or bar chart and on all sections of a pie chart

3. Source of the data

4. Key to a pictograph

5. Uniform size of a symbol in a pictograph

6. Scale: Does it start with zero? If not, is there a break shown

7. Scale: Are the numbers equally spaced?

1. Can you explain why these graphs might be misleading?

a. b. c.

2. What do you notice is different about the graphs below compared to the graphs from numbers

1a. and 1c.?

a. b.


Stem-and-Leaf Plots A researcher chose a sample of 27 young adult women (ages 18-25) at a health clinic and recorded their resting pulse rates, in beats per minute. The pulse rates are displayed below in

a stem-and-leaf plot (also called a stemplot):

The 8 | 8 you see in the top row represents a woman whose pulse rate was 88 beats per minute.

a) Why do you think there are there two “6” stems two “7” stems, two “8” stems? Why not

use one stem each?

b) Three young women in the sample were not placed on the stem-leaf plot. Their pulse rates

are 79, 63, and 51 beats per minute. Properly put them on the plot above.

c) On the left side of the stems, create a back-to-back stem plot for the pulse rates of this

sample of 20 older women who also had their pulse rates recorded. Here are the data, in

BPM:

51 57 63 52 57 65 71 83 70 86 62 62 65 60 57 73 58 64 67 71

d) Based on this display we can make comparisons between the pulse rates of older women

and younger women. SOCS [Shape, Outliers, Center, Spread]

A. Shape and Outliers: Discuss the overall shape and any outliers.

B. Center: Which group has a higher center? What does this mean in this study?

Which measure of center is most appropriate?

C. Spread /variability: Which group is more spread out? What does that mean in

this study?

e) Take your own pulse, record your data on the class stemplot (it should be on the board

already).


Histograms: Construct a histogram of the distribution of your classmates pulse rates on the grid below.

Pulse Rates

1. What characteristics [SOCS] of the distribution are evident from the histogram?

2. Compared to the stem-and-leaf plot, what details does the histogram lack?

3. When would it be beneficial to use a histogram rather than a stem plot?

Notes regarding shape: A distribution is said to be skewed to the right if it extends further to the right than it does to the

left. (The tail extends to the right)

A distribution is said to be skewed to the left if it extends further to the left than it does to the

right. (The tail extends to the left)

A distribution is said to be symmetric if the right and left sides of the histogram are approximately

[use your judgment] mirror images of each other.


Practice Reading a Histogram Practice Example: Last year, a group of AP Statistics students measured how long they could

balance on their toes with their eyes closed. Results are in seconds.

a) About how many students performed this test? How do you know?

b) What does the bar in the middle of the histogram tell you, in context of the data?

c) Suppose Jamey wants to use this histogram to know the proportion of the participants that

balanced for more than a minute. How would you help Jamey?

d) Here’s another histogram of the same data. In what specific ways does this histogram improve

on the previous one?

e) Here’s a third histogram of the same data. Is this an improvement? Explain.


Roll until “doubles” A game of chance is played in which two dice are rolled until “doubles” are rolled. A trial consists of

a sequence of rolls terminating with a roll of “doubles”.

1. On average, how many times do you think you’ll have to roll two dice to get doubles?

2. Record the calculator procedure:

3. Construct a histogram of the number of rolls until “doubles” are rolled. Use the calculator to

simulate 30 plays of the game.

Trial Rolls Trial Rolls

1 16

2 17

3 18

4 19

5 20

6 21

7 22

8 23

9 24

10 25

11 26

12 27

13 28

14 29

15 30


Construct the histogram on the grid below. Be sure to label and scale the axis and title the graph.

4. Describe the SOCS for this distribution.

5. Locate the mean and median for this distribution. Which is larger? Why?

6. Reconstruct your histogram in your calculator:

STAT>1:Edit…>L1> input your data > 2nd:STAT PLOT>1:PLOT1…ON>select the histogram picture,

confirm XList is set to L1 > ZOOM>9:ZoomStat

Go to your window screen and adjust the settings. Notice how the shape of your histogram changes.

Let’s play a game…

If you can roll the dice 6 times without rolling “doubles”, I will give you $1. However, if “doubles” are rolled on the first through 6th roll, you pay me $1…Who wants to play?


Make observations of the 3 graph types. For each type consider how it displays the

center and spread, how it captures individual data, the ease to create it, etc.

DOT PLOTS

STEMPLOTS

HISTOGRAMS

Examining Distributions

Statistical Language for describing the shape [overall pattern] of a distribution SOCS – always discuss these 4 things for 1-variable data Shape: Is the distribution Unimodal, bimodal, or uniform? If unimodal: symmetric, right-skewed

(positive skew), or left-skewed (negative skew)

Modes: major peaks or clusters in a graph

*If bimodal – describe the center and spread of each group separately

Outliers: Identify any deviations / outliers from the big picture. Talk about them in context of the data.

Speculate/ explain why they are outliers.

Center: Identify the overall center, or middle, of the distribution. Put your comments in context of the

data.

Numerical Measures of center: mean, median, midrange, mode

Spread or Variability: Identify how spread out the data are from the center. Put your comments in

context of the data.

Numerical Measures of spread: standard deviation, IQR, range


Example1: A large grocery store is interested in purchasing watermelons from local farms. They

collected a random sample of watermelons from each farm (labeled A-F), and weighed each

watermelon. They constructed dot plots of the weights of the watermelons from each sample. Dot

plots are below.

Discuss with your classmates: Do not do any computations. There can be alternate conclusions

for some of the questions.

i. On average, which farm(s) appear to produce the heaviest watermelons? The lightest watermelons?

ii. Which would you select as a grocer? Why?

iii. Overall, which farm(s) appear to have watermelon weights that are the least variable?

iv. Which farm has watermelons whose weight that are most variable?

v. What distinguishes farm E from the other five farms?


vi. What distinguishes farms C and D, from Farms A, B, and F?

vii. Give a specific feature of farms C and D that’s different from farms A, B, and F.

viii. Why might the grocer prefer to buy from Farm F over the others? Any concerns? Example 2: Match the following variables with the histograms and bar graphs given below.

Hint: Think about how each variable should behave. Where along the scale should values pile up?

(a) The SAT math scores of people in a college statistics class (b) Harvard Westlake Students’ response to “do you have your cell phone with you?” (c) Number of siblings of individuals in a group of 100 HW Seniors (d) Amount paid for last lunch out by students in this class (e) Gender breakdown of students in a college biology class. (f) Grades on an easy January exam in Algebra 2.


guess_minus_actual

-10 -8 -6 -4 -2 0 2 4 6 8

DayOneSurvey2009 Dot Plot

school "loomis"=

NUMERICAL SUMMARIES OF CENTER AND SPREAD Measures of Center

ARE YOU SMARTER THAN A MIDDLE SCHOOLER? MEAN, MEDIAN, RANGE CHALLENGE 1. Karl has eight people in his family. He wondered how many hours of TV each of them might have watched in a week if the mean of the eight values were 5 hours. Write down eight amounts of TV-watching that have a mean of five hours: 2. When Karl told his father what he had done, his father wondered how the values might change if the mean were five hours and the median were four hours. Write down eight values that have a mean of five hours and a median of four hours: 3. Karl’s mother challenged him to write down eight values with a mean of five hours, a median of four hours, and a range of seven hours. Write down eight possible amounts of TV-watching that have a mean of 5 hours, a median of 4 hours, and a range of 7 hours: NOTES Consider… • What’s typical in the group? • What value do the observations center/cluster themselves around? • At which location is the data split into an upper half and a lower half? Median: The median is the value for which half of the observations in the set are greater than and half of the observations are less than half. The median will divide a histogram into equal areas. To find the median:

1. Arrange the observations in increasing order 2. If the number of observations is odd, the median is the middle value 3. If the number of observations is even, the median is the average of the middle two.

To find the location of the median:

(n + 1) ÷ 2

Median Example: Twenty-one students guessed the weight of their backpacks, and then weighed

their backpacks. The variable “guess-actual” records the guessed weight minus the actual weight.

Here’s a dot-plot:

1. Find and interpret the median of this

distribution in context.


Mean (x bar): The mean, x bar, is the computed average of the set of observations:

x bar = (x1 + x2 + … xn) / n

or in sigma notation

x bar = 1/n ∑xi

1. Using the data from our day one survey, find the median and mean years of “minutes to get to school”. 2. Which measure of center is larger? Why?


guess_minus_actual

-10 -8 -6 -4 -2 0 2 4 6 8

DayOneSurvey2009 Dot Plot

school "loomis"=

Measures of Spread 1. Range = maximum – minimum 2. Interquartile Range (IQR) = Q3 – Q1 It tells you the width of the central 50% of the distribution. Quartiles: The first quartile (Q1) is the value for which 25% of the observations are less than. It is the MEDIAN OF THE LOWER HALF of the set of observations. The third quartile (Q3) is the value for which 75% of the observations are less than. It is the MEDIAN OF THE SECOND HALF of the set of observations.

IQR is typically used to describe spread when Median is used to describe center. On the AP Exam, if you decide to match median and IQR be sure to explain that your decision is based on the “norm”

a) Is the IQR a measure of center, spread, or shape? Why?

b) Compute and interpret the IQR for the

“guessed minus actual” data.

c) Quartiles are like fences that divide the

data into four groups of equal size…

Where’s the second quartile, Q2?

Five number summary: min, Q1, median, Q3, max.

Describe the calculator procedure: Outliers: An observation is called an outlier if it lies more than 1.5 * IQR above Q3 or below Q1. 3. Variance (s2): The variance is roughly the average of the squared differences between each observation and mean. Also worded: “on average, how much does each observation vary from the mean”. Variance is measured in square units.

s2 = (x1 - x bar)2 + (x2 - x bar)2 + … +(xn – x bar)2 / n-1

or in sigma notation

s2 = 1/(n-1) ∑(xi – x bar)2

4. Standard deviation (s): The standard deviation is the square root of the variance. This measurement is in the same units as the original data.


s = √(1/(n-1) * ∑(xi – x bar)2)

variance (s2) and standard deviation (s) are used to measure spread when the mean (x bar) is used to describe center.

When the distribution is approximately symmetric, the mean (x bar) and standard deviation (s) are generally used to summarize the distribution. If the distribution is skewed, a five number summary is generally used. CHALLENGE

There are twenty players with golf handicap between 6 and 32

(6,6,8,9,10,11,13,15,15,16,18,19,21,23,26,27,29,30,31,32) The problem is to divide the twenty players into four teams of five players each, with the lowest possible variance and standard deviation between the teams.


Common measures of center Common measures of spread

Median: a value that separates

data into top half, bottom half

Inter-quartile range: Q3 –Q1

Mean: 1 2 1

n

i

n i

yy y y

Yn n

Standard Deviation: 2( )

1

ix X

n

Midrange = min max

2

Range: max min

There are many methods to numerically summarize center or spread. Any number computed from a set of data is called a statistic.

The standard deviation: The average distance of an individual value from the mean

But Before we begin … Example: The following histograms display quiz scores on a scale of 1-9 for two different statistics

classes.

a) Between Classes F and G, which set of ratings exhibits more variability (greater spread)?

Explain.

b) One way to measure variability is to compare the “average distance of data from the center of

the distribution.”

Using this idea, order the ratings from least variable the most variable. Justify.


Summation Practice ∑

75, 76, 82, 93, 45, 68, 74, 82, 91, 98 Calculate (remember Order of Operations) and then double check wit the calculator [STAT> CALC> 1 Var Stats]

X values X2 (X-2)

∑X = ∑X2

= (∑X)2

= ∑(X-2) =

WARM-UP 2005B #1 Remember, each question is worth a total of 5 points. 1. The graph below displays the scores of 32 students on a recent exam. Scores on this exam ranged from 64 to 95 points.

6 | * *

6 | * *

7 | * * *

7 | * * * *

8 | * * * *

8 | * * * * * *

9 | * * * * * * *

9 | * * * *

(a) Describe the shape of this distribution (b) In order to motivate her students, the instructor of the class wants to report that, overall, the class’s performance on the exam was high. Which summary statistic, the mean or the median, should the instructor use to report that overall exam performance was high? Explain.


(c) The midrange is defined as (maximum + minimum)÷2. Compute this value. Is the midrange considered a measure of center or a measure of spread? Explain.

The Mean and Median 1. Go to www.whfreeman.com/tps3e : Statistical Applets : Mean and Median 2. Place two observations on the line by clicking below it. Why does only one arrow appear? 3. Place three observations on the line by clicking below it, two close together near the center of the line and one somewhat to the right of these two.

a. Pull the single rightmost observation out to the right. (Place the cursor on the point, hold down the mouse button, and drag the point.) How does the mean behave? How does the median behave? Explain briefly why each measure acts as it does. b. Now drag the rightmost point to the left as far as you can. What happens to the mean? What happens to the median as you drag this point past the other two? (Watch carefully.)

4. Place five observations on the line by clicking below it. a. Add one additional observation without changing the median. Where is your new point? b. Use the applet to convince yourself that when you add yet another observation (there are now seven in all), the median does not change no matter where you put the seventh point. Explain why this must be true.

http://www.whfreeman.com/tps3e


Example 1: New York Yankee Salaries A histogram of the 2010 salaries for every player on the

New York Yankees is displayed below. The mean salary and median salary are also marked.

Which marking is the median? Which one is the mean? Why?

Example 2: In which of the four histograms below are the mean and the median roughly equal?

In which is the mean considerably greater than the median? In which is the mean considerably

less than the median?

Summarize:

In a skew left distribution, the mean tends to be _________ the median.

In a skew right distribution, the mean tends to be __________ the median.

In a symmetric distribution, the mean tends to be __________ the median.


salary (thousands)

10 20 30 40 50 60 70

Collection 2 Dot Plot

age

24 26 28 30 32


Example 3: Observe the salaries for all the employees in this small ad agency:

Occupation: Salary: (dollars per month)

Owner $ 60,000 VP $ 40,000 Senior Agent $ 22,000 Senior Agent $ 16,000 Senior Agent $ 14,000 Senior Agent $ 12,000 Senior Agent $ 10,000 Junior Agent $ 10,000

a) Here’s a dot plot of the distribution of weekly salaries. Without doing calculations, what

would you call a typical, or “average” salary in this company?

b) Compute:

Median of salaries: Mean of salaries:

Midrange of salaries = (min + max) /2:

Which measure of center seems to coincide best with your answer in a)? The mean, median,

or midrange? Why don’t the other values work as well?

c) Suppose that the owner decides to double his own salary, and leave the rest unchanged.

Record the new values for each measure of center. Which measure of center stays the

same (MOST resistant to outliers)? Which one is LEAST resistant to outliers?

d) The Federal government collects 7.65% of each worker’s earnings for Social Security. If

you want to know the total amount collected from the agency for Social security in a month,

which measure of center should be used: mean, the median, or the midrange? Justify.

MEASURE DISADVANTAGES… BEST TO USE WHEN…

Mean: average value

Sensitivity to the influence

of a few extreme

observations, NOT a

resistant measure

Include every piece of

collected data

Data set is not highly

skewed

Median: middle value

Does not incorporate every

piece of data

Data set includes outliers, a

resistant measure

Looking to describe the

“typical” value of a data set

Midpoint: informal middle

value

Not very accurate of the

entire set of data

A quick idea of the center

Computing Standard Deviation


Example A researcher recorded the ages of a random sample of five first – year medical students at a local university. The mean of this sample is 26 years old x bar!

a) Individual ages vary from the mean age of 26 years. By how much, on average? Estimate.

To answer this question more mathematically, we can compute the standard deviation.

Standard deviation:

s =

2( )

1

ix X

n

Interpretation: roughly, it’s a measure of the average distance of an individual from the mean of

the sample.

This formula is usually a little irritating to calculate by hand, but we’ll do it now to

make sure we understand it. Sum up the squared deviations.

b) To find

the mean

squared

deviation,

we’d divide

by n, the

sample size.

In some

situations,

this works,

but for

samples, statisticians prefer to divide by n-1. Why? Excellent question. More on this later in the

course.

“Average” squared deviation= 2s = 2

( )

1

ix X

n

= _______________

This value, symbolized by2s , is called the variance of the ages. It represents (roughly) the

average squared deviation from the mean. The units of this number are “years squared.”

c) To put this back into “Years,” we take the square root of the variance.

Standard deviation = s =

2( )

1

ix X

n

= ______________

d) Which data value above contributed most to the value of the standard deviation? Why?

e) Which measure of spread is more resistant to outliers: the standard deviation or the IQR?

L1 = ix , the data L2 = ( )ix X ,

the deviations

L3 = 2( )ix X ,

the squared deviations

23 24 24 27 32

∑(2( )ix X =


More questions to help understand the standard deviation (The middle school challenge –

but for measures of spread):

a) You have twelve quiz scores. Each score is an integer from 0 to 10. You are told that the

standard deviation of the scores equals zero. What is one possibility for your set of data? Are

there other possibilities? What MUST be true about the set of data you have?

b) Suppose the IQR for a set of twelve quiz scores = 0. Do all the scores have to be the same?

Justify.

c) Earlier, you saw how the std. deviation of a sample could be zero. Take four observations from

0 to 10. Make their standard deviation as LARGE as possible.

Example: In Mrs. Bain’s running group (about 10 folks), the mean 5K time for all runners on their last run is about 25 minutes, with a standard deviation of about s = 3.6 minutes.

a) Mrs. Bain was the slowest person in her running group this weekend. She ran the 5k in 35

minutes. If she were removed from the data set, how would the STANDARD DEVIATION

change: (go up, go down, stay the same?) Justify.

b) How would the IQR change? Justify.

Summarizing numerical summaries: We’ve learned many different ways to summarize quantitative data:

mean x median M

3 1IQR Q Q .std deviation s

The RANGE = max min MIDRANGE: = max min

2

MOST resistant to

skewness /

outliers

LEAST resistant

to skewness/

outliers.

Measure of center

Measure of spread

Always plot your data before using numerical summaries.


When the data look roughly unimodal and symmetric, it’s fine to compare means and compare

standard deviations.

If you see skewness or outliers in data, you should compare medians and compare IQR’s.

Summarizing summary Statistics, part 2

Statistics are used as numerical indicators of:

Center - the “middle,” or “average” of a distribution

Spread – how widely dispersed are your data?

Location in a distribution - Where among the sample does a particular value fall ? (Note: measures

of center are also measures of location)

Shape - can indicate the presence of symmetry / skewness in a distribution.

Exercise: For each formula, determine / justify if it’s a measure of center, spread, location, or

shape (specifically, symmetry vs. skewness).

a. 1 3

2

Q Q

b. 95 5th percentile th percentile

c. mean median

d. First ignore the highest two and lowest two observations in the sample. Then from the

rest, compute sumof remaining obs

number of remaining obs

e. 75th percentile

f. 1

n

i

i

x median

n


Describing a distribution of a quantitative variable. Identify the overall center, or middle, of the distribution. Put your comments in context of the data.

Numerical Measures of center: mean, median, midrange

Identify how spread out the data are from the center. Put your comments in context of the data.

Numerical Measures of spread: standard deviation, IQR, range

Identify the shape of the distribution, and place your comments in context of the data.

Ways to describe: Unimodal, bimodal, or uniform? If unimodal, symmetric or skewed?

Identify any deviations / outliers from the big picture. Talk about them in context of the data.

Speculate/ explain why they are outliers.

Things to remember:

When comparing two groups, be sure to you make actual comparisons, and not simply

“parallel” descriptions of each distribution. In other words, which group has a higher center?

Which group shows more variability?

When comparing centers or spreads, be sure you use the same measurement on each

group. In other words, don’t compare the median of group A with the mean of group B.

Compare the same measure of center in each group. Same goes with spreads. In other

words, make sure you compare “apples to apples, not apples to oranges.”

Always make comparisons in context of the data.

I


Box Plots or “Box-and-Whisker Plots”

Find each of the following for the distribution of Cost of Last Hair Cut (page 11):

Q1 =

Q3 =

IQR =

Five Number Summary =

a) Are there any outliers in the distribution of Cost of Last Hair Cut?

b) Complete the table to find variance and standard deviation:

Participant # x (x – x bar) (x – x bar)2

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

∑

∑(xi – x bar)2 =


c) Which individual contributed the most (aka which data points are furthest from

the mean)?

Calculate

Variance = s2 = 1÷(n-1) • ∑(xi – x bar)2 = ____________________

Standard Deviation = s = √(s2) = _________________________

d) Which would be more appropriate in describing the distribution the Cost of Last Haircut – a five number summary or the mean and the median? Why?

e) What are the units of measurement for each:

Variance: ___________________

Standard Dev: _______________

f) Construct a boxplot for the Cost of Last Haircut using the grid as a guide:

g) Construct parallel boxplots for the Cost of Last Haircut for men and women using the grid a guide:


Using the boxplots above, compare and contrast the distributions of the Cost of

Last Haircut for men and women (be sure to use descriptive language and support your responses with data):


Effective communication in comparing distributions:

When comparing the distribution of two groups, clear communication is key. Here are some examples of how to improve poor communication in a problem. Let’s examine the following responses from a problem about tomatos! 1. INCLUDING SUFFICIENT CONTEXT. WEAK: “Group C had a median of about 31, and group E’s median is higher, around 35.” Issue: Our job is to describe the group of individuals in the study. Describing graphs without context doesn’t tell the reader what the numbers mean in this setting. BETTER: The control group showed an average increase of 31 inches, but the experimental group’s average increase was larger, about 35 inches. 2. LACK OF COMPARISON / PARALLEL Descriptions without comparison WEAK: “The control group showed an average increase of 31.2 inches, while the experimental group’s average was 31.7 inches.” Issue: Is this a major difference or not? Are you claiming they’re pretty similar, or different? It’s unclear. The “while” doesn’t communicate your conclusion thoroughly. Better: “The control group showed an average increase of 31.2 inches, but the experimental group’s average was slightly higher (31.7 inches.)” 3. NOT COMPARING SIMILAR QUANTITIES to compare centers or spreads. WEAK: “The control group showed less variability than the experimental group. The range of increases spanned 12 inches for the control group, but the standard deviation in the experimental group was 6.” Issue: You are using different ways to measure spread in each group. An effective comparison of spreads uses the same measurement for both groups. BETTER: “The control group showed less variability than the experimental group. The range of increases spanned 12 inches for the control group, but the range of increases for the experimental group was 16 inches. 4. VAGUE / UNCLEAR terminology: WEAK: “The control group’s increases are well-distributed evenly, but the experimental group is skewed.” Issue: “well distributed” might mean different things to different readers. For shapes, learn the terminology for each kind of shape. Better: The control group’s increases are symmetric, but the experimental group is skewed to the left.


2002 Form B #5

At a school field day, 50 students and 50 faculty members each completed an

obstacle course. Descriptive statistics for the completion times (in minutes) for the two groups are shown below.

Students Faculty Members

Mean 9.90 12.09

Median 9.25 11.00

Minimum 3.75 4.50

Maximum 16.50 25.00

Lower quartile 6.75 8.75

Upper quartile 13.75 15.75

(a) Use the same scale to draw boxplots for the completion times for students and for faculty members.

(b) Write a few sentences comparing the variability of the two distributions.

(c) You have been asked to report on this event for the school newspaper. Write a

few sentences describing student and faculty performances in this competition for

the paper.

Linear Transformations

When every value of the variable x is transformed into a new value xnew given by the equation: xnew = a + bx.

Original Data (x) Median Mean Range IQR St Dev Variance

3, 4, 6, 8, 12,15, 20

Add 4 to each value in the original data and complete the table: xnew = 4 + x Median Mean Range IQR St Dev Variance

7, 8, 10, 12, 16, 19, 24

What changed? What didn’t? Why (both mathematically and conceptually)?

Multiply each value in the original data by 3 and complete the table:

xnew = 3x Median Mean Range IQR St Dev Variance


9, 12, 18, 24, 36, 45, 60


Multiply each value in the original data by 2 and add 3 and complete the table:

xnew = 3 + 2x Median Mean Range IQR St Dev Variance

9, 11, 15, 19, 27, 33, 43


How is each summary statistic of x affected by the linear transformation xnew = a

+ bx?

Mediannew = IQRnew =

Meannew = St Devnew =

Rangenew = Variancenew =

Suppose a teacher gave a test for which x bar = 70 and s = 21. He wants to apply a linear transformation xnew = a + bx to “scale” the grades so that x barnew = 82

and snew = 7. Find a and b.

Which students are happy with the grade adjustment? Who is mad? Who wasn’t

affected?


2004 #1 1. A consumer advocate conducted a test of two popular gasoline additives, A and B.

There are claims that the use of either of these additives will increase gasoline mileage in cars. A random sample of 30 cars was selected. Each car was filled with gasoline and the

cars were run under the same driving conditions until the gas tanks were empty. The distance traveled was recorded for each car.

Additive A was randomly assigned to 15 of the cars and additive B was randomly assigned to the other 15 cars. The gas tank of each car was filled with gasoline and the assigned

additive. The cars were again run under the same driving conditions until the tanks were empty. The distance traveled was recorded and the difference in the distance with the additive minus the distance without the additive for each car was calculated.

The following table summarizes the calculated differences. Note that negative values

indicate less distance was traveled with the additive than without the additive.

Additive Values

Below Q1

Q1 Median Q3 Values

Above Q3

A -10, -8, -2 1 3 4 5, 7, 9

B -5, -3, -3 -2 1 25 35, 37, 40

(a) On the grid below, display boxplots (showing outliers, if any) of the differences of the two additives.

(b) Two ways that the effectiveness of a gasoline additive can be evaluated are by looking at either

The proportion of cars that have increased gas mileage when the additive is used in

those cars or

The mean increase in gas mileage when the additive is used in those cars i. Which additive, A or B, would you recommend if the goal is to increase gas mileage

in the highest proportion of cars? Explain your choice.

ii. Which additive, A or B, would you recommend if the goal is to have the highest mean increase in gas mileage? Explain your choice.

Use Standardization to compare LOCATIONS


Suppose that UC Berkeley’s admissions office needs to compare scores of students

who take the SAT with those who take the ACT. Suppose that among the college’s applicants who take the SAT, scores have a mean of 2020 and a standard

deviation of 174. Further suppose that among the college’s applicants who take the ACT, scores have a mean of 29 and a standard deviation of 5.2.

(a) If applicant Bobby scored 2115 on the SAT, how many points about the SAT

mean did he score?

(b) If applicant Adeline scored 32 on the ACT, how many points above the mean

did she score?

(c) Is it sensible to conclude that since your answer to (a) is greater than your

answer to (b), Bobby outperformed Adeline on the admissions test? Explain.

(d) Determine how many standard deviations above the mean Bobby scored. To

do this: divide your answer to (a) by the standard deviation of the SAT scores. Then do the same for Adeline.

Once you have a sample’s mean and standard deviation, you can generalize the behavior of individual data values. Also, you can make comparisons of individual values from different distributions. One calculates a z-score, or standardized

score by subtracting the mean from the value of interest and then dividing by the standard deviation. These z-scores indicate how many standard deviations above

(or below) the mean a particular value falls. One should use z-scores only when working with mound-shaped distributions that are roughly symmetric.

(f) Which applicant has the higher z-score for his or her test score? In your own

words, explain who performed better.

(g) The mean SAT score in 1998 was 1013 (there were only two parts) with a Standard Deviation of 100 points. Mrs. Bain’s friend (Rowdy) scored 885, who did

they compare to Bobby and Adeline?


colle

ge

four

year

sold

0 20 40 60 80 100 120 140 160 180 200 220


(h) Under what conditions does a z-score turn out to be negative?

Standardized Values and Z-Scores

If x is an observation from a distribution that has a known mean and standard

deviation, the standardized value of x is:

z = (x – x bar) ÷ standard deviation

STANDARDIZATION

Scenario: Below are the distributions of heights (in centimeters) in a population of male high-

school students, and the heights in a population of male 4-year olds:

Mrs.

Bain’s

cousi

n Kyle

is going to college. He’s very tall:

191 cm tall. Mrs. Bain also has a 4-

year-old neighbor, named Joey. He is also very tall for his age: 112 cm. Mrs. Bain wants to know

which one is taller, relative to their group of peers.

a) By looking at the displays of heights in each population, which one appears to be taller,

relative to their peer group? Provide evidence.

b) How many cm above the mean height for college guys is Kyle?

c) How many cm above the mean height for 4-year-olds is Joey?

d) Which group, college students or 4-year olds, shows more variability in their heights?

Evidence?

e) How many standard deviations above the mean height are JD and Nick? Show the

computations. Based on this measurement, which relative is taller relative to his peers?

Height(cm) Mean Std. Dev’n

college 180 10

4-year 100 6


f) Consider the set of heights for this population of college men. The mean of all heights is 180

cm. Suppose that we subtract “180” from each height in the population. What new name could

we give to this transformed variable?

g) Now, suppose that after subtracting, we now divide each value by 10, the standard deviation.

How does the standard deviation change?

h) If we converted each college height into a z-score, what are the mean and standard deviation

of the z-scores? Shape?

Mean:________ SD:_______________ shape:__________

i) Take a guess: what happens when we convert the 4-year old heights into z-scores: What is the

new mean and SD? Shape?

Mean:_________ SD:_______________ shape:_________

j) How does standardizing change the (a) shape, (b) center, and (c) spread of a distribution?

k) Explain the purpose of converting two groups of data into z-scores.


Density Curves and Normal Distributions Suppose we looked at an exam given to a large population of students. The histogram of this data appears like the graph to the left below. However, rather than show how many data values are in each bin, we now look at the bin height in terms of percentage. This histogram is symmetric and the tails fall off from a single peak in the center (unimodal). There are no obvious outliers. We now draw a smooth curve through the tops of the bins to get a better picture of the overall pattern of the data (right graph). This curve is an approximate mathematical model for the distribution. We will find that it is easier to work with this model than the actual histogram. We call this curve a density curve. Since the height of the histogram is in terms of percent, it should be obvious that the total area under the histogram should be exactly 1 (as 100% of the scores are shown). In other words, if you added up the heights of all the bars, it would equal 100%

In the graph to the right, a density curve is

shown. It appears to have data values from 0

to 10 and is skewed right. The shaded area

under this density curve between 2 and 4 is .3109.

That is saying that 31.09% of the data lies

between 2 and 4.

DENSITY CURVE NEEDS TO KNOW

1. Any curve above the x-axis that has a total area equal to 1 is considered a Density

Curve.

2. The MEDIAN of a density curve is called the “equal areas point”. The MEAN is called

the “balance point”. Why do you think they got those nicknames?


3. A Density Curve is used to describe the location of individual observations within a

distribution

4. Notation: because a Density Curve usually includes all the data of a population

(versus just a sample of the population), use Greek letters to denote the mean (µ: mu) and standard deviation (o: sigma)

Good trivia night fact: English letters are used for statistics (x bar, s, s2 – measures of a

sample) and Greek letters are used for parameters (∑, µ, etc – measures of the

population)

5. SKEWNESS in a density curve affects the mean and median the same as with a

histogram:

Right skewed: mean ___ median

Left skewed: mean ___ median

Symmetric curve: mean ___ median


A measure of LOCATION: Percentiles

Lets stick with College Entrance Exams. If you’ve

taken a close look at your SAT or PSAT scores, then

you may have noticed your percentile score. For

example, the 98th percentile is a score that separates

the bottom 98% of the cases from the top 2%.

The table to the right shows approximate percentiles

for scores on each section of the SAT in 2006. These

numbers describe the group of college-bound senior

males who graduated in the 2006-2007 school year.

How to read: A male college-bound student scoring

600 on the critical reading score higher than about

78% of his peers. Correspondingly, about 22% of his peers scored higher than him.

a) Kiko’s writing score is higher than 90% of other

males like himself. Estimate his score. Explain your reasoning.

b) On which test are males’ scores the highest, on average? Provide specific evidence for

your decision.

c) Construct box plots comparing the scores of boys on the math and the writing section of the

SAT. Compare the distributions in context.


Density curves shaped like a bell are EVERYWHERE!

You may have noticed that the visual displays of some real world variables follow very similar patterns. This pattern is called a bell-curve or a Normal Distribution:

Grades for my regular statistics

class:

Or this histogram of SAT Writing exam scores for 2006:

1. With your group, come up with 3 additional types of

data sets that you would expect to be normally distributed.

2. With your group, come up with 3 types of data sets that you would expect to NOT be normally

distributed.

This shape tends to crop up frequently when looking at

data distributions. It is also referred to as a “bell

curve”. Under the right circumstances, we can use

Normal density curves to “idealize” the shape of

variables like those above. We capitalize Normal to

remind you that these curves are special. The more

“normal” a distribution is, these relationships become

closer to being perfectly true. No distribution is

perfectly normal and therefore, these relationships are

approximations in most real-life settings.

Just like in any density curve, the area under the curve


between two points “a” and “b” corresponds to the proportion of the data that

falls between “a” and “b.”

What’s another word for proportion? ____________________

Recall: what does the entire area under any density curve equal? ______________

NOTATION alert: On a Normal density curve, we use GREEK letters to designate the mean and the standard deviation ~ because we are talking about a population.

Mean of normal model for a variable X = X

Std. deviation of X: X

But with a data distribution, we use ROMAN letters (i.e. ENGLISH letters) to designate the mean and the standard deviation of a set of data:

Mean of data set for a variable X = X

Std. deviation of X: xs

For the graph at right, what is the

mean and standard deviation of this Normal Distribution?

Mean =

Standard Dev =

Does this curve represent an entire

population or just a sample? How can you tell?

For the

graph at left, number the curves from widest to narrowest spread (1 = widest).

The 68 – 95 – 97.5% Rule: A

property of ANY Normal curve


A little something for our Academy of Finance kiddos…

The annual increase in value for each stock on the S&P 500 varies from stock to stock. Suppose

that in the year 2005, one economist (N. O. Wei) projected that the increase for all stocks

(measured in percentage points) will be normally distributed, with a mean of 5%, and a standard

deviation of 8%.

a) Sketch a normal density curve that can model the distribution of annual increases in

2005. Mark the values that correspond to the mean increase, and the increases that are

1, 2, and 3 SD away from the mean.

According to this economist’s predictions [sketch a normal density curve for each

justification]….

b) The top 2.5% of all stocks will show an increase of at least how much?

c) How many stocks will show a loss of at least 3% last year?

d) What percent of all stocks will show an increase from 13% to 21%?

e) Would it be likely to find a stock that gains more than 29% in a year?


f) We would expect the top 2.5% of all stocks to gain how much in a year?

g) Suppose we want to know what percentage of stocks would gain in value. Shade in the

region we’re looking for.

Understanding the STANDARD NORMAL, or Z-Model

Recall the purpose of standardized scores, or calculating z-scores. By

standardizing, we can rescale any set of data so its mean equals 0 and its standard deviation

equals 1. When a histogram of real-world data is unimodal and symmetric, then normal

curves often are used to describe the distribution shown in the histogram.

Facts about the normal model How this applies to data that is approximately normal in its shape:

The mean equals 0, standard deviation is 1.

µ = 0 X= 1

When you convert original data values to z scores, The mean z-score will be 0, and the standard deviation will be 1.

0Z 1Zs

About 68% of the area under the curve falls between -1 and 1.

About 68% of the data will be within one standard deviation of the mean.

About 95% of the area under the curve falls between -2 and 2.

About 95% of the data will fall within 2 standard deviations of the mean.

About 99.7% of the area under the curve falls between -3 and 3.

Nearly all (99.7%) of the data will fall within 3 standard deviations of the mean.

Example 1: 4 runners run the New York marathon. We wish to determine which runner had a faster

time relative to his age group. Assume normality. Complete the chart.


Best time ______ 2nd best time______ 3rd best time _______ Worst time ________

Example 2: Suppose you have a sample of data that can be modeled well with a normal curve.

Use the 68-95-99.7 rule to answer the following:

a) About what percent of the sample has a positive z-score? Show with a sketch of the normal

density curve.

b) About what percent of the sample has z-scores greater than 1? Show with a sketch of the

normal density curve.

c) About what percent are more than 2 standard deviations below the mean? Show with a

sketch of the normal density curve.

d) Suppose that heights for all 17 year old males can be modeled well with a normal curve.

Suppose we know that the US Navy will reject any young male whose height is more than 2

SD below or 1 SD above the mean height. What percent of young men are eligible for

admission to the Navy? (Note: this is a fake example. The situation I described is not true.)

e) Approximately what percent of the sample has z-scores greater than 1.5?

Using the TI-84 for normal calculations


Suppose you want to find any area between a and b under a standard normal

curve. TI COMMAND: 2nd DISTR normalcdf (a, b) . Output = Area. 1) define variable of interest, x = ? 2) check if normal model appropriate

3) make a picture. Point out the GOAL in picture! 4) state goal, and re-phrase the goal in terms of z scores, use inequality signs where

appropriate 5) use TI-83 for calculation. 6) answer question in context

Men's heights are approximately normally distributed with a mean of 70 inches and a standard deviation of 2.6. Example 1: About what proportion of men are between 67 and 76 inches? Set up (pg. 142 in your book): 1) Define variable of interest. Let x = height. 2) Check if normal model appropriate. We’re told men’s heights are approximately normal. We can use the normal model. 3) Make a picture. Point out the GOAL in picture! 4) State goal, and re-phrase the goal If 67 < heights < 76, in terms of z scores. then z-scores -1.154 < z < 2.307 5) Use the TI to do the calculation. Enter 2nd DISTR normalcdf (-1.154, 2.307)

ENTER

6) Answer question in context. Area = .865. 86.5% of all men are between 67 and 76 inches.

Example 2: What proportion of men are less than 6 feet tall?


1. Let X = height.

2. We already know a normal model is appropriate. Goal-find the proportion of men under 72 inches.

3. (picture)

4. If height < 72, then z < (72-70)÷2.6 z < .7692

5. To find the area on TI: 2nd DISTR normalcdf( -

9999, .7692) ENTER:NOTE: Since the area we want

has no lower limit, we substitute a large enough

number like "-9999" for negative infinity.

= .7791

6. Conclusion: about 77.9% of all men are under 6 feet tall.

Example 3: Find the proportion of men who are taller than 64 inches. Set up the problem draw the curve, show the steps, and use the TI-84 to find the

proportions. 1) define variable of interest, x = ? 2) check if normal model appropriate 3) make a picture. Point out the GOAL in picture!

4) state goal, and re-phrase the goal in terms of z scores, use inequality signs where appropriate

5) use TI-83 for calculation. 6) answer question in context

Answer: About 1.05% area under 64 inches tall.

Flip it and reverse it! Suppose that you know the area under a normal curve to the left of some location

“k”, and you want to find the corresponding z score:


1) How is this different from what we were doing above?

COMMAND: 2nd DISTR invNorm (k). The output will be the z-score we

seek.

Example #4: If we assume the normal model, an observation at the 75th percentile has what z-score? Draw a picture indicating what we know, and what

we want. 1) define variable of interest.

2) check if normal model appropriate. 3) make a picture. Point out the GOAL in picture!

4) state goal, and re-phrase the goal in terms of z scores. 5) use TI-83 for calculation. 6) answer question in context

Example #5: The lowest 15% of all men are all under what height? 1) define variable of interest. 2) check if normal model appropriate. 3) make a picture. Point out the GOAL in picture!

4) state goal, and re-phrase the goal in terms of z scores. 5) use TI-83 for calculation.

6) answer question in context

Example #6: Only 0.35% of the population of adult males are taller than Mr.

Bain. According to the normal model, how tall is Mr. Bain?


Answer: Mr. Bain is about 77 inches tall, or 6’ 5” .

Example #7: Joan found that her average telephone call last month was 9.6 minutes with a standard

deviation of 2.4 minutes. His telephone call usage was roughly normal. What percentage of her calls…

a. were less than 5 minutes? b. were less than 10 minutes?

c. were more than 15 minutes? d. were between 4 and 8 minutes?

Find the telephone call length that represents:

e. the 90th percentile f. the 98th percentile

g. the top 0.5 percent h. the bottom 1%

Example 8: 1) To the right is a density curve for distribution of values between 0 and 1.3. Use

geometry to determine the answers to these questions.

a. What percentage of the data lies below .6?

b. What percentage of the data lies above .5?

c. What percentage of the data lies between .3 and .9?

d. What percentage of the data is exactly .8?

e. What is the median of the data?

f. Draw the approximate position of the mean on the drawing More to think about: a) In a large set of normally distributed data, Q1 and Q3 correspond to which z-scores? (Hint, if

you get one of them, then you should be able to figure out the other one easily!)


b) We have declared a value an “outlier” if it falls more than 1.5 IQR’s below Q1 or 1.5 IQR’s above

Q3. In a large set of normally distributed data, what proportion of values would be declared

outliers by this criterion?

c) Suppose that in a population of professional basketball players in the USA, Mr. Bain only ranks in

the 45th percentile. If the distribution of basketball player heights is also normally distributed, and

has the same SD as the population of all men (2.6 inches), then what is the mean height for this

population of players?

d) Suppose that among a population of Olympic-level swimmers, 10% can swim 100 meters in

under 49 seconds, while 6% require 55 seconds or more to swim 100 meters. If it’s safe to assume

a normal model for the distribution for times, what are the mean and standard deviation of the

distribution of times?

Assessing Normality: How to assess whether a normal model is appropriate for data

If the problem set doesn’t explicitly state that the distribution is normal, it is necessary for you to

check for normality. For more information, see pages 148-154 in your textbook.

Method 1: Plot the data (this is not enough to prove Normality, you have to confirm with another

method). Construct a histogram or a stemplot. See if the graph is approximately…

unimodal, bell-shaped and symmetric about the mean

Method 2: Construct a Normal probability plot or Normal quantile plot. This type of plot

displays how much each data value varied for what would be expected if the data was normal. a. This can be found on the TI-84

b. This is a scatter plot of each data value versus it’s predicted z-score based on its order in

the data set.

c. If the distribution is roughly Normal, then the plot should look like a diagonal straight

line.


By hand

1. Arrange the observed data values from smallest to largest. Record the percentile of the

data each value occupies.

2. Use the standard Normal distribution to find the z-scores at these percentiles.

3. Plot each data point x against the corresponding z. If the data distribution is close to

Normal, then the plotted points will lie close to some straight line.

Method 3: Does the data follow the 68-95-99.7% rule? Check each standard deviation.

a. Compute the mean and SD of the data

b. Then see if about 68% of the data are within one SD of the mean, 95% are within 2 SD

of the mean, and nearly all (99.7%) are within 3 CD of the mean.

Documents

PART I - Exploring and Describing Data Agenda...PART I - Exploring and Describing Data Agenda Day Number Chapter Topics to discuss Homework Due Next Time 1 Class Intro The purpose