23
STT200 Chapter 1-6 KK AM 1 of 23 Chapter 1 Intro to Stats Definitions Data are observations (such as measurements, genders, survey responses) that have been collected. Statistics is a collection of methods for planning experiments, obtaining data, and then organizing, summarizing, presenting, analyzing, interpreting, and drawing conclusions based on the data. Population is the complete collection of all elements (scores, people, measurements, and so on) to be studied. The collection is complete in the sense that it includes all subjects to be studied. Parameter is a numerical measurement describing some characteristic of a population Census is the collection of data from every member of the population. Sample is a subset of a population. An important activity in this class is to demonstrate how we can use sample data to form conclusions about populations. It is extremely critical to obtain sample data that are representative of the population from which the data are drawn. A statistic is a numerical measurement describing some characteristic of a sample A survey is one of many tools that can be used for collecting data. A common goal of a survey is to collect data from a small part of a larger group so that we can learn something about the larger group. Example: Identify a sample and population, a statistic and a parameter: A Gallup Poll asked this of 1087 adults: “Do you have occasion to use alcoholic beverages such as liquor, wine, or beer, or are you a total abstainer?” The 1087 survey subjects are a sample, while the population consists of the entire collection of all 302,682,345 adult Americans (or whatever is the exact number for now).

Chapter 1 Intro to Stats - Department of Statistics and ...€¦ · Chapter 1 Intro to Stats Definitions Data are observations ... Sample data must be collected in an appropriate

  • Upload
    buidung

  • View
    241

  • Download
    0

Embed Size (px)

Citation preview

STT200 Chapter 1-6 KK AM

1 of 23

Chapter 1 Intro to Stats Definitions Data are observations (such as measurements, genders, survey responses) that have been collected. Statistics is a collection of methods for planning experiments, obtaining data, and then organizing, summarizing, presenting, analyzing, interpreting, and drawing conclusions based on the data. Population is the complete collection of all elements (scores, people, measurements, and so on) to be studied. The collection is complete in the sense that it includes all subjects to be studied. Parameter is a numerical measurement describing some characteristic of a population Census is the collection of data from every member of the population. Sample is a subset of a population. An important activity in this class is to demonstrate how we can use sample data to form conclusions about populations. It is extremely critical to obtain sample data that are representative of the population from which the data are drawn. A statistic is a numerical measurement describing some characteristic of a sample A survey is one of many tools that can be used for collecting data. A common goal of a survey is to collect data from a small part of a larger group so that we can learn something about the larger group. Example: Identify a sample and population, a statistic and a parameter: A Gallup Poll asked this of 1087 adults: “Do you have occasion to use alcoholic beverages such as liquor, wine, or beer, or are you a total abstainer?”

The 1087 survey subjects are a sample, while the population consists of the entire collection of all 302,682,345 adult Americans (or whatever is the exact number for now).

STT200 Chapter 1-6 KK AM

2 of 23

A parameter: the proportion of ALL adult Americans who use alcoholic beverages. A statistic: the proportion of those surveyed Americans who use alcoholic beverages.

Sample data must be collected in an appropriate way, such as through a process of random selection.

Chapter 2 – Data Data - recorded information together with its context.

Context - tells Who, What, When, Where, How and Why is measured

1. Who –individuals about whom data are collected ( participants, respondents, subjects, experimental units, records, cases)

2. What – characteristics recorded about each individual (variables)

Variables: quantitative or numerical (measured in units)

qualitative or categorical (labels)

3. When – time

4. Where – place

5. How – method of collecting data

6. Why – purpose of study

STT200 Chapter 1-6 KK AM

3 of 23

Classwork Ch 2

Homework tips: First complete your lecture notes, marking everything that is unclear. Read the textbook, paying attention to the examples, and get help with unclear parts if any left, then do assigned homework problems, as listed in Class Schedule.

Chapter 3 Categorical Data

Example: For each of 2201 people on the Titanic the following variables were recorded: Ticket Class (First, Second, Third, Crew), Survival (Dead or Alive), Age (Adult or Child) and Sex (Male or Female). ONE VARIABLE

Who = people on Titanic What = Ticket Class

STT200 Chapter 1-6 KK AM

4 of 23

Frequency Table = list of categories and counts or percentage of observations of each category.

Class Count (frequency) relative frequency

First 325 325/2201=0.1477=14.766%

Second 285

Third 706

Crew 885

Total 2201 100%=1

Distribution of a variable is the list of possible values of the variable and corresponding relative frequencies. Graphical displaying of a distribution of categorical data:

1. Bar chart 2. Pie chart

The Area of a bar or a slice should correspond to the frequency of a category.

Class Count (frequency) % (relative frequency)

First 325 14.8

Second 285 12.9

Third 706 32.1

Crew 885 40.2

Total 2201 100

Displaying categorical (qualitative) data: The most common displays for categorical data are the bar graphs and pie graphs. I used EXCEL to build the bar graphs and a pie graph for the data above:

Constructing the pie graph: Convert each data into the central angle of the circle by multiplying given relative frequency by 3600 Central angle=relative frequency x 3600

Count (frequency)

0

500

1000

First Second Third Crew

relative frequency

0.00

0.20

0.40

0.60

First Second Third Crew

STT200 Chapter 1-6 KK AM

5 of 23

Bar chart: Pie Chart:

0

100

200

300

400

500

600

700

800

900

1000

First Second Third Crew

First15%

Second13%

Third32%

Crew40%

Classwork Chapter 3

Displaying TWO VARIABLES with a Contingency Table Example: For each of 2201 people on the Titanic the following variables were recorded: Ticket Class (First, Second, Third, Crew) and Survival (Dead or Alive).

Who = people on Titanic What = (1) Ticket Class and (2) Survival Why = did the chance of surviving depend on ticket class?

STT200 Chapter 1-6 KK AM

6 of 23

Contingency Table

Marginal distributions

CONDITIONAL DISTRIBUTIONS A distribution of one variable, given value of another is called a conditional distribution. Can include Percentages of Column or Percentages of Row. Example: what percent of the survivors had First Class tickets? What percent of the passengers with First Class tickets survived? … Segmented Bar Chart Variables are independent, if the conditional distribution for each category is the same as the corresponding marginal distribution.

Is the variable SURVIVAL independent or dependent on TICKET CLASS? Classwork Ch 3: 32

Ticket Class

First Second Third Crew Total

Surv

ival Alive 202 118 178 212 710

Dead 123 167 528 673 1491

Total 325 285 706 885 2201

First

First

Second

Second

Third

Third

Crew

Crew

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Alive Dead

Crew

Third

Second

First

STT200 Chapter 1-6 KK AM

7 of 23

Chapter 4 – Displaying and Summarizing Quantitative Data

Histogram

Example: Consider the following set of raw data which are the high temperatures recorded for 30 consecutive days. Summarize this data by creating a frequency distribution of the temperatures. Data Set - High Temperatures for 30 Days 90 85 89 90 83 89 90 89 85 89 87 87 84 81 82 83 86 86 90 82 81 82 83 84 89 85 86 85 81 89 A histogram is a special kind of a bar graph in which the horizontal scale represents the classes of data values and the vertical scale represents the frequencies. There are no gaps between the bars, and the widths of the bars are usually equal.

STT200 Chapter 1-6 KK AM

8 of 23

The frequency distribution table: Frequency Distribution for High Temperatures Temperature Tally Frequency Relative Frequency Cumulative Frequency 81 /// 3 0.10 3 82 /// 3 6 83 /// 3 9 84 //// 4 etc. 85 //// 3 86 /// 3 87 // 2 88 0 89 ///// 5 90 //// 4 Total: 30 Total:1 Displays of data are clearer wen data are first grouped into the sets called classes or bins. Making a histogram: 1. First, make the frequency table by slicing up the entire span of values covered by the quantitative variable into equal-width piles called classes or bins. 2. The bins and the counts in each bin give the distribution of the quantitative variable 3. A histogram plots the bin counts as the heights of bars (like a bar chart). 4. A relative frequency histogram displays the percentage of cases in each bin instead of the count. In this way, relative frequency histograms have the total area of all bars equal to 100% (1 square unit, if you use fractions).

STT200 Chapter 1-6 KK AM

9 of 23

Class limits Frequency Relative Frequency 81-82 6 20.0% 83-84 7 23.3% 85-86 6 20.0% 87-88 … … 89-90 Total: 30 100%

0

2

4

6

8

10

81-82 83-84 85-86 87-88 89-90

The histogram built on Relative Frequency table has exactly the same shape. Graphs convey information about distribution of the data:

shape,

center,

spread

possible outliers. Example: More graphs Given are test scores: 35 37 45 46 49 56 57 57 59 61 62 64 68 71 72 76 80 89 94. Make a stem-and-leaf display, and a dot-plot.

a) Make a Stem-and-leaf Stem: Leaves

3 4 5 6 7 8 9

1. First, cut each data value into leading digits (“stems”) and trailing digits (“leaves”). 2. Use the stems to label the bins.

STT200 Chapter 1-6 KK AM

10 of 23

3. Use only one digit for each leaf—either round or truncate the data values to one decimal place after the stem.

b) Make a Dotplot: ___________________________________

Shapes of distributions:

Normal distribution: Symmetric graph with bell shaped “tails” plays a special role in statistics.

Other Shapes of distribution: Symmetric:

Uniform distribution

Bimodal:

STT200 Chapter 1-6 KK AM

11 of 23

Skewed distribution:

Skewed to the left Skewed to the right The outliers are the data far away from the main group. The shape of a histogram can be described by a smooth curve that roughly follows the tops of the bars. This can be done by eye or by numerical algorithms on the computer. The shape often follows certain patterns:

Symmetric (uniformed, bell-shaped etc)

skewed to the right or left

outliers - unusual observations which do not “fit” to the data

unimodal (one major peak) or bimodal (two major peaks)

large or small spread (is the histogram wide or narrow) Avoid Common Errors:

Don’t make a histogram of a categorical variable—bar charts or pie charts should be used for categorical data.

Don’t look for shape, center, and spread

of a bar chart. Choose the number of bins (a bin width) appropriate to the data.

Changing the bin width changes the appearance of the histogram. Describing Distribution Numerically Measures of the center: • mean • median Example: Find the measures of center for the data: 45 46 49 35 76 80 89 94 37 61 62 64 68 56 57 57 59 71 72

Sorted Data:

STT200 Chapter 1-6 KK AM

12 of 23

35 37 45 46 49 56 57 57 59 61 62 64 68 71 72 76 80 89 94 Max = 94, Min = 35, n=19 • mean = the average value = sum/n Mean = (35+37+...+94)/19 =1178/19 = 62 • median = the middle value (or the average of two middle values) Median = 61

Note: For skewed distributions the median is a better measure of the center than the mean. Measures of the spread: • range • interquartile range (IQR) • variance • standard deviation In a sample, the variance = "average" squared deviation from the mean

2

2

1

y ys

n

y y = deviation

Standard Deviation s = square root of the variance Example: Find the variance and standard deviation for the data below 3, 5, 6, 10 n=4, First, find the mean:

Data Deviation Deviation^2

3

5

6

10

Variance: s2 =

Standard Deviation: s = 2s

yy

n

2 5 6 10

4

yy

n

STT200 Chapter 1-6 KK AM

13 of 23

Rounding rule: round to at least one unit farther than the data – in this case, to the tenths. Measures of Position: Percentiles The 100pth-percentile in a ranked set of data is the value that separates p% of smaller (or not greater) values from remaining 100-p% numbers that are greater or at least not smaller than that value. The median is the 50th percentile of a given set of data A calculator or computer program might give somewhat different answers! Quartiles: “special” percentiles Q1 (First or Lower Quartile) separates the bottom 25% of sorted values from the top 75%. It’s the 25th percentile. Q2 (Median, Second Quartile) separates the bottom 50% of sorted values from the top 50%. It’s the 50th percentile. Q3 (Third or Upper Quartile) separates the bottom 75% of sorted values from the top 25%. It’s the 75th percentile.

CLASSWORK Ch 4

d) standard deviation (use calculator) e) variance g) determine whether there are outliers h) make a box plot (with outliers indicated, if any)

STT200 Chapter 1-6 KK AM

14 of 23

STT200 Chapter 1-6 KK AM

15 of 23

Chapter 5 Understanding and Comparing Distributions One more display: a Boxplot Data (sorted!): 35 37 45 46 49 56 57 57 59 61 62 64 68 71 72 76 80 89 94 Five-Number-Summary: Minimum, Lower Quartile, Median, Upper Quartile, Maximum Min = 35, Median = 61, Max = 94, Q3 = Upper quartile = middle of upper half (include median if n is odd) Q1 = Lower quartile = middle of lower half (include median if n is odd) Upper half: 35 37 45 46 49 56 57 57 59 [61 62 64 68 71 72 76 80 89 94] Q3 = (71+72)/2 = 71.5 Lower half: [35 37 45 46 49 56 57 57 59 61] 62 64 68 71 72 76 80 89 94 Q1 = (49 + 56)/2 =52.5 IQR = 71.5 - 52.5 = 19 Five number summary Min = 35, Q1 =52.5, Median = 61, Q3=71.5, Max = 94 Outlier = data values which are beyond fences Interquartile range (IQR) = Q3 - Q1=19 Upper fence = Q3 + 1.5 IQR =71.5 + 1.5 x 19 = 100 (no outliers here) Lower fence = Q1 - 1.5 IQR = 52.5 - 1.5 x 19 = 24 (no outliers here) (The extreme outliers are the numbers separated by twice longer fences) Boxplot:

yy

n

STT200 Chapter 1-6 KK AM

16 of 23

Exercise: draw the histogram for the data above Try also the calculator: Data: 35 37 45 46 49 56 57 57 59 61 62 64 68 71 72 76 80 89 94 .Classwork Chapter 5 CLASSWORK:

STT200 Chapter 1-6 KK AM

17 of 23

An answer to a classwork problem:

The Summary Chapters 4-6

Always report the shape of its distribution, along with a center and a spread.

If the shape is skewed, report the median and IQR

If the shape is symmetric, report the mean and standard deviation and possibly the median and IQR as well.

If there are multiple modes, try to understand why. If you identify a reason for the separate modes, it may be good to split the data into two groups.

If there are any clear outliers and you are reporting the mean and standard deviation, report them with the outliers present and with the outliers removed. The differences may be quite revealing.

Don’t report too many decimal places.

Don’t round in the middle of a calculation.

STT200 Chapter 1-6 KK AM

18 of 23

y yz

s

Chapter 6 – Standardizing and Normal Model We compare individual data values to their mean, relative to their standard deviation using the following formula:

For a sample:

For the population y

z

We call the resulting values standardized values, denoted as z, or z-scores.

Standardized values have no units. z-scores measure the distance of each data value from the mean in

standard deviations. A negative z-score tells us that the data value is below the mean,

while a positive z-score tells us that the data value is above the mean.

Arithmetic operations on the data: 1. Shifting data:

Adding (or subtracting) a constant to each value will increase (or decrease) measures of position: mean, median, percentiles, max or min by the same constant. Its shape and spread - range, IQR, standard deviation - remain unchanged.

2. Rescaling data Multiplying (or dividing) each value by a constant all measures of position (mean, median, percentiles, max or min) and all measures of spread (range, IQR, standard deviation) will be multiplied or divided by that constant.

Standardizing data into z-scores shifts the data by subtracting the mean and rescales the values by dividing by their standard deviation.

Standardizing into z-scores does not change the shape of the distribution.

Standardizing into z-scores changes the center by making the mean 0.

Standardizing into z-scores changes the spread by making the standard deviation 1.

STT200 Chapter 1-6 KK AM

19 of 23

4. Hams. A specialty foods company sells “gourmet hams” by mail order. The hams vary in size from 4.15 to 7.45 pounds, with a mean weight of 6 pounds and standard deviation of 0.65 pounds. The quartiles and median weights are 5.6, 6.2, and 6.55 pounds.

a) Find the range and the IQR of the weights.

b) Do you think the distribution of the weights is symmetric or skewed? If skewed, which way? Why?

c) If these weights were expressed in ounces (1 pound = 16 ounces) what would the mean, standard deviation, quartiles, median, IQR, and range be?

d) When the company ships these hams, the box and packing materials add 30 ounces. What are the mean, standard deviation, quartiles, median, IQR, and range of weights of boxes shipped (in ounces)?

e) One customer made a special order of a 10-pound ham. Which of the summary statistics of part d might not change if that data value were added to the distribution?

14. Placement exams. An incoming freshman took her college's placement exams in French and mathematics. In French, she scored 82 and in math 86. The overall results on the French exam had a mean of 72 and a standard deviation of 8, while the mean math score was 68, with a standard deviation of 12. On which exam did she do better compared with the other freshmen?

STT200 Chapter 1-6 KK AM

20 of 23

68-95-99.7 Rule (so called Empirical Rule)

In a Normal model: about 68% of the values fall within one standard deviation of

the mean; about 95% of the values fall within two standard deviations of

the mean; and, about 99.7% (almost all!) of the values fall within three

standard deviations of the mean.

Notation Normal Distribution: N ( , )

Where mu=mean, sigma=standard deviation Standard Normal Distribution: N(0,1)

Example: Suppose that we model SAT scores Y by N(500, 100) distribution. 1. What percentage of SAT scores fall between 450 and 600?

z-scores of 450 and 600: (450-500)/100 = -.50 (600-500)/100 = 1.00 So, P(450 <Y<600) = the area under standard normal curve and between z=-.5 and z=1, that is, by the tables, 0.8413 - 0.3085 = 0.5328 Answer: 53.28%

TI-83: [2nd DISTR 2] normalcdf(450,600,500,100) =0.5328072

STT200 Chapter 1-6 KK AM

21 of 23

Question 2 For what value b, 10% of SAT scores are greater than b?

Let z be a z-score of b. The area to the right of z must be 10%, so the area to the left is 90% (0.90) From Table Z the z-score for which the area to the left is .9 is z = 1.28

So, b is 1.28 standard deviations to the right of , that is

b = 100 x 1.28 + 500 = 628

Answer: b = 628

TI-83: [2nd DISTR 3] invNorm(.9,500,100) = 628.155

28. IQ. Some IQ tests are standardized to a Normal model, with a mean of 100 and a standard deviation of 16.

a) Draw the model for these IQ scores. Clearly label it, showing what the 68–95–99.7 Rule predicts about the scores.

b) In what interval would you expect the central 95% of IQ scores to be found?

c) About what percent of people should have IQ scores above 116?

d) About what percent of people should have IQ scores between 68 and 84?

e) About what percent of people should have IQ scores above 132?

STT200 Chapter 1-6 KK AM

22 of 23

STT200 Chapter 1-6 KK AM

23 of 23

End of Part 1 Ch. 1-6. Do Homework! REMEMBER:

Don’t use a Normal model when the distribution is not unimodal and symmetric.

Don’t use the mean and standard deviation when outliers are present—the mean and standard deviation can both be distorted by outliers.

Don’t round your results in the middle of a calculation.

Don’t worry about minor differences in results.