Graphs used for categorical/qualitative datatech132/3r6300.docx · Web viewA measure of variation is a measure of variability (i.e. spread or dispersion). The most commonly used measures

Describing and Interpreting Data /Descriptive Measures

This section presents concepts related to using and interpreting the following measures.

3r 6300 measures 1

Measures of Central Tendency

Mean

A mean is the most common measure of central tendency. A mean is what we commonly think of as the ‘average’ value. Calculate by summing the values and dividing by n (the total number of values) population or by n-1 for

a sample. Extremely large values in a data set will increase the value of the mean, and extremely low values will

decrease it.

To calculate a weighted mean, first multiply each cell frequency by its weight (the cell frequency), and then sum and divide by the total frequency.

Median

The median is the central point of the data. Half of the data has a lower numerical value than the median. Half of the data has a higher numerical value than the median. To find the median, arrange the data in order from smallest value to largest value, and

If there are an odd number of points, find the value that is in the center of the dataIf there are an even number of points, add the two middle values and divide by 2.

The median is not affected by extremely large or small values.

Mode

The mode is the data value that occurs with the greatest frequency. The mode is not affected by extreme values. There may be no mode or there may be more than one mode.

3r 6300 measures 2

Extreme values affect the mean.

The median is not affected by extreme values so this measure is used when the data set contains extreme values.

For DiscussionSales personnel for the Eastern Division report the following number of orders for the first quarter of the fiscal year.

Total Orders 160 175 190 215 230 240 290 290 320 350

The Sum = 2,460. Find and interpret the mean, median and mode.

If the 160 was entered in error ant it is actually 60, what will happen to the mean value?What will happen to the median and the mode?

3r 6300 measures 3

Measures of Spread /Variation

A measure of variation is a measure of variability (i.e. spread or dispersion).

3r 6300 measures 4

Measures of Variation

RangeStandard

Deviation / Variance

Coefficient of Variation

The most commonly used measures of variability are the range and standard deviation.

The values of the range and standard deviation are positive, and as the value increases as the spread of the data increases. If there is no variation, these measures equal 0.

Range

Subtract the smallest value from the largest - or Report the smallest and largest values. Note that the range can be a misleading value.

(Levine)

3r 6300 measures 5

Variance/Standard Deviation

The standard deviation is the average variation of the data values from the mean of the values and is the most commonly used measure of variation.

The standard deviation is found by taking the square root of the variance, and the standard deviation is more useful than the variance in reporting results so it is the measure that is typically reported.

The standard deviation is affected by extreme values.

(Levine)

3r 6300 measures 6

Note that the values for the standard deviation are different for a sample and a population.

In the calculation of the sample, divide by n-1.

In the sample of the population, divide by N (population size).

XL also has functions for these values: STDEV(array) and STDEVP(array)

All three data sets have the same mean. Note how the variation in the distribution of values changes the standard deviation.

A Note on Notation Values that describe a sample are called statistics and are typically designated by regularly used

letters.

Values that describe a population are called parameters and are typically designated by Greek letters.

Some of these are shown in the following chart and the diagrams that follow.

Population SampleMeasure parameter statistic

These are fixed numbers that describe the population; they are usually unknown.

These values describe a sample & are used to es timate the population parameters.

size N n

mean m (mu) x̅ (x-bar)

variance s2 s2

standard deviation s (sigma) s

3r 6300 measures 7

For DiscussionSales personnel for the Eastern Division are being reviewed and they report the following number of orders for the first quarter of the fiscal year.

Find & interpret the range, variance and standard deviation. (Assume the population formula.)

Total Orders X-µ ( X-µ ) 2

160 -86 7396175 -71 5041190 -56 3136215 -31 961230 -16 256240 -6 36290 44 1936290 44 1936320 74 5476350 104 10816

Sums:2,460

36,990

3r 6300 measures 8

Interpreting the Measures of Center and Spread

Coefficient of VariationThe coefficient of variation shows the variation of the data relative to the mean and is useful in comparing the variation of two data sets. Note that it is always expressed as a percentage.

For example, consider the test scores for two groups.

Group 1

Mean score = 80SD = 5

CV = 5

80∗100 % =

6%

Group 2

Mean score = 64SD = 9

CV = 9

64∗100 % =

14%

The z-scoreThe Z-score is the number of standard deviations a data value is from the mean.If a data point has a z score that is less than -3 or greater than +3, it is considered to be an extreme value.

Where:X represents the data value

X is the sample mean S is the sample standard deviation

Suppose the mean score on the test is 80 and the standard deviation is 5.

A grade of 70 has a z-score of -2. It is 2 standard deviations to the left of the mean. It is not an outlier.

A grade of 100 has a z-score of 4 standard deviations and is an outlier.

3r 6300 measures 9

An outlier is generally more than 3 standard deviations from the mean or less than -3 SD’s from the mean.

There is more variation with Group2 – it has a larger standard deviation and a higher coefficient of variation.

Z= X−XS

For Discussion1. For the Eastern Division orders in the preceding discussion problem, find the coefficient of

variation. Note what happens when the 160 is changed to 60.

Sales Orders 2

60175190215230240290290320350

2360

Find some z-values.

X Mean X - Mean SD zSales

Orders 1 160 246 -86 64 -1.34350 246 104 64 1.63

Sales Orders 2 60 136 -76 84 -0.90

246 136 110 84 1.31

2. You reviewing two stocks and are interested in minimizing fluctuation. The following information is available. Which stock would provide the least fluctuation?

StockA B

Mean Price/Share for the last year 60 10Standard Deviation 6 2

3r 6300 measures 10

Sales Orders

1

Sales Orders

2

Mean 246 236Median 235 235Mode 290 290Standard Deviation 64 84Range 190 290CV 26% 36%

The Empirical Rule

Apply this rule to interpret the measures when the data is symmetrical. At least:

68% of the data values are within one standard deviation of the mean: µ ± 1𝞼90% of the data values are within two standard deviation of the mean: µ ± 2𝞼99% of the data values are within three standards deviation of the mean: µ ± 3𝞼

Example

3r 6300 measures 11

68 % of salaries in the range 71.6 ± 10.768 % of salaries in the range 60.9$ to 82.3$



Tchybychef’s Inequality

Apply this method to interpret the measures when the data is skewed or when the shape of the distribution is unknown

of the values will fall within k standard deviations of the mean (k > 1)

At least:75% of the data values are within two standard deviation of the mean. : µ ± 2𝞼90% of the data values are within three standard deviation of the mean. : µ ± 3𝞼

↑ ↑

3r 6300 measures 12

75% of the data are within 2 standard deviations of the mean.

Measure of the Shape of a Distribution / Skewness & Kurtosis

Compare the mean and the median. Symmetrical: mean = median Left skewed: mean < median Right skewed: mean > median

Review the Skewness coefficient produced in the table you found using: Data / Data Analysis / Descriptive Statistics. The following values were suggested by M. G. Bulmer., [Principles of Statistics (Dover, 1979)] for interpreting the Skewness coefficient.

3r 6300 measures 13

For Discussion1. Evaluate the sales using the Empirical Rule. (Note: Skewness = 0.262)

Mean1 SD2 SD3 SD

Mean - 1 SDMean + 1 SD



2. Evaluate the sales using the Tchybychef’s Inequality. (Note: Skewness = -0.802)

Mean1 SD2 SD3 SD




3r 6300 measures 14

For Item 1

Mean 2461 SD 642 SD 1283 SD 192

Mean - 1 SD 182Mean + 1 SD 310



For Item2

Mean 2361 SD 842 SD 1683 SD 251

Mean - 1 SDCan't

sayMean + 1 SD


Mean - 3 SD -15Mean + 3 SD 487

3r 6300 measures 15

3. Analysis of the Salary Experience Data gave the following results for employee years of related experience. Interpret.

4. Suppose a company advertises that – with use – the mean weight loss that is expected in two months is 12 pounds. Suppose you discover that the median loss is 3 lbs.

a. Is the weight loss skewed right or left? b. Which measure is of the most value in this situation?

3r 6300 measures 16

In XL, use Data/Data Analysis/Descriptive Statistics to generate the table.

Measures of Relative Standing

Common measures of position (relative standing) within a data set include: Percentiles Quartiles

Percentiles

A percentile is a location marker along a range of values. The 50th percentile is the median or middle number in the range of values.

If your percentile score on the GRE is 90 then you scored better than 90% of those taking the test, and you scored lower than 10% of those taking the test. Excel will find percentiles.

If you are the fourth tallest person in a group of 20, you are taller than 16 people and represent the eightieth percentile.

3r 6300 measures 17

In XL Use: PERCENTILE(A2:A16,0.5)Be sure to put in the appropriate range of values and specify the percentile of interest.

Quartiles

Each quartile contains 25% of the total observations based on data that is ordered from smallest to largest.

First Quartile 0 - 25th Percentile of Range

Second Quartile 25th - 50th Percentile of Range

Third Quartile 50th - 75th Percentile of Range

Fourth Quartile 75th - 100th Percentile of Range)

The Interquartile Range measures the spread in the middle 50% of the data and is the third quartile minus the first quartile.

The lower quartile point (Q1) is the same as the 25th percentile. 25% of the scores are lower and 75% of the scores are higher than the lower quartile.

The upper quartile point (Q3) is the same as the 75th percentile. 75% of the scores are lower and

25% of the values are greater than the upper quartile.

The median (Q2) is the same as the 50th percentile.

IQR = Q3 – Q1

The IQR is a measure of variability that is not influenced by outliers or extreme values.Measures like Q1, Q3, and IQR that are not influenced by outliers are called resistant measures.

3r 6300 measures 18

In XL Use:QUARTILE(Ax:Axx,1)insert the correct range and specify if you want the lower quartile (1) or the upper quartile (3)

The following table shows quartile measures for the Salary Experience Data.

Salaries (x $1000)

The quartile values are used to construct a boxplot of the salaries.

Employee Salaries (x$1000)

3r 6300 measures 19

25% of the values lie below the first quartile = 63.

3Notice that the quartile values provide information regarding the symmetry of the data.

(Levine)

(Levine)

For Discussion

1. Interpret the employee salary boxplot. What can you determine from reviewing the graph?

2. John’s salary is $86, 000 and it ranks at the 75th percentile. What does he know about the salaries in the organization?

3r 6300 measures 20

This graphic also shows the relationship between the curve (use histogram) and the corresponding boxplot.

Note that XL does not generate a Boxplot; use an add-in like PHStat.

Note: If right skewed -the distance from Q1 to Q2 is less than the distance from Q2 to Q3.

Measure of Relationships between Two Quantitative Variables

Correlation

Correlation (r) is used in describing the strength of the relationship between two (or more) variables.

r can vary from a low of -1 (perfect negative correlation) to +1 (perfect positive relationship). A value of 0 means there is no correlation

Correlation coefficients reflect whether the relationship between variables is:1) positive (i.e. as one variable increases, the other variable increases) or 2) negative (i.e. as one variable increases, the other variable decreases).It also may indicate that there is no relationship.

There are many different types of correlation coefficients and selection of the appropriate one depends on the variables. We will consider Pearson Product-moment Correlation Coefficient which assumes continuous quantitative data.

Borg and Gall, Educational Research from Longman Publishing, provide the following information for interpreting correlation coefficients. Correlations coefficients ranging from 0.20 to 0.35 show a slight relationship between the

variables; they are of little value in practical prediction situations. With correlations around 0.50, crude group prediction may be achieved. In describing the

relationship between two variables, correlations that are this low do not suggest a good relationship.

Correlations coefficients ranging from 0.65 to 0.85 make possible group predictions that are accurate enough for most purposes. Near the top of this correlation range, individual predictions can be made that are more accurate than would occur if no such selection procedure were used.

Correlations coefficients over 0.85 indicate a close relationship between the two variables.

It is important to understand that even a high correlation coefficient does not establish a cause and effect relationship. There may be other factors that relate to both of the variables.

Line of Best Fit and Other Considerations It is always good to look at an XY scatter plot to see what you think about the relationship

between the variables. In comparing two variables, you can take the square root of the correlation Coefficient of

Determination to get the correlation coefficient; this measure gives the percent of variation in the dependent variable that is ‘explained’ by the independent variable.

Excel will not only give you a correlation coefficient, but it will also give you the equation for the Least Square line which can be useful in describing the relationship between the two variables and in making predictions of the dependent variable from the independent variable. Note the slope of the line; it tells how much the y value changes for each unit change in x.

Note that in making predictions of y based on x, stay close to the data set in your selection of x; the function may not look the same outside of the given data range.

3r 6300 measures 21

Sample Correlation Coefficients

3r 6300 measures 22

In XL, use the function wizard to find the correlation coefficient:

CORREL(A2:A16,B2:B16) insert/highlight the correct range

For Discussion1. Would you expect the correlation between engine size and gas mileage to be positive or

negative? Why?

2. An analyst reviewed the closing prices for the Dow Jones Industrial Average (DJIA) and the Standard & Poor's (S&P) 500 Index over a 10 week period. The sample correlation coefficient between the DJIA and the S&P 500 index was found to be r = 0.927.

a. How would you classify the linear relationship between the variables?

b. For the week when the DJIA is high, what would be your expectation for the S&P index in that week?

3. The following plot shows the relationship between a test for employment (Score 1) and the results of a test given after training (Score 2). Interpret - Consider factors such as slope, coefficient of determination, and correlation.

70 75 80 85 90 95 10070

75

80

85

90

95

100

f(x) = 0.678082191780822 x + 29.1917808219178R² = 0.535434791515864

Scatter Plot of Test Scores (Score 2 by Score 1)

Score 1

Score 2

3r 6300 measures 23

In XL, use: Insert/ScatterRight-click on a point and choose: Add Trend Line, Display equation / Display r-squared.

3r 6300 measures 24

Documents

Graphs used for categorical/qualitative datatech132/3r6300.docx · Web viewA measure of variation is a measure of variability (i.e. spread or dispersion). The most commonly used measures