34
Statistical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 1 Statistical Methods in Computer Science Descriptive Statistics Data 1: Frequency Distributions Ido Dagan

Statistical Methods in Computer Science

  • Upload
    dallon

  • View
    52

  • Download
    1

Embed Size (px)

DESCRIPTION

Descriptive Statistics Data 1: Frequency Distributions Ido Dagan. Statistical Methods in Computer Science. Concrete Theory: Relates Variables to Each Other. Examples: Mathematically accurate Memory = 2*sizeof(input) + 3 Runtime = 500 + 30*sizeof(input) + 20 Asymptotically correct - PowerPoint PPT Presentation

Citation preview

Page 1: Statistical Methods in Computer Science

Statistical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 1

Statistical Methods in Computer Science

Descriptive StatisticsData 1:

Frequency Distributions

Ido Dagan

Page 2: Statistical Methods in Computer Science

Statistical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 2

Concrete Theory: Relates Variables to Each Other

Examples: Mathematically accurate

Memory = 2*sizeof(input) + 3 Runtime = 500 + 30*sizeof(input) + 20

Asymptotically correct Memory = O(sizeof(input)) in worst case, Runtime = O(log (sizeof(input))) in best case Accuracy is proportional to run-time

Qualitative User performance is increased with reduced cognitive load number of bugs discovered is monotonically decreasing, but

positive, if the same programmer is used, otherwise, it increases

Page 3: Statistical Methods in Computer Science

Statistical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 3

Behavior Parameters/Variables(typical of Computer Science)

Hardware parameters CPU model and organization, cache organization, latencies in the

system

System parameters Memory availability, usage CPU running time (sometimes approximated by world-clock time) Communication bandwidth, usage Program characteristics

requires floating-point, heavy disk usage, integer math, graphics

large heap, large stack, uses non-local information, ...

Page 4: Statistical Methods in Computer Science

Statistical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 4

Scales of Measurements Nominal (also called categorical): No order, just labels

e.g., “Algorithm Name” Ordinal (also called rank): Order, but not numerical

Difference between ranks is not necessarily the same e.g., ranks in (hierarchical/military) organization

Interval: Difference between values has same meaning everywhere e.g., temperature in Celsius (rise of 10 degrees is the same

everywhere) But 100C is not twice as hot as 50C, and 0C is not lack of heat

Ratio: Interval + Fixed zero point e.g., robot position, memory usage, run-time

Page 5: Statistical Methods in Computer Science

Statistical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 5

Scale Hierarchy

Nominal < Ordinal < Interval < Ratio

Propositions that are true for some level, are true above it But not necessarily the other way around

e.g., we can calculate the mean (average) value for numerical variables But not for nominal and ordinal

e.g., we can calculate the most frequent value for all variables

http://en.wikipedia.org/wiki/Levels_of_measurement

“Numerical”

Page 6: Statistical Methods in Computer Science

Statistical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 6

Variables

Discrete: Can take on only certain values: symbols, exact numbers

For ordinal, interval and ratio scales, this means there will be gaps e.g., User satisfaction surveys, memory usage

Continuous: Can take on any value within its range: no gaps e.g., run-time, CPU temperature, robot velocity and position In practice: limited by measurement accuracy

Up to researcher to determine needed accuracy

Page 7: Statistical Methods in Computer Science

Statistical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 7

Data• The collection of values that a variable X took during

the measurement

Page 8: Statistical Methods in Computer Science

Statistical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 8

Describing Data

Our task: Describe the data we have collected Find ways to characterize it, represent it Find properties that are true of the data

Page 9: Statistical Methods in Computer Science

Statistical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 9

Data Distribution

The collection of data is called the sample distribution We will investigate distributions:

Find values that “best” represent a distribution Measure their dispersion, range, shape Identify extraordinary values in a distribution Find visual representations for a distribution

Remember hierarchy: Nominal < Ordinal < Interval < Ratio Think about how the following techniques apply

Page 10: Statistical Methods in Computer Science

Statistical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 10

Frequency Distribution

Examine the frequency of values

f(x) = # of times variable took on value x.

Page 11: Statistical Methods in Computer Science

Statistical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 11

Frequency Distribution

Examine the frequency of values

f(x) = # of times variable took on value x.

Student Grade Grade fX1 60 82 2X2 43 75 1X3 57 60 2X4 82 57 1X5 75 43 1X6 32 32 1X7 82 Total 8X8 60

?

Page 12: Statistical Methods in Computer Science

Statistical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 12

Frequency DistributionExamine the frequency of values

f(x) = # of times variable took on value x.

Student GradeX1 60X2 43X3 57X4 82X5 75X6 32X7 82X8 60

Convention (Ordinal/Numerical): Sort by value

Grade f82 281 080 0

...... 075 1

...... 060 2

...... 057 1

...... ...Total 8

Page 13: Statistical Methods in Computer Science

Statistical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 13

Grouped Frequency Distributions In ordinal/numerical variables, possible to group

values together Create Grouped Frequency Distributions

Score f Score f96 1 78 495 0 77 294 0 76 193 0 75 192 0 74 091 1 73 190 1 72 289 3 71 088 2 70 387 2 69 186 6 68 285 2 67 184 2 66 083 1 65 082 3 64 081 2 63 080 2 62 1

Score f96-100 191-95 186-90 1481-85 1076-80 1171-75 466-70 761-65 2

N= 50

Width (i) =5

Page 14: Statistical Methods in Computer Science

Statistical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 14

Grouped Frequency Distributions In ordinal/numerical variables, possible to group

values together Create Grouped Frequency Distributions

Score f Score f96 1 78 495 0 77 294 0 76 193 0 75 192 0 74 091 1 73 190 1 72 289 3 71 088 2 70 387 2 69 186 6 68 285 2 67 184 2 66 083 1 65 082 3 64 081 2 63 080 2 62 1

Score f95-99 190-94 285-89 1580-84 1075-79 1070-74 665-69 460-64 2

N= 50

i=5

Score f96-100 191-95 186-90 1481-85 1076-80 1171-75 466-70 761-65 2

N= 50

i=5

Warning: Loss of Information

Page 15: Statistical Methods in Computer Science

Statistical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 15

Real and Apparent Limits

Continuous values are more difficult to divide into intervals Score of 95 falls within 95-99, not within 90-94 But what about temperature of 94.87 ? 94 < 94.87 < 95 !

By convention, the real limits of a score are within ½ the measurement resolution If our resolution is 0.1, then limits are within 0.05 If our resolution 100, then limits are within 50

Note: we break convention only for exceptional cases e.g., age: “I am 35” is true of [35.0 .. 36.0)

Page 16: Statistical Methods in Computer Science

Statistical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 16

Real/Apparent Limits

For example: Resolution of 0.01. Interval 95..99 really covers values

94.995 to 99.005 Apparent limits: 95..99 Real limits: 94.995 to 99.005

Resolution of 10: 740-800 really covers values 735 to 805.

Page 17: Statistical Methods in Computer Science

Statistical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 17

Relative Frequency Distributions

A frequency count can be misleading Algorithm X was fastest on 60,000 trials: Is this good? 100,000 people voted for candidate A: Is she the winner?

Relative frequency distributions: translate f into percentage or ratio rel f (proportion) = f/N rel f (%) = 100 * f/N

Warning: Can be misleading, if ignoring count magnitude 50% of all test cases succeeded (with only two cases…)

Page 18: Statistical Methods in Computer Science

Statistical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 18

Relative Frequency Distributions

Score f (%)95-99 290-94 485-89 3080-84 2075-79 2070-74 1265-69 860-64 4

N= 100

i=5

Score f95-99 190-94 285-89 1580-84 1075-79 1070-74 665-69 460-64 2

N= 50

i=5

f/N

Example:

Page 19: Statistical Methods in Computer Science

Statistical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 19

Cumulative Frequency Distribution For ordinal/numerical variables Where values are with respect to others: How many

below or above

Cumulative frequency distribution

Page 20: Statistical Methods in Computer Science

Statistical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 20

Cumulative Frequency DistributionBased on the cumulative distribution, can answer

question such as: What percentage of scores fall below 80? How many scores below 95?

Page 21: Statistical Methods in Computer Science

Statistical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 21

Percentiles, Percentile Ranks (Value of) Percentile X: Value for which X percent of values are lower

e.g. baby height We use P

x to denote the Xth percentile, e.g., P

98 is in range 90-94.

Percentile rank of value X: the percent of values that fall below X. e.g., percentile rank of the interval 65-69 is 12.

Page 22: Statistical Methods in Computer Science

Statistical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 22

Computing Percentiles, P. RanksHow do we compute percentiles and percentile ranks

from grouped data? What is the score which defines the top 20% of scores? Is it between 84 and 85?

Score f f (%)

95-99 1 2

90-94 2 4

85-89 15 30

80-84 10 20

75-79 10 20

70-74 6 12

65-69 4 8

60-64 2 4N= 50 100

Page 23: Statistical Methods in Computer Science

Statistical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 23

Computing Percentiles

We want to compute P80

. 80% of 50 cases = 40 cases.

We look under the cum f heading. 32 of the 40 scores are less than 84.5 (real limit).

Page 24: Statistical Methods in Computer Science

Statistical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 24

Computing Percentiles

We want to compute P80

. 80% of 50 cases = 40 cases.

We look under the cum f heading. 32 of the 40 scores are less than 84.5 (real limit).i =5

Score cum f cum %95-99 50 10090-94 49 9885-89 47 9480-84 32 6475-79 22 4470-74 12 2465-69 6 1260-64 2 4

Page 25: Statistical Methods in Computer Science

Statistical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 25

Computing Percentiles

We want to compute P80

. 80% of 50 cases = 40 cases.

We look under the cum f heading. 32 of the 40 scores are less than 84.5 (real limit).

We need 8 more.i =5

Score cum fcum %95-99 50 10090-94 49 9885-89 47 9480-84 32 6475-79 22 4470-74 12 2465-69 6 1260-64 2 4

Page 26: Statistical Methods in Computer Science

Statistical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 26

The interval 85-89 contains 47-32 = 15 cases. real limit 84.5

These are spread over width of 5 (= 89.5-84.5).

Assume scores are evenly distributed within interval 8 more cases ==>

8/15 * 5 = 2.67 (linear interpolation)

P80

= 84.5 + 2.67 = 87.17

Computing Percentiles We want to compute P

80. 80% of 50 cases = 40 cases.

We look under the cum f heading. 32 of the 40 scores are less than 84.5 (real limit).

We need 8 more.

Page 27: Statistical Methods in Computer Science

Statistical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 27

Computing Percentile Ranks We want to compute the percentile rank of 86 Lies in the interval 85-89, real limits 84.5 – 89.5. 86-84.5 = 1.5 score points. Width of interval = 5. Assuming uniform spread of scores in interval:

1.5/5 = 0.3 ==> 30% of scores in interval (0.3*15 = 4.5)

Score cum fcum %95-99 50 10090-94 49 9885-89 47 9480-84 32 6475-79 22 4470-74 12 2465-69 6 1260-64 2 4

Page 28: Statistical Methods in Computer Science

Statistical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 28

Computing Percentile Ranks We want to compute the percentile rank of 86 Lies in the interval 85-89, real limits 84.5 – 89.5. 86-84.5 = 1.5 score points. Width of interval = 5. 1.5/5 = 0.3 ==> 30% of scores in

interval (0.3*15 = 4.5)

So we have 32 scores up to 84.5 4.5 scores from 84.5 to 86. Total: 4.5 + 32 = 36.5 scores. 36.5/50 = 73%. This is the percentile rank

of 86.

Score cum fcum %95-99 50 10090-94 49 9885-89 47 9480-84 32 6475-79 22 4470-74 12 2465-69 6 1260-64 2 4

Page 29: Statistical Methods in Computer Science

Statistical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 29

Frequency Distributions and Scales

Page 30: Statistical Methods in Computer Science

Statistical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 30

Displaying Frequency Distributions:Nominal Data

Page 31: Statistical Methods in Computer Science

Statistical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 31

Displaying Frequency Distributions:Ordinal/Numerical Data

Histogram

Score f f (%)

95-99 1 2

90-94 2 4

85-89 15 30

80-84 10 20

75-79 10 20

70-74 6 12

65-69 4 8

60-64 2 4N= 50 100 60-64 65-69 70-74 75-79 80-84 85-89 90-94 95-99

0

2

4

6

8

10

12

14

16

2

4

6

10 10

15

2

1

Scores Histogram

Page 32: Statistical Methods in Computer Science

Statistical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 32

Displaying Frequency Distributions:Ordinal/Numerical Data

Histogram: Different Grouping

Score f f (%)

95-99 1 2

90-94 2 4

85-89 15 30

80-84 10 20

75-79 10 20

70-74 6 12

65-69 4 8

60-64 2 4N= 50 100 60-69 70-79 80-89 90-99

0

5

10

15

20

25

30

6

16

25

3

Scores Histogram

Page 33: Statistical Methods in Computer Science

Statistical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 33

Lying with Visuals

60-64 65-69 70-74 75-79 80-84 85-89 90-94 95-99

0

120108 110 111 112 111 110 109 108

Scores Histogram

60-64 65-69 70-74 75-79 80-84 85-89 90-94 95-99

107

115

108

110

111

112

111

110

109

108

Scores Histogram

Page 34: Statistical Methods in Computer Science

Statistical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 34

Characteristics of Distributions

Shape, Central Tendency, Variability

Different Central Tendency

Different Variability