A Visual Guide for Describing Numbers

Preview:

Citation preview

torturing  numbers  a novice’s guide to descriptive dtatistics

1  

Bandhu  P.  Das  

"If you torture the data long enough, it will confess"

@BPDas_   2  

– Ronald Harry Coase

why  do  we  torture  numbers?

@BPDas_   3  

q  Describe the story

q  Find trends in data

against variation

q  Determine if a sample

represents a population

q  Draw conclusions about the story

a tool called ‘descriptive statistics’ is used

@BPDas_   4  

describing  numbers

@BPDas_   5  

25 people were asked what an average person pay in tax?

What do these numbers tell you?

£45,000   £3,700   £10,000   £2,000   £2,000  

£15,000   £3,000   £5,000   £3,700   £2,000  

£10,000   £2,000   £2,000   £3,700   £2,000  

£5,700   £2,000   £2,000   £3,700   £2,000  

£5,000   £2,000   £5,000   £2,000   £2,000  

describing  numbers

@BPDas_   6  

£2,000

Here is the same data ordered from greatest to least and weighted to show how many times each value occurs in the data set

•  Now what do the data tell you?

•  What is the average income?

£45,000

£15,000

£10,000 £5,700

£5,000

£3,700

£3,000

£45,000

£15,000

£10,000

£5,700

£5,000

£3,700

£3,000

describing  numbers

@BPDas_   7  

BEWARE! The reported ‘average’ might depend on what you are meant to see. Which would you use?

MEAN (arithmetic average)

MEDIAN (midpoint in range)

MODE (most frequent)

So, to really understand the data set you need more than just the ‘average’

£2,000

spread  and  variability

@BPDas_   8  

You need to know the spread of the data

•  This histogram shows the ages of people that use a smart phone

•  Is it typical

for 90 year olds to use a smart phone?

spread  and  variability

@BPDas_   9  

When the mean and median are the same, youhave a special situation called a ‘normal’ curve

On this symmetrical curve, the variability can be described using standard deviations (SD)

spread  and  variability

@BPDas_   10  

SD is a way to determine how far a data point is from the mean

You can now say that 90 year olds fall more than 2 SD from the mean, or that they make up less than 2.5% of the data set

spread  and  variability

@BPDas_   11  

If we collapse the whole data set to one bar, we can show the mean with some measure of variability (std dev, std error, etc.)Without some indication of variability, you cannot effectively compare two data sets

spread  and  variability

@BPDas_   12  

Min Q1 Median Q3 Max

Perhaps the best way to describe any data set is with five numbers: Minimum, Q1, Median, Q3, Maximum. This helps when comparing data sets, and when there are oddities called outliers.

25% 25% 25% 25% *

“79.48% of all statistics are made up on the spot.”

@BPDas_   13  

– John A. Paulos

a  sample  study

@BPDas_   14  

Researchers want to know which of three fertilisers produce the highest wheat yield in kg/plot

a  sample  study

@BPDas_   15  

They design a study with three treatments and five replications for each treatment

3 Treatments (Fertilisers 1, 2 and 3)

5 Replic

ates

a  sample  study

@BPDas_   16  

Could a nearby forest or river be a confounding variable?

Variables like soil type and other local influences may have unexpected impacts…

a  sample  study

@BPDas_   17  

This is why a good study is randomised, to defeat potentially confounding variables

Does the sample plot in our study represent all the wheat in all the

world?

POPULATION

SAMPLE

@BPDas_  18  

uncertainty

@BPDas_   19  

With all the unknown variables, there will always be a degree of uncertainty that our sample represents the populationThat’s why the more samples we have, the more confident we are that our study represents the population

confidence

@BPDas_   20  

•  Any confidence interval could be used, but 95% is often chosen

•  This means that 95% of the time, you expect your data represents reality

•  BEWARE reports with no confidence interval

@BPDas_   21  

Fer$lizer  1  Fer$lizer  2  Fer$lizer  3  

64.8   56.5   65.8  60.5   53.8   73.2  63.4   59.4   59.5  48.2   61.1   66.3  55.5   58.8   70.2  

two  ways  to  present  data

Tables are the preferred way to show data, but graphs paint a quick, easy and seductive picture

drawing  conclusions A presenter may want you to see a relationship between two variables

Fertiliser 3 appears to increase the average yield of wheat – but what kind of average is this? How big was the sample? Where is the indication of variability? Where is the confidence interval?

@BPDas_   22  

drawing  conclusions A presenter may want you to see a relationship between two variables

Fertiliser 3 appears to increase the average yield of wheat – but what kind of average is this? How big was the sample? Where is the indication of variability? Where is the confidence interval?

@BPDas_  23  

Bad stats and presentation may lead to bad conclusions

2 SD

drawing  conclusions

@BPDas_   24  

Correlation does not imply causationThe more firemen fighting a fire, the bigger the fire is observed to be. Therefore more firemen cause an increase in the size of a fire

Often, a presenter wants to lead you to a conclusion. Newspapers, TV and online articles should be scrutinised!

BEWARE:“This is not a scientific poll…” “These results may not be representative of the population” “…based on a list of those that responded” “Data showed a trend but was not statistically significant”

it’s  all  in  how  they  are  presented

@BPDas_   25  

it’s  all  in  how  they  are  presented

@BPDas_   26  

Pies are for eatingIt’s very hard to see differencesBEWARE CHARTJUNK!

it’s  all  in  how  they  are  presented

@BPDas_  27  

Amusing graphics are nothing but distractionsAgain, it’s very hard to see differencesBEWARE CHARTJUNK!

it’s  all  in  how  they  are  presented

@BPDas_   28  

Here is the same population growth data shown on two scales. Which would you use to demonstrate rapid growth?

BEWARE tricky scales!

it’s  all  in  how  they  are  presented

@BPDas_   29  

BEWARE statements with no context. Here’s a made-up example:

Did you know that even speaking to someone that once smoked, DOUBLES your chance of getting cancer?! ;)

Your odds go from to

0.000000001:10.000000002:1

conclusion

@BPDas_   30  

Like any tool, stats can be misused (intentionally or unintentionally)

Maintain a healthy skepticism and question charts, tables and conclusions where insufficient information is provided

references

@BPDas_   31  

-  The Cartoon Guide to Statistics (1993)-  Larry Gonick and Woolcott Smith

-  How to Lie with Statistics (1954)-  Darrel Huff

Recommended