7
1 Example of self-documenting data journalism notes This is an example of using Sweave to combine code and output from the R statistical programming environment and the LaTeX document processing environment to generate a self-documenting script in which the actual code used to do stats and generate statistical graphics is displayed along the charts it directly produces. 1.1 Getting Started... The aim is to try to replicate a graphic included by Ben Goldacre in his article DIY statistical analysis: experience the thrill of touching real data 1 . > # The << echo = T >>= identifies an R code region; > # echo=T means run the code, and print what happens when it's run > # In the code area, lines beginning with a # are comment lines and are not executed > > #First, we need to load in the XML library that contains the scraper function > library(XML) > #Now we scrape the table > srcURL='http://www.guardian.co.uk/commentisfree/2011/oct/28/bad-science-diy-data-analysis' > cancerdata=data.frame( + readHTMLTable( srcURL, which=1, header=c('Area','Rate','Population','Number'))) > > #The @ symbol on its own at the start of a line marks the end of a code block The format is simple: readHTMLTable(url,which=TABLENUMBER) (TABLENUMBER is used to extract the N’th table in the page.) The header part labels the columns (the data pulled in from the HTML table itself contains all sorts of clutter). We can inspect the data we’ve imported as follows: > #Look at the whole table (the whole table is quite long, > # so donlt disply it/comment out the command for now instead. > #cancerdata > #If you are using RStudio, you can inspect the data using the command: View(cancerdata)) > #Look at the column headers > names(cancerdata) [1] "Area" "Rate" "Population" "Number" > #Look at the first 10 rows > head(cancerdata) Area Rate Population Number 1 Shetland Islands 19.15 31332 6 2 Limavady 21.49 32573 7 3 Ballymoney 17.05 35191 6 4 Orkney Islands 29.87 36826 11 5 Larne 27.54 39942 11 6 Magherafelt 15.26 45872 7 > #Look at the last 10 rows > tail(cancerdata) 1 http://www.guardian.co.uk/commentisfree/2011/oct/28/bad-science-diy-data-analysis 1

Example sweavefunnelplot

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 1: Example sweavefunnelplot

1 Example of self-documenting data journalism notes

This is an example of using Sweave to combine code and output from the R statistical programmingenvironment and the LaTeX document processing environment to generate a self-documentingscript in which the actual code used to do stats and generate statistical graphics is displayed alongthe charts it directly produces.

1.1 Getting Started...

The aim is to try to replicate a graphic included by Ben Goldacre in his article DIY statisticalanalysis: experience the thrill of touching real data1.

> # The << echo = T >>= identifies an R code region;

> # echo=T means run the code, and print what happens when it's run

> # In the code area, lines beginning with a # are comment lines and are not executed

>

> #First, we need to load in the XML library that contains the scraper function

> library(XML)

> #Now we scrape the table

> srcURL='http://www.guardian.co.uk/commentisfree/2011/oct/28/bad-science-diy-data-analysis'> cancerdata=data.frame(

+ readHTMLTable( srcURL, which=1, header=c('Area','Rate','Population','Number') ) )

>

> #The @ symbol on its own at the start of a line marks the end of a code block

The format is simple: readHTMLTable(url,which=TABLENUMBER) (TABLENUMBER is used toextract the N’th table in the page.) The header part labels the columns (the data pulled in fromthe HTML table itself contains all sorts of clutter).

We can inspect the data we’ve imported as follows:

> #Look at the whole table (the whole table is quite long,

> # so donlt disply it/comment out the command for now instead.

> #cancerdata

> #If you are using RStudio, you can inspect the data using the command: View(cancerdata))

> #Look at the column headers

> names(cancerdata)

[1] "Area" "Rate" "Population" "Number"

> #Look at the first 10 rows

> head(cancerdata)

Area Rate Population Number

1 Shetland Islands 19.15 31332 6

2 Limavady 21.49 32573 7

3 Ballymoney 17.05 35191 6

4 Orkney Islands 29.87 36826 11

5 Larne 27.54 39942 11

6 Magherafelt 15.26 45872 7

> #Look at the last 10 rows

> tail(cancerdata)

1http://www.guardian.co.uk/commentisfree/2011/oct/28/bad-science-diy-data-analysis

1

Page 2: Example sweavefunnelplot

Area Rate Population Number

374 Wiltshire 18.69 727662 136

375 Sheffield 16.9 757396 128

376 Durham 17.29 786582 136

377 Leeds 17.3 959538 166

378 Cornwall 15.44 1062176 164

379 Birmingham 19.78 1268959 251

> #What sort of datatype is in the Number column?

> class(cancerdata$Number)

[1] "factor"

The last line, class(cancerdata$Number), identifies the data as type factor. In order todo stats and plot graphs, we need the Number, Rate and Population columns to contain actualnumbers. (Factors organise data according to categories; when the table is loaded in, the data isloaded in as strings of characters; rather than seeing each number as a number, it’s identified asa category.) The

> #Convert the numerical columns to a numeric datatype

> cancerdata$Rate =

+ as.numeric(levels(cancerdata$Rate)[as.numeric(cancerdata$Rate)])

> cancerdata$Population =

+ as.numeric(levels(cancerdata$Population)[as.integer(cancerdata$Population)])

> cancerdata$Number =

+ as.numeric(levels(cancerdata$Number)[as.integer(cancerdata$Number)])

> #Just check it workedaA↪e

> class(cancerdata$Number)

[1] "numeric"

> class(cancerdata$Rate)

[1] "numeric"

> class(cancerdata$Population)

[1] "numeric"

> head(cancerdata)

Area Rate Population Number

1 Shetland Islands 19.15 31332 6

2 Limavady 21.49 32573 7

3 Ballymoney 17.05 35191 6

4 Orkney Islands 29.87 36826 11

5 Larne 27.54 39942 11

6 Magherafelt 15.26 45872 7

We can now plot the data as a simple scatterplot using the plot command (figure 1) or wecan add a title to the graph and tweak the axis labels (figure 2).

The plot command is great for generating quick charts. If we want a bit more control overthe charts we produce, the ggplot2 library is the way to go. (ggplot2 isn’t part of the standard Rbundle, so you’ll need to install the package yourself if you haven’t already installed it. In RStudio,find the Packages tab, click Install Packages, search for ggplot2 and then install it, along with itsdependencies...). You can see the sort of chart ggplot creates out of the box in figure 3.

2

Page 3: Example sweavefunnelplot

> #Plot the Number of deaths by the Population

> plot(Number ~ Population, data=cancerdata)

●●●●●●●●●●●●●●●●●

●●

●●●●●●●●●●●●●●●●●●●●●

●●●●●

●●●●●●

●●●●

●●●●

●●●●●

●●●●●●

●●●●●●

●●

●●●●●●●●●●

●●

●●●

●●

●●

●●

●●

●●●●●●

●●

●●

●●

●●●●●

●●

●●●●

●●

●●●

●●●●●●

●●●●●●●●●●

●●●●●●●●

●●●●

●●

●●

●●●●●●●●●

●●

●●●●●●

●●●●●●

●●

●●

●●●●●

●●

●●●●●

●●

●●

●●

●●●●●

●●

●●●●●●

●●●●●

●●

●●

●●

●●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●●●

●●

●●●

●●

●●

●●

●●

●●●

●●●●

●●

●●

● ●

0 200000 400000 600000 800000 1200000

050

100

150

200

250

Population

Num

ber

Figure 1: Vanilla scatter plot

3

Page 4: Example sweavefunnelplot

> #Plot the Number of deaths by the Population.

> #Add in a title (main) and tweak the y-axis label (ylab).

> plot(Number ~ Population, data=cancerdata,

+ main='Bowel Cancer Occurrence by Population', ylab='Number of deaths')

●●●●●●●●●●●●●●●●●

●●

●●●●●●●●●●●●●●●●●●●●●

●●●●●

●●●●●●

●●●●

●●●●

●●●●●

●●●●●●

●●●●●●

●●

●●●●●●●●●●

●●

●●●

●●

●●

●●

●●

●●●●●●

●●

●●

●●

●●●●●

●●

●●●●

●●

●●●

●●●●●●

●●●●●●●●●●

●●●●●●●●

●●●●

●●

●●

●●●●●●●●●

●●

●●●●●●

●●●●●●

●●

●●

●●●●●

●●

●●●●●

●●

●●

●●

●●●●●

●●

●●●●●●

●●●●●

●●

●●

●●

●●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●●●

●●

●●●

●●

●●

●●

●●

●●●

●●●●

●●

●●

● ●

0 200000 400000 600000 800000 1200000

050

100

150

200

250

Bowel Cancer Occurrence by Population

Population

Num

ber

of d

eath

s

Figure 2: Vanilla scatter plot

4

Page 5: Example sweavefunnelplot

> require(ggplot2)

> #Plot the Number of deaths by the Population

> p=ggplot(cancerdata)+geom_point(aes(x=Population, y=Number))

> print(p)

Population

Num

ber

50

100

150

200

250

●●●●●●●

●●●●●

●●●●

●●

●●●●●●●●●●●●

●●

●●●●●●

●●

●●●

●●●●

●●●

●●●●

●●●●●

●●●●●●

●●●●●●

●●

●●

●●

●●●●

●●●

●●

●●

●●●●●●

●●

●●

●●

●●

●●●●

●●

●●●

●●●●●●

●●●●●●●●●●

●●●●●●●

●●●●

●●

●●●

●●●●●●

●●

●●●●●●

●●●●●

●●

●●

●●●●●

●●

●●●●

●●

●●

●●●

●●

●●●●●●

●●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●●

●●●●

●●

● ●

200000 400000 600000 800000 1000000 1200000

Figure 3: A rather prettier plot

5

Page 6: Example sweavefunnelplot

1.2 Generating the Funnel Plot

Doing a bit of searching for the “funnel plot” chart type used to display the data in Goldacre’sarticle, I came across a post on Cross Validated, the Stack Overflow/Stack Exchange site dedicatedto statistics related Q&A: How to draw funnel plot using ggplot2 in R? 2

The meta-analysis answer seemed to produce the similar chart type, so I had a go at cribbingthe code, with confidence limits set at the 95% and 99.9% levels. Note that I needed to do a coupleof things:

1. work out what values to use where! I did this by looking at the ggplot code to see whatwas plotted. p was on the y-axis and should be used to present the death rate. The dataprovides this as a rate per 100,000, so we need to divide by 100, 000 to make it a rate in therange 0..1. The x-axis is the population.

2. change the range and width of samples used to create the curves

3. change the y-axis range.

You can see the result in figure 3.

2http://stats.stackexchange.com/questions/5195/how-to-draw-funnel-plot-using-ggplot2-in-r/5210#

5210

6

Page 7: Example sweavefunnelplot

> #TH: funnel plot code from:

> #stats.stackexchange.com/questions/5195/how-to-draw-funnel-plot-using-ggplot2-in-r/5210#5210

> #TH: Use our cancerdata

> number=cancerdata$Population

> #TH: The rate is given as a 'per 100,000' value, so normalise it

> p=cancerdata$Rate/100000

> p.se <- sqrt((p*(1-p)) / (number))

> df <- data.frame(p, number, p.se, Area=cancerdata$Area)

> ## common effect (fixed effect model)

> p.fem <- weighted.mean(p, 1/p.se^2)

> ## lower and upper limits for 95% and 99.9% CI, based on FEM estimator

> #TH: I'm going to alter the spacing of the samples used to generate the curves

> number.seq <- seq(1000, max(number), 1000)

> number.ll95 <- p.fem - 1.96 * sqrt((p.fem*(1-p.fem)) / (number.seq))

> number.ul95 <- p.fem + 1.96 * sqrt((p.fem*(1-p.fem)) / (number.seq))

> number.ll999 <- p.fem - 3.29 * sqrt((p.fem*(1-p.fem)) / (number.seq))

> number.ul999 <- p.fem + 3.29 * sqrt((p.fem*(1-p.fem)) / (number.seq))

> dfCI <- data.frame(number.ll95, number.ul95, number.ll999, number.ul999, number.seq, p.fem)

> ## draw plot

> #TH: note that we need to tweak the limits of the y-axis

> fp <- ggplot(aes(x = number, y = p), data = df) +

+ geom_point(shape = 1) +

+ geom_line(aes(x = number.seq, y = number.ll95), data = dfCI) +

+ geom_line(aes(x = number.seq, y = number.ul95), data = dfCI) +

+ geom_line(aes(x = number.seq, y = number.ll999, linetype = 2), data = dfCI) +

+ geom_line(aes(x = number.seq, y = number.ul999, linetype = 2), data = dfCI) +

+ geom_hline(aes(yintercept = p.fem), data = dfCI) +

+ xlab("Population") + ylab("Bowel cancer death rate") + theme_bw()

> #Automatically set the maximum y-axis value to be just a bit larger than the max data value

> fp=fp+scale_y_continuous(limits = c(0,1.1*max(p)))

> #Label the outlier point

> fp=fp+geom_text(aes(x = number, y = p,label=Area),size=3,data=subset(df,p>0.0003))

> print(fp)

Population

Bow

el c

ance

r de

ath

rate

0.00000

0.00005

0.00010

0.00015

0.00020

0.00025

0.00030

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●●

●●

●● ●

Glasgow City

200000 400000 600000 800000 1000000 1200000

Figure 4: A rather prettier plot

7