Upload
others
View
8
Download
0
Embed Size (px)
Citation preview
Exploring dataRote analysis vs. snooping
Spurious correlationsThere’s a whole website about this
What can you do?The best you can
• Identify scientific questions
• Distinguish between exploratory and confirmatory analysis
1
• Pre-register studies when possible
• Keep an exploration and analysis journal
• Explore predictors and responses separately at first
1 Individual variables
• Look at location and shape
• Maybe with different sets of grouping variables
• Contrasts
– Parametric vs. non-parametric
– Exploratory vs. diagnostic
– Data vs. inference
Means and standard errors
●
●
●
●
●●
●
●
10
100
A B C D E F G H
treatment
decr
ease
Means and standard deviations
2
●
●
●
●
●● ●
●
1
10
100
A B C D E F G H
treatment
decr
ease
Means and standard deviations
●●
●
●
●
● ●
●
0
50
100
A B C D E F G H
treatment
decr
ease
Bike example
3
0
50
100
150
200
Clear Cloudy Light Heavy
weather
rent
als
Standard errors
−100
0
100
200
Clear Cloudy Light Heavy
weather
rent
als
Standard errors
4
●
●
●
●
−100
0
100
200
Clear Cloudy Light Heavy
weather
rent
als
Standard deviations
●
●
●
●
−200
0
200
400
600
Clear Cloudy Light Heavy
weather
rent
als
Data shape
5
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●●
●
●
●
●
●●
●
●
●
●
●
●●
●●
●
●●
●
●●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●●●●
●
●
●●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●●
●●●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●●●●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●●●●
●
●
●
●●
●
●
●●
●
●●
●
●●●●
●
●
●
●
●●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
0
250
500
750
1000
Clear Cloudy Light Heavy
weather
rent
als
Data shape
0
250
500
750
1000
Clear Cloudy Light Heavy
weather
rent
als
Data shape
6
●
●●
●
●
●
●●●●
●●●●●●
●●
●●●●
●
●●
●
●●●
●
●
●
●●
●●●
●
●●
●●●
●●●
●
●●●●
●●●●
●●
●●
●●●●
●
●●
●●●●●●●●
●●●
●
●●
●●
●●●●
●
●●●●
●
●●
●●●
●●●●●
●
●●●●
●
●
●●
●●
●
●
●●
●●●●
●
●●●●
●●●
●
●
●●●●
●
●
●●
●●●●●
●●
●●
●●●
●●
●●
●●
●●●
●●●
●
●
●●
●●
●
●●●●
●
●●●
●●●
●●●●
●●
●
●●●
●●●●●●
●●
●●●●●●●●
●
●
●
●
●
●●●
● ●●●●●
●●
●
●
●●
●
●
●
●
●●
●●
●
●
●
●
●●●
●●
●
●
●●
●●●●●●●●●
●●●●●●●●●●●●●
●
●●●●●
●●
●●●
●●
●
●●●
●●●
●
●
●
●●●●●●●●●
●●
●
●●
●●
●
●●
●●●●●●●●●●●●●●●●●●●●●●●●
10
1000
Clear Cloudy Light Heavy
weather
rent
als
Data shape
10
1000
Clear Cloudy Light Heavy
weather
rent
als
Data shape and weight
7
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●●
●
●
●
●
●●
●
●
●
●
●
●●
●●
●
●●
●
●●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●●●●
●
●
●●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●●
●●●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●●●●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●●●●
●
●
●
●●
●
●
●●
●
●●
●
●●●●
●
●
●
●
●●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
0
250
500
750
1000
Clear Cloudy Light Heavy
weather
rent
als
Log scales
• In general:
– If your logged data span < 3 decades, use human-readable numbers (e.g., 10-5000kilotons per hectare)
– If not, just embrace “logs” (log10 particles per ul is from 3–8)
∗ But remember these are not physical values
• I love natural logs, but not as axis values
– Except to represent proportional difference!
2 Bivariate data
Banking
• Banking is a real thing
– Even though many examples are bogus
• Since the point is make patterns visually clear, trial-and-error is usually as good asalgorithm
– But it is worth considering
8
Sunspots
sunspots data
Year
Mon
thly
sun
spot
num
bers
1750 1800 1850 1900 1950 2000
050
100
150
200
250
sunspots data
Year
Mon
thly
sun
spot
num
bers
1750 1800 1850 1900 1950
−40
0−
200
020
040
060
0
Smoking data
●
●
●
●
●
●
●
●●
1
2
3
4
5
6
current smoker non−current smoker
smoke
fev
Smoking data
9
●●
●
●
●
●
●
●
●●
●
●
●
●
●●●
●
●
5
10
15
current smoker non−current smoker
smoke
age
Scatter plots
• Depending on how many data points you have, scatter plots may indicate relationshipsclearly
• They can often be improved with trend interpolations
– Interpolations may be particularly good for discrete responses (count or true-false)
Scatter plot
●●●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
● ●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
1
2
3
4
5
6
5 10 15
age
fev
10
Seeing the density better
1
2
3
4
5
6
5 10 15
age
fev
Seeing the density worse
1
2
3
4
5
6
5 10 15
age
fev
n1.001.251.501.752.00
Maybe fixed
11
1
2
3
4
5
6
5 10 15
age
fev
n1.001.251.501.752.00
A loess trend line
1
2
3
4
5
6
5 10 15
age
fev
n1.001.251.501.752.00
Two loess trend lines
12
1
2
3
4
5
6
5 10 15
age
fev
smokecurrent smokernon−current smoker
n1.001.251.501.752.00
Many loess trend lines
female male
5 10 15 5 10 15
1
2
3
4
5
6
age
fev
Theory of loess
• Local smoother (locally flat, linear or quadratic)
• Neighborhood size given by alpha
– Points in neighborhood are weighted by distance
• Check help function for loess
13
Robust methods
• Loess is local, but not robust
– Uses least squares, can respond strongly to outliers
• R is has a very flexible function called rlm to do robust fitting
– Not local
– But can be combined with splines
Fitting comparison
1
2
3
4
5
6
5 10 15
age
fev
loess
1
2
3
4
5
6
5 10 15
age
fev
rlm plus ns(3)
Density plots
• Contours
– use _density_2d() to fit a two-dimensional kernel to the density
• hexes
– use geom_hex to plot densities using hexes
– this can also be done using rectangles for data with more discrete values
14
Contours
1
2
3
4
4 8 12 16
age
fev
0.025
0.050
0.075
level
Contours
1
2
3
4
4 8 12 16
age
fev
0.025
0.050
0.075
level
Contours
15
1
2
3
4
4 8 12 16
age
fev
0.025
0.050
0.075
level
Hexes
1
2
3
4
5
5 10 15
age
fev
sexfemalemale
Hexes
16
2
4
6
5 10 15 20
age
fev
10
20
30count
Color principles
• Use clear gradients
• If zero has a physical meaning (like density), go in just one direction
– e.g., white to blue, white to red
– If the map contrasts with a background, zero should match the background
• If there’s a natural middle, you can use blue to white to red, or something similar
3 Multiple dimensions
• Three dimensional data is a lot like two-d with densities: contour plots are good
• Pairs plots: pairs, ggpairs
Pairs example
17
●
●
●●
●
●
● ●
●
●
●
●
●●
●
●
●
●
●●
●
●●
●●
●
●●●
●●
●
●●
●●
●●
●
●●
●
●
●
●
●
●
●
●
●●●
●
●
●●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●●
●
●
●●
●
●●
● ●
●
●
●
●
●
●●
●
●
●
●
●● ●
●
●
●
●
●●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●●
●●
●
●
●
●
●
●
●●
●
●
●
●●
●● ●
●
●●
●
●
●
●
●
●●●● ●
●● ●● ● ●●
●● ●
●●●
●●
●●
●
●●
●● ●●●● ●● ●●
● ●●●●
●●●●●
●●
● ●●
●●
●
●
●●●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●●
● ●
●●
●
●●●
●
●
● ●●
●●●
●●
●
●
●●● ●
●
●
●
●
●●
●
●
●
●
●●
●●
●
●●●●
●●
●
●
●
●
●
●●
●●
●●
●●
●
●
●
●
●●
●
●●
●●
●●
●●
●●
●
●●●● ●
●●
●●●
●●●●
●
●●● ●●
●
●
●
●
● ●
●
●●●●
●
●●●● ●
●● ●
●●●
●
●●
●● ●●
●● ●
●
●
●
●
●
●●
●
●
●
●●
●●
●
●
●
●
●
●
●●
● ●
●
●
●●●
●
●●
●●
●●●●
●
●
●
●●● ●
●
●
●
●
●
●
●●
●●●
●
●●
●●
●●
●
●●
●
●
● ●
●
●
●●●
●
●
●●
●
●●
●●
●●
●
●●
●
●
●
●
●●
●
●
Corr:
−0.118
●● ●● ●
●●●● ● ●●
●● ●
●●●
●●
●●
●
●●
● ●●●●● ● ●●●● ●●●
●●● ●●
●
●●
● ●●
●●
●
●
●●●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●●
●●
●●
●
●●●
●
●
● ●●
●●●
●●
●
●
● ●●●
●
●
●
●
●●
●
●
●
●
●●
●●
●
● ●●
●
●●
●
●
●
●
●
●●
● ●
●●
●●
●
●
●
●
●●
●
●●
●●
●●
●●
●●
●
●● ●● ●
●●●●
●●●
●●●
●●● ●●
●
●
●
●
●●
●
●●●●
●
●●●● ●
●● ●
●●●
●
●●
●● ●●
●●●
●
●
●
●
●
●●
●
●
●
●●
●●
●
●
●
●
●
●
●●
●●
●
●
●●●
●
●●
●●
● ●●●
●
●
●
●●
●●
●
●
●
●
●
●
●●
●●●
●
●●
●●
●●
●
●●
●
●
●●
●
●
●● ●
●
●
●●
●
●●
●●
●●
●
●●
●
●
●
●
●●
●
●
Corr:
0.872
Corr:
−0.428
●●●●●
●●●●●●●
●●●
●●● ●●
●
●
●
●
●●
●
●●●●
●
●●●●●●
●●●●●
●
●●
●●●●
●● ●
●
●
●
●
●
●●
●
●
●
●●
●●
●
●
●
●
●
●
●●● ●
●
●
●●
●
●
●●●
●
●●●●
●
●
●
●●●●
●
●
●
●
●
●
●●
●●●
●
●●
●●
●●
●
●●
●
●
● ●
●
●
●●●
●
●
●●
●
●●
●●
●●
●
●●
●
●
●
●
●●
●
●
Corr:
0.818
Corr:
−0.366
Corr:0.963
Sepal.Length Sepal.Width Petal.Length Petal.Width
Sepal.Length
Sepal.W
idthP
etal.LengthP
etal.Width
5 6 7 8 2.0 2.5 3.0 3.5 4.0 4.5 2 4 6 0.0 0.5 1.0 1.5 2.0 2.5
0.0
0.1
0.2
0.3
0.4
2.0
2.5
3.0
3.5
4.0
4.5
2
4
6
0.0
0.5
1.0
1.5
2.0
2.5
4 Multiple factors
• Use boxplots and violin plots
• Make use of facet_wrap and facetgrid
• Use different combinations (e.g., try plots with the same info, but different factors onthe axes vs. in the colors or the facets)
c©2018, Jonathan Dushoff and Ben Bolker. May be reproduced and distributed, with this notice, for non-commercial purposes only.
18