58
Exploring and Visualizing Data: Techniques for a clearer presentation of data Brian Vegetabile UCI Statistics PhD Student November 17th, 2015 1

Exploring and Visualizing Data: Techniques for a clearer presentation ...€¦ · Exploring and Visualizing Data: Techniques for a clearer presentation of data Brian Vegetabile UCI

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Exploring and Visualizing Data: Techniques for a clearer presentation ...€¦ · Exploring and Visualizing Data: Techniques for a clearer presentation of data Brian Vegetabile UCI

Exploring and Visualizing Data: Techniques for a clearer presentation of data

Brian VegetabileUCI Statistics PhD StudentNovember 17th, 2015

1

Page 2: Exploring and Visualizing Data: Techniques for a clearer presentation ...€¦ · Exploring and Visualizing Data: Techniques for a clearer presentation of data Brian Vegetabile UCI

Outline

• A Case for Data Exploration & Visualization

• Exploring & Visualizing a Single Variable

• Comparing Distributions of Data

• The Iteration Process of Creating a Graphic

• Data & Image Sources

2

Page 3: Exploring and Visualizing Data: Techniques for a clearer presentation ...€¦ · Exploring and Visualizing Data: Techniques for a clearer presentation of data Brian Vegetabile UCI

A Case for Considering Data Exploration and Visualization

3

Page 4: Exploring and Visualizing Data: Techniques for a clearer presentation ...€¦ · Exploring and Visualizing Data: Techniques for a clearer presentation of data Brian Vegetabile UCI

A Case for Considering Data Visualization

• Graphics can be useful to aid the presentation of technical data in the sciences

• Sometimes though they are created without thought to the perception of the reader

• A misuse of graphics can often times lead to vital information in the data being missed by both an analyst, as well as a potential reader

• Also as a reader, it is your responsibility to be able to look for inconsistencies between technical graphics and conclusions within text

4

Page 5: Exploring and Visualizing Data: Techniques for a clearer presentation ...€¦ · Exploring and Visualizing Data: Techniques for a clearer presentation of data Brian Vegetabile UCI

Example Graphics: Two Ways of Looking at Sunspots (1)

1700 1750 1800 1850 1900 1950 2000

050

100

150

200

250

Yearly Sunspot Totals

Year

Suns

pot N

umbe

rs

• Most standard graphics packages create plots that are squares

• ‘Squishes’ the information in the plot leaving information lost to the reader

• Fails to communicate a key piece of information to the reader

5

Page 6: Exploring and Visualizing Data: Techniques for a clearer presentation ...€¦ · Exploring and Visualizing Data: Techniques for a clearer presentation of data Brian Vegetabile UCI

Example Graphics: Two Ways of Looking at Sunspots (1)

• Transforming the aspect ratio of the graphics width compared to its height reveals information hidden in the previous sunspot graphic

• Sometimes called “Banking”

• Observe the steep rise in sunspot numbers and the gradual decline following a maximum.

• A consideration to how graphics are displayed can be instrumental in communicating the maximum amount of information to a reader

1700 1750 1800 1850 1900 1950 2000

010

025

0

Yearly Sunspot Totals

Year

Suns

pot N

umbe

rs

6

Page 7: Exploring and Visualizing Data: Techniques for a clearer presentation ...€¦ · Exploring and Visualizing Data: Techniques for a clearer presentation of data Brian Vegetabile UCI

Example Graphics: Perception of the difference in Curves

• Another example of how information can be lost in the graphing process is the difference between curves

• The distance between the curves on the right appears to greatly decrease as we increase in the independent variable

0 1 2 3 4 5

020

4060

8010

012

0

Inependent Variable

Res

pons

e Va

riabl

e

7

Page 8: Exploring and Visualizing Data: Techniques for a clearer presentation ...€¦ · Exploring and Visualizing Data: Techniques for a clearer presentation of data Brian Vegetabile UCI

Example Graphics: Perception of the difference in Curves

• Once we add another graphic that captures the differences between the curves, we see that the difference is almost constant!

• Considering all possible presentations of your data is crucial for not only your understanding of the data, but your readers

0 1 2 3 4 5

020

4060

8010

012

0Inependent Variable

Res

pons

e Va

riabl

es

0 1 2 3 4 5

1315

17

Inependent VariableDiff

eren

ce in

Res

pons

e Va

riabl

es

8

Page 9: Exploring and Visualizing Data: Techniques for a clearer presentation ...€¦ · Exploring and Visualizing Data: Techniques for a clearer presentation of data Brian Vegetabile UCI

Example Graphics: Space Shuttle Challenger Analysis (2)

• January 27, 1986, the night before the space shuttle Challenger accident

• Three-hour teleconference among people at Morton Thiokol, Marshall Space Flight Center and Kennedy Space Center.

• The discussion focused on the forecast of a 31°F temperature for launch time the next morning, and the effect of low temperature on O-ring performance.

50 60 70 800.

01.

02.

03.

0

Space Shuttle Incidents vs. TemperaturePrior to Challenger

Calculated Joint Temperature (F)

Num

ber o

f Inc

iden

ts

9

Page 10: Exploring and Visualizing Data: Techniques for a clearer presentation ...€¦ · Exploring and Visualizing Data: Techniques for a clearer presentation of data Brian Vegetabile UCI

Example Graphics: Space Shuttle Challenger Analysis (2)

• The engineers had only presented the failures and not the successes

• Based on the U configuration of points, it was concluded that there was no evidence from the historical data about a temperature effect.

50 60 70 800.

01.

02.

03.

0

Space Shuttle Incidents vs. TemperaturePrior to Challenger

Calculated Joint Temperature (F)

Num

ber o

f Inc

iden

ts

10

Page 11: Exploring and Visualizing Data: Techniques for a clearer presentation ...€¦ · Exploring and Visualizing Data: Techniques for a clearer presentation of data Brian Vegetabile UCI

Example Graphics: Space Shuttle Challenger Analysis (2)

• Adding the successes to the graphic we observe a temperature dependence between incidents and joint temperature

• The Rogers Commission concluded that "A careful analysis of the flight history of O-ring performance would have revealed the correlation of O-ring damage in low temperature"

50 60 70 800.

01.

02.

03.

0

Space Shuttle Incidents vs. TemperaturePrior to Challenger

Calculated Joint Temperature (F)

Num

ber o

f Inc

iden

ts

11

Page 12: Exploring and Visualizing Data: Techniques for a clearer presentation ...€¦ · Exploring and Visualizing Data: Techniques for a clearer presentation of data Brian Vegetabile UCI

Example graphic: Typical Graphic from Science

• Pick up any issue of Science Magazine and you’ll find graphics similar to the one on the right.

• “…Data are means ± SEM of seven to eight mice per genotype for (B) and six mice per genotype for (C). Statistical significance was analyzed by unpaired two-tailed t test. *P < 0.05”

• This graphic is confusing since it represents the data by a “bar chart”, but the data is not categorical.

12

Page 13: Exploring and Visualizing Data: Techniques for a clearer presentation ...€¦ · Exploring and Visualizing Data: Techniques for a clearer presentation of data Brian Vegetabile UCI

Exploring and Describing the Distribution of a Single Continuous Variable - Variables of One Dimension

13

Page 14: Exploring and Visualizing Data: Techniques for a clearer presentation ...€¦ · Exploring and Visualizing Data: Techniques for a clearer presentation of data Brian Vegetabile UCI

Visualization of a Single Continuous Variable

• Visualizing a single variable is helpful in understanding the distribution of the data.

• Reveals insights beyond summary tables.

• See mean, median, mode, quantiles, etc.

• Many statistical tests assume certain distributions for the process that generated the data

• Students t-Test

• Presented are techniques for assessing the distribution of a variable to aid in its summary

• Note: 100 points were simulated randomly to highlight these cases

14

Page 15: Exploring and Visualizing Data: Techniques for a clearer presentation ...€¦ · Exploring and Visualizing Data: Techniques for a clearer presentation of data Brian Vegetabile UCI

Dynamite Plots for a Single Variable

• Dynamite plots are rampant throughout the sciences.

• Plotted is a dynamite plot of the simulated data

• Shows the mean as a measure of central tendency and an error bar that is standard deviation past mean.

• These plots obscure major information that is hiding within the data!

Dynamite Plot forDistribution of Data

Value

0.0

0.5

1.0

1.5

15

Page 16: Exploring and Visualizing Data: Techniques for a clearer presentation ...€¦ · Exploring and Visualizing Data: Techniques for a clearer presentation of data Brian Vegetabile UCI

Box & Whisker Plots

• To the right is a Box & Whisker plot of 100 simulated data points.

• Introduced by John Tukey in his toolkit of exploratory data analysis

• Useful for beginning to understand the data, or to supplement another plot (dot plot or histogram)

• Some packages will also highlight any outliers

−1.0 −0.5 0.0 0.5 1.0 1.5 2.0

Distribution of a Variable

x−value

16

Page 17: Exploring and Visualizing Data: Techniques for a clearer presentation ...€¦ · Exploring and Visualizing Data: Techniques for a clearer presentation of data Brian Vegetabile UCI

Dot plots

• Each data point is plotted along a line.

• Spread and distribution of points are now more obvious.

• Plotted with a measure of central tendency.

●● ●●● ●● ●●● ●● ●● ●●● ●●●● ● ●●●● ● ●● ● ●●● ●● ●● ●●●● ● ●● ●●●● ●● ●● ●●●●●●● ● ●● ●● ●●●● ● ● ●●●● ● ●●● ●●● ●● ●● ● ●● ●● ● ●●●● ●● ●● ●

Distribution of a Variable

x−value

−1.0 −0.5 0.0 0.5 1.0 1.5 2.017

Page 18: Exploring and Visualizing Data: Techniques for a clearer presentation ...€¦ · Exploring and Visualizing Data: Techniques for a clearer presentation of data Brian Vegetabile UCI

Dot plots

• Key Concept: Central Tendency

• A central tendency is a central or typical value for a probability distribution.

• Included in the graphic is a ‘red’ line that shows the median

• The median is a more stable measure of central tendency than the mean and is less likely to be influenced by skew within the distribution of data.

●● ●●● ●● ●●● ●● ●● ●●● ●●●● ● ●●●● ● ●● ● ●●● ●● ●● ●●●● ● ●● ●●●● ●● ●● ●●●●●●● ● ●● ●● ●●●● ● ● ●●●● ● ●●● ●●● ●● ●● ● ●● ●● ● ●●●● ●● ●● ●

Distribution of a Variable

x−value

−1.0 −0.5 0.0 0.5 1.0 1.5 2.0

18

Page 19: Exploring and Visualizing Data: Techniques for a clearer presentation ...€¦ · Exploring and Visualizing Data: Techniques for a clearer presentation of data Brian Vegetabile UCI

Dot plots - Adjusting the Alpha Level

• Adjusting the alpha level amounts to changing how transparent each data point is

• Adds a level of “depth” to the graphic

• The plot below has an alpha level set to 0.5

• Darker areas have more points than lighter areas

Distribution of a Variable

x−value

−1.0 −0.5 0.0 0.5 1.0 1.5 2.0

19

Page 20: Exploring and Visualizing Data: Techniques for a clearer presentation ...€¦ · Exploring and Visualizing Data: Techniques for a clearer presentation of data Brian Vegetabile UCI

Dot plots - Adding Jitter

• Adding jitter amounts to adding random noise to where each data point lies on its line

• Combined with adjusting the alpha level we have a better idea of the distribution of our data points

Distribution of a Variable

x−value

−1.0 −0.5 0.0 0.5 1.0 1.5 2.0

20

Page 21: Exploring and Visualizing Data: Techniques for a clearer presentation ...€¦ · Exploring and Visualizing Data: Techniques for a clearer presentation of data Brian Vegetabile UCI

Histograms

• Histograms reveal even more information than the previous two!

• Simulated data was actually multi-modal

• Note: When using histograms it’s also necessary to consider bin width

Distribution of a Variable

x−value

Frequency

−1.0 −0.5 0.0 0.5 1.0 1.5 2.0

05

1015

21

Page 22: Exploring and Visualizing Data: Techniques for a clearer presentation ...€¦ · Exploring and Visualizing Data: Techniques for a clearer presentation of data Brian Vegetabile UCI

Histograms - Comparing bin widthsBin width +− 0.05

x−value

Frequency

−1.0 −0.5 0.0 0.5 1.0 1.5 2.0

01

23

45

6

Bin width +− 0.1

x−value

Frequency

−1.0 −0.5 0.0 0.5 1.0 1.5 2.0

02

46

810

Bin width +− 0.2

x−value

Frequency

−1.0 −0.5 0.0 0.5 1.0 1.5 2.0

05

1015

Bin width +− 0.5

x−value

Frequency

−1 0 1 2

05

1015

2025

Bin width +− 1

x−value

Frequency

−2 −1 0 1 2 3

010

2030

40

Bin width +− 5

x−value

Frequency

−6 −4 −2 0 2 4

020

4060

80100

22

Page 23: Exploring and Visualizing Data: Techniques for a clearer presentation ...€¦ · Exploring and Visualizing Data: Techniques for a clearer presentation of data Brian Vegetabile UCI

Combining plots

• Combining plots sometimes tells a clearer picture

• Shows modality, total number of points and relative five number summary

Utilizing Three Plots

Frequency

−1.0 −0.5 0.0 0.5 1.0 1.5 2.0

05

1015

23

Page 24: Exploring and Visualizing Data: Techniques for a clearer presentation ...€¦ · Exploring and Visualizing Data: Techniques for a clearer presentation of data Brian Vegetabile UCI

Quantile Plots - Normal QQ-Plot

• Quantile-Quantile Plots are both simple and powerful

• Many statistical tests require that the data being tested were generated by a Normal Distribution.

• Normal QQ-Plots offer a way to visualize the quantiles of a sample to the theoretical quantiles of a normal distribution

−2 −1 0 1 2−2

−10

12

Normal Q−Q Plot

Theoretical Quantiles

Sam

ple

Qua

ntile

s

24

Page 25: Exploring and Visualizing Data: Techniques for a clearer presentation ...€¦ · Exploring and Visualizing Data: Techniques for a clearer presentation of data Brian Vegetabile UCI

Quantile Plots - QQ-Plot

• What does the sampled data look like compared with a normal distribution?

• As expected, the multi-modal data does not compare well against the normal distribution.

• This is another plot to understand the distributional characteristics of the observed data

−2 −1 0 1 2−1

.00.

00.

51.

01.

52.

0

Normal Q−Q Plot

Theoretical Quantiles

Sam

ple

Qua

ntile

s

25

Page 26: Exploring and Visualizing Data: Techniques for a clearer presentation ...€¦ · Exploring and Visualizing Data: Techniques for a clearer presentation of data Brian Vegetabile UCI

Logarithmic Transformation of a Distribution

• Again many tests assume that data is Normally distributed as an assumption of the test

• Many types of data though aren’t naturally normal on their original scale.

• It’s sometimes necessary to transform the data to a new scale that preserves the order of the data, but where it is now normally distributed

• Data such as salaries and non-negative data often can be natural datasets to transform

26

141 Major North American River LengthsObtained by USGS

River Length

Freq

uenc

y

0 1000 2000 3000 4000

010

3050

log(River Length)

Freq

uenc

y

3 4 5 6 7 8 90

510

2030

Page 27: Exploring and Visualizing Data: Techniques for a clearer presentation ...€¦ · Exploring and Visualizing Data: Techniques for a clearer presentation of data Brian Vegetabile UCI

Visualizing the Distribution of a Single Continuous Variable - Variables with More Dimensions

27

Page 28: Exploring and Visualizing Data: Techniques for a clearer presentation ...€¦ · Exploring and Visualizing Data: Techniques for a clearer presentation of data Brian Vegetabile UCI

Scatterplots

• Scatter plots are essentially an analog to dot plots in multiple dimensions

● ●●

●●

●●

● ●

●●

●●

●●

●● ●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

−4 −2 0 2 4

−3−2

−10

12

3

Dimension 1

Dim

ensi

on 2

● ●

●●

●●

●●●

● ●

●●

●●

●●

●●

●●●

●●

● ●●

●●

●●

●●

●●

●●

● ●

● ●

●●

●●

● ●

●●

●●

●● ●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

● ●

● ●

● ●

● ●

● ●

●●

●●

●●

●●

●●

● ●●

●●

●●

●●

●●

● ●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

● ●●

●●

●●

●●

●●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●● ●

●●

● ●

●●

●●●

●●

●●

●●

●●

●● ●

●●

● ● ●●

●●

● ●●

●●

●●● ● ●

●●

●●

●●

●●

● ●

●●

●●

● ●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

● ●

●●●

● ●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●● ●

●●

●●

● ●

● ●

●●

● ●

●●

●●●

●●

● ●●

● ●

● ●●

● ●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

● ●

●●

●●●

●● ●

●●●

●●

● ●

●●

● ●

● ●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

28

Page 29: Exploring and Visualizing Data: Techniques for a clearer presentation ...€¦ · Exploring and Visualizing Data: Techniques for a clearer presentation of data Brian Vegetabile UCI

Scatterplots

• Similar to dot plots, adjusting alpha reveals a ‘depth’ of points

−4 −2 0 2 4

−3−2

−10

12

3

Alpha: 0.25

Dimension 1

Dim

ensi

on 2

−4 −2 0 2 4

−3−2

−10

12

3

Alpha: 0.5

Dimension 1

Dim

ensi

on 2

−4 −2 0 2 4

−3−2

−10

12

3

Alpha: 0.75

Dimension 1

Dim

ensi

on 2

● ●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

−4 −2 0 2 4

−3−2

−10

12

3Alpha: 1

Dimension 1

Dim

ensi

on 2

● ●

●●

●●

●●●

● ●

●●

●●

●●●

● ●

●●

●●

●●

●●

●●

● ●

● ●

●●

● ●

●●

●●

●● ●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

● ●

●●

● ●

● ●

●●

●●

●●

● ●

●●

●●

● ●

●●

●●

● ●

● ●

●●

●●

●●

● ●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●●

●●

● ●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

●● ●

● ●

● ●●●

●●

● ●

●●

●●● ● ●

●●

●●

●●

●●

●●

●●

● ●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

● ●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●●

●●

●●

●●

● ●

●●

●●

●●

●●

● ●●

●●

● ●

● ●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

● ●

●●

●●

●● ●

●●

●●

● ●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

29

Page 30: Exploring and Visualizing Data: Techniques for a clearer presentation ...€¦ · Exploring and Visualizing Data: Techniques for a clearer presentation of data Brian Vegetabile UCI

Scatterplots with Histograms

• These can be combined with additional plots to make the picture more clear

Freq

uenc

y

−6 −4 −2 0 2 4 6

020

4060

8010

012

0

−6 −4 −2 0 2 4 6

−3−2

−10

12

3

Dimension 1

Dim

ensi

on 2

Frequency

0 50 100 150 200 250 300

−3−2

−10

12

3

30

Page 31: Exploring and Visualizing Data: Techniques for a clearer presentation ...€¦ · Exploring and Visualizing Data: Techniques for a clearer presentation of data Brian Vegetabile UCI

More Dimensions —> Pairs Plots

• As dimensions of a variable get larger, combining scatter plots and histograms in pair plots can have a great effect

Variable 1

Freq

uenc

y

−2 0 2 4 6

020

4060

8010

0

−5 0 5

−20

24

6

Variable 2

Varia

ble

1

0 1 2 3 4 5 6 7

−20

24

6

Variable 3

Varia

ble

1

0 2 4 6

−20

24

6

Variable 4

Varia

ble

1

Variable 2

Freq

uenc

y

−10 −5 0 5

020

4060

8010

0

0 1 2 3 4 5 6 7

−50

5

Variable 3

Varia

ble

2

0 2 4 6

−50

5

Variable 4

Varia

ble

2

Variable 3

Freq

uenc

y

0 2 4 6

050

100

150

0 2 4 6

01

23

45

67

Variable 4

Varia

ble

3

Variable 4

Freq

uenc

y

0 2 4 6 8

050

100

150

31

Page 32: Exploring and Visualizing Data: Techniques for a clearer presentation ...€¦ · Exploring and Visualizing Data: Techniques for a clearer presentation of data Brian Vegetabile UCI

Visualizing Categorical Variables

32

Page 33: Exploring and Visualizing Data: Techniques for a clearer presentation ...€¦ · Exploring and Visualizing Data: Techniques for a clearer presentation of data Brian Vegetabile UCI

Categorical Data

• Categorical Data is often represented as a table of quantities.

• MLB National League East Rankings as of July 26th, 2015

Team Wins Losses Percentages

Washington Nationals 52 45 0.5360825

New York Mets 51 48 0.5151515

Atlanta Braves 46 52 0.4693878

Miami Marlins 41 58 0.4141414

Philadelphia Phillies 37 63 0.370000033

Page 34: Exploring and Visualizing Data: Techniques for a clearer presentation ...€¦ · Exploring and Visualizing Data: Techniques for a clearer presentation of data Brian Vegetabile UCI

Categorical Data - Pie Charts

• Many people interested in data visualization will tell you to never to use pie charts…

• Often used to show “Percent of the Whole”

• … but relative scale between variables is often lost

Washington NationalsNew York Mets

Atlanta Braves

Miami Marlins

Philadelphia Phillies

34

Page 35: Exploring and Visualizing Data: Techniques for a clearer presentation ...€¦ · Exploring and Visualizing Data: Techniques for a clearer presentation of data Brian Vegetabile UCI

Categorical Data - Bar Charts

• One method to remedy this is to observe the data as a bar chart.

• Relative win percentage is now more clear.

• Nationals doing much better than the Phillies.

Was

hing

ton

Nat

iona

ls

New

Yor

k M

ets

Atla

nta

Brav

es

Mia

mi M

arlin

s

Phila

delp

hia

Philli

es

NL East Win Percentangeas of July 26th, 2015

Win

Per

cent

age

0.0

0.2

0.4

0.6

0.8

1.0

35

Page 36: Exploring and Visualizing Data: Techniques for a clearer presentation ...€¦ · Exploring and Visualizing Data: Techniques for a clearer presentation of data Brian Vegetabile UCI

Categorical Variables - Dot and Line Charts

• Changing to a ‘dot a line plot’ yields more information. We see the relative amounts of wins compared with losses across the league.

NL East Standings as of July 26, 2015

●●

●●

●●

●●

●●

−65 −55 −45 −35 −25 −15 −5 5 15 25 35 45 55

Losses Wins

Philadelphia Phillies

Miami Marlins

Atlanta Braves

New York Mets

Washington Nationals

36

Page 37: Exploring and Visualizing Data: Techniques for a clearer presentation ...€¦ · Exploring and Visualizing Data: Techniques for a clearer presentation of data Brian Vegetabile UCI

Comparing Distributions

37

Page 38: Exploring and Visualizing Data: Techniques for a clearer presentation ...€¦ · Exploring and Visualizing Data: Techniques for a clearer presentation of data Brian Vegetabile UCI

Comparing Distributions

• Often we are interested in comparing more than one distribution.

• Simulated are 1000 draws from 3 separate beta distributions

Distribution 1

X

Density

0.00 0.05 0.10 0.15 0.20 0.25 0.30

02

46

810

12

Distribution 2

X

Density

0.1 0.2 0.3 0.4

02

46

Distribution 3

X

Density

0.60 0.62 0.64 0.66 0.68 0.70 0.720

510

1520

25

38

Page 39: Exploring and Visualizing Data: Techniques for a clearer presentation ...€¦ · Exploring and Visualizing Data: Techniques for a clearer presentation of data Brian Vegetabile UCI

Comparing Distributions - Common Scale

• Adjusting to a common scale for each distribution allows us to see relative spreads, relative centers, etc.

Distribution 1

X

Density

0.0 0.2 0.4 0.6 0.8 1.0

04

812

Distribution 2

X

Density

0.0 0.2 0.4 0.6 0.8 1.0

02

46

8

Distribution 3

X

Density

0.0 0.2 0.4 0.6 0.8 1.0

05

15

39

Page 40: Exploring and Visualizing Data: Techniques for a clearer presentation ...€¦ · Exploring and Visualizing Data: Techniques for a clearer presentation of data Brian Vegetabile UCI

Comparing Distributions - Common Plot

• Finally moving to a common plot we see how the densities compare with each other on two common scales

All 3 Distributions

X

Density

0.0 0.2 0.4 0.6 0.8 1.0

05

1015

20

40

Page 41: Exploring and Visualizing Data: Techniques for a clearer presentation ...€¦ · Exploring and Visualizing Data: Techniques for a clearer presentation of data Brian Vegetabile UCI

Comparing Distributions

• This can be even more dramatic in more dimensions

9 10 11 12 13 14 15

1213

1415

1617

18

Distribution 1

X1

Y 1

0 2 4 6 8 10

510

1520

25

Distribution 2

X2

Y 2

41

Page 42: Exploring and Visualizing Data: Techniques for a clearer presentation ...€¦ · Exploring and Visualizing Data: Techniques for a clearer presentation of data Brian Vegetabile UCI

Comparing Distributions - Common Scale

• Common scales allow us to see the relative sizes of the distributions

0 5 10 15

510

1520

25

Distribution 1

X1

Y 1

0 5 10 15

510

1520

25

Distribution 2

X2

Y 2

42

Page 43: Exploring and Visualizing Data: Techniques for a clearer presentation ...€¦ · Exploring and Visualizing Data: Techniques for a clearer presentation of data Brian Vegetabile UCI

Comparing Distributions - Common Plot

• And with a common plot we can see the relative distance between each center and assess overlap

0 5 10 15

510

1520

25

Distribution 1 vs. Distribution 2

X

Y

43

Page 44: Exploring and Visualizing Data: Techniques for a clearer presentation ...€¦ · Exploring and Visualizing Data: Techniques for a clearer presentation of data Brian Vegetabile UCI

The Iterative Process of Creating a Graphic

44

Page 45: Exploring and Visualizing Data: Techniques for a clearer presentation ...€¦ · Exploring and Visualizing Data: Techniques for a clearer presentation of data Brian Vegetabile UCI

Exploring data: Stepping through the Process

• Data simulated as an illustration using the following study:

• “Maternal exposure to childhood trauma is associated during pregnancy with placental-fetal stress physiology, Biological Psychiatry (to apprear)”[3]

• Goal: Examine the hypothesis that intergenerational transmission may begin during intrauterine life via the effect of maternal childhood trauma exposure on placental-fetal stress physiology, specifically placental corticotrophin-releasing hormone (pCRH).

• Interested in examining the effects of childhood trauma exposure on placental corticotrophin-releasing hormone production over gestational age.

• This simulated data will help demonstrate the iterative design process of a graphic

45

Page 46: Exploring and Visualizing Data: Techniques for a clearer presentation ...€¦ · Exploring and Visualizing Data: Techniques for a clearer presentation of data Brian Vegetabile UCI

Describing the data

• The simulated data is of “sociodemographically-diverse cohort of 88 pregnant women.”

• Placental CRH concentrations were quantified in maternal blood collected serially over the course of gestation.

46

Page 47: Exploring and Visualizing Data: Techniques for a clearer presentation ...€¦ · Exploring and Visualizing Data: Techniques for a clearer presentation of data Brian Vegetabile UCI

What does the data look like?

• What does the relationship between pCRH and gestational age look like prior to taking into considering treatment effects or individual effects?

• We are interested in understanding the general effect of pCRH across gestational age.

15 20 25 30 35 400

400

800

1200

Relationship between Gestational Age and pCRH

Gestational Age

pCR

H

47

Page 48: Exploring and Visualizing Data: Techniques for a clearer presentation ...€¦ · Exploring and Visualizing Data: Techniques for a clearer presentation of data Brian Vegetabile UCI

Transforming the Response

• From the last plot we notice an exponential relationship

• It’s often of interest to see if this relationship is linear on a logarithmic scale in order to perform linear regression

• We’ve plotted a transformed log(pCRH) to the right

• Notice that there is a clear linear relationship on this scale

15 20 25 30 35 403

45

67

Relationship between Gestational Age and log(pCRH)

Gestational Age

log(

pCR

H)

48

Page 49: Exploring and Visualizing Data: Techniques for a clearer presentation ...€¦ · Exploring and Visualizing Data: Techniques for a clearer presentation of data Brian Vegetabile UCI

Is there a difference between the groups?

• We can now begin to explore differences between in the production of pCRH across gestational age in those who had experienced childhood trauma and those that did not.

• …. it doesn’t look like there’s much of a difference.

• Let’s investigate the possibility that the variability in slopes is different between the two groups?

15 20 25 30 35 40

23

45

67

8

Experienced Childhood Trauma

Gestational Age

log(

pCR

H)

15 20 25 30 35 40

23

45

67

8

Did Not Experience Childhood Trauma

Gestational Agelo

g(pC

RH

)

49

Page 50: Exploring and Visualizing Data: Techniques for a clearer presentation ...€¦ · Exploring and Visualizing Data: Techniques for a clearer presentation of data Brian Vegetabile UCI

Do the individual trajectories vary between the groups?

• Adding lines between the points for the individual trajectories allows us to see if there is variability between the two groups

• ….again, it doesn’t look like there’s much of a difference.

• It appears that we’ve got 5 different collection phases across gestational age. What if we bin these together and investigate that way?

15 20 25 30 35 40

23

45

67

8

Experienced Childhood Trauma

Gestational Age

log(

pCR

H)

15 20 25 30 35 40

23

45

67

8

Did Not Experience Childhood Trauma

Gestational Agelo

g(pC

RH

)

50

Page 51: Exploring and Visualizing Data: Techniques for a clearer presentation ...€¦ · Exploring and Visualizing Data: Techniques for a clearer presentation of data Brian Vegetabile UCI

Grouping by Week Clusters?

• We’ve created ‘groups’ by their week clusters

• Now we can look at the distribution of points within each cluster.

• It’s hard to tell if there is a difference between these two plots with them plotted this way

• Let’s add them back to the same plot for a side by side comparison!

●●

●●

34

56

7

Experienced Childhood Trauma

Gestational Age Grouped Every Five Weeks

log(

pCR

H)

<20 20−25 25−30

34

56

7

Did Not Experience Childhood Trauma

Gestational Age Grouped Every Five Weekslo

g(pC

RH

)<20 20−25 25−30

51

Page 52: Exploring and Visualizing Data: Techniques for a clearer presentation ...€¦ · Exploring and Visualizing Data: Techniques for a clearer presentation of data Brian Vegetabile UCI

Side by Side Distributions

• Comparing the Distribution at each ‘week’ tells us a lot more information

• We now see that the median pCRH for those who experienced childhood trauma is lower than those who did not experience trauma across gestational age

• We also see that the differences between the medians gets smaller across gestational age, suggesting an interaction between gestational age and pCRH.

• Now let’s tell the whole story!

●●

●●

34

56

7

Comparison of log(pCRH) Across Trauma

Gestational Age Grouped Every Five Weeks

log(

pCR

H)

<20 20−25 25−30 30−35 35−40

Did Not Experience TraumaExperienced Trauma

52

Page 53: Exploring and Visualizing Data: Techniques for a clearer presentation ...€¦ · Exploring and Visualizing Data: Techniques for a clearer presentation of data Brian Vegetabile UCI

Telling the Whole Story: A Completed Graphic

• We can now take the graphics that we’ve created through the exploratory phase and construct a combined graphic to tell the whole story

• The two left most graphics highlight the individual trajectories, while the last graphic captures the temporal change in the relationship

●●

● ●

● ●

●●

●●

● ●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

● ●

●●

● ●

●●

● ●

● ●

15 20 25 30 35 40

23

45

67

8

Experienced Childhood Trauma

Gestational Age

log(

pCR

H)

● ●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

● ●

●●

15 20 25 30 35 40

23

45

67

8

Did Not Experience Childhood Trauma

Gestational Age

log(

pCR

H)

●●

●●

34

56

7

Comparison of log(pCRH) Across Trauma

Gestational Age Grouped Every Five Weeks

log(

pCR

H)

<20 20−25 25−30 30−35 35−40

Did Not Experience TraumaExperienced Trauma

53

Page 54: Exploring and Visualizing Data: Techniques for a clearer presentation ...€¦ · Exploring and Visualizing Data: Techniques for a clearer presentation of data Brian Vegetabile UCI

Outlining the General Strategy for the Creation of Graphics

• It’s necessary to explore your data to fully understand how it’s behaving

• The goal is to pack a large amount of quantitative information into a small region.

• Consider how a reader would perceive the graphic that you’ve presented.

• Combine graphics when needed to tell the entire story.

• Carefully study the domain area and understand when it is necessary to further investigate the data

• Graphing data should be an iterative, experimental process

54

Page 55: Exploring and Visualizing Data: Techniques for a clearer presentation ...€¦ · Exploring and Visualizing Data: Techniques for a clearer presentation of data Brian Vegetabile UCI

Further Investigation

• Multidimensional Visualization techniques

• Visualizing Categorical Variables

• Visualization Techniques for combining Categorical and Continuous Variables

• Loess Smoothing for Scatter Plots

• Techniques for Time Series Data

• Techniques for Spatial Data55

Page 56: Exploring and Visualizing Data: Techniques for a clearer presentation ...€¦ · Exploring and Visualizing Data: Techniques for a clearer presentation of data Brian Vegetabile UCI

Texts/References

• Texts

• The Elements of Graphing Data - William S. Cleveland

• Visualizing Data - William S. Cleveland

• The Visual Display of Quantitative Information - Edward Tufte

• Articles

• Graphical Perception: Theory, Experimentation, and Application to the Development of Graphical Models - William Cleveland and Robert McGill

• Let’s Practice What We Preach: Turning Tables into Graphs - Andrew Gelman, Cristian Pasarica, and Rahul Dodhia

56

Page 57: Exploring and Visualizing Data: Techniques for a clearer presentation ...€¦ · Exploring and Visualizing Data: Techniques for a clearer presentation of data Brian Vegetabile UCI

References

1. Cleveland, William S. The Elements of Graphing Data. Murray Hill, NJ: AT & T Bell Laboratories, 1994. Print.

2. Siddhartha R. Dalal , Edward B. Fowlkes & Bruce Hoadley (1989) Risk Analysis of the Space Shuttle: Pre-Challenger Prediction of Failure, Journal of the American Statistical Association, 84:408, 945-957, DOI: 10.1080/01621459.1989.10478858

3. Moog, N.K,, Buss, C., Entringer, S., Shahbaba, V., Gillen, D., Hobel, C.J., and Wadhwa, P.D. (2015), Maternal exposure to childhood trauma is associated during pregnancy with placental-fetal stress physiology, Biological Psychiatry (to apprear).

57

Page 58: Exploring and Visualizing Data: Techniques for a clearer presentation ...€¦ · Exploring and Visualizing Data: Techniques for a clearer presentation of data Brian Vegetabile UCI

Data & Image Sources

• Image - Flight Patterns - http://users.design.ucla.edu/~akoblin/work/faa/

• Data - Sunspots - WDC-SILSO, Royal Observatory of Belgium, Brussels (http://www.sidc.be/silso/datafiles)

58