R: Learning By Example Visualization - Dave Armstrong ... · R: Learning By Example Visualization Dave Armstrong University of Wisconsin-Milwaukee Department of Political Science

R: Learning By ExampleVisualization

Dave ArmstrongUniversity of Wisconsin-Milwaukee

Department of Political Science

e: [email protected]: www.quantoid.net/teachicpsr/rbyexample

Contents

1 Brief Primer on Good Graphics 21.1 Graphical Perception . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.2 Advice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2 Graphics Philosophies 14

3 The Plot Function 153.1 getting familiar with the function . . . . . . . . . . . . . . . . . . . . . . 153.2 Default Plotting Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 183.3 Controlling the Plotting Region . . . . . . . . . . . . . . . . . . . . . . . 213.4 Example of Building a Scatterplot . . . . . . . . . . . . . . . . . . . . . . 21

3.4.1 Adding a Legend . . . . . . . . . . . . . . . . . . . . . . . . . . . 263.4.2 Adding a Regression Line . . . . . . . . . . . . . . . . . . . . . . 273.4.3 Identifying Points in the Plot . . . . . . . . . . . . . . . . . . . . 29

3.5 Other Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

4 ggplots 334.1 Deducer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4.1.1 Scatterplot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354.1.2 Bar Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

4.2 Back to Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354.3 Scatterplots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354.4 Other Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

4.4.1 Histograms and Barplots . . . . . . . . . . . . . . . . . . . . . . . 404.4.2 Dotplot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

4.5 Faceting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

1

www.quantoid.net/teachicpsr/rbyexample

5 Effects Package 495.1 Some Linear Model Examples . . . . . . . . . . . . . . . . . . . . . . . . 50

5.1.1 Non-linear Transformations in Linear Models . . . . . . . . . . . . 505.1.2 Choosing “other” values of Covariates . . . . . . . . . . . . . . . 525.1.3 Interactions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

5.2 GLM Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 565.3 OL/MNL Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

6 Maps 60

7 Interactivity 61

1 Brief Primer on Good Graphics

While we could define graphs in lots of different ways, Kosslyn (1994, 2) defines graphsas:

“a visual display that illustrates one or more relationships among numbers”

He goes on to define a graph as “successful” if:

“the pattern,trend or comparison it presents can be immediately apprehended.”

Graphs work by encoding quantitative and qualitative information with differentgraphical elements - points (plotting symbols), lines (patterns), colors, etc...

• Graphical perception is the visual decoding of this encoded information

• For our graphs to be successful, readers must be able to easily and efficiently decodedthe quantitative and qualitative information.

Shah (2002) identifies three components to the process of visual decoding

1. Identify important features of the display (what Cleveland calls “Detection”) →Characteristics of the display

2. Relate the visual feature to the phenomenon it represents in the observed world →Knowledge of graphs

3. Interpret the relationships interps of concepts being quantified → Knowledge ofcontent (i.e., concepts being discussed).

2

1.1 Graphical Perception

We can use lots of different elements to encode numeric information

• Angle

• Area

• Color hue

• Color saturation

• Density

• Length (distance)

• Position along a common scale

• Position along identical, non-alignedscales

• Slope

• Volume

Figure 1: Position

;;_ E T , T T L, ¢»_¢% Q $>i> t . %%%,$>_ %>¢»$;¢_$$i%_¢,Q ii% J;>£ ;;$I - %i<%$ ¢%>i¢i $i¢¥J» af IIATIL,t T T >$%$_»¢ I L ¢%i$ `$ii¥‘ 1V ‘ . Jr —’._

I - ¢i¢% *¢»¢% I 1 , I >$$$ %AA% »:»¢¢%»<> ~ i»» \~$%·— T »<—¢ ¢<$<~‘w s — * ¤ ` & · va. #·=’’ it · ‘ ’ " f *·z¤# A = ‘·—·’ V »*· a >»;*i$@9* ` ~ 2;*; · l » — ,, .>—,»` i»‘ ’ 1 "" " ‘ ° " " ` ’ " " `’i'‘ · `:%‘``,’` s<¤ · · ~ ¢`>»i»I A*%`‘4 2 THE ELEMENTS OF THE PARADIGM » » sATEEE E _ I1 » ‘_. `»’·’» `¢`T¢~< » ~ »`;iET Q ¥»¢ A ETEEETTT»»»jT T »;EiJi i ATETTET » Ti»E ~% ‘<<< Elementary Graphical-Perception Tasks IAiss —>eri, ' ' @ ``.' i UTable 4 1 h hzcal- erce tzon tasks that w,e»t,»eét . S OWS !I`l 8 8777877 Llfy gf P » »»»»s» PGI OITI1 to !COd! 111 OIÌ'I`l3t1OI1 fIÒII'I gI`3P $· BY EPP Y tO qlldfl l Cl IU8stss Ai ~ mf0rmat10r1 such as ViSCOS1ty, gross 11at10na1 pr0duCf, 111119, 11115 ofjtt ¥e:»“~* —.».» ,3 Qlnformatron and d1v0rce rate and not to cuiegvflcul 111f0I`1113t1011 SUCI1 38the tasks w1th several examples, all using mad!—up data.T0 V1SL13llY COII1p3I‘! the values Of tl1! 3 3 111 1gu1`! . WG C31'1iT»é I · . . 1 · th· th3k! ]11dgI11!1‘1tS of pOS1t101‘1S along a c0II1I‘IlOI1 SC3 8, 111 1S C3S! 8—ees~,<`¢ ' ’ h h th l. d thsatt hO1‘1ZO11tal scale. In F1gu1‘e 4.3 the grap 3S 1‘!! P311! S 311 8¢,ii» · · T 1 th ti, ~ h0r1z0nta1 scale lmes for the three are the same. 0 <ZOH1Pa1‘! va UGS 3»,l», · · ·31`! 011 tht? 831118 panel we can ]udge POS1t10nS 31011g 3 COII1I11011 SC3l!, butmss 1 cx-xss 2 CASE 3=` M tt ¤ 5 _r._c< , »ti» , ...................... .g ................ •_‘,» ir .................. . ........... . ..... . - ·····-··-··········-······ CB ''`'' • ` ' ' ' ·»¤:t;_,tttt.................... g................... g ....... gtatl ,. . . . ........... . . . . . . . . . . . . . . . '0lcci »l—7E . . • _,________________ • .................. •F -··--···--.... Q ..............,., g --·-·--.-·......... Qé' wl »`;» _,;*—t»—»LSC .,l,aita . Frgure 4.3 POSITION ALONG IDENTICAL, NONALIGNED SCALES. Values ,—`,'*On drfferent panels of thas graph can be compafé y ma I1 u men S 0»`‘eel,— ' ‘ ' ‘ • ·pcsntnons BIOUQ |dG|"|t|C8l, f'|OI'\&lIQI'I6d SCQIGS.,,lt

Figure 2: Length

.;=;=: ._,¤.i,;¢;..L,,;..` - ..V~ c .. ....> ..,.., - .\Vr`. _ _.._` . .. _i,, _,., . __to compare values on separate panels we must make judgments ofpositions on identical, but nonaligned scales.The vertical scale in Figure 4.4 shows statistics and two-tiered errorbars portraying 50% and 95% confidence intervals for the statistics. Tocompare the values of the statistics and the ends of the intervals we canmake judgments of positions along the vertical scale. The lengths of theconfidence intervals of one level of confidence, for example the 95%intervals, are measures of the precisions of the statistics; a longerinterval means less precision. To decode the precisions visually we canmake length judgments.In Figure 4.5 the relative values of the y measurements can beextracted by judgments of positions along the vertical scale, and the xs>< VALUEFigure 4.4 LENGTH. The lengths or the 95% confidence intervals, or ofthe 50% intervals, serve as measures of precision of the graphed statistics.To compare the magnitudes of either set ot precision measures, we can

3

Figure 3: Slope

gy 3 ·_,_

<<i<<

)__,·' _<n I

/ 1 • h QU ** sg- ·= ES ititsu t|'I¤;;t¤,¥a8 :\ n.ug; (g pz E5 n D r Vi Mii lO‘tOi( u IA Iu :5] n lp- Vgr. :2**. Ftm 8 ig, 2 'n tg E. _K. i iItA ljlh d E35 II S U

_

Figure 4: Angle

236 GRAPHICAL Pzncsprlou~ Ameasurements can be extracted by judgments along the horizontal scale.But we also can judge the local rate of change of y as a function of x by. judging the relative values of the slopes of the connecting lines; thej overall visual impression is an increasing trend in the local rate of. change as x increases.Figure 4.6 is a pie—churt. To extract the percentages visually we canmake angle comparisons. Figure 4.7 is a graph that shows threevariables: x, y, and z. The first two, x and y, are shown in the usualway by positions along the two scales, and z is portrayed by the areas ofthe circles. Thus to visually decode the values of z we must make areajudgments.42 Figure 4.8 is a scatterplot of two variables. One aspect of the datathat we can judge from the graph is the relative number of points perunit area in different regions of the plane. To extract this informationwe must judge the visual density of the points; for example, the densityof the point cloud appears to be greater in the middle than in theextremes in the upper right and lower left.iA

Figure 4.6 ANGLE. The values encoded by this pie chart can be decodedby angle judgments.

4

Figure 5: Area

»:··*— 5%<%%%%jiatu he,%%i C 10%¢\` P 110 d t L ~AA»% <<%Q%% t MV ra gy Han O1 @8h y· t- O1- 9 < %%%»% N~ 44A»‘¢AO n Q \,,,,\,: M , I• G t m¢À¢ Gf t I1 1V u Q¢¥¢»`‘i— 8 h 9 11 è S lg Tm 6 S, S. ___;_ _ ! O—_—· 1* a S N ·· » _~t..,~ ! I' t hr 1 MSed Ht Pe G h a 10 Gt n gl-¢`¢% h Ctr PI` E t H (A7- Ta J`¢ ¢»<»;¢Ai! 5A¢¥ Su ues al Op Pla WG fr Id bl MV % §` ?§; 5 `“‘` ~CC (,1 O I1 u () G ! ¢ Ti;Q . M _~~— ,~~—._—» S a 1 1 I ~ MS • O S · s inf ¥~;» » '%»%`—’· ~ 4 ,—ii% 2 1 1 Y 8 q I` a qu . %i—<% [J Y av Of t P l1' P 1 V0 ;¢<¥¢» t ! 3 O er 1I` I6 I ¢$i»<¢ i0 5 C t 5 ! S d 1.1 Q %A¥A% é Q!! O Pe V ‘ In %?AA: i¢>¢> » S I-1 n 1 g C _ O 1 n O (ggC ° 0]* tl 1 I > r >»*-Od 1 n ‘ V L1 CQ <¢;i%`’ b 3 G In SC· V»~ J1. $Q!¤Q H 1r <»¥i [xw ggk ..`, » V 1 V 5 I—.,. . 6 u' G • 8 ~ — :i»i M th Qt- 6 t fa ] I] ` {aw »i$$¢ E 10 I I- V W- ud Ce J `¢»v in 9 () 1 * ;%i »»¢` >n V ,=:~¢>$> ¢ V a 5 u g In a > ì`$¢’»i i S 8]] S- S S ·,J%·i “ Pt n O1 G L ‘ate 3 thyell hu ¢i`»» »go . at OW Q 3¢»r1C h 3 `¢A%ì i%;1< Lu al 11 fg %%%< < <:1 G >$<`__; v Var can W »~ À*;i “»Aj< 1abl W_’`. , .<j<%J·_gr IiJ»$;»; 1i<— y 4 2—`—n.i J T 9 '%%;%i$<‘ % f th in hisAi ° tm ° ca, gra X V 15fd C CIS Ph ALU_ an S 3 Sho % EQ ` be nd a W S* G th· Ve` IOde rd · 9 vd b ¤s ariay Gyeportr bles 2Oa ‘ ave ‘ Tlud d Wnts the re P0-C!| Fta SdreaS.

Figure 6: Density

{;~ ,>; .¤..;..,,_;L._<,A , .,,,. . .=_.J,_ i »_..;_vv»,»A,...L,..v.,;,. A ,.._\_ ..v, .. v,_`»..»“ 238MVarious schemes also exist for using hue to encode a quantitativevariable [1.4]. Color saturation refers to the intensity of a color; forexample, saturation decreases as we go from a deep red to a pink andfinally to a grey. Measurements of a quantitative variable can begraphed by defining a quantitative scale of saturation and thenencoding by saturations whose values are proportional to the values ofW the data [14]. no ‘‘‘jj Drs tanceé The elementary graphical—perception tasks are only one factor that. .... .we must consider in addressing graphical perception. Another factor is¢s. i i‘>.=s `—`*o¤ ° °3 ¤ ° °T <s O¤° t¤ ° O cgo § O S .Q ° Ogé <‘-P iitiLu · % <> ¤ ° o °D o O 8 9; O ¤ O O °° °°°<>?O $ °¤ 0 0 <>as~... ex X VALUEFigure 4.8 DENSITY. The relative number of points per unit area in Ydifferent regions of the graph can be decoded by density judgments.

5

Others

• Volume is used rarely to encode quantitative information in scientific graphs.

• Color hue (i.e., different colors) are often used to encode categorical (not quan-titative) information. Cleveland does not discuss this much (book published in1985).

• Color saturation (i.e., the intensity of the color) can be used to encode quantitativeinformation.

– Be careful about making graphs color-blind friendly

– Realize that many graphs will may appear in color on the screen, but may notbe printed in color

– See http://colorbrewer2.org for more advice

Further, our visual processing system is drawn to make comparisons within ratherthan between groups

• When observations are clustered together, our minds tend to think they belong tothe same group.

• Groups should be formed such that the intended comparisons are within, ratherthan between groups.

Cleveland identified a set of elementary perception tasks and the ordering (below fromeasiest to hardest) of their difficulty.

1. Position along a common scale

2. Position along identical, nonaligned scale

3. Length

4. Angle - Slope

5. Area

6. Volume

7. Color hue, color saturation, density

There are a number of other aspects of graphs that can make apprehension of thegraphs main point more or less difficult.

• Explanation - features of the display were not (sufficiently) explained

• Discrimination - features of the display were hard to distinguish from each otherbecause of the size or configuration.

6

http://colorbrewer2.org

• Construction - mistakes in the numbering/labeling of tick-marks, etc...

• Degraded Image - the reproduction of the image reduced the quality enough tomake it hard to read.

Below are some comparisons we can consider to try to understand how to make graphsappropriately.

Figure 7: Make the Data Stand Out: Wrong

$¥h¢¢ i 0 | *‘,;¥»g CLEAR VIS GN 25s In 1980 two Harvard anthropologists, Melvin Konhef andii ‘ Worthman, put forward a likely solution to the puzzle [80]- Theythat it was the very frequent nursing of infants by thelf mbthersQiiiring the first one to two years of life that produces the long 11‘1ter..interval. The nursing results in the secretion of the hbfmone QIsrelactin into the mother’s blood, which in turn reduces the fU11<!tion_Sthe gonads. This acts as a birth control mechanism.frequency of nursing and other activities of one !Kung women and heri‘ t~ `l aby. The open bars and tall vertical lines are nursing times; the Closedbars show times when the baby is sleeping; F means frettlihg; andslashed lines represent the time held by the mother with ¤1‘1‘0Ws forup and setting down. Al major problem with Figure 23 IS thatthe data do not stand out. It is hard to get a visual summefy sf theextent and variability of each activity and it is difficult to femernbeyWhich symbol goes with which activity, so that constant referring to thelegend is necessary. A minor problem with Figure 2,3 is that the arrowsfor picking up and setting down are superfluous. p0 10 00 00 00 00 00 E ...... p

oaoo » A _ ._n • . . . . . ... I T i n l A {Q‘°°° ’ r .,,. ·' i Vilil*;` *‘ *v*~ ‘ 0._‘; ia ·;·.—.¥ f '1 2 - I 144A-[ ij ,l.;,i._. 7_ _,1 l .l | I ` 0 .:..ist M ..,s *"°° ‘ J 5l Q- » . - - §""“‘“·‘rEEY - g _ ·;<· }'iFl *ili’ ,1600 F , y y l j is ~.t —A MM 7}). ’ ’ ' 0 ‘ · r 1 ~...A T lj I I __ ‘’‘·‘·il·— ~0·e - .0»·~ . 0 c.. . . ,,,. ,Figure 2.3 SUPERFLUITY AND STANDING OUT. The graph Shbws theactivities of a !Kung woman and her baby. The open bars and tell verticallines are nursing times; the closed bars show times when the baby is Ysleeping; F means fretting; and slashed lines are intervals when The baby isheld by the mother, with arrows for picking up and setting down. The datado not stand out on this graph. sFigure republished fromi[80]. Copyright 1980 by the AAAS.

Figure 8: Make the Data Stand Out: Right “¢¢‘—i‘i¢

<;<<$i<<&ii‘ '»i%

» ¤¢¥¢$<‘<$<i M

‘}$¢if$ s—

~—i I

Aiii m

‘<J¥¥ zz.&

E,%

A %

~»>1 I A

); W “ [3 m

_3_J~, V m

JQ`%

i @¥i$j$ Y1-: ‘&“ i%

... Eif)$%

Y E E

!A ¥ G

__i%

%—

iA$<< ;__

< ~ E 3?A

$.=. C O—

.2: ¤¤ |%

<<i Q E V

-· S<i<%

<i <:> [ Oi%

% ir m

Nii; $A

%¥¢ V

G [

» “ y ··gg ,__ E ‘M

q, ,___ XD

E 3 ,.5 5 Wcn ·=· · ~‘= -¤ <>iii > m

m ...

O C. F3 g é § .=,;

<$$<A Z r·· g '-- l

i‘; ·-—· C1 -; ·•-• O

‘gg Q

_ gg .. 2 g E =Q

§ § +:5 § E §‘ O

O~``‘ ’ `- eZ Eg E —

¤ -•-· 2 Q<%

$; V _=_ {*0* ,.; EE Q

Q ,

A __ U

, ¤> ¤Q

E 2 2 2E ;2=» iU

) > C) (O "‘

. _ D- ,-. •-

§ '—¤ Q

ZS ° -»-· ng$ L; U

E E .,, 5

% Lu ¤_ Q

) °...;> Q

-9V

) ° W °’

Q. cp ¤> C-

'¤’ c:·» ·=> 1 0 ~»5 Q

O 4

na Q 0) {

E cu.3; 3 _Q

Q g

% i ng F

§ 3 -,5< < Ù

5 AE Q

_ >

YSy}*R

7

Figure 9: Use Prominent Graphical Elements

`‘., ’i C —.,‘' i é ‘ é : ! ¥ V ‘`‘— 5 '’,` i V ~r~‘ _.r.‘, ` ` a@ii% i%%;% s %>%jA~ i A¤i> %;i; A»ÀA¥ * A<%% %;%~ A¢%% %`%\AJ %><J7¢A» @2 iig i¢<_¢ ‘J% >;i<_ %;%%_% X ; %i%%%A%\i;· _.—,».— i } ' Q ’‘,*‘·ix% ; \%<%% %i=i >;EiAiiA <AW¢ <%—A{»¢;‘&»\§AA{% `%,;< ;<; »ii% »%ii »ii!% A¥:%<> %;<%< %¢1;%Ji> <i\%; i~;% %—;~% »%:%%* ‘ `:—¥ »,¢r ’‘—* Q _—,_ ii i? * & i V »;»` ``—· * ¢~_ { f ``»,~· \—”’— 1 _‘i''»—.“ V ` `“ì$;¢ $i <i ;i% ¥‘A<\%;i''_;',`‘—'` *—.` XT _’'·.` 5 —_`’— Q {` * f `’‘’ 5**' * ( ;‘ I SV° 7 5' /'** ‘*s¢ 396 2 z(1705 S ° ° ‘f`l O 2 0 3 O 4 O 5 O

Nkmll Ohda

*

O A COI t 6 l (2v)WY 0 705c 0 00cL L w

V / • •0¤() 7l() •*•' 0*<>•* 0000

£ 0 0} 8 6RHg e 2 5 \/ISLJAL PRONIIIQEINICE Th 1e at d not t d 0L1t|ish d i iP bY P9""' SS QU fr¤ T!8iw’6 [25] Copwrght © 1983 N1acn M nJ i L n d

CLEAR vision 2Q¤· 7U =0 5 10 1 5 2U 250. 05 § lLD U- s< 0 °” cJ i .t~ 0. 75 ;m ¢“ pi MEALL DHUAR¤- 7¤ =Figure 2.6 visum. pnomiusuos. use visually prominent graphicalelements to show the data. Now the data from Figure 2.5 can ba See"- The&¥ ». X ¥ · ‘ l i ‘symbols that look like the spokes of a wheel represent multip 6 P0 "*tS· eachspoke is one observation.

Figure 10: Do Not Obscure Points

i% A$A! . »Y lGRAPH CONSTRUCTIONWhen plotting symbols are connected by lines, the symbols shouldbe prominent enough to prevent being obscured by the lines. InFigure 2.7 the data and their standard errors are inconspicuous, in partbecause of the connecting lines [17]. In Figure 2.8 visually prominent rfilled circles show the data. These large, bold plotting symbols makethe data amply visible and ensure that the connecting of one datum to. . . .the next by a straight line does not obscure the data. The connection isuseful since it helps us to track visually the movement of the valuesthrough time.The data in Figure 2.8 are from observations of nesting sites of baldeagles in northwestern Ontario [56]. The graph shows good news:After the ban on the use of DDT, the average number of young per sitebe an increasin . Ié No [-G’m <s> Ii <> zolj. ZA 1 (3) Na · [5 O 0 (3)[ Z Q TOO[ L 3 = W icqgumxggI - ¤*2 ° °° A 142 mM Na j »2 ¤•§ g » ca)9 20A J § O I(3> V" 0 30 60 so 120 150 M 180E "° 0.05 mM Ca5 ·20 10*5M ACS , »E _30 _ 142 mM Na (6)§ —¤¤ <1¤> Tg -70 "°’ —s10 20 so 40 50 60 70 so soTime (minutes)Figure 2.7 VISUAL PROMINENCE. The data on this graph do not standout because the graphical elements showing the observations and theirQi? standard errors are not prominent enough to prevent being obscured by theconnectnng Innes.Figure republished from [17]. Copyright 1983 by the AAAS.

_ c1.EAn vnsrou 31a pair ar scare lines rar each variable. Make the data region theinterior of the rectangle formed by the scale lines. Put tick markseutside of the data region.Data are frequently obscured by graphing them on top of scale gOne example is Figure 2.9 where points are graphed on top ofvertical scale line. The graph and data of Figure 2.9 are from an¤.< 1· 2"‘ill ta.. .< ·—·» r cn§¤¤ L Ug¤—t .` L ui¤¤E “~n. 4ieee isvo 1975 IQBUYEARFigure 2.e vnsuiu. Pnommsuca. The plotting symb0|S on this graph areprominent enough to prevent being obscured by the connectmg lines,

8

Figure 11: Two Scale Lines, Reference Line, Ticks Outside

V _-_ » CharlesLyman, Regina O’Brien, [89].In the experiment, the Turkishhamsters (Mesocricetus bmndti) percentages that`° • e i ~ mv J __ _ , ‘ `the hamsters spent hibernatrng. The goal of the experiment was to~» determine whether there is an association between the amount ofhibernation and the length of life; the hypothesis is that increasedhibernation causes increased life. Hamsters were chosen for the_ experiment since they can be raised in the laboratory and since theyhibernate fer long periods when exposed to the cold. Certain speciesofbats also hibernate for long periods in the cold but, as the experimentersPut it, "their lang life-span challenges the middle—aged investigator tosee the end of the experiment."iThe graph 111 F1gu1‘! 2.9 suggests that hibernation and lifetime areassociated; while this does not prove causality it does support thehypothesis. The graph also shows one deviant hamster that spent aIisoo · .` ‘ . · · `1200 ° ° _ . _ 'V ~ ° ° • ; ’ · • • ° ° ° °gs., p • • gl‘¤ · . Z · •$ 000 ` , · , , Z . . .400 ·<i

O a is 24 :-:2` Percentage of hibernationFigure 2.9 SCALE LINES AND THE DATA REGION. The data for zero.e hibernation are obscured by the left vertical scale line.ppg Figure republished from [89]. Copyright 1981 by the AAAS.

CLEAR visionlarge fraction of its life hibernating but nevertheless died at a YOLIUSage. Hibernation cannot save a hamster from all of the perils of lifeOne unfortunate aspect of Figure 2.9 is that the data for h&mSt0I`Swith zero hibernation are graphed on top of the vertical scale line. Thl-S; ! obscures the data to the point where it is hard to perceive just h0W V.many points there are. No data should be so obscured. One way toavoid this is shown in Figure 2.1OI The data region ——— the place where ithe symbols representing the data are allowed to be - is inthe inf!1‘101òf the rectangle formed by the scale lines. Now the values with Z!1‘0hibernation can be seen clearly.has ¢ °8 °° 0 8°¤ O0 2 ¤ ¤Z 0 0 0< · ¤ ° g ¤ ° ¤ O 8 ¤V Z 8 O 0F 2 0< 1 O O 0 B2 8 °P B 0 § 0 0 O 0 0 0 0 0 lel»r 0 O 0500 9 O aI 0 O O3 0‘ttt ° °Z Q2 ~‘l¤¢ U ‘ »t~ ii `0 10 20 BUPeracenr UF LIFETIME im HiseslwriomFigure 2.10 SCALE LINES AND Tl-le DATA maelom. use a pair or scalelines for each variable. Make the data region the interior of the rectangle?. . . . ·formed by the scale lmes. Put tick marks outside of the data region. Theformat prevents data from being obscured. Using two scale lines for eaGh ofthe two variables on this graph, instead of the more usual one, a lows easrerjudgment of the values of data on the top or on the right of the graph.

Figure 12: Do Not Clutter Data Region

` · `,_`_'the line; similarly, when there is just one horizontal scale line. thehorizontal scale values of data at the top aff? h·¤1`d!1` to 858658 than those ,the bottom. By using four scale lines, the graph treats the data in a Imore nearly equitable fashion.The four scale lines also provide a clearly defined region where Oureyes can search for data. With just tW0, data can be camouflaged b)'virtue of where they lie. This is trite for the data 1n Figure 2.12 [139]; itis easy to overlook the three points hidden in the upper left corner. InFigure 2.13 the graph has four scale lines and lihé ii'11‘!! P0intS 61`! 1n<>1`!prominent.Making the data region the entire interior of the rectangle formedby the scale lines means the plotting symbni ini ¤ data Pninlf !<>¤i<i justtouch a scale line. But just touching has inf? Pfiieniiai to !¤m0ufin8!‘points. So it is probably best to interpret "interior" as a rectangleslightly 1ns1de the scale line rectangle.D0 not clutter the data region. IAnother way to obscure data is to graph too much. It is alwaystempting to show everything that comes t0 mind Gn 8 Single g1‘¤Pn» butgraphing too much can result in less being seen and understood. Thiss b .·— 100 -‘ . · . · .iiio 0 I I I0 1 . 2 3 4 5Figure 2.12 SCALE LINES AND THE DATA FiEGlON. The three points anthe upper left are cemouflaged. iFigure republished from [139]. Copyright 1980 by the AAAS-``—” *

36 PRINCIPLES OF GRAPH CONSTRUCTION _is illustrated in Figure 2.14 [122]. The data are particle counts from anexciting scientific exploration: the passage of the Pioneer II spacecraftby Saturn. Inside the data region we have reference lines, a label,arrows, a key, symbols showing the data, tick marks, error bars, andsmooth curves. The graph is cluttered, with the result that it is hard tovisually disentangle what is graphed. It is unfortunate to have any of. these valuable data obscured.

1 • ·

P A. U ‘ •· •T ° ' · ° .n V 95 • ••i- • • .‘{LQcn A··· Q2 ° ·Z •Q: °_<-> B8 ° '0 1 2 2 4 5TIME <sEcoN¤s>ppi‘ Figure 2.13 SCALE LINES AND THE DATA REGION. The four scale linesprovide a clearly defined region for our eyes to look for data. Now, none ofthe data from Figure 2.12 are in danger of being overlooked.

9

Figure 13: Do Not Clutter Data Region II

cu-:An vnsiou 37The data are shown again in Figure 2.15. The clutter in the dataregion has been alleviated, in part, by removing the error bars. Itwould be prudent to convey accuracy for these data numerically ratherthan graphically; on a log scale the error bars decrease radically anddisappear from sight as the counts increase. (lt is possible that accuracyrs nearly constant on a scale of (counts/sec)é since count data of thissort tend to have a Poisson distribution. Thus accuracy might bes t g conveyed more readily on the square root scale rather than on the logscale.) Other removals have taken place. The plethora of tick marks onthe vertical scale has been reduced, as well as the number of tick marklabels on the top horizontal scale line. Also, the top horizontal scaleline is labeled in Figure 2.15, but not in Figure 2.14.sooo . i 1.».t V * ·{*****651 · to- s.: Mov Pnorous_ l ¤ 5.5-2i Mov moronsai‘ l l ¤ 5.6-2i Movmuc ALPHASloo l { l 1i‘“.~ = l /l* *2.s . ; 1 \ l l8 O l \ = l;i. S *·· l x° Ò i l l n0-l ‘ ll é lo.o» >— *nsoo easo sooo » mosapr. n, mrsFigure 2.14 CLUTTER. This gra his cluttered. The result is that different»..s. . . .p».,. graphical elements in the data region obscure one another.Figure republished from [122]. Copyright 1980 by the AAAS. A

38 PRINCIPLES OF GRAPH CQNSTRUCTIONThe clutter 1n the data reg1on also has been reduced by some/ alterations. The key is outside the data region, the label for rings isoutside the data region, the arrows Stwwing. values below 0.01Q A counts! sec in Figure 2.14 have been replaced by a separate panel, andV the wandering curves have been replaced by straight lines connectingI successive data points. These changes have reduced interferencebetween different elements of the graph and thus have reduced the. clutter.e Figure 2.16 [81] is also cluttered; the error bars interfere with oneF another so much that it is hard to see the values they portray. Onei • 1.e-5. 1 Mev morons .Y 5 ¤ 5. 5-21 Mev morons. ¤> 5. 5-21 MQV/NUC ALPHA5I DISTANCE -<5Arn.1RN moi 1>2. 5 5. o 5. 5 4. oRmcs Jmus Mm/ts encetxuus IM E ° ~· E E EG ¤ r ¤ ¢¤ ¤ ¤ · - ¥H D E , ? ? ~ I ’ »I ¤ ¤ ¤ ‘ " ¤1500 1555 15oo 155nNMEFigure 2.15 CLUTTER. Do not clutter the data region. The clutter ofA Figure 2.14 has been removed by alteration and excision. For example, thenumber of tick marks has been reduced. .

Figure 14: Consider Juxtaposing Instead of Superposing

i¥¥CLEAR vision 39solution is shown in Figure 2..17. In the top three panels the three datasets are juxtaposed and in the bottom panel they are Supe1‘p0S!d,· butwithout the error bars. The juxtaposition allows us to see clearly eachset of data and its error bars; the superposition allows us to compare thethree sets of data more effectively.D0 not overdc the number of tick marks.A large number of tick marks is usually superfluous. From 3 to 10tick marks are generally sufficient; this is just enough to give a broadsense of the measurement scale. Copious tick marks date back to a timewhen numerical values were communicated on graphs more than theyare today. In our high—tech age we have photocopies of tables,computer tapes, disk packs, and telecommunications networks totransfer data. Every aspect of a graph should serve an importantpurpose. Any superfluous aspects, such as unneeded tick m&1‘l<S, Shfmldbe eliminated to decrease visual clutter and thus increase the v1sualprominence of the most important element -—- the data.Figure 2.18 [113] has too many tick marks. The fiuéd Cl-1`<?1!S 5hOWthe number of bits of information (horizontal scale) 1¤ *fh! DNA- efvarious species when they emerged and the time of their emergence(vertical scale). The open circles show, in the same We)', UW b1tS efE 10 Q Control liner t; » F · r , Unselec tedli..t E . z8 4 / “ ·· —Oo 2 4 6 ( s 10 12 *4GeneraliwsFigure 2.16 oi.urrEi=i. This graph is also cluttered.Figure republished from [81]. Copyright 1980 by the AAAS-

°i __%.%M ~ ,A A AA k _V;% A <,%_ A yV>A 4..,A, _ AA A A A»~trt A. A __ A AA AAAAA _ A. M5 és U F¤‘RlN®lF*l...E$ (DF GRAPH CONSTRUCTIONs g iiU 4 8 1 2 2> AA A 15 .s ·AA%A ·AA 8 _{ ‘ i A EL . U . . I ...............................................................................................E ` It ‘ e 15 f ·gr :CUNTRD|.Yr » A I¤<AA ` > D ..; ................................................................................................1 FEDU . .> .......... , .... . .......... . ................................ . ...· . ·.·. . .·.·- · .·.·-·.-··· · ·I 5 ;- BLUEA—»' 1: .B — ·A REDgg; 1D ,..................A.......A.....................................A.AA.A........................0 4 8 12A Flgure 2.17 CLUTTER. The clutter of Flgure 2.16 has been ellmmated by. graphmg the data on juxtaposed panels. The bottom panel as mcluded sos A that the values of the three data sets can be more effectlvely compared.~A i agn-

10

Figure 15: Use (but don’t overuse) Reference Lines

CLEAR vision 43etephen Jay Gcuicts anaiysis of ti-ism [54], srs discussed in detail ingestion 4 of Chapter 3.) The vertical reference lines, which show timesef price increases, cross the entire graph and let us see what h8PP!n!dte weight exactly at the times of the price increases. Except for thecahange from 30¢ to 35¢, all price increases were accompanied b}' 8 SIZEincrease. 2 Aiett_ In Figure 2.21 only a marker is used to show the time of the firstcardiovascular care unit since the high precision of a reference line 1Snot needed. We can see the position clearly enough to perceive that 2Scimewhere after that point, the death rate for edrdicvascular d1S!8S!`spi‘ decreased more rapidly; a reference line is avoided since we want, 85always, to reduce the visual burden in the data region.ft,t 9 2 2 2 Q 2.‘`2 < I : : 1 ;CD F ? F 2 ?t. JM]as vo 75 so 85remiht , Figure 2.20 REFERENCE LINES. Use a reference line when the/'9 is animportant value that must be seen across the entire graph, but do NOT [Gt theline interfere with the data. The weight of the Hershey Bar is Qmphedagainst time. The vertical reference lines divide time up into price eD0¤h$$prices are shown just below the top vertical scale. The preciSi0¤ Gl thereference lines is needed to show us exactly where the price i¤¤"!&S°Soccur. ` p T

i ll data labels in the data region to interfere with the_ rlr data or to clutter the graph.Figure 2.22 shows the relationship between the average number of Ybad teeth in 11 and 12 year old children and the per capita sugarconsumption per year for 18 countries and the state of Hawaii [101].y When it is important to convey the names for the individual values of adata set, data labels in the data region are generally unavoidable. In sodoing we should attempt to reduce the visual prominence of the labelsso that they interfere as little as possible with our ability to assess theoverall pattern of the quantitative data. This has been done inj Figure 2.22 through the use of several methods: the plotting symbol isY; visually very different from the letters of the labels, the letters of thelabels are small, and when possible a label has been placed outside ofthe region formed by the point cloud rather than inside.• CARDIUVASCULAR r;AR¤1ovAsr;ul_Ar2?§ ¥ O OTHER ¤ARE UNIT5:;) U ... .....,.................. _ ...........................................................................................................%§g¤ Q;!!E::$ C) JE - i in .é%!i R -2¤195O isso 1970 issnFigure 2.21 REFERENCE LINES. Only a marker is used to show the timeof the first cardiovascular care unit since the high precision of a referenceline is not needed. l t

Figure 16: Labels if Necessary, but Don’t Obscure Data

. graph consists of Clitferellt data sets Wh!I'! each data set 1139 a name thatL we want to convey. This is illustrated in Figure 2.24, which shows lifeexpectancies for four groups of people: black fema1!S, lI>1aC1< m31!S, 0l white females and white males [129, p. 71]. Four data labels 111 the datacluttering the data region.Sometimes a key with the data labels 18 I1!!C1!Cl to 1!1!I1t11:}’ data $!t$»either because data labels in the data region would add too much clutteror because the values for each data set cannot be identified Wit1‘10`L1tusing different plOtt1I1g Sy1'I1bO1S fO17 t1`l! C11H!I`!U1T data Sets- A k!Y isused in Figure 2.25 for both reasons. On this graph the data labels arej long and the data region is already host to many things. Fl-11'th!I`m0I'!» akey is needed because there is no other convenient Way to allowidentification of the values below -2 logw (counts / sec), which areshown at the bottom of the graph.10,000 . L att. , .5,000 Dolphin`; Elcplwni •} /°Chimpanzee , aw lf° L")" _•100 Baboon 7,* 2 Brachiosaurus •gc 5g · Saurornithoid Ostrich , Diplodocus • ·g :: ·ia °"' All' t • Stegosa s. g 10.0 .cmw ‘g"‘ °' ` “"“È 5.0 _· •Opossum ~.s “ . B,.t ·!<><=l=¤<>¤¤rhE 1.0 _ • Vampire bat•1s 1_ 0*5 Mole • Goldfish !0.1 OHummingbird0.01 , » ‘—‘~ · A · A0.001 0.01 0.1 1 10 100 1,000 10,000 100,000Body mass in kilograms IFigure 2.23 DATA LABELS. The data labels interfere with ourassessment of the overall pattern of the quantitative data.Figure republished from The Dragons of Eden: Speculations on the Evolution of HumanIntelligence, by Carl Sagan, p. 39. Copyright © 1977 by Carl Sagan. Reprinted bypermission of Random House, Inc.

%%<%rr`r j *AAWL%;<%i i »

%<~i%V

»`’¥

A %A%<% »> < JI``V,%``“?jV` `<,“V<1V# —;~—

i%A%i Fe ” <»*:% % _“*§A ° 5-, 2 sN rT ea

A&i<% QI Ai—i»‘`$ ¢ i QH) ¢ A & Mi a 4,<»»!<i% [ ;I 'l`’ '< ;% uu- ; E e_\ WS _,_, _& :E 2 -\%<, ; S4 z<<%i < 1»%¥¢ ·AA -¥¥:— Y r%%;A»\%_ ; · ;V -$< i$ , I ’Yihi Y It ' CH S t S ISS r 2 ` 2 A O Enl rt I ,;LA AN @ LA is Su R· {A ND @ Y ND aiO ¤ ’ ¤ E 6 ‘ IP G T C 3 ii e C r$,ng»A V LE U VZ i S\ I l O iOVSQBI G N H EA fhl ILY» *r>)tL‘R2 N Sr rv y\ Blthéog C GA GE @ mal Zlis 4I,;¤¤-C O. RY RM QSA auou 5U · . Y »A S Z R O i 1M % ’ " Y% ny E ·· U- ER/VH Ot}’¤,,,1s*, Q K T? A, · Q¤ 5 6E 9*1 / At|‘V? s " 'ja a .’ · Zek} ~yn; aio 4gtl I n

Figure 17: Overlapping Symbols Must Be Distinguishable

i<<PRINCIPLES OF GRAPH CONSTRUCTION i» gShOI‘tCO1‘Il1I‘1gS 111 gIàPhS 1Il SC1E1'\CE and tEChI`1010gy WaS POOI V1Sl1a1d1SCI'1I'I11I1aII1011 of ThE d1ffEI'EI'IlC data SETS OH g1'aPhS !IDP10y1HgIn F1gure 2.29 [95] lt 1S d1ff1Cu1t to VISUHIIY chsentangle the sohdSql1aI‘ES, C1I`C1ES, and tI'1aI1g1ES, SLIC P O 1Ilg Sym O S a1È 111 gE11EI'av1sua11y s1m11ar, but 1n F1gure 2.29 the problem 1S exacerbated by the. symbols not be1ng cr1sp1y drawn. In F1gure 2.30 [50] the chfferent1200 Free °h °' ne1 000· 3; 800··— I Ch lo ra ml n eS‘*"° . ‘»I “ ;.A U nc h lo n n a ted200 .—· A .Ì..0 0.5 1 .0 1 .5 1.8if . I‘‘»gg Volume of water (laters)1..1 .1-1IFngure 2.28 OVERLAPPING PLOTTING SYMBOLS. Overlappmg plottlng; ; symbols must be wsually dlstlngulshable. On thas graph, because of exact— . · »2ix& eww,and near overlap, some of the data cannot be seen. Methods for combattmgoverlap are glven nn Chapter 3..3 Flgure repubhshed from [23]. Copyrlght 1980 by the AAAS.

11

Figure 18: Superposed Datasets Must be Distinguished

%ìJ%~ A<$‘i A><!—_~¢/A*<%‘—__.A%%_A%$;`—¢¢¢$$0 es¥)A¢AA I a%’~¥ A I',%,%A% nd 3 hm ’i ~i A u 3 P1 fd1 rw 8 tA 1 OEli Spy Yg H t iI) i ‘]] a1 % u i a ny c• 1 bl a W· ! ll 1 u d P O6 0 ; I q w n 1 Ss 'an Ch Il O · AA 3 11 In A AAAl Q I- QtI· a 65 -t Q E2 0 • ed 9 Y 1¤ Co %· h 10A{i 9 3 1A A ~ ¤¤ lO0.A8 •· 5 •

A~A ¤ ¢A • 1 { •• 2 0` A • 0¥A0 • L04 •A ~ A ;c2 O0A AAA A ;

D ° • C)OA¢<‘¥A A F-u8d ° 2 A °Ievg a IW lJ XS u U d j ‘ E 1 •~iA 2 p'e dA % n ’im S T-r 'S g ol q E 'mA 20P Ti i <) a t 9b · n e rw Ad f th‘ S h& I1 S OM a e Em § m 3 O S 15 CI th ‘I e 8t Gf eYr; u | al Va 0Qh r n d Si $O b nd ih mi r0b taY tr' IS na Ig SM 'm ¤ M ms en• •c r h 6n D t d 'rlé¢

12

1.2 Advice

Plotting Symbols:

• Use open plotting symbols (rather than filled in)

Legends:

• Legends should be comprehensive and informative - should make the graph standalone.

• Legends should:

– Describe everything that is graphed

– Draw attention to important aspects

– Describe conclusions.

Readers should not be left “guessing” about what the graphical elements mean.

Error Bars:

Error bars (or regions) should be clearly labeled so as to convey their meaning. Com-mon error bars are:

• ± 1 SD.

• ± 1 SE

• 95% Confidence Interval.

Axes (scales):

• Choose tick marks to include all (or nearly all) of the range of the data

• Choose scales so that the data fill up as much of the data region as possible

• Sometimes two different scales can be more informative

• Axes should be labeled (including units where applicable).

– Log and level scales (bottom and top scale lines)

– Differnet operationalization (e.g., age and year of birth).

• Use appropriate (i.e., the same) scales when comparing.

• Do not feel obliged to force scale lines to include zero.

• Use data transformations (e.g., logs) to improve resolution.

13

Aspect Ratio:

Cleveland found through experimentation, that 45° was the angle at which we canbest discern differences in slope. Thus, he developed a technique called “banking to 45°”that finds the optimal aspect ratio for graphs.

• If you’re using lattice, then using the option aspect= 'xy' will do the bankingfor you.

• If you’re using some different graphing tool (like R’s traditional graphics engine),then setting the aspect ratio to one will come close enough, usually.

2 Graphics Philosophies

R has four different graphics systems - base, grid, lattice and ggplot2, all with differentphilosophies about how graphs are to be made.

base The base system works like painting - you can sequentially add elements to a graph,but it is quite hard to take elements away (in fact, it is impossible). Layers can beadded until the graph conveys just what the user wants.

lattice The lattice system is based on Cleveland’s Trellis system developed at Bell Labsand is built on top of the grid graphics system. These are particularly good forgrouped data where multi-panel displays are needed/desired. These operate morelike setting up a track of dominoes when you were a kid. You line them all up andthen knock the first one down and all the others fall as you’ve arranged them. Ifyou messed up the arrangement, it isn’t going to be as cool/pretty/interesting asyou thought. Lattice graphs work the same way. All elements of the graph mustbe specified in a single call to the graphics command and if you don’t do it right,it will not be as cool/pretty/interesting as you want.

ggplot2 Hadley Wickham wrote ggplot2 and describes it in his 2009 book as an imple-mentation of Leeland Wilkinson’s “Grammar of graphics”. This builds a compre-hensive grammar- (read model-) based system for generating graphs.

grid The grid system was written by Paul Murrell and is described in his 2006 book,which is now in its second edition. This system, while very flexible, is not somethingthat users often interact with. We are much better off interacting with grid throughlattice.

The base graphics sytem, lattice and grid are all downloaded automatically when youdownload R. ggplot2 must be downloaded separately and all but the base system mustbe loaded when you want to use them with the library() command.

As you can imagine, these different systems offer quite different solutions to creatinghigh quality graphics. Depending on what exactly you’re trying to do, some things aremore difficult, or impossible, in one of these systems, but not in the other. We willcertainly see examples of this as the course progresses. I have had a change of heart

14

about these various systems recently. Originally, I did everything I possibly could inthe traditional graphics system and then only later moved to the lattice system. Thisprobably makes sense as the lattice system requires considerably more programming, butdoes lots more neat stuff. I still don’t use ggplot2 much, but recognize that it can alsodo quite powerful things.

We are going to spend our time today talking about the traditional graphics system.We will spend some time later talking about what lattice graphs can do and why wemight want to use them, but for now, it’s just the traditional graphics.

3 The Plot Function

For the most part, in the traditional graphics system, graphs are initially made with theplot function, though there are others, too. Then additional elements can be added asyou see fit.

3.1 getting familiar with the function

Let’s take a look at the help file for the plot command.

x: the coordinates of points in the plot. Alternatively, a

single plotting structure, function or _any R object with a

'plot' method_ can be provided.

y: the y coordinates of points in the plot, _optional_ if 'x' is

an appropriate structure.

...: Arguments to be passed to methods, such as graphical

parameters (see 'par'). Many methods will accept the

following arguments:

'type' what type of plot should be drawn. Possible types are

* '"p"' for *p*oints,

* '"l"' for *l*ines,

* '"b"' for *b*oth,

* '"c"' for the lines part alone of '"b"',

* '"o"' for both e*o*verplottedı,

* '"h"' for e*h*istogramı like (or ehigh-densityı)

vertical lines,

* '"s"' for stair *s*teps,

15

* '"S"' for other *s*teps, see eDetailsı below,

* '"n"' for no plotting.

All other 'type's give a warning or an error; using, e.g.,

'type = "punkte"' being equivalent to 'type = "p"' for S

compatibility.

'main' an overall title for the plot: see 'title'.

'sub' a sub title for the plot: see 'title'.

'xlab' a title for the x axis: see 'title'.

'ylab' a title for the y axis: see 'title'.

'asp' the y/x aspect ratio, see 'plot.window'.

Now, we can load the Duncan data again and see how the plotting function works. Thetwo following commands produce the same output within the plotting region, but havedifferent axis labels.

library(car)

plot(prestige ~ income, data=Duncan)

16

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

20 40 60 80

020

4060

8010

0

income

pres

tige

plot(Duncan$income, Duncan$prestige)

There are a couple of interesting features here.

• You can either specify the plot with a formula as the first argument, y ~ x orwith x,y as the first two arguments. The data argument only exists when youuse the former method. Using the latter method, you have to provide the data indataset$variable format, unless the data are attached.

• Whatever names you have for x and y will be printed as the x- and y-labels. Thesecan be controlled with the xlab and ylab commands.

• The plotting symbols, by default are open circles. This can also be controlled withthe pch option (more on this later) .

17

Figure 19: Scatterplot of Income and Prestige from the Duncan data

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

20 40 60 80

020

4060

8010

0

income

pres

tige

3.2 Default Plotting Methods

In R, there are default methods for plotting all types of variables. By default method Isimply mean that R looks at the context in which you’re asking for a graph and thenmakes what it thinks is a reasonable graph given the different types of data. All of thefigures in 20 were called simply by using the plot command. R figured out by the typesof variables being used what plot was most appropriate.

# univariate scatterplot

plot(Duncan$income)

# histogram

plot(Duncan$type)

# bivariate scatterplot

plot(income ~ education, data=Duncan)

# two factors mosaic-plot

Duncan$inc.cat <- cut(Duncan$income, 3)

par(las=2, mar=c(5,7,2,3))

plot(inc.cat ~ type, data=Duncan, xlab="", ylab="")

18

# boxplot

plot(income ~ type, data= Duncan, xlab="")

You try it

1. Load the R data file strikes_small.rda that was inyour zip file.

2. Look through the variables by using str andsearchVarLabels(strikes, '') (make sure theDAMisc package is loaded to use searchVarLabels).Using the data, make examples of all the plots above.

19

Figure 20: Default Plotting Methods

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

● ●

●

●

●

●

●

●

●

0 10 20 30 40

2040

6080

Index

Dun

can$

inco

me

(a) One Numeric - scatterplot (withindex as the x-axis)

bc prof wc

05

1015

20

(b) One factor - histogram

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

● ●

●

●

●

●

●

●

●

20 40 60 80 100

2040

6080

education

inco

me

(c) Two Numeric Variables - scatterplot

bc

prof wc

(6.93,31.7]

(31.7,56.3]

(56.3,81.1]

0.0

0.2

0.4

0.6

0.8

1.0

(d) Two Factors - mosaic plot

●

bc prof wc

2040

6080

inco

me

(e) One Numeric, One Factor - boxplot

20

3.3 Controlling the Plotting Region

There are a number of commands we can use to control the size and features of theplotting region.

• We can control the limits of the x- and y-axis with xlim and ylim, respectively.Here, the limits are specified with a vector of indicating the desired minimum andmaximum value of the axis.

plot(prestige ~ income, data=Duncan, xlim=c(0,100))

3.4 Example of Building a Scatterplot

In R, you can open a plotting window and set its dimensions without actually making thepoints appear in the space. You can do this by specifying type='n'. You can also removethe axes from the space, by issuing the command axes=F. Finally, you could remove anyaxis labels by adding the commands xlab='', ylab=''. The command, then, wouldlook something like this:

plot(prestige ~ income, data=Duncan, type="n", xlab='', ylab='', axes=F)

This will open a graphics window that is completely blank. One reason that you maywant to do this is to be able to control, more precisely, the elements of the graph. Let’stalk about adding some elements back in.

We can add the points by using the points command. You can see the arguments tothe points command by typing help(points). The first two arguments to the commandare x and y. You can also change the plotting symbol and the color.

• Plotting Symbols are governed by the pch argument. Below is a description of somecommon plotting symbols.

●1 2 3 4 5 6 7 8 9

●10 11 12

●13 14 15

●16 17 18

●19

●

20

• Color - Colors are controlled by the col command. (see help for rainbow, colors).The colors can be specified by col = 'black' or col = 'red'. Colors can alsobe specified with rgb() for red-green-blue saturation with optional transparencyparameter.

21

• Lines - there are six different line-types which should be specified with the corre-sponding number:

1 2 3 4 5 6

• Line-width, controlled with lwd defaults to 1, so numbers bigger than 1 are thickerthan the default and numbers smaller than 1 are thinner.

• Character Expansion is controlled by cex. This controls the size of the plottingsymbols. The default is 1. Numbers in the range (0,1) make plotting symbolssmaller than the default and values > 1 make the plotting symbols bigger thanthey would be otherwise.

There are lots of interesting things we can do with these. Let’s start by changing theplotting character and color of points.

plot(prestige ~ income, data=Duncan, pch=16, col="blue",

xlab='% males earnings > $3500/year',ylab='% of NORC raters indicating profession as good or excellent',main = 'Prestige versus Income\n (Duncan Data)')

22

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

20 40 60 80

020

4060

8010

0

Prestige versus Income (Duncan Data)

% males earnings > $3500/year

% o

f NO

RC

rat

ers

indi

catin

g pr

ofes

sion

as

good

or

exce

llent

Above, we simply plotted the points, so only two variables are involved here. What ifwe wanted to include a third variable? We could include information relating to occupa-tion type by coloring the points different for different types of occupations and perhapsusing different plotting symbols. To do this, we would have to specify the pch argumentdifferently, but could use the rest of the commands we specified above to make the plot.

plot(prestige ~ income, data=Duncan, pch=as.numeric(Duncan$type),


23

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

20 40 60 80

020

4060

8010

0



% o

f NO

RC

rat

ers

indi

catin

g pr

ofes

sion

as

good

or

exce

llent

Basically, what you are doing is you are specifying the plotting character for each point.We could also do this with colors.

cols <- c("blue", "black", "red")


col = cols[as.numeric(Duncan$type)],


24

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

20 40 60 80

020

4060

8010

0



% o

f NO

RC

rat

ers

indi

catin

g pr

ofes

sion

as

good

or

exce

llent

You’re doing this by allowing our vector of the three colors to be indexed by theoccupation type. Nifty, right!?

You try it

Using the strikes_small.rda object, do the following:1. Plot the log of strike volume (+1) against one of the

quantitative variables in the dataset.

2. Make a factor variable out of sdlab_rep such thatthere are three evenly-sized groups. Change the plot-ting symbols in the graph above based on the values ofthis variable.

3. Make the colors of the plotting symbols above a func-tion of the factor you created in the last step.

You can also encode quantitative information in color with colorRampPalette.

myRamp <- colorRampPalette(c("gray75", "gray25"))

cols <- myRamp(length(unique(Duncan$education)))

pchs <- c(15,16,17)

plot(prestige ~ income, data=Duncan,

pch=pchs[as.numeric(Duncan$type)],

col = cols[order(Duncan$education)],


25

●●

●

●

●

●●

●

●

●

20 40 60 80

020

4060

8010

0



% o

f NO

RC

rat

ers

indi

catin

g pr

ofes

sion

as

good

or

exce

llent

Next, we can talk about adding elements to the plots.

3.4.1 Adding a Legend

We know what the points mean, but we can’t very well expect other people to know thisunless we tell them. There is a function called legend that allows us to make a legendand stick it in the plot. The legend command has a number of arguments that shouldbe specified.

• The first argument is the location of the plot. This can be specified in a number ofways. If you provide x and y coordinate values (with x=# and y=#), then this givesthe coordinates for the top-left corner of the box containing the legend informa-tion. Otherwise, you can specify the location with topleft, top, topright, right,bottomright, bottom, bottomleft, left and R will put the legend near the edgeof the plot in this position.

• The legend argument gives the text you want to be displayed for each point/lineyou’re describing.

• The point symbol and color are given by a vector to pch and col.

• If instead of points, you have lines, you can give the argument lty with the differentline types being used (more on this later).

• inset gives the fraction of the plot region between the edges of the legend and thebox around the plotting region (default is 0).

We can include a legend in our plot as follows:

26

cols <- c("blue", "black", "red")




legend("bottomright", c("Blue-collar", "Professional", "White-collar"),

pch=1:3, col=cols, inset=.01)

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

20 40 60 80

020

4060

8010

0



% o

f NO

RC

rat

ers

indi

catin

g pr

ofes

sion

as

good

or

exce

llent

● Blue−collarProfessionalWhite−collar

You try it

Using the strikes_small.rda object, do the following:1. Add a legend to the plot you made in the previous

exercise

3.4.2 Adding a Regression Line

We can add a regression line to the previous graph with abline() which adds a line ofspecified slope and intercept to the plot.






27

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

20 40 60 80

020

4060

8010

0



% o

f NO

RC

rat

ers

indi

catin

g pr

ofes

sion

as

good

or

exce

llent


If we wanted to add a different regression line for each occupation type, we can do thatwith three different calls to abline().






abline(lm(prestige ~ income, data=Duncan,

subset=Duncan$type == "bc"), col=cols[1])


subset=Duncan$type == "prof"), col=cols[2])


subset=Duncan$type == "wc"), col=cols[3])

28

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

20 40 60 80

020

4060

8010

0



% o

f NO

RC

rat

ers

indi

catin

g pr

ofes

sion

as

good

or

exce

llent


3.4.3 Identifying Points in the Plot

Points in the plot can be identified with the function identify. The identify will printa label next to points that you click on giving an indication of which point exactly isplotted. The important arguments are:

• Location is given by an x and y variable. These should be the same x and y variablesyou used to make the plot.

• The labels, given with the label option will provide R with text to print by eachidentified point.

• n gives the number of points we want to identify.

Let’s try to identify the two most outlying points on our graph.

identify(x=Duncan$income, y=Duncan$prestige,

labels=rownames(Duncan), n=2)

You can see here that the minister has more prestige on average than other jobs withsimilar percentage of males making over $3500 and that the conductor (railroad) has lessprestige than its income would suggest.

29

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

20 40 60 80

020

4060

8010

0



% o

f NO

RC

rat

ers

indi

catin

g pr

ofes

sion

as

good

or

exce

llent


minister

conductor

You don’t have to set n to any number in particular, just remember to shut off theidentifying if you don’t set it. On a mac, just right-clicking in the plot will turn off theidentifying function. On Windows, you can right-click in the plot, then a dialog will popup asking if you want to stop or continue. Not surprisingly, if you want to stop identifying,just click “stop”. There is another function called locator() which will simply returnthe (x,y) coordinate of the point clicked in the plotting region.

You try it

Using the strikes_small.rda object, do the following:1. Identify points in the previous plot you made by coun-

try and year. Note, you will have to make a variablethat is country and year withpaste(strikes$abb, strikes$year, sep=:”)”

3.5 Other Plots

There are a few other plots that are both common and useful. I will talk about histograms,bar plots, density estimates and dot plots. Histograms can be done with the hist()

command:

hist(Duncan$prestige, nclass=10, xlab="Prestige", main="")

30

Prestige

Fre

quen

cy

0 20 40 60 80 100

02

46

8

If you wanted to impose a kernel density estimate over the bars, you could use lines()

with the density() command.

hist(Duncan$prestige, nclass=10, xlab="Prestige",

main="", freq=F, col="gray80")

lines(density(Duncan$prestige), col="red", lwd=1.5)

Prestige

Den

sity

0 20 40 60 80 100

0.00

00.

005

0.01

00.

015

0.02

0

31

You try it

Using the strikes_small.rda object, do the following:1. Make a histogram of strike volume and impose a kernel

density estimate.

Bar plots can be made first by aggregating a continuous variable over the values ofsome factor, then using barplot.

bp1 <- by(Duncan$prestige, list(Duncan$type), mean)

par(mar=c(4,10,2,2))

barplot(bp1, horiz=T, las=1)

bc

prof

wc

0 20 40 60 80

A dot-plot, often showing a set of estimates and confidence bounds arrayed relativeto a categorical variable, is relatively common these days. Let’s do this by first makingthe confidence interval of repression for each region. Then we can make the dot-plot.

library(gmodels)

ag <- aggregate(Duncan$prestige, list(Duncan$type), ci)

ag <- cbind(region = ag[,1], as.data.frame(ag$x))

names(ag) <- c("type", "est", "lower", "upper", "se")

ag

## type est lower upper se

## 1 bc 22.76190 14.54327 30.98054 3.939969

## 2 prof 80.44444 73.42991 87.45898 3.324717

## 3 wc 36.66667 24.29104 49.04230 4.814330

32

Now, we can make a dotplot:

ag <- ag[order(ag$est), ]

dotchart(ag$est, ag$type, xlim=c(10,90),

lcolor=NA, pch=16, xlab = "Prestige")

segments(ag$lower, 1:3, ag$upper, 1:3)

bc

wc

prof

●

●

●

20 40 60 80

Prestige

You try it

Using the strikes_small.rda object, do the following:1. Make a bar plot of average strike volume by country.

2. Make a dot plot of average strike volume by country.

4 ggplots

Personally, I prefer the lattice system because I think Cleveland’s theoretical work is mostcompelling. However, Deducer makes a very friendly interface to ggplot graphics, so we’lltalk more extensively about those. According to Wickham (2009, p. 3), “a statisticalgraphic is a mapping from data to aesthetic attributes (color, shape size) of geometricobjects (points, lines, bars).” The plot may contain transformations of the data thatinform the scale or the coordinate system. Finally, faceting, or grouping, can generatethe same plot for different subsets of the data. These components can be combined invarious ways to make graphics. The main elements of a ggplot are:

data/mappings The data that you want to represent visually.

33

geoms represent the elements you see on the plot (lines, points, shapes, etc...)

scales map the values of the data into an aesthetic space through colors, sizes or shapes.They also draw legends or axes that allow users to read original data values fromthe plot.

coord the coordinate system describes how coordinates are mapped to a two-dimensionalgraph.

facet describes how to subset the data.

There are two methods for making plots. The first qplot() makes an immediateplot, that is the output from this function is a plot. The ggplot() function sets theparameters for the plot, but geoms need to be added to display information. Here are acouple of examples:

library(car)

library(ggplot2)

data(Duncan)

qplot(income, prestige, data=Duncan, shape=type, color=type) +

theme_bw() + theme(aspect.ratio=1)+ scale_shape_manual(values=c(1,2,3))

g <- ggplot(data=Duncan, aes(x=income, y=prestige, shape=type, color=type))

g + geom_point() + theme_bw() + theme(aspect.ratio=1) + scale_shape_manual(values=c(1,2,3))

Figure 21: GGplots

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

0

25

50

75

100

20 40 60 80income

pres

tige

type

● bc

prof

wc

(a) qplot

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

0

25

50

75

100

20 40 60 80income

pres

tige

type

● bc

prof

wc

(b) ggplot + geom

4.1 Deducer

Now, I want to move us over to Deducer in the JGR GUI. These are the tasks I want usto complete.

34

4.1.1 Scatterplot

• Load the car and ggplot2 packages.

• Load the Duncan data from the car package.

• Make a scatterplot of prestige (y) by income (x).

• Subset by type (different shapes/colors).

• Add regression line by group.

• Make the same plot as above, but using facets for the different values of type.

4.1.2 Bar Graph

• Make a factor out of prestige such that it is in 4 roughly equally sized groups (i.e.,using quantiles).

• Make a bar graph of the new 4-group prestige variable.

• Subset the bar graph by type.

This will allow us the chance to A) investigate the various elements that can be addedtogether and B) see what the code looks like. Deducer will spit out the R code to producethe graph.

You try it

Load the strikes_small.rda object into Deducer1. Try to recreate some of the plots you made before using

Deducer

4.2 Back to Code

ggplot2 is Hadley Wickham’s creation based on theoretical work by Leland Wilkinson.The basic idea is that graphs are built in a model-based fashion, where pieces are addedor subtracted to or from existing plots. These “pieces” can be elements of the plot (e.g.,points, lines, etc...) or they can be other parameters that goven how the plot looks. Someexamples will help make sense of this.

4.3 Scatterplots

One way that we can make a scatterplot is by using the ggplot() command and thenadding points.

library(car)

library(ggplot2)

p <- ggplot(Duncan, aes(income, prestige))

p + geom_point(shape=1) + theme(aspect.ratio=1)

35

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

0

25

50

75

100

20 40 60 80income

pres

tige

As you can see here, the default theme is to have a light gray background and filled-inblack points (which I changed with shape=1). To make the plot have a white background,we add a theme statement:

36

p + geom_point(shape=1) + theme_bw() +

theme(aspect.ratio=1)

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

0

25

50

75

100

20 40 60 80income

pres

tige

Using the qplot() command, we can add colors, too.


theme_bw() + theme(aspect.ratio=1)

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

0

25

50

75

100

20 40 60 80income

pres

tige

type

● bc

prof

wc

37

If you don’t like the plotting symbols chosen for you, you can change them as well byadding another element to the plot.


theme_bw() + scale_shape_manual(values=c(1,2,3)) +


●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

0

25

50

75

100

20 40 60 80income

pres

tige

type

● bc

prof

wc

To change the colors back to something more familiar, we would have to first make them,and then add them to the plot.



scale_color_manual(values=c("blue", "black", "red")) +


●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

0

25

50

75

100

20 40 60 80income

pres

tige

type

● bc

prof

wc

38

We could also add a regression line with the following:



scale_color_manual(values=c("blue", "black", "red")) +

geom_smooth(method="lm", se=F) + theme(aspect.ratio=1)

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

0

25

50

75

100

20 40 60 80income

pres

tige

type

● bc

prof

wc

Notice a couple of things here. First, because we defined the shapes and colors as depend-ing on a variable, the regression lines also depend on that variable. So, we don’t have toexplicitly say that we want different regression lines in any way other than making thepoints different shapes/colors relative to some variable. Also, the lines only plot over therange of the data. Adding the fullrange=T argument to geom_smooth() would changethat behavior to something more like abline().

In general, the parameters you can set with respect to colors, plotting symbols, etc...can be captured as below:

39

B.5 Justification 197

(a) All named colours in Luv space

blank

dashed

dotdash

dotted

longdash

solid

twodash

(b) Built-in line types

●

●

●

●

●

●

●

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

22

21

24

23

20

(c) R plotting symbols. Colour is black,and fill is blue. Symbol 25 (not shown)is symbol 24 rotated 180 degrees.

ABCD

c(0, 0)

ABCD

c(0.5, 0)

ABCD

c(1, 0)

ABCD

c(0, 0.5)

ABCD

c(0.5, 0.5)

ABCD

c(1, 0.5)

ABCD

c(0, 1)

ABCD

c(0.5, 1)

ABCD

c(1, 1)

(d) Horizontal and vertical justificationsettings.

Fig. B.1: Examples illustrating di�erent aesthetic settings.

4.4 Other Plots

You can also make the whole set of other plots with ggplot2 as well. Here are a fewexamples.

4.4.1 Histograms and Barplots

p <- ggplot(Duncan, aes(prestige))

p2 <- p + xlim(1,101) + geom_histogram(aes(y = ..density..),

color="black", fill="gray75", breaks=seq(1,101,by=10)) +


p2

40

0.000

0.005

0.010

0.015

0.020

0 25 50 75 100prestige

dens

ity

41

and now with density lines

p2 + geom_density()

0.000

0.005

0.010

0.015

0.020

0 25 50 75 100prestige

dens

ity

ag <- aggregate(Duncan$prestige, list(Duncan$type), mean)

names(ag) <- c("type", "prestige")

ggplot(ag, aes(x=type, y=prestige)) + geom_bar(stat="identity") +


0

20

40

60

80

bc prof wctype

pres

tige

42

4.4.2 Dotplot

library(gmodels)

ag <- aggregate(Duncan$prestige, list(Duncan$type), ci)

ag <- cbind(region = ag[,1], as.data.frame(ag$x))

names(ag) <- c("type", "est", "lower", "upper", "se")

ag <- ag[order(ag$est), ]

ag$type <- factor(1:nrow(ag), labels=levels(Duncan$type))

p <- ggplot(ag, aes(est, type))

p + geom_point() + xlab("Mean Prestige") + ylab("") +

geom_errorbarh(aes(xmin=lower, xmax=upper)) +


●

●

●

bc

prof

wc

25 50 75Mean Prestige

4.5 Faceting

ggplot2 is also really good with dependent data (like lattice).

p <- ggplot(Duncan, aes(income, prestige))

p2 <- p + geom_point(shape=1) +

facet_grid(.~ type) + theme_bw() +


p2

43

bc prof wc

●

●

●

●

●●

●

●●

●●

●●

●●

●

●

●●

●

●

● ●

●

●

●●

●●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

0

25

50

75

100

20 40 60 80 20 40 60 80 20 40 60 80income

pres

tige

Adding a regression line in each panel is also easy:

p2 + geom_smooth(method="lm", se=F, col="gray65", fullrange=T) +

geom_smooth(method="lm", se=F) + theme(aspect.ratio=1)

bc prof wc

●

●

●

●

●●

●

●●

●●

●●

●●

●

●

●●

●

●

● ●

●

●

●●

●●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

0

25

50

75

100

20 40 60 80 20 40 60 80 20 40 60 80income

pres

tige

Taking the same example as in the Lattice lecture, we could plot model predictionsusing ggplot2.

44

mod1 <- lm(prestige ~ income + education + type, data=Duncan)

mod2 <- lm(prestige ~ poly(income, 3) + education + type, data=Duncan)

newdat <- data.frame(

income = seq(min(Duncan$income), max(Duncan$income), length=25),

education = mean(Duncan$education),

type = "bc"

)

p1 <- as.data.frame(predict(mod1, newdata=newdat, interval="confidence"))

p2 <- as.data.frame(predict(mod2, newdata=newdat, interval="confidence"))

plot.dat <- as.data.frame(rbind(p1, p2))

plot.dat$model <- rep(c("Model 1", "Model 2"), each=25)

plot.dat$income <- rep(newdat$income, 2)

rownames(plot.dat) <- NULL

plot.dat1 <- plot.dat[which(plot.dat$model == "Model 1"), ]

mp <- ggplot(plot.dat, aes(x=income, y=fit), color=model) + theme_bw()

mp + geom_ribbon(aes(ymin=lwr, ymax=upr), fill=rgb(.75,.75,.75, .25, 1)) +

facet_grid(.~ model) + geom_line() + theme(aspect.ratio=1)

Model 1 Model 2

20

40

60

80

20 40 60 80 20 40 60 80income

fit

In the same plot, we could do something like this:

mp <- ggplot(plot.dat, aes(x=income, y=fit, fill=model)) + theme_bw()

mp + geom_ribbon(aes(ymin=lwr, ymax=upr, fill=model), alpha=.25) +

scale_fill_manual("Model", values=c("blue", "red")) +

geom_line(aes(color=model), show_guide=F) +

45

scale_colour_manual(values=c("blue", "red")) +


## Warning: ‘show guide‘ has been deprecated. Please use ‘show.legend‘ instead.

20

40

60

80

20 40 60 80income

fit

Model

Model 1

Model 2

46

We could also make the histograms of assets by region from the Ornstein data.

p <- ggplot(Ornstein, aes(x=log2(assets))) + theme_bw()

ra <- range(log2(Ornstein$assets))

p + geom_histogram(aes(y=..density..),

breaks=pretty(ra, 10), color="black", fill="gray75") +

facet_wrap(~nation) + geom_density() + theme(aspect.ratio=1)

CAN OTH

UK US

0.0

0.1

0.2

0.3

0.0

0.1

0.2

0.3

8 12 16 8 12 16log2(assets)

dens

ity

47

Side-by-side barplots can be accomplished in a lot the same way as in Lattice.

ag <- with(Chile, aggregate(income,

list(region, education), mean, na.rm=T))

names(ag) <- c("region", "education", "mean_income")

p <- ggplot(ag, aes(x=region, y=mean_income)) + theme_bw()

p + geom_bar(stat="identity") + facet_grid(.~ education) + theme(aspect.ratio=1)

P PS S

0

20000

40000

60000

80000

C M N S SA C M N S SA C M N S SAregion

mea

n_in

com

e

48

Dotplots by group can happen easily here, too.

ag2 <- with(Chile, aggregate(income,

list(region, education), ci, na.rm=T))

ag2 <- cbind(ag2[,1:2], ag2$x)

names(ag2) <- c("region", "education", "mean", "lower", "upper", "se")

ag2 <- ag2[order(ag2$education, ag2$mean), ]

ag2$region <- factor(ag2$region,

levels=ag2$region[which(ag2$education == "PS")])

p <- ggplot(ag2, aes(y=region, x=mean))

p + geom_point() + facet_grid(education ~ .) + theme_bw() +

geom_errorbarh(aes(xmin=lower, xmax=upper),

size=.25)

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

N

S

M

C

SA

N

S

M

C

SA

N

S

M

C

SA

PP

SS

25000 50000 75000 100000mean

regi

on

5 Effects Package

The effects package is one of the most versatile and useful pieces of software for assessingnon-linear/interactive effects in linear models or for assessing any effects in non-linearmodels.

• You can load the package by doing:

library(effects)

• The main command that does all of the work is effect() which has print and plotmethods

49

The function provides predicted values holding other variables at fixed values (along withconfidence intervals). The function can do this not only for individual terms, but for termsinteracted with others and for non-linear transformations, polynomials, and splines. Thegreat thing about this is that it is built on the lattice package and we can use what weknow about lattice to modify it.

5.1 Some Linear Model Examples

5.1.1 Non-linear Transformations in Linear Models

First we have to run a model that has effects we want to plot. We’ll start with a modelthat has a non-linear transformation.

library(car)

library(effects)

data(Prestige)

mod <- lm(prestige ~ log(income) +

education + type, data=Prestige)

eff <- effect("log(income)", mod, default.levels=100)

names(eff)

## [1] "term" "formula"

## [3] "response" "variables"

## [5] "fit" "x"

## [7] "x.all" "model.matrix"

## [9] "data" "discrepancy"

## [11] "offset" "residuals"

## [13] "partial.residuals.range" "x.var"

## [15] "vcov" "se"

## [17] "lower" "upper"

## [19] "confidence.level" "transformation"

## [21] "family"

Now, we can use the plotting method for effects to plot those predictions. Notice,that I specified two colors as colors. The first controls the solid mean-line, the secondcontrols the colors of the confidence bounds. Other graphical parameter arguments canbe controlled, but more difficultly with lty, lwd, ect...

plot(eff, colors=c("black", "black"))

50

income effect plot

income

pres

tige

30

35

40

45

50

55

60

65

5000 10000 15000 20000 25000

Elements of the graph can be updated with the update() function. Some elements ofthe graph need to be “hard-coded” in, that is they are not allowed to be manipulated incall to plot. An example will help clarify. What if I wanted to rotate the x-axis labels45 degrees. I should be able to provide this as an argument to plot:

> plot(eff, colors=c("black", "black"), scales=list(x=list(rot=45)))

Error in xyplot.formula(eval(parse(text = paste("fit ~", names(x)[1]))), :

formal argument "scales" matched by multiple actual arguments

Becaus the scales arguement is pre-set in the command, we cannot change it directly.However, we can alter it with update().

p1 <- plot(eff, colors=c("black", "black"))

update(p1, scale=list(x=list(rot=45)))

51

income effect plot

income

pres

tige

30

35

40

45

50

55

60

65

500

0

1000

0

1500

0

2000

0

2500

0

5.1.2 Choosing “other” values of Covariates

The effects package holds the values of other variables constant at “interesting” values.We can choose the values at which those other variables are held constant in a couple ofdifferent ways:

• Using typical will allow you to set the values of all variables to something mean-ingful.

eff2 <- effect("log(income)", mod,

default.levels=100, typical = median)

head(eff$model.matrix)

## (Intercept) log(income) education typeprof typewc

## 1 1 7.412160 10.7951 0.3163265 0.2346939

## 2 1 7.549965 10.7951 0.3163265 0.2346939

## 3 1 7.671060 10.7951 0.3163265 0.2346939

## 4 1 7.779061 10.7951 0.3163265 0.2346939

## 5 1 7.876527 10.7951 0.3163265 0.2346939

## 6 1 7.965332 10.7951 0.3163265 0.2346939

head(eff2$model.matrix)


## 1 1 7.412160 10.605 0.3163265 0.2346939

## 2 1 7.549965 10.605 0.3163265 0.2346939

52

## 3 1 7.671060 10.605 0.3163265 0.2346939

## 4 1 7.779061 10.605 0.3163265 0.2346939

## 5 1 7.876527 10.605 0.3163265 0.2346939

## 6 1 7.965332 10.605 0.3163265 0.2346939

What you’ll notice is that the type dummy variables are set to their in-sampleproportions, not as zeros or ones. This is problmeatic for some people, so we canmove on to the other way of setting covariate values.

• We can also provide a value for all or some of the columns of the model matrix. Wecan do this with the given.values argument to effect().

eff3 <- effect("log(income)", mod,

default.levels=100,

given.values=c("typeprof"=1,

"typewc" = 0))

head(eff3$model.matrix)


## 1 1 7.412160 10.7951 1 0

## 2 1 7.549965 10.7951 1 0

## 3 1 7.671060 10.7951 1 0

## 4 1 7.779061 10.7951 1 0

## 5 1 7.876527 10.7951 1 0

## 6 1 7.965332 10.7951 1 0

Note that these values have to have precisely the same names as the model matrix (theseare also the same names as the model coefficients).

coef(mod)


## -81.201867 10.487471 3.284486 6.750887 -1.439403

colnames(model.matrix(mod))

## [1] "(Intercept)" "log(income)" "education" "typeprof" "typewc"

All of the other variables will be included at typical values.

You try it

Using the strikes_small.rda object, do the following:1. Model the log of strike volume (+1) as a function of the

log of population density, unemployment and inflation

2. Plot the effect of the log of population density

53

5.1.3 Interactions

The effects package can also help us understand interaction effects.

mod4 <- lm(prestige ~ income*education + type, data=Prestige)

inc_eff <- effect("income*education", mod4,

given.values=c("typeprof"=1, "typewc"=1))

plot(inc_eff, as.table=T)

income*education effect plot

income

pres

tige

30

40

50

60

70

80

90

education

5000 10000 15000 20000 25000

education

5000 10000 15000 20000 25000

education

30

40

50

60

70

80

90

education

This provides us the effect of income as education takes on 10 different values. So thelines tell us how predicted values of prestige change with income for each of the differentvalues of income

inc_eff

##

## income*education effect

## education

## income 8 10 12 14

## 5000 35.44788 43.55465 51.66141 59.76818

## 10000 45.97000 51.97489 57.97978 63.98467

## 15000 56.49212 60.39513 64.29814 68.20115

## 20000 67.01424 68.81537 70.61651 72.41764

## 25000 77.53636 77.23562 76.93487 76.63412

We could plot the other side of the interaction - the effect of education as income takeson different values:

plot(inc_eff, x.var="education", as.table=T)

54

income*education effect plot

education

pres

tige

30

40

50

60

70

80

90

income

8 9 10 11 12 13 14

income income

8 9 10 11 12 13 14

income

30

40

50

60

70

80

90

income

The effects package also does “the right” thing with factors and continuous variablesinteracted:

mod5 <- lm(prestige ~ income*type + education, data=Prestige)

eff_cf <- effect("income*type", mod5)

pcf1 <- plot(eff_cf, layout=c(3,1))

pcf1

income*type effect plot

income

pres

tige

40

60

80

100

120

5000 10000 15000 20000 25000

: type bc

5000 10000 15000 20000 25000

: type prof

5000 10000 15000 20000 25000

: type wc

55

We can also plot the other side of the interaction, too:

eff_cf2 <- effect("income*type", mod5,

xlevels=list(income = quantile(Prestige$income, c(.1,.5,.9))))

pcf2 <- plot(eff_cf2, layout=c(3,1), x.var="type")

pcf2

You try it

Using the strikes_small.rda object, do the following:1. Model the log of strike volume (+1) as a function of

the log of population density, unemployment and infla-tion and the interaction of log population density andinflation.

2. Figure out what is going on in the interaction.

5.2 GLM Examples

The effects package is also good at plotting predicted probabilities (or other inverse-link transformations) from GLMs, for one variable holding all other variables constant atsome value. Let’s run a logit model for vote for a democrat from the 1992 ANES.

library(foreign)

nes <- read.dta("anes1992.dta")

mod <- glm(votedem ~ age + pid +

black + retnat, data=nes, family=binomial)

eff <- effect("pid", mod,

default.levels=25,

given.values = c(

"pid" = 4,

"black" = 0,

"retnatsame" = 0,

"retnatworse" = 1))

plot(eff)

56

pid effect plot

pid

vote

dem

0.2

0.4

0.6

0.8

1 2 3 4 5 6 7

Note that in the default version of the graph, we get what looks like a straight line.This is because the effect is plotted on the linear predictor and you see that we getunequal distances between tick-marks on the y-axis. We can change this behavior withrescale.axis=F.

plot(eff, rescale.axis=F)

## NOTE: the rescale.axis argument is deprecated; use type instead

pid effect plot

pid

vote

dem

0.2

0.4

0.6

0.8

1 2 3 4 5 6 7

57

5.3 OL/MNL Examples

You can plot the predictions from Ordered logit/probit and MNL as well. Here, you havea couple more options.

fh <- read.dta("fh2006_ss.dta")

library(MASS)

mod <- polr(freedomstat ~ log(pop) + gdppc10k + civ, data=fh)

summary(mod)

##

## Re-fitting to get Hessian

## Call:

## polr(formula = freedomstat ~ log(pop) + gdppc10k + civ, data = fh)

##

## Coefficients:

## Value Std. Error t value

## log(pop) -0.2556 0.09261 -2.7596

## gdppc10k 0.5394 0.24689 2.1848

## civIslamic -1.3684 0.52840 -2.5897

## civLatin American 1.4775 0.57628 2.5638

## civOrthodox 0.2807 0.66007 0.4253

## civOther 0.8716 0.49292 1.7681

## civWestern 4.1342 1.16519 3.5481

##

## Intercepts:

## Value Std. Error t value

## not free|partly free -2.5367 0.9004 -2.8172

## partly free|free -0.4965 0.8748 -0.5675

##

## Residual Deviance: 264.0614

## AIC: 282.0614

We can use the effects package to get fitted probabilities for continuous covariates. Forexample:

gdp.eff <- effect("gdppc10k", mod,

default.levels=100)

##

## Re-fitting to get Hessian

plot(gdp.eff, rug=F)

58

gdppc10k effect plot

gdppc10k

free

dom

stat

(pr

obab

ility

)

0.2

0.4

0.6

0.8

0 1 2 3 4 5

: freedomstat not free

0.2

0.4

0.6

0.8

: freedomstat partly free

0.2

0.4

0.6

0.8

: freedomstat free

You can also use the style='stacked' argument if you want to get something a bitdifferent:

plot(gdp.eff, xlab="GDP/capita (in $10000 US)",

ylab = "Pr(Freedom Status)", style="stacked",

colors=c("gray75", "gray50", "gray25"),

)

gdppc10k effect plot

GDP/capita (in $10000 US)

Pr(

Fre

edom

Sta

tus)

0.2

0.4

0.6

0.8

0 1 2 3 4 5

freepartly freenot free

These work the same way for the multinomial logit, too.

59

6 Maps

One of the real benefits on R is that it is relatively easy to make good looking maps.You’ll need a shape file (and the associated auxiliary files) and potentially some extradata you want to plot. Generally, when plotting states, countries, counties, or any sortof administrative boundary, you read the data in as SpatialPolygonsDataFrame. Allof the objects we have dealt with until now have been S3 objects. While this has manyand varied implications, the most important for us is that we can list the names of theelements in the object with names(object) and once we find an element (let’s call itE), we could extract that element with object$E. With S4 objects, the analog to names

is slotNames(object) and to extract element E, you would use object@E. So, below,where you see the code world@data, that is extracting the data frame associated withthe polygons. Every polygon (country here)has an associated line in the data frame.

library(rgdal)

library(ggplot2)

library(plyr)

world <- readOGR(".", layer="WorldCountries")

## OGR data source with driver: ESRI Shapefile

## Source: ".", layer: "WorldCountries"

## with 250 features

## It has 3 fields

wbmig <- read.csv("wbmig_R.csv")

wbmig <- na.omit(wbmig[,c("wbpropmig", "ccode")])

world@data <- join(world@data, wbmig, type="left")

## Joining by: ccode

q <- quantile(world@data$wbpropmig, c(0,.25,.5,.75,1), na.rm=T)

world@data$propmigcat <- cut(world@data$wbpropmig, breaks=q)

world@data$id <- world@data$placename

world.points <- fortify(world, region="id")

world.df <- join(world.points, world@data, by="id")

pl <- ggplot(world.df) +

aes(long,lat,group=group,fill=propmigcat) +

geom_polygon() +

geom_path(color="gray50", size=.1) +

coord_equal() +

scale_fill_brewer("", labels=c("Lowest Quartile", "2nd Quartile", "3rd Quartile",

"Highest Quartile"))

pl + theme(line = element_blank(),

axis.title.x = element_blank(),

60

axis.title.y = element_blank(),

axis.ticks = element_blank(),

axis.text = element_blank(),

legend.position = "bottom",

panel.background = element_blank())

R also has lots of spatial statistics routines. One simple thing that you might have todo is create an adjacency matrix. You could do this as follows:

library(spdep)

neighbors <- poly2nb(world)

adjmat <- nb2mat(neighbors, zero.policy=TRUE)

The adjacency matrix sums to 1 (so each entry is equal to 1#neighbors

) in the rows (unless

the polygon has no neighbors in which case it is all zeros). You can create a neighborhoodaverage with:

x <- ifelse(is.na(world@data$wbpropmig), 0, world@data$wbpropmig)

neighave <- adjmat %*% x

Now, the neighborhood average is a variable you could use in a model (to account forsome degree of spatial dependence).

7 Interactivity

There are a few different means of generating interactive graphics in R depending onwhat you want. The following are interactive graphical options that don’t require you towrite a function to do the graphing.

• identify and locator are simple, both in their implementation and capabilities.They make it easy to identify points in a plot. locator identifies any (x,y) coor-dinate you click with the mouse and identify uses the locator function to tryto identify plotted points close to where you clicked the mouse. They are alsoautomatically labeled.

61

• iPlots offers linking and brushing of lots of univariate and bivariate plots (includingscatterplots, histograms, bar plots). We will work through some of these examplesin the workshop, but interactivity is hard to describe and display in a static handoutlike this one.

• RGL offers real-time 3D visualization in R (where openGL does the heavy lifting).

After we talk about writing functions, we’ll talk about some function-based approachesto interactive graphics. These include:

• rpanel - a package that makes it easy to re-render graphics based on sliders, checkboxes, text, pull-down menus, etc... These are great tools for demonstration, butare hard to allow others arbitrary access. In that, they are not something you wouldhost on a server and have other people play with.

• shiny - a web-based interactive visualization environment for R. Shiny is designedfor getting interactive visualizations to the web. This is a competitor to the Tableausoftware. This is a great solution for small projects (i.e., those that you won’t needthousands of people to see and use, many of them concurrently). Thus, it is a greattool for academics.

• rCharts - an set of helper functions in R to produce D3.js graphics. These can beeasily hosted on a website, though they do require you to have some place to hostthem. The interactivity is often more pointed or targeted here than it is with ashiny app, but it scales a lot more inexpensively.

62

Documents

R: Learning By Example Visualization - Dave Armstrong ... · R: Learning By Example Visualization Dave Armstrong University of Wisconsin-Milwaukee Department of Political Science