Upload
vivian-s-zhang
View
1.089
Download
3
Embed Size (px)
DESCRIPTION
I am sharing the slides I used for teaching my "Data Science by R" class. You can sign up a class at http://www.nycdatascience.com/ ----NYC Data Science Academy. We offer classes in R, Python, Processing, D3.js, Hadoop, and etc.
Citation preview
DDaattaa VViissuuaalliizzaattiioonnclass 5
Vivian Zhang | Scott KostyshakCTO @Supstat Inc | Data Scientist @Supstat Inc
Data Visualization http://nycdatascience.com/part4_en/
1 of 98 2/4/14, 7:31 AM
DDaattaa vviissuuaalliizzaattiioonnWe will study the application of primary drawing functions and advanced drawing functions in R andwill focus on understanding the methods of data exploration by visualization.
Case study and excercise: Analyzing the NBA data with graphics
The related functions in R
The properties of a single variable
Displaying compositions
The relationship between variables
Exhibiting change over time
Geographic information
·
·
·
·
·
·
Data Visualization http://nycdatascience.com/part4_en/
2 of 98 2/4/14, 7:31 AM
Why use visualization?
Data Visualization http://nycdatascience.com/part4_en/
3 of 98 2/4/14, 7:31 AM
DDaattaa vviissuuaalliizzaattiioonnA figure is worth a thousand words.
data <- read.table('data/anscombe.txt',T)data <- data[,-1]head(data)
x1 x2 x3 x4 y1 y2 y3 y41 10 10 10 8 8.04 9.14 7.46 6.582 8 8 8 8 6.95 8.14 6.77 5.763 13 13 13 8 7.58 8.74 12.74 7.714 9 9 9 8 8.81 8.77 7.11 8.845 11 11 11 8 8.33 9.26 7.81 8.476 14 14 14 8 9.96 8.10 8.84 7.04
Data Visualization http://nycdatascience.com/part4_en/
4 of 98 2/4/14, 7:31 AM
DDaattaa vviissuuaalliizzaattiioonnTry to calculate some statistical indicators. First calculate the mean of these datasets, and thencalculate the correlation coefficient of the four groups of data
colMeans(data)
x1 x2 x3 x4 y1 y2 y3 y4 9.0 9.0 9.0 9.0 7.5 7.5 7.5 7.5
sapply(1:4,function(x) cor(data[,x],data[,x+4]))
[1] 0.816 0.816 0.816 0.817
Data Visualization http://nycdatascience.com/part4_en/
5 of 98 2/4/14, 7:31 AM
DDaattaa vviissuuaalliizzaattiioonn
Data Visualization http://nycdatascience.com/part4_en/
6 of 98 2/4/14, 7:31 AM
SSoommee bbaassiicc pprriinncciipplleessDetermine the target of visualization from the beginning1.
Understanding the characteristics of the data and the audience2.
Keep concise but give enough information3.
Exploratory visualization
Explanatory visualization
·
·
Which variables are important and interesting
Consider the role and background of the audience
Select a proper mapping
·
·
·
Data Visualization http://nycdatascience.com/part4_en/
7 of 98 2/4/14, 7:31 AM
MMaappppiinngg eelleemmeennttss ooff aa ggrraapphh::Coordinate position1.
Line2.
Size3.
Color4.
Shape5.
Text6.
Data Visualization http://nycdatascience.com/part4_en/
8 of 98 2/4/14, 7:31 AM
Visualization functions in R
Data Visualization http://nycdatascience.com/part4_en/
9 of 98 2/4/14, 7:31 AM
VViissuuaalliizzaattiioonn ffuunnccttiioonnss iinn RRbase graphics
lattice
ggplot2
·
·
·
Data Visualization http://nycdatascience.com/part4_en/
10 of 98 2/4/14, 7:31 AM
EElleemmeennttaarryy ggrraapphhiinngg ffuunnccttiioonnssplot(cars$dist~cars$speed)
Data Visualization http://nycdatascience.com/part4_en/
11 of 98 2/4/14, 7:31 AM
EElleemmeennttaarryy ggrraapphhiinngg ffuunnccttiioonnssplot(cars$dist,type='l')
Data Visualization http://nycdatascience.com/part4_en/
12 of 98 2/4/14, 7:31 AM
EElleemmeennttaarryy ggrraapphhiinngg ffuunnccttiioonnssplot(cars$dist,type='h')
Data Visualization http://nycdatascience.com/part4_en/
13 of 98 2/4/14, 7:31 AM
EElleemmeennttaarryy ggrraapphhiinngg ffuunnccttiioonnsshist(cars$dist)
Data Visualization http://nycdatascience.com/part4_en/
14 of 98 2/4/14, 7:31 AM
llaattttiiccee ppaacckkaaggeelibrary(lattice)num <- sample(1:3,size=50,replace=T)barchart(table(num))
Data Visualization http://nycdatascience.com/part4_en/
15 of 98 2/4/14, 7:31 AM
llaattttiiccee ppaacckkaaggeeqqmath(rnorm(100))
Data Visualization http://nycdatascience.com/part4_en/
16 of 98 2/4/14, 7:31 AM
llaattttiiccee ppaacckkaaggeestripplot(~ Sepal.Length | Species, data = iris,layout=c(1,3))
Data Visualization http://nycdatascience.com/part4_en/
17 of 98 2/4/14, 7:31 AM
llaattttiiccee ppaacckkaaggeedensityplot(~ Sepal.Length, groups=Species, data = iris,plot.points=FALSE)
Data Visualization http://nycdatascience.com/part4_en/
18 of 98 2/4/14, 7:31 AM
llaattttiiccee ppaacckkaaggeebwplot(Species~ Sepal.Length, data = iris)
Data Visualization http://nycdatascience.com/part4_en/
19 of 98 2/4/14, 7:31 AM
llaattttiiccee ppaacckkaaggeexyplot(Sepal.Width~ Sepal.Length, groups=Species, data = iris)
Data Visualization http://nycdatascience.com/part4_en/
20 of 98 2/4/14, 7:31 AM
llaattttiiccee ppaacckkaaggeesplom(iris[1:4])
Data Visualization http://nycdatascience.com/part4_en/
21 of 98 2/4/14, 7:31 AM
llaattttiiccee ppaacckkaaggeehistogram(~ Sepal.Length | Species, data = iris,layout=c(1,3))
Data Visualization http://nycdatascience.com/part4_en/
22 of 98 2/4/14, 7:31 AM
TThhrreeee--ddiimmeennssiioonnaall ggrraapphhss iinn tthhee llaattttiicceeppaacckkaaggeelibrary(plyr)func3d <- function(x,y) { sin(x^2/2 - y^2/4) * cos(2*x - exp(y))}vec1 <- vec2 <- seq(0,2,length=30)para <- expand.grid(x=vec1,y=vec2)result6 <- mdply(.data=para,.fun=func3d)
Data Visualization http://nycdatascience.com/part4_en/
23 of 98 2/4/14, 7:31 AM
TThhrreeee--ddiimmeennssiioonnaall ggrraapphhss iinn tthhee llaattttiicceeppaacckkaaggeelibrary(lattice)wireframe(V1~x*y,data=result6,scales = list(arrows = FALSE), drape = TRUE, colorkey = F)
Data Visualization http://nycdatascience.com/part4_en/
24 of 98 2/4/14, 7:31 AM
ggggpplloott ppaacckkaaggeeData, Mapping and Geom
library(ggplot2)p <- ggplot(data=mpg,mapping=aes(x=cty,y=hwy)) + geom_point()print(p)
Data Visualization http://nycdatascience.com/part4_en/
25 of 98 2/4/14, 7:31 AM
ggggpplloott ppaacckkaaggeeObserve the internal structure
summary(p)
data: manufacturer, model, displ, year, cyl, trans, drv, cty, hwy, fl, class [234x11]mapping: x = cty, y = hwyfaceting: facet_null() -----------------------------------geom_point: na.rm = FALSE stat_identity: position_identity: (width = NULL, height = NULL)
Data Visualization http://nycdatascience.com/part4_en/
26 of 98 2/4/14, 7:31 AM
ggggpplloott ppaacckkaaggeeAdd other data mappings
p <- ggplot(data=mpg,mapping=aes(x=cty,y=hwy,colour=factor(year)))p <- p + geom_point()print(p)
Data Visualization http://nycdatascience.com/part4_en/
27 of 98 2/4/14, 7:31 AM
ggggpplloott ppaacckkaaggeeAdd a statistical transformation such as a smooth
p <- ggplot(data=mpg,mapping=aes(x=cty,y=hwy,colour=factor(year)))p <- p + geom_smooth()print(p)
Data Visualization http://nycdatascience.com/part4_en/
28 of 98 2/4/14, 7:31 AM
ggggpplloott ppaacckkaaggeeAdd points and smooth lines on the plot layer
p <- ggplot(data=mpg,mapping=aes(x=cty,y=hwy)) + geom_point(aes(colour=factor(year))) + geom_smooth()
Data Visualization http://nycdatascience.com/part4_en/
29 of 98 2/4/14, 7:31 AM
Data Visualization http://nycdatascience.com/part4_en/
30 of 98 2/4/14, 7:31 AM
ggggpplloott ppaacckkaaggeeScale control
p <- ggplot(data=mpg,mapping=aes(x=cty,y=hwy)) + geom_point(aes(colour=factor(year))) + geom_smooth() + scale_color_manual(values=c('blue2','red4'))
Data Visualization http://nycdatascience.com/part4_en/
31 of 98 2/4/14, 7:31 AM
Data Visualization http://nycdatascience.com/part4_en/
32 of 98 2/4/14, 7:31 AM
ggggpplloott ppaacckkaaggeeFacet control
p <- ggplot(data=mpg,mapping=aes(x=cty,y=hwy)) + geom_point(aes(colour=factor(year))) + geom_smooth() + scale_color_manual(values=c('blue2','red4')) + facet_wrap(~ year,ncol=1)
Data Visualization http://nycdatascience.com/part4_en/
33 of 98 2/4/14, 7:31 AM
Data Visualization http://nycdatascience.com/part4_en/
34 of 98 2/4/14, 7:31 AM
ggggpplloott ppaacckkaaggeePolishing your plots for publication
p <- ggplot(data=mpg, mapping=aes(x=cty,y=hwy)) + geom_point(aes(colour=class,size=displ), alpha=0.5,position = "jitter") + geom_smooth() + scale_size_continuous(range = c(4, 10)) + facet_wrap(~ year,ncol=1) + opts(title='Vehicle model and fuel consumption') + labs(y='Highway miles per gallon', x='Urban miles per gallon', size='Displacement', colour = 'Model')
Data Visualization http://nycdatascience.com/part4_en/
35 of 98 2/4/14, 7:31 AM
Data Visualization http://nycdatascience.com/part4_en/
36 of 98 2/4/14, 7:31 AM
ggggpplloott eexxeerrcciissee IIchange the coordinate system,such as coord_flip() , coord_polar(),coord_cartesian()
p <- ggplot(data=mpg, mapping=aes(x=cty,y=hwy)) + geom_point(aes(colour=factor(year),size=displ), alpha=0.5,position = "jitter")+ stat_smooth()+ scale_color_manual(values =c('steelblue','red4'))+ scale_size_continuous(range = c(4, 10))
Data Visualization http://nycdatascience.com/part4_en/
37 of 98 2/4/14, 7:31 AM
The properties of a single variable
Data Visualization http://nycdatascience.com/part4_en/
38 of 98 2/4/14, 7:31 AM
HHiissttooggrraammlibrary(ggplot2)p <- ggplot(data=iris,aes(x=Sepal.Length))+ geom_histogram()print(p)
Data Visualization http://nycdatascience.com/part4_en/
39 of 98 2/4/14, 7:31 AM
HHiissttooggrraammWe can customize the histogram as follows:
p <- ggplot(iris,aes(x=Sepal.Length))+ geom_histogram(binwidth=0.1, # Set the group gap fill='skyblue', # Set the fill color colour='black') # Set the border color
Data Visualization http://nycdatascience.com/part4_en/
40 of 98 2/4/14, 7:31 AM
Data Visualization http://nycdatascience.com/part4_en/
41 of 98 2/4/14, 7:31 AM
HHiissttooggrraammss pplluuss ddeennssiittyy ccuurrvveeThe main role of the histogram of is to show counting by groups and distribution characteristics. Thedistribution of a sample in traditional statistics is of important significance. But there is anothermethod that can also show the distribution of data, namely the kernel density estimation curve. Wecan estimate a density curve that represents the distribution, according to the data. We can displaythe histogram and density curve at the same time.
p <- ggplot(iris,aes(x=Sepal.Length)) + geom_histogram(aes(y=..density..), fill='skyblue', color='black') + geom_density(color='black', linetype=2,adjust=2)
Data Visualization http://nycdatascience.com/part4_en/
42 of 98 2/4/14, 7:31 AM
Data Visualization http://nycdatascience.com/part4_en/
43 of 98 2/4/14, 7:31 AM
DDeennssiittyy ccuurrvveeSimilar to the window width parameter, the adjust parameter will control the presentation of thedensity curve. We try different parameters to draw mutiple density curves. The smaller the parameteris, the more volatile and sensitive the curve is.
p <- ggplot(iris,aes(x=Sepal.Length)) + geom_histogram(aes(y=..density..), # Note: set y to relative frequency fill='gray60', color='gray') + geom_density(color='black',linetype=1,adjust=0.5) + geom_density(color='black',linetype=2,adjust=1) + geom_density(color='black',linetype=3,adjust=2)
Data Visualization http://nycdatascience.com/part4_en/
44 of 98 2/4/14, 7:31 AM
Data Visualization http://nycdatascience.com/part4_en/
45 of 98 2/4/14, 7:31 AM
DDeennssiittyy ccuurrvveeDensity curve is also convenient for comparison between different data. For example, we want tocompare the Sepal.Length distribution of three different flowers of the iris, like this:
p <- ggplot(iris,aes(x=Sepal.Length,fill=Species)) + geom_density(alpha=0.5,color='gray')print(p)
Data Visualization http://nycdatascience.com/part4_en/
46 of 98 2/4/14, 7:31 AM
BBooxxpplloottIn addition to the histograms and density map, We can also use boxplots to show the distribution ofone-dimensional data. The boxplot is also convenient for comparison of different data.
p <- ggplot(iris,aes(x=Species,y=Sepal.Length,fill=Species)) + geom_boxplot()print(p)
Data Visualization http://nycdatascience.com/part4_en/
47 of 98 2/4/14, 7:31 AM
VViioolliinn pplloottA violin plot contains more information than a boxplot about the (sub-)distributions of the data:
p <- ggplot(iris,aes(x=Species,y=Sepal.Length,fill=Species)) + geom_violin()print(p)
Data Visualization http://nycdatascience.com/part4_en/
48 of 98 2/4/14, 7:31 AM
VViioolliinn pplloott pplluuss ppooiinnttssp <- ggplot(iris,aes(x=Species,y=Sepal.Length, fill=Species)) + geom_violin(fill='gray',alpha=0.5) + geom_dotplot(binaxis = "y", stackdir = "center")print(p)
Data Visualization http://nycdatascience.com/part4_en/
49 of 98 2/4/14, 7:31 AM
Displaying compositions
Data Visualization http://nycdatascience.com/part4_en/
50 of 98 2/4/14, 7:31 AM
BBaarr cchhaarrttThe proportion of each vehicle model in the mpg dataset and these proportions grouped by years
p <- ggplot(mpg,aes(x=class)) + geom_bar()print(p)
Data Visualization http://nycdatascience.com/part4_en/
51 of 98 2/4/14, 7:31 AM
SSttaacckkeedd bbaarr cchhaarrttThe proportion of each vehicle model in the mpg dataset and these proportions grouped by years
mpg$year <- factor(mpg$year)p <- ggplot(mpg,aes(x=class,fill=year)) + geom_bar(color='black')
Data Visualization http://nycdatascience.com/part4_en/
52 of 98 2/4/14, 7:31 AM
Data Visualization http://nycdatascience.com/part4_en/
53 of 98 2/4/14, 7:31 AM
SSttaacckkeedd bbaarr cchhaarrttStacked bar chart
p <- ggplot(mpg,aes(x=class,fill=year)) + geom_bar(color='black', position=position_dodge())
Data Visualization http://nycdatascience.com/part4_en/
54 of 98 2/4/14, 7:31 AM
Data Visualization http://nycdatascience.com/part4_en/
55 of 98 2/4/14, 7:31 AM
PPiiee cchhaarrttp <- ggplot(mpg, aes(x = factor(1), fill = factor(class))) + geom_bar(width = 1)+ coord_polar(theta = "y")
Data Visualization http://nycdatascience.com/part4_en/
56 of 98 2/4/14, 7:31 AM
Data Visualization http://nycdatascience.com/part4_en/
57 of 98 2/4/14, 7:31 AM
RRoossee ddiiaaggrraammWind rose, a commonly used graphics tool by meteorologists, describes the wind speed anddirection distributions in a specific place.
set.seed(1)# Randomly generate 100 wind directions, and divide them into 16 intervals.dir <- cut_interval(runif(100,0,360),n=16)# Randomly generate 100 wind speed, and divide them into 4 intensities.mag <- cut_interval(rgamma(100,15),4) sample <- data.frame(dir=dir,mag=mag)# Map wind direction to X-axie, frequency to Y-axie and speed to fill colors. Transform the coordinates of p <- ggplot(sample,aes(x=dir,fill=mag)) + geom_bar()+ coord_polar()
Data Visualization http://nycdatascience.com/part4_en/
58 of 98 2/4/14, 7:31 AM
Data Visualization http://nycdatascience.com/part4_en/
59 of 98 2/4/14, 7:31 AM
MMoossaaiicc PPlloottDivide the data according to different variables, and then use rectangles of different sizes torepresent different groups of data. Let's look at the gender breakdown of survivors:
Data Visualization http://nycdatascience.com/part4_en/
60 of 98 2/4/14, 7:31 AM
Data Visualization http://nycdatascience.com/part4_en/
61 of 98 2/4/14, 7:31 AM
TThhee pprrooppoorrttiioonn ssttrruuccttuurree ooff ccoonnttiinnuuoouuss ddaattaadata <- read.csv('data/soft_impact.csv',T)library(reshape2)data.melt <- melt(data,id='Year')p <- ggplot(data.melt,aes(x=Year,y=value, group=variable,fill=variable)) + geom_area(color='black',size=0.3, position=position_fill()) + scale_fill_brewer()
Data Visualization http://nycdatascience.com/part4_en/
62 of 98 2/4/14, 7:31 AM
Data Visualization http://nycdatascience.com/part4_en/
63 of 98 2/4/14, 7:31 AM
The relationship between variables
Data Visualization http://nycdatascience.com/part4_en/
64 of 98 2/4/14, 7:31 AM
SSccaatttteerr ddiiaaggrraammShow the relationship between two variables with a scatter diagram.
p <- ggplot(data=mpg,aes(x=cty,y=hwy)) + geom_point()print(p)
Data Visualization http://nycdatascience.com/part4_en/
65 of 98 2/4/14, 7:31 AM
SSccaatttteerr pplloott ooff mmuullttiiddiimmeennssiioonnaall ddaattaampg$year <- factor(mpg$year)p <- ggplot(data=mpg,aes(x=cty,y=hwy)) + geom_point(aes(color=year))print(p)
Data Visualization http://nycdatascience.com/part4_en/
66 of 98 2/4/14, 7:31 AM
SSccaatttteerr pplloott ooff mmuullttiiddiimmeennssiioonnaall ddaattaaRepresent different years with different shapes
mpg$year <- factor(mpg$year)p <- ggplot(data=mpg,aes(x=cty,y=hwy)) + geom_point(aes(color=year,shape=year))print(p)
Data Visualization http://nycdatascience.com/part4_en/
67 of 98 2/4/14, 7:31 AM
SSccaatttteerr pplloott ooff mmuullttiiddiimmeennssiioonnaall ddaattaaWith large data sets, the points in a scatter plot may obscure each other due to overplotting, we canmake some random disturbance to solve this problem.
p <- ggplot(data=mpg,aes(x=cty,y=hwy)) + geom_point(aes(color=year),alpha=0.5,position = print(p)
Data Visualization http://nycdatascience.com/part4_en/
68 of 98 2/4/14, 7:31 AM
SSccaatttteerr pplloott ooff mmuullttiiddiimmeennssiioonnaall ddaattaaFor the trend of the scatterplot, we can draw out the regression line.
p <- ggplot(data=mpg,aes(x=cty,y=hwy)) + geom_point(aes(color=year),alpha=0.5,position = "jitter") + geom_smooth(method='lm')print(p)
Data Visualization http://nycdatascience.com/part4_en/
69 of 98 2/4/14, 7:31 AM
SSccaatttteerr pplloott ooff mmuullttiiddiimmeennssiioonnaall ddaattaaIn addition to color, We can also use the size of the dot to reflect another variable, such as the sizeof the cylinder. Some refer to plots like this as "bubble charts".
p <- ggplot(data=mpg,aes(x=cty,y=hwy)) + geom_point(aes(color=year,size=displ),alpha=0.5,position = "jitter") + geom_smooth(method='lm') + scale_size_continuous(range = c(4, 10))
Data Visualization http://nycdatascience.com/part4_en/
70 of 98 2/4/14, 7:31 AM
Data Visualization http://nycdatascience.com/part4_en/
71 of 98 2/4/14, 7:31 AM
SSccaatttteerr pplloott ooff mmuullttiiddiimmeennssiioonnaall ddaattaaAlthough we can show all the variables in a picture, we can also split it into multiple pictures to showthe characteristics of different variables. This method is called grouping, conditioning, or faceting.
p <- ggplot(data=mpg,aes(x=cty,y=hwy)) + geom_point(aes(colour=class,size=displ), alpha=0.5,position = "jitter") + geom_smooth() + scale_size_continuous(range = c(4, 10)) + facet_wrap(~ year,ncol=1)
Data Visualization http://nycdatascience.com/part4_en/
72 of 98 2/4/14, 7:31 AM
Data Visualization http://nycdatascience.com/part4_en/
73 of 98 2/4/14, 7:31 AM
ggggpplloott eexxeerrcciissee IIIImake scatter plot for diamond data
use transparency and small size points, look into size and alpha option in geom_point()
use bin chart to observe intensity of points,look into stat_bin2d()
estimate data dentisy,look into stat_density2d() and use+cooord_cartesian(xlim=c(0,1.5), ylim=c(0,6000))
·
·
·
·
Data Visualization http://nycdatascience.com/part4_en/
74 of 98 2/4/14, 7:31 AM
Data Visualization http://nycdatascience.com/part4_en/
75 of 98 2/4/14, 7:31 AM
Data Visualization http://nycdatascience.com/part4_en/
76 of 98 2/4/14, 7:31 AM
Data Visualization http://nycdatascience.com/part4_en/
77 of 98 2/4/14, 7:31 AM
SSccaatttteerr pplloott ooff mmuullttiiddiimmeennssiioonnaall ddaattaaThe typical scatter plot is to show a relationship between two variables. When you want to look atmany bivariate relationships at once, you can use a scatter plot matrix.
Data Visualization http://nycdatascience.com/part4_en/
78 of 98 2/4/14, 7:31 AM
SSccaatttteerr pplloott ooff mmuullttiiddiimmeennssiioonnaall ddaattaaif given many numerical variables, concentrated display can be done.
Data Visualization http://nycdatascience.com/part4_en/
79 of 98 2/4/14, 7:31 AM
Change over time
Data Visualization http://nycdatascience.com/part4_en/
80 of 98 2/4/14, 7:31 AM
CChhaannggee oovveerr ttiimmeeFor visualization of time series data, the first step is looking at how the variable changes over time.For example, we'll have a look at American employment GDP data visualization.
fillcolor <- ifelse(economics[440:470,'unemploy']<8000,'steelblue','red4')p <- ggplot(economics[440:470,],aes(x=date,y=unemploy)) + geom_bar(stat='identity', fill=fillcolor)
Data Visualization http://nycdatascience.com/part4_en/
81 of 98 2/4/14, 7:31 AM
Data Visualization http://nycdatascience.com/part4_en/
82 of 98 2/4/14, 7:31 AM
CChhaannggee oovveerr ttiimmeeFor the time series of small amount of data, we can use the bar graph to display. At the same timedisplay the number of positive and negative values with different colors.For the time series of largescale data, the bar will be crowded, and lines and points can be used to represent the strip.
p <- ggplot(economics[300:470,],aes(x=date,ymax=psavert,ymin=0)) + geom_linerange(color='grey20',size=0.5) + geom_point(aes(y=psavert),color='red4') + theme_bw()
Data Visualization http://nycdatascience.com/part4_en/
83 of 98 2/4/14, 7:31 AM
Data Visualization http://nycdatascience.com/part4_en/
84 of 98 2/4/14, 7:31 AM
CChhaannggee oovveerr ttiimmeeWhen the data is more intensive, we can use line graph or area chart to show the change of a trend.Also, some important time points or time interval can be marked in the time series graph, such asmarking 80's as a key time.
fill.color <- ifelse(economics$date > '1980-01-01' & economics$date < '1990-01-01', 'steelblue','red4')p <- ggplot(economics,aes(x=date,ymax=psavert,ymin=0)) + geom_linerange(color=fill.color,size=0.9) + geom_text(aes(x=as.Date("1985-01-01",'%Y-%m-%d'),y=13),label="1980'") + theme_bw()
Data Visualization http://nycdatascience.com/part4_en/
85 of 98 2/4/14, 7:31 AM
Data Visualization http://nycdatascience.com/part4_en/
86 of 98 2/4/14, 7:31 AM
Data Visualization http://nycdatascience.com/part4_en/
87 of 98 2/4/14, 7:31 AM
Geographic informationvisualization
Data Visualization http://nycdatascience.com/part4_en/
88 of 98 2/4/14, 7:31 AM
MMaappTwo types of drawing map
Download the geographic information data, and then draw the geographical boundaries, andidentify areas and locations according to the need
Download bitmap data of Google map, and then mark the location and path information on thegoogle map
·
·
Data Visualization http://nycdatascience.com/part4_en/
89 of 98 2/4/14, 7:31 AM
MMaappworld map
library(ggplot2)world <- map_data("world")worldmap <- ggplot(world, aes(x=long, y=lat, group=group)) + geom_path(color='gray10',size=0.3) + geom_point(x=114,y=30,size=10,shape='*') + scale_y_continuous(breaks=(-2:2) * 30) + scale_x_continuous(breaks=(-4:4) * 45) + coord_map("ortho", orientation=c(30, 120, 0)) + theme(panel.grid.major = element_line(colour = "gray50"), panel.background = element_rect(fill = "white"), axis.text=element_blank(), axis.ticks=element_blank(), axis.title=element_blank())
Data Visualization http://nycdatascience.com/part4_en/
90 of 98 2/4/14, 7:31 AM
Data Visualization http://nycdatascience.com/part4_en/
91 of 98 2/4/14, 7:31 AM
mmaapp ooff tthhee UU..SS..map <- map_data('state')arrests <- USArrestsnames(arrests) <- tolower(names(arrests))arrests$region <- tolower(rownames(USArrests))
usmap <- ggplot(data=arrests) + geom_map(map =map,aes(map_id = region,fill = murder),color='gray40' ) + expand_limits(x = map$long, y = map$lat) + scale_fill_continuous(high='red2',low='white') + theme_bw() + theme(panel.grid.major = element_blank(), panel.background = element_blank(), axis.text=element_blank(), axis.ticks=element_blank(), axis.title=element_blank(), legend.position = c(0.95,0.28), legend.background=element_rect(fill="white", colour="white"))+ coord_map('mercator'
Data Visualization http://nycdatascience.com/part4_en/
92 of 98 2/4/14, 7:31 AM
Data Visualization http://nycdatascience.com/part4_en/
93 of 98 2/4/14, 7:31 AM
DDrraawwiinngg aa mmaapp ooff CChhiinnaa bbaasseedd oonn aa bbiittmmaappAnother method to drawing China map is to download a document containing bitmap data fromGoogle or openstreetmap, and then to overlap points and lines elements on it with ggplot2. Thisdocument does not include information of latitude and longitude, just a simple bitmap, for fastmapping.
library(ggmap)library(XML)webpage <-'http://data.earthquake.cn/datashare/globeEarthquake_csn.html'tables <- readHTMLTable(webpage,stringsAsFactors = FALSE)raw <- tables[[6]]data <- raw[,c(1,3,4)]names(data) <- c('date','lan','lon')data$lan <- as.numeric(data$lan)data$lon <- as.numeric(data$lon)data$date <- as.Date(data$date, "%Y-%m-%d")#Read the map data from Google by the ggmap package, and mark the previous data on the map.earthquake <- ggmap(get_googlemap(center = 'china', zoom=4,maptype='terrain'),extent='device' geom_point(data=data,aes(x=lon,y=lan),colour = 'red',alpha=0.7)+ theme(legend.position = "none")
Data Visualization http://nycdatascience.com/part4_en/
94 of 98 2/4/14, 7:31 AM
Data Visualization http://nycdatascience.com/part4_en/
95 of 98 2/4/14, 7:31 AM
RR aanndd iinntteerraaccttiivvee vviissuuaalliizzaattiioonnGoogleVis is R package providing a interface between R and Google visualization API. It allows theuser to use the Google Visualization API for data visualization without the need to upload data.
We want to compare the development trajectory of 20 country group over the past several years. Inorder to obtain the data, we selected three variables from the world bank database, which reflect thechange of GDP, CO2 emissions and life expectancy between 2001 to 2009.
library(googleVis)library(WDI)DF <- WDI(country=c("CN","RU","BR","ZA","IN",'DE','AU','CA','FR','IT','JP','MX','GB','US'M <- gvisMotionChart(DF, idvar="country", timevar="year", xvar='EN.ATM.CO2E.KT', yvar='NY.GDP.MKTP.CD')plot(M)
Data Visualization http://nycdatascience.com/part4_en/
96 of 98 2/4/14, 7:31 AM
Case study and excercise
Data Visualization http://nycdatascience.com/part4_en/
97 of 98 2/4/14, 7:31 AM
EExxeerrcciissee IIIIII:: AAnnaallyyzziinngg NNBBAA ddaattaaCalculate the seasonal winning rate, and draw a bar chart
Calculating the seasonal winning rate at home and on the road, and draw a bar chart
According to the seasonal scores of home side, draw a set of four histograms
According to the seasonal scores of home side,draw the boxplots of five seasons
Draw the boxplots of scores of all competitions for home side and opposite side
Calculate the average and winning percentage for each opponent, and make a scatterplot to findthe strong and the weak team.
·
·
·
·
·
·
Data Visualization http://nycdatascience.com/part4_en/
98 of 98 2/4/14, 7:31 AM