Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
Stat 115 Lecture Notes 4Xiongzhi Chen
Washington State University
Contents2
Overview of data 2Course materials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2NYC flight data: nycflights13 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2NYC flight data: flights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2NYC flight data: airlines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2NYC flight data: airports . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3NYC flight data: planes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3NYC flight data: weather . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3NYC flight data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3NYC flight data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Mutating Joins 4Key . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4Data: flights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4Data: airlines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5Add airline name to flights2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5Inner join . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5Outer Joins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6Left outer join . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6Right outer join . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6Full outer join . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7Join flights and plances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7Join flights and airports . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7Filtering Joins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Missing values 8Course contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8Check object types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8Check object types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8Coerce an object . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9Missing observations/values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10Check for NA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10Remove observatiosn with NAs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10Complete and fill in NAs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11Mising as if not . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11Fill in missing values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11License and session Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1
Overview of data
Course materials
• Chapters 10 of book “R for data science” by Wickham and Grolemun• R library: tidyverse, dplyr, nycflights13
NYC flight data: nycflights13
nycflights13 contains 5 tibbles
• airlines: full carrier name• airports: information about each airport• planes: information plane, identified by tailnum• weather: hourly weather at each NYC airport• flights: airplane scheduled departure and arrival times, tailnum, etc
NYC flight data: flights
# A tibble: 6 x 19year month day dep_time sched_dep_time dep_delay arr_time
<int> <int> <int> <int> <int> <dbl> <int>1 2013 1 1 517 515 2 8302 2013 1 1 533 529 4 8503 2013 1 1 542 540 2 9234 2013 1 1 544 545 -1 10045 2013 1 1 554 600 -6 8126 2013 1 1 554 558 -4 740# ... with 12 more variables: sched_arr_time <int>,# arr_delay <dbl>, carrier <chr>, flight <int>,# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,# distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
NYC flight data: airlines
# A tibble: 6 x 2carrier name<chr> <chr>
1 9E Endeavor Air Inc.2 AA American Airlines Inc.3 AS Alaska Airlines Inc.4 B6 JetBlue Airways5 DL Delta Air Lines Inc.6 EV ExpressJet Airlines Inc.
2
NYC flight data: airports
# A tibble: 6 x 8faa name lat lon alt tz dst tzone<chr> <chr> <dbl> <dbl> <int> <dbl> <chr> <chr>
1 04G Lansdowne Airp~ 41.1 -80.6 1044 -5 A America/~2 06A Moton Field Mu~ 32.5 -85.7 264 -6 A America/~3 06C Schaumburg Reg~ 42.0 -88.1 801 -6 A America/~4 06N Randall Airport 41.4 -74.4 523 -5 A America/~5 09J Jekyll Island ~ 31.1 -81.4 11 -5 A America/~6 0A9 Elizabethton M~ 36.4 -82.2 1593 -5 A America/~
NYC flight data: planes
# A tibble: 6 x 9tailnum year type manufacturer model engines seats speed<chr> <int> <chr> <chr> <chr> <int> <int> <int>
1 N10156 2004 Fixe~ EMBRAER EMB-~ 2 55 NA2 N102UW 1998 Fixe~ AIRBUS INDU~ A320~ 2 182 NA3 N103US 1999 Fixe~ AIRBUS INDU~ A320~ 2 182 NA4 N104UW 1999 Fixe~ AIRBUS INDU~ A320~ 2 182 NA5 N10575 2002 Fixe~ EMBRAER EMB-~ 2 55 NA6 N105UW 1999 Fixe~ AIRBUS INDU~ A320~ 2 182 NA# ... with 1 more variable: engine <chr>
NYC flight data: weather
# A tibble: 6 x 15origin year month day hour temp dewp humid wind_dir<chr> <dbl> <dbl> <int> <int> <dbl> <dbl> <dbl> <dbl>
1 EWR 2013 1 1 1 39.0 26.1 59.4 2702 EWR 2013 1 1 2 39.0 27.0 61.6 2503 EWR 2013 1 1 3 39.0 28.0 64.4 2404 EWR 2013 1 1 4 39.9 28.0 62.2 2505 EWR 2013 1 1 5 39.0 28.0 64.4 2606 EWR 2013 1 1 6 37.9 28.0 67.2 240# ... with 6 more variables: wind_speed <dbl>,# wind_gust <dbl>, precip <dbl>, pressure <dbl>,# visib <dbl>, time_hour <dttm>
NYC flight data
For nycflights13:
• flights connects to planes via a single variable, tailnum• flights connects to airlines through the carrier variable• flights connects to airports in two ways: via the origin and dest variables• flights connects to weather via origin (the location), and year, month, day, and hour (the time).
3
NYC flight data
Mutating Joins
Key
• a key is a variable (or set of variables) that uniquely identifies an observation• A primary key uniquely identifies an observation in its own table. For example, planes$tailnum
is a primary key because it uniquely identifies each plane in the planes table• A foreign key uniquely identifies an observation in another table. For example, flights$tailnum
is a foreign key because it appears in the flights table where it matches each flight to a unique plane.
Data: flights
> library(dplyr)> flights2 <- flights %>%+ select(year:day, hour, origin, dest, tailnum, carrier)> flights2# A tibble: 336,776 x 8
year month day hour origin dest tailnum carrier<int> <int> <int> <dbl> <chr> <chr> <chr> <chr>
1 2013 1 1 5 EWR IAH N14228 UA2 2013 1 1 5 LGA IAH N24211 UA3 2013 1 1 5 JFK MIA N619AA AA4 2013 1 1 5 JFK BQN N804JB B65 2013 1 1 6 LGA ATL N668DN DL6 2013 1 1 5 EWR ORD N39463 UA
4
7 2013 1 1 6 EWR FLL N516JB B68 2013 1 1 6 LGA IAD N829AS EV9 2013 1 1 6 JFK MCO N593JB B6
10 2013 1 1 6 LGA ORD N3ALAA AA# ... with 336,766 more rows
Data: airlines
# A tibble: 6 x 2carrier name<chr> <chr>
1 9E Endeavor Air Inc.2 AA American Airlines Inc.3 AS Alaska Airlines Inc.4 B6 JetBlue Airways5 DL Delta Air Lines Inc.6 EV ExpressJet Airlines Inc.
Add airline name to flights2
> library(dplyr)> flights2 %>%+ select(-origin, -dest) %>%+ left_join(airlines, by = "carrier")# A tibble: 336,776 x 7
year month day hour tailnum carrier name<int> <int> <int> <dbl> <chr> <chr> <chr>
1 2013 1 1 5 N14228 UA United Air Lines In~2 2013 1 1 5 N24211 UA United Air Lines In~3 2013 1 1 5 N619AA AA American Airlines I~4 2013 1 1 5 N804JB B6 JetBlue Airways5 2013 1 1 6 N668DN DL Delta Air Lines Inc.6 2013 1 1 5 N39463 UA United Air Lines In~7 2013 1 1 6 N516JB B6 JetBlue Airways8 2013 1 1 6 N829AS EV ExpressJet Airlines~9 2013 1 1 6 N593JB B6 JetBlue Airways
10 2013 1 1 6 N3ALAA AA American Airlines I~# ... with 336,766 more rows
Inner join
inner_join: unmatched rows are not included in the result
5
y <- tribble( ~key, ~val_y, 1, "y1", 2, "y2", 4, "y3")
The colored column represents the “key” variable: these are used tomatch the rows between the tables. The gray column represents the“value” column that is carried along for the ride. In these examplesI’ll show a single key variable and single value variable, but the ideageneralizes in a straightforward way to multiple keys and multiplevalues.
A join is a way of connecting each row in x to zero, one, or morerows in y. The following diagram shows each potential match as anintersection of a pair of lines:
(If you look closely, you might notice that we’ve switched the orderof the key and value columns in x. This is to emphasize that joinsmatch based on the key; the value is just carried along for the ride.)
In an actual join, matches will be indicated with dots. The numberof dots = the number of matches = the number of rows in the out‐put.
Inner JoinThe simplest type of join is the inner join. An inner join matchespairs of observations whenever their keys are equal:
180 | Chapter 10: Relational Data with dplyr
Outer Joins
outer join keeps observations that appear in at least one of the tables. There are three types of outer joins:
• A left join keeps all observations in x• A right join keeps all observations in y• A full join keeps all observations in x and y
Left outer joinGraphically, that looks like:
The most commonly used join is the left join: you use this wheneveryou look up additional data from another table, because it preservesthe original observations even when there isn’t a match. The left joinshould be your default join: use it unless you have a strong reason toprefer one of the others.
Another way to depict the different types of joins is with a Venn dia‐gram:
However, this is not a great representation. It might jog your mem‐ory about which join preserves the observations in which table, butit suffers from a major limitation: a Venn diagram can’t show whathappens when keys don’t uniquely identify an observation.
182 | Chapter 10: Relational Data with dplyr
Right outer join
Graphically, that looks like:
The most commonly used join is the left join: you use this wheneveryou look up additional data from another table, because it preservesthe original observations even when there isn’t a match. The left joinshould be your default join: use it unless you have a strong reason toprefer one of the others.
Another way to depict the different types of joins is with a Venn dia‐gram:
However, this is not a great representation. It might jog your mem‐ory about which join preserves the observations in which table, butit suffers from a major limitation: a Venn diagram can’t show whathappens when keys don’t uniquely identify an observation.
182 | Chapter 10: Relational Data with dplyr
6
Full outer join
Graphically, that looks like:
The most commonly used join is the left join: you use this wheneveryou look up additional data from another table, because it preservesthe original observations even when there isn’t a match. The left joinshould be your default join: use it unless you have a strong reason toprefer one of the others.
Another way to depict the different types of joins is with a Venn dia‐gram:
However, this is not a great representation. It might jog your mem‐ory about which join preserves the observations in which table, butit suffers from a major limitation: a Venn diagram can’t show whathappens when keys don’t uniquely identify an observation.
182 | Chapter 10: Relational Data with dplyr
Join flights and plances
joint flights and planes by tailnum> flights2 %>% left_join(planes, by = "tailnum")# A tibble: 336,776 x 16
year.x month day hour origin dest tailnum carrier year.y<int> <int> <int> <dbl> <chr> <chr> <chr> <chr> <int>
1 2013 1 1 5 EWR IAH N14228 UA 19992 2013 1 1 5 LGA IAH N24211 UA 19983 2013 1 1 5 JFK MIA N619AA AA 19904 2013 1 1 5 JFK BQN N804JB B6 20125 2013 1 1 6 LGA ATL N668DN DL 19916 2013 1 1 5 EWR ORD N39463 UA 20127 2013 1 1 6 EWR FLL N516JB B6 20008 2013 1 1 6 LGA IAD N829AS EV 19989 2013 1 1 6 JFK MCO N593JB B6 2004
10 2013 1 1 6 LGA ORD N3ALAA AA NA# ... with 336,766 more rows, and 7 more variables:# type <chr>, manufacturer <chr>, model <chr>,# engines <int>, seats <int>, speed <int>, engine <chr>
Join flights and airports
joint flights and airports by matching dest in flights to faa in airports; the variables in x will beused in the output> flights2 %>%+ left_join(airports, c("dest" = "faa"))# A tibble: 336,776 x 15
year month day hour origin dest tailnum carrier name<int> <int> <int> <dbl> <chr> <chr> <chr> <chr> <chr>
1 2013 1 1 5 EWR IAH N14228 UA Geor~2 2013 1 1 5 LGA IAH N24211 UA Geor~3 2013 1 1 5 JFK MIA N619AA AA Miam~4 2013 1 1 5 JFK BQN N804JB B6 <NA>5 2013 1 1 6 LGA ATL N668DN DL Hart~6 2013 1 1 5 EWR ORD N39463 UA Chic~
7
7 2013 1 1 6 EWR FLL N516JB B6 Fort~8 2013 1 1 6 LGA IAD N829AS EV Wash~9 2013 1 1 6 JFK MCO N593JB B6 Orla~
10 2013 1 1 6 LGA ORD N3ALAA AA Chic~# ... with 336,766 more rows, and 6 more variables: lat <dbl>,# lon <dbl>, alt <int>, tz <dbl>, dst <chr>, tzone <chr>
Filtering Joins
Filtering joins match observations in the same way as mutating joins, but affect the observations, not thevariables. There are two types:
• semi_join(x, y) keeps all observations in x that have a match in y• anti_join(x, y) drops all observations in x that have a match in y
Missing values
Course contents
• Chapter 9 of book “R for data science” by Wickham and Grolemun• R packages: tidyr, dplyr and tidyverse
Check object types
After R imports data, make sure variables have correct object types
• str gives object type of each variable in a data frame or tibble• summary gives object type of each variable, together with 5-value summary for each numeric object• head gives object type for each variable in a data frame or tibble
Check object types
> summary(flights2)year month day
Min. :2013 Min. : 1.000 Min. : 1.001st Qu.:2013 1st Qu.: 4.000 1st Qu.: 8.00Median :2013 Median : 7.000 Median :16.00Mean :2013 Mean : 6.549 Mean :15.713rd Qu.:2013 3rd Qu.:10.000 3rd Qu.:23.00Max. :2013 Max. :12.000 Max. :31.00
hour origin destMin. : 1.00 Length:336776 Length:3367761st Qu.: 9.00 Class :character Class :characterMedian :13.00 Mode :character Mode :characterMean :13.183rd Qu.:17.00Max. :23.00
tailnum carrier
8
Length:336776 Length:336776Class :character Class :characterMode :character Mode :character
> str(flights2)Classes 'tbl_df', 'tbl' and 'data.frame': 336776 obs. of 8 variables:$ year : int 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ...$ month : int 1 1 1 1 1 1 1 1 1 1 ...$ day : int 1 1 1 1 1 1 1 1 1 1 ...$ hour : num 5 5 5 5 6 5 6 6 6 6 ...$ origin : chr "EWR" "LGA" "JFK" "JFK" ...$ dest : chr "IAH" "IAH" "MIA" "BQN" ...$ tailnum: chr "N14228" "N24211" "N619AA" "N804JB" ...$ carrier: chr "UA" "UA" "AA" "B6" ...
> head(flights2)# A tibble: 6 x 8
year month day hour origin dest tailnum carrier<int> <int> <int> <dbl> <chr> <chr> <chr> <chr>
1 2013 1 1 5 EWR IAH N14228 UA2 2013 1 1 5 LGA IAH N24211 UA3 2013 1 1 5 JFK MIA N619AA AA4 2013 1 1 5 JFK BQN N804JB B65 2013 1 1 6 LGA ATL N668DN DL6 2013 1 1 5 EWR ORD N39463 UA
Coerce an object
Force to change the type of a object by as.objecType()> x = c("1","2","3")> x[1] "1" "2" "3"> as.numeric(x)[1] 1 2 3> as.factor(x)[1] 1 2 3Levels: 1 2 3> as.matrix(x,3,1)
[,1][1,] "1"[2,] "2"[3,] "3"> y = c("a","b")> as.numeric(y)[1] NA NA
Missing observations/values
• Missing explicitly, i.e., flagged with NA (,i.e., presence of an absence)
9
• Missing implicitly, i.e., simply not present in the data (i.e., absence of a presence)
Example
> library(tidyr)> stocks <- tibble(+ year = c(2015, 2015, 2015, 2015, 2016, 2016, 2016),+ qtr = c( 1, 2, 3, 4, 2, 3, 4),+ return = c(1.88, 0.59, 0.35, NA, 0.92, 0.17, 2.66)+ )> stocks# A tibble: 7 x 3
year qtr return<dbl> <dbl> <dbl>
1 2015 1 1.882 2015 2 0.593 2015 3 0.354 2015 4 NA5 2016 2 0.926 2016 3 0.177 2016 4 2.66
Check for NA
Command: is.na() or any(is.na())> any(is.na(stocks))[1] TRUE> is.na(stocks)
year qtr return[1,] FALSE FALSE FALSE[2,] FALSE FALSE FALSE[3,] FALSE FALSE FALSE[4,] FALSE FALSE TRUE[5,] FALSE FALSE FALSE[6,] FALSE FALSE FALSE[7,] FALSE FALSE FALSE
Remove observatiosn with NAs
> stocks %>%+ spread(year, return) %>%+ gather(year, return, `2015`:`2016`, na.rm = TRUE)# A tibble: 6 x 3
qtr year return* <dbl> <chr> <dbl>1 1 2015 1.882 2 2015 0.593 3 2015 0.35
10
4 2 2016 0.925 3 2016 0.176 4 2016 2.66
Complete and fill in NAs
> stocks %>% complete(year, qtr)# A tibble: 8 x 3
year qtr return<dbl> <dbl> <dbl>
1 2015 1 1.882 2015 2 0.593 2015 3 0.354 2015 4 NA5 2016 1 NA6 2016 2 0.927 2016 3 0.178 2016 4 2.66
Mising as if not
scenario: missing values indicate that the previous value should be carried forward> library(tibble)> treatment <- tribble(+ ~ person, ~ treatment, ~response,+ "Derrick Whitmore", 1, 7,+ NA, 2, 10,+ NA, 3, 9,+ "Katherine Burke", 1, 4+ )> treatment# A tibble: 4 x 3
person treatment response<chr> <dbl> <dbl>
1 Derrick Whitmore 1 72 <NA> 2 103 <NA> 3 94 Katherine Burke 1 4
Fill in missing values
> treatment %>% fill(person)# A tibble: 4 x 3
person treatment response<chr> <dbl> <dbl>
1 Derrick Whitmore 1 72 Derrick Whitmore 2 10
11
3 Derrick Whitmore 3 94 Katherine Burke 1 4
License and session Information
License> sessionInfo()R version 3.5.0 (2018-04-23)Platform: x86_64-w64-mingw32/x64 (64-bit)Running under: Windows 7 x64 (build 7601) Service Pack 1
Matrix products: default
locale:[1] LC_COLLATE=English_United States.1252[2] LC_CTYPE=English_United States.1252[3] LC_MONETARY=English_United States.1252[4] LC_NUMERIC=C[5] LC_TIME=English_United States.1252
attached base packages:[1] stats graphics grDevices utils datasets methods[7] base
other attached packages:[1] tidyr_0.8.2 dplyr_0.7.8 nycflights13_1.0.0[4] knitr_1.21
loaded via a namespace (and not attached):[1] Rcpp_1.0.0 rstudioapi_0.8 bindr_0.1.1[4] magrittr_1.5 tidyselect_0.2.5 R6_2.3.0[7] rlang_0.3.0.1 fansi_0.4.0 stringr_1.3.1
[10] tools_3.5.0 xfun_0.4 utf8_1.1.4[13] cli_1.0.1 htmltools_0.3.6 yaml_2.2.0[16] digest_0.6.18 assertthat_0.2.0 tibble_1.4.2[19] crayon_1.3.4 bindrcpp_0.2.2 purrr_0.2.5[22] glue_1.3.0 evaluate_0.12 rmarkdown_1.11[25] stringi_1.2.4 compiler_3.5.0 pillar_1.3.1[28] pkgconfig_2.0.2
12