Analyzing Economic Data using R - Reading data in Rfaculty.baruch.cuny.edu/smanzan/files/aedr/BUS4093unit1.pdf · 2: University of Alabama at Birmingham Birmingham 0.7223 1107 3:

Analyzing Economic Data using RReading data in R

Sebastiano Manzan

BUS 4093H | Fall 2016

1 / 50

Unit 1: Reading data in R

I The first task in data analysis is to load/import a dataset in RI We have already discussed how to upload a csv file using the RStudio

menus

2 / 50

Loading from the command line

I The graphic interface provided by RStudio is very useful in exploratoryanalysis

I However, in most cases we want to make importing the data asautomatic as possible

I Example: we are writing a script that performs an analysis that wemight want to repeat again for future updates of the dataset

I We can import a dataset either by typing in the console or running ascript file

3 / 50

# First, set the directory with the setwd("directory name") command, then ...citibike <- "201605-citibike-tripdata.csv"library(readr)data <- read_csv(citibike)

1. The setwd() command tells R the location of the files to read and write2. library(readr): loads the package readr

I A package is a collection of functions that do some specific tasksI When you install R or start a R session only a few basic packages are

installed or loadedI Packages are available at the CRAN repository and organized by task.

This is what you need to do in case you need to use a package:I install.packages("name package"): only onceI library(name package) or require(name package): every time

you want to use functions for the packageI update.packages(): to update packages to the latest available

version

3. read_csv() is a function from package readr that is used toload/import csv files; type ?read_csv or help(read_csv) forinformation about its arguments and output

4 / 50

https://cran.r-project.org/web/packages/

https://cran.r-project.org/web/views/

The built-in read.csv function

I An alternative approach is to use the R built-in functions to importdata, among them the read.csv() function

I The read_csv function tries to solve two main problems with theread.csv function:

1. Speed in loading the dataset (see next slide for a comparison)2. Guessing the data types for the columns that are imported

5 / 50

Speed comparison: read.csv vs read_csv

I The function proc.time() is useful to time the duration of a process;below we first store the starting time and, after the loading is complete,we calculate the difference between current time and start time

start.csv <- proc.time() # start clock to calculate timedata <- read.csv(citibike)time.csv <- proc.time() - start.csv # time elapsed

start_csv <- proc.time() # start clock to calculate timedata <- read_csv(citibike)time_csv <- proc.time() - start_csv # time elapsed

I The read.csv() function took 56 seconds while the read_csv() only 5seconds

I The difference in speed is large and becomes even more remarkablewhen loading larger datasets

6 / 50

fread from package data.table

I The data.table package provides another function for fast loading ofdatasets in R

I The function is called fread and it is used similarly to the other two(but it can do more things, as we will see later)

library(data.table)data <- fread(citibike)

Read 18.1% of 1212280 rowsRead 55.3% of 1212280 rowsRead 95.7% of 1212280 rowsRead 1212280 rows and 15 (of 15) columns from 0.220 GB file in 00:00:05

I This function took 4 seconds to load the same fileI The function prints intermediate and final loading info that can be

switched off by adding the argument showProgress = FALSE

7 / 50

Other approaches

I When loading large datasets we might encounter a memory problemI R stores the data in the RAM memory and we might not be able to

load the data when the memory is not large enough for the dataset athand

I In this case, it is better to store the dataset as a SQL database whichallows to load only parts of the datasets, such as some variables, and/orsubsetting the data before importing the data in R

I Some resources on R,SQLite, and big data:I https://www.rstudio.com/resources/webinars/

working-with-big-data-in-r/I https://www.r-bloggers.com/r-and-sqlite-part-1/I https://blog.rstudio.org/2014/10/25/rsqlite-1-0-0/

8 / 50

https://www.rstudio.com/resources/webinars/working-with-big-data-in-r/

https://www.rstudio.com/resources/webinars/working-with-big-data-in-r/

https://www.r-bloggers.com/r-and-sqlite-part-1/

https://blog.rstudio.org/2014/10/25/rsqlite-1-0-0/

A larger dataset

I The College Scorecard is a dataset made available by theDepartment of Education

I The goal of the website is to provide the relevant information so thatstudents heading to college make the “right” decision

I Data available at: https://collegescorecard.ed.gov/data/

9 / 50

https://collegescorecard.ed.gov/data/

I Information provided in the College Scorecard dataset:I College: private non-profit/private for-profit/public, tuition, number

and type of degrees awarded, graduation rates, and admission rateamong others

I Student Demographics: number of UG students, race composition,first-generation students, average family income, and SAT scores amongothers

I Financial Aid: percentage Pell students, cumulative median debt,default and repayment rates among others

I Earnings (of students that received financial aid): mean and medianearnings

10 / 50

Reading the College Scorecard dataset

I The College Scorecard website used to provide a large file for severalyears with a size of about 2 GB

I The file has 124,699 rows and 1,731 columns/variablesI Recently, they made available separate files for each year which makes

the task of loading the data easierI I will use this larger file to compare the performance of the three

functions in importing the dataset

11 / 50

scorecard <- 'Scorecard.csv'start.csv <- proc.time()data <- read.csv(scorecard)end.csv <- proc.time() - start.csv

library(readr)start_csv <- proc.time()data <- read_csv(scorecard)end_csv <- proc.time() - start_csv

library(data.table)start.f <- proc.time()data <- fread(scorecard, showProgress = FALSE)end.f <- proc.time() - start.f

I The read.csv() function takes 359 seconds to load the file, while theread_csv() took 81 seconds and the fread() function 80 seconds

I Relative to the read.csv() function, read_csv and fread are 4 and 4times faster in loading the 2 GB file

12 / 50

Loading other formats

I Another built-in R function to read data is read.table() for tabdelimited txt files

I if the argument sep is not specified, the default is that columns areseparated by a space (sep=" ")

I Other values for sep can be specified:I "," for comma separated columns (read.table becomes

read.csv)I “\t” for horizontal tabs

I Other arguments of the read.table() and read.csv() functions are:I header: whether the first row of the dataset provides column names;

default is TRUEI skip: when the file contains a few lines that should not be read by the

functionI nrows: if you want to specify the number of rows that should be

importedI row.names and col.names: if you want to specify the row and

column names of the dataset

13 / 50

read.table()

14 / 50

Reading the Fama-French 3 factors data

I Ken French provides in his website files with the famous Fama-French 3factors:

I MKT: market returnI SMB: Small Minus Big factor (return of a portfolio of small cap stocks

minus the return of a portfolio of large cap stocks)I HML: High Minus Low factor (return of a portfolio of high

Book-to-Market ratio stocks minus low B-to-M ratio stocks)

I These factors measure risks that investors can decide to have exposureto gain an excess return

I US equity portfolios (e.g., mutual funds) should be highly explained bythese factors (although their exposure to each of these 3 factors mightbe different)

I Visit the page for details about the construction of the factors

15 / 50

http://mba.tuck.dartmouth.edu/pages/faculty/ken.french/data_library.html

http://mba.tuck.dartmouth.edu/pages/faculty/ken.french/data_library.html

16 / 50

17 / 50

data <- read.table("F-F_Research_Data_Factors.txt", fill=TRUE)# we need to include the fill=TRUE argument since not all rows# have the same number of elements; without that argument, an error# message is generatedhead(data, 4)

V1 V2 V3 V4 V5 V6 V7 V81 This file was created by CMPT_ME_BEME_RETS using the2 The 1-month TBill return is from Ibbotson and3 Mkt-RF SMB HML RF4 192607 2.96 -2.30 -2.87 0.22

V9 V10 V111 201606 CRSP database.2 Associates Inc.34

data[1083:1086,]

V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V111083 201606 -0.04 0.24 -0.08 0.021084 Annual Factors: January-December1085 Mkt-RF SMB HML RF1086 1927 29.47 -2.46 -3.75 3.12

tail(data, 3)

V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V111173 2014 11.70 -7.75 -3.17 0.021174 2015 0.07 -4.24 -10.58 0.021175 Copyright 2016 Kenneth R. French

18 / 50

I We could skip the first two lines since the table with data starts fromthe third row

I If we are only interested in the monthly returns we could import thefirst 1080 lines and discard the rest

data <- read.table("F-F_Research_Data_Factors.txt",skip=3, nrow=1080, header=TRUE)

Mkt.RF SMB HML RF192607 2.96 -2.3 -2.87 0.22192608 2.64 -1.4 4.19 0.25

Mkt.RF SMB HML RF201605 1.78 -0.28 -1.85 0.01201606 -0.04 0.24 -0.08 0.02

19 / 50

Download and unzip from R?

I Can R visit the webpage for us, download the zip file, unzip it, andimport the data file? Yes, R can!

I We will use the download function from the downloader package andunzip from the utils package (utils is one of the few defaultpackages that come with R)

library(downloader)file <- "http://mba.tuck.dartmouth.edu/pages/faculty/ken.french/ftp/F-F_Research_Data_Factors_TXT.zip"download(file, dest="F-F_Research_Data_Factors_TXT.zip")unzip(zipfile="F-F_Research_Data_Factors_TXT.zip")data <- read.table("F-F_Research_Data_Factors.txt",

skip=3, nrow=1080, header=TRUE)

Mkt.RF SMB HML RF192607 2.96 -2.3 -2.87 0.22192608 2.64 -1.4 4.19 0.25

Mkt.RF SMB HML RF201605 1.78 -0.28 -1.85 0.01201606 -0.04 0.24 -0.08 0.02

20 / 50

Excel files

I There are several packages that link R and ExcelI Some packages (readxl and xlsx) are useful to read and write Excel

filesI Others allow you to create Excel worksheet with several sheets, write,

and format among other things (package XLConnect)

library(readxl)data <- read_excel("F-F_Research_Data_Factors.xlsx", sheet="month")library(xlsx)data <- read.xlsx("F-F_Research_Data_Factors.xlsx", sheetName="month")

NA. Mkt.RF SMB HML RF1 192607 2.96 -2.3 -2.87 0.222 192608 2.64 -1.4 4.19 0.25

NA. Mkt.RF SMB HML RF1079 201605 1.78 -0.28 -1.85 0.011080 201606 -0.04 0.24 -0.08 0.02

21 / 50

More about the fread function

I In addition to importing data very fast, the fread function has anotherconvenient property: you can select the columns that you want toimport

I On the contrary, the other functions require loading the whole datasetand then select the columns that we will analyze

I This property is particularly useful when analyzing datasets that havehundreds of variables, of which we might only need a few

22 / 50

Back to the College Scorecard

I DocumentationI Assume that we are interested to run a preliminary analysis by looking

at the following variables:I INSTNM: institution nameI CITY: city where the institution is locatedI ADM_RATE_ALL: admission rate across all campusesI SAT_AVG_ALL: average SAT score across all campusesI TUITFTE: net tuition revenue per full-time equivalent studentI DISTANCEONLY: schools that are distance education-only

I Instead of the big file used earlier, I will import the file for 2012(merged_2012_PP.csv released on 03/2016)

23 / 50

https://collegescorecard.ed.gov/assets/FullDataDocumentation.pdf

scorecard <- 'merged_2012_PP.csv'library(data.table)data.sel <- fread(scorecard, showProgress = FALSE,

select=c("INSTNM", "CITY", "ADM_RATE_ALL","SAT_AVG_ALL", "TUITFTE", "DISTANCEONLY"))

I The whole dataset has 1729 variables but we only imported 6 variablesI This reduced the reading time from 4 seconds to 1 seconds

24 / 50

INSTNM CITY ADM_RATE_ALL SAT_AVG_ALL1: Alabama A & M University Normal 0.5438 8472: University of Alabama at Birmingham Birmingham 0.7223 11073: Amridge University Montgomery NULL NULL4: University of Alabama in Huntsville Huntsville 0.7766 11625: Alabama State University Montgomery 0.4604 8276: The University of Alabama Tuscaloosa 0.5308 1172

DISTANCEONLY TUITFTE1: 0 62222: 0 81813: 0 119804: 0 79985: 0 77976: 0 11610

INSTNM CITY ADM_RATE_ALL1: Excel Learning Center Austin NULL2: SAE Institute of Technology San Francisco San Francisco NULL3: Strayer University Bloomington NULL4: Strayer University Schaumburg NULL5: Strayer University Downers Grove NULL6: Strayer University Aurora NULL

SAT_AVG_ALL DISTANCEONLY TUITFTE1: NULL NULL NULL2: NULL NULL NULL3: NULL NULL NULL4: NULL NULL NULL5: NULL NULL NULL6: NULL NULL NULL

25 / 50

NA for missing observations

I Notice that there are many NULL in the dataset to indicate missingobservations

I In R missing observations are indicated by NAI We need to tell R that NULL values in the dataset are NAs by setting the

option na.strings="NULL" in the fread function

data.sel <- fread(scorecard, showProgress = FALSE,select=c("INSTNM", "CITY", "ADM_RATE_ALL","SAT_AVG_ALL", "TUITFTE", "DISTANCEONLY"),na.strings="NULL")

INSTNM CITY ADM_RATE_ALL SAT_AVG_ALL1: Alabama A & M University Normal 0.5438 8472: University of Alabama at Birmingham Birmingham 0.7223 11073: Amridge University Montgomery NA NA

DISTANCEONLY TUITFTE1: 0 62222: 0 81813: 0 11980

INSTNM CITY ADM_RATE_ALL SAT_AVG_ALL DISTANCEONLY1: Strayer University Schaumburg NA NA NA2: Strayer University Downers Grove NA NA NA3: Strayer University Aurora NA NA NA

TUITFTE1: NA2: NA3: NA 26 / 50

Data types

I Once we have imported a dataset in R, the next task is to examine ifthe variables have been imported correctly

I We can use the View() command to open the dataset in the DataViewer or we can print a snapshot of the data to the console usinghead() and tail()

I The function str() provides the structure of the dataset defined as theobject class, the number of observations and variables, and, for eachvariable, the data type and a short list of values

27 / 50

str(data.sel)

Classes 'data.table' and 'data.frame': 7793 obs. of 6 variables:$ INSTNM : chr "Alabama A & M University" "University of Alabama at Birmingham" "Amridge

University" "University of Alabama in Huntsville" ...$ CITY : chr "Normal" "Birmingham" "Montgomery" "Huntsville" ...$ ADM_RATE_ALL: num 0.544 0.722 NA 0.777 0.46 ...$ SAT_AVG_ALL : int 847 1107 NA 1162 827 1172 NA NA 1050 1217 ...$ DISTANCEONLY: int 0 0 0 0 0 0 0 0 0 0 ...$ TUITFTE : int 6222 8181 11980 7998 7797 11610 1861 4707 6912 12302 ...- attr(*, ".internal.selfref")=<externalptr>

28 / 50

Data frames

I The data.sel object is defined as a data.frame (and data.table)I A data frame is a matrix or table with each column representing a

variable and each row representing a unit of observation; examples:I Scorecard: each row is a college/universityI Citibike: each row represents a tripI Fama-French factors: each row is a month

I data.sel is a data frame with 6 columns (INSTNM, CITY,ADM_RATE_ALL, SAT_AVG_ALL, DISTANCEONLY, TUITFTE)and 7793 rows (HE institutions)

I Notice that the 6 variables are of different data types:I chr: character such as Alabama A & M University or NormalI num: numeric as 0.5438 or 0.7223I int: integer as 847 or 6222

I The Scorecard documentation uses different names for the data types:string for character, float for numeric, boolean for logical, and integer for. . . integer

29 / 50

I The read functions automatically pick the data type for each columnI This is not an easy task and the functions, often, differ in the way they

define the imported variablesI It is important that you examine and validate the data types assigned

by the read functions with the dataset documentation and with yourown understanding of the nature of the variable

I Scorecard example: the fread function defines DISTANCEONLY of typeinteger, although the documentation designate it as boolean

I Distance-only programs are given a value 1 and non distance-onlyprograms a value of 0

30 / 50

I The data.sel data frame has 6 columns and we can refer to each ofthem using the $ sign: data.sel$DISTANCEONLY extracts the variable‘DISTANCEONLY’

I The chunk below takes the DISTANCEONLY column and defines its classto be logical:

class(data.sel$DISTANCEONLY) <- "logical"str(data.sel)

Classes 'data.table' and 'data.frame': 7793 obs. of 6 variables:$ INSTNM : chr "Alabama A & M University" "University of Alabama at Birmingham" "Amridge

University" "University of Alabama in Huntsville" ...$ CITY : chr "Normal" "Birmingham" "Montgomery" "Huntsville" ...$ ADM_RATE_ALL: num 0.544 0.722 NA 0.777 0.46 ...$ SAT_AVG_ALL : int 847 1107 NA 1162 827 1172 NA NA 1050 1217 ...$ DISTANCEONLY: logi FALSE FALSE FALSE FALSE FALSE FALSE ...$ TUITFTE : int 6222 8181 11980 7998 7797 11610 1861 4707 6912 12302 ...- attr(*, ".internal.selfref")=<externalptr>

I Notice that now the structure of DISTANCEONLY is logi and the firstfew values are FALSE instead of 0

31 / 50

I The previous discussion used the fread() function to read the dataset;how would have read_csv categorized these 6 variables?

I As we said earlier, read_csv() is not able to load a subset of variables;hence, we need to import the complete file and then subset the dataframe as shown below:

library(readr)data.sel <- read_csv(scorecard, na = "NULL")data.sel <- data.sel[,c("INSTNM", "CITY", "ADM_RATE_ALL",

"SAT_AVG_ALL", "TUITFTE", "DISTANCEONLY")]str(data.sel)

Classes 'tbl_df', 'tbl' and 'data.frame': 7793 obs. of 6 variables:$ INSTNM : chr "Alabama A & M University" "University of Alabama at Birmingham" "Amridge

University" "University of Alabama in Huntsville" ...$ CITY : chr "Normal" "Birmingham" "Montgomery" "Huntsville" ...$ ADM_RATE_ALL: num 0.544 0.722 NA 0.777 0.46 ...$ SAT_AVG_ALL : int 847 1107 NA 1162 827 1172 NA NA 1050 1217 ...$ TUITFTE : int 6222 8181 11980 7998 7797 11610 1861 4707 6912 12302 ...$ DISTANCEONLY: int 0 0 0 0 0 0 0 0 0 0 ...

I The read_csv function assigned the same data types than fread

32 / 50

Factors

I And if we used the read.csv() function?

data.sel <- read.csv(scorecard, na.strings = "NULL")data.sel <- data.sel[,c("INSTNM", "CITY", "ADM_RATE_ALL",

"SAT_AVG_ALL", "TUITFTE", "DISTANCEONLY")]str(data.sel)

'data.frame': 7793 obs. of 6 variables:$ INSTNM : Factor w/ 7572 levels "A & W Healthcare Educators",..: 90 6773 230 6774 93 6514

1102 381 400 399 ...$ CITY : Factor w/ 2541 levels "Aberdeen","Abilene",..: 1556 190 1430 1014 1430 2289 27

93 1430 99 ...$ ADM_RATE_ALL: num 0.544 0.722 NA 0.777 0.46 ...$ SAT_AVG_ALL : int 847 1107 NA 1162 827 1172 NA NA 1050 1217 ...$ TUITFTE : int 6222 8181 11980 7998 7797 11610 1861 4707 6912 12302 ...$ DISTANCEONLY: int 0 0 0 0 0 0 0 0 0 0 ...

I INSTNM and CITY are of type Factor instead of character

33 / 50

I Type Factor is another data type that R uses for categorical variables,that is, those that take a finite set of values (that can be numeric,logical, or integer)

I The read.csv function by default interprets strings as factors; this canbe avoided by adding stringsAsFactors = FALSE as an argument

I Factors are characterized by:I levels(data.sel$CITY): Aberdeen, Abilene, Abingdon . . .I nlevels(data.sel$CITY): 2541

I Obviously, it does not make sense to have a categorical variable with2541 categories

I A more appropriate candidate to be defined as a factor is theDISTANCEONLY variable which has two clearly defined categories (0 vs 1,distance vs non-distance)

I Factors should be used when there are a small number of categoriesI Defining a variable as a factor is convenient in statistical analysis

34 / 50

I Defining a variable as a factor is done by setting the object class to"factor"

data.sel$DISTANCEONLY <- factor(data.sel$DISTANCEONLY)levels(data.sel$DISTANCEONLY)

[1] "0" "1"

I We can also change the labels of the levels:

levels(data.sel$DISTANCEONLY) <- c("NO-DISTANCE","DISTANCE")levels(data.sel$DISTANCEONLY)

[1] "NO-DISTANCE" "DISTANCE"

I We could have also defined the variable as a factor and change thelabels at the same time:

data.sel$DISTANCEONLY <- factor(data.sel$DISTANCEONLY,labels=c("NO-DISTANCE","DISTANCE"))

levels(data.sel$DISTANCEONLY)

[1] "NO-DISTANCE" "DISTANCE"

35 / 50

Working with strings

I There are several functions that are useful when dealing with strings(chr) such as:

I paste(): paste two or more strings, numeric, integer etcI nchar(): counts the number of characters in a stringI substr(): subsets a string

data.sel$CITY[1:2] # takes the first two elements of column CITY

[1] “Normal” “Birmingham”

nchar(data.sel$CITY[1:2]) # number of characters of each element

[1] 6 10

substr(data.sel$CITY[1:2], 1, 4) # sub-string from first to forth character

[1] “Norm” “Birm”

paste(data.sel$INSTNM[1], " is located in ",data.sel$CITY[1]," and has an admission rate of ", round(100*data.sel$ADM_RATE_ALL[1]),"% and a net tuition per FTE of ",data.sel$TUITFTE[1]," dollars", sep="")

[1] “Alabama A & M University is located in Normal and has an admission rate of 54% and anet tuition per FTE of 6222 dollars”

36 / 50

Data types in the Citibike dataset

citi <- fread(citibike, showProgress = FALSE)

Classes 'data.table' and 'data.frame': 1212280 obs. of 15 variables:$ tripduration : chr "538" "224" "328" "1196" ...$ starttime : chr "5/1/2016 00:00:03" "5/1/2016 00:00:04" "5/1/2016 00:00:14"

"5/1/2016 00:00:20" ...$ stoptime : chr "5/1/2016 00:09:02" "5/1/2016 00:03:49" "5/1/2016 00:05:43"

"5/1/2016 00:20:17" ...$ start station id : chr "536" "361" "301" "3141" ...$ start station name : chr "1 Ave & E 30 St" "Allen St & Hester St" "E 2 St & Avenue B" "1

Ave & E 68 St" ...$ start station latitude : chr "40.74144387" "40.71605866" "40.72217444" "40.76500525" ...$ start station longitude: chr "-73.97536082" "-73.99190759" "-73.98368779" "-73.95818491" ...$ end station id : chr "497" "340" "311" "237" ...$ end station name : chr "E 17 St & Broadway" "Madison St & Clinton St" "Norfolk St &

Broome St" "E 11 St & 2 Ave" ...$ end station latitude : chr "40.73704984" "40.71269042" "40.7172274" "40.73047309" ...$ end station longitude : chr "-73.99009296" "-73.98776323" "-73.98802084" "-73.98672378" ...$ bikeid : chr "23097" "23631" "23049" "19019" ...$ usertype : chr "Subscriber" "Subscriber" "Subscriber" "Customer" ...$ birth year : chr "1986" "1977" "1980" "" ...$ gender : chr "2" "1" "1" "0" ...- attr(*, ".internal.selfref")=<externalptr>

37 / 50

citi <- read_csv(citibike)

Classes 'tbl_df', 'tbl' and 'data.frame': 1212280 obs. of 15 variables:$ tripduration : int 538 224 328 1196 753 511 362 1399 515 1477 ...$ starttime : chr "5/1/2016 00:00:03" "5/1/2016 00:00:04" "5/1/2016 00:00:14"

"5/1/2016 00:00:20" ...$ stoptime : chr "5/1/2016 00:09:02" "5/1/2016 00:03:49" "5/1/2016 00:05:43"

"5/1/2016 00:20:17" ...$ start station id : int 536 361 301 3141 492 445 151 161 368 459 ...$ start station name : chr "1 Ave & E 30 St" "Allen St & Hester St" "E 2 St & Avenue B" "1

Ave & E 68 St" ...$ start station latitude : num 40.7 40.7 40.7 40.8 40.8 ...$ start station longitude: num -74 -74 -74 -74 -74 ...$ end station id : int 497 340 311 237 228 537 229 2022 334 445 ...$ end station name : chr "E 17 St & Broadway" "Madison St & Clinton St" "Norfolk St &

Broome St" "E 11 St & 2 Ave" ...$ end station latitude : num 40.7 40.7 40.7 40.7 40.8 ...$ end station longitude : num -74 -74 -74 -74 -74 ...$ bikeid : int 23097 23631 23049 19019 16437 20592 15681 16003 20515 20884 ...$ usertype : chr "Subscriber" "Subscriber" "Subscriber" "Customer" ...$ birth year : int 1986 1977 1980 NA 1981 1991 1986 1989 1998 1995 ...$ gender : int 2 1 1 0 1 1 1 1 1 1 ...- attr(*, "spec")=List of 2..- attr(*, "class")= chr "col_spec"

38 / 50

I read_csv seems to have guessed better at the variable types relative tofread that classified all variables as chr

I The starttime and stoptime are dates and times and we will discusslater the specifics of this particular type of variables

I The usertype and gender are good candidates to be defined as factors:

table(citi$usertype)

Customer Subscriber160014 1052266

table(citi$gender)

0 1 2178710 783723 249847

39 / 50

Other objects used in R

I We have discussed data frames which are tables formed by variablesthat can be of different types (numeric, integer, logical, factors, strings)

I Other objects used in R are:I Vector: a sequence of values of a certain data typeI Matrix: table with values all of the same data type (difference with

data frames is that they can be different)I List: consists of a list of objects (data frames, matrices, vectors)

40 / 50

Vector

I A vector is a sequence of values (either chr, num, int, factor)I Some R commands related to vectors:

myvector <- vector(length=5) # define an empty vectormyvector[1] <- "Baruch" # set the vector's 1st elementmyvector <- c("Baruch", "College","is",

"number", "one") # define a vector[1] "Baruch" "College" "is" "number" "one"length(myvector) # number of elements in the vector[1] 5class(myvector) # find the class of the vector[1] "character"myvector <- c("Baruch", "College","is",

"number", 1) # define a vector with num and chr[1] "Baruch" "College" "is" "number" "1"class(myvector)[1] "character"

41 / 50

Matrix

I A matrix is a table like a data frame, but it forces all variables to be ofthe same data type

I A matrix is useful to organize numerical data and use the tools ofmatrix algebra

I Some R commands related to matrices:

mymatrix <- matrix(0, nrow=3, ncol=2) # define an empty matrix# with three rows and 2 columns

mymatrix[1,1] <- 1 # assign valuesmymatrix[3,2] <- 6mymatrix

[,1] [,2][1,] 1 0[2,] 0 0[3,] 0 6dim(mymatrix) # dimensions of the matrix[1] 3 2c(nrow(mymatrix), ncol(mymatrix))[1] 3 2class(mymatrix)[1] "matrix"

42 / 50

I Let’s define the data.mat matrix as a matrix composed of twoquantatitive variables of data.sel

I If you read the file using fread the object is a data frame of classdata.table that has different ways of subsetting

I You can avoid this by adding data.table=FALSE in the fread functionor . . .

I . . . setting the object to be a data frame with the as.data.frame()command

data.mat <- as.matrix(data.sel[c("ADM_RATE_ALL", "SAT_AVG_ALL")])class(data.mat)[1] "matrix"class(data.mat[,1])[1] "numeric"class(data.mat[,2])[1] "numeric"

I Notice that now both columns are of type num whilst beforeSAT_AVG_ALL was of type int

43 / 50

apply

I apply is a command that applies a function to the rows or columns of amatrix

I Example: we want to calculate the average of the columns using themean() function

I The function requires whether we want to apply the mean function tothe rows (1) or columns (2)

apply(data.mat, 2, "mean")ADM_RATE_ALL SAT_AVG_ALL

NA NA

I The result is NA because when there some NA values the mean functionreturns a missing value

I Adding the option na.rm=TRUE excludes the missing values from theaverage:

apply(data.mat, 2, "mean", na.rm=TRUE)ADM_RATE_ALL SAT_AVG_ALL

0.6775375 1060.1038961

44 / 50

Lists

I A list is an object that stores objects (of the same or different type)I To subset a list we need to put [[ ]] instead of [ ] that we use for

vector, matrix, and data frame objects

mylist <- vector("list", 3) # create a list with 3 elementsmylist[[1]] <- c("Baruch", "is","great") # set the elements of the listmylist[[2]] <- data.frame(height=c(10,20), weight=c(3,4))mylist[[3]] <- factor(c("ciao","hello","ciao","hello"))[[1]][1] "Baruch" "is" "great"

[[2]]height weight

1 10 32 20 4

[[3]][1] ciao hello ciao helloLevels: ciao hellomylist[[1]][3] <- "fantastic"mylist[[1]][1] "Baruch" "is" "fantastic"

45 / 50

I Example: we might want to use a list to store the Citibike data indifferent months or the Scorecard data in different years

I We can then run the statistical analysis on each element of the list thatrepresents a month/year of the same dataset

citibike.list <- vector('list',3)citibike.list[[1]] <- read_csv("201604-citibike-tripdata.csv")citibike.list[[2]] <- read_csv("201605-citibike-tripdata.csv")citibike.list[[3]] <- read_csv("201606-citibike-tripdata.csv")

I We can assign names to each element of the list:

names(citibike.list) <- c("april","may","june")

I A list can also be created as follows:

citibike.list = list(april = read_csv("201604-citibike-tripdata.csv"),may = read_csv("201605-citibike-tripdata.csv"),june = read_csv("201606-citibike-tripdata.csv"))

46 / 50

lapply

I lapply works on lists the same way that the apply command works formatrices

I It applies a specified function to the element of a listI Example: I want to find the number of rows of each data frame in the

citibike.list

citibike.ntrips <- lapply(citibike.list, nrow)$april[1] 1013149

$may[1] 1212280

$june[1] 1460318unlist(citibike.ntrips)

april may june1013149 1212280 1460318

47 / 50

I If we put data frames in a list, we can then access their variables byreferencing to the item in the list that contains the variable

I For example, I want to run a table of the usertype for the month ofMay:

table(citibike.list[["may"]]$usertype)


48 / 50

I We can also define our own functions to pass to lapply; the structureis: function(x) ..., where x is the list item and ... represents theoperations that we want to do on each list item (in the example below:table(x$usertype))

citibike.user <- lapply(citibike.list, function(x) table(x$usertype))

$april


$may


$june


data.frame(matrix(unlist(citibike.user), nrow=3, ncol=2,byrow=TRUE, dimnames=list(c("April", "May","June"),

c("Customer", "Subscriber"))))

Customer SubscriberApril 130465 882684May 160014 1052266June 156832 1303486

49 / 50

Summary of functions and packages

Functions Packages

apply() nlevels() data.tableas.matrix() nrow() downloader

c() paste() readrclass() proc.time() readxl

data.frame() read_csv() xlsxdim() read_excel()

download() read.csv()factor() read.table()fread() read.xlsx()head() round()lapply() str()length() substr()levels() table()list() tail()

matrix() unlist()nchar() unzip()ncol() vector()

50 / 50

Documents

Analyzing Economic Data using R - Reading data in Rfaculty.baruch.cuny.edu/smanzan/files/aedr/BUS4093unit1.pdf · 2: University of Alabama at Birmingham Birmingham 0.7223 1107 3: