45
EMPIRICAL PROJECT 9 CREDIT-EXCLUDED HOUSEHOLDS IN A DEVELOPING COUNTRY LEARNING OBJECTIVES In this project you will: identify credit constrained and credit-excluded households using survey information (Part 9.1) create dummy (indicator) variables (Part 9.1) compare characteristics of successful borrowers, discouraged borrowers, credit constrained households, and credit-excluded households (Part 9.1) explain why selection bias is an important issue (Part 9.1) analyse the characteristics of loans obtained by successful borrowers (Part 9.2). Key concepts Concepts needed for this project: mean, standard deviation, range, percentile, correlation/correlation coefficient, confidence interval for the difference in means, dummy variable. Concepts introduced in this project: selection bias. 237

EMPIRICAL PROJECT 9 CREDIT-EXCLUDED HOUSEHOLDS IN A

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: EMPIRICAL PROJECT 9 CREDIT-EXCLUDED HOUSEHOLDS IN A

EMPIRICAL PROJECT 9

CREDIT-EXCLUDEDHOUSEHOLDS IN A

DEVELOPING COUNTRY

LEARNING OBJECTIVESIn this project you will:

• identify credit constrained and credit-excluded households using surveyinformation (Part 9.1)

• create dummy (indicator) variables (Part 9.1)• compare characteristics of successful borrowers, discouraged borrowers,

credit constrained households, and credit-excluded households (Part 9.1)• explain why selection bias is an important issue (Part 9.1)• analyse the characteristics of loans obtained by successful borrowers

(Part 9.2).

Key concepts

• Concepts needed for this project: mean, standard deviation, range,percentile, correlation/correlation coefficient, confidence intervalfor the difference in means, dummy variable.

• Concepts introduced in this project: selection bias.

237

Page 2: EMPIRICAL PROJECT 9 CREDIT-EXCLUDED HOUSEHOLDS IN A

CORE PROJECTSThis empirical project is related tomaterial in:• Unit 9 (https://tinyco.re/

9110648) of Economy, Society,and Public Policy

• Unit 10 (https://tinyco.re/6231423) of The Economy.

principal–agent relationship This isan asymmetrical relationship inwhich one party (the principal)benefits from some action orattribute of the other party (theagent) about which the principal’sinformation is not sufficient toenforce in a complete contract. Seealso: incomplete contract. Alsoknown as: principal–agent problem.credit constrained A description ofindividuals who are able to borrowonly on unfavourable terms. Seealso: credit excluded.credit excluded A description ofindividuals who are unable toborrow on any terms. See also:credit constrained.

INTRODUCTIONBorrowing, either through formal or informal institutions, can help smoothconsumption and also enable investment. However, the lenders cannotobserve how hard the borrower is working to repay the loan, and cannotmake the loan contract conditional on such effort. The relationshipbetween the lender and the borrower, like that between the employer andthe employer studied in Project 8, is called a principal–agent problem.Section 9.12 (https://tinyco.re/2272226) of Economy, Society, and PublicPolicy compares the labour market and the credit market.

Lenders can partially mitigate or eliminate the risk of default byrequiring collateral, which can be repossessed and sold to repay the loan ifthe borrower defaults. People who are unable to provide this collateraloften have to borrow under more unfavourable conditions (higher interestrates) or may be refused a loan entirely. We call the former group creditconstrained, and the latter group credit excluded.

Since the ability to borrow depends on a person’s wealth, such creditconstraints and exclusion contribute to inequality, and some opportunitiesfor economic growth are not realized. For example, a hard-working personwithout any assets may be refused a loan to start a business, which couldcontribute both to raising their income and to economic activity.

To design policies that help credit markets function better, we first needto look at how widely borrowing conditions and available sources offinance differ according to household characteristics. Sometimes borrowerswho are excluded from formal credit markets can still obtain loans throughother lenders such as relatives or friends, so it is important to considerthese sources when classifying people as credit constrained or excluded.

In countries where formal credit markets are still developing, informalarrangements are important ways for communities to share resources andpool risks. For example, in Ethiopia, households may be part of a socialsupport network called an ‘iddir’, a group of people who regularly pay cashinto a common pool that is shared among group members who need it.

We will be looking at data from an Ethiopian household survey (theEthiopian Socioeconomic Survey) to investigate the borrowing conditionsthat different types of households face. Aside from credit constrained andcredit-excluded households, we will also look at a third group of house-holds that are ‘discouraged borrowers’, meaning that they did not apply fora loan because they thought they would be refused.

EMPIRICAL PROJECT 9 CREDIT-EXCLUDED HOUSEHOLDS IN A DEVELOPING COUNTRY

238

Page 3: EMPIRICAL PROJECT 9 CREDIT-EXCLUDED HOUSEHOLDS IN A

DOWNLOAD THE CODEYou can download the code chunksused in this project from the onlineversion of this project(https://tinyco.re/7090800).

Don’t forget to also download thedata into your working directory byfollowing the steps in this project.

EMPIRICAL PROJECT 9

WORKING IN R

GETTING STARTED IN RFor this project you will need the following packages:

• tidyverse , to help with data manipulation• readxl , to import an Excel spreadsheet• knitr , to format tables• mosaic , to help create frequency tables.

If you need to install these packages, run the following code:

install.packages(c("readxl", "tidyverse", "knitr", "mosaic"))

You can import the libraries now, or when they are used in the R walk-throughs below.

library(readxl)library(tidyverse)library(knitr)library(mosaic)

PART 9.1 HOUSEHOLDS THAT DID NOT GET A LOAN

Learning objectives for this part

• identify credit-constrained and credit-excluded households usingsurvey information

• create dummy (indicator) variables• compare characteristics of successful borrowers, discouraged

borrowers, credit-constrained households, and credit-excludedhouseholds

• explain why selection bias is an important issue.

The Ethiopian Socioeconomic Survey (ESS) data was collected in 2013–14from a nationally representative sample of households. Households were askedabout topics such as their housing conditions, assets, and access to credit.

239

Page 4: EMPIRICAL PROJECT 9 CREDIT-EXCLUDED HOUSEHOLDS IN A

Download the ESS data and survey questionnaire:

• Download the ESS data (https://tinyco.re/6067633). The Excel filecontains three tabs (‘Data dictionary’, ‘All households’, and ‘Got loan’).Read the ‘Data dictionary’ tab and make sure you know what eachvariable represents. For Part 9.1 we will use the data from the ‘All house-holds’ tab. The ‘Got loan’ data will be used in Part 9.2.

• For the documentation, go to the data download site (https://tinyco.re/9757350). Click on the ‘Documentation’ tab in the middle of the page.

• Under the heading ‘Questionnaires’, download the PDF file called‘2013–2014 Ethiopian Socioeconomic Survey, Household Questionnaire’by clicking the ‘Download’ button on the right-hand side of the page. Youmay find it helpful to refer to Section 14 of the questionnaire for the exactquestions asked about credit and saving.

R WALK-THROUGH 9.1

Importing data into RBefore importing the data, open it in Excel to look at its structure. Youcan see there are three tabs: ‘Data dictionary’, ‘All households’, and ‘Gotloan’. We will import them into separate dataframes ( DataDict , allHH ,and gotL respectively). We import the ‘Data dictionary’ so that we donot have to return to the Excel spreadsheet.

Also note that there are a lot of empty cells, which is how missing data iscoded in Excel (but not in R). In the read_excel function we therefore usethe na = "" option so that R recognizes empty cells as missing data.

library(tidyverse)library(readxl)

# Set your working directory to the correct folder.# Insert your file path for 'YOURFILEPATH'.setwd("YOURFILEPATH")

allHH <- read_excel("Project 9 datafile.xlsx",sheet = "All households", na = "NA")

gotL <- read_excel("Project 9 datafile.xlsx",sheet = "Got loan", na = "NA")

DataDict <- read_excel("Project 9 datafile.xlsx",sheet = "Data dictionary", na = "NA")

Now let’s look at the variable types for allHH and gotL . To see thevariable definitions, use the command view(DataDict) .

str(allHH)

EMPIRICAL PROJECT 9 WORKING IN R

240

Page 5: EMPIRICAL PROJECT 9 CREDIT-EXCLUDED HOUSEHOLDS IN A

## Classes 'tbl_df', 'tbl' and 'data.frame': 5262 obs.of 19 variables:## $ household_id2 : num 1.01e+16 1.01e+16 1.01e+161.01e+16 1.01e+16 ...## $ got_loan : chr "No" "No" "No" "Yes" ...## $ rural : chr "Rural" "Rural" "Rural""Rural" ...## $ hhsize : num 8 8 1 3 4 4 5 5 6 7 ...## $ region : chr "Tigray" "Tigray" "Tigray""Tigray" ...## $ gender : chr "Female" "Male" "Female""Female" ...## $ age : num 78 37 78 30 71 28 37 32 5143 ...## $ young_children : num 4 5 0 1 1 3 2 3 3 2 ...## $ working_age_adults: num 3 3 0 2 1 1 3 2 3 6 ...## $ max_education : num 3 0 0 4 8 4 9 6 4 5 ...## $ number_assets : num 32 19 3 6 34 5 45 35 20 64...## $ loan_rejected : chr "No" "No" "No" "No" ...## $ rejection_source1 : chr NA NA NA NA ...## $ rejection_source2 : chr NA NA NA NA ...## $ loan_purpose : chr NA NA NA NA ...## $ loan_purpose_other: chr NA NA NA NA ...## $ did_not_apply : chr "Did not apply" "Did notapply" "Did not apply" "Applied" ...## $ reason_not_apply1 : chr "Too Expensive" "BelieveWould Be Refused" "Too Expensive" NA ...## $ reason_not_apply2 : chr "Fear Not Be Able To Pay"NA "Fear Not Be Able To Pay" NA ...

str(gotL)

## Classes 'tbl_df', 'tbl' and 'data.frame': 1480 obs.of 21 variables:## $ household_id2 : num 1.01e+16 1.01e+16 1.01e+161.01e+16 1.01e+16 ...## $ got_loan : chr "Yes" "Yes" "Yes" "Yes" ...## $ rural : chr "Rural" "Rural" "Rural""Rural" ...## $ hhsize : num 3 4 5 9 7 7 8 7 10 8 ...## $ region : chr "Tigray" "Tigray" "Tigray""Tigray" ...## $ gender : chr "Female" "Male" "Male""Male" ...## $ age : num 30 71 53 52 56 38 44 47 52

PART 9.1 HOUSEHOLDS THAT DID NOT GET A LOAN

241

Page 6: EMPIRICAL PROJECT 9 CREDIT-EXCLUDED HOUSEHOLDS IN A

37 ...## $ young_children : num 1 1 3 4 3 4 4 3 2 6 ...## $ working_age_adults : num 2 1 2 6 3 3 4 4 8 2 ...## $ max_education : num 4 8 5 8 7 7 8 6 10 1 ...## $ number_assets : num 6 34 14 16 9 14 24 15 2210 ...## $ borrowed_from : chr "Relative" "Relative""Relative" "Other (specify)" ...## $ borrowed_from_other: chr NA NA NA "Cooperatives" ...## $ loan_purpose : chr "Purchase AgriculturalInputs for Food Crop" "Purchase Agricultural Inputs forFood Crop" "Purchase Agricultural Inputs for Food Crop""Purchase Agricultural Inputs for Food Crop" ...## $ loan_startmonth : chr "March" "November""November" "June" ...## $ loan_startyear : num 2004 2006 2006 2005 2005...## $ loan_repaid : chr "Yes" "Yes" "No" "No" ...## $ loan_endmonth : chr NA NA "June" "March" ...## $ loan_endyear : num NA NA 2006 2006 2006 ...## $ loan_amount : num 1000 800 2500 2653 1600 ...## $ loan_interest : num 300 0 1750 325 300 1750290 300 300 268 ...

It is important to ensure that all variables we expect to be numerical(numbers) are coded as num , and in this case, they are. You can see thatthere are many variables that are coded as character ( chr ) variablesbecause they are text (for example gender or region ), but since we canuse these variables to group data by category, we will use as.factor tochange them into categorical (factor) variables for later use.

Instead of converting each character variable to a factor variable indi-vidually, say allHH$gender <- as.factor(allHH$gender) , we use thepiping operator ( %>% ) from the tidyverse package to do this step inone go. (For a more detailed introduction to piping, see the University ofManchester’s Econometric Computing Learning Resource(https://tinyco.re/5531433)).

We take allHH and use the mutate_if function, which applies theas.factor function to character variables only. Then we do the same

for gotL .

allHH <- allHH %>% mutate_if(is.character, as.factor)

gotL <- gotL %>% mutate_if(is.character, as.factor)

Typing str(allHH) and str(gotL) will confirm that all charactervariables are now categorical variables.

EMPIRICAL PROJECT 9 WORKING IN R

242

Page 7: EMPIRICAL PROJECT 9 CREDIT-EXCLUDED HOUSEHOLDS IN A

1 The data is already in a format clean enough to use, so we will begin bysummarizing the information in the ‘All households’ tab, starting withregion and household characteristics.

(a) Create a table showing the proportion of households that lived ineach region and area type, with region as the row variable andrural as the column variable. (For help on creating tables, see R

walk-through 3.3 (page 59).)

(b) Use the Gender variable to find what percentage of household headswere female.

(c) Create an appropriate summary table for the variables hhsize ,gender , age , young_children , working_age_adults ,max_education , and number_assets . (You may find it helpful to

refer to R walk-through 2.7 (page 45) in Empirical Project 2 for onepossible format to use.)

(d) Write a short paragraph describing the information in your tables for1(c).

R WALK-THROUGH 9.2

Creating summary tablesIn order to get the proportions of households living in large towns, smalltowns, or rural areas (encoded in the variable rural ), we use the tablefunction. The counts (number) of households in the respective regionsand area types are contained in stab1. Running this table through theprop.table() function changes the values from counts to proportions.

The optional input (, 1) makes the proportions for each row add toone. (You could see what happens when you leave this option out, or ifyou choose the value 2 .) To obtain detailed information, use thecommand ?prop.table .

# Control how many digits are printedoptions(digits = 3)

stab1 <- table(allHH$region, allHH$rural)

prop.table(stab1, 1)

#### Large town (urban) Rural Small town(urban)## Addis Ababa 1.0000 0.00000.0000## Afar 0.0956 0.75740.1471

PART 9.1 HOUSEHOLDS THAT DID NOT GET A LOAN

243

Page 8: EMPIRICAL PROJECT 9 CREDIT-EXCLUDED HOUSEHOLDS IN A

## Amhara 0.2176 0.66920.1132## Benshagul Gumuz 0.0000 0.90400.0960## Diredwa 0.4685 0.53150.0000## Gambelia 0.1154 0.80000.0846## Harari 0.2727 0.72730.0000## Oromia 0.2831 0.60700.1098## SNNP 0.1826 0.72700.0905## Somalie 0.1552 0.75520.0897## Tigray 0.3670 0.56280.0701

Let’s use a similar approach to calculate the percentage of householdswith female heads (encoded in the variable gender ).

stab2 <- table(allHH$gender)

prop.table(stab2)

#### Female Male## 0.304 0.696

As shown, 30.4% of households have a female head.We need to provide summary statistics for a range of variables. Most

of these variables are numeric variables, but one, gender , is a factorvariable. For the latter, we use the summary function. For the numericvariables we use the favstats function, which is part of the mosaicpackage. We could also use the summary function, but it does notprovide the standard deviations that we need.

The summary statistics for the numeric variables are generated usingthe lapply (list apply) function. Inside the function we specify thesubset of variables we are interested in, subset(allHH, select =sel_q) , and specify the function we want to apply to these variables,namely the favstats function.

EMPIRICAL PROJECT 9 WORKING IN R

244

Page 9: EMPIRICAL PROJECT 9 CREDIT-EXCLUDED HOUSEHOLDS IN A

# Load the mosaic packagelibrary(mosaic)

summary(allHH$gender)

## Female Male NA's## 1599 3662 1

# Create a list of the numeric variable namessel_q <- c('hhsize', 'age', 'young_children',

'working_age_adults', 'max_education', 'number_assets')

lapply(subset(allHH, select = sel_q), favstats)

## $hhsize## min Q1 median Q3 max mean sd n missing## 1 3 4 6 16 4.58 2.4 5260 2#### $age## min Q1 median Q3 max mean sd n missing## 3 32 42 55 99 44.2 15.6 5253 9#### $young_children## min Q1 median Q3 max mean sd n missing## 0 0 2 3 10 1.89 1.71 5262 0#### $working_age_adults## min Q1 median Q3 max mean sd n missing## 0 2 2 3 10 2.58 1.52 5262 0#### $max_education## min Q1 median Q3 max mean sd n missing## 0 2 6 10 30 7.53 7.28 5262 0#### $number_assets## min Q1 median Q3 max mean sd n missing## 0 5 9 18 203 14.9 17.2 5262 0

Now that we have an idea of what our data looks like, we will move on toidentifying households that are potentially excluded from the credit marketor are credit constrained. The former are households that find it impossibleto borrow, and the latter are households that can only borrow onunfavourable terms (see Section 9.10 (https://tinyco.re/5630844) of Eco-nomy, Society, and Public Policy).

PART 9.1 HOUSEHOLDS THAT DID NOT GET A LOAN

245

Page 10: EMPIRICAL PROJECT 9 CREDIT-EXCLUDED HOUSEHOLDS IN A

The variables in our dataset that are related to this issue aredid_not_apply and loan_rejected . Later we will also look at the

responses given in the variables ‘reason_not_apply1’ and‘reason_not_apply2’.

2 Using the ‘All households’ dataset:

(a) Create a frequency table with did_not_apply as the row variableand loan_rejected as the column variable. Include all ‘NA’ as aseparate row.

(b) Looking at these two variables, explain why some observationsshould be excluded and remove them from the dataset. Also removeall households with missing information for one or more of thesevariables. Of the non-excluded observations, what percentage ofhouseholds applied for a loan over the past 12 months? Of thosehouseholds, what percentage were successful?

(c) For the resulting categories in the frequency table, explain whetherthe households in that category can be described as creditconstrained, credit excluded, or both.

R WALK-THROUGH 9.3

Making frequency tables for loan applications and outcomesThe easiest way to make a frequency table is to use the table function.Note that we nested the table () function in the addmargins functionto obtain row and column totals.

By default, the table function excludes missing ( NA ) values.Consulting the help function (type ?table in the command window)will show that the option useNA = "always" includes these values.

stab3 <- addmargins(table(allHH$did_not_apply,allHH$loan_rejected,useNA = "always", dnn = c("Applied?", "Rejected?")))

stab3

## Rejected?## Applied? No Yes <NA> Sum## Applied 1363 201 1 1565## Did not apply 3632 24 2 3658## <NA> 37 2 0 39## Sum 5032 227 3 5262

EMPIRICAL PROJECT 9 WORKING IN R

246

Page 11: EMPIRICAL PROJECT 9 CREDIT-EXCLUDED HOUSEHOLDS IN A

Here, we decide to exclude the 24 households that indicated that theydid not apply for a loan, but also indicated that they were refused a loan.This is a contestable decision, as it results in excluding more than 10% ofhouseholds that indicated that they were refused a loan. We shall alsoremove all observations that have missing data for any of these twoquestions.

Now we create the dataset with the non-missing data only ( allHHc ).

# Remove NAs in did_not_applyallHHc <- subset(allHH, !is.na(allHH$did_not_apply))allHHc <- subset(allHHc, !is.na(allHHc$loan_rejected))

# Show the number of observationsnrow(allHHc)

## [1] 5220

At this stage we have dropped 42 observations. Now we delete the 24observations that gave a nonsensical answer.

allHHc <- subset(allHHc,!(allHHc$loan_rejected == "Yes" &allHHc$did_not_apply == "Did not apply"))

# Show the number of observationsnrow(allHHc)

## [1] 5196

We are left with 5,196 observations. Let’s recreate the frequency table,but this time adding the prop.table function to calculate proportions.(To obtain percentages, multiply the proportions by 100.)

stab4 <- addmargins(prop.table(table(allHHc$did_not_apply,allHHc$loan_rejected,dnn = c("Applied?", "Rejected?"))))

stab4

PART 9.1 HOUSEHOLDS THAT DID NOT GET A LOAN

247

Page 12: EMPIRICAL PROJECT 9 CREDIT-EXCLUDED HOUSEHOLDS IN A

## Rejected?## Applied? No Yes Sum## Applied 0.2623 0.0387 0.3010## Did not apply 0.6990 0.0000 0.6990## Sum 0.9613 0.0387 1.0000

To create operational categories to use throughout this project, we will labelhouseholds as either:

• ‘successful’: households that applied for a loan and were given the loan• ‘denied’: households that applied but were not given the loan• ‘did not apply’: households that did not apply for a loan.

You should note that the ‘denied’ households are only a subset of the credit-excluded households, as there will be households that are credit excludedand do not even apply for a loan. One could, for instance, reason thathouseholds who answered ‘Inadequate Collateral’ or ‘Do Not Know AnyLender’ are also likely to be credit excluded.

3 Using the subset of data from Question 2(b):

(a) Create a new variable called HH_status with the above categories.

(b) Create a new variable discouraged_borrower that takes the value 1if the household did not apply for a loan because it believed that itwould not receive a loan (answered ‘Believe Would Be Refused’ inreason_not_apply1 or reason_not_apply2 ). How many house-

holds (and what percentage) are discouraged borrowers? (Note: Thisis a fairly narrow definition of ‘discouraged’ and one could easilyargue that other criteria should also be considered under this label.)

(c) Create a new variable credit_constrained that takes the value 1 (oryes) for households that gave a reason for not applying other than‘NA’, ‘Other’, or ‘Have Adequate Farm’ in either of the two questionsreason_not_apply1 or reason_not_apply2 , and 0 otherwise. For

example, a household that answers ‘Have Adequate Farm’ inreason_not_apply1 and ‘Do Not Know Any Lender’ would not be

classified as credit constrained. How many households (and whatpercentage) are credit constrained?

(d) Create a frequency table showing the most important reason for notapplying for a loan, and another showing the second most importantreason for not applying. What were the most common reasons for notapplying?

Note that arguably other answersare also indicative of being creditconstrained, so the criteria we useis definitely only a subset of allhouseholds that are creditconstrained. For example, onecould include households thathave been denied a loan, and it isalso likely that some householdsthat have been granted a loan arein fact credit constrained.

EMPIRICAL PROJECT 9 WORKING IN R

248

Page 13: EMPIRICAL PROJECT 9 CREDIT-EXCLUDED HOUSEHOLDS IN A

R WALK-THROUGH 9.4

Creating variables to classify householdsLet’s first create the HH_status variable. We set the values ofHH_status to ”not applied” , then use logical indexing to change all

entries where households applied for a loan ( allHHc$did_not_apply== “Applied” ) and were accepted ( allHHC$loan_rejected == “No” ) to“successful” , and households who were denied

(( allHHC$loan_rejected == “Yes” ) to ”denied” .

# This is the default category.allHHc$HH_status <- "not applied"

allHHc$HH_status[allHHc$did_not_apply == "Applied" &allHHc$loan_rejected == "No"] <- "successful"

allHHc$HH_status[allHHc$did_not_apply == "Applied" &allHHc$loan_rejected == "Yes"] <- "denied"

# Change from character to factor variableallHHc$HH_status <- factor(allHHc$HH_status)

Typing summary(allHHc$HH_status) should give you numbers thatcorrespond to the frequency table from Question 2.

Let’s continue by using the same steps to make thediscouraged_borrower variable.

# This is the default category.allHHc$discouraged_borrower <- "No"

allHHc$discouraged_borrower[allHHc$reason_not_apply1 =="Believe Would Be Refused"] <- "Yes"

allHHc$discouraged_borrower[allHHc$reason_not_apply2 =="Believe Would Be Refused"] <- "Yes"

# Change from character to factor variableallHHc$discouraged_borrower <-

factor(allHHc$discouraged_borrower)summary(allHHc$discouraged_borrower)

## No Yes## 4608 588

To make the credit_constrained variable, we use the levels functionto check all the possible answers to the reason_not_apply1 variable.We store these answers in the object sel_ans .

PART 9.1 HOUSEHOLDS THAT DID NOT GET A LOAN

249

Page 14: EMPIRICAL PROJECT 9 CREDIT-EXCLUDED HOUSEHOLDS IN A

sel_ans <- levels(allHHc$reason_not_apply1)sel_ans

## [1] "Believe Would Be Refused" "Do Not Know AnyLender"## [3] "Do Not Like To Be In Debt" "Fear Not Be Able ToPay"## [5] "Have Adequate Farm" "InadequateCollateral"## [7] "No Farm or Business" "Other(Specify)"## [9] "Too Expensive" "Too Much Trouble"

Of these reasons, only reasons [5] and [8] do not lead to a conclusion thata household is credit constrained, so we remove them from sel_ans .

# Remove reasons 5 and 8sel_ans <- sel_ans[-c(5, 8)]

# This is the default category, as households that did not# provide any reasons are classified as not credit# constrained.allHHc$credit_constrained <- "No"

allHHc$credit_constrained[allHHc$reason_not_apply1%in% sel_ans] <- "Yes"

allHHc$credit_constrained[allHHc$reason_not_apply2%in% sel_ans] <- "Yes"

# Change from character to factor variableallHHc$credit_constrained <-

factor(allHHc$credit_constrained)summary(allHHc$credit_constrained)

## No Yes## 2184 3012

The use of %in% in the selection criterion allHHc$reason_not_apply1%in% sel_ans is a very useful programming technique that you can useto select data according to a list of values/variables. In this case,sel_ans contains all the answers that we associate with a credit-

constrained household.allHHc$reason_not_apply1 %in% sel_ans gives an outcome of

TRUE if the answer to reason_not_apply1 is one of the answers insel_ans , then sets the value of credit_constrained to ‘Yes’ for those

observations.

EMPIRICAL PROJECT 9 WORKING IN R

250

Page 15: EMPIRICAL PROJECT 9 CREDIT-EXCLUDED HOUSEHOLDS IN A

stab4 <- addmargins(prop.table(table(allHHc$credit_constrained, allHHc$discouraged_borrower,useNA = "ifany", dnn = c(

"Constrained?", "Discouraged?"))))

stab4

## Discouraged?## Constrained? No Yes Sum## No 0.420 0.000 0.420## Yes 0.467 0.113 0.580## Sum 0.887 0.113 1.000

# Required for the use of the kable functionlibrary(knitr)

print("Reasons not to apply 1")

## [1] "Reasons not to apply 1"

# 'kable' is optional but formats tables more neatly.table5 <- prop.table(table(allHHc$reason_not_apply1))

kable(table5[rev(order(table5))])

Var1 Freq-------------------------- ------Do Not Like To Be In Debt 0.191Have Adequate Farm 0.185Fear Not Be Able To Pay 0.170Believe Would Be Refused 0.116No Farm or Business 0.102Do Not Know Any Lender 0.068Too Expensive 0.049Inadequate Collateral 0.045Other (Specify) 0.041Too Much Trouble 0.033

PART 9.1 HOUSEHOLDS THAT DID NOT GET A LOAN

251

Page 16: EMPIRICAL PROJECT 9 CREDIT-EXCLUDED HOUSEHOLDS IN A

print("Reasons not to apply 2")

## [1] "Reasons not to apply 2"

table6 <- prop.table(table(allHHc$reason_not_apply2))

kable(table6[rev(order(table6))])

Var1 Freq-------------------------- ------Fear Not Be Able To Pay 0.277Do Not Like To Be In Debt 0.243Inadequate Collateral 0.087Believe Would Be Refused 0.087Too Expensive 0.068Do Not Know Any Lender 0.064Too Much Trouble 0.062Have Adequate Farm 0.050No Farm or Business 0.040Other (Specify) 0.024

We will now analyse the stated reasons for wanting a loan, comparing thosehouseholds that were successful ( HH_status equal to ‘successful’) withthose that were not successful ( HH_status equal to ‘denied’).

4 For both groups, create one table showing the proportion of householdsfor each loan purpose. You will realize that in the ‘All households’dataset, the reason for all ‘successful’ loans is ‘Other’. For that reason,you should use the ‘Got loan’ dataset to retrieve the reasons for loaninformation for successful loans. Was the purpose of loans for deniedand successful borrowers similar? (Hint: It may help to think about thebroad categories of spending on consumption and investment.)

R WALK-THROUGH 9.5

Making frequency tables to compare proportionsSome of the data is in the allHH dataset, while the rest is in the gotLdataset, both of which we imported in R walk-through 9.1 (page 240).We will combine that information into one new dataset calledloan_data , which we then use to produce the table.

EMPIRICAL PROJECT 9 WORKING IN R

252

Page 17: EMPIRICAL PROJECT 9 CREDIT-EXCLUDED HOUSEHOLDS IN A

sel_allHHc <- subset(allHHc, subset = (allHHc$HH_status %in% c("successful", "denied")))

# Removes the unused 'did not apply' levelsel_allHHc <- droplevels(sel_allHHc)

prop.table(table(sel_allHHc$loan_purpose, sel_allHHc$HH_status,dnn = c("Loan Purpose", "Loan")), 2)

## Loan## Loan Purpose deniedsuccessful## Business Start-up Capital 0.26150.0000## Expanding Business 0.13851.0000## Other (Specify) 0.27180.0000## Purchase Agricultural Inputs for Food Crop 0.21030.0000## Purchase House/Lease Land 0.03080.0000## Purchase Inputs for other Crops 0.06150.0000## Purchase Non-farm Inputs 0.02560.0000

This reveals a particular feature of the data, namely that for successfulborrowers, the allHHc dataset does not contain all the useful informa-tion, as every successful household has ‘Other (Specify)’ in theloan_purpose variable. There is more useful information on loan

purpose in the gotL data, so we will extract the loan_purpose variablefor unsuccessful households from the allHHc dataset, and theequivalent information for successful loaners from the gotL dataset.

# Select unsuccessful households from allHHcloan_no <- subset(allHHc, allHHc$HH_status == "denied",

select = c("loan_purpose", "HH_status"))

# Select loan purpose for successful households from gotLloan_yes <- subset(gotL, gotL$got_loan == "Yes",

select = "loan_purpose")

loan_yes$HH_status <- "successful"

PART 9.1 HOUSEHOLDS THAT DID NOT GET A LOAN

253

Page 18: EMPIRICAL PROJECT 9 CREDIT-EXCLUDED HOUSEHOLDS IN A

conditional mean An average of avariable, taken over a subgroup ofobservations that satisfy certainconditions, rather than allobservations.

# Combine into one datasetloan_data <- rbind(loan_no, loan_yes)

# Remove the unused 'did not apply' levelloan_data <- droplevels(loan_data)

kable(prop.table(table(loan_data$loan_purpose, loan_data$HH_status,dnn = c("Loan Purpose", "Loan")), 2))

deniedsuccessful------------------------------------------- ------------------Business Start-up Capital0.262 0.154Expanding Business0.138 0.081Other (Specify)0.272 0.027Purchase Agricultural Inputs for Food Crop0.210 0.300Purchase House/Lease Land0.031 0.023Purchase Inputs for other Crops0.062 0.098Purchase Non-farm Inputs0.026 0.115For consumption and personal expenses0.000 0.201

5 Using the information in the ‘All households’ and ‘Got loan’ tab, for‘successful’ and ‘denied’ households:

(a) Create a table as shown in Figure 9.1 to compare the averages of thespecified household characteristics.

(b) For each characteristic, explain how it may affect a household’sability to get a loan (ceteris paribus).

(c) Looking at your table from Question 5(a), discuss whether you seethis pattern in the data. (For example, are successful borrowers older/younger on average than denied borrowers?)

(d) Now try conditioning on the variable rural or region anddiscuss how (if at all) your results change.

EMPIRICAL PROJECT 9 WORKING IN R

254

Page 19: EMPIRICAL PROJECT 9 CREDIT-EXCLUDED HOUSEHOLDS IN A

R WALK-THROUGH 9.6

Calculating differences in household characteristicsHere we show how to get average characteristics conditional onHH_status using the mean function. With the mosaic package loaded

(as we have done in R walk-through 9.2 (page 243)), we can use theconditioning symbol | in the mean function to condition according toHH_status .

# Show the number of observations in each categorysummary(allHHc$HH_status)

## denied not applied successful## 201 3632 1363

# Mean household size conditional on credit statusmean(~hhsize|HH_status, data = allHHc, na.rm = TRUE)

## denied not applied successful## 4.82 4.46 4.87

# Mean max_education of household head, by credit statusmean(~max_education | HH_status, data = allHHc,

na.rm = TRUE)

## denied not applied successful## 8.00 7.62 7.26

Repeat the mean command above for all variables to complete the table.If we want to repeat this analysis by also splitting the data according

to rural or region , we can use the group option (which is only avail-able when mosaic has been loaded).

Household characteristic Successful Denied

Age of household head

Highest education in household

Number of assets

Household size

Number of young children

Number of working-age adults

Figure 9.1 Characteristics of successful and denied borrowers.

PART 9.1 HOUSEHOLDS THAT DID NOT GET A LOAN

255

Page 20: EMPIRICAL PROJECT 9 CREDIT-EXCLUDED HOUSEHOLDS IN A

# Show the number of observations in each categorytable(allHHc$rural, allHHc$HH_status)

#### denied not applied successful## Large town (urban) 56 1092 332## Rural 128 2236 903## Small town (urban) 17 304 128

# Mean HH size, by rural and credit status variablesmean(~hhsize | HH_status, group = rural, data = allHHc,

na.rm = TRUE)

## denied.Large town (urban) not applied.Large town(urban)##3.57 3.44## successful.Large town (urban)denied.Rural##3.84 5.47## not applied.Ruralsuccessful.Rural##4.98 5.35## denied.Small town (urban) not applied.Small town(urban)##4.06 4.28## successful.Small town (urban) Large town(urban)##4.12 3.54## Rural Small town(urban)##5.10 4.23

# Mean max_education of HH head, by rural and credit# status variablesmean(~working_age_adults | HH_status, group = rural,

data = allHHc, na.rm = TRUE)

EMPIRICAL PROJECT 9 WORKING IN R

256

Page 21: EMPIRICAL PROJECT 9 CREDIT-EXCLUDED HOUSEHOLDS IN A

## denied.Large town (urban) not applied.Large town(urban)##2.36 2.30## successful.Large town (urban)denied.Rural##2.42 2.88## not applied.Ruralsuccessful.Rural##2.58 2.91## denied.Small town (urban) not applied.Small town(urban)##3.18 2.75## successful.Small town (urban) Large town(urban)##2.54 2.33## Rural Small town(urban)##2.68 2.71

The same result can be also obtained using the piping operator ( %>% )from the tidyverse package.

stats4 <- allHHc %>%group_by(HH_status, region) %>%summarize(avg_hhsize = mean(hhsize, na.rm = TRUE)) %>%spread(HH_status, avg_hhsize) %>%print()

## # A tibble: 11 x 4## region denied not applied successful## <fct> <dbl> <dbl> <dbl>## 1 Addis Ababa 2.50 4.00 3.84## 2 Afar 7.00 4.97 5.69## 3 Amhara 4.42 3.86 4.50## 4 Benshagul Gumuz 5.00 4.70 5.41## 5 Diredwa 4.57 4.04 4.28## 6 Gambelia 5.22 5.04 5.50## 7 Harari 3.75 5.02 3.74## 8 Oromia 4.91 4.53 5.21## 9 SNNP 5.20 4.88 5.09

PART 9.1 HOUSEHOLDS THAT DID NOT GET A LOAN

257

Page 22: EMPIRICAL PROJECT 9 CREDIT-EXCLUDED HOUSEHOLDS IN A

## 10 Somalie 4.33 5.15 5.26## 11 Tigray 4.62 4.01 4.78

To understand what the spread command does, run the above codewithout it and see the difference.

6 Using Figure 9.1 (page 255), without conditioning on rural or region :

(a) Calculate the difference in means (‘successful’ borrowers minus‘denied’ borrowers).

(b) Calculate the 95% confidence interval for the difference in meansbetween the two subgroups (‘successful’ minus ‘denied’). (See Part8.3 (page 230) of Empirical Project 8 for help on how to do this.)

(c) Plot a column chart showing the differences on the vertical axis(sorted from smallest to largest), and household characteristics on thehorizontal axis. Add the confidence intervals from Question 6(b) tothe chart.

(d) Interpret your findings.

R WALK-THROUGH 9.7

Calculating confidence intervals and adding them to a chartTo repeat the same set of calculations for a list of variables, we will usethe piping operator ( %>% ). First we create a list of these variables (calledsel_var ).

sel_var <- c("age", "max_education", "number_assets","hhsize", "young_children", "working_age_adults")

Now we use the age variable as an example.

stats5 <- allHHc %>%# Filters out the 'did not apply' casesfilter(HH_status %in% c("denied", "successful")) %>%group_by(HH_status) %>%summarize(avg_ = mean(age, na.rm = TRUE),

sd_ = sd(age, na.rm = TRUE),n_ = sum(!is.na(age))) %>%

print()

## # A tibble: 2 x 4## HH_status avg_ sd_ n_

EMPIRICAL PROJECT 9 WORKING IN R

258

Page 23: EMPIRICAL PROJECT 9 CREDIT-EXCLUDED HOUSEHOLDS IN A

## <fct> <dbl> <dbl> <int>## 1 denied 41.2 12.9 201## 2 successful 43.4 14.3 1361

Now we use the t.test function to calculate the difference between thesuccessful group ( sel_success ) and the denied borrowers( sel_denied ).

# Select the age variable (sel_var[1]) for successful and# denied borrowerssel_success <- unlist(allHHc[

allHHc$HH_status == "successful", sel_var[1]])sel_denied <- unlist(allHHc[

allHHc$HH_status == "denied", sel_var[1]])

# The unlist function is needed to get data as a vector# instead of a dataframe/tibble.temp <- t.test(sel_success, sel_denied,

conf.level = 0.95)

The output of this test provides us with the details required. Note thatthe conf.level = 0.95 option is actually not necessary here, as 0.95 isthe default level.

We will now do this for all required variables and save the differencein means and the confidence interval values in a dataframe so we canplot this information.

# Create the dataframe to save the data used for the charttemp_plot <- data.frame(name = sel_var,

dmean = NA, yhigh = NA, ylow = NA)

for (i in sel_var){sel_success <- unlist(allHHc[

allHHc$HH_status == "successful", i])sel_denied <- unlist(allHHc[

allHHc$HH_status == "denied", i])temp <- t.test(sel_success, sel_denied,

conf.level = 0.95)# Mean differencetemp_plot$dmean[temp_plot$name == i] <-

temp$estimate[1] - temp$estimate[2]# Lower limit of the confidence intervaltemp_plot$ylow[temp_plot$name == i] <- temp$conf.int[1]# Upper limit of the confidence intervaltemp_plot$yhigh[temp_plot$name == i] <- temp$conf.int[2]

}

PART 9.1 HOUSEHOLDS THAT DID NOT GET A LOAN

259

Page 24: EMPIRICAL PROJECT 9 CREDIT-EXCLUDED HOUSEHOLDS IN A

ggplot(temp_plot, aes(x = name, y = dmean)) +geom_bar(stat = "identity") +geom_errorbar(aes(ymin = ylow, ymax = yhigh),

width = .2) +ylab("Difference in means") + xlab("Variable") +theme_bw() +ggtitle("Difference in HH characteristics

(successful and denied borrowers)")

Figure 9.2 Column chart showing difference in HH characteristics for successfuland denied borrowers.

7 Using the information in the ‘All households’ dataset:

(a) Create a table similar to Figure 9.1 (page 255), but with additionalcolumns for discouraged borrowers and credit-constrained house-holds.

(b) Compare the means across the four groups and discuss anysimilarities/differences you observe (you do not need to do anyformal calculations).

EMPIRICAL PROJECT 9 WORKING IN R

260

Page 25: EMPIRICAL PROJECT 9 CREDIT-EXCLUDED HOUSEHOLDS IN A

R WALK-THROUGH 9.8

Calculating conditional meansWe are interested in the means of a range of variables for differentsubgroups. Two subgroups are mutually exclusive ( HH_status =="successful" and HH_status == "denied" ), while the others (credit_constrained == "yes" and discouraged_borrower =="yes" ) are partially overlapping subgroups of the data. Our strategy is tocreate a temporary dataframe ( sel_allHHc ) that only contains therelevant observations and the relevant variables. Then we can use thecolMeans function to calculate the required means.

# List variables we are interested inel_var <- c("age", "max_education", "number_assets",

"hhsize", "young_children", "working_age_adults")

sel_allHHc <- allHHc[allHHc$HH_status=="successful", sel_var]

paste("successful (n = ", nrow(sel_allHHc), ")")

## [1] "successful (n = 1363 )"

colMeans(sel_allHHc, na.rm = TRUE)

## age max_educationnumber_assets## 43.37 7.2615.88## hhsize young_childrenworking_age_adults## 4.87 2.09 2.75

sel_allHHc <- allHHc[allHHc$HH_status == "denied", sel_var]

paste("denied (n = ", nrow(sel_allHHc), ")")

## [1] "denied (n = 201 )"

colMeans(sel_allHHc, na.rm = TRUE)

## age max_educationnumber_assets

PART 9.1 HOUSEHOLDS THAT DID NOT GET A LOAN

261

Page 26: EMPIRICAL PROJECT 9 CREDIT-EXCLUDED HOUSEHOLDS IN A

## 41.21 8.0014.46## hhsize young_childrenworking_age_adults## 4.82 2.22 2.76

sel_allHHc <- allHHc[allHHc$discouraged_borrower == "Yes", sel_var]

paste("discouraged (n = ", nrow(sel_allHHc), ")")

## [1] "discouraged (n = 588 )"

colMeans(sel_allHHc, na.rm = TRUE)

## age max_educationnumber_assets## 43.28 6.5010.16## hhsize young_childrenworking_age_adults## 4.65 2.03 2.49

sel_allHHc <- allHHc[allHHc$credit_constrained == "Yes", sel_var]

paste("constrained (n = ", nrow(sel_allHHc), ")")

## [1] "constrained (n = 3012 )"

colMeans(sel_allHHc, na.rm = TRUE)

## age max_educationnumber_assets## 44.84 7.1413.54## hhsize young_childrenworking_age_adults## 4.44 1.81 2.49

EMPIRICAL PROJECT 9 WORKING IN R

262

Page 27: EMPIRICAL PROJECT 9 CREDIT-EXCLUDED HOUSEHOLDS IN A

selection bias An issue that occurswhen the sample or data observedis not representative of the popula-tion of interest. For example,individuals with certaincharacteristics may be more likelyto be part of the sample observed(such as students being more likelythan CEOs to participate incomputer lab experiments).

A study on access to loans in Ethiopia (https://tinyco.re/9835958) looked atthe relationship between loan amount and household characteristics. Whendoing so, they needed to account for selection bias, because we onlyobserve positive loan amounts for successful borrowers. If we only had datafor successful borrowers, then our sample would not be representative ofthe population of interest (all households), so we would have to interpretour results with caution. In our case, we have information about all house-holds, so we can compare observable characteristics to see whethersuccessful borrowers are similar to other households.

8 Think of another example where there might be selection bias, in otherwords, where the data we observe is not representative of the populationof interest.

PART 9.2 HOUSEHOLDS THAT GOT A LOAN

Learning objectives for this part

• analyse the characteristics of loans obtained by successful borrowers.

For households that successfully got a loan, we will look at:

• purpose of the loan• duration of the loan(s)• loan amount and interest rate charged• who the household borrowed from.

We will also see if there are any relationships between these loancharacteristics and household characteristics.

Now we will use the variables relating to the loan start and end dates tocalculate the duration of the loan. Before using these variables, we need tocheck that the variable entries make sense. Some of this information couldbe recorded incorrectly (for example, the year is missing a digit, or themonth is a number rather than a word).

1 Using the ‘Got loan’ dataset:

(a) Check the variables loan_startmonth , loan_startyear ,loan_endmonth , and loan_endyear , and replace the entries that are

recorded incorrectly with either the correct entry (if possible), or asblank (if not possible to infer the correct entry). (Note: Some entriesare recorded as ‘Pagume’, which corresponds to early September inthe Ethiopian calendar.)

(b) To calculate loan duration, combine the month and year variablesinto one date variable and format them as date variables.

(c) Some of the dates (months or years) are missing. Calculate thepercentage of the data that is missing and explain whether you thinkmissing data is a serious problem.

An article by the Institute for Workand Health explains selection biasin more detail, and why it is aproblem encountered by all areasof research.

PART 9.2 HOUSEHOLDS THAT GOT A LOAN

263

Page 28: EMPIRICAL PROJECT 9 CREDIT-EXCLUDED HOUSEHOLDS IN A

(d) Create a new variable containing the loan duration (end date minusstart date), which will be measured in days.

(e) You will notice that some dates were recorded incorrectly, with thestart date later than the end date. We could either treat these entriesas missing or swap the start and end dates. Create two new variablesfor loan duration, one with all negative entries recorded as blank, andone with negative entries replaced as positive numbers.

(f) For this project we will define a long-term loan as lasting more than ayear (365 days), which we will use in later questions. For this defini-tion, use the loan_length variable that converts negative loanlengths to positive ones (see Question 1(e) above). Create an indicatorvariable called long_term that equals 1 if the loan was long term,and 0 otherwise. What percentage of loans were long term?

R WALK-THROUGH 9.9

Data cleaning and summarizing loan characteristicsWe start by cleaning up the loan dates. We have information on startmonth and year as well as end month and year. Let’s look at these inturn. The structure of the dataframe ( str(gotL) ) indicates that the startand end year are numeric variables, but the months are factor variableswith month names (for example ‘April’).

Let’s first look at the years by creating a scatterplot.

ggplot(gotL, aes(x = loan_endyear, y = loan_startyear)) +geom_point(size = 2, shape = 23, fill = "blue") +theme_bw() +ggtitle("Loan start and end year")

EMPIRICAL PROJECT 9 WORKING IN R

264

Page 29: EMPIRICAL PROJECT 9 CREDIT-EXCLUDED HOUSEHOLDS IN A

Figure 9.3 Scatterplot showing loan start and end year.

We can see that there are three observations that have very low (< 500)start or end year values, which does not make sense. We will replacethese with ‘NA’, but leave the original data untouched and create a newdataset called gotLc , where the ‘c’ indicates cleaned data.

gotLc <- gotL

gotLc$loan_startyear[gotLc$loan_startyear < 500] <- NAgotLc$loan_endyear[gotLc$loan_endyear < 500] <- NA

ggplot(gotLc, aes(x = loan_endyear, y = loan_startyear)) +geom_point(size = 2, shape = 23, fill = "blue") +theme_bw() +ggtitle("Loan start and end year")

PART 9.2 HOUSEHOLDS THAT GOT A LOAN

265

Page 30: EMPIRICAL PROJECT 9 CREDIT-EXCLUDED HOUSEHOLDS IN A

Figure 9.4 Revised scatterplot showing loan start and end year without outliers.

In the top left corner, there is a loan with the start year (2006) after theend year (2003). Clearly this is incorrect, so we should remove thisobservation when analysing loan periods. However, we wait until wehave combined the years with the months as there may be moreobservations with this issue.

Also, we can only see a small number of points because there aremany identical observations (for example startyear of 2006 andendyear of 2006). To see these points you could replace geom_point

with geom_jitter in the command above ggplot command . Use?geom_jitter to understand what this option does.

Now let’s look at the values in start_month .

summary(gotLc$loan_startmonth)

## April August December February JanuaryJuly June## 95 63 133 145 176115 156## March May November October SeptemberNA's## 85 141 115 106146 4

There is no particular issue with the start months. What about the endmonths?

EMPIRICAL PROJECT 9 WORKING IN R

266

Page 31: EMPIRICAL PROJECT 9 CREDIT-EXCLUDED HOUSEHOLDS IN A

summary(gotLc$loan_endmonth)

## April August December February JanuaryJuly June## 136 53 49 17052 34 94## March May November October PagumeSeptember NA's## 155 89 33 395 37 534

Two things are noteworthy here: there are now many ‘NA’ entries, andthere is an entry called ‘Pagume’. As described in the task, ‘Pagume’ canbe approximated by September. Let’s recode that.

gotLc$loan_endmonth[gotLc$loan_endmonth == "Pagume"] <-"September"

Another call of summary(gotLc$loan_endmonth) would confirm thatthere are no observations with ‘Pagume’ left.

Now we want to calculate the length of the loan, in other words, thenumber of days between start and end day. As we only have months andnot days, this will be an approximation. We will create a new variablecombining months and years using the paste function, assuming thatall loan start and end dates are on the first day of each month.

# We assume that all loan start and end dates are on the# 1st of the month.gotLc$loan_startdate <- paste("1",gotLc$loan_startmonth, gotLc$loan_startyear)

gotLc$loan_enddate <- paste("1",gotLc$loan_endmonth, gotLc$loan_endyear)

For observations with an unknown end date (recall we had more than500 of these), we will code as missing. Currently these are recorded as ‘1NA NA’. First we need to find which observations have missing dataelements for either the loan_startdate or loan_enddate variable. Rhas very powerful tools to identify text patterns such as this, which youcan learn about by searching the Internet for help (you could search for‘R test whether character contains string’). For example, a useful shortintroduction (https://tinyco.re/6886629) to using such tools containsone solution to our problem.

PART 9.2 HOUSEHOLDS THAT GOT A LOAN

267

Page 32: EMPIRICAL PROJECT 9 CREDIT-EXCLUDED HOUSEHOLDS IN A

See the first 20 observations of loan_enddate .

gotLc$loan_enddate[1:20]

## [1] "1 NA NA" "1 NA NA" "1 June2006"## [4] "1 March 2006" "1 February 2006" "1 September2007"## [7] "1 March 2006" "1 March 2006" "1 March2006"## [10] "1 March 2006" "1 February 2006" "1 NANA"## [13] "1 August 2007" "1 August 2007" "1 February2006"## [16] "1 June 2007" "1 November 2007" "1 March2006"## [19] "1 April 2006" "1 NA NA"

You can find NAs in observations 1, 2, 12, and 20. The command grepwill identify these rows automatically.

# Identify the position of the observations that contain# NAsselNA <- grep('NA', gotLc$loan_enddate)selNA[1:5]

## [1] 1 2 12 20 26

As you can see, grep identifies the first four instances correctly, and wecan see that the next missing end date is in observation 26. Now we willreplace all these observations as missing values and then convert thenon-missing observations to dates, using the as.Date function.

gotLc$loan_enddate[selNA] <- NA

gotLc$loan_enddate <- as.Date(gotLc$loan_enddate,format = "%d %B %Y")

The option format = "%d %B %Y" specifies the format that dates arerecorded as (for example ‘1 June 2006’), where %B stands for full months(view this page (https://tinyco.re/7793931) for examples of other dateformatting options). Now we repeat the same steps for the start date.

EMPIRICAL PROJECT 9 WORKING IN R

268

Page 33: EMPIRICAL PROJECT 9 CREDIT-EXCLUDED HOUSEHOLDS IN A

# Identify the position of the observations that contain# NAsselNA <- grep('NA', gotLc$loan_startdate)

gotLc$loan_startdate[selNA] <- NA

gotLc$loan_startdate <- as.Date(gotLc$loan_startdate,format = "%d %B %Y")

Let’s use the is.na function to find out the percentage of observationswith missing values for start and/or end date. Here, we used the pastefunction to print the output as both a number and a percentage.

# Missing start datespaste(sum(is.na(gotLc$loan_startdate)), "(",100 * round(mean(is.na(gotLc$loan_startdate)), 4), "% )")

## [1] "6 ( 0.41 % )"

# Missing end datespaste(sum(is.na(gotLc$loan_enddate)), "(",100 * round(mean(is.na(gotLc$loan_enddate)), 4), "% )")

## [1] "535 ( 36.15 % )"

Now we will add a new variable indicating the length of the loan period.As R knows that loan_startdate and loan_enddate are dates, itrecognizes automatically that the difference between two dates shouldbe expressed as the number of days.

gotLc$loan_length <-gotLc$loan_enddate-gotLc$loan_startdate

Let’s look at the first few observations.

gotLc[1:5, c("loan_startdate","loan_enddate", "loan_length")]

PART 9.2 HOUSEHOLDS THAT GOT A LOAN

269

Page 34: EMPIRICAL PROJECT 9 CREDIT-EXCLUDED HOUSEHOLDS IN A

## # A tibble: 5 x 3## loan_startdate loan_enddate loan_length## <date> <date> <time>## 1 2004-03-01 NA <NA>## 2 2006-11-01 NA <NA>## 3 2006-11-01 2006-06-01 -153## 4 2005-06-01 2006-03-01 273## 5 2005-06-01 2006-02-01 245

Notice the following:

• Where any of the two dates is missing, the length is missing as well.• Some loan lengths are negative (for example observation 3), because

the recorded end date is before the start date. It could be that the twodates were switched when the data was entered into the system.

This is unfortunate, but is a common feature of real-life data, and youwill have to be on the lookout for such occasions.

As required in Question 1, we will create two variants of theloan_length variable: one where we assign missing values to all

observations that have negative loan_length , and one where weassume that the problem was the switching of start and end date, so wetransform all loan lengths to positive values.

gotLc$loan_length_NA <- gotLc$loan_length

# Assign NA to negative loan lengthsgotLc$loan_length_NA[gotLc$loan_length_NA<0] <- NA

gotLc$loan_length_abs <- abs(gotLc$loan_length)

Now we can create the long_term variable and look at the number oflong-term loans.

gotLc$long_term <- (gotLc$loan_length_abs>365)

summary(gotLc$long_term)

## Mode FALSE TRUE NA's## logical 728 215 537

We therefore have about 23% long-term loans (only looking at loans forwhich we do have date information).

EMPIRICAL PROJECT 9 WORKING IN R

270

Page 35: EMPIRICAL PROJECT 9 CREDIT-EXCLUDED HOUSEHOLDS IN A

2 Using the variables loan_amount and loan_interest :

(a) Create summary tables to summarize the distribution of loan amount(mean, standard deviation, maximum, and minimum): one using theloan amount, the other using the total amount to repay (loan amount+ interest). Make sure to exclude the one observation previouslyidentified as having an extremely high interest rate. Remember togive your tables meaningful titles. Describe any features of the datathat you find interesting.

(b) As mentioned earlier, the interest rate is a borrowing condition thatcan vary widely across households. Here we will take the interest rateto be the interest paid as a percentage of the loan amount. Calculatethe interest rate for each loan in the data. (Exclude observationswhere the interest paid is not recorded.)

(c) Check for extreme values (interest rates that are either very large orzero). You may also want to create a scatterplot (with interest rate onthe vertical axis and loan amount on the horizontal axis) to help youidentify extreme (atypical) observations. Exclude the observationwith the most extreme interest rate from further calculations. Whatpercentage of the loans are zero interest?

(d) Make summary tables of the mean, maximum, minimum, andquartiles of the loan amount and interest rate, calculating thesemeasures separately for long-term and short-term loans. Comparethe distributions of interest rates for short-term and long-term loans.

(e) Create a table showing the correlation between the interest rate andhousehold characteristics (you may want to refer to Figure 8.4 (page223) in Empirical Project 8 for an example). Interpreting the interestrate charged as a measure of default risk (inability to repay), explainwhether the relationships implied by the coefficients are what youexpected (for example, would you expect interest rates to be higherfor households with less assets, more dependents, etc.).

R WALK-THROUGH 9.10

Making summary tables and calculating correlationsTo make summary tables, we use the favstats function from themosaic package.

# loan_amountfavstats(~loan_amount, data = gotLc)

## min Q1 median Q3 max mean sd n missing## 1 400 1200 3490 3e+07 26896 783587 1479 1

PART 9.2 HOUSEHOLDS THAT GOT A LOAN

271

Page 36: EMPIRICAL PROJECT 9 CREDIT-EXCLUDED HOUSEHOLDS IN A

# loan_amount + loan_interestfavstats(~(loan_amount + loan_interest), data = gotLc)

## min Q1 median Q3 max mean sd n missing## 20 500 1400 3780 31260000 29223 827144 1445 35

It is best to look at loan amounts and interest rate graphically, forexample in a scatterplot.

ggplot(gotLc, aes(x = loan_amount, y = loan_interest)) +geom_point(size = 2, shape = 23, fill = "blue") +theme_bw() +ggtitle("Loan amounts and interest payments")

Figure 9.5 Scatterplot showing loan amounts and interest payments.

One large loan (top right corner) dominates this graph. Let’s excludeobservations with a loan amount larger than 200,000 from the graph.

EMPIRICAL PROJECT 9 WORKING IN R

272

Page 37: EMPIRICAL PROJECT 9 CREDIT-EXCLUDED HOUSEHOLDS IN A

ggplot(gotLc, aes(x = loan_amount, y = loan_interest)) +geom_point(size = 2, shape = 23, fill = "blue") +# Set horizontal axis limitsxlim(0, 200000) +# Set vertical axis limitsylim(0, 30000) +theme_bw() +ggtitle("Loan amounts and interest payments")

Figure 9.6 Revised scatterplot showing loan amounts and interest paymentswithout outliers.

Interestingly we can see many zero interest loans. Now we will calculatethe interest rate as loan_interest / loan_amount .

gotLc$interest_rate <-gotLc$loan_interest / gotLc$loan_amount

favstats(~interest_rate, data = gotLc)

## min Q1 median Q3 max mean sd n missing## 0 0 0 0.167 200 0.257 5.26 1445 35

The maximum interest rate is 200 (in other words 20,000%), which doesnot make sense and could be due to a data entry error. Making anotherscatterplot can also identify extreme values for loan amounts.

PART 9.2 HOUSEHOLDS THAT GOT A LOAN

273

Page 38: EMPIRICAL PROJECT 9 CREDIT-EXCLUDED HOUSEHOLDS IN A

ggplot(gotLc, aes(x = loan_amount, y = interest_rate)) +geom_point(size = 2, shape = 23, fill = "blue") +theme_bw() +ggtitle("Loan amounts and interest rates")

Figure 9.7 Scatterplot identifying extreme values for loan amounts.

Let’s make another scatterplot, excluding the observation with theextremely high interest rate and only looking at small loan amounts(< 1,000).

ggplot(subset(gotLc, interest_rate < 50),aes(x = loan_amount, y = interest_rate)) +geom_point(size = 2, shape = 23, fill = "blue") +# Set horizontal axis limitsxlim(0, 1000) +# Set vertical axis limitsylim(0, 5) +theme_bw() +ggtitle("Loan amount and interest rates")

EMPIRICAL PROJECT 9 WORKING IN R

274

Page 39: EMPIRICAL PROJECT 9 CREDIT-EXCLUDED HOUSEHOLDS IN A

Figure 9.8 Scatterplot excluding extremely high interest rate and including onlysmall loan amounts.

Again we can see that there are many zero interest loans. From thesummary statistics above, we can see that the median interest rate is 0,which implies that at least 50% of loans have a zero interest rate. Thefollowing code calculates that percentage precisely.

temp_all <- gotLc$interest_rate[!is.na(gotLc$interest_rate)]

temp_0 <- temp_all[temp_all == 0]

paste("Percentage of zero interest rate loans: ",round(100 * length(temp_0) / length(temp_all), 2),"%")

## [1] "Percentage of zero interest rate loans: 50.52 %"

Now let’s calculate statistics conditional on whether a loan is long termor not. Before we do this, we will remove the observation with the veryextreme interest rate (20,000%) from our ‘gotLc’ dataset (but not fromthe original ‘gotL’ dataset). That observation has a loan amount of 1 andan interest payment of 200, which is probably a data entry mistake.There is another extreme observation (with a loan amount of30,000,000), but there is no indication that this observation ismisrecorded as there is a significant interest payment for this loan.

PART 9.2 HOUSEHOLDS THAT GOT A LOAN

275

Page 40: EMPIRICAL PROJECT 9 CREDIT-EXCLUDED HOUSEHOLDS IN A

gotLc <- subset(gotLc, interest_rate < 200)

favstats(~interest_rate | long_term, data = gotLc)

## long_term min Q1 median Q3 max mean sd nmissing## 1 FALSE 0 0 0.050 0.171 1.12 0.110 0.173717 0## 2 TRUE 0 0 0.141 0.245 2.24 0.189 0.268211 0

Both the mean and median interest rate are higher for long-term loans.You can adapt the code above to calculate statistics for the loan_amountvariable.

We now calculate correlations between interest rates and householdcharacteristics. Below we use piping operations ( %>% ) to select therelevant data (as in Project 8). We store the correlation coefficients in amatrix (array of rows and columns) called M .

gotLc %>%# Only select observations with interest rate informationsubset(!is.na(interest_rate)) %>%select(age, max_education, number_assets, hhsize,

young_children, working_age_adults, interest_rate) %>%cor(., use = "pairwise.complete.obs") -> M

M[, c("interest_rate")]

## age max_educationnumber_assets## 0.0258 -0.0841-0.0474## hhsize young_childrenworking_age_adults## 0.1050 0.10220.0466## interest_rate## 1.0000

3 Now we will look at sources of finance and how they are related to loancharacteristics.

(a) Create a table showing the proportion of loans (in terms of thecolumn variable) with source of finance ( borrowed_from ) as the rowvariable and rural as the column variable. Make a similar table but

EMPIRICAL PROJECT 9 WORKING IN R

276

Page 41: EMPIRICAL PROJECT 9 CREDIT-EXCLUDED HOUSEHOLDS IN A

with borrowed_from_other as the row variable instead. Does it looklike rural households use different sources of finance from urbanhouseholds? (Hint: It may help to think about sources of finance interms of formal, informal, and other institutions such asmicrofinancers or NGOs.)

(b) For each of the variables below, create a table showing the average ofthat variable, with borrowed_from as the row variable and rural asthe column variable. Comment on any similarities or differencesbetween rows and columns that you find interesting, and suggestexplanations for what you observe.

• duration of loan (using the variable in which negativedurations were transformed to positive durations)

• loan amount• interest rate

(c) Create a table showing the proportion of gender (in terms of therow variable) with borrowed_from as the row variable and rural asthe column variable. Describe any relationships you observe betweenthe gender of household head, the place where he/she lives, and thetypes of finance used.

(d) What other variables are currently not in our dataset but could alsobe important for our analysis in Questions 2 and 3?

R WALK-THROUGH 9.11

Creating summary tables of meansFirst we use the table function to create the table with the variableborrowed_from .

stab10 <- table(gotLc$borrowed_from, gotLc$rural)

addmargins(prop.table(stab10, 2), 1)

#### Large town (urban) RuralSmall town (urban)## Bank (commercial) 0.018810.00393 0.00000## Employer 0.040750.00196 0.01031## Grocery/Local Merchant 0.081500.04711 0.10309## Microfinance Institution 0.191220.27969 0.26804## Money Lender (Katapila) 0.003130.04809 0.02062## NGO 0.01254

PART 9.2 HOUSEHOLDS THAT GOT A LOAN

277

Page 42: EMPIRICAL PROJECT 9 CREDIT-EXCLUDED HOUSEHOLDS IN A

0.04711 0.05155## Neighbour 0.106580.11580 0.07216## Other (specify) 0.056430.12071 0.04124## Relative 0.485890.31501 0.43299## Religious Institution 0.003130.02061 0.00000## Sum 1.000001.00000 1.00000

Note that in all settings, most loans come from relatives. To create thetable with borrowed_from_other , substitute this variable name in theabove command.

When creating a table with categorical (factor) variables in the rowsand columns, but with the cells reporting a statistic based on a thirdvariable such as the average duration of a loan (rather than counts orproportions), we use piping operations ( %>% ).

tab10 <- gotLc %>%group_by(borrowed_from, rural) %>%summarize(mean_duration =

round(mean(loan_length_abs, na.rm = TRUE), 0)) %>%spread(rural, mean_duration) %>%print()

## # A tibble: 11 x 4## # Groups: borrowed_from [11]## borrowed_from `Large town (urban)` Rural`Small town (urba~## <fct> <time> <time><time>## 1 Bank (commercial) 1814 619<NA>## 2 Employer 602 290NaN## 3 Grocery/Local Merchant 166 259176## 4 Microfinance Institution 712 411510## 5 Money Lender (Katapila) 365 332365## 6 Neighbour 125 187296## 7 NGO 372 395

EMPIRICAL PROJECT 9 WORKING IN R

278

Page 43: EMPIRICAL PROJECT 9 CREDIT-EXCLUDED HOUSEHOLDS IN A

236## 8 Other (specify) 274 372806## 9 Relative 237 217393## 10 Religious Institution 1461 343<NA>## 11 <NA> 1096 289151

To get the tables for the loan amount and interest rate, change thevariable name in the mean() calculation above.

ExtensionInvestigating sources of finance associated with zero interest loans

We previously saw that a large percentage of loans have a zerointerest rate. Here we investigate whether particular sources offinance are responsible for these interest rates. The code we use isvery similar to the code above, but instead of calculating the mean ofa variable, we calculate the mean of a boolean (true/false) variable( (interest_rate==0) ). This will deliver the proportion of ‘true’observations, in other words, loans where the interest rate was equalto zero.

tab10 <- gotLc %>%group_by(borrowed_from, rural) %>%summarize(prop_0_interest = mean((interest_rate == 0),

na.rm = TRUE)) %>%spread(rural, prop_0_interest) %>%print()

## # A tibble: 11 x 4## # Groups: borrowed_from [11]## borrowed_from `Large town (urban)`Rural `Small town (urba~## <fct> <dbl><dbl> <dbl>## 1 Bank (commercial) 0.0. NA## 2 Employer 0.6920.500 1.00## 3 Grocery/Local Merchant 1.000.812 1.00## 4 Microfinance Institution 0.0820

PART 9.2 HOUSEHOLDS THAT GOT A LOAN

279

Page 44: EMPIRICAL PROJECT 9 CREDIT-EXCLUDED HOUSEHOLDS IN A

0.0351 0.0385## 5 Money Lender (Katapila) 0.0.0204 0.500## 6 NGO 0.2500.0833 0.400## 7 Neighbour 1.000.763 1.00## 8 Other (specify) 0.2220.171 0.## 9 Relative 0.9610.819 0.976## 10 Religious Institution 1.000.190 NA## 11 <NA> 0.0.571 1.00

You can see that in both urban and rural settings, a high proportion ofloans granted by local merchants, neighbours, and relatives are zerointerest (possibly because these people have a close relationship with theborrower so there is a lower chance of default).

We will use exactly the same technique to determine the proportionof loans that go to households with female heads.

tab11 <- gotLc %>%group_by(borrowed_from, rural) %>%summarize(prop_female = mean((gender == "Female"),

na.rm = TRUE)) %>%spread(rural, prop_female) %>%print()

## # A tibble: 11 x 4## # Groups: borrowed_from [11]## borrowed_from `Large town (urban)` Rural`Small town (urban~## <fct> <dbl><dbl> <dbl>## 1 Bank (commercial) 0.0.500 NA## 2 Employer 0.3080. 0.## 3 Grocery/Local Merchant 0.3850.188 0.500## 4 Microfinance Institution 0.410

EMPIRICAL PROJECT 9 WORKING IN R

280

Page 45: EMPIRICAL PROJECT 9 CREDIT-EXCLUDED HOUSEHOLDS IN A

0.130 0.308## 5 Money Lender (Katapila) 0.0.265 0.500## 6 NGO 0.2500.312 0.200## 7 Neighbour 0.3530.254 0.429## 8 Other (specify) 0.3890.187 0.250## 9 Relative 0.4000.206 0.286## 10 Religious Institution 1.000.238 NA## 11 <NA> 1.000.429 1.00

4 In this project we have looked at patterns in borrowing and access tocredit, but we are not able to make any causal statements such as‘changes in X will cause households to be credit constrained’ or‘characteristic Y causes improved access to credit’. Outline a policyintervention that could help improve households’ access to loans, andhow to design the implementation so you can assess the causal effects ofthis policy.

PART 9.2 HOUSEHOLDS THAT GOT A LOAN

281