Ugandan Farmers Research Project

8/10/2019 Ugandan Farmers Research Project

1/15

Mark BerberianMurali RajamaniNick Xu

A Tale of Two Regions - or of the Variables that make each Region Unique

Abstract

We used various data mining techniques to analyze a high dimensional data set to predict farmers in

Uganda willingness to pay for insurance(continuous), level of risk aversion (5 level classification), and if

they bought insurance in a simulation game(binary). We found that region (binary - whether the farmer

was from Kapchorwa or Oyam) had a dominating effect. We also found that degree to which farmers

worried about various factors such as HIV or injury, what crops were planted, and living conditions also

impacted these dependant variables and some were highly correlated to region. To deal with this

multicolinearity issue and need for interaction terms, trees were useful and thus the random forest tended

to have the best out of sample MSE.

1. Introduction

The Grameen Foundation and other organizations are working on a research project to learn about risk

aversion and if there is a market for insurance for farmers in Uganda. The motivation behind such survey

is to identify what types of practices are associated with farmers who are more risk averse and thus more

likely to buy insurance. Therefore targeted marketing can be effectively used by microinsurance

companies. The survey consisted of two economics games to measure risk tolerance. The coin game is

used to determine risk aversion. Farmers are offered a choice between two coins to flip that have

difference variance of outcomes, i.e. one coin will pay $5 if heads and $5 if tails and another coin will pay

$9 if heads, $2 if tails. There are five different coins and the farmers are presented different options until

their top choice coin is discovered. The dice game is used to determine if the farmer would like topurchase insurance. They roll four die to determine what pay out they get, they can sacrifice a dice as

insurance against a bad outcome.

These games give us a good measure of someones risk aversion as we are looking at revealed

preference instead of a stated preference such as the Willingness to Pay questions. There are several

Willingness to Pay related questions in the survey as farmers were asked various amounts in descending

and ascending order for full insurance and half insurance. These sort of willingness to pay scales are

widely used in surveys and are believed to provide a more accurate estimation of WTP (e.g. Oster, Emily

and Rebecca Thornton, Determinants of Technology Adoption, Journal of the European Economic

Association, January 2, 2010). We then created a single WTP variable based on the numerous WTP

related questions.

Our goal is to create a model that predicts if a farmer would buy insurance and understand the attitudes

towards risk amongst Uganda farmers. Our three dependant variables are 1. What coin they choose

(multinomial) 2. If they bought insurance during the dice game (binary) and 3. Willingness to pay for

insurance (continuous).

2. Data Cleaning and Visualization


2/15

The data started off with two data sets one survey from Kapchorwa and another from Oyam. We stacked

the data to create one dataset. We then made sure that the same farmer did not appear in multiple rows as

duplicates were an issue. We removed variables that had a high missing rate and deleted rows where we

had an unrealistic response like 99 children or chickens listed as answer to what should be a numeric

question. We decided to remove these farmers completely from the survey as it was less than 1% our

entire dataset and we thought if they gave an unreasonable response for one question they were morelikely to give misinformation for other questions.


3/15

The majority of people have a willingness to pay under 25,000. Therefore any insurance product

would have to be priced in this range to capture a meaningful portion of the market.

A cross section of dice and coin game results shows no real pattern between particular coin/diecombinations. The majority of people did purchase insurance in the dice game and chose alpha or

beta coins.

Comparing the major regions, it appears that Region 1(Kapchorwa) has higher WTP than Region

2 (Oyam) and therefore might be a better target for marketing.


4/15

However, Region 2 was more likely to choose Alpha than Region 1. Since Alpha is more

conservative, we interpreted it as Region 2 being more risk adverse than Region 1, we need to

investigate further whether this risk adversion can translate to insurance purchase.

Somewhat contrary to the coin game, Region 2 purchased insurance that a slightly lower level

than Region 1, though the overwhelming effect is that most people purchased insurance in the

game suggesting that this test could be improved to show more separation. Majority of farms are low in value with a few outliers.

Region 1 appears to be more wealthy than Region 2 which may partially explain the WTP

difference

Arules

(Full index of association rules in appendix)

Highlights1 {DF5,

HIV5} => {INJ5} 0.1055261 0.6413662 3.89809492 {HIV5,

INJ5} => {DF5} 0.1055261 0.8066826 3.4359100

People who are highly concerned about droughts and floods are also highly concerned about HIV and

injury. This suggests that certain people are highly concerned with uncertainty, and if we are able to

ascertain who is concerned about injury and/or HIV, it will also tell us who might want to buy insurance.

For example in rule number 2, given a new farmer that is extremely worried about HIV and injury, were

80% sure that they also will worry highly about droughts/floods and this is 3.4x higher than if we didnt

know about their HIV and injury concern level.

3 {13,19} => {16} 0.1008430 0.8682796 3.3874537

4 {19,6} => {16} 0.1286294 0.8273092 3.2276145

Crop associations dominate as well. Suggesting that it is highly common to grow multiple crops. In this

example sweet potato, sunflower, and sim-sim are grouped together.

47 {q32_2,q32_9} => {q32_12} 0.1045894 0.8438287 2.2924371

Question 32 asked people to pick options for what they would do if they suffered a drought flood. Our

hypothesis was that if people had very few options in a drought scenario they would be more inclined tobuy insurance. However there was no strong association between their emergency plan and the coins or

die. Instead we found that people will do multiple things in case of a drought/flood. In this case one of the

most common strategies is to eat less, reduce expenditures, and sell off livestock.

K-means

We identified 20 demographic related questions such as the value of the farm, age, marital status ect. We

thought K-Means would be good be a way to reduce the number of variables that describe demographic


5/15

information into a few clusters that we could used to describe demographics. Using K-means to

generate three clusters we got 3 clusters that contained 938, 977 and 598 observations. For 4 clusters we

got 404, 468, 1035 and 606 observations. For 5 clusters we have 466, 382, 1015, 55, 595 observations.

To decide the optimal number of clusters to use we then looked at the centers and compared how good

they were in a regression.

K-means is a great way to do deal with the co-linearity of the many living conditions questions. It ended

up grouping those who answered good into one cluster, average into another and poor into another. We

can now just talk about living conditions though our clusters instead of looking at each isolated effect of

having electricity, roof quality, door quality, ect.

Using 3 clusters we see that Cluster 1 is from Kapchorwa and have average living conditions, higher farm

value and many children. Cluster 2 is from Oyam and have nice living conditions Cluster 3 is from

Oyam and has poor living conditions, and low farm value. Adding a fourth cluster we now see a group

from Oyam that has good living conditions. What is interesting is that they have the lowest farm value.

They also have farming make up the least of their income compared to other clusters. Therefore the nicer

living conditions may not necessarily be tied to a higher farm value but income outside of farming.When moving to 5 clusters variation becomes dominated by having electricity and not as clear of a story

can be told as it still only has one cluster with its center in Kapchorwa.

3. Game 1 - Coin choice

In this section we build several models to predict what coin (Alpha, Beta Gamma, Delta, or

Epsilon) was chosen by the farmer. Farmers which choose Alpha or Delta are more risk adverse and we

would like to know what factors are associated with more risk adverse farmers. The models we will

present are: 1. Multinomial logistic regression with a gamma-lasso penalty (mnlm), 2. Mnlm on principal

components we indentify using principal component analysis, 3. Classification tree, 4. Random Forest.

1. Multinomial logistic regression with a gamma-lasso penalty (mnlm)

As we are regressing on 143 variables we want a strict penalty to avoid over fit and help identify the main

drives in classifying farmers coin choice. The largest loading we see is on region. If the farmer is in

Oyam instead of Kapchorwa the odds of choosing Alpha is e^(.37-0)=1.45 times the odds of choosing

Beta. This is also true for Alpha vs Delta, Gamma and Epsilion. We also observe for an increase in one

standard deviation in number of children we see a 1.04 times odds of choosing Alpha over the odds of

Epsilon. Crops also have an interesting relationship between coin choice. If a framer plants crops 4, 11,

18, or 20 we see an increase in the relative odds of choosing Alpha; if a farmer plants crop 3 we see a

decrease in in the relative odds of choosing Alpha. If a farmer plants crops 5, or 14 we see an increase in

the relative odds of choosing Epsilon; while if a farmer plants crop 1 we see a decrease in in the relative

odds of choosing Epsilon. The largest crop loading is on crop 5 such that if a farmer is planting crop 5

the odds of choosing Epsilon increases 1.16 times relative to the odds of each coin. For more on crops

impact please see the appendix. The effect of worrying about HIV is also fairly large. If a farmer

answered they have a very high concern about HIV instead of no concern than we would expect the odds

of the farmer choosing Epsilon to be 1.13 times the odds of choosing Alpha and 1.06 times the odds of

choosing delta. It might be that if someone is engaging in more risky behavior they are going to be more

worried about HIV. In other words our alpha type farmers might think that since they dont take risks


6/15

they dont needto worry about HIV. Unlike HIV, for the questions about worrying about drought or

injury the farmers that responded with higher worrying for those questions were more likely to choose the

more risk adverse coins.

Replacing the demographic variables we used to cluster with the clusters in the model described above we

see that in the Kapchorwa and average living conditions cluster (clus3) instead of the Oyam and good

living conditions (clus1) the relative odds of choosing Alpha change .97 times. In the poor living

conditions in Oyam cluster (clus4) instead of clus1 the relative odds of choosing Alpha increase by 1.11

times. The Oyam and average living conditions cluster (clus2) did not have any non-zero loadings.

2. Principal Component Analysis

We ran a principal component analysis on our explanatory variables. PC1 has very high rotations for

region and then crops 16, 6, 8, 23, 7. This is very interesting because when we did K-means on the crops

high scores in 6,7,8 and 16 tended to be grouped together. These crops are also highly correlated with

region 2. For example 90 percent of farmers in Oyam plant crop 6(Cassava) but only 18 percent of the

farmers in Kapchorwa plant Cassava. Therefore we have high multicolinearity between crops and region

and due to this we want use PCA to help cope with this problem that false discovery rate or other methods

would not deal with as well. PC2 seems to measure general worrying as the four questions about

worrying are the top 4 largest rotations for PC2. It is interesting that we noticed that worrying about HIV

has a different relationship with coin choice oppose to the other worrying questions but all four worry

questions load on PC2 as more worry more PC2. The association rules also showed a high lift and

confidence for factors that people worry about occurring together (HIV, flood/drought, and injury). Crops

20 and 12 also load high on PC2 and they do tend to be in the same clusters. You can see from the graph

of PC1 vs PC2 Oyam is on the far right and Kapchorwa is on far left. High worries are on top. We run

the mnlm on PC1 and PC2 we find that for a 1 standard deviation in your PC1/Region score towards

Oyam the odds of choosing Alpha increase e^(.32- -.23)=e^(.54)=1.7 times the odds of choosing Epsilon.

The worries PC2 are more likely to choose the extreme coins Alpha or Epsilon relative to the middlecoins. This may be due to the HIV effect discussed earlier.

3. Classification tree


7/15

(pruned, unpruned trees are in appendix)

Due to the high multicollinearity issues and the non-linear nature of our classification model, trees would

be a good tool to deal with these issues. We were unable to test all combinations of interaction terms due

to computational limits, and trees take interactions into account to reduce dimensions. It appears that

Region is the dominant factor, followed by crop type (rice) and whether the farmers friends have

insurance.

4. Random Forest

In the Random Forest on coins some of the most influential questions were: amount they would have torepay in a drought, age, Value per acre, number of kilos of crops sold. It seems like a lot of these

variables proxy a more general term of how much value their crop is, as a first order variable on coin

choice. Curiously, the random forest doesnt have region as a first order factor. Our other models may be

overemphasizing the effect of region, which may decrease when interacted with other terms. We will rely

on cross validation to pick the best out of sample model.

6. Cross Validation Classification models

Regionq16q11q130q143q140q29q134nq135_7q152q3q26WorryInjq18q52q27q56q21

WorryDFHIVq13q150q57q19Farm_ValVal_Acreq22q39q43q58

0e+00 2e+10 4e+10 6e+10 8e+10 1e+11

Coin RF variable importance

IncNodePurity


8/15

Random Forest turns out to be the best model for out of sample prediction. Interaction terms are

significant (presented in conclusion) and therefore random forests are performing the best because they

capture this effect. Random forests also compensate for non-linearity and multicollinearity issues. We

hypothesize that our other models are overemphasizing Region, which may be captured in other variables.

2. Willingness to Pay

In this next section we will try to model the continuous variable willingness to pay for insurance. In this

model we are estimating an actual dollar amount for WTP. We will: 1. Use regression with a ridge

penalty and regression with a lasso penalty 2. Regression on our principal components 3. Partial least

squares, 5. Regression tree 6. Random Forest

1. Penalized Regression Ridge vs Lasso

We Regressed WTP on our explanatory variables with a ridge penalty. The nature of the ridge penalty

does not force loadings to zero so we then used a lasso penalty to more easily identify which factors were

important for determining WTP. With our Lasso penalized regression we found that 19 out of 143

variables were non zero for the optimal way to predict WTP. Crop 21 had the largest loadings in that if

you planted crop 21 you WTP increased by $2,506. If you live in Oyam region your WTP decreases by

$7,183. Crops 8,5, and 14 also had non-zero loadings. We also learned that for a 1 STD increase in how

worried the farmer is about HIV we expect to see a $506 decrease in WTP for insurance.

We also used a regression with a ridge penalty to model WTP for insurance using our demographic

clusters. There is no significant increase in WTP for being in cluster Oyam and average living conditions

(clus2) instead of Oyam and good living conditions(clus1)(p-val=.71). We expect a $2,660.67 decrease

in WTP for being in Kapchorwa and average living conditions (clus3) instead of Oyam and average living

conditions cluster. We expect a $1,973.49 decrease in WTP for poor living conditions in Oyam(clus4)

instead of average living conditions in Oyam(clus1). However when we apply a false discover rate of

15% we find that only the Kapchorwa cluster remains significant. This is consistent with our other

results that region is the main diver in differences between the farmers.

2. Regression on our principal components


9/15

We used the same principal components that are described earlier. For WTP our linear regression has the

largest BIC when we use 49 principal components. Consistent with our earlier findings Oyam has lower

WTP and learn that for a 1 STD increase in the amount farmer worries (PC2) we expect a $181 increase

in WTP.

3. Partial least squares

PLS did a nice job without too much over fit, details in appendix and sub section 6.

4. Regression Tree

The dominant effect is borrowing money and the source of that money (bank, relatives etc), followed by

district (which is one of the main effects that has persisted) and then the type of crop grown. For cotton

growers, the value of the crop (kilos sold last year, value of half the crop, etc) determines willingness to

pay.

5. Random Forest

|q43:egh

q3:fg

nq135_7:a

q58 < 110

q22 < 425 q57 < 1

10620

11440

25730 78740 150000 19270

108700


10/15

When looking at a random forest, the top variables pertain to leverage (how much would you have to pay

in a drought, if youve borrowed money and from whom, and how frequently youve borrowed), followed

by value of crop effects. This seems to confirm the findings from the CART tree, and suggest a story

where you have higher WTP for insurance if you are already levered and therefore concerned you wont

be able to cover your losses in the case of a drought or flood.

6. Cross Validation for Out of Sample Predictions

Regionq16q11q130q143q140q29q134nq135_7q152q3q26WorryInjq18q52q27q56q21WorryDFHIVq13q150q57q19Farm_ValVal_Acreq22q39q43q58

0e+00 2e+10 4e+10 6e+10 8e+10 1e+11

WTP RF variable importance

IncNodePurity


11/15

Random Forest again performs quite well for the aforementioned reasons but in this case Lasso, Ridge,

and PLS also performed almost as well as WTP is more linear.

3. Game 2: Dice, Buy Insurance (Yes/No)

In the first game with coins there were five different buckets we could classify our farmers into. In the

dice game we have a binary outcome if the farmer bought insurance or not. In this model we will be

estimating the odds of a farmer buying insurance. We will: 1. Perform a logistic regression with a

gamma-lasso penalty, 2. PCR, 3. PLS, 4. CART and 5. Random Forest

Logistic Regression

Because we are trying to predict a binary outcome, we will use logistic regression to predict odds, and

apply a gamma lasso penalty to help with dimension reduction. For a standard deviation increase in

worrying about drought or flood, the odds of buying insurance are 1.2x higher. For Region 2 the odds of

buying insurance are only 0.74 of the odds of buying insurance for Region 1.

PCR

3 is the optimal number of factors to minimize BIC. PCR does not do a very good job of separating

people who do and do not buy insurance, probably because people overwhelmingly chose to buy

insurance regardless of the PCR factors.

PLS

We first ran PLS with five zs because it seemed optimal when examining correlation. However, PLS is

prone to over fit, but we also tested using only one z which still performed worse than over tests. This

suggests that other factors like interactions, and non-linearity and missing interactions that may be driving

down PLS performance overall.

Trees

The tree is rather simplistic because overwhelmingly people chose to buy insurance in the die game.

District is the dominant effect, followed by door material (proxy for well-offness as discovered in K

means). A good or average door causes you to buy insurance We believe this points to a trend where

having more crop or home value makes farmers more willing to buy insurance.

|q3:efg

q12:bc

True True

True


12/15

Strangely the most influential variable is q13 (age) which has not come up in our other regressions as

significant, followed by amount that would have to be paid in case of drought/flood. The next most

important variables proxy value again, (how many kilos last season, val/acre and farm value).

q143q137q130q27q136q11q139q138q26q56q7q142q39WorryInjq141HIVq150WorryDFq18q21q16q43q29Farm_Valq57

q22Val_Acreq19q58q13

0 5 10 15 20 25

Dice RF variable importance

MeanDecreaseGini


13/15


14/15

While it was hard to tell from the MSE plots which model was the best when we look the ROC curves,

we see that the random forest almost never has a false positive or a false negative. One thing that we

need to remember is that we are modeling probabilities of choosing to buy insurance in the dice game and

we do not know how that corresponds to buying insurance in real life. If we knew the cost of marketing

to each farmer and what the return would be if they bought insurance we could use the Sensitivity and

Specificity to come up with a classification rule to maximize profit.

4. Conclusion

Region

Region came up repeatedly as one of the most important factors. However we want to caveat this as our

random forest models often showed that region was not as important, and held up very well out of

sample. Region is strongly correlated with many other factors and captures a lot of other information,

including types of crops planted and to some extant wealth. The tree models get around some of this

multicollinearity. To adjust for this problem with region we included interaction terms between region

and all the other independent variables to hopefully isolate the main effect of region. In our multinomial

model to predict coin choice the main effect of region changed to zero for all coins, however the

interactions were non-zero! We conclude that region alone is not deterministic of coin choice but works

strongly with the other variables. Now are largest loading is on the interaction between how worried are

you about drought/flood and region. Each main effect is zero however if the farmer is in Region 2 for

each standard deviation increase in amount of they worry about drought/flood we would expect 1.21times

the relative odds they would choose alpha. However CV reveals that our MNLM with interactions has

higher MSE than our no interaction model (see appendix). The lasso penalty should be helping us with

over fit so we believe that there should be a main effect on region as out of sample MSE results can be

large due to either over fit or omitting variables (over penalized). While in the dice gamma-lasso


15/15

penalized regression we also see the main effect of region get set to zero, the main effect of worrying

about drought/flood is the largest and the interaction of region with drought/flood worrying is zero.

Risk Aversion

We found that farmers who were highly concerned about other dangers such as HIV or injury tended to be

much more concerned about drought than the general population. We interpreted this is their general risk

aversion (principal component 2) and we observe through our principal component analysis and our lasso

regression that increase in general worrying tends to also increase WTP with HIV as an exception.

Wealth

Variables that proxy wealth, such as past crop value, farm value, and standard of living measures seem to

track to willingness to pay for insurance. Perhaps having more wealth makes farmers more sensitive to

loss. If you have a big farm, a drought is suddenly much more costly than if you had a tiny farm. And

perhaps if you have a better home that means you may have more collateral posted for any existing loans.

Existing Leverage

Farmers existing borrowing patterns as well as who they owed money to affected their insurance choices.

We believe that existing leverage makes people more risk adverse, because a drought that affects their

ability to pay this loan perhaps could mean losing collateral or having to pay interest for longer.

We would like to thank Karl Muth for letting us use the data for this project. Please do not discuss these

results outside the context of the data mining class. This project was inspired by Muth, Karl and Jennifer

Helgeson, Stochastic Environments as Measurement Tools, The Journal of Applied Economy, 2010.

Documents

Ugandan Farmers Research Project