Upload
nick-xu
View
216
Download
0
Embed Size (px)
Citation preview
8/10/2019 Ugandan Farmers Research Project
1/15
Mark BerberianMurali RajamaniNick Xu
A Tale of Two Regions - or of the Variables that make each Region Unique
Abstract
We used various data mining techniques to analyze a high dimensional data set to predict farmers in
Uganda willingness to pay for insurance(continuous), level of risk aversion (5 level classification), and if
they bought insurance in a simulation game(binary). We found that region (binary - whether the farmer
was from Kapchorwa or Oyam) had a dominating effect. We also found that degree to which farmers
worried about various factors such as HIV or injury, what crops were planted, and living conditions also
impacted these dependant variables and some were highly correlated to region. To deal with this
multicolinearity issue and need for interaction terms, trees were useful and thus the random forest tended
to have the best out of sample MSE.
1. Introduction
The Grameen Foundation and other organizations are working on a research project to learn about risk
aversion and if there is a market for insurance for farmers in Uganda. The motivation behind such survey
is to identify what types of practices are associated with farmers who are more risk averse and thus more
likely to buy insurance. Therefore targeted marketing can be effectively used by microinsurance
companies. The survey consisted of two economics games to measure risk tolerance. The coin game is
used to determine risk aversion. Farmers are offered a choice between two coins to flip that have
difference variance of outcomes, i.e. one coin will pay $5 if heads and $5 if tails and another coin will pay
$9 if heads, $2 if tails. There are five different coins and the farmers are presented different options until
their top choice coin is discovered. The dice game is used to determine if the farmer would like topurchase insurance. They roll four die to determine what pay out they get, they can sacrifice a dice as
insurance against a bad outcome.
These games give us a good measure of someones risk aversion as we are looking at revealed
preference instead of a stated preference such as the Willingness to Pay questions. There are several
Willingness to Pay related questions in the survey as farmers were asked various amounts in descending
and ascending order for full insurance and half insurance. These sort of willingness to pay scales are
widely used in surveys and are believed to provide a more accurate estimation of WTP (e.g. Oster, Emily
and Rebecca Thornton, Determinants of Technology Adoption, Journal of the European Economic
Association, January 2, 2010). We then created a single WTP variable based on the numerous WTP
related questions.
Our goal is to create a model that predicts if a farmer would buy insurance and understand the attitudes
towards risk amongst Uganda farmers. Our three dependant variables are 1. What coin they choose
(multinomial) 2. If they bought insurance during the dice game (binary) and 3. Willingness to pay for
insurance (continuous).
2. Data Cleaning and Visualization
8/10/2019 Ugandan Farmers Research Project
2/15
The data started off with two data sets one survey from Kapchorwa and another from Oyam. We stacked
the data to create one dataset. We then made sure that the same farmer did not appear in multiple rows as
duplicates were an issue. We removed variables that had a high missing rate and deleted rows where we
had an unrealistic response like 99 children or chickens listed as answer to what should be a numeric
question. We decided to remove these farmers completely from the survey as it was less than 1% our
entire dataset and we thought if they gave an unreasonable response for one question they were morelikely to give misinformation for other questions.
8/10/2019 Ugandan Farmers Research Project
3/15
The majority of people have a willingness to pay under 25,000. Therefore any insurance product
would have to be priced in this range to capture a meaningful portion of the market.
A cross section of dice and coin game results shows no real pattern between particular coin/diecombinations. The majority of people did purchase insurance in the dice game and chose alpha or
beta coins.
Comparing the major regions, it appears that Region 1(Kapchorwa) has higher WTP than Region
2 (Oyam) and therefore might be a better target for marketing.
8/10/2019 Ugandan Farmers Research Project
4/15
However, Region 2 was more likely to choose Alpha than Region 1. Since Alpha is more
conservative, we interpreted it as Region 2 being more risk adverse than Region 1, we need to
investigate further whether this risk adversion can translate to insurance purchase.
Somewhat contrary to the coin game, Region 2 purchased insurance that a slightly lower level
than Region 1, though the overwhelming effect is that most people purchased insurance in the
game suggesting that this test could be improved to show more separation. Majority of farms are low in value with a few outliers.
Region 1 appears to be more wealthy than Region 2 which may partially explain the WTP
difference
Arules
(Full index of association rules in appendix)
Highlights1 {DF5,
HIV5} => {INJ5} 0.1055261 0.6413662 3.89809492 {HIV5,
INJ5} => {DF5} 0.1055261 0.8066826 3.4359100
People who are highly concerned about droughts and floods are also highly concerned about HIV and
injury. This suggests that certain people are highly concerned with uncertainty, and if we are able to
ascertain who is concerned about injury and/or HIV, it will also tell us who might want to buy insurance.
For example in rule number 2, given a new farmer that is extremely worried about HIV and injury, were
80% sure that they also will worry highly about droughts/floods and this is 3.4x higher than if we didnt
know about their HIV and injury concern level.
3 {13,19} => {16} 0.1008430 0.8682796 3.3874537
4 {19,6} => {16} 0.1286294 0.8273092 3.2276145
Crop associations dominate as well. Suggesting that it is highly common to grow multiple crops. In this
example sweet potato, sunflower, and sim-sim are grouped together.
47 {q32_2,q32_9} => {q32_12} 0.1045894 0.8438287 2.2924371
Question 32 asked people to pick options for what they would do if they suffered a drought flood. Our
hypothesis was that if people had very few options in a drought scenario they would be more inclined tobuy insurance. However there was no strong association between their emergency plan and the coins or
die. Instead we found that people will do multiple things in case of a drought/flood. In this case one of the
most common strategies is to eat less, reduce expenditures, and sell off livestock.
K-means
We identified 20 demographic related questions such as the value of the farm, age, marital status ect. We
thought K-Means would be good be a way to reduce the number of variables that describe demographic
8/10/2019 Ugandan Farmers Research Project
5/15
information into a few clusters that we could used to describe demographics. Using K-means to
generate three clusters we got 3 clusters that contained 938, 977 and 598 observations. For 4 clusters we
got 404, 468, 1035 and 606 observations. For 5 clusters we have 466, 382, 1015, 55, 595 observations.
To decide the optimal number of clusters to use we then looked at the centers and compared how good
they were in a regression.
K-means is a great way to do deal with the co-linearity of the many living conditions questions. It ended
up grouping those who answered good into one cluster, average into another and poor into another. We
can now just talk about living conditions though our clusters instead of looking at each isolated effect of
having electricity, roof quality, door quality, ect.
Using 3 clusters we see that Cluster 1 is from Kapchorwa and have average living conditions, higher farm
value and many children. Cluster 2 is from Oyam and have nice living conditions Cluster 3 is from
Oyam and has poor living conditions, and low farm value. Adding a fourth cluster we now see a group
from Oyam that has good living conditions. What is interesting is that they have the lowest farm value.
They also have farming make up the least of their income compared to other clusters. Therefore the nicer
living conditions may not necessarily be tied to a higher farm value but income outside of farming.When moving to 5 clusters variation becomes dominated by having electricity and not as clear of a story
can be told as it still only has one cluster with its center in Kapchorwa.
3. Game 1 - Coin choice
In this section we build several models to predict what coin (Alpha, Beta Gamma, Delta, or
Epsilon) was chosen by the farmer. Farmers which choose Alpha or Delta are more risk adverse and we
would like to know what factors are associated with more risk adverse farmers. The models we will
present are: 1. Multinomial logistic regression with a gamma-lasso penalty (mnlm), 2. Mnlm on principal
components we indentify using principal component analysis, 3. Classification tree, 4. Random Forest.
1. Multinomial logistic regression with a gamma-lasso penalty (mnlm)
As we are regressing on 143 variables we want a strict penalty to avoid over fit and help identify the main
drives in classifying farmers coin choice. The largest loading we see is on region. If the farmer is in
Oyam instead of Kapchorwa the odds of choosing Alpha is e^(.37-0)=1.45 times the odds of choosing
Beta. This is also true for Alpha vs Delta, Gamma and Epsilion. We also observe for an increase in one
standard deviation in number of children we see a 1.04 times odds of choosing Alpha over the odds of
Epsilon. Crops also have an interesting relationship between coin choice. If a framer plants crops 4, 11,
18, or 20 we see an increase in the relative odds of choosing Alpha; if a farmer plants crop 3 we see a
decrease in in the relative odds of choosing Alpha. If a farmer plants crops 5, or 14 we see an increase in
the relative odds of choosing Epsilon; while if a farmer plants crop 1 we see a decrease in in the relative
odds of choosing Epsilon. The largest crop loading is on crop 5 such that if a farmer is planting crop 5
the odds of choosing Epsilon increases 1.16 times relative to the odds of each coin. For more on crops
impact please see the appendix. The effect of worrying about HIV is also fairly large. If a farmer
answered they have a very high concern about HIV instead of no concern than we would expect the odds
of the farmer choosing Epsilon to be 1.13 times the odds of choosing Alpha and 1.06 times the odds of
choosing delta. It might be that if someone is engaging in more risky behavior they are going to be more
worried about HIV. In other words our alpha type farmers might think that since they dont take risks
8/10/2019 Ugandan Farmers Research Project
6/15
they dont needto worry about HIV. Unlike HIV, for the questions about worrying about drought or
injury the farmers that responded with higher worrying for those questions were more likely to choose the
more risk adverse coins.
Replacing the demographic variables we used to cluster with the clusters in the model described above we
see that in the Kapchorwa and average living conditions cluster (clus3) instead of the Oyam and good
living conditions (clus1) the relative odds of choosing Alpha change .97 times. In the poor living
conditions in Oyam cluster (clus4) instead of clus1 the relative odds of choosing Alpha increase by 1.11
times. The Oyam and average living conditions cluster (clus2) did not have any non-zero loadings.
2. Principal Component Analysis
We ran a principal component analysis on our explanatory variables. PC1 has very high rotations for
region and then crops 16, 6, 8, 23, 7. This is very interesting because when we did K-means on the crops
high scores in 6,7,8 and 16 tended to be grouped together. These crops are also highly correlated with
region 2. For example 90 percent of farmers in Oyam plant crop 6(Cassava) but only 18 percent of the
farmers in Kapchorwa plant Cassava. Therefore we have high multicolinearity between crops and region
and due to this we want use PCA to help cope with this problem that false discovery rate or other methods
would not deal with as well. PC2 seems to measure general worrying as the four questions about
worrying are the top 4 largest rotations for PC2. It is interesting that we noticed that worrying about HIV
has a different relationship with coin choice oppose to the other worrying questions but all four worry
questions load on PC2 as more worry more PC2. The association rules also showed a high lift and
confidence for factors that people worry about occurring together (HIV, flood/drought, and injury). Crops
20 and 12 also load high on PC2 and they do tend to be in the same clusters. You can see from the graph
of PC1 vs PC2 Oyam is on the far right and Kapchorwa is on far left. High worries are on top. We run
the mnlm on PC1 and PC2 we find that for a 1 standard deviation in your PC1/Region score towards
Oyam the odds of choosing Alpha increase e^(.32- -.23)=e^(.54)=1.7 times the odds of choosing Epsilon.
The worries PC2 are more likely to choose the extreme coins Alpha or Epsilon relative to the middlecoins. This may be due to the HIV effect discussed earlier.
3. Classification tree
8/10/2019 Ugandan Farmers Research Project
7/15
(pruned, unpruned trees are in appendix)
Due to the high multicollinearity issues and the non-linear nature of our classification model, trees would
be a good tool to deal with these issues. We were unable to test all combinations of interaction terms due
to computational limits, and trees take interactions into account to reduce dimensions. It appears that
Region is the dominant factor, followed by crop type (rice) and whether the farmers friends have
insurance.
4. Random Forest
In the Random Forest on coins some of the most influential questions were: amount they would have torepay in a drought, age, Value per acre, number of kilos of crops sold. It seems like a lot of these
variables proxy a more general term of how much value their crop is, as a first order variable on coin
choice. Curiously, the random forest doesnt have region as a first order factor. Our other models may be
overemphasizing the effect of region, which may decrease when interacted with other terms. We will rely
on cross validation to pick the best out of sample model.
6. Cross Validation Classification models
Regionq16q11q130q143q140q29q134nq135_7q152q3q26WorryInjq18q52q27q56q21
WorryDFHIVq13q150q57q19Farm_ValVal_Acreq22q39q43q58
0e+00 2e+10 4e+10 6e+10 8e+10 1e+11
Coin RF variable importance
IncNodePurity
8/10/2019 Ugandan Farmers Research Project
8/15
Random Forest turns out to be the best model for out of sample prediction. Interaction terms are
significant (presented in conclusion) and therefore random forests are performing the best because they
capture this effect. Random forests also compensate for non-linearity and multicollinearity issues. We
hypothesize that our other models are overemphasizing Region, which may be captured in other variables.
2. Willingness to Pay
In this next section we will try to model the continuous variable willingness to pay for insurance. In this
model we are estimating an actual dollar amount for WTP. We will: 1. Use regression with a ridge
penalty and regression with a lasso penalty 2. Regression on our principal components 3. Partial least
squares, 5. Regression tree 6. Random Forest
1. Penalized Regression Ridge vs Lasso
We Regressed WTP on our explanatory variables with a ridge penalty. The nature of the ridge penalty
does not force loadings to zero so we then used a lasso penalty to more easily identify which factors were
important for determining WTP. With our Lasso penalized regression we found that 19 out of 143
variables were non zero for the optimal way to predict WTP. Crop 21 had the largest loadings in that if
you planted crop 21 you WTP increased by $2,506. If you live in Oyam region your WTP decreases by
$7,183. Crops 8,5, and 14 also had non-zero loadings. We also learned that for a 1 STD increase in how
worried the farmer is about HIV we expect to see a $506 decrease in WTP for insurance.
We also used a regression with a ridge penalty to model WTP for insurance using our demographic
clusters. There is no significant increase in WTP for being in cluster Oyam and average living conditions
(clus2) instead of Oyam and good living conditions(clus1)(p-val=.71). We expect a $2,660.67 decrease
in WTP for being in Kapchorwa and average living conditions (clus3) instead of Oyam and average living
conditions cluster. We expect a $1,973.49 decrease in WTP for poor living conditions in Oyam(clus4)
instead of average living conditions in Oyam(clus1). However when we apply a false discover rate of
15% we find that only the Kapchorwa cluster remains significant. This is consistent with our other
results that region is the main diver in differences between the farmers.
2. Regression on our principal components
8/10/2019 Ugandan Farmers Research Project
9/15
We used the same principal components that are described earlier. For WTP our linear regression has the
largest BIC when we use 49 principal components. Consistent with our earlier findings Oyam has lower
WTP and learn that for a 1 STD increase in the amount farmer worries (PC2) we expect a $181 increase
in WTP.
3. Partial least squares
PLS did a nice job without too much over fit, details in appendix and sub section 6.
4. Regression Tree
The dominant effect is borrowing money and the source of that money (bank, relatives etc), followed by
district (which is one of the main effects that has persisted) and then the type of crop grown. For cotton
growers, the value of the crop (kilos sold last year, value of half the crop, etc) determines willingness to
pay.
5. Random Forest
|q43:egh
q3:fg
nq135_7:a
q58 < 110
q22 < 425 q57 < 1
10620
11440
25730 78740 150000 19270
108700
8/10/2019 Ugandan Farmers Research Project
10/15
When looking at a random forest, the top variables pertain to leverage (how much would you have to pay
in a drought, if youve borrowed money and from whom, and how frequently youve borrowed), followed
by value of crop effects. This seems to confirm the findings from the CART tree, and suggest a story
where you have higher WTP for insurance if you are already levered and therefore concerned you wont
be able to cover your losses in the case of a drought or flood.
6. Cross Validation for Out of Sample Predictions
Regionq16q11q130q143q140q29q134nq135_7q152q3q26WorryInjq18q52q27q56q21WorryDFHIVq13q150q57q19Farm_ValVal_Acreq22q39q43q58
0e+00 2e+10 4e+10 6e+10 8e+10 1e+11
WTP RF variable importance
IncNodePurity
8/10/2019 Ugandan Farmers Research Project
11/15
Random Forest again performs quite well for the aforementioned reasons but in this case Lasso, Ridge,
and PLS also performed almost as well as WTP is more linear.
3. Game 2: Dice, Buy Insurance (Yes/No)
In the first game with coins there were five different buckets we could classify our farmers into. In the
dice game we have a binary outcome if the farmer bought insurance or not. In this model we will be
estimating the odds of a farmer buying insurance. We will: 1. Perform a logistic regression with a
gamma-lasso penalty, 2. PCR, 3. PLS, 4. CART and 5. Random Forest
Logistic Regression
Because we are trying to predict a binary outcome, we will use logistic regression to predict odds, and
apply a gamma lasso penalty to help with dimension reduction. For a standard deviation increase in
worrying about drought or flood, the odds of buying insurance are 1.2x higher. For Region 2 the odds of
buying insurance are only 0.74 of the odds of buying insurance for Region 1.
PCR
3 is the optimal number of factors to minimize BIC. PCR does not do a very good job of separating
people who do and do not buy insurance, probably because people overwhelmingly chose to buy
insurance regardless of the PCR factors.
PLS
We first ran PLS with five zs because it seemed optimal when examining correlation. However, PLS is
prone to over fit, but we also tested using only one z which still performed worse than over tests. This
suggests that other factors like interactions, and non-linearity and missing interactions that may be driving
down PLS performance overall.
Trees
The tree is rather simplistic because overwhelmingly people chose to buy insurance in the die game.
District is the dominant effect, followed by door material (proxy for well-offness as discovered in K
means). A good or average door causes you to buy insurance We believe this points to a trend where
having more crop or home value makes farmers more willing to buy insurance.
|q3:efg
q12:bc
True True
True
8/10/2019 Ugandan Farmers Research Project
12/15
Strangely the most influential variable is q13 (age) which has not come up in our other regressions as
significant, followed by amount that would have to be paid in case of drought/flood. The next most
important variables proxy value again, (how many kilos last season, val/acre and farm value).
q143q137q130q27q136q11q139q138q26q56q7q142q39WorryInjq141HIVq150WorryDFq18q21q16q43q29Farm_Valq57
q22Val_Acreq19q58q13
0 5 10 15 20 25
Dice RF variable importance
MeanDecreaseGini
8/10/2019 Ugandan Farmers Research Project
13/15
8/10/2019 Ugandan Farmers Research Project
14/15
While it was hard to tell from the MSE plots which model was the best when we look the ROC curves,
we see that the random forest almost never has a false positive or a false negative. One thing that we
need to remember is that we are modeling probabilities of choosing to buy insurance in the dice game and
we do not know how that corresponds to buying insurance in real life. If we knew the cost of marketing
to each farmer and what the return would be if they bought insurance we could use the Sensitivity and
Specificity to come up with a classification rule to maximize profit.
4. Conclusion
Region
Region came up repeatedly as one of the most important factors. However we want to caveat this as our
random forest models often showed that region was not as important, and held up very well out of
sample. Region is strongly correlated with many other factors and captures a lot of other information,
including types of crops planted and to some extant wealth. The tree models get around some of this
multicollinearity. To adjust for this problem with region we included interaction terms between region
and all the other independent variables to hopefully isolate the main effect of region. In our multinomial
model to predict coin choice the main effect of region changed to zero for all coins, however the
interactions were non-zero! We conclude that region alone is not deterministic of coin choice but works
strongly with the other variables. Now are largest loading is on the interaction between how worried are
you about drought/flood and region. Each main effect is zero however if the farmer is in Region 2 for
each standard deviation increase in amount of they worry about drought/flood we would expect 1.21times
the relative odds they would choose alpha. However CV reveals that our MNLM with interactions has
higher MSE than our no interaction model (see appendix). The lasso penalty should be helping us with
over fit so we believe that there should be a main effect on region as out of sample MSE results can be
large due to either over fit or omitting variables (over penalized). While in the dice gamma-lasso
8/10/2019 Ugandan Farmers Research Project
15/15
penalized regression we also see the main effect of region get set to zero, the main effect of worrying
about drought/flood is the largest and the interaction of region with drought/flood worrying is zero.
Risk Aversion
We found that farmers who were highly concerned about other dangers such as HIV or injury tended to be
much more concerned about drought than the general population. We interpreted this is their general risk
aversion (principal component 2) and we observe through our principal component analysis and our lasso
regression that increase in general worrying tends to also increase WTP with HIV as an exception.
Wealth
Variables that proxy wealth, such as past crop value, farm value, and standard of living measures seem to
track to willingness to pay for insurance. Perhaps having more wealth makes farmers more sensitive to
loss. If you have a big farm, a drought is suddenly much more costly than if you had a tiny farm. And
perhaps if you have a better home that means you may have more collateral posted for any existing loans.
Existing Leverage
Farmers existing borrowing patterns as well as who they owed money to affected their insurance choices.
We believe that existing leverage makes people more risk adverse, because a drought that affects their
ability to pay this loan perhaps could mean losing collateral or having to pay interest for longer.
We would like to thank Karl Muth for letting us use the data for this project. Please do not discuss these
results outside the context of the data mining class. This project was inspired by Muth, Karl and Jennifer
Helgeson, Stochastic Environments as Measurement Tools, The Journal of Applied Economy, 2010.