Ugandan Farmers Research Project

  • Upload
    nick-xu

  • View
    216

  • Download
    0

Embed Size (px)

Citation preview

  • 8/10/2019 Ugandan Farmers Research Project

    1/15

    Mark BerberianMurali RajamaniNick Xu

    A Tale of Two Regions - or of the Variables that make each Region Unique

    Abstract

    We used various data mining techniques to analyze a high dimensional data set to predict farmers in

    Uganda willingness to pay for insurance(continuous), level of risk aversion (5 level classification), and if

    they bought insurance in a simulation game(binary). We found that region (binary - whether the farmer

    was from Kapchorwa or Oyam) had a dominating effect. We also found that degree to which farmers

    worried about various factors such as HIV or injury, what crops were planted, and living conditions also

    impacted these dependant variables and some were highly correlated to region. To deal with this

    multicolinearity issue and need for interaction terms, trees were useful and thus the random forest tended

    to have the best out of sample MSE.

    1. Introduction

    The Grameen Foundation and other organizations are working on a research project to learn about risk

    aversion and if there is a market for insurance for farmers in Uganda. The motivation behind such survey

    is to identify what types of practices are associated with farmers who are more risk averse and thus more

    likely to buy insurance. Therefore targeted marketing can be effectively used by microinsurance

    companies. The survey consisted of two economics games to measure risk tolerance. The coin game is

    used to determine risk aversion. Farmers are offered a choice between two coins to flip that have

    difference variance of outcomes, i.e. one coin will pay $5 if heads and $5 if tails and another coin will pay

    $9 if heads, $2 if tails. There are five different coins and the farmers are presented different options until

    their top choice coin is discovered. The dice game is used to determine if the farmer would like topurchase insurance. They roll four die to determine what pay out they get, they can sacrifice a dice as

    insurance against a bad outcome.

    These games give us a good measure of someones risk aversion as we are looking at revealed

    preference instead of a stated preference such as the Willingness to Pay questions. There are several

    Willingness to Pay related questions in the survey as farmers were asked various amounts in descending

    and ascending order for full insurance and half insurance. These sort of willingness to pay scales are

    widely used in surveys and are believed to provide a more accurate estimation of WTP (e.g. Oster, Emily

    and Rebecca Thornton, Determinants of Technology Adoption, Journal of the European Economic

    Association, January 2, 2010). We then created a single WTP variable based on the numerous WTP

    related questions.

    Our goal is to create a model that predicts if a farmer would buy insurance and understand the attitudes

    towards risk amongst Uganda farmers. Our three dependant variables are 1. What coin they choose

    (multinomial) 2. If they bought insurance during the dice game (binary) and 3. Willingness to pay for

    insurance (continuous).

    2. Data Cleaning and Visualization

  • 8/10/2019 Ugandan Farmers Research Project

    2/15

    The data started off with two data sets one survey from Kapchorwa and another from Oyam. We stacked

    the data to create one dataset. We then made sure that the same farmer did not appear in multiple rows as

    duplicates were an issue. We removed variables that had a high missing rate and deleted rows where we

    had an unrealistic response like 99 children or chickens listed as answer to what should be a numeric

    question. We decided to remove these farmers completely from the survey as it was less than 1% our

    entire dataset and we thought if they gave an unreasonable response for one question they were morelikely to give misinformation for other questions.

  • 8/10/2019 Ugandan Farmers Research Project

    3/15

    The majority of people have a willingness to pay under 25,000. Therefore any insurance product

    would have to be priced in this range to capture a meaningful portion of the market.

    A cross section of dice and coin game results shows no real pattern between particular coin/diecombinations. The majority of people did purchase insurance in the dice game and chose alpha or

    beta coins.

    Comparing the major regions, it appears that Region 1(Kapchorwa) has higher WTP than Region

    2 (Oyam) and therefore might be a better target for marketing.

  • 8/10/2019 Ugandan Farmers Research Project

    4/15

    However, Region 2 was more likely to choose Alpha than Region 1. Since Alpha is more

    conservative, we interpreted it as Region 2 being more risk adverse than Region 1, we need to

    investigate further whether this risk adversion can translate to insurance purchase.

    Somewhat contrary to the coin game, Region 2 purchased insurance that a slightly lower level

    than Region 1, though the overwhelming effect is that most people purchased insurance in the

    game suggesting that this test could be improved to show more separation. Majority of farms are low in value with a few outliers.

    Region 1 appears to be more wealthy than Region 2 which may partially explain the WTP

    difference

    Arules

    (Full index of association rules in appendix)

    Highlights1 {DF5,

    HIV5} => {INJ5} 0.1055261 0.6413662 3.89809492 {HIV5,

    INJ5} => {DF5} 0.1055261 0.8066826 3.4359100

    People who are highly concerned about droughts and floods are also highly concerned about HIV and

    injury. This suggests that certain people are highly concerned with uncertainty, and if we are able to

    ascertain who is concerned about injury and/or HIV, it will also tell us who might want to buy insurance.

    For example in rule number 2, given a new farmer that is extremely worried about HIV and injury, were

    80% sure that they also will worry highly about droughts/floods and this is 3.4x higher than if we didnt

    know about their HIV and injury concern level.

    3 {13,19} => {16} 0.1008430 0.8682796 3.3874537

    4 {19,6} => {16} 0.1286294 0.8273092 3.2276145

    Crop associations dominate as well. Suggesting that it is highly common to grow multiple crops. In this

    example sweet potato, sunflower, and sim-sim are grouped together.

    47 {q32_2,q32_9} => {q32_12} 0.1045894 0.8438287 2.2924371

    Question 32 asked people to pick options for what they would do if they suffered a drought flood. Our

    hypothesis was that if people had very few options in a drought scenario they would be more inclined tobuy insurance. However there was no strong association between their emergency plan and the coins or

    die. Instead we found that people will do multiple things in case of a drought/flood. In this case one of the

    most common strategies is to eat less, reduce expenditures, and sell off livestock.

    K-means

    We identified 20 demographic related questions such as the value of the farm, age, marital status ect. We

    thought K-Means would be good be a way to reduce the number of variables that describe demographic

  • 8/10/2019 Ugandan Farmers Research Project

    5/15

    information into a few clusters that we could used to describe demographics. Using K-means to

    generate three clusters we got 3 clusters that contained 938, 977 and 598 observations. For 4 clusters we

    got 404, 468, 1035 and 606 observations. For 5 clusters we have 466, 382, 1015, 55, 595 observations.

    To decide the optimal number of clusters to use we then looked at the centers and compared how good

    they were in a regression.

    K-means is a great way to do deal with the co-linearity of the many living conditions questions. It ended

    up grouping those who answered good into one cluster, average into another and poor into another. We

    can now just talk about living conditions though our clusters instead of looking at each isolated effect of

    having electricity, roof quality, door quality, ect.

    Using 3 clusters we see that Cluster 1 is from Kapchorwa and have average living conditions, higher farm

    value and many children. Cluster 2 is from Oyam and have nice living conditions Cluster 3 is from

    Oyam and has poor living conditions, and low farm value. Adding a fourth cluster we now see a group

    from Oyam that has good living conditions. What is interesting is that they have the lowest farm value.

    They also have farming make up the least of their income compared to other clusters. Therefore the nicer

    living conditions may not necessarily be tied to a higher farm value but income outside of farming.When moving to 5 clusters variation becomes dominated by having electricity and not as clear of a story

    can be told as it still only has one cluster with its center in Kapchorwa.

    3. Game 1 - Coin choice

    In this section we build several models to predict what coin (Alpha, Beta Gamma, Delta, or

    Epsilon) was chosen by the farmer. Farmers which choose Alpha or Delta are more risk adverse and we

    would like to know what factors are associated with more risk adverse farmers. The models we will

    present are: 1. Multinomial logistic regression with a gamma-lasso penalty (mnlm), 2. Mnlm on principal

    components we indentify using principal component analysis, 3. Classification tree, 4. Random Forest.

    1. Multinomial logistic regression with a gamma-lasso penalty (mnlm)

    As we are regressing on 143 variables we want a strict penalty to avoid over fit and help identify the main

    drives in classifying farmers coin choice. The largest loading we see is on region. If the farmer is in

    Oyam instead of Kapchorwa the odds of choosing Alpha is e^(.37-0)=1.45 times the odds of choosing

    Beta. This is also true for Alpha vs Delta, Gamma and Epsilion. We also observe for an increase in one

    standard deviation in number of children we see a 1.04 times odds of choosing Alpha over the odds of

    Epsilon. Crops also have an interesting relationship between coin choice. If a framer plants crops 4, 11,

    18, or 20 we see an increase in the relative odds of choosing Alpha; if a farmer plants crop 3 we see a

    decrease in in the relative odds of choosing Alpha. If a farmer plants crops 5, or 14 we see an increase in

    the relative odds of choosing Epsilon; while if a farmer plants crop 1 we see a decrease in in the relative

    odds of choosing Epsilon. The largest crop loading is on crop 5 such that if a farmer is planting crop 5

    the odds of choosing Epsilon increases 1.16 times relative to the odds of each coin. For more on crops

    impact please see the appendix. The effect of worrying about HIV is also fairly large. If a farmer

    answered they have a very high concern about HIV instead of no concern than we would expect the odds

    of the farmer choosing Epsilon to be 1.13 times the odds of choosing Alpha and 1.06 times the odds of

    choosing delta. It might be that if someone is engaging in more risky behavior they are going to be more

    worried about HIV. In other words our alpha type farmers might think that since they dont take risks

  • 8/10/2019 Ugandan Farmers Research Project

    6/15

    they dont needto worry about HIV. Unlike HIV, for the questions about worrying about drought or

    injury the farmers that responded with higher worrying for those questions were more likely to choose the

    more risk adverse coins.

    Replacing the demographic variables we used to cluster with the clusters in the model described above we

    see that in the Kapchorwa and average living conditions cluster (clus3) instead of the Oyam and good

    living conditions (clus1) the relative odds of choosing Alpha change .97 times. In the poor living

    conditions in Oyam cluster (clus4) instead of clus1 the relative odds of choosing Alpha increase by 1.11

    times. The Oyam and average living conditions cluster (clus2) did not have any non-zero loadings.

    2. Principal Component Analysis

    We ran a principal component analysis on our explanatory variables. PC1 has very high rotations for

    region and then crops 16, 6, 8, 23, 7. This is very interesting because when we did K-means on the crops

    high scores in 6,7,8 and 16 tended to be grouped together. These crops are also highly correlated with

    region 2. For example 90 percent of farmers in Oyam plant crop 6(Cassava) but only 18 percent of the

    farmers in Kapchorwa plant Cassava. Therefore we have high multicolinearity between crops and region

    and due to this we want use PCA to help cope with this problem that false discovery rate or other methods

    would not deal with as well. PC2 seems to measure general worrying as the four questions about

    worrying are the top 4 largest rotations for PC2. It is interesting that we noticed that worrying about HIV

    has a different relationship with coin choice oppose to the other worrying questions but all four worry

    questions load on PC2 as more worry more PC2. The association rules also showed a high lift and

    confidence for factors that people worry about occurring together (HIV, flood/drought, and injury). Crops

    20 and 12 also load high on PC2 and they do tend to be in the same clusters. You can see from the graph

    of PC1 vs PC2 Oyam is on the far right and Kapchorwa is on far left. High worries are on top. We run

    the mnlm on PC1 and PC2 we find that for a 1 standard deviation in your PC1/Region score towards

    Oyam the odds of choosing Alpha increase e^(.32- -.23)=e^(.54)=1.7 times the odds of choosing Epsilon.

    The worries PC2 are more likely to choose the extreme coins Alpha or Epsilon relative to the middlecoins. This may be due to the HIV effect discussed earlier.

    3. Classification tree

  • 8/10/2019 Ugandan Farmers Research Project

    7/15

    (pruned, unpruned trees are in appendix)

    Due to the high multicollinearity issues and the non-linear nature of our classification model, trees would

    be a good tool to deal with these issues. We were unable to test all combinations of interaction terms due

    to computational limits, and trees take interactions into account to reduce dimensions. It appears that

    Region is the dominant factor, followed by crop type (rice) and whether the farmers friends have

    insurance.

    4. Random Forest

    In the Random Forest on coins some of the most influential questions were: amount they would have torepay in a drought, age, Value per acre, number of kilos of crops sold. It seems like a lot of these

    variables proxy a more general term of how much value their crop is, as a first order variable on coin

    choice. Curiously, the random forest doesnt have region as a first order factor. Our other models may be

    overemphasizing the effect of region, which may decrease when interacted with other terms. We will rely

    on cross validation to pick the best out of sample model.

    6. Cross Validation Classification models

    Regionq16q11q130q143q140q29q134nq135_7q152q3q26WorryInjq18q52q27q56q21

    WorryDFHIVq13q150q57q19Farm_ValVal_Acreq22q39q43q58

    0e+00 2e+10 4e+10 6e+10 8e+10 1e+11

    Coin RF variable importance

    IncNodePurity

  • 8/10/2019 Ugandan Farmers Research Project

    8/15

    Random Forest turns out to be the best model for out of sample prediction. Interaction terms are

    significant (presented in conclusion) and therefore random forests are performing the best because they

    capture this effect. Random forests also compensate for non-linearity and multicollinearity issues. We

    hypothesize that our other models are overemphasizing Region, which may be captured in other variables.

    2. Willingness to Pay

    In this next section we will try to model the continuous variable willingness to pay for insurance. In this

    model we are estimating an actual dollar amount for WTP. We will: 1. Use regression with a ridge

    penalty and regression with a lasso penalty 2. Regression on our principal components 3. Partial least

    squares, 5. Regression tree 6. Random Forest

    1. Penalized Regression Ridge vs Lasso

    We Regressed WTP on our explanatory variables with a ridge penalty. The nature of the ridge penalty

    does not force loadings to zero so we then used a lasso penalty to more easily identify which factors were

    important for determining WTP. With our Lasso penalized regression we found that 19 out of 143

    variables were non zero for the optimal way to predict WTP. Crop 21 had the largest loadings in that if

    you planted crop 21 you WTP increased by $2,506. If you live in Oyam region your WTP decreases by

    $7,183. Crops 8,5, and 14 also had non-zero loadings. We also learned that for a 1 STD increase in how

    worried the farmer is about HIV we expect to see a $506 decrease in WTP for insurance.

    We also used a regression with a ridge penalty to model WTP for insurance using our demographic

    clusters. There is no significant increase in WTP for being in cluster Oyam and average living conditions

    (clus2) instead of Oyam and good living conditions(clus1)(p-val=.71). We expect a $2,660.67 decrease

    in WTP for being in Kapchorwa and average living conditions (clus3) instead of Oyam and average living

    conditions cluster. We expect a $1,973.49 decrease in WTP for poor living conditions in Oyam(clus4)

    instead of average living conditions in Oyam(clus1). However when we apply a false discover rate of

    15% we find that only the Kapchorwa cluster remains significant. This is consistent with our other

    results that region is the main diver in differences between the farmers.

    2. Regression on our principal components

  • 8/10/2019 Ugandan Farmers Research Project

    9/15

    We used the same principal components that are described earlier. For WTP our linear regression has the

    largest BIC when we use 49 principal components. Consistent with our earlier findings Oyam has lower

    WTP and learn that for a 1 STD increase in the amount farmer worries (PC2) we expect a $181 increase

    in WTP.

    3. Partial least squares

    PLS did a nice job without too much over fit, details in appendix and sub section 6.

    4. Regression Tree

    The dominant effect is borrowing money and the source of that money (bank, relatives etc), followed by

    district (which is one of the main effects that has persisted) and then the type of crop grown. For cotton

    growers, the value of the crop (kilos sold last year, value of half the crop, etc) determines willingness to

    pay.

    5. Random Forest

    |q43:egh

    q3:fg

    nq135_7:a

    q58 < 110

    q22 < 425 q57 < 1

    10620

    11440

    25730 78740 150000 19270

    108700

  • 8/10/2019 Ugandan Farmers Research Project

    10/15

    When looking at a random forest, the top variables pertain to leverage (how much would you have to pay

    in a drought, if youve borrowed money and from whom, and how frequently youve borrowed), followed

    by value of crop effects. This seems to confirm the findings from the CART tree, and suggest a story

    where you have higher WTP for insurance if you are already levered and therefore concerned you wont

    be able to cover your losses in the case of a drought or flood.

    6. Cross Validation for Out of Sample Predictions

    Regionq16q11q130q143q140q29q134nq135_7q152q3q26WorryInjq18q52q27q56q21WorryDFHIVq13q150q57q19Farm_ValVal_Acreq22q39q43q58

    0e+00 2e+10 4e+10 6e+10 8e+10 1e+11

    WTP RF variable importance

    IncNodePurity

  • 8/10/2019 Ugandan Farmers Research Project

    11/15

    Random Forest again performs quite well for the aforementioned reasons but in this case Lasso, Ridge,

    and PLS also performed almost as well as WTP is more linear.

    3. Game 2: Dice, Buy Insurance (Yes/No)

    In the first game with coins there were five different buckets we could classify our farmers into. In the

    dice game we have a binary outcome if the farmer bought insurance or not. In this model we will be

    estimating the odds of a farmer buying insurance. We will: 1. Perform a logistic regression with a

    gamma-lasso penalty, 2. PCR, 3. PLS, 4. CART and 5. Random Forest

    Logistic Regression

    Because we are trying to predict a binary outcome, we will use logistic regression to predict odds, and

    apply a gamma lasso penalty to help with dimension reduction. For a standard deviation increase in

    worrying about drought or flood, the odds of buying insurance are 1.2x higher. For Region 2 the odds of

    buying insurance are only 0.74 of the odds of buying insurance for Region 1.

    PCR

    3 is the optimal number of factors to minimize BIC. PCR does not do a very good job of separating

    people who do and do not buy insurance, probably because people overwhelmingly chose to buy

    insurance regardless of the PCR factors.

    PLS

    We first ran PLS with five zs because it seemed optimal when examining correlation. However, PLS is

    prone to over fit, but we also tested using only one z which still performed worse than over tests. This

    suggests that other factors like interactions, and non-linearity and missing interactions that may be driving

    down PLS performance overall.

    Trees

    The tree is rather simplistic because overwhelmingly people chose to buy insurance in the die game.

    District is the dominant effect, followed by door material (proxy for well-offness as discovered in K

    means). A good or average door causes you to buy insurance We believe this points to a trend where

    having more crop or home value makes farmers more willing to buy insurance.

    |q3:efg

    q12:bc

    True True

    True

  • 8/10/2019 Ugandan Farmers Research Project

    12/15

    Strangely the most influential variable is q13 (age) which has not come up in our other regressions as

    significant, followed by amount that would have to be paid in case of drought/flood. The next most

    important variables proxy value again, (how many kilos last season, val/acre and farm value).

    q143q137q130q27q136q11q139q138q26q56q7q142q39WorryInjq141HIVq150WorryDFq18q21q16q43q29Farm_Valq57

    q22Val_Acreq19q58q13

    0 5 10 15 20 25

    Dice RF variable importance

    MeanDecreaseGini

  • 8/10/2019 Ugandan Farmers Research Project

    13/15

  • 8/10/2019 Ugandan Farmers Research Project

    14/15

    While it was hard to tell from the MSE plots which model was the best when we look the ROC curves,

    we see that the random forest almost never has a false positive or a false negative. One thing that we

    need to remember is that we are modeling probabilities of choosing to buy insurance in the dice game and

    we do not know how that corresponds to buying insurance in real life. If we knew the cost of marketing

    to each farmer and what the return would be if they bought insurance we could use the Sensitivity and

    Specificity to come up with a classification rule to maximize profit.

    4. Conclusion

    Region

    Region came up repeatedly as one of the most important factors. However we want to caveat this as our

    random forest models often showed that region was not as important, and held up very well out of

    sample. Region is strongly correlated with many other factors and captures a lot of other information,

    including types of crops planted and to some extant wealth. The tree models get around some of this

    multicollinearity. To adjust for this problem with region we included interaction terms between region

    and all the other independent variables to hopefully isolate the main effect of region. In our multinomial

    model to predict coin choice the main effect of region changed to zero for all coins, however the

    interactions were non-zero! We conclude that region alone is not deterministic of coin choice but works

    strongly with the other variables. Now are largest loading is on the interaction between how worried are

    you about drought/flood and region. Each main effect is zero however if the farmer is in Region 2 for

    each standard deviation increase in amount of they worry about drought/flood we would expect 1.21times

    the relative odds they would choose alpha. However CV reveals that our MNLM with interactions has

    higher MSE than our no interaction model (see appendix). The lasso penalty should be helping us with

    over fit so we believe that there should be a main effect on region as out of sample MSE results can be

    large due to either over fit or omitting variables (over penalized). While in the dice gamma-lasso

  • 8/10/2019 Ugandan Farmers Research Project

    15/15

    penalized regression we also see the main effect of region get set to zero, the main effect of worrying

    about drought/flood is the largest and the interaction of region with drought/flood worrying is zero.

    Risk Aversion

    We found that farmers who were highly concerned about other dangers such as HIV or injury tended to be

    much more concerned about drought than the general population. We interpreted this is their general risk

    aversion (principal component 2) and we observe through our principal component analysis and our lasso

    regression that increase in general worrying tends to also increase WTP with HIV as an exception.

    Wealth

    Variables that proxy wealth, such as past crop value, farm value, and standard of living measures seem to

    track to willingness to pay for insurance. Perhaps having more wealth makes farmers more sensitive to

    loss. If you have a big farm, a drought is suddenly much more costly than if you had a tiny farm. And

    perhaps if you have a better home that means you may have more collateral posted for any existing loans.

    Existing Leverage

    Farmers existing borrowing patterns as well as who they owed money to affected their insurance choices.

    We believe that existing leverage makes people more risk adverse, because a drought that affects their

    ability to pay this loan perhaps could mean losing collateral or having to pay interest for longer.

    We would like to thank Karl Muth for letting us use the data for this project. Please do not discuss these

    results outside the context of the data mining class. This project was inspired by Muth, Karl and Jennifer

    Helgeson, Stochastic Environments as Measurement Tools, The Journal of Applied Economy, 2010.