18
Business Intelligence Using Data Mining Bribe Payments For Land Registrations Submitted By: Hussain Boltwala 61210213 Karthik Vemparala 61210505 Naveen Kumar HS 61210144 Salman Siddiqui 61210626 Smita Chakravorty 61210558

Bribes EGovREPORT

Embed Size (px)

DESCRIPTION

MEXL v2 Getting Started Tutorial 130605

Citation preview

Page 1: Bribes EGovREPORT

Business Intelligence Using Data Mining Bribe Payments For Land Registrations

Submitted By: Hussain Boltwala 61210213

Karthik Vemparala 61210505

Naveen Kumar HS 61210144

Salman Siddiqui 61210626

Smita Chakravorty 61210558

Page 2: Bribes EGovREPORT

BIDM – Bribing Behaviour for e-Governance Services

P a g e | 2

INTRODUCTION 3

PROBLEM STATEMENT 3

DATA PREPARATION AND VISUALIZATION 4

THE PREDICTION METHOD 15

CLASSIFICATION TREES 15

K- NEAREST NEIGHBOUR 16

NAÏVE BAYES 17

CONCLUSION & FURTHER ANALYSIS 18

Page 3: Bribes EGovREPORT

BIDM – Bribing Behaviour for e-Governance Services

P a g e | 3

Introduction

The project is based on the data collected over a period of time from the customers who have used e-

Governance services for the land registration process. This framework will be useful for intermediaries who

can target customers based on their demographic criteria. These intermediaries can charge a fee, that is

typically lesser than the bribe paid, and provide a convenient and fast service to people who are most

susceptible to pay bribes. This is similar to freelance notaries outside the court houses who charge a fee to

customers for guiding them through any legal process. The framework will also provide insights into

customer behaviour and the effectiveness of e-Governance initiatives.

This project also analyses the relationship between customers who paid bribes and the differentiating

factors like age, level of education, place etc that significantly contribute to payment of bribes. Our analysis

is based on Land Registration transactions carried out in Delhi, Haryana and Gujarat. Data was collected via

a hand written survey with people availing the survey being interviewed. This has resulted in a lot of

misclassified data and the group had endeavoured to clean and interpret as many data points to ensure a

robust model is obtained.

Problem Statement Predict whether a person availing the e-Governance Service will pay a bribe of over INR 100.

Page 4: Bribes EGovREPORT

BIDM – Bribing Behaviour for e-Governance Services

P a g e | 4

Data Preparation and Visualization In order to better understanding the key predictors for susceptibility to bribing behaviour, different metrics

were analysed whether the bribe was paid (categorical) and the amount of bribe paid (numerical). Some

insights are presented below:

1 Below Rs.500

2 Rs. 500-1000

3 Rs.1000-2999

4 Rs.3000-4999

5 Rs.5000-6999

6 Rs.7000-9999

7 More than Rs.10,000

The amount of bribe paid by people in higher income brackets (7000-9999 and more than 10,000) is higher in both

Delhi and Haryana.

Gujarat seemed to have the least amount of bribing culture, where Delhi and Haryana fared badly on most

markers. This could possibly indicate that affluent people are generally targeted by officials. This is also

depicted in the bar chart below which depicts the number of people who paid bribes (code -1, in pink) vs.

those who did not (code-2, in blue). Gujarat has the largest number of non-bribe payers.

Page 5: Bribes EGovREPORT

BIDM – Bribing Behaviour for e-Governance Services

P a g e | 5

From the above plot, we see that most people in Delhi and Haryana have paid bribes between Rs 100-200.

Page 6: Bribes EGovREPORT

BIDM – Bribing Behaviour for e-Governance Services

P a g e | 6

In Delhi, the number of people who did pay a bribe increased were the once who were more infrequent in availing

the services offered by the TCC. However, in Haryana, there is no information on the service availing frequency and

bribing pattern. The total amount of bribe paid also increases if the services are availed less frequently as seen from

the bar graph below.

1 Once in 3 Months

2 Once in 6 Months

3 Once in a Year

4 Less than once a year

5 Others

Page 7: Bribes EGovREPORT

BIDM – Bribing Behaviour for e-Governance Services

P a g e | 7

In Delhi, more number of people paid a bribe on their first trip, but this number decreases as the number of trips to

the TCC increased. Haryana doesn’t really follow any discernible pattern.

It may be that people who frequented the office at least once every 3 months and made more than 1 trip,

paid very little in bribes. This may indicate that people who have a high level of familiarity (and perhaps

have built relationships with officials) don’t pay too much to get their work done. Or they may simply not be

able / willing to pay a bribe and hence have to make more number of trips to avail the same services.

Page 8: Bribes EGovREPORT

BIDM – Bribing Behaviour for e-Governance Services

P a g e | 8

If we look at box plots of the age of an individual to see whether s/he has paid a bribe greater than Rs. 100,

we don’t see any discernible pattern.

But if we plot bribe amount and try and classify in different age brackets, we find that mostly elderly people

end up paying bribes less than Rs. 200

Page 9: Bribes EGovREPORT

BIDM – Bribing Behaviour for e-Governance Services

P a g e | 9

1 Illiterate

2 Literate without Education

3 Below Primary

4 Primary

5 Middle

6 Matric/Secondary

7 Higher Secondary/Intermediate

8 Non-Technical Diploma

9 Technical Diploma

10 Graduate & Above

11 Others

The median amount of bribe paid across education level remains between 100-150 with the only exception of the

individuals who are “literate without education”. The amount of bribe paid by this group is higher.

The above plot indicates that semi-urban areas generally paid much higher in bribes than either rural or

urban areas.

Page 10: Bribes EGovREPORT

BIDM – Bribing Behaviour for e-Governance Services

P a g e | 10

This came as no surprise that larger pockets of land attracted relatively higher amount of bribes.

Page 11: Bribes EGovREPORT

BIDM – Bribing Behaviour for e-Governance Services

P a g e | 11

Distance from the Land Registry office did not seem to play any significant role in the bribing patterns,

however wage loss did i.e. the higher the loss of wage, higher the bribe amount.

Whilst total cost of availing the service was seen as an important aspect, this was ultimately ruled out since

this included the total amount paid by the user, including the land registry charges.

Page 12: Bribes EGovREPORT

BIDM – Bribing Behaviour for e-Governance Services

P a g e | 12

Surprisingly, amount of bribes were closely tied to satisfaction levels, with Delhi and Haryana reporting the

most data. This could indicate that bribing is considered a part of any government transaction and it has no

bearing on the overall perceptions of satisfaction.

Page 13: Bribes EGovREPORT

BIDM – Bribing Behaviour for e-Governance Services

P a g e | 13

From a service provider’s perspective, the most amount of bribes given were under Rs. 100. This is not

considered the target market and only those people who would pay over Rs. 100 are being considered in

this study.

Also, most bribes were paid in order to expedite the process – thus it was logical to look at predictors that

would cause the individual to spend more time at the land registry office.

Total Bribes Paid

Page 14: Bribes EGovREPORT

BIDM – Bribing Behaviour for e-Governance Services

P a g e | 14

Page 15: Bribes EGovREPORT

BIDM – Bribing Behaviour for e-Governance Services

P a g e | 15

The Prediction Method

Classification Trees

Since there are a lot of variables, we decided to run a classification tree to find out what are the most

relevant predictor variables. Wage loss, service charges, wait time, total payment, age, level of education,

occupation, mode of travel, no. of trips made to the TCC, travel time, and reason for bribe payment (this is

largely to expedite the process).

Certain predictors above are not relevant for a prediction model. For example, reason for bribe payment will not

apply as it will not be available at the time of prediction. Also, a person who has already paid a bribe, may not want to

avail the services of an intermediary. However a person who might have tried to avail the services previously but had

a long wait time might be more inclined to use the services of a broker.

0.5

90 72.5

0260 1.5 5.5

0170 175

0 11.5

1 0Sub Tree beneath

0 1 0

travel_mode

serv_charge wait_time

total_paymen expedite_pro Occupation

serv_charge wage_loss expedite_pro

405 255

376 29 133 122

22 7 43 90 72 50

6 1 18 25 15 35

Full Tree

Pruned Tree

Page 16: Bribes EGovREPORT

BIDM – Bribing Behaviour for e-Governance Services

P a g e | 16

K- Nearest Neighbour

Running a K-NN with the above predictor variables, we get an error rate of 12% on the validation data and 11% on

the test data.

AgeLev_Educatio

nOccupation travel_mode no_of_trip travel_time w ait_time w age_loss serv_charge

expedite_pro

c

total_paymen

t

Variables

# Input Variables 11

Input variables

Output variable Bribe > 100

Training Data scoring - Summary Report (for k=1)

0.5

Actual Class 1 0

1 168 0

0 0 932

Class # Cases # Errors % Error

1 168 0 0.00

0 932 0 0.00

Overall 1100 0 0.00

Validation Data scoring - Summary Report (for k=1)

0.5

Actual Class 1 0

1 80 35

0 47 498

Class # Cases # Errors % Error

1 115 35 30.43

0 545 47 8.62

Overall 660 82 12.42

Test Data scoring - Summary Report (for k=1)

0.5

Actual Class 1 0

1 52 23

0 24 341

Class # Cases # Errors % Error

1 75 23 30.67

0 365 24 6.58

Overall 440 47 10.68

Error Report

Cut off Prob.Val. for Success (Updatable)

Classification Confusion Matrix

Predicted Class

Predicted Class

Error Report

Cut off Prob.Val. for Success (Updatable)

Classification Confusion Matrix

Predicted Class

Error Report

Cut off Prob.Val. for Success (Updatable)

Classification Confusion Matrix

Training Data scoring - Summary Report (for k=1)

0.5

Actual Class 1 0

1 168 0

0 0 932

Class # Cases # Errors % Error

1 168 0 0.00

0 932 0 0.00

Overall 1100 0 0.00

Validation Data scoring - Summary Report (for k=1)

0.5

Actual Class 1 0

1 80 35

0 47 498

Class # Cases # Errors % Error

1 115 35 30.43

0 545 47 8.62

Overall 660 82 12.42

Test Data scoring - Summary Report (for k=1)

0.5

Actual Class 1 0

1 52 23

0 24 341

Class # Cases # Errors % Error

1 75 23 30.67

0 365 24 6.58

Overall 440 47 10.68

Error Report

Cut off Prob.Val. for Success (Updatable)

Classification Confusion Matrix

Predicted Class

Predicted Class

Error Report

Cut off Prob.Val. for Success (Updatable)

Classification Confusion Matrix

Predicted Class

Error Report

Cut off Prob.Val. for Success (Updatable)

Classification Confusion Matrix

Training Data scoring - Summary Report (for k=1)

0.5

Actual Class 1 0

1 168 0

0 0 932

Class # Cases # Errors % Error

1 168 0 0.00

0 932 0 0.00

Overall 1100 0 0.00

Validation Data scoring - Summary Report (for k=1)

0.5

Actual Class 1 0

1 80 35

0 47 498

Class # Cases # Errors % Error

1 115 35 30.43

0 545 47 8.62

Overall 660 82 12.42

Test Data scoring - Summary Report (for k=1)

0.5

Actual Class 1 0

1 52 23

0 24 341

Class # Cases # Errors % Error

1 75 23 30.67

0 365 24 6.58

Overall 440 47 10.68

Error Report

Cut off Prob.Val. for Success (Updatable)

Classification Confusion Matrix

Predicted Class

Predicted Class

Error Report

Cut off Prob.Val. for Success (Updatable)

Classification Confusion Matrix

Predicted Class

Error Report

Cut off Prob.Val. for Success (Updatable)

Classification Confusion Matrix

Page 17: Bribes EGovREPORT

BIDM – Bribing Behaviour for e-Governance Services

P a g e | 17

Naïve Bayes

The Naïve Bayes method resulted in a higher error rate of approx 16%, when compared to the KNN method.

AgeLev_Educatio

nOccupation travel_mode no_of_trip travel_time w ait_time w age_loss serv_charge

expedite_pro

c

total_paymen

t

Variables

# Input Variables 11

Input variables

Output variable Bribe > 100

Prior class probabilities

Prob.

0.152727273

0.847272727

1

0

<-- Success Class

According to relative occurrences in training data

Class

Training Data scoring - Summary Report

0.5

Actual Class 1 0

1 148 20

0 91 841

Class # Cases # Errors % Error

1 168 20 11.90

0 932 91 9.76

Overall 1100 111 10.09

Validation Data scoring - Summary Report

0.5

Actual Class 1 0

1 79 36

0 74 471

Class # Cases # Errors % Error

1 115 36 31.30

0 545 74 13.58

Overall 660 110 16.67

Test Data scoring - Summary Report

0.5

Actual Class 1 0

1 49 26

0 40 325

Class # Cases # Errors % Error

1 75 26 34.67

0 365 40 10.96

Overall 440 66 15.00

Error Report

Cut off Prob.Val. for Success (Updatable)

Classification Confusion Matrix

Predicted Class

Error Report

Classification Confusion Matrix

Predicted Class

Error Report

Cut off Prob.Val. for Success (Updatable)

Classification Confusion Matrix

Predicted Class

Cut off Prob.Val. for Success (Updatable)

Training Data scoring - Summary Report

0.5

Actual Class 1 0

1 148 20

0 91 841

Class # Cases # Errors % Error

1 168 20 11.90

0 932 91 9.76

Overall 1100 111 10.09

Validation Data scoring - Summary Report

0.5

Actual Class 1 0

1 79 36

0 74 471

Class # Cases # Errors % Error

1 115 36 31.30

0 545 74 13.58

Overall 660 110 16.67

Test Data scoring - Summary Report

0.5

Actual Class 1 0

1 49 26

0 40 325

Class # Cases # Errors % Error

1 75 26 34.67

0 365 40 10.96

Overall 440 66 15.00

Error Report

Cut off Prob.Val. for Success (Updatable)

Classification Confusion Matrix

Predicted Class

Error Report

Classification Confusion Matrix

Predicted Class

Error Report

Cut off Prob.Val. for Success (Updatable)

Classification Confusion Matrix

Predicted Class

Cut off Prob.Val. for Success (Updatable)

Training Data scoring - Summary Report

0.5

Actual Class 1 0

1 148 20

0 91 841

Class # Cases # Errors % Error

1 168 20 11.90

0 932 91 9.76

Overall 1100 111 10.09

Validation Data scoring - Summary Report

0.5

Actual Class 1 0

1 79 36

0 74 471

Class # Cases # Errors % Error

1 115 36 31.30

0 545 74 13.58

Overall 660 110 16.67

Test Data scoring - Summary Report

0.5

Actual Class 1 0

1 49 26

0 40 325

Class # Cases # Errors % Error

1 75 26 34.67

0 365 40 10.96

Overall 440 66 15.00

Error Report

Cut off Prob.Val. for Success (Updatable)

Classification Confusion Matrix

Predicted Class

Error Report

Classification Confusion Matrix

Predicted Class

Error Report

Cut off Prob.Val. for Success (Updatable)

Classification Confusion Matrix

Predicted Class

Cut off Prob.Val. for Success (Updatable)

Page 18: Bribes EGovREPORT

BIDM – Bribing Behaviour for e-Governance Services

P a g e | 18

Conclusion & Further Analysis

When we started with the raw data, a tremendous amount of clean up and classification was needed to make the

data useable. We also had to define our goals clearly, keeping in mind the practicality and usefulness of the model we

were building.

Initially the idea was to estimate the amount of bribe a person would pay. A relatively small number of people had

reportedly paid bribes, many of which were very small amounts. Therefore it was more useful to classify records that

paid over a certain threshold – in our case, Rs. 100, and create a model based around this end goal, i.e. categorical ‘Y’

of ‘Bribe > 100’.

Records %

Initial Benchmark - # Paid Bribes 397 / 2200 18%

Initial Benchmark - # Paid > 100 358 / 2200 16.3%

The results of the prediction models are as follows:

Method Used Error Rate Accuracy Sensitivity Specificity

K-NN 12% 88% 69.6% 91.4%

Naïve Bayes 16% 84% 68.7% 86.4%

Therefore a drastic increase in accuracy was seen by applying the KNN and Naïve Bayes model. Obviously the KNN

method yielded better results than Naïve Bayes since KNN is not simply a majority vote.

The predictors of interest are as below, each of which could be estimated or determined by direct and indirect

probing by the service provider or already known to him (for e.g. Official Service Charge). The idea of this model is

that suitable prospects are approached by the service provider, who will find out the relevant information for each

parameter, mostly through a conversational strategy.

Predictor Method of Determination

Age Estimated / Indirectly determined from conversation

Level of Education Estimated / Indirectly determined from conversation

Occupation Directly queried from prospect

Mode of Travel Determined from conversation

No of trips Determined from conversation

Travel Time Determined from the ‘Mode of Travel’ query

Wait Time If first trip – communicated to prospect based on general wait times for the type of service required. If more than one trip – query prospect herself

Wage Loss Determined from ‘Occupation’ query

Official Service Charge Known to service provider – communicated to prospect

Desired Expediency Directly queried from prospect

Total Payment (charge) for services Known to service provider – communicated to prospect

Therefore using the above probes, a service provider should have great success in targeting prospective customers.