17
DATA MINING PROJECT Fraud Detention using Data Mining JUNE 7, 2015 NORTHWESTERN UNIVERSITY, MSIS 435 ALBERT KENNEDY

Fraud Detection using Data Mining Project

Embed Size (px)

Citation preview

Page 1: Fraud Detection using Data Mining Project

DATA MINING PROJECT Fraud Detention using Data Mining

JUNE 7, 2015 NORTHWESTERN UNIVERSITY, MSIS 435

ALBERT KENNEDY

Page 2: Fraud Detection using Data Mining Project

1

TABLE OF CONTENTS

Abstract............................................................................................................................................... 2

Introduction ........................................................................................................................................ 3

Data Mining Applications .................................................................................................................... 4

Data Mining Themes ........................................................................................................................... 4

CRISP-DM Methodology...................................................................................................................... 5

Data Understanding ............................................................................................................................ 7

Data Preparation ................................................................................................................................. 9

Data Mining Algorithm ...................................................................................................................... 10

Experimental Results and Analysis ................................................................................................ 11

Conclusion ..................................................................................................................................... 14

Future Work .................................................................................................................................. 15

References .................................................................................................................................... 16

Page 3: Fraud Detection using Data Mining Project

2

Fraud Detention

1 ABSTRACT

The Adoption of data mining can be great for many use cases and organization that have a special need

and understanding of what can be done with existing data. Many organizations don’t understand the

power and value they have in what you control. With this, the benefits of using a process for making

smarter decision will be discussed in this paper. For the purpose of explaining not yet a data mining

topic and its benefits but also to address common problems in the Fraud and identity sector where

many businesses and individuals can take advantage of.

Fraud detention should be more important today than ever before. With the growing e-commence

business that moving rapidly and people having more access to important, financial institutions need to

be more aware of ways to detect possible fraudulent acts. We can achieve this goal with the use of data

mining.

This document will show case a typical problem with sample data taken from borrowers and their

information that related to credit approval. Another data set is similar but present typical account

holders and a common profile of related attributes that make up a “type” of customer. We will go into

detail and the proper process of how to solve this problem using the outline:

Defining the business problem

Collection the data and enhancing that data

Choosing a model strategy and algorithm that fixes the business need

Executing the model through a training set then test that model

Evaluating the results of the model

Decide for the model or make any changes

Deploy the model into an actionable project

The above outline is based on a framework that many data scientist use today called “The Cross Industry

Standard Process for Data Mining” (CRISP-DM)1. This foundation is what will be use to analyze the fraud

detention use case of both the German Credit fraud data and the Give Me Some Credit data set to

compare against using two separate techniques. This two datasets have been thoroughly cleanse and

checked for correctly and free of any bias input to shew any result toward any direction. In order to

become successful in building a fraud detention system, it is important to understand the data mining

tools applications used in the industry. Second, it’s important to know the themes of data mining and

most importantly the CRISP-DM methodology that will be used to support our business problem and

data mining design for a reliable fraud detention system.

1 Information related to CRISP-DM, see CRISP-DM 1.0, Step-by-Step data mining guide, SPSS,

Page 4: Fraud Detection using Data Mining Project

3

2 INTRODUCTION

In America and many other others of the world, crime appears to be relevant in today’s society. As the

government and we as people continue to find ways to prevent and coach individuals to shrey away

from unlawful event, people find alternative and intelligent ways to be corrupted. Using our government

law enforcement has only become as good as to catch perpetrators after a particular wrong doing event

has been either reported or caught. Not all actions we can’t control and stop before happening that’s

within any person’s private setting. How about we can help prevent those crimes that can be warned if

not stop prior to happing? This type of crime that we people such as businesses and individual victims

can have control over with the used of data analysis. The crimes of identity thief and Credit approval

can be defended.

The purpose of this study is to empower financial businesses and individuals with the ability to combat

potential thief of personal and company information to avoid misuse for another person’s gain.

Description of Problem: In today’s fast growing technology sociality, individuals are completing more

and more online transactions and sending data to multiple sources of businesses using the same vital

information to identify a person.

Examples of this could be applying for credit via an online store. So what information is needed for this?

In many cases

Full name

Telephone number or street address

Social security number (SSN)

The above information is all that is needed for a creditors to approve someone for a line of credit or

account under a person’s name. There’s an issue here; anyone that doesn’t know you can obtain this

information easily. The only harder bit of information to obtain is an SSN. The best way for someone to

get a person’s SSN, and through personal work records; a person that has access to this who handle’s

administrative tasks for employees. This is a problem, because a telephone number and street address

can be any number or address that the creditor does not care for other than a place to mail bills to.

Typically a creditor will validate this through mailing information. Say a different phone number was

given and validated with a phone call that could also be possible. We can use smarter ways to combat

this, which some companies do with multiple levels on validating (through phone call, matching current

address and identification).

Our Objective: We can help fix these problems through the use of data mining and making sound

business decisions in order to complete a credit transition.

First we need we can identify the types of data needed to solve an identity or fraud similar

action. We will use pervious data from users of different credit accounts to do the analysis work

on.

Then, we need to choose one or many different data mining algorithms to test this data for an

outcome of what we are analyzing to make a recommendation, prediction, classifications,

and/or description of for better business decision.

Page 5: Fraud Detection using Data Mining Project

4

So, let’s explore the many different data mining applications that can be used that’s related to this

problem

3 DATA MINING APPLICATIONS

There are many related data mining applications that can be used for the purpose of detecting

fraudulent activities. This is a growing study that many established organizations have seek out and

completed in depth research that more so expose the issues and weakness more so than real world

working applications that actual identity the issue and combat the problem in a defense matter.

A company called Morpho, has a mission where they are the market leader in security solutions who are

the pioneer in identification and detection systems. They deliver many products that many target

government and national agencies with dedication tools and systems to safeguard sensitive information.

They completed a study using data mining and the relation to identity fraud as an application to prevent

and or warn businesses, government organizations and individuals of a possible fraudulent act. From

there “Fighting Identity Fraud with Data Mining,” paper on Safran product2, they speak about a

comprehensive fraud-proof process.

A second company that should be worth mention who use data mining methods to conducted a similar

study was Federal Data Corporation and the SASA institute Inc. These two completed a thorough study

on “Using Data Mining Techniques for Fraud Detection.” This was solved in conjunction with using the

SAS Enterprise Miner software. The two use cases presented where 1) Health Care Fraud Detection and

2) Purchase Card Fraud detection. Both have similar if not the same business problems and ending goals.

The first case, the FDC and SAS used Decision Trees to group all the nominal values of input into smaller

group that will in turn give a predictive target outcome. The second case study, they used a modeling

strategy that was clustering. Their analysis included three clusters that help explain the cluster analysis

efficiently segments data into groups of similar cases.

The overall conclusions for both unveiled unknown patterns and regularities in their data.

4 DATA MINING THEMES

The study of data mining to best explained and organized into different themes. These different area are

better described by the four core data mining tasks. According to “Introduction to Data Mining,” by

Pang-Ning Tan, these themes are covered under the four core tasks; Predictive modeling, cluster

analysis, Association Analysis Anomaly detection.3

To briefly describe each theme of data mining, it is best to show by examples. These themes in detailed

are:

Classification

Clustering

Anomaly

2 Product from the Morpho Inc., Safran, “Fighting Identity Fraud With Data Mining” 3 See more information on themes from Assignment 1: Data Science Applicaiton, Kennedy, Albert

Page 6: Fraud Detection using Data Mining Project

5

First off, predictive modeling is split into two types 1) Classification which is used for discrete target

variables and 2) regression, which is used for continuous target variables” Tan. The Classification type is

mostly common for making a prediction for an outcome which is the target variable of a single action.

Whereas the regression, may analyze the cost that a consumer may spend monthly on an e-commerce

website. These two types of variables (classification and regression) are what help define predictive

model.

The second them, Cluster analysis “seeks to find groups of closely related observations so that

observations that belong to the same cluster are more similar to each other than observations that

belong to other cluster”, Tan. Good examples of this would be grouping different customers’ purchasing

behaviors. Doing this helps define the types of customers and their purchases a clothing retail store

have for analysis.

Then lastly, there is the Association Analysis them. This theme is used “to discover patterns that

describe strongly associated features in the data” Tan. The association theme is commonly used in the

retailer or grocery market businesses for analysis. We can group liked things together that have

similarities based on related attributes and/or pair transitions from users.

These three themes all have its purpose to help solve particular data science problems. With solving

these problems, businesses can make decisions using these techniques. Understanding its definition is

key to ensure the right solution/theme is being utilized for the correct problem. Once the understanding

is complete, analyst can make use of the right tool for apply for the most appropriate case. In many uses,

not just one theme may apply to a case, but multiple themes can be applied for better analysis and

comparisons for the best results.

5 CRISP-DM METHODOLOGY

For many of these data mining techniques, we don’t want to apply the wrong or less effective solution.

Lucky, there is a well-organized methodology that gives businesses the steps and processes to handle

out these type of data mining project. We use what’s called the “Cross Industry Standard Process for

Data Mining” or CRIPS-DM for short. The CRIPS-DM is a structured framework with hierarchical steps to

follow in order to help guide through a proper data mining problem and solution. The CRISP-DM include

six phases:4

4 Information related to CRISP-DM, see CRISP-DM 1.0, Step-by-Step data mining guide, SPSS

Page 7: Fraud Detection using Data Mining Project

6

Figure 5-1: CRISP-DM diagram

1) Business Understanding – this makes sense as an initial step. Any data mining problem has an

business need and problem that needs to be understood. This stage “represents a part of the

craft where the analysts’ creativity plays a large role…the design team should think carefully

about the use scenario” Provost. In this stage questions such as what needs to be done and how

it needs to be done are asked.

2) Data Understanding – this phase is a self-explanatory step yet could take a lot of time. Making

sure as an analyst you become knowledgeable of the data can make for an easier process. This

phase enables you to become “familiar with the data, identify data quality problems, discover

first insights into data, and/or detect interesting subsets to form hypotheses regarding hidden

information”, SPSS.

3) Data Preparation – in order for us to make a good analysis, we need tools that will enable us to

process the data in the best matter suitable for the necessary model. Examples of this phase

may require converting data in a simple tabular format, removed pointless attributes not

relevant to the data problem and/or converting a data file to a particular file format in order to

operate in the chosen data mining tools.

Page 8: Fraud Detection using Data Mining Project

7

4) Modeling – “The modeling stage is the primary place where data mining techniques are applied

to the data” Provost. Simply put, this is where the magic happens and the actual data mining

craft and chosen algorithm(s) are put into work.

5) Evaluation – at the evaluation phase, we take time to access the results of the outcomes from

the models that were built from our data. The most important aspect of this phase going

through this evaluation is to gain confidence in the model’s outcome. We would like to analysis

the results and understand its outcome to ensure it’s reliable for meet the original business

problem’s needs.

6) Deployment – Now the model has been created and test, now we can make use of this reliable

model into a real life production case. What can we do with it…“the knowledge gained will need

to be organized and presented in a way that the customer can use it”, SPSS. Depending on the

businesses need for the data mining model that was craft, the deployment can be simple or

complex. Simple being as creating the results to report to managers or taking that model and

actually implementing in to the business in need.

Within each phase, there are a set of tasks that the business will help generate for both the data mining

team and business to complete in order to complete through the CRISP-DM cycle successfully.

6 DATA UNDERSTANDING

For the purpose of proving the data mining algorithms used for the case of fraud detention, we will

examine two different data sets. The first is the German Credit fraud data from Dr. Hans Hofmann, the

University of Hamburg in Germany. This particular data has 1000 instances of customer information for

preparation of understanding approval of credit. There are 20 attributes used to help describe each

instance and there uniqueness.

German Creadit Data Definition:

Variable Name Description Type

over_draft Status of existing checking account qualitative

credit_usage real Duration in month numerical

credit_history A30 : no credits taken/ all credits paid back duly A31 : all credits at this bank paid back duly A32 : existing credits paid back duly till now A33 : delay in paying off in the past A34 : critical account/ other credits existing (not at this bank)

qualitative

purpose Type of credit/loan needed for (new car, used car, furniture/equipment, radio/tv, repairs, education, vacation, business,other

qualitative

current_balance real Credit amount numerical

Average_Credit_Balance Savings account/bonds qualitative

employment Present employement since a date qualitative

Page 9: Fraud Detection using Data Mining Project

8

Location real Installment rate in percentage of disposble income numerical

Personal_status Personal status and sex qualitative

other_parties Other debtors / guarantor qualitative

residence_since real Present residence since a date qualitative

property_magnitude Property qualitative

cc_age real Age count in months numerical

other_payment_plans Other installment plans qualitative

housing Housing type - rent, own, for free qualitative

existing_Credits real Number of existing credits at this bank numerical

job Job and type Unemployed/unskilled, skilled, management/self-employed, highly qualified

qualitative

num_dependents real Number of people being liable to provide maintenance numerical

own_telephone Telephone qualitative

foreign_worker Indicate Yes or No if Foreign worker qualitative

class the cost matrix used for indicating is customer is Good or Bad

qualitative

The above information is financial data taken from year 1994 of customer that submitted for credit.

Details of the attributes used are of types in integers and categorical formats. This data set is best used

as a Classification data mining task.

The second data source comes from Kaggle, Give Me Some Credit competition.

According to Kaggle the purpose of this second is to so “state of the art in credit scoring by predicting

the probability that somebody will experience financial distress in the next two years.” This dataset uses

less attributes and input for descripting the customers. However this dataset has 4000 instances

Credit distress probability Data Definition:

Variable Name Description Type

SeriousDlqin2yrs Person experienced 90 days past due delinquency or worse

Y/N

RevolvingUtilizationOfUnsecuredLines Total balance on credit cards and personal lines of credit except real estate and no installment debt like car loans divided by the sum of credit limits

percentage

age Age of borrower in years integer

NumberOfTime30-59DaysPastDueNotWorse

Number of times borrower has been 30-59 days past due but no worse in the last 2 years.

integer

DebtRatio Monthly debt payments, alimony,living costs divided by monthy gross income

percentage

MonthlyIncome Monthly income real

NumberOfOpenCreditLinesAndLoans Number of Open loans (installment like car loan or mortgage) and Lines of credit (e.g. credit cards)

integer

NumberOfTimes90DaysLate Number of times borrower has been 90 days or more past due.

integer

NumberRealEstateLoansOrLines Number of mortgage and real estate loans including home equity lines of credit

integer

NumberOfTime60-89DaysPastDueNotWorse

Number of times borrower has been 60-89 days past due but no worse in the last 2 years.

integer

Page 10: Fraud Detection using Data Mining Project

9

NumberOfDependents Number of dependents in family excluding themselves (spouse, children etc.)

integer

The goal of this particular data is to help build a model to help those that are borrowing make better

financial decisions. This will be under a Classification task type as well.

7 DATA PREPARATION

The process used to was very simple for the German Credit fraud data. Decided to use an ARFF file

format due to the source of the data from UC Irvine, Machine Learning Repository to organize. Had

collect the attributes needed for the analysis and paste into notepad application with the @ beginning

symbols to denote that these variables are the attributes. Then taking the @data field and pasting

below the raw data. As long as the raw data that’s divided by commas has the same number of values

to match the number of attributes given, the file will be accurate.

Here’s a snapshot of what the inside of an ARFF file will contains:

@relation german_credit

@attribute over_draft { '<0', '0<=X<200', '>=200', 'no checking'}

@attribute credit_usage real

@attribute credit_history { 'no credits/all paid', 'all paid', 'existing paid', 'delayed

previously', 'critical/other existing credit'}

@attribute purpose { 'new car', 'used car', furniture/equipment, radio/tv, 'domestic appliance',

repairs, education, vacation, retraining, business, other}

@attribute current_balance real

@attribute Average_Credit_Balance { '<100', '100<=X<500', '500<=X<1000', '>=1000', 'no known

savings'}

@attribute employment { unemployed, '<1', '1<=X<4', '4<=X<7', '>=7'}

@attribute location real

@attribute personal_status { 'male div/sep', 'female div/dep/mar', 'male single', 'male mar/wid',

'female single'}

@attribute other_parties { none, 'co applicant', guarantor}

@attribute residence_since real

@attribute property_magnitude { 'real estate', 'life insurance', car, 'no known property'}

@attribute cc_age real

@attribute other_payment_plans { bank, stores, none}

@attribute housing { rent, own, 'for free'}

@attribute existing_credits real

@attribute job { 'unemp/unskilled non res', 'unskilled resident', skilled, 'high qualif/self

emp/mgmt'}

@attribute num_dependents real

@attribute own_telephone { none, yes}

@attribute foreign_worker { yes, no}

@attribute class { good, bad}

@data

'<0',6,'critical/other existing credit',radio/tv,1169,'no known savings','>=7',4,'male

single',none,4,'real estate',67,none,own,2,skilled,1,yes,yes,good

'0<=X<200',48,'existing paid',radio/tv,5951,'<100','1<=X<4',2,'female div/dep/mar',none,2,'real

estate',22,none,own,1,skilled,1,none,yes,bad

A compiled version of this can be viewed here

http://weka.8497.n7.nabble.com/file/n23121/credit_fruad.arff

The second data source is a CSV file that required some modification. The original data set file had

150,000 instances which was too large for the WEKA data mining tool’s heap size to handle. I manually

removed sections of the data set by dividing the total 150,000 into four sections and removing 3,750

Page 11: Fraud Detection using Data Mining Project

10

from bottom half of the sections. I did this so I have an even amount of distributed data instead of only

taking the top 4000 instances. This file was then saved as cs-traning.csv file for WEKA input.

For both files, I created Data definitions to elaborate on their chosen attributes. The purpose of this is to

explain what inputs are used and its purpose for a clearer business understanding when we need to

make a decision after reviewing the results.

8 DATA MINING ALGORITHM

Select data mining algorithm for your project, elaborate chosen algorithm in detail with reason why

algorithm was chosen over other algorithms.

German Credit Fraud Use case:

The first dataset using the German credit fraud data was ideal for the use of Decision tree algorithm.

This type of classification algorithm, the decision tree is a very first good technique for this particular use

case. Because this is a very common algorithm to use, we have many reason of a benefit to do an

analysis with this method. First, it’s relatively simple approach for classification type of data. It gives the

ability to take sample data with known attributes and place them into categories. Second, it help you

visualize the workflow of how the data is been broken down into sections that make decisions. Last, you

can determine a predictive outcome from the results.

Using Decision Tree, we need to explain more in depth how this method works and its structure. When

data is ran through this type of analysis, it structures the data into these three areas called nodes:

The root node is an attribute used to question the initial question if or if not something is in a

particular group or not to start off with. This can only be a single node item where the groups

are then branched off from it much like a tree from the bottom.

Then there grows the internal nodes which is a title of the proceeding branch off the root node.

The purpose of the internal node is to give information only pertaining to that group.

Lastly, there’s the leaf node(s). Think of these are individual leafs as answers to the internal

nodes. There can be multiple leads branching off an internal node. These finalize the answer of

the item in a particular internal node group.

Page 12: Fraud Detection using Data Mining Project

11

Figure 8-1 Decision tree

The simple method works each time for data that has multiple attributes that has classification of types

in them. For the purposes of our fraudulent credit problem, out root node answers the major questions

of those that will over draft or not. In this cases, I wouldn’t care to use the nodes that are derived, but

more so of the confusion matrix.

The Confusion Matrix helps classifies the actual classes from those that are potentially negative classed

numbers.5 We use the confusion matrix to help spate the decisions made by the classifier making

explicit how one class is being confused for another. This way error can be handled separately. We do

this be looking at the True class items and the predicted class items in a matrix box.

TP = true positive FN = false negative

PREDICTED CLASS

ACTUAL CLASS

Yes No

Yes a=TP b=FN

Yes c=FN d=TN

Figure 8.2 – Confusion Matrix

The goal here is to have the model obtain the highest possible accuracy rate or lowest error rate.

**Confusion matrix

9 EXPERIMENTAL RESULTS AND ANALYSIS

To test our fraudulent data, the test was executed with two different data mining techniques. Decision

tree and the Simple K-means algorithms to generate the results.

German Credit Data Use:

Examining the first data set of the German credit fraud data, we analyze this data using the decision tree

algorithm. The outcome below shows the summary of the results.

=== Stratified cross-validation ===

=== Summary ===

Correctly Classified Instances 705 70.5 %

Incorrectly Classified Instances 295 29.5 %

Kappa statistic 0.2467

Mean absolute error 0.3467

Root mean squared error 0.4796

Relative absolute error 82.5233 %

Root relative squared error 104.6565 %

Total Number of Instances 1000

=== Detailed Accuracy By Class ===

5 See more information from Assignment 2: Marketing Campaign Effectiveness, Kennedy, Albert

Page 13: Fraud Detection using Data Mining Project

12

TP Rate FP Rate Precision Recall F-Measure ROC Area Class

0.84 0.61 0.763 0.84 0.799 0.639 good

0.39 0.16 0.511 0.39 0.442 0.639 bad

Weighted Avg. 0.705 0.475 0.687 0.705 0.692 0.639

=== Confusion Matrix ===

a b <-- classified as

588 112 | a = good

183 117 | b = bad

What we are looking for here is a way to determine a good enough model for use. Based on the results,

we have a 70% accuracy of correct classified instances in the data set. This is decent for justification.

However, the ROC area curve is at 64%, which isn’t bad nor near perfect or ideal. How about we

consider the confusion matrix for deeper analysis. Remember from below the confusion matrix is

separated on four different section to help determine where our good and bad classifiers fall into best.

A = True Positive

B = False Negatives

Our matrix has an overwhelming amount, 588 instances that fall under the True Positive section where

it also equals true for the predicted and actual class. Say we take the percentage for the A class

588 + 183 = 771 (total)

588 (TP) / 771(total) = 76% accuracy

In order to generate this results, the data mining tool used was Weka v3-6-12

Give Me Some Credit Use:

Using the data set from the Kapple competition, we have a difference approach of how we want to view

the results. Initially, trying the decision tree algorithm for this dataset did not yield any solid enough

results for analysis purposes or to make any business sense. So, it was best to do a Cluster algorithm

type and view the results.

Choosing the Simple K Means was the algorithm used for this.

kMeans

======

Number of iterations: 10

Within cluster sum of squared errors: 5429.211249046848

Missing values globally replaced with mean/mode

Cluster centroids:

Page 14: Fraud Detection using Data Mining Project

13

Time taken to build model (full training data) : 0.05 seconds

=== Model and evaluation on training set ===

Clustered Instances

0 401 ( 40%)

1 266 ( 27%)

2 333 ( 33%)

To help explain the results, it will help to define the K-means method and its use. The K-means is a

commonly used cluster algorithm which is “simple, iterative way to approach and divide a data set into

specific number of cluster”- Manning. This process run through any dataset and uses a “closeness” as a

way to measure the distance of items in a dataset. This is called the Euclidean distance. To help explain

how the Euclidean distance works there’s a center point that’s created which is the centroid as the

cluster. Any number of items of K, that is within that distance of the centroid is defined as a cluster that

it is assigned to. If one item is closer in distance to another centroid than another centroid point, that

closer item is part of a different cluster.

In the above analysis, we have three defined clusters separated into their groups based on like

attributes their share.

Cluster 0:

This cluster has the strongest poll of instances. The results would suggest that in this cluster more

individuals that has an existing paid credit history and happen to be in the younger age group (31) ,

females and are most likely to be requesting a NEW CAR for credit.

Page 15: Fraud Detection using Data Mining Project

14

Cluster 1:

This cluster has the weakness amount of instances. The results would suggest that customers in this

group are the oldest (age 40) are seeking for credit for USED CARS.

Cluster 2:

This cluster has a distribution of 33%. The results would suggest these are SINGLE MALE customers

seeking credit for RADIO/TV.

10 CONCLUSION

The completed analysis drawn from two different dataset would yield two different ending business

decision results. If visually inspecting the first dataset using the decision tree algorithm, I would initial

conclude this is not a valid enough test to make as a solid choice for fraud detention. We have to

remember the goal is to define the problem where fraud used is being done where customer

information is being misuse in place to benefit another person by creating credit accounts. However,

the confusion matrix does present some promising results to consider. What a financial institution can

take from that is a starter point of where to predict. However it gives about a 76% accuracy of the

changes of this results be true for prediction.

The second analysis is to be view from a different approach not so much as for predictive, but for a clear

view of where customer land is their relative attributes that tie to them. Using the Simple K-Mean

algorithm and picking 3 clusters, we can analysis a few important things:

1) where the most important group of customers are at

2) the related attributes in comparison to other groups

3) purpose of line of credit

4) Instances or distribution percentage of which group has the highest activity and means to apply

for credit.

With the clustering results, a business can answer the above questions ahead of time. We can pick and

choose the determinant for our analysis. For our purpose we wish to detent the reason for a line of

credit.

Page 16: Fraud Detection using Data Mining Project

15

11 FUTURE WORK

Describe next steps to continue work on this project

There are organizations that already taking the necessary actions to benefit from these findings.

Companies like Morpho with their Safran product. Here’s a list of procedures and steps that should be

consider in order for this study to become successful.

1) Ensure those data scientist and analyst placed with the responsibility to craft these results to

use proper data mining practices and methodologies like CRISP-DM.

2) Take risk: I believe there are enough intelligent groups of people that understand the worth and

capable of drafting similar fraud detention analysis. There needs to more action on the

deployment phase.

3) Implement a production system where this process is automated.

Based on the analysis and results we have a plausible solution to solve a business problem with fraud

due to credit applications.

Page 17: Fraud Detection using Data Mining Project

16

12 REFERENCES

Tan, Pang and Michel Steinbach. Introduction to Data Mining. Boston: Pearson Addison Wesley, 2005.

Print (Chp 1, Chp 4).

Provost, Foster, and Fawcett, Tom, Data Science for Business, Sebastopol, CA, 2014. Print (Chp 2, Chp 7)

Morpho, Safran, Fighting Identity Fraud with Data Mining, Groundbreaking means to prevent fraud in

identity management solutions, France, Print (page 4, and page 7)

Federal Data Corporation and SAS, Using Data Mining Techniques for Fraud Detection, Solving Business

Problems using SAS Enterprise Minder Software, Cary, NC, Print (page 1, page 15 and page 20)

Dr. Hofmann, Hans, University at Hamburg, UCI Machine Learning Reposity, CA, 2000,

http://archive.ics.uci.edu/ml/datasets/Statlog+%28German+Credit+Data%29

Kaggle, Give Me Some Credit, 201, https://www.kaggle.com/c/GiveMeSomeCredit