Upload
alejandro-correa-bahnsen
View
1.244
Download
0
Embed Size (px)
DESCRIPTION
Presentation at the SAS Analytics Conference 2013, London, UK. Presenter: Alejandro Correa Bahnsen
Citation preview
Copyright © 2013, SAS Institute Inc. All rights reserved. #analytics2013
Credit Card Fraud Detection Why Theory Doesn't Adjust to Practice
Alejandro Correa Bahnsen, Luxembourg University Andrés Gonzalez Montoya, Scotia Bank
Copyright © 2013, SAS Institute Inc. All rights reserved. #analytics2013
Introduction
€ 500
€ 600
€ 700
€ 800
2007 2008 2009 2010 2011E 2012E
Europe fraud evolution Internet transactions (millions of euros)
Copyright © 2013, SAS Institute Inc. All rights reserved. #analytics2013
Introduction
$-
$1.0
$2.0
$3.0
$4.0
$5.0
2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012
US fraud evolution Online revenue lost due to fraud (Billions of dollars)
Copyright © 2013, SAS Institute Inc. All rights reserved. #analytics2013
• Increasing fraud levels around the world
• Different technologies and legal requirements makes it harder to control
• There is a need for advanced fraud detection systems
Introduction
Copyright © 2013, SAS Institute Inc. All rights reserved. #analytics2013
• Introduction
• Transaction flow
• Database
• Evaluation of algorithms
• If-Then rules (Expert Rules)
• Financial measure
• Predictive modeling
• Logistic Regression
• Cost Sensitive Logistic Regression
Agenda
Copyright © 2013, SAS Institute Inc. All rights reserved. #analytics2013
Simplify transaction flow
Fraud??
Network
Copyright © 2013, SAS Institute Inc. All rights reserved. #analytics2013
Data
• Larger European card
processing company
• 2012 card present transactions
• 750,000 Transactions
• 3500 Frauds
• 0.467% Fraud rate
• 148,562 EUR lost due to fraud
on test dataset
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
Test
Train
Copyright © 2013, SAS Institute Inc. All rights reserved. #analytics2013
• Raw attributes
• Other attributes:
Age, country of residence, postal code, type of card
Data
TRXID Client ID Date Amount Location Type Merchant
Group Fraud
1 1 2/1/12 6:00 580 Ger Internet Airlines No
2 1 2/1/12 6:15 120 Eng Present Car Rent No
3 2 2/1/12 8:20 12 Bel Present Hotel Yes
4 1 3/1/12 4:15 60 Esp ATM ATM No
5 2 3/1/12 9:18 8 Fra Present Retail No
6 1 3/1/12 9:55 1210 Ita Internet Airlines Yes
Copyright © 2013, SAS Institute Inc. All rights reserved. #analytics2013
• Derived attributes
Data
Trx
ID
Client
ID Date Amount Location Type
Merchant
Group Fraud
No. of Trx – same
client – last 6 hour
Sum – same client
– last 7 days
1 1 2/1/12 6:00 580 Ger Internet Airlines No 0 0
2 1 2/1/12 6:15 120 Eng Present Car Renting No 1 580
3 2 2/1/12 8:20 12 Bel Present Hotel Yes 0 0
4 1 3/1/12 4:15 60 Esp ATM ATM No 0 700
5 2 3/1/12 9:18 8 Fra Present Retail No 0 12
6 1 3/1/12 9:55 1210 Ita Internet Airlines Yes 1 760
By Group Last Function
Client None hour Count
Credit Card Transaction Type day Sum(Amount)
Merchant week Avg(Amount)
Merchant Category month
Merchant Country 3 months
– Combination of following criteria:
Copyright © 2013, SAS Institute Inc. All rights reserved. #analytics2013
• Misclassification = 1 −𝑇𝑃+𝑇𝑁
𝑇𝑃+𝑇𝑁+𝐹𝑃+𝐹𝑁
• Recall =𝑇𝑃
𝑇𝑃+𝐹𝑁
• Precision =𝑇𝑃
𝑇𝑃+𝐹𝑃
• F-Score = 2𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 ∗ 𝑅𝑒𝑐𝑎𝑙𝑙
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛+𝑅𝑒𝑐𝑎𝑙𝑙
Evaluation
True Class (𝑦𝑖)
Fraud (𝑦𝑖=1) Legitimate (𝑦𝑖=0)
Predicted class
(𝑝𝑖)
Fraud (𝑝𝑖=1) TP FP
Legitimate (𝑝𝑖=0) FN TN
• Confusion matrix
Copyright © 2013, SAS Institute Inc. All rights reserved. #analytics2013
• Introduction
• Transaction flow
• Database
• Evaluation of algorithms
• If-Then rules (Expert Rules)
• Financial measure
• Predictive modeling
• Logistic Regression
• Cost Sensitive Logistic Regression
Agenda
Copyright © 2013, SAS Institute Inc. All rights reserved. #analytics2013
Fraud
Algorithms
• If-Then rules
• Predictive modeling
• Logistic Regression
• Decision Trees
• Random Forest
• Cost Sensitive Logistic Regression
Fraud??
Network
Copyright © 2013, SAS Institute Inc. All rights reserved. #analytics2013
• “Purpose is to use facts and rules, taken from the knowledge of many human experts, to help make decisions.”
• Example of rules
• More than 4 ATM transactions in one hour?
• More than 2 transactions in 5 minutes?
• Magnetic stripe transaction then internet transaction?
If-Then rules (Expert rules)
Copyright © 2013, SAS Institute Inc. All rights reserved. #analytics2013
• More than 4 ATM transactions in one hour?
• More than 2 transactions in 5 minutes?
• Magnetic stripe transaction then internet transaction?
If-Then rules (Expert rules)
Fraud??
Network
If one or more rules is activated then decline the transaction
Copyright © 2013, SAS Institute Inc. All rights reserved. #analytics2013
• Problems with rules
• New fraud patterns are not detected
• Only simple rules can be created
• Advantages of rules
• Easy to implement
• Very easy to interpret
If-Then rules (Expert rules)
Copyright © 2013, SAS Institute Inc. All rights reserved. #analytics2013
If-Then rules (Expert rules)
1.04%
31%
17%
22%
Miss-cla Recall Precision F1-Score
Results
Copyright © 2013, SAS Institute Inc. All rights reserved. #analytics2013
• Motivation
• False positives carries a different cost than
false negatives
• Frauds range from few to thousands of euros
(dollars, pounds, etc)
Financial evaluation
There is a need for a real comparison measure
Copyright © 2013, SAS Institute Inc. All rights reserved. #analytics2013
• Cost matrix
where:
• Evaluation measure
Financial evaluation
Ca Administrative costs
Amt Amount of transaction i
True Class (𝑦𝑖)
Fraud (𝑦𝑖=1) Legitimate (𝑦𝑖=0)
Predicted class
(𝑝𝑖)
Fraud (𝑝𝑖=1) Ca Ca
Legitimate (𝑝𝑖=0) Amt 0
Copyright © 2013, SAS Institute Inc. All rights reserved. #analytics2013
If-Then rules
1.04%
31%
17%
22%
Miss-cla Recall Precision F1-Score
Results
€ 95,520
€ 148,562
Cost Cost No Model
148,562 EUR are the losses due to fraud in the test database (2 months)
Copyright © 2013, SAS Institute Inc. All rights reserved. #analytics2013
• Introduction
• Transaction flow
• Database
• Evaluation of algorithms
• If-Then rules (Expert Rules)
• Financial measure
• Predictive modeling
• Logistic Regression
• Cost Sensitive Logistic Regression
Agenda
Copyright © 2013, SAS Institute Inc. All rights reserved. #analytics2013
Predictive modeling is the use of statistical and mathematical techniques to discover patterns in data in order to make predictions
Predictive modeling
Copyright © 2013, SAS Institute Inc. All rights reserved. #analytics2013
Predictive modeling
Am
ount
of
transaction
Number of transactions last day
Normal Transaction
Fraud
Copyright © 2013, SAS Institute Inc. All rights reserved. #analytics2013
Predictive modeling
Am
ount
of
transaction
Number of transactions last day
Normal Transaction
Fraud
Copyright © 2013, SAS Institute Inc. All rights reserved. #analytics2013
Predictive modeling
Amount of transaction
Number of transactions last day
Normal Transaction
Fraud
Amount spend on internet last month
Copyright © 2013, SAS Institute Inc. All rights reserved. #analytics2013
True Class (𝑦𝑖)
Fraud (𝑦𝑖=1) Legitimate (𝑦𝑖=0)
Predicted class
(𝑝𝑖)
Fraud (𝑝𝑖=1) 0 1
Legitimate (𝑝𝑖=0) 1 0
• Model
• Cost Function
• Cost Matrix
Logistic Regression
Copyright © 2013, SAS Institute Inc. All rights reserved. #analytics2013
€ 148,196
€ 148,562
Cost Cost No Model
0.52% 0% 2%
0%
Miss-cla Recall Precision F1-Score
Logistic Regression
Results
148,562 EUR are the losses due to fraud in the test database (2 months)
Copyright © 2013, SAS Institute Inc. All rights reserved. #analytics2013
1% 5% 10% 20% 50%
Logistic Regression
Sub-sampling procedure:
0.467%
Select all the frauds and a random sample of the legitimate transactions.
620,000
310,000
62,000 31,000 15,500 5,200
Fraud Percentage
Copyright © 2013, SAS Institute Inc. All rights reserved. #analytics2013
Logistic Regression
Results € 148,562 € 148,196
€ 142,510
€ 112,103
€ 79,838
€ 65,870
€ 46,530
€ -
€ 20,000
€ 40,000
€ 60,000
€ 80,000
€ 100,000
€ 120,000
€ 140,000
€ 160,000
0%
10%
20%
30%
40%
50%
60%
70%
No Model All 1% 5% 10% 20% 50%
Cost Recall Precision Miss-cla F1-Score
Selecting the algorithm by Cost
Copyright © 2013, SAS Institute Inc. All rights reserved. #analytics2013
Logistic Regression
• Best model selected using traditional F1-Score does not gives the best results in terms of cost
• Model selected by cost, is trained using less than 1% of the database, meaning there is a lot of information excluded
• The algorithm is trained to minimize the miss-classification (approx.) but then is evaluated based on cost
• Why not train the algorithm to minimize the cost instead?
Copyright © 2013, SAS Institute Inc. All rights reserved. #analytics2013
True Class (𝑦𝑖)
Fraud (𝑦𝑖=1) Legitimate (𝑦𝑖=0)
Predicted class
(𝑝𝑖)
Fraud (𝑝𝑖=1) Ca Ca
Legitimate (𝑝𝑖=0) Amt 0
• Cost Matrix
Cost Sensitive Logistic Regression
• Cost Function
• Objective
Find 𝜃 that minimized the cost function (Genetic Algorithms)
Copyright © 2013, SAS Institute Inc. All rights reserved. #analytics2013
• Cost Function
• Gradient
• Hessian
Cost Sensitive Logistic Regression
Copyright © 2013, SAS Institute Inc. All rights reserved. #analytics2013
Cost Sensitive Logistic Regression
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Legitimate
Fraud
Amount cumulative distribution
€49
€370 €124
€196
Copyright © 2013, SAS Institute Inc. All rights reserved. #analytics2013
€ 148,562
€ 31,174 € 37,785
€ 66,245 € 67,264 € 73,772
€ 85,724
€ -
€ 20,000
€ 40,000
€ 60,000
€ 80,000
€ 100,000
€ 120,000
€ 140,000
€ 160,000
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
No Model All 1% 5% 10% 20% 50%
Cost Recall Precision F1-Score
Cost sensitive Logistic Regression
Results
Copyright © 2013, SAS Institute Inc. All rights reserved. #analytics2013
Cost sensitive Logistic Regression
Results € 148,562
€ 95,520
€ 46,530
€ 31,174 € 35,466 € 34,203
€ -
€ 20,000
€ 40,000
€ 60,000
€ 80,000
€ 100,000
€ 120,000
€ 140,000
€ 160,000
0%
10%
20%
30%
40%
50%
60%
70%
80%
No Model If-Then rules Logistic Regression Cost SensitiveLogistic Regression
Decision Trees Random Forests
Cost Recall Precision F1-Score
Copyright © 2013, SAS Institute Inc. All rights reserved. #analytics2013
Conclusion
• Selecting models based on traditional statistics does not gives the best results in terms of cost
• Models should be evaluated taking into account real financial costs of the application
• Algorithms should be developed to incorporate those financial costs
Copyright © 2013, SAS Institute Inc. All rights reserved. #analytics2013
Contact information
Alejandro Correa Bahnsen
University of Luxembourg
Luxembourg
http://www.linkedin.com/in/albahnsen
http://www.slideshare.net/albahnsen
Copyright © 2013, SAS Institute Inc. All rights reserved. #analytics2013
Thank You!!
Alejandro Correa Bahnsen Andres Gonzalez Montoya
Copyright © 2013, SAS Institute Inc. All rights reserved. #analytics2013
• Hastie, T., & Tibshirani, R. (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Beijing.
• Hand, D., Whitrow, C., Adams, N. M., Juszczak, P., & Weston, D. (2007). Performance criteria for plastic card fraud detection tools. Journal of the Operational Research Society, 59, 956–962.
• Sheng, V., & Ling, C. (2006). Thresholding for making classifiers cost-sensitive. Proceedings of the National Conference on Artificial Intelligence.
• Bhattacharyya, S., Jha, S., Tharakunnel, K., & Westland, J. C. (2011). Data mining for credit card fraud: A comparative study. Decision Support Systems, 50(3), 602–613.
• Ling, C., & Sheng, V. (2008). Cost-sensitive learning and the class imbalance problem. In C. Sammut & G. I. Webb (Eds.), Encyclopedia of Machine Learning (pp. 231–235). Springer.
• Moro, S., Laureano, R., & Cortez, P. (2011). Using data mining for bank direct marketing: An application of the crisp-dm methodology. In EUROSIS (Ed.), European Simulation and Modeling Conference - ESM’2011 (pp. 117–121). Guimares, Portugal.
References