Fraud Detection with MATLAB · Types of Fraud Corporate –Financial statement falsification...

Preview:

Citation preview

1© 2015 The MathWorks, Inc.

Fraud Detection with MATLAB

Ian McKenna, Ph.D.

2

Agenda

Introduction: Background on Fraud Detection

Challenges: Knowing your Risk

Overview of the MATLAB Solution– Connect to financial data sources

– Calculate fraud indicators

– Classify funds with machine learning

– Generate reports & deploy applications

Questions & Answers

4

Fraud Detection

Detecting when people

intentionally act secretly

to deprive another of

something of value

Types

– Returns Forensics

– Linguistic Based Cues

http://nakedshorts.typepad.com/files/madoff_fairfieldsentry3x.pdf

5

Types of Fraud

Corporate

– Financial statement falsification

Securities and commodities

– Hedge Fund returns manipulation

– Stock markets manipulation, regulation compliance

Healthcare

Mortgage

Identity theft (credit card)

Insurance

Mass marketing

Asset forfeiture/money laundering

6

Hedge Fund Returns Manipulation

More prone to fraud due to decreased regulation

– SEC stats indicate 1% misbehave

Scenarios

– Misbehavior: HF managers that have some discretion in

valuing illiquid investments. Academics have devised methods

to analyze and flag potentially “manipulated” fund returns.

– Outright fraud: Quantitative screening and use of dedicated

algorithms can save a lot of time

7

Return-Based Analysis

# of negative monthly returns used to judge manager’s

performance

Attract investors by misreporting returns

Distortion possible for returns at manager’s discretion

– Illiquid assets, complex assets

E.g. discontinuity exists at zero but disappears if returns

computed bimonthly

“Suspicious Patterns in Hedge Fund Returns and the Risk of Fraud”. Bollen, Nicolas P.B. and Veronika

K. Pool (2012) Review of Financial Studies 25, 2673-2702.

9

Returns Distribution Discontinuity

10

Benford’s Law

Frequency distribution of digits in many real-life sources

of data:

– Electricity bills

– Street addresses

– Stock prices

– Population numbers

– Death rates

– Physical and mathematical constants

– Processes described by power laws

11

Stock Market Returns First Digit Frequency

Source: Checking Financial markets via Benford's law, Marco Corazza, Andrea Ellero, and Alberto

Zorzi

12

Agenda

Introduction: Background on Fraud Detection

Challenges: Knowing your Risk

Overview of the MATLAB Solution– Connect to financial data sources

– Calculate fraud indicators

– Classify funds with machine learning

– Generate reports & deploy applications

Questions & Answers

13

Challenges in Fraud Detection

Cost/Economics

– Most cases not fraud

– Manual analysis

Data

– Huge data sets

– Complex data types

– Data integration

Change

– Evolutionary

– Secrecy in detection methods

15

Traditional Approach Challenge

Challenges Faced During Model Development

Off-the-shelf softwareInability to work with

custom and complex data

In-house development with

traditional languages

Adapting requires long

development times

Spreadsheets, Excel Limited data size

Combination of the aboveInefficiencies in

Integration & Automation

16

Computational Finance Workflow

Research and Quantify

Data Analysis

& Visualization

Financial

Modeling

Application

Development

Reporting

Applications

Production

Share

Automate

Files

Databases

Datafeeds

Access

17

The Desired Report

Three funds to analyze and report:

– Gateway Fund

– American Funds Growth Fund

– Fairfield Sentry (known fraudulent Madoff fund)

18

Agenda

Introduction: Background on Fraud Detection

Challenges: Knowing your Risk

Overview of the MATLAB Solution– Connect to financial data sources

– Calculate fraud indicators

– Classify funds with machine learning

– Generate reports & deploy applications

Questions & Answers

20

Implemented Methods – Returns Based

Returns distribution and discontinuity at 0 Check discontinuity at 0 of the distribution of monthly returns

Low correlation with other assets Regress fund returns on a combination of style factors that maximize

explanatory power of the analysis

Unconditional serial correlation Check if monthly returns are serially correlated, i.e. correlated with their

previous month value. Because managers investing in illiquid securities,

with no end-of-month quoted price, may smooth their returns compared to

all available market information

Conditional serial correlation Using the optimal factor model constructed in “Low correlation with other

assets”, check serial correlation occurring especially after a down month

(i.e. when the suspicious managers has the highest incentive to “catch up”)

21

Implemented Methods – Returns Based

Number of returns equal 0 Calculate the theoretical number of returns being 0, using cumulative

distribution function and binomial coefficients, for a time series exhibiting

the same characteristics (average returns and variance) as the fund. Then

compare that number with the actual count.

Number of negative returns Calculate the theoretical number of negative returns as above. Then

compare that number with the actual count.

Number of unique returns/length of identical recurring

series Calculate the theoretical number of each patterns. Unique returns is the

number of unique numbers in the time series and length of identical series

is the number of consecutive observations that are identical . Then

compare these statistical numbers with the actual count.

22

Implemented Methods – Returns Based

Sample distribution of the last digit Check if the distribution of the returns last digit is uniformly distributed with

a goodness-of-fit test

Sample distribution of the first digit Check if the distribution of the returns first digit is following the Benford’s

Law with a goodness-of-fit test

Supervised classification methods Using machine learning tools (such a Neural Networks, Classification

methods) train a model to identify potential fraudsters. Input variables

consists of all of the indicators described above so far, attributed to

previously identified fraudulent and non fraudulent fund. Apply the fitted

model to a new fund to obtain its classification.

24

Text Based Indicators

Idea from published research in criminal investigation

Hypothesis - deceptive senders display:

– Higher quantity

– Higher expressivity

– Higher informality

– Higher uncertainty

– Higher nonimmediacy

– Lower complexity

– Lower diversity

– Lower specificity

“Automating Linguistics-Based Cues for Detecting Deception in Text-based Asynchronous Computer-Mediated Communication”.

LINA ZHOU, Department of Information Systems, University of Maryland, Baltimore County, MD, USA. JUDEE K. BURGOON, JAY F.

NUNAMAKER, JR. AND DOUG TWITCHELL, Center for the Management of Information, University of Arizona, Tucson, AZ, USA. Group

Decision and Negotiation 13: 81–106, 2004

25

Implemented Methods – Text Based

Measure Complexity Average number of statements (average concepts per sentence)

Average sentence length (average complexity of structures)

Vocabulary complexity (average word length)

Measure Uncertainty Average use of modifiers (number of adjectives/adverbs per sentence)

Average reference to other (number of he, they, …)

Measure of Expressivity Emotiveness (number of adjectives compared to nouns)

Measure of Diversity Lexical diversity (number of unique words)

26

Classifying Words

Java POS Tagger

Reference online dictionary

Only a few line of code

28

Comparison: American Growth Fund

29

Comparison: Madoff

32

MATLAB Solutions

Traditional Approach Challenge Solution

Off-the-shelf softwareInability to work with

custom and complex dataFlexible Modeling

Work with structured/unstructured

In-house development

with traditional languages

Adapting requires long

development timesRapid Prototyping

Advanced

Spreadsheets, Excel Limited data sizeWork with Big Data Sets

Database/Hadoop

Combination of the aboveInefficiencies in

Integration & AutomationEasy to Integrate & Deploy

Automated reports, encrypted models

33

Financial Modeling Workflow

Financial

Statistics & Machine

LearningOptimization

Financial Instruments Econometrics

MATLAB

Parallel Computing MATLAB Distributed Computing Server

Files

Databases

Datafeeds

Access

Reporting

Applications

Production

Share

Data Analysis and Visualization

Financial Modeling

Application Development

Research and Quantify

MATLAB Compiler

SDK

MATLAB Compiler

Rep

ort G

en

era

tor

Production Server

Datafeed

Database

Spreadsheet Link EX

Trading

34

Q&A

Recommended