Fuzzy Matching in Fraud Analytics · Fuzzy Matching in Fraud Analytics Grant Brodie, President,...

Preview:

Citation preview

Fuzzy Matching in Fraud Analytics

Grant Brodie, President, Arbutus Software

2

Outline

What Is Fuzzy?

Causes

Effective Implementation

Application to Specific Products

Demonstration

Q&A

3

Why Is Fuzzy Important?

Big data

Too many transactions

User-entered data (web sites)

E-Commerce

Less manual oversight

4

What Is Fuzzy?

Subset of duplicates testing

Find specific keywords in text (FCPA, PCard)

Close, but not the same

Two reasonable definitions

Proximity

Looks similar

5

Proximity

Sorts close together

Characters

“Albert” vs. “Albertson”

Numbers

123,456.78 vs. 123,792.16

Dates

Jan 19, 2014 vs. Jan 20, 2014

6

Looks Similar

Characters

Microsoft vs. Wicrosoft

Numbers

127,894.63 vs. 12,894.63

Dates

Jan 13, 2014 vs. Jan 31, 2014

7

Traditional Approach to “Close”

Pronunciation based

Soundex

NYSIIS

Designed for names

Many false positives

Not useful for numbers or dates

8

Fuzzy Today

Based on physical string matching

Levenshtein (ACL)

Damerau-Levenshtein (Arbutus)

N-Gram

Jaro-Winkler

And many more…

Differences expressed as a “distance” or percentage

9 9

Quick Lesson: Damerau-Levenshtein

Min. # changes to make one string into another

Insert, delete, replace, transpose

‘123 Main Street’ vs. ‘123 Main St’ = 4

34567 vs. 34576 = 1 (Levenshtein: 2)

‘Rob’ vs.‘Robert’ = 3

‘Gary’ vs.‘Mary’ = 1

‘Gary’ vs.‘gary’ = 1

10

Problems with String Matching

Very literal

Doesn’t apply any context

“John Smith” vs. “John Smith” (1)

“Smith John” vs. “Smith, John” (1)

“John Smith” vs. “john smith” (2)

México vs. Mexico (1)

“John Smith” vs. “john smith” same as “John Hmitz” (2)

11

What Do You Use?

Whatever your tool offers

Almost impossible to implement manually

VERY compute intensive

12

Causes

Accidental errors

Carelessness/mistyping

Transpositions

Blurry source

Punctuation

Extra blanks

1 vs. I, 0 vs. O (particularly with OCR)

13

Errors vs. Fraud

All of the causes were likely “errors”

Fraud uses intentional errors to mask activity

Obscure duplicates

Obscure relationships

Trick through similarity

Disparate systems make comparison even harder

14

Practical Issues

Generally hard to “target” fuzzy tests

Forced to use broad tests

Most findings will be errors

Even so, the finding is still valuable

Need a process to address errors found

15

“Our System Catches Duplicates”

Exact matches only

Strict application (i.e. company, vendor, invoice)

May only warn

Not all duplicates are payments

Most only test document numbers

16

Types of Duplicates

Names

Personal

Corporate

Addresses

Document numbers (e.g., invoice)

Contact information

Phone numbers

Emails

17

Issues

Very compute intensive (wait times)

Exponential relationship

1000x data = 1,000,000x more work

False positives

Ease of use

18

False Positives

Easily the most challenging aspect

Any time spent on a false positive is wasted

Can easily outnumber the true positives by 10, 100, 1000 to 1

If too many, can remove any cost effectiveness

How does this happen?

Only one way to get an exact match

Virtually unlimited ways to get close

19

False Positive Examples

Matching to “12345” with a single difference:

Missing (1245): 5, Transposition (12435): 4

Incorrect (12745): min 45 (175 if alpha, 1,000+ if any char)

Extra (123345): min 60 (200+ if alpha, 1,000+ if any char)

Hundreds/thousands of ways that differ by just 1

Not just errors, all close values

Exponentially more with a distance of 2

Bad actor tries to rely on his needle in a haystack

20

How to Address the Issues

Data preparation

Utilize “context”

Use “tight” specifications

Choose software that meets needs

Rank your results

21

Choose Your Software

Has the capabilities you need

Can process your data volumes

Easy to implement

Easy to automate

ACL, Arbutus, IDEA, fraud-specific, non-audit tools

22

Data Preparation

Remove immaterial differences first (i.e., normalization)

Text manipulation

Upper case

Punctuation

Extra blanks

Foreign characters (México vs. Mexico, Québec vs. Quebec)

23

Data Preparation (Cont.) (Remove immaterial differences first, normalization)

Eliminate “noise” words

Different by type of data

Address: Suite, Unit

Corporate name: Company, Co, Inc

Personal name: Mr, Ms, Dr, Prof

24

Data Preparation (Cont.)

(Remove immaterial differences first, normalization)

Common misspellings/typos

Common vocabulary (chair vs. silla)

Different by data type

Avenue: Av, Ave, Aven, Avenu

First vs. 1st…

West vs. W…

Richard, Rick, Dick, Ricky, Rich

25

Data Preparation (Cont.) (Remove immaterial differences first, normalization)

Word order

“123 W Main St.” vs. “123 Main St. W”

26

Data Preparation: Result

Well implemented data prep. minimizes the need for fuzzy

Consider the two addresses:

“#200-1234 Main Street West”

“1234 W MAIN ST, Suite 200”

Levenshtein distance is 20

Applying data prep can make both strings identical

W ST MAIN 200 1234

27

Text Manipulation: ACL Create a computed field

Upper case: Upper(field) (FUZZYDUP ignores case, but data prep is simpler)

Punctuation: Include(field, “ 0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ”), but…

Extra blanks: (replace 2 with 1) Replace(Replace(field, “ ”, “ ”), “ ”, “ ”)…

Foreign characters: Replace(Replace(field “É”, “E”), “Á”, “A”)…

Replace(Replace(Replace(Replace(Include(Upper(field), ‘ 0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ”), ‘ ‘, ‘ ‘) , ‘ ‘, ‘ ‘) , ‘ ‘, ‘ ‘), “É”, “E”)…

In practice, many more replace calls

May break up into multiple fields for clarity

28

Text Manipulation: Arbutus Create a computed field

Upper case: Upper(field)

Punctuation: Include(field, “ 0~9A~Z”), but…

Extra blanks: Compact(field)

Foreign characters: Replace(field, “É”, “E”, “Á”, “A”,…)

Replace(Compact(Include(Upper(field), ‘ 0~9A~Z”)), “É”, “E”…)

May break up into multiple fields for clarity

Only for unusual situations (use Normalize function)

29

Eliminate “Noise” Words: ACL Use “whole words”

Omit(field+“ ”, “INCORPORATED ,INC ,LIMITED ,LTD…”, F), but…

Omit(field, “INC”): CINCH INDUSTRIES becomes CH INDUSTRIES

Problem is, many noise words to eliminate—two solutions:

Long list

Alltrim(Omit(field+“ ”, “INCORPORATED ,INC ,LIMITED ,LTD ,CORPORATION , CORP ,…”))

Sequential omits of a variable in a group

v_field=Omit(field…

v_field=Omit(v_field…

30

Common Vocabulary: ACL

Similar to noise words, only Replace instead of Omit

Use “whole words”

Replace(field+“ ”, “ROAD ”, “RD ”)

Otherwise, “BROADWAY” becomes “BRDWAY”

Don’t omit, as Peachtree Lane is not the same as Peachtree Court

Problem is, MANY vocabulary words to potentially normalize

USPS 400 street terms, 500+ male names, 700+ female names

Nested functions (with Replace instead of Omit)

Sequential replaces of a variable in a group

31

Word Order: ACL

No practical way to address this

32

Noise Words and Common Vocabulary: Arbutus If you choose, ACL syntax all works

Instead: Use Normalize() or SortNormalize()

Automatically implements ALL of the data prep described

(Upper case, punctuation, blanks, foreign, noise, vocabulary)

Normalize(address, “addr.txt”)

Norm(“Suite 200-1234 Main Street West”, “addr.txt”) = “200 1234 MAIN ST W”

SortNormalize has the same syntax, but = “W ST MAIN 200 1234”

Normalize can use a separate vocabulary file (addr.txt)

Replaces or omits any word, on a “whole word” basis

User configurable and selectable, by data type

33

Noise Words and Common Vocabulary: Arbutus Substitution file (addr.txt, for example)

FIRST 1ST

SEVENTH 7TH

AV AVE

AVENU AVE

AVENUE AVE

AVN AVE

PARKWAY PKWY

PARKWY PKWY

PKWAY PKWY

PKY PKWY

SUITE

UNIT

34

False Positive Reduction: Utilize Context

Data elements always have a “context”

Names or address: location (e.g., city, state, ZIP, country, etc.)

Documents: vendor, employee, etc.

Reference the similarities to minimize the ambiguity

Same state, city, similar address

“123 Main St.”, Springfield, IL/MA

Same vendor, date, amount, similar invoice number

35

Utilize Context: Application

ACL FUZZYDUP: Only supports one key field

Concatenate fields into a single expression/computed field

State+City+Address

Other data types require conversion: vendor+date(dt)+str(amount, 16)+invno

Arbutus DUPLICATES: Supports multiple key fields

Specify each key separately

Last key can be fuzzy

36

False Positive Reduction: Use “Tight” Specs

Levenshtein distance 1, or 2 max

Looser specifications = more false positives

Avoid Soundex and similar approaches

There is no substitute for good data prep

37

False Positives: Rank Your Results

Order based on exposure

Size of item

Degree of inherent risk (cash)

Order based on degree of similarity

Distance (1 vs. 2)

Number of matching “same” elements

38

Execution: ACL

Separate menu item

Analyze/fuzzy duplicates

Choose your (concatenated) key

Choose diff. threshold (1 or 2)

Select other fields to use in investigation

Select the output table name

Be patient

39

Execution: Arbutus

Included with duplicates testing

Analyze/duplicates

Choose your key fields (any type)

Choose either near or similar processing

Choose max. difference (0, 1, or 2)

Select other fields to use in investigation

Select output location and name

40

“Similar” Processing: Arbutus

Specifically designed to work with document IDs

Uses Damerau-Levenshtein, but auto. pre-processes

Removes all blanks and punctuation, upper cases

Matches similar characters: O=0, I=1, 5=S, etc.

Works on all data types

127,894.63 vs. 12,894.63 (diff. 1)

I-12345 vs. 112345 (diff 0)

Particularly useful with OCR

41

“Similar” Processing: ACL Not explicitly supported

Pre-process the data to create a computed field

Upper case

Include only numbers and letters (no blanks, punctuation)

Convert numbers and dates to strings (date or string)

Use the FUZZYDUP command as in the past

42

Manual Duplicates Testing: ACL Data prep is still important

LevDist(string1, string2 <, case sensitive>)

Case sensitive by default

Filter: LevDist(name1, name2, F) < 3

IsFuzzyDup(string1, string2, distance <, diff%> )

Automatically case insensitive

Filter: IsFuzzyDup(name1, name2, 2)

Can also be used as a join test

43

Manual Duplicates Testing: Arbutus All case sensitive, by default (assumes normalized inputs)

Difference(string1, string2 <, case sensitive>)

Filter: difference(name1, name2, F) < 3

Near(field1, field2, difference)

Filter: near(name1, name2, 2)

Applies to all data types

Char: Damerau-Levenshtein; numbers and dates: proximity (4799 vs 4803)

Similar(field, field2, difference)

Applies to all data types, always uses Damerau-Levenshtein

Char: prepared data; numbers and dates: 123,456 vs. 12,456

44

Find Specific Keywords in Text: ACL Very common for purchase card reviews, FCPA

Use the Find function:

Filter: IF Find(“Exotic”, desc)

Multiple words: IF Find(“Exotic”, desc) OR Find(“IPad”, desc)…

Not case sensitive, not whole word

Create a Logical computed field (say “Exception”):

T IF Find(“Exotic”, desc)

T IF Find(“IPad”, desc)

F

Filter: IF Exception

45

Find Specific Keywords in Text: Arbutus Find function works the same as ACL

Use the ListFind function instead:

Filter: IF ListFind(“exceptions.txt”, desc)

Simple text file

Easily maintained in Notepad

Unlimited entries

Supports an external reference file or an internal array

Like Find function, not case sensitive, not whole word

46

Continuous Monitoring

Mostly errors

“Test” vs. “control”

Ownership of the process

May relate to frequency

Detective vs. Preventative

Entire presentation detective

Opportunity to run against documents before committing

Preventative almost certainly a “control”

47

Fuzzy Testing in action

Demonstration

Recommended