Fuzzy Matching in Fraud Analytics · Fuzzy Matching in Fraud Analytics Grant Brodie, President,...

Fuzzy Matching in Fraud Analytics

Grant Brodie, President, Arbutus Software

Outline

What Is Fuzzy?

Causes

Effective Implementation

Application to Specific Products

Demonstration

Why Is Fuzzy Important?

Big data

Too many transactions

User-entered data (web sites)

E-Commerce

Less manual oversight

What Is Fuzzy?

Subset of duplicates testing

Find specific keywords in text (FCPA, PCard)

Close, but not the same

Two reasonable definitions

Proximity

Looks similar

Proximity

Sorts close together

Characters

“Albert” vs. “Albertson”

Numbers

123,456.78 vs. 123,792.16

Jan 19, 2014 vs. Jan 20, 2014

Looks Similar

Characters

Microsoft vs. Wicrosoft

Numbers

127,894.63 vs. 12,894.63

Jan 13, 2014 vs. Jan 31, 2014

Traditional Approach to “Close”

Pronunciation based

Soundex

NYSIIS

Designed for names

Many false positives

Not useful for numbers or dates

Fuzzy Today

Based on physical string matching

Levenshtein (ACL)

Damerau-Levenshtein (Arbutus)

N-Gram

Jaro-Winkler

And many more…

Differences expressed as a “distance” or percentage

Quick Lesson: Damerau-Levenshtein

Min. # changes to make one string into another

Insert, delete, replace, transpose

‘123 Main Street’ vs. ‘123 Main St’ = 4

34567 vs. 34576 = 1 (Levenshtein: 2)

‘Rob’ vs.‘Robert’ = 3

‘Gary’ vs.‘Mary’ = 1

‘Gary’ vs.‘gary’ = 1

Problems with String Matching

Very literal

Doesn’t apply any context

“John Smith” vs. “John Smith” (1)

“Smith John” vs. “Smith, John” (1)

“John Smith” vs. “john smith” (2)

México vs. Mexico (1)

“John Smith” vs. “john smith” same as “John Hmitz” (2)

What Do You Use?

Whatever your tool offers

Almost impossible to implement manually

VERY compute intensive

Causes

Accidental errors

Carelessness/mistyping

Transpositions

Blurry source

Punctuation

Extra blanks

1 vs. I, 0 vs. O (particularly with OCR)

Errors vs. Fraud

All of the causes were likely “errors”

Fraud uses intentional errors to mask activity

Obscure duplicates

Obscure relationships

Trick through similarity

Disparate systems make comparison even harder

Practical Issues

Generally hard to “target” fuzzy tests

Forced to use broad tests

Most findings will be errors

Even so, the finding is still valuable

Need a process to address errors found

“Our System Catches Duplicates”

Exact matches only

Strict application (i.e. company, vendor, invoice)

May only warn

Not all duplicates are payments

Most only test document numbers

Types of Duplicates

Personal

Corporate

Addresses

Document numbers (e.g., invoice)

Contact information

Phone numbers

Emails

Issues

Very compute intensive (wait times)

Exponential relationship

1000x data = 1,000,000x more work

False positives

Ease of use

False Positives

Easily the most challenging aspect

Any time spent on a false positive is wasted

Can easily outnumber the true positives by 10, 100, 1000 to 1

If too many, can remove any cost effectiveness

How does this happen?

Only one way to get an exact match

Virtually unlimited ways to get close

False Positive Examples

Matching to “12345” with a single difference:

Missing (1245): 5, Transposition (12435): 4

Incorrect (12745): min 45 (175 if alpha, 1,000+ if any char)

Extra (123345): min 60 (200+ if alpha, 1,000+ if any char)

Hundreds/thousands of ways that differ by just 1

Not just errors, all close values

Exponentially more with a distance of 2

Bad actor tries to rely on his needle in a haystack

How to Address the Issues

Data preparation

Utilize “context”

Use “tight” specifications

Choose software that meets needs

Rank your results

Choose Your Software

Has the capabilities you need

Can process your data volumes

Easy to implement

Easy to automate

ACL, Arbutus, IDEA, fraud-specific, non-audit tools

Data Preparation

Remove immaterial differences first (i.e., normalization)

Text manipulation

Upper case

Punctuation

Extra blanks

Foreign characters (México vs. Mexico, Québec vs. Quebec)

Data Preparation (Cont.) (Remove immaterial differences first, normalization)

Eliminate “noise” words

Different by type of data

Address: Suite, Unit

Corporate name: Company, Co, Inc

Personal name: Mr, Ms, Dr, Prof

Data Preparation (Cont.)

(Remove immaterial differences first, normalization)

Common misspellings/typos

Common vocabulary (chair vs. silla)

Different by data type

Avenue: Av, Ave, Aven, Avenu

First vs. 1st…

West vs. W…

Richard, Rick, Dick, Ricky, Rich

Data Preparation (Cont.) (Remove immaterial differences first, normalization)

Word order

“123 W Main St.” vs. “123 Main St. W”

Data Preparation: Result

Well implemented data prep. minimizes the need for fuzzy

Consider the two addresses:

“#200-1234 Main Street West”

“1234 W MAIN ST, Suite 200”

Levenshtein distance is 20

Applying data prep can make both strings identical

W ST MAIN 200 1234

Text Manipulation: ACL Create a computed field

Upper case: Upper(field) (FUZZYDUP ignores case, but data prep is simpler)

Punctuation: Include(field, “ 0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ”), but…

Extra blanks: (replace 2 with 1) Replace(Replace(field, “ ”, “ ”), “ ”, “ ”)…

Foreign characters: Replace(Replace(field “É”, “E”), “Á”, “A”)…

Replace(Replace(Replace(Replace(Include(Upper(field), ‘ 0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ”), ‘ ‘, ‘ ‘) , ‘ ‘, ‘ ‘) , ‘ ‘, ‘ ‘), “É”, “E”)…

In practice, many more replace calls

May break up into multiple fields for clarity

Text Manipulation: Arbutus Create a computed field

Upper case: Upper(field)

Punctuation: Include(field, “ 0~9A~Z”), but…

Extra blanks: Compact(field)

Foreign characters: Replace(field, “É”, “E”, “Á”, “A”,…)

Replace(Compact(Include(Upper(field), ‘ 0~9A~Z”)), “É”, “E”…)

May break up into multiple fields for clarity

Only for unusual situations (use Normalize function)

Eliminate “Noise” Words: ACL Use “whole words”

Omit(field+“ ”, “INCORPORATED ,INC ,LIMITED ,LTD…”, F), but…

Omit(field, “INC”): CINCH INDUSTRIES becomes CH INDUSTRIES

Problem is, many noise words to eliminate—two solutions:

Long list

Alltrim(Omit(field+“ ”, “INCORPORATED ,INC ,LIMITED ,LTD ,CORPORATION , CORP ,…”))

Sequential omits of a variable in a group

v_field=Omit(field…

v_field=Omit(v_field…

Common Vocabulary: ACL

Similar to noise words, only Replace instead of Omit

Use “whole words”

Replace(field+“ ”, “ROAD ”, “RD ”)

Otherwise, “BROADWAY” becomes “BRDWAY”

Don’t omit, as Peachtree Lane is not the same as Peachtree Court

Problem is, MANY vocabulary words to potentially normalize

USPS 400 street terms, 500+ male names, 700+ female names

Nested functions (with Replace instead of Omit)

Sequential replaces of a variable in a group

Word Order: ACL

No practical way to address this

Noise Words and Common Vocabulary: Arbutus If you choose, ACL syntax all works

Instead: Use Normalize() or SortNormalize()

Automatically implements ALL of the data prep described

(Upper case, punctuation, blanks, foreign, noise, vocabulary)

Normalize(address, “addr.txt”)

Norm(“Suite 200-1234 Main Street West”, “addr.txt”) = “200 1234 MAIN ST W”

SortNormalize has the same syntax, but = “W ST MAIN 200 1234”

Normalize can use a separate vocabulary file (addr.txt)

Replaces or omits any word, on a “whole word” basis

User configurable and selectable, by data type

Noise Words and Common Vocabulary: Arbutus Substitution file (addr.txt, for example)

FIRST 1ST

SEVENTH 7TH

AV AVE

AVENU AVE

AVENUE AVE

AVN AVE

PARKWAY PKWY

PARKWY PKWY

PKWAY PKWY

PKY PKWY

False Positive Reduction: Utilize Context

Data elements always have a “context”

Names or address: location (e.g., city, state, ZIP, country, etc.)

Documents: vendor, employee, etc.

Reference the similarities to minimize the ambiguity

Same state, city, similar address

“123 Main St.”, Springfield, IL/MA

Same vendor, date, amount, similar invoice number

Utilize Context: Application

ACL FUZZYDUP: Only supports one key field

Concatenate fields into a single expression/computed field

State+City+Address

Other data types require conversion: vendor+date(dt)+str(amount, 16)+invno

Arbutus DUPLICATES: Supports multiple key fields

Specify each key separately

Last key can be fuzzy

False Positive Reduction: Use “Tight” Specs

Levenshtein distance 1, or 2 max

Looser specifications = more false positives

Avoid Soundex and similar approaches

There is no substitute for good data prep

False Positives: Rank Your Results

Order based on exposure

Size of item

Degree of inherent risk (cash)

Order based on degree of similarity

Distance (1 vs. 2)

Number of matching “same” elements

Execution: ACL

Separate menu item

Analyze/fuzzy duplicates

Choose your (concatenated) key

Choose diff. threshold (1 or 2)

Select other fields to use in investigation

Select the output table name

Be patient

Execution: Arbutus

Included with duplicates testing

Analyze/duplicates

Choose your key fields (any type)

Choose either near or similar processing

Choose max. difference (0, 1, or 2)

Select other fields to use in investigation

Select output location and name

“Similar” Processing: Arbutus

Specifically designed to work with document IDs

Uses Damerau-Levenshtein, but auto. pre-processes

Removes all blanks and punctuation, upper cases

Matches similar characters: O=0, I=1, 5=S, etc.

Works on all data types

127,894.63 vs. 12,894.63 (diff. 1)

I-12345 vs. 112345 (diff 0)

Particularly useful with OCR

“Similar” Processing: ACL Not explicitly supported

Pre-process the data to create a computed field

Upper case

Include only numbers and letters (no blanks, punctuation)

Convert numbers and dates to strings (date or string)

Use the FUZZYDUP command as in the past

Manual Duplicates Testing: ACL Data prep is still important

LevDist(string1, string2 <, case sensitive>)

Case sensitive by default

Filter: LevDist(name1, name2, F) < 3

IsFuzzyDup(string1, string2, distance <, diff%> )

Automatically case insensitive

Filter: IsFuzzyDup(name1, name2, 2)

Can also be used as a join test

Manual Duplicates Testing: Arbutus All case sensitive, by default (assumes normalized inputs)

Difference(string1, string2 <, case sensitive>)

Filter: difference(name1, name2, F) < 3

Near(field1, field2, difference)

Filter: near(name1, name2, 2)

Applies to all data types

Char: Damerau-Levenshtein; numbers and dates: proximity (4799 vs 4803)

Similar(field, field2, difference)

Applies to all data types, always uses Damerau-Levenshtein

Char: prepared data; numbers and dates: 123,456 vs. 12,456

Find Specific Keywords in Text: ACL Very common for purchase card reviews, FCPA

Use the Find function:

Filter: IF Find(“Exotic”, desc)

Multiple words: IF Find(“Exotic”, desc) OR Find(“IPad”, desc)…

Not case sensitive, not whole word

Create a Logical computed field (say “Exception”):

T IF Find(“Exotic”, desc)

T IF Find(“IPad”, desc)

Filter: IF Exception

Find Specific Keywords in Text: Arbutus Find function works the same as ACL

Use the ListFind function instead:

Filter: IF ListFind(“exceptions.txt”, desc)

Simple text file

Easily maintained in Notepad

Unlimited entries

Supports an external reference file or an internal array

Like Find function, not case sensitive, not whole word

Continuous Monitoring

Mostly errors

“Test” vs. “control”

Ownership of the process

May relate to frequency

Detective vs. Preventative

Entire presentation detective

Opportunity to run against documents before committing

Preventative almost certainly a “control”

Fuzzy Testing in action

Demonstration

Fuzzy Matching in Fraud Analytics · Fuzzy Matching in Fraud Analytics Grant Brodie, President,...

Documents

2006_A New Algorithm for Distorted Fingerprints Matching Based on Normalized Fuzzy Similarity Measure

Elasticsearch and Fuzzy Name Matching

Indirect adaptive fuzzy control scheme based on … · Indirect adaptive fuzzy control scheme based on observer for nonlinear systems: ... both uncertainties and non-matching conditions

Text Search and Fuzzy Matching

Fuzzy Matching In PostgreSQL - Swiss PGDay · Introduction Simple Matching Fuzzy Matching Use Case Conclusion Fuzzy Matching In PostgreSQL A Story From The Trenches Charles Clavadetscher

Fuzzy String Matching

Simple fuzzy name matching in elasticsearch

Face Detection from Color Images using a Fuzzy Pattern ...web.wakayama-u.ac.jp/~wuhy/pami.pdfFace Detection from Color Images using a Fuzzy Pattern Matching Method Abstract This paper

Fuzzy Feelings for Fuzzy Matching

Efﬁcient Object Instance Search Using Fuzzy Objects Matchingjsyuan/papers/2017/Efficient Object Instance... · Efﬁcient Object Instance Search Using Fuzzy Objects Matching Tan

A Quick Look at Fuzzy Matching Programming Techniques

Java exception matching in real time using fuzzy logicpublications.lib.chalmers.se/records/fulltext/122239.pdf · 2010-05-12 · Java exception matching in real time using fuzzy logic

Matching Lecture 11. Topics ID parade Frames Matching Examples Fuzzy Matching Metric Spaces Scales of measurement

clearMDM - QuickStart Guide 3 - Matching · In summary, the Matching MDM operation applies Matching Rules (Fuzzy, Exact, Key etc.) to identify groupings where each record has a common

Using Data Analytics to Detect Fraud · Using Data Analytics to Detect Fraud ... Fuzzy Logic Matching ... Application: Fuzzy Logic Excel ACL IDEA Normalize,

Simplify Your Fuzzy Duplicates Testing · Simplify Your Fuzzy Duplicates Testing Detecting Errors and Fraud 1. 2 Why fuzzy testing?

Cassandra Summit 2014: Fuzzy Entity Matching at Scale

Fuzzy Matching and Merging of Family Trees using a Graph

Fuzzy Matching FlowChart

Fuzzy Matching with SAS: Data Analysts Tool to … Group...Fuzzy Matching with SAS: Data Analysts Tool to Cleaner Data Josh Fogarasi What is Fuzzy Matching Anyways? Why is it relevant