21

YM-RMWisdom15 final

Embed Size (px)

Citation preview

Page 1: YM-RMWisdom15 final
Page 2: YM-RMWisdom15 final

Classify This

Analyzing and classifying e-commerce

merchants websites for compliance with

payment systems brand protection programs

using RapidMiner

Vladimir Mikhnovich, fraud analyst

© 2015

Page 3: YM-RMWisdom15 final

The problemWhat is brand damaging and how payment systems control over it?

• Drugs

• Weapons

• Porno

• …

Illegal merchant activities (totally prohibited a.k.a. ‘Deadly Sins’)

• Supplements

• Adult shops

• Brand replicas

• …

High risk merchants(require limitations / additional checks)

Payment systems and aggregators must check merchants to avoid high

risk / prohibited categories / fraud

Page 4: YM-RMWisdom15 final

What to comply with

Business Risk

Assessment and

Mitigation (BRAM)

Global Brand

Protection

Program (GBPP)

Page 5: YM-RMWisdom15 final

The taskAs initially issued:

Regular (monthly/quarterly) scanning of big batches (ten of thousands) of merchant websites, determining non-compliant and high risk ones for further manual screening.

Total number of merchants using Yandex.Moneyintegrated payment solution: over 70.000.

First round, we need to check about 15.000 websites.

Page 6: YM-RMWisdom15 final

Key concerns (before we start)Automate downloading of big batches of websites

• Ideally we want to download all 15.000 sites at once

• Speed doesn’t matter but we must automatically handle errors

Automate classification of 1000s documents

• Ideally we want to classify everything at once

• Speed doesn’t matter in favor of accuracy

Manual picking and labeling sites for training dataset

• And you thought it is easy to find & buy drugs online? It’s not actually…

Uncertainty of some categories

• Category like ‘weapons’ or ‘adult’ are pretty straightforward…

• …while ‘replica’ or ‘magic’ are not

Define thresholds for classification results

• We do not want to further manually check 50% of websites after automatic classification

Page 7: YM-RMWisdom15 final

The general approach

Obtain test dataset

Text mining extension

Classification model

Download websites

Apply model

Page 8: YM-RMWisdom15 final

Training process

Training dataset

•11 categories

•269 labeled sites

•28000 words

Text processing

•Extract text, tokenize, stem

•Build TF-IDF matrix

Model evaluation

•k-NN with cross-validation

Page 9: YM-RMWisdom15 final

TF-IDF metric

TF-IDF (term

frequency–inverse

document frequency):

a numerical statistic that

is intended to reflect how

important a word is to

a single document in a

collection of documents.

Page 10: YM-RMWisdom15 final

RapidMiner insights

Labeled list of

domains

Loop URLs, wget and store to repository

Build TF-IDF matrixk-NN cross-validation (leave one out)

Page 11: YM-RMWisdom15 final

Example of confusion matrix

accuracy: 88.46% +/- 31.95%

true adulttrue

drugs

true

replica

true

weapons

true normal

guys

true betting

exchange

true

hourhoteltrue magic true spy

true

supplements

true

torrent

class

precision

pred. adult 14 1 0 0 4 0 0 0 0 0 0 73.68%

pred. drugs 1 10 0 0 0 0 0 0 0 0 0 90.91%

pred. replica 1 0 12 0 1 0 0 0 0 0 0 85.71%

pred. weapons 0 0 0 10 0 0 0 0 0 0 0 100.00%

pred. normal guys 1 2 1 0 88 4 0 1 0 0 1 89.80%

pred. betting

exchange0 0 0 0 0 9 0 0 0 0 0 100.00%

pred. hourhotel 0 0 0 0 2 0 8 0 0 0 0 80.00%

pred. magic 0 0 0 0 1 0 0 8 0 0 0 88.89%

pred. spy 0 0 0 0 1 0 0 0 9 0 0 90.00%

pred. supplements 0 0 0 0 0 0 0 0 0 12 0 100.00%

pred. torrent 0 0 0 0 2 0 0 0 0 0 4 66.67%

class recall 82.35% 76.92% 92.31% 100.00% 88.89% 69.23% 100.00% 88.89% 100.00% 100.00% 80.00%

Page 12: YM-RMWisdom15 final

‘All-in-one’ test process

It took just ~1 hour to download and classify ~900 sites at once.

For test purposes an ‘all-in-one’ process has been implemented

Page 13: YM-RMWisdom15 final

Data structures and sizes

• 1 site = text file from 0.3 to 1+ megabyte

• Corpus makes 150 – 300 Megabytes of text files in total

Text data size

• Training data: 269 sites x 28.017 words (70 Megabytes)

• Test data: 832 sites x 60.893 words (414 Megabytes)

TF-IDF matrices examples

Page 14: YM-RMWisdom15 final

Batching approachMany attempts to classify thousands sites at once were actually

unsuccessful. Reason? Memory problems.

So far, another approach to overcome physical memory limitations was chosen:

batching. First we download websites and divide them into batches of

reasonable size (empirically, 200-300 sites is enough to fit all matrices in

memory), every batch is downloaded into separate directory and then analyzed

in a loop.

Thousands of websites

Download and save

Loop batches Classify every batch

Page 15: YM-RMWisdom15 final

RapidMiner insights

Batch

numbers

Batch size

15.000 sites = 50 batches x 300 files = 50 directories to loop classifier through

Page 16: YM-RMWisdom15 final

k-NN and thresholds

Unknown

adult

normal

normal

torrent

drugs

adult

normal

weapons

k-NN is simple and (applied to text analysis) provides a measure

of similarity of the text document to known categories.

Page 17: YM-RMWisdom15 final

k-NN and thresholds

site adult drugs replica weapons normal betting magic spy supplements torrent prediction

eroshop.ru 100% adult

putana78.com 100% adult

IntimCity.nl 81% 19% adult

kupialco.ru 100% drugs

mari-juana.net 100% drugs

kyritelnie-smesi.nl 81% 19% drugs

03market.ru 19% 0% 81% normal guys

1-ocenka.ru 20% 40% 40% normal guys

1c-interes.ru 60% 40% normal guys

1gb.ru 100% normal guys

100-z.ru 20% 20% 60% spy

100mile.ru 20% 80% spy

1belka.ru 20% 20% 40% 20% spy

100captains.ru 20% 80% torrent

1chef.ru 100% weapons

k=5 allows assigning significant confidence values to categories. Only

high confidences are taken into account (threshold >= 80%).

Page 18: YM-RMWisdom15 final

Finally, what’s in and out

Average 8-10% of sites are assigned high confidence during

classification (>80%) and are screened manually thereafter.

Big list of

domains

site prediction confidence

zdoroviak.ru supplements 100%

zscom.ru spy 100%

zwuk.ru spy 100%

zarekoy.ru hourhotel 100%

zastava-izhevsk.ru weapons 100%

zutera.ru replica 81%

zoombao.com replica 80%

zita-gita.ru adult 80%

zishop.ru replica 80%

zdorovoetelo100.ru supplements 80%

zen-shop.ru drugs 80%

Page 19: YM-RMWisdom15 final

Performance and accuracy

For real-life randomly picked 100 websites:

Downloading time 7 minutes

Processing & classification time 30 seconds (0.3 sec per site)

Cross-validation accuracy (without applying threshold)

89%

High risk sites classified as normal (False Negatives)

1

Correctly classified high risk sites: 12 out of 13 (92%)

Normal sites classified as high risk(False Positives)

0

Page 20: YM-RMWisdom15 final

What’s next

Server deployment

• Processes scheduling

• Automated reports

Improved accuracy

• Incremental model updates with new data

• Using n-grams

Automated scanning

• No need to input data manually

• Automatic referral domains parsing

Page 21: YM-RMWisdom15 final

Thank you!

Vladimir Mikhnovich, fraud analyst

@kypexin

[email protected]