Randomness and fraud

Randomness and fraud Michael Manapat @mlmanapat Stripe

About Stripe Feature genera6on: fraudsters are “pseudorandom” Model training: “customized” random forests Model evalua6on: counterfactual offline evalua6on

What is Stripe? “Full stack” for e-‐commerce: -‐ credit cards -‐ Checkout (APMs: Alipay, Bitcoin) -‐ fraud (beta) -‐ etc.

Merchant fraud Transac6on fraud

Fraudsters are “pseudorandom”

Example 1: “Random” e-‐mail addresses [email protected] [email protected] ... [email protected] [email protected]

What features detect this kind of regularity? Distribu6on of leZer/digit/period/domain frequencies + measures of distribu6onal difference Log-‐likelihood ra6o: good at low counts Difference in log-‐likelihood from a single model for the matrix vs. a model for each row

Digit No digit

Sample (p) 9 1

Overall (q) 200,000 200,000

Example 2: Distribu6on of user agents -‐Transform so that it’s less “condi6onal” -‐Get rid of the distribu6on en6rely (# dis6nct user agents) / (# dis6nct IPs) @jvns @kelleyrivoire

Distributed random forest learning

At each node, pick a feature X and a value v Splihng on X < v should minimize I(L) + I(R) -‐ I(D) I: “Impurity” “PLANET”

Trained trees in Python with scikit, but... Our ETL pipeline runs on Hadoop and writes Parquet to HDFS Treatment of categorical variables is subop6mal (“x[1] <= 0.500”) No customiza6on (impurity: “gini” or “entropy”)

“Brushfire” @avibryant @daniellesucher Implemented in Scala (Scalding) Distributed learning approach modeled on Google’s PLANET paper Na6ve support for ordered/ordinal/categorical vars Highly customizable/modular (e.g., splihng func6on)

Customiza6on We don’t necessarily want to maximize impurity drop with each split X: 1 2 3 4 Y: 0 10 80 95

We have a “split budget” (arer enough splits/tree levels we’ll run out of data)

We want to choose splits so we improve the ROC curve in the region of interest (even at the expense of total AUC)

Want improvement here

Don’t care about improvement here

scikit (ler) vs. brushfire (right) Fixed FPR: +7 percentage points in recall in region of interest

Brushfire to be open-‐sourced in the next month (Talk this weekend at PNW Scala)

Counterfactual offline evaluaHon Li, Chen, Kleban, Gupta: “Counterfac6onal Es6ma6on and Op6miza6on of Click Metrics for Search Engines”

Every conversion results in some benefit b Every chargeback results in some cost c Margin = 30%, product costs $10 Conversion: $10 -‐ $7 (CGS) = $3 Chargeback: -‐$7 (CGS) -‐ $15 (fee) = -‐$22 The rela6ve sizes of b and c determine tolerance for false pos6ves and false nega6ves.

Train a model on charge history @ryw90

Historical total payoff: 3b – c

# Outcome Payoff

1 Conversion b

2 Conversion b

3 Chargeback -‐c

4 Conversion b

Evaluate it on charge history Historical total payoff: 3b – c Payoff with model: 2b

# Outcome Payoff

1 Conversion b

2 Conversion b

3 Disputed -‐c

4 Conversion b

Class New Outcome Payoff

Good Conversion (TN) b

Good Conversion (TN) b

Fraud Blocked (TP) 0

Fraud Blocked (FP) 0

c – b > 0

Model evalua6on possible because of charge log without interven6ons Interve6on beZer than no interven6on if What happens with the next model-‐building itera6on?

(odds of fraud) x (c/b) x (recall/fpr) > 1

# Outcome Payoff

1 Conversion b

2 Conversion b

3 Blocked 0

4 Blocked 0

New model: “good” Conversion or chargeback?

Where does the new training data come from? An A/B test would be complex/6me-‐consuming

One answer: introduce randomness in policy

# Score Original acHon

P(Block) Randomized acHon

Outcome Payoff

1 5 Allow 0.05 Allow Conversion b

2 20 Allow 0.10 Allow Conversion b

3 10 Allow 0.07 Block N/A 0

4 50 Block 0.50 Allow Chargeback -‐c

5 65 Block 0.90 Allow Conversion b

Log of scores/probabili6es/ac6ons Evaluate performance of model on events where original ac6on == randomized ac6on

# Score Original acHon

P(allow) P(Block) Randomized acHon

Outcome Payoff

1 5 Allow 0.95 0.05 Allow Conversion b

2 20 Allow 0.90 0.10 Allow Conversion b

...but weight by inverse of expected probability Average payoff: Intui6on: If the ac6on has a probability p and we see it in the log, there were ~1/p total such events

(1/0.95)b+ (1/0.9)b

(1/0.95) + (1/0.9)= b

Similarly for the candidate model...

# Score Old model

P(Allow) P(Block) Randomized acHon

Outcome Payoff New model

2 20 Allow 0.90 0.10 Allow Conversion b Allow

4 50 Block 0.50 0.50 Allow Chargeback -‐c Allow

5 65 Block 0.10 0.90 Allow Conversion b Allow

(1/0.9)b+ (1/0.5)(�c) + (1/0.1)b

(1/0.9) + (1/0.5) + (1/0.1)= 0.85b� 0.15c

Compute the expected payoff offline (arbitrarily many “experiments”) Need more data as incumbent model/policy and candidate diverge Propensity func6on controls the “exploita6on” – “explora6on” tradeoff Keep merchant experience good (adds bias)

Technical issues: Propensity score func6on not generally a sigmoid Mul6ple ac6ons Events must be IID

Fraudsters generate randomness in non-‐random ways (LLR good at low counts) We can improve our random forest performance by biasing the training (get lir where you need it) Randomizing ac6ons in produc6on makes counterfactual evalua6on easier (and faster)

Thanks [email protected] @mlmanapat Machine learning at Stripe: Avi Bryant @avibryant Chris Wu (@chriswu_) Dan Frank @danielhfrank Danielle Sucher @daniellesucher Julia Evans @jvns Kelley Rivoire @kelleyrivoire Ryan Wang @ryw90

Engineering

Randomness and fraud