Anomaly Detection Using Isolation Forests

Preview:

Citation preview

Minority Report:Using Anomaly Detection

to Identify a Minority ClassDavid Gerster

Vice President, Data ScienceBigML

3

Traditional “Predictive Modeling”

• The famous Iris data set has measurements for 150 flowers• Given a flower’s measurements, can we predict its species?

Iris setosa Iris versicolor Iris virginica

Peta

l Wid

th (c

m)

Petal Length (cm)

Iris setosa, red dots

Iris versicolor, green dots

Iris virginica, blue dots

Peta

l Wid

th (c

m)

Petal Length (cm)

Congratulations! You just trained a model.

Peta

l Wid

th (c

m)

Petal Length (cm)

Peta

l Wid

th (c

m)

Petal Length (cm)

Prediction: Iris setosa

Prediction: Iris versicolor

Prediction: Iris virginica

Prediction:Iris virginica

Peta

l Wid

th (c

m)

Petal Length (cm)

Prediction: Iris setosa

Prediction: Iris versicolor

Prediction: Iris virginica

Prediction:Iris virginica

Congratulations! You just scored four new flowers using your model, and made a prediction about the species of each one.

Peta

l Wid

th (c

m)

Petal Length (cm)

8

Width <= 0.8? Width > 0.8?

Width > 1.75? Width <= 1.75?

Length <= 5? Length > 5?

50 red

45 blue

1 blue, 48 green 4 blue, 2 green

“Decision Tree”

“Leaf Nodes”

50 blue, 50 green

5 blue, 50 green

50 red, 50 blue, 50 green

10

Demo: Predictive Modeling

• Train a predictive model using the 699 biopsies• The “label” of benign or malignant is known for each one• Since we have labels, this is supervised learning

11

What if we don’t have labels?

• Can we still get insight into our data if we don’t know the colors of the dots?• Enter anomaly detection• Since we don’t have labels, this is unsupervised learning

10 lines are neededto isolate this data point(not anomalous)

Only 4 lines are neededto isolate this data point(highly anomalous)

16

Demo: Anomaly Detection

• Remove the labels of benign or malignant• Train an anomaly detector on this unlabeled data• Create a new dataset with the anomaly scores as “labels”• Use these “labels” to train a predictive model!

Who Needs Labels?

Who Needs Labels?

19

What if we remove the malignant biopsies?• If we remove the malignant biopsies from the dataset and do

the whole process again …•We find a similar result!

20

Minority Report

• This approach is well-suited for large unlabeled datasets, especially if you expect to find an (adversarial) minority class• Millions of credit card transactions, billions of network events …

• Doesn’t require you to know what you’re looking for!

Free BigML subscription

• Use code “CERN” for a free 3-mo. BigML Pro subscription• Handles datasets up to 4GB

23

The original “Isolation Forest” paper

24

Q and A

David GersterVP Data Science, BigML

gerster@bigml.com

Recommended