Anomaly Detection Using Isolation Forests

Minority Report:Using Anomaly Detection

to Identify a Minority ClassDavid Gerster

Vice President, Data ScienceBigML

Traditional “Predictive Modeling”

• The famous Iris data set has measurements for 150 flowers• Given a flower’s measurements, can we predict its species?

Iris setosa Iris versicolor Iris virginica

Petal Length (cm)

Iris setosa, red dots

Iris versicolor, green dots

Iris virginica, blue dots

Petal Length (cm)

Congratulations! You just trained a model.

Petal Length (cm)

Prediction: Iris setosa

Prediction: Iris versicolor

Prediction: Iris virginica

Prediction:Iris virginica

Petal Length (cm)

Prediction: Iris setosa

Prediction: Iris versicolor

Prediction: Iris virginica

Prediction:Iris virginica

Congratulations! You just scored four new flowers using your model, and made a prediction about the species of each one.

Petal Length (cm)

Width <= 0.8? Width > 0.8?

Width > 1.75? Width <= 1.75?

Length <= 5? Length > 5?

50 red

45 blue

1 blue, 48 green 4 blue, 2 green

“Decision Tree”

“Leaf Nodes”

50 blue, 50 green

5 blue, 50 green

50 red, 50 blue, 50 green

Demo: Predictive Modeling

• Train a predictive model using the 699 biopsies• The “label” of benign or malignant is known for each one• Since we have labels, this is supervised learning

What if we don’t have labels?

• Can we still get insight into our data if we don’t know the colors of the dots?• Enter anomaly detection• Since we don’t have labels, this is unsupervised learning

10 lines are neededto isolate this data point(not anomalous)

Only 4 lines are neededto isolate this data point(highly anomalous)

Demo: Anomaly Detection

• Remove the labels of benign or malignant• Train an anomaly detector on this unlabeled data• Create a new dataset with the anomaly scores as “labels”• Use these “labels” to train a predictive model!

Who Needs Labels?

What if we remove the malignant biopsies?• If we remove the malignant biopsies from the dataset and do

the whole process again …•We find a similar result!

Minority Report

• This approach is well-suited for large unlabeled datasets, especially if you expect to find an (adversarial) minority class• Millions of credit card transactions, billions of network events …

• Doesn’t require you to know what you’re looking for!

Free BigML subscription

• Use code “CERN” for a free 3-mo. BigML Pro subscription• Handles datasets up to 4GB

The original “Isolation Forest” paper

Q and A

David GersterVP Data Science, BigML

gerster@bigml.com

Anomaly Detection Using Isolation Forests

Technology

Importance of streamside forests to large rivers: …andrewsforest.oregonstate.edu/pubs/pdf/pub1990.pdf · Importance of streamside forests to large rivers: The isolation of the Willamette

Data Analysis, Machine Learning, Broand You!Pandas to Scikit-Learn Example: Anomaly Detection Bro DNS and HTTP logs Categorical and Numeric Data Clustering Isolation Forests Scikit-Learn

Anomaly Detection in Predictive Maintenance - KNIME · Anomaly Detection in Predictive Maintenance Anomaly Detection with Time Series Analysis

On the effectiveness of isolation-based anomaly detection ...buyya.com/papers/AnomalyDetectionCloud-CCPE2017.pdf · in the context of IaaS (our target service model), where the multi-

Fiber Bragg Grating Smart Sensor Network for Anomaly Detection, Estimation, and Isolation · 2012. 5. 3. · Fiber Bragg Grating Smart Sensor Network for Anomaly Detection, Estimation,

Ebstein's anomaly

EBSTEIN ANOMALY

Isolation Mondrian Forest for Batch and Online Anomaly ... · Isolation Mondrian Forest for Batch and Online Anomaly Detection Haoran Ma 1;2, Benyamin Ghojogh , Maria N. Samad , Dongyu

Anomaly Detection Forestecai2020.eu/papers/1000_paper.pdfforests [12]. The Isolation Forest (IF), is an ensemble of search trees that has been a consistent top performing algorithm

Smart Cities: A Laboratory Analysisopenaccess.uoc.edu/webapps/o2/bitstream/10609/93232/1/anomaly… · and isolation forests) to detect anomalies in a laboratory that reproduces a

Density Anomaly

USENIX Security Symposium ‘18 · Uses multiple anomaly detection techniques ... Isolation Forest 23 Train with legitimate samples only Train without labeled samples. Diogo Barradas,

The Equatorial Anomaly - Miles Mathismilesmathis.com › equat.pdfThe Equatorial Anomaly by Miles Mathis First published August 18, 2013 The equatorial anomaly is an anomaly in the

REFLEX "Anomaly"

Isolation Forest anomaly detectionOpen source anomaly detection software package for scientific application using fast and efficient isolation forest Fault tolerant, robust, scalable

Anomaly Detection Using Isolation Forests

Effect of habitat area and isolation on plant trait distribution in European forests ... · 2020-05-13 · 356 Effect of habitat area and isolation on plant trait distribution in

Analogue Anomaly

Vascular anomaly

Anomaly Dentin