22
Outlier Detection Dalya Baron (Tel Aviv University) XXX Winter School, November 2018

Dalya Baron (Tel Aviv University) XXX Winter School

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Dalya Baron (Tel Aviv University) XXX Winter School

Outlier DetectionDalya Baron (Tel Aviv University)

XXX Winter School, November 2018

Page 2: Dalya Baron (Tel Aviv University) XXX Winter School

What is an outlier?• “Bad” object: artifacts, cosmic rays, bad reduction.

• Misclassified object: star classified as QSO, variable star classified as SN.

• Tail of a distribution: most luminous SN, fastest accreting BH.

• Unknown unknowns: completely new objects we did not know we should be looking for.

• In astronomy: processes which happen on shorter time scales.

Page 3: Dalya Baron (Tel Aviv University) XXX Winter School

How do we find outliers?1. Stay as close as possible to the original dataset. Measured features will usually not help (see Baron & Poznanski 2017).

wavelength (A)log [SII]/Halpha

num

ber o

f obj

ects

Page 4: Dalya Baron (Tel Aviv University) XXX Winter School

How do we find outliers?1. Stay as close as possible to the original dataset. Measured features will usually not help (see Baron & Poznanski 2017).

Page 5: Dalya Baron (Tel Aviv University) XXX Winter School

one-class SVM

How do we find outliers?2. Supervised learning-based outlier detection will uncover the

outliers that “shout the strongest”.

Additional supervised algorithms that can be used: RF, ANN, etc..

Page 6: Dalya Baron (Tel Aviv University) XXX Winter School

How do we find outliers?2. Supervised learning-based outlier detection will uncover the

outliers that “shout the strongest”.

Page 7: Dalya Baron (Tel Aviv University) XXX Winter School

How do we find outliers?2. Supervised learning-based outlier detection will uncover the

outliers that “shout the strongest”.

Page 8: Dalya Baron (Tel Aviv University) XXX Winter School

How do we find outliers?2. Supervised learning-based outlier detection will uncover the

outliers that “shout the strongest” (see Baron & Poznanski 2017).

4000 5000 6000 7000 8000 9000rest wavelength [A]

0

50

100

150

200

flux

[10�

17er

gcm

�2s�

1A

�1] SDSS J123308.44+262057.6 SDSS model

NaID zoom

Page 9: Dalya Baron (Tel Aviv University) XXX Winter School

Isolation ForestsA fast algorithm that does not require distance measurements.

Outliers are objects that are separated from the rest of the dataset higher in the tree.

http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.IsolationForest.html

Page 10: Dalya Baron (Tel Aviv University) XXX Winter School

feature 1

feat

ure

2

Isolation ForestsIsolation Forests will also find the outliers that shout the most.

Page 11: Dalya Baron (Tel Aviv University) XXX Winter School

Outlier detection on spectra using unsupervised RFSee: Baron & Poznanski 2017, Reis+18

Use unsupervised Random Forest to define distances between the objects in the sample. Define outliers are objects with a large average distance from all the rest, or from their ~250 nearest neighbors. Use tSNE to explore outlier clusters.

Page 12: Dalya Baron (Tel Aviv University) XXX Winter School
Page 13: Dalya Baron (Tel Aviv University) XXX Winter School

Outlier detection on other datasets using unsupervised RF?The unsupervised Random Forest assumes a regular grid, and thus will work for spectra or extracted features.It will not work for images or time series, because it does not have translational and rotational symmetry!

flux

time

galaxy 1 galaxy 2

possible solution: find a representation of the signal on a regular grid (e.g., FFT of time series).

Page 14: Dalya Baron (Tel Aviv University) XXX Winter School

Outlier detection using Autoencoders

Latent (compressed) space:

Autoencoders can represent images (CNN) and time-series (RNN) well. Use the latent space to look for, and examine outliers.

Page 15: Dalya Baron (Tel Aviv University) XXX Winter School

Outlier detection using Autoencoders

Latent (compressed) space: outlier!

Autoencoders can represent images (CNN) and time-series (RNN) well. Use the latent space to look for, and examine outliers.

Page 16: Dalya Baron (Tel Aviv University) XXX Winter School

Outlier detection using AutoencodersAutoencoders can represent images (CNN) and time-series (RNN) well.

Use the latent space to look for, and examine outliers.

Latent (compressed) space: outlier!

large reconstruction error, also outlier?

Page 17: Dalya Baron (Tel Aviv University) XXX Winter School

Questions?

Page 18: Dalya Baron (Tel Aviv University) XXX Winter School

Tutorials

Page 19: Dalya Baron (Tel Aviv University) XXX Winter School

Summary• Machine learning algorithms are just a tool. Not every problem is appropriate

for machine learning.

• Unsupervised (and most supervised) algorithms are not black boxes!!

• Don’t trust me.

• Make friends with computer scientists and engineers.

• Don’t be afraid to change off-the-shelf tools.

• Open questions:

• Uncertainties in supervised and unsupervised algorithms.

• Unsupervised: we need better cost functions.

Page 20: Dalya Baron (Tel Aviv University) XXX Winter School

Tell me more about your science!

Page 21: Dalya Baron (Tel Aviv University) XXX Winter School

Thanks! :)

Page 22: Dalya Baron (Tel Aviv University) XXX Winter School

Tutorials• Clustering algorithms, dimensionality reduction, supervised

learning and outlier detection: https://github.com/dalya/XXX_winter_school

• Unsupervised Random Forest and outlier detection: https://github.com/dalya/WeirdestGalaxies

• Outliers and tSNE in APOGEE: https://github.com/ireis/APOGEE_tSNE_nb

• Probabilistic Random Forest: https://github.com/ireis/PRF