Outlier analysis for Temporal Datasets

Location:

QuantUniversity Meetup

July 11th 2016

Boston MA

Outlier Analysis for Temporal Datasets

2016 Copyright QuantUniversity LLC.

Presented By:

Sri Krishnamurthy, CFA, CAP

www.QuantUniversity.com

sri@quantuniversity.com

Slides and Code available at: http://www.analyticscertificate.com/Anomaly/

• 6.30-7.15 – Anomaly Detection part II

• 7.15-8.00 - Azure ML Example

Agenda

- Analytics Advisory services- Custom training programs- Architecture assessments, advice and audits

• Founder of QuantUniversity LLC. and www.analyticscertificate.com

• Advisory and Consultancy for Financial Analytics• Prior Experience at MathWorks, Citigroup and

Endeca and 25+ financial services and energy customers (Shell, Firstfuel Software etc.)

• Regular Columnist for the Wilmott Magazine• Author of forthcoming book

“Financial Modeling: A case study approach”published by Wiley

• Charted Financial Analyst and Certified Analytics Professional

• Teaches Analytics in the Babson College MBA program and at Northeastern University, Boston

Sri KrishnamurthyFounder and CEO

Quantitative Analytics and Big Data Analytics Onboarding

• Trained more than 500 students in Quantitative methods, Data Science and Big Data Technologies using MATLAB, Python and R

• Launching the Analytics Certificate Program later in Fall

(MATLAB version also available)

• July▫ 11th : QuantUniversity’s 2nd meetup

Topic : Quantitative methods topic : TBD

• August▫ 1st and 2nd : 2-day workshop on Anomaly Detection

Registration and pricing details at www.analyticscertificate.com/Anomaly

▫ 8th : QuantUniversity meetup▫ 14-20th : ARPM in New York www.arpm.co

QuantUniversity presenting on Model Risk on August 14th

▫ 18-21st : Big-data Bootcamp http://globalbigdataconference.com/68/boston/big-data-bootcamp/event.html

▫ Use promotional code SPEAKERREF to receive $200 discount on or before July 22nd

Events of Interest

• July▫ Anomaly Detection Part II

•August▫ Anomaly Detection Workshop▫ Model Evaluation : Metrics, Scaling and Best Practices

• September▫ What’s missing ? Best practices in missing data analysis

QuantUniversity’s Summer workshop series

What is anomaly detection?• Anomalies or outliers are data points that appear to deviate

markedly from expected outputs.

• It is the process of finding patterns in data that don’tconform to a prior expected behavior.

• Fraud Detection

• Stock market

• E-commerce

Examples

Part 1: Summary13

We have covered Anomaly detection

Introduction Definition of anomaly detection and its importance in energy systems Different types of anomaly detection methods: Statistical, graphical and machine

learning methods

Graphical approach Graphical methods consist of boxplot, scatterplot, adjusted quantile plot and symbolplot to demonstrate outliers graphically

The main assumption for applying graphical approaches is multivariate normality Mahalanobis distance methods is mainly used for calculating the distance of a point

from a center of multivariate distribution

Statistical approach Statistical hypothesis testing includes of: Chi-square, Grubb’s test Statistical methods may use either scores or p-value as threshold to detect outliers

Machine learning approach Both supervised and unsupervised learning methods can be used for outlier detection Piece wised or segmented regression can be used to identify outliers based on the

residuals for each segment In K-means clustering method outliers are defined as points which have doesn’t belong

to any cluster, are far away from the centroids of the cluster or shaping sparse clusters

Anomaly Detection Part II : Dealing with Temporal Data

• In time series datasets, the assumption of temporal continuity plays an important role in defining and detecting outliers.

• When analyzing single time series, the lack of temporal continuity with immediate neighbors signal outliers. For example: ▫ A significant increase/decrease in value when compared with

immediate neighboring values . Example: Stock charts

• When analyzing multidimensional time series streams, temporal continuity is much weaker. For example:▫ Novel outliers that differ from aggregate trends. Example : Novel client

traffic from a new location in Google analytics

Point anomalies

• Points that or outside of “normal” points

Contextual anomalies

• Time is a contextual attribute that

determines the position of an instance

on the entire sequence.

• 145 point drop is not rare

but it is an anomaly if the drop happens

in a period of 3 minutes

Ref: http://www.bloomberg.com/news/articles/2013-04-23/fake-report-erasing-136-billion-shows-market-s-fragility?cmpid=yhoo

Nuances in Time series analysis

• Time Series Analysis▫ Numbers across time

▫ Example: Stock data

• Discrete sequences▫ Labels across time

▫ Example: Log of client interactions

▫ http-web, buffer-overflow, http-web, http-web, smtp-mail, ftp, http-web, ssh, smtp-mail, http-web, ssh, buffer-overflow, ftp, http-web, ftp, smtp-mail,http-web

Collective anomalies

• Here, a collection of related data instances is anomalous with respect to the entire data set

Ref: http://krebsonsecurity.com/2010/10/pill-gang-used-microsofts-network-to-attack-krebsonsecurity-com/

Challenges

• Defining what is normal and what isn’t

Challenges

• The notion of normal behavior keeps evolving

Challenges

• The magnitude of the anomaly may be different

Challenges

• Labels may not be available

Challenges

• Noise may manifest as anomalies and it may be difficult to identify and remove.

Methods for Anomaly Detection

Univariate data

• Point outlier scenario:

• Statistical methods (ARIMA, Seasonal Hybrid ESD test method, E-Divisive with medians, LOESS regression)

• Data mining methods (Multi layer perceptron)

• Outlier subsequences scenario:

• Windows based method

• Distance based method(PAA, SAX and HOTSAX)

Multivariate data

• Statistical methods:

• Cook’s distance

• Bonferroni’s test

• Distance based methods:

• Local Outlier Factors (LOF)

• Data mining methods:

• Clustering algorithms (Hierarchical and K-Means)

Methods for Anomaly Detection

Database time series univariate and multivariate

• Density approach for principal components

• Graphical methods:

• Bivariate and functional bag plots

• Bivariate and functional HDR box plots

• Clustering methods

• Euclidean, correlation, autocorrelation and Wavelet transform metrics

Censored survival data

• Statistical methods:

• Residual based algorithm

• Scoring algorithm

• Point Outliers▫ Prediction models

▫ Profile Similarity-based approaches and Deviants

• Subsequence Outliers▫ Discord discovery

Single Time Series – Sample approaches

• Input: A time series t• Output: Outlier points in t

Prediction Models: Compute outlier scores as deviation from predicted value• Median : ▫ Choose a window size k▫ Compute median in the window t-k and t+k

• Mean:▫ Choose a window size k▫ Compute mean in the window t-k and t+k

Point outliers

• ARIMA framework

Point outliers : Prediction Models

• Neural Networks▫ MLP predictor

Point outliers : Prediction Models

Original dataFitted dataBoundaries

Any data pointsthat are beyondthe boundaries areconsidered asoutliers

• Create a Normal profile (Example: MLP/AR etc. ) and notion of variance

• Estimate the next point

• Compare realized value with the estimated point. ▫ If within band, normal

▫ Else, Outlier

Point outliers : Profile Similarity-Based Approach

• Find points in a given time series whose removal from the time series results in a more succinct representation of the data

Point outliers : Deviant Approach

• Input: A time series t

• Output: Outlier subsequences in t

• Problem: Given t, and subsequence of length n, find outlier D that has the largest distance to its nearest non-overlapping match

• In particular, given two subsequences of length n denoted by A = (a1. . . an) and B = (b1 . . . bn), the Euclidean distance between themcan be computed as follows:

• Dist A, B = σi=1n (ai − bi)

Subsequence outliers:

• The standard way of discretizing the time series: Symbolic Approximation (SAX)

• The brute force solution is to consider all possible subsequences and compute the distance of each such subsequence with each other non-overlapping subsequences.

• Several optimizations▫ HOT-SAX (Keogh, E., Lin, J., Fu, A., HOT SAX: Efficiently finding the most

unusual time series subsequence. Proceeding ICDM '05 Proceedings of the Fifth IEEE International Conference on Data Mining)

• Plotting the discords

Outlier subsequences (Distance based)

The top discord which has the largest distance is 411th time series point.

SummaryWe have covered Anomaly detection

Univariate data Statistical methods (ARIMA, Seasonal Hybrid ESD test method, EMD and LOESS regression)

Data mining methods (Multi layer perceptron) Outlier subsequences (Windows and distance based methods)

Multivariate data Cook’s distance Bonferroni’s test Local outlier factor (LOF) Hierarchical and K-means clustering outlier detection methods

Database time series Database time series definition Density approach for two first principle component scores Bivariate and functional bag plots Bivariate and functional HDR box plot Clustering time series

Censored survival data Censored survival data definition Residual based algorithm Scoring algorithm

Register here:https://www.eventbrite.com/e/anomaly-detection-workshop-tickets-25910035614?ref=ebtnebtckt

Affiliate discount pricing for QuantUniversity Meetup members and Academics!

When: August 1st and 2nd

Where: 1 Roger St, Cambridge MA(IBM’s offices)Time : 9-5.00pm

Slides, code and details about the Anomaly detection workshopat: http://www.analyticscertificate.com/Anomaly/

Thank you!Members & Sponsors!

Sri Krishnamurthy, CFA, CAPFounder and CEO

QuantUniversity LLC.

srikrishnamurthy

www.QuantUniversity.com

Contact

Information, data and drawings embodied in this presentation are strictly a property of QuantUniversity LLC. and shall not bedistributed or used in any other publication without the prior written consent of QuantUniversity LLC.

Outlier analysis for Temporal Datasets

Software

Uji Outlier

AWS re:Invent 2016: Data Polygamy: The Many-Many Relationships among Urban Spatio-Temporal Datasets (WWPS401)

Articulo de Publicación€¦ · En la literatura se han considerado cuatro tipos de outliers: outlier aditivo (AO), outlier innovacional (AI), cambio de nivel (LS) y el cambio temporal

Durable Top- Queries on Temporal Data (Technical Report ...jygao/2018... · Many datasets have a temporal dimension and contain a wealth of historical information. When using such

Working Paper Series - European Central Bank · Working Paper Series . A methodology for automatised outlier detection in high-dimensional datasets: an application to euro area banks’

Statistical Outlier Detection Using Direct Density … Outlier Detection Using Direct Density Ratio Estimation 2 it is computationally very eﬃcient and is scalable to massive datasets

Spatio-Temporal Outlier Analysis and Detection using K-medoids

Temporal Search and Replace: A novel tool to simplify event sequences in large complex temporal datasets Allan Fong Hanseung Lee Rongjian Lan University

Parallel Multivariate Spatio-Temporal Clustering of Large ...sarat/files/Sreepathi_IEEECluster...Parallel Multivariate Spatio-Temporal Clustering of Large Ecological Datasets on Hybrid

Benchmarking biologically inspired spatio-temporal filter ... · Benchmarking biologically inspired spatio-temporal filter based optical flow estimation on modern datasets Fabio

Outlier Detection for Temporal Data Outlier D for Temp Outlier for Tem

OUTLIER MINING IN HIGH DIMENSIONAL DATASETS · PDF file · 2009-10-07Microsoft PowerPoint - datamining_09 Author: aturkmen Created Date: 10/7/2009 11:56:08 AM

Geospatial World Forum · Raster Analysis Flood Modeling Data Clock Uncertainty Data Understanding Relationships Big Datasets (230 Billion) Spatial Statistics Space-Time Cubes Outlier

Developments in the Hydrogen Demand and Resource ... · •Automatically update core datasets (in process) •Add temporal capabilities (in process) •All datasets in HyDRA can now

outlier detection

1 Spatio-Temporal Outlier Detection in Precipitation Data Elizabeth Wu, Wei Liu, Sanjay Chawla The University of Sydney, Australia SensorKDD 2008 Sunday,

Evaluating temporal consistency of long-term global NDVI ...doc.sciencenet.cn/upload/file/2015722214239380.pdf · Evaluating temporal consistency of long-term global NDVI datasets

Extreme Scale Analytics on Spatio -Temporal Datasets

The Outlier

Outlier Detection - My Webspace fileswebspace.ship.edu/pgmarr/Geo441/Lectures/OPT 1 - Outlier Detection.… · Outlier Detection. Outlier detection is both easy and difficult. •