46
Copyright © 2016 Splunk Inc. Alex Cruise Sr. Dev. Manager/Architect, Splunk Fred Zhang Sr. Data Scientist, Splunk Machine Learning and Anomaly Detection in Splunk IT Service Intelligence

Machine Learning and Anomaly Detection in SplunkIT Service ......Machine Learning and Anomaly Detection in SplunkIT Service Intelligence. Disclaimer 2 ... Uses new Chunked External

  • Upload
    others

  • View
    9

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Machine Learning and Anomaly Detection in SplunkIT Service ......Machine Learning and Anomaly Detection in SplunkIT Service Intelligence. Disclaimer 2 ... Uses new Chunked External

Copyright©2016Splunk Inc.

AlexCruiseSr.Dev.Manager/Architect,SplunkFredZhangSr.DataScientist,Splunk

MachineLearningandAnomalyDetectioninSplunk ITServiceIntelligence

Page 2: Machine Learning and Anomaly Detection in SplunkIT Service ......Machine Learning and Anomaly Detection in SplunkIT Service Intelligence. Disclaimer 2 ... Uses new Chunked External

Disclaimer

2

Duringthecourseofthispresentation,wemaymakeforwardlookingstatementsregardingfutureeventsortheexpectedperformanceofthecompany.Wecautionyouthatsuchstatementsreflectourcurrentexpectationsandestimatesbasedonfactorscurrentlyknowntousandthatactualeventsorresultscoulddiffermaterially.Forimportantfactorsthatmaycauseactualresultstodifferfromthose

containedinourforward-lookingstatements,pleasereviewourfilingswiththeSEC.Theforward-lookingstatementsmadeinthethispresentationarebeingmadeasofthetimeanddateofitslivepresentation.Ifreviewedafteritslivepresentation,thispresentationmaynotcontaincurrentoraccurateinformation.Wedonotassumeanyobligationtoupdateanyforwardlookingstatementswemaymake.Inaddition,anyinformationaboutourroadmapoutlinesourgeneralproductdirectionandissubjecttochangeatanytimewithoutnotice.Itisforinformationalpurposesonlyandshallnot,beincorporatedintoanycontractorothercommitment.Splunkundertakesnoobligationeithertodevelopthefeaturesor

functionalitydescribedortoincludeanysuchfeatureorfunctionalityinafuturerelease.

Page 3: Machine Learning and Anomaly Detection in SplunkIT Service ......Machine Learning and Anomaly Detection in SplunkIT Service Intelligence. Disclaimer 2 ... Uses new Chunked External

Agenda

Introductions/HistoryAxioms– ProblemDomainAxioms– SolutionDomainTimeSeriesFeatureEngineeringSpatialvs.TemporalAnalysisOtherApproachesMADServiceEngineeringITSIContext

3

Page 4: Machine Learning and Anomaly Detection in SplunkIT Service ......Machine Learning and Anomaly Detection in SplunkIT Service Intelligence. Disclaimer 2 ... Uses new Chunked External

Introductions/History

Keyteammembers– Shang– Mihai– Jacob– Iman– Touf

Presenters– Fred– Datascientist– Alex– Architect/DevManager

4

Page 5: Machine Learning and Anomaly Detection in SplunkIT Service ......Machine Learning and Anomaly Detection in SplunkIT Service Intelligence. Disclaimer 2 ... Uses new Chunked External

Axioms– ProblemDomain

5

THEUNIVERSEOFDATA

Time-seriesdata

Page 6: Machine Learning and Anomaly Detection in SplunkIT Service ......Machine Learning and Anomaly Detection in SplunkIT Service Intelligence. Disclaimer 2 ... Uses new Chunked External

Axioms– ProblemDomain

6

THEUNIVERSEOFDATA

ENHANCE!

Time-seriesdata

Page 7: Machine Learning and Anomaly Detection in SplunkIT Service ......Machine Learning and Anomaly Detection in SplunkIT Service Intelligence. Disclaimer 2 ... Uses new Chunked External

Axioms– ProblemDomain

7

Detectinganomaliesinthisnarrowsubsetoftheuniverseofdata:TimeseriesNumericvariablesthatchangeovertime

IncreasingTimeà

x

Page 8: Machine Learning and Anomaly Detection in SplunkIT Service ......Machine Learning and Anomaly Detection in SplunkIT Service Intelligence. Disclaimer 2 ... Uses new Chunked External

Axioms– ProblemDomain

8

Detectinganomaliesinthisnarrowsubsetoftheuniverseofdata:TimeseriesNumericvariablesthatchangeovertime

Regular timeseriesThenewvaluesarriveonaregularinterval(e.g.everyfiveseconds)

IncreasingTimeà

x

regularinterval

Page 9: Machine Learning and Anomaly Detection in SplunkIT Service ......Machine Learning and Anomaly Detection in SplunkIT Service Intelligence. Disclaimer 2 ... Uses new Chunked External

Axioms– ProblemDomain

9

Detectinganomaliesinthisnarrowsubsetoftheuniverseofdata:TimeseriesNumericvariablesthatchangeovertime

Regular timeseriesThenewvaluesarriveonaregularinterval(e.g.everyfiveseconds)

Dense,RegulartimeseriesNewvaluesarefairlylikelytoarriveandnotbenull

IncreasingTimeà

x

regularinterval

fewgaps/nulls/NaNs

Page 10: Machine Learning and Anomaly Detection in SplunkIT Service ......Machine Learning and Anomaly Detection in SplunkIT Service Intelligence. Disclaimer 2 ... Uses new Chunked External

Axioms– SolutionDomain

10

UnsupervisedNon-ParametricRobustStreamingAdaptiveDomain-agnostic

Page 11: Machine Learning and Anomaly Detection in SplunkIT Service ......Machine Learning and Anomaly Detection in SplunkIT Service Intelligence. Disclaimer 2 ... Uses new Chunked External

Axioms– SolutionDomain

11

Unsupervised– Nolabelledanomalies– What’snormalislearnedfromobservingthedataitself,notdefinedbyan

expertNon-ParametricRobustStreamingAdaptiveDomain-agnostic

Page 12: Machine Learning and Anomaly Detection in SplunkIT Service ......Machine Learning and Anomaly Detection in SplunkIT Service Intelligence. Disclaimer 2 ... Uses new Chunked External

Axioms– SolutionDomain

12

UnsupervisedNon-Parametric– Wemakenoassumptionsabouttheprobabilitydistributionofthevalues

(e.g.Gaussianorstationary)

RobustStreamingAdaptiveDomain-agnostic

Page 13: Machine Learning and Anomaly Detection in SplunkIT Service ......Machine Learning and Anomaly Detection in SplunkIT Service Intelligence. Disclaimer 2 ... Uses new Chunked External

Axioms– SolutionDomain

13

UnsupervisedNon-ParametricRobust– Outliersaredetectedasanomalies,butdon’tcausedistortionsinour

expectations

StreamingAdaptiveDomain-agnostic

Page 14: Machine Learning and Anomaly Detection in SplunkIT Service ......Machine Learning and Anomaly Detection in SplunkIT Service Intelligence. Disclaimer 2 ... Uses new Chunked External

Axioms– SolutionDomain

14

UnsupervisedNon-ParametricRobustStreaming– Noseparatetraining/testperiods– Anomaliesaredetectedandreportedin(near-)realtime

AdaptiveDomain-agnostic

Page 15: Machine Learning and Anomaly Detection in SplunkIT Service ......Machine Learning and Anomaly Detection in SplunkIT Service Intelligence. Disclaimer 2 ... Uses new Chunked External

Axioms– SolutionDomain

15

UnsupervisedNon-ParametricRobustStreamingAdaptive– Nostaticthresholds,discovernormalbehaviourpatternsautomatically– Adapttobehavioralchangeswithoutend-userfeedback– WhatwasnormallastweekmightbeworrisometodayDomain-agnostic

Page 16: Machine Learning and Anomaly Detection in SplunkIT Service ......Machine Learning and Anomaly Detection in SplunkIT Service Intelligence. Disclaimer 2 ... Uses new Chunked External

Axioms– SolutionDomain

16

UnsupervisedNon-ParametricRobustStreamingAdaptiveDomain-agnostic– Purelynumeric– Noinformationaboutunderlyingsubjectsorcausesofthebehaviourstream

Memory/CPUusage

Page 17: Machine Learning and Anomaly Detection in SplunkIT Service ......Machine Learning and Anomaly Detection in SplunkIT Service Intelligence. Disclaimer 2 ... Uses new Chunked External

Axioms– SolutionDomain

17

UnsupervisedNon-ParametricRobustStreamingAdaptiveDomain-agnostic– Purelynumeric– Noinformationaboutunderlyingsubjectsorcausesofthebehaviourstream

Unicornspersecond

Page 18: Machine Learning and Anomaly Detection in SplunkIT Service ......Machine Learning and Anomaly Detection in SplunkIT Service Intelligence. Disclaimer 2 ... Uses new Chunked External

GettingDataIn

18

Ifyoualreadyhavedense,regular,numerictimeseries(aka“metrics”or“KPIs”)you’regoodtogo

TimeSeriesFeatureEngineering

Page 19: Machine Learning and Anomaly Detection in SplunkIT Service ......Machine Learning and Anomaly Detection in SplunkIT Service Intelligence. Disclaimer 2 ... Uses new Chunked External

GettingDataIn

19

Ifyoualreadyhavedense,regular,numerictimeseries(aka“metrics”or“KPIs”)you’regoodtogoIfyouhavesomethingelse,nowyouhaveatimeseriesfeatureengineeringproblem

TimeSeriesFeatureEngineering

Page 20: Machine Learning and Anomaly Detection in SplunkIT Service ......Machine Learning and Anomaly Detection in SplunkIT Service Intelligence. Disclaimer 2 ... Uses new Chunked External

GettingDataIn

20

Ifyoualreadyhavedense,regular,numerictimeseries(aka“metrics”or“KPIs”)you’regoodtogoIfyouhavesomethingelse,nowyouhaveatimeseriesfeatureengineeringproblemThereareinescapabletradeoffsbetweendensity andprecision

TimeSeriesFeatureEngineering

Page 21: Machine Learning and Anomaly Detection in SplunkIT Service ......Machine Learning and Anomaly Detection in SplunkIT Service Intelligence. Disclaimer 2 ... Uses new Chunked External

GettingDataIn

21

Ifyoualreadyhavedense,regular,numerictimeseries(aka“metrics”or“KPIs”)you’regoodtogoIfyouhavesomethingelse,nowyouhaveatimeseriesfeatureengineeringproblemThereareinescapable tradeoffsbetweendensity andprecisionIncreasedprecisionimpliessparsertimeseries– Alsoincreasedmemoryandbandwidthusage!

TimeSeriesFeatureEngineering

Page 22: Machine Learning and Anomaly Detection in SplunkIT Service ......Machine Learning and Anomaly Detection in SplunkIT Service Intelligence. Disclaimer 2 ... Uses new Chunked External

GettingDataIn

22

Ifyoualreadyhavedense,regular,numerictimeseries(aka“metrics”or“KPIs”)you’regoodtogoIfyouhavesomethingelse,nowyouhaveatimeseriesfeatureengineeringproblemThereareinescapable tradeoffsbetweendensity andprecisionIncreasedprecisionimpliessparsertimeseries– Alsoincreasedmemoryandbandwidthusage!

TSFErequiresdealingwithTime,Space andValues

TimeSeriesFeatureEngineering

Page 23: Machine Learning and Anomaly Detection in SplunkIT Service ......Machine Learning and Anomaly Detection in SplunkIT Service Intelligence. Disclaimer 2 ... Uses new Chunked External

GettingDataIn

23

Time– Howfrequently donewvaluesarrive?– Howregularly donewvaluesarrive?– Howprecisely dowewanttobeabletorecordthetimewhenthe

measurementwastaken?ê Finertimeresolutionincreasessparsity:theprobabilitythatanyeventoccurredduringaparticulartimewindowisdecreased

SpaceValues

TimeSeriesFeatureEngineering

Page 24: Machine Learning and Anomaly Detection in SplunkIT Service ......Machine Learning and Anomaly Detection in SplunkIT Service Intelligence. Disclaimer 2 ... Uses new Chunked External

GettingDataIn

24

TimeSpace- howprecisely dowewanttobeabletorelatetimeseriesbacktotheunderlyingeventstream?

ê Howmanydimensions?e.g.IPaddress,geo.coordinates,MIMEtype,HTTPresponsecode– Addingdimensionsincreasesprecision,butalsomagnifiesthelikelihoodofsparsity

ê Withinadimension,howprecisedoweneedtobe?– FullIPaddressor/24?Distinguish400,401,403,404orjust4xx?– Country,state/province,city,neighbourhood,building,…?– Extraprecisionincreasesthelikelihoodofsparsity

Values

TimeSeriesFeatureEngineering

Page 25: Machine Learning and Anomaly Detection in SplunkIT Service ......Machine Learning and Anomaly Detection in SplunkIT Service Intelligence. Disclaimer 2 ... Uses new Chunked External

GettingDataIn

25

TimeSpaceValues– Howdowegenerateanumber?

ê Getanumericfieldas-is(i.e.a“gauge”)ê Incrementacounter

– Howdoweaggregatemultiplevalues?ê Min,max,mean,etc.

– Howshouldwehandlemissingvalues?ê ”Replacenullwithzero”onlymakessenseforsomethingweknowisacounterê “Takethepreviousvalue”mightmakesense

TimeSeriesFeatureEngineering

Page 26: Machine Learning and Anomaly Detection in SplunkIT Service ......Machine Learning and Anomaly Detection in SplunkIT Service Intelligence. Disclaimer 2 ... Uses new Chunked External

MetricAnomalyDetectionAlgorithms

26

Proprietary!Notopensourceoroff-the-shelf.Spatialandtemporalalgorithms– Whatdowemeanby“spatial”and“temporal”?– Completelyorthogonal,irreducibledistinction

ê Onecannotsubstitutefortheotherê Neitherisalwaysapplicabletoeverytimeseries

Page 27: Machine Learning and Anomaly Detection in SplunkIT Service ......Machine Learning and Anomaly Detection in SplunkIT Service Intelligence. Disclaimer 2 ... Uses new Chunked External

MetricAnomalyDetectionAlgorithms

27

Analyzeonetimeseriesatatime(embarrassinglyparallel)Alertingwhenpresentbehaviourissurprisingcomparedtopastbehaviour

TemporalAnalysis(aka“Trending”algorithm)

IncreasingTimeà

xnowß past

Page 28: Machine Learning and Anomaly Detection in SplunkIT Service ......Machine Learning and Anomaly Detection in SplunkIT Service Intelligence. Disclaimer 2 ... Uses new Chunked External

MetricAnomalyDetectionAlgorithms

28

Goodresultsonlywhenthereisahistoryofrecurringpatternsintheunderlyingeventstream– Notnecessarilyperiodic,justrecurring

Howmuchhistory?– Preliminary(usuallybad)resultsafter~2000points

ê e.g.1.5days at1-minuteresolution– Greatresultsaftera“fullperiod”hasbeenobserved(e.g.7days)– Moreisbetter!(modulomemory,storage…)

TrendingAlgorithmConstraints

Page 29: Machine Learning and Anomaly Detection in SplunkIT Service ......Machine Learning and Anomaly Detection in SplunkIT Service Intelligence. Disclaimer 2 ... Uses new Chunked External

MetricAnomalyDetectionAlgorithms

29

Comparepresent behaviourofmultiplemetrics

Spatial(“Cohesive”)Algorithm

IncreasingTimeà

x now

Page 30: Machine Learning and Anomaly Detection in SplunkIT Service ......Machine Learning and Anomaly Detection in SplunkIT Service Intelligence. Disclaimer 2 ... Uses new Chunked External

MetricAnomalyDetectionAlgorithms

30

Givenaset*oftimeseriesthatareexpected†tobehavesimilarly‡,detectwhenoneormoreofthemdepartsfromtheirpeers

*set>=3members

†expectedbyahumananalystorinterestingMLprocess

‡similarlyRoughlythesameshapeScaleandmagnitudeinvariant

CohesiveAlgorithmConstraints

Page 31: Machine Learning and Anomaly Detection in SplunkIT Service ......Machine Learning and Anomaly Detection in SplunkIT Service Intelligence. Disclaimer 2 ... Uses new Chunked External

MetricAnomalyDetectionAlgorithms

31

NoperiodicityrequiredHistoryimprovesscale/magnitudeinvariancePerformancereliesonsimilaritywithingroup– Whatifthegroupisn’tinherentlycohesive?

ê Lotsofalertsearlyonê Then,thealgorithmadaptstothechaosê Ifthegroupreturnstocohesion,thealgorithmwillautomaticallyadapttothe“newnormal”.

CohesiveAlgorithmCharacteristics

Page 32: Machine Learning and Anomaly Detection in SplunkIT Service ......Machine Learning and Anomaly Detection in SplunkIT Service Intelligence. Disclaimer 2 ... Uses new Chunked External

MetricAnomalyDetectionAlgorithms

32

Aclusterofserversperformingasimilarroleforthesameapplication,behindthesameloadbalancerAssumingtheloadbalancerisoperatingnominally,manyservermetricsshouldberoughlycorrelated,e.g.:– CPUusage(user,system,idle)– Diskusage(reads,writes,IOPS)– Networkusage(bandwidth,#activesockets)– Application-specificmetrics(requestshandledpersecond,500errors,

authenticationfailures,activesessions)

CohesiveAlgorithm:ExampleUseCase#1

Page 33: Machine Learning and Anomaly Detection in SplunkIT Service ......Machine Learning and Anomaly Detection in SplunkIT Service Intelligence. Disclaimer 2 ... Uses new Chunked External

MetricAnomalyDetectionAlgorithms

33

ImaginesomewindturbinesonthesamehillWecan’tpredictwinddirectionandspeedverywell(yet?)Butweexpecteveryturbineshouldberoughlycohesiveinseveralmetrics:– rotationspeed– powergenerationrate– vibration– direction

ê *actually,becausethisisaperiodicmetric(359° ≈1°),wedon’tsupportitwellrightnow

Ifanymetricforanyturbinedifferssignificantlyfromitspeers,weshouldbenotified,andmaybesendateamtoinvestigate

CohesiveAlgorithm:ExampleUseCase#2

Page 34: Machine Learning and Anomaly Detection in SplunkIT Service ......Machine Learning and Anomaly Detection in SplunkIT Service Intelligence. Disclaimer 2 ... Uses new Chunked External

Otherapproacheswehavetried

34

3-sigmaKolmogorov-SmirnovtestoverslidingwindowsTime-seriesforecastingmethods– Holt-Winters(previousversionofITSIADisbasedonitsnon-parametricversion)– ARIMA,etc

One-classSVMClusteringmethods– DBSCAN,K-means,etcVariousR,Pythonpackages

Page 35: Machine Learning and Anomaly Detection in SplunkIT Service ......Machine Learning and Anomaly Detection in SplunkIT Service Intelligence. Disclaimer 2 ... Uses new Chunked External

MADServiceEngineering

35

MAD=“MetaforAnomalyDetection”

Page 36: Machine Learning and Anomaly Detection in SplunkIT Service ......Machine Learning and Anomaly Detection in SplunkIT Service Intelligence. Disclaimer 2 ... Uses new Chunked External

MADServiceEngineering

36

MAD=“Metafor AnomalyDetection”

Page 37: Machine Learning and Anomaly Detection in SplunkIT Service ......Machine Learning and Anomaly Detection in SplunkIT Service Intelligence. Disclaimer 2 ... Uses new Chunked External

MADServiceEngineering

37

MAD=“Metric AnomalyDetection”

Page 38: Machine Learning and Anomaly Detection in SplunkIT Service ......Machine Learning and Anomaly Detection in SplunkIT Service Intelligence. Disclaimer 2 ... Uses new Chunked External

MADServiceEngineering

38

MAD=“Metric AnomalyDetection”WritteninScala– usingAkkaforconcurrency

Page 39: Machine Learning and Anomaly Detection in SplunkIT Service ......Machine Learning and Anomaly Detection in SplunkIT Service Intelligence. Disclaimer 2 ... Uses new Chunked External

MADServiceEngineering

39

MAD=“Metric AnomalyDetection”WritteninScala– usingAkkaforconcurrency

UsesSearchCommandProtocolv2(availablesinceSplunk6.3)– Runsforever,doesn’tgetrestartedevery50kevents– Receivesdatasoonafteritarrivesatanindexer,nopolling

Page 40: Machine Learning and Anomaly Detection in SplunkIT Service ......Machine Learning and Anomaly Detection in SplunkIT Service Intelligence. Disclaimer 2 ... Uses new Chunked External

MADServiceEngineering

40

MAD=“Metric AnomalyDetection”WritteninScala– usingAkkaforconcurrency

UsesnewChunkedExternalCommandfeatureofSplunk6.3– Runsforever,doesn’tgetrestartedevery50kevents– Receivesdatasoonafteritarrivesatanindexer,nopolling

Fast!

Page 41: Machine Learning and Anomaly Detection in SplunkIT Service ......Machine Learning and Anomaly Detection in SplunkIT Service Intelligence. Disclaimer 2 ... Uses new Chunked External

MADServiceEngineering

41

MAD=“Metric AnomalyDetection”WritteninScala– usingAkkaforconcurrency

UsesnewChunkedExternalCommandfeatureofSplunk6.3– Runsforever,doesn’tgetrestartedevery50kevents– Receivesdatasoonafteritarrivesatanindexer,nopolling

Fast!Designedforgeneral-purposeuse,nocouplingtoITSIruntime

Page 42: Machine Learning and Anomaly Detection in SplunkIT Service ......Machine Learning and Anomaly Detection in SplunkIT Service Intelligence. Disclaimer 2 ... Uses new Chunked External

Howtogetit

42

ITSI-AD

ITSI2.3“Batman”(July2016)– ITSIAnomalyDetectionreplacedwithTrendingalgorithm

ITSI2.4“Catwoman”(.conf 2016)– Cohesivealgorithmadded– ComparesentitieswithinaKPI

Page 43: Machine Learning and Anomaly Detection in SplunkIT Service ......Machine Learning and Anomaly Detection in SplunkIT Service Intelligence. Disclaimer 2 ... Uses new Chunked External

Howtogetit

43

ITSI-AD

Page 44: Machine Learning and Anomaly Detection in SplunkIT Service ......Machine Learning and Anomaly Detection in SplunkIT Service Intelligence. Disclaimer 2 ... Uses new Chunked External

Howtogetit

44

ITSI-AD

Page 45: Machine Learning and Anomaly Detection in SplunkIT Service ......Machine Learning and Anomaly Detection in SplunkIT Service Intelligence. Disclaimer 2 ... Uses new Chunked External

Howtogetit

45

ITSI-AD

Page 46: Machine Learning and Anomaly Detection in SplunkIT Service ......Machine Learning and Anomaly Detection in SplunkIT Service Intelligence. Disclaimer 2 ... Uses new Chunked External

THANKYOU