View
33
Download
0
Category
Preview:
DESCRIPTION
Fast Recovery + Statistical Anomaly Detection = Self-*. RADS/KATZ CATS Panel June 2004 ROC Retreat. Outline. Motivation & approach: complex systems of black boxes Measurements that respect black boxes Box-level Micro-recovery cheap enough to survive false positives - PowerPoint PPT Presentation
Citation preview
Fast Recovery + Statistical Fast Recovery + Statistical Anomaly Detection = Self-*Anomaly Detection = Self-*
RADS/KATZ CATS PanelRADS/KATZ CATS Panel
June 2004 ROC RetreatJune 2004 ROC Retreat
OutlineOutline
Motivation & approach: complex systems of black Motivation & approach: complex systems of black boxesboxes Measurements that respect black boxesMeasurements that respect black boxes
Box-level Micro-recovery cheap enough to survive false Box-level Micro-recovery cheap enough to survive false positivespositives
Differences from related effortsDifferences from related efforts
Early case studiesEarly case studies
Research agendaResearch agenda
Complex Systems of Black BoxesComplex Systems of Black Boxes
““...our ability to analyze and predict the performance of the ...our ability to analyze and predict the performance of the enormously complex software systems that lies at the core of our enormously complex software systems that lies at the core of our economy is painfully inadequate.” (Choudhury & Weikum, 2000 PITAC economy is painfully inadequate.” (Choudhury & Weikum, 2000 PITAC Report)Report)
Build model of “acceptable” operating envelope by Build model of “acceptable” operating envelope by measurement & analysismeasurement & analysis Control theory, statistical correlation, anomaly detection...Control theory, statistical correlation, anomaly detection...
Rely on Rely on external controlexternal control, using inexpensive and simple , using inexpensive and simple mechanisms that respect the black box, to keep system in its mechanisms that respect the black box, to keep system in its acceptable operating envelopeacceptable operating envelope ““Increase the size of the DB connection pool” [Hellerstein et al]Increase the size of the DB connection pool” [Hellerstein et al]
““Reallocate one or more whole machines” [Lassettre et al]Reallocate one or more whole machines” [Lassettre et al]
““Rejuvenate/reboot one or more machines” [Trivedi, Fox, others]Rejuvenate/reboot one or more machines” [Trivedi, Fox, others]
““Shoot one of the blocked txns” [everyone]Shoot one of the blocked txns” [everyone]
““Induce memory pressure on other apps” [Waldspurger et al]Induce memory pressure on other apps” [Waldspurger et al]
Differences from some existing Differences from some existing problemsproblems
intrusion detection (Hofmeyr et al 98, others)intrusion detection (Hofmeyr et al 98, others) Detections must be actionableDetections must be actionable in a way that is likely to in a way that is likely to
improve improve system (sacrificing availability for safety is system (sacrificing availability for safety is unacceptable)unacceptable)
bug finding via anomaly detection (Engler, others)bug finding via anomaly detection (Engler, others) Human-level monitoring/verification of detections not Human-level monitoring/verification of detections not
feasible, due to number of observations and short feasible, due to number of observations and short timescales for reactiontimescales for reaction
Can separate recovery from diagnosis/repair (don’t always Can separate recovery from diagnosis/repair (don’t always need to know root cause to recover)need to know root cause to recover)
modeling/predicting SLO violations (Hellerstein, modeling/predicting SLO violations (Hellerstein, Goldszmidt, others)Goldszmidt, others) Labeled training set not necessarily availableLabeled training set not necessarily available
Many other examples, but the point Many other examples, but the point is...is...
Statistical techniques identify “interesting” Statistical techniques identify “interesting” features and relationships from large features and relationships from large
datasets, but frequent tradeoff between datasets, but frequent tradeoff between detection rate (or detection time) and detection rate (or detection time) and false false
positivespositives
Statistical techniques identify “interesting” Statistical techniques identify “interesting” features and relationships from large features and relationships from large
datasets, but frequent tradeoff between datasets, but frequent tradeoff between detection rate (or detection time) and detection rate (or detection time) and false false
positivespositives
Make “micro-recovery” so inexpensive that Make “micro-recovery” so inexpensive that occasional false positives don’t matteroccasional false positives don’t matter
Make “micro-recovery” so inexpensive that Make “micro-recovery” so inexpensive that occasional false positives don’t matteroccasional false positives don’t matter
Granularity of black box should match granularity Granularity of black box should match granularity of available external control mechanismsof available external control mechanisms
““Micro-recovery” to survive false Micro-recovery” to survive false positivespositives
Goal: provide “recovery management invariants” Goal: provide “recovery management invariants”
““Salubrious”: returns some part of system to Salubrious”: returns some part of system to known stateknown state Reclaim resources (memory, DB conns, sockets, DHCP Reclaim resources (memory, DB conns, sockets, DHCP
lease...)lease...)
Throw away corrupt transient stateThrow away corrupt transient state
Possibly setup to retry operation, if appropriatePossibly setup to retry operation, if appropriate
Safe: affects only performance, not correctnessSafe: affects only performance, not correctness
Non-disruptive: performance impact is “small”Non-disruptive: performance impact is “small”
Predictable: impact and time-to-complete is stablePredictable: impact and time-to-complete is stableObserve, Analyze, Act:Observe, Analyze, Act:Not recovery, but Not recovery, but continuous adaptationcontinuous adaptation
Observe, Analyze, Act:Observe, Analyze, Act:Not recovery, but Not recovery, but continuous adaptationcontinuous adaptation
Crash-Only Building BlocksCrash-Only Building BlocksSubsystemSubsystem Control pointControl point How realizedHow realized Statistical monitoringStatistical monitoring
SSM (diskless SSM (diskless session state session state store) store) [NSDI [NSDI 04]04]
Whole-node Whole-node fast reboot fast reboot (doesn’t (doesn’t preserve preserve state)state)
Quorum-like Quorum-like redundancyredundancy
Relaxed Relaxed consistencyconsistency
Repair cost Repair cost spread over spread over many operationsmany operations
Time series of state metrics Time series of state metrics (Tarzan)(Tarzan)
DStore DStore (persistent (persistent hashtable) hashtable) [in [in preparation]preparation]
Whole-node Whole-node reboot reboot (preserves (preserves state)state)
JAGR (J2EE JAGR (J2EE application application server) server) [AMS [AMS 2003 & in 2003 & in prep.]prep.]
Microreboots Microreboots of EJB’sof EJB’s
Modify Modify appserver to appserver to undeploy/ undeploy/ redeploy EJB’s redeploy EJB’s and stall and stall pending reqspending reqs
Anomalous code paths and Anomalous code paths and component interactions component interactions (Probabilistic context-free (Probabilistic context-free grammar)grammar)
• Control points are safe, predictable, non-disruptive
• Crash-only design: shutdown=crash, recover=restart
• Makes state-management subsystems as easy to manage as stateless Web servers
Example: Managing DStore and SSMExample: Managing DStore and SSM
Rebooting is the only control mechanismRebooting is the only control mechanism Has predictable effect and takes predictable time, regardless of Has predictable effect and takes predictable time, regardless of
what the process is doingwhat the process is doing• Like kill -9, “turning off” a VM, or pulling power cordLike kill -9, “turning off” a VM, or pulling power cord
Intuition: the “infrastructure” supporting the power switch is Intuition: the “infrastructure” supporting the power switch is simpler than the applications using itsimpler than the applications using it
Due to slight overprovisioning inherent in replication, rebooting Due to slight overprovisioning inherent in replication, rebooting can have minimal effect on throughput & latencycan have minimal effect on throughput & latency
Relaxed consistency guarantees allow this to workRelaxed consistency guarantees allow this to work
Activity and state statistics collected per brick every Activity and state statistics collected per brick every second; any deviation => reboot bricksecond; any deviation => reboot brick
Makes it as easy as managing a stateless server farmMakes it as easy as managing a stateless server farm Backpressure at many design points prevents saturationBackpressure at many design points prevents saturation
Design Lessons Learned So FarDesign Lessons Learned So Far
““A spectrum of cleaning operations” (Eric Anderson, HP Labs)A spectrum of cleaning operations” (Eric Anderson, HP Labs) Consequence: as tConsequence: as t, all problems will converge to “repair of , all problems will converge to “repair of
corrupted persistent data”corrupted persistent data”
Trade “unnecessary” consistency for faster recoveryTrade “unnecessary” consistency for faster recovery spread recovery actions out incrementally/lazily (read repair) rather spread recovery actions out incrementally/lazily (read repair) rather
than doing it all at once (log replay) than doing it all at once (log replay) • gives predictable return-to-service time and gives predictable return-to-service time and acceptable acceptable variation in variation in
performance after recoveryperformance after recovery• keeps data available for readskeeps data available for reads and writes and writes throughout “recovery”throughout “recovery”
Use single phase ops to avoid coupling/locking and the issues they Use single phase ops to avoid coupling/locking and the issues they raise, and justify the cost in consistencyraise, and justify the cost in consistency
It’s OK to say no (backpressure)It’s OK to say no (backpressure) Several places our design got it wrong in SSMSeveral places our design got it wrong in SSM
But even those mistakes could have been worked around by guard But even those mistakes could have been worked around by guard timerstimers
Potential Limitations and ChallengesPotential Limitations and Challenges
Hard failuresHard failures
Configuration failuresConfiguration failures Although similar approach has been used to troubleshoot thoseAlthough similar approach has been used to troubleshoot those
Corruption of persistent stateCorruption of persistent state Data structure repair work (Rinard et al.) may be combinable Data structure repair work (Rinard et al.) may be combinable
with automatic inference (Lam et al.)with automatic inference (Lam et al.)
ChallengesChallenges Stability and the “autopilot problem”Stability and the “autopilot problem”
The base-rate fallacyThe base-rate fallacy
Multilevel learningMultilevel learning
Online implementations of SLT techniquesOnline implementations of SLT techniques
Nonintrusive data collection and storageNonintrusive data collection and storage
Recovery synthesis
Client requests
Responses
Datacenter boundary
Collection
Short-termstore
Long-termstore
Onlinealgo.
Onlinealgo.
Observations fromother datacenters
Offlinealgo.
Offlinealgo.
Recovery actions toother datacenters
Observations toother datacenters
Application component
Application server
An Architecture for An Architecture for Observe, Analyze, Observe, Analyze, ActAct
Separates systems Separates systems concerns from concerns from algorithm algorithm developmentdevelopment Programmable Programmable
network elements network elements provide extension provide extension of approach to of approach to other layersother layers
Consistent with Consistent with technology trendstechnology trends Explicit //ism in Explicit //ism in
CPU usageCPU usage
Lots of disk Lots of disk storage with storage with limited bandwidthlimited bandwidth
ConclusionConclusion
““...Ultimately, these aspects [of autonomic ...Ultimately, these aspects [of autonomic systems] will be emergent properties of a general systems] will be emergent properties of a general architecture, and distinctions will blur into a more architecture, and distinctions will blur into a more general notion of self-maintenance.” (general notion of self-maintenance.” (The Vision The Vision of Autonomic Computingof Autonomic Computing))
The The real real reason to reduce MTTRreason to reduce MTTRis to tolerate false positives: is to tolerate false positives: recovery recovery
adaptationadaptation
The The real real reason to reduce MTTRreason to reduce MTTRis to tolerate false positives: is to tolerate false positives: recovery recovery
adaptationadaptation
Breakout sessions?Breakout sessions?
1.1. [James H] Reserve some resources to deal with problems (by filtering or [James H] Reserve some resources to deal with problems (by filtering or pre-reservation)pre-reservation)
2.2. [Joe H] How black is the black box? What “gray box” prior knowledge can [Joe H] How black is the black box? What “gray box” prior knowledge can you exploit (so you don’t ignore the obvious)?you exploit (so you don’t ignore the obvious)?
3.3. [Joe H] Human role - can make statements about how system [Joe H] Human role - can make statements about how system should should act, act, so doesn’t have to be completely hands-off training. Similarly, during so doesn’t have to be completely hands-off training. Similarly, during training, human can give feedback about what anomalies are actually training, human can give feedback about what anomalies are actually relevant (labeling).relevant (labeling).
4.4. [Lakshmi] What kinds of apps is this intended to apply to? Where do ROC-[Lakshmi] What kinds of apps is this intended to apply to? Where do ROC-like and OASIS-like apps differ?like and OASIS-like apps differ?
5.5. [Mary Baker] People can learn to game the system -> randomness can be [Mary Baker] People can learn to game the system -> randomness can be your friend. If behaviors have small number of modes, just have to look your friend. If behaviors have small number of modes, just have to look for behaviors in the “valleys”for behaviors in the “valleys”
BreakoutsBreakouts1.1. 19 -“golden nuggets” to guide architecture, e.g., persistent 19 -“golden nuggets” to guide architecture, e.g., persistent
identifiers for path-based analysis...what else?identifiers for path-based analysis...what else?
2.2. 8 - act: what {safe,fast,predictable} behaviors of the system 8 - act: what {safe,fast,predictable} behaviors of the system should we expose (other than, eg, rebooting)? Esp. those that should we expose (other than, eg, rebooting)? Esp. those that contribute to security as well as dependability?contribute to security as well as dependability?
3.3. 11 - architectures for different types of stateful systems - what 11 - architectures for different types of stateful systems - what kinds of persistent/semi-persistent state need to be factored out kinds of persistent/semi-persistent state need to be factored out of apps, and how to store it; interfaces, etcof apps, and how to store it; interfaces, etc
4.4. 20 - Given your goal of “generic” techniques for distributed 20 - Given your goal of “generic” techniques for distributed systems, how will you know when you’ve succeeded/how do you systems, how will you know when you’ve succeeded/how do you validate the techniques? (What are the “proof points” you can validate the techniques? (What are the “proof points” you can hand to others to convince them you’ve succeeded, including but hand to others to convince them you’ve succeeded, including but not limited to metrics?) [Aaron/Dave] Metrics: How do you know not limited to metrics?) [Aaron/Dave] Metrics: How do you know you’re observing the right things? What benchmarks will be you’re observing the right things? What benchmarks will be needed?needed?
Open MicOpen Mic
James Hamilton - The Security EconomyJames Hamilton - The Security Economy
ConclusionConclusion
Toward “new science” in autonomic computingToward “new science” in autonomic computing
““...Ultimately, these aspects [of autonomic ...Ultimately, these aspects [of autonomic systems] will be emergent properties of a general systems] will be emergent properties of a general architecture, and distinctions will blur into a more architecture, and distinctions will blur into a more general notion of self-maintenance.” (general notion of self-maintenance.” (The Vision The Vision of Autonomic Computingof Autonomic Computing))
The The real real reason to reduce MTTRreason to reduce MTTRis to tolerate false positives: is to tolerate false positives: recovery recovery
adaptationadaptation
The The real real reason to reduce MTTRreason to reduce MTTRis to tolerate false positives: is to tolerate false positives: recovery recovery
adaptationadaptation
Autonomic & Technology TrendsAutonomic & Technology Trends
CPU speed increases slowing down, need more CPU speed increases slowing down, need more explicit parallelismexplicit parallelism Use extra CPU to collect and locally analyze data; exploit Use extra CPU to collect and locally analyze data; exploit
temporal localitytemporal locality
Disk space is free (though bandwidth and disaster-Disk space is free (though bandwidth and disaster-recovery aren’t)recovery aren’t) Can keep history of parallel as well as historical models for Can keep history of parallel as well as historical models for
regression analysis, trending, etc.regression analysis, trending, etc.
VM’s being used as unit of software distributionVM’s being used as unit of software distribution Fault isolationFault isolation
Opportunity for nonintrusive observationOpportunity for nonintrusive observation
Action that is independent of the hosted appAction that is independent of the hosted app
Data collection & monitoringData collection & monitoring
Component frameworks allow for non-intrusive data Component frameworks allow for non-intrusive data collection without modifying the applicationscollection without modifying the applications Inter-EJB calls through runtime-managed level of indirectionInter-EJB calls through runtime-managed level of indirection Slightly coarser grain of analysis: restrictions on “legal” Slightly coarser grain of analysis: restrictions on “legal”
paths make it more likely we can spot anomaliespaths make it more likely we can spot anomalies Aspect-oriented programming allows further monitoring Aspect-oriented programming allows further monitoring
without perturbing application logicwithout perturbing application logic
Virtual machine monitors provide additional Virtual machine monitors provide additional observation pointsobservation points Already used by ASP’s, for load balancing, app migration, Already used by ASP’s, for load balancing, app migration,
etc.etc. Transparent to applications Transparent to applications and hosted OS’sand hosted OS’s Likely to become the unit of software distribution (intra- Likely to become the unit of software distribution (intra-
and inter-cluster)and inter-cluster)
Optimizing for Specialized State TypesOptimizing for Specialized State Types
Two single-key (“Berkeley DB”) get/set state storesTwo single-key (“Berkeley DB”) get/set state stores Used for user session state, application workflow state, Used for user session state, application workflow state,
persistent user profiles, merchandise catalogs, ...persistent user profiles, merchandise catalogs, ...
Replication to a set of N bricks provides durabilityReplication to a set of N bricks provides durability Write to subset, wait for subset, remember subsetWrite to subset, wait for subset, remember subset
DStore: state persists “forever” as long as DStore: state persists “forever” as long as N/2N/2 bricks survive bricks survive
SSM: If client loses cookie, state is lost; otherwise, persists for SSM: If client loses cookie, state is lost; otherwise, persists for time time t t with probability with probability p, p, where where t, p t, p = F(N, node MTBF)= F(N, node MTBF)
Recovery==restart, takes seconds or lessRecovery==restart, takes seconds or less Efficacy doesn’t depend on whether replica is behaving Efficacy doesn’t depend on whether replica is behaving
correctlycorrectly
SSM: node state SSM: node state not preserved not preserved (in-memory only)(in-memory only)
DStore: node state preserved, read-repair fixesDStore: node state preserved, read-repair fixes
Detection & recovery in SSMDetection & recovery in SSM
9 “State” statistics collected once per second from 9 “State” statistics collected once per second from each brick each brick Tarzan time series analysis: keep N-length time series, Tarzan time series analysis: keep N-length time series,
discretize each data pointdiscretize each data point
count relative frequencies of all substrings of length count relative frequencies of all substrings of length k k or or shortershorter
compare against peer bricks; reboot if at least 6 stats compare against peer bricks; reboot if at least 6 stats “anomalous”; works for aperiodic or irregular-period signals“anomalous”; works for aperiodic or irregular-period signals
Remember! Remember! We are not We are not SLT/ML SLT/ML researchersresearchers!!
Detection & recovery in DStoreDetection & recovery in DStore
Metrics and algorithm comparable Metrics and algorithm comparable to those used in SSMto those used in SSM
We inject “fail-stutter” behavior We inject “fail-stutter” behavior by increasing request latencyby increasing request latency Bottom case: more aggressive Bottom case: more aggressive
detection also results in 2 detection also results in 2 “unnecessary” reboots“unnecessary” reboots
But they don’t matter muchBut they don’t matter much
Currently some voodoo constants Currently some voodoo constants for thresholds in both SSM and for thresholds in both SSM and DStoreDStore
Trade-off of fast detection vs. Trade-off of fast detection vs. false positivesfalse positives
What faults does this handle?What faults does this handle?
Substantially all non-Byzantine faults we injected:Substantially all non-Byzantine faults we injected: Node crash, hang/timeout/freezeNode crash, hang/timeout/freeze
Fail-stutter: Network loss (drop up to 70% of packets Fail-stutter: Network loss (drop up to 70% of packets randomly)randomly)
Periodic slowdown (eg from garbage collection)Periodic slowdown (eg from garbage collection)
Persistent slowdown (one node lags the others)Persistent slowdown (one node lags the others)
Underlying (weak) assumption: “Most bricks are doing mostly Underlying (weak) assumption: “Most bricks are doing mostly the right thing most of the time”the right thing most of the time”
All anomalies can be safely “coerced” to crash faults All anomalies can be safely “coerced” to crash faults If that turned out to be the wrong thing, it didn’t cost you If that turned out to be the wrong thing, it didn’t cost you
much to try itmuch to try it
Human notified after threshold number of restartsHuman notified after threshold number of restarts
These systems are “always recovering”These systems are “always recovering”
Path-based analysis + MicrorebootsPath-based analysis + Microreboots
Pinpoint captures execution paths through EJB’s as Pinpoint captures execution paths through EJB’s as dynamic call trees (intra-method calls hidden)dynamic call trees (intra-method calls hidden) Build probabilistic context-free grammar from theseBuild probabilistic context-free grammar from these
Detect trees that correspond to very low probability parsesDetect trees that correspond to very low probability parses
Respond by Respond by micro-rebootingmicro-rebooting(uRB) (uRB) suspected-faulty EJB’ssuspected-faulty EJB’s uRB takes 100’s of msecs, vs.uRB takes 100’s of msecs, vs.
whole-app restart (8-10 sec)whole-app restart (8-10 sec)
Component interaction analysisComponent interaction analysiscurrently finds 55-75% of currently finds 55-75% of failuresfailures
Path shape analysis detects Path shape analysis detects >90% of failures; but correctly>90% of failures; but correctlylocalizes fewerlocalizes fewer
Across all expts:80% detection rate with 1.8% FP rate
Across 92% of expts:40% detection rate with 0.2% FP rate
False positive rate
Det
ecti
on r
ate
Crash-Only Design Lessons from SSMCrash-Only Design Lessons from SSM
Eliminate couplingEliminate coupling No dependence on any specific brick, just on a subset of No dependence on any specific brick, just on a subset of
minimum size -- even at the granularity of individual requestsminimum size -- even at the granularity of individual requests
Not even across phases of an operation: single-phase Not even across phases of an operation: single-phase nonblocking ops only => predictable amount of work/requestnonblocking ops only => predictable amount of work/request
Use randomness to avoid deterministic worst cases and Use randomness to avoid deterministic worst cases and hotspotshotspots
We initially violated this guideline by using an off-the-shelf JMS We initially violated this guideline by using an off-the-shelf JMS implementation that was centralizedimplementation that was centralized
Make parts interchangeableMake parts interchangeable Any replica in a write-set is as good as any otherAny replica in a write-set is as good as any other
Unlike erasure coding, only need 1 replica to surviveUnlike erasure coding, only need 1 replica to survive
Cost is higher storage overhead, but we’re willing to pay that to Cost is higher storage overhead, but we’re willing to pay that to get the self-* propertiesget the self-* properties
Enterprise Service WorkloadsEnterprise Service Workloads
ObservationObservation ConsequenceConsequence
Internet service workloads Internet service workloads consist of large numbers of consist of large numbers of independent usersindependent users
Large number of independent Large number of independent samples gives basis for success samples gives basis for success of statistical techniquesof statistical techniques
Even a flaky service is doing Even a flaky service is doing mostly the right thing most of mostly the right thing most of the timethe time
Steady-state behavior can be Steady-state behavior can be extracted from normal extracted from normal operationoperation
Heavy traffic volume means Heavy traffic volume means most of the service is exercised most of the service is exercised in a relatively short timein a relatively short time
Baseline model can be learned Baseline model can be learned rapidly and updated in place rapidly and updated in place periodicallyperiodically
3. We can continuously extract models from 3. We can continuously extract models from the production system orthogonally to the the production system orthogonally to the
applicationapplication
3. We can continuously extract models from 3. We can continuously extract models from the production system orthogonally to the the production system orthogonally to the
applicationapplication
Building models through Building models through measurementmeasurement
Finding bugs using distributed assertion sampling Finding bugs using distributed assertion sampling [Liblit et al, 2003][Liblit et al, 2003] Instrument source code with assertions on pairs of Instrument source code with assertions on pairs of
variables (“features”)variables (“features”)
Use sampling so that any given run of program exercises Use sampling so that any given run of program exercises only a few assertions (to limit performance impact)only a few assertions (to limit performance impact)
Use classification algorithm to identify which features are Use classification algorithm to identify which features are most predictive of faults (observed program crashes)most predictive of faults (observed program crashes)
Goal: bug findingGoal: bug finding
JAGR: JBoss with Micro-rebootsJAGR: JBoss with Micro-reboots
performability of RUBiS (goodput/sec vs. time)performability of RUBiS (goodput/sec vs. time) vanilla JBoss w/manual restarting of app-server, vs. vanilla JBoss w/manual restarting of app-server, vs.
JAGR w/automatic recovery and micro-rebootingJAGR w/automatic recovery and micro-rebooting JAGR/RUBiS does 78% better than JBoss/RUBiSJAGR/RUBiS does 78% better than JBoss/RUBiS Maintains 20 req/sec, even in the face of faultsMaintains 20 req/sec, even in the face of faults Lower steady-state after recovery in first graph: class reloading, recompiling, Lower steady-state after recovery in first graph: class reloading, recompiling,
etc., which is not necessary with micro-rebootsetc., which is not necessary with micro-reboots Also used to fix memory leaks without rebooting whole appserverAlso used to fix memory leaks without rebooting whole appserver
Fast Recovery + Statistical Fast Recovery + Statistical Anomaly Detection = Self-*Anomaly Detection = Self-*
Armando Fox and Emre Kiciman, Stanford Armando Fox and Emre Kiciman, Stanford UniversityUniversity
Michael Jordan, Randy Katz, David Patterson, Ion Michael Jordan, Randy Katz, David Patterson, Ion Stoica,Stoica,
University of California, BerkeleyUniversity of California, Berkeley
SoS Workshop, Bertinoro, ItalySoS Workshop, Bertinoro, Italy
Recommended