33
NICTA Copyright 2012 From imagination to impact Dependable Operation Performance Management and Capacity Planning Under Continuous Changes April, 2014 Dr. Liming Zhu, Dr. Ingo Weber NICTA/UNSW http://slideshare.net/limingzhu

Dependable Operation - Performance Management and Capacity Planning Under Continuous Changes

Embed Size (px)

DESCRIPTION

Talk at http://www.cmga.org.au/ Meet up Modern large-scale applications experience sporadic changes due to operational activities such as upgrade, redeployment, on-demand scaling and interferences from other simultaneous operations. This poses new challenges in system monitoring, capacity planning, performance management, error detection and diagnosis. For example, the traditional anomaly-detection-based techniques are less effective during the “sporadic” operation period as a wide range of legitimate changes confound the situation and make performance baseline establishment for “normal” operation difficult. The increasing frequency of these sporadic operations (e.g. due to continuous deployment) is exacerbating the problem. In this talk, we will introduce a number of ongoing research activities at NICTA addressing these issues. For example, we propose the Process Oriented Dependability (POD) approach, an approach that explicitly models these sporadic operations as processes and uses the process context to filter logs, traverse fault trees and conduct adaptive monitoring.

Citation preview

Page 1: Dependable Operation - Performance Management and Capacity Planning Under Continuous Changes

NICTA Copyright 2012 From imagination to impact

Dependable Operation

Performance Management and Capacity Planning

Under Continuous Changes April, 2014

Dr. Liming Zhu, Dr. Ingo Weber

NICTA/UNSW

http://slideshare.net/limingzhu

Page 2: Dependable Operation - Performance Management and Capacity Planning Under Continuous Changes

NICTA Copyright 2012 From imagination to impact

NICTA (National ICT Australia)

• Australia’s National Centre of Excellence in Information and Communication Technology

• Five Research Labs:– ATP: Australian Technology Park, Sydney– NRL: UNSW, Sydney– CRL: ANU, Canberra– VRL: Uni. Melbourne– QRL: Uni. Queensland and QUT

• 700 staff including 270 PhD students• Budget: ~$90M/yr from Fed/State Gov and

industry• ~600 research papers/year, ~150 patents total

Page 3: Dependable Operation - Performance Management and Capacity Planning Under Continuous Changes

NICTA Copyright 2012 From imagination to impact

NICTA: Research and Outcomes

Software Systems

Networks

Optimisation

Machine Learning

Computer Vision

Broadband and the Digital Economy

Infrastructure Transport and Logistics

Security and Environment

Uni

vers

ity P

artn

ers

Ind

ustry a

nd

Go

vern

me

nt P

artn

ers

Research Excellence Wealth Creation

Engineering and Technology Development

Page 4: Dependable Operation - Performance Management and Capacity Planning Under Continuous Changes

NICTA Copyright 2012 From imagination to impact

Software Systems Research Group (SSRG)

• Vision: Cost Effective Dependable Systems• Two Major Activities

– Trustworthy Systems – single systems – Dependable Cloud Computing – distributed systems

• Research history related to capacity planning– Reve8tor/MDABench: capacity planning prototype– Spin-out: http://www.performance-assurance.com.au/– SPEC (spec.org) research group member

• Cloud (elasticity) benchmarking

– Keynote at ICPE 2013: “Supporting Operations Personnel Through Performance Engineering” by Len Bass

Page 5: Dependable Operation - Performance Management and Capacity Planning Under Continuous Changes

NICTA Copyright 2012 From imagination to impact

New Challenge: Continuous Changes

• Significant shorter release cycles– Continuous delivery/deployment: from months at

scheduled downtime to hours at all times • Etsy.com: 25 full deployments per day at 10 commits per deploy

• Resource sharing – Multiple sporadic operations at all times– scaling in/out, snapshot, migration, reconfiguration,

rolling upgrade, cron-jobs, backup, recovery…

• Cloud uncertainty – Limited visibility and indirect control

Demands continuous capacity planning and performance management

Page 6: Dependable Operation - Performance Management and Capacity Planning Under Continuous Changes

NICTA Copyright 2012 From imagination to impact

Sporadic Operation Example: Rolling Upgrade

- Have 100 servers in cloud with version 1 software

- Upgrade 10 servers at a time to version 2 software

- No downtime or redundancy cost

- Potentially take a long time to complete with errors during the operation with other interfering operations

Page 7: Dependable Operation - Performance Management and Capacity Planning Under Continuous Changes

NICTA Copyright 2012 From imagination to impact

System Monitoring During Rolling Upgrade

Page 8: Dependable Operation - Performance Management and Capacity Planning Under Continuous Changes

NICTA Copyright 2012 From imagination to impact

Our Approach

• Incorporating change-related knowledge into system management – Sporadic operation knowledge

• Process-Oriented Dependability (POD): error detection and diagnosis under continuous change

• Alerting management using process context • Availability analysis for sporadic operations

– External event knowledge • Event-aware workload prediction

Page 9: Dependable Operation - Performance Management and Capacity Planning Under Continuous Changes

NICTA Copyright 2012 From imagination to impact

Process-Oriented Dependability (POD)

• Context– Large-scale web/enterprise operation in Cloud– Distributed data analytics in Cloud (Hadoop/Spark)

• Goal: detect, diagnose and react to errors occurring during sporadic cloud operations– Scope: “sporadic operations” (not normal operation)

• deployment, reconfiguration, (rolling) upgrade, rollback• DevOps related: continuous integration/deploy/delivery

Page 10: Dependable Operation - Performance Management and Capacity Planning Under Continuous Changes

NICTA Copyright 2012 From imagination to impact

Operation as Process• Offline: treat an operation as a process

– Process discovered automatically from logs/scripts• Clustering of log lines and process mining

– Expected step outcomes specified as assertions

• Online: use process context– Process context: process/instance/step ids, expected states

– Errors are detected by examining logs and monitoring data• Assertions evaluations using monitoring facilities or directly• Compliance checking against expected processes using logs

– Detected errors are further diagnosed for (root) causes• Examining a fault tree to locate potential root causes• Performing more diagnostic tests and on-demand assertions

X. Xu, L Zhu, et. al. "POD-Diagnosis: Error Diagnosis of Sporadic Operations on Cloud Applications,” 44nd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), 2014.

Page 11: Dependable Operation - Performance Management and Capacity Planning Under Continuous Changes

NICTA Copyright 2012 From imagination to impact

Example: Rolling Upgrade Using Asgard

Read by

Operator

Process Mining Service

Cont

rols

Outputs Create SnapshotCheck AZs

Create instance from snapshot

Create AMI from instance

Evaluate AMI

Discovered Model

Asgard Log dataLog dataGeneratesOffline

Online

Page 12: Dependable Operation - Performance Management and Capacity Planning Under Continuous Changes

NICTA Copyright 2012 From imagination to impact

POD-Detection: Error Detection

Error Detection Service has two methods for detecting errors:• Assertion Checking• Conformance Checking

Page 13: Dependable Operation - Performance Management and Capacity Planning Under Continuous Changes

NICTA Copyright 2012 From imagination to impact

Assertion Checking: how it works

Log line:

Assertions:

Page 14: Dependable Operation - Performance Management and Capacity Planning Under Continuous Changes

NICTA Copyright 2012 From imagination to impact

Assertion Checking: how it works

Log line:• Remove ...

Assertions:• i has been de-registered

from ELB• i has been removed from

ASG• there is 1 less instance of

v1

Page 15: Dependable Operation - Performance Management and Capacity Planning Under Continuous Changes

NICTA Copyright 2012 From imagination to impact

Assertion Checking: how it works

Log line:• Remove ...• Terminate ...

Assertions:• i successfully terminated

Page 16: Dependable Operation - Performance Management and Capacity Planning Under Continuous Changes

NICTA Copyright 2012 From imagination to impact

Assertion Checking: how it works

Log line:• Remove ...• Terminate ...• Wait ...

Assertions:• Next log line should appear

within 17m35s (95 percentile)

Page 17: Dependable Operation - Performance Management and Capacity Planning Under Continuous Changes

NICTA Copyright 2012 From imagination to impact

Assertion Checking: how it works

Log line:• Remove ...• Terminate ...• Wait ...• New instance ...

Assertions:

• i‘ successfully launched

Page 18: Dependable Operation - Performance Management and Capacity Planning Under Continuous Changes

NICTA Copyright 2012 From imagination to impact

Conformance Checking: how it works

Log lines:

Page 19: Dependable Operation - Performance Management and Capacity Planning Under Continuous Changes

NICTA Copyright 2012 From imagination to impact

Conformance Checking: how it works

Log lines:• Remove ...

Page 20: Dependable Operation - Performance Management and Capacity Planning Under Continuous Changes

NICTA Copyright 2012 From imagination to impact

Conformance Checking: how it works

Log lines:• Remove ...• Terminate ...

Page 21: Dependable Operation - Performance Management and Capacity Planning Under Continuous Changes

NICTA Copyright 2012 From imagination to impact

Conformance Checking: how it works

Log lines:• Remove ...• Terminate ...• Wait ...

Page 22: Dependable Operation - Performance Management and Capacity Planning Under Continuous Changes

NICTA Copyright 2012 From imagination to impact

Conformance Checking: how it works

Log lines:• Remove ...• Terminate ...• Wait ...• Terminate ...???

Page 23: Dependable Operation - Performance Management and Capacity Planning Under Continuous Changes

NICTA Copyright 2012 From imagination to impact

POD-Diagnosis: how it works

• Fault frees are built as knowledge base

• On-demand diagnosis tests to locate the (root) causes

• Process context used for FT pruning

Page 24: Dependable Operation - Performance Management and Capacity Planning Under Continuous Changes

NICTA Copyright 2012 From imagination to impact

Evaluation: POD-Detection/Diagnosis

• Experiments– Rolling upgrade of 100+ node cluster in AWS

• Fault injection+ confounding processes: random kill, scaling-in..

• Detected errors– Assertion checking: known errors and global errors

• Examples: key management, launch configuration, images

– Compliance checking: unknown errors• skipping activities or undone activities

• Timing and precision– Compared with Asgard/Mentoring internal mechanisms

• Detected more errors earlier

– Diagnosis: limited to known causes in FT• 95 percentile less than 4s; accuracy ranges 80%~100%

Page 25: Dependable Operation - Performance Management and Capacity Planning Under Continuous Changes

NICTA Copyright 2012 From imagination to impact

Evaluation: POD-Detection/Diagnosis

Page 26: Dependable Operation - Performance Management and Capacity Planning Under Continuous Changes

NICTA Copyright 2012 From imagination to impact

Our Approach

• Incorporating change-related knowledge into system management – sporadic operation knowledge

• Process-Oriented Dependability: Error detection and diagnosis under continuous change

• Alerting management using process context • Availability analysis for sporadic operations

– External event knowledge • Event-aware workload prediction

Page 27: Dependable Operation - Performance Management and Capacity Planning Under Continuous Changes

NICTA Copyright 2012 From imagination to impact

Alerting Management using Process Context

• Do not turn off alerts during sporadic operation• Dynamically suppressing and annotating alerts

using sporadic operation knowledge– CPU sensitive?– Network sensitive?– I/O sensitive?– Health checking sensitive?

• Benefits– Reduce false positives of alerts– Add context to system monitoring data for later

capacity planning and performance tuning

Page 28: Dependable Operation - Performance Management and Capacity Planning Under Continuous Changes

NICTA Copyright 2012 From imagination to impact

Availability Analysis for Sporadic Operation

• Sporadic Operation’s Impact on Availability– Using Stochastic Reward Network (SRN)– Maintenance/Backup/Recovery operation

• Architecture has effect as well

Qinghua Lu, et. al. “Incorporating uncertainty into in-cloud application deployment decisions for availability”, IEEE 6th International Conference on Cloud Computing, June, 2013

Page 29: Dependable Operation - Performance Management and Capacity Planning Under Continuous Changes

NICTA Copyright 2012 From imagination to impact

Page 30: Dependable Operation - Performance Management and Capacity Planning Under Continuous Changes

NICTA Copyright 2012 From imagination to impact

Availability Estimation for Different Deployment and Recovery Approaches

Page 31: Dependable Operation - Performance Management and Capacity Planning Under Continuous Changes

NICTA Copyright 2012 From imagination to impact

Event-Aware Workload Prediction

Upcoming Event

Repository

Predict Workload

Workload Prediction

Event Workload

Model

Matthew Sladescu, et. al. “Event aware workload prediction: A study using auction events”, International Conference on Web Information System Engineering (WISE), 2012

Page 32: Dependable Operation - Performance Management and Capacity Planning Under Continuous Changes

NICTA Copyright 2012 From imagination to impact 32

+

+

=

Time(min)

Bids/min

Predicting Workload

Time to Predict

Page 33: Dependable Operation - Performance Management and Capacity Planning Under Continuous Changes

NICTA Copyright 2012 From imagination to impact

Summary• System is undergoing continuous changes

– Continuous deployment + Cloud uncertainty/visibility

• Use change-related knowledge in system mgt.– sporadic operation knowledge

• POD: Error detection and diagnosis under continuous change• Alerting management using process context • Availability analysis for sporadic operations

– External event knowledge • Event-aware workload prediction

• We need industry help and collaboration– Logs, trials, case study and feedback

Book: http://www.ssrg.nicta.com.au/projects/devops_book/

Contact: {[email protected]}