Dependable Operation - Performance Management and Capacity Planning Under Continuous Changes

NICTA Copyright 2012 From imagination to impact

Dependable Operation

Performance Management and Capacity Planning

Under Continuous Changes April, 2014

Dr. Liming Zhu, Dr. Ingo Weber

NICTA/UNSW

http://slideshare.net/limingzhu

http://slideshare.net/limingzhu


NICTA (National ICT Australia)

• Australia’s National Centre of Excellence in Information and Communication Technology

• Five Research Labs:– ATP: Australian Technology Park, Sydney– NRL: UNSW, Sydney– CRL: ANU, Canberra– VRL: Uni. Melbourne– QRL: Uni. Queensland and QUT

• 700 staff including 270 PhD students• Budget: ~$90M/yr from Fed/State Gov and

industry• ~600 research papers/year, ~150 patents total


NICTA: Research and Outcomes

Software Systems

Networks

Optimisation

Machine Learning

Computer Vision

Broadband and the Digital Economy

Infrastructure Transport and Logistics

Security and Environment

Uni

vers

ity P

artn

ers

Ind

ustry a

nd

Go

vern

me

nt P

artn

ers

Research Excellence Wealth Creation

Engineering and Technology Development


Software Systems Research Group (SSRG)

• Vision: Cost Effective Dependable Systems• Two Major Activities

– Trustworthy Systems – single systems – Dependable Cloud Computing – distributed systems

• Research history related to capacity planning– Reve8tor/MDABench: capacity planning prototype– Spin-out: http://www.performance-assurance.com.au/– SPEC (spec.org) research group member

• Cloud (elasticity) benchmarking

– Keynote at ICPE 2013: “Supporting Operations Personnel Through Performance Engineering” by Len Bass

http://www.performance-assurance.com.au/

http://www.performance-assurance.com.au/


New Challenge: Continuous Changes

• Significant shorter release cycles– Continuous delivery/deployment: from months at

scheduled downtime to hours at all times • Etsy.com: 25 full deployments per day at 10 commits per deploy

• Resource sharing – Multiple sporadic operations at all times– scaling in/out, snapshot, migration, reconfiguration,

rolling upgrade, cron-jobs, backup, recovery…

• Cloud uncertainty – Limited visibility and indirect control

Demands continuous capacity planning and performance management


Sporadic Operation Example: Rolling Upgrade

- Have 100 servers in cloud with version 1 software

- Upgrade 10 servers at a time to version 2 software

- No downtime or redundancy cost

- Potentially take a long time to complete with errors during the operation with other interfering operations


System Monitoring During Rolling Upgrade


Our Approach

• Incorporating change-related knowledge into system management – Sporadic operation knowledge

• Process-Oriented Dependability (POD): error detection and diagnosis under continuous change

• Alerting management using process context • Availability analysis for sporadic operations

– External event knowledge • Event-aware workload prediction


Process-Oriented Dependability (POD)

• Context– Large-scale web/enterprise operation in Cloud– Distributed data analytics in Cloud (Hadoop/Spark)

• Goal: detect, diagnose and react to errors occurring during sporadic cloud operations– Scope: “sporadic operations” (not normal operation)

• deployment, reconfiguration, (rolling) upgrade, rollback• DevOps related: continuous integration/deploy/delivery


Operation as Process• Offline: treat an operation as a process

– Process discovered automatically from logs/scripts• Clustering of log lines and process mining

– Expected step outcomes specified as assertions

• Online: use process context– Process context: process/instance/step ids, expected states

– Errors are detected by examining logs and monitoring data• Assertions evaluations using monitoring facilities or directly• Compliance checking against expected processes using logs

– Detected errors are further diagnosed for (root) causes• Examining a fault tree to locate potential root causes• Performing more diagnostic tests and on-demand assertions

X. Xu, L Zhu, et. al. "POD-Diagnosis: Error Diagnosis of Sporadic Operations on Cloud Applications,” 44nd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), 2014.


Example: Rolling Upgrade Using Asgard

Read by

Operator

Process Mining Service

Cont

rols

Outputs Create SnapshotCheck AZs

Create instance from snapshot

Create AMI from instance

Evaluate AMI

Discovered Model

Asgard Log dataLog dataGeneratesOffline

Online


POD-Detection: Error Detection

Error Detection Service has two methods for detecting errors:• Assertion Checking• Conformance Checking


Assertion Checking: how it works

Log line:

Assertions:



Log line:• Remove ...

Assertions:• i has been de-registered

from ELB• i has been removed from

ASG• there is 1 less instance of

v1



Log line:• Remove ...• Terminate ...

Assertions:• i successfully terminated



Log line:• Remove ...• Terminate ...• Wait ...

Assertions:• Next log line should appear

within 17m35s (95 percentile)



Log line:• Remove ...• Terminate ...• Wait ...• New instance ...

Assertions:

• i‘ successfully launched


Conformance Checking: how it works

Log lines:



Log lines:• Remove ...



Log lines:• Remove ...• Terminate ...



Log lines:• Remove ...• Terminate ...• Wait ...



Log lines:• Remove ...• Terminate ...• Wait ...• Terminate ...???


POD-Diagnosis: how it works

• Fault frees are built as knowledge base

• On-demand diagnosis tests to locate the (root) causes

• Process context used for FT pruning


Evaluation: POD-Detection/Diagnosis

• Experiments– Rolling upgrade of 100+ node cluster in AWS

• Fault injection+ confounding processes: random kill, scaling-in..

• Detected errors– Assertion checking: known errors and global errors

• Examples: key management, launch configuration, images

– Compliance checking: unknown errors• skipping activities or undone activities

• Timing and precision– Compared with Asgard/Mentoring internal mechanisms

• Detected more errors earlier

– Diagnosis: limited to known causes in FT• 95 percentile less than 4s; accuracy ranges 80%~100%


Evaluation: POD-Detection/Diagnosis


Our Approach

• Incorporating change-related knowledge into system management – sporadic operation knowledge

• Process-Oriented Dependability: Error detection and diagnosis under continuous change

• Alerting management using process context • Availability analysis for sporadic operations



Alerting Management using Process Context

• Do not turn off alerts during sporadic operation• Dynamically suppressing and annotating alerts

using sporadic operation knowledge– CPU sensitive?– Network sensitive?– I/O sensitive?– Health checking sensitive?

• Benefits– Reduce false positives of alerts– Add context to system monitoring data for later

capacity planning and performance tuning


Availability Analysis for Sporadic Operation

• Sporadic Operation’s Impact on Availability– Using Stochastic Reward Network (SRN)– Maintenance/Backup/Recovery operation

• Architecture has effect as well

Qinghua Lu, et. al. “Incorporating uncertainty into in-cloud application deployment decisions for availability”, IEEE 6th International Conference on Cloud Computing, June, 2013



Availability Estimation for Different Deployment and Recovery Approaches


Event-Aware Workload Prediction

Upcoming Event

Repository

Predict Workload

Workload Prediction

Event Workload

Model

Matthew Sladescu, et. al. “Event aware workload prediction: A study using auction events”, International Conference on Web Information System Engineering (WISE), 2012

NICTA Copyright 2012 From imagination to impact 32

+

+

=

Time(min)

Bids/min

Predicting Workload

Time to Predict


Summary• System is undergoing continuous changes

– Continuous deployment + Cloud uncertainty/visibility

• Use change-related knowledge in system mgt.– sporadic operation knowledge

• POD: Error detection and diagnosis under continuous change• Alerting management using process context • Availability analysis for sporadic operations


• We need industry help and collaboration– Logs, trials, case study and feedback

Book: http://www.ssrg.nicta.com.au/projects/devops_book/

Contact: {[email protected]}

http://www.ssrg.nicta.com.au/projects/devops_book/

mailto:%[email protected]