18
Ramkumar Natarajan & Manager - NFT Cognizant Technology Solutions Resilience Validation

Ramkumar Natarajan & Manager - NFT Cognizant Technology ...qaistc.com/2017/wp-content/uploads/2017/12/ramkumar_natarajan.pdfHystrix is a fault tolerance library developed by Netflix

  • Upload
    others

  • View
    18

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Ramkumar Natarajan & Manager - NFT Cognizant Technology ...qaistc.com/2017/wp-content/uploads/2017/12/ramkumar_natarajan.pdfHystrix is a fault tolerance library developed by Netflix

Ramkumar Natarajan & Manager - NFT

Cognizant Technology Solutions

Resilience Validation

Page 2: Ramkumar Natarajan & Manager - NFT Cognizant Technology ...qaistc.com/2017/wp-content/uploads/2017/12/ramkumar_natarajan.pdfHystrix is a fault tolerance library developed by Netflix

2

AbstractWith the greater complexities in application and infrastructure landscapes, the risk of failure is ever increasing. Building a robust,

highly available & fault tolerant software system becomes a challenging task and an absolute necessity in today's world. Even a minute of

downtime can cost millions of dollars.

Domino effects

What if? analysis

Myths & Facts

Major Incidents

Key tenants

Feature Groups

Page 3: Ramkumar Natarajan & Manager - NFT Cognizant Technology ...qaistc.com/2017/wp-content/uploads/2017/12/ramkumar_natarajan.pdfHystrix is a fault tolerance library developed by Netflix

3

Need of Resilience Validation

$500,000Outage of 5 Minutes

Overall Internet traffic dropped by 40%

$150 MOutage of 5 Hours

Cancelled about 1000 flights

$81 MOutage of 12 Hours

Cancelled about 2100 flights

$150 MOutage of 2.5 Hours

Amazon S3 is used by around 148,213 websites

Is our application resilient to handle unforeseen / unplanned events? Being prepared to maneuver disruptive events has

become an essential part of business success in the modern era.

Major Outages and Domino effects

Page 4: Ramkumar Natarajan & Manager - NFT Cognizant Technology ...qaistc.com/2017/wp-content/uploads/2017/12/ramkumar_natarajan.pdfHystrix is a fault tolerance library developed by Netflix

4

Business Impacts

Service Level Agreement Penalties

Lost Opportunities & Revenue losses.

Impact on reputation & quality.

Impact to Employee Productivity.

Impact to customers and strategic partners.

Outage costs are in increasing trend as the user base and complexities expands exponentially.

Page 5: Ramkumar Natarajan & Manager - NFT Cognizant Technology ...qaistc.com/2017/wp-content/uploads/2017/12/ramkumar_natarajan.pdfHystrix is a fault tolerance library developed by Netflix

5

Myths & Facts

Disaster Recovery / Passive system will protectagainst all kind of failures without any impacts.

Impact analysis needs be done to identify anychallenges or issues during failover.

Active nodes in cluster can handle failures ofother nodes / data centers effectively.

Failover testing needs to be performed to validatesystem’s ability to handle usage and the burdenthat additional use may cause on the system.

Resiliency testing is laborious process &requires availability of multidisciplinaryteam members.

Just like other testing, few of the steps like failuresimulation can be automated with less or no manualintervention.

MYTHS FACTS

1

2

3

4

5

Migrating to cloud make applications resilient Moving to cloud doesn't guarantee resiliency.

Off-the-shelf software products (like Oracle,Couchbase) are already tested and certified

Testing needs to be done in the client environmentto find the limitations & right configurations

Page 6: Ramkumar Natarajan & Manager - NFT Cognizant Technology ...qaistc.com/2017/wp-content/uploads/2017/12/ramkumar_natarajan.pdfHystrix is a fault tolerance library developed by Netflix

6

Fault Tolerance

Load Tolerance

Latency Tolerance

Data Tolerance

Manage the load variations without any issues.

Handle erroneous situations & data that is outside the range

Handle hardware & component failures gracefully

Manoeuvre network & downstream latencies efficiently.

Errors , Exceptions & bulk data from downstream systems

Black Friday, Product Release peak load events.

Network & Downstream system latencies

Rainy day scenarios, introducing CPU, Memory Spikes, node &

process failures.

How

an

appl

icat

ion

shou

ld b

e?

Resilience Tenants

Page 7: Ramkumar Natarajan & Manager - NFT Cognizant Technology ...qaistc.com/2017/wp-content/uploads/2017/12/ramkumar_natarajan.pdfHystrix is a fault tolerance library developed by Netflix

7

Objectives of Resiliency Validation

Identify & fix Single Point of Failure.

Know the unknown – Engineering, Functional & Performance Flaws.

Enable resilient features & provide self healing abilities.

Reduce MTTD (Mean Time to Detect) & MTTR (Mean Time To Recovery).

Improve high availability (Fall back) & Scalability by embracing next-gen HA technologies.

Effort required to restore system health.

High Availability Scalability Switchover Dependency DRRecovery Data

ValidationCI

Resilience Feature Groups

Page 8: Ramkumar Natarajan & Manager - NFT Cognizant Technology ...qaistc.com/2017/wp-content/uploads/2017/12/ramkumar_natarajan.pdfHystrix is a fault tolerance library developed by Netflix

8

Application Resilience

vulnerability and prioritization

matrixProduct Backlog

Architecture & System Review

Incident Analysis

Near Neighbor Analysis

Data Analysis

FMEA Analysis

Resiliency Implementation Resilience implementation & validation for different applications and streams were broadly categorized across three planes.

Application, Platform & Operational Readiness

Page 9: Ramkumar Natarajan & Manager - NFT Cognizant Technology ...qaistc.com/2017/wp-content/uploads/2017/12/ramkumar_natarajan.pdfHystrix is a fault tolerance library developed by Netflix

9

Validation - War gamingWar gaming is a process of simulating failures / production events and observe the system behavior.

War Gaming will help augment operational efficiency and stability through discovery, practice and teamwork.

To ensure our customers experience the absolute least amount of impact during unanticipated production events.

Validate Fault, load, latency & data tolerance levels of application

Page 10: Ramkumar Natarajan & Manager - NFT Cognizant Technology ...qaistc.com/2017/wp-content/uploads/2017/12/ramkumar_natarajan.pdfHystrix is a fault tolerance library developed by Netflix

10

Key Industry Trends

Hystrix is a fault tolerance librarydeveloped by Netflix to stop cascadingfailures and enable resilience in thedistributed systems

Key Features

Fault Throttling

Circuit Breaker.

Fall back

Page 11: Ramkumar Natarajan & Manager - NFT Cognizant Technology ...qaistc.com/2017/wp-content/uploads/2017/12/ramkumar_natarajan.pdfHystrix is a fault tolerance library developed by Netflix

11

Cognizant Chaos Framework

v

v

v

Page 12: Ramkumar Natarajan & Manager - NFT Cognizant Technology ...qaistc.com/2017/wp-content/uploads/2017/12/ramkumar_natarajan.pdfHystrix is a fault tolerance library developed by Netflix

12

Succ

ess

Stor

ies

YTD Incident 20% than 2015 YTD Incident duration (MTTR) 42% than 2015 Total outage by application platform reduced to 5%

from 30% More than 10+ outages were avoided due to

downstream failures with Fault Throttling Improved the MTTR of downstream system between 5

to 10 minutes from 1-2 hour

201

161

114

5475 77

34 38

AVAILABILITY LOAD LATENCY DATA

DEFECT COUNTS

Monkeys Executed Defects Found

outage

Self recovery

Client: Largest cable company and home Internet service provider in the United States.

Services: Cable television, Internet, telephone service and home security to both residential and commercial customers.

Page 13: Ramkumar Natarajan & Manager - NFT Cognizant Technology ...qaistc.com/2017/wp-content/uploads/2017/12/ramkumar_natarajan.pdfHystrix is a fault tolerance library developed by Netflix

13

Benefits

Ensure continuous response to end-customers. Resiliency testing will avoid / reduce domino effect. Meets Quality of Service (QoS) beyond BCP and DR. Cost of downtime will be reduced. Shorter up time following a service disruption Customer satisfaction will be improved.

Page 14: Ramkumar Natarajan & Manager - NFT Cognizant Technology ...qaistc.com/2017/wp-content/uploads/2017/12/ramkumar_natarajan.pdfHystrix is a fault tolerance library developed by Netflix

14

Takeaways

Expect the unexpected.

Key tenets of resilient application.

Resiliency testing not only addresses downtime - Load, Data, Latency

Testing and engineering approaches.

Failure simulations can be automated using Chaos Framework.

Resilience testing is good to must have.

Page 15: Ramkumar Natarajan & Manager - NFT Cognizant Technology ...qaistc.com/2017/wp-content/uploads/2017/12/ramkumar_natarajan.pdfHystrix is a fault tolerance library developed by Netflix

15

References & Appendix

• Cognizant internal reference links & materials.

• https://github.com/Netflix/

• http://www.itproportal.com/2015/10/07/how-to-overcome-business-continuity-challenges/

• http://searchcloudcomputing.techtarget.com/definition/cloud-computing

• http://people.cs.uchicago.edu/~aelmore/papers/dasfaa.pdf

Page 16: Ramkumar Natarajan & Manager - NFT Cognizant Technology ...qaistc.com/2017/wp-content/uploads/2017/12/ramkumar_natarajan.pdfHystrix is a fault tolerance library developed by Netflix

16

Author Biography

Ramkumar Natarajan has over 14+ years of experience with specialization in Performance testing and engineering space. Ram is part of Cognizant QE&A NFT CoE & leads the Resiliency testing initiatives.Ram holds Master’s degree in Computer Applications from Madras University. Earlier he has worked with Wipro Technologies & Ramco Systems.

Page 17: Ramkumar Natarajan & Manager - NFT Cognizant Technology ...qaistc.com/2017/wp-content/uploads/2017/12/ramkumar_natarajan.pdfHystrix is a fault tolerance library developed by Netflix

17

Question & Answers

Page 18: Ramkumar Natarajan & Manager - NFT Cognizant Technology ...qaistc.com/2017/wp-content/uploads/2017/12/ramkumar_natarajan.pdfHystrix is a fault tolerance library developed by Netflix

18

Thank You!!!