Next Generation of DevOps AIOps in Practice @Baidu · • Next generation of DevOps –AIOps •...

Next Generation of DevOpsAIOps in Practice @Baidu

Xianping Qu & Jingjing Ha

About Baidu

Agenda

• History of Baidu SRE team• Next generation of DevOps – AIOps• Best practice based on AIOps• Future

Tools (2007-2009)

Dev OPQA

• DevOps– Deployment– Monitoring– Budgeting– Consulting– …

• Problems– Human labor

Systems (2009-2012)

Operation System

Dev OPQA

• Building operation systems– Service management system– Monitoring system– Deployment system– Traffic scheduling system– Naming service– …

• Problems– Human labor（GUI, configure）

Platforms (2012-2014)

Operation Service

Dev OPQAO

peration Platform

• Building operation platforms– API– Configurable– Executable– …

• Problems– Reusability– Scalability

Standardization (2014-)

• Building operation standards– Unified language– Unified method– Unified solution– …

• Problems– Need a brain

Services, IDCsClusters, Servers

AIOps (2014-)

• Intelligent Operation Platforms– Development framework– Big data– Algorithm

• Data mining, machine learning…

Development framework

User Code 1 User Code 2 User Code …

User Code Management

Develop

Operation Abstraction Layer

Interface

Driver Runtime Environment

Cloud/Paas Scheduler

Develop Tool-chain

Simulation

Operation Knowledge Database

product

Meta data Metric data Event data

app service instance

host IDC

person

network

Unified data

Data source

Dataproduction

process

mapping

service management model metaDB，TSDB，eventDB

mining

API & View

cleaning

feedback

throughput latency

cpu mem io

anomaly

root cause

change

remediation

raw data

authority, quota

controlling

bandwidth

resulting data

structuraldata

calculating

Management platforms Monitoring platforms Operation platforms

Solution

Deployment Management Incident Management

General Components and Tools（transmission、storage、scheduling…）

Operation Development Framework ( SDK, RE )

Ops Algorithm Development

Solution Development

Ops Platform Development

Anomaly detection

Traffic scheduling

Root cause analysis

Trend forecasting

Other data mining &machine learning

algorithms

Best practice based on AIOps• Incident Management

– Single cluster stop-loss by traffic shifting

• Deploy Management– Unattended deployment with automated checker

• Consulting– ChatBots do Consultation

When will a failure occur?• Infrastructure issue • Program defects

• Change exception • Dependent service unavailable

How to stop loss?

• Limited Failure in one cluster

– Deployment isolation

– Dependency decoupling

– Reduce global risk

When single cluster fails do perform traffic shifting

• Capacity redundancy

– Availability and cost trade-off

– N+M redundancy

– Service degradation

Two layer traffic shifting @Baidu

Data Center1

Web Service cluster

BFE Cluster L7 LB

BGWL4 LB

Data Center2

Web Service cluster

BFE Cluster L7 LB

BGW L4 LB

Data Center3

Web Service cluster

BFE ClusterL7 LB

BGWL4 LB

User set

User set User

User set

Dependent Service cluster

Internet User set

shift traffic between user set and the edge node

shift traffic between front end and back end

BGW: Baidu Gate Way, layer-4 load balancer

BFE: Baidu Front End, layer-7 load balancer

Two layer traffic shifting @Baidu

• shift traffic between user set and edge node

– 10 minute to shift 80% traffic to the healthy edge node because

of DNS caching in the client side and ISP side

• shift traffic between front end and back end

– 10 second to shift 100% traffic to the healthy backend by

changing BFE’s routing configuration

Shift traffic between front end and back end

Concerns:• Service Capacity• Intranet bandwidth

Web Service cluster

BFE Cluster L7 LB

BGWL4 LB

Web Service cluster

BFE Cluster L7 LB

BGW L4 LB

Web Service cluster

BFE ClusterL7 LB

BGWL4 LB

User set

User set User

User set

Internet User set

Scenarios：• Web service cluster • Dependent service cluster• Internal network switch

Data Center1 Data Center2 Data Center3

Shift traffic between user set and the edge node

Web Service cluster

BFE Cluster L7 LB

BGWL4 LB

Web Service cluster

BFE Cluster L7 LB

BGW L4 LB

Web Service cluster

BFE ClusterL7 LB

BGWL4 LB

User set

User set User

User set

Internet User set

Concerns:• Bandwidth• BGW/BFE Capacity• Delay

Scenarios ：• BGW & BFE• External network switch

Shift traffic between user set and the back end

Web Service cluster

BFE Cluster L7 LB

BGWL4 LB

Web Service cluster

BFE Cluster L7 LB

BGW L4 LB

Web Service cluster

BFE ClusterL7 LB

BGWL4 LB

User set

User set User

User set

Internet User set

Concerns:• Bandwidth• BGW/BFE Capacity• Delay• Service Capacity• Intranet bandwidth

Scenarios:：• Entire data center failure

Single cluster stop-loss before AIOPs

Manual decision making

Perception Judgment Execution

Scattered Scripts

• Depends on personal experience

• Partial information

Traditional monitoring

• Handcrafted anomaly detection

• Lots of False Negatives and False Positives

• Poor code quality• Low availability

• Critical moment is unreliable• Wrong or slow decision making • No real-time feedback leads

to more serious failures

Single cluster stop-loss after AIOPs

Automated decision making Standard framework

• Rely on Algorithm Platform • Global information

Intelligent monitoring

• Intelligent anomaly detection• Abnormal event stored in

• High precision and recall

• Development framework• Deployment framework

• High development efficiency• High availability

• Accurate and quick decision making

• Feedback control

Perception Judgment Execution

The architecture of single cluster stop-loss

Abnormalevent DB

Network monitoring

Service Health Control

BFE Configuration Management

ServiceDegradation

VIP Health Control

DNS ConfigurationManagement

Internal Traffic Load balancer

Capacity DB

Metric DB

Businessmonitoring

Applicationmonitoring

Execution

Judgment

Perception

Anomaly detectionAlgorithm

Traffic shifting

Algorithm

Algorithm Platform

Unattended deployment with automated checker

beginpre-check

pipe line

deploycheck

post-check

deploycheck

…confirm

Person Robot

confirm

Manual multiple metric dashboards check

Manual single anomaly event dashboard check

Automated API check and notify

Intelligent Anomaly detection

Check process optimization

ChatBots do consultation • Key points of building ChatBots

query example intention slots

xx模块从昨晚到现在有上线么？ Change query Module : xx；Time：从昨晚到现在

今天xx模块有全流量上线么？ Change query Module : xx； Time : 今天 ;Stage：全流量

• Change consultation scenario

1. Accumulate manually labeled query

2. Train an NLP model to understand the questions offline

3. Translate natural language questions into structured questions

4. Query operation knowledge database

5. Display results on SRE Service desk

Future

• Dynamicresourceallocation• Capacitymanagement• Identificationofperformanceproblems• …

quxianping@baidu.comhajingjing@baidu.com

Next Generation of DevOps AIOps in Practice @Baidu · • Next generation of DevOps –AIOps •...

Documents

Getting Acquainted with Moogsoft AIOps Platform › rs › 092-EGH-780 › images › ...Getting Acquainted with Moogsoft AIOps Artificial Intelligence for IT Operations (AIOps) is

Accelerate Incident Response with AIOps · For DevOps teams, SRE teams, and on-call teams, New Relic AI. is an AIOps solution to help manage ... Alertmanager , Splunk, Grafana, Amazon

Deep cloud observability and advanced AIOps are key to

Next Generation Platform-as-a-Service (NGPaaS) From DevOps … · 2019-07-05 · Next Generation Platform-as-a-Service (NGPaaS) From DevOps to Dev-for-Operations 3 Table of contents

How DevOps engineer can advance real DevOps...Influencing DevOps without Authority How "DevOps engineer" can advance real DevOps

Devops Devops Devops

Leveraging the power of AIOps, IT Operations elevates to a ...communications.rightstar.com/acton/attachment/3662/f-04dd/1/-/-/-/-/RightToThePoint...Leveraging the power of AIOps, IT

Everest Group-AIOps – IT Infrastructure Services for the ... · Leading-edge IT delivery models such as cloud, DevOps, as-a-service, and user environment virtualization have consequently

Deliver DevOps with the Next Generation of PaaS

An AIOps solution for proactive operations

AIOps深度案例分享 - elasticsearch · AIOps深度案例分享 Elastic Meetup 成都分享会 leiwang@orientsoft.cn &

eSynergy Andy Hawkins - Enabling DevOps through next generation configuration management

Whitepaper Escaping the Alert Vortex with AIOps...5 Whitepaper: Escaping the Alert Vortex with AIOps ©2020 Intellyx LLC. Augmented or applied intelligence? Newer definitions of AIOps

EMA Radar for AIOps: Q3 2020 A Guide for Investing in ...€¦ · 2 | EMA RADAR FOR AIOPS Q3 2020 INTRODUCTION VENDOR PROFILE: RESOLVE SYSTEMS USE CASE PERSPECTIVES 2020 IIDE PERFORMAE

AIOps done right - Dynatrace · ©2019 Dynatrace Dynatrace’s open AIOps solution enables truly autonomous cloud operations. In contrast to machine learning tools, Davis — Dynatrace’s

AIOps: The Next Generation of ITOA - Splunk-Blogs...In-house Applications (Java, Node, .Net, Go, etc.) COTS and Open Source Databases Middleware 3 Sample Application Interaction 4

The AIOps Story - Micro Focus · Top Analytic Priorities for AIOps are: AIOps Investments Should Support Multiple Use Cases Capturing Interdependencies to put AIOps Analytics into

Panduan Dasar AIOps - Data-to-Everything

AIOps for Enhanced Business Performance · foundation that distinguishes truly unifying AIOps platform investments from more niche experiments in machine learning should include the

AIOps Essentials Optimizing Digital Interactions with AIOps · 2019-08-24 · AIOps Essentials Optimizing Digital Interactions with AIOps ... exacting standards. In fact, a survey