55
Copyright©2015 NTT DOCOMO, INC. All rights reserved. After One Year of OpenStack Cloud Operation (NTT DOCOMO) NTT DOCOMO Inc. Ken Igarashi NTT Software Asako Ishigaki NEC Akihiro Motoki

After One Year of OpenStack Cloud Operation (NTT … on using OpenStack instead of developing OpenStack Think how to use it. ... o Web crawler which is composed of Jenkins job and

Embed Size (px)

Citation preview

Page 1: After One Year of OpenStack Cloud Operation (NTT … on using OpenStack instead of developing OpenStack Think how to use it. ... o Web crawler which is composed of Jenkins job and

Copyright©2015 NTT DOCOMO, INC. All rights reserved.

After One Year of OpenStack Cloud

Operation (NTT DOCOMO)

NTT DOCOMO Inc.

Ken Igarashi

NTT Software

Asako Ishigaki

NEC

Akihiro Motoki

Page 2: After One Year of OpenStack Cloud Operation (NTT … on using OpenStack instead of developing OpenStack Think how to use it. ... o Web crawler which is composed of Jenkins job and

Copyright©2015 NTT DOCOMO, INC. All rights reserved.

Our Project

Page 3: After One Year of OpenStack Cloud Operation (NTT … on using OpenStack instead of developing OpenStack Think how to use it. ... o Web crawler which is composed of Jenkins job and

Copyright©2015 NTT DOCOMO, INC. All rights reserved. 3

Scalable Test

using 100 nodes

(10)

System

Design

(8)

Recovery Tests

(12)

Racking and

Cabling

(14)

24/7 support

(14)

User Support

(+x)

2014-6 2014-8 2014-11 2015-2 2015-5 2015-112015-8

Page 4: After One Year of OpenStack Cloud Operation (NTT … on using OpenStack instead of developing OpenStack Think how to use it. ... o Web crawler which is composed of Jenkins job and

Copyright©2015 NTT DOCOMO, INC. All rights reserved. 4

o Team Rules (Culture)

Focusing on using OpenStack instead of developing OpenStack

Think how to use it.

Don’t think OpenStack can’t do XXXX.

Reducing Opex/Promoting Automation

Operation tools

• “Anything that a humane needs to do more than twice must be

automated.”

Reduce operators by HA and self healing.

Page 5: After One Year of OpenStack Cloud Operation (NTT … on using OpenStack instead of developing OpenStack Think how to use it. ... o Web crawler which is composed of Jenkins job and

Copyright©2015 NTT DOCOMO, INC. All rights reserved.

Operation

Page 6: After One Year of OpenStack Cloud Operation (NTT … on using OpenStack instead of developing OpenStack Think how to use it. ... o Web crawler which is composed of Jenkins job and

Copyright©2015 NTT DOCOMO, INC. All rights reserved. 6

o OpenStack Configuration(http://bit.ly/1DbJPUO)

Double redundancies for hardware

Triple redundancies for software

VMVM

VMVM

VMVM

MySQL (Galera)

Arbitrator

DB1 DB2

DB3 DB4 VMVMNova

OpenStack

APIs

Zabbix

LBLBNeutron Agents

PXE, DNS, DHCP

MaaS

RabbitMQ

Page 7: After One Year of OpenStack Cloud Operation (NTT … on using OpenStack instead of developing OpenStack Think how to use it. ... o Web crawler which is composed of Jenkins job and

Copyright©2015 NTT DOCOMO, INC. All rights reserved. 7

o OpenStack Configuration(http://bit.ly/1DbJPUO)

Double redundancies for hardware

Triple redundancies for software

VMVM

VMVM

VMVM

MySQL (Galera)

Arbitrator

DB1 DB2

DB3 DB4 VMVMNova

OpenStack

APIs

Zabbix

LBLBNeutron Agents

PXE, DNS, DHCP

MaaS

RabbitMQ

Page 8: After One Year of OpenStack Cloud Operation (NTT … on using OpenStack instead of developing OpenStack Think how to use it. ... o Web crawler which is composed of Jenkins job and

Copyright©2015 NTT DOCOMO, INC. All rights reserved. 8

o Deployment

CMDB Registration

Page 9: After One Year of OpenStack Cloud Operation (NTT … on using OpenStack instead of developing OpenStack Think how to use it. ... o Web crawler which is composed of Jenkins job and

Copyright©2015 NTT DOCOMO, INC. All rights reserved. 9

o Choose playbooks for Ansible Dynamic Inventory

Ansible

Page 10: After One Year of OpenStack Cloud Operation (NTT … on using OpenStack instead of developing OpenStack Think how to use it. ... o Web crawler which is composed of Jenkins job and

Copyright©2015 NTT DOCOMO, INC. All rights reserved. 10

o Deployments

Common: network, account, logging, Zabbix agent, drivers/firmware x

37

OpenStack: Nova, Swift, Neutron, ……. x 62

HA Configuration

compileInitial update setup

kernel driver firmware filesystemdevelopment

environment

Install HDD Driver

Page 11: After One Year of OpenStack Cloud Operation (NTT … on using OpenStack instead of developing OpenStack Think how to use it. ... o Web crawler which is composed of Jenkins job and

Copyright©2015 NTT DOCOMO, INC. All rights reserved.

Monitoring System

Page 12: After One Year of OpenStack Cloud Operation (NTT … on using OpenStack instead of developing OpenStack Think how to use it. ... o Web crawler which is composed of Jenkins job and

Copyright©2015 NTT DOCOMO, INC. All rights reserved. 12

o Monitoring System

Weekday daytime

24h / 365d

VMVM…

VMVMSwift

VMVMCinder

VMVMNova

RabbitMQ

Neutron Agents

Data Bases

Fluentd

Elastic

search

Zabbi

x

Kibana

Page 13: After One Year of OpenStack Cloud Operation (NTT … on using OpenStack instead of developing OpenStack Think how to use it. ... o Web crawler which is composed of Jenkins job and

Copyright©2015 NTT DOCOMO, INC. All rights reserved. 13

VMVM…

VMVMSwift

VMVMCinder

VMVMNova

RabbitMQ

Neutron Agents Data Bases

Memory CPU Network HDD

General

OpenStack

Monitoring Items Self Healing

1,970 25

3,957 59

Page 14: After One Year of OpenStack Cloud Operation (NTT … on using OpenStack instead of developing OpenStack Think how to use it. ... o Web crawler which is composed of Jenkins job and

Copyright©2015 NTT DOCOMO, INC. All rights reserved. 14

o RabbitMQ

Configuration

3 node cluster

cluster_partition_handling, autoheal

Monitoring

Split Brain check:

• “rabbitmqctl eval '[N||{partitions,N}<-rabbit_mnesia:status()].’”

Port Check (5672, 25672)

Process Check

• Beam.smp

• Rabbitmq-server

At least one node running(1/3)• {Openstack-RabbitMQ:grpsum["HostG-

RabbitMQ","net.tcp.service[tcp,,25672]",last

,0].count(#3,0,"eq")}=3

• {OpenStack-RabbitMQ:grpsum["HostG-

RabbitMQ","proc.num[beam.smp]",last,0].c

ount(#3,0,"eq")}=3

Page 15: After One Year of OpenStack Cloud Operation (NTT … on using OpenStack instead of developing OpenStack Think how to use it. ... o Web crawler which is composed of Jenkins job and

Copyright©2015 NTT DOCOMO, INC. All rights reserved. 15

o MySQL

Configuration

4 Nodes + 1 Arbitrator

Monitoring

Cluster Check

• wsrep_local_recv_queue

• wsrep_local_send_queue

• wsrep_flow_control_paused

• wsrep_local_commits

Arbitrator

LB

R/W

Page 16: After One Year of OpenStack Cloud Operation (NTT … on using OpenStack instead of developing OpenStack Think how to use it. ... o Web crawler which is composed of Jenkins job and

Copyright©2015 NTT DOCOMO, INC. All rights reserved. 16

o MySQL Cluster

Master

Disk

Galera

recv_queuesend_queue

Commit

Disk

Replication

OK

Slave

MySQL

Client

OK

Wait until receive OK from replication

Page 17: After One Year of OpenStack Cloud Operation (NTT … on using OpenStack instead of developing OpenStack Think how to use it. ... o Web crawler which is composed of Jenkins job and

Copyright©2015 NTT DOCOMO, INC. All rights reserved. 17

o MySQL Cluster Freeze

Master

Disk

Galera

recv_queuesend_queue

Commit

Disk

Replication

OK

Slave

MySQL

Client

OK

Wait until receive OK from replication

👿

• Disk Failure: 😀 (removed from cluster)

• Disk Speed Throttling : 😢

Page 18: After One Year of OpenStack Cloud Operation (NTT … on using OpenStack instead of developing OpenStack Think how to use it. ... o Web crawler which is composed of Jenkins job and

Copyright©2015 NTT DOCOMO, INC. All rights reserved. 18

Page 19: After One Year of OpenStack Cloud Operation (NTT … on using OpenStack instead of developing OpenStack Think how to use it. ... o Web crawler which is composed of Jenkins job and

DOCOMO, INC All Rights Reserved

○ Prohibit some self-healing actions

Do not reboot some OpenStack processes

– neutron-plugin-openvswitch-agent

Do not reboot network nodes

– loose network reachability (can’t recreate network namespace)

Prohibited Actions while MySQL Cluster Freeze

19

Solved at Liberty?

All the VMs loose connections

Page 20: After One Year of OpenStack Cloud Operation (NTT … on using OpenStack instead of developing OpenStack Think how to use it. ... o Web crawler which is composed of Jenkins job and

Copyright©2015 NTT DOCOMO, INC. All rights reserved. 20

o Throttling happens during DB backup

Limit Backup Node

Backup Method

LB

R/WLimit Backup Node

LOCK TABLES FOR

BACKUP (online)

1. Take from cluster

(Donor/Desynced)

2. DB lock and do backup

(FLUSH TABLES WITH READ

LOCK)

3. Return to cluster

(wsrep_desync=OFF)

– wsrep_local_recv_queue

– wsrep_local_commits

Page 21: After One Year of OpenStack Cloud Operation (NTT … on using OpenStack instead of developing OpenStack Think how to use it. ... o Web crawler which is composed of Jenkins job and

Copyright©2015 NTT DOCOMO, INC. All rights reserved.

Log Analytics

Kibana

Page 22: After One Year of OpenStack Cloud Operation (NTT … on using OpenStack instead of developing OpenStack Think how to use it. ... o Web crawler which is composed of Jenkins job and

DOCOMO, INC All Rights Reserved

(1) detect critical system-

failure

We have to

recover

immediately

(2) detect malicious access

We need to

notify users

(3) detect no critical errors

Better to be fixed

as soon as

possible

(4) find errors/warnings that

have no service impact

We want to

filter out

next time

Purpose of Log Analytics

22

Page 23: After One Year of OpenStack Cloud Operation (NTT … on using OpenStack instead of developing OpenStack Think how to use it. ... o Web crawler which is composed of Jenkins job and

DOCOMO, INC All Rights Reserved

○ e.g.Logs of a day

Total:

100 GB, 80M lines

Sum of critical, error and warning logs:

200K lines

The meaningful logs are more restrictive:

(1) 0 critical failure (2) 0 malicious access

(3) 6 non-critical failure (4) 6 ignorable failure

0%0%

1%

30%

39%

30%

Breakdown of Logs

Critical

Error

Warning

Info

Debug

Other

Treasure Hunt in The Ocean of Logs

0%

24%

24%49%

3%

HW

OS

OpenStack backend

OpenStack

Operation tools

23

Page 24: After One Year of OpenStack Cloud Operation (NTT … on using OpenStack instead of developing OpenStack Think how to use it. ... o Web crawler which is composed of Jenkins job and

DOCOMO, INC All Rights Reserved

○ We analyze logs to enhance our black list and white list.

○ Logs found in our black list are sent to Zabbix.

Log Analytics Based on White/Black List

---------------

Logs

trash

Zabbix Kibana

--------------------

expand

expand

reduce

analyze…

24

add

addblack list

white list

Page 25: After One Year of OpenStack Cloud Operation (NTT … on using OpenStack instead of developing OpenStack Think how to use it. ... o Web crawler which is composed of Jenkins job and

DOCOMO, INC All Rights Reserved

Log Server

Network

Node

Control

Node

Compute

Node

How to Adopt Black/White List Using Fluentd

Fluentd

Elasticsearch

zabbix_sender

fluentd

LB

UTM

• Add “ignorable” flag according to

white list

• Put metadata to create graphs

from the logs

rsyslog

refer

Zabbix

alerts

Kibana

graph graph

Notify Zabbix according to

black list

25

Page 26: After One Year of OpenStack Cloud Operation (NTT … on using OpenStack instead of developing OpenStack Think how to use it. ... o Web crawler which is composed of Jenkins job and

DOCOMO, INC All Rights Reserved

Log Server

How to Adopt Black/White List Using Fluentd

Fluentd

Elasticsearch

zabbix_sender

fluentd

1. syslog10:01 crit: hardware failure

path: syslog rsyslog api.log

timestamp: 10:01 10:03 10:04

severity: crit warn ERROR

item: - ids ignore

source_ip: - x.x.x.x -

message: hardware

failure

IDS: from x.x.x.x

invalid request format

3. api.log10:04 ERROR: invalid request format

2. rsyslog10:03 warn: IDS: from x.x.x.x

Zabbix

hardware

failure

Kibana

IDS

graph

crit

graphrefer

26

Page 27: After One Year of OpenStack Cloud Operation (NTT … on using OpenStack instead of developing OpenStack Think how to use it. ... o Web crawler which is composed of Jenkins job and

DOCOMO, INC All Rights Reserved

Example of Our White List # with Juno

• Count response codes and understand the trend. That’s enough.

^keystonemiddleware¥.auth_token ¥[¥-¥] Unable to find authentication token in headers$

• This ERROR means user’s operation was denied due to quota.

• It has no impact to our system. Should be INFO log?

^nova¥.api¥.openstack ¥[[^¥]]*¥] Caught error: VolumeSizeExceedsAvailableQuota: Requested volume or snapshot exceeds allowed Gigabytes quota¥..*$

• This WARNING is caused by presence of SHUTOFF instances.

• It is commonplace condition. Need to be ignored.

^nova.scheduler.host_manager ¥[[^¥]]+¥] Host has more disk space than database expected .*$

27

1

2

3

Page 28: After One Year of OpenStack Cloud Operation (NTT … on using OpenStack instead of developing OpenStack Think how to use it. ... o Web crawler which is composed of Jenkins job and

DOCOMO, INC All Rights Reserved

○ We succeeded in reducing logs to be analyzed.

In other words, so many meaningless logs have high log-levels.

Effect of Our White List

Without White List: 160K

With White List: 37

reduce

99.98%

28

Today

We can analyze all logs in 2-3 hours a day!

1 year ago

We couldn’t analyze all logs in a day

Page 29: After One Year of OpenStack Cloud Operation (NTT … on using OpenStack instead of developing OpenStack Think how to use it. ... o Web crawler which is composed of Jenkins job and

DOCOMO, INC All Rights Reserved

Example of Our Black List

• This message indicates disk problem on

Compute node.

^kernel: ¥[[^¥]]*¥] XXXXX.*hardware failure¥.$

• Corosync needs cleanup its resources.

^pengine: warning: unpack_rsc_op:Processing failed op monitor for .*$

• Fullbackup of mysql failed once.

^mysql_fullbackup¥[¥d+¥]:¥sFailed¥sto¥sMySQL¥sfullbackup.*$

29

Warning

alert

Information

alert

Information

alert

1

2

3

Page 30: After One Year of OpenStack Cloud Operation (NTT … on using OpenStack instead of developing OpenStack Think how to use it. ... o Web crawler which is composed of Jenkins job and

Copyright©2015 NTT DOCOMO, INC. All rights reserved.

What are operators doing

behind the Cloud?

NEC : Takashi Torii

NTT DOCOMO : Ken Igarashi

Jun Ishii

Page 31: After One Year of OpenStack Cloud Operation (NTT … on using OpenStack instead of developing OpenStack Think how to use it. ... o Web crawler which is composed of Jenkins job and

Copyright©2015 NTT DOCOMO, INC. All rights reserved. 31

o Team Rules (Culture)

Focusing on using OpenStack instead of developing OpenStack

Reducing Opex

Operation tools

• “Anything that a humane needs to do more than twice must be

automated (scripting).”

Weekday daytime

24h / 365d

Dev/Ops Engineers

(L3)

Operators

(L1/2)

Tools

Knowledge Base

Page 32: After One Year of OpenStack Cloud Operation (NTT … on using OpenStack instead of developing OpenStack Think how to use it. ... o Web crawler which is composed of Jenkins job and

Copyright©2015 NTT DOCOMO, INC. All rights reserved. 32

o Operation Tools

Deployments

Common: network, account, logging, Zabbix agent, drivers/firmware x 37

OpenStack: Nova, Swift, Neutron, ……. x 62

• HA Configuration

• Rolling updates

Operation x 31

Common: process restart, log correction

OpenStack Operation: usage, VM migration/backup, user add/delete/quota

change

OpenStack Monitoring: health check tools

Page 33: After One Year of OpenStack Cloud Operation (NTT … on using OpenStack instead of developing OpenStack Think how to use it. ... o Web crawler which is composed of Jenkins job and

Copyright©2015 NTT DOCOMO, INC. All rights reserved.

Daily operation scheme

What

“Nicchoku”

means?

Page 34: After One Year of OpenStack Cloud Operation (NTT … on using OpenStack instead of developing OpenStack Think how to use it. ... o Web crawler which is composed of Jenkins job and

Copyright©2015 NTT DOCOMO, INC. All rights reserved. 34

o “Nicchoku” “日直”

Nicchoku takes charge of routine work in a day

Nicchoku role are taken turn by operators everyday

In japanese school, Nicchoku person do works such as erasing whiteboard

and passing handouts.

o In our operation, the Nicchoku person has to :

Check security updates

Check firmware updates

Check unhandled/unknown alerts

Check usage of resources

Analyze logs

o It takes almost a day to do these tasks without any plans!

Page 35: After One Year of OpenStack Cloud Operation (NTT … on using OpenStack instead of developing OpenStack Think how to use it. ... o Web crawler which is composed of Jenkins job and

Copyright©2015 NTT DOCOMO, INC. All rights reserved. 35

o The Nicchoku person carries out daily routine using following

tools:

o Operators solve the issues that the Nicchoku person has found.

Early detection of problems

o In this session, we present about Nicchoku tools.

• “Nicchoku assistant”Check security updates

• “Nicchoku assistant”Check firmware updates

• “Nicchoku assistant”Check unhandled/unknown alerts

• Zabbix graphsCheck usage of resources

• Kibana dashboardsAnalyze logs

Page 36: After One Year of OpenStack Cloud Operation (NTT … on using OpenStack instead of developing OpenStack Think how to use it. ... o Web crawler which is composed of Jenkins job and

Copyright©2015 NTT DOCOMO, INC. All rights reserved. 36

o Web crawler which is composed of Jenkins job and WordPress

o Assistant reports on 3 topics for Nicchoku at 9:00 AM

Security updates

Firmware updates

Zabbix alerts

o It saves our time to spent for patrolling many web sites

Page 37: After One Year of OpenStack Cloud Operation (NTT … on using OpenStack instead of developing OpenStack Think how to use it. ... o Web crawler which is composed of Jenkins job and

Copyright©2015 NTT DOCOMO, INC. All rights reserved. 37

o Assistant reports the latest security advisories.

We don’t have to patrol each web sites.

Ubuntu http://www.ubuntu.com/usn/trusty/

OpenStack http://security.openstack.org/

Page 38: After One Year of OpenStack Cloud Operation (NTT … on using OpenStack instead of developing OpenStack Think how to use it. ... o Web crawler which is composed of Jenkins job and

Copyright©2015 NTT DOCOMO, INC. All rights reserved. 38

o Assistant reports the latest updates of each firmware, such as

LB, UTM, Storage, Server.

o Some appliances’ release pages are compatible with assistant.

o The others haven’t been parsed by assistant yet; We still have

to patrol their web sites.

Page 39: After One Year of OpenStack Cloud Operation (NTT … on using OpenStack instead of developing OpenStack Think how to use it. ... o Web crawler which is composed of Jenkins job and

Copyright©2015 NTT DOCOMO, INC. All rights reserved. 39

o It reports alerts of Zabbix in last day.

o We can grasp all the alerts at a glance, including resolved

problems which are not shown in Zabbix.

Alerts occurred

in online

Alerts occurred in

maintenance mode/

Alerts which are

already resolved

Page 40: After One Year of OpenStack Cloud Operation (NTT … on using OpenStack instead of developing OpenStack Think how to use it. ... o Web crawler which is composed of Jenkins job and

Copyright©2015 NTT DOCOMO, INC. All rights reserved. 40

o Zabbix screens

o Check resources’ exponential increase/decrease

We can easily check all hardware resource

Page 41: After One Year of OpenStack Cloud Operation (NTT … on using OpenStack instead of developing OpenStack Think how to use it. ... o Web crawler which is composed of Jenkins job and

Copyright©2015 NTT DOCOMO, INC. All rights reserved. 41

o We can filter out known errors with Lucene queries.

o About 10 or more queries are shared on our Wiki page.

o Our Assistant kindly makes the queries be set to Kibana

dashboards.

Page 42: After One Year of OpenStack Cloud Operation (NTT … on using OpenStack instead of developing OpenStack Think how to use it. ... o Web crawler which is composed of Jenkins job and

Copyright©2015 NTT DOCOMO, INC. All rights reserved. 42

o We’ve reduced the time required for Nicchoku works.

o Moreover, we’ve succeeded in outsourcing Nicchoku works

since September.

Even though operators not familiar with OpenStack Logs, they just need

to check out Nicchoku helper and knowledgebase.

8

4

0

1

2

3

4

5

6

7

8

9

April September

hours

Required time to completeNicchoku works

50%

Knowledge base is

our next session’s theme!

Page 43: After One Year of OpenStack Cloud Operation (NTT … on using OpenStack instead of developing OpenStack Think how to use it. ... o Web crawler which is composed of Jenkins job and

Copyright©2015 NTT DOCOMO, INC. All rights reserved.

Knowledge base operation and

troubleshooting

Be an expert

operator!

Page 44: After One Year of OpenStack Cloud Operation (NTT … on using OpenStack instead of developing OpenStack Think how to use it. ... o Web crawler which is composed of Jenkins job and

Copyright©2015 NTT DOCOMO, INC. All rights reserved. 44

o What is difficult to operate OpenStack-based systems?

o Many components

Not everyone is well versed in all components

o Each operators has their own specialty in each components

Need to complement each others’ knowledge

o Unifying operators skill is important for stable operation

When trouble happens, operators are required to solve all kinds/types of

problems

Page 45: After One Year of OpenStack Cloud Operation (NTT … on using OpenStack instead of developing OpenStack Think how to use it. ... o Web crawler which is composed of Jenkins job and

Copyright©2015 NTT DOCOMO, INC. All rights reserved. 45

o We made DevOps system around OpenStack with OSS

Private CloudLog/Alert Management

Whole Project Management

CI/CD

powered by

Page 46: After One Year of OpenStack Cloud Operation (NTT … on using OpenStack instead of developing OpenStack Think how to use it. ... o Web crawler which is composed of Jenkins job and

Copyright©2015 NTT DOCOMO, INC. All rights reserved. 46

o Database of knowledge

knowledgebase is the plugin function of redmine

o Concentrate all information about operation to one DB , we can

easily cope with a problem which once someone has already

solved

Avoid reinventing the wheel

Reduce the time to search on the internet

Hardware

Nova

Swift

KnowledgbaseTools usage

Know-how and

experience

We stored over 1200 knowledge!

Page 47: After One Year of OpenStack Cloud Operation (NTT … on using OpenStack instead of developing OpenStack Think how to use it. ... o Web crawler which is composed of Jenkins job and

Copyright©2015 NTT DOCOMO, INC. All rights reserved. 47

o Well-known troubles are soon recovered by anyone

Not only Nicchoku task, but also initial response to troubles are

outsourced to non-expert operators

DevOps members can concentrate on essential troubles

o Reduce spent time to User-Support CI/CD task

All scheme are written in knowledgebase

Operators only need to fill out template forms

o Through these knowledgebase operations, everyone can be

non-expert to expert operator!

Page 48: After One Year of OpenStack Cloud Operation (NTT … on using OpenStack instead of developing OpenStack Think how to use it. ... o Web crawler which is composed of Jenkins job and

Copyright©2015 NTT DOCOMO, INC. All rights reserved. 48

o When something goes wrong in our private cloud system,

Zabbix automatically detect anomaly statement from SNMP

trap, API health monitoring, and error/warning logs

Page 49: After One Year of OpenStack Cloud Operation (NTT … on using OpenStack instead of developing OpenStack Think how to use it. ... o Web crawler which is composed of Jenkins job and

Copyright©2015 NTT DOCOMO, INC. All rights reserved. 49

o When well-known logs alert from our cloud server, Zabbix

automatically make a link to knowledgebase

o Operators only need to access knowledgebase and

ignore/solve alert according to knowledgebase flow

Page 50: After One Year of OpenStack Cloud Operation (NTT … on using OpenStack instead of developing OpenStack Think how to use it. ... o Web crawler which is composed of Jenkins job and

Copyright©2015 NTT DOCOMO, INC. All rights reserved. 50

o Almost all solution in knowledgebase has Ansible playbook

o Reduce human error

o Express solution by machine language rather than natural

language

.yaml file guarantees idempotence

Non-expert can learn troubleshooting by reading these playbooks

Page 51: After One Year of OpenStack Cloud Operation (NTT … on using OpenStack instead of developing OpenStack Think how to use it. ... o Web crawler which is composed of Jenkins job and

Copyright©2015 NTT DOCOMO, INC. All rights reserved.

Operator Training

Our future plan

Page 52: After One Year of OpenStack Cloud Operation (NTT … on using OpenStack instead of developing OpenStack Think how to use it. ... o Web crawler which is composed of Jenkins job and

DOCOMO, INC All Rights Reserved

User segmentation○ Gap between “Superuser” and “Using Full Managed”.

52

Superuser Not

Superuser

Full

Managed

Service

Design User Integrator Servicer

Planning and

Resource

Management

User User Servicer

Operation User User Servicer

How to organize the operation team?

“KnowledgeBase with non-specialized engineers”

will be the solution!

Page 53: After One Year of OpenStack Cloud Operation (NTT … on using OpenStack instead of developing OpenStack Think how to use it. ... o Web crawler which is composed of Jenkins job and

DOCOMO, INC All Rights Reserved

Concept of Operator Training○ SHORTCUT to become OpenStack Operator

53

Common way This Program

What is OpenStack?

Studying Architecture

Install OpenStack

Use by Horizon

Use by API

How to design

How to operate

Integrator

UserOperator

What is OpenStack?

How to operate

Aggregated in

KnowledgeBase

Operator

Page 54: After One Year of OpenStack Cloud Operation (NTT … on using OpenStack instead of developing OpenStack Think how to use it. ... o Web crawler which is composed of Jenkins job and

DOCOMO, INC All Rights Reserved

Eco system around KnowledgeBase

54

Standardized

Operation

KnowledgeBase

Training

Certification

Vender

Integrator

Developer

Super User

Provide Knowledge

Create and Brushing up

Provide vender specific

knowledge

Skill up operators

Motivate operatorsuser base

increasing

Page 55: After One Year of OpenStack Cloud Operation (NTT … on using OpenStack instead of developing OpenStack Think how to use it. ... o Web crawler which is composed of Jenkins job and

Copyright©2015 NTT DOCOMO, INC. All rights reserved.

Thank you for listening!

Any questions?