Upload
hoanganh
View
217
Download
0
Embed Size (px)
Citation preview
Copyright©2015 NTT DOCOMO, INC. All rights reserved.
After One Year of OpenStack Cloud
Operation (NTT DOCOMO)
NTT DOCOMO Inc.
Ken Igarashi
NTT Software
Asako Ishigaki
NEC
Akihiro Motoki
Copyright©2015 NTT DOCOMO, INC. All rights reserved.
Our Project
Copyright©2015 NTT DOCOMO, INC. All rights reserved. 3
Scalable Test
using 100 nodes
(10)
System
Design
(8)
Recovery Tests
(12)
Racking and
Cabling
(14)
24/7 support
(14)
User Support
(+x)
2014-6 2014-8 2014-11 2015-2 2015-5 2015-112015-8
Copyright©2015 NTT DOCOMO, INC. All rights reserved. 4
o Team Rules (Culture)
Focusing on using OpenStack instead of developing OpenStack
Think how to use it.
Don’t think OpenStack can’t do XXXX.
Reducing Opex/Promoting Automation
Operation tools
• “Anything that a humane needs to do more than twice must be
automated.”
Reduce operators by HA and self healing.
Copyright©2015 NTT DOCOMO, INC. All rights reserved.
Operation
Copyright©2015 NTT DOCOMO, INC. All rights reserved. 6
o OpenStack Configuration(http://bit.ly/1DbJPUO)
Double redundancies for hardware
Triple redundancies for software
VMVM
VMVM
VMVM
MySQL (Galera)
Arbitrator
DB1 DB2
DB3 DB4 VMVMNova
OpenStack
APIs
Zabbix
LBLBNeutron Agents
PXE, DNS, DHCP
MaaS
RabbitMQ
Copyright©2015 NTT DOCOMO, INC. All rights reserved. 7
o OpenStack Configuration(http://bit.ly/1DbJPUO)
Double redundancies for hardware
Triple redundancies for software
VMVM
VMVM
VMVM
MySQL (Galera)
Arbitrator
DB1 DB2
DB3 DB4 VMVMNova
OpenStack
APIs
Zabbix
LBLBNeutron Agents
PXE, DNS, DHCP
MaaS
RabbitMQ
Copyright©2015 NTT DOCOMO, INC. All rights reserved. 8
o Deployment
CMDB Registration
Copyright©2015 NTT DOCOMO, INC. All rights reserved. 9
o Choose playbooks for Ansible Dynamic Inventory
Ansible
Copyright©2015 NTT DOCOMO, INC. All rights reserved. 10
o Deployments
Common: network, account, logging, Zabbix agent, drivers/firmware x
37
OpenStack: Nova, Swift, Neutron, ……. x 62
HA Configuration
compileInitial update setup
kernel driver firmware filesystemdevelopment
environment
Install HDD Driver
Copyright©2015 NTT DOCOMO, INC. All rights reserved.
Monitoring System
Copyright©2015 NTT DOCOMO, INC. All rights reserved. 12
o Monitoring System
Weekday daytime
24h / 365d
VMVM…
VMVMSwift
VMVMCinder
VMVMNova
RabbitMQ
Neutron Agents
Data Bases
Fluentd
Elastic
search
Zabbi
x
Kibana
Copyright©2015 NTT DOCOMO, INC. All rights reserved. 13
VMVM…
VMVMSwift
VMVMCinder
VMVMNova
RabbitMQ
Neutron Agents Data Bases
Memory CPU Network HDD
General
OpenStack
Monitoring Items Self Healing
1,970 25
3,957 59
Copyright©2015 NTT DOCOMO, INC. All rights reserved. 14
o RabbitMQ
Configuration
3 node cluster
cluster_partition_handling, autoheal
Monitoring
Split Brain check:
• “rabbitmqctl eval '[N||{partitions,N}<-rabbit_mnesia:status()].’”
Port Check (5672, 25672)
Process Check
• Beam.smp
• Rabbitmq-server
At least one node running(1/3)• {Openstack-RabbitMQ:grpsum["HostG-
RabbitMQ","net.tcp.service[tcp,,25672]",last
,0].count(#3,0,"eq")}=3
• {OpenStack-RabbitMQ:grpsum["HostG-
RabbitMQ","proc.num[beam.smp]",last,0].c
ount(#3,0,"eq")}=3
Copyright©2015 NTT DOCOMO, INC. All rights reserved. 15
o MySQL
Configuration
4 Nodes + 1 Arbitrator
Monitoring
Cluster Check
• wsrep_local_recv_queue
• wsrep_local_send_queue
• wsrep_flow_control_paused
• wsrep_local_commits
Arbitrator
LB
R/W
Copyright©2015 NTT DOCOMO, INC. All rights reserved. 16
o MySQL Cluster
Master
Disk
Galera
recv_queuesend_queue
Commit
Disk
Replication
OK
Slave
MySQL
Client
OK
Wait until receive OK from replication
Copyright©2015 NTT DOCOMO, INC. All rights reserved. 17
o MySQL Cluster Freeze
Master
Disk
Galera
recv_queuesend_queue
Commit
Disk
Replication
OK
Slave
MySQL
Client
OK
Wait until receive OK from replication
👿
• Disk Failure: 😀 (removed from cluster)
• Disk Speed Throttling : 😢
Copyright©2015 NTT DOCOMO, INC. All rights reserved. 18
DOCOMO, INC All Rights Reserved
○ Prohibit some self-healing actions
Do not reboot some OpenStack processes
– neutron-plugin-openvswitch-agent
Do not reboot network nodes
– loose network reachability (can’t recreate network namespace)
Prohibited Actions while MySQL Cluster Freeze
19
Solved at Liberty?
All the VMs loose connections
Copyright©2015 NTT DOCOMO, INC. All rights reserved. 20
o Throttling happens during DB backup
Limit Backup Node
Backup Method
LB
R/WLimit Backup Node
LOCK TABLES FOR
BACKUP (online)
1. Take from cluster
(Donor/Desynced)
2. DB lock and do backup
(FLUSH TABLES WITH READ
LOCK)
3. Return to cluster
(wsrep_desync=OFF)
– wsrep_local_recv_queue
– wsrep_local_commits
Copyright©2015 NTT DOCOMO, INC. All rights reserved.
Log Analytics
Kibana
DOCOMO, INC All Rights Reserved
(1) detect critical system-
failure
We have to
recover
immediately
(2) detect malicious access
We need to
notify users
(3) detect no critical errors
Better to be fixed
as soon as
possible
(4) find errors/warnings that
have no service impact
We want to
filter out
next time
Purpose of Log Analytics
22
DOCOMO, INC All Rights Reserved
○ e.g.Logs of a day
Total:
100 GB, 80M lines
Sum of critical, error and warning logs:
200K lines
The meaningful logs are more restrictive:
(1) 0 critical failure (2) 0 malicious access
(3) 6 non-critical failure (4) 6 ignorable failure
0%0%
1%
30%
39%
30%
Breakdown of Logs
Critical
Error
Warning
Info
Debug
Other
Treasure Hunt in The Ocean of Logs
0%
24%
24%49%
3%
HW
OS
OpenStack backend
OpenStack
Operation tools
23
DOCOMO, INC All Rights Reserved
○ We analyze logs to enhance our black list and white list.
○ Logs found in our black list are sent to Zabbix.
Log Analytics Based on White/Black List
---------------
Logs
trash
Zabbix Kibana
--------------------
expand
expand
reduce
analyze…
24
add
addblack list
white list
DOCOMO, INC All Rights Reserved
Log Server
Network
Node
Control
Node
Compute
Node
How to Adopt Black/White List Using Fluentd
Fluentd
Elasticsearch
zabbix_sender
fluentd
LB
UTM
• Add “ignorable” flag according to
white list
• Put metadata to create graphs
from the logs
rsyslog
refer
Zabbix
alerts
Kibana
graph graph
Notify Zabbix according to
black list
25
DOCOMO, INC All Rights Reserved
Log Server
How to Adopt Black/White List Using Fluentd
Fluentd
Elasticsearch
zabbix_sender
fluentd
1. syslog10:01 crit: hardware failure
path: syslog rsyslog api.log
timestamp: 10:01 10:03 10:04
severity: crit warn ERROR
item: - ids ignore
source_ip: - x.x.x.x -
message: hardware
failure
IDS: from x.x.x.x
invalid request format
3. api.log10:04 ERROR: invalid request format
2. rsyslog10:03 warn: IDS: from x.x.x.x
Zabbix
hardware
failure
Kibana
IDS
graph
crit
graphrefer
26
DOCOMO, INC All Rights Reserved
Example of Our White List # with Juno
• Count response codes and understand the trend. That’s enough.
^keystonemiddleware¥.auth_token ¥[¥-¥] Unable to find authentication token in headers$
• This ERROR means user’s operation was denied due to quota.
• It has no impact to our system. Should be INFO log?
^nova¥.api¥.openstack ¥[[^¥]]*¥] Caught error: VolumeSizeExceedsAvailableQuota: Requested volume or snapshot exceeds allowed Gigabytes quota¥..*$
• This WARNING is caused by presence of SHUTOFF instances.
• It is commonplace condition. Need to be ignored.
^nova.scheduler.host_manager ¥[[^¥]]+¥] Host has more disk space than database expected .*$
27
1
2
3
DOCOMO, INC All Rights Reserved
○ We succeeded in reducing logs to be analyzed.
In other words, so many meaningless logs have high log-levels.
Effect of Our White List
Without White List: 160K
With White List: 37
reduce
99.98%
28
Today
We can analyze all logs in 2-3 hours a day!
1 year ago
We couldn’t analyze all logs in a day
DOCOMO, INC All Rights Reserved
Example of Our Black List
• This message indicates disk problem on
Compute node.
^kernel: ¥[[^¥]]*¥] XXXXX.*hardware failure¥.$
• Corosync needs cleanup its resources.
^pengine: warning: unpack_rsc_op:Processing failed op monitor for .*$
• Fullbackup of mysql failed once.
^mysql_fullbackup¥[¥d+¥]:¥sFailed¥sto¥sMySQL¥sfullbackup.*$
29
Warning
alert
Information
alert
Information
alert
1
2
3
Copyright©2015 NTT DOCOMO, INC. All rights reserved.
What are operators doing
behind the Cloud?
NEC : Takashi Torii
NTT DOCOMO : Ken Igarashi
Jun Ishii
Copyright©2015 NTT DOCOMO, INC. All rights reserved. 31
o Team Rules (Culture)
Focusing on using OpenStack instead of developing OpenStack
Reducing Opex
Operation tools
• “Anything that a humane needs to do more than twice must be
automated (scripting).”
Weekday daytime
24h / 365d
Dev/Ops Engineers
(L3)
Operators
(L1/2)
Tools
Knowledge Base
Copyright©2015 NTT DOCOMO, INC. All rights reserved. 32
o Operation Tools
Deployments
Common: network, account, logging, Zabbix agent, drivers/firmware x 37
OpenStack: Nova, Swift, Neutron, ……. x 62
• HA Configuration
• Rolling updates
Operation x 31
Common: process restart, log correction
OpenStack Operation: usage, VM migration/backup, user add/delete/quota
change
OpenStack Monitoring: health check tools
Copyright©2015 NTT DOCOMO, INC. All rights reserved.
Daily operation scheme
What
“Nicchoku”
means?
Copyright©2015 NTT DOCOMO, INC. All rights reserved. 34
o “Nicchoku” “日直”
Nicchoku takes charge of routine work in a day
Nicchoku role are taken turn by operators everyday
In japanese school, Nicchoku person do works such as erasing whiteboard
and passing handouts.
o In our operation, the Nicchoku person has to :
Check security updates
Check firmware updates
Check unhandled/unknown alerts
Check usage of resources
Analyze logs
o It takes almost a day to do these tasks without any plans!
Copyright©2015 NTT DOCOMO, INC. All rights reserved. 35
o The Nicchoku person carries out daily routine using following
tools:
o Operators solve the issues that the Nicchoku person has found.
Early detection of problems
o In this session, we present about Nicchoku tools.
• “Nicchoku assistant”Check security updates
• “Nicchoku assistant”Check firmware updates
• “Nicchoku assistant”Check unhandled/unknown alerts
• Zabbix graphsCheck usage of resources
• Kibana dashboardsAnalyze logs
Copyright©2015 NTT DOCOMO, INC. All rights reserved. 36
o Web crawler which is composed of Jenkins job and WordPress
o Assistant reports on 3 topics for Nicchoku at 9:00 AM
Security updates
Firmware updates
Zabbix alerts
o It saves our time to spent for patrolling many web sites
Copyright©2015 NTT DOCOMO, INC. All rights reserved. 37
o Assistant reports the latest security advisories.
We don’t have to patrol each web sites.
Ubuntu http://www.ubuntu.com/usn/trusty/
OpenStack http://security.openstack.org/
Copyright©2015 NTT DOCOMO, INC. All rights reserved. 38
o Assistant reports the latest updates of each firmware, such as
LB, UTM, Storage, Server.
o Some appliances’ release pages are compatible with assistant.
o The others haven’t been parsed by assistant yet; We still have
to patrol their web sites.
Copyright©2015 NTT DOCOMO, INC. All rights reserved. 39
o It reports alerts of Zabbix in last day.
o We can grasp all the alerts at a glance, including resolved
problems which are not shown in Zabbix.
Alerts occurred
in online
Alerts occurred in
maintenance mode/
Alerts which are
already resolved
Copyright©2015 NTT DOCOMO, INC. All rights reserved. 40
o Zabbix screens
o Check resources’ exponential increase/decrease
We can easily check all hardware resource
Copyright©2015 NTT DOCOMO, INC. All rights reserved. 41
o We can filter out known errors with Lucene queries.
o About 10 or more queries are shared on our Wiki page.
o Our Assistant kindly makes the queries be set to Kibana
dashboards.
Copyright©2015 NTT DOCOMO, INC. All rights reserved. 42
o We’ve reduced the time required for Nicchoku works.
o Moreover, we’ve succeeded in outsourcing Nicchoku works
since September.
Even though operators not familiar with OpenStack Logs, they just need
to check out Nicchoku helper and knowledgebase.
8
4
0
1
2
3
4
5
6
7
8
9
April September
hours
Required time to completeNicchoku works
50%
Knowledge base is
our next session’s theme!
Copyright©2015 NTT DOCOMO, INC. All rights reserved.
Knowledge base operation and
troubleshooting
Be an expert
operator!
Copyright©2015 NTT DOCOMO, INC. All rights reserved. 44
o What is difficult to operate OpenStack-based systems?
o Many components
Not everyone is well versed in all components
o Each operators has their own specialty in each components
Need to complement each others’ knowledge
o Unifying operators skill is important for stable operation
When trouble happens, operators are required to solve all kinds/types of
problems
Copyright©2015 NTT DOCOMO, INC. All rights reserved. 45
o We made DevOps system around OpenStack with OSS
Private CloudLog/Alert Management
Whole Project Management
CI/CD
powered by
Copyright©2015 NTT DOCOMO, INC. All rights reserved. 46
o Database of knowledge
knowledgebase is the plugin function of redmine
o Concentrate all information about operation to one DB , we can
easily cope with a problem which once someone has already
solved
Avoid reinventing the wheel
Reduce the time to search on the internet
Hardware
Nova
Swift
…
KnowledgbaseTools usage
Know-how and
experience
We stored over 1200 knowledge!
Copyright©2015 NTT DOCOMO, INC. All rights reserved. 47
o Well-known troubles are soon recovered by anyone
Not only Nicchoku task, but also initial response to troubles are
outsourced to non-expert operators
DevOps members can concentrate on essential troubles
o Reduce spent time to User-Support CI/CD task
All scheme are written in knowledgebase
Operators only need to fill out template forms
o Through these knowledgebase operations, everyone can be
non-expert to expert operator!
Copyright©2015 NTT DOCOMO, INC. All rights reserved. 48
o When something goes wrong in our private cloud system,
Zabbix automatically detect anomaly statement from SNMP
trap, API health monitoring, and error/warning logs
Copyright©2015 NTT DOCOMO, INC. All rights reserved. 49
o When well-known logs alert from our cloud server, Zabbix
automatically make a link to knowledgebase
o Operators only need to access knowledgebase and
ignore/solve alert according to knowledgebase flow
Copyright©2015 NTT DOCOMO, INC. All rights reserved. 50
o Almost all solution in knowledgebase has Ansible playbook
o Reduce human error
o Express solution by machine language rather than natural
language
.yaml file guarantees idempotence
Non-expert can learn troubleshooting by reading these playbooks
Copyright©2015 NTT DOCOMO, INC. All rights reserved.
Operator Training
Our future plan
DOCOMO, INC All Rights Reserved
User segmentation○ Gap between “Superuser” and “Using Full Managed”.
52
Superuser Not
Superuser
Full
Managed
Service
Design User Integrator Servicer
Planning and
Resource
Management
User User Servicer
Operation User User Servicer
How to organize the operation team?
“KnowledgeBase with non-specialized engineers”
will be the solution!
DOCOMO, INC All Rights Reserved
Concept of Operator Training○ SHORTCUT to become OpenStack Operator
53
Common way This Program
What is OpenStack?
Studying Architecture
Install OpenStack
Use by Horizon
Use by API
How to design
How to operate
Integrator
UserOperator
What is OpenStack?
How to operate
Aggregated in
KnowledgeBase
Operator
DOCOMO, INC All Rights Reserved
Eco system around KnowledgeBase
54
Standardized
Operation
KnowledgeBase
Training
Certification
Vender
Integrator
Developer
Super User
Provide Knowledge
Create and Brushing up
Provide vender specific
knowledge
Skill up operators
Motivate operatorsuser base
increasing
Copyright©2015 NTT DOCOMO, INC. All rights reserved.
Thank you for listening!
Any questions?