32
Lessons learned running large real-world Docker environments Oct 27 th 2015 Alois Mayr @mayralois [email protected] Dec 3 rd 2015

Lessons learned running large real-world Docker environments

Embed Size (px)

Citation preview

Lessons learned running largereal-world Docker environments

Oct 27th 2015

Alois [email protected]@ruxit.com

Dec 3rd 2015

Source: http://www.schoonoart.de/

What is a “large” environment?

Campfire stories#1 – The Death Star of Service Dependencies

#1 – Death Star of Service Dependencies

Load-balanced serviceSystem-wide service

dependencies

Reverse proxies are essential#1 – The Death Star of Service

Dependencies

App #1App #2

App #1 depends on App #2

Where is this specified?

Unwanted dependencies break architecture

#1 – The Death Star of Service Dependencies

Use proper versioning forservices, APIs, and images

#1 – The Death Star of Service Dependencies

Campfire stories#1 – The Death Star of Service Dependencies

#2 – The Network Retransmission Episode

#2 – The Network Retransmission Episode

Retransmissions

Retransmissions Retransmissions

Retransmissions Retransmissions

Retransmissions

Retransmissions

• Hardware defect in a single network interface card• NIC worked well under low load• Retransmissions only under heavy load• Affected communications to other machines

in datacenter

• Still not sure about exact defect on NIC

What was the problem?

#2 – The Network Retransmission Episode

#2 – The Network Retransmission Episode

Co-locate related containers.Check network infrastructure.

#2 – The Network Retransmission Episode

Campfire stories#1 – The Death Star of Service Dependencies

#2 – The Network Retransmission Episode#3 – The Hungry Container Breakdown

#3 – The Hungry Container Breakdown

Low disk space

Low disk space

• Shared /logs partition on host• No log rotation, no archiving for app logs• No proper log management used for Docker environment• Shared /logs partition on a single host ran out of space

What was the problem?

#3 – The Hungry Container Breakdown

• Container health checks failed• Marathon terminated task and rescheduled new one• Still no free space on /logs• Termination and rescheduling• /var/lib/docker ran out of space• Mesos slave unable to run Docker tasks

How the problem evolved over time

#3 – The Hungry Container Breakdown

• Log management tools for app logs, e.g. Fluentd and Logstash --log-driver=none|syslog

• Remove container--rm=true

• Run Mesos slave with --docker_remove_delay=VALUE

How the problem could have been avoided

#3 – The Hungry Container Breakdown

Use log management toolsEmpty /var/lib/docker

#3 – The Hungry Container Breakdown

Campfire stories#1 – The Death Star of Service Dependencies

#2 – The Network Retransmission Episode#3 – The Hungry Container Breakdown#4 – The Day Orchestration Stood Still

#4 – The Day Orchestration Stood Still

Queue and deployment methods are slow

• Marathon 0.8.x keeps all versions of applications for recovery (by default)• High frequency of microservices deployments• Slowdown through zk overload

What was the problem?

#4 – The Day Orchestration Stood Still

• Respective parameter (zk_max_versions) was not set to proper limit--zk_max_versions=20

How the problem could have been avoided

#4 – The Day Orchestration Stood Still

Track orchestration layer performanceSeparate Mesos clusters

#4 – The Day Orchestration Stood Still

Campfire stories#1 – The Death Star of Service Dependencies

#2 – The Network Retransmission Episode#3 – The Hungry Container Breakdown#4 – The Day Orchestration Stood Still

#5 – The Mushroom Cloud Effect

#5 – The Mushroom Cloud Effect

Way too many components involved

820 BILLION dependencies!

• Massive load testing in preparation for Black Friday• Tests ran for 3 days• No impact to real users, only backend services affected• Many components to take into account

What was the problem?

174 / 3.4k

22 / 13.3k

Service

Container

Host

1

1..*

*

1

#5 – The Mushroom Cloud Effect

Automation needed for problem analysis in large environments

#5 – The Mushroom Cloud Effect

Campfire stories#1 – The Death Star of Service Dependencies

#2 – The Network Retransmission Episode#3 – The Hungry Container Breakdown#4 – The Day Orchestration Stood Still

#5 – The Mushroom Cloud Effect

Free trial - https://ruxit.com/docker-monitoring/Blog - https://blog.ruxit.com/

@ruxit

What lessons have you learned?