31
Keeping Pinterest Running 1 Joe Gordon 2 February 2016

Keeping Pinterest Runningsysadmin.miniconf.org/2016/lca2016-joe_gordon-keeping...System Service Dependencies Latencies Alerting ELK stack for real time log collection Deployments

Embed Size (px)

Citation preview

Page 1: Keeping Pinterest Runningsysadmin.miniconf.org/2016/lca2016-joe_gordon-keeping...System Service Dependencies Latencies Alerting ELK stack for real time log collection Deployments

Keeping Pinterest Running

1

Joe Gordon2 February 2016

Page 2: Keeping Pinterest Runningsysadmin.miniconf.org/2016/lca2016-joe_gordon-keeping...System Service Dependencies Latencies Alerting ELK stack for real time log collection Deployments

What is Pinterest?

Page 3: Keeping Pinterest Runningsysadmin.miniconf.org/2016/lca2016-joe_gordon-keeping...System Service Dependencies Latencies Alerting ELK stack for real time log collection Deployments

Software v. Service

Page 4: Keeping Pinterest Runningsysadmin.miniconf.org/2016/lca2016-joe_gordon-keeping...System Service Dependencies Latencies Alerting ELK stack for real time log collection Deployments

Software v. Service

● Stable branches● Drivers and configurations● Support matrix● Dependency versions● Developers support their own service

○ On call rotation○ Aligns incentives○ Monitoring & alerting built in from day one

● Testing against production traffic

Page 5: Keeping Pinterest Runningsysadmin.miniconf.org/2016/lca2016-joe_gordon-keeping...System Service Dependencies Latencies Alerting ELK stack for real time log collection Deployments

SRE at Pinterest

Page 6: Keeping Pinterest Runningsysadmin.miniconf.org/2016/lca2016-joe_gordon-keeping...System Service Dependencies Latencies Alerting ELK stack for real time log collection Deployments
Page 7: Keeping Pinterest Runningsysadmin.miniconf.org/2016/lca2016-joe_gordon-keeping...System Service Dependencies Latencies Alerting ELK stack for real time log collection Deployments
Page 8: Keeping Pinterest Runningsysadmin.miniconf.org/2016/lca2016-joe_gordon-keeping...System Service Dependencies Latencies Alerting ELK stack for real time log collection Deployments
Page 9: Keeping Pinterest Runningsysadmin.miniconf.org/2016/lca2016-joe_gordon-keeping...System Service Dependencies Latencies Alerting ELK stack for real time log collection Deployments

What do SREs focus on?

Page 10: Keeping Pinterest Runningsysadmin.miniconf.org/2016/lca2016-joe_gordon-keeping...System Service Dependencies Latencies Alerting ELK stack for real time log collection Deployments

Operational Maturity

Page 11: Keeping Pinterest Runningsysadmin.miniconf.org/2016/lca2016-joe_gordon-keeping...System Service Dependencies Latencies Alerting ELK stack for real time log collection Deployments

Operational Excellence

Page 12: Keeping Pinterest Runningsysadmin.miniconf.org/2016/lca2016-joe_gordon-keeping...System Service Dependencies Latencies Alerting ELK stack for real time log collection Deployments

Operational Excellence

Page 13: Keeping Pinterest Runningsysadmin.miniconf.org/2016/lca2016-joe_gordon-keeping...System Service Dependencies Latencies Alerting ELK stack for real time log collection Deployments

VisibilityInsight into the system

Page 14: Keeping Pinterest Runningsysadmin.miniconf.org/2016/lca2016-joe_gordon-keeping...System Service Dependencies Latencies Alerting ELK stack for real time log collection Deployments

Visibility

● Data Driven● Cornerstone for many things we do

○ Measure and enforce SLA (Service Level Agreement)○ Debug issues○ Capacity planning

● Time series data - TSDB● Metrics

○ System○ Service○ Dependencies○ Latencies

● Alerting● ELK stack for real time

log collection

Page 15: Keeping Pinterest Runningsysadmin.miniconf.org/2016/lca2016-joe_gordon-keeping...System Service Dependencies Latencies Alerting ELK stack for real time log collection Deployments

Deployments

Page 16: Keeping Pinterest Runningsysadmin.miniconf.org/2016/lca2016-joe_gordon-keeping...System Service Dependencies Latencies Alerting ELK stack for real time log collection Deployments

Deployment Requirements

● No impact to end user● Change history● Easy

Page 17: Keeping Pinterest Runningsysadmin.miniconf.org/2016/lca2016-joe_gordon-keeping...System Service Dependencies Latencies Alerting ELK stack for real time log collection Deployments

Staging and Canary

Canary in a Coal mine. Rabbit in a Sarin gas plant

Page 18: Keeping Pinterest Runningsysadmin.miniconf.org/2016/lca2016-joe_gordon-keeping...System Service Dependencies Latencies Alerting ELK stack for real time log collection Deployments

Canary vs Staging

Staging

Page 19: Keeping Pinterest Runningsysadmin.miniconf.org/2016/lca2016-joe_gordon-keeping...System Service Dependencies Latencies Alerting ELK stack for real time log collection Deployments

Teletraandeploy system

Page 20: Keeping Pinterest Runningsysadmin.miniconf.org/2016/lca2016-joe_gordon-keeping...System Service Dependencies Latencies Alerting ELK stack for real time log collection Deployments

Teletraan

● Rollback

● Hotfix

● Rolling deploy

● Staging and testing

● Visibility & Usability

Features

Page 21: Keeping Pinterest Runningsysadmin.miniconf.org/2016/lca2016-joe_gordon-keeping...System Service Dependencies Latencies Alerting ELK stack for real time log collection Deployments

TeletraanDesign

● client-server model

● PRE/POST-DOWNLOAD

● PRE/POST-RESTART

● RESTART

● RBAC

Page 22: Keeping Pinterest Runningsysadmin.miniconf.org/2016/lca2016-joe_gordon-keeping...System Service Dependencies Latencies Alerting ELK stack for real time log collection Deployments

TeletraanAdvanced Features

● Pause/Resume● Acceptance Testing● Auto Deploy● Autoscaling

Staging Canary Production

Test Test Test

Auto Promote Promote

Page 23: Keeping Pinterest Runningsysadmin.miniconf.org/2016/lca2016-joe_gordon-keeping...System Service Dependencies Latencies Alerting ELK stack for real time log collection Deployments

PostmortemsLearn from our mistakes

Page 24: Keeping Pinterest Runningsysadmin.miniconf.org/2016/lca2016-joe_gordon-keeping...System Service Dependencies Latencies Alerting ELK stack for real time log collection Deployments

Postmortems

● Blameless● Incident Manager● Impact● Outage Type● Method of Detection● Timeline● Root Cause● Restoration Details● Actionable Items

Page 25: Keeping Pinterest Runningsysadmin.miniconf.org/2016/lca2016-joe_gordon-keeping...System Service Dependencies Latencies Alerting ELK stack for real time log collection Deployments

Production Readiness ReviewPre-mortem?

Page 26: Keeping Pinterest Runningsysadmin.miniconf.org/2016/lca2016-joe_gordon-keeping...System Service Dependencies Latencies Alerting ELK stack for real time log collection Deployments

Production Readiness Review

● Dependencies● Define an SLA● Alerting● Capacity planning● Testing● On call rotation● Decider to turn feature off if needed● Incremental launch plan● Rate limiting

Page 27: Keeping Pinterest Runningsysadmin.miniconf.org/2016/lca2016-joe_gordon-keeping...System Service Dependencies Latencies Alerting ELK stack for real time log collection Deployments

Public CloudIssues

Page 28: Keeping Pinterest Runningsysadmin.miniconf.org/2016/lca2016-joe_gordon-keeping...System Service Dependencies Latencies Alerting ELK stack for real time log collection Deployments

Public Cloud

● “If you get an InsufficientInstanceCapacity error when you try to launch an instance, AWS does not currently have enough available capacity to service your request.”

○ Cloud is not infinite○ Reserved instances○ Capacity planning

● RequestLimitExceeded: “The maximum request rate permitted by the Amazon EC2 APIs has been exceeded for your account.”

○ Includes DescribeInstances○ Use internal mirror (powered by elasticsearch)

● Noise Neighbors● Rightsizing● Ownership

Issues

Page 29: Keeping Pinterest Runningsysadmin.miniconf.org/2016/lca2016-joe_gordon-keeping...System Service Dependencies Latencies Alerting ELK stack for real time log collection Deployments

Open Sourced Tools

Page 30: Keeping Pinterest Runningsysadmin.miniconf.org/2016/lca2016-joe_gordon-keeping...System Service Dependencies Latencies Alerting ELK stack for real time log collection Deployments

Open Sourced Tools

● mysql_utils○ MySQL Management Tools for the Cloud

● thrift-tools○ thrift-tools is a library and a set of tools to introspect Apache Thrift traffic

● secor○ Secor is a service implementing Kafka log persistence

● pymemcache○ A comprehensive, fast, pure-Python memcached client

● pinrepo○ Artifact Repo

● TeletraanMore at: https://github.com/pinterest

Page 31: Keeping Pinterest Runningsysadmin.miniconf.org/2016/lca2016-joe_gordon-keeping...System Service Dependencies Latencies Alerting ELK stack for real time log collection Deployments

Pinterest Template 1.0