Mongo db ops mug pres

MongoDB Ops

This is not the 0tle of my talk

Not This

Not This Either

HOW TO BE A DBA

Inspira0on

CARFAX replica set architecture. People read this and don’t know how much of it applies to them. Is this a good

architecture?

MongoDB Ops Database Resiliency as a Service

Risk Mi'ga'on as a Database

That’s the 0tle of my talk

Topics

•  Risk Mi0ga0on •  Proac0ve and Itera0ve Ops •  MMS Tools •  Discussion

I went to the Na0onal Building Museum’s exhibit “Designing for Disaster”

It’s was all about understanding threats and designing structures to withstand natural

disasters.

This was on the wall and I loved it

This is what we do. We try to get the value on the leO to go as close to zero with the $$$ that

we have.

Probability •  Building Analogy: –  Likelihood of problem

•  In IT systems –  (Mean Time Between Failure) MTBF –  Know your infrastructure –  Categorize failure scenarios

•  What we can do: –  Proac0vely Monitor, Profile, Feedback –  Perform Root Cause Analysis

Vulnerability

•  Building Analogy: –  People and assets in harm’s way

•  In IT Systems –  Impact, Severity – Mission cri0cality

•  What we can do: –  Plan for the problem / exposure we actually have

Performance •  Building Analogy –  Integrity of infrastructure during adverse events

•  In IT systems –  Failover with consistency – Mean Time To Recovery (MTTR) (HA vs. DR) –  Performance (speed)

•  What we can do –  Ensure HA/DR plans actually accomplish resiliency goals –  Keep MTTR’s low (ideally they are automa0c) –  Actually test DR plans

Old School Ops •  Make sure hardware is sized correctly •  Make SQL more efficient, slowing down development

•  Hook up systems to my enterprise monitoring tools

•  Execute the S.O.P.’s someone handed me if they were ever wri^en in the first place

•  “It’s your first day … congratula3ons you are now the expert”

New School Ops

•  Proac0ve – Monitor Your App (“Knowing is half the ba^le”) – Compare Expected vs. Actual

•  Itera0ve –  Include O&M from the beginning of ops planning – Con0nuous Integra0on / Development – Run Dev / Integra0on like produc0on – Automate Everything (Using Dev) – As mission changes O&M also must change

Status and Profiling

•  Heartbeat and Status Services –  I’d require this as a Dev Ops job interview task

•  Low level tools – mongostat, – system profiler, – oplog, – mtools

•  Plugins to various monitoring tools – Nagios, SNMP, etc

MongoDB Management Service (MMS)

Monitoring Backup & Recovery Automa0on

MMS Monitoring App Data Tier

MMS (VLAN / Cloud)

agent agent

Java Container

HTTP/S

Operator

Alerts

Dashboards

Pull

Push

Monitoring Side Bar: MMS Schema

•  Time Series Data •  Data collec0on bucketed •  Data Captured a Minute Intervals in Hourly Docs •  Graphs Rolled up for bigger 0me resolu0ons with aggrega0on queries

•  User queries never cause real-‐0me aggrega0ons •  8 Shards run global MMS Monitoring –  35k instances

MMS Backup App Data Tier

MMS (VLAN / Cloud)

agent agent

Java Container

HTTP/S

Operator / Script

Get .tar restore point mongos

MMS Daemon

HEAD

Blockstore

MMS Automa0on App Data Tier

MMS (VLAN / Cloud)

agent agent

Java Container

HTTP/S

Operator

Edit Goal State Apply Goal State

Things to Monitor •  Determine what is normal

•  Failovers (Planned / Unplanned) •  Recovering Hosts •  Replica0on Lag •  Connec0ons •  Oplog Window •  Lock %

•  RUNNING OUT OF STORAGE!!!

Things to know

•  Individual doc dele0on is expensive – Plan for dele0on profile

•  BSON storage gets fragmented by updates – Repair jobs can be run on secondaries

•  “Automate” Everything – Un0l you’ve scripted something you don’t know if it’s going to work

Thanks! Discussion

Technology

Mongo db ops mug pres