High performance Infrastructure Oct 2013

High Performance Infrastructure

David Mytton

Woop Japan!

Server Density Infrastructure

•150 servers


• June 2009 - 4yrs


•150 servers

•MySQL -> MongoDB

• June 2009 - 4yrs


•150 servers

•MySQL -> MongoDB

•25TB data per month

• June 2009 - 4yrs


•150 servers

Picture is unrelated! Mmm, ice cream.

• Fast network

Performance

• Fast network

Performance

EC2 10 Gigabit Ethernet- Cluster Compute- High Memory Cluster- Cluster GPU- High I/O- High Storage

- Network cards- VLAN separation

• Fast network

Performance

Workload: Read/Write?

Result set size

What is being stored?

- Read / write: adds to replication oplog- Images? Web pages? Tiny documents?- What is being returned? Optimised to return certain fields?

• Fast network

Performance

Use Network Throughput

Normal 0-100Mb/s

Replication (Initial Sync) Burst +100Mb/s

Replication (Oplog) 0-100Mb/s

Backup Initial Sync + Oplog

• Fast network

Performance

Inter-DC LAN

- Latency

• Fast network

Performance

Inter-DC LAN

Cross USA Washington, DC - San Jose, CA

• Fast network

Performance

Location Ping RTT Latency

Within USA 40-80ms

Trans-Atlantic 100ms

Trans-Pacific 150ms

Europe - Japan 300ms

Ping - low overheadImportant for replication

Failover

•Replication

Failover

•Master/slave

•Replication

- One master accepts all writes- Many slaves staying up to date with master- Can read from slaves

Failover

•Min 3 nodes

•Master/slave

•Replication

Minimum of 3 nodes to form a majority in case one goes down. All store data.Odd number otherwise != majorityArbiter

Failover

•Min 3 nodes

•Master/slave

•Automatic failover

•Replication

Drivers handle automatic failover. First query after a failure will fail which will trigger a reconnect. Need to handle retries

•Replication lag

Performance

Location Ping RTT Latency

Within USA 40-80ms

Trans-Atlantic 100ms

Trans-Pacific 150ms

Europe - Japan 300ms

- Replication lag

Replication Lag

1. Reads: eventual consistency

Replication Lag


2. Failover: slave behind

Eventual Consistency

Stale data

Not what the user submitted?


Stale data

Inconsistent data

Doesn’t reflect the truth


Stale data

Inconsistent data

Changing data

Could change on every page refresh


Use Case Needs consistency?

Graphs No

User profile Yes

Statistics Depends

Alert config Yes

Statistics - depends on when they’re updated

Replication Lag


2. Failover: slave behind

Slave behind

Failover: out of date master

Old dataRollback

• Safe by default

>>> from pymongo import MongoClient>>> connection = MongoClient(w=int/str)

Value Meaning

0 Unsafe

1 Primary

2 Primary + x1 secondary

3 Primary + x2 secondaries

MongoDB WriteConcern

wtimeout - wait for write before raising an exception

Picture is unrelated! Mmm, ice cream.

• Fast network

•More RAM

Performance

http://www.slideshare.net/jrosoff/mongodb-on-ec2-and-ebs

No 32 bitNo High CPURAM RAM RAM.



http://blog.pythonisito.com/2011/12/mongodbs-write-lock.html






More RAM = expensive Performance

x2 4GB RAM 12 month Prices

RAM

SSDs

Spinning disk

Cost Speed

Softlayer disk pricing Performance

EC2 disk/RAM pricing Performance

$2232/m

$2520/m

$43/m

$295/m

SSD vs Spinning Performance

SSDs are better at buffered disk reads, sequential input and random i/o.

SSD vs Spinning Performance

However, CPU usage for SSDs is higher. This may be a driver issue so worth testing your own hardware. Tests done using Bonnie.

Cloud?

Cloud?

•Elastic workloads

Cloud?


•Demand spikes

Cloud?


•Demand spikes

•Unknown requirements

Dedicated?

Dedicated?

•Hardware replacement

Dedicated?


•Managed/support

Dedicated?


•Managed/support

•Networking

Colo?

Colo?

•Hardware spec/value

Colo?


•Total cost

Colo?


•Total cost

•Internal skills?

Colo?


•Total cost

•Internal skills?

•More fun?!

•Build master (buildbot): VM x2 CPU 2.0Ghz, 2GB RAM – $89/m

•Build slave (buildbot): VM x1 CPU 2.0Ghz, 1GB RAM – $40/m

•Staging load balancer: VM x1 CPU 2.0Ghz, 1GB RAM – $40/m

•Staging server 1: VM x2 CPU 2.0Ghz, 8GB RAM – $165/m

•Staging server 2: VM x1 CPU 2.0Ghz, 2GB RAM – $50/m

•Puppet master: VM x2 CPU 2.0Ghz, 2GB RAM – $89/m

Total: $473/m

Colo experiment

Colo experiment

•Dell 1U R415

•x2 8C AMD 2.8Ghz

•32GB RAM

Colo experiment

•Dell 1U R415

•x2 8C AMD 2.8Ghz

•32GB RAM

•Dual PSU, NIC

Colo experiment

•Dell 1U R415

•x2 8C AMD 2.8Ghz

•32GB RAM

•Dual PSU, NIC

•x4 1TB SATA hot swappable

Colo: Networking

•10-50Mbps: £20-25/Mbps/m

•51-100Mbps: £15/Mbps/m

•100+Mbps: £13/Mbps/m

Colo: Metro

•100Mbps: £300/m

•1000Mbps: £750/m

Colo: Power

•£300-350/kWh/m

•4.5A = £520/m

•9A = £900/m

Tips: rand()

•Field names

-Field names take up space

Tips: rand()

•Field names

•Covered indexes

- Get everything from the index

Tips: rand()

•Field names

•Covered indexes

•Collections / databases

- Dropping collections faster than remove()- Split use cases across databases to avoid locking- Put databases onto different disks / types e.g. SSDs

Backups

What is the use case?

Backups

What is the use case?

Fixing user errors?

Point in time restore?

Disaster recovery?

Backups

•Disaster recovery

Offsite

- What kind of disaster?- Store backups offsite

Backups

Age

Offsite


How log do you keep the backups for?How far do they go back?How recent are they?

Backups

Age

Offsite

Restore time


Latency issue - further away geographically, slower the transfer timePartition backups to get critical data restored first

david@asriel ~: scp david@stelmaria:~/local/local.11 .local.11 100% 2047MB 6.8MB/s 05:01

Restore time

- Needed to resync a database server across the US- Take too long; oplog not large enough- Fast internal network but slow internet

1d, 1h, 58m

11.22MB/s

Backups

Frequency

Consistency

Verification

- How often?- Backing up cluster at the same time - data moving around- Can the backups be restored?

www.flickr.com/photos/daddo83/3406962115/

Monitoring

•System

Disk i/o

Disk use

Disk i/o % utilDisk space usage

http://www.flickr.com/photos/daddo83/3406962115/


david@pan ~: df -aFilesystem 1K-blocks Used Available Use% Mounted on/dev/sda1 156882796 148489776 423964 100% /proc 0 0 0 - /procnone 0 0 0 - /dev/ptsnone 2097260 0 2097260 0% /dev/shmnone 0 0 0 - /proc/sys/fs/binfmt_misc david@pan ~: df -ahFilesystem Size Used Avail Use% Mounted on/dev/sda1 150G 142G 415M 100% /proc 0 0 0 - /procnone 0 0 0 - /dev/ptsnone 2.1G 0 2.1G 0% /dev/shmnone 0 0 0 - /proc/sys/fs/binfmt_

- Needed to upgrade a machine- Resize = downtime- Resyncing finished just in time


Monitoring

Disk i/o

Disk use

•System

Swap

Disk i/o % utilDisk space usage




Monitoring

Slave lag

State

•Replication



Monitoring tools

Run yourself

Ganglia

So Server Density is the tool my company produces but if you don’t like it, want to run your own tools locally or just want to try some others, then that’s fine.

Monitoring tools

www.serverdensity.com

On-call

Dealing with humans

- Sharing out the responsibility- Determining level of response: 24/7 real monitoring or first responder- 24/7 real monitoring for HA environments, real people at a screen at all times- First responder: people at the end of a phone

On-call 1) Ops engineer

Dealing with humans

- During working hours our dedicated ops engineers take the first level- Avoids interrupting product engineers for initial fire fighting


2) All engineers

Dealing with humans

- Out of hours we rotate every engineer, product and ops- Rotation every 7 days on a Tuesday


2) All engineers

3) Ops engineer

Dealing with humans

- Always have a secondary- This is always an ops engineer- Thinking is if the issue needs to be escalated then it’s likely a bigger problem that needs additional systems expertise


2) All engineers

3) Ops engineer

4) Others

Dealing with humans

- Support from design / frontend engineering- Have to press a button to get them involved

Off-call

Dealing with humans

- Responders to an incident get next 24 hours off-call- Social issues to deal with

On-call CEO

Dealing with humans

- I receive push notifications + e-mails for all outages

Uptime reporting

Dealing with humans

- Weekly internal report on G+- Gives visibility to entire company about any incidents- Allows us to discuss incidents to get to that 100% uptime

Social issues

Dealing with humans

- How quickly can you get to a computer?- Are they out drinking on a Friday?- What happens if someone is ill?- What if there’s a sudden emergency: accident? family emergency?- Do they have enough phone battery?- Can you hear the ringtone?

Backup responder

Dealing with humans

- Backup responder- Time out the initial responder- Escalate difficult problems- Essentially human redundancy: phone provider, geographic area, internet connectivity

Expected

Dealing with outages

- Outages are going to happen, especially at the beginning- Costs money for redundancy- How you deal with them


Externally

Communication

- Telling people what is happening- Frequently- Dependent on audience - we can go into more detail because our customers are techies- Github do a good job of providing incident writeups but don’t provide a good idea of what is happening right now- Generally Amazon and Heroku are good and go into more detail

Communication


Internally

- Open Skype conferences between the responders- Usually mostly silence or the sound of the keyboard, but simulates being in the situation room- Faster than typing

Really test your vendors


- Shows up flaws in vendor support processes- Frustrating when waiting on someone else- You want as much information as possible- Major outage? Everyone will be calling them

Simulations


- Try and avoid unncessary problems- Do servers come back up from boot?- Can hot spares handle the load?- Test failover: databases, HA firewalls- Regularly reboot servers- Wargames can happen at another stage: startups are usually too focused on building things first

David Mytton

[email protected]

@davidmytton

Woop Japan!

blog.serverdensity.com

Technology

High performance Infrastructure Oct 2013