96
High Performance Infrastructure

High performance Infrastructure Oct 2013

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 1: High performance Infrastructure Oct 2013

High Performance Infrastructure

Page 2: High performance Infrastructure Oct 2013

David Mytton

Woop Japan!

Page 3: High performance Infrastructure Oct 2013
Page 4: High performance Infrastructure Oct 2013

Server Density Infrastructure

Page 5: High performance Infrastructure Oct 2013

•150 servers

Server Density Infrastructure

Page 6: High performance Infrastructure Oct 2013

• June 2009 - 4yrs

Server Density Infrastructure

•150 servers

Page 7: High performance Infrastructure Oct 2013

•MySQL -> MongoDB

• June 2009 - 4yrs

Server Density Infrastructure

•150 servers

Page 8: High performance Infrastructure Oct 2013

•MySQL -> MongoDB

•25TB data per month

• June 2009 - 4yrs

Server Density Infrastructure

•150 servers

Page 9: High performance Infrastructure Oct 2013

Picture is unrelated! Mmm, ice cream.

• Fast network

Performance

Page 10: High performance Infrastructure Oct 2013

• Fast network

Performance

EC2 10 Gigabit Ethernet- Cluster Compute- High Memory Cluster- Cluster GPU- High I/O- High Storage

- Network cards- VLAN separation

Page 11: High performance Infrastructure Oct 2013

• Fast network

Performance

Workload: Read/Write?

Result set size

What is being stored?

- Read / write: adds to replication oplog- Images? Web pages? Tiny documents?- What is being returned? Optimised to return certain fields?

Page 12: High performance Infrastructure Oct 2013

• Fast network

Performance

Use Network Throughput

Normal 0-100Mb/s

Replication (Initial Sync) Burst +100Mb/s

Replication (Oplog) 0-100Mb/s

Backup Initial Sync + Oplog

Page 13: High performance Infrastructure Oct 2013

• Fast network

Performance

Inter-DC LAN

- Latency

Page 14: High performance Infrastructure Oct 2013

• Fast network

Performance

Inter-DC LAN

Cross USA Washington, DC - San Jose, CA

Page 15: High performance Infrastructure Oct 2013

• Fast network

Performance

Location Ping RTT Latency

Within USA 40-80ms

Trans-Atlantic 100ms

Trans-Pacific 150ms

Europe - Japan 300ms

Ping - low overheadImportant for replication

Page 16: High performance Infrastructure Oct 2013

Failover

•Replication

Page 17: High performance Infrastructure Oct 2013

Failover

•Master/slave

•Replication

- One master accepts all writes- Many slaves staying up to date with master- Can read from slaves

Page 18: High performance Infrastructure Oct 2013

Failover

•Min 3 nodes

•Master/slave

•Replication

Minimum of 3 nodes to form a majority in case one goes down. All store data.Odd number otherwise != majorityArbiter

Page 19: High performance Infrastructure Oct 2013

Failover

•Min 3 nodes

•Master/slave

•Automatic failover

•Replication

Drivers handle automatic failover. First query after a failure will fail which will trigger a reconnect. Need to handle retries

Page 20: High performance Infrastructure Oct 2013

•Replication lag

Performance

Location Ping RTT Latency

Within USA 40-80ms

Trans-Atlantic 100ms

Trans-Pacific 150ms

Europe - Japan 300ms

- Replication lag

Page 21: High performance Infrastructure Oct 2013

Replication Lag

1. Reads: eventual consistency

Page 22: High performance Infrastructure Oct 2013

Replication Lag

1. Reads: eventual consistency

2. Failover: slave behind

Page 23: High performance Infrastructure Oct 2013

Eventual Consistency

Stale data

Not what the user submitted?

Page 24: High performance Infrastructure Oct 2013

Eventual Consistency

Stale data

Inconsistent data

Doesn’t reflect the truth

Page 25: High performance Infrastructure Oct 2013

Eventual Consistency

Stale data

Inconsistent data

Changing data

Could change on every page refresh

Page 26: High performance Infrastructure Oct 2013

Eventual Consistency

Use Case Needs consistency?

Graphs No

User profile Yes

Statistics Depends

Alert config Yes

Statistics - depends on when they’re updated

Page 27: High performance Infrastructure Oct 2013

Replication Lag

1. Reads: eventual consistency

2. Failover: slave behind

Page 28: High performance Infrastructure Oct 2013

Slave behind

Failover: out of date master

Old dataRollback

Page 29: High performance Infrastructure Oct 2013

• Safe by default

>>> from pymongo import MongoClient>>> connection = MongoClient(w=int/str)

Value Meaning

0 Unsafe

1 Primary

2 Primary + x1 secondary

3 Primary + x2 secondaries

MongoDB WriteConcern

wtimeout - wait for write before raising an exception

Page 30: High performance Infrastructure Oct 2013
Page 31: High performance Infrastructure Oct 2013

Picture is unrelated! Mmm, ice cream.

• Fast network

•More RAM

Performance

Page 32: High performance Infrastructure Oct 2013

http://www.slideshare.net/jrosoff/mongodb-on-ec2-and-ebs

No 32 bitNo High CPURAM RAM RAM.

Page 35: High performance Infrastructure Oct 2013

More RAM = expensive Performance

x2 4GB RAM 12 month Prices

Page 36: High performance Infrastructure Oct 2013

RAM

SSDs

Spinning disk

Cost Speed

Page 37: High performance Infrastructure Oct 2013

Softlayer disk pricing Performance

Page 38: High performance Infrastructure Oct 2013

EC2 disk/RAM pricing Performance

$2232/m

$2520/m

$43/m

$295/m

Page 39: High performance Infrastructure Oct 2013

SSD vs Spinning Performance

SSDs are better at buffered disk reads, sequential input and random i/o.

Page 40: High performance Infrastructure Oct 2013

SSD vs Spinning Performance

However, CPU usage for SSDs is higher. This may be a driver issue so worth testing your own hardware. Tests done using Bonnie.

Page 41: High performance Infrastructure Oct 2013

Cloud?

Page 42: High performance Infrastructure Oct 2013

Cloud?

•Elastic workloads

Page 43: High performance Infrastructure Oct 2013

Cloud?

•Elastic workloads

•Demand spikes

Page 44: High performance Infrastructure Oct 2013

Cloud?

•Elastic workloads

•Demand spikes

•Unknown requirements

Page 45: High performance Infrastructure Oct 2013

Dedicated?

Page 46: High performance Infrastructure Oct 2013

Dedicated?

•Hardware replacement

Page 47: High performance Infrastructure Oct 2013

Dedicated?

•Hardware replacement

•Managed/support

Page 48: High performance Infrastructure Oct 2013

Dedicated?

•Hardware replacement

•Managed/support

•Networking

Page 49: High performance Infrastructure Oct 2013

Colo?

Page 50: High performance Infrastructure Oct 2013

Colo?

•Hardware spec/value

Page 51: High performance Infrastructure Oct 2013

Colo?

•Hardware spec/value

•Total cost

Page 52: High performance Infrastructure Oct 2013

Colo?

•Hardware spec/value

•Total cost

•Internal skills?

Page 53: High performance Infrastructure Oct 2013

Colo?

•Hardware spec/value

•Total cost

•Internal skills?

•More fun?!

Page 54: High performance Infrastructure Oct 2013
Page 55: High performance Infrastructure Oct 2013

•Build master (buildbot): VM x2 CPU 2.0Ghz, 2GB RAM – $89/m

•Build slave (buildbot): VM x1 CPU 2.0Ghz, 1GB RAM – $40/m

•Staging load balancer: VM x1 CPU 2.0Ghz, 1GB RAM – $40/m

•Staging server 1: VM x2 CPU 2.0Ghz, 8GB RAM – $165/m

•Staging server 2: VM x1 CPU 2.0Ghz, 2GB RAM – $50/m

•Puppet master: VM x2 CPU 2.0Ghz, 2GB RAM – $89/m

Total: $473/m

Colo experiment

Page 56: High performance Infrastructure Oct 2013

Colo experiment

•Dell 1U R415

•x2 8C AMD 2.8Ghz

•32GB RAM

Page 57: High performance Infrastructure Oct 2013

Colo experiment

•Dell 1U R415

•x2 8C AMD 2.8Ghz

•32GB RAM

•Dual PSU, NIC

Page 58: High performance Infrastructure Oct 2013

Colo experiment

•Dell 1U R415

•x2 8C AMD 2.8Ghz

•32GB RAM

•Dual PSU, NIC

•x4 1TB SATA hot swappable

Page 59: High performance Infrastructure Oct 2013
Page 60: High performance Infrastructure Oct 2013

Colo: Networking

•10-50Mbps: £20-25/Mbps/m

•51-100Mbps: £15/Mbps/m

•100+Mbps: £13/Mbps/m

Page 61: High performance Infrastructure Oct 2013

Colo: Metro

•100Mbps: £300/m

•1000Mbps: £750/m

Page 62: High performance Infrastructure Oct 2013

Colo: Power

•£300-350/kWh/m

•4.5A = £520/m

•9A = £900/m

Page 63: High performance Infrastructure Oct 2013
Page 64: High performance Infrastructure Oct 2013

Tips: rand()

•Field names

-Field names take up space

Page 65: High performance Infrastructure Oct 2013

Tips: rand()

•Field names

•Covered indexes

- Get everything from the index

Page 66: High performance Infrastructure Oct 2013

Tips: rand()

•Field names

•Covered indexes

•Collections / databases

- Dropping collections faster than remove()- Split use cases across databases to avoid locking- Put databases onto different disks / types e.g. SSDs

Page 67: High performance Infrastructure Oct 2013

Backups

What is the use case?

Page 68: High performance Infrastructure Oct 2013

Backups

What is the use case?

Fixing user errors?

Point in time restore?

Disaster recovery?

Page 69: High performance Infrastructure Oct 2013

Backups

•Disaster recovery

Offsite

- What kind of disaster?- Store backups offsite

Page 70: High performance Infrastructure Oct 2013

Backups

Age

Offsite

•Disaster recovery

How log do you keep the backups for?How far do they go back?How recent are they?

Page 71: High performance Infrastructure Oct 2013

Backups

Age

Offsite

Restore time

•Disaster recovery

Latency issue - further away geographically, slower the transfer timePartition backups to get critical data restored first

Page 72: High performance Infrastructure Oct 2013

david@asriel ~: scp david@stelmaria:~/local/local.11 .local.11 100% 2047MB 6.8MB/s 05:01

Restore time

- Needed to resync a database server across the US- Take too long; oplog not large enough- Fast internal network but slow internet

Page 73: High performance Infrastructure Oct 2013

1d, 1h, 58m

11.22MB/s

Page 74: High performance Infrastructure Oct 2013

Backups

Frequency

Consistency

Verification

- How often?- Backing up cluster at the same time - data moving around- Can the backups be restored?

Page 75: High performance Infrastructure Oct 2013

www.flickr.com/photos/daddo83/3406962115/

Monitoring

•System

Disk i/o

Disk use

Disk i/o % utilDisk space usage

Page 76: High performance Infrastructure Oct 2013

david@pan ~: df -aFilesystem 1K-blocks Used Available Use% Mounted on/dev/sda1 156882796 148489776 423964 100% /proc 0 0 0 - /procnone 0 0 0 - /dev/ptsnone 2097260 0 2097260 0% /dev/shmnone 0 0 0 - /proc/sys/fs/binfmt_misc david@pan ~: df -ahFilesystem Size Used Avail Use% Mounted on/dev/sda1 150G 142G 415M 100% /proc 0 0 0 - /procnone 0 0 0 - /dev/ptsnone 2.1G 0 2.1G 0% /dev/shmnone 0 0 0 - /proc/sys/fs/binfmt_

- Needed to upgrade a machine- Resize = downtime- Resyncing finished just in time

Page 77: High performance Infrastructure Oct 2013

www.flickr.com/photos/daddo83/3406962115/

Monitoring

Disk i/o

Disk use

•System

Swap

Disk i/o % utilDisk space usage

Page 78: High performance Infrastructure Oct 2013

www.flickr.com/photos/daddo83/3406962115/

Monitoring

Slave lag

State

•Replication

Page 79: High performance Infrastructure Oct 2013

Monitoring tools

Run yourself

Ganglia

So Server Density is the tool my company produces but if you don’t like it, want to run your own tools locally or just want to try some others, then that’s fine.

Page 80: High performance Infrastructure Oct 2013

Monitoring tools

www.serverdensity.com

Page 81: High performance Infrastructure Oct 2013

On-call

Dealing with humans

- Sharing out the responsibility- Determining level of response: 24/7 real monitoring or first responder- 24/7 real monitoring for HA environments, real people at a screen at all times- First responder: people at the end of a phone

Page 82: High performance Infrastructure Oct 2013

On-call 1) Ops engineer

Dealing with humans

- During working hours our dedicated ops engineers take the first level- Avoids interrupting product engineers for initial fire fighting

Page 83: High performance Infrastructure Oct 2013

On-call 1) Ops engineer

2) All engineers

Dealing with humans

- Out of hours we rotate every engineer, product and ops- Rotation every 7 days on a Tuesday

Page 84: High performance Infrastructure Oct 2013

On-call 1) Ops engineer

2) All engineers

3) Ops engineer

Dealing with humans

- Always have a secondary- This is always an ops engineer- Thinking is if the issue needs to be escalated then it’s likely a bigger problem that needs additional systems expertise

Page 85: High performance Infrastructure Oct 2013

On-call 1) Ops engineer

2) All engineers

3) Ops engineer

4) Others

Dealing with humans

- Support from design / frontend engineering- Have to press a button to get them involved

Page 86: High performance Infrastructure Oct 2013

Off-call

Dealing with humans

- Responders to an incident get next 24 hours off-call- Social issues to deal with

Page 87: High performance Infrastructure Oct 2013

On-call CEO

Dealing with humans

- I receive push notifications + e-mails for all outages

Page 88: High performance Infrastructure Oct 2013

Uptime reporting

Dealing with humans

- Weekly internal report on G+- Gives visibility to entire company about any incidents- Allows us to discuss incidents to get to that 100% uptime

Page 89: High performance Infrastructure Oct 2013

Social issues

Dealing with humans

- How quickly can you get to a computer?- Are they out drinking on a Friday?- What happens if someone is ill?- What if there’s a sudden emergency: accident? family emergency?- Do they have enough phone battery?- Can you hear the ringtone?

Page 90: High performance Infrastructure Oct 2013

Backup responder

Dealing with humans

- Backup responder- Time out the initial responder- Escalate difficult problems- Essentially human redundancy: phone provider, geographic area, internet connectivity

Page 91: High performance Infrastructure Oct 2013

Expected

Dealing with outages

- Outages are going to happen, especially at the beginning- Costs money for redundancy- How you deal with them

Page 92: High performance Infrastructure Oct 2013

Dealing with outages

Externally

Communication

- Telling people what is happening- Frequently- Dependent on audience - we can go into more detail because our customers are techies- Github do a good job of providing incident writeups but don’t provide a good idea of what is happening right now- Generally Amazon and Heroku are good and go into more detail

Page 93: High performance Infrastructure Oct 2013

Communication

Dealing with outages

Internally

- Open Skype conferences between the responders- Usually mostly silence or the sound of the keyboard, but simulates being in the situation room- Faster than typing

Page 94: High performance Infrastructure Oct 2013

Really test your vendors

Dealing with outages

- Shows up flaws in vendor support processes- Frustrating when waiting on someone else- You want as much information as possible- Major outage? Everyone will be calling them

Page 95: High performance Infrastructure Oct 2013

Simulations

Dealing with outages

- Try and avoid unncessary problems- Do servers come back up from boot?- Can hot spares handle the load?- Test failover: databases, HA firewalls- Regularly reboot servers- Wargames can happen at another stage: startups are usually too focused on building things first

Page 96: High performance Infrastructure Oct 2013

David Mytton

[email protected]

@davidmytton

Woop Japan!

blog.serverdensity.com