Upload
server-density
View
126
Download
0
Embed Size (px)
DESCRIPTION
Citation preview
High Performance Infrastructure
David Mytton
Woop Japan!
Server Density Infrastructure
•150 servers
Server Density Infrastructure
• June 2009 - 4yrs
Server Density Infrastructure
•150 servers
•MySQL -> MongoDB
• June 2009 - 4yrs
Server Density Infrastructure
•150 servers
•MySQL -> MongoDB
•25TB data per month
• June 2009 - 4yrs
Server Density Infrastructure
•150 servers
Picture is unrelated! Mmm, ice cream.
• Fast network
Performance
• Fast network
Performance
EC2 10 Gigabit Ethernet- Cluster Compute- High Memory Cluster- Cluster GPU- High I/O- High Storage
- Network cards- VLAN separation
• Fast network
Performance
Workload: Read/Write?
Result set size
What is being stored?
- Read / write: adds to replication oplog- Images? Web pages? Tiny documents?- What is being returned? Optimised to return certain fields?
• Fast network
Performance
Use Network Throughput
Normal 0-100Mb/s
Replication (Initial Sync) Burst +100Mb/s
Replication (Oplog) 0-100Mb/s
Backup Initial Sync + Oplog
• Fast network
Performance
Inter-DC LAN
- Latency
• Fast network
Performance
Inter-DC LAN
Cross USA Washington, DC - San Jose, CA
• Fast network
Performance
Location Ping RTT Latency
Within USA 40-80ms
Trans-Atlantic 100ms
Trans-Pacific 150ms
Europe - Japan 300ms
Ping - low overheadImportant for replication
Failover
•Replication
Failover
•Master/slave
•Replication
- One master accepts all writes- Many slaves staying up to date with master- Can read from slaves
Failover
•Min 3 nodes
•Master/slave
•Replication
Minimum of 3 nodes to form a majority in case one goes down. All store data.Odd number otherwise != majorityArbiter
Failover
•Min 3 nodes
•Master/slave
•Automatic failover
•Replication
Drivers handle automatic failover. First query after a failure will fail which will trigger a reconnect. Need to handle retries
•Replication lag
Performance
Location Ping RTT Latency
Within USA 40-80ms
Trans-Atlantic 100ms
Trans-Pacific 150ms
Europe - Japan 300ms
- Replication lag
Replication Lag
1. Reads: eventual consistency
Replication Lag
1. Reads: eventual consistency
2. Failover: slave behind
Eventual Consistency
Stale data
Not what the user submitted?
Eventual Consistency
Stale data
Inconsistent data
Doesn’t reflect the truth
Eventual Consistency
Stale data
Inconsistent data
Changing data
Could change on every page refresh
Eventual Consistency
Use Case Needs consistency?
Graphs No
User profile Yes
Statistics Depends
Alert config Yes
Statistics - depends on when they’re updated
Replication Lag
1. Reads: eventual consistency
2. Failover: slave behind
Slave behind
Failover: out of date master
Old dataRollback
• Safe by default
>>> from pymongo import MongoClient>>> connection = MongoClient(w=int/str)
Value Meaning
0 Unsafe
1 Primary
2 Primary + x1 secondary
3 Primary + x2 secondaries
MongoDB WriteConcern
wtimeout - wait for write before raising an exception
Picture is unrelated! Mmm, ice cream.
• Fast network
•More RAM
Performance
http://www.slideshare.net/jrosoff/mongodb-on-ec2-and-ebs
No 32 bitNo High CPURAM RAM RAM.
http://blog.pythonisito.com/2011/12/mongodbs-write-lock.html
http://blog.pythonisito.com/2011/12/mongodbs-write-lock.html
More RAM = expensive Performance
x2 4GB RAM 12 month Prices
RAM
SSDs
Spinning disk
Cost Speed
Softlayer disk pricing Performance
EC2 disk/RAM pricing Performance
$2232/m
$2520/m
$43/m
$295/m
SSD vs Spinning Performance
SSDs are better at buffered disk reads, sequential input and random i/o.
SSD vs Spinning Performance
However, CPU usage for SSDs is higher. This may be a driver issue so worth testing your own hardware. Tests done using Bonnie.
Cloud?
Cloud?
•Elastic workloads
Cloud?
•Elastic workloads
•Demand spikes
Cloud?
•Elastic workloads
•Demand spikes
•Unknown requirements
Dedicated?
Dedicated?
•Hardware replacement
Dedicated?
•Hardware replacement
•Managed/support
Dedicated?
•Hardware replacement
•Managed/support
•Networking
Colo?
Colo?
•Hardware spec/value
Colo?
•Hardware spec/value
•Total cost
Colo?
•Hardware spec/value
•Total cost
•Internal skills?
Colo?
•Hardware spec/value
•Total cost
•Internal skills?
•More fun?!
•Build master (buildbot): VM x2 CPU 2.0Ghz, 2GB RAM – $89/m
•Build slave (buildbot): VM x1 CPU 2.0Ghz, 1GB RAM – $40/m
•Staging load balancer: VM x1 CPU 2.0Ghz, 1GB RAM – $40/m
•Staging server 1: VM x2 CPU 2.0Ghz, 8GB RAM – $165/m
•Staging server 2: VM x1 CPU 2.0Ghz, 2GB RAM – $50/m
•Puppet master: VM x2 CPU 2.0Ghz, 2GB RAM – $89/m
Total: $473/m
Colo experiment
Colo experiment
•Dell 1U R415
•x2 8C AMD 2.8Ghz
•32GB RAM
Colo experiment
•Dell 1U R415
•x2 8C AMD 2.8Ghz
•32GB RAM
•Dual PSU, NIC
Colo experiment
•Dell 1U R415
•x2 8C AMD 2.8Ghz
•32GB RAM
•Dual PSU, NIC
•x4 1TB SATA hot swappable
Colo: Networking
•10-50Mbps: £20-25/Mbps/m
•51-100Mbps: £15/Mbps/m
•100+Mbps: £13/Mbps/m
Colo: Metro
•100Mbps: £300/m
•1000Mbps: £750/m
Colo: Power
•£300-350/kWh/m
•4.5A = £520/m
•9A = £900/m
Tips: rand()
•Field names
-Field names take up space
Tips: rand()
•Field names
•Covered indexes
- Get everything from the index
Tips: rand()
•Field names
•Covered indexes
•Collections / databases
- Dropping collections faster than remove()- Split use cases across databases to avoid locking- Put databases onto different disks / types e.g. SSDs
Backups
What is the use case?
Backups
What is the use case?
Fixing user errors?
Point in time restore?
Disaster recovery?
Backups
•Disaster recovery
Offsite
- What kind of disaster?- Store backups offsite
Backups
Age
Offsite
•Disaster recovery
How log do you keep the backups for?How far do they go back?How recent are they?
Backups
Age
Offsite
Restore time
•Disaster recovery
Latency issue - further away geographically, slower the transfer timePartition backups to get critical data restored first
david@asriel ~: scp david@stelmaria:~/local/local.11 .local.11 100% 2047MB 6.8MB/s 05:01
Restore time
- Needed to resync a database server across the US- Take too long; oplog not large enough- Fast internal network but slow internet
1d, 1h, 58m
11.22MB/s
Backups
Frequency
Consistency
Verification
- How often?- Backing up cluster at the same time - data moving around- Can the backups be restored?
www.flickr.com/photos/daddo83/3406962115/
Monitoring
•System
Disk i/o
Disk use
Disk i/o % utilDisk space usage
david@pan ~: df -aFilesystem 1K-blocks Used Available Use% Mounted on/dev/sda1 156882796 148489776 423964 100% /proc 0 0 0 - /procnone 0 0 0 - /dev/ptsnone 2097260 0 2097260 0% /dev/shmnone 0 0 0 - /proc/sys/fs/binfmt_misc david@pan ~: df -ahFilesystem Size Used Avail Use% Mounted on/dev/sda1 150G 142G 415M 100% /proc 0 0 0 - /procnone 0 0 0 - /dev/ptsnone 2.1G 0 2.1G 0% /dev/shmnone 0 0 0 - /proc/sys/fs/binfmt_
- Needed to upgrade a machine- Resize = downtime- Resyncing finished just in time
www.flickr.com/photos/daddo83/3406962115/
Monitoring
Disk i/o
Disk use
•System
Swap
Disk i/o % utilDisk space usage
www.flickr.com/photos/daddo83/3406962115/
Monitoring
Slave lag
State
•Replication
Monitoring tools
Run yourself
Ganglia
So Server Density is the tool my company produces but if you don’t like it, want to run your own tools locally or just want to try some others, then that’s fine.
Monitoring tools
www.serverdensity.com
On-call
Dealing with humans
- Sharing out the responsibility- Determining level of response: 24/7 real monitoring or first responder- 24/7 real monitoring for HA environments, real people at a screen at all times- First responder: people at the end of a phone
On-call 1) Ops engineer
Dealing with humans
- During working hours our dedicated ops engineers take the first level- Avoids interrupting product engineers for initial fire fighting
On-call 1) Ops engineer
2) All engineers
Dealing with humans
- Out of hours we rotate every engineer, product and ops- Rotation every 7 days on a Tuesday
On-call 1) Ops engineer
2) All engineers
3) Ops engineer
Dealing with humans
- Always have a secondary- This is always an ops engineer- Thinking is if the issue needs to be escalated then it’s likely a bigger problem that needs additional systems expertise
On-call 1) Ops engineer
2) All engineers
3) Ops engineer
4) Others
Dealing with humans
- Support from design / frontend engineering- Have to press a button to get them involved
Off-call
Dealing with humans
- Responders to an incident get next 24 hours off-call- Social issues to deal with
On-call CEO
Dealing with humans
- I receive push notifications + e-mails for all outages
Uptime reporting
Dealing with humans
- Weekly internal report on G+- Gives visibility to entire company about any incidents- Allows us to discuss incidents to get to that 100% uptime
Social issues
Dealing with humans
- How quickly can you get to a computer?- Are they out drinking on a Friday?- What happens if someone is ill?- What if there’s a sudden emergency: accident? family emergency?- Do they have enough phone battery?- Can you hear the ringtone?
Backup responder
Dealing with humans
- Backup responder- Time out the initial responder- Escalate difficult problems- Essentially human redundancy: phone provider, geographic area, internet connectivity
Expected
Dealing with outages
- Outages are going to happen, especially at the beginning- Costs money for redundancy- How you deal with them
Dealing with outages
Externally
Communication
- Telling people what is happening- Frequently- Dependent on audience - we can go into more detail because our customers are techies- Github do a good job of providing incident writeups but don’t provide a good idea of what is happening right now- Generally Amazon and Heroku are good and go into more detail
Communication
Dealing with outages
Internally
- Open Skype conferences between the responders- Usually mostly silence or the sound of the keyboard, but simulates being in the situation room- Faster than typing
Really test your vendors
Dealing with outages
- Shows up flaws in vendor support processes- Frustrating when waiting on someone else- You want as much information as possible- Major outage? Everyone will be calling them
Simulations
Dealing with outages
- Try and avoid unncessary problems- Do servers come back up from boot?- Can hot spares handle the load?- Test failover: databases, HA firewalls- Regularly reboot servers- Wargames can happen at another stage: startups are usually too focused on building things first