A Brief, Rapid History of Scaling Instagram
(with a tiny team)Mike Krieger
QConSF 2013 !
30 million with 2 eng (2010-end 2012)
150 million with 6 eng (2012-now)
What I would have done differently
What tradeoffs you make when scaling with that size team
(if you can help it, have a bigger team)
perfect solutions
decision-making process
Do the simplest thing first
Every infra moving part is another “thread” your team has to manage
Test & Monitor Everything
This talk
Early days Year 1: Scaling Up Year 2: Scaling Out Year 3-present: Stability, Video, FB
2010 2 guys on a pier
Mike iOS, Kevin Server
Early StackDjango + Apache mod_wsgi Postgres Redis Gearman Memcached Nginx
If todayDjango + uWSGI Postgres Redis Celery Memcached HAproxy
Three months later
Server planning night before launch
Year 1: Scaling Up
Single server in LA
“What’s a load average?”
“Can we get another server?”
Doritos & Red Bull & Animal Crackers & Amazon EC2
Underwater on recruiting
2 total engineers
Scale "just enough" to get back to working on app
Every weekend was an accomplishment
“Infra is what happens when you’re busy making other plans”
—Ops Lennon
First bottleneck: disk IO on old Amazon EBS
At the time: ~400 IOPS max
Simple thing first
Vertical partitioning
Django DB Routers
Partitions
Media Likes Comments Everything else
PG Replication to bootstrap nodes
Bought us some time
Almost no application logic changes (other than some primary keys)
Today: SSD and provisioned IOPS get you way further
Vertical partitioning by data type
No easy migration story; mostly double-writing
Replicating + deleting often leaves fragmentation
Chaining replication = awesome
Scaling Memcached
Consistent hashing / ketama
Mind that hash function
Why not Redis for kv caching?
Config Management & Deployment
fabric + parallel git pull (sorry GitHub)
All AMI based snapshots for new instances
update_update_ami.sh
Should have done Chef earlier
Infra going from 10% time to 70%
Testing & monitoring kept concurrent fires to a minimum
Several ticking time bombs
Year 2: Scaling Out
Stateless, but plentiful
HAProxy (Dead node detection)
Connection limits everywhere
PGBouncer Homegrown Redis pool
Hard to track down kernel panics
Skip rabbit hole; use instance-status to detect and restart
Database Scale Out
Out of IO again (Pre SSDs)
Theory: partitioning and rebalancing are hard to get right, let DB take care of it
MongoDB (1.2 at the time)
Double write, shadow reads
Stressing about Primary Key
Data loss, segfaults
Could have made it work…
…but it would have been someone’s full time job
(and we still only had 3 people)
train + rapidly approaching cliff
Sharding in Postgres
QCon to the rescue
Similar approach to FB (infra foreshadowing?)
Logical partitioning, done at application level
Simplest thing; skipped abstractions & proxies
note to self: pick a power of 2 next time
Postgres "schemas"
database schema
table columns
machineA: shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineA: shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineA’: shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineA: shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineA’: shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Still how we scale PG today
9.2 upgrade: bucardo to move schema by schema
Requirements
No extra moving parts
64 bits max Time ordered Containing partition key
41 bits: time in millis (41 years of IDs) 13 bits: logical shard ID
10 bits: auto-incrementing sequence, modulo 1024.
This means we can generate 1024 IDs, per shard, per table, per
millisecond
A new db is a full time commitment
Be thrifty with your existing tech
= minimize moving parts
Scaling configs/host discovery
ZooKeeper or DNS server?
No team to maintain
fab update_etc_hosts (generates, deploys)
Limited: dead host failover, etc
But zero additional infra, got the job done, easy to debug
Munin: too coarse, too hard to add new stats
StatsD & Graphite
statsd.timer statsd.incr
Step change in developer attitude towards stats
<5 min from wanting to measure, to having a graph
580 statsd counters 164 statsd timers
Launched Android
(doubling all of our infra, most of which was now horizontally scalable)
Doubled active users in < 6 months
Finally, slowly, building up team
Year 3+: Stability, Video, FB
Scale tools to match team
Deployment & Config Management
Finally 100% on Chef
Simple thing first: knife and chef-solo
Every new hire learns Chef
Many rollouts a day
Continuous integration
But push still needs a driver
Humans are terrible distributed locking systems
Redis-enforced locks
Rollout / major config changes / live deployment tracking
Extracting approach
Hit issue Develop manual approach Build tools to improve manual / hands on approach Replace manual with automated system
Munin finally broke
Ganglia for graphing
Sensu for alerting (http://sensuapp.org)
StatsD/Graphite still chugging along
waittime: lightweight slow component tracking
s = time.time() # do work statsd.incr("waittime.VIEWNAME.COMPONENT", time.time() - s)
Feeds and Inboxes
In memory requirement
Every churned or inactive user
Inbox moved to Cassandra
1000:1 write/read
Prereq: having rbranson, ex-DataStax
C* cluster is 20% of the size of Redis one
Main feed (timeline) still in Redis
Dynamic ramp-ups and config
Previously: required deploy
Refreshed every 30s
knobs.get(feature_name, default)
Uses
Incremental feature rollouts Dynamic page sizing (shedding load) Feature killswitches
As more teams around FB contribute
Decouple deploy from feature rollout
Launch a top 10 video site on day 1 with team of 6 engineers,
in less than 2 months
Reuse what we know
Avoid magic middleware
Separate from main App servers
server-side transcoding
ZooKeeper ephemeral nodes for detection
(finally worth it / doable to deploy ZK)
Priority list for clients
Transcoding tier is completely stateless
statsd waterfall
holding area for debugging bad videos
5 million videos in first day 40h of video / hour
(other than perf improvements we’ve basically not touched it since launch)
Where can we skip a few years?
(at our own pace)
re.compile(‘f[o0][1l][o0]w’)
Simplest thing did not last
Generic features + machine learning
Hadoop + Hive + Presto
"I wonder how they..."
Two-way exchange
2010 vintage infra
#1 impact: recruiting
Backend team: >10 people now
Do the simplest thing first
Every infra moving part is another “thread” your team has to manage
Test & Monitor Everything
Recruit way earlier than you'd think
Simple doesn't always imply hacky
Rocketship scaling has been (somewhat) democratized
Huge thanks to IG Eng Team