Upload
blakecrosby
View
933
Download
1
Embed Size (px)
DESCRIPTION
The Canadian Broadcasting Corporation is Canada's national public broadcaster. Our website, www.cbc.ca, is one of the largest and most visited in the country, delivering 700 million hits per day on an origin infrastructure composed of only six web servers. With the right combination of publishing methods, content delivery networks and fine-tuned caching rules, the CBC’s infrastructure has enough headroom to handle spikes of 40x normal traffic during major news events. How do you scale to almost infinite capacity when you can't predict the world’s events? It's impossible to prepare for that influx of visitors when a celebrity dies, a natural disaster occurs or for other breaking news. Scaling for predictable events is easier, but although we know when the next Federal Election, Olympics Games or FIFA Cup is scheduled, these events present different challenges. Balancing the architecture for both scenarios is important.
Citation preview
How Do you Scale for both Predictable and
Unpredictable Events on such a Large Scale?
Surge 2013
We’re going to talk about this:
Whitney Houston Death: February 11, 2012
… and this:
Without your site going down…
Who Am I?
• Team Lead of CBC.ca System Administration team.
• Been with CBC for over 11 years (since 2002).
• @blakecrosby
Let’s go back in time……way back
2010
2008
2007
2006
2005
2004
2003
“News stories must appear on the site as fast as possible!”
- Every Journalist at CBC
This architecture doesn’t work for news websites.
This was an important lesson for CBC
Breaking news trafficIt’s unpredictable and short lived.
From 12k hit/s to 30k hit/s
Royal Baby: July 22, 2013
From 1Gbps to 2.5Gbps in ~7min
Boston Marathon Bombing: April 15, 2013
From 1 Gbps to 14 Gbps in ~10 minutes.
Whitney Houston Death: February 11, 2012
Challenges we (or you) face
Too expensive to build out infrastructure for traffic levels that are sustained < 1% of the year.
Content must be flexible to changing traffic conditions
We have valuable information that users need in a crisis.
“News stories must appear on the site as fast as possible!”
- Every Journalist at CBC
How we fixed this problem(back in 2003, remember?)
Save everything to
disk.
Advantages
• Observes the principal of least surprise.
• Fast
• Takes advantages of OS and FS caches
• Easy to turn off certain site features.
Using SSIs (Server Side Includes)
• Primitive, but fast and secure.
• Can turn off site features or change look and feel by editing one file.
• All pages are updated instantly, without having to wait for pages to be republished.
Use a Content Delivery Network
Use Conditional GETs (If-Modified-Since)
Using Expiry and Validation
• Object has a TTL of 30 Seconds.
• Object hast a last modified time of Jan 1, 2013 00:00:00
• Once TTL has expired, cache/CDN will check if object is updated.
• Origin will return "304 Not Modified" and cache will reset TTL and serve object from cache store.
• The 30 second TTL protects the origin from a deluge of "If modified since" requests.
Use Last Mile Acceleration (GZIP Compression)
Use persistent HTTP connections
Use Appropriate Cache TTLs. Keep them simple!
Keep tunable options at the origin
Move personalization to the client
Outcomes(Where we are now in 2013)
Outcomes
• 2003 to 2010 – No need to grow origin
• 2010 to today – 9 origin web servers• HP DL360 G7
• Average 45-50% CPU utilization
• Capital cost for hardware? $15,000!
Our secret sauce.(or how to serve 800M requests a day from 9 webservers)
Offload (Bandwidth)
Offload (Hits)
Scaling for Unpredictable Events
Checking the last time a file has changed is faster than delivering that file to a user.
Conditional GETs (304s) will save you.
Make sure users don’t have to search for content
Increase your TTLs
Turn off dynamic components
Scaling for predictable events
Predicting traffic levels is impossible
Some (loose) rules.
• Scheduled events don't peak has high as unpredictable ones.
• Scheduled events last longer, so increase in traffic is spread out over hours, days, or weeks.
• Scheduled events are more "niche". Unlike breaking news where everyone wants to know what's going on.
• Might have to worry about 95/5 and bandwidth overages.
How do you scale for write operations?
We let someone else deal with that:
In Summary…
• Ensure your TTLs are appropriate
• Make sure your applications/content return last modified headers.
• Don't be afraid to change your site to turn off components that aren't critical during high traffic periods.
• Keep tunables at the Origin. This allows you to make changes quickly without waiting for CDN propagation.
• A CDN will not replace or fix bad origin infrastructure!
• Predicting the scale of a scheduled event is impossible. You will either over estimate or under estimate.
• Use previous traffic levels during unscheduled events as a high water mark.
• Don't be afraid to ask someone else (SaaS provider) to implement a feature that is not your core business/expertise.
Usenix Paper
http://tinyurl.com/lisa-paper
Thank You