How Do you Scale for both Predictable and Unpredictable Events on such a Large Scale?

How Do you Scale for both Predictable and

Unpredictable Events on such a Large Scale?

Surge 2013

We’re going to talk about this:

Whitney Houston Death: February 11, 2012

… and this:

Without your site going down…

Who Am I?

• Team Lead of CBC.ca System Administration team.

• Been with CBC for over 11 years (since 2002).

• @blakecrosby

• [email protected] / [email protected]

Let’s go back in time……way back

2010

2008

2007

2006

2005

2004

2003

“News stories must appear on the site as fast as possible!”

- Every Journalist at CBC

This architecture doesn’t work for news websites.

This was an important lesson for CBC

Breaking news trafficIt’s unpredictable and short lived.

From 12k hit/s to 30k hit/s

Royal Baby: July 22, 2013

From 1Gbps to 2.5Gbps in ~7min

Boston Marathon Bombing: April 15, 2013

From 1 Gbps to 14 Gbps in ~10 minutes.

Whitney Houston Death: February 11, 2012

Challenges we (or you) face

Too expensive to build out infrastructure for traffic levels that are sustained < 1% of the year.

Content must be flexible to changing traffic conditions

We have valuable information that users need in a crisis.

“News stories must appear on the site as fast as possible!”

- Every Journalist at CBC

How we fixed this problem(back in 2003, remember?)

Save everything to

disk.

Advantages

• Observes the principal of least surprise.

• Fast

• Takes advantages of OS and FS caches

• Easy to turn off certain site features.

Using SSIs (Server Side Includes)

• Primitive, but fast and secure.

• Can turn off site features or change look and feel by editing one file.

• All pages are updated instantly, without having to wait for pages to be republished.

Use a Content Delivery Network

Use Conditional GETs (If-Modified-Since)

Using Expiry and Validation

• Object has a TTL of 30 Seconds.

• Object hast a last modified time of Jan 1, 2013 00:00:00

• Once TTL has expired, cache/CDN will check if object is updated.

• Origin will return "304 Not Modified" and cache will reset TTL and serve object from cache store.

• The 30 second TTL protects the origin from a deluge of "If modified since" requests.

Use Last Mile Acceleration (GZIP Compression)

Use persistent HTTP connections

Use Appropriate Cache TTLs. Keep them simple!

Keep tunable options at the origin

Move personalization to the client

Outcomes(Where we are now in 2013)

Outcomes

• 2003 to 2010 – No need to grow origin

• 2010 to today – 9 origin web servers• HP DL360 G7

• Average 45-50% CPU utilization

• Capital cost for hardware? $15,000!

Our secret sauce.(or how to serve 800M requests a day from 9 webservers)

Offload (Bandwidth)

Offload (Hits)

Scaling for Unpredictable Events

Checking the last time a file has changed is faster than delivering that file to a user.

Conditional GETs (304s) will save you.

Make sure users don’t have to search for content

Increase your TTLs

Turn off dynamic components

Scaling for predictable events

Predicting traffic levels is impossible

Some (loose) rules.

• Scheduled events don't peak has high as unpredictable ones.

• Scheduled events last longer, so increase in traffic is spread out over hours, days, or weeks.

• Scheduled events are more "niche". Unlike breaking news where everyone wants to know what's going on.

• Might have to worry about 95/5 and bandwidth overages.

How do you scale for write operations?

We let someone else deal with that:

In Summary…

• Ensure your TTLs are appropriate

• Make sure your applications/content return last modified headers.

• Don't be afraid to change your site to turn off components that aren't critical during high traffic periods.

• Keep tunables at the Origin. This allows you to make changes quickly without waiting for CDN propagation.

• A CDN will not replace or fix bad origin infrastructure!

• Predicting the scale of a scheduled event is impossible. You will either over estimate or under estimate.

• Use previous traffic levels during unscheduled events as a high water mark.

• Don't be afraid to ask someone else (SaaS provider) to implement a feature that is not your core business/expertise.

Usenix Paper

http://tinyurl.com/lisa-paper

Thank You

@[email protected]

Technology

How Do you Scale for both Predictable and Unpredictable Events on such a Large Scale?