180
Scalable Web Architectures Common Patterns & Approaches Cal Henderson

Scalable Web Architectures

  • Upload
    diata

  • View
    54

  • Download
    0

Embed Size (px)

DESCRIPTION

Scalable Web Architectures. Common Patterns & Approaches Cal Henderson. Hello. Flickr. Large scale kitten-sharing Almost 5 years old 4.5 since public launch Open source stack An open attitude towards architecture. Flickr. Large scale In a single second: Serve 40,000 photos - PowerPoint PPT Presentation

Citation preview

Page 1: Scalable Web Architectures

Scalable Web Architectures

Common Patterns & Approaches

Cal Henderson

Page 2: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 2

Hello

Page 3: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 3

Flickr

• Large scale kitten-sharing

• Almost 5 years old– 4.5 since public launch

• Open source stack– An open attitude towards architecture

Page 4: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 4

Flickr

• Large scale

• In a single second:– Serve 40,000 photos– Handle 100,000 cache operations– Process 130,000 database queries

Page 5: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 5

Scalable Web Architectures?

What does scalable mean?

What’s an architecture?

An answer in 12 parts

Page 6: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 6

1.

Scaling

Page 7: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 7

Scalability – myths and lies

• What is scalability?

Page 8: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 8

Scalability – myths and lies

• What is scalability not ?

Page 9: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 9

Scalability – myths and lies

• What is scalability not ?– Raw Speed / Performance– HA / BCP– Technology X– Protocol Y

Page 10: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 10

Scalability – myths and lies

• So what is scalability?

Page 11: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 11

Scalability – myths and lies

• So what is scalability?– Traffic growth– Dataset growth– Maintainability

Page 12: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 12

Today

• Two goals of application architecture:

Scale

HA

Page 13: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 13

Today

• Three goals of application architecture:

Scale

HA

Performance

Page 14: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 14

Scalability

• Two kinds:– Vertical (get bigger)– Horizontal (get more)

Page 15: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 15

Big Irons

Sunfire E20k

$450,000 - $2,500,00036x 1.8GHz processors

PowerEdge SC1435Dualcore 1.8 GHz processor

Around $1,500

Page 16: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 16

Cost vs Cost

Page 17: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 17

That’s OK

• Sometimes vertical scaling is right

• Buying a bigger box is quick (ish)

• Redesigning software is not

• Running out of MySQL performance?– Spend months on data federation– Or, Just buy a ton more RAM

Page 18: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 18

The H & the V

• But we’ll mostly talk horizontal– Else this is going to be boring

Page 19: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 19

2.

Architecture

Page 20: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 20

Architectures then?

• The way the bits fit together

• What grows where

• The trade-offs between good/fast/cheap

Page 21: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 21

LAMP

• We’re mostly talking about LAMP– Linux– Apache (or LightHTTPd)– MySQL (or Postgres)– PHP (or Perl, Python, Ruby)

• All open source• All well supported• All used in large operations• (Same rules apply elsewhere)

Page 22: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 22

Simple web apps

• A Web Application– Or “Web Site” in Web 1.0 terminology

Interwebnet App server Database

Page 23: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 23

Simple web apps

• A Web Application– Or “Web Site” in Web 1.0 terminology

Interwobnet App server Database

Cache

Storage array

AJAX!!!1

Page 24: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 24

App servers

• App servers scale in two ways:

Page 25: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 25

App servers

• App servers scale in two ways:

– Really well

Page 26: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 26

App servers

• App servers scale in two ways:

– Really well

– Quite badly

Page 27: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 27

App servers

• Sessions!– (State)

– Local sessions == bad• When they move == quite bad

– Centralized sessions == good

– No sessions at all == awesome!

Page 28: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 28

Local sessions

• Stored on disk– PHP sessions

• Stored in memory– Shared memory block (APC)

• Bad!– Can’t move users– Can’t avoid hotspots– Not fault tolerant

Page 29: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 29

Mobile local sessions

• Custom built– Store last session location in cookie– If we hit a different server, pull our session

information across

• If your load balancer has sticky sessions, you can still get hotspots– Depends on volume – fewer heavier users

hurt more

Page 30: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 30

Remote centralized sessions

• Store in a central database– Or an in-memory cache

• No porting around of session data• No need for sticky sessions• No hot spots

• Need to be able to scale the data store– But we’ve pushed the issue down the stack

Page 31: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 31

No sessions

• Stash it all in a cookie!

• Sign it for safety– $data = $user_id . ‘-’ . $user_name;– $time = time();– $sig = sha1($secret . $time . $data);– $cookie = base64(“$sig-$time-$data”);

– Timestamp means it’s simple to expire it

Page 32: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 32

Super slim sessions

• If you need more than the cookie (login status, user id, username), then pull their account row from the DB– Or from the account cache

• None of the drawbacks of sessions• Avoids the overhead of a query per page

– Great for high-volume pages which need little personalization

– Turns out you can stick quite a lot in a cookie too– Pack with base64 and it’s easy to delimit fields

Page 33: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 33

App servers

• The Rasmus way– App server has ‘shared nothing’– Responsibility pushed down the stack

• Ooh, the stack

Page 34: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 34

Trifle

Page 35: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 35

Trifle

Sponge / Database

Jelly / Business Logic

Custard / Page Logic

Cream / Markup

Fruit / Presentation

Page 36: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 36

Trifle

Sponge / Database

Jelly / Business Logic

Custard / Page Logic

Cream / Markup

Fruit / Presentation

Page 37: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 37

App servers

Page 38: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 38

App servers

Page 39: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 39

App servers

Page 40: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 40

Well, that was easy

• Scaling the web app server part is easy

• The rest is the trickier part– Database– Serving static content– Storing static content

Page 41: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 41

The others

• Other services scale similarly to web apps– That is, horizontally

• The canonical examples:– Image conversion– Audio transcoding– Video transcoding– Web crawling– Compute!

Page 42: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 42

Amazon

• Let’s talk about Amazon– S3 - Storage– EC2 – Compute! (XEN based)– SQS – Queueing

• All horizontal

• Cheap when small– Not cheap at scale

Page 43: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 43

3.

Load Balancing

Page 44: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 44

Load balancing

• If we have multiple nodes in a class, we need to balance between them

• Hardware or software

• Layer 4 or 7

Page 45: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 45

Hardware LB

• A hardware appliance– Often a pair with heartbeats for HA

• Expensive!– But offers high performance– Easy to do > 1Gbps

• Many brands– Alteon, Cisco, Netscalar, Foundry, etc– L7 - web switches, content switches, etc

Page 46: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 46

Software LB

• Just some software– Still needs hardware to run on– But can run on existing servers

• Harder to have HA– Often people stick hardware LB’s in front– But Wackamole helps here

Page 47: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 47

Software LB

• Lots of options– Pound– Perlbal– Apache with mod_proxy

• Wackamole with mod_backhand– http://backhand.org/wackamole/– http://backhand.org/mod_backhand/

Page 48: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 48

Wackamole

Page 49: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 49

Wackamole

Page 50: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 50

The layers

• Layer 4– A ‘dumb’ balance

• Layer 7– A ‘smart’ balance

• OSI stack, routers, etc

Page 51: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 51

4.

Queuing

Page 52: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 52

Parallelizable == easy!

• If we can transcode/crawl in parallel, it’s easy– But think about queuing– And asynchronous systems– The web ain’t built for slow things– But still, a simple problem

Page 53: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 53

Synchronous systems

Page 54: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 54

Asynchronous systems

Page 55: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 55

Helps with peak periods

Page 56: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 56

Synchronous systems

Page 57: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 57

Asynchronous systems

Page 58: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 58

Asynchronous systems

Page 59: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 59

5.

Relational Data

Page 60: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 60

Databases

• Unless we’re doing a lot of file serving, the database is the toughest part to scale

• If we can, best to avoid the issue altogether and just buy bigger hardware

• Dual-Quad Opteron/Intel64 systems with 16+GB of RAM can get you a long way

Page 61: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 61

More read power

• Web apps typically have a read/write ratio of somewhere between 80/20 and 90/10

• If we can scale read capacity, we can solve a lot of situations

• MySQL replication!

Page 62: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 62

Master-Slave Replication

Page 63: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 63

Master-Slave Replication

Reads and Writes

Reads

Page 64: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 64

Master-Slave Replication

Page 65: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 65

Master-Slave Replication

Page 66: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 66

Master-Slave Replication

Page 67: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 67

Master-Slave Replication

Page 68: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 68

Master-Slave Replication

Page 69: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 69

Master-Slave Replication

Page 70: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 70

Master-Slave Replication

Page 71: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 71

Master-Slave Replication

Page 72: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 72

6.

Caching

Page 73: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 73

Caching

• Caching avoids needing to scale!– Or makes it cheaper

• Simple stuff– mod_perl / shared memory

• Invalidation is hard

– MySQL query cache• Bad performance (in most cases)

Page 74: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 74

Caching

• Getting more complicated…– Write-through cache– Write-back cache– Sideline cache

Page 75: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 75

Write-through cache

Page 76: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 76

Write-back cache

Page 77: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 77

Sideline cache

Page 78: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 78

Sideline cache

• Easy to implement– Just add app logic

• Need to manually invalidate cache– Well designed code makes it easy

• Memcached– From Danga (LiveJournal)– http://www.danga.com/memcached/

Page 79: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 79

Memcache schemes

• Layer 4– Good: Cache can be local on a machine– Bad: Invalidation gets more expensive with

node count– Bad: Cache space wasted by duplicate

objects

Page 80: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 80

Memcache schemes

• Layer 7– Good: No wasted space– Good: linearly scaling invalidation– Bad: Multiple, remote connections

• Can be avoided with a proxy layer– Gets more complicated

» Last indentation level!

Page 81: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 81

7.

HA Data

Page 82: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 82

But what about HA?

Page 83: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 83

But what about HA?

Page 84: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 84

SPOF!

• The key to HA is avoiding SPOFs– Identify– Eliminate

• Some stuff is hard to solve– Fix it further up the tree

• Dual DCs solves Router/Switch SPOF

Page 85: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 85

Master-Master

Page 86: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 86

Master-Master

• Either hot/warm or hot/hot

• Writes can go to either– But avoid collisions– No auto-inc columns for hot/hot

• Bad for hot/warm too• Unless you have MySQL 5

– But you can’t rely on the ordering!

– Design schema/access to avoid collisions• Hashing users to servers

Page 87: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 87

Rings

• Master-master is just a small ring– With 2 nodes

• Bigger rings are possible– But not a mesh!– Each slave may only have a single master– Unless you build some kind of manual

replication

Page 88: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 88

Rings

Page 89: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 89

Rings

Page 90: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 90

Dual trees

• Master-master is good for HA– But we can’t scale out the reads (or writes!)

• We often need to combine the read scaling with HA

• We can simply combine the two models

Page 91: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 91

Dual trees

Page 92: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 92

Cost models

• There’s a problem here– We need to always have 200% capacity to

avoid a SPOF• 400% for dual sites!

– This costs too much

• Solution is straight forward– Make sure clusters are bigger than 2

Page 93: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 93

N+M

• N+M– N = nodes needed to run the system– M = nodes we can afford to lose

• Having M as big as N starts to suck– If we could make each node smaller, we can

increase N while M stays constant– (We assume smaller nodes are cheaper)

Page 94: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 94

1+1 = 200% hardware

Page 95: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 95

3+1 = 133% hardware

Page 96: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 96

Meshed masters

• Not possible with regular MySQL out-of-the-box today

• But there is hope!– NBD (MySQL Cluster) allows a mesh– Support for replication out to slaves in a

coming version• RSN!

Page 97: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 97

8.

Federation

Page 98: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 98

Data federation

• At some point, you need more writes– This is tough– Each cluster of servers has limited write

capacity

• Just add more clusters!

Page 99: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 99

Simple things first

• Vertical partitioning– Divide tables into sets that never get joined– Split these sets onto different server clusters– Voila!

• Logical limits– When you run out of non-joining groups– When a single table grows too large

Page 100: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 100

Data federation

• Split up large tables, organized by some primary object– Usually users

• Put all of a user’s data on one ‘cluster’– Or shard, or cell

• Have one central cluster for lookups

Page 101: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 101

Data federation

Page 102: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 102

Data federation

• Need more capacity?– Just add shards!– Don’t assign to shards based on user_id!

• For resource leveling as time goes on, we want to be able to move objects between shards– Maybe – not everyone does this– ‘Lockable’ objects

Page 103: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 103

The wordpress.com approach

• Hash users into one of n buckets– Where n is a power of 2

• Put all the buckets on one server

• When you run out of capacity, split the buckets across two servers

• Then you run out of capacity, split the buckets across four servers

• Etc

Page 104: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 104

Data federation

• Heterogeneous hardware is fine– Just give a larger/smaller proportion of objects

depending on hardware

• Bigger/faster hardware for paying users– A common approach– Can also allocate faster app servers via magic

cookies at the LB

Page 105: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 105

Downsides

• Need to keep stuff in the right place• App logic gets more complicated• More clusters to manage

– Backups, etc

• More database connections needed per page– Proxy can solve this, but complicated

• The dual table issue– Avoid walking the shards!

Page 106: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 106

Bottom line

Data federation is how large applications are

scaled

Page 107: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 107

Bottom line

• It’s hard, but not impossible

• Good software design makes it easier– Abstraction!

• Master-master pairs for shards give us HA

• Master-master trees work for central cluster (many reads, few writes)

Page 108: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 108

9.

Multi-site HA

Page 109: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 109

Multiple Datacenters

• Having multiple datacenters is hard– Not just with MySQL

• Hot/warm with MySQL slaved setup– But manual (reconfig on failure)

• Hot/hot with master-master– But dangerous (each site has a SPOF)

• Hot/hot with sync/async manual replication– But tough (big engineering task)

Page 110: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 110

Multiple Datacenters

Page 111: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 111

GSLB

• Multiple sites need to be balanced– Global Server Load Balancing

• Easiest are AkaDNS-like services– Performance rotations– Balance rotations

Page 112: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 112

10.

Serving Files

Page 113: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 113

Serving lots of files

• Serving lots of files is not too tough– Just buy lots of machines and load balance!

• We’re IO bound – need more spindles!– But keeping many copies of data in sync is

hard– And sometimes we have other per-request

overhead (like auth)

Page 114: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 114

Reverse proxy

Page 115: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 115

Reverse proxy

• Serving out of memory is fast!– And our caching proxies can have disks too– Fast or otherwise

• More spindles is better

• We can parallelize it! – 50 cache servers gives us 50 times the

serving rate of the origin server– Assuming the working set is small enough to

fit in memory in the cache cluster

Page 116: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 116

Invalidation

• Dealing with invalidation is tricky

• We can prod the cache servers directly to clear stuff out– Scales badly – need to clear asset from every

server – doesn’t work well for 100 caches

Page 117: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 117

Invalidation

• We can change the URLs of modified resources– And let the old ones drop out cache naturally– Or poke them out, for sensitive data

• Good approach!– Avoids browser cache staleness– Hello Akamai (and other CDNs)– Read more:

• http://www.thinkvitamin.com/features/webapps/serving-javascript-fast

Page 118: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 118

CDN – Content Delivery Network

• Akamai, Savvis, Mirror Image Internet, etc

• Caches operated by other people– Already in-place– In lots of places

• GSLB/DNS balancing

Page 119: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 119

Edge networks

Origin

Page 120: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 120

Edge networks

Origin

Cache

Cache

Cache

CacheCache

Cache

CacheCache

Page 121: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 121

CDN Models

• Simple model– You push content to them, they serve it

• Reverse proxy model– You publish content on an origin, they proxy

and cache it

Page 122: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 122

CDN Invalidation

• You don’t control the caches– Just like those awful ISP ones

• Once something is cached by a CDN, assume it can never change– Nothing can be deleted– Nothing can be modified

Page 123: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 123

Versioning

• When you start to cache things, you need to care about versioning

– Invalidation & Expiry– Naming & Sync

Page 124: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 124

Cache Invalidation

• If you control the caches, invalidation is possible

• But remember ISP and client caches

• Remove deleted content explicitly– Avoid users finding old content– Save cache space

Page 125: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 125

Cache versioning

• Simple rule of thumb:– If an item is modified, change its name (URL)

• This can be independent of the file system!

Page 126: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 126

Virtual versioning

• Database indicates version 3 of file

• Web app writes version number into URL

• Request comes through cache and is cached with the versioned URL

• mod_rewrite converts versioned URL to path

Version 3

example.com/foo_3.jpg

Cached: foo_3.jpg

foo_3.jpg -> foo.jpg

Page 127: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 127

Reverse proxy

• Choices– L7 load balancer & Squid

• http://www.squid-cache.org/

– mod_proxy & mod_cache• http://www.apache.org/

– Perlbal and Memcache?• http://www.danga.com/

Page 128: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 128

More reverse proxy

• HA proxy

• Squid & CARP

• Varnish

Page 129: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 129

High overhead serving

• What if you need to authenticate your asset serving?– Private photos– Private data– Subscriber-only files

• Two main approaches– Proxies w/ tokens– Path translation

Page 130: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 130

Perlbal Re-proxying

• Perlbal can do redirection magic– Client sends request to Perlbal– Perlbal plugin verifies user credentials

• token, cookies, whatever• tokens avoid data-store access

– Perlbal goes to pick up the file from elsewhere– Transparent to user

Page 131: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 131

Perlbal Re-proxying

Page 132: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 132

Perlbal Re-proxying

• Doesn’t keep database around while serving

• Doesn’t keep app server around while serving

• User doesn’t find out how to access asset directly

Page 133: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 133

Permission URLs

• But why bother!?

• If we bake the auth into the URL then it saves the auth step

• We can do the auth on the web app servers when creating HTML

• Just need some magic to translate to paths

• We don’t want paths to be guessable

Page 134: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 134

Permission URLs

Page 135: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 135

Permission URLs

(or mod_perl)

Page 136: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 136

Permission URLs

• Downsides– URL gives permission for life– Unless you bake in tokens

• Tokens tend to be non-expirable– We don’t want to track every token

» Too much overhead

• But can still expire– But you choose at time of issue, not later

• Upsides– It works– Scales very nicely

Page 137: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 137

11.

Storing Files

Page 138: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 138

Storing lots of files

• Storing files is easy!– Get a big disk– Get a bigger disk– Uh oh!

• Horizontal scaling is the key– Again

Page 139: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 139

Connecting to storage

• NFS– Stateful == Sucks– Hard mounts vs Soft mounts, INTR

• SMB / CIFS / Samba– Turn off MSRPC & WINS (NetBOIS NS)– Stateful but degrades gracefully

• HTTP– Stateless == Yay!– Just use Apache

Page 140: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 140

Multiple volumes

• Volumes are limited in total size– Except (in theory) under ZFS & others

• Sometimes we need multiple volumes for performance reasons– When using RAID with single/dual parity

• At some point, we need multiple volumes

Page 141: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 141

Multiple volumes

Page 142: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 142

Multiple hosts

• Further down the road, a single host will be too small

• Total throughput of machine becomes an issue

• Even physical space can start to matter

• So we need to be able to use multiple hosts

Page 143: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 143

Multiple hosts

Page 144: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 144

HA Storage

• HA is important for file assets too– We can back stuff up– But we tend to want hot redundancy

• RAID is good– RAID 5 is cheap, RAID 10 is fast

Page 145: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 145

HA Storage

• But whole machines can fail

• So we stick assets on multiple machines

• In this case, we can ignore RAID– In failure case, we serve from alternative

source– But need to weigh up the rebuild time and

effort against the risk– Store more than 2 copies?

Page 146: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 146

HA Storage

Page 147: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 147

HA Storage

• But whole colos can fail

• So we stick assets in multiple colos

• Again, we can ignore RAID– Or even multi-machine per colo– But do we want to lose a whole colo because

of single disk failure?– Redundancy is always a balance

Page 148: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 148

Self repairing systems

• When something fails, repairing can be a pain– RAID rebuilds by itself, but machine

replication doesn’t

• The big appliances self heal– NetApp, StorEdge, etc

• So does MogileFS (reaper)

Page 149: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 149

Real Life

Case Studies

Page 150: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 150

GFS – Google File System

• Developed by … Google

• Proprietary

• Everything we know about it is based on talks they’ve given

• Designed to store huge files for fast access

Page 151: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 151

GFS – Google File System

• Single ‘Master’ node holds metadata– SPF – Shadow master allows warm swap

• Grid of ‘chunkservers’– 64bit filenames– 64 MB file chunks

Page 152: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 152

GFS – Google File System

1(a) 2(a)

1(b)

Master

Page 153: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 153

GFS – Google File System

• Client reads metadata from master then file parts from multiple chunkservers

• Designed for big files (>100MB)

• Master server allocates access leases

• Replication is automatic and self repairing– Synchronously for atomicity

Page 154: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 154

GFS – Google File System

• Reading is fast (parallelizable)– But requires a lease

• Master server is required for all reads and writes

Page 155: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 155

MogileFS – OMG Files

• Developed by Danga / SixApart

• Open source

• Designed for scalable web app storage

Page 156: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 156

MogileFS – OMG Files

• Single metadata store (MySQL)– MySQL Cluster avoids SPF

• Multiple ‘tracker’ nodes locate files

• Multiple ‘storage’ nodes store files

Page 157: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 157

MogileFS – OMG Files

Tracker

Tracker

MySQL

Page 158: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 158

MogileFS – OMG Files

• Replication of file ‘classes’ happens transparently

• Storage nodes are not mirrored – replication is piecemeal

• Reading and writing go through trackers, but are performed directly upon storage nodes

Page 159: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 159

Flickr File System

• Developed by Flickr

• Proprietary

• Designed for very large scalable web app storage

Page 160: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 160

Flickr File System

• No metadata store– Deal with it yourself

• Multiple ‘StorageMaster’ nodes

• Multiple storage nodes with virtual volumes

Page 161: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 161

Flickr File System

SM

SM

SM

Page 162: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 162

Flickr File System

• Metadata stored by app– Just a virtual volume number– App chooses a path

• Virtual nodes are mirrored– Locally and remotely

• Reading is done directly from nodes

Page 163: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 163

Flickr File System

• StorageMaster nodes only used for write operations

• Reading and writing can scale separately

Page 164: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 164

Amazon S3

• A big disk in the sky

• Multiple ‘buckets’

• Files have user-defined keys

• Data + metadata

Page 165: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 165

Amazon S3

Servers Amazon

Page 166: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 166

Amazon S3

Servers Amazon

Users

Page 167: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 167

The cost

• Fixed price, by the GB

• Store: $0.15 per GB per month

• Serve: $0.20 per GB

Page 168: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 168

The cost

S3

Page 169: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 169

The cost

S3

Regular Bandwidth

Page 170: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 170

End costs

• ~$2k to store 1TB for a year

• ~$63 a month for 1Mb

• ~$65k a month for 1Gb

Page 171: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 171

12.

Field Work

Page 172: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 172

Real world examples

• Flickr– Because I know it

• LiveJournal– Because everyone copies it

Page 173: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 173

FlickrArchitecture

Page 174: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 174

FlickrArchitecture

Page 175: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 175

LiveJournalArchitecture

Page 176: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 176

LiveJournalArchitecture

Page 177: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 177

Buy my book!

Page 178: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 178

Or buy Theo’s

Page 179: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 179

The end!

Page 180: Scalable Web Architectures

Web 2.0 Expo NYC, 16th September 2008 180

Awesome!

These slides are available online:

iamcal.com/talks/