“Mega-services”: Scale and Consistency A Lightning Tour Now with Elastic Scaling! Jeff Chase Duke University

“Mega-services”:Scale and Consistency

A Lightning TourNow with Elastic Scaling!

Jeff ChaseDuke University

A service Client

Store

Web Server

App Server

DB Server

request

replyclient

server

The Steve Yegge rant, part 1Products vs. Platforms

Selectively quoted/clarified from http://steverant.pen.io/, emphasis added. This is an internal google memorandum that ”escaped”. Yegge had moved to Google from Amazon. His goal was to promote service-oriented software structures within Google.

So one day Jeff Bezos [CEO of Amazon] issued a mandate....[to the developers in his company]:

His Big Mandate went something along these lines:

1) All teams will henceforth expose their data and functionality through service interfaces.

2) Teams must communicate with each other through these interfaces.

3) There will be no other form of interprocess communication allowed: no direct linking, no direct reads of another team's data store, no shared-memory model, no back-doors whatsoever. The only communication allowed is via service interface calls over the network.

http://steverant.pen.io/

http://steverant.pen.io/

4) It doesn't matter what technology they use. HTTP, Corba, PubSub, custom protocols -- doesn't matter. Bezos doesn't care.

5) All service interfaces, without exception, must be designed from the ground up to be externalizable. That is to say, the team must plan and design to be able to expose the interface to developers in the outside world. No exceptions.

6) Anyone who doesn't do this will be fired.

7) Thank you; have a nice day!

The Steve Yegge rant, part 2Products vs. Platforms

Managing overload

• What should we do when a service is in overload?– Overload: service is close to saturation, leading to

unacceptable response time.

– Work queues grow without bound, increasing memory consumption.

λ > λmax

λ λmax

ThroughputX

offered load

Options for overload

1. Thrashing– Keep trying and hope things get better. Accept each request

and inject it into the system. Then drop requests at random if some queue overflows its memory bound. Note: leads to dropping requests after work has been invested, wasting work and reducing throughput (e.g., congestion collapse).

2. Admission control or load conditioning– Reject requests as needed to keep system healthy. Reject

them early, before they incur processing costs. Choose your victims carefully, e.g., prefer “gold” customers, or reject the most expensive requests.

3. Dynamic provisioning or elastic scaling– E.g., acquire new capacity on the fly from a cloud provider, and

shift load over to the new capacity.

Elastic scaling: “pay as you grow”

EC2 Elastic Compute Cloud The canonical public cloud

Virtual Appliance

Image

Cloud Provider(s)

Host

GuestClient Service

OS

VMM

Physical

Platform

Client Service

IaaS: Infrastructure as a Service

EC2 is a public IaaS cloud (fee-for-service).

Deployment of private clouds is growing rapidly w/ open IaaS cloud software.

Hosting performance and isolation is determined by virtualization layer

Virtual Machines (VM): VMware, KVM, etc.

Native virtual machines (VMs)

• Slide a hypervisor underneath the kernel.– New OS/TCB layer: virtual machine monitor (VMM).

• Kernel and processes run in a virtual machine (VM).– The VM “looks the same” to the OS as a physical machine.

– The VM is a sandboxed/isolated context for an entire OS.

• Can run multiple VM instances on a shared computer.

Thank you, VMware

Also in the news, the Snowden NSA leak for Hallowe’en 2013. Scary?

Work

Server cluster/farm/cloud/gridData center

Support substrate

Scaling a service

Dispatcher

Add servers or “bricks” for scale and robustness.Issues: state storage, server selection, request routing, etc.

Service scaling and bottlenecks

Scale up by adding capacity incrementally?

• “Just add bricks/blades/units/elements/cores”...but that presumes we can parallelize the workload.

• “Service workloads parallelize easily.”

– Many independent requests: spread requests across multiple units.

– Problem: some requests use shared data. Partition data into chunks and spread them across the units: be sure to read/write a common copy.

• Load must be evenly distributed, or else some unit saturates before the others (bottleneck).

WorkA bottleneck limits throughput and/or may increase response time for some class of requests.

Coordination for service clusters

• How to assign data and functions among servers?– To spread the work across an elastic server cluster?

• How to know which server is “in charge” of a given function or data object?– E.g., to serialize reads/writes on each object, or otherwise

ensure consistent behavior: each read sees the value stored by the most recent write (or at least some reasonable value).

• Goals: safe, robust, even, balanced, dynamic, etc.

• Two key techniques:– Leases (leased locks)

– Hash-based adaptive data distributions

What about failures?• Systems fail. Here’s a reasonable set of assumptions

about failure properties:

• Nodes/servers/replicas/bricks– Fail-stop or fail-fast fault model

– Nodes either function correctly or remain silent

– A failed node may restart, or not

– A restarted node loses its memory state, and recovers its secondary (disk) state

– Note: nodes can also fail by behaving in unexpected ways, like sending false messages. These are called Byzantine failures.

• Network messages– “delivered quickly most of the time” but may be dropped.

– Message source and content are safely known (e.g., crypto).

X

Challenge: data management

• Data volumes are growing enormously.

• Mega-services are “grounded” in data.

• How to scale the data tier?– Scaling requires dynamic placement of data items across data

servers, so we can grow the number of servers.

– Caching helps to reduce load on the data tier.

– Replication helps to survive failures and balance read/write load.

– E.g., alleviate hot-spots by spreading read load across multiple data servers.

– Caching and replication require careful update protocols to ensure that servers see a consistent view of the data.

Service-oriented architecture of

Amazon’s platform

Dynamo is a scalable, replicated key-value store.

Key-value stores

• Many mega-services are built on key-value stores.– Store variable-length content objects: think “tiny files” (value)

– Each object is named by a “key”, usually fixed-size.

– Key is also called a token: not to be confused with a crypto key! Although it may be a content hash (SHAx or MD5).

– Simple put/get interface with no offsets or transactions (yet).

– Goes back to literature on Distributed Data Structures [Gribble 1998] and Distributed Hash Tables (DHTs).

[image from Sean Rhea, opendht.org]

• Data objects named in a “flat” key space (e.g., “serial numbers”)

• K-V is a simple and clean abstraction that admits a scalable, reliable implementation: a major focus of R&D.

• Is put/get sufficient to implement non-trivial apps?

Distributed hash table

Distributed application

get (key) data

node node node….

put(key, data)

Lookup service

lookup(key) node IP address

[image from Morris, Stoica, Shenker, etc.]

Key-value stores

Scalable key-value stores

• Can we build massively scalable key/value stores?– Balance the load: distribute the keys across the nodes.

– Find the “right” server(s) for a given key.

– Adapt to change (growth and “churn”) efficiently and reliably.

– Bound the spread of each object.

• Warning: it’s a consensus problem!

• What is the consistency model for massive stores?– Can we relax consistency for better scaling? Do we have to?

Memcached is a scalable in-memory key-value cache.

Scaling database access

• Many services are data-driven.– Multi-tier services: the “lowest”

layer is a data tier with authoritative copy of service data.

• Data is stored in various stores or databases, some with advanced query API.– e.g., SQL

• Databases are hard to scale.– Complex data: atomic, consistent,

recoverable, durable. (“ACID”)

database servers

webserversSQL

query API

SQL: Structured Query Language

Caches can help if much of the workload is simple reads.

Memcached

• “Memory caching daemon”

• It’s just a key/value store

• Scalable cluster service– array of server nodes

– distribute requests among nodes

– how? distribute the key space

– scalable: just add nodes

• Memory-based

• LRU object replacement

• Many technical issues: database servers

webservers

etc…

memcached servers

get/put API

SQL query API

Multi-core server scaling, MxN communication, replacement, consistency

[From Spark Plug to Drive Train: The Life of an App Engine Request, Along Levi, 5/27/09]

Consistency and coordination

Very loosely:

• Consistency: each read of a {key, file, object} sees the value stored by the most recent write.– Or at least writes propagate to readers “eventually”.

– More generally: the service behaves as expected: changes to shared state are properly recorded and observed.

• Coordination: the roles and functions of each element are understood and properly adjusted to respond to changes (e.g., failures, growth, or rebalancing: “churn”).– E.g., distribution of functions or data among the elements.

Example: mutual exclusion

• It is often necessary to grant some node/process the “right” to “own” some given data or function.

• Ownership rights often must be mutually exclusive.– At most one owner at any given time.

• How to coordinate ownership?

• Warning: it’s a consensus problem!

One solution: lock service

acquire

grantacquire

grant

release

release

A

B

x=x+1

lock service

x=x+1

A lock service in the real world

acquire

grantacquire

A

B

x=x+1

X??????

B

Solution: leases (leased locks)

• A lease is a grant of ownership or control for a limited time.

• The owner/holder can renew or extend the lease.

• If the owner fails, the lease expires and is free again.

• The lease might end early.– lock service may recall or evict

– holder may release or relinquish

A lease service in the real world

acquire

grantacquire

A

B

x=x+1

X???

grant

release

x=x+1

Leases and time

• The lease holder and lease service must agree when a lease has expired.– i.e., that its expiration time is in the past

– Even if they can’t communicate!

• We all have our clocks, but do they agree?– synchronized clocks

• For leases, it is sufficient for the clocks to have a known bound on clock drift.

– |T(Ci) – T(Cj)| < ε– Build in slack time > ε into the lease protocols as a safety

margin.

OK, fine, but…

• What if the A does not fail, but is instead isolated by a network partition?

A network partition

Cras hedrouter

A network partition is any event that blocks all message traffic between subsets of nodes.

Never two kings at once

acquire

grantacquire

A

x=x+1 ???

B

grant

release

x=x+1

Lease examplenetwork file cache consistency

[From Spark Plug to Drive Train: The Life of an App Engine Request, Along Levi, 5/27/09]

Google File System (GFS)

Similar: Hadoop HDFS, p-NFS, many other parallel file systems.A master server stores metadata (names, file maps) and acts as lock server.Clients call master to open file, acquire locks, and obtain metadata. Then theyread/write directly to a scalable array of data servers for the actual data. File data may be spread across many data servers: the maps say where it is.

OK, fine, but…

• What if the manager/master itself fails?

X

We can replace it, but the nodes must agree on who the new master is: requires consensus.

The Answer

• Replicate the functions of the manager/master.– Or other coordination service…

• Designate one of the replicas as a primary.– Or master

• The other replicas are backup servers.– Or standby or secondary

• If the primary fails, use a high-powered consensus algorithm to designate and initialize a new primary.

Consensus: abstraction

Unreliable multicast

Step 1Propose.

P1

P2 P3

v1

v3 v2

Consensus algorithm

Step 2Decide.

P1

P2 P3

d1

d3 d2

Coulouris and Dollimore

Each P proposes a value to the others. All nonfaulty P agree on a value in a bounded time.

Fischer-Lynch-Patterson (1985)• No consensus can be guaranteed in an

asynchronous system in the presence of failures.

• Intuition: a “failed” process may just be slow, and can rise from the dead at exactly the wrong time.

• Consensus may occur recognizably, rarely or often.

Network partition Split brain

C-A-P choose

two

C

A P

consistency

Availability Partition-resilience

CA: available, and consistent, unless there is a partition.

AP: a reachable replica provides service even in a partition, but may be inconsistent.

CP: always consistent, even in a partition, but a reachable replica may deny service if it is unable to agree with the others (e.g., quorum).

Dr. Eric Brewer

“CAP theorem”

Properties for Correct Consensus

• Termination: All correct processes eventually decide.

• Agreement: All correct processes select the same di.

– Or…(stronger) all processes that do decide select the same di, even if they later fail.

• Consensus “must be” both safe and live.

• FLP and CAP say that a consensus algorithm can be safe or live, but not both.

Now what?

• We have to build practical, scalable, efficient distributed systems that really work in the real world.

• But the theory says it is impossible to build reliable computer systems from unreliable components.

• So what are we to do?

http://research.microsoft.com/en-us/um/people/blampson/

Butler Lampson is a Technical Fellow at Microsoft Corporation and an Adjunct Professor at MIT…..He was one of the designers of the SDS 940 time-sharing system, the Alto personal distributed computing system, the Xerox 9700 laser printer, two-phase commit protocols, the Autonet LAN, the SPKI system for network security, the Microsoft Tablet PC software, the Microsoft Palladium high-assurance stack, and several programming languages. He received the ACM Software Systems Award in 1984 for his work on the Alto, the IEEE Computer Pioneer award in 1996 and von Neumann Medal in 2001, the Turing Award in 1992, and the NAE’s Draper Prize in 2004.

Butler W. Lampson

[Lampson 1995]

Summary/preview

• Master coordinates, dictates consensus– e.g., lock service

– Also called “primary”

• Remaining consensus problem: who is the master?– Master itself might fail or be isolated by a network

partition.

– Requires a high-powered distributed consensus algorithm (Paxos) to elect a new master.

– Paxos is safe but not live: in the worst case (multiple repeated failures) it might not terminate.

A Classic Paper

• ACM TOCS:– Transactions on Computer Systems

• Submitted: 1990. Accepted: 1998

• Introduced:

???

A Paxos Round

“Can I lead b?” “OK, but” “v!”

L

N

1a 1b

“v?” “OK”

2a 2b 3

Propose Promise Accept Ack Commit

Nodes may compete to serve as leader, and may interrupt one another’s rounds. It can take many rounds to reach consensus.

Wait for majority Wait for majority

log log safe

Self-appoint

Consensus in Practice

• Lampson: “Since general consensus is expensive, practical systems reserve it for emergencies.”

– e.g., to select a primary/master, e.g., a lock server.

• Centrifuge, GFS master, Frangipani, etc.

• Google Chubby service (“Paxos Made Live”)

• Pick a primary with Paxos. Do it rarely; do it right.– Primary holds a “master lease” with a timeout.

• Renew by consensus with primary as leader.

– Primary is “czar” as long as it holds the lease.

– Master lease expires? Fall back to Paxos.

– (Or BFT.)

Coordination services

• Build your cloud apps around a coordination service with consensus (Paxos) at its core.

• This service is a fundamental building block for consistent scalable services.– Chubby (Google)

– Zookeeper (Yahoo!)

– Centrifuge (Microsoft)

Coordination and Consensus

• If the key to availability and scalability is to decentralize and replicate functions and data, how do we coordinate the nodes?

– data consistency– update propagation– mutual exclusion– consistent global states– failure notification– group membership (views)– group communication– event delivery and ordering

We stopped here

• But I’ll leave these slides in.

• Hidden slides (e.g., on VMs and network file cache consistency) are not to be tested.

• Obviously we only peeked at some of these topics, but you should understand the problem of failures (lock service).

Caching in the Web

• Web “proxy” caches are servers that cache Web content.

• Reduce traffic to the origin server.

• Deployed by enterprises to reduce external network traffic to serve Web requests of their members.

• Also deployed by third-party companies that sell caching service to Web providers.– Content Delivery/Distribution Network (CDN)

– Help Web providers serve their clients better.

– Help absorb unexpected load from “flash crowds”.

– Reduce Web server infrastructure costs.

Content Delivery Network (CDN)

DNS CachingCache freshness by TTL “freshness date”), also used by early NFS, and HTTP/Web.

Documents

“Mega-services”: Scale and Consistency A Lightning Tour Now with Elastic Scaling! Jeff Chase Duke University