Approaches to Clustering CS444I Internet Services Winter 00 © 1999-2000 Armando Fox [email protected]

Approaches to ClusteringApproaches to Clustering

CS444I Internet ServicesCS444I Internet ServicesWinter 00Winter 00

© 1999-2000 Armando Fox© 1999-2000 Armando [email protected]@cs.stanford.edu

© 1999, Armando Fox

OutlineOutline

Non-cluster approaches to bignessNon-cluster approaches to bigness

Approaches to clusteringApproaches to clustering

Cluster case studiesCluster case studies

Berkeley NOW/GLUnixBerkeley NOW/GLUnix

SNS/TACCSNS/TACC

Microsoft Wolfpack WolfpairMicrosoft Wolfpack Wolfpair


Approaches to BignessApproaches to Bigness

One Big Mongo ServerOne Big Mongo Server

DNS Round RobinDNS Round Robin

Magic Routers (a/k/a L4/L5 load balancing)Magic Routers (a/k/a L4/L5 load balancing)

Application-Level ReplicationApplication-Level Replication

True Clustering (case studies)True Clustering (case studies) NOW/GLUnix: single system Unix imageNOW/GLUnix: single system Unix image

Microsoft Wolfpack: virtualize every serviceMicrosoft Wolfpack: virtualize every service

SNS/TACC: fixed Internet-service programming modelSNS/TACC: fixed Internet-service programming model


One Big Mongo ServerOne Big Mongo Server

Example: AltaVistaExample: AltaVista Scaling: What if you can’t get a server with enough main Scaling: What if you can’t get a server with enough main

memory?memory?

AvailabilityAvailability

Growth path and costGrowth path and cost

Advantages of one big mongo server?Advantages of one big mongo server? Many agencies now using their (old?) mainframes! (IBM Many agencies now using their (old?) mainframes! (IBM

390, e.g.)390, e.g.)

Putting Web front end on legacy DB’s/appsPutting Web front end on legacy DB’s/apps

What if application is (say) I/O bound?What if application is (say) I/O bound?


DNS Round RobinDNS Round Robin

BenefitsBenefits Software transparent all the way to network levelSoftware transparent all the way to network level

Expand farm by updating DNS serversExpand farm by updating DNS servers

CostsCosts Coarse grainCoarse grain

Ad hocAd hoc

Effect of node failureEffect of node failure

Some apps can’t be easily replicatedSome apps can’t be easily replicated DatabaseDatabase


Approaches to True ClusteringApproaches to True Clustering

NOW/GLUnix: single Unix system imageNOW/GLUnix: single Unix system image

Microsoft Wolfpack: off-the-shelf support for Microsoft Wolfpack: off-the-shelf support for commodity appscommodity apps

SNS/TACC: fixed Internet-service programming SNS/TACC: fixed Internet-service programming modelmodel


NOW: GLUnixNOW: GLUnix

Original goals:Original goals: High availability through redundancyHigh availability through redundancy

Load balancing, self-managementLoad balancing, self-management

Binary compatibilityBinary compatibility

Both batch and parallel-job supportBoth batch and parallel-job support

I.e., single system image for NOW usersI.e., single system image for NOW users Cluster abstractions == Unix abstractionsCluster abstractions == Unix abstractions

This is both good and bad…what’s missing?This is both good and bad…what’s missing?

For portability and rapid development, build on top of For portability and rapid development, build on top of off-the-shelf OS (Solaris)off-the-shelf OS (Solaris)


GLUnix ArchitectureGLUnix Architecture

Master collects load, status, etc. info from daemonsMaster collects load, status, etc. info from daemons Repository of cluster state,c entralized resource allocationRepository of cluster state,c entralized resource allocation

Pros/cons of this approach?Pros/cons of this approach?

Glib app library talks to GLUnix master as app proxyGlib app library talks to GLUnix master as app proxy Signal catching, process mgmt, I/O redirection, etc.Signal catching, process mgmt, I/O redirection, etc.

Death of daemon is treated as a SIGKILL by masterDeath of daemon is treated as a SIGKILL by master

GLUnixMaster

NOW node

glud daemon

NOW node

glud daemon

NOW node

glud daemon

1 per cluster


GLUnix RetrospectiveGLUnix Retrospective

Trends that changed the assumptionsTrends that changed the assumptions SMP’s have replaced MPP’s, and are tougher to compete withSMP’s have replaced MPP’s, and are tougher to compete with

Kernels have become extensibleKernels have become extensible

Final features vs. initial goalsFinal features vs. initial goals Tools: Tools: glurun, glumake glurun, glumake (2nd most popular use of NOW!)(2nd most popular use of NOW!), ,

glups/glukill, glustat, glureserveglups/glukill, glustat, glureserve

Remote execution--but not total transparencyRemote execution--but not total transparency

Load balancing/distribution--but not transparent Load balancing/distribution--but not transparent migration/failovermigration/failover

Redundancy for high availability--but not for the “GLUnix Redundancy for high availability--but not for the “GLUnix master” nodemaster” node


GLUnix Interesting ProblemsGLUnix Interesting Problems

GlumakeGlumake and NFS “consistency” and NFS “consistency”

Support for benchmark-style batch jobsSupport for benchmark-style batch jobs Many instantiations, different parametersMany instantiations, different parameters

Embarrassingly parallelEmbarrassingly parallel

Social considerationsSocial considerations User-initiated unnecessary (malicious?) restartsUser-initiated unnecessary (malicious?) restarts

Lack of migration: an obstacle to harnessing desktop idle Lack of migration: an obstacle to harnessing desktop idle cycles (why?)cycles (why?)

Philosophy:Philosophy: Did GLUnix ask the right question? Did GLUnix ask the right question?


Scalability LimitsScalability Limits

Centralized resource managementCentralized resource management

TCP connections! (file descriptors)TCP connections! (file descriptors)

Interconnect latency and bandwidth (HW level)Interconnect latency and bandwidth (HW level) Myrinet: ~10 usec latency, 640 Mbits/s throughputMyrinet: ~10 usec latency, 640 Mbits/s throughput

Ethernet: ~400 usec latency, 100 Mbits/s throughputEthernet: ~400 usec latency, 100 Mbits/s throughput

ATM: ~600 usec latency, 78 Mbits/s throughput (ATM was the ATM: ~600 usec latency, 78 Mbits/s throughput (ATM was the initial target of the NOW!)initial target of the NOW!)

Thoughts about the interconnectThoughts about the interconnect What’s more important, latency or bandwidth?What’s more important, latency or bandwidth?

Why else might we want a secondary interconnect?Why else might we want a secondary interconnect?


Microsoft WolfpackMicrosoft Wolfpack

Goal: clustering support for “commodity” OS & apps Goal: clustering support for “commodity” OS & apps (NT)(NT) Clustering DLL’sClustering DLL’s

Limited support for existing applicationsLimited support for existing applications

Elements of a Wolfpack clusterElements of a Wolfpack cluster Cluster leader& quorum resourceCluster leader& quorum resource

Other cluster membersOther cluster members

Failover managersFailover managers

Virtualized servicesVirtualized services


Wolfpack OperationWolfpack Operation

Cluster leader and quorum resourceCluster leader and quorum resource The quorum (cluster configuration DB) The quorum (cluster configuration DB) definesdefines the cluster the cluster

Quorum had better be robust/highly-available!Quorum had better be robust/highly-available!

Prevents “split brain” problem resulting from partitioningPrevents “split brain” problem resulting from partitioning

Heartbeats used to obtain membership infoHeartbeats used to obtain membership info

Services can be virtualized to run on one or more Services can be virtualized to run on one or more nodes, but sharing a single network namenodes, but sharing a single network name


Wolfpack: FailoverWolfpack: Failover

Failover managers negotiate among themselves to Failover managers negotiate among themselves to determine when/where/whether to restart a failed determine when/where/whether to restart a failed serviceservice

Degenerate case: can restart legacy appsDegenerate case: can restart legacy apps Cluster-aware DLL’s provided for writing your own appsCluster-aware DLL’s provided for writing your own apps

No guarantees on integrity/consistencyNo guarantees on integrity/consistency

Pfister:Pfister: “…a means of simply providing transactional semantics “…a means of simply providing transactional semantics for data, without necessarily having to buy an entire for data, without necessarily having to buy an entire relational database in the bargain, would make it significantly relational database in the bargain, would make it significantly easier for applications to be highly available in a cluster.”easier for applications to be highly available in a cluster.”


TACC/SNSTACC/SNS

Specialized cluster runtime to host Web-like Specialized cluster runtime to host Web-like workloadsworkloads TACC: transformation, aggregation, caching and TACC: transformation, aggregation, caching and

customization--elements of an Internet servicecustomization--elements of an Internet service

Build apps from composable modules, Unix-pipeline-styleBuild apps from composable modules, Unix-pipeline-style

Goal: complete separation of Goal: complete separation of *ility*ility concerns from concerns from application logicapplication logic Legacy code encapsulation, multiple language supportLegacy code encapsulation, multiple language support

Insulate programmers from nasty engineeringInsulate programmers from nasty engineering


TACC ExamplesTACC Examples

HotBotHotBot search engine search engine Query crawler’s DBQuery crawler’s DB

Cache recent searchesCache recent searches

Customize UI/presentationCustomize UI/presentation

TranSendTranSend transformation proxy transformation proxy On-the-fly lossy compression of inline On-the-fly lossy compression of inline

images (GIF, JPG, etc.)images (GIF, JPG, etc.)

Cache original & transformedCache original & transformed

User specifies aggressiveness, User specifies aggressiveness, “refinement” UI, etc.“refinement” UI, etc.

C TT

$$AA

TT

$$

C

DBDB

htmlhtml


Cluster-Based TACC ServerCluster-Based TACC Server Component replication for scaling and availabilityComponent replication for scaling and availability High-bandwidth, low-latency interconnectHigh-bandwidth, low-latency interconnect Incremental scaling: commodity PC’sIncremental scaling: commodity PC’s

C$

LB/FT

Interconnect

FE

$ $

WWWT

FE

FE

WWWA

GUI

Front EndsFront EndsFront EndsFront Ends CachesCachesCachesCaches User ProfileUser ProfileDatabaseDatabase

User ProfileUser ProfileDatabaseDatabase

WorkersWorkersWorkersWorkersLoad Balancing &Load Balancing &Fault ToleranceFault Tolerance

Load Balancing &Load Balancing &Fault ToleranceFault Tolerance

AdministrationAdministrationInterfaceInterface

AdministrationAdministrationInterfaceInterface


““Starfish” Availability: LB DeathStarfish” Availability: LB Death

FE detects via broken pipe/timeout, restarts LBFE detects via broken pipe/timeout, restarts LB

C$

Interconnect

FE

$ $

WWWT

FE

FE

LB/FT

WWWA




C$

Interconnect

FE

$ $

WWWT

FE

FE

LB/FT

WWWA

LB/FT

New LB announces itself (multicast), contacted by workers, New LB announces itself (multicast), contacted by workers, gradually rebuilds load tablesgradually rebuilds load tables

If partition heals, extra LB’s commit suicideIf partition heals, extra LB’s commit suicideFE’s operate using cached LB info during failureFE’s operate using cached LB info during failure




C$

Interconnect

FE

$ $

WWWT

FE

FE

LB/FT

WWWA

New LB announces itself (multicast), contacted by workers, New LB announces itself (multicast), contacted by workers, gradually rebuilds load tablesgradually rebuilds load tables

If partition heals, extra LB’s commit suicideIf partition heals, extra LB’s commit suicideFE’s operate using cached LB info during failureFE’s operate using cached LB info during failure


SNS Availability MechanismsSNS Availability Mechanisms

Soft state everywhereSoft state everywhere Multicast based announce/listen to refresh the stateMulticast based announce/listen to refresh the state

Idea stolen from multicast routing in the Internet!Idea stolen from multicast routing in the Internet!

Process peers watch each otherProcess peers watch each other Because of no hard state, “recovery” == “restart”Because of no hard state, “recovery” == “restart”

Because of multicast level of indirection, don’t need a location Because of multicast level of indirection, don’t need a location directory for resourcesdirectory for resources

Load balancing, hot updates, migration are “easy”Load balancing, hot updates, migration are “easy” Shoot down a worker, and it will recoverShoot down a worker, and it will recover

Upgrade == install new software, shoot down oldUpgrade == install new software, shoot down old

Mostly graceful degradationMostly graceful degradation


SNS Availability Mechanisms, cont’d.SNS Availability Mechanisms, cont’d.

Orthogonal mechanismsOrthogonal mechanisms Composition without interfacesComposition without interfaces

Example: Scalable Reliable Multicast (SRM) group state Example: Scalable Reliable Multicast (SRM) group state management with SNSmanagement with SNS

Eliminates O(nEliminates O(n22) complexity of composing modules) complexity of composing modules

State space of failure mechanisms is easy to reason aboutState space of failure mechanisms is easy to reason about

What’s the cost?What’s the cost?

More on orthogonal mechanisms laterMore on orthogonal mechanisms later


Administering SNSAdministering SNS

Multicast meansMulticast meansmonitor can runmonitor can runanywhere on clusteranywhere on cluster

Extensible via self-describing data structures and mobile code in Tcl


Comparing SNS & WolfpackComparing SNS & Wolfpack

Somewhat different targetsSomewhat different targets

Quorum Resource <--> Load Balancer/FT managerQuorum Resource <--> Load Balancer/FT manager But soft state, and cluster can (temporarily) function without itBut soft state, and cluster can (temporarily) function without it

Better partition resilienceBetter partition resilience

FailoverFailover Wolfpack Failover Manager slightly more flexibleWolfpack Failover Manager slightly more flexible

Neither system provides any integrity/consistency guarantees Neither system provides any integrity/consistency guarantees itselfitself

Multicast heartbeats detect membership, failures, Multicast heartbeats detect membership, failures, locations of thingslocations of things


What We Really Learned From TACCWhat We Really Learned From TACC

Design for failureDesign for failure It will fail anywayIt will fail anyway

End-to-end argument applied to high availabilityEnd-to-end argument applied to high availability

Orthogonality is even better than layeringOrthogonality is even better than layering Narrow interface vs. no interfaceNarrow interface vs. no interface

A great way to manage system complexityA great way to manage system complexity

The price of orthogonalityThe price of orthogonality

Techniques: Refreshable soft state; watchdogs/timeouts; Techniques: Refreshable soft state; watchdogs/timeouts; sandboxingsandboxing

Software compatibility is hard, but valuableSoftware compatibility is hard, but valuable


Clusters SummaryClusters Summary

Many approaches to clustering, software transparency, Many approaches to clustering, software transparency, failure semanticsfailure semantics An end-to-end problem that is often application-specificAn end-to-end problem that is often application-specific

We’ll see this again at the application level in harvest vs. yield We’ll see this again at the application level in harvest vs. yield discussiondiscussion

Internet workloads are a particularly good match for Internet workloads are a particularly good match for clustersclusters What software support is needed to mate these two things?What software support is needed to mate these two things?

What new abstractions do we want for writing failure-tolerant What new abstractions do we want for writing failure-tolerant applications in light of these techniques?applications in light of these techniques?

What about Pfister’s comment about transactional semantics?What about Pfister’s comment about transactional semantics?

Documents

Approaches to Clustering CS444I Internet Services Winter 00 © 1999-2000 Armando Fox [email protected]