Zookeeper: Wait-free Coordination for Internet-scale Systems

Zookeeper1: Wait-free Coordinationfor Internet-scale Systems

P. Hunt M. Konar F. Junqueira B. Reed

10 de julio de 2012

Por: Leandro Lera Romero

1Because Coordinating Distributed Systems is a Zoo

¿Como nos organizamos?

Las aplicaciones distribuidas requieren diferentes formas decoordinacion.

I Configuracion.

I Pertenencia a un grupo (Group Membership).

I Eleccion del lider.

1 / 23

¿Como nos organizamos?

Una opcion es desarrollar servicios para cada necesidad.

I Amazon Simple Queue Service.

I The Akamai configuration management system.

Otra alternativa es utilizar un servicio de locking para sistemasdistribuidos.

Necesitamos algo que nos de mayor flexibilidad.

2 / 23

El que espera, desespera

El objetivo es disenar un servicio de coordinacion distribuida que:

I Sea capaz de adaptarse a diferentes problematicas.

I Permita a los desarrolladores implementar sus primitivas.

I Evite el uso de locks, o primitivas bloqueantes.

En definitiva, necesitamos un coordination kernel que exponga unaAPI que sea wait-free.

3 / 23

El que espera, desespera

El objetivo es disenar un servicio de coordinacion distribuida que:

I Sea capaz de adaptarse a diferentes problematicas.

I Permita a los desarrolladores implementar sus primitivas.

I Evite el uso de locks, o primitivas bloqueantes.

En definitiva, necesitamos un coordination kernel que exponga unaAPI que sea wait-free.

3 / 23

Los componentes de ZooKeeper

ZooKeeper esta conformado por los siguientes componentes:

I Servidores

I Clientes

I SesionI zNodos

I RegularI Efımero

I APII Watches

4 / 23

La API

Algunos de los metodos fundamentales:

I create(path, data, flags)

I delete(path, version)

I exists(path, watch)

I getData(path, watch)

I setData(path, data, version)

I getChildren(path, watch)

I sync(path)

5 / 23

ZooKeeper garantiza

I Escrituras linealizables: todos los pedidos de actualizaciondel estado de ZooKeeper son serializados y respetan el ordende precedencia.

I Orden de cliente FIFO: todos los pedidos de un cliente sonejecutados en el orden en el que fueron envıados.

I Liveness: el servicio esta disponible mientras haya unamayorıa de servidores activa y comunicada.

I Durabilidad: los cambios persisten, a pesar de fallas, si enalgun momento se recupera la mayorıa de servidores.

6 / 23

Algunas primitivas: Configuraciones

1. Definimos un zNodo Zc en donde se almacenaran lasconfiguraciones.

2. Los procesos cuando arrancan leen la configuracion.I getData(Zc , TRUE )

3. Si se produce alguna modificacion vuelven al paso 2.

7 / 23

Algunas primitivas: Group Membership

1. Definimos un zNodo Zg que representa al grupo.

2. Los procesos cuando arrancan crean un zNodo efımero, conun nombre unico, como hijo de Zg .

I create(Zg + “/”, data, ephemeral | sequential)

Los procesos pueden obtener informacion del grupo viendo los hijosde Zg .

En caso de que el proceso termine o falle, el nodo creado esborrado automaticamente.

8 / 23

Algunas primitivas: Locks

1. Definimos un zNodo Zl que representa al lock.

2. Para adquirir el lock los procesos intentan crear Zl .I create(Zl , data, ephemeral)

3. Si el proceso pudo crear el zNodo, entonces tiene el lock.

4. Si no pudo, espera a que sea eliminado y vuelve a intentar.I exists(Zl , TRUE )

Este lock tiene un problema...

¿Que pasa si varios procesos esperan el lock?

9 / 23








9 / 23








9 / 23

Algunas primitivas: Locks 2.0


2. Para adquirir el lock el proceso crea un nodo con Zl comopadre.

I n = create(Zl + “/lock-”, data, ephemeral | sequential)

3. Obtiene la lista de hijos de Zl .I getChildren(Zl , FALSE )

4. Si n es el valor mas chico, tiene el lock.

5. Si no, define p = el valor el zNode que pidio el lock antes.

6. Espera a que p termine de usar el lock.I exists(p, TRUE )

7. Repite desde 3 para asegurarse que tiene el lock.

10 / 23

La arquitectura de ZooKeeper

RequestProcessor

AtomicBroadcast

ReplicatedDatabase

WriteRequest

Response

ZooKeeper Service

txn

txn

ReadRequest

Figure 4: The components of the ZooKeeper service.

4 ZooKeeper Implementation

ZooKeeper provides high availability by replicating theZooKeeper data on each server that composes the ser-vice. We assume that servers fail by crashing, and suchfaulty servers may later recover. Figure 4 shows the high-level components of the ZooKeeper service. Upon re-ceiving a request, a server prepares it for execution (re-quest processor). If such a request requires coordina-tion among the servers (write requests), then they use anagreement protocol (an implementation of atomic broad-cast), and finally servers commit changes to the Zoo-Keeper database fully replicated across all servers of theensemble. In the case of read requests, a server simplyreads the state of the local database and generates a re-sponse to the request.

The replicated database is an in-memory database con-taining the entire data tree. Each znode in the tree stores amaximum of 1MB of data by default, but this maximumvalue is a configuration parameter that can be changed inspecific cases. For recoverability, we efficiently log up-dates to disk, and we force writes to be on the disk mediabefore they are applied to the in-memory database. Infact, as Chubby [8], we keep a replay log (a write-aheadlog, in our case) of committed operations and generateperiodic snapshots of the in-memory database.

Every ZooKeeper server services clients. Clients con-nect to exactly one server to submit its requests. As wenoted earlier, read requests are serviced from the localreplica of each server database. Requests that change thestate of the service, write requests, are processed by anagreement protocol.

As part of the agreement protocol write requests areforwarded to a single server, called the leader1. Therest of the ZooKeeper servers, called followers, receive

1Details of leaders and followers, as part of the agreement protocol,are out of the scope of this paper.

message proposals consisting of state changes from theleader and agree upon state changes.

4.1 Request ProcessorSince the messaging layer is atomic, we guarantee thatthe local replicas never diverge, although at any point intime some servers may have applied more transactionsthan others. Unlike the requests sent from clients, thetransactions are idempotent. When the leader receivesa write request, it calculates what the state of the sys-tem will be when the write is applied and transforms itinto a transaction that captures this new state. The fu-ture state must be calculated because there may be out-standing transactions that have not yet been applied tothe database. For example, if a client does a conditionalsetData and the version number in the request matchesthe future version number of the znode being updated,the service generates a setDataTXN that contains thenew data, the new version number, and updated timestamps. If an error occurs, such as mismatched versionnumbers or the znode to be updated does not exist, anerrorTXN is generated instead.

4.2 Atomic BroadcastAll requests that update ZooKeeper state are forwardedto the leader. The leader executes the request andbroadcasts the change to the ZooKeeper state throughZab [24], an atomic broadcast protocol. The server thatreceives the client request responds to the client when itdelivers the corresponding state change. Zab uses by de-fault simple majority quorums to decide on a proposal,so Zab and thus ZooKeeper can only work if a majorityof servers are correct (i.e., with 2f + 1 server we cantolerate f failures).

To achieve high throughput, ZooKeeper tries to keepthe request processing pipeline full. It may have thou-sands of requests in different parts of the processingpipeline. Because state changes depend on the appli-cation of previous state changes, Zab provides strongerorder guarantees than regular atomic broadcast. Morespecifically, Zab guarantees that changes broadcast by aleader are delivered in the order they were sent and allchanges from previous leaders are delivered to an estab-lished leader before it broadcasts its own changes.

There are a few implementation details that simplifyour implementation and give us excellent performance.We use TCP for our transport so message order is main-tained by the network, which allows us to simplify ourimplementation. We use the leader chosen by Zab asthe ZooKeeper leader, so that the same process that cre-ates transactions also proposes them. We use the log tokeep track of proposals as the write-ahead log for the in-

8

Figura : Los componentes del servicio

11 / 23

La implementacion: Request Processor

I Recibe los pedidos de actualizacion de los clientes.I Reenvıa los pedidos al lıder, o en caso de serlo, calcula el

estado futuro del sistema para generar una transaccion.I Las transacciones son idempotentes.

Figura : Lıder y seguidores

12 / 23

La implementacion: Atomic Broadcast

I Los servidores reciben las transacciones a traves de unprotocolo de broadcast atomico.

I El protocolo utiliza mayorıa simple para decidir si aplicar unatransaccion.

I Con 2n + 1 servidores se toleran n fallas.

I El protocolo garantiza el orden de entrega de lastransacciones.

13 / 23

La implementacion: Replicated Database

I Es una base de datos en memoria que contiene toda laestructura de datos.

I Se realizan fuzzy snapshots periodicos para acelerar larestauracion en caso de caıda del servidor.

I Ante una caıda se reenvıan las transacciones perdidas.

14 / 23

Interacciones entre cliente y servidor

I Si un servidor procesa un pedido de actualizacion notifica atodos los watches que esten registrados para ese cambio.

I Las escrituras son procesadas en orden y no permiten otrasacciones concurrentes.

I Las lecturas se realizan localmente permitiendo altaperformance al leer.

I Las respuestas estan marcadas con el id de la ultimatransaccion vista.

I Las sesiones tienen un timeout para monitorear a los clientes.

15 / 23

Pruebas y Resultados

Se realizaron pruebas para medir el throughput y la latencia delsistema.

Se utilizo un cluster de 50 servidores con las siguientescaracterısticas:

I Procesador Xeon dual-core 2.1GHz

I 4GB de RAM

I Gigabit ethernet

I Dos discos duros SATA

16 / 23

Pruebas y Resultados: Throughput

Para testear el throughput realizaron un benchmark con el sistemasaturado.

Las pruebas consistieron en:

I 250 clientes simulados.

I 100 pedidos simultaneos por cliente, entre lecturas y escriturasde 1K de datos.

I 2000 pedidos en proceso por servidor.

17 / 23


0

10000

20000

30000

40000

50000

60000

70000

80000

90000

0 20 40 60 80 100

Ope

ratio

ns p

er s

econ

d

Percentage of read requests

Throughput of saturated system

3 servers5 servers7 servers9 servers

13 servers

Figura : Throughput mientras varıa la relacion de lecturas/escrituras18 / 23


Para testear el throughput a medida que se suceden fallasrealizaron el benchmark anterior, con un 30 % de escrituras.

19 / 23


0

10000

20000

30000

40000

50000

60000

70000

0 50 100 150 200 250 300

dnoces rep snoitarepO

Seconds since start of series

Time series with failures

Throughput

1 2

3

4a

5

64b

4c

Figura : Throughput al ocurrir fallas20 / 23

Pruebas y Resultados: Latencia

Para calcular la latencia se realizo un benchmark que consistio encrear 50.000 zNodos de la siguiente manera:

1. Crear un zNodo con 1K de datos.

2. Hacer un delete asincronico.

3. Volver a 1.

21 / 23

Pruebas y Resultados: Latencia

Number of serversWorkers 3 5 7 9

1 776 748 758 71110 2074 1832 1572 154020 2740 2336 1934 1890

Figura : Pedidos procesados por segundo

El throughput de un worker indica que la latencia promedio porpedido es de 1.2ms para 3 servidores y 1.4ms para 9 servidores.

22 / 23

En resumen... ZooKeeper

I Usa un enfoque wait-free para coordinar procesos en sistemasdistribuidos.

I Provee una solucion general para distintas formas decoordinacion.

I Mediante el uso de las replicas locales y los watches permiteun alto throughput en situaciones donde predominan laslecturas.

23 / 23

Documents

Zookeeper: Wait-free Coordination for Internet-scale Systems