43
MUTABLE CHECKPOINT-RESTART Automating Live Update For Generic Server Programs Cristiano Giuffrida (VU), Călin Iorgulescu (EPFL) Andrew S. Tanenbaum (VU) December 11, 2014

Mutable checkpoint-restart Automating Live-update For Generic …giuffrida/papers/middleware-2014... · 2014-12-20 · MUTABLE CHECKPOINT-RESTART Automating Live Update For Generic

  • Upload
    others

  • View
    7

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Mutable checkpoint-restart Automating Live-update For Generic …giuffrida/papers/middleware-2014... · 2014-12-20 · MUTABLE CHECKPOINT-RESTART Automating Live Update For Generic

MUTABLE CHECKPOINT-RESTART

Automating Live Update For Generic

Server Programs

Cristiano Giuffrida (VU), Călin Iorgulescu (EPFL)

Andrew S. Tanenbaum (VU)

December 11, 2014

Page 2: Mutable checkpoint-restart Automating Live-update For Generic …giuffrida/papers/middleware-2014... · 2014-12-20 · MUTABLE CHECKPOINT-RESTART Automating Live Update For Generic

MCR in a nutshell

• Generic live update framework

• Can handle whole-program updates

• Low engineering effort to adapt updates (~30 LOC / 1 KLOC)

• Low runtime performance overhead (~2%)

• Realistic update times (≤ 1 s) under load (with 100 connections)

• Combines checkpoint-restart with native initialization

December 11, 2014 2 / 32

Page 3: Mutable checkpoint-restart Automating Live-update For Generic …giuffrida/papers/middleware-2014... · 2014-12-20 · MUTABLE CHECKPOINT-RESTART Automating Live Update For Generic

Outline

The Live Update Problem

Mutable Checkpoint-Restart

Overview

Architecture

Evaluation

Conclusion

December 11, 2014 3 / 32

Page 4: Mutable checkpoint-restart Automating Live-update For Generic …giuffrida/papers/middleware-2014... · 2014-12-20 · MUTABLE CHECKPOINT-RESTART Automating Live Update For Generic

Outline

The Live Update Problem

Mutable Checkpoint-Restart

Overview

Architecture

Evaluation

Conclusion

December 11, 2014 4 / 32

Page 5: Mutable checkpoint-restart Automating Live-update For Generic …giuffrida/papers/middleware-2014... · 2014-12-20 · MUTABLE CHECKPOINT-RESTART Automating Live Update For Generic

Update without downtime

• Don’t stop running services!

• Don’t disrupt clients!

”If you think database patching is onerous, then try patching

a SCADA system that’s running a power plant.”

-- Kelly Jackson Higgins on the SCADA patch problem, 2013

December 11, 2014 5 / 32

Page 6: Mutable checkpoint-restart Automating Live-update For Generic …giuffrida/papers/middleware-2014... · 2014-12-20 · MUTABLE CHECKPOINT-RESTART Automating Live Update For Generic

What are the existing options ?

• Rolling upgrades

• require redundant hardware

• may lead to inconsistencies between machines

• In-place live update

• replaces part of the running code

• limited in terms of update complexity

• Whole-program live update

• requires considerable annotation effort

December 11, 2014

Servers protected with Ksplice Uptrack:

100,000+ at more than 700 companies

Source: www.ksplice.com

6 / 32

Page 7: Mutable checkpoint-restart Automating Live-update For Generic …giuffrida/papers/middleware-2014... · 2014-12-20 · MUTABLE CHECKPOINT-RESTART Automating Live Update For Generic

What are the existing options ?

• Rolling upgrades

• require redundant hardware

• may lead to inconsistencies between machines

• In-place live update

• replaces part of the running code

• limited in terms of update complexity

• Whole-program live update

• requires considerable annotation effort required

December 11, 2014

Servers protected with Ksplice Uptrack:

100,000+ at more than 700 companies

Source: www.ksplice.com

6 / 32

Page 8: Mutable checkpoint-restart Automating Live-update For Generic …giuffrida/papers/middleware-2014... · 2014-12-20 · MUTABLE CHECKPOINT-RESTART Automating Live Update For Generic

What We Did: Mutable Checkpoint-Restart

• Focused on servers: high availability requirements & well-defined

structure

• Low engineering effort to adapt new programs

• Updating recreates entire process hierarchy

• individual processes natively initialize their state

• process state changes are transferred to the new version

• Tested on 4 real-world servers: Apache httpd, nginx, OpenSSH, vsftpd

December 11, 2014

Current

(V1)

New

(V2)

State Transfer

7 / 32

Page 9: Mutable checkpoint-restart Automating Live-update For Generic …giuffrida/papers/middleware-2014... · 2014-12-20 · MUTABLE CHECKPOINT-RESTART Automating Live Update For Generic

Outline

The Live Update Problem

Mutable Checkpoint-Restart

Overview

Architecture

Evaluation

Conclusion

December 11, 2014 8 / 32

Page 10: Mutable checkpoint-restart Automating Live-update For Generic …giuffrida/papers/middleware-2014... · 2014-12-20 · MUTABLE CHECKPOINT-RESTART Automating Live Update For Generic

A generic live update framework

• Support complex updates

• Don’t break OS invariants

• maintain consistent server state

• e.g., open connections must not

break

• Support generic C servers

• weakly typed & types change

• no object tracing

December 11, 2014 9

Page 11: Mutable checkpoint-restart Automating Live-update For Generic …giuffrida/papers/middleware-2014... · 2014-12-20 · MUTABLE CHECKPOINT-RESTART Automating Live Update For Generic

A generic live update framework

• Support complex updates

• Don’t break OS invariants

• maintain consistent server state

• e.g., open connections must not

break

• Support generic C servers

• weakly typed & types change

• no object tracing

• Quiescence Detection

• apply updates safely

December 11, 2014 9

Page 12: Mutable checkpoint-restart Automating Live-update For Generic …giuffrida/papers/middleware-2014... · 2014-12-20 · MUTABLE CHECKPOINT-RESTART Automating Live Update For Generic

A generic live update framework

• Support complex updates

• Don’t break OS invariants

• maintain consistent server state

• e.g., open connections must not

break

• Support generic C servers

• weakly typed & types change

• no object tracing

• Quiescence Detection

• apply updates safely

• Mutable Reinitialization

• populate long-lived objects

December 11, 2014 9

Page 13: Mutable checkpoint-restart Automating Live-update For Generic …giuffrida/papers/middleware-2014... · 2014-12-20 · MUTABLE CHECKPOINT-RESTART Automating Live Update For Generic

A generic live update framework

• Support complex updates

• Don’t break OS invariants

• maintain consistent server state

• e.g., open connections must not

break

• Support generic C servers

• weakly typed & types change

• no object tracing

• Quiescence Detection

• apply updates safely

• Mutable Reinitialization

• populate long-lived objects

• State Transfer

• hybrid garbage collector

approach

December 11, 2014 9

Page 14: Mutable checkpoint-restart Automating Live-update For Generic …giuffrida/papers/middleware-2014... · 2014-12-20 · MUTABLE CHECKPOINT-RESTART Automating Live Update For Generic

The update process

1. An update request is received

December 11, 2014

Running

(V1)

server_V2

10 / 32

Page 15: Mutable checkpoint-restart Automating Live-update For Generic …giuffrida/papers/middleware-2014... · 2014-12-20 · MUTABLE CHECKPOINT-RESTART Automating Live Update For Generic

The update process Quiescence

2. All tasks reach a quiescent point & suspend

December 11, 2014

Quiescent

(V1)

11 / 32

Page 16: Mutable checkpoint-restart Automating Live-update For Generic …giuffrida/papers/middleware-2014... · 2014-12-20 · MUTABLE CHECKPOINT-RESTART Automating Live Update For Generic

The update process Mutable Reinitialization

3. The new version initializes

December 11, 2014

Quiescent

(V1)

Initializing

(V2)

Immutable objects

(e.g., sockets)

12 / 32

Page 17: Mutable checkpoint-restart Automating Live-update For Generic …giuffrida/papers/middleware-2014... · 2014-12-20 · MUTABLE CHECKPOINT-RESTART Automating Live Update For Generic

The update process State Transfer

4. Modified process state is transferred

December 11, 2014

Quiescent

(V1)

Initialized

(V2)

Memory object

pairing

State transfer

13 / 32

Page 18: Mutable checkpoint-restart Automating Live-update For Generic …giuffrida/papers/middleware-2014... · 2014-12-20 · MUTABLE CHECKPOINT-RESTART Automating Live Update For Generic

The update process

5. Control is transferred to V2

December 11, 2014

Running

(V2)

14 / 32

Page 19: Mutable checkpoint-restart Automating Live-update For Generic …giuffrida/papers/middleware-2014... · 2014-12-20 · MUTABLE CHECKPOINT-RESTART Automating Live Update For Generic

Preparing a program for MCR

December 11, 2014

Offline Profiling

server.c

Quiescence

Profiler

Quiescence

Points

LLVM Pass

(Static Type Analysis)

server.bc

LLVM Compiler

libmcr.a

GOLD Linker server

libmcr.so

Build time instrumentation Runtime

instrumentation

15 / 32

Page 20: Mutable checkpoint-restart Automating Live-update For Generic …giuffrida/papers/middleware-2014... · 2014-12-20 · MUTABLE CHECKPOINT-RESTART Automating Live Update For Generic

Outline

The Live Update Problem

Mutable Checkpoint-Restart

Overview

Architecture

Evaluation

Conclusion

December 11, 2014 16 / 32

Page 21: Mutable checkpoint-restart Automating Live-update For Generic …giuffrida/papers/middleware-2014... · 2014-12-20 · MUTABLE CHECKPOINT-RESTART Automating Live Update For Generic

A simple server example /* Auxiliary data structures. */ char b[8]; typedef struct list_s { int value; struct list_s *next; } l_t; l_t list; /* Startup configuration. */ struct conf_s *conf; /* Server implementation. */ int main() { server_init(&conf); while (1) { void *e = server_get_event(); server_handle_event(e, conf, b, &list); } return 0; }

December 11, 2014 17 / 32

Page 22: Mutable checkpoint-restart Automating Live-update For Generic …giuffrida/papers/middleware-2014... · 2014-12-20 · MUTABLE CHECKPOINT-RESTART Automating Live Update For Generic

A simple server example /* Auxiliary data structures. */ char b[8]; typedef struct list_s { int value; struct list_s *next; } l_t; l_t list; /* Startup configuration. */ struct conf_s *conf; /* Server implementation. */ int main() { server_init(&conf); while (1) { void *e = server_get_event(); server_handle_event(e, conf, b, &list); } return 0; }

December 11, 2014

Server initialization code

• load config

• create sockets

• allocate memory

• spawn workers

17 / 32

Page 23: Mutable checkpoint-restart Automating Live-update For Generic …giuffrida/papers/middleware-2014... · 2014-12-20 · MUTABLE CHECKPOINT-RESTART Automating Live Update For Generic

A simple server example /* Auxiliary data structures. */ char b[8]; typedef struct list_s { int value; struct list_s *next; } l_t; l_t list; /* Startup configuration. */ struct conf_s *conf; /* Server implementation. */ int main() { server_init(&conf); while (1) { void *e = server_get_event(); server_handle_event(e, conf, b, &list); } return 0; }

December 11, 2014

Event processing loop

• accept new connections

• delegate to workers

17 / 32

Page 24: Mutable checkpoint-restart Automating Live-update For Generic …giuffrida/papers/middleware-2014... · 2014-12-20 · MUTABLE CHECKPOINT-RESTART Automating Live Update For Generic

A simple server example /* Auxiliary data structures. */ char b[8]; typedef struct list_s { int value; struct list_s *next; } l_t; l_t list; /* Startup configuration. */ struct conf_s *conf; /* Server implementation. */ int main() { server_init(&conf); while (1) { void *e = server_get_event(); server_handle_event(e, conf, b, &list); } return 0; }

December 11, 2014

Global memory objects

• opaque objects

• well-defined objects

17 / 32

Page 25: Mutable checkpoint-restart Automating Live-update For Generic …giuffrida/papers/middleware-2014... · 2014-12-20 · MUTABLE CHECKPOINT-RESTART Automating Live Update For Generic

Quiescence Detection /* Auxiliary data structures. */ char b[8]; typedef struct list_s { int value; struct list_s *next; } l_t; l_t list; /* Startup configuration. */ struct conf_s *conf; /* Server implementation. */ int main() { server_init(&conf); while (1) { void *e = server_get_event(); server_handle_event(e, conf, b, &list); } return 0; }

December 11, 2014

Quiescence point

• long lived call stack

• underlying blocking call

• initialization already done

• Updates occur only at safe points

(quiescence points)

• Allows stopping/restarting tasks

safely

• Reduces expensive call stack

instrumentation

18 / 32

Page 26: Mutable checkpoint-restart Automating Live-update For Generic …giuffrida/papers/middleware-2014... · 2014-12-20 · MUTABLE CHECKPOINT-RESTART Automating Live Update For Generic

Mutable Reinitialization 1/2

• Rebuilding the entire execution state is notoriously hard!

• … but …

• Most long-lived state is populated during initialization

• Record immutable object creation events (startup log)

• e.g., creating a socket, opening a file

• Replay event result during initialization

• Let the initialization complete

December 11, 2014 19 / 32

Page 27: Mutable checkpoint-restart Automating Live-update For Generic …giuffrida/papers/middleware-2014... · 2014-12-20 · MUTABLE CHECKPOINT-RESTART Automating Live Update For Generic

Mutable Reinitialization 2/2

/* Auxiliary data structures. */ char b[8]; typedef struct list_s { int value; struct list_s *next; } l_t; l_t list; /* Startup configuration. */ struct conf_s *conf; /* Server implementation. */ int main() { server_init(&conf); while (1) { void *e = server_get_event(); server_handle_event(e, conf, b, &list); } return 0; }

December 11, 2014

void server_init(struct conf_s *conf) { ... conf.sock = socket(...); ... }

20 / 32

Page 28: Mutable checkpoint-restart Automating Live-update For Generic …giuffrida/papers/middleware-2014... · 2014-12-20 · MUTABLE CHECKPOINT-RESTART Automating Live Update For Generic

Mutable Reinitialization 2/2

/* Auxiliary data structures. */ char b[8]; typedef struct list_s { int value; struct list_s *next; } l_t; l_t list; /* Startup configuration. */ struct conf_s *conf; /* Server implementation. */ int main() { server_init(&conf); while (1) { void *e = server_get_event(); server_handle_event(e, conf, b, &list); } return 0; }

December 11, 2014

void server_init(struct conf_s *conf) { ... conf.sock = socket(...); ... }

V1 V2

conf.sock = 5 conf.sock = 5

Native syscall Replayed event

from V1

20 / 32

Page 29: Mutable checkpoint-restart Automating Live-update For Generic …giuffrida/papers/middleware-2014... · 2014-12-20 · MUTABLE CHECKPOINT-RESTART Automating Live Update For Generic

State Transfer /* Auxiliary data structures. */ char b[8]; typedef struct list_s { int value; struct list_s *next; } l_t; l_t list; /* Startup configuration. */ struct conf_s *conf; /* Server implementation. */ int main() { server_init(&conf); while (1) { void *e = server_get_event(); server_handle_event(e, conf, b, &list); } return 0; }

December 11, 2014 21 / 32

Page 30: Mutable checkpoint-restart Automating Live-update For Generic …giuffrida/papers/middleware-2014... · 2014-12-20 · MUTABLE CHECKPOINT-RESTART Automating Live Update For Generic

State Transfer

December 11, 2014

/* Auxiliary data structures. */ char b[8]; typedef struct list_s { int value; struct list_s *next; } l_t; l_t list; ...

V1

21 / 32

Page 31: Mutable checkpoint-restart Automating Live-update For Generic …giuffrida/papers/middleware-2014... · 2014-12-20 · MUTABLE CHECKPOINT-RESTART Automating Live Update For Generic

State Transfer

December 11, 2014

/* Auxiliary data structures. */ char b[8]; typedef struct list_s { int value; struct list_s *next; } l_t; l_t list; ...

/* Auxiliary data structures. */ char b[8]; typedef struct list_s { int value; struct list_s *next; int new; } l_t; l_t list; ...

V1 V2

21 / 32

Page 32: Mutable checkpoint-restart Automating Live-update For Generic …giuffrida/papers/middleware-2014... · 2014-12-20 · MUTABLE CHECKPOINT-RESTART Automating Live Update For Generic

State Transfer

December 11, 2014

/* Auxiliary data structures. */ char b[8]; typedef struct list_s { int value; struct list_s *next; } l_t; l_t list; ...

/* Auxiliary data structures. */ char b[8]; typedef struct list_s { int value; struct list_s *next; int new; } l_t; l_t list; ...

V1 V2

21 / 32

Page 33: Mutable checkpoint-restart Automating Live-update For Generic …giuffrida/papers/middleware-2014... · 2014-12-20 · MUTABLE CHECKPOINT-RESTART Automating Live Update For Generic

Violating Assumptions 1/2

• Unsupported immutable objects

• e.g., process IDs stored in opaque data structures, pointer

encoding

• Non-deterministic process model

• e.g., workers are spawned on-demand

• Non-replayed operations that affect initialization

• e.g., check if a PID file exists, then quit (Apache httpd)

December 11, 2014 22 / 32

Page 34: Mutable checkpoint-restart Automating Live-update For Generic …giuffrida/papers/middleware-2014... · 2014-12-20 · MUTABLE CHECKPOINT-RESTART Automating Live Update For Generic

Violating Assumptions 2/2

/* Auxiliary data structures. */ char b[8]; typedef struct list_s { int value; struct list_s *next; } l_t; l_t list; /* Startup configuration. */ struct conf_s *conf; /* Server implementation. */ int main() { server_init(&conf); while (1) { void *e = server_get_event(); server_handle_event(e, conf, b, &list); } return 0; } /* MCR annotations. */ MCR_ADD_OBJ_HANDLER(b, user_b_handler); MCR_ADD_REINIT_HANDLER(user_reinit_handler);

December 11, 2014

User defined handlers

• opaque object handlers

• custom initialization handlers

23 / 32

Page 35: Mutable checkpoint-restart Automating Live-update For Generic …giuffrida/papers/middleware-2014... · 2014-12-20 · MUTABLE CHECKPOINT-RESTART Automating Live Update For Generic

Outline

The Live Update Problem

Mutable Checkpoint-Restart

Overview

Architecture

Evaluation

Conclusion

December 11, 2014 24 / 32

Page 36: Mutable checkpoint-restart Automating Live-update For Generic …giuffrida/papers/middleware-2014... · 2014-12-20 · MUTABLE CHECKPOINT-RESTART Automating Live Update For Generic

MCR with real world servers

• Tested on Apache httpd (2.2.23), nginx (0.8.54),

OpenSSH (1.1.0), vsftpd (3.5p1)

• How much engineering effort is needed ?

• Does MCR yield low performance overhead ?

• Does MCR yield reasonable update time ?

• How much memory does MCR use ?

December 11, 2014 25 / 32

Page 37: Mutable checkpoint-restart Automating Live-update For Generic …giuffrida/papers/middleware-2014... · 2014-12-20 · MUTABLE CHECKPOINT-RESTART Automating Live Update For Generic

How much engineering effort is needed ?

# patches Changes

LOC

Annotations

LOC Overhead

Apache httpd 5 10844 383 3.4%

nginx 25 9681 357 3.5%

vsftpd 5 5830 103 1.7%

OpenSSH 5 14370 184 1.2%

December 11, 2014

Overview of custom handler code written to support updates.

• Ideally, annotations would be written directly by the maintainers

26 / 32

Page 38: Mutable checkpoint-restart Automating Live-update For Generic …giuffrida/papers/middleware-2014... · 2014-12-20 · MUTABLE CHECKPOINT-RESTART Automating Live Update For Generic

Does MCR yield low performance overhead ?

Static

Static +

Dynamic

Static +

Dynamic +

Quiescence

Overhead

Apache httpd 1.040 1.043 1.047 ~4.7%

nginx 1.000 1.000 1.000 ~0%

nginxreg 1.175 1.192 1.186 ~19%

vsftpd 1.027 1.028 1.028 ~2.8%

OpenSSH 0.999 1.001 1.001 ~0%

December 11, 2014

Normalized overhead of each instrumentation step.

• What about nginxreg ?

• Static instrumentation for custom region allocators

• reduces number of opaque objects

• adds runtime overhead

27 / 32

Page 39: Mutable checkpoint-restart Automating Live-update For Generic …giuffrida/papers/middleware-2014... · 2014-12-20 · MUTABLE CHECKPOINT-RESTART Automating Live Update For Generic

Does MCR yield reasonable update time ?

• Update = Quiescence + Initialization + State Transfer

• quiescence time <100ms

• initialization time < 50ms (overhead)

December 11, 2014

workload

independent

State transfer time depending on the number of open connections

28 / 32

Page 40: Mutable checkpoint-restart Automating Live-update For Generic …giuffrida/papers/middleware-2014... · 2014-12-20 · MUTABLE CHECKPOINT-RESTART Automating Live Update For Generic

How much memory does MCR use ?

• Larger memory footprints

• Binary size increased by: 18.7% - 135.2%

• Resident set size increased by: 10% - 383.6% (avg. 188.5%)

• Why?

• memory object tracing metadata

• implementation is optimized for speed

• Favor annotation-less support over memory overhead

December 11, 2014 29 / 32

Page 41: Mutable checkpoint-restart Automating Live-update For Generic …giuffrida/papers/middleware-2014... · 2014-12-20 · MUTABLE CHECKPOINT-RESTART Automating Live Update For Generic

Outline

The Live Update Problem

Mutable Checkpoint-Restart

Overview

Architecture

Evaluation

Conclusion

December 11, 2014 30 / 32

Page 42: Mutable checkpoint-restart Automating Live-update For Generic …giuffrida/papers/middleware-2014... · 2014-12-20 · MUTABLE CHECKPOINT-RESTART Automating Live Update For Generic

Conclusion

• MCR combines checkpoint-restart with native initialization

• Generic live update framework for whole-program updates

• Low engineering effort to adapt updates (~30 LOC / 1 KLOC)

• Low runtime performance overhead (~2%)

• Realistic update times (≤ 1 s) under load (with 100 connections)

December 11, 2014 31 / 32

Page 43: Mutable checkpoint-restart Automating Live-update For Generic …giuffrida/papers/middleware-2014... · 2014-12-20 · MUTABLE CHECKPOINT-RESTART Automating Live Update For Generic

Thank you!

• Questions ?

December 11, 2014 32 / 32