15
Restartability Manage- ment in the Cisco Core Router CRS/NG Stefan Schaeckeler (Cisco Systems, Inc.) Ashwin Narasimha Murthy (Google, Inc.)

Restartability Manage- ment in the Cisco Core Router CRS/NG Stefan Schaeckeler (Cisco Systems, Inc.) Ashwin Narasimha Murthy (Google, Inc.)

Embed Size (px)

Citation preview

Page 1: Restartability Manage- ment in the Cisco Core Router CRS/NG Stefan Schaeckeler (Cisco Systems, Inc.) Ashwin Narasimha Murthy (Google, Inc.)

Restartability Manage-ment in the Cisco Core Router CRS/NGStefan Schaeckeler (Cisco Systems, Inc.) Ashwin Narasimha Murthy (Google, Inc.)

Page 2: Restartability Manage- ment in the Cisco Core Router CRS/NG Stefan Schaeckeler (Cisco Systems, Inc.) Ashwin Narasimha Murthy (Google, Inc.)

Table of Contents

System Overview

CRS/NG Restartability Overview −Problem Definition and High Level Solution

Concrete Example −Statistics Resource Manager Library

Conclusion

2

Page 3: Restartability Manage- ment in the Cisco Core Router CRS/NG Stefan Schaeckeler (Cisco Systems, Inc.) Ashwin Narasimha Murthy (Google, Inc.)

System Overview

Core Router Extremely complex System• SW: 16 MLOC• HW: several chasses, LCs (1 CPU, 5 NPUs,

chips galore), RPs (1 CPU, chips galore), fabric cards, blade cards, …

Forms distributed System99.9...9% Uptime

3

Page 4: Restartability Manage- ment in the Cisco Core Router CRS/NG Stefan Schaeckeler (Cisco Systems, Inc.) Ashwin Narasimha Murthy (Google, Inc.)

System Overview

System Manager: restarts crashed Process• HW bug• SW bug

Process must maintain State (after Crash)CRS/NG Approach• Key data structures in shared memory• Well written algorithm guarantee consistency

CRS 1 CRS 3 CRS/NG (final name?)

4

Page 5: Restartability Manage- ment in the Cisco Core Router CRS/NG Stefan Schaeckeler (Cisco Systems, Inc.) Ashwin Narasimha Murthy (Google, Inc.)

CRS/NG Restartability Overview

CRS/NG runs Cisco IOS/XRCisco IOS/XR Abstraction Layer on Linux• Sophisticated IPC• Sophisticated shared memory API

Special malloc for shared memory Static configuration file

– Mapping identifiers to fixed virtual addresses– STATS_RESTART 0x50000000

(Re)attaching to shared memory via identifier Previously allocated objects always available

…5

Page 6: Restartability Manage- ment in the Cisco Core Router CRS/NG Stefan Schaeckeler (Cisco Systems, Inc.) Ashwin Narasimha Murthy (Google, Inc.)

CRS/NG Restartability Overview

Process requiring Restartability• Key data-structures in shared memory• Careful algorithm design to avoid

• Temporary inconsistencies account1 := account1+X; account2 := account2-X;

• Pointer operations (disconnection of linked lists)• Crashes during IPCs• Crashes before a return; (caller records success)

• Optional recovery phase• Compromises are possible

6

Page 7: Restartability Manage- ment in the Cisco Core Router CRS/NG Stefan Schaeckeler (Cisco Systems, Inc.) Ashwin Narasimha Murthy (Google, Inc.)

Concrete Example: Statistics Resource Manager Library

HW: Extremely simplified View on CRS/NG

7

Page 8: Restartability Manage- ment in the Cisco Core Router CRS/NG Stefan Schaeckeler (Cisco Systems, Inc.) Ashwin Narasimha Murthy (Google, Inc.)

Concrete Example: Statistics Resource Manager Library

SW: Somewhat simplified View on CRS/NG Statistics Manager

8

Page 9: Restartability Manage- ment in the Cisco Core Router CRS/NG Stefan Schaeckeler (Cisco Systems, Inc.) Ashwin Narasimha Murthy (Google, Inc.)

Concrete Example: Statistics Resource Manager Library

Client Application / Library crashes RestartClient Application: State is gone• Stats pointers are lost• Other state is lost

Stats Lib• State is gone• Stats pointers are lost

Solution for Stats Lib• Keep freelists in shared memory• Smart algorithm for keeping state consistent9

Page 10: Restartability Manage- ment in the Cisco Core Router CRS/NG Stefan Schaeckeler (Cisco Systems, Inc.) Ashwin Narasimha Murthy (Google, Inc.)

Concrete Example: Statistics Resource Manager Library

Step 1: Keeping State in Shared Memory01 stats_cl_ctx_st *mstats_cl_bind (char *name) {02 void *shmem;03 stats_cl_ctx_st *con;04 05 /* open shmem at a predetermined address */06 shmem = shmwin_attach(SSE_STATS_RESTART_ADDRESS); // posix mmap: MAP_FIXED flag07 con=shmem+name_to_offset(name);08 09 if (strcmp(con->name, name)) {10 /* first bind */1112 /* init "empty" context */13 con->freelist[0..max]=NULL;14 con->mutex=0;15 strcpy(con->name, name);16 } else {17 /* restart */18 /* do nothing, just return con */18 }20 return con;21 }

10

Page 11: Restartability Manage- ment in the Cisco Core Router CRS/NG Stefan Schaeckeler (Cisco Systems, Inc.) Ashwin Narasimha Murthy (Google, Inc.)

Concrete Example: Statistics Resource Manager Library

Step 2a: Smart Algorithm −A pragmatic Approach (chosen for CRS/NG)Few Concepts: (Re-)moving nodes from freelist

• Worst case: a page is lost (bad?) Requesting fresh page from server

• Worst case: page is lost (bad?) Updating bitmap: mark some pointers as

allocated − client does not pick up• Worst case: some pointers are lost (bad?)11

Page 12: Restartability Manage- ment in the Cisco Core Router CRS/NG Stefan Schaeckeler (Cisco Systems, Inc.) Ashwin Narasimha Murthy (Google, Inc.)

Concrete Example: Statistics Resource Manager Library

Discussion of worst Case ScenariosA page (or a few Pointers within) is lost• = 256 out of 8 million stats pointers in NPU

memory − no big deal• = 80 byte out of several GB of CPU memory

for node structure − no big deal

Client frees a Pointer from a lost Page Error Code is returned Client is irritated but has to ignore itWe never give out same Pointer twice

12

Page 13: Restartability Manage- ment in the Cisco Core Router CRS/NG Stefan Schaeckeler (Cisco Systems, Inc.) Ashwin Narasimha Murthy (Google, Inc.)

Concrete Example: Statistics Resource Manager Library

Step 2b: Smart Algorithm −A perfect Approach

Complicated Algorithm /Very difficult Implementation• Further pointers in shared memory• Need to figure out where crashed and

continue from there

Requirement: interacting Libraries and Processes must be "perfect" as well

13

Page 14: Restartability Manage- ment in the Cisco Core Router CRS/NG Stefan Schaeckeler (Cisco Systems, Inc.) Ashwin Narasimha Murthy (Google, Inc.)

Conclusion

Pragmatic Approach of CRS/NG+ Easy to implement+/− Crashes: worst Case: small Mem. Leak+ No Run-time Performance Hit

Perfect Approach+ Very difficult to implement Error prone+ Crashes: no Memory Leak− Perhaps Run-time Performance Hit

14

Page 15: Restartability Manage- ment in the Cisco Core Router CRS/NG Stefan Schaeckeler (Cisco Systems, Inc.) Ashwin Narasimha Murthy (Google, Inc.)

Thank You

15

Platinum Sponsors:

Gold Sponsors:

Silver Sponsors:

Organization Sponsors