60
LECTURE 6 Consistency and Replication

LECTURE 6 - cs.ucr.edukrish/CS253/Lecture6-CS253.pdfReplication and why is it hard ¨ Data needs to be replicated for reliability or to improve performance ¤ If a replica crashes,

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: LECTURE 6 - cs.ucr.edukrish/CS253/Lecture6-CS253.pdfReplication and why is it hard ¨ Data needs to be replicated for reliability or to improve performance ¤ If a replica crashes,

LECTURE 6 Consistency and Replication

Page 2: LECTURE 6 - cs.ucr.edukrish/CS253/Lecture6-CS253.pdfReplication and why is it hard ¨ Data needs to be replicated for reliability or to improve performance ¤ If a replica crashes,

Replication and why is it hard

¨  Data needs to be replicated for reliability or to improve performance ¤  If a replica crashes, possible to continue to have access. ¤  Performance improves by accessing nearer data stores ¤  Aids scaling

¨  However, a key challenge is ensuring the consistency of data across replicas. ¤ Hard to realize in large distributed systems. `

Page 3: LECTURE 6 - cs.ucr.edukrish/CS253/Lecture6-CS253.pdfReplication and why is it hard ¨ Data needs to be replicated for reliability or to improve performance ¤ If a replica crashes,

Handling Failures

¨  Logical clocks enable all replicas to execute updates in same order ¤ But, cannot handle failure of any replica!

¨  To deal with crash failures we saw: ¤ Chandy-Lamport snapshot for coordinated checkpoint ¤ Log sources of non-determinism between checkpoints

3

Page 4: LECTURE 6 - cs.ucr.edukrish/CS253/Lecture6-CS253.pdfReplication and why is it hard ¨ Data needs to be replicated for reliability or to improve performance ¤ If a replica crashes,

Types of failures

¨  Crash failures ¤ Can resume with previously saved state

¨  Fail stop ¤ All state is lost upon failure ¤ Need to replicate state

4

Page 5: LECTURE 6 - cs.ucr.edukrish/CS253/Lecture6-CS253.pdfReplication and why is it hard ¨ Data needs to be replicated for reliability or to improve performance ¤ If a replica crashes,

Primary Backup Replication 5

Client Primary Backup

Backup

Backup

Page 6: LECTURE 6 - cs.ucr.edukrish/CS253/Lecture6-CS253.pdfReplication and why is it hard ¨ Data needs to be replicated for reliability or to improve performance ¤ If a replica crashes,

Primary Backup Replication

¨  How to handle primary failure? ¤ Promote one of the backups as the primary

¨  How to handle backup failure? ¤ Add another machine as a backup

6

Page 7: LECTURE 6 - cs.ucr.edukrish/CS253/Lecture6-CS253.pdfReplication and why is it hard ¨ Data needs to be replicated for reliability or to improve performance ¤ If a replica crashes,

Primary Backup Sync

¨  When should primary sync up with backups?

¨  What should be transferred when syncing?

7

Page 8: LECTURE 6 - cs.ucr.edukrish/CS253/Lecture6-CS253.pdfReplication and why is it hard ¨ Data needs to be replicated for reliability or to improve performance ¤ If a replica crashes,

Example: MapReduce Master

RegisterServer() { while (1) {

receive msg and parse addr

pick task to assign

mark task as assigned

respond with task assignment }

}

8

Primary & Backups in sync

When to sync

again?

Page 9: LECTURE 6 - cs.ucr.edukrish/CS253/Lecture6-CS253.pdfReplication and why is it hard ¨ Data needs to be replicated for reliability or to improve performance ¤ If a replica crashes,

Takeaways

¨  Cannot tolerate primary failure after update to its state is externally visible

¨  Corollary: Okay for primary to be out of sync with backup until change is externally visible ¤ External consistency

¨  Implications for when primary should sync with backups?

9

Page 10: LECTURE 6 - cs.ucr.edukrish/CS253/Lecture6-CS253.pdfReplication and why is it hard ¨ Data needs to be replicated for reliability or to improve performance ¤ If a replica crashes,

Primary-Backup Sync 10

C1

C2

P

B

Page 11: LECTURE 6 - cs.ucr.edukrish/CS253/Lecture6-CS253.pdfReplication and why is it hard ¨ Data needs to be replicated for reliability or to improve performance ¤ If a replica crashes,

Reads vs. Updates

¨  For operations that do not update state, primary need not consult backups ¤ When is this true?

¨  If backup is externally consistent with primary ¤ à If backup takes over as primary, it will generate

identical response as primary may have

11

Page 12: LECTURE 6 - cs.ucr.edukrish/CS253/Lecture6-CS253.pdfReplication and why is it hard ¨ Data needs to be replicated for reliability or to improve performance ¤ If a replica crashes,

What to transfer in sync?

¨  Snapshot of primary’s state ¤ Slow ¤ When is this necessary? ¤ Necessary when bootstrapping new backup

¨  Every operation ¤ Why is this okay? ¤ Leverage determinism of state machine

12

Page 13: LECTURE 6 - cs.ucr.edukrish/CS253/Lecture6-CS253.pdfReplication and why is it hard ¨ Data needs to be replicated for reliability or to improve performance ¤ If a replica crashes,

Service Development

¨  Getting coordination right between primary and backups is tricky ¤ Easy to mess up

¨  Must make replication transparent to developer

13

Page 14: LECTURE 6 - cs.ucr.edukrish/CS253/Lecture6-CS253.pdfReplication and why is it hard ¨ Data needs to be replicated for reliability or to improve performance ¤ If a replica crashes,

RSM with Logical Clocks 14

Replicated State Machine

Application

Updates

Ordered Updates

Replicated State Machine

Application

Updates

Server1 Server2

Page 15: LECTURE 6 - cs.ucr.edukrish/CS253/Lecture6-CS253.pdfReplication and why is it hard ¨ Data needs to be replicated for reliability or to improve performance ¤ If a replica crashes,

Transparent Primary Backup

¨  Application relies on library to keep primary and backups in sync ¤ Receive message from client ¤ Sync with backups before sending response to client

¨  Will this solution work? ¨  Does not capture non-determinism in execution

15

Page 16: LECTURE 6 - cs.ucr.edukrish/CS253/Lecture6-CS253.pdfReplication and why is it hard ¨ Data needs to be replicated for reliability or to improve performance ¤ If a replica crashes,

Virtual Machines

Operating System

Hardware

Applications

CPU Disk RAM

Process File system Virtual memory

Virtual Machine

Virtual Machine Monitor

16

Page 17: LECTURE 6 - cs.ucr.edukrish/CS253/Lecture6-CS253.pdfReplication and why is it hard ¨ Data needs to be replicated for reliability or to improve performance ¤ If a replica crashes,

Primary VM

Backup VM

Logging channel

Shared Disk

�!�,)� �� �*!� �� �&%��,)�+!&%�

@8AG4G<BA B9 94H?GGB?8E4AG .%F 9BE G;8 ( (�*!+� C?4G9BE@� 'HE 4CCEB46; <F F<@<?4E 5HG J8 ;4I8 @478 FB@89HA74@8AG4? 6;4A:8F 9BE C8E9BE@4A68 E84FBAF 4A7 <AI8FG<:4G87 4 AH@58E B9 78F<:A 4?G8EA4G<I8F� !A 477<G<BA J8 ;4I8;47 GB 78F<:A 4A7 <@C?8@8AG @4AL 477<G<BA4? 6B@CBA8AGF<A G;8 FLFG8@ 4A7 784? J<G; 4 AH@58E B9 CE46G<64? <FFH8FGB 5H<?7 4 6B@C?8G8 FLFG8@ G;4G <F 8�6<8AG 4A7 HF45?8 5L6HFGB@8EF EHAA<A: 8AG8ECE<F8 4CC?<64G<BAF� +<@<?4E GB @BFGBG;8E CE46G<64? FLFG8@F 7<F6HFF87 J8 BA?L 4GG8@CG GB 784?J<G; 94<?FGBC 94<?HE8F 1��3 J;<6; 4E8 F8EI8E 94<?HE8F G;4G 64A58 78G86G87 589BE8 G;8 94<?<A: F8EI8E 64HF8F 4A <A6BEE86G 8KG8EA4??L I<F<5?8 46G<BA�,;8 E8FG B9 G;8 C4C8E <F BE:4A<M87 4F 9B??BJF� �<EFG J8

78F6E<58 BHE 54F<6 78F<:A 4A7 78G4<? BHE 9HA74@8AG4? CEBGB6B?F G;4G 8AFHE8 G;4G AB 74G4 <F ?BFG <9 4 546>HC .% G4>8FBI8E 49G8E 4 CE<@4EL .% 94<?F� ,;8A J8 78F6E<58 <A 78G4<? @4AL B9 G;8 CE46G<64? <FFH8F G;4G @HFG 58 477E8FF87 GB5H<?7 4 EB5HFG 6B@C?8G8 4A7 4HGB@4G87 FLFG8@� /8 4?FB78F6E<58 F8I8E4? 78F<:A 6;B<68F G;4G 4E<F8 9BE <@C?8@8AG<A:94H?GGB?8E4AG .%F 4A7 7<F6HFF G;8 GE478B�F <A G;8F8 6;B<68F�&8KG J8 :<I8 C8E9BE@4A68 E8FH?GF 9BE BHE <@C?8@8AG4G<BA9BE FB@8 58A6;@4E>F 4A7 FB@8 E84? 8AG8ECE<F8 4CC?<64G<BAF��<A4??L J8 78F6E<58 E8?4G87 JBE> 4A7 6BA6?H78�

2. BASIC FT DESIGN�<:HE8 � F;BJF G;8 54F<6 F8GHC B9 BHE FLFG8@ 9BE 94H?G

GB?8E4AG .%F� �BE 4 :<I8A .% 9BE J;<6; J8 78F<E8 GB CEBI<7894H?G GB?8E4A68 �G;8 -.'*�.4 .%� J8 EHA 4 �!(1- .% BA4 7<�8E8AG C;LF<64? F8EI8E G;4G <F >8CG <A FLA6 4A7 8K86HG8F<78AG<64??L GB G;8 CE<@4EL I<EGH4? @46;<A8 G;BH:; J<G; 4F@4?? G<@8 ?4:� /8 F4L G;4G G;8 GJB .%F 4E8 <A 2'.01�) ),!(�/0#-� ,;8 I<EGH4? 7<F>F 9BE G;8 .%F 4E8 BA F;4E87 FGBE4:8�FH6; 4F 4 �<5E8 �;4AA8? BE <+�+! 7<F> 4EE4L� 4A7 G;8E89BE8 4668FF<5?8 GB G;8 CE<@4EL 4A7 546>HC .% 9BE <ACHG 4A7BHGCHG� �/8 J<?? 7<F6HFF 4 78F<:A <A J;<6; G;8 CE<@4EL 4A7546>HC .% ;4I8 F8C4E4G8 ABAF;4E87 I<EGH4? 7<F>F <A +86G<BA ����� 'A?L G;8 CE<@4EL .% 47I8EG<F8F <GF CE8F8A68 BAG;8 A8GJBE> FB 4?? A8GJBE> <ACHGF 6B@8 GB G;8 CE<@4EL .%�+<@<?4E?L 4?? BG;8E <ACHGF �FH6; 4F >8L5B4E7 4A7 @BHF8� :BBA?L GB G;8 CE<@4EL .%��?? <ACHG G;4G G;8 CE<@4EL .% E868<I8F <F F8AG GB G;8

546>HC .% I<4 4 A8GJBE> 6BAA86G<BA >ABJA 4F G;8 ),%%'+%!&�++#)� �BE F8EI8E JBE>?B47F G;8 7B@<A4AG <ACHG GE4�6<F A8GJBE> 4A7 7<F>� �77<G<BA4? <A9BE@4G<BA 4F 7<F6HFF8758?BJ <A +86G<BA ��� <F GE4AF@<GG87 4F A868FF4EL GB 8AFHE8G;4G G;8 546>HC .% 8K86HG8F ABA78G8E@<A<FG<6 BC8E4G<BAF<A G;8 F4@8 J4L 4F G;8 CE<@4EL .%� ,;8 E8FH?G <F G;4G G;8546>HC .% 4?J4LF 8K86HG8F <78AG<64??L GB G;8 CE<@4EL .%� BJ8I8E G;8 BHGCHGF B9 G;8 546>HC .% 4E8 7EBCC87 5LG;8 ;LC8EI<FBE FB BA?L G;8 CE<@4EL CEB7H68F 46GH4? BHGCHGFG;4G 4E8 E8GHEA87 GB 6?<8AGF� �F 78F6E<587 <A +86G<BA ��� G;8CE<@4EL 4A7 546>HC .% 9B??BJ 4 FC86<�6 CEBGB6B? <A6?H7<A:8KC?<6<G 46>ABJ?87:@8AGF 5L G;8 546>HC .% <A BE78E GB8AFHE8 G;4G AB 74G4 <F ?BFG <9 G;8 CE<@4EL 94<?F�,B 78G86G <9 4 CE<@4EL BE 546>HC .% ;4F 94<?87 BHE FLF

G8@ HF8F 4 6B@5<A4G<BA B9 ;84EG584G<A: 58GJ88A G;8 E8?8I4AGF8EI8EF 4A7 @BA<GBE<A: B9 G;8 GE4�6 BA G;8 ?B::<A: 6;4AA8?�!A 477<G<BA J8 @HFG 8AFHE8 G;4G BA?L BA8 B9 G;8 CE<@4ELBE 546>HC .% G4>8F BI8E 8K86HG<BA 8I8A <9 G;8E8 <F 4 FC?<G5E4<A F<GH4G<BA J;8E8 G;8 CE<@4EL 4A7 546>HC F8EI8EF ;4I8?BFG 6B@@HA<64G<BA J<G; 846; BG;8E�!A G;8 9B??BJ<A: F86G<BAF J8 CEBI<78 @BE8 78G4<?F BA F8I

8E4? <@CBEG4AG 4E84F� !A +86G<BA ��� J8 :<I8 FB@8 78G4<?FBA G;8 78G8E@<A<FG<6 E8C?4L G86;AB?B:L G;4G 8AFHE8F G;4G CE<@4EL 4A7 546>HC .%F 4E8 >8CG <A FLA6 I<4 G;8 <A9BE@4G<BAF8AG BI8E G;8 ?B::<A: 6;4AA8?� !A +86G<BA ��� J8 78F6E<584 9HA74@8AG4? EH?8 B9 BHE �, CEBGB6B? G;4G 8AFHE8F G;4G AB74G4 <F ?BFG <9 G;8 CE<@4EL 94<?F� !A +86G<BA ��� J8 78F6E<58BHE @8G;B7F 9BE 78G86G<A: 4A7 E8FCBA7<A: GB 4 94<?HE8 <A 46BEE86G 94F;<BA�

2.1 Deterministic Replay Implementation�F J8 ;4I8 @8AG<BA87 E8C?<64G<A: F8EI8E �BE .%� 8K8

6HG<BA 64A 58 @B78?87 4F G;8 E8C?<64G<BA B9 4 78G8E@<A<FG<6 FG4G8 @46;<A8� !9 GJB 78G8E@<A<FG<6 FG4G8 @46;<A8F 4E8FG4EG87 <A G;8 F4@8 <A<G<4? FG4G8 4A7 CEBI<787 G;8 8K46G F4@8<ACHGF <A G;8 F4@8 BE78E G;8A G;8L J<?? :B G;EBH:; G;8 F4@8F8DH8A68F B9 FG4G8F 4A7 CEB7H68 G;8 F4@8 BHGCHGF� � I<EGH4? @46;<A8 ;4F 4 5EB47 F8G B9 <ACHGF <A6?H7<A: <A6B@<A:A8GJBE> C46>8GF 7<F> E847F 4A7 <ACHG 9EB@ G;8 >8L5B4E74A7 @BHF8� &BA78G8E@<A<FG<6 8I8AGF �FH6; 4F I<EGH4? <AG8EEHCGF� 4A7 ABA78G8E@<A<FG<6 BC8E4G<BAF �FH6; 4F E847<A:G;8 6?B6> 6L6?8 6BHAG8E B9 G;8 CEB68FFBE� 4?FB 4�86G G;8 .%�FFG4G8� ,;<F CE8F8AGF G;E88 6;4??8A:8F 9BE E8C?<64G<A: 8K86HG<BA B9 4AL .% EHAA<A: 4AL BC8E4G<A: FLFG8@ 4A7 JBE>?B47���� 6BEE86G?L 64CGHE<A: 4?? G;8 <ACHG 4A7 ABA78G8E@<A<F@A868FF4EL GB 8AFHE8 78G8E@<A<FG<6 8K86HG<BA B9 4 546>HC I<EGH4? @46;<A8 ��� 6BEE86G?L 4CC?L<A: G;8 <ACHGF 4A7 ABA78G8E@<A<F@ GB G;8 546>HC I<EGH4? @46;<A8 4A7 ��� 7B<A:FB <A 4 @4AA8E G;4G 7B8FA�G 78:E478 C8E9BE@4A68� !A 477<G<BA @4AL 6B@C?8K BC8E4G<BAF <A K�� @<6EBCEB68FFBEF ;4I8HA78�A87 ;8A68 ABA78G8E@<A<FG<6 F<78 8�86GF� �4CGHE<A:G;8F8 HA78�A87 F<78 8�86GF 4A7 E8C?4L<A: G;8@ GB CEB7H68G;8 F4@8 FG4G8 CE8F8AGF 4A 477<G<BA4? 6;4??8A:8�.%J4E8 78G8E@<A<FG<6 E8C?4L 1��3 CEBI<78F 8K46G?L G;<F

9HA6G<BA4?<GL 9BE K�� I<EGH4? @46;<A8F BA G;8 .%J4E8 I+C;8E8C?4G9BE@� �8G8E@<A<FG<6 E8C?4L E86BE7F G;8 <ACHGF B9 4 .%4A7 4?? CBFF<5?8 ABA78G8E@<A<F@ 4FFB6<4G87 J<G; G;8 .%8K86HG<BA <A 4 FGE84@ B9 ?B: 8AGE<8F JE<GG8A GB 4 ?B: �?8� ,;8.% 8K86HG<BA @4L 58 8K46G?L E8C?4L87 ?4G8E 5L E847<A: G;8?B: 8AGE<8F 9EB@ G;8 �?8� �BE ABA78G8E@<A<FG<6 BC8E4G<BAFFH�6<8AG <A9BE@4G<BA <F ?B::87 GB 4??BJ G;8 BC8E4G<BA GB 58E8CEB7H687 J<G; G;8 F4@8 FG4G8 6;4A:8 4A7 BHGCHG� �BEABA78G8E@<A<FG<6 8I8AGF FH6; 4F G<@8E BE !' 6B@C?8G<BA <A

31

Primary VM

Backup VM

Logging channel

Shared Disk

�!�,)� �� �*!� �� �&%��,)�+!&%�

@8AG4G<BA B9 94H?GGB?8E4AG .%F 9BE G;8 ( (�*!+� C?4G9BE@� 'HE 4CCEB46; <F F<@<?4E 5HG J8 ;4I8 @478 FB@89HA74@8AG4? 6;4A:8F 9BE C8E9BE@4A68 E84FBAF 4A7 <AI8FG<:4G87 4 AH@58E B9 78F<:A 4?G8EA4G<I8F� !A 477<G<BA J8 ;4I8;47 GB 78F<:A 4A7 <@C?8@8AG @4AL 477<G<BA4? 6B@CBA8AGF<A G;8 FLFG8@ 4A7 784? J<G; 4 AH@58E B9 CE46G<64? <FFH8FGB 5H<?7 4 6B@C?8G8 FLFG8@ G;4G <F 8�6<8AG 4A7 HF45?8 5L6HFGB@8EF EHAA<A: 8AG8ECE<F8 4CC?<64G<BAF� +<@<?4E GB @BFGBG;8E CE46G<64? FLFG8@F 7<F6HFF87 J8 BA?L 4GG8@CG GB 784?J<G; 94<?FGBC 94<?HE8F 1��3 J;<6; 4E8 F8EI8E 94<?HE8F G;4G 64A58 78G86G87 589BE8 G;8 94<?<A: F8EI8E 64HF8F 4A <A6BEE86G 8KG8EA4??L I<F<5?8 46G<BA�,;8 E8FG B9 G;8 C4C8E <F BE:4A<M87 4F 9B??BJF� �<EFG J8

78F6E<58 BHE 54F<6 78F<:A 4A7 78G4<? BHE 9HA74@8AG4? CEBGB6B?F G;4G 8AFHE8 G;4G AB 74G4 <F ?BFG <9 4 546>HC .% G4>8FBI8E 49G8E 4 CE<@4EL .% 94<?F� ,;8A J8 78F6E<58 <A 78G4<? @4AL B9 G;8 CE46G<64? <FFH8F G;4G @HFG 58 477E8FF87 GB5H<?7 4 EB5HFG 6B@C?8G8 4A7 4HGB@4G87 FLFG8@� /8 4?FB78F6E<58 F8I8E4? 78F<:A 6;B<68F G;4G 4E<F8 9BE <@C?8@8AG<A:94H?GGB?8E4AG .%F 4A7 7<F6HFF G;8 GE478B�F <A G;8F8 6;B<68F�&8KG J8 :<I8 C8E9BE@4A68 E8FH?GF 9BE BHE <@C?8@8AG4G<BA9BE FB@8 58A6;@4E>F 4A7 FB@8 E84? 8AG8ECE<F8 4CC?<64G<BAF��<A4??L J8 78F6E<58 E8?4G87 JBE> 4A7 6BA6?H78�

2. BASIC FT DESIGN�<:HE8 � F;BJF G;8 54F<6 F8GHC B9 BHE FLFG8@ 9BE 94H?G

GB?8E4AG .%F� �BE 4 :<I8A .% 9BE J;<6; J8 78F<E8 GB CEBI<7894H?G GB?8E4A68 �G;8 -.'*�.4 .%� J8 EHA 4 �!(1- .% BA4 7<�8E8AG C;LF<64? F8EI8E G;4G <F >8CG <A FLA6 4A7 8K86HG8F<78AG<64??L GB G;8 CE<@4EL I<EGH4? @46;<A8 G;BH:; J<G; 4F@4?? G<@8 ?4:� /8 F4L G;4G G;8 GJB .%F 4E8 <A 2'.01�) ),!(�/0#-� ,;8 I<EGH4? 7<F>F 9BE G;8 .%F 4E8 BA F;4E87 FGBE4:8�FH6; 4F 4 �<5E8 �;4AA8? BE <+�+! 7<F> 4EE4L� 4A7 G;8E89BE8 4668FF<5?8 GB G;8 CE<@4EL 4A7 546>HC .% 9BE <ACHG 4A7BHGCHG� �/8 J<?? 7<F6HFF 4 78F<:A <A J;<6; G;8 CE<@4EL 4A7546>HC .% ;4I8 F8C4E4G8 ABAF;4E87 I<EGH4? 7<F>F <A +86G<BA ����� 'A?L G;8 CE<@4EL .% 47I8EG<F8F <GF CE8F8A68 BAG;8 A8GJBE> FB 4?? A8GJBE> <ACHGF 6B@8 GB G;8 CE<@4EL .%�+<@<?4E?L 4?? BG;8E <ACHGF �FH6; 4F >8L5B4E7 4A7 @BHF8� :BBA?L GB G;8 CE<@4EL .%��?? <ACHG G;4G G;8 CE<@4EL .% E868<I8F <F F8AG GB G;8

546>HC .% I<4 4 A8GJBE> 6BAA86G<BA >ABJA 4F G;8 ),%%'+%!&�++#)� �BE F8EI8E JBE>?B47F G;8 7B@<A4AG <ACHG GE4�6<F A8GJBE> 4A7 7<F>� �77<G<BA4? <A9BE@4G<BA 4F 7<F6HFF8758?BJ <A +86G<BA ��� <F GE4AF@<GG87 4F A868FF4EL GB 8AFHE8G;4G G;8 546>HC .% 8K86HG8F ABA78G8E@<A<FG<6 BC8E4G<BAF<A G;8 F4@8 J4L 4F G;8 CE<@4EL .%� ,;8 E8FH?G <F G;4G G;8546>HC .% 4?J4LF 8K86HG8F <78AG<64??L GB G;8 CE<@4EL .%� BJ8I8E G;8 BHGCHGF B9 G;8 546>HC .% 4E8 7EBCC87 5LG;8 ;LC8EI<FBE FB BA?L G;8 CE<@4EL CEB7H68F 46GH4? BHGCHGFG;4G 4E8 E8GHEA87 GB 6?<8AGF� �F 78F6E<587 <A +86G<BA ��� G;8CE<@4EL 4A7 546>HC .% 9B??BJ 4 FC86<�6 CEBGB6B? <A6?H7<A:8KC?<6<G 46>ABJ?87:@8AGF 5L G;8 546>HC .% <A BE78E GB8AFHE8 G;4G AB 74G4 <F ?BFG <9 G;8 CE<@4EL 94<?F�,B 78G86G <9 4 CE<@4EL BE 546>HC .% ;4F 94<?87 BHE FLF

G8@ HF8F 4 6B@5<A4G<BA B9 ;84EG584G<A: 58GJ88A G;8 E8?8I4AGF8EI8EF 4A7 @BA<GBE<A: B9 G;8 GE4�6 BA G;8 ?B::<A: 6;4AA8?�!A 477<G<BA J8 @HFG 8AFHE8 G;4G BA?L BA8 B9 G;8 CE<@4ELBE 546>HC .% G4>8F BI8E 8K86HG<BA 8I8A <9 G;8E8 <F 4 FC?<G5E4<A F<GH4G<BA J;8E8 G;8 CE<@4EL 4A7 546>HC F8EI8EF ;4I8?BFG 6B@@HA<64G<BA J<G; 846; BG;8E�!A G;8 9B??BJ<A: F86G<BAF J8 CEBI<78 @BE8 78G4<?F BA F8I

8E4? <@CBEG4AG 4E84F� !A +86G<BA ��� J8 :<I8 FB@8 78G4<?FBA G;8 78G8E@<A<FG<6 E8C?4L G86;AB?B:L G;4G 8AFHE8F G;4G CE<@4EL 4A7 546>HC .%F 4E8 >8CG <A FLA6 I<4 G;8 <A9BE@4G<BAF8AG BI8E G;8 ?B::<A: 6;4AA8?� !A +86G<BA ��� J8 78F6E<584 9HA74@8AG4? EH?8 B9 BHE �, CEBGB6B? G;4G 8AFHE8F G;4G AB74G4 <F ?BFG <9 G;8 CE<@4EL 94<?F� !A +86G<BA ��� J8 78F6E<58BHE @8G;B7F 9BE 78G86G<A: 4A7 E8FCBA7<A: GB 4 94<?HE8 <A 46BEE86G 94F;<BA�

2.1 Deterministic Replay Implementation�F J8 ;4I8 @8AG<BA87 E8C?<64G<A: F8EI8E �BE .%� 8K8

6HG<BA 64A 58 @B78?87 4F G;8 E8C?<64G<BA B9 4 78G8E@<A<FG<6 FG4G8 @46;<A8� !9 GJB 78G8E@<A<FG<6 FG4G8 @46;<A8F 4E8FG4EG87 <A G;8 F4@8 <A<G<4? FG4G8 4A7 CEBI<787 G;8 8K46G F4@8<ACHGF <A G;8 F4@8 BE78E G;8A G;8L J<?? :B G;EBH:; G;8 F4@8F8DH8A68F B9 FG4G8F 4A7 CEB7H68 G;8 F4@8 BHGCHGF� � I<EGH4? @46;<A8 ;4F 4 5EB47 F8G B9 <ACHGF <A6?H7<A: <A6B@<A:A8GJBE> C46>8GF 7<F> E847F 4A7 <ACHG 9EB@ G;8 >8L5B4E74A7 @BHF8� &BA78G8E@<A<FG<6 8I8AGF �FH6; 4F I<EGH4? <AG8EEHCGF� 4A7 ABA78G8E@<A<FG<6 BC8E4G<BAF �FH6; 4F E847<A:G;8 6?B6> 6L6?8 6BHAG8E B9 G;8 CEB68FFBE� 4?FB 4�86G G;8 .%�FFG4G8� ,;<F CE8F8AGF G;E88 6;4??8A:8F 9BE E8C?<64G<A: 8K86HG<BA B9 4AL .% EHAA<A: 4AL BC8E4G<A: FLFG8@ 4A7 JBE>?B47���� 6BEE86G?L 64CGHE<A: 4?? G;8 <ACHG 4A7 ABA78G8E@<A<F@A868FF4EL GB 8AFHE8 78G8E@<A<FG<6 8K86HG<BA B9 4 546>HC I<EGH4? @46;<A8 ��� 6BEE86G?L 4CC?L<A: G;8 <ACHGF 4A7 ABA78G8E@<A<F@ GB G;8 546>HC I<EGH4? @46;<A8 4A7 ��� 7B<A:FB <A 4 @4AA8E G;4G 7B8FA�G 78:E478 C8E9BE@4A68� !A 477<G<BA @4AL 6B@C?8K BC8E4G<BAF <A K�� @<6EBCEB68FFBEF ;4I8HA78�A87 ;8A68 ABA78G8E@<A<FG<6 F<78 8�86GF� �4CGHE<A:G;8F8 HA78�A87 F<78 8�86GF 4A7 E8C?4L<A: G;8@ GB CEB7H68G;8 F4@8 FG4G8 CE8F8AGF 4A 477<G<BA4? 6;4??8A:8�.%J4E8 78G8E@<A<FG<6 E8C?4L 1��3 CEBI<78F 8K46G?L G;<F

9HA6G<BA4?<GL 9BE K�� I<EGH4? @46;<A8F BA G;8 .%J4E8 I+C;8E8C?4G9BE@� �8G8E@<A<FG<6 E8C?4L E86BE7F G;8 <ACHGF B9 4 .%4A7 4?? CBFF<5?8 ABA78G8E@<A<F@ 4FFB6<4G87 J<G; G;8 .%8K86HG<BA <A 4 FGE84@ B9 ?B: 8AGE<8F JE<GG8A GB 4 ?B: �?8� ,;8.% 8K86HG<BA @4L 58 8K46G?L E8C?4L87 ?4G8E 5L E847<A: G;8?B: 8AGE<8F 9EB@ G;8 �?8� �BE ABA78G8E@<A<FG<6 BC8E4G<BAFFH�6<8AG <A9BE@4G<BA <F ?B::87 GB 4??BJ G;8 BC8E4G<BA GB 58E8CEB7H687 J<G; G;8 F4@8 FG4G8 6;4A:8 4A7 BHGCHG� �BEABA78G8E@<A<FG<6 8I8AGF FH6; 4F G<@8E BE !' 6B@C?8G<BA <A

31

VMM-based Primary Backup

¨  Primary and backup execute on two virtual machines

¨  Primary logs inputs and outputs ¨  Backup applies inputs from log ¨  Primary waits for backup output

¨  Primary-backup monitor each other ¤ If primary fails, backup takes over

Page 18: LECTURE 6 - cs.ucr.edukrish/CS253/Lecture6-CS253.pdfReplication and why is it hard ¨ Data needs to be replicated for reliability or to improve performance ¤ If a replica crashes,

¨  VMM at primary logs external inputs and causes of non-determinism

¨  Example: Log results of non-deterministic instructions ¤ e.g., timestamp counter read (RDTSC)

¨  Random number generation?

Log-based VM replication 18

Page 19: LECTURE 6 - cs.ucr.edukrish/CS253/Lecture6-CS253.pdfReplication and why is it hard ¨ Data needs to be replicated for reliability or to improve performance ¤ If a replica crashes,

¨  VMM at primary sends log entries to VMM at backup over the logging channel

¨  Backup hypervisor replays log entries ¤  Stops backup VM at next input event or non-deterministic instruction

¤ Delivers same input as primary ¤ Delivers same result to non-deterministic instruction as primary

Log-based VM replication 19

Page 20: LECTURE 6 - cs.ucr.edukrish/CS253/Lecture6-CS253.pdfReplication and why is it hard ¨ Data needs to be replicated for reliability or to improve performance ¤ If a replica crashes,

¨  VM inputs ¤  Incoming network packets ¤ Disk reads ¤ Keyboard and mouse events ¤ Clock timer interrupt events ¤ Results of non-deterministic instructions

¨  VM outputs ¤ Outgoing network packets ¤ Disk writes

¨  Why log outputs?

Virtual Machine I/O 20

Page 21: LECTURE 6 - cs.ucr.edukrish/CS253/Lecture6-CS253.pdfReplication and why is it hard ¨ Data needs to be replicated for reliability or to improve performance ¤ If a replica crashes,

Example Scenario

Primary

¨  Write input to log ¨  Apply input ¨  Produce output ¨  Fail!

Backup

¨  Take over as primary ¨  Read from log and

apply input ¨  Timer interrupt fires ¨  Produce output

Backup does not know timing of output relative to timer interrupt

Execution of interrupt handler may affect output

Page 22: LECTURE 6 - cs.ucr.edukrish/CS253/Lecture6-CS253.pdfReplication and why is it hard ¨ Data needs to be replicated for reliability or to improve performance ¤ If a replica crashes,

Optimizing Performance

¨  Primary must buffer output until ACK from backup ¤ Why? ¤ Slow from client perspective!

n  If replicas are not in close proximity.

¨  How to optimize performance? ¨  Pipeline sync with backup and execution at primary

22

Page 23: LECTURE 6 - cs.ucr.edukrish/CS253/Lecture6-CS253.pdfReplication and why is it hard ¨ Data needs to be replicated for reliability or to improve performance ¤ If a replica crashes,

VMM-based Replication 23

C

P

B

Page 24: LECTURE 6 - cs.ucr.edukrish/CS253/Lecture6-CS253.pdfReplication and why is it hard ¨ Data needs to be replicated for reliability or to improve performance ¤ If a replica crashes,

Primary Backup Replication

¨  Promote one of the backups if primary fails ¨  Replace any failed backup

¨  When should primary sync with backups? ¤ Before making state change externally visible ¤ Primary and backups must be externally consistent

¨  What to sync? ¤ Entire state when bootstrapping new backup ¤ Thereafter, all sources of non-determinism

24

Page 25: LECTURE 6 - cs.ucr.edukrish/CS253/Lecture6-CS253.pdfReplication and why is it hard ¨ Data needs to be replicated for reliability or to improve performance ¤ If a replica crashes,

RSM with Primary Backup 25

Operating System

Application

Operating System

Application

Server1 Server2

Virtual Machine Monitor

Virtual Machine Monitor

Hardware Hardware

Page 26: LECTURE 6 - cs.ucr.edukrish/CS253/Lecture6-CS253.pdfReplication and why is it hard ¨ Data needs to be replicated for reliability or to improve performance ¤ If a replica crashes,

Client perspective

¨  What does a worker need to know in order to register itself with replicated master?

¨  Needs to know which machine is primary

26

Page 27: LECTURE 6 - cs.ucr.edukrish/CS253/Lecture6-CS253.pdfReplication and why is it hard ¨ Data needs to be replicated for reliability or to improve performance ¤ If a replica crashes,

Primary Backup Replication

Client Primary Backup

Backup

Backup

27

Page 28: LECTURE 6 - cs.ucr.edukrish/CS253/Lecture6-CS253.pdfReplication and why is it hard ¨ Data needs to be replicated for reliability or to improve performance ¤ If a replica crashes,

Client perspective

¨  What does a worker need to know in order to register itself with replicated master?

¨  Needs to know which machine is primary

¨  Can primary be hard coded into client code? ¨  No, primary gets replaced when it fails

¨  How does client discover current primary?

28

Page 29: LECTURE 6 - cs.ucr.edukrish/CS253/Lecture6-CS253.pdfReplication and why is it hard ¨ Data needs to be replicated for reliability or to improve performance ¤ If a replica crashes,

Primary Backup Replication

Client Primary Backup

Backup

Backup

View service

29

Page 30: LECTURE 6 - cs.ucr.edukrish/CS253/Lecture6-CS253.pdfReplication and why is it hard ¨ Data needs to be replicated for reliability or to improve performance ¤ If a replica crashes,

View service

¨  Maintains current membership of primary-backup service (called view) ¤ View number, primary, backup

¨  When does view service change view? ¨  When primary or any backup fails ¨  Periodically exchange heartbeat messages to

detect failures

¨  What if view service is down or not reachable?

30

Page 31: LECTURE 6 - cs.ucr.edukrish/CS253/Lecture6-CS253.pdfReplication and why is it hard ¨ Data needs to be replicated for reliability or to improve performance ¤ If a replica crashes,

Transitioning between views

¨  How to ensure new primary is up-to-date? ¤ Only promote a previous backup ¤ This is why view service needs to pick backups

¨  How does view service know if a backup is up-to-date?

¨  Two scenarios for ill-timed primary failure: ¤ Primary applies operation but fails before syncing with

backup ¤ Primary fails before new backup is initialized

31

Page 32: LECTURE 6 - cs.ucr.edukrish/CS253/Lecture6-CS253.pdfReplication and why is it hard ¨ Data needs to be replicated for reliability or to improve performance ¤ If a replica crashes,

Transitioning between views

¨  View service broadcasts view change to all

¨  Primary must ACK new view once backup is up-to-date

¨  Two implications: ¤ Liveness detection timeout > State transfer time ¤ Cannot change view if primary fails during sync

32

Page 33: LECTURE 6 - cs.ucr.edukrish/CS253/Lecture6-CS253.pdfReplication and why is it hard ¨ Data needs to be replicated for reliability or to improve performance ¤ If a replica crashes,

View service

¨  View change has three steps: ¤ View service announces new view ¤ Primary syncs with new backup if there is one ¤ Primary acknowledges new view

¨  Stuck if primary fails in the midst of this process

33

Page 34: LECTURE 6 - cs.ucr.edukrish/CS253/Lecture6-CS253.pdfReplication and why is it hard ¨ Data needs to be replicated for reliability or to improve performance ¤ If a replica crashes,

Scalability of View service

¨  Does every client need to contact view service before any operation?

¨  Clients can cache view across operations

¨  When to invalidate cached view? ¨  Client invalidates cache when no response or

negative response from primary

34

Page 35: LECTURE 6 - cs.ucr.edukrish/CS253/Lecture6-CS253.pdfReplication and why is it hard ¨ Data needs to be replicated for reliability or to improve performance ¤ If a replica crashes,

Split Brain

Client

S1

S2

View service (1,S1, _) (2,S1,S2) (3,S2, _)

(1,S1, _) (2,S1,S2)

(2,S1,S2) (3,S2, _)

(2,S1,S2)

35

Page 36: LECTURE 6 - cs.ucr.edukrish/CS253/Lecture6-CS253.pdfReplication and why is it hard ¨ Data needs to be replicated for reliability or to improve performance ¤ If a replica crashes,

Avoiding Split Brain

¨  Primary must forward all operations to backups ¤ Goal: Get ACKs from backups that they too recognize

primary

¨  Why can’t backups be mistaken about who is primary? ¤ Only a backup can be promoted as primary

36

Page 37: LECTURE 6 - cs.ucr.edukrish/CS253/Lecture6-CS253.pdfReplication and why is it hard ¨ Data needs to be replicated for reliability or to improve performance ¤ If a replica crashes,

View service

¨  Valid sequence of views: ¤  (1, S1, _) à (2, S1, S2) à (3, S1, S3) à (4, S3, S4) à (5, S4, _)

¨  Examples of invalid transitions between views? ¤  (1, S1, S2) à (2, S3, S4) ¤  (1, S1, S2) à (2, _, S2) ¤  (1, S1, _) à (2, S2, S1)

37

Page 38: LECTURE 6 - cs.ucr.edukrish/CS253/Lecture6-CS253.pdfReplication and why is it hard ¨ Data needs to be replicated for reliability or to improve performance ¤ If a replica crashes,

Summary of view service

¨  Monitors primary and backups to detect when to change view ¤ Can change only after primary has ACKed view ¤ Primary ACKs only after syncing with backups

¨  Clients cache view for scalability ¨  To address split brain, primary must check with

backup before serving client

38

Page 39: LECTURE 6 - cs.ucr.edukrish/CS253/Lecture6-CS253.pdfReplication and why is it hard ¨ Data needs to be replicated for reliability or to improve performance ¤ If a replica crashes,

¨  One copy in SF (primary), one in NY (backup)

Replicating Bank Database

“Deposit $100”

“Pay 1% interest”

$1,000 $1,000

39

Page 40: LECTURE 6 - cs.ucr.edukrish/CS253/Lecture6-CS253.pdfReplication and why is it hard ¨ Data needs to be replicated for reliability or to improve performance ¤ If a replica crashes,

Primary-Backup Sync 40

C1

C2

P

B

“Deposit $100”

“Pay 1% interest”

$1,000

$1,000 $1,110

$1,111

Page 41: LECTURE 6 - cs.ucr.edukrish/CS253/Lecture6-CS253.pdfReplication and why is it hard ¨ Data needs to be replicated for reliability or to improve performance ¤ If a replica crashes,

Ordering of Updates

¨  All updates must be applied in the same order at all replicas

¨  External view: Total ordering of writes

¨  Primary effectively serializes all writes ¤  Order of events made known to replicas

41

Page 42: LECTURE 6 - cs.ucr.edukrish/CS253/Lecture6-CS253.pdfReplication and why is it hard ¨ Data needs to be replicated for reliability or to improve performance ¤ If a replica crashes,

Serving Reads

¨  Can backups serve reads? ¤ Assume no split brain

¨  What if primary’s state is ahead of backup? ¤ Updates to primary not yet externally visible ¤  Effect of read equivalent to if primary fails at this point

¨  What if backup’s state is ahead of primary? ¤ Different backups may not be in sync ¤  Primary may get replaced before it applies update

42

Page 43: LECTURE 6 - cs.ucr.edukrish/CS253/Lecture6-CS253.pdfReplication and why is it hard ¨ Data needs to be replicated for reliability or to improve performance ¤ If a replica crashes,

Reads: Primary vs. Backup 43

C1

P

C2

B1

“Deposit $100”

$1100 $1000

B2

Page 44: LECTURE 6 - cs.ucr.edukrish/CS253/Lecture6-CS253.pdfReplication and why is it hard ¨ Data needs to be replicated for reliability or to improve performance ¤ If a replica crashes,

Desired Properties

¨  All writes are totally ordered

¨  Once read returns particular value, all later reads should return that value or value of later write

¨  Once a write completes, all later reads should return value of that write or value of later write

44

Page 45: LECTURE 6 - cs.ucr.edukrish/CS253/Lecture6-CS253.pdfReplication and why is it hard ¨ Data needs to be replicated for reliability or to improve performance ¤ If a replica crashes,

Reads relative to Writes 45

C1

P

C2

B

“Deposit $100”

“Pay 1% interest”

$1100 $1111

Page 46: LECTURE 6 - cs.ucr.edukrish/CS253/Lecture6-CS253.pdfReplication and why is it hard ¨ Data needs to be replicated for reliability or to improve performance ¤ If a replica crashes,

Consistency Spectrum

¨  Consistency: What are the properties of externally visible effects?

46

Linearizability Sequential Causal Eventual

Consistency

Read-after- write

Ease of programming

Page 47: LECTURE 6 - cs.ucr.edukrish/CS253/Lecture6-CS253.pdfReplication and why is it hard ¨ Data needs to be replicated for reliability or to improve performance ¤ If a replica crashes,

Linearizability

¨  Total ordering of writes ¨  Read returns last completed write

¨  Single copy semantics ¤ Externally visible effects of writes and reads are

equivalent to if there existed a single copy

¨  Users oblivious to replication

47

Page 48: LECTURE 6 - cs.ucr.edukrish/CS253/Lecture6-CS253.pdfReplication and why is it hard ¨ Data needs to be replicated for reliability or to improve performance ¤ If a replica crashes,

Why weaken consistency?

¨  Shouldn’t we always strive for single copy semantics? ¤ Comes at the expense of lower performance

¨  Latency vs. consistency tradeoff

48

Page 49: LECTURE 6 - cs.ucr.edukrish/CS253/Lecture6-CS253.pdfReplication and why is it hard ¨ Data needs to be replicated for reliability or to improve performance ¤ If a replica crashes,

Consistency Spectrum 49

Linearizability Sequential Causal Eventual

Consistency

Read-after- write

Ease of programming

Latency

Page 50: LECTURE 6 - cs.ucr.edukrish/CS253/Lecture6-CS253.pdfReplication and why is it hard ¨ Data needs to be replicated for reliability or to improve performance ¤ If a replica crashes,

Sequential Consistency

¨  Results of any execution same as that when the operations by all processes on the data store were executed in some sequential order.

¨  And … the operations of each individual process appear in this sequence in the order specified by its program.

Page 51: LECTURE 6 - cs.ucr.edukrish/CS253/Lecture6-CS253.pdfReplication and why is it hard ¨ Data needs to be replicated for reliability or to improve performance ¤ If a replica crashes,

Example

¨  (a) is sequentially consistent. All processes see the same interleaving of the operations (x takes value b before a)

¨  (b) is not sequentially consistent. P3 sees x taking on value b before a while P4 sees the opposite.

Page 52: LECTURE 6 - cs.ucr.edukrish/CS253/Lecture6-CS253.pdfReplication and why is it hard ¨ Data needs to be replicated for reliability or to improve performance ¤ If a replica crashes,

Causal Consistency

¨  Order of causally related writes must be preserved in values returned to reads ¤  If W1 à W2, then if a read sees effect of W2, it must

see effect of W1

¨  Example: Facebook News Feed ¤ Okay to not see all completed posts ¤ But, if you see a comment, you must see the post on

which the comment is made ¨  Main utility: Lazy sync between replicas

52

Page 53: LECTURE 6 - cs.ucr.edukrish/CS253/Lecture6-CS253.pdfReplication and why is it hard ¨ Data needs to be replicated for reliability or to improve performance ¤ If a replica crashes,

Example

¨  The operations above do not result in sequential consistency (see reads of P3 and P4)

¨  But there is causal consistency. This is because the writes W1(x)c and W2(x)b are concurrent (there are no dependencies and thus no causal relationship).

¨  But note that W1(x)a and W2(x)b are causally related since R2(x)a may be the reason for W2(x)b.

Page 54: LECTURE 6 - cs.ucr.edukrish/CS253/Lecture6-CS253.pdfReplication and why is it hard ¨ Data needs to be replicated for reliability or to improve performance ¤ If a replica crashes,

Example

¨  (a) is not causally consistent. R3(x)b cannot follow R3(x)a since the the write of b to x is dependent on P2 reading x’s value to be a.

¨  (b) however is causally consistent since W1(x) a and W2(x) b are concurrent.

Page 55: LECTURE 6 - cs.ucr.edukrish/CS253/Lecture6-CS253.pdfReplication and why is it hard ¨ Data needs to be replicated for reliability or to improve performance ¤ If a replica crashes,

Linearizability example

¨  Every operation should appear to take effect instantaneously at some moment before its start and completion (sometime in the shaded area)

¨  Thus, the result is consistent regardless of ordering of operations

Page 56: LECTURE 6 - cs.ucr.edukrish/CS253/Lecture6-CS253.pdfReplication and why is it hard ¨ Data needs to be replicated for reliability or to improve performance ¤ If a replica crashes,

Linearizability with Locks

Client Replica 2

Replica 1

Replica 3

Lock service

56

Problems?

Client failures!

Page 57: LECTURE 6 - cs.ucr.edukrish/CS253/Lecture6-CS253.pdfReplication and why is it hard ¨ Data needs to be replicated for reliability or to improve performance ¤ If a replica crashes,

Lease

¨  Lock with timeout ¨  If lease holder fails, not a problem because lease

will expire

¨  How to pick lease timeout value? ¤ Short timeout à Client needs to renew lease ¤ Long timeout à Unnecessarily block operations

57

Page 58: LECTURE 6 - cs.ucr.edukrish/CS253/Lecture6-CS253.pdfReplication and why is it hard ¨ Data needs to be replicated for reliability or to improve performance ¤ If a replica crashes,

Discrepancy in Lease Validity

Client Replica 2

Replica 1

Replica 3

Lease service

58

Scenario in which lease server and client

differ about lease validity?

Page 59: LECTURE 6 - cs.ucr.edukrish/CS253/Lecture6-CS253.pdfReplication and why is it hard ¨ Data needs to be replicated for reliability or to improve performance ¤ If a replica crashes,

Discrepancy in Lease Validity

¨  Message that grants lease may have high delay

¨  Clock at lease holder and lease service may have different skew

¨  How to account for potential discrepancy?

59

Page 60: LECTURE 6 - cs.ucr.edukrish/CS253/Lecture6-CS253.pdfReplication and why is it hard ¨ Data needs to be replicated for reliability or to improve performance ¤ If a replica crashes,

Discrepancy in Lease Validity

Client Replica 2

Replica 1

Replica 3

Lease service

60

Replica must check with lease service to confirm

lease validity