32
Faculty of Sciences, University of Porto On the Integration of Real-Time and Fault-Tolerance in P 2 P Middleware Rolando Martins Scientific Advisors: Lu´ ıs Lopes, Faculty of Science - University of Porto Fernando Silva, Faculty of Science - University of Porto Rolando Martins On the Integration of RT & FT in P 2 P May 7, 2012 1

On the Integration of Real-Time and Fault-Tolerance in P2P Middleware

Embed Size (px)

DESCRIPTION

PhD defense presentation. Juri: Paulo Veríssimo (FCUL), Rui Oliveira (Uminho), Priya Narasimhan (CMU), Luís Lopes (FCUP), Fernando Silva (FCUP), António Porto (FCUP)

Citation preview

Page 1: On the Integration of Real-Time and Fault-Tolerance in P2P Middleware

Faculty of Sciences, University of Porto

On the Integration of Real-Time and Fault-Tolerancein P2P Middleware

Rolando Martins

Scientific Advisors:

Luıs Lopes, Faculty of Science - University of Porto

Fernando Silva, Faculty of Science - University of Porto

Rolando Martins On the Integration of RT & FT in P2P May 7, 2012 1

Page 2: On the Integration of Real-Time and Fault-Tolerance in P2P Middleware

Faculty of Sciences, University of Porto

Target SystemsI EFACEC’s Oporto light-train deployment

I 5 lines, 70 stations, trains multiplexed over 5 linesI 70+ computational nodes (peers), 200+ sensors, arbitrary topology

I Traffic comprised of normal operations, critical events, alarmsI Tight timing, e.g., 2s for end-to-end response time

I Deployments across cities/regions can be overwhelmingly largeI What is needed to support such systems?

I Peer-to-peer (P2P) infrastructure that mirrors physical deploymentI Combined real-time and fault-tolerance guaranteesI Hierarchical abstraction (cells) to scale to large deployments

Rolando Martins On the Integration of RT & FT in P2P May 7, 2012 2

Page 3: On the Integration of Real-Time and Fault-Tolerance in P2P Middleware

Faculty of Sciences, University of Porto

In Search of a Solution

FT

FT+P2P

P2P

RT+P2P

RT+FT RT+FT+P2P

RT

Video

Streaming

Distributed

storage

Pastry

CORBA RT FT

DDS

CORBA FT

Stheno

Rolando Martins On the Integration of RT & FT in P2P May 7, 2012 3

Page 4: On the Integration of Real-Time and Fault-Tolerance in P2P Middleware

Faculty of Sciences, University of Porto

Research Challenges and Opportunities

I ChallengesI FT mechanisms consume additional resourcesI FT mechanisms add overhead (e.g., additional latency)I Different traffic types have different soft-RT requirementsI Different traffic types may require different FT configurationsI RT requirements must continue to be met even under faults

I OpportunitiesI P2P infrastructures have network-aware resilienceI COTS operating systems have priority-based scheduling,

multi-threading and resource-reservation mechanismsI Proven FT configuration options exist (replication styles)

Rolando Martins On the Integration of RT & FT in P2P May 7, 2012 4

Page 5: On the Integration of Real-Time and Fault-Tolerance in P2P Middleware

Faculty of Sciences, University of Porto

Research Question

Can we opportunistically leverage and integrate these proven strategies tosimultaneously support soft-RT and FT to meet the needs of our targetsystems?

Rolando Martins On the Integration of RT & FT in P2P May 7, 2012 5

Page 6: On the Integration of Real-Time and Fault-Tolerance in P2P Middleware

Faculty of Sciences, University of Porto

Scope

I Non-GoalsI Handling value faults and byzantine faultsI Formal specification and verification of the systemI Support for hard real-timeI Fully optimized implementationI Testing in production (not yet)

I AssumptionsI Fault model: crash of a peer, message lossI Resource-reservation mechanisms are always available

Rolando Martins On the Integration of RT & FT in P2P May 7, 2012 6

Page 7: On the Integration of Real-Time and Fault-Tolerance in P2P Middleware

Faculty of Sciences, University of Porto

Stheno: System Architecture

Rolando Martins On the Integration of RT & FT in P2P May 7, 2012 7

Page 8: On the Integration of Real-Time and Fault-Tolerance in P2P Middleware

Faculty of Sciences, University of Porto

Stheno: Operating-System Interface

I Problem: Control and monitor resource usage from userspaceI Solution:

I Leverage threads, priorities, /procI Resource reservationI CPU partitioning

I Example:I Highly critical surveillance feed has reserved amount of CPU for

processing

Rolando Martins On the Integration of RT & FT in P2P May 7, 2012 8

Page 9: On the Integration of Real-Time and Fault-Tolerance in P2P Middleware

Faculty of Sciences, University of Porto

Stheno: Support Framework

I Problem: Tasks have different RT requirementsI Solution:

I Leverage threading policiesI QoS Daemon

I Example:I Thread-per-Connection used for critical events in our target system to

achieve low latency

Rolando Martins On the Integration of RT & FT in P2P May 7, 2012 9

Page 10: On the Integration of Real-Time and Fault-Tolerance in P2P Middleware

Faculty of Sciences, University of Porto

Stheno: P2P Overlay and FT Configuration

I Problem: Tailor choice of P2P overlay and FT configuration toapplication needs

I Solution:I High-level API to support alternative overlays, e.g., P3, PastryI Leverage proven replication styles, e.g., active, semi-active, passiveI Configure replication properties, e.g., number and placement of replicasI Support service discovery

I Example:I P3 mirrors regional hierarchy of target systemI Active replication for critical tasks needing instantaneous fail-over

Rolando Martins On the Integration of RT & FT in P2P May 7, 2012 10

Page 11: On the Integration of Real-Time and Fault-Tolerance in P2P Middleware

Faculty of Sciences, University of Porto

Stheno: Core

I Problem: Manage services with different RT and FT requirementsI Solution:

I QoS daemon proxyI Service repositoryI Creator and coordinator of service instances and clientsI Delegation of service discovery to the P2P layer

I Example:I Service repository could include RPC, streaming service, etc

Rolando Martins On the Integration of RT & FT in P2P May 7, 2012 11

Page 12: On the Integration of Real-Time and Fault-Tolerance in P2P Middleware

Faculty of Sciences, University of Porto

Stheno: Application and Services

I Problem: Expose system functionalities and configuration options tothe user

I Solution:I High-level APIs for querying and configuring different layers

I Example:I Create a video streaming service from light-train station and set the

frame rate and replication style

Rolando Martins On the Integration of RT & FT in P2P May 7, 2012 12

Page 13: On the Integration of Real-Time and Fault-Tolerance in P2P Middleware

Faculty of Sciences, University of Porto

Stheno: Interaction between Layers

Rolando Martins On the Integration of RT & FT in P2P May 7, 2012 13

Page 14: On the Integration of Real-Time and Fault-Tolerance in P2P Middleware

Faculty of Sciences, University of Porto

Proof-of-Concept Prototype

I First prototype implementation in Java had more than 50k SLOC

I Current (unoptimized) prototype implementation in C/C++ withmore than 60k SLOC

I P3 overlay plugin implementation

I CPU resource reservation

I Thread priorities: three classes corresponding to low, medium andhigh criticality

I Threading policies: Thread-per-Connection, Thread-per-Request,Leader-Followers

I Semi-active replication style

Rolando Martins On the Integration of RT & FT in P2P May 7, 2012 14

Page 15: On the Integration of Real-Time and Fault-Tolerance in P2P Middleware

Faculty of Sciences, University of Porto

Empirical Evaluation

I Goals: To quantifyI Overhead of fault-tolerance mechanisms with/without faultsI Impact of background workload and faults on end-to-end latency

I Metrics:I End-to-end latency, jitter, recovery time

I Experimental setup:I 20 nodes, each quad-core AMD Phenom with 4GB RAMI 100 Mbit/s switch

I Experimental procedure:I Used a P3-based overlay, semi-active replicationI Run of 1000 invocationsI Fault-injection mid-way through each run

Rolando Martins On the Integration of RT & FT in P2P May 7, 2012 15

Page 16: On the Integration of Real-Time and Fault-Tolerance in P2P Middleware

Faculty of Sciences, University of Porto

End-to-End Latency ResultsI 4 replicas, without resource reservation: max time of 1s/invocationI 4 replicas, With resource reservation: max time of 1ms/invocation

0 10 20 30 40 50 60 70 80 90 100Load (%)

100

101

102

103

104

Late

ncy

(m

s)

Legend:No FT1 Replica

2 Replicas4 Replicas

(a) Without resource reservation.

0 10 20 30 40 50 60 70 80 90 100Load (%)

100

101

102

103

104

Late

ncy

(m

s)

Legend:No FT1 Replica

2 Replicas4 Replicas

(b) With resource reservation.

I Stheno’s RT+FT support meets and exceeds target systemrequirements (2s end-to-end response time, even under a fault)

Rolando Martins On the Integration of RT & FT in P2P May 7, 2012 16

Page 17: On the Integration of Real-Time and Fault-Tolerance in P2P Middleware

Faculty of Sciences, University of Porto

Fail-over Latency Results

I Without resource reservation: max fail-over time of 3s

I With resource reservation: max fail-over time of 30ms

0 10 20 30 40 50 60 70 80 90 100Load (%)

101

102

103

104

Late

ncy

(m

s)

Legend:1 Replica2 Replicas

4 Replicas

(a) Without resource reservation.

0 10 20 30 40 50 60 70 80 90 100Load (%)

101

102

103

104

Late

ncy

(m

s)

Legend:1 Replica2 Replicas

4 Replicas

(b) With resource reservation.

I Stheno’s RT+FT provides low fail-over latency that meetstarget system requirements

Rolando Martins On the Integration of RT & FT in P2P May 7, 2012 17

Page 18: On the Integration of Real-Time and Fault-Tolerance in P2P Middleware

Faculty of Sciences, University of Porto

Thesis Contributions

I Stheno, an RT+FT+P2P middlewareI Motivated by the timing, reliability and physical deployment

characteristics of our target systems

I To the best of our knowledge, Stheno is the first system thatI Supports traffic types with different soft-RT requirementsI Supports different FT configurationsI Supports configurability at multiple levels: P2P, RT and FTI Continues to meet RT requirements even under faults

I Implementation of a proof-of-concept prototypeI Empirical evaluation demonstrates that

I Stheno meets and exceeds target system requirements for end-to-endlatency and fail-over latency

Rolando Martins On the Integration of RT & FT in P2P May 7, 2012 18

Page 19: On the Integration of Real-Time and Fault-Tolerance in P2P Middleware

Faculty of Sciences, University of Porto

Thank You

Stheno, in Greek mythology, wasthe eldest of the three Gorgons.She was known to be the mostindependent and ferocious, hav-ing killed more men than bothof her sisters combined. (sourceWikipedia)In many ways, Stheno representsthe complexity of the problem thatwe set out to solve.

Rolando Martins On the Integration of RT & FT in P2P May 7, 2012 19

Page 20: On the Integration of Real-Time and Fault-Tolerance in P2P Middleware

Faculty of Sciences, University of Porto

PublicationsI Rolando Martins, Luıs Lopes and Fernando Silva. Lightweight Fault-Tolerance for Peer-to-Peer

Middleware (full version). Technical Report DCC-2011-01, Department of Computer Science, Faculty

of Sciences, University of Porto, 2011.

I Rolando Martins, Priya Narasimhan, Luıs Lopes, and Fernando Silva. Lightweight Fault-Tolerance for

Peer-to-Peer Middleware. In Proceedings of the 29th IEEE Symposium on Reliable Distributed Systems

(SRDS’10), pages 313-317, November 2010.

I Rolando Martins, Priya Narasimhan, Luıs Lopes and Fernando Silva. On the Impact of Fault-Tolerance

Mechanisms in a Peer-to-Peer Middleware. Technical Report DCC-2010-02, Department of Computer

Science, Faculty of Sciences, University of Porto, 2010.

I Rolando Martins, Luıs Lopes, and Fernando Silva. A Peer-to-Peer Middleware Platform for QoS and

Soft Real-Time Computing. Technical Report DCC-2008-02, Department of Computer Science,

Faculty of Sciences, University of Porto, 2008.

I Rolando Martins, Luıs Lopes, and Fernando Silva. A Peer-To-Peer Middleware Platform for

Fault-Tolerant, QoS, Real-Time Computing. In Proceedings of the 2nd Workshop on

Middleware-Application Interaction, part of DisCoTec 2008, pages 1-6, New York, NY, USA, June

2008. ACM.

Rolando Martins On the Integration of RT & FT in P2P May 7, 2012 20

Page 21: On the Integration of Real-Time and Fault-Tolerance in P2P Middleware

Faculty of Sciences, University of Porto

Replication Groups Over Group Communications

(a) Semi-active (b) Passive

Rolando Martins On the Integration of RT & FT in P2P May 7, 2012 21

Page 22: On the Integration of Real-Time and Fault-Tolerance in P2P Middleware

Faculty of Sciences, University of Porto

Resource Reservation Daemon

Rolando Martins On the Integration of RT & FT in P2P May 7, 2012 22

Page 23: On the Integration of Real-Time and Fault-Tolerance in P2P Middleware

Faculty of Sciences, University of Porto

Multicore: Examples of CPU Partitioning.

(a) Quad-core partitioning. (b) Six-core partitioning.

(c) Eight-core partitioning.

I Core Os: Threads belonging to the operating systemI BE: Threads served by SCHED OTHER scheduling policyI RT: Threads served by SCHED {FIFO,RR} scheduling policiesI Isolated RT: Isolated RT threads that are isolated from all other

threads present in the system

Rolando Martins On the Integration of RT & FT in P2P May 7, 2012 23

Page 24: On the Integration of Real-Time and Fault-Tolerance in P2P Middleware

Faculty of Sciences, University of Porto

RT Support: Object-to-Object interactions.

(a) Direct calling with dif-ferent partitions.

(b) Direct calling within thesame partition.

(c) Deferred calling with different partitions.

Rolando Martins On the Integration of RT & FT in P2P May 7, 2012 24

Page 25: On the Integration of Real-Time and Fault-Tolerance in P2P Middleware

Faculty of Sciences, University of Porto

Threading Strategies

Rolando Martins On the Integration of RT & FT in P2P May 7, 2012 25

Page 26: On the Integration of Real-Time and Fault-Tolerance in P2P Middleware

Faculty of Sciences, University of Porto

Minimizing Priority Inversion Through TrafficDemultiplexing

Rolando Martins On the Integration of RT & FT in P2P May 7, 2012 26

Page 27: On the Integration of Real-Time and Fault-Tolerance in P2P Middleware

Faculty of Sciences, University of Porto

Minimizing Priority Inversion Through TrafficDemultiplexing

Rolando Martins On the Integration of RT & FT in P2P May 7, 2012 27

Page 28: On the Integration of Real-Time and Fault-Tolerance in P2P Middleware

Faculty of Sciences, University of Porto

Putting It All Together

Rolando Martins On the Integration of RT & FT in P2P May 7, 2012 28

Page 29: On the Integration of Real-Time and Fault-Tolerance in P2P Middleware

Faculty of Sciences, University of Porto

Putting It All Together (Continuation)

Rolando Martins On the Integration of RT & FT in P2P May 7, 2012 29

Page 30: On the Integration of Real-Time and Fault-Tolerance in P2P Middleware

Faculty of Sciences, University of Porto

Execution Context/Execution Model (ECEM) DesignPattern

Rolando Martins On the Integration of RT & FT in P2P May 7, 2012 30

Page 31: On the Integration of Real-Time and Fault-Tolerance in P2P Middleware

Faculty of Sciences, University of Porto

Comparison with other Middlewares (RPC)

0 10 20 30 40 50 60 70 80 90 100Load (%)

101

102

103

104

105

Late

ncy

(us)

Legend:Stheno, No QoSStheno, QoSICE

TAORMI

I Our approach enable us to provide a 200us latency even in thepresence of a 95% CPU workload

Rolando Martins On the Integration of RT & FT in P2P May 7, 2012 31

Page 32: On the Integration of Real-Time and Fault-Tolerance in P2P Middleware

Faculty of Sciences, University of Porto

Related WorkI 1 - Decentralized scalability:

I Licınio Oliveira, Luıs Lopes, and Fernando Silva. P3 : Parallel Peer to Peer - An Internet Parallel ProgrammingEnvironment. In Workshop on Web Engineering & Peer-to-Peer Computing, part of Networking 2002, volume2376 of Lecture Notes in Computer Science, pages 274-288. Springer-Verlag, May 2002.

I A. Rowstron and P. Druschel. Pastry: Scalable, Decentralized Object Location, and Routing for Large-ScalePeer-to-Peer Systems. In Proceedings of the 2nd ACM/IFIP/USENIX International Middleware Conference(Middleware’01), pages 329-350, November 2001.

I 2 - Modular FT:I Tudor Dumitra, Deepti Srivastava, and Priya Narasimhan. Architecting and Implementing Versatile

Dependability. In Rogerio de Lemos, Cristina Gacek, and Alexander Romanovsky, editors, ArchitectingDependable Systems III, volume 3549 of Lecture Notes in Computer Science, pages 212-231. Springer Berlin /Heidelberg, 2005.

I P. Bond P. Barrett, A. Hilborne, Luıs Rodrigues, D. Seaton, N. Speirs, and Paulo Verıssimo. The Delta-4 ExtraPerformance Architecture (XPA). 20th International Symposium on Fault-Tolerant Computing, pages 481-488,1990.

I 3 - Resource reservation + CPU partitioning:I Chen Lee, R. Rajkumar and Cliff Mercer, Experiences with Processor Reservation and Dynamic QoS in

Real-Time Mach, In Proceedings of Multimedia Japan, March 1996

I Luigi Palopoli, Tommaso Cucinotta, Luca Marzario, and Giuseppe Lipari. AQuoSA - Adaptive Quality of ServiceArchitecture. Software: Practice and Experience, 39(1):1-31, April 2009.

I 4 - Real-time support:I Priya Narasimhan, Tudor Dumitras , Aaron Paulos, Soila Pertet, Carlos Reverte, Joseph Slember, and Deepti

Srivastava. MEAD: Support for Real-Time Fault- Tolerant CORBA: Research Articles. Concurrency andComputation: Practice & Experience 17(12):1527-1545, October 2005.

I Douglas Schmidt, David Levine, and Sumedh Mungee. The Design of the TAO Real-Time Object RequestBroker. Computer Communications, 21(4):294-324, 1998.

Rolando Martins On the Integration of RT & FT in P2P May 7, 2012 32