[IEEE 2006 Innovations in Information Technology - Dubai, United Arab Emirates (2006.11.19-2006.11.21)] 2006 Innovations in Information Technology - A Novel Approach for Fault Tolerance

1-4244-0674-9/06/$20.00 ©2006 IEEE.

A Novel Approach for Fault Tolerance in MPLS Networks

Sahel Alouneh, Anjali Agarwal, Abdeslam En-Nouaary Department of Electrical and Computer Engineering, Concordia University, Canada

E-mails: {s_aloune, aagarwal, [email protected]}

Abstract

There has been current demand on Internet

Service Providers (ISPs) to provide Quality of Service (QoS) guarantees. Fault Tolerance is an important factor that needs to be considered to maintain the network survivability. It is the property of a system that continues to operate the network properly in the event of failure of some of its parts. There is no solution until now that can provide a recovery with no packet loss and delay at the same time. In addition, network resources such as bandwidth have to be significantly utilized. In this paper, we propose a novel approach for fault tolerance in MPLS networks that can easily handle single or multiple path failures. Our approach uses the (k, n) Threshold Sharing Algorithm with multi-path routing. Our approach guarantees to continue the network operation with no packet loss and recovery delay, and with reasonable network resource utilization. 1. Introduction

MPLS (Multi-Protocol Label Switching) [1] is an evolving technology that facilitates several problems in the Internet, such as routing performance, speed, and traffic engineering. MPLS provides mechanisms in IP backbones for explicit routing using Label Switched Paths (LSPs), encapsulating the IP packet in an MPLS packet. MPLS combines a label-swapping algorithm, similar to that used in ATM, with network layer routing. A label is a short, fixed-length identifier that is used to forward packets. In MPLS, the FEC (Forward Equivalence Class) assignment will only be done just once at the ingress router. The FEC to which the packet is assigned is encoded into a label. The packets are "labeled" before they are forwarded to their next hop. In this basic procedure (the single path LSP) all packets which belong to a particular FEC and which travel from a particular node will follow the

same path to the destination. MPLS core routers are called Label Switch Routers (LSRs). The Label Distribution Protocol (LDP) [2] is used to create labels and bind them to LSPs.

Fault Tolerance is the property of a system to continue operating properly in the event of failure of some of its parts. The recovery time that can be tolerated for a highly reliable service is in the order of seconds down to 10’s milliseconds. The current routing algorithms used in IP networks are robust and survivable. However, the amount of time that is taken to recover a failure can be significant which is in the order of several seconds to minutes, which causes unaccepted disruption to critical time communication [4]. MPLS network is very vulnerable to failures because of its connection oriented architecture. Path restoration or recovery mechanisms are used to reroute the traffic around a node/link failure in a LSP. 2. Recovery models

Most of the recovery approaches that have been proposed for MPLS network recovery belong to two protection models: fast restoration or pre-established protection and rerouting or dynamic protection. In fast restoration model the backup LSP is established and configured in advance, therefore bandwidth has to be reserved. In dynamic protection model the backup LSP is established after a failure occurs, and correspondingly bandwidth reservation is not applied until the failure occurs. The dynamic protection model may not be suitable for time sensitive applications because of its large recovery time [5] [10]. For this reason, in this paper we consider the fast rerouting protection recovery schemes because our main concern is to overcome the recovery delay and packet loss.

Huang et al. [10] propose a PSL (Path Switched LSR) oriented path protection mechanism that consists of three components: reverse notification tree, a hello protocol, and a lightweight notification transport protocol. This approach targets to minimize the delay

experienced by an FIS (Fault Indication Signal) message by building a fast and efficient notification tree structure and uses a lightweight transport mechanism for the notification message. Generally speaking, this approach uses 1:1 path protection (extendible to m: n protection) where the resources (bandwidth, buffers, processing) on the recovery path are reserved for specific application or shared with others. In [9], Haskin et al. introduced a simple method for setting an alternative LSP with the objective to provide a single failure protection for fast restoration. The traffic stream flowing through a working path from the ingress router towards the egress router is protected by an alternative path. In Figure 1, if LSR4 fails, the traffic in working path is rerouted along the backup path, LSR 3-2-1-5-6-7-8-9. The backup path is comprised of two segments. The first segment is established between PML (Protection Merging LSR) and the PIL (Protection Ingress LSR) in the reverse direction of the working path. The second segment is built between PIL and PML along an LSP that does not utilize any working path. Haskin Scheme has a lower packet loss rate compared to Huang’s scheme [11].

Figure 1. Path restoration example

The paper by Buddhikot et al. [7] addresses the problem of distributed routing of restoration paths, which can be defined as follows: given a request for a bandwidth guaranteed LSP between two nodes, find a primary LSP, and a set of backup LSPs that protect the links along the primary LSP. A routing algorithm that computes these paths must optimize the restoration latency and the amount of bandwidth used. The authors introduce the concept of “backtracking” to bind the restoration latency. In other words, it provides algorithms that offer a way to tradeoff bandwidth to meet a range of restoration latency requirements.

Seok et al. proposed in [14] a fault tolerant multi-path traffic engineering scheme for MPLS networks with the objective to effectively control the network resource utilization. The proposed scheme consists of

the maximally disjoint multi-path configuration and the traffic rerouting mechanism for fault recovery. The authors use the linear programming formulation to configure the maximally disjoint multi-path and the traffic rerouting solution. So when the statistical traffic demand is known between a source LSR and a destination LSR, then the traffic engineering can be applied with the following objectives: set all LSPs configuration in order to find maximally disjoint paths for each node pair, subject to minimization of the maximum of link utilization. When some link failures are detected, the proposed mechanism routes the traffic flowing on the failed LSPs into available LSPs.

The 1+1 “one plus one” protection discussed in [3, 4, and 10] can provide path recovery without packet loss or delay. However, the resources are dedicated for the recovery of the working traffic, and resources may not be used for anything else. The resources (bandwidth, buffers, and processing capacity) on the recovery path are fully reserved, and carry the same traffic as the working path. Selection between the traffic on the working and recovery paths is made at the path merge LSR (PML). More discussions on MPLS recovery approaches can be found in [12, 13, and15]. 3. Motivation

Recovery delay time and packet loss are main concern in the event of MPLS network failure for applications that are sensitive to delay and data loss. Until now there is no approach that comes up with a solution for an MPLS fault tolerance protection with no packet loss and delay with reasonable network resource utilization. In this paper we introduce a novel approach that guarantees to continue the MPLS network operation with no packet loss or delay and with reasonable network resource utilization in the event of single or multiple LSPs failure. The following section introduces our approach that is based on the Threshold Sharing Scheme [6] combined with multi-path routing. 4. Proposed protection scheme

In this paper, we propose to use the Threshold Sharing scheme (TSS). The idea behind the TSS is to divide a message into n pieces, called shadows or shares, such that any k of them can be used to reconstruct the original message. Using any number of shares less than k will not help to reconstruct the original message. Adi Shamir polynomial approach [6] will be used to show this concept. The Shamir’s (k,

n) threshold scheme is based on the Lagrange interpolation for polynomials. The polynomial function used is shown as follows: Let p be a prime number. A polynomial in the intermediate x over the finite field Zp is an expression of the form: f(x) = (ak-1xk-1 + … + a2x2 + a1x + a0) mod p (1) Using Lagrange linear interpolation the polynomial function can be represented as follows:

f(x)=∑=

k

jijy

1∏

≠≤≤ −−

jsks isij

is

xxxx

,1

(2)

where the coefficients a0 , … , ak-1 are unknown elements of Zp, and a0 = M is the original message. Since yij = f(xij), 1 ≤ j ≤ k, 1 ≤ i ≤ n, a subset of participants can obtain k linear equations in the k unknowns a0 , … , ak-1, where all arithmetic is done in Zp. If the equations are linearly independent, there will be a unique solution, and a0 will be revealed.

Example: let M be 11, then in a (3, 4) threshold sharing scheme, any three out of four shares can reconstruct M. First, a quadratic equation should be generated, where the coefficients a1 and a2 are chosen randomly (Note that random selection assumption is going to be modified in our model), and the p value should be greater than any coefficient value, in this case 13:

f(x) = (7x2 + 8x + 11) mod 13 The following shares are calculated for different values of x:

f(1) ≡ 0 (mod 13) f(3) ≡ 7 (mod 13) f(2) ≡ 3 (mod 13) f(5) ≡ 5 (mod 13)

For example if f(2), f(3), and f(5) are chosen, the following three equations are solved using Langrage Interpolation:

(a0 + a1 (2) + a2 (2)2) ≡ 3 (mod 13) (a0 + a1 (3) + a2 (3)2) ≡ 7 (mod 13) (a0 + a1 (5) + a2 (5)2) ≡ 5 (mod 13)

The solution to the equations is a2 = 7, a1 = 8, and a0 = 11 = M, and hence, the original M is reconstructed. 4.1. Integration of TSS with MPLS

The Threshold Sharing Scheme (TSS) discussed above is modified to integrate with MPLS networks. Figure 2 shows an architectural view that describes how the distribution and reconstruction processes are applied to an MPLS network.

When an IP packet enters an MPLS ingress router, a distributor process is used to generate the n share

messages/packets that will become the payloads for MPLS packets. These n MPLS packets are allocated over n maximally disjoint LSPs obtained using multi-path routing [8]. The term “maximally disjoint” LSPs is used since it might not be applicable in some situations to have an absolute link/node disjoint LSPs between two edge routers. The IP packet is divided into L bit blocks S0, S1, ..., Sm. The n shares are calculated using the n different xi values as agreed between a sender and a receiver. Values for n and k are also needed to be configured between the sender and receiver. A (3, 4) Threshold Sharing Scheme example applied on an IP packet is shown by Figure 3.

Figure 2. Distributor and reconstruction processes

Figure 3. An example of applying modified threshold sharing scheme into an IP packet

It is important to mention here that we consider all coefficients of the polynomial function to be part of packet division. Unlike the original definition used in the Threshold Sharing Scheme [9], a1 and a2 values are provided from each block (not by assigning random values). The number of coefficients of polynomial function shown in Figure 3 is three coefficients, and hence, each L bytes block is divided into three equal parts. These parts will be mapped to the polynomial’s three coefficients a0, a1, and a2.

The distributor and reconstruction processes are only applied to MPLS edge routers (ingress and egress routers) as they are only responsible for analyzing and processing the packets entering to the network. Therefore, there is no additional processing applied by our approach on MPLS core LSRs. There might be variable timing in receiving shares from different LSPs at the egress router, therefore the reconstruction process should set a time limit for received MPLS packet shares. The ordering of MPLS packets received at egress router is not important for reconstructing the original IP packet since both ingress and egress router will be using the same polynomial function, hence, the order of coefficients a0, a1, ..., ak-1 are already preserved by the polynomial function. 4.2. Bandwidth analysis

Assuming a (3, 4) Threshold Sharing scheme used and IP packet rate of 3 Kbytes size entering the MPLS network. Referring to Figure 3, the distributor process will create four shares/MPLS packets, each share size in this case is ≈ 1 Kbytes (i.e., 1024 bytes + 4 bytes for each MPLS packet header). The total size of four MPLS packet shares created is therefore equal to ≈ 4 Kbytes. In other words, the extra bandwidth needed for one extra share in this case is equal to 1/3 of the total original bandwidth. Generally, providing protection to LSP failure requires extra bandwidth in our approach as in any other protection mechanism. Equation 3 below shows the extra bandwidth required (Br in bytes) for any Threshold Sharing protection level:

Br = k

kn − (3)

Where P = Original IP packet size (in bytes). 4.3. Packet loss and delay considerations

All existing recovery approaches aim to minimize the recovery time when switching and rerouting the traffic to a backup LSP. Unlike other approaches, the recovery delay time and packet loss has nothing to do

with our approach due to the way the threshold sharing scheme works. In other words, for example if we use (3, 4) threshold sharing level, then the egress router’s reconstruction process should receive at least three out of four shares from any three LSPs. Therefore, if one failure occurs in one of the LSPs, this failure will not affect the reconstruction process at the egress router, and consequently no delay or packet loss occurs to reconstruct the original message. The concept of sending a fault notification message to the router that is responsible for rerouting the traffic to another backup path is not required in our approach. The restoration time that includes the failure detection time, notification time, recovery operation time, and traffic restoration time [3] is all absent in our approach.

The 1+1 path protection scheme discussed in [3] [4] [10] may not produce delay or packet loss. However our approach performs better in terms of bandwidth because that scheme requires 100% bandwidth reservation for every backup LSP. On the other hand, our scheme can easily support multiple LSP failures with reasonable additional bandwidth requirement. 4.4. MPLS requirements for multi-path routing

To provide multi-path routing, MPLS architecture at the ingress and egress sides should consider multiple FEC-Label binding. The FEC specification and label binding at both edge routers in our model deal with multiple labels for different LSPs having the same FEC. Packets having the same FEC follow a set of different LSP paths based on the protection level applied, and the available number of paths.

The LDP operations in MPLS should be modified to provide multi-path label distribution between LSRs. The basic operations define a “peer mapping” between two LSRs having the same FEC/label mapping. So some modifications to the Session Initialization State Transition in [2] have to be considered in order to provide the multi-path option which is out of the scope of this paper. The egress router should be able to map different labels for the same FEC and explicitly bind every label with upstream routers based on the number of LSPs required. Figure 4 shows the egress router receiving three labels L9, L10, L11 coming from three disjoint LSPs, all of them having the same FEC value “Y”. The core routers (LSRs) in every path will apply normal (or single) label binding. The ingress router also uses different label bindings for the same FEC “Y” that identifies the different

LSPs towards the egress router. An index of value x is assigned to each of these label bindings. The egress router correspondingly applies the same x index values to the shares received to reconstruct the original IP packet.

Figure 4. Multi-path label information base for

three disjoint LSPs

The Threshold Sharing Scheme used in our approach can support single LSP failure with at least (2, 3) threshold level. In other words, three LSPs are enough to support single LSP failure. The level of the (k, n) Threshold Sharing scheme can therefore be adaptive to the MPLS network topology. 5. Conclusion and future work

In this paper, we present a novel mechanism to provide fault tolerance in MPLS networks. A multi-path approach was used combined with Threshold Sharing Scheme to protect the network in case of LSPs failures. Our scheme guarantees the MPLS network to continue its operation without producing packet loss or delay, and with reasonable network resource utilization. Our proposed approach has not been yet implemented or simulated. The processing time for distribution and reconstruction processes in edge routers has to be examined. In addition, the Label Distribution Protocol (LDP) should be extended to support multi-path routing. 6. References [1] E. Rosen, A. Viswanathan, and R. Callon, “LDP Specification”, IETF, RFC 3036, 2001.

[2] L. Andersson, P. Doolan, N. Feldman, A. Fredette, and B. Thomas, “LDP Specification”, IETF, RFC 3036, 2001. [3] A. Autenrieh, “Recovery Time Analysis of Differentiated Resilience in MPLS”, Proceeding of DRCN, IEEE, Alberta, Canada, 2003, pp. 333- 340. [4] V. Sharma, and F. Hellstrand, “Framework for Multiprotocol Label Switching (MPLS)-based Recovery”, IETF, RFC 3469, Feb 2003. [5] L. Hundessa, J. D. Pascual, “Optimal and Guaranteed Alternative LSP for Multiple Failures”, Proceeding ICCCN 2004, Chicago, USA, Oct 2004, pp. 59- 64. [6] A. Shamir, “How to share a secret”, Communications of the ACM, vol.24, no.11, Nov 1979, pp. 612 – 613. [7] L. Buddhikot, M. Chekuri, “Routing bandwidth guaranteed paths with local restoration in label switched networks”, IEEE Journal on Selected Areas in Communications, Feb, 2005. pp. 437 – 449. [8] D. Sidhu, and R. Nair, “Finding Disjoint Paths in Networks”, Proceeding ACM SIGCOMM’91Symposium, Switzerland, 1991, pp. 43-51. [9] D. Haskin, “A method for setting an Alternative Label Switched Paths to Handle Fast Reroute”, Internet Draft, July 2001. [10] C. Huang, and V. Sharma, “Building Reliable MPLS Networks Using a Path Protection Mechanism”, IEEE Communication Magazine, March 2002, pp. 156 – 162. [11] G. Ahn, and W. Chun, “Simulater for MPLS path restoration and performance evaluation”, Joint 4th IEEE International Conference on High Speed Intelligent Internet Symposium, Korea, April 2001, pp.32-36. [12] D. Wang, and F. Ergun, “Path protection with pre-identification for MPLS networks”, proceedings of The Second International Conference on Quality of Service in Heterogeneous Wired/Wireless Networks, USA, 2005, pp. 8-16. [13] L. Ruan, and Z Liu, “Upstream Node Initiated Fast Restoration in MPLS Networks”, Proceeding ICC200/IEEE International Conference on Communications, Seol, 2005, pp. 959- 964. [14] Y. Seok, Y. Lee, and N. Choi, “Fault-tolerant Multipath Traffic Engineering for MPLS Networks”, IASTED International Conference on Communications, Internet, and Information Technology, USA, Nov 2003, pp. 91-101. [15] A. Virk, and R. Boutaba, “Economical protection in MPLS networks”, Computer Communications, Vol. 29, Issue 3, February 2006, pp. 402-408.

Documents

[IEEE 2006 Innovations in Information Technology - Dubai, United Arab Emirates (2006.11.19-2006.11.21)] 2006 Innovations in Information Technology - A Novel Approach for Fault Tolerance