57
Cascading Failures in Infrastructure Networks David Alderson Ph.D. Candidate Dept. of Management Science and Engineering Stanford University April 15, 2002 Advisors: William J. Perry, Nicholas Bambos

Cascading Failures in Infrastructure Networks

Embed Size (px)

DESCRIPTION

Cascading Failures in Infrastructure Networks. David Alderson Ph.D. Candidate Dept. of Management Science and Engineering Stanford University April 15, 2002 Advisors: William J. Perry, Nicholas Bambos. Outline. Background and Motivation Union Pacific Case Study Conceptual Framework - PowerPoint PPT Presentation

Citation preview

Page 1: Cascading Failures in  Infrastructure Networks

Cascading Failures in Infrastructure Networks

David Alderson

Ph.D. Candidate

Dept. of Management Science and Engineering

Stanford University

April 15, 2002

Advisors: William J. Perry, Nicholas Bambos

Page 2: Cascading Failures in  Infrastructure Networks

IPAM 4/15/2002 David Alderson

Outline

• Background and Motivation

• Union Pacific Case Study

• Conceptual Framework

• Modeling Cascading Failures

• Ongoing Work

Page 3: Cascading Failures in  Infrastructure Networks

IPAM 4/15/2002 David Alderson

Background

• Most of the systems we rely on in our daily lives are designed and built as networks– Voice and data communications– Transportation– Energy distribution

• Large-scale disruption of such systems can be catastrophic because of our dependence on them

• Large-scale failures in these systems– Have already happened– Will continue to happen

Page 4: Cascading Failures in  Infrastructure Networks

IPAM 4/15/2002 David Alderson

Recent Examples• Telecommunications

– ATM network outage: AT&T (February 2001)– Frame Relay outage: AT&T (April 1998), MCI (August 1999)

• Transportation– Union Pacific Service Crisis (May 1997- December 1998)

• Electric Power– Northeast Blackout (November 1965)– Western Power Outage (August 1996)

• All of the above– Baltimore Tunnel Accident (July 2001)

Page 5: Cascading Failures in  Infrastructure Networks

IPAM 4/15/2002 David Alderson

Public Policy

• U.S. Government interest from 1996 (and earlier)

• Most national infrastructure systems are privately owned and operated– Misalignment between business imperatives (efficiency) and

public interest (robustness)

• Previously independent networks now tied together through common information infrastructure

• Current policy efforts directed toward building new public-private relationships– Policy & Partnership (CIAO)– Law Enforcement & Coordination (NIPC)– Defining new roles (Homeland Security)

Page 6: Cascading Failures in  Infrastructure Networks

IPAM 4/15/2002 David Alderson

Research QuestionsBroadly:

• Is there something about the network structure of these systems that contributes to their vulnerability?

More specifically:

• What is a cascading failure in the context of an infrastructure network?

• What are the mechanisms that cause it?

• What can be done to control it?

• Can we design networks that are robust to cascading failures?

• What are the implications for network-based businesses?

Page 7: Cascading Failures in  Infrastructure Networks

IPAM 4/15/2002 David Alderson

Outline

• Background and Motivation

• Union Pacific Case Study

• Conceptual Framework

• Modeling Cascading Failures

• Ongoing Work

Page 8: Cascading Failures in  Infrastructure Networks

IPAM 4/15/2002 David Alderson

Union Pacific Railroad

• Largest RR in North America– Headquartered in Omaha, Nebraska– 34,000 track miles (west of Mississippi River)

• Transporting– Coal, grain, cars, other manifest cargos– 3rd party traffic (e.g. Amtrak passenger trains)

• 24x7 Operations:– 1,500+ trains in motion– 300,000+ cars in system

• More than $10B in revenue annually

Page 9: Cascading Failures in  Infrastructure Networks

IPAM 4/15/2002 David Alderson

Union Pacific Railroad

• Four major resources constraining operations:– Line capacity

(# parallel tracks, speed restrictions, etc.)– Terminal capacity (in/out tracks, yard capacity)– Power (locomotives)– Crew (train personnel, yard personnel)

• Ongoing control of operations is mainly by:– Dispatchers– Yardmasters– Some centralized coordination, primarily through a

predetermined transportation schedule

Page 10: Cascading Failures in  Infrastructure Networks

IPAM 4/15/2002 David Alderson

Union Pacific Railroad

• Sources of network disruptions: – Weather

(storms, floods, rock slides, tornados, hurricanes, etc.)– Component failures

(signal outages, broken wheels/rails, engine failures, etc.)– Derailments (~1 per day on average)– Minor incidents (e.g. crossing accidents)

• Evidence for system-wide failures– 1997-1998 Service Crisis

• Fundamental operating challenge

Page 11: Cascading Failures in  Infrastructure Networks

IPAM 4/15/2002 David Alderson

UPRR Fundamental Challenge

Two conflicting drivers:

• Business imperatives necessitate a lean operation that maximizes efficiency and drives the system toward high utilization of available network resources.

• An efficient operation that maximizes utilization is very sensitive to disruptions, particularly because of the effects of network congestion.

Page 12: Cascading Failures in  Infrastructure Networks

IPAM 4/15/2002 David Alderson

Railroad CongestionThere are several places where congestion may be

seen within the railroad:

• Line segments

• Terminals

• Operating Regions

• The Entire Railroad Network

• (Probably not locomotives or crews)

Congestion is related to capacity.

Page 13: Cascading Failures in  Infrastructure Networks

IPAM 4/15/2002 David Alderson

UPRR Capacity Model Concepts

Factors Affecting Observed Performance:•Dispatcher / Corridor Manager Expertise•On Line Incidents / Equipment Failure•Weather•Temporary Speed Restrictions

3628

25

32

Lin

e S

egm

ent V

eloc

ity

Volume (trains per day)

Emprically-DerivedRelationship

18

35

The Effect of ForcingVolume in Excess of Capacity

Page 14: Cascading Failures in  Infrastructure Networks

IPAM 4/15/2002 David Alderson

Implications of Congestion

Concepts of traffic congestion are important for two key aspects of network operations:– Capacity Planning and Management– Service Restoration

In the presence of service interruptions, the objective of Service Restoration is to: – Minimize the propagation across the network of any

disturbance caused by a service interruption– Minimize the time to recovery to fluid operations

Page 15: Cascading Failures in  Infrastructure Networks

IPAM 4/15/2002 David Alderson

Modeling Congestion

We can model congestion using standard models from transportation engineering.

Define the relationships between:

• Number of items in the system (Density)

• Average processing rate (Velocity)

• Input Rate

• Output Rate (Throughput)

Page 16: Cascading Failures in  Infrastructure Networks

IPAM 4/15/2002 David Alderson

Modeling Congestion

N

nKnv 1)(

N

K

Velocity (v)

Density (n)

Velocity vs. Density: Assume that velocity decreases (linearly) with

the traffic density.

Page 17: Cascading Failures in  Infrastructure Networks

IPAM 4/15/2002 David Alderson

Modeling Congestion

N

nnKnnvn

2

)()(

Density (n)N

*

N/2

Throughput ()

Throughput vs. Density Throughput = Velocity · Density

Throughput is maximized at

n = N/2 with value

* = N/4 (K=1).

Page 18: Cascading Failures in  Infrastructure Networks

IPAM 4/15/2002 David Alderson

*

Velocity

Throughput

K

N

Velocity

Density

*

N Density

ThroughputModeling Congestion

Velocity

Throughput

Page 19: Cascading Failures in  Infrastructure Networks

IPAM 4/15/2002 David Alderson

Modeling Congestion

p

N

nKnv 1)(

p

N

nnKnvnn 1)()(

Let p represent the intensity of congestion onset.

v(n)=1-(n/10)^p

0

0.2

0.4

0.6

0.8

1

1.2

0 1 2 3 4 5 6 7 8 9 10n

0.1

0.25

0.5

1

2

4

10

mu(n) = n*v(n) = n * { 1 - (n/10)^p }

0

1

2

34

5

6

7

8

0 1 2 3 4 5 6 7 8 9 10n

0.1

0.25

0.5

1

2

4

10

Page 20: Cascading Failures in  Infrastructure Networks

IPAM 4/15/2002 David Alderson

Modeling Congestion

p

N

nnK

p1

lim

It is clear that

nK

N

becomes

otherwise

Nnn

nnK

N

N

0

1)( where

))(1(

mu(n) = n*v(n) = n * { 1 - (n/10)^p }

0

1

2

34

5

6

7

8

0 1 2 3 4 5 6 7 8 9 10n

0.1

0.25

0.5

1

2

4

10

Page 21: Cascading Failures in  Infrastructure Networks

IPAM 4/15/2002 David Alderson

*

Velocity

Throughput

K

N

Velocity

Density

*

N Density

ThroughputModeling Congestion

Velocity

Throughput

Page 22: Cascading Failures in  Infrastructure Networks

IPAM 4/15/2002 David Alderson

UP Service Crisis• Initiating Event

– 5/97 derailment at a critical train yard outside of Houston

• Additionally – Loss of BNSF route that was decommissioned for repairs– Embargo at Laredo interchange point to Mexico

• Complicating Factors– UP/SP merger and transition to consolidated operations– Hurricane Danny, fall 1997– Record rains and floods (esp. Kansas) in 1998

• Operational Issues– Tightly optimized transportation schedule– Traditional service priorities

Page 23: Cascading Failures in  Infrastructure Networks

IPAM 4/15/2002 David Alderson

Union Pacific RailroadTotal System Inventory, December 1996 - November 1998

280,000

290,000

300,000

310,000

320,000

330,000

340,000

350,000

360,000

370,000

Dec

-96

Mar

-97

Jun-

97

26-S

ep-9

7

17-O

ct-9

7

7-N

ov-9

7

28-N

ov-9

7

19-D

ec-9

7

9-Ja

n-98

30-J

an-9

8

20-F

eb-9

8

13-M

ar-9

8

3-A

pr-9

8

24-A

pr-9

8

15-M

ay-9

8

5-Ju

n-98

26-J

un-9

8

17-J

ul-9

8

7-A

ug-9

8

28-A

ug-9

8

18-S

ep-9

8

9-O

ct-9

8

30-O

ct-9

8

20-N

ov-9

8

Inventory (cars)

UP Service Crisis

Source: UP Filings with Surface Transportation Board, September 1997 – December 1998

Houston-Gulf Coast

Central Corridor(Kansas-Nebraska-Wyoming)

Southern California

Page 24: Cascading Failures in  Infrastructure Networks

IPAM 4/15/2002 David Alderson

Case Study: Union PacificCompleted Phase 1 of case study:

• Understanding of the factors affecting system capacity, system dynamics

• Investigation of the 1997-98 Service Crisis

• Project definition: detailed study of Sunset Route

• Data collection, preliminary analysis for the Sunset Route

Ongoing work:

• A detailed study of their specific network topology

• Development of real-time warning and analysis tools

Page 25: Cascading Failures in  Infrastructure Networks

IPAM 4/15/2002 David Alderson

Outline

• Background and Motivation

• Union Pacific Case Study

• Conceptual Framework

• Modeling Cascading Failures

• Ongoing Work

Page 26: Cascading Failures in  Infrastructure Networks

IPAM 4/15/2002 David Alderson

Basic Network Concepts• Networks allow the sharing of distributed resources

• Resource use resource load– Total network usage = total network load

• Total network load is distributed among the components of the network– Many networking problems are concerned with finding a

“good” distribution of load

• Resource allocation load distribution

Page 27: Cascading Failures in  Infrastructure Networks

IPAM 4/15/2002 David Alderson

Infrastructure Networks

• Self-protection as an explicit design criterion

• Network components themselves are valuable– Expensive – Hard to replace– Long lead times to obtain

• Willingness to sacrifice current system performance in exchange for future availability

• With protection as an objective, connectivity between neighboring nodes is– Helpful– Harmful

Page 28: Cascading Failures in  Infrastructure Networks

IPAM 4/15/2002 David Alderson

Cascading Failures

Cascading failures occur in networks where– Individual network components can fail– When a component fails, the natural dynamics of the

system may induce the failure of other components

Network components can fail because– Accident– Internal failure– Attack

Initiating events

A cascading failure is not– A single point of failure

– The occurrence of multiple concurrent failures

– The spread of a virus

Page 29: Cascading Failures in  Infrastructure Networks

IPAM 4/15/2002 David Alderson

Related Work

Cascading Failures:– Electric Power: Parrilo et. al. (1998), Thorp et. al. (2001)– Social Networks: Watts (1999)– Public Policy: Little (2001)

Other network research initiatives– “Survivable Networks”– “Fault-Tolerant Networks”

Large-Scale Vulnerability– Self-Organized Criticality: Bak (1987), many others– Highly Optimized Tolerance: Carlson and Doyle (1999)– Normal Accidents: Perrow (1999) – Influence Models: Verghese et. al. (2001)

Page 30: Cascading Failures in  Infrastructure Networks

IPAM 4/15/2002 David Alderson

Our Approach

• Cascading failures in the context of flow networks– conservation of flow within the network

• Overloading a resource leads to degraded performance and eventual failure

• Network failures are not independent– Flow allocation pattern resource interdependence

• Focus on the dynamics of network operation and control

• Design for robustness (not protection)

Page 31: Cascading Failures in  Infrastructure Networks

IPAM 4/15/2002 David Alderson

Taxonomy of Network Flow Models

FluidApproximations

Time-DependentAverages

Static FlowModels

Long-TermAverages

DiffusionApproximations

Averages &Variances

QueueingModels

Probability Distributions

SimulationModels

Event Sequences

Quantity ofInterest

ModelingApproach

OngoingOperation

(Processing& Routing)

RelevantDecisions

CoarseGrainedModels

FineGrainedModels

Failure &Recovery

CapacityPlanning

Reference:Janusz Filipiak

Page 32: Cascading Failures in  Infrastructure Networks

IPAM 4/15/2002 David Alderson

Time Scales in Network Operations

minutesto hours

daysto weeks

milliseconds to seconds

daysto weeks

monthsto years

minutesto hours

LongTime

Scales

ShortTime

Scales

ComputerRouting

RailroadTransportation

RelevantDecisions

Ongoing Operation(Processing & Routing)

Failure &Recovery

CapacityPlanning

Page 33: Cascading Failures in  Infrastructure Networks

IPAM 4/15/2002 David Alderson

What Are Network Dynamics?

Type of Network Dynamics

UnderlyingAssumption

DynamicsON

Networks

DynamicsOF

Networks

Network topologyis CHANGING

Network topologyis STATIC

Failure &Recovery

Page 34: Cascading Failures in  Infrastructure Networks

IPAM 4/15/2002 David Alderson

Network Flow Optimization

• Original work by Ford and Fulkerson (1956)

• One of the most studied areas for optimization

• Three main problem types– Shortest path problems– Maximum flow problems– Minimum cost flow problems

• Special interpretation for some of the most celebrated results in optimization theory

• Broad applicability to a variety of problems

Page 35: Cascading Failures in  Infrastructure Networks

IPAM 4/15/2002 David Alderson

Single Commodity Flow ProblemNotation:

N set of nodes, indexed i = 1, 2, … N

A set of arcs , indexed j = 1, 2, … M

di demand (supply) at node i

fj flow along arc j

uj capacity along arc j

A node-arc incidence matrix,

A set of flows f is feasible if it satisfies the constraints:

Ai f = di i N (flows balanced at node i, and

supply/demand is satisfied)

0 fj uj j A (flow on arc j less than capacity)

otherwise

node exits arc if

node enters arc if

0

1

1

ij

ij

aij

Page 36: Cascading Failures in  Infrastructure Networks

IPAM 4/15/2002 David Alderson

Single Commodity Flow Problem

Feasible region, denoted F():

(flows balanced at node i)

0 fj uj j A (flow on arc j feasible)

otherwise

if

if

0

ti

si

dfA ii

ts

Page 37: Cascading Failures in  Infrastructure Networks

IPAM 4/15/2002 David Alderson

Minimum Cost Problem

subject to:

(flows balanced at node i)

0 fj uj j A (flow on arc j feasible)

otherwise

if

if

0

ti

si

dfA ii

ts

Let cj = cost on arc j

Minimizef (j A) cj fj

Page 38: Cascading Failures in  Infrastructure Networks

IPAM 4/15/2002 David Alderson

Shortest Path Problem

subject to:

(flows balanced at node i)

0 fj uj=1 j A (flow on arc j feasible)

otherwise

if

if

0

ti

si

dfA ii

=11=

Let costs cj correspond to “distances”, set = 1

Minimizef (j A) cj fj

ts

Page 39: Cascading Failures in  Infrastructure Networks

IPAM 4/15/2002 David Alderson

Maximum Flow Problem

subject to:

(flows balanced at node i)

0 fj uj j A (flow on arc j feasible)

otherwise

if

if

0

ti

si

dfA ii

Maximizef

ts

Page 40: Cascading Failures in  Infrastructure Networks

IPAM 4/15/2002 David Alderson

Network Optimization

Traditional Assumptions:– Complete information– Static network (capacities, demands, topology)– Centralized decision maker

Solution obtained from global optimization algorithms

Relevant issues:– Computational (time) complexity

• Function of problem size (number of inputs)• Based on worst-case data

– Parallelization (decomposition)– Synchronization (global clock)

Page 41: Cascading Failures in  Infrastructure Networks

IPAM 4/15/2002 David Alderson

New ChallengesMost traditional assumptions no longer hold…

• Modern networks are inherently dynamic– Connectivity fluctuates, components fail, growth is ad hoc– Traffic demands/patterns constantly change

• Explosive growth massive size scale

• Faster technology shrinking time scale

• Operating decisions are made with incomplete, incorrect information

• Claim: A global approach based on static assumptions is no longer viable

Page 42: Cascading Failures in  Infrastructure Networks

IPAM 4/15/2002 David Alderson

Cascading Failures & Flow Networks

• In general, we assume that network failures result from violations of network constraints• Node feasibility (flow conservation)• Arc feasibility (arc capacity)

• That is, failure infeasibility

• The network topology provides the means by which failures (infeasibilities) propagate

• In the optimization context, a cascading failure is a collapse of the feasible region of the optimization problem that results from the interaction of the constraints when a parameter is changed

Page 43: Cascading Failures in  Infrastructure Networks

IPAM 4/15/2002 David Alderson

Addressing New Challenges

• Extend traditional notions of network optimization to model cascading failures in flow networks– Allow for node failures– Include flow dynamics

• Consider solution approaches based on– Decentralized control– Local information

• Leverage ideas from dual problem formulation

• Identify dimensions along which there are explicit tensions and tradeoffs between vulnerability and performance

Page 44: Cascading Failures in  Infrastructure Networks

IPAM 4/15/2002 David Alderson

Dual Problem FormulationPrimal Problem

Min cT f

s.t. A f = d

f 0

f u

Dual Problem

Max Td - uT

s.t. TA c

unrestricted

0

• Dual variables , have interpretation as prices at nodes, arcs

• Natural decomposition as distributed problem

• e.g. Nodes set prices based on local information

• Examples:

• Kelly, Low and many others for TCP/IP congestion control

• Boyd and Xiao for dual decomposition of SRRA problem

Page 45: Cascading Failures in  Infrastructure Networks

IPAM 4/15/2002 David Alderson

Outline

• Background and Motivation

• Union Pacific Case Study

• Conceptual Framework

• Modeling Cascading Failures

• Ongoing Work

Page 46: Cascading Failures in  Infrastructure Networks

IPAM 4/15/2002 David Alderson

• Let n(k) = flow being processed in interval k

• Node dynamics

n(k+1) = n(k) + a(k) – d(k)

n(k)(load)

d(k)(performance)

Node Dynamics• Consider each node as a simple input-output system running in discrete time…

n(k)a(k) d(k)

n(k)

knknkd

0

)(0)()(

• Processing capacity • State-dependent outputSystem is feasible

for a(k) <

constant a(k)

n*

• a(k) – d(k) indicates how n(k) is changing

• n* is equilibrium point

• Node “fails” if n(k) >

Page 47: Cascading Failures in  Infrastructure Networks

IPAM 4/15/2002 David Alderson

a2(k)

=a2(k) d2(k)u1

Network Dynamics• The presence of an arc between adjacent nodes couples their behavior

n1(k)a1(k) d1(k) n2(k)

n2(k)

d2(n2)

2

2

n1(k)

d1(n1)

1

1

a1(k)

• Arc capacities limit both outgoing and incoming flow

u1(k)

u1(k)

a2(k)a2(k)

Page 48: Cascading Failures in  Infrastructure Networks

IPAM 4/15/2002 David Alderson

=u1u1=0

Network Dynamics• The failure of one node can lead to the failure of another

n1(k)a1(k) d1(k) n2(k)a2(k) d2(k)

• When a node fails, the capacity of its incoming arcs drop effectively to zero.

• Upstream node loses capacity of arc

• In the absence of control, the upstream node fails too.

Result: Node failures propagate “upstream”…

Question:

• How will the network respond to perturbations?

Page 49: Cascading Failures in  Infrastructure Networks

IPAM 4/15/2002 David Alderson

Network Robustness

Consider the behavior of the network in response to a perturbation to arc capacity:1. Does the disturbance lead to a local failure?2. Does the failure propagate?3. How far does it propagate?

Measure the consequences in terms of:– Size of the resulting failure island– Loss of network throughput

Key factors:– Flow processing sensitivity to congestion– Network topology– Local routing and flow control policies– Time scales

Page 50: Cascading Failures in  Infrastructure Networks

IPAM 4/15/2002 David Alderson

Congestion SensitivityIn many real network systems, components are sensitive to congestion

SystemLoad

SystemPerformance

• Using the aforementioned family of functions we can tune the

sensitivity of congestion

• Direct consequences on local dynamics, stability, and control

• Tradeoff between system efficiency vs. fragility

• Implications for local behavior

evidence ofcongestion

Page 51: Cascading Failures in  Infrastructure Networks

IPAM 4/15/2002 David Alderson

Qualitative Behavior

Input Rate

x1* x2

* TotalSystemLoad

OutputRate

StableEquilibrium

UnstableEquilibrium

CongestionCollapse

Page 52: Cascading Failures in  Infrastructure Networks

IPAM 4/15/2002 David Alderson

SevereCongestion

MildCongestion

FluidProcessing

Qualitative Behavior

Input Rate

x1* x2

* TotalSystemLoad

OutputRate

System response to changes in input rate is opposite in fluid vs. congested regions.

Page 53: Cascading Failures in  Infrastructure Networks

IPAM 4/15/2002 David Alderson

Qualitative Behavior

Input Rate

x1* x2

* TotalSystemLoad

OutputRate

Safety Margin

New Input Rate

y1* y2

*

“Efficiency” results in “Fragility”

Page 54: Cascading Failures in  Infrastructure Networks

IPAM 4/15/2002 David Alderson

Ongoing Work• Modeling behavior of flow networks

– Vulnerability to cascading failures– Sensitivity to congestion

• Bringing together notions from network optimization, dynamical systems, and distributed control

• Exploring operating tradeoffs between – efficiency and robustness– global objectives vs. local behavior– system performance vs. system vulnerability

• Collectively, these features provide a framework for study of real systems– UPRR case study– Computer networks

Page 55: Cascading Failures in  Infrastructure Networks

IPAM 4/15/2002 David Alderson

Future Directions

• Development of decision support tools to support real-time operations– Warning systems– Incident recovery

• Investigation of issues related to topology

• Notions from economics– Network complements and substitutes– Node cooperation and competition

Page 56: Cascading Failures in  Infrastructure Networks

IPAM 4/15/2002 David Alderson

Key Takeaways

• Large-scale failures happen– Elements of vulnerability associated with connectivity– But we are moving to connect everything together…

• Critical tradeoff for network-based businesses– Business profitability from resource efficiency– System robustness

• Two fundamental aspects to understanding large-scale failure behavior– Networks– Dynamics

• Relevance to a wide variety of applications

Page 57: Cascading Failures in  Infrastructure Networks

Thank You

[email protected]