58
1 The Intersection of Grids and Networks: Where the Rubber Hits the Road William E. Johnston ESnet Manager and Senior Scientist Lawrence Berkeley National Laboratory

The Intersection of Grids and Networks - Semantic Scholar · • When one types “google.com” into a Web browser to use the search engine, the following takes place o The name

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: The Intersection of Grids and Networks - Semantic Scholar · • When one types “google.com” into a Web browser to use the search engine, the following takes place o The name

1

The Intersection of Grids and Networks:Where the Rubber Hits the Road

William E. JohnstonESnet Manager and Senior Scientist

Lawrence Berkeley National Laboratory

Page 2: The Intersection of Grids and Networks - Semantic Scholar · • When one types “google.com” into a Web browser to use the search engine, the following takes place o The name

2

Objectives of this Talk• How a production R&E network works

• Why some types of services needed by Grids / widely distributed computing environments are hard

Page 3: The Intersection of Grids and Networks - Semantic Scholar · • When one types “google.com” into a Web browser to use the search engine, the following takes place o The name

3

Outline

• How do Networks Work?

• Role of the R&E Core Network

• ESnet as a Core Networko ESnet Has Experienced Exponential Growth Since 1992o ESnet is Monitored in Many Ways o How Are Problems Detected and Resolved?

• Operating Science Mission Critical Infrastructureo Disaster Recovery and Stability

o Recovery from Physical Attack / Failureo Maintaining Science Mission Critical Infrastructure

in the Face of Cyberattack

• Services that Grids need from the Networko Public Key Infrastructure example

Page 4: The Intersection of Grids and Networks - Semantic Scholar · • When one types “google.com” into a Web browser to use the search engine, the following takes place o The name

4

How Do Networks Work?

• Accessing a service, Grid or otherwise, such as a Web server, FTP server, etc., from a client computer and client application (e.g. a Web browser_ involveso Target host nameso Host addresseso Service identificationo Routing

Page 5: The Intersection of Grids and Networks - Semantic Scholar · • When one types “google.com” into a Web browser to use the search engine, the following takes place o The name

5

How Do Networks Work?

• When one types “google.com” into a Web browser to use the search engine, the following takes placeo The name “google.com” is resolved to an Internet address

by the Domain Name System (DNS) – a hierarchical directory service

o The address is attached to a network packet (which carries the data – a google search request in this case) which is then sent out of the computer into the network

o The first place that the packet reaches is a router that must decide how to get that packet to its desitnatiion (google.com)

Page 6: The Intersection of Grids and Networks - Semantic Scholar · • When one types “google.com” into a Web browser to use the search engine, the following takes place o The name

6

How Do Networks Work?o In the Internet, routing is done “hot potato”

- Routers are in your site LANs and at your ISP, and each router typically communicates directly with several other routers

- The first router to receive your packet takes a quick look at the address and says, if I send this packet to router B that will probably take it closer to its destination. So it sends it to B without further adieu.

- Router B does the same thing, and so forth, until the packet reaches google.com

o What makes this work is routing protocols that exchange reachability information between all directly connected routers – “BGP” is the most common such protocol in WANs

Page 7: The Intersection of Grids and Networks - Semantic Scholar · • When one types “google.com” into a Web browser to use the search engine, the following takes place o The name

7

How Do Networks Work?

• Once the packet reaches its destination (the computer called google.com) it must be delivered to the google search engine, as opposed to the google mail server that may be running on the same machine.o This is accomplished with a service identifier that is put on

the packet by the browser (the client side application)- The service identifier says that this packet is to be delivered to the

Web server on the destination system – on each system every server/service has a unique identified called a “port number”

o So when someone says that the Blaster/Lovsan worm is attacking port 135 on the system called google.com, they mean that a worm program somewhere in the Internet is trying to gain access to the service at port 135 on google.com (usually to exploit a vulnerability).

Page 8: The Intersection of Grids and Networks - Semantic Scholar · • When one types “google.com” into a Web browser to use the search engine, the following takes place o The name

8

Role of the R&E Core Network: Transit (Deliver Every Packet)

LBNL

Google, Inc.

ESnet(Core network)

Big ISP(e.g. SprintLink)

gatewayrouter

routerrouter

router

router

router

corerouter

router

peeringrouter

corerouter

borderrouter

border/gateway routers•implement separate site and network provider policy (including site firewall

policy)

peering routers•implement/enforce routing policy for

each provider•provide

cyberdefense

router

router

core routers•focus on high-speed packet

forwarding

peeringrouter

Page 9: The Intersection of Grids and Networks - Semantic Scholar · • When one types “google.com” into a Web browser to use the search engine, the following takes place o The name

9

Outline

• How do Networks Work?

• Role of the R&E Core Network

• ESnet as a Core Networko ESnet Has Experienced Exponential Growth Since 1992o ESnet is Monitored in Many Ways o How Are Problems Detected and Resolved?

• Operating Science Mission Critical Infrastructureo Disaster Recovery and Stability

o Recovery from Physical Attack / Failureo Maintaining Science Mission Critical Infrastructure

in the Face of Cyberattack

• Services that Grids need from the Networko Public Key Infrastructure example

Page 10: The Intersection of Grids and Networks - Semantic Scholar · • When one types “google.com” into a Web browser to use the search engine, the following takes place o The name

10

What is ESnet

• ESnet is a large-scale, very high bandwidth network providing connectivity between DOE Science Labs and their science partners in the US, Europe, and Japan

• Essentially all of the national data traffic supporting US open science is carried by two networks – ESnet and Internet-2 / Abilene (which plays a similar role for the university community)

• ESnet is very different from commercial ISPs (Internet Service Providers) like Earthlink, AOL, etc.o Most big ISPs provide small amounts of bandwidth to a

large number of siteso ESnet supplies very high bandwidth to a small number of

sites

Page 11: The Intersection of Grids and Networks - Semantic Scholar · • When one types “google.com” into a Web browser to use the search engine, the following takes place o The name

11

ESnet Connects DOE Facilities and Collaborators

TWC

JGISNLL

LBNL

SLAC

YUCCA MTBECHTEL

PNNLLIGO

INEEL

LANL

SNLAAlliedSignal

PANTEX

ARM

KCP

NOAA

OSTIORAU

SRS

ORNLJLAB

PPPL

ANL-DCINEEL-DCORAU-DC

LLNL/LANL-DC

MIT

ANL

BNL

FNALAMES

4xLAB-DCNERSC

NR

EL

ALBHUB

LLNL

GADOE-ALB

SDSC

Japan

GTN&NNSA

International (high speed)OC192 (10G/s optical)OC48 (2.5 Gb/s optical)Gigabit Ethernet (1 Gb/s)OC12 ATM (622 Mb/s)OC12 OC3 (155 Mb/s)T3 (45 Mb/s)T1-T3T1 (1 Mb/s)

QWESTATM

ESnet IP

GEANT- Germany- France- Italy- UK- etc. Sinet (Japan)Japan – Russia(BINP)

CA*net4CERNMRENNetherlandsRussiaStarTapTaiwan(ASCC)

CA*net4KDDI (Japan)FranceSwitzerlandTaiwan(TANet2)

AustraliaCA*net4Taiwan(TANet2)

Singaren

ESnet core ring: Packet over SONET Optical Ring and Hubs

ELP HUB

SNV HUB CHI HUB

NYC HUB

ATL HUB

DC HUB

MAE-EStarlightChi NAP

Fix-W

PAIX-W

MAE-W

NY-NAP

PAIX-E

Euqinix

PNWG

SEA HUB

SNV HUB

Abile

ne Abile

ne

Abilene

Abilene

NNSA Sponsored (12)Joint Sponsored (3)Other Sponsored (NSF LIGO, NOAA)Laboratory Sponsored (6)

42 end user sites

peering pointsESnet hubs

Office Of Science Sponsored (22)

Page 12: The Intersection of Grids and Networks - Semantic Scholar · • When one types “google.com” into a Web browser to use the search engine, the following takes place o The name

12

Current Architecture

10GE

10GE

RTR

RTR

optical fiber ring

Wave division multiplexing

• today typically 64 x 10 Gb/s optical channels per fiber

• channels (referred to as “lambdas”) are usually used in bi-directional pairs

Lambda channels are converted to electrical channels

• usually SONET data framing or Ethernet data framing

• can be clear digital channels (no framing – e.g. for digital HDTV)

ESnet IP router

Site IP router

Site – ESnet network policy demarcation

(“DMZ”)

site LANESnet site

ESnet hub

ESnet core

RTR

RTR

RTRRTR

A ring topology network is inherently reliable – all single point failures are mitigated by routing traffic in

the other direction around the ring.

Page 13: The Intersection of Grids and Networks - Semantic Scholar · • When one types “google.com” into a Web browser to use the search engine, the following takes place o The name

STARL

IGH

T

MAE-E

NY-NAP

PAIX-E

GA

LBN

L

Peering – ESnet’s Logical Infrastructure –Connects the DOE Community With its Collaborators

ESnet Peering(connections to other networks)

Commercial

NYC HUBS

SEA HUB

Japan

SNV HUB

MAE-W

FIX-

W

PAIX-W 26 PEERS

CA*net4CERNMRENNetherlandsRussiaStarTapTaiwan(ASCC)

Abilene +7 Universities

22 PEERS

MAX GPOP

GEANT- Germany- France- Italy- UK- etc SInet (Japan)KEKJapan – Russia (BINP)

AustraliaCA*net4Taiwan(TANet2)

Singaren

20 PEERS3 PEERS

LANL

TECHnet

2 PEERS

39 PEERS

CENICSDSC

PNW-GPOP

CalREN2 CHI NAP

Distributed 6TAP19 Peers

2 PEERS

KDDI (Japan)France

EQX-ASH

1 PEER

1 PEER

5 PEERS

ESnet provides complete access to the Internet by managing the full complement of Global Internet routes (about 150,000) at 10 general/commercial peering points + high-speed peerings w/ Abilene and the international networks.

ATL HUB

University

International

Commercial

Abilene

EQX-SJ

Abilene

6 PEERS

Abilene

Page 14: The Intersection of Grids and Networks - Semantic Scholar · • When one types “google.com” into a Web browser to use the search engine, the following takes place o The name

14

AS routes peer1239701

20933563561

701829143549551117464617473

349111537540043234200639528287132

SPRINTLINK6338451685

470634144035980

28728197231736981905492503244293529

332733212774247524082383

UUNET-ALTERNETQWESTLEVEL3CABLE-WIRELESSATT-WORLDNETVERIOGLOBALCENTEROPENTRANSITCOGENTCOABOVENETSINGTELCAIS

ABILENEBTTWTELECOMALERONBROADWINGXO

1961 SBC

What is Peering?

• Peering points exchange routing information that says “which packets I can get closer to their destination”

• ESnet daily peeringreport(top 20 of about 100)

• This is a lot of work

peering with this outfitis not random, it carriesroutes that ESnet needs(e.g. to the Russian Backbone Net)

Page 15: The Intersection of Grids and Networks - Semantic Scholar · • When one types “google.com” into a Web browser to use the search engine, the following takes place o The name

15

What is Peering?

• Why so many routes? So that when I want to get to someplace out of the ordinary, I can get there. For example:http://www-sbras.nsc.ru/eng/sbras/copan/microel_main.html(Technological Design Institute of Applied Microelectronics of SB RAS 630090, Novosibirsk, Russia)

Peering routers

Start: 134.55.209.5

134.55.209.90

63.218.6.65

63.218.6.38

63.216.0.53

63.216.0.30

63.218.12.37

63.218.13.134

195.209.14.29

195.209.14.153

195.209.14.206

Finish: 194.226.160.10

ESnet coresnv-lbl-oc48.es.net

snvrt1-ge0-snvcr1.es.net

pos3-0.cr01.sjo01.pccwbtn.net

pos5-1.cr01.chc01.pccwbtn.net

pos6-1.cr01.vna01.pccwbtn.net

pos5-3.cr02.nyc02.pccwbtn.net

pos6-0.cr01.ldn01.pccwbtn.net

rbnet.pos4-1.cr01.ldn01.pccwbtn.net

MSK-M9-RBNet-5.RBNet.ru

MSK-M9-RBNet-1.RBNet.ru

NSK-RBNet-2.RBNet.ru

ESnet peering at Sunnyvale

AS3491 CAIS Internet

“ “

“ “

“ “

“ “

AS3491->AS5568 (Russian Backbone Network) peering point

Russian Backbone Network

“ “

“ “

Novosibirsk-NSC-RBNet.nsc.ru RBN to AS 5387 (NSCNET-2)

Page 16: The Intersection of Grids and Networks - Semantic Scholar · • When one types “google.com” into a Web browser to use the search engine, the following takes place o The name

16

0

50

100

150

200

250

300

1990

1991

1992

1993

1994

1995

1996

1997

1998

1999

2000

2001

2002

2003

2004

ESnet is Engineered to Move a Lot of Data

Annual growth in the past five years has increased from 1.7x annually to just over 2.0x annually.

TByt

es/M

onth

ESnet is currently transporting about 250 terabytes/mo.ESnet Monthly Accepted Traffic

Page 17: The Intersection of Grids and Networks - Semantic Scholar · • When one types “google.com” into a Web browser to use the search engine, the following takes place o The name

17

Traffic coming into ESnet = GreenTraffic leaving ESnet = BlueTraffic between sites% = of total ingress or egress traffic

Note that more that 90% of the ESnet traffic is OSC traffic

ESnet Appropriate Use Policy (AUP)

All ESnet traffic must originate and/or terminate on an ESnet an site (no transit traffic is allowed)

Who Generates Traffic, and Where Does it Go?ESnet Inter-Sector Traffic Summary,

Jan 2003 / Feb 2004 (1.7X overall traffic increase, 1.9X OSC increase)(the international traffic is increasing due to BABAR at SLAC and the LHC tier 1 centers at

FNAL and BNL)

Peering Points

21/14% Commercial

14/12%

17/10% R&E (mostlyuniversities)10/13%

9/26% International4/6%

ESnet

~25/18%

DOE collaborator traffic, inc.data

72/68%

53/49%

DOE is a net supplier of data because DOE facilities are used by universities and commercial entities, as well as by DOE researchers

DOE sites

Page 18: The Intersection of Grids and Networks - Semantic Scholar · • When one types “google.com” into a Web browser to use the search engine, the following takes place o The name

18

ESnet Top 20 Data Flows, 24 hrs., 2004-04-20

Fermilab

(US) →

CERN

SLAC (US) →

IN2P3 (FR)

1 te

raby

te/d

ay

SLAC (US) →

INFN Padva (

IT)

Fermilab

(US) →

U. Chica

go (U

S)

CEBAF (US) →

IN2P3 (FR)

INFN Padva

(IT) →

SLAC (US)

U. Toro

nto (C

A) →Ferm

ilab (US)

DFN-WiN (D

E) →SLAC (U

S)

DOE Lab →

DOE Lab

DOE Lab →

DOE Lab

SLAC (U

S) →JA

NET (UK)

Fermilab

(US) →

JANET (U

K)

Argonne

(US) →

Level3 (U

S)

Argonne

→SURFne

t (NL)

IN2P3 (FR) →

SLAC (U

S)

Fermilab

(US) →

INFN Padva (

IT)

A small number of science users

account for a significant

fraction of all ESnet traffic

Page 19: The Intersection of Grids and Networks - Semantic Scholar · • When one types “google.com” into a Web browser to use the search engine, the following takes place o The name

19

Top 50 Traffic Flows Monitoring – 24hr – 1 Int’l Peering Point

10 flows> 100 GBy/day

More than 50 flows

> 10 GBy/day

Page 20: The Intersection of Grids and Networks - Semantic Scholar · • When one types “google.com” into a Web browser to use the search engine, the following takes place o The name

20

Scalable Operation is Essential

• R&E networks typically operate with a small staff

• The key to everything that the network provides is scalabilityo How do you manage a huge infrastructure with a small

number of people?o This issue dominates all others when looking at whether to

support new services (e.g. Grid middleware)- Can the service be structured so that its operational aspects do not

scale as a function of the use population?- If not, then it cannot be offered as a service

Page 21: The Intersection of Grids and Networks - Semantic Scholar · • When one types “google.com” into a Web browser to use the search engine, the following takes place o The name

21

Scalable Operation is Essential

• The entire ESnet network is operated by fewer than 15 people

Man

agem

ent,

reso

urce

man

agem

ent,

circ

uit a

ccou

ntin

g, g

roup

lead

s (4

FTE

)S

cience Services

(middlew

are andcollaboration tools) (5 FTE

)

Core Engineering Group (5 FTE)

Infra

stru

ctur

e (6

FTE

)

7X24 On-Call Engineers (7 FTE)

7X24 Operations Desk (2-4 FTE)

Page 22: The Intersection of Grids and Networks - Semantic Scholar · • When one types “google.com” into a Web browser to use the search engine, the following takes place o The name

•Automated, real-time monitoring of traffic levels and operating state of some 4400 network entities is the primary network

operational and diagnosis toolNetwork Configuration OSPF Metrics (internal

routing and connectivity)

Performance

SecureNet IBGP Mesh (WAN routing and connectivity)

Hardware Configuration

Page 23: The Intersection of Grids and Networks - Semantic Scholar · • When one types “google.com” into a Web browser to use the search engine, the following takes place o The name

23

TWCJGISNLL

LBNL

SLAC

YUCCA MTBECHTEL

PNNLLIGO

INEEL

LANL

SNLAAlliedSignal

PANTEX

ARM

AlliedSignal

NOAA

OSTIORAU

SRS

ORNLJLAB

PPPL

ANL-DCINEEL-DCORAU-DC

LLNL/LANL-DC

MIT

ANL

BNL

FNALAMES

NevisYale

4xLAB-DC

Brandeis

NERSC

NR

EL

ALBHUB

LLNL

GADOE-ALB

SDSC

Japan

GTN&NNSA

International (high speed)OC192 (10G/s optical)OC48 (2.5 Gb/s optical)Gigabit Ethernet (1 Gb/s)OC12 ATM (622 Mb/s)OC12 OC3 (155 Mb/s)T3 (45 Mb/s)T1-T3T1 (1 Mb/s)

QWESTATM

ESnet IP

GEANT- Germany- France- Italy- UK- etc Sinet (Japan)Japan – Russia(BINP)

CA*net4CERNMRENNetherlandsRussiaStarTapTaiwan(ASCC)

CA*net4KDDI (Japan)FranceSwitzerlandTaiwan(TANet2)

AustraliaCA*net4Taiwan(TANet2)

Singaren

How Are Problems Detected and Resolved?

SEA HUB

ELP HUB

SNV HUB CHI HUB

NYC HUB

ATL HUB

DC HUB

When a hardware alarm goes off here, the 24x7

operator is notified

Page 24: The Intersection of Grids and Networks - Semantic Scholar · • When one types “google.com” into a Web browser to use the search engine, the following takes place o The name

24

ESnet is Monitored in Many Ways

ESnet configurationPerformance

SecureNet

OSPF Metrics

Hardware Configuration IBGP Mesh

Page 25: The Intersection of Grids and Networks - Semantic Scholar · • When one types “google.com” into a Web browser to use the search engine, the following takes place o The name

Drill Down into the Configuration DB to Operating Characteristics of Every Device

e.g. cooling air temperature for the router chassis air inlet, hot-point, and air exhaust for the ESnet gateway router at PNNL

Page 26: The Intersection of Grids and Networks - Semantic Scholar · • When one types “google.com” into a Web browser to use the search engine, the following takes place o The name

26

Problem Resolution

• Let’s say that the diagnoistics have pinpointed a bad module in a router rack in the ESnet hub in NYC

• Almost all high-end routers, and other equipment that ESnet uses, have multiple, redundant modules for all critical functions

• Failure of a module (e.g. a power supply or a control computer) can be corrected on-the-fly, without turning off the power or impacting the continued operation of the router

• Failed modules are typically replaced by a “smart hands” service at the hubs or siteso One of the many essential scalability mechanisms

Page 27: The Intersection of Grids and Networks - Semantic Scholar · • When one types “google.com” into a Web browser to use the search engine, the following takes place o The name

27

ESnet is Monitored in Many Ways

ESnet configurationPerformance OSPF Metrics

Hardware Configuration SecureNet IBGP Mesh

Page 28: The Intersection of Grids and Networks - Semantic Scholar · • When one types “google.com” into a Web browser to use the search engine, the following takes place o The name

Drill Down into the Hardware Configuration DBfor Every Wire Connection

Equipment rack detail at AOA,

NYC Hub(one of the

10 Gb/s core optical ring sites)

Page 29: The Intersection of Grids and Networks - Semantic Scholar · • When one types “google.com” into a Web browser to use the search engine, the following takes place o The name

• Equipment wiring detail for two modules at the AOA, NYC Hub

• This allows “smart hands” –e.g., Qwest personnel at the NYC site –to replace modules for ESnet)

The Hub Configuration

Database

Page 30: The Intersection of Grids and Networks - Semantic Scholar · • When one types “google.com” into a Web browser to use the search engine, the following takes place o The name

30

What Does this Equipment Actually Look Like?

Picture detail

Equipment rack detail at

NYC Hub, 32 Avenue

of the Americas

(one of the 10 Gb/s core optical ring

sites)

Page 31: The Intersection of Grids and Networks - Semantic Scholar · • When one types “google.com” into a Web browser to use the search engine, the following takes place o The name

31

Cisco 7206AOA-AR1

(low speed links to MIT & PPPL)($38,150 list)

Juniper M20AOA-PR1

(peering RTR)($353,000 list)

Juniper T320AOA-CR1

(Core router)($1,133,000

list)

Juniper OC192Optical Ring

Interface (the AOA end of

the OC192 to CHI

($195,000 list)

Juniper OC48Optical Ring

Interface (the AOA end of the OC48 to

DC-HUB($65,000 list)

AOAPerformance Tester

($4800 list)

Qwest DS3 DCX

DC / AC Converter($2200 list)

Lightwave SecureTerminal Server

($4800 list)ESnet core

equipment @ Qwest

32 AofA HUB NYC, NY

(~$1.8M, list)

Sentry power 48v 30/60 amp panel

($3900 list)

Sentry power 48v 10/25 amp panel

($3350 list)

Typical Equipment of an ESnet Core Network Hub

Page 32: The Intersection of Grids and Networks - Semantic Scholar · • When one types “google.com” into a Web browser to use the search engine, the following takes place o The name

32

Outline

• How do Networks Work?

• Role of the R&E Core Network

• ESnet as a Core Networko ESnet Has Experienced Exponential Growth Since 1992o ESnet is Monitored in Many Ways o How Are Problems Detected and Resolved?

• Operating Science Mission Critical Infrastructureo Disaster Recovery and Stability

o Recovery from Physical Attack / Failureo Maintaining Science Mission Critical Infrastructure

in the Face of Cyberattack

• Services that Grids need from the Networko Public Key Infrastructure example

Page 33: The Intersection of Grids and Networks - Semantic Scholar · • When one types “google.com” into a Web browser to use the search engine, the following takes place o The name

33

Operating Science Mission Critical Infrastructure

• ESnet is a visible and critical piece of DOE science infrastructure

o if ESnet fails,10s of thousands of DOE and University users know it within minutes if not seconds

• Requires high reliability and high operational security in the systems that are integral to the operation and management of the network

o Secure and redundant mail and Web systems are central to the operation and security of ESnet

- trouble tickets are by email- engineering communication by email- engineering database interfaces are via Web

o Secure network access to Hub routerso Backup secure telephone modem access to Hub equipmento 24x7 help desk and 24x7 on-call network engineer

[email protected] (end-to-end problem resolution)

Page 34: The Intersection of Grids and Networks - Semantic Scholar · • When one types “google.com” into a Web browser to use the search engine, the following takes place o The name

34

LBNLPPPL

BNL

AMES

Remote Engineer• partial duplicate infrastructure

DNS

Remote Engineer• partial duplicate

infrastructure

TWCRemoteEngineer

Disaster Recovery and Stability

• The network must be kept available even if, e.g., the West Coastis disabled by a massive earthquake, etc.

ATL HUB

SEA HUB

ALBHUB

NYC HUBS

DC HUB

ELP HUB

CHI HUB

SNV HUB Duplicate InfrastructureCurrently deploying full replication of the NOC databases and servers and Science Services databases in the NYC Qwest carrier hub

Engineers, 24x7 Network Operations Center, generator backed power

• Spectrum (net mgmt system)• DNS (name – IP address

translation)• Eng database• Load database• Config database• Public and private Web• E-mail (server and archive)• PKI cert. repository and

revocation lists• collaboratory authorization

service

Reliable operation of the network involves• remote Network Operation Centers (3) • replicated support infrastructure• generator backed UPS power at all critical network and infrastructure locations

• high physical security for all equipment• non-interruptible core - ESnet core

operated without interruption througho N. Calif. Power blackout of 2000o the 9/11/2001 attacks, ando the Sept., 2003 NE States power blackout

Page 35: The Intersection of Grids and Networks - Semantic Scholar · • When one types “google.com” into a Web browser to use the search engine, the following takes place o The name

35

Recovery from Physical Attack / Core Ring Failure

New York (AOA)

Chicago (CHI)

Sunnyvale (SNV)

Atlanta (ATL)

Washington, DC (DC)

El Paso (ELP)

Site gateway router

SiteLAN

ESnet border router DMZ

Site

Hubs(backbone routers

and local loop connection points)

ESnet backbone

(optical fiber ring)

Local loop(Hub to local site)

The Hubs have lots of connections

(42 in all)

XWe can route traffic

either way around the ring, so any single failure in the ring is

transparent to ESnet users

normal traffic flow

reversed traffic flow

The local loops are still single points of failure

break in the ring

Page 36: The Intersection of Grids and Networks - Semantic Scholar · • When one types “google.com” into a Web browser to use the search engine, the following takes place o The name

Maintaining Science Mission Critical Infrastructurein the Face of Cyberattack

• A Phased Security Architecture is being implemented to protects the network and the ESnet sites

• The phased response ranges from blocking certain site traffic to a complete isolation of the network which allows the sites to continue communicating among themselves in the face of the most virulent attacks

o Separates ESnet core routing functionality from external Internet connections by means of a “peering” router that can have a policy different from the core routers

o Provide a rate limited path to the external Internet that will insure site-to-site communication during an external denial of service attack

o Provide “lifeline” connectivity for downloading of patches, exchange of e-mail and viewing web pages (i.e.; e-mail, dns, http, https, ssh, etc.) with the external Internet prior to full isolation of the network

Page 37: The Intersection of Grids and Networks - Semantic Scholar · • When one types “google.com” into a Web browser to use the search engine, the following takes place o The name

37

Cyberattack Defense

LBNL

ESnet

router

router

borderrouter

X

peeringrouter

Lab

Lab

gatewayrouter

ESnet second response – filter traffic from outside of ESnet

Lab first response – filter incoming traffic at their ESnet gateway router

ESnet third response – shut down the main peering paths and provide only limited bandwidth paths for specific

“lifeline” services

Xpeeringrouter

gatewayrouter

border router

router

attack trafficX

ESnet first response –filters to assist a site

Sapphire/Slammer worm infection created a Gb/s of traffic on the ESnet core until filters were put in place (both into and out of sites) to damp it out.

Page 38: The Intersection of Grids and Networks - Semantic Scholar · • When one types “google.com” into a Web browser to use the search engine, the following takes place o The name

38

ESnet WAN Security and Cybersecurity• Cybersecurity is a new dimension of ESnet security

o Security is now inherently a global problemo As the entity with a global view of the network, ESnet has an

important role in overall security

30 minutes after the Sapphire/Slammer worm was released, 75,000 hosts running Microsoft's SQL Server (port 1434) were infected.

(“The Spread of the Sapphire/Slammer Worm,” David Moore (CAIDA & UCSD CSE), Vern Paxson (ICIR &LBNL), Stefan Savage (UCSD CSE), Colleen Shannon (CAIDA), Stuart Staniford (Silicon Defense), Nicholas Weaver (Silicon Defense & UC Berkeley EECS) http://www.cs.berkeley.edu/~nweaver/sapphire ) Jan., 2003

Page 39: The Intersection of Grids and Networks - Semantic Scholar · • When one types “google.com” into a Web browser to use the search engine, the following takes place o The name

39

ESnet and Cybersecurity

Sapphire/Slammer worm infection hits creating almost a full Gb/s (1000 megabit/sec.) traffic spike on the ESnet backbone

Page 40: The Intersection of Grids and Networks - Semantic Scholar · • When one types “google.com” into a Web browser to use the search engine, the following takes place o The name

40

Outline

• Role of the R&E Transit Network

• ESnet is Driven by the Requirements of DOE Science

• Terminology – How Do Networks Work?

• How Does it Work? – ESnet as a Backbone Networko ESnet Has Experienced Exponential Growth Since 1992o ESnet is Monitored in Many Ways o How Are Problems Detected and Resolved?

• Operating Science Mission Critical Infrastructureo Disaster Recovery and Stability

o Recovery from Physical Attack / Failureo Maintaining Science Mission Critical Infrastructure

in the Face of Cyberattack

• Services that Grids need from the Networko Public Key Infrastructure example

Page 41: The Intersection of Grids and Networks - Semantic Scholar · • When one types “google.com” into a Web browser to use the search engine, the following takes place o The name

41

Organized by Office of Science

Mary Anne Scott, Chair Dave Bader Steve Eckstrand Marvin Frazier Dale Koelling Vicky White

Workshop Panel ChairsRay Bair and Deb AgarwalBill Johnston and Mike WildeRick StevensIan Foster and Dennis GannonLinda Winkler and Brian TierneySandy Merola and Charlie Catlett

August 13-15, 2002Network and Middleware Needs of DOE Science

•Focused on science requirements that driveo Advanced Network Infrastructureo Middleware Researcho Network Researcho Network Governance Model

•The requirements for DOE science were developed by the OSC science community representing major DOE science disciplines

o Climateo Spallation Neutron Sourceo Macromolecular Crystallographyo High Energy Physics

o Magnetic Fusion Energy Scienceso Chemical Scienceso Bioinformatics

Available at www.es.net/#research

Page 42: The Intersection of Grids and Networks - Semantic Scholar · • When one types “google.com” into a Web browser to use the search engine, the following takes place o The name

42

Grid Middleware Requirements (DOE Workshop)

• A DOE workshop examined science driven requirements for network and middleware and identified twelve high priority middleware services (see www.es.net/#research)

• Some of these services have a central management component and some do not

• Most of the services that have central management fit the criteria for ESnet support. These include, for example

o Production, federated RADIUS authentication serviceo PKI federation serviceso Virtual Organization Management services to manage organization

membership, member attributes and privilegeso Long-term PKI key and proxy credential managemento End-to-end monitoring for Grid / distributed application debugging and

tuningo Some form of authorization service (e.g. based on RADIUS)o Knowledge management services that have the characteristics of an

ESnet service are also likely to be important (future)

Page 43: The Intersection of Grids and Networks - Semantic Scholar · • When one types “google.com” into a Web browser to use the search engine, the following takes place o The name

43

Grid Middleware Services• ESnet provides several “science services” – services

that support the practice of science

• A number of such services have an organization like ESnet as the natural providero ESnet is trusted, persistent, and has a large (almost

comprehensive within DOE) user baseo ESnet has the facilities to provide reliable access and high

availability through assured network access to replicated services at geographically diverse locations

o However, service must be scalable in the sense that as its user base grows, ESnet interaction with the users does not grow (otherwise not practical for a small organization like ESnet to operate)

Page 44: The Intersection of Grids and Networks - Semantic Scholar · • When one types “google.com” into a Web browser to use the search engine, the following takes place o The name

44

Science Services: PKI Support for Grids

• Public Key Infrastructure supports cross-site, cross-organization, and international trust relationships that permit sharing computing and data resources and other Grid services

• DOEGrids Certification Authority service provides X.509 identity certificates to support Grid authentication provides anexample of this modelo The service requires a highly trusted provider, and requires a

high degree of availabilityo The service provider is a centralized agent for negotiating trust

relationships, e.g. with European CAso The service scales by adding site based or Virtual Organization

based Registration Agents that interact directly with the userso See DOEGrids CA (www.doegrids.org)

Page 45: The Intersection of Grids and Networks - Semantic Scholar · • When one types “google.com” into a Web browser to use the search engine, the following takes place o The name

45

Science Services: Public Key Infrastructure• DOEGrids CA policies are tailored to science Grids

o Digital identity certificates for people, hosts and serviceso Provides formal and verified trust management – an essential

service for widely distributed heterogeneous collaboration, e.g. in the International High Energy Physics community

This service was the basis of the first routine sharing of HEP computing resources between US and Europe

Have recently added a second CA with a policy that supports secondary issuers that need to do bulk issuing of certificates with central private key managemento NERSC will auto issue certs when accounts are set up – this

constitutes an acceptable identity verificationo A variant of this will also be set up to support security domain

gateways such as Kerberos – X509 – e.g. KX509 – at FNAL

Page 46: The Intersection of Grids and Networks - Semantic Scholar · • When one types “google.com” into a Web browser to use the search engine, the following takes place o The name

46

Science Services: Public Key Infrastructure• The rapidly expanding customer base of this service will soon

make it ESnet’s largest collaboration service by customer count

Registration AuthoritiesANLLBNLORNLDOESG (DOE Science Grid)ESG (Climate)FNALPPDG (HEP)Fusion GridiVDGL (NSF-DOE HEP collab.)NERSCPNNL

Page 47: The Intersection of Grids and Networks - Semantic Scholar · • When one types “google.com” into a Web browser to use the search engine, the following takes place o The name

47

Grid Network Services Requirements (GGF, GHPN)

• Grid High Performance Networking Research Group, “Networking Issues of Grid Infrastructures” (draft-ggf-ghpn-netissues-3) – what networks should provide to Gridso High performance transport for bulk data transfer (over 1Gb/s

per flow) o Performance controllability to provide ad hoc quality of service

and traffic isolation. o Dynamic Network resource allocation and reservation o High availability when expensive computing or visualization

resources have been reserved o Security controllability to provide a trusty and efficient

communication environment when required o Multicast to efficiently distribute data to group of resources. o How to integrate wireless network and sensor networks in Grid

environment

Page 48: The Intersection of Grids and Networks - Semantic Scholar · • When one types “google.com” into a Web browser to use the search engine, the following takes place o The name

48

Transport Services

• network tools available to build serviceso queue management

- provide forwarding priorities different from best effort- e.g.

– scavenger (discard if anything behind in the queue)– expedited forwarding (elevated priority queuing)– low latency forwarding (highest priority – ahead of all

other traffic)

o path management- tagged traffic can be managed separately from regular traffic

o policing- limit the bandwidth of an incoming stream

Page 49: The Intersection of Grids and Networks - Semantic Scholar · • When one types “google.com” into a Web browser to use the search engine, the following takes place o The name

49

Priority Service: Guaranteed Bandwidth

0

1000

network pipe

bandwidth

reserved for production, best effort traffic

available for elevated priority traffic

bandwidth management model

usersystem2

borderrouter

borderrouter

?

flag traffic fromuser system1 for

expedited forwarding

bandwidthbroker

usersystem1

site A

site B

Page 50: The Intersection of Grids and Networks - Semantic Scholar · • When one types “google.com” into a Web browser to use the search engine, the following takes place o The name

50

Priority Service: Guaranteed Bandwidth

• What is wrong with this? (almost everything)

usersystem2

borderrouter

borderrouter

? bandwidthbrokeruser

system1

there may be several users that

want all of the premium bandwidth

at the same time

the user may send data into the high

priority stream at a high enough bandwidth

that it interferes with production traffic (and

not even know it)

this is at least three independent

networks, and probably more

a user that was a priority at site A may

not be at site B

site A

site B

Page 51: The Intersection of Grids and Networks - Semantic Scholar · • When one types “google.com” into a Web browser to use the search engine, the following takes place o The name

51

Priority Service: Guaranteed Bandwidth

• To address all of the issues is complex

usersystem2

usersystem1

resourcemanager

resourcemanager

resourcemanager

polic

er

auth

oriz

atio

n

shap

er

site A

bandwidthbroker

allocationmanager

site B

Page 52: The Intersection of Grids and Networks - Semantic Scholar · • When one types “google.com” into a Web browser to use the search engine, the following takes place o The name

52

Priority Service

• So, practically, what can be done?

• With available tools can provide a small number of provisioned circuitso secure and end-to-end (system to system)o various Quality of Service possible, including minimum

latencyo a certain amount of route reliability (if redundant paths

exist in the network) o end systems can manage these circuits as single high

bandwidth paths or multiple lower bandwidth paths of (with application level shapers)

o non-interfering with production traffic, so aggressive protocols may be used

Page 53: The Intersection of Grids and Networks - Semantic Scholar · • When one types “google.com” into a Web browser to use the search engine, the following takes place o The name

53

Priority Service: Guaranteed Bandwidth

usersystem2

usersystem1

site B

polic

er

site A

• will probably be service level agreements among transit networks allowing for a fixed amount of priority traffic – so the resource manager does minimal checking and no authorization

• will do policing, but only at the full bandwidth of the service agreement (for self protection)

resourcemanager

auth

oriz

atio

n

resourcemanager

resourcemanager

allocation will probably be

relatively static and ad hocbandwidth

broker

Page 54: The Intersection of Grids and Networks - Semantic Scholar · • When one types “google.com” into a Web browser to use the search engine, the following takes place o The name

54

Grid Network Services Requirements (GGF, GHPN)

• Grid High Performance Networking Research Group, “Networking Issues of Grid Infrastructures” (draft-ggf-ghpn-netissues-3) – what networks should provide to Gridso High performance transport for bulk data transfer (over 1Gb/s

per flow) o Performance controllability to provide ad hoc quality of service

and traffic isolation. o Dynamic Network resource allocation and reservation o High availability when expensive computing or visualization

resources have been reserved o Security controllability to provide a trusted and efficient

communication environment when required o Multicast to efficiently distribute data to group of resources. o Integrated wireless network and sensor networks in Grid

environment

Page 55: The Intersection of Grids and Networks - Semantic Scholar · • When one types “google.com” into a Web browser to use the search engine, the following takes place o The name

55

High Throughput Requirements

1) High average throughput 2) Advanced protocol capabilities available and usable at the end-systems 3) Lack of use of QoS parameters

Current issues

1) Low average throughput 2) Semantic gap between socket buffer interface and the protocol capabilities of TCP

Analyzed reasons

1a) End system bottleneck, 1b) Protocol misconfigured, 1c) Inefficient Protocol 1d) Mixing of congestion control and error recovery 2a) TCP connection Set up: Blocking operations vs asynchronous 2b)Window scale option not accessible through the API

Available solutions

1a) Multiple TCP sessions 1b) Larger MTU 1c) ECN

Proposed alternatives

1) Alternatives to TCP (see DT-RG survey document) 2) OS by-pass and protocol off-loading 3) Overlays 4) End to end optical paths

Page 56: The Intersection of Grids and Networks - Semantic Scholar · • When one types “google.com” into a Web browser to use the search engine, the following takes place o The name

56

A New Architecture• The essential requirements cannot be met with

the current, telecom provided, hub and spoke architecture of ESnet

• The core ring has good capacity and resiliency against single point failures, but the point-to-point tail circuits are neither reliable nor scalable to the required bandwidth

ESnetCore/Backbone

New York (AOA)Chicago (CHI)

Sunnyvale (SNV)Atlanta (ATL)

Washington, DC (DC)

El Paso (ELP)

DOE sites

Page 57: The Intersection of Grids and Networks - Semantic Scholar · • When one types “google.com” into a Web browser to use the search engine, the following takes place o The name

57

A New Architecture• A second backbone ring will multiply connect the

MAN rings to protect against hub failureEurope

Asia-Pacific

ESnetCore/Backbone

New York (AOA)Chicago (CHI)

Sunnyvale (SNV)Atlanta (ATL)

Washington, DC (DC)

El Paso (ELP)

DOE sites

• All OSC Labs will be able to participate in some variation of this new architecture in order to gain highly reliable and high capacity network access

Page 58: The Intersection of Grids and Networks - Semantic Scholar · • When one types “google.com” into a Web browser to use the search engine, the following takes place o The name

58

Conclusions• ESnet is an infrastructure that is critical to DOE’s

science mission and that serves all of DOE

• Focused on the Office of Science Labs

• ESnet is working on providing the DOE mission science networking requirements with several new initiatives and a new architecture

• QoS is hard – but we have enough experience to do pilot studies (which ESnet is just about to start)

• Middleware services for large numbers of users are hard – but they can be provided if careful attention is paid to scaling