20
Ramiro Voicu CHEP09 Prague March 2009 1 Ramiro Voicu Ramiro Voicu , Iosif Legrand, Harvey , Iosif Legrand, Harvey Newman, Newman, Artur Barczyk, Costin Grigoras, Ciprian Artur Barczyk, Costin Grigoras, Ciprian Dobre, Dobre, Alexandru Costan, Azher Mughal, Sandor Alexandru Costan, Azher Mughal, Sandor Rozsa Rozsa Monitoring and operational management Monitoring and operational management in USLHCNet in USLHCNet CHEP09 - March 2009 Prague

1 Ramiro Voicu, Iosif Legrand, Harvey Newman, Artur Barczyk, Costin Grigoras, Ciprian Dobre, Alexandru Costan, Azher Mughal, Sandor Rozsa Monitoring and

Embed Size (px)

Citation preview

Page 1: 1 Ramiro Voicu, Iosif Legrand, Harvey Newman, Artur Barczyk, Costin Grigoras, Ciprian Dobre, Alexandru Costan, Azher Mughal, Sandor Rozsa Monitoring and

Ramiro Voicu CHEP09 Prague March 20091

Ramiro VoicuRamiro Voicu, Iosif Legrand, Harvey Newman,, Iosif Legrand, Harvey Newman,Artur Barczyk, Costin Grigoras, Ciprian Dobre, Artur Barczyk, Costin Grigoras, Ciprian Dobre, Alexandru Costan, Azher Mughal, Sandor RozsaAlexandru Costan, Azher Mughal, Sandor Rozsa

Monitoring and operational managementMonitoring and operational managementin USLHCNetin USLHCNet

CHEP09 - March 2009 Prague

Page 2: 1 Ramiro Voicu, Iosif Legrand, Harvey Newman, Artur Barczyk, Costin Grigoras, Ciprian Dobre, Alexandru Costan, Azher Mughal, Sandor Rozsa Monitoring and

Ramiro Voicu CHEP09 Prague March 2009

22

OutlineOutline

MonALISA Framework

Architecture

Data handling

Automatic actions

USLHCNet

Network topology

Monitoring modules

Reliable monitoring & accounting

Alarms & triggers

Conclusions

Page 3: 1 Ramiro Voicu, Iosif Legrand, Harvey Newman, Artur Barczyk, Costin Grigoras, Ciprian Dobre, Alexandru Costan, Azher Mughal, Sandor Rozsa Monitoring and

Ramiro Voicu CHEP09 Prague March 2009

3

The MonALISA ArchitectureThe MonALISA Architecture

3

Regional or Global High Level Regional or Global High Level Services, Services, Repositories & ClientsRepositories & Clients

Secure and reliable communicationSecure and reliable communicationDynamic load balancing Dynamic load balancing Scalability & ReplicationScalability & ReplicationAAA for ClientsAAA for Clients

Distributed Dynamic Distributed Dynamic Registration and Discovery-Registration and Discovery-based on a lease based on a lease mechanism and remote eventsmechanism and remote events

JINI-Lookup Services Secure & Public

MonALISA services

Proxies

HL services

Agents

Network of

Distributed System for gathering and Distributed System for gathering and analyzing information based on analyzing information based on mobile agents: mobile agents: Customized aggregation, Triggers,Customized aggregation, Triggers,ActionsActions

Fully Distributed System with no Single Point of Failure

Page 4: 1 Ramiro Voicu, Iosif Legrand, Harvey Newman, Artur Barczyk, Costin Grigoras, Ciprian Dobre, Alexandru Costan, Azher Mughal, Sandor Rozsa Monitoring and

Ramiro Voicu CHEP09 Prague March 2009

4

MonALISA Service & Data HandlingMonALISA Service & Data Handling

4

Data Store

Data CacheService & DB

Configuration Control (SSL)

Predicates & Agents

Data (via ML Proxy)

Applications Clients or Higher Level

Services

WS Clients andservice

WebService

WSDLSOAP

LookupService

LookupService

Registration

Discovery

Postgres

AGENTSAGENTS

FILTERS / TRIGGERSFILTERS / TRIGGERS

Monitoring ModulesMonitoring ModulesCollects any type of information

Dynamic (Re)Loading

Push and Pull

Page 5: 1 Ramiro Voicu, Iosif Legrand, Harvey Newman, Artur Barczyk, Costin Grigoras, Ciprian Dobre, Alexandru Costan, Azher Mughal, Sandor Rozsa Monitoring and

Ramiro Voicu CHEP09 Prague March 2009

5

Two levels of decisions:

local (autonomous),

global (correlations).

Actions triggered by:

values above/below given thresholds,

absence/presence of values,

correlations between any values.

Action types:

alerts (emails/instant msg/atom feeds),

running an external command,

automatic charts annotations in the repository,

running custom code, like securely ordering a ML service to (re)start a site service.

ML ServiceML Service

ML ServiceML Service

Actions based onActions based onglobal informationglobal information

Actions based onActions based onlocal informationlocal information

• Traffic• Jobs• Hosts• Apps

• Temperature• Humidity• A/C Power• …

SensorsSensors Local Local decisionsdecisions

Global Global decisionsdecisions

Local and Global Decision FrameworkLocal and Global Decision Framework

Global ML

Services

Page 6: 1 Ramiro Voicu, Iosif Legrand, Harvey Newman, Artur Barczyk, Costin Grigoras, Ciprian Dobre, Alexandru Costan, Azher Mughal, Sandor Rozsa Monitoring and

Ramiro Voicu CHEP09 Prague March 2009

6

Monitoring architecture in ALICEMonitoring architecture in ALICE

6

Long HistoryDB

LCG Tools

MonALISA @Site

ApMon

AliEn Job Agent

ApMon

AliEn Job Agent

ApMon

AliEn Job Agent

MonALISA @CERN

MonALISA

LCG Site

ApMon

AliEn CE

ApMon

AliEn SE

ApMon

ClusterMonitor

ApMon

AliEn TQ

ApMon

AliEn Job Agent

ApMon

AliEn Job Agent

ApMon

AliEn Job Agent

ApMon

AliEn CE

ApMon

AliEn SE

ApMon

ClusterMonitor

ApMon

AliEn IS

ApMon

AliEn Optimizers

ApMon

AliEn Brokers

ApMon

MySQLServers

ApMon

CastorGridScripts

ApMon

APIServices

MonaLisaMonaLisaRepositoryRepository

Aggregated Data

rss

vsz

cputime

run

tim

e

job

slots

free

spac

e

nr.

of

file

s

op

en

files

Queued

JobAgents

cpu

ksi2k

jobstatus

disk

used

pro

cesses

loadn

etIn

/ou

t

jobsstatussockets

migratedmbytes

active

sessions

MyP

roxy

status

Alerts

Actions

See Costin Grigoras’ poster (067):

Automated agents for management and control of the

ALICE Computing Grid

Page 7: 1 Ramiro Voicu, Iosif Legrand, Harvey Newman, Artur Barczyk, Costin Grigoras, Ciprian Dobre, Alexandru Costan, Azher Mughal, Sandor Rozsa Monitoring and

Ramiro Voicu CHEP09 Prague March 2009

7

USLHCNetUSLHCNet

USLHCNet provides transatlantic connections of the Tier1 computing facilities at Fermilab and Brookhaven with the Tier0 and Tier1 facilities at CERN as well as Tier1s elsewhere in Europe and Asia.

Together with ESnet, Internet2 and the GEANT, USLHCNet supports connections between the Tier2 centers.

The USLHCNet core infrastructure is using the Ciena Core Director devices that provide time-division multiplexing and packet-forwarding protocols that support virtual circuits with bandwidth guarantees. The virtual circuits offer the functionality to develop efficient data transfer services with support for QoS and priorities.

Hybrid network: uses both Ciena CD and Force10 routers

4 transatlantic 10G links at the moment (6 links in the second part of this year)*

* See Harvey Newman talk[502] from Monday: “Status and outlook of the HEP network”

Page 8: 1 Ramiro Voicu, Iosif Legrand, Harvey Newman, Artur Barczyk, Costin Grigoras, Ciprian Dobre, Alexandru Costan, Azher Mughal, Sandor Rozsa Monitoring and

Ramiro Voicu CHEP09 Prague March 2009

8

USLHCnet ML weather mapUSLHCnet ML weather map

Page 9: 1 Ramiro Voicu, Iosif Legrand, Harvey Newman, Artur Barczyk, Costin Grigoras, Ciprian Dobre, Alexandru Costan, Azher Mughal, Sandor Rozsa Monitoring and

Ramiro Voicu CHEP09 Prague March 2009

9

Monitoring modulesMonitoring modules

We developed a set of monitoring modules for USLHCNet network devices:

Force10 (SNMP & sFlow)

Traffic per interface

sFlow traffic

Link status monitoring

Ciena Core Director (TL1 – Transaction Language1)

ETTP (Ethernet Termination Point) traffic

EFLOW (Ethernet Flow) traffic

OSRP (routing protocol) topology

Dynamic circuits inside the optical core of the network

Page 10: 1 Ramiro Voicu, Iosif Legrand, Harvey Newman, Artur Barczyk, Costin Grigoras, Ciprian Dobre, Alexandru Costan, Azher Mughal, Sandor Rozsa Monitoring and

Ramiro Voicu CHEP09 Prague March 2009

10

USLHCnet monitoringUSLHCnet monitoring

MonALISA

@GVA

MonALISA

@CHI

MonALISA

@NYC

MonALISA

@AMSSNMP

TL1

SNMP

Page 11: 1 Ramiro Voicu, Iosif Legrand, Harvey Newman, Artur Barczyk, Costin Grigoras, Ciprian Dobre, Alexandru Costan, Azher Mughal, Sandor Rozsa Monitoring and

Ramiro Voicu CHEP09 Prague March 2009

11

USLHCnet redundant monitoringUSLHCnet redundant monitoring

MonALISA

@GVA

MonALISA

@CHI

MonALISA

@NYC

MonALISA

@AMS

Each CircuitEach Circuitis monitored at bothis monitored at bothends by at least twoends by at least twoMonALISA services;MonALISA services;the monitored datathe monitored datais aggregated by is aggregated by global filters in global filters in the repositorythe repository

Page 12: 1 Ramiro Voicu, Iosif Legrand, Harvey Newman, Artur Barczyk, Costin Grigoras, Ciprian Dobre, Alexandru Costan, Azher Mughal, Sandor Rozsa Monitoring and

Ramiro Voicu CHEP09 Prague March 2009

12

Local and global filtersLocal and global filters

Based on the MonALISA actions framework a set of triggers have been deployed inside the service to notify by email, SMS and IM the USLHCNet network engineers in case of problems

The filters developed for USLHCNet repository aggregate the redundant monitoring data (traffic and link status) collected from all the MonALISA services

The link status is computed as a logical “AND” between both end points of a link. This also cross checks the status reported by the hardware equipment.

We collect data in two repository instances, each with replicated database back-ends. These instances are dynamically balanced in DNS.

Page 13: 1 Ramiro Voicu, Iosif Legrand, Harvey Newman, Artur Barczyk, Costin Grigoras, Ciprian Dobre, Alexandru Costan, Azher Mughal, Sandor Rozsa Monitoring and

Ramiro Voicu CHEP09 Prague March 2009

13

USLHCnet: USLHCnet: Precise measurements Precise measurements for the Operational Status on the WAN Linkfor the Operational Status on the WAN Link

Operations & management assisted by agent-based softwareOperations & management assisted by agent-based software Used on the new CIENA equipment used for network managmentUsed on the new CIENA equipment used for network managment

Page 14: 1 Ramiro Voicu, Iosif Legrand, Harvey Newman, Artur Barczyk, Costin Grigoras, Ciprian Dobre, Alexandru Costan, Azher Mughal, Sandor Rozsa Monitoring and

Ramiro Voicu CHEP09 Prague March 2009

14

USLHCnet: Traffic on different segmentsUSLHCnet: Traffic on different segments

Page 15: 1 Ramiro Voicu, Iosif Legrand, Harvey Newman, Artur Barczyk, Costin Grigoras, Ciprian Dobre, Alexandru Costan, Azher Mughal, Sandor Rozsa Monitoring and

Ramiro Voicu CHEP09 Prague March 2009

15

USLHCnet: Accounting for Integrated TrafficUSLHCnet: Accounting for Integrated Traffic

Page 16: 1 Ramiro Voicu, Iosif Legrand, Harvey Newman, Artur Barczyk, Costin Grigoras, Ciprian Dobre, Alexandru Costan, Azher Mughal, Sandor Rozsa Monitoring and

Ramiro Voicu CHEP09 Prague March 2009

16

USLHCnet: Ciena alarms monitoringUSLHCnet: Ciena alarms monitoring

Page 17: 1 Ramiro Voicu, Iosif Legrand, Harvey Newman, Artur Barczyk, Costin Grigoras, Ciprian Dobre, Alexandru Costan, Azher Mughal, Sandor Rozsa Monitoring and

Ramiro Voicu CHEP09 Prague March 2009

17

The Need for Planning and Scheduling for The Need for Planning and Scheduling for Large Data TransfersLarge Data Transfers

In Parallel Sequential

2.5 X Faster to perform the two reading tasks sequentially

Page 18: 1 Ramiro Voicu, Iosif Legrand, Harvey Newman, Artur Barczyk, Costin Grigoras, Ciprian Dobre, Alexandru Costan, Azher Mughal, Sandor Rozsa Monitoring and

Ramiro Voicu CHEP09 Prague March 2009

18

Dynamic restorationof lightpath if a segment has problems

Monitoring Optical SwitchesMonitoring Optical Switches

Page 19: 1 Ramiro Voicu, Iosif Legrand, Harvey Newman, Artur Barczyk, Costin Grigoras, Ciprian Dobre, Alexandru Costan, Azher Mughal, Sandor Rozsa Monitoring and

Ramiro Voicu CHEP09 Prague March 2009

19

CERNGeneva

CALTECHPasadena

Starlight

Manlan

USLHCnet

Internet2

Controlling Optical Planes Controlling Optical Planes Automatic Path RecoveryAutomatic Path Recovery

“Fiber cut” simulationsThe traffic moves from one transatlantic line to the other oneFDT transfer (CERN – CALTECH) continues uninterruptedTCP fully recovers in ~ 20s

1

23

4

FDT Transfer

4 Fiber cuts simulations

200+ MBytes/secFrom a 1U Node

4 fiber cut emulations

For more details, see Iosif Legrand’s poster (054):

A High Performance Data Transfer Service

Page 20: 1 Ramiro Voicu, Iosif Legrand, Harvey Newman, Artur Barczyk, Costin Grigoras, Ciprian Dobre, Alexandru Costan, Azher Mughal, Sandor Rozsa Monitoring and

Ramiro Voicu CHEP09 Prague March 2009

20

ConclusionsConclusions

The MonALISA framework provides a flexible and reliable monitoring infrastructure

350+ installed services, 1.5M+ unique parameters, 25kHz value updates

Truly distributed architecture with no single points of failure

Highly modular platform

Automatic decision taking capability at both local and global levels

USLHCNet provides a state-of-the-art hybrid network with support for circuit oriented network services

Monitoring this infrastructure proved to be a challenging task, but we are running with 99.5+% monitoring uptime

We are investigating dynamic provisioning of circuits from collaborating agents

http://monalisa.caltech.edu

http://repository.uslhcnet.org