Upload
keona
View
36
Download
1
Tags:
Embed Size (px)
DESCRIPTION
MonALISA capabilities for the LHCOPN. MonALISA Team Iosif Legrand , Harvey Newman, Ramiro Voicu , Costin Grigoras , Ciprian Dobre , Alexandru Costan. USLHCNet Team Harvey Newman, Artur Barczyk , Ramiro Voicu , Azher Mughal , Sandor Rozsa. LHCOPN meeting March 2010 London. - PowerPoint PPT Presentation
Citation preview
1
MonALISA TeamMonALISA TeamIosif Legrand, Harvey Newman, Iosif Legrand, Harvey Newman, Ramiro VoicuRamiro Voicu,,
Costin Grigoras, Ciprian Dobre, Alexandru CostanCostin Grigoras, Ciprian Dobre, Alexandru Costan
MonALISA capabilities MonALISA capabilities for the LHCOPNfor the LHCOPN
LHCOPN meeting March 2010 London
USLHCNet TeamUSLHCNet TeamHarvey Newman, Artur Barczyk, Harvey Newman, Artur Barczyk,
Ramiro VoicuRamiro Voicu, Azher Mughal, Sandor Rozsa, Azher Mughal, Sandor Rozsa
2
OutlineOutline
MonALISA Framework
Architecture
Data handling
Automatic actions
USLHCNet
Network topology
Monitoring modules
Reliable monitoring & accounting
Alarms & triggers
Conclusions2 Ramiro Voicu LHCOPN London March 2010
The MonALISA ArchitectureThe MonALISA Architecture
3
Regional or Global High Level Regional or Global High Level Services, Services, Repositories & ClientsRepositories & Clients
Secure and reliable communicationSecure and reliable communicationDynamic load balancing Dynamic load balancing Scalability & ReplicationScalability & ReplicationAAA for ClientsAAA for Clients
Distributed Dynamic Distributed Dynamic Registration and Discovery-Registration and Discovery-based on a lease based on a lease mechanism and remote eventsmechanism and remote events
JINI-Lookup Services Secure & Public
MonALISA services
Proxies
HL services
Agents
Network of
Distributed System for gathering and Distributed System for gathering and analyzing information based on analyzing information based on mobile agents: mobile agents: Customized aggregation, Triggers,Customized aggregation, Triggers,ActionsActions
Fully Distributed System with no Single Point of Failure3 Ramiro Voicu LHCOPN London March 2010
MonALISA Service & Data HandlingMonALISA Service & Data Handling
4
Data Store
Data CacheService & DB
Configuration Control (SSL)
Predicates & Agents
Data (via ML Proxy)
Applications Clients or Higher Level
Services
WS Clients andservice
WebService
WSDLSOAP
LookupService
LookupService
Registration
Discovery
Postgres
AGENTSAGENTS
FILTERS / TRIGGERSFILTERS / TRIGGERS
Monitoring ModulesMonitoring ModulesCollects any type of information
Dynamic (Re)Loading
Push and Pull
4 Ramiro Voicu LHCOPN London March 2010
Two levels of decisions:
local (autonomous),
global (correlations).
Actions triggered by:
values above/below given thresholds,
absence/presence of values,
correlations between any values.
Action types:
alerts (emails/instant msg/atom feeds),
running an external command,
automatic charts annotations in the repository,
running custom code, like securely ordering a ML service to (re)start a site service.
ML ServiceML Service
ML ServiceML Service
Actions based onActions based onglobal informationglobal information
Actions based onActions based onlocal informationlocal information
• Traffic• Jobs• Hosts• Apps
• Temperature• Humidity• A/C Power• …
SensorsSensors Local Local decisionsdecisions
Global Global decisionsdecisions
Local and Global Decision FrameworkLocal and Global Decision Framework
Global ML
Services
5 Ramiro Voicu LHCOPN London March 2010
USLHCNetUSLHCNet
USLHCNet provides transatlantic connections of the Tier1 computing facilities at Fermilab and Brookhaven with the Tier0 and Tier1 facilities at CERN as well as Tier1s elsewhere in Europe and Asia.
Together with ESnet, Internet2 and the GEANT, USLHCNet supports connections between the Tier2 centers.
The USLHCNet core infrastructure is using the Ciena Core Director devices that provide time-division multiplexing and packet-forwarding protocols that support virtual circuits with bandwidth guarantees. The virtual circuits offer the functionality to develop efficient data transfer services with support for QoS and priorities.
Hybrid network: uses both Ciena CD and Force10 routers
6 transatlantic 10G links at the moment
6 Ramiro Voicu LHCOPN London March 2010
USLHCnet ML weather mapUSLHCnet ML weather map
7 Ramiro Voicu LHCOPN London March 2010
Monitoring modulesMonitoring modules
We developed a set of monitoring modules for USLHCNet network devices:
Force10 (SNMP & sFlow)
Traffic per interface
sFlow traffic
Link status monitoring
Ciena Core Director (TL1 – Transaction Language1)
ETTP (Ethernet Termination Point) traffic
EFLOW (Ethernet Flow) traffic
OSRP (routing protocol) topology
VCG Provisioned / Available Bandwidth
Dynamic circuits inside the optical core of the network
Ping module/MLPing trigger which sends alarms in case of packet loss8 Ramiro Voicu LHCOPN London March 2010
USLHCnet monitoringUSLHCnet monitoring
MonALISA
@GVA
MonALISA
@CHI
MonALISA
@NYC
MonALISA
@AMSSNMP
TL1
SNMP
9 Ramiro Voicu LHCOPN London March 2010
USLHCnet redundant monitoringUSLHCnet redundant monitoring
MonALISA
@GVA
MonALISA
@CHI
MonALISA
@NYC
MonALISA
@AMS
Each CircuitEach Circuitis monitored at bothis monitored at bothends by at least twoends by at least twoMonALISA services;MonALISA services;the monitored datathe monitored datais aggregated by is aggregated by global filters in global filters in the repositorythe repository
10 Ramiro Voicu LHCOPN London March 2010
Local and global filtersLocal and global filters
Based on the MonALISA actions framework a set of triggers have been deployed inside the service to notify by email, SMS and IM the USLHCNet network engineers in case of problems
The filters developed for USLHCNet repository aggregate the redundant monitoring data (traffic and link status) collected from all the MonALISA services
The link status is computed as a logical “AND” between both end points of a link. This also cross checks the status reported by the hardware equipment.
We collect data in two repository instances, each with replicated database back-ends. These instances are dynamically balanced in DNS.
11 Ramiro Voicu LHCOPN London March 2010
USLHCnet: USLHCnet: Precise measurements Precise measurements for the Operational Status on the WAN Linkfor the Operational Status on the WAN Link
Operations & management assisted by agent-based softwareOperations & management assisted by agent-based software Used on the new CIENA equipment used for network managmentUsed on the new CIENA equipment used for network managment
12 Ramiro Voicu LHCOPN London March 2010
USLHCnet: ALL EFLOW traffic - last 2 months USLHCnet: ALL EFLOW traffic - last 2 months
13 Ramiro Voicu LHCOPN London March 2010
USLHCnet: Accounting for Integrated TrafficUSLHCnet: Accounting for Integrated Traffic
14 Ramiro Voicu LHCOPN London March 2010
USLHCnet: Ciena alarms monitoringUSLHCnet: Ciena alarms monitoring
15 Ramiro Voicu LHCOPN London March 2010
16 Ramiro Voicu LHCOPN London March 2010
Topology monitoring and discoveryTopology monitoring and discovery
NETWORKS
AS
ROUTERS
Real Time Topology Discovery & DisplayReal Time Topology Discovery & Display
Storage discovery in AliceStorage discovery in Alice
17 Ramiro Voicu LHCOPN London March 2010
France
Italy
USA
Russia
Nordic Countries
distance(IP, IP)distance(IP, IP) Same IP-class networkSame IP-class network Common domain nameCommon domain name Same ASSame AS Same country (+ function of RTT between Same country (+ function of RTT between
the respective AS-es if known)the respective AS-es if known) If distance between the AS-es is known, use itIf distance between the AS-es is known, use it Same continentSame continent Far awayFar away
distance(IP, Set<IP>): Client's public IP to all distance(IP, Set<IP>): Client's public IP to all known IPs for the storageknown IPs for the storage
C. Grigoras (Alice) – ACAT 2010C. Grigoras (Alice) – ACAT 2010
18 Ramiro Voicu LHCOPN London March 2010
FDT Bandwidth tests in Alice (E2E av bw)FDT Bandwidth tests in Alice (E2E av bw)
Newer kernelTuned TCP Buffers
Default kernels Default TCP BuffersDifferent trends = different kernels
100 Mbps network card
1 Gbps network card
http://monalisa.cern.ch/FDT/http://monalisa.cern.ch/FDT/
ConclusionsConclusions
The MonALISA framework provides a flexible and reliable monitoring infrastructure
350+ installed services, 1.5M+ unique parameters, 25kHz value updates
Truly distributed architecture with no single points of failure
Highly modular platform
Automatic decision taking capability at both local and global levels
USLHCNet provides a hybrid network with support for circuit oriented network services
Monitoring this infrastructure proved to be a challenging task, but we are running with 99.5+% monitoring uptime (100% in the last 6 months)
We are investigating dynamic provisioning of circuits from collaborating agents
http://monalisa.caltech.edu
http://repository.uslhcnet.org
19 Ramiro Voicu LHCOPN London March 2010
Dynamic restorationof lightpath if a segment has problems
Monitoring Optical SwitchesMonitoring Optical Switches
20 Ramiro Voicu LHCOPN London March 2010
CERNGeneva
CALTECHPasadena
Starlight
Manlan
USLHCnet
Internet2
Controlling Optical Planes Controlling Optical Planes Automatic Path RecoveryAutomatic Path Recovery
“Fiber cut” simulationsThe traffic moves from one transatlantic line to the other oneFDT transfer (CERN – CALTECH) continues uninterruptedTCP fully recovers in ~ 20s
1
23
4
FDT Transfer
4 Fiber cuts simulations
200+ MBytes/secFrom a 1U Node
4 fiber cut emulations
21 Ramiro Voicu LHCOPN London March 2010