Iosif Legrand California Institute of Technology

Iosif Legrand August 20091

Iosif LegrandIosif LegrandCalifornia Institute of Technology California Institute of Technology

Monitoring, Control and Optimization in Large Monitoring, Control and Optimization in Large Distributed SystemsDistributed Systems

ECSAC- August 2009, Veli Losink


Monitoring Distributed SystemsMonitoring Distributed Systems

An essential part of managing large scale, distributed data processing facilities, is a monitoring system that is able to monitor computing facilities, storage systems, networks and a very large number of applications running on these systems in near-real time.

The monitoring information gathered for all the subsystems is essential for design, debugging, accounting and the development of “higher level services”, that provide decision support and some degree of automated decisions and for maintaining and optimizing workflow in large scale distributed systems.


Monitoring Information is necessary for System Design, Monitoring Information is necessary for System Design, Control, Optimization, Debugging and AccountingControl, Optimization, Debugging and Accounting

Computing ModelsModeling & Simulations

DEBUGGINGControl and

Operational supportALARMS

Optimization Algorithms

ACCOUNTING

MONITORING

~ REALInformation

Create resilient

Distributed Systems


The LHC Data Grid HierarchyThe LHC Data Grid HierarchyThe need for distributed computing (MONARC)The need for distributed computing (MONARC)

Tier 1

Tier2 Center

Online System

CERN Center PBs of Disk; Tape Robot

FNAL CenterIN2P3 Center INFN Center RAL Center

InstituteInstituteInstituteInstitute

Workstations

~150-1500 MBytes/sec

~10 Gbps

1 to 10 Gbps

Tens of Petabytes by ~2010Physics data cache

~PByte/sec

10 - 40 Gbps

Tier2 CenterTier2 CenterTier2 Center

~1-10 Gbps

Tier 0 +1

Tier 3

Tier 4

Tier2 Center Tier 2

Experiment

11 Tier1 and120+ Tier2 Centers


Communication in Distributed SystemsCommunication in Distributed Systems

LookupServiceStubLookup

Service Skeleton CLIENTServer

“Traditional” Distributed Object Models(CORBA, DCOM)

“IDL” Compiler

The Stub is linked to the Client.The Client must know about theservice from the beginning and needs the right stub for it

The Server and the client code must be created together !!

Distributed Object Systems CORBA , DCOM, RMI Distributed Object Systems CORBA , DCOM, RMI

Different types of dedicated protocols Different types of dedicated protocols


Distributed Object Systems Distributed Object Systems Web Services WSDL/SOAP Web Services WSDL/SOAP

LookupService

WSDL

CLIENTServer

LookupService

Interface

SOAP

The client can dynamically generate the data structures and the interfaces for using remote objects based on WSDL

Platform independent

Large overhead Based on stateless connections

Iosif Legrand August 2009 7

Mobile Code and Distributed ServicesMobile Code and Distributed Services

Act as a true dynamic service and provide the necessary functionally to be used by any other services that require such information mechanism to dynamically discover all the “Service Units" remote event notification for changes in the any system lease mechanism for each registered unit

Is Based on Java

Dynamic Code Loading

LookupServiceProxy CLIENT

LookupService

Proxy

Service

Services can be used dynamicallyRemote Services Proxy == RMI StubMobile Agents Proxy == Entire Service “Smart Proxies” Proxy adjusts to the client

Any well suited protocol for the application


The MonALISA FrameworkThe MonALISA Framework

MonALISA is a Dynamic, Distributed Service System capable to collect any type of information from different systems, to analyze it in near real time and to provide support for automated control decisions and global optimization of workflows in complex grid systems.

The MonALISA system is designed as an ensemble of autonomous multi-threaded, self-describing agent-based subsystems which are registered as dynamic services, and are able to collaborate and cooperate in performing a wide range of monitoring tasks. These agents can analyze and process the information, in a distributed way, and to provide optimization decisions in large scale distributed applications.


The MonALISA ArchitectureThe MonALISA Architecture

9

Regional or Global High Level Regional or Global High Level Services, Services, Repositories & ClientsRepositories & Clients

Secure and reliable communicationSecure and reliable communicationDynamic load balancing Dynamic load balancing Scalability & ReplicationScalability & ReplicationAAA for ClientsAAA for Clients

Distributed Dynamic Distributed Dynamic Registration and Discovery-Registration and Discovery-based on a lease based on a lease mechanism and remote eventsmechanism and remote events

JINI-Lookup Services Secure & Public

MonALISA services

Proxies

HL services

Agents

Network of

Distributed System for gathering and Distributed System for gathering and analyzing information based on analyzing information based on mobile agents: mobile agents: Customized aggregation, Triggers,Customized aggregation, Triggers,ActionsActions

Fully Distributed System with no Single Point of Failure


MonALISA Service & Data HandlingMonALISA Service & Data Handling

10

Data Store

Data CacheService & DB

Configuration Control (SSL)

Predicates & Agents

Data (via ML Proxy)

Applications Clients or Higher Level

Services

WS Clients andservice

WebService

WSDLSOAP

LookupService

LookupService

Registration

Discovery

Postgres

AGENTSAGENTS

FILTERS / TRIGGERSFILTERS / TRIGGERS

Monitoring ModulesMonitoring ModulesCollects any type of information Dynamic Loading

Push and Pull


LookupService

Registration / Discovery Registration / Discovery Admin Access and AAA for ClientsAdmin Access and AAA for Clients

MonALISAService

LookupService

Client(other service)

DiscoveryRegistration

(signed certificate)

MonALISAService

MonALISAService

Services Proxy

Multiplexer

Services Proxy

Multiplexer

Client(other service)

Admin SSL connection

Trustkeystore

AAA services

Client authentication

Data Data Filters & AgentsFilters & Agents

Trustkeystore

Application

Applications


Monitoring Grid sites, Running Jobs, Monitoring Grid sites, Running Jobs, Network Traffic, and ConnectivityNetwork Traffic, and Connectivity

12

TOPOLOGY

JOBS

ACCOUNTING

Running Jobs


Monitoring the Execution of JobsMonitoring the Execution of Jobs and the Time Evolution and the Time Evolution

13

SPLIT JOBSSPLIT JOBS

LIFELINES for JOBS

Job Job

Job1

Job2

Job3

Job31

Job32

Summit a Job

DAG


Monitoring CMS Jobs WorldwideMonitoring CMS Jobs Worldwide

CMS is using MonALISA and ApMon to monitor all the production and analysis jobs. This information is than used in the CMS dashboard frontend

Collects monitoring data with rates up to more than 1000 values per second Peaks of more than 1500 jobs reporting concurrently to the same server Uptime for the service > 150 days continuous operation without any problems Collected 4.5* 109 monitoring values in the first half of 2009 Lost in UDP messages (values) is less than 5 * 10-6

~ 5 *109 Monitoring values Collected in the first half of 2009

The computer that runsML at CERN was replaces

Rate of collected monitoring values Total Collected values

3

4

2

1

X109

April May June

1000

500

May June 0

Organize and structure

Monitoring Information


Monitoring architecture in ALICEMonitoring architecture in ALICE

15

Long HistoryDB

LCG Tools

MonALISA @Site

ApMon

AliEn Job Agent

ApMon

AliEn Job Agent

ApMon

AliEn Job Agent

MonALISA @CERN

MonALISA

LCG Site

ApMon

AliEn CE

ApMon

AliEn SE

ApMon

ClusterMonitor

ApMon

AliEn TQ

ApMon

AliEn Job Agent

ApMon

AliEn Job Agent

ApMon

AliEn Job Agent

ApMon

AliEn CE

ApMon

AliEn SE

ApMon

ClusterMonitor

ApMon

AliEn IS

ApMon

AliEn Optimizers

ApMon

AliEn Brokers

ApMon

MySQLServers

ApMon

CastorGridScripts

ApMon

APIServices

MonaLisaMonaLisaRepositoryRepository

Aggregated Data

rss

vsz

cputime

run

tim

e

job

slots

free

spac

e

nr.

of

file

s

op

en

files

Queued

JobAgents

cpu

ksi2k

jobstatus

disk

used

pro

cesses

loadn

etIn

/ou

t

jobsstatussockets

migratedmbytes

active

sessions

MyP

roxy

status

Alerts

Actions


http://pcalimonitor.cern.ch

ALICE : Global Views, Status & JobsALICE : Global Views, Status & Jobs


ALICE: Job status – history plotsALICE: Job status – history plots


ALICE: Resource Usage monitoringALICE: Resource Usage monitoring

Cumulative parameters CPU Time & CPU KSI2K Wall time & Wall KSI2K Read & written files Input & output traffic (xrootd)

Running parameters Resident memory Virtual memory

Open files Workdir size Disk usage CPU usage

Aggregated per site


ALICE: Job agents monitoringALICE: Job agents monitoring

From Job Agent itself Requesting job Installing packages Running job Done Error statuses

From Computing Element Available job slots Queued Job Agents Running Job Agents


Two levels of decisions:

local (autonomous),

global (correlations).

Actions triggered by:

values above/below given thresholds,

absence/presence of values,

correlations between any values.

Action types:

alerts (emails/instant msg/atom feeds),

running an external command,

automatic charts annotations in the repository,

running custom code, like securely ordering a ML service to (re)start a site service.

ML ServiceML Service

ML ServiceML Service

Actions based onActions based onglobal informationglobal information

Actions based onActions based onlocal informationlocal information

• Traffic• Jobs• Hosts• Apps

• Temperature• Humidity• A/C Power• …

SensorsSensors Local Local decisionsdecisions

Global Global decisionsdecisions

Local and Global Decision FrameworkLocal and Global Decision Framework

Global ML

Services


ALICE: Automatic job submissionALICE: Automatic job submissionRestarting ServicesRestarting Services

21

MySQL daemon is automatically restartedwhen it runs out of memoryTrigger: threshold on VSZ memory usage

ALICE Production jobs queue is kept full by the automatic submissionTrigger: threshold on the number of aliprod waiting jobs

Administrators are kept up-to-date on the services’ statusTrigger: presence/absence of monitored information


ALICE is using the monitoring information to automatically:

resubmit error jobs until a target completion percentage is reached,

submit new jobs when necessary (watching the task queue size for each service account)

production jobs,

RAW data reconstruction jobs, for each pass,

restart site services, whenever tests of VoBox services fail but the central services are OK,

send email notifications / add chart annotations when a problem was not solved by a restart

dynamically modify the DNS aliases of central services for an efficient load-balancing.

Most of the actions are defined by few lines configuration files.

Automatic actions in ALICEAutomatic actions in ALICE

Iosif Legrand August 2009 23US LHCNet status report Artur Barczyk, 04/16/2009Together with ESnet provide Together with ESnet provide highly resilienthighly resilient data paths for US Tier1s data paths for US Tier1s

Advanced Advanced Standards:Standards:Dynamic Dynamic CircuitsCircuits

CIENA Core CIENA Core DirectorsDirectors

Mesh Mesh ProtectionProtection

Equipment Equipment and link and link

RedundancyRedundancy

Also Internet2 and SINET3

(Japan)

The USLHCnetThe USLHCnet

Iosif Legrand August 2009 24 Artur Barczyk, 04/16/2009

AMS-GVA(GEANT)

CHI-NYC (Qwest)

GVA – NYC (GC)

GVA – NYC (Colt)

Ref @ CERN)

CHI-GVA (Qwest)

CHI-GVA (GC)

AMS-NYC(GC)

0-95% 95-97% 97-98% 98-99% 99-100% 100%

99.5%

97.9%

99.9%

96.6%

99.3%

98.9%

99.5%

Monitoring Links AvailabilityMonitoring Links AvailabilityVery Reliable InformationVery Reliable Information

P1

P1Network

LINK


Monitoring USLHCNet TopologyMonitoring USLHCNet Topology

Topology & Status & PeeringTopology & Status & Peering Real Time Topology for L2 Circuits


USLHCnet: Traffic on different segmentsUSLHCnet: Traffic on different segments


USLHCnet: Accounting for Integrated TrafficUSLHCnet: Accounting for Integrated Traffic


ALARMS and Automatic notifications for USLHCnetALARMS and Automatic notifications for USLHCnet


The UltraLight NetworkThe UltraLight Network

BNL ESnet IN /OUT


Monitoring Network Topology (L3), Monitoring Network Topology (L3), Latency, RoutersLatency, Routers

NETWORKS

AS

ROUTERS

Real Time Topology Discovery & DisplayReal Time Topology Discovery & Display


Available Bandwidth MeasurementsAvailable Bandwidth Measurements

Embedded Pathload module.Embedded Pathload module.

31


EVO : Real-Time monitoring for ReflectorsEVO : Real-Time monitoring for Reflectorsand the quality of all possible connectionsand the quality of all possible connections


EVO: Creating a Dynamic, Global, Minimum EVO: Creating a Dynamic, Global, Minimum Spanning Tree to optimize the connectivitySpanning Tree to optimize the connectivity

Tuv

uvwTw),(

)),(()(

A weighted connected graph G = (V,E) with n vertices and m edges. The quality of connectivity between any two reflectors is measured every second.Building in near real time a minimum- spanning tree with addition constrains

Resilient Overlay Network that optimize real-time communication


Dynamic MST to optimize the Dynamic MST to optimize the Connectivity for ReflectorsConnectivity for Reflectors

Frequent measurements of RTT, jitter, traffic and lost packages The MST is recreated in ~ 1 S case on communication problems.


EVO: Optimize how clients connect to the EVO: Optimize how clients connect to the system for best performance and load balancingsystem for best performance and load balancing


FDT – Fast Data TransferFDT – Fast Data Transfer

FDT is an application for efficient data transfers.

Easy to use. Written in java and runs on all major platforms.

It is based on an asynchronous, multithreaded system which is using the NIO library and is able to:

stream continuously a list of files

use independent threads to read and write on each physical device

transfer data in parallel on multiple TCP streams, when necessary

use appropriate size of buffers for disk IO and networking

resume a file transfer session


FDT – Fast Data Transfer FDT – Fast Data Transfer

Pool of buffers Kernel Space

Pool of buffers Kernel Space

Data Transfer Sockets / Channels

Independent threads per device

Restore the files frombuffers

Control connection / authorization


FDT featuresFDT features

April 2007 Iosif Legrand38

The FDT architecture allows to "plug-in" external security The FDT architecture allows to "plug-in" external security APIs and to use them for client authentication and APIs and to use them for client authentication and authorization. Supports several security schemes :authorization. Supports several security schemes :

• IP filtering IP filtering • SSH SSH • GSI-SSHGSI-SSH• Globus-GSI Globus-GSI • SSL SSL

User defined loadable modules for Pre and Post User defined loadable modules for Pre and Post Processing to provide support for dedicated MS system, Processing to provide support for dedicated MS system, compression … compression …

FDT can be monitored and controlled dynamically by FDT can be monitored and controlled dynamically by MonALISA System MonALISA System

Iosif Legrand August 2009 39October 2006 Iosif Legrand

39

FDT – Memory to Memory Tests in WANFDT – Memory to Memory Tests in WAN

CPUs Dual Core Intel

Xenon @ 3.00 GHz, 4 GB RAM, 4 x 320 GB SATA Disks Connected with 10Gb/s Myricom

~9.0 Gb/s

~9.4 Gb/s


Disk -to- Disk transfers in WANDisk -to- Disk transfers in WAN

NEW YORK GENEVA

Reads and writes on 4 SATA disks in parallel on each server

Mean traffic ~ 210 MB/s~ 0.75 TB per hour

MB

/s

CERN CALTECH

Reads and writes on two 12-port RAID Controllers in parallel on each server

Mean traffic ~ 545 MB/s~ 2 TB per hour

1U Nodes with 4 Disks 4U Disk Servers with 24 Disks

October 2007 Iosif Legrand

Lustre read/ write ~ 320 MB/s between Florida and Caltech Works with xrootd Interface to dCache using the dcap protocol


Active Available Bandwidth measurements Active Available Bandwidth measurements between all the ALICE grid sitesbetween all the ALICE grid sites


Active Available Bandwidth measurements Active Available Bandwidth measurements between all the ALICE grid sites (2)between all the ALICE grid sites (2)


End to End Path Provisioning End to End Path Provisioning on different layerson different layers

Layer 3

Layer 2

Layer 1

Default IP route

VCAT and VLAN channels

Optical path

Site A

Site B

Monitor layout / Setup circuit

Monitor host & end-to-end paths / Setup end-host parameters

Control transfers and bandwidth reservations

Monitor interfaces traffic


Dynamic restorationof lightpath if a segment has problems

Monitoring Optical SwitchesMonitoring Optical Switches


Monitoring the Topology and Optical Monitoring the Topology and Optical Power on Fibers for Optical CircuitsPower on Fibers for Optical Circuits

Port power monitoring

Controlling

Glimmerglass Switch Example


““On-Demand”, End to End Optical On-Demand”, End to End Optical Path AllocationPath Allocation

46

Internet

A

>FDT A/fileX B/path/

OS path availableConfiguring interfacesStarting Data Transfer

Mo

nito

r

Co

ntro

l

TL

1

Optical Switch

MonALISAService

MonALISA Distributed Service System

BOSAgent

Active light path

Regul

ar IP

pat

hReal time monitoring

APPLICATION

LISA AGENTLISA sets up - Network Interfaces - TCP stack - Kernel parameters - RoutesLISA APPLICATION“use eth1.2, …”

LISALISA AgentAgent

DATA

CREATES AN END TO END PATH < 1s

Detects errors and automatically recreate theDetects errors and automatically recreate the path in less than the TCP timeout path in less than the TCP timeout


CERNGeneva

CALTECHPasadena

Starlight

Manlan

USLHCnet

Internet2

Controlling Optical Planes Controlling Optical Planes Automatic Path RecoveryAutomatic Path Recovery

“Fiber cut” simulationsThe traffic moves from one transatlantic line to the other oneFDT transfer (CERN – CALTECH) continues uninterruptedTCP fully recovers in ~ 20s

1

23

4

FDT Transfer

4 Fiber cuts simulations

200+ MBytes/secFrom a 1U Node

4 fiber cut emulations


APPLICATION

>FDT A/fileX B/path/

path or channel allocationConfiguring interfacesStarting Data Transfer

Regular IP path Regular IP

pathLocal VLANs

Recommended to use two NICs -one for management /one for data -- bonding two NICs to the same IP

MAP Local VLANsto WAN channels or light paths

““On-Demand”, Dynamic Circuits On-Demand”, Dynamic Circuits Channel and Path AllocationChannel and Path Allocation


The Need for Planning and Scheduling for The Need for Planning and Scheduling for Large Data TransfersLarge Data Transfers

In Parallel Sequential

2.5 X Faster to perform the two reading tasks sequentially


UserScheduling

ControlMonitoring

End Host Agents

RealtimeFeedback

Request

Channel allocation based on VO/Priority, [ + Wait time, etc.] Create on demand a End-to-end path or Channel & configure end-hosts Automatic recovery (rerouting) in case of errors Dynamic reallocation of throughputs per channel: to manage priorities,

control time to completion, where needed Reallocate resources requested but not used

Dynamic Path Provisioning Dynamic Path Provisioning Queueing and Scheduling Queueing and Scheduling


Dynamic priority for FDT TransfersDynamic priority for FDT Transferson common segmentson common segments

Priority 4

Priority 2

Priority 8


Bandwidth Challenge at SC2005

151 Gbs

~ 500 TB Total in 4h


FDT & MonLISA Used at SC 2006FDT & MonLISA Used at SC 2006

April 2007 Iosif Legrand

17.7 Gb/s Disk to Disk 17.7 Gb/s Disk to Disk on 10 Gb/s link used inon 10 Gb/s link used inBoth directions from Both directions from Florida to CaltechFlorida to Caltech


Official BWCOfficial BWCHyper BWCHyper BWC

SC2006SC2006

April 2007 Iosif Legrand


SC2007SC2007

+80 Gb/s disk to disk


SC2008 – Bandwidth Challenge, Austin, TX

Using FDT &MonALISA

Storage to Storage WAN transfers:

Bi-directional peak of 114Gbps (sustained 110Gbps) between Caltech, Michigan, CERN, FermiLab, Brazil, Korea, Estonia, Chicago, New York and Amsterdam


SC2008 – Bandwidth Challenge, Austin, TXSC2008 – Bandwidth Challenge, Austin, TX CIENA – Caltech Booths

FDT TRANSFERS Ciena, OTU-4 standard link carrying a 100 Gbps payload (or 200 Gbps bidirectional) with forward error correction .

10x10Gbps multiplex over an 80km fibre spool Pick : 199.9 Gb/s Mean 191 Gb/s

We transferred: ~1PB in ~12 hours


The eight fallacies of distributed computingThe eight fallacies of distributed computing

It is fair to say that at the beginning of this project we underestimated some of the potential problems in developing a large distributed system in WAN, and indeed the “eight fallacies of distributed

computing” are very important lessons :

1) The network is reliable.

2) Latency is zero.

3) Bandwidth is infinite.

4) The network is secure.

5) Topology doesn't change.

6) There is one administrator.

7) Transport cost is zero.

8) The network is homogeneous.

http://en.wikipedia.org/wiki/Computer_network

http://en.wikipedia.org/wiki/Reliability

http://en.wikipedia.org/wiki/Latency_%28engineering%29

http://en.wikipedia.org/wiki/Throughput

http://en.wikipedia.org/wiki/Computer_security

http://en.wikipedia.org/wiki/Network_topology

http://en.wikipedia.org/wiki/Network_administrator

http://en.wikipedia.org/wiki/Zero

http://en.wikipedia.org/wiki/Homogeneous


MonALISA Usage MonALISA Usage

59

Major Communities ALICE CMS ATLAS PANDA EVO LGC RUSSIA OSG MXG RoEduNet USLHCNET ULTRALIGHT Enlightened

--

VRVSVRVSALICE

EVOEVO

OSGOSG

U

USLHCnetUSLHCnet

MonALISA TodayRunning 24 X 7 at ~360 Sites

Collecting ~ 1.5 million “persistent” parameters in real-time

60 million “volatile” parameters per day Update rate of ~25,000 parameter updates/sec Monitoring

40,000 computers > 100 WAN Links > 6,000 complete end-to-end

network path measurements Tens of Thousands of Grid jobs

running concurrentlyControls jobs summation, different central

services for the Grid, EVO topology, FDT … The MonALISA repository system serves

~6 million user requests per year.


SummarySummary

Modeling and simulations is important to understand the way distributed systems scale, the limits, and help to develop and test optimization algorithms.

The algorithms to create resilient distributed systems are not easy (Handling the asymmetric information)

For data intensive applications, is important to move to more synergetic relationships between the applications, computing and storage facilities and the NETWORK.

http://monalisa.caltech.edu

http://monalisa.caltech.edu/

http://monalisa.caltech.edu/

Documents

Iosif Legrand California Institute of Technology