Upload
hyatt-burt
View
21
Download
1
Tags:
Embed Size (px)
DESCRIPTION
Monitoring, Control and Optimization in Large Distributed Systems. ECSAC- August 2009, Veli Losink. Iosif Legrand California Institute of Technology. Monitoring Distributed Systems. - PowerPoint PPT Presentation
Citation preview
Iosif Legrand August 20091
Iosif LegrandIosif LegrandCalifornia Institute of Technology California Institute of Technology
Monitoring, Control and Optimization in Large Monitoring, Control and Optimization in Large Distributed SystemsDistributed Systems
ECSAC- August 2009, Veli Losink
Iosif Legrand August 2009
Monitoring Distributed SystemsMonitoring Distributed Systems
An essential part of managing large scale, distributed data processing facilities, is a monitoring system that is able to monitor computing facilities, storage systems, networks and a very large number of applications running on these systems in near-real time.
The monitoring information gathered for all the subsystems is essential for design, debugging, accounting and the development of “higher level services”, that provide decision support and some degree of automated decisions and for maintaining and optimizing workflow in large scale distributed systems.
Iosif Legrand August 2009
Monitoring Information is necessary for System Design, Monitoring Information is necessary for System Design, Control, Optimization, Debugging and AccountingControl, Optimization, Debugging and Accounting
Computing ModelsModeling & Simulations
DEBUGGINGControl and
Operational supportALARMS
Optimization Algorithms
ACCOUNTING
MONITORING
~ REALInformation
Create resilient
Distributed Systems
Iosif Legrand August 2009
The LHC Data Grid HierarchyThe LHC Data Grid HierarchyThe need for distributed computing (MONARC)The need for distributed computing (MONARC)
Tier 1
Tier2 Center
Online System
CERN Center PBs of Disk; Tape Robot
FNAL CenterIN2P3 Center INFN Center RAL Center
InstituteInstituteInstituteInstitute
Workstations
~150-1500 MBytes/sec
~10 Gbps
1 to 10 Gbps
Tens of Petabytes by ~2010Physics data cache
~PByte/sec
10 - 40 Gbps
Tier2 CenterTier2 CenterTier2 Center
~1-10 Gbps
Tier 0 +1
Tier 3
Tier 4
Tier2 Center Tier 2
Experiment
11 Tier1 and120+ Tier2 Centers
Iosif Legrand August 2009
Communication in Distributed SystemsCommunication in Distributed Systems
LookupServiceStubLookup
Service Skeleton CLIENTServer
“Traditional” Distributed Object Models(CORBA, DCOM)
“IDL” Compiler
The Stub is linked to the Client.The Client must know about theservice from the beginning and needs the right stub for it
The Server and the client code must be created together !!
Distributed Object Systems CORBA , DCOM, RMI Distributed Object Systems CORBA , DCOM, RMI
Different types of dedicated protocols Different types of dedicated protocols
Iosif Legrand August 2009
Distributed Object Systems Distributed Object Systems Web Services WSDL/SOAP Web Services WSDL/SOAP
LookupService
WSDL
CLIENTServer
LookupService
Interface
SOAP
The client can dynamically generate the data structures and the interfaces for using remote objects based on WSDL
Platform independent
Large overhead Based on stateless connections
Iosif Legrand August 2009 7
Mobile Code and Distributed ServicesMobile Code and Distributed Services
Act as a true dynamic service and provide the necessary functionally to be used by any other services that require such information mechanism to dynamically discover all the “Service Units" remote event notification for changes in the any system lease mechanism for each registered unit
Is Based on Java
Dynamic Code Loading
LookupServiceProxy CLIENT
LookupService
Proxy
Service
Services can be used dynamicallyRemote Services Proxy == RMI StubMobile Agents Proxy == Entire Service “Smart Proxies” Proxy adjusts to the client
Any well suited protocol for the application
Iosif Legrand August 2009 88
The MonALISA FrameworkThe MonALISA Framework
MonALISA is a Dynamic, Distributed Service System capable to collect any type of information from different systems, to analyze it in near real time and to provide support for automated control decisions and global optimization of workflows in complex grid systems.
The MonALISA system is designed as an ensemble of autonomous multi-threaded, self-describing agent-based subsystems which are registered as dynamic services, and are able to collaborate and cooperate in performing a wide range of monitoring tasks. These agents can analyze and process the information, in a distributed way, and to provide optimization decisions in large scale distributed applications.
Iosif Legrand August 2009 9
The MonALISA ArchitectureThe MonALISA Architecture
9
Regional or Global High Level Regional or Global High Level Services, Services, Repositories & ClientsRepositories & Clients
Secure and reliable communicationSecure and reliable communicationDynamic load balancing Dynamic load balancing Scalability & ReplicationScalability & ReplicationAAA for ClientsAAA for Clients
Distributed Dynamic Distributed Dynamic Registration and Discovery-Registration and Discovery-based on a lease based on a lease mechanism and remote eventsmechanism and remote events
JINI-Lookup Services Secure & Public
MonALISA services
Proxies
HL services
Agents
Network of
Distributed System for gathering and Distributed System for gathering and analyzing information based on analyzing information based on mobile agents: mobile agents: Customized aggregation, Triggers,Customized aggregation, Triggers,ActionsActions
Fully Distributed System with no Single Point of Failure
Iosif Legrand August 2009 10
MonALISA Service & Data HandlingMonALISA Service & Data Handling
10
Data Store
Data CacheService & DB
Configuration Control (SSL)
Predicates & Agents
Data (via ML Proxy)
Applications Clients or Higher Level
Services
WS Clients andservice
WebService
WSDLSOAP
LookupService
LookupService
Registration
Discovery
Postgres
AGENTSAGENTS
FILTERS / TRIGGERSFILTERS / TRIGGERS
Monitoring ModulesMonitoring ModulesCollects any type of information Dynamic Loading
Push and Pull
Iosif Legrand August 2009 11
LookupService
Registration / Discovery Registration / Discovery Admin Access and AAA for ClientsAdmin Access and AAA for Clients
MonALISAService
LookupService
Client(other service)
DiscoveryRegistration
(signed certificate)
MonALISAService
MonALISAService
Services Proxy
Multiplexer
Services Proxy
Multiplexer
Client(other service)
Admin SSL connection
Trustkeystore
AAA services
Client authentication
Data Data Filters & AgentsFilters & Agents
Trustkeystore
Application
Applications
Iosif Legrand August 2009 12
Monitoring Grid sites, Running Jobs, Monitoring Grid sites, Running Jobs, Network Traffic, and ConnectivityNetwork Traffic, and Connectivity
12
TOPOLOGY
JOBS
ACCOUNTING
Running Jobs
Iosif Legrand August 2009 13
Monitoring the Execution of JobsMonitoring the Execution of Jobs and the Time Evolution and the Time Evolution
13
SPLIT JOBSSPLIT JOBS
LIFELINES for JOBS
Job Job
Job1
Job2
Job3
Job31
Job32
Summit a Job
DAG
Iosif Legrand August 2009 14
Monitoring CMS Jobs WorldwideMonitoring CMS Jobs Worldwide
CMS is using MonALISA and ApMon to monitor all the production and analysis jobs. This information is than used in the CMS dashboard frontend
Collects monitoring data with rates up to more than 1000 values per second Peaks of more than 1500 jobs reporting concurrently to the same server Uptime for the service > 150 days continuous operation without any problems Collected 4.5* 109 monitoring values in the first half of 2009 Lost in UDP messages (values) is less than 5 * 10-6
~ 5 *109 Monitoring values Collected in the first half of 2009
The computer that runsML at CERN was replaces
Rate of collected monitoring values Total Collected values
3
4
2
1
X109
April May June
1000
500
May June 0
Organize and structure
Monitoring Information
Iosif Legrand August 2009 15
Monitoring architecture in ALICEMonitoring architecture in ALICE
15
Long HistoryDB
LCG Tools
MonALISA @Site
ApMon
AliEn Job Agent
ApMon
AliEn Job Agent
ApMon
AliEn Job Agent
MonALISA @CERN
MonALISA
LCG Site
ApMon
AliEn CE
ApMon
AliEn SE
ApMon
ClusterMonitor
ApMon
AliEn TQ
ApMon
AliEn Job Agent
ApMon
AliEn Job Agent
ApMon
AliEn Job Agent
ApMon
AliEn CE
ApMon
AliEn SE
ApMon
ClusterMonitor
ApMon
AliEn IS
ApMon
AliEn Optimizers
ApMon
AliEn Brokers
ApMon
MySQLServers
ApMon
CastorGridScripts
ApMon
APIServices
MonaLisaMonaLisaRepositoryRepository
Aggregated Data
rss
vsz
cputime
run
tim
e
job
slots
free
spac
e
nr.
of
file
s
op
en
files
Queued
JobAgents
cpu
ksi2k
jobstatus
disk
used
pro
cesses
loadn
etIn
/ou
t
jobsstatussockets
migratedmbytes
active
sessions
MyP
roxy
status
Alerts
Actions
Iosif Legrand August 2009 16
http://pcalimonitor.cern.ch
ALICE : Global Views, Status & JobsALICE : Global Views, Status & Jobs
Iosif Legrand August 2009 17
ALICE: Job status – history plotsALICE: Job status – history plots
Iosif Legrand August 2009 18
ALICE: Resource Usage monitoringALICE: Resource Usage monitoring
Cumulative parameters CPU Time & CPU KSI2K Wall time & Wall KSI2K Read & written files Input & output traffic (xrootd)
Running parameters Resident memory Virtual memory
Open files Workdir size Disk usage CPU usage
Aggregated per site
Iosif Legrand August 2009 19
ALICE: Job agents monitoringALICE: Job agents monitoring
From Job Agent itself Requesting job Installing packages Running job Done Error statuses
From Computing Element Available job slots Queued Job Agents Running Job Agents
Iosif Legrand August 2009 20
Two levels of decisions:
local (autonomous),
global (correlations).
Actions triggered by:
values above/below given thresholds,
absence/presence of values,
correlations between any values.
Action types:
alerts (emails/instant msg/atom feeds),
running an external command,
automatic charts annotations in the repository,
running custom code, like securely ordering a ML service to (re)start a site service.
ML ServiceML Service
ML ServiceML Service
Actions based onActions based onglobal informationglobal information
Actions based onActions based onlocal informationlocal information
• Traffic• Jobs• Hosts• Apps
• Temperature• Humidity• A/C Power• …
SensorsSensors Local Local decisionsdecisions
Global Global decisionsdecisions
Local and Global Decision FrameworkLocal and Global Decision Framework
Global ML
Services
Iosif Legrand August 2009 21
ALICE: Automatic job submissionALICE: Automatic job submissionRestarting ServicesRestarting Services
21
MySQL daemon is automatically restartedwhen it runs out of memoryTrigger: threshold on VSZ memory usage
ALICE Production jobs queue is kept full by the automatic submissionTrigger: threshold on the number of aliprod waiting jobs
Administrators are kept up-to-date on the services’ statusTrigger: presence/absence of monitored information
Iosif Legrand August 2009 22
ALICE is using the monitoring information to automatically:
resubmit error jobs until a target completion percentage is reached,
submit new jobs when necessary (watching the task queue size for each service account)
production jobs,
RAW data reconstruction jobs, for each pass,
restart site services, whenever tests of VoBox services fail but the central services are OK,
send email notifications / add chart annotations when a problem was not solved by a restart
dynamically modify the DNS aliases of central services for an efficient load-balancing.
Most of the actions are defined by few lines configuration files.
Automatic actions in ALICEAutomatic actions in ALICE
Iosif Legrand August 2009 23US LHCNet status report Artur Barczyk, 04/16/2009Together with ESnet provide Together with ESnet provide highly resilienthighly resilient data paths for US Tier1s data paths for US Tier1s
Advanced Advanced Standards:Standards:Dynamic Dynamic CircuitsCircuits
CIENA Core CIENA Core DirectorsDirectors
Mesh Mesh ProtectionProtection
Equipment Equipment and link and link
RedundancyRedundancy
Also Internet2 and SINET3
(Japan)
The USLHCnetThe USLHCnet
Iosif Legrand August 2009 24 Artur Barczyk, 04/16/2009
AMS-GVA(GEANT)
CHI-NYC (Qwest)
GVA – NYC (GC)
GVA – NYC (Colt)
Ref @ CERN)
CHI-GVA (Qwest)
CHI-GVA (GC)
AMS-NYC(GC)
0-95% 95-97% 97-98% 98-99% 99-100% 100%
99.5%
97.9%
99.9%
96.6%
99.3%
98.9%
99.5%
Monitoring Links AvailabilityMonitoring Links AvailabilityVery Reliable InformationVery Reliable Information
P1
P1Network
LINK
Iosif Legrand August 2009 25
Monitoring USLHCNet TopologyMonitoring USLHCNet Topology
Topology & Status & PeeringTopology & Status & Peering Real Time Topology for L2 Circuits
Iosif Legrand August 2009 26
USLHCnet: Traffic on different segmentsUSLHCnet: Traffic on different segments
Iosif Legrand August 2009 27
USLHCnet: Accounting for Integrated TrafficUSLHCnet: Accounting for Integrated Traffic
Iosif Legrand August 2009 28
ALARMS and Automatic notifications for USLHCnetALARMS and Automatic notifications for USLHCnet
Iosif Legrand August 2009 29
The UltraLight NetworkThe UltraLight Network
BNL ESnet IN /OUT
Iosif Legrand August 2009 30
Monitoring Network Topology (L3), Monitoring Network Topology (L3), Latency, RoutersLatency, Routers
NETWORKS
AS
ROUTERS
Real Time Topology Discovery & DisplayReal Time Topology Discovery & Display
Iosif Legrand August 2009 31
Available Bandwidth MeasurementsAvailable Bandwidth Measurements
Embedded Pathload module.Embedded Pathload module.
31
Iosif Legrand August 2009 32
EVO : Real-Time monitoring for ReflectorsEVO : Real-Time monitoring for Reflectorsand the quality of all possible connectionsand the quality of all possible connections
Iosif Legrand August 2009 33
EVO: Creating a Dynamic, Global, Minimum EVO: Creating a Dynamic, Global, Minimum Spanning Tree to optimize the connectivitySpanning Tree to optimize the connectivity
Tuv
uvwTw),(
)),(()(
A weighted connected graph G = (V,E) with n vertices and m edges. The quality of connectivity between any two reflectors is measured every second.Building in near real time a minimum- spanning tree with addition constrains
Resilient Overlay Network that optimize real-time communication
Iosif Legrand August 2009 34
Dynamic MST to optimize the Dynamic MST to optimize the Connectivity for ReflectorsConnectivity for Reflectors
Frequent measurements of RTT, jitter, traffic and lost packages The MST is recreated in ~ 1 S case on communication problems.
Iosif Legrand August 2009 35
EVO: Optimize how clients connect to the EVO: Optimize how clients connect to the system for best performance and load balancingsystem for best performance and load balancing
Iosif Legrand August 2009 3636
FDT – Fast Data TransferFDT – Fast Data Transfer
FDT is an application for efficient data transfers.
Easy to use. Written in java and runs on all major platforms.
It is based on an asynchronous, multithreaded system which is using the NIO library and is able to:
stream continuously a list of files
use independent threads to read and write on each physical device
transfer data in parallel on multiple TCP streams, when necessary
use appropriate size of buffers for disk IO and networking
resume a file transfer session
Iosif Legrand August 2009 3737
FDT – Fast Data Transfer FDT – Fast Data Transfer
Pool of buffers Kernel Space
Pool of buffers Kernel Space
Data Transfer Sockets / Channels
Independent threads per device
Restore the files frombuffers
Control connection / authorization
Iosif Legrand August 2009 38
FDT featuresFDT features
April 2007 Iosif Legrand38
The FDT architecture allows to "plug-in" external security The FDT architecture allows to "plug-in" external security APIs and to use them for client authentication and APIs and to use them for client authentication and authorization. Supports several security schemes :authorization. Supports several security schemes :
• IP filtering IP filtering • SSH SSH • GSI-SSHGSI-SSH• Globus-GSI Globus-GSI • SSL SSL
User defined loadable modules for Pre and Post User defined loadable modules for Pre and Post Processing to provide support for dedicated MS system, Processing to provide support for dedicated MS system, compression … compression …
FDT can be monitored and controlled dynamically by FDT can be monitored and controlled dynamically by MonALISA System MonALISA System
Iosif Legrand August 2009 39October 2006 Iosif Legrand
39
FDT – Memory to Memory Tests in WANFDT – Memory to Memory Tests in WAN
CPUs Dual Core Intel
Xenon @ 3.00 GHz, 4 GB RAM, 4 x 320 GB SATA Disks Connected with 10Gb/s Myricom
~9.0 Gb/s
~9.4 Gb/s
Iosif Legrand August 2009 40
Disk -to- Disk transfers in WANDisk -to- Disk transfers in WAN
NEW YORK GENEVA
Reads and writes on 4 SATA disks in parallel on each server
Mean traffic ~ 210 MB/s~ 0.75 TB per hour
MB
/s
CERN CALTECH
Reads and writes on two 12-port RAID Controllers in parallel on each server
Mean traffic ~ 545 MB/s~ 2 TB per hour
1U Nodes with 4 Disks 4U Disk Servers with 24 Disks
October 2007 Iosif Legrand
Lustre read/ write ~ 320 MB/s between Florida and Caltech Works with xrootd Interface to dCache using the dcap protocol
Iosif Legrand August 2009 41
Active Available Bandwidth measurements Active Available Bandwidth measurements between all the ALICE grid sitesbetween all the ALICE grid sites
Iosif Legrand August 2009 42
Active Available Bandwidth measurements Active Available Bandwidth measurements between all the ALICE grid sites (2)between all the ALICE grid sites (2)
Iosif Legrand August 2009 43
End to End Path Provisioning End to End Path Provisioning on different layerson different layers
Layer 3
Layer 2
Layer 1
Default IP route
VCAT and VLAN channels
Optical path
Site A
Site B
Monitor layout / Setup circuit
Monitor host & end-to-end paths / Setup end-host parameters
Control transfers and bandwidth reservations
Monitor interfaces traffic
Iosif Legrand August 2009 44
Dynamic restorationof lightpath if a segment has problems
Monitoring Optical SwitchesMonitoring Optical Switches
Iosif Legrand August 2009 45
Monitoring the Topology and Optical Monitoring the Topology and Optical Power on Fibers for Optical CircuitsPower on Fibers for Optical Circuits
Port power monitoring
Controlling
Glimmerglass Switch Example
Iosif Legrand August 2009 46
““On-Demand”, End to End Optical On-Demand”, End to End Optical Path AllocationPath Allocation
46
Internet
A
>FDT A/fileX B/path/
OS path availableConfiguring interfacesStarting Data Transfer
Mo
nito
r
Co
ntro
l
TL
1
Optical Switch
MonALISAService
MonALISA Distributed Service System
BOSAgent
Active light path
Regul
ar IP
pat
hReal time monitoring
APPLICATION
LISA AGENTLISA sets up - Network Interfaces - TCP stack - Kernel parameters - RoutesLISA APPLICATION“use eth1.2, …”
LISALISA AgentAgent
DATA
CREATES AN END TO END PATH < 1s
Detects errors and automatically recreate theDetects errors and automatically recreate the path in less than the TCP timeout path in less than the TCP timeout
Iosif Legrand August 2009 47
CERNGeneva
CALTECHPasadena
Starlight
Manlan
USLHCnet
Internet2
Controlling Optical Planes Controlling Optical Planes Automatic Path RecoveryAutomatic Path Recovery
“Fiber cut” simulationsThe traffic moves from one transatlantic line to the other oneFDT transfer (CERN – CALTECH) continues uninterruptedTCP fully recovers in ~ 20s
1
23
4
FDT Transfer
4 Fiber cuts simulations
200+ MBytes/secFrom a 1U Node
4 fiber cut emulations
Iosif Legrand August 2009 48
APPLICATION
>FDT A/fileX B/path/
path or channel allocationConfiguring interfacesStarting Data Transfer
Regular IP path Regular IP
pathLocal VLANs
Recommended to use two NICs -one for management /one for data -- bonding two NICs to the same IP
MAP Local VLANsto WAN channels or light paths
““On-Demand”, Dynamic Circuits On-Demand”, Dynamic Circuits Channel and Path AllocationChannel and Path Allocation
Iosif Legrand August 2009 49
The Need for Planning and Scheduling for The Need for Planning and Scheduling for Large Data TransfersLarge Data Transfers
In Parallel Sequential
2.5 X Faster to perform the two reading tasks sequentially
Iosif Legrand August 2009 50
UserScheduling
ControlMonitoring
End Host Agents
RealtimeFeedback
Request
Channel allocation based on VO/Priority, [ + Wait time, etc.] Create on demand a End-to-end path or Channel & configure end-hosts Automatic recovery (rerouting) in case of errors Dynamic reallocation of throughputs per channel: to manage priorities,
control time to completion, where needed Reallocate resources requested but not used
Dynamic Path Provisioning Dynamic Path Provisioning Queueing and Scheduling Queueing and Scheduling
Iosif Legrand August 2009 51
Dynamic priority for FDT TransfersDynamic priority for FDT Transferson common segmentson common segments
Priority 4
Priority 2
Priority 8
Iosif Legrand August 2009 52
Bandwidth Challenge at SC2005
151 Gbs
~ 500 TB Total in 4h
Iosif Legrand August 2009 53
FDT & MonLISA Used at SC 2006FDT & MonLISA Used at SC 2006
April 2007 Iosif Legrand
17.7 Gb/s Disk to Disk 17.7 Gb/s Disk to Disk on 10 Gb/s link used inon 10 Gb/s link used inBoth directions from Both directions from Florida to CaltechFlorida to Caltech
Iosif Legrand August 2009 54
Official BWCOfficial BWCHyper BWCHyper BWC
SC2006SC2006
April 2007 Iosif Legrand
Iosif Legrand August 2009 55
SC2007SC2007
+80 Gb/s disk to disk
Iosif Legrand August 2009 56
SC2008 – Bandwidth Challenge, Austin, TX
Using FDT &MonALISA
Storage to Storage WAN transfers:
Bi-directional peak of 114Gbps (sustained 110Gbps) between Caltech, Michigan, CERN, FermiLab, Brazil, Korea, Estonia, Chicago, New York and Amsterdam
Iosif Legrand August 2009 57
SC2008 – Bandwidth Challenge, Austin, TXSC2008 – Bandwidth Challenge, Austin, TX CIENA – Caltech Booths
FDT TRANSFERS Ciena, OTU-4 standard link carrying a 100 Gbps payload (or 200 Gbps bidirectional) with forward error correction .
10x10Gbps multiplex over an 80km fibre spool Pick : 199.9 Gb/s Mean 191 Gb/s
We transferred: ~1PB in ~12 hours
Iosif Legrand August 2009 58
The eight fallacies of distributed computingThe eight fallacies of distributed computing
It is fair to say that at the beginning of this project we underestimated some of the potential problems in developing a large distributed system in WAN, and indeed the “eight fallacies of distributed
computing” are very important lessons :
1) The network is reliable.
2) Latency is zero.
3) Bandwidth is infinite.
4) The network is secure.
5) Topology doesn't change.
6) There is one administrator.
7) Transport cost is zero.
8) The network is homogeneous.
Iosif Legrand August 2009 59
MonALISA Usage MonALISA Usage
59
Major Communities ALICE CMS ATLAS PANDA EVO LGC RUSSIA OSG MXG RoEduNet USLHCNET ULTRALIGHT Enlightened
--
VRVSVRVSALICE
EVOEVO
OSGOSG
U
USLHCnetUSLHCnet
MonALISA TodayRunning 24 X 7 at ~360 Sites
Collecting ~ 1.5 million “persistent” parameters in real-time
60 million “volatile” parameters per day Update rate of ~25,000 parameter updates/sec Monitoring
40,000 computers > 100 WAN Links > 6,000 complete end-to-end
network path measurements Tens of Thousands of Grid jobs
running concurrentlyControls jobs summation, different central
services for the Grid, EVO topology, FDT … The MonALISA repository system serves
~6 million user requests per year.
Iosif Legrand August 2009 60
SummarySummary
Modeling and simulations is important to understand the way distributed systems scale, the limits, and help to develop and test optimization algorithms.
The algorithms to create resilient distributed systems are not easy (Handling the asymmetric information)
For data intensive applications, is important to move to more synergetic relationships between the applications, computing and storage facilities and the NETWORK.
http://monalisa.caltech.edu