22
http://monalisa.caltech.ed http://alien.cern.ch Monitoring, Accounting and Automated Decision Support for the ALICE Experiment Based on the MonALISA Framework Catalin Cirstoiu, Costin Grigoras, Catalin Cirstoiu, Costin Grigoras, Latchezar Betev, Alexandru Costan, Latchezar Betev, Alexandru Costan, Iosif Legrand Iosif Legrand 25/06/2007 25/06/2007 HPDC 2007 Workshop on Grid Monitoring HPDC 2007 Workshop on Grid Monitoring Monterey, California Monterey, California

Catalin Cirstoiu, Costin Grigoras, Latchezar Betev, Alexandru Costan, Iosif Legrand 25/06/2007

  • Upload
    saber

  • View
    50

  • Download
    0

Embed Size (px)

DESCRIPTION

Monitoring, Accounting and Automated Decision Support for the ALICE Experiment Based on the MonALISA Framework. Catalin Cirstoiu, Costin Grigoras, Latchezar Betev, Alexandru Costan, Iosif Legrand 25/06/2007 HPDC 2007 Workshop on Grid Monitoring Monterey, California. Contents. - PowerPoint PPT Presentation

Citation preview

Page 1: Catalin Cirstoiu, Costin Grigoras, Latchezar Betev, Alexandru Costan, Iosif Legrand 25/06/2007

http://monalisa.caltech.edu

http://alien.cern.ch

Monitoring, Accounting and Automated Decision Support

for the ALICE Experiment Based on the MonALISA Framework

Catalin Cirstoiu, Costin Grigoras, Latchezar Betev, Catalin Cirstoiu, Costin Grigoras, Latchezar Betev, Alexandru Costan, Iosif LegrandAlexandru Costan, Iosif Legrand

25/06/200725/06/2007HPDC 2007 Workshop on Grid MonitoringHPDC 2007 Workshop on Grid Monitoring

Monterey, CaliforniaMonterey, California

Page 2: Catalin Cirstoiu, Costin Grigoras, Latchezar Betev, Alexandru Costan, Iosif Legrand 25/06/2007

25/06/2007 [email protected] 2

Contents Monitoring requirements MonALISA overview Application monitoring Monitoring architecture in AliEn

Jobs monitoring Traffic monitoring Services monitoring Nodes monitoring

Actions framework Feature snapshots

Page 3: Catalin Cirstoiu, Costin Grigoras, Latchezar Betev, Alexandru Costan, Iosif Legrand 25/06/2007

25/06/2007 [email protected] 3

Monitoring Requirements Global view of the entire distributed system

Least-intrusive As accurate as possible Best-effort data transport Minimizing the requirements for open ports

Providing Near real-time information Long-term history of aggregated data

On key parameters like System status Resource usage

Helping with Correlating events System debugging Generating reports

Taking automated actions based on the monitored data

Page 4: Catalin Cirstoiu, Costin Grigoras, Latchezar Betev, Alexandru Costan, Iosif Legrand 25/06/2007

25/06/2007 [email protected] 4

Data Store

MonALISA Overview MonALISA is a dynamic distributed framework Collects any type of information from different systems Aggregates and analyzes it in near-real time Provides support for automated control decisions and global

optimization of workflows in complex distributed systems.

Data CacheService & DB

Configuration Control (SSL)

Predicates & Agents

Data (via ML Proxy)

Applications Java Client(other service)

Agents Filters DataModules

WS Client(other service)

WebService

WSDLSOAP

LookupService

LookupService

RegistrationDiscovery

Postgres MySQL

Page 5: Catalin Cirstoiu, Costin Grigoras, Latchezar Betev, Alexandru Costan, Iosif Legrand 25/06/2007

25/06/2007 [email protected] 5

ML Discovery System & Services Hierarchical structure of loosely coupled services Independent & autonomous entities able to

Publish their existence Discover other available Jini-enabled services Use a dynamic set of proxies to cooperate with them

Network of JINI-LUSs Secure & Public

MonALISA services

Proxies

Clients, Repositories, HL services

Global Services orGlobal Services orClientsClients

Dynamic load balancing Dynamic load balancing Scalability & ReplicationScalability & ReplicationSecurity AAA for ClientsSecurity AAA for Clients

Distributed System Distributed System for gathering and for gathering and Analyzing InformationAnalyzing Information

Distributed Dynamic Distributed Dynamic Discovery-based on a lease Discovery-based on a lease Mechanism and RENMechanism and REN

Agents

Page 6: Catalin Cirstoiu, Costin Grigoras, Latchezar Betev, Alexandru Costan, Iosif Legrand 25/06/2007

25/06/2007 [email protected] 6

ApMon – Application Monitoring

MonALISAService

MonALISAService

ApMon

ApMon

APPLICATION

APPLICATION

MonitoringData

UDP/XDR

Mbps_out: 0.52 Status: reading

App. Monitoring

MB_inout: 562.4

ApMonConfig

parameter1: value parameter2: value

App. Monitoring

...

Time;IP;procIDMonitoring

Data

UDP/XDR

MonitoringData

UDP/XDR

load1: 0.24 processes: 97

System Monitoring

pages_in: 83

MonALISA

hostsConfig Servlet dynamic

reloading

ApMon configuration generated automatically by a servlet / CGI script

0

10

20

30

40

50

60

70

0 1000 2000 3000 4000 5000 6000

Messages per second

Mon

ALI

SA C

PU U

sage

(%)

No Lost Packages

Lightweight library of APIs (C, C++, Java, Perl, Python) that can be used to send any information to MonALISA Services

High comm. performance Flexible Accounting Sys Mon

Page 7: Catalin Cirstoiu, Costin Grigoras, Latchezar Betev, Alexandru Costan, Iosif Legrand 25/06/2007

25/06/2007 [email protected] 7

Long HistoryDB

Monitoring architecture in AliEn

http://pcalimonitor.cern.ch:8889/LCG Tools

MonALISA @Site

ApMon

AliEn Job Agent

ApMon

AliEn Job Agent

ApMon

AliEn Job Agent

MonALISA @CERN

MonALISA

LCG Site

ApMon

AliEn CE

ApMon

AliEn SE

ApMon

ClusterMonitor

ApMon

AliEn TQ

ApMon

AliEn Job Agent

ApMon

AliEn Job Agent

ApMon

AliEn Job Agent

ApMon

AliEn CE

ApMon

AliEn SE

ApMon

ClusterMonitor

ApMon

AliEn IS

ApMon

AliEn Optimizers

ApMon

AliEn Brokers

ApMon

MySQLServers

ApMon

CastorGridScripts

ApMon

APIServices

MonaLisaMonaLisaRepositoryRepository

Aggregated Data

rss

vsz

cputime

run

time

jobslots

free

spac

e

nr. o

ffil

es

openfiles

Queued

JobAgents

cpuksi2k

jobstatus

disk

used

processes loadnet

In/o

utjobsstatus sockets

migratedmbytes

active

sessions

MyProxy

status

AlertsActions

Page 8: Catalin Cirstoiu, Costin Grigoras, Latchezar Betev, Alexandru Costan, Iosif Legrand 25/06/2007

25/06/2007 [email protected] 8

Job status monitoring Global summaries

For each/all conditions For each/all sites For each/all users Running & cumulative

Error status From job agents From central services Real-time map view Integrated pie charts History plots

Page 9: Catalin Cirstoiu, Costin Grigoras, Latchezar Betev, Alexandru Costan, Iosif Legrand 25/06/2007

25/06/2007 [email protected] 9

Real-time Map

Page 10: Catalin Cirstoiu, Costin Grigoras, Latchezar Betev, Alexandru Costan, Iosif Legrand 25/06/2007

25/06/2007 [email protected] 10

Integrated Pie Charts

Page 11: Catalin Cirstoiu, Costin Grigoras, Latchezar Betev, Alexandru Costan, Iosif Legrand 25/06/2007

25/06/2007 [email protected] 11

History Plots, Annotations

Page 12: Catalin Cirstoiu, Costin Grigoras, Latchezar Betev, Alexandru Costan, Iosif Legrand 25/06/2007

25/06/2007 [email protected] 12

Job Resource Usage Cumulative parameters

CPU Time & CPU KSI2K Wall time & Wall KSI2K Read & written files Input & output traffic (xrootd)

Running parameters Resident memory Virtual memory Open files Workdir size Disk usage CPU usage

Aggregated per site

Page 13: Catalin Cirstoiu, Costin Grigoras, Latchezar Betev, Alexandru Costan, Iosif Legrand 25/06/2007

25/06/2007 [email protected] 13

Job Network Traffic Based on the xrootd transfer

from every job Aggregated statistics for

Sites (incoming, outgoing, site to site, internal)

Storage Elements (incoming, outgoing)

Of Read and written files Transferred MB/s

Page 14: Catalin Cirstoiu, Costin Grigoras, Latchezar Betev, Alexandru Costan, Iosif Legrand 25/06/2007

25/06/2007 [email protected] 14

Individual job tracking Based on AliEn shell cmds.

top, ps, spy, jobinfo, masterjob Using the GUI ML Client

Status, resource usage, per job

Page 15: Catalin Cirstoiu, Costin Grigoras, Latchezar Betev, Alexandru Costan, Iosif Legrand 25/06/2007

25/06/2007 [email protected] 15

AliEn & LCG Services monitoring AliEn services

Periodically checked PID check + SOAP call Simple functional tests SE space usage Efficiency

LCG environment and tools Integrating the VoBOX tests previously run by ML within the SAM framework

Proxy lifetime, gsiscp, LCG CE/SE, Job submission, BDII, Local catalog, software area etc. Error messages in case of failure Efficiency ML Alerts are used for problems notification

.

Page 16: Catalin Cirstoiu, Costin Grigoras, Latchezar Betev, Alexandru Costan, Iosif Legrand 25/06/2007

25/06/2007 [email protected] 16

FTD/FTS Monitoring Status of the transfers Transfer rates Success/failures Efficiency via ARDA

Experiment Dashboard

Page 17: Catalin Cirstoiu, Costin Grigoras, Latchezar Betev, Alexandru Costan, Iosif Legrand 25/06/2007

25/06/2007 [email protected] 17

VOBox/Head node monitoring Machine parameters, real-time & history

Load, memory & swap usage, processes, sockets

Page 18: Catalin Cirstoiu, Costin Grigoras, Latchezar Betev, Alexandru Costan, Iosif Legrand 25/06/2007

25/06/2007 [email protected] 18

Actions framework Based on monitoring

information, actions can be taken in ML Service ML Repository

Actions can be triggered by Values above/below given

thresholds Absence/presence of values Correlation between multiple

values Possible actions types

Alerts e-mail Instant messaging RSS Feeds

External commands Event logging

ML ML RepositoryRepository

ML ServiceML Service

ML ServiceML Service

Actions based onActions based onglobal informationglobal information

Actions based onActions based onlocal informationlocal information

• Traffic• Jobs• Hosts• Apps

• Temperature• Humidity• A/C Power• …

SensorsSensors Local Local decisionsdecisions

Global Global decisionsdecisions

Page 19: Catalin Cirstoiu, Costin Grigoras, Latchezar Betev, Alexandru Costan, Iosif Legrand 25/06/2007

25/06/2007 [email protected] 19

Alerts and actions

MySQL daemon is automatically restartedwhen it runs out of memoryTrigger: threshold on VSZ memory usage

ALICE Production jobs queue is automaticallykept full by the automatic resubmissionTrigger: threshold on the number of aliprod waiting jobs

Administrators are kept up-to-date on the services’ statusTrigger: presence/absence of monitored information

Page 20: Catalin Cirstoiu, Costin Grigoras, Latchezar Betev, Alexandru Costan, Iosif Legrand 25/06/2007

25/06/2007 [email protected] 20

Fact figures Raw parameters when running 4K Jobs

Unique data series: 300K with frequency 1-15 minutes Message rate: 16K / minute

Site aggregated parameters Message rate: 2K / minute Bandwidth rate: 300Kbps

Repository DB Size: 70 GB ~ 700M records Data Reduction Schema

2 Months with 2 minutes bins 1 Year with 30 minutes bins ~Forever with 2 hours bins

Response time App -> ML Service – network speed ML Service -> ML Clients

Subscribed parameters – network speed One shot requests (history requests) – ~5 seconds

Repository dynamic history requests – ~300ms / page No incoming ports are required

Page 21: Catalin Cirstoiu, Costin Grigoras, Latchezar Betev, Alexandru Costan, Iosif Legrand 25/06/2007

25/06/2007 [email protected] 21

Summary The MonALISA framework is used as a primary

monitoring tool for the ALICE Grid since 2004 Presently the system is used for monitoring of all

(identified) services, jobs and network parameters necessary for the Grid operation and debugging

The add-on tools for automatic events notification allow for more efficient reaction to problems

The framework design and flexibility answers all requirements for a monitoring system

The accumulated information allows to construct and implement automated decision making algorithms, thus increasing further the efficiency of the Grid operations

Page 22: Catalin Cirstoiu, Costin Grigoras, Latchezar Betev, Alexandru Costan, Iosif Legrand 25/06/2007

25/06/2007 [email protected] 22

Thank you!

Questions?

http://alien.cern.ch http://monalisa.caltech.edu