25
1 The new Fabric The new Fabric Management Tools in Management Tools in Production at CERN Production at CERN Thorsten Kleinwort for CERN IT/FIO HEPiX Autumn 2003 Triumf Vancouver Monday, October 20, 2003

He Pi Xii2003

  • Upload
    fnian

  • View
    466

  • Download
    0

Embed Size (px)

DESCRIPTION

Fashion, apparel, textile, merchandising, garments

Citation preview

Page 1: He Pi Xii2003

1

The new Fabric The new Fabric Management Tools in Management Tools in Production at CERNProduction at CERN

Thorsten Kleinwort forCERN IT/FIO

HEPiX Autumn 2003Triumf Vancouver

Monday, October 20, 2003

Page 2: He Pi Xii2003

20 October 2003 Thorsten Kleinwort

IT/FIO/FS2

ContentsContents

• Introduction to CERN’s Fabric Management: Concepts

• Framework for CERN’s Fabric Management: Tools

• Configuration Mgmt• Software Mgmt• State Mgmt• Monitoring

Page 3: He Pi Xii2003

20 October 2003 Thorsten Kleinwort

IT/FIO/FS3

Concepts: The NodeConcepts: The Node

The Node is the manageable unit:• Autonomous:

• Local configuration files• Programs work locally• No external dependencies• No remote management scripts

• Adheres to LSB (Linux Standard Base):• Init scripts /etc/init.d/, start daemons• Logfile directory /var/log, logrotate• Config directory /etc• (System) Programs in /(s)bin/, /usr/(s)bin

Page 4: He Pi Xii2003

20 October 2003 Thorsten Kleinwort

IT/FIO/FS4

Concepts: Node -> Concepts: Node -> ClusterCluster

• Same functionality of nodes -> cluster(But not necessarily same HW)

• Management tools enforce uniform setup

• Cluster size varies:• LXBATCH > 1000 nodes• LXPLUS ~ 70 nodes• LXMASTER (Batch master) = 2 nodes

• Critical servers replaced by service clusters with redundant nodes

Page 5: He Pi Xii2003

20 October 2003 Thorsten Kleinwort

IT/FIO/FS5

Concepts: PrinciplesConcepts: Principles

• Software installs/updates through RPM• Configuration through one tool• Configuration information through one

interface• Configuration information stored

centrally• Installation, configuration and

maintenance automated, but steerable• Reproducibility

Page 6: He Pi Xii2003

20 October 2003 Thorsten Kleinwort

IT/FIO/FS6

FrameworkFramework

node

Mon Agent

MonitoringManager

Cfg Agent

ConfigManager

ConfigCache

SW Agent

SWManager

SWCache

HardwareManager

StateManager

Page 7: He Pi Xii2003

20 October 2003 Thorsten Kleinwort

IT/FIO/FS7

FrameworkFramework

node

SW AgentCfg Agent

Mon Agent

CDBMonitoringManager

SWManager

HardwareManager

StateManager

CCMSW

Cache

Page 8: He Pi Xii2003

20 October 2003 Thorsten Kleinwort

IT/FIO/FS8

Configuration (CDB & Configuration (CDB & CCM)CCM)

CDB (Configuration Data Base):• Development of EU Data Grid (WP4)• CDB is the configuration data base• Now ~ 1500 nodes, ~ 15 clusters• ~ 3200 configuration templates to

describe the nodes• Creates one (XML) profile per node • All information that is needed to install &

run the nodes now included• Currently 2 Linux versions: RH 7.3 & ES

2.1

Page 9: He Pi Xii2003

20 October 2003 Thorsten Kleinwort

IT/FIO/FS9

CDB (cont’d)CDB (cont’d)

Additional Information to be added:(Merged from other sources)

• State information (->SMS)• Monitoring information (->MSA)• Vendor/Contract/Purchase

information:• Need for encryption to store secure data

New, high level Interfaces are provided:• “Add/Rename Node”• Change node state

Page 10: He Pi Xii2003

20 October 2003 Thorsten Kleinwort

IT/FIO/FS10

CDB (cont’d)CDB (cont’d)

• Local caching on the node CCM (Configuration Cache Manager):• In test phase, deployed on a few nodes• Runs local daemon, which is notified on modification

of the nodes configuration information• Avoids peaks on CDB web servers

• Beside XML profiles, new SQL interface:• Allows SQL queries on CDB• Needed for cross machine view (e.g. give me all

nodes that belong to the cluster X)

Page 11: He Pi Xii2003

20 October 2003 Thorsten Kleinwort

IT/FIO/FS11

FrameworkFramework

node

SPMACfg Agent

Mon Agent

CDBMonitoringManager SWRep

HardwareManager

StateManager

CCMSWRepCache

Page 12: He Pi Xii2003

20 October 2003 Thorsten Kleinwort

IT/FIO/FS12

Software distributionSoftware distribution(SPMA & SWRep)(SPMA & SWRep)

SPMA (Software Package Management Agent):

• Development of EU Data Grid (WP4)• The tool to install all software on the nodes

• Uses RPM for SW distribution on Linux• Version for Solaris PKG package manager exists

• We install between 700 – 1000 RPMs per node

• Based on RPMT (Enhancement of RPM)• Crucial part of the framework

Page 13: He Pi Xii2003

20 October 2003 Thorsten Kleinwort

IT/FIO/FS13

SPMA (cont’d)SPMA (cont’d)

• SPMA runs on every node (on demand)

• Can manage either a subset or all packages:• We manage all packages on all clusters but one,

which is for development• Missing packages are added and• Unknown packages are removed

• Package list created from CDB, but SPMA is independent of CDB

• SPMA allows to roll back versions

Page 14: He Pi Xii2003

20 October 2003 Thorsten Kleinwort

IT/FIO/FS14

SPMA & SWRepSPMA & SWRep

SWRep (Software Repository):• Client-Server tool suite for storage

of software packages• Universal:

• Linux RPM/Solaris PKG• Multiple versions: RH 7.3, RH ES 2.1, RH 10

• Management interface:• ACL mechanism to add packages • Package list automatically kept up-to-date in

CDB

Page 15: He Pi Xii2003

20 October 2003 Thorsten Kleinwort

IT/FIO/FS15

SPMA & SWRep (cont’d)SPMA & SWRep (cont’d)

Addresses Scalability:• HTTP as SW distribution protocol• Load balanced server cluster • SPMA run is randomly time delayed

within 10 minutes• Pre-caching of SW packages on the

node possible• Currently installed on 1500 nodes

Page 16: He Pi Xii2003

20 October 2003 Thorsten Kleinwort

IT/FIO/FS16

FrameworkFramework

node

SPMANCMMon Agent

CDBMonitoringManager SWRep

HardwareManager

StateManager

CCMSWRepCache

Page 17: He Pi Xii2003

20 October 2003 Thorsten Kleinwort

IT/FIO/FS17

Configuration Tool Configuration Tool (NCM)(NCM)

NCM (Node Configuration Manager):• Local configuration tool• EU Data Grid (WP4) development• First components have been (re-)written

and are tested on production nodes• Uses CDB for configuration information • Has its first public release:

• We have to transform all our SUE features into NCM components (~50)

• Plan is to do this while migrating to next Linux release

Page 18: He Pi Xii2003

20 October 2003 Thorsten Kleinwort

IT/FIO/FS18

FrameworkFramework

node

SPMANCMMSA

CDBOraMon SWRep

CCMSWRepCache

HardwareManager

StateManager

Page 19: He Pi Xii2003

20 October 2003 Thorsten Kleinwort

IT/FIO/FS19

MonitoringMonitoring(MSA & OraMon) (MSA & OraMon)

LEMON (LHC Era Monitoring):• EU Data Grid (WP4) development• Client (MSA):

• ~ 100 metrics are measured• Deployed on > 1500 nodes (more than currently

managed by CDB)• Configuration to be put into CDB

• Server (OraMon):• ORACLE database as back end• Stores current values as well as history• User API (in C, PERL, PHP, TCL) in test phase

Page 20: He Pi Xii2003

20 October 2003 Thorsten Kleinwort

IT/FIO/FS20

FrameworkFramework

node

SPMANCMMSA

CDBOraMon SWRep

HMSSMS

CCMSWRepCache

Page 21: He Pi Xii2003

20 October 2003 Thorsten Kleinwort

IT/FIO/FS21

State ManagementState Management(SMS & HMS)(SMS & HMS)

LEAF (LHC Era Automated Fabric):• HMS (Hardware Management

System), controls & tracks:• Node installation• Node Move & reinstall (rename)• Node retirement• Node repairs (Vendor calls)

• Remedy Workflow Application• Will interface to CDB

Page 22: He Pi Xii2003

20 October 2003 Thorsten Kleinwort

IT/FIO/FS22

HMS & SMSHMS & SMS

SMS (State Management System):• Allows to set node states (in CDB) • Validates state transition• Handles new machine arrivals

(~400 in Nov)• Uses SOAP to interface to CDB• Working prototype

Page 23: He Pi Xii2003

20 October 2003 Thorsten Kleinwort

IT/FIO/FS23

node

Tools:Tools:

SPMANCMMSA

CDBOraMon SWRep

CCMSWRepCache

HMSSMS

QUATTORLEMON

LEAF

= + +

Page 24: He Pi Xii2003

20 October 2003 Thorsten Kleinwort

IT/FIO/FS24

Tools: ExamplesTools: Examples

• Batch System LSF:• Upgrade 4.2 -> 5.1 on > 1000 nodes within 15 min,

without stopping batch (with pre-caching)

• Kernel Upgrade:• SPMA can handle multiple versions of the same

package:• Allows to separate installation and reboot of new

kernel in time

• Security upgrades:• All security upgrades are done by SPMA (~once a

week):• SSH Security upgrade • KDE upgrade (~400 MB per node)