The LCG Distributed Database Infrastructure · The LCG Distributed Database Infrastructure - Database Services for Physics at CERN in 2002 • 24x7 service based on Oracle 9i / Solaris

CERN - IT DepartmentCH-1211 Genève 23

Switzerlandwww.cern.ch/it

The LCG Distributed Database Infrastructure

Dirk Düllmann, CERN & LCG 3DDESY Computing Seminar

21. May ‘07

[email protected]

The LCG Distributed Database Infrastructure -

Outline of the Talk

• Why databases and why distributed?• LCG Distributed Database Project (LCG 3D)• Building blocks for a scalable database

services– database clusters

• Distribution techniques– streams replication and distributed caching

• Database and replication monitoring optimisation

• Experiment models and schedules• Status and next steps

3

Offline Reconstruction

Conditions DB

ONline subset

CERN Computer

Center

Online Master Data StorageC

onditions

Formatting

Production Validation Development

Offline ReconstructionConditions DBOFfline subset

Mastercopy

Con

figur

atio

ns

Pixel

Strips

ECAL

HCAL

RPC

DT

ES

CSC

Trigger

DAQ

DCS Con

ditio

ns

Pixel

Strips

ECAL

HCAL

RPC

DT

ES

CSC

Trigger

DAQ

DCS

LHC data

Logbook

DDD

EMDB

Calibration

Create rec conDB data set

Tier 0

Con

figur

atio

n

Con

ditio

ns

[email protected]

The LCG Distributed Database Infrastructure - 4

Distributed Deployment of Databases (=3D)

• LCG provided initially an infrastructure for distributed access to file based data and file replication

• Physics applications (and grid services) depend crucially on relational databases for meta data and require similar services for databases

– Physics applications and grid services use RDBMS• eg configuration, conditions, calibration, event tags, file and

collection catalogues, production/transfer workflow

– LCG sites have already experience in providing RDBMS services

• Goals for common database project as part of LCG – increase the availability and scalability of LCG and experiment

database components

– allow applications to access databases in a consistent, location independent way

– connect database services via data replication mechanisms

– shared deployment and administration of this infrastructure during 24 x 7 operation

• Scope set by LCG PEB: Databases Online, Offline and at LCG Tier sites

[email protected]

The LCG Distributed Database Infrastructure - 5

LCG 3D Service Architecture

OO

O

O

S

F

S

S

M S

M S

T1 – DB backbone– all data replicated– reliable service

T2 – local DB cache– subset data– only local serviceOnline DB

– autonomous reliable service

database (DB) cluster

T0– autonomous reliable service

Read-only access at Tier-1/2(at least initially)

Oracle Streamshttp cache (Squid)Cross DB copy and MySQL/SQLight files

O

M

Squid

FroNTier

F

MySQL/SQLight DB file

[email protected]


Database Services for Physics at CERN in 2002

[email protected]


Database Services for Physics at CERN in 2002• 24x7 service based on

Oracle 9i / Solaris

[email protected]



Oracle 9i / Solaris• Most experiment

databases hosted on– 2-node (4 CPU) cluster on

Solaris with 6GB of RAM– 18 disks, 32GB each– Veritas Volume Manager

[email protected]



Oracle 9i / Solaris• Most experiment

databases hosted on– 2-node (4 CPU) cluster on

Solaris with 6GB of RAM– 18 disks, 32GB each– Veritas Volume Manager

• But also increasing many Linux disk servers for LHC and non-LHC data (COMPASS, HARP)– Locally attached disk– Spread all over the CERN

computing centre– Several different versions of

Linux and Oracle

[email protected]


How to build a highly available and scalable DB service?

[email protected]



• Scaling Up

[email protected]



• Scaling Up

[email protected]



• Scaling Up

• Scaling out -> clustering

Storage Area Network

[email protected]



• Scaling Up

• Scaling out -> clustering

Storage Area Network

[email protected]


Chosen Architecture: Database Cluster on Linux

• Oracle Real Application Cluster (RAC) on commodity hardware– Redundancy at all levels (CPU, storage, networking)– Oracle 10gR2 and RedHat ES 4– Oracle ASM as volume manager

CERN

LAN

RAC1

Infortrend storage

lxs5033

lxs5037

lxs5038

lxs5030

GB ethernet switch

SANbox 5200 – A2 GB fibre channel switch

ASGC 9

Hardware configuration Four servers

CPU : Intel Pentium-D 830 3.0 GHz Memory 2G (ECC) Local Disk S-ATA2 80G 7200 rpm Fiber Channel LSI 7102XP-LC, PCI

X 1

SAN Switch : Silkworm 3850 16 ports

Backend Raid subsystem: StorageTek B280

Each RAC group shares 1.7TB exported from SAN

RAC group for 3DRAC group for other LCG services

Dual channel &

redundant controller

Hurng-Chun - ASGC

[email protected]


Beginning of 2007

[email protected]


Beginning of 2007

[email protected]


Beginning of 2007

[email protected]


Beginning of 2007

• ~220 Database CPUs

• ~440 GB of RAM (shared DB cache)

• ~1100 disks

• Some 15 DB clusters deployed– One production cluster per LHC experiment for offline

applications

– ATLAS Online, COMPASS cluster

– Number of nodes varying from 4 to 8

• Several validation and test clusters– 1 or 2 per experiment (typically 2 node clusters)

– Some hardware allocated for internal use/tests

[email protected]


Application Validation and Optimisation• Database clusters can amplify effects of poor application

design - need scalability test during application release cycle

Development DB service Validation DB service Production DB service

• Significant fraction of DB administrator work• needs close collaboration with application developers

• Focus on application with large resource consumption:– File transfer systems (FTS, PhEDEx)– Grid catalogs (LFC)– Experiment Dashboards– Condition data (COOL)– Event collections (TAGS)

[email protected]


Evolving Database Hardware

• Need to continuously replace database h/w with next generation CPU and storage– still maintaining as few s/w configurations as

possible: only a single Linux and Oracle version• CPU side - continued performance increase via multi-core

– Recently tested dual quad-core CPU & 16 GB of RAM• performance similar to 5 node RAC built with the hardware currently used

– Multi core works well for database servers

• but implies a move to 64-bit Linux & Oracle

– May run into memory bandwidth limitations with many cores

• Storage side - slower performance increase– Sizing for IO operations per second rather than just volume

increase -> disk numbers increase

– Investigating higher performance disks (eg raptor) and other storage technologies (solid state disks)



Database Replication and Distributed Caching

13 Sept, 2006 CMS Frontier Report

FroNTier “Launchpad” software

• Squid caching proxy– Load shared with Round-Robin DNS– Configured in “accelerator mode” – Peer-to-peer caching– “Wide open frontier”*

• Tomcat - standard• FroNTier servlet

– Distributed as “war” file• Unpack in Tomcat webapps dir• Change 2 files if name is different

– One xml file describes DB connection

DB

SquidSquid Squid

Tomcat Tomcat Tomcat

FroNTierservlet

FroNTierservlet

FroNTierservlet

Round-RobinDNS

server1 server2 server3

*In the past, we required the registration so we could add IP/mask to our Access Control List (ACL) at CERN. Recently decided to run in “wide-open” mode so installations can be tested w/o registration.

LCG 3D Status Dirk Duellmann 15

How to keep Databases up-to-date? Asynchronous Replication via Streams

CNAF

RAL Sinica

FNAL

IN2P3

BNL

CERN

CERN

applypropagationcapture

Slide : Eva Dafonte Perez



CNAF

RAL Sinica

FNAL

IN2P3

BNL

CERN

CERN

insert into emp values ( 03, “Joan”,….)

applypropagationcapture




CNAF

RAL Sinica

FNAL

IN2P3

BNL

CERN

CERN


applypropagationcapturecapture




CNAF

RAL Sinica

FNAL

IN2P3

BNL

CERN

CERN

LCR


applypropagationcapturecapture




CNAF

RAL Sinica

FNAL

IN2P3

BNL

CERN

CERN

LCR


applypropagationcapture propagationcapture




CNAF

RAL Sinica

FNAL

IN2P3

BNL

CERN

CERN

LCR

LCR

LCR

LCR

LCR

LCR

LCR

LCR


applypropagationcapture propagationcapture




CNAF

RAL Sinica

FNAL

IN2P3

BNL

CERN

CERN

LCR

LCR

LCR

LCR

LCR

LCR

LCR

LCR


applypropagationcapture applypropagationcapture




CNAF

RAL Sinica

FNAL

IN2P3

BNL

CERN

CERN


applypropagationcapture applypropagationcapture


[email protected]


Database s/w Licenses and Support

• Tier 1 licenses acquired– Experiment and grid service request collected for

all T1 sites – License payment and signed agreement forms

received

• LCG Support ID has been created with Oracle• Accounts for Oracle support (MetaLink)

created for all T1 database contacts– access to Oracle s/w, patches, security upgrades

and problem database– allows to file problem reports directly to Oracle

[email protected]


Tier 0 setup and operational procedures

• Main streams operations have been included in the Tier 0 DBA operations manual

• Automated alerts for database or streams problems from 3D monitoring– Integrated with GGUS (grid user support)– so far: handling of streams problems during

working hours only

• Downstream capture setup has been installed as part of the planned Tier 0 service extension– log-mining step runs on a separate box,

offloading the source database– same h/w and s/w setup as db server nodes


Further Decoupling between Databases

CERN RAC

SOURCE DATABASE

COPY redo log files

DOWNSTREAM DATABASE @ CERN DESTINATION SITES

CNAF

FNAL

CERN

propagation jobs

Objectives Remove impact of capture from Tier 0 database Isolate destination sites from each other

pair capture process + queue x each target site big Streams pool size redundant events ( x number of queues)

capture process

capture process

capture process


[email protected]


Database Backup / Recovery with Streams

• Collected DB / streams recovery scenarios – Recovery after T1 data loss - OK

• RAL recovered and re-synchronised• Replication CENR to CNAF continued unaffected

– Recovery after T0 data loss - OK– Coordinated point-in-time recovery - OK

• Service procedure documented and validated with two Tier 1 sites and CERN

• Full recovery exercise with all sites scheduled for the 3D Workshop at CNAF June 12-13– Recover Tier 1 database from tape– Resynchronise replication streams– While T0 database is being populated

[email protected]


Streams Performance Tuning

• Focus recently: Wide Area Network– Remote sites significantly affected by latency

• eg ASCG with 300ms round trip time

• Studied TCP level data flow between CERN and Taiwan

• Resulting Optimisations – increased TCP buffer size (OS level) – decrease frequency of acknowledgements

between src and dest DB

• Total improvement of factor 10– Now: 4000 logical change records / sec– Checklist for Tier 1 sites has been prepared

• Focus moved now to LAN setup optimisation



Experiment Database Activities & Plans

Andrea Valassi Conditions Databases Rimini, 7 May 2007

LHCb – COOL service model

• Two servers at CERN – essentially for online and offline– Replication to Tier1’s from the online database is a two-step replication– Online server at the pit (managed by LHCb); offline server in the CC

GRIDKARALIN2P3CNAFSARAPIC

(Marco Clemencic, 3D workshop 13 Sep 2006)

COOL (Oracle)

LFC Replication Testbed

LFC Read-Only Server

LFC Oracle Server

Replica DB

LFC R-W Server

LFC Oracle Server

Master DB

LFC R-W Server

Population Clients

Population Clients

Oracle Streams

rls1r1.cern.ch

lxb0716.cern.ch

lxb0717.cern.ch

Read Only Clients

lfc-streams.cr.cnaf.infn.it

lfc-replica.cr.cnaf.infn.it

WAN

[email protected]


File Catalogue replication between CERN and CNAF

• LFC replication via streams between CERN and CNAF in production since last November – Requested by LHCb to provide read-only catalog

replicas

• Stable operation without major problems– Several site interventions at CNAF have been

performed – Site restart and resynchronisation worked – Rate is low compared to conditions

• In contact with LHCb for adding remaining LHCb Tier 1 sites


ATLAS – COOL service model

• COOL Oracle services at Tier0 and ten Tier1’s– Two COOL servers at CERN for online/offline (similar to LHCb)

• Online database within the Atlas pit network, but physically in the CC

– In addition: Oracle at three ‘muon calibration center’ Tier2’s

OnlineOracleDB

Offlinemaster

CondDB

Tier-0Sqlite replica

(1 file/run)

Tier-1replica

Tier-1replica

Online / PVSS /

HLT farm

Tier-0 Dedicated 10Gbit link

ATLAS pit Computer centre Outside world

CERN publicnetwork

Calibration updates

Streamsreplication

ATLAS pitnetwork (ATCN)

gateway

(Sasha Vaniachine and Richard Hawkings, 3D workshop 14 Sep 2006)

GRIDKATAIWANRALIN2P3CNAFSARABNL TRIUMFPICNordugrid

[email protected]


ATLAS Muon Calibration Data Flow

• Picture: H. von der Schmitt

[email protected]

LCG 3D Project Status -

3D Database Resource Request and Current Predictions

4

Dual CPU DB

Nodes

DB Storage

[TB usable]

Request no change wrt GDB Nov’05

Conditions Challenges

(April - Jul)

3

2

0.3

0.1

ATLAS: COOL + TAGs

LHCb: COOL+ LFC r/o replica

Predictions next review after initial CDC phase (eg May)

Dress Rehearsals

(Jul-Nov)

3

2

0.3

0.1

ATLAS: 4GB on 64bit DB server

LHCb: 2 LFC r/o servers in place

expect resource upgrade: double storage and CPU review

LHC Startup

(from Nov)

3+x

2+y

1.0

0.3

ATLAS

LHCb

from June 2008

2009

nominal year

0.2 + 1.4

0.5 + 3.7

0.8 + 6.0

ATLAS COOL + TAGs(tbc by ATLAS)


CMS – conditions data at CERN

Vincenzo Innocente,

CMS Software Week,

April 2007

ORCON and ORCOFF conditions data are in POOL_ORA format

ORCON-ORCOFF Oracle Streams prototype set up (integration RAC).

Production set up later in 2007.

11 Oct 2006 3

FroNTier Launchpad Setup

CERN

DNS

round

Robbin

• 3 servers running Frontier & Squid (worker nodes)

• Backend Oracle Database 10gR2 (4-node RAC)

WANWAN

Provides load

balancing and

failover

T0 Farm

Slide: L. Lueking

[email protected]

LCG 3D Project Status -

CMS Request Update

5

26 Jan 2007 CMS Req. Update and Plans 9

Rough Estimates of CMS DB Resources March 2007 through August 2007

10

(currently ~5)

10

(currently ~2)

20Offline DBS

Tier-0 (CMSR)

10 per (Squid)

>100 ! all sites

10(per Squid)

>100 ! all sites

100 (per Squid)

2-3 Squids/site

Offline Conditions

Tier-1 (each site)

10 (DB)*

10 (Squid)

* Incl. On2off Xfer

10 (DB)*

10 (Squid)

* Incl. On2off Xfer

500 (DB)

100 (per Squid)

2-3 Squids

Offline Conditions

Tier-0 (CMSR)

1020500Online P5

Transactions

Peak usage (Hz)

Concurrent Users

Peak usage

Disk

Maximum (GB)

Area

(CMS contact)

No major Change until August

[email protected]


Deployment Status and Next Steps• CMS tested Frontier/SQUID setup during CSA ’06

– now some 30 Tier 1 and Tier 2 sites connected

• ATLAS and LHCb moved production mode in April

• All ten Tier 1 database sites integrated into a single distributed database infrastructure– ASGC, BNL, CNAF, GridKA, IN2P3, SARA/

NIKHEF, NDGF, PIC, RAL, TRIUMF – One of the largest distributed database setups

worldwide

• Preparing for Tier 1 replica scaling test with O(100) client nodes using ATLAS off line framework ATHENA

• In summer: participate in WLCG “Dress Rehearsal” tests

[email protected]


More information at

• WLCG 3D Project– http://lcg3d.cern.ch or

– http://twiki.cern.ch/twiki/bin/view/PSSGroup/LCG3DWiki

• CERN Physics Database Service– http://phydb.web.cern.ch/phydb/

• WLCG Persistency Framework– http://pool.cern.ch

– http://pool.cern.ch/coral

– http://cool.cern.ch

http://lcg3d.cern.ch

http://lcg3d.cern.ch

https://twiki.cern.ch/twiki/bin/view/PSSGroup/LCG3DWiki




http://phydb.web.cern.ch/phydb/




http://pool.cern.ch

http://pool.cern.ch

http://pool.cern.ch/coral

http://pool.cern.ch/coral

http://cool.cern.ch

http://cool.cern.ch

Documents

The LCG Distributed Database Infrastructure · The LCG Distributed Database Infrastructure - Database Services for Physics at CERN in 2002 • 24x7 service based on Oracle 9i / Solaris