CERN DB Services: Status, Activities, Announcements

Preview:

DESCRIPTION

CERN DB Services: Status, Activities, Announcements. Marcin Blaszczyk - IT-DB. Recap. Last workshop: 16 th Nov 2010 – at that time We were using 10.2.0.4 We were installing new hardware to replace RAC3 & RAC4 RAC8 in “ Safehost ” for standbys RAC9 for integration DBs - PowerPoint PPT Presentation

Citation preview

CERN DB Services: Status, Activities, Announcements

Replication Technology Evolution for ATLAS Data Workshop, 3rd of June 2014

Marcin Blaszczyk - IT-DB

3

Recap• Last workshop: 16th Nov 2010 – at that time

• We were using 10.2.0.4 • We were installing new hardware to replace RAC3 & RAC4

• RAC8 in “Safehost” for standbys • RAC9 for integration DBs

• 11.2 evaluation process • 10.2.0.5 upgrade under planning

• Infrastructure for Physics DB Services• Quadcore machines with 16GB of RAM• FC infrastructure for storage (~2500 disks)

4

Things have changed…• Service evolution

• RAC8 in Safehost for standby installed • Performed in Q3 2010 • To assure geographical separation for DR• New standby installations - for each

production DB

• 10.2.0.5 upgrade• Performed in Q1 2011

5

Oracle 11gR2• SW upgrade + HW migration

• Target version 11.2.0.3• Performed in Q1 2012

• HW migration• New HW installations (RAC10 & RAC11)• 8 cores (16 threads) CPU, 48GB of memory

• Move from ASM to NAS• Netapp NAS storage

• Replication technology • Usage of streams replication - gradually reduced• Usage of Active Data Guard has grown

6

Offloading with ADG• Offloading Backups to ADG

• Significantly reduces load on primary• Removes sequential I/O of full backup

• Offloading Queries to ADG• Transactional workload runs on primary• Read-only workload can be moved to ADG• Examples of workload on our ADGs:

• Ad-hoc queries, analytics and long-running reports, parallel queries, unpredictable workload and test queries

• ORA-1555 (snapshot too old)• Sporadic occurrences • Oracle bug – to be confirmed if present in 11.2.0.4

7

New Architecture with ADG

Primary Database

Active Data Guard for disaster recovery

Active Data Guard for users’ access

2. Busy & critical ADG1. Low load ADG

Active Data Guard for users’ access and for disaster recovery

Primary Database

Maximum performance Maximum performance

Redo Transport

Redo Transport

Redo Transport

• Disaster recovery• Offloading read-only workload

8

IT-DB Service on 11gR2

• IT-DB service much more stable• Workload has been stabilized

• High loads and node reboots eliminated

• More powerful HW • Offloading to ADG helps a lot• 11g clusterware more stable• Storage model benefited from using NAS

• single/multiple disk failure can’t affect DB service anymore

• Faster and less vulnerable Streams replication

9

Preparation for Run2 • Oracle SW

• No good solution to fit entire RUN 2• New Software versions:

• 11.2.0.4 vs 12.1.0.1

• New HW• 32 threads CPU, 128/256GB memory

• New Storage NetApp model• More SSD cache• Consolidated storage

10

Hardware upgrades in Q1 2014• New servers and storage

• Servers: more RAM, more CPU • 128GB of RAM memory (48GB current prod machines)

• Storage: more SSD cache• Newer NetApp model• Consolidated storage

• Refresh cycle of OS and OS related• Puppet & RHEL 6

• Refresh cycle of our HW• New HW for production• Current production HW will be moved to standby

11

Software upgrades in Q1 2014• Available Oracle releases

• 11.2.0.4• 12.1.0.1

• Evolution – how to balance• Stable services• Latest releases for bug fixes• Newest releases for new features• Fit with LHC schedule

12

DBAs & workload validation• DBAs - can do:

• Test upgrades of integration and production databases

• Share experience across users communities• Database CAPTURE and REPLAY with RAT testing

• Capture workload from production and replay it in upgraded DB

• Useful to catch bugs and regressions• Unfortunately it cannot cover the edge cases

13

Validation by the users• Validation by the application owners is very

valuable to reduce risk• Functional tests• Tests with ‘real world’ data sizes• Tests with concurrent workload

• The criticality depends• On the complexity of the application• On how well they can test their SQL

14

Recent Changes: Q1-Q2 2014• DB services for Experiments/WLCG

• Target version 11.2.0.4• Exceptions - target 12c

• ATLARC• LHCBR• Few more IT-DB services

• Interventions took 2-5 hours of DB downtime• Depending on system complexity: standby

infrastructure, number of nodes etc…

15

Upgrade technique - overview

Clusterware 11g+

RDBMS 11.2.0.3

Clusterware 12c+

RDBMS 11.2.0.3Redo Transport

DATA GUARD RAC DATABASE

PRIMARY DATABASE RAC

Redo Transport

RW A

ccessRW A

cess

Clusterware 12c+

RDBMS 11.2.0.4

RDBMS upgrade

DATABASE downtime

Upgrade complete!

123456

16

Phased approach to 12c• Some DBs already on 12.1 version

• ATLARC, LHCBR• Smooth upgrade • No major issues discovered so far

• Following Oracle SW evolution, depending on • Next 12c releases feedback (12.2)• Testing status• Possibility to schedule upgrades

• Next possible slot for upgrades to 12c 1st patchset• Technical stop Q4 2014/Q1 2015?• Candidates: offline DBs (ATLR, CMSR, LCGR…)

17

Monitoring & Security• Monitoring

• RacMon • EM12c • Strmmon

• Support level during LS1• Best effort

• Security• AuditMon• Firewall rules for external access

• For ADCR in 2013• For ATLR in 2014

IT-DB Operations Report

ATLAS databases

• Production DBs: 12 nodes*, ~69 TB of data– ATONR: 2 nodes, ~8 TB– ADCR: 4 nodes, ~19,5 TB– ATLR: 3 nodes, ~20.5 TB– ATLARC: 2 nodes, ~17 TB– *ATLAS DASHBOARD (1 node of WLCG database), ~4TB

• Standby DBs: 14 nodes, ~75 TB of data– ATONR_ADG: 2 nodes; ATONR_DG: 2 nodes– ADCR_ADG: 4 nodes; ADCR_DG: 3 nodes– ATLR_DG: 3 nodes

• Integration DBs: 4 nodes, ~18 TB of data– INTR: 2 nodes, ~7,5 TB,– INT8R: 2 nodes, ~9 TB– **ATLASINT: 2 nodes, ~2 TB (will be consolidated with INT8R)

• Nearly 165TB of space, 30 database servers• 12* databases (11 RAC clusters + 1 dedicated RAC node*)

19

Replication for ATLAS - current status

20

Replication for ATLAS - plans• Replication changes overview

• PVSS • Read only replica: Active Data Guard

• COOL• Online -> Offline: GoldenGate• Offline ->Tier1s: GoldenGate

• MUON• Streams stopped when ATLAS new solution for custom

data movement will be in place

21

Conclusions• Focus on stability for DB services • Software evolution

• Critical services has just moved to 11.2.0.4 • Long perspective: keep testing towards 12c

• HW evolution• Technology evolution for replication

• ADG & GG will fully replace Oracle Streams

22

Acknowledgements• Work presented here on behalf of:

• CERN Database Group

Replication Technology Evolution for ATLAS Data Workshop, 3 rd of June 2014

Thank you!Marcin.Blaszczyk@cern.ch

24

Recommended