12
CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/ Marcin Blaszczyk, IT-DB [email protected] Atlas standby database tests February 2011

Atlas standby database tests February 2011

  • Upload
    sanura

  • View
    39

  • Download
    0

Embed Size (px)

DESCRIPTION

Atlas standby database tests February 2011. Marcin Blaszczyk, IT-DB [email protected]. Outline. Standby databases for ATLAS Failover and Switchover Test of standby switchover – February 17 th 2011 Conclusions. Standby databases for ATLAS. - PowerPoint PPT Presentation

Citation preview

Page 1: Atlas standby database tests February 2011

CERN IT Department

CH-1211 Geneva 23

Switzerlandwww.cern.ch/

it

Marcin Blaszczyk, [email protected]

Atlas standby database testsFebruary 2011

Page 2: Atlas standby database tests February 2011

CERN IT Department

CH-1211 Geneva 23

Switzerlandwww.cern.ch/

it

2

Outline

Standby databases for ATLAS Failover and Switchover Test of standby switchover – February 17th 2011 Conclusions

Page 3: Atlas standby database tests February 2011

CERN IT Department

CH-1211 Geneva 23

Switzerlandwww.cern.ch/

it

3

Standby databases for ATLAS

• Standby database is a copy of production database that can be used for disaster protection– Dedicated physical standby database for:

• ATONR database• ATLR database• ADCR database

Redo Transport

STANDBY DATABASE

PRIMARY DATABASE

Read / Write

Access

Page 4: Atlas standby database tests February 2011

CERN IT Department

CH-1211 Geneva 23

Switzerlandwww.cern.ch/

it

4

Standby databases for ATLAS

• All ATLAS standby databases:– Installed on new hardware provisioned in 2010

• Quadcore servers and high-capacity disks– This has increased resources on standby DBs comparing to

previous standby setups– Provided good compromise cost/performance in case of

switchover operation

– Are located in Safehost outside CERN campus– Reduce risk in case of disaster recovery

– Asynchronous transport mode (no influence on primary database performance)

Page 5: Atlas standby database tests February 2011

CERN IT Department

CH-1211 Geneva 23

Switzerlandwww.cern.ch/

it

5

Failover (unscheduled – DB failure)

Redo Transport

STANDBY DATABASE

PRIMARY DATABASE

Read / Write AccessRea

d / W

rite A

cces

s

Page 6: Atlas standby database tests February 2011

CERN IT Department

CH-1211 Geneva 23

Switzerlandwww.cern.ch/

it

6

Switchover (scheduled)

Redo Transport

STANDBY DATABASE

PRIMARY DATABASE

Redo Transport

Read / Write AccessRea

d / W

rite A

cces

s

Page 7: Atlas standby database tests February 2011

CERN IT Department

CH-1211 Geneva 23

Switzerlandwww.cern.ch/

it

7

Why failover?

• Advantages:– Failover minimizes downtime – it’s faster than full database

recovery from backups– no reconfiguration is needed for users and applications

• Real life scenarios from other LHC experiments:– LHCB online database failover

• August 2010• Reason: power cut in LHCb pit

– CMS offline database failover• March 2011• Reason: Electrical issue with storages in CC

Page 8: Atlas standby database tests February 2011

CERN IT Department

CH-1211 Geneva 23

Switzerlandwww.cern.ch/

it

8

Switchover tests

• Scenario– Tests performed on ATONR cluster to validate disaster recovery scenario

and infrastructure – Coordinated by Luca Canali (IT-DB),Gancho Dimitrov, Florbela Tique

Aires Viegas (ATLAS), Rainer Bartoldus (ATLAS Online DB coordinator) – Performed during technical stop on 17th of February 2011

• First phase:– Standby has been opened in read only mode for testing while primary

database was running– Several tests performed regarding connectivity checks for online systems

• Second phase– Full switchover– All applications have been sucessfully reconnected do primary database

while working on standby hardware– Switch back to original hardware

Page 9: Atlas standby database tests February 2011

CERN IT Department

CH-1211 Geneva 23

Switzerlandwww.cern.ch/

it

9

Switchover tests

• General outcome:– Tests were successful, switchover scenario has been tested and

validated– Standby database has been working fine handling production load

for around 2 hours after switchover. – We are able to do switchover / failover in ~30 minutes

• Issues encountered during test:– DNS local caching can caused some client-specific connectivity

problems • RDB manager restart solves this problem

– connection problems encountered for COOL & CORAL reconnecting after switch back to original hardware

• We believe this was a one-off issue • Fixed with a service restart in that particular occurrence

Page 10: Atlas standby database tests February 2011

CERN IT Department

CH-1211 Geneva 23

Switzerlandwww.cern.ch/

it

10

Conclusions

• In case of Primary database lost:– It’s feasible to perform a failover in around 30 minutes

• Determining that failover is the only and best option can be time consuming

– Due to asynchronous transport mode transaction lost is possible but limited to seconds

– No reconfiguration on client side is needed– Full database access is guaranteed immediately after

switchover • Global connection descriptors use aliases instead of physical

machine names – all changes on DNS level• DNS local caching can cause some client-specific connectivity

problems, connection checks needed after of Failover / Switchover

Page 11: Atlas standby database tests February 2011

CERN IT Department

CH-1211 Geneva 23

Switzerlandwww.cern.ch/

it

11

Acknowledgements

• Luca Canali (IT-DB)

http://phydb.web.cern.ch/phydb/

Page 12: Atlas standby database tests February 2011

CERN IT Department

CH-1211 Geneva 23

Switzerlandwww.cern.ch/

it

12

Q&A

Thank You!

Questions?

[email protected]@cern.ch

[email protected]