4
Emergency Database Failover: Impacts & Recovery Plan Aaron Smallwood – ERCOT IT Joel Mickey – ERCOT Market Operations

Emergency Database Failover : Impacts & Recovery Plan Aaron Smallwood – ERCOT IT Joel Mickey – ERCOT Market Operations

Embed Size (px)

Citation preview

Page 1: Emergency Database Failover : Impacts & Recovery Plan Aaron Smallwood – ERCOT IT Joel Mickey – ERCOT Market Operations

Emergency Database Failover:Impacts & Recovery Plan

Aaron Smallwood – ERCOT ITJoel Mickey – ERCOT Market Operations

Page 2: Emergency Database Failover : Impacts & Recovery Plan Aaron Smallwood – ERCOT IT Joel Mickey – ERCOT Market Operations

2

Emergency Database Failover

• Summary:– ERCOT conducted an emergency database failover on April 21st, 2008

following a hardware failure

– While ERCOT does perform controlled database failovers monthly, this was different due to the nature of the hardware failure

• Normally, the database is ‘stopped’ at one site, and then ‘started’ at the other in controlled manner

• In this case, the database ‘hung’ – meaning that it became unresponsive and data was unable to be written to or read from database

– The impacts:• Transactions were prevented from updating downstream databases• The lack of transaction updates in downstream databases left a gap in

transactional records (out of sync)

– The affected extracts for April 21st through April 30th are listed in market notices for the incident

– ERCOT considers this to be an isolated incident and not a systemic problem

Page 3: Emergency Database Failover : Impacts & Recovery Plan Aaron Smallwood – ERCOT IT Joel Mickey – ERCOT Market Operations

3

Recovery Plan

• Goal: – Recover transactions that are needed to perform price adjustment

calculations that are missing in downstream databases from a restored copy of the production database

• Plan: Build an environment identical to the production environment

• Servers, storage, applications

– Restore data to pre-crash state (4/21) • Over 20TB of data to restore from tape (in progress)

– Using the restored environment and data, extract transactions missing from downstream databases and then roll forward all subsequent transactions

– ERCOT Market Operations will then review the data for reasonableness and approve the data for reporting and settlement

Page 4: Emergency Database Failover : Impacts & Recovery Plan Aaron Smallwood – ERCOT IT Joel Mickey – ERCOT Market Operations

4

Questions

• Actions to prevent future occurrences:– Nodal market databases will be on newer hardware with more fault tolerance and

redundancy – Potential re-architecture of system integration between the databases

• Lessons learned are being documented but no plan yet

• Resources are focused on the data recovery efforts

• Questions:– When will non-spinning reserve price adjustments for PRR 650 be completed?

• When the transactional data has been restored, reviewed, and approved

– What is the timeline?• The environment build is complete, we anticipate the data restore from tape to be the

task that takes the longest

• We are estimating weeks, not months, to complete the plan

– Unknowns include the amount of time needed to restore from tape and the quality of the data once it’s been restored

• Market notices will continue to be sent to indicate status