Upload
marianna-boyd
View
215
Download
0
Embed Size (px)
Citation preview
Emergency Database Failover:Impacts & Recovery Plan
Aaron Smallwood – ERCOT ITJoel Mickey – ERCOT Market Operations
2
Emergency Database Failover
• Summary:– ERCOT conducted an emergency database failover on April 21st, 2008
following a hardware failure
– While ERCOT does perform controlled database failovers monthly, this was different due to the nature of the hardware failure
• Normally, the database is ‘stopped’ at one site, and then ‘started’ at the other in controlled manner
• In this case, the database ‘hung’ – meaning that it became unresponsive and data was unable to be written to or read from database
– The impacts:• Transactions were prevented from updating downstream databases• The lack of transaction updates in downstream databases left a gap in
transactional records (out of sync)
– The affected extracts for April 21st through April 30th are listed in market notices for the incident
– ERCOT considers this to be an isolated incident and not a systemic problem
3
Recovery Plan
• Goal: – Recover transactions that are needed to perform price adjustment
calculations that are missing in downstream databases from a restored copy of the production database
• Plan: Build an environment identical to the production environment
• Servers, storage, applications
– Restore data to pre-crash state (4/21) • Over 20TB of data to restore from tape (in progress)
– Using the restored environment and data, extract transactions missing from downstream databases and then roll forward all subsequent transactions
– ERCOT Market Operations will then review the data for reasonableness and approve the data for reporting and settlement
4
Questions
• Actions to prevent future occurrences:– Nodal market databases will be on newer hardware with more fault tolerance and
redundancy – Potential re-architecture of system integration between the databases
• Lessons learned are being documented but no plan yet
• Resources are focused on the data recovery efforts
• Questions:– When will non-spinning reserve price adjustments for PRR 650 be completed?
• When the transactional data has been restored, reviewed, and approved
– What is the timeline?• The environment build is complete, we anticipate the data restore from tape to be the
task that takes the longest
• We are estimating weeks, not months, to complete the plan
– Unknowns include the amount of time needed to restore from tape and the quality of the data once it’s been restored
• Market notices will continue to be sent to indicate status