Living with the Oracle Database Appliance

Preview:

Citation preview

Living with the

Oracle Database Appliance

Simon Haslam, Veriton Peter Moore, Simplyhealth

Simon Haslam Consultant, Veriton &

Technical Director of

Oracle s/w since 1995

Middleware & SOA

WebLogic, SOA, BPM

Peter Moore Principal Oracle DBA & MW Admin, Simplyhealth

Oracle s/w since 1988

Oracle DBA for 19 years

Database Administrator

Introduction & Background

ODA BM/VP & Sizing of Recovery Area

Hardware Maintenance (ASR & Disk Failures)

Patching

Miscellaneous

What is ODA?

Two fast Intel compute nodes

Shared, direct attached storage array including flash

InfiniBand interconnect & 10Gb public networks

Management software (database & virtualisation)

Sold as a single product for $68k (list)

in a slide!

Bulk Data HDD

Redo Logs

ODA Cache SSD

Compute Node

Compute Node HDD

Now with

InfiniBand

Background

Started in 1872 ◦ Previously… HSA, BCWA, HealthSure, LHF, Remedi, Medisure, Denplan

Primary business areas ◦ Health Cash Plans ◦ Private Medical Insurance ◦ Dental Capitation ◦ Healthcare delivery

Over 3M customers / 20,000 companies ~1700 Employees

Core IT

Product / CRM / Finance Application

~1000 Users / 600 Active

3M Customer records

Java EE and PL/SQL

3rd Party communications platform

RAC (2TB main db), WebLogic, Reports

ZFS Appliance

Simplyhealth’s ODAs Production Test

ODA Base

OLTP Reporting standby

Comms

ODA Base

TTD container VM 1

TTD container VM 2

ODA Base ODA Base

OLTP standby

Comms standby

Test

Reporting

Reporting

APEX portal

RMAN OLTP

archive

RMAN standby OLTP

UAT Comms

UAT

Test

ODA BM/VP & Sizing of Recovery Area

13 | 10 13 • 50

Virtualized Platform: databases

Database

Each node has a “ODA Base”

DomU

Looks a lot like ODA BM – most

admin done from ODA Base

Nodes

Run a special OVS image

Appliance Manager

GUI when you first provision it

oakcli tool

Node 0 - OVS

ODA Base (DomU) • Appliance Manager • Database(s) • Grid Infrastructure

Node 1 - OVS

ODA Base (DomU) • Appliance Manager • Database(s) • Grid Infrastructure

Dom0 Dom0 Repo Repo

Local Local Shared Storage

Lots of room for app VMs like SOA

ODA BM or VP?

Simplyhealth chose ODA VP ◦ Initially driven by WebLogic

◦ Turned out to be good for test databases

If in doubt Simon recommends ODA VP: ◦ gives you more flexibility in future (app & probably database)

◦ only moderate extra operational complexity

Sizing of RECO

DATA is on outer part of hard disks, RECO on inner

Only set during initial provisioning

RECO

DATA

RECO

DATA

RECO

DATA

Default: “Local Backup” “External Backup”

DATA

RECO

DATA

RECO

DATA

RECO

DATA:RECO Sizes

Disks are physically partitioned according to whether Local or External Backup was chosen

Same ratios for all ODA hardware versions and HIGH/NORMAL redundancy

DATA 43% RECO 57%

DATA 86% RECO 14%

“Local Backup”

“External Backup”

OUTER

OUTER

INNER

INNER

Usable Space Example ODA X5-2, 1 shelf, NORMAL redundancy

DATA 12TB RECO 16TB

DATA 24TB RECO 4TB

“Local Backup”

“External Backup”

REDO 250GB

FLASH 750GB

Hardware Maintenance (ASR & Disk Failures)

My Oracle Support Set up

Use a team MOS account + group email dist. list

Ensure MOS account has access to correct ODA CSI(s)

MOS

Oddity: you can only activate ASR on the ODA nodes so why this

warning/button? (you don’t get this on ZFSSA)

ASR Set up

Stand-alone ASR on each ODA

Each server needs internet access https://transport.oracle.com

oakcli configure asr

ASR Test

Option 1: Internal ASR Enter root password (x2) Enter MOS credentials

ASR Disk failure example

ASR Funnies

ASR raises one SR per disk… or none… or two…

Sometimes the first time you know that a disk has failed has been when Oracle has updated the SR ◦ New ODA plug-in for EM is expected to include hardware

notifications

ASR Further Diagnostics

Our Disk History

We have 2 x dual shelf ODA X3-2s 16 SSD & 88 HDD Running for 1.5 years (1.35M HDD-hours) Total of 6 HDDs have been replaced (i.e. 225k h MTBF) ◦ 5 predicted failures ◦ 1 real failure… bad experience with I/O waits though

No SSDs have failed

Note: new ZFS SA disk arrived automatically next morning without sys admin knowing it had failed! (ODA should be more like this)

Disk Failure ‘Gotchas’

1 predicted failure fixed itself! General fiddliness of replacing disks ◦ Firmware updating, getting new disks ONLINE, etc ◦ MOS 1435946.1 & 1496114.1

The replacement disk includes the courier details to collect the failed one… ◦ this is a European courier who will know nothing about it! ◦ we need the UK courier

Blinking yellow light doesn’t always work?!

Patching

Patching: It’s Really Good!

Vastly simplified process compared to DIY for full stack

Approx. quarterly ODA-only bundled patches ◦ includes PSU for databases (optional)

Oracle Support says <=2 versions behind current

There’s probably a backlog of ODA customers on 2.10 (last 11g GI but CPU only to April 2014)

prep • Download & load to patch repositories on ODA nodes

INFRA • Update INFRA

GI • Update GI

db • (optional) Update database Oracle Homes & databases

Upgrade Example ODA 2.10 to 12.1.2.2.0 INFRA, GI, DB PSU

11g12c CRS/ASM upgrade would have probably been a project pre-ODA

We only have a single 11.2.0.4.x Oracle Home ◦ some people have several, e.g. for different apps

prep

• scp p20340774_121220_Linux-x86-64_[12]of2.zip • oakcli unpack –package p20340774… {for each zip, on each node} • oakcli update -patch 12.1.2.2.0 --verify

INFRA • oakcli update –patch 12.1.2.2.0 --infra

GI • oakcli update –patch 12.1.2.2.0 --gi

db • oakcli update –patch 12.1.2.2.0 --database

Lost 1h 10min

12c GI / 11g PSU Upgrade Timeline

--infra 2h 29min

--gi 1h 12min

--d.b. 40min

App Prep. 1h

Elapsed outage for app ~6h

Restarting app etc

Supposed to be rolling?

(all DBs shutdown)

Supposed to be rolling?

Both nodes rebooted automatically

Database were open for most of day but we were never sure when they would be shut down… (our lack of experience of ODA patching?)

Possibly bug in shared repo upgrade

What happened under the covers? INFRA updates

◦ BIOS ◦ ILOM ◦ Firmware updated on all disks (except new ones) ◦ OVM 3.2.9

GI updates ◦ CRS 12.1.0.2.2 ◦ ASM 12.1.2.x.0 (i.e. inc Flex ASM) ◦ ODA Base to Oracle Linux 5.10 UEK2

Database PSU ◦ Oracle home to 11.2.0.4.5 (plus 12.1.0.2.2, 11.2.0.3.13 if we had them) ◦ Databases updated (some!)

…and probably much more!

DB Patch-Set Update

Choose which Oracle Home(s) to apply PSU to

Script loops through databases running in each updated home & runs catbundle.sql ◦ Recognises standbys - didn’t apply PSU (correctly) but still

shut them down! Perhaps because they shared the home being patched? Possibly our fault!

Strange Error Messages

Some strange messages, but mostly harmless: ◦ Console: “An error occurred while restoring domain oakDom1: Error: not a valid guest state file: config size read”

But… 2 of us were watching everything very closely ◦ Probably better to just go for a long lunch instead!

Patching Wish List

Status/confidence ◦ more timestamps (for checking back later – test vs prod)

◦ a progress indicator for anything taking over ~3 min e.g. “INFO: Running prepatching on node 0” ~20 mins

Could firmware updates of disks (35 mins) be done in parallel?

Patching Wish List

Help us to understand which parts of process are rolling (could be different per ODA version) and how to minimise downtime ◦ Is INFRA ever rolling?

◦ GI rolling?

◦ DB rolling if using RAC or RON?

Patching Nirvana:

Rolling Upgrades for Everything?!

Size of ODA X5-2 invites DB consolidation

Simplyhealth: Lack of rolling INFRA will drive all non-UAT databases off test ODA (v hard to test bundled patches on pre-prod/UAT)

O-box SOA Appliance: sold on strength as HA so need rolling updates below WebLogic layer

Miscellaneous

NFS Storage for Databases

Oracle ZFS and NFS (e.g. NetApp) is supported ◦ See MOS 1445253.1: External Storage (read/write) Support

◦ Use files over NFS, not via ASM

Uses Direct NFS (dNFS) fast ◦ we have 10 GbE network dedicated to storage

Not so self-contained so perhaps not “the ODA way”

An Innovative Approach for Test DBs

Requirement: ◦ To use DB EE NUP licences for test, when the 2 ODA bases are

licensed by RAC processor

Solution: ◦ One large VM on each node with multiple Linux Containers ◦ Test databases within the containers use ZFS SA for storage

Suffers from lack of rolling upgrades for ODA INFRA Technical Credit/Implementation:

Mark Leeuw & Fabrizio Bordaccini

Backup & Disaster Recovery

Data Guard works well of course ODA VP & ODA Base? ◦ In practice you need to rebuild

VMs running on ODA VP? ◦ Host level backup within VM ◦ ACFS Replication...?

Oracle White Paper: Backup and Recovery Best Practices for the Oracle Database Appliance (April 2014)

Management

Looking forward to trying the new EM 12c R4 ODA plug-in

Initial ODA VP imaging ◦ Why can’t ODA come with VP image?

◦ Speed of booting .ISO over ILOM if not local

Tips

Keep It Simple! ◦ Don’t stray too far from standard ODA design goals

◦ Custom databases running off vDisks will end in tears!

Don’t mess with BIOS! ◦ Simon’s don’t-do-this-at-home node eviction test

Summary

Choose Wisely!

ODA Bare Metal or Virtualized Platform

Internal or External Backup

Double (NORMAL) or Triple (HIGH) Mirrored

Hardware

ASR is useful

Disks – replacement process needs improvement

Patching

Probably the best feature of ODA

The gift that keeps on giving! ◦ Over lifetime of an ODA you might patch/upgrade 10 or more

times

Oracle Database Appliance VP

It Just Works*™ *99%!

@simon_haslam @petercmoore

Recommended