Resilience in data management

Resiliencein

Data Management

2015 Davide P. Carioni

Resiliency in data management

A case for Redundant Arrays of Inexpensive Disks

when: 1988

where: Chicago

who: D. A. Patterson, G. Gibson, and R. H. Katz

thesis: a top performing mainframe disk drive can be beaten on performance by an array

of inexpensive drives developed for personal computer market.

abstract:

Redundant Arrays of Independent Disks

In disks array several independent disks are considered as a single, large, high-

performance logical disk.

The data are striped across several disks accessed in parallel:

• high data transfer rate: large data accesses (heavy I/O op.)

• high I/O rate: small but frequent data accesses (light I/O op.)

• load balancing across the disks

Two orthogonal techniques:

• redundancy: to improve reliability

• data striping: to improve performance

redundancy in a I/O operation (a simple example)

ADAPTER

ABCDEFGH

Data striping in a I/O operation (a simple example)

ADAPTER

ABCDEFGH

2 byte interleaving (stripe unit)

virtualization in a I/O operation (a simple example)

ADAPTERABCDEFGH

single large logical disk

Data striping

striping: data are written sequentially in units on multiple disks according to a

cyclic algorithm (round robin)

stripe unit: dimension of the unit of data that are written on a single disk

stripe width: number of disks considered by the striping algorithm (does not

necessarily coincide with the number of physical disks in the array – there can

be “hot spares”)

Performance gains:

• multiple independent I/O requests will be executed in parallel by several

disks decreasing the queue length (and time) of the disks

• single multiple-block I/O requests will be executed by multiple disks in

parallel increasing of the transfer rate of a single request

Parallelism and reliability

the more physical disks in the array

the larger the size and performance gains

but …

the larger the probability of failure of a disk

this is the main motivation for the introduction of

redundancy

Parallelism and reliability

The probability of a failure (assuming independent failures) in an array of 100

disks is 100 higher the probability of a failure of a single disk

Redundancy: error correcting codes (stored on disks different from the ones

with the data) are computed to tolerate loss due to disk failures

Performance: since write operations must update also the redundant

information, their performance is worse than the one of the traditional writes

« if a disk has an Mean Time To Failure (MTTF) of 200,000 hours (~23 years)

an array of 100 disks will show a MTTF of 2000 hours (~ 3 months) »

Data reconstruction (a simple example)

data data checksum

12 8 20+ =data data checksum

12 8 20

+ =data data checksum

data data checksum

12 8 20

20 12 8

data data checksum

checksum data data

RAID standard levels

RAID 0 striping only

RAID 1 mirroring only

RAID 2 bit interleaving (not used)

RAID 3 byte interleaving - redundancy (parity disk)

RAID 4 block interleaving - redundancy (parity disk)

RAID 5 block interleaving - redundancy (parity block distributed) – highly utilized

RAID 6 greater redundancy (tolerates up to two failed disks)

RAID 0 and RAID 1

RAID 0

RAID 1

RAID 3

BP(4-6)

BP(1-3)

AP(4-6)

AP(1-3)

RAID 4

RAID 5

RAID 6

Nested levels

RAID 1

RAID 0

Overview

RAID level Utilization

Reliability R/W performance Rebuild

performance

0 1 N/A very good good

1 0.5 excellent very good/good good

3 (n-1)/n good good/fair fair

5 (n-1)/n good good/fair poor

6 (n-2)/n excellent very good/poor poor

1+0 0.5 excellent very good/good good

5+0 (n-1)/n excellent very good/good fair

Nota Bene: RAID technology should not be intended as a substitute for a suitable

backup procedure

Data Mirroring23

Synchronous VS Asynchronous mirroring

Synchronous mirroring: provides a consistent copy of a source disk on a

target disk. Data is synchronously written to the target disk after it is written to

the source virtual disk, so that the copy is continuously updated.

Asynchronous mirroring: provides a consistent copy of a source disk on a

target disk. Data is asynchronously written to the target virtual disk, so that the

copy is continuously updated, but the copy might not contain the last few

updates in the event that a disaster recovery operation is performed.

Synchronous VS Asynchronous mirroring

1. write 2. write

3. ack4. ack

primary secondary

1. write a. write

b. ack2. ack

primary secondary

Sinchronization clock

Multipath26

Multipath

In computer storage, multipath I/O is a fault-tolerance technique that defines

more than one physical path between the CPU in a computer system and its

mass-storage devices through the buses, controllers, switches, and bridge

devices connecting them.

Multipath software layers can also leverage the redundant paths to provide

performance-enhancing features, including dynamic load balancing and

trunking.

Path 1

Path 2

Backup28

Backup

Backup technologies provide effective recovery options for systems subject to

data loss from human error, hardware failure or major natural disasters. They

are ideally suited for quick restoration of large amounts of lost information and

can return complete systems to full operational capacity in a short period of time.

Two orthogonal techniques:

• Incremental backup: saves the data that has changed since the last backup.

• PROs: fast backup, small space occupancy

• CON: slow recovery

• Differential backup: saves the data that has changed since the last full

backup.

• PRO: fast recovery

• CONs: slow backup, big space occupancy

Incremental VS differential Backup (a simple example)

Backup frequency: daily

Full backup day: sunday

incremental differential

Point in time copy

A point in time copy is a logical image of the content of an associated base

volume created at a specific moment. A snapshot image can be thought of as a

restore point. Snapshot images are useful any time you need to be able to roll

back to a known good data set at a specific point in time.

For example, before performing a risky operation on a volume, you can create a

snapshot image to enable “undo” capability for the entire volume. A snapshot

image is created almost instantaneously, and initially uses no disk space,

because it stores only the incremental changes needed to roll the volume back

to the point-in-time when the snapshot image was created.

Two alternative approaches:

• copy on write

• redirect on write

Copy on write

Task: modify C

A B C D

Volatile memory

Copy on write

Task: modify C

A B C D

Volatile memory

Copy on write

Task: modify C

A B C D C

Volatile memory

Copy on write

Task: modify C

A B C D C

Volatile memory

modify

Copy on write

Task: modify C

A B Cʹ D C

Volatile memory

Redirect on write

Task: modify C

A B C D

Volatile memory

Redirect on write

Task: modify C

A B C D

Volatile memory

Redirect on write

Task: modify C

A B C D

Volatile memory

modify

Copy on write

Task: modify C

A B C D Cʹ

Volatile memory

Archive41

Resilience in data management

Devices & Hardware

ISACA Organisational Resilience Presentation 130813 v1 … · Emergence of Organisational Resilience ... Compared to Business Continuity Management ... ISACA Organisational Resilience

Augmented reality applied to Resilience Management LLP Programme Knowledge Alliances“Collaborative Reformation of Curricula on Resilience Management with

Business continuity management = organisational resilience?

Transition Management and Resilience

Business continuity, risk management and resilience

EMERGENCY MANAGEMENT & COMMUNITY RESILIENCE

ADOT Asset Management Infrastructure Resilience Study

RISK & RESILIENCE MANAGEMENT FRAMEWORK _ Resilience Management Framework for...RISK & RESILIENCE MANAGEMENT FRAMEWORK for PHILIPPINE HERITAGE SITES & STRUCTURES Grace C. Ramos EnP

Data Science and Resilience

CERT Resilience Management Model, Version 1Version 1.2 . CERT Resilience Management Model RISK | 1 . RISK MANAGEMENT . Enterprise . Purpose . The purpose of Risk Management is to identify,

Challenges of Resilience in Emergency Management

Resilience and Change Management Strategies

CERT Resilience Management Model · CERT® Resilience Management Model ... Resilience planning, program execution, and coordination ... IMC Incident Management & Control

FOOD SECURITY AND DISASTER RESILIENCE … SECURITY AND DISASTER RESILIENCE THROUGH SUSTAINABLE DRYLANDS MANAGEMENT ... Food Security and Disaster Resilience ... NATURAL RESOURCE MANAGEMENT

Resilience Management Matrix and Audit Toolkitresilens.eu/.../2016/08/D2.3-Resilience-Management... · the resilience management steps put forward in Task 2.2. The outputs of this

Cybersecurity Intelligence, Resilience, and Management · 4/21/2016 · Appendix J - Resilience Data backup architecture and technology Data integrity controls Independent, secondary

Measuring Operational Resilience Using the CERT Resilience Management Model

Watershed Management and Resilience

Integrated Enterprise Risk and Resilience Management

Cybersecurity Intelligence, Resilience, and Management