27
StarFish: highly- available block storage Eran Gabber Jeff Fellin Michael Flaster Fengrui Gu Bruce Hillyer Wee Teck Ng Banu O¨ zden Elizabeth Shriver 2003 USENIX Annual Technical Conference Presenter: D00922019 林林林

StarFish: highly-available block storage Eran Gabber Jeff Fellin Michael Flaster Fengrui Gu Bruce Hillyer Wee Teck Ng Banu O¨ zden Elizabeth Shriver 2003

Embed Size (px)

Citation preview

Page 1: StarFish: highly-available block storage Eran Gabber Jeff Fellin Michael Flaster Fengrui Gu Bruce Hillyer Wee Teck Ng Banu O¨ zden Elizabeth Shriver 2003

StarFish: highly-available block storageEran Gabber

Jeff Fellin

Michael Flaster

Fengrui Gu

Bruce Hillyer

Wee Teck Ng

Banu O¨ zden

Elizabeth Shriver

2003 USENIX Annual Technical Conference

Presenter: D00922019 林敬棋

Page 2: StarFish: highly-available block storage Eran Gabber Jeff Fellin Michael Flaster Fengrui Gu Bruce Hillyer Wee Teck Ng Banu O¨ zden Elizabeth Shriver 2003

IntroductionImportant data need to be

protected.◦Making replicas.

Replication on remote sites◦Reduce the amount of data lost in

failure.◦Decrease the time required to

recover from catastrophic site failure.

Page 3: StarFish: highly-available block storage Eran Gabber Jeff Fellin Michael Flaster Fengrui Gu Bruce Hillyer Wee Teck Ng Banu O¨ zden Elizabeth Shriver 2003

StarFishA highly-available geographically-

dispersed block storage system.◦Does not require expensive

dedicated communication lines to all replicas to achieve highly-available .

◦Achieves good performance even during recovery from a replica failure.

◦Single-owner access semantics.

Page 4: StarFish: highly-available block storage Eran Gabber Jeff Fellin Michael Flaster Fengrui Gu Bruce Hillyer Wee Teck Ng Banu O¨ zden Elizabeth Shriver 2003

ArchitectureStarFish consists of

◦One Host Element(HE) Provides storage virtualization and read

cache.

◦N Storage Element(SE) Q: write quorum size. Synchronous updates to a quorum of Q

SEs, and asynchronous updates to the rest.

Page 5: StarFish: highly-available block storage Eran Gabber Jeff Fellin Michael Flaster Fengrui Gu Bruce Hillyer Wee Teck Ng Banu O¨ zden Elizabeth Shriver 2003

Recommended Setup

N = 3, Q = 2

MAN : Metropolitan Area NetworkWAN :Wide Area Network

Page 6: StarFish: highly-available block storage Eran Gabber Jeff Fellin Michael Flaster Fengrui Gu Bruce Hillyer Wee Teck Ng Banu O¨ zden Elizabeth Shriver 2003

Another Deployment

Page 7: StarFish: highly-available block storage Eran Gabber Jeff Fellin Michael Flaster Fengrui Gu Bruce Hillyer Wee Teck Ng Banu O¨ zden Elizabeth Shriver 2003

SE RecoveryWrite log

◦HE keeps a circular buffer of recent writes.

◦Each SE maintains a circular buffer of recent writes on a log disk.

Three types of recovery◦Quick recovery◦Replay recovery◦Full recovery

Page 8: StarFish: highly-available block storage Eran Gabber Jeff Fellin Michael Flaster Fengrui Gu Bruce Hillyer Wee Teck Ng Banu O¨ zden Elizabeth Shriver 2003

Availability and ReliabilityAssume that the failure and

recovery processes of the network links and SEs are i.i.d Poisson processes with combined mean failure and recovery rates of λ and μ per second.

Similarly, the HE has Poisson-distributed λhe and μhe .

Page 9: StarFish: highly-available block storage Eran Gabber Jeff Fellin Michael Flaster Fengrui Gu Bruce Hillyer Wee Teck Ng Banu O¨ zden Elizabeth Shriver 2003

AvailabilityThe steady-state probability that

at least Q SEs are available.

Derived from the standard machine repairman mode.

NQ

i

N

NQAN

QN

i

i

1,10,

)1(),( 0

Page 10: StarFish: highly-available block storage Eran Gabber Jeff Fellin Michael Flaster Fengrui Gu Bruce Hillyer Wee Teck Ng Banu O¨ zden Elizabeth Shriver 2003

Machine Repairman Model

Page 11: StarFish: highly-available block storage Eran Gabber Jeff Fellin Michael Flaster Fengrui Gu Bruce Hillyer Wee Teck Ng Banu O¨ zden Elizabeth Shriver 2003

Availability(cont.)

Page 12: StarFish: highly-available block storage Eran Gabber Jeff Fellin Michael Flaster Fengrui Gu Bruce Hillyer Wee Teck Ng Banu O¨ zden Elizabeth Shriver 2003

Availability(cont.)

X ★ 9 : the number of 9s in an availability measure.

Achieve a much higher availability when N = 2Q + 1.

For fixed N, availability decrease with larger quorum size.◦Increasing quorum size trades off

availability for reliability.

Page 13: StarFish: highly-available block storage Eran Gabber Jeff Fellin Michael Flaster Fengrui Gu Bruce Hillyer Wee Teck Ng Banu O¨ zden Elizabeth Shriver 2003

ReliabilityThe probability of no data loss.The reliability increases with

larger Q.Two approaches

◦Make Q > floor(N/2) and at least Q SEs are available. Reduce availability and performance.

◦Read-only consistency

Page 14: StarFish: highly-available block storage Eran Gabber Jeff Fellin Michael Flaster Fengrui Gu Bruce Hillyer Wee Teck Ng Banu O¨ zden Elizabeth Shriver 2003

Read-only ConsistencyAvailable in read-only mode

during failure.◦Read-only mode obviates the need

for Q SEs to be available to handle updates.

◦Increase availability

Qhe

iQ

ihe

Nhe

iN

iadOnly

i

Q

i

N

NQA)1)(1(

)(

)1)(1(

)(),(

1

0

1

0Re

he

he

headOnly

QANANQA

1

),1(

1

),1(),(Re

Page 15: StarFish: highly-available block storage Eran Gabber Jeff Fellin Michael Flaster Fengrui Gu Bruce Hillyer Wee Teck Ng Banu O¨ zden Elizabeth Shriver 2003

Availability with Read-only Consistency

Page 16: StarFish: highly-available block storage Eran Gabber Jeff Fellin Michael Flaster Fengrui Gu Bruce Hillyer Wee Teck Ng Banu O¨ zden Elizabeth Shriver 2003

ObservationsIf ρhe = 0, availability is

independent of Q.◦Can always recover from HE.

If ρhe increase, availability increase with Q.

Largest increase occurs from Q = 1 to Q = 2, and bounded by 3/16 when ρ = 1.◦Diminishing gain after Q = 2.◦Suggest Q = 2 in practical system.

Page 17: StarFish: highly-available block storage Eran Gabber Jeff Fellin Michael Flaster Fengrui Gu Bruce Hillyer Wee Teck Ng Banu O¨ zden Elizabeth Shriver 2003

Implementation

Page 18: StarFish: highly-available block storage Eran Gabber Jeff Fellin Michael Flaster Fengrui Gu Bruce Hillyer Wee Teck Ng Banu O¨ zden Elizabeth Shriver 2003

Performance MeasurementsCompares with a direct-attached

RAID unit.

Page 19: StarFish: highly-available block storage Eran Gabber Jeff Fellin Michael Flaster Fengrui Gu Bruce Hillyer Wee Teck Ng Banu O¨ zden Elizabeth Shriver 2003

SettingsDifferent network delays

◦1, 2, 4, 8, 23, 36, 65 msDifferent bandwidth limitations

◦31, 51, 62, 93, 124 Mb/s.Benchmark:

◦Micro-benchmark Read hit Read miss Write

◦PostMark

Page 20: StarFish: highly-available block storage Eran Gabber Jeff Fellin Michael Flaster Fengrui Gu Bruce Hillyer Wee Teck Ng Banu O¨ zden Elizabeth Shriver 2003

Effects of network delays and HE cache size

Near SE delay: 4ms; Far SE delay: 8msNo cache miss if HE cache size = 400

MB

Page 21: StarFish: highly-available block storage Eran Gabber Jeff Fellin Michael Flaster Fengrui Gu Bruce Hillyer Wee Teck Ng Banu O¨ zden Elizabeth Shriver 2003

ObservationLarge HE cache improves

performance.◦HE can respond to more read

requests without communicating with SE. Does not change write requests.

◦Especially beneficial when local SE has significant delays.

Q = 2 and 400MB cache size is not influenced by the delay to local SE.◦Depend on near SE.

Page 22: StarFish: highly-available block storage Eran Gabber Jeff Fellin Michael Flaster Fengrui Gu Bruce Hillyer Wee Teck Ng Banu O¨ zden Elizabeth Shriver 2003

Normal Operation and placement of the far SE

1-8: 1, 2, 4, 8 ms; 4-12: 4, 8, 12 ms 23-65: 23, 36, 65 ms; 31-124:

31,51,62,93,124 Mbps Local SE delay: 0ms

N = 3

Page 23: StarFish: highly-available block storage Eran Gabber Jeff Fellin Michael Flaster Fengrui Gu Bruce Hillyer Wee Teck Ng Banu O¨ zden Elizabeth Shriver 2003

Normal Operation and placement of the far SE(Cont.)

N = 3 8 threads

Page 24: StarFish: highly-available block storage Eran Gabber Jeff Fellin Michael Flaster Fengrui Gu Bruce Hillyer Wee Teck Ng Banu O¨ zden Elizabeth Shriver 2003

Normal Operation and placement of the far SE(Cont.)

Page 25: StarFish: highly-available block storage Eran Gabber Jeff Fellin Michael Flaster Fengrui Gu Bruce Hillyer Wee Teck Ng Banu O¨ zden Elizabeth Shriver 2003

ObservationPerformance is influenced mostly

by two parameters◦Write quorum size◦Delay to the SE.

StarFish can provide adequate performance when one of the SEs is placed in a remote location.◦At least 85% of the performance of a

direct-attached RAID.

Page 26: StarFish: highly-available block storage Eran Gabber Jeff Fellin Michael Flaster Fengrui Gu Bruce Hillyer Wee Teck Ng Banu O¨ zden Elizabeth Shriver 2003

Recovery

Performance degrades more during full recovery.

Page 27: StarFish: highly-available block storage Eran Gabber Jeff Fellin Michael Flaster Fengrui Gu Bruce Hillyer Wee Teck Ng Banu O¨ zden Elizabeth Shriver 2003

ConclusionThe StarFish system reveals

significant benefits from a third copy of the data at an intermediate distance.

A StarFish system with 3 replicas, a write quorum size of 2, and read-only consistency yields better than 99.9999% availability assuming individual Storage Element availability of 99%.