Geographically distributed Swift clusters · Swift is durable server server server server 0 server...

Preview:

Citation preview

Geographically distributed Swift clusters

Alistair ColesSwift core developer

alistairncoles@gmail.comirc: acoles

Overview

• What is Swift?• Geographically distributed clusters • What? • Why?• How?

• Erasure coded geographically distributed cluster• Swift now supports these !• …enabled by fragment duplication and composite rings

• Summary

2

What is Swift?

• object storage service

• REST API – Create, Read, Update, Delete

• Simple naming hierarchy• objects belong to containers• containers belong to accounts

3

Swift is durable

server

server

server

server

server0

2

• Multiple replicas of every object (or erasure coding)• The Ring always tries to disperse replicas across

different devices and servers

RingProxy server

PUT a/c/o

3 replica policy

1

4

Swift is scalable

RingProxy server

server

server

server

server

server0

2

PUT a/c/o

3 replica policy

• The Ring always tries to balance load across all devices• No centralized services

RingProxy server

1

0

2PUT a2/c2/o2

3 replica policy1

5

Swift is highly available

RingProxy server

server

server

server

server

server

1

23 replica policy

• Write succeeds on quorum of replicas• Missing replicas are updated asynchronously

PUT a/c/o @ t1‘my old data’

t1

t1

asyncupdate

6

Swift is eventually consistent

RingProxy server

server

server

server

server

server

1

0

23 replica policy

• Possibility to read stale data• Async process makes data eventually consistent

RingProxy serverGET a/c/o @ t2 + ∂

‘my old data’

3 replica policy

t2

t2

t1

asyncupdate

7

PUT a/c/o @ t2‘my new data’

Overview

• What is Swift?• Geographically distributed clusters • What? • Why?• How?

• Erasure coded geographically distributed cluster• Swift now supports these!• … but there’s some new stuff to know about

• Deep dive: erasure coding, fragment duplication, composite rings• Summary

8

Geographically distributed clusters

• What?• Multiple physical locations

• Typically connected by high latency/low bandwidth WAN• Copies of data in each location• Single namespace

• Why?• Disaster recovery• Data locality

• Also known as “Global clusters”, “Multi-region Swift”

WAN

9

region1

region2server

server

server

server

server

Geographically distributed Swift clusters

Ring

Proxy server

3

server

server

server

server

server

1

0

2

4 replica policy

PUT a/c/o

WAN

• The Ring always tries to disperse replicas across different devices and servers …and regions

10

region1

region2server

server

server

server

server

Geographically distributed clusters- disaster recovery

Ring

Proxy server

3

server

server

server

server

server

1

0

2GET a/c/o

WAN

• A 4-replica policy makes each region independently robust to a single device failure

11

4 replica policy

region1

region2server

server

server

server

server

Geographically distributed Swift clusters- data locality

Ring

Proxy server

3

server

server

server

server

server

1

0

2GET a/c/o @ t1

WAN

GET a/c/o @ t2

GET a/c/o @ t3

• By default the ring will try to balance read load by choosing random replicas

random choice for

reads

12

4 replica policy

region1

region2server

server

server

server

server

Read affinity – trade off load balancing for read performance

Ring

Proxy server

3

server

server

server

server

server

1

0

2

Ring

Proxy server

GET a/c/o @ t4

read_affinity -> region1

WAN

read_affinity -> region2

always read replica in

local region

GET a/c/o @ t1

GET a/c/o @ t2

GET a/c/o @ t3

13

region1

region2server

server

server

server

server

What if remote writes fail?

Ring

Proxy server

server

server

server

server

server0

2PUT a/c/o @ t1

1

WANasync

move to remote region

3

• Remote replicas are written to temporary locations.

• Async process moves them when remote region is available.

temporarily misplaced

replicas

14

t1

t1

t1

t1

region1

region2server

server

server

server

server

Overwrite - replicas are eventually consistent

Ring

Proxy server

server

server

server

server

server

PUT a/c/o @ t3‘my new data’

Ring

Proxy serverGET a/c/o @ t4‘my old data’

0

2

1

WAN

PUT a/c/o @ t1‘my old data’

temporarily stale

replicas

asyncmove to remote region

3

1

WAN fails at t2

15

t1

t1

t3

t3

t3

t3

3

region1

region2server

server

server

server

server

What if remote writes are slow?

Ring

Proxy server

server

server

server

server

server

1

0

2PUT a/c/o

4 replica policy

“But isn’t this a terrible idea? All my writes will be slowed down by requests to the remote region!”

3

16

WAN

Effect of remote region write time on PUTs(remote region servers artificially slowed)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.00 10.00 20.00 30.00 40.00 50.00 60.00 70.00 80.00 90.00 100.00

10K

obje

ct

PUT

time

(s)

Time (secs)

Remote write 100ms

Remote write 200ms

Remote write 600ms

Remote write 800ms

post_quorum_timeout= 0.5

Remote write 400ms

post_quorum_timeout puts an upper bound on the extra latency

17

region1

region2server

server

server

server

server

Write affinity – temporarily trade off dispersion for write performance

Ring

Proxy server

server

server

server

server

server0

2PUT a/c/o

1

write_affinity = region1

1

• Remote replicas are initiallywritten to the local region.

• Async process moves them to remote region.

asyncmove to remote region

3

3

18

WAN

Write affinity performance improvement(remote region servers artificially slowed)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.00 10.00 20.00 30.00 40.00 50.00 60.00 70.00 80.00 90.00 100.00

10K

obje

ct

PUT

time

(s)

Time (secs)

Write affinity enabledRemote write

100ms

Remote write 200ms

Remote write 600ms

Remote write 800ms

Remote write 400ms

19

region1

region2server

server

server

server

server

Write affinity – data is always available

Ring

Proxy server

server

server

server

server

server

Ring

Proxy serverGET a/c/o @ t1 + ∂‘my data’

PUT a/c/o @ t1‘my data’

0

2

1

WANreads fall back to remote region

3write_affinity = region1

20

region1

region2server

server

server

server

server

Write affinity – trade off consistency for write performance

Ring

Proxy server

server

server

server

server

server

PUT a/c/o @ t2‘my new data’

Ring

Proxy serverGET a/c/o @ t2 + ∂‘my data’

0

2

1

WAN

PUT a/c/o @ t1‘my data’

write_affinity = region1 asyncmove to remote region

3

1

3

21

t1

t1

t2

t2

t2

t2

temporarily stale

replicas

Workload considerations

• Use write affinity with care!• “No free lunch” - replicas have to be copied across WAN eventually

• Suitable workloads:• Moderate or bursty write rates • Non-immediate remote reads• E.g. replicating archive data across sites

• Unsuitable workloads:• Continuous high write rate

• misplaced replicas will back up in local region• Immediate remote reads

• reads will fetch data over WAN before async move has happened

22

Overview

• What is Swift?• Geographically distributed clusters • What? • Why?• How?

• Erasure coded geographically distributed cluster• Swift now supports these!• … but there’s some new stuff to know about

• Deep dive: erasure coding, fragment duplication, composite rings• Summary

23

decode

Erasure Code (EC)

Erasure coding – same durability, less storage

data 0 1 2 3 4 5

Example: 4 data fragments + 2 parity fragments

Any subset of 4 unique fragments is sufficient to reconstruct data:

data0 1 3 5

Requires only 1.5 x size of data to store all fragments

24

server

Erasure coding in Swift

server

server

server

server

server

3

0

2

1

5

4https://github.com/openstack/pyeclibPython interface to liberasurecode

https://github.com/openstack/liberasurecodeC library with pluggable Erasure Code backends

ECdata 0 1 2 34 5

Ring

Proxy server

EC 4+2 policy

4 + 2 erasure coding requires approx. 50% storage vs 3 replicas with similar durability

25

region1

region2server

server

server

server

server

Erasure coding across regions

server

server

server

server

server

3

0

2

1

5

4

ECdata 0 1 2 34 5

Ring

Proxy server

EC 4+2 policy

WAN

26

• The Ring always tries to dispersefragments across different devices and servers …and regions

region1

region2server

server

server

server

server

Erasure coding across regions

server

server

server

server

server

3

0

2

1

5

4

With EC 4+2 we don’t have enough fragments in oneregion to reconstruct the

data!

ECdata 0 1 2 34 5

Ring

Proxy server

EC 4+2 policy

WAN

27

region1

region2server

server

server

server

server

Erasure coding across regions requires more fragments

server

server

server

server

server

3

0

2

1

5

4

8

7

9

6

How about EC 4+6?requires 2.5 x size of data

vsReplication requires 4 x size of

data for similar durability

ECdata0 1 2 34 5 6 78 9

Ring

Proxy server

EC 4+6 policy

WAN

28

Erasure coding time increases with number of parity fragments

0

0.01

0.02

0.03

0.04

0.05

0.06

2 3 4 5 6 7

Rela

tive

com

pute

tim

e

Number of parity fragments

4 data fragments isa_l_rs_cauchy backend40MB object

29

region1

region2server

server

server

server

server

EC duplication: more fragments, less compute

server

server

server

server

server

0

0

2

1

2

4

3

4

3

1

Each region has EC 4+1 fragmentsrequires 2.5 x size of data

vsReplication requires 4 x size of

data for similar durability

Ring

Proxy server

ECdata 0 1 2 34 3

0

4

12

3

0

4

12

WAN

EC 4+1 policy plus duplication

30

region1

region2server

server

server

server

server

But duplicates must be correctly dispersed…

server

server

server

server

server

3

0

2

1

1

4

3

4

0

2

ECdata 0 1 2 34

Ring

Proxy server

We don’t have enough unique fragments in this region to reconstruct the

data!

1 123 3

20 0

4 4

WAN

EC 4+1 policy plus duplication

31

region1

region2server

server

server

server

server

Solution: EC duplication + composite rings

server

server

server

server

server

0

0

2

1

2

4

3

4

3

1

PUT a/c/o

EC 4+1 policy plus duplication

ECdata 0 1 2 34

Proxy server

3

0

4

12

3

0

4

12

WAN

Each region has its own ‘component’ ring - this guarantees a set of unique fragments in each region.

32

Summary

• Swift enables you to build geographically distributed clusters• Good for disaster recovery and data locality• Tuning via options in the Swift proxy server

• write_affinity, read_affinity• Understand your workloads!

• Erasure coded storage policies can also be geographically distributed1

• Uses new features: EC fragment duplication and composite rings

1 new in Swift 2.15.0

33

Swift welcomes new users and contributors

• You can find us in freenode #openstack-swift• Project: https://launchpad.net/swift• Docs: https://docs.openstack.org/swift• Code: https://github.com/openstack/swift

34