Geographically distributed Swift clusters
Alistair ColesSwift core developer
[email protected]: acoles
Overview
• What is Swift?• Geographically distributed clusters • What? • Why?• How?
• Erasure coded geographically distributed cluster• Swift now supports these !• …enabled by fragment duplication and composite rings
• Summary
2
What is Swift?
• object storage service
• REST API – Create, Read, Update, Delete
• Simple naming hierarchy• objects belong to containers• containers belong to accounts
3
Swift is durable
server
server
server
server
server0
2
• Multiple replicas of every object (or erasure coding)• The Ring always tries to disperse replicas across
different devices and servers
RingProxy server
PUT a/c/o
3 replica policy
1
4
Swift is scalable
RingProxy server
server
server
server
server
server0
2
PUT a/c/o
3 replica policy
• The Ring always tries to balance load across all devices• No centralized services
RingProxy server
1
0
2PUT a2/c2/o2
3 replica policy1
5
Swift is highly available
RingProxy server
server
server
server
server
server
1
23 replica policy
• Write succeeds on quorum of replicas• Missing replicas are updated asynchronously
PUT a/c/o @ t1‘my old data’
t1
t1
asyncupdate
6
Swift is eventually consistent
RingProxy server
server
server
server
server
server
1
0
23 replica policy
• Possibility to read stale data• Async process makes data eventually consistent
RingProxy serverGET a/c/o @ t2 + ∂
‘my old data’
3 replica policy
t2
t2
t1
asyncupdate
7
PUT a/c/o @ t2‘my new data’
Overview
• What is Swift?• Geographically distributed clusters • What? • Why?• How?
• Erasure coded geographically distributed cluster• Swift now supports these!• … but there’s some new stuff to know about
• Deep dive: erasure coding, fragment duplication, composite rings• Summary
8
Geographically distributed clusters
• What?• Multiple physical locations
• Typically connected by high latency/low bandwidth WAN• Copies of data in each location• Single namespace
• Why?• Disaster recovery• Data locality
• Also known as “Global clusters”, “Multi-region Swift”
WAN
9
region1
region2server
server
server
server
server
Geographically distributed Swift clusters
Ring
Proxy server
3
server
server
server
server
server
1
0
2
4 replica policy
PUT a/c/o
WAN
• The Ring always tries to disperse replicas across different devices and servers …and regions
10
region1
region2server
server
server
server
server
Geographically distributed clusters- disaster recovery
Ring
Proxy server
3
server
server
server
server
server
1
0
2GET a/c/o
WAN
• A 4-replica policy makes each region independently robust to a single device failure
11
4 replica policy
region1
region2server
server
server
server
server
Geographically distributed Swift clusters- data locality
Ring
Proxy server
3
server
server
server
server
server
1
0
2GET a/c/o @ t1
WAN
GET a/c/o @ t2
GET a/c/o @ t3
• By default the ring will try to balance read load by choosing random replicas
random choice for
reads
12
4 replica policy
region1
region2server
server
server
server
server
Read affinity – trade off load balancing for read performance
Ring
Proxy server
3
server
server
server
server
server
1
0
2
Ring
Proxy server
GET a/c/o @ t4
read_affinity -> region1
WAN
read_affinity -> region2
always read replica in
local region
GET a/c/o @ t1
GET a/c/o @ t2
GET a/c/o @ t3
13
region1
region2server
server
server
server
server
What if remote writes fail?
Ring
Proxy server
server
server
server
server
server0
2PUT a/c/o @ t1
1
WANasync
move to remote region
3
• Remote replicas are written to temporary locations.
• Async process moves them when remote region is available.
temporarily misplaced
replicas
14
t1
t1
t1
t1
region1
region2server
server
server
server
server
Overwrite - replicas are eventually consistent
Ring
Proxy server
server
server
server
server
server
PUT a/c/o @ t3‘my new data’
Ring
Proxy serverGET a/c/o @ t4‘my old data’
0
2
1
WAN
PUT a/c/o @ t1‘my old data’
temporarily stale
replicas
asyncmove to remote region
3
1
WAN fails at t2
15
t1
t1
t3
t3
t3
t3
3
region1
region2server
server
server
server
server
What if remote writes are slow?
Ring
Proxy server
server
server
server
server
server
1
0
2PUT a/c/o
4 replica policy
“But isn’t this a terrible idea? All my writes will be slowed down by requests to the remote region!”
3
16
WAN
Effect of remote region write time on PUTs(remote region servers artificially slowed)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.00 10.00 20.00 30.00 40.00 50.00 60.00 70.00 80.00 90.00 100.00
10K
obje
ct
PUT
time
(s)
Time (secs)
Remote write 100ms
Remote write 200ms
Remote write 600ms
Remote write 800ms
post_quorum_timeout= 0.5
Remote write 400ms
post_quorum_timeout puts an upper bound on the extra latency
17
region1
region2server
server
server
server
server
Write affinity – temporarily trade off dispersion for write performance
Ring
Proxy server
server
server
server
server
server0
2PUT a/c/o
1
write_affinity = region1
1
• Remote replicas are initiallywritten to the local region.
• Async process moves them to remote region.
asyncmove to remote region
3
3
18
WAN
Write affinity performance improvement(remote region servers artificially slowed)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.00 10.00 20.00 30.00 40.00 50.00 60.00 70.00 80.00 90.00 100.00
10K
obje
ct
PUT
time
(s)
Time (secs)
Write affinity enabledRemote write
100ms
Remote write 200ms
Remote write 600ms
Remote write 800ms
Remote write 400ms
19
region1
region2server
server
server
server
server
Write affinity – data is always available
Ring
Proxy server
server
server
server
server
server
Ring
Proxy serverGET a/c/o @ t1 + ∂‘my data’
PUT a/c/o @ t1‘my data’
0
2
1
WANreads fall back to remote region
3write_affinity = region1
20
region1
region2server
server
server
server
server
Write affinity – trade off consistency for write performance
Ring
Proxy server
server
server
server
server
server
PUT a/c/o @ t2‘my new data’
Ring
Proxy serverGET a/c/o @ t2 + ∂‘my data’
0
2
1
WAN
PUT a/c/o @ t1‘my data’
write_affinity = region1 asyncmove to remote region
3
1
3
21
t1
t1
t2
t2
t2
t2
temporarily stale
replicas
Workload considerations
• Use write affinity with care!• “No free lunch” - replicas have to be copied across WAN eventually
• Suitable workloads:• Moderate or bursty write rates • Non-immediate remote reads• E.g. replicating archive data across sites
• Unsuitable workloads:• Continuous high write rate
• misplaced replicas will back up in local region• Immediate remote reads
• reads will fetch data over WAN before async move has happened
22
Overview
• What is Swift?• Geographically distributed clusters • What? • Why?• How?
• Erasure coded geographically distributed cluster• Swift now supports these!• … but there’s some new stuff to know about
• Deep dive: erasure coding, fragment duplication, composite rings• Summary
23
decode
Erasure Code (EC)
Erasure coding – same durability, less storage
data 0 1 2 3 4 5
Example: 4 data fragments + 2 parity fragments
Any subset of 4 unique fragments is sufficient to reconstruct data:
data0 1 3 5
Requires only 1.5 x size of data to store all fragments
24
server
Erasure coding in Swift
server
server
server
server
server
3
0
2
1
5
4https://github.com/openstack/pyeclibPython interface to liberasurecode
https://github.com/openstack/liberasurecodeC library with pluggable Erasure Code backends
ECdata 0 1 2 34 5
Ring
Proxy server
EC 4+2 policy
4 + 2 erasure coding requires approx. 50% storage vs 3 replicas with similar durability
25
region1
region2server
server
server
server
server
Erasure coding across regions
server
server
server
server
server
3
0
2
1
5
4
ECdata 0 1 2 34 5
Ring
Proxy server
EC 4+2 policy
WAN
26
• The Ring always tries to dispersefragments across different devices and servers …and regions
region1
region2server
server
server
server
server
Erasure coding across regions
server
server
server
server
server
3
0
2
1
5
4
With EC 4+2 we don’t have enough fragments in oneregion to reconstruct the
data!
ECdata 0 1 2 34 5
Ring
Proxy server
EC 4+2 policy
WAN
27
region1
region2server
server
server
server
server
Erasure coding across regions requires more fragments
server
server
server
server
server
3
0
2
1
5
4
8
7
9
6
How about EC 4+6?requires 2.5 x size of data
vsReplication requires 4 x size of
data for similar durability
ECdata0 1 2 34 5 6 78 9
Ring
Proxy server
EC 4+6 policy
WAN
28
Erasure coding time increases with number of parity fragments
0
0.01
0.02
0.03
0.04
0.05
0.06
2 3 4 5 6 7
Rela
tive
com
pute
tim
e
Number of parity fragments
4 data fragments isa_l_rs_cauchy backend40MB object
29
region1
region2server
server
server
server
server
EC duplication: more fragments, less compute
server
server
server
server
server
0
0
2
1
2
4
3
4
3
1
Each region has EC 4+1 fragmentsrequires 2.5 x size of data
vsReplication requires 4 x size of
data for similar durability
Ring
Proxy server
ECdata 0 1 2 34 3
0
4
12
3
0
4
12
WAN
EC 4+1 policy plus duplication
30
region1
region2server
server
server
server
server
But duplicates must be correctly dispersed…
server
server
server
server
server
3
0
2
1
1
4
3
4
0
2
ECdata 0 1 2 34
Ring
Proxy server
We don’t have enough unique fragments in this region to reconstruct the
data!
1 123 3
20 0
4 4
WAN
EC 4+1 policy plus duplication
31
region1
region2server
server
server
server
server
Solution: EC duplication + composite rings
server
server
server
server
server
0
0
2
1
2
4
3
4
3
1
PUT a/c/o
EC 4+1 policy plus duplication
ECdata 0 1 2 34
Proxy server
3
0
4
12
3
0
4
12
WAN
Each region has its own ‘component’ ring - this guarantees a set of unique fragments in each region.
32
Summary
• Swift enables you to build geographically distributed clusters• Good for disaster recovery and data locality• Tuning via options in the Swift proxy server
• write_affinity, read_affinity• Understand your workloads!
• Erasure coded storage policies can also be geographically distributed1
• Uses new features: EC fragment duplication and composite rings
1 new in Swift 2.15.0
33
Swift welcomes new users and contributors
• You can find us in freenode #openstack-swift• Project: https://launchpad.net/swift• Docs: https://docs.openstack.org/swift• Code: https://github.com/openstack/swift
34