85
Distributed Systems Concepts Jordan Halterman Intro to

Distributed Systems Concepts

Embed Size (px)

Citation preview

Page 1: Distributed Systems Concepts

Distributed Systems Concepts

Jordan Halterman

Intro to

Page 2: Distributed Systems Concepts

I N T R O D U C T I O N

2

Page 3: Distributed Systems Concepts

I N T R O D U C T I O N

3

W H A T I S A D I S T R I B U T E D

S Y S T E M ?

Page 4: Distributed Systems Concepts

I N T R O D U C T I O N

4

A N A T O M Y O F A D I S T R I B U T E D

S Y S T E MW H A T I S A D I S T R I B U T E D

S Y S T E M ?

Page 5: Distributed Systems Concepts

I N T R O D U C T I O N

5

A N A T O M Y O F A D I S T R I B U T E D

S Y S T E M F A L L A C I E S O F D I S T R I B U T E D C O M P U T I N G

W H A T I S A D I S T R I B U T E D

S Y S T E M ?

Page 6: Distributed Systems Concepts

A collection of independent computers that appear to users as a single coherent system

W H A T I S A D I S T R I B U T E D S Y S T E M ?

6

Page 7: Distributed Systems Concepts

“A collection of independent computers that appear to the users of the system as a single computer”

— Andrew Tanenbaum

W H A T I S A D I S T R I B U T E D S Y S T E M ?

7

Page 8: Distributed Systems Concepts

“You know you have a distributed system when the crash of a computer you’ve never heard of stops you from getting any work done”

— Leslie Lamport

W H A T I S A D I S T R I B U T E D S Y S T E M ?

8

Page 9: Distributed Systems Concepts

• Scalability and fault tolerance

• Memory, disk, and CPU are finite resources

• Computers crash and networks fail

• Science hasn’t kept up with technological needs

W H A T I S A D I S T R I B U T E D S Y S T E M ?

9

Page 10: Distributed Systems Concepts

B U T D I S T R I B U T E D

S Y S T E M S A R E

H A R D !1 0

Page 11: Distributed Systems Concepts

T H E T W O G E N E R A L S P R O B L E M

1 1

• Two generals on the opposite sides of a valley have to coordinate to decide when to attack

• Each general must be sure the other made the same decision

• Generals can only communicate through messages

• Messengers sent through the valley can be captured

Page 12: Distributed Systems Concepts

A N A T O M Y O F A D I S T R I B U T E D S Y S T E M

1 2

Page 13: Distributed Systems Concepts

Nodes

A N A T O M Y O F A D I S T R I B U T E D S Y S T E M

1 3

Page 14: Distributed Systems Concepts

Nodes Networks

A N A T O M Y O F A D I S T R I B U T E D S Y S T E M

1 4

Page 15: Distributed Systems Concepts

Nodes Networks Protocols

A N A T O M Y O F A D I S T R I B U T E D S Y S T E M

1 5

Page 16: Distributed Systems Concepts

• Each independent component of a distributed system is called a node

• Also known as a process, agent or actor

• Operations within a node are fast

• Communication between nodes is slow

• Operations generally occur in order

N O D E S

1 6

Page 17: Distributed Systems Concepts

S Y S T E M M O D E L

1 7

• Bounded message delays

• Accurate global clock

• Easy to reason about

• You don’t have one

ASYNCHRONOUSSYNCHRONOUS

• Processes execute independently

• Unbounded message delays

• No global clock

• Difficult to reason about

• You have one

Page 18: Distributed Systems Concepts

• Nodes communicate via messages

• Example: UDP, TCP, HTTP

M E S S A G E P A S S I N G

1 8

Page 19: Distributed Systems Concepts

F A L L A C I E S O F D I S T R I B U T E D C O M P U T I N G

1 9

Page 20: Distributed Systems Concepts

The network is reliable

F A L L A C I E S O F D I S T R I B U T E D C O M P U T I N G

2 0

Page 21: Distributed Systems Concepts

The network is reliable

Latency is zero

F A L L A C I E S O F D I S T R I B U T E D C O M P U T I N G

2 1

Page 22: Distributed Systems Concepts

The network is reliable

Latency is zero

Bandwidth is infinite

F A L L A C I E S O F D I S T R I B U T E D C O M P U T I N G

2 2

Page 23: Distributed Systems Concepts

The network is reliable

Latency is zero

Bandwidth is infinite

The network is secure

F A L L A C I E S O F D I S T R I B U T E D C O M P U T I N G

2 3

Page 24: Distributed Systems Concepts

The network is reliable

Latency is zero

Bandwidth is infinite

The network is secure

F A L L A C I E S O F D I S T R I B U T E D C O M P U T I N G

2 4

Topology doesn’t change

Page 25: Distributed Systems Concepts

The network is reliable

Latency is zero

Bandwidth is infinite

The network is secure

F A L L A C I E S O F D I S T R I B U T E D C O M P U T I N G

2 5

Topology doesn’t change

There is one administrator

Page 26: Distributed Systems Concepts

The network is reliable

Latency is zero

Bandwidth is infinite

The network is secure

F A L L A C I E S O F D I S T R I B U T E D C O M P U T I N G

2 6

Topology doesn’t change

There is one administrator

Transport cost is zero

Page 27: Distributed Systems Concepts

The network is reliable

Latency is zero

Bandwidth is infinite

The network is secure

F A L L A C I E S O F D I S T R I B U T E D C O M P U T I N G

2 7

Topology doesn’t change

There is one administrator

Transport cost is zero

The network is homogeneous

Page 28: Distributed Systems Concepts

FALLACY #1

THE NETWORK IS RELIABLE

F A L L A C I E S O F D I S T R I B U T E D C O M P U T I N G

2 8

Page 29: Distributed Systems Concepts

• On average 5.2 devices and 40.8 lines fail per day in Microsoft data centers

• The majority of Google’s outages that lasted more than 30 seconds were due to network maintenance or connectivity issues

• If network hardware doesn’t fail, software will

• We cannot rely on the network to deliver our communications

F A L L A C I E S O F D I S T R I B U T E D C O M P U T I N G

2 9

Page 30: Distributed Systems Concepts

FALLACY #2

LATENCY IS ZERO

F A L L A C I E S O F D I S T R I B U T E D C O M P U T I N G

3 0

Page 31: Distributed Systems Concepts

• Latency is the time it takes for a signal to travel from one computer to another

• Latency is a function of the speed of light

• It takes 40 milliseconds for light to travel from New York to Paris and back

• The JVM executes billions of instructions per second

F A L L A C I E S O F D I S T R I B U T E D C O M P U T I N G

3 1

Page 32: Distributed Systems Concepts

FALLACY #3

BANDWIDTH IS INFINITE

F A L L A C I E S O F D I S T R I B U T E D C O M P U T I N G

3 2

Page 33: Distributed Systems Concepts

• Bandwidth is roughly the amount of information that can be transmitted each second

• Networks are limited by hardware

• Applications are limited by software

F A L L A C I E S O F D I S T R I B U T E D C O M P U T I N G

3 3

Page 34: Distributed Systems Concepts

FALLACY #4

THE NETWORK IS SECURE

F A L L A C I E S O F D I S T R I B U T E D C O M P U T I N G

3 4

Page 35: Distributed Systems Concepts

• We see hacks of major corporations’ networks seemingly on a weekly basis

• In 2015, Foxglove Security discovered a major vulnerability in Java’s serialization framework

• Allowing remote access to friendly users opens systems up to unfriendly ones

F A L L A C I E S O F D I S T R I B U T E D C O M P U T I N G

3 5

Page 36: Distributed Systems Concepts

F A L L A C I E S O F D I S T R I B U T E D C O M P U T I N G

3 6

D A T A B R E A C H E S S I N C E 2 0 0 5

Page 37: Distributed Systems Concepts

FALLACY #5

TOPOLOGY DOESN’T CHANGE

F A L L A C I E S O F D I S T R I B U T E D C O M P U T I N G

3 7

Page 38: Distributed Systems Concepts

• Administrators add and remove servers from networks

• We cannot depend on machines always being in the same place

• Service discovery and routing layers solve this problem

F A L L A C I E S O F D I S T R I B U T E D C O M P U T I N G

3 8

Page 39: Distributed Systems Concepts

FALLACY #6

THERE IS ONE ADMINISTRATOR

F A L L A C I E S O F D I S T R I B U T E D C O M P U T I N G

3 9

Page 40: Distributed Systems Concepts

• Production systems are often maintained and managed by numerous people

• Multiple administrators may institute conflicting policies

F A L L A C I E S O F D I S T R I B U T E D C O M P U T I N G

4 0

Page 41: Distributed Systems Concepts

FALLACY #7

TRANSPORT COST IS ZERO

F A L L A C I E S O F D I S T R I B U T E D C O M P U T I N G

4 1

Page 42: Distributed Systems Concepts

• Local processing is cheap

• Network communication is expensive

• Latency and bandwidth ensure transport cost is never zero

F A L L A C I E S O F D I S T R I B U T E D C O M P U T I N G

4 2

Page 43: Distributed Systems Concepts

FALLACY #8

THE NETWORK IS HOMOGENEOUS

F A L L A C I E S O F D I S T R I B U T E D C O M P U T I N G

4 3

Page 44: Distributed Systems Concepts

• Applications must be designed to work in a variety of environments

• Wired networks

• Wireless networks

• Cellular networks

• Satellite networks

F A L L A C I E S O F D I S T R I B U T E D C O M P U T I N G

4 4

Page 45: Distributed Systems Concepts

C O N C E P T S

4 5

Page 46: Distributed Systems Concepts

C O N C E P T S

4 6

T I M E I N D I S T R I B U T E D

S Y S T E M S

Page 47: Distributed Systems Concepts

C O N C E P T S

4 7

C O N S I S T E N C Y I N D I S T R I B U T E D

S Y S T E M ST I M E I N D I S T R I B U T E D

S Y S T E M S

Page 48: Distributed Systems Concepts

C O N C E P T S

4 8

C O N S I S T E N C Y I N D I S T R I B U T E D

S Y S T E M S P A R T I T I O N I N G A N D

R E P L I C A T I O N

T I M E I N D I S T R I B U T E D

S Y S T E M S

Page 49: Distributed Systems Concepts

CONSISTENCY AVAILABILITY PARTITION TOLERANCE

ZOOKEEPER STRONG QUORUM YES

DYNAMO EVENTUALLY STRONG HIGH YES

MYSQL STRONG HIGH NO

T H E C A P T H E O R E M

4 9

T R A D E O F F S I N D I S T R I B U T E D S Y S T E M S

Page 50: Distributed Systems Concepts

O R D E R I N D I S T R I B U T E D S Y S T E M S

5 0

Page 51: Distributed Systems Concepts

• Order is necessary to enforce causal relationships

• Two types of order in distributed systems

• Partial order

• Order of dependent events

• Total order

• Order of all events

• Single-threaded applications are totally ordered

O R D E R I N D I S T R I B U T E D S Y S T E M S

5 1

Page 52: Distributed Systems Concepts

T I M E I N D I S T R I B U T E D S Y S T E M S

5 2

Page 53: Distributed Systems Concepts

• Time can be used to enforce order

• Time can be used to enforce bounds on communications

• But time progresses independently in asynchronous systems

• Clocks suffer from clock drift

• Even NTP can only synchronize clocks to within a few milliseconds of each other

T I M E I N D I S T R I B U T E D S Y S T E M S

5 3

Page 54: Distributed Systems Concepts

T I M E I N D I S T R I B U T E D S Y S T E M S

5 4

Page 55: Distributed Systems Concepts

• “Time, Clocks, and the Ordering of Events in a Distributed System”

• Developed by Leslie Lamport in 1978

• One of the seminal papers in distributed systems

• Determines partial ordering of events in a distributed system

• Also referred to as logical clocks

T I M E I N D I S T R I B U T E D S Y S T E M S

5 5

L A M P O R T C L O C K S

Page 56: Distributed Systems Concepts

T I M E I N D I S T R I B U T E D S Y S T E M S

5 6

Page 57: Distributed Systems Concepts

• “Timestamps in Message Passing Systems That Preserve the Partial Ordering” - Colin J. Fidge

• “Virtual Time and Global States of Distributed Systems” - Friedemann Mattern

• Independently developed by two researchers in 1988

• Determines causal ordering of events in a distributed system

• Also referred to as version vectors

T I M E I N D I S T R I B U T E D S Y S T E M S

5 7

V E C T O R C L O C K S

Page 58: Distributed Systems Concepts

T I M E I N D I S T R I B U T E D S Y S T E M S

5 8

Page 59: Distributed Systems Concepts

C O N S I S T E N C Y

5 9

Page 60: Distributed Systems Concepts

• Linearizability

• Sequential consistency

• Causal consistency

• Eventual strong consistency

• Eventual consistency

C O N S I S T E N C Y M O D E L S

6 0

Page 61: Distributed Systems Concepts

• Monotonic read consistency

• Monotonic write consistency

• Read-your-writes consistency

• Writes follow reads consistency

• Serializability

C O N S I S T E N C Y M O D E L S

6 1

M O R E C O N S I S T E N C Y M O D E L S

Page 62: Distributed Systems Concepts

P A R T I T I O N I N G

6 2

Page 63: Distributed Systems Concepts

• Split data across multiple machines

• Reduces the amount of data each node must handle

• Reduces the amount of network I/O for certain algorithms

P A R T I T I O N I N G

6 3

Page 64: Distributed Systems Concepts

R E P L I C A T I O N

6 4

Page 65: Distributed Systems Concepts

• Sharing information to ensure consistency between redundant services

• Active replication — push

• Passive replication — pull

• Quorum-based

• Gossip

R E P L I C A T I O N

6 5

Page 66: Distributed Systems Concepts

R E P L I C A T I O N

6 6

• Nodes updated between the request and response

• Consistency over performance

A S Y N C H R O N O U SS Y N C H R O N O U S

• State persisted locally and replicated after response

• Performance over consistency

Page 67: Distributed Systems Concepts

PRIMARY-BACKUP GOSSIP 2PC QUORUM

CONSISTENCY

TRANSACTIONS

LATENCY

THROUGHPUT

DATA LOSS

READ ONLY

E V E N T U A L S T R O N G

L O W

H I G H

F U L L

H I G H

F U L L L O C A L

S O M E

R E A D O N LY

L O W M E D I U M

N O N E

R E A D / W R I T E

R E P L I C A T I O N

6 7

T R A D E O F F S I N D I S T R I B U T E D S Y S T E M S

Page 68: Distributed Systems Concepts

• Gossip is one of the simplest distributed communication algorithms

• Inspired by the gossip that takes place in human communication

• Each node periodically chooses a random set of neighbors with which to exchange information

• Information propagates through the system quickly

• Version vectors can be used to resolve conflicts in updates

R E P L I C A T I O N

6 8

G O S S I P

Page 69: Distributed Systems Concepts

C O N S I S T E N T H A S H I N G

6 9

Page 70: Distributed Systems Concepts

• Map each object to a point on the edge of a circle

• Map each machine to a pseudo-random point on the same circle

• To find the node on which an object is stored, find the location of the object on the edge of the circle and walk around the circle until the first node is found

C O N S I S T E N T H A S H I N G

7 0

Page 71: Distributed Systems Concepts

C O N S I S T E N T H A S H I N G

7 1

Page 72: Distributed Systems Concepts

F A I L U R E D E T E C T I O N

7 2

Page 73: Distributed Systems Concepts

• Failure detectors are characterized in terms of completeness and accuracy

• In a synchronous system, failure detection is solvable

• Certain problems are not solvable without failure detection in an asynchronous system

• A partitioned process is indistinguishable from a crashed process

• Thus reliable failure detection is impossible in an asynchronous system

• Failure detection is usually based on time

F A I L U R E D E T E C T I O N

7 3

Page 74: Distributed Systems Concepts

L E A D E R E L E C T I O N

7 4

Page 75: Distributed Systems Concepts

L E A D E R E L E C T I O N

7 5

• The process of selecting a single node to coordinate a cluster

• Difficult to account for failures

• Electing a leader allows a single process to control a cluster

• Frequently used in consensus algorithms

• But a single leader can limit throughput

Page 76: Distributed Systems Concepts

L E A D E R E L E C T I O N

7 6

B U L LY A L G O R I T H M

Page 77: Distributed Systems Concepts

C O N S E N S U S

7 7

Page 78: Distributed Systems Concepts

• Single-system view, shared state

• Key to building consistent storage systems

C O N S E N S U S

7 8

Page 79: Distributed Systems Concepts

• Agreement — every correct process must agree on the same value

• Integrity — every correct process decides at most one proposed value

• Termination — all processes eventually reach some value

• Validity — if all correct processes propose the same value v then all processes decide the same value v

C O N S E N S U S

7 9

Page 80: Distributed Systems Concepts

• “Impossibility of Consensus with One Faulty Process” — Fischer, Lynch, and Paterson

• Commonly referred to as the FLP Impossibility Result

• Consensus is impossible to guarantee in a fault-tolerant asynchronous system

• In practice, consensus can be reached

C O N S E N S U S

8 0

Page 81: Distributed Systems Concepts

ZooKeeper Atomic Broadcast “Wait-free Coordination for Internet Scale Systems” — Hunt, Konar et al

Viewstamped Replication “Viewstamped Replication” — Brian M. Oki and Barbara H. Liskov

Raft “In Search of an Understandable Consensus Algorithm” — Diego Ongaro and John Osterhout

C O N S E N S U S

8 1

Paxos “The Part-Time Parliament” — Leslie Lamport

“Paxos Made Easy” — Leslie Lamport

Page 82: Distributed Systems Concepts

• Leader election

• Log replication

• Failure detection

• Log compaction

• Membership changes

C O N S E N S U S

8 2

Page 83: Distributed Systems Concepts

Distributed systems in practice

N E X T T I M E

8 3

Page 84: Distributed Systems Concepts

Q & A

Page 85: Distributed Systems Concepts

HaltermanJordan

T H A N K Y O U !