04/27/2011DHT1 Operating System ecs251 Spring 2011 : Operating System #5: Distributed Hash Table Dr. S. Felix Wu Computer Science Department University

04/27/2011 DHT 1

ecs251 Spring 2011:Operating SystemOperating System#5: Distributed Hash Table

Dr. S. Felix Wu

Computer Science Department

University of California, Davis

http://www.facebook.com/group.php?gid=29670204725

http://cyrus.cs.ucdavis.edu/~wu/ecs251



04/27/2011 DHT 2

GFS: Google File SystemGFS: Google File System

“failures” are norm Multiple-GB files are common Append rather than overwrite

– Random writes are rare Can we relax the consistency?

04/27/2011 DHT 3

• Client translates file name and byte offset to chunk index.• Sends request to master.• Master replies with chunk handle and location of replicas.• Client caches this info.• Sends request to a close replica, specifying chunk handle and byte range.• Requests to master are typically buffered.

04/27/2011 DHT 4

The MasterThe MasterMaintains all file system metadata.

names space, access control info, file to chunk mappings, chunk (including replicas) location, etc.

Periodically communicates with chunkservers in HeartBeat messages to give instructions and check state

04/27/2011 DHT 5

1. Client asks master for all replicas.2. Master replies. Client caches.3. Client pre-pushes data to all

replicas.4. After all replicas acknowledge,

client sends write request to primary.

5. Primary forwards write request to all replicas.

6. Secondaries signal completion.7. Primary replies to client. Errors

handled by retrying.

System InteractionsSystem Interactions

The master grants a chunk lease to a replica The replica holding the lease determines the

order of updates to all replicas Lease

– 60 second timeouts– Can be extended indefinitely– Extension request are piggybacked on heartbeat

messages– After a timeout expires, the master can grant new

leases

04/27/2011 6DHT

SnapshotSnapshot

A “snapshot” is a copy of a system at a moment in time.– When are snapshots useful?

– Does “cp –r” generate snapshots?

Handled using copy-on-write (COW).– First revoke all leases.

– Then duplicate the metadata, but point to the same chunks.

– When a client requests a write, the master allocates a new chunk handle.

04/27/2011 7DHT

04/27/2011 DHT 8

SecondaryNameNode

Client

HDFS Architecture

NameNode

DataNodes

1. filename

2. BlckId, DataNodes

o

3.Read data

Cluster Membership

Cluster Membership

NameNode : Maps a file to a file-id and list of MapNodesDataNode : Maps a block-id to a physical location on diskSecondaryNameNode: Periodic merge of Transaction log

04/27/2011 DHT 9

Structured PeeringStructured Peering

Peer identity and routability Key/content assignment

– Which identity owns what?GFS/Napster: centralized index serviceSkype/Kazaa: login-server & super peersDNS: hierarchical DNS servers

Two problems:(1). How to connect to the “topology”?(2). How to prevent failures/changes?

04/27/2011 DHT 10

DHTDHT

Most s-P2P systems are DHT-based. Distributed hash tables (DHTs)

– decentralized lookup service of a hash table– (name, value) pairs stored in the DHT– any peer can efficiently retrieve the value

associated with a given name– the mapping from names to values is distributed

among peers

04/27/2011 DHT 11

HT as a search tableHT as a search table(BitTorrent, Napster)(BitTorrent, Napster)

Index key

Information/content is distributed, and we need to know where?

Where is this GFS chunk?Where is this piece of music?Is this BT piece available?What is the location of this type of content?What is the current IP address of this skype user?

Content Object/Peer naming

“160 bits”

04/27/2011 DHT 12

DHT as a search tableDHT as a search table

Index key

???

04/27/2011 DHT 13

DHT as a search tableDHT as a search table

Index key

???

04/27/2011 DHT 14

DHT segment DHT segment ownershipownership

Index key

???

04/27/2011 DHT 15

DHTDHT

Scalable Peer arrivals, departures, and failures Unstructured versus structured

04/27/2011 DHT 16

DHT (Name, Value)DHT (Name, Value)

How to utilize DHT to avoid Trackers in Bittorrent?

04/27/2011 DHT 17

DHT-based TrackerDHT-based Tracker

Index key

Whoever owns this hash entry is the tracker for the corresponding key!

FreeBSD 5.4 CD images

Publish the key on the class web site.

Seed’s IP address

PUT & GET

04/27/2011 DHT 18

ChordChord

Given a key (content object), it maps the key onto a peer -- consistent hash

Assign keys to peers. Solves problem of locating key in a

collection of distributed peers. Maintains routing information as peers join

and leave the system

04/27/2011 DHT 19

ChordChord

Consistent Hashing A Simple Key Lookup Algorithm Scalable Key Lookup Algorithm Node Joins and Stabilization Node Failures

04/27/2011 DHT 20

Consistent HashingConsistent Hashing Consistent hash function assigns each peer and key

an m-bit identifier (e.g., 140 bits). SHA-1 as a base hash function. A peer’s identifier is defined by hashing the peer’s

IP address. (other possibilities?) A content identifier is produced by hashing the key:

– ID(peer) = SHA-1(IP, Port)– ID(content) = SHA-1(related to the content object)

– Application-dependent!

04/27/2011 DHT 21

Peer, ContentPeer, Content

In an m-bit identifier space, there are 2m identifiers (for both peer and content).

Which peer handles which content?

04/27/2011 DHT 22

Peer, ContentPeer, Content In an m-bit identifier space, there are 2m

identifiers (for both peer and content). Which peer handles which contents?

– We will not have 2m peers/contents!– Each peer might need to handle more than one

contents.– In that case, which peer has what?

04/27/2011 DHT 23

Consistent HashingConsistent Hashing In an m-bit identifier space, there are 2m

identifiers. an identifier circle modulo 2m. The identifier ring is called Chord ring. Content X is assigned to the first peer whose

identifier is equal to or follows (the identifier of) X in the identifier space.

This peer is the successor peer of key X, denoted by successor(X).

04/27/2011 DHT 24

6

1

2

6

0

4

26

5

1

3

7

2identifier

circle

identifier

node

X key

Successor PeersSuccessor Peers

successor(1) = 1

successor(2) = 3successor(6) = 0

04/27/2011 DHT 26

Join and DepartureJoin and Departure

When a node N joins the network, certain contents previously assigned to N’s successor now become assigned to N.

When node N leaves the network, all of its assigned contents are reassigned to N’s successor.

04/27/2011 DHT 27

JoinJoin

0

4

26

5

1

3

7

keys1

keys2

keys

keys

7

5

04/27/2011 DHT 28

DepartureDeparture

0

4

26

5

1

3

7

keys1

keys2

keys

keys6

7

04/27/2011 DHT 29

Join/DepartJoin/Depart

What information must be maintained?

04/27/2011 DHT 30

Join/DepartJoin/Depart

What information must be maintained?– Pointer to successor(s)– Content itself (but application dependent)

04/27/2011 DHT 31

Tracker gone?Tracker gone?

Index key

Whoever owns this hash entry is the tracker for the corresponding key!

FreeBSD 5.4 CD images

Publish the key on the class web site.

Seed’s IP address

PUT & GET

04/27/2011 DHT 32

How to identify the How to identify the tracker?tracker?

And, its IP address, of course?

04/27/2011 DHT 33

A Simple Key LookupA Simple Key Lookup

A very small amount of routing information suffices to implement consistent hashing in a distributed environment

If each node knows only how to contact its current successor node on the identifier circle, all node can be visited in linear order.

Queries for a given identifier could be passed around the circle via these successor pointers until they encounter the node that contains the key.

04/27/2011 DHT 34

A Simple Key LookupA Simple Key Lookup

Pseudo code for finding successor:// ask node n to find the successor of id

N.find_successor(id)

if (id (N, successor])

return successor;

else

// forward the query around the circle

return successor.find_successor(id);

04/27/2011 DHT 35

A Simple Key LookupA Simple Key Lookup The path taken by a query from node 8 for

key 54:

04/27/2011 DHT 36

SuccessorSuccessor

Each active node MUST know the IP address of its successor!– N8 has to know that the next node on the ring is

N14. Departure N8 => N21 But, how about failure or crash?

04/27/2011 DHT 37

RobustnessRobustness

Successor in R hops– N8 => N14, N21, N32, N38 (R=4)– Periodic pinging along the path to check, &

also find out maybe there are “new members” in between

04/27/2011 DHT 38

Is that good enough?Is that good enough?

04/27/2011 DHT 39

Without Periodic Ping…??Without Periodic Ping…??Triggered only by dynamics (Join/Depart)!

04/27/2011 DHT 40

Complexity of the Complexity of the searchsearch

Time/messages: O(N)– N: # of nodes on the Ring

Space: O(1)– We only need to remember R IP addresses

Stablization depends on “period”.

04/27/2011 DHT 41

Scalable Key LocationScalable Key Location

To accelerate lookups, Chord maintains additional routing information.

This additional information is not essential for correctness, which is achieved as long as each node knows its correct successor.

04/27/2011 DHT 42

Finger TablesFinger Tables

Each node N’ maintains a routing table with up to m entries (which is in fact the number of bits in identifiers), called finger table.

The ith entry in the table at node N contains the identity of the first node s that succeeds N by at least 2i-1 on the identifier circle.

s = successor (n+2i-1).

s is called the ith finger of node N, denoted by N.finger(i)

04/27/2011 DHT 43


0

4

26

5

1

3

7

124

130

finger tablestart succ.

keys1

235

330


keys2

457

000


keys6

0+20

0+21

0+22

For.

1+20

1+21

1+22

For.

3+20

3+21

3+22

For.

s = successor (n+2i-1).

04/27/2011 DHT 44


A finger table entry includes both the Chord identifier and the IP address (and port number) of the relevant node.

The first finger of N is the immediate successor of N on the circle.

04/27/2011 DHT 45

Example queryExample query

The path a query for key 54 starting at node 8:

Kademlia routingKademlia routing

04/27/2011 DHT 46

04/27/2011 DHT 47

Scalable Key LocationScalable Key Location

Since each node has finger entries at power of two intervals around the identifier circle, each node can forward a query at least halfway along the remaining distance between the node and the target identifier. From this intuition follows a theorem:

Theorem: With high probability, the number of nodes that must be contacted to find a successor in an N-node network is O(logN).

04/27/2011 DHT 48

Complexity of the Complexity of the SearchSearch

Time/messages: O(logN)– N: # of nodes on the Ring

Space: O(logN)– We need to remember R IP addresses– We need to remember logN Fingers


04/27/2011 DHT 49

An ExampleAn Example M = 140 (identifier size), ring size is 2140

N = 216 (# of nodes) How many entries we need to have for the

Finger Table?

Each node n’ maintains a routing table with up to m entries (which is in fact the number of bits in identifiers), called finger table.The ith entry in the table at node n contains the identity of the first node s that succeeds n by at least 2i-1 on the identifier circle.

s = successor(n+2i-1).

04/27/2011 DHT 50

Complexity of the Complexity of the SearchSearch

Time/messages: O(M)– M: # of bits of the identifier

Space: O(M)– We need to remember R IP addresses– We need to remember M Fingers


04/27/2011 DHT 51

Structured PeeringStructured Peering

Peer identity and routability– 2M identifiers, Finger Table routing

Key/content assignment– Hashing

Dynamics/Failures– Inconsistency??

04/27/2011 DHT 52

Joins and StabilizationsJoins and Stabilizations

The most important thing is the successor pointer. If the successor pointer is ensured to be up to date,

which is sufficient to guarantee correctness of lookups, then finger table can always be verified.

Each node runs a “stabilization” protocol periodically in the background to update successor pointer and finger table.

04/27/2011 DHT 53

Node Joins – stabilize()Node Joins – stabilize()

Each time node N runs stabilize(), it asks its successor for the it’s predecessor p, and decides whether p should be N’s successor instead.

stabilize() notifies node N’s successor of N’s existence, giving the successor the chance to change its predecessor to N.

The successor does this only if it knows of no closer predecessor than N.

04/27/2011 DHT 54

Node Joins – stabilize()Node Joins – stabilize()// called periodically. verifies N’s immediate// successor, and tells the successor about N.N.stabilize()

x = successor.predecessor;if (x (N, successor))

successor = x;successor.notify(N);

// N’ thinks it might be our predecessor.n.notify(N’)if (predecessor is nil or N’ (predecessor, N))

predecessor = N’;

04/27/2011 DHT 55

StabilizatioStabilizationn

np

su

cc(n

p)

= n

s

ns

n

pre

d(n

s)

= n

p

n joins

– predecessor = nil

– n acquires ns as successor via some n’

n runs stabilize

– n notifies ns being the new predecessor

– ns acquires n as its predecessor

np runs stabilize

– np asks ns for its predecessor (now n)

– np acquires n as its successor

– np notifies n

– n will acquire np as its predecessor

all predecessor and successor pointers are now correct

fingers still need to be fixed, but old fingers will still work

nil

pre

d(n

s)

= n

su

cc(n

p)

= n

04/27/2011 DHT 56

fix_fingers()fix_fingers()

Each node periodically calls fix fingers to make sure its finger table entries are correct.

It is how new nodes initialize their finger tables

It is how existing nodes incorporate new nodes into their finger tables.

04/27/2011 DHT 57

Node Joins – Node Joins – fix_fingers()fix_fingers()

// called periodically. refreshes finger table entries.N.fix_fingers()

next = next + 1 ;if (next > m)

next = 1 ;finger[next] = find_successor(N + 2next-1);

// checks whether predecessor has failed.n.check_predecessor()

if (predecessor has failed)predecessor = nil;

04/27/2011 DHT 59

Node Node FailureFailuress

Key step in failure recovery is maintaining correct successor pointers

To help achieve this, each node maintains a successor-list of its r nearest successors on the ring

If node n notices that its successor has failed, it replaces it with the first live entry in the list

Successor lists are stabilized as follows: – node n reconciles its list with its successor s by copying s’s successor list,

removing its last entry, and prepending s to it. – If node n notices that its successor has failed, it replaces it with the first

live entry in its successor list and reconciles its successor list with its new successor.

04/27/2011 DHT 60

Chord – The MathChord – The Math Every node is responsible for about K/N keys (N nodes, K

keys)

When a node joins or leaves an N-node network, only O(K/N) keys change hands (and only to and from joining or leaving node)

Lookups need O(log N) messages

To reestablish routing invariants and finger tables after node joining or leaving, only O(log2N) messages are required

Structural SearchStructural Search

Distributed, P2P Attributes about the nodes Nodes are connecting via some structures

(ring, grid, or hypergraph)

Objective: Where is X?– X could be some content or a node identity

04/27/2011 DHT 61

10/26/2009 Davis Social Links 62

Kleinberg’s Basic settingKleinberg’s Basic setting


p, q, rp, q, r

p: lattice distance between one node and all its local neighbors

q: number of long range contacts r: inverse probability [d(u,v)]-r

– What is the intuition about r?– What about r = 0


Kleinberg’s resultsKleinberg’s results

A decentralized routing/search problem– For nodes s,t with known lattice coordinates, find a

short path from s to t. – At any step, can only use local information, – Kleinberg suggests a simple greedy algorithm and

analyzes it:


Local InformationLocal Information

Local contacts Coordinate for the target The locations and long-range contacts of all

nodes that have come in contact with the message.


ResultsResults

If r = 0, expected delivery time is at least a0n2/3.

– Lower bound

If r = 2, p = q = 1, a2(log n)2

– Martel/Nguyen’s newer results

0 <= r < 2 ~ arn(2-r)/3

r > 2 ~ arn(r-2)(r-1)


The Web

Social Network AnalysisSocial Network Analysis

“Structural relationships” as explanations:

• Network

• Formation

• Influence and collective actions


Social Network AnalysisSocial Network Analysis1. Degree Centrality: The number of direct connections a node has. What really

matters is where those connections lead to and how they connect the otherwise unconnected.

2. Betweenness Centrality: A node with high betweenness has great influence over what flows in the network indicating important links and single point of failure.

3. Closeness Centrality: The measure of closeness of a node which are close to everyone else. The pattern of the direct and indirect ties allows the nodes any other node in the network more quickly than anyone else. They have the shortest paths to all others.

4. Eigenvector Centrality: It assigns relative scores to all nodes in the network based on the principle that connections to high-scoring nodes contribute more to the score of the node in question than equal connections to low-scoring nodes.


Small World ModelSmall World Model

Low Diameter– Logarithmic or poly-logarithmic to N

“High” Cluster Coefficient– cluster coefficient: the portion of X’s neighbors

directly connecting to one of X’s other neighbors


Cluster CoefficientCluster Coefficient

Mesh network: Ccluster = 1

Lattice Network (with degree K): Ccluster = 0

– E.g., a linear line


Re-wiring Re-wiring (Watts/Strogatz)(Watts/Strogatz)

Trade off between D and Ccluster !

Structured/Clustered


Two Issues about Low Two Issues about Low DiametersDiameters

Why should there exist short chains of acquaintances linking together arbitrary pairs of strangers?

Why should arbitrary pairs of strangers be able to find the short chains of acquaintances that link them together?


Some ExtensionsSome Extensions

Hierarchical Network Models Group Structure Models Constant Number of Out-Links

“Small World Phenomena and the Dynamics of Information” by J. Kleinberg, NIPS, 2001


Generation & SearchGeneration & Search

There is a data structure behind and among all the social peers– Lattice, Tree, Group/Community

The link probability depends on this “social data structure”– And, using it to generate the social network

Searching may use “direct contacts” plus the knowledge about the social data structure


Hierarchical Network Hierarchical Network ModelsModels

Representation– a complete b-ary tree, T– All social nodes are “leaves”

Distance and Link Probability– = the height of the least common ancestor

of v and w in T– probability proportional– normalization in probability

– out-degree in graph

€

f (h(v,w))

f (h(v,x))x≠v

∑€

f (h(v,w))€

h(v,w)

€

k = c log2 n


the Critical Valuethe Critical Value

€

h →∞lim

f (h)

b− ′ α h= 0,∀ ′ α < α

€

h →∞lim

b− ′ ′ α h

f (h)= 0,∀ ′ ′ α > α

€

f (h(v,w)) ~ b−αh(v,w )


Interpretation (1)Interpretation (1) /Science/Computer_Science/Algorithms

/Arts/Music/Opera

/Science/Computer_Science/Machine_Learning


Interpretation (2)Interpretation (2)

Target: “stock broker @ Boston, MA”

Next hop:– “bishop @ Cambridge, MA”– “banker @ New York City, NY”


ResultsResults

Otherwise, no polylogarithmic search

€

α =1⇒ Ο(logn)


How to Search in How to Search in HNM??HNM??

€

f (h(v,w)) ~ b−h(v,w )

€

f (h(v,w))

f (h(v,x))x≠v

∑€

h(v,w)

€

k = c log2 n


Useful NeighborUseful Neighbor

€

v → t

v, t ∈ T

commonAncestor(v, t) = u

Height( ′ T ) = i,u∈ ′ T ,root( ′ T ) = u

Height( ′ ′ T ) = (i −1), t ∈ ′ ′ T ∧t ∉ ′ ′ T

Is “v” useful to reach “t”?

v t

€

T



€

v → t

v, t ∈ T



Height( ′ ′ T ) = (i −1), t ∈ ′ ′ T ∧t ∉ ′ ′ T


v

u

t

€

T

€

′ T



€

v → t

v, t ∈ T



Height( ′ ′ T ) = (i −1), t ∈ ′ ′ T ∧t ∉ ′ ′ T


v

u

t

€

T

€

′ T

€

′ ′ T

w



€

v → t

v, t ∈ T



Height( ′ ′ T ) = (i −1), t ∈ ′ ′ T ∧t ∉ ′ ′ T


v

u

t

€

T

€

′ T

€

′ ′ T

w


Useful Neighbor Useful Neighbor RecursivelyRecursively

€

v → t

v, t ∈ T



Height( ′ ′ T ) = (i −1), t ∈ ′ ′ T ∧t ∉ ′ ′ T


v

u

€

T

€

′ T

€

′ ′ T

w t


SearchSearch

Find one “useful” neighbor in G as the next step

What happens if NO useful neighbor? Expected steps to reach “t”.


Probability to have 1 Probability to have 1 U.N.U.N.

€

Z = b−h(v,x )

x≠v

∑ = (b −1)b j−1

j=1

log n

∑ b− j ≤ logn

bi−1leaves∈ ′ ′ T

b−i

logn

bi−1 ×b−i

logn=

1

b log n

(1−1

b log n)c log2 n ≤ n−θ

One leave

All out-links


HNMHNM

High probability to be useful How about “constant links”?


Group StructuresGroup Structures

R is a group; R’ is a strict smaller subgroup

R1, R2,R3,… all contain v, then

q(v,w): minimum size of a group containing both v and w

€

q = R ≥ 2,v ∈ R ⇒ (v ∈ ′ R ⊆R)∧(q = R > ′ R > λq)

€

∀i,( Ri ≤ q)∧(v ∈ Ri)⇒i

URi ≤ βq


How to Search in Group How to Search in Group Structure??Structure??

€

f (q(v,w)) ~ q(v,w)−α

€

f (q(v,w))

f (q(v,x))x≠v

∑€

q(v,w)

€

k = c log2 n


IdeaIdea

(v, t) R is the minimum-sized group containing both v and t. With property (1)

Then:

€

q = R ≥ 2,v ∈ R ⇒ (v ∈ ′ R ⊆R)∧(q = R > ′ R > λq)

€

∃ ′ R ⇒ (t ∈ ′ R )∧(λ2 R < ′ R < λ R )

How to define “usefulness” of v?


Usefulness of Usefulness of vv

(v, t) R is the minimum-sized group containing both v and t. With property (1)

Then:

€

q = R ≥ 2,v ∈ R ⇒ (v ∈ ′ R ⊆R)∧(q = R > ′ R > λq)

€

∃ ′ R ⇒ (t ∈ ′ R )∧(λ2 R < ′ R < λ R )

€

∃x,(l(v, x) =1)∧(x ∈ ′ R )



€

Z = b−h(v,x )

x≠v

∑ = (b −1)b j−1

j=1

log n

∑ b− j ≤ logn

bi−1leaves∈ ′ ′ T

b−i

logn

bi−1 ×b−i

logn=

1

b log n

(1−1

b log n)c log2 n ≤ n−θ

One leave

All out-links



€

Z =1

q(v,x)x≠v

∑ ≤ β j +1

j=1

log n

∑ β −( j−1) = β 2 logβ n

(1−λ2

β 2 logβ n)c log2 n ≤ n−θ


ResultsResults

Otherwise, no polylogarithmic search

€

α =1⇒ Ο(logn)


Fixed Number of Out-Fixed Number of Out-LinksLinks Relax “t” to “a cluster of t”

v t

€

T

Cl Cl

€

T

tx

vw€

m = L

r = Cluster

n = m × r

r: Resolution


Question #1Question #1

Why can’t we just treat “Cluster” as “Super Node” and we go home (by applying the HNM results)?

Cl Cl

€

T

tx

vw€

m = L

r = Cluster

n = m × r


Not necessarilyNot necessarily

Cl Cl

tx

vw

Cl

pq


ProbabilityProbability

€

f (h(v,w)) ~ (h(v,w) +1)−2b−h(v,w )

Z ≤ 2r


Question #2Question #2

For any out-link of v, what is the probability that the end point of the out-link is in the same cluster of v?


AnswerAnswer

€

(0 +1)−2b−0 =1

1× r

Z≥

r

2r=

1

2


ResultsResults

If the resolution is polylogarithmic, the the search is polylogarithmic if alpha = 1.


A “Similar” ProcessA “Similar” Process

v

u

€

T

€

′ T

€

′ ′ T

w t

Coloring the Links


ReadingReading

“Small World Phenomena and the Dynamics of Information” by J. Kleinberg, NIPS, 2001

10/23/2007 P2P 106

10/23/2007 P2P 107

File OrganizationFile Organization

Piece256KB

Block16KB

File

421 3

Incomplete Piece

10/23/2007 P2P 108

InitializationInitialization

tracker

webserveruser

HTTP GET MYFILE.torrent

http://mytracker.com:6969/S3F5YHG6FEBFG5467HGF367F456JI9N5FF4E…

MYFILE.torrent

“register”

ID1 169.237.234.1:6881ID2 190.50.34.6:5692ID3 34.275.89.143:4545…ID50 231.456.31.95:6882

list of peers

Peer 40Peer 2

Peer 1

…

10/23/2007 P2P 109

Peer/Seed

421 3

10/23/2007 P2P 110

““On the Wire” ProtocolOn the Wire” Protocol

(Over TCP)

Local PeerRemote Peer

ID/Infohash HandshakeBitField BitField

Interested = 0choked = 1

Interested = 0choked = 1

10 0 10 – choke1 – unchoke2 – interested3 – not interested4 – have5 – bitfield6 – request7 – piece8 – cancel

Non-keepalive messages:

10/23/2007 P2P 111

ChokingChoking By default, every peer is “choked”

– stop “uploading” to them, but the TCP connection is still there.

Select 4~6 peers to “unchoke” ??– “Re-choke” every 30 seconds– How to decide?

Optimistic Unchoking– What is this?

10/23/2007 P2P 112

““Interested”Interested”

A request for a piece (or its sub-pieces)

10/23/2007 P2P 113

Get a piece/block!!Get a piece/block!!

Download:– Which peer? (download from whom? Does it

matter?)– Which piece?

How about “upload”?– Which peer?– Which piece?

10/23/2007 P2P 114

Piece SelectionPiece Selection

Pipelining (5 requests) Strict Priority (incomplete pieces first) Rarest First

What is the problem?

10/23/2007 P2P 115

Rarest FirstRarest First

Exchanging bitmaps with 20+ peers– Initial messages– “have” messages

Array of buckets– Ith buckets contains “pieces” with I known

instances– Within the same bucket, the client will

randomly select one piece.

10/23/2007 P2P 116

Piece SelectionPiece Selection

Pipelining (5 requests) Strict Priority 3 stages:

– Random first piece– Rarest First– Endgame mode

10/23/2007 P2P 117

Piece SelectionPiece Selection Piece (64K~1M) Sub-piece (16K)

– Piece-size: trade-off between performance and the size of the torrent file itself

– A client might request different sub-pieces of the same piece from different peers.

Strict Priority - sub-pieces and piece Rarest First

– Exception: “random first”

– Get the stuff out of Seed(s) as soon as possible..

10/23/2007 P2P 118

Get a piece/block!!Get a piece/block!!

Download:– Which peer?– Which piece?

How about “upload”?– Which peer?– Which piece?

10/23/2007 P2P 119

Peer SelectionPeer Selection

Focus on Rate Upload to 4~6 peers Random Unchoke Global rate cap only

10/23/2007 P2P 120

Bittorrent: “Tit for Tat”Bittorrent: “Tit for Tat”

Equivalent Retaliation (Game theory)– A peer will “initially” cooperate, then respond

in kind to an opponent's previous action. If the opponent previously was cooperative, the agent is cooperative. If not, the agent is not.

10/23/2007 P2P 121

ChokingChoking By default, every peer is “choked”

– stop “uploading” to them, but the TCP connection is still there.

Select 4~6 peers to “unchoke” ??– Best “upload rates” and “interested”.– Uploading to the unchoked ones and monitor the

download rate for all the peers– “Re-choke” every 30 seconds

Optimistic Unchoking (6+1)– Randomly select a choked peer to unchoke

10/23/2007 P2P 122

BittorrentBittorrent Fairness of download and upload between a

pair of peers Every 10 seconds, estimate the download

bandwidth from the other peer– Based on the performance estimation to decide

to continue uploading to the other peer or not

10/23/2007 P2P 123

PropertiesProperties

Bigger “%” = better chance of unchoked Bigger “%” ~= better UL and DL rates ?!

10/23/2007 P2P 124

Peer/Seed

421 3

Who to Unchoke?Who to Unchoke?

10/23/2007 P2P 125

Seed unchokingSeed unchoking old algorithm

– unchoke the fastest peers (how?)– problem: fastest peers may monopolize seeds

new algorithm periodically sort all peers according to their last unchoke time prefer the most recently unchoked peers; on a tie, prefer the fastest (presumably) achieves equal spread of seed bandwidth

10/23/2007 P2P 126

Seed unchokingSeed unchoking old algorithm

– unchoke the fastest peers (how?)– problem: fastest peers may monopolize seeds

new algorithm periodically sort all peers according to their last unchoke time prefer the most recently unchoked peers; on a tie, prefer the fastest (presumably) achieves equal spread of seed bandwidth

10/23/2007 P2P 127

Attacks to BTAttacks to BT

???

10/23/2007 P2P 128

Attacks to BTAttacks to BT

Download only from the seeds Download only from fastest peers Announcing false pieces Privacy -- (Torrent, source IP addresses)

10/23/2007 P2P 129

BitTorrent: Questions to BitTorrent: Questions to askask

Peer’s role (or SP’s role) Peer’s controllability and vulnerability Incentives to contribute Peer’s mobility and dynamics Scalability

10/23/2007 P2P 130

BittorrentBittorrent

“Tic-for-Tat” incentive model within the same torrent

Piece/Peer selection and choking The need for tracker and torrent file

10/23/2007 P2P 131

Client implementationsClient implementations mainline: written in Python; right now, the only

one employing the new seed unchoking algorithm Azureus: the most popular, written in Java;

implements a special protocol between clients(e.g. peers can exchange peer lists)

other popular clients: ABC, BitComet, BitLord, BitTornado, μTorrent, Opera browser

various non-standard extensions– retaliation mode: detect compromised/malicious peers– anti-snubbing: ignore a peer who ignores us– super seeding: seed masquerading as a leecher

10/23/2007 P2P 132

ResourcesResources Basic BitTorrent mechanisms

[Cohen, P2PECON’03] BitTorrent specification Wiki

http://wiki.theory.org/BitTorrentSpecification Measurement studies

[Izal et al., PAM’04], [Pouwelse et al., Delft TR 2004 and IPTPS’05], [Guo et al., IMC’05], and[Legout et al., INRIA-TR-2006]

Theoretical analysis and modeling [Qiu et al., SIGCOMM’04], and[Tian et al., Infocom’06]

Simulations [Bharambe et al., MSR-TR-2005]

Sharing incentives and exploiting them [Shneidman et al., PINS’04],[Jun et al., P2PECON’05], and[Liogkas et al., IPTPS’06]

Documents

04/27/2011DHT1 Operating System ecs251 Spring 2011 : Operating System #5: Distributed Hash Table Dr. S. Felix Wu Computer Science Department University