Upload
sabrina-ramsey
View
216
Download
0
Tags:
Embed Size (px)
Citation preview
04/27/2011 DHT 1
ecs251 Spring 2011:Operating SystemOperating System#5: Distributed Hash Table
Dr. S. Felix Wu
Computer Science Department
University of California, Davis
http://www.facebook.com/group.php?gid=29670204725
http://cyrus.cs.ucdavis.edu/~wu/ecs251
04/27/2011 DHT 2
GFS: Google File SystemGFS: Google File System
“failures” are norm Multiple-GB files are common Append rather than overwrite
– Random writes are rare Can we relax the consistency?
04/27/2011 DHT 3
• Client translates file name and byte offset to chunk index.• Sends request to master.• Master replies with chunk handle and location of replicas.• Client caches this info.• Sends request to a close replica, specifying chunk handle and byte range.• Requests to master are typically buffered.
04/27/2011 DHT 4
The MasterThe MasterMaintains all file system metadata.
names space, access control info, file to chunk mappings, chunk (including replicas) location, etc.
Periodically communicates with chunkservers in HeartBeat messages to give instructions and check state
04/27/2011 DHT 5
1. Client asks master for all replicas.2. Master replies. Client caches.3. Client pre-pushes data to all
replicas.4. After all replicas acknowledge,
client sends write request to primary.
5. Primary forwards write request to all replicas.
6. Secondaries signal completion.7. Primary replies to client. Errors
handled by retrying.
System InteractionsSystem Interactions
The master grants a chunk lease to a replica The replica holding the lease determines the
order of updates to all replicas Lease
– 60 second timeouts– Can be extended indefinitely– Extension request are piggybacked on heartbeat
messages– After a timeout expires, the master can grant new
leases
04/27/2011 6DHT
SnapshotSnapshot
A “snapshot” is a copy of a system at a moment in time.– When are snapshots useful?
– Does “cp –r” generate snapshots?
Handled using copy-on-write (COW).– First revoke all leases.
– Then duplicate the metadata, but point to the same chunks.
– When a client requests a write, the master allocates a new chunk handle.
04/27/2011 7DHT
04/27/2011 DHT 8
SecondaryNameNode
Client
HDFS Architecture
NameNode
DataNodes
1. filename
2. BlckId, DataNodes
o
3.Read data
Cluster Membership
Cluster Membership
NameNode : Maps a file to a file-id and list of MapNodesDataNode : Maps a block-id to a physical location on diskSecondaryNameNode: Periodic merge of Transaction log
04/27/2011 DHT 9
Structured PeeringStructured Peering
Peer identity and routability Key/content assignment
– Which identity owns what?GFS/Napster: centralized index serviceSkype/Kazaa: login-server & super peersDNS: hierarchical DNS servers
Two problems:(1). How to connect to the “topology”?(2). How to prevent failures/changes?
04/27/2011 DHT 10
DHTDHT
Most s-P2P systems are DHT-based. Distributed hash tables (DHTs)
– decentralized lookup service of a hash table– (name, value) pairs stored in the DHT– any peer can efficiently retrieve the value
associated with a given name– the mapping from names to values is distributed
among peers
04/27/2011 DHT 11
HT as a search tableHT as a search table(BitTorrent, Napster)(BitTorrent, Napster)
Index key
Information/content is distributed, and we need to know where?
Where is this GFS chunk?Where is this piece of music?Is this BT piece available?What is the location of this type of content?What is the current IP address of this skype user?
Content Object/Peer naming
“160 bits”
04/27/2011 DHT 12
DHT as a search tableDHT as a search table
Index key
???
04/27/2011 DHT 13
DHT as a search tableDHT as a search table
Index key
???
04/27/2011 DHT 14
DHT segment DHT segment ownershipownership
Index key
???
04/27/2011 DHT 15
DHTDHT
Scalable Peer arrivals, departures, and failures Unstructured versus structured
04/27/2011 DHT 16
DHT (Name, Value)DHT (Name, Value)
How to utilize DHT to avoid Trackers in Bittorrent?
04/27/2011 DHT 17
DHT-based TrackerDHT-based Tracker
Index key
Whoever owns this hash entry is the tracker for the corresponding key!
FreeBSD 5.4 CD images
Publish the key on the class web site.
Seed’s IP address
PUT & GET
04/27/2011 DHT 18
ChordChord
Given a key (content object), it maps the key onto a peer -- consistent hash
Assign keys to peers. Solves problem of locating key in a
collection of distributed peers. Maintains routing information as peers join
and leave the system
04/27/2011 DHT 19
ChordChord
Consistent Hashing A Simple Key Lookup Algorithm Scalable Key Lookup Algorithm Node Joins and Stabilization Node Failures
04/27/2011 DHT 20
Consistent HashingConsistent Hashing Consistent hash function assigns each peer and key
an m-bit identifier (e.g., 140 bits). SHA-1 as a base hash function. A peer’s identifier is defined by hashing the peer’s
IP address. (other possibilities?) A content identifier is produced by hashing the key:
– ID(peer) = SHA-1(IP, Port)– ID(content) = SHA-1(related to the content object)
– Application-dependent!
04/27/2011 DHT 21
Peer, ContentPeer, Content
In an m-bit identifier space, there are 2m identifiers (for both peer and content).
Which peer handles which content?
04/27/2011 DHT 22
Peer, ContentPeer, Content In an m-bit identifier space, there are 2m
identifiers (for both peer and content). Which peer handles which contents?
– We will not have 2m peers/contents!– Each peer might need to handle more than one
contents.– In that case, which peer has what?
04/27/2011 DHT 23
Consistent HashingConsistent Hashing In an m-bit identifier space, there are 2m
identifiers. an identifier circle modulo 2m. The identifier ring is called Chord ring. Content X is assigned to the first peer whose
identifier is equal to or follows (the identifier of) X in the identifier space.
This peer is the successor peer of key X, denoted by successor(X).
04/27/2011 DHT 24
6
1
2
6
0
4
26
5
1
3
7
2identifier
circle
identifier
node
X key
Successor PeersSuccessor Peers
successor(1) = 1
successor(2) = 3successor(6) = 0
04/27/2011 DHT 26
Join and DepartureJoin and Departure
When a node N joins the network, certain contents previously assigned to N’s successor now become assigned to N.
When node N leaves the network, all of its assigned contents are reassigned to N’s successor.
04/27/2011 DHT 27
JoinJoin
0
4
26
5
1
3
7
keys1
keys2
keys
keys
7
5
04/27/2011 DHT 28
DepartureDeparture
0
4
26
5
1
3
7
keys1
keys2
keys
keys6
7
04/27/2011 DHT 29
Join/DepartJoin/Depart
What information must be maintained?
04/27/2011 DHT 30
Join/DepartJoin/Depart
What information must be maintained?– Pointer to successor(s)– Content itself (but application dependent)
04/27/2011 DHT 31
Tracker gone?Tracker gone?
Index key
Whoever owns this hash entry is the tracker for the corresponding key!
FreeBSD 5.4 CD images
Publish the key on the class web site.
Seed’s IP address
PUT & GET
04/27/2011 DHT 32
How to identify the How to identify the tracker?tracker?
And, its IP address, of course?
04/27/2011 DHT 33
A Simple Key LookupA Simple Key Lookup
A very small amount of routing information suffices to implement consistent hashing in a distributed environment
If each node knows only how to contact its current successor node on the identifier circle, all node can be visited in linear order.
Queries for a given identifier could be passed around the circle via these successor pointers until they encounter the node that contains the key.
04/27/2011 DHT 34
A Simple Key LookupA Simple Key Lookup
Pseudo code for finding successor:// ask node n to find the successor of id
N.find_successor(id)
if (id (N, successor])
return successor;
else
// forward the query around the circle
return successor.find_successor(id);
04/27/2011 DHT 35
A Simple Key LookupA Simple Key Lookup The path taken by a query from node 8 for
key 54:
04/27/2011 DHT 36
SuccessorSuccessor
Each active node MUST know the IP address of its successor!– N8 has to know that the next node on the ring is
N14. Departure N8 => N21 But, how about failure or crash?
04/27/2011 DHT 37
RobustnessRobustness
Successor in R hops– N8 => N14, N21, N32, N38 (R=4)– Periodic pinging along the path to check, &
also find out maybe there are “new members” in between
04/27/2011 DHT 38
Is that good enough?Is that good enough?
04/27/2011 DHT 39
Without Periodic Ping…??Without Periodic Ping…??Triggered only by dynamics (Join/Depart)!
04/27/2011 DHT 40
Complexity of the Complexity of the searchsearch
Time/messages: O(N)– N: # of nodes on the Ring
Space: O(1)– We only need to remember R IP addresses
Stablization depends on “period”.
04/27/2011 DHT 41
Scalable Key LocationScalable Key Location
To accelerate lookups, Chord maintains additional routing information.
This additional information is not essential for correctness, which is achieved as long as each node knows its correct successor.
04/27/2011 DHT 42
Finger TablesFinger Tables
Each node N’ maintains a routing table with up to m entries (which is in fact the number of bits in identifiers), called finger table.
The ith entry in the table at node N contains the identity of the first node s that succeeds N by at least 2i-1 on the identifier circle.
s = successor (n+2i-1).
s is called the ith finger of node N, denoted by N.finger(i)
04/27/2011 DHT 43
Finger TablesFinger Tables
0
4
26
5
1
3
7
124
130
finger tablestart succ.
keys1
235
330
finger tablestart succ.
keys2
457
000
finger tablestart succ.
keys6
0+20
0+21
0+22
For.
1+20
1+21
1+22
For.
3+20
3+21
3+22
For.
s = successor (n+2i-1).
04/27/2011 DHT 44
Finger TablesFinger Tables
A finger table entry includes both the Chord identifier and the IP address (and port number) of the relevant node.
The first finger of N is the immediate successor of N on the circle.
04/27/2011 DHT 45
Example queryExample query
The path a query for key 54 starting at node 8:
Kademlia routingKademlia routing
04/27/2011 DHT 46
04/27/2011 DHT 47
Scalable Key LocationScalable Key Location
Since each node has finger entries at power of two intervals around the identifier circle, each node can forward a query at least halfway along the remaining distance between the node and the target identifier. From this intuition follows a theorem:
Theorem: With high probability, the number of nodes that must be contacted to find a successor in an N-node network is O(logN).
04/27/2011 DHT 48
Complexity of the Complexity of the SearchSearch
Time/messages: O(logN)– N: # of nodes on the Ring
Space: O(logN)– We need to remember R IP addresses– We need to remember logN Fingers
Stablization depends on “period”.
04/27/2011 DHT 49
An ExampleAn Example M = 140 (identifier size), ring size is 2140
N = 216 (# of nodes) How many entries we need to have for the
Finger Table?
Each node n’ maintains a routing table with up to m entries (which is in fact the number of bits in identifiers), called finger table.The ith entry in the table at node n contains the identity of the first node s that succeeds n by at least 2i-1 on the identifier circle.
s = successor(n+2i-1).
04/27/2011 DHT 50
Complexity of the Complexity of the SearchSearch
Time/messages: O(M)– M: # of bits of the identifier
Space: O(M)– We need to remember R IP addresses– We need to remember M Fingers
Stablization depends on “period”.
04/27/2011 DHT 51
Structured PeeringStructured Peering
Peer identity and routability– 2M identifiers, Finger Table routing
Key/content assignment– Hashing
Dynamics/Failures– Inconsistency??
04/27/2011 DHT 52
Joins and StabilizationsJoins and Stabilizations
The most important thing is the successor pointer. If the successor pointer is ensured to be up to date,
which is sufficient to guarantee correctness of lookups, then finger table can always be verified.
Each node runs a “stabilization” protocol periodically in the background to update successor pointer and finger table.
04/27/2011 DHT 53
Node Joins – stabilize()Node Joins – stabilize()
Each time node N runs stabilize(), it asks its successor for the it’s predecessor p, and decides whether p should be N’s successor instead.
stabilize() notifies node N’s successor of N’s existence, giving the successor the chance to change its predecessor to N.
The successor does this only if it knows of no closer predecessor than N.
04/27/2011 DHT 54
Node Joins – stabilize()Node Joins – stabilize()// called periodically. verifies N’s immediate// successor, and tells the successor about N.N.stabilize()
x = successor.predecessor;if (x (N, successor))
successor = x;successor.notify(N);
// N’ thinks it might be our predecessor.n.notify(N’)if (predecessor is nil or N’ (predecessor, N))
predecessor = N’;
04/27/2011 DHT 55
StabilizatioStabilizationn
np
su
cc(n
p)
= n
s
ns
n
pre
d(n
s)
= n
p
n joins
– predecessor = nil
– n acquires ns as successor via some n’
n runs stabilize
– n notifies ns being the new predecessor
– ns acquires n as its predecessor
np runs stabilize
– np asks ns for its predecessor (now n)
– np acquires n as its successor
– np notifies n
– n will acquire np as its predecessor
all predecessor and successor pointers are now correct
fingers still need to be fixed, but old fingers will still work
nil
pre
d(n
s)
= n
su
cc(n
p)
= n
04/27/2011 DHT 56
fix_fingers()fix_fingers()
Each node periodically calls fix fingers to make sure its finger table entries are correct.
It is how new nodes initialize their finger tables
It is how existing nodes incorporate new nodes into their finger tables.
04/27/2011 DHT 57
Node Joins – Node Joins – fix_fingers()fix_fingers()
// called periodically. refreshes finger table entries.N.fix_fingers()
next = next + 1 ;if (next > m)
next = 1 ;finger[next] = find_successor(N + 2next-1);
// checks whether predecessor has failed.n.check_predecessor()
if (predecessor has failed)predecessor = nil;
04/27/2011 DHT 59
Node Node FailureFailuress
Key step in failure recovery is maintaining correct successor pointers
To help achieve this, each node maintains a successor-list of its r nearest successors on the ring
If node n notices that its successor has failed, it replaces it with the first live entry in the list
Successor lists are stabilized as follows: – node n reconciles its list with its successor s by copying s’s successor list,
removing its last entry, and prepending s to it. – If node n notices that its successor has failed, it replaces it with the first
live entry in its successor list and reconciles its successor list with its new successor.
04/27/2011 DHT 60
Chord – The MathChord – The Math Every node is responsible for about K/N keys (N nodes, K
keys)
When a node joins or leaves an N-node network, only O(K/N) keys change hands (and only to and from joining or leaving node)
Lookups need O(log N) messages
To reestablish routing invariants and finger tables after node joining or leaving, only O(log2N) messages are required
Structural SearchStructural Search
Distributed, P2P Attributes about the nodes Nodes are connecting via some structures
(ring, grid, or hypergraph)
Objective: Where is X?– X could be some content or a node identity
04/27/2011 DHT 61
10/26/2009 Davis Social Links 62
Kleinberg’s Basic settingKleinberg’s Basic setting
10/26/2009 Davis Social Links 63
p, q, rp, q, r
p: lattice distance between one node and all its local neighbors
q: number of long range contacts r: inverse probability [d(u,v)]-r
– What is the intuition about r?– What about r = 0
10/26/2009 Davis Social Links 64
Kleinberg’s resultsKleinberg’s results
A decentralized routing/search problem– For nodes s,t with known lattice coordinates, find a
short path from s to t. – At any step, can only use local information, – Kleinberg suggests a simple greedy algorithm and
analyzes it:
10/26/2009 Davis Social Links 65
Local InformationLocal Information
Local contacts Coordinate for the target The locations and long-range contacts of all
nodes that have come in contact with the message.
10/26/2009 Davis Social Links 66
ResultsResults
If r = 0, expected delivery time is at least a0n2/3.
– Lower bound
If r = 2, p = q = 1, a2(log n)2
– Martel/Nguyen’s newer results
0 <= r < 2 ~ arn(2-r)/3
r > 2 ~ arn(r-2)(r-1)
10/26/2009 Davis Social Links 67
The Web
Social Network AnalysisSocial Network Analysis
“Structural relationships” as explanations:
• Network
• Formation
• Influence and collective actions
10/26/2009 Davis Social Links 69
Social Network AnalysisSocial Network Analysis1. Degree Centrality: The number of direct connections a node has. What really
matters is where those connections lead to and how they connect the otherwise unconnected.
2. Betweenness Centrality: A node with high betweenness has great influence over what flows in the network indicating important links and single point of failure.
3. Closeness Centrality: The measure of closeness of a node which are close to everyone else. The pattern of the direct and indirect ties allows the nodes any other node in the network more quickly than anyone else. They have the shortest paths to all others.
4. Eigenvector Centrality: It assigns relative scores to all nodes in the network based on the principle that connections to high-scoring nodes contribute more to the score of the node in question than equal connections to low-scoring nodes.
10/26/2009 Davis Social Links 70
Small World ModelSmall World Model
Low Diameter– Logarithmic or poly-logarithmic to N
“High” Cluster Coefficient– cluster coefficient: the portion of X’s neighbors
directly connecting to one of X’s other neighbors
10/26/2009 Davis Social Links 71
Cluster CoefficientCluster Coefficient
Mesh network: Ccluster = 1
Lattice Network (with degree K): Ccluster = 0
– E.g., a linear line
10/26/2009 Davis Social Links 72
Re-wiring Re-wiring (Watts/Strogatz)(Watts/Strogatz)
Trade off between D and Ccluster !
Structured/Clustered
10/26/2009 Davis Social Links 73
Two Issues about Low Two Issues about Low DiametersDiameters
Why should there exist short chains of acquaintances linking together arbitrary pairs of strangers?
Why should arbitrary pairs of strangers be able to find the short chains of acquaintances that link them together?
10/26/2009 Davis Social Links 74
Some ExtensionsSome Extensions
Hierarchical Network Models Group Structure Models Constant Number of Out-Links
“Small World Phenomena and the Dynamics of Information” by J. Kleinberg, NIPS, 2001
10/26/2009 Davis Social Links 75
Generation & SearchGeneration & Search
There is a data structure behind and among all the social peers– Lattice, Tree, Group/Community
The link probability depends on this “social data structure”– And, using it to generate the social network
Searching may use “direct contacts” plus the knowledge about the social data structure
10/26/2009 Davis Social Links 76
Hierarchical Network Hierarchical Network ModelsModels
Representation– a complete b-ary tree, T– All social nodes are “leaves”
Distance and Link Probability– = the height of the least common ancestor
of v and w in T– probability proportional– normalization in probability
– out-degree in graph
€
f (h(v,w))
f (h(v,x))x≠v
∑€
f (h(v,w))€
h(v,w)
€
k = c log2 n
10/26/2009 Davis Social Links 77
the Critical Valuethe Critical Value
€
h →∞lim
f (h)
b− ′ α h= 0,∀ ′ α < α
€
h →∞lim
b− ′ ′ α h
f (h)= 0,∀ ′ ′ α > α
€
f (h(v,w)) ~ b−αh(v,w )
10/26/2009 Davis Social Links 78
Interpretation (1)Interpretation (1) /Science/Computer_Science/Algorithms
/Arts/Music/Opera
/Science/Computer_Science/Machine_Learning
10/26/2009 Davis Social Links 79
Interpretation (2)Interpretation (2)
Target: “stock broker @ Boston, MA”
Next hop:– “bishop @ Cambridge, MA”– “banker @ New York City, NY”
10/26/2009 Davis Social Links 80
ResultsResults
Otherwise, no polylogarithmic search
€
α =1⇒ Ο(logn)
10/26/2009 Davis Social Links 81
How to Search in How to Search in HNM??HNM??
€
f (h(v,w)) ~ b−h(v,w )
€
f (h(v,w))
f (h(v,x))x≠v
∑€
h(v,w)
€
k = c log2 n
10/26/2009 Davis Social Links 82
Useful NeighborUseful Neighbor
€
v → t
v, t ∈ T
commonAncestor(v, t) = u
Height( ′ T ) = i,u∈ ′ T ,root( ′ T ) = u
Height( ′ ′ T ) = (i −1), t ∈ ′ ′ T ∧t ∉ ′ ′ T
Is “v” useful to reach “t”?
v t
€
T
10/26/2009 Davis Social Links 83
Useful NeighborUseful Neighbor
€
v → t
v, t ∈ T
commonAncestor(v, t) = u
Height( ′ T ) = i,u∈ ′ T ,root( ′ T ) = u
Height( ′ ′ T ) = (i −1), t ∈ ′ ′ T ∧t ∉ ′ ′ T
Is “v” useful to reach “t”?
v
u
t
€
T
€
′ T
10/26/2009 Davis Social Links 84
Useful NeighborUseful Neighbor
€
v → t
v, t ∈ T
commonAncestor(v, t) = u
Height( ′ T ) = i,u∈ ′ T ,root( ′ T ) = u
Height( ′ ′ T ) = (i −1), t ∈ ′ ′ T ∧t ∉ ′ ′ T
Is “v” useful to reach “t”?
v
u
t
€
T
€
′ T
€
′ ′ T
w
10/26/2009 Davis Social Links 85
Useful NeighborUseful Neighbor
€
v → t
v, t ∈ T
commonAncestor(v, t) = u
Height( ′ T ) = i,u∈ ′ T ,root( ′ T ) = u
Height( ′ ′ T ) = (i −1), t ∈ ′ ′ T ∧t ∉ ′ ′ T
Is “v” useful to reach “t”?
v
u
t
€
T
€
′ T
€
′ ′ T
w
10/26/2009 Davis Social Links 86
Useful Neighbor Useful Neighbor RecursivelyRecursively
€
v → t
v, t ∈ T
commonAncestor(v, t) = u
Height( ′ T ) = i,u∈ ′ T ,root( ′ T ) = u
Height( ′ ′ T ) = (i −1), t ∈ ′ ′ T ∧t ∉ ′ ′ T
Is “v” useful to reach “t”?
v
u
€
T
€
′ T
€
′ ′ T
w t
10/26/2009 Davis Social Links 87
SearchSearch
Find one “useful” neighbor in G as the next step
What happens if NO useful neighbor? Expected steps to reach “t”.
10/26/2009 Davis Social Links 88
Probability to have 1 Probability to have 1 U.N.U.N.
€
Z = b−h(v,x )
x≠v
∑ = (b −1)b j−1
j=1
log n
∑ b− j ≤ logn
bi−1leaves∈ ′ ′ T
b−i
logn
bi−1 ×b−i
logn=
1
b log n
(1−1
b log n)c log2 n ≤ n−θ
One leave
All out-links
10/26/2009 Davis Social Links 89
HNMHNM
High probability to be useful How about “constant links”?
10/26/2009 Davis Social Links 90
Group StructuresGroup Structures
R is a group; R’ is a strict smaller subgroup
R1, R2,R3,… all contain v, then
q(v,w): minimum size of a group containing both v and w
€
q = R ≥ 2,v ∈ R ⇒ (v ∈ ′ R ⊆R)∧(q = R > ′ R > λq)
€
∀i,( Ri ≤ q)∧(v ∈ Ri)⇒i
URi ≤ βq
10/26/2009 Davis Social Links 91
How to Search in Group How to Search in Group Structure??Structure??
€
f (q(v,w)) ~ q(v,w)−α
€
f (q(v,w))
f (q(v,x))x≠v
∑€
q(v,w)
€
k = c log2 n
10/26/2009 Davis Social Links 92
IdeaIdea
(v, t) R is the minimum-sized group containing both v and t. With property (1)
Then:
€
q = R ≥ 2,v ∈ R ⇒ (v ∈ ′ R ⊆R)∧(q = R > ′ R > λq)
€
∃ ′ R ⇒ (t ∈ ′ R )∧(λ2 R < ′ R < λ R )
How to define “usefulness” of v?
10/26/2009 Davis Social Links 93
Usefulness of Usefulness of vv
(v, t) R is the minimum-sized group containing both v and t. With property (1)
Then:
€
q = R ≥ 2,v ∈ R ⇒ (v ∈ ′ R ⊆R)∧(q = R > ′ R > λq)
€
∃ ′ R ⇒ (t ∈ ′ R )∧(λ2 R < ′ R < λ R )
€
∃x,(l(v, x) =1)∧(x ∈ ′ R )
10/26/2009 Davis Social Links 94
Probability to have 1 Probability to have 1 U.N.U.N.
€
Z = b−h(v,x )
x≠v
∑ = (b −1)b j−1
j=1
log n
∑ b− j ≤ logn
bi−1leaves∈ ′ ′ T
b−i
logn
bi−1 ×b−i
logn=
1
b log n
(1−1
b log n)c log2 n ≤ n−θ
One leave
All out-links
10/26/2009 Davis Social Links 95
Probability to have 1 Probability to have 1 U.N.U.N.
€
Z =1
q(v,x)x≠v
∑ ≤ β j +1
j=1
log n
∑ β −( j−1) = β 2 logβ n
(1−λ2
β 2 logβ n)c log2 n ≤ n−θ
10/26/2009 Davis Social Links 96
ResultsResults
Otherwise, no polylogarithmic search
€
α =1⇒ Ο(logn)
10/26/2009 Davis Social Links 97
Fixed Number of Out-Fixed Number of Out-LinksLinks Relax “t” to “a cluster of t”
v t
€
T
Cl Cl
€
T
tx
vw€
m = L
r = Cluster
n = m × r
r: Resolution
10/26/2009 Davis Social Links 98
Question #1Question #1
Why can’t we just treat “Cluster” as “Super Node” and we go home (by applying the HNM results)?
Cl Cl
€
T
tx
vw€
m = L
r = Cluster
n = m × r
10/26/2009 Davis Social Links 99
Not necessarilyNot necessarily
Cl Cl
tx
vw
Cl
pq
10/26/2009 Davis Social Links 100
ProbabilityProbability
€
f (h(v,w)) ~ (h(v,w) +1)−2b−h(v,w )
Z ≤ 2r
10/26/2009 Davis Social Links 101
Question #2Question #2
For any out-link of v, what is the probability that the end point of the out-link is in the same cluster of v?
10/26/2009 Davis Social Links 102
AnswerAnswer
€
(0 +1)−2b−0 =1
1× r
Z≥
r
2r=
1
2
10/26/2009 Davis Social Links 103
ResultsResults
If the resolution is polylogarithmic, the the search is polylogarithmic if alpha = 1.
10/26/2009 Davis Social Links 104
A “Similar” ProcessA “Similar” Process
v
u
€
T
€
′ T
€
′ ′ T
w t
Coloring the Links
10/26/2009 Davis Social Links 105
ReadingReading
“Small World Phenomena and the Dynamics of Information” by J. Kleinberg, NIPS, 2001
10/23/2007 P2P 106
10/23/2007 P2P 107
File OrganizationFile Organization
Piece256KB
Block16KB
File
421 3
Incomplete Piece
10/23/2007 P2P 108
InitializationInitialization
tracker
webserveruser
HTTP GET MYFILE.torrent
http://mytracker.com:6969/S3F5YHG6FEBFG5467HGF367F456JI9N5FF4E…
MYFILE.torrent
“register”
ID1 169.237.234.1:6881ID2 190.50.34.6:5692ID3 34.275.89.143:4545…ID50 231.456.31.95:6882
list of peers
Peer 40Peer 2
Peer 1
…
10/23/2007 P2P 109
Peer/Seed
421 3
10/23/2007 P2P 110
““On the Wire” ProtocolOn the Wire” Protocol
(Over TCP)
Local PeerRemote Peer
ID/Infohash HandshakeBitField BitField
Interested = 0choked = 1
Interested = 0choked = 1
10 0 10 – choke1 – unchoke2 – interested3 – not interested4 – have5 – bitfield6 – request7 – piece8 – cancel
Non-keepalive messages:
10/23/2007 P2P 111
ChokingChoking By default, every peer is “choked”
– stop “uploading” to them, but the TCP connection is still there.
Select 4~6 peers to “unchoke” ??– “Re-choke” every 30 seconds– How to decide?
Optimistic Unchoking– What is this?
10/23/2007 P2P 112
““Interested”Interested”
A request for a piece (or its sub-pieces)
10/23/2007 P2P 113
Get a piece/block!!Get a piece/block!!
Download:– Which peer? (download from whom? Does it
matter?)– Which piece?
How about “upload”?– Which peer?– Which piece?
10/23/2007 P2P 114
Piece SelectionPiece Selection
Pipelining (5 requests) Strict Priority (incomplete pieces first) Rarest First
What is the problem?
10/23/2007 P2P 115
Rarest FirstRarest First
Exchanging bitmaps with 20+ peers– Initial messages– “have” messages
Array of buckets– Ith buckets contains “pieces” with I known
instances– Within the same bucket, the client will
randomly select one piece.
10/23/2007 P2P 116
Piece SelectionPiece Selection
Pipelining (5 requests) Strict Priority 3 stages:
– Random first piece– Rarest First– Endgame mode
10/23/2007 P2P 117
Piece SelectionPiece Selection Piece (64K~1M) Sub-piece (16K)
– Piece-size: trade-off between performance and the size of the torrent file itself
– A client might request different sub-pieces of the same piece from different peers.
Strict Priority - sub-pieces and piece Rarest First
– Exception: “random first”
– Get the stuff out of Seed(s) as soon as possible..
10/23/2007 P2P 118
Get a piece/block!!Get a piece/block!!
Download:– Which peer?– Which piece?
How about “upload”?– Which peer?– Which piece?
10/23/2007 P2P 119
Peer SelectionPeer Selection
Focus on Rate Upload to 4~6 peers Random Unchoke Global rate cap only
10/23/2007 P2P 120
Bittorrent: “Tit for Tat”Bittorrent: “Tit for Tat”
Equivalent Retaliation (Game theory)– A peer will “initially” cooperate, then respond
in kind to an opponent's previous action. If the opponent previously was cooperative, the agent is cooperative. If not, the agent is not.
10/23/2007 P2P 121
ChokingChoking By default, every peer is “choked”
– stop “uploading” to them, but the TCP connection is still there.
Select 4~6 peers to “unchoke” ??– Best “upload rates” and “interested”.– Uploading to the unchoked ones and monitor the
download rate for all the peers– “Re-choke” every 30 seconds
Optimistic Unchoking (6+1)– Randomly select a choked peer to unchoke
10/23/2007 P2P 122
BittorrentBittorrent Fairness of download and upload between a
pair of peers Every 10 seconds, estimate the download
bandwidth from the other peer– Based on the performance estimation to decide
to continue uploading to the other peer or not
10/23/2007 P2P 123
PropertiesProperties
Bigger “%” = better chance of unchoked Bigger “%” ~= better UL and DL rates ?!
10/23/2007 P2P 124
Peer/Seed
421 3
Who to Unchoke?Who to Unchoke?
10/23/2007 P2P 125
Seed unchokingSeed unchoking old algorithm
– unchoke the fastest peers (how?)– problem: fastest peers may monopolize seeds
new algorithm periodically sort all peers according to their last unchoke time prefer the most recently unchoked peers; on a tie, prefer the fastest (presumably) achieves equal spread of seed bandwidth
10/23/2007 P2P 126
Seed unchokingSeed unchoking old algorithm
– unchoke the fastest peers (how?)– problem: fastest peers may monopolize seeds
new algorithm periodically sort all peers according to their last unchoke time prefer the most recently unchoked peers; on a tie, prefer the fastest (presumably) achieves equal spread of seed bandwidth
10/23/2007 P2P 127
Attacks to BTAttacks to BT
???
10/23/2007 P2P 128
Attacks to BTAttacks to BT
Download only from the seeds Download only from fastest peers Announcing false pieces Privacy -- (Torrent, source IP addresses)
10/23/2007 P2P 129
BitTorrent: Questions to BitTorrent: Questions to askask
Peer’s role (or SP’s role) Peer’s controllability and vulnerability Incentives to contribute Peer’s mobility and dynamics Scalability
10/23/2007 P2P 130
BittorrentBittorrent
“Tic-for-Tat” incentive model within the same torrent
Piece/Peer selection and choking The need for tracker and torrent file
10/23/2007 P2P 131
Client implementationsClient implementations mainline: written in Python; right now, the only
one employing the new seed unchoking algorithm Azureus: the most popular, written in Java;
implements a special protocol between clients(e.g. peers can exchange peer lists)
other popular clients: ABC, BitComet, BitLord, BitTornado, μTorrent, Opera browser
various non-standard extensions– retaliation mode: detect compromised/malicious peers– anti-snubbing: ignore a peer who ignores us– super seeding: seed masquerading as a leecher
10/23/2007 P2P 132
ResourcesResources Basic BitTorrent mechanisms
[Cohen, P2PECON’03] BitTorrent specification Wiki
http://wiki.theory.org/BitTorrentSpecification Measurement studies
[Izal et al., PAM’04], [Pouwelse et al., Delft TR 2004 and IPTPS’05], [Guo et al., IMC’05], and[Legout et al., INRIA-TR-2006]
Theoretical analysis and modeling [Qiu et al., SIGCOMM’04], and[Tian et al., Infocom’06]
Simulations [Bharambe et al., MSR-TR-2005]
Sharing incentives and exploiting them [Shneidman et al., PINS’04],[Jun et al., P2PECON’05], and[Liogkas et al., IPTPS’06]