14
ALMA MATER STUDIORUM – UNIVERSITA’ DI BOLOGNA Introduction to Peer-to-Peer Systems Ozalp Babaoglu © Babaoglu 2012 Complex Systems Outline Introduction Peer-to-Peer vs. Client / Server Common Topologies Data location Churn Security 2 © Babaoglu 2012 Complex Systems Introduction Peer-to-peer (P2P) systems have become extremely popular and contribute to vast amounts of Internet traffic P2P basic definition: A P2P system is a distributed collection of peer nodes Each node is both a server and a client: May provide services to other peers May consume services from other peers Very different from the client-server model 3 © Babaoglu 2012 Complex Systems P2P History: 1969 - 1990 The origins: In the beginning, all nodes in Arpanet/Internet were peers Each node was capable of: Performing routing (locate machines) Accepting ftp connections (file sharing) Accepting telnet connections (distribution computation) 4

Introduction tobabaoglu/courses/cas11-12/slides/p2p.pdf · P2P vs. client-server Client-server Asymmetric: client and servers carry out different tasks Global knowledge: servers have

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Introduction tobabaoglu/courses/cas11-12/slides/p2p.pdf · P2P vs. client-server Client-server Asymmetric: client and servers carry out different tasks Global knowledge: servers have

ALMA MATER STUDIORUM – UNIVERSITA’ DI BOLOGNA 

Introduction toPeer-to-Peer Systems

Ozalp Babaoglu

© Babaoglu 2012 Complex Systems

Outline

■ Introduction■ Peer-to-Peer vs. Client / Server■ Common Topologies■ Data location■ Churn■ Security

2

© Babaoglu 2012 Complex Systems

Introduction

■ Peer-to-peer (P2P) systems have become extremely popular and contribute to vast amounts of Internet traffic

■ P2P basic definition:■ A P2P system is a distributed collection of peer nodes■ Each node is both a server and a client:

▴ May provide services to other peers▴ May consume services from other peers

■ Very different from the client-server model

3 © Babaoglu 2012 Complex Systems

P2P History: 1969 - 1990

■ The origins:■ In the beginning, all nodes in Arpanet/Internet were peers■ Each node was capable of:

▴ Performing routing (locate machines)▴ Accepting ftp connections (file sharing)▴ Accepting telnet connections (distribution computation)

4

Page 2: Introduction tobabaoglu/courses/cas11-12/slides/p2p.pdf · P2P vs. client-server Client-server Asymmetric: client and servers carry out different tasks Global knowledge: servers have

© Babaoglu 2012 Complex Systems

P2P History: 1999 - today

■ The advent of Napster:■ Jan 1999: the first version of Napster was released by Shawn

Fanning, student at the Northeastern University■ July 1999: Napster Inc. founded■ Feb 2001: Napster closed down

5 © Babaoglu 2012 Complex Systems

Napster

Napster Central Index Server

Napster Client 4

Napster Client 3

Napster Client 2

Napster Client 1

Your ComputerQuery: song.mp3

Client 2

Copy: song.mp3

- Client/Server search- P2P download- Napster is not “pure P2P”

© Babaoglu 2012 Complex Systems

Client/Server vs. Peer-to-Peer

• Servers well connected to the “core” of the Internet

• Servers carry out critical tasks• Clients only talk to servers

• Only nodes located at the “periphery of the Internet”

• Tasks distributed across all nodes• Clients talk to other clients

© Babaoglu 2012 Complex Systems

Example – Video sharing

Client-Server: YouTubeAdvantages

• Client can disconnect after upload

• Uploader needs little bandwidth

• Other users can find the file easily(just use search on server webpage)

Disadvantages

• Server may not accept file or remove it later (according to content policy)

• Whole system depends on the server(what if shut down like Napster?)

• Server storage and bandwidthare expensive!

client-server

uploader

downloader downloader

downloader

downloaderdownloader

Page 3: Introduction tobabaoglu/courses/cas11-12/slides/p2p.pdf · P2P vs. client-server Client-server Asymmetric: client and servers carry out different tasks Global knowledge: servers have

© Babaoglu 2012 Complex Systems

Example – Video sharing

Peer-to-peer: BitTorrent

Advantages

• Does not depend on a central server

• Bandwidth shared across nodes(downloaders also act as uploaders)

• High scalability, low cost

Disadvantages

• Seeder must remain on-line to guarantee file availability

• Content is more difficult to find(downloaders must find .torrent file)

• Freeloaders cheat in order to download without uploading

peer-to-peer

seeder

downloader downloader

downloader

downloaderdownloader

© Babaoglu 2012 Complex Systems

P2P vs. client-server

Client-server

Asymmetric: client and servers carry out different tasks

Global knowledge: servers have a global view of the network

Centralization: communications and management are centralized

Single point of failure: a server failure brings down the system

Limited scalability: servers easily overloaded

Expensive: server storage and bandwidth capacity is not cheap

Peer-to-peer

Symmetric: No distinction between node; they are peers

Local knowledge: nodes only know a small set of other nodes

Decentralization: no global knowledge, only local interactions

Robustness: several nodes may fail with little or no impact

High scalability: high aggregate capacity, load distribution

Low-cost: storage and bandwidth are contributed by users

10

© Babaoglu 2012 Complex Systems

P2P and Overlay Networks

11

■ Peer-to-Peer Networks are usually “overlays”■ Logical structures built on top of a physical routed

communication infrastructure (IP) that creates the allusion of a completely-connected graph

■ Links based on logical relationships (“knows”) rather than physical connectivity

© Babaoglu 2012 Complex Systems12

Overlay networks

Physical network: “who has a communication link to whom”

Page 4: Introduction tobabaoglu/courses/cas11-12/slides/p2p.pdf · P2P vs. client-server Client-server Asymmetric: client and servers carry out different tasks Global knowledge: servers have

© Babaoglu 2012 Complex Systems13

Overlay networks

Logical network: “who can communicate with whom”

© Babaoglu 2012 Complex Systems14

Overlay networks

Overlay network (ring): “who knows whom”

© Babaoglu 2012 Complex Systems15

Overlay networks

Overlay network (binary tree): “who knows whom”

© Babaoglu 2012 Complex Systems

P2P Environment

■ High latency, low bandwidth communication■ Churn

■ Nodes may disconnect temporarily■ New nodes are continuously joining the system, while others

leave permanently■ Security

■ P2P clients runs on machines under the total control of their owners

■ Malicious users may try to bring down the system■ Selfishness

■ Users may run hacked clients in order to avoid contributing resources

16

Page 5: Introduction tobabaoglu/courses/cas11-12/slides/p2p.pdf · P2P vs. client-server Client-server Asymmetric: client and servers carry out different tasks Global knowledge: servers have

© Babaoglu 2012 Complex Systems

Why P2P?

■ Decentralization enables deployment of applications that are:■ Highly available■ Fault-tolerant■ Self organizing■ Scalable■ Difficult or impossible to shut down

■ This results in a “grassroots” approach and “democratization” of the Internet

17 © Babaoglu 2012 Complex Systems 18

P2P Problems

■ Overlay construction and maintenance■ e.g., random, two-level, ring, etc.

■ Data location■ locate a given data object among a large number of nodes

■ Data dissemination■ propagate data in an efficient and robust manner

■ Per-node state■ keep the amount of state per node small

■ Tolerance to churn■ maintain system invariants (e.g., topology, data location, data

availability) despite node arrivals and departures

© Babaoglu 2012 Complex Systems

P2P Applications

■ Sharing of content:■ File sharing, content delivery networks■ Gnutella, eMule, Akamai

■ Sharing of storage:■ Distributed file systems

■ Sharing of CPU time:■ Parallel computing, Grid■ Seti@home, Folding@home, FightAids@home

(typically not pure P2P)

19 © Babaoglu 2012 Complex Systems

P2P Topologies

■ Centralized■ Hierarchical■ Decentralized■ Hybrid

20

Page 6: Introduction tobabaoglu/courses/cas11-12/slides/p2p.pdf · P2P vs. client-server Client-server Asymmetric: client and servers carry out different tasks Global knowledge: servers have

© Babaoglu 2012 Complex Systems

Evaluating topologies

■ Manageability■ How hard is it to keep working?

■ Information coherence■ How authoritative is info?

■ Extensibility■ How easy is it to grow?

■ Fault tolerance■ How well can it handle failures?

■ Censorship■ How hard is it to shut down?

21 © Babaoglu 2012 Complex Systems

Centralized

■ Manageable■ System is all in one place

■ Coherent■ Information is centralized

■ Extensible■ No

■ Fault tolerance■ Single point of failure

■ Censorship■ Easy to shut down

© Babaoglu 2012 Complex Systems

Hierarchical

■ Manageable■ Chain of authority

■ Coherent■ Cache consistency

■ Extensible■ Add more leaves, rebalance

■ Fault tolerance■ Root is vulnerable

■ Censorship■ Just shut down the root

© Babaoglu 2012 Complex Systems

Decentralized

■ Manageable■ Difficult, many owners

■ Coherent■ Difficult, unreliable peers

■ Extensible■ Anyone can join in

■ Fault tolerance■ redundancy

■ Censorship■ Difficult to shut down

Page 7: Introduction tobabaoglu/courses/cas11-12/slides/p2p.pdf · P2P vs. client-server Client-server Asymmetric: client and servers carry out different tasks Global knowledge: servers have

© Babaoglu 2012 Complex Systems

Centralized + Decentralized

■ Manageable■ Same as decentralized

■ Coherent■ Better than decentralized

■ Extensible■ Anyone can join in

■ Fault tolerance■ redundancy

■ Censorship■ Difficult to shut down

© Babaoglu 2012 Complex Systems

Some Common Topologies

■ Flat unstructured: a node can connect to any other node■ only constraint: maximum degree dmax

■ fast join procedure■ good for data dissemination, bad for location

■ Two-level unstructured: nodes connect to a superpeer■ superpeer form a small overlay■ used for indexing and forwarding■ high load on superpeer

■ Flat structured: constraints based on node ids■ allows for efficient data location■ constraints require long join and leave procedures

26

© Babaoglu 2012 Complex Systems

Data location: Flooding

27

■ Problem: find the set of nodes S that store a copy of object O

■ Flooding: send a search message to all nodes (first Gnutella protocol)■ A search message contains either keywords or an object id■ Advantages:

▴ simplicity▴ no topology constraints

■ Disadvantages:▴ high network overhead (huge traffic generated by each search request)▴ flooding stopped by TTL (which produces search horizon)▴ only applicable to small number of nodes

© Babaoglu 2012 Complex Systems

Data location: Flooding

■ Flooding (cont.)■ Flooding in a flat unstructured network:

28

search horizon forTTL = 2

obj

Objects that lie outside of the horizon are not found

Page 8: Introduction tobabaoglu/courses/cas11-12/slides/p2p.pdf · P2P vs. client-server Client-server Asymmetric: client and servers carry out different tasks Global knowledge: servers have

© Babaoglu 2012 Complex Systems

Data location: Superpeers

29

■ Two-level overlay: use superpeers to track the locations of an object [Gnutella 2, BitTorrent]■ Each node connects to a superpeer and advertises the list of

objects it stores■ Search requests are sent to the supernode, which forwards them

to other supernodes■ Advantages:

▴ highly scalable■ Disadvantages:

▴ superpeers must be realiable, powerful and well connected to the Internet (expensive)

▴ superpeers must maintain large state▴ the system relies on a small number of superpeers

© Babaoglu 2012 Complex Systems

Data location: Superpeers

■ Two-Level Overlay (cont.)

30

A two-level overlay is a partially centralized system

In some systems, superpeers may be disconnected (e.g., BitTorrent)

responserequestobj

© Babaoglu 2012 Complex Systems

Data location: Key-Based Routing

■ Structured networks: use a routing algorithm that implements Key-Based Routing (KBR) [Chord, Pastry, Overnet, Kad, eMule]

■ KBR (also known as Distributed Hash Tables) works as follows:■ nodes are (randomly) assigned unique node identifiers (nodeId)■ given a key k, the node whose nodeId is numerically closest to k

among all nodes in the network is known as the root of key k■ given a key k, a KBR algorithm can route a message to the root

of k in a small number of hops, usually O(log n)■ the location of an object with id objectId is tracked by the root of

k = objectId■ thus, one can find the location of an object by routing a message

to the root of k = objectId and querying the root for the location of the object

31 © Babaoglu 2012 Complex Systems 32

Pastry

■ Pastry is a substrate for the wide area peer-to-peer location and routing middleware

■ Any host that runs Pastry middleware and has proper credentials can participate in the overlay network

Page 9: Introduction tobabaoglu/courses/cas11-12/slides/p2p.pdf · P2P vs. client-server Client-server Asymmetric: client and servers carry out different tasks Global knowledge: servers have

© Babaoglu 2012 Complex Systems 33

Pastry Characteristics

■ Each node in the Pastry has a unique 128-bit nodeId■ Keys (of items or messages) and nodeIds are in the same

address space■ Given a message with a key, Pastry routes the message

to an alive node whose nodeId is numerically closest to the key (its root)

■ The routing will be done in O(log n) hops where n is the number of nodes in the network

■ Takes into account network locality■ Seeks to minimize the distance the message travel

© Babaoglu 2012 Complex Systems 34

Design of Pastry

■ Node Identifiers■ Assigned randomly■ Uniformly distributed over some large ID space (e.g., 128 bits)

through a secure hash (e.g., SHA1) of a unique parameter such as IP address or public key

■ As a consequence, nodes with adjacent nodeIds are diverse in geography, ownership, jurisdiction

■ nodeIds are thought of as a sequence of base-2b digits▴ b=1, nodeIds are binary numbers▴ b=4, nodeIds are hexadecimal numbers▴ Typical value for b is 4

© Babaoglu 2012 Complex Systems 35

Design of Pastry

■ Routing characteristics■ Route the key to the numerically closest nodeId■ Done in approximately ⎡log2bN⎤ steps

■ N number of nodes■ Delivery of messages guaranteed unless ⎣½ L⎦ nodes with

adjacent nodeIds fail simultaneously■ L is a configuration parameter with a typical value of 24 or 25

© Babaoglu 2012 Complex Systems 36

Pastry Node State

■ Routing table■ ⎡log2bN⎤rows

■ 2b -1 entries in each rows■ Each of the entries at row x represents a node whose nodeId

shares a x-digit prefix with the current node, but whose x+1st digit has one of the 2b -1 possible values other than the x+1st digit of the current nodeId

■ Each entry in the routing table contains the IP address of one of potentially many nodes whose nodeId have the appropriate prefix

Page 10: Introduction tobabaoglu/courses/cas11-12/slides/p2p.pdf · P2P vs. client-server Client-server Asymmetric: client and servers carry out different tasks Global knowledge: servers have

© Babaoglu 2012 Complex Systems 37

Pastry Node State: Routing Table

■ Only the nodes which are likely to be close to the current node are selected (using an appropriate proximity metric)

■ If no node is suitable for the entry, it is left blank■ Choice of b involves a trade off

■ When b decreases, the number of hops to route a message will increase

■ When b increases, the populated portion in the routing table will be low

© Babaoglu 2012 Complex Systems 38

Pastry Node State: Routing Table

Routing table example

used to find next hop with longer shared prefix

used to find the nodeid closest to a key that is close to the local nodeid

• In this example the routing table size is 4 x 15 = 60 entries, for a maximum network size of N = 65536 nodes.

• The average route length in this case is 4 hops.

© Babaoglu 2012 Complex Systems 39

Pastry Node State: Leaf set

■ Contains the entries of the nodes numerically closest to the current node

■ The set of nodes with⎣½ L⎦numerically closest with larger nodeIds, and

■ The set of nodes with⎣½ L⎦numerically closest with smaller nodeIds

■ Typical value for L is 2b

© Babaoglu 2012 Complex Systems 40

Pastry Routing

■ A particular node receives the message containing a key■ If the key falls within the range of nodeIds covered by the

leaf set■ Directly forward the message to the node whose nodeId is

closest to the key

■ Otherwise use the routing table■ The current node forwards the message to a node whose nodeId

shares a common prefix with the key by at least one more digit

Page 11: Introduction tobabaoglu/courses/cas11-12/slides/p2p.pdf · P2P vs. client-server Client-server Asymmetric: client and servers carry out different tasks Global knowledge: servers have

© Babaoglu 2012 Complex Systems 41

Pastry Routing

■ If the appropriate entry in the routing table is empty or the associated node is not reachable■ Among all the entries that the current node has, identify those

that share with the key a prefix at least as long as the current nodeId

■ Choose an entry whose nodeId is numerically closer to the message key than the current node’s id (such an entry must exist in the leaf set if the message hasn’t reached the destination)

■ Forward the message to that node

© Babaoglu 2012 Complex Systems 42

Pastry Routing

■ The routing converges because each step takes the message to a node that either■ shares longer prefix with the key than the local node, or■ shares the same length prefix as the local node but is numerically

closer to the key

© Babaoglu 2012 Complex Systems

Pastry Routing Example

43

Key-Based Routing [Pastry]

Source nodeid: 04F2

k =�����8955

Hop # Hop id Shared prefix length� 0� 04F2� � 0� 1� 85E0� � 1� 2� 8909� � 2� 3� 8957� � 3� 4� 8954� � 3 (root of k)

Object 8955 is tracked by node 8954, which knows of two copies stored at nodes 620F and C52A

04F2

3A79

5230

8909

8954

8957

AC78

C52A

E25A

route(k=8955,msg)

overlay address space[0000,FFFF]

620F

85E0obj 8955stored atnodes 620F,C52A

8821

obj 8955

obj 8955

© Babaoglu 2012 Complex Systems

Data location: KBR (cont.)

44

■ Structured networks (cont.)■ Advantages:

▴ completely decentralized (no need for superpeers)▴ routing algorithm achieves low hop count for large network sizes

■ Disadvantages:▴ each object must be tracked by a different node▴ objects are tracked by unreliable nodes (which may disconnect)▴ keyword-based searches are more difficult to implement than with

superpeers (because objects are located by their objectid)▴ the overlay must be structured according to a given topology in order to

achieve a low hop count▴ routing tables must be updated every time a node joins or leaves the

overlay

Page 12: Introduction tobabaoglu/courses/cas11-12/slides/p2p.pdf · P2P vs. client-server Client-server Asymmetric: client and servers carry out different tasks Global knowledge: servers have

© Babaoglu 2012 Complex Systems 45

Data location - Loosely structured overlays

Loosely structured networks: use hints for the location of objects [Freenet]

• Nodes locate objects by sending search requests containing the objectid

• Requests are propagated using a technique similar to flooding

• Objects with similar identifiers are grouped on the same nodes

AE5J

A

B

C

DE F5B20

requestfor AE5J

responseAF02

© Babaoglu 2012 Complex Systems 46

Data location - Loosely structured overlays

Loosely structured networks (cont.)

• A search response leaves routing hints on the path back to the source

• Hints are used when propagating future requests for similar object ids

AE5J

A

B

C

DE F

HintsAE5J: D5B20: E

5B20

HintsAE5J: F

HintsAE5J: B

AF02

requestfor AF02

© Babaoglu 2012 Complex Systems 47

Data location - Loosely structured overlays

■ Advantages:■ no topology constraints, flat architecture■ searches are more efficient than with plain flooding

■ Disadvantages:■ does not support keyword-based searches■ search requests have a TTL■ contrary to structured overlays, loosely structured overlay do not

guarantee a low number of hops, nor that the object will be found

© Babaoglu 2012 Complex Systems 48

The location schemes described previously can be classified according to the two dimensions: structure and decentralization

low loose high

partialeMule

Gnutella 2BitTorrent

- -

total Gnutella Freenet Chord Pastry

Data location - Summary

Degree of Structure

Deg

ree

of D

ecen

traliz

atio

n

Page 13: Introduction tobabaoglu/courses/cas11-12/slides/p2p.pdf · P2P vs. client-server Client-server Asymmetric: client and servers carry out different tasks Global knowledge: servers have

© Babaoglu 2012 Complex Systems

Effects of Churn

■ Churn can have several effects on a P2P system:■ data objects may be become unavailable if all replicas disconnect■ routing tables may become inconsistent

(e.g., entries may point to disconnected nodes)■ the overlay may become partitioned if many nodes suddenly

disconnect:

49 © Babaoglu 2012 Complex Systems

Churn - Preventing Partitions

■ A naïve approach to preventing partitions is to increase the average node degree

■ Ring partitions can be avoided by keep a list of successor nodes

50

© Babaoglu 2012 Complex Systems

Churn Tolerance

■ Node arrivals and departures must not disrupt the normal behavior of the system■ system invariants must be maintained

▴ connected overlay (i.e., no partitions), low average path length▴ data objects accessible from anywhere in the network

■ two types of churn tolerance:▴ dynamic recovery: ability to react to changes in the overlay to maintain

system invariants (e.g., heal partitions)▴ static resilience: ability to continue operating correctly before adaptation

occurs (e.g., route messages through alternate paths)

51 © Babaoglu 2012 Complex Systems

Security

■ Security in P2P systems is hard to enforce:■ Users have full control of their computers■ Modified clients may not follow the standard protocol■ Data may be corrupted■ Private data stored on remote computers may disclosed

52

Page 14: Introduction tobabaoglu/courses/cas11-12/slides/p2p.pdf · P2P vs. client-server Client-server Asymmetric: client and servers carry out different tasks Global knowledge: servers have

© Babaoglu 2012 Complex Systems

Security - Weak identities

■ The user may leave the system and rejoin it with a new identity (different user id)

■ If an attack is detected, the attacker can re-enter the system with a new id

■ An attacker may create a large number of false identities (Sybil attack)

A

S1 S2

S3

S4S5

S6

Example of Sybil attack:

- Nodes S1 to S6 are actually 6 instances of the P2P client running on the same machine

- The attacker can intercept all traffic coming from or going to node A

© Babaoglu 2012 Complex Systems

Security - Strong identities

■ The user cannot change its identity■ Solution: use a centralized, trusted Certification Authority

(CA)■ Each new user must obtain an identity certificate■ The certificate is digitally signed by the CA, whose public key is

known by all users■ A certificate cannot be forged (require the CA’s private key)■ To prove his identity, a user signs a message with his private key,

and attaches the corresponding certificate signed by the CA■ Strong identities prevent Sybil Attacks■ If an attacker is caught, it cannot easily rejoin the system

54

© Babaoglu 2012 Complex Systems

Security - Weak vs. Strong identities

■ Strong identities require a centralized CA■ New nodes must contact the CA before joining the network:

▴ The CA response may be slow▴ If the CA is unavailable, new nodes cannot join

■ The security of the system depends on CA:▴ The CA must correctly verify the identity of the requester▴ The CA’s private key must be secret

■ Many P2P systems use weak identities■ IP addresses already gives some identity information■ Some systems ensure anonymity

55