Open Problems in Data-Sharing Peer-to-Peer Systems
Neil Daswani,Hector Garcia-Molina,
Beverly Yang
Peer-To-Peer Systems
Autonomous, large-scale, decentralized systems
A large pool of resources Files, compute cycles
Open performance and security challenges
Research problems Search
Efficiency Expressiveness Quality of Service
Security Availability Authenticity Anonymity Access Control
Search Mechanism Submit queries and receive results
Keywords, SQL statements Defines the behavior of peers
Topology How peers are connected to each other
Data placement How data is distributed across the peers
Message Routing How messages are propagated
System Requirements Expressiveness
Query language should provide detailed description
Key lookups not expressive enough Comprehensiveness
Single result not sufficient for some systems All results required in some cases
Autonomy Nodes should control their organization
Goals of Search Mechanism
Maximize efficiency Light overhead, higher throughput
Maximize Quality of Service Number of results Response time
Robustness Stability in presence of failures
Expressiveness (1/2) Key lookup Keyword queries
Partial search Efficient for certain types of file , e.g music
Ranked Keyword Rank the results of keyword queries Global statistics required Collection and maintenance challenging “top k” results
Expressiveness (2/2)
Aggregates SUM, COUNT, MAX and MEDIAN E.g. COUNT nodes belonging to
forth.gr domain SQL
The most difficult query language Performance “hotspots” (PIER
system)
Autonomy/ Efficiency/ Robustness
Correlation between autonomy and efficiency Locate data with bounded cost
(Chord) Small sets of nodes guaranteed to
hold the answer Increased chance of finding results on
random node
Tuning the autonomy / efficiency tradeoff Varying needs
E.g. sensitive files should remain on the intranet
Different systems for different purposes not always desirable
SkipNet Specify a range of peers on which a
document can be stored Single peer range: high autonomy All peers range: traditional P2P system
Autonomy and Robustness Viceroy network construction
Low level of autonomy Reduced cost of maintaining structure
=> Increased robustness and efficiency Distributed hash tables
Logarithmic maintenance cost Super-peer redundancy
Stricter topology => decreased autonomy => greater robustness
Quality of Service Number of results
Tradeoff between number of results and cost BFS technique
Send messages to “productive” nodes Depends on ad-hoc topology
Concept-clustering Communicate according to “interest”
“Satisfaction” True when a threshold of results found Important to partial-search systems Cost can be drastically reduced
Security Availability
Bandwidth, CPU and file availability File Authenticity
Which responses are authentic? Anonymity
How we can hide our identity? Access Control
Restrict accessibility
Availability Nodes should be always up DoS attacks
Flooding a node with messages Malicious super-nodes in Gnutella
Claims that the victim has all files requested Attack CPU availability
Sending complex queries Attack file storage
Submit bogus documents Attack quality-of-service
Serve a file slowly Send a different file
Countermeasures Careful design of P2P protocols
Gnutella is loosely constrained Back-door communication channels are
prohibited Techniques for detecting failures
High message overhead, complexity Assume pairwise connectivity
Allocate storage proportionally to what a node contributes
Hash trees to ensure a node is sending the correct data and at a reasonable rate
Security Availability
Bandwidth, CPU and file availability File Authenticity
Which responses are authentic? Anonymity
How we can hide our identity? Access Control
Restrict accessibility
File Authenticity
Different than file integrity CRC, hashing, MACs, digital
signatures Given a query, the authentic
response has to be distinguished What does “authentic” mean?
Definition of “authentic” Oldest Document
The oldest submission is consider authentic Timestamping systems
Expert-based Authoriative nodes keep track of signatures Susceptible to failures Offline digital signature schemes
Voting-based Votes of many experts Experts may be humans Spoofing of votes, nodes and files
Reputation-based Weight votes, some experts more trustworthy Maintenance, update and propagation of weights
Security Availability
Bandwidth, CPU and file availability File Authenticity
Which responses are authentic? Anonymity
How we can hide our identity? Access Control
Restrict accessibility
Anonymity (1/2) Illegal trade of files vs. censorship
resistance, freedom of speech, privacy protection
Types of anonymity Author: which users created which documents Server: which nodes store a given document Reader: which users access which documents Document: which documents are stored at a
given node Anonymity vs. efficiency
Free Haven provides server anonymity, Freenet provides author anonymity
Anonymity (2/2)
Achieve server anonymity through intermediate nodes Forwarding proxies Servers identified by nicknames Degradation of anonymity protocols
under attacks Problem of collusion
Free Haven and Crowds use forwarding proxies
Security Availability
Bandwidth, CPU and file availability File Authenticity
Which responses are authentic? Anonymity
How we can hide our identity? Access Control
Restrict accessibility
Access Control
Restrict accessibility to documents P2P systems cannot enforce
copyright laws Violation of copyright laws by users Lawsuits against companies that build
P2P systems Limited utilization vs. free
distribution