Text Retrieval in Peer to Peer Systems David Karger MIT

Text Retrievalin Peer to Peer Systems

David Karger

Information Retrievalbefore P2P

The traditional approach

Information Retrieval

Most of our information base is text academic journals books and encyclopedias news feeds world wide web pages email

How do we find what we need?

The Classic IR Model

User has information need User formulates textual query System processes corpus of documents System extracts relevant documents User refines query Metrics:

recall: % of relevant documents retrieved precision: % of retrieved docs with relevance

Precision-Recall Tradeoff

Recall 100%

Precision

Fetch Nothing

Fetch Everything

Web Search

Library

Specific Retrieval Algorithms

Define relevance Build a model of documents, meanings Ignore computational cost

Implement efficiently Preprocessing

Tb corpora call for big-iron machines (or simulations) Interaction:

after 1/2 second, user notices delay after 10 seconds, user gives up (historical perspective; changed by web)

Boolean Keyword Search

Q: “Do harsh winters affect steel production?” Query: steel AND winter

Output: “Last WINTER, overproduction of STEEL led to...” “STEEL automobiles resist WINTER weather poorly.” “Boston must STEEL itself for another bad WINTER” “the Pittsburgh STEELers started WINTER training...”

Not Output: “Cold weather caused increased metal prices as orders for

radiators and automobiles picked up...”

Implementing Boolean Search Typical: OR of ANDS,

handle each OR separately, aggregate For ANDs, inverted index:

Per term, list of documents containing that term intersect lists for query terms

Basically a database join

Intersection Algorithms (as in DB) Method 1: direct list merge

Linear work in summed size of lists Method 2: examine candidates

Start with shortest term list For each list entry, check for other search terms Linear in smallest list Good if at least one rare term, but

requires forward index (list of terms in each document) no gain if all search terms common (“flying fish”)

Problems with Boolean Approach Synonymy

several words for same thing if author used different one, query won’t match

Polysemy one word can mean many things (“bank”) query matches wrong meanings

Harsh cutoffs (1 wrong keyword kills) user can’t type descriptive paragraph...

Terms have uniform influence repeated occurrence same as single occurrence common terms treated same as rare ones

Fixing Problems

Synonymy thesaurus can add equivalent terms to query increases recall, but lowers precision expensive to construct (semantics---manual)

Polysemy use more query terms to disambiguate user might not know more terms increases precision, but lowers recall

Harsh cutoffs quorum system (maximize # matching terms)

Uniformity?

Vector Space Model

Document is a vector with a coordinate/term 0-1 for presence/absence of term (quorum) real valued to represent term “importance”

term frequency in document increases value term frequency in corpus decreases value

Dot product with query measures similarity Best known implementation: inverted index

for each query term, list documents containing it accumulate dot products

Vector Space Advantages

Smoother than Boolean search Provides ranking rather than sharp cut-off

Tends to allow/encourage queries with many nonzero terms Easy to “expand query” with synonyms Hopefully polysemes will “interfere constructively” May even add relevant documents to query

100s or 1000s of terms

Simulating big iron

Web Search Info From Google Web queries

Almost all queries 2 terms only “Boolean vector space” model (tiny recall OK) Zipf distribution, so caching queries helps some

Corpus 3B pages, 10K average size, 30TB total Inverted index: roughly the same size Fits in a “moderate” P2P system of 30K nodes But must be partitioned. How?

Obvious: Partition Documents Node builds full inverted index for its subset Query quite tractable per node Merge results sent back from each node Used by Google (in data center) and Gnutella Drawback: query broadcast to all nodes

OK for Google data center; bad for P2P

Alternative: Partition Terms

One node owns a few terms of inverted index Term pair is “key” for distributed hash table

Talk only to nodes that own query terms They return desired inverted-index lists Results intersected at query issuer Drawback: transfer huge inverted index lists Alternative: send first term-list to second

Ships 1 (perhaps small) list instead of 2

Avoiding Communication(Om Gnawali et al. @MIT) Build inverted index on term pairs

Pre-answering all queries Partition pairs among nodes Search contacts one node Problem: pre-computation cost

Size-n document generates n2 pairs Each pair must be communicated Each pair must be stored

Good Cases

Music search “document” is song title + author n small, so n2 factor unimportant

Document windows Usually, good docs have query terms “nearby” Scan window of length 5, take pairs in window 10 pairs/window, so 10n per document So linear in corpus size as before

Bundle pairs to ship over sparse overlay

What About Vector Space?

Weighting terms is easy But cannot limit search to pair list

However, need only highest-scored documents on individual terms

So, pre-compute and store small “winner list” Vector space encourages many-term queries

Find pairs with small intersection Index triples, quadruples, etc Apply branch and bound techniques

Google Pushback

No need for P2P More precisely: “keep peers in our data center” Exploit high local communication bandwidth Economics support large server farm

More load? Buy more servers

Main bottleneck: content provider bandwidth Limits rate of crawl Google index often weeks out of date Distributed crawler won’t help

Google Pushback Pushback

P2P might help Let each node build own index Ship changes to Google

Potential applications real-time index new-relevant-content notification

Problem: SPAM Content providers will lie about index changes Use P2P system to spot-check?

Person-to-Person IR

New modalities

P2P: Systems Perspective Distributed system has more resources

Computation/Storage Reliability

Can exploit, if successfully hide Latency Bandwidth

Goal: simulate reliable big iron Solve traditional problems that need resources File storage, factoring, database queries, IR

P2P:Social Perspective Applications based on person-to-person

interactions Messaging Linking/community bulding (the web) Reputation management (Mojo Nation) File-sharing collaborations (just now)

Need not run on top of P2P network

The “Pathetic Fallacy” of P2P

Assumption that network layer should mirror social layer E.g. “peers should be node with similar interests”

Many work fine on one (big, reliable) machine Placement on P2P system is “coincidental” On other side of “one big machine” abstraction

Breaching abstraction has bad consequences Peering to “friends” unlikely to optimize efficiency,

reliability

P2P Opportunity:Leverage Involvement of People Each individual manipulates information

In much more powerful, semantic ways than machines can achieve

Record that manipulation Exploit to help others do better retrieval

Link-based Retrieval

Simultaneous work: Kleinberg at IBM Brin/Page at Stanford/Google

People find “good” web pages, link to them So, a page with large in-degree is good Refine: target of many good nodes is good

Mathematically, random walk model Page rank=stationary probability of random walk

Applications

Search Raise relevance of high page-rank pages If lazy, limit corpus to high page-rank Anchor text better description than page contents

Crawl Page rank computed before see page Prioritize high page-rank pages for crawl

People add usable info no system could find

P2P:Systems/Social Interactions Distributed system has novel properties Exploit them to enable novel capabilities E.g., anonymity

Relies on partition of control/knowledge E.g., privacy

Allow limited access to my private information Gain (false, but important) sense of safety by

keeping it on my machine

Expertise Networks

Haystack (Karger et al), Shock (Adar et al) Route questions to appropriate expert

Use text to describe knowledge Based on human entry, or indexing of human’s

personal files Might be unwilling to admit knowledge

P2P framework can protect anonymity Shock achieves by Gnutella-style query broadcast More efficient approach?

Other New Aspects

Personal information sharing Unwilling to “publish” mail, documents to world But might allow search, access in some cases Keeping data, index on own machine gives (false)

sense of security, privacy Anonymity

P2P provides strong anonymity primitives Can be exploited, e.g., for “recommending”

embarrassing content

Sample Application

Social: “Secret Web” Maintain links for use by page-rank algorithm But, links are secret from most others Need random walk through link path

Implement via recursive lookup Censorproof?, spamproof?

Semantics vs. Syntax

Clearly, using word meanings would help Some systems try to implement semantics But this is a core AI problem, unsolved Current attempts don’t scale to large corpora All current large systems are syntactic only Idea: use computational power of P2P Idea: use humans to attach semantics

Conclusion: Two Approaches to P2P Hide P2P (Partition to Partition)

Goal: illusion of single server Know how to do task on single server Devise tools to achieve same in distributed sys. Focus on surmounting drawbacks: systems

Exploit P2P (Person to Person) Determine new opportunities afforded by P2P Perhaps impossible on single server Focus on new applications: AI? HCI?

Text Retrieval in Peer to Peer Systems David Karger MIT

Documents

Introduction... · Web view2018-09-06 · The Peer Content Caching and Retrieval Framework is a content caching and retrieval framework based on a peer-to-peer discovery and distribution

Peer-to-Peer Information Retrieval...peer-to-peer information retrieval dissertation to obtain the degree of doctor at the University of Twente, on the authority of the rector magniﬁcus,

Peer-to-peer computing research: a fad? Frans Kaashoek Joint work with: H. Balakrishnan, P. Druschel, J. Hellerstein, D. Karger, R

Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek, Hari Balakrishnan MIT and Berkeley Chord: A Scalable Peer-to-peer Lookup Service for Internet

Peer-to-peer information retrieval using shared-contentshavitt/pub/KaIS2014.pdf · Peer-to-peer information retrieval recommendation tasks. Using domain-speciﬁc efﬁciency measures,

Types of Agents in Peer-to-Peer Shared Ride Systems · Analysis; H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval General Terms Performance 1. INTRODUCTION

Forensic Ballistics Karger

Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys ICDE 2007 Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys Ivana

Personal Information Management, Personal Information Retrieval? Michael Bernstein, Max Van Kleek, David R. Karger, mc schraefel {msbernst, emax}@csail.mit.edu

1 Structured P2P overlay. 2 Outline Introduction Chord I. Stoica, R. Morris and D. Karger,“Chord: A Scalable Peer-to-peer Lookup Service for Internet

Sistemi P2P Chord: A Scalable Peer-to-Peer Lookup Protocol for Internet Applications Autori: I. Stoica, R. Morris, D. Liben-Nowell, D. R. Karger, M. F

p601 Karger

Karger, Stoesz 2

PIER: Peer-to-Peer Information Exchange and Retrieval

Peer-to-Peer Data Management Management Peer-to-Peer Data ...gtsat/collection/Morgan Claypool/Peer-to-Peer D… · The third type are peer-to-peer document retrieval systems that

Peer-to-peer information retrieval using shared …bengal/genre_statistics.pdfKnowl Inf Syst DOI 10.1007/s10115-013-0619-9 REGULAR PAPER Peer-to-peer information retrieval using shared-content

Information Retrieval in Peer to Peer Systems

Distributed Information Retrieval · Peer-to-Peer Network Broker-Based Architecture Crawling Metadata Harvesting Hybrid Outline 1 Motivations 2 Architectures Peer-to-Peer Network

Recent Problems in Peer-to-peer Content Retrieval

Peer-to-Peer Information Retrieval: An Overview · Peer-to-peer technology is widely used for ﬁle sharing. In the past decade a number of prototype peer-to-peer information retrieval