23
Algorithms for Information Retrieval Prologue

Algorithms for Information Retrieval Prologue. References Managing gigabytes A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999. A bunch of scientific

Embed Size (px)

Citation preview

Page 1: Algorithms for Information Retrieval Prologue. References Managing gigabytes A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999. A bunch of scientific

Algorithms forInformation Retrieval

Prologue

Page 2: Algorithms for Information Retrieval Prologue. References Managing gigabytes A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999. A bunch of scientific

References

Managing gigabytesA. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

A bunch of scientific papers available on the course site !!

Mining the Web: Discovering Knowledge from...S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

Page 3: Algorithms for Information Retrieval Prologue. References Managing gigabytes A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999. A bunch of scientific

More than 85% users arrive to a site from a SE

Web Searches: 45% Google, 29% Yahoo, 13% MSN, 5%

ASK,... Toolbar searches: 49.6% Google, 46.1% Yahoo,...

SE have an impact onto: Web structure, knowledge and understanding, social behavior....

...and, onto the market: 33% users believe that “the results of a query are the

best place where to buy things” !! Ads (4B$ in USA, 2B€ in Europe, 180M€ in Italy)

Paid search: 65% Google, 25% Yahoo, 8% MSN,... Portal search: 15% Yahoo, 10% MSN, 7% AOL-Google,...

Much interest...

Page 4: Algorithms for Information Retrieval Prologue. References Managing gigabytes A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999. A bunch of scientific

Retrieve docs that are “relevant” for the user query

Doc: file word or pdf, web page, email, blog, e-book,... Query: paradigm “bag of words”

Relevant ?!?

...We face many difficulties, especially on the

Web!!!

Goal of a Search Engine

Page 5: Algorithms for Information Retrieval Prologue. References Managing gigabytes A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999. A bunch of scientific

Languages/Encodings Hundreds of languages: 55 (Jul01) Home pages:

In 1997: English 82%, the next 15 take 13% In 2001: English: 53%, the next 9 take 30%

Distributed authorship Millions of people creating pages with their own style… Not all have the purest motives in providing high-quality

information - commercial motives drive “spamming”.

Web is huge and heterogeneous

Extracting “significant data” is difficult !!

Page 6: Algorithms for Information Retrieval Prologue. References Managing gigabytes A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999. A bunch of scientific

Web is highly dynamic [154 sites, 2004]

A “good” coverage of the indexed Web is difficult !!

Normalizedwrt first week

Page 7: Algorithms for Information Retrieval Prologue. References Managing gigabytes A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999. A bunch of scientific

Web structure

Page 8: Algorithms for Information Retrieval Prologue. References Managing gigabytes A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999. A bunch of scientific

User Queries

Query composition: Short

2001: 2.54 terms avg

80% less than 3 terms

Imprecise terms

78% of the queries are not modified

Query results: 85% of the users look at just one result-page

Page 9: Algorithms for Information Retrieval Prologue. References Managing gigabytes A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999. A bunch of scientific

User Needs

Informational – want to learn about something (~40%)

Navigational – want to go to a page (~25%)

Transactional – want to do something (~35%)

Access a service Downloads Shop

Asthma

Alitalia

NY weatherMars surface images

Nikon CoolPix

Page 10: Algorithms for Information Retrieval Prologue. References Managing gigabytes A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999. A bunch of scientific

Evolution of Search Engines First generation -- use only on-page, web-text data

Word frequency and language

Second generation -- use off-page, web-graph data Link (or connectivity) analysis Anchor-text (How people refer to a page)

Third generation -- answer “the need behind the query” Focus on “user need”, rather than on query Integrate multiple data-sources Click-through data Query mining

1995-1997 AltaVista, Excite, Lycos, etc

1998: Google, now everyone

No winner yet !!

Fourth generation Information Supply[Andrei Broder, VP emerging search tech, Yahoo! Research]

Page 11: Algorithms for Information Retrieval Prologue. References Managing gigabytes A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999. A bunch of scientific

What is a search engine, nowadays?

Page 12: Algorithms for Information Retrieval Prologue. References Managing gigabytes A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999. A bunch of scientific

Size of search engines [2005]

Google vs Yahoo: 20-30% sharing of results

Page 13: Algorithms for Information Retrieval Prologue. References Managing gigabytes A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999. A bunch of scientific

Ranking: Google vs Yahoo!

Page 14: Algorithms for Information Retrieval Prologue. References Managing gigabytes A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999. A bunch of scientific

Ranking: Google vs Google.cn

Page 15: Algorithms for Information Retrieval Prologue. References Managing gigabytes A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999. A bunch of scientific

Clustering engines Vivisimo, Snaket,...

Suggestions

Products

Local searches News, Blogs, ....

Not only Web Searches...

Page 16: Algorithms for Information Retrieval Prologue. References Managing gigabytes A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999. A bunch of scientific

Directories

Deep web: Invisible-web.net, Completeplanet, ResoruceDiscovery Network

Page 17: Algorithms for Information Retrieval Prologue. References Managing gigabytes A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999. A bunch of scientific

“Vertical” search engines

Page 18: Algorithms for Information Retrieval Prologue. References Managing gigabytes A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999. A bunch of scientific

About this course

This course is a mix of Smart algorithms & data structures Data compression IR tools: Data Projection, Clustering,...

Page 19: Algorithms for Information Retrieval Prologue. References Managing gigabytes A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999. A bunch of scientific

Massive Data

Nature 2/06 issue highlight trends in sciences:“2020 – Future of computing”

Exponential growth of scientific data Due to e.g. large experiments, sensor networks, etc Nano-tech provides further opportunities

Paradigm shift: Science will be about mining data

Computer science paramount in all sciences

March 2006

Page 20: Algorithms for Information Retrieval Prologue. References Managing gigabytes A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999. A bunch of scientific

Algorithm Inadequacy Importance of scalability/efficiency

→ Algorithmics core computer science area

Traditional algorithmics:Transform input to output using simple machine model

Communities addressing inadequacies have emerged

You should be space/IO-aware

programmers

Page 21: Algorithms for Information Retrieval Prologue. References Managing gigabytes A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999. A bunch of scientific

I/O-conscious Algorithms

Disk access is 106 times slower than main memory access

Store/access data taking advantage of blocks

I/O-efficient algorithms: Move as few disk blocks as possible to solve given problem Access close blocks to reduce the seek time

track

magnetic surface

read/write armread/write head

“The difference in speed between modern CPU and disk technologies is analogous to the difference in speed in sharpening a pencil using a sharpener on one’s desk or by taking an airplane to the other side of the world and using a sharpener on someone else’s desk.” (D. Comer)

Page 22: Algorithms for Information Retrieval Prologue. References Managing gigabytes A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999. A bunch of scientific

Streaming Algorithms

Data arrive continuously or we wish FEW scans

Streaming algorithms: Use few scans Handle each element fast Use small space

track

magnetic surface

read/write armread/write head

Page 23: Algorithms for Information Retrieval Prologue. References Managing gigabytes A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999. A bunch of scientific

Cache-Oblivious Algorithms

Unknown and/or changing devices

Block access important on all levels of memory hierarchy But memory hierarchies are very diverse

Cache-oblivious algorithms: Explicitly, algorithms do not assume any model parameters Implicitly, algorithms use blocks efficiently on all memory levels