Upload
georgina-bond
View
216
Download
0
Tags:
Embed Size (px)
Citation preview
Algorithms forInformation Retrieval
Prologue
References
Managing gigabytesA. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.
A bunch of scientific papers available on the course site !!
Mining the Web: Discovering Knowledge from...S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.
More than 85% users arrive to a site from a SE
Web Searches: 45% Google, 29% Yahoo, 13% MSN, 5%
ASK,... Toolbar searches: 49.6% Google, 46.1% Yahoo,...
SE have an impact onto: Web structure, knowledge and understanding, social behavior....
...and, onto the market: 33% users believe that “the results of a query are the
best place where to buy things” !! Ads (4B$ in USA, 2B€ in Europe, 180M€ in Italy)
Paid search: 65% Google, 25% Yahoo, 8% MSN,... Portal search: 15% Yahoo, 10% MSN, 7% AOL-Google,...
Much interest...
Retrieve docs that are “relevant” for the user query
Doc: file word or pdf, web page, email, blog, e-book,... Query: paradigm “bag of words”
Relevant ?!?
...We face many difficulties, especially on the
Web!!!
Goal of a Search Engine
Languages/Encodings Hundreds of languages: 55 (Jul01) Home pages:
In 1997: English 82%, the next 15 take 13% In 2001: English: 53%, the next 9 take 30%
Distributed authorship Millions of people creating pages with their own style… Not all have the purest motives in providing high-quality
information - commercial motives drive “spamming”.
Web is huge and heterogeneous
Extracting “significant data” is difficult !!
Web is highly dynamic [154 sites, 2004]
A “good” coverage of the indexed Web is difficult !!
Normalizedwrt first week
Web structure
User Queries
Query composition: Short
2001: 2.54 terms avg
80% less than 3 terms
Imprecise terms
78% of the queries are not modified
Query results: 85% of the users look at just one result-page
User Needs
Informational – want to learn about something (~40%)
Navigational – want to go to a page (~25%)
Transactional – want to do something (~35%)
Access a service Downloads Shop
Asthma
Alitalia
NY weatherMars surface images
Nikon CoolPix
Evolution of Search Engines First generation -- use only on-page, web-text data
Word frequency and language
Second generation -- use off-page, web-graph data Link (or connectivity) analysis Anchor-text (How people refer to a page)
Third generation -- answer “the need behind the query” Focus on “user need”, rather than on query Integrate multiple data-sources Click-through data Query mining
1995-1997 AltaVista, Excite, Lycos, etc
1998: Google, now everyone
No winner yet !!
Fourth generation Information Supply[Andrei Broder, VP emerging search tech, Yahoo! Research]
What is a search engine, nowadays?
Size of search engines [2005]
Google vs Yahoo: 20-30% sharing of results
Ranking: Google vs Yahoo!
Ranking: Google vs Google.cn
Clustering engines Vivisimo, Snaket,...
Suggestions
Products
Local searches News, Blogs, ....
Not only Web Searches...
Directories
Deep web: Invisible-web.net, Completeplanet, ResoruceDiscovery Network
“Vertical” search engines
About this course
This course is a mix of Smart algorithms & data structures Data compression IR tools: Data Projection, Clustering,...
Massive Data
Nature 2/06 issue highlight trends in sciences:“2020 – Future of computing”
Exponential growth of scientific data Due to e.g. large experiments, sensor networks, etc Nano-tech provides further opportunities
Paradigm shift: Science will be about mining data
Computer science paramount in all sciences
March 2006
Algorithm Inadequacy Importance of scalability/efficiency
→ Algorithmics core computer science area
Traditional algorithmics:Transform input to output using simple machine model
Communities addressing inadequacies have emerged
You should be space/IO-aware
programmers
I/O-conscious Algorithms
Disk access is 106 times slower than main memory access
Store/access data taking advantage of blocks
I/O-efficient algorithms: Move as few disk blocks as possible to solve given problem Access close blocks to reduce the seek time
track
magnetic surface
read/write armread/write head
“The difference in speed between modern CPU and disk technologies is analogous to the difference in speed in sharpening a pencil using a sharpener on one’s desk or by taking an airplane to the other side of the world and using a sharpener on someone else’s desk.” (D. Comer)
Streaming Algorithms
Data arrive continuously or we wish FEW scans
Streaming algorithms: Use few scans Handle each element fast Use small space
track
magnetic surface
read/write armread/write head
Cache-Oblivious Algorithms
Unknown and/or changing devices
Block access important on all levels of memory hierarchy But memory hierarchies are very diverse
Cache-oblivious algorithms: Explicitly, algorithms do not assume any model parameters Implicitly, algorithms use blocks efficiently on all memory levels