79
Intro to IR and SE, research issues 1 http://comet.lehman.cuny.edu/jung/presentation/ presentation.html Introduction to Modern Information Retrieval and Search Engines And Some Research Issues Professor Gwang Jung Department of Mathematics and Computer Science Lehman College, CUNY November 10, Fall 05

Intro to IR and SE, research issues1 Introduction to Modern Information Retrieval and

  • View
    222

  • Download
    2

Embed Size (px)

Citation preview

Page 1: Intro to IR and SE, research issues1  Introduction to Modern Information Retrieval and

Intro to IR and SE, research issues 1

http://comet.lehman.cuny.edu/jung/presentation/presentation.html

Introduction to Modern Information Retrieval and

Search EnginesAnd Some Research Issues

Professor Gwang Jung

Department of Mathematics and Computer ScienceLehman College, CUNY

November 10, Fall 05

Page 2: Intro to IR and SE, research issues1  Introduction to Modern Information Retrieval and

Intro to IR and SE, research issues 2

Introduction to Information Retrieval Introduction to Search Engines (IR Systems for the Web) Search Engine Example: Google Brief Introduction to Semantic Web Useful Tools for IR System Building and Resources for Advanced Research Research Issues

Outline

Page 3: Intro to IR and SE, research issues1  Introduction to Modern Information Retrieval and

Intro to IR and SE, research issues 3

Introduction to Information Retrieval

Page 4: Intro to IR and SE, research issues1  Introduction to Modern Information Retrieval and

Intro to IR and SE, research issues 4

Information Age

Page 5: Intro to IR and SE, research issues1  Introduction to Modern Information Retrieval and

Intro to IR and SE, research issues 5

IR in General

Information Retrieval in general deals with Retrieval of structured, semi-structured and unstructured data

(information items) in response to a user query (topic statement).

User query Structured (e.g., Boolean expression of keywords or terms) Unstructured (e.g., terms, sentence, document)

In other words, IR is the process of applying algorithms over unstructured, semi-structured, or structured data in order to satisfy a given query. Efficiency with respect to:

Algorithms, Query processing, Data organization/structure Effectiveness with respect to:

Retrieval results

Page 6: Intro to IR and SE, research issues1  Introduction to Modern Information Retrieval and

Intro to IR and SE, research issues 6

IR Systems

Page 7: Intro to IR and SE, research issues1  Introduction to Modern Information Retrieval and

Intro to IR and SE, research issues 7

Formal Definition of IR System

IRS = (T, D, Q, F, R) T: set of index terms (terms) D: set of documents in a document database Q: set of user queries F: D x Q R (retrieval function) R: real numbers (RSV: Retrieval Status Value)

Relevance Judgment is given by users.

Page 8: Intro to IR and SE, research issues1  Introduction to Modern Information Retrieval and

Intro to IR and SE, research issues 8

IRS versus DBMS

Objects

DBMS

Simple (DBMS)

IRS

Complex (texts)

Information Request Objective structure, specific

Subjective structure

General and ambiguous

Query Language Predicate calculus How?

Knowledge Representation

Transparent Complex process

Retrieval Process Deterministic Statistic (similarity)

Major Success Criteria

Correctness utility

Query Example Select student name where student’s GPA > 3.0

Select items where item’s concept deals with process synchronization

Page 9: Intro to IR and SE, research issues1  Introduction to Modern Information Retrieval and

Intro to IR and SE, research issues 9

IR Systems Focus on Retrieval effectiveness

The effective retrieval of relevant information depends on User task (formulating effective query for the information

need) Indexing

IR systems in general adopt index terms to represent documents and queries.

The process of developing document representations by assigning index terms to documents (information items).

Retrieval model (often called IR model) and logical view of documents

Logical view of documents (logical representation of documents) depends on IR model

Page 10: Intro to IR and SE, research issues1  Introduction to Modern Information Retrieval and

Intro to IR and SE, research issues 10

Indexing

The process of developing document representations by assigning descriptions to information items (texts, documents, or multimedia items).

Descriptors = index terms = terms Descriptors also lead users to participate in

formulating information requests. Two types of index terms:

Objective: author name, publisher, date of publication Subjective: keywords selected from full text

Two types of indexing methods: Manual: performed by human experts (for very effective IR

systems)– may use ontology Automatic: performed by computer HW and SW

Page 11: Intro to IR and SE, research issues1  Introduction to Modern Information Retrieval and

Intro to IR and SE, research issues 11

Indexing Aims (1) Recall: the proportion of relevant items (documents) retrieved.

R = # of relevant items retrieved / total # of relevant items in the db

Precision: the proportion of retrieved documents that are relevant. P = # of relevant items retrieved / total # of items retrieved

Effectiveness of indexing is mainly controlled by Term Specificity Broader terms may retrieve both useful (relevant) and useless

(non-relevant) info items for the user. Narrower (specific) index terms favor precision at the expense of

recall. Index Language (set of well-selected index terms)

T = { index term t} Pre-specified (controlled): easy maintenance; poor adaptability Uncontrolled (dynamic): expanded dynamically; taken freely from

the texts to be indexed and from the users’ queries. Synonymous terms can be expanded to T by thesaurus, e-

dictionary (e.g., WordNet), and/or knowledge base (e.g., ontology).

Page 12: Intro to IR and SE, research issues1  Introduction to Modern Information Retrieval and

Intro to IR and SE, research issues 12

Indexing Aims (2)

Recall and Precision values vary from 0 to 1. Average users want to have high recall and high precision. In practice, a compromise must be reached (middle point).

R

P

1.0

1.00

Page 13: Intro to IR and SE, research issues1  Introduction to Modern Information Retrieval and

Intro to IR and SE, research issues 13

Steps for Indexing

Objective attributes of a document are extracted (e.g., title, author, URL, structure).

Grammatical functional words (stop words) in general are not considered as index terms (e.g., of, then, this, and, …., etc).

Case insensitivity might be performed. Stemming might be used. Frequency of nonfunctional words are used to specify the term

importance. Term frequency weight fulfils only one of the indexing aims,

I.e., Recall. Terms that occur rarely in the individual document database

may be used to distinguish documents in which they occur from those in which they do not occur could improve Precision.

Document frequency: the number of documents in the collection in which a term tj T occurs

Page 14: Intro to IR and SE, research issues1  Introduction to Modern Information Retrieval and

Intro to IR and SE, research issues 14

Inverted Index File

system

computer

database

science D2, 4

D5, 2

D1, 3

D7, 4

Index terms df

3

2

4

1

Dj, tfj,

Inverted Index Entries

Optionally postings (the positions of the term in a document)

Page 15: Intro to IR and SE, research issues1  Introduction to Modern Information Retrieval and

Intro to IR and SE, research issues 15

Retrieval Models (1) Set theoretic IR models

Documents are represented by a set of terms Well known Set Theoretic Models

Boolean IR Model Retrieval Function is based on Boolean operation (e.g., and, or, not) Query is formulated by Boolean logic

Fuzzy Set IR Model Retrieval function is based on Fuzzy set operations Query is formulated by Boolean logic

Rough Set IR Model Various set operations were examined. Ad-hoc Boolean query

Probabilistic IR model Mainly used for probabilistic index term weighting Provides mathematical framework for the well known tf*idf indexing

scheme Language Model based

Infer query concept from a document as retrieval process

Page 16: Intro to IR and SE, research issues1  Introduction to Modern Information Retrieval and

Intro to IR and SE, research issues 16

Retrieval Models (2) Vector space model

Queries and documents are represented as weighted vectors. Vectors in the basis are called term vectors, and assumed they

are semantically independent. A document (query) is represented as a linear combination of

vectors in the generating set. Retrieval function is based on dot product or cosine measure

between document and query vectors. Extended Boolean IR model

Combine characteristics of the vector space IR model with properties of Boolean algebra.

Retrieval function is based on Euclidean distances in a n-dimensional vector space. Distances are measured by using p-norms, where 1 p

Page 17: Intro to IR and SE, research issues1  Introduction to Modern Information Retrieval and

Intro to IR and SE, research issues 17

The Retrieval Process

Page 18: Intro to IR and SE, research issues1  Introduction to Modern Information Retrieval and

Intro to IR and SE, research issues 18

The retrieval Process in IR System

Page 19: Intro to IR and SE, research issues1  Introduction to Modern Information Retrieval and

Intro to IR and SE, research issues 19

Introduction to Search Engines (IR Systems for the Web)

Page 20: Intro to IR and SE, research issues1  Introduction to Modern Information Retrieval and

Intro to IR and SE, research issues 20

World Wide Web History

1965 – Hypertext Ted Nelson developed idea of hypertext in 1965.

Late 1960’s Doug Engelbart invented the mouse and built the first

implementation of hypertext in the late 1960’s at SRI. Early 1970’s

ARPANET was developed in the early 1970’s. 1982 - Transmission Control Protocol (TCP) and Internet Protocol (IP) 1989- WWW

Developed by Tim Berners-Lee and others in 1990 at CERN to organize research documents available on the Internet.

Combined idea of documents available by FTP with the idea of hypertext to link documents.

Developed initial HTTP network protocol, URLs, HTML, and first web server.

Page 21: Intro to IR and SE, research issues1  Introduction to Modern Information Retrieval and

Intro to IR and SE, research issues 21

Search Engine (Web-based IR System) History

By late 1980’s many files were available by anonymous FTP. In 1990, Alan Emtage of McGill Univ. developed Archie (short for

“archives”) Assembled lists of files available on many FTP servers. Allowed regular expression search of these file names.

In 1993, Veronica and Jughead were developed to search names of text files available through Gopher servers.

In 1993, early web robots (spiders) were built to collect URL’s: Wanderer ALIWEB (Archie-Like Index of the WEB) WWW Worm (indexed URL’s and titles for regex search)

In 1994, Stanford graduate students David Filo and Jerry Yang started manually collecting popular web sites into a topical hierarchy called Yahoo.

Page 22: Intro to IR and SE, research issues1  Introduction to Modern Information Retrieval and

Intro to IR and SE, research issues 22

Search Engine History (cont’d)

In early 1994, Brian Pinkerton developed WebCrawler as a class project at U Washington. (eventually became part of Excite and AOL).

A few months later, Fuzzy Maudlin, a professor at CMU developed Lycos with his graduate students. First to use a standard IR system as developed for the DARPA

Tipster project. First to index a large set of pages.

In late 1995, DEC developed Altavista. Used a large farm of Alpha machines to quickly process large

numbers of queries. Supported boolean operators, phrases, and “reverse pointer”

queries. In 1998 – Google was developed by graduate students Larry Page &

Sergey Brin at Stanford U use of link analysis to rank documents

Page 23: Intro to IR and SE, research issues1  Introduction to Modern Information Retrieval and

Intro to IR and SE, research issues 23

How do Web SE Work?

Search Engines for the general web search a database of the full text of web pages selected from billions

of Web pages searching is based on inverted index entries

Search Engine Databases Full text documents are collected by software robot (also called

softbot, spider). They navigate the web for collecting pages. Web can be viewed as a graph structure. The navigation can be based on DFS (Depth First Search), or BFS (Breadth

First Search), or based on some combined navigation heuristics. How to detect cycles? research issue

Indexer then build inverted index entries stored them into inverted files.

If necessary the inverted files may be compressed. Some types of pages & links are excluded from the search engine

form invisible Web (maybe many times bigger than the visible Web).

Page 24: Intro to IR and SE, research issues1  Introduction to Modern Information Retrieval and

Intro to IR and SE, research issues 24

------------------------------------------------------------------------

------------------------------------------------------------------------

------------------------------------------------------------------------

------------------------------------------------------------------------

------------------------------------------------------------------------

------------------------------------------------------------------------

------------------------------------------------------------------------

------------------------------------------------------------------------

------------------------------------------------------------------------

------------------------------------------------------------------------

------------------------------------------------------------------------

------------------------------------------------------------------------

------------------------------------------------------------------------

------------------------------------------------------------------------

------------------------------------------------------------------------

Breadth-First Crawling

Page 25: Intro to IR and SE, research issues1  Introduction to Modern Information Retrieval and

Intro to IR and SE, research issues 25

------------------------------------------------------------------------

------------------------------------------------------------------------

------------------------------------------------------------------------

------------------------------------------------------------------------

------------------------------------------------------------------------

Depth-First Crawling

------------------------------------------------------------------------

------------------------------------------------------------------------

------------------------------------------------------------------------

------------------------------------------------------------------------

------------------------------------------------------------------------

------------------------------------------------------------------------

------------------------------------------------------------------------

------------------------------------------------------------------------

------------------------------------------------------------------------

------------------------------------------------------------------------

Page 26: Intro to IR and SE, research issues1  Introduction to Modern Information Retrieval and

Intro to IR and SE, research issues 26

Web Search Engine System Architecture

Page 27: Intro to IR and SE, research issues1  Introduction to Modern Information Retrieval and

Intro to IR and SE, research issues 27

InternetWebsites

Temporarystorage

Parser

Stopper/StemmerIndexer

Robot

Inverted Files (can be based on

Different physical data structures

User

Interface

RetrievalMechanism

Logical DocumentRepresentation

(based on IR Models)

Page 28: Intro to IR and SE, research issues1  Introduction to Modern Information Retrieval and

Intro to IR and SE, research issues 28

Distributed Architecture (example) Harvest (http://harvest.sourceforge.net/)

Distributed web search engine distribute the load among different machines indexer doesn't run on the same machine as broker or web server

Page 29: Intro to IR and SE, research issues1  Introduction to Modern Information Retrieval and

Intro to IR and SE, research issues 29

What Makes a SE Good?

Database of web documents Size of database Freshness (Recency or up-to-datedness) Types of documents offered Retrieval Speed

The search engine's capabilities Search options Effectiveness of the retrieval mechanism Support Concept-based search semantic web

Concept-based search systems try to determine what you mean, not just what you say.

Concept-based often works better in theory than in practice. Concept-based indexing is difficult task to perform.

Presentation of the results keywords highlighted in context showing summary of the web page that match

Page 30: Intro to IR and SE, research issues1  Introduction to Modern Information Retrieval and

Intro to IR and SE, research issues 30

Search Engine Example (Google)

Page 31: Intro to IR and SE, research issues1  Introduction to Modern Information Retrieval and

Intro to IR and SE, research issues 31

Google The most popular web search engine:

Crawls (by robots) the web, stores a local cache of found pages Builds a lexicon of common words For each word creates an index list of pages containing it Also human-compiled information from the Open Directory Cached links - let you see older versions of recently changed ones

Link Analysis system: page rank heuristic

Estimated size of index 580 million pages visited and recorded Uses link data to get to another 500 million pages (by link analysis

system) Recent estimation is around 4 billion pages (??)

Index refresh Updated monthly/weekly or daily for popular pages

Serves queries from three data centres (service replication) Service updates are synchronized. Two on West Coast of the US, one on East Coast.

Page 32: Intro to IR and SE, research issues1  Introduction to Modern Information Retrieval and

Intro to IR and SE, research issues 32

Google Founders Larry Page, Co-

founder & President, Products

Sergey Brin, Co-founder & President, Technology

PhD students at Stanford Became public co. last year

0%

10%

20%

30%

50%

40%

2001 20032002 2004

GoogleYahoo!

MSN

LycosAltaVistaAOL

Source: WebSideStory

Market share

Page 33: Intro to IR and SE, research issues1  Introduction to Modern Information Retrieval and

Intro to IR and SE, research issues 33

Google Architecture Overview

Page 34: Intro to IR and SE, research issues1  Introduction to Modern Information Retrieval and

Intro to IR and SE, research issues 34

Google Indexer

term frequencies

Page 35: Intro to IR and SE, research issues1  Introduction to Modern Information Retrieval and

Intro to IR and SE, research issues 35

Google Lexicon

Page 36: Intro to IR and SE, research issues1  Introduction to Modern Information Retrieval and

Intro to IR and SE, research issues 36

Google Searcher

Page 37: Intro to IR and SE, research issues1  Introduction to Modern Information Retrieval and

Intro to IR and SE, research issues 37

Google Features Combines traditional IR text matching with extremely heavy use of link

popularity to rank the pages it has indexed. Other services also use link popularity, but none do to the extent that

Google does. Traditional IR (LITE) Link Popularity (HEAVYLY USED) Citation Importance Ranking (Quality of links pointing at it)

Relevancy Similarity between query and a page Number of Links Link Quality Link Content Ranking boosts on text styles

PageRank Usage simulation & Citation importance ranking User randomly navigates

Process modelled by Markov Chain

Page 38: Intro to IR and SE, research issues1  Introduction to Modern Information Retrieval and

Intro to IR and SE, research issues 38

Collecting Links in Google Submission (by Web Promotion):

Add URL page (may not need to do a "deep" submit) The best way to ensure that your site is indexed is to build

links. The more other sites are pointing at you, the more likely you will be crawled and ranked well.

Crawling and Index Depth: Aims to refresh its index on a monthly basis, If Google doesn't actually index pages, it may still return it in a

search because it makes extensive use of the text within hyperlinks.

This text is associated with the pages the link points at, and it makes it possible for Google to find matching pages even when these pages cannot themselves be indexed.

Page 39: Intro to IR and SE, research issues1  Introduction to Modern Information Retrieval and

Intro to IR and SE, research issues 39

Google Guidelines for Web-submission

Page 40: Intro to IR and SE, research issues1  Introduction to Modern Information Retrieval and

Intro to IR and SE, research issues 40

Deep SubmitPro

Page 41: Intro to IR and SE, research issues1  Introduction to Modern Information Retrieval and

Intro to IR and SE, research issues 41

Link Analysis for Relevancy (1)

Inspired by the CiteSeer (NEC International, Princeton, NJ) and IBM Clever Project CiteSeer….. http://www.almaden.ibm.com/cs/k53/clever.html

Google ranks web pages based on the number, quality and content of links pointing at them (citations).

Number of Links All things being equal, a page with more links pointing at it will do

better than a page with few or no links to it. Link Quality

Numbers aren't everything. A single link from an important site might be worth more than many links from relatively unknown sites.

Weights page importance – links from important pages weighted higher

Page 42: Intro to IR and SE, research issues1  Introduction to Modern Information Retrieval and

Intro to IR and SE, research issues 42

Link Analysis for Relevancy (2)

Link Content The text in and around links relates to the page they point

at. For a page to rank well for "travel," it would need to have many links that use the word travel in them or near them on the page. It also helps if the page itself is textually relevant for travel.

Ranking boosts on text styles The appearance of terms in bold text, or in header text, or

in a large font size is all taken into account. None of these are dominant factors, but they do figure into the overall equation.

Page 43: Intro to IR and SE, research issues1  Introduction to Modern Information Retrieval and

Intro to IR and SE, research issues 43

PageRank

Usage simulation & Citation importance ranking: Based on a model of a Web surfer who follows links and makes

occasional haphazard jumps, arriving at certain places more frequently than others.

User randomly navigates Jumps to random page with probability p Follows a random hyperlink from the page with probability 1-p Does not go back to a previously visited page by following a

previously traversed link backwards Google finds a type of universally important page intuitively

locations that are heavily visited in a random traversal of the Web's link structure.

Page 44: Intro to IR and SE, research issues1  Introduction to Modern Information Retrieval and

Intro to IR and SE, research issues 44

PageRank Heuristics

Process modelled by the following heuristics probability of being in each page is computed, p set by

the system wj = PageRank of page j

ni = number of outgoing links on page i m is the number of nodes in G (the number of Web pages

in the collection)

jiGji i

ij n

wp

m

pw

|),(

)1(

Page 45: Intro to IR and SE, research issues1  Introduction to Modern Information Retrieval and

Intro to IR and SE, research issues 45

PageRank Illusrtation

wj

wn

wm

w2

w1

w3

(1- p)

w1

n1 m

p

w2

n2

w3

n3

+

+

Page 46: Intro to IR and SE, research issues1  Introduction to Modern Information Retrieval and

Intro to IR and SE, research issues 46

Google Spamming

Link popularity ranking system leaves it relatively immune to traditional spamming techniques. Goes beyond the text on pages to decide how good they are. No

links, low rank. Common spam idea

Create a lot of new pages within a site that link to a single page, in an effort to boost that page's popularity, perhaps spreading out these pages across a network of sites.

The (Evil) Genius of Comment Spammers By Steven Johnson, WIRED 12.03 http://www.wired.com/wired/archive/12.03/google.html?pg=7

Page 47: Intro to IR and SE, research issues1  Introduction to Modern Information Retrieval and

Intro to IR and SE, research issues 47

http://www.wired.com/wired/archive/12.03/google.html?pg=7

Page 48: Intro to IR and SE, research issues1  Introduction to Modern Information Retrieval and

Intro to IR and SE, research issues 48

Topic Search http://www.google.com/options/index.html

Page 49: Intro to IR and SE, research issues1  Introduction to Modern Information Retrieval and

Intro to IR and SE, research issues 49

Brief Introduction to Semantic Web

Page 50: Intro to IR and SE, research issues1  Introduction to Modern Information Retrieval and

Intro to IR and SE, research issues 50

Machine Process-able Knowledge on the Web

Unique identity of resources and objects- URI Metadata Annotations

Data describing the content and meaning of resources But everyone must speak the same language…

Terminologies Shared and common vocabularies But everyone must mean the same thing…

Ontologies Shared and common understanding of a domain Essential for exchange and discovery of knowledge

Inference Apply the knowledge in the metadata and the ontology to

create new metadata and new knowledge

Page 51: Intro to IR and SE, research issues1  Introduction to Modern Information Retrieval and

Intro to IR and SE, research issues 51

The Semantic Web

Page 52: Intro to IR and SE, research issues1  Introduction to Modern Information Retrieval and

Intro to IR and SE, research issues 52

Ontologies: The Semantic Backbone

Page 53: Intro to IR and SE, research issues1  Introduction to Modern Information Retrieval and

Intro to IR and SE, research issues 53

Language Tower in Semantic Web

Identity

Standard Syntax

Metadata annotations

Ontologies

Rules & Inference

Explanation

Attribution

Web Ontology Language 1.0 Referencehttp://www.w3.org/TR/owl-ref/

Page 54: Intro to IR and SE, research issues1  Introduction to Modern Information Retrieval and

Intro to IR and SE, research issues 54

Person

Sport

Soccer

participants >1

Team-based Sport

participants >1

Organisation

Club Sport

Sports Club

Soccer Club

Country

UK

Europe

partof

Blackburn

Blackburn Rovers

Page 55: Intro to IR and SE, research issues1  Introduction to Modern Information Retrieval and

Intro to IR and SE, research issues 55

Competition

Tournament

Sports Tournament

Worthington Cup

Soccer Tournament

Event

Blackburn Rovers

Andy Cole

Soccer Player

Sports Player

Brad Friedal

Page 56: Intro to IR and SE, research issues1  Introduction to Modern Information Retrieval and

Intro to IR and SE, research issues 56

Andy Cole

Soccer Player

Sports Player

Blackburn Rovers

Country

UK

Europe

partof

Nottingham

Person

nationality

birthplace

Page 57: Intro to IR and SE, research issues1  Introduction to Modern Information Retrieval and

Intro to IR and SE, research issues 57

Blackburn Rovers

Country

UK

Europe

partof

Lakewood

Brad Friedal

Soccer Player

Sports Player

Country

Person

nationality

USA

birthplace

Page 58: Intro to IR and SE, research issues1  Introduction to Modern Information Retrieval and

Intro to IR and SE, research issues 58

Useful IR System Building Software

And Resources

Page 59: Intro to IR and SE, research issues1  Introduction to Modern Information Retrieval and

Intro to IR and SE, research issues 59

Lucene API (http://lucene.apache.org/) Pure java (data abstraction, platform-independence, components

reusable)

High-performance indexing

Support both incremental indexing and batch indexing

Provide Accurate and Efficient Searching Mechanisms Complex queries based on Boolean and phrase queries, and

quires with specific document fields Ranked searching highest score being returned first

Allow users to develop variety of new applications: Searchable email

CD-based documentation search

DBMS Object ID management

Page 60: Intro to IR and SE, research issues1  Introduction to Modern Information Retrieval and

Intro to IR and SE, research issues 60

http://www.getopt.org/luke/

Page 61: Intro to IR and SE, research issues1  Introduction to Modern Information Retrieval and

Intro to IR and SE, research issues 61

www.egothor.org (support EBIR)

Page 62: Intro to IR and SE, research issues1  Introduction to Modern Information Retrieval and

Intro to IR and SE, research issues 62

http://nltk.sourceforge.net/

Page 63: Intro to IR and SE, research issues1  Introduction to Modern Information Retrieval and

Intro to IR and SE, research issues 63

http://ciir.cs.umass.edu/research/indri/

Page 64: Intro to IR and SE, research issues1  Introduction to Modern Information Retrieval and

Intro to IR and SE, research issues 64

http://www.summarization.com/

Page 65: Intro to IR and SE, research issues1  Introduction to Modern Information Retrieval and

Intro to IR and SE, research issues 65

http://wordnet.princeton.edu/

Page 66: Intro to IR and SE, research issues1  Introduction to Modern Information Retrieval and

Intro to IR and SE, research issues 66

http://protege.stanford.edu/

Page 67: Intro to IR and SE, research issues1  Introduction to Modern Information Retrieval and

Intro to IR and SE, research issues 67

http://www.google.com/apis/

Page 68: Intro to IR and SE, research issues1  Introduction to Modern Information Retrieval and

Intro to IR and SE, research issues 68

http://www.amazon.com/gp/browse.html/103-1065429-7111805?%5Fencoding=UTF8&node=3435361

Then click Alexa Web Information Service 1.0 Released

Page 69: Intro to IR and SE, research issues1  Introduction to Modern Information Retrieval and

Intro to IR and SE, research issues 69

http://mg4j.dsi.unimi.it/

Page 70: Intro to IR and SE, research issues1  Introduction to Modern Information Retrieval and

Intro to IR and SE, research issues 70

http://www.xapian.org/history.php (Probabilistic IR model)

Page 71: Intro to IR and SE, research issues1  Introduction to Modern Information Retrieval and

Intro to IR and SE, research issues 71

http://www.searchtools.com/info/info-retrieval.html

IR research resources

Page 72: Intro to IR and SE, research issues1  Introduction to Modern Information Retrieval and

Intro to IR and SE, research issues 72

http://www-db.stanford.edu/db_pages/projects.html

Page 73: Intro to IR and SE, research issues1  Introduction to Modern Information Retrieval and

Intro to IR and SE, research issues 73

http://dbpubs.stanford.edu:8090/aux/index-en.html

Page 74: Intro to IR and SE, research issues1  Introduction to Modern Information Retrieval and

Intro to IR and SE, research issues 74

http://citeseer.ist.psu.edu/

Page 75: Intro to IR and SE, research issues1  Introduction to Modern Information Retrieval and

Intro to IR and SE, research issues 75

Web Challenges

for IR Research Community

Page 76: Intro to IR and SE, research issues1  Introduction to Modern Information Retrieval and

Intro to IR and SE, research issues 76

Research Issues (1)

IR research field is interdisciplinary in nature Traditionally focused on retrieval effectiveness

Retrieval models and mechanisms (e.g., various ad-hoc models, probabilistic/statistic reasoning, language model INDRI system at UMASS)

Use of Relevance feedback for improving effectiveness (e.g., query reformulation, pseudo-thesaurus, document categorization/clustering through machine learning techniques as knowledge acquisition tools)

Knowledge/semantic richer retrieval approaches (e.g., RUBRIC-rule based IR, some recent concept-based IR based on Rules)

Information filtering based on user profiling

Traditionally based on small set of text collections Little work has been done on retrieval efficiency

although we have some reports (e.g., use of parallel architecture for handling index files based on signature files, etc)

Page 77: Intro to IR and SE, research issues1  Introduction to Modern Information Retrieval and

Intro to IR and SE, research issues 77

Research Issues (2) Challenges

Distributed Data: Documents spread over millions of different web servers.

Volatile Data: Many documents change or disappear rapidly (e.g. dead links) information recency (up-to-datedness)

Large Volume: Billions of separate documents. Unstructured and Redundant Data: No uniform structure, HTML

errors, up to 30% (near) duplicate documents. Quality of Data: No editorial control, false information, poor

quality writing, typos, etc. Need large scale knowledge/semantic rich retrieval applications Heterogeneous Data: Multiple media types (images, video, VRML),

languages, character sets, etc.

Page 78: Intro to IR and SE, research issues1  Introduction to Modern Information Retrieval and

Intro to IR and SE, research issues 78

Research Issues (3) Retrieval Effectiveness (all in large scale) with efficiency in mind

Test effectiveness of IR models with efficiency as an important considerations

Effective and efficient indexing for both documents and query Natural language processing (some statistical) Distributed incremental indexing System and physical data structure/algorithm issues Distributed brokering architecture for information recency

Investigation of semantic richer approaches Semantic web, and other rule based approaches Effective and efficient knowledge indexing

Use of users relevance feedback Automatic feedback acquisition User profiling and information filtering Evaluation measures (Predictable)

Text summarization for better presentation Text categorization (clustering) for topic search

(e.g., Yahoo subject directory, Google topic).

Page 79: Intro to IR and SE, research issues1  Introduction to Modern Information Retrieval and

Intro to IR and SE, research issues 79

Research Issues (4) Multimedia indexing

IBM QBIC project (http://wwwqbic.almaden.ibm.com/) Indexing tools for various media types (e.g., an image of mountain

with a lake covered by snow, SemCap) Develop test bed for controllable experiments

Internet emulator/simulator Distributed IR subsystems Appropriate performance measures (e.g., RB Precision)

Refer to the recent papers by Stanford researchers addressing Both retrieval effectiveness and efficiency