Intro to IR and SE, research issues1 Introduction to Modern Information Retrieval and

Intro to IR and SE, research issues 1

http://comet.lehman.cuny.edu/jung/presentation/presentation.html

Introduction to Modern Information Retrieval and

Search EnginesAnd Some Research Issues

Professor Gwang Jung

Department of Mathematics and Computer ScienceLehman College, CUNY

November 10, Fall 05


Introduction to Information Retrieval Introduction to Search Engines (IR Systems for the Web) Search Engine Example: Google Brief Introduction to Semantic Web Useful Tools for IR System Building and Resources for Advanced Research Research Issues

Outline


Introduction to Information Retrieval


Information Age


IR in General

Information Retrieval in general deals with Retrieval of structured, semi-structured and unstructured data

(information items) in response to a user query (topic statement).

User query Structured (e.g., Boolean expression of keywords or terms) Unstructured (e.g., terms, sentence, document)

In other words, IR is the process of applying algorithms over unstructured, semi-structured, or structured data in order to satisfy a given query. Efficiency with respect to:

Algorithms, Query processing, Data organization/structure Effectiveness with respect to:

Retrieval results


IR Systems


Formal Definition of IR System

IRS = (T, D, Q, F, R) T: set of index terms (terms) D: set of documents in a document database Q: set of user queries F: D x Q R (retrieval function) R: real numbers (RSV: Retrieval Status Value)

Relevance Judgment is given by users.


IRS versus DBMS

Objects

DBMS

Simple (DBMS)

IRS

Complex (texts)

Information Request Objective structure, specific

Subjective structure

General and ambiguous

Query Language Predicate calculus How?

Knowledge Representation

Transparent Complex process

Retrieval Process Deterministic Statistic (similarity)

Major Success Criteria

Correctness utility

Query Example Select student name where student’s GPA > 3.0

Select items where item’s concept deals with process synchronization


IR Systems Focus on Retrieval effectiveness

The effective retrieval of relevant information depends on User task (formulating effective query for the information

need) Indexing

IR systems in general adopt index terms to represent documents and queries.

The process of developing document representations by assigning index terms to documents (information items).

Retrieval model (often called IR model) and logical view of documents

Logical view of documents (logical representation of documents) depends on IR model


Indexing

The process of developing document representations by assigning descriptions to information items (texts, documents, or multimedia items).

Descriptors = index terms = terms Descriptors also lead users to participate in

formulating information requests. Two types of index terms:

Objective: author name, publisher, date of publication Subjective: keywords selected from full text

Two types of indexing methods: Manual: performed by human experts (for very effective IR

systems)– may use ontology Automatic: performed by computer HW and SW


Indexing Aims (1) Recall: the proportion of relevant items (documents) retrieved.

R = # of relevant items retrieved / total # of relevant items in the db

Precision: the proportion of retrieved documents that are relevant. P = # of relevant items retrieved / total # of items retrieved

Effectiveness of indexing is mainly controlled by Term Specificity Broader terms may retrieve both useful (relevant) and useless

(non-relevant) info items for the user. Narrower (specific) index terms favor precision at the expense of

recall. Index Language (set of well-selected index terms)

T = { index term t} Pre-specified (controlled): easy maintenance; poor adaptability Uncontrolled (dynamic): expanded dynamically; taken freely from

the texts to be indexed and from the users’ queries. Synonymous terms can be expanded to T by thesaurus, e-

dictionary (e.g., WordNet), and/or knowledge base (e.g., ontology).


Indexing Aims (2)

Recall and Precision values vary from 0 to 1. Average users want to have high recall and high precision. In practice, a compromise must be reached (middle point).

R

P

1.0

1.00


Steps for Indexing

Objective attributes of a document are extracted (e.g., title, author, URL, structure).

Grammatical functional words (stop words) in general are not considered as index terms (e.g., of, then, this, and, …., etc).

Case insensitivity might be performed. Stemming might be used. Frequency of nonfunctional words are used to specify the term

importance. Term frequency weight fulfils only one of the indexing aims,

I.e., Recall. Terms that occur rarely in the individual document database

may be used to distinguish documents in which they occur from those in which they do not occur could improve Precision.

Document frequency: the number of documents in the collection in which a term tj T occurs


Inverted Index File

system

computer

database

science D2, 4

D5, 2

D1, 3

D7, 4

Index terms df

3

2

4

1

Dj, tfj,

Inverted Index Entries

Optionally postings (the positions of the term in a document)


Retrieval Models (1) Set theoretic IR models

Documents are represented by a set of terms Well known Set Theoretic Models

Boolean IR Model Retrieval Function is based on Boolean operation (e.g., and, or, not) Query is formulated by Boolean logic

Fuzzy Set IR Model Retrieval function is based on Fuzzy set operations Query is formulated by Boolean logic

Rough Set IR Model Various set operations were examined. Ad-hoc Boolean query

Probabilistic IR model Mainly used for probabilistic index term weighting Provides mathematical framework for the well known tf*idf indexing

scheme Language Model based

Infer query concept from a document as retrieval process


Retrieval Models (2) Vector space model

Queries and documents are represented as weighted vectors. Vectors in the basis are called term vectors, and assumed they

are semantically independent. A document (query) is represented as a linear combination of

vectors in the generating set. Retrieval function is based on dot product or cosine measure

between document and query vectors. Extended Boolean IR model

Combine characteristics of the vector space IR model with properties of Boolean algebra.

Retrieval function is based on Euclidean distances in a n-dimensional vector space. Distances are measured by using p-norms, where 1 p


The Retrieval Process


The retrieval Process in IR System


Introduction to Search Engines (IR Systems for the Web)


World Wide Web History

1965 – Hypertext Ted Nelson developed idea of hypertext in 1965.

Late 1960’s Doug Engelbart invented the mouse and built the first

implementation of hypertext in the late 1960’s at SRI. Early 1970’s

ARPANET was developed in the early 1970’s. 1982 - Transmission Control Protocol (TCP) and Internet Protocol (IP) 1989- WWW

Developed by Tim Berners-Lee and others in 1990 at CERN to organize research documents available on the Internet.

Combined idea of documents available by FTP with the idea of hypertext to link documents.

Developed initial HTTP network protocol, URLs, HTML, and first web server.


Search Engine (Web-based IR System) History

By late 1980’s many files were available by anonymous FTP. In 1990, Alan Emtage of McGill Univ. developed Archie (short for

“archives”) Assembled lists of files available on many FTP servers. Allowed regular expression search of these file names.

In 1993, Veronica and Jughead were developed to search names of text files available through Gopher servers.

In 1993, early web robots (spiders) were built to collect URL’s: Wanderer ALIWEB (Archie-Like Index of the WEB) WWW Worm (indexed URL’s and titles for regex search)

In 1994, Stanford graduate students David Filo and Jerry Yang started manually collecting popular web sites into a topical hierarchy called Yahoo.


Search Engine History (cont’d)

In early 1994, Brian Pinkerton developed WebCrawler as a class project at U Washington. (eventually became part of Excite and AOL).

A few months later, Fuzzy Maudlin, a professor at CMU developed Lycos with his graduate students. First to use a standard IR system as developed for the DARPA

Tipster project. First to index a large set of pages.

In late 1995, DEC developed Altavista. Used a large farm of Alpha machines to quickly process large

numbers of queries. Supported boolean operators, phrases, and “reverse pointer”

queries. In 1998 – Google was developed by graduate students Larry Page &

Sergey Brin at Stanford U use of link analysis to rank documents


How do Web SE Work?

Search Engines for the general web search a database of the full text of web pages selected from billions

of Web pages searching is based on inverted index entries

Search Engine Databases Full text documents are collected by software robot (also called

softbot, spider). They navigate the web for collecting pages. Web can be viewed as a graph structure. The navigation can be based on DFS (Depth First Search), or BFS (Breadth

First Search), or based on some combined navigation heuristics. How to detect cycles? research issue

Indexer then build inverted index entries stored them into inverted files.

If necessary the inverted files may be compressed. Some types of pages & links are excluded from the search engine

form invisible Web (maybe many times bigger than the visible Web).


------------------------------------------------------------------------

------------------------------------------------------------------------

------------------------------------------------------------------------

------------------------------------------------------------------------

------------------------------------------------------------------------

------------------------------------------------------------------------

------------------------------------------------------------------------

------------------------------------------------------------------------

------------------------------------------------------------------------

------------------------------------------------------------------------

------------------------------------------------------------------------

------------------------------------------------------------------------

------------------------------------------------------------------------

------------------------------------------------------------------------

------------------------------------------------------------------------

Breadth-First Crawling


------------------------------------------------------------------------

------------------------------------------------------------------------

------------------------------------------------------------------------

------------------------------------------------------------------------

------------------------------------------------------------------------

Depth-First Crawling

------------------------------------------------------------------------

------------------------------------------------------------------------

------------------------------------------------------------------------

------------------------------------------------------------------------

------------------------------------------------------------------------

------------------------------------------------------------------------

------------------------------------------------------------------------

------------------------------------------------------------------------

------------------------------------------------------------------------

------------------------------------------------------------------------


Web Search Engine System Architecture


InternetWebsites

Temporarystorage

Parser

Stopper/StemmerIndexer

Robot

Inverted Files (can be based on

Different physical data structures

User

Interface

RetrievalMechanism

Logical DocumentRepresentation

(based on IR Models)


Distributed Architecture (example) Harvest (http://harvest.sourceforge.net/)

Distributed web search engine distribute the load among different machines indexer doesn't run on the same machine as broker or web server


What Makes a SE Good?

Database of web documents Size of database Freshness (Recency or up-to-datedness) Types of documents offered Retrieval Speed

The search engine's capabilities Search options Effectiveness of the retrieval mechanism Support Concept-based search semantic web

Concept-based search systems try to determine what you mean, not just what you say.

Concept-based often works better in theory than in practice. Concept-based indexing is difficult task to perform.

Presentation of the results keywords highlighted in context showing summary of the web page that match


Search Engine Example (Google)


Google The most popular web search engine:

Crawls (by robots) the web, stores a local cache of found pages Builds a lexicon of common words For each word creates an index list of pages containing it Also human-compiled information from the Open Directory Cached links - let you see older versions of recently changed ones

Link Analysis system: page rank heuristic

Estimated size of index 580 million pages visited and recorded Uses link data to get to another 500 million pages (by link analysis

system) Recent estimation is around 4 billion pages (??)

Index refresh Updated monthly/weekly or daily for popular pages

Serves queries from three data centres (service replication) Service updates are synchronized. Two on West Coast of the US, one on East Coast.


Google Founders Larry Page, Co-

founder & President, Products

Sergey Brin, Co-founder & President, Technology

PhD students at Stanford Became public co. last year

0%

10%

20%

30%

50%

40%

2001 20032002 2004

GoogleYahoo!

MSN

LycosAltaVistaAOL

Source: WebSideStory

Market share


Google Architecture Overview


Google Indexer

term frequencies


Google Lexicon


Google Searcher


Google Features Combines traditional IR text matching with extremely heavy use of link

popularity to rank the pages it has indexed. Other services also use link popularity, but none do to the extent that

Google does. Traditional IR (LITE) Link Popularity (HEAVYLY USED) Citation Importance Ranking (Quality of links pointing at it)

Relevancy Similarity between query and a page Number of Links Link Quality Link Content Ranking boosts on text styles

PageRank Usage simulation & Citation importance ranking User randomly navigates

Process modelled by Markov Chain


Collecting Links in Google Submission (by Web Promotion):

Add URL page (may not need to do a "deep" submit) The best way to ensure that your site is indexed is to build

links. The more other sites are pointing at you, the more likely you will be crawled and ranked well.

Crawling and Index Depth: Aims to refresh its index on a monthly basis, If Google doesn't actually index pages, it may still return it in a

search because it makes extensive use of the text within hyperlinks.

This text is associated with the pages the link points at, and it makes it possible for Google to find matching pages even when these pages cannot themselves be indexed.


Google Guidelines for Web-submission


Deep SubmitPro


Link Analysis for Relevancy (1)

Inspired by the CiteSeer (NEC International, Princeton, NJ) and IBM Clever Project CiteSeer….. http://www.almaden.ibm.com/cs/k53/clever.html

Google ranks web pages based on the number, quality and content of links pointing at them (citations).

Number of Links All things being equal, a page with more links pointing at it will do

better than a page with few or no links to it. Link Quality

Numbers aren't everything. A single link from an important site might be worth more than many links from relatively unknown sites.

Weights page importance – links from important pages weighted higher


Link Analysis for Relevancy (2)

Link Content The text in and around links relates to the page they point

at. For a page to rank well for "travel," it would need to have many links that use the word travel in them or near them on the page. It also helps if the page itself is textually relevant for travel.

Ranking boosts on text styles The appearance of terms in bold text, or in header text, or

in a large font size is all taken into account. None of these are dominant factors, but they do figure into the overall equation.


PageRank

Usage simulation & Citation importance ranking: Based on a model of a Web surfer who follows links and makes

occasional haphazard jumps, arriving at certain places more frequently than others.

User randomly navigates Jumps to random page with probability p Follows a random hyperlink from the page with probability 1-p Does not go back to a previously visited page by following a

previously traversed link backwards Google finds a type of universally important page intuitively

locations that are heavily visited in a random traversal of the Web's link structure.


PageRank Heuristics

Process modelled by the following heuristics probability of being in each page is computed, p set by

the system wj = PageRank of page j

ni = number of outgoing links on page i m is the number of nodes in G (the number of Web pages

in the collection)

jiGji i

ij n

wp

m

pw

|),(

)1(


PageRank Illusrtation

wj

wn

wm

w2

w1

w3

(1- p)

w1

n1 m

p

w2

n2

w3

n3

+

+


Google Spamming

Link popularity ranking system leaves it relatively immune to traditional spamming techniques. Goes beyond the text on pages to decide how good they are. No

links, low rank. Common spam idea

Create a lot of new pages within a site that link to a single page, in an effort to boost that page's popularity, perhaps spreading out these pages across a network of sites.

The (Evil) Genius of Comment Spammers By Steven Johnson, WIRED 12.03 http://www.wired.com/wired/archive/12.03/google.html?pg=7


http://www.wired.com/wired/archive/12.03/google.html?pg=7


Topic Search http://www.google.com/options/index.html


Brief Introduction to Semantic Web


Machine Process-able Knowledge on the Web

Unique identity of resources and objects- URI Metadata Annotations

Data describing the content and meaning of resources But everyone must speak the same language…

Terminologies Shared and common vocabularies But everyone must mean the same thing…

Ontologies Shared and common understanding of a domain Essential for exchange and discovery of knowledge

Inference Apply the knowledge in the metadata and the ontology to

create new metadata and new knowledge


The Semantic Web


Ontologies: The Semantic Backbone


Language Tower in Semantic Web

Identity

Standard Syntax

Metadata annotations

Ontologies

Rules & Inference

Explanation

Attribution

Web Ontology Language 1.0 Referencehttp://www.w3.org/TR/owl-ref/


Person

Sport

Soccer

participants >1

Team-based Sport

participants >1

Organisation

Club Sport

Sports Club

Soccer Club

Country

UK

Europe

partof

Blackburn

Blackburn Rovers


Competition

Tournament

Sports Tournament

Worthington Cup

Soccer Tournament

Event

Blackburn Rovers

Andy Cole

Soccer Player

Sports Player

Brad Friedal


Andy Cole

Soccer Player

Sports Player

Blackburn Rovers

Country

UK

Europe

partof

Nottingham

Person

nationality

birthplace


Blackburn Rovers

Country

UK

Europe

partof

Lakewood

Brad Friedal

Soccer Player

Sports Player

Country

Person

nationality

USA

birthplace


Useful IR System Building Software

And Resources


Lucene API (http://lucene.apache.org/) Pure java (data abstraction, platform-independence, components

reusable)

High-performance indexing

Support both incremental indexing and batch indexing

Provide Accurate and Efficient Searching Mechanisms Complex queries based on Boolean and phrase queries, and

quires with specific document fields Ranked searching highest score being returned first

Allow users to develop variety of new applications: Searchable email

CD-based documentation search

DBMS Object ID management


http://www.getopt.org/luke/


www.egothor.org (support EBIR)


http://nltk.sourceforge.net/


http://ciir.cs.umass.edu/research/indri/


http://www.summarization.com/


http://wordnet.princeton.edu/


http://protege.stanford.edu/


http://www.google.com/apis/


http://www.amazon.com/gp/browse.html/103-1065429-7111805?%5Fencoding=UTF8&node=3435361

Then click Alexa Web Information Service 1.0 Released


http://mg4j.dsi.unimi.it/


http://www.xapian.org/history.php (Probabilistic IR model)


http://www.searchtools.com/info/info-retrieval.html

IR research resources


http://www-db.stanford.edu/db_pages/projects.html


http://dbpubs.stanford.edu:8090/aux/index-en.html


http://citeseer.ist.psu.edu/


Web Challenges

for IR Research Community


Research Issues (1)

IR research field is interdisciplinary in nature Traditionally focused on retrieval effectiveness

Retrieval models and mechanisms (e.g., various ad-hoc models, probabilistic/statistic reasoning, language model INDRI system at UMASS)

Use of Relevance feedback for improving effectiveness (e.g., query reformulation, pseudo-thesaurus, document categorization/clustering through machine learning techniques as knowledge acquisition tools)

Knowledge/semantic richer retrieval approaches (e.g., RUBRIC-rule based IR, some recent concept-based IR based on Rules)

Information filtering based on user profiling

Traditionally based on small set of text collections Little work has been done on retrieval efficiency

although we have some reports (e.g., use of parallel architecture for handling index files based on signature files, etc)


Research Issues (2) Challenges

Distributed Data: Documents spread over millions of different web servers.

Volatile Data: Many documents change or disappear rapidly (e.g. dead links) information recency (up-to-datedness)

Large Volume: Billions of separate documents. Unstructured and Redundant Data: No uniform structure, HTML

errors, up to 30% (near) duplicate documents. Quality of Data: No editorial control, false information, poor

quality writing, typos, etc. Need large scale knowledge/semantic rich retrieval applications Heterogeneous Data: Multiple media types (images, video, VRML),

languages, character sets, etc.


Research Issues (3) Retrieval Effectiveness (all in large scale) with efficiency in mind

Test effectiveness of IR models with efficiency as an important considerations

Effective and efficient indexing for both documents and query Natural language processing (some statistical) Distributed incremental indexing System and physical data structure/algorithm issues Distributed brokering architecture for information recency

Investigation of semantic richer approaches Semantic web, and other rule based approaches Effective and efficient knowledge indexing

Use of users relevance feedback Automatic feedback acquisition User profiling and information filtering Evaluation measures (Predictable)

Text summarization for better presentation Text categorization (clustering) for topic search

(e.g., Yahoo subject directory, Google topic).


Research Issues (4) Multimedia indexing

IBM QBIC project (http://wwwqbic.almaden.ibm.com/) Indexing tools for various media types (e.g., an image of mountain

with a lake covered by snow, SemCap) Develop test bed for controllable experiments

Internet emulator/simulator Distributed IR subsystems Appropriate performance measures (e.g., RB Precision)

Refer to the recent papers by Stanford researchers addressing Both retrieval effectiveness and efficiency