Intro to IR and SE, research issues1 Introduction to Modern Information Retrieval and


Citation preview

Intro to IR and SE, research issues 1

Introduction to Modern Information Retrieval and

Search EnginesAnd Some Research Issues

Professor Gwang Jung

Department of Mathematics and Computer ScienceLehman College, CUNY

November 10, Fall 05

Intro to IR and SE, research issues 2

Introduction to Information Retrieval Introduction to Search Engines (IR Systems for the Web) Search Engine Example: Google Brief Introduction to Semantic Web Useful Tools for IR System Building and Resources for Advanced Research Research Issues


Intro to IR and SE, research issues 3

Introduction to Information Retrieval

Intro to IR and SE, research issues 4

Information Age

Intro to IR and SE, research issues 5

IR in General

Information Retrieval in general deals with Retrieval of structured, semi-structured and unstructured data

(information items) in response to a user query (topic statement).

User query Structured (e.g., Boolean expression of keywords or terms) Unstructured (e.g., terms, sentence, document)

In other words, IR is the process of applying algorithms over unstructured, semi-structured, or structured data in order to satisfy a given query. Efficiency with respect to:

Algorithms, Query processing, Data organization/structure Effectiveness with respect to:

Retrieval results

Intro to IR and SE, research issues 6

IR Systems

Intro to IR and SE, research issues 7

Formal Definition of IR System

IRS = (T, D, Q, F, R) T: set of index terms (terms) D: set of documents in a document database Q: set of user queries F: D x Q R (retrieval function) R: real numbers (RSV: Retrieval Status Value)

Relevance Judgment is given by users.

Intro to IR and SE, research issues 8

IRS versus DBMS



Simple (DBMS)


Complex (texts)

Information Request Objective structure, specific

Subjective structure

General and ambiguous

Query Language Predicate calculus How?

Knowledge Representation

Transparent Complex process

Retrieval Process Deterministic Statistic (similarity)

Major Success Criteria

Correctness utility

Query Example Select student name where student’s GPA > 3.0

Select items where item’s concept deals with process synchronization

Intro to IR and SE, research issues 9

IR Systems Focus on Retrieval effectiveness

The effective retrieval of relevant information depends on User task (formulating effective query for the information

need) Indexing

IR systems in general adopt index terms to represent documents and queries.

The process of developing document representations by assigning index terms to documents (information items).

Retrieval model (often called IR model) and logical view of documents

Logical view of documents (logical representation of documents) depends on IR model

Intro to IR and SE, research issues 10


The process of developing document representations by assigning descriptions to information items (texts, documents, or multimedia items).

Descriptors = index terms = terms Descriptors also lead users to participate in

formulating information requests. Two types of index terms:

Objective: author name, publisher, date of publication Subjective: keywords selected from full text

Two types of indexing methods: Manual: performed by human experts (for very effective IR

systems)– may use ontology Automatic: performed by computer HW and SW

Intro to IR and SE, research issues 11

Indexing Aims (1) Recall: the proportion of relevant items (documents) retrieved.

R = # of relevant items retrieved / total # of relevant items in the db

Precision: the proportion of retrieved documents that are relevant. P = # of relevant items retrieved / total # of items retrieved

Effectiveness of indexing is mainly controlled by Term Specificity Broader terms may retrieve both useful (relevant) and useless

(non-relevant) info items for the user. Narrower (specific) index terms favor precision at the expense of

recall. Index Language (set of well-selected index terms)

T = { index term t} Pre-specified (controlled): easy maintenance; poor adaptability Uncontrolled (dynamic): expanded dynamically; taken freely from

the texts to be indexed and from the users’ queries. Synonymous terms can be expanded to T by thesaurus, e-

dictionary (e.g., WordNet), and/or knowledge base (e.g., ontology).

Intro to IR and SE, research issues 12

Indexing Aims (2)

Recall and Precision values vary from 0 to 1. Average users want to have high recall and high precision. In practice, a compromise must be reached (middle point).





Intro to IR and SE, research issues 13

Steps for Indexing

Objective attributes of a document are extracted (e.g., title, author, URL, structure).

Grammatical functional words (stop words) in general are not considered as index terms (e.g., of, then, this, and, …., etc).

Case insensitivity might be performed. Stemming might be used. Frequency of nonfunctional words are used to specify the term

importance. Term frequency weight fulfils only one of the indexing aims,

I.e., Recall. Terms that occur rarely in the individual document database

may be used to distinguish documents in which they occur from those in which they do not occur could improve Precision.

Document frequency: the number of documents in the collection in which a term tj T occurs

Intro to IR and SE, research issues 14

Inverted Index File




science D2, 4

D5, 2

D1, 3

D7, 4

Index terms df





Dj, tfj,

Inverted Index Entries

Optionally postings (the positions of the term in a document)

Intro to IR and SE, research issues 15

Retrieval Models (1) Set theoretic IR models

Documents are represented by a set of terms Well known Set Theoretic Models

Boolean IR Model Retrieval Function is based on Boolean operation (e.g., and, or, not) Query is formulated by Boolean logic

Fuzzy Set IR Model Retrieval function is based on Fuzzy set operations Query is formulated by Boolean logic

Rough Set IR Model Various set operations were examined. Ad-hoc Boolean query

Probabilistic IR model Mainly used for probabilistic index term weighting Provides mathematical framework for the well known tf*idf indexing

scheme Language Model based

Infer query concept from a document as retrieval process

Intro to IR and SE, research issues 16

Retrieval Models (2) Vector space model

Queries and documents are represented as weighted vectors. Vectors in the basis are called term vectors, and assumed they

are semantically independent. A document (query) is represented as a linear combination of

vectors in the generating set. Retrieval function is based on dot product or cosine measure

between document and query vectors. Extended Boolean IR model

Combine characteristics of the vector space IR model with properties of Boolean algebra.

Retrieval function is based on Euclidean distances in a n-dimensional vector space. Distances are measured by using p-norms, where 1 p

Intro to IR and SE, research issues 17

The Retrieval Process

Intro to IR and SE, research issues 18

The retrieval Process in IR System

Intro to IR and SE, research issues 19

Introduction to Search Engines (IR Systems for the Web)

Intro to IR and SE, research issues 20

World Wide Web History

1965 – Hypertext Ted Nelson developed idea of hypertext in 1965.

Late 1960’s Doug Engelbart invented the mouse and built the first

implementation of hypertext in the late 1960’s at SRI. Early 1970’s

ARPANET was developed in the early 1970’s. 1982 - Transmission Control Protocol (TCP) and Internet Protocol (IP) 1989- WWW

Developed by Tim Berners-Lee and others in 1990 at CERN to organize research documents available on the Internet.

Combined idea of documents available by FTP with the idea of hypertext to link documents.

Developed initial HTTP network protocol, URLs, HTML, and first web server.

Intro to IR and SE, research issues 21

Search Engine (Web-based IR System) History

By late 1980’s many files were available by anonymous FTP. In 1990, Alan Emtage of McGill Univ. developed Archie (short for

“archives”) Assembled lists of files available on many FTP servers. Allowed regular expression search of these file names.

In 1993, Veronica and Jughead were developed to search names of text files available through Gopher servers.

In 1993, early web robots (spiders) were built to collect URL’s: Wanderer ALIWEB (Archie-Like Index of the WEB) WWW Worm (indexed URL’s and titles for regex search)

In 1994, Stanford graduate students David Filo and Jerry Yang started manually collecting popular web sites into a topical hierarchy called Yahoo.

Intro to IR and SE, research issues 22

Search Engine History (cont’d)

In early 1994, Brian Pinkerton developed WebCrawler as a class project at U Washington. (eventually became part of Excite and AOL).

A few months later, Fuzzy Maudlin, a professor at CMU developed Lycos with his graduate students. First to use a standard IR system as developed for the DARPA

Tipster project. First to index a large set of pages.

In late 1995, DEC developed Altavista. Used a large farm of Alpha machines to quickly process large

numbers of queries. Supported boolean operators, phrases, and “reverse pointer”

queries. In 1998 – Google was developed by graduate students Larry Page &

Sergey Brin at Stanford U use of link analysis to rank documents

Intro to IR and SE, research issues 23

How do Web SE Work?

Search Engines for the general web search a database of the full text of web pages selected from billions

of Web pages searching is based on inverted index entries

Search Engine Databases Full text documents are collected by software robot (also called

softbot, spider). They navigate the web for collecting pages. Web can be viewed as a graph structure. The navigation can be based on DFS (Depth First Search), or BFS (Breadth

First Search), or based on some combined navigation heuristics. How to detect cycles? research issue

Indexer then build inverted index entries stored them into inverted files.

If necessary the inverted files may be compressed. Some types of pages & links are excluded from the search engine

form invisible Web (maybe many times bigger than the visible Web).

Intro to IR and SE, research issues 24
















Breadth-First Crawling

Intro to IR and SE, research issues 25






Depth-First Crawling











Intro to IR and SE, research issues 26

Web Search Engine System Architecture

Intro to IR and SE, research issues 27






Inverted Files (can be based on

Different physical data structures




Logical DocumentRepresentation

(based on IR Models)

Intro to IR and SE, research issues 28

Distributed Architecture (example) Harvest (

Distributed web search engine distribute the load among different machines indexer doesn't run on the same machine as broker or web server

Intro to IR and SE, research issues 29

What Makes a SE Good?

Database of web documents Size of database Freshness (Recency or up-to-datedness) Types of documents offered Retrieval Speed

The search engine's capabilities Search options Effectiveness of the retrieval mechanism Support Concept-based search semantic web

Concept-based search systems try to determine what you mean, not just what you say.

Concept-based often works better in theory than in practice. Concept-based indexing is difficult task to perform.

Presentation of the results keywords highlighted in context showing summary of the web page that match

Intro to IR and SE, research issues 30

Search Engine Example (Google)

Intro to IR and SE, research issues 31

Google The most popular web search engine:

Crawls (by robots) the web, stores a local cache of found pages Builds a lexicon of common words For each word creates an index list of pages containing it Also human-compiled information from the Open Directory Cached links - let you see older versions of recently changed ones

Link Analysis system: page rank heuristic

Estimated size of index 580 million pages visited and recorded Uses link data to get to another 500 million pages (by link analysis

system) Recent estimation is around 4 billion pages (??)

Index refresh Updated monthly/weekly or daily for popular pages

Serves queries from three data centres (service replication) Service updates are synchronized. Two on West Coast of the US, one on East Coast.

Intro to IR and SE, research issues 32

Google Founders Larry Page, Co-

founder & President, Products

Sergey Brin, Co-founder & President, Technology

PhD students at Stanford Became public co. last year







2001 20032002 2004




Source: WebSideStory

Market share

Intro to IR and SE, research issues 33

Google Architecture Overview

Intro to IR and SE, research issues 34

Google Indexer

term frequencies

Intro to IR and SE, research issues 35

Google Lexicon

Intro to IR and SE, research issues 36

Google Searcher

Intro to IR and SE, research issues 37

Google Features Combines traditional IR text matching with extremely heavy use of link

popularity to rank the pages it has indexed. Other services also use link popularity, but none do to the extent that

Google does. Traditional IR (LITE) Link Popularity (HEAVYLY USED) Citation Importance Ranking (Quality of links pointing at it)

Relevancy Similarity between query and a page Number of Links Link Quality Link Content Ranking boosts on text styles

PageRank Usage simulation & Citation importance ranking User randomly navigates

Process modelled by Markov Chain

Intro to IR and SE, research issues 38

Collecting Links in Google Submission (by Web Promotion):

Add URL page (may not need to do a "deep" submit) The best way to ensure that your site is indexed is to build

links. The more other sites are pointing at you, the more likely you will be crawled and ranked well.

Crawling and Index Depth: Aims to refresh its index on a monthly basis, If Google doesn't actually index pages, it may still return it in a

search because it makes extensive use of the text within hyperlinks.

This text is associated with the pages the link points at, and it makes it possible for Google to find matching pages even when these pages cannot themselves be indexed.

Intro to IR and SE, research issues 39

Google Guidelines for Web-submission

Intro to IR and SE, research issues 40

Deep SubmitPro

Intro to IR and SE, research issues 41

Link Analysis for Relevancy (1)

Inspired by the CiteSeer (NEC International, Princeton, NJ) and IBM Clever Project CiteSeer…..

Google ranks web pages based on the number, quality and content of links pointing at them (citations).

Number of Links All things being equal, a page with more links pointing at it will do

better than a page with few or no links to it. Link Quality

Numbers aren't everything. A single link from an important site might be worth more than many links from relatively unknown sites.

Weights page importance – links from important pages weighted higher

Intro to IR and SE, research issues 42

Link Analysis for Relevancy (2)

Link Content The text in and around links relates to the page they point

at. For a page to rank well for "travel," it would need to have many links that use the word travel in them or near them on the page. It also helps if the page itself is textually relevant for travel.

Ranking boosts on text styles The appearance of terms in bold text, or in header text, or

in a large font size is all taken into account. None of these are dominant factors, but they do figure into the overall equation.

Intro to IR and SE, research issues 43


Usage simulation & Citation importance ranking: Based on a model of a Web surfer who follows links and makes

occasional haphazard jumps, arriving at certain places more frequently than others.

User randomly navigates Jumps to random page with probability p Follows a random hyperlink from the page with probability 1-p Does not go back to a previously visited page by following a

previously traversed link backwards Google finds a type of universally important page intuitively

locations that are heavily visited in a random traversal of the Web's link structure.

Intro to IR and SE, research issues 44

PageRank Heuristics

Process modelled by the following heuristics probability of being in each page is computed, p set by

the system wj = PageRank of page j

ni = number of outgoing links on page i m is the number of nodes in G (the number of Web pages

in the collection)

jiGji i

ij n






Intro to IR and SE, research issues 45

PageRank Illusrtation







(1- p)


n1 m








Intro to IR and SE, research issues 46

Google Spamming

Link popularity ranking system leaves it relatively immune to traditional spamming techniques. Goes beyond the text on pages to decide how good they are. No

links, low rank. Common spam idea

Create a lot of new pages within a site that link to a single page, in an effort to boost that page's popularity, perhaps spreading out these pages across a network of sites.

The (Evil) Genius of Comment Spammers By Steven Johnson, WIRED 12.03

Intro to IR and SE, research issues 47

Intro to IR and SE, research issues 48

Topic Search

Intro to IR and SE, research issues 49

Brief Introduction to Semantic Web

Intro to IR and SE, research issues 50

Machine Process-able Knowledge on the Web

Unique identity of resources and objects- URI Metadata Annotations

Data describing the content and meaning of resources But everyone must speak the same language…

Terminologies Shared and common vocabularies But everyone must mean the same thing…

Ontologies Shared and common understanding of a domain Essential for exchange and discovery of knowledge

Inference Apply the knowledge in the metadata and the ontology to

create new metadata and new knowledge

Intro to IR and SE, research issues 51

The Semantic Web

Intro to IR and SE, research issues 52

Ontologies: The Semantic Backbone

Intro to IR and SE, research issues 53

Language Tower in Semantic Web


Standard Syntax

Metadata annotations


Rules & Inference



Web Ontology Language 1.0 Reference

Intro to IR and SE, research issues 54




participants >1

Team-based Sport

participants >1


Club Sport

Sports Club

Soccer Club






Blackburn Rovers

Intro to IR and SE, research issues 55



Sports Tournament

Worthington Cup

Soccer Tournament


Blackburn Rovers

Andy Cole

Soccer Player

Sports Player

Brad Friedal

Intro to IR and SE, research issues 56

Andy Cole

Soccer Player

Sports Player

Blackburn Rovers









Intro to IR and SE, research issues 57

Blackburn Rovers






Brad Friedal

Soccer Player

Sports Player






Intro to IR and SE, research issues 58

Useful IR System Building Software

And Resources

Intro to IR and SE, research issues 59

Lucene API ( Pure java (data abstraction, platform-independence, components


High-performance indexing

Support both incremental indexing and batch indexing

Provide Accurate and Efficient Searching Mechanisms Complex queries based on Boolean and phrase queries, and

quires with specific document fields Ranked searching highest score being returned first

Allow users to develop variety of new applications: Searchable email

CD-based documentation search

DBMS Object ID management

Intro to IR and SE, research issues 60

Intro to IR and SE, research issues 61 (support EBIR)

Intro to IR and SE, research issues 62

Intro to IR and SE, research issues 63

Intro to IR and SE, research issues 64

Intro to IR and SE, research issues 65

Intro to IR and SE, research issues 66

Intro to IR and SE, research issues 67

Intro to IR and SE, research issues 68

Then click Alexa Web Information Service 1.0 Released

Intro to IR and SE, research issues 69

Intro to IR and SE, research issues 70 (Probabilistic IR model)

Intro to IR and SE, research issues 71

IR research resources

Intro to IR and SE, research issues 72

Intro to IR and SE, research issues 73

Intro to IR and SE, research issues 74

Intro to IR and SE, research issues 75

Web Challenges

for IR Research Community

Intro to IR and SE, research issues 76

Research Issues (1)

IR research field is interdisciplinary in nature Traditionally focused on retrieval effectiveness

Retrieval models and mechanisms (e.g., various ad-hoc models, probabilistic/statistic reasoning, language model INDRI system at UMASS)

Use of Relevance feedback for improving effectiveness (e.g., query reformulation, pseudo-thesaurus, document categorization/clustering through machine learning techniques as knowledge acquisition tools)

Knowledge/semantic richer retrieval approaches (e.g., RUBRIC-rule based IR, some recent concept-based IR based on Rules)

Information filtering based on user profiling

Traditionally based on small set of text collections Little work has been done on retrieval efficiency

although we have some reports (e.g., use of parallel architecture for handling index files based on signature files, etc)

Intro to IR and SE, research issues 77

Research Issues (2) Challenges

Distributed Data: Documents spread over millions of different web servers.

Volatile Data: Many documents change or disappear rapidly (e.g. dead links) information recency (up-to-datedness)

Large Volume: Billions of separate documents. Unstructured and Redundant Data: No uniform structure, HTML

errors, up to 30% (near) duplicate documents. Quality of Data: No editorial control, false information, poor

quality writing, typos, etc. Need large scale knowledge/semantic rich retrieval applications Heterogeneous Data: Multiple media types (images, video, VRML),

languages, character sets, etc.

Intro to IR and SE, research issues 78

Research Issues (3) Retrieval Effectiveness (all in large scale) with efficiency in mind

Test effectiveness of IR models with efficiency as an important considerations

Effective and efficient indexing for both documents and query Natural language processing (some statistical) Distributed incremental indexing System and physical data structure/algorithm issues Distributed brokering architecture for information recency

Investigation of semantic richer approaches Semantic web, and other rule based approaches Effective and efficient knowledge indexing

Use of users relevance feedback Automatic feedback acquisition User profiling and information filtering Evaluation measures (Predictable)

Text summarization for better presentation Text categorization (clustering) for topic search

(e.g., Yahoo subject directory, Google topic).

Intro to IR and SE, research issues 79

Research Issues (4) Multimedia indexing

IBM QBIC project ( Indexing tools for various media types (e.g., an image of mountain

with a lake covered by snow, SemCap) Develop test bed for controllable experiments

Internet emulator/simulator Distributed IR subsystems Appropriate performance measures (e.g., RB Precision)

Refer to the recent papers by Stanford researchers addressing Both retrieval effectiveness and efficiency
