View
222
Download
2
Embed Size (px)
Citation preview
Intro to IR and SE, research issues 1
http://comet.lehman.cuny.edu/jung/presentation/presentation.html
Introduction to Modern Information Retrieval and
Search EnginesAnd Some Research Issues
Professor Gwang Jung
Department of Mathematics and Computer ScienceLehman College, CUNY
November 10, Fall 05
Intro to IR and SE, research issues 2
Introduction to Information Retrieval Introduction to Search Engines (IR Systems for the Web) Search Engine Example: Google Brief Introduction to Semantic Web Useful Tools for IR System Building and Resources for Advanced Research Research Issues
Outline
Intro to IR and SE, research issues 3
Introduction to Information Retrieval
Intro to IR and SE, research issues 4
Information Age
Intro to IR and SE, research issues 5
IR in General
Information Retrieval in general deals with Retrieval of structured, semi-structured and unstructured data
(information items) in response to a user query (topic statement).
User query Structured (e.g., Boolean expression of keywords or terms) Unstructured (e.g., terms, sentence, document)
In other words, IR is the process of applying algorithms over unstructured, semi-structured, or structured data in order to satisfy a given query. Efficiency with respect to:
Algorithms, Query processing, Data organization/structure Effectiveness with respect to:
Retrieval results
Intro to IR and SE, research issues 6
IR Systems
Intro to IR and SE, research issues 7
Formal Definition of IR System
IRS = (T, D, Q, F, R) T: set of index terms (terms) D: set of documents in a document database Q: set of user queries F: D x Q R (retrieval function) R: real numbers (RSV: Retrieval Status Value)
Relevance Judgment is given by users.
Intro to IR and SE, research issues 8
IRS versus DBMS
Objects
DBMS
Simple (DBMS)
IRS
Complex (texts)
Information Request Objective structure, specific
Subjective structure
General and ambiguous
Query Language Predicate calculus How?
Knowledge Representation
Transparent Complex process
Retrieval Process Deterministic Statistic (similarity)
Major Success Criteria
Correctness utility
Query Example Select student name where student’s GPA > 3.0
Select items where item’s concept deals with process synchronization
Intro to IR and SE, research issues 9
IR Systems Focus on Retrieval effectiveness
The effective retrieval of relevant information depends on User task (formulating effective query for the information
need) Indexing
IR systems in general adopt index terms to represent documents and queries.
The process of developing document representations by assigning index terms to documents (information items).
Retrieval model (often called IR model) and logical view of documents
Logical view of documents (logical representation of documents) depends on IR model
Intro to IR and SE, research issues 10
Indexing
The process of developing document representations by assigning descriptions to information items (texts, documents, or multimedia items).
Descriptors = index terms = terms Descriptors also lead users to participate in
formulating information requests. Two types of index terms:
Objective: author name, publisher, date of publication Subjective: keywords selected from full text
Two types of indexing methods: Manual: performed by human experts (for very effective IR
systems)– may use ontology Automatic: performed by computer HW and SW
Intro to IR and SE, research issues 11
Indexing Aims (1) Recall: the proportion of relevant items (documents) retrieved.
R = # of relevant items retrieved / total # of relevant items in the db
Precision: the proportion of retrieved documents that are relevant. P = # of relevant items retrieved / total # of items retrieved
Effectiveness of indexing is mainly controlled by Term Specificity Broader terms may retrieve both useful (relevant) and useless
(non-relevant) info items for the user. Narrower (specific) index terms favor precision at the expense of
recall. Index Language (set of well-selected index terms)
T = { index term t} Pre-specified (controlled): easy maintenance; poor adaptability Uncontrolled (dynamic): expanded dynamically; taken freely from
the texts to be indexed and from the users’ queries. Synonymous terms can be expanded to T by thesaurus, e-
dictionary (e.g., WordNet), and/or knowledge base (e.g., ontology).
Intro to IR and SE, research issues 12
Indexing Aims (2)
Recall and Precision values vary from 0 to 1. Average users want to have high recall and high precision. In practice, a compromise must be reached (middle point).
R
P
1.0
1.00
Intro to IR and SE, research issues 13
Steps for Indexing
Objective attributes of a document are extracted (e.g., title, author, URL, structure).
Grammatical functional words (stop words) in general are not considered as index terms (e.g., of, then, this, and, …., etc).
Case insensitivity might be performed. Stemming might be used. Frequency of nonfunctional words are used to specify the term
importance. Term frequency weight fulfils only one of the indexing aims,
I.e., Recall. Terms that occur rarely in the individual document database
may be used to distinguish documents in which they occur from those in which they do not occur could improve Precision.
Document frequency: the number of documents in the collection in which a term tj T occurs
Intro to IR and SE, research issues 14
Inverted Index File
system
computer
database
science D2, 4
D5, 2
D1, 3
D7, 4
Index terms df
3
2
4
1
Dj, tfj,
Inverted Index Entries
Optionally postings (the positions of the term in a document)
Intro to IR and SE, research issues 15
Retrieval Models (1) Set theoretic IR models
Documents are represented by a set of terms Well known Set Theoretic Models
Boolean IR Model Retrieval Function is based on Boolean operation (e.g., and, or, not) Query is formulated by Boolean logic
Fuzzy Set IR Model Retrieval function is based on Fuzzy set operations Query is formulated by Boolean logic
Rough Set IR Model Various set operations were examined. Ad-hoc Boolean query
Probabilistic IR model Mainly used for probabilistic index term weighting Provides mathematical framework for the well known tf*idf indexing
scheme Language Model based
Infer query concept from a document as retrieval process
Intro to IR and SE, research issues 16
Retrieval Models (2) Vector space model
Queries and documents are represented as weighted vectors. Vectors in the basis are called term vectors, and assumed they
are semantically independent. A document (query) is represented as a linear combination of
vectors in the generating set. Retrieval function is based on dot product or cosine measure
between document and query vectors. Extended Boolean IR model
Combine characteristics of the vector space IR model with properties of Boolean algebra.
Retrieval function is based on Euclidean distances in a n-dimensional vector space. Distances are measured by using p-norms, where 1 p
Intro to IR and SE, research issues 17
The Retrieval Process
Intro to IR and SE, research issues 18
The retrieval Process in IR System
Intro to IR and SE, research issues 19
Introduction to Search Engines (IR Systems for the Web)
Intro to IR and SE, research issues 20
World Wide Web History
1965 – Hypertext Ted Nelson developed idea of hypertext in 1965.
Late 1960’s Doug Engelbart invented the mouse and built the first
implementation of hypertext in the late 1960’s at SRI. Early 1970’s
ARPANET was developed in the early 1970’s. 1982 - Transmission Control Protocol (TCP) and Internet Protocol (IP) 1989- WWW
Developed by Tim Berners-Lee and others in 1990 at CERN to organize research documents available on the Internet.
Combined idea of documents available by FTP with the idea of hypertext to link documents.
Developed initial HTTP network protocol, URLs, HTML, and first web server.
Intro to IR and SE, research issues 21
Search Engine (Web-based IR System) History
By late 1980’s many files were available by anonymous FTP. In 1990, Alan Emtage of McGill Univ. developed Archie (short for
“archives”) Assembled lists of files available on many FTP servers. Allowed regular expression search of these file names.
In 1993, Veronica and Jughead were developed to search names of text files available through Gopher servers.
In 1993, early web robots (spiders) were built to collect URL’s: Wanderer ALIWEB (Archie-Like Index of the WEB) WWW Worm (indexed URL’s and titles for regex search)
In 1994, Stanford graduate students David Filo and Jerry Yang started manually collecting popular web sites into a topical hierarchy called Yahoo.
Intro to IR and SE, research issues 22
Search Engine History (cont’d)
In early 1994, Brian Pinkerton developed WebCrawler as a class project at U Washington. (eventually became part of Excite and AOL).
A few months later, Fuzzy Maudlin, a professor at CMU developed Lycos with his graduate students. First to use a standard IR system as developed for the DARPA
Tipster project. First to index a large set of pages.
In late 1995, DEC developed Altavista. Used a large farm of Alpha machines to quickly process large
numbers of queries. Supported boolean operators, phrases, and “reverse pointer”
queries. In 1998 – Google was developed by graduate students Larry Page &
Sergey Brin at Stanford U use of link analysis to rank documents
Intro to IR and SE, research issues 23
How do Web SE Work?
Search Engines for the general web search a database of the full text of web pages selected from billions
of Web pages searching is based on inverted index entries
Search Engine Databases Full text documents are collected by software robot (also called
softbot, spider). They navigate the web for collecting pages. Web can be viewed as a graph structure. The navigation can be based on DFS (Depth First Search), or BFS (Breadth
First Search), or based on some combined navigation heuristics. How to detect cycles? research issue
Indexer then build inverted index entries stored them into inverted files.
If necessary the inverted files may be compressed. Some types of pages & links are excluded from the search engine
form invisible Web (maybe many times bigger than the visible Web).
Intro to IR and SE, research issues 24
------------------------------------------------------------------------
------------------------------------------------------------------------
------------------------------------------------------------------------
------------------------------------------------------------------------
------------------------------------------------------------------------
------------------------------------------------------------------------
------------------------------------------------------------------------
------------------------------------------------------------------------
------------------------------------------------------------------------
------------------------------------------------------------------------
------------------------------------------------------------------------
------------------------------------------------------------------------
------------------------------------------------------------------------
------------------------------------------------------------------------
------------------------------------------------------------------------
Breadth-First Crawling
Intro to IR and SE, research issues 25
------------------------------------------------------------------------
------------------------------------------------------------------------
------------------------------------------------------------------------
------------------------------------------------------------------------
------------------------------------------------------------------------
Depth-First Crawling
------------------------------------------------------------------------
------------------------------------------------------------------------
------------------------------------------------------------------------
------------------------------------------------------------------------
------------------------------------------------------------------------
------------------------------------------------------------------------
------------------------------------------------------------------------
------------------------------------------------------------------------
------------------------------------------------------------------------
------------------------------------------------------------------------
Intro to IR and SE, research issues 26
Web Search Engine System Architecture
Intro to IR and SE, research issues 27
InternetWebsites
Temporarystorage
Parser
Stopper/StemmerIndexer
Robot
Inverted Files (can be based on
Different physical data structures
User
Interface
RetrievalMechanism
Logical DocumentRepresentation
(based on IR Models)
Intro to IR and SE, research issues 28
Distributed Architecture (example) Harvest (http://harvest.sourceforge.net/)
Distributed web search engine distribute the load among different machines indexer doesn't run on the same machine as broker or web server
Intro to IR and SE, research issues 29
What Makes a SE Good?
Database of web documents Size of database Freshness (Recency or up-to-datedness) Types of documents offered Retrieval Speed
The search engine's capabilities Search options Effectiveness of the retrieval mechanism Support Concept-based search semantic web
Concept-based search systems try to determine what you mean, not just what you say.
Concept-based often works better in theory than in practice. Concept-based indexing is difficult task to perform.
Presentation of the results keywords highlighted in context showing summary of the web page that match
Intro to IR and SE, research issues 30
Search Engine Example (Google)
Intro to IR and SE, research issues 31
Google The most popular web search engine:
Crawls (by robots) the web, stores a local cache of found pages Builds a lexicon of common words For each word creates an index list of pages containing it Also human-compiled information from the Open Directory Cached links - let you see older versions of recently changed ones
Link Analysis system: page rank heuristic
Estimated size of index 580 million pages visited and recorded Uses link data to get to another 500 million pages (by link analysis
system) Recent estimation is around 4 billion pages (??)
Index refresh Updated monthly/weekly or daily for popular pages
Serves queries from three data centres (service replication) Service updates are synchronized. Two on West Coast of the US, one on East Coast.
Intro to IR and SE, research issues 32
Google Founders Larry Page, Co-
founder & President, Products
Sergey Brin, Co-founder & President, Technology
PhD students at Stanford Became public co. last year
0%
10%
20%
30%
50%
40%
2001 20032002 2004
GoogleYahoo!
MSN
LycosAltaVistaAOL
Source: WebSideStory
Market share
Intro to IR and SE, research issues 33
Google Architecture Overview
Intro to IR and SE, research issues 34
Google Indexer
term frequencies
Intro to IR and SE, research issues 35
Google Lexicon
Intro to IR and SE, research issues 36
Google Searcher
Intro to IR and SE, research issues 37
Google Features Combines traditional IR text matching with extremely heavy use of link
popularity to rank the pages it has indexed. Other services also use link popularity, but none do to the extent that
Google does. Traditional IR (LITE) Link Popularity (HEAVYLY USED) Citation Importance Ranking (Quality of links pointing at it)
Relevancy Similarity between query and a page Number of Links Link Quality Link Content Ranking boosts on text styles
PageRank Usage simulation & Citation importance ranking User randomly navigates
Process modelled by Markov Chain
Intro to IR and SE, research issues 38
Collecting Links in Google Submission (by Web Promotion):
Add URL page (may not need to do a "deep" submit) The best way to ensure that your site is indexed is to build
links. The more other sites are pointing at you, the more likely you will be crawled and ranked well.
Crawling and Index Depth: Aims to refresh its index on a monthly basis, If Google doesn't actually index pages, it may still return it in a
search because it makes extensive use of the text within hyperlinks.
This text is associated with the pages the link points at, and it makes it possible for Google to find matching pages even when these pages cannot themselves be indexed.
Intro to IR and SE, research issues 39
Google Guidelines for Web-submission
Intro to IR and SE, research issues 40
Deep SubmitPro
Intro to IR and SE, research issues 41
Link Analysis for Relevancy (1)
Inspired by the CiteSeer (NEC International, Princeton, NJ) and IBM Clever Project CiteSeer….. http://www.almaden.ibm.com/cs/k53/clever.html
Google ranks web pages based on the number, quality and content of links pointing at them (citations).
Number of Links All things being equal, a page with more links pointing at it will do
better than a page with few or no links to it. Link Quality
Numbers aren't everything. A single link from an important site might be worth more than many links from relatively unknown sites.
Weights page importance – links from important pages weighted higher
Intro to IR and SE, research issues 42
Link Analysis for Relevancy (2)
Link Content The text in and around links relates to the page they point
at. For a page to rank well for "travel," it would need to have many links that use the word travel in them or near them on the page. It also helps if the page itself is textually relevant for travel.
Ranking boosts on text styles The appearance of terms in bold text, or in header text, or
in a large font size is all taken into account. None of these are dominant factors, but they do figure into the overall equation.
Intro to IR and SE, research issues 43
PageRank
Usage simulation & Citation importance ranking: Based on a model of a Web surfer who follows links and makes
occasional haphazard jumps, arriving at certain places more frequently than others.
User randomly navigates Jumps to random page with probability p Follows a random hyperlink from the page with probability 1-p Does not go back to a previously visited page by following a
previously traversed link backwards Google finds a type of universally important page intuitively
locations that are heavily visited in a random traversal of the Web's link structure.
Intro to IR and SE, research issues 44
PageRank Heuristics
Process modelled by the following heuristics probability of being in each page is computed, p set by
the system wj = PageRank of page j
ni = number of outgoing links on page i m is the number of nodes in G (the number of Web pages
in the collection)
jiGji i
ij n
wp
m
pw
|),(
)1(
Intro to IR and SE, research issues 45
PageRank Illusrtation
wj
wn
wm
w2
w1
w3
(1- p)
w1
n1 m
p
w2
n2
w3
n3
+
+
Intro to IR and SE, research issues 46
Google Spamming
Link popularity ranking system leaves it relatively immune to traditional spamming techniques. Goes beyond the text on pages to decide how good they are. No
links, low rank. Common spam idea
Create a lot of new pages within a site that link to a single page, in an effort to boost that page's popularity, perhaps spreading out these pages across a network of sites.
The (Evil) Genius of Comment Spammers By Steven Johnson, WIRED 12.03 http://www.wired.com/wired/archive/12.03/google.html?pg=7
Intro to IR and SE, research issues 47
http://www.wired.com/wired/archive/12.03/google.html?pg=7
Intro to IR and SE, research issues 48
Topic Search http://www.google.com/options/index.html
Intro to IR and SE, research issues 49
Brief Introduction to Semantic Web
Intro to IR and SE, research issues 50
Machine Process-able Knowledge on the Web
Unique identity of resources and objects- URI Metadata Annotations
Data describing the content and meaning of resources But everyone must speak the same language…
Terminologies Shared and common vocabularies But everyone must mean the same thing…
Ontologies Shared and common understanding of a domain Essential for exchange and discovery of knowledge
Inference Apply the knowledge in the metadata and the ontology to
create new metadata and new knowledge
Intro to IR and SE, research issues 51
The Semantic Web
Intro to IR and SE, research issues 52
Ontologies: The Semantic Backbone
Intro to IR and SE, research issues 53
Language Tower in Semantic Web
Identity
Standard Syntax
Metadata annotations
Ontologies
Rules & Inference
Explanation
Attribution
Web Ontology Language 1.0 Referencehttp://www.w3.org/TR/owl-ref/
Intro to IR and SE, research issues 54
Person
Sport
Soccer
participants >1
Team-based Sport
participants >1
Organisation
Club Sport
Sports Club
Soccer Club
Country
UK
Europe
partof
Blackburn
Blackburn Rovers
Intro to IR and SE, research issues 55
Competition
Tournament
Sports Tournament
Worthington Cup
Soccer Tournament
Event
Blackburn Rovers
Andy Cole
Soccer Player
Sports Player
Brad Friedal
Intro to IR and SE, research issues 56
Andy Cole
Soccer Player
Sports Player
Blackburn Rovers
Country
UK
Europe
partof
Nottingham
Person
nationality
birthplace
Intro to IR and SE, research issues 57
Blackburn Rovers
Country
UK
Europe
partof
Lakewood
Brad Friedal
Soccer Player
Sports Player
Country
Person
nationality
USA
birthplace
Intro to IR and SE, research issues 58
Useful IR System Building Software
And Resources
Intro to IR and SE, research issues 59
Lucene API (http://lucene.apache.org/) Pure java (data abstraction, platform-independence, components
reusable)
High-performance indexing
Support both incremental indexing and batch indexing
Provide Accurate and Efficient Searching Mechanisms Complex queries based on Boolean and phrase queries, and
quires with specific document fields Ranked searching highest score being returned first
Allow users to develop variety of new applications: Searchable email
CD-based documentation search
DBMS Object ID management
Intro to IR and SE, research issues 60
http://www.getopt.org/luke/
Intro to IR and SE, research issues 61
www.egothor.org (support EBIR)
Intro to IR and SE, research issues 62
http://nltk.sourceforge.net/
Intro to IR and SE, research issues 63
http://ciir.cs.umass.edu/research/indri/
Intro to IR and SE, research issues 64
http://www.summarization.com/
Intro to IR and SE, research issues 65
http://wordnet.princeton.edu/
Intro to IR and SE, research issues 66
http://protege.stanford.edu/
Intro to IR and SE, research issues 67
http://www.google.com/apis/
Intro to IR and SE, research issues 68
http://www.amazon.com/gp/browse.html/103-1065429-7111805?%5Fencoding=UTF8&node=3435361
Then click Alexa Web Information Service 1.0 Released
Intro to IR and SE, research issues 69
http://mg4j.dsi.unimi.it/
Intro to IR and SE, research issues 70
http://www.xapian.org/history.php (Probabilistic IR model)
Intro to IR and SE, research issues 71
http://www.searchtools.com/info/info-retrieval.html
IR research resources
Intro to IR and SE, research issues 72
http://www-db.stanford.edu/db_pages/projects.html
Intro to IR and SE, research issues 73
http://dbpubs.stanford.edu:8090/aux/index-en.html
Intro to IR and SE, research issues 74
http://citeseer.ist.psu.edu/
Intro to IR and SE, research issues 75
Web Challenges
for IR Research Community
Intro to IR and SE, research issues 76
Research Issues (1)
IR research field is interdisciplinary in nature Traditionally focused on retrieval effectiveness
Retrieval models and mechanisms (e.g., various ad-hoc models, probabilistic/statistic reasoning, language model INDRI system at UMASS)
Use of Relevance feedback for improving effectiveness (e.g., query reformulation, pseudo-thesaurus, document categorization/clustering through machine learning techniques as knowledge acquisition tools)
Knowledge/semantic richer retrieval approaches (e.g., RUBRIC-rule based IR, some recent concept-based IR based on Rules)
Information filtering based on user profiling
Traditionally based on small set of text collections Little work has been done on retrieval efficiency
although we have some reports (e.g., use of parallel architecture for handling index files based on signature files, etc)
Intro to IR and SE, research issues 77
Research Issues (2) Challenges
Distributed Data: Documents spread over millions of different web servers.
Volatile Data: Many documents change or disappear rapidly (e.g. dead links) information recency (up-to-datedness)
Large Volume: Billions of separate documents. Unstructured and Redundant Data: No uniform structure, HTML
errors, up to 30% (near) duplicate documents. Quality of Data: No editorial control, false information, poor
quality writing, typos, etc. Need large scale knowledge/semantic rich retrieval applications Heterogeneous Data: Multiple media types (images, video, VRML),
languages, character sets, etc.
Intro to IR and SE, research issues 78
Research Issues (3) Retrieval Effectiveness (all in large scale) with efficiency in mind
Test effectiveness of IR models with efficiency as an important considerations
Effective and efficient indexing for both documents and query Natural language processing (some statistical) Distributed incremental indexing System and physical data structure/algorithm issues Distributed brokering architecture for information recency
Investigation of semantic richer approaches Semantic web, and other rule based approaches Effective and efficient knowledge indexing
Use of users relevance feedback Automatic feedback acquisition User profiling and information filtering Evaluation measures (Predictable)
Text summarization for better presentation Text categorization (clustering) for topic search
(e.g., Yahoo subject directory, Google topic).
Intro to IR and SE, research issues 79
Research Issues (4) Multimedia indexing
IBM QBIC project (http://wwwqbic.almaden.ibm.com/) Indexing tools for various media types (e.g., an image of mountain
with a lake covered by snow, SemCap) Develop test bed for controllable experiments
Internet emulator/simulator Distributed IR subsystems Appropriate performance measures (e.g., RB Precision)
Refer to the recent papers by Stanford researchers addressing Both retrieval effectiveness and efficiency