33
1 Information Retrieval, Search, and Mining Introduction

1 Information Retrieval, Search, and Mining Introduction

Embed Size (px)

Citation preview

Page 1: 1 Information Retrieval, Search, and Mining Introduction

1

Information Retrieval, Search, and Mining

Introduction

Page 2: 1 Information Retrieval, Search, and Mining Introduction

2

Course Topics

• Information Retrieval and Web Search– Information Retrieval Models– Indexing, Compression, and Online Search– Ranking methods: link analysis.– Evaluation Methods

• Text Mining– Text Categorization and Clustering– Recommendation, and Information extraction

• Systems Support– Clustering for online servers and offline computation.

MapReduce/Hadoop. Caching. Data storage and communication.

– Crawling and document parsing.• Basic algorithms:

– Duplicate detection. SVD/Eigen computation. Constrained optimization

Page 3: 1 Information Retrieval, Search, and Mining Introduction

3

References

• Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze, Introduction to Information Retrieval, Cambridge University Press. 2008.

• Others:– S. Chakrabarti. 2003. Mining the Web: Discovering Knowledge

from Hypertext Data. Morgan Kaufmann. – MG = Managing Gigabytes, by Witten, Moffat, and Bell.

MIR = Modern Information Retrieval, by Baeza-Yates and Ribeiro-Neto.

• Selected papers

Page 4: 1 Information Retrieval, Search, and Mining Introduction

4

Information Retrieval System

IRSystem

Query String

Documentcorpus

RankedDocuments

1. Doc12. Doc23. Doc3 . .

Page 5: 1 Information Retrieval, Search, and Mining Introduction

5

Information Retrieval(IR)

• The indexing and retrieval of textual documents.– Searching for pages on the World Wide Web is

challenging

• Concerned firstly with retrieving relevant documents to a query.

• Concerned secondly with retrieving from large sets of documents efficiently.

Page 6: 1 Information Retrieval, Search, and Mining Introduction

6

Relevance

• Relevance is a subjective judgment and may include:– Being on the proper subject.– Being timely (recent information).– Being authoritative (from a trusted source).– Satisfying the goals of the user and his/her

intended use of the information (information need).

Page 7: 1 Information Retrieval, Search, and Mining Introduction

7

Keyword Search

• Simplest notion of relevance is that the query string appears verbatim in the document.

• Slightly less strict notion is that the words in the query appear frequently in the document, in any order (bag of words).

Page 8: 1 Information Retrieval, Search, and Mining Introduction

8

Problems with Keywords

• May not retrieve relevant documents that include synonymous terms.– “restaurant” vs. “café”– “PRC” vs. “China”

• May retrieve irrelevant documents that include ambiguous terms.– “bat” (baseball vs. mammal)– “Apple” (company vs. fruit)– “bit” (unit of data vs. act of eating)

Page 9: 1 Information Retrieval, Search, and Mining Introduction

9

Intelligent IR

• Taking into account the meaning of the words used.

• Taking into account the order of words in the query.

• Adapting to the user based on direct or indirect feedback.

• Taking into account the authority of the source.

Page 10: 1 Information Retrieval, Search, and Mining Introduction

10

IR System Architecture

TextDatabase

DatabaseManager

Indexing

Index

QueryOperations

Searching

RankingRanked

Docs

UserFeedback

Text Operations

User Interface

RetrievedDocs

UserNeed

Text

Query

Logical View

Inverted file

Page 11: 1 Information Retrieval, Search, and Mining Introduction

11

IR System Components

• Text Operations forms index words (tokens).– Stopword removal– Stemming

• Indexing constructs an inverted index of word to document pointers.

• Searching retrieves documents that contain a given query token from the inverted index.

• Ranking scores all retrieved documents according to a relevance metric.

Page 12: 1 Information Retrieval, Search, and Mining Introduction

12

IR System Components (continued)

• User Interface manages interaction with the user:– Query input and document output.– Relevance feedback.– Visualization of results.

• Query Operations transform the query to improve retrieval:– Query expansion using a thesaurus.– Query transformation using relevance feedback.

Page 13: 1 Information Retrieval, Search, and Mining Introduction

13

Web Search

• Application of IR to HTML documents on the World Wide Web.

• Differences:– Must assemble document corpus by spidering

the web.– Can exploit the structural layout information

in HTML (XML).– Documents change uncontrollably.– Can exploit the link structure of the web.

Page 14: 1 Information Retrieval, Search, and Mining Introduction

14

Web Search System

Query String

IRSystem

RankedDocuments

1. Page12. Page23. Page3 . .

Documentcorpus

Web Spider

Page 15: 1 Information Retrieval, Search, and Mining Introduction

15

Other IR-Related Tasks

• Automated document categorization• Information filtering (spam filtering)• Information routing• Automated document clustering• Recommending information or products• Information extraction• Information integration• Question answering• Advertisement placement

Page 16: 1 Information Retrieval, Search, and Mining Introduction

Topics: Text mining

• “Text mining” is a cover-all marketing term• A lot of what we’ve already talked about is actually

the bread and butter of text mining:– Text classification, clustering, and retrieval

• But we will focus in on some of the higher-level text applications:– Extracting document metadata– Topic tracking and new story detection– Cross document entity and event coreference– Text summarization– Question answering

Page 17: 1 Information Retrieval, Search, and Mining Introduction

Topics: Information extraction

• Getting semantic information out of textual data– Filling the fields of a database record

• E.g., looking at an events web page:– What is the name of the event?– What date/time is it?– How much does it cost to attend

• Other applications: resumes, health data, …• A limited but practical form of natural language

understanding

Page 18: 1 Information Retrieval, Search, and Mining Introduction

Topics: Recommendation systems

• Using statistics about the past actions of a group to give advice to an individual

• E.g., Amazon book suggestions or NetFlix movie suggestions

• A matrix problem: but now instead of words and documents, it’s users and “documents”

• What kinds of methods are used?• Why have recommendation systems become a

source of jokes on late night TV?– How might one build better ones?

Page 19: 1 Information Retrieval, Search, and Mining Introduction

Topics: XML search

• The nature of semi-structured data

• Tree models and XML

• Content-oriented XML retrieval

• Query languages and engines

Page 20: 1 Information Retrieval, Search, and Mining Introduction

20

History of IR

• 1960-70’s:– Initial exploration of text retrieval systems for

“small” corpora of scientific abstracts, and law and business documents.

– Development of the basic Boolean and vector-space models of retrieval.

– Prof. Salton and his students at Cornell University are the leading researchers in the area.

Page 21: 1 Information Retrieval, Search, and Mining Introduction

21

IR History Continued

• 1980’s:– Large document database systems, many run by

companies:• Lexis-Nexis

• Dialog

• MEDLINE

Page 22: 1 Information Retrieval, Search, and Mining Introduction

22

IR History Continued

• 1990’s:– Searching FTPable documents on the Internet

• Archie

• WAIS

– Searching the World Wide Web• Lycos

• Yahoo

• Altavista

Page 23: 1 Information Retrieval, Search, and Mining Introduction

23

IR History Continued

• 1990’s continued:– Organized Competitions

• NIST TREC

– Recommender Systems• Ringo

• Amazon

• NetPerceptions

– Automated Text Categorization & Clustering

Page 24: 1 Information Retrieval, Search, and Mining Introduction

24

Recent IR History

• 2000’s– Link analysis for Web Search

• Google• Inktomi • Teoma

– Feedback based engine:• DirectHit (Ask.com)

– Automated Information Extraction• Whizbang• Fetch• Burning Glass

– Question Answering• TREC Q/A track• Ask.com/Ask Jeeves

Page 25: 1 Information Retrieval, Search, and Mining Introduction

25

Recent IR History

• 2000’s continued:– Multimedia IR

• Image

• Video

• Audio and music

– Cross-Language IR– Document Summarization

Page 26: 1 Information Retrieval, Search, and Mining Introduction

26

Related Areas

• Database Management

• Information Management– Library and Information Science– Artificial Intelligence/Machine Learning– Natural Language Processing

• Large-scale systems– Operating systems/networking support– Parsing/compression/fast algorithms.– Fault tolerance/paralle+distributed systems

Page 27: 1 Information Retrieval, Search, and Mining Introduction

27

Database Management

• Structured data stored in relational tables rather than free-form text.

• Efficient processing of well-defined queries in a formal language (SQL).

• Semi-structured data, XML

Page 28: 1 Information Retrieval, Search, and Mining Introduction

28

Library and Information Science

• Human user aspects of information retrieval (human-computer interaction, user interface, visualization).

• Effective categorization of human knowledge.

• Citation analysis and bibliometrics (structure of information).

• Digital libraries

Page 29: 1 Information Retrieval, Search, and Mining Introduction

29

Natural Language Processing&IR

• Syntactic, semantic, and pragmatic analysis of natural language text and discourse.

• Analyze syntax (phrase structure) and semantics to retrieve based on meaning rather than keywords.

• Sense of an ambiguous word based on context (word sense disambiguation).

• Identifying specific pieces of information in a document (information extraction).

• Answering specific NL questions from document corpora.

Page 30: 1 Information Retrieval, Search, and Mining Introduction

30

Artificial Intelligence/Machine Learning

• Representation of knowledge, reasoning, and intelligent action.– Formalisms for representing knowledge– First-order Predicate Logic– Bayesian Networks

• Web ontologies and intelligent agents• Machine learning:

– Automated classification (supervised learning).– Clustering unlabeled examples (unsupervised

learning).

Page 31: 1 Information Retrieval, Search, and Mining Introduction

31

Machine Learning&IR

• Text Categorization– Automatic hierarchical classification .– Adaptive filtering/routing/recommending.– Automated spam filtering.

• Text Clustering– Clustering of IR query results.– Automatic formation of hierarchies

• Learning for Information Extraction• Text Mining

Page 32: 1 Information Retrieval, Search, and Mining Introduction

32

System Supports for Internet-Scale Data-Intensive Search/Mining

• System Challenges for Large-scale Processing, Fast Response, and High Availability.– Multiprocessing, Machine clustering,

multiprocessing. Networking– Thread/process/memory management. IO

• Middleware support.• Software engineering (e.g. debugging)• Data center management.

Page 33: 1 Information Retrieval, Search, and Mining Introduction

33

Fast Algorithms/Computing

• String comparison/manipulation • Duplicates.• Compression.• Graph/matrix computation

– Ranking matrix. Sparse matrices– Eigen computation and SVD

• Linear/nonlinear programming for optimization (e.g. used for SVM).