©2012 Paula Matuszek CSC 9010: Text Mining Applications: Information Retrieval Dr. Paula Matuszek [email protected] [email protected]

©2012 Paula Matuszek

CSC 9010: Text Mining Applications:

Information Retrieval

Dr. Paula Matuszek

[email protected]

[email protected]

(610) 647-9789

mailto:[email protected]

mailto:[email protected]


Knowledge Knowledge is captured in large quantities and many

forms Much of the knowledge is in unstructured text

– books, journals, papers– letters– web pages, blogs, tweets

A very old process– accelerated greatly with the invention of the printing press– and again with the invention of computers– and again with the advent of the web

Thus the increasing importance of text mining!


A question! People interact with all that information

because they want to KNOW something; there is a question they are trying to answer or a piece of information they want. They have an information need.

Hopefully there is some information somewhere that will satisfy that need

At its most general, information retrieval is the process of finding the information that meets that need.


Basic Information Retrieval Simplest approach:

– Knowledge is organized into chunks (pages)– Goal is to return appropriate chunks

Not a new problem But some new solutions!

– Web Search engines– Text mining includes this process also

– still dealing with lots of unstructured text

– finding the appropriate “chunk” can be viewed as a classification problem


Search Engines Goal of search engine is to return

appropriate chunks Steps involve include

– asking a question– finding answers– evaluating answers– presenting answers

Value of a search engine depends on how well it does on all of these.


Asking a question Reflect some information need Query Syntax needs to allow information need to

be expressed– Keywords– Combining terms

– Simple: “required”, NOT (+ and -)

– Boolean expressions with and/or/not and nested parentheses

– Variations: strings, NEAR, capitalization.

– Simplest syntax that works– Typically more acceptable if predictable

Another set of problems when information isn’t text: graphics, music


Finding the Information Goal is to retrieve all relevant chunks. Too time-

consuming to do in real-time, so search engines index pages.

Two basic approaches– Index and classify by hand

– Automate For BOTH approaches deciding what to index on (e.g.,

what is a keyword) is a significant issue.– stemming?– stopwords?– capitalization?


Indexing by Hand Indexing by hand involves having a person look at

information items and assign them to categories.– Assumes taxonomy of categories exists – Each document can go into multiple categories– Creates high quality indices – Expensive to create– Supports hierarchical browsing for retrieval as well

as search Inter-rater reliability is an issue; requires training and

checking to get consistent category assignment


Indexing By Hand For focused collections, even very large ones, feasible

– Medline– ACM papers– NY Times hand-indexes all abstracts

For the web as a whole, not yet feasible Evolving solutions

– social bookmarking: delicious, reddit, digg– Hash tags: twitter, Google+– Sometimes pre-structured. More often completely

freeform– In some domains become a folksonomy.


Automated Indexing Automated indexing involves parsing documents to

pull out key words and creating a table which links keywords to documents– Doesn’t have any predefined categories or

keywords– Can cover a much higher proportion of the

information available– Can update more quickly– Much lower quality, therefore important to have

some kind of relevance ranking


Or IE-Based

Good information extraction tools can be used to extract the important terms– Using gazetteers and ontologies to identify

terms– Using named entity and other rules to

assign categories I2EOnDemand is a good example


Automating Search Always involves balancing factors:

– Recall, Precision– Which is more important varies with query and

with coverage

– Speed, storage, completeness, timeliness– Query response needs to be fast– Documents searched need to be current

– Ease of use vs power of queries– Full Boolean queries very rich, very confusing. – Simplest is “and”ing together keywords; fast,

straightforward.


Search Engine Basics

A spider or crawler starts at a web page, identifies all links on it, and follows them to new web pages.

A parser processes each web page and extracts individual words.

An indexer creates/updates a hash table which connects words with documents

A searcher uses the hash table to retrieve documents based on words

A ranking system decides the order in which to present the documents: their relevance


Selecting Relevant Documents

Assume:– we already have a corpus of documents defined. – goal is to return a subset of those documents.– Individual documents have been separated into

individual files Remaining components must parse, index,

find, and rank documents. Traditional approach is based on the words in

the documents (predates the web)


Extracting Lexical Features

Process a string of characters– assemble characters into tokens (tokenizer)– choose tokens to index

In place (problem for www) Standard lexical analysis problem Lexical Analyser Generator, such as lex Tokenizers such as the NLTK and

GATE tokenizers


Lexical Analyser

Basic idea is a finite state machine Triples of input state, transition token, output state

Must be very efficient; gets used a LOT

0

1

2

blank

A-Z

A-Z

blank, EOF


Design Issues for Lexical Analyser

Punctuation– treat as whitespace?

– treat as characters?

– treat specially? Case

– fold? Digits

– assemble into numbers?

– treat as characters?

– treat as punctuation?


Lexical Analyser

Output of lexical analyser is a string of tokens

Remaining operations are all on these tokens

We have already thrown away some information; makes more efficient, but limits the power of our search– can’t distinguish “VITA” from “Vita”– Can be somewhat remedied at “relevance” step


Stemming Additional processing at the token level Turn words into a canonical form:

– “cars” into “car”– “children” into “child”– “walked” into “walk”

Decreases the total number of different tokens to be processed

Decreases the precision of a search, but increases its recall

NLTK, GATE, other stemmers


Noise Words (Stop Words)

Function words that contribute little or nothing to meaning

Very frequent words– If a word occurs in every document, it is not

useful in choosing among documents– However, need to be careful, because this

is corpus-dependent Often implemented as a discrete list


Example Corpora We are assuming a fixed corpus. Some

sample corpora:– Medline Abstracts– Email. Anyone’s email.– Reuters corpus– Brown corpus

Textual fields, structured attributes– Textual: free, unformatted, no meta-information– Structured: additional information beyond the

content


Structured Attributes for Medline

Pubmed ID Author Year Keywords Journal


Textual Fields for Medline Abstract

– Reasonably complete standard academic English

– Capturing the basic meaning of document Title

– Short, formalized – Captures most critical part of meaning– Proxy for abstract


Structured Fields for Email

To, From, Cc, Bcc Dates Content type Status Content length Subject (partially)


Text fields for Email Subject

– Format is structured, content is arbitrary. – Captures most critical part of content. – Proxy for content -- but may be inaccurate.

Body of email– Highly irregular, informal English. – Entire document, not summary. – Spelling and grammar irregularities. – Structure and length vary.


Indexing We have a tokenized, stemmed

sequence of words Next step is to parse document,

extracting index terms– Assume that each token is a word and we

don’t want to recognize any more complex structures than single words.

When all documents are processed, create index


Basic Indexing Algorithm For each document in the corpus

– get the next token– save the posting in a list

– doc ID, frequency

For each token found in the corpus– calculate #doc, total frequency– sort by frequency– This is the inverse index


Fine Points Dynamic Corpora: requires incremental

algorithms Higher-resolution data (e.g, char position) Giving extra weight to proxy text (typically

by doubling or tripling frequency count) Document-type-specific processing

– In HTML, want to ignore tags– In email, maybe want to ignore quoted

material


Choosing Keywords Don’t necessarily want to index on every

word– Takes more space for index– Takes more processing time– May not improve our resolving power

How do we choose keywords?– Manually– Statistically

Exhaustivity vs specificity


Manually Choosing Keywords

Unconstrained vocabulary: allow creator of document to choose whatever he/she wants– “best” match– captures new terms easily– easiest for person choosing keywords

Constrained vocabulary: hand-crafted ontologies– can include hierarchical and other relations– more consistent– easier for searching; possible “magic bullet”

search


Examples of Constrained Vocabularies

ACM Computing Classification System (www.acm.org/class/1998)

– H: Information Retrieval H3: Information Storage and Retrieval H3.3: Information Search and Retrieval

» Clustering» Information Filtering» Query formulation

» Relevance feedback » etc.

Medline Headings (www.nlm.nih.gov/mesh/MBrowser.html)

– L: Information Science L01: Information Science L01.700: Medical Informatics L01.700.508: Medical Informatics Applications L01.700.508.280: Information Storage and Retrieval

» MedlinePlus [L01.700.508.280.730]


Automated Vocabulary Selection

Frequency: Zipf’s Law. – In a natural language corpus, frequency of a word is

inversely proportional to its position in a frequency table.

Within one corpus, words with middle frequencies are typically “best”– We have used this in NLTK classification, ignoring

the most frequent terms in creating the BOW. Document-oriented representation bias: lots of

keywords/document Query-Oriented representation bias: only the

“most typical” words. Assumes that we are comparing across documents.


Choosing Keywords “Best” depends on actual use; if a word

only occurs in one document, may be very good for retrieving that document; not, however, very effective overall.

Words which have no resolving power within a corpus may be best choices across corpora


Keyword Choice for WWW We don’t have a fixed corpus of

documents New terms appear fairly regularly, and are

likely to be common search terms Queries that people want to make are

wide-ranging and unpredictable Therefore: can’t limit keywords, except

possibly to eliminate stop words. Even stop words are language-dependent.

So determine language first.


Comparing and Ranking Documents

Once our search engine has retrieved a set of documents, we may want to

Rank them by relevance – Which are the best fit to my query?– This involves determining what the query is

about and how well the document answers it Compare them

– Show me more like this.– This involves determining what the

document is about.


Determining Relevance by Keyword

The typical web query consists entirely of keywords.

Retrieval can be binary: present or absent More sophisticated is to look for degree of

relatedness: how much does this document reflect what the query is about?

Simple strategies:– How many times does word occur in document?– How close to head of document?– If multiple keywords, how close together?


Keywords for Relevance Ranking

Count: repetition is an indication of emphasis– Very fast (usually in the index)– Reasonable heuristic– Unduly influenced by document length– Can be "stuffed" by web designers

Position: Lead paragraphs summarize content– Requires more computation– Also reasonably heuristic– Less influenced by document length– Harder to "stuff"; can only have a few keywords near

beginning


Keywords for Relevance Ranking

Proximity for multiple keywords– Requires even more computation– Obviously relevant only if have multiple keywords– Effectiveness of heuristic varies with information

need; typically either excellent or not very helpful at all

– Very hard to "stuff" All keyword methods

– Are computationally simple and adequately fast– Are effective heuristics– typically perform as well as in-depth natural

language methods for standard search


Comparing Documents

"Find me more like this one" really means that we are using the document as a query.

This requires that we have some conception of what a document is about overall.

Depends on context of query. We need to– Characterize the entire content of this document– Discriminate between this document and others in

the corpus This is basically a document classification problem.


Describing an Entire Document

So what is a document about? TF*IDF: can simply list keywords in

order of their TF*IDF values Document is about all of them to some

degree: it is at some point in some vector space of meaning


Vector Space Any corpus has defined set of terms (index) These terms define a knowledge space Every document is somewhere in that

knowledge space -- it is or is not about each of those terms.

Consider each term as a vector. Then– We have an n-dimensional vector space– Where n is the number of terms (very large!)– Each document is a point in that vector space

The document position in this vector space can be treated as what the document is about.


Similarity Between Documents

How similar are two documents?– Measures of association

– How much do the feature sets overlap?– Simple Matching coefficient: take into account

exclusions

– Cosine similarity– similarity of angle of the two document vectors– not sensitive to vector length

Same basic similarity ideas as classification and clustering


Additional Search Engine Issues

Freshness: how often to revisit documents Eliminate duplicate documents Eliminate multiple documents from one site Provide good context Non-content based features: citation

graphs (basis of Page rank) Search Engine Optimization


Beyond Simple Search

Information Extraction on queries to recognize some common patterns– Airline flights– tracking #w

Rich IE systems like I2E Taxonomy browsing


Beyond Unstructured Text

“Improved” search Specific types of text Non-text search constraints Faceted Search Searching non-text information assets Personalizing search


Improved Search Modern search engines tweak your query a

lot. Google, for instance, says it will normally– suggest spelling corrections and alternative

spellings– personalize your search by using information such

as sites you’ve visited before– include synonyms of your search terms– find results that match similar terms to those in

your query– stem your search terms

(http://support.google.com/websearch/bin/answer.py?hl=en&p=g_verb&answer=1734130 )

http://support.google.com/websearch/bin/answer.py?hl=en&p=g_verb&answer=1734130


Domain of Documents May be desirable to limit search to specific types

of document Google gives you, among other things

– news– books– blogs– recipes– patents– discussions

May be based on the source Or may be text mining (classification) at work :-)


Non-Text Constraints We may know some things about documents

that are not captured in the tokens: Meta-information included in the document

– date created and modified– author, department– keywords or tags

Information that can be determined by examining the document– language– images included?– reading level


Faceted Search Faceted Search: constrain search along several

attributes Research topic in information retrieval especially

for the last ten years– Flamenco project at Berkeley (flamenco.berkeley.edu)– CiteSeer project at Penn State (citeseer.ist.psu.edu)

Labeling with facets has same issues as hand-indexing– except when you already have the information in a

database somewhere Has become popular primarily for online

commerce.


Faceted Search Examples Search Amazon for “rug”

– Generally applicable facets: Department, Amazon Prime Eligible, Average Customer Review, Price, Discount, Availability

– Object-specific facets: size, material, pattern, style, theme, color

Flamenco Fine Arts demo– http://orange.sims.berkeley.edu/cgi-bin/

flamenco.cgi/famuseum/Flamenco


Searching non-Text Assets Our information chunks might not be text

– images, sounds, videos, maps Images, sounds, videos often based on

proxy information: captions, tags Information Extraction useful for maps Still an active research area, typically at

the intersection of information retrieval and machine learning. http://www.gwap.com/gwap/


Personalized Search What searches you have done in the past tells a lot

about your information needs If a search engine has and makes use of that

information it can improve your searches– web page history– cookies– explicit sign-in

Relevance can be influenced by pages you’ve visited, click-throughs

May go beyond to use information such as your blog posts, friends links, etc.

Can be a privacy issue


Summary

Information Retrieval is the process of finding and providing to the user chunks of information which are relevant to some information need

Where the chunks are text, free text search tools are the common approach

Text mining tools such as document classification and information extraction can improve the relevance of search results

This is not new with the web, but the web has had a massive impact on the area

It continues to evolve, rapidly