1
T t P i O i
Lecture 2
Text Processing Overview
Lecture 2CS 510
Information Retrieval on the Internet
Text Processing PipelineDocuments “<p>Friends, Romans &
countrymen</p>”
Document Type ID& Text Extraction
Document Type ID& Text Extraction
Text String
Token Stream
“Friends, Romans & countrymen”
“Friends” “Romans” “countrymen”
TokenizationTokenization
Linguistic ModulesLinguistic Modules
2IR 2010
Terms
Postings
“friend” “roman” “countryman”
(term6, doc3); (term23, doc3); (term18, doc3)
Linguistic ModulesLinguistic Modules
Dictionary LookupDictionary Lookup
2
Text Processing Goals
• Enable high quality retrieval results• Represent document content and
characteristics efficiently– Storage space– Efficient matching of queries to documents
3IR 2010
Terminology
Token: Occurrence of a string of charactersToken Type: All tokens with the same charsTerm: Equivalence class of tokens
GoetzGötz gotzG tGotz
Recall vs. PrecisionIR 2010 4
3
Accessing Document Content
• Identify document types• Handle each appropriately
5IR 2010
Challenge Possible approachScanned documents OCR
Accessing Document Content
Compressed, encrypted, zipped files
Uncompress, decrypt, unzip
Word processed and other application-specific documents
Application-specific handler
Complex HTML pages: Parse HTML to extract
6
Complex HTML pages:•Frames•Dynamic content•Scripts embedded in HTML
Parse HTML to extract contentSite-specific wrapper?
IR 2010
5
(Part-of-Speech \(POS\) Tagging)endobj89 0 obj<< /S /GoTo /D (Outline7.8.56) >>endobj92 0 obj(Chunking and Parsing)endobj93 0 obj
Excerpt from a PDF file93 0 obj<< /S /GoTo /D (Outline7.9.64) >>endobj96 0 obj(Semantics)endobj97 0 obj<< /S /GoTo /D (Outline7.10.71) >>endobj100 0 obj
9
j(Pragmatics: Co-reference resolution)endobj101 0 obj<< /S /GoTo /D (Outline8) >>endobj104 0 obj(Performance Evaluation)endobj IR 2010
�ÐÏ à¡±� �á > þÿ � � � � 5 7 þÿÿÿ 4 ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿì¥Á q` � � � � � � ø ¿ ß bjbjqPqP �� � �� � 8 : : ß � � � � � �ÿÿ ÿÿ ÿÿ ¤ Ü Ü Ü Ü Ü Ü Ü ð ” ” ” ” ð ä 2 À Ö
� � � � � � � � �� � � �Ö Ö Ö Ö Ö Ö c e e e e e e $ h ~ j ‰ � � � � Ü Ö Ö Ü Ö Ü Ö c Ö ´
� �H}A¦Æ ” ¬ " ß } ¦�c �́ � � � 0 ä ë x è Î
� � � � � �. è c è Ü c Ö Ö �
� �Ö Ö Ö Ö Ö ‰ ‰ ü� � �Ö Ö Ö ä Ö Ö Ö Ö ð ð ð ¤ ” ð
�ð ð ” ð ð ð Ü Ü Ü Ü Ü Ü ÿÿÿÿ � � HYPERLINK "http://www.ultraseek.com/support/docs/UltraseekAdmin/wwhelp/wwhimpl/js/html/starthelp.htm" � htt // lt k / t/d /Ult kAd i / h l / hi l/j /ht l/ t th l ht
Excerpt from a Word file
10
�� http://www.ultraseek.com/support/docs/UltraseekAdmin/wwhelp/wwhimpl/js/html/starthelp.ht�m
Importing from Ultraseek
Ultraseek collection configurations are text files named configuration that are located in the directory you specified during installation to hold search data. This section explains how to import these files so that Ultraseek can re-create the described collection.IR 2010
6
Identifying Document Structure
• To help extract textT l it t t f hi• To exploit structure for searching
• Examples– Complex web pages– Title, metadata, headings– Chapter, section, paragraph, sentence
11
• What is a sentence? How do you recognize one?– Email: header, body, attachment– Tables
IR 2010
C f S S
How to automatically recognize a sentence?
Washington D.C. is the capital of the U.S. and Salem O.R. is a state capital. The D.C. metro is a convenient subway. At clerk.house.gov you can find a listing of members of the U.S. congress including Peter A. DeFazio, John Conyers Jr. and Jim McDermott, who is also sometimes known as Dr. Jim McDermott because he has an M D degree
12
McDermott because he has an M.D. degree.
IR 2010
7
source recipient(s)date
subject
Document structure: email
Header
13
BodyIR 2010
Deciding What and How to Index
• HTML tags?URL? P titl ? J i t?• URL? Page title? Javascript?
• Ads?• Navigation bars?• Text in tables?• Text associated with images?
14
g• Maintain information about document structure?
– Index document parts as separate fields?
IR 2010
8
What should you index on this page?
How would you extract it from all
15
the other stuff?
IR 2010
Lexical analysis
• Parse the text into words– What is a word?
• Challenges:– Punctuation– Special characters
Hyphens
16
– Hyphens– Case– Digits
IR 2010
9
Lexical analysis: Punctuation• Most can just be eliminated• Periods
E d f t– End of sentence– Abbreviations and acronyms– “Dots” in URLs?
• Apostrophes– How much do they change the meaning?
• Grants versus Grant’sE t t h l ?
17
– Exact match only?– Eliminate all in both query and index?– Eliminate some and not others?– Retain and use some algorithm for matching?
• What do current search engines do?IR 2010
18IR 2010
10
Lexical analysis: Special Chars
• What to do with &, $, !?– Are they different if they occur alone vs. part
of a word?• Characters used in other languages
– Exact match only?– Automatically match common translations?
19
Automatically match common translations?• el nino and el niño?
• What do current web search engines do?
IR 2010
20IR 2010
11
Lexical analysis: Hyphens
• Usage often variesM t h i• May or may not change meaning– world wide, world-wide, worldwide– client server, client-server, client/server,
ClientServer– A-Boy, a boy
G ll fl diff f
21
• Generally want to conflate different forms that refer to the same concept
IR 2010
Hyphens in Index vs. Query
Handle differently in indexing and query• Indexing
world wide worldwide
world-wide world-wideworldwide worldwide
• Querying(world AND wide) OR
world-wide (world-wide) OR(worldwide)
IR 2010 22
12
Lexical analysis: Case• Upper case, lower case, mixed
– Case can be useful for distinguishing proper nouns and Case ca be use u o d st gu s g p ope ou s a dacronyms
– Most search engines appear to convert all text to lower or upper case
• Grant == grant on Google, Yahoo, MSN• Bush == bush on Google, Yahoo, MSN• Or do they?
– look further down in the Google results; order differs slightlyUl k ( i l f f b i /i
23
– Ultraseek (commercial software for website/intranet use)
• if query is lower case, matching is case-insensitive• If query has any upper case characters, matching is exact only
IR 2010
Lexical analysis: Digits
Some early sources suggest that digits ll t f lgenerally not very useful
What about addresses, phone numbers, numbers with significant meaning? (e.g. 911)
Major search engines do index digits
24IR 2010
13
Stemming• Purpose is to conflate inflexional variations on a
word that refer to the same conceptword that refer to the same concept– singular and plural forms of nouns– different tenses of the same verb– query rain cat dog should match “raining cats and
dogs”• Okay if stemmed form is not meaningful
U d ’t th t ti i d
25
– User doesn’t see or care that computing in query and computed in text were both stemmed to comput
IR 2010
Stemming strategies (1)• Algorithmic
– Affix removale o a• Usually only remove suffixes since prefixes often change
meaning of the word (e.g. do, undo)– Successor variety
• Finds boundaries between morphemes (word segments)• Morphemes are the smallest linguistic units with semantic
meaning (e.g. unreadable un-read-able)• Successor variety of a string is the number of different
characters that follow it in words in some body of text
26
characters that follow it in words in some body of text.– Text: read, readable, reading, reads, red, rope, ripe– R 3, RE 2, REA 1, READ 3, READA 1, READAB 1,
READABL 1, READABLE 1• Look for peak and plateau to break words
IR 2010
14
Stemming strategies (2)
• Algorithmic, cont.N grams– N-grams
• An N-gram is a sub-sequence of length n from a sequence • Uses character sequences within words• Clusters words based on number of shared N-grams• Language-independent
• Dictionary
27
– Table lookup• Mixed
– Use algorithm with table lookup for exceptions
IR 2010
Stemming
Porter Stemmer Outputa aabase abasabate abatabated abatabatement abatabbess abbess
28
abbey abbeiabide abidabides abidabjectly abjectli
IR 2010
15
Stemming
• Effect on retrieval is controversialS l i it i ll b t• Some claim it improves recall but may degrade precision
• Probably depends on many factors:– Language– Stemming method
29
– Document collection– Queries– Nature of user task
IR 2010
Stemming summaryAdvantagesSearcher doesn’t have to
ti i t th ’ t
DisadvantagesMay decrease precision
anticipate author’s exact usagetsunami can match:•What happens to a tsunami as it approaches land?•Tsunamis have been historically referred to as tidal
•Might want to match specific word•Might be okay for query tired to match other forms of to tire, but not automobile tireStemming can change meaning•bushing – an insulating liner in an
30
historically referred to as tidal waves because ...
May increase recallResults in more compact indexing vocabulary
g gopening through which conductors pass
•bush - A low shrub with many branches
Stemming algorithms imperfect –occasionally lead to odd resultsIR 2010
16
Lemmatization
Lemma = linguistic “base word”made makesaw see
Stemming is a syntactic processLemmatization is a semantic process
Req ires a look p table in generalRequires a look-up table in general
IR 2010 31
Token Normalization
Equivalence class of tokens term{Co-worker, co-worker, Coworker, coworker} coworker
How to implement?Implicit: Use rules, such as hyphen removal
– apply for indexing and querying– apply for indexing and querying– only index target term
IR 2010 32
17
Token Normalization 2
Explicit: index tokens separatelycoworker, co-worker each has a list
• Apply during indexingcoworker is indexed under coworker and co-worker
• Apply during queryApply during querycoworker (coworker OR co-worker)allows suppression: “coworker”
IR 2010 33
Stopwords
• Very common words are often thought not to be very good discriminators between documentsvery good discriminators between documents– Common words often convey little of what a
document is about– Articles, conjunctions, prepositions– Does the or of tell you what a document is about?
• Eliminating common words (stopwords) reduces
34
Eliminating common words (stopwords) reduces the size of the index
IR 2010
18
Stopwords
• Typically implemented as a set of words f filt i b th d t t t d ifor filtering both document text and queries
• Size of list may vary, e.g.– MEDLARS stoplist only 7 words
• and, an, by, from, of, the, with– Other published lists have 250, 471 words
35
Other published lists have 250, 471 words• May want a collection-specific stopword
list
IR 2010
Example stopword list applied to a documenta, about, an, and, are, as, by, for, from, had, have, he, his, him, in, into, of, on, or, that, the, this, to, was, with, were
Here were the servants of your adversary,And yours, close fighting ere I did approach:I drew to part them: in the instant cameThe fiery Tybalt, with his sword prepared,Which, as he breathed defiance to my ears,He swung about his head and cut the winds
36
He swung about his head and cut the winds,Who nothing hurt withal hiss'd him in scorn:While we were interchanging thrusts and blows,Came more and more and fought on part and part,Till the prince came, who parted either part.
IR 2010
19
Stopwords
• Common words may be important– Phrases
• Query for “to be or not to be”– Special meaning in context
• Vitamin A– Abbreviations and acronyms
37
• OR as abbreviation (ORegon) or acronym (Operating Room)
IR 2010
Phrases containing common words
• Index all words• Ignore stopwords in indexing but account for• Ignore stopwords in indexing, but account for
them in calculating postion– use position information for “significant” words “the cat in the hat”: cat at p, hat at p+3
– Accept risk of false matches, or– Filter matching docs by examining full text
38IR 2010
20
39IR 2010
to and be notto and be noteliminated in ad search
to and be doappear to be eliminated in
b h
40
web search
IR 2010
21
41IR 2010
Index term selection
A difficult balance– Conflating tokens risks bringing in irrelevant
results: co-op coop
– If you don’t, could miss relevant documents: co-worker vs. coworker
Might be possible to adjust the balance with
42
g p jweightsco-op to co-op is a “stronger” match than co-op to coop
IR 2010