Upload
phamhanh
View
224
Download
0
Embed Size (px)
Citation preview
1
Hercules Dalianis 1
Hercules DalianisDSV-SU-KTH
email:[email protected] 070-568 13 59 / 08-674 75 47
Hercules Dalianis 2
• Stemming, truncation, query expansion • Multi word queries and phrase search • Spell checking • Synonym search • KWIC- Key word in context �
Extract of relevant context around search words • Index size
2
Hercules Dalianis 3
• Truncation – förskol* => förskola, förskolor, förskolelärare,
(kindergarten, kindergartens, kindergarten teacher)
• Stemming + some rules – förskola => förskol (kindergarten) – förskolelärare => förskolelärar (kindergarten
teach) • Query expansion - generate all inflections
– förskola => förskola, förskolan, förskolans, förskolor, förskolorna, förskolornas,etc.
Hercules Dalianis 4
• Truncation – förskol* => förskola, förskolan, förskolans,
förskolor, förskolorna, förskolornas, förskolelärare, förskoleläraren, förskolelärarens, förskolelärarna… (kindergarten AND kindergarten teacher)
3
Hercules Dalianis 5
• Stemming + some rules – förskola => förskol (kindergarten stem) hits
on förskola, förskolan, förskolans, förskolor, förskolorna, förskolornas but not on förskolelärare (kindergarten teacher)
Hercules Dalianis 6
• Query expansion - generate all correct inflections – förskola => förskola, förskolan, förskolans,
förskolor, förskolorna, förskolornas,etc.
• Query expansion has same effect as stemming but computationally more costly
4
• Truncation Ski* hotel* Åre => hits on => ski hotel Åre,
skiing hotel Åre, Ski hotels Åre, Skiing hotels Åre, skis Hotel Åre
Hercules Dalianis 7
Ski hotel Åre => Ski hote Åre (ski hote are stems) hits on => ski hotel Åre, skiing hotel Åre, Ski
hotels Åre, Skiing hotels Åre, skis Hotel Åre
Hercules Dalianis 8
5
Ski hotel Åre => ski hotel Åre, skiing hotel Åre, Ski hotels Åre, Skiing hotels Åre, skis Hotel Åre
Query expansion has same effect as stemming but computationally more costly
Hercules Dalianis 9
Hercules Dalianis 10
• Stemming => – Stem both indexed file AND query – Obtain efficient match
• Query expansion – no stemming of index file
6
Hercules Dalianis 11
• Stop words are common, non significant words
• English stopwords are e.g.� and, or, with, on, but, more, a, the….
• Swedish �och, eller, på, men, under, en, bara..
Hercules Dalianis 12
• Remove stop words – Half of the words in the document collection
will disappear • Stemming
– 1/3 of the words are stemmable – 2/3 of the words will be collapsed
⇒ 20 percent of the words will disappear
⇒ 30 percent left in the index?
7
Hercules Dalianis 13
• Phrase search implies that one need to keep the stop words.
• Larger index - at least as large as the document collection
• Stemming will only remove 20 percent of the words
Hercules Dalianis 14
Recall = Number of found relevant documents / Total number of relevant documents
Precision = Found relevant documents/ Total number of found documents�
8
Hercules Dalianis 15
• Stemming (Carlberger et al 2001) – bilverkstaden, bilverkstäder, mm =>
bilverkstad (car shop / garage) – Bok => böcker, book => books – 15% better precision and 18% better recall
when searching in Swedish – Other languages up to 30-50 % better hits
(except English)
Hercules Dalianis 16
webmaster, webbmaster, webbansvarig => webbmaster
Mord, mordet, mördare => mörd (Murder, the murder, murderer)
9
Hercules Dalianis 17
Tomlinson (2001) (Hummingbird Fulcrum) increased precision in search by using stemming • German 43% • Dutch 30% • French 18% • Italian 16% • Spanish 12%, • English 12%
Hercules Dalianis 18
Tomlinson (2002) (Hummingbird Fulcrum) increased precision in search by using word splitting Mobiltelefonbatteri => mobil telefon batteri • Finnish 69%, (word splitting) • German 27% (word splitting) • Spanish 8% • Dutch 8%, • French 6%, • Italian 4%, ? • Swedish 4% • English 2% • using Inxight LinguistX tool (Xerox)
?
10
Hercules Dalianis 19
• Average two word queries 1.8 eller 2.3 words per query
• Longer queries give better answers • Larger input field
Hercules Dalianis 20
• Many misspelled queries in search engines- At least 10 percent
• Spell checker => fuzzy matching
11
Hercules Dalianis 21
• 10 percent of all search queries were misspelled at RSV web site of all search queries (1 million search queries at Skatteverket, RSVs web site, (Dalianis 2002))
• Google pressrelease (2002) says same thing • 10 percent of all search queries are misspelled at
SUNET web catalogue (Stolpe 2002) • Euroling-SiteSeeker logs says also 10-12.5 percent
miss spelled search queries of 12 million search queries totally
Hercules Dalianis 22
• Stava - spell checker is used in Lexin Skolverket a web based dictionary, e.g Swedish English dictionary, but also large immigrant languages as Finnish, Spanish, Greek, Turkish, Russian, Kroatian,Albanian.
• 7 million lookups per month, among them are 33 percent misspellings.
• 3 lookups per second where one is misspelled • (Can the many misspellings be due to small
dictionaries? Not enough words to propose?)
12
Hercules Dalianis 23
• Person that cannot spell correctly-dyslectics. • Slipping errors - slips on the keyword • Unsure about spelling, second language users • Compound splitting or erroneous compounding
– Missbruksvård / missbruk vård • Alternative spellings of words in the index
– Names can be spelled in different ways , Eriksson, Erikson, Ericsson, Ericson, Erickson, Erixon, Eiriksson,)
• Misspellings in the index
Hercules Dalianis 24
• The index in the dictionary • All words in the index are correct even misspelled
words. • If a search word is not present in the index then the
spelling correction algorithm will try to find the closest editing distance of the search words to a word in the index
• Key board distance as well
13
Hercules Dalianis 25
• insertion • deletion • substitution • transposition Covers 80 percent of all spelling
errors
Hercules Dalianis 26
• At RSV’s search engine with built-in spelling correction 90 percent of the spelling errors were corrected.
• 40 percent of the suggestions were compound splitting
• 30 percent are alternative spelling • 22 percent were misspellings • The document collection contained around �
5 000 documents.
14
Hercules Dalianis 27
• Compound splitting 40 percent utrikestraktamente => traktamente utrikes bilavgifter => avgifter bilar expertskatt => expert skatt skattejämkningsblankett => jämkningsblankett skattejämkning
Hercules Dalianis 28
• Alternative spellings 30 percent kyrkskatt => kyrkoskatt hempc => hem-pc rotavdrag => rot-avdrag arvsskifte => arvsskiftet pharmasia => pharmacia skattåterbäring => skatteåterbäring
15
Hercules Dalianis 29
• Spelling errors 22 percent engångskatt => engångsskatt giftemål => giftermål jämnkning => jämkning skillsmässa => skilsmässa skiljsmässa => skilsmässa skattejämnkning => skattejämkning
Hercules Dalianis 30
• To perform automatically compound splitting is computationally hard. Better to use linguistic methods
• Compound joining is easy to make automatically – rätt stavning => rättstavning (spelling correction, correct spelling) �
text sammanfattning => textsammanfattning
• Compound splitting is more difficult – rättstavning => rätt stavning (spelling correction, correct spelling)
• Google, SiteSeeker
16
Hercules Dalianis 31
• Search with Google – Utrikestraktamente – Businessintelligence
• Compound splitting!
Hercules Dalianis 32
• Word splitter applied as post processor to queries with no answer.
• 127 different compounds were split • Word splitting gave 64 percent better hits • Missbruksvård no hit but vård (av)
missbrukare gave hits (treatment of drug addicts)
17
Hercules Dalianis 33
• Nine Swedish public web sites • 1.6 million queries • 9.3 percent spelling errors • 6 000 compounds with no answers • Compound splitter (Sjöbergh & Kann 2004)
improved results with 64 percent (Dalianis 2005)
Hercules Dalianis 34
Compound joining 1 per thousand of all searches • växthus effekten => växthuseffekten • vinter däck => vinterdäck • lämplighets intyg => lämplighetsintyg • telefon nummer => telefonnummer
18
Hercules Dalianis 35
Compound splitting 4 per thousand of all searches • befolkningsutveckling => befolknings utveckling • lotteritillstånd => lotteri tillstånd • studentförsäkring => student försäkring • skadeanmälningsblankett • => skade anmälnings blankett missbruksvård => missbruks vård • 2 percent on some web sites
Hercules Dalianis 36
• Split compound in two parts • The compound split retrieval should be within one
sentence, e.g. 29 words window. • NEAR and word position • Use genitive “s” as compound split marker and
remove the “s” • The rightmost should be the longest • Treat Proper nouns specially.
19
Hercules Dalianis 37
• NEAR pseudo Boolean operator • NEAR higher ranking for query terms that
are close to each other • Petra Hansson better than Petra….
……….Svensson and Åke…….Hansson
Hercules Dalianis 38
• Search with spelling correction among 79 000 Swedish news texts increased precision and recall with 4 and 11,5 percent respectively (Sarr 2003).
• Search with Swedish stemming on 54 000 news texts increased increased precision and recall with 15 and 18 percent respectively (Carlberger et al 2001).
20
Hercules Dalianis 39
• KWIC- Key word in context �Extract of relevant context around search terms
• The first search engines only presented the link adress and maybe the first words of the indexed text.
• One needed to click on each document and investigate them!
• Text summarizer SweSum connected to Altavista in 1999 !!
Hercules Dalianis 40
• KWIC gives you overview of what is in the documents by looking in the KWIC key list with all hits key words marked up.
21
Hercules Dalianis 41
Hercules Dalianis 42
– Google – Altavista-Yahoo-Inktomi – Alltheweb-Yahoo-Inktomi – SiteSeeker – And several others
23
Hercules Dalianis 45
Hercules Dalianis 46
• KWIC implies restoration of stemmed words – Index increase
• NEAR operator implies position search. – Index increase
• Phrase search – Implies stop word restoration
• Index grows bigger than dokument collection!
24
Hercules Dalianis 47
• Term expansion is neat – Bilverkstad => bilverkstad,
bilreparation, garage, verkstad, bilar • One does not really want do to this
manually!
Hercules Dalianis 48
• Retriever - http://www.retriever-info.com/=> just click on login
• Free of use for schools and universities • Previously did this work, does it work so
today? • Search on ”fordon” (vehicle) and get hits on
bil (cars) • Synonyms and stemming!
25
Hercules Dalianis 49
Hercules Dalianis 50
• Approximation of Latent Semantic Indexing • Faster and more efficient
26
Hercules Dalianis 51
• Parallell texts are identical texts but in different languages
• Parallell corpora are lots of texts ~ 1 000 • Allignment methods proposes translation
candidates between languages The lecture Hercules gave was difficult =
Föreläsningen Hercules gav var svår.
Hercules Dalianis 52
• The lectures Hercules gave were difficult => Föreläsningarna Hercules gav var svåra.
• The lectures of Hercules are always difficult => Hercules lektioner är alltid svåra
Synonyms: • Lecture(s) <=> föreläsning(arna), lektion(er)
28
Hercules Dalianis 55
• ~volvo => volvo car, cars • ~volvo -volvo ~car -car => 240, vehicle,
motor, racing, automotive, auto • ~car => BMV, auto, motor, car • ~car -car => automotive, motor vehicle,
racing
Hercules Dalianis 56