29
1 Hercules Dalianis 1 Hercules Dalianis DSV-SU-KTH email:[email protected] 070-568 13 59 / 08-674 75 47 Hercules Dalianis 2 Stemming, truncation, query expansion Multi word queries and phrase search Spell checking Synonym search • KWIC- Key word in context Extract of relevant context around search words Index size

Hercules Dalianis DSV-SU-KTHpeople.dsv.su.se/~hercules/kurser/isbi/lectures/F4/ISBI-HD...Hercules Dalianis DSV-SU-KTH ... (2001) (Hummingbird Fulcrum) increased precision in search

Embed Size (px)

Citation preview

1

Hercules Dalianis 1

Hercules DalianisDSV-SU-KTH

email:[email protected] 070-568 13 59 / 08-674 75 47

Hercules Dalianis 2

•  Stemming, truncation, query expansion •  Multi word queries and phrase search •  Spell checking •  Synonym search •  KWIC- Key word in context �

Extract of relevant context around search words •  Index size

2

Hercules Dalianis 3

•  Truncation –  förskol* => förskola, förskolor, förskolelärare,

(kindergarten, kindergartens, kindergarten teacher)

•  Stemming + some rules –  förskola => förskol (kindergarten) –  förskolelärare => förskolelärar (kindergarten

teach) •  Query expansion - generate all inflections

–  förskola => förskola, förskolan, förskolans, förskolor, förskolorna, förskolornas,etc.

Hercules Dalianis 4

•  Truncation –  förskol* => förskola, förskolan, förskolans,

förskolor, förskolorna, förskolornas, förskolelärare, förskoleläraren, förskolelärarens, förskolelärarna… (kindergarten AND kindergarten teacher)

3

Hercules Dalianis 5

•  Stemming + some rules – förskola => förskol (kindergarten stem) hits

on förskola, förskolan, förskolans, förskolor, förskolorna, förskolornas but not on förskolelärare (kindergarten teacher)

Hercules Dalianis 6

• Query expansion - generate all correct inflections – förskola => förskola, förskolan, förskolans,

förskolor, förskolorna, förskolornas,etc.

•  Query expansion has same effect as stemming but computationally more costly

4

• Truncation Ski* hotel* Åre => hits on => ski hotel Åre,

skiing hotel Åre, Ski hotels Åre, Skiing hotels Åre, skis Hotel Åre

Hercules Dalianis 7

Ski hotel Åre => Ski hote Åre (ski hote are stems) hits on => ski hotel Åre, skiing hotel Åre, Ski

hotels Åre, Skiing hotels Åre, skis Hotel Åre

Hercules Dalianis 8

5

Ski hotel Åre => ski hotel Åre, skiing hotel Åre, Ski hotels Åre, Skiing hotels Åre, skis Hotel Åre

Query expansion has same effect as stemming but computationally more costly

Hercules Dalianis 9

Hercules Dalianis 10

•  Stemming => – Stem both indexed file AND query – Obtain efficient match

• Query expansion –  no stemming of index file

6

Hercules Dalianis 11

•  Stop words are common, non significant words

•  English stopwords are e.g.� and, or, with, on, but, more, a, the….

•  Swedish �och, eller, på, men, under, en, bara..

Hercules Dalianis 12

•  Remove stop words – Half of the words in the document collection

will disappear •  Stemming

– 1/3 of the words are stemmable – 2/3 of the words will be collapsed

⇒ 20 percent of the words will disappear

⇒  30 percent left in the index?

7

Hercules Dalianis 13

•  Phrase search implies that one need to keep the stop words.

•  Larger index - at least as large as the document collection

•  Stemming will only remove 20 percent of the words

Hercules Dalianis 14

Recall = Number of found relevant documents / Total number of relevant documents

Precision = Found relevant documents/ Total number of found documents�

8

Hercules Dalianis 15

•  Stemming (Carlberger et al 2001) – bilverkstaden, bilverkstäder, mm =>

bilverkstad (car shop / garage) – Bok => böcker, book => books – 15% better precision and 18% better recall

when searching in Swedish – Other languages up to 30-50 % better hits

(except English)

Hercules Dalianis 16

webmaster, webbmaster, webbansvarig => webbmaster

Mord, mordet, mördare => mörd (Murder, the murder, murderer)

9

Hercules Dalianis 17

Tomlinson (2001) (Hummingbird Fulcrum) increased precision in search by using stemming • German 43% • Dutch 30% • French 18% • Italian 16% • Spanish 12%, • English 12%

Hercules Dalianis 18

Tomlinson (2002) (Hummingbird Fulcrum) increased precision in search by using word splitting Mobiltelefonbatteri => mobil telefon batteri •  Finnish 69%, (word splitting) •  German 27% (word splitting) •  Spanish 8% •  Dutch 8%, •  French 6%, •  Italian 4%, ? •  Swedish 4% •  English 2% •  using Inxight LinguistX tool (Xerox)

?

10

Hercules Dalianis 19

• Average two word queries 1.8 eller 2.3 words per query

•  Longer queries give better answers •  Larger input field

Hercules Dalianis 20

• Many misspelled queries in search engines- At least 10 percent

•  Spell checker => fuzzy matching

11

Hercules Dalianis 21

•  10 percent of all search queries were misspelled at RSV web site of all search queries (1 million search queries at Skatteverket, RSVs web site, (Dalianis 2002))

•  Google pressrelease (2002) says same thing •  10 percent of all search queries are misspelled at

SUNET web catalogue (Stolpe 2002) •  Euroling-SiteSeeker logs says also 10-12.5 percent

miss spelled search queries of 12 million search queries totally

Hercules Dalianis 22

•  Stava - spell checker is used in Lexin Skolverket a web based dictionary, e.g Swedish English dictionary, but also large immigrant languages as Finnish, Spanish, Greek, Turkish, Russian, Kroatian,Albanian.

•  7 million lookups per month, among them are 33 percent misspellings.

•  3 lookups per second where one is misspelled •  (Can the many misspellings be due to small

dictionaries? Not enough words to propose?)

12

Hercules Dalianis 23

•  Person that cannot spell correctly-dyslectics. •  Slipping errors - slips on the keyword •  Unsure about spelling, second language users •  Compound splitting or erroneous compounding

– Missbruksvård / missbruk vård •  Alternative spellings of words in the index

– Names can be spelled in different ways , Eriksson, Erikson, Ericsson, Ericson, Erickson, Erixon, Eiriksson,)

•  Misspellings in the index

Hercules Dalianis 24

•  The index in the dictionary •  All words in the index are correct even misspelled

words. •  If a search word is not present in the index then the

spelling correction algorithm will try to find the closest editing distance of the search words to a word in the index

•  Key board distance as well

13

Hercules Dalianis 25

•  insertion •  deletion •  substitution •  transposition Covers 80 percent of all spelling

errors

Hercules Dalianis 26

• At RSV’s search engine with built-in spelling correction 90 percent of the spelling errors were corrected.

•  40 percent of the suggestions were compound splitting

•  30 percent are alternative spelling •  22 percent were misspellings •  The document collection contained around �

5 000 documents.

14

Hercules Dalianis 27

•  Compound splitting 40 percent utrikestraktamente => traktamente utrikes bilavgifter => avgifter bilar expertskatt => expert skatt skattejämkningsblankett => jämkningsblankett skattejämkning

Hercules Dalianis 28

• Alternative spellings 30 percent kyrkskatt => kyrkoskatt hempc => hem-pc rotavdrag => rot-avdrag arvsskifte => arvsskiftet pharmasia => pharmacia skattåterbäring => skatteåterbäring

15

Hercules Dalianis 29

•  Spelling errors 22 percent engångskatt => engångsskatt giftemål => giftermål jämnkning => jämkning skillsmässa => skilsmässa skiljsmässa => skilsmässa skattejämnkning => skattejämkning

Hercules Dalianis 30

•  To perform automatically compound splitting is computationally hard. Better to use linguistic methods

•  Compound joining is easy to make automatically –  rätt stavning => rättstavning (spelling correction, correct spelling) �

text sammanfattning => textsammanfattning

•  Compound splitting is more difficult –  rättstavning => rätt stavning (spelling correction, correct spelling)

•  Google, SiteSeeker

16

Hercules Dalianis 31

•  Search with Google – Utrikestraktamente – Businessintelligence

•  Compound splitting!

Hercules Dalianis 32

• Word splitter applied as post processor to queries with no answer.

•  127 different compounds were split • Word splitting gave 64 percent better hits • Missbruksvård no hit but vård (av)

missbrukare gave hits (treatment of drug addicts)

17

Hercules Dalianis 33

• Nine Swedish public web sites •  1.6 million queries •  9.3 percent spelling errors •  6 000 compounds with no answers •  Compound splitter (Sjöbergh & Kann 2004)

improved results with 64 percent (Dalianis 2005)

Hercules Dalianis 34

Compound joining 1 per thousand of all searches •  växthus effekten => växthuseffekten •  vinter däck => vinterdäck •  lämplighets intyg => lämplighetsintyg •  telefon nummer => telefonnummer

18

Hercules Dalianis 35

Compound splitting 4 per thousand of all searches •  befolkningsutveckling => befolknings utveckling •  lotteritillstånd => lotteri tillstånd •  studentförsäkring => student försäkring •  skadeanmälningsblankett •  => skade anmälnings blankett missbruksvård => missbruks vård •  2 percent on some web sites

Hercules Dalianis 36

•  Split compound in two parts •  The compound split retrieval should be within one

sentence, e.g. 29 words window. •  NEAR and word position •  Use genitive “s” as compound split marker and

remove the “s” •  The rightmost should be the longest •  Treat Proper nouns specially.

19

Hercules Dalianis 37

• NEAR pseudo Boolean operator • NEAR higher ranking for query terms that

are close to each other •  Petra Hansson better than Petra….

……….Svensson and Åke…….Hansson

Hercules Dalianis 38

•  Search with spelling correction among 79 000 Swedish news texts increased precision and recall with 4 and 11,5 percent respectively (Sarr 2003).

•  Search with Swedish stemming on 54 000 news texts increased increased precision and recall with 15 and 18 percent respectively (Carlberger et al 2001).

20

Hercules Dalianis 39

•  KWIC- Key word in context �Extract of relevant context around search terms

•  The first search engines only presented the link adress and maybe the first words of the indexed text.

•  One needed to click on each document and investigate them!

•  Text summarizer SweSum connected to Altavista in 1999 !!

Hercules Dalianis 40

• KWIC gives you overview of what is in the documents by looking in the KWIC key list with all hits key words marked up.

21

Hercules Dalianis 41

Hercules Dalianis 42

– Google – Altavista-Yahoo-Inktomi – Alltheweb-Yahoo-Inktomi – SiteSeeker – And several others

22

Hercules Dalianis 43

Hercules Dalianis 44

23

Hercules Dalianis 45

Hercules Dalianis 46

•  KWIC implies restoration of stemmed words –  Index increase

•  NEAR operator implies position search. –  Index increase

•  Phrase search –  Implies stop word restoration

•  Index grows bigger than dokument collection!

24

Hercules Dalianis 47

• Term expansion is neat – Bilverkstad => bilverkstad,

bilreparation, garage, verkstad, bilar • One does not really want do to this

manually!

Hercules Dalianis 48

•  Retriever - http://www.retriever-info.com/=> just click on login

•  Free of use for schools and universities •  Previously did this work, does it work so

today? •  Search on ”fordon” (vehicle) and get hits on

bil (cars) •  Synonyms and stemming!

25

Hercules Dalianis 49

Hercules Dalianis 50

• Approximation of Latent Semantic Indexing •  Faster and more efficient

26

Hercules Dalianis 51

•  Parallell texts are identical texts but in different languages

•  Parallell corpora are lots of texts ~ 1 000 • Allignment methods proposes translation

candidates between languages The lecture Hercules gave was difficult =

Föreläsningen Hercules gav var svår.

Hercules Dalianis 52

•  The lectures Hercules gave were difficult => Föreläsningarna Hercules gav var svåra.

•  The lectures of Hercules are always difficult => Hercules lektioner är alltid svåra

Synonyms: •  Lecture(s) <=> föreläsning(arna), lektion(er)

27

Hercules Dalianis 53

Hercules Dalianis 54

•  http://translate.google.com/translate_s?hl

28

Hercules Dalianis 55

•  ~volvo => volvo car, cars •  ~volvo -volvo ~car -car => 240, vehicle,

motor, racing, automotive, auto •  ~car => BMV, auto, motor, car •  ~car -car => automotive, motor vehicle,

racing

Hercules Dalianis 56

29

Hercules Dalianis 57

•  Truncation •  Stemming • Query expansion •  Synonyms •  Spell checking • NEAR • KWIC •  Cross language retrieval