16
The Usage of Various Lexical Resources and Tools to Improve the Performance of Web Search Engines Cvetana Krstev, Ranka Stanković , Duško Vitas, Ivan Obradović Human Language Technology Group, University of Belgrade, Serbia

Cvetana Krstev, Ranka Stanković , Duško Vitas, Ivan Obradović

  • Upload
    haile

  • View
    58

  • Download
    0

Embed Size (px)

DESCRIPTION

The Usage of Various Lexical Resources and Tools to Improve the Performance of Web Search Engines. Cvetana Krstev, Ranka Stanković , Duško Vitas, Ivan Obradović Human Language Technology Group, University of Belgrade, Serbia. Contents. Typical problems when retrieving - PowerPoint PPT Presentation

Citation preview

Page 1: Cvetana Krstev,  Ranka Stanković , Duško Vitas, Ivan Obradović

The Usage of Various Lexical Resources and Tools to Improve the Performance

of Web Search Engines

Cvetana Krstev, Ranka Stanković, Duško Vitas, Ivan Obradović Human Language Technology Group, University of Belgrade, Serbia

Page 2: Cvetana Krstev,  Ranka Stanković , Duško Vitas, Ivan Obradović

HLT Group, University of Belgrade2

Contents

Results and evaluation

Technical implementation

The system options

The lexical resources used

Typical problems when retrieving documents using a web search engine

Page 3: Cvetana Krstev,  Ranka Stanković , Duško Vitas, Ivan Obradović

HLT Group, University of Belgrade3

Typical problems when retrieving documents using a web search engine

Highly inflective language

Donošenjem odluke… Odluka o priređivanju igara… Ministarstvo donosi odluku…

By making a decision… A decision to organize gamesThe Ministry shall make a decision…________________________

Sastojci za 10 porcija: 3 glavice crnog luka, 1 šoljica ulja, 1/2 čaša belog vina, 1 čaša soka od paradajza

(The ingredients for 10 portions: 3 onions, 1 cup of oil, ½ glass of white wine, 1 glass of tomato juice.)

Typical problems

Lexical realization of a concept synonyms: beli luk ‘garlic’ → češnjakhyponyms: muzički instrument ‘musical instrument’ → klavir ‘piano’, gitara ‘guitar’ etc.

derivations: Beograd → Beograđanin, Beograđanka, etcand other relations

Bilingual search in order to find documents on the chosen subject in two languages, e.g. English and Serbian.

Page 4: Cvetana Krstev,  Ranka Stanković , Duško Vitas, Ivan Obradović

HLT Group, University of Belgrade4

The lexical resources used

Inflectional finite state transducers

(FST)

Morphological dictionaries

Prolex

WordNets

WS4QE

Work Station for Query Expansion

Serbian morphological dictionary is in LADL format:

117,000 lemmas with 1,400,000 different lexical words

FST for inflection of both simple and compound words developed for the Unitex system http://www-igm.univ-mlv.fr/~unitex

Serbian WN conceived within the Balkanet project with 14.593 synsets and Princeton

WN are used for query expansion with related words & for bilingual searches

Prolex: multilingual database of proper names organized around a

conceptual proper name that represents the same concept in

different languageshttp://www.cnrtl.fr/lexiques/prolex/

Page 5: Cvetana Krstev,  Ranka Stanković , Duško Vitas, Ivan Obradović

HLT Group, University of Belgrade5

The lexical resources used

For query beli luk two FSTs for components and one for the compound are used producing only 12 instead of 216 possible combinations:

beli luk AND belim lukom AND beli lukovi AND belih lukova AND belima lukovima AND belim lukovima AND bele lukove AND bela luka AND beloga luka AND belog luka AND belome luku AND belom luku

thus preventing false retrievals such as:•...posmatrano sa dna vidika, izgleda kao da iz širokih lukova belog mosta teče i razliva se ne samo zelena Drina… •Thus, from a bottom view, it appears that not only green Drina flows and spills over under the wide arcs of the white bridge…

Page 6: Cvetana Krstev,  Ranka Stanković , Duško Vitas, Ivan Obradović

HLT Group, University of Belgrade6

The system options

Improved queryImproved queryInflexion of free phrases

Addition of related words

Inclusion of inflectional

forms

Alternate alphabet

usage

•štrajk ‘strike’ → штрајк

•štrajk ‘strike’ → štrajk, štrajka, štrajkovi etc.• štrajk ‘strike’ → obustava rada ‘work stoppage’ • solarni sistem ‘solar system’ → Merkur,

Venera, Zemlja, Mars• Engleska ‘England’ → Englez ‘Englishman’,

Engleskinja, ‘English woman’ + with Albion• inflection of free phrases by

predicting their syntactic structure

Page 7: Cvetana Krstev,  Ranka Stanković , Duško Vitas, Ivan Obradović

HLT Group, University of Belgrade7

Rule based procedure for inflection

Procedure for automatic inflection of compounds and phrases based on a set of rules

Rule design strategy - result of expert knowledge on morphology and the analysis of existing manually created compound dictionaries

Experiments with various rule strategies possible – the final strategy is result of several iterations

The rule based strategy presently consists of 53 rules with total of 1014 rule subtypes (rule parts)

EXAMPLE OF RULE NUMBER 43, CLASS NC_N6X

Class Gramm.

condition Frequ ency

Additional conditions

_:fs1q__ 3 _:ms1q__ 2

_:ms1v__ 2 _:ns1q__ 1 _:fs1v__ 0

NC_ N6X

_:ns1v__ 0

(The first component is a noun ) AND ((The second, the third and fourth component are in genitive) OR (The second word is a preposition and the third word agrees with it))

Page 8: Cvetana Krstev,  Ranka Stanković , Duško Vitas, Ivan Obradović

HLT Group, University of Belgrade8

Rule based procedure for inflection<Rule ID="43" CFLX="NC_N6X" Status="true">

<RuleType ID="1"><WordRT ID="1" POS="N" Flex="true" /><WordRT ID="2" POS="*" Flex="false" Condition="GramCats,2"/><WordRT ID="3" POS="*" Flex="false" Condition="GramCats,2"/><WordRT ID="4" POS="*" Flex="false" Condition="GramCats,2"/>

</RuleType><RuleType ID="2">

<WordRT ID="1" POS="N" Flex="true" /><WordRT ID="2" POS="PREP" Flex="false" /><WordRT ID="3" POS="*" Flex="false" Condition="PrepAgr,2" /><WordRT ID="4" POS="*" Flex="false" />

</RuleType><RulePart ID="1" Frequency="3" Example="princ na belom konju">

<WordRP ID="1" GramCats="ms1v" /></RulePart><RulePart ID="2" Frequency="2"

<WordRP ID="1" GramCats="ms1q" /></RulePart><RulePart ID="3" Frequency="2" >

<WordRP ID="1" GramCats="ns1q" /></RulePart><RulePart ID="4" Frequency="1" >

<WordRP ID="1" GramCats="fs1q" /></RulePart><RulePart ID="5" Frequency="0">

<WordRP ID="1" GramCats="ns1v" /></RulePart><RulePart ID="6" Frequency="0">

<WordRP ID="1" GramCats="fs1v" /></RulePart>

</Rule>

Page 9: Cvetana Krstev,  Ranka Stanković , Duško Vitas, Ivan Obradović

HLT Group, University of Belgrade9

Rule based procedure for inflection

System evaluation on three separate sets of data that differ both in content and in structure: compound toponyms (238) formal names of professions (356) search engine queries (728)

(log file of one of Serbian professional journals)

Evaluation indicated that: the strategy can be integrated in morphological query

expansion mechanism for compounds and phrases which do not exist in the compounds dictionary

Page 10: Cvetana Krstev,  Ranka Stanković , Duško Vitas, Ivan Obradović

HLT Group, University of Belgrade10

Tehnical implementation

The Process The developed web application receives the user query and subsequently uses the local web service WS4QE to expand the

query and forwards it to the Google search engine using the Google AJAX

Search API (enables the embedding of Google searches into personal web pages or web applications)

Interface Query expansion is implemented with different possibilities and

levels of detail, so the web user can choose from several options From simple query expansion to complex wordnet advanced

search Search results are displayed within our own web pages for

different types of query expansions, depending on the resources and type of expansion

Page 11: Cvetana Krstev,  Ranka Stanković , Duško Vitas, Ivan Obradović

HLT Group, University of Belgrade11

Tehnical implementation

Web service WS4QE uses classes from .NET dll components developed within WS4LR (WorkStation for Lexical Resources)

WS4LR enables the usage of lexical resources for query expansion

The components that make up the WS4LR system and their inter-relationships

Page 12: Cvetana Krstev,  Ranka Stanković , Duško Vitas, Ivan Obradović

HLT Group, University of Belgrade12

Tehnical implementationWS4QE home page

• Wordnet advanced search Compare

•Query submitted directly to Google with only the initial string ‘beli luk’ returned a total of 54,900

• Expanded with ‘бели лук’,’češnjak’,’чешњак’ then submitted by WS4QE to Google, as a result, total of 92,700 documents were obtained.

Page 13: Cvetana Krstev,  Ranka Stanković , Duško Vitas, Ivan Obradović

HLT Group, University of Belgrade13

Results for expanded query

Compare

•Query submitted directly to Google obtained 66,600 documents•Expanded query with hypernym, in both alphabets obtained 160,000 documents• Morphological expansion in two alphabets (without semantic expansion) obtained 285,000 documents

Page 14: Cvetana Krstev,  Ranka Stanković , Duško Vitas, Ivan Obradović

HLT Group, University of Belgrade14

Results for expanded query

Page 15: Cvetana Krstev,  Ranka Stanković , Duško Vitas, Ivan Obradović

HLT Group, University of Belgrade15

Conclusion

approach

further endeavors

formulation of queries

Queries often need to be ‘fine tuned’ in order to obtain an optimal balance between recall and precision

Lexical resources can be put to the aid of the user by offering him/her various possibilities of query expansion

1. We shall continue do develop our lexical resources2. We will strive to broaden the scope of tasks that can be solved with our tools