Upload
haile
View
58
Download
0
Tags:
Embed Size (px)
DESCRIPTION
The Usage of Various Lexical Resources and Tools to Improve the Performance of Web Search Engines. Cvetana Krstev, Ranka Stanković , Duško Vitas, Ivan Obradović Human Language Technology Group, University of Belgrade, Serbia. Contents. Typical problems when retrieving - PowerPoint PPT Presentation
Citation preview
The Usage of Various Lexical Resources and Tools to Improve the Performance
of Web Search Engines
Cvetana Krstev, Ranka Stanković, Duško Vitas, Ivan Obradović Human Language Technology Group, University of Belgrade, Serbia
HLT Group, University of Belgrade2
Contents
Results and evaluation
Technical implementation
The system options
The lexical resources used
Typical problems when retrieving documents using a web search engine
HLT Group, University of Belgrade3
Typical problems when retrieving documents using a web search engine
Highly inflective language
Donošenjem odluke… Odluka o priređivanju igara… Ministarstvo donosi odluku…
By making a decision… A decision to organize gamesThe Ministry shall make a decision…________________________
Sastojci za 10 porcija: 3 glavice crnog luka, 1 šoljica ulja, 1/2 čaša belog vina, 1 čaša soka od paradajza
(The ingredients for 10 portions: 3 onions, 1 cup of oil, ½ glass of white wine, 1 glass of tomato juice.)
Typical problems
Lexical realization of a concept synonyms: beli luk ‘garlic’ → češnjakhyponyms: muzički instrument ‘musical instrument’ → klavir ‘piano’, gitara ‘guitar’ etc.
derivations: Beograd → Beograđanin, Beograđanka, etcand other relations
Bilingual search in order to find documents on the chosen subject in two languages, e.g. English and Serbian.
HLT Group, University of Belgrade4
The lexical resources used
Inflectional finite state transducers
(FST)
Morphological dictionaries
Prolex
WordNets
WS4QE
Work Station for Query Expansion
Serbian morphological dictionary is in LADL format:
117,000 lemmas with 1,400,000 different lexical words
FST for inflection of both simple and compound words developed for the Unitex system http://www-igm.univ-mlv.fr/~unitex
Serbian WN conceived within the Balkanet project with 14.593 synsets and Princeton
WN are used for query expansion with related words & for bilingual searches
Prolex: multilingual database of proper names organized around a
conceptual proper name that represents the same concept in
different languageshttp://www.cnrtl.fr/lexiques/prolex/
HLT Group, University of Belgrade5
The lexical resources used
For query beli luk two FSTs for components and one for the compound are used producing only 12 instead of 216 possible combinations:
beli luk AND belim lukom AND beli lukovi AND belih lukova AND belima lukovima AND belim lukovima AND bele lukove AND bela luka AND beloga luka AND belog luka AND belome luku AND belom luku
thus preventing false retrievals such as:•...posmatrano sa dna vidika, izgleda kao da iz širokih lukova belog mosta teče i razliva se ne samo zelena Drina… •Thus, from a bottom view, it appears that not only green Drina flows and spills over under the wide arcs of the white bridge…
HLT Group, University of Belgrade6
The system options
Improved queryImproved queryInflexion of free phrases
Addition of related words
Inclusion of inflectional
forms
Alternate alphabet
usage
•štrajk ‘strike’ → штрајк
•štrajk ‘strike’ → štrajk, štrajka, štrajkovi etc.• štrajk ‘strike’ → obustava rada ‘work stoppage’ • solarni sistem ‘solar system’ → Merkur,
Venera, Zemlja, Mars• Engleska ‘England’ → Englez ‘Englishman’,
Engleskinja, ‘English woman’ + with Albion• inflection of free phrases by
predicting their syntactic structure
HLT Group, University of Belgrade7
Rule based procedure for inflection
Procedure for automatic inflection of compounds and phrases based on a set of rules
Rule design strategy - result of expert knowledge on morphology and the analysis of existing manually created compound dictionaries
Experiments with various rule strategies possible – the final strategy is result of several iterations
The rule based strategy presently consists of 53 rules with total of 1014 rule subtypes (rule parts)
EXAMPLE OF RULE NUMBER 43, CLASS NC_N6X
Class Gramm.
condition Frequ ency
Additional conditions
_:fs1q__ 3 _:ms1q__ 2
_:ms1v__ 2 _:ns1q__ 1 _:fs1v__ 0
NC_ N6X
_:ns1v__ 0
(The first component is a noun ) AND ((The second, the third and fourth component are in genitive) OR (The second word is a preposition and the third word agrees with it))
HLT Group, University of Belgrade8
Rule based procedure for inflection<Rule ID="43" CFLX="NC_N6X" Status="true">
<RuleType ID="1"><WordRT ID="1" POS="N" Flex="true" /><WordRT ID="2" POS="*" Flex="false" Condition="GramCats,2"/><WordRT ID="3" POS="*" Flex="false" Condition="GramCats,2"/><WordRT ID="4" POS="*" Flex="false" Condition="GramCats,2"/>
</RuleType><RuleType ID="2">
<WordRT ID="1" POS="N" Flex="true" /><WordRT ID="2" POS="PREP" Flex="false" /><WordRT ID="3" POS="*" Flex="false" Condition="PrepAgr,2" /><WordRT ID="4" POS="*" Flex="false" />
</RuleType><RulePart ID="1" Frequency="3" Example="princ na belom konju">
<WordRP ID="1" GramCats="ms1v" /></RulePart><RulePart ID="2" Frequency="2"
<WordRP ID="1" GramCats="ms1q" /></RulePart><RulePart ID="3" Frequency="2" >
<WordRP ID="1" GramCats="ns1q" /></RulePart><RulePart ID="4" Frequency="1" >
<WordRP ID="1" GramCats="fs1q" /></RulePart><RulePart ID="5" Frequency="0">
<WordRP ID="1" GramCats="ns1v" /></RulePart><RulePart ID="6" Frequency="0">
<WordRP ID="1" GramCats="fs1v" /></RulePart>
</Rule>
HLT Group, University of Belgrade9
Rule based procedure for inflection
System evaluation on three separate sets of data that differ both in content and in structure: compound toponyms (238) formal names of professions (356) search engine queries (728)
(log file of one of Serbian professional journals)
Evaluation indicated that: the strategy can be integrated in morphological query
expansion mechanism for compounds and phrases which do not exist in the compounds dictionary
HLT Group, University of Belgrade10
Tehnical implementation
The Process The developed web application receives the user query and subsequently uses the local web service WS4QE to expand the
query and forwards it to the Google search engine using the Google AJAX
Search API (enables the embedding of Google searches into personal web pages or web applications)
Interface Query expansion is implemented with different possibilities and
levels of detail, so the web user can choose from several options From simple query expansion to complex wordnet advanced
search Search results are displayed within our own web pages for
different types of query expansions, depending on the resources and type of expansion
HLT Group, University of Belgrade11
Tehnical implementation
Web service WS4QE uses classes from .NET dll components developed within WS4LR (WorkStation for Lexical Resources)
WS4LR enables the usage of lexical resources for query expansion
The components that make up the WS4LR system and their inter-relationships
HLT Group, University of Belgrade12
Tehnical implementationWS4QE home page
• Wordnet advanced search Compare
•Query submitted directly to Google with only the initial string ‘beli luk’ returned a total of 54,900
• Expanded with ‘бели лук’,’češnjak’,’чешњак’ then submitted by WS4QE to Google, as a result, total of 92,700 documents were obtained.
HLT Group, University of Belgrade13
Results for expanded query
Compare
•Query submitted directly to Google obtained 66,600 documents•Expanded query with hypernym, in both alphabets obtained 160,000 documents• Morphological expansion in two alphabets (without semantic expansion) obtained 285,000 documents
HLT Group, University of Belgrade14
Results for expanded query
HLT Group, University of Belgrade15
Conclusion
approach
further endeavors
formulation of queries
Queries often need to be ‘fine tuned’ in order to obtain an optimal balance between recall and precision
Lexical resources can be put to the aid of the user by offering him/her various possibilities of query expansion
1. We shall continue do develop our lexical resources2. We will strive to broaden the scope of tasks that can be solved with our tools
[email protected]@[email protected]@rgf.bg.ac.yu