La indexación con técnicas lingüísticas en el modelo clásico de Recuperación de Información Julio Gonzalo, Anselmo Peñas y Felisa Verdejo Grupo de Procesamiento

La indexación conLa indexación con técnicas lingüísticas técnicas lingüísticas en el en el modelo clásico de Recuperación de Informaciónmodelo clásico de Recuperación de Información

Julio Gonzalo, Anselmo Peñas y Felisa Verdejo

Grupo de Procesamiento de Lenguaje Natural

Dpto. Lenguajes y Sistemas Informáticos

UNED

Jornadas de Tratamiento y Recuperación de la Información JOTRI 2002

2

ContentContent

Goal Morpho-syntactic ambiguity in IR Phrase indexing Conceptual indexing Conclusions

3

GoalGoal Indexing with automatic linguistic

techniques within the classic IR model

Information

need

Search engine

Docs.Document ranking

Refinement

Query

Formulation• POS tagging• Phrase indexing• WSD & Conceptual Indexing

Bad strategies or too much error in automatic processing?

IR-Semcor, hand-annotated test collection• Lemmas and phrases

• Senses

• Synsets

4

Morpho-syntactic ambiguity in IRMorpho-syntactic ambiguity in IRTexts

...particle crosses the wall...

...canadian red cross...

...boat to cross mississippi river...

Query

cross_N

...particl_N cross_V the_D wall_N...

...canadian_ADJ red_ADJ cross_N...

...boat_N to_TO cross_V mississippi_N river_N...

POS Tagged

Query

cross

...particl cross the wall...

...canadian red cross...

...boat to cross mississippi river...

Plain

matches

matches

5

6

Morpho-syntactic ambiguity in IRMorpho-syntactic ambiguity in IR Documents matched are ranked much higher (there are

less competing documents)

Manual POS tagging misses relevant matches• Query: ...talented baseball player... (talent_ADJ)

• Doc: ...top talents of the time... (talent_N)

• Missing Match

Automatic makes more mistakes, but not always correlated to retrieval decrease

• Query: summer_N shoes_N design_V (design_V)

• Doc: Italian_ADJ designed_V sandals_N (design_V)

• Match

7

Phrase indexingPhrase indexing

Texts

...a guide for the fisher who...

...information on cat care...

...arboreal carnivorous called fisher cat...

Query

fisher

...a guide for the fisher who...

...arboreal carnivorous called fisher cat...


Plain

Query

fisher

Phrase indexing...a guide for the fisher who...

...arboreal carnivorous called fisher_cat...


matches

matches

8

9

Phrase indexingPhrase indexing

Phrase indexing harms retrieval sometimes• Query: Candidate in governor’s_race• Doc: Opened his race for governor• Missing match

Phrase meaning is highly compositional

Needs semantic distinction

10

Conceptual IndexingConceptual Indexing

This model can improve text retrieval (Gonzalo 1998; Gonzalo 1999) Depending on WSD error rate

Query

spring

Texts

...spring...

...muelle...

...spring...

...fountain...

...fuente...

...spring...

...springtime...

...primavera...

Conceptual Index

n03114639

n05727069

n09151839n09151839

WSD

11

Word Sense DisambiguationWord Sense Disambiguation (Sanderson 1994) introduced fixed error rates in pseudo-words

disambiguationbanana banana/education/toy/gun/forest WSD toy

to conclude (over Reuters collection)– WSD must be above 90% accuracy

Reproduce Sanderson’s experiment (over IR-Semcor)

Compare precision in retrieval over synsets with WSD errors n07062238 spring WSD n04985670 (error)

{spring,springtime} {spring, hook}

12

Pseudo-words with no

errors in WSD text

13

Synset indexing with no errors in WSD

14

Conceptual IndexingConceptual Indexing Although explicit disambiguation strategies applied to Indexing

• POS tagging

• Phrase indexing

• Word Sense Disambiguation

don’t produce a significative improvement in IR

Conceptual indexing based on synsets• Needs automatic WSD accuracy near to state-of-the-art (60%)

• Permit Cross-Language Information Retrieval

Qualitative evaluation (Item Search engine)• Some unsolved challenges (mainly WSD)• Users perceive a slower and less transparent system

15

ConConclusionsclusions

Think of users– Even an improvement of 10% wouldn’t change users

perception

– Don’t subordinate NLP to classic IR model

– Find new paradigms in Information Access

– In a higher level, closer to users• Consider users tasks

• Consider users interaction

La indexación conLa indexación con técnicas lingüísticas técnicas lingüísticas en el en el modelo clásico de Recuperación de Informaciónmodelo clásico de Recuperación de Información

Julio Gonzalo, Anselmo Peñas y Felisa Verdejo

Grupo de Procesamiento de Lenguaje Natural

Dpto. Lenguajes y Sistemas Informáticos

UNED

Jornadas de Tratamiento y Recuperación de la Información JOTRI 2002

17

IR-Semcor test collectionIR-Semcor test collection

– 254 hand-annotated documents in English– 82 hand-annotated queries in English with ~6.8 relevant

documents eachExample

The Fulton County Grand Jury investigates possible irregularities in Atlanta’s primary election

Lemmas and phrase annotationThe Fulton_County_Grand_Jury investigate possible irregularity in

atlanta primary_election Sense annotation

Fulton_County_Grand_Jury investigate2 possible2 irregularity1 atlanta1 primary_election1

Synset annotation (actually synset offsets or ILI-records)Fulton_County_Grand_Jury v00441414 a00036893 n00412042 n5608324

n00103176{ investigate,

carry_out_an_investigation_of }{ irregularity, abnormality }

{ Atlanta, capital_of_Georgia }

{ primary_election, primary }

{ possible, potential }

18

IR-Semcor test collectionIR-Semcor test collectionSemcor 1.5

Doc 1

Doc 2

Doc 1Doc 1

Doc~100

Semcor 1.6

Doc 1

Doc 2

Doc 1Doc 1

Doc83

IR-Semcor

Doc 1

Doc 2

Doc171

Doc 1Doc 1

Doc254

Query 1

Query 2

Query82

Hand-annotated sumaries only for chunked docs

Assume the summary of a text is relevant to all fragments of the original Semcor document

19Textual representation: query istranslated into the target language

Conceptual representation: queryand documents are compared

at a conceptual level

Selection ofquery language

Selection of WSD strategy

Selection of newspaper

determines the target language

Retrieved documents

20

AApproachpproacheses

NaturalLanguage

ProcessingDisambiguation Conceptual indexing

Terminology

Controlled vocabularies indexing & browsing

String

ProcessingFree text indexing

Information Retrieval

Phrase indexing & browsing (Phind)

Keyphrase navigation (Phrasier)

AutomaticTerminology Extraction

Terminology Retrieval & Term browsing

(WTB)

21

22

23

Semantic distinction of compoundsSemantic distinction of compoundsII. Experiments in Lexical Ambiguity and Indexing

Automatic classification through WordNet

Endocentric: one component is hyperonym

Appositional: all components are hyperonyms

Exocentric: no components are hyperonyms

purchasingdepartment

department

is_a

Endocentric

aspirin powder

powderaspirin

is_ais_a

Appositional

fisher cat

Exocentric

Types of lexical compounds

24

0

5

10

15

20

25

30

35

40

45

50

10 20 30 40 50 60 70 80 90 100

Plain text

All compounds

Exocentriccompounds