11
LINKING VERB PATTERN DICTIONARIES OF ENGLISH AND SPANISH VÍT BAISA ~ SARA MOŽE ~ IRENE RENAU Masaryk University Brno, Czech Republic University of Wolverhampton United Kingdom Pontificia Universidad Católica de Valparaíso, Chile

linking verb pattern dictionaries of english and spanish

Embed Size (px)

Citation preview

LINKINGVERB PATTERN DICTIONARIESOF ENGLISH AND SPANISH

VÍT BAISA ~ SARA MOŽE ~ IRENE RENAU

Masaryk University

Brno, Czech Republic

University of Wolverhampton

United Kingdom

Pontificia Universidad Católica

de Valparaíso, Chile

INTRODUCTION

• Verbs are complex

• AIM: methodology and tools for the creation of a multilingual corpus-driven lexical

resource for verbs using manual and automatic procedures

• CPA-based monolingual pattern dictionaries– What are they?

• New multilingual resource – researchers and language professionals?

• Preliminary study:

I. Manual linking task gold standard dataset

II. Automatic linking task = algorithm; evaluated against the gold standard

CORPUS PATTERN ANALYSIS (CPA)

• Corpus Pattern Analysis (CPA) (Hanks, 2004)– an empirical technique in Corpus Ling. and Lexicography

– map word meaning onto word use through lexical analysis of phraseological patterns, collocations

• Basis: Theory of Norms and Exploitations (TNE) (Hanks, 2013)– ‘double helix‘ – patterns of normal usage (‘norms‘) vs. their ‘exploitations‘

• ‘Pattern‘ – semantically motivated syntagmatic pattern– Syntax: SPOCA (Halliday)

– Semantics: typical nominal slot fillers, represented by Semantic Types (ST) – mnemonic sem. labels CPA shallow ontology (Hanks and Ježek, 2010) – approx. 250 STs; shared by several projects

WHAT IS A PATTERN?

• PDEV: harvest

CPA PATTERN DICTIONARIES• Pattern Dictionary of Italian Verbs (PDIV) – Elisabetta Ježek, Pavia

• Pattern Dictionary of English Verbs (PDEV)– Public website: http://pdev.org.uk/

– Prof. Hanks, University of Wolverhampton; over 1,700+ English verbs completed

– Procedure: corpus samples (250/500/1000 lines) from the BNC corpus (Leech, 1992);• Sketch Engine – word sketches (Kilgarrif et al., 2014),

• CPA Editor (Baisa et al., 2015) and CPA shallow ontology (Ježek and Hanks, 2010)

• Implicatures; register, domain, idiom/phrasal verb labels; links to FrameNet (Ruppenhofer et al., 2010)

• Percentages for each pattern

• Pattern Dictionary of Spanish Verbs (PDSV)– Public website: http://www.verbario.com/

– Verbario: Irene Renau, Pontificia Universidad Católica de Valparaíso

– 300 high-frequency Spanish verbs (currently only 100 publicly available online)

– Same methodology (CPA), guidelines, ontology, tools (SkE); but: Spanish Web Corpus

MANUAL LINKING: SP-EN PATTERN PAIRS

• Gold standard:– 87 SP verbs with one or more EN equivalents (total: 126 EN verbs)

– Medium-frequency verbs, up to 15 patterns

– Manual cross-linguistic links between pattern pairs semanto-syntactic similarity = tertium comparationis

linking procedure developed

dataset used in algorithm evaluation

• Issues – practical, theoretical– Coverage: PDEV/PDSV are WIP resources; different coverage;

limited overlap!!!

– Zero equivalence: cultural, social, cognitive, pragmatic reasons; idioms

INPUT: POTENTIALLY MATCHING EN PATTERN

Does it have the same basic syntactic structure as the SP pattern (i.e. SVO or SV [+no obj])?

YES

Do all semantic types in all obligatory syntactic slots match? E.g.:

EN: [[Human]] admire [[Anything]]

SP: [[Human]] admirar [[Anything]]

Do the two patterns share at least ONE semantic type in the same obligatory syntactic slot? For example:

EN: [[Eventuality 1 | Human | Institution]] occasion [[Eventuality 2]]

SP: [[Eventuality 1]] motivar [[Eventuality 2]]

OUTPUT: PERFECT MATCH

OUTPUT: NO MATCHOUTPUT: PARTIAL MATCH

Are the two semantic types in the same obligatory syntactic position related to each other in terms of inheritancein the CPA ontology (up to two nodes), e.g. [[Eventuality]] (supertype) vs. [[Activity]] and [[Plan]] (subtypes):

EN: [[Eventuality 1 | Human]] spoil [[Eventuality 2]]

SP: [[Eventuality | Human]] estropear [[Activity |Plan]]

YES

YES

YES

NO

NO

NO

NO

AUTOMATIC PATTERN LINKING: ALGORITHM• Heuristic-based algorithm: automatic linking suggestions

• Similarity score– 490 SP patterns and their translations into EN (statistical EN-SP dictionary <-- parallel corpus)

– S, DO, IO comparison of STs

– Full match: 1 score pt (*Human = 0.5 pt); matching empty slots (e.g. DO) – 0.5 pts

– CPA ontology: similarity score = 0.5N

Score calculated based on the distance (N) in the CPA ontology tree

– Scores summed up, final score assigned to the pair, top ranking EN pattern = most likely candidate

• Evaluation– 50 SP-EN verb pairs

– Excluded: SP pattern cannot be matched agains an EN pattern in the sample

– Final no. of candidate pattern pairs: 50 gold standard

– 40/50 suggested candidate pairs were correct 80% precision

CONCLUSION

• Future activities:–Gold standard: more annotated data;

–Refine the linking procedure (fine-grained distinctions?; intralingual links)

–Algorithm: train, improve precision;

–Software adaptation: feature for adding cross-linguistic links to the

dictionaries/databases.

REFERENCES• Baisa, V., El Maarouf, I., Rychlý, P. & Rambousek, A. (2015). Software and data for Corpus Pattern Analysis. In Horák, A., Rychlý, P., and Rambousek, A. (eds.), Ninth Workshop on

Recent Advances in Slavonic Natural Language Processing. Brno. Tribun EU. 75–86.• Buyse, K. and Verlinde, S. (2013). Possible effects of free on line data driven lexicographic instruments on foreign language learning: The case of Linguee and the interactive

language toolbox. In Procedia: Social and Behavioral Sciences, volume 95, pages 507–512. Elsevier BV.• Fillmore, C. J. and Baker, C. (2010). A frames approach to semantic analysis. The Oxford Handbook of Linguistic Analysis, pages 313–339.• Halliday, M. A. K. (1994). An introduction to Functional Grammar. Edward Arnold.• Hanks, P. (2004). Corpus Pattern Analysis. In G. Williams & S. Vessier (Eds.), 11th Euralex International Congress. Proceedings. Lorient: Université de Bretagne-Sud, pp. 87-97.• Hanks, P. (2013). Lexical Analysis: Norms and Exploitations. Cambridge, MA: MIT Press.• Hlavácková, D., Horák, A. (2005). Verbalex–new comprehensive lexicon of verb valencies for Czech. In Proceedings of the Slovko Conference.• Ježek, E., & Hanks, P. (2010) What lexical sets tell us about conceptual categories. Lexis: E-journal in English lexicology. 4: Corpus Linguistics and the Lexicon. Université Lumiere, Lyon.

7-22• Ježek, E., Magnini, B., Feltracco, A., Bianchini, A., and Popescu, O. (2014). T-pas: A resource of corpusderived types predicateargument structures for linguistic analysis and semantic

processing. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), pages 26–31.• Kilgarriff, A., Baisa, V., Bušta, J., Jakubíček, M., Kovář, V., Michelfeit, J., Rychlý, P. & Suchomel, V. (2014). The Sketch Engine: ten years on. Lexicography 1(1): 7-36.• Leech, G. (1992) 100 million words of English: the British National Corpus (BNC). Language Research 28(1):1–13.• Maarouf, I. E., Bradbury, J., and Hanks, P. (2014). PDEVlemon: a Linked Data implementation of the Pattern Dictionary of English Verbs based on the Lemon model. In Proceedings of

the 3rd Workshop on Linked Data in Linguistics (LDL): Multilingual Knowledge Resources and Natural Language Processing at the Ninth International Conference on Language Resources and Evaluation (LREC’14), Reykjavik, Iceland.

• Navigli, R. & Ponzetto, S. (2012). BabelNet: The Automatic Construction, Evaluation and Application of a Wide-Coverage Multilingual Semantic Network. Artificial Intelligence 193: 217-250.

• Nazar, R. & Renau, I. (2015). Ontology Population Using Corpus Statistics. In O. Papini, S. Benferhat, L. Garcia et al. (Eds.), Proceedings of the Joint Ontology Workshops 2015 co-located with the 24th International Joint Conference on Artificial Intelligence (IJCAI 2015). Buenos Aires, Argentina, July 25-27, 2015.

• Ruppenhofer, J., Ellsworth, M., Petruck, M. R., Johnson, C. R. & Scheffczyk, J. (2010). FrameNet II: Extended Theory and Practice. Berkeley, CA: ICSI.• Vossen, P. (2002). WordNet, EuroWordNet and Global WordNet. Revue Française de Linguistique Appliquée 7(1): 27–38.• Yong, H. & Peng, J. (1997). Bilingual lexicography from a communicative perspective. Amsterdam: John Benjamins.

USEFUL LINKS

• Pattern Dictionary of English Verbs

http://pdev.org.uk/

• VERBARIO (Pattern Dictionary of Spanish Verbs)

http://www.verbario.com/

• PDEV-LEMON

http://pdev.org.uk/PDEVLEMON.html