Upload
others
View
12
Download
0
Embed Size (px)
Citation preview
Finding Semantic Classes of Nouns for Hindi/UrduComplex Predicates
Sebastian Sulger & Ashwini Vaidya
Universitat Konstanz
ParGram MeetingSpring 2014
1 / 27
The situation
Spoken and written Hindi/Urdu: heavy, productive use of complexpredicates (CPs) across domains
Different types of CPs:
Aspectual V+V CPs: gIr pAr. ‘suddenly fall’ (lit. ‘fall fall’)Permissive V+V CPs: jane de ‘let go’ (lit. ‘go give’)N+V CPs: yad kAr ‘remember’ (lit. ‘memory do’)
In other languages:
take a bath (≈ ‘bathe’)give a stir (≈ ‘stir’)in Betracht ziehen ‘consider’ (lit. ‘in look-at pull’)
2 / 27
The challenges
General problem in deep and shallow parsing methods for Hindi/Urdu(and other South Asian languages): proper treatment of ComplexPredicates
Automatic distinction of CPs from simplex verbsExtraction of subcategorization framesSemantic role labelingDrawing semantic inferences
Research questions:
What existing resources may be employed to explore CP usage?Can we confirm/reject existing theoretical hypotheses of N+V CPs?How far can clustering algorithms take us?... and how “good” are the resulting classes?
3 / 27
Hindi/Urdu Noun-Verb Complex Predicates
Contents
1 Hindi/Urdu Noun-Verb Complex Predicates
2 Corpus study
3 Evaluation
4 Semantic Classes
4 / 27
Hindi/Urdu Noun-Verb Complex Predicates
The construction
Combination of noun and light verb to form a single predicational unit
Noun contributes main predicational content (including argument(s)),light verb dictates case marking and expresses subtle lexical semanticdifferences
Highly productive constructions
[Ahmed and Butt, 2011]: proposal for different classes of N+V CPsbased on a small case study of 45 nouns
light verbN+V type kAr ‘do’ ho ‘be’ hu- ‘become’ analyisclass a + + + psych predicationsclass b + − + only agentiveclass c + + − subject is not an undergoer
Table: Classes of nouns identified by [Ahmed and Butt, 2011]
5 / 27
Hindi/Urdu Noun-Verb Complex Predicates
Class A: psych predications
Occur with all three light verbs examined by [Ahmed and Butt, 2011]
(1) a. lAr.ki=ne kAhani yad k-igirl.F.Sg=Erg story.F.Sg memory.F.Sg do-Perf.F.Sg‘The girl remembered a/the story.’(lit. ‘The girl did memory of the story.’)
b. lAr.ki=ko kAhani yad hEgirl.F.Sg=Dat story.F.Sg memory.F.Sg be.Pres.3.Sg‘The girl remembers/knows a/the story.’(lit. ‘Memory of the story is at the girl.’)
c. lAr.ki=ko kAhani yad hu-igirl.F.Sg=Dat story.F.Sg memory.F.Sg be.Perf-F.Sg‘The girl came to remember a/the story.’(lit. ‘Memory of the story became to be at the girl.’)
6 / 27
Hindi/Urdu Noun-Verb Complex Predicates
Class B: agentive CPs
Require an agentive (ergative-marked) subject and light verb kAr ‘do’
(2) a. bIlal=ne mAkan tAmir kI-yaBilal.M.Sg=Erg house.M.Sg construction.F.Sg do-Perf.M.Sg‘Bilal built a/the house.’(lit. ‘Bilal did construction of the house.’)
b. * bIlal=ko mAkan tAmir hEBilal.M.Sg=Dat house.M.Sg construction.F.Sg be.Pres.3.Sg
c. * bIlal=ko mAkan tAmir hu-aBilal.M.Sg=Dat house.M.Sg construction.F.Sg be.Perf-M.Sg
7 / 27
Hindi/Urdu Noun-Verb Complex Predicates
Class C: subject not an undergoer
Exclude the light verb hu- ‘become’
(3) a. bIlal=ne yih sArt. tAslim k-iBilal.M.Sg=Erg this condition.F.Sg acceptance.M.Sg do-Perf.F.Sg‘Bilal accepted this condition.’(lit. ‘Bilal did acceptance of this condition.’)
b. bIlal=ko yih sArt. tAslim hEBilal.M.Sg=Dat this condition.F.Sg acceptance.M.Sg be.Pres.3.Sg‘Bilal accepted this condition.’(lit. ‘Acceptance of this condition was at Bilal.’)
c. * bIlal=ko yih sArt. tAslim hu-aBilal.M.Sg=Dat this condition.F.Sg acceptance.M.Sg be.Perf-M.Sg
8 / 27
Hindi/Urdu Noun-Verb Complex Predicates
And beyond ...
[Ahmed and Butt, 2011] looked at a set of three light verbs
Extending the set of light verbs brings up new questions
Nouns that occur with kAr ‘do’ and de ‘give’ (but exclude other lightverbs)
(4) a. nadya=ne lAr.ki=ko pAramArs kI-yaNadya.F.Sg=Erg girl.F.Sg=Acc advice.M.Sg do-Perf.M.Sg‘Nadya advised the girl.’(lit. ‘Nadya did advice to the girl.’)
b. nadya=ne lAr.ki=ko pAramArs dI-yaNadya.F.Sg=Erg girl.F.Sg=Acc advice.M.Sg give-Perf.M.Sg‘Nadya advised the girl.’(lit. ‘Nadya gave advice to the girl.’)
9 / 27
Hindi/Urdu Noun-Verb Complex Predicates
And beyond ...
Nouns that occur with kAr ‘do’ only, not with de ‘give’
(5) a. bIlal=ne mAkan tAmir kI-yaBilal.M.Sg=Erg house.M.Sg construction.F.Sg do-Perf.M.Sg‘Bilal built a/the house.’(lit. ‘Bilal did construction of a/the house.’)[Ahmed and Butt, 2011, p. 3]
b. * bIlal=ne mAkan tAmir dI-yaBilal.M.Sg=Erg house.M.Sg construction.F.Sg give-Perf.M.Sg
10 / 27
Hindi/Urdu Noun-Verb Complex Predicates
And beyond ...
Nouns that occur with le ‘take’ only, not with any other light verb
(6) a. nadya=ne lAr.ki=ko god lI-yaNadya.F.Sg=Erg girl.F.Sg=Acc lap.F.Sg take-Perf.M.Sg‘Nadya adopted the girl.’(lit. ‘Nadya took lap to the girl.’)
b. * nadya=ne lAr.ki=ko god kI-yaNadya.F.Sg=Erg girl.F.Sg=Acc lap.F.Sg do-Perf.M.Sg
c. * nadya=ne lAr.ki=ko god dI-yaNadya.F.Sg=Erg girl.F.Sg=Acc lap.F.Sg do-Perf.M.Sg
11 / 27
Hindi/Urdu Noun-Verb Complex Predicates
Goals of the investigation
How do the proposals by [Ahmed and Butt, 2011] hold up towards alarger empirical basis (i.e., bigger corpora)?
Extend the set of light verbs
Apply different strategies of acquiring knowledge about CPs:
“Brute-force” statistical approach, based on bigram extraction,collocation analysis and clustering [Butt et al., 2012]“Seed list” approach, using knowledge amassed from treebanksand clustering, and try to do evaluation of clusters
Come up with semantic classes of nouns:
Members of classes will behave in a coherent way with respect to thelight verbs they may occur withOf great use for the Hindi/Urdu grammar: extend noun lexicon, definetemplates of N+V CPs
12 / 27
Corpus study
Contents
1 Hindi/Urdu Noun-Verb Complex Predicates
2 Corpus study
3 Evaluation
4 Semantic Classes
13 / 27
Corpus study
Methodology
In a recent corpus study on Hindi, we used the approach below:
1 Use corpus of 17 million words harvested from BBC Hindi website &Hindi wikipedia
2 Look at a set of seven light verbs: kAr ‘do’, ho ‘be’, de ‘give’, le‘take’, rAkh ‘put’, lAg ‘be attached’, a ‘come’ (seven most frequentlyoccurring light verbs)
POS tagged, lemmatized using a state-of-the-art Hindi tagger[Reddy and Sharoff, 2011]
14 / 27
Corpus study
Methodology
3 Make use of the Hindi-Urdu Treebank (HUTB) [Bhatt et al., 2009]
Includes dependency annotation schemeEmploys label pof (for part of) to annotate complex predicatesExtract all items that are tagged as nouns and carry pof label
→ “Seed list” of nouns that we know take part in N-V CPs
4 Extract all bigrams which have one of the seven light verbs (theirlemmas) on the right (frequency cutoff 10, to get rid of some spellingvariation as well as marginal usages)
15 / 27
Corpus study
Methodology
5 Compute relative frequencies of noun combined with light verbs (880noun instances)
kAr ho de le rAkh lAg aID noun ‘do’ ‘be’ ‘give’ ‘take’ ‘put’ ‘attach’ ‘come’1 tAnav ‘tension’ 0.115 0.562 0.058 0.058 0.000 0.000 0.2072 bhag ‘part’ 0.149 0.365 0.119 0.253 0.000 0.000 0.1153 ag ‘fire’ 0.110 0.251 0.087 0.000 0.055 0.443 0.0554 mAzuri ‘sanction’ 0.000 0.000 0.757 0.243 0.000 0.000 0.0005 dhava ‘attack’ 1.000 0.000 0.000 0.000 0.000 0.000 0.0006 krIpa ‘mercy’ 0.409 0.486 0.000 0.000 0.105 0.000 0.000
Table: Relative frequencies of co-occurrence of nouns with light verbs
16 / 27
Corpus study
Methodology
6 Apply clustering algorithm to the data
Clustering the nouns based on their occurrence patterns with light verbsk-means clusteringProblems: How good are resulting clusters? What value should we usefor k?
→ How to evaluate?
We already know that our combinations (“seed list” nouns + lightverbs) form legitimate CPs.What we don’t know is how semantically coherent the clusters are.We also don’t know which k is giving us the best (i.e. mostexpressive/semantically most coherent) clusters.(But k = 8 seemed to be a good value during initial inspection.)
17 / 27
Evaluation
Contents
1 Hindi/Urdu Noun-Verb Complex Predicates
2 Corpus study
3 Evaluation
4 Semantic Classes
18 / 27
Evaluation
Preliminary evaluation using WordNet
Hindi WordNet publicly available [Bhattacharyya, 2010]
Follow the technique described by e.g. [Van de Cruys, 2006] for eachk = 2, ..., 10
Extract synonyms, hypernyms and hyponyms for every word in a clusterChoose cluster centroid: word with most semantic relations with everyother word in clusterExtract co-hyponyms, i.e. the hyponyms of the hypernyms (sisters inthe ontology tree), for each centroid from WordNet (along with theirsynonyms, hypernyms and hyponyms)Calculate precision for each cluster: count number of words thatoverlap with words in centroid’s relations & divide by number of wordsin cluster
19 / 27
Evaluation
Preliminary evaluation using WordNet
k Precision
2 0.04523 0.03714 0.05675 0.08116 0.08227 0.07988 0.08989 0.074010 0.082
Table: Evaluating cluster size using semantic relations in WordNet (low precisionvalues because of small size of data given to the algorithm)
→ Result: most coherent clusters according to evaluation with k = 8
20 / 27
Semantic Classes
Contents
1 Hindi/Urdu Noun-Verb Complex Predicates
2 Corpus study
3 Evaluation
4 Semantic Classes
21 / 27
Semantic Classes
Overview
Preliminary overview of semantic classes of nouns (labels partly borrowedfrom [Ahmed, 2010]/[Ahmed and Butt, 2011]):
Property Cluster Size Light Verbs
1. Change of state 22 a ‘come’2. Mental state 102 ho ‘be’3. Sending away 110 de ‘give’4. Mental state/mental action 101 ho ‘be’; kAr ‘do’5. Action 476 kAr ‘do’6. Sudden event 10 lAg ‘attach’7. Short duration/durative 28 rAkh ‘keep’8. Ingestive/mental gain 31 le ‘take’
Table: Occurrences of light verb with semantic classes of nouns
22 / 27
Semantic Classes
Description of classes I
Class 1: a ‘come’ — events with a direction that has a beginning,path and end
bAdlav a ‘change come, change’
Class 2: ho ‘be’ — mental states/psych predicates, require dativesubjects
dukh ho ‘sadness be, be sad’, khed ho ‘regret be, regret’
Class 3: de ‘give’ — events involve “transmissions” away from thesender/subject
sAndes de ‘message give, give a message’, hUkm de ‘order give, ordersomeone’
Class 4: ho ‘be’, kAr ‘do’ — mental states/mental actions, subjectcase marking alternates between dative and ergative, depending onlight verb
pyar ho ‘love be, love’
23 / 27
Semantic Classes
Description of classes II
Class 5: kAr ‘do’ — largest class, dynamic events/actions, takeergative subjects
dAstAkhAt kAr ‘signature do, sign’, fon kAr ‘phone do, call someone’,tExt kAr ‘text do, text someone’ (and other borrowings from English)
Class 6: lAg ‘attach’ — sudden events
jhAtka lAg ‘jolt attach, get jolted’, chot lAg ‘injury attach, get injured’
Class 7: rAkh ‘keep’ — durative, non-momentary events
tAllUkh rAkh ‘contact keep, keep in touch’, khAyal rAkh ‘care keep, takecare’
Class 8: le ‘take’ — events involve “transmissions” to thereceiver/subject, which is the endpoint of transmission
sApAth le ‘oath take, take an oath’, sAharA le ‘shelter take, take shelter’
24 / 27
Semantic Classes
Summary
Some nouns heavily lexicalized towards a peculiar semanticconfiguration (i.e., compatible with a smaller subset of light verbs)
Others may occur with a wider range of light verbs
→ Use for grammar development?
Lexicon developmentCan define templates, based on classificationHandle new coinages/borrowings, predict their usage
Future work:
Apply method to Urdu dataRefine/narrow down clusters (using more data/more features/morelight verbs)
25 / 27
Semantic Classes
References I
Ahmed, T. (2010).
The interaction of light verbs and verb classes of Urdu.In Interdisciplinary Workshop on Verbs - The Identification and Representation of Verb Features, Pisa.
Ahmed, T. and Butt, M. (2011).
Discovering Semantic Classes for Urdu N-V Complex Predicates.In Proceedings of the International Conference on Computational Semantics (IWCS 2011).
Bhatt, R., Narasimhan, B., Palmer, M., Rambow, O., Sharma, D., and Xia, F. (2009).
A Multi-Representational and Multi-Layered Treebank for Hindi/Urdu.In Proceedings of the Third Linguistic Annotation Workshop, pages 186–189, Suntec, Singapore. Association forComputational Linguistics.
Bhattacharyya, P. (2010).
IndoWordNet.In Proceedings of the Seventh Conference on International Language Resources and Evaluation (LREC’10), pages3785–3792.
Butt, M., Bogel, T., Hautli, A., Sulger, S., and Ahmed, T. (2012).
Identifying Urdu Complex Predication via Bigram Extraction.In In Proceedings of COLING 2012, Technical Papers, pages 409 – 424, Mumbai, India.
Lamprecht, A., Hautli, A., Rohrdantz, C., and Bogel, T. (2013).
A Visual Analytics System for Cluster Exploration.In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics: System Demonstrations,pages 109–114, Sofia, Bulgaria. Association for Computational Linguistics.
26 / 27
Semantic Classes
References II
Reddy, S. and Sharoff, S. (2011).
Cross Language POS Taggers (and other Tools) for Indian Languages: An Experiment with Kannada using TeluguResources.In Proceedings of the Fifth International Workshop On Cross Lingual Information Access, pages 11–19, Chiang Mai,Thailand. Asian Federation of Natural Language Processing.
Van de Cruys, T. (2006).
Semantic Clustering in Dutch.In Proceedings of the Sixteenth Computational Linguistics in Netherlands (CLIN), pages 17–32.
27 / 27