41
A Universal Phrase Tagset for Multilingual Treebanks Aaron Li-Feng Han 1,2 , Derek F. Wong 1 , Lidia S. Chao 1 , Yi Lu 1 , Liangye He 1 , and Liang Tian 1 1 NLP 2 CT lab, University of Macau 2 ILLC, University of Amsterdam Speaker: Mr. Zhiyang Teng / Haibo Zhang Chinese Academy of Science(CAS) CCL and NLP-NABD 2014 Wuhan, P.R.China

PPT-CCL: A Universal Phrase Tagset for Multilingual Treebanks

Embed Size (px)

DESCRIPTION

Many syntactic treebanks and parser toolkits are developed in the past twenty years, including dependency structure parsers and phrase structure parsers. For the phrase structure parsers, they usually utilize different phrase tagsets for different languages, which results in an inconvenience when conducting the multilingual research. This paper designs a refined universal phrase tagset that contains 9 commonly used phrase categories. Furthermore, the mapping covers 25 constituent treebanks and 21 languages. The experiments show that the universal phrase tagset can generally reduce the costs in the parsing models and even improve the parsing accuracy.

Citation preview

Page 1: PPT-CCL: A Universal Phrase Tagset for Multilingual Treebanks

A Universal Phrase Tagset for Multilingual Treebanks

Aaron Li-Feng Han1,2, Derek F. Wong1, Lidia S. Chao1, Yi Lu1, Liangye He1, and Liang Tian1

1 NLP2CT lab, University of Macau 2 ILLC, University of Amsterdam

Speaker: Mr. Zhiyang Teng / Haibo Zhang Chinese Academy of Science(CAS)

CCL and NLP-NABD 2014 Wuhan, P.R.China

Page 2: PPT-CCL: A Universal Phrase Tagset for Multilingual Treebanks

Experiment

Proposed tagset

Discussion

CONTENTMotivation

Page 3: PPT-CCL: A Universal Phrase Tagset for Multilingual Treebanks

• Many syntactic treebanks and parser toolkits are developed in the past twenty years

• Including dependency structure parsers and phrase structure parsers

• For the phrase structure parsers, they usually utilise different phrase tagsets for different languages

• To capture the characteristics of specific languages

• Phrase categories span from ten to twenty or even more

• Results in an inconvenience when conducting the multilingual research

Motivations

Page 4: PPT-CCL: A Universal Phrase Tagset for Multilingual Treebanks

• To facilitate the research of multilingual tasks

• Could we make some bridges between these treebanks?

• McDonald et al. (2013) designed a universal annotation approach for dependency treebanks

• Petrov et al. (2012) developed a universal part-of-speech (PoS) tagset

• Han et al. (2013) discussed the universal phrase tagset between French-English (bilingual)

• => Then, A Multilingual Universal Phrase Tagset? • and Mappings between existing phrase tags and universal ones?

Motivations

Page 5: PPT-CCL: A Universal Phrase Tagset for Multilingual Treebanks

5

Page 6: PPT-CCL: A Universal Phrase Tagset for Multilingual Treebanks

• After look inside of some syntactic treebanks

• We designs a refined universal phrase tagset

• Use 9 common phrase categories • all with high appearance rates

• Conduct mappings between the phrase tagsets from the existing phrase structure treebanks and the universal ones

• Cover 25 treebanks and 21 languages

Proposed tagset

Page 7: PPT-CCL: A Universal Phrase Tagset for Multilingual Treebanks

• Refined universal phrase tagset • Noun phrase (NP),

• Verbal phrase (VP),

• Adjectival phrase (AJP),

• Adverbial phrase (AVP),

• Prepositional phrase (PP),

• Sentence or sub-sentence (S),

• Conjunction phrase (CONJP),

• Coordinated phrase (COP),

• Others (X) covering the list marker, interjection, URL, etc.

Proposed tagset

Page 8: PPT-CCL: A Universal Phrase Tagset for Multilingual Treebanks

• Cover 21 languages and 25 treebanks:

• Arabic, Catalan, Chinese, Danish, English, Estonian, French, German, Hebrew,Hindi, Hungarian, Icelandic, Italian, Japanese, Korean, Portuguese, Spanish, Swedish, Thai, Urdu, and Vietnamese

• Detailed mappings between existing phrase tags and the universal ones in following table

Proposed tagset

Page 9: PPT-CCL: A Universal Phrase Tagset for Multilingual Treebanks

• Mappings of 25 phrase structure treebanks

Proposed tagset

Page 10: PPT-CCL: A Universal Phrase Tagset for Multilingual Treebanks

• Mappings of 25 phrase structure treebanks

Proposed tagset

Page 11: PPT-CCL: A Universal Phrase Tagset for Multilingual Treebanks

• Mappings of 25 phrase structure treebanks

Proposed tagset

Page 12: PPT-CCL: A Universal Phrase Tagset for Multilingual Treebanks

• Mappings of 25 phrase structure treebanks

Proposed tagset

Page 13: PPT-CCL: A Universal Phrase Tagset for Multilingual Treebanks

13

Page 14: PPT-CCL: A Universal Phrase Tagset for Multilingual Treebanks

• How to evaluate the effectiveness of the proposed works?

• - the universal phrase tagset

• - the mapping of existing tagsets

• Experimental design:

• - parsing accuracy testing

Experiment

Page 15: PPT-CCL: A Universal Phrase Tagset for Multilingual Treebanks

• Steps:

• 1. Run training and testing on original corpus, record the testing accuracy

• 2. Replace original phrase tags with universal ones resulting in ‘new corpus’

• 3. Re-run the training and testing on new corpus, record the new accuracy

• 4. Compare the changes of accuracy

Experiment

Page 16: PPT-CCL: A Universal Phrase Tagset for Multilingual Treebanks

• Evaluation criteria

• cost of training time (hours),

• size of the generated grammar (MB),

• parsing accuracy scores, • Labeled Precision (LPre), Labeled Recall (LRec), the

harmonic mean of precision and recall (F1), and exact match (Ex, whole sentence/tree)

Experiment

Page 17: PPT-CCL: A Universal Phrase Tagset for Multilingual Treebanks

• Calculation formula:

Experiment

Page 18: PPT-CCL: A Universal Phrase Tagset for Multilingual Treebanks

• Tested Languages:

• Five representative languages: Chinese (CN), English (EN), Portuguese (PT), French (FR), and German (DE)

Experiment

Page 19: PPT-CCL: A Universal Phrase Tagset for Multilingual Treebanks

• Based on Berkeley parser (Petrov and Klein, 2007):

• learn probabilistic context-free grammars (PCFGs) to assign a sequence of words the most likely parse tree

• introduce the hierarchical latent variable grammars to automatically learn a broad coverage grammar starting from a simple initial one

• The generated grammar is refined by hierarchical splitting, merging and smoothing

Experiment

Petrov, S., Klein, D.: Improved Inference for Unlexicalized Parsing. NAACL (2007)

Page 20: PPT-CCL: A Universal Phrase Tagset for Multilingual Treebanks

• Setting:

• The Berkeley parser generally gains the best testing result using the 6th smoothed grammar

• For a broad analysis of the experiments, we tune the parameters to learn the refined grammar by 7 times of splitting, merging and smoothing

• except 8 times for French treebank

Experiment

Page 21: PPT-CCL: A Universal Phrase Tagset for Multilingual Treebanks

• Hardware:

• The experiments are conducted on a server with the configuration stated in Table

Experiment

Page 22: PPT-CCL: A Universal Phrase Tagset for Multilingual Treebanks

• Chinese corpus:

• Penn Chinese Treebank (CTB-7) (Xue et al., 2005)

• standard splitting criteria for the training and testing data

• training documents contain CTB-7 files 0-to-2082

• development documents contain files 2083-to- 2242

• testing documents are files 2243-to-2447

Experiment

Nianwen Xue, Fei Xia, Fu-Dong Chiou, and Martha Palmer. 2005: The Penn Chinese TreeBank: Phrase Structure Annotation of a Large Corpus.Natural Language Engineering, 11(2), 207-238.

Page 23: PPT-CCL: A Universal Phrase Tagset for Multilingual Treebanks

• Experiment results:

• The highest precision, recall, F1 and exact score • 85.58 (85.06), 83.24 (83.01), 84.4 (83.99), and 25.33

(24.73) respectively by using the universal phrase tagset (original tags)

• Grammar size and training time • (65.55 MB) and (94.79 hours) using the original tagset

almost doubles that (34.53 MB & 56.25 hours) of the universal one for the learning of 7th refined grammar

• Detailed learning scores in the figure of next page

Experiment

Page 24: PPT-CCL: A Universal Phrase Tagset for Multilingual Treebanks

Experiment

Page 25: PPT-CCL: A Universal Phrase Tagset for Multilingual Treebanks

• English corpus:

• Wall Street Journal treebank (Bies et al., 1995)

• standard setting: • WSJ section 2-to-21 corpora are for training

• section 22 for developing

• section 23 for testing (Petrov et al., 2006)

Experiment

Ann Bies, Mark Ferguson, Karen Katz and Robert MacIntyre. 1995. Bracketing Guidelines for Treebank II Style Penn Treebank Project. Technical paper.

Page 26: PPT-CCL: A Universal Phrase Tagset for Multilingual Treebanks

• Experiment results (similar with CN):

• The highest precision, recall, F1 and exact score • 91.45 (91.25), 91.19 (91.11) and 91.32 (91.18)

respectively by using the universal phrase tagset (original tags)

• Grammar size and training time • 38.67 (51.64) hours and 30.72 (47.00) MB of memory

during the training process for the 7th refined grammar respectively on universal (original) tagset

• Detailed learning scores in the figure of next page

Experiment

Page 27: PPT-CCL: A Universal Phrase Tagset for Multilingual Treebanks

Experiment

Page 28: PPT-CCL: A Universal Phrase Tagset for Multilingual Treebanks

• Portuguese corpus:

• Bosque treebank subset of Floresta Virgem corpora (Afonso et al., 2002; Freitas et al., 2008)

• A size of 162,484 lexical units • 80 percent of the sentences for training

• 10 percent for developing

• another 10 percent for testing

• i.e. 7393, 939, and 957 sentences respectively !

Experiment

Susana Afonso, Eckhard Bick, Renato Haber, and Diana Santos.2002. Florestasintá(c)tica: a treebank for Portuguese. In Proceedings of LREC 2002,pp.1698-1703.

Cláudia Freitas, Paulo Rocha andEckhard Bick.2008. FlorestaSintá(c)tica: Bigger, Thicker and Easier. In António Teixeira, Vera LúciaStrube de Lima, Luís Caldas de Oliveira & Paulo Quaresma (eds.), Computational Processing of the Portuguese Language, 8th International Conference, pp. 216-219.

Page 29: PPT-CCL: A Universal Phrase Tagset for Multilingual Treebanks

• Experiment results (similar with CN):

• The highest precision, recall, F1 and exact score • 81.84 (81.44), 80.81 (80.27) and 81.32 (80.85)

respectively by using the universal phrase tagset (original tags)

• Grammar size and training time • 3.69 (4.16) hours and 9.17 (10.02) MB of memory

during the training process for the 7th refined grammar respectively on universal (original) tagset

• Detailed learning scores in the figure of next page

Experiment

Page 30: PPT-CCL: A Universal Phrase Tagset for Multilingual Treebanks

Experiment

Page 31: PPT-CCL: A Universal Phrase Tagset for Multilingual Treebanks

• German corpus:

• German Negra treebank (Skut et al., 1997)

• 355,096 tokens and 20,602 sentences German newspaper text with completely annotated syntactic structures

• 80 percent (16,482 sentences) of corpus for training

• 10 percent (2,060 sentences) for developing

• 10 percent (2,060 sentences) for testing.

Experiment

W. Skut, B.Krenn, T. Brants, and H. Uszkoreit. 1997. An annotation scheme for free word order languages. In Conference on ANLP.

Page 32: PPT-CCL: A Universal Phrase Tagset for Multilingual Treebanks

• Experiment results:

• The highest precision, recall, F1 and exact score • 81.35 (81.23), 81.03 (81.02), and 81.19 (81.12)

respectively by using the universal phrase tagset (original tags)

• Grammar size and training time • similar on universal (original) tagset

• Detailed learning scores in the figure of next page

Experiment

Page 33: PPT-CCL: A Universal Phrase Tagset for Multilingual Treebanks

Experiment

Page 34: PPT-CCL: A Universal Phrase Tagset for Multilingual Treebanks

• French corpus:

• Different with previous standard language treebanks, we build ourselves

• Extract 20,000 French sentences from Europarl corpora

• Parse the extracted French text using the Berkeley French grammar “fra_sm5.gr” (Petrov, 2009)

• parsing accuracy around 0.80, the parsed Euro-Fr corpus is for the training

• Developing and testing corpora • WMT12 and WMT13 French plain text, 3,003 and 3,000 sentences respectively,

which are parsed by the same parser

Experiment

Petrov, S.: Coarse-to-Fine Natural Language Processing. PHD thesis (2009)

Page 35: PPT-CCL: A Universal Phrase Tagset for Multilingual Treebanks

• Experiment results:

• The highest precision, recall, F1 and exact score • 80.49 (80.34), 80.93 (80.96), and 80.71 (80.64)

respectively by using the universal phrase tagset (original tags)

• Grammar size and training time • 13.66 (28.91) hours and 12.07 (16.66) MB of memory

during the training process for the 8th refined grammar respectively on universal (original) tagset

• Detailed learning scores in the figure of next page

Experiment

Page 36: PPT-CCL: A Universal Phrase Tagset for Multilingual Treebanks

Experiment

Page 37: PPT-CCL: A Universal Phrase Tagset for Multilingual Treebanks

37

Page 38: PPT-CCL: A Universal Phrase Tagset for Multilingual Treebanks

• 1. Differences with related work

• McDonald et al. (2013) designed a universal annotation approach for dependency treebanks

• Han et al. (2013) first discussed the universal phrase tag set for syntactic treebanks. Differences between our work and Han’s:

• Han’s covers the mapping of French and English <=> We extend the mapping into 25 treebanks and 21 languages

• Han’s apples the universal tag set into MT evaluation (indirect examination) <=> We examine the effectiveness of the mapping directly on parsing tasks

• Han’s experiments on French-English MTE corpus <=> Our experiments on five representative languages: Chinese/English/French/German/Portuguese

Discussion

McDonald, R.; Nivre, J., Quirmbach-Brundage, Y., et al.: Universal Dependency Annotation for Multilingual Parsing. In: Proceedings of ACL (2013)

Han, A.L.-F; Wong, D.F., Chao, L.S., He, L., Li, S., Zhu, L.: Phrase Tagset Mapping for French and English Treebanks and Its Application in Machine Translation Evaluation. GSCL-2013. LNCS, vol. 8105, pp. 119–131. Springer, Heidelberg (2013)

Page 39: PPT-CCL: A Universal Phrase Tagset for Multilingual Treebanks

• 2. Analysis of the performances

• The higher parsing accuracy performances are not only due to the less number of phrase tags we employed, since:

• the beginning stage parsing accuracies (e.g. the 1st and 2nd refined grammars) using the universal tags are even lower than the ones using original tags.

• e.g. the F-scores of the 1st refined grammar using universal tags vs original tags are 67.06 (lower) vs 70.84; however, the winner changed after 5th refinement

• we think the parsing accuracy is also related to the mapping quality

• The exact match scores sometimes (French/German) lower than the ones using original tags, which is a issue for future work

• Currently, universal tags usually gain a higher performance according to the best F-score

Discussion

Page 40: PPT-CCL: A Universal Phrase Tagset for Multilingual Treebanks

• 3. Future work

• We plan to evaluate the parsing experiments on more language treebanks

• Improve the Exact match score/ whole tree level

• Utilize the universal phrase tagset into other multilingual applications, e.g. Petrov et al. (2012)’s work.

Discussion

Petrov, S., Das, D., McDonald, R.: A Universal Part-of-Speech Tagset. In: Proceedings of the Eighth LREC (2012)

Page 41: PPT-CCL: A Universal Phrase Tagset for Multilingual Treebanks

Cite this work:

Han, A.L.-F; Wong, D.F.; Chao, L.S.; Lu, Y.; He, L.; and Tian, L.: A Universal Phrase Tagset for Multilingual Treebanks. In M. Sun et al. (Eds.): CCL and NLP-NABD 2014, LNAI 8801, pp. 247–258, 2014. © Springer International Publishing Switzerland 2014