PPT-CCL: A Universal Phrase Tagset for Multilingual Treebanks

A Universal Phrase Tagset for Multilingual Treebanks

Aaron Li-Feng Han1,2, Derek F. Wong1, Lidia S. Chao1, Yi Lu1, Liangye He1, and Liang Tian1

1 NLP2CT lab, University of Macau 2 ILLC, University of Amsterdam

Speaker: Mr. Zhiyang Teng / Haibo Zhang Chinese Academy of Science(CAS)

CCL and NLP-NABD 2014 Wuhan, P.R.China

Experiment

Proposed tagset

Discussion

CONTENTMotivation

• Many syntactic treebanks and parser toolkits are developed in the past twenty years

• Including dependency structure parsers and phrase structure parsers

• For the phrase structure parsers, they usually utilise different phrase tagsets for different languages

• To capture the characteristics of specific languages

• Phrase categories span from ten to twenty or even more

• Results in an inconvenience when conducting the multilingual research

Motivations

• To facilitate the research of multilingual tasks

• Could we make some bridges between these treebanks?

• McDonald et al. (2013) designed a universal annotation approach for dependency treebanks

• Petrov et al. (2012) developed a universal part-of-speech (PoS) tagset

• Han et al. (2013) discussed the universal phrase tagset between French-English (bilingual)

• => Then, A Multilingual Universal Phrase Tagset? • and Mappings between existing phrase tags and universal ones?

Motivations

5

• After look inside of some syntactic treebanks

• We designs a refined universal phrase tagset

• Use 9 common phrase categories • all with high appearance rates

• Conduct mappings between the phrase tagsets from the existing phrase structure treebanks and the universal ones

• Cover 25 treebanks and 21 languages

Proposed tagset

• Refined universal phrase tagset • Noun phrase (NP),

• Verbal phrase (VP),

• Adjectival phrase (AJP),

• Adverbial phrase (AVP),

• Prepositional phrase (PP),

• Sentence or sub-sentence (S),

• Conjunction phrase (CONJP),

• Coordinated phrase (COP),

• Others (X) covering the list marker, interjection, URL, etc.

Proposed tagset

• Cover 21 languages and 25 treebanks:

• Arabic, Catalan, Chinese, Danish, English, Estonian, French, German, Hebrew,Hindi, Hungarian, Icelandic, Italian, Japanese, Korean, Portuguese, Spanish, Swedish, Thai, Urdu, and Vietnamese

• Detailed mappings between existing phrase tags and the universal ones in following table

Proposed tagset

• Mappings of 25 phrase structure treebanks

Proposed tagset


Proposed tagset


Proposed tagset


Proposed tagset

13

• How to evaluate the effectiveness of the proposed works?

• - the universal phrase tagset

• - the mapping of existing tagsets

• Experimental design:

• - parsing accuracy testing

Experiment

• Steps:

• 1. Run training and testing on original corpus, record the testing accuracy

• 2. Replace original phrase tags with universal ones resulting in ‘new corpus’

• 3. Re-run the training and testing on new corpus, record the new accuracy

• 4. Compare the changes of accuracy

Experiment

• Evaluation criteria

• cost of training time (hours),

• size of the generated grammar (MB),

• parsing accuracy scores, • Labeled Precision (LPre), Labeled Recall (LRec), the

harmonic mean of precision and recall (F1), and exact match (Ex, whole sentence/tree)

Experiment

• Calculation formula:

Experiment

• Tested Languages:

• Five representative languages: Chinese (CN), English (EN), Portuguese (PT), French (FR), and German (DE)

Experiment

• Based on Berkeley parser (Petrov and Klein, 2007):

• learn probabilistic context-free grammars (PCFGs) to assign a sequence of words the most likely parse tree

• introduce the hierarchical latent variable grammars to automatically learn a broad coverage grammar starting from a simple initial one

• The generated grammar is refined by hierarchical splitting, merging and smoothing

Experiment

Petrov, S., Klein, D.: Improved Inference for Unlexicalized Parsing. NAACL (2007)

• Setting:

• The Berkeley parser generally gains the best testing result using the 6th smoothed grammar

• For a broad analysis of the experiments, we tune the parameters to learn the refined grammar by 7 times of splitting, merging and smoothing

• except 8 times for French treebank

Experiment

• Hardware:

• The experiments are conducted on a server with the configuration stated in Table

Experiment

• Chinese corpus:

• Penn Chinese Treebank (CTB-7) (Xue et al., 2005)

• standard splitting criteria for the training and testing data

• training documents contain CTB-7 files 0-to-2082

• development documents contain files 2083-to- 2242

• testing documents are files 2243-to-2447

Experiment

Nianwen Xue, Fei Xia, Fu-Dong Chiou, and Martha Palmer. 2005: The Penn Chinese TreeBank: Phrase Structure Annotation of a Large Corpus.Natural Language Engineering, 11(2), 207-238.

• Experiment results:

• The highest precision, recall, F1 and exact score • 85.58 (85.06), 83.24 (83.01), 84.4 (83.99), and 25.33

(24.73) respectively by using the universal phrase tagset (original tags)

• Grammar size and training time • (65.55 MB) and (94.79 hours) using the original tagset

almost doubles that (34.53 MB & 56.25 hours) of the universal one for the learning of 7th refined grammar

• Detailed learning scores in the figure of next page

Experiment

Experiment

• English corpus:

• Wall Street Journal treebank (Bies et al., 1995)

• standard setting: • WSJ section 2-to-21 corpora are for training

• section 22 for developing

• section 23 for testing (Petrov et al., 2006)

Experiment

Ann Bies, Mark Ferguson, Karen Katz and Robert MacIntyre. 1995. Bracketing Guidelines for Treebank II Style Penn Treebank Project. Technical paper.

• Experiment results (similar with CN):

• The highest precision, recall, F1 and exact score • 91.45 (91.25), 91.19 (91.11) and 91.32 (91.18)

respectively by using the universal phrase tagset (original tags)

• Grammar size and training time • 38.67 (51.64) hours and 30.72 (47.00) MB of memory

during the training process for the 7th refined grammar respectively on universal (original) tagset


Experiment

Experiment

• Portuguese corpus:

• Bosque treebank subset of Floresta Virgem corpora (Afonso et al., 2002; Freitas et al., 2008)

• A size of 162,484 lexical units • 80 percent of the sentences for training

• 10 percent for developing

• another 10 percent for testing

• i.e. 7393, 939, and 957 sentences respectively !

Experiment

Susana Afonso, Eckhard Bick, Renato Haber, and Diana Santos.2002. Florestasintá(c)tica: a treebank for Portuguese. In Proceedings of LREC 2002,pp.1698-1703.

Cláudia Freitas, Paulo Rocha andEckhard Bick.2008. FlorestaSintá(c)tica: Bigger, Thicker and Easier. In António Teixeira, Vera LúciaStrube de Lima, Luís Caldas de Oliveira & Paulo Quaresma (eds.), Computational Processing of the Portuguese Language, 8th International Conference, pp. 216-219.

• Experiment results (similar with CN):

• The highest precision, recall, F1 and exact score • 81.84 (81.44), 80.81 (80.27) and 81.32 (80.85)





Experiment

Experiment

• German corpus:

• German Negra treebank (Skut et al., 1997)

• 355,096 tokens and 20,602 sentences German newspaper text with completely annotated syntactic structures

• 80 percent (16,482 sentences) of corpus for training

• 10 percent (2,060 sentences) for developing

• 10 percent (2,060 sentences) for testing.

Experiment

W. Skut, B.Krenn, T. Brants, and H. Uszkoreit. 1997. An annotation scheme for free word order languages. In Conference on ANLP.


• The highest precision, recall, F1 and exact score • 81.35 (81.23), 81.03 (81.02), and 81.19 (81.12)


• Grammar size and training time • similar on universal (original) tagset


Experiment

Experiment

• French corpus:

• Different with previous standard language treebanks, we build ourselves

• Extract 20,000 French sentences from Europarl corpora

• Parse the extracted French text using the Berkeley French grammar “fra_sm5.gr” (Petrov, 2009)

• parsing accuracy around 0.80, the parsed Euro-Fr corpus is for the training

• Developing and testing corpora • WMT12 and WMT13 French plain text, 3,003 and 3,000 sentences respectively,

which are parsed by the same parser

Experiment

Petrov, S.: Coarse-to-Fine Natural Language Processing. PHD thesis (2009)


• The highest precision, recall, F1 and exact score • 80.49 (80.34), 80.93 (80.96), and 80.71 (80.64)





Experiment

Experiment

37

• 1. Differences with related work

• McDonald et al. (2013) designed a universal annotation approach for dependency treebanks

• Han et al. (2013) first discussed the universal phrase tag set for syntactic treebanks. Differences between our work and Han’s:

• Han’s covers the mapping of French and English <=> We extend the mapping into 25 treebanks and 21 languages

• Han’s apples the universal tag set into MT evaluation (indirect examination) <=> We examine the effectiveness of the mapping directly on parsing tasks

• Han’s experiments on French-English MTE corpus <=> Our experiments on five representative languages: Chinese/English/French/German/Portuguese

Discussion

McDonald, R.; Nivre, J., Quirmbach-Brundage, Y., et al.: Universal Dependency Annotation for Multilingual Parsing. In: Proceedings of ACL (2013)

Han, A.L.-F; Wong, D.F., Chao, L.S., He, L., Li, S., Zhu, L.: Phrase Tagset Mapping for French and English Treebanks and Its Application in Machine Translation Evaluation. GSCL-2013. LNCS, vol. 8105, pp. 119–131. Springer, Heidelberg (2013)

• 2. Analysis of the performances

• The higher parsing accuracy performances are not only due to the less number of phrase tags we employed, since:

• the beginning stage parsing accuracies (e.g. the 1st and 2nd refined grammars) using the universal tags are even lower than the ones using original tags.

• e.g. the F-scores of the 1st refined grammar using universal tags vs original tags are 67.06 (lower) vs 70.84; however, the winner changed after 5th refinement

• we think the parsing accuracy is also related to the mapping quality

• The exact match scores sometimes (French/German) lower than the ones using original tags, which is a issue for future work

• Currently, universal tags usually gain a higher performance according to the best F-score

Discussion

• 3. Future work

• We plan to evaluate the parsing experiments on more language treebanks

• Improve the Exact match score/ whole tree level

• Utilize the universal phrase tagset into other multilingual applications, e.g. Petrov et al. (2012)’s work.

Discussion

Petrov, S., Das, D., McDonald, R.: A Universal Part-of-Speech Tagset. In: Proceedings of the Eighth LREC (2012)

Cite this work:

Han, A.L.-F; Wong, D.F.; Chao, L.S.; Lu, Y.; He, L.; and Tian, L.: A Universal Phrase Tagset for Multilingual Treebanks. In M. Sun et al. (Eds.): CCL and NLP-NABD 2014, LNAI 8801, pp. 247–258, 2014. © Springer International Publishing Switzerland 2014

Education

PPT-CCL: A Universal Phrase Tagset for Multilingual Treebanks