35
Multilingual NLP Inspired from Graham Neubig’s CMU CS 11737, Fall 2020

Multilingual NLP - cse.iitd.ac.in

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Multilingual NLP - cse.iitd.ac.in

Multilingual NLPInspired from Graham Neubig’s

CMU CS 11737, Fall 2020

Page 2: Multilingual NLP - cse.iitd.ac.in
Page 3: Multilingual NLP - cse.iitd.ac.in
Page 4: Multilingual NLP - cse.iitd.ac.in
Page 5: Multilingual NLP - cse.iitd.ac.in
Page 6: Multilingual NLP - cse.iitd.ac.in
Page 7: Multilingual NLP - cse.iitd.ac.in
Page 8: Multilingual NLP - cse.iitd.ac.in
Page 9: Multilingual NLP - cse.iitd.ac.in
Page 10: Multilingual NLP - cse.iitd.ac.in

Multilingual Pre-training

● Extend pre-training to multiple languages

● Pro: Can transfer information across languages

Page 11: Multilingual NLP - cse.iitd.ac.in

Multilingual Pre-training

● Extend pre-training to multiple languages● Pro: Can transfer information across

languages

● Con: Limited model capacity ○ Curse of Multilinguality

● Increases low-resource performance● Reduces high-resource performance

Page 12: Multilingual NLP - cse.iitd.ac.in

mBERT on 100 languages

● 110K WordPiece vocabulary● Rules for handling specific languages like Chinese

○ Chinese, Japanese and Korean don’t use whitespaces

● Exponential Weighted Sampling○ English vs. Icelandic - 1000x vs. 100x○ Exponentiate probability by 0.7 and re-normalize

Page 13: Multilingual NLP - cse.iitd.ac.in

mBERT monolingual performance

● mBERT vs BERT:○ MNLI: 81.4 vs. 84.2

● mBERT vs BERT-Chinese:○ XNLI: 74.2 vs. 77.2

Page 14: Multilingual NLP - cse.iitd.ac.in

SentencePiece

● WordPiece issues:○ Individual words to split (‘hello’, ‘world’, ‘.’)○ Does not work for all languages like Chinese○ Not open-source

● SentencePiece:○ Directly works on sentences○ Treats spaces as another character to

tokenize

Page 15: Multilingual NLP - cse.iitd.ac.in

XLM, XLM-Roberta from Facebook

● XLM uses Translation Language Modeling (TLM)

Page 16: Multilingual NLP - cse.iitd.ac.in

XLM, XLM-Roberta from Facebook

● XLM uses ○ Wikipedia text for MLM○ Supervised translation data for TLM

● XLM-R scales it up to use CommonCrawl text● Does not use TLM or language-ids

Page 17: Multilingual NLP - cse.iitd.ac.in

BART

Page 18: Multilingual NLP - cse.iitd.ac.in

mBART: Multi-Lingual BART

● Training on CC25 corpus

● Corpus of 25 languages

● A subset of Common Crawl

● A crawl of the internet

Page 19: Multilingual NLP - cse.iitd.ac.in

mT5

● Collect the mC4 corpus over 100 languages

● Train using Span-denoising objective

● Don’t use language-ids○ Results in “accidental translations”

Page 20: Multilingual NLP - cse.iitd.ac.in
Page 21: Multilingual NLP - cse.iitd.ac.in

Indian Languages

● MuRIL from Google: 16 IN and EN○ MLM and TLM○ Vocabulary of 197K

● Better than mBERT

● Available in TFHub and Huggingface

Page 22: Multilingual NLP - cse.iitd.ac.in

Indian Languages

● IndicBERT from IITM: 12 IN and EN○ Only MLM ○ Vocabulary of 200K

● Better than mBERT, XLM-R for IndicGLUE

Page 23: Multilingual NLP - cse.iitd.ac.in

A4 Primer - Hindi and Tamil

Context:

मानव कंकाल शरीर की आन्तररक संरचना होती है। यह जन्म के

समय 300 हड्डियों से बना …. म मजबूत होते हैं। अन्य प्राड्डियों

से ड्डिन्न, मानव पुरुष का ड्डलंग सं्तिास्थि रड्डहत होता है।[2]

Question:

जन्म के समय ड्डशशु के शरीर में ड्डकतनी हड्डियााँ होती है?

Answer:

300

Page 24: Multilingual NLP - cse.iitd.ac.in

A4 Possible Ideas

● Start from multilingual models

● Utilise English QA datasets like SQuAD○ Train on English + Hindi/Tamil○ Translate from English to Hindi/Tamil■ Context/Question

Page 25: Multilingual NLP - cse.iitd.ac.in

Machine Translation

Page 26: Multilingual NLP - cse.iitd.ac.in

What is Machine Translation?

Page 27: Multilingual NLP - cse.iitd.ac.in
Page 28: Multilingual NLP - cse.iitd.ac.in
Page 29: Multilingual NLP - cse.iitd.ac.in
Page 30: Multilingual NLP - cse.iitd.ac.in
Page 31: Multilingual NLP - cse.iitd.ac.in
Page 32: Multilingual NLP - cse.iitd.ac.in
Page 33: Multilingual NLP - cse.iitd.ac.in
Page 34: Multilingual NLP - cse.iitd.ac.in

Massively Multilingual Machine Translation

● One model for Translating between 100 languages!

● Goal: Maintain same performance in high-resource and improve low-resource langs

● M2M-100 from FAIR and MMNMT from Google

Page 35: Multilingual NLP - cse.iitd.ac.in

Indian Languages

● IIT-B Hi Corpus is one benchmark

● Recently released: Samanantar (from IIT-M)○ Largest corpus for 11 Indian languages○ Automatically mined from web○ Trained mT5 outperforms Google Translate