25
The Challenge of Morphology Mapudungun (Indigenous Language of Chile and Argentina, ~1 Million Speakers) Allkütulekefun

The Challenge of Morphology Mapudungun (Indigenous Language of Chile and Argentina, ~1 Million Speakers) Allkütulekefun

Embed Size (px)

Citation preview

Page 1: The Challenge of Morphology Mapudungun (Indigenous Language of Chile and Argentina, ~1 Million Speakers) Allkütulekefun

The Challenge of MorphologyMapudungun (Indigenous Language of Chile and Argentina, ~1 Million Speakers)

Allkütulekefun

Page 2: The Challenge of Morphology Mapudungun (Indigenous Language of Chile and Argentina, ~1 Million Speakers) Allkütulekefun

The Challenge of MorphologyMapudungun

-ke -fu -n-leAllkütu

Page 3: The Challenge of Morphology Mapudungun (Indigenous Language of Chile and Argentina, ~1 Million Speakers) Allkütulekefun

The Challenge of MorphologyMapudungun

-ke

-past

-fu

-indic.1sg

-n

-habitual

-le

-prog.

Allkütu

Listen

Page 4: The Challenge of Morphology Mapudungun (Indigenous Language of Chile and Argentina, ~1 Million Speakers) Allkütulekefun

The Challenge of MorphologyMapudungun

-ke

-past

-fu

-indic.1sg

-n

-habitual

-le

-prog.

Allkütu

Listen

I

Page 5: The Challenge of Morphology Mapudungun (Indigenous Language of Chile and Argentina, ~1 Million Speakers) Allkütulekefun

The Challenge of MorphologyMapudungun

I used to

-ke

-past

-fu

-indic.1sg

-n

-habitual

-le

-prog.

Allkütu

Listen

Page 6: The Challenge of Morphology Mapudungun (Indigenous Language of Chile and Argentina, ~1 Million Speakers) Allkütulekefun

The Challenge of MorphologyMapudungun

I used to listen

-ke

-past

-fu

-indic.1sg

-n

-habitual

-le

-prog.

Allkütu

Listen

Page 7: The Challenge of Morphology Mapudungun (Indigenous Language of Chile and Argentina, ~1 Million Speakers) Allkütulekefun

The Challenge of MorphologyMapudungun

I used to listen

-ke

-past

-fu

-indic.1sg

-n

-habitual

-le

-prog.

Allkütu

Listen

Tasks for Morphology• Segment Words• Map Morphemes onto Features

Page 8: The Challenge of Morphology Mapudungun (Indigenous Language of Chile and Argentina, ~1 Million Speakers) Allkütulekefun

The Challenge of Morphology

Tasks for Morphology

• Segment Words• Map Morphemes

onto Features

• Learn these tasks– unsupervised – from data – for any language

Page 9: The Challenge of Morphology Mapudungun (Indigenous Language of Chile and Argentina, ~1 Million Speakers) Allkütulekefun

• Paradigm– Set of affixes that interchangeably

attach to a set of stems– English Example

• Regular Verbs: Ø.s.ing.ed• Regular Adj: Ø.er.est

Leverage the Natural Structure of Morphology

Page 10: The Challenge of Morphology Mapudungun (Indigenous Language of Chile and Argentina, ~1 Million Speakers) Allkütulekefun

Example Vocabulary

blame blamed blames roamed

roaming roams solve solves solving

Page 11: The Challenge of Morphology Mapudungun (Indigenous Language of Chile and Argentina, ~1 Million Speakers) Allkütulekefun

Ø.sblamesolve

Example Vocabulary

blame blamed blames roamed

roaming roams solve solves solving

Page 12: The Challenge of Morphology Mapudungun (Indigenous Language of Chile and Argentina, ~1 Million Speakers) Allkütulekefun

Ø.sblamesolve

Ø.s.dblame

Example Vocabulary

blame blamed blames roamed

roaming roams solve solves solving

Page 13: The Challenge of Morphology Mapudungun (Indigenous Language of Chile and Argentina, ~1 Million Speakers) Allkütulekefun

Ø.sblamesolve

Ø.s.dblame

Example Vocabulary

blame blamed blames roamed

roaming roams solve solves solving

Page 14: The Challenge of Morphology Mapudungun (Indigenous Language of Chile and Argentina, ~1 Million Speakers) Allkütulekefun

Ø.sblamesolve

Ø.s.dblame

sblameroamsolve

Example Vocabulary

blame blamed blames roamed

roaming roams solve solves solving

Page 15: The Challenge of Morphology Mapudungun (Indigenous Language of Chile and Argentina, ~1 Million Speakers) Allkütulekefun

Ø.sblamesolve

Ø.s.dblame

sblameroamsolve

Example Vocabulary

blame blamed blames roamed

roaming roams solve solves solving

Page 16: The Challenge of Morphology Mapudungun (Indigenous Language of Chile and Argentina, ~1 Million Speakers) Allkütulekefun

Ø.sblamesolve

Ø.s.dblame

sblameroamsolve

e.esblamsolv

Example Vocabulary

blame blamed blames roamed

roaming roams solve solves solving

Page 17: The Challenge of Morphology Mapudungun (Indigenous Language of Chile and Argentina, ~1 Million Speakers) Allkütulekefun

Ø.sblamesolve

Example Vocabulary

blame blamed blames roamed

roaming roams solve solves solving

Ø.s.dblame

sblameroamsolve

e.esblamsolv

Page 18: The Challenge of Morphology Mapudungun (Indigenous Language of Chile and Argentina, ~1 Million Speakers) Allkütulekefun

e.esblamsolv

e.edblam

esblamsolv

Ø.s.dblame

Ø.sblamesolve

Øblameblamesblamedroams

roamedroaming

solvesolvessolving

e.es.edblam

edblamroam

dblameroame

Ø.dblame

s.dblame

sblameroamsolve

es.edblam e

blamsolv

me.mesbla

me.medbla

mesbla

me.mes.medbla

medblaroa

mes.medbla

mebla

Page 19: The Challenge of Morphology Mapudungun (Indigenous Language of Chile and Argentina, ~1 Million Speakers) Allkütulekefun

a.as.o.os43

african, cas, jurídic, l, ...

a.as.o.os.tro1

cas

a.as.os50

afectad, cas, jurídic, l, ...

a.as.o59

cas, citad, jurídic, l, ...

a.o.os105

impuest, indonesi, italian, jurídic, ...

a.as199

huelg, incluid, industri,

inundad, ...

a.os134

impedid, impuest, indonesi,

inundad, ...

as.os68

cas, implicad, inundad, jurídic, ...

a.o214

id, indi, indonesi,

inmediat, ...

as.o85

intern, jurídic, just, l, ...

a.tro2

cas.cen

a1237

huelg, ib, id, iglesi, ...

as404

huelg, huelguist, incluid,

industri, ...

os534

humorístic, human, hígad,

impedid, ...

o1139

hub, hug, human,

huyend, ...

tro16

catas, ce, cen, cua, ...

as.o.os54

cas, implicad, jurídic, l, ...

o.os268

human, implicad, indici,

indocumentad, ...

Spanish Newswire Corpus40,011 Tokens

6,975 Types

19

Page 20: The Challenge of Morphology Mapudungun (Indigenous Language of Chile and Argentina, ~1 Million Speakers) Allkütulekefun

a.as.o.os43

african, cas, jurídic, l, ...

a.as.o.os.tro1

cas

a.as.os50

afectad, cas, jurídic, l, ...

a.as.o59

cas, citad, jurídic, l, ...

a.o.os105

impuest, indonesi, italian, jurídic, ...

a.as199

huelg, incluid, industri,

inundad, ...

a.os134

impedid, impuest, indonesi,

inundad, ...

as.os68

cas, implicad, inundad, jurídic, ...

a.o214

id, indi, indonesi,

inmediat, ...

as.o85

intern, jurídic, just, l, ...

a.tro2

cas.cen

a1237

huelg, ib, id, iglesi, ...

as404

huelg, huelguist, incluid,

industri, ...

os534

humorístic, human, hígad,

impedid, ...

o1139

hub, hug, human,

huyend, ...

tro16

catas, ce, cen, cua, ...

as.o.os54

cas, implicad, jurídic, l, ...

o.os268

human, implicad, indici,

indocumentad, ...

20

Suffixes

Stems

Level 5 = 5 suffixes

Stem Type Count

Page 21: The Challenge of Morphology Mapudungun (Indigenous Language of Chile and Argentina, ~1 Million Speakers) Allkütulekefun

a.as.o.os43

african, cas, jurídic, l, ...

Adjective Inflection Class

21

a.as.o.os.tro1

cas

a.tro2

cas.cen

tro16

catas, ce, cen, cua, ...

a.as.os50

afectad, cas, jurídic, l, ...

a.as.o59

cas, citad, jurídic, l, ...

a.o.os105

impuest, indonesi, italian, jurídic, ...

a.as199

huelg, incluid, industri,

inundad, ...

a.os134

impedid, impuest, indonesi,

inundad, ...

as.os68

cas, implicad, inundad, jurídic, ...

a.o214

id, indi, indonesi,

inmediat, ...

as.o85

intern, jurídic, just, l, ...

a1237

huelg, ib, id, iglesi, ...

as404

huelg, huelguist, incluid,

industri, ...

os534

humorístic, human, hígad,

impedid, ...

o1139

hub, hug, human,

huyend, ...

as.o.os54

cas, implicad, jurídic, l, ...

o.os268

human, implicad, indici,

indocumentad, ...

From the spurious suffix “tro”

Page 22: The Challenge of Morphology Mapudungun (Indigenous Language of Chile and Argentina, ~1 Million Speakers) Allkütulekefun

a.as.o.os.tro1

cas

a.tro2

cas.cen

tro16

catas, ce, cen, cua, ...

a.as.o.os43

african, cas, jurídic, l, ...

a.as.os50

afectad, cas, jurídic, l, ...

a.as.o59

cas, citad, jurídic, l, ...

a.o.os105

impuest, indonesi, italian, jurídic, ...

a.as199

huelg, incluid, industri,

inundad, ...

a.os134

impedid, impuest, indonesi,

inundad, ...

as.os68

cas, implicad, inundad, jurídic, ...

a.o214

id, indi, indonesi,

inmediat, ...

as.o85

intern, jurídic, just, l, ...

a1237

huelg, ib, id, iglesi, ...

as404

huelg, huelguist, incluid,

industri, ...

os534

humorístic, human, hígad,

impedid, ...

o1139

hub, hug, human,

huyend, ...

as.o.os54

cas, implicad, jurídic, l, ...

o.os268

human, implicad, indici,

indocumentad, ...

22

Dec

reas

ing

Ste

m C

ount

Incr

easi

ng S

uffix

Cou

nt

Basic Search Procedure

Page 23: The Challenge of Morphology Mapudungun (Indigenous Language of Chile and Argentina, ~1 Million Speakers) Allkütulekefun

Scaling Up

• Scaling Up– 1 Million word corpus– Network built on demand

• New Approach to Search– High Recall initial search– Weed the results to improve precision

• Results– Boost Recall of Suffixes in Spanish

• from 0.5 to 0.8– But very low precision currently

Page 24: The Challenge of Morphology Mapudungun (Indigenous Language of Chile and Argentina, ~1 Million Speakers) Allkütulekefun

Top Examples of Selected Schemes

1 Million Words of Spanish

Suffixes # of Stems

Part of Speech

Ø.s 2 Noun

a.as.o.os 4 Adjective

Ø.ba.ban.da.das.do.dos.n.ndo.r.ron.rse.rá.rán.ría.rían 16 Verb (-ar)

Ø.es 2 Noun

a.aba.aban.ada.adas.ado.ados.an.ando.ar.ara.aron.arse.ará.arán.e.en.ó

18 Verb (-ar)

Ø.a.emos.on.se.á.án.ía.ían 9 Verb (-ar/-er/-ir)

ones.ón 2 Nominalization

l.les 2 Noun

Page 25: The Challenge of Morphology Mapudungun (Indigenous Language of Chile and Argentina, ~1 Million Speakers) Allkütulekefun

Next Steps for Morphology Induction

• Clean the Selected Schemes– Current Work

• Convert Paradigms into a Segmenter– Soon

• Agglutinative sequences of suffixes– Soon

• Learn Mappings from Morphemes to Features– Future Goal