Developing affordable technologies for resource-poor languages

Developing affordable technologies for resource-poor

languages

Ariadna Font Llitjós

Language Technologies Institute

Carnegie Mellon University

September 22, 2004

October 11, 2002 AMTA 2002 2

dot = language

October 11, 2002 AMTA 2002 3

MotivationResource-poor scenarios- Indigenous communities have difficult access

to crucial information that directly affects their life (such as land laws, health warnings, etc.)

- Formalize a potentially endangered language

Affordable technologies, such as- spell-checkers, - on-line dictionaries, - Machine Translation (MT) systems, - computer assisted tutoring

October 11, 2002 AMTA 2002 4

AVENUE PartnersLanguage Country Institutions

Mapudungun

(in place)

Chile Universidad de la Frontera, Institute for Indigenous Studies,

Ministry of Education

Quechua

(started)

Peru Ministry of Education

Iñupiaq

(discussion)

US (Alaska) Ilisagvik College, Barrow school district, Alaska Rural Systemic Initiative, Trans-Arctic and Antarctic Institute, Alaska Native Language Center

Siona

(discussion)

Colombia OAS-CICAD, Plante, Department of the Interior

October 11, 2002 AMTA 2002 5

ChileOfficial Language: SpanishPopulation: ~15 million

~1/2 million Mapuche people

Language: Mapudungun

Mapudungun for the Mapuche

October 11, 2002 AMTA 2002 6

What’s Machine Translation (MT)?

Japanesesentence Swahili

sentence

October 11, 2002 AMTA 2002 7

Speech to Speech MT

October 11, 2002 AMTA 2002 8

Why Machine Translation for resource-poor (indigenous) languages?

• Commercial MT economically feasible for only a handful of major languages with large resources (corpora, human developers)

• Benefits include:– Better government access to indigenous communities

(Epidemics, crop failures, etc.)– Better indigenous communities participation in

information-rich activities (health care, education, government) without giving up their languages.

– Language preservation– Civilian and military applications (disaster relief)

October 11, 2002 AMTA 2002 9

MT for resource-poor languages: Challenges

• Minimal amount of parallel text (oral tradition)• Possibly competing standards for

orthography/spelling• Often relatively few trained linguists• Access to native informants possible• Need to minimize development time and cost

October 11, 2002 AMTA 2002 10

Interlingua

Transfer rules

Corpus-based methodsanalysis

interpretation

generation

I saw you Yo vi tú

Machine Translation Pyramid

October 11, 2002 AMTA 2002 11

AVENUE MT system overview

\spa Una mujer se quedó en casa\map Kie domo mlewey ruka mew\eng One woman stayed at home.

{VP,3}

VP::VP : [VP NP] -> [VP NP]

( (X1::Y1) (X2::Y2)

((x2 case) = acc)

((x0 obj) = x2)

((x0 agr) = (x1 agr))

(y2 == (y0 obj))

((y0 tense) = (x0 tense))

((y0 agr) = (y1 agr)))

V::V |: [stayed] -> [quedó]

((X1::Y1)

((x0 form) = stay)

((x0 actform) = stayed)

((x0 tense) = past-pp)

((y0 agr pers) = 3)

((y0 agr num) = sg))

Interactive and Automatic Refinement of Translation Rules

Or: How to recycle corrections of MT

output back into the MT system by adjusting and adapting

the grammar and lexical rules

October 11, 2002 AMTA 2002 15

Error correction by non-expert bilingual users

October 11, 2002 AMTA 2002 16

Interactive elicitation of MT errorsAssumptions:• non-expert bilingual users can reliably detect

and minimally correct MT errors, given:– SL sentence (I saw you)– TL sentence (Yo vi tú)– word-to-word alignments (I-yo, saw-vi, you-tú)– (context)

• using an online GUI: the Translation Correction Tool (TCTool)

Goal: • simplify MT correction task maximally

October 11, 2002 AMTA 2002 17

TranslationCorrection

Tool

Actions:

October 11, 2002 AMTA 2002 18

SL + best TL picked by user

October 11, 2002 AMTA 2002 20

Changing “grande” into “gran”

October 11, 2002 AMTA 2002 21

October 11, 2002 AMTA 2002 22

October 11, 2002 AMTA 2002 23

Automatic Rule Refinement Framework

• Find best RR operations given a:• grammar (G), • lexicon (L), • (set of) source language sentence(s) (SL), • (set of) target language sentence(s) (TL), • its parse tree (P), and • minimal correction of TL (TL’)

such that TQ2 > TQ1• Which can also be expressed as:

max TQ(TL|TL’,P,SL,RR(G,L))

October 11, 2002 AMTA 2002 24

Types of RR operations

• Grammar:– R0 R0 + R1 [=R0’ + contr] Cov[R0] Cov[R0,R1]– R0 R1 [=R0 + constr] Cov[R0] Cov[R1]– R0 R1[=R0 + constr= -]

R2[=R0’ + constr=c +] Cov[R0] Cov[R1,R2]

• Lexicon– Lex0 Lex0 + Lex1[=Lex0 + constr] – Lex0 Lex1[=Lex0 + constr]– Lex0 Lex1[Lex0 + TLword] Lex1 (adding lexical item)

October 11, 2002 AMTA 2002 25

Questions & Discussion

Thanks!

October 11, 2002 AMTA 2002 26

Formalizing Error Information

Wi = error

Wi’ = correction

Wc = clue word

Example:

SL: the red car - TL: *el auto roja TL’: el auto rojo

Wi = roja Wi’ = rojo Wc = auto

October 11, 2002 AMTA 2002 27

Finding Triggering Features

Once we have user’s correction (Wi’), we can compare it with Wi at the feature level and find which is the triggering feature.

If set is empty, need to postulate a new binary feature

Delta function:

Documents

Developing affordable technologies for resource-poor languages