28
From P ¯ an . inian Sandhi to Finite State Calculus Malcolm D. Hyman Max Planck Institute for the History of Science, Berlin First International Sanskrit Computational Linguistics Symposium, Paris, 2007 – p.1

From Pan¯ .inian Sandhi to Finite State Calculusarchimedes.fas.harvard.edu/mdh/sandhi-slides.pdf · From Pan¯ .inian Sandhi to Finite State Calculus Malcolm D. Hyman Max Planck

  • Upload
    others

  • View
    9

  • Download
    0

Embed Size (px)

Citation preview

Page 1: From Pan¯ .inian Sandhi to Finite State Calculusarchimedes.fas.harvard.edu/mdh/sandhi-slides.pdf · From Pan¯ .inian Sandhi to Finite State Calculus Malcolm D. Hyman Max Planck

From Pan. inian Sandhito Finite State Calculus

Malcolm D. Hyman

Max Planck Institute for the History of Science, Berlin

First International Sanskrit Computational Linguistics Symposium, Paris, 2007 – p.1

Page 2: From Pan¯ .inian Sandhi to Finite State Calculusarchimedes.fas.harvard.edu/mdh/sandhi-slides.pdf · From Pan¯ .inian Sandhi to Finite State Calculus Malcolm D. Hyman Max Planck

Overview

1. Research context

2. An XML vocabulary for Pan. inian rules

3. From Pan. inian rules to an FST

4. Implications: remarks on linguistic description

First International Sanskrit Computational Linguistics Symposium, Paris, 2007 – p.2

Page 3: From Pan¯ .inian Sandhi to Finite State Calculusarchimedes.fas.harvard.edu/mdh/sandhi-slides.pdf · From Pan¯ .inian Sandhi to Finite State Calculus Malcolm D. Hyman Max Planck

Research context

Ongoing work on modeling components ofSanskrit grammar according to Pan. inianprinciples

nominal inflection

verbal inflection (using Dhatupat.ha)stem formation (perfect stem, participialstems. . . )

morphophonology (sandhi)

First International Sanskrit Computational Linguistics Symposium, Paris, 2007 – p.3

Page 4: From Pan¯ .inian Sandhi to Finite State Calculusarchimedes.fas.harvard.edu/mdh/sandhi-slides.pdf · From Pan¯ .inian Sandhi to Finite State Calculus Malcolm D. Hyman Max Planck

Methodology

How closely to follow Pan. ini?

Practical concerns dictate an incrementalapproach.

We are obliged to interpret Pan. ini.

Research results concerning both Indiangrammatical methods and facts of theSanskrit language will emerge fromcomputational studies.

First International Sanskrit Computational Linguistics Symposium, Paris, 2007 – p.4

Page 5: From Pan¯ .inian Sandhi to Finite State Calculusarchimedes.fas.harvard.edu/mdh/sandhi-slides.pdf · From Pan¯ .inian Sandhi to Finite State Calculus Malcolm D. Hyman Max Planck

Building blocks of an XML model

The rules model not only a Pan. inian sutra, butalso its context and its interpretation.

An XML schema

A sound-based encoding (SLP1)

A regular expression dialect (PCREs)

First International Sanskrit Computational Linguistics Symposium, Paris, 2007 – p.5

Page 6: From Pan¯ .inian Sandhi to Finite State Calculusarchimedes.fas.harvard.edu/mdh/sandhi-slides.pdf · From Pan¯ .inian Sandhi to Finite State Calculus Malcolm D. Hyman Max Planck

The SLP1 encoding

a

a

���

a

A

i

i

��

ı

I

u

u

u

U

r�

f

r�

F

l�

x

l�

X

��

e

e

� �

ai

E

�� �

o

o

�� ��

au

O

*

��� k

k

���

kh

K

���

g

g

����

gh

G

��� n

N

� ����

c

c

�� ch

C

����

j

j���

jh

J

���

ñ

Y

�� t.

w

�� t.h

W

���d.

q

�� d. h

Q

!��

n.R

"��

t

t

#��

th

T

$� d

d

�%��

dh

D

&��

n

n

'��

p

p

(� ph

P

)��

b

b

*��

bh

B

+��

m

m

,��

y

y

�-�r

r

.� l

l

/��

v

v0��

s

S

1��

s.z

�2��

s

s

3�h

h

* anusvara = M; visarga = H

First International Sanskrit Computational Linguistics Symposium, Paris, 2007 – p.6

Page 7: From Pan¯ .inian Sandhi to Finite State Calculusarchimedes.fas.harvard.edu/mdh/sandhi-slides.pdf · From Pan¯ .inian Sandhi to Finite State Calculus Malcolm D. Hyman Max Planck

The rule element

8.3.23 mo ’nusvarah.

<rule source="m"target="M"rcontext="[@(wb)][@(hal)]"ref="A.8.3.23"/>

(We may need more than one rule to express a

sutra.)

First International Sanskrit Computational Linguistics Symposium, Paris, 2007 – p.7

Page 8: From Pan¯ .inian Sandhi to Finite State Calculusarchimedes.fas.harvard.edu/mdh/sandhi-slides.pdf · From Pan¯ .inian Sandhi to Finite State Calculus Malcolm D. Hyman Max Planck

The macro element

We need some means for translating Pan. ini’smetalanguage, e. g. sound classes (pratyaharas):

<macro name="JaS"value="JBGQDjbgqd"c="voiced stop"/>

First International Sanskrit Computational Linguistics Symposium, Paris, 2007 – p.8

Page 9: From Pan¯ .inian Sandhi to Finite State Calculusarchimedes.fas.harvard.edu/mdh/sandhi-slides.pdf · From Pan¯ .inian Sandhi to Finite State Calculus Malcolm D. Hyman Max Planck

The mapping element

1.1.2 aden gun. ah.

<mapping name="guna"ref="A.1.1.2">

<map from="@(a)" to="a"/><map from="@(i)" to="e"/><map from="@(u)" to="o"/><map from="@(f)" to="a"/><map from="@(x)" to="a"/>

</mapping>

First International Sanskrit Computational Linguistics Symposium, Paris, 2007 – p.9

Page 10: From Pan¯ .inian Sandhi to Finite State Calculusarchimedes.fas.harvard.edu/mdh/sandhi-slides.pdf · From Pan¯ .inian Sandhi to Finite State Calculus Malcolm D. Hyman Max Planck

The function element

<function name="gunate"><rule source="[@(a)@(i)@(u)]"

target="%(guna($1))"/><rule source="[@(f)@(x)]"

target="%(guna($1))%(semivowel($1))"/>

</function>

First International Sanskrit Computational Linguistics Symposium, Paris, 2007 – p.10

Page 11: From Pan¯ .inian Sandhi to Finite State Calculusarchimedes.fas.harvard.edu/mdh/sandhi-slides.pdf · From Pan¯ .inian Sandhi to Finite State Calculus Malcolm D. Hyman Max Planck

Applying a function

6.1.87 ad gun. ah.

<rule source="[@(a)][@(wb)]([@(ik)])"target="!(gunate($1))"ref="A.6.1.87"/>

First International Sanskrit Computational Linguistics Symposium, Paris, 2007 – p.11

Page 12: From Pan¯ .inian Sandhi to Finite State Calculusarchimedes.fas.harvard.edu/mdh/sandhi-slides.pdf · From Pan¯ .inian Sandhi to Finite State Calculus Malcolm D. Hyman Max Planck

Implementing the modeled rules

The XML model captures some of thestructure of Pan. ini’s grammar. But theobvious serial application of the rules iscomputationally inefficient.

The rules can be automatically translated intoregular expressions for compilation into afinite state transducer using tools such asxfst (Xerox) or fsa (van Noord).

The relation between the underlying stringsand the surface strings is a regular relation.

First International Sanskrit Computational Linguistics Symposium, Paris, 2007 – p.12

Page 13: From Pan¯ .inian Sandhi to Finite State Calculusarchimedes.fas.harvard.edu/mdh/sandhi-slides.pdf · From Pan¯ .inian Sandhi to Finite State Calculus Malcolm D. Hyman Max Planck

The replace operator

Rules may be translated into regular expressionsemploying the replace operator (Karttunen 1995).

(a|A)( | #)(a|A) → a(a|A)( | #)(i|I) → e(a|A)( | #)(u|U) → o(a|A)( | #)(f|F) → ar(a|A)( | #)(x|X) → al

First International Sanskrit Computational Linguistics Symposium, Paris, 2007 – p.13

Page 14: From Pan¯ .inian Sandhi to Finite State Calculusarchimedes.fas.harvard.edu/mdh/sandhi-slides.pdf · From Pan¯ .inian Sandhi to Finite State Calculus Malcolm D. Hyman Max Planck

Context-dependent replacement

Documented algorithms exist for the translationof context-dependent replacements into FSTs(Mohri & Sproat 1996).

6.1.109 enah. padantadati

<rule source="a"target="’"lcontext="[@(eN)][@(wb)]"ref="6.1.109"/>

a → ’ / (e|o)( | #)

First International Sanskrit Computational Linguistics Symposium, Paris, 2007 – p.14

Page 15: From Pan¯ .inian Sandhi to Finite State Calculusarchimedes.fas.harvard.edu/mdh/sandhi-slides.pdf · From Pan¯ .inian Sandhi to Finite State Calculus Malcolm D. Hyman Max Planck

An FST for 6.1.109

6.1.109 enah. padantadati

s 0 s 1 s 2e, o

?

?

e, o

, #

e, o

?, a:’

First International Sanskrit Computational Linguistics Symposium, Paris, 2007 – p.15

Page 16: From Pan¯ .inian Sandhi to Finite State Calculusarchimedes.fas.harvard.edu/mdh/sandhi-slides.pdf · From Pan¯ .inian Sandhi to Finite State Calculus Malcolm D. Hyman Max Planck

A composed FST for external sandhi

37 sutras constitute core rules for externalsandhi

XML: 48 rules, 61 macros, 16 mappings, 3functions

compiled regular expressions are ~268KB

composed transducer has 4,994 states,417,814 arcs

First International Sanskrit Computational Linguistics Symposium, Paris, 2007 – p.16

Page 17: From Pan¯ .inian Sandhi to Finite State Calculusarchimedes.fas.harvard.edu/mdh/sandhi-slides.pdf · From Pan¯ .inian Sandhi to Finite State Calculus Malcolm D. Hyman Max Planck

Comparing two approaches

Serial application of rules:

FORM SUTRA

tat ca

tad ca 8.2.39taj ca 8.4.40, 44tac ca 8.4.55tacca

First International Sanskrit Computational Linguistics Symposium, Paris, 2007 – p.17

Page 18: From Pan¯ .inian Sandhi to Finite State Calculusarchimedes.fas.harvard.edu/mdh/sandhi-slides.pdf · From Pan¯ .inian Sandhi to Finite State Calculus Malcolm D. Hyman Max Planck

Comparing two approaches

A unique path through the transducer:

<t:t><a:a><t:c><" ":c><c:ε><a:a>

First International Sanskrit Computational Linguistics Symposium, Paris, 2007 – p.17

Page 19: From Pan¯ .inian Sandhi to Finite State Calculusarchimedes.fas.harvard.edu/mdh/sandhi-slides.pdf · From Pan¯ .inian Sandhi to Finite State Calculus Malcolm D. Hyman Max Planck

Limitations of segmentalism

Segments are atomic, and enumerating themlimits linguistic generalization.

Features overlap segments. It wasJ. R. Firth’s insight that “some phonologicalproperties are not uniquely ‘placed’ withrespect to particular segments within a largerunit” (Anderson, 1985, 185).

Coarticulation “can be detected in almostevery phoneme sequence in normal speech”(Goodglass, 1993, 62).

First International Sanskrit Computational Linguistics Symposium, Paris, 2007 – p.18

Page 20: From Pan¯ .inian Sandhi to Finite State Calculusarchimedes.fas.harvard.edu/mdh/sandhi-slides.pdf · From Pan¯ .inian Sandhi to Finite State Calculus Malcolm D. Hyman Max Planck

Positions of the Indian grammarians

Pan. ini moved beyond the vikara system ofearlier linguistic thinkers (Cardona 1965,311).

Use of abbreviations (pratyaharas) for soundclasses and the principle of savarn. ya (A.1.1.50) emphasize featural analysis.

Segments contain subsegments (e. g. /r

/contains r: MBh. 3.452.1 ff.

Pitch is a property of the syllable (R

Pr. 3.9) orspreads to adjacent consonants (TPr. 1.43).

First International Sanskrit Computational Linguistics Symposium, Paris, 2007 – p.19

Page 21: From Pan¯ .inian Sandhi to Finite State Calculusarchimedes.fas.harvard.edu/mdh/sandhi-slides.pdf · From Pan¯ .inian Sandhi to Finite State Calculus Malcolm D. Hyman Max Planck

N-retroflexion in finite state modeling

Non-final /n/ is realized as n. after {r

, r

, r, s. }despite intervening vowels, semivowels,gutturals/velars, labials, or anusvara.

<rule source="n"target="R"lcontext="[fFrz]

[#@(aw)@(ku)@(pu)M]*"rcontext=".*[@(ac)]"ref="8.4.1-2"/>

First International Sanskrit Computational Linguistics Symposium, Paris, 2007 – p.20

Page 22: From Pan¯ .inian Sandhi to Finite State Calculusarchimedes.fas.harvard.edu/mdh/sandhi-slides.pdf · From Pan¯ .inian Sandhi to Finite State Calculus Malcolm D. Hyman Max Planck

N-retroflexion examples

There is a regular relation between a set ofunderlying and surface strings that includes thefollowing pairs:

UNDERLYING SURFACE

br

m. hana br

m. han. a ‘making big/strong’arabhyamana arabhyaman. a ‘being commenced’nis. anna nis. an. n. a ‘sitting’

First International Sanskrit Computational Linguistics Symposium, Paris, 2007 – p.21

Page 23: From Pan¯ .inian Sandhi to Finite State Calculusarchimedes.fas.harvard.edu/mdh/sandhi-slides.pdf · From Pan¯ .inian Sandhi to Finite State Calculus Malcolm D. Hyman Max Planck

A prosody of retroflexion

When R is projected onto the linear phonematicplane, n. occurs within its extension (Allen 1951,943).

bR

r

m. han. a

a-R

rabhyaman. a

ni-R

s. an. n. a

First International Sanskrit Computational Linguistics Symposium, Paris, 2007 – p.22

Page 24: From Pan¯ .inian Sandhi to Finite State Calculusarchimedes.fas.harvard.edu/mdh/sandhi-slides.pdf · From Pan¯ .inian Sandhi to Finite State Calculus Malcolm D. Hyman Max Planck

How to represent length?

/devat/ ([+long] segment)/deva �t/ (phoneme of length)/devaat/ (two phonemes)

First International Sanskrit Computational Linguistics Symposium, Paris, 2007 – p.23

Page 25: From Pan¯ .inian Sandhi to Finite State Calculusarchimedes.fas.harvard.edu/mdh/sandhi-slides.pdf · From Pan¯ .inian Sandhi to Finite State Calculus Malcolm D. Hyman Max Planck

Autosegmental approaches to length

d e v a t

[DBL]

d e v a t

C V C V V C

First International Sanskrit Computational Linguistics Symposium, Paris, 2007 – p.24

Page 26: From Pan¯ .inian Sandhi to Finite State Calculusarchimedes.fas.harvard.edu/mdh/sandhi-slides.pdf · From Pan¯ .inian Sandhi to Finite State Calculus Malcolm D. Hyman Max Planck

Autosegmental implications

“stability” of suprasegmental units (Goldsmith1976)

compensatory lengthening (Latin consul →cosul ; cf. epigraphic COS)Swedish has complementary distribution ofvocalic/consonantal length in rime ofstressed syllables

long vowels are structurally parallel todiphthongs on the CV tier but not on thesegmental tier

First International Sanskrit Computational Linguistics Symposium, Paris, 2007 – p.25

Page 27: From Pan¯ .inian Sandhi to Finite State Calculusarchimedes.fas.harvard.edu/mdh/sandhi-slides.pdf · From Pan¯ .inian Sandhi to Finite State Calculus Malcolm D. Hyman Max Planck

Length in Indian grammar

The Pan. inian Sivasutras specify only five basic

vowels, not distinguishing between short or long

(or pluta) vowels. Pan. ini characteristically refers

to a-varn. a, etc., that is, the a vowel independent

of its length (1.1.69).

First International Sanskrit Computational Linguistics Symposium, Paris, 2007 – p.26

Page 28: From Pan¯ .inian Sandhi to Finite State Calculusarchimedes.fas.harvard.edu/mdh/sandhi-slides.pdf · From Pan¯ .inian Sandhi to Finite State Calculus Malcolm D. Hyman Max Planck

The utility of linguistic descriptions

The virtue of particular linguistic descriptionsis substantially relative to their purpose.Linear and non-linear descriptions each haveadvantages.

The As. t.adhyayı is motivated by brevity andexplanatory generality. Computationallinguistics strives for efficiency andexplicitness.

First International Sanskrit Computational Linguistics Symposium, Paris, 2007 – p.27