Click here to load reader

Phonotactic Structures in Swedish - su.diva- 1109915/FULLTEXT01.pdf · PDF filePhonotactic Structures in Swedish ... att testa ett par regelgrupper av initiala två-konsonant kluster

  • View
    212

  • Download
    0

Embed Size (px)

Text of Phonotactic Structures in Swedish - su.diva- 1109915/FULLTEXT01.pdf · PDF...

Phonotactic Structures in SwedishA Data-Driven Approach

Felix Hultin

Department of Linguistics

Magister thesis 15 credits

Computational Linguistics

Spring 2017

Tutor: Mats Wirn

Examinator: Bernhard Wlchli

Reviewer: Robert stling

Phonotactic Structures in SwedishA Data-Driven Approach

AbstractEver since Bengt Sigurd laid out the first comprehensive description of Swedish phonotactics in 1965, ithas been the main point of reference within the field. This thesis attempts a new approach, by presentinga computational and statistical model of Swedish phonotactics, which can be built by any corpus of IPAphonetic script. The model is a weighted trie, represented as a finite state automaton, where states arephonemes linked by transitions in valid phoneme sequences, which adds the benefits of being probabilis-tic and expressible by regular languages. It was implemented using the Nordisk Sprkteknologi (NST)pronunciation lexicon and was used to test against a couple of rulesets defined in Sigurd relating to ini-tial two consonant clusters of phonemes and phoneme classes. The results largely agree with Sigurdsrules and illustrated the benefits of the model, in that it effectively can be used to pattern match againstphonotactic information using regular expression-like syntax.

Keywords

Phonotactics, computational phonology, trie, finite automata, pattern matching, regular languages

Sammanfattningnda sedan Bengt Sigurd lade fram den frsta vergripande beskrivningen av svensk fonotax 1965,s har den varit den frmsta referenspunkten inom fltet. Detta examensarbete frsker sig p en nyinfallsvinkel genom att presentera en berkningsbar och statistisk modell av svensk fonotax som kanbyggas med en korpus av fonetisk skrift i IPA. Modellen r en viktad trie, representerad som en ndligautomat, vilket har frdelarna av att vara probabilistisk och kunna beskrivas av reguljra sprk. Denimplementerades med hjlp av uttalslexikonet frn Nordisk Sprkteknologi (NST) och anvndes fratt testa ett par regelgrupper av initiala tv-konsonant kluster av fonem och fonemklasser definieradav Sigurd. Resultaten stmmer till strre del verens med Sigurds regler och visar p frdelarna hosmodellen, i att den effektivt kan anvndas fr att matcha mnster av fonotaktisk information med hjlpav en liknande syntax fr reguljra uttryck.

Nyckelord

Fonotax, berkningsbar fonologi, trie, ndlig automat, mnstermatchning, reguljra sprk

Contents1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12. Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2.1. Phonology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22.1.1. Phonemes in Swedish . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22.1.2. International Phonetic Alphabet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.1.3. Distinctive Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.1.4. Phonotactics in Swedish . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.1.5. Initial Sequences in Phonotactic Structures in Swedish . . . . . . . . . . . . . . . . . . . . . . . . . . 52.1.6. Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.2. Computational Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.2.1. Finite Automata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.2.2. Regular Languages and Pattern Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.2.3. Trie . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.2.4. Computational Phonology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3. Aims and Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114. Method and Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

4.1. Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124.1.1. Data Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

4.2. A Trie Representation of Phonotactics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124.2.1. Implementation of Phonotactic Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134.2.2. Extracting Information from the Phonotactic Model with Pattern Matching . . . . . . . . . . . . . 13

4.3. Visualizing the Phonotactic Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154.4. Using Search Patterns to Test Initial Consonant Cluster Rules . . . . . . . . . . . . . . . . . . . . . . . . . 17

5. Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185.1. Initial Two Phoneme Consonant Clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185.2. Initial Two Consonant Phoneme Class Clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225.3. Results Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

6. Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266.1. Method Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266.2. Results Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

7. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29A. The case of /pj/- . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3

1. IntroductionEver since Swedish phonotactics was first laid out by Bengt Sigurd in his doctoral thesis PhonotacticStructures of Swedish (Sigurd, 1965), the research area has been largely confined to the results of hismore than 50-year-old endeavor. Indeed, in the latest accounts of Swedish Phonotactics, such as inTomas Riads book The Phonology of Swedish (Riad, 2013, ch. 12), Sigurds work is still referred to asthe main point of reference.

Meanwhile, in the area of computational phonology, the mathematical model of finite state automatahas become essential for representing phonological observations, recently coupled with statistical mod-els to predict phonological information. On a different note, a vast digital lexicon of word entries bySprkteknologi Holding, including pronunciation data, was released in 2011 to the public by the Nor-wegian Sprkbanken, giving access to a phonological resource previously not available.

In the light of and inspired by these separate developments, this thesis will present a computational,data-driven, statistical, model of Swedish phonotactics, which can be built by any corpus of phoneticscript, based on the International Phonetic Alphabet (IPA) (International Phonetic Association, 1999).The model is a weighted trie, represented as a probabilistic finite automaton, where states are phonemeslinked by transitions in valid phoneme sequences, representing the likelihood of one phoneme followinganother. I will investigate the models computational benefits, especially in the context of phonotacticresearch, and, as a proof-of-concept, test some sample rules defined in Sigurds thesis against corre-sponding generated results by the model.

With this research, I hope to lay out and demonstrate the need for this type of computational model,which I will argue is an important infrastructure for data-driven research of Swedish phonotactics.

1

2. BackgroundThis thesis is based on, on the one hand, the linguistic research area of phonology, especially phono-tactics in Swedish, and, on the other hand, the computational and mathematical theories, which will beused to compute a phonotactic model. Therefore, I will, in this section, cover both of these areas, inorder to put the need for a computational, statistical model into perspective and to lay out the necessarytheory for implementing it.

2.1. Phonology

Phonology is the study of how sounds are organized in natural languages. This stands in contrast tophonetics, which studies the physiological, aerodynamic and acoustic characteristics of speech-sounds(Catford, 1988). Although both disciplines are in many ways dependent on each other, it can generallybe said that phonetics studies continuous aspects of sound, which phonology then organizes into dis-crete systems of natural languages. This continuous and discrete relation between the two disciplines isimportant, as it will reappear as we get into the International Phon

Search related