10
Proceedings of VarDial, pages 110–119 Minneapolis, MN, June 7, 2019 c 2019 Association for Computational Linguistics 110 Toward a deep dialectological representation of Indo-Aryan Chundra A. Cathcart Department of Comparative Linguistics University of Zurich [email protected] Abstract This paper presents a new approach to disen- tangling inter-dialectal and intra-dialectal re- lationships within one such group, the Indo- Aryan subgroup of Indo-European. I draw upon admixture models and deep generative models to tease apart historic language contact and language-specific behavior in the over- all patterns of sound change displayed by Indo-Aryan languages. I show that a “deep” model of Indo-Aryan dialectology sheds some light on questions regarding inter-relationships among the Indo-Aryan languages, and per- forms better than a “shallow” model in terms of certain qualities of the posterior distribu- tion (e.g., entropy of posterior distributions), and outline future pathways for model devel- opment. 1 Introduction At the risk of oversimplifying, quantitative mod- els of language relationship fall into two broad categories. At a wide, family-level scale, phylo- genetic methods adopted from computational bi- ology have had success in shedding light on the histories of genetically related but significantly di- versified speech varieties (Bouckaert et al., 2012). At a shallower level, the subfield of dialectome- try has used a wide variety of chiefly distance- based methodologies to analyze variation among closely related dialects with similar lexical and ty- pological profiles (Nerbonne and Heeringa, 2001), though this work also emphasizes the importance of hierarchical linguistic relationships and the use of abstract, historically meaningful features (Proki´ c and Nerbonne, 2008; Nerbonne, 2009). It is possible, however, that neither methodology is completely effective for for language groups of in- termediate size, particularly those where certain languages have remained in contact to an extent that blurs the phylogenetic signal, but have expe- rienced great enough diversification that dialecto- metric approaches are not appropriate. This pa- per presents a new approach to disentangling inter- dialectal and intra-dialectal relationships within one such group, the Indo-Aryan subgroup of Indo- European. Indo-Aryan presents many interesting puz- zles. Although all modern Indo-Aryan (hence- forth NIA) languages descend from Sanskrit or Old Indo-Aryan (henceforth OIA), their subgroup- ing and dialectal interrelationships remain some- what poorly understood (for surveys of assorted problems, see Emeneau 1966; Masica 1991; Toul- min 2009; Smith 2017; Deo 2018). This is partly due to the fact that these languages have remained in contact with each other, and this admixture has complicated our understanding of the lan- guages’ history. Furthermore, while most NIA languages have likely gone through stages closely resembling attested Middle Indo-Aryan (MIA) languages such as Prakrit or Pali, no NIA language can be taken with any certainty to be direct descen- dants of an attested MIA variety, further shrouding the historical picture of their development. The primary goal of the work described in this paper is to build, or work towards build- ing, a model of Indo-Aryan dialectology that in- corporates realistic assumptions regarding histor- ical linguistics and language change. I draw upon admixture models and deep generative mod- els to tease apart historic language contact and language-specific behavior in the overall patterns of sound change displayed by Indo-Aryan lan- guages. I show that a “deep” model of Indo-Aryan dialectology sheds some light on questions re- garding inter-relationships among the Indo-Aryan languages, and performs better than a “shallow” model in terms of certain qualities of the poste- rior distribution (e.g., entropy of posterior distri- butions). I provide a comparison with other met-

Toward a deep dialectological representation of Indo-Aryanweb.science.mq.edu.au/~smalmasi/vardial6/pdf/W19-1411.pdfAryan subgroup of Indo-European. I draw upon admixture models and

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Toward a deep dialectological representation of Indo-Aryanweb.science.mq.edu.au/~smalmasi/vardial6/pdf/W19-1411.pdfAryan subgroup of Indo-European. I draw upon admixture models and

Proceedings of VarDial pages 110ndash119Minneapolis MN June 7 2019 ccopy2019 Association for Computational Linguistics

110

Toward a deep dialectological representation of Indo-Aryan

Chundra A CathcartDepartment of Comparative Linguistics

University of Zurichchundracathcartuzhch

Abstract

This paper presents a new approach to disen-tangling inter-dialectal and intra-dialectal re-lationships within one such group the Indo-Aryan subgroup of Indo-European I drawupon admixture models and deep generativemodels to tease apart historic language contactand language-specific behavior in the over-all patterns of sound change displayed byIndo-Aryan languages I show that a ldquodeeprdquomodel of Indo-Aryan dialectology sheds somelight on questions regarding inter-relationshipsamong the Indo-Aryan languages and per-forms better than a ldquoshallowrdquo model in termsof certain qualities of the posterior distribu-tion (eg entropy of posterior distributions)and outline future pathways for model devel-opment

1 Introduction

At the risk of oversimplifying quantitative mod-els of language relationship fall into two broadcategories At a wide family-level scale phylo-genetic methods adopted from computational bi-ology have had success in shedding light on thehistories of genetically related but significantly di-versified speech varieties (Bouckaert et al 2012)At a shallower level the subfield of dialectome-try has used a wide variety of chiefly distance-based methodologies to analyze variation amongclosely related dialects with similar lexical and ty-pological profiles (Nerbonne and Heeringa 2001)though this work also emphasizes the importanceof hierarchical linguistic relationships and theuse of abstract historically meaningful features(Prokic and Nerbonne 2008 Nerbonne 2009) Itis possible however that neither methodology iscompletely effective for for language groups of in-termediate size particularly those where certainlanguages have remained in contact to an extentthat blurs the phylogenetic signal but have expe-

rienced great enough diversification that dialecto-metric approaches are not appropriate This pa-per presents a new approach to disentangling inter-dialectal and intra-dialectal relationships withinone such group the Indo-Aryan subgroup of Indo-European

Indo-Aryan presents many interesting puz-zles Although all modern Indo-Aryan (hence-forth NIA) languages descend from Sanskrit orOld Indo-Aryan (henceforth OIA) their subgroup-ing and dialectal interrelationships remain some-what poorly understood (for surveys of assortedproblems see Emeneau 1966 Masica 1991 Toul-min 2009 Smith 2017 Deo 2018) This is partlydue to the fact that these languages have remainedin contact with each other and this admixturehas complicated our understanding of the lan-guagesrsquo history Furthermore while most NIAlanguages have likely gone through stages closelyresembling attested Middle Indo-Aryan (MIA)languages such as Prakrit or Pali no NIA languagecan be taken with any certainty to be direct descen-dants of an attested MIA variety further shroudingthe historical picture of their development

The primary goal of the work described inthis paper is to build or work towards build-ing a model of Indo-Aryan dialectology that in-corporates realistic assumptions regarding histor-ical linguistics and language change I drawupon admixture models and deep generative mod-els to tease apart historic language contact andlanguage-specific behavior in the overall patternsof sound change displayed by Indo-Aryan lan-guages I show that a ldquodeeprdquo model of Indo-Aryandialectology sheds some light on questions re-garding inter-relationships among the Indo-Aryanlanguages and performs better than a ldquoshallowrdquomodel in terms of certain qualities of the poste-rior distribution (eg entropy of posterior distri-butions) I provide a comparison with other met-

111

rics and outline future pathways for model devel-opment

2 Sound Change

The notion that sound change proceeds in a reg-ular and systematic fashion is a cornerstone ofthe comparative method of historical linguisticsWhen we consider cognates such as Greek pheroand Sanskrit bhara(mi) lsquoI carryrsquo we observe reg-ular sound correspondences (eg phbh) which al-low us to formulate sound changes that have oper-ated during the course of each languagersquos develop-ment from their shared common ancestor Underideal circumstances these are binary yesno ques-tions (eg Proto-Indo-European bh gt Greek ph)At other times there is some noise in the signalfor instance OIA ks is realized as kh in most Ro-mani words (eg aks i- lsquoeyersquo gt jakh) but also asch (ks urika- gt churi lsquoknifersquo) according to Ma-tras (2002 41) This is undoubtedly due to rel-atively old language contact (namely lexical bor-rowing) between prehistoric Indo-Aryan dialectsas opposed to different conditioning environmentswhich trigger a change ks gt kh in some phono-logical contexts but ks gt ch in others The ideathat Indo-Aryan speech varieties borrowed formsfrom one another on a large scale is well estab-lished (Turner 1975 [1967] 406) as is often thecase in situations where closely related dialectshave developed in close geographic proximity toone another (cf Bloomfield 1933 461ndash495) Aneffective model of Indo-Aryan dialectology mustbe able to account this sort of admixture Phylo-genetic methods and distance-based methods pro-vide indirect information regarding language con-tact (eg in the form of uncertain tree topologies)but do not explicitly model intimate borrowing

A number of studies have used mixed-membership models such as the Structure model(Pritchard et al 2000) in order to explicitly modeladmixture between languages (Reesink et al2009 Syrjanen et al 2016) Under this approachindividual languages receive their linguistic fea-tures from latent ancestral components with par-ticular feature distributions A key assumption ofthe Structure model is the relative invariance andstability of the features of interest (eg allele fre-quencies linguistic properties) However soundchange is a highly recurrent process with manytelescoped and intermediate changes and it is notpossible to treat sound changes that have operated

as stable highly conservative features1

Intermediate stages between OIA and NIAlanguages are key for capturing similarities incross-linguistic behavior and we require a modelthat teases apart dialect group-specific trends andlanguage-level ones Consider the following ex-amples

bull Assamese x the reflex of OIA s s s is thought to develop from intermediate s(Kakati 1941 224) This isogloss wouldunite it with languages like Bengali whichshow S for OIA s s s

bull Some instances of NIA bh likely come froman earlier mh (Tedesco 1965 371 Oberlies2005 48) (cf Oberlies 200548)

bull The Marathi change ch gt s affects certainwords containing MIA ch lt OIA ks as wellas OIA ch (Masica 1991 457) ch sim kh ltOIA ks variation is of importance to MIA andNIA dialectology (compare the Romani ex-amples given above)

In all examples a given NIA language shows theeffects of chronologically deep behavior whichserves as an isogloss uniting it with other NIAlanguages but this trend is masked by subse-quent language-specific changes2 Work on proba-bilistic reconstruction of proto-word forms explic-itly appeals to intermediate chronological stageswhere linguistic data are unobserved (Bouchard-Cote et al 2007) however unlike the work citedthis paper does not assume a fixed phylogeny andhence I cannot adopt many of the simplifying con-ventions that the authors use

3 Data

I extracted all modern Indo-Aryan forms fromTurnerrsquos (1962ndash1966) Comparative Dictionary ofthe Indo-Aryan Languages (henceforth CDIAL)3

1Cathcart (to appear) circumvents this issue in a mixed-membership model of Indo-Aryan dialectology by consider-ing only sound changes thought a priori in the literature to berelatively stable and of importance to dialectology

2Some similar-looking sound changes can be shown to bechronologically shallow For instance the presence of s fororiginal kh in Old Braj taken by most scholars to represent alegitimate sound change and not just an orthographic idiosyn-crasy affects Persian loans such as s aracu lsquoexpensersquolarrMod-ern Persian xirc (McGregor 1968 125) This orthographicbehavior is found in Old Gujarati as well (Baumann 19759) For further discussion of this issue see Strnad 2013 16ff

3Available online at httpdsaluchicagoedudictionariessoas

112

along with the Old Indo-Aryan headwords (hence-forth ETYMA) from which these reflexes descendTranscriptions of the data were normalized andconverted to the International Phonetic Alphabet(IPA) Systematic morphological mismatches be-tween OIA etyma and reflexes were accountedfor including stripping the endings from all verbssince citation forms for OIA verbs are in the 3sgpresent while most NIA reflexes give the infini-tive I matched each dialect with correspond-ing languoids in Glottolog (Hammarstrom et al2017) containing geographic metadata resultingin the merger of several dialects I excluded cog-nate sets with fewer than 10 forms yielding 33231modern Indo-Aryan forms I preprocessed thedata first converting each segment into its respec-tive sound class as described by List (2012) andsubsequently aligning each converted OIANIAstring pair via the Needleman-Wunsch algorithmusing the Expectation-Maximization method de-scribed by Jager (2014) building off of work byWieling et al (2012) This yields alignments ofthe following type eg OIA antra lsquoentrailsrsquo gtNepali anemptyro where empty indicates a gap wherethe ldquocursorrdquo advances for the OIA string but notthe Nepali string Gaps on the OIA side are ig-nored yielding a one-to-many OIA-to-NIA align-ment this ensures that all aligned cognate sets areof the same length

4 Model

The basic family of model this paper employs is aBayesian mixture model which assumes that eachword in each language is generated by one ofK la-tent dialect components Like Structure (and sim-ilar methodologies like Latent Dirichlet Alloca-tion) this model assumes that different elementsin the same language can be generated by differ-ent dialect components Unlike the most basictype of Structure model which assumes a two-level data structure consisting of (1) languages andthe (2) features they contain our model assumes athree-level hierarchy where (1) languages contain(2) words which display the operation of differ-ent (3) sound changes latent variable assignmenthappens at the word level

I contrast the behavior of a DEEP model withthat of a SHALLOW model The deep model drawsinspiration from Bayesian deep generative mod-els (Ranganath et al 2015) which incorporateintermediate latent variables which mimic the ar-

chitecture of a neural network This structure al-lows us to posit an intermediate representationbetween the sound patterns in the OIA etymonand the sound patterns in the NIA reflex allow-ing the model to pick up on shared dialectal sim-ilarities between forms in languages as opposedto language-specific idiosyncrasies The shal-low model which serves as a baseline of sortsconflates dialect group-level and language-leveltrends it contains a flat representation of all of thesound changes taking place between a NIA wordand its ancestral OIA etymon and in this sense ishalfway between a Structure model and a NaıveBayes classifier (with a language-specific ratherthan global prior over component membership)

41 Shallow model

Here I describe the generative process for theshallow model assuming W OIA etyma L lan-guages K dialect components I unique OIA in-puts O unique NIA outputs and aligned OIA-NIA word pair lengths Tw w isin 1 WFor each OIA etymon an input xwt at time pointt isin 1 Tw consists of a trigram centered atthe timepoint in question (eg ntr in OIA antralsquoentrailsrsquo) and the NIA reflexrsquos output ywlt con-tains the segment(s) aligned with timepoint t (egNepali empty) xwt t = 0 is the left word boundarywhile xwt t = Tw + 1 is the right word bound-ary Accordingly sound change in the model canbe viewed as a rewrite rule of the type A gt B C

D The model has the following parameters

bull Language-level weights over dialect compo-nents Ulk l isin 1 L k isin 1 K

bull Dialect component-level weights over soundchanges Wkio k isin 1 K i isin1 I o isin 1 O

The generative process is as follows

For each OIA etymon xw isin 1 W

For each language l isin 1 L in whichthe etymon survives containing a reflexywl

Draw a dialect component assignmentzwl sim Categorical(f(Ulmiddot))

For each time point t isin 1 TwDraw a NIA sound ywlt sim

Categorical(f(Wzwlxwtmiddot))

113

All weights in U and W are drawn from a Normaldistribution with a mean of 0 and standard devi-ation of 10 f(middot) represents the softmax function(throughout this paper) which transforms theseweights to probability simplices The generativeprocess yields the following joint log likelihood ofthe OIA etyma x and NIA reflexes y (with the dis-crete latent variables z marginalized out

P (xy|UW ) =

Wprodw=1

Lprodl=1

Ksumk=1

[f(Ulk)

Twprodt=1

f(Wkxwltywlt)

](1)

As readers will note this model weights allsound changes equally and makes no attempt todistinguish between dialectologically meaningfulchanges and noisy idiosyncratic changes

42 Deep modelThe deep model like the shallow model is a mix-ture model and as such retains the language-levelweights over dialect component membership U However unlike the shallow model in which thelikelihood of an OIA etymon and NIA reflex un-der a component assignment z = k is depen-dent on a flat representation of edit probabilitiesbetween OIA trigrams and NIA unigrams associ-ated with dialect component k Here I attemptto add some depth to this representation of soundchange by positing a hidden layer of dimension Jbetween each xwt and ywlt The goal here is tomimic a ldquonoisyrdquo reconstruction of an intermediatestage between OIA and NIA represented by dialectgroup k This reconstruction is not an explicitlinguistically meaningful string (as in Bouchard-Cote et al 2007 2008 2013) furthermore it isre-generated for each individual reflex of each et-ymon and not shared across data points (such amodel would introduce deeply nested dependen-cies between variables and enumerating all possi-ble reconstructions would be computationally in-feasible)

For parsimonyrsquos sake I employ a simple Recur-rent Neural Network (RNN) architecture to cap-ture rightward dependencies (Elman 1990) Fig-ure 1 gives a visual representation of the net-work unfolded in time This model exchangesW the dialect component-level weights over soundchanges for the following parameters

bull Dialect component-level weights governinghidden layer unit activations by OIA sounds

W xkij k isin 1 K i isin 1 I j isin1 J

bull Dialect component-level weights governinghidden layer unit activations by previous hid-den layers W h

kij k isin 1 K i isin1 J j isin 1 J

bull Language-level weights governing NIA out-put activations by hidden layer unitsW y

ljo l isin 1 L j isin 1 J o isin1 O

For a given mixture component z = k the activa-tion of the hidden layer at time t ht depends ontwo sets of parameters each associated with com-ponent k the weightsW x

kxxt middot associated with the

OIA input at time t and W hk the weights asso-

ciated with the previous hidden layer htminus1rsquos acti-vations for all t gt 1 Given a hidden layer htthe weights W l can be used to generate a proba-bility distribution over possible outcomes in NIAlanguage l The forward pass of this networkcan be viewed as a generative process denotedywt sim RNN(xwlW

xk W

hk W

l) under the pa-rameters for component k and language l undersuch a process the likelihood of ywl can be com-puted as follows

PRNN(ywl|xwWxk W

hk W

l) =

Twprodt=1

f(hgtt W

l)ywlt (2)

where

ht =

f(W x

kxwtmiddot) if t = 1

f(hgttminus1Wh oplusW x

kxwtmiddot) if t gt 1(3)

The generative process for this model is nearlyidentical to the process described in the previ-ous sections however after the dialect compo-nent assignment (zwl sim Categorical(f(Ulmiddot)))is drawn the NIA string ywl is sampled fromRNN(xwW

xzwl

W hzwl

W l) The joint log likeli-hood of the OIA etyma x and NIA reflexes y (withthe discrete latent variables z marginalized out isthe following

P (xy|UW xW hW y) =Wprodw=1

Lprodl=1

Ksumk=1

[f(Ulk)PRNN(ywl|xwW

xk W

hk W

l)] (4)

114

The same N (0 10) prior as above is placed overUW xW hW y J the dimension of the hiddenlayer is fixed at 100 This model bears some sim-ilarities to the mixture of RNNs described by Kimet al (2018)

I have employed a simple RNN (rather than amore state-of-the art architecture) for several rea-sons The first is that I am interested in the conse-quences of expanding a flat mixture model to con-tain a simple slightly deeper architecture Addi-tionally I believe that the fact that the hidden layerof an RNN can be activated by a softmax functionis more desirable from the perspective of repre-senting sound change as a categorical or multi-nomial distribution as all layer unit activationssum to one as opposed to the situation with LongShort-Term Memory (LSTM) and Gated Recur-rent Units (GRU) which traditionally use sigmoidor hyperbolic tangent functions to activate the hid-den layer Furthermore long-distance dependen-cies are not particularly widespread in Indo-Aryansound change lessening the need for more com-plex architectures At the same time the RNNis a crude approximation to the reality of lan-guage change RNNs and related models draw asingle arc between a hidden layer at time t andthe corresponding output It is perhaps not ap-propriate to envision this single dependency un-less the dimensionality of the hidden layer is largeenough to absorb potential contextual informationthat is crucial to sound change To put it sim-ply emission probabilities in sound change aresharper than transitions common in most NLP ap-plications (eg sentence prediction) and it maynot be correct to envision yt given htprimeltt ht as afunction of an additive combination of weightsthough in practice I find it too computationallycostly to enumerate all possible value combina-tions the hidden layer at multiple consecutive timepoints This issue requires further exploration andI employ what seems to be the most computation-ally tractable approach for the moment

5 Results

I learn each modelrsquos MAP configuration using theAdam optimizer (Kingma and Ba 2015) with alearning rate of 14 I run the optimizer for 10000iterations over three random initializations fittingthe model on mini-batches of 100 data points and

4Code for all experiments can be found at httpsgithubcomchundracIA_dialVarDial2019

xtminus1 xt xt+1

htminus1 ht ht+1

ytminus1 yt yt+1

Figure 1 RNN representation unfolded in time hid-den layers depend on OIA inputs x1 xTw and previ-ous hidden layers (for t gt 1) NIA outputs y1 yTw

depend on hidden layers Hidden layer activations aredependent on dialect component-specific parameterswhile activations of the output layer are dependent onindividual NIA language-specific parameters

monitor convergence by observing the trace of thelog posterior (Figure 2)

The flat model fails to pick up on any majordifferences between languages finding virtuallyidentical posterior values of f(Ul) the language-level distribution over dialect component member-ship for all l isin 1 L According to theMAP configuration each language draws formsfrom the same dialect group with gt 99 proba-bility essentially undergoing a sort of ldquocompo-nent collapserdquo that latent variable models some-times encounter (Bowman et al 2015 Dinh andDumoulin 2016) It is likely that bundling to-gether sound change features leads to component-level distributions over sound changes with highentropy that are virtually indistinguishable fromone another5 While this particular result is dis-appointing in the lack of information it provides Iobserve some properties of our modelsrsquo posteriorvalues in order to diagnose problems that can beaddressed in future work (discussed below)

The deep model on the other hand infershighly divergent language-level posterior distri-butions over cluster membership Since thesedistributions are not identical across initializa-tions due to the label-switching problem I com-pute the Jensen-Shannon divergence between thelanguage-level posterior distributions over clustermembership for each pair of languages in our sam-ple for each initialization I then average these di-vergences across initializations These averaged

5I made several attempts to run this model with differ-ent specifications including different prior distributions butachieved the same result

115

0 2000 4000 6000 8000 10000

1048

1046

1044

1042

1040

1038

10361e8

0 2000 4000 6000 8000 10000

426

424

422

420

418

416

414

412

1e7

Figure 2 Log posteriors for shallow model (left) anddeep model (right) for 10000 iterations over three ran-dom initializations

divergences are then scaled to three dimensionsusing multidimensional scaling Figure 3 gives avisualization of these transformed values via thered-green-blue color vector plotted on a map lan-guages with similar component distributions dis-play similar colors With a few exceptions (thatmay be artifacts of the fact that certain languageshave only a small number of data points associ-ated with them) a noticieable divide can be seenbetween languages of the main Indo-Aryan speechregion on one hand and languages of northwest-ern South Asia (dark blue) the Dardic languagesof Northern Pakistan and the Pahari languagesof the Indian Himalayas though this division isnot clear cut Romani and other Indo-Aryan va-rieties spoken outside of South Asia show affil-iation with multiple groups While Romani di-alects are thought to have a close genetic affin-ity with Hindi and other Central Indic languagesit was likely in contact with languages of north-west South Asian during the course of its speak-ersrsquo journey out of South Asia (Hamp 1987 Ma-tras 2002) However this impressionistic evalua-tion is by no means a confirmation that the deepmodel has picked up on linguistically meaningfuldifferences between speech varieties In the fol-lowing sections some comparison and evaluationmetrics and checks are deployed in order to assessthe quality of these modelsrsquo behavior

51 Entropy of distributions

I measure the average entropy of the modelrsquos pos-terior distributions in order to gauge the extent towhich the models are able to learn sparse informa-tive distributions over sound changes hidden stateactivations or other parameters concerning transi-tions through the model architecture Normalizedentropy is used in order to make entropies of distri-butions of different dimension comparable a dis-tributionrsquos entropy can be normalized by dividingby its maximum possible entropy

As mentioned above our data set consists ofOIA trigrams and the NIA segment correspondingto the second segment in the trigram representingrewrite rules operating between OIA and the NIAlanguages in our sample It is often the case thatmore than one NIA reflex is attested for a givenOIA trigram As such the sound changes that haveoperated in an NIA language can be representedas a collection of categorical distributions eachsumming to one I calculate the average of thenormalized entropies of these sound change dis-tributions as a baseline against which to compareentropy values for the modelsrsquo parameters Thepooled average of the normalized entropies acrossall languages is 11 while the average of averagesfor each language is 063

For the shallow model the parameter of interestis f(V ) the dialect component-level collection ofdistributions over sound changes the mean nor-malized entropy of which averaged across initial-izations but pooled across components within eachinitialization is 091 (raw values range from 0003to 1) For the deep model the average entropyof the dialect-level distributions over hidden-layeractivations f(W x) is only slightly lower at 086(raw values range from close to 0 to 1)

For each k isin 1 K I compute the for-ward pass of RNN(xwlW

xk W

hk W

l) for eachetymon w and each language l in which theetymon survives using the inferred values forW x

k Whk W

l and compute the entropy of eachf(hgtt W

l) yielding an average of 74 (raw val-ues range from close to 0 to 1) While these val-ues are still very high it is clear that the inclu-sion of a hidden layer has learned sparser poten-tially more meaningful distributions than the flatapproach and that increasing the dimensionalityof the hidden layer will likely bring about evensparser more meaningful distributions The en-tropies cited here are considerably higher than theaverage entropy of languagesrsquo sound change dis-tributions but the latter distributions do little to tellus about the internal clustering of the languages

52 Comparison with other linguisticdistance metrics

Here I compare the cluster membership inferredby this paperrsquos models against other measures oflinguistic distance Each method yields a pairwiseinter-language distance metric which can be com-pared against a non-linguistic measure I measure

116

assa1263awad1243

bagh1251

balk1252

beng1280

bhad1241 bhat1263

bhoj1244braj1242

brok1247

carp1235

cham1307

chil1275

chur1258

dhiv1236

dogr1250

doma1258

doma1260

garh1243

gawa1247

gran1245

guja1252

halb1244

hind1269

indu1241

jaun1243

kach1277

kala1372 kala1373

kalo1256

kang1280

kash1277

khet1238

khow1242 kohi1248

konk1267

kull1236

kuma1273

loma1235

maga1260

maha1287

mait1250

mara1378

marw1260

nepa1254

nort2665

nort2666

oriy1255

paha1251pang1282

panj1256

phal1254

savi1242

sera1259

shin1264

shum1235

sind1272

sinh1246

sint1235

sirm1239

sout2671

sout2672

tira1253

torw1241

vlax1238

wels1246

west2386

wota1240

0

20

40

60

0 25 50 75long

lat

Figure 3 Dialect group makeup of languages in sample under deep model

the correlation between each linguistic distancemeasure as well as great circle geographic distanceand patristic distance according to the Glottologphylogeny using Spearmanrsquos ρ

521 Levenshtein distanceBorin et al (2014) measure the normalized Lev-enshtein distances (ie the edit distance betweentwo strings divided by the length of the longerstring) between words for the same concept inpairs of Indo-Aryan languages and find that av-erage normalized Levenshtein distance correlatessignificantly with patristic distances in the Ethno-logue tree This paperrsquos dataset is not organized bysemantic meaning so for comparability I measurethe average normalized Levenshtein distance be-tween cognates in pairs of Indo-Aryan languageswhich picks up on phonological divergence be-tween dialects as opposed to both phonologicaland lexical divergence

522 Jensen-Shannon divergenceEach language in our dataset attests one or more(due to language contact analogy etc) outcomesfor a given OIA trigram yielding a collection ofsound change distributions as described aboveFor each pair of languages I compute the Jensen-Shannon divergence between sound change distri-butions for all OIA trigrams that are continued inboth languages and average these values This

gives a measure of pairwise average diachronicphonological divergence between languages

523 LSTM AutoencoderRama and Coltekin (2016) and Rama et al (2017)develop an LSTM-based method for represent-ing the phonological structure of individual wordforms across closely related speech varieties Eachstring is fed to a unidirectional or bidirectionalLSTM autoencoder which learns a continuouslatent multidimensional representation of the se-quence This embedding is then used to recon-struct the input sequence The latent values in theembedding provide information that can be usedto compute dissimilarity (in the form of cosineor Euclidean distance) between strings or acrossspeech varieties (by averaging the latent values forall strings in each dialect or language) I use thebidirectional LSTM Autoencoder described in thework cited in order to learn an 8-dimensional la-tent representation for all NIA forms in the datasettraining the model over 20 epochs on batches of 32data points using the Adam optimizer to minimizethe categorical cross-entropy between the input se-quence and the NIA reconstruction predicted bythe model I use the learned model parameters togenerate a latent representation for each form Thelatent representations are averaged across formswithin each language and pairwise linguistic Eu-clidean distances are computed between each av-

117

Geographic GeneticShallow JSD minus001 minus003Deep JSD 0147lowast 0008

LDN 0346lowast 0013Raw JSD 0302lowast minus0051lowastLSTM AE 0158lowast minus0068lowastLSTM ED 0084lowast 00001

Table 1 Spearmanrsquos ρ values for correlations betweeneach linguistic distance metric (JSD = Jensen-ShannonDivergence LDN = Levenshtein Distance NormalizedAE = Autoencoder ED = Encoder-Decoder) and geo-graphic and genetic distance Asterisks represent sig-nificant correlations

eraged representation

524 LSTM Encoder-DecoderFor the sake of completeness I use an LSTMencoder-decoder to learn a continuous representa-tion for every OIA-NIA string pair This modelis very similar to the LSTM autoencoder exceptthat it takes an OIA input and reconstructs an NIAoutput instead of taking an NIA form as input andreconstructing the same string I train the modelas described above

53 Correlations

Table 1 gives correlation coefficients (Spearmanrsquosρ) between linguistic distance metrics and non-linguistic distance metrics In general correlationswith Glottolog patristic distance are quite poorThis is surprising for Levenshtein Distance Nor-malized given the high correlation with patristicdistance reported by Borin et al (2014) Giventhat the authors measured Levenshtein distancebetween identical concepts in pairs of languagesand not cognates as I do here it is possible thatlexical divergence carries a stronger genetic sig-nal than phonological divergence at least in thecontext of Indo-Aryan (it is worth noting that Idid not balance the tree as described by the au-thors it is not clear that this would have yieldedany improvement) On the other hand the Lev-enshtein distance measured in this paper corre-lates significantly with great circle distance indi-cating a strong geographic signal Average Jensen-Shannon divergence between pairs of languagesrsquosound change distributions shows a strong associ-ation with geographic distance as well

Divergencedistances based on the deepmodel the LSTM Autoencoder and the LSTM

Encoder-Decoder show significant correlationswith geospatial distance albeit lower ones It isnot entirely clear what accounts for this disparityIntuitively we expect more shallow chronologicalfeatures to correlate with geographic distance Itis possible that the LSTM and RNN architecturesare picking up on chronologically deeper infor-mation and show a low geographic signal for thisreason though this highly provisional idea is notborne out by any genetic signal

It is not clear how to assess the meaning ofthese correlations at this stage Nevertheless deeparchitectures provide an interesting direction forfuture research into sound change and languagecontact as they have the potential to disaggregatea great deal of information regarding interactingforces in language change that is censored whenraw distance measures are computed directly fromthe data

6 Outlook

This paper explored the consequences of addinghidden layers to models of dialectology where thelanguages have experienced too much contact forphylogenetic models to be appropriate but havediversified to the extent that traditional dialecto-metric approaches are not applicable While themodel requires some refinement its results pointin a promising direction Modifying prior distribu-tions could potentially produce more informativeresults as could tweaking hyperparameters of thelearning algorithms employed Additionally it islikely that the model will benefit from hidden lay-ers of higher dimension J as well as bidirectionalapproaches and despite the misgivings regard-ing LSTM and GRUs stated above future workwill probably benefit from incorporating these andrelated architectures (eg attention) Addition-ally the models used in this paper assumed dis-crete latent variables attempting to be faithful tothe traditional historical linguistic notion of inti-mate borrowing between discrete dialect groupsHowever continuous-space models may provide amore flexible framework for addressing the ques-tions asked in this paper (cf Murawaki 2015)

This paper provides a new way of looking atdialectology and linguistic affiliation with refine-ment and expansion it is hoped that this and re-lated models can further our understanding of thehistory of the Indo-Aryan speech community andcan generalize to new linguistic scenarios It is

118

hoped that methodologies of this sort can joinforces with similar tools designed to investigateinteraction of regularly conditioned sound changeand chronologically deep language contact in in-dividual languagesrsquo histories

ReferencesGeorge Baumann 1975 Drei Jaina-Gedichte in Alt-

Gujaratı Edition Ubersetzung Grammatik undGlossar Franz Steiner Wiesbaden

Leonard Bloomfield 1933 Language Holt Rinehartand Winston New York

Lars Borin Anju Saxena Taraka Rama and BernardComrie 2014 Linguistic landscaping of southasia using digital language resources Genetic vsareal linguistics In Ninth International Conferenceon Language Resources and Evaluation (LRECrsquo14)pages 3137ndash3144

Alexandre Bouchard-Cote David Hall Thomas LGriffiths and Dan Klein 2013 Automated recon-struction of ancient languages using probabilisticmodels of sound change Proceedings of the Na-tional Academy of Sciences 1104224ndash4229

Alexandre Bouchard-Cote Percy Liang Thomas Grif-fiths and Dan Klein 2007 A probabilistic approachto diachronic phonology In Proceedings of the 2007Joint Conference on Empirical Methods in NaturalLanguage Processing and Computational NaturalLanguage Learning (EMNLP-CoNLL) pages 887ndash896 Prague Association for Computational Lin-guistics

Alexandre Bouchard-Cote Percy S Liang Dan Kleinand Thomas L Griffiths 2008 A probabilistic ap-proach to language change In Advances in NeuralInformation Processing Systems pages 169ndash176

R Bouckaert P Lemey M Dunn S J GreenhillA V Alekseyenko A J Drummond R D GrayM A Suchard and Q D Atkinson 2012 Mappingthe origins and expansion of the Indo-European lan-guage family Science 337(6097)957ndash960

Samuel R Bowman Luke Vilnis Oriol Vinyals An-drew M Dai Rafal Jozefowicz and Samy Ben-gio 2015 Generating sentences from a continu-ous space Proceedings of the Twentieth Confer-ence on Computational Natural Language Learning(CoNLL)

Chundra Cathcart to appear A probabilistic assess-ment of the Indo-Aryan Inner-Outer HypothesisJournal of Historical Linguistics

Ashwini Deo 2018 Dialects in the Indo-Aryan land-scape In Charles Boberg John Nerbonne and Do-minic Watt editors The Handbook of Dialectologypages 535ndash546 John Wiley amp Sons Oxford

Laurent Dinh and Vincent Dumoulin 2016 Train-ing neural Bayesian nets httpwwwiroumontrealcabengioycifarNCAP2014-summerschoolslidesLaurent_dinh_cifar_presentationpdf

Jeffrey Elman 1990 Finding structure in time Cogni-tive Science 14(2)179ndash211

Murray B Emeneau 1966 The dialects of Old-Indo-Aryan In Jaan Puhvel editor Ancient Indo-European dialects pages 123ndash138 University ofCalifornia Press Berkeley

Harald Hammarstrom Robert Forkel and MartinHaspelmath 2017 Glottolog 33 Max Planck In-stitute for the Science of Human History

Eric P Hamp 1987 On the sibilants of romani Indo-Iranian Journal 30(2)103ndash106

Gerhard Jager 2014 Phylogenetic inference fromword lists using weighted alignment with empiri-cally determined weights In Quantifying LanguageDynamics pages 155ndash204 Brill

Banikanta Kakati 1941 Assamese its formation anddevelopment Government of Assam Gauhati

Yoon Kim Sam Wiseman and Alexander M Rush2018 A tutorial on deep latent variable models ofnatural language arXiv preprint arXiv181206834

Diederik P Kingma and Jimmy Ba 2015 Adam Amethod for stochastic optimization In InternationalConference on Learning Representations (ICLR)

Johann-Mattis List 2012 SCA Phonetic alignmentbased on sound classes In M Slavkovik and D Las-siter editors New directions in logic language andcomputation pages 32ndash51 Springer Berlin Heidel-berg

Colin P Masica 1991 The Indo-Aryan languagesCambridge University Press Cambridge

Yaron Matras 2002 Romani ndash A Linguistic Introduc-tion Cambridge University Press Cambridge

R S McGregor 1968 The language of Indrajit of Or-cha Cambridge University Press Cambridge

Yugo Murawaki 2015 Continuous space representa-tions of linguistic typology and their application tophylogenetic inference In Proceedings of the 2015Conference of the North American Chapter of theAssociation for Computational Linguistics HumanLanguage Technologies pages 324ndash334

John Nerbonne 2009 Data-driven dialectology Lan-guage and Linguistics Compass 3(1)175ndash198

John Nerbonne and Wilbert Heeringa 2001 Computa-tional comparison and classification of dialects Di-alectologia et Geolinguistica 969ndash83

119

Thomas Oberlies 2005 A historical grammar ofHindi Leykam Graz

Jonathan K Pritchard Matthew Stephens and Pe-ter Donnelly 2000 Inference of population struc-ture using multilocus genotype data Genetics155(2)945ndash959

Jelena Prokic and John Nerbonne 2008 Recognisinggroups among dialects International journal of hu-manities and arts computing 2(1-2)153ndash172

Taraka Rama and Cagrı Coltekin 2016 Lstm autoen-coders for dialect analysis In Proceedings of theThird Workshop on NLP for Similar Languages Va-rieties and Dialects (VarDial3) pages 25ndash32

Taraka Rama Cagrı Coltekin and Pavel Sofroniev2017 Computational analysis of gondi dialectsIn Proceedings of the Fourth Workshop on NLPfor Similar Languages Varieties and Dialects (Var-Dial) pages 26ndash35

Rajesh Ranganath Linpeng Tang Laurent Charlin andDavid Blei 2015 Deep exponential families InArtificial Intelligence and Statistics pages 762ndash771

Ger Reesink Ruth Singer and Michael Dunn 2009Explaining the linguistic diversity of Sahul usingpopulation models PLoS Biology 7e1000241

Caley Smith 2017 The dialectology of Indic InJared Klein Brian Joseph and Matthias Fritz edi-tors Handbook of Comparative and Historical Indo-European Linguistics pages 417ndash447 De GruyterBerlin Boston

Jaroslav Strnad 2013 Morphology and Syntax of OldHindi Brill Leiden

Kaj Syrjanen Terhi Honkola Jyri Lehtinen AnttiLeino and Outi Vesakoski 2016 Applying popu-lation genetic approaches within languages Finnishdialects as linguistic populations Language Dy-namics and Change 6235ndash283

Paul Tedesco 1965 Turnerrsquos Comparative Dictionaryof the Indo-Aryan Languages Journal of the Amer-ican Oriental Society 85368ndash383

Matthew Toulmin 2009 From linguistic to sociolin-guistic reconstruction the Kamta historical sub-group of Indo-Aryan Pacific Linguistics ResearchSchool of Pacific and Asian Studies The AustralianNational University Canberra

Ralph L Turner 1962ndash1966 A comparative dictionaryof Indo-Aryan languages Oxford University PressLondon

Ralph L Turner 1975 [1967] Geminates after longvowel in Indo-aryan In RL Turner Collected Pa-pers 1912ndash1973 pages 405ndash415 Oxford UniversityPress London

Martijn Wieling Eliza Margaretha and John Ner-bonne 2012 Inducing a measure of phonetic simi-larity from pronunciation variation Journal of Pho-netics 40(2)307ndash314

Page 2: Toward a deep dialectological representation of Indo-Aryanweb.science.mq.edu.au/~smalmasi/vardial6/pdf/W19-1411.pdfAryan subgroup of Indo-European. I draw upon admixture models and

111

rics and outline future pathways for model devel-opment

2 Sound Change

The notion that sound change proceeds in a reg-ular and systematic fashion is a cornerstone ofthe comparative method of historical linguisticsWhen we consider cognates such as Greek pheroand Sanskrit bhara(mi) lsquoI carryrsquo we observe reg-ular sound correspondences (eg phbh) which al-low us to formulate sound changes that have oper-ated during the course of each languagersquos develop-ment from their shared common ancestor Underideal circumstances these are binary yesno ques-tions (eg Proto-Indo-European bh gt Greek ph)At other times there is some noise in the signalfor instance OIA ks is realized as kh in most Ro-mani words (eg aks i- lsquoeyersquo gt jakh) but also asch (ks urika- gt churi lsquoknifersquo) according to Ma-tras (2002 41) This is undoubtedly due to rel-atively old language contact (namely lexical bor-rowing) between prehistoric Indo-Aryan dialectsas opposed to different conditioning environmentswhich trigger a change ks gt kh in some phono-logical contexts but ks gt ch in others The ideathat Indo-Aryan speech varieties borrowed formsfrom one another on a large scale is well estab-lished (Turner 1975 [1967] 406) as is often thecase in situations where closely related dialectshave developed in close geographic proximity toone another (cf Bloomfield 1933 461ndash495) Aneffective model of Indo-Aryan dialectology mustbe able to account this sort of admixture Phylo-genetic methods and distance-based methods pro-vide indirect information regarding language con-tact (eg in the form of uncertain tree topologies)but do not explicitly model intimate borrowing

A number of studies have used mixed-membership models such as the Structure model(Pritchard et al 2000) in order to explicitly modeladmixture between languages (Reesink et al2009 Syrjanen et al 2016) Under this approachindividual languages receive their linguistic fea-tures from latent ancestral components with par-ticular feature distributions A key assumption ofthe Structure model is the relative invariance andstability of the features of interest (eg allele fre-quencies linguistic properties) However soundchange is a highly recurrent process with manytelescoped and intermediate changes and it is notpossible to treat sound changes that have operated

as stable highly conservative features1

Intermediate stages between OIA and NIAlanguages are key for capturing similarities incross-linguistic behavior and we require a modelthat teases apart dialect group-specific trends andlanguage-level ones Consider the following ex-amples

bull Assamese x the reflex of OIA s s s is thought to develop from intermediate s(Kakati 1941 224) This isogloss wouldunite it with languages like Bengali whichshow S for OIA s s s

bull Some instances of NIA bh likely come froman earlier mh (Tedesco 1965 371 Oberlies2005 48) (cf Oberlies 200548)

bull The Marathi change ch gt s affects certainwords containing MIA ch lt OIA ks as wellas OIA ch (Masica 1991 457) ch sim kh ltOIA ks variation is of importance to MIA andNIA dialectology (compare the Romani ex-amples given above)

In all examples a given NIA language shows theeffects of chronologically deep behavior whichserves as an isogloss uniting it with other NIAlanguages but this trend is masked by subse-quent language-specific changes2 Work on proba-bilistic reconstruction of proto-word forms explic-itly appeals to intermediate chronological stageswhere linguistic data are unobserved (Bouchard-Cote et al 2007) however unlike the work citedthis paper does not assume a fixed phylogeny andhence I cannot adopt many of the simplifying con-ventions that the authors use

3 Data

I extracted all modern Indo-Aryan forms fromTurnerrsquos (1962ndash1966) Comparative Dictionary ofthe Indo-Aryan Languages (henceforth CDIAL)3

1Cathcart (to appear) circumvents this issue in a mixed-membership model of Indo-Aryan dialectology by consider-ing only sound changes thought a priori in the literature to berelatively stable and of importance to dialectology

2Some similar-looking sound changes can be shown to bechronologically shallow For instance the presence of s fororiginal kh in Old Braj taken by most scholars to represent alegitimate sound change and not just an orthographic idiosyn-crasy affects Persian loans such as s aracu lsquoexpensersquolarrMod-ern Persian xirc (McGregor 1968 125) This orthographicbehavior is found in Old Gujarati as well (Baumann 19759) For further discussion of this issue see Strnad 2013 16ff

3Available online at httpdsaluchicagoedudictionariessoas

112

along with the Old Indo-Aryan headwords (hence-forth ETYMA) from which these reflexes descendTranscriptions of the data were normalized andconverted to the International Phonetic Alphabet(IPA) Systematic morphological mismatches be-tween OIA etyma and reflexes were accountedfor including stripping the endings from all verbssince citation forms for OIA verbs are in the 3sgpresent while most NIA reflexes give the infini-tive I matched each dialect with correspond-ing languoids in Glottolog (Hammarstrom et al2017) containing geographic metadata resultingin the merger of several dialects I excluded cog-nate sets with fewer than 10 forms yielding 33231modern Indo-Aryan forms I preprocessed thedata first converting each segment into its respec-tive sound class as described by List (2012) andsubsequently aligning each converted OIANIAstring pair via the Needleman-Wunsch algorithmusing the Expectation-Maximization method de-scribed by Jager (2014) building off of work byWieling et al (2012) This yields alignments ofthe following type eg OIA antra lsquoentrailsrsquo gtNepali anemptyro where empty indicates a gap wherethe ldquocursorrdquo advances for the OIA string but notthe Nepali string Gaps on the OIA side are ig-nored yielding a one-to-many OIA-to-NIA align-ment this ensures that all aligned cognate sets areof the same length

4 Model

The basic family of model this paper employs is aBayesian mixture model which assumes that eachword in each language is generated by one ofK la-tent dialect components Like Structure (and sim-ilar methodologies like Latent Dirichlet Alloca-tion) this model assumes that different elementsin the same language can be generated by differ-ent dialect components Unlike the most basictype of Structure model which assumes a two-level data structure consisting of (1) languages andthe (2) features they contain our model assumes athree-level hierarchy where (1) languages contain(2) words which display the operation of differ-ent (3) sound changes latent variable assignmenthappens at the word level

I contrast the behavior of a DEEP model withthat of a SHALLOW model The deep model drawsinspiration from Bayesian deep generative mod-els (Ranganath et al 2015) which incorporateintermediate latent variables which mimic the ar-

chitecture of a neural network This structure al-lows us to posit an intermediate representationbetween the sound patterns in the OIA etymonand the sound patterns in the NIA reflex allow-ing the model to pick up on shared dialectal sim-ilarities between forms in languages as opposedto language-specific idiosyncrasies The shal-low model which serves as a baseline of sortsconflates dialect group-level and language-leveltrends it contains a flat representation of all of thesound changes taking place between a NIA wordand its ancestral OIA etymon and in this sense ishalfway between a Structure model and a NaıveBayes classifier (with a language-specific ratherthan global prior over component membership)

41 Shallow model

Here I describe the generative process for theshallow model assuming W OIA etyma L lan-guages K dialect components I unique OIA in-puts O unique NIA outputs and aligned OIA-NIA word pair lengths Tw w isin 1 WFor each OIA etymon an input xwt at time pointt isin 1 Tw consists of a trigram centered atthe timepoint in question (eg ntr in OIA antralsquoentrailsrsquo) and the NIA reflexrsquos output ywlt con-tains the segment(s) aligned with timepoint t (egNepali empty) xwt t = 0 is the left word boundarywhile xwt t = Tw + 1 is the right word bound-ary Accordingly sound change in the model canbe viewed as a rewrite rule of the type A gt B C

D The model has the following parameters

bull Language-level weights over dialect compo-nents Ulk l isin 1 L k isin 1 K

bull Dialect component-level weights over soundchanges Wkio k isin 1 K i isin1 I o isin 1 O

The generative process is as follows

For each OIA etymon xw isin 1 W

For each language l isin 1 L in whichthe etymon survives containing a reflexywl

Draw a dialect component assignmentzwl sim Categorical(f(Ulmiddot))

For each time point t isin 1 TwDraw a NIA sound ywlt sim

Categorical(f(Wzwlxwtmiddot))

113

All weights in U and W are drawn from a Normaldistribution with a mean of 0 and standard devi-ation of 10 f(middot) represents the softmax function(throughout this paper) which transforms theseweights to probability simplices The generativeprocess yields the following joint log likelihood ofthe OIA etyma x and NIA reflexes y (with the dis-crete latent variables z marginalized out

P (xy|UW ) =

Wprodw=1

Lprodl=1

Ksumk=1

[f(Ulk)

Twprodt=1

f(Wkxwltywlt)

](1)

As readers will note this model weights allsound changes equally and makes no attempt todistinguish between dialectologically meaningfulchanges and noisy idiosyncratic changes

42 Deep modelThe deep model like the shallow model is a mix-ture model and as such retains the language-levelweights over dialect component membership U However unlike the shallow model in which thelikelihood of an OIA etymon and NIA reflex un-der a component assignment z = k is depen-dent on a flat representation of edit probabilitiesbetween OIA trigrams and NIA unigrams associ-ated with dialect component k Here I attemptto add some depth to this representation of soundchange by positing a hidden layer of dimension Jbetween each xwt and ywlt The goal here is tomimic a ldquonoisyrdquo reconstruction of an intermediatestage between OIA and NIA represented by dialectgroup k This reconstruction is not an explicitlinguistically meaningful string (as in Bouchard-Cote et al 2007 2008 2013) furthermore it isre-generated for each individual reflex of each et-ymon and not shared across data points (such amodel would introduce deeply nested dependen-cies between variables and enumerating all possi-ble reconstructions would be computationally in-feasible)

For parsimonyrsquos sake I employ a simple Recur-rent Neural Network (RNN) architecture to cap-ture rightward dependencies (Elman 1990) Fig-ure 1 gives a visual representation of the net-work unfolded in time This model exchangesW the dialect component-level weights over soundchanges for the following parameters

bull Dialect component-level weights governinghidden layer unit activations by OIA sounds

W xkij k isin 1 K i isin 1 I j isin1 J

bull Dialect component-level weights governinghidden layer unit activations by previous hid-den layers W h

kij k isin 1 K i isin1 J j isin 1 J

bull Language-level weights governing NIA out-put activations by hidden layer unitsW y

ljo l isin 1 L j isin 1 J o isin1 O

For a given mixture component z = k the activa-tion of the hidden layer at time t ht depends ontwo sets of parameters each associated with com-ponent k the weightsW x

kxxt middot associated with the

OIA input at time t and W hk the weights asso-

ciated with the previous hidden layer htminus1rsquos acti-vations for all t gt 1 Given a hidden layer htthe weights W l can be used to generate a proba-bility distribution over possible outcomes in NIAlanguage l The forward pass of this networkcan be viewed as a generative process denotedywt sim RNN(xwlW

xk W

hk W

l) under the pa-rameters for component k and language l undersuch a process the likelihood of ywl can be com-puted as follows

PRNN(ywl|xwWxk W

hk W

l) =

Twprodt=1

f(hgtt W

l)ywlt (2)

where

ht =

f(W x

kxwtmiddot) if t = 1

f(hgttminus1Wh oplusW x

kxwtmiddot) if t gt 1(3)

The generative process for this model is nearlyidentical to the process described in the previ-ous sections however after the dialect compo-nent assignment (zwl sim Categorical(f(Ulmiddot)))is drawn the NIA string ywl is sampled fromRNN(xwW

xzwl

W hzwl

W l) The joint log likeli-hood of the OIA etyma x and NIA reflexes y (withthe discrete latent variables z marginalized out isthe following

P (xy|UW xW hW y) =Wprodw=1

Lprodl=1

Ksumk=1

[f(Ulk)PRNN(ywl|xwW

xk W

hk W

l)] (4)

114

The same N (0 10) prior as above is placed overUW xW hW y J the dimension of the hiddenlayer is fixed at 100 This model bears some sim-ilarities to the mixture of RNNs described by Kimet al (2018)

I have employed a simple RNN (rather than amore state-of-the art architecture) for several rea-sons The first is that I am interested in the conse-quences of expanding a flat mixture model to con-tain a simple slightly deeper architecture Addi-tionally I believe that the fact that the hidden layerof an RNN can be activated by a softmax functionis more desirable from the perspective of repre-senting sound change as a categorical or multi-nomial distribution as all layer unit activationssum to one as opposed to the situation with LongShort-Term Memory (LSTM) and Gated Recur-rent Units (GRU) which traditionally use sigmoidor hyperbolic tangent functions to activate the hid-den layer Furthermore long-distance dependen-cies are not particularly widespread in Indo-Aryansound change lessening the need for more com-plex architectures At the same time the RNNis a crude approximation to the reality of lan-guage change RNNs and related models draw asingle arc between a hidden layer at time t andthe corresponding output It is perhaps not ap-propriate to envision this single dependency un-less the dimensionality of the hidden layer is largeenough to absorb potential contextual informationthat is crucial to sound change To put it sim-ply emission probabilities in sound change aresharper than transitions common in most NLP ap-plications (eg sentence prediction) and it maynot be correct to envision yt given htprimeltt ht as afunction of an additive combination of weightsthough in practice I find it too computationallycostly to enumerate all possible value combina-tions the hidden layer at multiple consecutive timepoints This issue requires further exploration andI employ what seems to be the most computation-ally tractable approach for the moment

5 Results

I learn each modelrsquos MAP configuration using theAdam optimizer (Kingma and Ba 2015) with alearning rate of 14 I run the optimizer for 10000iterations over three random initializations fittingthe model on mini-batches of 100 data points and

4Code for all experiments can be found at httpsgithubcomchundracIA_dialVarDial2019

xtminus1 xt xt+1

htminus1 ht ht+1

ytminus1 yt yt+1

Figure 1 RNN representation unfolded in time hid-den layers depend on OIA inputs x1 xTw and previ-ous hidden layers (for t gt 1) NIA outputs y1 yTw

depend on hidden layers Hidden layer activations aredependent on dialect component-specific parameterswhile activations of the output layer are dependent onindividual NIA language-specific parameters

monitor convergence by observing the trace of thelog posterior (Figure 2)

The flat model fails to pick up on any majordifferences between languages finding virtuallyidentical posterior values of f(Ul) the language-level distribution over dialect component member-ship for all l isin 1 L According to theMAP configuration each language draws formsfrom the same dialect group with gt 99 proba-bility essentially undergoing a sort of ldquocompo-nent collapserdquo that latent variable models some-times encounter (Bowman et al 2015 Dinh andDumoulin 2016) It is likely that bundling to-gether sound change features leads to component-level distributions over sound changes with highentropy that are virtually indistinguishable fromone another5 While this particular result is dis-appointing in the lack of information it provides Iobserve some properties of our modelsrsquo posteriorvalues in order to diagnose problems that can beaddressed in future work (discussed below)

The deep model on the other hand infershighly divergent language-level posterior distri-butions over cluster membership Since thesedistributions are not identical across initializa-tions due to the label-switching problem I com-pute the Jensen-Shannon divergence between thelanguage-level posterior distributions over clustermembership for each pair of languages in our sam-ple for each initialization I then average these di-vergences across initializations These averaged

5I made several attempts to run this model with differ-ent specifications including different prior distributions butachieved the same result

115

0 2000 4000 6000 8000 10000

1048

1046

1044

1042

1040

1038

10361e8

0 2000 4000 6000 8000 10000

426

424

422

420

418

416

414

412

1e7

Figure 2 Log posteriors for shallow model (left) anddeep model (right) for 10000 iterations over three ran-dom initializations

divergences are then scaled to three dimensionsusing multidimensional scaling Figure 3 gives avisualization of these transformed values via thered-green-blue color vector plotted on a map lan-guages with similar component distributions dis-play similar colors With a few exceptions (thatmay be artifacts of the fact that certain languageshave only a small number of data points associ-ated with them) a noticieable divide can be seenbetween languages of the main Indo-Aryan speechregion on one hand and languages of northwest-ern South Asia (dark blue) the Dardic languagesof Northern Pakistan and the Pahari languagesof the Indian Himalayas though this division isnot clear cut Romani and other Indo-Aryan va-rieties spoken outside of South Asia show affil-iation with multiple groups While Romani di-alects are thought to have a close genetic affin-ity with Hindi and other Central Indic languagesit was likely in contact with languages of north-west South Asian during the course of its speak-ersrsquo journey out of South Asia (Hamp 1987 Ma-tras 2002) However this impressionistic evalua-tion is by no means a confirmation that the deepmodel has picked up on linguistically meaningfuldifferences between speech varieties In the fol-lowing sections some comparison and evaluationmetrics and checks are deployed in order to assessthe quality of these modelsrsquo behavior

51 Entropy of distributions

I measure the average entropy of the modelrsquos pos-terior distributions in order to gauge the extent towhich the models are able to learn sparse informa-tive distributions over sound changes hidden stateactivations or other parameters concerning transi-tions through the model architecture Normalizedentropy is used in order to make entropies of distri-butions of different dimension comparable a dis-tributionrsquos entropy can be normalized by dividingby its maximum possible entropy

As mentioned above our data set consists ofOIA trigrams and the NIA segment correspondingto the second segment in the trigram representingrewrite rules operating between OIA and the NIAlanguages in our sample It is often the case thatmore than one NIA reflex is attested for a givenOIA trigram As such the sound changes that haveoperated in an NIA language can be representedas a collection of categorical distributions eachsumming to one I calculate the average of thenormalized entropies of these sound change dis-tributions as a baseline against which to compareentropy values for the modelsrsquo parameters Thepooled average of the normalized entropies acrossall languages is 11 while the average of averagesfor each language is 063

For the shallow model the parameter of interestis f(V ) the dialect component-level collection ofdistributions over sound changes the mean nor-malized entropy of which averaged across initial-izations but pooled across components within eachinitialization is 091 (raw values range from 0003to 1) For the deep model the average entropyof the dialect-level distributions over hidden-layeractivations f(W x) is only slightly lower at 086(raw values range from close to 0 to 1)

For each k isin 1 K I compute the for-ward pass of RNN(xwlW

xk W

hk W

l) for eachetymon w and each language l in which theetymon survives using the inferred values forW x

k Whk W

l and compute the entropy of eachf(hgtt W

l) yielding an average of 74 (raw val-ues range from close to 0 to 1) While these val-ues are still very high it is clear that the inclu-sion of a hidden layer has learned sparser poten-tially more meaningful distributions than the flatapproach and that increasing the dimensionalityof the hidden layer will likely bring about evensparser more meaningful distributions The en-tropies cited here are considerably higher than theaverage entropy of languagesrsquo sound change dis-tributions but the latter distributions do little to tellus about the internal clustering of the languages

52 Comparison with other linguisticdistance metrics

Here I compare the cluster membership inferredby this paperrsquos models against other measures oflinguistic distance Each method yields a pairwiseinter-language distance metric which can be com-pared against a non-linguistic measure I measure

116

assa1263awad1243

bagh1251

balk1252

beng1280

bhad1241 bhat1263

bhoj1244braj1242

brok1247

carp1235

cham1307

chil1275

chur1258

dhiv1236

dogr1250

doma1258

doma1260

garh1243

gawa1247

gran1245

guja1252

halb1244

hind1269

indu1241

jaun1243

kach1277

kala1372 kala1373

kalo1256

kang1280

kash1277

khet1238

khow1242 kohi1248

konk1267

kull1236

kuma1273

loma1235

maga1260

maha1287

mait1250

mara1378

marw1260

nepa1254

nort2665

nort2666

oriy1255

paha1251pang1282

panj1256

phal1254

savi1242

sera1259

shin1264

shum1235

sind1272

sinh1246

sint1235

sirm1239

sout2671

sout2672

tira1253

torw1241

vlax1238

wels1246

west2386

wota1240

0

20

40

60

0 25 50 75long

lat

Figure 3 Dialect group makeup of languages in sample under deep model

the correlation between each linguistic distancemeasure as well as great circle geographic distanceand patristic distance according to the Glottologphylogeny using Spearmanrsquos ρ

521 Levenshtein distanceBorin et al (2014) measure the normalized Lev-enshtein distances (ie the edit distance betweentwo strings divided by the length of the longerstring) between words for the same concept inpairs of Indo-Aryan languages and find that av-erage normalized Levenshtein distance correlatessignificantly with patristic distances in the Ethno-logue tree This paperrsquos dataset is not organized bysemantic meaning so for comparability I measurethe average normalized Levenshtein distance be-tween cognates in pairs of Indo-Aryan languageswhich picks up on phonological divergence be-tween dialects as opposed to both phonologicaland lexical divergence

522 Jensen-Shannon divergenceEach language in our dataset attests one or more(due to language contact analogy etc) outcomesfor a given OIA trigram yielding a collection ofsound change distributions as described aboveFor each pair of languages I compute the Jensen-Shannon divergence between sound change distri-butions for all OIA trigrams that are continued inboth languages and average these values This

gives a measure of pairwise average diachronicphonological divergence between languages

523 LSTM AutoencoderRama and Coltekin (2016) and Rama et al (2017)develop an LSTM-based method for represent-ing the phonological structure of individual wordforms across closely related speech varieties Eachstring is fed to a unidirectional or bidirectionalLSTM autoencoder which learns a continuouslatent multidimensional representation of the se-quence This embedding is then used to recon-struct the input sequence The latent values in theembedding provide information that can be usedto compute dissimilarity (in the form of cosineor Euclidean distance) between strings or acrossspeech varieties (by averaging the latent values forall strings in each dialect or language) I use thebidirectional LSTM Autoencoder described in thework cited in order to learn an 8-dimensional la-tent representation for all NIA forms in the datasettraining the model over 20 epochs on batches of 32data points using the Adam optimizer to minimizethe categorical cross-entropy between the input se-quence and the NIA reconstruction predicted bythe model I use the learned model parameters togenerate a latent representation for each form Thelatent representations are averaged across formswithin each language and pairwise linguistic Eu-clidean distances are computed between each av-

117

Geographic GeneticShallow JSD minus001 minus003Deep JSD 0147lowast 0008

LDN 0346lowast 0013Raw JSD 0302lowast minus0051lowastLSTM AE 0158lowast minus0068lowastLSTM ED 0084lowast 00001

Table 1 Spearmanrsquos ρ values for correlations betweeneach linguistic distance metric (JSD = Jensen-ShannonDivergence LDN = Levenshtein Distance NormalizedAE = Autoencoder ED = Encoder-Decoder) and geo-graphic and genetic distance Asterisks represent sig-nificant correlations

eraged representation

524 LSTM Encoder-DecoderFor the sake of completeness I use an LSTMencoder-decoder to learn a continuous representa-tion for every OIA-NIA string pair This modelis very similar to the LSTM autoencoder exceptthat it takes an OIA input and reconstructs an NIAoutput instead of taking an NIA form as input andreconstructing the same string I train the modelas described above

53 Correlations

Table 1 gives correlation coefficients (Spearmanrsquosρ) between linguistic distance metrics and non-linguistic distance metrics In general correlationswith Glottolog patristic distance are quite poorThis is surprising for Levenshtein Distance Nor-malized given the high correlation with patristicdistance reported by Borin et al (2014) Giventhat the authors measured Levenshtein distancebetween identical concepts in pairs of languagesand not cognates as I do here it is possible thatlexical divergence carries a stronger genetic sig-nal than phonological divergence at least in thecontext of Indo-Aryan (it is worth noting that Idid not balance the tree as described by the au-thors it is not clear that this would have yieldedany improvement) On the other hand the Lev-enshtein distance measured in this paper corre-lates significantly with great circle distance indi-cating a strong geographic signal Average Jensen-Shannon divergence between pairs of languagesrsquosound change distributions shows a strong associ-ation with geographic distance as well

Divergencedistances based on the deepmodel the LSTM Autoencoder and the LSTM

Encoder-Decoder show significant correlationswith geospatial distance albeit lower ones It isnot entirely clear what accounts for this disparityIntuitively we expect more shallow chronologicalfeatures to correlate with geographic distance Itis possible that the LSTM and RNN architecturesare picking up on chronologically deeper infor-mation and show a low geographic signal for thisreason though this highly provisional idea is notborne out by any genetic signal

It is not clear how to assess the meaning ofthese correlations at this stage Nevertheless deeparchitectures provide an interesting direction forfuture research into sound change and languagecontact as they have the potential to disaggregatea great deal of information regarding interactingforces in language change that is censored whenraw distance measures are computed directly fromthe data

6 Outlook

This paper explored the consequences of addinghidden layers to models of dialectology where thelanguages have experienced too much contact forphylogenetic models to be appropriate but havediversified to the extent that traditional dialecto-metric approaches are not applicable While themodel requires some refinement its results pointin a promising direction Modifying prior distribu-tions could potentially produce more informativeresults as could tweaking hyperparameters of thelearning algorithms employed Additionally it islikely that the model will benefit from hidden lay-ers of higher dimension J as well as bidirectionalapproaches and despite the misgivings regard-ing LSTM and GRUs stated above future workwill probably benefit from incorporating these andrelated architectures (eg attention) Addition-ally the models used in this paper assumed dis-crete latent variables attempting to be faithful tothe traditional historical linguistic notion of inti-mate borrowing between discrete dialect groupsHowever continuous-space models may provide amore flexible framework for addressing the ques-tions asked in this paper (cf Murawaki 2015)

This paper provides a new way of looking atdialectology and linguistic affiliation with refine-ment and expansion it is hoped that this and re-lated models can further our understanding of thehistory of the Indo-Aryan speech community andcan generalize to new linguistic scenarios It is

118

hoped that methodologies of this sort can joinforces with similar tools designed to investigateinteraction of regularly conditioned sound changeand chronologically deep language contact in in-dividual languagesrsquo histories

ReferencesGeorge Baumann 1975 Drei Jaina-Gedichte in Alt-

Gujaratı Edition Ubersetzung Grammatik undGlossar Franz Steiner Wiesbaden

Leonard Bloomfield 1933 Language Holt Rinehartand Winston New York

Lars Borin Anju Saxena Taraka Rama and BernardComrie 2014 Linguistic landscaping of southasia using digital language resources Genetic vsareal linguistics In Ninth International Conferenceon Language Resources and Evaluation (LRECrsquo14)pages 3137ndash3144

Alexandre Bouchard-Cote David Hall Thomas LGriffiths and Dan Klein 2013 Automated recon-struction of ancient languages using probabilisticmodels of sound change Proceedings of the Na-tional Academy of Sciences 1104224ndash4229

Alexandre Bouchard-Cote Percy Liang Thomas Grif-fiths and Dan Klein 2007 A probabilistic approachto diachronic phonology In Proceedings of the 2007Joint Conference on Empirical Methods in NaturalLanguage Processing and Computational NaturalLanguage Learning (EMNLP-CoNLL) pages 887ndash896 Prague Association for Computational Lin-guistics

Alexandre Bouchard-Cote Percy S Liang Dan Kleinand Thomas L Griffiths 2008 A probabilistic ap-proach to language change In Advances in NeuralInformation Processing Systems pages 169ndash176

R Bouckaert P Lemey M Dunn S J GreenhillA V Alekseyenko A J Drummond R D GrayM A Suchard and Q D Atkinson 2012 Mappingthe origins and expansion of the Indo-European lan-guage family Science 337(6097)957ndash960

Samuel R Bowman Luke Vilnis Oriol Vinyals An-drew M Dai Rafal Jozefowicz and Samy Ben-gio 2015 Generating sentences from a continu-ous space Proceedings of the Twentieth Confer-ence on Computational Natural Language Learning(CoNLL)

Chundra Cathcart to appear A probabilistic assess-ment of the Indo-Aryan Inner-Outer HypothesisJournal of Historical Linguistics

Ashwini Deo 2018 Dialects in the Indo-Aryan land-scape In Charles Boberg John Nerbonne and Do-minic Watt editors The Handbook of Dialectologypages 535ndash546 John Wiley amp Sons Oxford

Laurent Dinh and Vincent Dumoulin 2016 Train-ing neural Bayesian nets httpwwwiroumontrealcabengioycifarNCAP2014-summerschoolslidesLaurent_dinh_cifar_presentationpdf

Jeffrey Elman 1990 Finding structure in time Cogni-tive Science 14(2)179ndash211

Murray B Emeneau 1966 The dialects of Old-Indo-Aryan In Jaan Puhvel editor Ancient Indo-European dialects pages 123ndash138 University ofCalifornia Press Berkeley

Harald Hammarstrom Robert Forkel and MartinHaspelmath 2017 Glottolog 33 Max Planck In-stitute for the Science of Human History

Eric P Hamp 1987 On the sibilants of romani Indo-Iranian Journal 30(2)103ndash106

Gerhard Jager 2014 Phylogenetic inference fromword lists using weighted alignment with empiri-cally determined weights In Quantifying LanguageDynamics pages 155ndash204 Brill

Banikanta Kakati 1941 Assamese its formation anddevelopment Government of Assam Gauhati

Yoon Kim Sam Wiseman and Alexander M Rush2018 A tutorial on deep latent variable models ofnatural language arXiv preprint arXiv181206834

Diederik P Kingma and Jimmy Ba 2015 Adam Amethod for stochastic optimization In InternationalConference on Learning Representations (ICLR)

Johann-Mattis List 2012 SCA Phonetic alignmentbased on sound classes In M Slavkovik and D Las-siter editors New directions in logic language andcomputation pages 32ndash51 Springer Berlin Heidel-berg

Colin P Masica 1991 The Indo-Aryan languagesCambridge University Press Cambridge

Yaron Matras 2002 Romani ndash A Linguistic Introduc-tion Cambridge University Press Cambridge

R S McGregor 1968 The language of Indrajit of Or-cha Cambridge University Press Cambridge

Yugo Murawaki 2015 Continuous space representa-tions of linguistic typology and their application tophylogenetic inference In Proceedings of the 2015Conference of the North American Chapter of theAssociation for Computational Linguistics HumanLanguage Technologies pages 324ndash334

John Nerbonne 2009 Data-driven dialectology Lan-guage and Linguistics Compass 3(1)175ndash198

John Nerbonne and Wilbert Heeringa 2001 Computa-tional comparison and classification of dialects Di-alectologia et Geolinguistica 969ndash83

119

Thomas Oberlies 2005 A historical grammar ofHindi Leykam Graz

Jonathan K Pritchard Matthew Stephens and Pe-ter Donnelly 2000 Inference of population struc-ture using multilocus genotype data Genetics155(2)945ndash959

Jelena Prokic and John Nerbonne 2008 Recognisinggroups among dialects International journal of hu-manities and arts computing 2(1-2)153ndash172

Taraka Rama and Cagrı Coltekin 2016 Lstm autoen-coders for dialect analysis In Proceedings of theThird Workshop on NLP for Similar Languages Va-rieties and Dialects (VarDial3) pages 25ndash32

Taraka Rama Cagrı Coltekin and Pavel Sofroniev2017 Computational analysis of gondi dialectsIn Proceedings of the Fourth Workshop on NLPfor Similar Languages Varieties and Dialects (Var-Dial) pages 26ndash35

Rajesh Ranganath Linpeng Tang Laurent Charlin andDavid Blei 2015 Deep exponential families InArtificial Intelligence and Statistics pages 762ndash771

Ger Reesink Ruth Singer and Michael Dunn 2009Explaining the linguistic diversity of Sahul usingpopulation models PLoS Biology 7e1000241

Caley Smith 2017 The dialectology of Indic InJared Klein Brian Joseph and Matthias Fritz edi-tors Handbook of Comparative and Historical Indo-European Linguistics pages 417ndash447 De GruyterBerlin Boston

Jaroslav Strnad 2013 Morphology and Syntax of OldHindi Brill Leiden

Kaj Syrjanen Terhi Honkola Jyri Lehtinen AnttiLeino and Outi Vesakoski 2016 Applying popu-lation genetic approaches within languages Finnishdialects as linguistic populations Language Dy-namics and Change 6235ndash283

Paul Tedesco 1965 Turnerrsquos Comparative Dictionaryof the Indo-Aryan Languages Journal of the Amer-ican Oriental Society 85368ndash383

Matthew Toulmin 2009 From linguistic to sociolin-guistic reconstruction the Kamta historical sub-group of Indo-Aryan Pacific Linguistics ResearchSchool of Pacific and Asian Studies The AustralianNational University Canberra

Ralph L Turner 1962ndash1966 A comparative dictionaryof Indo-Aryan languages Oxford University PressLondon

Ralph L Turner 1975 [1967] Geminates after longvowel in Indo-aryan In RL Turner Collected Pa-pers 1912ndash1973 pages 405ndash415 Oxford UniversityPress London

Martijn Wieling Eliza Margaretha and John Ner-bonne 2012 Inducing a measure of phonetic simi-larity from pronunciation variation Journal of Pho-netics 40(2)307ndash314

Page 3: Toward a deep dialectological representation of Indo-Aryanweb.science.mq.edu.au/~smalmasi/vardial6/pdf/W19-1411.pdfAryan subgroup of Indo-European. I draw upon admixture models and

112

along with the Old Indo-Aryan headwords (hence-forth ETYMA) from which these reflexes descendTranscriptions of the data were normalized andconverted to the International Phonetic Alphabet(IPA) Systematic morphological mismatches be-tween OIA etyma and reflexes were accountedfor including stripping the endings from all verbssince citation forms for OIA verbs are in the 3sgpresent while most NIA reflexes give the infini-tive I matched each dialect with correspond-ing languoids in Glottolog (Hammarstrom et al2017) containing geographic metadata resultingin the merger of several dialects I excluded cog-nate sets with fewer than 10 forms yielding 33231modern Indo-Aryan forms I preprocessed thedata first converting each segment into its respec-tive sound class as described by List (2012) andsubsequently aligning each converted OIANIAstring pair via the Needleman-Wunsch algorithmusing the Expectation-Maximization method de-scribed by Jager (2014) building off of work byWieling et al (2012) This yields alignments ofthe following type eg OIA antra lsquoentrailsrsquo gtNepali anemptyro where empty indicates a gap wherethe ldquocursorrdquo advances for the OIA string but notthe Nepali string Gaps on the OIA side are ig-nored yielding a one-to-many OIA-to-NIA align-ment this ensures that all aligned cognate sets areof the same length

4 Model

The basic family of model this paper employs is aBayesian mixture model which assumes that eachword in each language is generated by one ofK la-tent dialect components Like Structure (and sim-ilar methodologies like Latent Dirichlet Alloca-tion) this model assumes that different elementsin the same language can be generated by differ-ent dialect components Unlike the most basictype of Structure model which assumes a two-level data structure consisting of (1) languages andthe (2) features they contain our model assumes athree-level hierarchy where (1) languages contain(2) words which display the operation of differ-ent (3) sound changes latent variable assignmenthappens at the word level

I contrast the behavior of a DEEP model withthat of a SHALLOW model The deep model drawsinspiration from Bayesian deep generative mod-els (Ranganath et al 2015) which incorporateintermediate latent variables which mimic the ar-

chitecture of a neural network This structure al-lows us to posit an intermediate representationbetween the sound patterns in the OIA etymonand the sound patterns in the NIA reflex allow-ing the model to pick up on shared dialectal sim-ilarities between forms in languages as opposedto language-specific idiosyncrasies The shal-low model which serves as a baseline of sortsconflates dialect group-level and language-leveltrends it contains a flat representation of all of thesound changes taking place between a NIA wordand its ancestral OIA etymon and in this sense ishalfway between a Structure model and a NaıveBayes classifier (with a language-specific ratherthan global prior over component membership)

41 Shallow model

Here I describe the generative process for theshallow model assuming W OIA etyma L lan-guages K dialect components I unique OIA in-puts O unique NIA outputs and aligned OIA-NIA word pair lengths Tw w isin 1 WFor each OIA etymon an input xwt at time pointt isin 1 Tw consists of a trigram centered atthe timepoint in question (eg ntr in OIA antralsquoentrailsrsquo) and the NIA reflexrsquos output ywlt con-tains the segment(s) aligned with timepoint t (egNepali empty) xwt t = 0 is the left word boundarywhile xwt t = Tw + 1 is the right word bound-ary Accordingly sound change in the model canbe viewed as a rewrite rule of the type A gt B C

D The model has the following parameters

bull Language-level weights over dialect compo-nents Ulk l isin 1 L k isin 1 K

bull Dialect component-level weights over soundchanges Wkio k isin 1 K i isin1 I o isin 1 O

The generative process is as follows

For each OIA etymon xw isin 1 W

For each language l isin 1 L in whichthe etymon survives containing a reflexywl

Draw a dialect component assignmentzwl sim Categorical(f(Ulmiddot))

For each time point t isin 1 TwDraw a NIA sound ywlt sim

Categorical(f(Wzwlxwtmiddot))

113

All weights in U and W are drawn from a Normaldistribution with a mean of 0 and standard devi-ation of 10 f(middot) represents the softmax function(throughout this paper) which transforms theseweights to probability simplices The generativeprocess yields the following joint log likelihood ofthe OIA etyma x and NIA reflexes y (with the dis-crete latent variables z marginalized out

P (xy|UW ) =

Wprodw=1

Lprodl=1

Ksumk=1

[f(Ulk)

Twprodt=1

f(Wkxwltywlt)

](1)

As readers will note this model weights allsound changes equally and makes no attempt todistinguish between dialectologically meaningfulchanges and noisy idiosyncratic changes

42 Deep modelThe deep model like the shallow model is a mix-ture model and as such retains the language-levelweights over dialect component membership U However unlike the shallow model in which thelikelihood of an OIA etymon and NIA reflex un-der a component assignment z = k is depen-dent on a flat representation of edit probabilitiesbetween OIA trigrams and NIA unigrams associ-ated with dialect component k Here I attemptto add some depth to this representation of soundchange by positing a hidden layer of dimension Jbetween each xwt and ywlt The goal here is tomimic a ldquonoisyrdquo reconstruction of an intermediatestage between OIA and NIA represented by dialectgroup k This reconstruction is not an explicitlinguistically meaningful string (as in Bouchard-Cote et al 2007 2008 2013) furthermore it isre-generated for each individual reflex of each et-ymon and not shared across data points (such amodel would introduce deeply nested dependen-cies between variables and enumerating all possi-ble reconstructions would be computationally in-feasible)

For parsimonyrsquos sake I employ a simple Recur-rent Neural Network (RNN) architecture to cap-ture rightward dependencies (Elman 1990) Fig-ure 1 gives a visual representation of the net-work unfolded in time This model exchangesW the dialect component-level weights over soundchanges for the following parameters

bull Dialect component-level weights governinghidden layer unit activations by OIA sounds

W xkij k isin 1 K i isin 1 I j isin1 J

bull Dialect component-level weights governinghidden layer unit activations by previous hid-den layers W h

kij k isin 1 K i isin1 J j isin 1 J

bull Language-level weights governing NIA out-put activations by hidden layer unitsW y

ljo l isin 1 L j isin 1 J o isin1 O

For a given mixture component z = k the activa-tion of the hidden layer at time t ht depends ontwo sets of parameters each associated with com-ponent k the weightsW x

kxxt middot associated with the

OIA input at time t and W hk the weights asso-

ciated with the previous hidden layer htminus1rsquos acti-vations for all t gt 1 Given a hidden layer htthe weights W l can be used to generate a proba-bility distribution over possible outcomes in NIAlanguage l The forward pass of this networkcan be viewed as a generative process denotedywt sim RNN(xwlW

xk W

hk W

l) under the pa-rameters for component k and language l undersuch a process the likelihood of ywl can be com-puted as follows

PRNN(ywl|xwWxk W

hk W

l) =

Twprodt=1

f(hgtt W

l)ywlt (2)

where

ht =

f(W x

kxwtmiddot) if t = 1

f(hgttminus1Wh oplusW x

kxwtmiddot) if t gt 1(3)

The generative process for this model is nearlyidentical to the process described in the previ-ous sections however after the dialect compo-nent assignment (zwl sim Categorical(f(Ulmiddot)))is drawn the NIA string ywl is sampled fromRNN(xwW

xzwl

W hzwl

W l) The joint log likeli-hood of the OIA etyma x and NIA reflexes y (withthe discrete latent variables z marginalized out isthe following

P (xy|UW xW hW y) =Wprodw=1

Lprodl=1

Ksumk=1

[f(Ulk)PRNN(ywl|xwW

xk W

hk W

l)] (4)

114

The same N (0 10) prior as above is placed overUW xW hW y J the dimension of the hiddenlayer is fixed at 100 This model bears some sim-ilarities to the mixture of RNNs described by Kimet al (2018)

I have employed a simple RNN (rather than amore state-of-the art architecture) for several rea-sons The first is that I am interested in the conse-quences of expanding a flat mixture model to con-tain a simple slightly deeper architecture Addi-tionally I believe that the fact that the hidden layerof an RNN can be activated by a softmax functionis more desirable from the perspective of repre-senting sound change as a categorical or multi-nomial distribution as all layer unit activationssum to one as opposed to the situation with LongShort-Term Memory (LSTM) and Gated Recur-rent Units (GRU) which traditionally use sigmoidor hyperbolic tangent functions to activate the hid-den layer Furthermore long-distance dependen-cies are not particularly widespread in Indo-Aryansound change lessening the need for more com-plex architectures At the same time the RNNis a crude approximation to the reality of lan-guage change RNNs and related models draw asingle arc between a hidden layer at time t andthe corresponding output It is perhaps not ap-propriate to envision this single dependency un-less the dimensionality of the hidden layer is largeenough to absorb potential contextual informationthat is crucial to sound change To put it sim-ply emission probabilities in sound change aresharper than transitions common in most NLP ap-plications (eg sentence prediction) and it maynot be correct to envision yt given htprimeltt ht as afunction of an additive combination of weightsthough in practice I find it too computationallycostly to enumerate all possible value combina-tions the hidden layer at multiple consecutive timepoints This issue requires further exploration andI employ what seems to be the most computation-ally tractable approach for the moment

5 Results

I learn each modelrsquos MAP configuration using theAdam optimizer (Kingma and Ba 2015) with alearning rate of 14 I run the optimizer for 10000iterations over three random initializations fittingthe model on mini-batches of 100 data points and

4Code for all experiments can be found at httpsgithubcomchundracIA_dialVarDial2019

xtminus1 xt xt+1

htminus1 ht ht+1

ytminus1 yt yt+1

Figure 1 RNN representation unfolded in time hid-den layers depend on OIA inputs x1 xTw and previ-ous hidden layers (for t gt 1) NIA outputs y1 yTw

depend on hidden layers Hidden layer activations aredependent on dialect component-specific parameterswhile activations of the output layer are dependent onindividual NIA language-specific parameters

monitor convergence by observing the trace of thelog posterior (Figure 2)

The flat model fails to pick up on any majordifferences between languages finding virtuallyidentical posterior values of f(Ul) the language-level distribution over dialect component member-ship for all l isin 1 L According to theMAP configuration each language draws formsfrom the same dialect group with gt 99 proba-bility essentially undergoing a sort of ldquocompo-nent collapserdquo that latent variable models some-times encounter (Bowman et al 2015 Dinh andDumoulin 2016) It is likely that bundling to-gether sound change features leads to component-level distributions over sound changes with highentropy that are virtually indistinguishable fromone another5 While this particular result is dis-appointing in the lack of information it provides Iobserve some properties of our modelsrsquo posteriorvalues in order to diagnose problems that can beaddressed in future work (discussed below)

The deep model on the other hand infershighly divergent language-level posterior distri-butions over cluster membership Since thesedistributions are not identical across initializa-tions due to the label-switching problem I com-pute the Jensen-Shannon divergence between thelanguage-level posterior distributions over clustermembership for each pair of languages in our sam-ple for each initialization I then average these di-vergences across initializations These averaged

5I made several attempts to run this model with differ-ent specifications including different prior distributions butachieved the same result

115

0 2000 4000 6000 8000 10000

1048

1046

1044

1042

1040

1038

10361e8

0 2000 4000 6000 8000 10000

426

424

422

420

418

416

414

412

1e7

Figure 2 Log posteriors for shallow model (left) anddeep model (right) for 10000 iterations over three ran-dom initializations

divergences are then scaled to three dimensionsusing multidimensional scaling Figure 3 gives avisualization of these transformed values via thered-green-blue color vector plotted on a map lan-guages with similar component distributions dis-play similar colors With a few exceptions (thatmay be artifacts of the fact that certain languageshave only a small number of data points associ-ated with them) a noticieable divide can be seenbetween languages of the main Indo-Aryan speechregion on one hand and languages of northwest-ern South Asia (dark blue) the Dardic languagesof Northern Pakistan and the Pahari languagesof the Indian Himalayas though this division isnot clear cut Romani and other Indo-Aryan va-rieties spoken outside of South Asia show affil-iation with multiple groups While Romani di-alects are thought to have a close genetic affin-ity with Hindi and other Central Indic languagesit was likely in contact with languages of north-west South Asian during the course of its speak-ersrsquo journey out of South Asia (Hamp 1987 Ma-tras 2002) However this impressionistic evalua-tion is by no means a confirmation that the deepmodel has picked up on linguistically meaningfuldifferences between speech varieties In the fol-lowing sections some comparison and evaluationmetrics and checks are deployed in order to assessthe quality of these modelsrsquo behavior

51 Entropy of distributions

I measure the average entropy of the modelrsquos pos-terior distributions in order to gauge the extent towhich the models are able to learn sparse informa-tive distributions over sound changes hidden stateactivations or other parameters concerning transi-tions through the model architecture Normalizedentropy is used in order to make entropies of distri-butions of different dimension comparable a dis-tributionrsquos entropy can be normalized by dividingby its maximum possible entropy

As mentioned above our data set consists ofOIA trigrams and the NIA segment correspondingto the second segment in the trigram representingrewrite rules operating between OIA and the NIAlanguages in our sample It is often the case thatmore than one NIA reflex is attested for a givenOIA trigram As such the sound changes that haveoperated in an NIA language can be representedas a collection of categorical distributions eachsumming to one I calculate the average of thenormalized entropies of these sound change dis-tributions as a baseline against which to compareentropy values for the modelsrsquo parameters Thepooled average of the normalized entropies acrossall languages is 11 while the average of averagesfor each language is 063

For the shallow model the parameter of interestis f(V ) the dialect component-level collection ofdistributions over sound changes the mean nor-malized entropy of which averaged across initial-izations but pooled across components within eachinitialization is 091 (raw values range from 0003to 1) For the deep model the average entropyof the dialect-level distributions over hidden-layeractivations f(W x) is only slightly lower at 086(raw values range from close to 0 to 1)

For each k isin 1 K I compute the for-ward pass of RNN(xwlW

xk W

hk W

l) for eachetymon w and each language l in which theetymon survives using the inferred values forW x

k Whk W

l and compute the entropy of eachf(hgtt W

l) yielding an average of 74 (raw val-ues range from close to 0 to 1) While these val-ues are still very high it is clear that the inclu-sion of a hidden layer has learned sparser poten-tially more meaningful distributions than the flatapproach and that increasing the dimensionalityof the hidden layer will likely bring about evensparser more meaningful distributions The en-tropies cited here are considerably higher than theaverage entropy of languagesrsquo sound change dis-tributions but the latter distributions do little to tellus about the internal clustering of the languages

52 Comparison with other linguisticdistance metrics

Here I compare the cluster membership inferredby this paperrsquos models against other measures oflinguistic distance Each method yields a pairwiseinter-language distance metric which can be com-pared against a non-linguistic measure I measure

116

assa1263awad1243

bagh1251

balk1252

beng1280

bhad1241 bhat1263

bhoj1244braj1242

brok1247

carp1235

cham1307

chil1275

chur1258

dhiv1236

dogr1250

doma1258

doma1260

garh1243

gawa1247

gran1245

guja1252

halb1244

hind1269

indu1241

jaun1243

kach1277

kala1372 kala1373

kalo1256

kang1280

kash1277

khet1238

khow1242 kohi1248

konk1267

kull1236

kuma1273

loma1235

maga1260

maha1287

mait1250

mara1378

marw1260

nepa1254

nort2665

nort2666

oriy1255

paha1251pang1282

panj1256

phal1254

savi1242

sera1259

shin1264

shum1235

sind1272

sinh1246

sint1235

sirm1239

sout2671

sout2672

tira1253

torw1241

vlax1238

wels1246

west2386

wota1240

0

20

40

60

0 25 50 75long

lat

Figure 3 Dialect group makeup of languages in sample under deep model

the correlation between each linguistic distancemeasure as well as great circle geographic distanceand patristic distance according to the Glottologphylogeny using Spearmanrsquos ρ

521 Levenshtein distanceBorin et al (2014) measure the normalized Lev-enshtein distances (ie the edit distance betweentwo strings divided by the length of the longerstring) between words for the same concept inpairs of Indo-Aryan languages and find that av-erage normalized Levenshtein distance correlatessignificantly with patristic distances in the Ethno-logue tree This paperrsquos dataset is not organized bysemantic meaning so for comparability I measurethe average normalized Levenshtein distance be-tween cognates in pairs of Indo-Aryan languageswhich picks up on phonological divergence be-tween dialects as opposed to both phonologicaland lexical divergence

522 Jensen-Shannon divergenceEach language in our dataset attests one or more(due to language contact analogy etc) outcomesfor a given OIA trigram yielding a collection ofsound change distributions as described aboveFor each pair of languages I compute the Jensen-Shannon divergence between sound change distri-butions for all OIA trigrams that are continued inboth languages and average these values This

gives a measure of pairwise average diachronicphonological divergence between languages

523 LSTM AutoencoderRama and Coltekin (2016) and Rama et al (2017)develop an LSTM-based method for represent-ing the phonological structure of individual wordforms across closely related speech varieties Eachstring is fed to a unidirectional or bidirectionalLSTM autoencoder which learns a continuouslatent multidimensional representation of the se-quence This embedding is then used to recon-struct the input sequence The latent values in theembedding provide information that can be usedto compute dissimilarity (in the form of cosineor Euclidean distance) between strings or acrossspeech varieties (by averaging the latent values forall strings in each dialect or language) I use thebidirectional LSTM Autoencoder described in thework cited in order to learn an 8-dimensional la-tent representation for all NIA forms in the datasettraining the model over 20 epochs on batches of 32data points using the Adam optimizer to minimizethe categorical cross-entropy between the input se-quence and the NIA reconstruction predicted bythe model I use the learned model parameters togenerate a latent representation for each form Thelatent representations are averaged across formswithin each language and pairwise linguistic Eu-clidean distances are computed between each av-

117

Geographic GeneticShallow JSD minus001 minus003Deep JSD 0147lowast 0008

LDN 0346lowast 0013Raw JSD 0302lowast minus0051lowastLSTM AE 0158lowast minus0068lowastLSTM ED 0084lowast 00001

Table 1 Spearmanrsquos ρ values for correlations betweeneach linguistic distance metric (JSD = Jensen-ShannonDivergence LDN = Levenshtein Distance NormalizedAE = Autoencoder ED = Encoder-Decoder) and geo-graphic and genetic distance Asterisks represent sig-nificant correlations

eraged representation

524 LSTM Encoder-DecoderFor the sake of completeness I use an LSTMencoder-decoder to learn a continuous representa-tion for every OIA-NIA string pair This modelis very similar to the LSTM autoencoder exceptthat it takes an OIA input and reconstructs an NIAoutput instead of taking an NIA form as input andreconstructing the same string I train the modelas described above

53 Correlations

Table 1 gives correlation coefficients (Spearmanrsquosρ) between linguistic distance metrics and non-linguistic distance metrics In general correlationswith Glottolog patristic distance are quite poorThis is surprising for Levenshtein Distance Nor-malized given the high correlation with patristicdistance reported by Borin et al (2014) Giventhat the authors measured Levenshtein distancebetween identical concepts in pairs of languagesand not cognates as I do here it is possible thatlexical divergence carries a stronger genetic sig-nal than phonological divergence at least in thecontext of Indo-Aryan (it is worth noting that Idid not balance the tree as described by the au-thors it is not clear that this would have yieldedany improvement) On the other hand the Lev-enshtein distance measured in this paper corre-lates significantly with great circle distance indi-cating a strong geographic signal Average Jensen-Shannon divergence between pairs of languagesrsquosound change distributions shows a strong associ-ation with geographic distance as well

Divergencedistances based on the deepmodel the LSTM Autoencoder and the LSTM

Encoder-Decoder show significant correlationswith geospatial distance albeit lower ones It isnot entirely clear what accounts for this disparityIntuitively we expect more shallow chronologicalfeatures to correlate with geographic distance Itis possible that the LSTM and RNN architecturesare picking up on chronologically deeper infor-mation and show a low geographic signal for thisreason though this highly provisional idea is notborne out by any genetic signal

It is not clear how to assess the meaning ofthese correlations at this stage Nevertheless deeparchitectures provide an interesting direction forfuture research into sound change and languagecontact as they have the potential to disaggregatea great deal of information regarding interactingforces in language change that is censored whenraw distance measures are computed directly fromthe data

6 Outlook

This paper explored the consequences of addinghidden layers to models of dialectology where thelanguages have experienced too much contact forphylogenetic models to be appropriate but havediversified to the extent that traditional dialecto-metric approaches are not applicable While themodel requires some refinement its results pointin a promising direction Modifying prior distribu-tions could potentially produce more informativeresults as could tweaking hyperparameters of thelearning algorithms employed Additionally it islikely that the model will benefit from hidden lay-ers of higher dimension J as well as bidirectionalapproaches and despite the misgivings regard-ing LSTM and GRUs stated above future workwill probably benefit from incorporating these andrelated architectures (eg attention) Addition-ally the models used in this paper assumed dis-crete latent variables attempting to be faithful tothe traditional historical linguistic notion of inti-mate borrowing between discrete dialect groupsHowever continuous-space models may provide amore flexible framework for addressing the ques-tions asked in this paper (cf Murawaki 2015)

This paper provides a new way of looking atdialectology and linguistic affiliation with refine-ment and expansion it is hoped that this and re-lated models can further our understanding of thehistory of the Indo-Aryan speech community andcan generalize to new linguistic scenarios It is

118

hoped that methodologies of this sort can joinforces with similar tools designed to investigateinteraction of regularly conditioned sound changeand chronologically deep language contact in in-dividual languagesrsquo histories

ReferencesGeorge Baumann 1975 Drei Jaina-Gedichte in Alt-

Gujaratı Edition Ubersetzung Grammatik undGlossar Franz Steiner Wiesbaden

Leonard Bloomfield 1933 Language Holt Rinehartand Winston New York

Lars Borin Anju Saxena Taraka Rama and BernardComrie 2014 Linguistic landscaping of southasia using digital language resources Genetic vsareal linguistics In Ninth International Conferenceon Language Resources and Evaluation (LRECrsquo14)pages 3137ndash3144

Alexandre Bouchard-Cote David Hall Thomas LGriffiths and Dan Klein 2013 Automated recon-struction of ancient languages using probabilisticmodels of sound change Proceedings of the Na-tional Academy of Sciences 1104224ndash4229

Alexandre Bouchard-Cote Percy Liang Thomas Grif-fiths and Dan Klein 2007 A probabilistic approachto diachronic phonology In Proceedings of the 2007Joint Conference on Empirical Methods in NaturalLanguage Processing and Computational NaturalLanguage Learning (EMNLP-CoNLL) pages 887ndash896 Prague Association for Computational Lin-guistics

Alexandre Bouchard-Cote Percy S Liang Dan Kleinand Thomas L Griffiths 2008 A probabilistic ap-proach to language change In Advances in NeuralInformation Processing Systems pages 169ndash176

R Bouckaert P Lemey M Dunn S J GreenhillA V Alekseyenko A J Drummond R D GrayM A Suchard and Q D Atkinson 2012 Mappingthe origins and expansion of the Indo-European lan-guage family Science 337(6097)957ndash960

Samuel R Bowman Luke Vilnis Oriol Vinyals An-drew M Dai Rafal Jozefowicz and Samy Ben-gio 2015 Generating sentences from a continu-ous space Proceedings of the Twentieth Confer-ence on Computational Natural Language Learning(CoNLL)

Chundra Cathcart to appear A probabilistic assess-ment of the Indo-Aryan Inner-Outer HypothesisJournal of Historical Linguistics

Ashwini Deo 2018 Dialects in the Indo-Aryan land-scape In Charles Boberg John Nerbonne and Do-minic Watt editors The Handbook of Dialectologypages 535ndash546 John Wiley amp Sons Oxford

Laurent Dinh and Vincent Dumoulin 2016 Train-ing neural Bayesian nets httpwwwiroumontrealcabengioycifarNCAP2014-summerschoolslidesLaurent_dinh_cifar_presentationpdf

Jeffrey Elman 1990 Finding structure in time Cogni-tive Science 14(2)179ndash211

Murray B Emeneau 1966 The dialects of Old-Indo-Aryan In Jaan Puhvel editor Ancient Indo-European dialects pages 123ndash138 University ofCalifornia Press Berkeley

Harald Hammarstrom Robert Forkel and MartinHaspelmath 2017 Glottolog 33 Max Planck In-stitute for the Science of Human History

Eric P Hamp 1987 On the sibilants of romani Indo-Iranian Journal 30(2)103ndash106

Gerhard Jager 2014 Phylogenetic inference fromword lists using weighted alignment with empiri-cally determined weights In Quantifying LanguageDynamics pages 155ndash204 Brill

Banikanta Kakati 1941 Assamese its formation anddevelopment Government of Assam Gauhati

Yoon Kim Sam Wiseman and Alexander M Rush2018 A tutorial on deep latent variable models ofnatural language arXiv preprint arXiv181206834

Diederik P Kingma and Jimmy Ba 2015 Adam Amethod for stochastic optimization In InternationalConference on Learning Representations (ICLR)

Johann-Mattis List 2012 SCA Phonetic alignmentbased on sound classes In M Slavkovik and D Las-siter editors New directions in logic language andcomputation pages 32ndash51 Springer Berlin Heidel-berg

Colin P Masica 1991 The Indo-Aryan languagesCambridge University Press Cambridge

Yaron Matras 2002 Romani ndash A Linguistic Introduc-tion Cambridge University Press Cambridge

R S McGregor 1968 The language of Indrajit of Or-cha Cambridge University Press Cambridge

Yugo Murawaki 2015 Continuous space representa-tions of linguistic typology and their application tophylogenetic inference In Proceedings of the 2015Conference of the North American Chapter of theAssociation for Computational Linguistics HumanLanguage Technologies pages 324ndash334

John Nerbonne 2009 Data-driven dialectology Lan-guage and Linguistics Compass 3(1)175ndash198

John Nerbonne and Wilbert Heeringa 2001 Computa-tional comparison and classification of dialects Di-alectologia et Geolinguistica 969ndash83

119

Thomas Oberlies 2005 A historical grammar ofHindi Leykam Graz

Jonathan K Pritchard Matthew Stephens and Pe-ter Donnelly 2000 Inference of population struc-ture using multilocus genotype data Genetics155(2)945ndash959

Jelena Prokic and John Nerbonne 2008 Recognisinggroups among dialects International journal of hu-manities and arts computing 2(1-2)153ndash172

Taraka Rama and Cagrı Coltekin 2016 Lstm autoen-coders for dialect analysis In Proceedings of theThird Workshop on NLP for Similar Languages Va-rieties and Dialects (VarDial3) pages 25ndash32

Taraka Rama Cagrı Coltekin and Pavel Sofroniev2017 Computational analysis of gondi dialectsIn Proceedings of the Fourth Workshop on NLPfor Similar Languages Varieties and Dialects (Var-Dial) pages 26ndash35

Rajesh Ranganath Linpeng Tang Laurent Charlin andDavid Blei 2015 Deep exponential families InArtificial Intelligence and Statistics pages 762ndash771

Ger Reesink Ruth Singer and Michael Dunn 2009Explaining the linguistic diversity of Sahul usingpopulation models PLoS Biology 7e1000241

Caley Smith 2017 The dialectology of Indic InJared Klein Brian Joseph and Matthias Fritz edi-tors Handbook of Comparative and Historical Indo-European Linguistics pages 417ndash447 De GruyterBerlin Boston

Jaroslav Strnad 2013 Morphology and Syntax of OldHindi Brill Leiden

Kaj Syrjanen Terhi Honkola Jyri Lehtinen AnttiLeino and Outi Vesakoski 2016 Applying popu-lation genetic approaches within languages Finnishdialects as linguistic populations Language Dy-namics and Change 6235ndash283

Paul Tedesco 1965 Turnerrsquos Comparative Dictionaryof the Indo-Aryan Languages Journal of the Amer-ican Oriental Society 85368ndash383

Matthew Toulmin 2009 From linguistic to sociolin-guistic reconstruction the Kamta historical sub-group of Indo-Aryan Pacific Linguistics ResearchSchool of Pacific and Asian Studies The AustralianNational University Canberra

Ralph L Turner 1962ndash1966 A comparative dictionaryof Indo-Aryan languages Oxford University PressLondon

Ralph L Turner 1975 [1967] Geminates after longvowel in Indo-aryan In RL Turner Collected Pa-pers 1912ndash1973 pages 405ndash415 Oxford UniversityPress London

Martijn Wieling Eliza Margaretha and John Ner-bonne 2012 Inducing a measure of phonetic simi-larity from pronunciation variation Journal of Pho-netics 40(2)307ndash314

Page 4: Toward a deep dialectological representation of Indo-Aryanweb.science.mq.edu.au/~smalmasi/vardial6/pdf/W19-1411.pdfAryan subgroup of Indo-European. I draw upon admixture models and

113

All weights in U and W are drawn from a Normaldistribution with a mean of 0 and standard devi-ation of 10 f(middot) represents the softmax function(throughout this paper) which transforms theseweights to probability simplices The generativeprocess yields the following joint log likelihood ofthe OIA etyma x and NIA reflexes y (with the dis-crete latent variables z marginalized out

P (xy|UW ) =

Wprodw=1

Lprodl=1

Ksumk=1

[f(Ulk)

Twprodt=1

f(Wkxwltywlt)

](1)

As readers will note this model weights allsound changes equally and makes no attempt todistinguish between dialectologically meaningfulchanges and noisy idiosyncratic changes

42 Deep modelThe deep model like the shallow model is a mix-ture model and as such retains the language-levelweights over dialect component membership U However unlike the shallow model in which thelikelihood of an OIA etymon and NIA reflex un-der a component assignment z = k is depen-dent on a flat representation of edit probabilitiesbetween OIA trigrams and NIA unigrams associ-ated with dialect component k Here I attemptto add some depth to this representation of soundchange by positing a hidden layer of dimension Jbetween each xwt and ywlt The goal here is tomimic a ldquonoisyrdquo reconstruction of an intermediatestage between OIA and NIA represented by dialectgroup k This reconstruction is not an explicitlinguistically meaningful string (as in Bouchard-Cote et al 2007 2008 2013) furthermore it isre-generated for each individual reflex of each et-ymon and not shared across data points (such amodel would introduce deeply nested dependen-cies between variables and enumerating all possi-ble reconstructions would be computationally in-feasible)

For parsimonyrsquos sake I employ a simple Recur-rent Neural Network (RNN) architecture to cap-ture rightward dependencies (Elman 1990) Fig-ure 1 gives a visual representation of the net-work unfolded in time This model exchangesW the dialect component-level weights over soundchanges for the following parameters

bull Dialect component-level weights governinghidden layer unit activations by OIA sounds

W xkij k isin 1 K i isin 1 I j isin1 J

bull Dialect component-level weights governinghidden layer unit activations by previous hid-den layers W h

kij k isin 1 K i isin1 J j isin 1 J

bull Language-level weights governing NIA out-put activations by hidden layer unitsW y

ljo l isin 1 L j isin 1 J o isin1 O

For a given mixture component z = k the activa-tion of the hidden layer at time t ht depends ontwo sets of parameters each associated with com-ponent k the weightsW x

kxxt middot associated with the

OIA input at time t and W hk the weights asso-

ciated with the previous hidden layer htminus1rsquos acti-vations for all t gt 1 Given a hidden layer htthe weights W l can be used to generate a proba-bility distribution over possible outcomes in NIAlanguage l The forward pass of this networkcan be viewed as a generative process denotedywt sim RNN(xwlW

xk W

hk W

l) under the pa-rameters for component k and language l undersuch a process the likelihood of ywl can be com-puted as follows

PRNN(ywl|xwWxk W

hk W

l) =

Twprodt=1

f(hgtt W

l)ywlt (2)

where

ht =

f(W x

kxwtmiddot) if t = 1

f(hgttminus1Wh oplusW x

kxwtmiddot) if t gt 1(3)

The generative process for this model is nearlyidentical to the process described in the previ-ous sections however after the dialect compo-nent assignment (zwl sim Categorical(f(Ulmiddot)))is drawn the NIA string ywl is sampled fromRNN(xwW

xzwl

W hzwl

W l) The joint log likeli-hood of the OIA etyma x and NIA reflexes y (withthe discrete latent variables z marginalized out isthe following

P (xy|UW xW hW y) =Wprodw=1

Lprodl=1

Ksumk=1

[f(Ulk)PRNN(ywl|xwW

xk W

hk W

l)] (4)

114

The same N (0 10) prior as above is placed overUW xW hW y J the dimension of the hiddenlayer is fixed at 100 This model bears some sim-ilarities to the mixture of RNNs described by Kimet al (2018)

I have employed a simple RNN (rather than amore state-of-the art architecture) for several rea-sons The first is that I am interested in the conse-quences of expanding a flat mixture model to con-tain a simple slightly deeper architecture Addi-tionally I believe that the fact that the hidden layerof an RNN can be activated by a softmax functionis more desirable from the perspective of repre-senting sound change as a categorical or multi-nomial distribution as all layer unit activationssum to one as opposed to the situation with LongShort-Term Memory (LSTM) and Gated Recur-rent Units (GRU) which traditionally use sigmoidor hyperbolic tangent functions to activate the hid-den layer Furthermore long-distance dependen-cies are not particularly widespread in Indo-Aryansound change lessening the need for more com-plex architectures At the same time the RNNis a crude approximation to the reality of lan-guage change RNNs and related models draw asingle arc between a hidden layer at time t andthe corresponding output It is perhaps not ap-propriate to envision this single dependency un-less the dimensionality of the hidden layer is largeenough to absorb potential contextual informationthat is crucial to sound change To put it sim-ply emission probabilities in sound change aresharper than transitions common in most NLP ap-plications (eg sentence prediction) and it maynot be correct to envision yt given htprimeltt ht as afunction of an additive combination of weightsthough in practice I find it too computationallycostly to enumerate all possible value combina-tions the hidden layer at multiple consecutive timepoints This issue requires further exploration andI employ what seems to be the most computation-ally tractable approach for the moment

5 Results

I learn each modelrsquos MAP configuration using theAdam optimizer (Kingma and Ba 2015) with alearning rate of 14 I run the optimizer for 10000iterations over three random initializations fittingthe model on mini-batches of 100 data points and

4Code for all experiments can be found at httpsgithubcomchundracIA_dialVarDial2019

xtminus1 xt xt+1

htminus1 ht ht+1

ytminus1 yt yt+1

Figure 1 RNN representation unfolded in time hid-den layers depend on OIA inputs x1 xTw and previ-ous hidden layers (for t gt 1) NIA outputs y1 yTw

depend on hidden layers Hidden layer activations aredependent on dialect component-specific parameterswhile activations of the output layer are dependent onindividual NIA language-specific parameters

monitor convergence by observing the trace of thelog posterior (Figure 2)

The flat model fails to pick up on any majordifferences between languages finding virtuallyidentical posterior values of f(Ul) the language-level distribution over dialect component member-ship for all l isin 1 L According to theMAP configuration each language draws formsfrom the same dialect group with gt 99 proba-bility essentially undergoing a sort of ldquocompo-nent collapserdquo that latent variable models some-times encounter (Bowman et al 2015 Dinh andDumoulin 2016) It is likely that bundling to-gether sound change features leads to component-level distributions over sound changes with highentropy that are virtually indistinguishable fromone another5 While this particular result is dis-appointing in the lack of information it provides Iobserve some properties of our modelsrsquo posteriorvalues in order to diagnose problems that can beaddressed in future work (discussed below)

The deep model on the other hand infershighly divergent language-level posterior distri-butions over cluster membership Since thesedistributions are not identical across initializa-tions due to the label-switching problem I com-pute the Jensen-Shannon divergence between thelanguage-level posterior distributions over clustermembership for each pair of languages in our sam-ple for each initialization I then average these di-vergences across initializations These averaged

5I made several attempts to run this model with differ-ent specifications including different prior distributions butachieved the same result

115

0 2000 4000 6000 8000 10000

1048

1046

1044

1042

1040

1038

10361e8

0 2000 4000 6000 8000 10000

426

424

422

420

418

416

414

412

1e7

Figure 2 Log posteriors for shallow model (left) anddeep model (right) for 10000 iterations over three ran-dom initializations

divergences are then scaled to three dimensionsusing multidimensional scaling Figure 3 gives avisualization of these transformed values via thered-green-blue color vector plotted on a map lan-guages with similar component distributions dis-play similar colors With a few exceptions (thatmay be artifacts of the fact that certain languageshave only a small number of data points associ-ated with them) a noticieable divide can be seenbetween languages of the main Indo-Aryan speechregion on one hand and languages of northwest-ern South Asia (dark blue) the Dardic languagesof Northern Pakistan and the Pahari languagesof the Indian Himalayas though this division isnot clear cut Romani and other Indo-Aryan va-rieties spoken outside of South Asia show affil-iation with multiple groups While Romani di-alects are thought to have a close genetic affin-ity with Hindi and other Central Indic languagesit was likely in contact with languages of north-west South Asian during the course of its speak-ersrsquo journey out of South Asia (Hamp 1987 Ma-tras 2002) However this impressionistic evalua-tion is by no means a confirmation that the deepmodel has picked up on linguistically meaningfuldifferences between speech varieties In the fol-lowing sections some comparison and evaluationmetrics and checks are deployed in order to assessthe quality of these modelsrsquo behavior

51 Entropy of distributions

I measure the average entropy of the modelrsquos pos-terior distributions in order to gauge the extent towhich the models are able to learn sparse informa-tive distributions over sound changes hidden stateactivations or other parameters concerning transi-tions through the model architecture Normalizedentropy is used in order to make entropies of distri-butions of different dimension comparable a dis-tributionrsquos entropy can be normalized by dividingby its maximum possible entropy

As mentioned above our data set consists ofOIA trigrams and the NIA segment correspondingto the second segment in the trigram representingrewrite rules operating between OIA and the NIAlanguages in our sample It is often the case thatmore than one NIA reflex is attested for a givenOIA trigram As such the sound changes that haveoperated in an NIA language can be representedas a collection of categorical distributions eachsumming to one I calculate the average of thenormalized entropies of these sound change dis-tributions as a baseline against which to compareentropy values for the modelsrsquo parameters Thepooled average of the normalized entropies acrossall languages is 11 while the average of averagesfor each language is 063

For the shallow model the parameter of interestis f(V ) the dialect component-level collection ofdistributions over sound changes the mean nor-malized entropy of which averaged across initial-izations but pooled across components within eachinitialization is 091 (raw values range from 0003to 1) For the deep model the average entropyof the dialect-level distributions over hidden-layeractivations f(W x) is only slightly lower at 086(raw values range from close to 0 to 1)

For each k isin 1 K I compute the for-ward pass of RNN(xwlW

xk W

hk W

l) for eachetymon w and each language l in which theetymon survives using the inferred values forW x

k Whk W

l and compute the entropy of eachf(hgtt W

l) yielding an average of 74 (raw val-ues range from close to 0 to 1) While these val-ues are still very high it is clear that the inclu-sion of a hidden layer has learned sparser poten-tially more meaningful distributions than the flatapproach and that increasing the dimensionalityof the hidden layer will likely bring about evensparser more meaningful distributions The en-tropies cited here are considerably higher than theaverage entropy of languagesrsquo sound change dis-tributions but the latter distributions do little to tellus about the internal clustering of the languages

52 Comparison with other linguisticdistance metrics

Here I compare the cluster membership inferredby this paperrsquos models against other measures oflinguistic distance Each method yields a pairwiseinter-language distance metric which can be com-pared against a non-linguistic measure I measure

116

assa1263awad1243

bagh1251

balk1252

beng1280

bhad1241 bhat1263

bhoj1244braj1242

brok1247

carp1235

cham1307

chil1275

chur1258

dhiv1236

dogr1250

doma1258

doma1260

garh1243

gawa1247

gran1245

guja1252

halb1244

hind1269

indu1241

jaun1243

kach1277

kala1372 kala1373

kalo1256

kang1280

kash1277

khet1238

khow1242 kohi1248

konk1267

kull1236

kuma1273

loma1235

maga1260

maha1287

mait1250

mara1378

marw1260

nepa1254

nort2665

nort2666

oriy1255

paha1251pang1282

panj1256

phal1254

savi1242

sera1259

shin1264

shum1235

sind1272

sinh1246

sint1235

sirm1239

sout2671

sout2672

tira1253

torw1241

vlax1238

wels1246

west2386

wota1240

0

20

40

60

0 25 50 75long

lat

Figure 3 Dialect group makeup of languages in sample under deep model

the correlation between each linguistic distancemeasure as well as great circle geographic distanceand patristic distance according to the Glottologphylogeny using Spearmanrsquos ρ

521 Levenshtein distanceBorin et al (2014) measure the normalized Lev-enshtein distances (ie the edit distance betweentwo strings divided by the length of the longerstring) between words for the same concept inpairs of Indo-Aryan languages and find that av-erage normalized Levenshtein distance correlatessignificantly with patristic distances in the Ethno-logue tree This paperrsquos dataset is not organized bysemantic meaning so for comparability I measurethe average normalized Levenshtein distance be-tween cognates in pairs of Indo-Aryan languageswhich picks up on phonological divergence be-tween dialects as opposed to both phonologicaland lexical divergence

522 Jensen-Shannon divergenceEach language in our dataset attests one or more(due to language contact analogy etc) outcomesfor a given OIA trigram yielding a collection ofsound change distributions as described aboveFor each pair of languages I compute the Jensen-Shannon divergence between sound change distri-butions for all OIA trigrams that are continued inboth languages and average these values This

gives a measure of pairwise average diachronicphonological divergence between languages

523 LSTM AutoencoderRama and Coltekin (2016) and Rama et al (2017)develop an LSTM-based method for represent-ing the phonological structure of individual wordforms across closely related speech varieties Eachstring is fed to a unidirectional or bidirectionalLSTM autoencoder which learns a continuouslatent multidimensional representation of the se-quence This embedding is then used to recon-struct the input sequence The latent values in theembedding provide information that can be usedto compute dissimilarity (in the form of cosineor Euclidean distance) between strings or acrossspeech varieties (by averaging the latent values forall strings in each dialect or language) I use thebidirectional LSTM Autoencoder described in thework cited in order to learn an 8-dimensional la-tent representation for all NIA forms in the datasettraining the model over 20 epochs on batches of 32data points using the Adam optimizer to minimizethe categorical cross-entropy between the input se-quence and the NIA reconstruction predicted bythe model I use the learned model parameters togenerate a latent representation for each form Thelatent representations are averaged across formswithin each language and pairwise linguistic Eu-clidean distances are computed between each av-

117

Geographic GeneticShallow JSD minus001 minus003Deep JSD 0147lowast 0008

LDN 0346lowast 0013Raw JSD 0302lowast minus0051lowastLSTM AE 0158lowast minus0068lowastLSTM ED 0084lowast 00001

Table 1 Spearmanrsquos ρ values for correlations betweeneach linguistic distance metric (JSD = Jensen-ShannonDivergence LDN = Levenshtein Distance NormalizedAE = Autoencoder ED = Encoder-Decoder) and geo-graphic and genetic distance Asterisks represent sig-nificant correlations

eraged representation

524 LSTM Encoder-DecoderFor the sake of completeness I use an LSTMencoder-decoder to learn a continuous representa-tion for every OIA-NIA string pair This modelis very similar to the LSTM autoencoder exceptthat it takes an OIA input and reconstructs an NIAoutput instead of taking an NIA form as input andreconstructing the same string I train the modelas described above

53 Correlations

Table 1 gives correlation coefficients (Spearmanrsquosρ) between linguistic distance metrics and non-linguistic distance metrics In general correlationswith Glottolog patristic distance are quite poorThis is surprising for Levenshtein Distance Nor-malized given the high correlation with patristicdistance reported by Borin et al (2014) Giventhat the authors measured Levenshtein distancebetween identical concepts in pairs of languagesand not cognates as I do here it is possible thatlexical divergence carries a stronger genetic sig-nal than phonological divergence at least in thecontext of Indo-Aryan (it is worth noting that Idid not balance the tree as described by the au-thors it is not clear that this would have yieldedany improvement) On the other hand the Lev-enshtein distance measured in this paper corre-lates significantly with great circle distance indi-cating a strong geographic signal Average Jensen-Shannon divergence between pairs of languagesrsquosound change distributions shows a strong associ-ation with geographic distance as well

Divergencedistances based on the deepmodel the LSTM Autoencoder and the LSTM

Encoder-Decoder show significant correlationswith geospatial distance albeit lower ones It isnot entirely clear what accounts for this disparityIntuitively we expect more shallow chronologicalfeatures to correlate with geographic distance Itis possible that the LSTM and RNN architecturesare picking up on chronologically deeper infor-mation and show a low geographic signal for thisreason though this highly provisional idea is notborne out by any genetic signal

It is not clear how to assess the meaning ofthese correlations at this stage Nevertheless deeparchitectures provide an interesting direction forfuture research into sound change and languagecontact as they have the potential to disaggregatea great deal of information regarding interactingforces in language change that is censored whenraw distance measures are computed directly fromthe data

6 Outlook

This paper explored the consequences of addinghidden layers to models of dialectology where thelanguages have experienced too much contact forphylogenetic models to be appropriate but havediversified to the extent that traditional dialecto-metric approaches are not applicable While themodel requires some refinement its results pointin a promising direction Modifying prior distribu-tions could potentially produce more informativeresults as could tweaking hyperparameters of thelearning algorithms employed Additionally it islikely that the model will benefit from hidden lay-ers of higher dimension J as well as bidirectionalapproaches and despite the misgivings regard-ing LSTM and GRUs stated above future workwill probably benefit from incorporating these andrelated architectures (eg attention) Addition-ally the models used in this paper assumed dis-crete latent variables attempting to be faithful tothe traditional historical linguistic notion of inti-mate borrowing between discrete dialect groupsHowever continuous-space models may provide amore flexible framework for addressing the ques-tions asked in this paper (cf Murawaki 2015)

This paper provides a new way of looking atdialectology and linguistic affiliation with refine-ment and expansion it is hoped that this and re-lated models can further our understanding of thehistory of the Indo-Aryan speech community andcan generalize to new linguistic scenarios It is

118

hoped that methodologies of this sort can joinforces with similar tools designed to investigateinteraction of regularly conditioned sound changeand chronologically deep language contact in in-dividual languagesrsquo histories

ReferencesGeorge Baumann 1975 Drei Jaina-Gedichte in Alt-

Gujaratı Edition Ubersetzung Grammatik undGlossar Franz Steiner Wiesbaden

Leonard Bloomfield 1933 Language Holt Rinehartand Winston New York

Lars Borin Anju Saxena Taraka Rama and BernardComrie 2014 Linguistic landscaping of southasia using digital language resources Genetic vsareal linguistics In Ninth International Conferenceon Language Resources and Evaluation (LRECrsquo14)pages 3137ndash3144

Alexandre Bouchard-Cote David Hall Thomas LGriffiths and Dan Klein 2013 Automated recon-struction of ancient languages using probabilisticmodels of sound change Proceedings of the Na-tional Academy of Sciences 1104224ndash4229

Alexandre Bouchard-Cote Percy Liang Thomas Grif-fiths and Dan Klein 2007 A probabilistic approachto diachronic phonology In Proceedings of the 2007Joint Conference on Empirical Methods in NaturalLanguage Processing and Computational NaturalLanguage Learning (EMNLP-CoNLL) pages 887ndash896 Prague Association for Computational Lin-guistics

Alexandre Bouchard-Cote Percy S Liang Dan Kleinand Thomas L Griffiths 2008 A probabilistic ap-proach to language change In Advances in NeuralInformation Processing Systems pages 169ndash176

R Bouckaert P Lemey M Dunn S J GreenhillA V Alekseyenko A J Drummond R D GrayM A Suchard and Q D Atkinson 2012 Mappingthe origins and expansion of the Indo-European lan-guage family Science 337(6097)957ndash960

Samuel R Bowman Luke Vilnis Oriol Vinyals An-drew M Dai Rafal Jozefowicz and Samy Ben-gio 2015 Generating sentences from a continu-ous space Proceedings of the Twentieth Confer-ence on Computational Natural Language Learning(CoNLL)

Chundra Cathcart to appear A probabilistic assess-ment of the Indo-Aryan Inner-Outer HypothesisJournal of Historical Linguistics

Ashwini Deo 2018 Dialects in the Indo-Aryan land-scape In Charles Boberg John Nerbonne and Do-minic Watt editors The Handbook of Dialectologypages 535ndash546 John Wiley amp Sons Oxford

Laurent Dinh and Vincent Dumoulin 2016 Train-ing neural Bayesian nets httpwwwiroumontrealcabengioycifarNCAP2014-summerschoolslidesLaurent_dinh_cifar_presentationpdf

Jeffrey Elman 1990 Finding structure in time Cogni-tive Science 14(2)179ndash211

Murray B Emeneau 1966 The dialects of Old-Indo-Aryan In Jaan Puhvel editor Ancient Indo-European dialects pages 123ndash138 University ofCalifornia Press Berkeley

Harald Hammarstrom Robert Forkel and MartinHaspelmath 2017 Glottolog 33 Max Planck In-stitute for the Science of Human History

Eric P Hamp 1987 On the sibilants of romani Indo-Iranian Journal 30(2)103ndash106

Gerhard Jager 2014 Phylogenetic inference fromword lists using weighted alignment with empiri-cally determined weights In Quantifying LanguageDynamics pages 155ndash204 Brill

Banikanta Kakati 1941 Assamese its formation anddevelopment Government of Assam Gauhati

Yoon Kim Sam Wiseman and Alexander M Rush2018 A tutorial on deep latent variable models ofnatural language arXiv preprint arXiv181206834

Diederik P Kingma and Jimmy Ba 2015 Adam Amethod for stochastic optimization In InternationalConference on Learning Representations (ICLR)

Johann-Mattis List 2012 SCA Phonetic alignmentbased on sound classes In M Slavkovik and D Las-siter editors New directions in logic language andcomputation pages 32ndash51 Springer Berlin Heidel-berg

Colin P Masica 1991 The Indo-Aryan languagesCambridge University Press Cambridge

Yaron Matras 2002 Romani ndash A Linguistic Introduc-tion Cambridge University Press Cambridge

R S McGregor 1968 The language of Indrajit of Or-cha Cambridge University Press Cambridge

Yugo Murawaki 2015 Continuous space representa-tions of linguistic typology and their application tophylogenetic inference In Proceedings of the 2015Conference of the North American Chapter of theAssociation for Computational Linguistics HumanLanguage Technologies pages 324ndash334

John Nerbonne 2009 Data-driven dialectology Lan-guage and Linguistics Compass 3(1)175ndash198

John Nerbonne and Wilbert Heeringa 2001 Computa-tional comparison and classification of dialects Di-alectologia et Geolinguistica 969ndash83

119

Thomas Oberlies 2005 A historical grammar ofHindi Leykam Graz

Jonathan K Pritchard Matthew Stephens and Pe-ter Donnelly 2000 Inference of population struc-ture using multilocus genotype data Genetics155(2)945ndash959

Jelena Prokic and John Nerbonne 2008 Recognisinggroups among dialects International journal of hu-manities and arts computing 2(1-2)153ndash172

Taraka Rama and Cagrı Coltekin 2016 Lstm autoen-coders for dialect analysis In Proceedings of theThird Workshop on NLP for Similar Languages Va-rieties and Dialects (VarDial3) pages 25ndash32

Taraka Rama Cagrı Coltekin and Pavel Sofroniev2017 Computational analysis of gondi dialectsIn Proceedings of the Fourth Workshop on NLPfor Similar Languages Varieties and Dialects (Var-Dial) pages 26ndash35

Rajesh Ranganath Linpeng Tang Laurent Charlin andDavid Blei 2015 Deep exponential families InArtificial Intelligence and Statistics pages 762ndash771

Ger Reesink Ruth Singer and Michael Dunn 2009Explaining the linguistic diversity of Sahul usingpopulation models PLoS Biology 7e1000241

Caley Smith 2017 The dialectology of Indic InJared Klein Brian Joseph and Matthias Fritz edi-tors Handbook of Comparative and Historical Indo-European Linguistics pages 417ndash447 De GruyterBerlin Boston

Jaroslav Strnad 2013 Morphology and Syntax of OldHindi Brill Leiden

Kaj Syrjanen Terhi Honkola Jyri Lehtinen AnttiLeino and Outi Vesakoski 2016 Applying popu-lation genetic approaches within languages Finnishdialects as linguistic populations Language Dy-namics and Change 6235ndash283

Paul Tedesco 1965 Turnerrsquos Comparative Dictionaryof the Indo-Aryan Languages Journal of the Amer-ican Oriental Society 85368ndash383

Matthew Toulmin 2009 From linguistic to sociolin-guistic reconstruction the Kamta historical sub-group of Indo-Aryan Pacific Linguistics ResearchSchool of Pacific and Asian Studies The AustralianNational University Canberra

Ralph L Turner 1962ndash1966 A comparative dictionaryof Indo-Aryan languages Oxford University PressLondon

Ralph L Turner 1975 [1967] Geminates after longvowel in Indo-aryan In RL Turner Collected Pa-pers 1912ndash1973 pages 405ndash415 Oxford UniversityPress London

Martijn Wieling Eliza Margaretha and John Ner-bonne 2012 Inducing a measure of phonetic simi-larity from pronunciation variation Journal of Pho-netics 40(2)307ndash314

Page 5: Toward a deep dialectological representation of Indo-Aryanweb.science.mq.edu.au/~smalmasi/vardial6/pdf/W19-1411.pdfAryan subgroup of Indo-European. I draw upon admixture models and

114

The same N (0 10) prior as above is placed overUW xW hW y J the dimension of the hiddenlayer is fixed at 100 This model bears some sim-ilarities to the mixture of RNNs described by Kimet al (2018)

I have employed a simple RNN (rather than amore state-of-the art architecture) for several rea-sons The first is that I am interested in the conse-quences of expanding a flat mixture model to con-tain a simple slightly deeper architecture Addi-tionally I believe that the fact that the hidden layerof an RNN can be activated by a softmax functionis more desirable from the perspective of repre-senting sound change as a categorical or multi-nomial distribution as all layer unit activationssum to one as opposed to the situation with LongShort-Term Memory (LSTM) and Gated Recur-rent Units (GRU) which traditionally use sigmoidor hyperbolic tangent functions to activate the hid-den layer Furthermore long-distance dependen-cies are not particularly widespread in Indo-Aryansound change lessening the need for more com-plex architectures At the same time the RNNis a crude approximation to the reality of lan-guage change RNNs and related models draw asingle arc between a hidden layer at time t andthe corresponding output It is perhaps not ap-propriate to envision this single dependency un-less the dimensionality of the hidden layer is largeenough to absorb potential contextual informationthat is crucial to sound change To put it sim-ply emission probabilities in sound change aresharper than transitions common in most NLP ap-plications (eg sentence prediction) and it maynot be correct to envision yt given htprimeltt ht as afunction of an additive combination of weightsthough in practice I find it too computationallycostly to enumerate all possible value combina-tions the hidden layer at multiple consecutive timepoints This issue requires further exploration andI employ what seems to be the most computation-ally tractable approach for the moment

5 Results

I learn each modelrsquos MAP configuration using theAdam optimizer (Kingma and Ba 2015) with alearning rate of 14 I run the optimizer for 10000iterations over three random initializations fittingthe model on mini-batches of 100 data points and

4Code for all experiments can be found at httpsgithubcomchundracIA_dialVarDial2019

xtminus1 xt xt+1

htminus1 ht ht+1

ytminus1 yt yt+1

Figure 1 RNN representation unfolded in time hid-den layers depend on OIA inputs x1 xTw and previ-ous hidden layers (for t gt 1) NIA outputs y1 yTw

depend on hidden layers Hidden layer activations aredependent on dialect component-specific parameterswhile activations of the output layer are dependent onindividual NIA language-specific parameters

monitor convergence by observing the trace of thelog posterior (Figure 2)

The flat model fails to pick up on any majordifferences between languages finding virtuallyidentical posterior values of f(Ul) the language-level distribution over dialect component member-ship for all l isin 1 L According to theMAP configuration each language draws formsfrom the same dialect group with gt 99 proba-bility essentially undergoing a sort of ldquocompo-nent collapserdquo that latent variable models some-times encounter (Bowman et al 2015 Dinh andDumoulin 2016) It is likely that bundling to-gether sound change features leads to component-level distributions over sound changes with highentropy that are virtually indistinguishable fromone another5 While this particular result is dis-appointing in the lack of information it provides Iobserve some properties of our modelsrsquo posteriorvalues in order to diagnose problems that can beaddressed in future work (discussed below)

The deep model on the other hand infershighly divergent language-level posterior distri-butions over cluster membership Since thesedistributions are not identical across initializa-tions due to the label-switching problem I com-pute the Jensen-Shannon divergence between thelanguage-level posterior distributions over clustermembership for each pair of languages in our sam-ple for each initialization I then average these di-vergences across initializations These averaged

5I made several attempts to run this model with differ-ent specifications including different prior distributions butachieved the same result

115

0 2000 4000 6000 8000 10000

1048

1046

1044

1042

1040

1038

10361e8

0 2000 4000 6000 8000 10000

426

424

422

420

418

416

414

412

1e7

Figure 2 Log posteriors for shallow model (left) anddeep model (right) for 10000 iterations over three ran-dom initializations

divergences are then scaled to three dimensionsusing multidimensional scaling Figure 3 gives avisualization of these transformed values via thered-green-blue color vector plotted on a map lan-guages with similar component distributions dis-play similar colors With a few exceptions (thatmay be artifacts of the fact that certain languageshave only a small number of data points associ-ated with them) a noticieable divide can be seenbetween languages of the main Indo-Aryan speechregion on one hand and languages of northwest-ern South Asia (dark blue) the Dardic languagesof Northern Pakistan and the Pahari languagesof the Indian Himalayas though this division isnot clear cut Romani and other Indo-Aryan va-rieties spoken outside of South Asia show affil-iation with multiple groups While Romani di-alects are thought to have a close genetic affin-ity with Hindi and other Central Indic languagesit was likely in contact with languages of north-west South Asian during the course of its speak-ersrsquo journey out of South Asia (Hamp 1987 Ma-tras 2002) However this impressionistic evalua-tion is by no means a confirmation that the deepmodel has picked up on linguistically meaningfuldifferences between speech varieties In the fol-lowing sections some comparison and evaluationmetrics and checks are deployed in order to assessthe quality of these modelsrsquo behavior

51 Entropy of distributions

I measure the average entropy of the modelrsquos pos-terior distributions in order to gauge the extent towhich the models are able to learn sparse informa-tive distributions over sound changes hidden stateactivations or other parameters concerning transi-tions through the model architecture Normalizedentropy is used in order to make entropies of distri-butions of different dimension comparable a dis-tributionrsquos entropy can be normalized by dividingby its maximum possible entropy

As mentioned above our data set consists ofOIA trigrams and the NIA segment correspondingto the second segment in the trigram representingrewrite rules operating between OIA and the NIAlanguages in our sample It is often the case thatmore than one NIA reflex is attested for a givenOIA trigram As such the sound changes that haveoperated in an NIA language can be representedas a collection of categorical distributions eachsumming to one I calculate the average of thenormalized entropies of these sound change dis-tributions as a baseline against which to compareentropy values for the modelsrsquo parameters Thepooled average of the normalized entropies acrossall languages is 11 while the average of averagesfor each language is 063

For the shallow model the parameter of interestis f(V ) the dialect component-level collection ofdistributions over sound changes the mean nor-malized entropy of which averaged across initial-izations but pooled across components within eachinitialization is 091 (raw values range from 0003to 1) For the deep model the average entropyof the dialect-level distributions over hidden-layeractivations f(W x) is only slightly lower at 086(raw values range from close to 0 to 1)

For each k isin 1 K I compute the for-ward pass of RNN(xwlW

xk W

hk W

l) for eachetymon w and each language l in which theetymon survives using the inferred values forW x

k Whk W

l and compute the entropy of eachf(hgtt W

l) yielding an average of 74 (raw val-ues range from close to 0 to 1) While these val-ues are still very high it is clear that the inclu-sion of a hidden layer has learned sparser poten-tially more meaningful distributions than the flatapproach and that increasing the dimensionalityof the hidden layer will likely bring about evensparser more meaningful distributions The en-tropies cited here are considerably higher than theaverage entropy of languagesrsquo sound change dis-tributions but the latter distributions do little to tellus about the internal clustering of the languages

52 Comparison with other linguisticdistance metrics

Here I compare the cluster membership inferredby this paperrsquos models against other measures oflinguistic distance Each method yields a pairwiseinter-language distance metric which can be com-pared against a non-linguistic measure I measure

116

assa1263awad1243

bagh1251

balk1252

beng1280

bhad1241 bhat1263

bhoj1244braj1242

brok1247

carp1235

cham1307

chil1275

chur1258

dhiv1236

dogr1250

doma1258

doma1260

garh1243

gawa1247

gran1245

guja1252

halb1244

hind1269

indu1241

jaun1243

kach1277

kala1372 kala1373

kalo1256

kang1280

kash1277

khet1238

khow1242 kohi1248

konk1267

kull1236

kuma1273

loma1235

maga1260

maha1287

mait1250

mara1378

marw1260

nepa1254

nort2665

nort2666

oriy1255

paha1251pang1282

panj1256

phal1254

savi1242

sera1259

shin1264

shum1235

sind1272

sinh1246

sint1235

sirm1239

sout2671

sout2672

tira1253

torw1241

vlax1238

wels1246

west2386

wota1240

0

20

40

60

0 25 50 75long

lat

Figure 3 Dialect group makeup of languages in sample under deep model

the correlation between each linguistic distancemeasure as well as great circle geographic distanceand patristic distance according to the Glottologphylogeny using Spearmanrsquos ρ

521 Levenshtein distanceBorin et al (2014) measure the normalized Lev-enshtein distances (ie the edit distance betweentwo strings divided by the length of the longerstring) between words for the same concept inpairs of Indo-Aryan languages and find that av-erage normalized Levenshtein distance correlatessignificantly with patristic distances in the Ethno-logue tree This paperrsquos dataset is not organized bysemantic meaning so for comparability I measurethe average normalized Levenshtein distance be-tween cognates in pairs of Indo-Aryan languageswhich picks up on phonological divergence be-tween dialects as opposed to both phonologicaland lexical divergence

522 Jensen-Shannon divergenceEach language in our dataset attests one or more(due to language contact analogy etc) outcomesfor a given OIA trigram yielding a collection ofsound change distributions as described aboveFor each pair of languages I compute the Jensen-Shannon divergence between sound change distri-butions for all OIA trigrams that are continued inboth languages and average these values This

gives a measure of pairwise average diachronicphonological divergence between languages

523 LSTM AutoencoderRama and Coltekin (2016) and Rama et al (2017)develop an LSTM-based method for represent-ing the phonological structure of individual wordforms across closely related speech varieties Eachstring is fed to a unidirectional or bidirectionalLSTM autoencoder which learns a continuouslatent multidimensional representation of the se-quence This embedding is then used to recon-struct the input sequence The latent values in theembedding provide information that can be usedto compute dissimilarity (in the form of cosineor Euclidean distance) between strings or acrossspeech varieties (by averaging the latent values forall strings in each dialect or language) I use thebidirectional LSTM Autoencoder described in thework cited in order to learn an 8-dimensional la-tent representation for all NIA forms in the datasettraining the model over 20 epochs on batches of 32data points using the Adam optimizer to minimizethe categorical cross-entropy between the input se-quence and the NIA reconstruction predicted bythe model I use the learned model parameters togenerate a latent representation for each form Thelatent representations are averaged across formswithin each language and pairwise linguistic Eu-clidean distances are computed between each av-

117

Geographic GeneticShallow JSD minus001 minus003Deep JSD 0147lowast 0008

LDN 0346lowast 0013Raw JSD 0302lowast minus0051lowastLSTM AE 0158lowast minus0068lowastLSTM ED 0084lowast 00001

Table 1 Spearmanrsquos ρ values for correlations betweeneach linguistic distance metric (JSD = Jensen-ShannonDivergence LDN = Levenshtein Distance NormalizedAE = Autoencoder ED = Encoder-Decoder) and geo-graphic and genetic distance Asterisks represent sig-nificant correlations

eraged representation

524 LSTM Encoder-DecoderFor the sake of completeness I use an LSTMencoder-decoder to learn a continuous representa-tion for every OIA-NIA string pair This modelis very similar to the LSTM autoencoder exceptthat it takes an OIA input and reconstructs an NIAoutput instead of taking an NIA form as input andreconstructing the same string I train the modelas described above

53 Correlations

Table 1 gives correlation coefficients (Spearmanrsquosρ) between linguistic distance metrics and non-linguistic distance metrics In general correlationswith Glottolog patristic distance are quite poorThis is surprising for Levenshtein Distance Nor-malized given the high correlation with patristicdistance reported by Borin et al (2014) Giventhat the authors measured Levenshtein distancebetween identical concepts in pairs of languagesand not cognates as I do here it is possible thatlexical divergence carries a stronger genetic sig-nal than phonological divergence at least in thecontext of Indo-Aryan (it is worth noting that Idid not balance the tree as described by the au-thors it is not clear that this would have yieldedany improvement) On the other hand the Lev-enshtein distance measured in this paper corre-lates significantly with great circle distance indi-cating a strong geographic signal Average Jensen-Shannon divergence between pairs of languagesrsquosound change distributions shows a strong associ-ation with geographic distance as well

Divergencedistances based on the deepmodel the LSTM Autoencoder and the LSTM

Encoder-Decoder show significant correlationswith geospatial distance albeit lower ones It isnot entirely clear what accounts for this disparityIntuitively we expect more shallow chronologicalfeatures to correlate with geographic distance Itis possible that the LSTM and RNN architecturesare picking up on chronologically deeper infor-mation and show a low geographic signal for thisreason though this highly provisional idea is notborne out by any genetic signal

It is not clear how to assess the meaning ofthese correlations at this stage Nevertheless deeparchitectures provide an interesting direction forfuture research into sound change and languagecontact as they have the potential to disaggregatea great deal of information regarding interactingforces in language change that is censored whenraw distance measures are computed directly fromthe data

6 Outlook

This paper explored the consequences of addinghidden layers to models of dialectology where thelanguages have experienced too much contact forphylogenetic models to be appropriate but havediversified to the extent that traditional dialecto-metric approaches are not applicable While themodel requires some refinement its results pointin a promising direction Modifying prior distribu-tions could potentially produce more informativeresults as could tweaking hyperparameters of thelearning algorithms employed Additionally it islikely that the model will benefit from hidden lay-ers of higher dimension J as well as bidirectionalapproaches and despite the misgivings regard-ing LSTM and GRUs stated above future workwill probably benefit from incorporating these andrelated architectures (eg attention) Addition-ally the models used in this paper assumed dis-crete latent variables attempting to be faithful tothe traditional historical linguistic notion of inti-mate borrowing between discrete dialect groupsHowever continuous-space models may provide amore flexible framework for addressing the ques-tions asked in this paper (cf Murawaki 2015)

This paper provides a new way of looking atdialectology and linguistic affiliation with refine-ment and expansion it is hoped that this and re-lated models can further our understanding of thehistory of the Indo-Aryan speech community andcan generalize to new linguistic scenarios It is

118

hoped that methodologies of this sort can joinforces with similar tools designed to investigateinteraction of regularly conditioned sound changeand chronologically deep language contact in in-dividual languagesrsquo histories

ReferencesGeorge Baumann 1975 Drei Jaina-Gedichte in Alt-

Gujaratı Edition Ubersetzung Grammatik undGlossar Franz Steiner Wiesbaden

Leonard Bloomfield 1933 Language Holt Rinehartand Winston New York

Lars Borin Anju Saxena Taraka Rama and BernardComrie 2014 Linguistic landscaping of southasia using digital language resources Genetic vsareal linguistics In Ninth International Conferenceon Language Resources and Evaluation (LRECrsquo14)pages 3137ndash3144

Alexandre Bouchard-Cote David Hall Thomas LGriffiths and Dan Klein 2013 Automated recon-struction of ancient languages using probabilisticmodels of sound change Proceedings of the Na-tional Academy of Sciences 1104224ndash4229

Alexandre Bouchard-Cote Percy Liang Thomas Grif-fiths and Dan Klein 2007 A probabilistic approachto diachronic phonology In Proceedings of the 2007Joint Conference on Empirical Methods in NaturalLanguage Processing and Computational NaturalLanguage Learning (EMNLP-CoNLL) pages 887ndash896 Prague Association for Computational Lin-guistics

Alexandre Bouchard-Cote Percy S Liang Dan Kleinand Thomas L Griffiths 2008 A probabilistic ap-proach to language change In Advances in NeuralInformation Processing Systems pages 169ndash176

R Bouckaert P Lemey M Dunn S J GreenhillA V Alekseyenko A J Drummond R D GrayM A Suchard and Q D Atkinson 2012 Mappingthe origins and expansion of the Indo-European lan-guage family Science 337(6097)957ndash960

Samuel R Bowman Luke Vilnis Oriol Vinyals An-drew M Dai Rafal Jozefowicz and Samy Ben-gio 2015 Generating sentences from a continu-ous space Proceedings of the Twentieth Confer-ence on Computational Natural Language Learning(CoNLL)

Chundra Cathcart to appear A probabilistic assess-ment of the Indo-Aryan Inner-Outer HypothesisJournal of Historical Linguistics

Ashwini Deo 2018 Dialects in the Indo-Aryan land-scape In Charles Boberg John Nerbonne and Do-minic Watt editors The Handbook of Dialectologypages 535ndash546 John Wiley amp Sons Oxford

Laurent Dinh and Vincent Dumoulin 2016 Train-ing neural Bayesian nets httpwwwiroumontrealcabengioycifarNCAP2014-summerschoolslidesLaurent_dinh_cifar_presentationpdf

Jeffrey Elman 1990 Finding structure in time Cogni-tive Science 14(2)179ndash211

Murray B Emeneau 1966 The dialects of Old-Indo-Aryan In Jaan Puhvel editor Ancient Indo-European dialects pages 123ndash138 University ofCalifornia Press Berkeley

Harald Hammarstrom Robert Forkel and MartinHaspelmath 2017 Glottolog 33 Max Planck In-stitute for the Science of Human History

Eric P Hamp 1987 On the sibilants of romani Indo-Iranian Journal 30(2)103ndash106

Gerhard Jager 2014 Phylogenetic inference fromword lists using weighted alignment with empiri-cally determined weights In Quantifying LanguageDynamics pages 155ndash204 Brill

Banikanta Kakati 1941 Assamese its formation anddevelopment Government of Assam Gauhati

Yoon Kim Sam Wiseman and Alexander M Rush2018 A tutorial on deep latent variable models ofnatural language arXiv preprint arXiv181206834

Diederik P Kingma and Jimmy Ba 2015 Adam Amethod for stochastic optimization In InternationalConference on Learning Representations (ICLR)

Johann-Mattis List 2012 SCA Phonetic alignmentbased on sound classes In M Slavkovik and D Las-siter editors New directions in logic language andcomputation pages 32ndash51 Springer Berlin Heidel-berg

Colin P Masica 1991 The Indo-Aryan languagesCambridge University Press Cambridge

Yaron Matras 2002 Romani ndash A Linguistic Introduc-tion Cambridge University Press Cambridge

R S McGregor 1968 The language of Indrajit of Or-cha Cambridge University Press Cambridge

Yugo Murawaki 2015 Continuous space representa-tions of linguistic typology and their application tophylogenetic inference In Proceedings of the 2015Conference of the North American Chapter of theAssociation for Computational Linguistics HumanLanguage Technologies pages 324ndash334

John Nerbonne 2009 Data-driven dialectology Lan-guage and Linguistics Compass 3(1)175ndash198

John Nerbonne and Wilbert Heeringa 2001 Computa-tional comparison and classification of dialects Di-alectologia et Geolinguistica 969ndash83

119

Thomas Oberlies 2005 A historical grammar ofHindi Leykam Graz

Jonathan K Pritchard Matthew Stephens and Pe-ter Donnelly 2000 Inference of population struc-ture using multilocus genotype data Genetics155(2)945ndash959

Jelena Prokic and John Nerbonne 2008 Recognisinggroups among dialects International journal of hu-manities and arts computing 2(1-2)153ndash172

Taraka Rama and Cagrı Coltekin 2016 Lstm autoen-coders for dialect analysis In Proceedings of theThird Workshop on NLP for Similar Languages Va-rieties and Dialects (VarDial3) pages 25ndash32

Taraka Rama Cagrı Coltekin and Pavel Sofroniev2017 Computational analysis of gondi dialectsIn Proceedings of the Fourth Workshop on NLPfor Similar Languages Varieties and Dialects (Var-Dial) pages 26ndash35

Rajesh Ranganath Linpeng Tang Laurent Charlin andDavid Blei 2015 Deep exponential families InArtificial Intelligence and Statistics pages 762ndash771

Ger Reesink Ruth Singer and Michael Dunn 2009Explaining the linguistic diversity of Sahul usingpopulation models PLoS Biology 7e1000241

Caley Smith 2017 The dialectology of Indic InJared Klein Brian Joseph and Matthias Fritz edi-tors Handbook of Comparative and Historical Indo-European Linguistics pages 417ndash447 De GruyterBerlin Boston

Jaroslav Strnad 2013 Morphology and Syntax of OldHindi Brill Leiden

Kaj Syrjanen Terhi Honkola Jyri Lehtinen AnttiLeino and Outi Vesakoski 2016 Applying popu-lation genetic approaches within languages Finnishdialects as linguistic populations Language Dy-namics and Change 6235ndash283

Paul Tedesco 1965 Turnerrsquos Comparative Dictionaryof the Indo-Aryan Languages Journal of the Amer-ican Oriental Society 85368ndash383

Matthew Toulmin 2009 From linguistic to sociolin-guistic reconstruction the Kamta historical sub-group of Indo-Aryan Pacific Linguistics ResearchSchool of Pacific and Asian Studies The AustralianNational University Canberra

Ralph L Turner 1962ndash1966 A comparative dictionaryof Indo-Aryan languages Oxford University PressLondon

Ralph L Turner 1975 [1967] Geminates after longvowel in Indo-aryan In RL Turner Collected Pa-pers 1912ndash1973 pages 405ndash415 Oxford UniversityPress London

Martijn Wieling Eliza Margaretha and John Ner-bonne 2012 Inducing a measure of phonetic simi-larity from pronunciation variation Journal of Pho-netics 40(2)307ndash314

Page 6: Toward a deep dialectological representation of Indo-Aryanweb.science.mq.edu.au/~smalmasi/vardial6/pdf/W19-1411.pdfAryan subgroup of Indo-European. I draw upon admixture models and

115

0 2000 4000 6000 8000 10000

1048

1046

1044

1042

1040

1038

10361e8

0 2000 4000 6000 8000 10000

426

424

422

420

418

416

414

412

1e7

Figure 2 Log posteriors for shallow model (left) anddeep model (right) for 10000 iterations over three ran-dom initializations

divergences are then scaled to three dimensionsusing multidimensional scaling Figure 3 gives avisualization of these transformed values via thered-green-blue color vector plotted on a map lan-guages with similar component distributions dis-play similar colors With a few exceptions (thatmay be artifacts of the fact that certain languageshave only a small number of data points associ-ated with them) a noticieable divide can be seenbetween languages of the main Indo-Aryan speechregion on one hand and languages of northwest-ern South Asia (dark blue) the Dardic languagesof Northern Pakistan and the Pahari languagesof the Indian Himalayas though this division isnot clear cut Romani and other Indo-Aryan va-rieties spoken outside of South Asia show affil-iation with multiple groups While Romani di-alects are thought to have a close genetic affin-ity with Hindi and other Central Indic languagesit was likely in contact with languages of north-west South Asian during the course of its speak-ersrsquo journey out of South Asia (Hamp 1987 Ma-tras 2002) However this impressionistic evalua-tion is by no means a confirmation that the deepmodel has picked up on linguistically meaningfuldifferences between speech varieties In the fol-lowing sections some comparison and evaluationmetrics and checks are deployed in order to assessthe quality of these modelsrsquo behavior

51 Entropy of distributions

I measure the average entropy of the modelrsquos pos-terior distributions in order to gauge the extent towhich the models are able to learn sparse informa-tive distributions over sound changes hidden stateactivations or other parameters concerning transi-tions through the model architecture Normalizedentropy is used in order to make entropies of distri-butions of different dimension comparable a dis-tributionrsquos entropy can be normalized by dividingby its maximum possible entropy

As mentioned above our data set consists ofOIA trigrams and the NIA segment correspondingto the second segment in the trigram representingrewrite rules operating between OIA and the NIAlanguages in our sample It is often the case thatmore than one NIA reflex is attested for a givenOIA trigram As such the sound changes that haveoperated in an NIA language can be representedas a collection of categorical distributions eachsumming to one I calculate the average of thenormalized entropies of these sound change dis-tributions as a baseline against which to compareentropy values for the modelsrsquo parameters Thepooled average of the normalized entropies acrossall languages is 11 while the average of averagesfor each language is 063

For the shallow model the parameter of interestis f(V ) the dialect component-level collection ofdistributions over sound changes the mean nor-malized entropy of which averaged across initial-izations but pooled across components within eachinitialization is 091 (raw values range from 0003to 1) For the deep model the average entropyof the dialect-level distributions over hidden-layeractivations f(W x) is only slightly lower at 086(raw values range from close to 0 to 1)

For each k isin 1 K I compute the for-ward pass of RNN(xwlW

xk W

hk W

l) for eachetymon w and each language l in which theetymon survives using the inferred values forW x

k Whk W

l and compute the entropy of eachf(hgtt W

l) yielding an average of 74 (raw val-ues range from close to 0 to 1) While these val-ues are still very high it is clear that the inclu-sion of a hidden layer has learned sparser poten-tially more meaningful distributions than the flatapproach and that increasing the dimensionalityof the hidden layer will likely bring about evensparser more meaningful distributions The en-tropies cited here are considerably higher than theaverage entropy of languagesrsquo sound change dis-tributions but the latter distributions do little to tellus about the internal clustering of the languages

52 Comparison with other linguisticdistance metrics

Here I compare the cluster membership inferredby this paperrsquos models against other measures oflinguistic distance Each method yields a pairwiseinter-language distance metric which can be com-pared against a non-linguistic measure I measure

116

assa1263awad1243

bagh1251

balk1252

beng1280

bhad1241 bhat1263

bhoj1244braj1242

brok1247

carp1235

cham1307

chil1275

chur1258

dhiv1236

dogr1250

doma1258

doma1260

garh1243

gawa1247

gran1245

guja1252

halb1244

hind1269

indu1241

jaun1243

kach1277

kala1372 kala1373

kalo1256

kang1280

kash1277

khet1238

khow1242 kohi1248

konk1267

kull1236

kuma1273

loma1235

maga1260

maha1287

mait1250

mara1378

marw1260

nepa1254

nort2665

nort2666

oriy1255

paha1251pang1282

panj1256

phal1254

savi1242

sera1259

shin1264

shum1235

sind1272

sinh1246

sint1235

sirm1239

sout2671

sout2672

tira1253

torw1241

vlax1238

wels1246

west2386

wota1240

0

20

40

60

0 25 50 75long

lat

Figure 3 Dialect group makeup of languages in sample under deep model

the correlation between each linguistic distancemeasure as well as great circle geographic distanceand patristic distance according to the Glottologphylogeny using Spearmanrsquos ρ

521 Levenshtein distanceBorin et al (2014) measure the normalized Lev-enshtein distances (ie the edit distance betweentwo strings divided by the length of the longerstring) between words for the same concept inpairs of Indo-Aryan languages and find that av-erage normalized Levenshtein distance correlatessignificantly with patristic distances in the Ethno-logue tree This paperrsquos dataset is not organized bysemantic meaning so for comparability I measurethe average normalized Levenshtein distance be-tween cognates in pairs of Indo-Aryan languageswhich picks up on phonological divergence be-tween dialects as opposed to both phonologicaland lexical divergence

522 Jensen-Shannon divergenceEach language in our dataset attests one or more(due to language contact analogy etc) outcomesfor a given OIA trigram yielding a collection ofsound change distributions as described aboveFor each pair of languages I compute the Jensen-Shannon divergence between sound change distri-butions for all OIA trigrams that are continued inboth languages and average these values This

gives a measure of pairwise average diachronicphonological divergence between languages

523 LSTM AutoencoderRama and Coltekin (2016) and Rama et al (2017)develop an LSTM-based method for represent-ing the phonological structure of individual wordforms across closely related speech varieties Eachstring is fed to a unidirectional or bidirectionalLSTM autoencoder which learns a continuouslatent multidimensional representation of the se-quence This embedding is then used to recon-struct the input sequence The latent values in theembedding provide information that can be usedto compute dissimilarity (in the form of cosineor Euclidean distance) between strings or acrossspeech varieties (by averaging the latent values forall strings in each dialect or language) I use thebidirectional LSTM Autoencoder described in thework cited in order to learn an 8-dimensional la-tent representation for all NIA forms in the datasettraining the model over 20 epochs on batches of 32data points using the Adam optimizer to minimizethe categorical cross-entropy between the input se-quence and the NIA reconstruction predicted bythe model I use the learned model parameters togenerate a latent representation for each form Thelatent representations are averaged across formswithin each language and pairwise linguistic Eu-clidean distances are computed between each av-

117

Geographic GeneticShallow JSD minus001 minus003Deep JSD 0147lowast 0008

LDN 0346lowast 0013Raw JSD 0302lowast minus0051lowastLSTM AE 0158lowast minus0068lowastLSTM ED 0084lowast 00001

Table 1 Spearmanrsquos ρ values for correlations betweeneach linguistic distance metric (JSD = Jensen-ShannonDivergence LDN = Levenshtein Distance NormalizedAE = Autoencoder ED = Encoder-Decoder) and geo-graphic and genetic distance Asterisks represent sig-nificant correlations

eraged representation

524 LSTM Encoder-DecoderFor the sake of completeness I use an LSTMencoder-decoder to learn a continuous representa-tion for every OIA-NIA string pair This modelis very similar to the LSTM autoencoder exceptthat it takes an OIA input and reconstructs an NIAoutput instead of taking an NIA form as input andreconstructing the same string I train the modelas described above

53 Correlations

Table 1 gives correlation coefficients (Spearmanrsquosρ) between linguistic distance metrics and non-linguistic distance metrics In general correlationswith Glottolog patristic distance are quite poorThis is surprising for Levenshtein Distance Nor-malized given the high correlation with patristicdistance reported by Borin et al (2014) Giventhat the authors measured Levenshtein distancebetween identical concepts in pairs of languagesand not cognates as I do here it is possible thatlexical divergence carries a stronger genetic sig-nal than phonological divergence at least in thecontext of Indo-Aryan (it is worth noting that Idid not balance the tree as described by the au-thors it is not clear that this would have yieldedany improvement) On the other hand the Lev-enshtein distance measured in this paper corre-lates significantly with great circle distance indi-cating a strong geographic signal Average Jensen-Shannon divergence between pairs of languagesrsquosound change distributions shows a strong associ-ation with geographic distance as well

Divergencedistances based on the deepmodel the LSTM Autoencoder and the LSTM

Encoder-Decoder show significant correlationswith geospatial distance albeit lower ones It isnot entirely clear what accounts for this disparityIntuitively we expect more shallow chronologicalfeatures to correlate with geographic distance Itis possible that the LSTM and RNN architecturesare picking up on chronologically deeper infor-mation and show a low geographic signal for thisreason though this highly provisional idea is notborne out by any genetic signal

It is not clear how to assess the meaning ofthese correlations at this stage Nevertheless deeparchitectures provide an interesting direction forfuture research into sound change and languagecontact as they have the potential to disaggregatea great deal of information regarding interactingforces in language change that is censored whenraw distance measures are computed directly fromthe data

6 Outlook

This paper explored the consequences of addinghidden layers to models of dialectology where thelanguages have experienced too much contact forphylogenetic models to be appropriate but havediversified to the extent that traditional dialecto-metric approaches are not applicable While themodel requires some refinement its results pointin a promising direction Modifying prior distribu-tions could potentially produce more informativeresults as could tweaking hyperparameters of thelearning algorithms employed Additionally it islikely that the model will benefit from hidden lay-ers of higher dimension J as well as bidirectionalapproaches and despite the misgivings regard-ing LSTM and GRUs stated above future workwill probably benefit from incorporating these andrelated architectures (eg attention) Addition-ally the models used in this paper assumed dis-crete latent variables attempting to be faithful tothe traditional historical linguistic notion of inti-mate borrowing between discrete dialect groupsHowever continuous-space models may provide amore flexible framework for addressing the ques-tions asked in this paper (cf Murawaki 2015)

This paper provides a new way of looking atdialectology and linguistic affiliation with refine-ment and expansion it is hoped that this and re-lated models can further our understanding of thehistory of the Indo-Aryan speech community andcan generalize to new linguistic scenarios It is

118

hoped that methodologies of this sort can joinforces with similar tools designed to investigateinteraction of regularly conditioned sound changeand chronologically deep language contact in in-dividual languagesrsquo histories

ReferencesGeorge Baumann 1975 Drei Jaina-Gedichte in Alt-

Gujaratı Edition Ubersetzung Grammatik undGlossar Franz Steiner Wiesbaden

Leonard Bloomfield 1933 Language Holt Rinehartand Winston New York

Lars Borin Anju Saxena Taraka Rama and BernardComrie 2014 Linguistic landscaping of southasia using digital language resources Genetic vsareal linguistics In Ninth International Conferenceon Language Resources and Evaluation (LRECrsquo14)pages 3137ndash3144

Alexandre Bouchard-Cote David Hall Thomas LGriffiths and Dan Klein 2013 Automated recon-struction of ancient languages using probabilisticmodels of sound change Proceedings of the Na-tional Academy of Sciences 1104224ndash4229

Alexandre Bouchard-Cote Percy Liang Thomas Grif-fiths and Dan Klein 2007 A probabilistic approachto diachronic phonology In Proceedings of the 2007Joint Conference on Empirical Methods in NaturalLanguage Processing and Computational NaturalLanguage Learning (EMNLP-CoNLL) pages 887ndash896 Prague Association for Computational Lin-guistics

Alexandre Bouchard-Cote Percy S Liang Dan Kleinand Thomas L Griffiths 2008 A probabilistic ap-proach to language change In Advances in NeuralInformation Processing Systems pages 169ndash176

R Bouckaert P Lemey M Dunn S J GreenhillA V Alekseyenko A J Drummond R D GrayM A Suchard and Q D Atkinson 2012 Mappingthe origins and expansion of the Indo-European lan-guage family Science 337(6097)957ndash960

Samuel R Bowman Luke Vilnis Oriol Vinyals An-drew M Dai Rafal Jozefowicz and Samy Ben-gio 2015 Generating sentences from a continu-ous space Proceedings of the Twentieth Confer-ence on Computational Natural Language Learning(CoNLL)

Chundra Cathcart to appear A probabilistic assess-ment of the Indo-Aryan Inner-Outer HypothesisJournal of Historical Linguistics

Ashwini Deo 2018 Dialects in the Indo-Aryan land-scape In Charles Boberg John Nerbonne and Do-minic Watt editors The Handbook of Dialectologypages 535ndash546 John Wiley amp Sons Oxford

Laurent Dinh and Vincent Dumoulin 2016 Train-ing neural Bayesian nets httpwwwiroumontrealcabengioycifarNCAP2014-summerschoolslidesLaurent_dinh_cifar_presentationpdf

Jeffrey Elman 1990 Finding structure in time Cogni-tive Science 14(2)179ndash211

Murray B Emeneau 1966 The dialects of Old-Indo-Aryan In Jaan Puhvel editor Ancient Indo-European dialects pages 123ndash138 University ofCalifornia Press Berkeley

Harald Hammarstrom Robert Forkel and MartinHaspelmath 2017 Glottolog 33 Max Planck In-stitute for the Science of Human History

Eric P Hamp 1987 On the sibilants of romani Indo-Iranian Journal 30(2)103ndash106

Gerhard Jager 2014 Phylogenetic inference fromword lists using weighted alignment with empiri-cally determined weights In Quantifying LanguageDynamics pages 155ndash204 Brill

Banikanta Kakati 1941 Assamese its formation anddevelopment Government of Assam Gauhati

Yoon Kim Sam Wiseman and Alexander M Rush2018 A tutorial on deep latent variable models ofnatural language arXiv preprint arXiv181206834

Diederik P Kingma and Jimmy Ba 2015 Adam Amethod for stochastic optimization In InternationalConference on Learning Representations (ICLR)

Johann-Mattis List 2012 SCA Phonetic alignmentbased on sound classes In M Slavkovik and D Las-siter editors New directions in logic language andcomputation pages 32ndash51 Springer Berlin Heidel-berg

Colin P Masica 1991 The Indo-Aryan languagesCambridge University Press Cambridge

Yaron Matras 2002 Romani ndash A Linguistic Introduc-tion Cambridge University Press Cambridge

R S McGregor 1968 The language of Indrajit of Or-cha Cambridge University Press Cambridge

Yugo Murawaki 2015 Continuous space representa-tions of linguistic typology and their application tophylogenetic inference In Proceedings of the 2015Conference of the North American Chapter of theAssociation for Computational Linguistics HumanLanguage Technologies pages 324ndash334

John Nerbonne 2009 Data-driven dialectology Lan-guage and Linguistics Compass 3(1)175ndash198

John Nerbonne and Wilbert Heeringa 2001 Computa-tional comparison and classification of dialects Di-alectologia et Geolinguistica 969ndash83

119

Thomas Oberlies 2005 A historical grammar ofHindi Leykam Graz

Jonathan K Pritchard Matthew Stephens and Pe-ter Donnelly 2000 Inference of population struc-ture using multilocus genotype data Genetics155(2)945ndash959

Jelena Prokic and John Nerbonne 2008 Recognisinggroups among dialects International journal of hu-manities and arts computing 2(1-2)153ndash172

Taraka Rama and Cagrı Coltekin 2016 Lstm autoen-coders for dialect analysis In Proceedings of theThird Workshop on NLP for Similar Languages Va-rieties and Dialects (VarDial3) pages 25ndash32

Taraka Rama Cagrı Coltekin and Pavel Sofroniev2017 Computational analysis of gondi dialectsIn Proceedings of the Fourth Workshop on NLPfor Similar Languages Varieties and Dialects (Var-Dial) pages 26ndash35

Rajesh Ranganath Linpeng Tang Laurent Charlin andDavid Blei 2015 Deep exponential families InArtificial Intelligence and Statistics pages 762ndash771

Ger Reesink Ruth Singer and Michael Dunn 2009Explaining the linguistic diversity of Sahul usingpopulation models PLoS Biology 7e1000241

Caley Smith 2017 The dialectology of Indic InJared Klein Brian Joseph and Matthias Fritz edi-tors Handbook of Comparative and Historical Indo-European Linguistics pages 417ndash447 De GruyterBerlin Boston

Jaroslav Strnad 2013 Morphology and Syntax of OldHindi Brill Leiden

Kaj Syrjanen Terhi Honkola Jyri Lehtinen AnttiLeino and Outi Vesakoski 2016 Applying popu-lation genetic approaches within languages Finnishdialects as linguistic populations Language Dy-namics and Change 6235ndash283

Paul Tedesco 1965 Turnerrsquos Comparative Dictionaryof the Indo-Aryan Languages Journal of the Amer-ican Oriental Society 85368ndash383

Matthew Toulmin 2009 From linguistic to sociolin-guistic reconstruction the Kamta historical sub-group of Indo-Aryan Pacific Linguistics ResearchSchool of Pacific and Asian Studies The AustralianNational University Canberra

Ralph L Turner 1962ndash1966 A comparative dictionaryof Indo-Aryan languages Oxford University PressLondon

Ralph L Turner 1975 [1967] Geminates after longvowel in Indo-aryan In RL Turner Collected Pa-pers 1912ndash1973 pages 405ndash415 Oxford UniversityPress London

Martijn Wieling Eliza Margaretha and John Ner-bonne 2012 Inducing a measure of phonetic simi-larity from pronunciation variation Journal of Pho-netics 40(2)307ndash314

Page 7: Toward a deep dialectological representation of Indo-Aryanweb.science.mq.edu.au/~smalmasi/vardial6/pdf/W19-1411.pdfAryan subgroup of Indo-European. I draw upon admixture models and

116

assa1263awad1243

bagh1251

balk1252

beng1280

bhad1241 bhat1263

bhoj1244braj1242

brok1247

carp1235

cham1307

chil1275

chur1258

dhiv1236

dogr1250

doma1258

doma1260

garh1243

gawa1247

gran1245

guja1252

halb1244

hind1269

indu1241

jaun1243

kach1277

kala1372 kala1373

kalo1256

kang1280

kash1277

khet1238

khow1242 kohi1248

konk1267

kull1236

kuma1273

loma1235

maga1260

maha1287

mait1250

mara1378

marw1260

nepa1254

nort2665

nort2666

oriy1255

paha1251pang1282

panj1256

phal1254

savi1242

sera1259

shin1264

shum1235

sind1272

sinh1246

sint1235

sirm1239

sout2671

sout2672

tira1253

torw1241

vlax1238

wels1246

west2386

wota1240

0

20

40

60

0 25 50 75long

lat

Figure 3 Dialect group makeup of languages in sample under deep model

the correlation between each linguistic distancemeasure as well as great circle geographic distanceand patristic distance according to the Glottologphylogeny using Spearmanrsquos ρ

521 Levenshtein distanceBorin et al (2014) measure the normalized Lev-enshtein distances (ie the edit distance betweentwo strings divided by the length of the longerstring) between words for the same concept inpairs of Indo-Aryan languages and find that av-erage normalized Levenshtein distance correlatessignificantly with patristic distances in the Ethno-logue tree This paperrsquos dataset is not organized bysemantic meaning so for comparability I measurethe average normalized Levenshtein distance be-tween cognates in pairs of Indo-Aryan languageswhich picks up on phonological divergence be-tween dialects as opposed to both phonologicaland lexical divergence

522 Jensen-Shannon divergenceEach language in our dataset attests one or more(due to language contact analogy etc) outcomesfor a given OIA trigram yielding a collection ofsound change distributions as described aboveFor each pair of languages I compute the Jensen-Shannon divergence between sound change distri-butions for all OIA trigrams that are continued inboth languages and average these values This

gives a measure of pairwise average diachronicphonological divergence between languages

523 LSTM AutoencoderRama and Coltekin (2016) and Rama et al (2017)develop an LSTM-based method for represent-ing the phonological structure of individual wordforms across closely related speech varieties Eachstring is fed to a unidirectional or bidirectionalLSTM autoencoder which learns a continuouslatent multidimensional representation of the se-quence This embedding is then used to recon-struct the input sequence The latent values in theembedding provide information that can be usedto compute dissimilarity (in the form of cosineor Euclidean distance) between strings or acrossspeech varieties (by averaging the latent values forall strings in each dialect or language) I use thebidirectional LSTM Autoencoder described in thework cited in order to learn an 8-dimensional la-tent representation for all NIA forms in the datasettraining the model over 20 epochs on batches of 32data points using the Adam optimizer to minimizethe categorical cross-entropy between the input se-quence and the NIA reconstruction predicted bythe model I use the learned model parameters togenerate a latent representation for each form Thelatent representations are averaged across formswithin each language and pairwise linguistic Eu-clidean distances are computed between each av-

117

Geographic GeneticShallow JSD minus001 minus003Deep JSD 0147lowast 0008

LDN 0346lowast 0013Raw JSD 0302lowast minus0051lowastLSTM AE 0158lowast minus0068lowastLSTM ED 0084lowast 00001

Table 1 Spearmanrsquos ρ values for correlations betweeneach linguistic distance metric (JSD = Jensen-ShannonDivergence LDN = Levenshtein Distance NormalizedAE = Autoencoder ED = Encoder-Decoder) and geo-graphic and genetic distance Asterisks represent sig-nificant correlations

eraged representation

524 LSTM Encoder-DecoderFor the sake of completeness I use an LSTMencoder-decoder to learn a continuous representa-tion for every OIA-NIA string pair This modelis very similar to the LSTM autoencoder exceptthat it takes an OIA input and reconstructs an NIAoutput instead of taking an NIA form as input andreconstructing the same string I train the modelas described above

53 Correlations

Table 1 gives correlation coefficients (Spearmanrsquosρ) between linguistic distance metrics and non-linguistic distance metrics In general correlationswith Glottolog patristic distance are quite poorThis is surprising for Levenshtein Distance Nor-malized given the high correlation with patristicdistance reported by Borin et al (2014) Giventhat the authors measured Levenshtein distancebetween identical concepts in pairs of languagesand not cognates as I do here it is possible thatlexical divergence carries a stronger genetic sig-nal than phonological divergence at least in thecontext of Indo-Aryan (it is worth noting that Idid not balance the tree as described by the au-thors it is not clear that this would have yieldedany improvement) On the other hand the Lev-enshtein distance measured in this paper corre-lates significantly with great circle distance indi-cating a strong geographic signal Average Jensen-Shannon divergence between pairs of languagesrsquosound change distributions shows a strong associ-ation with geographic distance as well

Divergencedistances based on the deepmodel the LSTM Autoencoder and the LSTM

Encoder-Decoder show significant correlationswith geospatial distance albeit lower ones It isnot entirely clear what accounts for this disparityIntuitively we expect more shallow chronologicalfeatures to correlate with geographic distance Itis possible that the LSTM and RNN architecturesare picking up on chronologically deeper infor-mation and show a low geographic signal for thisreason though this highly provisional idea is notborne out by any genetic signal

It is not clear how to assess the meaning ofthese correlations at this stage Nevertheless deeparchitectures provide an interesting direction forfuture research into sound change and languagecontact as they have the potential to disaggregatea great deal of information regarding interactingforces in language change that is censored whenraw distance measures are computed directly fromthe data

6 Outlook

This paper explored the consequences of addinghidden layers to models of dialectology where thelanguages have experienced too much contact forphylogenetic models to be appropriate but havediversified to the extent that traditional dialecto-metric approaches are not applicable While themodel requires some refinement its results pointin a promising direction Modifying prior distribu-tions could potentially produce more informativeresults as could tweaking hyperparameters of thelearning algorithms employed Additionally it islikely that the model will benefit from hidden lay-ers of higher dimension J as well as bidirectionalapproaches and despite the misgivings regard-ing LSTM and GRUs stated above future workwill probably benefit from incorporating these andrelated architectures (eg attention) Addition-ally the models used in this paper assumed dis-crete latent variables attempting to be faithful tothe traditional historical linguistic notion of inti-mate borrowing between discrete dialect groupsHowever continuous-space models may provide amore flexible framework for addressing the ques-tions asked in this paper (cf Murawaki 2015)

This paper provides a new way of looking atdialectology and linguistic affiliation with refine-ment and expansion it is hoped that this and re-lated models can further our understanding of thehistory of the Indo-Aryan speech community andcan generalize to new linguistic scenarios It is

118

hoped that methodologies of this sort can joinforces with similar tools designed to investigateinteraction of regularly conditioned sound changeand chronologically deep language contact in in-dividual languagesrsquo histories

ReferencesGeorge Baumann 1975 Drei Jaina-Gedichte in Alt-

Gujaratı Edition Ubersetzung Grammatik undGlossar Franz Steiner Wiesbaden

Leonard Bloomfield 1933 Language Holt Rinehartand Winston New York

Lars Borin Anju Saxena Taraka Rama and BernardComrie 2014 Linguistic landscaping of southasia using digital language resources Genetic vsareal linguistics In Ninth International Conferenceon Language Resources and Evaluation (LRECrsquo14)pages 3137ndash3144

Alexandre Bouchard-Cote David Hall Thomas LGriffiths and Dan Klein 2013 Automated recon-struction of ancient languages using probabilisticmodels of sound change Proceedings of the Na-tional Academy of Sciences 1104224ndash4229

Alexandre Bouchard-Cote Percy Liang Thomas Grif-fiths and Dan Klein 2007 A probabilistic approachto diachronic phonology In Proceedings of the 2007Joint Conference on Empirical Methods in NaturalLanguage Processing and Computational NaturalLanguage Learning (EMNLP-CoNLL) pages 887ndash896 Prague Association for Computational Lin-guistics

Alexandre Bouchard-Cote Percy S Liang Dan Kleinand Thomas L Griffiths 2008 A probabilistic ap-proach to language change In Advances in NeuralInformation Processing Systems pages 169ndash176

R Bouckaert P Lemey M Dunn S J GreenhillA V Alekseyenko A J Drummond R D GrayM A Suchard and Q D Atkinson 2012 Mappingthe origins and expansion of the Indo-European lan-guage family Science 337(6097)957ndash960

Samuel R Bowman Luke Vilnis Oriol Vinyals An-drew M Dai Rafal Jozefowicz and Samy Ben-gio 2015 Generating sentences from a continu-ous space Proceedings of the Twentieth Confer-ence on Computational Natural Language Learning(CoNLL)

Chundra Cathcart to appear A probabilistic assess-ment of the Indo-Aryan Inner-Outer HypothesisJournal of Historical Linguistics

Ashwini Deo 2018 Dialects in the Indo-Aryan land-scape In Charles Boberg John Nerbonne and Do-minic Watt editors The Handbook of Dialectologypages 535ndash546 John Wiley amp Sons Oxford

Laurent Dinh and Vincent Dumoulin 2016 Train-ing neural Bayesian nets httpwwwiroumontrealcabengioycifarNCAP2014-summerschoolslidesLaurent_dinh_cifar_presentationpdf

Jeffrey Elman 1990 Finding structure in time Cogni-tive Science 14(2)179ndash211

Murray B Emeneau 1966 The dialects of Old-Indo-Aryan In Jaan Puhvel editor Ancient Indo-European dialects pages 123ndash138 University ofCalifornia Press Berkeley

Harald Hammarstrom Robert Forkel and MartinHaspelmath 2017 Glottolog 33 Max Planck In-stitute for the Science of Human History

Eric P Hamp 1987 On the sibilants of romani Indo-Iranian Journal 30(2)103ndash106

Gerhard Jager 2014 Phylogenetic inference fromword lists using weighted alignment with empiri-cally determined weights In Quantifying LanguageDynamics pages 155ndash204 Brill

Banikanta Kakati 1941 Assamese its formation anddevelopment Government of Assam Gauhati

Yoon Kim Sam Wiseman and Alexander M Rush2018 A tutorial on deep latent variable models ofnatural language arXiv preprint arXiv181206834

Diederik P Kingma and Jimmy Ba 2015 Adam Amethod for stochastic optimization In InternationalConference on Learning Representations (ICLR)

Johann-Mattis List 2012 SCA Phonetic alignmentbased on sound classes In M Slavkovik and D Las-siter editors New directions in logic language andcomputation pages 32ndash51 Springer Berlin Heidel-berg

Colin P Masica 1991 The Indo-Aryan languagesCambridge University Press Cambridge

Yaron Matras 2002 Romani ndash A Linguistic Introduc-tion Cambridge University Press Cambridge

R S McGregor 1968 The language of Indrajit of Or-cha Cambridge University Press Cambridge

Yugo Murawaki 2015 Continuous space representa-tions of linguistic typology and their application tophylogenetic inference In Proceedings of the 2015Conference of the North American Chapter of theAssociation for Computational Linguistics HumanLanguage Technologies pages 324ndash334

John Nerbonne 2009 Data-driven dialectology Lan-guage and Linguistics Compass 3(1)175ndash198

John Nerbonne and Wilbert Heeringa 2001 Computa-tional comparison and classification of dialects Di-alectologia et Geolinguistica 969ndash83

119

Thomas Oberlies 2005 A historical grammar ofHindi Leykam Graz

Jonathan K Pritchard Matthew Stephens and Pe-ter Donnelly 2000 Inference of population struc-ture using multilocus genotype data Genetics155(2)945ndash959

Jelena Prokic and John Nerbonne 2008 Recognisinggroups among dialects International journal of hu-manities and arts computing 2(1-2)153ndash172

Taraka Rama and Cagrı Coltekin 2016 Lstm autoen-coders for dialect analysis In Proceedings of theThird Workshop on NLP for Similar Languages Va-rieties and Dialects (VarDial3) pages 25ndash32

Taraka Rama Cagrı Coltekin and Pavel Sofroniev2017 Computational analysis of gondi dialectsIn Proceedings of the Fourth Workshop on NLPfor Similar Languages Varieties and Dialects (Var-Dial) pages 26ndash35

Rajesh Ranganath Linpeng Tang Laurent Charlin andDavid Blei 2015 Deep exponential families InArtificial Intelligence and Statistics pages 762ndash771

Ger Reesink Ruth Singer and Michael Dunn 2009Explaining the linguistic diversity of Sahul usingpopulation models PLoS Biology 7e1000241

Caley Smith 2017 The dialectology of Indic InJared Klein Brian Joseph and Matthias Fritz edi-tors Handbook of Comparative and Historical Indo-European Linguistics pages 417ndash447 De GruyterBerlin Boston

Jaroslav Strnad 2013 Morphology and Syntax of OldHindi Brill Leiden

Kaj Syrjanen Terhi Honkola Jyri Lehtinen AnttiLeino and Outi Vesakoski 2016 Applying popu-lation genetic approaches within languages Finnishdialects as linguistic populations Language Dy-namics and Change 6235ndash283

Paul Tedesco 1965 Turnerrsquos Comparative Dictionaryof the Indo-Aryan Languages Journal of the Amer-ican Oriental Society 85368ndash383

Matthew Toulmin 2009 From linguistic to sociolin-guistic reconstruction the Kamta historical sub-group of Indo-Aryan Pacific Linguistics ResearchSchool of Pacific and Asian Studies The AustralianNational University Canberra

Ralph L Turner 1962ndash1966 A comparative dictionaryof Indo-Aryan languages Oxford University PressLondon

Ralph L Turner 1975 [1967] Geminates after longvowel in Indo-aryan In RL Turner Collected Pa-pers 1912ndash1973 pages 405ndash415 Oxford UniversityPress London

Martijn Wieling Eliza Margaretha and John Ner-bonne 2012 Inducing a measure of phonetic simi-larity from pronunciation variation Journal of Pho-netics 40(2)307ndash314

Page 8: Toward a deep dialectological representation of Indo-Aryanweb.science.mq.edu.au/~smalmasi/vardial6/pdf/W19-1411.pdfAryan subgroup of Indo-European. I draw upon admixture models and

117

Geographic GeneticShallow JSD minus001 minus003Deep JSD 0147lowast 0008

LDN 0346lowast 0013Raw JSD 0302lowast minus0051lowastLSTM AE 0158lowast minus0068lowastLSTM ED 0084lowast 00001

Table 1 Spearmanrsquos ρ values for correlations betweeneach linguistic distance metric (JSD = Jensen-ShannonDivergence LDN = Levenshtein Distance NormalizedAE = Autoencoder ED = Encoder-Decoder) and geo-graphic and genetic distance Asterisks represent sig-nificant correlations

eraged representation

524 LSTM Encoder-DecoderFor the sake of completeness I use an LSTMencoder-decoder to learn a continuous representa-tion for every OIA-NIA string pair This modelis very similar to the LSTM autoencoder exceptthat it takes an OIA input and reconstructs an NIAoutput instead of taking an NIA form as input andreconstructing the same string I train the modelas described above

53 Correlations

Table 1 gives correlation coefficients (Spearmanrsquosρ) between linguistic distance metrics and non-linguistic distance metrics In general correlationswith Glottolog patristic distance are quite poorThis is surprising for Levenshtein Distance Nor-malized given the high correlation with patristicdistance reported by Borin et al (2014) Giventhat the authors measured Levenshtein distancebetween identical concepts in pairs of languagesand not cognates as I do here it is possible thatlexical divergence carries a stronger genetic sig-nal than phonological divergence at least in thecontext of Indo-Aryan (it is worth noting that Idid not balance the tree as described by the au-thors it is not clear that this would have yieldedany improvement) On the other hand the Lev-enshtein distance measured in this paper corre-lates significantly with great circle distance indi-cating a strong geographic signal Average Jensen-Shannon divergence between pairs of languagesrsquosound change distributions shows a strong associ-ation with geographic distance as well

Divergencedistances based on the deepmodel the LSTM Autoencoder and the LSTM

Encoder-Decoder show significant correlationswith geospatial distance albeit lower ones It isnot entirely clear what accounts for this disparityIntuitively we expect more shallow chronologicalfeatures to correlate with geographic distance Itis possible that the LSTM and RNN architecturesare picking up on chronologically deeper infor-mation and show a low geographic signal for thisreason though this highly provisional idea is notborne out by any genetic signal

It is not clear how to assess the meaning ofthese correlations at this stage Nevertheless deeparchitectures provide an interesting direction forfuture research into sound change and languagecontact as they have the potential to disaggregatea great deal of information regarding interactingforces in language change that is censored whenraw distance measures are computed directly fromthe data

6 Outlook

This paper explored the consequences of addinghidden layers to models of dialectology where thelanguages have experienced too much contact forphylogenetic models to be appropriate but havediversified to the extent that traditional dialecto-metric approaches are not applicable While themodel requires some refinement its results pointin a promising direction Modifying prior distribu-tions could potentially produce more informativeresults as could tweaking hyperparameters of thelearning algorithms employed Additionally it islikely that the model will benefit from hidden lay-ers of higher dimension J as well as bidirectionalapproaches and despite the misgivings regard-ing LSTM and GRUs stated above future workwill probably benefit from incorporating these andrelated architectures (eg attention) Addition-ally the models used in this paper assumed dis-crete latent variables attempting to be faithful tothe traditional historical linguistic notion of inti-mate borrowing between discrete dialect groupsHowever continuous-space models may provide amore flexible framework for addressing the ques-tions asked in this paper (cf Murawaki 2015)

This paper provides a new way of looking atdialectology and linguistic affiliation with refine-ment and expansion it is hoped that this and re-lated models can further our understanding of thehistory of the Indo-Aryan speech community andcan generalize to new linguistic scenarios It is

118

hoped that methodologies of this sort can joinforces with similar tools designed to investigateinteraction of regularly conditioned sound changeand chronologically deep language contact in in-dividual languagesrsquo histories

ReferencesGeorge Baumann 1975 Drei Jaina-Gedichte in Alt-

Gujaratı Edition Ubersetzung Grammatik undGlossar Franz Steiner Wiesbaden

Leonard Bloomfield 1933 Language Holt Rinehartand Winston New York

Lars Borin Anju Saxena Taraka Rama and BernardComrie 2014 Linguistic landscaping of southasia using digital language resources Genetic vsareal linguistics In Ninth International Conferenceon Language Resources and Evaluation (LRECrsquo14)pages 3137ndash3144

Alexandre Bouchard-Cote David Hall Thomas LGriffiths and Dan Klein 2013 Automated recon-struction of ancient languages using probabilisticmodels of sound change Proceedings of the Na-tional Academy of Sciences 1104224ndash4229

Alexandre Bouchard-Cote Percy Liang Thomas Grif-fiths and Dan Klein 2007 A probabilistic approachto diachronic phonology In Proceedings of the 2007Joint Conference on Empirical Methods in NaturalLanguage Processing and Computational NaturalLanguage Learning (EMNLP-CoNLL) pages 887ndash896 Prague Association for Computational Lin-guistics

Alexandre Bouchard-Cote Percy S Liang Dan Kleinand Thomas L Griffiths 2008 A probabilistic ap-proach to language change In Advances in NeuralInformation Processing Systems pages 169ndash176

R Bouckaert P Lemey M Dunn S J GreenhillA V Alekseyenko A J Drummond R D GrayM A Suchard and Q D Atkinson 2012 Mappingthe origins and expansion of the Indo-European lan-guage family Science 337(6097)957ndash960

Samuel R Bowman Luke Vilnis Oriol Vinyals An-drew M Dai Rafal Jozefowicz and Samy Ben-gio 2015 Generating sentences from a continu-ous space Proceedings of the Twentieth Confer-ence on Computational Natural Language Learning(CoNLL)

Chundra Cathcart to appear A probabilistic assess-ment of the Indo-Aryan Inner-Outer HypothesisJournal of Historical Linguistics

Ashwini Deo 2018 Dialects in the Indo-Aryan land-scape In Charles Boberg John Nerbonne and Do-minic Watt editors The Handbook of Dialectologypages 535ndash546 John Wiley amp Sons Oxford

Laurent Dinh and Vincent Dumoulin 2016 Train-ing neural Bayesian nets httpwwwiroumontrealcabengioycifarNCAP2014-summerschoolslidesLaurent_dinh_cifar_presentationpdf

Jeffrey Elman 1990 Finding structure in time Cogni-tive Science 14(2)179ndash211

Murray B Emeneau 1966 The dialects of Old-Indo-Aryan In Jaan Puhvel editor Ancient Indo-European dialects pages 123ndash138 University ofCalifornia Press Berkeley

Harald Hammarstrom Robert Forkel and MartinHaspelmath 2017 Glottolog 33 Max Planck In-stitute for the Science of Human History

Eric P Hamp 1987 On the sibilants of romani Indo-Iranian Journal 30(2)103ndash106

Gerhard Jager 2014 Phylogenetic inference fromword lists using weighted alignment with empiri-cally determined weights In Quantifying LanguageDynamics pages 155ndash204 Brill

Banikanta Kakati 1941 Assamese its formation anddevelopment Government of Assam Gauhati

Yoon Kim Sam Wiseman and Alexander M Rush2018 A tutorial on deep latent variable models ofnatural language arXiv preprint arXiv181206834

Diederik P Kingma and Jimmy Ba 2015 Adam Amethod for stochastic optimization In InternationalConference on Learning Representations (ICLR)

Johann-Mattis List 2012 SCA Phonetic alignmentbased on sound classes In M Slavkovik and D Las-siter editors New directions in logic language andcomputation pages 32ndash51 Springer Berlin Heidel-berg

Colin P Masica 1991 The Indo-Aryan languagesCambridge University Press Cambridge

Yaron Matras 2002 Romani ndash A Linguistic Introduc-tion Cambridge University Press Cambridge

R S McGregor 1968 The language of Indrajit of Or-cha Cambridge University Press Cambridge

Yugo Murawaki 2015 Continuous space representa-tions of linguistic typology and their application tophylogenetic inference In Proceedings of the 2015Conference of the North American Chapter of theAssociation for Computational Linguistics HumanLanguage Technologies pages 324ndash334

John Nerbonne 2009 Data-driven dialectology Lan-guage and Linguistics Compass 3(1)175ndash198

John Nerbonne and Wilbert Heeringa 2001 Computa-tional comparison and classification of dialects Di-alectologia et Geolinguistica 969ndash83

119

Thomas Oberlies 2005 A historical grammar ofHindi Leykam Graz

Jonathan K Pritchard Matthew Stephens and Pe-ter Donnelly 2000 Inference of population struc-ture using multilocus genotype data Genetics155(2)945ndash959

Jelena Prokic and John Nerbonne 2008 Recognisinggroups among dialects International journal of hu-manities and arts computing 2(1-2)153ndash172

Taraka Rama and Cagrı Coltekin 2016 Lstm autoen-coders for dialect analysis In Proceedings of theThird Workshop on NLP for Similar Languages Va-rieties and Dialects (VarDial3) pages 25ndash32

Taraka Rama Cagrı Coltekin and Pavel Sofroniev2017 Computational analysis of gondi dialectsIn Proceedings of the Fourth Workshop on NLPfor Similar Languages Varieties and Dialects (Var-Dial) pages 26ndash35

Rajesh Ranganath Linpeng Tang Laurent Charlin andDavid Blei 2015 Deep exponential families InArtificial Intelligence and Statistics pages 762ndash771

Ger Reesink Ruth Singer and Michael Dunn 2009Explaining the linguistic diversity of Sahul usingpopulation models PLoS Biology 7e1000241

Caley Smith 2017 The dialectology of Indic InJared Klein Brian Joseph and Matthias Fritz edi-tors Handbook of Comparative and Historical Indo-European Linguistics pages 417ndash447 De GruyterBerlin Boston

Jaroslav Strnad 2013 Morphology and Syntax of OldHindi Brill Leiden

Kaj Syrjanen Terhi Honkola Jyri Lehtinen AnttiLeino and Outi Vesakoski 2016 Applying popu-lation genetic approaches within languages Finnishdialects as linguistic populations Language Dy-namics and Change 6235ndash283

Paul Tedesco 1965 Turnerrsquos Comparative Dictionaryof the Indo-Aryan Languages Journal of the Amer-ican Oriental Society 85368ndash383

Matthew Toulmin 2009 From linguistic to sociolin-guistic reconstruction the Kamta historical sub-group of Indo-Aryan Pacific Linguistics ResearchSchool of Pacific and Asian Studies The AustralianNational University Canberra

Ralph L Turner 1962ndash1966 A comparative dictionaryof Indo-Aryan languages Oxford University PressLondon

Ralph L Turner 1975 [1967] Geminates after longvowel in Indo-aryan In RL Turner Collected Pa-pers 1912ndash1973 pages 405ndash415 Oxford UniversityPress London

Martijn Wieling Eliza Margaretha and John Ner-bonne 2012 Inducing a measure of phonetic simi-larity from pronunciation variation Journal of Pho-netics 40(2)307ndash314

Page 9: Toward a deep dialectological representation of Indo-Aryanweb.science.mq.edu.au/~smalmasi/vardial6/pdf/W19-1411.pdfAryan subgroup of Indo-European. I draw upon admixture models and

118

hoped that methodologies of this sort can joinforces with similar tools designed to investigateinteraction of regularly conditioned sound changeand chronologically deep language contact in in-dividual languagesrsquo histories

ReferencesGeorge Baumann 1975 Drei Jaina-Gedichte in Alt-

Gujaratı Edition Ubersetzung Grammatik undGlossar Franz Steiner Wiesbaden

Leonard Bloomfield 1933 Language Holt Rinehartand Winston New York

Lars Borin Anju Saxena Taraka Rama and BernardComrie 2014 Linguistic landscaping of southasia using digital language resources Genetic vsareal linguistics In Ninth International Conferenceon Language Resources and Evaluation (LRECrsquo14)pages 3137ndash3144

Alexandre Bouchard-Cote David Hall Thomas LGriffiths and Dan Klein 2013 Automated recon-struction of ancient languages using probabilisticmodels of sound change Proceedings of the Na-tional Academy of Sciences 1104224ndash4229

Alexandre Bouchard-Cote Percy Liang Thomas Grif-fiths and Dan Klein 2007 A probabilistic approachto diachronic phonology In Proceedings of the 2007Joint Conference on Empirical Methods in NaturalLanguage Processing and Computational NaturalLanguage Learning (EMNLP-CoNLL) pages 887ndash896 Prague Association for Computational Lin-guistics

Alexandre Bouchard-Cote Percy S Liang Dan Kleinand Thomas L Griffiths 2008 A probabilistic ap-proach to language change In Advances in NeuralInformation Processing Systems pages 169ndash176

R Bouckaert P Lemey M Dunn S J GreenhillA V Alekseyenko A J Drummond R D GrayM A Suchard and Q D Atkinson 2012 Mappingthe origins and expansion of the Indo-European lan-guage family Science 337(6097)957ndash960

Samuel R Bowman Luke Vilnis Oriol Vinyals An-drew M Dai Rafal Jozefowicz and Samy Ben-gio 2015 Generating sentences from a continu-ous space Proceedings of the Twentieth Confer-ence on Computational Natural Language Learning(CoNLL)

Chundra Cathcart to appear A probabilistic assess-ment of the Indo-Aryan Inner-Outer HypothesisJournal of Historical Linguistics

Ashwini Deo 2018 Dialects in the Indo-Aryan land-scape In Charles Boberg John Nerbonne and Do-minic Watt editors The Handbook of Dialectologypages 535ndash546 John Wiley amp Sons Oxford

Laurent Dinh and Vincent Dumoulin 2016 Train-ing neural Bayesian nets httpwwwiroumontrealcabengioycifarNCAP2014-summerschoolslidesLaurent_dinh_cifar_presentationpdf

Jeffrey Elman 1990 Finding structure in time Cogni-tive Science 14(2)179ndash211

Murray B Emeneau 1966 The dialects of Old-Indo-Aryan In Jaan Puhvel editor Ancient Indo-European dialects pages 123ndash138 University ofCalifornia Press Berkeley

Harald Hammarstrom Robert Forkel and MartinHaspelmath 2017 Glottolog 33 Max Planck In-stitute for the Science of Human History

Eric P Hamp 1987 On the sibilants of romani Indo-Iranian Journal 30(2)103ndash106

Gerhard Jager 2014 Phylogenetic inference fromword lists using weighted alignment with empiri-cally determined weights In Quantifying LanguageDynamics pages 155ndash204 Brill

Banikanta Kakati 1941 Assamese its formation anddevelopment Government of Assam Gauhati

Yoon Kim Sam Wiseman and Alexander M Rush2018 A tutorial on deep latent variable models ofnatural language arXiv preprint arXiv181206834

Diederik P Kingma and Jimmy Ba 2015 Adam Amethod for stochastic optimization In InternationalConference on Learning Representations (ICLR)

Johann-Mattis List 2012 SCA Phonetic alignmentbased on sound classes In M Slavkovik and D Las-siter editors New directions in logic language andcomputation pages 32ndash51 Springer Berlin Heidel-berg

Colin P Masica 1991 The Indo-Aryan languagesCambridge University Press Cambridge

Yaron Matras 2002 Romani ndash A Linguistic Introduc-tion Cambridge University Press Cambridge

R S McGregor 1968 The language of Indrajit of Or-cha Cambridge University Press Cambridge

Yugo Murawaki 2015 Continuous space representa-tions of linguistic typology and their application tophylogenetic inference In Proceedings of the 2015Conference of the North American Chapter of theAssociation for Computational Linguistics HumanLanguage Technologies pages 324ndash334

John Nerbonne 2009 Data-driven dialectology Lan-guage and Linguistics Compass 3(1)175ndash198

John Nerbonne and Wilbert Heeringa 2001 Computa-tional comparison and classification of dialects Di-alectologia et Geolinguistica 969ndash83

119

Thomas Oberlies 2005 A historical grammar ofHindi Leykam Graz

Jonathan K Pritchard Matthew Stephens and Pe-ter Donnelly 2000 Inference of population struc-ture using multilocus genotype data Genetics155(2)945ndash959

Jelena Prokic and John Nerbonne 2008 Recognisinggroups among dialects International journal of hu-manities and arts computing 2(1-2)153ndash172

Taraka Rama and Cagrı Coltekin 2016 Lstm autoen-coders for dialect analysis In Proceedings of theThird Workshop on NLP for Similar Languages Va-rieties and Dialects (VarDial3) pages 25ndash32

Taraka Rama Cagrı Coltekin and Pavel Sofroniev2017 Computational analysis of gondi dialectsIn Proceedings of the Fourth Workshop on NLPfor Similar Languages Varieties and Dialects (Var-Dial) pages 26ndash35

Rajesh Ranganath Linpeng Tang Laurent Charlin andDavid Blei 2015 Deep exponential families InArtificial Intelligence and Statistics pages 762ndash771

Ger Reesink Ruth Singer and Michael Dunn 2009Explaining the linguistic diversity of Sahul usingpopulation models PLoS Biology 7e1000241

Caley Smith 2017 The dialectology of Indic InJared Klein Brian Joseph and Matthias Fritz edi-tors Handbook of Comparative and Historical Indo-European Linguistics pages 417ndash447 De GruyterBerlin Boston

Jaroslav Strnad 2013 Morphology and Syntax of OldHindi Brill Leiden

Kaj Syrjanen Terhi Honkola Jyri Lehtinen AnttiLeino and Outi Vesakoski 2016 Applying popu-lation genetic approaches within languages Finnishdialects as linguistic populations Language Dy-namics and Change 6235ndash283

Paul Tedesco 1965 Turnerrsquos Comparative Dictionaryof the Indo-Aryan Languages Journal of the Amer-ican Oriental Society 85368ndash383

Matthew Toulmin 2009 From linguistic to sociolin-guistic reconstruction the Kamta historical sub-group of Indo-Aryan Pacific Linguistics ResearchSchool of Pacific and Asian Studies The AustralianNational University Canberra

Ralph L Turner 1962ndash1966 A comparative dictionaryof Indo-Aryan languages Oxford University PressLondon

Ralph L Turner 1975 [1967] Geminates after longvowel in Indo-aryan In RL Turner Collected Pa-pers 1912ndash1973 pages 405ndash415 Oxford UniversityPress London

Martijn Wieling Eliza Margaretha and John Ner-bonne 2012 Inducing a measure of phonetic simi-larity from pronunciation variation Journal of Pho-netics 40(2)307ndash314

Page 10: Toward a deep dialectological representation of Indo-Aryanweb.science.mq.edu.au/~smalmasi/vardial6/pdf/W19-1411.pdfAryan subgroup of Indo-European. I draw upon admixture models and

119

Thomas Oberlies 2005 A historical grammar ofHindi Leykam Graz

Jonathan K Pritchard Matthew Stephens and Pe-ter Donnelly 2000 Inference of population struc-ture using multilocus genotype data Genetics155(2)945ndash959

Jelena Prokic and John Nerbonne 2008 Recognisinggroups among dialects International journal of hu-manities and arts computing 2(1-2)153ndash172

Taraka Rama and Cagrı Coltekin 2016 Lstm autoen-coders for dialect analysis In Proceedings of theThird Workshop on NLP for Similar Languages Va-rieties and Dialects (VarDial3) pages 25ndash32

Taraka Rama Cagrı Coltekin and Pavel Sofroniev2017 Computational analysis of gondi dialectsIn Proceedings of the Fourth Workshop on NLPfor Similar Languages Varieties and Dialects (Var-Dial) pages 26ndash35

Rajesh Ranganath Linpeng Tang Laurent Charlin andDavid Blei 2015 Deep exponential families InArtificial Intelligence and Statistics pages 762ndash771

Ger Reesink Ruth Singer and Michael Dunn 2009Explaining the linguistic diversity of Sahul usingpopulation models PLoS Biology 7e1000241

Caley Smith 2017 The dialectology of Indic InJared Klein Brian Joseph and Matthias Fritz edi-tors Handbook of Comparative and Historical Indo-European Linguistics pages 417ndash447 De GruyterBerlin Boston

Jaroslav Strnad 2013 Morphology and Syntax of OldHindi Brill Leiden

Kaj Syrjanen Terhi Honkola Jyri Lehtinen AnttiLeino and Outi Vesakoski 2016 Applying popu-lation genetic approaches within languages Finnishdialects as linguistic populations Language Dy-namics and Change 6235ndash283

Paul Tedesco 1965 Turnerrsquos Comparative Dictionaryof the Indo-Aryan Languages Journal of the Amer-ican Oriental Society 85368ndash383

Matthew Toulmin 2009 From linguistic to sociolin-guistic reconstruction the Kamta historical sub-group of Indo-Aryan Pacific Linguistics ResearchSchool of Pacific and Asian Studies The AustralianNational University Canberra

Ralph L Turner 1962ndash1966 A comparative dictionaryof Indo-Aryan languages Oxford University PressLondon

Ralph L Turner 1975 [1967] Geminates after longvowel in Indo-aryan In RL Turner Collected Pa-pers 1912ndash1973 pages 405ndash415 Oxford UniversityPress London

Martijn Wieling Eliza Margaretha and John Ner-bonne 2012 Inducing a measure of phonetic simi-larity from pronunciation variation Journal of Pho-netics 40(2)307ndash314