1
Do collocaon-based networks share non-trivial stascal paerns with dependency- based networks? Paerns of dependency-based networks (Ferrer i Cancho et al. 2004). Small world effect: compactness, correlated with low average shortest path length and high clustering coeffi- cient (Was and Strogatz 1998). Disassortave mixing: negave assortavity. Collocaon-based networks share these paerns. Results for disassortave mixing are flo- ang and language-specific. Results for small world effect are more coherent. Loss of non-configuraonality influences word order. Does it have a clear correlate in collo- caon-based networks? Predicon: because modifiers can collocate less freely in the late variety, small world effect should weaken. Contradicted by the results. Why? Pragmacs, have influence on word order, too. Data sparseness, due to treebank smal- lness and stylisc choices, provide an insufficient range of lemmas for representaveness. Non-configuraonality in Collocaon-based Networks of Ancient Greek and Lan Edoardo Maria Pon, Università degli Studi di Pavia [email protected] Ancient Indo-European languages displayed and progressively lost non-configuraonality. Non-configuraonal dependent-marking languages are characterized by free word order, disconnuous constuents, and high frequency of null anaphora (Luraghi 2010, Baker 2001). Aim of the work: diachronic survey over disconnuous constuents in Ancient Greek and Lan. In example (1): genive noun and adjecve modifiers detached from the respecve head noun. NON-CONFIGURATIONALITY Data from the dependency treebank PROIEL (Pragmac Resources in Old Indo-Europe- an Languages) (Haug & Jøhndal 2008). Treebank: a corpus of texts annotated at the syntacc layer. Dependency: an oriented relaon between two words labelled by funcon. I created four sub-treebanks levelled off to the same token amount, as a combinaon of two parameters, language and diachronic variety. See Table 1. Ancient Greek Lan Classical Late Herodotus, Histories, 440-429 B.C. New Testament, 49-150 ca. A.D. Caesar, The Gallic War, 58-50 B.C. Cicero, Leers to Acus, 68-43 B.C. Jerome, Vulgate, 382-413 A.D. Table 1: Texts composing the treebanks for Classical Greek, Late Greek, Classical Lan and Late Lan. Network: a graph consisng verces and edges. Induced from a treebank by posing an equivalence between verces and word properes, edges and word relaons. Networks with lemma as property and dependency as relaon, in Figure 1, mirror the global structure of the language they represent (Ferrer i Cancho et al. 2004). Unfortunately, they dim its collocaonal side, and stand inadequate alone. Networks with linear precedence as relaon complement them. Verces can be either lemmas as in Figure 2 or deprels disambiguated by part-of-speech as in Figure 3. These three methods were applied to the four sub-treebanks to induce four networks. NETWORK INDUCTION Figure 1 - A lemma-based dependency network from the Ancient Greek treebank Figure 2 - A lemma-based collocaon network from the Ancient Greek treebank Figure 3 - A dependency-relaon-based collocaon network from the Ancient Greek treebank Extracted topological indices are framed in Table 2 for Ancient Greek and Table 3 for Lan. Average in/out degree = Average number of in and out edges for each vertex. Assortavity = Tendency of a vertex to link with others with a similar degree. Average shortest path length = Average of minimum number of edges to connect any pair of verces (Solé et al. 2010). Clustering coefficient = Probability that edges of any three verces form a triangle. METRICS AND RESULTS network nodes edges in/out- degree clustering coefficient shortest path length assortavity GC-lemma-dependency GC-lemma-collocaon GL-lemma-dependency GL-lemma-collocaon 5480 34613 6,316 0,197 3,547 -0,169 34673 6,303 20789 6,836 20882 6,822 0,346 0,269 0,433 3,872 3,096 3,421 -0,2 -0,196 -0,244 5501 3041 3061 network nodes edges in/out- degree clustering coefficient shortest path length assortavity GC-deprel-collocaon GL-deprel-collocaon LC-deprel-collocaon LL-deprel-collocaon 133 3621 27,226 0,696 1,889 -0,338 2855 21,305 3988 28,899 2943 21,64 0,599 0,642 0,621 2,042 2,021 1,917 -0,315 -0,288 -0,296 134 138 136 network nodes edges in/out- degree clustering coefficient shortest path length assortavity LC-lemma-dependency LC-lemma-collocaon LL-lemma-dependency LL-lemma-collocaon 4933 34709 7,036 0,174 3,326 -0,243 39794 8,015 21719 6,981 25294 8,123 0,241 0,23 0,362 3,007 3,309 3,112 -0,221 -0,228 -0,22 4965 3111 3114 Table 2: Results of the network analysis of the Ancient Greek treebanks. Table 3: Results of the network analysis of the Lan treebanks. DISCUSSION Data sparseness alleviated through collocaon-based networks with deprels disambiguated by part of speech as verces. The number of verces and edges is now stable, and results in Table 4 fully abide by the expected small world effect re- ducon. In-depth evaluaon: behavior of noun and adjecve modifiers, labelled atr_N and atr_A, in Figures 4.a-d. Closeness centrality = Inverse of the sum of the length of the shortest paths to connect a vertex with all the others. Betweenness centrality = Times a shortest path between two other verces passes through a given one. As for results in Table 5, both in-degree and out-degree decrease, since leſt and right contexts of a word are impoverished. Clustering coefficient and average path length follow the general trend. Finally, both the metrics for closeness drop. IN-DEPTH EVALUATION REFERENCES Figure 4.a Subnetwork for atr_A and atr_N in Classical Greek Figure 4.c Classical Lan Figure 4.b Late Greek Figure 4.d Late Lan Table 4: Values of the metrics for dependency-relaon-based collocaon networks. name in- degree out- degree clustering coefficient shortest path length closeness centrality betweenness centrality GC atr_N GL atr_N GC atr_A GL atr_A LC atr_ N LL atr_N LC atr_A LL atr_A 55 69 0,448 1,477 0,677 0,014 61 0,434 74 0,413 64 0,365 1,543 1,438 1,528 0,655 0,648 0,695 0,005 0,02 0,019 31 72 51 73 49 90 57 82 67 87 62 0,385 0,366 0,334 0,386 1,416 1,504 1,372 1,55 0,706 0,665 0,729 0,645 0,026 0,015 0,048 0,021 Table 5: In-depth analysis for aribuve adjecves atr_A and modifier nouns atr_N in the four networks. Baker, Mark. 2001. Configuraonality and polysynthesis. In M. Haspelmath, E. König, W. Oesterreicher and W. Raible (eds.), Language Typology and Language Universals. An Internaonal Handbook. Berlin, New York: Mouton de Gruyter, vol. 2, pp. 1433–1441. Ferrer i Cancho, Ramon, Ricard V. Solé and Rein-hard Köhler. 2004. Paerns in syntacc dependency networks. Physical Review E69, 051915(8). Haug, Dag T. T. and Marius L. Jøhndal. 2008. Creang a Parallel Treebank of the Old Indo-European Bible Translaons. In C. Sporleder and K. Ribarov (eds.), Proceedings of the Second Workshop on Language Technology for Cultural Heritage Data (LaTeCH 2008), pp. 27-34. Herman, József. 1985. La disparion de la déclinaison lane et l’évoluon du syntagme nominal in Syntaxe et lan. In C. Touraer (ed.), Syntaxe et lan. Actes du IIe Congrès Internaonal de Linguisque Lane. Aix-en-Pro vence: Université de Provence-Laffie, pp. 345–355. Hewson, John and Vít Bubeník. 2006. From Case to Adposion: the Development of Configuraonal Syntax in Indo-European Languages. Current Issues in Linguisc Theory, 280. Amsterdam, Philadelphia: John Benjamins. Luraghi, Silvia. 2010. The Rise (and Possible Downfall) of Configuraonality. In S. Luraghi and V. Bubeník (eds.), Connuum companion to historical linguiscs. Bloomsbury Publishing, pp. 212-229. Solé, Ricard V., Bernat Corominas-Murtra, Sergi Valverde and Luc Steels. 2010. Language networks: Their structure, funcon, and evoluon. Complexity, 15(6):20-26. Was, Duncan J. and Steven H. Strogatz. 1998. Collecve dynamics of ‘small-world’ networks. Nature, 393:440-442. DEPENDENCY TREEBANK In the classical variety of Ancient Greek, noun phrases, contrary to adposional phrases, preserved disconnuity. They became rarer in the late variety (Hewson & Bubenik 2006). In Lan, the frequency of hyperbata dropped from 20 percent of NPs in classical texts to 4 percent in Caena Trimalchionis (Herman 1985:352). (1) Arm-a vir-um=que can-o, Troi-ae qu-i prim-us ab or-is Itali-am, fat-o profug-us, Lavini-a=que veni-t litor-a weapon(N)-ACC.PL man(M)-ACC.SG=and sing-PRS.1SG Troy(F)-GEN REL-NOM.SG.M first-NOM.SG.M from shore(N)-ABL.PL Italy(F).ACC desny(N)-ABL.SG fugive-NOM.SG.M Lavinian-ACC.PL.N=and come.PRF-3SG strand(N)-ACC.PL ‘I sing the arms and the man, who, exiled by desny, first came from the Trojan shores to Italy and to the Lavinian strand.’ Virgil, Aeneid I 1–3

Non-configurationality in Collocation-based Networks of ...indoeuropean.wdfiles.com/local--files/posters/Ponti_Pavia2015_final.pdfNon-configurationality in Collocation-based Networks

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Non-configurationality in Collocation-based Networks of ...indoeuropean.wdfiles.com/local--files/posters/Ponti_Pavia2015_final.pdfNon-configurationality in Collocation-based Networks

Do collocation-based networks share non-trivial statistical patterns with dependency-based networks?Patterns of dependency-based networks (Ferrer i Cancho et al. 2004). Small world effect: compactness, correlated with low average shortest path length and high clustering coeffi-cient (Watts and Strogatz 1998). Disassortative mixing: negative assortativity. Collocation-based networks share these patterns. Results for disassortative mixing are flo-ating and language-specific. Results for small world effect are more coherent.

Loss of non-configurationality influences word order. Does it have a clear correlate in collo-cation-based networks? Prediction: because modifiers can collocate less freely in the late variety, small world effect should weaken. Contradicted by the results. Why?Pragmatics, have influence on word order, too. Data sparseness, due to treebank smal-lness and stylistic choices, provide an insufficient range of lemmas for representativeness.

Non-configurationality in Collocation-based Networks of Ancient Greek and Latin

Edoardo Maria Ponti, Università degli Studi di Pavia [email protected]

Ancient Indo-European languages displayed and progressively lost non-configurationality.Non-configurational dependent-marking languages are characterized by free word order, discontinuous constituents, and high frequency of null anaphora (Luraghi 2010, Baker 2001).Aim of the work: diachronic survey over discontinuous constituents in Ancient Greek and Latin.In example (1): genitive noun and adjective modifiers detached from the respective head noun.

NON-CONFIGURATIONALITY

Data from the dependency treebank PROIEL (Pragmatic Resources in Old Indo-Europe-an Languages) (Haug & Jøhndal 2008). Treebank: a corpus of texts annotated at the syntactic layer. Dependency: an oriented relation between two words labelled by function. I created four sub-treebanks levelled off to the same token amount, as a combination of two parameters, language and diachronic variety. See Table 1.

Ancient Greek

Latin

Classical Late

Herodotus, Histories, 440-429 B.C. New Testament, 49-150 ca. A.D.

Caesar, The Gallic War, 58-50 B.C.Cicero, Letters to Atticus, 68-43 B.C.

Jerome, Vulgate, 382-413 A.D.

Table 1: Texts composing the treebanks for Classical Greek, Late Greek, Classical Latin and Late Latin.

Network: a graph consisting vertices and edges. Induced from a treebank by positing an equivalence between vertices and word properties, edges and word relations. Networks with lemma as property and dependency as relation, in Figure 1, mirror the global structure of the language they represent (Ferrer i Cancho et al. 2004).Unfortunately, they dim its collocational side, and stand inadequate alone.Networks with linear precedence as relation complement them. Vertices can be either lemmas as in Figure 2 or deprels disambiguated by part-of-speech as in Figure 3.These three methods were applied to the four sub-treebanks to induce four networks.

NETWORK INDUCTION

Figure 1 - A lemma-based dependency network from the Ancient Greek treebank

Figure 2 - A lemma-based collocation network from the Ancient Greek treebank

Figure 3 - A dependency-relation-based collocation network from the Ancient Greek treebank

Extracted topological indices are framed in Table 2 for Ancient Greek and Table 3 for Latin.

Average in/out degree = Average number of in and out edges for each vertex.Assortativity = Tendency of a vertex to link with others with a similar degree. Average shortest path length = Average of minimum number of edges to connect any pair of vertices (Solé et al. 2010).Clustering coefficient = Probability that edges of any three vertices form a triangle.

METRICS AND RESULTS

network nodes edgesin/out-degree

clustering coefficient

shortest path length assortativity

GC-lemma-dependencyGC-lemma-collocationGL-lemma-dependencyGL-lemma-collocation

5480 34613 6,316 0,197 3,547 -0,16934673 6,30320789 6,83620882 6,822

0,3460,2690,433 3,872

3,0963,421

-0,2-0,196-0,244

550130413061

network nodes edgesin/out-degree

clustering coefficient

shortest path length assortativity

GC-deprel-collocationGL-deprel-collocationLC-deprel-collocationLL-deprel-collocation

133 3621 27,226 0,696 1,889 -0,3382855 21,3053988 28,8992943 21,64

0,5990,6420,621 2,042

2,0211,917

-0,315-0,288-0,296

134138136

network nodes edgesin/out-degree

clustering coefficient

shortest path length assortativity

LC-lemma-dependencyLC-lemma-collocationLL-lemma-dependencyLL-lemma-collocation

4933 34709 7,036 0,174 3,326 -0,24339794 8,01521719 6,98125294 8,123

0,2410,230,362 3,007

3,3093,112

-0,221-0,228-0,22

496531113114

Table 2: Results of the network analysis of the Ancient Greek treebanks.

Table 3: Results of the network analysis of the Latin treebanks.

DISCUSSION

Data sparseness alleviated through collocation-based networks with deprels disambiguated by part of speech as vertices. The number of vertices and edges is now stable, and results in Table 4 fully abide by the expected small world effect re-duction.

In-depth evaluation: behavior of noun and adjective modifiers, labelled atr_N and atr_A, in Figures 4.a-d.Closeness centrality = Inverse of the sum of the length of the shortest paths to connect a vertex with all the others. Betweenness centrality = Times a shortest path between two other vertices passes through a given one. As for results in Table 5, both in-degree and out-degree decrease, since left and right contexts of a word are impoverished.Clustering coefficient and average path length follow the general trend. Finally, both the metrics for closeness drop.

IN-DEPTH EVALUATION

REFERENCES

Figure 4.a Subnetwork for atr_A and atr_N in Classical Greek

Figure 4.c Classical LatinFigure 4.b Late Greek Figure 4.d Late Latin

Table 4: Values of the metrics for dependency-relation-based collocation networks.

name in-degree

out-degree

clusteringcoefficient

shortest path length

closenesscentrality

betweennesscentrality

GC atr_NGL atr_NGC atr_AGL atr_ALC atr_ NLL atr_NLC atr_ALL atr_A

55 69 0,448 1,477 0,6770,014

61 0,43474 0,41364 0,365

1,5431,4381,528 0,655

0,6480,695

0,0050,020,019

31725173499057

82678762

0,3850,3660,3340,386

1,4161,5041,3721,55

0,7060,6650,7290,645

0,0260,0150,0480,021

Table 5: In-depth analysis for attributive adjectives atr_A and modifier nouns atr_N in the four networks.

Baker, Mark. 2001. Configurationality and polysynthesis. In M. Haspelmath, E. König, W. Oesterreicher and W. Raible (eds.), Language Typology and Language Universals. An International Handbook. Berlin, New York: Mouton de Gruyter, vol. 2, pp. 1433–1441.Ferrer i Cancho, Ramon, Ricard V. Solé and Rein-hard Köhler. 2004. Patterns in syntactic dependency networks. Physical Review E69, 051915(8).Haug, Dag T. T. and Marius L. Jøhndal. 2008. Creating a Parallel Treebank of the Old Indo-European Bible Translations. In C. Sporleder and K. Ribarov (eds.), Proceedings of the Second Workshop on Language Technology for Cultural Heritage Data (LaTeCH 2008), pp. 27-34.Herman, József. 1985. La disparition de la déclinaison latine et l’évolution du syntagme nominal in Syntaxe et latin. In C. Touratier (ed.), Syntaxe et latin. Actes du IIe Congrès International de Linguistique Latine. Aix-en-Pro vence: Université de Provence-Laffitte, pp. 345–355.Hewson, John and Vít Bubeník. 2006. From Case to Adposition: the Development of Configurational Syntax in Indo-European Languages. Current Issues in Linguistic Theory, 280. Amsterdam, Philadelphia: John Benjamins.Luraghi, Silvia. 2010. The Rise (and Possible Downfall) of Configurationality. In S. Luraghi and V. Bubeník (eds.), Continuum companion to historical linguistics. Bloomsbury Publishing, pp. 212-229.Solé, Ricard V., Bernat Corominas-Murtra, Sergi Valverde and Luc Steels. 2010. Language networks: Their structure, function, and evolution. Complexity, 15(6):20-26.Watts, Duncan J. and Steven H. Strogatz. 1998. Collective dynamics of ‘small-world’ networks. Nature, 393:440-442.

DEPENDENCY TREEBANK

In the classical variety of Ancient Greek, noun phrases, contrary to adpositional phrases, preserved discontinuity. They became rarer in the late variety (Hewson & Bubenik 2006). In Latin, the frequency of hyperbata dropped from 20 percent of NPs in classical texts to 4 percent in Caena Trimalchionis (Herman 1985:352).

(1) Arm-a vir-um=que can-o, Troi-ae qu-i prim-us ab or-is Itali-am, fat-o profug-us, Lavini-a=que veni-t litor-a weapon(N)-ACC.PL man(M)-ACC.SG=and sing-PRS.1SG Troy(F)-GEN REL-NOM.SG.M first-NOM.SG.M from shore(N)-ABL.PL Italy(F).ACC destiny(N)-ABL.SG fugitive-NOM.SG.M Lavinian-ACC.PL.N=and come.PRF-3SG strand(N)-ACC.PL ‘I sing the arms and the man, who, exiled by destiny, first came from the Trojan shores to Italy and to the Lavinian strand.’ Virgil, Aeneid I 1–3