10
Universality in Syntactic Dependency Networks Ramón Ferrer Ricard V. Solé Reinhard Köhler SFI WORKING PAPER: 2003-06-042 SFI Working Papers contain accounts of scientific work of the author(s) and do not necessarily represent the views of the Santa Fe Institute. We accept papers intended for publication in peer-reviewed journals or proceedings volumes, but not papers that have already appeared in print. Except for papers by our external faculty, papers must be based on work done at SFI, inspired by an invited visit to or collaboration at SFI, or funded by an SFI grant. ©NOTICE: This working paper is included by permission of the contributing author(s) as a means to ensure timely distribution of the scholarly and technical work on a non-commercial basis. Copyright and all rights therein are maintained by the author(s). It is understood that all persons copying this information will adhere to the terms and constraints invoked by each author's copyright. These works may be reposted only with the explicit permission of the copyright holder. www.santafe.edu SANTA FE INSTITUTE

Universality in syntactic dependency networks

Embed Size (px)

Citation preview

Universality in SyntacticDependency NetworksRamón FerrerRicard V. SoléReinhard Köhler

SFI WORKING PAPER: 2003-06-042

SFI Working Papers contain accounts of scientific work of the author(s) and do not necessarily represent theviews of the Santa Fe Institute. We accept papers intended for publication in peer-reviewed journals or proceedings volumes, but not papers that have already appeared in print. Except for papers by our externalfaculty, papers must be based on work done at SFI, inspired by an invited visit to or collaboration at SFI, orfunded by an SFI grant.©NOTICE: This working paper is included by permission of the contributing author(s) as a means to ensuretimely distribution of the scholarly and technical work on a non-commercial basis. Copyright and all rightstherein are maintained by the author(s). It is understood that all persons copying this information willadhere to the terms and constraints invoked by each author's copyright. These works may be reposted onlywith the explicit permission of the copyright holder.www.santafe.edu

SANTA FE INSTITUTE

APS/123-QED

Universality in syntactic dependency networks

Ramon Ferrer i Cancho1∗,1 Ricard V. Sole,1, 2 and Reinhard Kohler3

1ICREA-Complex Systems Lab, Universitat Pompeu Fabra, Dr. Aiguader 80, 08003 Barcelona, Spain2Santa Fe Institute, 1399 Hyde Park Road, New Mexico 87501, USA

3Universitat Trier, FB2/LDV, D-54268 Trier, Germany.(Dated: June 11, 2003)

Many languages are spoken on Earth. Despite their diversity, many robust language universalsare known to exist. All languages share syntax, i.e. the ability of combining words for formingsentences. The origins of such trait is an open debate. Most linguistic universals are defined ina way that strictly confines them to a linguistic context. This is not the case of the previouslyunreported potential syntactic universals presented here. By using recent developments from thestatistical physics of complex networks, we show that different syntactic dependency networks (fromCzech, German and Romanian) share many non-trivial statistical patterns such as the small worldphenomenon, scaling in the distribution of degrees and disassortative mixing. Such previouslyunreported features of syntax organization are not a trivial consequence of the structure of sentences,but an emergent trait at the global scale. Our results strongly suggest that existent languages mightbelong to the same universality class in the way it is defined in physics.

PACS numbers: PACS number(s): 89.75.-k, 89.20.-aKeywords: complex networks, small-world, scaling, human language, linguistic universals.

I. INTRODUCTION

There is no agreement about the number of languagesspoken on Earth, but estimates are in the range from3, 000 to 10, 000 [1]. World languages exhibit a vast ar-ray of structural similarities and differences. Two majorstrategies are followed by empirically working scholars fordeepening their understanding of human language fac-ulty and language diversity. One concerned about find-ing linguistic universals, i.e. properties common to alllanguages. The other concerned about the differencesamong languages, which is the aim of typology.

The seek of linguistic universals has to face a basicproblem. From the one hand, if a property is very gen-eral then it is likely to be satisfied by all languages butit is also likely to carry little information (e.g. all lan-guages have vowels). On the other hand, if a propertyis more specific there is a high risk the property is onlysatisfied by a limited set of languages. Another generalproblem of most linguistic universals found [2–4] is thatthey are not generally portable to other disciplines or amore general framework where they can be compared andfurther understood. If a linguistic universal is defined interms of the position of a certain type of word in a sen-tence [2], tentatively no comparison can be made withnon-linguistics systems. In contrast, when studying theuniversal distribution of word frequencies, the so calledZipf’s law for word frequencies [5], it has been hypoth-esized that the underlying process might be essentiallythe same behind solid ice melting to liquid water [6].

In this context, a recent study has shown that the pres-ence of scaling in word frequency distributions might be

∗Corresponding author, email: [email protected]

a natural result of an optimization process involving asudden phase transition [7]. In other words, scaling lawsin human language would be the result of universal phe-nomena as those familiar to statistical physicists. In-terestingly, Zipf’s-law-like distributions appear in manynon-linguistic domains [8–11]. Zipf’s law is portable.

Common language universals are just empirical gen-eralizations resulting from inductive studies. Linguistsin that field are well aware of this fact, and the factthat universals have to be explained themselves (whichmeans they are not laws but general observations to-gether with inductively formulated hypotheses). In con-trast, physics has its own understanding of universallaws, where the macroscopic regularities of different sys-tems are explained with the same basic mechanism. Crit-ical systems, for example, are grouped into universalityclasses that only depend on dimensionality, symmetry ofthe order parameter and symmetry and range of inter-actions [12–14]. Statistical physics thus provides a welldefined understanding of universality that is largely sys-tem independent. This is certainly not the common viewin linguistics [3]. Empirical evidence supports the possi-bility that a large number of systems arising in disparatedisciplines such as physics, biology and economics mightshare some key properties involving their large-scale or-ganization [14]. One of the most remarkable of these uni-versal laws is related to scale invariance, i. e. the pres-ence of a hierarchical organization that repeats itself atvery different scales. Using tools from statistical physicswe present evidence for previously unreported syntacticuniversals that are both portable and tentatively enoughspecific.

Most of linguistic research is done in the domain ofdescriptive approaches (e.g Chomsky’s standard and ex-tended theories [15]). But descriptions are not expla-nations. Linguistic explanation is not possible without

2

31

BAobjectsubject

2

(3)(2)appleshas

(1)John

FIG. 1: A. The syntactic structure of simple sentence. Herewords define the nodes in a graph and the binary relations(arcs) represent syntactic dependencies. Here we assume arcsgo from a modifier to its head. B. Mapping the syntactic de-pendency structure of the sentence in A into a global syntacticdependency network.

the construction of a linguistic theory containing univer-sal language laws [16]. The aim of the present paperis investigating potential syntactic universals, which inturn may provide clues for understanding the origins oflanguage. Since syntax is a crucial feature in human lan-guage uniqueness [17, 18], we will focus on syntactic uni-versals. Different non-excluding positions are taken forexplaining linguistic universals. To cite some examples,an underlying universal grammar [15], genetic encoding[19, 20] or functional constraints [21–23]. Syntax involvesa set of rules for combining words into phrases and sen-tences. Such rules ultimately define explicit syntactic re-lations among words that can be directly mapped into agraph capturing most of the global features of the under-lying rules. Such a network-based approach has providednew insights into semantic networks [24–27]. Capturingglobal syntactic information using a network has beenattempted. The global structure of word interactions inshort contexts in sentences has been studied [28, 29]. Al-though about 70 % of syntactic relationships take placeat distance lower or equal than 2 [30], such early worklacks both a linguistically precise definition of link andfails in capturing the characteristic long-distance corre-lations of words in sentences [31]. A precise definition ofsyntactic link is thus required. In this paper we studythe architecture of syntactic graphs and show that theydisplay small world patterns, scale-free structure, a well-defined hierarchical organization and assortative mixing[32–34]. Three different European languages will be used,all of them share similar statistical features. The paperis organized as follows. The three datasets are presentedtogether with a brief definition of the procedure used forbuilding the networks in Section II. The key measuresused in this study are presented III, with the basic resultsreported in section IV. A comparison between sentence-level patterns and global patterns is presented in V. Ageneral discussion and summary is given in Section VI.

II. THE SYNTACTIC DEPENDENCY

NETWORK

The networks that are analyzed here have been definedaccording to the dependency grammar formalism. De-pendency grammar is a family of grammatical formalisms[35–37], which share the assumption that syntactic struc-ture consists of lexical nodes (representing words) and bi-nary relations (dependencies) linking them. This formal-ism thus naturally defines a network structure. In thisapproximation, a dependency relation connects a pair ofwords. Most of links are directed and the arc usually goesfrom the head word to its modifier. In some cases, suchas coordination, there is no clear direction [38]. Sincethat cases are rather uncommon, we will assume here-after links have a direction and assign an arbitrary direc-tion to the undirected cases. Syntactic relations are thusbinary, usually directed and sometimes typed in order todistinguish different kinds of dependency (Fig. 1 A).

We define a syntactic dependency network as a set ofn words V = {si}, (i = 1, ..., n) and an adjacency matrixA = {aij}. si can be a modifier word of the head sj in asentence if aij = 1 (aij = 0 otherwise). Here, we assumearcs go from a modifier to its head. The syntactic depen-dency structure of a sentence can be seen as a subset ofall possible syntactic links contained on a global network(Fig. 1 B). More precisely, the structure of a sentence isa subgraph (a tree) of the global network that is inducedby the words in the sentence [39].

Different measures can be defined on A allowing to testthe presence of certain interesting features such as thesmall-world effect [40] and scale invariance [41]. Suchmeasures can also be used for finding similarities anddifferences among different networks (see Section III).

The common formal property of dependency represen-tations (compared to other syntactic representations) isthe lack of explicit encoding for phrases as in the phrasestructure formalism [31] and later developments [15]. De-pendency grammar regards phrases as emergent patternsof syntactic dependency interactions. Statistical studiesabout phrase-structure-based grammars have been per-formed and reveal that the properties of syntactic con-structs map to only a few distributions [42, 43], suggest-ing a reduced set of principles behind syntactic struc-tures.

We studied three global syntactic dependencies net-works from three European languages: Czech, Germanand Romanian. Because of the reduced availability ofdata, the language set is unintentional restricted to theSlavic, Germanic and Italic families. These languages arenot intended to be representative of every family. We arenot taking the common inductive position in the studyof linguistic universals. We mention the families theselanguages belong to in order to show how distant theselanguages are. Probably not enough distant for standardmethods in linguistics but enough distant for our con-cerns here. Syntactic dependency networks were buildcollecting all words and syntactic dependency links ap-

3

pearing in three corpora (a corpus is a collection of sen-tences). aij = 1 if an arc from the i-th word to thej-th word has appeared in a sentence at least once andaij = 0 otherwise. Punctuation marks and loops (arcsfrom a word to itself) were rejected in all three corpora.The study was performed on the largest connected com-ponent of the networks. Sentences with less than twowords were rejected.

The corpora analyzed here are:

1. The Czech corpus was The Czech National Cor-pus 1.0, containing 562820 words and 31701 sen-tences. Many sentence structures are incompletein this (i.e. they have less than n − 1 links, wheren is the length of the sentence). The proportion oflinks provided with regard to the theoretical maxi-mum is about 0.65. The structure of sentences wasdetermined by linguists by hand.

2. The Romanian corpus was formed by all samplesentences in the Dependency Grammar Annotatorwebsite [62] It contains 21, 275 words and 2, 340sentences. The syntactic annotation was performedby hand.

3. The German corpus is The NeGra Corpus 1.0. Itcontains 153007 words and 10027 sentences. Theformalism used is based on the phrase structuregrammar. Nonetheless, for certain constituents,the head word is indicated. Only the head mod-ifier links between words at the same level of thederivation tree were collected. The syntactic anno-tation was performed automatically. The propor-tion of links provided with regard to the theoreticalmaximum is about 0.16.

The German corpus is the most sparse of them. Itis important to notice that while the missing links inthe German corpus obey no clear regularity, links in theCzech corpus are mostly function words, specially prepo-sitions, the annotators did not link because they treatedthem as grammatical markers. The links that are miss-ing are those corresponding to the most connected wordstypes in the remaining corpora.

III. SMALL WORLDS, BETWEENNESS,

HIERARCHY AND DISASSORTATIVE MIXING

IN SYNTAX GRAPHS

In order to properly define novel universals in syntax,we need to consider several statistical measures. Thesemeasures allow to categorize networks in terms of:

1. Small world structure. Two key quantities allow tocharacterize the global organization of a complexnetwork. The first is the so called average pathlenght D, defined as D = 〈Dmin(i, j)〉 over all pairs(si, sj) in the network, where Dmin(i, j) indicatesthe length of the shortest path between two nodes.

1 2 3 4 5 6 7 8 9 10D

0.0

0.1

0.2

0.3

0.4

0.5

N(D

)

FIG. 2: Shortest path length distributions for the three syn-tactic networks analyzed here. The symbols correspond to:Romanian (circles), Czech (triangles) and German (squares)respectively. The three distributions are peaked around anaverage distance of D ≈ 3.5 degrees of separation. The ex-pected distribution for a Poissonian graph is also shown (filledtriangles), using the same average distance.

The second measure is the so called clustering coef-ficient, defined as the probability that two vertices(e.g. words) that are neighbors of a given vertexare neighbors of each other. C is easily definedfrom the adjacency matrix as:

C =

2

ki(ki − 1)

n∑

j=1

aij

[

k∈Γi

ajk

]⟩

(1)

Erdos-Renyi graphs have a binomial degree dis-tribution that can be approximated by a Poisso-nian distribution [32–34]. Erdos-Renyi graphs withan average degree 〈k〉 are such that Crandom ≈〈k〉 /(n− 1) and the path length follows [44]:

Drandom ≈log n

log 〈k〉(2)

It is said that a network exhibits the small-worldphenomenon when D ≈ Drandom [40]. The keydifference between an Erdos-Renyi graph and a realnetwork is often C À Crandom [32–34].

2. Heterogeneity: A different type of characteriza-tion of the statistical properties of a complex net-work is given by the degree distribution P (k).Although the degree distribution of Erdos-Renyigraphs is Poisson, most complex networks are actu-ally characterized by highly heterogeneous distribu-tions: they can be described by a degree distribu-tion P (k) ∼ k−γφ(k/kc), where φ(k/kc) introducesa cut-off at some characteristic scale kc. The sim-plest test of scale invariance is thus performed bylooking at P (k), the probability that a vertex has

4

degree k, often obeying [32–34]

P (k) ∼ k−γ

3. Hierarchical organization. Some scaling propertiesindicate the presence of hierarchical organizationand modularity in complex networks. When study-ing C(k), i. e. the clustering coefficient as a func-tion of the degree k, certain networks have beenshown to behave as [45, 46]

C(k) ∼ k−θ (3)

with θ ≈ 1 [45]. Hierarchical patterns are speciallyimportant here, since tree-like structures derivedfrom the analysis of sentence structure stronglyclaim for a hierarchy.

4. Betweenness centrality. While many real networksexhibit scaling in their degree distributions, thevalue of the exponent γ is not universal, the be-tweenness centrality distribution is significantly lessvarying and provides an elegant way of classifyingnetworks [47]. The betweenness centrality of a ver-tex v, g(v), is a measure of the number of minimumdistance paths running through v, that is definedas [47]

g(v) =∑

i6=j

G(i, j)

Gv(i, j)

where Gv(i, j) is the number of shortest pathwaysbetween i and j running through k and G(i, j) =∑

kGv(i, j). Many real networks obey

P (g) ∼ g−η

where P (g) is the proportion of vertices whose be-tweenness centrality is g. The betweenness central-ity was calculated using [48].

5. Assortativeness.

A network is said to show assortative mixing if thenodes in the network that have many connectionstend to be connected to other nodes with manyconnections A network is said to show disassor-tative mixing if the highly connected nodes tendto be connected to nodes with few connections.The Pearson correlation coefficient Γ defined in [49]measures the type of mixing with Γ > 0 for assor-tative mixing and Γ < 0 for disassortative mixing.Such correlation function can be defined as

Γ =c∑

i jiki − [c∑

i12(ji + ki))]

2

c∑

i12(j2i + k2i )− [c

i12(ji + ki))]2

(4)

where ji and ki are the degrees of the vertices atthe ends of the i-th edge, with i = 1...m, c = 1/mand m being the number of edges. Disassorta-tive mixing (Γ < 0) is shared by Internet, World-Wide Web, protein interactions, neural networksand food webs. In contrast, different kinds of so-cial relationships are assortative (Γ > 0) [49, 50].

100

101

102

103

10-4

10-3

10-2

10-1

100

Cum

ulat

ive

P(k

)

100

101

102

103

10-4

10-3

10-2

10-1

100

100

101

102

103

10-4

10-3

10-2

10-1

100

100

101

102

103

10-4

10-3

10-2

10-1

100

100

101

102

10-4

10-3

10-2

10-1

100

Cum

ulat

ive

P(k

)

100

101

102

10-4

10-3

10-2

10-1

100

100

101

102

103

10-4

10-3

10-2

10-1

100

100

101

102

103

10-4

10-3

10-2

10-1

100

100

101

102

103

k

10-4

10-3

10-2

10-1

100

Cum

ulat

ive

P(k

)

100

101

102

103

10-4

10-3

10-2

10-1

100

100

101

102

103

k

10-4

10-3

10-2

10-1

100

100

101

102

103

10-4

10-3

10-2

10-1

100

-0.99 -0.98

-1.37 -1.09

-1.2 -1.2

Czech

German

Romanian

Czech

German

Romanian

IN-distribution OUT-distribution

FIG. 3: Left. Cumulative degree distributions for the threecorpora. Here the proportion of vertices whose input andoutput degree is k are shown. The plots are computed usingthe cumulative distributions P>(k) =

j≥kP (j). The arrows

in the upper plots indicate the deviation from the scalingbehavior in the Czech corpus (see Section IV).

IV. RESULTS

The first relevant result of our study is the presence ofsmall world structure in the syntax graph. As shown byour analysis (see Table I for a summary), syntactic net-works show D ≈ 3.5 degrees of separation. The valuesof D and C are very similar for Czech and Romanian. Acertain degree of variation for German can be attributedto the fact it is the most sparse dataset. Thus, D isoverestimated and C is underestimated. Nonetheless, allnetworks have D close to Drandom which is the the hall-mark of the small-world phenomenon [40]. The fact thatC À Crandom indicates (Table I) that the organization ofsyntactic networks strongly differs from the Erdos-Renyigraphs. Additionally, we have also studied the frequencyof short path lengths for the three networks. As shown inFigure 2, the three distributions are actually very similar,thus suggesting a common pattern of organization. Whenwe compare the observed distributions to the expectationfrom a random Poissonian graph (indicated by filled tri-angles), they strongly differ. Although the average valueis the same, syntactic networks are much more narrowlydistributed. This was early observed in the analysis ofWorld Wide Web [51].

The second result concerns the presence of scaling in

5

101

102

103

10-2

10-1

C(k

)

101

102

103

10-2

10-1

104

105

10610

-3

10-2

10-1

Cum

ulat

ive

P(g

)

104

105

10610

-3

10-2

10-1

100

101

102

103

10410

-4

10-3

10-2

10-1

C(k

)

100

101

102

103

10410

-4

10-3

10-2

10-1

104

105

10610

-3

10-2

10-1

Cum

ulat

ive

P(g

)10

410

510

610-3

10-2

10-1

100

101

102

103

104

k

10-4

10-3

10-2

10-1

C(k

)

100

101

102

103

10410

-4

10-3

10-2

10-1

104

105

106

g

10-3

10-2

10-1

Cum

ulat

ive

P(g

)

104

105

10610

-3

10-2

10-1

Czech

German

Romanian

Czech

German

Romanian

-0.91

-1.10

-1.10

FIG. 4: Left: C(k), the clustering coefficient versus degree k

for the for the three corpora. In all three pictures the scalingrelation C(k) ∼ k−1 is shown for comparison. Right: thecorresponding (cumulative) P(g), the proportion of verticeswhose betweenness centrality is g.

their degree distributions. The scaling exponents aresummarized in Table I. For the undirected graph, wehave found that the networks are scale free with γ ≈ 2.2.Additionally, Fig. 3 shows P (k) for input and outputdegrees (see Table I for the specific values observed).With the exception of the Czech corpus, they displaywell-defined scale-free distributions. The Czech data setdeparts from the power law for k > 102. Thus highlyconnected words appear underestimated in this case, con-sistently with the limitations of this corpus discussed insection II. These power laws fully confirm the presenceof scaling at all levels of language organization.

Complex networks display hierarchical structure [45].Fig. 4 (left column) shows the distribution of clusteringcoefficients C(k) against degree for the different corpora.We observe skewed distributions of C(k) (which are notpower laws), as in other systems displaying hierarchicalorganization, such as the World Wide Web (see Fig. 3(c)in [46]).

In order to measure to what extent word syntactic de-pendency degree k is related to word frequency, f , wecalculated the average value of f versus k (5) and founda power distribution of the form

f ∼ kζ (5)

where ζ ≈ 1 (Table I) indicates a linear relationship (Fig.5). The highest values of ζ for German can be attributedto the sparseness of the German corpus.

Highly connected words tend to be not interconnectedamong them. Since degree and frequency are positivelycorrelated (Eq. 5 and Fig. 5) one easily concludes, as avisual examination will reveal, that the most connected

words are function words (i.e. prepositions, articles, de-terminers,...). Disassortative mixing (Γ < 0) tells us thatfunction words tend to avoid linking each other. Thisconsistently explains why the Czech corpus has a valueof Γ clearly greater than that of the remaining languages.We already mentioned in section II that most of the miss-ing links in the Czech corpus are those involving functionwords such as prepositions, which are in turn the wordsresponsible for a tendency to avoid links among highlyconnected words. Γ is thus overestimated in the Czechnetwork.

The scaling exponent γ is somewhat variable, but thescaling exponents obtained for the betweenness centralitymeasure are much more narrowly constrained (table I).Betweenness centrality has been actually used as a pow-erful characterization of complex networks in terms ofuniversality classes[47]. Although again the Czech cor-pus deviates from the other two (in an expected way)the two other corpora display a remarkable similarity.P (g) distribution, with η = 2.1. Is is worth mention-ing that the fits are very accurate and give an exponentthat seems to be different from those reported in mostcomplex networks analyzed so far, tipically falling intoη = 2.0 or η = 2.2. In previous studies of P (g), thebetweenness centrality distribution, only two exponentshave been found, η ≈ 2.2 and η = 2.0, respectively defin-ing class I and class II universality classes [47]. Proteininteraction networks, metabolic networks of eukaryotesand bacteria and co-authorship networks belong to classI whereas Internet, world-wide web and the metabolicnetworks of archaea belong to class II. The behavior ofP (g) (Fig. 4) with a domain with scaling of exponentabout 2.1 suggests human language could belong to athird class of networks.

The behavior of C(k) (Fig. 4, left) differs from theindependence of the vertex degree found in Poisson net-works and certain scale-free network models [46]. Suchbehavior C(k) is also different from Eq. 3 with θ = 1 thatis clearly found in synonymy networks and suggested inactor networks [45] and metabolic networks [45]. In con-trast, such behavior is similar to that of the World WideWeb and Internet at the Autonomous System level [46].The similar shape of C(k) in the three syntactic depen-dency networks suggests all languages belong to the sameuniversality class.

Besides word co-occurrence networks and the syntac-tic dependency networks presented here, other types oflinguistic networks have been heterogeneously studied.Networks were nodes are words or concepts and linksand semantic relations are known to show C À Crandom

with d ≈ drandom and power distribution of degreeswith and exponent γ ∈ [3, 3.5]. For the Roget’s The-saurus, assortative mixing (Γ = 0.157) is found [34]. [24–27]. In contrast, syntactic dependency networks haveγ ∈ [2.11, 2.29] and disassortative mixing. (Table I), sug-gesting networks of semantic relations have exponentsbelonging to a different universality class. Further work,including more precise measures, such as the exponent of

6

10−5

10−4

10−3

10−2

10−1

100

k

10−6

10−5

10−4

10−3

10−2

10−1

f

10−4

10−3

10−2

10−1

100

k

10−5

10−4

10−3

10−2

10−1

10−4

10−3

10−2

10−1

100

k

10−5

10−4

10−3

10−2

10−1

Czech RomanianGerman

FIG. 5: Average word frequency f of words having degree k. Dashed lines indicate the slope of f ∼ k, in agreement with realseries.

P (g), should be carried out for semantic networks.

V. GLOBAL VERSUS SENTENCE-LEVEL

PATTERNS

We have mentioned that there is a high risk that verygeneral linguistic universals carry no information. Sim-ilarly, one may argue that the regularities encounteredhere are not significant unless it is shown they are nota trivial consequence of some pattern already present inthe syntactic structure of isolated sentences. In order todismiss such possibility, we define dglobal and dsentence asthe normalized vertex-vertex distance of the global de-pendency networks and a sentence dependency network.The normalized average vertex-vertex distance is definedhere as

d =D − 1

Dmax − 1

where dmax = n+13

, the maximum distance of a con-nected network with n nodes [54]. Similarly, we defineCglobal and Csentence for the clustering coefficient andΓglobal and Γsentence for the Pearson correlation coeffi-cient. The clustering coefficient of whatever syntacticdependency structure is Csentence = 0, since the syntac-tic dependency structure is defined with no cycles [35].We find Cglobal À Csentence and dglobal ¿ dsentence anddglobal significantly smaller than dsentence, but still dis-assortative (Table II). Γsentence is clearly different thanΓglobal, although disassortative mixing is found in bothcases.

Besides, one may think that the global degree distri-bution is scale-free because the degree distribution of thesyntactic dependency structure of a sentence is alreadyscale free. Psentence(k), the probability that the degree ofa word of a word in a sentence is k is not a power functionof k (Fig. 6). Actually, the data point suggests an expo-nential fit. As a consequence, we conclude that scalingin P (k) and significantly high C are features emergingat the macroscopic scale. The global patterns discussed

above are emergent features that show up at the globallevel.

VI. DISCUSSION

We have presented a study of the statistical patternsof organization displayed by three different corpus in thispaper. The study reveals that, as it occurs at other levelsof language organization [24–27] scaling is widespread.The analysis shows that syntax is a small world anddisplays a well defined and potentially universal) globalstructure. These features can be properly quantified andhave been shown to be rather homogeneous. No onecan speak of a linguistic universal in standard linguis-tics before hundreds of languages have been investigatedaccording to a sophisticated system of criteria for the se-lection of the languages to study in order to cover anyknown type of language family, etc. in a balanced pro-portion. Here, the context is different. Here, we havethe backup of universality in physics. Different patternsof statistical patterns of complex networks are the resultof very general mechanisms. To cite some examples, thepreferential attachment principle generates scale-free net-works [41, 55] as well as a conflict between vertex-vertexdistance and link density minimization [56]. Random-ness in the way vertices are linked is a source of small-worldness [40]. ’How universal small-world is in syntacticdependency networks?’ is a question more related to howrandomly are words linked than to how much sufficientis the language sample examined here. This allows us toconclude that a new class of potential language universalscan be defined on quantitative grounds. We have takenthe position of formulating the most specific hypothesesaccording to out currently available data. Our findingsdo not exclude investigating more languages in order tovalidate our hypotheses.

Understanding the origins of syntax implies under-standing what is essential in human language. Recentstudies have explored this question by using mathemat-ical models inspired in evolutionary dynamics [57–59].However, the study of the origins of language is usually

7

Czech German Romanian Software graph Proteome a

n 33336 6789 5563 1993 1846< k > 13.4 4.6 5.1 5.0 2.0

C 0.1 0.02 0.09 0.17 2.2× 10−2

Crandom 4 · 10−4 6 · 10−6 9.2 · 10−4 2× 10−3 1.5× 10−3

D 3.5 3.8 3.4 4.85 7.14Drandom 4 5.7 5.2 4.72 9.0

Γ −0.06 −0.18 −0.2 −0.08 −0.16γ 2.29± 0.09 2.23± 0.02 2.19± 0.02 2.85± 0.11 2.5 (kc ∼ 20)γin 1.99± 0.01 2.37± 0.02 2.2± 0.01 - -γout 1.98± 0.01 2.09± 0.01 2.2± 0.01 - -η 1.91± 0.007 2.1± 0.005 2.1± 0.005 2.0 2.2θ Skewed Skewed Skewed Skewed 1.0ζ 1.03± 0.02 1.18± 0.01 1.06± 0.02 - -

aData available from http://www.nd.edu/~networks/database/

protein/bo.dat.gz

TABLE I: A summary of the basic features that characterize the potential universal features exhibited by the three syntacticdependency networks analyzed here. n is the number of vertices of the networks, < k > is the average degree, C is the clusteringcoefficient, Crandom is the value of C of an Erdos-Renyi network. D is the average minimum vertex-vertex distance, Drandom isthe value of D for and Erdos-Renyi graph, Γ is the Pearson correlation coefficient, γ, γin, γout, are, respectively, the exponentsof the undirected degree distribution, input degree distribution, output degree distribution. η, θ and ζ are, respectively, theexponents of the betweenness centrality distribution, the clustering versus degree and the frequency versus degree. Two furtherexamples of complex networks are shown. One is a technological graph (a software network analyzed in [52]) and the second isa biological web: the protein interaction map of yeast [53]. Here skewed indicates that the distribution C(k) decays with k butnot necessarily following a power law.

Czech Romanian Germandglobal 2.3 · 10−4 1.3 · 10−3 1.2 · 10−3

< dsentence > 0.88 0.75 0.83Cglobal 0.1 0.09 0.02

< Csentence > 0 0 0Γglobal −0.06 −0.2 −0.18

< Γsentence > −0.4 −0.51 −0.64

TABLE II: Summary of global versus sentence network traits.dglobal, Cglobal and Γglobal are, respectively, the normalizedaverage vertex-vertex distance, the clustering coefficient andthe Pearson correlation coefficient of a given global syntacticdependency network. dsentence, Csentence and Γsentence are,respectively, the normalized average vertex-vertex distance,the clustering coefficient and the Pearson correlation coeffi-cient of a given sentence syntactic dependency network. 〈x〉stands for the average value of the variable x in the the syn-tactic dependency networks where x is defined. < Γsentence >

is calculated in all the sentences where Γsentence is defined.

dissociated from the quantitative analysis of real syntac-tic structures. General statistical regularities that humanlanguage obeys at different scales are known [42, 43, 60].The statistical pattern reported here could serve as val-idation of existent formal approaches to the origins ofsyntax. What is reported here is specially suitable for re-cent approaches to the origins of language [57–59], since

it reduces syntax to word pairwise relationships.Linguists can decide not to consider certain word types

as vertices in the syntactic dependency structure. Forinstance, annotators in the Czech corpus decided thatprepositions are not vertices. That way, we have seenthat different statistical regularities are distorted, e.g.disassortative mixing almost disappears and degree dis-tributions are truncated. If the degree distribution istruncated, describing degree distributions requires morecomplex functions. If simplicity is a desirable property,syntactic descriptions should consider prepositions andsimilar word types as words in the strict sense. Annota-tors should be aware of the consequences of their decisionabout the local structure of sentences wit regard to globalstatistical patterns.

Syntactic dependency networks do not imply recursion,that is regarded as a crucial trait of the language faculty[18]. Nonetheless, different non-trivial traits that recur-sion needs have been quantified quantified:

• Disassortative mixing tells us that labour is dividedin human language. Linking words tend to avoidconnections among them.

• Hierarchical organization tells us that syntactic de-pendency networks not only define the syntacticallycorrect links (if certain context freedom is assumed)but also a top-down hierarchical organization thatis the basis of phrase structure formalisms such asX-bar [61].

8

0 1 2 3 4 5 6k

10-4

10-3

10-2

10-1

100

Cum

ulat

ive

dist

ribut

ion

100

101

k

10-4

10-3

10-2

10-1

100

100

10110

-4

10-3

10-2

10-1

100

a b

FIG. 6: Cumulative Psentence(k) for Czech (circles), German(squares) and Romanian (diamonds). Here linear-log (a) andlog-log (b) plots have been used, indicating an exponential-like decay. Psentence(k) is the probability that a word hasdegree k in the syntactic dependency structure of a sentence.Notice that P≥(1) is less than 1 for Czech and German sincethe sentence dependency trees are not complete. If Psentence

was a power function, a straight line should appear in log-logscale. The German corpus is so sparse than its appearanceis dubious. Statistics are shown for L∗ the typical sentencelength. We have L∗ = 12 for Czech and German and L∗ =6 for Romanian. The average value of Psentence(k) for allsentences lengths is not used since it can be misleading as[30] shows in a similar context.

• Small-worldness is a necessary condition for recur-sion. If mental navigation [27] in the syntactic de-pendency structure can not be performed reason-ably fast, recursion can not take place. Pressuresfor fast vocal communication are known to exist[22, 23].

An interesting prospect of our work is that explainingcertain linguistic universals may also explain other net-work patterns outside the linguistic context without lossof generality. We have seen for the non-trivial propertiesanalyzed here that human languages are likely to belongto the same universality class. Such a class is a novelway of understanding world languages internal coher-ence and essential similarity. In contrast, when regardingother systems, human languages exhibit both unique andmatching features.

Acknowledgments

We thank Ludmila Uhrlirova for the opportunity to an-alyze the Czech corpus for the present study. RFC thanksa grant from the Generalitat de Catalunya (FI/2000-00393). This work has been also supported by a grantBFM 2001-2154 and by the Santa Fe Institute (RVS).

[1] D. Crystal, The Cambridge Encyclopedia of language(Cambridge University Press, Cambridge, UK, 1997).

[2] J. Greenberg, in Universals of language, edited byJ. Greenberg (MIT Press, Cambridge, 1968).

[3] W. Croft, Typology and Universals (Cambridge Univer-sity Press, Cambridge, 1990).

[4] J. H. Greenberg, Language Universals: with Special Ref-erence to Feature Hierarchies (Mouton, 1966).

[5] G. K. Zipf, Human behaviour and the principle of leasteffort. An introduction to human ecology (Hafner reprint,New York, 1972), 1st edition: Cambridge, MA: Addison-Wesley, 1949.

[6] J. Binney, N. Dowrick, A. Fisher, and M. Newman,The theory of critical phenomena. An introduction tothe renormalization group (Oxford University Press, NewYork, 1992).

[7] R. Ferrer i Cancho and R. V. Sole, Proc. Natl. Acad. Sci.USA 100, 788 (2003).

[8] J. J. Ramsden and J. Vohradsky, Physical Review E 58,7777 (1998).

[9] C. Furusawa and K. Kaneko, Physical Review Letters 90,088102 (2003).

[10] J. D. Burgos, BioSystems 39, 19 (1996).[11] J. D. Burgos and P. Moreno-Tovar, BioSystems 39, 227

(1996).[12] P. M. Chaikin and T. C. Lubensky, Principles of con-

densed matter physics (Cambridege University Press,Cambridge, 1995).

[13] H. E. Stanley, L. Amaral, S. Buldyrev, A. Goldberger,

S. Havlin, H. Leschhorn, P. Maas, H. A. Makse, C.-K.Peng, M. Salinger, et al., Physica A 231, 20 (1996).

[14] H. Stanley, L. Amaral, P. Gopikrishnan, P. C. Ivanov,T. H. Keitt, and V. Plerou, Physica A 281, 60 (2000).

[15] J. Uriagereka, Rhyme and Reason. An introduction toMinimalist Syntax (The MIT Press, Cambridge, Mas-sachusetts, 1998).

[16] R. Kohler, Theor. Linguist. 14, 241 (1987).[17] P. Lieberman, Uniquely Human: The evolution of speech,

thought and selfless behavior (Harvard University Press,Cambridge, MA, 1991).

[18] M. D. Hauser, N. Chomsky, and W. T. Fitch, Science298, 1569 (2002).

[19] S. Pinker, The language instinct (HarperCollins, NewYork, 1996).

[20] S. Pinker and P. Bloom, Behav. Brain Sci. 13, 707 (1990).[21] J. A. Hawkins, A performance theory of order and con-

stituency (Cambridge University Press, New York, 1994).[22] J. A. Hawkins, in Innateness and function in language

universals, edited by J. A. Hawkins and M. Gell-Mann(Addison Wesley, Redwood, CA, 1992), pp. 87–120.

[23] P. Lieberman, Uniquely Human: The evolution of speech,thought and selfless behavior (Harvard University Press.,Cambridge, MA, 1991).

[24] A. E. Motter, A. P. S. de Moura, Y.-C. Lai, and P. Das-gupta, Phys. Rev. E 65, 065102 (2002).

[25] M. Sigman and G. A. Cecchi, Proc. Natl. Acad. Sci. USA99, 1742 (2002).

[26] M. Steyvers and J. Tenenbaum, submitted (2001),

9

http://web.mit.edu/cocosci/Papers/smallworlds.pdf.[27] O. Kinouchi, A. S. Martinez, G. F. Lima, G. M.

Lourenco, and S. Risau-Gusman, Physica A 315, 665(2002).

[28] R. Ferrer i Cancho and R. V. Sole, Proc. R. Soc. Lond.B 268, 2261 (2001), santa Fe Institute Working Paper01-03-016.

[29] S. N. Dorogovtsev and J. F. F. Mendes, Proc. R. Soc.Lond. 268, 2595 (2001).

[30] R. Ferrer i Cancho, J. Quantitative Linguistics (2002),submitted.

[31] N. Chomsky, Syntactic Structures (Mouton, 1957).[32] A.-L. Barabasi and R. Albert, Rev. Mod. Phys. 74, 47

(2002).[33] S. N. Dorogovtsev and J. F. F. Mendes, Adv. Phys. 51,

1079 (2002).[34] M. E. J. Newman, SIAM Review pp. 167–256 (2003).[35] I. Melcuk, Dependency Syntax: Theory and Practice

(SUNY, 1988).[36] R. Hudson, Word Grammar (Blackwell, Oxford, 1984).[37] D. Sleator and D. Temperley, Tech. Rep., Carnegie Mel-

lon University (1991).[38] I. Melcuk, in International Encyclope-dia of the Social

and Behavioral Sciences, edited by N. J. Smelser andP. B. Baltes (Pergamon, Oxford, 2002), pp. 8336–8344.

[39] B. Bollobas, Modern graph theory, Graduate Texts inMathematics (Springer, New York, 1998).

[40] D. J. Watts and S. H. Strogatz, Nature 393, 440 (1998).[41] A.-L. Barabasi and R. Albert, Science 286, 509 (1999).[42] R. Kohler, J. Quantitative Linguistics 6, 46 (1999).[43] R. Kohler and G. Altmann, J. Quantitative Linguistics

7, 189 (2000).[44] M. E. J. Newman, Journal of Statistical Physics 101, 819

(2000).[45] E. Ravasz, A. L. Somera, D. A. Mongru, Z. N. Oltvai,

and A.-L. Barabasi, Science 297, 1551 (2002).[46] E. Ravasz and A.-L. Barabasi, Phys. Rev. E 67, 026112

(2002).[47] K.-I. Goh, E. Oh, H. Jeong, B. Kahng, and D. Kim, Proc.

Nat. Acad. Sci. USA 99, 12583 (2002).[48] A. F. A. f. B. C. Ulrik Brandes, Journal of Mathematical

Sociology 25, 163 (2001).[49] M. E. J. Newman, Phys. Rev. Lett. 89, 208701 (2002).[50] M. E. J. Newman, Phys. Rev. E 67 p. 026126 (2003).[51] L. A. Adamic, Procedings of the ECDL’99 Conference,

LNCS 1696, Springer, pp. 443–452 (1999).[52] S. Valverde, R. Ferrer i Cancho, and R. V. Sole, Euro-

physics Letters 60, 512 (2002).[53] H. Jeong, S. Mason, A.-L. Barabasi, and Z. N. Oltvai,

Nature 411, 41 (2001).[54] R. Ferrer i Cancho and R. V. Sole, in Statistical Physics

of Complex Networks (Springer, Berlin, 2003), LectureNotes in Physics.

[55] S. N. Dorogovtsev and J. F. F. Mendes, Evolution of net-works. From biological nets to Internet and WWW (Ox-ford University Press, Oxford, 2003).

[56] R. Ferrer i Cancho and R. V. Sole, Optimization in com-plex networks (Springer, Berlin, 2003).

[57] M. A. Nowak and D. C. Krakauer, Proc. Natl. Acad. Sci.USA 96, 8028 (1999).

[58] M. A. Nowak, J. B. Plotkin, and V. A. Jansen, Nature404, 495 (2000).

[59] M. A. Nowak, Phil. Trans. R. Soc. Lond. B 355, 1615(2000).

[60] L. Hrebıcek, Quantitative Linguistics 56 (1995).[61] D. Bickerton, Language and species (Chicago University

Press, 1990).[62] http://phobos.cs.unibuc.ro/roric/DGA/dga.html