Kinship structures create persistent channels for language ......PBL19 WNK36 WMR22 PBL21 PBL54 AKL23 BKP35 LOLI31 WNK24 AKL25 Sumba mtDNA Haplogroup D5 B5a B5b1 B4c M7b E M(xD,Q) B

Kinship structures create persistent channels forlanguage transmissionLansing et al.

Supporting Information (SI)

Genetic data. The sample dataset comprises 982 men from the eastern Indonesian islands of Sumba (n “ 505) and Timor(n “ 477), with associated genetic, cultural and linguistic information. Of the 25 villages in this study, 14 are on Sumba and 11on Timor. All Sumba villages are patrilocal. Of the 11 Timor villages, two follow patrilocal kinship practice (Fatukati andUmaklaran), while the remainder are matrilocal. The diversity of mtDNA and Y chromosome haplogroups found in eachsampled village is shown in Fig. S1. The genetic trees were constructed using the method described below, which is freelyavailable as an open source R application at https://carpntrees.shinyapps.io/shinyapp/ [last accessed August 2017].

125° E

9.5° S

8.5° S

9.5° S

8.5° S

119° E 120° E

8° S

9° S

10° S

120° E

TimorSumba

124° E 128° E

Fatukati

Umaklaran

Kateri

Wehali

Kakaniuk

Wehali

Umanen

LawaluBesikama

Tailai

Rainmanawe

Wehali

Kamanasa

Wehali

Laran

Kletek

Bilur Pangadu

Waimangura

Bukambero

Kodi

Lamboya

Mahu

Wanokaka

Mbatakapidu

Loli

Rindi

Praibakul

Anakalang

Mamboro

Sumba mtDNA Haplogroup

125° E

9.5° S

8.5° S

9.5° S

8.5° S

119° E 120° E

Fatukati

Umaklaran

Kateri

Wehali

Kakaniuk

Wehali

Umanen

LawaluBesikama

Tailai

Rainmanawe

Wehali

Kamanasa

Wehali

Laran

Kletek

Bilur Pangadu

Waimangura

Bukambero

Kodi

Lamboya

Mahu

Wanokaka

Mbatakapidu

Loli

Praibakul

Wunga

Anakalang

Mamboro

Sumba Y-STR Haplogroup Timor Y-STR Haplogroup

S−M254C−M38O−M110O−M119K−M526

K−M526xO−M95M−P34O−P201F−P14

O−M122O−P203O−P201xxC−M216S−M230(xM254)

D5B5aB5b1B4cM7bEM(xD,Q)BMF1ac(xF1a1_4)F1a(xF1a1_4)

NQ

B4aF1a1_4D(xD4,D5)M7b_cM(xD)F1ac(xF1a,F1a1_4)M7c1Q1_2

F1a

R

P

F1(xF1a)F1ac(xF1a)M7

B4bF1(xF1ac)Other(xM)M(xE,D5,Q)M(xE,Q)

M(xE)

L3(xM,N)F3M(xD,D5)N(xR)N(xR9,xR)R9(xF1,F3)R9(xF1)F1acD5a'bF(xN,R)M (xQ)

B5bF1a_cOther(xM,R)R(xB)B(xB4a)R(xP,B,F)B(xB4c,B4b,B4a)F1a_c(xF1a)R(xF1a,B)M(xM7c1)

E−P1D−M116C−M216C−P343C−M208C−M38C−P355F−P14

J−L27O−M119O−P203O−M110

O−M134O−M95

O−JST3M−P34

O−P201K−P307K−P336K−P397

S−M230S−P377

S−M254Q−P36P−P295

Timor mtDNA Haplogroup

QM29−xQQ12D4D5D−xD4D5M−xCDEM7M12GQM7c1E−xE1a1EorM12GE1a1aE1a1Y2N−xFB4bB4cB5N9ab

N9aN9abN−xRN9R−xFB4bB4cB5R−xFB4aB4bB4cB5PF3F1aB4cB5aB5b1B4aB4b

61

24

41

Sample Size

Fig. S1. Genetic diversity on Sumba and Timor. Diversity of mtDNA haplogroups (top) and Y chromosome haplogroups (bottom) in villages on Sumba (left) and Timor (right). ‚

indicates patrilocal villages; ‚ indicates matrilocal villages. Pie charts, scaled by sample size, show the distribution of haplogroups within each village.

Lansing et al. 1 of 13

https://carpntrees.shinyapps.io/shinyapp/

mtDNA tree and distances. A tree of female lineages was constructed using mitochondrial DNA, with a combination of HVS-Isequences (540 base pairs) and SNP-defined haplogroup information. While the sequence data show recent variation betweenindividuals, the haplogroups (determined by hierarchically genotyped SNPs) permit greater resolution of deeper parts of thephylogeny. By combining both information sources, a phylogenetic tree can be built that represents the true phylogeny farbetter than one generated from a single data source alone.

The tree building approach involved several steps. First, a tree was built using the haplogroup of each individual, withthe typology constrained to match the consensus haplogroup tree from the Phylotree project [1]. This tree contains leavesrepresenting individuals that share the same haplogroup. These leaves were then further refined using a neighbor-joiningalgorithm on the HVS-I sequences, using the most closely related haplogroup as an outgroup. Thus, we iteratively generatea final tree in which the deep nodes are determined by SNP-defined haplogroups and recent nodes by HVS-I variation. Allindividuals with the same haplogroup and HVS-I sequence are initially collapsed to speed up the calculations, later beingseparated at the tips after the tree reconstruction was complete. The final tree is shown in Fig. S2. Subtrees (e.g., for patrilocalvillages on Timor, etc.) were extracted from this master tree.

The pairwise distance matrix used for IM model fitting and language transmission analysis was built using the number ofpairwise differences in HVS-I sequences between each pair of individuals.

MHU15

PBL24

BKP34

PBL51

PBL52

BKR13

BKR39

BKR47

WMR24

BLP45

PBL44

BKR41

PBL56

AKL16

AKL37

AKL42

KDI42

LBY40

MBK25

BLP43

WNK25

BKP28

BKP29

KDI45

AKL44

WNK02

WNK05

WNK26

WNK27

BKR35

LBY28

LBY30

WNK49

LBY20

MHU12

MHU29

WNK03

WNK12

WNK19

MHU06

KDI11

KDI39

AKL28

LBY37

MBK13

MHU24

WNK10

WNK13

WNK23

WNK30

WNK38

AKL20

BKR23

BKR43

BLP02

BLP16

KDI20

LBY05

MBK05

MBK17

MHU14

MHU31

MHU41

WMR07

WMR09

WMR13

WMR15

WMR16

WMR18

WMR21

WMR25

WMR32

WMR33

WMR38

WMR43

BKP38

BKR28

BKR32

WNK48

KDI04

MHU05

MHU21

MHU27

AKL11

AKL12

AKL34

BKP33

BLP13

BLP18

BLP37

MBK10

MBK28

MBK31

MBK37

PBL09

PBL27

PBL31

PBL35

PBL50

PBL57

WNK29

WNK39

WNK42

AKL05

AKL18

AKL27

AKL50

BKP14

BKP15

BLP26

BLP33

BLP40

LBY42

LOLI19

LOLI25

LOLI26

LOLI29

MBK40

MBK41

MHU10

MHU13

PBL42

WNK01

WNK28

WNK31

WNK32

WNK41

WNK50

WNK51

AKL21

BKR07

LBY27

LBY38

MBK08

MBK14

MBK15

MBK27

MBK34

MBK39

MHU04

MHU08

MHU11

MHU17

PBL14

PBL33

WMR17

WMR46

BLP12

MHU37

WNK06

WNK08

PBL07

PBL41

BKR26

BLP28

LBY09

LBY10

LBY11

LBY34

MBK18

MBK47

PBL18

PBL47

BLP06

BLP10

LBY15

LBY22

BKP41

BKR25

BLP32

BLP50

BLP51

LBY08

LBY18

LBY47

LOLI01

KDI12

KDI13

KDI16

KDI18

KDI33

LOLI22

BLP49

BLP38

BLP52

WNK37

PBL22

PBL23

PBL53

MBK44

MBK45

BKP21

BKP23

BKR31

LBY43

WNK07

BKR16

KDI19

LBY35

LBY44

LBY45

MBK35

KDI25

WMR47

WMR44

BLP14

WMR01

BLP22

LBY01

KDI35

LBY06

LBY07

LBY14

LBY46

BKR27

BKR48

BLP42

KDI07

KDI17

KDI30

LBY26

LBY48

MBK48

PBL06

WMR41

WMR50

BKR30

WMR10

WMR37

MBK03

MHU18

MHU26

AKL41

BLP39

WNK21

MHU35

AKL33

AKL35

BKP39

BLP25

BLP53

BKP13

BKP17

BKP22

BLP24

MBK20

PBL29

PBL49

BKR29

MBK30

MBK46

KDI15

LBY31

LBY33

AKL13

AKL36

BKR01

KDI05

LBY21

LBY41

LOLI09

LOLI11

LOLI13

LOLI18

LOLI23

LOLI24

LOLI32

MBK29

MHU28

WNK15

WNK40

MBK11

MBK12

LOLI16

MHU09

WMR49

AKL04

AKL08

BKP42

BKR22

BLP47

KDI23

KDI31

KDI34

LBY29

MBK21

MBK42

PBL45

BKR08

KDI41

LOLI36

MHU44

PBL48

LOLI28

BKR05

BKR09

BLP07

BLP41

KDI10

KDI37

KDI43

LBY24

MHU43

WMR08

WMR20

KDI02

KDI40

KDI44

MBK19

MBK36

MBK43

MBK50

MBK53

MHU40

WNK44

MHU01

WNK46

WMR39

LBY16

BLP23

WMR19

LBY39

BKR24

BLP20

BLP31

BLP44

MHU07

KDI14

MHU16

WMR40

WNK14

WNK17

WNK18

WNK22

WNK34

BLP48

WMR42

AKL19

AKL43

BLP46

BLP35

MHU22

BKR19

BKR44

KDI09

LBY25

BKR34

BLP09

MHU30

MHU45

BKR06

BKR18

BKR37

KDI03

KDI21

KDI27

MBK51

BLP15

BKR15

LOLI06

LOLI12

LOLI38

MBK23

MHU32

WMR02

WMR04

WMR11

WMR12

WMR14

WMR23

WMR26

WMR27

WMR28

WMR29

WMR31

WMR34

LOLI20

BKP16

BKR17

AKL30

AKL32

AKL47

BLP29

LOLI34

MBK38

MHU20

WMR06

WMR30

WMR36

WMR51

WNK45

WNK47

AKL46

LBY04

LBY19

LBY36

LBY49

WMR48

BKR33

BKP04

BKP07

BKP26

BKP30

BKP20

BLP34

LBY02

WNK09

WNK11

WNK35

MHU25

MHU33

MHU23

LOLI30

BKP36

KDI01

KDI06

LOLI15

MBK22

WMR35

WNK16

BKR12

BKR20

BKR21

LBY17

LBY32

LOLI05

LOLI27

PBL19

WNK36

WMR22

PBL21

PBL54

AKL23

BKP35

LOLI31

WNK24

AKL25

Su

mb

a m

tDN

A H

ap

log

rou

p

D5B5aB5b1B4cM7bEM(xD,Q)BMF1ac(xF1a1_4)F1a(xF1a1_4)B4aF1a1_4D(xD4,D5)M7b_cM(xD)F1ac(xF1a,F1a1_4)M7c1Q1_2NQF1aF1(xF1a)F1ac(xF1a)M7RB4bF1(xF1ac)Other(xM)M(xE,D5,Q)M(xE,Q)PM(xE)L3(xM,N)F3M(xD,D5)N(xR)N(xR9,xR)R9(xF1,F3)R9(xF1)F1acD5a'bF(xN,R)M (xQ)B5bF1a_cOther(xM,R)R(xB)B(xB4a)R(xP,B,F)B(xB4c,B4b,B4a)F1a_c(xF1a)R(xF1a,B)M(xM7c1)

01

0k

20

k3

0k

40

k5

0k

60

k7

0k

80

k

years

ago

A

B

10k

20k

30k

40k

50k

60k

70k

ye

ars

ag

o0

Tim

or

mtD

NA

Ha

plo

gro

up

Q12D4D5

M7c1

Y2

N9a

P

TKI47

TKA04

TBA29

TLN23

TUK13

TUK14

TUK35

TUK37

TKA36

TKR16

TFI22

TKA61

TRE09

TRE20

TRE30

TTI18

TBA36

TRE07

TUK36

TKW08

TKK18

TKK21

TKK41

TKK42

TKK43

TKA21

TKA27

TKA58

TKI13

TKI31

TKI43

TKI45

TKR15

TKS15

TLN06

TLN45

TRE14

TRE24

TRE28

TRE40

TRE42

TRE44

TRE47

TRE48

TUK03

TUK17

TUK22

TUK27

TUK39

TUL26

TRE12

TBA28

TKA46

TKA48

TKR17

TBA25

TFI03

TFI19

TFI21

TFI35

TKR09

TKW02

TLN28

TLN43

TLN51

TRE16

TRE41

TRE49

TTI06

TTI10

TTI13

TTI22

TUK06

TUK20

TUK23

TUK33

TKA19

TKA67

TRE39

TKS03

TRE33

TUL42

TFI30

TLN12

TRE37

TRE38

TTI07

TTI14

TKA55

TFI13

TFI17

TFI33

TLN49

TKI15

TKK28

TLN29

TUL21

TLN37

TBA10

TBA22

TKA02

TKI38

TKI50

TUL15

TKR04

TKR14

TLN16

TBA24

TUK28

TKA29

TRE34

TRE36

TRE50

TUK12

TBA18

TKA59

TKR02

TKS08

TRE29

TBA09

TKA22

TKK08

TLN08

TLN10

TUK05

TUK07

TUK08

TUK09

TUK16

TUK24

TUK30

TUK40

TKA37

TKA41

TKA38

TKK50

TRE17

TRE23

TUL47

TKA26

TLN15

TRE22

TBA31

TKI24

TKK29

TLN11

TLN27

TLN44

TTI01

TTI02

TTI03

TUL19

TUL20

TUL32

TUL33

TRE45

TUK32

TFI08

TFI09

TFI11

TFI12

TFI16

TFI27

TKR10

TTI24

TKA53

TKA03

TFI29

TFI32

TKA60

TKI26

TKI42

TKA15

TKI08

TKI12

TKI14

TKI16

TKI25

TKI29

TKI30

TKI40

TKI28

TFI05

TBA20

TRE31

TUL43

TBA15

TBA26

TKA09

TKR03

TKR06

TUL23

TKK39

TKS10

TLN22

TLN24

TKI41

TKK12

TRE04

TKA18

TKI20

TFI06

TLN26

TRE25

TUK18

TUK19

TUK25

TUK38

TUL39

TUL40

TKR13

TKI35

TKA05

TKA40

TKA51

TKI46

TRE13

TFI23

TKA06

TKA14

TKA43

TKA44

TKA66

TKK02

TKK05

TKK07

TKK10

TKK13

TKK14

TKK16

TKK19

TKK22

TKK23

TKK24

TKK31

TKK33

TKK34

TKK36

TKK38

TKK40

TKK44

TKK48

TKK49

TKS01

TKS11

TKS13

TKS14

TKW01

TKW09

TKW12

TLN31

TLN42

TRE03

TRE05

TRE26

TRE27

TRE43

TTI04

TTI05

TTI09

TUK10

TUK26

TUK34

TUL01

TUL02

TUL05

TUL06

TUL09

TUL10

TUL13

TUL14

TUL18

TUL22

TUL27

TUL31

TUL35

TUL37

TUL44

TUL45

TUL48

TUL49

TUL50

TBA27

TKR05

TUL34

TBA39

TKA11

TKI04

TLN13

TLN35

TLN38

TLN40

TLN46

TRE32

TTI21

TUL30

TUL41

TUL03

TLN41

TKW15

TUL25

TUL29

TKW10

TLN14

TLN19

TKA07

TTI16

TTI17

TTI20

TKS02

TRE46

TKA28

TKR01

TUL07

TBA41

TKA17

TKA24

TKI21

TKW03

TLN17

TLN47

TRE06

TUL12

TFI10

TTI19

TTI23

TBA32

TKS09

TLN05

TLN21

TRE11

TFI18

TFI26

TKA45

TLN18

TKA01

TKA42

TKI09

TKI17

TKI19

TKI23

TKI32

TKI33

TKI36

TKI37

TKI44

TKI48

TKI49

TKS12

TLN01

TLN20

TLN25

TLN36

TRE02

TUL04

TUL24

TKI01

TKI10

TUL38

TKA34

TKS05

TKS07

TKW04

TKW13

TFI31

TKI07

TKK27

TLN32

TLN34

TLN48

TUL16

TUK31

TBA34

TBA30

TKA16

TLN07

TUL46

TKK30

TKR07

TKW05

TKW06

TKA56

TLN30

TRE35

TUK29

TKA30

TKA39

TUK11

TKA12

TBA14

TKA50

TKK25

Fig. S2. mtDNA phylogeny. The female lineages of sampled men from (A) Sumba and (B) Timor, colored according to their mtDNA haplogroup.

2 of 13 Lansing et al.

Y chromosome tree and distances. The tree of male lineages was reconstructed using both Y chromosome haplogroup informationand Y-STR markers. Using a similar approach as for mtDNA, both information sources were combined by first building ahaplogroup tree, and then refining recent lineage branching using Y-STR variation. The haplogroup tree was constrained tothe topology of the Y-DNA Haplogroup Tree defined by the International Society of Genetic Genealogy [2], with improvementsas per Karafet et al. [3] to incorporate new markers specific to this geographical region. To refine relationships at the tips,Bruvo’s distance [4] was calculated between each pair of individuals based on Y-STR variation. Leaves were then furtherrefined using a neighbor-joining algorithm on the Bruvo distance of the Y-STRs. As before, all genetically identical individualswere initially collapsed and later separated out after the overall tree structure had been determined. The final tree is shown inFig. S3. Individual population trees were extracted from this master tree.

The pairwise distance matrix used for IM model fitting and language transmission analysis was built using the Bruvodistance [4] between all pairs of individuals, as defined by their Y-STR diversity.

BKR24

BLP46

LOLI18

LOLI23

LOLI24

LOLI25

MBK18

MBK19

MBK22

MBK36

BKP04

BKP07

BKP16

BKP20

BKP22

BKP26

BKP28

BKP29

BKP30

BKP34

MBK34

MBK35

WNK24

WNK26

WNK32

MBK10

MHU05

MHU15

MHU21

MHU27

PBL06

PBL07

PBL33

PBL56

WNK36

PBL48

PBL18

AKL34

WNK07

BLP06

PBL14

MBK48

PBL31

PBL35

PBL57

LOLI28

BKP33

MBK30

PBL41

AKL32

AKL36

AKL47

BLP02

LBY25

BLP10

BKP13

MBK47

LOLI01

MHU13

PBL24

PBL51

BKP41

PBL21

PBL44

PBL45

PBL54

KDI34

MHU07

KDI03

KDI43

KDI45

BLP28

BLP32

BLP38

BLP40

WNK06

WNK44

PBL50

BKP38

BKP39

WNK28

WNK30

WNK12

LBY20

MHU17

MHU24

MHU25

MHU33

LOLI32

MHU04

MHU30

BLP16

BLP52

BLP13

BLP23

BLP45

WNK42

MBK50

MBK11

AKL19

PBL29

BLP48

BLP26

BLP35

BLP09

BLP12

BLP14

BLP15

BLP18

BLP20

BLP22

BLP25

BLP29

BLP31

BLP33

BLP34

BLP37

BLP41

BLP42

BLP43

BLP44

BLP47

BLP51

PBL27

MHU16

MBK29

MBK39

AKL13

AKL08

AKL28

AKL30

AKL44

AKL35

AKL18

AKL20

AKL21

AKL25

AKL33

AKL41

AKL42

AKL43

AKL50

PBL49

BKP21

BKP23

MBK12

WMR29

BKP14

BKP15

BKP17

BKP35

BLP39

BKP36

LOLI30

MHU14

BLP53

MHU35

MHU40

PBL52

PBL19

PBL42

PBL47

PBL09

WMR44

LBY44

MHU01

WMR01

WMR04

WMR07

WMR19

WMR26

WMR40

WMR43

WMR48

WMR51

WNK34

WNK35

WNK46

BKR35

MHU09

WMR14

WMR27

MBK45

WNK29

BLP24

LBY17

LBY09

LBY45

WNK17

WNK47

LBY36

LBY14

LBY19

LBY26

LBY46

LBY48

LBY49

LBY06

WMR33

WMR13

WMR24

WMR37

WMR38

WMR42

LBY01

KDI21

LBY27

LBY42

LBY32

MBK03

MBK15

MBK20

MBK23

MBK28

MBK31

MBK40

MBK41

MBK44

MBK51

MBK38

PBL22

BKR29

BLP50

WNK19

BLP07

MBK37

WMR06

WMR49

AKL04

AKL05

AKL11

AKL12

AKL16

AKL23

AKL37

BKP42

LBY02

LBY08

LBY10

LBY11

LBY15

LBY18

LBY21

LBY22

LBY28

LBY30

LBY33

LBY34

LBY35

LBY38

LBY39

LBY47

LOLI06

LOLI11

MBK05

MBK08

MBK14

MBK17

MBK21

MBK27

MBK42

MBK43

MBK46

MBK53

WMR02

WMR08

WMR09

WMR10

WMR11

WMR12

WMR15

WMR17

WMR18

WMR20

WMR21

WMR23

WMR25

WMR34

WMR39

WMR46

WMR50

WNK03

WNK05

WNK18

WNK22

WNK45

WNK48

LBY41

WMR16

WMR32

WMR36

WMR41

MBK25

KDI07

LBY04

LBY05

LBY43

LBY07

LBY24

WMR47

BKR19

BKR27

BKR34

BKR44

KDI35

KDI09

BKR30

BKR28

BKR05

BKR12

BKR20

BKR21

BKR48

KDI44

WMR28

WMR31

BKR47

BKR08

BKR09

BKR13

BKR32

BKR33

BKR39

BKR41

BKR43

KDI01

KDI02

KDI04

KDI05

KDI06

KDI10

KDI11

KDI12

KDI13

KDI14

KDI15

KDI16

KDI17

KDI18

KDI19

KDI20

KDI23

KDI25

KDI27

KDI31

KDI33

KDI39

KDI40

KDI41

KDI42

LBY31

WNK16

BKR06

BKR07

BKR18

BKR22

BKR31

KDI30

LOLI31

WMR30

LOLI16

LOLI36

LOLI19

LOLI29

LOLI38

LOLI15

WNK25

BKR23

LOLI27

LOLI12

LOLI26

WMR22

WMR35

WNK15

LOLI20

MHU44

MBK13

PBL23

PBL53

AKL46

MHU45

LBY16

LBY29

LBY37

LBY40

MHU28

MHU31

MHU32

MHU41

WNK21

MHU06

MHU08

MHU10

MHU11

MHU12

MHU18

MHU22

MHU23

MHU29

MHU37

MHU43

WNK08

WNK09

WNK10

WNK11

WNK13

WNK14

WNK23

WNK27

WNK31

WNK37

WNK38

WNK39

WNK40

WNK50

WNK51

MHU26

BKR01

LOLI34

WNK49

MHU20

BKR15

BKR16

BKR17

BKR25

BKR26

BKR37

WNK01

LOLI05

KDI37

WNK41

LOLI13

BLP49

AKL27

LOLI09

LOLI22

WNK02

Su

mb

a Y

-ST

R H

ap

log

rou

p

years

ago

ye

ars

ag

o

TTI13

TUL43

TTI16

TTI19

TTI23

TKK12

TKI33

TUK24

TKW08

TKA06

TUK03

TBA29

TKA37

TBA22

TTI10

TKK28

TLN24

TLN30

TUL29

TLN38

TUK09

TUK20

TUK23

TKR07

TKK41

TKA05

TRE13

TRE32

TUK34

TUK37

TUL46

TLN13

TLN32

TKW09

TKA01

TLN45

TTI07

TKA29

TKA34

TRE23

TKA39

TKA42

TKA51

TKA36

TKA43

TFI06

TUL50

TRE35

TKA26

TKA55

TKS09

TUK31

TKA12

TUL35

TLN43

TUL42

TKA28

TKW15

TKA50

TKR10

TBA14

TRE07

TKR03

TUK11

TKI47

TUL41

TUK05

TKR16

TKA19

TBA34

TUK12

TUK38

TUK39

TKA40

TKK30

TUK27

TUL27

TUK35

TKA17

TLN16

TBA18

TTI05

TKS02

TRE41

TUL18

TKS11

TKA60

TUL34

TUK13

TKR01

TBA27

TBA41

TRE31

TKS14

TKS03

TKS05

TKS07

TKW04

TBA24

TLN34

TKA14

TKK23

TKK29

TKR04

TKR09

TUL05

TUL15

TUK16

TKW02

TLN49

TRE06

TRE26

TUK10

TUL23

TKI14

TKI42

TKK07

TFI13

TTI21

TFI09

TKR02

TLN11

TLN17

TBA25

TUL44

TKK36

TKS08

TLN26

TRE12

TRE24

TTI06

TUL49

TUL07

TKK14

TKK16

TKK19

TKK33

TKK34

TLN19

TKK08

TTI03

TRE44

TUL30

TKI44

TRE47

TKK24

TRE37

TTI01

TKI15

TKA11

TUL32

TTI20

TFI31

TFI17

TBA20

TKK31

TKK44

TUL47

TBA39

TLN37

TLN42

TLN46

TFI30

TUK08

TUK25

TLN06

TLN29

TUL33

TRE27

TTI04

TBA30

TLN21

TUL20

TKI08

TKI45

TBA10

TKI17

TKI26

TKI32

TKI37

TKI40

TKI41

TKI43

TKI48

TKK22

TKK38

TKK43

TKK50

TUL14

TUK22

TFI29

TTI14

TKS13

TRE11

TKW12

TLN28

TUK07

TKI21

TLN47

TRE25

TKS15

TBA26

TKR15

TKA38

TKA48

TKA30

TKA15

TKA04

TKA27

TBA32

TLN44

TKA45

TKK49

TKA41

TUL13

TKI09

TRE45

TKK27

TFI08

TRE22

TLN22

TBA09

TBA15

TBA36

TFI16

TFI19

TFI35

TKA02

TKR05

TRE16

TUL02

TUL39

TKS12

TUL26

TUK29

TRE03

TKA24

TKA59

TKI19

TRE02

TRE17

TRE30

TRE40

TRE48

TTI09

TTI22

TFI18

TUL40

TRE05

TFI21

TRE49

TKI13

TUK14

TKK40

TKI07

TKR06

TKA66

TLN10

TKK48

TLN08

TUK26

TUK30

TUK36

TUL24

TKW13

TKA46

TRE46

TLN40

TKR13

TUL25

TRE28

TKI20

TKA18

TKI35

TFI26

TBA28

TFI22

TFI33

TTI17

TLN25

TLN15

TUK19

TKA58

TLN31

TRE42

TTI18

TTI02

TRE43

TUL48

TKR14

TUL06

TKS01

TKW01

TRE39

TLN07

TKW05

TUK17

TRE09

TRE14

TRE20

TRE29

TRE36

TRE38

TRE50

TUK32

TKA07

TKA22

TKA44

TKI01

TKI24

TKK02

TKK10

TKK13

TKK25

TKK39

TKS10

TLN27

TUL38

TUL45

TRE33

TKA61

TKA21

TKW06

TLN48

TFI27

TRE34

TUK18

TRE04

TKR17

TFI11

TTI24

TUL01

TFI32

TLN14

TLN18

TUK33

TUK40

TKK21

TLN20

TUL37

TKI25

TKI30

TUL03

TUL04

TUL10

TUL31

TLN01

TLN12

TFI05

TKI50

TLN35

TKI23

TKI28

TKK18

TKA16

TKA56

TKI29

TKA67

TKI10

TKI12

TKI36

TKI46

TKA03

TKI04

TUL19

TKW10

TLN23

TUK06

TKA53

TUL22

TKK05

TUL09

TLN05

TFI12

TKI16

TKI31

TLN36

TLN51

TUL16

TKK42

TLN41

TBA31

TKI38

TFI10

TFI03

TFI23

TUK28

TKW03

TUL12

TKI49

TUL21

TKA09

A

B

Fig. S3. Y chromosome phylogeny. The male lineages of sampled men from (A) Sumba and (B) Timor, colored according to their Y chromosome haplogroup.


Molecular dating. Previously dated haplogroups with confidence intervals were used to calibrate the trees. For the mtDNAphylogeny, three time points were used, as determined from whole mtDNA genome sequences by Fu et al. [5]. For the Ychromosome phylogeny, five time points were used, taken from Karafet et al. [3]. Molecular dating of both trees using thesecalibration points was performed using the chronos function in the R package ape [6].

Language data and tree. The linguistic data were organized and classified according to the principles of the traditionalcomparative method. As an independent check, the resulting language phylogenies for Sumba and Timor were compared againstthe language relationships given in Ethnologue [7]. The language phylogenies were then calibrated using penalized likelihoodand a relaxed substitution rate model to estimate internal branching times [8]. The language tree for Sumba (Fig. S4A) wasconstructed by setting the root to 4, 085 years before present, the date inferred by Xu et al. [9] for the arrival of Austronesianspeakers on Sumba. The language tree for Timor (Fig. S4B) was constructed using two much less well supported priors: i) theTimorese Austronesian languages were set to split 5, 000 years ago; and ii) the languages spoken on Timor were assumed tohave a coalescent date no earlier than 10, 000 years ago. Note that the depth of the Austronesian/non-Austronesian split is notknown, and is purposely given an arbitrary date. The coalescence age of this external branch has little effect on the analyses.

L800 Betun

L863 Tetun

L803 Kemak

L802 Dawanta

L801 Bunak

10 kya

5 kya

3.88 kya

2.49 kya

A Sumba

B Timor

L870 Bukambero

L708, L709 Wejewa

L1492 Wanokaka

L153Praibakul

L535 Wanokaka

L1493 Anakalang

L1212 Kambera

L164 Rindi

L1497 Mamboro

L1490 Kodi

L1496 Lamboya

L539 Loli

4.09 kya

3.63 kya

3.18 kya

2.72 kya

2.27 kya 1.21 kya

1.82 kya

1.36 kya0.91 kya

0.45 kya

Fig. S4. Language phylogenies for (A) Sumba and (B) Timor. ‚ indicates calibration points.

Isolation with Migration (IM) model. An Isolation with Migration model was used to capture the impact of sex-biased movementson genetic diversity [10]. The IM model describes a single panmictic population of size aN that splits into n subpopulationsof size N at time 2Nτ generations in the past. Migration occurs at a rate m between subpopulations. To assess the effectsof postmarital residence practices on mtDNA and the Y chromosome, we distinguish between the migration rates of women(mfemale, for mtDNA) and men (mmale, for the Y chromosome), and run the model separately for each group. Both runs havethe same values of N , n, a, τ and mutation rate µ, but can have different migration rates mfemale and mmale. For example,matrilocal kinship systems lead to greater migration of men than women, such that mmale ą mfemale. The converse is true forpatrilocal systems. A cultural preference for endogamy is reflected by low migration rates for both mfemale and mmale.

Using a scaled (haploid) migration rate M “ 2Nm and θ “ 2Nµ, with mutation rate µ in mutations per generation for thelocus, the probability that the number of nucleotide differences between a pair of individuals Sj is k is given by

P pSj “ kq “2ÿ

r“1

Ajr´ λrθ

k

pλr ` θqk`1

´

1´ e´τpλr`θqkÿ

l“0

pλr ` θql τ l

l!

¯

`e´τpλr`θq paθqk

p1` aθqk`1

kÿ

l“0

` 1a` θ

˘lτ l

l!

¯

[1]

where j “ 0 and j “ 1 correspond to the cases of two samples being from the same or different villages, respectively. Theremaining sub-equations are defined as


λ1 “nM ` n´ 1´

?D

2pn´ 1q

λ2 “nM ` n´ 1`

?D

2pn´ 1q , [2]

D “ pnM ` n´ 1q2 ´ 4 pn´ 1qM, [3]

A01 “λ2 ´ 1λ2 ´ λ1

A02 “1´ λ1

λ2 ´ λ1

A11 “λ2

λ2 ´ λ1

A02 “´λ1

λ2 ´ λ1. [4]

This model can be used to make genetic predictions regarding the distribution of genetic differences in mtDNA and inthe Y chromosome given sex-biased migration. To do so, we calculate the equations twice, applying a migration rate ofM “ Mmale “ 2Nmmale to explore Y chromosome differences and M “ Mfemale “ 2Nmfemale to explore mtDNA difference,while keeping other parameters constant. We take this approach when exploring the phase space of the model, see Fig. S5.

It is important to be aware that such a model is a considerable simplification of the complex migration patterns generated bykinship systems. For instance, it assumes that i) movements of mtDNA and Y chromosomes are independent; ii) the effectivepopulation size N and demographic history, as described by n, τ and a, of the mtDNA and Y chromosome are the same (inpractice, N is often lower for men due to their higher reproductive variance); and iii) standard coalescent assumptions hold,such as a relatively small sample size compared to the total population size, and exchangeability among sampled individuals.A further important assumption is that the measurement of pairwise differences between two individuals in a sample isindependent of the number of pairwise differences between other individuals in the sample. This final point is relevant to ourApproximate Bayesian Computation (ABC) modeling especially, and is elaborated on below.

Despite these limitations, this non-equilibrium model is complex enough to incorporate many demographic events that areexpected to be reflected in the data. It closely fits the distribution of pairwise differences for distant relatives observed inthe Sumba and Timor mtDNA and Y chromosome data sets (see main text Fig. 3), as well as differences between these lociat smaller genetic distances. We are especially interested in the quality of fitting at smaller genetic distances because thesebetter reflect the impact of kinship systems over recent genetic history. While we do not suggest that the fitted parameterscorrespond directly to real-world properties of these populations, they do capture patterns that provide a qualitative indicationof whether matrilocality or patrilocality is a likely cause of the observed data.

Parameter space of the IM model within and between villages. The mtDNA and Y chromosome pairwise distances within and betweensubpopulations are shown in Fig. S5, with other parameters N “ 200, n “ 100, τ “ 5.0 and a “ 2.0 and mmale and mfemalescaled between 0 and 0.05. The Kullback-Leibler divergence for between-village comparisons (Fig. S5B) is almost the same forall migration rate values. Differences in the quality of the model fit, as quantified by the Kullback-Leibler divergence, onlybecome apparent when looking at within-village comparisons (Fig. S5A). In other words, it is only easy to observe kinshipsystems through differences in mtDNA and Y chromosome pairwise distances within subpopulations. This is an importantfuture consideration when designing methods to detect the signature of kinship practices, and is leveraged in our ABC method.

Fitting the IM model using Approximate Bayesian Computation (ABC). ABC is a method to assess how much the observed data supportsdifferent parameter values in a model. The pairwise Bruvo distances [4] of the Y chromosome (values in range [0,1]) were firstscaled such that the average observed distance was the same as the mtDNA pairwise differences. Given that the time depth ofthe Y chromosome and mtDNA trees are broadly similar for our samples, we consider this operation equivalent to ‘scaling’ themutation rate of the Y chromosome and mtDNA such that, for the purposes of modeling, they can be considered equal (µ= per site mutation rate of 1.67×10−7 × generation time of 25 years × 540 mtDNA sites = 0.002214). The distribution ofmtDNA pairwise distances and scaled Y chromosome Bruvo’s distances are similar. After obtaining the observed distributions,we proceeded with the ABC iterations.

We use full coalescent simulations in our ABC model fitting, rather than simply sampling from the theoretical distributionof pairwise distances (Eq. (1)). This is slower but allows us to properly account for correlations in pairwise distances betweensamples (e.g., see [11]). The coalescent simulations were performed using the coalescent program msms [12] and implementedso as to replicate the IM model explored in Wilkinson-Herbots (2008) and described above.

In the simple form of ABC used, an iteration consists of i) proposing a combination of parameter values according to theparameter prior distributions; ii) using msms to simulate samples given these parameter values, both for the female (withmigration rate mfemale) and male (with migration rate mmale) lineages; iii) calculating the average pairwise distance between


Fig. S5. Migration rate phase space. (A) Kullback-Leibler divergence of pairwise distance distributions (KL(Y|mtDNA) + KL(mtDNA|Y)) within subpopulations. (B) Kullback-Leiblerdivergence of pairwise distance distributions between subpopulations. (C) Example plots of mtDNA (blue) and Y chromosome (red) pairwise distance distributions within (solidlines) and between (dashed lines) subpopulations, focusing on the low-difference regime.


k k

Fig. S6. Posterior distributions of IM model fitting for Sumba data, based on 300 best-fitting parameter values from three million samples of the prior distributions.

k k

Fig. S7. Posterior distributions of IM model fitting for Timor data, based on 300 best-fitting parameter values from three million samples of the prior distributions.


samples, using only within-village comparisons (which are more informative regarding locality, see Fig. S5); and iv) calculatingthe distance metric (Kullback-Leibler divergence) by summing the KL-divergence of the simulated and observed mtDNA andY chromosome distance distributions. For N , n and the migration rates we used uniform priors with bounds 70 ď N ď 80,15 ď n ď 300, 0 ď mmale ď 0.25, 0 ď mfemale ď 0.25. For τ and a, we used lognormal distributions with mean 1 and standarddeviation 1.25, allowing us to sample large values of the derived parameters T0 “ 2.0 × 25 × N × τ and Nancestral “ a × Nwhile weighting sampling toward more plausible smaller parameter values. The number of individuals sampled from each villagein the model was the same as the number of individuals in our data. This procedure was iterated three million times, keepingthe parameters that generate a model output with the lowest distance from the data as being representative of the posteriordistributions. The ABC posterior distributions of the IM model used for main text Fig. 3 are shown in Figs. S6 and Figs. S7.The prior distribution of T0 and ancestral N are calculated from the uniform and lognormal priors of other parameters.

Kinship and language transmission in matrilocal Timor. In Fig. 4 of the main text, we show the genetic distances of the villagesin Timor, including the 9 matrilocal villages and 2 patrilocal villages. In Fig. S8, only the genetic distances of the matrilocalvillages on Timor are shown. The changes are minor when compared to Fig. 4B, D : the ‘p’ cluster does not disappear,and considering only matrilocal villages results in only a slight difference in the kernel density of pairwise genetic distances.We interpret this result as reflecting the relatively small number of pairwise patrilocal comparisons within the total sample.The presence of this weak ‘p’ cluster may be related to the high proportion of effectively ambilocal marriages in the closelyintegrated villages of the centuries-old Wehali village cluster (see Therik [13], but also noted in our informal surveys) withweak representation of Y-chromosome-specific drift in a minority of samples.

0 5 10 15

0

5

10

15

B

p

m

Timor matrilocal pairs

with same language

an

Ydistance

(i,j)

mtDNA distance (i, j)

0 5 10 15

0

5

10

15

A

Ydistance

(i,j)


Timor matrilocal

all pairs

Fig. S8. Genetic distances on matrilocal villages of Timor, between all pairs of individuals (A) and only individuals who speak a common language (B). Considering onlymatrilocal villages in Timor does not significantly change pairwise genetic distances, and still show matrilocality (m), ambi- or neolocality (an), with a small patrilocal cluster (p)related to a high proportion of effectively ambilocal marriages.

Analysis of linguistic and genetic distances. In Fig. S9, we compare genetic distances along both matrilines and patrilineswith linguistic distances, 1´ dlpi, jq, defined as

1´ dlpi, jq “ 1´ cospli, ljq, [5]where li is the language vector of individual i and lj is the language vector of individual j. A pair of individuals who speak thesame set of languages will have 1´ dlpi, jq “ 0. If they both speak at least one language in common, but one or both of themspeak other languages as well, then 0 ă 1´ dlpi, jq ă 1. If they do not speak any language in common, 1´ dlpi, jq “ 1.

On Sumba, we see the signal of monolinguality as pairs of individuals that either share all of their languages or do notshare a language at all (Fig. S9A-B). The few pairs of individuals that speak the same language form a cluster with a mtDNAgenetic distance of ~8 (Fig. S9A). On their Y chromosome (Fig. S9B), pair of individuals that speak the same languages areeither closely related (~0) or distantly related to some degree (~5–10).

Multilinguality is evident in Timor (Fig. S9C-D) where there are pairs of individuals that share some, but not all, of theirlanguages. Most individuals speak the same languages (for instance, nearly everyone speaks Upper Tetun) and they are closelyrelated on their mtDNA (Fig. S9C), but have only distant relatedness on the Y chromosome (Fig. S9D).

Co-phylogeny of genes and languages. The co-phylogeny method has often been applied in functional ecology to find linksbetween species traits and environmental variables, thus explaining observed biological processes in ecosystems [14]. Morerecently, similar methods have been applied in the context of host-parasite co-evolution [15] and cultural traits [16]. Theevolution of language is analogous to parasites carried by their human speakers, whose genes evolve in a parallel process.Language traits evolve as the environmental and genetic conditions of the speech communities evolve. Using the co-phylogenyframework, we can test the significance of a global hypothesis of co-evolution between genes and language [15].

Co-phylogenetic analysis requires the specification of a binary association link matrix that indicates which of the locallyspoken languages each sampled individual speaks (e.g., for Timor: Bunak, Betun, Dawanta, Kemak and/or Upper Tetun;


0 5 10 15


0.0

0.2

0.4

0.6

0.8

1.0

languagedistance

(i,j)

0 5 10 15

Y distance (i, j)

0.0

0.2

0.4

0.6

0.8

1.0

languagedistance

(i,j)

C D

0 5 10 15


0.0

0.2

0.4

0.6

0.8

1.0

languagedistance

(i,j)

10 15mtDNA distance (i, j)

0.0

0.5

1.0

languagedistance

(i,j)

0 5 10 15

Y distance (i, j)

0.0

0.2

0.4

0.6

0.8

1.0

languagedistance

(i,j)

Y distance (i, j)

0.0

0.5

1.0

langu

age

distance

(i,j)

A B

Fig. S9. Linguistic and genetic distances. The kernel density of the linguistic and genetic distances is shown for all pairs of individuals. (A) On Sumba, where villages aremonolingual, only a few pairs with mtDNA distance ~8 speak the same language. (B) For the Y chromosome on Sumba, pairs of individuals who speak the same language formtwo clusters of close (~0) and slightly distant (~5–10) relatedness. (C) On Timor, there is a high probability of finding pairs of individuals that are closely related both on theirmtDNA and in terms of the languages they speak. (D) Most pairs of individuals on Timor have a distant degree of relatedness on their Y chromosome, but speak a similar set oflanguages.


for Sumba: languages spoken in each monolingual village). On Timor, language fluency was assessed for each participanton a three-point ordinal scale (‘Understand’, ‘Understand some’, ‘Do not understand’) during a survey administered to eachparticipant. The first two categories were then coded as a positive indication of ability with the corresponding language for thepurposes of the cophylogenetic analysis.

Statistical test for gene-language coevolution. To test whether a significant association exists between the evolution of genes andlanguages, we employ statistics that are functions of the genetic and language phylogenies, together with their association links(i.e., the list of which individuals speak which languages) [15]. In this statistical test, we examine the significance of congruence,if any, between the gene and language trees. If genes and languages occupy corresponding positions in both phylogenetic treeswith a significant degree of congruence, we reject the global null hypothesis that their evolution has been independent.

ParaFit statistic. Evaluating the global hypothesis requires three pieces of information: i) the gene phylogeny G, ii) the languagephylogeny L, and iii) the association between genes and languages LG.

The cophylogeny matrices L and G (Fig. S10) represent the principal coordinates of the linguistic and genetic distances ofindividuals along their respective trees. The association matrix LG is a binary link specifying the languages spoken by eachindividual. To calculate the congruence of the gene and language trees, we define the fourth-corner statistics matrix D [15] as

D “ G LG1 L [6]

From D, we define the global association ParaF itGlobal statistic, to test the gene-language coevolution hypothesis, as thesum of squares of dij

ParaF itGlobal “ tracepD1Dq “ÿ

pdij2q [7]

Lprincipal

coordinates

of the language tree

princ. coordinates

lan

gu

ag

esLG

language-gene

associa on

(binary data

of who speaks what)

lan

gu

ag

es

genes

Gprincipal

coordinates

of the gene tree

genes

pri

nc.

co

ord

ina

tes

Dfourth-corner

sta s csge

ne

tre

e

pri

nc.

co

ord

ina

tes

language tree

princ. coordinates

Fig. S10. Fourth-corner matrix. Statistics describing gene-language associations.

Permutation Model. To test whether the association between genes and languages is significant, we calculate ParaF itGlobal onthe original data, as well as on the permuted data. Permutation is performed by shuffling the columns of LG as shown inFig. S11. Randomizing the columns is akin to shuffling the gene labels in the link matrix, which removes the association, butpreserves the extent and degree of multilinguality.

The observed association between genes and languages is then determined using Z scores. The distribution of ParaF itGlobalvalues of different permuted realizations of the data, ParaF itGlobalperm, are taken and the number of standard deviationsParaF itGlobaldata (i.e., the ParaF itGlobal value of the original data) is away from the mean of tParaF itGlobalpermu iscalculated. The Z score is taken over 25 randomizations r, each with 999 permutations p of ParaF itGlobalperm.

Z “ParaF itGlobaldata ´

ř

r,p ParaFitGlobalpermr,p

r`p

sptParaF itGlobalpermr,p uq[8]

languages

...

languages

g3 g1 gn ... g2

...

g1 g2 g3 ... gn

Fig. S11. Permutation model. Columns of the LG matrix are shuffled to remove associations, while preserving the number of multilingual individuals and the number oflanguages they speak.


Language switching on genetic trees. To estimate host switching probabilities, we assume that both gene and language treesare accurately reconstructed, and seek to generate a one-to-one mapping of branching points in the gene and language trees,thus describing how the languages were transmitted and ultimately leading to the current language(s) spoken at each leaf ofthe gene tree today. A plausible model shown in Fig. S12 predicts the languages spoken at each branch of the gene tree atgeneration t. This stochastic simulation is run forward in time over the trees using two rules. First, where the language treebranches into two daughter languages, all genetic clades that speak the ancestral language are randomly assigned to one of thedaughter languages. Second, in each generation, there is a probability α that a given clade switches to a new language chosenfrom among the languages that exist in the population at that time. That is, a proportion of lineages α switches to a newlanguage at each generation. These two rules provide a simple model of language diversification and host switching (Fig. S12)that is sufficient to reconstruct patterns of language sharing observed in the present.

L1 L2 L3

L*12

L*123

A Bt0

t1

t2

t4

t3

t5

t6 G1 G2 G3G1 G2 G3

C

Fig. S12. Language switching in a gene tree conditioned on a language tree. Shown is (A) an unannotated gene tree, (B) a language tree and (C) an example of a simulatedlanguage-annotated gene tree. Language diversifies i) during language branching (L˚

123 splits into L˚12 and L3 at t2), and ii) during host switching, as in the branches leading

to G2 (language switch at t4) and G3 (language switch at t5).

This plausible stochastic model of language transmission along the branches of the gene tree was run forward in time, startingfrom the gene tree root (t0 in Fig. S12C). The following process was performed at each generation of 25 years: i) determine thegenetic clades and languages at generation t (e.g., two genetic clades and languages L˚12 and L3 at t2 in Fig. S12B-C); ii) if alanguage branches at generation t, each genetic branch that carried the ancestral language at t´ 1 is randomly assigned to oneof the two new languages (e.g., at t2 in Fig. S12C, L˚12 is assigned to the first branch and L3 to the second branch); iii) eachgenetic branch then switches to a different language within the current pool of languages with probability α (e.g., geneticbranch G2 switches to language L3 at t4).

This simulation process was repeated for many gene-language simulations across the range α “ r0, 100s where α “ 0 indicatesno language switching and α “ 100 means all lineages switch their language every generation. The average Z of each α wasthen compared with the Z of the observed data to find the degree of language switching that best fits the data. Variants of thisgene-language co-evolution model yield qualitatively similar results.

Multilinguality in the model of language switching between genetic clades. The model of language switching between genetic clades asshown in Fig. S12 was repeated with simulations across a range of values of the language switching rate α. From these, weobtained a distribution of Z scores and identified the range of α values that yield Z scores similar to those seen in the observeddata. This simulation model is sufficient to generate plausible patterns of language sharing in Sumba, where individuals aremonolingual. However, multilinguality is common in Timor, and we therefore modified the model such that individuals havea list of languages. When a new language is introduced (i.e., an ancestral language branches into daughter languages), orlanguage switching occurs (as determined by α), there is now a fixed probability β that the new language is appended to thelanguage list rather than replacing the language list. We are primarily interested in fitting the language switch rate α, and soonly a limited range of β values were explored. In main text Fig. 5, we take β “ α. Other β values yield qualitatively similarresults, with all values leading to Z score convergence at α ă 5%.

Alternative model of language switching between genetic clades. A plausible alternative stochastic model of language transmissionalong the branches of the gene tree run forward in time (starting from the root of the gene tree) involves the following process,which is performed at each generation t (with an interval of 25 years):

1. Set parameters – determine the genetic clades and languages at generation t;

2. Cospeciation – if a language branches at generation t, each genetic branch that carried the ancestral language at t´ 1 israndomly assigned one of the two new daughter languages. Genetic branches associated with a language that does notbranch inherit the language(s) spoken by that branch at time t´ 1; and

3. Language switch – each genetic branch then switches to a different language within the current pool of languages withprobability α.

In the case of multilinguality (as on Timor), every individual has a list of languages and the set of languages spoken by anindividual is determined by these additional rules:


4. Parent language replaced by daughter language during cospeciation – whenever language branching occurs, for everygenetic branch that spoke the ancestral language at t´ 1, one of the daughter languages replaces the ancestral languagein the current list of languages of the genetic branch; and

5. Additional language during switch – when a genetic branch switches to a language not in its language list at t´ 1, itretains the other languages it already spoke at t´ 1.

For this alternative model, language switching rates α converge to 1–5% per generation for all trees examined, as shown inFig. S13. Under this alternative model, the ‘half life’ of a language on a lineage is 325–1, 700 years.

The difference between this alternative model and the model used in main text Fig. 5 lies in steps 4 and 5. In the modelused in the main text, the new language is appended with a probability β; in this alternative model, i) the chosen new daughterlanguage always replaces the ancestral language in step 4; and ii) the language chosen during switching in step 5 is alwaysappended to the language list of the genetic branch. That is, in the main model, β is allowed to vary, while in this alternativemodel, β is always 0 during language tree branching, and β is always 1 during language switching.

Fig. S13. Language switching rates under the alternative model. The Z score measure of association between gene and language phylogenies on Sumba and Timor is shownfor different language switching rates α under the alternative model. All cases independently converge and abruptly lose gene-language associations, behaving similar torandomized cases, when α « 5%.

Probability of shared gene-language heritage. The probability that a pair of individuals pi, jq share a common language l giventhat they belong to the same genetic clade g at generation t is given by

P pil, jl|igptq, jgptqq “P pil, jl X igptq, jgptqq

P pigptq, jgptqq“

řctg“1

´

ř

i,j‰i 1cosplig ,ljg qą0

¯

řctg“1 ngpng ´ 1q{2 , [9]

where ct is the number of genetic clades in generation t each lasting 25 years, lig and ljg are the language vectors of individualsi and j belonging to g, 1cosplig ,ljg qą0 is 1 if i and j share a language and 0 otherwise, and ng are the number of individuals in g.

Probability of shared gene-language heritage in idealized scenarios. In Fig. 2 of the main text, we show P pil, jl|igptq, jgptqq forthe randomly permuted case. Here, we present alternative hypothesis modeling by estimating language distributions using thelanguage switching model (shown in Fig. S12) for two idealized scenarios:

1. The language-switching rate is very high (α “ 1), such that every lineage has the opportunity to switch its languageevery generation; and

2. No language-switching occurs (α “ 0), such that language sharing is purely generated by divergence in the language tree.

Considering mtDNA, the α “ 0 scenario is consistent with rigid matrilocality, while the α “ 1 scenario is consistent with(very) frequent female movement, as might accompany patrilocality. The converse is true for the Y chromosome, with α “ 0corresponding to rigid patrilocality. Running these scenarios for our genetic and language trees (both mtDNA and the Ychromosome) allows us to make language sharing predictions in the case of extremely sex-biased migrations.

For patrilocal Sumba, the observed mtDNA pattern is relatively close to the patrilocal model prediction throughout thetime period explored, and clearly departs from the matrilocal model. The observed Y chromosome pattern is very distant fromthe matrilocal prediction and generally approaches the idealized patrilocal model, until about 20 kya. These results support


0 10k 20k 60k50k40k30k 70k

years ago

P(i

l,j l|i g

,jg)

0.1

0.4

0.3

0.2

0.5

0 10k 20k 60k50k40k30k 70k

years ago

P(i

l,j l|i g

,jg)

0.2

1.0

0.4

0 10k 20k 60k50k40k30k 70k

years ago

P(i

l,j l|i g

,jg)

0.2

1.0

0.6

0.4

0 10k 20k 60k50k40k30k 70k

years ago

P(i

l,j l|i g

,jg)

0.1

0.4

0.3

0.2

0.5 A Sumba mtDNA B Sumba Y

data

random

matrilocal

patrilocal

C Timor mtDNA D Timor Y

0.8

0.6

0.8

Fig. S14. Language sharing in the mtDNA and Y phylogenies of (A-B) Sumba and (C-D) Timor. The probability is shown of sharing a language l given that each pair ofindividuals are in the same genetic clade g at a given time in the past. Solid blue lines represent the observed metric, with shaded blue bands indicating the result of randompermutations of the linguistic data. Idealized scenarios are shown for strict matrilocality (yellow) and patrilocality (violet). Probabilities in the Sumba dataset approach the strictpatrilocal case for mtDNA (all times) and the Y chromosome (until „20 kya). In Timor, matrilocal probabilities are higher than chance for mtDNA, and mostly never greater thanchance for the Y chromosome.

our primary conclusions; not only is P pil, jl|ig, jgq greater overall for the Y chromosome in the patrilocal Sumba data, but thiseven begins to approach the predictions of an idealized, very strict patrilocal model.

The situation is more complicated for Timor because most individuals are multilingual. The higher prevalence of multilin-guality in central Timor that we observed probably reflects recent history (notably forced migrations over the last couple ofgenerations). In central Timor, there have been population movements caused by warfare over the last century, potentiallycontributing to the mixture of languages.

To capture the idealized matrilocal and patrilocal patterns, we now need to additionally consider the implications of suchrigid mating systems on multilinguality. To illustrate one method of calculating P pil, jl|ig, jgq while taking multilinguality intoaccount, we ran the same model used for Sumba, but with an additional probability parameter β that the new language isappended to the language list rather than replacing it (see the ‘Multilinguality in the model of language switching betweengenetic clades’ section below). For both the strict matrilocality and patrilocality cases, we used β “ 0.2 when languageswitching occurs. When a new language is introduced (i.e., language tree branching), we used β “ 0.7 for mtDNA and β “ 0.96for the Y chromosome. These β values were chosen as they represent the transition points below which the resulting languagedistribution and shared probabilities closely fit the Timor data.

Note that we are in no way claiming that this method is necessarily the right way to handle multilinguality, and insteadpresent this analysis as an exploratory study only. Multilinguality is an extremely complex question and largely beyond thescope of this manuscript. Nevertheless, these analyses clarify how the P pil, jl|ig, jgq findings for Timor relate to alternativeidealized scenarios. Here, we find that for mtDNA, the pattern on Timor is greater than random chance, analogous to rigidmatrilocality. For the Y chromosome, the pattern on Timor is mostly never greater than random chance, similar to strictmatrilocality, which is also always below random chance.

1. Van Oven M, Kayser M (2009) Updated comprehensive phylogenetic tree of global human mitochondrial dna variation. Human Mutation 30(2).2. ISOGG (2017) International society of genetic genealogy (http://www.isogg.org/tree, accessed 10-24-2016).3. Karafet TM et al. (2008) New binary polymorphisms reshape and increase resolution of the human Y chromosomal haplogroup tree. Genome Research pp. 830–838.4. Bruvo R, Michiel NK, D’Souza TG, Schulenburg H (2004) A simple method for the calculation of microsatellite genotype distances irrespective of ploidy level. Molecular Ecology 13(7):2101–2106.5. Fu Q et al. (2013) A revised timescale for human evolution based on ancient mitochondrial genomes. Current Biology 23(7):553–559.6. Paradis E, Claude J, Strimmer K (2004) Ape: Analyses of phylogenetics and evolution in r language. Bioinformatics 20(2):289.7. Lewis PM, Simons GF, Fennig CD (2013) Ethnologue: Languages of the world.8. Paradis E (2013) Molecular dating of phylogenies by likelihood methods: a comparison of models and a new information criterion. Molecular Phylogenetics and Evolution 67(2):436–444.9. Xu S et al. (2012) Genetic dating indicates that the asian–papuan admixture through eastern indonesia corresponds to the austronesian expansion. Proceedings of the National Academy of

Sciences 109(12):4574–4579.10. Wilkinson-Herbots HM (2008) The distribution of the coalescence time and the number of pairwise nucleotide differences in the “Isolation with Migration” model. Theoretical Population Biology

73:277–288.11. Slatkin M, Hudson RR (1991) Pairwise comparisons of mitochondrial dna sequences in stable and exponentially growing populations. Genetics 129(2):555–562.12. Ewing G, Hermisson J (2010) MSMS: a coalescent simulation program including recombination, demographic structure and selection at a single locus. Bioinformatics 26(16):2064–2065.13. Therik T (2004) Wehali – The Female Land: Traditions of a Timorese Ritual Centre. (Pandanus Books: ANU Research School of Pacific and Asian Studies, Canberra, Australia).14. Legendre P, Desdevises Y, Bazin E, Page RDM (2002) A statistical test for host–parasite coevolution. Systematic Biology 51:217–234.15. Dray S, Legendre P (2008) Testing the species traits–environment relationships: The fourth-corner problem revisited. Ecology 89:3400–3412.16. Tehrani JJ, Collard M, Shennan SJ (2010) The cophylogeny of populations and cultures: reconstructing the evolution of iranian tribal craft traditions using trees and jungles. Philosophical Transactions

of the Royal Society B: Biological Sciences 365(1559):3865–3874.


http://www.isogg.org/tree

Documents

Kinship structures create persistent channels for language ......PBL19 WNK36 WMR22 PBL21 PBL54 AKL23 BKP35 LOLI31 WNK24 AKL25 Sumba mtDNA Haplogroup D5 B5a B5b1 B4c M7b E M(xD,Q) B