16
Supplementary Figure 1 Country distribution of GME samples and designation of geographical subregions. GME samples collected across 20 countries and territories from the GME. Pie size corresponds to the number of samples from each country, and each pie shows the proportion of samples filtered because of quality control and relationship status (Online Methods). Geographical subregions are colored to show the sets of grouped countries. Some non-uniformity of sampling was inevitable owing to the inaccessibility of some populations. Map downloaded from http://www.presentationmagazine.com/ then colored. Nature Genetics: doi:10.1038/ng.3592

Nature Genetics: doi:10.1038/ng...Supplementary Figure 5 Introgression analysis of GME and 1000 Genomes Project exome samples shows consistent Neanderthal introgression on all GME,

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Nature Genetics: doi:10.1038/ng...Supplementary Figure 5 Introgression analysis of GME and 1000 Genomes Project exome samples shows consistent Neanderthal introgression on all GME,

Supplementary Figure 1

Country distribution of GME samples and designation of geographical subregions.

GME samples collected across 20 countries and territories from the GME. Pie size corresponds to the number of samples from each country, and each pie shows the proportion of samples filtered because of quality control and relationship status (Online Methods). Geographical subregions are colored to show the sets of grouped countries. Some non-uniformity of sampling was inevitable owing to the inaccessibility of some populations. Map downloaded from http://www.presentationmagazine.com/ then colored.

Nature Genetics: doi:10.1038/ng.3592

Page 2: Nature Genetics: doi:10.1038/ng...Supplementary Figure 5 Introgression analysis of GME and 1000 Genomes Project exome samples shows consistent Neanderthal introgression on all GME,

Supplementary Figure 2

Unbiased genetic clustering demonstrates shorter genetic distance between samples from proximal geographical subregions.

Dendrogram of unbiased genetic clustering correlated with geographical subregion designation. 2,497 samples underwent exome sequencing from the Greater Middle East Consortium, including 1,111 GME samples as well as samples from Africa, East Asia, Europe, the Americas, Oceania, and unknown regions. Calculated identity-by-state (IBS) distances between samples represent the number of non-identical positions. Concordance between recruitment location and IBS clustering for all GME subregions was observed. Some intermixing was evident, suggesting recent migration events.

Nature Genetics: doi:10.1038/ng.3592

Page 3: Nature Genetics: doi:10.1038/ng...Supplementary Figure 5 Introgression analysis of GME and 1000 Genomes Project exome samples shows consistent Neanderthal introgression on all GME,

Supplementary Figure 3

ADMIXTURE cross-validation.

(a) Cross-validation errors for the ADMIXTURE results shown in Supplementary Figure 1. Analysis with k = 6 gave the lowest cross-validation error. (b) Cross-validation errors for GME and 1000 Genomes Project samples.

Nature Genetics: doi:10.1038/ng.3592

Page 4: Nature Genetics: doi:10.1038/ng...Supplementary Figure 5 Introgression analysis of GME and 1000 Genomes Project exome samples shows consistent Neanderthal introgression on all GME,

Supplementary Figure 4

Unsupervised ADMIXTURE analysis of GME populations shows genetic history.

Results of ADMIXTURE analysis for LD-filtered variants for 1,111 GME samples across the six geographical subregions. Eleven iterations of k were run, from 2 to 12, to optimize clustering. Each vertical bar represents a single individual. The y axis shows the estimated proportion of the genome assigned to each ancestral cluster. Samples grouped by subregion and organized from west (left) to east (right), showing trends of overlap. Substantial substructure was apparent throughout much of the GME, but three apparent ‘sources’ of ancestral populations stem from the NWA (yellow), AP (red), and PP (green) subregions.

Nature Genetics: doi:10.1038/ng.3592

Page 5: Nature Genetics: doi:10.1038/ng...Supplementary Figure 5 Introgression analysis of GME and 1000 Genomes Project exome samples shows consistent Neanderthal introgression on all GME,

Supplementary Figure 5

Introgression analysis of GME and 1000 Genomes Project exome samples shows consistent Neanderthal introgression on all GME, European, and East Asian samples except for NWA.

(a) Individuals from the 1000 Genomes Project reference populations and GME subregions were projected onto the first two principal components calculated from Neanderthal, chimpanzee, and Denisovan genomes. PC1 separates ancient human populations from chimpanzee, and PC2 separates the Neanderthal and Denisovan populations. When human samples were projected onto these principal components, they clustered near the center of these three species. Arrows are drawn from the center of the sub-Saharan African populations to each of the ancestral human and chimpanzee points. The sub-Saharan African populations represent a control group, where only limited Neanderthal and Denisovan introgression should be present. (b) Magnified view of a showing the dispersal of human populations within these two principal components. Samples are colored on the basis of continental origin, and subpopulations are labeled to indicate the center of each population. African populations were found to be separate from the remaining populations, which were found from this adjusted origin along the Neanderthal vector. Most populations were found to be tightly clustered with only the TP and NWA populations, showing clear separation, suggesting a common time point of introgression among these clustered populations. The NWA samples had less introgression than the other GME populations.

Nature Genetics: doi:10.1038/ng.3592

Page 6: Nature Genetics: doi:10.1038/ng...Supplementary Figure 5 Introgression analysis of GME and 1000 Genomes Project exome samples shows consistent Neanderthal introgression on all GME,

Supplementary Figure 6

Heat map of pairwise FST values among all 1000 Genomes Project and GME populations identifies three clusters with a low degree of differentiation.

Top right, Wright’s fixation index; bottom left, standard error values. Populations are ordered on the basis of geographical location. Three distinct clusters of close populations (shown as a blue gradient) are evident: 1000 Genomes Project Africa (LWK and YRI); 1000 Genomes Project Europe (FIN, CEU, and TSI), and GME subregions (NWA, NEA, AP, SD, TP, and PP); and 1000 Genomes Project East Asia (JPT, CHS, and CHB). Among global populations, the GME and European populations were more closely related than any other two continental regions. The greatest distance between any two populations was estimated as 0.212 for YRI and JPT. As populations became more distant, standard error values increased but remained small for all comparisons.

Nature Genetics: doi:10.1038/ng.3592

Page 7: Nature Genetics: doi:10.1038/ng...Supplementary Figure 5 Introgression analysis of GME and 1000 Genomes Project exome samples shows consistent Neanderthal introgression on all GME,

Supplementary Figure 7

Principal-component analysis on GME and 1000 Genomes Project populations showed that PC3 and PC4 explained inter-GME variance.

Plots comparing all combinations of PC1, PC2, PC3, and PC4 and percentages of variance explained. GME populations are color-coded by geographical regions. PC1 (39.03%) and PC2 (31.38%) together accounted for the majority of variation in the data and were associated with separating Africans and East Asians from other samples, respectively. PC3 and PC4 separated GME and European populations along north–south and east–west axes, respectively. AP was the most distant cluster from the 1000 Genomes Project reference populations, showing the greatest separation along PC3. Both of the North African populations tended to cluster closer to the sub-Saharan African cluster, whereas PP and TP trended toward the East Asian cluster.

Nature Genetics: doi:10.1038/ng.3592

Page 8: Nature Genetics: doi:10.1038/ng...Supplementary Figure 5 Introgression analysis of GME and 1000 Genomes Project exome samples shows consistent Neanderthal introgression on all GME,

Supplementary Figure 8

Reported consanguineous marriage rates many fold higher in GME than in other continental populations.

Clinical survey results aggregated to estimate regional averages of the consanguineous marriage rate. Weighted averages, taking sample size into account, were calculated across all studies falling within a given region. The highest rates of consanguineous marriage were documented in PP and AP.

Nature Genetics: doi:10.1038/ng.3592

Page 9: Nature Genetics: doi:10.1038/ng...Supplementary Figure 5 Introgression analysis of GME and 1000 Genomes Project exome samples shows consistent Neanderthal introgression on all GME,

Supplementary Figure 9

GME samples carried longer and rarer runs of homozygosity than 1000 Genomes Project populations.

(a) Cumulative proportion total ROH length by bin for African, East Asian, European, and GME populations. African populations had the shortest accumulation of ROH spans, whereas GME populations showed the longest despite the limited influence of bottlenecks. (b) Distribution of total ROH length (in Mb) for all 1000 Genomes Project and GME populations. Wider distributions were evident for the GME populations owing to heterogeneity in long ROHs. (c) The total number of exomic bases found in ROHs binned by frequency in each population. GME ROHs tended to be unique in comparison to 1000 Genomes Project populations.

Nature Genetics: doi:10.1038/ng.3592

Page 10: Nature Genetics: doi:10.1038/ng...Supplementary Figure 5 Introgression analysis of GME and 1000 Genomes Project exome samples shows consistent Neanderthal introgression on all GME,

Supplementary Figure 10

Identity-by-state distance comparing human and chimpanzee reference genomes showed burden bias associated with hg19 corrected using estimated ancestral alleles.

(a) Homozygous and heterozygous variant counts shown for samples using hg19 (left) and PanTro2 (right) as the reference genomes. PanTro2 alleles demonstrated a linear relationship between populations, arguing for no burden difference. (b) IBS distance to the reference for chimpanzee genomes PanTro2 and PanTro4 (x axis) versus human hg19 (y axis). Human populations stratify by IBS distance using the hg19 reference genome. With chimpanzee ancestral variants, populations were equidistant from the chimpanzee reference genome.

Nature Genetics: doi:10.1038/ng.3592

Page 11: Nature Genetics: doi:10.1038/ng...Supplementary Figure 5 Introgression analysis of GME and 1000 Genomes Project exome samples shows consistent Neanderthal introgression on all GME,

Supplementary Figure 11

Correction of PolyPhen-2 predictions for derived variants resolved missense burden bias.

(a) The proportions of derived (Der) and ancestral (Anc) variants falling into each PolyPhen-2 class (B, benign; P, possibly damaging; D, probably damaging), across 14 allele frequency bins. The bias was apparent in the absence of possibly damaging and probably damaging calls for derived variants across nearly all bins. This bias can misrepresent results when comparing populations. (b) The same proportions after correction of derived variant PolyPhen-2 classes (Online Methods). Derived variant classes reflect the distributions of the ancestral variants. The x axis shows derived allele frequency bins, with parentheses and square brackets designating exclusion and inclusion, respectively.

Nature Genetics: doi:10.1038/ng.3592

Page 12: Nature Genetics: doi:10.1038/ng...Supplementary Figure 5 Introgression analysis of GME and 1000 Genomes Project exome samples shows consistent Neanderthal introgression on all GME,

Supplementary Figure 12

Mean derived allele frequencies for GME and 1000 Genomes Project populations across seven functional and deleteriousness variant classes suggested equivalent selective pressure.

(a) Calculated mean DAFs and standard errors for GME and 1000 Genomes Project populations. Variants were separated by functional class (noncoding, synonymous, nonsynonymous, and LOF) and corrected PolyPhen-2 deleteriousness class (benign, possibly damaging, probably and damaging). Populations are ordered as indicated on the right. No significant difference between populations was found for any variant class. (b) Mean DAF comparison for the X chromosome. Large error bars for some classes reflect limited ascertainment of variants within those classes.

Nature Genetics: doi:10.1038/ng.3592

Page 13: Nature Genetics: doi:10.1038/ng...Supplementary Figure 5 Introgression analysis of GME and 1000 Genomes Project exome samples shows consistent Neanderthal introgression on all GME,

Supplementary Figure 13

Comparison of allele frequency estimates from Exome Variant Server European-American and African-American populations showed poor correlation.

Comparison of the distribution of estimated allele frequencies for shared variants from two populations, EA and AA, showed poor correlation (Pearson’s r = 0.1147). Hexagonal bins are colored according to the abundance of variants falling within each region. The linear regression line (blue) and identity line (black) are shown.

Nature Genetics: doi:10.1038/ng.3592

Page 14: Nature Genetics: doi:10.1038/ng...Supplementary Figure 5 Introgression analysis of GME and 1000 Genomes Project exome samples shows consistent Neanderthal introgression on all GME,

Supplementary Table 1. Distribution of samples across countries and geographic regions, before filtering.

Geographic RegionCountry/Territory Individuals

Arabian Peninsula Kuwait 45

Oman 19

Qatar 8

Saudi Arabia 184

UAE 44

Yemen 8

Northeast Africa Egypt 614

Libya 59

Northwest Africa Algeria 13

Morocco 96

Tunisia 4

Persia and Pakistan Iran 87

Pakistan 136

Syrian Desert Iraq 10

Israel 13

Jordan 40

Lebanon 5

West Bank 9

Syria 6

Turkish Peninsula Turkey 249

Total 1649

Nature Genetics: doi:10.1038/ng.3592

Page 15: Nature Genetics: doi:10.1038/ng...Supplementary Figure 5 Introgression analysis of GME and 1000 Genomes Project exome samples shows consistent Neanderthal introgression on all GME,

Supplementary Table2. The majority of loss of function (LOF) variants are had allele frequency <.5%, and co-occurrence within a gene was rare. The functional class of predicted LOF variants, and the counts for three allele frequency bins (<.5%, .5%<x<2%, >2%) shown. <.5% .5%<x<2% >2%

Indels SNPs Indels SNPs Indels SNPs Total

Variants Genes

Frameshift Deletion 1426 0 276 0 279 0 1981 1670

Frameshift Insertion 762 0 319 0 461 0 1542 1338

Stopgain 48 5129 7 795 8 306 6293 4648

Stoploss 4 174 1 51 2 68 300 282

Exonic near-splice 0 72 0 25 3 35 135 127

Splice Junction 20 0 6 1 10 3 40 40

Nature Genetics: doi:10.1038/ng.3592

Page 16: Nature Genetics: doi:10.1038/ng...Supplementary Figure 5 Introgression analysis of GME and 1000 Genomes Project exome samples shows consistent Neanderthal introgression on all GME,

Family Members 0 1 2 3 total 0 1 2 3 total 0 1 2 3 total 0 1 2 3 total1242 1 51 23 29 14 117 9 1 3 0 13 22 1 4 0 27 17 1 3 0 211314 1 50 41 26 23 140 11 1 1 0 13 14 1 1 0 16 12 1 1 0 141549 1 74 30 23 16 143 12 3 3 1 19 33 3 3 1 40 29 2 2 1 341829 1 55 28 18 18 119 6 2 0 1 9 17 2 0 1 20 16 2 1 1 20709 1 80 38 40 25 183 31 5 4 5 45 48 6 4 5 63 39 6 5 5 55789 1 35 17 20 14 86 12 1 1 3 17 16 1 1 3 21 15 1 1 3 20882 1 44 21 16 22 103 9 4 1 1 15 19 4 1 1 25 17 2 1 1 21

Mean 55.6 28.3 24.6 18.9 127 12.9 2.43 1.86 1.57 18.7 24.1 2.57 2 1.57 30.3 20.7 2.14 2 1.57 26.4

1098 2 19 10 9 12 50 2 1 0 0 3 3 1 0 0 4 5 1 0 0 61241 2 12 10 7 5 34 2 3 0 1 6 8 3 0 1 12 6 3 0 1 101290 2 19 12 3 10 44 1 0 0 1 2 3 0 0 1 4 1 0 0 1 21349 2 9 7 9 13 38 1 0 0 2 3 2 0 0 2 4 3 0 0 2 51526 2 18 8 7 3 36 9 1 0 1 11 15 1 0 1 17 8 1 0 1 101598 2 29 10 9 10 58 2 0 2 0 4 8 0 2 0 10 5 1 2 0 81611 2 45 19 22 13 99 13 2 5 1 21 23 2 5 2 32 15 2 6 2 251675 2 24 13 5 10 52 2 0 0 2 4 7 0 0 2 9 10 1 0 2 131800 2 21 4 4 7 36 5 1 0 1 7 7 1 0 1 9 9 1 0 1 11659 2 36 17 10 14 77 11 2 1 2 16 19 2 1 2 24 15 4 1 2 22738 2 9 5 13 9 36 1 0 1 1 3 5 1 1 1 8 5 0 1 1 7787 2 6 4 4 3 17 1 0 0 1 2 1 0 0 1 2 1 0 0 1 2803 2 13 5 7 6 31 3 0 0 0 3 4 0 0 0 4 2 0 0 0 2910 2 32 20 14 14 80 9 1 3 1 14 16 2 3 1 22 8 3 3 1 15

Mean 20.1 9.71 8.14 9 47 4.57 0.57 0.71 1.14 7 8.43 0.86 0.71 1.14 11.1 7.14 1.29 0.71 1.14 10.3

786 3 11 5 3 3 22 4 1 0 0 5 4 1 0 0 5 5 1 0 0 6

Supplemental Table 4. Number of predicted high-impact variants passing segregation filters and allele count filters. Four datasets wereused as a healthy background to reduce the number of candidates: HSP – using only healthy members of HSP families, With GME – using the GME variome to estimate allele frequencies, With EA – using the EVS EA population, With AA – using the EVS AA population as background. 0,1,2,3 refer to the number of allele shares for that fraction of variants.

HSP only With GME AF With EA AF With AA AF

Nature Genetics: doi:10.1038/ng.3592