Fine-scale genomic analyses of admixed individuals reveal unrecognized genetic ancestry components in Argentina

Similarly to other populations across the Americas, Argentinean populations trace back their genetic ancestry into African, European and Native American ancestors, reflecting a complex demographic history with multiple migration and admixture events in pre- and post-colonial times. However, little is known about the sub-continental origins of these three main ancestries. We present new high-throughput genotyping data for 87 admixed individuals across Argentina. This data was combined to previously published data for admixed individuals in the region and then compared to different reference panels specifically built to perform population structure analyses at a sub-continental level. Concerning the Native American ancestry, we could identify four Native American components segregating in modern Argentinean populations. Three of them are also found in modern South American populations and are specifically represented in Central Andes, Central Chile/Patagonia, and Subtropical and Tropical Forests geographic areas. The fourth component might be specific to the Central Western region of Argentina, and it is not well represented in any genomic data from the literature. As for the European and African ancestries, we confirmed previous results about origins from Southern Europe, Western and Central Western Africa, and we provide evidences for the presence of Northern European and Eastern African ancestries.


117
The first systematic investigation of human genetic variation in Argentina focused 118 on a limited number of markers either uniparental (mtDNA, Y-STRs, Y-SNP; [1][2][3][4][5][6][7][8][9][10] Although the departure harbor is a poor proxy to infer the actual slaves' origins 143 [25,26], the African uniparental lineages observed in Argentina Table), the algorithm allows estimating the proportions of European, 243 Native American and African ancestry. For K=4 to K=7, (S2C-F Figs), the model 244 detects sub-continental ancestries. At K=8 (Fig 2; S3 Table), the European ancestry 245 is divided into Northern and Southern components (dark and light blue, 246 respectively), while African ancestry is composed of Westernmost African (green), 247 Gulf of Guinea (light green) and Bantu-influenced (dark green) components.  From the Admixture results for K=8 (Fig 2), we observed that the European 266 ancestry for Argentinean samples, is divided in Southern and Northern components, 267 the former being the most abundant. The low proportion of African ancestry in 268 Argentinean samples makes difficult to interpret its sub-continental origins from 269 analyses within a global context. Surprisingly, all three components of Native 270 American ancestry are present in most Argentinean samples (Fig 2). They exhibit 271 mid proportions of different Native ancestries suggesting either the result of a 272 mixture between these three ancestry components or an underrepresentation of 273 Native American in the reference panels currently available. Such mixture pattern is 274 not observed in other South American countries. Indeed, the Native American 275 ancestry for Peruvian, Chilean and Colombian admixed samples is mainly 276 represented by CAN, CCP and STF, respectively. This is consistent with the 277 geographical area where the admixed individuals have been sampled, and the 278 genetic ancestry of the indigenous communities from each country.  Table). 287

Sub-continental ancestry components in Argentina
The genetic legacy of European migration in Argentina 288 We used DS4, a combination of the masked genotype data for admixed individuals. 289 with a set of European individuals carefully selected to be representative of the 290 genetic diversity in their sampling area [45,46]. 291 From the PCA, we observed that most Argentinean individuals cluster with Iberians 292 and Italians (Fig 3)  individuals. Moreover, this analysis also showed that some individuals exhibit smaller, yet important, proportions of Eastern African ancestry, particularly in 349 Northern Argentina (Fig 4). Although, the important missing genotype rate in 350 masked data for admixed individuals could bias PCA and Admixture results, the 351 results obtained by both methods are highly consistent for admixed individuals 352 (S11 Fig). individuals were assigned to a fourth cluster (S4 Table). The remaining 26 436 individuals were removed for further analyses because their group assignation was 437 not consistent across the three clustering approaches. We acknowledge that these 438 groups are culturally, ethnically and linguistically heterogeneous. However, we 439

Different African ancestry components in Argentina
argue that analyzing such groups built from genetic similarities may provide 440 interesting insights into evolutionary mechanisms that shaped the Native American 441 ancestry in South America. 442 Most individuals from Calingasta, (located in the Northwest Monte and Thistle of the 443 Prepuna ecoregion; San Juan Province) and from Santiago de Chile were assigned to 444 the fourth group. The genealogical record for the Calingasta individuals attests to a 445 local origin of their direct ancestors up to two generations ago, and they have 446 mtDNA sub-haplogroups predominant in the Cuyo region (S1 Table;  and any other cluster is not lower than for other comparisons (Fig 7B). The lowest 460 FST was observed between STF and CAN, probably due to the fact that STF 461 encompasses the Northern Andes region (Fig 7B). Moreover, the distribution of 1 -462 f3(YRI; Ind1, Ind2) between pairs of individuals from different groups (S17 Fig) is an 463 additional argument to discard a scenario of mixture (mentioned before as scenario 464 1). 465 Furthermore, f4 analyses showed that (i) CAN has no particular genetic affinity with 466 any component relative to the others; (ii) STF is closer to CAN as compared to CCP 467 and CWA; and (iii) CWA and CCP exhibit higher genetic affinity between them than 468 with CAN or STF (Fig 7C; S5B Table). However, a neighbor-joining analysis [58] 469 from distances of the form 1/f3(YRI; X, Y) suggests that CAN is more closely related 470 to CCP and CWA than to STF (Fig 7D; S5C Table). When comparing the genetic affinity of a given component with the different ancient 515 groups using either the f3-outgroup or the f4 statistics (S19F Fig and S5D,E Tables), 516 we identified that CAN tends to exhibit greater genetic affinity with ancient Andean 517 populations than with other ancient groups (S19A, E Figs) Then, we evaluated the relationship of the time depth of the ancient samples from 536 either the Andes or the Southern Cone, with their genetic affinity to the modern 537 components of Native American ancestry (Fig 8). We observed a statistically 538 significant relationship between the age of the ancient Southern Cone samples and 539 their genetic affinity with CCP and CWA. This means that the older the ancient 540 sample from the Southern Cone, the lower the shared drift with CCP and CWA. On 541 the other hand, no statistically significant relationship was identified for STF and 542 CAN (P = 0.523 and P=0.596, respectively; Fig 8A). These patterns could be due to a 543 relationship between geography and the age of the ancient samples because the 544 most recent samples are concentrated in the Southern tip of the subcontinent (Fig  545   5A). Moreover, the number of SNPs with genotype data tends to decrease with the 546 age of the ancient samples due to DNA damage, and thus inducing a potential bias 547 towards significant positive correlations. To simultaneity correct both these two 548 putative confounding effects, we repeated the analyses but using a correction for the 549 ancient sample age (the residuals of the linear regression between the age of the 550 ancient samples and their geographic coordinates) and a correction for genetic 551 affinity estimates (the residuals of the linear regression between f3 and the number 552 of SNPs to estimate it). This correction intensified the relationship described for CCP 553 and CWA (Fig 8C). It also allowed to actually identifying significant relationships for 554 STF and CAN. On the other hand, CAN is the only modern Native American 555 component that exhibits a significant relationship between its genetic affinity with 556 ancient Andean samples and their age (Fig 8B). This pattern holds after correction 557 for geography (Fig 8D). Repeating the same analyses using f4 statistics, we reached 558 the same conclusions (S20 Fig). Using another setting of the f4 statistics (S5F Table;  In order to get insights into the past genetic influence among the four components 602 since their divergence, we applied a last f4-statistics analysis (S5G Table;   has been generated misrepresents CWA since its early divergence with CCP, as well 614 as the common ancestors specific to these two components.

676
We genotyped 94 individuals with the Axiom LAT1 array (Affymetrix) from 24 677 localities and 17 provinces across Argentina (Fig 1). These samples were selected genotyping Quality Controls (S1 Table). 701 Most of the genotype data processing was performed using in-house scripts in R 702 We thus build different dataset arrangements (named DS<n>) that we analyzed 712 through this work (S2 Table). 713  (Fig 2, S2 Fig and S3 Table). 728 as well as Native American individuals identified through Admixture procedure 767 described before. We used 1 Expectation-Maximization iteration (-e 1), actualizing 768 the reference panel in this process (--reanalyze-reference). We used CRF spacing size and random forest window size of 0.2 cM (-c 0.2 and -r 0.2). We use a node size 770 of 5 (-n 5). We set the number of generations since admixture to 11 (-G 11) 771 considering the estimates from [31]. The forward-backward output was then 772 interpreted to assign allele ancestry to the one exhibiting major posterior 773 probability, conditioning that it was greater than 0.9. Otherwise, the allele ancestry 774

783
In order to analyze the ancestry-specific population structure we masked the data, 784 i.e. for each individual, we assigned missing genotype for any position for which at 785 least one of the two alleles was not assigned to the relevant ancestry. In other 786 words, to study ancestry A, we kept for each individual, regions exhibiting ancestry 787 A on both haplotypes (ditypes) as illustrated in S5 Fig  788 European ancestry specific population structure 789 To study European ancestry specific population structure, we analyzed together 790 masked data for this ancestry for Colombian individuals from 1KG and individuals 791 from DS2P and DS3P excluding individuals from Chilean Native American 792 communities [37] . This data was merged with a set of reference individuals with 793 European ancestry [46], which is a subset of the POPRES dataset [45]. We call this 794 data set as DS4. We removed individuals with less than 30% SNPs with the ancestry 795 ditypes (--mind 0.7 with plink 1.9). We also removed SNPs with more than 50% of 796 missing genotypes (--geno 0.5 with plink 1.9). Thus, DS4 contains 132 modern 797 We report the PCA results summarized into a 2-dimensional space by applying 803 Multidimensional Scaling on weighted Euclidian distance matrix for the first N PCs. 804 We weighted each PC by the proportion of variance it explains. We selected the N 805 most informative PCs according to the Elbow method on the proportion of explained 806 variance. Admixture [43] was run with K ranging from 2 to 10 with cross-validation 807 procedure. 808 African ancestry specific population structure 809 To study African ancestry specific population structure, we analyzed together 810 masked data for this ancestry for individuals from DS2P and DS3P. This data was 811 merged with a compilation of reference individuals with African ancestry from 812 [42,48-50]. We removed African individuals with less than 99% of African ancestry when comparing them to the 2504 individuals from 1KGP (Admixture with K=7 814 minimizing cross-validation score). We thus reduced the African reference to 1685 815 individuals. We call as DS5 the data set containing both the masked data for 816 admixed South American individuals and African reference individuals. 817 We removed SNPs with more than 10% of missing genotypes (--geno 0.1 with plink 818 1.9), and individuals with less than 5% of the ancestry ditypes (--mind 0.95 with 819 plink 1.9). Thus, DS5 contains, 26 modern Argentinean individuals (all from the 820 present study), and 12 individuals from Lima (9). DS5 consisted in 137,136 SNPs, of 821 which 128,086 remained after LD-pruning (--indep-pairwise 50 5 0.5 flag in plink2). 822 PCA and Admixture were performed as for European ancestry specific population 823 structure analyses (described before). 824 Native American ancestry specific population structure 825 To study Native American ancestry specific population structure, we analyzed 826 together masked data for this ancestry for individuals from DS2P and DS3P. This