May 11, 2014

Ancient Thracians, Ötzi and the origins of modern Europeans (another point of view)

A recent study has sequenced the DNA of an ancient Thracian woman but, for some reason, instead of looking at her comparison with modern Bulgarians and such, they have written a study that mostly goes about Ötzi "the iceman" and has not a single Bulgarian sample.

Martin Sikora et al., Population Genomic Analysis of Ancient and Modern Genomes Yields New Insights into the Genetic Ancestry of the Tyrolean Iceman and the Genetic Structure of Europe. PLoS Genetics 2014. Open accessLINK [doi:10.1371/journal.pgen.1004353]


Genome sequencing of the 5,300-year-old mummy of the Tyrolean Iceman, found in 1991 on a glacier near the border of Italy and Austria, has yielded new insights into his origin and relationship to modern European populations. A key finding of that study was an apparent recent common ancestry with individuals from Sardinia, based largely on the Y chromosome haplogroup and common autosomal SNP variation. Here, we compiled and analyzed genomic datasets from both modern and ancient Europeans, including genome sequence data from over 400 Sardinians and two ancient Thracians from Bulgaria, to investigate this result in greater detail and determine its implications for the genetic structure of Neolithic Europe. Using whole-genome sequencing data, we confirm that the Iceman is, indeed, most closely related to Sardinians. Furthermore, we show that this relationship extends to other individuals from cultural contexts associated with the spread of agriculture during the Neolithic transition, in contrast to individuals from a hunter-gatherer context. We hypothesize that this genetic affinity of ancient samples from different parts of Europe with Sardinians represents a common genetic component that was geographically widespread across Europe during the Neolithic, likely related to migrations and population expansions associated with the spread of agriculture.

Notice please that, as the authors acknowledge, the DNA of the second Thracian individual, K8 may be contaminated:
the DNA damage pattern of this individual does not appear to be typical of ancient samples (Table S4 in [15]), indicating a potentially higher level of modern DNA contamination.

This does not seem to dissuade them to use it in the analyses.

Figure 1. Geographic origin of ancient samples and ADMIXTURE results.
(A) Map of Europe indicating the discovery sites for each of the ancient samples used in this study. (B) Ancestral population clusters inferred using ADMIXTURE on the HGDP dataset, for k = 6 ancestral clusters. The width of the bars of the ancient samples was increased to aid visualization.

Notice that, instead of attempting to model moderns on ancients, as would seem logical from the viewpoint of purported ancestry but would be incomplete for lack of a sufficiently large ancient sample or allow the ancient samples to "float freely" in the analysis, the researchers decided to force them into modern parameters what is still valid, because it indicates greater or lesser affinity to the various studied modern populations (among which there's not a single Balcanic sample, oddly enough). 

We can see that:
  • Epipaleolithic Iberian Braña 1 approximates the French structure but is somewhat "more Basque" than these. 
  • Neolithic Pitted Ware semi-forager Ajv70 (Gotland) approximates the Orcadians very well.
  • Neolithic Megalithic/Funnelbeaker Gok4 (Southern Sweden) approximates North Italians. 
  • Chalcolithic North Italian Ötzi (Iceman) is close to Sardinians but not quite the same ("more Basque" again).
  • Iron Age Thracian commoner P 192-1 approximates Tuscans.
  • I would ignore princely Thracian K8 because of the aforementioned contamination issues.

For completeness, I'm including here also fig. S1, which includes runs 1-8 of ADMIXTURE:

Fig S1- ADMIXTURE results for HGDP. Panels show the results for ADMIXTURE runs for k = 2 to k = 8 ancestral clusters on the HGDP individuals, and the corresponding cluster proportions inferred for the ancient samples.

Notice (see fig. S7) that K=3-5 are quite poor fits and therefore both should be ignored as meaningless. From K=6 onwards the scores slightly improve for all the ancient samples, however it must be said that K=2 is in general the best fit form most European populations, being most of the improvement in error score due to better approximation to West Asian samples. 

In most cases Basques have the lowest or one of the lowest fitness scores (except at K=5, where Basques are portrayed as a Russian-Sardinian mix, what is clearly a confounding artifact). Sardinians also have very low error scores but only from K=5 onwards, when the Sardinian component becomes apparent. The Iceman has very low error scores for all K values, while the Thracian samples have the greater ones, maybe owing to the lack of Balcanic samples.

For me these error results suggest that ancients are fine being just "unspecific Europeans" (K=2 blue), while the low error score for Basques and Sardinians surely underline that these are about the only modern populations which can be explained as simple Paleolithic-Neolithic mix, without need of a third Indoeuropean extra ancestry.

They also projected the ancient samples onto PCA plots of modern European populations:

Fig S2 - PCA results for HGDP. Panels show the results for PCA on the HGDP individuals for subsets of SNPs with data in the respective ancient sample. Each point represents an individual, with plot symbol and color indicating population of origin. The position of the ancient samples was inferred by projecting onto the PC space calculated using the modern samples only.

For some odd reason the PCAs are different in each case, even if the samples are the same (only moderns used, ancients are "projected" and should not affect the result). I have no explanation for this issue and I reckon I'm tempted to write to the authors asking for this unexpected complexity, which seems product of the projection itself altering the graph.

Whatever the case, the projection of the ancient samples, follows in general terms the patterns noted above for the ADMIXTURE graph:
  • La Braña 1 projects between French and Basques.
  • Ajv70 projects onto Orcadians, tending also towards France.
  • Ötzi projects between Sardinians and Italians.
  • Gok 4 with North Italians but not far from Basques.
  • P 192-1 doesn't seem too akin to any specific modern population, although some French, Basques and Tuscans do approximate her.

These results may be frustrating for those already too accustomed to the previous analysis of ancient autosomal DNA but we must not forget that, because of its huge size and complexity, autosomal DNA requires statistical analysis, which is highly susceptible to variations in sample strategy particularly, as well as to other not always well understood factors. Hence different points of view are generally complementary rather that outright contradictory.

Of some interest is also this TreeMix graph of modern populations and Ötzi:

Figure 3. Results of TreeMix analysis of the Iceman with 1000G/Sardinia.
Shown are maximum-likelihood trees and the matrices of pairwise residuals (inset) for a model allowing (A) m = 0 and (B) m = 3 mixture events. Large positive values in the residual matrix indicate a poor fit for the respective pair of populations. Edges representing mixture events are colored according to weight of the inferred edge.

It is notable the African low level admixture arrow at the root of the Euro-Mediterranean branch (the so-called "Basal Eurasian" element in Lazaridis 2013) and the East Asian component in Finns. Also sizable admixture from the West Eurasian root is apparent among Tuscans. Once these admixture axes are allowed for, the topology of the European tree changes significantly, showing a main split between Eastern Europeans (Finns) and Western/Southern ones.

Other similar trees are available in fig. S6.

No extra Neanderthal admixture in Ötzi

Contrary to some previous rough estimates, Ötzi does not appear to differ from other Europeans in Neanderthal ancestry at all. See figs. S9 and S10.


  1. Nah, forgot it, La Brana doesn't cluster there.

    I just spoke to someone at the Reich lab about this issue, because it shows up in PCA done with Eigenstrat as well.

    They call this problem "shrinkage", because the PCA space is shrinked for the projected samples relative to the reference samples, and there's no automatic fix for it yet, but there might be soon.

    You'll find all the technical details here...

    1. I understand (now) the shrinkage problem at a basic level, thanks for the explanation and link about it. I guess that's why the various PCAs look different from each other in spite of having the same samples.

      However it does not explain why the results in the PCAs are so similar to the results in the ADMIXTURE graph for the ancient samples.

      So I understand that, given this dataset, those are the results. Two different tests were used that produce nearly the same results.

      Of course, given other datasets, the results may vary, something that is very important, I'd say fundamental, to understand.

      The characteristics of this dataset (with no representation of Central-North Europeans) is similar in many aspects to the Lazaridis one, however Lazaridis had a substantial Balcanic and West Asian sample (missing here in the PCA). It is much more different from the Skoglund dataset, which oversampled Northern Europe.

    2. For the record, something I just commented at your blog and that seems more important than Northern Europe oversampling: West Asian presence or absence. Self-quote:

      The main difference between this dataset (in the PCA only) and the others (Lazaridis, Skoglund, Dasakali) is the absence of a West Asian sample. Logically the West Asia vs Europe difference affects the PCA in ways that a Europe only dataset (and more so a Europe minus Balcans, as is this one) does not experience. As normal PCAs have only two dimensions, this may be determinant, West Asian vs. Europe copes one of the axis invariably (the other typically is SW Europe vs the rest).

      In Europe-only datasets instead the SW Europe pole typically becomes dual, with one axis representing Sardinians vs Russians and the other Adigey vs Basques.

      In brief: when West Asians are thrown in Sardinians and Basques get closer and ancient foragers appear exotic (hyper-Nordic), when they are removed Sardinians and Basques diverge much more clearly and ancient foragers appear as more standard Europeans.

      Which is "more correct"? Maybe both in different ways. I really do not know but it's a sampling strategy effect no doubt

    3. No, the ADMIXTURE results are biased as well. From my blog...

      "They didn't run the ancient samples in the same ADMIXTURE analysis as the modern samples, but instead used allele frequencies sourced from the modern samples to test the ancient samples.

      This was a problem because it changed the conditions under which the modern and ancient samples were tested under, and resulted in much less precise outcomes for the ancient samples. In effect, this was the ADMIXTURE version of PCA projection bias.

      There are two ways around this: a) run the ancient samples together with the modern samples, or b) source the allele frequencies from a subset of modern samples, and then use them to test the ancient samples as well as the rest of the modern samples. Then you can actually compare the modern samples to the ancient samples."

    4. The ADMIXTURE results are "projected", so to say: just as in the PCA, the ancients are forced to define themselves in terms of modern patterns. This is just as good or as bad as projection onto the PCA but it has no other effect: not "biased", just ancients cannot create their own components, but neither do they in the Lazaridis global ADMIXTURE graph, which I decided to ignore because of its cumbersomeness.

      It'd be interesting to do the opposite: forcing moderns to define as ancients + "ancestral" (say Mbuti) or ancients + a couple of West Asian 1-5 ind. samples and another such sample from South Morocco (to control for the "Aterian" element, better than YRI). But nobody is doing it, Lazaridis did not do that either and even if single "oddball" individuals, as are these ancient samples, are allowed to "float" freely in the ADMIXTURE algorithm, they weight only 1, so they won't affect the components except at very high K values of the kind nobody ever processes, unless modern populations are also restricted to individual samples (what may be interesting to do).

      You can't just go around saying "it's biased", "it's junk", just because you already fell for another (equally valid) statistical interpretation. It's not that way: autosomal DNA is like linguistics: not exactly rocket science. The statistical tools used to understand it are very powerful, yes, but are not univocal in their results. There's usually another viewpoint just around the corner (different sample, different details of the method) and this one is one of those.

      Whatever the case, when Bra1 is compared to these samples it clearly results between French and Basques, exactly what happens in the PCA (different methods, somewhat different datasets but similar results). The same happens with Ajv70, just that in his case it goes towards Orkney-France.

      "There are two ways around this: a) run the ancient samples together with the modern samples"...

      Irrelevant (unless you massively reduce the number of modern individuals): their n=1 weight impedes that they show their specificity.

      "or b) source the allele frequencies from a subset of modern samples, and then use them to test the ancient samples as well as the rest of the modern samples"...

      That's the zombies' approach, which is the biased method, because you selectively decide what is a valid component and what is not. For example Dienekes decided to prune the Basque component but maintained the Sardinian one, which he renamed "Mediterranean", even if Sardinians have less diversity than Basques (never mind the Finns). If you prune Basque specificity, you also need to prune other more endogamous populations' one, something Dienekes ideologically chose not to. So his "calculator" is the biased thing here and a major reason I strongly distrust "zombie" approaches. Said that, other bloggers like the one of the Harappa Ancestry Project don't seem to have this kind of prejudiced bias, so it's still possible that a correct "zombie" approach to Europe may work: it would need to include the Basque "zombie" though.

    5. PS- Notice that the error scores for Basques are markedly highest at K=5, precisely when the Sardinian component shows up but the Basque component is still hidden. Therefore this choice of "zombies" (Dienekes-like) is clearly an error.

  2. Oversampling of North Europeans, primarily Swedes, in the Skoglund 2014 PCA is of little consequence because the primary dimension is defined by the "southernmost" Yemeni vs the "northernmost" Finn (Ancients are of course projected). The secondary dimension is a lack of "Sardinian/Southwest European-ness" which is defined by Swedish Saami at the other end. Yemenis are as distant from Sardinians along this dimension as Finns are. One sample is enough to define a primary or secondary PCA dimension if it's divergent enough, which is best seen in the MA-1 or La Braña PCA's of West Eurasia Davidski did.

    By the way, in which PCA there was a Basques-Adygei PC2? Lazaridis had one with a Basques-Maltese dimension, but that's not quite the same thing.

    1. To the first issue, I responded to myself above (and at Davidski's blog): it seems that the key is not so much the European sample but the presence or absence of West Asians. It dawned to me after the first comment, when I compared all the datasets of the various PCAs of recent autosomal aDNA European studies.

      Apparently West Asian specifics blur the differences between Basques and Sardinians, something that does not happen in a Europe-only dataset.

      "... in which PCA there was a Basques-Adygei PC2?"

      In this study's PCAs: graphs "Ajv70", "Iceman" and "Gok4" specifically. Not always PC2, because the exact distribution changes because of "shrinkage", so for example in the Braña 1 PCA, Adigey's take the PC1 vs both Basques and Russians, while the PC2 is both Adigeys and Russians vs Sardinians. I've seen that before anyhow: once I recognized the pattern it looked familiar, although I don't recall exactly where.

      If I'm correct with a Europe-only sample similar to this one the result "E-W/N-S" cross pattern should happen most of the time.

      "Lazaridis had one with a Basques-Maltese dimension, but that's not quite the same thing".

      True, but the North Caucasians had been removed along with West Asians (while Balcanics were present). Whatever the case it seems to be a similar "anti-Basque" E-W axis, in which both Sardinians/Italians and NE Europeans score in between (neutral).

      On the other hand, the ancient foragers projected somewhat differently: closest to what in this dataset would be the Orcadians. But the WHGs (Lochsbour, Braña) still tended more to the Basque pole than the Scandinavian HGs (Motala).

      I wonder to what extent Balcanic and South Italian samples act in a similar way as West Asians or not. To solve this issue would require to do a lot of different PCAs with different samples, not necessarily large (I say because of processing limitations), and I would encourage those with the technical capabilities to do them.

  3. Here, I forgot I had this online.

    Scroll down to the bottom. The first result for PL1 was achieved with ADMIXTURE, and the second with allele frequencies using a calculator. Check out the difference in the spread of the main components.

    1. You obviously used the components of unsupervised test 1 on the very same sample in the supervised test 2, so the results are identical (or so it seems on cursory look). Now try with a substantially different sample, for example Spaniards, Basques and Mbuti: the unsupervised results will almost certainly generate a Mbuti, a Basque and a Spaniard component, while the supervised run may cause almost any result (although most probably Basques and Spaniards will show up as Orcadians, no idea about the Mbuti).

  4. Re: Ötzi and archaic DNA - I wish Hawks would go back and correct himself. His blog post claims the data was about to be published, but it looks like that didn't ban.

    From that PCA like Ötzi is actually on the low end of archaic admixture for Eurasian populations. I wonder if that means the "basal Eurasian" component was not admixed with Neanderthals/Denisovans.

    1. Logically, the "Basal Eurasian" (or more properly "African") element in the EEF vector (to which Ötzi belongs) should be at the very least less admixed with Neanderthals than the Eurasian element, so he tends somewhat towards the African cluster, as do other Europeans.

      In the Neanderthal admixture study in North Africa, it was apparent that, excepted the problematic Tunisians, the Neanderthal component was smaller than in Europe (about half that of CEU, what makes perfect sense) and particularly smaller in Southern Morocco which is the population that, IMO, best retains the Aterian aboriginal element.

    2. Looking at it again, I think that would imply a much large migration that what's shown in these results though wouldn't it? The African element in EEF doesn't even register in those TreeMix graphs - it only shows up in the residuals. The migration edge shown is actually Eurasian admixture into LWK and not the other way around.

      Note the residuals in Figure S8 btw. The African residuals in La Brana are roughly the same size as the African residuals in Gok4 and Ötzi. The largest African residual is actually found in P192-1 (which actually makes a lot of sense). Ajv70 has no African residual and K8 has the next smallest (both of which make sense).

    3. In the right side tree there is a clear African admixture axis affecting the root of the "Mediterranean" branch, what are you talking about?

      " The migration edge shown is actually Eurasian admixture into LWK and not the other way around".

      No it's not. Some TreeMix results in the supp. material have both a YRI→EuroMed admixture axis and a WEA→LWK one but that I thought too complicated for the entry so I did not show those supp. material complex trees, as I suspected they would induce more confusion than clarity and most of the extra stuff was extraneous to Europe anyhow. In any case the main one with 3 admixture axes is unmistakably Africa→Europe and the only notorious residuals are fully extra-European.

      Also similar axes have shown up in previous studies. It's nothing really new. Another thing is how exactly to interpret it.

    4. You're right, I'm looking at the wrong part of the paper/wrong graph. From the paper:

      "The last edge added corresponds to a mixture of an Iceman-related population and the Bantu-speaking Luhya (LWK) from Eastern Africa (w = 0.03). The LWK have previously been reported as showing a signal of gene flow of possible Neolithic Middle Eastern or European origin [29], which would be consistent with the observed signal (see also [30], [31])"

      From zooming in closer I see you're right about the direction though. 3B is the three migration edge graph. Here's what they say about the Africa->Southern Europe edge:

      "The first inferred edge corresponds to sub-Saharan African admixture in Southern European populations (edge weight w = 0.027), which is consistent with previous estimates of 1%–3% sub-Saharan African ancestry in those populations via North African gene flow [25], [26]."

      Supposing Ötzi is representative of the EEF population, would a 1-3% migration edge really be enough to explain him placing where he does for Neanderthal admixture?

      You may want to take a look at if you haven't read it already. That's the article they cite re: the African admixture at the "Mediterranean" branch. They suggest two separate sources based on IBD sharing - one from Tunisia/North Africa, and another from the Near East. Both are notably in their Basque sample too.

    5. "Supposing Ötzi is representative of the EEF population, would a 1-3% migration edge really be enough to explain him placing where he does for Neanderthal admixture?"

      Probably. Ötzi in Lazaridis is pretty much standard EEF, not really different from Stuttgart (the EEF reference genome in that study) in any case.

      The Botigué study is interesting indeed but notice that directionality of admixture is not determined. This, considering that Iberian influence in NW Africa is almost certainly much greater than the opposite, is an important issue. For example, compare the ADMIXTURE graph in the study with the one I produced in 2011, in which Spaniards consistently show up as a single homogeneous population (with at most minor North African admixture at K=3-5), while North Africans show up as a complex ancestry population in which the Iberian component is quite important (10-35%). In mtDNA at least 25% of North African matrilineal ancestry seems Iberian.

      "Both are notably in their Basque sample too".

      Only in the ADMIXTURE graph (fig. 1), which is less important (as it's dependent on sample and the, usually lacking, optimal K-depth). In the IBD maps (fig. 2), Basques generally score low for both, similar to Dutch or Southern Brits. Not as low as the extremes (Finns for the North African element; Norwegians, Scots and Irish for the West Asian one) but right in the next category. Interestingly Basques score lower than French and Germans in the North African IBD element (and largely also in the West Asian element).

      That's an interesting counterpoint to the recent aDNA studies because IBD is very important in order to determine "recent" gene flow. Thanks for the mention.

    6. Correction: in the African IBD Basques score like French or Germans, not Dutch/Brits. I could not discern well the shade of yellow first.

    7. "Both are notably in their Basque sample too."

      Sorry, meant to write both are at notably low levels in their Basque sample. Glad you enjoyed the article.

  5. If the EEF is virtually the same in number of the Pas_Vasco, South_french, Bergamo and Bulgarians then clearly its basically the same people the migrated in these areas or is the EEF-WHG-ANE paper a lot of rubbish?

    1. There's no significant ANE deviation in the Basque Country nor Gascony, while there is in Italy instead. Hence the WHG element in Italians and other Europeans has two likely origins: genuine WHG from Magdalenian Europe (which did not include Italy incidentally) and pseudo-WHG from Eastern Europe arriving with ANE (i.e. the Indoeuropean component).

      Without more aDNA data it is very difficult to estimate the exact IE element but my rough estimate is that Bergamese are pretty much Neolithic Italians and Tuscans have some 7% of IE-like element instead (not much in any case). I dedicated some time to try to estimate this but the ancient DNA references are so limited that it is impossible to know as of now with any certainty, so I did not publish my results.

      Just for reference, I suspect that the IE (or Uralic) influx was quite strong in East-Central Europe (Czechs, Hungarians, Croats) but much smaller elsewhere (between 35% and 0%). I consider that much of the extra ANE in Northern Europe is aboriginal rather than IE in any case (judging on Motala and the fact that Hamburgian and descendant cultures are not Magdalenian). But anyhow a lot is a guess game as of now because of lack of sufficient aDNA sequences, so take it with lots of salt and spices please.

      "... or is the EEF-WHG-ANE paper a lot of rubbish?"

      Not at all. It's pioneering. It just cannot provide all exact answers with their limited aDNA dataset, but they have truly altered the paradigm in the right direction. Mostly we need Eastern European and Balcanic aDNA now but all will help (also diverse ancient West Asian DNA would help a lot, of course, North African too to a lesser extent).


Please, be reasonably respectful when making comments. I do not tolerate in particular sexism, racism nor homophobia. Personal attacks, manipulation and trolling are also very much unwelcome here.The author reserves the right to delete any abusive comment.

Preliminary comment moderation is... ON (sorry, too many trolls).