March 12, 2013

Autosomal DNA of NE Europeans

A paper of some interest is available these days at the Public Library of Science:

Andrey V. Kruhnin et al., A Genome-Wide Analysis of Populations from European Russia Reveals a New Pole of Genetic Diversity in Northern Europe. PLoS ONE 2013. Open accessLINK [doi:10.1371/journal.pone.0058552]


Several studies examined the fine-scale structure of human genetic variation in Europe. However, the European sets analyzed represent mainly northern, western, central, and southern Europe. Here, we report an analysis of approximately 166,000 single nucleotide polymorphisms in populations from eastern (northeastern) Europe: four Russian populations from European Russia, and three populations from the northernmost Finno-Ugric ethnicities (Veps and two contrast groups of Komi people). These were compared with several reference European samples, including Finns, Estonians, Latvians, Poles, Czechs, Germans, and Italians. The results obtained demonstrated genetic heterogeneity of populations living in the region studied. Russians from the central part of European Russia (Tver, Murom, and Kursk) exhibited similarities with populations from central–eastern Europe, and were distant from Russian sample from the northern Russia (Mezen district, Archangelsk region). Komi samples, especially Izhemski Komi, were significantly different from all other populations studied. These can be considered as a second pole of genetic diversity in northern Europe (in addition to the pole, occupied by Finns), as they had a distinct ancestry component. Russians from Mezen and the Finnic-speaking Veps were positioned between the two poles, but differed from each other in the proportions of Komi and Finnic ancestries. In general, our data provides a more complete genetic map of Europe accounting for the diversity in its most eastern (northeastern) populations.

I'm not too sure of how to analyze this paper because, on one side, there's some missing data, especially in regards to the ADMIXTURE analysis (FST distances between components) and then for some reason the Chinese control was totally removed from further analysis as well, making very difficult for example to estimate if and how much East Asian admixture exists in these NE European populations. Then on the other side, nearly all Finno-Ugrian peoples (as well as the Mezen Russians, genetically Finno-Ugrian as well) are highly endogamous peoples, what almost invariably distorts ADMIXTURE analysis by creating many localized components of dubious relevance.

The ADMIXTURE analysis was presented, as often happens quite incorrectly, for values under the cross-validation optimum, which in this case is at least known: K=6 and K=7 (very similar lowest values):

Figure 4. ADMIXTURE clustering of individuals from the populations studied.
Results obtained at K = 2 to 5 are shown. Each individual is represented by a vertical line composed of colored segments, in which each segment represents the proportion of an individual’s ancestry derived from one of the K ancestral populations. Individuals are grouped by population (labeled on the bottom of the graph). In addition to populations used in principal component analysis, a Chinese sample (Han Chinese from Beijing [22]) was included. The results at K = 5 are also accompanied by average ancestral proportions by population (*). Population designations are the same as in Figure 1.
[From fig. 1:] Key: Komi_Izh – Izhemski Komi, Komi_Pr – Priluzski Komi, Rus_Tv – Russians from Tver, Rus_Ku – Russians from Kursk, Rus_Mu – Russians from Murom, Rus_Me – Russians from Mezen, Finns_He – Finns from Helsinki, Finns_Ku – Finns from Kuusamo, Rus_HGDP – Russians from the Human Genome Diversity Panel.
At least in the supplemental materials we find the missing K-values:

Figure S4. Results of ADMIXTURE clustering at K = 6 to 8. The number of populations and their order are the same as at Figure 4.
[Note: per fig. S5, the optimal K-values are K=6 and K=7]

Something that may call your attention is the relatively high value of the Chinese component in Italians (Tuscans, judging on the locator map). This anomalous effect (unheard of in other studies) may well be caused because a West Asian control is clearly missing and Italians have relatively high West Asian affinity, being otherwise relatively isolated within this Northern European sample. 

Notice also how every single endogamous Finno-Ugric population forms their own cluster: a generic Finno-Ugrian component at K=3 (red), a distinction between the Komi and the Finnic component at K=4 (red and purple), then at K=5 we get a mini-break with a more general North/South Europe distinction showing up (yellow and blue components), but at "optimal" K=6 and K=7, we still see other localized components forming: first Komi_Pr (brown) and then the Vepsian one (grey). So out of seven "optimal" components (K=7), four are local corresponding to highly endogamous populations. 

But I'm running a bit ahead of myself, admittedly. The endogamy index is analyzed as ROH values: nROH for the mean and cROH for the average:

Table 2. Summary of ROH statistics of 16 European populations.

We can see here that large and relatively cosmopolitan populations like Germans and Italians have low ROH values. Czechs and Central Russians come next, with Poles already showing a bit higher endogamy index. Latvians and Estonians are still relatively low but Northern Finno-Ugrian peoples (including Mezen Russians) deviate a lot, with values (at the non-asterisk columns) that are at best almost double than those for Estonians and, at worst, six times higher.

So in this particular case, and quite exceptionally, I'd say that K=2 or K=3 are the most realistic K-values, in spite of scoring quite poor in the cross-validation test. Of course that the N-S European distinction shown at K=5 is also real and not caused by any "effect" but otherwise the clusters showing up correspond to extreme drift caused by isolation and endogamy and therefore only tell us about that peculiarity of the European Far North. 

K=2 is surely the most informative level for East Asian genetic influence, except for  the already mentioned Italian anomaly (which may also affect to lesser extent Central Europeans). However because this study is so limited in this aspect, I'd encourage the development of more informative studies, which could for example ponder the FST distances between components, always informative, and/or use other population sampling strategies that better capture this aspect.

After all this is a study focused on Russia, even if that way it has also produced some valuable information for much of NE Europe.

Figure 3. Principal component analysis of the combined autosomal genotypic data of individuals from Russia and seven European countries (Finnland, Estonia, Latvia, Poland, Czech Republic, Germany [5] and Italia [22]).
The first two PCs are shown. The color legend for the predefined population labels is indicated within the plot. Population designations are the same as in Figure 1.

Appendix: Finno-Ugrian peoples/languages map by Marting/Nug (anti-copyright):


  1. As I have mentioned elsewhere, while drift is of cause very important here, I don't think it alone can explain the PC results. The main reasons being,

    (i) PC1 (or, more accurately, a component slightly slanted from top left to bottom right) also distinguishes Italians from Central Europeans (CE), and, e.g., Czechs from Balts. That surely is not due to drift, alone, but signifies a major South-North gradient that likely is related to the varying West Asian, Mediterranean, and Northern Mesolithic admixtures. In other words, Latvians are not (at all) heavily admixed Finns who forgot to speak that language, and Finns were already extreme in that respect "when they got started", in addition to having drifted.

    (ii)Komi and Finns share some of this behavior with respect to Central Europeans. In other words, Komi started out somewhere where the Estonians are, in the 2-D PC plot -- not in CE.

    (iii) Komi and Finns share a language subgroup and are currently neighbors. Their genetic distance, in addition to drift, may thus also be explained by different admixture histories before they (more recently) became close neighbors (again).

    (iv) Unlike Finns, Komi live close to and intermarry with Mansi and Nenets. I strongly suspect that part of the second (top right to bottom left tilted) dimension in the PC diagram is admixture with N / NE Ural people (not necessarily recent, extant "Uralic" - but also those who have been displaced). There really hasn't been enough time since the Permic-Finnish split to explain the huge genetic distance between them, except considering a different admixture history.

    (v) I think it would have been highly instructive to see the Nenets and Mansi on the PC plot - even if just projected onto it (i.e., without contributing to the components, proper).

    1. I'm not at all in disagreement with you. In fact I do agree with nearly all you say above. But for the sake of clarity:

      When I said that K=2 (or K3 maybe) is most informative it was with the caveat of the general European N-S cline apparent only at K=5 being also something real, not any "effect" (because it does show up in many other studies and has no relation with any endogamy distortion, at least not that I can discern). However because of the sampling strategy, with so many members of the highly endogamous/isolated Finno-Ugric populations of the far North, it only shows up relatively late in the sequence, so at least for the issue of comparison with East Asia, I prefer a lower K value.

      "(iii) Komi and Finns share a language subgroup and are currently neighbors".

      Not if you mean Finns from Finland: they are not neighbors but are rather quite separated by geography and intermediate populations like Northern Russians. I'm adding a map of Finno-Ugric languages to the main entry, so everyone can easily locate them.

      "I strongly suspect that part of the second (top right to bottom left tilted) dimension in the PC diagram is admixture with N / NE Ural people"...

      The eigenvector diagram essentially reflects K=2 (EV1) and K=3 (EV1 and EV2), the second (vertical) dimension appears to indicate the relatively strong divergence between Finno-Ugric peoples, caused at least partly by strong drift within a context of very low population numbers and great isolation. In other words these peoples seem to have sped up genetic divergence because of their low numbers, what is an interesting factoid to consider in population genetic in general.

      As for the Nenets, I have not yet performed a direct genetic comparison with East Asians but their Fst values re. other Europeans are quite extreme, suggesting that they should be very close to East Asians like the Chinese.

  2. Yes, I meant Finnish speaking people, not just Finns in Finland - most importantly, including those of Karelia and the Veps. "Neighbors" is of course, relatively speaking.

    My main gripe is not at all directed at you, but at the notion that drift can explain the first few PC components in a sufficiently large study. We don't see that with Sardinians, Irish, Orcadians, or Jews, for example - and for a reason.

    Drift is of course important, also for PC2, here - but when properly done, the matrix elements of the analysis are weighted with a function of the allele frequency (e.g., EIGENSOFT; Patterson, Price, and Reich, 2006). This strongly emphasizes rare SNPs over SNPs that are mostly missing in a particular, small group, only (i.e., due to drift). You can see that in (the slightly tilted) PC1, because otherwise, as I mentioned, it would not be the main S-N differentiator. SNPs that Finns lost due to subsequent drift actually dilute the S-N differentiation (moves Northern and Central Europeans closer to Italians), which, however, is obviously not the case.

    At K=4, it looks as though Central Europeans have a good chunk of admixture with what is modal in Komi and Finns, respectively. At K=5 it becomes clear that this is in fact Baltic admixture - which (i) makes much more sense, and (ii) are known, true SNP signatures - not just the lack thereof due to drift. So, up to Estonians the (tilted) PC1 is by a vast majority due to characteristic SNPs - not the lack thereof, and thus not due to drift away from a large original population.

    I am quite confident that similarly, much of the genetic distance of Komi is a set of unique SNPs, rather than a lack of them - in this case, incorporated via admixture. As you mentioned, Nenets (and surely also Mansi) are heavily removed from extant Europeans, and so likely also were the original people who were displaced/ incorporated during either-side Uralic expansion.

    The main question I posed is: is this simply a N Asian signature, or perhaps an ancient NE Uralic element that is distinct, and thus of great importance in our understanding of Mesolithic and Paleolithic population structure.

    1. "My main gripe is not at all directed at you, but at the notion that drift can explain the first few PC components in a sufficiently large study. We don't see that with Sardinians, Irish, Orcadians, or Jews, for example - and for a reason".

      I think we do see it. I doubt that the "Sardinian" component is as distinct as it's often posited but rather the product of isolation. However it's very possible that Sardinia alone had even more population than all the Far North together until recent times (just a guess but sounds plausible). Orcadians again are another "polar" population and that's probably caused by their isolation (even if they had "recent" Norwegian immigration probably it's still a "small Iceland" so to say). Ashkenazi and Moroccan Jews gave me problems a year ago when I tried to compare them to West Asian populations: both produced their own distinct components very early. It did not happen with Sephardites however, which seem to lack the marked bottleneck of the other Western Jews.

      Even a historically larger and more connected population as Basques was discarded by Dienekes in his on grounds of alleged endogamy. Here I must say that Basques gave me no problem (ref): they produce their own component but only at high K-values but the Sardinian component does look suspect to me (because it shows up very early) and no other population appears with it as very dominant (Basques are second highest but the haploid relation between both populations is very weak). Has anybody tried to look up at the autosomal genetics of Europe without Sardinians?

      And using only large "cosmopolitan" populations? That would be a test for the extreme weight of small peoples like Lithuanian and Sardinians. At the very least the number of samples should be apportioned by population to some extent: one sample per million people represented, more or less. Otherwise small peripheral populations weight too much and seem to distort things.

      I may try some day.

      "At K=4, it looks as though Central Europeans have a good chunk of admixture with what is modal in Komi and Finns".

      The correct term is not "admixture" but affinity. The program ADMIXTURE actually looks for similitudes and cannot state on its own any real admixture. That requires qualified and careful analysis and even that may not be enough.

      So Central Europeans and Russians (same cluster) show affinity with Finno-Ugric peoples but that is normal because these peoples have admixture with mainstream Central-East Europeans (and not so much the other way around). You see this better at K=5 and above, when the influence of the Finno-Ugric components on Central Europeans and most Russians becomes residual noise, as they can finally show better their own genetic personality (yellow component + blue component).

      What K=3 and K=4 say is just that, given four polarities: China, Italy, North Finland and Komi 1, how akin are to each one those Central and other Eastern Europeans? The real issue is that before unveiling a Central-Eastern European component (K=5 yellow) ADMIXTURE with this sample finds other poles in these remote Finno-Ugric peoples, which are overrepresented (by sample choice) and very distinct because of their isolation/endogamy.

      That's how I see it.

    2. Maju,

      I was talking about the principal component analysis - not admixture. We do not see Sardinians, Irish, Orcadians, Icelanders, or Jews define the first couple of eigenvalues in that type of analysis in a reasonably wide study. Those components persist even removing such populations.

      Some people look at studies like the one I cite below and say: "Look! It's all drift! PC1 is due to Italians drifting from CEs and Balts drifting from CEs, and then there are Finns drifting from CEs in PC2!"

      Of course, we know that is not true. Drift explains everything and nothing. Ignoring Neanderthals, I could state non-Africans are just people that drifted away from Africans, West-Asians and Europeans are just people that drifted away from South Asians, and Europeans are just people who drifted away from West Asians.

      But we know that hidden in all of this is a complex pre-history of European and extreme West Asian homogeneity, but with an important gradient during the Gravettian, drift and changes during several distinct LGM refugia, neolithic and bronze age migrations from the same general areas (emphasizing the northern Levant, western Anatolia, and the CE Balkans), and very little thereafter from the outside.

      We have a chance to resolve the above by combining ancient DNA studies with models of populations genetics within Europe - but not if the prevailing notion is; "Oh well, it's all just drift" (which we know to be untrue for many reasons, including the uniparental markers Kristiina mentions).

    3. Sardinians do. If you get only European samples, Sardinians often define one of two clusters at K=2.

      Also in my example (linked above), Ashkenazim and Moroccan Jews took over 2/4 clusters at K=4 (I don't recall how they behaved at lower values). I had to scrap them off for that very reason.

      I'm saying "drift" (or more precisely "endogamy", "inbreeding") because there are other reasons that support that conclusion in this case: the ROH table. It is very clear that they are highly endogamous populations which do not behave normally in the ADMIXTURE analysis and distort everything. It happens (for example the Tunisian Berbers of Henn - she admitted that they were clearly too inbred to be informative after using them twice), so it is a lesson we must take in account and very seriously so.

      If you want to mess around with the word "drift", then I'd rather scrap it out and use "endogamy" or "inbreeding", because (while drift is of course involved and intensified by these phenomena, it is not the underlaying cause: the isolation of a small sized population is).

    4. Maju,

      Not to beat a dead horse, but I was talking about the PC analysis - not ADMIXTURE. It is my understanding (but perhaps I am wrong?) that properly weighted PC analysis is much less sensitive to drift / endogamy. I mean, it is constructed such as to find and use the SNPs that are most meaningful to properly structure (find the highest correlation in) all populations in the study.

      And again, I don't doubt that drift is very important in this context - I just think that the PC analysis indicates that both the Finnish and Komi component directions are "real" in the sense that they point to an ancient connection rather than to some idiosyncratic drift of run-of-the-mill northerners over the past thousand years, or so.

    5. If you look at the substance of the eigenvectors they are identical to K=3 and K=4 components (after excluding the Chinese). EV1 shows the Italian vs. Finno-Ugric duality of K=3 (blue vs red) and EV2 shows the Komi-Izh vs Finns-Ku polarity (red vs purple at K=4).

      They are the same thing in essence, just that shown differently. The results are nearly identical. I don't have any reason to believe, as you suggest, that PC analyses are "much less sensitive to drift / endogamy". Never heard of that before and never noticed it in the many many PC/EV graphs I have seen in my life: they behave almost identically to ADMIXTURE/STRUCTURE analysis, just that with much lesser depth (2D, sometimes 3D constraint), what is a clear limitation. What ADMIXTURE-like analysis does is to add more and more dimensions until, if done properly, an optimal clustering level is reached. But both seem equally sensible to the distorting effects of highly endogamous populations - this AFAIK only depends on sampling strategy, which is the human factor. An astute analysis will seek to minimize these anomalies, if not in one analysis A, then in a complementary analysis B, by properly balancing the numbers of each of the populations studied in order to get a most realistic result and minimize endogamy-caused artifacts. A shallow one will not and that is a failure.

      Whatever the case this issue needs more research, that we do agree on.

    6. Now that I think of it, a reason why you may believe that PCAs are "much less sensitive to drift / endogamy" could be that, normally, the two main PCs are not described by one such endogamy-caused polarity (as would be most commonly also with ADMIXTURE K=3 clusters, depending on sampling strategy always) but this is not any rule, just an accident caused by, in most cases, having ample and diverse samples, which define these poles/components on more realistic grounds.

    7. Yes, having a good number of external populations is important - and there were eight, here, which is not bad: Italians, Germans, Czechs, Poles, Latvians, and three Southern Russians.

      PC analysis also calculates higher components - but many don't bother showing the results. I'd like to see at least PC3, best in a 3-D plot or as color or with "height markers." Then one can at least tell if there is useful additional information.

    8. It's not the number of populations as much as the number of individuals in each population and the distinctiveness of their genetic pool. Also Europeans are quite homogeneous, what does not help here: all but Italians (to some extent) are very similar (Poles, Czechs, Germans, mainstream Russians... all look about the same in the genetic analysis), both in ADMIXTURE as in the EV graph. Instead the pools of Finno-Ugric ancestry peoples are all very characteristic and that is, in this case, because of endogamy: a distorting effect hard to compensate for with such large numbers.

      "PC analysis also calculates higher components"...

      Sure but it's roughly the equivalent of deeper K-values in ADMIXTURE. I can tell you that the EV3 in this case shows a polarity defined by Central-East Europe, the yellow component (deduced from the ADMIXTURE data of course).

    9. Maju,

      This is my interpretation of EIGENSOFT; hopefully this is at least roughly correct:

      The matrix elements are weighted with the function w(j) = 1/sqrt[p(j)*(1 -p(j))], where j is the SNP marker index, and p(j) is the allele frequency calculated as 1/2 of the average for that site. For autosomal data, a site can have the values 0, 1 (only one copy) or 2 (both copies carry the SNP). In the end, the square of the Matrix elements (after projection onto the eigenvectors) enters, so, roughly speaking, w^2 = 1/[p(j)*(1 -p(j))] is more relevant. Now, this is where the number of populations comes in: say, you have 10 different populations, and for simplicity, all populations are fairly homogenous. The most extreme scenarios are that only one carries the marker, and only once - or that all but one population carries the marker, but that one still once. Then p = 1/20 or 19/20 and w^2 ~20 in both cases - that is, the calculation is heavily weighted in favor of such scenarios. On the flip side, a case in which SNPs are all over the place in the different populations, say 3 have the marker twice, 4 once, and 3 don't have it, p = 1/2 and w^2 = 4, i.e., a factor 5 difference in weight. This weight difference keeps increasing with the number of (different) populations included, and increases with the number of individuals if the populations are not homogenous. So, if 1000 individuals are studied, in the extreme case, the weight can be 2000!

      And those scenarios with the high weight are what you would expect from an ancient contribution: an SNP left over but long gone in all comparison populations (individuals), or a rare new mutation, or not being part of a sweep. On the flip side, heavily drifted populations largely have randomly lost markers, or have an unusual build-up of decently common (but not extremely rare) markers (if they were rare, then that is of significance even without subsequent drift). Both such cases do not yield a high weight. In general, drift occurs at a rate sqrt[p(j)*(1 -p(j))] - that is, it occurs most swiftly at sites that receive the lowest weight in EIGENSOFT.

      This is what I base my claim on that properly weighted PC analysis is fairly insensitive to drift, but extremely sensitive to ancient markers or the lack of participation in sweeps and the like.

    10. I've read your detailed comment this morning and I've been chewing on it all day but I don't see anything on it that would not cause a highly endogamous set of similar populations, such as the dramatically oversampled Finno-Ugrics in this set, from dominating the vectors/PCs (or components in ADMIXTURE equally) easily: having low genetic diversity just weights a lot in creating a vector.

      I still do not understand well why that equation because I get for the various values:
      · p(j)=0.9 → w(j)=3.33
      · p(j)=0.5 → w(j)=2.00
      · p(j)=0.2 → w(j)=2.50

      How can a relatively rare marker with a frequency of 20% weight more than a marker with a 50% frequency? Is this what you meant to explain above? I think so. Freaky algorithm this one! I don't find this correct because intermediate frequencies are punished against and that seems to distort the matter even more in favor of "freak" populations with extreme values.

      "And those scenarios with the high weight are what you would expect from an ancient contribution"...

      Not really. Any kind of "bottleneck" process can cause it. Actually ancient contributions should be more diluted in general (because of the "balming effect" of recent cosmopolitanism in most populations) while more recent bottlenecks, such as those caused by recent endogamous drift, are necessarily enhanced.

      So it's actually doing the exact opposite of what you expect. Maybe I should look more frequently at the algorithm definitions instead of just the results (I admit that integers scare me).

      ... "heavily drifted populations largely have randomly lost markers, or have an unusual build-up of decently common (but not extremely rare) markers (if they were rare, then that is of significance even without subsequent drift)"...

      On the contrary: they will accumulate a collection of rarities because their whole genome tend to converge at highly accelerated pace towards variants that are rare in other populations. This is very obvious, for example, in the Sámi haploid DNA, concentrated (fixated) in few lineages shared with nobody else. Exactly the same happens in the autosomal DNA, just that SNP by SNP: p(j) values close to 1 or 0 will be the rule, not the exception among such heavily drifted populations (this of course also has effects in concentration rare illnesses as is known among Ashkenazim, for example, even if selection weights against them to some extent).

  3. As for our previous discussion on the origin of western Uralic N1c, am I right that this study does not support the mixture of these people with the mainstream Chinese but rather a more Central Asian origin of N1c, as in the centroid map found here at

    1. That study is alright on the frequency maps. The "centroids" are not calculated in any way that may indicate origin but surely only on frequency. Therefore they are pretty much meaningless because they do not take in account phylogenetic hierarchy, so, if one highly derived subclade has expanded much more than its relatives, it distorts everything.

  4. On paternal side, Komi are a mixture of Uralic N1c (29%) and N1b1 (18%), but they possess a high amount of steppe haplogroups R1a (33%), R1b (16%) and a little bit of I (5%). On maternal side, they have very little East Asian Mtdna, Z (1.6%), D (1.6%) and A (1.6%). The Komi Zyryan (Northern branch) western Eurasian haplogroups are: H (33.9%), U4 (24.2), U5 (9.7%), J (9.7%) and T (16,1%; of which T1 3.2%) and in addition, they have a little bit of K, U8 and W. For me, this combination resembles the Sargat culture in Western Siberia , but as Komi have, compared to ancient Sargat people, much less eastern Mtdna haplogroups (A,C,Z), they could represent a more western culture in the western Ural area. Also Dieneken has commented the Sargat culture In this study he comments on his blog, this ancient Mtdna is connected with Ugric branch of Uralic, i.e. Khanty, Mansi and Ugric people. However, I would say that the Komi people are a mixture of Uralic and Kurgan people.

    As for the Finns, they must have a different history. On paternal side, the Finns have 61% of N1c, 23% of I1, 10% of R1a and 2% of R1b, 2% of K and 1% of E3b (according to the old nomenclature). As far as I have understood, both N1c and I1 have, for the most part, developped locally, but R1a has diverse origins On maternal side, the Finnish haplogroups are X 1.3%, H 13.9%, H1 12.6%, H2 5%, H3 2.5%,H5 5.1%, I3.8%, J1 2.6%, T 2.5%, K 2.5%, U4 2.5%, U5a 5.1%, U5b 14%, V 5.1%,W 10.1% and Z 2.5%. These haplogroups are not derived from Swedish subclades Hg H and U5a are surely old in Finland, as they have been detected in an old burial site in North Western Russia already 7500 years ago (H was possibly Eastern H2a). If we want to see a bronze age or later connection between Finnish and Baltic haplogroups, good candidates are I1, J, including J1b , T2, W. In Finland there are quite high frequencies of both Mtdna I and W. As the frequency of X is higher in Finland than in Sweden or Baltic area, I would say that it has arrived in Finland from Russia, having looked at the haplogroup maps here

    In my opinion, the Finns do not have a Scandinavian origin maternally or paternally and it would also be somewhat unnatural, since there is a sea between Finland and Sweden, but the Eastern route is open. Compared to Finns, the Swedes have a lot of T (connected to R1a?) and K (connected to ydna R1b?).

  5. With this last remark I wanted to challenge the comment that Finns are intruders who intermarried with native Scandinavians!

    1. You may well be right. We cannot ignore the impact of Swedish rule and the Viking era but overall your claim seems to make sense.

      In fact I'm rather inclined to think that it is Swedes, especially, who have some Finnic and other Eastern European admixture. The first is somewhat logical, as Northern Scandinavia was and still is to some extent Finnic (Sámi but maybe also other groups now vanished?) but the latter requires to look at the Chalcolithic Age, when cultures related to Dniepr-Don (Ukraine/South Russia Neolithic, of local Paleolithic roots), known as Pitted Ware (Neolithic semi-foragers), expanded towards the Baltic and also across it to Swedish coasts. This route, with some variations (through East Germany and Poland - no coastal), was then followed by the Kurgan Peoples (Indoeuropeans), who were also Eastern Europeans by origin (Samara Valley originally, later in complex interaction with the Dniepr-Don substrate at Seredny-Stog II culture, and fully "Kurganized" in the Yamnaya and Catacombs periods that followed). Denmark was more densely populated, so the impact was less dramatic but Sweden was more impacted by these two waves (Norway is rather intermediate).

    2. Kristiina,

      If you look at PC studies that emphasize Northern Europe, like, e.g., Nelis M, Esko T, Mägi R, Zimprich F, Zimprich A, et al. (2009) Genetic Structure of Europeans: A View from the North–East. PLoS One 4: e5472. doi: 10.1371/journal.pone.0005472, you can see that there is a huge gradient within Finns that smoothly joins Swedes. Yes, Northern Swedes of course have Finnish admixture, but some Finns fall right in-between (and so do some Estonians). Again, not everything is simply drift of the most extreme populations.

      There were some old "continuity" models w/r the Uralic languages, but all modern analyses come to the conclusion that Uralic languages in that general area are relatively recent "intruders." And genetically, they all seem to carry Finnish or Permian, i.e., Uralic signatures. Now, be my guest to argue that Finnish genetics is native Finnish, while only Permian is native Uralic. ;)

    3. But, Eurologist, look at the PC graph: Finns only slightly tend, in some cases, towards Sweden, while Swedes have a general clear deviation towards Finns and in some cases they much more Finnish than Swedish, genetically speaking.

      Also from the viewpoint of archaeology or material prehistory, while Southern Scandinavia was populated from mainland Europe since the Epipaleolithic, there's absolutely nothing that suggests those peoples migrating far northwards, much less towards Finland (i.e. not before the Viking Era).

      Finns stand as much closer in that graph to Estonians and in PC1 (which weights 2x PC2) they are clearly closer to Russians and other East Europeans, only appearing closer to Swedes in PC2 (half that weight). As I said before, Swedes are not even a realistic comparison because they are themselves a mix of mainstream Central-North Europeans and Eastern Europeans (Pitted Ware + Kurgans).

      More genetic data: in table 2, Finns do not appear either particularly close to Swedes but rather (Southern Finns) to (1) Estonians and (2) Poland (only 3. Sweden), and (Northern Finns) to (1) Poland and (2) Hungary (and then again 3. Sweden). Their relation with Sweden may well be caused more by Swedish Finnic admixture than vice-versa anyhow.

      They do not look "Russian" either (distances are not large but larger than to Poles and other peoples from East-Central Europe. So I can only imagine that the earliest substrate may be rather from the East-Central European Paleolithic, etc.

    4. Not sure what to say - perhaps mentally try to rotate the axes for better seeing what is going on (while compensating for the fact that the two scales are different)?

      And Swedes and Estonians are admixed with Finns, but Finns are not admixed with them? What's the semantic difference I don't get, here? Half of the Helsinki Finns are closer to Swedes (or Estonians) than they are to extreme Finns. How does that argue against them being admixed with Swedes and Estonians?

      We know for certain that agricultural Swedes settled along the Baltic coast of Finland. Some still speak Swedish, there. Why is it so difficult to imagine that they intermixed?

      The Estonians, on the other hand, prove my point: namely, that there was a genuine, genetically distinct Finish population that intermixed with both Swedish and Baltic populations.

      The main question to me is: was this originally a local Uralic population (probably not), or a local, extreme NE population that got elite-dominated by a Uralic-speaking population before becoming known as Finns (possible, IMO).

    5. "And Swedes and Estonians are admixed with Finns, but Finns are not admixed with them?"

      First of all, some Finns do show a deviation towards Swedes, but it surely corresponds to recent times (Swedish occupation, Viking Era, admixture with Finland-Swedes). Second, I don't think that Estonians and Finns are admixed among them (at least not notably so) but rather that they share a common origin, at least partly. Third, Swedes are probably admixed with Finnic peoples other than Finns proper, more like Sámi.

      "What's the semantic difference I don't get, here?"

      Well, it's the same as saying that Mexicans are (natives) admixed with Spaniards but Spaniards are not (significantly) admixed with Mexicans. There's a difference and a quite clear one.

      "Half of the Helsinki Finns are closer to Swedes (or Estonians) than they are to extreme Finns".

      That's because Northern Finns seem to be not just less cosmopolitan but also a different (albeit related) population re. Southern Finns. Also "drift" (or endogamy) plays a role here: Southern Finns were for sure all the time a larger population and a more cosmopolitan one than Northern ones.

      "We know for certain that agricultural Swedes settled along the Baltic coast of Finland. Some still speak Swedish, there. Why is it so difficult to imagine that they intermixed?"

      I'm not rejecting that but that seems obvious only in a small fraction of Helsinki Finns who are divergent (and not too much) towards Swedes. It's anyhow likely that the impact of this recent admixture is more obvious in Finland Swedes than in Finns proper.

      "The main question to me is: was this originally a local Uralic population (probably not), or a local, extreme NE population that got elite-dominated by a Uralic-speaking population before becoming known as Finns (possible, IMO)".

      I suspect that the first (Epipaleolithic) substrate population of Finland was of Central-Eastern European origins but were later assimilated but Combed Pottery peoples arrived from further East, who spoke (proto-)Finnic and exerted a patrilocal centrality of some sort. Probably the "elite dominance" of this second wave was determined by "Reindeer Neolithic" (i.e. a pastoralist economy).

  6. Okay, Finns proper may have a small amount of recent Swedish admixture, but it is probably Finland Swedes who have more. The haplogroup frequencies I listed above represent however whole Finland and not Helsinki Finns!

    Anyway, I do not know how this pool of Helsinki Finns is gathered in this Kruhmin et al research, but if there is a bigger share of Finland Swedes in this Helsinki component, then of course the difference between Kuusamo Finns and Finns proper is smaller than in this study.

  7. Eurologist, you say that half of the Helsinki Finns are closer to Swedes (or Estonians) than they are to extreme Finns. Where do you get that, as I can't find Swedes in this reasearch?

  8. Okay, it is Nelis et al research! It seems that Swedes, Estonian and Russians are on the same plot while Helsinki Finns and Lithuanians are on the same plot and Kuusamo Finns stand apart. During the bronze age (and later?) there were close contacts between Southern Finland and Baltic area and there was surely admixture. Perhaps there were also very close contacts between Russians, Estonians and Swedes and these contacts were not unidirectional, initially there may have been more contacts and admixture from East to West and later on more from West to East.

  9. The biggest haplogroups in Kainuu, Northern Finland are the following: H 37%, V 9%, U 36%, W 10 % and M 6% (in the whole country H 39.3%, V 5.1%, U 27.9%, W 10,1%, M 2.5%. Kainuu Finns seem to be admixted with Saami people, as Finnnish Saami have 37.7% of V and 15,9% of Z and D5. It is noteworthy that there is no difference in the amount of W and, in fact, the share of older haplogroup U is bigger in Kainuu than in the South. On the other hand, the share of H is bigger in the South, as is expected as the frequencies of H are in Estonia 41,1%, 35,2% in Latvia and 45,6 in Sweden (according to Lappalainen et al.)

    1. Kristiina,

      Thanks for listing that.

      Do you suspect that Saami have less admixture with ancient local northerners (vs. Uralic Reindeer herders) than Finns (proper)in Finland?

      Of course there could be drift at work, again, but the percentages of V and Z are astonishing.

  10. Eastern haplogroups, D and Z1a, were found at a 3,500 year old burial site in extreme North (Bol'shoy Oleni Ostrov). Also U4a1, U5a and U5a1, T and C and C5 were found. I would say that Saami people combine recent geneflow from the Eastern Arctic area in an old Northern European population (Mtdna U5b and ydna I1, + ydna N1c). They lack all Neolitic or Southern haplogroups that Finns have, but they have a lot of Mtdna V, and I do not quite understand the route of this haplogroup.

  11. As haplogroup H is very important around the Baltic Sea, I took a careful look at the median-joining networks found here:
    Fig. 3
    Fig. 4

    The root of H* seems well anchored in Carelia. The route of H2 (age estimate 11,000) to Finland seems to go through Eastern Slavs and the entrance is again through Carelia. It seems instead that Western H3 (age estimate 16,000) arrived in Finland through the Baltic region. Also the route of H5 seems to be Turkey - Eastern Slavs – Finland. Haplogroup H1 (age estimate 23,800) seems to be well anchored in Carelia, but looking at the map, it seems that it might have arrived in Finland from the East (even from Volga-Ural area where it has a high frequency).

    It is highly interesting that the first study says that ”in contrast to that found in Europeans, sub-Hgs H6 and H8 among Central Asian/Altaian populations are characterized by distinctly divergent haplotypes This finding may reflect a long-time separation of Asian and European H6 and H8 mtDNA pools and/or an earlier expansion of H6 in the eastern part of its present range. Indeed, the coalescence age of H6 in Central Asians is very deep—40,400 years.”

    As for Sweden, H2 may have arrived to Sweden through Poland - Germany (these countries are not included in the study, in the map there is only a Lithuanian circle). Based on Fig 4, you would say that H3 arrived to Sweden through Latvia, but the route through Germany is probably more plausible, as the haplogroup came from Iberia. H5 is quite rare in Sweden, and based on the map you would say that it came to Sweden from the Baltic region (Latvia-Estonia). H1 arrived in Sweden probably from the South, and also H1a and H1b, and in fact, these two sub-clusters are not found in Finland! In the median-joining network, I really was not able to find any case where a H-haplotype arrived in Finland from Sweden.

    As for the old U5a, I am not sure how to read the map. I am wondering if U5a is derived from U5b and if U5b has come to Finland through the Baltic region (Latvia). Anyway, the Saami motif U5b1b1 seems to have taken the Eastern route to Lapland through Carelia.

    I studied carefully also the instances of R1a in Finland ( and noticed that for the most part, it arrived in Finland from the East, possibly through Carelia. This applies for Central European R1a1a1g1. This applies also for Carpathian-Russian R1a1a1b1a2. R1a1a1b1a2 - Central Eastern European branch, Southern Baltic type is found also in Northern Finland. R1a1a1b1a2d East Slavic I is found only in Carelia. The only Scandinavian R1a cluster is R1a1a1b1a3 - Scandinavian branch and it is also found in Russia.

    According to Wikipedia, ”the Corded Ware culture (in Middle Europe ca. 2900–2450/2350 cal. BC), alternatively characterized as the Battle Axe culture or Single Grave culture, is an enormous European archaeological horizon that begins in the late Neolithic (Stone Age), flourishes through the Copper Age and culminates in the early Bronze Age. Corded Ware ceramic forms in single graves develop earlier in Poland than in western and southern Central Europe, already 3000 BC.” The Corded ware culture arrives in Southern and South Western Finland from North Western Poland and the Baltic region in c. 2500 BC. By looking at the haplogroup routes above, you would say that at least part of ydna R1a must have arrived in Finland during the Corded Ware period through the Baltic area. I suggest that mtdna W arrived also from Poland during that period, as its highest frequencies in Europe are in Poland, Latvia and Lithuania. Also the majority of mtdna H3, I and T might have taken that route to Finland.

    1. This you mention about a rather sharp distinction in the mtDNA pools of Finnish and Swedes is very interesting and agrees with my interpretation of the autosomal and archaeological data in my reply to Eurologist above.

      As for the difficult to understand origins of V among Sámi, etc. (which you mention in other comments) we must have in mind that bottlenecks, even in relatively recent times (very small numbers and extreme environmental conditions) seem to have shaped that genetic pool very intensely, with highly randomized results. This applies to all the Far North but maybe more to the Sámi than anyone else.

  12. As for my understanding of the Saami roots, forget the "recent" geneflow from the Eastern Arctic area. If this Eastern geneflow (D and Z1a) is from Bol'shoy Oleni Ostrov, dated 3,500 uncal. yBP, it is not very recent! In this study of ancient Mtdna in North Western Russia, it is however pointed out that these Bol'shoy Oleni Ostrov people were not the ancestors of modern Saami people.

  13. I completely forgot the Saami R1a! Finnish Saami have only 4.5% of R1a, but Swedish Saami have 20% and Kola Saami 21.7%. Swedish average is 17%. Northern Norwegians have 27.1% and the Southernmost Norwegians only 13.2%, but I don't know their type and I have no idea whether they form a cluster of their own.

    Now I am wondering if Mtdna V that has extreme frequencies among Saami is somehow connected to an old type of R1a. According to Family Tree search (, there is an old Pan-European R1a1a M198 that is found also in Northern Norway. For example, Poles have a lot of pre-V Mtdna.

    Then, I noticed that the only R1a sub-cluster that might be old enough to be connected with Corded Ware culture in Finland, is R1a1a1b1a2 - Central Eastern European branch, Southern Baltic type. According to Family Tree, it forms a special Finnish cluster.

  14. As regard Figure3, I understand that we have the "Average Europeans" on the left in the positive values.
    What surprises me is that Uralic people do not seem to add up to a real group. Logically Komi people are closest to unmixed Finno-Ugric people, so it would seem that the Balto-Finnic people + "Arctic" pseudo-Russians are another population that was Uralicized.
    Is this a possible scenario in light of data?

    1. Sorry, i meant on the right !

    2. If you read the discussion, beginning by the main entry, I have been arguing that endogamy/bottlenecks are distorting everything here especially because Finno-Ugrics are oversampled, so they begin coping most K levels (or eigenvector positions, same logic) very early on. This adds a level of cryptic difficulty to interpretation.

      So you have in account:
      1. Sample size matters (a lot) because every individual sample weights the same
      2. Distinctiveness or affinity are not absolute values but are often enhanced by endogamy and/or ancient bottlenecks, both factors which seem to be very important in the Far North, where population densities were always necessarily very low.

      So in fact Finno-Ugric peoples do cluster with each other at eingevector 1: they are all far to the left. Same with ADMIXTURE's K=3: they are all very high in the red component. But, because of their intrinsic peculiarities and the very large size of their sample, they cope eingeverctor 2 and it's ADMIXTURE equivalent at K=4.


Please, be reasonably respectful when making comments. I do not tolerate in particular sexism, racism nor homophobia. Personal attacks, manipulation and trolling are also very much unwelcome here.The author reserves the right to delete any abusive comment.

Preliminary comment moderation is... ON (your comment may take some time, maybe days or weeks to appear).