December 26, 2011

Playing around with ADMIXTURE

I decided to gift myself these Saturnalia with the basic knowledge of how to use the ADMIXTURE program. It is not easy but with the help of Razib's instructions, a good dose of patience and some computer savvy-ness I managed yesterday to have something done, even if not exactly what I wanted.

First of all I cleaned up the population file from all populations that have no apparent relation with West Eurasia and also a bunch of tiny minorities like Druzes, Bedouins, etc., which tend to be rather non-informative, and so on. I still retained a number of populations from all around Europe: several North Africans, even more West Asians and Caucasians and then also some peoples from Central Asia and Siberia. I committed two errors however: I removed most NW European representatives by taking out both the CEU (Utah Euroamericans) and North European samples and I accidentally retained two Caucasian Jewish populations.

Good enough for a draft, not good enough for the strategy I had in mind. I went all the way down to K=7 but I will show here only one panel, and only because it offers a perspective that my second attempt, today, did not achieve so neatly (different strategy, different results): to show a clear cut of the European and West Asian components:

example from a previous run: Europe - West Asia duality

We can see here four components:
  • Red: West Asian
  • Purple: European
  • Green: North African
  • Cyan: Siberian
North African genetic influence in Europe is almost trivial and concentrated in Iberia and the Balcans, although this influence is more apparent in West Asia. Siberian influence is also minor, excepting the Chuvash and to much lesser extent Russians and other East Europeans.

However West Asian influence is more important and concentrates in the Balcans and Italy. North Caucasian peoples are clearly West Asians genetically speaking, even if they technically live in Europe. In turn European genetic influence outside the subcontinent is concentrated along the Northern African coast, Asia Minor and Cyprus.

I'd say that the West Asian (red) component correlates quite strictly with the extent of demic replacement in the Neolithic (although, naturally, the demic wave would have been each generation more European and less West Asian).


Today's strategy

Today I decided to be more methodical and also to reduce population numbers in order to speed up the process. I decided to only keep one North African and one Siberian populations (Moroccans and Selkups) and to reduce a lot the West Asian and Caucasian array of samples (I retained: Palestinians, Kurds, Turks and Georgians). I retained all non-Caucasus European populations, including the omissions of the previous day: CEU and North Europeans.

However I cut all samples to 10 members. Actually Belarus (only 9) and another unknown sample by error have just 9 but that should not affect the results. I doubted about retaining higher numbers for larger populations like North Europeans, Russians, French and Spaniards but in the last moment I chose not to (next time I probably get in 20 of each instead of just 10). In any case the smaller number of samples allowed me to go faster with the runs and reach deeper levels quite easily.

And I went on with the runs, getting this:


... and this:


The color code is a bit crazy and absolutely un-cool but I have managed to figure that it gives red to pop0 and then similarly spaced hues until blue or magenta. I'd rather prefer if the program was able to keep the same color for each comparable component but that seems to require human intervention (dyeing).

I decided that it was best to spend my time putting them side by side as above (also human intervention).


Points of interest


K=3 

As in the previous trial, the first detached populations were North Africans (Moroccans) and Siberians (Selkups). Nothing unexpected. The Siberian component is clearly more distant than the North African one from the main component (European in this case, because the West Asian specificity is masked between Europe and North Africa once the samples have been reduced).

Fst (components):
  • Siberian-Berber 0.131
  • Siberian-European 0.112
  • Berber-European 0.054
It's clear (and is consistent along runs) that North Africans (Berber for short) are much closer to Europeans than Siberian natives (including the partly European Selkups). West Asians generally stay 50-50 between the European and North African components (because their specificity has not yet been unveiled because of the effects of sample size, smaller than usual).

I did not run K=2 but I imagine that it'd result in Selkups vs the rest, meaning East Asians vs West Eurasians overall.

I could express the distances in a neutral form pop0, pop1 as the program does but I think it's more confusing (I get confused myself), so maybe better to use a label and hope it is a good choice. 

Most Fst distances are in the 0.040-0.070 range. I won't emphasize them.


K=4

The division of Europe into two components takes place at this stage. I decided to label them NE European and SW European because the latter is too influential in NW Europe and too low in the Balcans to be merely "South" (more presence among Northern Europeans than in Romania or Turkey), even if the NE component is more of a general presence. I wonder where they come from, if they are the produce of a duality in the early colonization of Europe, something like Aurignacian vs Gravettian or what? In any case both seem equally European and not originated outside the subcontinent. They are persistent across runs.


K=5

The West Asian specificity shows up, with focus in Georgia. West Asians finally stop looking like a mere amalgam of Europeans and North Africans and display their unique personality.

I insist in this being a mere effect of the sampling strategy: more West Asian samples would have caused this specificity to show up earlier in the runs (K=4) but, maybe more importantly, the European difference would have been the one eclipsed by the West Asian component. I actually have one example from yesterday's exercise:

counter-example from a previous run

Here Europeans and West Asians appear all mostly Green, which is primarily the West Asian component (and not the European one yet). While some North African affinity persists, this has nothing to do with the 50-50 eclipse of West Asian specificity that we can see in the main exercise.

This is a good example why we must beware of the exactitude of the components produced by these algorithms because often, differences in sample strategies and depth of analysis may show or hide critical insight.


K=6 - Slovenian Neanderthals or what?!

Since this level of analysis we get a small and quite puzzling new component that almost only exists in Slovenes and is not even dominant among them. Usually you don't get such a lesser component, much less shows up once and again in several K-depths. It is also just the third European-specific component, what the heck?!

The explanation may be that it is extremely distant from all the rest, so even if small it had little choice but surfacing. 

The Fst distances of the Slovenian odd component are extreme: 0.312, 0.233, 0.241, 0.284, 0.239 with each of the other components. By comparison, the largest distance of the Selkup component is just 0.155, while the largest distance I got between World populations in an ad-hoc K=3 run was 0.195. 

So this component, whatever it means, is significantly more distant to everything else in the region than continental populations are between each other. I can only think in massive local Neanderthal admixture but I know this is so weird and unlikely that a mere algorithm error is probably the truth. 

If you have any idea... I welcome it.


K=7 

New component: Palestinian!

K=8

An Orcadian component shows up (but vanishes at K=10).

K=9

A lesser Kurdish component shows up but it does not have the weird Fst distances of the Slovenian one, in spite of the first sight similitude.

K=10

The Orcadian and Kurdish components vanish (may they resurface in further runs? - I never run them). Instead Chuvash, Basque and a distinct Sardinian specific components show up. 

I stopped here because it was taking longer and longer (some 50 mins for just this last run) and my patience is limited (specially when I have no clear goal).

This is the detailed spreadsheet snapshot of the exact distribution of the components at K=10:

click to expand
And the K=10 detail:


Mini update: the K5 detail, which is in a sense a simplified display of the same general scheme of things: showing the two main European components, one West Asian (Caucasus) component, the North African and the Siberian components:




Many doubts

The toy seems curious and I did at least manage to make it work at the basics. But I'd like to know:
  1. How to sort populations so they show up in some logical order, like all Moroccan samples side by side and such.
  2. Can I command Plink to retain populations instead of just remove them? 
  3. Where can I get other samples? I'm particularly interested in samples of SW Europe but really whatever will do: I'll follow the candy bait, I reckon. 
  4. How can I make the results show individual instead of whole-population bars?
  5. How can I get the data (cross-ref-validation?) that indicates when the likelihood of meaning of a run is low or high.
  6. Etc. (surely a lot remains in the ink jar - I just forgot)
Thanks in advance.


Update (Dec 28): Fst distances

Table of Fst genetic distances at K=10:


I marked with red stars the extreme (>0.2) Fst distances of the Slovene component, orange ones the those in highest quintile (after removing the Slovene oddity), which are all from the Siberian component, and green ones the lowest quintile Fst distances.

I also made an Euler diagram sketching Fst genetic distances between the various West Eurasian components:


Where Fst distances in the lowest quintile (after removing the Slovene oddity; <0.084) are shown with continuous lines and the second quintile (0.084-0.107) are shown with dotted lines. (Note: image corrected from first posted version, which had an error).

I think it gives an interesting impression of the possible relations between the various components, in which the NE Euro and Caucasian components (and to a slightly lesser extent the Basque one) seem pivotal, almost as if all the other West Eurasian components are peripheral outgrowths. The short Fst distance between NE Euro and Caucasus (or Highland West Asia) components already showed up in some of the analysis of Dienekes, raising some eyebrows, at least mine. However, as he does not use the smaller components, some of the correlations, notably that the Basque component is also in that pivotal zone, were not apparent at the time.

PS- highly tentative reconstruction of pop. history (excluding the Slovene odd component), based on average Fst (Fst(core))towards the "core" Caucasus/NE Euro components:
  1. Fst(core)=0.125 - Divergence of Siberian/East Asian component (0.110 Chinese/CEU per Wikipedia): Eurasian expansion after the OoA.
  2. Fst(core)=0.102-0.100 - Divergence of Sardinian (?) and North African components: Dabban industries?
  3. Fst(core)=0.091 - Divergence of SW European component (Aurignacian?)
  4. Fst(core)=0.084 - Divergence of the Palestinian component
  5. Fst(core)=0.079 - Chuvash component
  6. Fst(core)=0.065 - Basque component
  7. Fst=0.060 - Caucasus and NE Euro divergence

A rough estimate of the possible Caucasus/NE euro divergence timing (by comparing the Fst values with those of presumably Aurignacoid divergences) would place it c. 24 to 30 Ka ago (depending on what values are used for the Aurignacoid divergence: 40 or 44 Ka ago and of which component is considered the SW Euro or the North African one). So I'd dare say that Basque, Caucasian and NE Euro components appear to have split ways (with all reservations) in the Gravettian period.

(Not sure how well it fits but this kind of maths would place the Siberian/East Asian divergence c. 55-60 Ka ago, a bit too recently IMO and the odd Slovenian component's divergence, if real, c. 110 Ka ago, weirdly old but H. sapiens rather than Neanderthal).

43 comments:

  1. 1) i have give you R code on how to view individual pops. i have your email address, so i'll send it to you soon if you haven't fond someone else

    2) I'm particularly interested in samples of SW Europe but really whatever will do: I'll follow the candy bait, I reckon. ask dienekes for his IBS data set for the spaniards. i did, and he sent it.

    have fun!

    ReplyDelete
  2. Maju: don't you find odd some things?

    For example: why do Basques and other very specific populations show low % or absence of most components except a pair or three?

    Let's take the "NorthEast" component, and you find values of 0.233 in Spaniards, 0.160 in Tuscans and 0.378 in French, but 0.013 in Basques and 0 in Sardinians.

    The SouthWest component is absent in Basques! How can this be possible?

    There are other examples, but what does this mean? Maybe that these small, specific and mostly isolated (not sure) populations are less cosmopolitan? Or maybe because some components are being lost because all individuals are closely related, or it's because the sample is very small?

    ReplyDelete
  3. "why do Basques and other very specific populations show low % or absence of most components except a pair or three?"

    Because they are populations without (or with very low) "Neolithic" admixture most likely: the components that they manifest are always European components.

    "Let's take the "NorthEast" component, and you find values of 0.233 in Spaniards, 0.160 in Tuscans and 0.378 in French, but 0.013 in Basques and 0 in Sardinians".

    Yes but that's because, in the case of Basques, their specificity had just manifested, not allowing for anything else (all Basques sampled for genetic studies are pure-blooded Basques, what is not necessarily the case for other populations). But at K=9 Basques are still something like 70% NE Euro and 30% SW Euro.

    This change is very typical: the algorithm goes around finding the "populations" (by genetic affinity) which strike out. Whether a ring bells or not depends on how many people fit and how distinct is the component, I understand. So Basques are, with this sample, not sufficiently distinct until K=10 (would be K=6 if we'd only considered Europeans maybe), showing up as more similar to those in the NE Euro category and then those in the SW Euro one (and very little to the rest).

    Then, banzai! They show up as a distinct self-contained population and all the other affinities become nimious in comparison. Or rather, the Basque component begins showing up in other populations: 25% in Spaniards, 16% in French, 16% in Utah Euroamericans, 14% in North Europeans, 13% among North Italians...

    Sardinians are a different case, because since the smallest K levels they show up as almost exclusively members of the SW European component and in the end they reveal a new distinct Sardinia-only component in addition to that one. One that is shared specially with Eastern Europeans and Italians.

    I know of only one thing shared between those two regions: (epi-)Gravettian continuity beyond the LGM. But I'm just hunching here.

    Another possible link, now that I think of it, might be Y-DNA I. No idea.

    The problem with these toys is that they can give you a headache if you take them too seriously. But if you don't use them others will and will define the gameboard in their own terms alone.

    "or it's because the sample is very small?"

    The overall sample is of 218 individuals. Only 10 of them (<5%) are Basque (or of any other individual sample: I equalized all sample numbers to 10 or 9 in two cases).

    So yeah, the populations/components that strike out are particularly distinct (isolate, inbreeding, whatever term you wish to use: it's relative): what is not distinct does not show up, the more distinct the easier to show up early on in the runs (the cases of the Selkups and Moroccans are much more striking, right? While the Selkup distinctiveness belongs to the quite fundamental West-East Eurasian polarity, the Moroccan one is specifically North African distinctiveness, isolation, uniqueness).

    Basques or Chuvash or even Palestinians are not so noticeable, not so obviously distinct, so they show up later. But they do show up eventually because they are not "like everybody else" either.

    You may want to look at this study of bovine genetics: they go extremely deep and can even discern almost every single breed at K=47 (but not all!), but there are still distinctions on at what level each breed shows up, etc. For example at K=2 there is a very basic distinction between taurine and zebuine breeds but then it gets complicated.

    ReplyDelete
  4. In all those tests, Basques have been shown to mostly lack any "Neolithic" affinity through West Asian proxy. And as Maju points it out, as K increases, it is possible to "isolate" a Basque component. I wonder what it may mean to know that on average, the French are 16% "Basque" : just shared genetic data from Paleolithic times before both populations got to diverge ?

    I can send you my genetic data if you wish to or if you know how to use it (I'm clueless about that) : it'd be probably the only Pyrenean Gascon sample you can find on the Internet ...

    There were interesting results with my data : like the Basques, I somehow lack "Neolithic" components (though it's still bigger than the Basque average) and my SW/NE European components are like reverted as opposed to the Basques.

    ReplyDelete
  5. I truth Heraus that I do not feel confident at all that I can integrate your sample to a wider sample (yet). In Razib's instructions there's something about that but I have not tried it (for all my interest in genetics I have not got myself tested yet at all: I 'believe' in population genetics and not private genetics).

    This does not mean that your sample is not potentially interesting (even if I'd like more an array of "Southern French" ones: we know something about Gascons but almost nothing about Occitans, Perigordins, Poitevins, Lyonais, etc.) I'll keep your offer in mind in any case, specially if I manage to get a wider SW European sample.

    "the French are 16% "Basque""

    These seem to be French from Paris (so says Razib), so they are like not the French that interest us the most but almost like Belgians or English in fact. There is a huge void of data in Southern France, one that should be filled, because that's like 90% of the Paleolithic Franco-Cantabrian region and, specially if it's true that there is (at least some) genetic continuity since Paleolithic, then knowing about it it's crucial to know about European genetics in general.

    ReplyDelete
  6. "Because they are populations without (or with very low) "Neolithic" admixture most likely: the components that they manifest are always European components. "

    SouthWest and NorthEast are "Neolithic"? I thought these farmers came from the Middle East.

    "The overall sample is of 218 individuals. Only 10 of them (<5%) are Basque (or of any other individual sample: I equalized all sample numbers to 10 or 9 in two cases)."

    I think that's not a lot... it's possible these 10 persons are very closely related or their ancestors were in the past.

    We just don't know if these different components represent different populations being isolated from others since millenia or just similarities for any reason.

    For example, the Moroccan component is closer to the European one than the Siberian is. That's not expected if this component represent a deep African ancestry in present day Moroccans: we know quite well Europeans are more closely related to Siberians than to Africans. Also: how can we discern between admixture and common ancestry?

    Another thing: when one component appears, it HAS TO be specific for one population, like the Basque and Sardinian ones? How can we detect components common to all populations but present at low %?

    "There is a huge void of data in Southern France, one that should be filled, because that's like 90% of the Paleolithic Franco-Cantabrian region and, specially if it's true that there is (at least some) genetic continuity since Paleolithic, then knowing about it it's crucial to know about European genetics in general."

    We can't conclude if there's Paleolithic continuity or discontinuity in SW France withouth data from Paleolithic SW French. The same with the pressumed "neolithic" components. We can only make theories, which may vary a lot depending on the reliability someone gives to molecular clocks.

    ReplyDelete
  7. @Heraus: on second thought, send me your data and I'll see what can I do with it. I know that something you want to check and that you won't get with Dienekes' methods is to know how much "Basque" is your DNA and so on, so I'll try to help you with that and, meanwhile, I can also figure out how to insert new samples in the list, what may be useful in the future.

    ReplyDelete
  8. "SouthWest and NorthEast are "Neolithic"?"

    No, that's not what I meant at all. Basques show up as a mix of NE and SW components (and nothing else that matters) until K=9, only in K=10 they show up as distinct.

    This last is because of relative isolation but NE and SW components seem to be aboriginal European, pre-Neolithic.

    "I think that's not a lot... it's possible these 10 persons are very closely related or their ancestors were in the past".

    I do not have a clear info on what's the background of the sample but I understand it is not the case at all (it'd be methodologically wrong to allow such closely related people in a sample). Also different samples of Basques tend to behave the same: this indicates relative isolation of Basques re. their neighbors. It's probable that it also implies a "buffer zone" around the remaining Basque-speaking areas where the Basque component gradually increases (the Spanish sample -where from?- is 25% but maybe a sample from La Rioja is 80%, I can't say, similarly the French sample is 16% but surely one from Gascony would be 60% or 80% or even 100% Basque) and the "foreign" admixture gradually decreases. This is something that has not really been researched but I can imagine to be the case because it's not like there is a wall separating Basques from non-Basques, it's more a gradually dampening buffer of geographical and cultural distance.

    For example I'd expect Heraus, who is 100% Gascon from Bearn (very close to the remaining Basque-speaking zone and a zone with very high Basque-like phenotype, looks), to score quite high in "Basqueness", while someone from Seville would probably score rather low instead.

    "We just don't know if these different components represent different populations being isolated from others since millenia or just similarities for any reason".

    The similarity must mean certain level of isolation at some point, otherwise it'd be so diluted that it'd become undetectable. The components indicate "stocks" but what these "stocks" mean exactly is not so easy to apprehend.

    "For example, the Moroccan component is closer to the European one than the Siberian is. That's not expected if this component represent a deep African ancestry in present day Moroccans"...

    Good point. Clearly the North African component detected in these runs, even if more consolidated towards the South of Morocco seems to represent the main West Eurasian stock and not others that may be there. IMO, discerning this would require an analysis specifically oriented to discern North Africans (i.e. all 7 North African samples plus three or four controls: Iberia, Levant, East Africa and West Africa). It was not my intent to study North Africa here but Europe, so I did the opposite in fact: reduce the North African samples to the minimal expression. But some other day I will probably tray to analyze them in detail.

    "we know quite well Europeans are more closely related to Siberians than to Africans".

    Not so much in fact, repeatedly some global runs (but not others) give slightly greater distances for the East Asian macro-stock (incl. Siberians) probably because West Eurasians (incl. Europeans) have small levels of more recent African admixture mediated via Arabia and North Africa. I have already mentioned on occasion that this is very suggestive of a quite fast Eurasian scatter after the OoA, not allowing for the similitude between West and East Eurasians to consolidate almost at all, being both almost equally distant and distinct African-derived branches.

    In any case when we say "Africans" in these contexts we almost always mean Tropical and Southern Africans, not Africans of mostly Eurasian stock, as are North Africans. This is implictly understood normally.

    ...

    ReplyDelete
  9. ...

    "Also: how can we discern between admixture and common ancestry?"

    Uh? Not sure what you have in mind but admixture, such as we can find in African Americans or Mexicans shows up very obviously as the admixed population being part A and part B, even if A and B are small samples and the mixed population is very large. At K=2 Mexicans show up as an irregular but mostly horizontal divide of two components, components that can easily be found (by including the appropriate controls) to be European and Native American. Even if there are 10 Europeans, 10 Native Americans and 200 Mexicans, the result is invariably the same: the two components are clearly found.

    So what does this tell us of the NE and SW components in Europeans? It suggests that there were once two populations of which the closest to pure survivors are Lithuanians and Sardinians respectively. These two populations admixed intensely. There may be other possible answers? Unsure but it's a likely interpretation because the two components have shown to be very stubborn so far.

    "Another thing: when one component appears, it HAS TO be specific for one population, like the Basque and Sardinian ones?"

    No but specific components have more difficulty to show up early on (too low numbers). Sometimes "mystery" components (not dominant anywhere) do appear but this may also depend on the algorithm used (some algorithms may discourage this, I can't say).

    In general it seems to me that the likelihood of a component to show up depends on a simple function of numbers and distinctiveness: the more distinct and the more people sharing it, the easier it will show up. Distinctiveness is relatively intrinsic (although it also depends on what you compare with) but numbers are highly dependent on samples. Hence the sampling strategy can change a lot the results, even if unwillingly.

    "How can we detect components common to all populations but present at low %?"

    Running lot of Ks but that costs a lot of time (each higher run is takes about double than the previous one, imposing some practical limits). Also components are not "real" things but approximations. Let's say that we are making the same process by appearance in a classroom or work site, we classify people depending on gender, skin, hair, eyes color, hair texture, wearing glasses, height, etc. Maybe there is only one person who is 2m tall, platinum blond with blue eyes - this person is very distinctive but if the numbers are high enough, he won't show up easily because he'd fall in other categories, for example height and gender may conspire to form the K=2 components, one including mostly short and mostly women (some tall women and some short men too) and the other including mostly tall and mostly men (some some short mend and some tall women too). Then maybe color and hair texture conspire to divide between blonds and brunettes (I'm imagining a mostly Euro group) but the hyper-tall quasi-albino guy would still be hidden in wider groupings, etc.

    That's it: it's not rocket science with absolute values, just an statistical method to find affinity groups in complex sets of information (lots of SNPs).

    ...

    ReplyDelete
  10. ...

    "We can't conclude if there's Paleolithic continuity or discontinuity in SW France withouth data from Paleolithic SW French".

    Only true up to a point. I think that the data of modern populations quite clearly shows a duality between West Asians and Europeans, what precludes anything other than Paleolithic continuity in Europe (with less important immigrant admixture). Another thing could be if someone would propose a Neolithic discontinuity mostly generated by European stock, for example the NE component. But so far I have never read about any such hypothesis nor I can imagine how could it be.

    "We can only make theories"...

    In science everything is a theory. Evolution? Theory! Gravity? Theory! General Relativity? Theory! Quantum Mechanics? Theory! A well demonstrated consolidated theory is as good as a fact. From Wikipedia:

    "... in modern science the term "theory", or "scientific theory" is generally understood to refer to a proposed explanation of empirical phenomena, made in a way consistent with scientific method".

    "A common distinction made in science is between theories and hypotheses. Hypotheses are individual empirically testable conjectures, while theories are collections of hypotheses that are logically linked together into a coherent explanation of some aspect of reality and which have individually or jointly received some empirical support".

    A theory is something very respectable, a hypothesis less so, although it can well be the seed of a respectable theory, its still unproven or even untested at all.

    Of course all the empirical evidence would be welcome but so far the only SW European pre-Neolithic people studied are not "French" but Portuguese (Chandler 2005) and they happened to be quite within the modern genetic pool expectations. It'd be better to have more evidence but that's what we have as of now.

    "... which may vary a lot depending on the reliability someone gives to molecular clocks".

    And which interpretation of the molecular clock hypothesis (I have my own: I count from the root). Sadly the molecular clock remains an untested hypothesis as of today and has been strongly criticized even by people who were strong defendants until recently, like Dienekes.

    ReplyDelete
  11. Maju,

    First congrats to trying this out! I wished I had the time to do so.

    On North Africans, at the low k-values you have of course a group that includes SW Asians (Arabia, Levant). So, collectively, it is less surprising they are much closer to "Europeans" than North/Central/East Asians are.

    Like you, I have always interpreted the identifiable European components (N and S, or NE and SW and SC, depending on k) as pre-neolithic. Likewise, it has always struck me that the N or NE component is so clearly different, that my only explanation is that we have a different LGM refuge, here. Besides the known but small Moravia, the place clearly must have been somewhere between the northern Balkans and the NW Black sea.
    Even if it takes an overnight run, I'd run a much larger population sample - it looks to me like the Slovenian result is spurious due to the small sample size.

    ReplyDelete
  12. Hello Maju


    I'm willing to send you my raw data - I'm northwestern Portuguese - I can also send you four Andalusians and plus I can also convince other Iberians to join this project.

    ReplyDelete
  13. "I wished I had the time to do so".

    It doesn't take so much time, really. Although it does take computer time. If I can help you in anything, just tell me.

    A helpful tip may be to begin with Razib's step 3 (Plink), so you can narrow your sample and get faster results. While the program uses your computer for a while, it does not clutter it and you can use it for Internet browsing, email and such (probably not to play videogames, though).

    "Like you, I have always interpreted the identifiable European components (N and S, or NE and SW and SC, depending on k) as pre-neolithic".

    Yes, I can't, after that preliminary K=4 snapshot specially, consider the two main European components (or the only one when they show up as one) as anything other than pre-Neolithic. Unless we would accept levels of replacement of 99.99%, what is not credible.

    "Likewise, it has always struck me that the N or NE component is so clearly different, that my only explanation is that we have a different LGM refuge, here. Besides the known but small Moravia, the place clearly must have been somewhere between the northern Balkans and the NW Black sea".

    I'd that this component looks more Eastern centered in fact, getting the highest scores in Lithuania and Belarus. However these areas got colonized from two different regions: Central Europe first and Eastern Europe later. It is the same as with Sardinia: it is hard to figure out what they represent, if Italians, Iberians or what?

    An easier possibility could be to imagine that Europe was colonized by two distinct, albeit culturally connected waves initially. For example a proto-Aurignacian people (SW component) and a true Aurignacian one (NE component) or maybe Aurignacian and Gravettian. However I reckon that it's hard to imagine the the LGM compaction did not have an homogenizing effect.

    But maybe it did not. Or not enough. Can't say.

    "Even if it takes an overnight run, I'd run a much larger population sample - it looks to me like the Slovenian result is spurious due to the small sample size".

    Standard sample sizes are not much larger in fact. Slovenians are 23 in total for example. I chose 10 because the smallest sample (Belorussians) are just 9, however we could do without Belorussians.

    I can imagine it can be spurious (it's the easiest explanation) but I find hard to imagine a spurious result being so persistent across runs. Also 10 is not such a small sample, unless they are all cousins - and even then it should not produce such high Fst distance. It's very anomalous.

    I'll see how it performs in future analysis. I'll probably do one with only Europeans but right now I rather have North Africans in my mind admittedly: it includes a rather good North African sample: eight distinct samples plus Egypt and all kind of outgroups like the Fulani.

    ReplyDelete
  14. Eduardo: there is no "project" at the moment. I'm just learning the ropes so far. I know that there is interest in having a Western or SW European approach to population genetics (it's a blank) but I don't really feel ready to lead such endeavor.

    At the moment I'm totally clumsy when dealing with this stuff and in general I do not have much time, nor energies... I accepted, as you can see, Heraus' offer on experimental basis but I don't feel like starting to "hoard" a genetic databank of sorts, specially not without any clear goal. Feel free to send me your data without compromise on my side other than doing my best to respect your privacy but I'd rather prefer that if anyone is to send me any data, he/she does with his/her own and not anybody else's (not just possible legal reasons but certainly I do have ethical issues about using people's genetic data without clear consent - I understand that all available gene banks do have such consents and do respect privacy).

    I've asked Dienekes (somewhat awkwardly) for his IBS data set. Actually the dataset is public but he did make it separate from a larger one. However my greatest interest is in getting my hands on Southern French datasets, because I have the strong intuition that they could be of great help in explaining European (and certainly SW European) genetic structure and history.

    ReplyDelete
  15. Updated with some meditation on the genetic distances (Fst) at K=10.

    ReplyDelete
  16. Maju,

    I will be quite busy the next 8-10 weeks, but if you have something that is computationally intensive, but otherwise ready to go (compiles easily on preferably Linux or also on Windoze), I might have the computational power to run something large for you.

    E-mail me.

    ReplyDelete
  17. Well, not at the moment. My strategy is to cut and focus on the area of interest instead of uselessly run all those global comparisons once and again.

    The only thing that might be of interest is in order to do more and more K-levels, as each one takes like double of the previous (so reaching out to K=16 or so may take several days only for that level with my computer).

    But right now I'm more interested in learning the ropes, sincerely. There's a lot of things I do not know how to do.

    ReplyDelete
  18. Maju: Well, it seems you're an expert with these kind of programs, so you know much better than me their pros and limitations.

    However I think it's highly speculative to conclude that this component may be "Aurignacian" this one "Gravettian", etc. It's just my opinion.

    I've been thinking about the Slovenian odd component. My thoughts are: it's not neandertal because if it was we'd expect Fst distances of >0.5 at least, because Fst distances between San and Papuans are 0.36 or so.

    It seems Slovenians haver a rather high % of haplogroup E1b (30%) maybe this deep African lineage explains why it appears so distant relative to other Eurasian components, though it's odd it's so low in Turks, Moroccans and Georgians... they're fairly high E1b.

    ReplyDelete
  19. "it seems you're an expert with these kind of programs"

    I'm no expert but I am quite familiar with what they do. I think you are going to the other extreme: if some would like to read too much, you want to deny the tool all validity. Neither this nor that. Take the results and speculations around them with a pinch of salt (specially until you get second opinions and the like) but not necessarily just ignore them: they say something, even if it's not any exact science.

    "However I think it's highly speculative to conclude that this component may be "Aurignacian" this one "Gravettian", etc."

    Agreed. I'm just taking the risk and diving into some of that speculation with all the reservations this kind of exercise deserves. My speculation (in the post-stamente to the update) has some internal logic but it's far from proven, just a suggested path to explore further maybe in the future.

    "Fst distances between San and Papuans are 0.36"

    Really? What's the source of that info, may we know? It could make sense and would suggest that the Slovene component (if it holds up) could be from the times of Skhul/Qahfez or something like that. It's all very feeble at the moment but there is a chance that there is "genetic gold" here.

    "It seems Slovenians haver a rather high % of haplogroup E1b (30%)".

    Shouldn't be. Slovenes are high in R1a and I, Albanians and Greeks are that high in E1b. But well, North Africans are much higher in that haplogroup and they do not show any such extreme divergence... apparently.

    I'm still expecting it to be random noise in any case, a fluke.

    ReplyDelete
  20. "if some would like to read too much, you want to deny the tool all validity. Neither this nor that. Take the results and speculations around them with a pinch of salt (specially until you get second opinions and the like) but not necessarily just ignore them: they say something, even if it's not any exact science. "

    No, I don't deny they give us many interesting information such as relationship between different populations, level of admixture, distance between two components... they're an useful tool albeit with some limitations, just like every tool.

    "Really? What's the source of that info, may we know? It could make sense and would suggest that the Slovene component (if it holds up) could be from the times of Skhul/Qahfez or something like that. It's all very feeble at the moment but there is a chance that there is "genetic gold" here. "

    From Dienekes:

    "The maximum Fst in humans is between the Palaeoafrican ancestral population (Pygmies and San) and the Papuan one at 0.346, with a close second, that between Palaeoafricans and Amerindians (0.333).

    However, the maximum distance also corresponds to distance between extant populations: guided by this analysis, I carried out a separate ADMIXTURE run using Papuans and Mbuti Pygmies from the HGDP set, arriving at Fst=0.377"

    http://dienekes.blogspot.com/2010/12/human-genetic-variation-first.html

    However, one has to make an observation: the Slovene component is closer to the European ones than to the "Selkups"! If it was indeed a component from an archaic human source, we'd expect the same distance for all modern humans. Alternatively, maybe the program made the Fst distance shorter for Europeans because the Selkups lack this component. Who knows!

    "I'm still expecting it to be random noise in any case, a fluke."

    Yes, but if you did all OK with the program... if this one is a fluke only because it shows Fst distances a bit greater, what about the others?

    ReplyDelete
  21. "Yes, but if you did all OK with the program... if this one is a fluke only because it shows Fst distances a bit greater, what about the others?"

    But the proper thing is to do the same or similar analysis again several times with maybe different random seeds (something I still don't know how to alter, although ideally it should not be a problem), different numbers for the samples, etc. If the component keeps reappearing, then it'd be confirmed, if it does not, then it was a random error, a fluke. But all that takes time and in some cases knowledge I do not have as of now.

    ReplyDelete
  22. "However, one has to make an observation: the Slovene component is closer to the European ones than to the "Selkups"! If it was indeed a component from an archaic human source, we'd expect the same distance for all modern humans".

    In raw theory yes, in practice this Fst is an estimate, no matter how mathematically formulated. Your example from Dienekes only confirms that: why would Melanesians or Native Americans be more distant than other groups from Pygmies (not San), shouldn't all be equally distant? That's probably because relative isolation and inbreeding also accentuate the distance.

    "... they're an useful tool albeit with some limitations, just like every tool".

    That's it. I won't be the one who claims otherwise.

    ReplyDelete
  23. "In raw theory yes, in practice this Fst is an estimate, no matter how mathematically formulated. Your example from Dienekes only confirms that: why would Melanesians or Native Americans be more distant than other groups from Pygmies (not San), shouldn't all be equally distant? That's probably because relative isolation and inbreeding also accentuate the distance. "

    That's it!! So, if we have two initially very different and isolated populations, like Europeans and A. Aboriginals, and they mix with each other, the program would surely detect 2 mainly components, yet the Fst distance between the two may be smaller than, say, the Fst distance between an Amerind component and the European component, but we know from other studies that they're slighty more closely related than Aboriginals.

    ReplyDelete
  24. Really we can't say that Amerinds are closer to Europeans than Aborigins. Some markers (Y-DNA) are but others (blood groups or hair traits for example) are not (Aborigins have blood type A often and also blond and curly/wavy textured hair and to my eye, what I know is arguable, they do resemble Europeans more than any other "race", much more than Amerindians usually do in any case).

    The problem with Australian Aboriginals is that, in spite of the interest they arise, it is very difficult to study their genetics. So we have to judge most of the time by proxy: Melanesians.

    ReplyDelete
  25. "Really we can't say that Amerinds are closer to Europeans than Aborigins. Some markers (Y-DNA) are but others (blood groups or hair traits for example) are not (Aborigins have blood type A often and also blond and curly/wavy textured hair and to my eye, what I know is arguable, they do resemble Europeans more than any other "race", much more than Amerindians usually do in any case). "

    Is there any study that has found any similarity between Europeans and A. aborigines? Not to my knowledge, though I recognize very few aborigines have been tested to date. Are hair traits a reliable source? Europeans for example, can have curly, wavy or straight hair.

    As for looks, some of them resemble Causcasoids, but others look rather archaic: huge noses, massive jaws, pronounced supraciliar archs...

    ReplyDelete
  26. "The problem with Australian Aboriginals is that, in spite of the interest they arise, it is very difficult to study their genetics. So we have to judge most of the time by proxy: Melanesians".

    Unfortunately Melanesians are not a good proxy for Australian Aborigines. They look fairly different from each other and their haplogroups are reasonably separate.

    "(Aborigins have blood type A often and also blond and curly/wavy textured hair and to my eye, what I know is arguable, they do resemble Europeans more than any other 'race'"

    Certainly more like Europeans than Melanesians are.

    "this kind of maths would place the Siberian/East Asian divergence c. 55-60 Ka ago, a bit too recently IMO"

    Not too recently if the route to East Asia was via Siberia. Several haplogroups in East Asia obviously derive from SE Asia or India but others less obviously so. And haplogroups are only part of the story.

    "and the odd Slovenian component's divergence, if real, c. 110 Ka ago, weirdly old but H. sapiens rather than Neanderthal"

    Fits with your suggestion:

    "would suggest that the Slovene component (if it holds up) could be from the times of Skhul/Qahfez or something like that"

    Quite possible. We would expect it unlikely that the 'Skhul/Qahfez' population was confined to Palestine. It could easily have reached the Balkans and Anatolia, to later become isolated to the mountains.

    ReplyDelete
  27. @Maju,

    Did you remove the outliers from the North African populations? There are recently Sub-Saharan admixed individuals among them, which sometimes make the Northwest African cluster a bit unstable and divergent. Try removing all those who have much more Sub-Saharan than the North African average.

    ReplyDelete
  28. @Neanderthalerin:

    "Is there any study that has found any similarity between Europeans and A. aborigines?"

    No that I know but while some seem to emphasize the differences, I often see striking similitudes. But there is no genetic marker (other than blood group A, which might have evolved several times) connecting West Eurasians with Australian Aborigines, so any connection should be from the time of the Great Eurasian Expansion (or OoA chapter II). But in any case they do not look more distant to me (in phenotype) than, say, Chinese, sincerely - being their most distinct traits the often broad and flat nose and the angularity of the skull. I extend that claim of vague "Caucasoidness" to many Melanesian and Negrito individuals as well. They almost never look as "Mongoloid" as they look "Caucasoid", even if in an odd or slightly off manner.

    I mean look at this guy: don't you know people from, say, Seville, who look a lot like him? Others have broader heads more like people have towards Northern Europe. And then you have just way too many who are totally Gypsy (Roma). What about this one? He could be my neighbor and I would not notice.

    So I have this hypothesis that this Caucaso-Australoid phenotype continuum was dominant early on among Eurasians. We even find it in Japan, with the pre-admixture Ainu and even in China, where the Zhoukoudian skulls look a lot Ainu-like, and not at all "Mongoloid" yet.

    ReplyDelete
  29. @Awi:

    I can't do what you say with the current knowledge I have, sorry.

    All I can say is be patient with me or try doing it yourself (I'd help all I can).

    ReplyDelete
  30. You are very brave for diving into the statistics stuff. I find it very interesting, but am not sure what it means yet.

    My personal opinion is that the earliest modern humans in Asia resemble "Caucasoids" in certain superficial respects, or more correctly stated, "Caucasoids" retained certain features that were more common in early moderns. Some Australoids retained similar features, but that doesn't mean that Australoids and Caucasoids are more closely related, but instead that they may have both retained some of the same original characteristics of early moderns in Asia.

    It is fun to speculate, but I am hesitant in drawing conclusions about paleolithic genetics from modern DNA data. The more ancient DNA we test, the more complicated the picture becomes (particularly with respect to YDNA). The presence of G2a (of which I am member) in neolithic Europe was a surprise.

    ReplyDelete
  31. I wouldn't like to be read too narrowly in my speculation about the Caucaso-Australoid phenotype continuum. It's just an impression I have and it's widely open to all kind of necessary shadings, adjustments and whatever else.

    "I am hesitant in drawing conclusions about paleolithic genetics from modern DNA data".

    I can understand that but we need some (consistent, not wild) speculation, otherwise we may never explore anything at all and would be wasting the potential of all this information. Maybe tomorrow there is new data that makes me change my mind on something important... but maybe tomorrow we will be dead, so while I do not mind to wait when there is no other choice, I also feel the need to lay out working hypothesis now - and that, and nothing else, is what I'm doing.

    ReplyDelete
  32. "The presence of G2a (of which I am member) in neolithic Europe was a surprise".

    Actually it should not be. I've always felt that the smaller "Eastern" haplogroups (G2a, E-V13, J2, etc., are most likely Neolithic (and not Classical Greek or who-knows-what that some have argued with no foundation other than their romantic imagination). What is a surprise is that we haven't still hit R1b, but this can well be because the admixture with the natives happened mostly at a later moment (??)

    ReplyDelete
  33. I agree with you about the need to form hypotheses;it is essential to science. Your blog is one of the few I feel worth bookmarking and reading on a regular basis. I love the maps.

    I haven't posted before (been reading for a while), so I am just giving you an idea of where I am starting from.

    Your chart of Fst distances is interesting, I guess I would have expected the Sardinian,Southwest European and Basque samples to be closer to each other than to samples from the Caucasus. I guess I still find this statistics stuff a bit confusing.

    ReplyDelete
  34. "I mean look at this guy: don't you know people from, say, Seville, who look a lot like him? Others have broader heads more like people have towards Northern Europe. And then you have just way too many who are totally Gypsy (Roma). What about this one? He could be my neighbor and I would not notice. "

    Maybe a little... it's true some look Caucasian. I think that's because Australia wasn't colonized in just one wave, but perhaps two, three or more. Maybe one of these waves was very akin to Europeans. The genome of an Australian aborigine don't show an especial relationship with W. Eurasians. Rather, the authors said they represent a very old population that had some admixture with the ancestors of present day Asians and Native Americans.

    ReplyDelete
  35. "Your blog is one of the few I feel worth bookmarking and reading on a regular basis. I love the maps".

    Thanks a lot. It is one of the best compliments I have ever read. :)

    "I guess I still find this statistics stuff a bit confusing".

    Admittedly me too. But still it's something we have to dive into: sometimes it's perplexing, sometimes revealing.

    ReplyDelete
  36. @Neandertalerin: I have no reason to think that Australia was colonized in different waves (spaced in any way I can discern via mtDNA or whatever other evidence). There may have been some fluctuations in phenotype (judging by some skulls) but all within Australia itself.

    ReplyDelete
  37. "I have no reason to think that Australia was colonized in different waves"

    Don't you think that Y-DNA C4 could be the mark of a first group and the others (more "recent" Y-DNA lineages) could account for a more recent wave (responsible for the spreading of a more classic "Eurasian" look so to speak).

    I'm no specialist of this, but I remember reading that there were at least 2 (I think 3) waves of settlement in Australia, in a (granted, old and not very renowned) encyclopaedia.

    ReplyDelete
  38. Open to speculation, Waggg.

    I tend to judge this stuff mostly based on mtDNA (because it's somewhat easier to read "the entrails", so to say) and the two main Aboriginal mtDNA haplogroups (S and O) are not just sisters but also separated by small genetic distance (one extra mutation for the smallest of both clades: O). Then there is some other stuff but it's mostly tiny sublineages or Papuan-derived (also very localized, it seems).

    So my impression is of one main entry, maybe with some smallish secondary entries, mostly from Papua. The entries from Papua should be recognized by genetic affinity. I'm not sure which are the non-C4 lineages among Australian Aborigines you mention but I can only imagine that they are all Melanesian in affinity (MNOPS, C2).

    Even if Papuans might be more akin to Europeans in some traits (Y-DNA MNOPS, mtDNA R, prominent noses) they are less akin in the traits I mentioned (like hair color and texture, etc.) and IMO even if there were two waves in Papua, they would not have been widely separated either.

    We are talking in any case, I understand, on how were the "proto-Eurasians" at the time of their divergence o scatter. I cannot detect any relevant secondary wave that is not a mere extension or aftershock of the first one.

    ReplyDelete
  39. @Etyopis: again I find myself against a programming barrier. I did the Plink sequence you suggested (for the PC graph) and downloaded GNU Octave but this one is a command line interface which seems anything but easy to use. Probably very useful if you're familiar with it but useless for me as of now. :(

    ReplyDelete
  40. Maju, I'll email you some of the code I have to written for octave to process some of the data from plink/ADMIXTURE with explanations after the festivities of the newyear are over.
    Happy New year!

    ReplyDelete
  41. Ok, thanks in advance. Happy new year to you as well.

    ReplyDelete
  42. "So I have this hypothesis that this Caucaso-Australoid phenotype continuum was dominant early on among Eurasians".

    Quite possibly so.

    "We even find it in Japan, with the pre-admixture Ainu and even in China, where the Zhoukoudian skulls look a lot Ainu-like, and not at all 'Mongoloid' yet".

    Hints at the single geographic spread that many claim to see. The spread of the Mongoloid phenotype is the product of a later expansion.

    "Maybe a little... it's true some look Caucasian [don't you know people from, say, Seville, who look a lot like him? Others have broader heads more like people have towards Northern Europe. And then you have just way too many who are totally Gypsy (Roma). What about this one? He could be my neighbor and I would not notice]".

    A couple look as though they are mixed with European. The first was the most like inland unmixed Aborigines. The girl definitely didn't look 'purebred'.

    "I have no reason to think that Australia was colonized in different waves (spaced in any way I can discern via mtDNA or whatever other evidence)".

    Not really correct. Aborigines to the north look more 'Papuan' than do the inland ones. And haplogroup evidence can easily be interpreted as demonstrating at least two waves. The second being the 'Papuan' one carrying mt-DNA M and Y-DNA K. The first arrivals carried only mt-DNA N and Y-DNA C. As:

    "Don't you think that Y-DNA C4 could be the mark of a first group and the others (more 'recent' Y-DNA lineages) could account for a more recent wave (responsible for the spreading of a more classic 'Eurasian' look so to speak)".

    Except that I would place the first arrivals as being more 'Eurasian' while the later arrivals would be the 'Papuans'.

    "the two main Aboriginal mtDNA haplogroups (S and O) are not just sisters but also separated by small genetic distance (one extra mutation for the smallest of both clades: O)".

    But M haplogroups are by no means uncommon in Australia. M14, M15 and M42, a representative of M42'74. M74 is found in South China.

    "I'm not sure which are the non-C4 lineages among Australian Aborigines you mention but I can only imagine that they are all Melanesian in affinity (MNOPS, C2)".

    Basically correct, but no C2. Just K, mostly K2 I think. As you say, basically a Melanesian haplogroup.

    "Even if Papuans might be more akin to Europeans in some traits"

    Australian Aborigines are far more 'European-looking' than are Papuans.

    ReplyDelete

Please, be reasonably respectful when making comments. I do not tolerate in particular sexism, racism nor homophobia. Personal attacks, manipulation and trolling are also very much unwelcome here.The author reserves the right to delete any abusive comment.

Preliminary comment moderation is... ON (sorry, too many trolls).