I decided to gift myself these
Saturnalia with the basic knowledge of how to use the ADMIXTURE program. It is not easy but with the help of
Razib's instructions, a good dose of patience and some computer savvy-ness I managed yesterday to have something done, even if not exactly what I wanted.
First of all I cleaned up the population file from all populations that have no apparent relation with West Eurasia and also a bunch of tiny minorities like Druzes, Bedouins, etc., which tend to be rather non-informative, and so on. I still retained a number of populations from all around Europe: several North Africans, even more West Asians and Caucasians and then also some peoples from Central Asia and Siberia. I committed two errors however: I removed most NW European representatives by taking out both the CEU (Utah Euroamericans) and North European samples and I accidentally retained two Caucasian Jewish populations.
Good enough for a draft, not good enough for the strategy I had in mind. I went all the way down to K=7 but I will show here only one panel, and only because it offers a perspective that my second attempt, today, did not achieve so neatly (different strategy, different results): to show a clear cut of the European and West Asian components:
|
example from a previous run: Europe - West Asia duality |
We can see here four components:
- Red: West Asian
- Purple: European
- Green: North African
- Cyan: Siberian
North African genetic influence in Europe is almost trivial and concentrated in Iberia and the Balcans, although this influence is more apparent in West Asia. Siberian influence is also minor, excepting the Chuvash and to much lesser extent Russians and other East Europeans.
However West Asian influence is more important and concentrates in the Balcans and Italy. North Caucasian peoples are clearly West Asians genetically speaking, even if they technically live in Europe. In turn European genetic influence outside the subcontinent is concentrated along the Northern African coast, Asia Minor and Cyprus.
I'd say that the West Asian (red) component correlates quite strictly with the extent of demic replacement in the Neolithic (although, naturally, the demic wave would have been each generation more European and less West Asian).
Today's strategy
Today I decided to be more methodical and also to reduce population numbers in order to speed up the process. I decided to only keep one North African and one Siberian populations (Moroccans and Selkups) and to reduce a lot the West Asian and Caucasian array of samples (I retained: Palestinians, Kurds, Turks and Georgians). I retained all non-Caucasus European populations, including the omissions of the previous day: CEU and North Europeans.
However I cut all samples to 10 members. Actually Belarus (only 9) and another unknown sample by error have just 9 but that should not affect the results. I doubted about retaining higher numbers for larger populations like North Europeans, Russians, French and Spaniards but in the last moment I chose not to (next time I probably get in 20 of each instead of just 10). In any case the smaller number of samples allowed me to go faster with the runs and reach deeper levels quite easily.
And I went on with the runs, getting this:
... and this:
The color code is a bit crazy and absolutely un-cool but I have managed to figure that it gives red to pop0 and then similarly spaced hues until blue or magenta. I'd rather prefer if the program was able to keep the same color for each comparable component but that seems to require human intervention (dyeing).
I decided that it was best to spend my time putting them side by side as above (also human intervention).
Points of interest
K=3
As in the previous trial, the first detached populations were North Africans (Moroccans) and Siberians (Selkups). Nothing unexpected. The Siberian component is clearly more distant than the North African one from the main component (European in this case, because the West Asian specificity is masked between Europe and North Africa once the samples have been reduced).
Fst (components):
- Siberian-Berber 0.131
- Siberian-European 0.112
- Berber-European 0.054
It's clear (and is consistent along runs) that North Africans (
Berber for short) are much closer to Europeans than Siberian natives (including the partly European Selkups). West Asians generally stay 50-50 between the European and North African components (because their specificity has not yet been unveiled because of the effects of sample size, smaller than usual).
I did not run K=2 but I imagine that it'd result in Selkups vs the rest, meaning East Asians vs West Eurasians overall.
I could express the distances in a neutral form pop0, pop1 as the program does but I think it's more confusing (I get confused myself), so maybe better to use a label and hope it is a good choice.
Most F
st distances are in the 0.040-0.070 range. I won't emphasize them.
K=4
The division of Europe into two components takes place at this stage. I decided to label them
NE European and
SW European because the latter is too influential in NW Europe and too low in the Balcans to be merely "South" (more presence among Northern Europeans than in Romania or Turkey), even if the NE component is more of a general presence. I wonder where they come from, if they are the produce of a duality in the early colonization of Europe, something like
Aurignacian vs
Gravettian or what? In any case both seem equally European and not originated outside the subcontinent. They are persistent across runs.
K=5
The
West Asian specificity shows up, with focus in Georgia. West Asians finally stop looking like a mere amalgam of Europeans and North Africans and display their unique personality.
I insist in this being a mere effect of the sampling strategy: more West Asian samples would have caused this specificity to show up earlier in the runs (K=4) but, maybe more importantly, the European difference would have been the one eclipsed by the West Asian component. I actually have one
example from yesterday's exercise:
|
counter-example from a previous run |
Here Europeans and West Asians appear all mostly Green, which is primarily the West Asian component (and not the European one yet). While some North African affinity persists, this has nothing to do with the 50-50 eclipse of West Asian specificity that we can see in the main exercise.
This is a good example why we must beware of the exactitude of the components produced by these algorithms because often, differences in sample strategies and depth of analysis may show or hide critical insight.
K=6 - Slovenian Neanderthals or what?!
Since this level of analysis we get a small and quite puzzling new component that almost only exists in Slovenes and is not even dominant among them. Usually you don't get such a lesser component, much less shows up once and again in several K-depths. It is also just the third European-specific component, what the heck?!
The explanation may be that it is extremely distant from all the rest, so even if small it had little choice but surfacing.
The Fst distances of the Slovenian odd component are extreme: 0.312, 0.233, 0.241, 0.284, 0.239 with each of the other components. By comparison, the largest distance of the Selkup component is just 0.155, while the largest distance I got between World populations in an ad-hoc K=3 run was 0.195.
So this component, whatever it means, is significantly more distant to everything else in the region than continental populations are between each other. I can only think in massive local Neanderthal admixture but I know this is so weird and unlikely that a mere algorithm error is probably the truth.
If you have any idea... I welcome it.
K=7
New component: Palestinian!
K=8
An Orcadian component shows up (but vanishes at K=10).
K=9
A lesser Kurdish component shows up but it does not have the weird Fst distances of the Slovenian one, in spite of the first sight similitude.
K=10
The Orcadian and Kurdish components vanish (may they resurface in further runs? - I never run them). Instead Chuvash, Basque and a distinct Sardinian specific components show up.
I stopped here because it was taking longer and longer (some 50 mins for just this last run) and my patience is limited (specially when I have no clear goal).
This is the detailed spreadsheet snapshot of the exact distribution of the components at K=10:
|
click to expand |
And the K=10 detail:
Mini update: the K5 detail, which is in a sense a simplified display of the same general scheme of things: showing the two main European components, one West Asian (Caucasus) component, the North African and the Siberian components:
Many doubts
The toy seems curious and I did at least manage to make it work at the basics. But I'd like to know:
- How to sort populations so they show up in some logical order, like all Moroccan samples side by side and such.
- Can I command Plink to retain populations instead of just remove them?
- Where can I get other samples? I'm particularly interested in samples of SW Europe but really whatever will do: I'll follow the candy bait, I reckon.
- How can I make the results show individual instead of whole-population bars?
- How can I get the data (cross-ref-validation?) that indicates when the likelihood of meaning of a run is low or high.
- Etc. (surely a lot remains in the ink jar - I just forgot)
Thanks in advance.
Update (Dec 28): Fst distances
Table of F
st genetic distances at K=10:
I marked with red stars the extreme (>0.2) Fst distances of the Slovene component, orange ones the those in highest quintile (after removing the Slovene oddity), which are all from the Siberian component, and green ones the lowest quintile Fst distances.
I also made an Euler diagram sketching F
st genetic distances between the various West Eurasian components:
Where Fst distances in the lowest quintile (after removing the Slovene oddity; <0.084) are shown with continuous lines and the second quintile (0.084-0.107) are shown with dotted lines. (Note: image corrected from first posted version, which had an error).
I think it gives an interesting impression of the possible relations between the various components, in which the NE Euro and Caucasian components (and to a slightly lesser extent the Basque one) seem pivotal, almost as if all the other West Eurasian components are peripheral outgrowths. The short F
st distance between NE Euro and Caucasus (or Highland West Asia) components already showed up in some of the analysis of Dienekes, raising some eyebrows, at least mine. However, as he does not use the smaller components, some of the correlations, notably that the Basque component is also in that pivotal zone, were not apparent at the time.
PS- highly tentative reconstruction of pop. history (excluding the Slovene odd component), based on average Fst (Fst(core))towards the "core" Caucasus/NE Euro components:
- Fst(core)=0.125 - Divergence of Siberian/East Asian component (0.110 Chinese/CEU per Wikipedia): Eurasian expansion after the OoA.
- Fst(core)=0.102-0.100 - Divergence of Sardinian (?) and North African components: Dabban industries?
- Fst(core)=0.091 - Divergence of SW European component (Aurignacian?)
- Fst(core)=0.084 - Divergence of the Palestinian component
- Fst(core)=0.079 - Chuvash component
- Fst(core)=0.065 - Basque component
- Fst=0.060 - Caucasus and NE Euro divergence
A rough estimate of the possible Caucasus/NE euro divergence timing (by comparing the Fst values with those of presumably Aurignacoid divergences) would place it c. 24 to 30 Ka ago (depending on what values are used for the Aurignacoid divergence: 40 or 44 Ka ago and of which component is considered the SW Euro or the North African one). So I'd dare say that Basque, Caucasian and NE Euro components appear to have split ways (with all reservations) in the Gravettian period.
(Not sure how well it fits but this kind of maths would place the Siberian/East Asian divergence c. 55-60 Ka ago, a bit too recently IMO and the odd Slovenian component's divergence, if real, c. 110 Ka ago, weirdly old but H. sapiens rather than Neanderthal).