December 29, 2011

North African genetics through the prism of ADMIXTURE

I believe that with this exercise, which took me just a morning's time, I'm walking a path that has not been explored before: analyzing the autosomal genetics of North Africans on their own right, without  being part of a larger context, be it African or West Eurasian or global. At least I'm not aware of any such paper nor self-research exercise in the blogosphere either.

Said that, I did get in the study five exogenous samples, in order to estimate possible external influences. These are: 10 Fulani, 10 Mandinka, 10 Ethiopians, 10 Saudi Arabs and 10 Spaniards. I did not alter the diverse HGDP North African samples (including two different Egyptian samples), except for two things: I removed the Moroccan Jews altogether and cut the Mozabite sample to 10 individuals, because of suspicion that their alleged isolation might distort the larger analysis.

More or less as I expected, at K=10, which was my preliminary goal, each of the exogenous ethnicities described one distinct component, while the other five components were North African specific.

What I did not expect at all was that Tunisians would show up as distinctive as they did (see below). I wonder if there is something special in that sample or if the measure applies to all (or most) Tunisians. Very strange and unexpected in any case.

In the end, concerned that I might be missing something of relevance, I made two more runs and one of them struck "genetic gold", it seems to me: a small South Moroccan component very distant from everything else, which might well be a remnant of the Aterian period or something like that.

Method: I used a fraction (as described in the previous lines) of the global HGDP sample following the method explained at Gene Expression to operate ADMIXTURE and associated programs (Plink, R). 


K=2 - Without surprises: Tropical African vs. West Eurasian components.

K=3 - Big surprise: the first North African specific component is concentrated in Tunisia, not Morocco, not Mozabites... but Tunisians, uh?

K=4 - As another North African specific component (red, most common among Sahrawis, then Moroccans) shows up, the Tunisian component (green) retracts, so to say, to the Tunisian borders.

K=5 - Not happy with one, the algorithm finds a second Tunisian component, restricted also to that country. I'm as perplex as you may be.

K=6 - West Asian (Saudi Arab) and European (Spaniard) components diverge.

K=7 - Second (non-Tunisian) NW African component shows up. This one (turquoise) is most concentrated among Mozabites. 

K=8 - A Fulani-specific component shows up. Intriguingly it is almost equidistant by Fst measure from the Mandenka and the Sahrawi components (0.105 and 0.115 respectively). All the North African specific components are much closer to West Eurasian ones than to the Mandenka component, so this might suggest a very old kind of trans-Saharan admixture, then homogenized in a single component. 

K=9 - Not happy with one, the Fulani show a second component in a row. This one is neatly Tropical African (very distant from all and only somewhat close to the Mandinka component and the other Fulani component but at the 0.163 and 0.173 Fst values, which is also very distant). I imagine that this has to do with the Fulani L1b mtDNA lineage but never mind because the component will vanish again as we move on. 

K=10 - A Morocco-centered component shows up here (green), also found in Algerians and Libyans. A distinct Ethiopian-specific component is also defined (influencing Egypt and Libya significantly and to much lesser extent also NW Africa).

K=11 - A small and very interesting component exclusive of South Morocco shows up. 

(Note: at K=12 there is a third Tunisian component, go figure!, but I don't think that is informative at all so it's not shown). 

Note: A reader suggested that some North Africans in these samples are heavily admixed with Tropical Africans, distorting the results in that aspect. I can't say but, if I manage to get working the program variant that should show individual instead of population bars, then we will find out.

Fst Distances at K=11:

Notice please that the South Moroccan component is extremely distant to all (Eurasians and Africans alike). I will speculate (as I have done before seeing this) that this component, now almost only restricted to Southern Morocco and heavily admixed, is a residue of the Aterian period and is related to a vaguely "Khoisanid" or equally vaguely "Mongoloid" phenotype found in the region.

Component apportions (numerical) at K=11:

Detail of K=11 graph:

December 26, 2011

Playing around with ADMIXTURE

I decided to gift myself these Saturnalia with the basic knowledge of how to use the ADMIXTURE program. It is not easy but with the help of Razib's instructions, a good dose of patience and some computer savvy-ness I managed yesterday to have something done, even if not exactly what I wanted.

First of all I cleaned up the population file from all populations that have no apparent relation with West Eurasia and also a bunch of tiny minorities like Druzes, Bedouins, etc., which tend to be rather non-informative, and so on. I still retained a number of populations from all around Europe: several North Africans, even more West Asians and Caucasians and then also some peoples from Central Asia and Siberia. I committed two errors however: I removed most NW European representatives by taking out both the CEU (Utah Euroamericans) and North European samples and I accidentally retained two Caucasian Jewish populations.

Good enough for a draft, not good enough for the strategy I had in mind. I went all the way down to K=7 but I will show here only one panel, and only because it offers a perspective that my second attempt, today, did not achieve so neatly (different strategy, different results): to show a clear cut of the European and West Asian components:

example from a previous run: Europe - West Asia duality

We can see here four components:
  • Red: West Asian
  • Purple: European
  • Green: North African
  • Cyan: Siberian
North African genetic influence in Europe is almost trivial and concentrated in Iberia and the Balcans, although this influence is more apparent in West Asia. Siberian influence is also minor, excepting the Chuvash and to much lesser extent Russians and other East Europeans.

However West Asian influence is more important and concentrates in the Balcans and Italy. North Caucasian peoples are clearly West Asians genetically speaking, even if they technically live in Europe. In turn European genetic influence outside the subcontinent is concentrated along the Northern African coast, Asia Minor and Cyprus.

I'd say that the West Asian (red) component correlates quite strictly with the extent of demic replacement in the Neolithic (although, naturally, the demic wave would have been each generation more European and less West Asian).

Today's strategy

Today I decided to be more methodical and also to reduce population numbers in order to speed up the process. I decided to only keep one North African and one Siberian populations (Moroccans and Selkups) and to reduce a lot the West Asian and Caucasian array of samples (I retained: Palestinians, Kurds, Turks and Georgians). I retained all non-Caucasus European populations, including the omissions of the previous day: CEU and North Europeans.

However I cut all samples to 10 members. Actually Belarus (only 9) and another unknown sample by error have just 9 but that should not affect the results. I doubted about retaining higher numbers for larger populations like North Europeans, Russians, French and Spaniards but in the last moment I chose not to (next time I probably get in 20 of each instead of just 10). In any case the smaller number of samples allowed me to go faster with the runs and reach deeper levels quite easily.

And I went on with the runs, getting this:

... and this:

The color code is a bit crazy and absolutely un-cool but I have managed to figure that it gives red to pop0 and then similarly spaced hues until blue or magenta. I'd rather prefer if the program was able to keep the same color for each comparable component but that seems to require human intervention (dyeing).

I decided that it was best to spend my time putting them side by side as above (also human intervention).

Points of interest


As in the previous trial, the first detached populations were North Africans (Moroccans) and Siberians (Selkups). Nothing unexpected. The Siberian component is clearly more distant than the North African one from the main component (European in this case, because the West Asian specificity is masked between Europe and North Africa once the samples have been reduced).

Fst (components):
  • Siberian-Berber 0.131
  • Siberian-European 0.112
  • Berber-European 0.054
It's clear (and is consistent along runs) that North Africans (Berber for short) are much closer to Europeans than Siberian natives (including the partly European Selkups). West Asians generally stay 50-50 between the European and North African components (because their specificity has not yet been unveiled because of the effects of sample size, smaller than usual).

I did not run K=2 but I imagine that it'd result in Selkups vs the rest, meaning East Asians vs West Eurasians overall.

I could express the distances in a neutral form pop0, pop1 as the program does but I think it's more confusing (I get confused myself), so maybe better to use a label and hope it is a good choice. 

Most Fst distances are in the 0.040-0.070 range. I won't emphasize them.


The division of Europe into two components takes place at this stage. I decided to label them NE European and SW European because the latter is too influential in NW Europe and too low in the Balcans to be merely "South" (more presence among Northern Europeans than in Romania or Turkey), even if the NE component is more of a general presence. I wonder where they come from, if they are the produce of a duality in the early colonization of Europe, something like Aurignacian vs Gravettian or what? In any case both seem equally European and not originated outside the subcontinent. They are persistent across runs.


The West Asian specificity shows up, with focus in Georgia. West Asians finally stop looking like a mere amalgam of Europeans and North Africans and display their unique personality.

I insist in this being a mere effect of the sampling strategy: more West Asian samples would have caused this specificity to show up earlier in the runs (K=4) but, maybe more importantly, the European difference would have been the one eclipsed by the West Asian component. I actually have one example from yesterday's exercise:

counter-example from a previous run

Here Europeans and West Asians appear all mostly Green, which is primarily the West Asian component (and not the European one yet). While some North African affinity persists, this has nothing to do with the 50-50 eclipse of West Asian specificity that we can see in the main exercise.

This is a good example why we must beware of the exactitude of the components produced by these algorithms because often, differences in sample strategies and depth of analysis may show or hide critical insight.

K=6 - Slovenian Neanderthals or what?!

Since this level of analysis we get a small and quite puzzling new component that almost only exists in Slovenes and is not even dominant among them. Usually you don't get such a lesser component, much less shows up once and again in several K-depths. It is also just the third European-specific component, what the heck?!

The explanation may be that it is extremely distant from all the rest, so even if small it had little choice but surfacing. 

The Fst distances of the Slovenian odd component are extreme: 0.312, 0.233, 0.241, 0.284, 0.239 with each of the other components. By comparison, the largest distance of the Selkup component is just 0.155, while the largest distance I got between World populations in an ad-hoc K=3 run was 0.195. 

So this component, whatever it means, is significantly more distant to everything else in the region than continental populations are between each other. I can only think in massive local Neanderthal admixture but I know this is so weird and unlikely that a mere algorithm error is probably the truth. 

If you have any idea... I welcome it.


New component: Palestinian!


An Orcadian component shows up (but vanishes at K=10).


A lesser Kurdish component shows up but it does not have the weird Fst distances of the Slovenian one, in spite of the first sight similitude.


The Orcadian and Kurdish components vanish (may they resurface in further runs? - I never run them). Instead Chuvash, Basque and a distinct Sardinian specific components show up. 

I stopped here because it was taking longer and longer (some 50 mins for just this last run) and my patience is limited (specially when I have no clear goal).

This is the detailed spreadsheet snapshot of the exact distribution of the components at K=10:

click to expand
And the K=10 detail:

Mini update: the K5 detail, which is in a sense a simplified display of the same general scheme of things: showing the two main European components, one West Asian (Caucasus) component, the North African and the Siberian components:

Many doubts

The toy seems curious and I did at least manage to make it work at the basics. But I'd like to know:
  1. How to sort populations so they show up in some logical order, like all Moroccan samples side by side and such.
  2. Can I command Plink to retain populations instead of just remove them? 
  3. Where can I get other samples? I'm particularly interested in samples of SW Europe but really whatever will do: I'll follow the candy bait, I reckon. 
  4. How can I make the results show individual instead of whole-population bars?
  5. How can I get the data (cross-ref-validation?) that indicates when the likelihood of meaning of a run is low or high.
  6. Etc. (surely a lot remains in the ink jar - I just forgot)
Thanks in advance.

Update (Dec 28): Fst distances

Table of Fst genetic distances at K=10:

I marked with red stars the extreme (>0.2) Fst distances of the Slovene component, orange ones the those in highest quintile (after removing the Slovene oddity), which are all from the Siberian component, and green ones the lowest quintile Fst distances.

I also made an Euler diagram sketching Fst genetic distances between the various West Eurasian components:

Where Fst distances in the lowest quintile (after removing the Slovene oddity; <0.084) are shown with continuous lines and the second quintile (0.084-0.107) are shown with dotted lines. (Note: image corrected from first posted version, which had an error).

I think it gives an interesting impression of the possible relations between the various components, in which the NE Euro and Caucasian components (and to a slightly lesser extent the Basque one) seem pivotal, almost as if all the other West Eurasian components are peripheral outgrowths. The short Fst distance between NE Euro and Caucasus (or Highland West Asia) components already showed up in some of the analysis of Dienekes, raising some eyebrows, at least mine. However, as he does not use the smaller components, some of the correlations, notably that the Basque component is also in that pivotal zone, were not apparent at the time.

PS- highly tentative reconstruction of pop. history (excluding the Slovene odd component), based on average Fst (Fst(core))towards the "core" Caucasus/NE Euro components:
  1. Fst(core)=0.125 - Divergence of Siberian/East Asian component (0.110 Chinese/CEU per Wikipedia): Eurasian expansion after the OoA.
  2. Fst(core)=0.102-0.100 - Divergence of Sardinian (?) and North African components: Dabban industries?
  3. Fst(core)=0.091 - Divergence of SW European component (Aurignacian?)
  4. Fst(core)=0.084 - Divergence of the Palestinian component
  5. Fst(core)=0.079 - Chuvash component
  6. Fst(core)=0.065 - Basque component
  7. Fst=0.060 - Caucasus and NE Euro divergence

A rough estimate of the possible Caucasus/NE euro divergence timing (by comparing the Fst values with those of presumably Aurignacoid divergences) would place it c. 24 to 30 Ka ago (depending on what values are used for the Aurignacoid divergence: 40 or 44 Ka ago and of which component is considered the SW Euro or the North African one). So I'd dare say that Basque, Caucasian and NE Euro components appear to have split ways (with all reservations) in the Gravettian period.

(Not sure how well it fits but this kind of maths would place the Siberian/East Asian divergence c. 55-60 Ka ago, a bit too recently IMO and the odd Slovenian component's divergence, if real, c. 110 Ka ago, weirdly old but H. sapiens rather than Neanderthal).

Echoes from the Past (Dec 26)

Before the year is over, here there is a bunch of stuff I wanted to mention:

Lower and Middle Paleolithic

Humans may have originated near rivers - Technology & science - Science - LiveScience - - neither savanna nor jungle, beach (river banks) was the favored ecosystem even for old good Ardi, it seems.

Pileta de Prehistoria: 180 prehistoric sites located around Atapuerca[es] - not just Neanderthal ones: a bit of everything (located just outside Burgos city, Atapuerca is a key pass between the Upper Ebro basin and the Northern Iberian Plateau, which must have played an ecological and socio-political role always, and hence attracted people towards it).

Upper Paleolithic and Epipaleolithic 

The boulders at lake Huron were to trap the reindeer (caribou)
Remains found of the culture which inhabited Northern Chile 11,000 years ago - Terrae Antiqvae[es] - they exploited a quartz deposit for their tools in the middle of Atacama desert, which then was probably quite milder.

Simultaneous ice melt in Antarctic and Arctic at the end of the last Ice Age.

Neolithic and Chalcolithic

El Neolítico en Europa: una simulación del proceso | Neolítico de la Península Ibérica - Iberian Neolithic - exposition and criticism in Spanish language of yet another paper simulating the Neolithic 'colonization' (Lemmen 2011).

Metal Ages and Historical periods

Iruina blog: doubts about the ability of the Basque Autonomous Police to  analyze the Iruña-Veleia pieces[es] - the Spanish Guardia Civil police force already declared themselves unable to do the tests. The defense asks to send the remains to one of the few international laboratories able to do the tests and has even offered to pay the cost of it. Also at Diario de Noticias de Alava[es].

An intimate look at ancient Rome - - a journey through the hygienic practices of Ancient Rome.

Scientists unlock the mystery surrounding a tale of shaggy dogs - Native Americans used dog hair for textiles (among other components).

The Archaeology News Network: Real Mayan apocalypse may have been their own fault -overexploitation of the jungle biome caused desertification.


Some of these open access papers surely deserved a deeper look at... I did not have time or energies for that however.

BBC News - Liking a lie-in in people's genes, researchers say - long sleeping is a genetic need: tell your boss next time you are late. I am among those who need to sleep 9-10 hours per day (normally) though I have also met people who only sleep 4-5 hours.

The Spittoon » Find Your Inner Neanderthal (I retract what I said before: the results are coherent, even if Africans still get too much too often I guess that's part of the margin of error. However there is another "free online" genetic test that is misleading).

Biology and psychology

Of mice and men, a common cortical connection - a nice comparison to better understand brain regions. To the right: F/M: frontal/motor cortex, S1: primary somatosensory cortex, A1: primary audtive cortex and V1: primary visual cortex. Mice have a much more developed somatosensory cortex (surely related to whiskers, smell, etc.) but a much less developed frontal/motor cortex (related to willpower and rationality).

Brain Scans Reveal Difference Between Neanderthals and Us | LiveScience - something about the sense of smell, not too clear.

Primates are more resilient than other animals to environmental ups and downs - diversification and flexibility is the key to long-term success.

December 23, 2011

Battle of Andagoste (Kuartango, Basque Country) year 38 BCE

Caliga showing the nails
Or should I say Caristian territory, year one of the Hispanic Era?

Actually it was probably the year before or maybe even two years earlier: 40-39 BCE according to the best estimations but the case is that a battle took place in the municipal territory of  Kuartango, not far from Vitoria and Bilbao and the ancient city of Veleia, where many Basque and Vulgar Latin short inscriptions have been found. 

I was totally oblivious to this historical-era archaeological episode until I read about it yesterday at Iruina[es] blog and then at Euskonews[es] (dated in 2006). It seems that some 1500 Roman legionaries (1200-1800, known by the size of the defensive castrum erected) were attacked and defeated by local tribal troops in what was the prelude to the Cantabrian Wars, a decade later. 

The hill of the battle looks so peaceful now
While this prelude is poorly known to historians it must have been of some importance because Octavian (Augustus) declared the year 38 BCE as the first one of the Aera Hispanica, a chronology that was used in all Iberia (i.e. Hispania before the name was monopolized by the state of Spain) until the Late Middle Ages when the more cosmopolitan Christian calendar began to be used instead. And he did so because of the victories that he allegedly attained against the tribes of the North, campaigns of which little is known.

In that year of 38 BCE is known that M.V. Agrippa quelled an uprising by the Aquitani (Northern Basques and proto-Gascons). We can I guess speculate if this battle was caused by (or even cause of) the campaigns of Octavian in the South or that of his commander Agrippa in the North but I can only imagine that this is a trivial distinction and that both are one and the same crushing pressure of the Roman Empire against the Basque tribes overall.

Blue: Celtic tribes, Red: pre-Indoeuropean and hence presumably Vasconic tribes

The battle brings to question the myth of Southern Basques being submitted by Rome only or mostly by pacts and agreements. This myth is mostly based on the fact that Pompey camped at what is now Pamplona (Pompaelo) while his rival Sertorius (a supporter of Marius) did in Huesca (Osca), at the final showdown of the Sertorian War. However at that time the Romans battled with many diverse and circumstantial allies, alliances that might have been eroded by the time of this battle. Alternatively, different tribes may have held different relations with Rome and what applies (maybe) to the Vascones needs not apply to the Caristii or other tribes of North Iberia and Aquitania.

In any case the reconstructed battle depicts a siege of a roman military camp (castrum), maybe erected for the occasion, following the pattern of this image provided by Iruina blog:

The nails (clavos) are more than 600 used for caligae (military footwear), indicating where Romans lost a sandal and maybe their lives. The coins (monedas), weapons (armas) and slingshot ammunition (proyectiles de honda) may help give an impression of the details of the battle in and around the central fortification (núcleo) and also tell archaeologist of when the battle took place (for example these large shoe-nails were replaced by smaller ones in the Roman legions a few years later, the coins also allow for a quite precise estimate...) The overall estimate seems to be 38 BCE (+/-3 years) but the most exact claims are for the 40-38 BCE period in fact.

Rough chronology of the Roman takeover of the Cantabro-Aquitanian or proto-Basque area:
  • 80-72 BCE Sertorian War. Pompey making camp at a Vasco town in 75 BCE is considered the foundation of Pompaelo (Pamplona)
  • 56-51 BCE: conquest of Aquitania (within the context of the wider conquest of Gaul by Julius Caesar)
  • 40-38 BCE: approximate date of the Battle of Andagoste
  • 38 BCE: Agrippa defeats an Aquitanian uprising
  • 38 BCE: decreed by Pompey to be the first year of the Aera Hispanica
  • 29 BCE: Octavius proclaims World Peace (closes the gates of Janus) for the first time
  • 29-19 BCE: Cantabrian Wars
  • 27 BCE: Octavius becomes Augustus (standard beginning of the Roman Empire, previously known as Roman Republic). 
  • 23 BCE: Octavius proclaims World Peace for the second time
  • 13 BCE: Octavius proclaims World Peace for the third time after a final campaign against the Alpine tribes (17-15 BCE)

December 12, 2011

On the origin of mitochondrial macro-haplogroup N

The notion that the migration of Homo sapiens out of Africa had to pivot around West Asia has been deeply entrenched in our minds, partly because geographical common sense, partly because Eurocentrism, partly maybe because of the Judeo-Christian-Muslim religious background of most influential researchers historically... 

However in the last years this idea has been challenged by the coastal migration theory that proposes a migration mostly along the coasts of the Indian Ocean rather than through the interior of Asia. This theory was first outlined by population geneticists, who needed to explain the facts of haplogroup distribution in Eurasia, not at all more diverse towards the West, as we could expect from the classical models pivoting around the Fertile Crescent, but rather towards the East and very specially in South Asia. Later it has been also corroborated, with lesser shadings maybe, by archaeologists who have sought material support in Arabia and India and found it.

While the origin of mitochondrial macro-haplogroup M in South Asia is seldom contested, that of its "sister" N is seldom agreed upon. The reason is that it is distributed somewhat evenly through all Eurasia, Australasia and even America.

This map, from the Metspalu 2005 paper (open access), illustrates the issue and how even renowned geneticists doubted not long ago on where to place the urheimat of the haplogroup:

The phylogeny has anyhow been refined in these six and a half years and you may notice that Australasia is not even included in the map, although it does play an important role, being surely more important than West Eurasia. In any case the map is illustrative of this state of confusion. Confusion that I will try (once again and hopefully for good) to dispel in this article.

The facts of mtDNA N

Macro-haplogroup N has 15 acknowledged basal haplogroups scattered through all Eurasia and Aboriginal Australia. They have diverse numerical importance but what matters to me here is how many mutations (coding region transitions, to be more precise) they are downstream of the N node. Why? Because this is surely indicative of the timing of their respective expansions in relation with N as such. 

Looking at this measure we find the following classes of N sub-haplogroups:
  • Elder daughters: one coding region mutation downstream of N: N1'5, N9, N11, S and R. Notice that among these R holds a special place, not for any phylogenetic reason but because it has a scatter as wide as that of her mother N, suggestive of a very early coalescence and some sort of association between both expansions. 
  • Two mutations downstream of N: N10 and O.
  • Four mutations downstream of N: N2 (incl. W), A and X.
  • Extremely long stems, rare clades without any known node under N: N8, N13, N14, N21, N22.
This distinction is not very important but I have always present in any case, because it implies that the various classes of subhaplogroups expanded at different moments after the N node. Notably there is a "pause" at the place of the third mutation and then after the fourth. So we can well imagine the expansion of N as a double explosion, first the two first categories and then the third and maybe the fourth.

Representing each haplogroup as a dot, where they might have coalesced (often a hunch within the local region), the result is as follows:

1.- Estimated coalescence of basal subhaplogroups of N

The size of the dots represents only the "class", that is: how many mutational steps they are under N, the larger the closer they are and the earlier they must have coalesced (according to the laws of probability). The peculiar macro-haplogroup R (whose approx coalescence location was estimated in the past and I will not explain here) has been painted of a lighter blue and given a slightly larger size. 

I have also outlined the cloud of N expansion at mutational steps 1 and 2 (no difference), which are followed by an apparent pause at mutational step 3, as mentioned above. The cloud has been pushed northwards a bit in East Asia in order to avoid disputes on where exactly did N9 coalesce (it does not make much of a difference if you prefer Beijing over Shanghai for this clade's coalescence in the end).

Notice that this N cloud is almost identical as would be the M cloud (not shown but look here for a reference if you wish). Whether they were simultaneous or, as I think, N coalesced and expanded a bit after M did, their geography was the same: South Asia, East Asia and Australasia without distinctions. This T-shaped region (with the East on top) was the homeland of the first Eurasian (or more properly non-African) population of Homo sapiens (excepted those who remained in Arabia, which are another story).

The geographic origin of N

Alright, I have described the scatter of N subhaplogroups and the most likely sequence of the expansion but my main purpose here is to estimate the origin, the urheimat of N: where did the N matriarch, the ultimate matrilineal ancestor of all N people today, live?

I apply the statistical principle by which the derived basal haplogroups should tend to remain not too far away from the common origin. Being the most removed ones, exceptions and never the rule. It does makes sense, right?

Hence if we can estimate the centroid of the geometry described by the 15 haplogroups, we will have found the origin of N - or at least a raw estimate of it. There are several methods to estimate centroids but I chose to use the geometric one. In fact, for simplicity, I divided the subhaplogroups in three sets of five (so they all weight the same) and estimated their centroids by geometric decomposition. Then I estimated the centroid of the resulting triangle.

If I am correct the raw centroid of N is at the lower Mekong:

2.- Possible origins of mtDNA N (blue flowers): A - 'raw' geometric centroid, B - corrected against directionality.

I have argued on occasion that, in order to compensate for the directionality of the expansion, a correction can be applied to the geometric centroid or raw estimate of the origin. This correction should pull the origin towards the parent node, in this case L3 in East Africa (estimated here). How much? Maybe 1/4, maybe 1/3... this step, even if probably very reasonable, is a guess and not rocket science. Here I chose to use 1/4 and then look for the closest coast, which is that of Bengal - alternatively I can use a crooked line that follows the geography and get the same result (even less ambiguously Bengal again).

If I would have chosen a 1/3 value for the correction, it would fall in a more central part of India, if 1/5 in Burma surely. We can't be sure of where exactly that happened but we can be more than reasonably sure that it was between India and Cambodia. 

And nowhere else: not in West Asia, not in Altai... thanks for the suggestions but I have heard that before... many times... always without a single piece of evidence nor well-reasoned backing of any sort. 

The data says otherwise: around the Bay of Bengal or even further East maybe. 

Getting R into the picture

I have said before (and is obvious for anyone interested on population genetics) that mtDNA R is peculiar. While it is not different phylogenetically from other subclades of N which are separated by just one coding region mutation, its geographic distribution is very different, because R, like its mother N, is everywhere. 

In order to show it more clearly, I drew approximate origins of all basal R-subclades (in lighter blue). The size of the circles follows the same logic as do those of N above, representing only the distance from the mother node (R in this case, what means one step further downstream in relation with N), and hence a probable order of coalescence:

3.- Scatter of N (deep blue) and R (cyan) subhaplogroups. The flower indicates the possible common origin.

The scatter of R fits very curiously within that of N(xR). They do not overlap too much maybe and it looks on first sight like R could have pushed other N around to the margins of the common expansion cloud. However this does not seem to happen with M, so maybe another explanation is needed, like undifferentiated N and R traveling together, mostly under the leadership of the latter and causing different founder effects in different locations.

Whatever the case it is worth a good meditation, because it is possible that both haplogroups (mother N and daughter R) coalesced in rapid succession in a single region (Bengal probably). 

December 9, 2011

Autosomal genetics of South Asia in the wider context

Estonian geneticist Mait Metspalu has in the past performed leading research of the genetic pool of South Asia, so crucial to understand not just the subcontinental populations but all Eurasia as a matter of fact. Again he and his team provide us with valuable material to understand this region and its wider continental context:

The authors added 142 samples from India to pre-existing catalogs and found that:
30% of SNPs found in Indian populations were not seen in HapMap populations and that compared to these populations (including Africans) some Indian populations displayed higher levels of genetic variation, whereas some others showed unexpectedly low diversity.
Reinforcing the generally acknowledged notion that India hosts very large, albeit largely untapped, genetic diversity.

Nothing really new in the wider picture but always worth reminding the basics (principal component analysis of Eurasians):

Supp. Fig. 12

Supp. Fig. 2 (part)

The Pakistan-India (ANI-ASI) duality

It may get a bit more interesting when they analyze the equivalent of Reich's Ancestral North/South Indian components (ANI & ASI), around which much speculation (sometimes quite wild) has built up. 

These two components are apparent at both the PC analysis (PC2 and PC4) but maybe more clearly within the ADMIXTURE cluster analysis. The authors decided to use K=8 where I would have used K=13 (preferred by the combination of both check algorithms shown at Supp. Fig. 4 b and c) but the result is only different (for this purpose) in the inclusion or not of Caucasian populations in the ANI-equivalent component (k5 in the maps below). 

Iranians are always included, as are Central Asians but quite less emphatically anyhow at K=13 than at K=8, as the affinity splits between the Baloch (ANI) component and the Caucasus-specific one. However Russians do not show any Caucasus-specific affinity and show instead strong influence of the ANI component, which seems to correlate well with Y-DNA R1a, specially once the Caucasus affinity is detached at K=13.

Whatever the case at K=8:

The authors do in fact make an effort to discern if the Baloch-ANI could represent the much discussed Indoeuropean (or Aryan) invasion (hardly doubted in the linguistic plane but not clearly supported in the genetic one). They conclude however that the arrival of the ANI component in South Asia should be much older, at least 12,500 years old, that is: clearly pre-Neolithic - and in any case not related to the Indo-Aryan invasion

Barely outlined South Asian internal structure

It is interesting that at deeper K levels (K=18) a Gujarat-centered component (middle green), distinct from the two mentioned so far appears and takes a dominant role in most populations, particularly displacing the Baloch (light green) component:

Cut from Supp. Fig. 4a
I would like to encourage transcending the limitations of the chosen K=8 level of analysis and dive in the K=18 analysis found in the Supplemental Figures' PDF (fig. 4). As said before, the optimal level of analysis seems to be K=13 or maybe K=12, rather than the chosen one of K=8. Above K=10 in any case. However many of the improvements of greater resolution take place outside of South Asia, so for most purposes there is no difference (other than the inclusion or exclusion of the Caucasus' populations in the ANI bloc).

Something else that I miss here is a regional, South Asian specific (maybe with the inclusion of some West Asian and SE Asian controls), analysis. It may have offered interesting insights but it is just outlined, with just four South-Asian-specific components at K=18: more than enough for the pan-Eurasian analysis but surely quite limited to discern the details of population structure in South Asia alone. 

Diabetes-related allele

One of the most specific findings of this survey is the detection of a group of alleles (at genes DOK5, CLOCK) that have been apparently selected for in South Asians but that has become harmful as diet and lifestyles change today, favoring type 2 diabetes.

December 2, 2011

Problems demonstrating positive selection

This is a very interesting paper not because it offers any new striking discovery but because it brings to doubt previous ones and highlights the difficulties in demonstrating beyond reasonable doubt the functional selection in genes allegedly involved in pigmentation, which is one of the most clear differential adaptions among humans.

Importantly most of the discoveries challenged were made by the same team, what is a outstanding example of scientific commitment and self-criticism. Casting doubts and reducing certainty may not be what makes a Nobel Prize but it is how Science advances in fact.

Abstract (provisional)


Numerous genome-wide scans conducted by genotyping previously-ascertained single nucleotide polymorphisms (SNPs) have provided candidate signatures of positive selection in various regions of the human genome, including in genes involved in pigmentation traits. However, it is unclear how well the signatures discovered by such haplotype-based test statistics can be reproduced in tests based on full resequence data. Four genes, OCA2, TYRP1, DCT and KITLG, implicated in human skin color variation, have shown evidence for positive selection in Europeans and East Asians in previous SNP-scan data. In the current study, we resequenced 4.7-6.7 kb of DNA from each of these genes in Africans, Europeans, East Asians and South Asians.


Applying all commonly-used allele frequency distribution neutrality test statistics to the newly generated sequence data provided conflicting results in respect of evidence for positive selection. Previous haplotype-based findings could not be clearly confirmed. The application of Markov Chain Monte Carlo Approximate Bayesian Computation to these sequence data using a simple forward simulator revealed broad posterior distributions of the selective parameters for all four genes providing no support for positive selection. However, when we applied this approach to published sequence data on SLC45A2, another human pigmentation candidate gene, we could readily confirm evidence for positive selection as previously detected with sequence-based and some haplotype-based tests.


Overall, our data indicate that even genes that are strong biological candidates for positive selection and show reproducible signatures of positive selection in SNP scans do not always show the same replicability of selection signals in other tests, which should be considered in future studies on detecting positive selection in genetic data. 

December 1, 2011

The Nubian techno-complex of Dhofar: yet another evidence for an early migration out-of-Africa via Arabia

Jeffrey Rose and colleagues gift us with a beautifully written and delightfully detailed open access study on a culture of the Middle Paleolithic of Arabia: the Nubian techno-complex of Dhofar: 

I strongly recommend reading this paper in full: it really deserves your attention.

The Nubian Complex: extension and origins

The Nubian techno-complex is a facies of the pan-African Middle Stone Age macro-culture (MSA for short), which is roughly equivalent in timeline to the Middle Paleolithic of Europe (and, as techno-culture, to Mousterian in this other context). A facies that is mostly concentrated in North Sudan and Upper Egypt (with the occasional Ethiopian site) and, now we get to know, in Dhofar (Oman).

Fig. 1 Nubian Complex occurrences

In Africa:
Late Nubian Complex assemblages have been found in stratigraphic succession overlying early Nubian Complex horizons at Sodmein Cave [11] and Taramsa Hill 1 [21] in Egypt; in both cases separated by a chronological hiatus. The early Nubian Complex roughly corresponds to early MIS 5, while numerical ages for the late Nubian Complex in northeast Africa fall in the latter half of MIS 5.
In Arabia:
For the time being, the apparent distribution of Nubian Levallois technology in Arabia is limited to the Nejd plateau and, perhaps, Hadramaut valley (Fig. 1). Archaeological surveys in central/northern Oman have not produced any evidence of Nubian Complex occupation [66], [68], nor have Nubian Complex occurrences yet been found in eastern [22], [69][71], central, or northern Arabia [72][74].
Fig. 10 Dhofar Nubian Complex' points
Note that the authors' concept of Nedj plateau does not correspond with that of Wikipedia, as they are obviously talking of the sites in highland Dhofar and not anywhere in Saudi Arabia (see map below).

The authors express their expectation that eventually other sites will be found within drainage systems along the western coast and hinterlands of central Arabia, linking Nubia with South Arabia. However it is also possible, I'd say, that the actual link is via the Horn of Africa, specially as Arabia has been quite extensively combed in recent years.

The Nubian techno-complex in Sudan appears to have evolved locally:
Taking into account its distinct, regionally-specific characteristics, Marks [2] notes that the Nubian Complex has no exogenous source and, therefore, probably derives from a local Nilotic tradition rooted in the late Middle Pleistocene (~200–128 ka). This supposition is supported by the early Nubian Complex assemblage at Sai Island, northern Sudan, which overlies a Lupemban occupation layer dated to between ~180 and 150 ka.
The oldest known Lupemban culture is dated to c. 300 Ka ago in Kenya and Tanzania.

The authors reject the presence of Nubian Complex tools claimed in the past for the Levant (Levantine Mousterian) and Persian Gulf (Jebel Barakah).

Previously to this work:
The first hint of the Nubian Complex extending into southern Arabia was documented by Inizan and Ortlieb [31], who illustrate three cores from Wadi Muqqah in western Hadramaut, Yemen, with Nubian Type 1 and Type 2 technological features. More recently, Crassard [32] presents a handful of Levallois point cores exhibiting Nubian Type 1 preparation from Wadi Wa'shah, central Hadramaut, Yemen.

Time frame and ecology: the wet MIS 5

The chronological reference of Marine Isotope Stage 5, time frame of  the Nubian Complex, corresponds to a warm period between c. 130 and 74 thousand years ago, and corresponds very roughly with the Abbassia Pluvial, when the arid region of the Sahara and Arabia was quite more welcoming. 

Fig. 3 Dhofar ecological zones and place names mentioned in text.

MIS 5 is divided in the following substages (figures are Ka ago and may vary a bit depending on source):
  • MIS 5a - 84-74 (wet)
  • MIS 5b - 92-84 (?)
  • MIS 5c - 105-92 (wet)
  • MIS 5d - 115-105 (?)
  • MIS 5e - 130-115 (very wet and warm: Eemian interglacial)

MIS 5 was followed by MIS 4, a cold and dry period triggered by the Toba caldera explosion (supervolcano).

In what regards to Dhofar:
... the monsoon increased in intensity during three intervals within MIS 5. Among these humid episodes, the last interglacial (sub-stage 5e; 128–120 ka) appears to represent the most significant wet phase within the entire Late Pleistocene, with rainfall surpassing all subsequent pluvials [42], [43]. Later, less substantial humid episodes associated with sub-stages 5c (110–100 ka) and 5a (90–74 ka) are also attested to in the palaeoenvironmental record. Uncertainties remain concerning the extent to which the climate deteriorated in the intervening sub-stages 5d (120–110 ka) and 5b (100–90 ka).
The increased humidity provided water security to all the region and is also correlated with plant and animal migration from Africa, what the authors think should almost forcibly make humans participant in this overall biological outpouring. 

Out of Africa: the alternative routes

Dhofar mountains in monsoon season
The authors discard the Levantine route because of the techno-cultural isolation of the Shkul-Qafzeh group.

They acknowledge the conceptual debt to population genetics for unveiling the probable Arabian route Out of Africa, with particular mention to Behar 2008, who points to the possibility (that I have re-elaborated myself on my own means but on his data) of mtDNA L3'4'6 (and I'd say also L0) having left very indicative remnants in Arabia Peninsula. However they make unnecessary conceptual contortions in order to adapt archaeological knowledge to the molecular clock pseudo-science when it must be the other way around, if anything. No need.

In any case, and this is very important, they describe two different cultural groups in interglacial Arabia:
... we surmise that at least two technologically (hence culturally) differentiated groups were present at this time: Nubian Levallois in southern Arabia and centripetal preferential Levallois with bifacial tools in northern/eastern Arabia.

They also suggest that, after the arid MIS 4 parenthesis, South Arabia experienced another mildly wet period with the MIS 3 (since c. 60 Ka ago), which would have enabled:
... north-south demographic exchange between ~60–50 ka. South Arabian populations may have spread to the north at this time, taking with them a Nubian-derived Levallois technology based on elongated point production struck from bidirectional Levallois cores, which is notably the hallmark of the Middle-Upper Palaeolithic transition in the Levant [105], [106].
But the whole Persian Gulf and Arabian Sea area, not to mention East Asia, remains to be fit in (archaeologically speaking) if we are to understand this period's colonization of West Asia from the East (according to the genetic data).

See also:

In this blog:

In external sites:

Update (Jan 11): I have received a copy of a related paper dealing with the relations of Hadramaut tools in the context of global Levallois technique. It is however too technical and inconclusive for me to discuss separately. Yet I do not see it being published anywhere online (PPV or open source or whatever), so I am just uploading it online (for a year) so you can download and read it yourself: