April 10, 2012

Claim that Japanese are 60-72% Neolithic

Jomon clay head
An open access letter claims that modern Japanese are 2/3 of Neolithic ancestry (except Ryukyuans, who'd be 2/3 Paleolithic instead).

The explanation is however not really clear for me and, looking at their own data, I can't really accept such conclusions easily:

Only the yellow component (at K=4), almost totally absent in the Ryukyuans and dominant among North Chinese and Koreans, the likely parent populations of the Yayoi farmers, can be considered to inform the input of such immigrants to Japan. The exact apportions are not detailed anywhere in the letter but it seems to be c. 35% among Koreans (KR-KR) and Shanghai Chinese (CN-SH) and slightly above 20% among North Han (NHan, CHB). 

By comparison mainland Japanese (Japanese, JPT, JP-ML) show c. 10% in most cases. IF the parent proto-Yayoi population would be Koreans, then Japanese would have less than 1/3 Yayoi blood, while if the proto-Yayoi is equated to Northern Han instead, then the result would be at most 50%. 

In the case of Ryukyans, the Yayoi input would be negligible, almost zero. They'd be almost 100% Jomon, assuming this concept applies to the Ryukyu islands at all.

In truth I do not know what to think of this article other than it seems confusing and inconsistent with its own data.


  1. It is quite possible that the proto-Yayoi were admixed themselves. One can perhaps equate the roughly similar amounts of red and yellow components in N Chinese or Koreans to ancient people related to today's Sinitic and Altaic speakers, respectively. The S Chinese have much less of the yellow component as expected, while the Ryukyuans have none. The latter might have acquired the red component alone through gene flow from Taiwan or adjacent southern China.

  2. Components are affinity indexes, nothing else. In some clear-cut cases they do represent ancestral populations but not always.

    In this case I find the orange ("red") component (in K=4) to be a generic East Asian affinity component, unless you can reasonably argue that there was more migration from South China (where the component is strongest) to Japan than from North China/Korea.

    Of course, East Asian Neolithic does seem to have originated in South China c. 7000 BCE but by the time farming arrived to Japan c. 300 BCE that was not anymore relevant: all China (but the West and parts of Manchuria maybe) was Neolithic by c. 6000 BCE and Korea at least since c. 3500 BCE.

    It's also likely that the analysis might benefit from greater depth in terms of K values. It is indeed rare that such a shallow level is best, normally the best levels are in the "teens" K values for what I have seen.

  3. "normally the best levels are in the "teens" K values for what I have seen."

    It depends on the nature of the dataset, the relative homogeneity/heterogeneity of the dataset, the number of samples, geographical spread and so forth, ADMIXTURE for example gives you a Cross Validation error computing method to come up with an appropriate K value, which I discussed here .

    STRUCTURE, also has applicable methods for determining an optimized K value which Tishkoff (2009) discusses in the supplemental material:

    “The maximum K value was determined on the basis of: (1) the K value at which the likelihood distribution reached a maximum and began to plateau or decrease; (2) high stability of clustering patterns between runs (the primary mode was observed in at least 60% of the 25 runs) and; (3) from the Kmax value at which Kmax + 1 no longer refines the clusters (i.e. Kmax + 1 no longer splits the cluster distinguished at Kmax).”

    1. Indeed (and thanks for the clarification) but I have no means to check these for the present case. I'm talking based on other cases where I have seen explicit cross-validation data (typically 13 and the like).

      Here the dataset is quasi-continental in character (it includes much of East Asia) so I would really expect it to still split meaningfully at higher K values.

      We can actually test it with the 1000 genomes dataset and ADMIXTURE, I'm just not feeling like working these days.

