July 2, 2014

Sino-Basque is not for real

Unmistakable evidence: beret-wearing Chinese!
(humorously borrowed from Zubia-Qiao blog,
which is about real Basque-China relations)
Linguistic speculation haunts us and today I stumbled on this paper, which has an interesting introduction but ends up claiming the extremely unlikely Sino-Caucasian family (including Basque and what-not):

Murray Gell-Mann, Ilia Peiros & George Starostin, Distant Language Relationships: The Current Perspective. Available at academia.eduLINK

I admit I have been skeptic of the Sino-Caucasian hypothesis since I tried once to learn some Chinese and was surprised of how little this language actually resembles Basque. Probably a random African or Australian language is not more different than Chinese is to Basque, or so I thought without having performed until now any formal test of the hypothesis.

There are a lot of reasons: the general skepticism of most linguists but also the lack of any apparent archaeological or meaningful genetic relationship since maybe 60 Ka ago (or, if Sino-Tibetan is related to Amerind and other Native American languages, since c. 45 Ka ago at the latest).

But the hypothesis continues to have some currency and today I finally decided to test it following the Swadesh-100 method suggested in the paper. The result:

Sino-Tibetan/Basque/English Swadesh-100 comparison (open office ODT format, similar to Excel - if anyone has a problem, please ask and I will upload an Excel version of it). 

Conclusion: Basque is not more related to Sino-Tibetan (either Mandarin or Burmese) than English is. If anything, the opposite is true, although the low level of plausible cognates for both languages (5-7%) seems merely stochastic noise, or maybe in some case wanderworts. Of course, the exact number of similar words (possible cognates) depends on one's permisivity but the pattern is so similar for the three possible pairings that, if there is any relationship at all, it must include English and therefore Indoeuropean.

Check it yourself, of course.


  1. Excellent original work! I've always been skeptical myself for many of the same reasons. But, your effort is as solid as any that have been done to investigate the issue and has excellent linguistic precedents.

    As you note, the time depth of any possible connection is simply too great. Simply put, if a macro-language family were older than 20,000 years, comparable data points tell us that there is no way we could observe their relationships from modern linguistic data or even the oldest available proto-languages. The oldest recognized language families have a time depth of relatedness for their living members of probably less than 8,000 years (making some educated guesses about Sino-Tibetan and Afro-Asiatic that are perhaps not universally shared). For comparison purposes, we know from genetic data and archaeology, that essentially the entire population of pre-Columbian South Americans and Meso-Americans, and a supermajority share of North Americans derive from a first significant wave of modern human migration ca. 14kya +/- a few thousand years, that the first wave American founding population wasn't very big at all, and that it very likely spent a few thousand years isolated from the rest of the world in Beringia with subpopulations interacting with each other. It is possible that the first wave American founding population wasn't monolinguistic, but it is very doubtful that there were more than two to four languages in that population by the time it migrated from Beringia to North America and the modal possibility is that there was just one language for that population, and it is also likely that if there were multiple languages in the first wave founding population of the Americas, that they were part of the same language family. So, Greenberg's macro-linguistic Amerind family of languages that treats all non-Inuit, non-Na-Dene Native American languages as if they have a shared language family and derive from a common language family source is very likely to be true.

    But, the amount of linguistic diversity that existed at the time that Columbus and subsequent European colonists arrived in the 15th century, about 14,500 years later, is so great that it is impossible from linguistic data alone to make any meaningful effort to reconstruct a proto-language. The Americas have 730 living indigeneous languages in 135 language families (only 2 families of which with 46 languages are not Amerind macrofamily members) and South America alone probably had another 1150 indigeneous languages in 1500 CE, grouped into at least 90 extinct families (treating isolates as language families of their own). http://washparkprophet.blogspot.com/2010/04/wasted.html

    Even with some bias in that particular discipline towards splitting and a lack of connections in some cases being due to insufficient scholarly resources to see connections that do exist, there is just no way to piece together a very meaningful linguistic family tree or make any meaningful assertions about an Amerind proto-language. Amerind languages are extremely diverse. I am not aware of any WALS category in which Amerind languages lack of full range of diversity (except perhaps the absence of clicks or labial-velars as full fledged phonemes which is restricted to African languages) in which they don't show the full global range of diversity. Even if there was a common non-African language family at the time of Out of Africa, or even 50,000 years ago at the dawn of the UP, any trace of it would be long gone by now.

    1. "As you note, the time depth of any possible connection is simply too great."

      Actually that was not my point at all. My point is that English looks at least as "Sino-Tibetan" as Basque, so if Sino-Caucasian does exist it must include English and hence Indoeuropean. Time estimates are here, as in genetics, slippery guesstimates. For the Gell-Man et al. for example <10% "cognates" (coincidences) still implies relatedness at some time older than 7 Ka. 10% in the 100-words Swadesh list is a mere 10 words (father and mother included, which are pseudo-cognate in almost all languages of Earth - nursery words which are reinvented regularly out of babies' babbling - I ignored them in my test, of course). My hypothesis is that <10% of "cognates" can be found in any two random languages and that is therefore a non-valid category. I believe that professional linguists have also demonstrated that once and again anyhow.

      "essentially the entire population of pre-Columbian South Americans and Meso-Americans, and a supermajority share of North Americans derive from a first significant wave of modern human migration ca. 14kya +/- a few thousand years"...

      Actually more like 17 Ka, maybe even older. There are several sites in North America dated to 17-16 Ka BP and one in South America to 15 Ka BP. Some archaeologists argue insistently for an older date (>20Ka) in some Brazilian site but it's viewed with skepticism for lack of other supportive evidence.

      Amerindian is a great example anyhow of a macro-family that must exist (on genetic and archaeological grounds) but is nearly invisible in linguistic parameters just because of way too long divergence time. I fully agree in this.

    2. Yah, when this article starts throwing out dates I feel is when it becomes really uninformative.

      I do think that this sort of mass comparison may be able to identify a lot of areal features that may otherwise be hidden, but it fails at distinguishing between areal features and a genetic relationship, which is what this paper claims. Even some of the supposed connections between the Americas and Eurasia could be wanderworts. Microblade technology and archery managed to cross the Bering Strait so I don't think it's implausible that words did.

  2. After reading the article itself, a few more observations. You can't fault G. Starostin for sticking up for the pet ideas of his dad who died to young, but the article estimates the time depth of Sino-Caucasian at 10kya, with Sino-Tibetan and Caucasian each at 6 kya, and Borean which includes two of these "C" level families at 15-20kya. Yet, there is simply nothing in the archaeological record to support as Sino-Caucasian connection in the time frame from 10kya which a split of the two families from a proto-language is suggested, to 6kya, before the split within their separate families of sub-languages The Fertile Crescent Neolithic and the Chinese Neolithic were independent events involving the domestication of native wild plants and animals ca. 10kya to 9kya with no borrowed crops or technologies (with the possible exception of dog domestication which dates to ca. 30kya). There is no uniparental genetic overlap between populations of the Fertile Crescent Neolithic and the Chinese Neolithic with TMRCA ca. 10-6kya. The earliest record of trade and exchange of technologies between these regions dates to roughly the Bronze Age. Defining decorative styles like Venus figurines of early Neolithic Europe are absent from China and visa versa. East Asia developed pottery long before the Neolithic, while in the Fertile Crescent the Neolithic preceded the development of pottery. Early West Eurasian and early East Eurasian pottery styles, moreover show no sign of cultural exchange.

    Greenberg's proposed macro-families have mostly been enduring, in part, because superficial comparisons that he did not himself rigorously establish had, or have since developed, plausible corroboration from physical anthropology, genetics, archaeology, geographic coherence and/or small sets of linguistic features so distinctive that these litmus tests were powerful because languages in other families didn't share them at all and the groupings produced by these litmus tests made sense.

    Starostin's macrolinguistic families, like Sino-Caucasian, lack any of these reassurances, but the kind of human migration necessary to produce mass language transmission in the pre-literate, pre-horse, pre-camel, pre-wheeled vehicle, pre-long range maritime travel world of 10-6 kya that lacked even multi-city regional political units, could not possibly have happened without leaving traces that archaeologists, population geneticists or others could see today.

    1. It's not my role to be anyone's personal psychoanalist, much less if I we haven't ever met, so George Starostin's issues with his father's figure are not my concern. My concern is that certain most unlikely ideas, which can easily be proven wrong, are being pushed once and again, causing confusion on zero grounds.

      I'm not just unable to see any particular similitude between Basque and Sino-Tibetan but also between Basque and NE Caucasian (NW Caucasian is too difficult to test), certainly not any similitude that can't be equally attributed to Indoeuropean. I also did some testing in this regard in the past, you can find it here.

      "East Asia developed pottery long before the Neolithic, while in the Fertile Crescent the Neolithic preceded the development of pottery."

      Indeed. Kristiina mentioned some months ago that the oldest Western pottery belongs to proto-Uralic populations who probably brought the concept with them from the East. It's plausible that the Neolithic populations borrowed the concept from them somehow, although unclear.

      On the other hand Dolni Vestonice (Gravettian) has many broken terracotta figurines which can well be considered a proto-pottery of sorts, even if their purpose was apparently to "wrongly" cook them in order to make them explode in some sort of "ritual" primitive fireworks.

      I agree that there is zero likelihood of any Holocene language expansion across Siberia influencing either the West or the East beyond the Sub-Arctic climate specialization zone (taiga-plus). However wanderworts are perfectly possible and may well be a confounding factor. These wanderworts may be related to Indoeuropean, Uralic, Altaic and even the mysterious Megalithic flow Eastwards. And of course to Neolithic contacts.

    2. "I agree that there is zero likelihood of any Holocene language expansion across Siberia influencing either the West or the East beyond the Sub-Arctic climate specialization zone (taiga-plus)."

      Depending on the origin of Uralic languages (which I don't think is settled) Magyar would be a candidate for this.

    3. Magyar originated in the taiga or not far from it, being closely related to other Ugric languages. Obviously I meant the Uralic family in general when I said "the Sub-Arctic climate specialization zone (taiga-plus)". It is quite apparent that those latitudes were exceptional, although this does not mean that whatever happened in them did not to some extent overextended their core climatic area now and then. The presence of mtDNA C in Neolithic Ukraine is another and even better example of this exceptional "overextension".

  3. That article is not bad at all. It contains a nice summary of the work being done to compare world language families using comparative phonological method. The inclusion of Basque in the Sino-Caucasian language family was only mentioned in the article and was not given so much weight. I do not have a personal opinion on this, but if you want to judge Storostin’s and his colleagues’ work, go to this page and click link ”Sino-Caucasian etymology”

    I checked a few proposed cognate words, and noticed that there is more often a link between North Caucasian, Yenisseian and Basque and less often with Basque and Sino-Tibetan. IMO, these common words, if they are real, should belong to extinct North Eurasian languages rather than to any southern language strata.

    However, if you have time, you could add Tibetan in your comparison. My understanding is that Sino-Tibetan languages spread from north to south, so a comparison with a northern variety might be better than a comparison with a southern variety. I think that Amdo language would be the best choice but I do not know if Amdo Swadesh list is available on Internet.

    1. I know the website and I don't think it is useful in any sense because ONLY those families pre-determined in Starostian dogma to be in one or another "macro-family" are compared. So for example, second word: abere (mentioned as *abele, which is purely theoretical and something I disagree with but anyhow), is obvious cognate of Sumerian "áb" (cow), being the most famous (and one of a few) Basque-Sumerian cognates, which led romantic revivalists to the simpleton assumption of Basque being related to Sumerian and what-not. But Sumerian is not there for comparison only "proto-SC" bVɫV (100% based on "North Caucasian" *bü̆ɫV) which totally sounds to "bull" (but English is not being compared either) and which actually means "bull" or "ram" and seldom "cattle" (generic, cow is "behi", bull: "zezen") as happens in Basque.

      So it's useless except as "holy text for the already convinced" or maybe casual browsing of the database, particularly for Caucasian vocabulary (but you must look for the actual words, not the "proto-words", often just junk).

      "... if you have time, you could add Tibetan in your comparison".

      I don't have time: I'm an amateur in dire poverty and linguistics is just a side interest to me. But it does outrage me that linguists with a salary waste not just their work time but ours in such nonsense.

      "My understanding is that Sino-Tibetan languages spread from north to south, so a comparison with a northern variety might be better than a comparison with a southern variety".

      I doubt that the N>S assumption is correct, more like from central inland China into both North and South directions probably. Whatever the case Mandarin is already a Northern ST language (the northernmost one in fact). I took one TB and one Sinitic language because ST is actually contested (plausible but not 100% certain) but it seems a waste of time to include more languages: if the result with two languages is so extremely negative, it is 99.99% certain that it will be the same with whatever other languages.

      "I do not know if Amdo Swadesh list is available on Internet."

      Probably not or just fragmentarily so. Another reason for my choice of languages was that they are fully documented in the Wikitionary Swadesh lists' appendix.

  4. I think that this work is interesting in this respect: http://halshs.archives-ouvertes.fr/docs/00/10/43/11/PDF/2005_Festschrift_Chirkova_Baima.pdf

  5. I agree with you on many points. It is probably waste of time to compare directly Basque and Amdo. Nevertheless, I do appreciate the Tower of Babel website and find it very useful. There is a huge amount of words in a nice package and available for different search options. I agree with you about "abere", and I have noticed that the proposed macrofamily cognates often are not so good, but however the data base is still useful, and many connections may be true, although in reality cognate words often overlap with words from other proposed language families.

  6. Victor Mair's journal published a bunch of work on cross-fertilization between Sino-Tibetan sphere and Indo-European world (mainly Iranian and Tocharian I think). Including key concepts like Dao, Tengri, magi, etc.


    Not Basque-Chinese, but more later stuff. Btw, I've seen it claimed that the archetypally "non-western" religion form of the shaman at least got it's name from India. Sramana, someone chanting Vedic formulas.

    Reverse influences from Tungusic cultures have also been pointed out by scholars of shamanism like Mircea Eliade. "World Tree" is basically Siberian and probably copied in parts of Northern Europe and Middle East. Tree of Knowledge, Irminsul, etc.

  7. Sometimes, I disagree with you, dear Maju.
    But on this issue, I can only agree with you, with an outpour of laughter. :)


Please, be reasonably respectful when making comments. I do not tolerate in particular sexism, racism nor homophobia. Personal attacks, manipulation and trolling are also very much unwelcome here.The author reserves the right to delete any abusive comment.

Preliminary comment moderation is... ON (your comment may take some time, maybe days or weeks to appear).