Phylogeny of Circumpolar Language Isolates

One of the most perplexing anomalies in uncovering our deep past is the incongruence between our population history suggested by physical anthropology (cranial, craniodental, molecular) on the one hand, and linguistic variation on the other. For a long time it was thought that populations gain and lose languages too rapidly to contain a significant population history signal. But that turns out to be largely wrong. Recent population history largely tracks that revealed by molecular anthropology. There are certainly cases where there is incongruence between the population history suggested by genes and tongues. That happens when populations adopt the language of their rulers or elites. But such cases are exceptional. By and large, the population history of the Holocene, ie the past ten thousand years or so, is reflected well in the distribution of languages around the world. This is a period marked by explosive population growth and mass migrations triggered by the Neolithic and Secondary Products revolutions.

At the end of Pleistocene, the world was loosely packed by tens if not hundreds of thousands of small populations that practiced a hunter-gatherer way of life. This was a highly differentiated and deconcentrated world, in the sense that the major linguistic groups accounted for a small fraction of the total world population of Homo sapiens. (Also, right around this time, the last of archaic populations were dying off or being absorbed.) As a few populations settled down and domesticated plants and animals, their numbers began to grow at an unprecedented rate. Simply put, the fertility rate of settled agriculturalists and pastoralists were dramatically higher than hunter-gatherers everywhere for the simple reason that hunter-gatherers had to keep moving whilst carrying their babies. Moreover, Neolithic populations could support dramatically greater population densities than hunter-gatherers — a single band of hunter-gatherers requires thousands of square kilometers to subsist. Furthermore, living as they did with animals, including domesticates and sundry commensals, Neolithic populations sported pathogen packages that were lethal to hunter-gatherers who had little resistance to Neolithic diseases. What happened to the native populations of the New World in the Columbian era was no one-off. That story has been repeated many times over on every continent on the planet throughout our deep history.

So most of the contemporary world’s populations, as well as the populations found by rapacious Europeans in the ethnographic present (c. 1500), got where they were in the relatively recent past. What we find everywhere is considerable population stratification, with multiple layers of Neolithic populations having annihilated, absorbed, or pushed into marginal environments, autochthonous populations who had occupied the area during the Pleistocene. This is what happened across Eurasia, Africa, and Asia during the Holocene; and, of course, this is what happened in the ethnographic present in Australia, New Zealand, and the Americas. Many profound implications follow from these discoveries that have a bearing on a lot of conversations in anthropology, prehistory, and paleoanthropology.

The big three, or five, so-called races were not in fact ‘continental races’ or allopatric subspecies as racialists had imagined. Rather they were small demes who happened to grow much more rapidly than others around them. The ‘Mongoloid’ Sino-Tibetan speakers emerged from their northern home to overrun much of eastern Eurasia; the ‘Negroid’  similarly displaced very many populations in sub-Saharan Africa; as did the ‘Caucasoids’ in western Eurasia. There were other, equally dramatic, population pulses before these in continental Africa and Eurasia, as well as elsewhere. Perhaps the most dramatic was the sea-borne expansion of the Austronesian speakers, who went as far as far west as Madagascar, and as far south as New Zealand, besides reaching every habitable island in the vastness of the Pacific ocean that covers half the planet’s surface.


Second, the population expansions following the Neolithic and Secondary Products revolutions, dramatically polarized the world so that, whereas before it was populated by small populations, the share of the largest populations increasingly came to predominate. Of the 102 language families in the Ruhlen database, the 2 biggest families (Sino-Tibetan and Indo-European) account for 84.1 percent of the world’s human population; the 5 biggest for 93.0 percent, and the 10 biggest for 99.0 percent. Put another way, more than 90 percent of the world’s language families account for 1 percent of the world’s population. Some 70 families have fewer than a hundred thousand speakers and are at risk of extinction; together their total share of the world population is 0.0089 percent.


Each of the language families emerged from a single language of a specific deme at the end of the Pleistocene. It is a fair bet that the vast majority of them have already vanished. Our world is considerably more homogeneous than the world at the end of the Pleistocene.


There is, in fact, an easy way to see the vanished world; the populations that were erased from the face of the earth by the relentless expansion of Neolithic peoples. We simply look at the worldwide distribution of languages that belong to families with less than a million speakers. See next figure. The great erasure is visible in the blankness of most of Eurasia and Africa. Most of the smaller families survive in Sahul (Australia and Papua New Guinea), the Americas, and the wilderness of the Kalahari desert in Africa. Otherwise, we find one isolate in Europe (Basque), one in Tibet, one on the Andaman Islands, and a small number in the circumpolar region.


For my project with Chomsky, I need to understand the phylogeny of the language families. This is very hard for two reasons. First, as discussed above, we need to rewind the clock, so to speak; undo the Neolithic population pulses, to get at the world as it was at the end of the Pleistocene. It simply doesn’t cut it to relocate the big families at their point of origin — although we have been able to identify them by now. We must imagine a world teeming with perhaps thousands, if not tens of thousands, of linguistic groups populating the vast blank spaces we see in the world map above. Our world is haunted by the ghosts of vanished populations.

Second, there is a more technical challenge in phylogenetic systematics. Put simply, the challenge is that the number of phylogenetic trees grows super-exponentially with the number of taxa — here, languages or language families. Giddy Landan and others recently published a new algorithm for rooting phylogenetic trees. Among other very valuable advice, for which I am very grateful, he suggested that the instability I was finding was probably due to the large number of taxa I was trying to work simultaneously. Despite my doctorate in mathematics, I had failed to grasp the full magnitude of the challenge. The number of possible trees grows very, very fast. With 5 taxa, there are 236 possible trees; with 10, there are 282 million; with 20, we have more than 20 orders of magnitude to consider. Whatever the algorithms are doing, they cannot possibly locate the global optimum in that gargantuan possibility space.

Source: Felsenstein (1983), Statistical Inference of Phylogenies.

What all this means is that (a) I need to pay much more attention to the computational challenge, and (b) pursue a more piece-meal strategy. If we can’t solve the problem in one go, perhaps we can defeat the enemy in detail. With that in mind, I want to take a first pass at understanding the stability properties of the algorithms at my disposal. We make a first pass by looking at the phylogeny of circumpolar language isolates.

Strictly speaking, not all of these are language isolates in the technical sense. But, of course, relatively speaking they are. The largest number of speakers of these sixteen languages is 47,800. See next table.


As the name suggests, these circumpolar populations are located around the Arctic circle. The Saami, who used to be called Lapps, live in Scandinavia. European anthropologists were fascinated by them. We want to understand how they relate to other circumpolar peoples. The Yeniseian are located in the middle of northern Siberia (see map below), while the Yukaghir are an isolate in northeastern Siberia. Speakers of the Chukotko-Kamchatkan family are also located there. Eskimo-Aleut speakers can be found at either ends of the North American continent, as well as the very northeastern extremity of Eurasia and Greenland.


In order to test the algorithms, I looked at the phylogenies obtained separately for the Chukotko-Kamchatkan and Eskimo-Aleut families. The I used the same algorithm on all the circumpolar isolates together and checked that the relevant subtrees were identical to the ones obtained separately. It does indeed check out. I use the cosine metric to obtain pairwise distances between the languages, and then use the Saitou and Nei (1987) sequential neighbor joining algorithm to obtain unrooted trees. Finally, I use the MAD algorithm of Landen et al. (2019) to root the trees. We can see that the procedure is able to cleanly separate the four families. Moreover, the branching order makes perfect sense in light of the circumpolar geography — as we go from top to bottom, we are going in the easterly direction. circumpolar_phylogeny.png

The separability of the families and intelligibility of the branching order gives us good confidence in our procedure. We can also formally test whether our choice of metric makes sense. This is important because Creanza et al. use the Hamming metric, which we found to be inferior in all our investigations. There is also a theoretical reason to be skeptical of the Hamming metric since it treats languages with different phoneme inventory sizes asymmetrically. In contrast, the cosine metric, equivalent to the size of the shared phoneme inventory relative to inventory sizes, is naturally standardized.

In order to formally test the performance of the two metrics, we use the Mantel test. That is, we test whether phonemic distance is correlated with geodesic distance for our two metrics. We compute geodesic distance via the Haversine formula. We find that, whether we use Pearson’s or Spearman’s coefficient, the null is rejected by the Mantel test for the cosine metric but not for the Hamming metric. Again, this gives us good confidence that our procedure is not producing spurious results.

Mantel Tests: Phonemic vs. Geodesic distance.
Pearson P
Cosine 0.4380 0.0043
Hamming 0.2875 0.0757
Spearman P
Cosine 0.4260 0.0044
Hamming 0.3157 0.0528

There is a lot more to be done on this before I can take a serious crack at the question I really want to investigate. I’ll report back when I have more work to show.



One thought on “Phylogeny of Circumpolar Language Isolates

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s