# Phonemic Phylograms for Subaltern Languages

The picture of population history that emerges from physical anthropology (molecular, craniometric, craniodental, etc) is consistent with an Out-of-Africa model of the ethnographic and historic present. All non-African populations descend from anatomically modern (Homo sapiens sensu stricto) founder populations that dispersed from Africa 130-40ka. The evidence from paleontology (fossil molecular, craniometric, craniodental, etc) complicates this picture. There were evidently successful interspecific families (if Sapiens, Neanderthals, Denisovans, etc, are regarded as species) or interracial families (if they are regarded as allopatric subspecies) — the DNA of non-African people can only contain sequences acquired from them if their issue survived to procreate. Put another way, all lineages of the other taxa encountered by Sapiens could not have vanished without issue for otherwise their signature could not still be found in contemporary populations. [So these were at best allospecies.]

The picture is complicated further when we look at the archaeological evidence. The evidence is inconsistent with Out-of-Africa in a thick sense — that Sapiens replaced Neanderthals sensu lato because they were behaviorally modern and the latter were not. There is no evidence of modern behavior in the assemblages associated with anatomically modern humans within and without Africa for tens of thousands of years after the “speciation” of Sapiens and the subsequent Out-of-Africa dispersals. The onset of modern behavior is late, staggered and impossible to reconcile with the simple Human Revolution story laid by Klein, Mellars, Stringer, Gamble, Tattersall and others from the late 1980s onwards. (Although Gamble seems to have since changed his mind.)

A further source of evidence information is linguistic. Cavalli-Sforza and others began to show in the 1990s that linguistic phylograms resembled those derived from physical anthropology. Atkinson remade the case in the last decade. It has since been debunked, among others by Crenza et al. (2015).  Phylograms extracted from linguistic data are not consistent with phylograms obtained from physical anthropology. Why should that be? Something very interesting is going on with this disconnect.

The problem with phonemic data is not that the population history signal is confounded by unstable rates of innovation. As we shall see, this is not the problem. The problem is rather that the Holocene Filter confounds the Pleistocene population history signal. Most people on the planet today speak languages in a small handful of families (Indo-European, Sino-Tibetan, Bantu, Nilo-Saharan, Dravidian, Austronesian etc) that underwent massive geographic and population size expansions during the Holocene. Their ethnogenesis of these families was trigged by the agricultural (9ka) and pastoral (5ka) revolutions. Contemporary populations in Eurasia, Africa and elsewhere as well as those of the ethnographic present are descended from very recent migrants (the men more so than the women since the migrations were always sex-biased) superimposed over yet older strata of Holocene expanders. Underneath these massive boulders are ancient populations of hunter-gatherers like the San, the Andamanese and thousands of others who survive as isolates in deserts, on islands, in dense forests and mountain redoubts, and suchlike. If we are interested in Pleistocene history, we need to isolate and study the substratum of hunter-gatherer populations of the ethnographic present. Here we make such an attempt.

In order to recover Pleistocene population history we have to figure a way of controlling for the Holocene filter. I think there is a simple method that can work if we have a large enough phonemic database. Populations (roughly identified as the ancestors of the speakers of language families) that underwent Holocene expansions can be expected to be large today for that very reason. This means that we throw out all the big language families our sample will become more representative of the ancient substrata.

We examine the phonemic data collated by Crenza et al. (2015) from the Ethnologue Database. After throwing out language families with speaker populations larger than those of the Hmong-Mein family (who were largely overrun and driven out of China and into the shatter-belt of Indochina by the Han) and New World populations (who are known to have reached the New World after the Last Glacial Maximum) we obtain a sample of 103 languages that have been classified into 19 languages by linguists. These have a good claim to be direct descendents of languages spoken by Pleistocene populations in the Old World and Sahul before the Mesolithic (New World) and Holocene expansions. Figure 1 displays the latitude and longitude of these languages.

The dataset consists of Boolean variables denoting the presence or absence of 728 phonemes (vowels and consonants). Phonemic distance is computed from the number of shared phonemes (the Hamming metric).

We begin by testing that isolation-by-distance explains pairwise phonemic distance. The appropriate test is a Mantel test comparing geographic distance (computed by the Haversine formula using known waypoints) and phonemic distance. We obtain a robust test statistic equal to 0.585. The probability of observing this value of chance is less than one in a million ($p<10^-6$). At the level of language phyla (Crenza et al. report the highest language phylum for each language, not family per se), the Mantel test statistic is a still robust 0.350 and statistically significant (p=0.041). So phonemic distance displays the same isolation-by-distance as physical anthropology distances. We can thus be confident that phonemic distance contains a population history signal.

Figure 2 presents the phylogram (lineage tree) obtained at the level of language phyla as the languages are classified today. We derive the phylogenetic tree from sequential neighbor joining algorithm.

What stands out is the isolation of the Hmong, the Khoisan and the Indian Aborigines (whose location places them in Jim Corbett national park in the foothills of the Himalayas). Interestingly, the Australians, with their unusual languages (they don’t have word order) are placed next to the northeastern Siberians, the Chukotko-Kamchatkan language family. Andamanese is closest to West Papuan suggesting that this subaltern population was once widespread across the southern dispersal route.

Can we recover the language classification (ie, the “families”) by looking at phonemic distance at the level of languages? Figure 2 displays the phylogram for families. For each language, we display the family, the ISO code for the language, and the latitude and longitude in parentheses.

The first, and most reassuring thing to note is that the language families as identified by linguists are largely placed together. The Australians have a complicated structure but they are classified together (the bottom half of the phylogram above the Hmong). Ditto the Hmong and the Khoisan. The fact that recognized phyla are generally classified together is very strong evidence of a population history signal in phonemic data.

The big anomaly is the classification of New Guinea languages. Unlike Australian, Khoisan, and Hmong languages, the Trans-New Guinea phylum does not cluster together. Rather there seem to be meaningful multiple clusters of languages that have been classified as falling within the Trans-New Guinean family. This is quite possibly due to the fact that New Guinea served as the shatter-belt par excellence in our deep history. What this phylogram suggests is that scholars may have misclassified New Guinea languages; in particular by not recognizing enough language families on this extraordinarily diverse island.

In Figure 4 we have folded up the tree to reveal the underlying big relationship together with the great anomalies. The collapsed subtrees and branch nodes are labeled in blue. Those that are not labeled “Branch #” are subtrees with many languages in the same family underneath. The anomalies are interesting. Why is the language of Indian aborigines close to Khoisan languages in Africa? Why is one Andamanese language classified with Sahul languages and another with Basque in Europe and the Hmong in China? Even more intriguing is the phylogenetic affinity (surely spurious) between the Hmong and Australian families. So some of these anomalies are probably random. But others correspond to actual population history. Recall that the speakers of the predecessors of these languages probably occupied much larger areas than they do at present. The most striking feature of course is the phonemic incoherence of languages folded into the Trans-New Guinea. Above all else, what it attests to is the sheer variation in New Guinea. Could the island have served as a shatter-belt in the Pleistocene?

This is a work in progress. My goal is to cointegrate the information from physical anthropology, paleontology, archaeology and linguistics to tell a more compelling history of the Pleistocene than has so far been on offer. Bear with me.

## 6 thoughts on “Phonemic Phylograms for Subaltern Languages”

1. Some of the visualizations would baffle even the smartest people. The biggest problem with the analysis is a lack of conceptual clarity to begin with–that is, the ‘phoneme’. Begging the question like this does no good to any argument you might want to make. What if the phoneme doesn’t exist? At best it is just an abstraction from formal linguistics for creating sets of ‘related sounds’, with the relationships often being dubious and arbitrary. Then what have you just done here? Spent a lot of time using a non-existent psycholinguistic object as the basis for an elaborate argument about unobtainable historical linguistic phenomena–at best.

2. Phoneme is just a fancy name for consonants and vowels. I don’t even know what it means to suggest that they don’t exist. The database is the result of the hard work of generations of linguists who have compiled all the consonants and vowels in more than 2000 languages. Pairwise distances in this space contain a very strong population history signal, as is obvious from the Mantel test statistic. See the excellent work by https://www.pnas.org/content/112/5/1265.

1. If that is the case, why don’t you just call them vowels and consonants? But then I might call you out and ask, by what criteria do you determine that?

The problem is you are counting and analyzing sound categories that might simply be the fictions of the compilers. Also, it should be obvious you are not talking about actual sounds, actual vowels or consonants. Instead, you are trying to refer to some sort of controlling mental category that belongs to the speakers of a given language. In which case, it is important to remember that the categories of linguists are just models, and often the models are totally off.

I liked your initial analysis a lot. But the stats look like complete nonsense. And the sad thing is, I think that is where you think you nailed it.

1. It is not fiction but fact that there are 44 phonemes in English. These things are extraordinarily stable. These may be observer errors for rare languages as when an incompetent linguist misunderstands the language she is studying. But these have to be vanishingly rare since the entire database is based on peer-reviewed work. It can’t be so easily dismissed. Tell you what, I’ll look at the Indo-European family in the next dispatch. There it’ll be easier to visually check that the quantitative approach is not giving us meaningless results.

2. Hope you get a chance to see the new post on Indo-European phylogeny. This really works!

3. Ouch. I didn’t nail anything! This is an initial exploration. A lot more work is obviously required to infer Pleistocene population history!