Balance Sheet Capacity and the Price of Crude

I’ve written before about the macrofinancial importance of broker-dealers (a.k.a. Wall Street banks). I emphasized the key role played by dealers in the so-called shadow banking system and have shown that fluctuations in balance sheet capacity explain the cross-section of stock excess returns. I have also argued for a monetary-financial explanation of the commodities rout. In this post, I will show that fluctuations in dealer balance sheet capacity also explain fluctuations in the price of crude.

The evidence can be read off Figure 1. Recessions are shown as dark bands. The top-left plot shows the real price of crude for reference. The spikes in the 1970s correspond to the oil price shocks in 1973 and 1979. Note the price collapse in 1986 and the price shock that attended the Iraqi occupation of Kuwait (the spike in the 1990 recession). Note also the extraordinary run-up in the price of crude during the 2000s boom and the return of China-driven triple digit prices after the great recession. Finally, note the dramatic oil price collapse in 2014 due to the US fracking revolution. We know that much of the fluctuation in the oil price was a result of geopolitical, supply-side and exogenous demand-side factors. My claim is that much of the rest is driven by the excess elasticity of the financial intermediary sector.


Figure 1. Source: Haver Analytics, author’s calculations.

Specifically, I show that fluctuations in the balance sheet capacity of US securities broker-dealers predict fluctuations in the oil price. We define balance sheet capacity as the log of the ratio of aggregate financial assets of broker-dealers to the aggregate financial assets of US households. We stochastically detrend the quarterly series by subtracting the trailing 4-quarter moving average from the original series. The plot on the top-right displays the stochastically detrended balance sheet capacity. We will show that it predicts 1-quarter ahead excess returns on crude.

We run 30-quarter rolling regressions of the form,

{R^{crude}_{t+1}=\alpha+\beta\times capacity_{t}+\varepsilon_{t+1}}, \qquad (1)

where {R^{crude}_{t+1}} is the return on Brent in quarter {t+1} in excess of the risk-free rate and {capacity_{t}} is the shock to balance sheet capacity in quarter {t}. We must take care to interpret rolling regressions because instead of two parameters suggested by equation (1), we are in effect running 183 regressions with different parameters.

The plot on the bottom right displays the percentage of variation explained in each predictive regression. We see that balance sheet capacity became a significant predictor of the price of crude in the mid-1980s. It’s predictive capability diminished in the mid-1990s, before gaining new heights in the 2000s. The period 1999-2007 was the heydey of financially-driven fluctuations in the price of crude. That relationship collapsed in the second quarter of 2007. During the financial crisis and the period of postcrisis financial repression, the relationship disappeared entirely. It only recovers at the very end of our sample in 2016.

The bottom-left plot in Figure 1 displays a signed measure of the influence of balance sheet capacity on the price of crude. We display the product of the slope coefficient in equation (1) with one minus its p-value. This measure kills three birds with one stone. We can (a) keep track of the sign of the slope coefficients (to see whether or not it reverses direction too much), (b) get an additional handle on the time-variation of the strength of the predictive relationship, and (c) control the noise by attenuating the slope coefficients in inverse proportion to their statistical significance. Note that we have reversed the direction of the Y axis in the plot on the bottom-left.

The slope and significance metric tells a story that is very similar to the one told by the percentage of variation explained. Moreover, we can see that the relationship is economically large and negative. The interpretation is that positive shocks to balance sheet capacity compress the risk premium embedded in the price of crude. When balance sheet capacity is plentiful, risk arbitrageurs (speculators who make risky bets) bid away expected excess returns. Conversely, when balance sheet capacity is scarce, risk arbitrageurs are constrained in the amount of leverage they can obtain from their dealers and are therefore compelled to leave expected excess returns on the table.

The main result above—that dealer balance sheet growth predicts returns on crude oil—was originally obtained by Erkko Etula for his doctoral dissertation at Harvard. 


The Near-Unipolar World Reconsidered

Above 200

Figure 1. Countries rescaled by the number of people earning more than $200 dollars a day in 2002. Source: WorldMapper.Org.

This is an ongoing conversation with Ted Fertik.

Thanks for the link man. Tooze (2014) was an amazing read! I want to talk about two things. First, I am going to shamelessly insist that I was right about the role of near-unipolarity in Tooze’s schema. Second, I want to talk about how near-unipolarity relates to the history of the twentieth century. All quotes are from Tooze (2014) unless otherwise specified.

“In the wake of World War I think the stakes were higher.” Why were they higher? “What was at stake was a new global order under the sign of what has been variously referred to as ultraimperialism, American hegemony, or Empire”; that Churchill described as “the pyramids of peace” (quoted in The Deluge). [Emphasis mine.]

The “central challenge facing the German political elite” was the “sheer scale of twentieth-century Anglo-American economic predominance.” Tooze shows that the interwar order was one of unabashed Anglo-American cohegemony. The “main question” of the international politics of the interwar era is “how to understand the insurgency against the order.” More pertinently, the question facing the Germans was should they “conform and assimilate themselves to its power” or “mount an insurgency against it”?

“We must view that struggle as more asymmetric, and thus as an expression of the combined and uneven development of the international system…” [Emphasis mine.]

“Neither the international relations of the interwar period, nor World War II itself are well-described by models…derived from the more truly multipolar world of the late nineteenth century.

I contended that the world from the close of the nineteenth century to the rise of China in the 2000s was secretly near-unipolar. I presented GDP numbers and argued that GDP was a good enough measure to detect near-unipolarity. But I also have strong historical reasons to think carefully about near-unipolarity—as the quotes from Tooze above suggest.

When I say near-unipolar, I mean that there is a especially strong state in the system such that no state could hope to prevail against it in a war or an extended rivalry; that there is no doubt about the identity of the strongest state in the system; and that when statesmen evaluate great power war and great power military alliances they had to care a great deal about the unipole’s position—computations on the outcome of great power war and confrontation premised on the unipole’s disinterest have to be thrown out of the window if the unipole weighs in the balance.

Note again that this is a weak definition. It just means that there is a football in a pile of tennis balls. The unipole may not even have a standing army. It may or may not exercise influence abroad. A lesser great power may run the maritime world and lesser great powers may worry much more about each other (especially their strong neighbours) than the unipole. In fact, if the unipole is insular and isolationist, it may not cause the other great powers any headaches at all. Indeed, they may even make fun of its extant weakness.

However, in a near-unipolar world, such disdain is contingent on the foreign policy of the unipole. Were the unipole to mobilize its war potential and be willing to use force on the world stage, the lesser great powers would have to eat their insulting words. Moreover, lesser great powers threatened by each other can be expected to try to secure the protection of the unipole. An alliance with the unipole is, after all, very useful given the rule of force in world affairs. The unipole may therefore get pulled into other people’s fights despite itself. Even insularity and isolationism thus do not completely thwart the gravitational pull exerted by the unipole.

One could write a convincing history of the twentieth century in this frame of reference. The philosophy of history that such a work requires is almost insultingly straightforward. The basic fact of near-polarity serves as the single explanatory variable. That is, the twentieth century as the story of the clarification of the real balance of forces. Or history catching up with the secret topology of the world.

In this frame of reference, the outcomes of the main great power confrontations of the twentieth century—World War I, World War II, and the Cold War—were more or less known in advance. The game had, in fact, been rigged from the get go.

What explains the British surrender of naval preponderance in the Western Hemisphere in 1900? What explains the results of 1918? What explains the Washington Naval Conference of 1922? The stability of the interwar European order in the 1920s? The breakdown of that order and the turn to radicalism in 1931? The startling fact that not the winner but the power that basically sat out the Second World War dictated the postwar order? The outright capitulation of the second ranked power in the so-called bipolar world in 1989? All these questions have a single answer: The fact of the asymmetric size of the football.

Is it possible to construct a tighter, more parsimonious narrative frame? Is it not, then, a quite compelling frame of reference?


Tooze, Adam. “The Sense of a Vacuum.” Historical Materialism 22.3-4 (2014): 351-370.

World Affairs

The Geopolitics of the French Election


If populism prevails in France, it would have a much more dramatic impact on geopolitical affairs than the victory of populism in the offshore powers.

The immediate geopolitical impact of Brexit is now clear. Britain’s unilateral decision to withdraw has unified the continent against the perfidious Albion. Little England has, in effect, been forced into splendid isolation from the continent. Going forward, Britain will not have a seat at the European table.

Across the pond, Trump pulled off the greatest bait-and-switch in US political history. All promises of economic nationalism and isolationism have been shelved. Instead, the political high tide of the GOP is being mined in the service of plutocratic interests. While the Bannon-Sessions-Miller wing remains committed to constructing an ethnic security state—and is worryingly empowered to do so—the foreign economic and security policies of the United States are back in the hands of the Blob. Despite expectations to the contrary, Liberal Hegemonism is alive and well in the United States. American populism seems to have been tamed at least in so far as US foreign policy is concerned.

The consensus on the impact of a Marine Le Pen victory is that it would spell the demise of the European project. In particular, it would mean the end of the euro. But there is a perfectly feasible alternative scenario that may obtain if she wins. In that scenario, the French withdrawal will leave an even more unified and compact EU; one that would look more and more like a German Delian league.

I will argue that the second scenario is more likely than the first and that it would reconfigure European geopolitics in important and foreseeable ways. But first, how did we get here?

During the 1950s and 1960s, the core of the world economy was tripolar. Global industrial production was dominated by national champions of the United States, Japan and Germany (more generally, western Europe). Northern labor had a quasi-monopoly on Northern knowhow. More precisely, national labor pools had a quasi-monopoly on the knowhow of national industrial champions. Within this context, domestic bargains between labor and capital along the lines of the Treaty of Detroit enabled broad-based growth in the core of the world economy.

The Western economic miracle of the early postwar era came to an end as a result of the Japanese onslaught. Japan was able to combine its relatively low wages with high productivity growth to dramatically swell its shares of the global product market; helped along by the container revolution of the late 1960s that enhanced the integration of global product markets. Unable to compete with the Japanese, western firms tried in vain to increase the growth rate of their productivity. The western world slid into a deep stagflation crisis during the 1970s that prepared the ground for the neoliberal counterrevolution whose main agenda was to put finance firmly back in the saddle and tear up the Treaty of Detroit. That alone would’ve been sufficient to guarantee the rise of plutocracy, precarity and wage polarization. But even more momentous developments were underfoot that undermined the geoeconomic foundations of broad-based prosperity in the center countries even more thoroughly.

The 1980s witnessed the telecommunication and intermodal transportation revolution whereby transportation and communication costs collapsed enough to split the atom of national champions. The result was what Richard Baldwin calls ‘the second unbundling’ of global production whereby managers in the headquarter economies (US, Germany, Japan) trained cheap foreign labor within a day’s flying distance of headquarters to create Factory North America, Factory Europe and Factory Asia. This unified national labor markets at the regional level even as global product markets integrated further at the global level.

The addition of hundreds of millions of Chinese workers to Factory Asia created a tremendous imbalance between capital and labor. The result was even greater worker insecurity, wage polarization, and intensification of plutocracy. At the same time, the reemergence of global finance unleashed the financial cycle that also whipsawed market society with bubbles and financial crises.

The consequence of these global-macro fluctuations and structural changes was tremendous trauma in western market societies. This trauma manifested itself as the rise of populism and the destruction of the political center.

Back to geopolitics. A Le Pen victory in France cannot be ruled out with any degree of certainty. I claimed that if she wins, France and England would likely face a virtual German Delian league. The reason is twofold. First, European states are extraordinarily exposed to the risk of a dramatic unraveling of Factory Europe. A breakup of the eurozone would result in the effective repeal of deep integration on the continent and therefore a wholesale disruption of European value chains. In order to forestall such a catastrophic scenario, other states are likely to stick together. Second, while the renationalization of market society might be a viable strategy for medium-weights like the UK and France, it is decidedly not a viable strategy for either the small rich northern nations of the European core or the poorer nations on Europe’s southern and eastern periphery. Niether the Nordics and the Low Countries on the one hand, nor eastern European nations like Poland and the Czech Republic on the other, have any possibility of maintaining their prosperity after renationalization. Basically, the depth and breadth of skill-sets in every country except maybe Germany and perhaps Italy, provide an insufficient basis to compete in global product markets. They can put up tariff walls to protect domestic industry. But then the size of their national markets would sharply limit their firms’ economies of scale. That’s what doomed the import-substitution strategies of countless developing nations.

Le Pen dreams of an independent France that can stand up to Germany. But the reality is far more sobering. The harsh truth is that, after the second unbundling, without combining the knowhow of the North with the cheap labor of the South you can no longer be truly competitive in global product markets. This is even true of the United States. Renationalization is a recipe for geoeconomic irrelevance. Isolationist Britain and France will not become third world states, but they will be marginalized; both in Europe and in the global marketplace.

So what happens if I’m right and the UK and France face a German Delian league on the continent? France and the United Kingdom are independent nuclear powers and the main politico-military actors in Europe. They are essential partners for the United States. If and when they withdraw into isolationism, European security will rest on German shoulders. Le Pen has already declared her intention on a reorientation of Franco-Russian relations—away from deterrence in alliance with the western bloc to bilateral cooperation. But the security of the Baltics, the Nordics, and central and eastern Europe depends on more than US engagement. It requires a European great power partner. What this means is that the German Delian league would have to obtain its own conventional and nuclear deterrent. Pressures in this direction are already building as a result of growing doubts about the US commitment to defend Europe. They will intensify with the French exit, if it obtains.

One can think more systematically about the geopolitical implications of populism in England and France through the theory of regional security complexes (RSCs). Buzan and Wæver described the European great power RSC as a security community (a territorial cluster of states for whom war amongst each other is unthinkable) of great powers protected by one global power and threatened by another. This configuration is unlikely to last. The question is, How will it be transformed?

In the Policy Tensor’s view, the answer is that the British and French exits correspond to a major structural transformation of the European great power RSC. In particular, there is a strong potential for the emergence of a new, dominant security actor on the scene; namely, the German Delian league. Whether or not security competition reemerges in western Europe will then depend on whether the secondary powers (France and the UK) exacerbate the Russian threat to the league’s security. A possible withdrawal of the American pacifier—which is no longer unthinkable either—will make it considerably more likely. But whether or not security competition reemerges, we’re looking at a major transformation of the European RSC.


Silicon Valley’s Visions of Absolute Power


Omnipotence is in front of us, almost within our reach…

Yuval Noah Harari

The word “disrupt” only appears thrice in Yuval Noah Harari’s Homo Deus: A Brief History of Tomorrow. That fact cannot save the book from being thrown into the Silicon Valley Kool-Aid wastebasket.

Hariri is an entertaining writer. There are plenty of anecdotes that stroke the imagination. There is the one about vampire bats loaning blood to each other. Then there’s the memorable quip from Woody Allen: Asked if he hoped to live forever through the silver screen, Allen replied, “I don’t want to achieve immortality through my work. I want to achieve it by not dying.” The book is littered with such clever yarns interspersed with sweeping, evidence-free claims. Many begin with “to the best of our knowledge” or some version thereof. Like this zinger: “To the best our knowledge, cats are able to imagine only things that actually exist in the world, like mice.” Umm, no, we don’t know that. Such fraudulent claims about scientific knowledge plague the book and undermine the author’s credibility. And they just don’t stop coming.

“To the best of our scientific understanding, the universe is a blind and purposeless process, full of sound and fury but signifying nothing.” How does one even pose this question scientifically?

“To the best of our knowledge” behaviorally modern humans’ decisive advantage over others was that they could exercise “flexible cooperation with countless number of strangers.” Unfortunately for the theory, modern humans eliminated their competitors well before any large-scale organization. During the Great Leap Forward—what’s technically called the Upper Paleolithic Revolution when we spread across the globe and eliminated all competition—mankind lived in small bands. There was virtually no “cooperation with countless strangers.” The reason why we prevailed everywhere and against every foe was because we had language, which allowed for unprecedented coordination within small bands. Harari seems completely unaware of the role of language in the ascent of modern humans. He claims that as people “spread into different lands and climates they lost touch with one another…” Umm, how exactly were modern humans in touch with each other across the vast expanse of Africa?

“To the best of our scientific understanding, determinism and randomness have divided the entire cake between them, leaving not a crumb for ‘freedom’…. Free will exists only in the imaginary stories we humans have invented.” Here, Harari takes one of the hardest open problems and pretends that science has an answer. The truth is much more sobering. Not only is there no scientific consensus on the matter of free will and consciousness, it would be disturbing if there were, since we have failed to develop the conceptual framework to attack the problem in the first place.

“According to the theory of evolution, all the choices animals make –whether of residence, food or mates – reflect their genetic code.… [I]f an animal freely chooses what to eat and with whom to mate, then natural selection is left with nothing to work with.” Nonsense. The theory of evolution, whether in the original or in its modern formulations, is entirely compatible with free will. Natural selection operates statistically and inter-generationally over populations, not on specific individuals. It leaves ample room for free will.

There are eleven chapters in the book. All the sweeping generalizations and hand-waving of the first ten chapters are merely a prelude to the final chapter. Here, Harari goes on the hard sell.

Dataism considers living organisms to be mere “biochemical algorithms” and “promises to provide the scientific holy grail that has eluded us for centuries: a single overarching theory that unifies all scientific disciplines….”

“You may not agree with the idea that organisms are algorithms” but “you should know that this is current scientific dogma…”

“Science is converging on an all-encompassing dogma, which says that organisms are algorithms, and life is data processing.”

“…capitalism won the Cold War because distributed data processing works better than centralized data processing, at least in periods of accelerating technological changes.”

“When Columbus first hooked up the Eurasian net to the American net, only a few bits of data could cross the ocean each year…”

“Intelligence is decoupling from consciousness” and “non-conscious but highly intelligent algorithms may soon know us better than we know ourselves.”

No, the current scientific dogma isn’t that organisms are algorithms. Nor is science converging on an all-encompassing dogma that says that life is data processing. Lack of incentives for innovation in the Warsaw Pact played a greater role in the outcome of the Cold War than the information-gathering deficiencies of centralized planning. When Columbus first “hooked up the Eurasian net to the American net,” much more than a few bits of data crossed the ocean. For instance, the epidemiological unification of the two worlds annihilated much of the New World population in short order.

There are more fundamental issues with Dataism, or more accurately, Data Supremacism. First, data is simply not enough. Without theory, it is impossible to make inferences from data, big or small. Think of the turkey. All year long, the turkey thinks that the human would feed and take care of it. Indeed, every day the evidence keeps piling up that humans want to protect the turkey. Then comes Thanksgiving.

Second, the data itself is not independent of reference frames. This is manifest in modern physics; in particular, in both relativity and quantum physics. What we observe critically depends on our choice of reference frame. For instance, if Alice and Bob measure a spatially-separated (more precisely, spacelike separated) pair of entangled particles, their observations may or may not be correlated depending on the axes onto which they project the quantum state. This is not an issue of decoherence. It is in principle impossible to extract information stored in a qubit without knowledge of the right reference frame. To go a step further, Kent (1999) has shown that observers can mask their communication from an eavesdropper (called Eve, obviously) if she doesn’t share their reference frame. Even more damningly, reference frames are a form of unspeakable information—information that, unlike other classical information, cannot be encoded into bits to be stored on media and transmitted on data links.

Third and most importantly, we do not have the luxury of assuming that an open problem will be solved at all, much less that it will be solved by a particular approach within a specific time-frame. This is a major source of radical uncertainty that is never going to go away. Think about cancer research. Big data and powerful new data science tools make the researchers’ jobs easier. But they cannot guarantee their success.

The main contribution of my doctoral thesis was solving the problem of reference frame alignment for observers trying to communicate in the vicinity of a black hole. The problem has no general solution. I exploited the locally-measurable symmetries of the spacetime to solve the problem. Observers located in the vicinity of a black hole can use my solution to communicate. If they don’t know my solution or don’t want to use it, they need to discover another solution that works. They cannot communicate otherwise. This is just one of countless examples where data plays at best a secondary role in solving concrete problems.

Empirical data is clearly very important for solving scientific, technical, economic, social, and psychological problems. But data is never enough. Much more is needed. Specifically, solving an open problem often requires a reformulation of the problem. That is, it often requires an entirely new theory. We don’t know yet if AI will ever be able to make the leap from calculator to theoretician. We cannot simply assume that they will be able to do so. They may run into insurmountable problems for which no solution may ever be found. However, if and when they do, there is no reason why humans should not be able to comprehend an AI’s theories. More powerful theories turn out to be simpler after all. And if and when that happens, the Policy Tensor for one would welcome our AI overlords.

Harari makes a big fuss about algorithms knowing you better than yourself. “Liberalism will collapse the day the system knows me better than I know myself.” Well, my weighing machine “knows” my weight better than I do. What difference does it make if an AI could tell me I really and truly have a 92 percent change of having a successful marriage with Mary and only 42 percent with Jane? Assuming that the AI knows me better than I do, why would I treat it any differently from my BMI calculator that insists that I am testing the upper bound of normality? After all, I also agree that the BMI calculator is more accurate than my subjective judgment about my fitness as the AI is about my love life.

Artificial Intelligence without consciousness is just a really fancy weighing machine. And data science is just a fancy version of simple linear regression. Why would Liberalism collapse if Silicon Valley delivers on its promises on AI? Won’t we double-down on the right to choose precisely because we can calibrate our choices much better?

If AI gain consciousness on the other hand, all bets are off. Whether as an existential threat or as a beneficial disruption, the arrival of the first Super AI will be an inflection point in human history. The arrival of advanced aliens poses similar risks to human civilization.

If you are interested in the potential of AI, you’re better off reading Nick Bostrom’s Superintelligence: Paths, Dangers, Strategies. If you are curious about scientific progress and our technological future in deep time as well as the primacy of theory, you should read David Deutsch’s The Beginning of Infinity: Explanations That Transform the World. If you are more interested in the unification of the sciences, look no further than Peter Watson’s Convergence: The Idea at the Heart of Science. (Although I do recommend Watson’s The Modern Mind, The German Genius, and The Great Divide more and in that order.) Finally, for the limits of scientific and technical advance, see John D. Barrow’s Impossibility: The Limits of Science and the Science of Limits.

Silicon Valley’s Kool-Aid encompasses long-term visions of both techno-utopias and techno-dystopias. The unifying fantasy is that, in the long run, technological advance will endow man and/or AI with absolute power. In the utopias, men become gods and mankind conquers the galaxy; and in much more ambitious versions, the entire universe itself. (It would be orders of magnitude harder to reach other galaxies than other stars.) In the more common dystopias, man won’t be able to compete with AI, or the elite will but the commoners won’t (this is Harari’s version). In either case, the Valley’s Kool-Aid is that technology will revolutionize human life and endow some—depending on the narrative: Silicon Valley, tech firms, AIs, the rich, all humans, or AI and humans—with god-like powers. Needless to say, this technology will come out of Silicon Valley.

In reality, a small oligopoly of what Farhad Manjoo calls the Frightful Five (Facebook, Google, Apple, Microsoft and Amazon) have cornered unprecedented market power; and stashed their oligopolistic supernormal profits overseas, just to rub it in your face. Apple alone has an untaxed $216 billion parked offshore. Far from obeying the motto “data wants to be free,” these oligopolistic firms hoard your data and sell it to the highest bidder. The dream of tech start-ups is no longer a unicorn IPO. Rather, it is a buyout by one of the oligopolists. If you are a truly successful firm in the Valley, you have either benefited from network externalities (like the Frightful Five which are all platforms with natural economies of scale), or you have managed to shed costs onto the shoulders of people who would’ve hitherto been your employees or customers (like Airbnb, Uber and so on). Silicon Valley is, in fact, more neoliberal than Wall Street. While the Street has managed to shed risks and costs to the state, the Valley has managed to shed risks and costs to employees and customers. That’s basically the Valley’s business model.

Alongside its hoard of financial resources, the Valley has also cornered an impressive amount of goodwill in the popular consciousness. Who does not admire Google and Apple? This goodwill is the result of the industry’s actual accomplishments; some of them genuine, some thrust upon them by fate. In the popular imaginary, the Valley is the source of innovation and dynamism; to be celebrated not decried. Yet, the concentration of power in the industry has started to worry the best informed. If mass technological unemployment does come to pass, the Valley should not be surprised to find itself a pariah and a target of virulent populism, in the manner of Wall Street in 2009.


Causal Inference from Linear Models

For the past few decades, empirical research has shunned all talk of causation. Scholars use their causal intuitions but they only ever talk about correlation. Smoking is “associated to” cancer, being overweight is “correlated with” higher morbidity rates, college education is the strongest “correlate of” Trump’s vote gains over Romney, and so on and so forth. Empirical researchers don’t like to use causal language because they think that causal concepts are not well-defined. It is a hegemonic postulate of modern statistics and econometrics that all falsifiable claims can be stated in the language of modern probability. Any talk of causation is frowned upon because causal claims simply cannot be cast in the language of probability. For instance, there is no way to state in the language of probability that smoking causes cancer, that the tides are caused by the moon or that rain causes the lawn to get wet.

Unfortunately, or rather fortunately, the hegemonic postulate happens to be untrue. Recent developments in causality—a sub-discipline of philosophy—by Judea Pearl and others, have made it possible to talk about causality with mathematical precision and use causal models in practice. We’ll come back to causal inference and show how to do it in practice after a brief digression on theory.

Theories isolate a portion of reality for study. When we say that Nature is intelligible, we mean that it is possible to discover Nature’s mechanisms theoretically (and perhaps empirically). For instance, the tilting of the earth on its axis is the cause of the seasons. It’s why the northern and southern hemispheres have opposite seasons. We don’t know that from perfect correlation of the tilting and the seasons because correlation does not imply causation (and in any case they are not perfectly correlated). We could, of course, be wrong, but we think that this is a ‘good theory’ in the sense that it is parsimonious and hard-to-vary—it is impossible to fiddle with the theory without destroying it. [This argument is due to David Deutsch.] In fact, we find this theory so compelling that we don’t even subject it to empirical falsification.

Yes, it is impossible to derive causal inference from the data with absolute certainty. This is because, without theory, causal inference from data is impossible, and theories on their part can only ever be falsified; never proven. Causal inference from data is only possible if the data analyst is willing to entertain theories. The strongest causal claims a scholar can possibly make take the form: “Researchers who accept the qualitative premises of my theory are compelled by the data to accept the quantitative conclusion that the causal effect of X on Y is such and such.”

We can talk about causality with mathematical precision because, under fairly mild regularity conditions, any consistent set of causal claims can be represented faithfully as causal diagrams which are well-defined mathematical objects. A causal diagram is a directed graph with a node for every variable and directed edges or arrows denoting causal influence from one variable to another, e.g., {X\longrightarrow Y} which says that Y is caused by X where, say, X is smoking and Y is lung cancer.

The closest thing to causal analysis in contemporary social science are structural equation models. In order to illustrate the graphical method for causal inference, we’ll restrict attention to a particularly simple class of structural equation models, that of linear models. The results hold for nonlinear and even nonparametric models. We’ll work only with linear models not only because they are ubiquitous but also for pedagogical reasons. Our goal is to teach rank-and-file researchers how to use the graphical method to draw causal inferences from data. We’ll show when and how structural linear models can be identified. In particular, you’ll learn which variables you should and shouldn’t control for in order to isolate the causal effect of X on Y. For someone with basic undergraduate level training in statistics and probability it should take no more than a day’s work. So bring out your pencil and notebook.

A note on attribution: What follows is largely from Judea Pearl’s work on causal inference. Some of the results may be due to other scholars. There is a lot more to causal inference than what you will encounter below. Again, my goal here is purely pedagogical. I want you, a rank-and-file researcher, to start using this method as soon as you are done with the exercises at the end of this lecture. (Yes, I’m going to assign you homework!)

Consider the simple linear model,

{\large Y := \beta X + \varepsilon }

where {\varepsilon} is a standard normal random variable independent of X. This equation is structural in the sense that Y is a deterministic function of X and {\varepsilon} but neither X nor {\varepsilon} is a function of Y. In other words, we assume that Nature chooses X and {\varepsilon} independently, and Y takes values in obedience to the mathematical law above. This is why we use the asymmetric symbol “:=” instead of the symmetric “=” for structural equations.

We can embed this structural model into the simplest causal graph {X\longrightarrow Y} , where the arrow indicates the causal influence of X on Y . We have suppressed the dependence of Y on the error {\varepsilon}. The full graph reads {X\longrightarrow Y \dashleftarrow\varepsilon}, where the dotted lines denotes the influence of unobserved variables captured by our error term. The path coefficient associated to the link {X\longrightarrow Y} is {\beta}, the structural parameter of the simple linear model. A structural model is said to be identified if the structural parameters can in principle be estimated from the joint distribution of the observed variables. We will show presently that under our assumptions the model is indeed identified and the path coefficient {\beta} is equal to the slope of the regression equation,


where {\rho_{YX}} is the correlation between X and Y and {\sigma_{X}} and {\sigma_{Y}} are the standard deviations of X and Y respectively.  {r_{YX}} can be estimated from sample data with the usual techniques, say, ordinary least squares (OLS).

What allows straightforward identification in the base case is the assumption that X and {\varepsilon} are independent. If X and {\varepsilon} are dependent then the model cannot be identified. Why? Because in this case there is spurious correlation between X and Y that propagates along the “backdoor path” {X\dashleftarrow\varepsilon\dashrightarrow Y}. See Figure 1.


Figure 1. Identification of the simple linear model.

Here’s what we can do if X and {\varepsilon} are dependent. We simply find another observed variable that is a causal “parent” of X (i.e., {Z\longrightarrow X} ) but independent of {\varepsilon}. Then we can use it as an instrumental variable to identify the model. This is because there is no backdoor path between Y and Z (which identifies {\alpha\beta} ) and X and Z (which identifies {\alpha}). See Figure 2.


Figure 2. Identification with an instrumental variable.

In that case, {\beta}  is given by the instrumental variable formula,


More generally, in order to identify the causal influence of X on Y in a graph G, we need to block all spurious correlation between X and Y. This can be achieved by controlling for the right set of covariates (or controls) Z. We’ll come to that presently. First, some graph terminology.

A directed graph is a set of vertices together with arrows between them (some of whom may be bidirected). A path is simply a sequence of connected links, e.g., {i\dashrightarrow m\leftrightarrow j\dashleftarrow k} is a path between i and k. A directed path is one where every node has arrows that point in one direction, e.g., {i\longrightarrow j\leftrightarrow m\longrightarrow k} is a directed path from i to k. A directed acyclic graph is a directed graph that does not admit closed directed paths. That is, a directed graph is acyclic if there are no directed paths from a node back to itself.

A causal subgraph of the form {i\longrightarrow m\longrightarrow j} is called a chain and corresponds to a mediating or intervening variable m between i and j. A subgraph of the form {i\longleftarrow m\longrightarrow j} is called a fork, and denotes a situation where the variables i and j have a common cause m. A subgraph of the form {i\longrightarrow m\longleftarrow j} is called an inverted fork and corresponds to a common effect. In a chain {i\longrightarrow m\longrightarrow j} or a fork {i\longleftarrow m\longrightarrow j}, i and j are marginally dependent but conditionally independent (where we condition on m). In an inverted fork {i\longrightarrow m\longleftarrow j} on the other hand, i and j are marginally independent but conditionally dependent (once we condition on m). We use family connections to talk in short hand about directed graphs. In the graph {i\longrightarrow j}, i is the parent and j is the child. The descendants of i are all nodes that can be reached by a directed path starting at i. Similarly, the predecessors of j are all nodes from which j can be reached by directed paths.

Definition (Blocking). A path p is blocked by a set of nodes Z if and only if p contains at least one arrow-emitting node that is in Z or p contains at least one inverted fork that is outside Z and has no descendant in Z. A set of nodes Z is said to block X from Y, written {(X\perp Y |Z)_{G}}, if Z blocks every path from X to Y.

The logic of the definition is that the removal of the set of nodes Z completely stops the flow of information from Y to X. Consider all paths between X and Y . No information passes through an inverted fork {i \longrightarrow m\longleftarrow j} so you can ignore the paths that contain inverted forks. Likewise, no information passes through a path without an arrow-emitting node so those can also be ignored. The rest of the paths are “live” and we must choose a set of nodes Z whose removal would block the flow of all information between X and Y along these paths. Note that whether Z blocks X from Y in a causal graph G can be decided by visual inspection when the number of covariates is small, say less than a dozen. If the number of covariates is large, as in many machine learning applications, a simple algorithm can do the job.

If Z blocks X from Y in a causal graph G, then X is independent of Y given Z. That is, if Z blocks X from Y then X|Z and Y |Z are independent random variables. We can use this property to figure out precisely which covariates we ought to control for in order to isolate the causal effect of X on Y in a given structural model.

Theorem 1 (Covariate selection criteria for direct effect). Let G be any directed acyclic graph in which {\beta} is the path coefficient of the link {X\longrightarrow Y}, and let {G_{\beta}} be the graph obtained by deleting the link {X\longrightarrow Y}. If there exists a set of variables Z such that no descendant of Y belongs to Z and Z blocks X from Y in {G_{\beta}}, then {\beta} is identifiable and equal to the regression coefficient {r_{YX\cdot Z}}. Conversely, if Z does not satisfy these conditions, then {r_{YX\cdot Z}} is not a consistent estimand of {\beta}.

Theorem 1 says that the direct effect of X on Y can be identified if and only if we have a set of covariates Z that blocks all paths, confounding as well as causal, between X and Y except for the direct path {X\longrightarrow Y}. The path coefficient is then equal to the partial regression coefficient of X in the multivariate regression of Y on X and Z,

{Y =\alpha_1Z_1+\cdots+\alpha_kZ_k+\beta X+\varepsilon.}

The above equation can, of course, be estimated by OLS. Theorem 1 does not say that the model as a whole is identified. In fact, the path coefficients associated the links {Z_{i}\longrightarrow Y} that the multivariate regression above suggests, are not guaranteed to be identified. The regression model would be fully identified if Y is also independent of {Z_{i}} given {\{(Z_{j})_{j\ne i}, X\}} in G_{i} for all {i=1,\dots,k}.

What if you wanted to know the total effect of X on Y ? That is, the combined effect of X on Y both through the direct channel (i.e., the path coefficient {\beta}) and through indirect channels, e.g., {X\longrightarrow W\longrightarrow Y} ? The following theorem provides the solution.

Theorem 2 (Covariate selection criteria for total effect). Let G be any directed acyclic graph. The total effect of X on Y is identifiable if there exists a set of nodes Z such that no member of Z is a descendant of X and Z blocks X from Y in the subgraph formed by deleting from G all arrows emanating from X. The total effect of X on Y is then given by {r_{YX\cdot Z}}.

Theorem 2 ensures that, after adjustment for Z, the variables X and Y are not associated through confounding paths, which means that the regression coefficient {r_{YX\cdot Z}} is equal to the total effect. Note the difference between the two criteria. For the direct effect, we delete the link {X\longrightarrow Y} and find a set of nodes that blocks all other paths between X and Y . For the total effect, we delete all arrows emanating from X because we do not want to block any indirect causal path of X to Y.

Theorem 1 is Theorem 5.3.1 and Theorem 2 is Theorem 5.3.2 in the second edition of Judea Pearl’s book, Causality: Models, Reasoning, and Inference, where the proofs may also be found. These theorems are of extraordinary importance for empirical research. Instead of the ad-hoc and informal methods currently used by empirical researchers to choose covariates, they provide a mathematically precise criteria for covariate selection. The next few examples show how to use these criteria for a variety of causal graphs.

Figure 3 shows a simple case (top left) {Z\longrightarrow X\longrightarrow Y} where the errors of Z and Y are correlated. We obtain identification by repeated application of Theorem 1. Specifically, Z blocks X from Y in the graph obtained from deleting the link {X\longrightarrow Y} (top right). Thus, {\alpha} is identified. Similarly, Y blocks Z from X in the graph obtained from deleting the link {Z\longrightarrow X} (bottom right). Thus, {\beta} is identified.


Figure 3. Identification when a parent of X is correlated with Y.

Figure 4 shows a case where an unobserved disturbance term influences both X and Y. Here, the presence of the intervening variable Z allows for the identification of all the path coefficients. I’ve written the structural equation on the top right and checked the premises of Theorem 1 at the bottom left. Note that the path coefficient of {U\dashrightarrow X} is known to be 1 in accordance with the structural equation for X. Hence, the total effect of X on Y equals {\alpha\beta+\gamma}.


Figure 4. Model identification with an unobserved common cause.

Figure 5 presents a more complicated case where the direct effect can be identified but not the total effect. The identification of {\delta} is impossible because X and Z are spuriously correlated and there is no instrumental variable or intervening available available.


Figure 5. A more complicated case where only partial identification is possible.

If you have reached this far, I hope you have acquired a basic grasp of the graphical methods presented in this lecture. You probably feel that you still don’t really know it. This always happens when we learn a new technique or method. The only way to move from “I sorta know what this is about” to “I understand how to do this” is to sit down and work out a few examples. If you do the exercises in the homework below, you will be ready to use this powerful arsenal for live projects. Good luck!


  1. Epidemiologists argued in the early postwar period that smoking causes cancer. Big Tobacco countered that both smoking and cancer are correlated with genotype (unobserved), and hence, the effect of smoking on cancer cannot be identified. Show Big Tobacco’s argument in a directed graph. What happens if we have an intervening variable between smoking and cancer that is not causally related to genotype? Say, the accumulation of tar in lungs? What would the causal diagram look like? Prove that it is then possible to identify the causal effect of smoking on cancer. Provide an expression for the path coefficient between smoking and cancer.
  2. Obtain a thousand simulations each of two independent standard normal random variables X and Y. Set Z=X+Y. Check that X and Y are uncorrelated. Check that X|Z and Y|Z are correlated. Ask yourself if it is a good idea to control for a variable without thinking the causal relations through.
  3. Obtain a thousand simulations each of three independent standard normal random variables {u,\nu,\varepsilon}. Let {X=u+\nu} and {Y=u+\varepsilon}. Create scatter plots to check that X and Y are marginally dependent but conditionally independent (conditional on u). That is, X|u and Y|u are uncorrelated. Project Y on X using OLS. Check that the slope is significant. Then project Y on X and u. Check that the slope coefficient for X is no longer significant. Should you or should you not control for u?
  4. Using the graphical rules of causal inference, show that the causal effect of X on Y can be identified in each of the seven graphs shown in Figure 6.
  5. Using the graphical rules of causal inference, show that the causal effect of X on Y cannot be identified in each of the eight graphs in Figure 7. Provide an intuitive reason for the failure in each case.

    Figure 6. Graphs where the causal effect of X on Y can be identified.


    Figure 7. Graphs where the causal effect of X on Y cannot be identified.

    P.S. I just discovered that there is a book on this very topic, Stanley A. Mulaik’s Linear Causal Modeling with Structural Equations (2009).


Regional Polarization and Trump’s Electoral Performance

Tom Edsall suggested that I look at the regional socioeconomic correlates of Trump’s electoral performance. Why that didn’t cross my mind before I know not. But here goes. 

Political polarization in the United States means that the overwhelming best predictor of a major party presidential candidate’s electoral performance is the performance of the previous candidate of the party. This was clearly the case in this election. [All data in this post is at the county level. The socioeconomic data is from GeoFRED while the vote count is from here.]


In what follows, therefore, we will look at the correlates of Trump’s performance relative to Mitt Romney’s in 2012. This is the cleanest way to control for partisan polarization. We’re going to examine the socioeconomic indicators of counties where Trump gained vote share compared to Romney.

Specifically, we will divide the counties into five buckets: Blowout, where Trump’s vote share was 5 percent below Romney’s; Major Loss, where Trump’s vote share was between 5 and 2.5 percent below Romney’s; Moderate Loss, where his vote share was between 2.5 and at par with Romney’s; Moderate Gain, where Trump increased the GOP’s share by less than 2.5 percent; Major Gain, where he increased it by between 2.5 and 5 percent; and finally, Land Slide, where Trump gained more than 5 percent relative to Romney.

More sophisticated strategies are certainly possible. But this strategy will allow us to visualize the data cleanly.

We begin with the number of counties. This chart is no surprise to anyone who watched the results on election night. A lot more of the map was colored red than in 2012. There was a major swing in a large number of counties.


But most such counties are very sparsely populated. The most populous counties actually went for Clinton at higher rates than they had gone for Obama in 2012. These two charts illustrate the GOP’s astonishing geographic advantage.


Let’s move on to socioeconomic variables. The next two charts show the median household income and per capita incomes averaged over all the counties in each of the six buckets. Both paint a consistent picture: Trump did worse than Romney in a typical affluent county, but did better than him in poorer counties. But neither was a strong correlate of Trump’s performance. Median household income and per capita income explain only 13 percent and 10 percent of the variation in Trump’s performance relative to Romney respectively.


The percentage of college graduates on the other hand, is a very strong predictor. It explains 35 percent of the variation in Trump’s relative performance. High school diploma rate is, however, a poor predictor. Still, counties where Trump did worse than Romney typically had higher percentages of people with high school diplomas.


Trump did better than Romney in counties where poverty and unemployment rates are relatively high. Although the gradient is not constant.


Similarly, Trump did well in counties where the proportion of people relying on food stamps is high.


But his performance was uncorrelated with crime rates. On the other hand, it was correlated with youth idleness rate—the percentage of 16-19 year olds who are neither working nor employed.


Similarly, counties where Trump improved on Romney’s performance had higher percentages of families with children that are single parent households.


Finally, Trump did worse than Romney in counties with positive net migration rates and he did better in counties with negative net migration rates. This is the only dynamic variable we have in the dataset. (The others are snapshots and do not tell how things are changing in the counties.) It is therefore very interesting to find a clean correlation between net migration rates and Trump’s relative performance. The upshot is that Trump did well in places that are hemorrhaging people.


A consistent picture emerges from all these charts. Trump got to the White House by outperforming Mitt Romney is counties that are less educated, have lower incomes and higher poverty rates, where a greater proportion of people rely on food stamps, where many young adults are idle and children are growing up in broken homes. This is the America that is getting left behind. People are quite literally leaving these counties for greener pastures.

We have yet to tackle the why of it all. Why has America become so regionally polarized? Is it global trade? Automation? Skill-biased technological change? The neoliberal policy consensus? The political economy of Washington, DC? A fairly coherent narrative can be constructed along any of these threads. It is much harder to evaluate their relative importance. And even harder to devise meaningful policy solutions.

While we quietly thank our stars that Trump is getting tamed by adult supervision, we cannot go back to ignoring fly-over country. For we now know quite well what happens when we do.






Zones of Poverty and Affluence in America

In BoBos in Paradise, David Brooks popularized the notion of Latte Towns: “upscale liberal communities, often in magnificent natural settings, often university-based, that have become the gestation centers for America’s new upscale culture.” Charles Murry, in Coming Apart, compiles a list of superzips where the affluent and the educated are concentrated:


Superzips in the United States. Source: Data by Charles Murray, compiled by Gavin Rehkemper.

On the other side of the great divide, we know about endemic poverty in Appalachia and, of course, the Deep South. Much of the doomed cohort analyzed by Case and Deaton is concentrated in these poverty belts.

Combined and uneven development has left America regionally polarized. This affects the politics of the nation and the country’s cohesiveness as a society. To better understand the challenges, it is important to map the regional polarization of America.

Before we come to the maps, a basic question needs to be considered. The affluent are concentrated in the superzips and the poor in the poverty belts, but what about the rest? Surely, the bulk of the population lives neither in zones of grinding poverty nor in zones of mass affluence. Are the rest of these zones homogeneous? Or is there internal structure in the middling bulk of America?

In order to answer this question, I looked at county-level socioeconomic data from GeoFRED. I wanted to see if the counties sorted themselves out into natural clusters. It turns out that there are four basic clusters of counties: Affluent, Middle America, Near-Poor, and Poor. These four clusters differ systematically from each other. Moreover, no matter which subset of socioeconomic indicators you use to do the sorting, you obtain very nearly the same clusters.


The Geography of Class in America
Poor Near-Poor Middle America Affluent
College Graduates 12% 16% 23% 37%
Some College 20% 25% 33% 47%
High School Graduates 75% 83% 89% 91%
Median Household Income 34,302 42,787 52,800 73,170
Per Capita Income 31,107 36,226 45,010 64,218
Unemployment rate 7% 6% 4% 4%
Single Parent Households 41% 34% 28% 25%
Inequality (ratio) 16% 13% 12% 13%
Poverty Rate 26% 18% 12% 9%
SubPrime Rate 37% 29% 22% 20%
Youth Idleness Rate 13% 10% 6% 5%
Food Stamps 27% 17% 10% 7%
Crime Rate (per thousand) 10 8 6 6
Population (millions) 23.0 78.5 134.0 80.7
Population share (sample) 7% 25% 42% 26%
No. of counties 582 1,177 1,077 231
Source: GeoFRED, author’s calculations.

Only 231 out of 3,067 counties can be classified as affluent. But they contain 81 million people, or a quarter of the US population. The median household income in these counties is 73,170. In affluent counties, 91 percent of adults have a high school diploma and 37 percent have college degrees. The poverty rate is 9 percent and only 7 percent of residents rely on food stamps. About a quarter of the families with children are single parent households. Only 5 percent of young adults aged 16-19 are neither studying nor working. The crime rate is low and the unemployment rate is below the national average.

Some 582 out of 3,067 counties can be classified as poor. They are home to 23 million people, or 7 percent of the US population. The median household income is 34,302; less than half that of the affluent counties. A quarter of adult residents in these counties lack a high school diploma and only 12 percent have college degrees. More than a quarter of residents fall below the poverty line and 27 percent rely on food stamps for survival. Some 41 percent of families with children are single parent households and 13 percent of young adults are neither studying nor working. The crime rate is high and the unemployment rate is above the national average.

The vast of bulk of US counties, 74 percent, are neither affluent nor poor. They contain 212 million people, almost exactly two-thirds of the US population. Of these 2,254 counties, 1,177 are near-poor. They are home to 78 million people, or 25 percent of the population. On almost any socioeconomic indicator, these counties are closer to the poor counties than the affluent ones.

Finally, there are 1,077 moderately affluent counties in Middle America. This is where the middling bulk of the US population—42 percent—lives. They are home to 134 million people, which is more than the population of Japan or Mexico. There is a significant gap in incomes and college graduation rates between moderately affluent and affluent counties. But on other socioeconomic indicators, they are not far apart.

Although affluent counties are sprinkled throughout the country, coastal United States is home to all multi-county clusters of mass affluence. A vast zone of affluence stretches across the northeastern seaboard, from the suburbs of DC all the up to Vermont.

Eastern zone of affluence

Inside this eastern zone of affluence there are two major clusters. One is centered around New York City. It is the richest, most populous cluster of counties in the United States. The City’s per capita income is nearly a hundred and sixty thousand dollars.

NYC zone of affluence

The second is centered on Washington, DC. The two suburban counties of Fairfax and Prince William are brown because GeoFRED does not have data on them. Both are easily affluent. According to the 2010 census, the median household incomes of Fairfax and Prince William counties were 105,416 and 91,098 respectively.

DC zone of affluence

The Western zone of affluence is centered on San Francisco and comparable in affluence to the DC area. It obeys the same distance decay law that characterizes the eastern zones of affluence: The closer one gets to the leading city the more affluent the area. Note that Marin County has a higher per capita income than San Francisco itself. Both have per capita incomes in six figures—a property shared by only 13 counties in the entire United States.

Western zone of affluence

On to the other side of the ledger. There are some counties in the Western United States with high poverty rates. But these counties are sparsely populated. Because they are geographically large, national maps provide a misleading picture to the naked eye. The exception is the cluster of high poverty rate counties in Arizona and New Mexico. At the center of the cluster of three dark-hued counties that visually dominate the map is Apache County, Arizona. (The narrow strip that runs north-south along the Arizona-New Mexico border.) Only 10 percent of Apache residents have a college degree; 26 percent don’t even have a high school diploma. Some 37 percent of residents are below the poverty line and rely on food stamps. Per capita income in the county is just shy of thirty thousand dollars. Nearly half the families with children are single parent households. An astonishing 55 percent of county residents have a credit score below 660, meaning that they are considered subprime.

Western poverty

Big multi-county clusters of widespread poverty are concentrated in the southeastern United States. There is a vast poverty belt stretching across the Deep South and another big cluster in Appalachia. You can walk a thousand miles from Texas to the eastern seaboard—say from Marion County, TX, to McIntosh County, GA—without stepping foot in any county with a poverty rate below 20 percent.

Eastern poverty

Kentucky has its own zone of wrenching poverty centered at Owsley County. In the map it is the one in the northern cluster of dark counties (where the poverty rate is more than 30 percent) that is surrounded on all sides by other dark counties. Here, 38 percent of the residents fall below the poverty line. The median household income is a mere 23,047. Only 11 percent of adults are college graduates and 41 percent lack a high school diploma. An astounding 55 percent of county residents rely on food stamps.

We have only scratched the surface of regional socioeconomic polarization in the United States. I will report again when I have more substantial results.