# Causal Inference from Linear Models

For the past few decades, empirical research has shunned all talk of causation. Scholars use their causal intuitions but they only ever talk about correlation. Smoking is “associated to” cancer, being overweight is “correlated with” higher morbidity rates, college education is the strongest “correlate of” Trump’s vote gains over Romney, and so on and so forth. Empirical researchers don’t like to use causal language because they think that causal concepts are not well-defined. It is a hegemonic postulate of modern statistics and econometrics that all falsifiable claims can be stated in the language of modern probability. Any talk of causation is frowned upon because causal claims simply cannot be cast in the language of probability. For instance, there is no way to state in the language of probability that smoking causes cancer, that the tides are caused by the moon or that rain causes the lawn to get wet.

Unfortunately, or rather fortunately, the hegemonic postulate happens to be untrue. Recent developments in causality—a sub-discipline of philosophy—by Judea Pearl and others, have made it possible to talk about causality with mathematical precision and use causal models in practice. We’ll come back to causal inference and show how to do it in practice after a brief digression on theory.

Theories isolate a portion of reality for study. When we say that Nature is intelligible, we mean that it is possible to discover Nature’s mechanisms theoretically (and perhaps empirically). For instance, the tilting of the earth on its axis is the cause of the seasons. It’s why the northern and southern hemispheres have opposite seasons. We don’t know that from perfect correlation of the tilting and the seasons because correlation does not imply causation (and in any case they are not perfectly correlated). We could, of course, be wrong, but we think that this is a ‘good theory’ in the sense that it is parsimonious and hard-to-vary—it is impossible to fiddle with the theory without destroying it. [This argument is due to David Deutsch.] In fact, we find this theory so compelling that we don’t even subject it to empirical falsification.

Yes, it is impossible to derive causal inference from the data with absolute certainty. This is because, without theory, causal inference from data is impossible, and theories on their part can only ever be falsified; never proven. Causal inference from data is only possible if the data analyst is willing to entertain theories. The strongest causal claims a scholar can possibly make take the form: “Researchers who accept the qualitative premises of my theory are compelled by the data to accept the quantitative conclusion that the causal effect of X on Y is such and such.”

We can talk about causality with mathematical precision because, under fairly mild regularity conditions, any consistent set of causal claims can be represented faithfully as causal diagrams which are well-defined mathematical objects. A causal diagram is a directed graph with a node for every variable and directed edges or arrows denoting causal influence from one variable to another, e.g., ${X\longrightarrow Y}$ which says that Y is caused by X where, say, X is smoking and Y is lung cancer.

The closest thing to causal analysis in contemporary social science are structural equation models. In order to illustrate the graphical method for causal inference, we’ll restrict attention to a particularly simple class of structural equation models, that of linear models. The results hold for nonlinear and even nonparametric models. We’ll work only with linear models not only because they are ubiquitous but also for pedagogical reasons. Our goal is to teach rank-and-file researchers how to use the graphical method to draw causal inferences from data. We’ll show when and how structural linear models can be identified. In particular, you’ll learn which variables you should and shouldn’t control for in order to isolate the causal effect of X on Y. For someone with basic undergraduate level training in statistics and probability it should take no more than a day’s work. So bring out your pencil and notebook.

A note on attribution: What follows is largely from Judea Pearl’s work on causal inference. Some of the results may be due to other scholars. There is a lot more to causal inference than what you will encounter below. Again, my goal here is purely pedagogical. I want you, a rank-and-file researcher, to start using this method as soon as you are done with the exercises at the end of this lecture. (Yes, I’m going to assign you homework!)

Consider the simple linear model,

${\large Y := \beta X + \varepsilon }$

where ${\varepsilon}$ is a standard normal random variable independent of X. This equation is structural in the sense that Y is a deterministic function of X and ${\varepsilon}$ but neither X nor ${\varepsilon}$ is a function of Y. In other words, we assume that Nature chooses X and ${\varepsilon}$ independently, and Y takes values in obedience to the mathematical law above. This is why we use the asymmetric symbol “:=” instead of the symmetric “=” for structural equations.

We can embed this structural model into the simplest causal graph ${X\longrightarrow Y}$ , where the arrow indicates the causal influence of X on Y . We have suppressed the dependence of Y on the error ${\varepsilon}$. The full graph reads ${X\longrightarrow Y \dashleftarrow\varepsilon}$, where the dotted lines denotes the influence of unobserved variables captured by our error term. The path coefficient associated to the link ${X\longrightarrow Y}$ is ${\beta}$, the structural parameter of the simple linear model. A structural model is said to be identified if the structural parameters can in principle be estimated from the joint distribution of the observed variables. We will show presently that under our assumptions the model is indeed identified and the path coefficient ${\beta}$ is equal to the slope of the regression equation,

${\beta=r_{YX}=\rho_{YX}\sigma_{Y}/\sigma_X}$,

where ${\rho_{YX}}$ is the correlation between X and Y and ${\sigma_{X}}$ and ${\sigma_{Y}}$ are the standard deviations of X and Y respectively.  ${r_{YX}}$ can be estimated from sample data with the usual techniques, say, ordinary least squares (OLS).

What allows straightforward identification in the base case is the assumption that X and ${\varepsilon}$ are independent. If X and ${\varepsilon}$ are dependent then the model cannot be identified. Why? Because in this case there is spurious correlation between X and Y that propagates along the “backdoor path” ${X\dashleftarrow\varepsilon\dashrightarrow Y}$. See Figure 1.

Here’s what we can do if X and ${\varepsilon}$ are dependent. We simply find another observed variable that is a causal “parent” of X (i.e., ${Z\longrightarrow X}$ ) but independent of ${\varepsilon}$. Then we can use it as an instrumental variable to identify the model. This is because there is no backdoor path between Y and Z (which identifies ${\alpha\beta}$ ) and X and Z (which identifies ${\alpha}$). See Figure 2.

In that case, ${\beta}$  is given by the instrumental variable formula,

${\beta=r_{YZ}/r_{XZ}}$.

More generally, in order to identify the causal influence of X on Y in a graph G, we need to block all spurious correlation between X and Y. This can be achieved by controlling for the right set of covariates (or controls) Z. We’ll come to that presently. First, some graph terminology.

A directed graph is a set of vertices together with arrows between them (some of whom may be bidirected). A path is simply a sequence of connected links, e.g., ${i\dashrightarrow m\leftrightarrow j\dashleftarrow k}$ is a path between i and k. A directed path is one where every node has arrows that point in one direction, e.g., ${i\longrightarrow j\leftrightarrow m\longrightarrow k}$ is a directed path from i to k. A directed acyclic graph is a directed graph that does not admit closed directed paths. That is, a directed graph is acyclic if there are no directed paths from a node back to itself.

A causal subgraph of the form ${i\longrightarrow m\longrightarrow j}$ is called a chain and corresponds to a mediating or intervening variable m between i and j. A subgraph of the form ${i\longleftarrow m\longrightarrow j}$ is called a fork, and denotes a situation where the variables i and j have a common cause m. A subgraph of the form ${i\longrightarrow m\longleftarrow j}$ is called an inverted fork and corresponds to a common effect. In a chain ${i\longrightarrow m\longrightarrow j}$ or a fork ${i\longleftarrow m\longrightarrow j}$, i and j are marginally dependent but conditionally independent (where we condition on m). In an inverted fork ${i\longrightarrow m\longleftarrow j}$ on the other hand, i and j are marginally independent but conditionally dependent (once we condition on m). We use family connections to talk in short hand about directed graphs. In the graph ${i\longrightarrow j}$, i is the parent and j is the child. The descendants of i are all nodes that can be reached by a directed path starting at i. Similarly, the predecessors of j are all nodes from which j can be reached by directed paths.

Definition (Blocking). A path p is blocked by a set of nodes Z if and only if p contains at least one arrow-emitting node that is in Z or p contains at least one inverted fork that is outside Z and has no descendant in Z. A set of nodes Z is said to block X from Y, written ${(X\perp Y |Z)_{G}}$, if Z blocks every path from X to Y.

The logic of the definition is that the removal of the set of nodes Z completely stops the flow of information from Y to X. Consider all paths between X and Y . No information passes through an inverted fork ${i \longrightarrow m\longleftarrow j}$ so you can ignore the paths that contain inverted forks. Likewise, no information passes through a path without an arrow-emitting node so those can also be ignored. The rest of the paths are “live” and we must choose a set of nodes Z whose removal would block the flow of all information between X and Y along these paths. Note that whether Z blocks X from Y in a causal graph G can be decided by visual inspection when the number of covariates is small, say less than a dozen. If the number of covariates is large, as in many machine learning applications, a simple algorithm can do the job.

If Z blocks X from Y in a causal graph G, then X is independent of Y given Z. That is, if Z blocks X from Y then X|Z and Y |Z are independent random variables. We can use this property to figure out precisely which covariates we ought to control for in order to isolate the causal effect of X on Y in a given structural model.

Theorem 1 (Covariate selection criteria for direct effect). Let G be any directed acyclic graph in which ${\beta}$ is the path coefficient of the link ${X\longrightarrow Y}$, and let ${G_{\beta}}$ be the graph obtained by deleting the link ${X\longrightarrow Y}$. If there exists a set of variables Z such that no descendant of Y belongs to Z and Z blocks X from Y in ${G_{\beta}}$, then ${\beta}$ is identifiable and equal to the regression coefficient ${r_{YX\cdot Z}}$. Conversely, if Z does not satisfy these conditions, then ${r_{YX\cdot Z}}$ is not a consistent estimand of ${\beta}$.

Theorem 1 says that the direct effect of X on Y can be identified if and only if we have a set of covariates Z that blocks all paths, confounding as well as causal, between X and Y except for the direct path ${X\longrightarrow Y}$. The path coefficient is then equal to the partial regression coefficient of X in the multivariate regression of Y on X and Z,

${Y =\alpha_1Z_1+\cdots+\alpha_kZ_k+\beta X+\varepsilon.}$

The above equation can, of course, be estimated by OLS. Theorem 1 does not say that the model as a whole is identified. In fact, the path coefficients associated the links ${Z_{i}\longrightarrow Y}$ that the multivariate regression above suggests, are not guaranteed to be identified. The regression model would be fully identified if Y is also independent of ${Z_{i}}$ given ${\{(Z_{j})_{j\ne i}, X\}}$ in $G_{i}$ for all ${i=1,\dots,k}$.

What if you wanted to know the total effect of X on Y ? That is, the combined effect of X on Y both through the direct channel (i.e., the path coefficient ${\beta}$) and through indirect channels, e.g., ${X\longrightarrow W\longrightarrow Y}$ ? The following theorem provides the solution.

Theorem 2 (Covariate selection criteria for total effect). Let G be any directed acyclic graph. The total effect of X on Y is identifiable if there exists a set of nodes Z such that no member of Z is a descendant of X and Z blocks X from Y in the subgraph formed by deleting from G all arrows emanating from X. The total effect of X on Y is then given by ${r_{YX\cdot Z}}$.

Theorem 2 ensures that, after adjustment for Z, the variables X and Y are not associated through confounding paths, which means that the regression coefficient ${r_{YX\cdot Z}}$ is equal to the total effect. Note the difference between the two criteria. For the direct effect, we delete the link ${X\longrightarrow Y}$ and find a set of nodes that blocks all other paths between X and Y . For the total effect, we delete all arrows emanating from X because we do not want to block any indirect causal path of X to Y.

Theorem 1 is Theorem 5.3.1 and Theorem 2 is Theorem 5.3.2 in the second edition of Judea Pearl’s book, Causality: Models, Reasoning, and Inference, where the proofs may also be found. These theorems are of extraordinary importance for empirical research. Instead of the ad-hoc and informal methods currently used by empirical researchers to choose covariates, they provide a mathematically precise criteria for covariate selection. The next few examples show how to use these criteria for a variety of causal graphs.

Figure 3 shows a simple case (top left) ${Z\longrightarrow X\longrightarrow Y}$ where the errors of Z and Y are correlated. We obtain identification by repeated application of Theorem 1. Specifically, Z blocks X from Y in the graph obtained from deleting the link ${X\longrightarrow Y}$ (top right). Thus, ${\alpha}$ is identified. Similarly, Y blocks Z from X in the graph obtained from deleting the link ${Z\longrightarrow X}$ (bottom right). Thus, ${\beta}$ is identified.

Figure 4 shows a case where an unobserved disturbance term influences both X and Y. Here, the presence of the intervening variable Z allows for the identification of all the path coefficients. I’ve written the structural equation on the top right and checked the premises of Theorem 1 at the bottom left. Note that the path coefficient of ${U\dashrightarrow X}$ is known to be 1 in accordance with the structural equation for X. Hence, the total effect of X on Y equals ${\alpha\beta+\gamma}$.

Figure 5 presents a more complicated case where the direct effect can be identified but not the total effect. The identification of ${\delta}$ is impossible because X and Z are spuriously correlated and there is no instrumental variable or intervening available available.

If you have reached this far, I hope you have acquired a basic grasp of the graphical methods presented in this lecture. You probably feel that you still don’t really know it. This always happens when we learn a new technique or method. The only way to move from “I sorta know what this is about” to “I understand how to do this” is to sit down and work out a few examples. If you do the exercises in the homework below, you will be ready to use this powerful arsenal for live projects. Good luck!

Homework

1. Epidemiologists argued in the early postwar period that smoking causes cancer. Big Tobacco countered that both smoking and cancer are correlated with genotype (unobserved), and hence, the effect of smoking on cancer cannot be identified. Show Big Tobacco’s argument in a directed graph. What happens if we have an intervening variable between smoking and cancer that is not causally related to genotype? Say, the accumulation of tar in lungs? What would the causal diagram look like? Prove that it is then possible to identify the causal effect of smoking on cancer. Provide an expression for the path coefficient between smoking and cancer.
2. Obtain a thousand simulations each of two independent standard normal random variables X and Y. Set Z=X+Y. Check that X and Y are uncorrelated. Check that X|Z and Y|Z are correlated. Ask yourself if it is a good idea to control for a variable without thinking the causal relations through.
3. Obtain a thousand simulations each of three independent standard normal random variables ${u,\nu,\varepsilon}$. Let ${X=u+\nu}$ and ${Y=u+\varepsilon}$. Create scatter plots to check that X and Y are marginally dependent but conditionally independent (conditional on u). That is, X|u and Y|u are uncorrelated. Project Y on X using OLS. Check that the slope is significant. Then project Y on X and u. Check that the slope coefficient for X is no longer significant. Should you or should you not control for u?
4. Using the graphical rules of causal inference, show that the causal effect of X on Y can be identified in each of the seven graphs shown in Figure 6.
5. Using the graphical rules of causal inference, show that the causal effect of X on Y cannot be identified in each of the eight graphs in Figure 7. Provide an intuitive reason for the failure in each case.

P.S. I just discovered that there is a book on this very topic, Stanley A. Mulaik’s Linear Causal Modeling with Structural Equations (2009).