I was recently hired to teach a short course on quantitative methods to a firm that contracts with the Federal government. It was a good experience for me on a number of levels. First, it required me to go back through all my notes and "re-learn" a lot of methodological approaches that I was taught but don't frequently employ. Second, despite minoring in methodology at grad school I'd never taught a methods course. Third, I'd never taught a course in this format -- roughly 32 hours of instruction over 4 days -- at all. Plus I got paid. So a good experience.
It also means that a lot of issues related to data and statistical modeling are in the front of my mind, so when I saw posts from Matt Yglesias and Noah Smith on possibilities and problems for economists looking to do empirical work using quantitative methods it sparked a few responses.
First, both Yglesias and Smith are correct that it isn't enough to just mine data. Yglesias mentions that, under the arbitrary but common decision-rule for inference that social scientists employ, generally without thinking because of a convention that has nothing to do with social inquiry and which is neither scientific nor especially rigorous, we're willing to be wrong 5% of the time.
But I do worry sometimes that social sciences are becoming an arena in which number crunching sometimes trumps sound analysis. Given a nice big dataset and a good computer, you can come up with any number of correlations that hold up at a 95 percent confidence interval, about 1 in 20 of which will be completely spurious. But those spurious ones might be the most interesting findings in the batch, so you end up publishing them!This doesn't concern me that much in general, but Yglesias' next point concerns me quite a lot:
[T]here's also the problem of Milton Friedman's thermostat. Take a room with a furnace that's regulated by a really good thermostat. Your data is going to show that the amount of fuel burned by the furnace is uncorrelated with the temperature in the room. Thus you'll discover that burning fossil fuels doesn't cause heat. Oooops!What this problem illustrates has nothing to do with data or with statistical analysis per se. After all, Friedman could come to the same wrong conclusion simply by sitting in his room and looking at each month's power bill. Some months the bill goes up, some months the bill goes down, but the room always feels the same. Friedman could then, without any statistical analysis at all, come to the same erroneous conclusion about the relationship between fossil fuel consumption and temperature inside his room.
The problem is in the model. Friedman's example suggests that a univariate model -- burning fossil fuels causes temperature changes -- is insufficient when there are other variables -- the thermostat, the temperature outside the room -- involved in the causal process. Similarly, an additive model will give false results if the relationships between variables are conditioned by the presence or absence of other variables. The difficulty is that, contrary to the assumption underlying the most commonly-used statistical models, we do not have perfect information regarding the data-generating process. Therefore we will never know if we have included the correct variables in our model. There is no statistical test for this.
There are many approaches to overcoming this problem. The inductive approach encourages gathering as much data as it can and mining it to look for correlations. Yes, some of them will be spurious but most won't be. And, contrary to the impression Yglesias gives, there are methods short of randomized controlled experiments that can help us determine which correlations are spurious and which are robust. None of them are perfect but all of them are better than chucking out statistical analysis entirely.
The data mining approach is fundamentally atheoretical. As Yglesias notes, sometimes we'll get very robust correlations that have nothing to do with causality. The canonical (fake) example of this is the finding that ice cream consumption causes crime, but there are many real-world examples of it, particularly in epidemiology research. Social scientists, however, try to get around this problem by developing theory about the phenomena of interest, deriving hypotheses from this theory, and then using statistical (or other) methods to evaluate whether the hypotheses can be eliminated. Thus the model is determined by theory: if we were trying to explain fluctuations in crime rates and had no theoretical reason to include ice cream consumption in our model, then we wouldn't include it in our model. Or, to bring this back to Yglesias, if we think that the temperature of the room is caused by fossil fuel consumption as regulated by a thermostat, then we would include both the fuel consumption and the thermostat in our model. Typically the way this is done is by estimating the average size of the effect of variable x on outcome y, while allowing for "random" -- i.e. unmodeled and therefore unexplained -- variance in y, which we call the "residual" and denote e.
This brings us to Smith's post. Much development in econometrics has concerned e. Without getting too wonky, if there are systematic processes operating within e then it is not random and our model has major problems. So methodologists have spent a lot of time trying to "fix" e, by which I mean making it resemble a "random" process even when it isn't. Fixing e is fine is what we're primarily interested in is predicting the central tendency of y. But simply fixing e is not fine at all if what we're interested in understanding what causes y.
This gets into Smith's criticism about economics. Often, economists find some factor that they either can't measure or can't explain but that nevertheless seems to have an important causal impact on outcomes like economic growth or the wealth of nations. Frequently they'll give a name to this -- such as "total factor productivity" -- despite the fact that they don't really know what it is:
This residual represents how productive capital and labor are, which is why we call it "total factr productivity" (TFP). What determines TFP? It could be "human capital". It could be technology. It could be institutions like property rights, corporate governance, etc. It could be government inputs like roads, bridges, and schools. It could be taxes and regulations. It could be land and natural resources. It could be some complicated function of a country's position in global supply chains. It could be a country's terms of trade. It could be transport costs and urban agglomeration. It could be culture. It could be inborn racial superpowers. It could be God, Buddha, Cthulhu, or the Flying Spaghetti Monster. It could be an ironic joke by the vast artificial intelligences that govern the computer simulation that generates our "reality", putting their metaphorical thumb on the scales because they are bored underpaid research assistants with nothing better to do.A similar thing happens with "institutions" in economics. Again, the issue is not with the data. It's with the model. More specifically, it's with the theory from which the model is derived. Put more precisely, any approach that focus on fixing e rather than modeling the processes that are happening within it has no theory. When we "label the residual", as Smith puts it, we might think we are doing theory because we now have things to say about "institutions" or "total factor productivity", but really we are just defining the problem away using new terminology. In fact what we are doing is omitting relevant variables from our analysis and then concluding that we've solved the problem!
There are many problems with this, some of which Smith notes. For one thing this can lead to "semantic bias" (Smith's term), which in turn can lead us to all sorts of wrong intuitions about the way the world works. Political science is far from immune from these same problems*, but I think the problems are even bigger in economics, particularly when economists shift from academic exercise to policy prescription. Quite often economists neglect politics entirely -- i.e. they leave them in e -- when modeling economic outcomes. This is, as I put it once, like re-arranging the furniture for a party with no invited guests... perhaps interesting academically, but not helpful for much else. It's also very likely to lead to a biased model.
The solution is not, as Friedman seemingly suggests, to just stop doing statistical analysis. It's to do better statistical analysis, in conjunction with better theory and better operationalizations of key variables. It's to try to improve the model rather than fix the residual, and to specify exactly what we do and don't know.
*The most common instance of this phenomenon is to use "regime type" as an explanatory variable in comparative studies. Sometimes this might make sense, but most commonly it's just dumped into a model with little theoretical justification and almost no interpretive value.