Wednesday, August 8, 2012

It's Not Just the Data; It's Also the Model

. Wednesday, August 8, 2012

I was recently hired to teach a short course on quantitative methods to a firm that contracts with the Federal government. It was a good experience for me on a number of levels. First, it required me to go back through all my notes and "re-learn" a lot of methodological approaches that I was taught but don't frequently employ. Second, despite minoring in methodology at grad school I'd never taught a methods course. Third, I'd never taught a course in this format -- roughly 32 hours of instruction over 4 days -- at all. Plus I got paid. So a good experience.

It also means that a lot of issues related to data and statistical modeling are in the front of my mind, so when I saw posts from Matt Yglesias and Noah Smith on possibilities and problems for economists looking to do empirical work using quantitative methods it sparked a few responses.

First, both Yglesias and Smith are correct that it isn't enough to just mine data. Yglesias mentions that, under the arbitrary but common decision-rule for inference that social scientists employ, generally without thinking because of a convention that has nothing to do with social inquiry and which is neither scientific nor especially rigorous, we're willing to be wrong 5% of the time.

But I do worry sometimes that social sciences are becoming an arena in which number crunching sometimes trumps sound analysis. Given a nice big dataset and a good computer, you can come up with any number of correlations that hold up at a 95 percent confidence interval, about 1 in 20 of which will be completely spurious. But those spurious ones might be the most interesting findings in the batch, so you end up publishing them! 
This doesn't concern me that much in general, but Yglesias' next point concerns me quite a lot:
[T]here's also the problem of Milton Friedman's thermostat. Take a room with a furnace that's regulated by a really good thermostat. Your data is going to show that the amount of fuel burned by the furnace is uncorrelated with the temperature in the room. Thus you'll discover that burning fossil fuels doesn't cause heat. Oooops!
What this problem illustrates has nothing to do with data or with statistical analysis per se. After all, Friedman could come to the same wrong conclusion simply by sitting in his room and looking at each month's power bill. Some months the bill goes up, some months the bill goes down, but the room always feels the same. Friedman could then, without any statistical analysis at all, come to the same erroneous conclusion about the relationship between fossil fuel consumption and temperature inside his room.

The problem is in the model. Friedman's example suggests that a univariate model -- burning fossil fuels causes temperature changes -- is insufficient when there are other variables -- the thermostat, the temperature outside the room -- involved in the causal process. Similarly, an additive model will give false results if the relationships between variables are conditioned by the presence or absence of other variables. The difficulty is that, contrary to the assumption underlying the most commonly-used statistical models, we do not have perfect information regarding the data-generating process. Therefore we will never know if we have included the correct variables in our model. There is no statistical test for this.

There are many approaches to overcoming this problem. The inductive approach encourages gathering as much data as it can and mining it to look for correlations. Yes, some of them will be spurious but most won't be. And, contrary to the impression Yglesias gives, there are methods short of randomized controlled experiments that can help us determine which correlations are spurious and which are robust. None of them are perfect but all of them are better than chucking out statistical analysis entirely.

The data mining approach is fundamentally atheoretical. As Yglesias notes, sometimes we'll get very robust correlations that have nothing to do with causality. The canonical (fake) example of this is the finding that ice cream consumption causes crime, but there are many real-world examples of it, particularly in epidemiology research. Social scientists, however, try to get around this problem by developing theory about the phenomena of interest, deriving hypotheses from this theory, and then using statistical (or other) methods to evaluate whether the hypotheses can be eliminated. Thus the model is determined by theory: if we were trying to explain fluctuations in crime rates and had no theoretical reason to include ice cream consumption in our model, then we wouldn't include it in our model. Or, to bring this back to Yglesias, if we think that the temperature of the room is caused by fossil fuel consumption as regulated by a thermostat, then we would include both the fuel consumption and the thermostat in our model. Typically the way this is done is by estimating the average size of the effect of variable x on outcome y, while allowing for "random" -- i.e. unmodeled and therefore unexplained -- variance in y, which we call the "residual" and denote e.

This brings us to Smith's post. Much development in econometrics has concerned e. Without getting too wonky, if there are systematic processes operating within e then it is not random and our model has major problems. So methodologists have spent a lot of time trying to "fix" e, by which I mean making it resemble a "random" process even when it isn't. Fixing e is fine is what we're primarily interested in is predicting the central tendency of y. But simply fixing e is not fine at all if what we're interested in understanding what causes y.

This gets into Smith's criticism about economics. Often, economists find some factor that they either can't measure or can't explain but that nevertheless seems to have an important causal impact on outcomes like economic growth or the wealth of nations. Frequently they'll give a name to this -- such as "total factor productivity" -- despite the fact that they don't really know what it is:
This residual represents how productive capital and labor are, which is why we call it "total factr productivity" (TFP). What determines TFP? It could be "human capital". It could be technology. It could be institutions like property rights, corporate governance, etc. It could be government inputs like roads, bridges, and schools. It could be taxes and regulations. It could be land and natural resources. It could be some complicated function of a country's position in global supply chains. It could be a country's terms of trade. It could be transport costs and urban agglomeration. It could be culture. It could be inborn racial superpowers. It could be God, Buddha, Cthulhu, or the Flying Spaghetti Monster. It could be an ironic joke by the vast artificial intelligences that govern the computer simulation that generates our "reality", putting their metaphorical thumb on the scales because they are bored underpaid research assistants with nothing better to do.
A similar thing happens with "institutions" in economics. Again, the issue is not with the data. It's with the model. More specifically, it's with the theory from which the model is derived. Put more precisely, any approach that focus on fixing e rather than modeling the processes that are happening within it has no theory. When we "label the residual", as Smith puts it, we might think we are doing theory because we now have things to say about "institutions" or "total factor productivity", but really we are just defining the problem away using new terminology. In fact what we are doing is omitting relevant variables from our analysis and then concluding that we've solved the problem!

There are many problems with this, some of which Smith notes. For one thing this can lead to "semantic bias" (Smith's term), which in turn can lead us to all sorts of wrong intuitions about the way the world works. Political science is far from immune from these same problems*, but I think the problems are even bigger in economics, particularly when economists shift from academic exercise to policy prescription. Quite often economists neglect politics entirely -- i.e. they leave them in e -- when modeling economic outcomes. This is, as I put it once, like re-arranging the furniture for a party with no invited guests... perhaps interesting academically, but not  helpful for much else. It's also very likely to lead to a biased model.

The solution is not, as Friedman seemingly suggests, to just stop doing statistical analysis. It's to do better statistical analysis, in conjunction with better theory and better operationalizations of key variables. It's to try to improve the model rather than fix the residual, and to specify exactly what we do and don't know.

*The most common instance of this phenomenon is to use "regime type" as an explanatory variable in comparative studies. Sometimes this might make sense, but most commonly it's just dumped into a model with little theoretical justification and almost no interpretive value.


Phil Arena said...

Well said.

Jonathan Kropko said...

I enjoyed your post, and I'm glad you found teaching methods to be an intellectual exercise. I always do.

There are a few points here which I think deserve some clarification.

First, I do think Milton Friedman's thermostat is a good example. Here, the room temperature doesn't change. But it is wrong to conclude that burning fuel has 0 correlation with the room temperature. In fact, the correlation doesn't exist, since the variance of room temperature is 0, and one must divide by 0 to get the correlation. The point here is not that additional controls are needed, but that constants cannot be analyzed with this methodology, even when complicated causal processes underlie the constant.

Second, "systematic process within e" are only a problem if those processes are correlated with the Xs in the model, and therefore bias the coefficients. If e is nonrandom, but independent of the Xs, and distributed in a way such that e is equally likely to be positive or negative, and more likely to be close to 0 than farther away from it, then the coefficient estimates are probably fine. Independence from X is generally the big assumption.

Finally, labeling the residual is problematic for two reasons. You mention omitted variables, and this is only an issue if those variables correlate with the Xs, and therefore obscure relationships. In the canonical example, ice cream causes crime, and the heat is in e, but heat correlates with ice cream consumption, which hijacks the effect of heat. I'm more troubled by a second problem, that labeling the residual assumes that the remainder of the variation in Y is known to the researcher. That's arrogant, unscientific, and leaves no room for the important idea that social processes (or economics) are in part naturally random.

Thanks for the post, a really interesting read.

Kindred Winecoff said...

I was waiting for a methodologist to pop in and point out the weak points in the post. I knew there were probably a few. To your specific points:

1. Yes, but again this is a failure of modeling and not of method. If our dependent variable "temperature in the room" and one of our independent variables is NOT "amount of fuel burnt" then we will be wrong. The best approach is not to throw up ones hands, but create a better model. Perhaps, instead of trying to explain the temperature of the room, we try to explain the difference between the temperature in the room and the temperature outside it. Or compare two rooms: one which is heated by burning fuel (as regulated by a thermostat) and one that is not heated.

2. I was trying to keep the post relatively clean of jargon, so I sacrificed some nuance. So I left out the technical discussion of heteroskedasticity and multicollinearity and when they produce bias and when they produce inefficiency. I felt fine doing that for practical reasons: when (in the social science world) is it ever the case that systematic processes in e are not correlated with the Xs? In my experience, if there are dynamics in the model, and there are dynamics in e, then it is overwhelmingly likely that there will be a relationship between the two. Additionally, I would say that our model has big problems if there is systematic variation in e even if there is no relationship between e and X. We're clearly not explaining Y as well as we could.

The bigger point I was trying to make is that we should try to pull systematic variation out of e and model it whenever we can. Which ties into:

3. I agree completely. Though if you look at the comments on Noah Smith's post, plenty of economists are defending the practice as perfectly fine.

Brian Urlacher said...

I had the same reaction to Yglesias' caricature of social science. Thanks for a clear response putting the issue in larger perspective.

It's Not Just the Data; It's Also the Model
There was an error in this gadget




Add to Technorati Favorites