Thursday, December 26, 2013

Correlation and Causation (It's Kind of a Big Deal)

My sister sent me a link to an article titled "Earn more money, when you have more sex, study says (seriously!)" Being a math teacher herself, she had some doubts about the article's claims. She linked me to the study on which the article was based and passed along her thoughts on its methods:
Is it me, or are the guy's statistical interpretations (from the journal article) not all correct? Sketchy claims include: 

"For both sexes, in Panel I, we observe that a one standard deviation increase in sexual activity increases hourly wages by 3.2%, other things being equal." 

"The importance of the sexual activity variable can also be assessed by the fact that if we regress a single wage equation without the sexual activity variable, the R^2 is 0.821, while if we consider the sexual activity variable (as in Table 5), the R^2 is 0.842. In other words, the wage estimation becomes more precise if we consider the sexual activity variable." 

Um, R^2 always goes up when you add an additional variable, that's why we use adjusted R^2?

Not to mention the implied causation claim. I am not surprised at the correlation since there's a documented positive correlation between being married and having higher income, but really "Earn more money when you have more sex"?
My sister hit the nail on the head with the R-squared thing. To quote the stats textbook I keep in my cubicle for the purpose of making my coworkers think I'm smart, "in practice, the best model found by the R-squared criterion will rarely be the model with the largest R-squared." Adding additional explanatory variables to a regression equation will always increase R-squared, whether or not those variables are truly significant. Adjusted R-squared takes into account the number of explanatory terms already present in the model, and is often a better way to determine which variables are worth retaining in a regression equation-- it helps temper the impulse to cram in as many explanatory variables as possible. I would not interpret an increased R-squared as conclusive evidence that the sexual activity variable is worth retaining in the final regression equation.

The causation claim raised my suspicions as well. It's very difficult, often impossible, to prove causation in an observational study like this one-- there are a million confounding variables out there. "Correlation doesn't prove causation" has become the go-to comeback favored by science fans the world over. It's an easy shot to take at almost any of these attention-grabbing, "studies show" headlines. It's so easy it could feel like a cheat code for sounding smart, if it weren't so often true and applicable.

This excellent needlepoint design is sadly sold out on Etsy.