Friday, April 5, 2013

Residuals: What the hell are they, and why are they important?

Lady luck has blessed me recently with some AP Stats students in need of tutoring, and it's been wonderful getting to tutor in my area of expertise! Not that I dislike helping students with geometry proofs or trigonometric calculations; it's just nice to work with my favorite flavor of mathematics for a change. The perennial "when am I going to use this in real life?" question is a lot easier to answer when it's asked about statistics.

One of the questions that my students have been asking is why they should care about residuals. It's a valid question! High school math classes don't typically have time to cover the reasoning behind their subjects with much depth, so the only thing most high school students know about residuals is "it's bad if the residuals have a pattern." They don't know why it's bad, or what exactly the residuals are, and they're often confused as to why this is the one time in their stats class when they want to see a scatterplot with no correlation at all. Truthfully, I didn't really understand much about the importance of residuals until late in my college career.

To explore residuals, let's use a real-life data set. I do market research for radio stations, and we send out online surveys every week asking radio listeners to rate short song clips. Different radio stations ask for different survey sizes-- we have some that request 40 respondents a week, and a few that request as many as 100 respondents a week. We disqualify respondents who don't follow directions, in an attempt to weed out the survey-takers who aren't really paying attention to what they're doing. It stands to reason that the larger the survey size, the more people we'd end up removing for quality control. And that relationship seems to hold true (data taken from last week's cycle of surveys):



It looks like there's a positive relationship between sample size and number of respondents removed for quality control. At this point, a lot of my students would know to find the least-squares regression line. But not by hand-- calculating a least-squares regression by hand is a soul-crushing endeavor. Maybe do it once on a small dataset, just to understand how it works. But then switch over to the calculator and never go back. The least-squares regression line for our data looks like this:

That equation in the corner tells us that yes, there is a positive relationship between sample size and QC removals! The R-squared value represents the strength of the relationship. Essentially, it means that about 48% of the variation in the number of QC removals is attributable to the size of the survey we removed them from. The other 52% of the variation is just random error. Not a great fit, but hey, real-life data is messy.

Now, the residuals! Here's a picture of the same graph, this time highlighting how far above or below the line each data point falls:


That's what a residual is.

A data point's residual represents the difference between what we predicted and what we observed. For example, look at that 70-person survey up there, in the middle of the graph near the bottom. At a sample size of 70, we predict (based on our regression line) that we'd have to remove about 8 respondents for quality control. But we actually only removed 2 respondents from that survey. We ended up with 6 fewer QC removals than we expected, so that data point has a residual of -6.

Or, look at our 86-person survey, near the top. Our regression equation predicts that an 86-person survey would rack up about 11 QC removals, but we actually had to remove 18 people from that survey. We removed 7 more respondents than we expected to, so that data point has a residual of +7.

You've probably guessed by now that if your data points have bigger residuals, your regression line has less predictive power. That's why we want to find the line that best fits our data-- we want to minimize the total sum of our residuals, so that each data point winds up as close to our predictive line as possible. There's a problem, though. Data points falling below the line have negative residuals, and data points falling above the line have positive residuals. Mathematically, we can't minimize the sum of all the residuals, because the positives and the negatives cancel each other out. Blast!

Aha, but what's a great way to make the negative residuals positive, while keeping the positive residuals positive as well? Square everything, of course!


Multiplying each residual by itself nicely eliminates our problem with negative values, while maintaining the respective magnitudes of the residuals. We can represent it graphically by turning each residual on our graph into a square whose side length equals the absolute value of the residual (the squares in the picture might not be perfectly square-- bear with me, I've only got MS Paint to work with). The areas of the squares are all positive, so the sum of their areas is positive as well, and that's something we can minimize to find the best-fitting line for our data.

That's why we call it a least-squares regression.

You're welcome.


"Okay, okay," you say. "I understand what a residual is, but I still don't understand why we graph them on their own, or why they shouldn't have a correlation." Hold your hoofbeasts, I'm getting to that.

The whole reason I went to the trouble of gathering data on survey size versus number of QC disqualifications is that I wanted to identify which stations had abnormally high numbers of junk respondents. I knew that just because our 100-person surveys had lots of disqualifications didn't necessarily mean they were abnormal; more people overall means more people who get disqualified, no matter what. I wanted to know if there were certain station formats that tended to have positive residuals, and some that tended to have negative residuals. In other words, which kinds of radio stations have more QC disqualifications that we could reasonably expect given their sample size?



One of the most common ways to display residuals is to place the x-axis in the center of the graph, then plot every point at the coordinates of its original x-value and the residual of its y-value. We want to display the same data, but in such a way that the original "tilt" of the positive or negative relationship is nullified. If you can picture it, it's like taking the least-squares regression line and laying it horozontally on the x-axis, keeping all the data points the same vertical distance from the line as before.

Remember how the R-squared value told us that 48% of our variation could be accounted for by the regression relationship? A residual plot shows all the variation that's left over once we remove the regression relationship. What we should see here is the remaining 52% random error-- and that's why you don't want to see a pattern in your residual plot. If there's a pattern in your residuals, it means that there's some kind of relationship in your data that you didn't account for with your original least-squares regression. Maybe your correlation is non-linear and you should try a quadratic regression instead? Maybe you need to take the log values of your data instead of raw values? But what to do when your residuals display correlated behavior is another blog post entirely.

Overall, there's no real pattern in my residual plot, even after I added some colors to represent station formats. (Protip: adjusting the color or shape of your data points is a great way to display interesting qualitative information about quantitative bivariate data!) No indvidual format is really the worst offender here. If, say, all the CHR points fell below the axis, we might conclude that listeners of top 40 radio are less likely than others to get disqualified for noncompliance with survey instructions. Similarly, if all the Country stations fell above the line, we could say that country fans are more likely than others to disobey simple instructions and get kicked out of surveys. (I rag on the country fans too much. They're really very pleasant people!)

So what have we learned about sample sizes and quality control issues? The size of the survey has a positive effect on the number of folks we'll remove for quality control, but there doesn't seem to be a second, lurking variable when we check the residuals. Also, station format does not seem to affect the number of QC removals when we adjust for survey size.

We've also learned that we calculate residuals by measuring the distance between our observations and our regression line, but that we can only fit that regression line by minimizing the squared values of those very residuals. Stats is like that sometimes.



No comments:

Post a Comment