# Data need to be normally-distributed, and other myths of linear regression

There are four basic assumptions of linear regression. These are:

1. the mean of the data is a linear function of the explanatory variable(s)*;
2. the residuals are normally distributed with mean of zero;
3. the variance of the residuals is the same for all values of the explanatory variables; and
4. the residuals should be independent of each other.

Let’s look at those assumptions in more detail.

1. Linearity

The linearity assumption is perhaps the easiest to consider, and seemingly the best understood. For each unit increase in the explanatory variable, the mean of the response variable increases by the same amount, regardless of the value of the explanatory variables. The mean of the response variable (the line, which is fitted to the data (the dots)) increases at the same rate, regardless of the value of the explanatory variable. This is the basis of the linearity assumption of linear regression.

2. Normality

Some users think (erroneously) that the normal distribution assumption of linear regression applies to their data. They might plot their response variable as a histogram and examine whether it differs from a normal distribution. Others assume that the explanatory variable must be normally-distributed. Neither is required. The normality assumption relates to the distributions of the residuals. This is assumed to be normally distributed, and the regression line is fitted to the data such that the mean of the residuals is zero.

What are the residuals, you ask? These are the values that measure departure of the data from the regression line. The residuals are the differences between the data and the regression line (red bars in upper figure). The residuals deviate around a value of zero in linear regression (lower figure). It is these residuals that should be normally distributed.

To examine whether the residuals are normally distributed, we can compare them to what would be expected. This can be done in a variety of ways. We could inspect it by binning the values in classes and examining a histogram, or by constructing a kernel density plot – does it look like a normal distribution? We could construct QQ plots. Or we could calculate the skewness and kurtosis of the distribution to check whether the values are close to that expected of a normal distribution.

With only 10 data points, I won’t do those checks for this example data set. But my point is that we need to check normality of the residuals, not the raw data. You can see in the above example that both the explanatory and response variables are far from normally distributed – they are much closer to a uniform distribution (in fact the explanatory variable conforms exactly to a uniform distribution).

3. Equal variance

Linear regression assumes that the variance of the residuals is the same regardless of the value of the response or explanatory variables – the issue of homoscedasticity. If the variance of the residuals varies, they are said to be heteroscedastic. The residuals in our example are not obviously heteroscedastic. If they were, they might look more like this. The residuals in this example are clearly heretoscedastic, violating one of the assumptions of linear regression; the data vary more widely around the regression line for larger values of the explanatory variable. In the previous example, the variation in the residuals was more similar across the range of the data.

4. Independence

The final assumption is that the residuals should be independent of each other. In particular, it is worth checking for serial correlation. Correlation is evident if the residuals have patterns where they remain positive or negative. In our first example, the residuals seem to randomly switch between positive and negative values – there are not disproportionately long runs of positive or negative values.

In contrast, if we examine the human population growth rate over the period 1965 to 2015, we see that there are extended time periods where the observed growth rate is above the fitted line, and then extended periods when it is below. Human population growth rate over the period 1965 to 2015 is serially correlated – there are extended periods when the residuals are positive (data are above the trend line), and extended periods when they are negative (data are below the trend line).

A key in independence in linear regression is that the values of the response variables are not independent – in fact, there is an approximate linear change! Indeed, this is related to the first assumption that I listed, such that the value of the response variable for adjacent data points are similar. But the residuals must vary independently of each other.

So, those are the four basic assumptions of linear regression. If you don’t think your data conform to these assumptions, then it is possible to fit models that relax these assumptions, or at least make different assumptions. We can:

1. fit non-linear models;
2. assume distributions other than the normal for the residuals;
3. model changes in the variance of the residuals;
4. or model correlation in the residuals.

All these things, and more, are possible.

* To keep things simple, I will only discuss simple linear regression in which there is a single explanatory variable.

This entry was posted in Environmental Modelling and tagged , , , by Michael McCarthy. Bookmark the permalink. 