Data need to be normally-distributed, and other myths of linear regression

There are four basic assumptions of linear regression. These are:

  1. the mean of the data is a linear function of the explanatory variable(s)*;
  2. the residuals are normally distributed with mean of zero;
  3. the variance of the residuals is the same for all values of the explanatory variables; and
  4. the residuals should be independent of each other.

Let’s look at those assumptions in more detail.

1. Linearity

The linearity assumption is perhaps the easiest to consider, and seemingly the best understood. For each unit increase in the explanatory variable, the mean of the response variable increases by the same amount, regardless of the value of the explanatory variables.

LinearRegression_Linear

The mean of the response variable (the line, which is fitted to the data (the dots)) increases at the same rate, regardless of the value of the explanatory variable. This is the basis of the linearity assumption of linear regression.

2. Normality

Some users think (erroneously) that the normal distribution assumption of linear regression applies to their data. They might plot their response variable as a histogram and examine whether it differs from a normal distribution. Others assume that the explanatory variable must be normally-distributed. Neither is required. The normality assumption relates to the distributions of the residuals. This is assumed to be normally distributed, and the regression line is fitted to the data such that the mean of the residuals is zero.

What are the residuals, you ask? These are the values that measure departure of the data from the regression line.

LinearRegression_Residuals

The residuals are the differences between the data and the regression line (red bars in upper figure). The residuals deviate around a value of zero in linear regression (lower figure). It is these residuals that should be normally distributed.

To examine whether the residuals are normally distributed, we can compare them to what would be expected. This can be done in a variety of ways. We could inspect it by binning the values in classes and examining a histogram, or by constructing a kernel density plot – does it look like a normal distribution? We could construct QQ plots. Or we could calculate the skewness and kurtosis of the distribution to check whether the values are close to that expected of a normal distribution.

With only 10 data points, I won’t do those checks for this example data set. But my point is that we need to check normality of the residuals, not the raw data. You can see in the above example that both the explanatory and response variables are far from normally distributed – they are much closer to a uniform distribution (in fact the explanatory variable conforms exactly to a uniform distribution).

3. Equal variance

Linear regression assumes that the variance of the residuals is the same regardless of the value of the response or explanatory variables – the issue of homoscedasticity. If the variance of the residuals varies, they are said to be heteroscedastic. The residuals in our example are not obviously heteroscedastic. If they were, they might look more like this.

LinearRegression_Hetero

The residuals in this example are clearly heretoscedastic, violating one of the assumptions of linear regression; the data vary more widely around the regression line for larger values of the explanatory variable. In the previous example, the variation in the residuals was more similar across the range of the data.

4. Independence

The final assumption is that the residuals should be independent of each other. In particular, it is worth checking for serial correlation. Correlation is evident if the residuals have patterns where they remain positive or negative. In our first example, the residuals seem to randomly switch between positive and negative values – there are not disproportionately long runs of positive or negative values.

In contrast, if we examine the human population growth rate over the period 1965 to 2015, we see that there are extended time periods where the observed growth rate is above the fitted line, and then extended periods when it is below.

LinearRegression_AutoCorrelation

Human population growth rate over the period 1965 to 2015 is serially correlated – there are extended periods when the residuals are positive (data are above the trend line), and extended periods when they are negative (data are below the trend line).

A key in independence in linear regression is that the values of the response variables are not independent – in fact, there is an approximate linear change! Indeed, this is related to the first assumption that I listed, such that the value of the response variable for adjacent data points are similar. But the residuals must vary independently of each other.

So, those are the four basic assumptions of linear regression. If you don’t think your data conform to these assumptions, then it is possible to fit models that relax these assumptions, or at least make different assumptions. We can:

  1. fit non-linear models;
  2. assume distributions other than the normal for the residuals;
  3. model changes in the variance of the residuals;
  4. or model correlation in the residuals.

All these things, and more, are possible.

 

* To keep things simple, I will only discuss simple linear regression in which there is a single explanatory variable.

Vacation jobs with GHD

GHD are advertising for vacation jobs for “undergraduate students at the end of their third year” in the coming summer. I’d imagine that these positions would also be suitable for honours and Masters students who are interested in working in environmental consulting. Several students in the University of Melbourne’s masters programs have been offered vacation employment with GHD in previous years, and they have often then joined GHD via their graduate recruitment program. I understand that applications close on 31 August. For information from GHD, see here:

http://ghd.com.au/global/careers-1/students/

Nuclear energy for biodiversity conservation

We are going to kick off our subject Graduate Seminar: Environmental Science by discussing this recent paper by Barry Brook and Corey Bradshaw:

Key role for nuclear energy in global biodiversity conservation

Here is the abstract:

Modern society uses massive amounts of energy. Usage rises as population and affluence increase, and energy production and use often have an impact on biodiversity or natural areas. To avoid a business-as-usual dependence on coal, oil, and gas over the coming decades, society must map out a future energy mix that incorporates alternative sources. This exercise can lead to radically different opinions on what a sustainable energy portfolio might entail, so an objective assessment of the relative costs and benefits of different energy sources is required. We evaluated the land use, emissions, climate, and cost implications of 3 published but divergent storylines for future energy production, none of which was optimal for all environmental and economic indicators. Using multicriteria decision-making analysis, we ranked 7 major electricity-generation sources (coal, gas, nuclear, biomass, hydro, wind, and solar) based on costs and benefits and tested the sensitivity of the rankings to biases stemming from contrasting philosophical ideals. Irrespective of weightings, nuclear and wind energy had the highest benefit-to-cost ratio. Although the environmental movement has historically rejected the nuclear energy option, new-generation reactor technologies that fully recycle waste and incorporate passive safety systems might resolve their concerns and ought to be more widely understood. Because there is no perfect energy source however, conservation professionals ultimately need to take an evidence-based approach to consider carefully the integrated effects of energy mixes on biodiversity conservation. Trade-offs and compromises are inevitable and require advocating energy mixes that minimize net environmental damage. Society cannot afford to risk wholesale failure to address energy-related biodiversity impacts because of preconceived notions and ideals.

The full paper is here. Have any counter arguments to this piece been published? Do any such arguments exist? What is the evidence to support these counter arguments? I look forward to the discussion.

Environmental modelling has little to do with green fashion

Image from treehugger.com

A model in the environment. Rarely do people venture into forests attired like this. While unrealistic, the model is used for a purpose. Similarly, environmental models are unrealisitc, but designed that way for a reason (image from treehugger.com).

Environmental modelling has nothing to do with green fashion. Or perhaps it does, but only obliquely – fashion models are usually stylised versions of reality (think make-up, air brushing, clothing that might not befit the conditions, etc). Somewhat similarly, environmental models are also stylised versions of reality.

A model of the Vasa, a Swedish warship from the 1620s. (Image from www.modelships.de).

A model of the Vasa, a Swedish warship from the 1620s (image from www.modelships.de).

A conceptual model of the carbon cycle of an African savannah. Williams et al. Carbon Balance and Management 2007 2:3.

A conceptual model of the carbon cycle of an African savannah (from Williams et al. Carbon Balance and Management 2007 2:3).

DNA_orbit_animated

DNA does not look exactly like this, but the model helps to understand DNA’s structure. (GIF by Zephyris at the English language Wikipedia).

Environmental models, as with other models, have their own particular purpose. A model of a Swedish warship wouldn’t battle a real Polish fleet. DNA does not look exactly like its model, yet the model helps to understand and communicate its structure. A model of the carbon cycle of an African savannah can help understand the main components of the system and how they are linked, but it is not the real cycle.

Perhaps the most naive criticism of an environmental model is that it is unrealistic. Such a comment is naive because models are meant to be unrealistic. You might think that more realistic models are always better. If so, you would be wrong, because models are designed to be imperfect descriptions of reality.

Hanna Kokko makes this point that models should be somewhat unrealistic with an analogy to maps. Imagine you are lost in the forest. A map, a model of reality, would help find your way home. You don’t need a perfect model of reality. A perfect model of reality would be identical to reality itself, and you have more than enough reality staring you in the face. In fact, it is the reality, so complex and cumbersome, that obscures the way home. To navigate efficiently, you’d need a sufficiently simple model.

If you were lost in a forest, find your way home with a map that subscribes to Hanna Kokko's approach to modelling. If it were too detailed, it would look like the forest itself. If it were too superficial, you might be none the wiser about where you. A good would strike the appropriate balance between complexity and simplicity for the task (adapted from Hanna Kokko's book Modelling for Field Biologists)

If you were lost in a forest, find your way home with a map that subscribes to Hanna Kokko’s approach to modelling. If it were too detailed, it would look like the forest itself. If it were too superficial, you might be none the wiser about where you were in the world. A good map would strike the appropriate balance between complexity and simplicity for the task (adapted from Hanna Kokko’s book Modelling for Field Biologists)

However, the model must not be too simple. To paraphrase Einstein, models should be as simple as possible, but no simpler. That is the crux of modelling – a modeller must find the balance between complexity and simplicity for the task at hand.

Another reason we need environmental models is that we often cannot afford to experiment with environmental systems in case our experiments have unintended consequences. Let’s return to the Swedish warship, which is the Vasa. Built over two years for King Gustavus Adolphus, the king wanted an impressive vessel, packed with many guns. It launched in 1628 to great fanfare.

However, the balance between lots of heavy guns above the waterline, and a fast and manoeuvrable warship was delicate – a little too delicate as it turned out. Less than a mile from its dock and with a couple of puffs of breeze, the ship leaned over, submerging its lower gun ports. Filling with water, the Vasa promptly, and ignominiously, sank.

A model of the warship – a physical model in the time of the Vasa, or a mathematical model in the modern age – would have been sufficient to help assess the ship’s stability prior to it being built.

The Vasa in real life, salvaged from off the coast of Stockholm, and now housed in the Vasa Museum (image from the Vasa Museum).

Think of the ship as the world’s environment. Would we want to test different options on the real environment, or test those options on models? With only one Vasa, and with only one world, it might often be prudent to assess options with models first, before implementing them in reality.

Detectability

We’re looking at detectability this week in Environmental Monitoring & Audit. Here are some relevant links:

1. First, check out Guru and Jose’s video explaining why detectability is important in species distribution models (there’s also some bloopers).

2. Then we have Georgia’s post about setting minimum survey effort requirements to detect a species at a site.

3. Another by Georgia about her trait-based model of detection.

4. And finally, a paper showing that Georgia’s time to detection model can efficiently estimate detectability.

And if you want more about detectability, check out a few posts of mine.

Some statistics to get started

The subject Environmental Monitoring and Audit starts today. We’ll be delving into some statistics, so my introductory chapter on statistical inference for an upcoming book might be useful.

And we’ll be using R, so if you need a quick introduction, check out Liz Martin’s blog.

Edit: And if you want some more information about double sampling (from Angus’ lecture today), please read this blog post.