Regarding gender bias in ecology, it is worth watching this video featuring Professors Emma Johnston and Mark Burgman. Emma Johnston also has a recent article in The Conversation.
Regarding reproducibility, read this, this, and this.
We also discussed topics that we want to cover in the class, things that we like about The University of Melbourne, and things we don’t like. We’ll post about those things shortly.
Let’s look at those assumptions in more detail.
1. Linearity
The linearity assumption is perhaps the easiest to consider, and seemingly the best understood. For each unit increase in the explanatory variable, the mean of the response variable increases by the same amount, regardless of the value of the explanatory variables.
2. Normality
Some users think (erroneously) that the normal distribution assumption of linear regression applies to their data. They might plot their response variable as a histogram and examine whether it differs from a normal distribution. Others assume that the explanatory variable must be normally-distributed. Neither is required. The normality assumption relates to the distributions of the residuals. This is assumed to be normally distributed, and the regression line is fitted to the data such that the mean of the residuals is zero.
What are the residuals, you ask? These are the values that measure departure of the data from the regression line.
To examine whether the residuals are normally distributed, we can compare them to what would be expected. This can be done in a variety of ways. We could inspect it by binning the values in classes and examining a histogram, or by constructing a kernel density plot – does it look like a normal distribution? We could construct QQ plots. Or we could calculate the skewness and kurtosis of the distribution to check whether the values are close to that expected of a normal distribution.
With only 10 data points, I won’t do those checks for this example data set. But my point is that we need to check normality of the residuals, not the raw data. You can see in the above example that both the explanatory and response variables are far from normally distributed – they are much closer to a uniform distribution (in fact the explanatory variable conforms exactly to a uniform distribution).
3. Equal variance
Linear regression assumes that the variance of the residuals is the same regardless of the value of the response or explanatory variables – the issue of homoscedasticity. If the variance of the residuals varies, they are said to be heteroscedastic. The residuals in our example are not obviously heteroscedastic. If they were, they might look more like this.
4. Independence
The final assumption is that the residuals should be independent of each other. In particular, it is worth checking for serial correlation. Correlation is evident if the residuals have patterns where they remain positive or negative. In our first example, the residuals seem to randomly switch between positive and negative values – there are not disproportionately long runs of positive or negative values.
In contrast, if we examine the human population growth rate over the period 1965 to 2015, we see that there are extended time periods where the observed growth rate is above the fitted line, and then extended periods when it is below.
A key in independence in linear regression is that the values of the response variables are not independent – in fact, there is an approximate linear change! Indeed, this is related to the first assumption that I listed, such that the value of the response variable for adjacent data points are similar. But the residuals must vary independently of each other.
So, those are the four basic assumptions of linear regression. If you don’t think your data conform to these assumptions, then it is possible to fit models that relax these assumptions, or at least make different assumptions. We can:
All these things, and more, are possible.
* To keep things simple, I will only discuss simple linear regression in which there is a single explanatory variable.
http://ghd.com.au/global/careers-1/students/
Here is the abstract:
Modern society uses massive amounts of energy. Usage rises as population and affluence increase, and energy production and use often have an impact on biodiversity or natural areas. To avoid a business-as-usual dependence on coal, oil, and gas over the coming decades, society must map out a future energy mix that incorporates alternative sources. This exercise can lead to radically different opinions on what a sustainable energy portfolio might entail, so an objective assessment of the relative costs and benefits of different energy sources is required. We evaluated the land use, emissions, climate, and cost implications of 3 published but divergent storylines for future energy production, none of which was optimal for all environmental and economic indicators. Using multicriteria decision-making analysis, we ranked 7 major electricity-generation sources (coal, gas, nuclear, biomass, hydro, wind, and solar) based on costs and benefits and tested the sensitivity of the rankings to biases stemming from contrasting philosophical ideals. Irrespective of weightings, nuclear and wind energy had the highest benefit-to-cost ratio. Although the environmental movement has historically rejected the nuclear energy option, new-generation reactor technologies that fully recycle waste and incorporate passive safety systems might resolve their concerns and ought to be more widely understood. Because there is no perfect energy source however, conservation professionals ultimately need to take an evidence-based approach to consider carefully the integrated effects of energy mixes on biodiversity conservation. Trade-offs and compromises are inevitable and require advocating energy mixes that minimize net environmental damage. Society cannot afford to risk wholesale failure to address energy-related biodiversity impacts because of preconceived notions and ideals.
The full paper is here. Have any counter arguments to this piece been published? Do any such arguments exist? What is the evidence to support these counter arguments? I look forward to the discussion.
Environmental modelling has nothing to do with green fashion. Or perhaps it does, but only obliquely – fashion models are usually stylised versions of reality (think make-up, air brushing, clothing that might not befit the conditions, etc). Somewhat similarly, environmental models are also stylised versions of reality.
Environmental models, as with other models, have their own particular purpose. A model of a Swedish warship wouldn’t battle a real Polish fleet. DNA does not look exactly like its model, yet the model helps to understand and communicate its structure. A model of the carbon cycle of an African savannah can help understand the main components of the system and how they are linked, but it is not the real cycle.
Perhaps the most naive criticism of an environmental model is that it is unrealistic. Such a comment is naive because models are meant to be unrealistic. You might think that more realistic models are always better. If so, you would be wrong, because models are designed to be imperfect descriptions of reality.
Hanna Kokko makes this point that models should be somewhat unrealistic with an analogy to maps. Imagine you are lost in the forest. A map, a model of reality, would help find your way home. You don’t need a perfect model of reality. A perfect model of reality would be identical to reality itself, and you have more than enough reality staring you in the face. In fact, it is the reality, so complex and cumbersome, that obscures the way home. To navigate efficiently, you’d need a sufficiently simple model.
However, the model must not be too simple. To paraphrase Einstein, models should be as simple as possible, but no simpler. That is the crux of modelling – a modeller must find the balance between complexity and simplicity for the task at hand.
Another reason we need environmental models is that we often cannot afford to experiment with environmental systems in case our experiments have unintended consequences. Let’s return to the Swedish warship, which is the Vasa. Built over two years for King Gustavus Adolphus, the king wanted an impressive vessel, packed with many guns. It launched in 1628 to great fanfare.
However, the balance between lots of heavy guns above the waterline, and a fast and manoeuvrable warship was delicate – a little too delicate as it turned out. Less than a mile from its dock and with a couple of puffs of breeze, the ship leaned over, submerging its lower gun ports. Filling with water, the Vasa promptly, and ignominiously, sank.
A model of the warship – a physical model in the time of the Vasa, or a mathematical model in the modern age – would have been sufficient to help assess the ship’s stability prior to it being built.
Think of the ship as the world’s environment. Would we want to test different options on the real environment, or test those options on models? With only one Vasa, and with only one world, it might often be prudent to assess options with models first, before implementing them in reality.
1. First, check out Guru and Jose’s video explaining why detectability is important in species distribution models (there’s also some bloopers).
2. Then we have Georgia’s post about setting minimum survey effort requirements to detect a species at a site.
3. Another by Georgia about her trait-based model of detection.
4. And finally, a paper showing that Georgia’s time to detection model can efficiently estimate detectability.
And if you want more about detectability, check out a few posts of mine.
And we’ll be using R, so if you need a quick introduction, check out Liz Martin’s blog.
Edit: And if you want some more information about double sampling (from Angus’ lecture today), please read this blog post.
These cool changes can be dramatic. Temperatures can drop from around 40 °C to 25 °C in a matter of an hour. That’s a drop of 25-30 °F for those working in Fahrenheit. Maximums can be more than 20 °C different from one day to the next. Cool changes often arrive in Melbourne after many days of sweltering heat; you can almost hear the city of 4 million sigh.
Predicting the timing of summer cool changes is important for various reasons regarding public safety, including bushfire management. The winds before and after a cool change are often strong, so bushfires can be extremely intense at this time. The worst fire events in Victoria are typically associated with these wind changes. Fires that might have spread along quite narrow fronts under north-westerlies can have massive fronts when the wind switches to the south west. The Kilmore East fire of 7 February 2009 (“Black Saturday”) is one example.
If you need to know when a change will occur, you should ask a weather forecaster. Weather systems in Melbourne typically move from west to east, and cold fronts that bring the change certainly match this pattern. While weather forecasters use models of atmospheric dynamics to predict the passage of these cold fronts, most of us don’t have access to the necessary computer power, data and expertise to solve the equations required to analyze these models.
So what should we do if we want to DIY? Thanks to the Australian Bureau of Meteorology (BoM), we can access data for a range of weather stations to the west of Melbourne. These weather stations record the wind direction and temperature, and the BoM displays these data every half hour via their website, and sometimes more frequently. So we can watch the cool change approach.
But can we do more? If we wanted to model the passage of a cold front to predict the timing of a wind change, how might we do that without the aid of numerical weather forecasting?
Let’s overlay a model of a cold front at Aireys Inlet on the map of weather stations. Cold fronts are usually aligned at an approximate 45 degree angle. Imagine it sweeping from west to east. What would be the simplest model for this cold front? Well, we might represent the cold front as a straight line and have it progressing at a constant speed to the east. Let’s assume the cold front is currently at Aireys Inlet (dark line), and we are interested in predicting where it will be at some time in the future (grey line).
This model has two parameters that we need to estimate. We need to know the slope of the cold front and its speed. Thinking of the model in this way helps us realise how it might be wrong – the cold front might not be a straight line (it might be curved), and it might not move at a constant velocity (it might change speed or direction). For example, a curved front slipping away to the south east might take longer to arrive than anticipated.
Bearing these simplifications in mind, we will plough on with our simple model, and leave more realistic ones to the experts. We can define the model geometrically. Think of the location of Aireys Inlet as being the origin of an x-y graph, so Aireys Inlet has coordinates (0, 0). Melbourne is approximately 76 km east of Aireys Inlet and 72 km north, so Melbourne has coordinates (76, 72). We can define the coordinates of all the other weather stations (and all other locations) in a similar way. A negative value for the x-value of the coordinate indicates that the site is to the west of Aireys Inlet and a negative y-value indicates the site is to the south of Aireys Inlet.
When the front is at Aireys Inlet, the equation defining its location is y = −bx (with b, a positive number, defining the backward slope of the front). If the front is moving eastward at a speed of v km/hour, then after t hours, the front will be vt kilometres to the east. So, the equation defining the location of the front at some other time is y = −b(x − vt).
The location and time in this equation is relative to a reference location; in this case I chose Aireys Inlet. So a negative value for time t indicates the passage of the front at a particular location prior to it arriving at Aireys Inlet.
We can manipulate the equation y = −b(x − vt) to determine the time of arrival of the front for any location x and y by solving for t. Thus:
x − vt = −y / b
−vt = −y / b − x
t = y / bv + x / v
This tells us that the time of arrival of the front at a particular location depends on the coordinates of the location (x, y), and the speed (v) and slope (b) of the front. So to determine the arrival time, we must estimate the two parameters b and v. If the front is at Aireys Inlet, then it will have passed at least some of the other weather stations, so we will know when it arrived at those locations. Therefore, we can fit the observed times and locations of the passage of the front to the equation t = y / bv + x / v to estimate b and v.
A simple way to estimate b and v is to construct the model as a linear regression. Manipulating the equation (by dividing both sides by x), we have:
t/x = y/xbv + 1/v,
in which the variable t/x is proportional to the variable y/x (with a constant of proportionality 1/bv) plus a constant 1/v.
This is simply a linear regression of the form Y = mX + c, based on the transformed variables Y = t/x and X = y/x. The speed and slope of the front are defined by the regression coefficients, and are v = 1/c and b = c/m.
Let’s apply that to some data on the passage of a cold front. Melbournians might remember the front that arrived on 17 January 2014 after a few days with maximums above 40°C. I’m sure tennis players in the Australian Open remember it – seeing Snoopy anyone?
Here are recorded times for the passage of the cold front at weather stations prior to them arriving at Aireys Inlet. The column t is the number of hours relative to arrival at Aireys Inlet. For example, the front arrived at Mount Gellibrand 15 minutes (0.25 hours) prior to its arrival at Aireys Inlet.
Location |
x |
y |
Time |
t |
y/x |
t/x |
Port Fairy |
−162.77 |
0.99 |
10:46 |
−2.77 |
−0.00611 |
0.01700 |
Warrnambool |
−144.09 |
13.07 |
11:10 |
−2.37 |
−0.09072 |
0.01643 |
Hamilton |
−182.00 |
82.39 |
11:48 |
−1.73 |
−0.45269 |
0.00952 |
Cape Otway |
−48.93 |
−46.16 |
12:03 |
−1.48 |
0.94350 |
0.03032 |
Mortlake |
−117.20 |
38.83 |
12:09 |
−1.38 |
−0.33131 |
0.01180 |
Westmere |
−104.02 |
79.46 |
13:08 |
−0.40 |
−0.76391 |
0.00385 |
Mount Gellibrand |
−27.07 |
24.66 |
13:17 |
−0.25 |
−0.91092 |
0.00924 |
Aireys Inlet |
0.00 |
0.00 |
13:32 |
0.00 |
The linear regression of t/x versus y/x yields m = 0.0133 and c = 0.0171. Therefore, v = 58.5 km/hour and b = 1.28. The value of v means the front was estimated to be moving eastward at 58.5 km/hour, and the value of b implies it was approximately aligned at an angle of tan^{−1}(1.28) = 52° above the horizontal (b = 1 would imply an angle of 45°).
Using those parameters, the time at which the front is expected to arrive at a location with coordinates (x, y) is t = 0.0133y + 0.0171x (relative to the time it arrived at Airey’s Inlet). Different fronts will have different alignments and move at different speeds, so these parameters only apply to the passage of this particular front.
But let’s look at the regression relationship more closely; it has some interesting attributes. Firstly, the relationship is approximately linear, although clearly imperfect. The approximate linearity might encourage us to have some faith in our rather bold assumptions.
Also, one of the points, corresponding to Cape Otway, has a potentially large influence on the regression. Being to the right of the other data, it has “high leverage”; the regression line will tend to always pass quite close to that point.
Whether that high leverage is important will depend on where we wish to make predictions. It turns out that Melbourne is located very close to that point. Now, that might seem surprising at first because, compared to Cape Otway, Melbourne is in the opposite direction from Aireys Inlet. In fact, that is why Cape Otway and Melbourne have similar values for y/x (the “x-value” of the regression model) – the two locations are in opposite directions from Aireys Inlet.
This dependence of the regression on when the front reaches Cape Otway actually means we can very much simplify the model. We can use t/x for Cape Otway to predict t/x for Melbourne because they have very similar values of y/x. For Cape Otway, x = −48.93, and for Melbourne x = 76.14. If the front arrived at Cape Otway (relative to Aireys Inlet) at t_{CO}, then the time it arrives at Melbourne, t_{M}, is predicted from the expected dependence:
t_{CO} / −48.93 = t_{M} / 76.14.
Thus, t_{M} = −t_{CO} 76.14/48.93 = −1.56t_{CO}.
That is, the time it takes for the front to arrive in Melbourne from Aireys Inlet is approximately the time it takes the front to travel between Cape Otway and Aireys Inlet multiplied by 1.56. The accuracy of this method can be assessed by comparing it to data on the passage of two fronts (17 Jan 2014 and 28 Jan 2014).
On 17 January, the front took 1.48 hours to travel between Cape Otway and Aireys Inlet, so our simplified model predicts the front’s arrival in Melbourne 2.3 hours after it passed through Aireys Inlet. The observed time was 3.2 hours, so the front took about 55 minutes longer than predicted. Thus, the data point for Melbourne is above that of Cape Otway.
On 28 January, the front took 1.56 hours to travel between Cape Otway and Aireys Inlet, so our simplified model predicts the front’s arrival in Melbourne 2.1 hours after it passed through Aireys Inlet. The observed time was 1.7 hours, so the front arrived about 25 minutes sooner than predicted. Thus, the data point for Melbourne is below that of Cape Otway.
Interestingly, errors in the predictions could have been anticipated once the front arrived in Geelong. Because the front on 17 January took longer than predicted (by the regression) to arrive in Geelong, it seems to have travelled slower than anticipated. In contrast, the front on 28 January arrived in Geelong earlier than predicted, so its passage might have accelerated.
The simplification t_{M} = −1.56t_{CO} only works for predicting arrival of the front at Melbourne. If you want to predict the passage of the front at other locations, you might need to do the linear regression (or better still, ask a numerical weather forecaster).