It shouldn’t come as a surprise to anyone with a basic understanding of evolution that it is a temporal (and also spatial concept). Time is a fundamental aspect of the process of evolution by natural selection, and without it evolution wouldn’t exist. But time is also a fickle thing, and although it remains constant (let’s not delve into that issue here) not all things experience it in the same way.
For anyone who has had to study geography at some point in their education, you’d likely be familiar with the idea of river courses drawn on a map. They’re so important, in fact, that they are often the delimiting factor in the edges of countries, states or other political units. Water is a fundamental requirement of all forms of life and the riverways that scatter the globe underpin the maintenance, structure and accumulation of a large swathe of biodiversity.
Further to this, we can expand the site-frequency spectrum to compare across populations. Instead of having a simple 1-dimensional frequency distribution, for a pair of populations we can have a grid. This grid specifies how often a particular allele occurs at a certain frequency in Population A and at a different frequency in Population B. This can also be visualised quite easily, albeit as a heatmap instead. We refer to this as the 2-dimensional SFS (2DSFS).
The same concept can be expanded to even more populations, although this gets harder to represent visually. Essentially, we end up with a set of different matrices which describe the frequency of certain alleles across all of our populations, merging them together into the joint SFS. For example, a joint SFS of 4 populations would consist of 6 (4 x 4 total comparisons – 4 self-comparisons, then halved to remove duplicate comparisons) 2D SFSs all combined together. To make sense of this, check out the diagrammatic tables below.
The different forms of the SFS
Which alleles we choose to use within our SFS is particularly important. If we don’t have a lot of information about the genomics or evolutionary history of our study species, we might choose to use the minor allele frequency (MAF). Given that SNPs tend to be biallelic, for any given locus we could have Allele A or Allele B. The MAF chooses the least frequent of these two within the dataset and uses that in the summary SFS: since the other allele’s frequency would just be 2N – the frequency of the other allele, it’s not included in the summary. An SFS made of the MAF is also referred to as the folded SFS.
Alternatively, if we know some things about the genetic history of our study species, we might be able to divide Allele A and Allele B into derived or ancestral alleles. Since SNPs often occur as mutations at a single site in the DNA, one allele at the given site is the new mutation (the derived allele) whilst the other is the ‘original’ (the ancestral allele). Typically, we would use the derived allele frequency to construct the SFS, since under coalescent theory we’re trying to simulate that mutation event. An SFS made of the derived alleles only is also referred to as the unfolded SFS.
Applications of the SFS
How can we use the SFS? Well, it can moreorless be used as a summary of genetic variation for many types of coalescent-based analyses. This means we can make inferences of demographic history (see here for more detailed explanation of that) without simulating large and complex genetic sequences and instead use the SFS. Comparing our observed SFS to a simulated scenario of a bottleneck and comparing the expected SFS allows us to estimate the likelihood of that scenario.
The SFS can even be used to detect alleles under natural selection. For strongly selected parts of the genome, alleles should occur at either high (if positively selected) or low (if negatively selected) frequency, with a deficit of more intermediate frequencies.
Adding to the analytical toolbox
The SFS is just one of many tools we can use to investigate the demographic history of populations and species. Using a combination of genomic technologies, coalescent theory and more robust analytical methods, the SFS appears to be poised to tackle more nuanced and complex questions of the evolutionary history of life on Earth.
Australia is renowned for its unique diversity of species, and likewise for the diversity of ecosystems across the island continent. Although many would typically associate Australia with the golden sandy beaches, palm trees and warm weather of the tropical east coast, other ecosystems also hold both beautiful and interesting characteristics. Even the regions that might typically seem the dullest – the temperate zones in the southern portion of the continent – themselves hold unique stories of the bizarre and wonderful environmental history of Australia.
The two temperate zones
Within Australia, the temperate zone is actually separated into two very distinct and separate regions. In the far south-western corner of the continent is the southwest Western Australia temperate zone, which spans a significant portion. In the southern eastern corner, the unnamed temperate zone spans from the region surrounding Adelaide at its westernmost point, expanding to the east and encompassing Tasmanian and Victoria before shifting northward into NSW. This temperate zones gradually develops into the sub-tropical and tropical climates of more northern latitudes in Queensland and across to Darwin.
The divide separating these two regions might be familiar to some readers – the Nullarbor Plain. Not just a particularly good location for fossils and mineral ores, the Nullarbor Plain is an almost perfectly flat arid expanse that stretches from the western edge of South Australia to the temperate zone of the southwest. As the name suggests, the plain is totally devoid of any significant forestry, owing to the lack of available water on the surface. This plain is a relatively ancient geological structure, and finished forming somewhere between 14 and 16 million years ago when tectonic uplift pushed a large limestone block upwards to the surface of the crust, forming an effective drain for standing water with the aridification of the continent. Thus, despite being relatively similar bioclimatically, the two temperate zones of Australia have been disconnected for ages and boast very different histories and biota.
The hotspot of the southwest
The southwest temperate zone – commonly referred to as southwest Western Australia (SWWA) – is an island-like bioregion. Isolated from the rest of the temperate Australia, it is remarkably geologically simple, with little topographic variation (only the Darling Scarp that separates the lower coast from the higher elevation of the Darling Plateau), generally minor river systems and low levels of soil nutrients. One key factor determining complexity in the SWWA environment is the isolation of high rainfall habitats within the broader temperate region – think of islands with an island.
Contrastingly, the temperate region in the south-east of the continent is much more complex. For one, the topography of the zone is much more variable: there are a number of prominent mountain chains (such as the extended Great Dividing Range), lowland basins (such as the expansive Murray-Darling Basin) and variable valley and river systems. Similarly, the climate varies significantly within this temperate region, with the more northern parts featuring more subtropical climatic conditions with wetter and hotter summers than the southern end. There is also a general trend of increasing rainfall and lower temperatures along the highlands of the southeast portion of the region, and dry, semi-arid conditions in the western lowland region.
A complicated history
The south-east temperate zone is not only variable now, but has undergone some drastic environmental changes over history. Massive shifts in geology, climate and sea-levels have particularly altered the nature of the area. Even volcanic events have been present at some time in the past.
One key hydrological shift that massively altered the region was the paleo-megalake Bungunnia. Not just a list of adjectives, Bungunnia was exactly as it’s described: a historically massive lake that spread across a huge area prior to its demise ~1-2 million years ago. At its largest size, Lake Bungunnia reached an area of over 50,000 km2, spreading from its westernmost point near the current Murray mouth although to halfway across Victoria. Initially forming due to a tectonic uplift event along the coastal edge of the Murray-Darling Basin ~3.2 million years ago, damming the ancestral Murray River (which historically outlet into the ocean much further east than today). Over the next few million years, the size of the lake fluctuated significantly with climatic conditions, with wetter periods causing the lake to overfill and burst its bank. With every burst, the lake shrank in size, until a final break ~700,000 years ago when the ‘dam’ broke and the full lake drained.
Another change in the historic environment readers may be more familiar with is the land-bridge that used to connect Tasmania to the mainland. Dubbed the Bassian Isthmus, this land-bridge appeared at various points in history of reduced sea-levels (i.e. during glacial periods in Pleistocene cycle), predominantly connecting via the still-above-water Flinders and Cape Barren Islands. However, at lower sea-levels, the land bridge spread as far west as King Island: central to this block of land was a large lake dubbed the Bass Lake (creative). The Bassian Isthmus played a critical role in the migration of many of the native fauna of Tasmania (likely including the Indigenous peoples of the now-island), and its submergence and isolation leads to some distinctive differences between Tasmanian and mainland biota. Today, the historic presence of the Bassian Isthmus has left a distinctive mark on the genetic make-up of many species native to the southeast of Australia, including dolphins, frogs, freshwater fishes and invertebrates.
Don’t underestimate the temperates
Although tropical regions get most of the hype for being hotspots of biodiversity, the temperate zones of Australia similarly boast high diversity, unique species and document a complex environmental history. Studying how the biota and environment of the temperate regions has changed over millennia is critical to predicting the future effects of climatic change across large ecosystems.
This is based on the idea that for genes that are not related to traits under selection (either positively or negatively), new mutations should be acquired and lost under predominantly random patterns. Although this accumulation of mutations is influenced to some degree by alternate factors such as population size, the overall average of a genome should give a picture that largely discounts natural selection. But is this true? Is the genome truly neutral if averaged?
First, let’s take a look at what we mean by neutral or not. For genes that are not under selection, alleles should be maintained at approximately balanced frequencies and all non-adaptive genes across the genome should have relatively similar distribution of frequencies. While natural selection is one obvious way allele frequencies can be altered (either favourably or detrimentally), other factors can play a role.
The extent of this linkage effect depends on a number of other factors such as ploidy (the number of copies of a chromosome a species has), the size of the population and the strength of selection around the central locus. The presence of linkage and its impact on the distribution of genetic diversity (LD) has been well documented within evolutionary and ecological genetic literature. The more pressing question is one of extent: how much of the genome has been impacted by linkage? Is any of the genome unaffected by the process?
Although I avoid having a strong stance here (if you’re an evolutionary geneticist yourself, I will allow you to draw your own conclusions), it is my belief that the model of neutral theory – and the methods that rely upon it – are still fundamental to our understanding of evolution. Although it may present itself as a more conservative way to identify adaptation within the genome, and cannot account for the effect of the above processes, neutral theory undoubtedly presents itself as a direct and well-implemented strategy to understand adaptation and demography.
A recurring analytical method, both within The G-CAT and the broader ecological genetic literature, is based on coalescent theory. This is based on the mathematical notion that mutations within genes (leading to new alleles) can be traced backwards in time, to the point where the mutation initially occurred. Given that this is a retrospective, instead of describing these mutation moments as ‘divergence’ events (as would be typical for phylogenetics), these appear as moments where mutations come back together i.e. coalesce.
From a mathematical perspective, the coalescent model is actually (relatively) simple. If we sampled a single gene from two different individuals (for simplicity’s sake, we’ll say they are haploid and only have one copy per gene), we can statistically measure the probability of these alleles merging back in time (coalescing) at any given generation. This is the same probability that the two samples share an ancestor (think of a much, much shorter version of sharing an evolutionary ancestor with a chimpanzee).
Normally, if we were trying to pick the parents of our two samples, the number of potential parents would be the size of the ancestral population (since any individual in the previous generation has equal probability of being their parent). But from a genetic perspective, this is based on the genetic (effective) population size (Ne), multiplied by 2 as each individual carries two copies per gene (one paternal and one maternal). Therefore, the number of potential parents is 2Ne.
Although this might seem mathematically complicated, the coalescent model provides us with a scenario of how we would expect different mutations to coalesce back in time if those idealistic scenarios are true. However, biology is rarely convenient and it’s unlikely that our study populations follow these patterns perfectly. By studying how our empirical data varies from the expectations, however, allows us to infer some interesting things about the history of populations and species.
This makes sense from theoretical perspective as well, since strong genetic bottlenecks means that most alleles are lost. Thus, the alleles that we do have are much more likely to coalesce shortly after the bottleneck, with very few alleles that coalesce before the bottleneck event. These alleles are ones that have managed to survive the purge of the bottleneck, and are often few compared to the overarching patterns across the genome.
In a similar vein, the coalescent can also be used to test how long ago the two contemporary populations diverged. Similar to gene flow, this is often included as an additional parameter on top of the coalescent model in terms of the number of generations ago. To convert this to a meaningful time estimate (e.g. in terms of thousands or millions of years ago), we need to include a mutation rate (the number of mutations per base pair of sequence per generation) and a generation time for the study species (how many years apart different generations are: for humans, we would typically say ~20-30 years).
While each of these individual concepts may seem (depending on how well you handle maths!) relatively simple, one critical issue is the interactive nature of the different factors. Gene flow, divergence time and population size changes will all simultaneously impact the distribution and frequency of alleles and thus the coalescent method. Because of this, we often use complex programs to employ the coalescent which tests and balances the relative contributions of each of these factors to some extent. Although the coalescent is a complex beast, improvements in the methodology and the programs that use it will continue to improve our ability to infer evolutionary history with coalescent theory.
To expand on this, we’re going to look at a few different models of how the spatial distribution of populations influences their divergence, and particularly how these factor into different processes of speciation.
What comes first, ecological or genetic divergence?
The order of these two processes have been in debate for some time, and different aspects of species and the environment can influence how (or if) these processes occur.
Different spatial models of speciation
Generally, when we consider the spatial models for speciation we divide these into distinct categories based on the physical distance of populations from one another. Although there is naturally a lot of grey area (as there is with almost everything in biological science), these broad concepts help us to define and determine how speciation is occurring in the wild.
A step closer in bringing populations geographically together in speciation is “parapatry” and “peripatry”. Parapatric populations are often geographically close together but not overlapping: generally, the edges of their distributions are touching but do not overlap one another. A good analogy would be to think of countries that share a common border. Parapatry can occur when a species is distributed across a broad area, but some form of narrow barrier cleaves the distribution in two: this can be the case across particular environmental gradients where two extremes are preferred over the middle.
This can be tricky to visualise, so let’s invent an example. Say we have a tropical island, which is occupied by one bird species. This bird prefers to eat the large native fruit of the island, although there is another fruit tree which produces smaller fruits. However, there’s only so much space and eventually there are too many birds for the number of large fruit trees available. So, some birds are pushed to eat the smaller fruit, and adapt to a different diet, changing physiology over time to better acquire their new food and obtain nutrients. This shift in ecological niche causes the two populations to become genetically separated as small-fruit-eating-birds interact more with other small-fruit-eating-birds than large-fruit-eating-birds. Over time, these divergences in genetics and ecology causes the two populations to form reproductively isolated species despite occupying the same island.
As you can see, the processes and context driving speciation are complex to unravel and many factors play a role in the transition from population to species. Understanding the factors that drive the formation of new species is critical to understanding not just how evolution works, but also in how new diversity is generated and maintained across the globe (and how that might change in the future).
One particular distinction we need to make early here is the difference between allele frequency and allele identity. In these analyses, often we are working with the same alleles (i.e. particular variants) across our populations, it’s just that each of these populations may possess these particular alleles in different frequencies. For example, one population may have an allele (let’s call it Allele A) very rarely – maybe only 10% of individuals in that population possess it – but in another population it’s very common and perhaps 80% of individuals have it. This is a different level of differentiation than comparing how different alleles mutate (as in the coalescent) or how these mutations accumulate over time (like in many phylogenetic-based analyses).
Fixed differences are sometimes used as a type of diagnostic trait for species. This means that each ‘species’ has genetic variants that are not shared at all with its closest relative species, and that these variants are so strongly under selection that there is no diversity at those loci. Often, fixed differences are considered a level above populations that differ by allelic frequency only as these alleles are considered ‘diagnostic’ for each species.
To distinguish between the two, we often use the overall frequency of alleles in a population as a basis for determining how likely two individuals share an allele by random chance. If alleles which are relatively rare in the overall population are shared by two individuals, we expect that this similarity is due to family structure rather than population history. By factoring this into our relatedness estimates we can get a more accurate overview of how likely two individuals are to be related using genetic information.
The wild world of allele frequency
Despite appearances, this is just a brief foray into the many applications of allele frequency data in evolution, ecology and conservation studies. There are a plethora of different programs and methods that can utilise this information to address a variety of scientific questions and refine our investigations.
This is partly where the concept of a ‘model’ comes into it: it’s much easier to pick a particular species to study as a target, and use the information from it to apply to other scenarios. Most people would be familiar with the concept based on medical research: the ‘lab rat’ (or mouse). The common house mouse (Mus musculus) and the brown rat (Rattus norvegicus) are some of the most widely used models for understanding the impact of particular biochemical compounds on physiology and are often used as the testing phase of medical developments before human trials.