The reality of neutrality

The neutral theory 

Many, many times within The G-CAT we’ve discussed the difference between neutral and selective processes, DNA markers and their applications in our studies of evolution, conservation and ecology. The idea that many parts of the genome evolve under a seemingly random pattern – largely dictated by genome-wide genetic drift rather than the specific force of natural selection – underpins many demographic and adaptive (in outlier tests) analyses.

This is based on the idea that for genes that are not related to traits under selection (either positively or negatively), new mutations should be acquired and lost under predominantly random patterns. Although this accumulation of mutations is influenced to some degree by alternate factors such as population size, the overall average of a genome should give a picture that largely discounts natural selection. But is this true? Is the genome truly neutral if averaged?

Non-neutrality

First, let’s take a look at what we mean by neutral or not. For genes that are not under selection, alleles should be maintained at approximately balanced frequencies and all non-adaptive genes across the genome should have relatively similar distribution of frequencies. While natural selection is one obvious way allele frequencies can be altered (either favourably or detrimentally), other factors can play a role.

As stated above, population sizes have a strong impact on allele frequencies. This is because smaller populations are more at risk of losing rarer alleles due to random deaths (see previous posts for a more thorough discussion of this). Additionally, genes which are physically close to other genes which are under selection may themselves appear to be under selection due to linkage disequilibrium (often shortened to ‘LD’). This is because physically close genes are more likely to be inherited together, thus selective genes can ‘pull’ neighbours with them to alter their allele frequencies.

Linkage disequilibrium figure
An example of how linkage disequilibrium can alter allele frequency of ‘neutral’ parts of the genome as well. In this example, only one part of this section of the genome is selected for: the green gene. Because of this positive selection, the frequency of a particular allele at this gene increases (the blue graph): however, nearby parts of the genome also increase in frequency due to their proximity to this selected gene, which decreases with distance. The extent of this effect determines the size of the ‘linkage block’ (see below).

Why might ‘neutral’ models not be neutral?

The assumption that the vast majority of the genome evolves under neutral patterns has long underpinned many concepts of population and evolutionary genetics. But it’s never been all that clear exactly how much of the genome is actually evolving neutrally or adaptively. How far natural selection reaches beyond a single gene under selection depends on a few different factors: let’s take a look at a few of them.

Linked selection

As described above, physically close genes (i.e. located near one another on a chromosome) often share some impacts of selection due to reduced recombination that occurs at that part of the genome. In this case, even alleles that are not adaptive (or maladaptive) may have altered frequencies simply due to their proximity to a gene that is under selection (either positive or negative).

Recombination blocks and linkage figure
A (perhaps familiar) example of the interaction between recombination (the breaking and mixing of different genes across chromosomes) and linkage disequilibrium. In this example, we have 5 different copies of a part of the genome (different coloured sequences), which we randomly ‘break’ into separate fragments (breaks indicated by the dashed lines). If we focus on a particular base in the sequence (the yellow A) and count the number of times a particular base pair is on the same fragment, we can see how physically close bases are more likely to be coinherited than further ones (bottom column graph). This makes mathematical sense: if two bases are further apart, you’re more likely to have a break that separates them. This is the very basic underpinning of linkage and recombination, and the size of the region where bases are likely to be coinherited is called the ‘linkage block’.

Under these circumstances, for a region of a certain distance (dubbed the ‘linkage block’) around a gene under selection, the genome will not truly evolve neutrally. Although this is simplest to visualise as physically linked sections of the genome (i.e. adjacent), linked genes do not necessarily have to be next to one another, just linked somehow. For example, they may be different parts of a single protein pathway.

The extent of this linkage effect depends on a number of other factors such as ploidy (the number of copies of a chromosome a species has), the size of the population and the strength of selection around the central locus. The presence of linkage and its impact on the distribution of genetic diversity (LD) has been well documented within evolutionary and ecological genetic literature. The more pressing question is one of extent: how much of the genome has been impacted by linkage? Is any of the genome unaffected by the process?

Background selection

One example of linked selection commonly used to explain the proliferation of non-neutral evolution within the genome is ‘background selection’. Put simply, background selection is the purging of alleles due to negative selection on a linked gene. Sometimes, background selection is expanded to include any forms of linked selection.

Background selection figure .jpg
A cartoonish example of how background selection affects neighbouring sections of the genome. In this example, we have 4 genes (A, B, C and D) with interspersing neutral ‘non-gene’ sections. The allele for Gene B is strongly selected against by natural selection (depicted here as the Banhammer of Selection). However, the Banhammer is not very precise, and when decreasing the frequency of this maladaptive Gene B allele it also knocks down the neighbouring non-gene sections. Despite themselves not being maladaptive, their allele frequencies are decreased due to physical linkage to Gene B.

Under the first etymology of background selection, the process can be divided into two categories based on the impact of the linkage. As above, one scenario is the purging of neutral alleles (and therefore reduction in genetic diversity) as it is associated with a deleterious maladaptive gene nearby. Contrastingly, some neutral alleles may be preserved by association with a positively selected adaptive gene: this is often referred to as ‘genetic hitchhiking’ (which I’ve always thought was kind of an amusing phrase…).

Genetic hitchhiking picture.jpg
Definitely not how genetic hitchhiking works.

The presence of background selection – particularly under the ‘maladaptive’ scenario – is often used as a counter-argument to the ‘paradox in variation’. This paradox was determined by evolutionary biologist Richard Lewontin, who noted that despite massive differences in population sizes across the many different species on Earth, the total amount of ‘neutral’ genetic variation does not change significantly. In fact, he observed no clear relationship (directly) between population size and neutral variation. Many years after this observation, the influence of background selection and genetic hitchhiking on the distribution of genomic diversity helps to explain how the amount of neutral genomic variation is ‘managed’, and why it doesn’t vary excessively across biota.

What does it mean if neutrality is dead?

This findings have significant implications for our understanding of the process of evolution, and how we can detect adaptation within the genome. In light of this research, there has been heated discussion about whether or not neutral theory is ‘dead’, or a useful concept.

Genome wide allele frequency figure.jpg
A vague summary of how a large portion of the genome might not actually be neutral. In this section of the genome, we have neutral (blue), maladaptive (red) and adaptive (green) elements. Natural selection either favours, disfavours, or is ambivalent about each of this sections aloneHowever, there is significant ‘spill-over’ around regions of positively or negatively selected sections, which causes the allele frequency of even the neutral sections to fluctuate widely. The blue dotted line represents this: when the line is above the genome, allele frequency is increased; when it is below it is decreased. As we travel along this section of the genome, you may notice it is rarely ever in the middle (the so-called ‘neutral‘ allele frequency, in line with the genome).

Although I avoid having a strong stance here (if you’re an evolutionary geneticist yourself, I will allow you to draw your own conclusions), it is my belief that the model of neutral theory – and the methods that rely upon it – are still fundamental to our understanding of evolution. Although it may present itself as a more conservative way to identify adaptation within the genome, and cannot account for the effect of the above processes, neutral theory undoubtedly presents itself as a direct and well-implemented strategy to understand adaptation and demography.

Bringing alleles back together: applications of coalescent theory

Coalescent theory

A recurring analytical method, both within The G-CAT and the broader ecological genetic literature, is based on coalescent theory. This is based on the mathematical notion that mutations within genes (leading to new alleles) can be traced backwards in time, to the point where the mutation initially occurred. Given that this is a retrospective, instead of describing these mutation moments as ‘divergence’ events (as would be typical for phylogenetics), these appear as moments where mutations come back together i.e. coalesce.

There are a number of applications of coalescent theory, and it is particularly fitting process for understanding the demographic (neutral) history of populations and species.

Mathematics of the coalescent

Before we can explore the multitude of applications of the coalescent, we need to understand the fundamental underlying model. The initial coalescent model was described in the 1980s, built upon by a number of different ecologists, geneticists and mathematicians. However, John Kingman is often attributed with the formation of the original coalescent model, and the Kingman’s coalescent is considered the most basic, primal form of the coalescent model.

From a mathematical perspective, the coalescent model is actually (relatively) simple. If we sampled a single gene from two different individuals (for simplicity’s sake, we’ll say they are haploid and only have one copy per gene), we can statistically measure the probability of these alleles merging back in time (coalescing) at any given generation. This is the same probability that the two samples share an ancestor (think of a much, much shorter version of sharing an evolutionary ancestor with a chimpanzee).

Normally, if we were trying to pick the parents of our two samples, the number of potential parents would be the size of the ancestral population (since any individual in the previous generation has equal probability of being their parent). But from a genetic perspective, this is based on the genetic (effective) population size (Ne), multiplied by 2 as each individual carries two copies per gene (one paternal and one maternal). Therefore, the number of potential parents is 2Ne.

Constant Ne and coalescent prob
A graph of the probability of a coalescent event (i.e. two alleles sharing an ancestor) in the immediately preceding generation (i.e. parents) relatively to the size of the population. As one might expect, with larger population sizes there is low chance of sharing an ancestor in the immediately prior generation, as the pool of ‘potential parents’ increases.

If we have an idealistic population, with large Ne, random mating and no natural selection on our alleles, the probability that their ancestor is in this immediate generation prior (i.e. share a parent) is 1/(2Ne). Inversely, the probability they don’t share a parent is 1 − 1/(2Ne). If we add a temporal component (i.e. number of generations), we can expand this to include the probability of how many generations it would take for our alleles to coalesce as (1 – (1/2Ne))t-1 x 1/2Ne.

Variable Ne and coalescent probs
The probability of two alleles sharing a coalescent event back in time under different population sizes. Similar to above, there is a higher probability of an earlier coalescent event in smaller populations as the reduced number of ancestors means that alleles are more likely to ‘share’ an ancestor. However, over time this pattern consistently decreases under all population size scenarios.

Although this might seem mathematically complicated, the coalescent model provides us with a scenario of how we would expect different mutations to coalesce back in time if those idealistic scenarios are true. However, biology is rarely convenient and it’s unlikely that our study populations follow these patterns perfectly. By studying how our empirical data varies from the expectations, however, allows us to infer some interesting things about the history of populations and species.

Testing changes in Ne and bottlenecks

One of the more common applications of the coalescent is in determining historical changes in the effective population size of species, particularly in trying to detect genetic bottleneck events. This is based on the idea that alleles are likely to coalesce at different rates under scenarios of genetic bottlenecks, as the reduced number of individuals (and also genetic diversity) associated with bottlenecks changes the frequency of alleles and coalescence rates.

For a set of k different alleles, the rate of coalescence is determined as k(k – 1)/4Ne. Thus, the coalescence rate is intrinsically linked to the number of genetic variants available: Ne. During genetic bottlenecks, the severely reduced Ne gives the appearance of coalescence rate speeding up. This is because alleles which are culled during the bottleneck event by genetic drift causes only a few (usually common) alleles to make it through the bottleneck, with the mutation and spread of these alleles after the bottleneck. This can be a little hard to think of, so the diagram below demonstrates how this appears.

Bottleneck test figure.jpg
A diagram of how the coalescent can be used to detect bottlenecks in a single population (centre). In this example, we have contemporary population in which we are tracing the coalescence of two main alleles (red and green, respectively). Each circle represents a single individual (we are assuming only one allele per individual for simplicity, but for most animals there are up to two).  Looking forward in time, you’ll notice that some red alleles go extinct just before the bottleneck: they are lost during the reduction in Ne. Because of this, if we measure the rate of coalescence (right), it is much higher during the bottleneck than before or after it. Another way this could be visualised is to generate gene trees for the alleles (left): populations that underwent a bottleneck will typically have many shorter branches and a long root, as many branches will be ‘lost’ by extinction (the dashed lines, which are not normally seen in a tree).

This makes sense from theoretical perspective as well, since strong genetic bottlenecks means that most alleles are lost. Thus, the alleles that we do have are much more likely to coalesce shortly after the bottleneck, with very few alleles that coalesce before the bottleneck event. These alleles are ones that have managed to survive the purge of the bottleneck, and are often few compared to the overarching patterns across the genome.

Testing migration (gene flow) across lineages

Another demographic factor we may wish to test is whether gene flow has occurred across our populations historically. Although there are plenty of allele frequency methods that can estimate contemporary gene flow (i.e. within a few generations), coalescent analyses can detect patterns of gene flow reaching further back in time.

In simple terms, this is based on the idea that if gene flow has occurred across populations, then some alleles will have been transferred from one population to another. Because of this, we would expect that transferred alleles coalesce with alleles of the source population more recently than the divergence time of the two populations. Thus, models that include a migration rate often add it as a parameter specifying the probability than any given allele coalesces with an allele in another population or species (the backwards version of a migration or introgression event). Again, this might be difficult to conceptualise so there’s a handy diagram below.

Migration rate test figure
A similar model of coalescence as above, but testing for migration rate (gene flow) in two recently diverged populations (right). In this example, when we trace two alleles (red and green) back in time, we notice that some individuals in Population 1 coalesce more recently with individuals of Population 2 than other individuals of Population 1 (e.g. for the red allele), and vice versa for the green allele. This can also be represented with gene trees (left), with dashed lines representing individuals from Population 2 and whole lines representing individuals from Population 1. This incomplete split between the two populations is the result of migration transferring genes from one population to the other after their initial divergence (also called ‘introgression’ or ‘horizontal gene transfer’).

Testing divergence time

In a similar vein, the coalescent can also be used to test how long ago the two contemporary populations diverged. Similar to gene flow, this is often included as an additional parameter on top of the coalescent model in terms of the number of generations ago. To convert this to a meaningful time estimate (e.g. in terms of thousands or millions of years ago), we need to include a mutation rate (the number of mutations per base pair of sequence per generation) and a generation time for the study species (how many years apart different generations are: for humans, we would typically say ~20-30 years).

Divergence time test figure.jpg
An example of using the coalescent to test the divergence time between two populations, this time using three different alleles (red, green and yellow). Tracing back the coalescence of each alleles reveals different times (in terms of which generation the coalescence occurs in) depending on the allele (right). As above, we can look at this through gene trees (left), showing variation how far back the two populations (again indicated with bold and dashed lines respectively) split. The blue box indicates the range of times (i.e. a confidence interval) around which divergence occurred: with many more alleles, this can be more refined by using an ‘average’ and later related to time in years with a generation time.

 

The basic model of testing divergence time with the coalescent is relatively simple, and not all that different to phylogenetic methods. Where in phylogenetics we relate the length of the different branches in the tree to the amount of time that has occurred since the divergence of those branches, with the coalescent we base these on coalescent events, with more coalescent events occurring around the time of divergence. One important difference in the two methods is that coalescent events might not directly coincide with divergence time (in fact, we expect many do not) as some alleles will separate prior to divergence, and some will lag behind and start to diverge after the divergence event.

The complex nature of the coalescent

While each of these individual concepts may seem (depending on how well you handle maths!) relatively simple, one critical issue is the interactive nature of the different factors. Gene flow, divergence time and population size changes will all simultaneously impact the distribution and frequency of alleles and thus the coalescent method. Because of this, we often use complex programs to employ the coalescent which tests and balances the relative contributions of each of these factors to some extent. Although the coalescent is a complex beast, improvements in the methodology and the programs that use it will continue to improve our ability to infer evolutionary history with coalescent theory.

What’s the (allele) frequency, Kenneth?

Allele frequency

A number of times before on The G-CAT, we’ve discussed the idea of using the frequency of different genetic variants (alleles) within a particular population or species to test a number of different questions about evolution, ecology and conservation. These are all based on the central notion that certain forces of nature will alter the distribution and frequency of alleles within and across populations, and that these patterns are somewhat predictable in how they change.

One particular distinction we need to make early here is the difference between allele frequency and allele identity. In these analyses, often we are working with the same alleles (i.e. particular variants) across our populations, it’s just that each of these populations may possess these particular alleles in different frequencies. For example, one population may have an allele (let’s call it Allele A) very rarely – maybe only 10% of individuals in that population possess it – but in another population it’s very common and perhaps 80% of individuals have it. This is a different level of differentiation than comparing how different alleles mutate (as in the coalescent) or how these mutations accumulate over time (like in many phylogenetic-based analyses).

Allele freq vs identity figure.jpg
An example of the difference between allele frequency and identity. In this example (and many of the figures that follow in this post), the circle denote different populations, within which there are individuals which possess either an A gene (blue) or a B gene. Left: If we compared Populations 1 and 2, we can see that they both have A and B alleles. However, these alleles vary in their frequency within each population, with an equal balance of A and B in Pop 1 and a much higher frequency of B in Pop 2. Right: However, when we compared Pop 3 and 4, we can see that not only do they vary in frequencies, they vary in the presence of alleles, with one allele in each population but not the other.

Non-adaptive (neutral) uses

Testing neutral structure

Arguably one of the most standard uses of allele frequency data is the determination of population structure, one which more avid The G-CAT readers will be familiar with. This is based on the idea that populations that are isolated from one another are less likely to share alleles (and thus have similar frequencies of those alleles) than populations that are connected. This is because gene flow across two populations helps to homogenise the frequency of alleles within those populations, by either diluting common alleles or spreading rarer ones (in general). There are a number of programs that use allele frequency data to assess population structure, but one of the most common ones is STRUCTURE.

Gene flow homogeneity figure
An example of how gene flow across populations homogenises allele frequencies. We start with two initial populations (and from above), which have very different allele frequencies. Hybridising individuals across the two populations means some alleles move from Pop 1 and Pop 2 into the hybrid population: which alleles moves is random (the smaller circles). Because of this, the resultant hybrid population has an allele frequency somewhere in between the two source populations: think of like mixing red and blue cordial and getting a purple drink.

 

Simple YPP structure figure.jpg
An example of a Structure plot which long-term The G-CAT readers may be familiar with. This is taken from Brauer et al. (2013), where the authors studied the population structure of the Yarra pygmy perch. Each small column represents a single individual, with the colours representing how well the alleles of that individual fit a particular genetic population (each population has one colour). The numbers and broader columns refer to different ‘localities’ (different from populations) where individuals were sourced. This shows clear strong population structure across the 4 main groups, except for in Locality 6 where there is a mixture of Eastern and Merri/Curdies alleles.

Determining genetic bottlenecks and demographic change

Other neutral aspects of population identity and history can be studied using allele frequency data. One big component of understanding population history in particular is determining how the population size has changed over time, and relating this to bottleneck events or expansion periods. Although there are a number of different approaches to this, which span many types of analyses (e.g. also coalescent methods), allele frequency data is particularly suited to determining changes in the recent past (hundreds of generations, as opposed to thousands of generations ago). This is because we expect that, during a bottleneck event, it is statistically more likely for rare alleles (i.e. those with low frequency) in the population to be lost due to strong genetic drift: because of this, the population coming out of the bottleneck event should have an excess of more frequent alleles compared to a non-bottlenecked population. We can determine if this is the case with tests such as the heterozygosity excess, M-ratio or mode shift tests.

Genetic drift and allele freq figure
A diagram of how allele frequencies change in genetic bottlenecks due to genetic drift. Left: Large circles again denote a population (although across different sequential times), with smaller circle denoting which alleles survive into the next generation (indicated by the coloured arrows). We start with an initial ‘large’ population of 8, which is reduced down to 4 and 2 in respective future times. Each time the population contracts, only a select number of alleles (or individuals) ‘survive’: assuming no natural selection is in process, this is totally random from the available gene pool. Right: We can see that over time, the frequencies of alleles A and B shift dramatically, leading to the ‘extinction’ of Allele B due to genetic drift. This is because it is the less frequent allele of the two, and in the smaller population size has much less chance of randomly ‘surviving’ the purge of the genetic bottleneck. 

Adaptive (selective) uses

Testing different types of selection

We’ve also discussed previously about how different types of natural selection can alter the distribution of allele frequency within a population. There are a number of different predictions we can make based on the selective force and the overall population. For understanding particular alleles that are under strong selective pressure (i.e. are either strongly adaptive or maladaptive), we often test for alleles which have a frequency that strongly deviates from the ‘neutral’ background pattern of the population. These are called ‘outlier loci’, and the fact that their frequency is much more different from the average across the genome is attributed to natural selection placing strong pressure on either maintaining or removing that allele.

Other selective tests are based on the idea of correlating the frequency of alleles with a particular selective environmental pressure, such as temperature or precipitation. In this case, we expect that alleles under selection will vary in relation to the environmental variable. For example, if a particular allele confers a selective benefit under hotter temperatures, we would expect that allele to be more common in populations that occur in hotter climates and rarer in populations that occur in colder climates. This is referred to as a ‘genotype-environment association test’ and is a good way to detect polymorphic selection (i.e. when multiple alleles contribute to a change in a single phenotypic trait).

Genotype by environment figure.jpg
An example of how the frequency of alleles might vary under natural selection in correlation to the environment. In this example, the blue allele A is adaptive and under positive selection in the more intense environment, and thus increases in frequency at higher values. Contrastingly, the red allele B is maladaptive in these environments and decreases in frequency. For comparison, the black allele shows how the frequency of a neutral (non-adaptive or maladaptive) allele doesn’t vary with the environment, as it plays no role in natural selection.

Taxonomic (species identity) uses

At one end of the spectrum of allele frequencies, we can also test for what we call ‘fixed differences’ between populations. An allele is considered ‘fixed’ it is the only allele for that locus in the population (i.e. has a frequency of 1), whilst the alternative allele (which may exist in other populations) has a frequency of 0. Expanding on this, ‘fixed differences’ occur when one population has Allele A fixed and another population has Allele B fixed: thus, the two populations have as different allele frequencies (for that one locus, anyway) as possible.

Fixed differences are sometimes used as a type of diagnostic trait for species. This means that each ‘species’ has genetic variants that are not shared at all with its closest relative species, and that these variants are so strongly under selection that there is no diversity at those loci. Often, fixed differences are considered a level above populations that differ by allelic frequency only as these alleles are considered ‘diagnostic’ for each species.

Fixed differences figure.jpg
An example of the difference between fixed differences and allelic frequency differences. In this example, we have 5 cats from 3 different species, sequencing a particular target gene. Within this gene, there are three possible alleles: T, A or G respectively. You’ll quickly notice that the allele is both unique to Species A and is present in all cats of that species (i.e. is fixed). This is a fixed difference between Species A and the other two. Alleles and G, however, are present in both Species B and C, and thus are not fixed differences even if they have different frequencies.

Intrapopulation (relatedness) uses

Allele frequency-based methods are even used in determining relatedness between individuals. While it might seem intuitive to just check whether individuals share the same alleles (and are thus related), it can be hard to distinguish between whether they are genetically similar due to direct inheritance or whether the entire population is just ‘naturally’ similar, especially at a particular locus. This is the distinction between ‘identical-by-descent’, where alleles that are similar across individuals have recently been inherited from a similar ancestor (e.g. a parent or grandparent) or ‘identical-by-state’, where alleles are similar just by chance. The latter doesn’t contribute or determine relatedness as all individuals (whether they are directly related or not) within a population may be similar.

To distinguish between the two, we often use the overall frequency of alleles in a population as a basis for determining how likely two individuals share an allele by random chance. If alleles which are relatively rare in the overall population are shared by two individuals, we expect that this similarity is due to family structure rather than population history. By factoring this into our relatedness estimates we can get a more accurate overview of how likely two individuals are to be related using genetic information.

The wild world of allele frequency

Despite appearances, this is just a brief foray into the many applications of allele frequency data in evolution, ecology and conservation studies. There are a plethora of different programs and methods that can utilise this information to address a variety of scientific questions and refine our investigations.

Hotter and colder: how historic glacial cycles have shaped modern diversity

A tale as old as time

Since evolution is a constant process, occurring over both temporal and spatial scales, the impact of evolutionary history for current and future species cannot be overstated. The various forces of evolution through natural selection have strong, lasting impacts on the evolution of organisms, which is exemplified within the genetic make-up of all species. Phylogeography is the domain of research which intrinsically links this genetic information to historical selective environment (and changes) to understand historic distributions, evolutionary history, and even identify biodiversity hotspots.

The Ice Age(s)

Although there are a huge number of both historic and contemporary climatic factors that have influenced the evolution of species, one particularly important time period is referred to as the Pleistocene glacial cycles. The Pleistocene epoch spans from ~2 million years ago until ~100,000 years ago, and is a time of significant changes in the evolution of many species still around today (particularly for vertebrates). This is because the Pleistocene largely consisted of several successive glacial periods: at times, the climate was significantly cooler, glaciers were more widespread and sea-levels were lower (due to the deeper freezing of water around the poles). These periods were then followed by ‘interglacial periods’, where much of the globe warmed, ice caps melted and sea-levels rose. Sometimes, this natural pattern is argued as explaining 100% of recent climate change: don’t be fooled, however, as Pleistocene cycles were never as dramatic or irreversible as modern, anthropogenically-driven climate change.

Annotated glacial cycles.jpg
The general pattern of glacial and interglacial periods over the last 1 million years, adapted from Oceanbites.

The glacial cycles of the Pleistocene had a number of impacts on a plethora of species on Earth. For many of these species, these glacial-interglacial periods resulted in what we call ‘glacial refugia’ and ‘interglacial expansion’: at the peak of glacial periods, many species’ distributions contracted to small patches of suitable habitat, like tiny islands in a freezing ocean. As the globe warmed during interglacial periods, these habitats started to spread and with them the inhabiting species. While it’s expected that this likely happened many times throughout the Pleistocene, the most clearly observed cycle would be the most recent one: referred to as the Last Glacial Maximum (LGM), at ~21,000 years ago. Thus, a quick dive into the literature shows that it is rife with phylogeographic examples of expansions and contractions related to the LGM.

glacial refugia example figure.jpg
An example of how phylogeographic analysis can find glacial refugia in species, in this case the montane caddisfly Thremma gallicum from Macher et al. (2017). The colours refer to the two datasets they used (blue = ddRADseq; red = mtDNA) and the arrows demonstrate migration pathways in the interglacial period following the LGM.

The glacial impact on genetic diversity

Why does any of this matter? Didn’t it all happen in the past? Well, that leads us back to the original point in this post: forces of evolution leave distinct impacts on the genetic architecture of species. In regards to glacial refugia, a clear pattern is often observed: populations occurring approximately in line with the refugia have maintained greater genetic diversity over time, whilst those in more unstable or unsuitable regions show much more reduced genetic diversity. And this makes sense: many of those populations likely went extinct during glaciation, and only within the last 20,000 or so years have been recolonised from nearby refugia. Accounting for genetic drift due to founder effect, it’s easy to see how this would cause genetic diversity to plummet.

Case study: the charismatic cheetah

And this loss of genetic diversity isn’t just a hypothetical, or an interesting note in evolution. It can have dire impacts for the survivability of species. Take for example, the very charismatic cheetah. Like many large, apex predator species, the cheetah in the modern day is endangered and at risk of extinction to a variety of threats, and although many of these are linked to modern activity (such as being killed to protect farms or habitat clearing), some of these go back much further in history.

Believe it not, the cheetah as a species actually originated from an ancestor in the Americas: they’re closely related to other American big cats such as the puma/cougar. During the Miocene (5 – 8 million years ago), however, the ancestor of the modern cheetah migrated a very long way to Africa, diverging from its shared ancestor with jaguarandi and cougars. Subsequent migrations into Africa and Asia (where only the Iranian subspecies remains) during the Pleistocene, dated at ~100,000 and ~12,000 years ago, have been shown through whole genome analysis to have resulted in significant reductions in the genetic diversity of the cheetah. This timing correlates with the extinction of the cheetah and puma within North America, and the worldwide extinction of many large mammals including mammoths, dire wolves and sabre-tooth tigers.

cheetah bottleneck.jpg
The demographic history of the African cheetah population, based on whole genomes in Dobrynin et al. (2015). In this figure, ‘Eastern’ refers to a Tanzanian population whilst ‘southern’ refers to a Namibian population (and as such doesn’t depict bottlenecks elsewhere in the cheetah e.g. Iran). The initial population underwent a severe genetic bottleneck ~12,000 years ago, likely due to glaciation.

What does this mean for the cheetah? Well, the cheetah has one of the lowest amounts of genetic variation for any living mammal. It’s even lower than the Tasmanian Devil, a species with such notoriously low genetic diversity that a rampant face cancer (Devil Facial Tumour Disease) is transmissible simply because their immune system can’t recognise the transferred cancer cells as being different to the host animal. Similarly, for the cheetah, it’s possible to do reciprocal skin transplants without the likelihood of organ rejection simply because their immune system is incapable of determining the difference between foreign and host tissue cells.

cheetah diversity 2.jpg
Examples of the incredibly low genetic diversity in cheetah, both from Dobrynin et al. (2015)A) shows the relative level of genetic diversity in cheetah compared to many other species, being lower than Tasmanian Devils and significantly lower than humans and domestic cats. D) shows the overall variation across the genome of a domestic cat (top), the inbred Abyssinian cat (middle) and the cheetah (bottom). Highly variable regions are indicated in red, whilst low variability regions are indicated in green. As you can see, the entirety of the cheetah genome has incredibly low genetic variation, even compared to another cat species considered to have low genetic variation (the Abyssinian).

Inference for the future

Understanding the impact of the historic environment on the evolution and genetic diversity of living species is not just important for understanding how species became what they are today. It also helps us understand how species might change in the future, by providing the natural experimental evidence of evolution in a changing climate.

 

Fantastic Genes and Where to Find Them

The genetics of adaptation

Adaptation and evolution by natural selection remains one of the most significant research questions in many disciplines of biology, and this is undoubtedly true for molecular ecology. While traditional evolutionary studies have been based on the physiological aspects of organisms and how this relates to their evolution, such as how these traits improve their fitness, the genetic component of adaptation is still somewhat elusive for many species and traits.

Hunting for adaptive genes in the genome

We’ve previously looked at the two main categories of genetic variation: neutral and adaptive. Although we’ve focused predominantly on the neutral components of the genome, and the types of questions about demographic history, geographic influences and the effect of genetic drift, they cannot tell us (directly) about the process of adaptation and natural selective changes in species. To look at this area, we’d have to focus on adaptive variation instead; that is, genes (or other related genetic markers) which directly influence the ability of a species to adapt and evolve. These are directly under natural selection, either positively (‘selected for’) or negatively (‘selected against’).

Given how complex organisms, the environment and genomes can be, it can be difficult to determine exactly what is a real (i.e. strong) selective pressure, how this is influenced by the physical characteristics of the organism (the ‘phenotype’) and which genes are fundamental to the process (the ‘genotype’). Even determining the relevant genes can be difficult; how do we find the needle-like adaptive genes in a genomic haystack?

Magnifying glass figure
If only it were this easy.

There’s a variety of different methods we can use to find adaptive genetic variation, each with particular drawbacks and strengths. Many of these are based on tests of the frequency of alleles, rather than on the exact genetic changes themselves; adaptation works more often by favouring one variant over another rather than completely removing the less-adaptive variant (this would be called ‘fixation’). So measuring the frequency of different alleles is a central component of many analyses.

FST outlier tests

One of the most classical examples is called an ‘FST outlier test’. This can be a bit complicated without understanding what FST is actually measures: in short terms, it’s a statistical measure of ‘population differentiation due to genetic structure’. The FST value of one particular population can determine how genetically similar it is to another. An FST value of 1 implies that the two populations are as genetically different as they could possibly be, whilst an FST value of 0 implies that they are genetically identical populations.

Generally, FST reflects neutral genetic structure: it gives a background of how, on average, different are two populations. However, if we know what the average amount of genetic differentiation should be for a neutral DNA marker, then we would predict that adaptive markers are significantly different. This is because a gene under selection should be more directly pushed towards or away from one variant (allele) than another, and much more strongly than the neutral variation would predict. Thus, the alleles that are way more or less frequent than the average pattern we might assume are under selection. This is the basis of the FST outlier test; by comparing two or more populations (using FST), and looking at the distribution of allele frequencies, we can pick out a few alleles that vary from the average pattern and suggest that they are under selection (i.e. are adaptive).

There are a few significant drawbacks for FST outlier tests. One of the most major ones is that genetic drift can also produce a large number of outliers; in a small population, for example, one allele might be fixed (has a frequency of 1, with no alternative allele in the population) simply because there is not enough diversity or population size to sustain more alleles. Even if this particular allele was extremely detrimental, it’d still appear to be favoured by natural selection just because of drift.

Drift leading to outliers diagram
An example of genetic drift leading to outliers, featuring our friends the cat population. Top row: Two cat populations, one small (left; n = 5) and one large (middle, n = 12) show little genetic differentiation between them (right; each triangle represents a single gene or locus; the ‘colour’ gene is marked in green). The average (‘neutral’) pattern of differentiation is shown by the dashed line. Much like in our original example, one cat in the small population is horrifically struck by lightning and dies (RIP again). Now when we compare the frequency of the alleles of the two populations (bottom), we see that (because a green cat died), the ‘colour’ locus has shifted away from the general trend (right) and is now an outlier. Thus, genetic drift in the ‘colour’ gene gives the illusion of a selective loci (even though natural selection didn’t cause the change, since colour does not relate to how likely a cat is to be struck by lightning).

Secondly, the cut-off for a ‘significant’ vs. ‘relatively different but possibly not under selection’ can be a bit arbitrary; some genes that are under weak selection can go undetected. Furthermore, recent studies have shown a growing appreciation for polygenic adaptation, where tiny changes in allele frequencies of many different genes combine together to cause strong evolutionary changes. For example, despite the clear heritable nature of height (tall people often have tall children), there is no clear ‘height’ gene: instead, it appears that hundreds of genes are potentially very minor height contributors.

Polygenic height figure final
In this example, we have one tall parent (top) who produces two offspring; one who is tall (left) and one who isn’t (right). In order to understand what genetic factors are contributing to their height differences, we compare their genetics (right; each dot represents a single locus). Although there aren’t any particular loci that look massively different between the two, the cumulative effect of tiny differences (the green triangles) together make one person taller than the other. There are no clear outliers, but many (poly) different genes (genic) acting together.

Genotype-environment associations

To overcome these biases, sometimes we might take a more methodological approach called ‘genotype-environment association’. This analysis differs in that we select what we think our selective pressures are: often environmental characteristics such as rainfall, temperature, habitat type or altitude. We then take two types of measures per individual organism: the genotype, through DNA sequencing, and the relevant environmental values for that organisms’ location. We repeat this over the full distribution of the species, taking a good number of samples per population and making sure we capture the full variation in the environment. Then we perform a correlation-type analysis, which seeks to see if there’s a connection or trend between any particular alleles and any environmental variables. The most relevant variables are often pulled out of the environmental dataset and focused on to reduce noise in the data.

The main benefit of GEA over FST outlier tests is that it’s unlikely to be as strongly influenced by genetic drift. Unless (coincidentally) populations are drifting at the same genes in the same pattern as the environment, the analysis is unlikely to falsely pick it up. However, it can still be confounded by neutral population structure; if one population randomly has a lot of unique alleles or variation, and also occurs in a somewhat unique environment, it can bias the correlation. Furthermore, GEA is limited by the accuracy and relevance of the environmental variables chosen; if we pick only a few, or miss the most important ones for the species, we won’t be able to detect a large number of very relevant (and likely very selective) genes. This is a universal problem in model-based approaches and not just limited to GEA analysis.

New spells to find adaptive genes?

It seems likely that with increasing datasets and better analytical platforms, many more types of analysis will be developed to delve deeper into the adaptive aspects of the genome. With whole-genome sequencing starting to become a reality for non-model species, better annotation of current genomes and a steadily increasing database of functional genes, the ability of researchers to investigate evolution and adaptation at the genomic level is also increasing.

Evolution and the space-time continuum

Evolution travelling in time

As I’ve mentioned a few times before, evolution is a constant force that changes and flows over time. While sometimes it’s more convenient to think of evolution as a series of rather discrete events (a species pops up here, a population separates here, etc.), it’s really a more continual process. The context and strength of evolutionary forces, such as natural selection, changes as species and the environment they inhabit also changes. This is important to remember in evolutionary studies because although we might think of more recent and immediate causes of the evolutionary changes we see, they might actually reflect much more historic patterns. For example, extremely low contemporary levels of genetic diversity in cheetah is likely largely due to a severe reduction in their numbers during the last ice age, ~12 thousand years ago (that’s not to say that modern human issues haven’t also been seriously detrimental to them). Similarly, we can see how the low genetic diversity of a small population colonise a new area can have long term effects on their genetic variation: this is called ‘founder effect’. Because of this, we often have to consider the temporal aspect of a species’ evolution.

Founder effect diagram
An example of founder effect. Each circle represents a single organism; the different colours are an indicator of how much genetic diversity that individual possesses (more colours = more variation). We start with a single population; one (A) or two (B) individuals go on a vacation and decide to stay on a new island. Even after the population has become established and grows over time, it takes a long time for new diversity to arise. This is because of the small original population size and genetic diversity; this is called founder effect. The more genetic diversity in the settled population (e.g. vs A), the faster new diversity arises and the weaker the founder effect.

Evolution travelling across space

If the environmental context of species and populations are also important for determining the evolutionary pathways of organisms, then we must also consider the spatial context. Because of this, we also need to look at where evolution is happening in the world; what kinds of geographic, climatic, hydrological or geological patterns are shaping and influencing the evolution of species? These patterns can influence both neutral or adaptive processes by shaping exactly how populations or species exist in nature; how connected they are, how many populations they can sustain, how large those populations can sustainably become, and what kinds of selective pressures those populations are under.

Allopatry diagram
An example of how the environment (in this case, geology) can have both neutral and adaptive effects. Let’s say we start with one big population of cats (N = 9; A), which is distributed over a single large area (the green box). However, a sudden geological event causes a mountain range to uplift, splitting the population in two (B). Because of the reduced population size and the (likely) randomness of which individuals are on each side, we expect some impact of genetic drift. Thus, this is the neutral influence. Over time, these two separated regions might change climatically (C), with one becoming much more arid and dry (right) and the other more wet and shady (left). Because of the difference of the selective environment, the two populations might adapt differently. This is the adaptive influence. 

Evolution along the space-time continuum

Given that the environment also changes over time (and can be very rapid, and we’ve seen recently), the interaction of the spatial and temporal aspects of evolution are critical in understanding the true evolutionary history of species. As we know, the selective environment is what determines what is, and isn’t, adaptive (or maladaptive), so we can easily imagine how a change in the environment could push changes in species. Even from a neutral perspective, geography is important to consider since it can directly determine which populations are or aren’t connected, how many populations there are in total or how big populations can sustainably get. It’s always important to consider how evolution travels along the space-time continuum.

Genetics TARDIS
“Postgraduate Student Who” doesn’t quite have the same ring to it, unfortunately.

Phylogeography

The field of evolutionary science most concerned with these two factors and how the influence evolution is known as ‘phylogeography’, which I’ve briefly mentioned in previous posts. In essence, phylogeographers are interested in how the general environment (e.g. geology, hydrology, climate, etc) have influenced the distribution of genealogical lineages. That’s a bit of a mouthful and seems a bit complicated, by the genealogical part is important; phylogeography has a keen basis in evolutionary genetics theory and analysis, and explicitly uses genetic data to test patterns of historic evolution. Simply testing the association between broad species or populations, without the genetic background, and their environment, falls under the umbrella field of ‘biogeography’. Semantics, but important.

Birds phylogeo
Some example phylogeographic models created by Zamudio et al. (2016). For each model, there’s a demonstrated relationship between genealogical lineages (left) and the geographic patterns (right), with the colours of the birds indicating some trait (let’s pretend they’re actually super colourful, as birds are). As you can see, depending on which model you look at, you will see a different evolutionary pattern; for example, model shows specific lineages that are geographically isolated from one another each evolved their own colour. This contrasts with in that each colour appears to have evolved once in each region based on the genetic history.

For phylogeography, the genetic history of populations or species gives the more accurate overview of their history; it allows us to test when populations or species became separated, which were most closely related, and whether patterns are similar or different across other taxonomic groups. Predominantly, phylogeography is based on neutral genetic variation, as using adaptive variation can confound the patterns we are testing. Additionally, since neutral variation changes over time in a generally predictable, mathematical format (see this post to see what I mean), we can make testable models of various phylogeographic patterns and see how well our genetic data makes sense under each model. For example, we could make a couple different models of how many historic populations there were and see which one makes the most sense for our data (with a statistical basis, of course). This wouldn’t work with genes under selection since they (by their nature) wouldn’t fit a standard ‘neutral’ model.

Coalescent
If it looks mathematically complicated, it’s because it is. This is an example of the coalescent from Brito & Edwards, 2008: a method that maps genes back in time (the different lines) to see where the different variants meet at a common ancestor. These genes are nested within the history of the species as a whole (the ‘tubes’), with many different variables accounted for in the model.

That said, there are plenty of interesting scientific questions within phylogeography that look at exploring the adaptive variation of historic populations or species and how this has influenced their evolution. Although this can’t inherently be built into the same models as the neutral patterns, looking at candidate genes that we think are important for evolution and seeing how their distributions and patterns relate to the overall phylogeographic history of the species is one way of investigating historic adaptive evolution. For example, we might track changes in adaptive genes by seeing which populations have which variants of the gene and referring to our phylogeographic history to see how and when these variants arose. This can help us understand how phylogeographic patterns have influenced the adaptive evolution of different populations or species, or inversely, how adaptive traits might have influenced the geographic distribution of species or populations.

Where did you come from and where will you go?

Phylogeographic studies can tell us a lot about the history of a species, and particularly how that relates to the history of the Earth. All organisms share an intimate relationship with their environment, both over time and space, and keeping this in mind is key for understanding the true evolutionary history of life on Earth.

 

Drifting or driving: directionality in evolution

How random is evolution?

Often, we like to think of evolution fairly anthropomorphically; as if natural selection actively decides what is, and what isn’t, best for the evolution of a species (or population). Of course, there’s not some explicit Evolution God who decrees how a species should evolve, and in reality, evolution reflects a more probabilistic system. Traits that give a species a better chance of reproducing or surviving, and can be inherited by the offspring, will over time become more and more dominant within the species; contrastingly, traits that do the opposite will be ‘weeded out’ of the gene pool as maladaptive organisms die off or are outcompeted by more ‘fit’ individuals. The fitness value of a trait can be determined from how much the frequency of that trait varies over time.

So, if natural selection is just probabilistic, does this mean evolution is totally random? Is it just that traits are selected based on what just happens to survive and reproduce in nature, or are there more direct mechanisms involved? Well, it turns out both processes are important to some degree. But to get into it, we have to explain the difference between genetic drift and natural selection (we’re assuming here that our particular trait is genetically determined).  

Allele frequency over time diagram
The (statistical) overview of natural selection. In this example, we have two different traits in a population; the blue and the red O. Our starting population is 20 individuals (N), with 10 of each trait (a 1:1 ratio, or 50% frequency of each). We’re going to assume that, because the blue is favoured by natural selection, it doubles in frequency each generation (i.e. one individual with the blue has two offspring with one blue each). The red is neither here nor there and is stable over time (one red O produces one red O in the next generation). So, going from Gen 1 to Gen 2, we have twice as many blue Xs (Nt) as we did previously, changing the overall frequency of the traits (highlighted in yellow). Because populations probably don’t exponentially increase every generation, we’ll cut it back down to our original total of 20, but at the same ratios (Np). Over time, we can see that the population gradually accumulates more blue Xs relative to red Os, and by Gen 5 the red is extinct. Thus, the blue X has evolved!

When we consider the genetic variation within a species to be our focal trait, we can tell that different parts of the genome might be more related with natural selection than others. This makes sense; some mutations in the genome will directly change a trait (like fur colour) which might have a selective benefit or detriment, while others might not change anything physically or change traits that are neither here-nor-there under natural selection (like nose shape in people, for example). We can distinguish between these two by talking about adaptive or neutral variation; adaptive variation has a direct link to natural selection whilst neutral variation is predominantly the product of genetic drift. Depending on our research questions, we might focus on one type of variation over the other, but both are important components of evolution as a whole.

Genetic drift

Genetic drift is considered the random, selectively ‘neutral’ changes in the frequencies of different traits (alleles) over time, due to completely random effects such as random mutations or random loss of alleles. This results in the neutral variation we can observe in the gene pool of the species. Changes in allele frequencies can happen due to entirely stochastic events. If, by chance, all of the individuals with the blue fur variant of a gene are struck by lightning and die, the blue fur allele would end up with a frequency of 0 i.e. go extinct. That’s not to say the blue fur ‘predisposed’ the individuals to be struck be lightning (we assume here, anyway), so it’s not like it was ‘targeted against’ by natural selection (see the bottom figure for this example).

Because neutral variation appears under a totally random, probabilistic model, the mathematical basis of it (such as the rate at which mutations appear) has been well documented and is the foundation of many of the statistical aspects of molecular ecology. Much of our ability to detect which genes are under selection is by seeing how much the frequencies of alleles of that gene vary from the neutral model: if one allele is way more frequent than you’d expect by random genetic drift, then you’d say that it’s likely being ‘pushed’ by something: natural selection.

Manhattan plot example
A Manhattan plot, which measures the level of genetic differentiation between two different groups across the genome. The x-axis shows the length of the genome, in this example colour-coded by the specific chromosome of the sequence, while the y-axis shows the level of differentiation between the two groups being studied. The dots represent certain spots (loci, singular locus) in the genome, with the level of differentiation (Fst) measured for that locus in one group vs that locus in the other group. The dotted line represents the ‘average differentiation’: i.e. how different you’d expect the two groups to be by chance. Anything about that line is significantly different between the two groups, either because of drift or natural selection. This plot has been slightly adapted from Axelsson et al. (2013), who were studying domestication in dogs by comparing the genetic architecture of wild wolves versus domestic dogs. In this example we can see that certain regions of the genome are clearly different between dogs and wolves (circled); when the authors looked at the genes within those blocks, they found that many were related to behavioural changes (nervous system), competitive breeding (sperm-egg recognition) and interestingly, starch digestion. This last category suggests that adaptation to an omnivorous diet (likely human food waste) was key in the domestication process.

Natural selection

Contrastingly to genetic drift, natural selection is when particular traits are directly favoured (or unfavoured) in the environmental context of the population; natural selection is very specific to both the actual trait and how the trait works. A trait is only selected for if it conveys some kind of fitness benefit to the individual; in evolutionary genetics terms, this means it allows the individual to have more offspring or to survive better (usually).

While this might be true for a trait in a certain environment, in another it might be irrelevant or even have the reverse effect. Let’s again consider white fur as our trait under selection. In an arctic environment, white fur might be selected for because it helps the animal to camouflage against the snow to avoid predators or catch prey (and therefore increase survivability). However, in a dense rainforest, white fur would stand out starkly against the shadowy greenery of the foliage and thus make the animal a target, making it more likely to be taken by a predator or avoided by prey (thus decreasing survivability). Thus, fitness is very context-specific.

Who wins? Drift or selection?

So, which is mightier, the pen (drift) or the sword (selection)? Well, it depends on a large number of different factors such as mutation rate, the importance of the trait under selection, and even the size of the population. This last one might seem a little different to the other two, but it’s critically important to which process governs the evolution of the species.

In very small populations, we expect genetic drift to be the stronger process. Natural selection is often comparatively weaker because small populations have less genetic variation for it to act upon; there are less choices for gene variants that might be more beneficial than others. In severe cases, many of the traits are probably very maladaptive, but there’s just no better variant to be selected for; look at the plethora of physiological problems in the cheetah for some examples.

Genetic drift, however, doesn’t really care if there’s “good” or “bad” variation, since it’s totally random. That said, it tends to be stronger in smaller populations because a small, random change in the number or frequency of alleles can have a huge effect on the overall gene pool. Let’s say you have 5 cats in your species; they’re nearly extinct, and probably have very low genetic diversity. If one cat suddenly dies, you’ve lost 20% of your species (and up to that percentage of your genetic variation). However, if you had 500 cats in your species, and one died, you’d lose only <0.2% of your genetic variation and the gene pool would barely even notice. The same applies to random mutations, or if one unlucky cat doesn’t get to breed because it can’t find a mate, or any other random, non-selective reason. One way we can think of this is as ‘random error’ with evolution; even a perfectly adapted organism might not pass on its genes if it is really unlucky. A bigger sample size (i.e. more individuals) means this will have less impact on the total dataset (i.e. the species), though.

Drift in small pops
The effect of genetic drift on small populations. In this example, we have two very similar populations of cats, each with three different alleles (black, blue and green) in similar frequencies across the populations. The major difference is the size of the population; the left is much smaller (5 cats) compared to the right (20 cats). If one cat randomly dies from a bolt of lightning (RIP), and assuming that the colour of the cat has no effect on the likelihood of being struck by lightning (i.e. is not under natural selection), then the outcome of this event is entirely due to genetic drift. In this case, the left population has lost 1/5th of its population size and 1/3rd of its total genetic diversity thanks to the death of the genetically unique blue cat (He will be missed) whereas the right population has only really lost 1/20th of its size and no changes in total diversity (it’ll recover).

Both genetic drift and natural selection are important components of evolution, and together shape the overall patterns of evolution for any given species on the planet. The two processes can even feed into one another; random mutations (drift) might become the genetic basis of new selective traits (natural selection) if the environment changes to suit the new variation. Therefore, to ignore one in favour of the other would fail to capture the full breadth of the processes which ultimately shape and determine the evolution of all species on Earth, and thus the formation of the diversity of life.