UnConservation Genetics: tools for managing invasive species

Conservation genetics

Naturally, all species play their role in the balancing and functioning of ecosystems across the globe (even the ones we might not like all that much, personally). Persistence or extinction of ecologically important species is a critical component of the overall health and stability of an ecosystem, and thus our aim as conservation scientists is to attempt to use whatever tools we have at our disposal to conserve species. One of the most central themes in conservation ecology (and to The G-CAT, of course) is the notion that genetic information can be used to better our conservation management approaches. This usually involves understanding the genetic history and identity of our target threatened species from which we can best plan for their future. This can take the form of genetic-informed relatedness estimates for breeding programs; identifying important populations and those at risk of local extinction; or identifying evolutionarily-important new species which might hold unique adaptations that could allow them to persist in an ever-changing future.

Applications of conservation genetics.jpg
Just a few applications of genetic information in conservation management, such as in breeding programs and pedigrees (left), identifying new/cryptic species (centre) and identifying and maintaining populations and their structure (right).

The Invaders

Contrastingly, sometimes we might also use genetic information to do the exact opposite. While so many species on Earth are at risk (or have already passed over the precipice) of extinction, some have gone rogue with our intervention. These are, of course, invasive species; pests that have been introduced into new environments and, by their prolific nature, start to throw out the balance of the ecosystem. Australians will be familiar with no shortage of relevant invasive species; the most notable of which is the cane toad, Rhinella marina. However, there are a plethora of invasive species which range from notably prolific (such as the cane toad) to the seemingly mundane (such as the blackbird): so how can we possibly deal with the number and propensity of pests?

Table of invasive species in Australia
A table of some of the most prolific mammalian invasive species in Australia, including when they were first introduced and why, and their (relatively) recently estimated population sizes. Source: Wikipedia (and studies referenced therein). Some estimated numbers might not reflect current sizes as they were obtained from studies over the last 10 years.

Tools for invasive species management

There are a number of tools at our disposal for dealing with invasive species. These range from chemical controls (like pesticides), to biological controls and more recently to targeted genetic methods. Let’s take a quick foray into some of these different methods and their applications to pest control.

Types of control tools for invasive species
Some of the broad categories of invasive species control. For any given pest species, such as the cane toad (top), we might choose to use a particular set of methods to reduce their numbers. These can include biological controls (such as the ladybird, for aphid populations (left)); chemical controls such as pesticides; or even genetic engineering technologies.

Biological controls

One of the most traditional methods of pest control are biological controls. A biological control is, in simple terms, a species that can be introduced to an afflicted area to control the population of an invasive species. Usually, this is based on some form of natural co-evolution or hierarchy: species which naturally predate upon, infect or otherwise displace the pest in question are preferred. The basis of this choice is that nature, and evolution by natural selection, often creates a near-perfect machine adapted for handling the exact problem.

Biological controls can have very mixed results. In some cases, they can be relatively effective, such as the introduction of the moth Cactoblastis cactorum into Australia to control the invasive prickly pear. The moth lays eggs exclusively within the tissue of the prickly pear, and the resultant caterpillars ravish the plant. There has been no association of secondary diet items for caterpillars, suggesting the control method has been very selective and precise.

Moth biological control flow chart
The broad life cycle of the cactus moth and how it controls the invasive prickly pear in Australia. The ravenous caterpillar larvae of the moth is effective at decimating prickly pears, whilst the moth’s specificity to this host means there is limited impact on other plant species.

On the contrary, bad biological controls can lead to ecological disasters. As mentioned above, the introduction of the cane toad into Australia has been widely regarded as the origin of one of the worst invasive pests in the nation’s history. Initially, cane toads were brought over in the 1930s to predate on the (native) cane beetle, which was causing significant damage to sugar cane plantations in the tropical north. Not overly effective at actually dealing with the problem they were supposed to deal with, the cane toad rapidly spread across northern portion of the continent. Native species that attempt to predate on the cane toad often die to their defensive toxin, causing massive ecological damage to the system.

The potential secondary impact of biological controls, and the degree of unpredictability in how they will respond to a new environment (and how native species will also respond to their introduction) leads conservationists to develop new, more specific techniques. In similar ways, viral and bacterial-based controls have had limited success (although are still often proposed in conservation management, such as the planned carp herpesvirus release).

Genetic controls?

It is clear that more targeted and narrow techniques are required to effectively control pest species. At a more micro level, individual genes could be used to manage species: this is not the first way genetic modification has been proposed to deal with problem organisms. Genetic methods have been employed for years in crop farming through genetic engineering of genes to produce ‘natural’ pesticides or insecticides. In a similar vein, it has been proposed that genetic modification could be a useful tool for dealing with invasive pests and their native victims.

Gene drives

One promising targeted, genetic-based method that has shown great promise is the gene drive. Following some of the theory behind genetic engineering, gene drives are targeted suites of genes (or alleles) which, by their own selfish nature, propagate through a population at a much higher rate than other alternative genes. In conjunction with other DNA modification methods, which can create fatal or sterilising genetic variants, gene drives present the opportunity to allow the natural breeding of an invasive species to spread the detrimental modified gene.

Gene drive diagram
An example of how gene drives are being proposed to tackle malaria. In this figure, the pink mosquito at the top has been genetically engineered using CRISPR to possess two important genetic elements: a genetic variant which causes the mosquito to be unable to produce eggs or bite (the pink gene), and a linked selfish genetic element (the gene drive itself; the plus) which makes this detrimental allele spread more rapidly than by standard inheritance. Sources: Nature and The Australian Academy of Science.

Although a relatively new, and untested, technique, gene drive technology has already been proposed as a method to address some of the prolific invasive mammals of New Zealand. Naturally, there are a number of limitations and reservations for the method; similar to biological control, there is concern for secondary impact on other species that interact with the invasive host. Hybridisation between invasive and native species would cause the gene drive to be spread to native species, counteracting the conservation efforts to save natives. For example, a gene drive could not reasonably be proposed to deal with feral wild dogs in Australia without massively impacting the ‘native’ dingo.

Genes for non-genetic methods

Genetic information, more broadly, can also be useful for pest species management without necessarily directly feeding into genetic engineering methods. The various population genetic methods that we’ve explored over a number of different posts can also be applied in informing management. For example, understanding how populations are structured, and the sizes and demographic histories of these populations, may help us to predict how they will respond in the future and best focus our efforts where they are most effective. By including analysis of their adaptive history and responses, we may start to unravel exactly what makes a species a good invader and how to best predict future susceptibility of an environment to invasion.

Table of genetic information applications
A comprehensive table of the different ways genetic information could be applied in broader invasive species management programs, from Rollins et al. (2006). This paper specifically relates to pest management within Western Australia but the concepts listed here apply broadly. Many of these concepts we have discussed previously in a conservation management context as well.

The better we understand invasive species and populations from a genetic perspective, the more informed our management efforts can be and the more likely we are to be able to adequately address the problem.

Managing invasive pest species

The impact of human settlement into new environments is exponentially beyond our direct influences. With our arrival, particularly in the last few hundred years, human migration has been an effective conduit for the spread of ecologically-disastrous species which undermine the health and stability of ecosystems around the globe. As such, it is our responsibility to Earth to attempt to address our problems: new genetic techniques is but one growing avenue by which we might be able to remove these invasive pests.

The human race(s)? Perspectives from genetics

The genetic testing of race

In one form or another, you may have been (unfortunately) exposed to the notion of ‘testing for someone’s race using genetics.’ In one sense, this is part of the motivation and platform of ‘23andMe’, which maps the genetic variants across the human genome back to likely origin populations to determine the relative ancestry of a person. In a much darker sense, the connection between genetic identity and race is the basis of eugenics, by suggesting genetic “purity” (this concept is utter nonsense, for reference) of a population as justification for some racist hierarchy. Typically, this is associated with Hitler’s Nazism, but more subversive versions of this association still exist in the world: for Australian readers, most notably when the far-right conservative minor party One Nation suggested that people claiming to be Indigenous should be subjected to genetic testing to verify their race.

DNA Ancestry map.jpg
A simplified overview of how DNA Ancestry methods work, by associating particular genetic variants within your genome to likely regions of origin. Note the geographic imprecision in the method on the map on the right, as well as the clear gaps. Source: Ancestry blog.

The biological concept of a ‘race’

Beyond the apparent ethical and moral objections to the invasive nature of demanding genetic testing for Indigenous peoples, a crucial question is one of feasibility: even if you decided to genetically test for race, is this possible? It might come as a surprise to non-geneticists that actually, from a genetic perspective, race is not a particularly stable concept.

The notion of races based on genetics has been a highly controversial topic throughout the development of genetic theory and research. Even recently, James Watson (as in of Watson & Crick, who were credited with the discovery of the structure of DNA) was stripped of several titles (including Chancellor Emeritus) following some controversial (and scientifically invalid) comments on the nature of race, genetics and intelligence. Comfortingly, the vast majority of the scientific community opposed his viewpoints on the matter, and in fact it has long been held that a ‘genetic race’ is not a scientifically stable concept.

James Watson.jpg
James Watson himself. I bet Rosalind Franklin never said anything like this… Source: Wikipedia.

You might ask: why is that? There are perceivable differences in the various peoples of the world, surely some of those could be related to both a ‘race’ and a ‘genetic identity’, right? Well, the issue is primarily due to the lack of identifiability of genetic variants that can be associated with a race. Decades of research in genetic variation across the global human population indicates that, due to the massive size of the human population and levels of genetic variation, it is functionally impossible to pinpoint down genetic variants that uniquely identify a ‘race’. Human genetic variation is such a beautiful spectrum of alleles that it becomes impossible to reliably determine where one end of the spectrum ends or begins, or to identify a strict number of ‘races’ within the kaleidoscope of the human genome.

How does this relate to 23AndMe?

How does this relate to your ‘23AndMe’ results? Well, chances are that some genetic variants might be able to be traced back to a particular region (e.g. Europe, somewhere). But naturally, there’s a significant number of limitations to this kind of inference; notably, that we don’t have reliable references from ancient history to draw upon very often. This, combined with the fact that humans have mixed among ourselves (and even with other species) for millennia, means that tracing back individual alleles is exceedingly difficult.

Genetic variation and non-identifiability of race figure
A diagram of exactly why identifying a genetic basis for race is impossible in humans. A) The ‘idealised’ version of race; people are easily classified by their genetic identity, with some variation within each classification (in this case, race) but still distinctiveness between them. B) The reality of human genetic variation, which makes it exceedingly difficult to make any robust or solid boundaries between groups of people due to the sheer amount of variation. Source: Harvard University blog.

This is exponentially difficult for people who might have fewer sequenced ancestors or relatives; without the reference for genetic variation, it can be even harder to trace their genetic ancestry. Such is the case for Indigenous Australians, for which there is a distinct lack of available genetic data (especially compared to European-descended Australians).

The non-genetic components

The genetic non-identifiability of race is but one aspect which contradicts the rationality of genetic race testing. As we discussed in the previous post on The G-CAT, the connection between genetic underpinning and physicality is not always clear or linear. The role of the environment on both the expression of genetic variation, as well as the general influence of environment on aspects such as behaviour, philosophy, and culture necessitate that more than the genome contributes to a person’s identity. For any given person, how they express and identify themselves is often more strongly associated with their non-genetic traits such as beliefs and culture.

genetic vs cultural inheritance.jpg
A comparison of genetic vs. cultural inheritance, which demonstrates (as an example) how other factors (in this case, other people) influence the passing on of cultural traits. Remember that this but one aspect of the factors that determine culture and identity, and equally (probably more) complex networks exist for other influences such as environment and development. Source: Creanza et al. (2017), PNAS.

These factors cannot reliably be tested under a genetic framework. While there may be some influence of genes on how a person’s psychology develops, it is unlikely to be able to predict the lifestyle, culture and complete identity of said person. For Indigenous Australians, this has been confounded by the corruption and disruption of their identity through the Stolen Generation. As a result, many Indigenous descendants may not appear (from a genetic point of view) to be purely Indigenous but their identity and culture as an Indigenous person is valid. To suggest that their genetic ancestry more strongly determines their identity than anything else is not only naïve from a scientific perspective, but nothing short of a horrific simplification and degradation of those seeking to reclaim their identity and culture.

The non-identifiability of genetic race

The science of genetics overwhelmingly suggests that there is no fundamental genetic underpinning of ‘race’ that can be reliably used. Furthermore, the impact of non-genetic factors on determining the more important aspects of personal identity, such as culture, tradition and beliefs, demonstrates that attempts to delineate people into subcategories by genetic identity is an unreliable method. Instead, genetic research and biological history fully acknowledges and embraces the diversity of the global human population. As it stands, the phrase ‘human race’ might be the most biologically-sound classification of people: we are all the same.

Crossing the Wires: why ‘genetic hardwiring’ is not the whole story

The age-old folly of ‘nature vs. nurture’

It should come as no surprise to any reader of The G-CAT that I’m a firm believer against the false dichotomy (and yes, I really do love that phrase) of “nature versus nurture.” Primarily, this is because the phrase gives the impression of some kind of counteracting balance between intrinsic (i.e. usually genetic) and extrinsic (i.e. usually environmental) factors and how they play a role in behaviour, ecology and evolution. While both are undoubtedly critical for adaptation by natural selection, posing this as a black-and-white split removes the possibility of interactive traits.

We know readily that fitness, the measure by which adaptation or maladaptation can be quantified, is the product of both the adaptive value of a certain trait and the environmental conditions said trait occurs in. A trait that might confer strong fitness in white environment may be very, very unfit in another. A classic example is fur colour in mammals: in a snowy environment, a white coat provides camouflage for predators and prey alike; in a rainforest environment, it’s like wearing one of those fluoro-coloured safety vests construction workers wear.

Genetics and environment interactions figure.jpg
The real Circle of Life. Not only do genes and the environment interact with one another, but genes may interact with other genes and environments may be complex and multi-faceted.

Genetically-encoded traits

In the “nature versus nurture” context, the ‘nature’ traits are often inherently assumed to be genetic. This is because genetic traits are intrinsic as a fundamental aspect of life, inheritable (and thus can be passed on and undergo evolution by natural selection) and define the important physiological traits that provide (or prevent) adaptation. Of course, not all of the genome encodes phenotypic traits at all, and even less relate to diagnosable and relevant traits for natural selection to act upon. In addition, there is a bit of an assumption that many physiological or behavioural traits are ‘hardwired’: that is, despite any influence of environment, genes will always produce a certain phenotype.

Adaptation from genetic variation.jpg
A very simplified example of adaptation from genetic variation. In this example, we have two different alleles of a single gene (orange and blue). Natural selection favours the blue allele so over time it increases in frequency. The difference between these two alleles is at least one base pair of DNA sequence; this often arises by mutation processes.

Despite how important the underlying genes are for the formation of proteins and definition of physiology, they are not omnipotent in that regard. In fact, many other factors can influence how genetic traits relate to phenotypic traits: we’ve discussed a number of these in minor detail previously. An example includes interactions across different genes: these can be due to physiological traits encoded by the cumulative presence and nature of many loci (as in quantitative trait loci and polygenic adaptation). Alternatively, one gene may translate to multiple different physiological characters if it shows pleiotropy.

Differential expression

One non-direct way genetic information can impact on the phenotype of an organism is through something we’ve briefly discussed before known as differential expression. This is based on the notion that different environmental pressures may affect the expression (that is, how a gene is translated into a protein) in alternative ways. This is a fundamental underpinning of what we call phenotypic plasticity: the concept that despite having the exact same (or very similar) genes and alleles, two clonal individuals can vary in different traits. The is related to the example of genetically-identical twins which are not necessarily physically identical; this could be due to environmental constraints on growth, behaviour or personality.

Brauer DE figure_cropped
An example of differential expression in wild populations of southern pygmy perch, courtesy of Brauer et al. (2017). In this figure, each column represents a single individual fish, with the phylogenetic tree and coloured boxes at the top indicating the different populations. Each row represents a different gene (this is a subset of 50 from a much larger dataset). The colour of each cell indicates whether the expression of that gene is expressed more (red) or less (blue) than average. As you can see, the different populations can clearly be seen within their expression profiles, with certain genes expressing more or less in certain populations.

From an evolutionary perspective, the ability to translate a single gene into multiple phenotypic traits has a strong advantage. It allows adaptation to new, novel environments without waiting for natural selection to favour adaptive mutations (or for new, adaptive alleles to become available from new mutation events). This might be a fundamental trait that determines which species can become invasive pests, for instance: the ability to establish and thrive in environments very different to their native habitat allows introduced species to quickly proliferate and spread. Even for species which we might not consider ‘invasive’ (i.e. they have naturally spread to new environments), phenotypic plasticity might allow them to very rapidly adapt and evolve into new ecological niches and could even underpin the early stages of the speciation process.

Epigenetics

Related to this alternative expression of genes is another relatively recent concept: that of epigenetics. In epigenetics, the expression and function of genes is controlled by chemical additions to the DNA which can make gene expression easier or more difficult, effectively promoting or silencing genes. Generally, the specific chemicals that are attached to the DNA are relatively (but not always) predictable in their effects: for example, the addition of a methyl group to the sequence is generally associated with the repression of the gene underlying it. How and where these epigenetic markers may in turn be affected by environmental conditions, creating a direct conduit between environmental (‘nurture’) and intrinsic genetic (‘nature’) aspects of evolution.

Epigenetic_mechanisms.jpg
A diagram of different epigenetic factors and the mechanisms by which they control gene expression. Source: Wikipedia.

Typically, these epigenetic ‘marks’ (chemical additions to the DNA) are erased and reset during fertilisation: the epigenetic marks on the parental gametes are removed, and new marks are made on the fertilised embryo. However, it has been shown that this removal process is not 100% effective, and in fact some marks are clearly passed down from parent to offspring. This means that these marks are heritable, and could allow them to evolve similarly to full DNA mutations.

The discovery of epigenetic markers and their influence on gene expression has opened up the possibility of understanding heritable traits which don’t appear to be clearly determined by genetics alone. For example, research into epigenetics suggest that heritable major depressive disorder (MDD) may be controlled by the expression of genes, rather than from specific alleles or genetic variants themselves. This is likely true for a number of traits for which the association to genotype is not entirely clear.

Epigenetic adaptation?

From an evolutionary standpoint again, epigenetics can similarly influence the ‘bang for a buck’ of particular genes. Being able to translate a single gene into many different forms, and for this to be linked to environmental conditions, allows organisms to adapt to a variety of new circumstances without the need for specific adaptive genes to be available. Following this logic, epigenetic variation might be critically important for species with naturally (or unnaturally) low genetic diversity to adapt into the future and survive in an ever-changing world. Thus, epigenetic information might paint a more optimistic outlook for the future: although genetic variation is, without a doubt, one of the most fundamental aspects of adaptability, even horrendously genetically depleted populations and species might still be able to be saved with the right epigenetic diversity.

Epigenetic cats example
A relatively simplified example of adaptation from epigenetic variation. In this example, we have a species of cat; the ‘default’ cat has non-tufted ears and an orange coat. These two traits are controlled by the expression of Genes A and B, respectively: in the top cat, neither gene is expressed. However, when this cat is placed into different environments, the different genes are “switched on” by epigenetic factors (the green markers). In a rainforest environment, the dark foliage makes darker coat colour more adaptive; switching on Gene B allows this to happen. Conversely, in a desert environment switching on Gene A causes the cat to develop tufts on its ears, which makes it more effective at hunting prey hiding in the sands. Note that in both circumstances, the underlying genetic sequence (indicated by the colours in the DNA) is identical: only the expression of those genes change.

 

Epigenetic research, especially from an ecological/evolutionary perspective, is a very new field. Our understanding of how epigenetic factors translate into adaptability, the relative performance of epigenetic vs. genetic diversity in driving adaptability, and how limited heritability plays a role in adaptation is currently limited. As with many avenues of research, further studies in different contexts, experiments and scopes will reveal further this exciting new aspect of evolutionary and conservation genetics. In short: watch this space! And remember, ‘nature is nurture’ (and vice versa)!

Pressing Ctrl-Z on Life with De-extinction

Note: For some clear, interesting presentations on the topic of de-extinction, and where some of the information for this post comes from, check out this list of TED talks.

The current conservation crisis

The stark reality of conservation in the modern era epitomises the crisis discipline that so often is used to describe it: species are disappearing at an unprecedented rate, and despite our best efforts it appears that they will continue to do so. The magnitude and complexity of our impacts on the environment effectively decimates entire ecosystems (and indeed, the entire biosphere). It is thus our responsibility as ‘custodians of the planet’ (although if I had a choice, I would have sacked us as CEOs of this whole business) to attempt to prevent further extinction of our planet’s biodiversity.

Human CEO example
“….shit.”

If you’re even remotely familiar with this blog, then you would have been exposed to a number of different techniques, practices and outcomes of conservation research and its disparate sub-disciplines (e.g. population genetics, community ecology, etc.). Given the limited resources available to conserve an overwhelming number of endangered species, we attempt to prioritise our efforts towards those most in need, although there is a strong taxonomic bias underpinning them.

At least from a genetic perspective, this sometimes involves trying to understand the nature and potential of adaptation from genetic variation (as a predictor of future adaptability). Or using genetic information to inform captive breeding programs, to allow us to boost population numbers with minimal risk of inbreeding depression. Or perhaps allowing us to describe new, unidentified species which require their own set of targeted management recommendations and political legislation.

Genetic rescue

Yet another example of the use of genetics in conservation management, and one that we have previously discussed on The G-CAT, is the concept of ‘genetic rescue’. This involves actively adding new genetic material from other populations into our captive breeding programs to supplement the amount of genetic variation available for future (or even current) adaptation. While there traditionally has been some debate about the risk of outbreeding depression, genetic rescue has been shown to be an effective method for prolonging the survival of at-risk populations.

super-gene-genetic-rescue-e1549973268851.jpg
How my overactive imagination pictures ‘genetic rescue’.

There’s one catch (well, a few really) with genetic rescue: namely, that one must have other populations to ‘outbreed’ with in order add genetic variation to the captive population. But what happens if we’re too late? What if there are no other populations to supplement with, or those other populations are also too genetically depauperate to use for genetic rescue?

Believe it or not, sometimes it’s not too late to save species, even after they have gone extinct. Which brings us from this (lengthy) introduction to this week’s topic: de-extinction. Yes, we’re literally (okay, maybe not) going to raise the dead.

Necroconservaticon
Your textbook guide to de-extinction. Now banned in 47 countries.

Backbreeding: resurrection by hybridisation

You might wonder how (or even if!) this is possible. And to be frank, it’s extraordinarily difficult. However, it has to a degree been done before, in very specific circumstances. One scenario is based on breeding out a species back into existence: sometimes we refer to this as ‘backbreeding’.

This practice really only applies in a few select scenarios. One requirement for backbreeding to be possible is that hybridisation across species has to have occurred in the past, and generally to a substantial scale. This is important as it allows the genetic variation which defines one of those species to live on within the genome of its sister species even when the original ‘host’ species goes extinct. That might make absolutely zero sense as it stands, so let’s dive into this with a case study.

I’m sure you’ll recognise (at the very least, in name) these handsome fellows below: the Galápagos tortoise. They were a pinnacle in Charles Darwin’s research into the process of evolution by natural selection, and can live for so long that until recently there had been living individuals which would have been able to remember him (assuming, you know, memory loss is not a thing in tortoises. I can’t even remember what I had for dinner two days ago, to be fair). As remarkable as they are, Galápagos tortoises actually comprise 15 different species, which can be primarily determined by the shape of their shells and the islands they inhabit.

Galapagos island and tortoises
A map of the Galápagos archipelago and tortoise species, with extinct species indicated by symbology. Lonesome George was the last known living member of the Pinta Island tortoise, C. abingdonii for reference. Source: Wikipedia.

One of these species, Chelonoidis elephantopus, also known as the Floreana tortoise after their home island, went extinct over 150 years ago, likely due to hunting and tradeHowever, before they all died, some individuals were transported to another island (ironically, likely by mariners) and did the dirty with another species of tortoise: C. becki. Because of this, some of the genetic material of the extinct Floreana tortoise introgressed into the genome of the still-living C. becki. In an effort to restore an iconic species, scientists from a number of institutions attempted to do what sounds like science-fiction: breed the extinct tortoise back to life.

By carefully managing and selectively breeding captive individuals , progressive future generations of the captive population can gradually include more and more of the original extinct C. elephantopus genetic sequence within their genomes. While a 100% resurrection might not be fully possible, by the end of the process individuals with progressively higher proportion of the original Floreana tortoise genome will be born. Although maybe not a perfect replica, this ‘revived’ species is much more likely to serve a similar ecological role to the now-extinct species, and thus contribute to ecosystem stability. To this day, this is one of the closest attempts at reviving a long-dead species.

Is full de-extinction possible?

When you saw the title for this post, you were probably expecting some Jurassic Park level ‘dinosaurs walking on Earth again’ information. I know I did when I first heard the term de-extinction. Unfortunately, contemporary de-extinction practices are not that far advanced just yet, although there have been some solid attempts. Experiments conducted using the genomic DNA from the nucleus of a dead animal, and cloning it within the egg of another living member of that species has effectively cloned an animal back from the dead. This method, however, is currently limited to animals that have died recently, as the DNA degrades beyond use over time.

The same methods have been attempted for some extinct animals, which went extinct relatively recently. Experiments involving the Pyrenean ibex (bucardo) were successful in generating an embryo, but not sustaining a living organism. The bucardo died 10 minutes after birth due to a critical lung condition, as an example.

The challenges and ethics of de-extinction

One might expect that as genomic technologies improve, particularly methods facilitated by the genome-editing allowed from CRISPR/Cas-9 development, that we might one day be able to truly resurrect an extinct species. But this leads to very strongly debated topics of ethics and morality of de-extinction. If we can bring a species back from the dead, should we? What are the unexpected impacts of its revival? How will we prevent history from repeating itself, and the species simply going back extinct? In a rapidly changing world, how can we account for the differences in environment between when the species was alive and now?

Deextinction via necromancy figure
The Chaotic Neutral (?) approach to de-extinction.

There is no clear, simple answer to many of these questions. We are only scratching the surface of the possibility of de-extinction, and I expect that this debate will only accelerate with the research. One thing remains eternally true, though: it is still the distinct responsibility of humanity to prevent more extinctions in the future. Handling the growing climate change problem and the collapse of ecosystems remains a top priority for conservation science, and without a solution there will be no stable planet on which to de-extinct species.

de-extinction meme
You bet we’re gonna make a meme months after it’s gone out of popularity.

The ‘other’ allele frequency: applications of the site frequency spectrum

The site-frequency spectrum

In order to simplify our absolutely massive genomic datasets down to something more computationally feasible for modelling techniques, we often reduce it to some form of summary statistic. These are various aspects of the genomic data that can summarise the variation or distribution of alleles within the dataset without requiring the entire genetic sequence of all of our samples.

One very effective summary statistic that we might choose to use is the site-frequency spectrum (aka the allele frequency spectrum). Not to be confused with other measures of allele frequency which we’ve discussed before (like Fst), the site-frequency spectrum (abbreviated to SFS) is essentially a histogram of how frequent certain alleles are within our dataset. To do this, the SFS classifies each allele into a certain category based on how common it is, tallying up the number of alleles that occur at that frequency. The total number of categories would be the maximum number of possible alleles: for organisms with two copies of every chromosome (‘diploids’, including humans), this means that there are double the number of samples included. For example, a dataset comprised of genomic sequence for 5 people would have 10 different frequency bins.

For one population

The SFS for a single population – called the 1-dimensional SFS – this is very easy to visualise as a concept. In essence, it’s just a frequency distribution of all the alleles within our dataset. Generally, the distribution follows an exponential shape, with many more rare (e.g. ‘singletons’) alleles than there are common ones. However, the exact shape of the SFS is determined by the history of the population, and like other analyses under coalescent theory we can use our understanding of the interaction between demographic history and current genetic variation to study past events.

1DSFS example.jpg
An example of the 1DSFS for a single population, taken from a real dataset from my PhD. Left: the full site-frequency spectrum, counting how many alleles (y-axis) occur a certain number of times (categories of the x-axis) within the population. In this example, as in most species, the vast majority of our DNA sequence is non-variable (frequency = 0). Given the huge disparity in number of non-variable sites, we often select on the variable ones (and even then, often discard the 1 category to remove potential sequencing errors) and get a graph more like the right. Right: the ‘realistic’ 1DSFS for the population, showing a general exponential decline (the blue trendline) for the more frequent classes. This is pretty standard for an SFS. ‘Singleton’ and ‘doubleton’ are alternative names for ‘alleles which occur once’ and ‘alleles which occur twice’ in an SFS.

Expanding the SFS to multiple populations

Further to this, we can expand the site-frequency spectrum to compare across populations. Instead of having a simple 1-dimensional frequency distribution, for a pair of populations we can have a grid. This grid specifies how often a particular allele occurs at a certain frequency in Population A and at a different frequency in Population B. This can also be visualised quite easily, albeit as a heatmap instead. We refer to this as the 2-dimensional SFS (2DSFS).

2dsfs example
An example of a 2DSFS, also taken from my PhD research. In this example, we are comparing Population A, containing 5 individuals (as diploid, 2 x 5 = max. of 10 occurrences of an allele) with Population B, containing 4 individuals. Each row denotes the frequency at which a certain allele occurs in Population whilst the columns indicate the frequency a certain allele occurs in Population A. Each cell therefore indicates the number of alleles that occur at the exact frequency of the corresponding row and column. For example, the first cell (highlighted in green) indicates the number of alleles which are not found in either Population A or Population B (this dataset is a subsample from a larger one). The yellow cell indicates the number of alleles which occur 4 times in Population and also 4 times in Population A. This could mean that in one of those Populations 4 individuals have one copy of that allele each, or two individuals have two copies of that allele, or that one has two copies and two have one copy. The exact composition of how the alleles are spread across samples within each population doesn’t matter to the overall SFS.

The same concept can be expanded to even more populations, although this gets harder to represent visually. Essentially, we end up with a set of different matrices which describe the frequency of certain alleles across all of our populations, merging them together into the joint SFS. For example, a joint SFS of 4 populations would consist of 6 (4 x 4 total comparisons – 4 self-comparisons, then halved to remove duplicate comparisons) 2D SFSs all combined together. To make sense of this, check out the diagrammatic tables below.

populations for jsfs
A summary of the different combinations of 2DSFSs that make up a joint SFS matrix. In this example we have 4 different populations (as described in the above text). Red cells denote comparisons between a population and itself – which is effectively redundant. Green cells contain the actual 2D comparisons that would be used to build the joint SFS: the blue cells show the same comparisons but in mirrored order, and are thus redundant as well.
annotated jsfs heatmap
Expanding the above jSFS matrix to the actual data, this matrix demonstrates how the matrix is actually a collection of multiple 2DSFSs. In this matrix, one particular cell demonstrates the number of alleles which occur at frequency x in one population and frequency y in another. For example, if we took the cell in the third row from the top and the fourth column from the left, we would be looking at the number of alleles which occur twice in Population B and three times in Population A. The colour of this cell is moreorless orange, indicating that ~50 alleles occur at this combination of frequencies. As you may notice, many population pairs show similar patterns, except for the Population C vs Population D comparison.

The different forms of the SFS

Which alleles we choose to use within our SFS is particularly important. If we don’t have a lot of information about the genomics or evolutionary history of our study species, we might choose to use the minor allele frequency (MAF). Given that SNPs tend to be biallelic, for any given locus we could have Allele A or Allele B. The MAF chooses the least frequent of these two within the dataset and uses that in the summary SFS: since the other allele’s frequency would just be 2N – the frequency of the other allele, it’s not included in the summary. An SFS made of the MAF is also referred to as the folded SFS.

Alternatively, if we know some things about the genetic history of our study species, we might be able to divide Allele A and Allele B into derived or ancestral alleles. Since SNPs often occur as mutations at a single site in the DNA, one allele at the given site is the new mutation (the derived allele) whilst the other is the ‘original’ (the ancestral allele). Typically, we would use the derived allele frequency to construct the SFS, since under coalescent theory we’re trying to simulate that mutation event. An SFS made of the derived alleles only is also referred to as the unfolded SFS.

Applications of the SFS

How can we use the SFS? Well, it can moreorless be used as a summary of genetic variation for many types of coalescent-based analyses. This means we can make inferences of demographic history (see here for more detailed explanation of that) without simulating large and complex genetic sequences and instead use the SFS. Comparing our observed SFS to a simulated scenario of a bottleneck and comparing the expected SFS allows us to estimate the likelihood of that scenario.

For example, we would predict that under a scenario of a recent genetic bottleneck in a population that alleles which are rare in the population will be disproportionately lost due to genetic drift. Because of this, the overall shape of the SFS will shift to the right dramatically, leaving a clear genetic signal of the bottleneck. This works under the same theoretical background as coalescent tests for bottlenecks.

SFS shift from bottleneck example.jpg
A representative example of how a bottleneck causes a shift in the SFS, based on a figure from a previous post on the coalescentCentre: the diagram of alleles through time, with rarer variants (yellow and navy) being lost during the bottleneck but more common variants surviving (red). Left: this trend is reflected in the coalescent trees for these alleles, with red crosses indicating the complete loss of that allele. Right: the SFS from before (in red) and after (in blue) the bottleneck event for the alleles depicted. Before the bottleneck, variants are spread in the usual exponential shape: afterwards, however, a disproportionate loss of the rarer variants causes the distribution to flatten. Typically, the SFS would be built from more alleles than shown here, and extend much further.

Contrastingly, a large or growing population will have a larger number of rare (i.e. unique) alleles from the sudden growth and increase in genetic variation. Thus, opposite to the bottleneck the SFS distribution will be biased towards the left end of the spectrum, with an excess of low-frequency variants.

SFS shift from expansion example.jpg
A similar diagram as above, but this time with an expansion event rather than a bottleneck. The expansion of the population, and subsequent increase in Ne, facilitates the mutation of new alleles from genetic drift (or reduced loss of alleles from drift), causing more new (and thus rare) alleles to appear. This is shown by both the coalescent tree (left) and a shift in the SFS (right).

The SFS can even be used to detect alleles under natural selection. For strongly selected parts of the genome, alleles should occur at either high (if positively selected) or low (if negatively selected) frequency, with a deficit of more intermediate frequencies.

Adding to the analytical toolbox

The SFS is just one of many tools we can use to investigate the demographic history of populations and species. Using a combination of genomic technologies, coalescent theory and more robust analytical methods, the SFS appears to be poised to tackle more nuanced and complex questions of the evolutionary history of life on Earth.

The reality of neutrality

The neutral theory 

Many, many times within The G-CAT we’ve discussed the difference between neutral and selective processes, DNA markers and their applications in our studies of evolution, conservation and ecology. The idea that many parts of the genome evolve under a seemingly random pattern – largely dictated by genome-wide genetic drift rather than the specific force of natural selection – underpins many demographic and adaptive (in outlier tests) analyses.

This is based on the idea that for genes that are not related to traits under selection (either positively or negatively), new mutations should be acquired and lost under predominantly random patterns. Although this accumulation of mutations is influenced to some degree by alternate factors such as population size, the overall average of a genome should give a picture that largely discounts natural selection. But is this true? Is the genome truly neutral if averaged?

Non-neutrality

First, let’s take a look at what we mean by neutral or not. For genes that are not under selection, alleles should be maintained at approximately balanced frequencies and all non-adaptive genes across the genome should have relatively similar distribution of frequencies. While natural selection is one obvious way allele frequencies can be altered (either favourably or detrimentally), other factors can play a role.

As stated above, population sizes have a strong impact on allele frequencies. This is because smaller populations are more at risk of losing rarer alleles due to random deaths (see previous posts for a more thorough discussion of this). Additionally, genes which are physically close to other genes which are under selection may themselves appear to be under selection due to linkage disequilibrium (often shortened to ‘LD’). This is because physically close genes are more likely to be inherited together, thus selective genes can ‘pull’ neighbours with them to alter their allele frequencies.

Linkage disequilibrium figure
An example of how linkage disequilibrium can alter allele frequency of ‘neutral’ parts of the genome as well. In this example, only one part of this section of the genome is selected for: the green gene. Because of this positive selection, the frequency of a particular allele at this gene increases (the blue graph): however, nearby parts of the genome also increase in frequency due to their proximity to this selected gene, which decreases with distance. The extent of this effect determines the size of the ‘linkage block’ (see below).

Why might ‘neutral’ models not be neutral?

The assumption that the vast majority of the genome evolves under neutral patterns has long underpinned many concepts of population and evolutionary genetics. But it’s never been all that clear exactly how much of the genome is actually evolving neutrally or adaptively. How far natural selection reaches beyond a single gene under selection depends on a few different factors: let’s take a look at a few of them.

Linked selection

As described above, physically close genes (i.e. located near one another on a chromosome) often share some impacts of selection due to reduced recombination that occurs at that part of the genome. In this case, even alleles that are not adaptive (or maladaptive) may have altered frequencies simply due to their proximity to a gene that is under selection (either positive or negative).

Recombination blocks and linkage figure
A (perhaps familiar) example of the interaction between recombination (the breaking and mixing of different genes across chromosomes) and linkage disequilibrium. In this example, we have 5 different copies of a part of the genome (different coloured sequences), which we randomly ‘break’ into separate fragments (breaks indicated by the dashed lines). If we focus on a particular base in the sequence (the yellow A) and count the number of times a particular base pair is on the same fragment, we can see how physically close bases are more likely to be coinherited than further ones (bottom column graph). This makes mathematical sense: if two bases are further apart, you’re more likely to have a break that separates them. This is the very basic underpinning of linkage and recombination, and the size of the region where bases are likely to be coinherited is called the ‘linkage block’.

Under these circumstances, for a region of a certain distance (dubbed the ‘linkage block’) around a gene under selection, the genome will not truly evolve neutrally. Although this is simplest to visualise as physically linked sections of the genome (i.e. adjacent), linked genes do not necessarily have to be next to one another, just linked somehow. For example, they may be different parts of a single protein pathway.

The extent of this linkage effect depends on a number of other factors such as ploidy (the number of copies of a chromosome a species has), the size of the population and the strength of selection around the central locus. The presence of linkage and its impact on the distribution of genetic diversity (LD) has been well documented within evolutionary and ecological genetic literature. The more pressing question is one of extent: how much of the genome has been impacted by linkage? Is any of the genome unaffected by the process?

Background selection

One example of linked selection commonly used to explain the proliferation of non-neutral evolution within the genome is ‘background selection’. Put simply, background selection is the purging of alleles due to negative selection on a linked gene. Sometimes, background selection is expanded to include any forms of linked selection.

Background selection figure .jpg
A cartoonish example of how background selection affects neighbouring sections of the genome. In this example, we have 4 genes (A, B, C and D) with interspersing neutral ‘non-gene’ sections. The allele for Gene B is strongly selected against by natural selection (depicted here as the Banhammer of Selection). However, the Banhammer is not very precise, and when decreasing the frequency of this maladaptive Gene B allele it also knocks down the neighbouring non-gene sections. Despite themselves not being maladaptive, their allele frequencies are decreased due to physical linkage to Gene B.

Under the first etymology of background selection, the process can be divided into two categories based on the impact of the linkage. As above, one scenario is the purging of neutral alleles (and therefore reduction in genetic diversity) as it is associated with a deleterious maladaptive gene nearby. Contrastingly, some neutral alleles may be preserved by association with a positively selected adaptive gene: this is often referred to as ‘genetic hitchhiking’ (which I’ve always thought was kind of an amusing phrase…).

Genetic hitchhiking picture.jpg
Definitely not how genetic hitchhiking works.

The presence of background selection – particularly under the ‘maladaptive’ scenario – is often used as a counter-argument to the ‘paradox in variation’. This paradox was determined by evolutionary biologist Richard Lewontin, who noted that despite massive differences in population sizes across the many different species on Earth, the total amount of ‘neutral’ genetic variation does not change significantly. In fact, he observed no clear relationship (directly) between population size and neutral variation. Many years after this observation, the influence of background selection and genetic hitchhiking on the distribution of genomic diversity helps to explain how the amount of neutral genomic variation is ‘managed’, and why it doesn’t vary excessively across biota.

What does it mean if neutrality is dead?

This findings have significant implications for our understanding of the process of evolution, and how we can detect adaptation within the genome. In light of this research, there has been heated discussion about whether or not neutral theory is ‘dead’, or a useful concept.

Genome wide allele frequency figure.jpg
A vague summary of how a large portion of the genome might not actually be neutral. In this section of the genome, we have neutral (blue), maladaptive (red) and adaptive (green) elements. Natural selection either favours, disfavours, or is ambivalent about each of this sections aloneHowever, there is significant ‘spill-over’ around regions of positively or negatively selected sections, which causes the allele frequency of even the neutral sections to fluctuate widely. The blue dotted line represents this: when the line is above the genome, allele frequency is increased; when it is below it is decreased. As we travel along this section of the genome, you may notice it is rarely ever in the middle (the so-called ‘neutral‘ allele frequency, in line with the genome).

Although I avoid having a strong stance here (if you’re an evolutionary geneticist yourself, I will allow you to draw your own conclusions), it is my belief that the model of neutral theory – and the methods that rely upon it – are still fundamental to our understanding of evolution. Although it may present itself as a more conservative way to identify adaptation within the genome, and cannot account for the effect of the above processes, neutral theory undoubtedly presents itself as a direct and well-implemented strategy to understand adaptation and demography.

Bringing alleles back together: applications of coalescent theory

Coalescent theory

A recurring analytical method, both within The G-CAT and the broader ecological genetic literature, is based on coalescent theory. This is based on the mathematical notion that mutations within genes (leading to new alleles) can be traced backwards in time, to the point where the mutation initially occurred. Given that this is a retrospective, instead of describing these mutation moments as ‘divergence’ events (as would be typical for phylogenetics), these appear as moments where mutations come back together i.e. coalesce.

There are a number of applications of coalescent theory, and it is particularly fitting process for understanding the demographic (neutral) history of populations and species.

Mathematics of the coalescent

Before we can explore the multitude of applications of the coalescent, we need to understand the fundamental underlying model. The initial coalescent model was described in the 1980s, built upon by a number of different ecologists, geneticists and mathematicians. However, John Kingman is often attributed with the formation of the original coalescent model, and the Kingman’s coalescent is considered the most basic, primal form of the coalescent model.

From a mathematical perspective, the coalescent model is actually (relatively) simple. If we sampled a single gene from two different individuals (for simplicity’s sake, we’ll say they are haploid and only have one copy per gene), we can statistically measure the probability of these alleles merging back in time (coalescing) at any given generation. This is the same probability that the two samples share an ancestor (think of a much, much shorter version of sharing an evolutionary ancestor with a chimpanzee).

Normally, if we were trying to pick the parents of our two samples, the number of potential parents would be the size of the ancestral population (since any individual in the previous generation has equal probability of being their parent). But from a genetic perspective, this is based on the genetic (effective) population size (Ne), multiplied by 2 as each individual carries two copies per gene (one paternal and one maternal). Therefore, the number of potential parents is 2Ne.

Constant Ne and coalescent prob
A graph of the probability of a coalescent event (i.e. two alleles sharing an ancestor) in the immediately preceding generation (i.e. parents) relatively to the size of the population. As one might expect, with larger population sizes there is low chance of sharing an ancestor in the immediately prior generation, as the pool of ‘potential parents’ increases.

If we have an idealistic population, with large Ne, random mating and no natural selection on our alleles, the probability that their ancestor is in this immediate generation prior (i.e. share a parent) is 1/(2Ne). Inversely, the probability they don’t share a parent is 1 − 1/(2Ne). If we add a temporal component (i.e. number of generations), we can expand this to include the probability of how many generations it would take for our alleles to coalesce as (1 – (1/2Ne))t-1 x 1/2Ne.

Variable Ne and coalescent probs
The probability of two alleles sharing a coalescent event back in time under different population sizes. Similar to above, there is a higher probability of an earlier coalescent event in smaller populations as the reduced number of ancestors means that alleles are more likely to ‘share’ an ancestor. However, over time this pattern consistently decreases under all population size scenarios.

Although this might seem mathematically complicated, the coalescent model provides us with a scenario of how we would expect different mutations to coalesce back in time if those idealistic scenarios are true. However, biology is rarely convenient and it’s unlikely that our study populations follow these patterns perfectly. By studying how our empirical data varies from the expectations, however, allows us to infer some interesting things about the history of populations and species.

Testing changes in Ne and bottlenecks

One of the more common applications of the coalescent is in determining historical changes in the effective population size of species, particularly in trying to detect genetic bottleneck events. This is based on the idea that alleles are likely to coalesce at different rates under scenarios of genetic bottlenecks, as the reduced number of individuals (and also genetic diversity) associated with bottlenecks changes the frequency of alleles and coalescence rates.

For a set of k different alleles, the rate of coalescence is determined as k(k – 1)/4Ne. Thus, the coalescence rate is intrinsically linked to the number of genetic variants available: Ne. During genetic bottlenecks, the severely reduced Ne gives the appearance of coalescence rate speeding up. This is because alleles which are culled during the bottleneck event by genetic drift causes only a few (usually common) alleles to make it through the bottleneck, with the mutation and spread of these alleles after the bottleneck. This can be a little hard to think of, so the diagram below demonstrates how this appears.

Bottleneck test figure.jpg
A diagram of how the coalescent can be used to detect bottlenecks in a single population (centre). In this example, we have contemporary population in which we are tracing the coalescence of two main alleles (red and green, respectively). Each circle represents a single individual (we are assuming only one allele per individual for simplicity, but for most animals there are up to two).  Looking forward in time, you’ll notice that some red alleles go extinct just before the bottleneck: they are lost during the reduction in Ne. Because of this, if we measure the rate of coalescence (right), it is much higher during the bottleneck than before or after it. Another way this could be visualised is to generate gene trees for the alleles (left): populations that underwent a bottleneck will typically have many shorter branches and a long root, as many branches will be ‘lost’ by extinction (the dashed lines, which are not normally seen in a tree).

This makes sense from theoretical perspective as well, since strong genetic bottlenecks means that most alleles are lost. Thus, the alleles that we do have are much more likely to coalesce shortly after the bottleneck, with very few alleles that coalesce before the bottleneck event. These alleles are ones that have managed to survive the purge of the bottleneck, and are often few compared to the overarching patterns across the genome.

Testing migration (gene flow) across lineages

Another demographic factor we may wish to test is whether gene flow has occurred across our populations historically. Although there are plenty of allele frequency methods that can estimate contemporary gene flow (i.e. within a few generations), coalescent analyses can detect patterns of gene flow reaching further back in time.

In simple terms, this is based on the idea that if gene flow has occurred across populations, then some alleles will have been transferred from one population to another. Because of this, we would expect that transferred alleles coalesce with alleles of the source population more recently than the divergence time of the two populations. Thus, models that include a migration rate often add it as a parameter specifying the probability than any given allele coalesces with an allele in another population or species (the backwards version of a migration or introgression event). Again, this might be difficult to conceptualise so there’s a handy diagram below.

Migration rate test figure
A similar model of coalescence as above, but testing for migration rate (gene flow) in two recently diverged populations (right). In this example, when we trace two alleles (red and green) back in time, we notice that some individuals in Population 1 coalesce more recently with individuals of Population 2 than other individuals of Population 1 (e.g. for the red allele), and vice versa for the green allele. This can also be represented with gene trees (left), with dashed lines representing individuals from Population 2 and whole lines representing individuals from Population 1. This incomplete split between the two populations is the result of migration transferring genes from one population to the other after their initial divergence (also called ‘introgression’ or ‘horizontal gene transfer’).

Testing divergence time

In a similar vein, the coalescent can also be used to test how long ago the two contemporary populations diverged. Similar to gene flow, this is often included as an additional parameter on top of the coalescent model in terms of the number of generations ago. To convert this to a meaningful time estimate (e.g. in terms of thousands or millions of years ago), we need to include a mutation rate (the number of mutations per base pair of sequence per generation) and a generation time for the study species (how many years apart different generations are: for humans, we would typically say ~20-30 years).

Divergence time test figure.jpg
An example of using the coalescent to test the divergence time between two populations, this time using three different alleles (red, green and yellow). Tracing back the coalescence of each alleles reveals different times (in terms of which generation the coalescence occurs in) depending on the allele (right). As above, we can look at this through gene trees (left), showing variation how far back the two populations (again indicated with bold and dashed lines respectively) split. The blue box indicates the range of times (i.e. a confidence interval) around which divergence occurred: with many more alleles, this can be more refined by using an ‘average’ and later related to time in years with a generation time.

 

The basic model of testing divergence time with the coalescent is relatively simple, and not all that different to phylogenetic methods. Where in phylogenetics we relate the length of the different branches in the tree to the amount of time that has occurred since the divergence of those branches, with the coalescent we base these on coalescent events, with more coalescent events occurring around the time of divergence. One important difference in the two methods is that coalescent events might not directly coincide with divergence time (in fact, we expect many do not) as some alleles will separate prior to divergence, and some will lag behind and start to diverge after the divergence event.

The complex nature of the coalescent

While each of these individual concepts may seem (depending on how well you handle maths!) relatively simple, one critical issue is the interactive nature of the different factors. Gene flow, divergence time and population size changes will all simultaneously impact the distribution and frequency of alleles and thus the coalescent method. Because of this, we often use complex programs to employ the coalescent which tests and balances the relative contributions of each of these factors to some extent. Although the coalescent is a complex beast, improvements in the methodology and the programs that use it will continue to improve our ability to infer evolutionary history with coalescent theory.

What’s the (allele) frequency, Kenneth?

Allele frequency

A number of times before on The G-CAT, we’ve discussed the idea of using the frequency of different genetic variants (alleles) within a particular population or species to test a number of different questions about evolution, ecology and conservation. These are all based on the central notion that certain forces of nature will alter the distribution and frequency of alleles within and across populations, and that these patterns are somewhat predictable in how they change.

One particular distinction we need to make early here is the difference between allele frequency and allele identity. In these analyses, often we are working with the same alleles (i.e. particular variants) across our populations, it’s just that each of these populations may possess these particular alleles in different frequencies. For example, one population may have an allele (let’s call it Allele A) very rarely – maybe only 10% of individuals in that population possess it – but in another population it’s very common and perhaps 80% of individuals have it. This is a different level of differentiation than comparing how different alleles mutate (as in the coalescent) or how these mutations accumulate over time (like in many phylogenetic-based analyses).

Allele freq vs identity figure.jpg
An example of the difference between allele frequency and identity. In this example (and many of the figures that follow in this post), the circle denote different populations, within which there are individuals which possess either an A gene (blue) or a B gene. Left: If we compared Populations 1 and 2, we can see that they both have A and B alleles. However, these alleles vary in their frequency within each population, with an equal balance of A and B in Pop 1 and a much higher frequency of B in Pop 2. Right: However, when we compared Pop 3 and 4, we can see that not only do they vary in frequencies, they vary in the presence of alleles, with one allele in each population but not the other.

Non-adaptive (neutral) uses

Testing neutral structure

Arguably one of the most standard uses of allele frequency data is the determination of population structure, one which more avid The G-CAT readers will be familiar with. This is based on the idea that populations that are isolated from one another are less likely to share alleles (and thus have similar frequencies of those alleles) than populations that are connected. This is because gene flow across two populations helps to homogenise the frequency of alleles within those populations, by either diluting common alleles or spreading rarer ones (in general). There are a number of programs that use allele frequency data to assess population structure, but one of the most common ones is STRUCTURE.

Gene flow homogeneity figure
An example of how gene flow across populations homogenises allele frequencies. We start with two initial populations (and from above), which have very different allele frequencies. Hybridising individuals across the two populations means some alleles move from Pop 1 and Pop 2 into the hybrid population: which alleles moves is random (the smaller circles). Because of this, the resultant hybrid population has an allele frequency somewhere in between the two source populations: think of like mixing red and blue cordial and getting a purple drink.

 

Simple YPP structure figure.jpg
An example of a Structure plot which long-term The G-CAT readers may be familiar with. This is taken from Brauer et al. (2013), where the authors studied the population structure of the Yarra pygmy perch. Each small column represents a single individual, with the colours representing how well the alleles of that individual fit a particular genetic population (each population has one colour). The numbers and broader columns refer to different ‘localities’ (different from populations) where individuals were sourced. This shows clear strong population structure across the 4 main groups, except for in Locality 6 where there is a mixture of Eastern and Merri/Curdies alleles.

Determining genetic bottlenecks and demographic change

Other neutral aspects of population identity and history can be studied using allele frequency data. One big component of understanding population history in particular is determining how the population size has changed over time, and relating this to bottleneck events or expansion periods. Although there are a number of different approaches to this, which span many types of analyses (e.g. also coalescent methods), allele frequency data is particularly suited to determining changes in the recent past (hundreds of generations, as opposed to thousands of generations ago). This is because we expect that, during a bottleneck event, it is statistically more likely for rare alleles (i.e. those with low frequency) in the population to be lost due to strong genetic drift: because of this, the population coming out of the bottleneck event should have an excess of more frequent alleles compared to a non-bottlenecked population. We can determine if this is the case with tests such as the heterozygosity excess, M-ratio or mode shift tests.

Genetic drift and allele freq figure
A diagram of how allele frequencies change in genetic bottlenecks due to genetic drift. Left: Large circles again denote a population (although across different sequential times), with smaller circle denoting which alleles survive into the next generation (indicated by the coloured arrows). We start with an initial ‘large’ population of 8, which is reduced down to 4 and 2 in respective future times. Each time the population contracts, only a select number of alleles (or individuals) ‘survive’: assuming no natural selection is in process, this is totally random from the available gene pool. Right: We can see that over time, the frequencies of alleles A and B shift dramatically, leading to the ‘extinction’ of Allele B due to genetic drift. This is because it is the less frequent allele of the two, and in the smaller population size has much less chance of randomly ‘surviving’ the purge of the genetic bottleneck. 

Adaptive (selective) uses

Testing different types of selection

We’ve also discussed previously about how different types of natural selection can alter the distribution of allele frequency within a population. There are a number of different predictions we can make based on the selective force and the overall population. For understanding particular alleles that are under strong selective pressure (i.e. are either strongly adaptive or maladaptive), we often test for alleles which have a frequency that strongly deviates from the ‘neutral’ background pattern of the population. These are called ‘outlier loci’, and the fact that their frequency is much more different from the average across the genome is attributed to natural selection placing strong pressure on either maintaining or removing that allele.

Other selective tests are based on the idea of correlating the frequency of alleles with a particular selective environmental pressure, such as temperature or precipitation. In this case, we expect that alleles under selection will vary in relation to the environmental variable. For example, if a particular allele confers a selective benefit under hotter temperatures, we would expect that allele to be more common in populations that occur in hotter climates and rarer in populations that occur in colder climates. This is referred to as a ‘genotype-environment association test’ and is a good way to detect polymorphic selection (i.e. when multiple alleles contribute to a change in a single phenotypic trait).

Genotype by environment figure.jpg
An example of how the frequency of alleles might vary under natural selection in correlation to the environment. In this example, the blue allele A is adaptive and under positive selection in the more intense environment, and thus increases in frequency at higher values. Contrastingly, the red allele B is maladaptive in these environments and decreases in frequency. For comparison, the black allele shows how the frequency of a neutral (non-adaptive or maladaptive) allele doesn’t vary with the environment, as it plays no role in natural selection.

Taxonomic (species identity) uses

At one end of the spectrum of allele frequencies, we can also test for what we call ‘fixed differences’ between populations. An allele is considered ‘fixed’ it is the only allele for that locus in the population (i.e. has a frequency of 1), whilst the alternative allele (which may exist in other populations) has a frequency of 0. Expanding on this, ‘fixed differences’ occur when one population has Allele A fixed and another population has Allele B fixed: thus, the two populations have as different allele frequencies (for that one locus, anyway) as possible.

Fixed differences are sometimes used as a type of diagnostic trait for species. This means that each ‘species’ has genetic variants that are not shared at all with its closest relative species, and that these variants are so strongly under selection that there is no diversity at those loci. Often, fixed differences are considered a level above populations that differ by allelic frequency only as these alleles are considered ‘diagnostic’ for each species.

Fixed differences figure.jpg
An example of the difference between fixed differences and allelic frequency differences. In this example, we have 5 cats from 3 different species, sequencing a particular target gene. Within this gene, there are three possible alleles: T, A or G respectively. You’ll quickly notice that the allele is both unique to Species A and is present in all cats of that species (i.e. is fixed). This is a fixed difference between Species A and the other two. Alleles and G, however, are present in both Species B and C, and thus are not fixed differences even if they have different frequencies.

Intrapopulation (relatedness) uses

Allele frequency-based methods are even used in determining relatedness between individuals. While it might seem intuitive to just check whether individuals share the same alleles (and are thus related), it can be hard to distinguish between whether they are genetically similar due to direct inheritance or whether the entire population is just ‘naturally’ similar, especially at a particular locus. This is the distinction between ‘identical-by-descent’, where alleles that are similar across individuals have recently been inherited from a similar ancestor (e.g. a parent or grandparent) or ‘identical-by-state’, where alleles are similar just by chance. The latter doesn’t contribute or determine relatedness as all individuals (whether they are directly related or not) within a population may be similar.

To distinguish between the two, we often use the overall frequency of alleles in a population as a basis for determining how likely two individuals share an allele by random chance. If alleles which are relatively rare in the overall population are shared by two individuals, we expect that this similarity is due to family structure rather than population history. By factoring this into our relatedness estimates we can get a more accurate overview of how likely two individuals are to be related using genetic information.

The wild world of allele frequency

Despite appearances, this is just a brief foray into the many applications of allele frequency data in evolution, ecology and conservation studies. There are a plethora of different programs and methods that can utilise this information to address a variety of scientific questions and refine our investigations.

You’re perfect, you’re beautiful, you look like a model (species)

What is a ‘model’?

There are quite literally millions of species on Earth, ranging from the smallest of microbes to the largest of mammals. In fact, there are so many that we don’t actually have a good count on the sheer number of species and can only estimate it based on the species we actually know about. Unsurprisingly, then, the number of species vastly outweighs the number of people that research them, especially considering the sheer volumes of different aspects of species, evolution, conservation and their changes we could possibly study.

Species on Earth estimate figure
Some estimations on the number of eukaryotic species (i.e. not including things like bacteria), with the number of known species in blue and the predicted number of total species on Earth in purpleSource: Census of Marine Life.

This is partly where the concept of a ‘model’ comes into it: it’s much easier to pick a particular species to study as a target, and use the information from it to apply to other scenarios. Most people would be familiar with the concept based on medical research: the ‘lab rat’ (or mouse). The common house mouse (Mus musculus) and the brown rat (Rattus norvegicus) are some of the most widely used models for understanding the impact of particular biochemical compounds on physiology and are often used as the testing phase of medical developments before human trials.

So, why are mice used as a ‘model’? What actually constitutes a ‘model’, rather than just a ‘relatively-well-research-species’? Well, there are a number of traits that might make certain species ideal subjects for understanding key concepts in evolution, biology, medicine and ecology. For example, mice are often used in medical research given their (relative) similar genetic, physiological and behavioural characteristics to humans. They’re also relatively short-lived and readily breed, making them ideal to observe the more long-term effects of medical drugs or intergenerational impacts. Other species used as models primarily in medicine include nematodes (Caenorhabditis elegans), pigs (Sus scrofa domesticus), and guinea pigs (Cavia porcellus).

The diversity of models

There are a wide variety and number of different model species, based on the type of research most relevant to them (and how well it can be applied to other species). Even with evolution and conservation-based research, which can often focus on more obscure or cryptic species, there are several key species that have widely been applied as models for our understanding of the evolutionary process. Let’s take a look at a few examples for evolution and conservation.

Drosophila

It would be remiss of me to not mention one of the most significant contributors to our understanding of the genetic underpinning of adaptation and speciation, the humble fruit fly (Drosophila melanogaster, among other species). The ability to rapidly produce new generations (with large numbers of offspring with very short generation time), small fully-sequenced genome, and physiological variation means that observing both phenotypic and genotypic changes over generations due to ‘natural’ (or ‘experimental’) selection are possible. In fact, Drosphilia spp. were key in demonstrating the formation of a new species under laboratory conditions, providing empirical evidence for the process of natural selection leading to speciation (despite some creationist claims that this has never happened).

Drosophila speciation experiment
A simplified summary of the speciation experiment in Drosophila, starting with a single species and resulting in two reproductively isolated species based on mating and food preference. Source: Ilmari Karonen, adapted from here.

Darwin’s finches

The original model of evolution could be argued to be Darwin’s finches, as the formed part of the empirical basis of Charles Darwin’s work on the theory of evolution by natural selection. This is because the different species demonstrate very distinct and obvious changes in morphology related to a particular diet (e.g. the physiological consequences of natural selection), spread across an archipelago in a clear demonstration of a natural experiment. Thus, they remain the original example of adaptive radiation and are fundamental components of the theory of evolution by natural selection. However, surprisingly, Darwin’s finches are somewhat overshadowed in modern research by other species in terms of the amount of available data.

Darwin's finches drawings
Some of Darwin’s early drawings of the morphological differences in Galapagos finch beaks, which lead to the formulation of the theory of evolution by natural selection.

Zebra finches

Even as far as birds go, one species clearly outshines the rest in terms of research. The zebra finch is one of the most highly researched vertebrate species, particularly as a model of song learning and behaviour in birds but also as a genetic model. The full genome of the zebra finch was the second bird to ever be sequenced (the first being a chicken), and remains one of the more detailed and annotated genomes in birds. Because of this, the zebra finch genome is often used as a reference for other studies on the genetics of bird species, especially when trying to understand the function of genetic changes or genes under selection.

Zebra finches.jpg
A pair of (very cute) model zebra finches. Source: Michael Lawton via Smithsonian.com.

 

Fishes

Fish are (perhaps surprisingly) also relatively well research in terms of evolutionary studies, largely due to their ancient origins and highly diverse nature, with many different species across the globe. They also often demonstrate very rapid and strong bouts of divergence, such as the cichlid fish species of African lakes which demonstrate how new species can rapidly form when introduced to new and variable environments. The cichlids have become the poster child of adaptive radiation in fishes much in the same way that Darwin’s finches highlighted this trend in birds. Another group of fish species used as a model for similar aspects of speciation, adaptive divergence and rapid evolutionary change are the three-spine and nine-spine stickleback species, which inhabit a variety of marine, estuarine and freshwater environments. Thus, studies on the genetic changes across these different morphotypes is a key in understanding how adaptation to new environments occur in nature (particularly the relatively common transition into different water types in fishes).

cichlid diversity figure
The sheer diversity of species and form makes African cichlids an ideal model for testing hypotheses and theories about the process of evolution and adaptive radiation. Figure sourced from Brawand et al. (2014) in Nature.

Zebra fish

More similar to the medical context of lab rats is the zebrafish (ironically, zebra themselves are not considered a model species). Zebrafish are often used as models for understanding embryology and the development of the body in early formation given the rapid speed at which embryonic development occurs and the transparent body of embryos (which makes it easier to detect morphological changes during embryogenesis).

Zebrafish embryo
The transparent nature of zebrafish embryos make them ideal for studying the development of organisms in early stages. Source: yourgenome.org.

Using information from model species for non-models

While the relevance of information collected from model species to other non-model species depends on the similarity in traits of the two species, our understanding of broad concepts such as evolutionary process, biochemical pathways and physiological developments have significantly improved due to model species. Applying theories and concepts from better understood organisms to less researched ones allows us to produce better research much faster by cutting out some of the initial investigative work on the underlying processes. Thus, model species remain fundamental to medical advancement and evolutionary theory.

That said, in an ideal world all species would have the same level of research and resources as our model species. In this sense, we must continue to strive to understand and research the diversity of life on Earth, to better understand the world in which we live. Full genomes are progressively being sequenced for more and more species, and there are a number of excellent projects that are aiming to sequence at least one genome for all species of different taxonomic groups (e.g. birds, bats, fish). As the data improves for our non-model species, our understanding of evolution, conservation management and medical research will similarly improve.

Lost in a forest of (gene) trees

Using genetics to understand species history

The idea of using the genetic sequences of living organisms to understand the evolutionary history of species is a concept much repeated on The G-CAT. And it’s a fundamental one in phylogenetics, taxonomy and evolutionary biology. Often, we try to analyse the genetic differences between individuals, populations and species in a tree-like manner, with close tips being similar and more distantly separated branches being more divergent. However, this runs on one very key assumption; that the patterns we observe in our study genes matches the overall patterns of species evolution. But this isn’t always true, and before we can delve into that we have to understand the difference between a ‘gene tree’ and a ‘species tree’.

A gene tree or a species tree?

Our typical view of a phylogenetic tree is actually one of a ‘gene tree’, where we analyse how a particular gene (or set of genes) have changed over time between different individuals (within and across populations or species) based on our understanding of mutation and common ancestry.

However, a phylogenetic tree based on a single gene only demonstrates the history of that gene. What we assume in most cases is that the history of that gene matches the history of the species: that branches in the genetic tree mirror when different splits in species occurred throughout history.

The easiest way to conceptualise gene trees and species trees is to think of individual gene trees that are nested within an overarching species tree. In this sense, individual gene trees can vary from one another (substantially, even) but by looking at the overall trends of many genes we can see how the genome of the species have changed over time.

Gene tree incongruence figure
A (potentially familiar) depiction of individual gene trees (coloured lines) within the broader species tree (defined b the black boundaries). As you might be able to tell, the branching patterns of the different genes are not the same, and don’t always match the overarching species tree.

Gene tree incongruence

Different genes may have different patterns for a number of reasons. Changes in the genetic sequences of organisms over time don’t happen equally across the entire genome, and very specific parts of the genome can evolve in entirely different directions, or at entirely different rates, than the rest of the genome. Let’s take a look at a few ways we could have conflicting gene trees in our studies.

Incomplete lineage sorting

One of the most prolific, but more complicated, ways gene trees can vary from their overarching species tree is due to what we call ‘incomplete lineage sorting’. This is based on the idea that species and the genes that define them are constantly evolving over time, and that because of this different genes are at different stages of divergence between population and species. If we imagine a set of three related populations which have all descended from a single ancestral population, we can start to see how incomplete lineage sorting could occur. Our ancestral population likely has some genetic diversity, containing multiple alleles of the same locus. In a true phylogenetic tree, we would expect these different alleles to ‘sort’ into the different descendent populations, such that one population might have one of the alleles, a second the other, and so on, without them sharing the different alleles between them.

If this separation into new populations has been recent, or if gene flow has occurred between the populations since this event, then we might find that each descendent population has a mixture of the different alleles, and that not enough time has passed to clearly separate the populations. For this to occur, sufficient time for new mutations to occur and genetic drift to push different populations to differently frequent alleles needs to happen: if this is too recent, then it can be hard to accurately distinguish between populations. This can be difficult to interpret (see below figure for a visualisation of this), but there’s a great description of incomplete lineage sorting here.

ILS_adaptedfigure
A demonstration of incomplete lineage sorting, generously adapted from a talk by fellow MELFU postdocs Dr Yuma (Jonathon) Sandoval-Castillo and Dr Catherine Attard. On the left is a depiction of a single gene coalescent tree over time: circles represent a single individual at a particular point in time (row) with the colours representing different alleles of that same gene. The tree shows how new mutations occur (colour changes along the branches) and spread throughout the descendent populations. In this example, we have three recently separated species, with a good number of different alleles. However, when we study these alleles in tree form (the phylogeny on the right), we see that the branches themselves don’t correlate well with the boundaries of the species. For example, the teal allele found within Species C is actually more similar to Species B alleles (purple and blue) than any other Species B alleles, based on the order and patterns of these mutations.

Hybridisation and horizontal transfer

Another way individual genes may become incongruent with other genes is through another phenomenon we’ve discussed before: hybridisation (or more specifically, introgression). When two individuals from different species breed together to form a ‘hybrid’, they join together what was once two separate gene pools. Thus, the hybrid offspring has (if it’s a first generation hybrid, anyway) 50% of genes from Species A and 50% of genes from Species B. In terms of our phylogenetic analysis, if we picked one gene randomly from the hybrid, we have 50% of picking a gene that reflects the evolutionary history of Species A, and 50% chance of picking a gene that reflects the evolutionary history of Species B. This would change how our outputs look significantly: if we pick a Species A gene, our ‘hybrid’ will look (genetically) very, very similar to Species A. If we pick a Species B gene, our ‘hybrid’ will look like a Species B individual instead. Naturally, this can really stuff up our interpretations of species boundaries, distributions and identities.

Hybridisation_figure
An example of hybridisation leading to gene tree incongruence with our favourite colourful fishA) We have a hybridisation event between a red fish (Species A) and a green fish (Species B), resulting in a hybrid species (‘Species’ H). The red fish genome is indicated by the yellow DNA, the green fish genomes by the blue DNA, and the hybrid orange fish has a mixture of these two. B) If we sampled one set of genes in the hybrid, we might select a gene that originated from the red fish, showing that the hybrid is identical (or very similar) the Species A. D) Conversely, if we sampled a gene originating from the green fish, the resultant phylogeny might show that the hybrid is the same as Species B. C) If we consider these two patterns in combination, which see the true pattern of species formation, which is not a clear dichotomous tree and rather a mixture of the two sets of trees.

Paralogous genes

More confusingly, we can even have events where a single gene duplicates within a genome. This is relatively rare, although it can have huge effects: for example, salmon have massive genomes as the entire thing was duplicated! Each version of the gene can take on very different forms, functions, and evolve in entirely different ways. We call these duplicated variants paralogous genes: genes that look the same (in terms of sequence), but are totally different genes.

This can have a profound impact as paralogous genes are difficult to detect: if there has been a gene duplication early in the evolutionary history of our phylogenetic tree, then many (or all) of our study samples will have two copies of said gene. Since they look similar in sequence, there’s all possibility that we pick Variant 1 in some species and Variant 2 in other species. Being unable to tell them apart, we can have some very weird and abstract results within our tree. Most importantly, different samples with the same duplicated variant will seem similar to one another (e.g. have evolved from a common ancestor more recently) than it will to any sample of the other variant (even if they came from the exact same species)!

Paralogy_figure.jpg
An example of how paralogous genes can confound species tree. We start with a single (purple) gene: at a particular point in time, this gene duplicates into a red and a blue form. Each of these genes then evolve and spread into four separate descendent species (A, B, C and D) but not in entirely the same way. However, since both the red and blue genetic sequences are similar, if we took a single gene from each species we might (somewhat randomly) sequence either the red or the blue copy. The different phylogenetic trees on the right demonstrate how different combinations of red and blue genes give very different patterns, since all blue copies will be more related to other blue genes than to the red gene of the same species. E.g. a blue A and a blue C are more similar than a blue A and a red A.

Overcoming incongruence with genomics

Although a tricky conundrum in phylogenetics and evolutionary genetics broadly, gene tree incongruence can largely be overcome with using more loci. As the random changes of any one locus has a smaller effect of the larger total set of loci, the general and broad patterns of evolutionary history can become clearer. Indeed, understanding how many loci are affected by what kind of process can itself become informative: large numbers of introgressed loci can indicate whether hybridisation was recent, strong, or biased towards one species over another, for example. As with many things, the genomic era appears poised to address the many analytical issues and complexities of working with genetic data.