Scanning for causes: an introduction to genome-wide association studies

Understanding genetic determinants

You’ve probably been exposed to one news headline or another in the recent past (let’s say the last 5 years) that reads something like “SCIENTISTS DISCOVER GENES THAT CAUSE (X).” X, of course, varies massively based on the study itself (and sometimes the bastardisation of said study by media): it can include describing medical conditions such as cancer, autism or congenital diseases; behavioural traits, such as sexual preferences; or broad physical traits, such as the classic problem of the inheritability of height. Unsurprisingly, you may think that trying to find the genes responsible for some traits should be either a) super easy, or b) super hard, depending on your own philosophical preference or the trait in question. So how do these studies come about, anyway?

Continue reading

What’s yours is mine: evolution by adaptive introgression

Gene flow and introgression

Genetic variation remains a key component of not only understanding the process and history of evolution, but also for allowing evolution to continue into the future. This is the basis of the concept of ‘evolutionary potential’ – the available variation within a population or species which may enable them to adapt to new environmental stressors as they occur. With the looming threat of contemporary climate change and environmental transformations by humanity, predicting and supporting evolutionary potential across the diversity of life is critical for conserving the stability of our biosphere.

Continue reading

Products of their time: the impact of demographic history on evolution

Demographic history

Many things in life are the product of their history, and nothing exemplifies this better than evolution. Given the often-gradual nature of evolution by natural selection, environmental stressors and factors operating on long-term scales (i.e. over thousands or millions of years) can have major impacts on evolutionary changes across the diversity of biota. While many of these are specific to the characteristics of the target organism (i.e. are related to adaptive traits), non-adaptive (neutral) traits are also critically important in driving the path of evolution.

Continue reading

Islands of speciation and speciation on islands

The concept of a species

We’ve spent some time before discussing the nature of the term ‘species’ and what it means in reality. Of course, answers to questions in biology are always more complicated than we wish they might be, and despite the common nomenclature of the word ‘species’ the underlying definition is convoluted and variable.

Continue reading

Evolutionary clocks out of sync

Evolutionary time

It shouldn’t come as a surprise to anyone with a basic understanding of evolution that it is a temporal (and also spatial concept). Time is a fundamental aspect of the process of evolution by natural selection, and without it evolution wouldn’t exist. But time is also a fickle thing, and although it remains constant (let’s not delve into that issue here) not all things experience it in the same way.

Continue reading

Genes in parallel

Adaptation from genetic variation

One of the central themes of this blog, and indeed of evolutionary biology as a whole, is the notion that adaptation is often underpinned by genes. Genetic variation acts as the basis for natural selection to favour or disfavour traits: while this is directly through phenotypic traits (e.g. fur colour, morphology, behaviour), these traits are typically determined by a genetic component. In the early stages of adaptation, evolution can often be observed by changes in the frequency of genetic variants (alleles) within a species or population over time as natural selection acts, gradually leading to the observable (and sometimes dramatic) change in species over time.

Continue reading

UnConservation Genetics: tools for managing invasive species

Conservation genetics

Naturally, all species play their role in the balancing and functioning of ecosystems across the globe (even the ones we might not like all that much, personally). Persistence or extinction of ecologically important species is a critical component of the overall health and stability of an ecosystem, and thus our aim as conservation scientists is to attempt to use whatever tools we have at our disposal to conserve species. One of the most central themes in conservation ecology (and to The G-CAT, of course) is the notion that genetic information can be used to better our conservation management approaches. This usually involves understanding the genetic history and identity of our target threatened species from which we can best plan for their future. This can take the form of genetic-informed relatedness estimates for breeding programs; identifying important populations and those at risk of local extinction; or identifying evolutionarily-important new species which might hold unique adaptations that could allow them to persist in an ever-changing future.

Applications of conservation genetics.jpg
Just a few applications of genetic information in conservation management, such as in breeding programs and pedigrees (left), identifying new/cryptic species (centre) and identifying and maintaining populations and their structure (right).

The Invaders

Contrastingly, sometimes we might also use genetic information to do the exact opposite. While so many species on Earth are at risk (or have already passed over the precipice) of extinction, some have gone rogue with our intervention. These are, of course, invasive species; pests that have been introduced into new environments and, by their prolific nature, start to throw out the balance of the ecosystem. Australians will be familiar with no shortage of relevant invasive species; the most notable of which is the cane toad, Rhinella marina. However, there are a plethora of invasive species which range from notably prolific (such as the cane toad) to the seemingly mundane (such as the blackbird): so how can we possibly deal with the number and propensity of pests?

Table of invasive species in Australia
A table of some of the most prolific mammalian invasive species in Australia, including when they were first introduced and why, and their (relatively) recently estimated population sizes. Source: Wikipedia (and studies referenced therein). Some estimated numbers might not reflect current sizes as they were obtained from studies over the last 10 years.

Tools for invasive species management

There are a number of tools at our disposal for dealing with invasive species. These range from chemical controls (like pesticides), to biological controls and more recently to targeted genetic methods. Let’s take a quick foray into some of these different methods and their applications to pest control.

Types of control tools for invasive species
Some of the broad categories of invasive species control. For any given pest species, such as the cane toad (top), we might choose to use a particular set of methods to reduce their numbers. These can include biological controls (such as the ladybird, for aphid populations (left)); chemical controls such as pesticides; or even genetic engineering technologies.

Biological controls

One of the most traditional methods of pest control are biological controls. A biological control is, in simple terms, a species that can be introduced to an afflicted area to control the population of an invasive species. Usually, this is based on some form of natural co-evolution or hierarchy: species which naturally predate upon, infect or otherwise displace the pest in question are preferred. The basis of this choice is that nature, and evolution by natural selection, often creates a near-perfect machine adapted for handling the exact problem.

Biological controls can have very mixed results. In some cases, they can be relatively effective, such as the introduction of the moth Cactoblastis cactorum into Australia to control the invasive prickly pear. The moth lays eggs exclusively within the tissue of the prickly pear, and the resultant caterpillars ravish the plant. There has been no association of secondary diet items for caterpillars, suggesting the control method has been very selective and precise.

Moth biological control flow chart
The broad life cycle of the cactus moth and how it controls the invasive prickly pear in Australia. The ravenous caterpillar larvae of the moth is effective at decimating prickly pears, whilst the moth’s specificity to this host means there is limited impact on other plant species.

On the contrary, bad biological controls can lead to ecological disasters. As mentioned above, the introduction of the cane toad into Australia has been widely regarded as the origin of one of the worst invasive pests in the nation’s history. Initially, cane toads were brought over in the 1930s to predate on the (native) cane beetle, which was causing significant damage to sugar cane plantations in the tropical north. Not overly effective at actually dealing with the problem they were supposed to deal with, the cane toad rapidly spread across northern portion of the continent. Native species that attempt to predate on the cane toad often die to their defensive toxin, causing massive ecological damage to the system.

The potential secondary impact of biological controls, and the degree of unpredictability in how they will respond to a new environment (and how native species will also respond to their introduction) leads conservationists to develop new, more specific techniques. In similar ways, viral and bacterial-based controls have had limited success (although are still often proposed in conservation management, such as the planned carp herpesvirus release).

Genetic controls?

It is clear that more targeted and narrow techniques are required to effectively control pest species. At a more micro level, individual genes could be used to manage species: this is not the first way genetic modification has been proposed to deal with problem organisms. Genetic methods have been employed for years in crop farming through genetic engineering of genes to produce ‘natural’ pesticides or insecticides. In a similar vein, it has been proposed that genetic modification could be a useful tool for dealing with invasive pests and their native victims.

Gene drives

One promising targeted, genetic-based method that has shown great promise is the gene drive. Following some of the theory behind genetic engineering, gene drives are targeted suites of genes (or alleles) which, by their own selfish nature, propagate through a population at a much higher rate than other alternative genes. In conjunction with other DNA modification methods, which can create fatal or sterilising genetic variants, gene drives present the opportunity to allow the natural breeding of an invasive species to spread the detrimental modified gene.

Gene drive diagram
An example of how gene drives are being proposed to tackle malaria. In this figure, the pink mosquito at the top has been genetically engineered using CRISPR to possess two important genetic elements: a genetic variant which causes the mosquito to be unable to produce eggs or bite (the pink gene), and a linked selfish genetic element (the gene drive itself; the plus) which makes this detrimental allele spread more rapidly than by standard inheritance. Sources: Nature and The Australian Academy of Science.

Although a relatively new, and untested, technique, gene drive technology has already been proposed as a method to address some of the prolific invasive mammals of New Zealand. Naturally, there are a number of limitations and reservations for the method; similar to biological control, there is concern for secondary impact on other species that interact with the invasive host. Hybridisation between invasive and native species would cause the gene drive to be spread to native species, counteracting the conservation efforts to save natives. For example, a gene drive could not reasonably be proposed to deal with feral wild dogs in Australia without massively impacting the ‘native’ dingo.

Genes for non-genetic methods

Genetic information, more broadly, can also be useful for pest species management without necessarily directly feeding into genetic engineering methods. The various population genetic methods that we’ve explored over a number of different posts can also be applied in informing management. For example, understanding how populations are structured, and the sizes and demographic histories of these populations, may help us to predict how they will respond in the future and best focus our efforts where they are most effective. By including analysis of their adaptive history and responses, we may start to unravel exactly what makes a species a good invader and how to best predict future susceptibility of an environment to invasion.

Table of genetic information applications
A comprehensive table of the different ways genetic information could be applied in broader invasive species management programs, from Rollins et al. (2006). This paper specifically relates to pest management within Western Australia but the concepts listed here apply broadly. Many of these concepts we have discussed previously in a conservation management context as well.

The better we understand invasive species and populations from a genetic perspective, the more informed our management efforts can be and the more likely we are to be able to adequately address the problem.

Managing invasive pest species

The impact of human settlement into new environments is exponentially beyond our direct influences. With our arrival, particularly in the last few hundred years, human migration has been an effective conduit for the spread of ecologically-disastrous species which undermine the health and stability of ecosystems around the globe. As such, it is our responsibility to Earth to attempt to address our problems: new genetic techniques is but one growing avenue by which we might be able to remove these invasive pests.

Crossing the Wires: why ‘genetic hardwiring’ is not the whole story

The age-old folly of ‘nature vs. nurture’

It should come as no surprise to any reader of The G-CAT that I’m a firm believer against the false dichotomy (and yes, I really do love that phrase) of “nature versus nurture.” Primarily, this is because the phrase gives the impression of some kind of counteracting balance between intrinsic (i.e. usually genetic) and extrinsic (i.e. usually environmental) factors and how they play a role in behaviour, ecology and evolution. While both are undoubtedly critical for adaptation by natural selection, posing this as a black-and-white split removes the possibility of interactive traits.

We know readily that fitness, the measure by which adaptation or maladaptation can be quantified, is the product of both the adaptive value of a certain trait and the environmental conditions said trait occurs in. A trait that might confer strong fitness in white environment may be very, very unfit in another. A classic example is fur colour in mammals: in a snowy environment, a white coat provides camouflage for predators and prey alike; in a rainforest environment, it’s like wearing one of those fluoro-coloured safety vests construction workers wear.

Genetics and environment interactions figure.jpg
The real Circle of Life. Not only do genes and the environment interact with one another, but genes may interact with other genes and environments may be complex and multi-faceted.

Genetically-encoded traits

In the “nature versus nurture” context, the ‘nature’ traits are often inherently assumed to be genetic. This is because genetic traits are intrinsic as a fundamental aspect of life, inheritable (and thus can be passed on and undergo evolution by natural selection) and define the important physiological traits that provide (or prevent) adaptation. Of course, not all of the genome encodes phenotypic traits at all, and even less relate to diagnosable and relevant traits for natural selection to act upon. In addition, there is a bit of an assumption that many physiological or behavioural traits are ‘hardwired’: that is, despite any influence of environment, genes will always produce a certain phenotype.

Adaptation from genetic variation.jpg
A very simplified example of adaptation from genetic variation. In this example, we have two different alleles of a single gene (orange and blue). Natural selection favours the blue allele so over time it increases in frequency. The difference between these two alleles is at least one base pair of DNA sequence; this often arises by mutation processes.

Despite how important the underlying genes are for the formation of proteins and definition of physiology, they are not omnipotent in that regard. In fact, many other factors can influence how genetic traits relate to phenotypic traits: we’ve discussed a number of these in minor detail previously. An example includes interactions across different genes: these can be due to physiological traits encoded by the cumulative presence and nature of many loci (as in quantitative trait loci and polygenic adaptation). Alternatively, one gene may translate to multiple different physiological characters if it shows pleiotropy.

Differential expression

One non-direct way genetic information can impact on the phenotype of an organism is through something we’ve briefly discussed before known as differential expression. This is based on the notion that different environmental pressures may affect the expression (that is, how a gene is translated into a protein) in alternative ways. This is a fundamental underpinning of what we call phenotypic plasticity: the concept that despite having the exact same (or very similar) genes and alleles, two clonal individuals can vary in different traits. The is related to the example of genetically-identical twins which are not necessarily physically identical; this could be due to environmental constraints on growth, behaviour or personality.

Brauer DE figure_cropped
An example of differential expression in wild populations of southern pygmy perch, courtesy of Brauer et al. (2017). In this figure, each column represents a single individual fish, with the phylogenetic tree and coloured boxes at the top indicating the different populations. Each row represents a different gene (this is a subset of 50 from a much larger dataset). The colour of each cell indicates whether the expression of that gene is expressed more (red) or less (blue) than average. As you can see, the different populations can clearly be seen within their expression profiles, with certain genes expressing more or less in certain populations.

From an evolutionary perspective, the ability to translate a single gene into multiple phenotypic traits has a strong advantage. It allows adaptation to new, novel environments without waiting for natural selection to favour adaptive mutations (or for new, adaptive alleles to become available from new mutation events). This might be a fundamental trait that determines which species can become invasive pests, for instance: the ability to establish and thrive in environments very different to their native habitat allows introduced species to quickly proliferate and spread. Even for species which we might not consider ‘invasive’ (i.e. they have naturally spread to new environments), phenotypic plasticity might allow them to very rapidly adapt and evolve into new ecological niches and could even underpin the early stages of the speciation process.

Epigenetics

Related to this alternative expression of genes is another relatively recent concept: that of epigenetics. In epigenetics, the expression and function of genes is controlled by chemical additions to the DNA which can make gene expression easier or more difficult, effectively promoting or silencing genes. Generally, the specific chemicals that are attached to the DNA are relatively (but not always) predictable in their effects: for example, the addition of a methyl group to the sequence is generally associated with the repression of the gene underlying it. How and where these epigenetic markers may in turn be affected by environmental conditions, creating a direct conduit between environmental (‘nurture’) and intrinsic genetic (‘nature’) aspects of evolution.

Epigenetic_mechanisms.jpg
A diagram of different epigenetic factors and the mechanisms by which they control gene expression. Source: Wikipedia.

Typically, these epigenetic ‘marks’ (chemical additions to the DNA) are erased and reset during fertilisation: the epigenetic marks on the parental gametes are removed, and new marks are made on the fertilised embryo. However, it has been shown that this removal process is not 100% effective, and in fact some marks are clearly passed down from parent to offspring. This means that these marks are heritable, and could allow them to evolve similarly to full DNA mutations.

The discovery of epigenetic markers and their influence on gene expression has opened up the possibility of understanding heritable traits which don’t appear to be clearly determined by genetics alone. For example, research into epigenetics suggest that heritable major depressive disorder (MDD) may be controlled by the expression of genes, rather than from specific alleles or genetic variants themselves. This is likely true for a number of traits for which the association to genotype is not entirely clear.

Epigenetic adaptation?

From an evolutionary standpoint again, epigenetics can similarly influence the ‘bang for a buck’ of particular genes. Being able to translate a single gene into many different forms, and for this to be linked to environmental conditions, allows organisms to adapt to a variety of new circumstances without the need for specific adaptive genes to be available. Following this logic, epigenetic variation might be critically important for species with naturally (or unnaturally) low genetic diversity to adapt into the future and survive in an ever-changing world. Thus, epigenetic information might paint a more optimistic outlook for the future: although genetic variation is, without a doubt, one of the most fundamental aspects of adaptability, even horrendously genetically depleted populations and species might still be able to be saved with the right epigenetic diversity.

Epigenetic cats example
A relatively simplified example of adaptation from epigenetic variation. In this example, we have a species of cat; the ‘default’ cat has non-tufted ears and an orange coat. These two traits are controlled by the expression of Genes A and B, respectively: in the top cat, neither gene is expressed. However, when this cat is placed into different environments, the different genes are “switched on” by epigenetic factors (the green markers). In a rainforest environment, the dark foliage makes darker coat colour more adaptive; switching on Gene B allows this to happen. Conversely, in a desert environment switching on Gene A causes the cat to develop tufts on its ears, which makes it more effective at hunting prey hiding in the sands. Note that in both circumstances, the underlying genetic sequence (indicated by the colours in the DNA) is identical: only the expression of those genes change.

 

Epigenetic research, especially from an ecological/evolutionary perspective, is a very new field. Our understanding of how epigenetic factors translate into adaptability, the relative performance of epigenetic vs. genetic diversity in driving adaptability, and how limited heritability plays a role in adaptation is currently limited. As with many avenues of research, further studies in different contexts, experiments and scopes will reveal further this exciting new aspect of evolutionary and conservation genetics. In short: watch this space! And remember, ‘nature is nurture’ (and vice versa)!

Pressing Ctrl-Z on Life with De-extinction

Note: For some clear, interesting presentations on the topic of de-extinction, and where some of the information for this post comes from, check out this list of TED talks.

The current conservation crisis

The stark reality of conservation in the modern era epitomises the crisis discipline that so often is used to describe it: species are disappearing at an unprecedented rate, and despite our best efforts it appears that they will continue to do so. The magnitude and complexity of our impacts on the environment effectively decimates entire ecosystems (and indeed, the entire biosphere). It is thus our responsibility as ‘custodians of the planet’ (although if I had a choice, I would have sacked us as CEOs of this whole business) to attempt to prevent further extinction of our planet’s biodiversity.

Human CEO example
“….shit.”

If you’re even remotely familiar with this blog, then you would have been exposed to a number of different techniques, practices and outcomes of conservation research and its disparate sub-disciplines (e.g. population genetics, community ecology, etc.). Given the limited resources available to conserve an overwhelming number of endangered species, we attempt to prioritise our efforts towards those most in need, although there is a strong taxonomic bias underpinning them.

At least from a genetic perspective, this sometimes involves trying to understand the nature and potential of adaptation from genetic variation (as a predictor of future adaptability). Or using genetic information to inform captive breeding programs, to allow us to boost population numbers with minimal risk of inbreeding depression. Or perhaps allowing us to describe new, unidentified species which require their own set of targeted management recommendations and political legislation.

Genetic rescue

Yet another example of the use of genetics in conservation management, and one that we have previously discussed on The G-CAT, is the concept of ‘genetic rescue’. This involves actively adding new genetic material from other populations into our captive breeding programs to supplement the amount of genetic variation available for future (or even current) adaptation. While there traditionally has been some debate about the risk of outbreeding depression, genetic rescue has been shown to be an effective method for prolonging the survival of at-risk populations.

super-gene-genetic-rescue-e1549973268851.jpg
How my overactive imagination pictures ‘genetic rescue’.

There’s one catch (well, a few really) with genetic rescue: namely, that one must have other populations to ‘outbreed’ with in order add genetic variation to the captive population. But what happens if we’re too late? What if there are no other populations to supplement with, or those other populations are also too genetically depauperate to use for genetic rescue?

Believe it or not, sometimes it’s not too late to save species, even after they have gone extinct. Which brings us from this (lengthy) introduction to this week’s topic: de-extinction. Yes, we’re literally (okay, maybe not) going to raise the dead.

Necroconservaticon
Your textbook guide to de-extinction. Now banned in 47 countries.

Backbreeding: resurrection by hybridisation

You might wonder how (or even if!) this is possible. And to be frank, it’s extraordinarily difficult. However, it has to a degree been done before, in very specific circumstances. One scenario is based on breeding out a species back into existence: sometimes we refer to this as ‘backbreeding’.

This practice really only applies in a few select scenarios. One requirement for backbreeding to be possible is that hybridisation across species has to have occurred in the past, and generally to a substantial scale. This is important as it allows the genetic variation which defines one of those species to live on within the genome of its sister species even when the original ‘host’ species goes extinct. That might make absolutely zero sense as it stands, so let’s dive into this with a case study.

I’m sure you’ll recognise (at the very least, in name) these handsome fellows below: the Galápagos tortoise. They were a pinnacle in Charles Darwin’s research into the process of evolution by natural selection, and can live for so long that until recently there had been living individuals which would have been able to remember him (assuming, you know, memory loss is not a thing in tortoises. I can’t even remember what I had for dinner two days ago, to be fair). As remarkable as they are, Galápagos tortoises actually comprise 15 different species, which can be primarily determined by the shape of their shells and the islands they inhabit.

Galapagos island and tortoises
A map of the Galápagos archipelago and tortoise species, with extinct species indicated by symbology. Lonesome George was the last known living member of the Pinta Island tortoise, C. abingdonii for reference. Source: Wikipedia.

One of these species, Chelonoidis elephantopus, also known as the Floreana tortoise after their home island, went extinct over 150 years ago, likely due to hunting and tradeHowever, before they all died, some individuals were transported to another island (ironically, likely by mariners) and did the dirty with another species of tortoise: C. becki. Because of this, some of the genetic material of the extinct Floreana tortoise introgressed into the genome of the still-living C. becki. In an effort to restore an iconic species, scientists from a number of institutions attempted to do what sounds like science-fiction: breed the extinct tortoise back to life.

By carefully managing and selectively breeding captive individuals , progressive future generations of the captive population can gradually include more and more of the original extinct C. elephantopus genetic sequence within their genomes. While a 100% resurrection might not be fully possible, by the end of the process individuals with progressively higher proportion of the original Floreana tortoise genome will be born. Although maybe not a perfect replica, this ‘revived’ species is much more likely to serve a similar ecological role to the now-extinct species, and thus contribute to ecosystem stability. To this day, this is one of the closest attempts at reviving a long-dead species.

Is full de-extinction possible?

When you saw the title for this post, you were probably expecting some Jurassic Park level ‘dinosaurs walking on Earth again’ information. I know I did when I first heard the term de-extinction. Unfortunately, contemporary de-extinction practices are not that far advanced just yet, although there have been some solid attempts. Experiments conducted using the genomic DNA from the nucleus of a dead animal, and cloning it within the egg of another living member of that species has effectively cloned an animal back from the dead. This method, however, is currently limited to animals that have died recently, as the DNA degrades beyond use over time.

The same methods have been attempted for some extinct animals, which went extinct relatively recently. Experiments involving the Pyrenean ibex (bucardo) were successful in generating an embryo, but not sustaining a living organism. The bucardo died 10 minutes after birth due to a critical lung condition, as an example.

The challenges and ethics of de-extinction

One might expect that as genomic technologies improve, particularly methods facilitated by the genome-editing allowed from CRISPR/Cas-9 development, that we might one day be able to truly resurrect an extinct species. But this leads to very strongly debated topics of ethics and morality of de-extinction. If we can bring a species back from the dead, should we? What are the unexpected impacts of its revival? How will we prevent history from repeating itself, and the species simply going back extinct? In a rapidly changing world, how can we account for the differences in environment between when the species was alive and now?

Deextinction via necromancy figure
The Chaotic Neutral (?) approach to de-extinction.

There is no clear, simple answer to many of these questions. We are only scratching the surface of the possibility of de-extinction, and I expect that this debate will only accelerate with the research. One thing remains eternally true, though: it is still the distinct responsibility of humanity to prevent more extinctions in the future. Handling the growing climate change problem and the collapse of ecosystems remains a top priority for conservation science, and without a solution there will be no stable planet on which to de-extinct species.

de-extinction meme
You bet we’re gonna make a meme months after it’s gone out of popularity.

The ‘other’ allele frequency: applications of the site frequency spectrum

The site-frequency spectrum

In order to simplify our absolutely massive genomic datasets down to something more computationally feasible for modelling techniques, we often reduce it to some form of summary statistic. These are various aspects of the genomic data that can summarise the variation or distribution of alleles within the dataset without requiring the entire genetic sequence of all of our samples.

One very effective summary statistic that we might choose to use is the site-frequency spectrum (aka the allele frequency spectrum). Not to be confused with other measures of allele frequency which we’ve discussed before (like Fst), the site-frequency spectrum (abbreviated to SFS) is essentially a histogram of how frequent certain alleles are within our dataset. To do this, the SFS classifies each allele into a certain category based on how common it is, tallying up the number of alleles that occur at that frequency. The total number of categories would be the maximum number of possible alleles: for organisms with two copies of every chromosome (‘diploids’, including humans), this means that there are double the number of samples included. For example, a dataset comprised of genomic sequence for 5 people would have 10 different frequency bins.

For one population

The SFS for a single population – called the 1-dimensional SFS – this is very easy to visualise as a concept. In essence, it’s just a frequency distribution of all the alleles within our dataset. Generally, the distribution follows an exponential shape, with many more rare (e.g. ‘singletons’) alleles than there are common ones. However, the exact shape of the SFS is determined by the history of the population, and like other analyses under coalescent theory we can use our understanding of the interaction between demographic history and current genetic variation to study past events.

1DSFS example.jpg
An example of the 1DSFS for a single population, taken from a real dataset from my PhD. Left: the full site-frequency spectrum, counting how many alleles (y-axis) occur a certain number of times (categories of the x-axis) within the population. In this example, as in most species, the vast majority of our DNA sequence is non-variable (frequency = 0). Given the huge disparity in number of non-variable sites, we often select on the variable ones (and even then, often discard the 1 category to remove potential sequencing errors) and get a graph more like the right. Right: the ‘realistic’ 1DSFS for the population, showing a general exponential decline (the blue trendline) for the more frequent classes. This is pretty standard for an SFS. ‘Singleton’ and ‘doubleton’ are alternative names for ‘alleles which occur once’ and ‘alleles which occur twice’ in an SFS.

Expanding the SFS to multiple populations

Further to this, we can expand the site-frequency spectrum to compare across populations. Instead of having a simple 1-dimensional frequency distribution, for a pair of populations we can have a grid. This grid specifies how often a particular allele occurs at a certain frequency in Population A and at a different frequency in Population B. This can also be visualised quite easily, albeit as a heatmap instead. We refer to this as the 2-dimensional SFS (2DSFS).

2dsfs example
An example of a 2DSFS, also taken from my PhD research. In this example, we are comparing Population A, containing 5 individuals (as diploid, 2 x 5 = max. of 10 occurrences of an allele) with Population B, containing 4 individuals. Each row denotes the frequency at which a certain allele occurs in Population whilst the columns indicate the frequency a certain allele occurs in Population A. Each cell therefore indicates the number of alleles that occur at the exact frequency of the corresponding row and column. For example, the first cell (highlighted in green) indicates the number of alleles which are not found in either Population A or Population B (this dataset is a subsample from a larger one). The yellow cell indicates the number of alleles which occur 4 times in Population and also 4 times in Population A. This could mean that in one of those Populations 4 individuals have one copy of that allele each, or two individuals have two copies of that allele, or that one has two copies and two have one copy. The exact composition of how the alleles are spread across samples within each population doesn’t matter to the overall SFS.

The same concept can be expanded to even more populations, although this gets harder to represent visually. Essentially, we end up with a set of different matrices which describe the frequency of certain alleles across all of our populations, merging them together into the joint SFS. For example, a joint SFS of 4 populations would consist of 6 (4 x 4 total comparisons – 4 self-comparisons, then halved to remove duplicate comparisons) 2D SFSs all combined together. To make sense of this, check out the diagrammatic tables below.

populations for jsfs
A summary of the different combinations of 2DSFSs that make up a joint SFS matrix. In this example we have 4 different populations (as described in the above text). Red cells denote comparisons between a population and itself – which is effectively redundant. Green cells contain the actual 2D comparisons that would be used to build the joint SFS: the blue cells show the same comparisons but in mirrored order, and are thus redundant as well.
annotated jsfs heatmap
Expanding the above jSFS matrix to the actual data, this matrix demonstrates how the matrix is actually a collection of multiple 2DSFSs. In this matrix, one particular cell demonstrates the number of alleles which occur at frequency x in one population and frequency y in another. For example, if we took the cell in the third row from the top and the fourth column from the left, we would be looking at the number of alleles which occur twice in Population B and three times in Population A. The colour of this cell is moreorless orange, indicating that ~50 alleles occur at this combination of frequencies. As you may notice, many population pairs show similar patterns, except for the Population C vs Population D comparison.

The different forms of the SFS

Which alleles we choose to use within our SFS is particularly important. If we don’t have a lot of information about the genomics or evolutionary history of our study species, we might choose to use the minor allele frequency (MAF). Given that SNPs tend to be biallelic, for any given locus we could have Allele A or Allele B. The MAF chooses the least frequent of these two within the dataset and uses that in the summary SFS: since the other allele’s frequency would just be 2N – the frequency of the other allele, it’s not included in the summary. An SFS made of the MAF is also referred to as the folded SFS.

Alternatively, if we know some things about the genetic history of our study species, we might be able to divide Allele A and Allele B into derived or ancestral alleles. Since SNPs often occur as mutations at a single site in the DNA, one allele at the given site is the new mutation (the derived allele) whilst the other is the ‘original’ (the ancestral allele). Typically, we would use the derived allele frequency to construct the SFS, since under coalescent theory we’re trying to simulate that mutation event. An SFS made of the derived alleles only is also referred to as the unfolded SFS.

Applications of the SFS

How can we use the SFS? Well, it can moreorless be used as a summary of genetic variation for many types of coalescent-based analyses. This means we can make inferences of demographic history (see here for more detailed explanation of that) without simulating large and complex genetic sequences and instead use the SFS. Comparing our observed SFS to a simulated scenario of a bottleneck and comparing the expected SFS allows us to estimate the likelihood of that scenario.

For example, we would predict that under a scenario of a recent genetic bottleneck in a population that alleles which are rare in the population will be disproportionately lost due to genetic drift. Because of this, the overall shape of the SFS will shift to the right dramatically, leaving a clear genetic signal of the bottleneck. This works under the same theoretical background as coalescent tests for bottlenecks.

SFS shift from bottleneck example.jpg
A representative example of how a bottleneck causes a shift in the SFS, based on a figure from a previous post on the coalescentCentre: the diagram of alleles through time, with rarer variants (yellow and navy) being lost during the bottleneck but more common variants surviving (red). Left: this trend is reflected in the coalescent trees for these alleles, with red crosses indicating the complete loss of that allele. Right: the SFS from before (in red) and after (in blue) the bottleneck event for the alleles depicted. Before the bottleneck, variants are spread in the usual exponential shape: afterwards, however, a disproportionate loss of the rarer variants causes the distribution to flatten. Typically, the SFS would be built from more alleles than shown here, and extend much further.

Contrastingly, a large or growing population will have a larger number of rare (i.e. unique) alleles from the sudden growth and increase in genetic variation. Thus, opposite to the bottleneck the SFS distribution will be biased towards the left end of the spectrum, with an excess of low-frequency variants.

SFS shift from expansion example.jpg
A similar diagram as above, but this time with an expansion event rather than a bottleneck. The expansion of the population, and subsequent increase in Ne, facilitates the mutation of new alleles from genetic drift (or reduced loss of alleles from drift), causing more new (and thus rare) alleles to appear. This is shown by both the coalescent tree (left) and a shift in the SFS (right).

The SFS can even be used to detect alleles under natural selection. For strongly selected parts of the genome, alleles should occur at either high (if positively selected) or low (if negatively selected) frequency, with a deficit of more intermediate frequencies.

Adding to the analytical toolbox

The SFS is just one of many tools we can use to investigate the demographic history of populations and species. Using a combination of genomic technologies, coalescent theory and more robust analytical methods, the SFS appears to be poised to tackle more nuanced and complex questions of the evolutionary history of life on Earth.