Building Blueprints – How to assemble a genome

The utility of a reference genome

In the 18 years since the completion of the Human Genome Project, the practicality of assembling full genomes for a wide range of taxa beyond ourselves has only improved. While model taxa systems have achieved genomes before many others, it is now possible for whole genomes to be assembled for a range of non-model organisms as well. But how do we assemble the genome of a species for the very first time (often de novo – literally “from the new”)? What can we do with this genome? Why is it so useful? Let’s delve into the process and outcomes of genome assembly a little more.

Continue reading

Incomplete lineage sorting through Pachinko – a visual analogy

Reconstructing evolutionary history

Unravelling the evolutionary history of organisms – one of the main goals of phylogenetic research – remains a challenging prospect due to a number of theoretical and analytical aspects. Particularly, trying to reconstruct evolutionary patterns based on current genetic data (the most common way phylogenetic trees are estimated) is prone to the erroneous influence of some secondary factors. One of these is referred to as ‘incomplete lineage sorting’, which can have a major effect on how phylogenetic relationships are estimated and the statistical confidence we may have around these patterns. Today, we’re going to take a look at incomplete lineage sorting (shortened to ILS for brevity herein) using a game-based analogy – a Pachinko machine. Or, if you’d rather, the same general analogy also works for those creepy clown carnival games, but I prefer the less frightening alternative.

Continue reading

The G-CAT in 2020

The new year and decade

It’s been a few minutes (okay, several weeks) since the last post here on The G-CAT. Naturally, over that time I’ve spent holidays with both my own family and my partner’s family. Hopefully, you’ve enjoyed your own Christmas/New Year/Other-Non-Denominational-Celebrations break (and for us Aussies, that you’ve managed to avoid much of the devastation of the recent bushfire epidemic).

Because of this period of time (and a few other more pressing deadlines I had for the start of the new year), I haven’t prepared a new post in some time. However, I’d like to take this time to address how the nature of this blog might change of the next year or so (and into the future).

A new schedule

For those of you who keep more up-to-date with my academic progress, you’ll be aware that this year is the final year of my PhD. As it stands, I’m due to submit my thesis in August of this year (which feels much, much sooner than it really is). Similarly, for anyone who has ever interacted with a PhD student in the final year of their studies, you’ll also be aware that this can be a time of high stress, stacking deadlines and the overall impending doom of D-Day (thesis submission).

In light of all of this, I have decided to move away from the more predictable fortnightly post routine in favour of a more organic timetable. This will likely mean a fairly significant reduction in the frequency of blog posts, whereby I will post as a topic comes into my mind or when it appears relevant to other parts of my studies (e.g. in reading for writing manuscripts, etc.). This decision has also been playing on my mind for some time to also balance the quality of the posts I write: in some circumstances, I feel like the consistent deadline of once per fortnight causes some posts to suffer a little as I rush to produce something at least in the vicinity of every second Wednesday.

The future of The G-CAT

It is not my intention to completely abandon this project: The G-CAT is something that I have invested a fair time and inspiration into and provides a solid avenue for science communication. As always, my inbox (on whichever platform you choose) is wide open for suggestions on topics of discussion. I’m looking forward to a more organic schedule that will allow me to properly explore and expand on the topics of interest whilst maintaining a healthy balance of PhD progression and down-time.

Here’s to 2020!

The ‘other’ allele frequency: applications of the site frequency spectrum

The site-frequency spectrum

In order to simplify our absolutely massive genomic datasets down to something more computationally feasible for modelling techniques, we often reduce it to some form of summary statistic. These are various aspects of the genomic data that can summarise the variation or distribution of alleles within the dataset without requiring the entire genetic sequence of all of our samples.

One very effective summary statistic that we might choose to use is the site-frequency spectrum (aka the allele frequency spectrum). Not to be confused with other measures of allele frequency which we’ve discussed before (like Fst), the site-frequency spectrum (abbreviated to SFS) is essentially a histogram of how frequent certain alleles are within our dataset. To do this, the SFS classifies each allele into a certain category based on how common it is, tallying up the number of alleles that occur at that frequency. The total number of categories would be the maximum number of possible alleles: for organisms with two copies of every chromosome (‘diploids’, including humans), this means that there are double the number of samples included. For example, a dataset comprised of genomic sequence for 5 people would have 10 different frequency bins.

For one population

The SFS for a single population – called the 1-dimensional SFS – this is very easy to visualise as a concept. In essence, it’s just a frequency distribution of all the alleles within our dataset. Generally, the distribution follows an exponential shape, with many more rare (e.g. ‘singletons’) alleles than there are common ones. However, the exact shape of the SFS is determined by the history of the population, and like other analyses under coalescent theory we can use our understanding of the interaction between demographic history and current genetic variation to study past events.

1DSFS example.jpg
An example of the 1DSFS for a single population, taken from a real dataset from my PhD. Left: the full site-frequency spectrum, counting how many alleles (y-axis) occur a certain number of times (categories of the x-axis) within the population. In this example, as in most species, the vast majority of our DNA sequence is non-variable (frequency = 0). Given the huge disparity in number of non-variable sites, we often select on the variable ones (and even then, often discard the 1 category to remove potential sequencing errors) and get a graph more like the right. Right: the ‘realistic’ 1DSFS for the population, showing a general exponential decline (the blue trendline) for the more frequent classes. This is pretty standard for an SFS. ‘Singleton’ and ‘doubleton’ are alternative names for ‘alleles which occur once’ and ‘alleles which occur twice’ in an SFS.

Expanding the SFS to multiple populations

Further to this, we can expand the site-frequency spectrum to compare across populations. Instead of having a simple 1-dimensional frequency distribution, for a pair of populations we can have a grid. This grid specifies how often a particular allele occurs at a certain frequency in Population A and at a different frequency in Population B. This can also be visualised quite easily, albeit as a heatmap instead. We refer to this as the 2-dimensional SFS (2DSFS).

2dsfs example
An example of a 2DSFS, also taken from my PhD research. In this example, we are comparing Population A, containing 5 individuals (as diploid, 2 x 5 = max. of 10 occurrences of an allele) with Population B, containing 4 individuals. Each row denotes the frequency at which a certain allele occurs in Population whilst the columns indicate the frequency a certain allele occurs in Population A. Each cell therefore indicates the number of alleles that occur at the exact frequency of the corresponding row and column. For example, the first cell (highlighted in green) indicates the number of alleles which are not found in either Population A or Population B (this dataset is a subsample from a larger one). The yellow cell indicates the number of alleles which occur 4 times in Population and also 4 times in Population A. This could mean that in one of those Populations 4 individuals have one copy of that allele each, or two individuals have two copies of that allele, or that one has two copies and two have one copy. The exact composition of how the alleles are spread across samples within each population doesn’t matter to the overall SFS.

The same concept can be expanded to even more populations, although this gets harder to represent visually. Essentially, we end up with a set of different matrices which describe the frequency of certain alleles across all of our populations, merging them together into the joint SFS. For example, a joint SFS of 4 populations would consist of 6 (4 x 4 total comparisons – 4 self-comparisons, then halved to remove duplicate comparisons) 2D SFSs all combined together. To make sense of this, check out the diagrammatic tables below.

populations for jsfs
A summary of the different combinations of 2DSFSs that make up a joint SFS matrix. In this example we have 4 different populations (as described in the above text). Red cells denote comparisons between a population and itself – which is effectively redundant. Green cells contain the actual 2D comparisons that would be used to build the joint SFS: the blue cells show the same comparisons but in mirrored order, and are thus redundant as well.

annotated jsfs heatmap
Expanding the above jSFS matrix to the actual data, this matrix demonstrates how the matrix is actually a collection of multiple 2DSFSs. In this matrix, one particular cell demonstrates the number of alleles which occur at frequency x in one population and frequency y in another. For example, if we took the cell in the third row from the top and the fourth column from the left, we would be looking at the number of alleles which occur twice in Population B and three times in Population A. The colour of this cell is moreorless orange, indicating that ~50 alleles occur at this combination of frequencies. As you may notice, many population pairs show similar patterns, except for the Population C vs Population D comparison.

The different forms of the SFS

Which alleles we choose to use within our SFS is particularly important. If we don’t have a lot of information about the genomics or evolutionary history of our study species, we might choose to use the minor allele frequency (MAF). Given that SNPs tend to be biallelic, for any given locus we could have Allele A or Allele B. The MAF chooses the least frequent of these two within the dataset and uses that in the summary SFS: since the other allele’s frequency would just be 2N – the frequency of the other allele, it’s not included in the summary. An SFS made of the MAF is also referred to as the folded SFS.

Alternatively, if we know some things about the genetic history of our study species, we might be able to divide Allele A and Allele B into derived or ancestral alleles. Since SNPs often occur as mutations at a single site in the DNA, one allele at the given site is the new mutation (the derived allele) whilst the other is the ‘original’ (the ancestral allele). Typically, we would use the derived allele frequency to construct the SFS, since under coalescent theory we’re trying to simulate that mutation event. An SFS made of the derived alleles only is also referred to as the unfolded SFS.

Applications of the SFS

How can we use the SFS? Well, it can moreorless be used as a summary of genetic variation for many types of coalescent-based analyses. This means we can make inferences of demographic history (see here for more detailed explanation of that) without simulating large and complex genetic sequences and instead use the SFS. Comparing our observed SFS to a simulated scenario of a bottleneck and comparing the expected SFS allows us to estimate the likelihood of that scenario.

For example, we would predict that under a scenario of a recent genetic bottleneck in a population that alleles which are rare in the population will be disproportionately lost due to genetic drift. Because of this, the overall shape of the SFS will shift to the right dramatically, leaving a clear genetic signal of the bottleneck. This works under the same theoretical background as coalescent tests for bottlenecks.

SFS shift from bottleneck example.jpg
A representative example of how a bottleneck causes a shift in the SFS, based on a figure from a previous post on the coalescentCentre: the diagram of alleles through time, with rarer variants (yellow and navy) being lost during the bottleneck but more common variants surviving (red). Left: this trend is reflected in the coalescent trees for these alleles, with red crosses indicating the complete loss of that allele. Right: the SFS from before (in red) and after (in blue) the bottleneck event for the alleles depicted. Before the bottleneck, variants are spread in the usual exponential shape: afterwards, however, a disproportionate loss of the rarer variants causes the distribution to flatten. Typically, the SFS would be built from more alleles than shown here, and extend much further.

Contrastingly, a large or growing population will have a larger number of rare (i.e. unique) alleles from the sudden growth and increase in genetic variation. Thus, opposite to the bottleneck the SFS distribution will be biased towards the left end of the spectrum, with an excess of low-frequency variants.

SFS shift from expansion example.jpg
A similar diagram as above, but this time with an expansion event rather than a bottleneck. The expansion of the population, and subsequent increase in Ne, facilitates the mutation of new alleles from genetic drift (or reduced loss of alleles from drift), causing more new (and thus rare) alleles to appear. This is shown by both the coalescent tree (left) and a shift in the SFS (right).

The SFS can even be used to detect alleles under natural selection. For strongly selected parts of the genome, alleles should occur at either high (if positively selected) or low (if negatively selected) frequency, with a deficit of more intermediate frequencies.

Adding to the analytical toolbox

The SFS is just one of many tools we can use to investigate the demographic history of populations and species. Using a combination of genomic technologies, coalescent theory and more robust analytical methods, the SFS appears to be poised to tackle more nuanced and complex questions of the evolutionary history of life on Earth.

Notes from the Field: Octoroks

Scientific name

Octorokus infletus

Meaning: Octorokus from [octorok] in Hylian; infletus from [inflate] in Latin.

Translation: inflating octorok; all varieties use an inflatable air sac derived from the swim bladder to float and scan the horizon.

Varieties

Octorokus infletus hydros [aquatic morphotype]

Octorokus infletus petram [mountain morphotype]

Octorokus infletus silva [forest morphotype]

Octorokus infletus arctus [snow morphotype]

Octorokus infletus imitor [deceptive morphotype]

All octoroks.jpg
The various morphotypes of inflating octoroksA: The water octorok, considered the morphotype closest to the ancestral physiology of the species. B: The forest octorok, with grass camouflage. C: The deceptive octorok, which has replaced its tufted vegetation with a glittering chest as bait. D: The mountainous octorok, with rock camouflage. E: The snow octorok, with tundra grass camouflage.

Common name

Variable octorok

Taxonomic status

Kingdom Animalia; Phylum Mollusca; Class Cephalapoda; Order Octopoda; Family Octopididae; Genus Octorokus; Species infletus

Conservation status

Least Concern

Distribution

The species is found throughout all major habitat regions of Hyrule, with localised morphotypes found within specific habitats. The only major region where the variable octorok is not found is within the Gerudo Desert, suggesting some remnant dependency of standing water.

Octorok distribution.jpg
The region of Hyrule, with the distribution of octoroks in blue. The only major region where they are not found is the Gerudo Desert in the bottom left.

Habitat

Habitat choice depends on the physiology of the morphotype; so long as the environment allows the octorok to blend in, it is highly likely there are many around (i.e. unseen).

Behaviour and ecology

The variable octorok is arguably one of the most diverse species within modern Hyrule, exhibiting a large number of different morphotypic forms and occurring in almost all major habitat zones. Historical data suggests that the water octorok (Octorokus infletus hydros) is the most ancestral morphotype, with ancient literature frequently referring to them as sea-bearing or river-traversing organisms. Estimates from the literature suggests that their adaptation to land-based living is a recent evolutionary step which facilitated rapid morphological radiation of the lineage.

Several physiological characteristics unite the variable morphological forms of the octorok into a single identifiable species. Other than the typical body structure of an octopod (eight legs, largely soft body with an elongated mantle region), the primary diagnostic trait of the octorok is the presence of a large ‘balloon’ with the top of the mantle. This appears to be derived from the swim bladder of the ancestral octorok, which has shifted to the cranial region. The octorok can inflate this balloon using air pumped through the gills, filling it and lifting the octorok into the air. All morphotypes use this to scan the surrounding region to identify prey items, including attacking people if aggravated.

inflated octorok
A water morphotype octorok with balloon inflated.

Diets of the octorok vary depending on the morphotype and based on the ecological habitat; adaptations to different ecological niches is facilitated by a diverse and generalist diet.

Demography

Although limited information is available on the amount of gene flow and population connectivity between different morphotypes, by sheer numbers alone it would appear the variable octorok is highly abundant. Some records of interactions between morphotypes (such as at the water’s edge within forested areas) implies that the different types are not reproductively isolated and can form hybrids: how this impacts resultant hybrid morphotypes and development is unknown. However, given the propensity of morphotypes to be largely limited to their adaptive habitats, it would seem reasonable to assume that some level of population structure is present across types.

Adaptive traits

The variable octorok appears remarkably diverse in physiology, although the recent nature of their divergence and the observed interactions between morphological types suggests that they are not reproductively isolated. Whether these are the result of phenotypic plasticity, and environmental pressures are responsible for associated physiological changes to different environments, or genetically coded at early stages of development is unknown due to the cryptic nature of octorok spawning.

All octoroks employ strong behavioural and physiological traits for camouflage and ambush predation. Vegetation is usually placed on the top of the cranium of all morphotypes, with the exact species of plant used dependent on the environment (e.g. forest morphotypes will use grasses or ferns, whilst mountain morphotypes will use rocky boulders). The octorok will then dig beneath the surface until just the vegetation is showing, effectively blending in with the environment and only occasionally choosing to surface by using the balloon. Whether this behaviour is passed down genetically or taught from parents is unclear.

Management actions

Few management actions are recommended for this highly abundant species. However, further research is needed to better understand the highly variable nature and the process of evolution underpinning their diverse morphology. Whether morphotypes are genetically hardwired by inheritance of determinant genes, or whether alterations in gene expression caused by the environmental context of octoroks (i.e. phenotypic plasticity) provides an intriguing avenue of insight into the evolution of Hylian fauna.

Nevertheless, the transition from the marine environment onto the terrestrial landscape appears to be a significant stepping stone in the radiation of morphological structures within the species. How this has been facilitated by the genetic architecture of the octorok is a mystery.