As you may have gathered, The G-CAT has been significantly less active in this our most Cursed year. There are a number of reasons for that – not just the overall disaster that has been world events – including the fact that this was the last year of my PhD. I’m delighted to announce that now, after ~3.5 years of hard work, I am officially Dr. Buckley (not Dr. G-CAT, as I may have led you to believe)!
The idea of using the genetic sequences of living organisms to understand the evolutionary history of species is a concept much repeated on The G-CAT. And it’s a fundamental one in phylogenetics, taxonomy and evolutionary biology. Often, we try to analyse the genetic differences between individuals, populations and species in a tree-like manner, with close tips being similar and more distantly separated branches being more divergent. However, this runs on one very key assumption; that the patterns we observe in our study genes matches the overall patterns of species evolution. But this isn’t always true, and before we can delve into that we have to understand the difference between a ‘gene tree’ and a ‘species tree’.
However, a phylogenetic tree based on a single gene only demonstrates the history of that gene. What we assume in most cases is that the history of that gene matches the history of the species: that branches in the genetic tree mirror when different splits in species occurred throughout history.
The easiest way to conceptualise gene trees and species trees is to think of individual gene trees that are nested within an overarching species tree. In this sense, individual gene trees can vary from one another (substantially, even) but by looking at the overall trends of many genes we can see how the genome of the species have changed over time.
One of the most prolific, but more complicated, ways gene trees can vary from their overarching species tree is due to what we call ‘incomplete lineage sorting’. This is based on the idea that species and the genes that define them are constantly evolving over time, and that because of this different genes are at different stages of divergence between population and species. If we imagine a set of three related populations which have all descended from a single ancestral population, we can start to see how incomplete lineage sorting could occur. Our ancestral population likely has some genetic diversity, containing multiple alleles of the same locus. In a true phylogenetic tree, we would expect these different alleles to ‘sort’ into the different descendent populations, such that one population might have one of the alleles, a second the other, and so on, without them sharing the different alleles between them.
If this separation into new populations has been recent, or if gene flow has occurred between the populations since this event, then we might find that each descendent population has a mixture of the different alleles, and that not enough time has passed to clearly separate the populations. For this to occur, sufficient time for new mutations to occur and genetic drift to push different populations to differently frequent alleles needs to happen: if this is too recent, then it can be hard to accurately distinguish between populations. This can be difficult to interpret (see below figure for a visualisation of this), but there’s a great description of incomplete lineage sorting here.
Hybridisation and horizontal transfer
Another way individual genes may become incongruent with other genes is through another phenomenon we’ve discussed before: hybridisation (or more specifically, introgression). When two individuals from different species breed together to form a ‘hybrid’, they join together what was once two separate gene pools. Thus, the hybrid offspring has (if it’s a first generation hybrid, anyway) 50% of genes from Species A and 50% of genes from Species B. In terms of our phylogenetic analysis, if we picked one gene randomly from the hybrid, we have 50% of picking a gene that reflects the evolutionary history of Species A, and 50% chance of picking a gene that reflects the evolutionary history of Species B. This would change how our outputs look significantly: if we pick a Species A gene, our ‘hybrid’ will look (genetically) very, very similar to Species A. If we pick a Species B gene, our ‘hybrid’ will look like a Species B individual instead. Naturally, this can really stuff up our interpretations of species boundaries, distributions and identities.
This can have a profound impact as paralogous genes are difficult to detect: if there has been a gene duplication early in the evolutionary history of our phylogenetic tree, then many (or all) of our study samples will have two copies of said gene. Since they look similar in sequence, there’s all possibility that we pick Variant 1 in some species and Variant 2 in other species. Being unable to tell them apart, we can have some very weird and abstract results within our tree. Most importantly, different samples with the same duplicated variant will seem similar to one another (e.g. have evolved from a common ancestor more recently) than it will to any sample of the other variant (even if they came from the exact same species)!
Overcoming incongruence with genomics
Although a tricky conundrum in phylogenetics and evolutionary genetics broadly, gene tree incongruence can largely be overcome with using more loci. As the random changes of any one locus has a smaller effect of the larger total set of loci, the general and broad patterns of evolutionary history can become clearer. Indeed, understanding how many loci are affected by what kind of process can itself become informative: large numbers of introgressed loci can indicate whether hybridisation was recent, strong, or biased towards one species over another, for example. As with many things, the genomic era appears poised to address the many analytical issues and complexities of working with genetic data.
Given the strong influence of genetic identity on the process and outcomes of the speciation process, it seems a natural connection to use genetic information to study speciation and species identities. There is a plethora of genetics-based tools we can use to investigate how speciation occurs (both the evolutionary processes and the external influences that drive it). One clear way to test whether two populations of a particular species are actually two different species is to investigate genes related to reproductive isolation: if the genetic differences demonstrate reproductive incompatibilities across the two populations, then there is strong evidence that they are separate species (at least under the Biological Species Concept; see Part One for why!). But this type of analysis requires several tools: 1) knowledge of the specific genes related to reproduction (e.g. formation of sperm and eggs, genital morphology, etc.), 2) the complete and annotated genome of the species (to be able to find and analyse the right genes properly) and 3) a good amount of data for the populations in question. As you can imagine, for people working on non-model species (i.e. ones that haven’t had the same history and detail of research as, say, humans and mice), this can be problematic. So, instead, we can use other genetic information to investigate and suggest patterns and processes related to the formation of new species.
Is reproductive isolation naturally selected for or just a consequence?
A fundamental aspect of studies of speciation is a “chicken or the egg”-type paradigm: does natural selection directly select for rapid reproductive isolation, preventing interbreeding; or as a secondary consequence of general adaptive differences, over a long history of evolution? This might be a confusing distinction, so we’ll dive into it a little more.
The reproductive incompatibility of two populations (thus making them species) is often intrinsically linked to the genetic make-up of those two species. Some conflicts in the genetics of Population 1 and Population 2 may mean that a hybrid having half Population 1 genes and half Population 2 genes will have serious fitness problems (such as sterility or developmental problems). Dramatic genetic differences, particularly a difference in the number of chromosomes between the two sources, is a significant component of reproductive isolation and is usually to blame for sterile hybrids such as ligers, zorse and mules.
We can study the process of speciation in the natural world without focussing on the ‘reproductive isolation’ element of species identity as well. For many species, we are unlikely to have the detail (such as an annotated genome and known functions of genes related to reproduction) required to study speciation at this level in any case. Instead, we might choose to focus on the different factors that are currently influencing the process of speciation, such as how the environmental, demographic or adaptive contexts of populations plays a role in the formation of new species. Many of these questions fall within the domain of phylogeography; particularly, how the historical environment has shaped the diversity of populations and species today.
Although these can help answer some questions related to speciation, new tools are constantly needed to provide a clearer picture of the process. Understanding how and why new species are formed is a critical aspect of understanding the world’s biodiversity. How can we predict if a population will speciate at some point? What environmental factors are most important for driving the formation of new species? How stable are species identities, really? These questions (and many more) remain elusive for a wide variety of life on Earth.
This is Part 1 of a four part miniseries on the process of speciation; how we get new species, how we can see this in action, and the end results of the process. This week, we’ll start with a seemingly obvious question: what is a species?
The definition of a ‘species’
‘Species’ are a human definition of the diversity of life. When we talk about the diversity of life, and the myriad of creatures and plants on Earth, we often talk about species diversity. This might seem glaringly obvious, but there’s one key issue: what is a species, anyway? While we might like to think of them as discrete and obvious groups (a dog is definitely not the same species as a cat, for example), the concept of a singular “species” is actually the result of human categorisation.
In reality, the diversity of life is spread across a huge spectrum of differentiation: from things which are closely related but still different to us (like chimps), to more different again (other mammals), to hardly relatable at all (bacteria and plants). So, what is the cut-off for calling something a species, and not a different genus, family, or kingdom? Or alternatively, at what point do we call a specific sub-group of a species as a sub-species, or another species entirely?
This might seem like a simple question: we look at two things, and they look different, so they must be different species, right? Well, of course, nature is never simple, and the line between “different” and “not different” is very blurry. Here’s an example: consider that you knew nothing about the history, behaviour or genetics of dogs. If you simply looked at all the different breeds of dogs on Earth, you might suggest that there are hundreds of species of domestic dogs. That seems a little excessive though, right? In fact, the domestic dog, Eurasian wolf, and the Australian dingo are all the same species (but different subspecies, along with about 38 others…but that’s another issue altogether).
For example, a horse and zebra can breed to produce a zorse, however zorse are fundamentally infertile (due to the different number of chromosomes between a horse and a zebra) and thus a horse is a different species to a zebra. However, a German Shepherd and a chihuahua can breed and make a hybrid mutt, so they are the same species.
To try and account for the issues with the BSC, taxonomists try to push for the usage of “integrative taxonomy”. This means that species should be defined by multiple different agreeing concepts, such as reproductive isolation, genetic differentiation, behavioural differences, and/or ecological traits. The more traits that can separate the two, the greater support there is for the species to be separated: if they disagree, then more information is needed to determine exactly whether or not that should be called different species. Debates about taxonomy are ongoing and are likely going to be relevant for years to come, but form critical components of understanding biodiversity, patterns of evolution, and creating effective conservation legislation to protect endangered or threatened species (for whichever groups we decide are species).
As regular readers of The G-CAT are likely aware, my first ever scientific paper was published this week. The paper is largely the results of my Honours research (with some extra analysis tacked on) on the phylogenomics (the same as phylogenetics, but with genomic data) and biogeographic history of a group of small, endemic freshwater fishes known as the pygmy perch. There are a number of different messages in the paper related to biogeography, taxonomy and conservation, and I am really quite proud of the work.
To my honest surprise, the paper has received a decentamount of media attention following its release. Nearly all of these have focused on the biogeographic results and interpretations of the paper, which is arguably the largest component of the paper. In these media releases, the articles are often opened with “…despite the odds, new research has shown how a tiny fish managed to find its way across the arid Australian continent – more than once.” So how did they manage it? These are tiny fish, and there’s a very large desert area right in the middle of Australia, so how did they make it all the way across? And more than once?!
The Great (southern) Southern Land
To understand the results, we first have to take a look at the context for the research question. There are seven officially named species of pygmy perches (‘named’ is an important characteristic here…but we’ll go into the details of that in another post), which are found in the temperate parts of Australia. Of these, three are found with southwest Western Australia, in Australia’s only globally recognised biodiversity hotspot, and the remaining four are found throughout eastern Australia (ranging from eastern South Australia to Tasmania and up to lower Queensland). These two regions are separated by arid desert regions, including the large expanse of the Nullarbor Plain.
As one might expect, the formation of the Nullarbor Plain was a huge barrier for many species, especially those that depend on regular accessible water for survival. In many species of both plants and animals, we see in their phylogenetic history a clear separation of eastern and western groups around this time; once widely distributed species become fragmented by the plain and diverged from one another. We would most certainly expect this to be true of pygmy perch.
This is where the real difference between everything else and pygmy perch happens. For most species, we see only one east and west split in their phylogenetic tree, associated with the Nullarbor Plain; before that, their ancestors were likely distributed across the entire southern continent and were one continuous unit.
Not for pygmy perch, though. Our phylogenetic patterns show that there were multiple splits between eastern and western ancestral pygmy perch. We can see this visually within the phylogenetic tree; some western species of pygmy perches are more closely related, from an evolutionary perspective, to eastern species of pygmy perches than they are to other western species. This could imply a couple different things; either some species came about by migration from east to west (or vice versa), and that this happened at least twice, or that two different ancestral pygmy perches were distributed across all of southern Australia and each split east-west at some point in time. These two hypotheses are called “multiple invasion” and “geographic paralogy”, respectively.
So, which is it? We delved deeper into this using a type of analysis called ‘ancestral clade reconstruction’. This tries to guess the likely distributions of species ancestors using different models and statistical analysis. Our results found that the earliest east-west split was due to the fragmentation of a widespread ancestor ~20 million years ago, and a migration event facilitated by changing waterways from the Nullarbor Plain pushing some eastern pygmy perches to the west to form the second group of western species. We argue for more than one migration across Australia since the initial ancestor of pygmy perches must have expanded from some point (either east or west) to encompass the entirety of southern Australia.
So why do we see this for pygmy perch and no other species? Well, that’s the real mystery; out of all of the aquatic species found in southeast and southwest Australia, pygmy perch are one of the worst at migrating. They’re very picky about habitat, small, and don’t often migrate far unless pushed (by, say, a flood). It is possible that unrecorded extinct species of pygmy perch might help to clarify this a little, but the chances of finding a preserved fish fossil (let alone for a fish less than 8cm in size!) is extremely unlikely. We can really only theorise about how they managed to migrate.
What does this mean for pygmy perches?
Nearly all species of pygmy perch are threatened or worse in the conservation legislation; there have been many conservation efforts to try and save the worst-off species from extinction. Pygmy perches provide a unique insight to the history of the Australian climate and may be a key in unlocking some of the mysteries of what our land was like so long ago. Every species is important for conservation and even those small, hard-to-notice creatures that we might forget about play a role in our environmental history.
We’ve talked previously on The G-CAT about how the genetic underpinning of certain evolutionary traits can change in different directions depending on the selective pressure it is under. Particularly, we can see how the frequency of different alleles might change in one direction or another, or stabilise somewhere in the middle, depending on its encoded trait. But thinking bigger picture than just the genetics of one trait, we can actually see that evolution as an entire process works rather similarly.
The classic view of the direction of evolution is based on divergent evolution. This is simply the idea that a particular species possess some ancestraltrait. The species (or population) then splits into two (for one reason or another), and each one of these resultant species and populations evolves in a different way to the other. Over time, this means that their traits are changing in different directions, but ultimately originate from the same ancestral source.
Evidence for divergent evolution is rife throughout nature, and is a fundamental component of all of our understanding of evolution. Divergent evolution means that, by comparing similar traits in two species (called homologous traits), we can trace back species histories to common ancestors. Some impressive examples of this exist in nature, such as the number of bones in most mammalian species. Humans have the same number of neck bones as giraffes; thus, we can suggest that the ancestor of both species (and all mammals) probably had a similar number of neck bones. It’s just that the giraffe lineage evolved longer bones whereas other lineages did not.
A more dramatic (and potentially obvious) example of convergent evolution would be wings and the power of flight. Despite the fact that butterflies, bees, birds and bats all have wings and can fly, most of them are pretty unrelated to one another. It seems much more likely that flight evolved independently multiple times, rather than the other 99% of species that shared the same ancestor lost the capacity of flight.
Parallel evolution is an interesting field of research for a few reasons. Firstly, it provides a scenario in which we can more rigorously test expectations and outcomes of evolution in a particular environment. For example, if we find traits that are parallel in a whole bunch of fish species in a particular region, we can start to look at how that particular environment drives evolution across all fish species, as opposed to one species case studies.
So, where is evolution going for nature? Well, the answer is probably all over the place, but steered by the current environmental circumstances. Predicting the evolutionary impacts of particular environmental change (e.g. climate change) is exceedingly difficult but a critical component of understanding the process of evolution and the future of species. Evolution continually surprises us with creative solution to complex problems and I have no doubt new mysteries will continue to be thrown at us as we delve deeper.
All of these questions can be addressed with a combination of genetic, environmental and ecological information across a variety of timescales. However, the overall field of biogeography (and phylogeography as a derivative of it) has traditionally been largely rooted on a strong yet changing theoretical basis. The earliest discussions and discoveries related to biogeography as a field of science date back to the 18th Century, and to Carl Linnaeus (to whom we owe our binomial classification system) and Alexander von Humboldt. These scientists (and undoubtedly many others of that era) were among the first to notice how organisms in similar climates (e.g. Australia, South Africa and South America) showed similar physical characteristics despite being so distantly separated (both in their groups and geographic distance). The communities of these regions also appeared to be highly similar. So how could this be possible over such huge distances?
Dispersal or vicariance?
Two main explanations for these patterns are possible; dispersal and vicariance. As one might expect, dispersal denotes that an ancestral species was distributed in one of these places (referred to as the ‘centre of origin’) before it migrated and inhabited the other places. Contrastingly, vicariancesuggests that the ancestral species was distributed everywhere originally, covering all contemporary ranges within it. However, changes in geography, climate or the formation of other barriers caused the range of the ancestor to fragment, with each fragmented group evolving into its own distinct species (or group of species).
In initial biogeographic science, dispersal was the most heavily favoured explanation. At the time, there was no clear mechanism by which organisms could be present all over the globe without some form of dispersal: it was generally believed that the world was a static, unmoving system. Dispersal was well supported by some biological evidence such as the diversification of Darwin’s finches across the Galápagos archipelago. Thus, this concept was supported through the proposals of a number of prominent scientists such as Charles Darwin and A.R. Wallace. For others, however, the distance required for dispersal (such as across entire oceans) seemed implausible and biologically unrealistic.
A paradigm shift in biogeography
Two particular developments in theory are credited with a paradigm shift in the field; cladistics and plate tectonics. Cladistics simply involved using shared biological characteristics to reconstruct the evolutionary relationships of species (think like phylogenetics, but using physical traits instead of genetic sequence). Just as importantly, however, was plate tectonic theory, which provided a clear way for organisms to spread across the planet. By understanding that, deep in the past, all continents had been directly connected to one another provides a convenient explanation for how species groups spread. Instead of requiring for species to travel across entire oceans, continental drift meant that one widespread and ancient ancestor on the historic supercontinent (Pangaea; or subsequently Gondwana and Laurasia) could become fragmented. It only required that groups were very old, but not necessarily very dispersive.
From these advances in theory, cladistic vicariance biogeography was born. The field rapidly overtook dispersal as the most likely explanation for biogeographic patterns across the globe by not only providing a clear mechanism to explain these but also an analytical framework to test questions relating to these patterns. Further developments into the analytical backbone of cladistic vicariance allowed for more nuanced questions of biogeography to be asked, although still fundamentally ignored the role of potential dispersals in explaining species’ distributions.
Modern philosophy of biogeography
So, what is the current state of the field? Well, the more we research biogeographic patterns with better data (such as with genomics) the more we realise just how complicated the history of life on Earth can be. Complex modelling (such as Bayesian methods) allow us to more explicitly test the impact of Earth history events on our study species, and can provide more detailed overview of the evolutionary history of the species (such as by directly estimating times of divergence, amount of dispersal, extent of range shifts).
From a theoretical perspective, the consistency of patterns of groups is always in question and exactly what determines what species occurs where is still somewhat debatable. However, the greater number of types of data we can now include (such as geological, paleontological, climatic, hydrological, genetic…the list goes on!) allows us to paint a better picture of life on Earth. By combining information about what we know happened on Earth, with what we know has happened to species, we can start to make links between Earth history and species history to better understand how (or if) these events have shaped evolution.
Understanding the evolutionary history of species can be a complicated matter, both from theoretical and analytical perspectives. Although phylogenetics addresses many questions about evolutionary history, there are a number of limitations we need to consider in our interpretations.
One of these limitations we often want to explore in better detail is the estimation of the divergence times within the phylogeny; we want to know exactly when two evolutionary lineages (be they genera, species or populations) separated from one another. This is particularly important if we want to relate these divergences to Earth history and environmental factors to better understand the driving forces behind evolution and speciation. A traditional phylogenetic tree, however, won’t show this: the tree is scaled in terms of the genetic differences between the different samples in the tree. The rate of genetic differentiation is not always a linear relationship with time and definitely doesn’t appear to be universal.
How do we do it?
There are a number of parameters that are required for estimating divergence times from a phylogenetic tree. These can be summarised into two distinct categories: the tree model and the substitution model.
The first one of these is relatively easy to explain; it describes the exact relationship of the different samples in our dataset (i.e. the phylogenetic tree). Naturally, this includes the topology of the tree (which determines which divergences times can be estimated for in the first place). However, there is another very important factor in the process: the lengths of the branches within the phylogenetic tree. Branch lengths are related to the amount of genetic differentiation between the different tips of the tree. The longer the branch, the more genetic differentiation that must have accumulated (and usually also meaning that longer time has occurred from one end of the branch to the other). Even two phylogenetic trees with identical topology can give very different results if they vary in their branch lengths (see the above Figure).
However, at least one another important component is necessary to turn divergence time estimates into absolute, objective times. An external factor with an attached date is needed to calibrate the relative branch divergences; this can be in the form of the determined mutation rate for all of the branches of the tree or by dating at least one node in the tree using additional information. These help to anchor either the mutation rate along the branches or the absolute date of at least one node in the tree (with the rest estimated relative to this point). The second method often involves placing a time constraint on a particular node of the tree based on prior information about the biogeography of the species (for example, we might know one species likely diverged from another after a mountain range formed: the age of the mountain range would be our constraints). Alternatively, we might include a fossil in the phylogeny which has been radiocarbon dated and place an absolute age on that instead.
In regards to the former method, mutation rates describe how fast genetic differentiation accumulates as evolution occurs along the branch. Although mutations gradually accumulate over time, the rate at which they occur can depend on a variety of factors (even including the environment of the organism). Even within the genome of a single organism, there can be variation in the mutation rate: genes, for example, often gain mutations slower than non-coding region.
All of these components are combined into various analytical frameworks or programs, each of which handle the data in different ways. Many of these are Bayesian model-based analysis, which in short generates hypothetical models of evolutionary history and divergence times for the phylogeny and tests how well it fits the data provided (i.e. the phylogenetic tree). The algorithm then alters some aspect(s) of the model and tests whether this fits the data better than the previous model and repeats this for potentially millions of simulations to get the best model. Although models are typically a simplification of reality, they are a much more tractable approach to estimating divergence times (as well as a number of other types of evolutionary genetics analyses which incorporating modelling).
Despite the developments in the analytical basis of estimating divergence times in the last few decades, there are still a number of limitations inherent in the process. Many of these relate to the assumptions of the underlying model (such as the correct and accurate phylogenetic tree and the correct estimations of evolutionary rate) used to build the analysis and generate simulations. In the case of calibrations, it is also critical that they are correctly dated based on independent methods: inaccurate radiocarbon dating of a fossil, for example, could throw out all of the estimations in the entire tree. That said, these factors are intrinsic to any phylogenetic analysis and regularly considered by evolutionary biologists in the interpretations and discussions of results (such as by including confidence intervals of estimations to demonstrate accuracy).
Understanding the temporal aspects of evolution and being able to relate them to a real estimate of age is a difficult affair, but an important component of many evolutionary studies. Obtaining good estimates of the timing of divergence of populations and species through molecular dating is but one aspect in building the picture of the history of all organisms, including (and especially) humans.