An identity crisis: using genomics to determine species identities

This is the fourth (and final) part of the miniseries on the genetics and process of speciation. To start from Part One, click here.

In last week’s post, we looked at how we can use genetic tools to understand and study the process of speciation, and particularly the transition from populations to species along the speciation continuum. Following on from that, the question of “how many species do I have?” can be further examined using genetic data. Sometimes, it’s entirely necessary to look at this question using genetics (and genomics).

Cryptic species

A concept that I’ve mentioned briefly previously is that of ‘cryptic species’. These are species which are identifiable by their large genetic differences, but appear the same based on morphological, behavioural or ecological characteristics. Cryptic species often arise when a single species has become fragmented into several different populations which have been isolated for a long time from another. Although they may diverge genetically, this doesn’t necessarily always translate to changes in their morphology, ecology or behaviour, particularly if these are strongly selected for under similar environmental conditions. Thus, we need to use genetic methods to be able to detect and understand these species, as well as later classify and describe them.

Cryptic species fish
An example of cryptic species. All four fish in this figure are morphologically identical to one another, but they differ in their underlying genetic variation (indicated by the different colours of DNA). Thus, from looking at these fish alone we would not perceive any differences, but their genetic make-up might suggest that there are more than one species…
Cryptic species heatmap example
The level of genetic differentiation between the fish in the above example. The phylogenies on the left and top of the figure demonstrate the evolutionary relationships of these four fish. The matrix shows a heatmap of the level of differences between different pairwise comparisons of all four fish: red squares indicate zero genetic differences (such as when comparing a fish to itself; the middle diagonal) whilst yellow squares indicate increasingly higher levels of genetic differentiation (with bright yellow = all differences). By comparing the different fish together, we can see that Fish 1 and 2, and Fish 3 and 4, are relatively genetically similar to one another (red-deep orange). However, other comparisons show high level of genetic differences (e.g. 1 vs 3 and 1 vs 4). Based on this information, we might suggest that Fish 1 and 2 belong to one cryptic species (A) and Fish 3 and 4 belong to a second cryptic species (B).

Genetic tools to study species: the ‘Barcode of Life’

A classically employed method that uses DNA to detect and determine species is referred to as the ‘Barcode of Life’. This uses a very specific fragment of DNA from the mitochondria of the cell: the cytochrome c oxidase I gene, CO1. This gene is made of 648 base pairs and is found pretty well universally: this and the fact that CO1 evolves very slowly make it an ideal candidate for easily testing the identity of new species. Additionally, mitochondrial DNA tends to be a bit more resilient than its nuclear counterpart; thus, small or degraded tissue samples can still be sequenced for CO1, making it amenable to wildlife forensics cases. Generally, two sequences will be considered as belonging to different species if they are certain percentage different from one another.

Annotated mitogeome
The full (annotated) mitochondrial genome of humans, with the different genes within it labelled. The CO1 gene is labelled with the red arrow (sometimes also referred to as COX1) whilst blue arrows point to other genes often used in phylogenetic or taxonomic studies, depending on the group or species in question.

Despite the apparent benefits of CO1, there are of course a few drawbacks. Most of these revolve around the mitochondrial genome itself. Because mitochondria are passed on from mother to offspring (and not at all from the father), it reflects the genetic history of only one sex of the species. Secondly, the actual cut-off for species using CO1 barcoding is highly contentious and possibly not as universal as previously suggested. Levels of sequence divergence of CO1 between species that have been previously determined to be separate (through other means) have varied from anywhere between 2% to 12%. The actual translation of CO1 sequence divergence and species identity is not all that clear.

Gene tree – species tree incongruences

One particularly confounding aspect of defining species based on a single gene, and with using phylogenetic-based methods, is that the history of that gene might not actually be reflective of the history of the species. This can be a little confusing to think about but essentially leads to what we call “gene tree – species tree incongruence”. Different evolutionary events cause different effects on the underlying genetic diversity of a species (or group of species): while these may be predictable from the genetic sequence, different parts of the genome might not be as equally affected by the same exact process.

A classic example of this is hybridisation. If we have two initial species, which then hybridise with one another, we expect our resultant hybrids to be approximately made of 50% Species A DNA and 50% Species B DNA (if this is the first generation of hybrids formed; it gets a little more complicated further down the track). This means that, within the DNA sequence of the hybrid, 50% of it will reflect the history of Species A and the other 50% will reflect the history of Species B, which could differ dramatically. If we randomly sample a single gene in the hybrid, we will have no idea if that gene belongs to the genealogy of Species A or Species B, and thus we might make incorrect inferences about the history of the hybrid species.

Gene tree incongruence figure
A diagram of gene tree – species tree incongruence. Each individual coloured line represents a single gene as we trace it back through time; these are mostly bound within the limits of species divergences (the black borders). For many genes (such as the blue ones), the genes resemble the pattern of species divergences very well, albeit with some minor differences in how long ago the splits happened (at the top of the branches). However, the red genes contrast with this pattern, with clear movement across species (from and into B): this represents genes that have been transferred by hybridisation. The green line represents a gene affected by what we call incomplete lineage sorting; that is, we cannot trace it back far enough to determine exactly how/when it initially diverged and so there are still two separate green lines at the very top of the figure. You can think of each line as a separate phylogenetic tree, with the overarching species tree as the average pattern of all of the genes.

There are a number of other processes that could similarly alter our interpretations of evolutionary history based on analysing the genetic make-up of the species. The best way to handle this is simply to sample more genes: this way, the effect of variation of evolutionary history in individual genes is likely to be overpowered by the average over the entire gene pool. We interpret this as a set of individual gene trees contained within a species tree: although one gene might vary from another, the overall picture is clearer when considering all genes together.

Species delimitation

In earlier posts on The G-CAT, I’ve discussed the biogeographical patterns unveiled by my Honours research. Another key component of that paper involved using statistical modelling to determine whether cryptic species were present within the pygmy perches. I didn’t exactly elaborate on that in that section (mostly for simplicity), but this type of analysis is referred to as ‘species delimitation’. To try and simplify complicated analyses, species delimitation methods evaluate possible numbers and combinations of species within a particular dataset and provides a statistical value for which configuration of species is most supported. One program that employs species delimitation is Bayesian Phylogenetics and Phylogeography (BPP): to do this, it uses a plethora of information from the genetics of the individuals within the dataset. These include how long ago the different populations/species separated; which populations/species are most related to one another; and a pre-set minimum number of species (BPP will try to combine these in estimations, but not split them due to computational restraints). This all sounds very complex (and to a degree it is), but this allows the program to give you a statistical value for what is a species and what isn’t based on the genetics and statistical modelling.

Vittata cryptic species
The cryptic species of pygmy perches identified within my research paper. This represents part of the main phylogenetic tree result, with the estimates of divergence times from other analyses included. The pictures indicate the physiology of the different ‘species’: Nannoperca pygmaea is morphologically different to the other species of Nannoperca vittata. Species delimitation analysis suggested all four of these were genetically independent species; at the very least, it is clear that there must be at least 2 species of Nannoperca vittata since is more related to N. pygmaea than to other N. vittata species. Photo credits: N. vittata = Chris Lamin; N. pygmaea = David Morgan.

The end result of a BPP run is usually reported as a species tree (e.g. a phylogenetic tree describing species relationships) and statistical support for the delimitation of species (0-1 for each species). Because of the way the statistical component of BPP works, it has been found to give extremely high support for species identities. This has been criticised as BPP can, at time, provide high statistical support for genetically isolated lineages (i.e. divergent populations) which are not actually species.

Improving species identities with integrative taxonomy

Due to this particular drawback, and the often complex nature of species identity, using solely genetic information such as species delimitation to define species is extremely rare. Instead, we use a combination of different analytical techniques which can include genetic-based evaluations to more robustly assign and describe species. In my own paper example, we suggested that up to three ‘species’ of N. vittata that were determined as cryptic species by BPP could potentially exist pending on further analyses. We did not describe or name any of the species, as this would require a deeper delve into the exact nature and identity of these species.

As genetic data and analytical techniques improve into the future, it seems likely that our ability to detect and determine species boundaries will also improve. However, the additional supported provided by alternative aspects such as ecology, behaviour and morphology will undoubtedly be useful in the progress of taxonomy.

Using the ‘blueprint of life’: an introduction to DNA markers

What is a ‘molecular marker’?

As we’ve previously discussed within The G-CAT, information from the DNA of organisms can be used in a variety of ways to study evolution and ecology, inform conservation management, and understand the diversity of life on Earth. We’ve also had a look at the general background of the DNA itself, and some of the different parts of the genome. What we haven’t discussed yet is how we use the DNA sequence in these studies; most importantly, which part of the genome to use.

The genome of most organisms is massive. The size of the genome ranges depending on the organism, with one of the smallest recorded genomes belonging to a bacteria (Carsonella ruddi), consisting of 160,000 bases. There is a bit of debate about the largest recorded genome, but one contender (the ‘canopy plant’, Paris japonica) has a genome stretching 150 billion base pairs long! The human genome sits in the middle at around 3 billion bases long. Naturally, it would be incredibly difficult to obtain the sequence of the whole genome of many organisms (particularly 20 – 30 years ago, due to technological limitations in the sequencing process) so instead we usually pick a specific region of the genome instead. The exact region (or type of region) we use is referred to as a ‘molecular marker’.

How do we choose a good marker?

The marker we pick is incredibly important: this is often based on how much variation we need to observe across groups. For example, if we want to study differences between individuals, say in a pedigree analysis, we need to pick a section of the DNA that will show differences between individuals; it will need to mutate fairly rapidly to be useful. If it mutates too slowly, all individuals will look identical genetically and we won’t have learnt anything new at all.

On the flipside, if we want to study evolution at a larger scale (say, between species, or groups of species) we would need to use a marker that evolves much slower. Using a rapidly mutating section of DNA would effectively give a tonne of ‘white noise’; it’d be impossible to pick what is the genetic difference at the species level (i.e. one species is different to another at that base) vs. at the individual level (i.e. one or many individuals within the species are different). Thus, we tend to use much slower mutating markers for deeper evolutionary history.

Evol spectrum
The spectrum of evolutionary history, with evolutionary splits between major animal groups on the left, to splits between species in the middle, to splits between individuals within a family tree on the right. The effectiveness of a marker for a particular part of the spectrum depends on its mutation rate. The original figure was taken from a landmark paper by Avise (1994), considered one of the forefathers of molecular ecology.

Think of it like comparing cats and dogs. If we wanted to compare different cats to one another (say different breeds) we could use hair length or coat colour as a useful trait. Since some breeds have different coat characteristics, and these don’t vary as much within the breed as across breeds, we can easily determine a long haired cat from a short haired cat. However, if we tried to use coat colour and length to compare cats and dogs we’d be stumped, because both species have lots of variation in these traits within their species. Some cats have coat length more similar to some dogs than to other cats for example; so they’re not a good characteristics to separate the two animal species (we might use muzzle shape, or body shape instead). If we substitute each of these traits with a particular marker, then we can see that some markers are better for some comparisons but not good for others.


The most traditional molecular marker are referred to as ‘allozymes’; instead of comparing actual genetic sequences (something that was not readily possible early in the field), variations in the shape (i.e. the amino acids of the protein, not the code underlying it) were compared between species. Changes in proteins occur very rarely as natural selection tends to push against randomly changing protein structure, since the shape of it is critical to its function and functionality. Because of this, allozymes were only really effective for studying very broad comparisons (mainly across species or species groups); the exact protein used depends on the study organism. Allozymes are generally considered outdated in the field nowadays.

With the development of technologies that allowed us to actual determine the DNA code of genes, molecular ecology moved into comparing actual sequences across individuals. However, early sequencing technology could generally only accurately determine small sections of DNA at a time, so particular markers capitalising on this were developed. Many of these are still used due to their cost-effectiveness and general ease of analysing.


For comparing closely related individuals (within a pedigree, or a population), markers called ‘microsatellites’ are widely used. These are small sections of the genome which have repetitive DNA codes; usually, the same two or three base pairs (one ‘motif’) are repeated a number of times afterwards (the ‘repeat number’). While the motifs themselves rarely get mutations, the number of repeated motifs very rapidly mutates. This is because the protein that copies DNA is not very perfect, and often ‘slips up’, and adds or cuts off a repeat from the microsatellite sequence. Thus, differences in the repeat number of microsatellites accumulate pretty quickly, to the point where you can determine the parents of an individual with them.

The general (and simplified) structure of a microsatellite marker. 

Microsatellites are often used in comparisons across closely related individuals, such as within pedigrees or within populations. While they are relatively easy to obtain, one drawback is that you need to have some understanding of the exact microsatellite you wish to analyse before you start; you need to make a specific ‘primer’ sequence to be able to get the right marker, as some may not be informative in particular species or comparisons. Many researchers choose to use 10-20 different microsatellite markers together in these types of studies, such as in human parentage analyses.

Microsatellites are useful for parentage analysis. Our previous guest contestants are here to discuss ‘Who is the father?!’ in Maury-like fashion. The results are in, and using 4 microsatellites (1-4) and looking at the number of repeats in each of those, we can see the contestant 2 is undoubtedly the father! I’ll be honest, I have no idea if this is how Maury works, but I think it would work.

Mitochondrial DNA

For deeper comparisons, however, microsatellites mutate far too rapidly to be effective. Instead, we can choose to use the DNA of the mitochondria. You may remember the mitochondria as ‘the powerhouse of the cell’; while this is true, it also has a lot of other unique properties. The mitochondria was actually (a very, very, very long time ago) a separate bacteria-like organism which became symbiotically embedded within another cell. Because of this, and despite a couple billion years of evolution since that time, the mitochondria actually has its own genome separate to the ‘host’ (like the standard human genome). The full mitochondrial genome consists of around 37 different genes, most of which don’t code for any proteins involved directly in evolution; as such, natural selection doesn’t affect them as much as other genes. The most commonly used mitochondrial genes are the cytochrome b oxidase gene (cytb for short) or the cytochrome c oxidase 1 (CO1) gene.

The mitochondrial genome evolves relatively rapidly (but not nearly as fast as microsatellites) and is found in pretty much every plant and animal on the planet. Because of these traits, it’s often used as a way of diagnosing species through the ‘Barcode of Life’ project (using cytb and CO1). It’s very widely used within species-level studies, to the point where we can even use the relatively consistent mutation rate of the mitochondrial genome to estimate how long ago different species separated in evolution.

Not entirely how the Barcode of Life works, but close enough, right?

Other markers?

There are plenty of other genetic markers that are used within molecular ecology, with some focusing on only the exons or introns of genes, or other repetitive sequences. However, microsatellites and mitochondrial genes are among the most widely used in evolution and conservation studies.

While these markers have been very useful in building the foundations of molecular ecology as a scientific field, developments in sequencing technology, analytical methods and evolutionary theory have pushed our ability to use DNA to understand evolution and conservation even further. Particularly the development of sequencing machines which can process much larger amounts of genetic DNA. This has pushed genetics into the age of ‘genomics’: while this sounds like a massively technical difference, it’s really just about the difference in the size of the data we can use. Obviously, this has many other benefits for the kinds of questions we can ask about evolution, conservation and ecology.

Genomics has massively expanded in recent years, the types, quantity and quality of data are diverse. Stay tuned because next week, we’ll start to delve into the modern world of genomics!