A simplified guide to genomic vulnerability

Predicting the future for biodiversity

Conservation biology is frequently referred to as a “crisis discipline“, a status which doesn’t appear to be changing any time soon. Like any response to a crisis, biologists of all walks of life operate under a prioritisation scheme – how can our finite resources be best utilised to save as much biodiversity as possible? This approach requires some knowledge of both current vulnerability and future threat – we need to focus our efforts on those populations and species which are most at-risk of extinction in the near (often immediate) future.

Genomic vulnerability

From an evolutionary perspective, this approach involves understanding which populations will be face the greatest adaptive pressure in the future. This can be thought of as a product of two interacting components – one, the current adaptive potential of the given population, and two, the future environmental change the population is expected to experience. Trying to tie these aspects into a single quantifiable trait is the basis of “genomic vulnerability”, a term which was first popularised by Bay et al. 2018 in their landmark paper on the American yellow warbler in Science. Understandably, the paper generated significant attention from the scientific community which saw the broad applicability of the method. Since then, a number of other papers have applied and expanded upon the method to include additional factors (such as land-use changes and migration ability) or using other biological systems.

A generalised flowchart of the process of calculating genomic vulnerability scores. We’ll go over this step by step below, but it essentially involves combining genetic and environmental data to predict how the frequency of adaptive alleles will change under future climatic conditions (and how much strain that will put on the populations in the system).

For this post, we’re going to take a brief foray into the process of calculating genomic vulnerability scores. In essence, this involves identifying loci (SNPs) which are associated with environmental variation, and then projecting these onto future climate conditions. Genomic vulnerability can be summarised as the amount of adaptive genetic change in a population required to keep up with climate change. This summary has been significantly simplified (particularly statistically) for clarity – just bear in mind that there are often multiple other pre-processing and validation steps to account for confounding factors and provide more robust inferences.

The Data

Genomic vulnerability requires two main initial inputs – genetic data, in the form of allele frequencies per population, and environmental data where each population occurs. For this post, I used a combination of simulated genetic data and real (empirical) environmental data for the following analyses.

Simulated genetic data

I simulated genetic data for ten populations of a species (I’ll let your own taxonomic biases dictate what you’d like to pretend the species is), each consisting of ten individuals (so 100 total). Using the R package scrime I randomly generate genotypes for 5,000 loci (SNPs) from a normal distribution. That is, for a given SNP each individual was assigned one of two possible alleles (no invariant sites or triallelic loci). Because of this sampling strategy, we can effectively rule out population structure as a confounding factor, but it’s important to remember that the data are not entirely realistic for this reason. Ordinarily, there would be a number of data filtering steps and prior knowledge about the study system to make sure the genetic data is robust.

From this genotypes, I then calculated the frequency of these alleles per population using the dartR package, which acts as the input for the genotype-environment analysis below.

Frequency distributions (histograms) of 5,000 alleles across each of ten populations. As you can see, allele frequencies were generated using a random normal distribution (which is not often reflected in real populations).

Empirical environmental data

For the environmental data, I chose four different environment variables from across southeast Australia, obtained from the CHELSA climatic database. These come from a larger global database of 19 bioclimatic variables related to temperature and precipitation, which are widely used to estimate species distributions and summarise environmental characteristics driving selection. I chose to use annual mean temperature (Bio1), isothermality (Bio3), total annual precipitation (Bio12) and the seasonality of precipitation (Bio15). I decided on this particular region and these particular variables for a few different reasons:

  1. I already had the data available from my own PhD work, saving me some processing time.
  2. From that work, I already know that these variables are not strongly correlated (R > 0.8) with one another, an issue which could impact on genotype-environment associations below.
  3. The region has notable climatic variation, especially in terms of temperature and climate, which makes it ideal for this approach (especially considering the genetic data is randomly generated and thus not likely to innately show differences between populations).
  4. It’s relatively easy to obtain future projections of the same variables for the same region from the CHELSA database (this is an important factor further down).
Maps demonstrating the bioclimatic data used here. On each map, points and numbers indicate the location of the populations. All of the environmental data was extracted from the CHELSA database. A: Bio1- annual mean temperature. B: Bio3 – isothermality, calculated as the mean diurnal range / annual temperature range (i.e. how much the temperature fluctuates over the day relative to seasonal changes). C: Bio12 – total annual precipitation. D: Bio15 – precipitation seasonality (i.e. standard deviation of monthly precipitation as percentage of annual mean precipitation).

For each of the populations created above, I randomly generated coordinates for their position in southeast Australia (on land only, of course) using the sampleRandom function of the raster R package. This serves to randomly associate the simulated genetic variation with local environmental conditions (without my own bias of manually choosing where to place the populations). Overall, however, they appear to be nicely spread throughout the region.

Genotype-environment association (GEA)

The next major step in the pipeline involves combining these two sources of data to identify genetic variants associated with environment variation – this falls under the scope of “genotype-environment association” (or GEA for short). In essence, GEA approaches seek to detect alleles whose frequency is statistically correlated with environmental variation across the populations. There are a number of statistical approaches to detect this associations, accounting for various confounding factors or types of associations (e.g. linear, exponential, multivariate). One commonly used statistical approach is redundancy analysis (RDA), which is particularly effective at detecting relationships involving multiple alleles and environmental factors simultaneously. RDAs allow for the regression of multiple response variables (allele frequencies) on multiple explanatory variables (environmental variables) simultaneously, seeking to identify which SNPs have allele frequencies which can be linearly explained by components of the environment.

I used an RDA approach based on the AlleleShift R package to identify which of the 5,000 loci showed significant associations with the four environmental variables. I also used the RDA approach described by my colleague Dr. Chris Brauer here, which obtained similar results. I focused on the AlleleShift results as it was much simpler to translate this RDA for future climatic conditions (see below).

The allele frequencies of most populations were well captured by the RDA model, with strong correlation between RDA-predicted and real allele frequencies (R2 > 0.93), expect for Population 1 (R2 = 0.22, mostly driven by one outlier SNP). In total, 72 of the 5,000 loci (1.44%) showed statistically significant associations – however, most of these were only weakly correlated with one of the environmental variables (see histograms below). As is commonly done in GEA approaches, I focus on only those loci with “decent” (R > 0.3) association with at least one environmental variable – this reduced the number to 10 putatively adaptive loci.

Linear relationships of 10 environmentally-associated alleles and their most related environmental variable. Since RDA is a multivariate approach, it’s important to remember that any given SNP may be correlated with more than one environmental variable. In the right column, histograms show how strongly correlated the 72 associated SNPs are with their most relevant variable – only those on the right side of the dark vertical line (>0.3) are plotted in the left-side (and are the focus for the remaining analyses). On the left, points indicate the position of the different populations, with the solid line and highlighted region showing the linear relationship of the environment and allele frequency (shaded area = 95% confidence interval). Each line/shading and colour indicates the relationship of a single SNP with a single environmental variable – for example, only one SNP is significantly associated with Bio1 but two are associated with Bio3 in opposing directions (light blue positively, dark blue negatively).

Within the most strongly associated SNPs, it appears that many of them are associated with precipitation seasonality (Bio15), although this might be due to the fact that most populations cluster at either low or high values (i.e. the full range of seasonality is not well sampled across the populations).

The future environment

Next, we need to find a projected environment which can be used to translate these RDA relationships and predict allele frequency changes. Given the unpredictability in future climate change, largely linked to how effectively we control global emissions, we tend to model future climate based on a range of projected scenarios. This may also include using different global circulation models (GCMs), although for simplicity here I use only the BCC-CSM2-MR from the CMIP6 project. This particular model is frequently used in similar types of studies.

These future scenarios take the form of “shared socio-economic pathways” (SSPs) or “representative concentration pathways” (RCPs), which seek to model possible future climatic conditions depending on how well global carbon emissions are controlled. These pathways range from (relatively) mild to severe, and have their own timelines of global climatic change. To represent the range of potential future climates, I obtained the same four environmental variables as above projected under relatively mild (RCP2.6), moderate (RCP4.5) and severe (RCP8.5) climate change using data freely available in the CHELSA database.

The relative difference in each of the environmental variables across the region under the three chosen RCPs. Histograms represent the distribution of relative changes in environmental values (i.e. how much a given pixel in the environmental raster will increase/decrease), with the black vertical line indicating the number of pixels that will not change. For example, nearly all pixels will increase in mean annual temperature, but there is a mixture of increases and decreases in precipitation seasonality.

As one might expect, all populations across the region are expected to experience higher mean annual temperatures and lower isothermality (below). However, the other two variables do not change predictably across all populations, with most not showing a huge change in annual precipitation but with a mixture of increasing and decreasing seasonality.

Projections of the future environmental changes populations are expected to experience under a severe climate change scenario (RCP8.5). As you can see, all populations are predicted to experience higher annual temperatures (Bio1) and lower isothermality (Bio3), but the changes in precipitation seasonality are inconsistent across populations (Bio15).

Future allele frequency changes

Next, we use the combination of current allele frequencies per population, the projected future environmental conditions each population is expected to experience, and the relationship (RDA) of the two to estimate how allele frequencies for these adaptive loci are likely to shift over time. This assumes that the relationship of allele frequencies remains constant over time (which it may not), that adaptation to the changing environment is genetic (it may not be), and focuses only on our subset of loci (other important loci might not have been detected, or important environmental factors may not have been included).

This projection is repeated for each locus (of the 10 identified above), per population, and under the three different RCP scenarios. The simplest way of demonstrating this projection would be to assume a total linear relationship of only one locus and one environmental variable at a time, and use the slope of that relationship to project future allele frequencies (as done below). However, the more statistically robust way to do this is to use the RDA model itself for the projection (remembering that this accounts for multivariate relationships). To do this, I predicted future allele frequencies using AlleleShift (count.pred and freq.pred functions) and the established RDA.

A very simplified example of how future allele frequencies are projected (here, for RCP8.5 only). This example demonstrates the above GEA relationships for one example SNP per environmental variable (for visual clarity). Arrows indicate how allele frequencies could be predicted to change for each population – movement along the x-axis is calculated by comparing the current and future environment for each population and the movement along the y-axis is calculated based on the slope of the linear GEA relationship (hence why the arrows are parallel to the dashed lines). In reality, calculating future allele frequencies is done based on the RDA model itself (which accounts for multivariate associations and doesn’t necessarily assume a linear relationship).

Genomic vulnerability scores

The last step is relatively simple compared to the previous one – all we need to obtain a ‘genomic vulnerability score’ is to find a way to summarise all of these predicted allele frequency changes per population. This can be done as simply as an average across all adaptive loci (in this case, all 10 GEA loci) or the sum of all adaptive loci. You could attribute additional penalties here as well – for example, if a population is expected to have an allele frequency change from 0 to 0.5 under climate change, this would require a new variant to be introduced (by mutation or migration). This presents its own challenges, and is more difficult than a population that has to change from 0.2 to 0.7 (same distance), so you might choose to add a value associated with this. For here though, I’ve chosen to just summarise genomic vulnerability as the total sum of projected allele frequency changes.

A heatmap of the genomic vulnerability scores (summarised here as the sum of predicted GEA allele frequency changes) per population under each of the three emissions scenarios. As you can see, Population 2 is consistently more vulnerable than any of the others, possibly driven by large changes in local isothermality (Bio3 – see above). However, Populations 1 and 10 appear (relatively) safe under future climate conditions.

Closing statement

Genomic vulnerability scores are a convenient measure of predicting the susceptibility of populations to environmental change, and may make significant strides towards helping identify the most at-risk populations of a species. However, the method is relatively simple and is regularly being expanded upon to include additional sources of information. Over time, I suspect genomic vulnerability (in one form or another) will become a staple in conservation genomics research for a diverse array of threatened species: hopefully, this post has been an informative introduction for those seeking to explore the method.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s