Paleobuddy overview

Bruno do Rosario Petrucci

paleobuddy is an R package for species birth-death simulation, complete with the possibility of generating phylogenetic trees and fossil records from the results. The package offers unprecedented flexibility in the choice of speciation, extinction, and fossil sampling rates, as we will showcase in this vignette.

One of the biggest reason to write and publish a simulator is to effectively test rate estimation methods with scenarios whose true dynamics are known—others might include model adequacy tests and the study of scenarios without an analytical solution. This leads to an intuitive overview of the package: in this vignette, we will first generate a couple of useful scenarios to show use cases of the package. Then, we will discuss the choice of rates further, detailing the customization capabilities of both the birth-death and sampling functions. Finally, we will conclude going over the shortcomings of the package, including the features we plan to implement in the future. As new versions are released, this document will be updated to reflect the newest features.

First we do some setup.

# importing the package functions
library(paleobuddy)

Constant rate birth-death

Let us try the simplest possible birth-death scenario - constant speciation and extinction rates.

bd.sim is the birth-death simulator in the package. We use it to generate a group.

# we set a seed so the results are reproducible
set.seed(1)

# set the necessary parameters
# initial number of species
n0 <- 1

# speciation rate - approx. 1 speciation event every 4my
# we are trying to create a big phylogeny so phytools can function better
lambda <- 0.25

# extinction rate - approx. 1 extinction event every 10my
mu <- 0.15

# maximum simulation time - species that die after this are considered extant
tMax <- 50

# run the simulation
sim <- bd.sim(n0, lambda, mu, tMax)

# take a look at the way the result is organized
sim
## 
## Birth-death simulation object with 51 species and 11 extant species
## 
## Details for some species:
## 
## Extinction times (NA means extant)
## [1] 42.12238 43.38139 45.41614 40.75707 14.60010 37.18177
## 
## 
## Speciation times 
## [1] 50.00000 46.97927 46.39645 45.83726 44.09299 39.14258
## 
## 
## Species parents (NA for initial)
## [1] NA  1  1  1  1  5
## 
## 
## Species status (extinct or extant)
## [1] "extinct" "extinct" "extinct" "extinct" "extinct" "extinct"
## 
## 
## For more details on vector y, try sim$y, with y one of
## TE TS PAR EXTANT

The output of bd.sim is a sim object, a class made up of named vectors that is organized as follows

We can use the function draw.sim to visualize the longevity of species in this simulation.

# draw simulation
draw.sim(sim, showLabel = FALSE)

Species are drawn in order of speciation time by default, though that can be altered. We omit species labels since for a high number of species that can get unruly.

Using this sim object, we can generate a phylogenetic tree using make.phylo.

# there are currently not many customization options for phylogenies
phy <- make.phylo(sim)

# plot it with APE - hide tip labels since there are a lot so it looks cluttered
ape::plot.phylo(phy, show.tip.label = FALSE)
ape::axisPhylo()

# plot the molecular phylogeny
ape::plot.phylo(ape::drop.fossil(phy), show.tip.label = FALSE)
ape::axisPhylo()

From here, we could run this phylogeny through a number of inference software in the field, so as to test their accuracy and robustness. Of course this would require more trees, and larger trees, to control for stochasticity.

For illustration, we create a simulation with more than 500 species, with 253 extant species.

# set a seed
set.seed(3)

# create simulation
# note nExtant, defining we want 200 or more extant species at the end
sim <- bd.sim(n0, lambda, mu, tMax, nExtant = c(200, Inf))

# check the number of extant species
paste0("Number of species alive at the end of the simulation: ", 
       sum(sim$EXTANT))
## [1] "Number of species alive at the end of the simulation: 253"

And we could of course get a molecular phylogeny from it.

# might look a bit cluttered
ape::plot.phylo(ape::drop.fossil(make.phylo(sim)), show.tip.label = FALSE)
ape::axisPhylo()

Time-dependent speciation and extinction

One of the pluses of paleobuddy is that we can generate both fossil records and phylogenies in independent processes, both coming from the same underlying birth-death simulations. We will here use the fossil record generating functions of paleobuddy to generate a fossil record and prepare the output for use with PyRate (Silvestro et al 2014) and Foote’s Per Capita method (Foote 2000), as an example of a workflow using paleobuddy to test inference methods.

As before, start with a simulation

# we set a seed so the results are reproducible
set.seed(5)

# set the necessary parameters
# initial number of species
n0 <- 1

# speciation rate - it can be any function of time!
lambda <- function(t) {
  0.1 + 0.001*t
}

# extinction rate - also can be any function of time
mu <- function(t) {
  0.03 * exp(-0.01*t)
}

# maximum simulation time - species that die after this are considered extant
tMax <- 50

# run the simulation
sim <- bd.sim(n0, lambda, mu, tMax)

# check the resulting clade out
ape::plot.phylo(make.phylo(sim), show.tip.label = FALSE)
ape::axisPhylo()

A lot of species! We can then create a fossil record from this group.

# again set a seed
set.seed(1)

# set the sampling rate
# using a simple case - there will be on average T occurrences,
# per species, where T is the species duration
rho <- 1

# bins - used to represent the uncertainty in fossil occurrence times
bins <- seq(tMax, 0, -1)
# this is a simple 1my bin vector, but one could use the GSA timescale
# or something random, etc

# run the sampling simulation only for the first 10 species for brevity's sake
# returnAll = TRUE makes it so the occurrences are returned as binned as well
# (e.g. an occurrence at time 42.34 is returned as between 42 and 41)
fossils <- suppressMessages(sample.clade(sim = sim, rho = rho, 
                                      tMax = tMax, S = 1:10, 
                                      bins = bins, returnAll = TRUE))
# suppressing messages - the message is to inform the user how
# many speciesleft no fossils. In this case, it was 0

# take a look at how the output is organized
head(fossils)
##   Species Extant    SampT MaxT MinT
## 1      t1  FALSE 49.24482   50   49
## 2      t1  FALSE 48.06318   49   48
## 3      t1  FALSE 47.91747   48   47
## 4      t1  FALSE 47.77767   48   47
## 5      t1  FALSE 47.34160   48   47
## 6      t1  FALSE 44.44664   45   44

The output of sample.clade is a data frame organized as follows

We can visualize the fossil record of this group using draw.sim as well.

# take only first 5 species with head
simHead <- head(sim, 10)

# draw longevities with fossil time points
suppressMessages(draw.sim(simHead, fossils = fossils))

We can also visualize the fossil ranges if we take away the SampT column, which is used by default if the data frame has it.

# draw longevities with fossil time ranges
suppressMessages(draw.sim(simHead, fossils = fossils[, -3]))