Abstract

A scalable and fast method for estimating joint Species Distribution Models (jSDMs) for big community data, including eDNA data. The package estimates a full (i.e. non-latent) jSDM with different response functions (including the traditional multivariate probit model). As described in Pichler & Hartig (2021) doi:10.1111/2041-210X.13687, scalability is achieved by using a Monte Carlo approximation of the joint likelihood implemented via ‘PyTorch’ and ‘reticulate’, which can be run on CPUs or GPUs. The package includes support for different response families, the ability to account for spatial autocorrelation, and the option to fit responses via deep neural networks instead of a standard linear predictor. In addition, you can perform variation partitioning (VP) / ANOVA on the fitted models for environmental, spatial, and co-association model components, and further subdivide these three components per species and sites (internal metacommunity structure, see Leibold et al., doi:10.1111/oik.08618).

- Getting started
- Working with sjSDM
- Advanced topics
- Installation problems

The sjSDM package provides R functions for estimating so-called joint species distribution models.

A jSDM is a GLMM that models a multivariate (i.e. a many-species) response to the environment, space and a covariance term that models conditional (on the other terms) correlations between the outputs (i.e. species). In other research domains, the covariance component is also known as the multivariate probit model.

A big challenge in jSDM implementation is computational speed. The goal of the sjSDM (which stands for “scalable joint species distribution models”) is to make jSDM computations fast and scalable. Unlike many other packages, which use a latent-variable approximation to make estimating jSDMs faster, sjSDM fits a full covariance matrix in the likelihood, which is, however, numerically approximated via simulations. The method is described in Pichler & Hartig (2021) A new joint species distribution model for faster and more accurate inference of species associations from big community data, https://www.doi.org/10.1111/2041-210X.13687.

The core code of sjSDM is implemented in Python / PyTorch, which is then wrapped into an R package. In principle, you can also use it stand-alone under Python (see instructions below). Note: for both the R and the python package, python >= 3.9 and pytorch must be installed (more details below). However, for most users, it will be more convenient to use sjSDM via the sjSDM R package, which also provides a large number of downstream functionalities.

To get citation info for sjSDM when you use it for your reseach, type

```
citation("sjSDM")
#> To cite sjSDM in publications use:
#>
#> Pichler, M. and Hartig, F. (2021), A new joint species distribution model for faster and more accurate inference of species associations from big
#> community data. Methods in Ecology and Evolution. Accepted Author Manuscript. https://doi.org/10.1111/2041-210X.13687
#>
#> A BibTeX entry for LaTeX users is
#>
#> @Article{,
#> title = {A new joint species distribution model for faster and more accurate inference of species associations from big community data},
#> author = {Maximilian Pichler and Florian Hartig},
#> journal = {Methods in Ecology and Evolution},
#> year = {2021},
#> doi = {10.1111/2041-210X.13687},
#> }
```

sjSDM is distributed via CRAN. For most users, it will be best to install the package from CRAN

Depencies for the package can be installed before or after installing the package. Detailed explanations of the can be found below. Very briefly, the dependencies can be automatically installed from within R:

For advanced users: if you want to install the current (development) version from this repository, run

`devtools::install_github("https://github.com/TheoreticalEcology/s-jSDM", subdir = "sjSDM", ref = "master")`

dependencies should be installed as above. If the installation fails,
check out the help of `?install_sjSDM`

, ?installation_help,
and `vignette("Dependencies", package = "sjSDM")`

.

- Try
`install_sjSDM()`

- New session, if no ‘PyTorch not found’ appears it should work,
otherwise see
`?installation_help`

- If do not get the pkg to run, create an issue issue tracker or write an email to maximilian.pichler at ur.de

We start with a dataset about eucalypt communities (the dataset is from Pollock et a., 2014)

```
library(sjSDM)
set.seed(42)
Env = eucalypts$env # environment
PA = eucalypts$PA # presence absence
Coords = eucalypts$lat_lon # coordinates
```

Prepare data:

- scale numerical predictors/covariates

```
Env$Rockiness = scale(Env$Rockiness)
Env$PPTann = scale(Env$PPTann)
Env$cvTemp = scale(Env$cvTemp)
Env$T0 = scale(Env$T0)
Coords = scale(Coords)
```

The model is fit by the function `sjSDM()`

. You have to
provide all predictors and response as matrices or data frames. Here, we
specify all three predictor components of an SDM:

- An environmental component, which essentially fits an SDM for each species. The “.” tells the model to use all predictors that are not otherwise used
- A spatial model account for spatial autocorrelation and spatial trends/effects (trend surface model)
- The covariance component is automatically added. If you don’t want
this, you can set
`biotic = bioticStruct(diag = TRUE)`

, which would result in a standard MSDM.

```
model <- sjSDM(Y = PA,
env = linear(data = Env, formula = ~.),
spatial = linear(data = Coords, formula = ~0+latitude*longitude),
family=binomial("probit"),
se = TRUE,
verbose = FALSE)
#> Error in eval(expr, envir, enclos): Required version of NumPy not available: incompatible NumPy binary version 33554432 (expecting version 16777225)
```

The “linear” expression used in the environmental and spatial component means that you essentially fit a linear regression structure for the respective component. Alternatively, you could also fit a neural network (see later).

The option `se = TRUE`

tells the model to calculate CIs
and p-values. Note that these are approximated based on Wald CIs, which
may have a certain amount of error for small data and if regularization
is applied.

To get an overall summary of the model, type

```
summary(model)
#> Family: binomial
#>
#> LogLik: -242.1242
#> Regularization loss: 0
#> Error in eval(expr, envir, enclos): Required version of NumPy not available: incompatible NumPy binary version 33554432 (expecting version 16777225)
```

This output will show you the fitted environmental, spatial and covariance parameters (provided that those components were specified). Implemented S3 functions for a model object include

The environmental effects are displayed in the summary() table. To get a visual plot, you can use

The spatial effects are displayed in the `summary()`

table. Currently, there are not additional options implemented to
visualize the spatial effects

The species-species associations are displayed in the
`summary()`

table. You can extract and visualize them as
follows:

```
association = getCor(model)
#> Error in eval(expr, envir, enclos): Required version of NumPy not available: incompatible NumPy binary version 33554432 (expecting version 16777225)
```

Note that these associations are to be interpreted as correlations in species presence or abundance after accounting for environmental and spatial effects.

ANOVA or variation partitioning means that we want to calculate how much variation (=signal) is explained by the different model components.

The sjSDM package includes several functions for this purpose.

The first option is a global ANOVA, which will calculate the amount of variation explained by the three components (environment, associations, and space):

```
an = anova(model, verbose = FALSE)
#> Error in eval(expr, envir, enclos): Required version of NumPy not available: incompatible NumPy binary version 33554432 (expecting version 16777225)
```

The resulting object is of class sjSDManova, and you can use the standard summary and plot function on this object:

```
summary(an)
#> Analysis of Deviance Table
#>
#> Deviance Residual deviance R2 Nagelkerke R2 McFadden
#> Abiotic 673.26304 3002.96075 0.77008 0.1654
#> Associations 169.13782 3507.08597 0.30878 0.0416
#> Spatial 150.96228 3525.26151 0.28080 0.0371
#> Shared Abiotic+Associations 142.48582 3533.73797 0.26736 0.0350
#> Shared Abiotic+Spatial 1531.78234 2144.44145 0.96472 0.3763
#> Shared Spatial+Associations 11.24359 3664.98020 0.02425 0.0028
#> Shared Abiotic+Associations+Spatial -1361.34035 5037.56415 -18.53795 -0.3345
#> Full 1317.53455 2358.68924 0.94368 0.3237
```

We see that the environmental component explains the most variance in the data, followed by the biotic association component. As in any ANOVA, there are unique and shared fractions for each component. The shared fractions depict variation that can be explained from either of the components (e.g. due to collinearity).

Different ANOVA strategies deal differently with these shared
fractions. For example, in type II/III ANOVA, the shared fractions are
usually discarded. The default here is to display all fractions.
Alternatively (more on this see `?summary.sjSDManova`

):

`summary(an, fractions = "discard")`

discards shared fractions, as in a type II/III ANOVA`summary(an, fractions = "proportional")`

distributes shared fractions to the unique fractions proportional to the unique fractions`summary(an, fractions = "equal")`

distributes shared fractions evenly to the unique fractions

Note that in VP, some fractions can get negative. For more discussion on this and how to interpret it see here and here.

An interesting feature of jSDMs is that we can further partition the global VP discussed just before to sites and species. This idea was first presented in

Leibold, M. A., Rudolph, F. J., Blanchet, F. G., De Meester, L.,
Gravel, D., Hartig, F., ... & Chase, J. M. (2022). The internal
structure of metacommunities. *Oikos*, *2022*(1).

We have implemented this feature call the calculation for this “internal metacommunity structure” via

The resulting object contains R2 values for each of the three
components (Env, Space, Associations) for each species and site. Note
the comments in `?internalStructure`

, especially about how to
deal with negative values.

The results can be plotted via

The ternary plots report the relative importance of the components for sites and species. Leaning to one of the corners means that this corner is mot important (e.g. here Species and Sites lean more to the environmental corner).

A attractive feature of the internal metacommunity structure
calculations is that we can now regress the partial R^{2} of
sites and species against predictors, to see how the strength of the
three assembly components (Environment, Space, Associations) changes
with these predictors.

The default plot will plot the assembly processes per plot against environmental and spatial distinctiveness, and richness:

However, you can use your own predictors, and also switch between
looking at sites or species as a response. For more on this, see
`?plotAssemblyEffects`

.

To predict with an sjSDM object, use

```
pred = predict(model, newdata = Env, SP = Coords)
#> Error in eval(expr, envir, enclos): Required version of NumPy not available: incompatible NumPy binary version 33554432 (expecting version 16777225)
```

The standard predictions are marginal, i.e. assume that we have no information about what species exist on a site we predict to.

As jSDMs fit correlations between species, predictions of jSDM improve if we can provide information for some of the species on a new site. This is known as a conditional prediction. As an example, let’s assume that the first 6 species in this data are unknown, whereas the P/A of the rest is known to us.

```
New_PA = PA
New_PA[, 1:6] = NA
head(New_PA)
#> ALA ARE BAX CAM GON MEL OBL OVA WIL ALP VIM ARO.SAB
#> [1,] NA NA NA NA NA NA 0 0 1 1 0 0
#> [2,] NA NA NA NA NA NA 1 0 1 1 0 0
#> [3,] NA NA NA NA NA NA 0 0 1 1 0 0
#> [4,] NA NA NA NA NA NA 0 0 1 0 0 0
#> [5,] NA NA NA NA NA NA 1 0 0 0 0 0
#> [6,] NA NA NA NA NA NA 0 0 1 1 0 0
```

We can then predict occurrences for the first 6 species, conditioning on the rest of the species (7:12):

```
pred = predict(model, newdata = Env, SP = Coords, Y = New_PA)
#> Error in eval(expr, envir, enclos): Required version of NumPy not available: incompatible NumPy binary version 33554432 (expecting version 16777225)
```

```
head(pred)
#> [,1] [,2] [,3] [,4] [,5] [,6]
#> [1,] 0.06919945 0.0009476454 0.1910016 0.0018830881 0.0128013078 0.0024387540
#> [2,] 0.11315208 0.0009214621 0.1472399 0.0002079794 0.0072220865 0.0003615931
#> [3,] 0.04488569 0.0006793896 0.2183437 0.0011189727 0.0177170230 0.0009391013
#> [4,] 0.03075938 0.0009504823 0.6358477 0.0013088832 0.0062728245 0.0042722428
#> [5,] 0.03657349 0.0002785174 0.7242049 0.0016383200 0.0003361727 0.0132508982
#> [6,] 0.06529714 0.0008642071 0.2183885 0.0005441471 0.0231352945 0.0005368480
```

Note that the predict function returns only predictions for species with NA in Y

sjSDM supports other responses than presence-absence data: Simulate non-presence-absence data:

```
com = simulate_SDM(env = 3L, species = 5L, sites = 100L,
link = "identical", response = "count", verbose = FALSE)
#> Error in simulate_SDM(env = 3L, species = 5L, sites = 100L, link = "identical", : unused argument (verbose = FALSE)
```

jSDMs account for correlation between species within communities (sites), in real datasets, however, communities (sites) are often also correlated (== spatial autocorrelation). Usually conditional autoregressive (CAR) models are used to account for the spatial autocorrelation in the residuals, which we, however, do not support yet. A similar approach is to condition the model on space, which we can do by using space as predictors.

Let’s first simulate test data:

- Simulate jSDM without a link (normal response)

```
com = simulate_SDM(env = 3L, species = 5L, sites = 100L,
link = "identical", response = "identical")
X = com$env_weights
Y = com$response
```

- add spatial residuals (create coordinates and use spatial distance matrix to draw autocorrelated residuals for each species)

```
XYcoords = matrix(rnorm(200), 100, 2)+2
WW = as.matrix(dist(XYcoords))
spatialResiduals = mvtnorm::rmvnorm( 5L, sigma = exp(-WW))
```

- Finish test data

There are three options to condition our model on space:

The idea of the trend surface model is to use the spatial coordinates within a polynom:

Sometimes a linear model and a polynom is not flexible enough to account for space. We can use a “simple” DNN for space to condition our linear environmental model on the space:

sjSDM supports l1 (lasso) and l2 (ridge) regularization: * alpha is the weighting between lasso and ridge * alpha = 0.0 corresponds to pure lasso * alpha = 1.0 corresponds to pure ridge

We can do the same for the species associations:

```
model = sjSDM(Y = com$response,
env = linear(data = com$env_weights,
formula = ~0+ I(X1^2),
lambda = 0.5),
biotic = bioticStruct(lambda =0.1),
iter = 50L, verbose = FALSE)
#> Error in eval(expr, envir, enclos): Required version of NumPy not available: incompatible NumPy binary version 33554432 (expecting version 16777225)
```

```
com = simulate_SDM(env = 3L, species = 5L, sites = 100L)
X = com$env_weights
Y = com$response
# three fully connected layers with relu as activation function
model = sjSDM(Y = Y,
env = DNN(data = X,
formula = ~.,
hidden = c(10L, 10L, 10L),
activation = "relu"),
iter = 50L, se = TRUE, verbose = FALSE)
#> Error in eval(expr, envir, enclos): Required version of NumPy not available: incompatible NumPy binary version 33554432 (expecting version 16777225)
```

```
summary(model)
#> Family: binomial
#>
#> LogLik: -242.1242
#> Regularization loss: 0
#> Error in eval(expr, envir, enclos): Required version of NumPy not available: incompatible NumPy binary version 33554432 (expecting version 16777225)
```

The methods for sjSDM() work also for the non-linear model:

```
association = getCor(model) # species association matrix
pred = predict(model) # predict on fitted data
pred = predict(model, newdata = X) # predict on new data
```

Extract and set weights of model:

Plot the training history:

The `sjSDM::install_sjSDM()`

function can install
automatically all necessary ‘python’ dependencies but it can fail
sometimes because of individual system settings or if other
‘python’/‘conda’ installations get into the way.

A few notes before you start with the installation (skip this point if you do not know conda):

- existing ‘conda’ installations: make sure you have the latest ‘conda’ or ‘miniconda’ version and remove unnecessary ‘conda’ installations.
- existing ‘conda’ or ‘virtualenv’ environments (skip this point if you do not know ‘conda’): we currently enforce the usage of a specific environment called ‘r-sjsdm’,so if you want use a custom environment it should be named ‘r-sjsdm’

Sometimes the automatic ‘miniconda’ installation (via
`sjSDM::install_sjSDM()`

).doesn’t work because of white
spaces in the user’s name. But you can easily download and install
‘conda’ on your own:

Download and install the latest ‘conda’ version

Afterwards run:

Reload the package and run the example, if this doesn’t work:

- Restart RStudio
- Install manually pytorch, see the following section

Download and install the latest ‘conda’ version

Open the command window (cmd.exe - hit windows key + r and write cmd)

Run in cmd.exe:

```
conda create --name r-sjsdm python=3.10
conda activate r-sjsdm
conda install pytorch torchvision torchaudio cpuonly -c pytorch # cpu
conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia #gpu
python -m pip install pyro-ppl torch_optimizer madgrad
```

Restart R, try the example, and if it does not work:

- Restart RStudio
- See the ‘Help and bugs’ section

Run in R:

Restart R try to run the example, if this doesn’t work:

- Restart RStudio
- Install manually ‘PyTorch’, see the following section

We strongly advise to use a ‘conda’ environment but a virtual environment should also work. The only requirement is that it is named ‘r-sjsdm’

Download and install the latest ‘conda’ version

Open your terminal and run:

```
conda create --name r-sjsdm python=3.10
conda activate r-sjsdm
conda install pytorch torchvision torchaudio cpuonly -c pytorch # cpu
conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia #gpu
python -m pip install pyro-ppl torch_optimizer madgrad
```

Restart R try to run the example, if this doesn’t work:

- Restart RStudio
- See the ‘Help and bugs’ section

Run in R:

Restart R try to run the example, if this doesn’t work:

- Restart RStudio
- Install manually ‘PyTorch’, see the following section

We strongly advise to use a ‘conda’ environment but a virtual environment should also work. The only requirement is that it is named ‘r-sjsdm’

Download and install the latest conda conda version

Open your terminal and run:

```
conda create --name r-sjsdm python=3.10
conda activate r-sjsdm
conda install pytorch::pytorch torchvision torchaudio -c pytorch
python -m pip install pyro-ppl torch_optimizer madgrad
```

Restart R, try the example, if it does not work:

- Restart RStudio
- See the ‘Help and bugs’ section

- ask for help via the issue
tracker. Please post the output of
`sjSDM::install_diagnostic()`

as a quote. - report bugs with a reproducible example via the issue tracker