The **TDApplied** package provides a wide variety of
tools for performing powerful applied analyses of multiple persistence
diagrams, and in order to make these analyses more practical a number of
computational speedups have been built-in. These speedups involve other
programming languages (Python and C++), parallel computing and intuitive
tricks, and result in significant performance gains (compared to other
similar R packages). In this vignette we will describe the methods that
are used to make **TDApplied** a highly practical and
scalable package for applied topological data analysis in R, as well as
benchmarking **TDApplied** functions against suitable
counterparts. All benchmarking was carried out on a Windows 10 64-bit
machine, with an Intel(R) Core(TM) i7-7700 CPU @ 3.60GHz 3.60 GHz
processor with 8 cores and 64GB of RAM.

When performing multiple independent computations, parallelization
can increase speed by performing several computations concurrently.
**TDApplied** utilizes parallelization in three ways to
achieve significant performance gains:

- Calculating distance and Gram matrices in parallel – these matrices
are the backbone, and limiting runtime factors, of all
**TDApplied**machine learning and inference methods. - Carrying out the bootstrap procedure (to find significant topological features) in parallel – each bootstrap iteration involves an independent distance calculation.
- Determining the loss-function value in the permutation test procedure, where distances are calculated between each pair of diagrams in the same (permuted) group.

Parallelization is performed in an operating-system agnostic manner,
ensuring that these speedups are available to all
**TDApplied** users.

The Fisher information metric (Le and Yamada
2018) unlocks the door to a number of **TDApplied**
machine learning and inference functions via the persistence Fisher
kernel. However, the computational complexity of calculating this metric
is quadratic in the number of points in its input diagrams (Le and Yamada 2018), making calculations using
diagrams with many points prohibitive. To solve this issue, a fast
approximation has been described for the Fisher information metric using
the Fast Gauss Transform (Morariu et al.
2008). This approximation reduces the runtime complexity of the
distance calculation to be linear – a huge performance gain with large
persistence diagrams. The implementation of the Fast Gauss Transform at
https://github.com/vmorariu/figtree was copied into
**TDApplied** with only slight modifications to interface
with Rcpp (Eddelbuettel and Francois
2011), providing the approximation functionality. Note that the
“epsilon” parameter in the original C++ code is called `rho`

in **TDApplied**’s implementation in order to avoid
confusion about error bounds – epsilon usually denotes an error bound
(like in the original code). However, `rho`

is not itself a
bound on the approximation error of the metric itself, but rather for
some subcalculations.

To illustrate how significant this speedup can be, we sampled unit spheres and tori (with inner tube radius 0.25 and major radius 0.75) with different numbers of points 100,200,…,1000 (with 10 iterations at each number of points), calculated their persistence diagrams and benchmarked calculating their Fisher information metric in dimensions 0, 1 and 2, with and without approximation. The plot below shows the results (plotting the mean runtime with a 95% confidence interval, which was too small to show for the approximation):

A linear model of the runtime ratio (division of the exact vs. approximation mean runtimes) regressed onto the number of points in the shapes found a highly significant positive slope of about 0.57. This means that for every additional 100 points in the shapes the runtime ratio increases by about 57 – larger inputs generally lead to greater runtime savings.

Unfortunately this approximation cannot currently be run in parallel
in **TDApplied** functions, so in cases where the number of
points in the persistence diagrams is small or the number of available
cores is large we may consider using the exact calculation instead.
However, functions are provided in the
“exec/parallel_with_approximation.R” script which can be loaded into
your environment to compute distance and Gram matrices in parallel with
approximation (and these matrices can be input into other
**TDApplied** functions directly – see the following
section).

As a demonstration we will create ten persistence diagrams from circles (100 points points sampled on each) and compute their Fisher information metric distance matrix in parallel with and without approximation:

```
# create 20 diagrams from circles
g <- lapply(X = 1:10,FUN = function(X){
return(TDAstats::calculate_homology(TDA::circleUnif(100),dim = 0,threshold = 2))
})
# calculate distance matrices
d_exact <- distance_matrix(g,distance = "fisher",sigma = 1)
d_approx <- parallel_approx_distance_mat(g,rho = 1e-6)
```

The timing difference was impressive – the exact distance matrix was calculated in about 44.8s, whereas the approximate distance matrix was calculated in 1.8s! Moreover, the maximum percent difference between the two matrices was only about 0.9%. When we repeated these calculations with twenty diagrams the timing was 180s for the exact calculation compared to 3.7s with the approximation, and when we used twenty diagrams with each circle having 200 sampled points the timings were 174s and 2.4s for the exact and approximate calculations respectively.

These simulations indicate that the functions
`parallel_approx_distance_matrix`

and
`parallel_approx_gram_matrix`

in the exec directory can
unlock exceptional performance increases in calculating distance/Gram
matrices, and can be particularly useful when combined with the speedup
documented in the following section. However, we have found that these
parallelized approximation functions can sometimes cause (seemingly
non-reproducible, perhaps related to either parallel processing or
memory allocation) fatal errors, requiring the user to manually restart
the R session. This function should be used carefully, and perhaps not
in large loops which may get interrupted. A future update to
**TDApplied** should resolve this issue.

Redundancy can be a huge computational strain when performing
multiple related and slow calculations. Applied topological data
analysis unfortunately falls victim to this problem since machine
learning and inference methods are built on calculating (potentially the
same) distance/Gram matrices. In order to circumvent this issue
**TDApplied** machine learning and inference functions can
accept as input precomputed distance/Gram matrices. Therefore, if a
number of analyses are being carried out with the same persistence
diagrams and with the same distance/kernel parameter choices (as is
often desired) then the distance/Gram matrices can each be computed once
and reused across functions.

To illustrate how this works in practice we will use the
`generate_TDApplied_vignette_data`

function to generate nine
persistence diagrams and analyze them with multidimensional scaling
(Cox and Cox 2008) ,kernel k-means (Dhillon, Guan, and Kulis 2004) and kernel
principal components analysis (Scholkopf, Smola,
and Muller 1998):

```
generate_TDApplied_vignette_data <- function(num_D1,num_D2,num_D3){
# num_D1 is the number of desired copies of D1, and likewise
# for num_D2 and num_D3
# create data
D1 = data.frame(dimension = c(0),birth = c(2),death = c(3))
D2 = data.frame(dimension = c(0),birth = c(2,0),death = c(3.3,0.5))
D3 = data.frame(dimension = c(0),birth = c(0),death = c(0.5))
# make noisy copies
noisy_copies <- lapply(X = 1:(num_D1 + num_D2 + num_D3),FUN = function(X){
# i stores the number of the data frame to make copies of:
# i = 1 is for D1, i = 2 is for D2 and i = 3 is for D3
i <- 1
if(X > num_D1 & X <= num_D1 + num_D2)
{
i <- 2
}
if(X > num_D1 + num_D2)
{
i <- 3
}
# store correct data in noisy_copy
noisy_copy <- get(paste0("D",i))
# add Gaussian noise to birth and death values
n <- nrow(noisy_copy)
noisy_copy$dimension <- as.numeric(as.character(noisy_copy$dimension))
noisy_copy$birth <- noisy_copy$birth + stats::rnorm(n = n,mean = 0,sd = 0.05)
noisy_copy$death <- noisy_copy$death + stats::rnorm(n = n,mean = 0,sd = 0.05)
# make any birth values which are less than 0 equal 0
noisy_copy[which(noisy_copy$birth < 0),2] <- 0
# make any birth values which are greater than their death values equal their death values
noisy_copy[which(noisy_copy$birth > noisy_copy$death),2] <-
noisy_copy[which(noisy_copy$birth > noisy_copy$death),3]
return(noisy_copy)
})
# return list containing num_D1 noisy copies of D1, then
# num_D2 noisy copies of D2, and finally num_D3 noisy copies
# of D3
return(noisy_copies)
}
# create noisy copies of D1, D2 and D3
g <- generate_TDApplied_vignette_data(3,3,3)
# calculate MDS embedding
mds <- diagram_mds(diagrams = g,k = 2,dim = 0,sigma = 1.5,distance = "fisher")
# calculate kmeans clusters with 3 centers
clust <- diagram_kkmeans(diagrams = g,centers = 3,dim = 0,t = 2,sigma = 1.5)
# calculate kpca embedding
pca <- diagram_kpca(diagrams = g,dim = 0,t = 2,sigma = 1.5,features = 2)
```

The time taken to run the MDS, k-means and PCA lines was about 2.53s. Noting that the distance/Gram matrices had shared parameters (i.e. the distance matrix was calculated with the Fisher information metric and the values of t and sigma were all shared), we then repeated the analysis by pre-computing one distance (and Gram) matrix and using these in all three analyses:

```
D <- distance_matrix(diagrams = g,dim = 0,sigma = 1.5,distance = "fisher")
K <- exp(-2*D)
class(K) <- "kernelMatrix"
# calculate MDS embedding
mds <- diagram_mds(D = D,k = 2)
# calculate kmeans clusters with 3 centers
clust <- diagram_kkmeans(diagrams = g,K = K,centers = 3,dim = 0,t = 2,sigma = 1.5)
# calculate kpca embedding
pca <- diagram_kpca(diagrams = g,K = K,dim = 0,t = 2,sigma = 1.5,features = 2)
```

The new runtime (including calculating the distance and Gram matrices) was about 0.59s, over three times faster than the original. We then repeated the analyses using 300 persistence diagrams, i.e. with

```
# create noisy copies of D1, D2 and D3
g <- generate_TDApplied_vignette_data(100,100,100)
```

The timing scaled proportionally – without using precomputed matrices the time taken was about 121.8s and with precomputed matrices the time taken was about 39.69s. We recommend using precomputed matrices whenever performing multiple analyses of the same persistence diagrams with shared distance/kernel parameters.

`permutation_test`

Another significant source of redundancy can be found in the
`permutation_test`

function – in each calculation of the loss
function, a distance value is computed between each pair of diagrams in
the same (possibly permuted) group. However, diagrams will often appear
in the same permuted group meaning distances would be needlessly
recalculated. To solve this problem the `permutation_test`

function creates an initially trivial distance matrix (with entries -1)
between all persistence diagrams across all groups, updates its entries
when new distance calculations are performed (in parallel as discussed
earlier) and retrieves already computed values whenever possible. It is
possible to input precomputed distance matrices to this function.
However, for standalone usage depending on the group sizes and number of
permutations not every pair of diagrams may appear in some permuted
group together, so the implemented speedup avoids redundancy without
calculating unnecessary distances.

In order to properly situate **TDApplied** in the
landscape of software for topological data analysis, we will compare the
speed of its calculations to similar calculations from other packages.
In the following sections we will benchmark:

- Persistent (co)homology calculations with
**TDApplied**’s`PyH`

,**TDAstats**’ (Wadhwa et al. 2019)`calculate_homology`

and**rgudhi**’s`compute_persistence`

. - Wasserstein distances between persistence diagrams with
**TDApplied**’s`diagram_distance`

and**TDA**’s (Fasy et al. 2021)`wasserstein`

. - Wasserstein distances between persistence diagrams with
**TDApplied**’s`diagram_distance`

and the**persim**Python module’s (https://persim.scikit-tda.org/en/latest/)`wasserstein`

. - Fisher information metric distances between persistence diagrams
with
**TDApplied**’s`diagram_distance`

and**rgudhi**’s (Stamm 2023)`PersistenceFisherDistance`

.

The script that was used to perform benchmarking (and plotting the
results) is available in the `exec`

directory of this
package, using `PyH`

in certain cases and thus requiring
Python. A simple error check is included for the installation of the
reticulate package, but the script will throw an error if reticulate is
not properly connected with Python. In order to perform the benchmarking
against **rgudhi**, **rgudhi** must be
installed explicitly (again requiring configuration with Python) as it
is not even a suggested package for **TDApplied**
installation. In all cases, benchmarking followed a similar procedure,
involving sampling data from simple shapes (unit circles, unit spheres
and tori with inner tube radius 0.25 and major radius 0.75) with various
number of rows, and performing 10 benchmarking iterations at each number
of rows. The mean and standard deviation of run time for the two
functions were then calculated at each number of rows.

The benchmarking results are displayed graphically in the following
three subsections. On top of comparing raw run time of the various
functions, we also compared the scalability of the functions by
computing the runtime ratios (i.e. quotient of runtimes in the two
packages) of the functions and regressing the ratios onto the number of
points in the input shapes. Overall we found that
**TDApplied**’s functions are faster and scale better than
their R counterparts, and scale similarly to Python counterparts. These
results indicate that **TDApplied** is a powerful and
efficient tool for applied topological data analysis in R.

`PyH`

Against `calculate_homology`

Function and `compute_persistence`

FunctionThe long calculation time of persistence diagrams is likely a large
contributing factor to the slow adoption of topological data analysis
for applied data science. Much research has been carried out in order to
speed up these calculations, but the current state-of-the-art is the
persistent cohomology algorithm (Silva and
Vejdemo-Johansson 2011). In R, the **TDAstats**’
`calculate_homology`

function is the fastest option for
persistence diagram calculations (Somasundaram et
al. 2021), being a wrapper for the **ripser** (Bauer 2015) persistent cohomology engine.
**TDApplied**’s `PyH`

function is a Python
wrapper for the same engine, and **rgudhi** also provides a
Python cohomology engine via the C++ library **GUDHI**
(Lacombe et al. 2019) with its
`compute_persistence`

function. Note that
**rgudhi**’s function required a number of lines of code to
use (i.e. to calculate a persistence diagram across all desired
dimensions from a dataset) and we benchmarked all of these lines (see
the benchmarking script in the exec directory for details). We
benchmarked all three function’s run time on circles, spheres and tori.
The results were as follows:

As we can see the run time of `compute_persistence`

was
fastest, followed by `PyH`

and finally
`calculate_homology`

. We then used linear models to analyze
how the three functions scale compared to each other for all three
shapes, regressing the ratio of runtimes onto the number of points in
the data set. For the models dividing the runtime of
`calculate_homology`

by that of `PyH`

, the
intercepts were significant and positive but the slopes were positive
and either barely significant or not significant (at the \(\alpha = 0.05\) level). This suggests that
the runtime of the two functions scale similarly, but there was a
constant speed increase of `PyH`

which differed by shape
(about 1.06 times faster for circles, 1.75 times faster for tori and 3
times faster for spheres). The models for dividing the runtime of
`PyH`

by that of `compute_persistence`

yielded
similar results - roughly constant multiplicative runtime decreases with
**rgudhi**. Overall, if Python is available to a
**TDApplied** user then the `PyH`

function may
provide speedups compared to the **TDAstats** function
**calculate_homology**, but in that case
**rgudhi**’s `compute_persistence`

is the
fastest option at the expense of more lines of code (computing a
persistence diagram of just dimension 0 as a data frame takes 5 lines of
code).

`diagram_distance`

and `wasserstein`

FunctionsComputing wasserstein (or bottleneck) distances between persistence
diagrams is a key feature of some of the main topological data analysis
software packages in R and Python. However, these calculations can be
very expensive, rendering practical applications of topological data
analysis nearly unfeasible. Since **TDAstats** has
implemented an unconventional distance calculation (see the package
vignette “Comparing Distance Calculations” for details), we will
benchmark **TDApplied**’s `diagram_distance`

function against the **TDA** `wasserstein`

function on spheres and tori, calculating their distance in dimensions
0, 1 and 2 and recording the total time. The results were as follows
(the 95% confidence interval was too small to be plotted for the
`diagram_distance`

function):

A linear model, regressing the ratio of **TDA**’s
runtime divided by **TDApplied**’s runtime onto the number
of points in the data set, found a significant positive coefficient of
the number of points. This suggests that `diagram_distance`

scales better than `wasserstein`

, and the model estimated a
95x speed up for 1000 data points. These results suggest that distance
calculations with **TDApplied** are faster and more
scalable, making the applications of statistics and machine learning
with persistence diagrams more feasible in R. This is why the
**TDA** distance calculation was not used in the
**TDApplied** package.

`diagram_distance`

Function against `wasserstein`

FunctionsWhile the functionality of Python packages for topological data
analysis packages are out of the scope for an R package, in order to
fully situate **TDApplied** in the landscape of topological
data analysis software we will benchmark the
`diagram_distance`

function against its counterpart from the
**scikit-TDA** collection of libraries, namely the
`wasserstein`

function from the **persim**
Python module. The R package reticulate (Ushey,
Allaire, and Tang 2022) was used to carry out this benchmarking,
via installing, importing and using the **persim** module.
This benchmarking procedure also used spheres and tori, calculating
distances in dimensions 0, 1 and 2, and the results were as follows
(confidence intervals for the **persim** package were too
small to be plotted):