The TDApplied package provides a wide variety of tools for performing powerful applied analyses of multiple persistence diagrams, and in order to make these analyses more practical a number of computational speedups have been built-in. These speedups involve other programming languages (Python and C++), parallel computing and intuitive tricks, and result in significant performance gains (compared to other similar R packages). In this vignette we will describe the methods that are used to make TDApplied a highly practical and scalable package for applied topological data analysis in R, as well as benchmarking TDApplied functions against suitable counterparts. All benchmarking was carried out on a Windows 10 64-bit machine, with an Intel(R) Core(TM) i7-7700 CPU @ 3.60GHz 3.60 GHz processor with 8 cores and 64GB of RAM.
When performing multiple independent computations, parallelization can increase speed by performing several computations concurrently. TDApplied utilizes parallelization in three ways to achieve significant performance gains:
Parallelization is performed in an operating-system agnostic manner, ensuring that these speedups are available to all TDApplied users.
The Fisher information metric (Le and Yamada
2018) unlocks the door to a number of TDApplied
machine learning and inference functions via the persistence Fisher
kernel. However, the computational complexity of calculating this metric
is quadratic in the number of points in its input diagrams (Le and Yamada 2018), making calculations using
diagrams with many points prohibitive. To solve this issue, a fast
approximation has been described for the Fisher information metric using
the Fast Gauss Transform (Morariu et al.
2008). This approximation reduces the runtime complexity of the
distance calculation to be linear – a huge performance gain with large
persistence diagrams. The implementation of the Fast Gauss Transform at
https://github.com/vmorariu/figtree was copied into
TDApplied with only slight modifications to interface
with Rcpp (Eddelbuettel and Francois
2011), providing the approximation functionality. Note that the
“epsilon” parameter in the original C++ code is called rho
in TDApplied’s implementation in order to avoid
confusion about error bounds – epsilon usually denotes an error bound
(like in the original code). However, rho
is not itself a
bound on the approximation error of the metric itself, but rather for
some subcalculations.
To illustrate how significant this speedup can be, we sampled unit spheres and tori (with inner tube radius 0.25 and major radius 0.75) with different numbers of points 100,200,…,1000 (with 10 iterations at each number of points), calculated their persistence diagrams and benchmarked calculating their Fisher information metric in dimensions 0, 1 and 2, with and without approximation. The plot below shows the results (plotting the mean runtime with a 95% confidence interval, which was too small to show for the approximation):
A linear model of the runtime ratio (division of the exact vs. approximation mean runtimes) regressed onto the number of points in the shapes found a highly significant positive slope of about 0.57. This means that for every additional 100 points in the shapes the runtime ratio increases by about 57 – larger inputs generally lead to greater runtime savings.
Unfortunately this approximation cannot currently be run in parallel in TDApplied functions, so in cases where the number of points in the persistence diagrams is small or the number of available cores is large we may consider using the exact calculation instead. However, functions are provided in the “exec/parallel_with_approximation.R” script which can be loaded into your environment to compute distance and Gram matrices in parallel with approximation (and these matrices can be input into other TDApplied functions directly – see the following section).
As a demonstration we will create ten persistence diagrams from circles (100 points points sampled on each) and compute their Fisher information metric distance matrix in parallel with and without approximation:
# create 20 diagrams from circles
g <- lapply(X = 1:10,FUN = function(X){
return(TDAstats::calculate_homology(TDA::circleUnif(100),dim = 0,threshold = 2))
})
# calculate distance matrices
d_exact <- distance_matrix(g,distance = "fisher",sigma = 1)
d_approx <- parallel_approx_distance_mat(g,rho = 1e-6)
The timing difference was impressive – the exact distance matrix was calculated in about 44.8s, whereas the approximate distance matrix was calculated in 1.8s! Moreover, the maximum percent difference between the two matrices was only about 0.9%. When we repeated these calculations with twenty diagrams the timing was 180s for the exact calculation compared to 3.7s with the approximation, and when we used twenty diagrams with each circle having 200 sampled points the timings were 174s and 2.4s for the exact and approximate calculations respectively.
These simulations indicate that the functions
parallel_approx_distance_matrix
and
parallel_approx_gram_matrix
in the exec directory can
unlock exceptional performance increases in calculating distance/Gram
matrices, and can be particularly useful when combined with the speedup
documented in the following section. However, we have found that these
parallelized approximation functions can sometimes cause (seemingly
non-reproducible, perhaps related to either parallel processing or
memory allocation) fatal errors, requiring the user to manually restart
the R session. This function should be used carefully, and perhaps not
in large loops which may get interrupted. A future update to
TDApplied should resolve this issue.
Redundancy can be a huge computational strain when performing multiple related and slow calculations. Applied topological data analysis unfortunately falls victim to this problem since machine learning and inference methods are built on calculating (potentially the same) distance/Gram matrices. In order to circumvent this issue TDApplied machine learning and inference functions can accept as input precomputed distance/Gram matrices. Therefore, if a number of analyses are being carried out with the same persistence diagrams and with the same distance/kernel parameter choices (as is often desired) then the distance/Gram matrices can each be computed once and reused across functions.
To illustrate how this works in practice we will use the
generate_TDApplied_vignette_data
function to generate nine
persistence diagrams and analyze them with multidimensional scaling
(Cox and Cox 2008) ,kernel k-means (Dhillon, Guan, and Kulis 2004) and kernel
principal components analysis (Scholkopf, Smola,
and Muller 1998):
generate_TDApplied_vignette_data <- function(num_D1,num_D2,num_D3){
# num_D1 is the number of desired copies of D1, and likewise
# for num_D2 and num_D3
# create data
D1 = data.frame(dimension = c(0),birth = c(2),death = c(3))
D2 = data.frame(dimension = c(0),birth = c(2,0),death = c(3.3,0.5))
D3 = data.frame(dimension = c(0),birth = c(0),death = c(0.5))
# make noisy copies
noisy_copies <- lapply(X = 1:(num_D1 + num_D2 + num_D3),FUN = function(X){
# i stores the number of the data frame to make copies of:
# i = 1 is for D1, i = 2 is for D2 and i = 3 is for D3
i <- 1
if(X > num_D1 & X <= num_D1 + num_D2)
{
i <- 2
}
if(X > num_D1 + num_D2)
{
i <- 3
}
# store correct data in noisy_copy
noisy_copy <- get(paste0("D",i))
# add Gaussian noise to birth and death values
n <- nrow(noisy_copy)
noisy_copy$dimension <- as.numeric(as.character(noisy_copy$dimension))
noisy_copy$birth <- noisy_copy$birth + stats::rnorm(n = n,mean = 0,sd = 0.05)
noisy_copy$death <- noisy_copy$death + stats::rnorm(n = n,mean = 0,sd = 0.05)
# make any birth values which are less than 0 equal 0
noisy_copy[which(noisy_copy$birth < 0),2] <- 0
# make any birth values which are greater than their death values equal their death values
noisy_copy[which(noisy_copy$birth > noisy_copy$death),2] <-
noisy_copy[which(noisy_copy$birth > noisy_copy$death),3]
return(noisy_copy)
})
# return list containing num_D1 noisy copies of D1, then
# num_D2 noisy copies of D2, and finally num_D3 noisy copies
# of D3
return(noisy_copies)
}
# create noisy copies of D1, D2 and D3
g <- generate_TDApplied_vignette_data(3,3,3)
# calculate MDS embedding
mds <- diagram_mds(diagrams = g,k = 2,dim = 0,sigma = 1.5,distance = "fisher")
# calculate kmeans clusters with 3 centers
clust <- diagram_kkmeans(diagrams = g,centers = 3,dim = 0,t = 2,sigma = 1.5)
# calculate kpca embedding
pca <- diagram_kpca(diagrams = g,dim = 0,t = 2,sigma = 1.5,features = 2)
The time taken to run the MDS, k-means and PCA lines was about 2.53s. Noting that the distance/Gram matrices had shared parameters (i.e. the distance matrix was calculated with the Fisher information metric and the values of t and sigma were all shared), we then repeated the analysis by pre-computing one distance (and Gram) matrix and using these in all three analyses:
D <- distance_matrix(diagrams = g,dim = 0,sigma = 1.5,distance = "fisher")
K <- exp(-2*D)
class(K) <- "kernelMatrix"
# calculate MDS embedding
mds <- diagram_mds(D = D,k = 2)
# calculate kmeans clusters with 3 centers
clust <- diagram_kkmeans(diagrams = g,K = K,centers = 3,dim = 0,t = 2,sigma = 1.5)
# calculate kpca embedding
pca <- diagram_kpca(diagrams = g,K = K,dim = 0,t = 2,sigma = 1.5,features = 2)
The new runtime (including calculating the distance and Gram matrices) was about 0.59s, over three times faster than the original. We then repeated the analyses using 300 persistence diagrams, i.e. with
# create noisy copies of D1, D2 and D3
g <- generate_TDApplied_vignette_data(100,100,100)
The timing scaled proportionally – without using precomputed matrices the time taken was about 121.8s and with precomputed matrices the time taken was about 39.69s. We recommend using precomputed matrices whenever performing multiple analyses of the same persistence diagrams with shared distance/kernel parameters.
permutation_test
Another significant source of redundancy can be found in the
permutation_test
function – in each calculation of the loss
function, a distance value is computed between each pair of diagrams in
the same (possibly permuted) group. However, diagrams will often appear
in the same permuted group meaning distances would be needlessly
recalculated. To solve this problem the permutation_test
function creates an initially trivial distance matrix (with entries -1)
between all persistence diagrams across all groups, updates its entries
when new distance calculations are performed (in parallel as discussed
earlier) and retrieves already computed values whenever possible. It is
possible to input precomputed distance matrices to this function.
However, for standalone usage depending on the group sizes and number of
permutations not every pair of diagrams may appear in some permuted
group together, so the implemented speedup avoids redundancy without
calculating unnecessary distances.
In order to properly situate TDApplied in the landscape of software for topological data analysis, we will compare the speed of its calculations to similar calculations from other packages. In the following sections we will benchmark:
PyH
,
TDAstats’ (Wadhwa et al.
2019) calculate_homology
and
rgudhi’s compute_persistence
.diagram_distance
and
TDA’s (Fasy et al. 2021)
wasserstein
.diagram_distance
and the
persim Python module’s (https://persim.scikit-tda.org/en/latest/)
wasserstein
.diagram_distance
and
rgudhi’s (Stamm 2023)
PersistenceFisherDistance
.The script that was used to perform benchmarking (and plotting the
results) is available in the exec
directory of this
package, using PyH
in certain cases and thus requiring
Python. A simple error check is included for the installation of the
reticulate package, but the script will throw an error if reticulate is
not properly connected with Python. In order to perform the benchmarking
against rgudhi, rgudhi must be
installed explicitly (again requiring configuration with Python) as it
is not even a suggested package for TDApplied
installation. In all cases, benchmarking followed a similar procedure,
involving sampling data from simple shapes (unit circles, unit spheres
and tori with inner tube radius 0.25 and major radius 0.75) with various
number of rows, and performing 10 benchmarking iterations at each number
of rows. The mean and standard deviation of run time for the two
functions were then calculated at each number of rows.
The benchmarking results are displayed graphically in the following three subsections. On top of comparing raw run time of the various functions, we also compared the scalability of the functions by computing the runtime ratios (i.e. quotient of runtimes in the two packages) of the functions and regressing the ratios onto the number of points in the input shapes. Overall we found that TDApplied’s functions are faster and scale better than their R counterparts, and scale similarly to Python counterparts. These results indicate that TDApplied is a powerful and efficient tool for applied topological data analysis in R.
PyH
Against TDAstats’
calculate_homology
Function and rgudhi’s
compute_persistence
FunctionThe long calculation time of persistence diagrams is likely a large
contributing factor to the slow adoption of topological data analysis
for applied data science. Much research has been carried out in order to
speed up these calculations, but the current state-of-the-art is the
persistent cohomology algorithm (Silva and
Vejdemo-Johansson 2011). In R, the TDAstats’
calculate_homology
function is the fastest option for
persistence diagram calculations (Somasundaram et
al. 2021), being a wrapper for the ripser (Bauer 2015) persistent cohomology engine.
TDApplied’s PyH
function is a Python
wrapper for the same engine, and rgudhi also provides a
Python cohomology engine via the C++ library GUDHI
(Lacombe et al. 2019) with its
compute_persistence
function. Note that
rgudhi’s function required a number of lines of code to
use (i.e. to calculate a persistence diagram across all desired
dimensions from a dataset) and we benchmarked all of these lines (see
the benchmarking script in the exec directory for details). We
benchmarked all three function’s run time on circles, spheres and tori.
The results were as follows:
As we can see the run time of compute_persistence
was
fastest, followed by PyH
and finally
calculate_homology
. We then used linear models to analyze
how the three functions scale compared to each other for all three
shapes, regressing the ratio of runtimes onto the number of points in
the data set. For the models dividing the runtime of
calculate_homology
by that of PyH
, the
intercepts were significant and positive but the slopes were positive
and either barely significant or not significant (at the \(\alpha = 0.05\) level). This suggests that
the runtime of the two functions scale similarly, but there was a
constant speed increase of PyH
which differed by shape
(about 1.06 times faster for circles, 1.75 times faster for tori and 3
times faster for spheres). The models for dividing the runtime of
PyH
by that of compute_persistence
yielded
similar results - roughly constant multiplicative runtime decreases with
rgudhi. Overall, if Python is available to a
TDApplied user then the PyH
function may
provide speedups compared to the TDAstats function
calculate_homology, but in that case
rgudhi’s compute_persistence
is the
fastest option at the expense of more lines of code (computing a
persistence diagram of just dimension 0 as a data frame takes 5 lines of
code).
diagram_distance
and TDA
wasserstein
FunctionsComputing wasserstein (or bottleneck) distances between persistence
diagrams is a key feature of some of the main topological data analysis
software packages in R and Python. However, these calculations can be
very expensive, rendering practical applications of topological data
analysis nearly unfeasible. Since TDAstats has
implemented an unconventional distance calculation (see the package
vignette “Comparing Distance Calculations” for details), we will
benchmark TDApplied’s diagram_distance
function against the TDA wasserstein
function on spheres and tori, calculating their distance in dimensions
0, 1 and 2 and recording the total time. The results were as follows
(the 95% confidence interval was too small to be plotted for the
diagram_distance
function):
A linear model, regressing the ratio of TDA’s
runtime divided by TDApplied’s runtime onto the number
of points in the data set, found a significant positive coefficient of
the number of points. This suggests that diagram_distance
scales better than wasserstein
, and the model estimated a
95x speed up for 1000 data points. These results suggest that distance
calculations with TDApplied are faster and more
scalable, making the applications of statistics and machine learning
with persistence diagrams more feasible in R. This is why the
TDA distance calculation was not used in the
TDApplied package.
diagram_distance
Function against persim’s
wasserstein
FunctionsWhile the functionality of Python packages for topological data
analysis packages are out of the scope for an R package, in order to
fully situate TDApplied in the landscape of topological
data analysis software we will benchmark the
diagram_distance
function against its counterpart from the
scikit-TDA collection of libraries, namely the
wasserstein
function from the persim
Python module. The R package reticulate (Ushey,
Allaire, and Tang 2022) was used to carry out this benchmarking,
via installing, importing and using the persim module.
This benchmarking procedure also used spheres and tori, calculating
distances in dimensions 0, 1 and 2, and the results were as follows
(confidence intervals for the persim package were too
small to be plotted):