Introduction

The TDApplied package provides a wide variety of tools for performing powerful applied analyses of multiple persistence diagrams, and in order to make these analyses more practical a number of computational speedups have been built-in. These speedups involve other programming languages (Python and C++), parallel computing and intuitive tricks, and result in significant performance gains (compared to other similar R packages). In this vignette we will describe the methods that are used to make TDApplied a highly practical and scalable package for applied topological data analysis in R, as well as benchmarking TDApplied functions against suitable counterparts. All benchmarking was carried out on a Windows 10 64-bit machine, with an Intel(R) Core(TM) i7-7700 CPU @ 3.60GHz 3.60 GHz processor with 8 cores and 64GB of RAM.

Speedups

Parallelization

When performing multiple independent computations, parallelization can increase speed by performing several computations concurrently. TDApplied utilizes parallelization in three ways to achieve significant performance gains:

  1. Calculating distance and Gram matrices in parallel – these matrices are the backbone, and limiting runtime factors, of all TDApplied machine learning and inference methods.
  2. Carrying out the bootstrap procedure (to find significant topological features) in parallel – each bootstrap iteration involves an independent distance calculation.
  3. Determining the loss-function value in the permutation test procedure, where distances are calculated between each pair of diagrams in the same (permuted) group.

Parallelization is performed in an operating-system agnostic manner, ensuring that these speedups are available to all TDApplied users.

Fisher Information Metric Approximation

The Fisher information metric (Le and Yamada 2018) unlocks the door to a number of TDApplied machine learning and inference functions via the persistence Fisher kernel. However, the computational complexity of calculating this metric is quadratic in the number of points in its input diagrams (Le and Yamada 2018), making calculations using diagrams with many points prohibitive. To solve this issue, a fast approximation has been described for the Fisher information metric using the Fast Gauss Transform (Morariu et al. 2008). This approximation reduces the runtime complexity of the distance calculation to be linear – a huge performance gain with large persistence diagrams. The implementation of the Fast Gauss Transform at https://github.com/vmorariu/figtree was copied into TDApplied with only slight modifications to interface with Rcpp (Eddelbuettel and Francois 2011), providing the approximation functionality. Note that the “epsilon” parameter in the original C++ code is called rho in TDApplied’s implementation in order to avoid confusion about error bounds – epsilon usually denotes an error bound (like in the original code). However, rho is not itself a bound on the approximation error of the metric itself, but rather for some subcalculations.

To illustrate how significant this speedup can be, we sampled unit spheres and tori (with inner tube radius 0.25 and major radius 0.75) with different numbers of points 100,200,…,1000 (with 10 iterations at each number of points), calculated their persistence diagrams and benchmarked calculating their Fisher information metric in dimensions 0, 1 and 2, with and without approximation. The plot below shows the results (plotting the mean runtime with a 95% confidence interval, which was too small to show for the approximation):

A linear model of the runtime ratio (division of the exact vs. approximation mean runtimes) regressed onto the number of points in the shapes found a highly significant positive slope of about 0.57. This means that for every additional 100 points in the shapes the runtime ratio increases by about 57 – larger inputs generally lead to greater runtime savings.

Unfortunately this approximation cannot currently be run in parallel in TDApplied functions, so in cases where the number of points in the persistence diagrams is small or the number of available cores is large we may consider using the exact calculation instead. However, functions are provided in the “exec/parallel_with_approximation.R” script which can be loaded into your environment to compute distance and Gram matrices in parallel with approximation (and these matrices can be input into other TDApplied functions directly – see the following section).

As a demonstration we will create ten persistence diagrams from circles (100 points points sampled on each) and compute their Fisher information metric distance matrix in parallel with and without approximation:

# create 20 diagrams from circles
g <- lapply(X = 1:10,FUN = function(X){
  
  return(TDAstats::calculate_homology(TDA::circleUnif(100),dim = 0,threshold = 2))
  
})

# calculate distance matrices
d_exact <- distance_matrix(g,distance = "fisher",sigma = 1)
d_approx <- parallel_approx_distance_mat(g,rho = 1e-6)

The timing difference was impressive – the exact distance matrix was calculated in about 44.8s, whereas the approximate distance matrix was calculated in 1.8s! Moreover, the maximum percent difference between the two matrices was only about 0.9%. When we repeated these calculations with twenty diagrams the timing was 180s for the exact calculation compared to 3.7s with the approximation, and when we used twenty diagrams with each circle having 200 sampled points the timings were 174s and 2.4s for the exact and approximate calculations respectively.

These simulations indicate that the functions parallel_approx_distance_matrix and parallel_approx_gram_matrix in the exec directory can unlock exceptional performance increases in calculating distance/Gram matrices, and can be particularly useful when combined with the speedup documented in the following section. However, we have found that these parallelized approximation functions can sometimes cause (seemingly non-reproducible, perhaps related to either parallel processing or memory allocation) fatal errors, requiring the user to manually restart the R session. This function should be used carefully, and perhaps not in large loops which may get interrupted. A future update to TDApplied should resolve this issue.

Using Pre-Computed Matrices

Redundancy can be a huge computational strain when performing multiple related and slow calculations. Applied topological data analysis unfortunately falls victim to this problem since machine learning and inference methods are built on calculating (potentially the same) distance/Gram matrices. In order to circumvent this issue TDApplied machine learning and inference functions can accept as input precomputed distance/Gram matrices. Therefore, if a number of analyses are being carried out with the same persistence diagrams and with the same distance/kernel parameter choices (as is often desired) then the distance/Gram matrices can each be computed once and reused across functions.

To illustrate how this works in practice we will use the generate_TDApplied_vignette_data function to generate nine persistence diagrams and analyze them with multidimensional scaling (Cox and Cox 2008) ,kernel k-means (Dhillon, Guan, and Kulis 2004) and kernel principal components analysis (Scholkopf, Smola, and Muller 1998):

generate_TDApplied_vignette_data <- function(num_D1,num_D2,num_D3){
  
  # num_D1 is the number of desired copies of D1, and likewise
  # for num_D2 and num_D3
  
  # create data
  D1 = data.frame(dimension = c(0),birth = c(2),death = c(3))
  D2 = data.frame(dimension = c(0),birth = c(2,0),death = c(3.3,0.5))
  D3 = data.frame(dimension = c(0),birth = c(0),death = c(0.5))
  
  # make noisy copies
  noisy_copies <- lapply(X = 1:(num_D1 + num_D2 + num_D3),FUN = function(X){
    
    # i stores the number of the data frame to make copies of:
    # i = 1 is for D1, i = 2 is for D2 and i = 3 is for D3
    i <- 1
    if(X > num_D1 & X <= num_D1 + num_D2)
    {
      i <- 2
    }
    if(X > num_D1 + num_D2)
    {
      i <- 3
    }
    # store correct data in noisy_copy
    noisy_copy <- get(paste0("D",i))
    
    # add Gaussian noise to birth and death values
    n <- nrow(noisy_copy)
    noisy_copy$dimension <- as.numeric(as.character(noisy_copy$dimension))
    noisy_copy$birth <- noisy_copy$birth + stats::rnorm(n = n,mean = 0,sd = 0.05)
    noisy_copy$death <- noisy_copy$death + stats::rnorm(n = n,mean = 0,sd = 0.05)
    
    # make any birth values which are less than 0 equal 0
    noisy_copy[which(noisy_copy$birth < 0),2] <- 0
    
    # make any birth values which are greater than their death values equal their death values
    noisy_copy[which(noisy_copy$birth > noisy_copy$death),2] <- 
      noisy_copy[which(noisy_copy$birth > noisy_copy$death),3]
    return(noisy_copy)
    
  })
  
  # return list containing num_D1 noisy copies of D1, then
  # num_D2 noisy copies of D2, and finally num_D3 noisy copies
  # of D3
  return(noisy_copies)
  
}

# create noisy copies of D1, D2 and D3
g <- generate_TDApplied_vignette_data(3,3,3)

# calculate MDS embedding
mds <- diagram_mds(diagrams = g,k = 2,dim = 0,sigma = 1.5,distance = "fisher")
                              
# calculate kmeans clusters with 3 centers
clust <- diagram_kkmeans(diagrams = g,centers = 3,dim = 0,t = 2,sigma = 1.5)

# calculate kpca embedding
pca <- diagram_kpca(diagrams = g,dim = 0,t = 2,sigma = 1.5,features = 2)

The time taken to run the MDS, k-means and PCA lines was about 2.53s. Noting that the distance/Gram matrices had shared parameters (i.e. the distance matrix was calculated with the Fisher information metric and the values of t and sigma were all shared), we then repeated the analysis by pre-computing one distance (and Gram) matrix and using these in all three analyses:

D <- distance_matrix(diagrams = g,dim = 0,sigma = 1.5,distance = "fisher")
K <- exp(-2*D)
class(K) <- "kernelMatrix"

# calculate MDS embedding
mds <- diagram_mds(D = D,k = 2)

# calculate kmeans clusters with 3 centers
clust <- diagram_kkmeans(diagrams = g,K = K,centers = 3,dim = 0,t = 2,sigma = 1.5)

# calculate kpca embedding
pca <- diagram_kpca(diagrams = g,K = K,dim = 0,t = 2,sigma = 1.5,features = 2)

The new runtime (including calculating the distance and Gram matrices) was about 0.59s, over three times faster than the original. We then repeated the analyses using 300 persistence diagrams, i.e. with

# create noisy copies of D1, D2 and D3
g <- generate_TDApplied_vignette_data(100,100,100)

The timing scaled proportionally – without using precomputed matrices the time taken was about 121.8s and with precomputed matrices the time taken was about 39.69s. We recommend using precomputed matrices whenever performing multiple analyses of the same persistence diagrams with shared distance/kernel parameters.

Storing Calculations in permutation_test

Another significant source of redundancy can be found in the permutation_test function – in each calculation of the loss function, a distance value is computed between each pair of diagrams in the same (possibly permuted) group. However, diagrams will often appear in the same permuted group meaning distances would be needlessly recalculated. To solve this problem the permutation_test function creates an initially trivial distance matrix (with entries -1) between all persistence diagrams across all groups, updates its entries when new distance calculations are performed (in parallel as discussed earlier) and retrieves already computed values whenever possible. It is possible to input precomputed distance matrices to this function. However, for standalone usage depending on the group sizes and number of permutations not every pair of diagrams may appear in some permuted group together, so the implemented speedup avoids redundancy without calculating unnecessary distances.

Benchmarking Against Similar Packages

In order to properly situate TDApplied in the landscape of software for topological data analysis, we will compare the speed of its calculations to similar calculations from other packages. In the following sections we will benchmark:

  1. Persistent (co)homology calculations with TDApplied’s PyH, TDAstats(Wadhwa et al. 2019) calculate_homology and rgudhi’s compute_persistence.
  2. Wasserstein distances between persistence diagrams with TDApplied’s diagram_distance and TDA’s (Fasy et al. 2021) wasserstein.
  3. Wasserstein distances between persistence diagrams with TDApplied’s diagram_distance and the persim Python module’s (https://persim.scikit-tda.org/en/latest/) wasserstein.
  4. Fisher information metric distances between persistence diagrams with TDApplied’s diagram_distance and rgudhi’s (Stamm 2023) PersistenceFisherDistance.

The script that was used to perform benchmarking (and plotting the results) is available in the exec directory of this package, using PyH in certain cases and thus requiring Python. A simple error check is included for the installation of the reticulate package, but the script will throw an error if reticulate is not properly connected with Python. In order to perform the benchmarking against rgudhi, rgudhi must be installed explicitly (again requiring configuration with Python) as it is not even a suggested package for TDApplied installation. In all cases, benchmarking followed a similar procedure, involving sampling data from simple shapes (unit circles, unit spheres and tori with inner tube radius 0.25 and major radius 0.75) with various number of rows, and performing 10 benchmarking iterations at each number of rows. The mean and standard deviation of run time for the two functions were then calculated at each number of rows.

The benchmarking results are displayed graphically in the following three subsections. On top of comparing raw run time of the various functions, we also compared the scalability of the functions by computing the runtime ratios (i.e. quotient of runtimes in the two packages) of the functions and regressing the ratios onto the number of points in the input shapes. Overall we found that TDApplied’s functions are faster and scale better than their R counterparts, and scale similarly to Python counterparts. These results indicate that TDApplied is a powerful and efficient tool for applied topological data analysis in R.

Benchmarking PyH Against TDAstatscalculate_homology Function and rgudhi’s compute_persistence Function

The long calculation time of persistence diagrams is likely a large contributing factor to the slow adoption of topological data analysis for applied data science. Much research has been carried out in order to speed up these calculations, but the current state-of-the-art is the persistent cohomology algorithm (Silva and Vejdemo-Johansson 2011). In R, the TDAstatscalculate_homology function is the fastest option for persistence diagram calculations (Somasundaram et al. 2021), being a wrapper for the ripser (Bauer 2015) persistent cohomology engine. TDApplied’s PyH function is a Python wrapper for the same engine, and rgudhi also provides a Python cohomology engine via the C++ library GUDHI (Lacombe et al. 2019) with its compute_persistence function. Note that rgudhi’s function required a number of lines of code to use (i.e. to calculate a persistence diagram across all desired dimensions from a dataset) and we benchmarked all of these lines (see the benchmarking script in the exec directory for details). We benchmarked all three function’s run time on circles, spheres and tori. The results were as follows:

As we can see the run time of compute_persistence was fastest, followed by PyH and finally calculate_homology. We then used linear models to analyze how the three functions scale compared to each other for all three shapes, regressing the ratio of runtimes onto the number of points in the data set. For the models dividing the runtime of calculate_homology by that of PyH, the intercepts were significant and positive but the slopes were positive and either barely significant or not significant (at the \(\alpha = 0.05\) level). This suggests that the runtime of the two functions scale similarly, but there was a constant speed increase of PyH which differed by shape (about 1.06 times faster for circles, 1.75 times faster for tori and 3 times faster for spheres). The models for dividing the runtime of PyH by that of compute_persistence yielded similar results - roughly constant multiplicative runtime decreases with rgudhi. Overall, if Python is available to a TDApplied user then the PyH function may provide speedups compared to the TDAstats function calculate_homology, but in that case rgudhi’s compute_persistence is the fastest option at the expense of more lines of code (computing a persistence diagram of just dimension 0 as a data frame takes 5 lines of code).

Benchmarking the TDApplied diagram_distance and TDA wasserstein Functions

Computing wasserstein (or bottleneck) distances between persistence diagrams is a key feature of some of the main topological data analysis software packages in R and Python. However, these calculations can be very expensive, rendering practical applications of topological data analysis nearly unfeasible. Since TDAstats has implemented an unconventional distance calculation (see the package vignette “Comparing Distance Calculations” for details), we will benchmark TDApplied’s diagram_distance function against the TDA wasserstein function on spheres and tori, calculating their distance in dimensions 0, 1 and 2 and recording the total time. The results were as follows (the 95% confidence interval was too small to be plotted for the diagram_distance function):

A linear model, regressing the ratio of TDA’s runtime divided by TDApplied’s runtime onto the number of points in the data set, found a significant positive coefficient of the number of points. This suggests that diagram_distance scales better than wasserstein, and the model estimated a 95x speed up for 1000 data points. These results suggest that distance calculations with TDApplied are faster and more scalable, making the applications of statistics and machine learning with persistence diagrams more feasible in R. This is why the TDA distance calculation was not used in the TDApplied package.

Benchmarking the TDApplied diagram_distance Function against persim’s wasserstein Functions

While the functionality of Python packages for topological data analysis packages are out of the scope for an R package, in order to fully situate TDApplied in the landscape of topological data analysis software we will benchmark the diagram_distance function against its counterpart from the scikit-TDA collection of libraries, namely the wasserstein function from the persim Python module. The R package reticulate (Ushey, Allaire, and Tang 2022) was used to carry out this benchmarking, via installing, importing and using the persim module. This benchmarking procedure also used spheres and tori, calculating distances in dimensions 0, 1 and 2, and the results were as follows (confidence intervals for the persim package were too small to be plotted):