Correlation Types


This vignette can be cited as:

citation("correlation")

Different Methods for Correlations

Correlations tests are arguably one of the most commonly used statistical procedures, and are used as a basis in many applications such as exploratory data analysis, structural modeling, data engineering, etc. In this context, we present correlation, a toolbox for the R language (R Core Team 2019) and part of the easystats collection, focused on correlation analysis. Its goal is to be lightweight, easy to use, and allows for the computation of many different kinds of correlations, such as:

\[r_{xy} = \frac{cov(x,y)}{SD_x \times SD_y}\]

\[r_{s_{xy}} = \frac{cov(rank_x, rank_y)}{SD(rank_x) \times SD(rank_y)}\]

\[\tau_{xy} = \frac{2}{n(n-1)}\sum_{i<j}^{}sign(x_i - x_j) \times sign(y_i - y_j)\]

\[r_{xy.z} = r_{e_{x.z},e_{y.z}}\]

Comparison

We will fit different types of correlations of generated data with different link strengths and link types.

Let’s first load the required libraries for this analysis.

library(correlation)
library(bayestestR)
library(see)
library(ggplot2)
library(datawizard)
library(poorman)

Utility functions

generate_results <- function(r, n = 100, transformation = "none") {
  data <- bayestestR::simulate_correlation(round(n), r = r)

  if (transformation != "none") {
    var <- ifelse(grepl("(", transformation, fixed = TRUE), "data$V2)", "data$V2")
    transformation <- paste0(transformation, var)
    data$V2 <- eval(parse(text = transformation))
  }

  out <- data.frame(n = n, transformation = transformation, r = r)

  out$Pearson <- cor_test(data, "V1", "V2", method = "pearson")$r
  out$Spearman <- cor_test(data, "V1", "V2", method = "spearman")$rho
  out$Kendall <- cor_test(data, "V1", "V2", method = "kendall")$tau
  out$Biweight <- cor_test(data, "V1", "V2", method = "biweight")$r
  out$Distance <- cor_test(data, "V1", "V2", method = "distance")$r
  out$Distance <- cor_test(data, "V1", "V2", method = "distance")$r

  out
}

Effect of Relationship Type

data <- data.frame()
for (r in seq(0, 0.999, length.out = 200)) {
  for (n in 100) {
    for (transformation in c(
      "none",
      "exp(",
      "log10(1+max(abs(data$V2))+",
      "1/",
      "tan(",
      "sin(",
      "cos(",
      "cos(2*",
      "abs(",
      "data$V2*",
      "data$V2*data$V2*",
      "ifelse(data$V2>0, 1, 0)*("
    )) {
      data <- rbind(data, generate_results(r, n, transformation = transformation))
    }
  }
}

data %>%
  datawizard::reshape_longer(
    select = -c("n", "r", "transformation"),
    names_to = "Type",
    values_to = "Estimation"
  ) %>%
  mutate(Type = relevel(as.factor(Type), "Pearson", "Spearman", "Kendall", "Biweight", "Distance")) %>%
  ggplot(aes(x = r, y = Estimation, fill = Type)) +
  geom_smooth(aes(color = Type), method = "loess", alpha = 0, na.rm = TRUE) +
  geom_vline(aes(xintercept = 0.5), linetype = "dashed") +
  geom_hline(aes(yintercept = 0.5), linetype = "dashed") +
  guides(colour = guide_legend(override.aes = list(alpha = 1))) +
  see::theme_modern() +
  scale_color_flat_d(palette = "rainbow") +
  scale_fill_flat_d(palette = "rainbow") +
  guides(colour = guide_legend(override.aes = list(alpha = 1))) +
  facet_wrap(~transformation)

model <- data %>%
  datawizard::reshape_longer(
    select = -c("n", "r", "transformation"),
    names_to = "Type",
    values_to = "Estimation"
  ) %>%
  lm(r ~ Type / Estimation, data = .) %>%
  parameters::parameters()

arrange(model[6:10, ], desc(Coefficient))

As we can see, distance correlation is able to capture the strength even for severely non-linear relationships.

References

Bishara, Anthony J, and James B Hittner. 2017. “Confidence Intervals for Correlations When Data Are Not Normal.” Behavior Research Methods 49 (1): 294–309. https://doi.org/10.3758/s13428-016-0702-8.
Fieller, Edgar C, Herman O Hartley, and Egon S Pearson. 1957. “Tests for Rank Correlation Coefficients. i.” Biometrika 44 (3/4): 470–81. https://doi.org/10.1093/biomet/48.1-2.29.
Langfelder, Peter, and Steve Horvath. 2012. “Fast r Functions for Robust Correlations and Hierarchical Clustering.” Journal of Statistical Software 46 (11). https://www.jstatsoft.org/v46/i11/.
R Core Team. 2019. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.
Székely, Gábor J, and Maria L Rizzo. 2009. “Brownian Distance Covariance.” The Annals of Applied Statistics 3 (4): 1236–65.
Székely, Gábor J, Maria L Rizzo, and Nail K Bakirov. 2007. “Measuring and Testing Dependence by Correlation of Distances.” The Annals of Statistics 35 (6): 2769–94.