1.1 Introduction

Cohorts are often a key component in studies that use the OMOP CDM. They may be created to represent various clinical events of interest and we often use cohorts in combination, whether it is to identify outcomes among people with an exposure of interest, report baseline comorbidites among a certain study population, or for many other possible reasons.

Cohorts have a particular format in the OMOP CDM, which we can see for two cohort tables created by the mockPatientProfiles() function provided by PatientProfiles, which mimics a database in the OMOP CDM format. We can see the first cohort table contains 2 cohorts while the second contains 3 cohorts.

library(CDMConnector)
library(PatientProfiles)
library(duckdb)
library(dplyr)
library(ggplot2)

cdm <- mockPatientProfiles(numberIndividuals = 1000)

cdm$cohort1 %>%
  glimpse()
## Rows: ??
## Columns: 4
## Database: DuckDB v1.0.0 [root@Darwin 23.6.0:R 4.4.1/:memory:]
## $ cohort_definition_id <int> 1, 2, 1, 3, 1, 1, 1, 1, 2, 3, 2, 3, 2, 2, 2, 2, 2…
## $ subject_id           <int> 415, 110, 378, 325, 920, 206, 806, 392, 299, 862,…
## $ cohort_start_date    <date> 1918-05-29, 1973-08-07, 1971-12-19, 1960-09-18, …
## $ cohort_end_date      <date> 1929-12-11, 1993-02-11, 1990-02-19, 1962-02-03, …
settings(cdm$cohort1)
## # A tibble: 3 × 2
##   cohort_definition_id cohort_name
##                  <int> <chr>      
## 1                    1 cohort_1   
## 2                    2 cohort_2   
## 3                    3 cohort_3
cdm$cohort2 %>%
  glimpse()
## Rows: ??
## Columns: 4
## Database: DuckDB v1.0.0 [root@Darwin 23.6.0:R 4.4.1/:memory:]
## $ cohort_definition_id <int> 1, 2, 3, 1, 3, 2, 2, 1, 1, 2, 3, 1, 2, 2, 3, 2, 1…
## $ subject_id           <int> 718, 670, 304, 241, 332, 962, 1, 247, 768, 536, 7…
## $ cohort_start_date    <date> 1965-02-18, 1911-12-07, 2005-07-01, 1959-10-09, …
## $ cohort_end_date      <date> 1968-01-14, 1914-09-10, 2011-06-06, 1963-07-17, …
settings(cdm$cohort2)
## # A tibble: 3 × 2
##   cohort_definition_id cohort_name
##                  <int> <chr>      
## 1                    1 cohort_1   
## 2                    2 cohort_2   
## 3                    3 cohort_3

1.2 Identifying cohort intersections

PatientProfiles provides four functions for identifying cohort intersections (the presence of an individual in two cohorts). The first addCohortIntersectFlag() adds a flag of whether someone appeared in the other cohort during a time window. The second, addCohortIntersectCount(), counts the number of times someone appeared in the other cohort in the window. A third, addCohortIntersectDate(), adds the date when the intersection occurred. And the fourth, addCohortIntersectDays(), adds the number of days to the intersection.

We can see each of these below. Note that they add variables to our cohort table of interest, and identify intersections over a given window. As we can see, if our target cohort table contains multiple cohorts then by default these functions will add one new variable per cohort.

Let’s start by adding flag and count variables using a window of 180 days before to 180 days after the cohort start date in our table of interest. By default the cohort start date of our cohort of interest will be used as the index date, with the cohort start to cohort end date of the target cohort then used to check for an intersection.

cdm$cohort1 %>%
  addCohortIntersectFlag(
    indexDate = "cohort_start_date",
    targetCohortTable = "cohort2",
    targetStartDate = "cohort_start_date",
    targetEndDate = "cohort_end_date",
    window = list(c(-180, 180))
  ) |>
  glimpse()
## Rows: ??
## Columns: 7
## Database: DuckDB v1.0.0 [root@Darwin 23.6.0:R 4.4.1/:memory:]
## $ cohort_definition_id <int> 2, 1, 3, 2, 3, 1, 2, 1, 1, 2, 3, 3, 3, 3, 1, 3, 3…
## $ subject_id           <int> 110, 378, 325, 299, 916, 878, 178, 332, 408, 648,…
## $ cohort_start_date    <date> 1973-08-07, 1971-12-19, 1960-09-18, 1984-05-05, …
## $ cohort_end_date      <date> 1993-02-11, 1990-02-19, 1962-02-03, 1994-12-04, …
## $ cohort_3_m180_to_180 <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1…
## $ cohort_1_m180_to_180 <dbl> 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0…
## $ cohort_2_m180_to_180 <dbl> 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0…
cdm$cohort1 %>%
  addCohortIntersectCount(
    indexDate = "cohort_start_date",
    targetCohortTable = "cohort2",
    targetStartDate = "cohort_start_date",
    targetEndDate = "cohort_end_date",
    window = list(c(-180, 180))
  ) |>
  glimpse()
## Rows: ??
## Columns: 7
## Database: DuckDB v1.0.0 [root@Darwin 23.6.0:R 4.4.1/:memory:]
## $ cohort_definition_id <int> 2, 1, 3, 2, 3, 1, 2, 1, 1, 2, 3, 3, 3, 3, 1, 3, 3…
## $ subject_id           <int> 110, 378, 325, 299, 916, 878, 178, 332, 408, 648,…
## $ cohort_start_date    <date> 1973-08-07, 1971-12-19, 1960-09-18, 1984-05-05, …
## $ cohort_end_date      <date> 1993-02-11, 1990-02-19, 1962-02-03, 1994-12-04, …
## $ cohort_1_m180_to_180 <dbl> 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0…
## $ cohort_2_m180_to_180 <dbl> 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0…
## $ cohort_3_m180_to_180 <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1…

Next we can add the date of the intersection and the days to the intersection. When identifying these variables we use only one date in our target table, which by default will be the cohort start date. In addition by default the first intersection that occurs within our window will be used.

cdm$cohort1 %>%
  addCohortIntersectDate(
    indexDate = "cohort_start_date",
    targetCohortTable = "cohort2",
    targetDate = "cohort_start_date",
    window = list(c(-180, 180)),
    order = "first"
  ) |>
  glimpse()
## Rows: ??
## Columns: 7
## Database: DuckDB v1.0.0 [root@Darwin 23.6.0:R 4.4.1/:memory:]
## $ cohort_definition_id <int> 1, 2, 1, 1, 1, 1, 1, 2, 3, 2, 3, 2, 2, 2, 2, 2, 1…
## $ subject_id           <int> 415, 110, 378, 920, 206, 806, 392, 299, 862, 273,…
## $ cohort_start_date    <date> 1918-05-29, 1973-08-07, 1971-12-19, 1966-01-06, …
## $ cohort_end_date      <date> 1929-12-11, 1993-02-11, 1990-02-19, 1970-02-21, …
## $ cohort_3_m180_to_180 <date> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ cohort_2_m180_to_180 <date> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ cohort_1_m180_to_180 <date> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
cdm$cohort1 %>%
  addCohortIntersectDays(
    indexDate = "cohort_start_date",
    targetCohortTable = "cohort2",
    targetDate = "cohort_start_date",
    window = list(c(-180, 180)),
    order = "first"
  ) |>
  glimpse()
## Rows: ??
## Columns: 7
## Database: DuckDB v1.0.0 [root@Darwin 23.6.0:R 4.4.1/:memory:]
## $ cohort_definition_id <int> 1, 2, 1, 1, 1, 1, 1, 2, 3, 2, 3, 2, 2, 2, 2, 2, 1…
## $ subject_id           <int> 415, 110, 378, 920, 206, 806, 392, 299, 862, 273,…
## $ cohort_start_date    <date> 1918-05-29, 1973-08-07, 1971-12-19, 1966-01-06, …
## $ cohort_end_date      <date> 1929-12-11, 1993-02-11, 1990-02-19, 1970-02-21, …
## $ cohort_3_m180_to_180 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ cohort_2_m180_to_180 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ cohort_1_m180_to_180 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…

1.3 Options for identifying cohort intersection

To consider the impact of the different options we can choose when identifying cohort intersections let´s consider a toy example with a single patient with common cold (diagnosed on the 1st February 2020 and ending on the 15th February 2020). This person had two records for aspirin, one ending shortly before their start date for common cold and the other starting during their record for common cold.

common_cold <- dplyr::tibble(
  cohort_definition_id = 1,
  subject_id = 1,
  cohort_start_date = as.Date("2020-02-01"),
  cohort_end_date = as.Date("2020-02-15")
)

aspirin <- dplyr::tibble(
  cohort_definition_id = c(1, 1),
  subject_id = c(1, 1),
  cohort_start_date = as.Date(c("2020-01-01", "2020-02-10")),
  cohort_end_date = as.Date(c("2020-01-28", "2020-03-15"))
)

We can visualise what this person’s timeline looks like.

bind_rows(
  common_cold |> mutate(cohort = "common cold"),
  aspirin |> mutate(cohort = "aspirin")
) |>
  mutate(record = as.character(row_number())) |>
  ggplot() +
  geom_segment(
    aes(
      x = cohort_start_date,
      y = cohort,
      xend = cohort_end_date,
      yend = cohort, col = cohort, fill = cohort
    ),
    size = 4.5, alpha = .5
  ) +
  geom_point(aes(x = cohort_start_date, y = cohort, color = cohort), size = 4) +
  geom_point(aes(x = cohort_end_date, y = cohort, color = cohort), size = 4) +
  ylab("") +
  xlab("") +
  theme_minimal() +
  theme(legend.position = "none")

Whether we consider there to be a cohort intersection between the common cold and aspirin cohorts will depend on what options we choose. To see this let’s first create a cdm reference containing our example.

cdm <- mockPatientProfiles(
  cohort1 = common_cold,
  cohort2 = aspirin,
  numberIndividuals = 2
)

If we consider the intersection relative to the cohort start date for common cold with a window of 0 to 0 (ie only the index date) then no intersection will be identified as the individual did not have an ongoing record for aspirin on that date.

cdm$cohort1 %>%
  addCohortIntersectFlag(
    targetCohortTable = "cohort2",
    indexDate = "cohort_start_date",
    targetStartDate = "cohort_start_date",
    targetEndDate = "cohort_end_date",
    window = list(c(0, 0)),
  ) |>
  glimpse()
## Rows: ??
## Columns: 5
## Database: DuckDB v1.0.0 [root@Darwin 23.6.0:R 4.4.1/:memory:]
## $ cohort_definition_id <dbl> 1
## $ subject_id           <dbl> 1
## $ cohort_start_date    <date> 2020-02-01
## $ cohort_end_date      <date> 2020-02-15
## $ cohort_1_0_to_0      <dbl> 0

We could, however, change the index date to cohort end date in which case an intersection would be found.

cdm$cohort1 %>%
  addCohortIntersectFlag(
    targetCohortTable = "cohort2",
    indexDate = "cohort_end_date",
    targetStartDate = "cohort_start_date",
    targetEndDate = "cohort_end_date",
    window = list(c(0, 0))
  ) |>
  glimpse()
## Rows: ??
## Columns: 5
## Database: DuckDB v1.0.0 [root@Darwin 23.6.0:R 4.4.1/:memory:]
## $ cohort_definition_id <dbl> 1
## $ subject_id           <dbl> 1
## $ cohort_start_date    <date> 2020-02-01
## $ cohort_end_date      <date> 2020-02-15
## $ cohort_1_0_to_0      <dbl> 1

Or we could also extend the window to include more time before or after which in both cases would lead to cohort intersections being found.

cdm$cohort1 %>%
  addCohortIntersectFlag(
    targetCohortTable = "cohort2",
    indexDate = "cohort_start_date",
    targetStartDate = "cohort_start_date",
    targetEndDate = "cohort_end_date",
    window = list(c(-90, 90)),
  ) |>
  glimpse()
## Rows: ??
## Columns: 5
## Database: DuckDB v1.0.0 [root@Darwin 23.6.0:R 4.4.1/:memory:]
## $ cohort_definition_id <dbl> 1
## $ subject_id           <dbl> 1
## $ cohort_start_date    <date> 2020-02-01
## $ cohort_end_date      <date> 2020-02-15
## $ cohort_1_m90_to_90   <dbl> 1

With a window of 90 days before to 90 days after cohort start, the person would have a count of two cohort intersections.

cdm$cohort1 %>%
  addCohortIntersectCount(
    targetCohortTable = "cohort2",
    indexDate = "cohort_start_date",
    targetStartDate = "cohort_start_date",
    targetEndDate = "cohort_end_date",
    window = list(c(-90, 90)),
  ) |>
  glimpse()
## Rows: ??
## Columns: 5
## Database: DuckDB v1.0.0 [root@Darwin 23.6.0:R 4.4.1/:memory:]
## $ cohort_definition_id <dbl> 1
## $ subject_id           <dbl> 1
## $ cohort_start_date    <date> 2020-02-01
## $ cohort_end_date      <date> 2020-02-15
## $ cohort_1_m90_to_90   <dbl> 2

With this same window, if we add the first cohort intersect date we will get the start date of the first record of aspirin.

cdm$cohort1 %>%
  addCohortIntersectDate(
    targetCohortTable = "cohort2",
    indexDate = "cohort_start_date",
    targetDate = "cohort_start_date",
    window = list(c(-90, 90)),
    order = "first"
  ) |>
  glimpse()
## Rows: ??
## Columns: 5
## Database: DuckDB v1.0.0 [root@Darwin 23.6.0:R 4.4.1/:memory:]
## $ cohort_definition_id <dbl> 1
## $ subject_id           <dbl> 1
## $ cohort_start_date    <date> 2020-02-01
## $ cohort_end_date      <date> 2020-02-15
## $ cohort_1_m90_to_90   <date> 2020-01-01

But if we instead set order to last, we get the start date of the second record of aspirin.

cdm$cohort1 %>%
  addCohortIntersectDate(
    targetCohortTable = "cohort2",
    indexDate = "cohort_start_date",
    targetDate = "cohort_start_date",
    window = list(c(-90, 90)),
    order = "last"
  ) |>
  glimpse()
## Rows: ??
## Columns: 5
## Database: DuckDB v1.0.0 [root@Darwin 23.6.0:R 4.4.1/:memory:]
## $ cohort_definition_id <dbl> 1
## $ subject_id           <dbl> 1
## $ cohort_start_date    <date> 2020-02-01
## $ cohort_end_date      <date> 2020-02-15
## $ cohort_1_m90_to_90   <date> 2020-02-10

1.4 Naming conventions for new variables

One last option relates to the naming convention used to for the new variables.

cdm$cohort1 %>%
  addCohortIntersectDate(
    targetCohortTable = "cohort2",
    indexDate = "cohort_start_date",
    targetDate = "cohort_start_date",
    window = list(c(-90, 90)),
    order = "last",
    nameStyle = "{cohort_name}_{window_name}"
  ) |>
  glimpse()
## Rows: ??
## Columns: 5
## Database: DuckDB v1.0.0 [root@Darwin 23.6.0:R 4.4.1/:memory:]
## $ cohort_definition_id <dbl> 1
## $ subject_id           <dbl> 1
## $ cohort_start_date    <date> 2020-02-01
## $ cohort_end_date      <date> 2020-02-15
## $ cohort_1_m90_to_90   <date> 2020-02-10

We can instead choose a specific name (but this will only work if only one new variable will be added, otherwise we will get an error to avoid duplicate names).

cdm$cohort1 %>%
  addCohortIntersectDate(
    targetCohortTable = "cohort2",
    indexDate = "cohort_start_date",
    targetDate = "cohort_start_date",
    window = list(c(-90, 90)),
    order = "last",
    nameStyle = "my_new_variable"
  ) |>
  glimpse()
## Rows: ??
## Columns: 5
## Database: DuckDB v1.0.0 [root@Darwin 23.6.0:R 4.4.1/:memory:]
## $ cohort_definition_id <dbl> 1
## $ subject_id           <dbl> 1
## $ cohort_start_date    <date> 2020-02-01
## $ cohort_end_date      <date> 2020-02-15
## $ my_new_variable      <date> 2020-02-10

In the other direction we could also include the estimate type in the name. This will be useful, for example, if we’re adding multiple different types of intersection values.

cdm$cohort1 |>
  addCohortIntersectDate(
    targetCohortTable = "cohort2",
    indexDate = "cohort_start_date",
    targetDate = "cohort_start_date",
    window = list(c(-90, 90)),
    order = "last",
    nameStyle = "{cohort_name}_{window_name}_{value}"
  ) |>
  addCohortIntersectDays(
    targetCohortTable = "cohort2",
    indexDate = "cohort_start_date",
    targetDate = "cohort_start_date",
    window = list(c(-90, 90)),
    order = "last",
    nameStyle = "{cohort_name}_{window_name}_{value}"
  ) |>
  glimpse()
## Rows: ??
## Columns: 6
## Database: DuckDB v1.0.0 [root@Darwin 23.6.0:R 4.4.1/:memory:]
## $ cohort_definition_id    <dbl> 1
## $ subject_id              <dbl> 1
## $ cohort_start_date       <date> 2020-02-01
## $ cohort_end_date         <date> 2020-02-15
## $ cohort_1_m90_to_90_date <date> 2020-02-10
## $ cohort_1_m90_to_90_days <dbl> 9