Minimize the difference between two data frames based on t-statistic.

This function generates random subsets of a data frame to minimize the difference with another data frame based on a specified set of columns, as measured by the t-statistic. Authored by Avants and Chat-GPT 3.5.

match_cohort_pair(
  df1,
  df2,
  cols,
  sample_size,
  num_iterations = 1000,
  restrict_df1 = 0.05,
  option = "optimal",
  verbose = TRUE
)

Arguments

df1: Data frame to be subsetted.
df2: Data frame used as a reference for comparison.
cols: Vector of column names used for matching.
sample_size: the number to sample from df1
num_iterations: Number of random subsets to generate.
restrict_df1: float lower quantile to restrict df1 based on first col value to match range of df2
option: either random or optimal
verbose: boolean

Value

rownames of a sub data frame that minimizes the difference with df2 in terms of t-statistic.

Examples

set.seed(123)
df1 <- data.frame(A = rnorm(100), B = factor(sample(1:3, 100, replace = TRUE)), C = rnorm(100))
df2 <- data.frame(A = rnorm(50), B = factor(sample(1:3, 50, replace = TRUE)), C = rnorm(50))
matching_cols <- c("A", "B")
# matched_subset <- match_cohort_pair(df1, df2, matching_cols)
# print(matched_subset)