This function generates random subsets of a data frame to minimize the difference with another data frame based on a specified set of columns, as measured by the t-statistic. Authored by Avants and Chat-GPT 3.5.

match_cohort_pair(
  df1,
  df2,
  cols,
  sample_size,
  num_iterations = 1000,
  restrict_df1 = 0.05,
  option = "optimal",
  verbose = TRUE
)

Arguments

df1

Data frame to be subsetted.

df2

Data frame used as a reference for comparison.

cols

Vector of column names used for matching.

sample_size

the number to sample from df1

num_iterations

Number of random subsets to generate.

restrict_df1

float lower quantile to restrict df1 based on first col value to match range of df2

option

either random or optimal

verbose

boolean

Value

rownames of a sub data frame that minimizes the difference with df2 in terms of t-statistic.

Examples

set.seed(123)
df1 <- data.frame(A = rnorm(100), B = factor(sample(1:3, 100, replace = TRUE)), C = rnorm(100))
df2 <- data.frame(A = rnorm(50), B = factor(sample(1:3, 50, replace = TRUE)), C = rnorm(50))
matching_cols <- c("A", "B")
# matched_subset <- match_cohort_pair(df1, df2, matching_cols)
# print(matched_subset)