Train subtype for multivariate data — trainSubtypeClusterMulti • subtyper

This is the training module for subtype definition based on a matrix. Currently only supports clustering based on gaussian mixtures via the ClusterR package. One may want to use a specific sub-group for this, e.g. patients.

trainSubtypeClusterMulti(
  mxdfin,
  measureColumns,
  method = "kmeans",
  desiredk,
  maxk,
  groupVariable,
  group,
  frobNormThresh = 0.01,
  trainTestRatio = 0,
  distance_metric = NULL,
  flexweights = NULL,
  flexgroup = NULL,
  groupFun = NULL
)

Arguments

mxdfin: Input data frame
measureColumns: vector defining the data columns to be used for clustering. Note that these methods may be sensitive to scaling so the user may want to scale columns accordingly.
method: string GMM or kmeans or medoid
desiredk: number of subtypes
maxk: maximum number of subtypes
groupVariable: names of the column that defines the group to use for training.
group: string defining a subgroup on which to train
frobNormThresh: fractional value less than 1 indicating the amount of change in the reconstruction error (measured by frobenius norm) from the previous iteration 1 - F_cur / F_prev that will determine the optimal number of clusters. For GMM clustering.
trainTestRatio: Training testing split for finding optimal number of clusters. For GMM clustering. If zero, then will not split data. Otherwise, will compute reconstruction error in test data only.
distance_metric: see medoid methods in ClusterR
flexweights: optional weights
flexgroup: optional group
groupFun: optional function name to use in group-guided clustering e.g. minSumClusters

Value

the clustering object

Author

Avants BB

Examples

mydf = generateSubtyperData( 100 )
rbfnames = names(mydf)[grep("Random",names(mydf))]
gmmcl = trainSubtypeClusterMulti( mydf, rbfnames, maxk=4 )