This is the training module for subtype definition based on a matrix. Currently only supports clustering based on gaussian mixtures via the ClusterR package. One may want to use a specific sub-group for this, e.g. patients.

trainSubtypeClusterMulti(
  mxdfin,
  measureColumns,
  method = "kmeans",
  desiredk,
  maxk,
  groupVariable,
  group,
  frobNormThresh = 0.01,
  trainTestRatio = 0,
  distance_metric = NULL,
  flexweights = NULL,
  flexgroup = NULL,
  groupFun = NULL
)

Arguments

mxdfin

Input data frame

measureColumns

vector defining the data columns to be used for clustering. Note that these methods may be sensitive to scaling so the user may want to scale columns accordingly.

method

string GMM or kmeans or medoid

desiredk

number of subtypes

maxk

maximum number of subtypes

groupVariable

names of the column that defines the group to use for training.

group

string defining a subgroup on which to train

frobNormThresh

fractional value less than 1 indicating the amount of change in the reconstruction error (measured by frobenius norm) from the previous iteration 1 - F_cur / F_prev that will determine the optimal number of clusters. For GMM clustering.

trainTestRatio

Training testing split for finding optimal number of clusters. For GMM clustering. If zero, then will not split data. Otherwise, will compute reconstruction error in test data only.

distance_metric

see medoid methods in ClusterR

flexweights

optional weights

flexgroup

optional group

groupFun

optional function name to use in group-guided clustering e.g. minSumClusters

Value

the clustering object

Author

Avants BB

Examples

mydf = generateSubtyperData( 100 )
rbfnames = names(mydf)[grep("Random",names(mydf))]
gmmcl = trainSubtypeClusterMulti( mydf, rbfnames, maxk=4 )