Gaussian Mixture Model (GMM)

hanaml.GaussianMixture is a R wrapper for SAP HANA PAL Gaussian Mixture Model (GMM).

hanaml.GaussianMixture(
  data = NULL,
  key = NULL,
  features = NULL,
  n.components = NULL,
  init.param = NULL,
  init.centers = NULL,
  covariance.type = NULL,
  shared.covariance = NULL,
  thread.ratio = NULL,
  max.iter = NULL,
  category.weight = NULL,
  categorical.variable = NULL,
  error.tol = NULL,
  regularization = NULL,
  random.seed = NULL
)

Arguments

data	`DataFrame` DataFrame containting the data.
key	`character` Name of the ID column.
features	`character or list of characters, optional` Names of features columns. If is not provided, it defaults to all non-key columns of `data`.
n.components	`integer, optional` Number of groups. Mandatory when init.param is not 'manual'.
init.param	`character` Specifies the initialization mode: `"farthest.first.traversal"`: The initial centers are given by the farthest-first traversal algorithm. `"manual"`: The initial centers are the init.centers given by user. `"random.means"`: The initial centers are the means of all the data that are randomly weighted. `"k.means++"`: The initial centers are given using the k-means++ approach.
init.centers	`vector of integers/characters, optional` Specifies the rows of data to be used as initial centers by providing their values in the ID column. For example, we want to specify rows with ID 1, 5, and 9 as centers, please input init.centers = c(1, 5, 9) Mandatory when init.param is 'manual'.
covariance.type	`character, optional` Specifies the type of covariance matrices in the model: `"full"`: use full covariance matrices. `"diag"`: use diagonal covariance matrices. `"tied.diag"`: use diagonal covariance matrices with all equal diagonal entries. Defaults to "full".
shared.covariance	`logical, optional` All clusters share the same covariance matrix if TRUE. Defaults to FALSE.
thread.ratio	`double, optional` Controls the proportion of available threads that can be used by this function. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates all available threads. Values between 0 and 1 will use up to that percentage of available threads.Values outside this range are ignored. Defaults to 0.
max.iter	`integer, optional` Specifies the maximum number of iterations for the EM algorithm. Defaults to 100.
category.weight	`double, optional` Represents the weight of category attributes. Defaults to 0.707.
categorical.variable	`character or list/vector of characters, optional` Indicates features should be treated as categorical variable. The default behavior is dependent on what input is given: "VARCHAR" and "NVARCHAR": categorical "INTEGER" and "DOUBLE": continuous. VALID only for variables of "INTEGER" type, omitted otherwise. No default value.
error.tol	`double, optional` Convergence threshold for exiting iterations. Defaults to 1.e-5.
regularization	`float, optional` Regularization to be added to the diagonal of covariance matrices to ensure positive-definite. Defaults to 1e-6.
random.seed	`integer, optional` Indicates the seed used to initialize the random number generator: `0`: Uses the system time. `Not 0`: The initial centers are the init.centers given by user. Defaults to 0.

Value

Returns a "GaussianMixture" object with following values:

labels : DataFrame
Label assigned to each sample.
model : DataFrame
Model content.
stats : DataFrame
Statistic value.

Examples

Input DataFrame data:

> data$Collect()
ID  X1   X2   X3
0  0.10  0.10  1
1  0.11  0.10  1
2  0.10  0.11  1
3  0.11  0.11  1
4  0.12  0.11  1

Call the function:

> gmm <- hanaml.GaussianMixture(data = data,
                                key = "ID",
                                n.components = 2,
                                init.param = "k.means++",
                                covariance.type = "full",
                                shared.covariance = TRUE,
                                thread.ratio = 0,
                                max.iter = 100,
                                category.weight = 0.707,
                                error.tol = 2.5,
                                regularization = 2.5,
                                random.seed = 5)

Output:

 > gmm$labels$Collect()
    ID  CLUSTER_ID  PROBABILITY
 1   0           0            1
 2   1           0            1
 3   2           0            0
 4   3           0            0
 5   4           0            0
 6   0           1            0
 7   1           1            0
 8   2           1            1
 9   3           1            1
 10  4           1            1

Arguments

Value

Examples

See also