hanaml.GaussianMixture {hana.ml.r}R Documentation

Gaussian Mixture Model (GMM)

Description

hanaml.GaussianMixture is a R wrapper for PAL Gaussian Mixture Model (GMM).

Usage

hanaml.GaussianMixture(conn.context,
                       data = NULL,
                       key = NULL,
                       features = NULL,
                       n.components = NULL,
                       init.param = NULL,
                       init.centers = NULL,
                       covariance.type = NULL,
                       shared.covariance = NULL,
                       thread.ratio = NULL,
                       max.iter = NULL,
                       category.weight = NULL,
                       categorical.variable = NULL,
                       error.tol = NULL,
                       regularization = NULL,
                       random.seed = NULL)

Arguments

conn.context

ConnectionContext
Database connection to the SAP HANA system.

data

DataFrame
DataFrame containing the data.

key

DataFrame
Name of ID column.

features

character or list of characters, optional
Names of the feature columns.
If not provided, it defaults to all non-ID columns.

n.components

integer, optional
Number of groups.
Mandatory when init.param is not 'manual'.

init.param

character
Specifies the initialization mode:

  • 'farthest.first.traversal': The initial centers are given by the farthest-first traversal algorithm.

  • 'manual': The initial centers are the init.centers given by user.

  • 'random.means': The initial centers are the means of all the data that are randomly weighted.

  • 'k.means++': The initial centers are given using the k-means++ approach.

init.centers

integer, optional
Specifies the data (by using sequence number of the data in the data table (starting from 0)) to be used as init.centers. For example, if select sequence number 1, 5, 9 as centers, please input init.centers = c(1, 5, 9)
Mandatory when init.param is 'manual'.

covariance.type

character, optional
Specifies the type of covariance matrices in the model:

  • 'full': use full covariance matrices.

  • 'diag': use diagonal covariance matrices.

  • 'tied.diag': use diagonal covariance matrices with all equal diagonal entries.

Defaults to 'full'.

shared.covariance

logical, optional
All clusters share the same covariance matrix if TRUE.
Defaults to FALSE.

thread.ratio

double, optional
Controls the proportion of available threads that can be used by this function. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates all available threads. Values between 0 and 1 will use up to that percentage of available threads.
Defaults to 0.

max.iter

integer, optional
Specifies the maximum number of iterations for the EM algorithm.
Defaults to 100.

category.weight

double, optional
Represents the weight of category attributes.
Defaults to 0.707.

categorical.variable

character or list of characters, optional
Column names in the data table to use as category variable.
No default value.

error.tol

double, optional
Convergence threshold for exiting iterations.
Defaults to 1.0e-6.

regularization

float, optional
Regularization to be added to the diagonal of covariance matrices to ensure positive-definite.
Defaults to 1e-6.

random.seed

integer, optional
Indicates the seed used to initialize the random number generator:

  • 0: Uses the system time.

  • Not 0: The initial centers are the init.centers given by user.

Defaults to 0.

Format

R6Class object

Value

See Also

predict.GaussianMixture

Examples

## Not run: 
Input DataFrame data:
 ID  X1   X2   X3
 0  0.10  0.10  1
 1  0.11  0.10  1
 2  0.10  0.11  1
 3  0.11  0.11  1
 4  0.12  0.11  1

 Model traning and a "GaussianMixture" object gmm is returned:
> gmm <- hanaml.GaussianMixture(conn.context = conn,
                                data = data,
                                key = "ID",
                                n.components = 2,
                                init.param = 'k.means++',
                                covariance.type = 'full',
                                shared.covariance = TRUE,
                                thread.ratio = 0,
                                max.iter = 100,
                                category.weight = 0.707,
                                error.tol = 2.5,
                                regularization = 2.5,
                                random.seed = 5)

Expected output:
> gmm$labels$Collect()
      ID  CLUSTER_ID  PROBABILITY
       0     0            1
       1     0            1
       2     0            0
       3     0            0
       4     0            0
       0     1            0
       1     1            0
       2     1            1
       3     1            1
       4     1            1

## End(Not run)

[Package hana.ml.r version 1.0.8 Index]