GaussianMixture

class hana_ml.algorithms.pal.mixture.GaussianMixture(init_param, n_components=None, init_centers=None, covariance_type=None, shared_covariance=False, thread_ratio=None, max_iter=None, categorical_variable=None, category_weight=None, error_tol=None, regularization=None, random_seed=None)

Gaussian Mixture Model (GMM) is a probabilistic model used for modeling data points that are assumed to be generated from a mixture of Gaussian distributions. It is a parametric model that represents the probability distribution of the data as a weighted sum of multiple Gaussian distributions, also known as components or clusters.

Parameters:
init_param{'farthest_first_traversal', 'manual', 'random_means', 'kmeans++'}

Specifies the initialization mode.

  • 'farthest_first_traversal': The farthest-first traversal algorithm provides the initial cluster centers.

  • 'manual': User-provided values (init_centers) serve as initial centers.

  • 'random_means': Initial centers become the weighted means of randomly chosen data points.

  • 'kmeans++': Initial centers are determined by the k-means++ method.

n_componentsint, optional

Specifies the number of Gaussian distributions.

This parameter becomes mandatory when init_param is not 'manual'.

init_centerslist of int/str

List of row identifiers in data that are to be used as initial centers.

This parameter becomes mandatory when init_param is 'manual'.

covariance_type{'full', 'diag', 'tied_diag'}, optional

Specifies the type of covariance matrices to be utilized in the model.

  • 'full': Utilizes full covariance matrices.

  • 'diag': Implements diagonal covariance matrices.

  • 'tied_diag': Applies diagonal covariance matrices with equal diagonal elements.

Defaults to 'full'.

shared_covariancebool, optional

If set to True, all clusters will share the same covariance matrix.

Defaults to False.

thread_ratiofloat, optional

Controls the proportion of available threads that can be used.

The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads.

Values between 0 and 1 will use that percentage of available threads.

Values outside this range tell PAL to heuristically determine the number of threads to use.

Defaults to 0.

max_iterint, optional

Defines the maximum iterations the EM algorithm can undertake.

Defaults to 100.

categorical_variablestr or list of str, optional

Specifies INTEGER column(s) that should be be treated as categorical.

Other INTEGER columns will be treated as continuous.

category_weightfloat, optional

Represents the weight of category attributes.

Defaults to 0.707.

error_tolfloat, optional

Defines the error tolerance, serving as a termination condition for the algorithm.

Defaults to 1e-5.

regularizationfloat, optional

Represents the regularization factor added to the diagonal elements of covariance matrices to guarantee their positive-definiteness.

Defaults to 1e-6.

random_seedint, optional

Indicates the seed used to initialize the random number generator:

  • 0: The system time is deployed as the default seed.

  • Not 0: The user-defined value is used as the seed.

Defaults to 0.

Examples

Input dataframe df:

>>> df.collect()
    ID     X1     X2  X3
0    0   0.10   0.10   1
1    1   0.11   0.10   1
2    2   0.10   0.11   1
3    3   0.11   0.11   1
4    4   0.12   0.11   1
5    5   0.11   0.12   1
6    6   0.12   0.12   1
7    7   0.12   0.13   1
8    8   0.13   0.12   2
9    9   0.13   0.13   2
10  10   0.13   0.14   2
11  11   0.14   0.13   2
12  12  10.10  10.10   1
13  13  10.11  10.10   1
14  14  10.10  10.11   1
15  15  10.11  10.11   1
16  16  10.11  10.12   2
17  17  10.12  10.11   2
18  18  10.12  10.12   2
19  19  10.12  10.13   2
20  20  10.13  10.12   2
21  21  10.13  10.13   2
22  22  10.13  10.14   2
23  23  10.14  10.13   2

Create a GMM instance:

>>> gmm = GaussianMixture(init_param='farthest_first_traversal',
...                       n_components=2, covariance_type='full',
...                       shared_covariance=False, max_iter=500,
...                       error_tol=0.001, thread_ratio=0.5,
...                       categorical_variable=['X3'], random_seed=1)

Perform fit():

>>> gmm.fit(data=df, key='ID')

Output:

>>> gmm.labels_.head(14).collect()
    ID  CLUSTER_ID  PROBABILITY
0    0           0          0.0
1    1           0          0.0
2    2           0          0.0
3    4           0          0.0
4    5           0          0.0
5    6           0          0.0
6    7           0          0.0
7    8           0          0.0
8    9           0          0.0
9   10           0          1.0
10  11           0          1.0
11  12           0          1.0
12  13           0          1.0
13  14           0          0.0
>>> gmm.stats_.collect()
         STAT_NAME     STAT_VALUE
0   log-likelihood        11.7199
1              aic      -504.5536
2              bic      -480.3900
>>> gmm.model_collect()
   ROW_INDEX    CLUSTER_ID           MODEL_CONTENT
0          0            -1           {"Algorithm":"GMM","Metadata":{"DataP...
1          1             0           {"GuassModel":{"covariance":[22.18895...
2          2             1           {"GuassModel":{"covariance":[22.19450...
Attributes:
model_DataFrame

Trained model content.

labels_DataFrame

Cluster membership probabilities for each data point.

stats_DataFrame

Statistics.

Methods

create_model_state([model, function, ...])

Create PAL model state.

delete_model_state([state])

Delete PAL model state.

fit(data[, key, features, categorical_variable])

Perform GMM clustering on input dataset.

fit_predict(data, key[, features, ...])

Perform GMM clustering on input dataset and return cluster membership probabilities for each data point.

predict(data[, key, features])

Assign clusters to data based on a fitted model.

set_model_state(state)

Set the model state by state information.

fit(data, key=None, features=None, categorical_variable=None)

Perform GMM clustering on input dataset.

Parameters:
dataDataFrame

Data to be clustered.

keystr, optional

Name of ID column. Defaults to the index column of data (i.e. data.index) if it is set. If the index of column of data is not provided, please enter the value of key.

featureslist of str, optional

List of strings specifying feature columns.

If a list of features is not given, all the columns except the ID column are taken as features.

categorical_variablestr or list of str, optional

Specifies INTEGER column(s) that should be treated as categorical.

Other INTEGER columns will be treated as continuous.

fit_predict(data, key, features=None, categorical_variable=None)

Perform GMM clustering on input dataset and return cluster membership probabilities for each data point.

Parameters:
dataDataFrame

Data to be clustered.

keystr, optional

Name of ID column. Defaults to the index column of data (i.e. data.index) if it is set. If the index of column of data is not provided, please enter the value of key.

featureslist of str, optional

List of strings specifying feature columns.

If a list of features is not given, all the columns except the ID column are taken as features.

categorical_variablestr or list of str, optional

Specifies INTEGER column(s) specified that should be treated as categorical.

Other INTEGER columns will be treated as continuous.

Returns:
DataFrame

Cluster membership probabilities.

predict(data, key=None, features=None)

Assign clusters to data based on a fitted model.

The output structure of this method does not match that of fit_predict().

Parameters:
dataDataFrame

Data points to match against computed clusters.

This dataframe's column structure should match that of the data used for fit().

keystr, optional

Name of ID column.

Defaults to the index column of data (i.e. data.index) if it is set. If the index column of data is not provided, please enter the value of key.

featureslist of str, optional.

Names of the feature columns.

If features is not provided, it defaults to all non-key columns.

Returns:
DataFrame

Cluster assignment results, with 3 columns:

  • Data point ID, with name and type taken from the input ID column.

  • CLUSTER_ID, INTEGER type, representing the cluster the data point is assigned to.

  • DISTANCE, DOUBLE type, representing the distance between the data point and the cluster center.

create_model_state(model=None, function=None, pal_funcname='PAL_CLUSTER_ASSIGNMENT', state_description=None, force=False)

Create PAL model state.

Parameters:
modelDataFrame, optional

Specify the model for AFL state.

Defaults to self.model_.

functionstr, optional

Specify the function in the unified API.

A placeholder parameter, not effective for cluster assignment.

pal_funcnameint or str, optional

PAL function name. Must be a valid PAL procedure that supports model state.

Defaults to 'PAL_CLUSTER_ASSIGNMENT'.

state_descriptionstr, optional

Description of the state as model container.

Defaults to None.

forcebool, optional

If True it will delete the existing state.

Defaults to False.

delete_model_state(state=None)

Delete PAL model state.

Parameters:
stateDataFrame, optional

Specified the state.

Defaults to self.state.

set_model_state(state)

Set the model state by state information.

Parameters:
state: DataFrame or dict

If state is DataFrame, it has the following structure:

  • NAME: VARCHAR(100), it mush have STATE_ID, HINT, HOST and PORT.

  • VALUE: VARCHAR(1000), the values according to NAME.

If state is dict, the key must have STATE_ID, HINT, HOST and PORT.

Inherited Methods from PALBase

Besides those methods mentioned above, the GaussianMixture class also inherits methods from PALBase class, please refer to PAL Base for more details.