GaussianMixture

class hana_ml.algorithms.pal.mixture.GaussianMixture(init_param, n_components=None, init_centers=None, covariance_type=None, shared_covariance=False, thread_ratio=None, max_iter=None, categorical_variable=None, category_weight=None, error_tol=None, regularization=None, random_seed=None)

Representation of a Gaussian mixture model probability distribution.

Parameters:
init_param{'farthest_first_traversal','manual','random_means','kmeans++'}

Specifies the initialization mode.

  • farthest_first_traversal: The initial centers are given by the farthest-first traversal algorithm.

  • manual: The initial centers are the init_centers given by user.

  • random_means: The initial centers are the means of all the data that are randomly weighted.

  • kmeans++: The initial centers are given using the k-means++ approach.

n_componentsint

Specifies the number of Gaussian distributions.

Mandatory when init_param is not 'manual'.

init_centerslist of integers/strings

Specifies the rows of data to be used as initial centers by provides their IDs in data.

Mandatory when init_param is 'manual'.

covariance_type{'full', 'diag', 'tied_diag'}, optional

Specifies the type of covariance matrices in the model.

  • full: use full covariance matrices.

  • diag: use diagonal covariance matrices.

  • tied_diag: use diagonal covariance matrices with all equal diagonal entries.

Defaults to 'full'.

shared_covariancebool, optional

All clusters share the same covariance matrix if True.

Defaults to False.

thread_ratiofloat, optional

Controls the proportion of available threads that can be used.

The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads.

Values between 0 and 1 will use that percentage of available threads.

Values outside this range tell PAL to heuristically determine the number of threads to use.

Defaults to 0.

max_iterint, optional

Specifies the maximum number of iterations for the EM algorithm.

Defaults value: 100.

categorical_variablestr or list of str, optional

Specifies INTEGER column(s) that should be be treated as categorical.

Other INTEGER columns will be treated as continuous.

category_weightfloat, optional

Represents the weight of category attributes.

Defaults to 0.707.

error_tolfloat, optional

Specifies the error tolerance, which is the stop condition.

Defaults to 1e-5.

regularizationfloat, optional

Regularization to be added to the diagonal of covariance matrices to ensure positive-definite.

Defaults to 1e-6.

random_seedint, optional

Indicates the seed used to initialize the random number generator:

  • 0: Uses the system time.

  • Not 0: Uses the provided value.

Defaults to 0.

Examples

Input dataframe df1 for training:

>>> df1.collect()
    ID     X1     X2  X3
0    0   0.10   0.10   1
1    1   0.11   0.10   1
2    2   0.10   0.11   1
3    3   0.11   0.11   1
4    4   0.12   0.11   1
5    5   0.11   0.12   1
6    6   0.12   0.12   1
7    7   0.12   0.13   1
8    8   0.13   0.12   2
9    9   0.13   0.13   2
10  10   0.13   0.14   2
11  11   0.14   0.13   2
12  12  10.10  10.10   1
13  13  10.11  10.10   1
14  14  10.10  10.11   1
15  15  10.11  10.11   1
16  16  10.11  10.12   2
17  17  10.12  10.11   2
18  18  10.12  10.12   2
19  19  10.12  10.13   2
20  20  10.13  10.12   2
21  21  10.13  10.13   2
22  22  10.13  10.14   2
23  23  10.14  10.13   2

Creating the GMM instance:

>>> gmm = GaussianMixture(init_param='farthest_first_traversal',
...                       n_components=2, covariance_type='full',
...                       shared_covariance=False, max_iter=500,
...                       error_tol=0.001, thread_ratio=0.5,
...                       categorical_variable=['X3'], random_seed=1)

Performing fit() on the given dataframe:

>>> gmm.fit(data=df1, key='ID')

Expected output:

>>> gmm.labels_.head(14).collect()
    ID  CLUSTER_ID  PROBABILITY
0    0           0          0.0
1    1           0          0.0
2    2           0          0.0
3    4           0          0.0
4    5           0          0.0
5    6           0          0.0
6    7           0          0.0
7    8           0          0.0
8    9           0          0.0
9   10           0          1.0
10  11           0          1.0
11  12           0          1.0
12  13           0          1.0
13  14           0          0.0
>>> gmm.stats_.collect()
         STAT_NAME     STAT_VALUE
0   log-likelihood        11.7199
1              aic      -504.5536
2              bic      -480.3900
>>> gmm.model_collect()
   ROW_INDEX    CLUSTER_ID           MODEL_CONTENT
0          0            -1           {"Algorithm":"GMM","Metadata":{"DataP...
1          1             0           {"GuassModel":{"covariance":[22.18895...
2          2             1           {"GuassModel":{"covariance":[22.19450...
Attributes:
model_DataFrame

Trained model content.

labels_DataFrame

Cluster membership probabilities for each data point.

stats_DataFrame

Statistics.

Methods

create_model_state([model, function, ...])

Create PAL model state.

delete_model_state([state])

Delete PAL model state.

fit(data[, key, features, categorical_variable])

Perform GMM clustering on input dataset.

fit_predict(data, key[, features, ...])

Perform GMM clustering on input dataset and return cluster membership probabilities for each data point.

predict(data[, key, features])

Assign clusters to data based on a fitted model.

set_model_state(state)

Set the model state by state information.

fit(data, key=None, features=None, categorical_variable=None)

Perform GMM clustering on input dataset.

Parameters:
dataDataFrame

Data to be clustered.

keystr, optional

Name of ID column. Defaults to the index column of data (i.e. data.index) if it is set. If the index of column of data is not provided, please enter the value of key.

featureslist of str, optional

List of strings specifying feature columns.

If a list of features is not given, all the columns except the ID column are taken as features.

categorical_variablestr or list of str, optional

Specifies INTEGER column(s) that should be treated as categorical.

Other INTEGER columns will be treated as continuous.

fit_predict(data, key, features=None, categorical_variable=None)

Perform GMM clustering on input dataset and return cluster membership probabilities for each data point.

Parameters:
dataDataFrame

Data to be clustered.

keystr, optional

Name of ID column. Defaults to the index column of data (i.e. data.index) if it is set. If the index of column of data is not provided, please enter the value of key.

featureslist of str, optional

List of strings specifying feature columns.

If a list of features is not given, all the columns except the ID column are taken as features.

categorical_variablestr or list of str, optional

Specifies INTEGER column(s) specified that should be treated as categorical.

Other INTEGER columns will be treated as continuous.

Returns:
DataFrame

Cluster membership probabilities.

predict(data, key=None, features=None)

Assign clusters to data based on a fitted model.

The output structure of this method does not match that of fit_predict().

Parameters:
dataDataFrame

Data points to match against computed clusters.

This dataframe's column structure should match that of the data used for fit().

keystr, optional

Name of ID column.

Defaults to the index column of data (i.e. data.index) if it is set. If the index column of data is not provided, please enter the value of key.

featureslist of str, optional.

Names of the feature columns.

If features is not provided, it defaults to all non-key columns.

Returns:
DataFrame

Cluster assignment results, with 3 columns:

  • Data point ID, with name and type taken from the input ID column.

  • CLUSTER_ID, INTEGER type, representing the cluster the data point is assigned to.

  • DISTANCE, DOUBLE type, representing the distance between the data point and the cluster center.

create_model_state(model=None, function=None, pal_funcname='PAL_CLUSTER_ASSIGNMENT', state_description=None, force=False)

Create PAL model state.

Parameters:
modelDataFrame, optional

Specify the model for AFL state.

Defaults to self.model_.

functionstr, optional

Specify the function in the unified API.

A placeholder parameter, not effective for cluster assignment.

pal_funcnameint or str, optional

PAL function name. Must be a valid PAL procedure that supports model state.

Defaults to 'PAL_CLUSTER_ASSIGNMENT'.

state_descriptionstr, optional

Description of the state as model container.

Defaults to None.

forcebool, optional

If True it will delete the existing state.

Defaults to False.

delete_model_state(state=None)

Delete PAL model state.

Parameters:
stateDataFrame, optional

Specified the state.

Defaults to self.state.

property fit_hdbprocedure

Returns the generated hdbprocedure for fit.

property predict_hdbprocedure

Returns the generated hdbprocedure for predict.

set_model_state(state)

Set the model state by state information.

Parameters:
state: DataFrame or dict

If state is DataFrame, it has the following structure:

  • NAME: VARCHAR(100), it mush have STATE_ID, HINT, HOST and PORT.

  • VALUE: VARCHAR(1000), the values according to NAME.

If state is dict, the key must have STATE_ID, HINT, HOST and PORT.

Inherited Methods from PALBase

Besides those methods mentioned above, the GaussianMixture class also inherits methods from PALBase class, please refer to PAL Base for more details.