GaussianMixture
- class hana_ml.algorithms.pal.mixture.GaussianMixture(init_param, n_components=None, init_centers=None, covariance_type=None, shared_covariance=False, thread_ratio=None, max_iter=None, categorical_variable=None, category_weight=None, error_tol=None, regularization=None, random_seed=None)
Gaussian Mixture Model (GMM) is a probabilistic model used for modeling data points that are assumed to be generated from a mixture of Gaussian distributions. It is a parametric model that represents the probability distribution of the data as a weighted sum of multiple Gaussian distributions, also known as components or clusters.
- Parameters:
- init_param{'farthest_first_traversal', 'manual', 'random_means', 'kmeans++'}
Specifies the initialization mode.
'farthest_first_traversal': The farthest-first traversal algorithm provides the initial cluster centers.
'manual': User-provided values (init_centers) serve as initial centers.
'random_means': Initial centers become the weighted means of randomly chosen data points.
'kmeans++': Initial centers are determined by the k-means++ method.
- n_componentsint, optional
Specifies the number of Gaussian distributions.
This parameter becomes mandatory when
init_param
is not 'manual'.- init_centerslist of int/str
List of row identifiers in
data
that are to be used as initial centers.This parameter becomes mandatory when
init_param
is 'manual'.- covariance_type{'full', 'diag', 'tied_diag'}, optional
Specifies the type of covariance matrices to be utilized in the model.
'full': Utilizes full covariance matrices.
'diag': Implements diagonal covariance matrices.
'tied_diag': Applies diagonal covariance matrices with equal diagonal elements.
Defaults to 'full'.
- shared_covariancebool, optional
If set to True, all clusters will share the same covariance matrix.
Defaults to False.
- thread_ratiofloat, optional
Adjusts the percentage of available threads to use, from 0 to 1. A value of 0 indicates the use of a single thread, while 1 implies the use of all possible current threads. Values outside the range will be ignored and this function heuristically determines the number of threads to use.
Defaults to 0.
- max_iterint, optional
Defines the maximum iterations the EM algorithm can undertake.
Defaults to 100.
- categorical_variablestr or a list of str, optional
Specifies which INTEGER columns should be treated as categorical, with all other INTEGER columns treated as continuous.
No default value.
- category_weightfloat, optional
Represents the weight of category attributes.
Defaults to 0.707.
- error_tolfloat, optional
Defines the error tolerance, serving as a termination condition for the algorithm.
Defaults to 1e-5.
- regularizationfloat, optional
Represents the regularization factor added to the diagonal elements of covariance matrices to guarantee their positive-definiteness.
Defaults to 1e-6.
- random_seedint, optional
Indicates the seed used to initialize the random number generator:
0: The system time is deployed as the default seed.
Not 0: The user-defined value is used as the seed.
Defaults to 0.
Examples
Input DataFrame df:
>>> df.collect() ID X1 X2 X3 0 0 0.10 0.10 1 1 1 0.11 0.10 1 ... 22 22 10.13 10.14 2 23 23 10.14 10.13 2
Create a GMM instance:
>>> gmm = GaussianMixture(init_param='farthest_first_traversal', ... n_components=2, covariance_type='full', ... shared_covariance=False, max_iter=500, ... error_tol=0.001, thread_ratio=0.5, ... categorical_variable=['X3'], random_seed=1)
Perform fit():
>>> gmm.fit(data=df, key='ID')
Output:
>>> gmm.labels_.head(14).collect() ID CLUSTER_ID PROBABILITY 0 0 0 0.0 1 1 0 0.0 ... 12 13 0 1.0 13 14 0 0.0
>>> gmm.stats_.collect() STAT_NAME STAT_VALUE 0 log-likelihood 11.7199 1 aic -504.5536 2 bic -480.3900
>>> gmm.model_collect() ROW_INDEX CLUSTER_ID MODEL_CONTENT 0 0 -1 {"Algorithm":"GMM","Metadata":{"DataP... 1 1 0 {"GuassModel":{"covariance":[22.18895... 2 2 1 {"GuassModel":{"covariance":[22.19450...
- Attributes:
- model_DataFrame
Model content.
- labels_DataFrame
Cluster membership probabilities for each data point.
- stats_DataFrame
Statistics.
Methods
create_model_state
([model, function, ...])Create PAL model state.
delete_model_state
([state])Delete PAL model state.
fit
(data[, key, features, categorical_variable])Perform GMM clustering on input dataset.
fit_predict
(data, key[, features, ...])Perform GMM clustering on input dataset and return cluster membership probabilities for each data point.
predict
(data[, key, features])Assign clusters to data based on a fitted model.
set_model_state
(state)Set the model state by state information.
- fit(data, key=None, features=None, categorical_variable=None)
Perform GMM clustering on input dataset.
- Parameters:
- dataDataFrame
Data to be clustered.
- keystr, optional
Name of ID column. Defaults to the index column of data (i.e. data.index) if it is set. If the index of column of data is not provided, please enter the value of key.
- featuresa list of str, optional
List of strings specifying feature columns.
If a list of features is not given, all the columns except the ID column are taken as features.
- categorical_variablestr or a list of str, optional
Specifies which INTEGER columns should be treated as categorical, with all other INTEGER columns treated as continuous.
No default value.
- fit_predict(data, key, features=None, categorical_variable=None)
Perform GMM clustering on input dataset and return cluster membership probabilities for each data point.
- Parameters:
- dataDataFrame
Data to be clustered.
- keystr, optional
Name of ID column. Defaults to the index column of data (i.e. data.index) if it is set. If the index of column of data is not provided, please enter the value of key.
- featuresa list of str, optional
List of strings specifying feature columns.
If a list of features is not given, all the columns except the ID column are taken as features.
- categorical_variablestr or a list of str, optional
- Specifies which INTEGER columns should be treated as categorical, with all other INTEGER columns treated as continuous.
- No default value.
- Returns:
- DataFrame
Cluster membership probabilities.
- predict(data, key=None, features=None)
Assign clusters to data based on a fitted model.
The output structure of this method does not match that of fit_predict().
- Parameters:
- dataDataFrame
Data points to match against computed clusters.
This dataframe's column structure should match that of the data used for fit().
- keystr, optional
Name of ID column.
Defaults to the index column of data (i.e. data.index) if it is set. If the index column of data is not provided, please enter the value of key.
- featuresa list of str, optional.
Names of the feature columns.
If
features
is not provided, it defaults to all non-key columns.
- Returns:
- DataFrame
Cluster assignment results, with 3 columns:
Data point ID, with name and type taken from the input ID column.
CLUSTER_ID, INTEGER type, representing the cluster the data point is assigned to.
DISTANCE, DOUBLE type, representing the distance between the data point and the cluster center.
- create_model_state(model=None, function=None, pal_funcname='PAL_CLUSTER_ASSIGNMENT', state_description=None, force=False)
Create PAL model state.
- Parameters:
- modelDataFrame, optional
Specify the model for AFL state.
Defaults to self.model_.
- functionstr, optional
Specify the function in the unified API.
A placeholder parameter, not effective for cluster assignment.
- pal_funcnameint or str, optional
PAL function name. Must be a valid PAL procedure that supports model state.
Defaults to 'PAL_CLUSTER_ASSIGNMENT'.
- state_descriptionstr, optional
Description of the state as model container.
Defaults to None.
- forcebool, optional
If True it will delete the existing state.
Defaults to False.
- delete_model_state(state=None)
Delete PAL model state.
- Parameters:
- stateDataFrame, optional
Specified the state.
Defaults to self.state.
- set_model_state(state)
Set the model state by state information.
- Parameters:
- state: DataFrame or dict
If state is DataFrame, it has the following structure:
NAME: VARCHAR(100), it mush have STATE_ID, HINT, HOST and PORT.
VALUE: VARCHAR(1000), the values according to NAME.
If state is dict, the key must have STATE_ID, HINT, HOST and PORT.
Inherited Methods from PALBase
Besides those methods mentioned above, the GaussianMixture class also inherits methods from PALBase class, please refer to PAL Base for more details.