GaussianMixture
- class hana_ml.algorithms.pal.mixture.GaussianMixture(init_param, n_components=None, init_centers=None, covariance_type=None, shared_covariance=False, thread_ratio=None, max_iter=None, categorical_variable=None, category_weight=None, error_tol=None, regularization=None, random_seed=None)
Representation of a Gaussian mixture model probability distribution.
- Parameters:
- init_param{'farthest_first_traversal','manual','random_means','kmeans++'}
Specifies the initialization mode.
farthest_first_traversal: The initial centers are given by the farthest-first traversal algorithm.
manual: The initial centers are the init_centers given by user.
random_means: The initial centers are the means of all the data that are randomly weighted.
kmeans++: The initial centers are given using the k-means++ approach.
- n_componentsint
Specifies the number of Gaussian distributions.
Mandatory when
init_param
is not 'manual'.- init_centerslist of integers/strings
Specifies the rows of
data
to be used as initial centers by provides their IDs indata
.Mandatory when
init_param
is 'manual'.- covariance_type{'full', 'diag', 'tied_diag'}, optional
Specifies the type of covariance matrices in the model.
full: use full covariance matrices.
diag: use diagonal covariance matrices.
tied_diag: use diagonal covariance matrices with all equal diagonal entries.
Defaults to 'full'.
- shared_covariancebool, optional
All clusters share the same covariance matrix if True.
Defaults to False.
- thread_ratiofloat, optional
Controls the proportion of available threads that can be used.
The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads.
Values between 0 and 1 will use that percentage of available threads.
Values outside this range tell PAL to heuristically determine the number of threads to use.
Defaults to 0.
- max_iterint, optional
Specifies the maximum number of iterations for the EM algorithm.
Defaults value: 100.
- categorical_variablestr or list of str, optional
Specifies INTEGER column(s) that should be be treated as categorical.
Other INTEGER columns will be treated as continuous.
- category_weightfloat, optional
Represents the weight of category attributes.
Defaults to 0.707.
- error_tolfloat, optional
Specifies the error tolerance, which is the stop condition.
Defaults to 1e-5.
- regularizationfloat, optional
Regularization to be added to the diagonal of covariance matrices to ensure positive-definite.
Defaults to 1e-6.
- random_seedint, optional
Indicates the seed used to initialize the random number generator:
0: Uses the system time.
Not 0: Uses the provided value.
Defaults to 0.
Examples
Input dataframe df1 for training:
>>> df1.collect() ID X1 X2 X3 0 0 0.10 0.10 1 1 1 0.11 0.10 1 2 2 0.10 0.11 1 3 3 0.11 0.11 1 4 4 0.12 0.11 1 5 5 0.11 0.12 1 6 6 0.12 0.12 1 7 7 0.12 0.13 1 8 8 0.13 0.12 2 9 9 0.13 0.13 2 10 10 0.13 0.14 2 11 11 0.14 0.13 2 12 12 10.10 10.10 1 13 13 10.11 10.10 1 14 14 10.10 10.11 1 15 15 10.11 10.11 1 16 16 10.11 10.12 2 17 17 10.12 10.11 2 18 18 10.12 10.12 2 19 19 10.12 10.13 2 20 20 10.13 10.12 2 21 21 10.13 10.13 2 22 22 10.13 10.14 2 23 23 10.14 10.13 2
Creating the GMM instance:
>>> gmm = GaussianMixture(init_param='farthest_first_traversal', ... n_components=2, covariance_type='full', ... shared_covariance=False, max_iter=500, ... error_tol=0.001, thread_ratio=0.5, ... categorical_variable=['X3'], random_seed=1)
Performing fit() on the given dataframe:
>>> gmm.fit(data=df1, key='ID')
Expected output:
>>> gmm.labels_.head(14).collect() ID CLUSTER_ID PROBABILITY 0 0 0 0.0 1 1 0 0.0 2 2 0 0.0 3 4 0 0.0 4 5 0 0.0 5 6 0 0.0 6 7 0 0.0 7 8 0 0.0 8 9 0 0.0 9 10 0 1.0 10 11 0 1.0 11 12 0 1.0 12 13 0 1.0 13 14 0 0.0
>>> gmm.stats_.collect() STAT_NAME STAT_VALUE 0 log-likelihood 11.7199 1 aic -504.5536 2 bic -480.3900
>>> gmm.model_collect() ROW_INDEX CLUSTER_ID MODEL_CONTENT 0 0 -1 {"Algorithm":"GMM","Metadata":{"DataP... 1 1 0 {"GuassModel":{"covariance":[22.18895... 2 2 1 {"GuassModel":{"covariance":[22.19450...
- Attributes:
- model_DataFrame
Trained model content.
- labels_DataFrame
Cluster membership probabilities for each data point.
- stats_DataFrame
Statistics.
Methods
create_model_state
([model, function, ...])Create PAL model state.
delete_model_state
([state])Delete PAL model state.
fit
(data[, key, features, categorical_variable])Perform GMM clustering on input dataset.
fit_predict
(data, key[, features, ...])Perform GMM clustering on input dataset and return cluster membership probabilities for each data point.
predict
(data[, key, features])Assign clusters to data based on a fitted model.
set_model_state
(state)Set the model state by state information.
- fit(data, key=None, features=None, categorical_variable=None)
Perform GMM clustering on input dataset.
- Parameters:
- dataDataFrame
Data to be clustered.
- keystr, optional
Name of ID column. Defaults to the index column of data (i.e. data.index) if it is set. If the index of column of data is not provided, please enter the value of key.
- featureslist of str, optional
List of strings specifying feature columns.
If a list of features is not given, all the columns except the ID column are taken as features.
- categorical_variablestr or list of str, optional
Specifies INTEGER column(s) that should be treated as categorical.
Other INTEGER columns will be treated as continuous.
- fit_predict(data, key, features=None, categorical_variable=None)
Perform GMM clustering on input dataset and return cluster membership probabilities for each data point.
- Parameters:
- dataDataFrame
Data to be clustered.
- keystr, optional
Name of ID column. Defaults to the index column of data (i.e. data.index) if it is set. If the index of column of data is not provided, please enter the value of key.
- featureslist of str, optional
List of strings specifying feature columns.
If a list of features is not given, all the columns except the ID column are taken as features.
- categorical_variablestr or list of str, optional
Specifies INTEGER column(s) specified that should be treated as categorical.
Other INTEGER columns will be treated as continuous.
- Returns:
- DataFrame
Cluster membership probabilities.
- predict(data, key=None, features=None)
Assign clusters to data based on a fitted model.
The output structure of this method does not match that of fit_predict().
- Parameters:
- dataDataFrame
Data points to match against computed clusters.
This dataframe's column structure should match that of the data used for fit().
- keystr, optional
Name of ID column.
Defaults to the index column of data (i.e. data.index) if it is set. If the index column of data is not provided, please enter the value of key.
- featureslist of str, optional.
Names of the feature columns.
If
features
is not provided, it defaults to all non-key columns.
- Returns:
- DataFrame
Cluster assignment results, with 3 columns:
Data point ID, with name and type taken from the input ID column.
CLUSTER_ID, INTEGER type, representing the cluster the data point is assigned to.
DISTANCE, DOUBLE type, representing the distance between the data point and the cluster center.
- create_model_state(model=None, function=None, pal_funcname='PAL_CLUSTER_ASSIGNMENT', state_description=None, force=False)
Create PAL model state.
- Parameters:
- modelDataFrame, optional
Specify the model for AFL state.
Defaults to self.model_.
- functionstr, optional
Specify the function in the unified API.
A placeholder parameter, not effective for cluster assignment.
- pal_funcnameint or str, optional
PAL function name. Must be a valid PAL procedure that supports model state.
Defaults to 'PAL_CLUSTER_ASSIGNMENT'.
- state_descriptionstr, optional
Description of the state as model container.
Defaults to None.
- forcebool, optional
If True it will delete the existing state.
Defaults to False.
- delete_model_state(state=None)
Delete PAL model state.
- Parameters:
- stateDataFrame, optional
Specified the state.
Defaults to self.state.
- property fit_hdbprocedure
Returns the generated hdbprocedure for fit.
- property predict_hdbprocedure
Returns the generated hdbprocedure for predict.
- set_model_state(state)
Set the model state by state information.
- Parameters:
- state: DataFrame or dict
If state is DataFrame, it has the following structure:
NAME: VARCHAR(100), it mush have STATE_ID, HINT, HOST and PORT.
VALUE: VARCHAR(1000), the values according to NAME.
If state is dict, the key must have STATE_ID, HINT, HOST and PORT.
Inherited Methods from PALBase
Besides those methods mentioned above, the GaussianMixture class also inherits methods from PALBase class, please refer to PAL Base for more details.