ConstrainedClustering
- class hana_ml.algorithms.pal.clustering.ConstrainedClustering(n_clusters: int, encoder_hidden_dims: str = None, embedding_dim: int = None, normalization: int = None, seed: int = None, pretrain_learning_rate: float = None, pretrain_epochs: int = None, pretrain_batch_size: int = None, thread_ratio: float = None, gamma: float = None, ml_penalty: float = None, cl_penalty: float = None, theta: float = None, learning_rate: float = None, max_epochs: int = None, batch_size: int = None, update_interval: int = None, ml_batch_size: int = None, cl_batch_size: int = None, triplet_batch_size: int = None, ml_update_interval: int = None, cl_update_interval: int = None, triplet_update_interval: int = None, tolerance: float = None, verbose: int = None)
Constraints are additional information that guide the clustering process to produce results more in line with specific requirements or prior knowledge.
Pairwise Constraints: Must-Link constraints specify that two data points should be in the same cluster. Cannot-Link constraints indicate that two data points should not be in the same cluster.
Triplet Constraints: Given an anchor instance a, positive instance p and negative instance n the constraint indicates that instance a is more similar to p than to n.
- Parameters:
- n_clustersint
The number of clusters for constrained clustering.
The valid range for this parameter is from 2 to the number of records in the input data.
- encoder_hidden_dimsstr, optional
Specifies the hidden layer sizes of encoder.
Defaults to '8, 16'.
- embedding_dimint, optional
Specifies the dimension of latent space.
Defaults to 3.
- normalizationint, optional
Specifies whether to use normalization.
Defaults to 1.
- seedint, optional
Specifies the seed for random number generator. Use system time when 0 is specified.
Defaults to 0.
- pretrain_learning_ratefloat, optional
Specifies the learning rate of pretraining stage.
Defaults to 0.01.
- pretrain_epochsint, optional
Specifies the number of pretraining epochs.
Defaults to 10.
- pretrain_batch_sizeint, optional
Specifies the number of training samples in a batch.
Defaults to 16.
- thread_ratiofloat, optional
Specifies the ratio of total number of threads that can be used by this function. The value range is from 0 to 1, where 0 means only using 1 thread, and 1 means using at most all the currently available threads. Values outside the range will be ignored and this function heuristically determines the number of threads to use.
Defaults to 1.0.
- gammafloat, optional
Specifies the degree of distorting latent space.
Defaults to 0.1.
- ml_penaltyfloat, optional
Specifies the penalty for must-link constraints.
Only valid when constraint_type is 'pairwise'.
Defaults to 0.1.
- cl_penaltyfloat, optional
Specifies the penalty for cannot-link constraints.
Only valid when constraint_type is 'pairwise'.
Defaults to 1.0.
- thetafloat, optional
Specifies the margin in triplet loss.
Only valid when constraint_type is 'triplet'.
Defaults to 0.1.
- learning_ratefloat, optional
Specifies the learning rate.
Defaults to 0.01.
- max_epochsint, optional
Specifies the maximum number of training epochs.
Defaults to 5.
- batch_sizeint, optional
Specifies the number of training samples in a batch.
Defaults to 16.
- update_intervalint, optional
Specifies the frequency of updating target distribution.
Defaults to 1.
- ml_batch_sizeint, optional
Specifies the number of must-link constraints in a batch.
Only valid when constraint_type is 'pairwise'.
Defaults to 16.
- cl_batch_sizeint, optional
Specifies the number of cannot-link constraints in a batch.
Only valid when constraint_type is 'pairwise'.
Defaults to 16.
- triplet_batch_sizeint, optional
Specifies the number of triplet constraints in a batch.
Only valid when constraint_type is 'triplet'.
Defaults to 16.
- ml_update_intervalint, optional
Specifies the frequency of training with must-link constraints.
Only valid when constraint_type is 'pairwise'.
Defaults to 1.
- cl_update_intervalint, optional
Specifies the frequency of training with cannot-link constraints.
Only valid when constraint_type is 'pairwise'.
Defaults to 1.
- triplet_update_intervalint, optional
Specifies the frequency of training with triplet constraints.
Only valid when constraint_type is 'triplet'.
Defaults to 1.
- tolerancefloat, optional
Specifies the stopping threshold.
Defaults to 0.001.
- verboseint, optional
Specifies the verbosity of log.
Defaults to 0.
- Attributes:
- labels_DataFrame
DataFrame that holds the cluster labels.
- model_DataFrame
Model.
- training_log_DataFrame
Training log.
- statistics_DataFrame
Statistics.
Methods
fit(data, constraint_type, constraints[, ...])Fit the model to the training dataset.
fit_predict(data, constraint_type, constraints)Given data, perform constrained clustering and return the corresponding cluster labels.
predict([data, key, features])Given data, perform constrained clustering and return the corresponding cluster labels.
Examples
>>> from hana_ml.algorithms.pal.clustering import ConstrainedClustering >>> constrained_clustering = ConstrainedClustering(n_clusters=3, encoder_hidden_dims='4', embedding_dim=3, seed=1, pretrain_learning_rate=0.01, pretrain_epochs=350, learning_rate=0.01, max_epochs=200, update_interval=1) >>> import numpy as np >>> import pandas as pd >>> from hana_ml.dataframe import create_dataframe_from_pandas >>> constraints_data_structure = {'TYPE': 'INTEGER', 'ID1': 'INTEGER', 'ID2': 'INTEGER'} >>> constraints_data = np.array([ [1, 1, 30], [-1, 1, 130], [-1, 80, 130] ]) >>> constraints_df = create_dataframe_from_pandas(conn, pd.DataFrame(constraints_data, columns=list(constraints_data_structure.keys())), 'CONSTRAINTS_TBL', force=True, table_structure=constraints_data_structure) >>> labels = constrained_clustering.fit_predict(data=iris_df, constraint_type='pairwise', constraints=constraints_df, key='ID', features=['SEPALLENGTHCM', 'SEPALWIDTHCM', 'PETALLENGTHCM', 'PETALWIDTHCM'])
- fit(data: DataFrame, constraint_type: str, constraints: DataFrame, key: str = None, features: List[str] = None, pre_model: DataFrame = None)
Fit the model to the training dataset.
- Parameters:
- dataDataFrame
DataFrame containing the input data, expected to be structured as follows:
1st column : Record ID.
other columns : Attribute data.
- constraint_type{'pairwise', 'triplet'}
Specifies the type of constraints:
'pairwise' : Pairwise Constraints.
'triplet' : Triplet Constraints.
- constraintsDataFrame
Constraints data for pairwise constraints, expected to be structured as follows:
1st column : Pairwise constraint type. Only the values 1 and -1 are considered valid, with 1 representing a must-link and -1 indicating a cannot-link.
2nd column : Instance 1 ID.
3rd column : Instance 2 ID.
Constraints data for triplet constraints, expected to be structured as follows:
1st column : Anchor instance ID.
2nd column : Positive instance ID.
3rd column : Negative instance ID.
- keystr, optional
Name of ID column.
Mandatory if
datais not indexed, or indexed by multiple columns.Defaults to the index column of
dataif there is one.- featuresa list of str, optional
Names of the feature columns.
If
featuresis not provided, it defaults to all non-key columns ofdata.- pre_modelDataFrame, optional
DataFrame containing the pre-model data, expected to be structured as follows:
1st column : Indicates the ID of the row.
2nd column : Model content.
Defaults to None.
- fit_predict(data: DataFrame, constraint_type: str, constraints: DataFrame, key: str = None, features: List[str] = None, pre_model: DataFrame = None)
Given data, perform constrained clustering and return the corresponding cluster labels.
- Parameters:
- dataDataFrame
DataFrame containing the input data, expected to be structured as follows:
1st column : Record ID.
other columns : Attribute data.
- constraint_type{'pairwise', 'triplet'}
Specifies the type of constraints:
'pairwise' : Pairwise Constraints.
'triplet' : Triplet Constraints.
- constraintsDataFrame
Constraints data for pairwise constraints, expected to be structured as follows:
1st column : Pairwise constraint type. Only the values 1 and -1 are considered valid, with 1 representing a must-link and -1 indicating a cannot-link.
2nd column : Instance 1 ID.
3rd column : Instance 2 ID.
Constraints data for triplet constraints, expected to be structured as follows:
1st column : Anchor instance ID.
2nd column : Positive instance ID.
3rd column : Negative instance ID.
- keystr, optional
Name of ID column.
Mandatory if
datais not indexed, or indexed by multiple columns.Defaults to the index column of
dataif there is one.- featuresa list of str, optional
Names of the feature columns.
If
featuresis not provided, it defaults to all non-key columns ofdata.- pre_modelDataFrame, optional
DataFrame containing the pre-model data, expected to be structured as follows:
1st column : Indicates the ID of the row.
2nd column : Model content.
Defaults to None.
- Returns:
- DataFrame
The cluster labels of all records in
data.
- predict(data: DataFrame = None, key: str = None, features: List[str] = None)
Given data, perform constrained clustering and return the corresponding cluster labels.
- Parameters:
- dataDataFrame
DataFrame containing the input data, expected to be structured as follows:
1st column : Record ID.
other columns : Attribute data.
Defaults to None.
- keystr, optional
Name of ID column.
Mandatory if
datais not indexed, or indexed by multiple columns.Defaults to the index column of
dataif there is one.- featuresa list of str, optional
Names of the feature columns.
If
featuresis not provided, it defaults to all non-key columns ofdata.
- Returns:
- DataFrame
The cluster labels of all records in
data.
Inherited Methods from PALBase
Besides those methods mentioned above, the ConstrainedClustering class also inherits methods from PALBase class, please refer to PAL Base for more details.