ConstrainedClustering¶
- class hana_ml.algorithms.pal.clustering.ConstrainedClustering(n_clusters: int, encoder_hidden_dims: str = None, embedding_dim: int = None, normalization: int = None, seed: int = None, pretrain_learning_rate: float = None, pretrain_epochs: int = None, pretrain_batch_size: int = None, thread_ratio: float = None, gamma: float = None, ml_penalty: float = None, cl_penalty: float = None, theta: float = None, learning_rate: float = None, max_epochs: int = None, batch_size: int = None, update_interval: int = None, ml_batch_size: int = None, cl_batch_size: int = None, triplet_batch_size: int = None, ml_update_interval: int = None, cl_update_interval: int = None, triplet_update_interval: int = None, tolerance: float = None, verbose: int = None)¶
Constraints are additional information that guide the clustering process to produce results more in line with specific requirements or prior knowledge.
Pairwise Constraints: Must-Link constraints specify that two data points should be in the same cluster. Cannot-Link constraints indicate that two data points should not be in the same cluster.
Triplet Constraints: Given an anchor instance a, positive instance p and negative instance n the constraint indicates that instance a is more similar to p than to n.
- Parameters
- n_clustersint
The number of clusters for constrained clustering.
The valid range for this parameter is from 2 to the number of records in the input data.
- encoder_hidden_dimsstr, optional
Specifies the hidden layer sizes of encoder.
Defaults to '8, 16'.
- embedding_dimint, optional
Specifies the dimension of latent space.
Defaults to 3.
- normalizationint, optional
Specifies whether to use normalization.
Defaults to 1.
- seedint, optional
Specifies the seed for random number generator. Use system time when 0 is specified.
Defaults to 0.
- pretrain_learning_ratefloat, optional
Specifies the learning rate of pretraining stage.
Defaults to 0.01.
- pretrain_epochsint, optional
Specifies the number of pretraining epochs.
Defaults to 10.
- pretrain_batch_sizeint, optional
Specifies the number of training samples in a batch.
Defaults to 16.
- thread_ratiofloat, optional
Specifies the ratio of total number of threads that can be used by this function. The value range is from 0 to 1, where 0 means only using 1 thread, and 1 means using at most all the currently available threads. Values outside the range will be ignored and this function heuristically determines the number of threads to use.
Defaults to 1.0.
- gammafloat, optional
Specifies the degree of distorting latent space.
Defaults to 0.1.
- ml_penaltyfloat, optional
Specifies the penalty for must-link constraints.
Only valid when constraint_type is 'pairwise'.
Defaults to 0.1.
- cl_penaltyfloat, optional
Specifies the penalty for cannot-link constraints.
Only valid when constraint_type is 'pairwise'.
Defaults to 1.0.
- thetafloat, optional
Specifies the margin in triplet loss.
Only valid when constraint_type is 'triplet'.
Defaults to 0.1.
- learning_ratefloat, optional
Specifies the learning rate.
Defaults to 0.01.
- max_epochsint, optional
Specifies the maximum number of training epochs.
Defaults to 5.
- batch_sizeint, optional
Specifies the number of training samples in a batch.
Defaults to 16.
- update_intervalint, optional
Specifies the frequency of updating target distribution.
Defaults to 1.
- ml_batch_sizeint, optional
Specifies the number of must-link constraints in a batch.
Only valid when constraint_type is 'pairwise'.
Defaults to 16.
- cl_batch_sizeint, optional
Specifies the number of cannot-link constraints in a batch.
Only valid when constraint_type is 'pairwise'.
Defaults to 16.
- triplet_batch_sizeint, optional
Specifies the number of triplet constraints in a batch.
Only valid when constraint_type is 'triplet'.
Defaults to 16.
- ml_update_intervalint, optional
Specifies the frequency of training with must-link constraints.
Only valid when constraint_type is 'pairwise'.
Defaults to 1.
- cl_update_intervalint, optional
Specifies the frequency of training with cannot-link constraints.
Only valid when constraint_type is 'pairwise'.
Defaults to 1.
- triplet_update_intervalint, optional
Specifies the frequency of training with triplet constraints.
Only valid when constraint_type is 'triplet'.
Defaults to 1.
- tolerancefloat, optional
Specifies the stopping threshold.
Defaults to 0.001.
- verboseint, optional
Specifies the verbosity of log.
Defaults to 0.
- Attributes
- labels_DataFrame
DataFrame that holds the cluster labels.
- model_DataFrame
Model.
- training_log_DataFrame
Training log.
- statistics_DataFrame
Statistics.
Methods
fit(data, constraint_type, constraints[, ...])Fit the model to the training dataset.
fit_predict(data, constraint_type, constraints)Given data, perform constrained clustering and return the corresponding cluster labels.
predict([data, key, features])Given data, perform constrained clustering and return the corresponding cluster labels.
Examples
>>> from hana_ml.algorithms.pal.clustering import ConstrainedClustering >>> constrained_clustering = ConstrainedClustering(n_clusters=3, encoder_hidden_dims='4', embedding_dim=3, seed=1, pretrain_learning_rate=0.01, pretrain_epochs=350, learning_rate=0.01, max_epochs=200, update_interval=1) >>> import numpy as np >>> import pandas as pd >>> from hana_ml.dataframe import create_dataframe_from_pandas >>> constraints_data_structure = {'TYPE': 'INTEGER', 'ID1': 'INTEGER', 'ID2': 'INTEGER'} >>> constraints_data = np.array([ [1, 1, 30], [-1, 1, 130], [-1, 80, 130] ]) >>> constraints_df = create_dataframe_from_pandas(conn, pd.DataFrame(constraints_data, columns=list(constraints_data_structure.keys())), 'CONSTRAINTS_TBL', force=True, table_structure=constraints_data_structure) >>> labels = constrained_clustering.fit_predict(data=iris_df, constraint_type='pairwise', constraints=constraints_df, key='ID', features=['SEPALLENGTHCM', 'SEPALWIDTHCM', 'PETALLENGTHCM', 'PETALWIDTHCM'])
- fit(data: DataFrame, constraint_type: str, constraints: DataFrame, key: str = None, features: List[str] = None, pre_model: DataFrame = None)¶
Fit the model to the training dataset.
- Parameters
- dataDataFrame
DataFrame containing the input data, expected to be structured as follows:
1st column : Record ID.
other columns : Attribute data.
- constraint_type{'pairwise', 'triplet'}
Specifies the type of constraints:
'pairwise' : Pairwise Constraints.
'triplet' : Triplet Constraints.
- constraintsDataFrame
Constraints data for pairwise constraints, expected to be structured as follows:
1st column : Pairwise constraint type. Only the values 1 and -1 are considered valid, with 1 representing a must-link and -1 indicating a cannot-link.
2nd column : Instance 1 ID.
3rd column : Instance 2 ID.
Constraints data for triplet constraints, expected to be structured as follows:
1st column : Anchor instance ID.
2nd column : Positive instance ID.
3rd column : Negative instance ID.
- keystr, optional
Name of ID column.
Mandatory if
datais not indexed, or indexed by multiple columns.Defaults to the index column of
dataif there is one.- featuresa list of str, optional
Names of the feature columns.
If
featuresis not provided, it defaults to all non-key columns ofdata.- pre_modelDataFrame, optional
DataFrame containing the pre-model data, expected to be structured as follows:
1st column : Indicates the ID of the row.
2nd column : Model content.
Defaults to None.
- fit_predict(data: DataFrame, constraint_type: str, constraints: DataFrame, key: str = None, features: List[str] = None, pre_model: DataFrame = None)¶
Given data, perform constrained clustering and return the corresponding cluster labels.
- Parameters
- dataDataFrame
DataFrame containing the input data, expected to be structured as follows:
1st column : Record ID.
other columns : Attribute data.
- constraint_type{'pairwise', 'triplet'}
Specifies the type of constraints:
'pairwise' : Pairwise Constraints.
'triplet' : Triplet Constraints.
- constraintsDataFrame
Constraints data for pairwise constraints, expected to be structured as follows:
1st column : Pairwise constraint type. Only the values 1 and -1 are considered valid, with 1 representing a must-link and -1 indicating a cannot-link.
2nd column : Instance 1 ID.
3rd column : Instance 2 ID.
Constraints data for triplet constraints, expected to be structured as follows:
1st column : Anchor instance ID.
2nd column : Positive instance ID.
3rd column : Negative instance ID.
- keystr, optional
Name of ID column.
Mandatory if
datais not indexed, or indexed by multiple columns.Defaults to the index column of
dataif there is one.- featuresa list of str, optional
Names of the feature columns.
If
featuresis not provided, it defaults to all non-key columns ofdata.- pre_modelDataFrame, optional
DataFrame containing the pre-model data, expected to be structured as follows:
1st column : Indicates the ID of the row.
2nd column : Model content.
Defaults to None.
- Returns
- DataFrame
The cluster labels of all records in
data.
- predict(data: DataFrame = None, key: str = None, features: List[str] = None)¶
Given data, perform constrained clustering and return the corresponding cluster labels.
- Parameters
- dataDataFrame
DataFrame containing the input data, expected to be structured as follows:
1st column : Record ID.
other columns : Attribute data.
Defaults to None.
- keystr, optional
Name of ID column.
Mandatory if
datais not indexed, or indexed by multiple columns.Defaults to the index column of
dataif there is one.- featuresa list of str, optional
Names of the feature columns.
If
featuresis not provided, it defaults to all non-key columns ofdata.
- Returns
- DataFrame
The cluster labels of all records in
data.