ConstrainedClustering¶

class hana_ml.algorithms.pal.clustering.ConstrainedClustering(n_clusters: int, encoder_hidden_dims: str = None, embedding_dim: int = None, normalization: int = None, seed: int = None, pretrain_learning_rate: float = None, pretrain_epochs: int = None, pretrain_batch_size: int = None, thread_ratio: float = None, gamma: float = None, ml_penalty: float = None, cl_penalty: float = None, theta: float = None, learning_rate: float = None, max_epochs: int = None, batch_size: int = None, update_interval: int = None, ml_batch_size: int = None, cl_batch_size: int = None, triplet_batch_size: int = None, ml_update_interval: int = None, cl_update_interval: int = None, triplet_update_interval: int = None, tolerance: float = None, verbose: int = None)¶

Constraints are additional information that guide the clustering process to produce results more in line with specific requirements or prior knowledge.

Pairwise Constraints: Must-Link constraints specify that two data points should be in the same cluster. Cannot-Link constraints indicate that two data points should not be in the same cluster.

Triplet Constraints: Given an anchor instance a, positive instance p and negative instance n the constraint indicates that instance a is more similar to p than to n.

Parameters

n_clustersint

The number of clusters for constrained clustering.

The valid range for this parameter is from 2 to the number of records in the input data.

encoder_hidden_dimsstr, optional

Specifies the hidden layer sizes of encoder.

Defaults to '8, 16'.

embedding_dimint, optional

Specifies the dimension of latent space.

Defaults to 3.

normalizationint, optional

Specifies whether to use normalization.

Defaults to 1.

seedint, optional

Specifies the seed for random number generator. Use system time when 0 is specified.

Defaults to 0.

pretrain_learning_ratefloat, optional

Specifies the learning rate of pretraining stage.

Defaults to 0.01.

pretrain_epochsint, optional

Specifies the number of pretraining epochs.

Defaults to 10.

pretrain_batch_sizeint, optional

Specifies the number of training samples in a batch.

Defaults to 16.

thread_ratiofloat, optional

Specifies the ratio of total number of threads that can be used by this function. The value range is from 0 to 1, where 0 means only using 1 thread, and 1 means using at most all the currently available threads. Values outside the range will be ignored and this function heuristically determines the number of threads to use.

Defaults to 1.0.

gammafloat, optional

Specifies the degree of distorting latent space.

Defaults to 0.1.

ml_penaltyfloat, optional

Specifies the penalty for must-link constraints.

Only valid when constraint_type is 'pairwise'.

Defaults to 0.1.

cl_penaltyfloat, optional

Specifies the penalty for cannot-link constraints.

Only valid when constraint_type is 'pairwise'.

Defaults to 1.0.

thetafloat, optional

Specifies the margin in triplet loss.

Only valid when constraint_type is 'triplet'.

Defaults to 0.1.

learning_ratefloat, optional

Specifies the learning rate.

Defaults to 0.01.

max_epochsint, optional

Specifies the maximum number of training epochs.

Defaults to 5.

batch_sizeint, optional

Specifies the number of training samples in a batch.

Defaults to 16.

update_intervalint, optional

Specifies the frequency of updating target distribution.

Defaults to 1.

ml_batch_sizeint, optional

Specifies the number of must-link constraints in a batch.

Only valid when constraint_type is 'pairwise'.

Defaults to 16.

cl_batch_sizeint, optional

Specifies the number of cannot-link constraints in a batch.

Only valid when constraint_type is 'pairwise'.

Defaults to 16.

triplet_batch_sizeint, optional

Specifies the number of triplet constraints in a batch.

Only valid when constraint_type is 'triplet'.

Defaults to 16.

ml_update_intervalint, optional

Specifies the frequency of training with must-link constraints.

Only valid when constraint_type is 'pairwise'.

Defaults to 1.

cl_update_intervalint, optional

Specifies the frequency of training with cannot-link constraints.

Only valid when constraint_type is 'pairwise'.

Defaults to 1.

triplet_update_intervalint, optional

Specifies the frequency of training with triplet constraints.

Only valid when constraint_type is 'triplet'.

Defaults to 1.

tolerancefloat, optional

Specifies the stopping threshold.

Defaults to 0.001.

verboseint, optional

Specifies the verbosity of log.

Defaults to 0.

Attributes

labels_DataFrame: DataFrame that holds the cluster labels.
model_DataFrame: Model.
training_log_DataFrame: Training log.
statistics_DataFrame: Statistics.

Methods

`fit`(data, constraint_type, constraints[, ...])	Fit the model to the training dataset.
`fit_predict`(data, constraint_type, constraints)	Given data, perform constrained clustering and return the corresponding cluster labels.
`predict`([data, key, features])	Given data, perform constrained clustering and return the corresponding cluster labels.

Examples

>>> from hana_ml.algorithms.pal.clustering import ConstrainedClustering
>>> constrained_clustering = ConstrainedClustering(n_clusters=3,
                                                   encoder_hidden_dims='4',
                                                   embedding_dim=3,
                                                   seed=1,
                                                   pretrain_learning_rate=0.01,
                                                   pretrain_epochs=350,
                                                   learning_rate=0.01,
                                                   max_epochs=200,
                                                   update_interval=1)
>>> import numpy as np
>>> import pandas as pd
>>> from hana_ml.dataframe import create_dataframe_from_pandas
>>> constraints_data_structure = {'TYPE': 'INTEGER', 'ID1': 'INTEGER', 'ID2': 'INTEGER'}
>>> constraints_data = np.array([
        [1, 1, 30],
        [-1, 1, 130],
        [-1, 80, 130]
    ])
>>> constraints_df = create_dataframe_from_pandas(conn,
                                                  pd.DataFrame(constraints_data, columns=list(constraints_data_structure.keys())),
                                                  'CONSTRAINTS_TBL',
                                                  force=True,
                                                  table_structure=constraints_data_structure)
>>> labels = constrained_clustering.fit_predict(data=iris_df,
                                                constraint_type='pairwise',
                                                constraints=constraints_df,
                                                key='ID',
                                                features=['SEPALLENGTHCM', 'SEPALWIDTHCM', 'PETALLENGTHCM', 'PETALWIDTHCM'])

fit(data: DataFrame, constraint_type: str, constraints: DataFrame, key: str = None, features: List[str] = None, pre_model: DataFrame = None)¶

Fit the model to the training dataset.

Parameters

dataDataFrame

DataFrame containing the input data, expected to be structured as follows:

1st column : Record ID.
other columns : Attribute data.

constraint_type{'pairwise', 'triplet'}

Specifies the type of constraints:

'pairwise' : Pairwise Constraints.
'triplet' : Triplet Constraints.

constraintsDataFrame

Constraints data for pairwise constraints, expected to be structured as follows:

1st column : Pairwise constraint type. Only the values 1 and -1 are considered valid, with 1 representing a must-link and -1 indicating a cannot-link.

2nd column : Instance 1 ID.

3rd column : Instance 2 ID.

Constraints data for triplet constraints, expected to be structured as follows:

1st column : Anchor instance ID.

2nd column : Positive instance ID.

3rd column : Negative instance ID.

keystr, optional

Name of ID column.

Mandatory if data is not indexed, or indexed by multiple columns.

Defaults to the index column of data if there is one.

featuresa list of str, optional

Names of the feature columns.

If features is not provided, it defaults to all non-key columns of data.

pre_modelDataFrame, optional

DataFrame containing the pre-model data, expected to be structured as follows:

1st column : Indicates the ID of the row.

2nd column : Model content.

Defaults to None.

fit_predict(data: DataFrame, constraint_type: str, constraints: DataFrame, key: str = None, features: List[str] = None, pre_model: DataFrame = None)¶

Given data, perform constrained clustering and return the corresponding cluster labels.

Parameters

dataDataFrame

DataFrame containing the input data, expected to be structured as follows:

1st column : Record ID.
other columns : Attribute data.

constraint_type{'pairwise', 'triplet'}

Specifies the type of constraints:

'pairwise' : Pairwise Constraints.
'triplet' : Triplet Constraints.

constraintsDataFrame

Constraints data for pairwise constraints, expected to be structured as follows:

1st column : Pairwise constraint type. Only the values 1 and -1 are considered valid, with 1 representing a must-link and -1 indicating a cannot-link.

2nd column : Instance 1 ID.

3rd column : Instance 2 ID.

Constraints data for triplet constraints, expected to be structured as follows:

1st column : Anchor instance ID.

2nd column : Positive instance ID.

3rd column : Negative instance ID.

keystr, optional

Name of ID column.

Mandatory if data is not indexed, or indexed by multiple columns.

Defaults to the index column of data if there is one.

featuresa list of str, optional

Names of the feature columns.

If features is not provided, it defaults to all non-key columns of data.

pre_modelDataFrame, optional

DataFrame containing the pre-model data, expected to be structured as follows:

1st column : Indicates the ID of the row.

2nd column : Model content.

Defaults to None.

Returns

DataFrame: The cluster labels of all records in data.

predict(data: DataFrame = None, key: str = None, features: List[str] = None)¶

Given data, perform constrained clustering and return the corresponding cluster labels.

Parameters

dataDataFrame

DataFrame containing the input data, expected to be structured as follows:

1st column : Record ID.
other columns : Attribute data.

Defaults to None.

keystr, optional

Name of ID column.

Mandatory if data is not indexed, or indexed by multiple columns.

Defaults to the index column of data if there is one.

featuresa list of str, optional

Names of the feature columns.

If features is not provided, it defaults to all non-key columns of data.

Returns

DataFrame: The cluster labels of all records in data.

Inherited Methods from PALBase¶

Besides those methods mentioned above, the ConstrainedClustering class also inherits methods from PALBase class, please refer to PAL Base for more details.