AgglomerateHierarchicalClustering

class hana_ml.algorithms.pal.clustering.AgglomerateHierarchicalClustering(n_clusters=None, affinity=None, linkage=None, thread_ratio=None, distance_dimension=None, normalization=None, category_weights=None)

This algorithm is a widely used clustering method which can find natural groups within a set of data. The idea is to group the data into a hierarchy or a binary tree of the subgroups. A hierarchical clustering can be either agglomerate or divisive, depending on the method of hierarchical decomposition. The implementation in PAL follows the agglomerate approach, which merges the clusters with a bottom-up strategy. Initially, each data point is considered as an own cluster. The algorithm iteratively merges two clusters based on the dissimilarity measure in a greedy manner and forms a larger cluster.

Parameters
n_clustersint, optional

Number of clusters after agglomerate hierarchical clustering algorithm. Value range: between 1 and the initial number of input data.

Defaults to 1.

affinity{'manhattan', 'euclidean', 'minkowski', 'chebyshev', 'cosine', 'pearson correlation', 'squared euclidean', 'jaccard', 'gower', 'precomputed'}, optional

Ways to compute the distance between two points.

Note

  • (1) For jaccard distance, non-zero input data will be treated as 1, and zero input data will be treated as 0. jaccard distance = (M01 + M10) / (M11 + M01 + M10)

  • (2) Only gower distance supports category attributes. When linkage is 'centroid clustering', 'median clustering', or 'ward', this parameter must be set to 'squared euclidean'.

Defaults to 'squared euclidean'.

linkage{ 'nearest neighbor', 'furthest neighbor', 'group average', 'weighted average', 'centroid clustering', 'median clustering', 'ward'}, optional

Linkage type between two clusters.

  • 'nearest neighbor' : single linkage.

  • 'furthest neighbor' : complete linkage.

  • 'group average' : UPGMA.

  • 'weighted average' : WPGMA.

  • 'centroid clustering'.

  • 'median clustering'.

  • 'ward'.

Defaults to 'centroid clustering'.

Note

For linkage 'centroid clustering', 'median clustering', or 'ward', the corresponding affinity must be set to 'squared euclidean'.

thread_ratiofloat, optional

Specifies the ratio of total number of threads that can be used by this function.

The value range is from 0 to 1, where 0 means only using 1 thread, and 1 means using at most all the currently available threads.

Values outside the range will be ignored and this function heuristically determines the number of threads to use.

Defaults to 0.

distance_dimensionfloat, optional

Distance dimension can be set if affinity is set to 'minkowski'. The value should be no less than 1.

Only valid when affinity is 'minkowski'.

Defaults to 3.

normalizationstr, optional

Specifies the type of normalization applied.

  • 'no': No normalization

  • 'z-score': Z-score standardization

  • 'zero-centred-min-max': Zero-centred min-max normalization, transforming to new range [-1, 1].

  • 'min-max': Standard min-max normalization, transforming to new range [0, 1].

Valid only when affinity is not 'precomputed'.

Defaults to 'no'.

category_weightsfloat, optional

Represents the weight of category columns.

Defaults to 1.

Examples

Input dataframe df for clustering:

>>> df.collect()
   POINT       X1     X2    X3
0      0      0.5    0.5     1
1      1      1.5    0.5     2
2      2      1.5    1.5     2
3      3      0.5    1.5     2
4      4      1.1    1.2     2
5      5      0.5   15.5     2
6      6      1.5   15.5     3
7      7      1.5   16.5     3
8      8      0.5   16.5     3
9      9      1.2   16.1     3
10    10     15.5   15.5     3
11    11     16.5   15.5     4
12    12     16.5   16.5     4
13    13     15.5   16.5     4
14    14     15.6   16.2     4
15    15     15.5    0.5     4
16    16     16.5    0.5     1
17    17     16.5    1.5     1
18    18     15.5    1.5     1
19    19     15.7    1.6     1

Create an AgglomerateHierarchicalClustering instance:

>>> hc = AgglomerateHierarchicalClustering(
             n_clusters=4,
             affinity='Gower',
             linkage='weighted average',
             thread_ratio=None,
             distance_dimension=3,
             normalization='no',
             category_weights= 0.1)

Perform fit on the given data:

>>> hc.fit(data=df, key='POINT', categorical_variable=['X3'])

Expected output:

>>> hc.combine_process_.collect().head(3)
   STAGE  LEFT_POINT   RIGHT_POINT    DISTANCE
0      1          18           19       0.0187
1      2          13           14       0.0250
2      3           7            9       0.0437
>>> hc.labels_.collect().head(3)
   POINT  CLUSTER_ID
0      0           1
1      1           1
2      2           1
Attributes
combine_process_DataFrame

Structured as follows:

  • 1st column: int, STAGE, cluster stage.

  • 2nd column: ID (in input table) data type, LEFT_ + ID (in input table) column + name, One of the clusters that is to be combined in one combine stage, name as its row number in the input data table. After the combining, the new cluster is named after the left one.

  • 3rd column: ID (in input table) data type, RIGHT_ + ID (in input table) column name, The other cluster to be combined in the same combine stage, named as its row number in the input data table.

  • 4th column: float, DISTANCE. Distance between the two combined clusters.

labels_DataFrame

Label assigned to each sample. structured as follows:

  • 1st column: Name of the ID column in the input data(or that of the first column of the input DataFrame when affinity is 'precomputed'), record ID.

  • 2nd column: CLUSTER_ID, cluster number after applying the hierarchical agglomerate algorithm.

Methods

fit(data[, key, features, categorical_variable])

Fit the model when given the training dataset.

fit_predict(data[, key, features, ...])

Fit with the dataset and return the labels.

fit(data, key=None, features=None, categorical_variable=None)

Fit the model when given the training dataset.

Parameters
dataDataFrame

DataFrame containing the data.

If affinity is specified as 'precomputed' in initialization, then data must be a structured DataFrame that reflects the affinity information between points as follows:

  • 1st column: ID of the first point.

  • 2nd column: ID of the second point.

  • 3rd column: Precomputed distance between first point & second point.

keystr, optional

Name of ID column in data.

Defaults to the index column of data (i.e. data.index) if it is set. If the index column of data is not provided and affinity is not set as 'precomputed' in initialization, please enter the value of key.

Valid only when affinity is not 'precomputed' in initialization.

featuresa list of str, optional

Names of the features columns.

If features is not provided, it defaults to all non-key columns.

Valid only when affinity is not 'precomputed' in initialization.

categorical_variablestr or a list of str, optional

Specifies INTEGER column(s) that should be treated as categorical. By default, 'VARCHAR' or 'NVARCHAR' is a category variable, and 'INTEGER' or 'DOUBLE' is a continuous variable.

Valid only when affinity is not 'precomputed' in initialization.

Defaults to None.

fit_predict(data, key=None, features=None, categorical_variable=None)

Fit with the dataset and return the labels.

Parameters
dataDataFrame

DataFrame containing the data.

If affinity is specified as 'precomputed' in initialization, then data must be a structured DataFrame that reflects the affinity information between points as follows:

  • 1st column: ID of the first point.

  • 2nd column: ID of the second point.

  • 3rd column: Precomputed distance between first point & second point.

keystr, optional

Name of ID column in data.

Defaults to the index column of data (i.e. data.index) if it is set. If the index column of data is not provided and affinity is not set as 'precomputed' in initialization, please enter the value of key.

Valid only when affinity is not 'precomputed' in initialization.

featuresa list of str, optional

Names of the features columns.

If features is not provided, it defaults to all non-key columns.

Valid only when affinity is not 'precomputed' in initialization.

categorical_variablestr or a list of str, optional

Specifies INTEGER column(s) that should be treated as categorical. By default, 'VARCHAR' or 'NVARCHAR' is a category variable, and 'INTEGER' or 'DOUBLE' is a continuous variable.

Valid only when affinity is not 'precomputed' in initialization.

Defaults to None.

Returns
DataFrame

Label of each points.

property fit_hdbprocedure

Returns the generated hdbprocedure for fit.

property predict_hdbprocedure

Returns the generated hdbprocedure for predict.

Inherited Methods from PALBase

Besides those methods mentioned above, the AgglomerateHierarchicalClustering class also inherits methods from PALBase class, please refer to PAL Base for more details.