
class hana_ml.algorithms.pal.clustering.AgglomerateHierarchicalClustering(n_clusters=None, affinity=None, linkage=None, thread_ratio=None, distance_dimension=None, normalization=None, category_weights=None)

Agglomerate Hierarchical Clustering is a widely used clustering method which can find natural groups within a set of data. The idea is to group the data into a hierarchy or a binary tree of the subgroups. A hierarchical clustering can be either agglomerate or divisive, depending on the method of hierarchical decomposition.

The implementation in HANA PAL follows the agglomerate approach, which merges the clusters with a bottom-up strategy. Initially, each data point is considered as an own cluster. The algorithm iteratively merges two clusters based on the dissimilarity measure in a greedy manner and forms a larger cluster.

n_clustersint, optional

Number of clusters after agglomerate hierarchical clustering algorithm. Value range: between 1 and the initial number of input data.

Defaults to 1.

affinity{'manhattan', 'euclidean', 'minkowski', 'chebyshev', 'cosine', 'pearson correlation', 'squared euclidean', 'jaccard', 'gower', 'precomputed'}, optional

Determines the method for calculating the distance between two points.


  • (1) For jaccard distance, non-zero input data will be treated as 1, and zero input data will be treated as 0. jaccard distance = (M01 + M10) / (M11 + M01 + M10)

  • (2) Only gower distance supports category attributes. When linkage is 'centroid clustering', 'median clustering', or 'ward', this parameter must be set to 'squared euclidean'.

Defaults to 'squared euclidean'.

linkage{'nearest neighbor', 'furthest neighbor', 'group average', 'weighted average', 'centroid clustering', 'median clustering', 'ward'}, optional

Linkage type between two clusters.

  • 'nearest neighbor' : single linkage.

  • 'furthest neighbor' : complete linkage.

  • 'group average' : UPGMA.

  • 'weighted average' : WPGMA.

  • 'centroid clustering'.

  • 'median clustering'.

  • 'ward'.

Defaults to 'centroid clustering'.


For linkage 'centroid clustering', 'median clustering', or 'ward', the corresponding affinity must be set to 'squared euclidean'.

thread_ratiofloat, optional

Adjusts the percentage of available threads to use, from 0 to 1. A value of 0 indicates the use of a single thread, while 1 implies the use of all possible current threads. Values outside the range will be ignored and this function heuristically determines the number of threads to use.

Defaults to 0.

distance_dimensionfloat, optional

Distance dimension can be set if affinity is set to 'minkowski'. The value should be no less than 1.

Only valid when affinity is 'minkowski'.

Defaults to 3.

normalizationstr, optional

Specifies the type of normalization applied.

  • 'no': No normalization

  • 'z-score': Z-score standardization

  • 'zero-centred-min-max': Zero-centred min-max normalization, transforming to new range [-1, 1].

  • 'min-max': Standard min-max normalization, transforming to new range [0, 1].

Valid only when affinity is not 'precomputed'.

Defaults to 'no'.

category_weightsfloat, optional

Represents the weight of category columns.

Defaults to 1.


>>> df.collect()
   POINT       X1     X2    X3
0      0      0.5    0.5     1
1      1      1.5    0.5     2
18    18     15.5    1.5     1
19    19     15.7    1.6     1

Create an AgglomerateHierarchicalClustering instance:

>>> hc = AgglomerateHierarchicalClustering(
             linkage='weighted average',
             category_weights= 0.1)

Perform fit():

>>>, key='POINT', categorical_variable=['X3'])

Expected output:

>>> hc.combine_process_.collect().head(3)
0      1          18           19       0.0187
1      2          13           14       0.0250
2      3           7            9       0.0437
>>> hc.labels_.collect().head(3)
0      0           1
1      1           1
2      2           1

Structured as follows:

  • 1st column: int, STAGE, cluster stage.

  • 2nd column: ID (in input table) data type, LEFT_ + ID (in input table) column + name, One of the clusters that is to be combined in one combine stage, name as its row number in the input data table. After the combining, the new cluster is named after the left one.

  • 3rd column: ID (in input table) data type, RIGHT_ + ID (in input table) column name, The other cluster to be combined in the same combine stage, named as its row number in the input data table.

  • 4th column: float, DISTANCE. Distance between the two combined clusters.


Label assigned to each sample. structured as follows:

  • 1st column: Name of the ID column in the input data(or that of the first column of the input DataFrame when affinity is 'precomputed'), record ID.

  • 2nd column: CLUSTER_ID, cluster number after applying the hierarchical agglomerate algorithm.


fit(data[, key, features, categorical_variable])

Fit the model to the training dataset.

fit_predict(data[, key, features, ...])

Fit with the dataset and return the labels.

fit(data, key=None, features=None, categorical_variable=None)

Fit the model to the training dataset.


DataFrame containing the data.

If affinity is specified as 'precomputed' in initialization, then data must be a structured DataFrame that reflects the affinity information between points as follows:

  • 1st column: ID of the first point.

  • 2nd column: ID of the second point.

  • 3rd column: Precomputed distance between first point & second point.

keystr, optional

Name of ID column in data. Defaults to the index column of data (i.e. data.index) if it is set. If the index column of data is not provided and affinity is not set as 'precomputed' in initialization, please enter the value of key.

Valid only when affinity is not 'precomputed' in initialization.

featuresa list of str, optional

Names of the features columns. If features is not provided, it defaults to all non-key columns.

Valid only when affinity is not 'precomputed' in initialization.

categorical_variablestr or a list of str, optional

Specifies which INTEGER columns should be treated as categorical, with all other INTEGER columns treated as continuous.

No default value.

fit_predict(data, key=None, features=None, categorical_variable=None)

Fit with the dataset and return the labels.


DataFrame containing the data.

If affinity is specified as 'precomputed' in initialization, then data must be a structured DataFrame that reflects the affinity information between points as follows:

  • 1st column: ID of the first point.

  • 2nd column: ID of the second point.

  • 3rd column: Precomputed distance between first point & second point.

keystr, optional

Name of ID column in data.

Defaults to the index column of data (i.e. data.index) if it is set. If the index column of data is not provided and affinity is not set as 'precomputed' in initialization, please enter the value of key.

Valid only when affinity is not 'precomputed' in initialization.

featuresa list of str, optional

Names of the features columns. If features is not provided, it defaults to all non-key columns. Valid only when affinity is not 'precomputed' in initialization.

categorical_variablestr or a list of str, optional

Specifies which INTEGER columns should be treated as categorical, with all other INTEGER columns treated as continuous.

No default value.


Label of each points.

Inherited Methods from PALBase

Besides those methods mentioned above, the AgglomerateHierarchicalClustering class also inherits methods from PALBase class, please refer to PAL Base for more details.