AgglomerateHierarchicalClustering
- class hana_ml.algorithms.pal.clustering.AgglomerateHierarchicalClustering(n_clusters=None, affinity=None, linkage=None, thread_ratio=None, distance_dimension=None, normalization=None, category_weights=None)
This algorithm is a widely used clustering method which can find natural groups within a set of data. The idea is to group the data into a hierarchy or a binary tree of the subgroups. A hierarchical clustering can be either agglomerate or divisive, depending on the method of hierarchical decomposition. The implementation in PAL follows the agglomerate approach, which merges the clusters with a bottom-up strategy. Initially, each data point is considered as an own cluster. The algorithm iteratively merges two clusters based on the dissimilarity measure in a greedy manner and forms a larger cluster.
- Parameters
- n_clustersint, optional
Number of clusters after agglomerate hierarchical clustering algorithm. Value range: between 1 and the initial number of input data.
Defaults to 1.
- affinity{'manhattan', 'euclidean', 'minkowski', 'chebyshev', 'cosine', 'pearson correlation', 'squared euclidean', 'jaccard', 'gower', 'precomputed'}, optional
Ways to compute the distance between two points.
Note
(1) For jaccard distance, non-zero input data will be treated as 1, and zero input data will be treated as 0. jaccard distance = (M01 + M10) / (M11 + M01 + M10)
(2) Only gower distance supports category attributes. When linkage is 'centroid clustering', 'median clustering', or 'ward', this parameter must be set to 'squared euclidean'.
Defaults to 'squared euclidean'.
- linkage{ 'nearest neighbor', 'furthest neighbor', 'group average', 'weighted average', 'centroid clustering', 'median clustering', 'ward'}, optional
Linkage type between two clusters.
'nearest neighbor' : single linkage.
'furthest neighbor' : complete linkage.
'group average' : UPGMA.
'weighted average' : WPGMA.
'centroid clustering'.
'median clustering'.
'ward'.
Defaults to 'centroid clustering'.
Note
For linkage 'centroid clustering', 'median clustering', or 'ward', the corresponding affinity must be set to 'squared euclidean'.
- thread_ratiofloat, optional
Specifies the ratio of total number of threads that can be used by this function.
The value range is from 0 to 1, where 0 means only using 1 thread, and 1 means using at most all the currently available threads.
Values outside the range will be ignored and this function heuristically determines the number of threads to use.
Defaults to 0.
- distance_dimensionfloat, optional
Distance dimension can be set if affinity is set to 'minkowski'. The value should be no less than 1.
Only valid when affinity is 'minkowski'.
Defaults to 3.
- normalizationstr, optional
Specifies the type of normalization applied.
'no': No normalization
'z-score': Z-score standardization
'zero-centred-min-max': Zero-centred min-max normalization, transforming to new range [-1, 1].
'min-max': Standard min-max normalization, transforming to new range [0, 1].
Valid only when
affinity
is not 'precomputed'.Defaults to 'no'.
- category_weightsfloat, optional
Represents the weight of category columns.
Defaults to 1.
Examples
Input dataframe df for clustering:
>>> df.collect() POINT X1 X2 X3 0 0 0.5 0.5 1 1 1 1.5 0.5 2 2 2 1.5 1.5 2 3 3 0.5 1.5 2 4 4 1.1 1.2 2 5 5 0.5 15.5 2 6 6 1.5 15.5 3 7 7 1.5 16.5 3 8 8 0.5 16.5 3 9 9 1.2 16.1 3 10 10 15.5 15.5 3 11 11 16.5 15.5 4 12 12 16.5 16.5 4 13 13 15.5 16.5 4 14 14 15.6 16.2 4 15 15 15.5 0.5 4 16 16 16.5 0.5 1 17 17 16.5 1.5 1 18 18 15.5 1.5 1 19 19 15.7 1.6 1
Create an AgglomerateHierarchicalClustering instance:
>>> hc = AgglomerateHierarchicalClustering( n_clusters=4, affinity='Gower', linkage='weighted average', thread_ratio=None, distance_dimension=3, normalization='no', category_weights= 0.1)
Perform fit on the given data:
>>> hc.fit(data=df, key='POINT', categorical_variable=['X3'])
Expected output:
>>> hc.combine_process_.collect().head(3) STAGE LEFT_POINT RIGHT_POINT DISTANCE 0 1 18 19 0.0187 1 2 13 14 0.0250 2 3 7 9 0.0437
>>> hc.labels_.collect().head(3) POINT CLUSTER_ID 0 0 1 1 1 1 2 2 1
- Attributes
- combine_process_DataFrame
Structured as follows:
1st column: int, STAGE, cluster stage.
2nd column: ID (in input table) data type, LEFT_ + ID (in input table) column + name, One of the clusters that is to be combined in one combine stage, name as its row number in the input data table. After the combining, the new cluster is named after the left one.
3rd column: ID (in input table) data type, RIGHT_ + ID (in input table) column name, The other cluster to be combined in the same combine stage, named as its row number in the input data table.
4th column: float, DISTANCE. Distance between the two combined clusters.
- labels_DataFrame
Label assigned to each sample. structured as follows:
1st column: Name of the ID column in the input data(or that of the first column of the input DataFrame when
affinity
is 'precomputed'), record ID.2nd column: CLUSTER_ID, cluster number after applying the hierarchical agglomerate algorithm.
Methods
fit
(data[, key, features, categorical_variable])Fit the model when given the training dataset.
fit_predict
(data[, key, features, ...])Fit with the dataset and return the labels.
- fit(data, key=None, features=None, categorical_variable=None)
Fit the model when given the training dataset.
- Parameters
- dataDataFrame
DataFrame containing the data.
If
affinity
is specified as 'precomputed' in initialization, thendata
must be a structured DataFrame that reflects the affinity information between points as follows:1st column: ID of the first point.
2nd column: ID of the second point.
3rd column: Precomputed distance between first point & second point.
- keystr, optional
Name of ID column in
data
.Defaults to the index column of data (i.e. data.index) if it is set. If the index column of data is not provided and
affinity
is not set as 'precomputed' in initialization, please enter the value of key.Valid only when
affinity
is not 'precomputed' in initialization.- featuresa list of str, optional
Names of the features columns.
If
features
is not provided, it defaults to all non-key columns.Valid only when
affinity
is not 'precomputed' in initialization.- categorical_variablestr or a list of str, optional
Specifies INTEGER column(s) that should be treated as categorical. By default, 'VARCHAR' or 'NVARCHAR' is a category variable, and 'INTEGER' or 'DOUBLE' is a continuous variable.
Valid only when
affinity
is not 'precomputed' in initialization.Defaults to None.
- fit_predict(data, key=None, features=None, categorical_variable=None)
Fit with the dataset and return the labels.
- Parameters
- dataDataFrame
DataFrame containing the data.
If
affinity
is specified as 'precomputed' in initialization, thendata
must be a structured DataFrame that reflects the affinity information between points as follows:1st column: ID of the first point.
2nd column: ID of the second point.
3rd column: Precomputed distance between first point & second point.
- keystr, optional
Name of ID column in
data
.Defaults to the index column of data (i.e. data.index) if it is set. If the index column of data is not provided and
affinity
is not set as 'precomputed' in initialization, please enter the value of key.Valid only when
affinity
is not 'precomputed' in initialization.- featuresa list of str, optional
Names of the features columns.
If
features
is not provided, it defaults to all non-key columns.Valid only when
affinity
is not 'precomputed' in initialization.- categorical_variablestr or a list of str, optional
Specifies INTEGER column(s) that should be treated as categorical. By default, 'VARCHAR' or 'NVARCHAR' is a category variable, and 'INTEGER' or 'DOUBLE' is a continuous variable.
Valid only when
affinity
is not 'precomputed' in initialization.Defaults to None.
- Returns
- DataFrame
Label of each points.
- property fit_hdbprocedure
Returns the generated hdbprocedure for fit.
- property predict_hdbprocedure
Returns the generated hdbprocedure for predict.