Agglomerate Hierarchical Clustering — hanaml.AgglomerateHierarchical • hana.ml.r

hanaml.AgglomerateHierarchical is a R wrapper for SAP HANA PAL Agglomerate Hierarchical Clustering algorithm.

hanaml.AgglomerateHierarchical(
  data,
  key,
  features = NULL,
  n.clusters = NULL,
  affinity = NULL,
  linkage = NULL,
  thread.ratio = NULL,
  distance.dimension = NULL,
  normalization = NULL,
  category.weights = NULL,
  categorical.variable = NULL
)

Arguments

data

DataFrame
DataFrame containting the data for agglomerate hierarchical clustering.
If affinity is "precomputed", then data must be structured for reflecting the affinity between points as follows:

1st column: ID of the first point.
2nd column: ID of the second point.
3rd column: Precomputed distance between first & second point.

key

character, optional
Specifies the name of ID column in data.
Mandatory and valid only when affinity is not "precomputed".

features

character or list of characters, optional
Names of features columns.
If is not provided, it defaults to all non-key columns of data.
Valid only when affinity is not "precomputed".

n.clusters

integer, optional
Number of clusters after agglomerate hierarchical clustering algorithm.
Value range: between 1 and the initial number of input data.
Defaults to 1.

affinity

character, optional
Ways to compute the distance between two points:

"manhattan"
"euclidean"
"minkowski"
"chebyshev"
"cosine"
"pearson.correlation"
"squared.euclidean"
"jaccard"
"gower"
"precomputed"

Note that
(1) For "jaccard", non-zero input data will be treated as 1, and zero input data will be treated as 0.
jaccard distance = (M01 + M10) / (M11 + M01 + M10)
(2) Only "gower" supports category attributes.
(3) When linkage is "centroid.clustering", "median.clustering" "ward", affinity must be set to "squared.euclidean"
Defaults to "squared.euclidean".

linkage

character, optional
Linkage type between two clusters.

"nearest.neighbor": single linkage.
"furthest.neighbor": complete linkage.
"group.average": UPGMA.
"weighted.average": WPGMA
"centroid.clustering"
"median.clustering"
"ward"

When linkage is "centroid.clustering", "median.clustering" "ward", affinity must be set to "squared.euclidean"
Defaults to "centroid.clustering".

thread.ratio

double, optional
Controls the proportion of available threads that can be used by this function.
The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates all available threads.
Values between 0 and 1 will use up to that percentage of available threads.Values outside this range are ignored.
Defaults to 0.

distance.dimension

double, optional
distance.dimension can be set if affinity is set to "minkowski". The value should be no less than 1.
Only valid when affinity is "minkowski".
Defaults to 3.

normalization

character, optional
normalization type

"no": does nothing
"z.score": Z score standardize
"symmetric.min.max": transforms to new range: -1 to 1.
"min.max": transforms to new range: 0 to 1

Valid only when affinity is not "precomputed".
Defaults to "no".

category.weights

double, optional
Represents the weight of category columns.
Defaults to 1.

categorical.variable

character or a list/vector of characters, optional
Indicates features should be treated as categorical variable.
The default behavior is dependent on what input is given:

"VARCHAR" and "NVARCHAR": categorical
"INTEGER" and "DOUBLE": continuous.

VALID only for variables of "INTEGER" type, omitted otherwise.
Effective only when affinity is not "precomputed".
No default value.

Value

A "AgglomerateHierarchical" object with the following attributes:

labels : DataFrame
label of each points, structed as follows:
- 1st column: ID (in input table) data type, ID, record ID.
- 2nd column: int, CLUSTER_ID, the range is from 0 to n.clusters - 1.
comb.process : DataFrame
structed as follows:
- 1st column: int, STAGE, cluster stage.
- 2nd column: ID (in input table) data type, LEFT_ + ID (in input table) column name, One of the clusters that is to be combined in one combine stage, name as its row number in the input data table. After the combining, the new cluster is named after the left one.
- 3rd column: ID (in input table) data type, RIGHT_ + ID (in input table) column name, The other cluster to be combined in the same combine stage, named as its row number in the input data table.
- 4th column: float, DISTANCE. Distance between the two combined clusters.

Examples

Input DataFrame data:


> data$Collect()
 POINT        X1    X2    X3
0    0       0.5   0.5     1
1    1       1.5   0.5     2
2    2       1.5   1.5     2
3    3       0.5   1.5     2
4    4       1.1   1.2     2
......
16   16      16.5  0.5     1
17   17      16.5  1.5     1
18   18      15.5  1.5     1
19   19      15.7  1.6     1

Call the function:


> AH <- hanaml.AgglomerateHierarchical(data = data,
                                       key = "POINT",
                                       n.clusters = 4,
                                       affinity = "squared.euclidean",
                                       linkage = "centroid.clustering",
                                       thread.ratio = 0,
                                       distance.dimension = 3,
                                       normalization = "no",
                                       category.weights = 0.1)

Output:


> AH$comb.process.tbl$collect()
    STAGE  LEFT_POINT RIGHT_POINT  DISTANCE
1       1          18      19        0.0187
2       2          13      14        0.025
3       3          7       9         0.0437
4       4          2       4         0.0438
......
16     16          15      16        0.1085
17     17          0       15        1.0381
18     18          5       10        1.0425
19     19          0        5        1.5146

> AH$labels$collect()
   POINT    CLUSTER_ID
1      0          1
2      1          1
3      2          1
4      3          1
......
17    16          4
18    17          4
19    18          4
20    19          4