R: Agglomerate Hierarchical Clustering

hanaml.AgglomerateHierarchical {hana.ml.r}

R Documentation

Agglomerate Hierarchical Clustering

Description

hanaml.AgglomerateHierarchical is a R wrapper for PAL Agglomerate Hierarchical Clusteringalgorithm.

Usage

hanaml.AgglomerateHierarchical(conn.context,
                               data,
                               key,
                               features = NULL,
                               n.clusters = NULL,
                               affinity = NULL,
                               linkage = NULL,
                               thread.ratio = NULL,
                               distance.dimension = NULL,
                               normalization = NULL,
                               category.weights = NULL,
                               categorical.variable = NULL)

Arguments

`conn.context`	`ConnectionContext` Connection to the SAP HANA System
`data`	`DataFrame` DataFrame containing the data.
`key`	`character` Name of ID column.
`features`	`character or list of characters, optional` Names of the features columns. If is not provided, it defaults to all the non-ID columns.
`n.clusters`	`integer, optional` Number of clusters after agglomerate hierarchical clustering algorithm. Value range: between 1 and the initial number of input data
`affinity`	`character, optional` Ways to compute the distance between two points: `'manhattan'` `'euclidean'` `'minkowski'` `'chebyshev'` `'cosine'` `'pearson.correlation'` `'squared.euclidean'` `'jaccard'` `'gower'` Note that (1) For "jaccard", non-zero input data will be treated as 1, and zero input data will be treated as 0. jaccard distance = (M01 + M10) / (M11 + M01 + M10) (2) Only "gower" supports category attributes. When linkage is 'centroid clustering', 'median clustering', or 'ward', this parameter must be set to 'squared euclidean' Defaults to "centroid clustering".
`linkage`	`character, optional` Linkage type between two clusters. `'nearest.neighbor'`: single linkage. `'furthest.neighbor'`: complete linkage. `'group.average'`: UPGMA. `'weighted.average'`: WPGMA `'centroid.clustering'` `'median.clustering'` `'ward'` Note that Only gower supports category attributes. When linkage is 'centroid.clustering', 'median.clustering', or 'ward', this parameter must be set to 'squared.euclidean Defaults to "centroid.clustering".
`thread.ratio`	`double, optional` Specifies the ratio of total number of threads that can be used by this function. The value range is from 0 to 1, where 0 means only using 1 thread, and 1 means using at most all the currently available threads. Values outside the range will be ignored and this function heuristically determines the number of threads to use. Defaults to 0.
`distance.dimension`	`double, optional` distance.dimension can be set if affinity is set to 'minkowski'. The value should be no less than 1. Only valid when affinity is 'minkowski'. Defaults to 3.
`normalization`	`character, optional` normalization type `"no"`: does nothing `"z.score"`: Z score standardize `"symmetric.min.max"`: transforms to new range: -1 to 1. `"min.max"`: transforms to new range: 0 to 1 Defaults to 0.
`category.weights`	`double, optional` Represents the weight of category columns. Defaults to 1.
`categorical.variable`	`character or list of character, optional` Specifies INTEGER column(s) that should be treated as categorical. By default, 'VARCHAR' or 'NVARCHAR' is category variable, and 'INTEGER' or 'DOUBLE' is continuous variable. No default value.

Format

R6Class object.

Value

labels : DataFrame
label of each points, structed as follows:
- 1st column: ID (in input table) data type, ID, record ID.
- 2nd column: int, CLUSTER_ID, the range is from 0 to n.clusters - 1.
comb.process : DataFrame
structed as follows:
- 1st column: int, STAGE, cluster stage.
- 2nd column: ID (in input table) data type, LEFT_ + ID (in input table) column name, One of the clusters that is to be combined in one combine stage, name as its row number in the input data table. After the combining, the new cluster is named after the left one.
- 3rd column: ID (in input table) data type, RIGHT_ + ID (in input table) column name, The other cluster to be combined in the same combine stage, named as its row number in the input data table.
- 4th column: float, DISTANCE. Distance between the two combined clusters.

Examples

## Not run: 
 Input DataFrame data:

> data$collect()
 POINT        X1    X2    X3
0    0       0.5   0.5     1
1    1       1.5   0.5     2
2    2       1.5   1.5     2
3    3       0.5   1.5     2
4    4       1.1   1.2     2
5    5       0.5   15.5    2
6    6       1.5   15.5    3
7    7       1.5   16.5    3
8    8       0.5   16.5    3
9    9       1.2   16.1    3
10   10      15.5  15.5    3
11   11      16.5  15.5    4
12   12      16.5  16.5    4
13   13      15.5  16.5    4
14   14      15.6  16.2    4
15   15      15.5  0.5     4
16   16      16.5  0.5     1
17   17      16.5  1.5     1
18   18      15.5  1.5     1
19   19      15.7  1.6     1

Create Agglomerate Hierarchical Clustering instance:

> AgglomerateHierarchical <-
     hanaml.AgglomerateHierarchical(conn.context = conn,
                                    data = data,
                                    key = "POINT",
                                    n.clusters = 4,
                                    affinity = 'squared.euclidean',
                                    inkage = 'centroid.clustering',
                                    thread.ratio = 0,
                                    distance.dimension = 3,
                                    normalization = "no",
                                    category.weights = 0.1)

Expected output:
> AgglomerateHierarchical$comb.process.tbl$collect()
        STAGE  LEFT_POINT RIGHT_POINT  DISTANCE
  1       1          18      19        0.0187
  2       2          13      14        0.025
  3       3          7       9         0.0437
  4       4          2       4         0.0438
  5       5          2       3         0.0594
  6       6          17      18        0.0594
  7       7          6       7         0.0594
  8       8          11      12        0.0625
  9       9          11      13        0.0906
  10     10          16      17        0.0922
  11     11          6       8         0.0953
  12     12          1       2         0.0953
  13     13          0       1         0.1727
  14     14          5       6         0.1727
  15     15          10      11        0.175
  16     16          15      16        0.1085
  17     17          0       15        1.0381
  18     18          5       10        1.0425
  19     19          0        5        1.5146


> AgglomerateHierarchical$labels$collect()
       POINT    CLUSTER_ID
  1      0          1
  2      1          1
  3      2          1
  4      3          1
  5      4          1
  6      5          2
  7      6          2
  8      7          2
  9      8          2
  10     9          2
  11    10          3
  12    11          3
  13    12          3
  14    13          3
  15    14          3
  16    15          4
  17    16          4
  18    17          4
  19    18          4
  20    19          4

## End(Not run)

[Package hana.ml.r version 1.0.8 Index]