R: K-Medoids

hanaml.Kmedoid {hana.ml.r}

R Documentation

K-Medoids

Description

hanaml.Kmedoid is a R wrapper for PAL Kmedoids algorithm.

Usage

hanaml.Kmedoid(conn.context,
               data,
               key,
               features = NULL,
               n.clusters,
               init = NULL,
               max.iter = NULL,
               tol = NULL,
               thread.ratio = NULL,
               distance.level = NULL,
               minkowski.power = NULL,
               category.weights = NULL,
               normalization = NULL,
               categorical.variable = NULL)

Arguments

`conn.context`	`ConnectionContext` The connection to the SAP HANA system.
`data`	`DataFrame` DataFrame containing the data.
`key`	`character` Name of the ID column.
`features`	`character, optional` Name of the features columns.
`n.clusters`	`integer` Number of groups.
`init`	`character, optional` Controls how the initial centers are selected: `'first_k'`: First k observations. `'replace'`: Random with replacements. `'no_replace'`: Random without replacements. `'patent'`:Patent of selecting the init center (US 6,882,998 B1). Defaults to 'patent'.
`max.iter`	`integer, optional` Maximum number of iterations. Defaults to 100.
`tol`	`double, optional` Convergence threshold for exiting iterations. Only valid when accelerated is FALSE. Defaults to 1.0e-6.
`thread.ratio`	`double, optional` Controls the proportion of available threads that can be used by this function. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates all available threads. Values between 0 and 1 will use up to that percentage of available threads. Defaults to 0.
`distance.level`	`character, optional` Specifies how to compute the distance between the item and the cluster center. Valid options are 'manhattan', 'euclidean', 'minkowski', 'chebyshev', and 'cosine' (valid only when accelerated is FALSE). Defaults to 'euclidean'.
`minkowski.power`	`double, optional` When Minkowski distance is used, this parameter controls the value of power. Only valid when `distance.level` is 'minkowski'. Defaults to 3.0.
`category.weights`	`double, optional` Represents the weight of category attributes. Defaults to 0.707.
`normalization`	`character, optional` Normalization type: `'no'`: no normalization. `l1.norm`: Yes, for each point X (x1,x2,...,xn), the normalized value will be X'(x1/S,x2/S,...,xn/S), where S = \|x1\|+\|x2\|+...\|xn\|. `min.max`: Yes, for each column C, get the min and max value of C, and then C[i] = (C[i]-min)/(max-min). Defaults to 'no'.
`categorical.variable`	`character or list of characters, optional` Column names in the data table to use as category variable. No default value.

Format

R6Class object.

Details

The K-Medoids clustering algorithm partitions n observations into K clusters according to their nearest cluster center. It uses medoids to calculate cluster centers. The K-Medoids algorithm is more robust in regards to noise and outliers.

Value

labels : DataFrame
Label assigned to each sample.
cluster.centers : DataFrame
Coordinates of cluster centers.

Examples

## Not run: 
>data$Collect()
     ID  V000 V001  V002
0    0   0.5    A   0.5
1    1   1.5    A   0.5
2    2   1.5    A   1.5
3    3   0.5    A   1.5
4    4   1.1    B   1.2
5    5   0.5    B  15.5
6    6   1.5    B  15.5
7    7   1.5    B  16.5
8    8   0.5    B  16.5
9    9   1.2    C  16.1
10  10  15.5    C  15.5
11  11  16.5    C  15.5
12  12  16.5    C  16.5
13  13  15.5    C  16.5
14  14  15.6    D  16.2
15  15  15.5    D   0.5
16  16  16.5    D   0.5
17  17  16.5    D   1.5
18  18  15.5    D   1.5
19  19  15.7    A   1.6

>kmed <- hanaml.Kmedoid(conn.context = conn,
                        data = data,
                        n.clusters = 4,
                        init = 'first_k',
                        max.iter = 100,
                        tol = 1.0E-6,
                        thread.ratio = 0.3,
                        distance.level = 'Euclidean',
                        category.weights = 0.5)

>kmed$cluster.centers$Collect()
   CLUSTER_ID  V000 V001  V002
0           0   1.5    A   1.5
1           1  15.5    D   1.5
2           2  15.5    C  16.5
3           3   1.5    B  16.5

>kmed$labels$Collect()
   ID CLUSTER_ID  DISTANCE
1   0          0 1.4142136
2   1          0 1.0000000
3   2          0 0.0000000
4   3          0 1.0000000
5   4          0 1.2071068
6   5          3 1.4142136
7   6          3 1.0000000
8   7          3 0.0000000
9   8          3 1.0000000
10  9          3 1.2071068
11 10          2 1.0000000
12 11          2 1.4142136
13 12          2 1.0000000
14 13          2 0.0000000
15 14          2 1.0233345
16 15          1 1.0000000
17 16          1 1.4142136
18 17          1 1.0000000
19 18          1 0.0000000
20 19          1 0.9307136

## End(Not run)

[Package hana.ml.r version 1.0.8 Index]