hanaml.Kmedoid {hana.ml.r}R Documentation

K-Medoids

Description

hanaml.Kmedoid is a R wrapper for PAL Kmedoids algorithm.

Usage

hanaml.Kmedoid(conn.context,
               data,
               key,
               features = NULL,
               n.clusters,
               init = NULL,
               max.iter = NULL,
               tol = NULL,
               thread.ratio = NULL,
               distance.level = NULL,
               minkowski.power = NULL,
               category.weights = NULL,
               normalization = NULL,
               categorical.variable = NULL)

Arguments

conn.context

ConnectionContext
The connection to the SAP HANA system.

data

DataFrame
DataFrame containing the data.

key

character
Name of the ID column.

features

character, optional
Name of the features columns.

n.clusters

integer
Number of groups.

init

character, optional
Controls how the initial centers are selected:

  • 'first_k': First k observations.

  • 'replace': Random with replacements.

  • 'no_replace': Random without replacements.

  • 'patent':Patent of selecting the init center (US 6,882,998 B1).


Defaults to 'patent'.

max.iter

integer, optional
Maximum number of iterations.
Defaults to 100.

tol

double, optional
Convergence threshold for exiting iterations. Only valid when accelerated is FALSE.
Defaults to 1.0e-6.

thread.ratio

double, optional
Controls the proportion of available threads that can be used by this function. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates all available threads. Values between 0 and 1 will use up to that percentage of available threads.
Defaults to 0.

distance.level

character, optional
Specifies how to compute the distance between the item and the cluster center. Valid options are 'manhattan', 'euclidean', 'minkowski', 'chebyshev', and 'cosine' (valid only when accelerated is FALSE).
Defaults to 'euclidean'.

minkowski.power

double, optional
When Minkowski distance is used, this parameter controls the value of power.
Only valid when distance.level is 'minkowski'.
Defaults to 3.0.

category.weights

double, optional
Represents the weight of category attributes.
Defaults to 0.707.

normalization

character, optional
Normalization type:

  • 'no': no normalization.

  • l1.norm: Yes, for each point X (x1,x2,...,xn), the normalized value will be X'(x1/S,x2/S,...,xn/S), where S = |x1|+|x2|+...|xn|.

  • min.max: Yes, for each column C, get the min and max value of C, and then C[i] = (C[i]-min)/(max-min).

Defaults to 'no'.

categorical.variable

character or list of characters, optional
Column names in the data table to use as category variable.
No default value.

Format

R6Class object.

Details

The K-Medoids clustering algorithm partitions n observations into K clusters according to their nearest cluster center. It uses medoids to calculate cluster centers. The K-Medoids algorithm is more robust in regards to noise and outliers.

Value

Examples

## Not run: 
>data$Collect()
     ID  V000 V001  V002
0    0   0.5    A   0.5
1    1   1.5    A   0.5
2    2   1.5    A   1.5
3    3   0.5    A   1.5
4    4   1.1    B   1.2
5    5   0.5    B  15.5
6    6   1.5    B  15.5
7    7   1.5    B  16.5
8    8   0.5    B  16.5
9    9   1.2    C  16.1
10  10  15.5    C  15.5
11  11  16.5    C  15.5
12  12  16.5    C  16.5
13  13  15.5    C  16.5
14  14  15.6    D  16.2
15  15  15.5    D   0.5
16  16  16.5    D   0.5
17  17  16.5    D   1.5
18  18  15.5    D   1.5
19  19  15.7    A   1.6

>kmed <- hanaml.Kmedoid(conn.context = conn,
                        data = data,
                        n.clusters = 4,
                        init = 'first_k',
                        max.iter = 100,
                        tol = 1.0E-6,
                        thread.ratio = 0.3,
                        distance.level = 'Euclidean',
                        category.weights = 0.5)

>kmed$cluster.centers$Collect()
   CLUSTER_ID  V000 V001  V002
0           0   1.5    A   1.5
1           1  15.5    D   1.5
2           2  15.5    C  16.5
3           3   1.5    B  16.5

>kmed$labels$Collect()
   ID CLUSTER_ID  DISTANCE
1   0          0 1.4142136
2   1          0 1.0000000
3   2          0 0.0000000
4   3          0 1.0000000
5   4          0 1.2071068
6   5          3 1.4142136
7   6          3 1.0000000
8   7          3 0.0000000
9   8          3 1.0000000
10  9          3 1.2071068
11 10          2 1.0000000
12 11          2 1.4142136
13 12          2 1.0000000
14 13          2 0.0000000
15 14          2 1.0233345
16 15          1 1.0000000
17 16          1 1.4142136
18 17          1 1.0000000
19 18          1 0.0000000
20 19          1 0.9307136

## End(Not run)

[Package hana.ml.r version 1.0.8 Index]