hanaml.Kmedian {hana.ml.r}R Documentation

K-Medians

Description

hanaml.Kmedian is a R wrapper for PAL Kmedian algorithm.

Usage

hanaml.Kmedian(conn.context,
               data,
               key,
               features = NULL,
               n.clusters,
               init = NULL,
               max.iter = NULL,
               tol = NULL,
               thread.ratio = NULL,
               distance.level = NULL,
               minkowski.power = NULL,
               category.weights = NULL,
               normalization = NULL,
               categorical.variable = NULL)

Arguments

conn.context

ConnectionContext
The connection to the SAP HANA system.

data

DataFrame
DataFrame containing the data.

key

character
Name of the ID column.

features

character or list of characters, optional
Name of the features columns.

n.clusters

integer
Number of groups.

init

character, optional
Controls how the initial centers are selected:

  • 'first_k': First k observations.

  • 'replace': Random with replacements.

  • 'no_replace': Random without replacements.

  • 'patent':Patent of selecting the init center (US 6,882,998 B1).


Defaults to 'patent'.

max.iter

integer, optional
Maximum number of iterations.
Defaults to 100.

tol

double, optional
Convergence threshold for exiting iterations. Only valid when accelerated is FALSE.
Defaults to 1.0e-6.

thread.ratio

double, optional
Controls the proportion of available threads that can be used by this function. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates all available threads. Values between 0 and 1 will use up to that percentage of available threads.
Defaults to 0.

distance.level

character, optional
Specifies how to compute the distance between the item and the cluster center. Valid options are 'manhattan', 'euclidean', 'minkowski', 'chebyshev', and 'cosine' (valid only when accelerated is FALSE).
Defaults to 'euclidean'.

minkowski.power

double, optional
When Minkowski distance is used, this parameter controls the value of power.
Only valid when distance.level is 'minkowski'.
Defaults to 3.0.

category.weights

double, optional
Represents the weight of category attributes.
Defaults to 0.707.

normalization

character, optional
Normalization type:

  • 'no': no normalization.

  • l1.norm: Yes, for each point X (x1,x2,...,xn), the normalized value will be X'(x1/S,x2/S,...,xn/S), where S = |x1|+|x2|+...|xn|.

  • min.max: Yes, for each column C, get the min and max value of C, and then C[i] = (C[i]-min)/(max-min).

Defaults to 'no'.

categorical.variable

character or list of characters, optional
Column names in the data table to use as category variable.
No default value.

Format

R6Class object.

Details

The K-Medians clustering algorithm that partitions n observations into K clusters according to their nearest cluster center. It uses medians of each feature to calculate cluster centers.

Value

Examples

## Not run: 
Input DataFrame data:
> data$Collect()
     ID  V000 V001  V002
0    0   0.5    A   0.5
1    1   1.5    A   0.5
2    2   1.5    A   1.5
3    3   0.5    A   1.5
4    4   1.1    B   1.2
5    5   0.5    B  15.5
6    6   1.5    B  15.5
7    7   1.5    B  16.5
8    8   0.5    B  16.5
9    9   1.2    C  16.1
10  10  15.5    C  15.5
11  11  16.5    C  15.5
12  12  16.5    C  16.5
13  13  15.5    C  16.5
14  14  15.6    D  16.2
15  15  15.5    D   0.5
16  16  16.5    D   0.5
17  17  16.5    D   1.5
18  18  15.5    D   1.5
19  19  15.7    A   1.6

> kmedian <- hanaml.Kmedian(conn.context = conn,
                           data = data,
                           key = "ID",
                           n.clusters = 4,
                           init = 'first_k',
                           max.iter = 100,
                           tol = 1.0E-6,
                           thread.ratio = 0.3,
                           distance.level = 'euclidean',
                           category.weights = 0.5)

Expected output:
> kmedian$cluster.centers$Collect()
    CLUSTER_ID  V000 V001  V002
0           0   1.1    A   1.2
1           1  15.7    D   1.5
2           2  15.6    C  16.2
3           3   1.2    B  16.1

>kmedian$labels$Collect()
    ID CLUSTER_ID  DISTANCE
1   0          0 0.9219544
2   1          0 0.8062258
3   2          0 0.5000000
4   3          0 0.6708204
5   4          0 0.7071068
6   5          3 0.9219544
7   6          3 0.6708204
8   7          3 0.5000000
9   8          3 0.8062258
10  9          3 0.7071068
11 10          2 0.7071068
12 11          2 1.1401754
13 12          2 0.9486833
14 13          2 0.3162278
15 14          2 0.7071068
16 15          1 1.0198039
17 16          1 1.2806248
18 17          1 0.8000000
19 18          1 0.2000000
20 19          1 0.8071068

## End(Not run)

[Package hana.ml.r version 1.0.8 Index]