hanaml.KMeans.Rd
hanaml.KMeans is a R wrapper for SAP HANA PAL K-means and accelerated K-Means algorithm.
hanaml.KMeans(
data = NULL,
key = NULL,
features = NULL,
n.clusters = NULL,
n.clusters.min = NULL,
n.clusters.max = NULL,
init = NULL,
max.iter = NULL,
tol = NULL,
thread.ratio = NULL,
distance.level = NULL,
minkowski.power = NULL,
category.weights = NULL,
normalization = NULL,
categorical.variable = NULL,
memory.mode = NULL,
accelerated = FALSE,
use.fast.library = NULL,
use.float = NULL
)
DataFrame
DataFrame containting the data.
character
Name of the ID column.
character or list of characters, optional
Names of features columns.
If is not provided, it defaults to all non-key columns of data
.
integer, optional
Number of groups.
Note: If this parameter is not specified, you must specify the range of k using
n.clusters.min and n.clusters.max. Then the algorithm will iterate through
the range and return the k with the highest slight silhouette.
No default value.
integer, optional
Lower boundary of the clustering range.
Note: You must specify either an exact value or a range for k. If both are specified,
the exact value will be used.
No default value.
integer, optional
Upper boundary of the clustering range.
Note: You must specify either an exact value or a range for k. If both are specified,
the exact value will be used.
No default value.
character, optional
Controls how the initial centers are selected:
"first.k"
: First k observations.
"replace"
: Random with replacements.
"no.replace"
: Random without replacements.
"patent"
:Patent of selecting the init center (US 6,882,998 B1).
Defaults to "patent".
integer, optional
Maximum number of iterations.
Defaults to 100.
double, optional
Convergence threshold for exiting iterations.
Only valid when accelerated is FALSE.
Defaults to 1.0e-6.
double, optional
Controls the proportion of available threads that can be used by this
function.
The value range is from 0 to 1, where 0 indicates a single thread,
and 1 indicates all available threads.
Values between 0 and 1 will use up to
that percentage of available threads.Values outside this
range are ignored.
Defaults to 0.
character, optional
Specifies how to compute the distance between the item and the cluster center.
"manhattan"
"euclidean"
"minkowski"
"chebyshev"
"cosine"
valid only when accelerated is FALSE.
Defaults to "euclidean".
double, optional
When Minkowski distance is used, this parameter controls the value of power.
Only valid when distance.level
is 'minkowski'.
Defaults to 3.0.
double, optional
Represents the weight of category attributes.
Defaults to 0.707.
character, optional
Specifies the normalization type:
'no'
: no normalization.
'l1.norm'
: For each point
X = (x1,x2,...,xn),
the normalized value will be
X' = (x1/S,x2/S,...,xn/S),
where S = |x1|+|x2|+...|xn|.
min.max
: Yes, for each column C, get the min and max value of C,
and then C[i] = (C[i]-min)/(max-min).
Defaults to "no".
character or list/vector of characters, optional
Indicates features should be treated as categorical variable.
The default behavior is dependent on what input is given:
"VARCHAR" and "NVARCHAR": categorical
"INTEGER" and "DOUBLE": continuous.
VALID only for variables of "INTEGER" type, omitted otherwise.
No default value.
character, optional
Indicates the memory mode the algorithm uses.
"auto"
: Chosen by the algorithm.
"optimized.speed"
: Priorities speed.
"optimized.space"
: Priorities saving memory.
Only valid when accelerated is TRUE. Defaults to "auto".
logical, optional
Indicates whether or not to accelerate the calculation process.
Defaults to FALSE.
logical, optional
Use vectorized accelerated operation when it is set to TRUE.
Not valid when accelerated is TRUE.
Defaults to FALSE.
logical, optional
If FALSE, use double and if TRUE, use float.
Only valid when use.fast.library
is TRUE. Not valid when accelerated
is TRUE.
Defaults to TRUE.
An R6 object of class "KMeans" with the following attributes and methods:
Attributes
labels : DataFrame
Label assigned to each sample.
cluster.centers : DataFrame
Coordinates of cluster centers.
model : DataFrame
Model content.
statistics : DataFrame
Statistic value.
Methods
CreateModelState(model=NULL, algorithm=NULL, func=NULL, state.description="ModelState", force=FALSE)
Usage:
> km <- hanaml.KMeans(data=df, key=df$columns[[1]])
> km$CreateModelState()
Arguments:
model: DataFrame
DataFrame containing the model for parsing.
Defaults to self$model
.
algorithm: character
Specifies the PAL algorithm associated with model
.
Defaults to self$pal.algorithm
.
func: character
Specifies the functionality for Unified Classification/Regression.
Valid only for object instance of R6Class "UnifiedClassification" or "UnifiedRegression".
Defaults to self$func
.
state.description: character
A summary string for the generated model state.
Defaults to "ModelState".
force: logic
Specifies whether or not the replace existing state for model
.
Defaults to FALSE.
After calling this method, an attribute state
that contains the parsed info for model
shall be assigned
to the corresponding R6 object.
DeleteModelState(state=NULL)
Usage:
Assuming we have trained a hanaml
model and created its model state, like the following:
> km <- hanaml.KMeans(data=df, key=df$columns[[1]])
> km$CreateModelState()
After using the model state for real-time scoring, we can delete the state by calling:
> km$DelateModelState()
Arguments:
state: DataFrame
DataFrame containing the state info.
Defaults to self$state
.
After calling this method, the specified model state shall be cleaned up and associated memory be released.
Input DataFrame data:
> data$Collect()
ID V000 V001 V002
1 0 0.5 A 0.5
2 1 1.5 A 0.5
3 2 1.5 A 1.5
4 3 0.5 A 1.5
5 4 1.1 B 1.2
6 5 0.5 B 15.5
7 6 1.5 B 15.5
8 7 1.5 B 16.5
9 8 0.5 B 16.5
10 9 1.2 C 16.1
11 10 15.5 C 15.5
12 11 16.5 C 15.5
13 12 16.5 C 16.5
14 13 15.5 C 16.5
15 14 15.6 D 16.2
16 15 15.5 D 0.5
17 16 16.5 D 0.5
18 17 16.5 D 1.5
19 18 15.5 D 1.5
20 19 15.7 A 1.6
Call the function:
> km <- hanaml.KMeans(data = data,
key = "ID",
features = NULL,
n.clusters = 4,
init = "first.k",
max.iter = 100,
tol = 1.0E-6,
thread.ratio = 0.2,
distance.level = "euclidean",
category.weights = 0.5)
Output:
> km$labels$Collect()
ID CLUSTER_ID DISTANCE SLIGHT_SILHOUETTE
1 0 0 0.891088 0.944370
2 1 0 0.863917 0.942478
3 2 0 0.806252 0.946288
4 3 0 0.835684 0.944942
5 4 0 0.744571 0.950234
6 5 3 0.891088 0.940733