hanaml.UnifiedClustering.Rd
hanaml.UnifiedClustering is an R wrapper for SAP HANA PAL Unified Clustering.
hanaml.UnifiedClustering(
data = NULL,
func = NULL,
key = NULL,
features = NULL,
massive = FALSE,
group.key = NULL,
group.params = NULL,
...
)
DataFrame
DataFrame containting the data.
character
The functionality for unified Clustering.
"AgglomerateHierarchicalClustering"
"DBSCAN"
"GaussianMixture"
"AcceleratedKMeans"
"KMeans"
"KMedians"
"KMedoids"
"SOM"
"AffinityPropagation"
character, optional
Name of the ID column.
Defaults to the first column if not provided.
character or list of characters, optional
Names of features columns.
If is not provided, it defaults to all non-key columns of data
.
logical, optional
Specifies whether or not to use massive mode.
For parameter setting in massive mode, you could use both
group.params (please see the example below) or the original parameters.
Using original parameters will apply for all groups. However, if you define some parameters of a group,
the value of all original parameter setting will be not applicable to such group.
An example is as follows:
udbscan <- hanaml.UnifiedClustering(data = df.fit,
group.key = "GROUP_ID",
func = 'DBSCAN',
thread.ratio=1.0,
key='ID',
massive=TRUE,
group.params = list(
'Group_1'=list(metric='manhattan')))
In this example, as metric='manhattan' is set in group.params for Group_1,
parameter setting of thread.ratio=1.0 is not applicable to Group_1.
Defaults to FALSE.
character, optional
The column of group key. The data type can be INT or NVARCHAR/VARCHAR.
If data type is INT, only parameters set in the group.params are valid.
This parameter is only valid when massive is TRUE.
Defaults to the first column of data if group.key is not provided.
list, optional
If the massive mode is activated (massive=TRUE),
input data shall be divided into different groups with different parameters applied.
An example is as follows:
udbscan <- hanaml.UnifiedClustering(data = df.fit,
group.key = "GROUP_ID",
func = 'DBSCAN',
thread.ratio = 1.0,
key = 'ID',
massive = TRUE,
group.params = ist(
'Group_1'=list(metric='manhattan')))
res <- predict(model = udbscan,
data = df.predict,
group.key = "GROUP_ID",
key = 'ID')
Valid only when massive is TRUE and defaults to NULL.
Specifies other parameters for training a clustering model with the functionality
specified in func.
Please see the documentation of corresponding functionalities for more detail.hanaml.AgglomerateHierarchical,
hanaml.DBSCAN,
hanaml.GaussianMixture,
hanaml.KMeans,
hanaml.KMedian,
hanaml.KMedoid,
hanaml.SOM,
hanaml.AffinityPropagation
Returns a "hanaml.UnifiedClustering" object with the following attributes and methods:
labelsDataFrame
DATA_ID
- ID column in the input data.
CLUSTER_ID
- The assigned cluster ID.
DISTANCE
- Distance between a given point and the cluster center (k-means)
nearest core object (DBSCAN) weight vector (SOM) Or probability
of a given point belonging to the corresponding cluster (GMM).
SLIGHT_SILHOUETTE
- Estimated value (slight silhouette).
centersDataFrame
CLUSTER_ID
VARIABLE_NAME
- The name of variable.
VALUE
- The value of variable.
modelDataFrame
ROW_INDEX
- model row index.
PART_INDEX
- Specifically for GMM's CLUSTER_ID.
MODEL_CONTENT
- model content.
statisticsDataFrame
STAT_NAME
- Statistics name.
STAT_VALUE
- Statistics value.
optimal.paramDataFrame
PARM_NAME
- parameter name.
INT_VALUE
- integer value.
DOUBLE_VALUE
- double value.
STRING_VALUE
- character value.
error.msgDataFrame
Error message and only valid if massive is TRUE.
The training data:
> data.fit$Collect()
ID V000 V001 V002
1 0 0.5 A 0.5
2 1 1.5 A 0.5
3 2 1.5 A 1.5
4 3 0.5 A 1.5
5 4 1.1 B 1.2
6 5 0.5 B 15.5
7 6 1.5 B 15.5
8 7 1.5 B 16.5
9 8 0.5 B 16.5
10 9 1.2 C 16.1
11 10 15.5 C 15.5
12 11 16.5 C 15.5
13 12 16.5 C 16.5
14 13 15.5 C 16.5
15 14 15.6 D 16.2
16 15 15.5 D 0.5
17 16 16.5 D 0.5
18 17 16.5 D 1.5
19 18 15.5 D 1.5
20 19 15.7 A 1.6
Create a UnifiedClustering model for Kmeans:
ukmeans <- hanaml.UnifiedClustering(data = data.fit,
n.clusters=4,
init="first.k",
max.iter=100,
tol=1.0E-6,
thread.ratio=1.0,
distance.level="Euclidean",
category.weights=0.5)
Check the labels:
> ukmeans$labels$Collect()
ID CLUSTER_ID DISTANCE SLIGHT_SILHOUETE
1 0 0 0.891088 0.944370
2 1 0 0.863917 0.942478
3 2 0 0.806252 0.946288
4 3 0 0.835684 0.944942
......
17 16 1 0.976885 0.939386
18 17 1 0.818178 0.945878
19 18 1 0.722799 0.952170
20 19 1 1.102342 0.925679