Affinity Propagation — hanaml.AffinityPropagation • hana.ml.r

hanaml.AffinityPropagation is a R wrapper for SAP HANA PAL Affinity Propagation algorithm.

hanaml.AffinityPropagation(
  data,
  key,
  features = NULL,
  affinity,
  n.clusters,
  max.iter = NULL,
  convergence.iter = NULL,
  damping = NULL,
  preference = NULL,
  seed.ratio = NULL,
  times = NULL,
  minkowski.power = NULL,
  thread.ratio = NULL
)

Arguments

data

DataFrame
DataFrame containting the data.

key

character
Name of the ID column.

features

character or list of characters, optional
Names of features columns.
If is not provided, it defaults to all non-key columns of data.

affinity

character
Ways to compute the distance between two points.

'manhattan'
'euclidean'
'minkowski'
'chebyshev'
'standardized.euclidean'
'cosine'

No default value as it is mandatory.

n.clusters

integer

0: Does not adjust Affinity Propagation cluster result.
Non-zero integer: If Affinity Propagation cluster number is bigger than n.clusters, PAL will merge the result to make the cluster number be the value specified for n.clusters.

max.iter

integer, optional
Maximum number of iterations.
Defaults to 500.

convergence.iter

integer, optional
When the clusters keep a steady one for the specified times, the algorithm ends.
Defaults to 100.

damping

double, optional
Controls the updating velocity. Value range: (0, 1).
Defaults to 0.9.

preference

double, optional
Determines the preference. Value range: [0,1].
Defaults to 0.5.

seed.ratio

double, optional
Select a portion of (seed.ratio * data_number) the input data as seed, where data_number is the row-size of the input data. Value range: (0,1]. If seed.ratio is 1, all the input data will be the seed.
Defaults to 1.

times

integer, optional
The sampling times. Only valid when seed.ratio is less than 1 and affinity is 'minkowski'.
Defaults to 3.

minkowski.power

integer, optional
The sampling times. Only valid when affinity is 'minkowski'.
Defaults to 1.

thread.ratio

double, optional
Controls the proportion of available threads that can be used by this function.
The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates all available threads.
Values between 0 and 1 will use up to that percentage of available threads.Values outside this range are ignored.
Defaults to 0.

Value

An "AffinityPropagation" object with the following attributes:

labels : DataFrame
Label assigned to each sample,structured as follows:
- ID : record ID.
- CLUSTER_ID : the range is from 0 to n.clusters - 1.
statistics : DataFrame
Statistic value, structured as follows:
- STAT_NAME : Statistic name.
- STAT_VALUE : Statistic value.

Examples

Input DataFrame data:


> data$Collect()
    ID     V1     V2
1    1   0.10   0.10
2    2   0.11   0.10
3    3   0.10   0.11
4    4   0.11   0.11
5    5   0.12   0.11
6    6   0.11   0.12
21  21  10.13  10.12
22  22  10.13  10.13
23  23  10.13  10.14
24  24  10.14  10.13

Call the function:


> ap <- hanaml.AffinityPropagation(data = data,
                                   key = "ID",
                                   affinity = "euclidean",
                                   n.clusters = 0L,
                                   max.iter = 500L,
                                   convergence.iter = 100L,
                                   damping = 0.9,
                                   preference = 0.5,
                                   times = 1L,
                                   seed.ratio = 1,
                                   minkowski.power = 0,
                                   thread.ratio = 0)

Output:


> ap$labels$collect()
    ID  CLUSTER_ID
1    1           0
2    2           0
3    3           0
4    4           0
5    5           0
6    6           0
......
22  22           1
23  23           1
24  24           1