Spectral Clustering — hanaml.SpectralClustering • hana.ml.r

Spectral clustering is an algorithm evolved from graph theory, and has been widely used in clustering.

hanaml.SpectralClustering(
  data,
  key,
  features = NULL,
  n.clusters = NULL,
  n.components = NULL,
  gamma = NULL,
  affinity = NULL,
  n.neighbors = NULL,
  cut = NULL,
  eigen.tol = NULL,
  krylov.dim = NULL,
  distance.level = NULL,
  minkowski.power = NULL,
  category.weights = NULL,
  max.iter = NULL,
  init = NULL,
  tol = NULL,
  thread.ratio = NULL
)

Arguments

n.clusters

integer
The number of clusters for spectral clustering.
The valid range for this parameter is from 2 to the number of records in the input data.

n.components

integer, optional
The number of eigenvectors used for spectral embedding. Defaults to the value of n.clusters.

gamma

numeric, optional
RBF kernel coefficient \(\gamma\) used in constructing affinity matrix with distance metric d, illustrated as e^{(- &gamma * d²)}.
Defaults to 1.0.

affinity

character, optional
Specifies the type of graph used to construct the affinity matrix. Valid options include:

"knn": binary affinity matrix constructed from the graph of k-nearest-neighbors(knn).
"mutual-knn": binary affinity matrix constructed from the graph of mutual k-nearest-neighbors(mutual-knn).
"fully-connected": affinity matrix constructed from fully-connected graph, with weights defined by RBF kernel coefficients.

Defaults to "fully-connected".

n.neighbors

integer, optional
The number neighbors to use when constructing the affinity matrix using nearest neighbors method.
Valid only when graph is "knn" or "mutual-knn".
Defaults to 10.

cut

character, optional
Specifies the method to cut the graph.

"ratio-cut": Ratio-Cut.
"n-cut": Normalized-Cut.

Defaults to "ratio-cut".

eigen.tol

numeric, optional
Stopping criterion for eigendecomposition of the Laplacian matrix.
Defaults to 1e-10.

krylov.dim

int, optional
Specifies the dimension of Krylov subspaces used in Eigenvalue decomposition. In general, this parameter controls the convergence speed of the algorithm. Typically a larger krylov.dim means faster convergence, but it may also result in greater memory use and more matrix operations in each iteration.
Defaults to \(2* n.components\).
NOTE: This parameter must satisfy
n.components < krylov.dim <= the number of training records.

distance.level

character, optional
Specifies the method for computing the distance between data records and cluster centers:

"manhattan": Manhattan distance.
"euclidean": Euclidean distance.
"minkowski": Minkowski distance.
"chebyshev": Chebyshev distance.
"cosine": Cosine distance.

Defaults to "euclidean".

minkowski.power

numeric, optional
Specifies the power parameter in Minkowski distance.
Valid only when distance.level is 'minkowski'.
Defaults to 3.0.

max.iter

integer, optional
Maximum number of iterations for K-Means algorithm.
Defaults to 100.

init

c("first_k", "replace", "no_replace", "patent"), optional
Controls how the initial centers are selected in K-Means algorithm:

"first_k": First k observations.
"replace": Random with replacement.
"no_replace": Random without replacement.
"patent": Patent of selecting the init center (US 6,882,998 B1).

Defaults to "patent".

tol

numeric, optional
Specifies the exit threshold for K-Means iterations.
Defaults to 1e-6.

category.wights

numeric, optional
Represents the weight of category attributes.
Defaults to 0.707.

Value

List of DataFrames:

labels: DataFrame containing the cluster labels of the input data.
statistics: DataFrame containing the related statistics for spectral clustering.

Details

The main idea of Spectral Clustering is to treat all data as points in space, which can be connected by edges. The edge weight between two points farther away is low, while the edge weight between two points closer is high. Cutting the graph composed of all data points to make the edge weight sum between different subgraphs after cutting as low as possible, while make the edge weight sum within the subgraph as high as possible to achieve the purpose of clustering.
It performs a low-dimension embedding of the affinity matrix between samples, followed by k-means clustering of the components of the eigenvectors in the low dimensional space.

Examples

Input data:


> data$Collect()
   ID   X1   X2
1   0  0.5  0.5
2   1  1.5  0.5
3   2  1.5  1.5
4   3  0.5  1.5
5   4  1.1  1.2
6   5  0.5 15.5
7   6  1.5 15.5
8   7  1.5 16.5
9   8  0.5 16.5
10  9  1.2 16.1
11 10 15.5 15.5
12 11 16.5 15.5
13 12 16.5 16.5
14 13 15.5 16.5
15 14 15.6 16.2
16 15 15.5  0.5
17 16 16.5  0.5
18 17 16.5  1.5
19 18 15.5  1.5
20 19 15.7  1.6

Apply spectral clustering:

spc <- hanaml.SpectralClustering(data=data,
                                 key="ID",
                                 thread.ratio=0.2,
                                 n.clusters=4,
                                 n.neighbors=4,
                                 init="patent",
                                 distance.level="euclidean",
                                 max.iter=100,
                                 tol=1e-6,
                                 category.weights=0.5)

Check the cluster labels:


> spc$labels$Collect()
   ID CLUSTER_ID
1   0          0
2   1          0
3   2          0
4   3          0
5   4          0
6   5          1
7   6          1
8   7          1
9   8          1
10  9          1
11 10          2
12 11          2
13 12          2
14 13          2
15 14          2
16 15          3
17 16          3
18 17          3
19 18          3
20 19          3