SpectralClustering

class hana_ml.algorithms.pal.clustering.SpectralClustering(n_clusters, n_components=None, gamma=None, affinity=None, n_neighbors=None, cut=None, eigen_tol=None, krylov_dim=None, distance_level=None, minkowski_power=None, category_weights=None, max_iter=None, init=None, tol=None, onehot_min_frequency=None, onehot_max_categories=None)

Spectral clustering is an algorithm evolved from graph theory, and has been widely used in clustering. Its main idea is to treat all data as points in space, which can be connected by edges. The edge weight between two points farther away is low, while the edge weight between two points closer is high. Cutting the graph composed of all data points to make the edge weight sum between different subgraphs after cutting as low as possible, while make the edge weight sum within the subgraph as high as possible to achieve the purpose of clustering.

It performs a low-dimension embedding of the affinity matrix between samples, followed by k-means clustering of the components of the eigenvectors in the low dimensional space.

Parameters:
n_clustersint

The number of clusters for spectral clustering.

The valid range for this parameter is from 2 to the number of records in the input data.

n_componentsint, optional

The number of eigenvectors used for spectral embedding.

Defaults to the value of n_clusters.

gammafloat, optional

The RBF kernel coefficient \(\gamma\) used in constructing affinity matrix with distance metric d, illustrated as \(\exp(-\gamma * d^2)\).

Defaults to 1.0.

affinitystr, optional

Specifies the type of graph used to construct the affinity matrix. Valid options include:

  • 'knn' : binary affinity matrix constructed from the graph of k-nearest-neighbors(knn).

  • 'mutual-knn' : binary affinity matrix constructed from the graph of mutual k-nearest-neighbors(mutual-knn).

  • 'fully-connected' : affinity matrix constructed from fully-connected graph, with weights defined by RBF kernel coefficients.

Defaults to 'fully-connected'.

n_neighborsint, optional

The number neighbors to use when constructing the affinity matrix using nearest neighbors method.

Valid only when graph is 'knn' or 'mutual-knn'.

Defaults to 10.

cutstr, optional

Specifies the method to cut the graph.

  • 'ratio-cut' : Ratio-Cut.

  • 'n-cut' : Normalized-Cut.

Defaults to 'ratio-cut'.

eigen_tolfloat, optional

The stopping criterion for eigendecomposition of the Laplacian matrix.

Defaults to 1e-10.

krylov_dimint, optional

Specifies the dimension of Krylov subspaces used in Eigenvalue decomposition. In general, this parameter controls the convergence speed of the algorithm. Typically a larger krylov_dim means faster convergence, but it may also result in greater memory use and more matrix operations in each iteration.

Defaults to 2*``n_components``.

Note

This parameter must satisfy

n_components < krylov_dim \(\leq\) the number of training records.

distance_levelstr, optional

Specifies the method for computing the distance between data records and cluster centers:

  • 'manhattan' : Manhattan distance.

  • 'euclidean' : Euclidean distance.

  • 'minkowski' : Minkowski distance.

  • 'chebyshev' : Chebyshev distance.

  • 'cosine' : Cosine distance.

Defaults to 'euclidean'.

minkowski_powerfloat, optional

Specifies the power parameter in Minkowski distance.

Valid only when distance_level is 'minkowski'.

Defaults to 3.0.

category_weightsfloat, optional

Represents the weight of category attributes.

Defaults to 0.707.

max_iterint, optional

Maximum number of iterations for K-Means algorithm.

Defaults to 100.

init{'first_k', 'replace', 'no_replace', 'patent'}, optional

Controls how the initial centers are selected in K-Means algorithm:

  • 'first_k': First k observations.

  • 'replace': Random with replacement.

  • 'no_replace': Random without replacement.

  • 'patent': Patent of selecting the init center (US 6,882,998 B1).

Defaults to 'patent'.

tolfloat, optional

Specifies the exit threshold for K-Means iterations.

Defaults to 1e-6.

onehot_min_frequencyint, optional

Specifies the minimum frequency below which a category will be considered infrequent.

Defaults to 1.

onehot_max_categoriesint, optional

Specifies an upper limit to the number of output features for each input feature. It includes the feature that combines infrequent categories.

Defaults to 0.

Examples

>>> spc = SpectralClustering(n_clusters=4, n_neighbors=4,
                             init='patent', distance_level='euclidean',
                             max_iter=100, tol=1e-6, category_weights=0.5)
>>> labels = spc.fit_predict(data=df, thread_ratio=0.2)
Attributes:
labels_DataFrame

DataFrame that holds the cluster labels.

stats_DataFrame

Statistics.

Methods

fit(data[, key, features, thread_ratio])

Fit the model to the training dataset.

fit_predict(data[, key, features, thread_ratio])

Given data, perform spectral clustering and return the corresponding cluster labels.

get_model_metrics()

Get the model metrics.

get_score_metrics()

Get the score metrics.

fit(data, key=None, features=None, thread_ratio=None)

Fit the model to the training dataset.

Parameters:
dataDataFrame

DataFrame containing the input data.

keystr, optional

Name of ID column.

Mandatory if data is not indexed, or indexed by multiple columns.

Defaults to the index column of data if there is one.

featuresa list of str, optional

Names of the feature columns.

If features is not provided, it defaults to all non-key columns of data.

thread_ratiofloat, optional

Adjusts the percentage of available threads to use, from 0 to 1. A value of 0 indicates the use of a single thread, while 1 implies the use of all possible current threads. Values outside the range will be ignored and this function heuristically determines the number of threads to use.

Defaults to 0.

fit_predict(data, key=None, features=None, thread_ratio=None)

Given data, perform spectral clustering and return the corresponding cluster labels.

Parameters:
dataDataFrame

DataFrame containing the input data.

keystr, optional

Name of ID column in data.

Mandatory if data is not indexed, or indexed by multiple columns.

Defaults to the index column of data if there is one.

featuresa list of str, optional

Names of the feature columns.

If features is not provided, it defaults to all non-key columns of data.

thread_ratiofloat, optional

Adjusts the percentage of available threads to use, from 0 to 1. A value of 0 indicates the use of a single thread, while 1 implies the use of all possible current threads. Values outside the range will be ignored and this function heuristically determines the number of threads to use.

Defaults to 0.

Returns:
DataFrame

The cluster labels of all records in data.

get_model_metrics()

Get the model metrics.

Returns:
DataFrame

The model metrics.

get_score_metrics()

Get the score metrics.

Returns:
DataFrame

The score metrics.

Inherited Methods from PALBase

Besides those methods mentioned above, the SpectralClustering class also inherits methods from PALBase class, please refer to PAL Base for more details.