Precomputed Distance Matrix as input data in UnifiedClustering

It is able to do some clustering algorithms in the unified clustering by pre-computed distance matrix input data. To do unified clustering with pre-computed distance matrix, it must be either upper or lower triangular, which means that the expanded shape needs to be N samples * N samples and the distance value of the pair of (i,j) or (j,i) must be unique.

Currently, unified clustering with pre-computed distance matrix is only provided for K-Medoids. In addition, massive mode does not support this feature. Hence, if you want to use precomputed distance matrix as input data in fit() and predict(), please use the input dataframe in the following structure:

Input DataFrame Structure

  • 1st column:INTEGER, VARCHAR, or NVARCHAR, Left Point.

  • 2nd column:INTEGER, VARCHAR, or NVARCHAR, Right Point, the type should be the same as the left point type.

  • 3rd column:DOUBLE, Distance.

The parameters for precomputed distance matrix as input data for K-Medoids:

Parameters

n_clusters : int

Number of groups.

tol : float, optional

Convergence threshold for exiting iterations.

Defaults to 1.0e-6.

init : {'first_k', 'replace', 'no_replace'}, optional

Controls how the initial centers are selected:

  • 'first_k': First k observations.

  • 'replace': Random with replacement.

  • 'no_replace': Random without replacement.

Defaults to 'first_k'.

random_seed : int, optional

Indicates the seed used to initialize the random number generator. It can be set to 0 or a positive value.

  • 0: Uses the system time;

  • Not 0: Uses the specified seed.

    Defaults to -1.

max_iter : int, optional

Max iterations.

Defaults to 100.

thread_ratio : float, optional

Controls the proportion of available threads to use.

The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates all available threads.

Values between 0 and 1 will use up to that percentage of available threads.

Values outside this range tell PAL to heuristically determine the number of threads to use.

Defaults to 0.

precalculated_distance : bool, optional

State of pre-computed distance matrix as input data:

  • False: invalid.

  • True: valid.