trustworthiness

hana_ml.algorithms.pal.decomposition.trustworthiness(data, embedding, distance_level=None, minkowski_power=None, embedded_distance_level=None, embedded_minkowski_power=None, distance_method=None, embedded_knn_method=None, max_neighbors_trustworthiness=None, thread_ratio=None)

Calculate the trustworthiness of the embedding.

Parameters
dataDataFrame

Input data.

embeddingDataFrame

Embedded data.

distance_level{'manhattan', 'euclidean', 'minkowski', 'chebyshev', 'standardized_euclidean', 'cosine'}, optional

The distance level determines the distance metric used in the original high dimensional space. The following distance levels are available:

  • 'manhattan' : Manhattan distance

  • 'euclidean' : Euclidean distance

  • 'minkowski' : Minkowski distance

  • 'chebyshev' : Chebyshev distance

  • 'standardized_euclidean' : Standardized Euclidean distance

  • 'cosine' : Cosine distance

Defaults to 'euclidean'.

minkowski_powerfloat, optional

The power parameter for the Minkowski distance metric. This is only used if distance_level is set to 'minkowski'.

Defaults to 3.0.

embedded_minkowski_powerfloat, optional

The power parameter for the Minkowski distance metric. This is only used if embedded_distance_level is set to 'minkowski'.

Defaults to 3.0.

distance_method{'brute_force', 'matrix_enabled'}, optional

The method for calculating the distances in original high dimensional space when calculating trustworthness. The following methods are available:

  • 'brute_force' : Use formula to calculate distances

  • 'matrix_enabled' : Matrix-enabled calculation

Defaults to knn_method.

embedded_knn_method{'brute_force', 'matrix_enabled', 'kd_tree'}, optional

The method used to compute the k-nearest neighbors of the embedded data when calculating trustworthiness. The following methods are available:

  • 'brute_force' : Brute Force searching

  • 'matrix_enabled' : Matrix-enabled searching

  • 'kd_tree' : KD-Tree searching

Defaults to 'brute_force'.

max_neighbors_trustworthinessint, optional

The maximum number of neighbors to consider when calculating trustworthiness.

Defaults to min(15, int(2(N+1)/3-1e-8)), N is the number of data points.

thread_ratiofloat, optional

Adjusts the percentage of available threads to use, from 0 to 1. A value of 0 indicates the use of a single thread, while 1 implies the use of all possible current threads. Values outside the range will be ignored and this function heuristically determines the number of threads to use.

Default to 1.0.

Returns
DataFrame

Trustworthiness of the embedding.

Examples

>>> from hana_ml.algorithms.pal.preprocessing import UMAP, trustworthiness
>>> umap = UMAP(n_neighbors=5, n_components=2,
                knn_method='brute_force', init='random', min_dist=0.1,
                distance_method='brute_force', embedded_knn_method='brute_force', seed=12345)
>>> embedding = umap.fit_transform(data=df, key='ID', features=['X1', 'X2', 'X3'])
>>> res = trustworthiness(data=df, embedding=embedding,
                          distance_level='euclidean', distance_method='brute_force',
                          embedded_knn_method='brute_force', max_neighbors_trustworthiness=5)
>>> res.collect()
    NEIGHBORS  TRUSTWORTHINESS
0          1         1.000000
1          2         0.952381
2          3         1.000000
3          4         0.962963
4          5         0.877778