pairwise_distances

hana_ml.algorithms.pal.stats.pairwise_distances(data, ref_data, thread_ratio=None, categorical_variable=None, string_variable=None, distance_metric=None, minkow_p=None, category_weights=None, string_weights=None, variable_weight=None)

Computes pairwise distances between two datasets.

Parameters
dataDataFrame

Input data for computing pairwise distances.

This DataFrame must be structured as follows:

  • 1st column : type INTEGER, VARCHAR, NVARCHAR, BIGINT. ID

  • Other columns : VARCHAR, NVARCHAR, INTEGER, DOUBLE, REAL_VECTOR, or DECIMAL(p, s).

ref_dataDataFrame

Reference data for computing pairwise distances.

This DataFrame must be structured as follows:

  • 1st column : type INTEGER, VARCHAR, NVARCHAR, BIGINT. ID

  • Other columns : VARCHAR, NVARCHAR, INTEGER, DOUBLE, REAL_VECTOR, or DECIMAL(p, s). Same structure as data.

thread_ratiofloat, optional

Specifies the ratio of threads used to execute the algorithm.

The value must be between 0 and 1.

Defaults to 1.

categorical_variablestr or a list of str, optional

Specifies the categorical columns in data and ref_data. The columns specified in categorical_variable will be treated as categorical variables when calculating distances.

string_variablestr or a list of str, optional

Specifies the string columns in data and ref_data. The columns specified in string_variable will be treated as text variables when calculating distances.

distance_metric{'manhattan', 'euclidean', 'minkowski', 'chebyshev'}, optional

Specifies the distance metric used to compute distances.

Defaults to 'euclidean'.

  • 'manhattan': Manhattan distance.

  • 'euclidean': Euclidean distance.

  • 'minkowski': Minkowski distance.

  • 'chebyshev': Chebyshev distance.

minkow_pint, optional

The parameter p for Minkowski distance. This parameter is valid only when distance_metric is set to 'minkowski'.

Defaults to 3.

category_weightsfloat, optional

Weight of category variables. The value must be greater than or equal to 0.

Defaults to 0.707.

string_weightsfloat, optional

Weight of text variables. The value must be greater than or equal to 0.

Defaults to 1.0.

variable_weightdict, optional

A python dictionary object that contains the weights of variables. The key is the variable name and the value is the weight of the variable.

If a variable is not specified in variable_weight, it will be assigned a default weight of 1.0.

Example for illustration: {'var1':0.5, 'var2':2.0}.

Returns
DataFrame

Pairwise distances. - ID - REF_ID - DISTANCE

DataFrame

Statistics. - STAT_NAME - STAT_VALUE