KDE

class hana_ml.algorithms.pal.kernel_density.KDE(thread_ratio=None, leaf_size=None, kernel=None, method=None, distance_level=None, minkowski_power=None, atol=None, rtol=None, bandwidth=None, resampling_method=None, evaluation_metric=None, bandwidth_values=None, bandwidth_range=None, stat_info=None, random_state=None, search_strategy=None, repeat_times=None, algorithm=None)

Perform Kernel Density to analogue with histograms whereas getting rid of its defects.

Parameters:
thread_ratiofloat, optional

Specifies the ratio of total number of threads that can be used by this function.

The value range is from 0 to 1, where 0 means only using 1 thread, and 1 means using at most all the currently available threads.

Values outside the range will be ignored and this function heuristically determines the number of threads to use.

Default to 0.0.

leaf_sizeint, optional

Number of samples in a KD tree or Ball tree leaf node.

Only Valid when algorithm is 'kd-tree' or 'ball-tree'.

Default to 30.

kernel{'gaussian', 'tophat', 'epanechnikov', 'exponential', 'linear', 'cosine'}, optional

Kernel function type.

Default to 'gaussian'.

method{'brute_force', 'kd_tree', 'ball_tree'}, optional(deprecated)

Searching method.

Default to 'brute_force'

algorithm{'brute-force', 'kd-tree', 'ball-tree'}, optional

Specifies the searching method.

Default to 'brute-force'.

bandwidthfloat, optional

Bandwidth used during density calculation.

0 means providing by optimizer inside, otherwise bandwidth is provided by end users.

Only valid when data is one dimensional.

Default to 0.

distance_level{'manhattan', 'euclidean', 'minkowski', 'chebyshev'}, optional

Computes the distance between the train data and the test data point.

Default to 'euclidean'.

minkowski_powerfloat, optionl

When you use the Minkowski distance, this parameter controls the value of power.

Only valid when distance_level is 'minkowski'.

Default to 3.0.

rtolfloat, optional

The desired relative tolerance of the result.

A larger tolerance generally leads to faster execution.

Default to 1e-8.

atolfloat, optional

The desired absolute tolerance of the result.

A larger tolerance generally leads to faster execution.

Default to 0.

stat_infobool, optional
  • False: STATISTIC table is empty

  • True: Statistic information is displayed in the STATISTIC table.

Only valid when parameter selection is not specified.

resampling_method{'loocv'}, optional

Specifies the resampling method for model evaluation or parameter selection, only 'loocv' is permitted.

evaluation_metric must be set together.

No default value.

evaluation_metric{'nll'}, optional

Specifies the evaluation metric for model evaluation or parameter selection, only 'nll' is supported.

No default value.

search_strategy{'grid', 'random'}, optional

Specifies the method to activate parameter selection.

No default value.

repeat_timesint, optional

Specifies the number of repeat times for resampling.

Default to 1.

random_stateint, optional

Specifies the seed for random generation. Use system time when 0 is specified.

Default to 0.

bandwidth_valueslist, optional

Specifies values of parameter bandwidth to be selected.

Only valid when parameter selection is enabled.

bandwidth_rangelist, optional

Specifies ranges of parameter bandwidth to be selected.

Only valid when parameter selection is enabled.

Examples

Data used for fitting a kernel density function:

>>> df_train.collect()
       ID        X1        X2
    0   0 -0.425770 -1.396130
    1   1  0.884100  1.381493
    2   2  0.134126 -0.032224
    3   3  0.845504  2.867921
    4   4  0.288441  1.513337
    5   5 -0.666785  1.244980
    6   6 -2.102968 -1.428327
    7   7  0.769902 -0.473007
    8   8  0.210291  0.328431
    9   9  0.482323 -0.437962

Data used for density value prediction:

>>> df_pred.collect()
   ID        X1        X2
0   0 -2.102968 -1.428327
1   1 -2.102968  0.719797
2   2 -2.102968  2.867921
3   3 -0.609434 -1.428327
4   4 -0.609434  0.719797
5   5 -0.609434  2.867921
6   6  0.884100 -1.428327
7   7  0.884100  0.719797
8   8  0.884100  2.867921

Construct KDE instance:

>>> kde = KDE(leaf_size=10, method='kd_tree', bandwidth=0.68129, stat_info=True)

Fit a kernel density function:

>>> kde.fit(data=df_train, key='ID')

Perform density prediction and check the results

>>> res, stats = kde.predict(data=df_pred, key='ID')
>>> res.collect()
   ID  DENSITY_VALUE
0   0      -3.324821
1   1      -5.733966
2   2      -8.372878
3   3      -3.123223
4   4      -2.772520
5   5      -4.852817
6   6      -3.469782
7   7      -2.556680
8   8      -3.198531
>>> stats_.collect()
   TEST_ID                            FITTING_IDS
0        0  {"fitting ids":[0,1,2,3,4,5,6,7,8,9]}
1        1  {"fitting ids":[0,1,2,3,4,5,6,7,8,9]}
2        2  {"fitting ids":[0,1,2,3,4,5,6,7,8,9]}
3        3  {"fitting ids":[0,1,2,3,4,5,6,7,8,9]}
4        4  {"fitting ids":[0,1,2,3,4,5,6,7,8,9]}
5        5  {"fitting ids":[0,1,2,3,4,5,6,7,8,9]}
6        6  {"fitting ids":[0,1,2,3,4,5,6,7,8,9]}
7        7  {"fitting ids":[0,1,2,3,4,5,6,7,8,9]}
8        8  {"fitting ids":[0,1,2,3,4,5,6,7,8,9]}
Attributes:
stats_DataFrame

Statistical info for model evaluation. Available only when model evaluation/parameter selection is triggered.

optim_param_DataFrame

Optimal parameters selected. Available only when model evaluation/parameter selection is triggered.

Methods

fit(data, key[, features])

If parameter selection / model evaluation is enabled, perform it.

predict(data, key[, features])

Apply kernel density analysis.

fit(data, key, features=None)

If parameter selection / model evaluation is enabled, perform it. Otherwise, just setting the training data set.

Parameters:
dataDataFrame

Dataframe including the data of density distribution.

keystr

Name of the ID column.

featuresstr/list of str, optional

Name of the feature columns in the dataframe.

Defaults to all non-key columns.

Attributes:
_training_dataDataFrame

The training data for kernel density function fitting.

predict(data, key, features=None)

Apply kernel density analysis.

Parameters:
dataDataFrame

Dataframe including the data of density prediction.

keystr

Column of IDs of the data points for density prediction.

featureslist of str, optional

Names of the feature columns.

Defaults to all non-key columns.

Returns:
DataFrame
  • Result data table, i.e. predicted log-density values on all points in data.

  • Statistics information table which reflects the support of prediction points over all training points.

property fit_hdbprocedure

Returns the generated hdbprocedure for fit.

property predict_hdbprocedure

Returns the generated hdbprocedure for predict.

Inherited Methods from PALBase

Besides those methods mentioned above, the KDE class also inherits methods from PALBase class, please refer to PAL Base for more details.