KDE

class hana_ml.algorithms.pal.kernel_density.KDE(thread_ratio=None, leaf_size=None, kernel=None, method=None, distance_level=None, minkowski_power=None, atol=None, rtol=None, bandwidth=None, resampling_method=None, evaluation_metric=None, bandwidth_values=None, bandwidth_range=None, stat_info=None, random_state=None, search_strategy=None, repeat_times=None, algorithm=None)

Perform Kernel Density to analogue with histograms whereas getting rid of its defects.

Parameters:

thread_ratiofloat, optional

Adjusts the percentage of available threads to use, from 0 to 1. A value of 0 indicates the use of a single thread, while 1 implies the use of all possible current threads. Values outside the range will be ignored and this function heuristically determines the number of threads to use.

Default to 0.0.

leaf_sizeint, optional

Number of samples in a KD tree or Ball tree leaf node.

Only Valid when algorithm is 'kd-tree' or 'ball-tree'.

Default to 30.

kernel{'gaussian', 'tophat', 'epanechnikov', 'exponential', 'linear', 'cosine'}, optional

Kernel function type.

Default to 'gaussian'.

method{'brute_force', 'kd_tree', 'ball_tree'}, optional(deprecated)

Searching method.

Default to 'brute_force'

algorithm{'brute-force', 'kd-tree', 'ball-tree'}, optional

Specifies the searching method.

Default to 'brute-force'.

bandwidthfloat, optional

Bandwidth used during density calculation. 0 means providing by optimizer inside, otherwise bandwidth is provided by end users. Only valid when data is one dimensional.

Default to 0.

distance_level{'manhattan', 'euclidean', 'minkowski', 'chebyshev'}, optional

Computes the distance between the train data and the test data point.

Default to 'euclidean'.

minkowski_powerfloat, optionl

When you use the Minkowski distance, this parameter controls the value of power. Only valid when distance_level is 'minkowski'.

Default to 3.0.

rtolfloat, optional

The desired relative tolerance of the result. A larger tolerance generally leads to faster execution.

Default to 1e-8.

atolfloat, optional

The desired absolute tolerance of the result. A larger tolerance generally leads to faster execution.

Default to 0.

stat_infobool, optional

False: STATISTIC table is empty
True: Statistic information is displayed in the STATISTIC table.

Only valid when parameter selection is not specified.

resampling_method{'loocv'}, optional

Specifies the resampling method for model evaluation / parameter selection, only 'loocv' is permitted. Note that evaluation_metric must be set together.

No default value.

evaluation_metric{'nll'}, optional

Specifies the evaluation metric for model evaluation / parameter selection, only 'nll' is supported.

No default value.

search_strategy{'grid', 'random'}, optional

Specifies the method to activate parameter selection.

No default value.

repeat_timesint, optional

Specifies the number of repeat times for resampling.

Default to 1.

random_stateint, optional

Specifies the seed for random generation. Use system time when 0 is specified.

Default to 0.

bandwidth_valueslist, optional

Specifies values of parameter bandwidth to be selected.

Only valid when parameter selection is enabled.

bandwidth_rangelist, optional

Specifies ranges of parameter bandwidth to be selected.

Only valid when parameter selection is enabled.

Attributes:

stats_DataFrame: Statistics. Available only when model evaluation / parameter selection is triggered.
optim_param_DataFrame: Optimal parameters selected. Available only when model evaluation/parameter selection is triggered.

Methods

`fit`(data, key[, features])	If parameter selection / model evaluation is enabled, perform it.
`predict`(data, key[, features])	Apply kernel density analysis.

Examples

>>> kde = KDE(leaf_size=10, method='kd_tree', bandwidth=0.68129, stat_info=True)
>>> kde.fit(data=df_train, key='ID')
>>> res, stats = kde.predict(data=df_pred, key='ID')
>>> res.collect()
>>> stats.collect()

fit(data, key, features=None)

If parameter selection / model evaluation is enabled, perform it. Otherwise, just setting the training dataset.

Parameters:

dataDataFrame

Dataframe including the data of density distribution.

keystr

Name of the ID column.

featuresstr/list of str, optional

Name of the feature columns in the dataframe.

Defaults to all non-key columns.

Attributes:

_training_dataDataFrame: The training data for kernel density function fitting.

predict(data, key, features=None)

Apply kernel density analysis.

Parameters:

dataDataFrame

Dataframe including the data of density prediction.

keystr

Column of IDs of the data points for density prediction.

featuresa list of str, optional

Names of the feature columns.

Defaults to all non-key columns.

Returns:

DataFrame

Result data table, i.e. predicted log-density values on all points in data.
Statistics information table which reflects the support of prediction points over all training points.

Inherited Methods from PALBase

Besides those methods mentioned above, the KDE class also inherits methods from PALBase class, please refer to PAL Base for more details.