KDE
- class hana_ml.algorithms.pal.kernel_density.KDE(thread_ratio=None, leaf_size=None, kernel=None, method=None, distance_level=None, minkowski_power=None, atol=None, rtol=None, bandwidth=None, resampling_method=None, evaluation_metric=None, bandwidth_values=None, bandwidth_range=None, stat_info=None, random_state=None, search_strategy=None, repeat_times=None, algorithm=None)
Perform Kernel Density to analogue with histograms whereas getting rid of its defects.
- Parameters:
- thread_ratiofloat, optional
Adjusts the percentage of available threads to use, from 0 to 1. A value of 0 indicates the use of a single thread, while 1 implies the use of all possible current threads. Values outside the range will be ignored and this function heuristically determines the number of threads to use.
Default to 0.0.
- leaf_sizeint, optional
Number of samples in a KD tree or Ball tree leaf node.
Only Valid when
algorithm
is 'kd-tree' or 'ball-tree'.Default to 30.
- kernel{'gaussian', 'tophat', 'epanechnikov', 'exponential', 'linear', 'cosine'}, optional
Kernel function type.
Default to 'gaussian'.
- method{'brute_force', 'kd_tree', 'ball_tree'}, optional(deprecated)
Searching method.
Default to 'brute_force'
- algorithm{'brute-force', 'kd-tree', 'ball-tree'}, optional
Specifies the searching method.
Default to 'brute-force'.
- bandwidthfloat, optional
Bandwidth used during density calculation. 0 means providing by optimizer inside, otherwise bandwidth is provided by end users. Only valid when data is one dimensional.
Default to 0.
- distance_level{'manhattan', 'euclidean', 'minkowski', 'chebyshev'}, optional
Computes the distance between the train data and the test data point.
Default to 'euclidean'.
- minkowski_powerfloat, optionl
When you use the Minkowski distance, this parameter controls the value of power. Only valid when
distance_level
is 'minkowski'.Default to 3.0.
- rtolfloat, optional
The desired relative tolerance of the result. A larger tolerance generally leads to faster execution.
Default to 1e-8.
- atolfloat, optional
The desired absolute tolerance of the result. A larger tolerance generally leads to faster execution.
Default to 0.
- stat_infobool, optional
False: STATISTIC table is empty
True: Statistic information is displayed in the STATISTIC table.
Only valid when parameter selection is not specified.
- resampling_method{'loocv'}, optional
Specifies the resampling method for model evaluation / parameter selection, only 'loocv' is permitted. Note that
evaluation_metric
must be set together.No default value.
- evaluation_metric{'nll'}, optional
Specifies the evaluation metric for model evaluation / parameter selection, only 'nll' is supported.
No default value.
- search_strategy{'grid', 'random'}, optional
Specifies the method to activate parameter selection.
No default value.
- repeat_timesint, optional
Specifies the number of repeat times for resampling.
Default to 1.
- random_stateint, optional
Specifies the seed for random generation. Use system time when 0 is specified.
Default to 0.
- bandwidth_valueslist, optional
Specifies values of parameter
bandwidth
to be selected.Only valid when parameter selection is enabled.
- bandwidth_rangelist, optional
Specifies ranges of parameter
bandwidth
to be selected.Only valid when parameter selection is enabled.
Examples
>>> kde = KDE(leaf_size=10, method='kd_tree', bandwidth=0.68129, stat_info=True) >>> kde.fit(data=df_train, key='ID') >>> res, stats = kde.predict(data=df_pred, key='ID') >>> res.collect() >>> stats.collect()
- Attributes:
- stats_DataFrame
Statistics. Available only when model evaluation / parameter selection is triggered.
- optim_param_DataFrame
Optimal parameters selected. Available only when model evaluation/parameter selection is triggered.
Methods
fit
(data, key[, features])If parameter selection / model evaluation is enabled, perform it.
predict
(data, key[, features])Apply kernel density analysis.
- fit(data, key, features=None)
If parameter selection / model evaluation is enabled, perform it. Otherwise, just setting the training dataset.
- Parameters:
- dataDataFrame
Dataframe including the data of density distribution.
- keystr
Name of the ID column.
- featuresstr/list of str, optional
Name of the feature columns in the dataframe.
Defaults to all non-key columns.
- Attributes:
- _training_dataDataFrame
The training data for kernel density function fitting.
- predict(data, key, features=None)
Apply kernel density analysis.
- Parameters:
- dataDataFrame
Dataframe including the data of density prediction.
- keystr
Column of IDs of the data points for density prediction.
- featuresa list of str, optional
Names of the feature columns.
Defaults to all non-key columns.
- Returns:
- DataFrame
Result data table, i.e. predicted log-density values on all points in
data
.Statistics information table which reflects the support of prediction points over all training points.
Inherited Methods from PALBase
Besides those methods mentioned above, the KDE class also inherits methods from PALBase class, please refer to PAL Base for more details.