KDE
- class hana_ml.algorithms.pal.kernel_density.KDE(thread_ratio=None, leaf_size=None, kernel=None, method=None, distance_level=None, minkowski_power=None, atol=None, rtol=None, bandwidth=None, resampling_method=None, evaluation_metric=None, bandwidth_values=None, bandwidth_range=None, stat_info=None, random_state=None, search_strategy=None, repeat_times=None, algorithm=None)
Perform Kernel Density to analogue with histograms whereas getting rid of its defects.
- Parameters:
- thread_ratiofloat, optional
Adjusts the percentage of available threads to use, from 0 to 1. A value of 0 indicates the use of a single thread, while 1 implies the use of all possible current threads. Values outside the range will be ignored and this function heuristically determines the number of threads to use.
Default to 0.0.
- leaf_sizeint, optional
Number of samples in a KD tree or Ball tree leaf node.
Only Valid when
algorithm
is 'kd-tree' or 'ball-tree'.Default to 30.
- kernel{'gaussian', 'tophat', 'epanechnikov', 'exponential', 'linear', 'cosine'}, optional
Kernel function type.
Default to 'gaussian'.
- method{'brute_force', 'kd_tree', 'ball_tree'}, optional(deprecated)
Searching method.
Default to 'brute_force'
- algorithm{'brute-force', 'kd-tree', 'ball-tree'}, optional
Specifies the searching method.
Default to 'brute-force'.
- bandwidthfloat, optional
Bandwidth used during density calculation. 0 means providing by optimizer inside, otherwise bandwidth is provided by end users. Only valid when data is one dimensional.
Default to 0.
- distance_level{'manhattan', 'euclidean', 'minkowski', 'chebyshev'}, optional
Computes the distance between the train data and the test data point.
Default to 'euclidean'.
- minkowski_powerfloat, optionl
When you use the Minkowski distance, this parameter controls the value of power. Only valid when
distance_level
is 'minkowski'.Default to 3.0.
- rtolfloat, optional
The desired relative tolerance of the result. A larger tolerance generally leads to faster execution.
Default to 1e-8.
- atolfloat, optional
The desired absolute tolerance of the result. A larger tolerance generally leads to faster execution.
Default to 0.
- stat_infobool, optional
False: STATISTIC table is empty
True: Statistic information is displayed in the STATISTIC table.
Only valid when parameter selection is not specified.
- resampling_method{'loocv'}, optional
Specifies the resampling method for model evaluation / parameter selection, only 'loocv' is permitted. Note that
evaluation_metric
must be set together.No default value.
- evaluation_metric{'nll'}, optional
Specifies the evaluation metric for model evaluation / parameter selection, only 'nll' is supported.
No default value.
- search_strategy{'grid', 'random'}, optional
Specifies the method to activate parameter selection.
No default value.
- repeat_timesint, optional
Specifies the number of repeat times for resampling.
Default to 1.
- random_stateint, optional
Specifies the seed for random generation. Use system time when 0 is specified.
Default to 0.
- bandwidth_valueslist, optional
Specifies values of parameter
bandwidth
to be selected.Only valid when parameter selection is enabled.
- bandwidth_rangelist, optional
Specifies ranges of parameter
bandwidth
to be selected.Only valid when parameter selection is enabled.
Examples
>>> kde = KDE(leaf_size=10, method='kd_tree', bandwidth=0.68129, stat_info=True) >>> kde.fit(data=df_train, key='ID') >>> res, stats = kde.predict(data=df_pred, key='ID') >>> res.collect() >>> stats.collect()
- Attributes:
- stats_DataFrame
Statistics. Available only when model evaluation / parameter selection is triggered.
- optim_param_DataFrame
Optimal parameters selected. Available only when model evaluation/parameter selection is triggered.
Methods
fit
(data, key[, features])If parameter selection / model evaluation is enabled, perform it.
Get the model metrics.
Get the score metrics.
predict
(data, key[, features])Apply kernel density analysis.
- fit(data, key, features=None)
If parameter selection / model evaluation is enabled, perform it. Otherwise, just setting the training dataset.
- Parameters:
- dataDataFrame
Dataframe including the data of density distribution.
- keystr
Name of the ID column.
- featuresstr/list of str, optional
Name of the feature columns in the dataframe.
Defaults to all non-key columns.
- Attributes:
- _training_dataDataFrame
The training data for kernel density function fitting.
- predict(data, key, features=None)
Apply kernel density analysis.
- Parameters:
- dataDataFrame
Dataframe including the data of density prediction.
- keystr
Column of IDs of the data points for density prediction.
- featuresa list of str, optional
Names of the feature columns.
Defaults to all non-key columns.
- Returns:
- DataFrame
Result data table, i.e. predicted log-density values on all points in
data
.Statistics information table which reflects the support of prediction points over all training points.
- get_model_metrics()
Get the model metrics.
- Returns:
- DataFrame
The model metrics.
- get_score_metrics()
Get the score metrics.
- Returns:
- DataFrame
The score metrics.
Inherited Methods from PALBase
Besides those methods mentioned above, the KDE class also inherits methods from PALBase class, please refer to PAL Base for more details.