KDE
- class hana_ml.algorithms.pal.kernel_density.KDE(thread_ratio=None, leaf_size=None, kernel=None, method=None, distance_level=None, minkowski_power=None, atol=None, rtol=None, bandwidth=None, resampling_method=None, evaluation_metric=None, bandwidth_values=None, bandwidth_range=None, stat_info=None, random_state=None, search_strategy=None, repeat_times=None, algorithm=None)
Perform Kernel Density to analogue with histograms whereas getting rid of its defects.
- Parameters
- thread_ratiofloat, optional
Specifies the ratio of total number of threads that can be used by this function.
The value range is from 0 to 1, where 0 means only using 1 thread, and 1 means using at most all the currently available threads.
Values outside the range will be ignored and this function heuristically determines the number of threads to use.
Default to 0.0.
- leaf_sizeint, optional
Number of samples in a KD tree or Ball tree leaf node.
Only Valid when
algorithm
is 'kd-tree' or 'ball-tree'.Default to 30.
- kernel{'gaussian', 'tophat', 'epanechnikov', 'exponential', 'linear', 'cosine'}, optional
Kernel function type.
Default to 'gaussian'.
- method{'brute_force', 'kd_tree', 'ball_tree'}, optional(deprecated)
Searching method.
Default to 'brute_force'
- algorithm{'brute-force', 'kd-tree', 'ball-tree'}, optional
Specifies the searching method.
Default to 'brute-force'.
- bandwidthfloat, optional
Bandwidth used during density calculation.
0 means providing by optimizer inside, otherwise bandwidth is provided by end users.
Only valid when data is one dimensional.
Default to 0.
- distance_level{'manhattan', 'euclidean', 'minkowski', 'chebyshev'}, optional
Computes the distance between the train data and the test data point.
Default to 'euclidean'.
- minkowski_powerfloat, optionl
When you use the Minkowski distance, this parameter controls the value of power.
Only valid when
distance_level
is 'minkowski'.Default to 3.0.
- rtolfloat, optional
The desired relative tolerance of the result.
A larger tolerance generally leads to faster execution.
Default to 1e-8.
- atolfloat, optional
The desired absolute tolerance of the result.
A larger tolerance generally leads to faster execution.
Default to 0.
- stat_infobool, optional
False: STATISTIC table is empty
True: Statistic information is displayed in the STATISTIC table.
Only valid when parameter selection is not specified.
- resampling_method{'loocv'}, optional
Specifies the resampling method for model evaluation or parameter selection, only 'loocv' is permitted.
evaluation_metric
must be set together.No default value.
- evaluation_metric{'nll'}, optional
Specifies the evaluation metric for model evaluation or parameter selection, only 'nll' is supported.
No default value.
- search_strategy{'grid', 'random'}, optional
Specifies the method to activate parameter selection.
No default value.
- repeat_timesint, optional
Specifies the number of repeat times for resampling.
Default to 1.
- random_stateint, optional
Specifies the seed for random generation. Use system time when 0 is specified.
Default to 0.
- bandwidth_valueslist, optional
Specifies values of parameter
bandwidth
to be selected.Only valid when parameter selection is enabled.
- bandwidth_rangelist, optional
Specifies ranges of parameter
bandwidth
to be selected.Only valid when parameter selection is enabled.
Examples
Data used for fitting a kernel density function:
>>> df_train.collect() ID X1 X2 0 0 -0.425770 -1.396130 1 1 0.884100 1.381493 2 2 0.134126 -0.032224 3 3 0.845504 2.867921 4 4 0.288441 1.513337 5 5 -0.666785 1.244980 6 6 -2.102968 -1.428327 7 7 0.769902 -0.473007 8 8 0.210291 0.328431 9 9 0.482323 -0.437962
Data used for density value prediction:
>>> df_pred.collect() ID X1 X2 0 0 -2.102968 -1.428327 1 1 -2.102968 0.719797 2 2 -2.102968 2.867921 3 3 -0.609434 -1.428327 4 4 -0.609434 0.719797 5 5 -0.609434 2.867921 6 6 0.884100 -1.428327 7 7 0.884100 0.719797 8 8 0.884100 2.867921
Construct KDE instance:
>>> kde = KDE(leaf_size=10, method='kd_tree', bandwidth=0.68129, stat_info=True)
Fit a kernel density function:
>>> kde.fit(data=df_train, key='ID')
Perform density prediction and check the results
>>> res, stats = kde.predict(data=df_pred, key='ID') >>> res.collect() ID DENSITY_VALUE 0 0 -3.324821 1 1 -5.733966 2 2 -8.372878 3 3 -3.123223 4 4 -2.772520 5 5 -4.852817 6 6 -3.469782 7 7 -2.556680 8 8 -3.198531
>>> stats_.collect() TEST_ID FITTING_IDS 0 0 {"fitting ids":[0,1,2,3,4,5,6,7,8,9]} 1 1 {"fitting ids":[0,1,2,3,4,5,6,7,8,9]} 2 2 {"fitting ids":[0,1,2,3,4,5,6,7,8,9]} 3 3 {"fitting ids":[0,1,2,3,4,5,6,7,8,9]} 4 4 {"fitting ids":[0,1,2,3,4,5,6,7,8,9]} 5 5 {"fitting ids":[0,1,2,3,4,5,6,7,8,9]} 6 6 {"fitting ids":[0,1,2,3,4,5,6,7,8,9]} 7 7 {"fitting ids":[0,1,2,3,4,5,6,7,8,9]} 8 8 {"fitting ids":[0,1,2,3,4,5,6,7,8,9]}
- Attributes
- stats_DataFrame
Statistical info for model evaluation. Available only when model evaluation/parameter selection is triggered.
- optim_param_DataFrame
Optimal parameters selected. Available only when model evaluation/parameter selection is triggered.
Methods
fit
(data, key[, features])If parameter selection / model evaluation is enabled, perform it.
predict
(data, key[, features])Apply kernel density analysis.
- fit(data, key, features=None)
If parameter selection / model evaluation is enabled, perform it. Otherwise, just setting the training data set.
- Parameters
- dataDataFrame
Dataframe including the data of density distribution.
- keystr
Name of the ID column.
- featuresstr/list of str, optional
Name of the feature columns in the dataframe.
Defaults to all non-key columns.
- Attributes
- _training_dataDataFrame
The training data for kernel density function fitting.
- predict(data, key, features=None)
Apply kernel density analysis.
- Parameters
- dataDataFrame
Dataframe including the data of density prediction.
- keystr
Column of IDs of the data points for density prediction.
- featureslist of str, optional
Names of the feature columns.
Defaults to all non-key columns.
- Returns
- DataFrame
Result data table, i.e. predicted log-density values on all points in
data
.Statistics information table which reflects the support of prediction points over all training points.
- property fit_hdbprocedure
Returns the generated hdbprocedure for fit.
- property predict_hdbprocedure
Returns the generated hdbprocedure for predict.