KBinsDiscretizer
- class hana_ml.algorithms.pal.preprocessing.KBinsDiscretizer(strategy, smoothing, n_bins=None, bin_size=None, n_sd=None)
Bin continuous data into number of intervals and perform local smoothing.
Note
Note that the data type of the output value is the same as that of the input value. Therefore, if the data type of the original data is INTEGER, the output value will be converted to an integer instead of the result you expect.
Therefore, please cast the feature column(s) from INTEGER to be DOUBLE before invoking the function.
- Parameters
- strategy{'uniform_number', 'uniform_size', 'quantile', 'sd'}
Specifies the binning method, valid options include:
'uniform_number': Equal widths based on the number of bins.
'uniform_size': Equal widths based on the bin size.
'quantile': Equal number of records per bin.
'sd': Bins are divided based on the distance from the mean. Most bins are one standard deviation wide, except that the center bin contains all values within one standard deviation from the mean, and the leftmost and rightmost bins contain all values more than
n_sd
standard deviations from the mean in the corresponding directions.
- smoothing{'means', 'medians', 'boundaries'}
Specifies the smoothing method, valid options include:
'means': Each value within a bin is replaced by the average of all the values belonging to the same bin.
'medians': Each value in a bin is replaced by the median of all the values belonging to the same bin.
'boundaries': The minimum and maximum values in a given bin are identified as the bin boundaries. Each value in the bin is then replaced by its closest boundary value. When the distance is equal to both sides, it will be replaced by the front boundary value.
Values used for smoothing are not re-calculated during transform.
- n_binsint, optional
The number of bins.
Only valid when
strategy
is 'uniform_number' or 'quantile'.Defaults to 2.
- bin_sizeint, optional
The interval width of each bin.
Only valid when
strategy
is 'uniform_size'.Defaults to 10.
- n_sdint, optional
The leftmost bin contains all values located further than n_sd standard deviations lower than the mean, and the rightmost bin contains all values located further than n_sd standard deviations above the mean.
Only valid when
strategy
is 'sd'.Defaults to 1.
Examples
Input DataFrame df1:
>>> df1.collect() ID DATA 0 0 6.0 1 1 12.0 2 2 13.0 3 3 15.0 4 4 10.0 5 5 23.0 6 6 24.0 7 7 30.0 8 8 32.0 9 9 25.0 10 10 38.0
Creating a KBinsDiscretizer instance:
>>> binning = KBinsDiscretizer(strategy='uniform_size', smoothing='means', bin_size=10)
Performing fit on the given DataFrame:
>>> binning.fit(data=df1, key='ID') ID BIN_INDEX DATA 0 0 1 8.000000 1 1 2 13.333333 2 2 2 13.333333 3 3 2 13.333333 4 4 1 8.000000 5 5 3 25.500000 6 6 3 25.500000 7 7 3 25.500000 8 8 4 35.000000 9 9 3 25.500000 10 10 4 35.000000
Input DataFrame df2 for transforming:
>>> df2.collect() ID DATA 0 0 6.0 1 1 67.0 2 2 4.0 3 3 12.0 4 4 -2.0 5 5 40.0
Performing transform on the given DataFrame:
>>> result = binning.transform(data=df2, key='ID')
Output:
>>> result.collect() ID BIN_INDEX DATA 0 0 1 8.000000 1 1 -1 67.000000 2 2 1 8.000000 3 3 2 13.333333 4 4 1 8.000000 5 5 4 35.000000
- Attributes
- result_DataFrame
Binned dataset from fit and fit_transform methods.
- model_DataFrame
Binning model content.
Methods
create_model_state
([model, function, ...])Create PAL model state.
delete_model_state
([state])Delete PAL model state.
fit
(data[, key, features])Bin input data into number of intervals and smooth.
fit_transform
(data[, key, features])Fit with the dataset and return the results.
set_model_state
(state)Set the model state by state information.
transform
(data[, key, features])Bin data based on the previous binning model.
- fit(data, key=None, features=None)
Bin input data into number of intervals and smooth.
- Parameters
- dataDataFrame
DataFrame to be discretized.
- keystr, optional
Name of the ID column.
Mandatory if
data
is not indexed, or the index ofdata
contains multiple columns.Defaults to the single index column of
data
if not provided.- featureslist of str, optional
Names of the feature columns.
Since the underlying PAL binning algorithm only supports one feature, this list can only contain one element.
If not provided,
data
must have exactly 1 non-ID column, andfeatures
defaults to that column.
- Returns
- Fitted object.
- fit_transform(data, key=None, features=None)
Fit with the dataset and return the results.
- Parameters
- dataDataFrame
DataFrame to be binned.
- keystr, optional
Name of the ID column.
Mandatory if
data
is not indexed, or the index ofdata
contains multiple columns.Defaults to the single index column of
data
if not provided.- featureslist of str, optional
Names of the feature columns.
Since the underlying PAL binning algorithm only supports one feature, this list can only contain one element.
If not provided,
data
must have exactly 1 non-ID column, andfeatures
defaults to that column.
- Returns
- DataFrame
Binned result, structured as follows:
DATA_ID column: with same name and type as
data
's ID column.BIN_INDEX: type INTEGER, assigned bin index.
BINNING_DATA column: smoothed value, with same name and type as
data
's feature column.
- transform(data, key=None, features=None)
Bin data based on the previous binning model.
- Parameters
- dataDataFrame
DataFrame to be binned.
- keystr, optional
Name of the ID column.
Mandatory if
data
is not indexed, or the index ofdata
contains multiple columns.Defaults to the single index column of
data
if not provided.- featureslist of str, optional
Names of the feature columns.
Since the underlying PAL_BINNING_ASSIGNMENT only supports one feature, this list can only contain one element.
If not provided,
data
must have exactly 1 non-ID column, andfeatures
defaults to that column.
- Returns
- DataFrame
Binned result, structured as follows:
DATA_ID column: with same name and type as
data
's ID column.BIN_INDEX: type INTEGER, assigned bin index.
BINNING_DATA column: smoothed value, with same name and type as
data
's feature column.
- create_model_state(model=None, function=None, pal_funcname='PAL_BINNING_ASSIGNMENT', state_description=None, force=False)
Create PAL model state.
- Parameters
- modelDataFrame, optional
Specify the model for AFL state.
Defaults to self.model_.
- functionstr, optional
Specify the function name of the classification algorithm.
Valid only for UnifiedClassification and UnifiedRegression.
Defaults to self.real_func
- pal_funcnameint or str, optional
PAL function name.
Defaults to 'PAL_BINNING_ASSIGNMENT'.
- state_descriptionstr, optional
Description of the state as model container.
Defaults to None.
- forcebool, optional
If True it will delete the existing state.
Defaults to False.
- set_model_state(state)
Set the model state by state information.
- Parameters
- state: DataFrame or dict
If state is DataFrame, it has the following structure:
NAME: VARCHAR(100), it mush have STATE_ID, HINT, HOST and PORT.
VALUE: VARCHAR(1000), the values according to NAME.
If state is dict, the key must have STATE_ID, HINT, HOST and PORT.
- delete_model_state(state=None)
Delete PAL model state.
- Parameters
- stateDataFrame, optional
Specified the state.
Defaults to self.state.
- property fit_hdbprocedure
Returns the generated hdbprocedure for fit.
- property predict_hdbprocedure
Returns the generated hdbprocedure for predict.