KBinsDiscretizer
- class hana_ml.algorithms.pal.preprocessing.KBinsDiscretizer(strategy, smoothing, n_bins=None, bin_size=None, n_sd=None)
Bin continuous data into number of intervals and perform local smoothing.
Note
Note that the data type of the output value is the same as that of the input value. Therefore, if the data type of the original data is INTEGER, the output value will be converted to an integer instead of the result you expect.
Therefore, please cast the feature column(s) from INTEGER to be DOUBLE before invoking the function.
- Parameters:
- strategy{'uniform_number', 'uniform_size', 'quantile', 'sd'}
Specifies the binning method, valid options include:
'uniform_number': Equal widths based on the number of bins.
'uniform_size': Equal widths based on the bin width.
'quantile': Equal number of records per bin.
'sd': Bins are divided based on the distance from the mean. Most bins are one standard deviation wide, except that the center bin contains all values within one standard deviation from the mean, and the leftmost and rightmost bins contain all values more than
n_sd
standard deviations from the mean in the corresponding directions.
- smoothing{'means', 'medians', 'boundaries'}
Specifies the smoothing method, valid options include:
'means': Each value within a bin is replaced by the average of all the values belonging to the same bin.
'medians': Each value in a bin is replaced by the median of all the values belonging to the same bin.
'boundaries': The minimum and maximum values in a given bin are identified as the bin boundaries. Each value in the bin is then replaced by its closest boundary value. When the distance is equal to both sides, it will be replaced by the front boundary value.
Values used for smoothing are not re-calculated during transform.
- n_binsint, optional
The number of bins.
Only valid when
strategy
is 'uniform_number' or 'quantile'.Defaults to 2.
- bin_sizeint, optional
The interval width of each bin.
Only valid when
strategy
is 'uniform_size'.Defaults to 10.
- n_sdint, optional
The leftmost bin contains all values located further than n_sd standard deviations lower than the mean, and the rightmost bin contains all values located further than n_sd standard deviations above the mean.
Only valid when
strategy
is 'sd'.Defaults to 1.
Examples
>>> binning = KBinsDiscretizer(strategy='uniform_size', smoothing='means', bin_size=10) >>> binning.fit(data=df_train, key='ID') >>> res = binning.transform(data=df_transform, key='ID') >>> res.collect()
- Attributes:
- result_DataFrame
Binned dataset from fit and fit_transform methods.
- model_DataFrame
Model content.
Methods
create_model_state
([model, function, ...])Create PAL model state.
delete_model_state
([state])Delete PAL model state.
fit
(data[, key, features])Bin input data into number of intervals and smooth.
fit_transform
(data[, key, features])Fit with the dataset and return the results.
Get the model metrics.
Get the score metrics.
set_model_state
(state)Set the model state by state information.
transform
(data[, key, features])Bin data based on the previous binning model.
- fit(data, key=None, features=None)
Bin input data into number of intervals and smooth.
- Parameters:
- dataDataFrame
DataFrame to be discretized.
- keystr, optional
Name of the ID column.
Mandatory if
data
is not indexed, or the index ofdata
contains multiple columns.Defaults to the single index column of
data
if not provided.- featuresa list of str, optional
Names of the feature columns.
Since the underlying PAL binning algorithm only supports one feature, this list can only contain one element.
If not provided,
data
must have exactly 1 non-ID column, andfeatures
defaults to that column.
- Returns:
- A fitted object of class "KBinsDiscretizer".
- fit_transform(data, key=None, features=None)
Fit with the dataset and return the results.
- Parameters:
- dataDataFrame
DataFrame to be binned.
- keystr, optional
Name of the ID column.
Mandatory if
data
is not indexed, or the index ofdata
contains multiple columns.Defaults to the single index column of
data
if not provided.- featuresa list of str, optional
Names of the feature columns.
Since the underlying PAL binning algorithm only supports one feature, this list can only contain one element.
If not provided,
data
must have exactly 1 non-ID column, andfeatures
defaults to that column.
- Returns:
- DataFrame
Binned result, structured as follows:
DATA_ID column: with same name and type as
data
's ID column.BIN_INDEX: type INTEGER, assigned bin index.
BINNING_DATA column: smoothed value, with same name and type as
data
's feature column.
- transform(data, key=None, features=None)
Bin data based on the previous binning model.
- Parameters:
- dataDataFrame
DataFrame to be binned.
- keystr, optional
Name of the ID column.
Mandatory if
data
is not indexed, or the index ofdata
contains multiple columns.Defaults to the single index column of
data
if not provided.- featuresa list of str, optional
Names of the feature columns.
Since the underlying PAL_BINNING_ASSIGNMENT only supports one feature, this list can only contain one element.
If not provided,
data
must have exactly 1 non-ID column, andfeatures
defaults to that column.
- Returns:
- DataFrame
Binned result, structured as follows:
DATA_ID column: with same name and type as
data
's ID column.BIN_INDEX: type INTEGER, assigned bin index.
BINNING_DATA column: smoothed value, with same name and type as
data
's feature column.
- create_model_state(model=None, function=None, pal_funcname='PAL_BINNING_ASSIGNMENT', state_description=None, force=False)
Create PAL model state.
- Parameters:
- modelDataFrame, optional
Specify the model for AFL state.
Defaults to self.model_.
- functionstr, optional
Specify the function name of the classification algorithm.
Valid only for UnifiedClassification and UnifiedRegression.
Defaults to self.real_func
- pal_funcnameint or str, optional
PAL function name.
Defaults to 'PAL_BINNING_ASSIGNMENT'.
- state_descriptionstr, optional
Description of the state as model container.
Defaults to None.
- forcebool, optional
If True it will delete the existing state.
Defaults to False.
- set_model_state(state)
Set the model state by state information.
- Parameters:
- state: DataFrame or dict
If state is DataFrame, it has the following structure:
NAME: VARCHAR(100), it mush have STATE_ID, HINT, HOST and PORT.
VALUE: VARCHAR(1000), the values according to NAME.
If state is dict, the key must have STATE_ID, HINT, HOST and PORT.
- delete_model_state(state=None)
Delete PAL model state.
- Parameters:
- stateDataFrame, optional
Specified the state.
Defaults to self.state.
- get_model_metrics()
Get the model metrics.
- Returns:
- DataFrame
The model metrics.
- get_score_metrics()
Get the score metrics.
- Returns:
- DataFrame
The score metrics.
Inherited Methods from PALBase
Besides those methods mentioned above, the KBinsDiscretizer class also inherits methods from PALBase class, please refer to PAL Base for more details.