KBinsDiscretizer

class hana_ml.algorithms.pal.preprocessing.KBinsDiscretizer(strategy, smoothing, n_bins=None, bin_size=None, n_sd=None)

Bin continuous data into number of intervals and perform local smoothing.

Note

Note that the data type of the output value is the same as that of the input value. Therefore, if the data type of the original data is INTEGER, the output value will be converted to an integer instead of the result you expect.

Therefore, please cast the feature column(s) from INTEGER to be DOUBLE before invoking the function.

Parameters:

strategy{'uniform_number', 'uniform_size', 'quantile', 'sd'}

Specifies the binning method, valid options include:

'uniform_number': Equal widths based on the number of bins.

'uniform_size': Equal widths based on the bin size.

'quantile': Equal number of records per bin.

'sd': Bins are divided based on the distance from the mean. Most bins are one standard deviation wide, except that the center bin contains all values within one standard deviation from the mean, and the leftmost and rightmost bins contain all values more than n_sd standard deviations from the mean in the corresponding directions.

smoothing{'means', 'medians', 'boundaries'}

Specifies the smoothing method, valid options include:

'means': Each value within a bin is replaced by the average of all the values belonging to the same bin.

'medians': Each value in a bin is replaced by the median of all the values belonging to the same bin.

'boundaries': The minimum and maximum values in a given bin are identified as the bin boundaries. Each value in the bin is then replaced by its closest boundary value. When the distance is equal to both sides, it will be replaced by the front boundary value.

Values used for smoothing are not re-calculated during transform.

n_binsint, optional

The number of bins.

Only valid when strategy is 'uniform_number' or 'quantile'.

Defaults to 2.

bin_sizeint, optional

The interval width of each bin.

Only valid when strategy is 'uniform_size'.

Defaults to 10.

n_sdint, optional

The leftmost bin contains all values located further than n_sd standard deviations lower than the mean, and the rightmost bin contains all values located further than n_sd standard deviations above the mean.

Only valid when strategy is 'sd'.

Defaults to 1.

Examples

Input DataFrame df1:

>>> df1.collect()
    ID  DATA
  0   6.0
  1  12.0
  2  13.0
  3  15.0
  4  10.0
  5  23.0
  6  24.0
  7  30.0
  8  32.0
  9  25.0
10  38.0

Creating a KBinsDiscretizer instance:

>>> binning = KBinsDiscretizer(strategy='uniform_size', smoothing='means', bin_size=10)

Performing fit on the given DataFrame:

>>> binning.fit(data=df1, key='ID')
    ID  BIN_INDEX       DATA
  0          1   8.000000
  1          2  13.333333
  2          2  13.333333
  3          2  13.333333
  4          1   8.000000
  5          3  25.500000
  6          3  25.500000
  7          3  25.500000
  8          4  35.000000
  9          3  25.500000
10          4  35.000000

Input DataFrame df2 for transforming:

>>> df2.collect()
   ID  DATA
 0   6.0
 1  67.0
 2   4.0
 3  12.0
 4  -2.0
 5  40.0

Performing transform on the given DataFrame:

>>> result = binning.transform(data=df2, key='ID')

Output:

>>> result.collect()
   ID  BIN_INDEX       DATA
 0          1   8.000000
 1         -1  67.000000
 2          1   8.000000
 3          2  13.333333
 4          1   8.000000
 5          4  35.000000

Attributes:

result_DataFrame: Binned dataset from fit and fit_transform methods.
model_DataFrame: Binning model content.

Methods

`create_model_state`([model, function, ...])	Create PAL model state.
`delete_model_state`([state])	Delete PAL model state.
`fit`(data[, key, features])	Bin input data into number of intervals and smooth.
`fit_transform`(data[, key, features])	Fit with the dataset and return the results.
`set_model_state`(state)	Set the model state by state information.
`transform`(data[, key, features])	Bin data based on the previous binning model.

fit(data, key=None, features=None)

Bin input data into number of intervals and smooth.

Parameters:

dataDataFrame

DataFrame to be discretized.

keystr, optional

Name of the ID column.

Mandatory if data is not indexed, or the index of data contains multiple columns.

Defaults to the single index column of data if not provided.

featureslist of str, optional

Names of the feature columns.

Since the underlying PAL binning algorithm only supports one feature, this list can only contain one element.

If not provided, data must have exactly 1 non-ID column, and features defaults to that column.

Returns:

Fitted object.

fit_transform(data, key=None, features=None)

Fit with the dataset and return the results.

Parameters:

dataDataFrame

DataFrame to be binned.

keystr, optional

Name of the ID column.

Mandatory if data is not indexed, or the index of data contains multiple columns.

Defaults to the single index column of data if not provided.

featureslist of str, optional

Names of the feature columns.

Since the underlying PAL binning algorithm only supports one feature, this list can only contain one element.

If not provided, data must have exactly 1 non-ID column, and features defaults to that column.

Returns:

DataFrame

Binned result, structured as follows:

DATA_ID column: with same name and type as data's ID column.

BIN_INDEX: type INTEGER, assigned bin index.

BINNING_DATA column: smoothed value, with same name and type as data's feature column.

transform(data, key=None, features=None)

Bin data based on the previous binning model.

Parameters:

dataDataFrame

DataFrame to be binned.

keystr, optional

Name of the ID column.

Mandatory if data is not indexed, or the index of data contains multiple columns.

Defaults to the single index column of data if not provided.

featureslist of str, optional

Names of the feature columns.

Since the underlying PAL_BINNING_ASSIGNMENT only supports one feature, this list can only contain one element.

If not provided, data must have exactly 1 non-ID column, and features defaults to that column.

Returns:

DataFrame

Binned result, structured as follows:

DATA_ID column: with same name and type as data 's ID column.

BIN_INDEX: type INTEGER, assigned bin index.

BINNING_DATA column: smoothed value, with same name and type as data 's feature column.

create_model_state(model=None, function=None, pal_funcname='PAL_BINNING_ASSIGNMENT', state_description=None, force=False)

Create PAL model state.

Parameters:

modelDataFrame, optional

Specify the model for AFL state.

Defaults to self.model_.

functionstr, optional

Specify the function name of the classification algorithm.

Valid only for UnifiedClassification and UnifiedRegression.

Defaults to self.real_func

pal_funcnameint or str, optional

PAL function name.

Defaults to 'PAL_BINNING_ASSIGNMENT'.

state_descriptionstr, optional

Description of the state as model container.

Defaults to None.

forcebool, optional

If True it will delete the existing state.

Defaults to False.

set_model_state(state)

Set the model state by state information.

Parameters:

state: DataFrame or dict

If state is DataFrame, it has the following structure:

NAME: VARCHAR(100), it mush have STATE_ID, HINT, HOST and PORT.

VALUE: VARCHAR(1000), the values according to NAME.

If state is dict, the key must have STATE_ID, HINT, HOST and PORT.

delete_model_state(state=None)

Delete PAL model state.

Parameters:

stateDataFrame, optional

Specified the state.

Defaults to self.state.

property fit_hdbprocedure: Returns the generated hdbprocedure for fit.

property predict_hdbprocedure: Returns the generated hdbprocedure for predict.

Inherited Methods from PALBase

Besides those methods mentioned above, the KBinsDiscretizer class also inherits methods from PALBase class, please refer to PAL Base for more details.