Discretize

class hana_ml.algorithms.pal.preprocessing.Discretize(strategy, n_bins=None, bin_size=None, n_sd=None, smoothing=None, save_model=True)

It is an enhanced version of binning function which can be applied to table with multiple columns. This function partitions table rows into multiple segments called bins, then applies smoothing methods in each bin of each column respectively.

Parameters:

strategy{'uniform_number', 'uniform_size', 'quantile', 'sd'}

Binning methods:

'uniform_number': equal widths based on the number of bins.

'uniform_size': equal widths based on the bin width.

'quantile': equal number of records per bin.

'sd': mean/ standard deviation bin boundaries.

n_binsint, optional

Number of needed bins.

Required and only valid when strategy is set as 'uniform_number' or 'quantile'.

Default to 2.

bin_sizefloat, optional

Specifies the distance for binning.

Required and only valid when strategy is set as 'uniform_size'.

Default to 10.

n_sdint, optional

Specifies the number of standard deviation at each side of the mean.

For example, if n_sd equals 2, this function takes mean +/- 2 * standard deviation as the upper/lower bound for binning.

Required and only valid when strategy is set as 'sd'.

smoothing{'no', 'bin_means', 'bin_medians', 'bin_boundaries'}, optional

Specifies the default smoothing method for all non-categorical columns.

Default to 'bin_means'.

save_modelbool, optional

Indicates whether the model is saved.

Default to True.

Examples

Original data:

>>> df.collect()
    ID  ATT1   ATT2  ATT3 ATT4
  1  10.0  100.0   1.0    A
  2  10.1  101.0   1.0    A
  3  10.2  100.0   1.0    A
  4  10.4  103.0   1.0    A
  5  10.3  100.0   1.0    A
  6  40.0  400.0   4.0    C
  7  40.1  402.0   4.0    B
  8  40.2  400.0   4.0    B
  9  40.4  402.0   4.0    B
 10  40.3  400.0   4.0    A
11  90.0  900.0   2.0    C
12  90.1  903.0   1.0    B
13  90.2  901.0   2.0    B
14  90.4  900.0   1.0    B
15  90.3  900.0   1.0    B

Construct an Discretize instance:

>>> bin = Discretize(method='uniform_number',
          n_bins=3, smoothing='bin_medians')

Training the model with training data:

>>> bin.fit(train_data, binning_variable='ATT1', col_smoothing=[('ATT2', 'bin_means')],
            categorical_variable='ATT3', key=None, features=None)

>>> bin.assign_.collect()
    ID  BIN_INDEX
  1          1
  2          1
  3          1
  4          1
  5          1
  6          2
  7          2
  8          2
  9          2
 10          2
11          3
12          3
13          3
14          3
15          3

Apply the model to new data:

>>> bin.predict(predict_data)

>>> res.collect():
   ID  BIN_INDEX
 1          1
 2          1
 3          1
 4          1
 5          3
 6          3
 7          2

Attributes:

result_DataFrame

Discretize results, structured as follows:

ID: name as shown in input dataframe.
FEATURES : data smoothed respectively in each bins

assign_DataFrame

Assignment results, structured as follows:

ID: data ID, name as shown in input dataframe.
BIN_INDEX : bin index.

model_DataFrame

Model results, structured as follows:

ROW_INDEX: row index.
MODEL_CONTENT : model contents.

stats_DataFrame

Statistic results, structured as follows:

STAT_NAME: statistic name.
STAT_VALUE: statistic value.

Methods

`fit`(data, binning_variable[, key, features, ...])	Fitting a Discretize model.
`fit_transform`(data, binning_variable[, key, ...])	Learn a discretization configuration(model) from input data and then discretize it under that configuration.
`predict`(data)	Discretizing new data using a generated Discretize model.
`transform`(data)	Data discretization using generated Discretize models.

fit(data, binning_variable, key=None, features=None, col_smoothing=None, categorical_variable=None)

Fitting a Discretize model.

Parameters:

dataDataFrame

Dataframe that contains the training data.

keystr, optional

Name of the ID column in data.

If key is not provided, then:

if data is indexed by a single column, then key defaults to that index column;

otherwise, it defaults to the first column of data.

featuresstr/ListofStrings, optional

Name of the feature columns which needs to be considered in the model.

If not specified, all columns except the key column will be count as feature columns.

binning_variablestr

Attribute name, to which binning operation is applied.

Variable data type must be numeric.

col_smoothingListofTuples, optional

Specifies column name and its method for smoothing, which overwrites the default smoothing method.

For example: smoothing_method = [('ATT1', 'bin_means'), ('ATT2', 'bin_boundaries')]

Only applies for none-categorical attributes.

No default value.

categorical_variablestr/ListofStrings, optional

Indicates whether a column data is actually corresponding to a category variable even the data type of this column is int.

No default value.

Returns:

Fitted object.

predict(data)

Discretizing new data using a generated Discretize model.

Parameters:

dataDataFrame: Dataframe including the predict data.

Returns:

DataFrame

Discretization result
Bin assignment
Statistics

transform(data)

Data discretization using generated Discretize models.

Parameters:

dataDataFrame: Dataframe including the predict data.

Returns:

DataFrame

Discretization result
Bin assignment
Statistics

fit_transform(data, binning_variable, key=None, features=None, col_smoothing=None, categorical_variable=None)

Learn a discretization configuration(model) from input data and then discretize it under that configuration.

Parameters:

dataDataFrame

Dataframe that contains the training data.

keystr, optional

Name of the ID column in data.

If key is not provided, then:

if data is indexed by a single column, then key defaults to that index column;

otherwise, it is assumed that data contains no ID column.

featuresstr/ListofStrings, optional

Name of the feature columns which needs to be considered in the model.

If not specified, all columns except the key column will be count as feature columns.

binning_variablestr

Attribute name, to which binning operation is applied.

Variable data type must be numeric.

col_smoothingListofTuples, optional

Specifies column name and its method for smoothing, which overwrites the default smoothing method.

For example: smoothing_method = [('ATT1', 'bin_means'), ('ATT2', 'bin_boundaries')]

Only applies for non-categorical attributes.

No default value.

categorical_variablestr/ListofStrings, optional

Indicates whether a column data is actually corresponding to a category variable even the data type of this column is int.

No default value.

Returns:

DataFrame

Discretization result
Bin assignment
Statistics

property fit_hdbprocedure: Returns the generated hdbprocedure for fit.

property predict_hdbprocedure: Returns the generated hdbprocedure for predict.

Inherited Methods from PALBase

Besides those methods mentioned above, the Discretize class also inherits methods from PALBase class, please refer to PAL Base for more details.