Discretize

class hana_ml.algorithms.pal.preprocessing.Discretize(strategy, n_bins=None, bin_size=None, n_sd=None, smoothing=None, save_model=True)

It is an enhanced version of binning function which can be applied to table with multiple columns. This function partitions table rows into multiple segments called bins, then applies smoothing methods in each bin of each column respectively.

Parameters:

strategy{'uniform_number', 'uniform_size', 'quantile', 'sd'}

Binning methods:

'uniform_number': equal widths based on the number of bins.

'uniform_size': equal widths based on the bin width.

'quantile': equal number of records per bin.

'sd': mean/ standard deviation bin boundaries.

n_binsint, optional

Number of needed bins.

Required and only valid when strategy is set as 'uniform_number' or 'quantile'.

Default to 2.

bin_sizefloat, optional

Specifies the distance for binning.

Required and only valid when strategy is set as 'uniform_size'.

Default to 10.

n_sdint, optional

Specifies the number of standard deviation at each side of the mean.

For example, if n_sd equals 2, this function takes mean +/- 2 * standard deviation as the upper/lower bound for binning.

Required and only valid when strategy is set as 'sd'.

smoothing{'no', 'bin_means', 'bin_medians', 'bin_boundaries'}, optional

Specifies the default smoothing method for all non-categorical columns.

Default to 'no'.

save_modelbool, optional

Indicates whether the model is saved.

Default to True.

Examples

>>> bin = Discretize(method='uniform_number', n_bins=3, smoothing='bin_medians')
>>> bin.fit(data=df, binning_variable='ATT1',
            col_smoothing=[('ATT2', 'bin_means')],
            categorical_variable='ATT3')
>>> bin.assign_.collect()
>>> res = bin.predict(data=predict_data)
>>> res.collect():

Attributes:

result_DataFrame: Discretize results.
assign_DataFrame: Assignment results..
model_DataFrame: Model content.
stats_DataFrame: Statistics.

Methods

`fit`(data, binning_variable[, key, features, ...])	Fitting a Discretize model.
`fit_transform`(data, binning_variable[, key, ...])	Learn a discretization configuration(model) from input data and then discretize it under that configuration.
`get_model_metrics`()	Get the model metrics.
`get_score_metrics`()	Get the score metrics.
`predict`(data)	Discretizing new data using a generated Discretize model.
`transform`(data)	Data discretization using generated Discretize models.

fit(data, binning_variable, key=None, features=None, col_smoothing=None, categorical_variable=None)

Fitting a Discretize model.

Parameters:

dataDataFrame

Dataframe that contains the training data.

binning_variablestr

Attribute name, to which binning operation is applied.

Variable data type must be numeric.

keystr, optional

Name of the ID column in data.

If key is not provided, then:

if data is indexed by a single column, then key defaults to that index column;

otherwise, it defaults to the first column of data.

featuresstr/ListofStrings, optional

Name of the feature columns which needs to be considered in the model.

If not specified, all columns except the key column will be count as feature columns.

col_smoothingListofTuples, optional

Specifies column name and its method for smoothing, which overwrites the default smoothing method.

For example: smoothing_method = [('ATT1', 'bin_means'), ('ATT2', 'bin_boundaries')]

Only applies for none-categorical attributes.

No default value.

categorical_variablestr/ListofStrings, optional

Specifies which INTEGER columns should be treated as categorical, with all other INTEGER columns treated as continuous.

No default value.

Returns:

A fitted object of class "Discretize".

predict(data)

Discretizing new data using a generated Discretize model.

Parameters:

dataDataFrame: Dataframe including the predict data.

Returns:

DataFrame

Discretization result
Bin assignment
Statistics

transform(data)

Data discretization using generated Discretize models.

Parameters:

dataDataFrame: Dataframe including the predict data.

Returns:

DataFrame

Discretization result
Bin assignment
Statistics

fit_transform(data, binning_variable, key=None, features=None, col_smoothing=None, categorical_variable=None)

Learn a discretization configuration(model) from input data and then discretize it under that configuration.

Parameters:

dataDataFrame

Dataframe that contains the training data.

keystr, optional

Name of the ID column in data.

If key is not provided, then:

if data is indexed by a single column, then key defaults to that index column;

otherwise, it is assumed that data contains no ID column.

featuresstr/ListofStrings, optional

Name of the feature columns which needs to be considered in the model.

If not specified, all columns except the key column will be count as feature columns.

binning_variablestr

Attribute name, to which binning operation is applied.

Variable data type must be numeric.

col_smoothingListofTuples, optional

Specifies column name and its method for smoothing, which overwrites the default smoothing method.

For example: smoothing_method = [('ATT1', 'bin_means'), ('ATT2', 'bin_boundaries')]

Only applies for non-categorical attributes.

No default value.

categorical_variablestr/ListofStrings, optional

Specifies which INTEGER columns should be treated as categorical, with all other INTEGER columns treated as continuous.

No default value.

Returns:

DataFrame

Discretization result
Bin assignment
Statistics

get_model_metrics()

Get the model metrics.

Returns:

DataFrame: The model metrics.

get_score_metrics()

Get the score metrics.

Returns:

DataFrame: The score metrics.

Inherited Methods from PALBase

Besides those methods mentioned above, the Discretize class also inherits methods from PALBase class, please refer to PAL Base for more details.