Discretize
- class hana_ml.algorithms.pal.preprocessing.Discretize(strategy, n_bins=None, bin_size=None, n_sd=None, smoothing=None, save_model=True)
It is an enhanced version of binning function which can be applied to table with multiple columns. This function partitions table rows into multiple segments called bins, then applies smoothing methods in each bin of each column respectively.
- Parameters:
- strategy{'uniform_number', 'uniform_size', 'quantile', 'sd'}
Binning methods:
'uniform_number': equal widths based on the number of bins.
'uniform_size': equal widths based on the bin width.
'quantile': equal number of records per bin.
'sd': mean/ standard deviation bin boundaries.
- n_binsint, optional
Number of needed bins.
Required and only valid when
strategy
is set as 'uniform_number' or 'quantile'.Default to 2.
- bin_sizefloat, optional
Specifies the distance for binning.
Required and only valid when
strategy
is set as 'uniform_size'.Default to 10.
- n_sdint, optional
Specifies the number of standard deviation at each side of the mean.
For example, if
n_sd
equals 2, this function takes mean +/- 2 * standard deviation as the upper/lower bound for binning.Required and only valid when
strategy
is set as 'sd'.- smoothing{'no', 'bin_means', 'bin_medians', 'bin_boundaries'}, optional
Specifies the default smoothing method for all non-categorical columns.
Default to 'no'.
- save_modelbool, optional
Indicates whether the model is saved.
Default to True.
Examples
>>> bin = Discretize(method='uniform_number', n_bins=3, smoothing='bin_medians') >>> bin.fit(data=df, binning_variable='ATT1', col_smoothing=[('ATT2', 'bin_means')], categorical_variable='ATT3') >>> bin.assign_.collect() >>> res = bin.predict(data=predict_data) >>> res.collect():
- Attributes:
- result_DataFrame
Discretize results.
- assign_DataFrame
Assignment results..
- model_DataFrame
Model content.
- stats_DataFrame
Statistics.
Methods
fit
(data, binning_variable[, key, features, ...])Fitting a Discretize model.
fit_transform
(data, binning_variable[, key, ...])Learn a discretization configuration(model) from input data and then discretize it under that configuration.
Get the model metrics.
Get the score metrics.
predict
(data)Discretizing new data using a generated Discretize model.
transform
(data)Data discretization using generated Discretize models.
- fit(data, binning_variable, key=None, features=None, col_smoothing=None, categorical_variable=None)
Fitting a Discretize model.
- Parameters:
- dataDataFrame
Dataframe that contains the training data.
- binning_variablestr
Attribute name, to which binning operation is applied.
Variable data type must be numeric.
- keystr, optional
Name of the ID column in
data
.If
key
is not provided, then:if
data
is indexed by a single column, thenkey
defaults to that index column;otherwise, it defaults to the first column of
data
.
- featuresstr/ListofStrings, optional
Name of the feature columns which needs to be considered in the model.
If not specified, all columns except the key column will be count as feature columns.
- col_smoothingListofTuples, optional
Specifies column name and its method for smoothing, which overwrites the default smoothing method.
For example: smoothing_method = [('ATT1', 'bin_means'), ('ATT2', 'bin_boundaries')]
Only applies for none-categorical attributes.
No default value.
- categorical_variablestr/ListofStrings, optional
Specifies which INTEGER columns should be treated as categorical, with all other INTEGER columns treated as continuous.
No default value.
- Returns:
- A fitted object of class "Discretize".
- predict(data)
Discretizing new data using a generated Discretize model.
- Parameters:
- dataDataFrame
Dataframe including the predict data.
- Returns:
- DataFrame
Discretization result
Bin assignment
Statistics
- transform(data)
Data discretization using generated Discretize models.
- Parameters:
- dataDataFrame
Dataframe including the predict data.
- Returns:
- DataFrame
Discretization result
Bin assignment
Statistics
- fit_transform(data, binning_variable, key=None, features=None, col_smoothing=None, categorical_variable=None)
Learn a discretization configuration(model) from input data and then discretize it under that configuration.
- Parameters:
- dataDataFrame
Dataframe that contains the training data.
- keystr, optional
Name of the ID column in
data
.If
key
is not provided, then:if
data
is indexed by a single column, thenkey
defaults to that index column;otherwise, it is assumed that
data
contains no ID column.
- featuresstr/ListofStrings, optional
Name of the feature columns which needs to be considered in the model.
If not specified, all columns except the key column will be count as feature columns.
- binning_variablestr
Attribute name, to which binning operation is applied.
Variable data type must be numeric.
- col_smoothingListofTuples, optional
Specifies column name and its method for smoothing, which overwrites the default smoothing method.
For example: smoothing_method = [('ATT1', 'bin_means'), ('ATT2', 'bin_boundaries')]
Only applies for non-categorical attributes.
No default value.
- categorical_variablestr/ListofStrings, optional
Specifies which INTEGER columns should be treated as categorical, with all other INTEGER columns treated as continuous.
No default value.
- Returns:
- DataFrame
Discretization result
Bin assignment
Statistics
- get_model_metrics()
Get the model metrics.
- Returns:
- DataFrame
The model metrics.
- get_score_metrics()
Get the score metrics.
- Returns:
- DataFrame
The score metrics.
Inherited Methods from PALBase
Besides those methods mentioned above, the Discretize class also inherits methods from PALBase class, please refer to PAL Base for more details.