Discretize
- class hana_ml.algorithms.pal.preprocessing.Discretize(strategy, n_bins=None, bin_size=None, n_sd=None, smoothing=None, save_model=True)
It is an enhanced version of binning function which can be applied to table with multiple columns. This function partitions table rows into multiple segments called bins, then applies smoothing methods in each bin of each column respectively.
- Parameters:
- strategy{'uniform_number', 'uniform_size', 'quantile', 'sd'}
Binning methods:
'uniform_number': equal widths based on the number of bins.
'uniform_size': equal widths based on the bin width.
'quantile': equal number of records per bin.
'sd': mean/ standard deviation bin boundaries.
- n_binsint, optional
Number of needed bins.
Required and only valid when
strategy
is set as 'uniform_number' or 'quantile'.Default to 2.
- bin_sizefloat, optional
Specifies the distance for binning.
Required and only valid when
strategy
is set as 'uniform_size'.Default to 10.
- n_sdint, optional
Specifies the number of standard deviation at each side of the mean.
For example, if
n_sd
equals 2, this function takes mean +/- 2 * standard deviation as the upper/lower bound for binning.Required and only valid when
strategy
is set as 'sd'.- smoothing{'no', 'bin_means', 'bin_medians', 'bin_boundaries'}, optional
Specifies the default smoothing method for all non-categorical columns.
Default to 'bin_means'.
- save_modelbool, optional
Indicates whether the model is saved.
Default to True.
Examples
Original data:
>>> df.collect() ID ATT1 ATT2 ATT3 ATT4 0 1 10.0 100.0 1.0 A 1 2 10.1 101.0 1.0 A 2 3 10.2 100.0 1.0 A 3 4 10.4 103.0 1.0 A 4 5 10.3 100.0 1.0 A 5 6 40.0 400.0 4.0 C 6 7 40.1 402.0 4.0 B 7 8 40.2 400.0 4.0 B 8 9 40.4 402.0 4.0 B 9 10 40.3 400.0 4.0 A 10 11 90.0 900.0 2.0 C 11 12 90.1 903.0 1.0 B 12 13 90.2 901.0 2.0 B 13 14 90.4 900.0 1.0 B 14 15 90.3 900.0 1.0 B
Construct an Discretize instance:
>>> bin = Discretize(method='uniform_number', n_bins=3, smoothing='bin_medians')
Training the model with training data:
>>> bin.fit(train_data, binning_variable='ATT1', col_smoothing=[('ATT2', 'bin_means')], categorical_variable='ATT3', key=None, features=None)
>>> bin.assign_.collect() ID BIN_INDEX 0 1 1 1 2 1 2 3 1 3 4 1 4 5 1 5 6 2 6 7 2 7 8 2 8 9 2 9 10 2 10 11 3 11 12 3 12 13 3 13 14 3 14 15 3
Apply the model to new data:
>>> bin.predict(predict_data)
>>> res.collect(): ID BIN_INDEX 0 1 1 1 2 1 2 3 1 3 4 1 4 5 3 5 6 3 6 7 2
- Attributes:
- result_DataFrame
Discretize results, structured as follows:
ID: name as shown in input dataframe.
FEATURES : data smoothed respectively in each bins
- assign_DataFrame
Assignment results, structured as follows:
ID: data ID, name as shown in input dataframe.
BIN_INDEX : bin index.
- model_DataFrame
Model results, structured as follows:
ROW_INDEX: row index.
MODEL_CONTENT : model contents.
- stats_DataFrame
Statistic results, structured as follows:
STAT_NAME: statistic name.
STAT_VALUE: statistic value.
Methods
fit
(data, binning_variable[, key, features, ...])Fitting a Discretize model.
fit_transform
(data, binning_variable[, key, ...])Learn a discretization configuration(model) from input data and then discretize it under that configuration.
predict
(data)Discretizing new data using a generated Discretize model.
transform
(data)Data discretization using generated Discretize models.
- fit(data, binning_variable, key=None, features=None, col_smoothing=None, categorical_variable=None)
Fitting a Discretize model.
- Parameters:
- dataDataFrame
Dataframe that contains the training data.
- keystr, optional
Name of the ID column in
data
.If
key
is not provided, then:if
data
is indexed by a single column, thenkey
defaults to that index column;otherwise, it defaults to the first column of
data
.
- featuresstr/ListofStrings, optional
Name of the feature columns which needs to be considered in the model.
If not specified, all columns except the key column will be count as feature columns.
- binning_variablestr
Attribute name, to which binning operation is applied.
Variable data type must be numeric.
- col_smoothingListofTuples, optional
Specifies column name and its method for smoothing, which overwrites the default smoothing method.
For example: smoothing_method = [('ATT1', 'bin_means'), ('ATT2', 'bin_boundaries')]
Only applies for none-categorical attributes.
No default value.
- categorical_variablestr/ListofStrings, optional
Indicates whether a column data is actually corresponding to a category variable even the data type of this column is int.
No default value.
- Returns:
- Fitted object.
- predict(data)
Discretizing new data using a generated Discretize model.
- Parameters:
- dataDataFrame
Dataframe including the predict data.
- Returns:
- DataFrame
Discretization result
Bin assignment
Statistics
- transform(data)
Data discretization using generated Discretize models.
- Parameters:
- dataDataFrame
Dataframe including the predict data.
- Returns:
- DataFrame
Discretization result
Bin assignment
Statistics
- fit_transform(data, binning_variable, key=None, features=None, col_smoothing=None, categorical_variable=None)
Learn a discretization configuration(model) from input data and then discretize it under that configuration.
- Parameters:
- dataDataFrame
Dataframe that contains the training data.
- keystr, optional
Name of the ID column in
data
.If
key
is not provided, then:if
data
is indexed by a single column, thenkey
defaults to that index column;otherwise, it is assumed that
data
contains no ID column.
- featuresstr/ListofStrings, optional
Name of the feature columns which needs to be considered in the model.
If not specified, all columns except the key column will be count as feature columns.
- binning_variablestr
Attribute name, to which binning operation is applied.
Variable data type must be numeric.
- col_smoothingListofTuples, optional
Specifies column name and its method for smoothing, which overwrites the default smoothing method.
For example: smoothing_method = [('ATT1', 'bin_means'), ('ATT2', 'bin_boundaries')]
Only applies for non-categorical attributes.
No default value.
- categorical_variablestr/ListofStrings, optional
Indicates whether a column data is actually corresponding to a category variable even the data type of this column is int.
No default value.
- Returns:
- DataFrame
Discretization result
Bin assignment
Statistics
- property fit_hdbprocedure
Returns the generated hdbprocedure for fit.
- property predict_hdbprocedure
Returns the generated hdbprocedure for predict.
Inherited Methods from PALBase
Besides those methods mentioned above, the Discretize class also inherits methods from PALBase class, please refer to PAL Base for more details.