FeatureNormalizer
- class hana_ml.algorithms.pal.preprocessing.FeatureNormalizer(method=None, z_score_method=None, new_max=None, new_min=None, thread_ratio=None, division_by_zero_handler=None)
Normalize a DataFrame. In real world scenarios the collected continuous attributes are usually distributed within different ranges. It is a common practice to have the data well scaled so that data mining algorithms like neural networks, nearest neighbor classification and clustering can give more reliable results.
Note
Note that the data type of the output value is the same as that of the input value. Therefore, if the data type of the original data is INTEGER, the output value will be converted to an integer instead of the result you expect.
For example, if we want to use min-max method to normalize a list [1, 2, 3, 4] and set new_min = 0 and new_max = 1.0, we want the result to be [0, 0.33, 0.66, 1], but actually the output is [0, 0, 0, 1] due to the rule of consistency of input and output data type.
Therefore, please cast the feature column(s) from INTEGER to be DOUBLE before invoking the function.
- Parameters:
- method{'min-max', 'z-score', 'decimal'}
Scaling methods:
'min-max': Min-max normalization.
'z-score': Z-Score normalization.
'decimal': Decimal scaling normalization.
- z_score_method{'mean-standard', 'mean-mean', 'median-median'} or dict, optional
If z_score_methods is not dict, only valid when
method
is 'z-score'.'mean-standard': Mean-Standard deviation
'mean-mean': Mean-Mean deviation
'median-median': Median-Median absolute deviation
If z_score_methods is dict, it specifies the columns for different methods.
- new_maxfloat, optional
The new maximum value for min-max normalization.
Only valid when
method
is 'min-max'.- new_minfloat, optional
The new minimum value for min-max normalization.
Only valid when
method
is 'min-max'.- thread_ratiofloat, optional
Adjusts the percentage of available threads to use, from 0 to 1. A value of 0 indicates the use of a single thread, while 1 implies the use of all possible current threads. Values outside the range will be ignored and this function heuristically determines the number of threads to use.
Defaults to 0.
- division_by_zero_handlerbool or str, optional
False or 'ignore': Ignores the column when encountering a division by zero, so the column is not scaled.
True or 'abort': Throws an error when encountering a division by zero.
Defaults to True.
Examples
>>> fn = FeatureNormalizer(method="min-max", new_max=1.0, new_min=0.0) >>> fn.fit(data=df_train, key='ID') >>> res = fn.transform(data=df_transform, key='ID') >>> res.collect()
- Attributes:
- result_DataFrame
Scaled dataset from fit and fit_transform methods.
- model_DataFrame
Model content.
Methods
fit
(data[, key, features])Normalize input data and generate a scaling model using one of the three scaling methods: min-max normalization, z-score normalization and normalization by decimal scaling.
fit_transform
(data[, key, features])Fit with the dataset and return the results.
Get the model metrics.
Get the score metrics.
transform
(data[, key, features, ...])Scales data based on the previous scaling model.
- fit(data, key=None, features=None)
Normalize input data and generate a scaling model using one of the three scaling methods: min-max normalization, z-score normalization and normalization by decimal scaling.
- Parameters:
- dataDataFrame
DataFrame to be normalized.
- keystr, optional
Name of the ID column.
Mandatory if
data
is not indexed, or the index ofdata
contains multiple columns.Defaults to the single index column of
data
if not provided.- featuresa list of str, optional
Names of the feature columns. If
features
is not provided, it defaults to all non-ID columns.
- Returns:
- A fitted object of class "FeatureNormalizer".
- fit_transform(data, key=None, features=None)
Fit with the dataset and return the results.
- Parameters:
- dataDataFrame
DataFrame to be normalized.
- keystr, optional
Name of the ID column.
Mandatory if
data
is not indexed, or the index ofdata
contains multiple columns.Defaults to the single index column of
data
if not provided.- featuresa list of str, optional
Names of the feature columns. If
features
is not provided, it defaults to all non-ID columns.
- Returns:
- DataFrame
Normalized result, with the same structure as
data
.
- transform(data, key=None, features=None, thread_ratio=None, division_by_zero_handler=None)
Scales data based on the previous scaling model.
- Parameters:
- dataDataFrame
DataFrame to be normalized.
- keystr, optional
Name of the ID column.
Mandatory if
data
is not indexed, or the index ofdata
contains multiple columns.Defaults to the single index column of
data
if not provided.- featuresa list of str, optional
Names of the feature columns.
If
features
is not provided, it defaults to all non-ID columns.- thread_ratiofloat, optional
Adjusts the percentage of available threads to use, from 0 to 1. A value of 0 indicates the use of a single thread, while 1 implies the use of all possible current threads. Values outside the range will be ignored and this function heuristically determines the number of threads to use.
- Defaults to 0.
- division_by_zero_handlerstr, optional
Specifies the system behavior when division-by-zero is encountered when scaling
data
using a fitted model.'ignore': Ignores the column when encountering a division by zero, so the column is not scaled.
'abort': Throws an error when encountering a division by zero.
Defaults to 'abort'.
- Returns:
- DataFrame
Normalized result, with the same structure as
data
.
- get_model_metrics()
Get the model metrics.
- Returns:
- DataFrame
The model metrics.
- get_score_metrics()
Get the score metrics.
- Returns:
- DataFrame
The score metrics.
Inherited Methods from PALBase
Besides those methods mentioned above, the FeatureNormalizer class also inherits methods from PALBase class, please refer to PAL Base for more details.