FeatureNormalizer
- class hana_ml.algorithms.pal.preprocessing.FeatureNormalizer(method, z_score_method=None, new_max=None, new_min=None, thread_ratio=None, division_by_zero_handler=None)
Normalize a DataFrame. In real world scenarios the collected continuous attributes are usually distributed within different ranges. It is a common practice to have the data well scaled so that data mining algorithms like neural networks, nearest neighbor classification and clustering can give more reliable results.
Note
Note that the data type of the output value is the same as that of the input value. Therefore, if the data type of the original data is INTEGER, the output value will be converted to an integer instead of the result you expect.
For example, if we want to use min-max method to normalize a list [1, 2, 3, 4] and set new_min = 0 and new_max = 1.0, we want the result to be [0, 0.33, 0.66, 1], but actually the output is [0, 0, 0, 1] due to the rule of consistency of input and output data type.
Therefore, please cast the feature column(s) from INTEGER to be DOUBLE before invoking the function.
- Parameters
- method{'min-max', 'z-score', 'decimal'}
Scaling methods:
'min-max': Min-max normalization.
'z-score': Z-Score normalization.
'decimal': Decimal scaling normalization.
- z_score_method{'mean-standard', 'mean-mean', 'median-median'}, optional
Only valid when
method
is 'z-score'.'mean-standard': Mean-Standard deviation
'mean-mean': Mean-Mean deviation
'median-median': Median-Median absolute deviation
- new_maxfloat, optional
The new maximum value for min-max normalization.
Only valid when
method
is 'min-max'.- new_minfloat, optional
The new minimum value for min-max normalization.
Only valid when
method
is 'min-max'.- thread_ratiofloat, optional
Controls the proportion of available threads to use. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use.
Defaults to 0.
- division_by_zero_handlerbool or str, optional
False or 'ignore': Ignores the column when encountering a division by zero, so the column is not scaled.
True or 'abort': Throws an error when encountering a division by zero.
Defaults to True.
Examples
Input DataFrame df1:
>>> df1.head(4).collect() ID X1 X2 0 0 6.0 9.0 1 1 12.1 8.3 2 2 13.5 15.3 3 3 15.4 18.7
Creating a FeatureNormalizer instance:
>>> fn = FeatureNormalizer(method="min-max", new_max=1.0, new_min=0.0)
Performing fit on given DataFrame:
>>> fn.fit(df1, key='ID') >>> fn.result_.head(4).collect() ID X1 X2 0 0 0.000000 0.033175 1 1 0.186544 0.000000 2 2 0.229358 0.331754 3 3 0.287462 0.492891
Input DataFrame for transforming:
>>> df2.collect() ID S_X1 S_X2 0 0 6.0 9.0 1 1 6.0 7.0 2 2 4.0 4.0 3 3 1.0 2.0 4 4 9.0 -2.0 5 5 4.0 5.0
Performing transform on given DataFrame:
>>> result = fn.transform(df2, key='ID') >>> result.collect() ID S_X1 S_X2 0 0 0.000000 0.033175 1 1 0.000000 -0.061611 2 2 -0.061162 -0.203791 3 3 -0.152905 -0.298578 4 4 0.091743 -0.488152 5 5 -0.061162 -0.156398
- Attributes
- result_DataFrame
Scaled dataset from fit and fit_transform methods.
- model_DataFrame
Trained model content.
Methods
fit
(data[, key, features])Normalize input data and generate a scaling model using one of the three scaling methods: min-max normalization, z-score normalization and normalization by decimal scaling.
fit_transform
(data[, key, features])Fit with the dataset and return the results.
transform
(data[, key, features, ...])Scales data based on the previous scaling model.
- fit(data, key=None, features=None)
Normalize input data and generate a scaling model using one of the three scaling methods: min-max normalization, z-score normalization and normalization by decimal scaling.
- Parameters
- dataDataFrame
DataFrame to be normalized.
- keystr, optional
Name of the ID column.
Mandatory if
data
is not indexed, or the index ofdata
contains multiple columns.Defaults to the single index column of
data
if not provided.- featureslist of str, optional
Names of the feature columns. If
features
is not provided, it defaults to all non-ID columns.
- Returns
- Fitted object.
- fit_transform(data, key=None, features=None)
Fit with the dataset and return the results.
- Parameters
- dataDataFrame
DataFrame to be normalized.
- keystr
Name of the ID column.
Mandatory if
data
is not indexed, or the index ofdata
contains multiple columns.Defaults to the single index column of
data
if not provided.- featureslist of str, optional
Names of the feature columns. If
features
is not provided, it defaults to all non-ID columns.
- Returns
- DataFrame
Normalized result, with the same structure as
data
.
- transform(data, key=None, features=None, thread_ratio=None, division_by_zero_handler=None)
Scales data based on the previous scaling model.
- Parameters
- dataDataFrame
DataFrame to be normalized.
- keystr, optional
Name of the ID column.
Mandatory if
data
is not indexed, or the index ofdata
contains multiple columns.Defaults to the single index column of
data
if not provided.- featureslist of str, optional
Names of the feature columns.
If
features
is not provided, it defaults to all non-ID columns.- thread_ratiofloat, optional
Controls the proportion of available threads to use for scaling
data
.The value range is from 0 to 1, where 0 indicates a single thread, 1 indicates up to all available threads, and values between 0 and 1 indicates the percentage of available threads.
If the specified value is outside [0, 1], then the number of threads to use is heuristically determined.
- Defaults to 0.
- division_by_zero_handlerstr, optional
Specifies the system behavior when division-by-zero is encountered when scaling
data
using a fitted model.'ignore': Ignores the column when encountering a division by zero, so the column is not scaled.
'abort': Throws an error when encountering a division by zero.
Defaults to 'abort'.
- Returns
- DataFrame
Normalized result, with the same structure as
data
.
- property fit_hdbprocedure
Returns the generated hdbprocedure for fit.
- property predict_hdbprocedure
Returns the generated hdbprocedure for predict.