FeatureNormalizer

class hana_ml.algorithms.pal.preprocessing.FeatureNormalizer(method=None, z_score_method=None, new_max=None, new_min=None, thread_ratio=None, division_by_zero_handler=None)

Normalize a DataFrame. In real world scenarios the collected continuous attributes are usually distributed within different ranges. It is a common practice to have the data well scaled so that data mining algorithms like neural networks, nearest neighbor classification and clustering can give more reliable results.

Note

Note that the data type of the output value is the same as that of the input value. Therefore, if the data type of the original data is INTEGER, the output value will be converted to an integer instead of the result you expect.

For example, if we want to use min-max method to normalize a list [1, 2, 3, 4] and set new_min = 0 and new_max = 1.0, we want the result to be [0, 0.33, 0.66, 1], but actually the output is [0, 0, 0, 1] due to the rule of consistency of input and output data type.

Therefore, please cast the feature column(s) from INTEGER to be DOUBLE before invoking the function.

Parameters
method{'min-max', 'z-score', 'decimal'}

Scaling methods:

  • 'min-max': Min-max normalization.

  • 'z-score': Z-Score normalization.

  • 'decimal': Decimal scaling normalization.

z_score_method{'mean-standard', 'mean-mean', 'median-median'} or dict, optional

If z_score_methods is not dict, only valid when method is 'z-score'.

  • 'mean-standard': Mean-Standard deviation

  • 'mean-mean': Mean-Mean deviation

  • 'median-median': Median-Median absolute deviation

If z_score_methods is dict, it specifies the columns for different methods.

new_maxfloat, optional

The new maximum value for min-max normalization.

Only valid when method is 'min-max'.

new_minfloat, optional

The new minimum value for min-max normalization.

Only valid when method is 'min-max'.

thread_ratiofloat, optional

Controls the proportion of available threads to use. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use.

Defaults to 0.

division_by_zero_handlerbool or str, optional
  • False or 'ignore': Ignores the column when encountering a division by zero, so the column is not scaled.

  • True or 'abort': Throws an error when encountering a division by zero.

Defaults to True.

Examples

Input DataFrame df1:

>>> df1.head(4).collect()
    ID    X1    X2
0    0   6.0   9.0
1    1  12.1   8.3
2    2  13.5  15.3
3    3  15.4  18.7

Creating a FeatureNormalizer instance:

>>> fn = FeatureNormalizer(method="min-max", new_max=1.0, new_min=0.0)

Performing fit on given DataFrame:

>>> fn.fit(df1, key='ID')
>>> fn.result_.head(4).collect()
    ID        X1        X2
0    0  0.000000  0.033175
1    1  0.186544  0.000000
2    2  0.229358  0.331754
3    3  0.287462  0.492891

Input DataFrame for transforming:

>>> df2.collect()
   ID  S_X1  S_X2
0   0   6.0   9.0
1   1   6.0   7.0
2   2   4.0   4.0
3   3   1.0   2.0
4   4   9.0  -2.0
5   5   4.0   5.0

Performing transform on given DataFrame:

>>> result = fn.transform(df2, key='ID')
>>> result.collect()
   ID      S_X1      S_X2
0   0  0.000000  0.033175
1   1  0.000000 -0.061611
2   2 -0.061162 -0.203791
3   3 -0.152905 -0.298578
4   4  0.091743 -0.488152
5   5 -0.061162 -0.156398
Attributes
result_DataFrame

Scaled dataset from fit and fit_transform methods.

model_DataFrame

Trained model content.

Methods

fit(data[, key, features])

Normalize input data and generate a scaling model using one of the three scaling methods: min-max normalization, z-score normalization and normalization by decimal scaling.

fit_transform(data[, key, features])

Fit with the dataset and return the results.

transform(data[, key, features, ...])

Scales data based on the previous scaling model.

fit(data, key=None, features=None)

Normalize input data and generate a scaling model using one of the three scaling methods: min-max normalization, z-score normalization and normalization by decimal scaling.

Parameters
dataDataFrame

DataFrame to be normalized.

keystr, optional

Name of the ID column.

Mandatory if data is not indexed, or the index of data contains multiple columns.

Defaults to the single index column of data if not provided.

featureslist of str, optional

Names of the feature columns. If features is not provided, it defaults to all non-ID columns.

Returns
Fitted object.
fit_transform(data, key=None, features=None)

Fit with the dataset and return the results.

Parameters
dataDataFrame

DataFrame to be normalized.

keystr

Name of the ID column.

Mandatory if data is not indexed, or the index of data contains multiple columns.

Defaults to the single index column of data if not provided.

featureslist of str, optional

Names of the feature columns. If features is not provided, it defaults to all non-ID columns.

Returns
DataFrame

Normalized result, with the same structure as data.

transform(data, key=None, features=None, thread_ratio=None, division_by_zero_handler=None)

Scales data based on the previous scaling model.

Parameters
dataDataFrame

DataFrame to be normalized.

keystr, optional

Name of the ID column.

Mandatory if data is not indexed, or the index of data contains multiple columns.

Defaults to the single index column of data if not provided.

featureslist of str, optional

Names of the feature columns.

If features is not provided, it defaults to all non-ID columns.

thread_ratiofloat, optional

Controls the proportion of available threads to use for scaling data.

The value range is from 0 to 1, where 0 indicates a single thread, 1 indicates up to all available threads, and values between 0 and 1 indicates the percentage of available threads.

If the specified value is outside [0, 1], then the number of threads to use is heuristically determined.

Defaults to 0.
division_by_zero_handlerstr, optional

Specifies the system behavior when division-by-zero is encountered when scaling data using a fitted model.

  • 'ignore': Ignores the column when encountering a division by zero, so the column is not scaled.

  • 'abort': Throws an error when encountering a division by zero.

Defaults to 'abort'.

Returns
DataFrame

Normalized result, with the same structure as data.

property fit_hdbprocedure

Returns the generated hdbprocedure for fit.

property predict_hdbprocedure

Returns the generated hdbprocedure for predict.

Inherited Methods from PALBase

Besides those methods mentioned above, the FeatureNormalizer class also inherits methods from PALBase class, please refer to PAL Base for more details.