FeatureNormalizer

class hana_ml.algorithms.pal.preprocessing.FeatureNormalizer(method=None, z_score_method=None, new_max=None, new_min=None, thread_ratio=None, division_by_zero_handler=None)

Normalize a DataFrame. In real world scenarios the collected continuous attributes are usually distributed within different ranges. It is a common practice to have the data well scaled so that data mining algorithms like neural networks, nearest neighbor classification and clustering can give more reliable results.

Note

Note that the data type of the output value is the same as that of the input value. Therefore, if the data type of the original data is INTEGER, the output value will be converted to an integer instead of the result you expect.

For example, if we want to use min-max method to normalize a list [1, 2, 3, 4] and set new_min = 0 and new_max = 1.0, we want the result to be [0, 0.33, 0.66, 1], but actually the output is [0, 0, 0, 1] due to the rule of consistency of input and output data type.

Therefore, please cast the feature column(s) from INTEGER to be DOUBLE before invoking the function.

Parameters:
method{'min-max', 'z-score', 'decimal'}

Scaling methods:

  • 'min-max': Min-max normalization.

  • 'z-score': Z-Score normalization.

  • 'decimal': Decimal scaling normalization.

z_score_method{'mean-standard', 'mean-mean', 'median-median'} or dict, optional

If z_score_methods is not dict, only valid when method is 'z-score'.

  • 'mean-standard': Mean-Standard deviation

  • 'mean-mean': Mean-Mean deviation

  • 'median-median': Median-Median absolute deviation

If z_score_methods is dict, it specifies the columns for different methods.

new_maxfloat, optional

The new maximum value for min-max normalization.

Only valid when method is 'min-max'.

new_minfloat, optional

The new minimum value for min-max normalization.

Only valid when method is 'min-max'.

thread_ratiofloat, optional

Adjusts the percentage of available threads to use, from 0 to 1. A value of 0 indicates the use of a single thread, while 1 implies the use of all possible current threads. Values outside the range will be ignored and this function heuristically determines the number of threads to use.

Defaults to 0.

division_by_zero_handlerbool or str, optional
  • False or 'ignore': Ignores the column when encountering a division by zero, so the column is not scaled.

  • True or 'abort': Throws an error when encountering a division by zero.

Defaults to True.

Examples

>>> fn = FeatureNormalizer(method="min-max", new_max=1.0, new_min=0.0)
>>> fn.fit(data=df_train, key='ID')
>>> res = fn.transform(data=df_transform, key='ID')
>>> res.collect()
Attributes:
result_DataFrame

Scaled dataset from fit and fit_transform methods.

model_DataFrame

Model content.

Methods

fit(data[, key, features])

Normalize input data and generate a scaling model using one of the three scaling methods: min-max normalization, z-score normalization and normalization by decimal scaling.

fit_transform(data[, key, features])

Fit with the dataset and return the results.

get_model_metrics()

Get the model metrics.

get_score_metrics()

Get the score metrics.

transform(data[, key, features, ...])

Scales data based on the previous scaling model.

fit(data, key=None, features=None)

Normalize input data and generate a scaling model using one of the three scaling methods: min-max normalization, z-score normalization and normalization by decimal scaling.

Parameters:
dataDataFrame

DataFrame to be normalized.

keystr, optional

Name of the ID column.

Mandatory if data is not indexed, or the index of data contains multiple columns.

Defaults to the single index column of data if not provided.

featuresa list of str, optional

Names of the feature columns. If features is not provided, it defaults to all non-ID columns.

Returns:
A fitted object of class "FeatureNormalizer".
fit_transform(data, key=None, features=None)

Fit with the dataset and return the results.

Parameters:
dataDataFrame

DataFrame to be normalized.

keystr, optional

Name of the ID column.

Mandatory if data is not indexed, or the index of data contains multiple columns.

Defaults to the single index column of data if not provided.

featuresa list of str, optional

Names of the feature columns. If features is not provided, it defaults to all non-ID columns.

Returns:
DataFrame

Normalized result, with the same structure as data.

transform(data, key=None, features=None, thread_ratio=None, division_by_zero_handler=None)

Scales data based on the previous scaling model.

Parameters:
dataDataFrame

DataFrame to be normalized.

keystr, optional

Name of the ID column.

Mandatory if data is not indexed, or the index of data contains multiple columns.

Defaults to the single index column of data if not provided.

featuresa list of str, optional

Names of the feature columns.

If features is not provided, it defaults to all non-ID columns.

thread_ratiofloat, optional

Adjusts the percentage of available threads to use, from 0 to 1. A value of 0 indicates the use of a single thread, while 1 implies the use of all possible current threads. Values outside the range will be ignored and this function heuristically determines the number of threads to use.

Defaults to 0.
division_by_zero_handlerstr, optional

Specifies the system behavior when division-by-zero is encountered when scaling data using a fitted model.

  • 'ignore': Ignores the column when encountering a division by zero, so the column is not scaled.

  • 'abort': Throws an error when encountering a division by zero.

Defaults to 'abort'.

Returns:
DataFrame

Normalized result, with the same structure as data.

get_model_metrics()

Get the model metrics.

Returns:
DataFrame

The model metrics.

get_score_metrics()

Get the score metrics.

Returns:
DataFrame

The score metrics.

Inherited Methods from PALBase

Besides those methods mentioned above, the FeatureNormalizer class also inherits methods from PALBase class, please refer to PAL Base for more details.