PowerTransform

class hana_ml.algorithms.pal.preprocessing.PowerTransform(method=None, standardize=None, max_iter=None, tol=None, interval=None, interval_hint=None, thread_ratio=None)

This class implements a python interface for the power transform algorithm in PAL. Power Transform is the family of non-linear transformation methods that can stabilize he variance of data, minimize its skewness, and approximate its distribution as Gaussian.

Power Transform supports Box-Cox transformation and Yeo-Johnson transformation. Both transformations are monotonically increasing functions with one hyper-parameter, nominally \(\lambda\).

Box-Cox transformation is restricted to positive data only, with mathematically formula illustrated as follows:

\[\begin{split}x^{(\lambda)} = \begin{cases} \frac{x^{\lambda} - 1}{\lambda}, &\text{ if }\lambda\neq 0\\ \ln{(x)}, & \text{ if }\lambda = 0 \end{cases}\end{split}\]

where \(x^{(\lambda)}\) represents the value after transform. In contrast, the Yeo-Johnson transformation can be applied to any real data, with the following mathematical formula:

\[\begin{split}x^{(\lambda)} = \begin{cases} \frac{(x+1)^{\lambda} - 1}{\lambda}, &\text{ if }\lambda\neq 0, x\geq 0\\ \ln{(x+1)}, &\text{ if }\lambda = 0, x\geq 0\\ -\frac{(1-x)^{(2-\lambda)} - 1}{2-\lambda}, &\text{ if }\lambda\neq 2, x < 0\\ -\ln{(1-x)}, &\text{ if }\lambda = 2, x < 0 \end{cases}\end{split}\]

For given collection of data, the hyperparameter \(\lambda\) can be estimated by maximizing the log-likelihood function.

Parameters:

method{'boxcox', 'yeojohnson'}, optional

Specifies the type of power transformation.

'boxcox': Box-Cox transformation.
'yeojohnson': Yeo-Johnson transformation.

Defaults to 'yeojohnson'.

standardizebool, optional

Specifies whether or not the standardize the result of power transformation as output.

Defaults to True.

max_iterint, optional

Specifies the maximum number of iterations for fitting the power parameter \(\lambda\).

If convergence is not reached after the specified number of iterations, an error will be generated. In this case, users should enlarge the number and re-try.

Defaults to 500.

tolfloat, optional

Specifies the absolute tolerance to control the accuracy of fitted parameter parameter \(\lambda\). The value should be positive but less than 1.

Defaults to 1e-11.

intervallist, optional

Specifies the global search interval for power parameter \(\lambda\) in a list of two numbers: 1st number for interval start, and 2nd number for interval end. A natural restriction is that interval start should be less than interval end.

Defaults to [-2.0, 2.0].

interval_hintbool, optional

Specifies whether or not to use the specified interval as hint. If True, the specifies interval is only used for initial search range, while the final power parameter \(\lambda\) may fall outside the specified interval.

Defaults to False(means that final \(\lambda\) must fall into the specified interval).

thread_ratiofloat, optional

Adjusts the percentage of available threads to use, from 0 to 1. A value of 0 indicates the use of a single thread, while 1 implies the use of all possible current threads. Values outside the range will be ignored and this function heuristically determines the number of threads to use.

Defaults to 1.0.

Examples

>>> pt = PowerTransform(method='yeojohnson')
>>> res = pt.fit_transform(data=df, key='ID')
>>> res.collect()
>>> pt.model_.collect()

Attributes:

model_DataFrame

Model content.

result_DataFrame

DataFrame containing the transformed result of the data feeded to the fit() function.

Available and non-empty only calling the fit_transform() function.

Methods

`fit`(data[, key, features, feature_interval, ...])	Fit the data to be transformed to obtain the feature-wise transformation parameter respectively.
`fit_transform`(data[, key, features, ...])	Fit the data to be transformed to obtain the feature-wise transformation parameter respectively, then apply transformation to the training data and return the transformation result.
`get_model_metrics`()	Get the model metrics.
`get_score_metrics`()	Get the score metrics.
`transform`(data[, key, features, inverse, ...])	Data transformation based on trained power transform parameters.

fit(data, key=None, features=None, feature_interval=None, feature_interval_hint=None)

Fit the data to be transformed to obtain the feature-wise transformation parameter respectively.

Parameters:

dataDataFrame

DataFrame containing the data to apply power transform.

keystr, optional

Specifies the name of ID column in data.

Defaults to the index of data if data is indexed by a single column, otherwise it key must be specified explicitly(i.e. mandatory).

featuresstr or a list of str, optional

Specifies the names of the columns in data to apply power transform to.

All columns in features must be of numerical data type.

Defaults to all non-key columns of data if not provided.

feature_intervaldict, optional

Specifies the power parameter search intervals for features in a dictionary format. For each key-value pair in feature_interval, key is the feature name, and value is the specified interval.

For example, if data contains a feature 'X1' that needs to be transformed, and we want the power parameter to be searched from the range [-3, 3] for 'X1', then the key-value pair 'X1':[-3, 3]' should be specified within feature_interval. This overwrites the global interval specified by the parameter interval in class initialization.

If not provided, the power parameter will be searched within a global interval specified by parameter interval in class initialization.

feature_interval_hint: dict, optional

Specifies whether or not to use the specified intervals as hint for different features.

For each key-value pair in feature_interval_hint, key is the feature name, and value is bool indicating whether or not to use specified interval as hint for the corresponding feature specified by key. This overwrites the global hint choice specified by the parameter hint in class initialization.

If not provided, the value of hint parameter in class initialization will be applied to all features.

Returns:

A fitted object of class "PowerTransform".

Examples

>>> pt = PowerTransform(method='yeojohnson')
>>> res = pt.fit(data=df, key='ID', features=['X1', 'X2', 'X3'],
                 feature_interval={'X1':[-3, 3], 'X2':[-4, 4]},
                 feauter_interval_hint={'X1': True, 'X3': True})
>>> res.collect()

fit_transform(data, key=None, features=None, feature_interval=None, feature_interval_hint=None)

Fit the data to be transformed to obtain the feature-wise transformation parameter respectively, then apply transformation to the training data and return the transformation result.

Parameters:

dataDataFrame

DataFrame containing the data to apply power transform.

keystr, optional

Specifies the name of ID column in data.

Defaults to the index of data if data is indexed by a single column, otherwise key must be specified explicitly(i.e. mandatory).

featuresstr or a list of str, optional

Specifies the names of the columns in data to apply power transform to.

All columns in features must be of numerical data type.

Defaults to all non-key columns of data if not provided.

feature_intervaldict, optional

Specifies the power parameter search intervals for features in a dictionary format. For each key-value pair in feature_interval, key is the feature name, and value is the specified interval.

For example, if data contains a feature 'X1' that needs to be transformed, and we want the power parameter to be searched from the range [-3, 3] for 'X1', then the key-value pair 'X1':[-3, 3]' should be specified within feature_interval. This overwrites the global interval specified by the parameter interval in class initialization.

If not provided, the power parameter will be searched within a global interval specified by parameter interval in class initialization.

feature_interval_hint: dict, optional

Specifies whether or not to use the specified intervals as hint for different features.

For each key-value pair in feature_interval_hint, key is the feature name, and value is bool indicating whether or not to use specified interval as hint for the corresponding feature specified by key. This overwrites the global hint choice specified by the parameter hint in class initialization.

If not provided, the value of hint parameter in class initialization will be applied to all features.

Returns:

DataFrame: The transformed result of features selected from data.

Examples

>>> pt = PowerTransform(method='yeojohnson')
>>> res = pt.fit_transform(data=df, key='ID')
>>> res.collect()

transform(data, key=None, features=None, inverse=False, thread_ratio=None)

Data transformation based on trained power transform parameters.

Parameters:

dataDataFrame

Input data.

keystr, optional

The ID column.

Defaults to the first column of data if the index column of data is not provided. Otherwise, defaults to the index column of data.

featuresstr, optional

Specifies the features to apply power transform.

Defaults to all non-key columns if not provided.

inversebool, optional

Specifies whether or not to apply inverse power transform.

False : apply forward power transformation, i.e. transform from raw feature data to Gaussian-like data.
True : apply inverse power transformation, i.e. transform from Gaussian-like data to raw feature data.

Defaults to False.

thread_ratiofloat, optional

Adjusts the percentage of available threads to use, from 0 to 1. A value of 0 indicates the use of a single thread, while 1 implies the use of all possible current threads. Values outside the range will be ignored and this function heuristically determines the number of threads to use.

Defaults to 1.0.

Returns:

DataFrame

Forecasted values, structured as follows:

ID, the same type as key in data, row ID.
STRING_CONTENT, type VARCHAR, transformed features in JSON format.

Examples

>>> pt = PowerTransform(method='yeojohnson')
>>> pt.fit(data=df_train, key='ID')
>>> res = pt.transform(data=df_transofrm, key='ID')
>>> res.collect()

get_model_metrics()

Get the model metrics.

Returns:

DataFrame: The model metrics.

get_score_metrics()

Get the score metrics.

Returns:

DataFrame: The score metrics.

Inherited Methods from PALBase

Besides those methods mentioned above, the PowerTransform class also inherits methods from PALBase class, please refer to PAL Base for more details.