PowerTransform
- class hana_ml.algorithms.pal.preprocessing.PowerTransform(method=None, standardize=None, max_iter=None, tol=None, interval=None, interval_hint=None, thread_ratio=None)
This class implements a python interface for the power transform algorithm in PAL. Power Transform is the family of non-linear transformation methods that can stabilize he variance of data, minimize its skewness, and approximate its distribution as Gaussian.
Power Transform supports Box-Cox transformation and Yeo-Johnson transformation. Both transformations are monotonically increasing functions with one hyper-parameter, nominally \(\lambda\).
Box-Cox transformation is restricted to positive data only, with mathematically formula illustrated as follows:
\[\begin{split}x^{(\lambda)} = \begin{cases} \frac{x^{\lambda} - 1}{\lambda}, &\text{ if }\lambda\neq 0\\ \ln{(x)}, & \text{ if }\lambda = 0 \end{cases}\end{split}\]where \(x^{(\lambda)}\) represents the value after transform. In contrast, the Yeo-Johnson transformation can be applied to any real data, with the following mathematical formula:
\[\begin{split}x^{(\lambda)} = \begin{cases} \frac{(x+1)^{\lambda} - 1}{\lambda}, &\text{ if }\lambda\neq 0, x\geq 0\\ \ln{(x+1)}, &\text{ if }\lambda = 0, x\geq 0\\ -\frac{(1-x)^{(2-\lambda)} - 1}{2-\lambda}, &\text{ if }\lambda\neq 2, x < 0\\ -\ln{(1-x)}, &\text{ if }\lambda = 2, x < 0 \end{cases}\end{split}\]For given collection of data, the hyperparameter \(\lambda\) can be estimated by maximizing the log-likelihood function.
- Parameters:
- method{'boxcox', 'yeojohnson'}, optional
Specifies the type of power transformation.
'boxcox': Box-Cox transformation.
'yeojohnson': Yeo-Johnson transformation.
Defaults to 'yeojohnson'.
- standardizebool, optional
Specifies whether or not the standardize the result of power transformation as output.
Defaults to True.
- max_iterint, optional
Specifies the maximum number of iterations for fitting the power parameter \(\lambda\).
If convergence is not reached after the specified number of iterations, an error will be generated. In this case, users should enlarge the number and re-try.
Defaults to 500.
- tolfloat, optional
Specifies the absolute tolerance to control the accuracy of fitted parameter parameter \(\lambda\). The value should be positive but less than 1.
Defaults to 1e-11.
- intervallist, optional
Specifies the global search interval for power parameter \(\lambda\) in a list of two numbers: 1st number for interval start, and 2nd number for interval end. A natural restriction is that interval start should be less than interval end.
Defaults to [-2.0, 2.0].
- interval_hintbool, optional
Specifies whether or not to use the specified interval as hint. If True, the specifies interval is only used for initial search range, while the final power parameter \(\lambda\) may fall outside the specified interval.
Defaults to False(means that final \(\lambda\) must fall into the specified interval).
- thread_ratiofloat, optional
Adjusts the percentage of available threads to use, from 0 to 1. A value of 0 indicates the use of a single thread, while 1 implies the use of all possible current threads. Values outside the range will be ignored and this function heuristically determines the number of threads to use.
Defaults to 1.0.
Examples
>>> pt = PowerTransform(method='yeojohnson') >>> res = pt.fit_transform(data=df, key='ID') >>> res.collect() >>> pt.model_.collect()
- Attributes:
- model_DataFrame
Model content.
- result_DataFrame
DataFrame containing the transformed result of the data feeded to the fit() function.
Available and non-empty only calling the fit_transform() function.
Methods
fit
(data[, key, features, feature_interval, ...])Fit the data to be transformed to obtain the feature-wise transformation parameter respectively.
fit_transform
(data[, key, features, ...])Fit the data to be transformed to obtain the feature-wise transformation parameter respectively, then apply transformation to the training data and return the transformation result.
Get the model metrics.
Get the score metrics.
transform
(data[, key, features, inverse, ...])Data transformation based on trained power transform parameters.
- fit(data, key=None, features=None, feature_interval=None, feature_interval_hint=None)
Fit the data to be transformed to obtain the feature-wise transformation parameter respectively.
- Parameters:
- dataDataFrame
DataFrame containing the data to apply power transform.
- keystr, optional
Specifies the name of ID column in
data
.Defaults to the index of
data
ifdata
is indexed by a single column, otherwise itkey
must be specified explicitly(i.e. mandatory).- featuresstr or a list of str, optional
Specifies the names of the columns in
data
to apply power transform to.All columns in
features
must be of numerical data type.Defaults to all non-key columns of
data
if not provided.- feature_intervaldict, optional
Specifies the power parameter search intervals for features in a dictionary format. For each key-value pair in
feature_interval
, key is the feature name, and value is the specified interval.For example, if
data
contains a feature 'X1' that needs to be transformed, and we want the power parameter to be searched from the range [-3, 3] for 'X1', then the key-value pair 'X1':[-3, 3]' should be specified withinfeature_interval
. This overwrites the global interval specified by the parameterinterval
in class initialization.If not provided, the power parameter will be searched within a global interval specified by parameter
interval
in class initialization.- feature_interval_hint: dict, optional
Specifies whether or not to use the specified intervals as hint for different features.
For each key-value pair in
feature_interval_hint
, key is the feature name, and value is bool indicating whether or not to use specified interval as hint for the corresponding feature specified by key. This overwrites the global hint choice specified by the parameterhint
in class initialization.If not provided, the value of
hint
parameter in class initialization will be applied to all features.
- Returns:
- A fitted object of class "PowerTransform".
Examples
>>> pt = PowerTransform(method='yeojohnson') >>> res = pt.fit(data=df, key='ID', features=['X1', 'X2', 'X3'], feature_interval={'X1':[-3, 3], 'X2':[-4, 4]}, feauter_interval_hint={'X1': True, 'X3': True}) >>> res.collect()
- fit_transform(data, key=None, features=None, feature_interval=None, feature_interval_hint=None)
Fit the data to be transformed to obtain the feature-wise transformation parameter respectively, then apply transformation to the training data and return the transformation result.
- Parameters:
- dataDataFrame
DataFrame containing the data to apply power transform.
- keystr, optional
Specifies the name of ID column in
data
.Defaults to the index of
data
ifdata
is indexed by a single column, otherwisekey
must be specified explicitly(i.e. mandatory).- featuresstr or a list of str, optional
Specifies the names of the columns in
data
to apply power transform to.All columns in
features
must be of numerical data type.Defaults to all non-key columns of
data
if not provided.- feature_intervaldict, optional
Specifies the power parameter search intervals for features in a dictionary format. For each key-value pair in
feature_interval
, key is the feature name, and value is the specified interval.For example, if
data
contains a feature 'X1' that needs to be transformed, and we want the power parameter to be searched from the range [-3, 3] for 'X1', then the key-value pair 'X1':[-3, 3]' should be specified withinfeature_interval
. This overwrites the global interval specified by the parameterinterval
in class initialization.If not provided, the power parameter will be searched within a global interval specified by parameter
interval
in class initialization.- feature_interval_hint: dict, optional
Specifies whether or not to use the specified intervals as hint for different features.
For each key-value pair in
feature_interval_hint
, key is the feature name, and value is bool indicating whether or not to use specified interval as hint for the corresponding feature specified by key. This overwrites the global hint choice specified by the parameterhint
in class initialization.If not provided, the value of
hint
parameter in class initialization will be applied to all features.
- Returns:
- DataFrame
The transformed result of
features
selected fromdata
.
Examples
>>> pt = PowerTransform(method='yeojohnson') >>> res = pt.fit_transform(data=df, key='ID') >>> res.collect()
- transform(data, key=None, features=None, inverse=False, thread_ratio=None)
Data transformation based on trained power transform parameters.
- Parameters:
- dataDataFrame
Input data.
- keystr, optional
The ID column.
Defaults to the first column of data if the index column of data is not provided. Otherwise, defaults to the index column of data.
- featuresstr, optional
Specifies the features to apply power transform.
Defaults to all non-key columns if not provided.
- inversebool, optional
Specifies whether or not to apply inverse power transform.
False : apply forward power transformation, i.e. transform from raw feature data to Gaussian-like data.
True : apply inverse power transformation, i.e. transform from Gaussian-like data to raw feature data.
Defaults to False.
- thread_ratiofloat, optional
Adjusts the percentage of available threads to use, from 0 to 1. A value of 0 indicates the use of a single thread, while 1 implies the use of all possible current threads. Values outside the range will be ignored and this function heuristically determines the number of threads to use.
Defaults to 1.0.
- Returns:
- DataFrame
Forecasted values, structured as follows:
ID, the same type as key in
data
, row ID.STRING_CONTENT, type VARCHAR, transformed features in JSON format.
Examples
>>> pt = PowerTransform(method='yeojohnson') >>> pt.fit(data=df_train, key='ID') >>> res = pt.transform(data=df_transofrm, key='ID') >>> res.collect()
- get_model_metrics()
Get the model metrics.
- Returns:
- DataFrame
The model metrics.
- get_score_metrics()
Get the score metrics.
- Returns:
- DataFrame
The score metrics.
Inherited Methods from PALBase
Besides those methods mentioned above, the PowerTransform class also inherits methods from PALBase class, please refer to PAL Base for more details.