PowerTransform

class hana_ml.algorithms.pal.preprocessing.PowerTransform(method=None, standardize=None, max_iter=None, tol=None, interval=None, interval_hint=None, thread_ratio=None)

This class implements a python interface for the power transform algorithm in PAL. Power Transform is the family of non-linear transformation methods that can stabilize he variance of data, minimize its skewness, and approximate its distribution as Gaussian.

Power Transform supports Box-Cox transformation and Yeo-Johnson transformation. Both transformations are monotonically increasing functions with one hyper-parameter, nominally \(\lambda\).

Box-Cox transformation is restricted to positive data only, with mathematically formula illustrated as follows:

\[\begin{split}x^{(\lambda)} = \begin{cases} \frac{x^{\lambda} - 1}{\lambda}, &\text{ if }\lambda\neq 0\\ \ln{(x)}, & \text{ if }\lambda = 0 \end{cases}\end{split}\]

where \(x^{(\lambda)}\) represents the value after transform. In contrast, the Yeo-Johnson transformation can be applied to any real data, with the following mathematical formula:

\[\begin{split}x^{(\lambda)} = \begin{cases} \frac{(x+1)^{\lambda} - 1}{\lambda}, &\text{ if }\lambda\neq 0, x\geq 0\\ \ln{(x+1)}, &\text{ if }\lambda = 0, x\geq 0\\ -\frac{(1-x)^{(2-\lambda)} - 1}{2-\lambda}, &\text{ if }\lambda\neq 2, x < 0\\ -\ln{(1-x)}, &\text{ if }\lambda = 2, x < 0 \end{cases}\end{split}\]

For given collection of data, the hyperparameter \(\lambda\) can be estimated by maximizing the log-likelihood function.

Parameters:
method{'boxcox', 'yeojohnson'}, optional

Specifies the type of power transformation.

  • 'boxcox': Box-Cox transformation.

  • 'yeojohnson': Yeo-Johnson transformation.

Defaults to 'yeojohnson'.

standardizebool, optional

Specifies whether or not the standardize the result of power transformation as output.

Defaults to True.

max_iterint, optional

Specifies the maximum number of iterations for fitting the power parameter \(\lambda\).

If convergence is not reached after the specified number of iterations, an error will be generated. In this case, users should enlarge the number and re-try.

Defaults to 500.

tolfloat, optional

Specifies the absolute tolerance to control the accuracy of fitted parameter parameter \(\lambda\). The value should be positive but less than 1.

Defaults to 1e-11.

intervallist, optional

Specifies the global search interval for power parameter \(\lambda\) in a list of two numbers: 1st number for interval start, and 2nd number for interval end. A natural restriction is that interval start should be less than interval end.

Defaults to [-2.0, 2.0].

interval_hintbool, optional

Specifies whether or not to use the specified interval as hint. If True, the specifies interval is only used for initial search range, while the final power parameter \(\lambda\) may fall outside the specified interval.

Defaults to False(means that final \(\lambda\) must fall into the specified interval).

thread_ratiofloat, optional

Specifies the ratio of total number of threads that can be used by this function. The value range is from 0 to 1, where 0 means only using 1 thread, and 1 means using at most all the currently available threads. Values outside the range will be ignored, and the number of threads to use is heuristically determined.

Defaults to 1.0.

Examples

>>> data.collect()
    ID        X1          X2          X3
 0   0  6.991913    4.670970    4.477528
 1   1  4.160349    5.646341    5.472923
 2   2  6.255790    6.144606    7.999252
 3   3  5.420097    6.904864    6.962577
 4   4  5.793422    4.517881    4.953931
 5   5  6.830040    5.478249    5.714642
 6   6  4.471869    5.240600    4.191433
 7   7  3.759174    4.691744    3.820419
 8   8  4.908710    4.600478    5.902754
 9   9  4.904309    5.684147    4.954790
10  10  5.150544    7.554681    4.943971
11  11  6.821817    4.796386    4.470521
12  12  6.612000    4.299182    6.578032
13  13  6.485878    4.968078    5.259654
14  14  4.930266    4.193548    6.453566
15  15  6.920819    4.605443    6.267915
16  16  4.693543    9.056507    6.368288
17  17  5.502248    5.067995    4.516716
18  18  4.985586    5.942696    6.111872
19  19  4.621056    4.940330    5.064032

Create a 'PowerTransform' instance pt and apply it to the dataset above:

>>> from hana_ml.algorithms.pal.preprocessing import PowerTransform
>>> pt = PowerTransform(method='yeojohnson')
>>> res = pt.fit_transform(data, key='ID')

View the transformation result and check the model content:

>>> res.collect()
    ID  TRANSFORMED_X1  TRANSFORMED_X2  TRANSFORMED_X3
 0   0        1.409486       -0.704236       -1.024967
 1   1       -1.394199        0.452884        0.057642
 2   2        0.760291        0.869854        1.990395
 3   3       -0.037030        1.360660        1.302520
 4   4        0.327867       -0.943591       -0.475069
 5   5        1.270545        0.290057        0.285688
 6   6       -1.038098        0.037018       -1.388643
 7   7       -1.876366       -0.673235       -1.904550
 8   8       -0.562559       -0.812014        0.455319
 9   9       -0.567222        0.487824       -0.474133
10  10       -0.310117        1.680686       -0.485927
11  11        1.263432       -0.522116       -1.033553
12  12        1.080083       -1.322180        1.014463
13  13        0.968113       -0.291169       -0.153736
14  14       -0.539759       -1.522405        0.916789
15  15        1.348716       -0.804290        0.766759
16  16       -0.793501        2.197417        0.848533
17  17        0.044553       -0.165700       -0.977230
18  18       -0.481517        0.711676        0.636425
19  19       -0.872717       -0.327141       -0.356725
>>> pt.model_.collect()
    ROW_ID  MODEL_CONTENT
0        1  {"version":1.0,"method":"yeo-johnson","model-c...
Attributes:
model_DataFrame

DataFrame containing the model info for power transformation, structured as follows:

  • 1st column: ROW_ID

  • 2nd column: MODEL_CONTENT

Available only after the class instance has been fitted.

result_DataFrame

DataFrame containing the transformed result of the data feeded to the fit() function.

Available and non-empty only calling the fit_transform() function.

Methods

fit(data[, key, features, feature_interval, ...])

Fit the data to be transformed to obtain the feature-wise transformation parameter respectively.

fit_transform(data[, key, features, ...])

Fit the data to be transformed to obtain the feature-wise transformation parameter respectively, then apply transformation to the training data and return the transformation result.

transform(data[, key, features, inverse, ...])

Data transformation based on trained power transform parameters.

property fit_hdbprocedure

Returns the generated hdbprocedure for fit.

property predict_hdbprocedure

Returns the generated hdbprocedure for predict.

fit(data, key=None, features=None, feature_interval=None, feature_interval_hint=None)

Fit the data to be transformed to obtain the feature-wise transformation parameter respectively.

Parameters:
dataDataFrame

DataFrame containing the data to apply power transform.

keystr, optional

Specifies the name of ID column in data.

Defaults to the index of data if data is indexed by a single column, otherwise it key must be specified explicitly(i.e. mandatory).

featuresstr or a list of str, optional

Specifies the names of the columns in data to apply power transform to.

All columns in features must be of numerical data type.

Defaults to all non-key columns of data if not provided.

feature_intervaldict, optional

Specifies the power parameter search intervals for features in a dictionary format. For each key-value pair in feature_interval, key is the feature name, and value is the specified interval.

For example, if data contains a feature 'X1' that needs to be transformed, and we want the power parameter to be searched from the range [-3, 3] for 'X1', then the key-value pair 'X1':[-3, 3]' should be specified within feature_interval. This overwrites the global interval specified by the parameter interval in class initialization.

If not provided, the power parameter will be searched within a global interval specified by parameter interval in class initialization.

feature_interval_hint: dict, optional

Specifies whether or not to use the specified intervals as hint for different features.

For each key-value pair in feature_interval_hint, key is the feature name, and value is bool indicating whether or not to use specified interval as hint for the corresponding feature specified by key. This overwrites the global hint choice specified by the parameter hint in class initialization.

If not provided, the value of hint parameter in class initialization will be applied to all features.

Returns:
Self

A fitted object of class 'PowerTransform'.

Examples

>>> data.columns
['ID', 'X1', 'X2', 'X3']
>>> pt = PowerTransform(method='yeojohnson')
>>> pt.fit(data, key='ID', features=['X1', 'X2', 'X3'],
...        feature_interval={'X1':[-3, 3], 'X2':[-4, 4]},
...        feauter_interval_hint={'X1': True, 'X3': True})
<hana_ml.algorithms.pal.preprocessing.PowerTransform at xxxxxxxxxx>
fit_transform(data, key=None, features=None, feature_interval=None, feature_interval_hint=None)

Fit the data to be transformed to obtain the feature-wise transformation parameter respectively, then apply transformation to the training data and return the transformation result.

Parameters:
dataDataFrame

DataFrame containing the data to apply power transform.

keystr, optional

Specifies the name of ID column in data.

Defaults to the index of data if data is indexed by a single column, otherwise key must be specified explicitly(i.e. mandatory).

featuresstr or a list of str, optional

Specifies the names of the columns in data to apply power transform to.

All columns in features must be of numerical data type.

Defaults to all non-key columns of data if not provided.

feature_intervaldict, optional

Specifies the power parameter search intervals for features in a dictionary format. For each key-value pair in feature_interval, key is the feature name, and value is the specified interval.

For example, if data contains a feature 'X1' that needs to be transformed, and we want the power parameter to be searched from the range [-3, 3] for 'X1', then the key-value pair 'X1':[-3, 3]' should be specified within feature_interval. This overwrites the global interval specified by the parameter interval in class initialization.

If not provided, the power parameter will be searched within a global interval specified by parameter interval in class initialization.

feature_interval_hint: dict, optional

Specifies whether or not to use the specified intervals as hint for different features.

For each key-value pair in feature_interval_hint, key is the feature name, and value is bool indicating whether or not to use specified interval as hint for the corresponding feature specified by key. This overwrites the global hint choice specified by the parameter hint in class initialization.

If not provided, the value of hint parameter in class initialization will be applied to all features.

Returns:
DataFrame

The transformed result of features selected from data.

Examples

>>> data.columns
['ID', 'X1', 'X2', 'X3']
>>> pt = PowerTransform(method='yeojohnson')
>>> res = pt.fit_transform(data, key='ID')
>>> res.collect()
    ID  TRANSFORMED_X1  TRANSFORMED_X2  TRANSFORMED_X3
.   ..             ...             ...             ...
transform(data, key=None, features=None, inverse=False, thread_ratio=None)

Data transformation based on trained power transform parameters.

Parameters:
dataDataFrame

Input data.

keystr, optional

The ID column.

Defaults to the first column of data if the index column of data is not provided. Otherwise, defaults to the index column of data.

featuresstr, optional

Specifies the features to apply power transform.

Defaults to all non-key columns if not provided.

inversebool, optional

Specifies whether or not to apply inverse power transform.

  • False : apply forward power transformation, i.e. transform from raw feature data to Gaussian-like data.

  • True : apply inverse power transformation, i.e. transform from Gaussian-like data to raw feature data.

Defaults to False.

thread_ratiofloat, optional

Controls the proportion of available threads to use. The ratio of available threads.

  • 0: single thread.

  • 0~1: percentage.

  • Others: heuristically determined.

Defaults to 1.0.

Returns:
DataFrame

Forecasted values, structured as follows:

  • ID, the same type as key in data, row ID.

  • STRING_CONTENT, type VARCHAR, transformed features in JSON format.

Examples

>>> data.columns
['ID', 'X1', 'X2', 'X3']
>>> test_data.columns
['ID', 'X1', 'X2', 'X3']
>>> pt = PowerTransform(method='yeojohnson')
>>> pt.fit(data, key='ID')
<hana_ml.algorithms.pal.preprocessing.PowerTransform at xxxxxxxxxx>
>>> res = pt.transform(test_data, key='ID')
>>> res.collect()
    ID  TRANSFORMED_X1  TRANSFORMED_X2  TRANSFORMED_X3
.   ..             ...             ...             ...

Inherited Methods from PALBase

Besides those methods mentioned above, the PowerTransform class also inherits methods from PALBase class, please refer to PAL Base for more details.