condition_index

hana_ml.algorithms.pal.stats.condition_index(data, key=None, col=None, scaling=True, include_intercept=True, thread_ratio=None)

Condition index is used to detect collinearity problem between independent variables which are later used as predictors in a multiple linear regression model.

Parameters:

dataDataFrame

DataFrame containing the data.

keystr, optional

Name of the ID column.

Defaults to the first column.

colstr or a list of str, optional

Name of the feature column that needs to be processed.

If not given, it defaults to all non-ID columns.

scalingbool, optional

Specifies whether the input data are scaled to have unit variance before the analysis.

False: No

True: Yes

Default to True.

include_interceptbool, optional

Specifies whether the algorithm considers intercept during the calculation.

False: No

True: Yes

Default to True.

thread_ratiofloat, optional

Specifies the ratio of total number of threads that can be used by this function.

The value range is from 0 to 1, where 0 means only using 1 thread, and 1 means using at most all the currently available threads.

Values outside the range are ignored and this function heuristically determines the number of threads to use.

Default to 0.

Returns:

DataFrame

Condition index results, structured as follows:

COMPONENT_ID, principal component ID.

EIGENVALUE, eigenvalue.

CONDITION_INDEX, Condition index.

FEATURES, variance decomposition proportion for each variable.

INTERCEPT, variance decomposition proportion for the intercept term.

Second DataFrame is empty if collinearity problem has not been detected. Distinct values results, structured as follows:

STAT_NAME, Name for the values, including condition number, and the name of variables which are involved in collinearity problem.

STAT_VALUE, values of the corresponding name.

Examples

Original data:

>>> df.collect()
   ID    X1    X2    X3    X4
0   1  12.0  52.0  20.0  44.0
1   2  12.0  57.0  25.0  45.0
2   3  12.0  54.0  21.0  45.0
3   4  13.0  52.0  21.0  46.0
4   5  14.0  54.0  24.0  46.0
Apply the condition index function:

>>> res, stats = condition_index(data,  key='ID', scaling=True,
                                 include_intercept=True, thread_ratio=0.1)
>>> res.collect()
  COMPONENT_ID  EIGENVALUE  CONDITION_INDEX        X1        X2        X3        X4  INTERCEPT
0       Comp_1   19.966688         1.000000  0.000012  0.000002  0.000010  0.000003   0.000002
1       Comp_2    0.020736        31.030738  0.008776  0.000210  0.031063  0.001251   0.000907
2       Comp_3    0.012260        40.355748  0.053472  0.002571  0.005315  0.000639   0.002710
3       Comp_4    0.000230       294.940696  0.205666  0.015224  0.006579  0.931121   0.246862
4       Comp_5    0.000086       480.735654  0.732074  0.981993  0.957034  0.066986   0.749518
>>> stats.collect()
          STAT_NAME  STAT_VALUE
0  CONDITION_NUMBER  480.735654
1                X1    0.732074
2                X2    0.981993
3                X3    0.957034
4         INTERCEPT    0.749518