
hana_ml.algorithms.pal.stats.condition_index(data, key=None, col=None, scaling=True, include_intercept=True, thread_ratio=None)

Detects collinearity problem between independent variables which are later used as predictors in a multiple linear regression model.


DataFrame containing the data.

keystr, optional

Name of the ID column.

Defaults to the first column.

colstr or a list of str, optional

Name of the feature column that needs to be processed.

If not given, it defaults to all non-ID columns.

scalingbool, optional

Specifies whether the input data are scaled to have unit variance before the analysis.

  • False: No

  • True: Yes

Default to True.

include_interceptbool, optional

Specifies whether the algorithm considers intercept during the calculation.

  • False: No

  • True: Yes

Default to True.

thread_ratiofloat, optional

Adjusts the percentage of available threads to use, from 0 to 1. A value of 0 indicates the use of a single thread, while 1 implies the use of all possible current threads. Values outside the range will be ignored and this function heuristically determines the number of threads to use.

Default to 0.


Condition index results, structured as follows:

  • COMPONENT_ID, principal component ID.

  • EIGENVALUE, eigenvalue.

  • CONDITION_INDEX, Condition index.

  • FEATURES, variance decomposition proportion for each variable.

  • INTERCEPT, variance decomposition proportion for the intercept term.

Second DataFrame is empty if collinearity problem has not been detected. Distinct values results, structured as follows:

  • STAT_NAME, Name for the values, including condition number, and the name of variables which are involved in collinearity problem.

  • STAT_VALUE, values of the corresponding name.


Original data:

>>> df.collect()
   ID    X1    X2    X3    X4
0   1  12.0  52.0  20.0  44.0
1   2  12.0  57.0  25.0  45.0
2   3  12.0  54.0  21.0  45.0
3   4  13.0  52.0  21.0  46.0
4   5  14.0  54.0  24.0  46.0

Apply the condition index function:

>>> res, stats = condition_index(data=df,  key='ID', scaling=True,
                                 include_intercept=True, thread_ratio=0.1)
>>> res.collect()
0       Comp_1   19.966688         1.000000  0.000012  0.000002  0.000010  0.000003   0.000002
1       Comp_2    0.020736        31.030738  0.008776  0.000210  0.031063  0.001251   0.000907
2       Comp_3    0.012260        40.355748  0.053472  0.002571  0.005315  0.000639   0.002710
3       Comp_4    0.000230       294.940696  0.205666  0.015224  0.006579  0.931121   0.246862
4       Comp_5    0.000086       480.735654  0.732074  0.981993  0.957034  0.066986   0.749518
>>> stats.collect()
0  CONDITION_NUMBER  480.735654
1                X1    0.732074
2                X2    0.981993
3                X3    0.957034
4         INTERCEPT    0.749518