condition_index

hana_ml.algorithms.pal.stats.condition_index(data, key=None, col=None, scaling=True, include_intercept=True, thread_ratio=None)

Detects collinearity problem between independent variables which are later used as predictors in a multiple linear regression model.

Parameters:
dataDataFrame

DataFrame containing the data.

keystr, optional

Name of the ID column.

Defaults to the first column.

colstr or a list of str, optional

Name of the feature column that needs to be processed.

If not given, it defaults to all non-ID columns.

scalingbool, optional

Specifies whether the input data are scaled to have unit variance before the analysis.

  • False: No

  • True: Yes

Default to True.

include_interceptbool, optional

Specifies whether the algorithm considers intercept during the calculation.

  • False: No

  • True: Yes

Default to True.

thread_ratiofloat, optional

Adjusts the percentage of available threads to use, from 0 to 1. A value of 0 indicates the use of a single thread, while 1 implies the use of all possible current threads. Values outside the range will be ignored and this function heuristically determines the number of threads to use.

Default to 0.

Returns:
DataFrames

Condition index results, structured as follows:

  • COMPONENT_ID, principal component ID.

  • EIGENVALUE, eigenvalue.

  • CONDITION_INDEX, Condition index.

  • FEATURES, variance decomposition proportion for each variable.

  • INTERCEPT, variance decomposition proportion for the intercept term.

Second DataFrame is empty if collinearity problem has not been detected. Distinct values results, structured as follows:

  • STAT_NAME, Name for the values, including condition number, and the name of variables which are involved in collinearity problem.

  • STAT_VALUE, values of the corresponding name.

Examples

Original data:

>>> df.collect()
   ID    X1    X2    X3    X4
0   1  12.0  52.0  20.0  44.0
1   2  12.0  57.0  25.0  45.0
2   3  12.0  54.0  21.0  45.0
3   4  13.0  52.0  21.0  46.0
4   5  14.0  54.0  24.0  46.0

Apply the condition index function:

>>> res, stats = condition_index(data=df,  key='ID', scaling=True,
                                 include_intercept=True, thread_ratio=0.1)
>>> res.collect()
  COMPONENT_ID  EIGENVALUE  CONDITION_INDEX        X1        X2        X3        X4  INTERCEPT
0       Comp_1   19.966688         1.000000  0.000012  0.000002  0.000010  0.000003   0.000002
1       Comp_2    0.020736        31.030738  0.008776  0.000210  0.031063  0.001251   0.000907
2       Comp_3    0.012260        40.355748  0.053472  0.002571  0.005315  0.000639   0.002710
3       Comp_4    0.000230       294.940696  0.205666  0.015224  0.006579  0.931121   0.246862
4       Comp_5    0.000086       480.735654  0.732074  0.981993  0.957034  0.066986   0.749518
>>> stats.collect()
          STAT_NAME  STAT_VALUE
0  CONDITION_NUMBER  480.735654
1                X1    0.732074
2                X2    0.981993
3                X3    0.957034
4         INTERCEPT    0.749518