condition_index
- hana_ml.algorithms.pal.stats.condition_index(data, key=None, col=None, scaling=True, include_intercept=True, thread_ratio=None)
Detects collinearity problem between independent variables which are later used as predictors in a multiple linear regression model.
- Parameters:
- dataDataFrame
DataFrame containing the data.
- keystr, optional
Name of the ID column.
Defaults to the first column.
- colstr or a list of str, optional
Name of the feature column that needs to be processed.
If not given, it defaults to all non-ID columns.
- scalingbool, optional
Specifies whether the input data are scaled to have unit variance before the analysis.
False: No
True: Yes
Default to True.
- include_interceptbool, optional
Specifies whether the algorithm considers intercept during the calculation.
False: No
True: Yes
Default to True.
- thread_ratiofloat, optional
Adjusts the percentage of available threads to use, from 0 to 1. A value of 0 indicates the use of a single thread, while 1 implies the use of all possible current threads. Values outside the range will be ignored and this function heuristically determines the number of threads to use.
Default to 0.
- Returns:
- DataFrames
Condition index results, structured as follows:
COMPONENT_ID, principal component ID.
EIGENVALUE, eigenvalue.
CONDITION_INDEX, Condition index.
FEATURES, variance decomposition proportion for each variable.
INTERCEPT, variance decomposition proportion for the intercept term.
Second DataFrame is empty if collinearity problem has not been detected. Distinct values results, structured as follows:
STAT_NAME, Name for the values, including condition number, and the name of variables which are involved in collinearity problem.
STAT_VALUE, values of the corresponding name.
Examples
Original data:
>>> df.collect() ID X1 X2 X3 X4 0 1 12.0 52.0 20.0 44.0 1 2 12.0 57.0 25.0 45.0 2 3 12.0 54.0 21.0 45.0 3 4 13.0 52.0 21.0 46.0 4 5 14.0 54.0 24.0 46.0
Apply the condition index function:
>>> res, stats = condition_index(data=df, key='ID', scaling=True, include_intercept=True, thread_ratio=0.1) >>> res.collect() COMPONENT_ID EIGENVALUE CONDITION_INDEX X1 X2 X3 X4 INTERCEPT 0 Comp_1 19.966688 1.000000 0.000012 0.000002 0.000010 0.000003 0.000002 1 Comp_2 0.020736 31.030738 0.008776 0.000210 0.031063 0.001251 0.000907 2 Comp_3 0.012260 40.355748 0.053472 0.002571 0.005315 0.000639 0.002710 3 Comp_4 0.000230 294.940696 0.205666 0.015224 0.006579 0.931121 0.246862 4 Comp_5 0.000086 480.735654 0.732074 0.981993 0.957034 0.066986 0.749518 >>> stats.collect() STAT_NAME STAT_VALUE 0 CONDITION_NUMBER 480.735654 1 X1 0.732074 2 X2 0.981993 3 X3 0.957034 4 INTERCEPT 0.749518