factor_analysis
- hana_ml.algorithms.pal.stats.factor_analysis(data, key, factor_num, col=None, method=None, rotation=None, score=None, matrix=None, kappa=None)
Factor analysis is a statistical method that tries to extract a low number of unobserved variables, i.e. factors, that can best describe the covariance pattern of a larger set of observed variables.
- Parameters
- dataDataFrame
DataFrame containing the data.
- keystr
Name of the ID column.
- factor_numint
Number of factors.
- colstr/ListofStrings, optional
Name of the feature columns.
- method{'principle_component'}, optional
Specifies the method used for factor analysis.
Currently SAP HANA PAL only supports the principal component method.
- rotation{'non', 'varimax', 'promax'}, optional
Specifies the rotation to be performed on loadings.
Default to 'varimax'.
- score{'non', 'regression'}, optional
Specifies the method to compute factor scores.
Default to 'regression'.
- matrix{'covariance', 'correlation'}, optional
Uses cor matrix to perform factor analysis.
Default to 'correlation'.
- kappafloat, optional
Power of promax rotation. (Only valid when rotation is promax.)
Default to 4.
- Returns
- DataFrame
DataFrame 1: Eigenvalues, structured as follows:
FACTOR_ID: factor id.
EIGENVALUE: Eigenvalue (i.e. variance explained).
VAR_PROP: Variance proportion to the total variance explained.
CUM_VAR_PROP: Cumulative variance proportion to the total variance explained.
DataFrame 2: Variance explanation, structured as follows:
FACTOR_ID: factor id.
VAR: Variance explained without rotation .
VAR_PROP: Variance proportion to the total variance explained without rotation.
CUM_VAR_PROP: Cumulative variance proportion to the total variance explained without rotation.
ROT_VAR: Variance explained with rotation.
ROT_VAR_PROP: Variance proportion to the total variance explained with rotation. Note that there is no rotated variance proportion when performing oblique rotation since the rotated factors are correlated.
ROT_CUM_VAR_PROP: Cumulative variance proportion to the total variance explained with rotation.
DataFrame 3: Communalities, structured as follows:
NAME: variable name.
OBERVED_VARS: Communalities of observed variable.
DataFrame 4: Loadings, structured as follows:
FACTOR_ID: Factor id.
LOADINGs_+OBSERVED_VARs: loadings.
DataFrame 5: Rotated loadings, structured as follows:
FACTOR_ID: Factor id.
ROT_LOADINGS_+OBSERVED_VARs: rotated loadings.
DataFrame 6: Structure, structured as follows:
FACTOR_ID: Factor id.
STRUCTURE+OBSERVED_VARS: Structure matrix. It is empty when rotation is not oblique.
DataFrame 7: Rotation, structured as follows:
ROTATION: rotation
ROTATION_ + i (i sequences from 1 to number of columns in OBSERVED_VARS (in input table) : Rotation matrix.
DataFrame 8: Factor correlation, structured as follows:
FACTOR_ID: Factor id.
FACTOR_ + i (i sequences from 1 to number of columns in OBSERVED_VARS (in input table): Factor correlation matrix. It is empty when rotation is not oblique.
DataFrame 9: Score model, structured as follows:
NAME: Factor id, MEAN, SD.
OBSERVED_VARS (in input table) column name: Score coefficients, means and standard deviations of observed variables.
DataFrame 10: Scores, structured as follows:
FACTOR_ID: Factor id.
FACTOR_ + i (i sequences from 1 to number of columns in OBSERVED_VARS(in input table)): scores.
DataFrame 11: Statistics, placeholder for future features, structured as follows:
STAT_NAME: statistic name.
STAT_VALUE: statistic value.
Examples
Original data:
>>> df.collect() ID X1 X2 X3 X4 X5 X6 0 1 1.0 1.0 3.0 3.0 1.0 1.0 1 2 1.0 2.0 3.0 3.0 1.0 1.0 2 3 1.0 1.0 3.0 4.0 1.0 1.0 3 4 1.0 1.0 3.0 3.0 1.0 2.0 4 5 1.0 1.0 3.0 3.0 1.0 1.0 5 6 1.0 1.0 1.0 1.0 3.0 3.0 6 7 1.0 2.0 1.0 1.0 3.0 3.0 7 8 1.0 1.0 1.0 2.0 3.0 3.0 8 9 1.0 2.0 1.0 1.0 3.0 4.0 9 10 1.0 1.0 1.0 1.0 3.0 3.0 10 11 3.0 3.0 1.0 1.0 1.0 1.0 11 12 3.0 4.0 1.0 1.0 1.0 1.0 12 13 3.0 3.0 1.0 2.0 1.0 1.0 13 14 3.0 3.0 1.0 1.0 1.0 2.0 14 15 3.0 3.0 1.0 1.0 1.0 1.0 15 16 4.0 4.0 5.0 5.0 6.0 6.0 16 17 5.0 6.0 4.0 6.0 4.0 5.0 17 18 6.0 5.0 6.0 4.0 5.0 4.0
Apply the factor_analysis function:
>>> res = factor_analysis(data, key='ID', factor_num=2, rotation='promax', matrix='correlation') >>> res[0].collect() FACTOR_ID EIGENVALUE VAR_PROP CUM_VAR_PROP 0 FACTOR_1 3.696031 0.616005 0.616005 1 FACTOR_2 1.073114 0.178852 0.794858 2 FACTOR_3 1.000774 0.166796 0.961653 3 FACTOR_4 0.161003 0.026834 0.988487 4 FACTOR_5 0.040961 0.006827 0.995314 5 FACTOR_6 0.028116 0.004686 1.000000