factor_analysis

hana_ml.algorithms.pal.stats.factor_analysis(data, key, factor_num, col=None, method=None, rotation=None, score=None, matrix=None, kappa=None)

Factor analysis is a statistical method that tries to extract a low number of unobserved variables, i.e. factors, that can best describe the covariance pattern of a larger set of observed variables.

Parameters:

dataDataFrame

DataFrame containing the data.

keystr

Name of the ID column.

factor_numint

Number of factors.

colstr or a list of str, optional

Name of the feature columns.

method{'principle_component'}, optional

Specifies the method used for factor analysis.

Currently SAP HANA PAL only supports the principal component method.

rotation{'non', 'varimax', 'promax'}, optional

Specifies the rotation to be performed on loadings.

Default to 'varimax'.

score{'non', 'regression'}, optional

Specifies the method to compute factor scores.

Default to 'regression'.

matrix{'covariance', 'correlation'}, optional

Uses cor matrix to perform factor analysis.

Default to 'correlation'.

kappafloat, optional

Power of promax rotation. (Only valid when rotation is promax.)

Default to 4.

Returns:

DataFrame

DataFrame 1: Eigenvalues, structured as follows:

FACTOR_ID: factor id.

EIGENVALUE: Eigenvalue (i.e. variance explained).

VAR_PROP: Variance proportion to the total variance explained.

CUM_VAR_PROP: Cumulative variance proportion to the total variance explained.

DataFrame 2: Variance explanation, structured as follows:

FACTOR_ID: factor id.

VAR: Variance explained without rotation .

VAR_PROP: Variance proportion to the total variance explained without rotation.

CUM_VAR_PROP: Cumulative variance proportion to the total variance explained without rotation.

ROT_VAR: Variance explained with rotation.

ROT_VAR_PROP: Variance proportion to the total variance explained with rotation. Note that there is no rotated variance proportion when performing oblique rotation since the rotated factors are correlated.

ROT_CUM_VAR_PROP: Cumulative variance proportion to the total variance explained with rotation.

DataFrame 3: Communalities, structured as follows:

NAME: variable name.

OBERVED_VARS: Communalities of observed variable.

DataFrame 4: Loadings, structured as follows:

FACTOR_ID: Factor id.

LOADINGs_+OBSERVED_VARs: loadings.

DataFrame 5: Rotated loadings, structured as follows:

FACTOR_ID: Factor id.

ROT_LOADINGS_+OBSERVED_VARs: rotated loadings.

DataFrame 6: Structure, structured as follows:

FACTOR_ID: Factor id.

STRUCTURE+OBSERVED_VARS: Structure matrix. It is empty when rotation is not oblique.

DataFrame 7: Rotation, structured as follows:

ROTATION: rotation

ROTATION_ + i (i sequences from 1 to number of columns in OBSERVED_VARS (in input table) : Rotation matrix.

DataFrame 8: Factor correlation, structured as follows:

FACTOR_ID: Factor id.

FACTOR_ + i (i sequences from 1 to number of columns in OBSERVED_VARS (in input table): Factor correlation matrix. It is empty when rotation is not oblique.

DataFrame 9: Score model, structured as follows:

NAME: Factor id, MEAN, SD.

OBSERVED_VARS (in input table) column name: Score coefficients, means and standard deviations of observed variables.

DataFrame 10: Scores, structured as follows:

FACTOR_ID: Factor id.

FACTOR_ + i (i sequences from 1 to number of columns in OBSERVED_VARS(in input table)): scores.

DataFrame 11: Statistics, placeholder for future features, structured as follows:

STAT_NAME: statistic name.

STAT_VALUE: statistic value.

Examples

Original data:

>>> df.collect()
    ID   X1   X2   X3   X4   X5   X6
  1  1.0  1.0  3.0  3.0  1.0  1.0
  2  1.0  2.0  3.0  3.0  1.0  1.0
  3  1.0  1.0  3.0  4.0  1.0  1.0
  4  1.0  1.0  3.0  3.0  1.0  2.0
  5  1.0  1.0  3.0  3.0  1.0  1.0
  6  1.0  1.0  1.0  1.0  3.0  3.0
  7  1.0  2.0  1.0  1.0  3.0  3.0
  8  1.0  1.0  1.0  2.0  3.0  3.0
  9  1.0  2.0  1.0  1.0  3.0  4.0
 10  1.0  1.0  1.0  1.0  3.0  3.0
11  3.0  3.0  1.0  1.0  1.0  1.0
12  3.0  4.0  1.0  1.0  1.0  1.0
13  3.0  3.0  1.0  2.0  1.0  1.0
14  3.0  3.0  1.0  1.0  1.0  2.0
15  3.0  3.0  1.0  1.0  1.0  1.0
16  4.0  4.0  5.0  5.0  6.0  6.0
17  5.0  6.0  4.0  6.0  4.0  5.0
18  6.0  5.0  6.0  4.0  5.0  4.0

Apply the factor_analysis function:

>>> res = factor_analysis(data, key='ID', factor_num=2,
                          rotation='promax',
                          matrix='correlation')
>>> res[0].collect()
  FACTOR_ID  EIGENVALUE  VAR_PROP  CUM_VAR_PROP
0  FACTOR_1    3.696031  0.616005      0.616005
1  FACTOR_2    1.073114  0.178852      0.794858
2  FACTOR_3    1.000774  0.166796      0.961653
3  FACTOR_4    0.161003  0.026834      0.988487
4  FACTOR_5    0.040961  0.006827      0.995314
5  FACTOR_6    0.028116  0.004686      1.000000