factor_analysis

hana_ml.algorithms.pal.stats.factor_analysis(data, key, factor_num, col=None, method=None, rotation=None, score=None, matrix=None, kappa=None)

Factor analysis is a statistical method that tries to extract a low number of unobserved variables, i.e. factors, that can best describe the covariance pattern of a larger set of observed variables.

Parameters:
dataDataFrame

DataFrame containing the data.

keystr

Name of the ID column.

factor_numint

Number of factors.

colstr or a list of str, optional

Name of the feature columns.

method{'principle_component'}, optional

Specifies the method used for factor analysis.

Currently SAP HANA PAL only supports the principal component method.

rotation{'non', 'varimax', 'promax'}, optional

Specifies the rotation to be performed on loadings.

Default to 'varimax'.

score{'non', 'regression'}, optional

Specifies the method to compute factor scores.

Default to 'regression'.

matrix{'covariance', 'correlation'}, optional

Uses cor matrix to perform factor analysis.

Default to 'correlation'.

kappafloat, optional

Power of promax rotation. (Only valid when rotation is promax.)

Default to 4.

Returns:
DataFrame

DataFrame 1: Eigenvalues, structured as follows:

  • FACTOR_ID: factor id.

  • EIGENVALUE: Eigenvalue (i.e. variance explained).

  • VAR_PROP: Variance proportion to the total variance explained.

  • CUM_VAR_PROP: Cumulative variance proportion to the total variance explained.

DataFrame 2: Variance explanation, structured as follows:

  • FACTOR_ID: factor id.

  • VAR: Variance explained without rotation .

  • VAR_PROP: Variance proportion to the total variance explained without rotation.

  • CUM_VAR_PROP: Cumulative variance proportion to the total variance explained without rotation.

  • ROT_VAR: Variance explained with rotation.

  • ROT_VAR_PROP: Variance proportion to the total variance explained with rotation. Note that there is no rotated variance proportion when performing oblique rotation since the rotated factors are correlated.

  • ROT_CUM_VAR_PROP: Cumulative variance proportion to the total variance explained with rotation.

DataFrame 3: Communalities, structured as follows:

  • NAME: variable name.

  • OBERVED_VARS: Communalities of observed variable.

DataFrame 4: Loadings, structured as follows:

  • FACTOR_ID: Factor id.

  • LOADINGs_+OBSERVED_VARs: loadings.

DataFrame 5: Rotated loadings, structured as follows:

  • FACTOR_ID: Factor id.

  • ROT_LOADINGS_+OBSERVED_VARs: rotated loadings.

DataFrame 6: Structure, structured as follows:

  • FACTOR_ID: Factor id.

  • STRUCTURE+OBSERVED_VARS: Structure matrix. It is empty when rotation is not oblique.

DataFrame 7: Rotation, structured as follows:

  • ROTATION: rotation

  • ROTATION_ + i (i sequences from 1 to number of columns in OBSERVED_VARS (in input table) : Rotation matrix.

DataFrame 8: Factor correlation, structured as follows:

  • FACTOR_ID: Factor id.

  • FACTOR_ + i (i sequences from 1 to number of columns in OBSERVED_VARS (in input table): Factor correlation matrix. It is empty when rotation is not oblique.

DataFrame 9: Score model, structured as follows:

  • NAME: Factor id, MEAN, SD.

  • OBSERVED_VARS (in input table) column name: Score coefficients, means and standard deviations of observed variables.

DataFrame 10: Scores, structured as follows:

  • FACTOR_ID: Factor id.

  • FACTOR_ + i (i sequences from 1 to number of columns in OBSERVED_VARS(in input table)): scores.

DataFrame 11: Statistics, placeholder for future features, structured as follows:

  • STAT_NAME: statistic name.

  • STAT_VALUE: statistic value.

Examples

Original data:

>>> df.collect()
    ID   X1   X2   X3   X4   X5   X6
0    1  1.0  1.0  3.0  3.0  1.0  1.0
1    2  1.0  2.0  3.0  3.0  1.0  1.0
2    3  1.0  1.0  3.0  4.0  1.0  1.0
3    4  1.0  1.0  3.0  3.0  1.0  2.0
4    5  1.0  1.0  3.0  3.0  1.0  1.0
5    6  1.0  1.0  1.0  1.0  3.0  3.0
6    7  1.0  2.0  1.0  1.0  3.0  3.0
7    8  1.0  1.0  1.0  2.0  3.0  3.0
8    9  1.0  2.0  1.0  1.0  3.0  4.0
9   10  1.0  1.0  1.0  1.0  3.0  3.0
10  11  3.0  3.0  1.0  1.0  1.0  1.0
11  12  3.0  4.0  1.0  1.0  1.0  1.0
12  13  3.0  3.0  1.0  2.0  1.0  1.0
13  14  3.0  3.0  1.0  1.0  1.0  2.0
14  15  3.0  3.0  1.0  1.0  1.0  1.0
15  16  4.0  4.0  5.0  5.0  6.0  6.0
16  17  5.0  6.0  4.0  6.0  4.0  5.0
17  18  6.0  5.0  6.0  4.0  5.0  4.0

Apply the factor_analysis function:

>>> res = factor_analysis(data, key='ID', factor_num=2,
                          rotation='promax',
                          matrix='correlation')
>>> res[0].collect()
  FACTOR_ID  EIGENVALUE  VAR_PROP  CUM_VAR_PROP
0  FACTOR_1    3.696031  0.616005      0.616005
1  FACTOR_2    1.073114  0.178852      0.794858
2  FACTOR_3    1.000774  0.166796      0.961653
3  FACTOR_4    0.161003  0.026834      0.988487
4  FACTOR_5    0.040961  0.006827      0.995314
5  FACTOR_6    0.028116  0.004686      1.000000