f_oneway_repeated

hana_ml.algorithms.pal.stats.f_oneway_repeated(data, subject_id, measures=None, multcomp_method=None, significance_level=None, se_type=None)

Performs one-way repeated measures analysis of variance, along with Mauchly's Test of Sphericity and post hoc multiple comparison tests.

Parameters:
dataDataFrame

DataFrame containing the data.

subject_idstr

Name of the subject ID column. The algorithm treats each row of the data table as a different subject. Hence there should be no duplicate subject IDs in this column.

measureslist of str, optional

Names of the groups (measures).

If not provided, defaults to all non-subject_id columns.

multcomp_method{'tukey-kramer', 'bonferroni', 'dunn-sidak', 'scheffe', 'fisher-lsd'}, optional

Method used to perform multiple comparison tests.

Defaults to 'bonferroni'.

significance_levelfloat, optional

The significance level when the function calculates the confidence interval in multiple comparison tests.

Values must be greater than 0 and less than 1.

Defaults to 0.05.

se_type{'all-data', 'two-group'}
Type of standard error used in multiple comparison tests.
  • 'all-data': computes the standard error from all data. It has more power if the assumption of sphericity is true, especially with small data sets.

  • 'two-group': computes the standard error from only the two groups being compared. It doesn't assume sphericity.

Defaults to 'two-group'.

Returns:
DataFrame

Statistics for each group, structured as follows:

  • GROUP, type NVARCHAR(256), group name.

  • VALID_SAMPLES, type INTEGER, number of valid samples.

  • MEAN, type DOUBLE, group mean.

  • SD, type DOUBLE, group standard deviation.

Mauchly test results, structured as follows:

  • STAT_NAME, type NVARCHAR(100), names of test result quantities.

  • STAT_VALUE, type DOUBLE, values of test result quantities.

Computed results, structured as follows:

  • VARIABILITY_SOURCE, type NVARCHAR(100), source of variability, divided into group, error and subject portions.

  • SUM_OF_SQUARES, type DOUBLE, sum of squares.

  • DEGREES_OF_FREEDOM, type DOUBLE, degrees of freedom.

  • MEAN_SQUARES, type DOUBLE, mean squares.

  • F_RATIO, type DOUBLE, calculated as mean square between groups divided by mean square of error.

  • P_VALUE, type DOUBLE, associated p-value from the F-distribution.

  • P_VALUE_GG, type DOUBLE, p-value of Greenhouse-Geisser correction.

  • P_VALUE_HF, type DOUBLE, p-value of Huynh-Feldt correction.

  • P_VALUE_LB, type DOUBLE, p-value of lower bound correction.

Multiple comparison results, structured as follows:

  • FIRST_GROUP, type NVARCHAR(256), the name of the first group to conduct pairwise test on.

  • SECOND_GROUP, type NVARCHAR(256), the name of the second group to conduct pairwise test on.

  • MEAN_DIFFERENCE, type DOUBLE, mean difference between the two groups.

  • SE, type DOUBLE, standard error computed from all data or compared two groups, depending on se_type.

  • P_VALUE, type DOUBLE, p-value.

  • CI_LOWER, type DOUBLE, the lower limit of the confidence interval.

  • CI_UPPER, type DOUBLE, the upper limit of the confidence interval.

Examples

Data df:

>>> df.collect()
  ID  MEASURE1  MEASURE2  MEASURE3  MEASURE4
0  1       8.0       7.0       1.0       6.0
1  2       9.0       5.0       2.0       5.0
2  3       6.0       2.0       3.0       8.0
3  4       5.0       3.0       1.0       9.0
4  5       8.0       4.0       5.0       8.0
5  6       7.0       5.0       6.0       7.0
6  7      10.0       2.0       7.0       2.0
7  8      12.0       6.0       8.0       1.0

Perform the function:

>>> stats, mtest, anova, mult_comp = f_oneway_repeated(
...     data=df,
...     subject_id='ID',
...     multcomp_method='bonferroni',
...     significance_level=0.05,
...     se_type='two-group')

Outputs:

>>> stats.collect()
      GROUP  VALID_SAMPLES   MEAN        SD
0  MEASURE1              8  8.125  2.232071
1  MEASURE2              8  4.250  1.832251
2  MEASURE3              8  4.125  2.748376
3  MEASURE4              8  5.750  2.915476
>>> mtest.collect()
                    STAT_NAME  STAT_VALUE
0                 Mauchly's W    0.136248
1                  Chi-Square   11.405981
2                          df    5.000000
3                      pValue    0.046773
4  Greenhouse-Geisser Epsilon    0.532846
5         Huynh-Feldt Epsilon    0.665764
6         Lower bound Epsilon    0.333333
>>> anova.collect()
  VARIABILITY_SOURCE  SUM_OF_SQUARES  DEGREES_OF_FREEDOM  MEAN_SQUARES
0              Group          83.125                 3.0     27.708333
1            Subject          17.375                 7.0      2.482143
2              Error         153.375                21.0      7.303571
    F_RATIO  P_VALUE  P_VALUE_GG  P_VALUE_HF  P_VALUE_LB
0  3.793806  0.02557    0.062584    0.048331    0.092471
1       NaN      NaN         NaN         NaN         NaN
2       NaN      NaN         NaN         NaN         NaN
>>> mult_comp.collect()
  FIRST_GROUP SECOND_GROUP  MEAN_DIFFERENCE        SE   P_VALUE  CI_LOWER
0    MEASURE1     MEASURE2            3.875  0.811469  0.012140  0.924655
1    MEASURE1     MEASURE3            4.000  0.731925  0.005645  1.338861
2    MEASURE1     MEASURE4            2.375  1.792220  1.000000 -4.141168
3    MEASURE2     MEASURE3            0.125  1.201747  1.000000 -4.244322
4    MEASURE2     MEASURE4           -1.500  1.336306  1.000000 -6.358552
5    MEASURE3     MEASURE4           -1.625  1.821866  1.000000 -8.248955
   CI_UPPER
0  6.825345
1  6.661139
2  8.891168
3  4.494322
4  3.358552
5  4.998955