f_oneway_repeated

hana_ml.algorithms.pal.stats.f_oneway_repeated(data, subject_id, measures=None, multcomp_method=None, significance_level=None, se_type=None)

Performs one-way repeated measures analysis of variance, along with Mauchly's Test of Sphericity and post hoc multiple comparison tests.

Parameters:

dataDataFrame

DataFrame containing the data.

subject_idstr

Name of the subject ID column. The algorithm treats each row of the data table as a different subject. Hence there should be no duplicate subject IDs in this column.

measureslist of str, optional

Names of the groups (measures).

If not provided, defaults to all non-subject_id columns.

multcomp_method{'tukey-kramer', 'bonferroni', 'dunn-sidak', 'scheffe', 'fisher-lsd'}, optional

Method used to perform multiple comparison tests.

Defaults to 'bonferroni'.

significance_levelfloat, optional

The significance level when the function calculates the confidence interval in multiple comparison tests.

Values must be greater than 0 and less than 1.

Defaults to 0.05.

se_type{'all-data', 'two-group'}

Type of standard error used in multiple comparison tests.

'all-data': computes the standard error from all data. It has more power if the assumption of sphericity is true, especially with small data sets.
'two-group': computes the standard error from only the two groups being compared. It doesn't assume sphericity.

Defaults to 'two-group'.

Returns:

DataFrame

Statistics for each group, structured as follows:

GROUP, type NVARCHAR(256), group name.

VALID_SAMPLES, type INTEGER, number of valid samples.

MEAN, type DOUBLE, group mean.

SD, type DOUBLE, group standard deviation.

Mauchly test results, structured as follows:

STAT_NAME, type NVARCHAR(100), names of test result quantities.

STAT_VALUE, type DOUBLE, values of test result quantities.

Computed results, structured as follows:

VARIABILITY_SOURCE, type NVARCHAR(100), source of variability, divided into group, error and subject portions.

SUM_OF_SQUARES, type DOUBLE, sum of squares.

DEGREES_OF_FREEDOM, type DOUBLE, degrees of freedom.

MEAN_SQUARES, type DOUBLE, mean squares.

F_RATIO, type DOUBLE, calculated as mean square between groups divided by mean square of error.

P_VALUE, type DOUBLE, associated p-value from the F-distribution.

P_VALUE_GG, type DOUBLE, p-value of Greenhouse-Geisser correction.

P_VALUE_HF, type DOUBLE, p-value of Huynh-Feldt correction.

P_VALUE_LB, type DOUBLE, p-value of lower bound correction.

Multiple comparison results, structured as follows:

FIRST_GROUP, type NVARCHAR(256), the name of the first group to conduct pairwise test on.

SECOND_GROUP, type NVARCHAR(256), the name of the second group to conduct pairwise test on.

MEAN_DIFFERENCE, type DOUBLE, mean difference between the two groups.

SE, type DOUBLE, standard error computed from all data or compared two groups, depending on se_type.

P_VALUE, type DOUBLE, p-value.

CI_LOWER, type DOUBLE, the lower limit of the confidence interval.

CI_UPPER, type DOUBLE, the upper limit of the confidence interval.

Examples

Data df:

>>> df.collect()
  ID  MEASURE1  MEASURE2  MEASURE3  MEASURE4
1       8.0       7.0       1.0       6.0
2       9.0       5.0       2.0       5.0
3       6.0       2.0       3.0       8.0
4       5.0       3.0       1.0       9.0
5       8.0       4.0       5.0       8.0
6       7.0       5.0       6.0       7.0
7      10.0       2.0       7.0       2.0
8      12.0       6.0       8.0       1.0

Perform the function:

>>> stats, mtest, anova, mult_comp = f_oneway_repeated(
...     data=df,
...     subject_id='ID',
...     multcomp_method='bonferroni',
...     significance_level=0.05,
...     se_type='two-group')

Outputs:

>>> stats.collect()
      GROUP  VALID_SAMPLES   MEAN        SD
MEASURE1              8  8.125  2.232071
MEASURE2              8  4.250  1.832251
MEASURE3              8  4.125  2.748376
MEASURE4              8  5.750  2.915476
>>> mtest.collect()
                    STAT_NAME  STAT_VALUE
               Mauchly's W    0.136248
                Chi-Square   11.405981
                        df    5.000000
                    pValue    0.046773
Greenhouse-Geisser Epsilon    0.532846
       Huynh-Feldt Epsilon    0.665764
       Lower bound Epsilon    0.333333
>>> anova.collect()
  VARIABILITY_SOURCE  SUM_OF_SQUARES  DEGREES_OF_FREEDOM  MEAN_SQUARES
            Group          83.125                 3.0     27.708333
          Subject          17.375                 7.0      2.482143
            Error         153.375                21.0      7.303571
    F_RATIO  P_VALUE  P_VALUE_GG  P_VALUE_HF  P_VALUE_LB
3.793806  0.02557    0.062584    0.048331    0.092471
     NaN      NaN         NaN         NaN         NaN
     NaN      NaN         NaN         NaN         NaN
>>> mult_comp.collect()
  FIRST_GROUP SECOND_GROUP  MEAN_DIFFERENCE        SE   P_VALUE  CI_LOWER
  MEASURE1     MEASURE2            3.875  0.811469  0.012140  0.924655
  MEASURE1     MEASURE3            4.000  0.731925  0.005645  1.338861
  MEASURE1     MEASURE4            2.375  1.792220  1.000000 -4.141168
  MEASURE2     MEASURE3            0.125  1.201747  1.000000 -4.244322
  MEASURE2     MEASURE4           -1.500  1.336306  1.000000 -6.358552
  MEASURE3     MEASURE4           -1.625  1.821866  1.000000 -8.248955
   CI_UPPER
6.825345
6.661139
8.891168
4.494322
3.358552
4.998955