f_oneway

hana_ml.algorithms.pal.stats.f_oneway(data, group=None, sample=None, multcomp_method=None, significance_level=None)

Performs a 1-way ANOVA.

The purpose of one-way ANOVA is to determine whether there is any statistically significant difference between the means of three or more independent groups.

Parameters
dataDataFrame

DataFrame containing the data.

groupstr

Name of the group column.

If not provided, it defaults to the first column.

samplestr, optional

Name of the sample measurement column.

If not provided, it defaults to the first non-group column.

multcomp_method{'tukey-kramer', 'bonferroni', 'dunn-sidak', 'scheffe', 'fisher-lsd'}, str, optional

Method used to perform multiple comparison tests.

Defaults to 'tukey-kramer'.

significance_levelfloat, optional

The significance level when the function calculates the confidence interval in multiple comparison tests.

Values must be greater than 0 and less than 1.

Defaults to 0.05.

Returns
DataFrame

Statistics for each group, structured as follows:

  • GROUP, type NVARCHAR(256), group name.

  • VALID_SAMPLES, type INTEGER, number of valid samples.

  • MEAN, type DOUBLE, group mean.

  • SD, type DOUBLE, group standard deviation.

Computed results for ANOVA, structured as follows:

  • VARIABILITY_SOURCE, type NVARCHAR(100), source of variability, including between groups, within groups (error) and total.

  • SUM_OF_SQUARES, type DOUBLE, sum of squares.

  • DEGREES_OF_FREEDOM, type DOUBLE, degrees of freedom.

  • MEAN_SQUARES, type DOUBLE, mean squares.

  • F_RATIO, type DOUBLE, calculated as mean square between groups divided by mean square of error.

  • P_VALUE, type DOUBLE, associated p-value from the F-distribution.

Multiple comparison results, structured as follows:

  • FIRST_GROUP, type NVARCHAR(256), the name of the first group to conduct pairwise test on.

  • SECOND_GROUP, type NVARCHAR(256), the name of the second group to conduct pairwise test on.

  • MEAN_DIFFERENCE, type DOUBLE, mean difference between the two groups.

  • SE, type DOUBLE, standard error computed from all data.

  • P_VALUE, type DOUBLE, p-value.

  • CI_LOWER, type DOUBLE, the lower limit of the confidence interval.

  • CI_UPPER, type DOUBLE, the upper limit of the confidence interval.

Examples

Data df:

>>> df.collect()
   GROUP  DATA
0      A   4.0
1      A   5.0
2      A   4.0
3      A   3.0
4      A   2.0
5      A   4.0
6      A   3.0
7      A   4.0
8      B   6.0
9      B   8.0
10     B   4.0
11     B   5.0
12     B   4.0
13     B   6.0
14     B   5.0
15     B   8.0
16     C   6.0
17     C   7.0
18     C   6.0
19     C   6.0
20     C   7.0
21     C   5.0

Perform the function:

>>> stats, anova, mult_comp = f_oneway(data=df,
...                                    multcomp_method='Tukey-Kramer',
...                                    significance_level=0.05)

Outputs:

>>> stats.collect()
   GROUP  VALID_SAMPLES      MEAN        SD
0      A              8  3.625000  0.916125
1      B              8  5.750000  1.581139
2      C              6  6.166667  0.752773
3  Total             22  5.090909  1.600866
>>> anova.collect()
  VARIABILITY_SOURCE  SUM_OF_SQUARES  DEGREES_OF_FREEDOM  MEAN_SQUARES
0              Group       27.609848                 2.0     13.804924
1              Error       26.208333                19.0      1.379386
2              Total       53.818182                21.0           NaN
     F_RATIO   P_VALUE
0  10.008021  0.001075
1        NaN       NaN
2        NaN       NaN
>>> mult_comp.collect()
  FIRST_GROUP SECOND_GROUP  MEAN_DIFFERENCE        SE   P_VALUE  CI_LOWER
0           A            B        -2.125000  0.587236  0.004960 -3.616845
1           A            C        -2.541667  0.634288  0.002077 -4.153043
2           B            C        -0.416667  0.634288  0.790765 -2.028043
   CI_UPPER
0 -0.633155
1 -0.930290
2  1.194710