- hana_ml.algorithms.pal.stats.f_oneway(data, group=None, sample=None, multcomp_method=None, significance_level=None)
Performs a 1-way ANOVA.
The purpose of one-way ANOVA is to determine whether there is any statistically significant difference between the means of three or more independent groups.
- Parameters:
- dataDataFrame
DataFrame containing the data.
- groupstr
Name of the group column.
If not provided, it defaults to the first column.
- samplestr, optional
Name of the sample measurement column.
If not provided, it defaults to the first non-group column.
- multcomp_method{'tukey-kramer', 'bonferroni', 'dunn-sidak', 'scheffe', 'fisher-lsd'}, str, optional
Method used to perform multiple comparison tests.
Defaults to 'tukey-kramer'.
- significance_levelfloat, optional
The significance level when the function calculates the confidence interval in multiple comparison tests.
Values must be greater than 0 and less than 1.
Defaults to 0.05.
- Returns:
- DataFrame
Statistics for each group, structured as follows:
GROUP, type NVARCHAR(256), group name.
VALID_SAMPLES, type INTEGER, number of valid samples.
MEAN, type DOUBLE, group mean.
SD, type DOUBLE, group standard deviation.
Computed results for ANOVA, structured as follows:
VARIABILITY_SOURCE, type NVARCHAR(100), source of variability, including between groups, within groups (error) and total.
SUM_OF_SQUARES, type DOUBLE, sum of squares.
DEGREES_OF_FREEDOM, type DOUBLE, degrees of freedom.
MEAN_SQUARES, type DOUBLE, mean squares.
F_RATIO, type DOUBLE, calculated as mean square between groups divided by mean square of error.
P_VALUE, type DOUBLE, associated p-value from the F-distribution.
Multiple comparison results, structured as follows:
FIRST_GROUP, type NVARCHAR(256), the name of the first group to conduct pairwise test on.
SECOND_GROUP, type NVARCHAR(256), the name of the second group to conduct pairwise test on.
MEAN_DIFFERENCE, type DOUBLE, mean difference between the two groups.
SE, type DOUBLE, standard error computed from all data.
P_VALUE, type DOUBLE, p-value.
CI_LOWER, type DOUBLE, the lower limit of the confidence interval.
CI_UPPER, type DOUBLE, the upper limit of the confidence interval.
Data df:
>>> df.collect() GROUP DATA 0 A 4.0 1 A 5.0 2 A 4.0 3 A 3.0 4 A 2.0 5 A 4.0 6 A 3.0 7 A 4.0 8 B 6.0 9 B 8.0 10 B 4.0 11 B 5.0 12 B 4.0 13 B 6.0 14 B 5.0 15 B 8.0 16 C 6.0 17 C 7.0 18 C 6.0 19 C 6.0 20 C 7.0 21 C 5.0
Perform the function:
>>> stats, anova, mult_comp = f_oneway(data=df, ... multcomp_method='Tukey-Kramer', ... significance_level=0.05)
>>> stats.collect() GROUP VALID_SAMPLES MEAN SD 0 A 8 3.625000 0.916125 1 B 8 5.750000 1.581139 2 C 6 6.166667 0.752773 3 Total 22 5.090909 1.600866 >>> anova.collect() VARIABILITY_SOURCE SUM_OF_SQUARES DEGREES_OF_FREEDOM MEAN_SQUARES 0 Group 27.609848 2.0 13.804924 1 Error 26.208333 19.0 1.379386 2 Total 53.818182 21.0 NaN F_RATIO P_VALUE 0 10.008021 0.001075 1 NaN NaN 2 NaN NaN >>> mult_comp.collect() FIRST_GROUP SECOND_GROUP MEAN_DIFFERENCE SE P_VALUE CI_LOWER 0 A B -2.125000 0.587236 0.004960 -3.616845 1 A C -2.541667 0.634288 0.002077 -4.153043 2 B C -0.416667 0.634288 0.790765 -2.028043 CI_UPPER 0 -0.633155 1 -0.930290 2 1.194710