- hana_ml.algorithms.pal.stats.chi_squared_independence(data, key, observed_data=None, correction=False)
Perform the chi-squared test of independence to tell whether observations of two variables are independent from each other.
DataFrame containing the data.
Name of the ID column.
- observed_datalist of str, optional
Names of the observed data columns.
If not given, it defaults to all non-ID columns.
- correctionbool, optional
If True, and the degrees of freedom is 1, apply Yates's correction for continuity.
The effect of the correction is to adjust each observed value by 0.5 towards the corresponding expected value.
Defaults to False.
The expected count table, structured as follows:
ID column, with same name and type as
data's ID column.
Expected count columns, named by prepending
observed_datacolumn name, type DOUBLE. There will be as many columns here as there are
Statistical outputs, including the calculated chi-squared value, degrees of freedom and p-value, structured as follows:
STAT_NAME, type NVARCHAR(100), name of statistics.
STAT_VALUE, type DOUBLE, value of statistics.
Data to test:
>>> df.collect() ID X1 X2 X3 X4 0 male 25 23.0 11 14.0 1 female 41 20.0 18 6.0
Perform the function:
>>> res, stats = chi_squared_independence(data=df, 'ID') >>> res.collect() ID EXPECTED_X1 EXPECTED_X2 EXPECTED_X3 EXPECTED_X4 0 male 30.493671 19.867089 13.398734 9.240506 1 female 35.506329 23.132911 15.601266 10.759494 >>> stats.collect() STAT_NAME STAT_VALUE 0 Chi-squared Value 8.113152 1 degree of freedom 3.000000 2 p-value 0.043730