chi_squared_independence
- hana_ml.algorithms.pal.stats.chi_squared_independence(data, key, observed_data=None, correction=False)
Performs the chi-squared test of independence to tell whether observations of two variables are independent from each other.
- Parameters:
- dataDataFrame
DataFrame containing the data.
- keystr
Name of the ID column.
- observed_datalist of str, optional
Names of the observed data columns.
If not given, it defaults to all non-ID columns.
- correctionbool, optional
If True, and the degrees of freedom is 1, apply Yates's correction for continuity.
The effect of the correction is to adjust each observed value by 0.5 towards the corresponding expected value.
Defaults to False.
- Returns:
- DataFrames
The expected count table, structured as follows:
ID column, with same name and type as
data
's ID column.Expected count columns, named by prepending
Expected_
to eachobserved_data
column name, type DOUBLE. There will be as many columns here as there areobserved_data
columns.
Statistical outputs, including the calculated chi-squared value, degrees of freedom and p-value, structured as follows:
STAT_NAME, type NVARCHAR(100), name of statistics.
STAT_VALUE, type DOUBLE, value of statistics.
Examples
>>> df.collect() ID X1 X2 X3 X4 0 male 25 23.0 11 14.0 1 female 41 20.0 18 6.0
Perform the function:
>>> res, stats = chi_squared_independence(data=df, key='ID') >>> res.collect() ID EXPECTED_X1 EXPECTED_X2 EXPECTED_X3 EXPECTED_X4 0 male 30.493671 19.867089 13.398734 9.240506 1 female 35.506329 23.132911 15.601266 10.759494 >>> stats.collect() STAT_NAME STAT_VALUE 0 Chi-squared Value 8.113152 1 degree of freedom 3.000000 2 p-value 0.043730