chi_squared_independence

hana_ml.algorithms.pal.stats.chi_squared_independence(data, key, observed_data=None, correction=False)

Perform the chi-squared test of independence to tell whether observations of two variables are independent from each other.

Parameters:
dataDataFrame

DataFrame containing the data.

keystr

Name of the ID column.

observed_datalist of str, optional

Names of the observed data columns.

If not given, it defaults to all non-ID columns.

correctionbool, optional

If True, and the degrees of freedom is 1, apply Yates's correction for continuity.

The effect of the correction is to adjust each observed value by 0.5 towards the corresponding expected value.

Defaults to False.

Returns:
DataFrame

The expected count table, structured as follows:

  • ID column, with same name and type as data's ID column.

  • Expected count columns, named by prepending Expected_ to each observed_data column name, type DOUBLE. There will be as many columns here as there are observed_data columns.

Statistical outputs, including the calculated chi-squared value, degrees of freedom and p-value, structured as follows:

  • STAT_NAME, type NVARCHAR(100), name of statistics.

  • STAT_VALUE, type DOUBLE, value of statistics.

Examples

Data to test:

>>> df.collect()
       ID  X1    X2  X3    X4
0    male  25  23.0  11  14.0
1  female  41  20.0  18   6.0

Perform the function:

>>> res, stats = chi_squared_independence(data=df, 'ID')
>>> res.collect()
       ID  EXPECTED_X1  EXPECTED_X2  EXPECTED_X3  EXPECTED_X4
0    male    30.493671    19.867089    13.398734     9.240506
1  female    35.506329    23.132911    15.601266    10.759494
>>> stats.collect()
           STAT_NAME  STAT_VALUE
0  Chi-squared Value    8.113152
1  degree of freedom    3.000000
2            p-value    0.043730