chi_squared_goodness_of_fit

hana_ml.algorithms.pal.stats.chi_squared_goodness_of_fit(data, key, observed_data=None, expected_freq=None)

Perform the chi-squared goodness-of fit test to tell whether or not an observed distribution differs from an expected chi-squared distribution.

Parameters:
dataDataFrame

DataFrame containing the data.

keystr

Name of the ID column.

observed_datastr, optional

Name of column for counts of actual observations belonging to each category.

If not given, it defaults to the first non-ID column.

expected_freqstr, optional

Name of the expected frequency column.

If not given, it defaults to the second non-ID column.

Returns:
DataFrame

Comparison between the actual counts and the expected counts, structured as follows:

  • ID column, with same name and type as data's ID column.

  • Observed data column, with same name as data's observed_data column, but always with type DOUBLE.

  • EXPECTED, type DOUBLE, expected count in each category.

  • RESIDUAL, type DOUBLE, the difference between the observed counts and the expected counts.

Statistical outputs, including the calculated chi-squared value, degrees of freedom and p-value, structured as follows:

  • STAT_NAME, type NVARCHAR(100), name of statistics.

  • STAT_VALUE, type DOUBLE, value of statistics.

Examples

Data to test:

>>> df.collect()
   ID  OBSERVED    P
0   0     519.0  0.3
1   1     364.0  0.2
2   2     363.0  0.2
3   3     200.0  0.1
4   4     212.0  0.1
5   5     193.0  0.1

Perform the function:

>>> res, stat = chi_squared_goodness_of_fit(data=df, 'ID')
>>> res.collect()
   ID  OBSERVED  EXPECTED  RESIDUAL
0   0     519.0     555.3     -36.3
1   1     364.0     370.2      -6.2
2   2     363.0     370.2      -7.2
3   3     200.0     185.1      14.9
4   4     212.0     185.1      26.9
5   5     193.0     185.1       7.9
>>> stat.collect()
           STAT_NAME  STAT_VALUE
0  Chi-squared Value    8.062669
1  degree of freedom    5.000000
2            p-value    0.152815