chi_squared_goodness_of_fit

hana_ml.algorithms.pal.stats.chi_squared_goodness_of_fit(data, key, observed_data=None, expected_freq=None)

Perform the chi-squared goodness-of fit test to tell whether or not an observed distribution differs from an expected chi-squared distribution.

Parameters:

dataDataFrame

DataFrame containing the data.

keystr

Name of the ID column.

observed_datastr, optional

Name of column for counts of actual observations belonging to each category.

If not given, it defaults to the first non-ID column.

expected_freqstr, optional

Name of the expected frequency column.

If not given, it defaults to the second non-ID column.

Returns:

DataFrame

Comparison between the actual counts and the expected counts, structured as follows:

ID column, with same name and type as data's ID column.

Observed data column, with same name as data's observed_data column, but always with type DOUBLE.

EXPECTED, type DOUBLE, expected count in each category.

RESIDUAL, type DOUBLE, the difference between the observed counts and the expected counts.

Statistical outputs, including the calculated chi-squared value, degrees of freedom and p-value, structured as follows:

STAT_NAME, type NVARCHAR(100), name of statistics.

STAT_VALUE, type DOUBLE, value of statistics.

Examples

Data to test:

>>> df.collect()
   ID  OBSERVED    P
 0     519.0  0.3
 1     364.0  0.2
 2     363.0  0.2
 3     200.0  0.1
 4     212.0  0.1
 5     193.0  0.1

Perform the function:

>>> res, stat = chi_squared_goodness_of_fit(data=df, 'ID')
>>> res.collect()
   ID  OBSERVED  EXPECTED  RESIDUAL
0   0     519.0     555.3     -36.3
1   1     364.0     370.2      -6.2
2   2     363.0     370.2      -7.2
3   3     200.0     185.1      14.9
4   4     212.0     185.1      26.9
5   5     193.0     185.1       7.9
>>> stat.collect()
           STAT_NAME  STAT_VALUE
0  Chi-squared Value    8.062669
1  degree of freedom    5.000000
2            p-value    0.152815