chi_squared_goodness_of_fit
- hana_ml.algorithms.pal.stats.chi_squared_goodness_of_fit(data, key, observed_data=None, expected_freq=None)
Performs the chi-squared goodness-of fit test to tell whether or not an observed distribution differs from an expected chi-squared distribution.
- Parameters:
- dataDataFrame
DataFrame containing the data.
- keystr
Name of the ID column.
- observed_datastr, optional
Name of column for counts of actual observations belonging to each category.
If not given, it defaults to the first non-ID column.
- expected_freqstr, optional
Name of the expected frequency column.
If not given, it defaults to the second non-ID column.
- Returns:
- DataFrames
Comparison between the actual counts and the expected counts, structured as follows:
ID column, with same name and type as
data
's ID column.Observed data column, with same name as
data
's observed_data column, but always with type DOUBLE.EXPECTED, type DOUBLE, expected count in each category.
RESIDUAL, type DOUBLE, the difference between the observed counts and the expected counts.
Statistical outputs, including the calculated chi-squared value, degrees of freedom and p-value, structured as follows:
STAT_NAME, type NVARCHAR(100), name of statistics.
STAT_VALUE, type DOUBLE, value of statistics.
Examples
>>> df.collect() ID OBSERVED P 0 0 519.0 0.3 1 1 364.0 0.2 2 2 363.0 0.2 3 3 200.0 0.1 4 4 212.0 0.1 5 5 193.0 0.1
Perform chi_squared_goodness_of_fit():
>>> res, stat = chi_squared_goodness_of_fit(data=df, key='ID') >>> res.collect() ID OBSERVED EXPECTED RESIDUAL 0 0 519.0 555.3 -36.3 1 1 364.0 370.2 -6.2 2 2 363.0 370.2 -7.2 3 3 200.0 185.1 14.9 4 4 212.0 185.1 26.9 5 5 193.0 185.1 7.9 >>> stat.collect() STAT_NAME STAT_VALUE 0 Chi-squared Value 8.062669 1 degree of freedom 5.000000 2 p-value 0.152815