grubbs_test

hana_ml.algorithms.pal.stats.grubbs_test(data, key, col=None, method=None, alpha=None)

Performs grubbs' test to detect outliers from a given univariate dataset. The algorithm assumes that Y comes from Gaussian distribution.

Parameters:

dataDataFrame

DataFrame containing the data.

keystr

Name of the ID column.

colstr, optional

Name of the data column that needs to be tested.

If not given, defaults to the first non-ID column.

method{'two_sides', 'one_side_min', 'one_side_max', 'repeat_two_sides'}, optional

Specifies the alternative type.

Default to "one_side_min".

alphafloat, optional

Significance level.

Default to 0.05.

Returns:

DataFrames

Test results, structured as follows:

SOURCE_ID column name, ID of outlier data.

RAW_DATA column name, value of original data.

Statistics, structured as follows:

SOURCE_ID column name, ID of outlier data.

STAT_NAME column, name of statistics.

STAT_VALUE column, value of statistics.

Examples

Original data:

>>> df.collect()
     ID        VAL
0   100   4.254843
1   200   0.135000
...
9   101   8.149382
10  201   9.160144

Perform the grubb's test:

>>> res, stats = grubbs_test(data=df, key='ID',
                             method='one_side_max', alpha=0.2)

Results:

>>> res.collect()
    ID    VAL
0  200  0.135
>>> stats.collect()
    ID                 STAT_NAME  STAT_VALUE
0  200                      MEAN    9.422085
1  200  STANDARD_SAMPLE_VARIANCE    4.675935
2  200                         T    1.910219
3  200                         G    1.986145
4  200                         U    0.566075