grubbs_test

hana_ml.algorithms.pal.stats.grubbs_test(data, key, col=None, method=None, alpha=None)

Perform grubbs' test to detect outliers from a given univariate data set. The algorithm assumes that Y comes from Gaussian distribution.

Parameters:

dataDataFrame

DataFrame containing the data.

keystr

Name of the ID column.

colstr, optional

Name of the data column that needs to be tested.

If not given, defaults to the first non-ID column.

method{'two_sides', 'one_side_min', 'one_side_max', 'repeat_two_sides'}, optional

Specifies the alternative type.

Default to "one_side_min".

alphafloat, optional

Significance level.

Default to 0.05.

Returns:

DataFrame

Test results, structured as follows:

SOURCE_ID column name, ID of outlier data.

RAW_DATA column name, value of original data.

Statistics, structured as follows:

SOURCE_ID column name, ID of outlier data.

STAT_NAME column, name of statistics.

STAT_VALUE column, value of statistics.

Examples

Original data:

>>> df.collect()
     ID        VAL
 100   4.254843
 200   0.135000
 300  11.072257
 400  14.797838
 500  12.125133
 600  14.265839
 700   7.731352
 800   6.856739
 900  15.094403
 101   8.149382
201   9.160144

Perform the grubb's test:

>>> res, stats = grubbs_test(data, key='ID', method='one_side_max', alpha=0.2)

Results:

>>> res.collect()
   ID    VAL
0  200  0.135
>>> stats.collect()
    ID                 STAT_NAME  STAT_VALUE
0  200                      MEAN    9.422085
1  200  STANDARD_SAMPLE_VARIANCE    4.675935
2  200                         T    1.910219
3  200                         G    1.986145
4  200                         U    0.566075