grubbs_test

hana_ml.algorithms.pal.stats.grubbs_test(data, key, col=None, method=None, alpha=None)

Perform grubbs' test to detect outliers from a given univariate data set. The algorithm assumes that Y comes from Gaussian distribution.

Parameters
dataDataFrame

DataFrame containing the data.

keystr

Name of the ID column.

colstr, optional

Name of the data column that needs to be tested.

If not given, defaults to the first non-ID column.

method{'two_sides', 'one_side_min', 'one_side_max', 'repeat_two_sides'}, optional

Specifies the alternative type.

Default to "one_side_min".

alphafloat, optional

Significance level.

Default to 0.05.

Returns
DataFrame

Test results, structured as follows:

  • SOURCE_ID column name, ID of outlier data.

  • RAW_DATA column name, value of original data.

Statistics, structured as follows:

  • SOURCE_ID column name, ID of outlier data.

  • STAT_NAME column, name of statistics.

  • STAT_VALUE column, value of statistics.

Examples

Original data:

>>> df.collect()
     ID        VAL
0   100   4.254843
1   200   0.135000
2   300  11.072257
3   400  14.797838
4   500  12.125133
5   600  14.265839
6   700   7.731352
7   800   6.856739
8   900  15.094403
9   101   8.149382
10  201   9.160144

Perform the grubb's test:

>>> res, stats = grubbs_test(data, key='ID', method='one_side_max', alpha=0.2)

Results:

>>> res.collect()
   ID    VAL
0  200  0.135
>>> stats.collect()
    ID                 STAT_NAME  STAT_VALUE
0  200                      MEAN    9.422085
1  200  STANDARD_SAMPLE_VARIANCE    4.675935
2  200                         T    1.910219
3  200                         G    1.986145
4  200                         U    0.566075