grubbs_test
- hana_ml.algorithms.pal.stats.grubbs_test(data, key, col=None, method=None, alpha=None)
Performs grubbs' test to detect outliers from a given univariate dataset. The algorithm assumes that Y comes from Gaussian distribution.
- Parameters:
- dataDataFrame
DataFrame containing the data.
- keystr
Name of the ID column.
- colstr, optional
Name of the data column that needs to be tested.
If not given, defaults to the first non-ID column.
- method{'two_sides', 'one_side_min', 'one_side_max', 'repeat_two_sides'}, optional
Specifies the alternative type.
Default to "one_side_min".
- alphafloat, optional
Significance level.
Default to 0.05.
- Returns:
- DataFrames
Test results, structured as follows:
SOURCE_ID column name, ID of outlier data.
RAW_DATA column name, value of original data.
Statistics, structured as follows:
SOURCE_ID column name, ID of outlier data.
STAT_NAME column, name of statistics.
STAT_VALUE column, value of statistics.
Examples
Original data:
>>> df.collect() ID VAL 0 100 4.254843 1 200 0.135000 ... 9 101 8.149382 10 201 9.160144
Perform the grubb's test:
>>> res, stats = grubbs_test(data=df, key='ID', method='one_side_max', alpha=0.2)
Results:
>>> res.collect() ID VAL 0 200 0.135 >>> stats.collect() ID STAT_NAME STAT_VALUE 0 200 MEAN 9.422085 1 200 STANDARD_SAMPLE_VARIANCE 4.675935 2 200 T 1.910219 3 200 G 1.986145 4 200 U 0.566075