grubbs_test
- hana_ml.algorithms.pal.stats.grubbs_test(data, key, col=None, method=None, alpha=None)
Perform grubbs' test to detect outliers from a given univariate data set. The algorithm assumes that Y comes from Gaussian distribution.
- Parameters:
- dataDataFrame
DataFrame containing the data.
- keystr
Name of the ID column.
- colstr, optional
Name of the data column that needs to be tested.
If not given, defaults to the first non-ID column.
- method{'two_sides', 'one_side_min', 'one_side_max', 'repeat_two_sides'}, optional
Specifies the alternative type.
Default to "one_side_min".
- alphafloat, optional
Significance level.
Default to 0.05.
- Returns:
- DataFrame
Test results, structured as follows:
SOURCE_ID column name, ID of outlier data.
RAW_DATA column name, value of original data.
Statistics, structured as follows:
SOURCE_ID column name, ID of outlier data.
STAT_NAME column, name of statistics.
STAT_VALUE column, value of statistics.
Examples
Original data:
>>> df.collect() ID VAL 0 100 4.254843 1 200 0.135000 2 300 11.072257 3 400 14.797838 4 500 12.125133 5 600 14.265839 6 700 7.731352 7 800 6.856739 8 900 15.094403 9 101 8.149382 10 201 9.160144
Perform the grubb's test:
>>> res, stats = grubbs_test(data, key='ID', method='one_side_max', alpha=0.2)
Results:
>>> res.collect() ID VAL 0 200 0.135 >>> stats.collect() ID STAT_NAME STAT_VALUE 0 200 MEAN 9.422085 1 200 STANDARD_SAMPLE_VARIANCE 4.675935 2 200 T 1.910219 3 200 G 1.986145 4 200 U 0.566075