Grubbs' Test for Outliers — hanaml.GrubbsTest • hana.ml.r

hanaml.GrubbsTest is a R wrapper for SAP HANA PAL Grubbs' Test.

hanaml.GrubbsTest(data, key, col = NULL, method = NULL, alpha = NULL)

Arguments

data

DataFrame
DataFrame containting the data points structured as follows:

SOURCE_ID : INTEGER
RAW_DATA : INTEGER or DOUBLE

key

character
Name of the ID column of data.

col

character, optional
Name of the data column that needs to be tested. If not provided, it defaults the non-key columns of data.

method

{"two.sided", "one.sided.min","one.sided.max", "iter.two.sided"},optional
Specifies the method to test against the hypothesis. The test methods are given as follows:

"two.sided" use the two-sided test.
"one.sided.min" use the one-sided test for minimum value.
"one.sided.max" use the one-sided test for maximum value.
"iter.two.sided" perform two-sided test iteratively to detect multiple outliers.

Defaults to "two.sided".

alpha

double, optional
specifies the significance level at which the algorithm will reject the hypothesis that there are no outliers in the given data set. Defaults to 0.05.

Value

Returns a list of DataFrames.

DataFrame 1
Detected outliers, structured as follows:
- SOURCE_ID : ID of the outlier data point.
- RAW_DATA : the corresponding value.
DataFrame 2
Statistical information of the tests.
- SOURCE_ID : ID of the outlier data point.
- STAT_NAME : Statistics name.
- STAT_VALUE : Statistics value.

Details

Grubbs' test is used to detect a single outlier in a gaussian distributed data set.
It can be applied iteratively to detect multiple outliers.

Examples


> data$Collect()
    ID       VAL
1  100  4.254843
2  200  0.135000
3  300 11.072257
4  400 14.797838
5  500 12.125133
6  600 14.265839
7  700  7.731352
8  800  6.856739
9  900 15.094403
10 101  8.149382
11 201  9.160144

Call the function:


> result <- hanaml.GrubbsTest(data=data,
                              method = "one.sided.min",
                              alpha = 0.2)

Results:


> result[[1]]$Collect()
   ID   VAL
1 200 0.135
> result[[2]]$Collect()
   ID                STAT_NAME STAT_VALUE
1 200                     MEAN  9.4220845
2 200 STANDARD_SAMPLE_VARIANCE  4.6759352
3 200                        T  1.9102192
4 200                        G  1.9861448
5 200                        U  0.5660752