variance_test

Variance Test is a method to identify the outliers of n number of numeric data {xi} where 0 < i < n+1, using the mean and the standard deviation of n number of numeric data.

Parameters

DataFrame containing the data.

sigama_numfloat

Multiplier for sigma.

Specifies the ratio of total number of threads that can be used by this function.

The value range is from 0 to 1, where 0 means only using 1 thread, and 1 means using at most all the currently available threads.

Values outside the range will be ignored and this function heuristically determines the number of threads to use.

Default to 0.

keystr, optional

Name of the ID column in `data`.

If `key` is not specified, then:

• if `data` is indexed by a single column, then `key` defaults to that index column;

• otherwise, it defaults to the first column of `data`.

data_colstr, optional

Name of the raw data column in the dataframe.

If not specified, defaults to the last column of data.

Returns
DataFrame

Sampling results, structured as follows:

• DATA_ID: name as shown in input dataframe.

• IS_OUT_OF_RANGE: 0 -> in bounds, 1 -> out of bounds.

Statistic results, structured as follows:

• STAT_NAME: statistic name.

• STAT_VALUE: statistic value.

Examples

Original data:

```>>> df.collect().tail(10)
ID      X
0  10   26.0
1  11   28.0
2  12   29.0
3  13   27.0
4  14   26.0
5  15   23.0
6  16   22.0
7  17   23.0
8  18   25.0
9  19  103.0
```

Apply the variance test:

```>>> res, stats = variance_test(data, sigma_num=3.0)
```
```>>> res.collect().tail(10)
ID  IS_OUT_OF_RANGE
0   10                0
1   11                0
2   12                0
3   13                0
4   14                0
5   15                0
6   16                0
7   17                0
8   18                0
9   19                1
>>> stats.collect()
STAT_NAME  STAT_VALUE
0        mean   28.400000
```