variance_test

hana_ml.algorithms.pal.preprocessing.variance_test(data, sigma_num, thread_ratio=None, key=None, data_col=None)

Variance Test is a method to identify the outliers of n number of numeric data {xi} where 0 < i < n+1, using the mean and the standard deviation of n number of numeric data.

Parameters:
dataDataFrame

DataFrame containing the data.

sigama_numfloat

Multiplier for sigma.

thread_ratiofloat, optional

Specifies the ratio of total number of threads that can be used by this function.

The value range is from 0 to 1, where 0 means only using 1 thread, and 1 means using at most all the currently available threads.

Values outside the range will be ignored and this function heuristically determines the number of threads to use.

Default to 0.

keystr, optional

Name of the ID column in data.

If key is not specified, then:

  • if data is indexed by a single column, then key defaults to that index column;

  • otherwise, it defaults to the first column of data.

data_colstr, optional

Name of the raw data column in the dataframe.

If not specified, defaults to the last column of data.

Returns:
DataFrame

Sampling results, structured as follows:

  • DATA_ID: name as shown in input dataframe.

  • IS_OUT_OF_RANGE: 0 -> in bounds, 1 -> out of bounds.

Statistic results, structured as follows:

  • STAT_NAME: statistic name.

  • STAT_VALUE: statistic value.

Examples

Original data:

>>> df.collect().tail(10)
   ID      X
0  10   26.0
1  11   28.0
2  12   29.0
3  13   27.0
4  14   26.0
5  15   23.0
6  16   22.0
7  17   23.0
8  18   25.0
9  19  103.0

Apply the variance test:

>>> res, stats = variance_test(data=data, sigma_num=3.0)
>>> res.collect().tail(10)
    ID  IS_OUT_OF_RANGE
0   10                0
1   11                0
2   12                0
3   13                0
4   14                0
5   15                0
6   16                0
7   17                0
8   18                0
9   19                1
>>> stats.collect()
    STAT_NAME  STAT_VALUE
0        mean   28.400000