variance_test

hana_ml.algorithms.pal.preprocessing.variance_test(data, sigma_num, thread_ratio=None, key=None, data_col=None)

Variance Test is a method to identify the outliers of n number of numeric data {xi} where 0 < i < n+1, using the mean and the standard deviation of n number of numeric data.

Parameters:
dataDataFrame

DataFrame containing the data.

sigama_numfloat

Multiplier for sigma.

thread_ratiofloat, optional

Adjusts the percentage of available threads to use, from 0 to 1. A value of 0 indicates the use of a single thread, while 1 implies the use of all possible current threads. Values outside the range will be ignored and this function heuristically determines the number of threads to use.

Default to 0.

keystr, optional

Name of the ID column in data.

If key is not specified, then:

  • if data is indexed by a single column, then key defaults to that index column;

  • otherwise, it defaults to the first column of data.

data_colstr, optional

Name of the raw data column in the dataframe.

If not specified, defaults to the last column of data.

Returns:
DataFrame

Sampling results, structured as follows:

  • DATA_ID: name as shown in input DataFrame.

  • IS_OUT_OF_RANGE: 0 -> in bounds, 1 -> out of bounds.

Statistic results, structured as follows:

  • STAT_NAME: statistic name.

  • STAT_VALUE: statistic value.

Examples

>>> res, stats = variance_test(data=df, sigma_num=3.0)
>>> res.collect()
>>> stats.collect()