variance_test
- hana_ml.algorithms.pal.preprocessing.variance_test(data, sigma_num, thread_ratio=None, key=None, data_col=None)
Variance Test is a method to identify the outliers of n number of numeric data {xi} where 0 < i < n+1, using the mean and the standard deviation of n number of numeric data.
- Parameters
- dataDataFrame
DataFrame containing the data.
- sigama_numfloat
Multiplier for sigma.
- thread_ratiofloat, optional
Specifies the ratio of total number of threads that can be used by this function.
The value range is from 0 to 1, where 0 means only using 1 thread, and 1 means using at most all the currently available threads.
Values outside the range will be ignored and this function heuristically determines the number of threads to use.
Default to 0.
- keystr, optional
Name of the ID column in
data
.If
key
is not specified, then:if
data
is indexed by a single column, thenkey
defaults to that index column;otherwise, it defaults to the first column of
data
.
- data_colstr, optional
Name of the raw data column in the dataframe.
If not specified, defaults to the last column of data.
- Returns
- DataFrame
Sampling results, structured as follows:
DATA_ID: name as shown in input dataframe.
IS_OUT_OF_RANGE: 0 -> in bounds, 1 -> out of bounds.
Statistic results, structured as follows:
STAT_NAME: statistic name.
STAT_VALUE: statistic value.
Examples
Original data:
>>> df.collect().tail(10) ID X 0 10 26.0 1 11 28.0 2 12 29.0 3 13 27.0 4 14 26.0 5 15 23.0 6 16 22.0 7 17 23.0 8 18 25.0 9 19 103.0
Apply the variance test:
>>> res, stats = variance_test(data, sigma_num=3.0)
>>> res.collect().tail(10) ID IS_OUT_OF_RANGE 0 10 0 1 11 0 2 12 0 3 13 0 4 14 0 5 15 0 6 16 0 7 17 0 8 18 0 9 19 1 >>> stats.collect() STAT_NAME STAT_VALUE 0 mean 28.400000