iqr

hana_ml.algorithms.pal.stats.iqr(data, key, col=None, multiplier=None)

Perform the inter-quartile range (IQR) test to find the outliers of the data. The inter-quartile range (IQR) is the difference between the third quartile (Q3) and the first quartile (Q1) of the data. Data points will be marked as outliers if they fall outside the range from Q1 - multiplier * IQR to Q3 + multiplier * IQR.

Parameters
dataDataFrame

DataFrame containing the data.

keystr

Name of the ID column.

colstr, optional

Name of the data column that needs to be tested.

If not given, it defaults to the first non-ID column.

multiplierfloat, optional

The multiplier used to calculate the value range during the IQR test.

  • Upper-bound = Q3 + multiplier * IQR,

  • Lower-bound = Q1 - multiplier * IQR,

where Q1 is equal to 25th percentile and Q3 is equal to 75th percentile.

Defaults to 1.5.

Returns
DataFrame

Test results, structured as follows:

  • ID column, with same name and type as data's ID column.

  • IS_OUT_OF_RANGE, type INTEGER, containing the test results from the IQR test that determine whether each data sample is in the range or not:

    • 0: a value is in the range.

    • 1: a value is out of range.

Statistical outputs, including Upper-bound and Lower-bound from the IQR test, structured as follows:

  • STAT_NAME, type NVARCHAR(256), statistics name.

  • STAT_VALUE, type DOUBLE, statistics value.

Examples

Original data:

>>> df.collect()
     ID   VAL
0    P1  10.0
1    P2  11.0
2    P3  10.0
3    P4   9.0
4    P5  10.0
5    P6  24.0
6    P7  11.0
7    P8  12.0
8    P9  10.0
9   P10   9.0
10  P11   1.0
11  P12  11.0
12  P13  12.0
13  P14  13.0
14  P15  12.0

Perform the IQR test:

>>> res, stat = iqr(data=df, key='ID', col='VAL', multiplier=1.5)
>>> res.collect()
     ID  IS_OUT_OF_RANGE
0    P1                0
1    P2                0
2    P3                0
3    P4                0
4    P5                0
5    P6                1
6    P7                0
7    P8                0
8    P9                0
9   P10                0
10  P11                1
11  P12                0
12  P13                0
13  P14                0
14  P15                0
>>> stat.collect()
        STAT_NAME  STAT_VALUE
0  lower quartile        10.0
1  upper quartile        12.0