CPD

class hana_ml.algorithms.pal.tsa.changepoint.CPD(cost=None, penalty=None, solver=None, lamb=None, min_size=None, min_sep=None, max_k=None, dispersion=None, lamb_range=None, max_iter=None, range_penalty=None, value_penalty=None)

Change-point detection (CPDetection) methods aim at detecting multiple abrupt changes such as change in mean, variance or distribution in an observed time-series data.

Parameters:
cost{'normal_mse', 'normal_rbf', 'normal_mhlb', 'normal_mv', 'linear', 'gamma', 'poisson', 'exponential', 'normal_m', 'negbinomial'}, optional

The cost function for change-point detection.

Defaults to 'normal_mse'.

penalty{'aic', 'bic', 'mbic', 'oracle', 'custom'}, optional

The penalty function for change-point detection.

Defaults to

(1)'aic' if solver is 'pruneddp', 'pelt' or 'opt',

(2)'custom' if solver is 'adppelt'.

solver{'pelt', 'opt', 'adppelt', 'pruneddp'}, optional

Method for finding change-points of given data, cost and penalty.

Each solver supports different cost and penalty functions.

    1. For cost functions, 'pelt', 'opt' and 'adpelt' support the following eight: 'normal_mse', 'normal_rbf', 'normal_mhlb', 'normal_mv', 'linear', 'gamma', 'poisson', 'exponential'; while 'pruneddp' supports the following four cost functions: 'poisson', 'exponential', 'normal_m', 'negbinomial'.

    1. For penalty functions, 'pruneddp' supports all penalties, 'pelt', 'opt' and 'adppelt' support the following three: 'aic','bic','custom', while 'adppelt' only supports 'custom' cost.

Defaults to 'pelt'.

lambfloat, optional

Assigned weight of the penalty w.r.t. the cost function, i.e. penalization factor.

It can be seen as trade-off between speed and accuracy of running the detection algorithm.

A small values (usually less than 0.1) will dramatically improve the efficiency.

Defaults to 0.02, and valid only when solver is 'pelt' or 'adppelt'.

min_sizeint, optional

The minimal length from the very beginning within which change would not happen.

Valid only when solver is 'opt', 'pelt' or 'adppelt'.

Defaults to 2.

min_sepint, optional

The minimal length of separation between consecutive change-points.

Defaults to 1, valid only when solver is 'opt', 'pelt' or 'adppelt'.

max_kint, optional

The maximum number of change-points to be detected.

If the given value is less than 1, this number would be determined automatically from the input data.

Defaults to 0, valid only when solver is 'pruneddp'.

dispersionfloat, optinal

Dispersion coefficient for Gamma and negative binomial distribution.

Valid only when cost is 'gamma' or 'negbinomial'.

Defaults to 1.0.

lamb_rangelist of two numerical(float and int) values, optional(deprecated)

User-defined range of penalty.

Only valid when solver is 'adppelt'.

Deprecated, please use range_penalty instead.

max_iterint, optional

Maximum number of iterations for searching the best penalty.

Valid only when solver is 'adppelt'.

Defaults to 40.

range_penaltylist of two numerical values, optional

User-defined range of penalty.

Valid only when solver is 'adppelt' and value_penalty is not provided.

Defaults to [0.01, 100].

value_penaltyfloat, optional

Value of user-defined penalty.

Valid when penalty is 'custom' or solver is 'adppelt'.

No default value.

Examples

First check the input time-series DataFrame df:

>>> df.collect()
  TIME_STAMP      SERIES
0        1-1       -5.36
1        1-2       -5.14
2        1-3       -4.94
3        2-1       -5.15
4        2-2       -4.95
5        2-3        0.55
6        2-4        0.88
7        3-1        0.95
8        3-2        0.68
9        3-3        0.86

Now create a CPD instance with 'pelt' solver and 'aic' penalty:

>>> cpd = CPD(solver='pelt',
...           cost='normal_mse',
...           penalty='aic',
...           lamb=0.02)

Apply the above CPD instance to the input data, check the detection result and related statistics:

>>> cp = cpd.fit_predict(data=df)
>>> cp.collect()
      TIME_STAMP
0            2-2
>>> cpd.stats_.collect()
             STAT_NAME    STAT_VAL
0               solver        Pelt
1        cost function  Normal_MSE
2         penalty type         AIC
3           total loss     4.13618
4  penalisation factor        0.02

Create another CPD instance with 'adppelt' solver and 'normal_mv' cost:

>>> cpd = CPD(solver='adppelt',
...           cost='normal_mv',
...           range_penalty=[0.01, 100],
...           lamb=0.02)

Again, apply the above CPD instance to the input data, check the detection result and related statistics:

>>> cp.collect()
      TIME_STAMP
0            2-2
>>> cpd.stats_.collect()
             STAT_NAME   STAT_VAL
0               solver    AdpPelt
1        cost function  Normal_MV
2         penalty type     Custom
3           total loss   -28.1656
4  penalisation factor       0.02
5            iteration          2
6      optimal penalty    2.50974

Create a third CPD instance with 'pruneddp' solver and 'oracle' penalty:

>>> cpd = CPD(solver='pruneddp', cost='normal_m', penalty='oracle', max_k=3)

Similar as before, apply the above CPD instance to the input data, check the detection result and related statistics:

>>> cp = cpd.fit_predict(data=df)
>>> cp.collect()
      TIME_STAMP
0            2-2
>>> cpd.stats_.collect()
             STAT_NAME   STAT_VAL
0               solver    AdpPelt
1        cost function  Normal_MV
2         penalty type     Custom
3           total loss   -28.1656
4  penalisation factor       0.02
5            iteration          2
6      optimal penalty    2.50974
Attributes:
stats_DataFrame

Statistics for running change-point detection on the input data, structured as follows:

  • 1st column: statistics name,

  • 2nd column: statistics value.

Methods

fit_predict(data[, key, features])

Detecting change-points of the input data.

fit_predict(data, key=None, features=None)

Detecting change-points of the input data.

Parameters:
dataDataFrame

Input time-series data for change-point detection.

keystr, optional

Column name for time-stamp of the input time-series data.

If the index column of data is not provided or not a single column, and the key of fit_predict function is not provided, the default value is the first column of data.

If the index of data is set as a single column, the default value of key is index column of data.

featuresstr or list of str, optional

Column name(s) for the value(s) of the input time-series data.

Returns:
DataFrame

Detected the change-points of the input time-series data.

Inherited Methods from PALBase

Besides those methods mentioned above, the CPD class also inherits methods from PALBase class, please refer to PAL Base for more details.