correlation

hana_ml.algorithms.pal.tsa.correlation_function.correlation(data, key=None, x=None, y=None, thread_ratio=None, method=None, max_lag=None, calculate_pacf=None, calculate_confint=False, alpha=None, bartlett=None)

This correlation function gives the statistical correlation between random variables.

Parameters
dataDataFrame

Input data.

keystr, optional

Name of the ID column.

Defaults to the index column of data (i.e. data.index) if it is set.

xstr, optional

Name of the first series data column.

ystr, optional

Name of the second series data column.

thread_ratiofloat, optional

The ratio of available threads.

  • 0: single thread

  • 0~1: percentage

  • Others: heuristically determined

Valid only when method is set as 'brute_force'.

Defaults to -1.

method{'auto', 'brute_force', 'fft'}, optional

Indicates the method to be used to calculate the correlation function.

Defaults to 'auto'.

max_lagint, optional

Maximum lag for the correlation function.

Defaults to sqrt(n), where n is the data number.

calculate_pacfbool, optional

Controls whether to calculate Partial Autocorrelation Coefficient(PACF) or not.

Valid only when only one series is provided.

Defaults to True.

calculate_confintbool, optional

Controls whether to calculate confidence intervals or not.

If it is True, two additional columns of confidence intervals are shown in the result.

Defaults to False.

alphafloat, optional

Confidence bound for the given level are returned. For instance if alpha=0.05, 95 % confidence bound is returned.

Valid only when only calculate_confint is True.

Defaults to 0.05.

bartlettbool, optional
  • False: using standard error to calculate the confidence bound.

  • True: using Bartlett's formula to calculate confidence bound.

Valid only when only calculate_confint is True.

Defaults to True.

Returns
DataFrame

Result of the correlation function, structured as follows:

  • LAG: ID column.

  • CV: ACV/CCV.

  • CF: ACF/CCF.

  • PACF: PACF. Null if cross-correlation is calculated.

  • ACF_CONFIDENCE_BOUND: Confidence intervals of acf. The result will show this column when calculate_confint = True.

  • PACF_CONFIDENCE_BOUND: Confidence intervals of pacf. The result will show this column when calculate_confint = True.

Examples

Data for correlation:

>>> df.collect().head(10)
    ID      X
0    1   88.0
1    2   84.0
2    3   85.0
3    4   85.0
4    5   84.0
5    6   85.0
6    7   83.0
7    8   85.0
8    9   88.0
9   10   89.0

Perform correlation function on the input dataframe:

>>> res = correlation(data=df,
                      key='ID',
                      x='X',
                      thread_ratio=0.4,
                      method='auto',
                      calculate_pacf=True)
>>> res.collect()
    LAG           CV        CF      PACF
0     0  1583.953600  1.000000  1.000000
1     1  1520.880736  0.960180  0.960180
2     2  1427.356272  0.901135 -0.266618
3     3  1312.695808  0.828746 -0.154417
4     4  1181.606944  0.745986 -0.120176
5     5  1041.042480  0.657243 -0.071546
6     6   894.493216  0.564722 -0.065065
7     7   742.178352  0.468561 -0.083686
8     8   587.453488  0.370878 -0.065213
9     9   434.287824  0.274180 -0.045501
10   10   286.464160  0.180854 -0.029586