pearsonr_matrix

hana_ml.algorithms.pal.stats.pearsonr_matrix(data, cols=None)

Computes a correlation matrix using Pearson's correlation coefficient.

Parameters
dataDataFrame

DataFrame containing the data.

colslist of str, optional

List of column names to analyze.

If not provided, it defaults to all columns.

Returns
DataFrame

Pearson's correlation coefficient between any two data samples (columns).

  • ID, type NVARCHAR. The values of this column are the column names from cols.

  • Correlation coefficient columns, type DOUBLE, named after the columns in cols. The correlation coefficient between variables X and Y is in column X, in the row with ID value Y.

Examples

Dataset to be analyzed:

>>> df.collect()
    X     Y
0   1   2.4
1   5   3.5
2   3   8.9
3  10  -1.4
4  -4  -3.5
5  11  32.8

Compute the Pearson's correlation coefficient matrix:

>>> result = pearsonr_matrix(data=df)
>>> result.collect()
  ID               X               Y
0  X               1  0.592707653621
1  Y  0.592707653621               1