pearsonr_matrix

hana_ml.algorithms.pal.stats.pearsonr_matrix(data, cols=None)

Computes a correlation matrix using Pearson's correlation coefficient.

Parameters:
dataDataFrame

DataFrame containing the data.

colslist of str, optional

List of column names to analyze.

If not provided, it defaults to all columns.

Returns:
DataFrame

Pearson's correlation coefficient between any two data samples (columns).

  • ID, type NVARCHAR. The values of this column are the column names from cols.

  • Correlation coefficient columns, type DOUBLE, named after the columns in cols. The correlation coefficient between variables X and Y is in column X, in the row with ID value Y.

Examples

Dataset to be analyzed:

>>> df.collect()
    X     Y
0   1   2.4
1   5   3.5
2   3   8.9
3  10  -1.4
4  -4  -3.5
5  11  32.8

Compute the Pearson's correlation coefficient matrix:

>>> result = pearsonr_matrix(data=df)
>>> result.collect()
  ID               X               Y
0  X               1  0.592707653621
1  Y  0.592707653621               1