VectorPCA

class hana_ml.algorithms.pal.decomposition.VectorPCA(n_components, scaling=None, thread_ratio=None, scores=True, component_tol=None, svd_alg=None, lanczos_tol=None, lanczos_iter=None)

Principal component analysis for real vector data in SAP HANA Cloud.

Parameters:
n_componentsint

Specifies the number of components to keep.

Should be greater than or equal to 1.

scalingbool, optional

If True, scale variables to have unit variance before analysis.

Defaults to False.

thread_ratiofloat, optional

Specifies the ratio of total available thread. The value range is [0,1], where 0 means single thread, and 1 means all available threads.

scoresbool, optional

If True, output the scores on each principal component after fitting.

Defaults to False.

component_tolfloat, optional

Specifies the threshold for dropping principal components. More precisely, if the ratio between a singular value of some component and the largest singular value is less than the specified threshold, then the corresponding component will be dropped.

Defaults to 0(indicating no component is dropped).

svg_alg{'lanczos', 'jacobi'}, optional

Specifies the choice of SVD algorithm.

  • 'lanczos' : The LANCZOS algorithms.

  • 'jacobi' : The Divide and conquer with Jacobi algorithm.

Defaults to 'jacobi'.

lanczos_tolfloat, optional

Specifies precision number of LANCZOS algorithm for computing the eigen value. Valid only when svg_alg is "lanczos".

Valid range is (0, 1).

lanczos_iterint, optional

Specifies the maximum allowed interactions for computing SVD using LANCZOS algorithm. Valid only when svg_alg is 'lanczos'.

Defaults to 1000.

Examples

Input data with real vectors:

>>> df.dtypes()
[('ID', 'INT', 10, 10, 10, 0),
 ('V1', 'REAL_VECTOR', 16, 16, 3, 0),
 ('V2', 'REAL_VECTOR', 12, 12, 2, 0)]
>>> df.collect()
   ID                   V1            V2
0   0      [1.0, 1.0, 1.0]    [1.0, 1.0]
1   1     [1.0, 1.0, -1.0]   [1.0, -1.0]
2   2    [-1.0, -1.0, 1.0]   [-1.0, 1.0]
3   3   [-1.0, -1.0, -1.0]  [-1.0, -1.0]

Train a VectorPCA model and return the transformed data:

>>> from hana_ml.algorithms.pal.decomposition import VectorPCA
>>> vecpca = VectorPCA(n_components=2)
>>> pca_res = vecpca.fit_transform(data=df, key='ID')
>>> pca_res.collect()
    ID                                SCORE_VECTOR
0    0   [1.7320507764816284, -1.4142135381698608]
1    1    [1.7320507764816284, 1.4142135381698608]
2    2  [-1.7320507764816284, -1.4142135381698608]
3    3   [-1.7320507764816284, 1.4142135381698608]
Attributes:
loadings_DataFrame

The weights by which each standardized original variable should be multiplied when computing component scores.

loadings_stat_DataFrame

Loadings statistics on each component.

scores_DataFrame

The transformed variable values corresponding to each data point.

Set to None if scores is False.

scaling_stat_DataFrame

Mean and scale values of each variable.

Methods

fit(data[, key])

The fit() method for VectorPCA.

fit_transform(data[, key])

Fit a VectorPCA model and in the meantime return the transformed data.

transform(data[, key, thread_ratio, ...])

Tranform data with real vectors with a trained VectorPCA model.

fit(data, key=None)

The fit() method for VectorPCA.

Parameters:
dataDataFrame

Data to be fitted for obtaining a VectorPCA model. In particular, all columns of data should be REAL_VECTOR type except its ID column, otherwise an error shall be thrown.

keystr, optional

Name of the ID column. Mandatory if data is not indexed, or the index of data contains multiple columns.

Defaults to the single index column of data if not provided.

Returns:
A fitted object of class VectorPCA.
fit_transform(data, key=None)

Fit a VectorPCA model and in the meantime return the transformed data.

Parameters:
dataDataFrame

Data to be fitted for obtaining a VectorPCA model. In particular, all columns of data should be REAL_VECTOR type except its ID column, otherwise an error shall be thrown.

keystr, optional

Name of the ID column. Mandatory if data is not indexed, or the index of data contains multiple columns.

Defaults to the single index column of data if not provided.

Returns:
DataFrame

The transformed data, structured as follows:

  • 1st column, with column name/type same as the ID (key) column in data.

  • 2nd column, with column name SCORE_VECTOR and column data type REAL_VECTOR, which stores the vector of component score for data after PCA transformation.

transform(data, key=None, thread_ratio=None, max_components=None)

Tranform data with real vectors with a trained VectorPCA model.

Parameters:
dataDataFrame

Data to be transformed by the fitted VectorPCA model. In particular, it should be structured the same as the one used for training the VectorPCA model, otherwise an error shall be thrown.

keystr, optional

Name of the ID column. Mandatory if data is not indexed, or the index of data contains multiple columns.

Defaults to the single index column of data if not provided.

thread_ratiofloat, optional

Specifies the ratio of available threads used for performing the transformation. The value range is [0, 1], where 0 means single thread while 1 means all available threads.

max_componentsint, optional

Specifies the component dimension to keep in output, with default value being the value of n_components specified when training the VectorPCA model.

The valid range should be between 1 and default value.

Returns:
DataFrame

The transformed data, structured as follows:

  • 1st column, with column name/type same as the ID (key) column in data.

  • 2nd column, with column name SCORE_VECTOR and column data type REAL_VECTOR, which stores the vector of component score for data after PCA transformation.

Inherited Methods from PALBase

Besides those methods mentioned above, the VectorPCA class also inherits methods from PALBase class, please refer to PAL Base for more details.