PCA

class hana_ml.algorithms.pal.decomposition.PCA(scaling=None, thread_ratio=None, scores=None)

Principal component analysis is to reduce the dimensionality of multivariate data using Singular Value Decomposition.

Parameters:

thread_ratiofloat, optional

Controls the proportion of available threads to use.

The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads.

Values between 0 and 1 will use that percentage of available threads.

Values outside this range tell PAL to heuristically determine the number of threads to use.

No default value.

scalingbool, optional

If true, scale variables to have unit variance before the analysis takes place.

Defaults to False.

scoresbool, optional

If true, output the scores on each principal component when fitting.

Defaults to False.

Examples

Input DataFrame df1 for training:

>>> df1.head(4).collect()
   ID    X1    X2    X3    X4
0   1  12.0  52.0  20.0  44.0
1   2  12.0  57.0  25.0  45.0
2   3  12.0  54.0  21.0  45.0
3   4  13.0  52.0  21.0  46.0

Creating a PCA instance:

>>> pca = PCA(scaling=True, thread_ratio=0.5, scores=True)

Performing fit on given dataframe:

>>> pca.fit(data=df1, key='ID')

Output:

>>> pca.loadings_.collect()
  COMPONENT_ID  LOADINGS_X1  LOADINGS_X2  LOADINGS_X3  LOADINGS_X4
0        Comp1     0.541547     0.321424     0.511941     0.584235
1        Comp2    -0.454280     0.728287     0.395819    -0.326429
2        Comp3    -0.171426    -0.600095     0.760875    -0.177673
3        Comp4    -0.686273    -0.078552    -0.048095     0.721489

>>> pca.loadings_stat_.collect()
  COMPONENT_ID        SD  VAR_PROP  CUM_VAR_PROP
0        Comp1  1.566624  0.613577      0.613577
1        Comp2  1.100453  0.302749      0.916327
2        Comp3  0.536973  0.072085      0.988412
3        Comp4  0.215297  0.011588      1.000000

>>> pca.scaling_stat_.collect()
   VARIABLE_ID       MEAN     SCALE
0            1  17.000000  5.039841
1            2  53.636364  1.689540
2            3  23.000000  2.000000
3            4  48.454545  4.655398

Input dataframe df2 for transforming:

>>> df2.collect()
   ID    X1    X2    X3    X4
0   1   2.0  32.0  10.0  54.0
1   2   9.0  57.0  20.0  25.0
2   3  12.0  24.0  28.0  35.0
3   4  15.0  42.0  27.0  36.0

Performing transform() on given dataframe:

>>> result = pca.transform(data=df2, key='ID', n_components=4)
>>> result.collect()
   ID  COMPONENT_1  COMPONENT_2  COMPONENT_3  COMPONENT_4
0   1    -8.359662   -10.936083     3.037744     4.220525
1   2    -3.931082     3.221886    -1.168764    -2.629849
2   3    -6.584040   -10.391291    13.112075    -0.146681
3   4    -2.967768    -3.170720     6.198141    -1.213035

Attributes:

loadings_DataFrame: The weights by which each standardized original variable should be multiplied when computing component scores.
loadings_stat_DataFrame: Loadings statistics on each component.
scores_DataFrame: The transformed variable values corresponding to each data point. Set to None if scores is False.
scaling_stat_DataFrame: Mean and scale values of each variable.

Note

Variables cannot be scaled if there exists one variable which has constant value across data items.

Methods

`create_model_state`([model, function, ...])	Create PAL model state.
`delete_model_state`([state])	Delete PAL model state.
`fit`(data[, key, features, label])	Principal component analysis fit function.
`fit_transform`(data[, key, features, ...])	Fit with the dataset and return the scores.
`set_model_state`(state)	Set the model state by state information.
`transform`(data[, key, features, ...])	Principal component analysis projection function using a trained model.

fit(data, key=None, features=None, label=None)

Principal component analysis fit function.

Parameters:

dataDataFrame

Data to be fitted.

keystr, optional

Name of the ID column.

Mandatory if data is not indexed, or the index of data contains multiple columns.

Defaults to the single index column of data if not provided.

featureslist of str, optional

Names of the feature columns.

If features is not provided, it defaults to all non-ID columns.

labelstr, optional

Label of data.

Returns:

A fitted 'PCA' object.

fit_transform(data, key=None, features=None, n_components=None, label=None)

Fit with the dataset and return the scores.

Parameters:

dataDataFrame

Data to be analyzed.

keystr, optional

Name of the ID column.

Mandatory if data is not indexed, or the index of data contains multiple columns.

Defaults to the single index column of data if not provided.

featureslist of str, optional

Names of the feature columns.

If features is not provided, it defaults to all non-ID columns, non-label columns.

n_componentsint, optional

Number of components to be retained.

The value range is from 1 to number of features.

Defaults to number of features.

labelstr, optional

Label of data.

Returns:

DataFrame

Transformed variable values corresponding to each data point, structured as follows:

ID column, with same name and type as data 's ID column.

SCORE columns, type DOUBLE, representing the component score values of each data point.

LABEL column, same as the label column in data, valid only when parameter label is set.

transform(data, key=None, features=None, n_components=None, label=None)

Principal component analysis projection function using a trained model.

Parameters:

dataDataFrame

Data to be analyzed.

keystr, optional

Name of the ID column.

Mandatory if data is not indexed, or the index of data contains multiple columns.

Defaults to the single index column of data if not provided.

featureslist of str, optional

Names of the feature columns.

If features is not provided, it defaults to all non-ID columns.

n_componentsint, optional

Number of components to be retained.

The value range is from 1 to number of features.

Defaults to number of features.

labelstr, optional

Label of data.

Returns:

DataFrame

Transformed variable values corresponding to each data point, structured as follows:

ID column, with same name and type as data 's ID column.

SCORE columns, type DOUBLE, representing the component score values of each data point.

create_model_state(model=None, function=None, pal_funcname=None, state_description=None, force=False)

Create PAL model state.

Parameters:

modelDataFrame, optional

Specify the model for AFL state.

Defaults to self.model_.

functionstr, optional

Specify the function in the unified API.

A placeholder parameter, not effective for (CAT)PCA.

pal_funcnameint or str, optional

PAL function name.

Defaults to self.pal_funcname.

state_descriptionstr, optional

Description of the state as model container.

Defaults to None.

forcebool, optional

If True it will delete the existing state.

Defaults to False.

delete_model_state(state=None)

Delete PAL model state.

Parameters:

stateDataFrame, optional

Specified the state.

Defaults to self.state.

property fit_hdbprocedure: Returns the generated hdbprocedure for fit.

property predict_hdbprocedure: Returns the generated hdbprocedure for predict.

set_model_state(state)

Set the model state by state information.

Parameters:

state: DataFrame or dict

If state is DataFrame, it has the following structure:

NAME: VARCHAR(100), it mush have STATE_ID, HINT, HOST and PORT.

VALUE: VARCHAR(1000), the values according to NAME.

If state is dict, the key must have STATE_ID, HINT, HOST and PORT.

Inherited Methods from PALBase

Besides those methods mentioned above, the PCA class also inherits methods from PALBase class, please refer to PAL Base for more details.