PCA

class hana_ml.algorithms.pal.decomposition.PCA(scaling=None, thread_ratio=None, scores=None, n_components=None)

Principal component analysis (PCA) aims at reducing the dimensionality of multivariate data while accounting for as much of the variation in the original dataset as possible. This technique is especially useful when the variables within the dataset are highly correlated. Principal components seek to transform the original variables to a new set of variables that are:

  • linear combinations of the variables in the dataset;

  • uncorrelated with each other;

  • ordered according to the amount of variations of the original variables that they explain.

Parameters:
thread_ratiofloat, optional

Adjusts the percentage of available threads to use, from 0 to 1. A value of 0 indicates the use of a single thread, while 1 implies the use of all possible current threads. Values outside the range will be ignored and this function heuristically determines the number of threads to use.

No default value.

scalingbool, optional

If true, scale variables to have unit variance before the analysis takes place.

Defaults to False.

scoresbool, optional

If true, output the scores on each principal component when fitting.

Defaults to False.

n_componentsint, optional

Specifies the number of components to keep after tranforming input data.

Defaults to None.

Examples

>>> pca = PCA(scaling=True, thread_ratio=0.5, scores=True)

Perform fit():

>>> pca.fit(data=df, key='ID')

Output:

>>> pca.loadings_.collect()
>>> pca.loadings_stat_.collect()
>>> pca.scaling_stat_.collect()

Perform transform():

>>> result = pca.transform(data=df_trasform, key='ID', n_components=4)
>>> result.collect()
Attributes:
loadings_DataFrame

The weights by which each standardized original variable should be multiplied when computing component scores.

loadings_stat_DataFrame

Loadings statistics on each component.

scores_DataFrame

The transformed variable values corresponding to each data point. Set to None if scores is False.

scaling_stat_DataFrame

Mean and scale values of each variable.

Note

Variables cannot be scaled if there exists one variable which has constant value across data items.

Methods

create_model_state([model, function, ...])

Create PAL model state.

delete_model_state([state])

Delete PAL model state.

fit(data[, key, features, label])

Fit the model to the given dataset.

fit_transform(data[, key, features, ...])

Fit with the data and return the scores.

get_model_metrics()

Get the model metrics.

get_score_metrics()

Get the score metrics.

set_model_state(state)

Set the model state by state information.

transform(data[, key, features, ...])

Principal component analysis projection function using a trained model.

fit(data, key=None, features=None, label=None)

Fit the model to the given dataset.

Parameters:
dataDataFrame

Data to be fitted.

keystr, optional

Name of the ID column. Mandatory if data is not indexed, or the index of data contains multiple columns.

Defaults to the single index column of data if not provided.

featuresa list of str, optional

Names of the feature columns. If features is not provided, it defaults to all non-ID columns.

labelstr, optional

Label of data.

Returns:
A fitted 'PCA' object.
fit_transform(data, key=None, features=None, n_components=None, label=None)

Fit with the data and return the scores.

Parameters:
dataDataFrame

Data to be analyzed.

keystr, optional

Name of the ID column.

Mandatory if data is not indexed, or the index of data contains multiple columns.

Defaults to the single index column of data if not provided.

featuresa list of str, optional

Names of the feature columns.

If features is not provided, it defaults to all non-ID columns, non-label columns.

n_componentsint, optional

Number of components to be retained. The value range is from 1 to number of features.

Defaults to number of features if self.n_components is None, else defaults to self.n_components.

labelstr, optional

Label of data.

Returns:
DataFrame

Transformed variable values corresponding to each data point, structured as follows:

  • ID column, with same name and type as data 's ID column.

  • SCORE columns, type DOUBLE, representing the component score values of each data point.

  • LABEL column, same as the label column in data, valid only when parameter label is set.

transform(data, key=None, features=None, n_components=None, label=None)

Principal component analysis projection function using a trained model.

Parameters:
dataDataFrame

Data to be analyzed.

keystr, optional

Name of the ID column. Mandatory if data is not indexed, or the index of data contains multiple columns.

Defaults to the single index column of data if not provided.

featuresa list of str, optional

Names of the feature columns.

If features is not provided, it defaults to all non-ID columns.

n_componentsint, optional

Number of components to be retained. The value range is from 1 to number of features.

Defaults to number of features.

labelstr, optional

Label of data.

Returns:
DataFrame

Transformed variable values corresponding to each data point.

create_model_state(model=None, function=None, pal_funcname=None, state_description=None, force=False)

Create PAL model state.

Parameters:
modelDataFrame, optional

Specify the model for AFL state.

Defaults to self.model_.

functionstr, optional

Specify the function in the unified API.

A placeholder parameter, not effective for (CAT)PCA.

pal_funcnameint or str, optional

PAL function name.

Defaults to self.pal_funcname.

state_descriptionstr, optional

Description of the state as model container.

Defaults to None.

forcebool, optional

If True it will delete the existing state.

Defaults to False.

delete_model_state(state=None)

Delete PAL model state.

Parameters:
stateDataFrame, optional

Specified the state.

Defaults to self.state.

get_model_metrics()

Get the model metrics.

Returns:
DataFrame

The model metrics.

get_score_metrics()

Get the score metrics.

Returns:
DataFrame

The score metrics.

set_model_state(state)

Set the model state by state information.

Parameters:
state: DataFrame or dict

If state is DataFrame, it has the following structure:

  • NAME: VARCHAR(100), it mush have STATE_ID, HINT, HOST and PORT.

  • VALUE: VARCHAR(1000), the values according to NAME.

If state is dict, the key must have STATE_ID, HINT, HOST and PORT.

Inherited Methods from PALBase

Besides those methods mentioned above, the PCA class also inherits methods from PALBase class, please refer to PAL Base for more details.