CATPCA

class hana_ml.algorithms.pal.decomposition.CATPCA(scaling=None, thread_ratio=None, scores=None, n_components=None, component_tol=None, random_state=None, max_iter=None, tol=None, svd_alg=None, lanczos_iter=None)

Principal components analysis algorithm that supports categorical features. Current implementation uses Alternating Least Square algorithm to find the optimal scaling quantification for categorical data.

Parameters:
scalingbool, optional

If true, scale variables to have unit variance before the analysis takes place.

Defaults to False.

thread_ratiofloat, optional

Adjusts the percentage of available threads to use, from 0 to 1. A value of 0 indicates the use of a single thread, while 1 implies the use of all possible current threads. Values outside the range will be ignored and this function heuristically determines the number of threads to use.

No default value.

scoresbool, optional

If true, output the scores on each principal component when fitting.

Defaults to False.

n_componentsint

Specifies the number of components to keep.

Should be greater than or equal to 1.

component_tolfloat, optional

Specifies the threshold for dropping principal components. More precisely, if the ratio between a singular value of some component and the largest singular value is less than the specified threshold, then the corresponding component will be dropped.

Defaults to 0(indicating no component is dropped).

random_stateint, optional

Specifies the random seed used to generate initial quantification for categorical variables. Should be nonnegative.

  • 0 : Use current system time as seed(always changing).

  • Others : The deterministic seed value.

Defaults to 0.

max_iterint, optional

Specifies the maximum number of iterations allowed in computing the quantification for categorical variables.

Defaults to 100.

tolint, optional

Specifies the threshold to determine when the iterative quantification process should be stopped. More precisely, if the improvement of loss value is less than this threshold between consecutive iterations, the quantification process will terminate and regarded as converged.

Valid range is (0, 1).

Defaults to 1e-5.

svg_alg{'lanczos', 'jacobi'}, optional

Specifies the choice of SVD algorithm.

  • 'lanczos' : The LANCZOS algorithms.

  • 'jacobi' : The Divide and conquer with Jacobi algorithm.

Defaults to 'jacobi'.

lanczos_iterint, optional

Specifies the maximum allowed interactions for computing SVD using LANCZOS algorithm. Valid only when svg_alg is 'lanczos'.

Defaults to 1000.

Examples

>>> cpc = CATPCA(scaling=TRUE,
                 thread_ratio=0.0,
                 scores=TRUE,
                 n_components=2,
                 component_tol=1e-5)

Perform fit():

>>> cpc.fit(data=df, key='ID', categorical_variable='X4')
>>> cpc.loadings_.collect()

Perform transform():

>>> result = cpc.transform(data=df_transform, key="ID", n_components=2,
                           thread_ratio = 0.5, ignore_unknown_category=False)
>>> result.collect()
Attributes:
loadings_DataFrame

The weights by which each standardized original variable should be multiplied when computing component scores.

loadings_stat_DataFrame

Loadings statistics on each component.

scores_DataFrame

The transformed variable values corresponding to each data point.

Set to None if scores is False.

scaling_stat_DataFrame

Mean and scale values of each variable.

Note

Variables cannot be scaled if there exists one variable which has constant value across data items.

Methods

create_model_state([model, function, ...])

Create PAL model state.

delete_model_state([state])

Delete PAL model state.

fit(data[, key, features, categorical_variable])

Fit the model to the given dataset.

fit_transform(data[, key, features, ...])

Fit with the dataset and return the scores.

get_model_metrics()

Get the model metrics.

get_score_metrics()

Get the score metrics.

set_model_state(state)

Set the model state by state information.

transform(data[, key, features, ...])

Principal component analysis projection function using a trained model.

fit(data, key=None, features=None, categorical_variable=None)

Fit the model to the given dataset.

Parameters:
dataDataFrame

Data to be fitted.

The number of rows in data are expected to be no less than self.n_components.

keystr, optional

Name of the ID column.

Mandatory if data is not indexed, or the index of data contains multiple columns.

Defaults to the single index column of data if not provided.

featuresa list of str, optional

Names of the feature columns.

The number of features should be no less than self.n_components.

If features is not provided, it defaults to all non-ID columns.

categorical_variablestr or a list of str, optional

Specifies which INTEGER columns should be treated as categorical, with all other INTEGER columns treated as continuous.

No default value.

Returns:
A fitted object of class "CATPCA".
fit_transform(data, key=None, features=None, n_components=None, ignore_unknown_category=None)

Fit with the dataset and return the scores.

Parameters:
dataDataFrame

Data to be analyzed.

keystr, optional

Name of the ID column.

Mandatory if data is not indexed, or the index of data contains multiple columns.

Defaults to the single index column of data if not provided.

featuresa list of str, optional

Names of the feature columns.

If features is not provided, it defaults to all non-ID columns, non-label columns.

n_componentsint, optional

Number of components to be retained.

The value range is from 1 to number of features.

Defaults to number of features.

ignore_unknown_categorybool, optional

Specifies whether or not to ignore unknown category in data.

If set to True, any unknown category shall be ignored with quantify 0; otherwise, an error message shall be raised in case of unknown category.

Defaults to False.

Returns:
DataFrame

Transformed variable values for data, structured as follows:

  • 1st column, with same name and type as data 's ID column.

  • 2nd column, type INTEGER, named 'COMPONENT_ID', representing the IDs for principle components.

  • 3rd column, type DOUBLE, named 'COMPONENT_SCORE', representing the score values of each data points in different components.

transform(data, key=None, features=None, n_components=None, ignore_unknown_category=None, thread_ratio=None)

Principal component analysis projection function using a trained model.

Parameters:
dataDataFrame

Data to be analyzed.

keystr, optional

Name of the ID column.

Mandatory if data is not indexed, or the index of data contains multiple columns.

Defaults to the single index column of data if not provided.

featuresa list of str, optional

Names of the feature columns.

If features is not provided, it defaults to all non-ID columns.

n_componentsint, optional

Number of components to be retained.

The value range is from 1 to number of features.

Defaults to number of features.

Returns:
DataFrame

Transformed variable values corresponding to each data point, structured as follows:

  • 1st column, with same name and type as data 's ID column.

  • 2nd column, type INTEGER, named 'COMPONENT_ID', representing the IDs for principle components.

  • 3rd column, type DOUBLE, named 'COMPONENT_SCORE', representing the score values of each data points in different components.

create_model_state(model=None, function=None, pal_funcname=None, state_description=None, force=False)

Create PAL model state.

Parameters:
modelDataFrame, optional

Specify the model for AFL state.

Defaults to self.model_.

functionstr, optional

Specify the function in the unified API.

A placeholder parameter, not effective for (CAT)PCA.

pal_funcnameint or str, optional

PAL function name.

Defaults to self.pal_funcname.

state_descriptionstr, optional

Description of the state as model container.

Defaults to None.

forcebool, optional

If True it will delete the existing state.

Defaults to False.

delete_model_state(state=None)

Delete PAL model state.

Parameters:
stateDataFrame, optional

Specified the state.

Defaults to self.state.

get_model_metrics()

Get the model metrics.

Returns:
DataFrame

The model metrics.

get_score_metrics()

Get the score metrics.

Returns:
DataFrame

The score metrics.

set_model_state(state)

Set the model state by state information.

Parameters:
state: DataFrame or dict

If state is DataFrame, it has the following structure:

  • NAME: VARCHAR(100), it mush have STATE_ID, HINT, HOST and PORT.

  • VALUE: VARCHAR(1000), the values according to NAME.

If state is dict, the key must have STATE_ID, HINT, HOST and PORT.

Inherited Methods from PALBase

Besides those methods mentioned above, the CATPCA class also inherits methods from PALBase class, please refer to PAL Base for more details.