CATPCA

class hana_ml.algorithms.pal.decomposition.CATPCA(scaling=None, thread_ratio=None, scores=None, n_components=None, component_tol=None, random_state=None, max_iter=None, tol=None, svd_alg=None, lanczos_iter=None)

Principal components analysis algorithm that supports categorical features.

Parameters:

scalingbool, optional

If true, scale variables to have unit variance before the analysis takes place.

Defaults to False.

thread_ratiofloat, optional

Controls the proportion of available threads to use.

The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads.

Values between 0 and 1 will use that percentage of available threads.

Values outside this range tell PAL to heuristically determine the number of threads to use.

No default value.

scoresbool, optional

If true, output the scores on each principal component when fitting.

Defaults to False.

n_componentsint

Specifies the number of components to keep.

Should be greater than or equal to 1.

component_tolfloat, optional

Specifies the threshold for dropping principal components. More precisely, if the ratio between a singular value of some component and the largest singular value is less than the specified threshold, then the corresponding component will be dropped.

Defaults to 0(indicating no component is dropped).

random_stateint, optional

Specifies the random seed used to generate initial quantification for categorical variables. Should be nonnegative.

0 : Use current system time as seed(always changing).

Others : The deterministic seed value.

Defaults to 0.

max_iterint, optional

Specifies the maximum number of iterations allowed in computing the quantification for categorical variables.

Defaults to 100.

tolint, optional

Specifies the threshold to determine when the iterative quantification process should be stopped. More precisely, if the improvement of loss value is less than this threshold between consecutive iterations, the quantification process will terminate and regarded as converged.

Valid range is (0, 1).

Defaults to 1e-5.

svg_alg{'lanczos', 'jacobi'}, optional

Specifies the choice of SVD algorithm.

'lanczos' : The LANCZOS algorithms.

'jacobi' : The Divide and conquer with Jacobi algorithm.

Defaults to 'jacobi'.

lanczos_iterint, optional

Specifies the maximum allowed interactions for computing SVD using LANCZOS algorithm.

Valid only when svg_alg is 'lanczos'.

Defaults to 1000.

Examples

Input DataFrame data:

>>> data.collect()
   ID X1 X2 X3 X4 X5 X6
 1 12  A 20 44 48 16
 2 12  B 25 45 50 16
 3 12  C 21 45 50 16
 4 13  A 21 46 51 17
 5 14  C 24 46 51 17
 6 22  A 25 54 58 26
 7 22  D 26 55 58 27
 8 17  A 21 45 52 17
 9 15  D 24 45 53 18
10 23  C 23 53 57 24
11 25  B 23 55 58 25

Call the function:

>>> cpc = CATPCA(scaling=TRUE,
                 thread_ratio=0.0,
                 scores=TRUE,
                 n_components=2,
                 component_tol=1e-5,
                 random_state=2021,
                 max_iter=550,
                 tol=1e-5,
                 svd_alg='lanczos',
                 lanczos_iter=100)
>>> cpc.fit(data=data, key='ID', categorical_variable='X4')
>>> cpc.loadings_.collect()
   VARIABLE_NAME  COMPONENT_ID  COMPONENT_LOADING
0             X1             1          -0.444462
1             X1             2          -0.266543
2             X3             1          -0.331331
3             X3             2           0.532125
4             X5             1          -0.467411
5             X5             2          -0.132006
6             X6             1          -0.463490
7             X6             2          -0.158253
8             X2             1           0.213112
9             X2             2          -0.755338
10            X4             1          -0.462559
11            X4             2          -0.181087

Input data for CATPCA transformation:

>>> data2.collect()
  ID X1 X2 X3 X4 X5 X6
1 12  A 20 44 48 16
2 12  B 25 45 50 16
3 12  C 21 45 50 16
4 13  A 21 46 51 17
5 14  C 24 46 51 17
6 22  A 25 54 58 26

Perform transformation of the DataFrame above using the "CATPCA" object cpc:

>>> result = transform(cpc, data2,
                       key="ID", n_components=2,
                       thread_ratio = 0.5,
                       ignore_unknown_category=False)
Output:

>>> result.collect()
    ID  COMPONENT_ID  COMPONENT_SCORE
  1             1         2.734518
  2             1         1.055637
  3             1         1.918711
  4             1         1.768841
  5             1         1.019988
  6             1        -2.386126
  1             2        -0.747631
  2             2         1.709687
  3             2        -0.064889
  4             2        -0.767442
 5             2         0.557958
 6             2        -1.094885

Attributes:

loadings_DataFrame

The weights by which each standardized original variable should be multiplied when computing component scores.

loadings_stat_DataFrame

Loadings statistics on each component.

scores_DataFrame

The transformed variable values corresponding to each data point.

Set to None if scores is False.

scaling_stat_DataFrame

Mean and scale values of each variable.

Note

Variables cannot be scaled if there exists one variable which has constant value across data items.

Methods

`create_model_state`([model, function, ...])	Create PAL model state.
`delete_model_state`([state])	Delete PAL model state.
`fit`(data[, key, features, categorical_variable])	Principal component analysis fit function.
`fit_transform`(data[, key, features, ...])	Fit with the dataset and return the scores.
`set_model_state`(state)	Set the model state by state information.
`transform`(data[, key, features, ...])	Principal component analysis projection function using a trained model.

fit(data, key=None, features=None, categorical_variable=None)

Principal component analysis fit function.

Parameters:

dataDataFrame

Data to be fitted.

The number of rows in data are expected to be no less than self.n_components.

keystr, optional

Name of the ID column.

Mandatory if data is not indexed, or the index of data contains multiple columns.

Defaults to the single index column of data if not provided.

featureslist of str, optional

Names of the feature columns.

The number of features should be no less than self.n_components.

If features is not provided, it defaults to all non-ID columns.

categorical_variablestr or ListOfStrings, optional

Specifies INTEGER variables that should be treated as categorical.

By default, variables of INTEGER type are treated as continuous.

Returns:

A fitted 'CATPCA' object.

fit_transform(data, key=None, features=None, n_components=None, ignore_unknown_category=None)

Fit with the dataset and return the scores.

Parameters:

dataDataFrame

Data to be analyzed.

keystr, optional

Name of the ID column.

Mandatory if data is not indexed, or the index of data contains multiple columns.

Defaults to the single index column of data if not provided.

featureslist of str, optional

Names of the feature columns.

If features is not provided, it defaults to all non-ID columns, non-label columns.

n_componentsint, optional

Number of components to be retained.

The value range is from 1 to number of features.

Defaults to number of features.

ignore_unknown_categorybool, optional

Specifies whether or not to ignore unknown category in data.

If set to True, any unknown category shall be ignored with quantify 0; otherwise, an error message shall be raised in case of unknown category.

Defaults to False.

Returns:

DataFrame

Transformed variable values for data, structured as follows:

1st column, with same name and type as data 's ID column.

2nd column, type INTEGER, named 'COMPONENT_ID', representing the IDs for principle components.

3rd column, type DOUBLE, named 'COMPONENT_SCORE', representing the score values of each data points in different components.

transform(data, key=None, features=None, n_components=None, ignore_unknown_category=None, thread_ratio=None)

Principal component analysis projection function using a trained model.

Parameters:

dataDataFrame

Data to be analyzed.

keystr, optional

Name of the ID column.

Mandatory if data is not indexed, or the index of data contains multiple columns.

Defaults to the single index column of data if not provided.

featureslist of str, optional

Names of the feature columns.

If features is not provided, it defaults to all non-ID columns.

n_componentsint, optional

Number of components to be retained.

The value range is from 1 to number of features.

Defaults to number of features.

Returns:

DataFrame

Transformed variable values corresponding to each data point, structured as follows:

1st column, with same name and type as data 's ID column.

2nd column, type INTEGER, named 'COMPONENT_ID', representing the IDs for principle components.

3rd column, type DOUBLE, named 'COMPONENT_SCORE', representing the score values of each data points in different components.

create_model_state(model=None, function=None, pal_funcname=None, state_description=None, force=False)

Create PAL model state.

Parameters:

modelDataFrame, optional

Specify the model for AFL state.

Defaults to self.model_.

functionstr, optional

Specify the function in the unified API.

A placeholder parameter, not effective for (CAT)PCA.

pal_funcnameint or str, optional

PAL function name.

Defaults to self.pal_funcname.

state_descriptionstr, optional

Description of the state as model container.

Defaults to None.

forcebool, optional

If True it will delete the existing state.

Defaults to False.

delete_model_state(state=None)

Delete PAL model state.

Parameters:

stateDataFrame, optional

Specified the state.

Defaults to self.state.

property fit_hdbprocedure: Returns the generated hdbprocedure for fit.

property predict_hdbprocedure: Returns the generated hdbprocedure for predict.

set_model_state(state)

Set the model state by state information.

Parameters:

state: DataFrame or dict

If state is DataFrame, it has the following structure:

NAME: VARCHAR(100), it mush have STATE_ID, HINT, HOST and PORT.

VALUE: VARCHAR(1000), the values according to NAME.

If state is dict, the key must have STATE_ID, HINT, HOST and PORT.

Inherited Methods from PALBase

Besides those methods mentioned above, the CATPCA class also inherits methods from PALBase class, please refer to PAL Base for more details.