CATPCA
- class hana_ml.algorithms.pal.decomposition.CATPCA(scaling=None, thread_ratio=None, scores=None, n_components=None, component_tol=None, random_state=None, max_iter=None, tol=None, svd_alg=None, lanczos_iter=None)
Principal components analysis algorithm that supports categorical features. Current implementation uses Alternating Least Square algorithm to find the optimal scaling quantification for categorical data.
- Parameters:
- scalingbool, optional
If true, scale variables to have unit variance before the analysis takes place.
Defaults to False.
- thread_ratiofloat, optional
Adjusts the percentage of available threads to use, from 0 to 1. A value of 0 indicates the use of a single thread, while 1 implies the use of all possible current threads. Values outside the range will be ignored and this function heuristically determines the number of threads to use.
No default value.
- scoresbool, optional
If true, output the scores on each principal component when fitting.
Defaults to False.
- n_componentsint
Specifies the number of components to keep.
Should be greater than or equal to 1.
- component_tolfloat, optional
Specifies the threshold for dropping principal components. More precisely, if the ratio between a singular value of some component and the largest singular value is less than the specified threshold, then the corresponding component will be dropped.
Defaults to 0(indicating no component is dropped).
- random_stateint, optional
Specifies the random seed used to generate initial quantification for categorical variables. Should be nonnegative.
0 : Use current system time as seed(always changing).
Others : The deterministic seed value.
Defaults to 0.
- max_iterint, optional
Specifies the maximum number of iterations allowed in computing the quantification for categorical variables.
Defaults to 100.
- tolint, optional
Specifies the threshold to determine when the iterative quantification process should be stopped. More precisely, if the improvement of loss value is less than this threshold between consecutive iterations, the quantification process will terminate and regarded as converged.
Valid range is (0, 1).
Defaults to 1e-5.
- svg_alg{'lanczos', 'jacobi'}, optional
Specifies the choice of SVD algorithm.
'lanczos' : The LANCZOS algorithms.
'jacobi' : The Divide and conquer with Jacobi algorithm.
Defaults to 'jacobi'.
- lanczos_iterint, optional
Specifies the maximum allowed interactions for computing SVD using LANCZOS algorithm. Valid only when
svg_alg
is 'lanczos'.Defaults to 1000.
Examples
>>> cpc = CATPCA(scaling=TRUE, thread_ratio=0.0, scores=TRUE, n_components=2, component_tol=1e-5)
Perform fit():
>>> cpc.fit(data=df, key='ID', categorical_variable='X4') >>> cpc.loadings_.collect()
Perform transform():
>>> result = cpc.transform(data=df_transform, key="ID", n_components=2, thread_ratio = 0.5, ignore_unknown_category=False) >>> result.collect()
- Attributes:
- loadings_DataFrame
The weights by which each standardized original variable should be multiplied when computing component scores.
- loadings_stat_DataFrame
Loadings statistics on each component.
- scores_DataFrame
The transformed variable values corresponding to each data point.
Set to None if
scores
is False.- scaling_stat_DataFrame
Mean and scale values of each variable.
Note
Variables cannot be scaled if there exists one variable which has constant value across data items.
Methods
create_model_state
([model, function, ...])Create PAL model state.
delete_model_state
([state])Delete PAL model state.
fit
(data[, key, features, categorical_variable])Fit the model to the given dataset.
fit_transform
(data[, key, features, ...])Fit with the dataset and return the scores.
Get the model metrics.
Get the score metrics.
set_model_state
(state)Set the model state by state information.
transform
(data[, key, features, ...])Principal component analysis projection function using a trained model.
- fit(data, key=None, features=None, categorical_variable=None)
Fit the model to the given dataset.
- Parameters:
- dataDataFrame
Data to be fitted.
The number of rows in
data
are expected to be no less than self.n_components.- keystr, optional
Name of the ID column.
Mandatory if
data
is not indexed, or the index ofdata
contains multiple columns.Defaults to the single index column of
data
if not provided.- featuresa list of str, optional
Names of the feature columns.
The number of features should be no less than self.n_components.
If
features
is not provided, it defaults to all non-ID columns.- categorical_variablestr or a list of str, optional
Specifies which INTEGER columns should be treated as categorical, with all other INTEGER columns treated as continuous.
No default value.
- Returns:
- A fitted object of class "CATPCA".
- fit_transform(data, key=None, features=None, n_components=None, ignore_unknown_category=None)
Fit with the dataset and return the scores.
- Parameters:
- dataDataFrame
Data to be analyzed.
- keystr, optional
Name of the ID column.
Mandatory if
data
is not indexed, or the index ofdata
contains multiple columns.Defaults to the single index column of
data
if not provided.- featuresa list of str, optional
Names of the feature columns.
If
features
is not provided, it defaults to all non-ID columns, non-label columns.- n_componentsint, optional
Number of components to be retained.
The value range is from 1 to number of features.
Defaults to number of features.
- ignore_unknown_categorybool, optional
Specifies whether or not to ignore unknown category in
data
.If set to True, any unknown category shall be ignored with quantify 0; otherwise, an error message shall be raised in case of unknown category.
Defaults to False.
- Returns:
- DataFrame
Transformed variable values for
data
, structured as follows:1st column, with same name and type as
data
's ID column.2nd column, type INTEGER, named 'COMPONENT_ID', representing the IDs for principle components.
3rd column, type DOUBLE, named 'COMPONENT_SCORE', representing the score values of each data points in different components.
- transform(data, key=None, features=None, n_components=None, ignore_unknown_category=None, thread_ratio=None)
Principal component analysis projection function using a trained model.
- Parameters:
- dataDataFrame
Data to be analyzed.
- keystr, optional
Name of the ID column.
Mandatory if
data
is not indexed, or the index ofdata
contains multiple columns.Defaults to the single index column of
data
if not provided.- featuresa list of str, optional
Names of the feature columns.
If
features
is not provided, it defaults to all non-ID columns.- n_componentsint, optional
Number of components to be retained.
The value range is from 1 to number of features.
Defaults to number of features.
- Returns:
- DataFrame
Transformed variable values corresponding to each data point, structured as follows:
1st column, with same name and type as
data
's ID column.2nd column, type INTEGER, named 'COMPONENT_ID', representing the IDs for principle components.
3rd column, type DOUBLE, named 'COMPONENT_SCORE', representing the score values of each data points in different components.
- create_model_state(model=None, function=None, pal_funcname=None, state_description=None, force=False)
Create PAL model state.
- Parameters:
- modelDataFrame, optional
Specify the model for AFL state.
Defaults to self.model_.
- functionstr, optional
Specify the function in the unified API.
A placeholder parameter, not effective for (CAT)PCA.
- pal_funcnameint or str, optional
PAL function name.
Defaults to self.pal_funcname.
- state_descriptionstr, optional
Description of the state as model container.
Defaults to None.
- forcebool, optional
If True it will delete the existing state.
Defaults to False.
- delete_model_state(state=None)
Delete PAL model state.
- Parameters:
- stateDataFrame, optional
Specified the state.
Defaults to self.state.
- get_model_metrics()
Get the model metrics.
- Returns:
- DataFrame
The model metrics.
- get_score_metrics()
Get the score metrics.
- Returns:
- DataFrame
The score metrics.
- set_model_state(state)
Set the model state by state information.
- Parameters:
- state: DataFrame or dict
If state is DataFrame, it has the following structure:
NAME: VARCHAR(100), it mush have STATE_ID, HINT, HOST and PORT.
VALUE: VARCHAR(1000), the values according to NAME.
If state is dict, the key must have STATE_ID, HINT, HOST and PORT.
Inherited Methods from PALBase
Besides those methods mentioned above, the CATPCA class also inherits methods from PALBase class, please refer to PAL Base for more details.