CATPCA
- class hana_ml.algorithms.pal.decomposition.CATPCA(scaling=None, thread_ratio=None, scores=None, n_components=None, component_tol=None, random_state=None, max_iter=None, tol=None, svd_alg=None, lanczos_iter=None)
Principal components analysis algorithm that supports categorical features.
- Parameters:
- scalingbool, optional
If true, scale variables to have unit variance before the analysis takes place.
Defaults to False.
- thread_ratiofloat, optional
Controls the proportion of available threads to use.
The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads.
Values between 0 and 1 will use that percentage of available threads.
Values outside this range tell PAL to heuristically determine the number of threads to use.
No default value.
- scoresbool, optional
If true, output the scores on each principal component when fitting.
Defaults to False.
- n_componentsint
Specifies the number of components to keep.
Should be greater than or equal to 1.
- component_tolfloat, optional
Specifies the threshold for dropping principal components. More precisely, if the ratio between a singular value of some component and the largest singular value is less than the specified threshold, then the corresponding component will be dropped.
Defaults to 0(indicating no component is dropped).
- random_stateint, optional
Specifies the random seed used to generate initial quantification for categorical variables. Should be nonnegative.
0 : Use current system time as seed(always changing).
Others : The deterministic seed value.
Defaults to 0.
- max_iterint, optional
Specifies the maximum number of iterations allowed in computing the quantification for categorical variables.
Defaults to 100.
- tolint, optional
Specifies the threshold to determine when the iterative quantification process should be stopped. More precisely, if the improvement of loss value is less than this threshold between consecutive iterations, the quantification process will terminate and regarded as converged.
Valid range is (0, 1).
Defaults to 1e-5.
- svg_alg{'lanczos', 'jacobi'}, optional
Specifies the choice of SVD algorithm.
'lanczos' : The LANCZOS algorithms.
'jacobi' : The Divide and conquer with Jacobi algorithm.
Defaults to 'jacobi'.
- lanczos_iterint, optional
Specifies the maximum allowed interactions for computing SVD using LANCZOS algorithm.
Valid only when
svg_alg
is 'lanczos'.Defaults to 1000.
Examples
Input DataFrame data:
>>> data.collect() ID X1 X2 X3 X4 X5 X6 0 1 12 A 20 44 48 16 1 2 12 B 25 45 50 16 2 3 12 C 21 45 50 16 3 4 13 A 21 46 51 17 4 5 14 C 24 46 51 17 5 6 22 A 25 54 58 26 6 7 22 D 26 55 58 27 7 8 17 A 21 45 52 17 8 9 15 D 24 45 53 18 9 10 23 C 23 53 57 24 10 11 25 B 23 55 58 25
Call the function:
>>> cpc = CATPCA(scaling=TRUE, thread_ratio=0.0, scores=TRUE, n_components=2, component_tol=1e-5, random_state=2021, max_iter=550, tol=1e-5, svd_alg='lanczos', lanczos_iter=100) >>> cpc.fit(data=data, key='ID', categorical_variable='X4') >>> cpc.loadings_.collect() VARIABLE_NAME COMPONENT_ID COMPONENT_LOADING 0 X1 1 -0.444462 1 X1 2 -0.266543 2 X3 1 -0.331331 3 X3 2 0.532125 4 X5 1 -0.467411 5 X5 2 -0.132006 6 X6 1 -0.463490 7 X6 2 -0.158253 8 X2 1 0.213112 9 X2 2 -0.755338 10 X4 1 -0.462559 11 X4 2 -0.181087
Input data for CATPCA transformation:
>>> data2.collect() ID X1 X2 X3 X4 X5 X6 0 1 12 A 20 44 48 16 1 2 12 B 25 45 50 16 2 3 12 C 21 45 50 16 3 4 13 A 21 46 51 17 4 5 14 C 24 46 51 17 5 6 22 A 25 54 58 26
Perform transformation of the DataFrame above using the "CATPCA" object cpc:
>>> result = transform(cpc, data2, key="ID", n_components=2, thread_ratio = 0.5, ignore_unknown_category=False) Output:
>>> result.collect() ID COMPONENT_ID COMPONENT_SCORE 0 1 1 2.734518 1 2 1 1.055637 2 3 1 1.918711 3 4 1 1.768841 4 5 1 1.019988 5 6 1 -2.386126 6 1 2 -0.747631 7 2 2 1.709687 8 3 2 -0.064889 9 4 2 -0.767442 10 5 2 0.557958 11 6 2 -1.094885
- Attributes:
- loadings_DataFrame
The weights by which each standardized original variable should be multiplied when computing component scores.
- loadings_stat_DataFrame
Loadings statistics on each component.
- scores_DataFrame
The transformed variable values corresponding to each data point.
Set to None if
scores
is False.- scaling_stat_DataFrame
Mean and scale values of each variable.
Note
Variables cannot be scaled if there exists one variable which has constant value across data items.
Methods
create_model_state
([model, function, ...])Create PAL model state.
delete_model_state
([state])Delete PAL model state.
fit
(data[, key, features, categorical_variable])Principal component analysis fit function.
fit_transform
(data[, key, features, ...])Fit with the dataset and return the scores.
set_model_state
(state)Set the model state by state information.
transform
(data[, key, features, ...])Principal component analysis projection function using a trained model.
- fit(data, key=None, features=None, categorical_variable=None)
Principal component analysis fit function.
- Parameters:
- dataDataFrame
Data to be fitted.
The number of rows in
data
are expected to be no less than self.n_components.- keystr, optional
Name of the ID column.
Mandatory if
data
is not indexed, or the index ofdata
contains multiple columns.Defaults to the single index column of
data
if not provided.- featureslist of str, optional
Names of the feature columns.
The number of features should be no less than self.n_components.
If
features
is not provided, it defaults to all non-ID columns.- categorical_variablestr or ListOfStrings, optional
Specifies INTEGER variables that should be treated as categorical.
By default, variables of INTEGER type are treated as continuous.
- Returns:
- A fitted 'CATPCA' object.
- fit_transform(data, key=None, features=None, n_components=None, ignore_unknown_category=None)
Fit with the dataset and return the scores.
- Parameters:
- dataDataFrame
Data to be analyzed.
- keystr, optional
Name of the ID column.
Mandatory if
data
is not indexed, or the index ofdata
contains multiple columns.Defaults to the single index column of
data
if not provided.- featureslist of str, optional
Names of the feature columns.
If
features
is not provided, it defaults to all non-ID columns, non-label columns.- n_componentsint, optional
Number of components to be retained.
The value range is from 1 to number of features.
Defaults to number of features.
- ignore_unknown_categorybool, optional
Specifies whether or not to ignore unknown category in
data
.If set to True, any unknown category shall be ignored with quantify 0; otherwise, an error message shall be raised in case of unknown category.
Defaults to False.
- Returns:
- DataFrame
Transformed variable values for
data
, structured as follows:1st column, with same name and type as
data
's ID column.2nd column, type INTEGER, named 'COMPONENT_ID', representing the IDs for principle components.
3rd column, type DOUBLE, named 'COMPONENT_SCORE', representing the score values of each data points in different components.
- transform(data, key=None, features=None, n_components=None, ignore_unknown_category=None, thread_ratio=None)
Principal component analysis projection function using a trained model.
- Parameters:
- dataDataFrame
Data to be analyzed.
- keystr, optional
Name of the ID column.
Mandatory if
data
is not indexed, or the index ofdata
contains multiple columns.Defaults to the single index column of
data
if not provided.- featureslist of str, optional
Names of the feature columns.
If
features
is not provided, it defaults to all non-ID columns.- n_componentsint, optional
Number of components to be retained.
The value range is from 1 to number of features.
Defaults to number of features.
- Returns:
- DataFrame
Transformed variable values corresponding to each data point, structured as follows:
1st column, with same name and type as
data
's ID column.2nd column, type INTEGER, named 'COMPONENT_ID', representing the IDs for principle components.
3rd column, type DOUBLE, named 'COMPONENT_SCORE', representing the score values of each data points in different components.
- create_model_state(model=None, function=None, pal_funcname=None, state_description=None, force=False)
Create PAL model state.
- Parameters:
- modelDataFrame, optional
Specify the model for AFL state.
Defaults to self.model_.
- functionstr, optional
Specify the function in the unified API.
A placeholder parameter, not effective for (CAT)PCA.
- pal_funcnameint or str, optional
PAL function name.
Defaults to self.pal_funcname.
- state_descriptionstr, optional
Description of the state as model container.
Defaults to None.
- forcebool, optional
If True it will delete the existing state.
Defaults to False.
- delete_model_state(state=None)
Delete PAL model state.
- Parameters:
- stateDataFrame, optional
Specified the state.
Defaults to self.state.
- property fit_hdbprocedure
Returns the generated hdbprocedure for fit.
- property predict_hdbprocedure
Returns the generated hdbprocedure for predict.
- set_model_state(state)
Set the model state by state information.
- Parameters:
- state: DataFrame or dict
If state is DataFrame, it has the following structure:
NAME: VARCHAR(100), it mush have STATE_ID, HINT, HOST and PORT.
VALUE: VARCHAR(1000), the values according to NAME.
If state is dict, the key must have STATE_ID, HINT, HOST and PORT.
Inherited Methods from PALBase
Besides those methods mentioned above, the CATPCA class also inherits methods from PALBase class, please refer to PAL Base for more details.