hanaml.CATPCA is a R wrapper for SAP HANA PAL Categorical PCA.

hanaml.CATPCA(
  data = NULL,
  key = NULL,
  features = NULL,
  formula = NULL,
  n.components = NULL,
  scaling = NULL,
  thread.ratio = NULL,
  scores.output = NULL,
  categorical.variable = NULL,
  component.tol = NULL,
  random.state = NULL,
  max.iter = NULL,
  tol = NULL,
  svd.alg = NULL,
  lanczos.iter = NULL
)

Arguments

data

DataFrame
DataFrame containting the data.

key

character
Name of the ID column.

features

character of list of characters, optional
Name of feature columns.
If not provided, it defaults all non-key, non-label columns of data.

formula

formula type, optional
Formula to be used for model generation. format = label~<feature_list> e.g.: formula=CATEGORY~V1+V2+V3
You can either give the formula, or a feature and label combination, but do not provide both.
Defaults to NULL.

n.components

integer
Specifies the number of components to keep.

scaling

logical, optional
If TRUE, scale variables to have unit variance before the analysis takes place.
Defaults to FALSE if data contains no categorical features, otherwise defaults to TRUE.

thread.ratio

double, optional
Controls the proportion of available threads that can be used by this function.
The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates all available threads. Values between 0 and 1 will use up to that percentage of available threads.
Values outside the range from 0 to 1 are ignored, and the actual number of threads used is then be heuristically determined.
Defaults to -1.

scores.output

logical, optional
If TRUE, output the scores on each principal component when fitting.
Defaults to FALSE.

categorical.variable

character or list/vector of characters, optional
Indicates features should be treated as categorical variable.
The default behavior is dependent on what input is given:

  • "VARCHAR" and "NVARCHAR": categorical

  • "INTEGER" and "DOUBLE": continuous.

VALID only for variables of "INTEGER" type, omitted otherwise.
No default value.

component.tol

double, optional
Specifies the threshold for dropping principal components.
More precisely, if the ratio between a singular value of some component and the largest singular value is less than the specified threshold, then the corresponding component will be dropped.
Valid range is [0, 1).
Defaults to 0(indicating no component is dropped)

random.state

integer, optional
Specifies the random seed used to generate initial quantification for categorical variables. Should be nonnegative.

  • 0 : Use current system time as seed(always changing);

  • Others : The deterministic seed value.

Defaults to 0.

max.iter

integer, optional
Specifies the maximum number of iterations allowed in computing the quantification for categorical variables.
Defaults to 100.

tol

integer, optional
Specifies the threshold to determine when the iterative quantification process should be stopped.
More precisely, if the improvement of loss value is less than this threshold between consecutive iterations, the quantification process will terminate and regarded as converged.
Valid range is (0, 1).
Defaults to 1e-5.

lanczos.iter

integer, optional
Specifies the maximum allowed iterations for computing SVD using LANCZOS algorithm.
Valid only when svg.alg is "lanczos".
Defaults to 100.

svg.alg

c("lanczos", "jacobi"), optional
Specifies the choice of SVD algorithm.

  • "lanczos" : The LANCZOS algorithm.

  • "jacobi" : The Divide and conquer with Jacobi algorithm.

Defaults to "jacobi".

Value

Returns a R6 object of class "CATPCA" with following attributes and methods:
Attributes

  • loadings : DataFrame
    The weights by which each standardized original variable should be
    multiplied when computing component scores.

  • loadings.stat : DataFrame
    Loading statistics on each component

  • scores : DataFrame
    The transformed variable values corresponding to each data point.
    Set to NULL if scores is FALSE.

  • scaling.stat : DataFrame
    Mean and scale values of each variable

  • quantification : DataFrame
    Quantification information for categorical variables.

  • stat : DataFrame
    Key statistics for the category quantification process.

  • model : list of DataFrames
    The fitted model.

Methods

CreateModelState(model=NULL, algorithm=NULL, func=NULL, state.description="ModelState", force=FALSE)

Usage:


   > cpc <- hanaml.CATPCA(data=df, key="ID")
   > cpc$CreateModelState()


Arguments:

  • model: DataFrame
    DataFrame containing the model for parsing.
    Defaults to self$model.

  • algorithm: character
    Specifies the PAL algorithm associated with model.
    Defaults to self$pal.algorithm.

  • func: character
    Specifies the functionality for Unified Classification/Regression.
    Valid only for object instance of R6Class "UnifiedClassification" or "UnifiedRegression".
    Defaults to self$func.

  • state.description: character
    A summary string for the generated model state.
    Defaults to "ModelState".

  • force: logic
    Specifies whether or not the replace existing state for model.
    Defaults to FALSE.

After calling this method, an attribute state that contains the parsed info for model shall be assigned to the corresponding R6 object.

DeleteModelState(state=NULL)

Usage:
Assuming we have trained a hanaml model and created its model state, like the following:


   > cpc <- hanaml.CATPCA(data=df, key="ID")
   > cpc$CreateModelState()


After using the model state for real-time scoring, we can delete the state by calling:


   > cpc$DelateModelState()


Arguments:

  • state: DataFrame
    DataFrame containing the state info.
    Defaults to self$state.

After calling this method, the specified model state shall be cleaned up and associated memory be released.

Details

The principal component analysis procedure to reduce the dimensionality of multivariate data using Singular Value Decomposition. Different from traditional PCA, the algorithm provided here allows categorical features in the input dataset.

Examples

Input DataFrame data:


> data$Collect()
   ID X1 X2 X3 X4 X5 X6
1   1 12  A 20 44 48 16
2   2 12  B 25 45 50 16
3   3 12  C 21 45 50 16
4   4 13  A 21 46 51 17
5   5 14  C 24 46 51 17
6   6 22  A 25 54 58 26
7   7 22  D 26 55 58 27
8   8 17  A 21 45 52 17
9   9 15  D 24 45 53 18
10 10 23  C 23 53 57 24
11 11 25  B 23 55 58 25

Call the function:


> cpc <- hanaml.CATPCA(data = data,
                       key="ID",
                       scaling=TRUE,
                       thread.ratio=0.0,
                       scores.output=TRUE,
                       n.components=2,
                       component.tol=1e-5,
                       categorical.variable="X4",
                       random.state=2021,
                       max.iter=550,
                       tol=1e-5,
                       svd.alg="lanczos",
                       lanczos.iter=100)

Output:


> cpc$loadings$Collect()
   VARIABLE_NAME COMPONENT_ID COMPONENT_LOADING
1             X1            1        -0.4444619
2             X1            2        -0.2665427
3             X3            1        -0.3313307
4             X3            2         0.5321249
5             X5            1        -0.4674109
6             X5            2        -0.1320058
7             X6            1        -0.4634899
8             X6            2        -0.1582530
9             X2            1         0.2131124
10            X2            2        -0.7553383
11            X4            1        -0.4625589
12            X4            2        -0.1810874