hanaml.PCA is a R wrapper for SAP HANA PAL PCA.

hanaml.PCA(
  data = NULL,
  key = NULL,
  features = NULL,
  formula = NULL,
  scaling = NULL,
  thread.ratio = NULL,
  scores.output = NULL
)

Arguments

data

DataFrame
DataFrame containting the data.

key

character
Name of the ID column.

features

character of list of characters, optional
Name of feature columns.
If not provided, it defaults all non-key, non-label columns of data.

formula

formula type, optional
Formula to be used for model generation. format = label~<feature_list> e.g.: formula=CATEGORY~V1+V2+V3
You can either give the formula, or a feature and label combination, but do not provide both.
Defaults to NULL.

scaling

logical, optional
If TRUE, scale variables to have unit variance before the analysis takes place.
Defaults to FALSE.

thread.ratio

double, optional
Controls the proportion of available threads that can be used by this function.
The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates all available threads. Values between 0 and 1 will use up to that percentage of available threads.
Values outside the range from 0 to 1 are ignored, and the actual number of threads used is then be heuristically determined.
Defaults to -1.

scores.output

logical, optional
If TRUE, output the scores on each principal component when fitting.
Defaults to FALSE.

Value

Returns a "PCA" object with following values:

  • loadings : DataFrame
    The weights by which each standardized original variable should be
    multiplied when computing component scores.

  • loadings.stat : DataFrame
    Loading statistics on each component

  • scores : DataFrame
    The transformed variable values corresponding to each data point.
    Set to NULL if scores is FALSE.

  • scaling.stat : DataFrame
    Mean and scale values of each variable

  • model : list of DataFrame
    The fitted model.

Details

The principal component analysis procedure to reduce the dimensionality of multivariate data using Singular Value Decomposition.

Examples

Input DataFrame data:

> data$Head(4)$Collect()
   ID    X1    X2    X3    X4
1   1  12.0  52.0  20.0  44.0
2   2  12.0  57.0  25.0  45.0
3   3  12.0  54.0  21.0  45.0
4   4  13.0  52.0  21.0  46.0

Call the function:

> pca <- hanaml.PCA(data = data,
                    key = "ID",
                    scaling=TRUE,
                    thread.ratio=0.5,
                    scores.output=TRUE)

Output:

> pca$loadings$Collect()
  COMPONENT_ID  LOADINGS_X1  LOADINGS_X2  LOADINGS_X3  LOADINGS_X4
1        Comp1     0.541547     0.321424     0.511941     0.584235
2        Comp2    -0.454280     0.728287     0.395819    -0.326429
3        Comp3    -0.171426    -0.600095     0.760875    -0.177673
4        Comp4    -0.686273    -0.078552    -0.048095     0.721489
> pca$loadings.stat$Collect()
  COMPONENT_ID        SD  VAR_PROP  CUM_VAR_PROP
1        Comp1  1.566624  0.613577      0.613577
2        Comp2  1.100453  0.302749      0.916327
3        Comp3  0.536973  0.072085      0.988412
4        Comp4  0.215297  0.011588      1.000000
> pca$scaling.stat$Collect()
   VARIABLE_ID       MEAN     SCALE
1            1  17.000000  5.039841
2            2  53.636364  1.689540
3            3  23.000000  2.000000
4            4  48.454545  4.655398

See also