Principal Component Analysis (PCA)

hanaml.PCA is a R wrapper for SAP HANA PAL PCA.

hanaml.PCA(
  data = NULL,
  key = NULL,
  features = NULL,
  formula = NULL,
  scaling = NULL,
  thread.ratio = NULL,
  scores.output = NULL
)

Arguments

data	`DataFrame` DataFrame containting the data.
key	`character` Name of the ID column.
features	`character of list of characters, optional` Name of feature columns. If not provided, it defaults all non-key, non-label columns of data.
formula	`formula type, optional` Formula to be used for model generation. format = label~<feature_list> e.g.: formula=CATEGORY~V1+V2+V3 You can either give the formula, or a feature and label combination, but do not provide both. Defaults to NULL.
scaling	`logical, optional` If TRUE, scale variables to have unit variance before the analysis takes place. Defaults to FALSE.
thread.ratio	`double, optional` Controls the proportion of available threads that can be used by this function. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates all available threads. Values between 0 and 1 will use up to that percentage of available threads. Values outside the range from 0 to 1 are ignored, and the actual number of threads used is then be heuristically determined. Defaults to -1.
scores.output	`logical, optional` If TRUE, output the scores on each principal component when fitting. Defaults to FALSE.

Value

Returns a "PCA" object with following values:

loadings : DataFrame
The weights by which each standardized original variable should be
multiplied when computing component scores.
loadings.stat : DataFrame
Loading statistics on each component
scores : DataFrame
The transformed variable values corresponding to each data point.
Set to NULL if scores is FALSE.
scaling.stat : DataFrame
Mean and scale values of each variable
model : list of DataFrame
The fitted model.

Details

The principal component analysis procedure to reduce the dimensionality of multivariate data using Singular Value Decomposition.

Examples

Input DataFrame data:

> data$Head(4)$Collect()
   ID    X1    X2    X3    X4
1   1  12.0  52.0  20.0  44.0
2   2  12.0  57.0  25.0  45.0
3   3  12.0  54.0  21.0  45.0
4   4  13.0  52.0  21.0  46.0

Call the function:

> pca <- hanaml.PCA(data = data,
                    key = "ID",
                    scaling=TRUE,
                    thread.ratio=0.5,
                    scores.output=TRUE)

Output:

> pca$loadings$Collect()
  COMPONENT_ID  LOADINGS_X1  LOADINGS_X2  LOADINGS_X3  LOADINGS_X4
1        Comp1     0.541547     0.321424     0.511941     0.584235
2        Comp2    -0.454280     0.728287     0.395819    -0.326429
3        Comp3    -0.171426    -0.600095     0.760875    -0.177673
4        Comp4    -0.686273    -0.078552    -0.048095     0.721489
> pca$loadings.stat$Collect()
  COMPONENT_ID        SD  VAR_PROP  CUM_VAR_PROP
1        Comp1  1.566624  0.613577      0.613577
2        Comp2  1.100453  0.302749      0.916327
3        Comp3  0.536973  0.072085      0.988412
4        Comp4  0.215297  0.011588      1.000000
> pca$scaling.stat$Collect()
   VARIABLE_ID       MEAN     SCALE
1            1  17.000000  5.039841
2            2  53.636364  1.689540
3            3  23.000000  2.000000
4            4  48.454545  4.655398

Arguments

Value

Details

Examples

See also