R: principal component analysis (PCA)

hanaml.PCA {hana.ml.r}

R Documentation

principal component analysis (PCA)

Description

hanaml.PCA is a R wrapper for PAL PCA.

Usage

hanaml.PCA(conn.context, data, key, features = NULL,
           formula = NULL, scaling = NULL, thread.ratio = NULL,
           scores = NULL)

Arguments

`conn.context`	`ConnectionContext` The connection to the SAP HANA system.
`data`	`DataFrame` DataFrame containing the data.
`key`	`character` Name of the ID column of data.
`features`	`list of character, optional` Names of the feature columns. If features is not provided, it defaults to all non-ID, no-label columns.
`formula`	`formula type, optional` Formula to be used for model generation. format = label~<feature_list> eg: formula=CATEGORY~V1+V2+V3 You can either give the formula, or a feature and label combination. Do not provide both. Defaults to NULL.
`scaling`	`logical, optional` If TRUE, scale variables to have unit variance before the analysis takes place. Defaults to FALSE.
`thread.ratio`	`double, optional` Controls the proportion of available threads to use. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use. No default value.
`scores`	`logical, optional` If TRUE, output the scores on each principal component when fitting. Defaults to FALSE.

Format

R6Class object.

Details

The principal component analysis procedure to reduce the dimensionality of multivariate data using Singular Value Decomposition.

Value

Return a "PCA" object with following values:

loadings : DataFrame
The weights by which each standardized original variable should be
multiplied when computing component scores.
loadings.stat : DataFrame
Loading statistics on each component
scores : DataFrame
The transformed variable values corresponding to each data point.
Set to None if scores is FALSE.
scaling.stat : DataFrame
Mean and scale values of each variable
model : list of DataFrame
The fitted model.

Examples

## Not run: 
Input DataFrame df for training:
>df$Head(4)$Collect()
ID    X1    X2    X3    X4
0   1  12.0  52.0  20.0  44.0
1   2  12.0  57.0  25.0  45.0
2   3  12.0  54.0  21.0  45.0
3   4  13.0  52.0  21.0  46.0

>pca <- hanaml.PCA(conn.context = conn, data = df, key = "ID",
                   scaling=TRUE, thread.ratio=0.5, scores=TRUE)

Output:
>pca$loadings$Collect()
   COMPONENT_ID  LOADINGS_X1  LOADINGS_X2  LOADINGS_X3  LOADINGS_X4
0        Comp1     0.541547     0.321424     0.511941     0.584235
1        Comp2    -0.454280     0.728287     0.395819    -0.326429
2        Comp3    -0.171426    -0.600095     0.760875    -0.177673
3        Comp4    -0.686273    -0.078552    -0.048095     0.721489
> pca$loadings.stat$Collect()
 COMPONENT_ID        SD  VAR_PROP  CUM_VAR_PROP
0        Comp1  1.566624  0.613577      0.613577
1        Comp2  1.100453  0.302749      0.916327
2        Comp3  0.536973  0.072085      0.988412
3        Comp4  0.215297  0.011588      1.000000
> pca$scaling.stat$Collect()
  VARIABLE_ID       MEAN     SCALE
0            1  17.000000  5.039841
1            2  53.636364  1.689540
2            3  23.000000  2.000000
3            4  48.454545  4.655398

## End(Not run)

[Package hana.ml.r version 1.0.8 Index]