hanaml.UnivariateAnalysis {hana.ml.r}R Documentation

Univariate Analysis

Description

hanaml.UnivariateAnalysis is a R wrapper for PAL Univariate Analysis.

Usage

hanaml.UnivariateAnalysis (conn.context, data,
                          key = NULL, cols = NULL,
                          categorical.variable = NULL,
                          significance.level = NULL,
                          trimmed.percentage = NULL)

Arguments

conn.context

ConnectionContext
The connection to the SAP HANA system.

data

DataFrame
DataFrame containing the data.

key

character, optional
Name of the ID column of data.

cols

list of characters, optional
List of column names to analyze.
If not provided, it defaults to all non-ID columns.

categorical.variable

list of characters, optional
INTEGER columns specified in this list will be treated as categorical data.
By default, INTEGER columns are treated as continuous.

significance.level

double, optional
The significance level when the function calculates the confidence interval of the sample mean.
Values must be greater than 0 and less than 1.
Defaults to 0.05.

trimmed.percentage

double, optional
The ratio of data at both head and tail that will be dropped in the process of calculating the trimmed mean.
Value range is from 0 to 0.5.
Defaults to 0.05.

Details

Provides an overview of the dataset. For continuous columns, it provides the count of valid observations, min, lower quartile, median, upper quartile, max, mean, confidence interval for the mean (lower and upper bound), trimmed mean, variance, standard deviation, skewness, and kurtosis. For discrete columns, it provides the number of occurrences and the percentage of the total data in each category.

Value

Return a result object containing two DataFrame:

Examples

## Not run: 

DataFrame df to be analyzed:

> df$Collect()
     X1 X2 X3 X4
1   1.2 NA  1  A
2   2.5 NA  2  C
3   5.2 NA  3  A
4 -10.2 NA  2  A
5   8.5 NA  2  C
6 100.0 NA  3  B

Perform univariate analysis:
> output <- hanaml.UnivariateAnalysis(conn, df, categorical.variable='X3',
                                      significance.level=0.05,
                                      trimmed.percentage=0.2)
> output[[1]]
   VARIABLE_NAME     CATEGORY     STAT_NAME STAT_VALUE
1             X3 __PAL_NULL__         count    0.00000
2             X3 __PAL_NULL__ percentage(%)    0.00000
3             X3            1         count    1.00000
4             X3            1 percentage(%)   16.66667
5             X3            2         count    3.00000
6             X3            2 percentage(%)   50.00000
7             X3            3         count    2.00000
8             X3            3 percentage(%)   33.33333
9             X4 __PAL_NULL__         count    0.00000
10            X4 __PAL_NULL__ percentage(%)    0.00000
11            X4            A         count    3.00000
12            X4            A percentage(%)   50.00000
13            X4            B         count    1.00000
14            X4            B percentage(%)   16.66667
15            X4            C         count    2.00000
16            X4            C percentage(%)   33.33333

> output[[2]]
   VARIABLE_NAME                STAT_NAME  STAT_VALUE
1             X1       valid observations    6.000000
2             X1                      min  -10.200000
3             X1           lower quartile    1.200000
4             X1                   median    3.850000
5             X1           upper quartile    8.500000
6             X1                      max  100.000000
7             X1                     mean   17.866667
8             X1 CI for mean, lower bound  -24.879549
9             X1 CI for mean, upper bound   60.612883
10            X1             trimmed mean    4.350000
11            X1                 variance 1659.142667
12            X1       standard deviation   40.732575
13            X1                 skewness    1.688495
14            X1                 kurtosis    1.036148
15            X2       valid observations    0.000000

## End(Not run)

[Package hana.ml.r version 1.0.8 Index]