Univariate Analysis — hanaml.UnivariateAnalysis • hana.ml.r

hanaml.UnivariateAnalysis is a R wrapper for SAP HANA PAL Univariate Analysis.

hanaml.UnivariateAnalysis(
  data,
  key = NULL,
  cols = NULL,
  categorical.variable = NULL,
  significance.level = NULL,
  trimmed.percentage = NULL
)

Arguments

data

DataFrame
DataFrame containting the data.

key

character, optional
Name of the ID column. If not provided, the data is assumed to have no ID column.
No default value.

cols

list of characters, optional
List of column names to analyze.
If not provided, it defaults to all non-ID columns.

categorical.variable

character or list/vector of characters, optional
Indicates features should be treated as categorical variable.
The default behavior is dependent on what input is given:

"VARCHAR" and "NVARCHAR": categorical
"INTEGER" and "DOUBLE": continuous.

VALID only for variables of "INTEGER" type, omitted otherwise.
No default value.

significance.level

double, optional
The significance level when the function calculates the confidence interval of the sample mean.
Values must be greater than 0 and less than 1.
Defaults to 0.05.

trimmed.percentage

double, optional
The ratio of data at both head and tail that will be dropped in the process of calculating the trimmed mean.
Value range is from 0 to 0.5.
Defaults to 0.05.

Value

Return a list of two DataFrames:

DataFrame 1
Continuous result: statistics for continuous variables.
DataFrame 2
Categorical result: statistics for categorical variables.

Details

Provides an overview of the dataset. For continuous columns, it provides the count of valid observations, min, lower quartile, median, upper quartile, max, mean, confidence interval for the mean (lower and upper bound), trimmed mean, variance, standard deviation, skewness, and kurtosis. For discrete columns, it provides the number of occurrences and the percentage of the total data in each category.

Examples

Input DataFrame data:


> data$Collect()
     X1 X2 X3 X4
1   1.2 NA  1  A
2   2.5 NA  2  C
3   5.2 NA  3  A
4 -10.2 NA  2  A
5   8.5 NA  2  C
6 100.0 NA  3  B

Call the function:


> result <- hanaml.UnivariateAnalysis(data,
                                      categorical.variable="X3",
                                      significance.level=0.05,
                                      trimmed.percentage=0.2)

Ouput:


> result[[1]]
   VARIABLE_NAME     CATEGORY      STAT_NAME  STAT_VALUE
1             X3 __PAL_NULL__          count     0.00000
2             X3 __PAL_NULL__  percentage(%)     0.00000
3             X3            1          count     1.00000
4             X3            1  percentage(%)    16.66667
5             X3            2          count     3.00000
6             X3            2  percentage(%)    50.00000
7             X3            3          count     2.00000
8             X3            3  percentage(%)    33.33333
9             X4 __PAL_NULL__          count     0.00000
10            X4 __PAL_NULL__  percentage(%)     0.00000
11            X4            A          count     3.00000
12            X4            A  percentage(%)    50.00000
13            X4            B          count     1.00000
14            X4            B  percentage(%)    16.66667
15            X4            C          count     2.00000
16            X4            C  percentage(%)    33.33333

> result[[2]]
   VARIABLE_NAME                STAT_NAME  STAT_VALUE
1             X1       valid observations    6.000000
2             X1                      min  -10.200000
3             X1           lower quartile    1.200000
4             X1                   median    3.850000
5             X1           upper quartile    8.500000
6             X1                      max  100.000000
7             X1                     mean   17.866667
8             X1 CI for mean, lower bound  -24.879549
9             X1 CI for mean, upper bound   60.612883
10            X1             trimmed mean    4.350000
11            X1                 variance 1659.142667
12            X1       standard deviation   40.732575
13            X1                 skewness    1.688495
14            X1                 kurtosis    1.036148
15            X2       valid observations    0.000000