hanaml.UnivariateAnalysis.Rd
hanaml.UnivariateAnalysis is a R wrapper for SAP HANA PAL Univariate Analysis.
hanaml.UnivariateAnalysis(
data,
key = NULL,
cols = NULL,
categorical.variable = NULL,
significance.level = NULL,
trimmed.percentage = NULL
)
DataFrame
DataFrame containting the data.
character, optional
Name of the ID column.
If not provided, the data is assumed to have no ID column.
No default value.
list of characters, optional
List of column names to analyze.
If not provided, it defaults to all non-ID columns.
character or list/vector of characters, optional
Indicates features should be treated as categorical variable.
The default behavior is dependent on what input is given:
"VARCHAR" and "NVARCHAR": categorical
"INTEGER" and "DOUBLE": continuous.
VALID only for variables of "INTEGER" type, omitted otherwise.
No default value.
double, optional
The significance level when the function calculates the confidence
interval of the sample mean.
Values must be greater than 0 and less than 1.
Defaults to 0.05.
double, optional
The ratio of data at both head and tail that will be dropped in the
process of calculating the trimmed mean.
Value range is from 0 to 0.5.
Defaults to 0.05.
Return a list of two DataFrames:
DataFrame 1
Continuous result: statistics for continuous variables.
DataFrame 2
Categorical result: statistics for categorical variables.
Provides an overview of the dataset. For continuous columns, it provides the count of valid observations, min, lower quartile, median, upper quartile, max, mean, confidence interval for the mean (lower and upper bound), trimmed mean, variance, standard deviation, skewness, and kurtosis. For discrete columns, it provides the number of occurrences and the percentage of the total data in each category.
Input DataFrame data:
> data$Collect()
X1 X2 X3 X4
1 1.2 NA 1 A
2 2.5 NA 2 C
3 5.2 NA 3 A
4 -10.2 NA 2 A
5 8.5 NA 2 C
6 100.0 NA 3 B
Call the function:
> result <- hanaml.UnivariateAnalysis(data,
categorical.variable="X3",
significance.level=0.05,
trimmed.percentage=0.2)
Ouput:
> result[[1]]
VARIABLE_NAME CATEGORY STAT_NAME STAT_VALUE
1 X3 __PAL_NULL__ count 0.00000
2 X3 __PAL_NULL__ percentage(%) 0.00000
3 X3 1 count 1.00000
4 X3 1 percentage(%) 16.66667
5 X3 2 count 3.00000
6 X3 2 percentage(%) 50.00000
7 X3 3 count 2.00000
8 X3 3 percentage(%) 33.33333
9 X4 __PAL_NULL__ count 0.00000
10 X4 __PAL_NULL__ percentage(%) 0.00000
11 X4 A count 3.00000
12 X4 A percentage(%) 50.00000
13 X4 B count 1.00000
14 X4 B percentage(%) 16.66667
15 X4 C count 2.00000
16 X4 C percentage(%) 33.33333
> result[[2]]
VARIABLE_NAME STAT_NAME STAT_VALUE
1 X1 valid observations 6.000000
2 X1 min -10.200000
3 X1 lower quartile 1.200000
4 X1 median 3.850000
5 X1 upper quartile 8.500000
6 X1 max 100.000000
7 X1 mean 17.866667
8 X1 CI for mean, lower bound -24.879549
9 X1 CI for mean, upper bound 60.612883
10 X1 trimmed mean 4.350000
11 X1 variance 1659.142667
12 X1 standard deviation 40.732575
13 X1 skewness 1.688495
14 X1 kurtosis 1.036148
15 X2 valid observations 0.000000