univariate_analysis
- hana_ml.algorithms.pal.stats.univariate_analysis(data, key=None, cols=None, categorical_variable=None, significance_level=None, trimmed_percentage=None)
Provides an overview of the dataset. For continuous columns, it provides the count of valid observations, min, lower quartile, median, upper quartile, max, mean, confidence interval for the mean (lower and upper bound), trimmed mean, variance, standard deviation, skewness, and kurtosis. For discrete columns, it provides the number of occurrences and the percentage of the total data in each category.
- Parameters:
- dataDataFrame
DataFrame containing the data.
- keystr, optional
Name of the ID column.
If not provided, it is assumed that the input data has no ID column.
- colslist of str, optional
List of column names to analyze.
If not provided, it defaults to all non-ID columns.
- categorical_variablelist of str, optional
INTEGER columns specified in this list will be treated as categorical data.
By default, INTEGER columns are treated as continuous.
No default value.
- significance_levelfloat, optional
The significance level when the function calculates the confidence interval of the sample mean.
Values must be greater than 0 and less than 1.
Defaults to 0.05.
- trimmed_percentagefloat, optional
The ratio of data at both head and tail that will be dropped in the process of calculating the trimmed mean.
Value range is from 0 to 0.5.
Defaults to 0.05.
- Returns:
- DataFrames
Statistics for continuous variables, structured as follows:
VARIABLE_NAME, type NVARCHAR(256), variable names.
STAT_NAME, type NVARCHAR(100), names of statistical quantities, including the count of valid observations, min, lower quartile, median, upper quartile, max, mean, confidence interval for the mean (lower and upper bound), trimmed mean, variance, standard deviation, skewness, and kurtosis (14 quantities in total).
STAT_VALUE, type DOUBLE, values for the corresponding statistical quantities.
Statistics for categorical variables, structured as follows:
VARIABLE_NAME, type NVARCHAR(256), variable names.
CATEGORY, type NVARCHAR(256), category names of the corresponding variables. Null is also treated as a category.
STAT_NAME, type NVARCHAR(100), names of statistical quantities: number of observations, percentage of total data points falling in the current category for a variable (including null).
STAT_VALUE, type DOUBLE, values for the corresponding statistical quantities.
Examples
Dataset to be analyzed:
>>> df.collect() X1 X2 X3 X4 0 1.2 None 1 A 1 2.5 None 2 C 2 5.2 None 3 A 3 -10.2 None 2 A 4 8.5 None 2 C 5 100.0 None 3 B
Perform univariate analysis:
>>> continuous, categorical = univariate_analysis( ... data=df, ... categorical_variable=['X3'], ... significance_level=0.05, ... trimmed_percentage=0.2)
Outputs:
>>> continuous.collect() VARIABLE_NAME STAT_NAME STAT_VALUE 0 X1 valid observations 6.000000 1 X1 min -10.200000 2 X1 lower quartile 1.200000 3 X1 median 3.850000 4 X1 upper quartile 8.500000 5 X1 max 100.000000 6 X1 mean 17.866667 7 X1 CI for mean, lower bound -24.879549 8 X1 CI for mean, upper bound 60.612883 9 X1 trimmed mean 4.350000 10 X1 variance 1659.142667 11 X1 standard deviation 40.732575 12 X1 skewness 1.688495 13 X1 kurtosis 1.036148 14 X2 valid observations 0.000000 >>> categorical.collect() VARIABLE_NAME CATEGORY STAT_NAME STAT_VALUE 0 X3 __PAL_NULL__ count 0.000000 1 X3 __PAL_NULL__ percentage(%) 0.000000 2 X3 1 count 1.000000 3 X3 1 percentage(%) 16.666667 4 X3 2 count 3.000000 5 X3 2 percentage(%) 50.000000 6 X3 3 count 2.000000 7 X3 3 percentage(%) 33.333333 8 X4 __PAL_NULL__ count 0.000000 9 X4 __PAL_NULL__ percentage(%) 0.000000 10 X4 A count 3.000000 11 X4 A percentage(%) 50.000000 12 X4 B count 1.000000 13 X4 B percentage(%) 16.666667 14 X4 C count 2.000000 15 X4 C percentage(%) 33.333333