univariate_analysis

hana_ml.algorithms.pal.stats.univariate_analysis(data, key=None, cols=None, categorical_variable=None, significance_level=None, trimmed_percentage=None)

Provides an overview of the dataset. For continuous columns, it provides the count of valid observations, min, lower quartile, median, upper quartile, max, mean, confidence interval for the mean (lower and upper bound), trimmed mean, variance, standard deviation, skewness, and kurtosis. For discrete columns, it provides the number of occurrences and the percentage of the total data in each category.

Parameters:
dataDataFrame

DataFrame containing the data.

keystr, optional

Name of the ID column.

If not provided, it is assumed that the input data has no ID column.

colslist of str, optional

List of column names to analyze.

If not provided, it defaults to all non-ID columns.

categorical_variablelist of str, optional

INTEGER columns specified in this list will be treated as categorical data.

By default, INTEGER columns are treated as continuous.

No default value.

significance_levelfloat, optional

The significance level when the function calculates the confidence interval of the sample mean.

Values must be greater than 0 and less than 1.

Defaults to 0.05.

trimmed_percentagefloat, optional

The ratio of data at both head and tail that will be dropped in the process of calculating the trimmed mean.

Value range is from 0 to 0.5.

Defaults to 0.05.

Returns:
DataFrames

Statistics for continuous variables, structured as follows:

  • VARIABLE_NAME, type NVARCHAR(256), variable names.

  • STAT_NAME, type NVARCHAR(100), names of statistical quantities, including the count of valid observations, min, lower quartile, median, upper quartile, max, mean, confidence interval for the mean (lower and upper bound), trimmed mean, variance, standard deviation, skewness, and kurtosis (14 quantities in total).

  • STAT_VALUE, type DOUBLE, values for the corresponding statistical quantities.

Statistics for categorical variables, structured as follows:

  • VARIABLE_NAME, type NVARCHAR(256), variable names.

  • CATEGORY, type NVARCHAR(256), category names of the corresponding variables. Null is also treated as a category.

  • STAT_NAME, type NVARCHAR(100), names of statistical quantities: number of observations, percentage of total data points falling in the current category for a variable (including null).

  • STAT_VALUE, type DOUBLE, values for the corresponding statistical quantities.

Examples

Dataset to be analyzed:

>>> df.collect()
      X1    X2  X3 X4
0    1.2  None   1  A
1    2.5  None   2  C
2    5.2  None   3  A
3  -10.2  None   2  A
4    8.5  None   2  C
5  100.0  None   3  B

Perform univariate analysis:

>>> continuous, categorical = univariate_analysis(
...     data=df,
...     categorical_variable=['X3'],
...     significance_level=0.05,
...     trimmed_percentage=0.2)

Outputs:

>>> continuous.collect()
   VARIABLE_NAME                 STAT_NAME   STAT_VALUE
0             X1        valid observations     6.000000
1             X1                       min   -10.200000
2             X1            lower quartile     1.200000
3             X1                    median     3.850000
4             X1            upper quartile     8.500000
5             X1                       max   100.000000
6             X1                      mean    17.866667
7             X1  CI for mean, lower bound   -24.879549
8             X1  CI for mean, upper bound    60.612883
9             X1              trimmed mean     4.350000
10            X1                  variance  1659.142667
11            X1        standard deviation    40.732575
12            X1                  skewness     1.688495
13            X1                  kurtosis     1.036148
14            X2        valid observations     0.000000
>>> categorical.collect()
   VARIABLE_NAME      CATEGORY      STAT_NAME  STAT_VALUE
0             X3  __PAL_NULL__          count    0.000000
1             X3  __PAL_NULL__  percentage(%)    0.000000
2             X3             1          count    1.000000
3             X3             1  percentage(%)   16.666667
4             X3             2          count    3.000000
5             X3             2  percentage(%)   50.000000
6             X3             3          count    2.000000
7             X3             3  percentage(%)   33.333333
8             X4  __PAL_NULL__          count    0.000000
9             X4  __PAL_NULL__  percentage(%)    0.000000
10            X4             A          count    3.000000
11            X4             A  percentage(%)   50.000000
12            X4             B          count    1.000000
13            X4             B  percentage(%)   16.666667
14            X4             C          count    2.000000
15            X4             C  percentage(%)   33.333333