univariate_analysis

hana_ml.algorithms.pal.stats.univariate_analysis(data, key=None, cols=None, categorical_variable=None, significance_level=None, trimmed_percentage=None)

Provides an overview of the dataset. For continuous columns, it provides the count of valid observations, min, lower quartile, median, upper quartile, max, mean, confidence interval for the mean (lower and upper bound), trimmed mean, variance, standard deviation, skewness, and kurtosis. For discrete columns, it provides the number of occurrences and the percentage of the total data in each category.

Parameters:

dataDataFrame

DataFrame containing the data.

keystr, optional

Name of the ID column.

If not provided, it is assumed that the input data has no ID column.

colslist of str, optional

List of column names to analyze.

If not provided, it defaults to all non-ID columns.

categorical_variablelist of str, optional

INTEGER columns specified in this list will be treated as categorical data.

By default, INTEGER columns are treated as continuous.

No default value.

significance_levelfloat, optional

The significance level when the function calculates the confidence interval of the sample mean.

Values must be greater than 0 and less than 1.

Defaults to 0.05.

trimmed_percentagefloat, optional

The ratio of data at both head and tail that will be dropped in the process of calculating the trimmed mean.

Value range is from 0 to 0.5.

Defaults to 0.05.

Returns:

DataFrame

Statistics for continuous variables, structured as follows:

VARIABLE_NAME, type NVARCHAR(256), variable names.

STAT_NAME, type NVARCHAR(100), names of statistical quantities, including the count of valid observations, min, lower quartile, median, upper quartile, max, mean, confidence interval for the mean (lower and upper bound), trimmed mean, variance, standard deviation, skewness, and kurtosis (14 quantities in total).

STAT_VALUE, type DOUBLE, values for the corresponding statistical quantities.

Statistics for categorical variables, structured as follows:

VARIABLE_NAME, type NVARCHAR(256), variable names.

CATEGORY, type NVARCHAR(256), category names of the corresponding variables. Null is also treated as a category.

STAT_NAME, type NVARCHAR(100), names of statistical quantities: number of observations, percentage of total data points falling in the current category for a variable (including null).

STAT_VALUE, type DOUBLE, values for the corresponding statistical quantities.

Examples

Dataset to be analyzed:

>>> df.collect()
      X1    X2  X3 X4
  1.2  None   1  A
  2.5  None   2  C
  5.2  None   3  A
-10.2  None   2  A
  8.5  None   2  C
100.0  None   3  B

Perform univariate analysis:

>>> continuous, categorical = univariate_analysis(
...     data=df,
...     categorical_variable=['X3'],
...     significance_level=0.05,
...     trimmed_percentage=0.2)

Outputs:

>>> continuous.collect()
   VARIABLE_NAME                 STAT_NAME   STAT_VALUE
           X1        valid observations     6.000000
           X1                       min   -10.200000
           X1            lower quartile     1.200000
           X1                    median     3.850000
           X1            upper quartile     8.500000
           X1                       max   100.000000
           X1                      mean    17.866667
           X1  CI for mean, lower bound   -24.879549
           X1  CI for mean, upper bound    60.612883
           X1              trimmed mean     4.350000
          X1                  variance  1659.142667
          X1        standard deviation    40.732575
          X1                  skewness     1.688495
          X1                  kurtosis     1.036148
          X2        valid observations     0.000000
>>> categorical.collect()
   VARIABLE_NAME      CATEGORY      STAT_NAME  STAT_VALUE
           X3  __PAL_NULL__          count    0.000000
           X3  __PAL_NULL__  percentage(%)    0.000000
           X3             1          count    1.000000
           X3             1  percentage(%)   16.666667
           X3             2          count    3.000000
           X3             2  percentage(%)   50.000000
           X3             3          count    2.000000
           X3             3  percentage(%)   33.333333
           X4  __PAL_NULL__          count    0.000000
           X4  __PAL_NULL__  percentage(%)    0.000000
          X4             A          count    3.000000
          X4             A  percentage(%)   50.000000
          X4             B          count    1.000000
          X4             B  percentage(%)   16.666667
          X4             C          count    2.000000
          X4             C  percentage(%)   33.333333