hanaml.UnifiedClassification is an R wrapper for SAP HANA PAL Unified Classification.

hanaml.UnifiedClassification(
  data = NULL,
  func = NULL,
  key = NULL,
  features = NULL,
  label = NULL,
  purpose = NULL,
  formula = NULL,
  partition.method = NULL,
  stratified.column = NULL,
  partition.random.state = NULL,
  training.percent = NULL,
  training.size = NULL,
  ntiles = NULL,
  output.partition.result = NULL,
  background.size = NULL,
  background.random.state = NULL,
  ...
)

Arguments

data

DataFrame
DataFrame containting the data.

func

character
The functionality for unified classification.
Valid values are as follows:
"DecisionTree", "RandomDecisionTrees", "HGBT", "LogisticRegression", "NaiveBayes", "SVM", "MLP".

key

character, optional
Name of the ID column. If not provided, the data is assumed to have no ID column.
No default value.

features

character of list of characters, optional
Name of feature columns.
If not provided, it defaults all non-key, non-label columns of data.

label

character, optional
Name of the column which specifies the dependent variable.
If not specified, defaults to the last non-purpose column.

purpose

character, optional
Name of the column which specified user-defined data partition.
Mandatory if partition.method is "user.defined".

formula

formula type, optional
Formula to be used for model generation. format = label~<feature_list> e.g.: formula=CATEGORY~V1+V2+V3
You can either give the formula, or a feature and label combination, but do not provide both.
Defaults to NULL.

partition.method

character, optional
Specified the method for partitioning the training data.
Valid options include: "no", "user.defined", "stratified".
Defaults to "no" if not specified (i.e. no data partition).

stratified.column

character, optional
Specifies the name of the column used for stratified partition.
Mandatory when partition.method is set to "stratified".

partition.random.state

character, optional
Specifies the random seed for stratified partition. Defaults to 0(system time).

training.percent

numeric, optional
Specifies the percentage of data used for training.
Defaults to 0.8.

training.size

integer, optional
Specifies the number of samples in data used for training.
If training.percent is set, then this parameter has no effect.

ntiles

integer, optional
Specifies the population tiles in metrics output.
The value should be no less than 1 and no larger than the row size of the input data.

output.partition.result

logical, optional
Controls whether to output the partition result of data or not.
Defaults to FALSE.

background.size

integer, optional
Specifies the row size of background data.
It should not be larger than the row size of data. Valid only for the following cases:

  • func is "NaiveBayes", "SVM", or "MLP";

  • func is "LogisticRegression" and multi.class is TRUE.

Defaults to 0.

background.random.state

integer, optional
Specifies the seed for random number generator in the background sampling.

  • 0: Uses the current time (in second) as seed

  • Others: Uses the specified value as seed

Defaults to 0.

...


Specifies other parameters for training a classification model with the functionality specified in func.
Please see the documentation of corresponding functionalities for more detail.
hanaml.DecisionTreeClassifier, hanaml.RDTClassifier, hanaml.MLPClassifier, hanaml.HGBTClassifier, hanaml.NaiveBayes, hanaml.LogisticRegression, hanaml.SVC However, some parameters will be disabled. The disable parameters are listed as follows:

  • DecisionTree: output.rules, output.confusion.matrix

  • RDT: calculate.oob

  • HGBT: calculate.importance, calculate.cm

  • LogisticRegression: pmml.export

Value

Returns an "UnifiedClassification" object with the following attributes and methods:

model DataFrame

  • ROW_INDEX - model row index

  • PART_INDEX - data partition index

  • MODEL_CONTENT - model content

importance DataFrame

  • VARIABLE_NAME - Independent variable name

  • IMPORTANCE - Variable importance

optimal.param DataFrame

  • PARM_NAME - parameter name

  • INT_VALUE - integer value

  • DOUBLE_VALUE - double value

  • STRING_VALUE - character value

statistics DataFrame

  • STAT_NAME - Statistics name

  • STAT_VALUE - Statistics value

confusion.matrix DataFrame

  • ACTUAL_CLASS - The actual class name

  • PREDICTED_CLASS - The predicted class name

  • COUNT - Number of records

metrics DataFrame

  • NAME - Metric name

  • X - X value

  • Y - Y value

score() Function
Parameters:

  • data DataFrame
    Input data for calculating score metrics.

  • key character
    Specifies name of the ID column for input data.

  • features list/vector of characters, optional
    Specifies names of the feature columns, i.e. independent columns.
    Defaults to all non-key, non-label columns if not provided.

  • label character, optional
    Specifies name of dependent column in the input data.
    Defaults to the final last non-key column if not provided.

  • max.result.num integer, optional
    Specifies the output number of prediction results.

  • ntiles integer, optional
    Specifies the population tiles in metrics output.
    The value should be no less than 1 and no larger than the row size of the input data.
    Defaults to 1.

  • thread.ratio numeric, optional
    Specifies the ratio of total number of threads that can be used by the score function.
    Defaults to 1.0.

  • func character, optional
    The functionality for unified classification model.
    Mandatory only when the func attribute of model is NULL.
    Valid values are as follows:
    "DecisionTree", "RandomDecisionTrees", "HGBT", "LogisticRegression", "NaiveBayes", "SVM", "MLP".

  • multi.class logical, optional
    If the functionality of the unified classification model is LogisticRegression,
    then this parameter indicates whether or not the classification mdoel is
    binary-class case or multiple-class case.
    Valid only when func is set to be "LogisticRegression".

  • alpha double, optional
    Specifies the value of Laplace smoothing.
    A positive value will enable Laplace smoothing for categorical variables with that value being the smoothing parameter.
    Set the value to 0 to disable Laplace smoothing .
    Defaults to the alpha value in the JSON model if there is one, and 0 otherwise.

  • block.size integer, optional
    Specifies the number of data loaded per time during scoring.

    • 0: load all data once

    • Other positive Values: the specified number

    Valid only when the trained classification model is for Random Decision Trees.
    Defaults to 0.

  • missing.replacement character, optional
    Specifies the strategy for replacement of missing values in prediction data.

    • "feature.marginalized": marginalizes each missing feature out independently

    • "instance.marginalized": marginalizes all missing features in an instance as a whole corresponding to each category

    Valid only when the trained classification model is for Random Decision Trees or Hybrid Gradient Boosting Tree(HGBT).
    Defaults to 'feature.marginalized'.

  • class.map0 character, optional
    Specifies the label value which will be mapped to 0 in logistic regression.
    Mandatory and valid only for binary logistic regression models when the label variable is of type VARCHAR or NVARCHAR.
    Defaults to the value of class.map0 in the model training phase.

  • class.map1 character, optional
    Specifies the label value which will be mapped to 1 in logistic regression.
    Mandatory and valid only for binary logistic regression models when the label variable is of type VARCHAR or NVARCHAR.
    Defaults to the value of class.map1 in the model training phase.

  • categorical.variable character or list of characters, optional
    Indicates features that should be treated as categorical variable.
    The default behavior is dependent on what input is given:

    • "VARCHAR" and "NVARCHAR": categorical.

    • "INTEGER" and "DOUBLE": continuous.

    VALID only for variables of type "INTEGER",omitted otherwise.
    Default to the value of categorical.variable in the model training phase.

  • attribution.method character, optional
    Specifies which method to use in model reasoning:

    • "no": no reasoning

    • "saabas": SAABAS reasoning

    • "shap": SHAP reasoning

    Valid only for tree-based classification models.
    Defaults to "shap".

  • top.k.attributions character, optional
    Output the attributions of top k features which contribute the most.
    Defaults to 10.

  • sample.size integer, optional
    Specifies the number of sampled combinations of features.
    If set to 0, the value is determined by algorithm heuristically.
    Valid only when the trained classification model is for Naive Bayes, Support Vector Machine(SVM), Multilayer Perceptron or Multi-class Logistic Regression.
    Defaults to 0.

  • random.state integer, optional
    Specifies the seed for random number generator.

    • 0: Uses the current time (in second) as seed;

    • Others: Uses the specified value as seed.

    Valid only when the trained classification model is for Naive Bayes, Support Vector Machine(SVM), Multilayer Perceptron(MLP) or Multi-class Logistic Regression.
    Defaults to 0.

Examples

> df.fit.dt
    OUTLOOK TEMP HUMIDITY WINDY       CLASS PURPOSE
1     Sunny   75       70   Yes        Play       1
2     Sunny   80       90   Yes Do not Play       1
3     Sunny   85       91    No Do not Play       1
4     Sunny   72       95    No Do not Play       2
5     Sunny   73       70    No        Play       1
6  Overcast   72       90   Yes        Play       1
7  Overcast   83       78    No        Play       1
8  Overcast   64       65   Yes        Play       1
9  Overcast   81       75    No        Play       2
10     Rain   71       80   Yes Do not Play       1
11     Rain   65       70   Yes Do not Play       1
12     Rain   75       80    No        Play       1
13     Rain   68       80    No        Play       1
14     Rain   70       96    No        Play       2

> uc.dt <- hanaml.UnifiedClassification(func = "DecisionTree",
                                        data = df.fit.dt,
                                        partition.method = "user.defined",
                                        purpose = "PURPOSE",
                                        algorithm = "c45",
                                        model.format = "json",
                                        min.records.of.parent = 2,
                                        min.records.of.leaf = 1,
                                        priors = list("Play" = 0.5,
                                                      "Do not Play" = 0.5),
                                        thread.ratio = 0.4,
                                        resampling.method = "cv",
                                        evaluation.metric = "auc",
                                        fold.num = 5,
                                        progress.indicator.id = "CV",
                                        param.search.strategy = "grid",
                                        parameter.values = list(split.threshold = c(1e-3 , 1e-4, 1e-5)))

> uc.dt$statistics
   STAT_NAME         STAT_VALUE  CLASS_NAME
1        AUC 0.6666666666666666        <NA>
2     RECALL                  0 Do not Play
3  PRECISION                  0 Do not Play
4   F1_SCORE                  0 Do not Play
5    SUPPORT                  1 Do not Play
6     RECALL                  1        Play
7  PRECISION 0.6666666666666666        Play
8   F1_SCORE                0.8        Play
9    SUPPORT                  2        Play
10  ACCURACY 0.6666666666666666        <NA>
11     KAPPA                  0        <NA>

See also