Unified Classficiation

hanaml.UnifiedClassification is an R wrapper for SAP HANA PAL Unified Classification.

hanaml.UnifiedClassification(
  data = NULL,
  func = NULL,
  key = NULL,
  features = NULL,
  label = NULL,
  purpose = NULL,
  formula = NULL,
  partition.method = NULL,
  stratified.column = NULL,
  partition.random.state = NULL,
  training.percent = NULL,
  training.size = NULL,
  ntiles = NULL,
  output.partition.result = NULL,
  background.size = NULL,
  background.random.state = NULL,
  ...
)

Arguments

data	`DataFrame` DataFrame containting the data.
func	`character` The functionality for unified classification. Valid values are as follows: "DecisionTree", "RandomDecisionTrees", "HGBT", "LogisticRegression", "NaiveBayes", "SVM", "MLP".
key	`character, optional` Name of the ID column. If not provided, the data is assumed to have no ID column. No default value.
features	`character of list of characters, optional` Name of feature columns. If not provided, it defaults all non-key, non-label columns of data.
label	`character, optional` Name of the column which specifies the dependent variable. If not specified, defaults to the last non-purpose column.
purpose	`character, optional` Name of the column which specified user-defined data partition. Mandatory if partition.method is "user.defined".
formula	`formula type, optional` Formula to be used for model generation. format = label~<feature_list> e.g.: formula=CATEGORY~V1+V2+V3 You can either give the formula, or a feature and label combination, but do not provide both. Defaults to NULL.
partition.method	`character, optional` Specified the method for partitioning the training data. Valid options include: "no", "user.defined", "stratified". Defaults to "no" if not specified (i.e. no data partition).
stratified.column	`character, optional` Specifies the name of the column used for stratified partition. Mandatory when partition.method is set to "stratified".
partition.random.state	`character, optional` Specifies the random seed for stratified partition. Defaults to 0(system time).
training.percent	`numeric, optional` Specifies the percentage of data used for training. Defaults to 0.8.
training.size	`integer, optional` Specifies the number of samples in data used for training. If training.percent is set, then this parameter has no effect.
ntiles	`integer, optional` Specifies the population tiles in metrics output. The value should be no less than 1 and no larger than the row size of the input data.
output.partition.result	`logical, optional` Controls whether to output the partition result of `data` or not. Defaults to FALSE.
background.size	`integer, optional` Specifies the row size of background data. It should not be larger than the row size of data. Valid only for the following cases: func is "NaiveBayes", "SVM", or "MLP"; func is "LogisticRegression" and multi.class is TRUE. Defaults to 0.
background.random.state	`integer, optional` Specifies the seed for random number generator in the background sampling. 0: Uses the current time (in second) as seed Others: Uses the specified value as seed Defaults to 0.
...	Specifies other parameters for training a classification model with the functionality specified in func. Please see the documentation of corresponding functionalities for more detail. `hanaml.DecisionTreeClassifier, hanaml.RDTClassifier, hanaml.MLPClassifier, hanaml.HGBTClassifier, hanaml.NaiveBayes, hanaml.LogisticRegression, hanaml.SVC` However, some parameters will be disabled. The disable parameters are listed as follows: DecisionTree: output.rules, output.confusion.matrix RDT: calculate.oob HGBT: calculate.importance, calculate.cm LogisticRegression: pmml.export

Value

Returns an "UnifiedClassification" object with the following attributes and methods:

model DataFrame

ROW_INDEX - model row index
PART_INDEX - data partition index
MODEL_CONTENT - model content

importance DataFrame

VARIABLE_NAME - Independent variable name
IMPORTANCE - Variable importance

optimal.param DataFrame

PARM_NAME - parameter name
INT_VALUE - integer value
DOUBLE_VALUE - double value
STRING_VALUE - character value

statistics DataFrame

STAT_NAME - Statistics name
STAT_VALUE - Statistics value

confusion.matrix DataFrame

ACTUAL_CLASS - The actual class name
PREDICTED_CLASS - The predicted class name
COUNT - Number of records

metrics DataFrame

NAME - Metric name
X - X value
Y - Y value

score() Function
Parameters:

data DataFrame
Input data for calculating score metrics.
key character
Specifies name of the ID column for input data.
features list/vector of characters, optional
Specifies names of the feature columns, i.e. independent columns.
Defaults to all non-key, non-label columns if not provided.
label character, optional
Specifies name of dependent column in the input data.
Defaults to the final last non-key column if not provided.
max.result.num integer, optional
Specifies the output number of prediction results.
ntiles integer, optional
Specifies the population tiles in metrics output.
The value should be no less than 1 and no larger than the row size of the input data.
Defaults to 1.
thread.ratio numeric, optional
Specifies the ratio of total number of threads that can be used by the score function.
Defaults to 1.0.
func character, optional
The functionality for unified classification model.
Mandatory only when the func attribute of model is NULL.
Valid values are as follows:
"DecisionTree", "RandomDecisionTrees", "HGBT", "LogisticRegression", "NaiveBayes", "SVM", "MLP".
multi.class logical, optional
If the functionality of the unified classification model is LogisticRegression,
then this parameter indicates whether or not the classification mdoel is
binary-class case or multiple-class case.
Valid only when func is set to be "LogisticRegression".
alpha double, optional
Specifies the value of Laplace smoothing.
A positive value will enable Laplace smoothing for categorical variables with that value being the smoothing parameter.
Set the value to 0 to disable Laplace smoothing .
Defaults to the alpha value in the JSON model if there is one, and 0 otherwise.
block.size integer, optional
Specifies the number of data loaded per time during scoring.
- 0: load all data once
- Other positive Values: the specified number
Valid only when the trained classification model is for Random Decision Trees.
Defaults to 0.
missing.replacement character, optional
Specifies the strategy for replacement of missing values in prediction data.
- "feature.marginalized": marginalizes each missing feature out independently
- "instance.marginalized": marginalizes all missing features in an instance as a whole corresponding to each category
Valid only when the trained classification model is for Random Decision Trees or Hybrid Gradient Boosting Tree(HGBT).
Defaults to 'feature.marginalized'.
class.map0 character, optional
Specifies the label value which will be mapped to 0 in logistic regression.
Mandatory and valid only for binary logistic regression models when the label variable is of type VARCHAR or NVARCHAR.
Defaults to the value of class.map0 in the model training phase.
class.map1 character, optional
Specifies the label value which will be mapped to 1 in logistic regression.
Mandatory and valid only for binary logistic regression models when the label variable is of type VARCHAR or NVARCHAR.
Defaults to the value of class.map1 in the model training phase.
categorical.variable character or list of characters, optional
Indicates features that should be treated as categorical variable.
The default behavior is dependent on what input is given:
- "VARCHAR" and "NVARCHAR": categorical.
- "INTEGER" and "DOUBLE": continuous.
VALID only for variables of type "INTEGER",omitted otherwise.
Default to the value of categorical.variable in the model training phase.
attribution.method character, optional
Specifies which method to use in model reasoning:
- "no": no reasoning
- "saabas": SAABAS reasoning
- "shap": SHAP reasoning
Valid only for tree-based classification models.
Defaults to "shap".
top.k.attributions character, optional
Output the attributions of top k features which contribute the most.
Defaults to 10.
sample.size integer, optional
Specifies the number of sampled combinations of features.
If set to 0, the value is determined by algorithm heuristically.
Valid only when the trained classification model is for Naive Bayes, Support Vector Machine(SVM), Multilayer Perceptron or Multi-class Logistic Regression.
Defaults to 0.
random.state integer, optional
Specifies the seed for random number generator.
- 0: Uses the current time (in second) as seed;
- Others: Uses the specified value as seed.
Valid only when the trained classification model is for Naive Bayes, Support Vector Machine(SVM), Multilayer Perceptron(MLP) or Multi-class Logistic Regression.
Defaults to 0.

Examples

> df.fit.dt
    OUTLOOK TEMP HUMIDITY WINDY       CLASS PURPOSE
1     Sunny   75       70   Yes        Play       1
2     Sunny   80       90   Yes Do not Play       1
3     Sunny   85       91    No Do not Play       1
4     Sunny   72       95    No Do not Play       2
5     Sunny   73       70    No        Play       1
6  Overcast   72       90   Yes        Play       1
7  Overcast   83       78    No        Play       1
8  Overcast   64       65   Yes        Play       1
9  Overcast   81       75    No        Play       2
10     Rain   71       80   Yes Do not Play       1
11     Rain   65       70   Yes Do not Play       1
12     Rain   75       80    No        Play       1
13     Rain   68       80    No        Play       1
14     Rain   70       96    No        Play       2

> uc.dt <- hanaml.UnifiedClassification(func = "DecisionTree",
                                        data = df.fit.dt,
                                        partition.method = "user.defined",
                                        purpose = "PURPOSE",
                                        algorithm = "c45",
                                        model.format = "json",
                                        min.records.of.parent = 2,
                                        min.records.of.leaf = 1,
                                        priors = list("Play" = 0.5,
                                                      "Do not Play" = 0.5),
                                        thread.ratio = 0.4,
                                        resampling.method = "cv",
                                        evaluation.metric = "auc",
                                        fold.num = 5,
                                        progress.indicator.id = "CV",
                                        param.search.strategy = "grid",
                                        parameter.values = list(split.threshold = c(1e-3 , 1e-4, 1e-5)))

> uc.dt$statistics
   STAT_NAME         STAT_VALUE  CLASS_NAME
1        AUC 0.6666666666666666        <NA>
2     RECALL                  0 Do not Play
3  PRECISION                  0 Do not Play
4   F1_SCORE                  0 Do not Play
5    SUPPORT                  1 Do not Play
6     RECALL                  1        Play
7  PRECISION 0.6666666666666666        Play
8   F1_SCORE                0.8        Play
9    SUPPORT                  2        Play
10  ACCURACY 0.6666666666666666        <NA>
11     KAPPA                  0        <NA>

Arguments

Value

Examples

See also