hanaml.UnifiedClassification.Rdhanaml.UnifiedClassification is an R wrapper for SAP HANA PAL Unified Classification.
hanaml.UnifiedClassification( data = NULL, func = NULL, key = NULL, features = NULL, label = NULL, purpose = NULL, formula = NULL, partition.method = NULL, stratified.column = NULL, partition.random.state = NULL, training.percent = NULL, training.size = NULL, ntiles = NULL, output.partition.result = NULL, background.size = NULL, background.random.state = NULL, ... )
| data |
|
|---|---|
| func |
|
| key |
|
| features |
|
| label |
|
| purpose |
|
| formula |
|
| partition.method |
|
| stratified.column |
|
| partition.random.state |
|
| training.percent |
|
| training.size |
|
| ntiles |
|
| output.partition.result |
|
| background.size |
Defaults to 0. |
| background.random.state |
Defaults to 0. |
| ... |
|
Returns an "UnifiedClassification" object with the following attributes and methods:
model DataFrame
ROW_INDEX - model row index
PART_INDEX - data partition index
MODEL_CONTENT - model content
importance DataFrame
VARIABLE_NAME - Independent variable name
IMPORTANCE - Variable importance
optimal.param DataFrame
PARM_NAME - parameter name
INT_VALUE - integer value
DOUBLE_VALUE - double value
STRING_VALUE - character value
statistics DataFrame
STAT_NAME - Statistics name
STAT_VALUE - Statistics value
confusion.matrix DataFrame
ACTUAL_CLASS - The actual class name
PREDICTED_CLASS - The predicted class name
COUNT - Number of records
metrics DataFrame
NAME - Metric name
X - X value
Y - Y value
score() Function
Parameters:
data DataFrame
Input data for calculating score metrics.
key character
Specifies name of the ID column for input data.
features list/vector of characters, optional
Specifies names of the feature columns, i.e.
independent columns.
Defaults to all non-key, non-label columns if not provided.
label character, optional
Specifies name of dependent column in the input data.
Defaults to the final last non-key column if not provided.
max.result.num integer, optional
Specifies the output number of prediction results.
ntiles integer, optional
Specifies the population tiles in metrics output.
The value should be no less than 1 and no larger than the row size of the input data.
Defaults to 1.
thread.ratio numeric, optional
Specifies the ratio of total number of threads that
can be used by the score function.
Defaults to 1.0.
func character, optional
The functionality for unified classification model.
Mandatory only when the func attribute of model is NULL.
Valid values are as follows:
"DecisionTree", "RandomDecisionTrees", "HGBT",
"LogisticRegression", "NaiveBayes", "SVM", "MLP".
multi.class logical, optional
If the functionality of the unified classification model is LogisticRegression,
then this parameter indicates whether or not the classification mdoel is
binary-class case or multiple-class case.
Valid only when func is set to be "LogisticRegression".
alpha double, optional
Specifies the value of Laplace smoothing.
A positive value will enable Laplace smoothing for categorical variables
with that value being the smoothing parameter.
Set the value to 0 to disable Laplace smoothing .
Defaults to the alpha value in the JSON model if there is one, and 0 otherwise.
block.size integer, optional
Specifies the number of data loaded per time during scoring.
0: load all data once
Other positive Values: the specified number
missing.replacement character, optional
Specifies the strategy for replacement of missing values in prediction data.
"feature.marginalized": marginalizes each missing feature out independently
"instance.marginalized": marginalizes all missing features in an instance as a whole corresponding to each category
class.map0 character, optional
Specifies the label value which will be mapped to 0 in logistic regression.
Mandatory and valid only for binary logistic regression models
when the label variable is of type VARCHAR or NVARCHAR.
Defaults to the value of class.map0 in the model training phase.
class.map1 character, optional
Specifies the label value which will be mapped to 1 in logistic regression.
Mandatory and valid only for binary logistic regression models
when the label variable is of type VARCHAR or NVARCHAR.
Defaults to the value of class.map1 in the model training phase.
categorical.variable character or list of characters, optional
Indicates features that should be treated as categorical variable.
The default behavior is dependent on what input is given:
"VARCHAR" and "NVARCHAR": categorical.
"INTEGER" and "DOUBLE": continuous.
categorical.variable in the model training phase.attribution.method character, optional
Specifies which method to use in model reasoning:
"no": no reasoning
"saabas": SAABAS reasoning
"shap": SHAP reasoning
top.k.attributions character, optional
Output the attributions of top k features which contribute the most.
Defaults to 10.
sample.size integer, optional
Specifies the number of sampled combinations of features.
If set to 0, the value is determined by algorithm heuristically.
Valid only when the trained classification model is for Naive Bayes,
Support Vector Machine(SVM), Multilayer Perceptron or Multi-class Logistic Regression.
Defaults to 0.
random.state integer, optional
Specifies the seed for random number generator.
0: Uses the current time (in second) as seed;
Others: Uses the specified value as seed.
> df.fit.dt
OUTLOOK TEMP HUMIDITY WINDY CLASS PURPOSE
1 Sunny 75 70 Yes Play 1
2 Sunny 80 90 Yes Do not Play 1
3 Sunny 85 91 No Do not Play 1
4 Sunny 72 95 No Do not Play 2
5 Sunny 73 70 No Play 1
6 Overcast 72 90 Yes Play 1
7 Overcast 83 78 No Play 1
8 Overcast 64 65 Yes Play 1
9 Overcast 81 75 No Play 2
10 Rain 71 80 Yes Do not Play 1
11 Rain 65 70 Yes Do not Play 1
12 Rain 75 80 No Play 1
13 Rain 68 80 No Play 1
14 Rain 70 96 No Play 2
> uc.dt <- hanaml.UnifiedClassification(func = "DecisionTree",
data = df.fit.dt,
partition.method = "user.defined",
purpose = "PURPOSE",
algorithm = "c45",
model.format = "json",
min.records.of.parent = 2,
min.records.of.leaf = 1,
priors = list("Play" = 0.5,
"Do not Play" = 0.5),
thread.ratio = 0.4,
resampling.method = "cv",
evaluation.metric = "auc",
fold.num = 5,
progress.indicator.id = "CV",
param.search.strategy = "grid",
parameter.values = list(split.threshold = c(1e-3 , 1e-4, 1e-5)))
> uc.dt$statistics
STAT_NAME STAT_VALUE CLASS_NAME
1 AUC 0.6666666666666666 <NA>
2 RECALL 0 Do not Play
3 PRECISION 0 Do not Play
4 F1_SCORE 0 Do not Play
5 SUPPORT 1 Do not Play
6 RECALL 1 Play
7 PRECISION 0.6666666666666666 Play
8 F1_SCORE 0.8 Play
9 SUPPORT 2 Play
10 ACCURACY 0.6666666666666666 <NA>
11 KAPPA 0 <NA>