hanaml.UnifiedClassification.Rd
hanaml.UnifiedClassification is an R wrapper for SAP HANA PAL Unified Classification.
hanaml.UnifiedClassification(
data = NULL,
func = NULL,
key = NULL,
features = NULL,
label = NULL,
purpose = NULL,
formula = NULL,
partition.method = NULL,
stratified.column = NULL,
partition.random.state = NULL,
training.percent = NULL,
training.size = NULL,
ntiles = NULL,
categorical.variable = NULL,
output.partition.result = NULL,
background.size = NULL,
background.random.state = NULL,
impute = FALSE,
strategy = NULL,
strategy.by.col = NULL,
als.factors = NULL,
als.lambda = NULL,
als.maxit = NULL,
als.randomstate = NULL,
als.exit.threshold = NULL,
als.exit.interval = NULL,
als.linsolver = NULL,
als.cg.maxit = NULL,
als.centering = NULL,
als.scaling = NULL,
c = NULL,
massive = FALSE,
group.key = NULL,
group.params = NULL,
...
)
DataFrame
DataFrame containting the data.
character
The functionality for unified classification.
Valid values are as follows:
"DecisionTree", "RandomDecisionTrees", "HGBT",
"LogisticRegression", "NaiveBayes", "SVM", "MLP".
character, optional
Name of the ID column.
If not provided, the data is assumed to have no ID column.
No default value.
character of list of characters, optional
Name of feature columns.
If not provided, it defaults all non-key, non-label columns of data.
character, optional
Name of the column which specifies the dependent variable.
If not specified, defaults to the last non-purpose column.
character, optional
Name of the column which specifies user-defined data partition.
Mandatory if partition.method is "user.defined".
formula type, optional
Formula to be used for model generation.
format = label~<feature_list>
e.g.: formula=CATEGORY~V1+V2+V3
You can either give the formula,
or a feature and label combination, but do not provide both.
Defaults to NULL.
character, optional
Specified the method for partitioning the training data.
Valid options include: "no", "user.defined", "stratified".
Defaults to "no" if not specified (i.e. no data partition).
character, optional
Specifies the name of the column used for stratified partition.
Mandatory when partition.method is set to "stratified".
character, optional
Specifies the random seed for stratified partition.
Defaults to 0(system time).
numeric, optional
Specifies the percentage of data used for training.
Defaults to 0.8.
integer, optional
Specifies the number of samples in data used for training.
If training.percent is set, then this parameter has no effect.
integer, optional
Used to control the population tiles in metrics output.
The validation value should be at least 1 and no larger than the row size of the validation data.
For AUC, this parameter means the maximum tiles.
The value should be at least 1 and no larger than the row size of the input data.
If the row size of data for metrics evaluation is less than 20,
the default value is 1; otherwise it is 20.
logical, optional
Controls whether to output the partition result of data
or not.
Defaults to FALSE.
integer, optional
Specifies the row size of background data.
It should not be larger than the row size of data.
Valid only for the following cases:
func is "NaiveBayes", "SVM", or "MLP";
func is "LogisticRegression" and multi.class is TRUE.
Defaults to 0.
integer, optional
Specifies the seed for random number generator in the background sampling.
0: Uses the current time (in second) as seed.
Others: Uses the specified value as seed.
Defaults to 0.
logical, optional
Specifies whether or not to handle missing values in the data for scoring.
Defaults to FALSE.
character, optional
Specifies the overall
imputation strategy for the input scoring data.
"non"
: No imputation for all
columns.
"most_frequent.mean"
: Replacing missing values in any categorical column by its most frequently observed value, and
missing values in any numerical column by its mean.
'most_frequent.median' : Replacing missing values in any categorical column by its most frequently observed value, and missing values in any numerical column by its median.
"most_frequent.median"
: Replacing missing values in any categorical column by its most frequently observed value, and
missing values in all numerical columns by zeros.
"most_frequent.zero"
: Replacing missing values in any categorical column by
its most frequently observed value, and filling the missing values in all numerical columns via a
matrix completion technique called alternating least squares.
"most_frequent.als"
: For numerical columns, fills each missing value by the value imputed by a
matrix completion model trained using alternating least squares method;
for categorical columns, fills all missing values with the most frequent value.
'delete'
: Delete all
rows with missing values.
Valid only when impute
is TRUE.
Defaults to 'most_frequent.mean'.
list, optional
Specifies the imputation strategy for a set of columns, which
overrides the overall strategy for data imputation.
Elements of this list must be named. The names must be column names,
while each value should either be the imputation strategy applied to that column,
or the replacement for all missing values within that column.
Valid column imputation strategies are listed as follows:
"mean", "median", "als", "non", "delete", "most_frequent".
The first five strategies are applicable to numerical columns, while the final three
strategies are applicable to categorical columns.
An illustrative example:
stragegy.by.col = list(V1 = 0, V5 = "median"),
which mean for column V1, all missing values shall be replaced by constant 0;
while for column V5, all missing values shall be by replaced by the median of all
available values in that column.
No default value.
integer, optional
Length of factor vectors in the ALS model.
It should be less than the number of numerical columns,
so that the imputation results would be meaningful.
Defaults to 3.
double, optional
L2 regularization applied to the factors in the ALS model.
Should be non-negative.
Defaults to 0.01.
integer, optional
Specifies the maximum number of iterations for cg algorithm.
Invoked only when the 'cg' is the chosen linear system solver for ALS.
Defaults to 3.
integer, optional
Specifies the seed of the random number generator used in the training of ALS model.
0 means to use the current time as the seed and Others number is to use the specified value as the seed.
Defaults to 0.
double, optional
Specify a value for stopping the training of ALS model.
If the improvement of the cost function of the ALS model
is less than this value between consecutive checks, then
the training process will exit.
0 means there is no checking of the objective value when
running the algorithms, and it stops till the maximum number of
iterations has been reached.
Defaults to 0.
integer, optional
Specify the number of iterations between consecutive checking of
cost functions for the ALS model, so that one can see if the
pre-specified exit.threshold
is reached.
Defaults to 5.
c('cholesky', 'cg'), optional
Linear system solver for the ALS model.
'cholesky' is usually much faster
'cg' is recommended when als.factors
is large.
Defaults to 'cholesky'.
logical, optional
Whether to center the data by column before training the ALS model.
Defaults to TRUE.
logical, optional
Whether to scale the data by column before training the ALS model.
Defaults to TRUE.
double, optional
Trade-off between training error and margin for SVM Classification.
Valid only when func
is "SVM".
Must be positive.
Defaults to 100.
logical, optional
Specifies whether or not to use massive mode.
For parameter setting in massive mode, you could use both
group.params (please see the example below) or the original parameters.
Using original parameters will apply for all groups. However, if you define some parameters of a group,
the value of all original parameter setting will be not applicable to such group.
An example is as follows:
> muc <- hanaml.UnifiedClassification(func='randomdecisiontrees',
massive=TRUE,
data=df.fit,
key="ID",
group.key="GROUP_ID",
label='CLASS',
impute=TRUE
group.params=list("Group_1"=list(background.size=4))
In this example, as 'background.size=4' is set in group.params for Group_1,
parameter setting of 'impute=TRUE' is not applicable to Group_1.
Defaults to FALSE.
character, optional
The column of group key. The data type can be INT or NVARCHAR/VARCHAR.
If data type is INT, only parameters set in the group.params are valid.
This parameter is only valid when massive is TRUE.
Defaults to the first column of data if group.key is not provided.
list, optional
If the massive mode is activated (massive=TRUE),
input data shall be divided into different groups with different parameters applied.
An example is as follows:
> muc <- hanaml.UnifiedClassification(func='randomdecisiontrees',
massive=TRUE,
group.params= list("Group_1"=list(background.size=4),
data=df.fit,
key="ID",
group.key="GROUP_ID",
label='CLASS')
> res <- predict(muc,
data=df.predict,
group.key="GROUP_ID",
key="ID",
group.params= list("Group_1"=list(impute=TRUE))
Valid only when massive is TRUE and defaults to NULL.
Specifies other parameters for training a classification model with the functionality
specified in func.
Please see the documentation of corresponding functions for more detail.hanaml.DecisionTreeClassifier, hanaml.RDTClassifier,
hanaml.MLPClassifier, hanaml.HGBTClassifier,
hanaml.NaiveBayes, hanaml.LogisticRegression,
hanaml.SVC
However, some parameters are disabled. The disable parameters are listed as follows:
DecisionTree: output.rules, output.confusion.matrix
RDT: calculate.oob
MLP: functionality
HGBT: calculate.importance, calculate.cm
LogisticRegression: pmml.exportNote that for Multi-class Logistic Regression, the meaning of Parameter json.export has changed, where FALSE means to export multiple linear regression model in PMML and TRUE remains to export model in JSON.
Returns an R6 object of class "UnifiedClassification" with the following attributes and methods:
Attributesmodel: DataFrame
ROW_INDEX
- model row index
PART_INDEX
- data partition index
MODEL_CONTENT
- model content
importance: DataFrame
VARIABLE_NAME
- Independent variable name
IMPORTANCE
- Variable importance
optimal.param: DataFrame
PARM_NAME
- parameter name
INT_VALUE
- integer value
DOUBLE_VALUE
- double value
STRING_VALUE
- character value
statistics: DataFrame
STAT_NAME
- Statistics name
STAT_VALUE
- Statistics value
confusion.matrix: DataFrame
ACTUAL_CLASS
- The actual class name
PREDICTED_CLASS
- The predicted class name
COUNT
- Number of records
metrics: DataFrame
NAME
- Metric name
X
- X value
Y
- Y value
error.msg: DataFrame
Error message and only valid if massive is TRUE when create an instance.
Methods
CreateModelState(model=NULL, algorithm=NULL, func=NULL, state.description="ModelState", force=FALSE)
Usage:
> udtc <- hanaml.UnifiedClassification(data=df, func="DecisionTree")
> udtc$CreateModelState()
Arguments:
model: DataFrame
DataFrame containing the model for parsing.
Defaults to self$model
.
algorithm: character
Specifies the PAL algorithm associated with model
.
Defaults to self$pal.algorithm
.
func: character
Specifies the functionality for Unified Classification/Regression.
Valid only for object instance of R6Class "UnifiedClassification" or "UnifiedRegression".
Defaults to self$func
.
state.description: character
A summary string for the generated model state.
Defaults to "ModelState".
force: logical
Specifies whether or not the replace existing state for model
.
Defaults to FALSE.
After calling this method, an attribute state
that contains the parsed info for model
shall be assigned
to the corresponding R6 object.
DeleteModelState(state=NULL)
Usage:
Assuming we have trained a hanaml
model and created its model state, like the following:
> udtc <- hanaml.UnifiedClassification(data=df, func="DecisionTree")
> udtc$CreateModelState()
After using the model state for real-time scoring, we can delete the state by calling:
> udtc$DelateModelState()
Arguments:
state: DataFrame
DataFrame containing the state info.
Defaults to self$state
.
After calling this method, the specified model state shall be cleaned up and associated memory be released.
Input data:
> df.fit.dt$Collect()
OUTLOOK TEMP HUMIDITY WINDY CLASS PURPOSE
1 Sunny 75 70 Yes Play 1
2 Sunny 80 90 Yes Do not Play 1
3 Sunny 85 91 No Do not Play 1
4 Sunny 72 95 No Do not Play 2
5 Sunny 73 70 No Play 1
6 Overcast 72 90 Yes Play 1
7 Overcast 83 78 No Play 1
8 Overcast 64 65 Yes Play 1
9 Overcast 81 75 No Play 2
10 Rain 71 80 Yes Do not Play 1
11 Rain 65 70 Yes Do not Play 1
12 Rain 75 80 No Play 1
13 Rain 68 80 No Play 1
14 Rain 70 96 No Play 2
> uc.dt <- hanaml.UnifiedClassification(func="DecisionTree",
data=df.fit.dt,
partition.method="user.defined",
purpose="PURPOSE",
algorithm="c45",
model.format="json",
min.records.of.parent=2,
min.records.of.leaf=1,
priors=list("Play"=0.5,
"Do not Play"=0.5),
thread.ratio=0.4,
resampling.method="cv",
evaluation.metric="auc",
fold.num=5,
progress.indicator.id="CLASSIFICATION_TEST",
param.search.strategy="grid",
parameter.values=list(split.threshold=c(1e-3 , 1e-4, 1e-5)))
Output:
> uc.dt$statistics
STAT_NAME STAT_VALUE CLASS_NAME
1 AUC 0.6666666666666666 <NA>
2 RECALL 0 Do not Play
3 PRECISION 0 Do not Play
4 F1_SCORE 0 Do not Play
5 SUPPORT 1 Do not Play
6 RECALL 1 Play
7 PRECISION 0.6666666666666666 Play
8 F1_SCORE 0.8 Play
9 SUPPORT 2 Play
10 ACCURACY 0.6666666666666666 <NA>
11 KAPPA 0 <NA>