Similar to other predict methods, this function predicts fitted values from a fitted "UnifiedClassification" object.

# S3 method for UnifiedClassification
predict(
  model,
  data,
  key,
  features = NULL,
  thread.ratio = NULL,
  verbose = NULL,
  func = NULL,
  multi.class = NULL,
  alpha = NULL,
  block.size = NULL,
  missing.replacement = NULL,
  categorical.variable = NULL,
  class.map0 = NULL,
  class.map1 = NULL,
  attribution.method = NULL,
  top.k.attributions = NULL,
  sample.size = NULL,
  random.state = NULL,
  impute = FALSE,
  strategy = NULL,
  strategy.by.col = NULL,
  als.factors = NULL,
  als.lambda = NULL,
  als.maxit = NULL,
  als.randomstate = NULL,
  als.exit.threshold = NULL,
  als.exit.interval = NULL,
  als.linsolver = NULL,
  als.cg.maxit = NULL,
  als.centering = NULL,
  als.scaling = NULL,
  group.key = NULL,
  group.params = NULL
)

Format

S3 methods

Arguments

model

R6Class
An "UnifiedClassification" object for prediction.

data

DataFrame
DataFrame containting the data.

key

character
Name of the ID column.

features

character of list of characters, optional
Name of feature columns for prediction.
If not provided, it defaults to all non-key columns of data.

thread.ratio

double, optional
Controls the proportion of available threads that can be used by this function.
The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates all available threads. Values between 0 and 1 will use up to that percentage of available threads.
Values outside the range from 0 to 1 are ignored, and the actual number of threads used is then be heuristically determined.
Defaults to -1.

verbose

logical, optional
If TRUE, output all classes and the corresponding confidences for each data point.
Defaults to FALSE.

func

character, optional
The functionality for unified classification model.
Mandatory only when the func attribute of model is NULL.
Valid values are as follows:
"DecisionTree", "RandomDecisionTrees", "HGBT", "LogisticRegression", "NaiveBayes", "SVM", "MLP".
Defaults to model$func.

multi.class

logical, optional
If the functionality of the unified classification model is LogisticRegression,
then this parameter indicates whether or not the classification mdoel is binary-class case or multiple-class case.
Valid only when func is set to be "LogisticRegression".

alpha

numeric, optional
Specifies the value of Laplace smoothing.
A positive value will enable Laplace smoothing for categorical variables with that value being the smoothing parameter.
Set the value to 0 to disable Laplace smoothing .
Defaults to the alpha value in the JSON model if there is one, and 0 otherwise.

block.size

integer, optional
Specifies the number of data loaded per time during scoring.

  • 0: load all data once.

  • Other positive Values: the specified number.

Valid only when func is "RandomDecisionTrees"(case insensitive).
Defaults to 0.

missing.replacement

character, optional
Specifies the strategy for replacement of missing values in prediction data.

  • 'feature.marginalized': marginalizes each missing feature out independently.

  • 'instance.marginalized': marginalizes all missing features in an instance as a whole corresponding to each category.

Valid only when func is "RandomDecisionTrees" or "HGBT".
Defaults to 'feature.marginalized'.

categorical.variable

character or list of characters, optional
Indicates features that should be treated as categorical variable.
The behavior is dependent on what input is given:

  • "VARCHAR" and "NVARCHAR": categorical.

  • "INTEGER" and "DOUBLE": continuous.

VALID only for variables of type "INTEGER",omitted otherwise.
Default to the value of categorical.variable in the model training phase.

class.map0

character, optional
Specifies the label value which will be mapped to 0 in logistic regression.
Mandatory and valid only for logistic regression models when the label variable is of type VARCHAR or NVARCHAR.
Defaults to the value of class.map0 in the model training phase.

class.map1

character, optional
Specifies the label value which will be mapped to 1 in logistic regression.
Mandatory and valid only for logistic regression models when the label variable is of type VARCHAR or NVARCHAR.
Defaults to the value of class.map1 in the model training phase.

attribution.method

character, optional
Specifies which method to use in model reasoning:

  • "no": no reasoning.

  • "saabas": SAABAS reasoning.

  • "tree-shap": Tree-SHAP reasoning.

Valid only for tree-based models.
Defaults to "tree-shap".

top.k.attributions

integer, optional
Output the attributions of top k features which contribute the most.
Defaults to 10.

sample.size

integer, optional
Specifies the number of sampled combinations of features.
If set to 0, the value is determined by algorithm heuristically.
Valid only when the trained classification model is for Naive Bayes, Support Vector Machine(SVM), Multilayer Perceptron or Multi-class Logistic Regression.
Defaults to 0.

random.state

integer, optional
Specifies the seed for random number generator.

  • 0: Uses the current time (in second) as seed;

  • Others: Uses the specified value as seed.

Valid only when the trained classification model is for Naive Bayes, Support Vector Machine(SVM), Multilayer Perceptron(MLP) or Multi-class Logistic Regression.
Defaults to 0.

impute

logical, optional
Specifies whether or not to handle missing values in the data for scoring.
Defaults to FALSE.

strategy

character, optional
Specifies the overall imputation strategy for the input scoring data.

  • "non" : No imputation for all columns.

  • "most_frequent.mean" : Replacing missing values in any categorical column by its most frequently observed value, and missing values in any numerical column by its mean.

  • 'most_frequent.median' : Replacing missing values in any categorical column by its most frequently observed value, and missing values in any numerical column by its median.

  • "most_frequent.median" : Replacing missing values in any categorical column by its most frequently observed value, and missing values in all numerical columns by zeros.

  • "most_frequent.zero" : Replacing missing values in any categorical column by its most frequently observed value, and filling the missing values in all numerical columns via a matrix completion technique called alternating least squares.

  • "most_frequent.als": For numerical columns, fills each missing value by the value imputed by a matrix completion model trained using alternating least squares method; for categorical columns, fills all missing values with the most frequent value.

  • 'delete' : Delete all rows with missing values.

Valid only when impute is TRUE.
Defaults to 'most_frequent.mean'.

strategy.by.col

list, optional
Specifies the imputation strategy for a set of columns, which overrides the overall strategy for data imputation.
Elements of this list must be named. The names must be column names, while each value should either be the imputation strategy applied to that column, or the replacement for all missing values within that column.
Valid column imputation strategies are listed as follows:
"mean", "median", "als", "non", "delete", "most_frequent".
The first five strategies are applicable to numerical columns, while the final three strategies are applicable to categorical columns.
An illustrative example:
stragegy.by.col = list(V1 = 0, V5 = "median"), which mean for column V1, all missing values shall be replaced by constant 0; while for column V5, all missing values shall be by replaced by the median of all available values in that column.
No default value.

als.factors

integer, optional
Length of factor vectors in the ALS model.
It should be less than the number of numerical columns, so that the imputation results would be meaningful.
Defaults to 3.

als.lambda

double, optional
L2 regularization applied to the factors in the ALS model. Should be non-negative.
Defaults to 0.01.

als.maxit

integer, optional
Specifies the maximum number of iterations for cg algorithm. Invoked only when the 'cg' is the chosen linear system solver for ALS.
Defaults to 3.

als.randomstate

integer, optional
Specifies the seed of the random number generator used in the training of ALS model.
0 means to use the current time as the seed and Others number is to use the specified value as the seed.
Defaults to 0.

als.exit.threshold

double, optional
Specify a value for stopping the training of ALS model. If the improvement of the cost function of the ALS model is less than this value between consecutive checks, then the training process will exit.
0 means there is no checking of the objective value when running the algorithms, and it stops till the maximum number of iterations has been reached.
Defaults to 0.

als.exit.interval

integer, optional
Specify the number of iterations between consecutive checking of cost functions for the ALS model, so that one can see if the pre-specified exit.threshold is reached.
Defaults to 5.

als.linsolver

c('cholesky', 'cg'), optional
Linear system solver for the ALS model.

  • 'cholesky' is usually much faster

  • 'cg' is recommended when als.factors is large.

Defaults to 'cholesky'.

als.centering

logical, optional
Whether to center the data by column before training the ALS model.
Defaults to TRUE.

als.scaling

logical, optional
Whether to scale the data by column before training the ALS model.
Defaults to TRUE.

group.key

character, optional
The column of group key. The data type can be INT or NVARCHAR/VARCHAR. If data type is INT, only parameters set in the group.params are valid. This parameter is only valid when massive is TRUE.
Defaults to the first column of data if group.key is not provided.

group.params

list, optional
If the massive mode is activated (massive = TRUE), input data shall be divided into different groups with different parameters applied.
An example is as follows:


 > muc <- hanaml.UnifiedClassification(func='randomdecisiontrees',
                                       massive = TRUE,
                                       group.params= list("Group_1"=list(background.size = 4),
                                       data = df.fit,
                                       key="ID",
                                       group.key="GROUP_ID",
                                       label = 'CLASS')
 > res <- predict(muc,
                  data=df.predict,
                  group.key="GROUP_ID",
                  key="ID",
                  group.params= list("Group_1"=list(impute=TRUE))

Defaults to NULL.

Value

Predicted values are returned as a DataFrame, structured as follows.

  • ID column name

  • SCORE

  • CONFIDENCE

  • REASON CODE

Interpretation of Prediction Result

In the process of predictive modeling, sometimes we want to know why certain predicative results are made. To achieve this objective, Shapley Additive Explanations(SHAP) is proposed, which is based on the Shapley values in game theory.
To balance both accuracy and efficiency, the implementations of SHAP can vary among different machine learning algorithms.
The following table gives an overview of the supported SHAP versions for different classification functions supported in UnifiedClassification:

FunctionSHAP Method
Decision TreetreeSHAP, Saabas
Random Decision TreestreeSHAP, Saabas
HGBTtreeSHAP, Saabas
Naive BayeskernelSHAP
Support Vector MachinekernelSHAP
Multi-layer PerceptronKernelSHAP
Multi-class Logistic RegressionKernelSHAP


Relevant Parameters

  • background.size: This parameter specifies the row size of background data(which is sampled from the training data) for implementing kernelSHAP in hanaml.UnifiedClassification. Therefore (1)the value of this parameter should not exceed the row size of training data, and (2)valid only for the following set of classification functions: Naive Bayes, SVM, MLP, Multi-class Logistic Regression.

  • background.random.state: This parameter specifies the random seed from sampling the background data from the training data in hanaml.UnifiedClassification.

  • top.k.attributions: This parameter specifies the number of attributions(i.e. features) that contribute the most(in absolute magnitude) to the prediction result to output.

  • sample.size: This parameter specified the number of combination of features(attributions) for implementing kernelSHAP(set to 0 if you want the value of this parameter to be heuristically determined). Therefore (1)it is better to use a number that is larger than the number of columns in the training data, and (2)valid only for the following set of classification functions: Naive Bayes, SVM, MLP, Multi-class Logistic Regression.

  • random.state: This parameter specifies the random seed from implementing kernelSHAP(e.g. random sampling of combinations of features). Therefore, it is valid only for the following set of classification functions: Naive Bayes, SVM, MLP, Multi-class Logistic Regression.

  • feature.attribution.method: This parameter specifies the method used to compute feature contributions to prediction results for tree-based models(i.e. Decision Tree, Random Decision Trees and HGBT).

Examples

1). A Simple Example:
Input data for prediction:


> df.predict
  ID  OUTLOOK   TEMP HUMIDITY WINDY
1  0 Overcast     75   -10000   Yes
2  1     Rain     78       70   Yes
3  2    Sunny -10000       NA   Yes
4  3    Sunny     69       70   Yes
5  4     Rain     NA       70   Yes
6  5     <NA>     70       70   Yes
7  6      ***     70       70   Yes

Call the predict() function:


> res <- predict(model = uc.dt,
                 data = df.predict,
                 key = "ID",
                 func = "DecisionTree",
                 algorithm = "cart")

Check the result:


> res$Collect()[1:3]
  ID       SCORE CONFIDENCE
1  0        Play  1.0000000
2  1 Do not Play  1.0000000
3  2        Play  0.5000000
4  3        Play  0.5000000
5  4        Play  0.6363636
6  5        Play  0.5000000
7  6        Play  0.5000000

2). Interpretation for Prediction result using kernelSHAP:
Here we use the renowned iris data for simplicity.


> usvc <- hanaml.UnifiedClassification(data=iris_train,
                                       key="ID",
                                       label="SPECIES",
                                       func="SVM",
                                       background.size=50,
                                       background.random.state=2022)
> res <- predict(usvc, data=iris_predict,
                 key="ID",
                 top.k.attributions=4,#output all attributions
                 sample.size=6,
                 random.state=2022)

3). Interpretation for Prediction result using for tree-based models:
Here we use the renowned iris data for simplicity.


> udtc <- hanaml.UnifiedClassification(data=iris_train,
                                       key="ID",
                                       label="SPECIES",
                                       func="DecisionTree",
                                       algorithm="c45")#no need to sample background data
> res <- predict(udtc,
                 data=iris_predict,
                 key="ID",
                 top.k.attributions=4,#output all attributions
                 feature.attribution.method="tree-shap")#use tree-SHAP to compute feature contributions