Similar to other predict methods, this function predicts fitted values from a fitted "UnifiedRegression" object.

# S3 method for UnifiedRegression
predict(
  model,
  data,
  key,
  features = NULL,
  thread.ratio = NULL,
  func = NULL,
  prediction.type = NULL,
  significance.level = NULL,
  handle.missing = NULL,
  block.size = NULL,
  attribution.method = NULL,
  top.k.attributions = NULL,
  sample.size = NULL,
  random.state = NULL,
  ignore.correlation = NULL,
  categorical.variable = NULL,
  impute = FALSE,
  strategy = NULL,
  strategy.by.col = NULL,
  als.factors = NULL,
  als.lambda = NULL,
  als.maxit = NULL,
  als.randomstate = NULL,
  als.exit.threshold = NULL,
  als.exit.interval = NULL,
  als.linsolver = NULL,
  als.cg.maxit = NULL,
  als.centering = NULL,
  als.scaling = NULL,
  group.key = NULL,
  group.params = NULL,
  interval.type = NULL
)

Format

S3 methods

Arguments

model

R6Class
A "UnifiedRegression" object for prediction.

data

DataFrame
DataFrame containting the data.

key

character
Name of the ID column.

features

character of list of characters, optional
Name of feature columns for prediction.
If not provided, it defaults to all non-key columns of data.

thread.ratio

double, optional
Controls the proportion of available threads that can be used by this function.
The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates all available threads. Values between 0 and 1 will use up to that percentage of available threads.
Values outside the range from 0 to 1 are ignored, and the actual number of threads used is then be heuristically determined.
Defaults to -1.

func

character, optional
The functionality for unified regression model.
Mandatory only when the func attribute of model is NULL.
Valid values are as follows:
"DecisionTree", "RandomDecisionTrees", "HGBT", "LinearRegression", "SVM", "MLP", "PolynomialRegression", "LogarithmicRegression", "ExponentialRegression", "GeometricRegression", "GLM".

prediction.type

character, optinoal
Specifies the prediction type in the result table.

  • "response": direct response (with link applied)

  • "link": linear response (without link)

Valid only for GLM models.
Defaults to "response".

significance.level

numeric, optional
Specifies the significance level for the confidence interval and prediction interval.
Valid only for GLM models when irls solver is applied.
Defaults to 0.05.

handle.missing

character, optional
Specifies the way to handling missing values in data.

  • "skip": Skip rows with missing values

  • "fill_zero": Replace missing values with 0 before prediction

Valid only for GLM models.
Defaults to "fill_zero".

block.size

integer, optional
Specifies the number of data loaded per time during scoring.

  • 0: load all data once.

  • Others: the specified number.

This parameter is for reducing memory consumption, especially as the predict data is huge,
or it consists of a large number of missing independent variables.
However, you might lose some efficiency. Valid only for Random Decision Trees models.
Defaults to 0.

attribution.method

character, optional
Specifies which method to use in model reasoning:

  • "no": no reasoning

  • "saabas": SAABAS reasoning

  • "tree-shap": treeSHAP reasoning

Valid only for tree-based models.
Defaults to "tree-shap".

top.k.attributions

character, optional
Output the attributions of top k features which contribute the most.
Defaults to 10.

sample.size

integer, optional
Specifies the number of sampled combinations of features.
If set to 0, the value is determined by algorithm heuristically.
Valid only when the trained regression model is for Exponential Regression, GLM, Linear Regression, Multi-layer Perceptron, or SVM.
Defaults to 0.

random.state

integer, optional
Specifies the seed for random number generator.

  • 0: Uses the current time (in second) as seed;

  • Others: Uses the specified value as seed.

Valid only when the trained regression model is for Exponential Regression, GLM, Linear Regression, Multi-layer Perceptron, or SVM.
Defaults to 0.

ignore.correlation

logical, optional
Specifies whether or not to ignore the correlation between the features of the input data.
Valid only for Exponential Regression, GLM and Linear Regression models.
Defaults to TRUE.

impute

logical, optional
Specifies whether or not to handle missing values in the data for scoring.
Defaults to FALSE.

strategy

character, optional
Specifies the overall imputation strategy for the input scoring data.

  • "non" : No imputation for all columns.

  • "most_frequent.mean" : Replacing missing values in any categorical column by its most frequently observed value, and missing values in any numerical column by its mean.

  • 'most_frequent.median' : Replacing missing values in any categorical column by its most frequently observed value, and missing values in any numerical column by its median.

  • "most_frequent.median" : Replacing missing values in any categorical column by its most frequently observed value, and missing values in all numerical columns by zeros.

  • "most_frequent.zero" : Replacing missing values in any categorical column by its most frequently observed value, and filling the missing values in all numerical columns via a matrix completion technique called alternating least squares.

  • "most_frequent.als": For numerical columns, fills each missing value by the value imputed by a matrix completion model trained using alternating least squares method; for categorical columns, fills all missing values with the most frequent value.

  • 'delete' : Delete all rows with missing values.

Valid only when impute is TRUE.
Defaults to 'most_frequent.mean'.

strategy.by.col

list, optional
Specifies the imputation strategy for a set of columns, which overrides the overall strategy for data imputation.
Elements of this list must be named. The names must be column names, while each value should either be the imputation strategy applied to that column, or the replacement for all missing values within that column.
Valid column imputation strategies are listed as follows:
"mean", "median", "als", "non", "delete", "most_frequent".
The first five strategies are applicable to numerical columns, while the final three strategies are applicable to categorical columns.
An illustrative example:
stragegy.by.col = list(V1 = 0, V5 = "median"), which mean for column V1, all missing values shall be replaced by constant 0; while for column V5, all missing values shall be by replaced by the median of all available values in that column.
No default value.

als.factors

integer, optional
Length of factor vectors in the ALS model.
It should be less than the number of numerical columns, so that the imputation results would be meaningful.
Defaults to 3.

als.lambda

double, optional
L2 regularization applied to the factors in the ALS model. Should be non-negative.
Defaults to 0.01.

als.maxit

integer, optional
Specifies the maximum number of iterations for cg algorithm. Invoked only when the 'cg' is the chosen linear system solver for ALS.
Defaults to 3.

als.randomstate

integer, optional
Specifies the seed of the random number generator used in the training of ALS model.
0 means to use the current time as the seed and Others number is to use the specified value as the seed.
Defaults to 0.

als.exit.threshold

double, optional
Specify a value for stopping the training of ALS model. If the improvement of the cost function of the ALS model is less than this value between consecutive checks, then the training process will exit.
0 means there is no checking of the objective value when running the algorithms, and it stops till the maximum number of iterations has been reached.
Defaults to 0.

als.exit.interval

integer, optional
Specify the number of iterations between consecutive checking of cost functions for the ALS model, so that one can see if the pre-specified exit.threshold is reached.
Defaults to 5.

als.linsolver

c('cholesky', 'cg'), optional
Linear system solver for the ALS model.

  • 'cholesky' is usually much faster

  • 'cg' is recommended when als.factors is large.

Defaults to 'cholesky'.

als.centering

logical, optional
Whether to center the data by column before training the ALS model.
Defaults to TRUE.

als.scaling

logical, optional
Whether to scale the data by column before training the ALS model.
Defaults to TRUE.

group.key

character, optional
The column of group key. The data type can be INT or NVARCHAR/VARCHAR. If data type is INT, only parameters set in the group.params are valid. This parameter is only valid when massive is TRUE.
Defaults to the first column of data if group.key is not provided.

group.params

list, optional
If the massive mode is activated (massive=TRUE), input data shall be divided into different groups with different parameters applied.
An example is as follows:


> mur <- hanaml.UnifiedRegression(func='hgbt',
                                  massive=TRUE,
                                  thread.ratio=0.5,
                                  data=df.fit,
                                  group.key="GROUP_ID",
                                  key="ID",
                                  features=list("X1", "X2", "Y"),
                                  label='X3',
                                  group.params=list("Group_1"=list(partition.method = 'random'))
 > res <- predict(mur,
                  data=df.predict,
                  group.key="GROUP_ID",
                  key="ID",
                  group.params= list("Group_1"=list(impute=TRUE))

Valid only when massive is TRUE and defaults to NULL.

interval.type

c("no", "confidence", "prediction"), optional
Specifies the type of interval to output:

  • "no": do not calculate and output any interval

  • "confidence": calculate and output the confidence interval

  • "prediction": calculate and output the prediction interval

Valid only for one of the following three cases:

  • GLM model with IRLS solver applied(i.e. func is specified as 'GLM' and solver as "irls" during class instance initialization).

  • Linear Regression model with json model imported and coefficient covariance information computed (i.e. func is specified as "LinearRegression", json.export specified as True during class instance initialization, and output.coefcov specified as TRUE during the traning phase).

  • Random Decision Trees model with all leaf values retained(i.e. func is "RandomDecisionTrees" and output.leaf.values is TRUE). In this case, interval.type could be specified as either "no" or "prediction".

Defaults to "no".

Value

Predicted values are returned as a DataFrame, structured as follows.

  • ID column name

  • SCORE

  • UPPER_BOUND

  • LOWER_BoUND

  • REASON

An additional DataFrame for error message is presented for massive model(i.e. massive is set as TRUE during model training phase). In this case, the output is a list of two DataFrames.

Interpretation of Prediction Result

In the process of predictive modeling, sometimes we want to know why certain predicative result is made. To achieve this objective, Shapley Additive Explanations(SHAP) is proposed, which is based on the Shapley values in game theory.
To balance both accuracy and efficiency, the implementations of SHAP can vary among different machine learning algorithms.

The following table gives an overview of the supported SHAP versions for different regression functions supported in UnifiedRegression:

FunctionSHAP Method
Decision TreetreeSHAP, Saabas
Random Decision TreestreeSHAP, Saabas
HGBTtreeSHAP, Saabas
Exponential RegressionlinearSHAP
Linear RegressionlinearSHAP
GLMlinearSHAP
MLPkernelSHAP
SVMkernelSHAP


Note that (1)for regression functions not listed in the table above, SHAP explanations are not supported, and (2)for Exponential Regression and GLM, feature contributions are computed w.r.t. to the linear response(using linearSHAP), not the original target variable.

Relevant Parameters

  • background.size: This parameter is provided in hanaml.UnifiedRegression. It specifies the row size of background data(which is sampled from the training data) for implementing kernelSHAP and linearSHAP. Therefore (1)the value of this parameter should not exceed the row size of training data, and (2)it is valid only for the following set of regression functions: Exponential Regression, Linear Regression, GLM, MLP and SVM.

  • background.random.state: This parameter is provided in hanaml.UnifiedRegression. It specifies the random seed from sampling the background data from the training data.

  • top.k.attributions: This parameter specifies the number of attributions(i.e. features) that contribute the most(in absolute magnitude) to the prediction result to output.

  • sample.size: This parameter specified the number of combination of features(attributions) for implementing kernelSHAP and linearSHAP(set to 0 if you want the value of this parameter to be heuristically determined). Therefore (1)it is better to use a number that is larger than the number of columns in the training data, and (2)this parameter is valid only for the following set of regression functions: Exponential Regression, Linear Regression, GLM, MLP and SVM.

  • random.state: This parameter specifies the random seed from implementing kernelSHAP and linearSHAP(e.g. random sampling of combinations of features). Therefore, it is valid only for the following set of classification functions: Exponential Regression, Linear Regression, GLM, MLP and SVM.

  • ignore.correlation: This parameter specifies whether or not to ignore correlation between features when computing feature contributions using linearSHAP. Therefore, it is only valid for the following set of regression functions: Exponential Regression, Linear Regression and GLM.

  • attribution.method: This parameter specifies the method used to compute feature contributions of prediction results for tree-based models(i.e. Decision Tree, Random Decision Trees and HGBT).

Examples

1). A simple examples
Input data for prediction:


> df.predict
  ID      X1 X2 X3
1  0   1.690  B  1
2  1   0.054  B  2
3  2 980.123  A  2
4  3   1.000  A  1
5  4   0.563  A  1

Call the predict() function to get target values as well as prediction intervals(assuming that umlr is a UnifiedRegression object initiated by specifying output.coefcov = TRUE in the training phase):


> res <- predict(model = umlr,
                 data = df.predict,
                 key = "ID",
                 significance.level = 0.05,#specify the significance level
                 interval.type = 'prediction')#specify the interval type

Check the result:


> res$Collect()
  ID       SCORE UPPER_BOUND LOWER_BOUND REASON
1  0    8.719607    6.759643   10.679571   <NA>
2  1    1.416343   -0.543621    3.376307   <NA>
3  2 3318.371440 3316.411476 3320.331404   <NA>
4  3   -2.050390   -4.010354   -0.090426   <NA>
5  4   -3.533135   -5.493099   -1.573171   <NA>

2). Interpretation of prediction result using treeSHAP
We use the renowned boston-housing data for illustration:
Firstly we train a Decision Tree model:


> udtr <- hanaml.UnifiedRegression(func="DecisionTree",
                                   algorithms="cart",
                                   data = boston_housing_train,
                                   key="ID",
                                   label="MEDV")

Note that for interpretating prediction result of tree-based models using either treeSHAP or Saabas, there is no need to sample background data from the training data to make local interpretations.


> res <- predict(udtr,
                 data = boston_housing_predict,
                 key="ID",
                 top.k.attributions=5,
                 attribution.method="tree-shap")

3). Interpretation of prediction result using linearSHAP
Again we use the renowned boston-housing data for illustration:


> umlr <- hanaml.UnifiedRegression(func="LinearRegression",
                                   data=boston_housing_train,
                                   key="ID",
                                   label="MEDV",
                                   background.size=25,
                                   background.random.state=2023)
> res <- predict(umlr,
                 data=boston_housing_predict,
                 key="ID",
                 top.k.attributions=5,
                 sample.size=20,
                 random.state=2023,
                 ignore.correlation=TRUE)#consider correlations among features

4). Interpretation of prediction result using kernelSHAP
Yet again we use the renowned boston-housing data for illustration:


> usvr <- hanaml.UnifiedRegression(func="SVM",
                                   data=boston_housing_train,
                                   key="ID",
                                   label="MEDV",
                                   background.size=25,
                                   background.random.state=2023)
> res <- predict(usvr,
                 data=boston_housing_predict,
                 key="ID",
                 top.k.attributions=5,
                 sample.size=30,
                 random.state=2023)