predict.UnifiedRegression.Rd
Similar to other predict methods, this function predicts fitted values from a fitted "UnifiedRegression" object.
# S3 method for UnifiedRegression
predict(
model,
data,
key,
features = NULL,
thread.ratio = NULL,
func = NULL,
prediction.type = NULL,
significance.level = NULL,
handle.missing = NULL,
block.size = NULL,
attribution.method = NULL,
top.k.attributions = NULL,
sample.size = NULL,
random.state = NULL,
ignore.correlation = NULL,
categorical.variable = NULL,
impute = FALSE,
strategy = NULL,
strategy.by.col = NULL,
als.factors = NULL,
als.lambda = NULL,
als.maxit = NULL,
als.randomstate = NULL,
als.exit.threshold = NULL,
als.exit.interval = NULL,
als.linsolver = NULL,
als.cg.maxit = NULL,
als.centering = NULL,
als.scaling = NULL,
group.key = NULL,
group.params = NULL,
interval.type = NULL
)
S3
methods
R6Class
A "UnifiedRegression" object for prediction.
DataFrame
DataFrame containting the data.
character
Name of the ID column.
character of list of characters, optional
Name of feature columns for prediction.
If not provided, it defaults to all non-key columns of data.
double, optional
Controls the proportion of available threads that can be used by this
function.
The value range is from 0 to 1, where 0 indicates a single thread,
and 1 indicates all available threads. Values between 0 and 1 will use up to
that percentage of available threads.
Values outside the range from 0 to 1 are ignored, and the actual number of threads
used is then be heuristically determined.
Defaults to -1.
character, optional
The functionality for unified regression model.
Mandatory only when the func attribute of model is NULL.
Valid values are as follows:
"DecisionTree", "RandomDecisionTrees", "HGBT", "LinearRegression",
"SVM", "MLP", "PolynomialRegression", "LogarithmicRegression",
"ExponentialRegression", "GeometricRegression", "GLM".
character, optinoal
Specifies the prediction type in the result table.
"response": direct response (with link applied)
"link": linear response (without link)
Valid only for GLM models.
Defaults to "response".
numeric, optional
Specifies the significance level for the confidence interval and prediction interval.
Valid only for GLM models when irls
solver is applied.
Defaults to 0.05.
character, optional
Specifies the way to handling missing values in data
.
"skip": Skip rows with missing values
"fill_zero": Replace missing values with 0 before prediction
Valid only for GLM models.
Defaults to "fill_zero".
integer, optional
Specifies the number of data loaded per time during scoring.
0: load all data once.
Others: the specified number.
This parameter is for reducing memory consumption, especially as the predict data is huge,
or it consists of a large number of missing independent variables.
However, you might lose some efficiency.
Valid only for Random Decision Trees models.
Defaults to 0.
character, optional
Specifies which method to use in model reasoning:
"no": no reasoning
"saabas": SAABAS reasoning
"tree-shap": treeSHAP reasoning
Valid only for tree-based models.
Defaults to "tree-shap".
character, optional
Output the attributions of top k features which contribute the most.
Defaults to 10.
integer, optional
Specifies the number of sampled combinations of features.
If set to 0, the value is determined by algorithm heuristically.
Valid only when the trained regression model is for Exponential Regression,
GLM, Linear Regression, Multi-layer Perceptron, or SVM.
Defaults to 0.
integer, optional
Specifies the seed for random number generator.
0: Uses the current time (in second) as seed;
Others: Uses the specified value as seed.
Valid only when the trained regression model is for Exponential Regression,
GLM, Linear Regression, Multi-layer Perceptron, or SVM.
Defaults to 0.
logical, optional
Specifies whether or not to ignore the correlation between the features of the input data.
Valid only for Exponential Regression, GLM and Linear Regression models.
Defaults to TRUE.
logical, optional
Specifies whether or not to handle missing values in the data for scoring.
Defaults to FALSE.
character, optional
Specifies the overall
imputation strategy for the input scoring data.
"non"
: No imputation for all
columns.
"most_frequent.mean"
: Replacing missing values in any categorical column by its most frequently observed value, and
missing values in any numerical column by its mean.
'most_frequent.median' : Replacing missing values in any categorical column by its most frequently observed value, and missing values in any numerical column by its median.
"most_frequent.median"
: Replacing missing values in any categorical column by its most frequently observed value, and
missing values in all numerical columns by zeros.
"most_frequent.zero"
: Replacing missing values in any categorical column by
its most frequently observed value, and filling the missing values in all numerical columns via a
matrix completion technique called alternating least squares.
"most_frequent.als"
: For numerical columns, fills each missing value by the value imputed by a
matrix completion model trained using alternating least squares method;
for categorical columns, fills all missing values with the most frequent value.
'delete'
: Delete all
rows with missing values.
Valid only when impute
is TRUE.
Defaults to 'most_frequent.mean'.
list, optional
Specifies the imputation strategy for a set of columns, which
overrides the overall strategy for data imputation.
Elements of this list must be named. The names must be column names,
while each value should either be the imputation strategy applied to that column,
or the replacement for all missing values within that column.
Valid column imputation strategies are listed as follows:
"mean", "median", "als", "non", "delete", "most_frequent".
The first five strategies are applicable to numerical columns, while the final three
strategies are applicable to categorical columns.
An illustrative example:
stragegy.by.col = list(V1 = 0, V5 = "median"),
which mean for column V1, all missing values shall be replaced by constant 0;
while for column V5, all missing values shall be by replaced by the median of all
available values in that column.
No default value.
integer, optional
Length of factor vectors in the ALS model.
It should be less than the number of numerical columns,
so that the imputation results would be meaningful.
Defaults to 3.
double, optional
L2 regularization applied to the factors in the ALS model.
Should be non-negative.
Defaults to 0.01.
integer, optional
Specifies the maximum number of iterations for cg algorithm.
Invoked only when the 'cg' is the chosen linear system solver for ALS.
Defaults to 3.
integer, optional
Specifies the seed of the random number generator used in the training of ALS model.
0 means to use the current time as the seed and Others number is to use the specified value as the seed.
Defaults to 0.
double, optional
Specify a value for stopping the training of ALS model.
If the improvement of the cost function of the ALS model
is less than this value between consecutive checks, then
the training process will exit.
0 means there is no checking of the objective value when
running the algorithms, and it stops till the maximum number of
iterations has been reached.
Defaults to 0.
integer, optional
Specify the number of iterations between consecutive checking of
cost functions for the ALS model, so that one can see if the
pre-specified exit.threshold
is reached.
Defaults to 5.
c('cholesky', 'cg'), optional
Linear system solver for the ALS model.
'cholesky' is usually much faster
'cg' is recommended when als.factors
is large.
Defaults to 'cholesky'.
logical, optional
Whether to center the data by column before training the ALS model.
Defaults to TRUE.
logical, optional
Whether to scale the data by column before training the ALS model.
Defaults to TRUE.
character, optional
The column of group key. The data type can be INT or NVARCHAR/VARCHAR.
If data type is INT, only parameters set in the group.params are valid.
This parameter is only valid when massive is TRUE.
Defaults to the first column of data if group.key is not provided.
list, optional
If the massive mode is activated (massive=TRUE),
input data shall be divided into different groups with different parameters applied.
An example is as follows:
> mur <- hanaml.UnifiedRegression(func='hgbt',
massive=TRUE,
thread.ratio=0.5,
data=df.fit,
group.key="GROUP_ID",
key="ID",
features=list("X1", "X2", "Y"),
label='X3',
group.params=list("Group_1"=list(partition.method = 'random'))
> res <- predict(mur,
data=df.predict,
group.key="GROUP_ID",
key="ID",
group.params= list("Group_1"=list(impute=TRUE))
Valid only when massive
is TRUE and defaults to NULL.
c("no", "confidence", "prediction"), optional
Specifies the type of interval to output:
"no": do not calculate and output any interval
"confidence": calculate and output the confidence interval
"prediction": calculate and output the prediction interval
Valid only for one of the following three cases:
GLM model with IRLS solver applied(i.e. func
is specified as 'GLM' and solver
as "irls" during class instance initialization).
Linear Regression model with json model imported and coefficient covariance information computed
(i.e. func
is specified as "LinearRegression", json.export
specified as True
during class instance initialization, and output.coefcov
specified as TRUE during
the traning phase).
Random Decision Trees model with all leaf values retained(i.e. func
is "RandomDecisionTrees" and
output.leaf.values
is TRUE). In this case, interval.type
could be specified as either "no" or "prediction".
Defaults to "no".
Predicted values are returned as a DataFrame, structured as follows.
ID column name
SCORE
UPPER_BOUND
LOWER_BoUND
REASON
An additional DataFrame for error message is presented for massive model(i.e. massive
is set
as TRUE during model training phase). In this case, the output is a list of two DataFrames.
In the process of predictive modeling, sometimes we want to know why
certain predicative result is made. To achieve this objective, Shapley
Additive Explanations(SHAP) is proposed, which is based on the Shapley
values in game theory.
To balance both accuracy and efficiency, the implementations of SHAP can
vary among different machine learning algorithms.
The following table gives an overview of the supported SHAP versions for
different regression functions supported in UnifiedRegression:
Function | SHAP Method |
Decision Tree | treeSHAP, Saabas |
Random Decision Trees | treeSHAP, Saabas |
HGBT | treeSHAP, Saabas |
Exponential Regression | linearSHAP |
Linear Regression | linearSHAP |
GLM | linearSHAP |
MLP | kernelSHAP |
SVM | kernelSHAP |
Note that (1)for regression functions not listed in the table above, SHAP explanations
are not supported, and (2)for Exponential Regression and GLM, feature contributions
are computed w.r.t. to the linear response(using linearSHAP), not the original target variable.
Relevant Parameters
background.size
: This parameter is provided in
hanaml.UnifiedRegression
. It specifies the row size of background
data(which is sampled from the training data) for implementing kernelSHAP and linearSHAP.
Therefore (1)the value of this parameter should
not exceed the row size of training data, and (2)it is valid only for the following set of
regression functions: Exponential Regression, Linear Regression, GLM, MLP and SVM.
background.random.state
: This parameter is provided in
hanaml.UnifiedRegression
. It specifies the random seed from sampling
the background data from the training data.
top.k.attributions
: This parameter specifies the number of attributions(i.e. features)
that contribute the most(in absolute magnitude) to the prediction result to output.
sample.size
: This parameter specified the number of combination of features(attributions)
for implementing kernelSHAP and linearSHAP(set to 0 if you want the value of this parameter to
be heuristically determined). Therefore (1)it is better to use a number that is
larger than the number of columns in the training data, and (2)this parameter is
valid only for the following set of regression functions: Exponential Regression, Linear Regression,
GLM, MLP and SVM.
random.state
: This parameter specifies the random seed from implementing kernelSHAP
and linearSHAP(e.g. random sampling of combinations of features). Therefore, it is valid only for
the following set of classification functions: Exponential Regression, Linear Regression, GLM, MLP and SVM.
ignore.correlation
: This parameter specifies whether or not to ignore correlation between features
when computing feature contributions using linearSHAP. Therefore, it is only valid for the following set of
regression functions: Exponential Regression, Linear Regression and GLM.
attribution.method
: This parameter specifies the method used to compute feature
contributions of prediction results for tree-based models(i.e. Decision Tree, Random Decision Trees and
HGBT).
1). A simple examples
Input data for prediction:
> df.predict
ID X1 X2 X3
1 0 1.690 B 1
2 1 0.054 B 2
3 2 980.123 A 2
4 3 1.000 A 1
5 4 0.563 A 1
Call the predict() function to get target values as well as prediction intervals(assuming that
umlr
is a UnifiedRegression object initiated by specifying output.coefcov = TRUE
in the training phase):
> res <- predict(model = umlr,
data = df.predict,
key = "ID",
significance.level = 0.05,#specify the significance level
interval.type = 'prediction')#specify the interval type
Check the result:
> res$Collect()
ID SCORE UPPER_BOUND LOWER_BOUND REASON
1 0 8.719607 6.759643 10.679571 <NA>
2 1 1.416343 -0.543621 3.376307 <NA>
3 2 3318.371440 3316.411476 3320.331404 <NA>
4 3 -2.050390 -4.010354 -0.090426 <NA>
5 4 -3.533135 -5.493099 -1.573171 <NA>
2). Interpretation of prediction result using treeSHAP
We use the renowned boston-housing data for illustration:
Firstly we train a Decision Tree model:
> udtr <- hanaml.UnifiedRegression(func="DecisionTree",
algorithms="cart",
data = boston_housing_train,
key="ID",
label="MEDV")
Note that for interpretating prediction result of tree-based models using
either treeSHAP or Saabas, there is no need to sample background
data from the training data to make local interpretations.
> res <- predict(udtr,
data = boston_housing_predict,
key="ID",
top.k.attributions=5,
attribution.method="tree-shap")
3). Interpretation of prediction result using linearSHAP
Again we use the renowned boston-housing data for illustration:
> umlr <- hanaml.UnifiedRegression(func="LinearRegression",
data=boston_housing_train,
key="ID",
label="MEDV",
background.size=25,
background.random.state=2023)
> res <- predict(umlr,
data=boston_housing_predict,
key="ID",
top.k.attributions=5,
sample.size=20,
random.state=2023,
ignore.correlation=TRUE)#consider correlations among features
4). Interpretation of prediction result using kernelSHAP
Yet again we use the renowned boston-housing data for illustration:
> usvr <- hanaml.UnifiedRegression(func="SVM",
data=boston_housing_train,
key="ID",
label="MEDV",
background.size=25,
background.random.state=2023)
> res <- predict(usvr,
data=boston_housing_predict,
key="ID",
top.k.attributions=5,
sample.size=30,
random.state=2023)