Unified Regression

hanaml.UnifiedRegression is an R wrapper for SAP HANA PAL Unified Regression.

hanaml.UnifiedRegression(
  data = NULL,
  func = NULL,
  key = NULL,
  features = NULL,
  label = NULL,
  purpose = NULL,
  formula = NULL,
  partition.method = NULL,
  partition.random.state = NULL,
  training.percent = NULL,
  output.partition.result = NULL,
  background.size = NULL,
  background.random.state = NULL,
  categorical.variable = NULL,
  impute = FALSE,
  strategy = NULL,
  strategy.by.col = NULL,
  als.factors = NULL,
  als.lambda = NULL,
  als.maxit = NULL,
  als.randomstate = NULL,
  als.exit.threshold = NULL,
  als.exit.interval = NULL,
  als.linsolver = NULL,
  als.cg.maxit = NULL,
  als.centering = NULL,
  als.scaling = NULL,
  c = NULL,
  massive = FALSE,
  group.key = NULL,
  group.params = NULL,
  output.coefcov = NULL,
  output.leaf.values = NULL,
  ...
)

Arguments

data

DataFrame
DataFrame containting the data.

func

character
The functionality for unified regression.
Valid values are as follows:
"DecisionTree", "RandomDecisionTrees", "HGBT", "LinearRegression", "SVM", "MLP", "PolynomialRegression", "LogarithmicRegression", "ExponentialRegression", "GeometricRegression", "GLM".

key

character, optional
Name of the ID column. If not provided, the data is assumed to have no ID column.
No default value.

features

character of list of characters, optional
Name of feature columns.
If not provided, it defaults all non-key, non-label columns of data.

label

character, optional
Name of the column which specifies the dependent variable.
If not specified, defaults to the last non-purpose column.

purpose

character, optional
Name of the column which specified user-defined data partition.
Mandatory if partition.method is "predefined".

formula

formula type, optional
Formula to be used for model generation. format = label~<feature_list> e.g.: formula=CATEGORY~V1+V2+V3
You can either give the formula, or a feature and label combination, but do not provide both.
Defaults to NULL.

partition.method

character, optional
Specified the method for partitioning the training data.
Valid options include: "no", "predefined", "random".
Defaults to "no" if not specified (i.e. no data partition).

partition.random.state

character, optional
Specifies the random seed for stratified partition.
Valid only when partition.method is set to "random".
Defaults to 0(system time).

training.percent

numeric, optional
Specifies the percentage of data used for training.
Valid only when partition.method is set to "random".
Defaults to 0.8.

output.partition.result

logical, optional
Specifies whether or not to output the partition result of the training data.
Defaults to FALSE.

background.size

integer, optional
Specifies the row size of background data(for SHAP value computation).
It should not be larger than the row size of train data.
Valid only when func takes one of the following values: "ExponentialRegression", "GLM", "LinearRegression", "MLP" and "SVM".
Defaults to 0(i.e. no background data, thus no local model interpretability).

background.random.state

integer, optional
Specified the seed for random number generator in the background data sampling:

0: Use current system time as seed.
others : Use the specified value as seed.

Valid only when func takes one of the following values: "ExponentialRegression", "GLM", "LinearRegression", "MLP" and "SVM".
Defaults to 0.

impute

logical, optional
Specifies whether or not to handle missing values in the data for scoring.
Defaults to FALSE.

strategy

character, optional
Specifies the overall imputation strategy for the input scoring data.

"non" : No imputation for all columns.
"most_frequent.mean" : Replacing missing values in any categorical column by its most frequently observed value, and missing values in any numerical column by its mean.
'most_frequent.median' : Replacing missing values in any categorical column by its most frequently observed value, and missing values in any numerical column by its median.
"most_frequent.median" : Replacing missing values in any categorical column by its most frequently observed value, and missing values in all numerical columns by zeros.
"most_frequent.zero" : Replacing missing values in any categorical column by its most frequently observed value, and filling the missing values in all numerical columns via a matrix completion technique called alternating least squares.
"most_frequent.als": For numerical columns, fills each missing value by the value imputed by a matrix completion model trained using alternating least squares method; for categorical columns, fills all missing values with the most frequent value.
'delete' : Delete all rows with missing values.

Valid only when impute is TRUE.
Defaults to 'most_frequent.mean'.

strategy.by.col

list, optional
Specifies the imputation strategy for a set of columns, which overrides the overall strategy for data imputation.
Elements of this list must be named. The names must be column names, while each value should either be the imputation strategy applied to that column, or the replacement for all missing values within that column.
Valid column imputation strategies are listed as follows:
"mean", "median", "als", "non", "delete", "most_frequent".
The first five strategies are applicable to numerical columns, while the final three strategies are applicable to categorical columns.
An illustrative example:
stragegy.by.col = list(V1 = 0, V5 = "median"), which mean for column V1, all missing values shall be replaced by constant 0; while for column V5, all missing values shall be by replaced by the median of all available values in that column.
No default value.

als.factors

integer, optional
Length of factor vectors in the ALS model.
It should be less than the number of numerical columns, so that the imputation results would be meaningful.
Defaults to 3.

als.lambda

double, optional
L2 regularization applied to the factors in the ALS model. Should be non-negative.
Defaults to 0.01.

als.maxit

integer, optional
Specifies the maximum number of iterations for cg algorithm. Invoked only when the 'cg' is the chosen linear system solver for ALS.
Defaults to 3.

als.randomstate

integer, optional
Specifies the seed of the random number generator used in the training of ALS model.
0 means to use the current time as the seed and Others number is to use the specified value as the seed.
Defaults to 0.

als.exit.threshold

double, optional
Specify a value for stopping the training of ALS model. If the improvement of the cost function of the ALS model is less than this value between consecutive checks, then the training process will exit.
0 means there is no checking of the objective value when running the algorithms, and it stops till the maximum number of iterations has been reached.
Defaults to 0.

als.exit.interval

integer, optional
Specify the number of iterations between consecutive checking of cost functions for the ALS model, so that one can see if the pre-specified exit.threshold is reached.
Defaults to 5.

als.linsolver

c('cholesky', 'cg'), optional
Linear system solver for the ALS model.

'cholesky' is usually much faster
'cg' is recommended when als.factors is large.

Defaults to 'cholesky'.

als.centering

logical, optional
Whether to center the data by column before training the ALS model.
Defaults to TRUE.

als.scaling

logical, optional
Whether to scale the data by column before training the ALS model.
Defaults to TRUE.

c

double, optional
Trade-off between training error and margin for SVM Regression.
Valid only when func is "SVM".
Must be positive.
Defaults to 100.

massive

logical, optional
Specifies whether or not to use massive mode.
For parameter setting in massive mode, you could use both group.params (please see the example below) or the original parameters. Using original parameters will apply for all groups. However, if you define some parameters of a group, the value of all original parameter setting will be not applicable to such group.
An example is as follows:


> mur <- hanaml.UnifiedRegression(func='hgbt',
                                  massive=TRUE,
                                  thread.ratio=0.5,
                                  data=df.fit,
                                  group.key="GROUP_ID",
                                  key="ID",
                                  features=list("X1", "X2", "Y"),
                                  label='X3',
                                  group.params=list("Group_1"=list(partition.method = 'random')))

In this example, as 'partition.method' is set in group.params for Group_1, parameter setting of 'thread.ratio' is not applicable to Group_1.
Defaults to FALSE.

group.key

character, optional
The column of group key. The data type can be INT or NVARCHAR/VARCHAR. If data type is INT, only parameters set in the group.params are valid. This parameter is only valid when massive is TRUE.
Defaults to the first column of data if group.key is not provided.

group.params

list, optional
If the massive mode is activated (massive=TRUE), input data shall be divided into different groups with different parameters applied.
An example is as follows:


> mur <- hanaml.UnifiedRegression(func='hgbt',
                                  massive=TRUE,
                                  thread.ratio=0.5,
                                  data=df.fit,
                                  group.key="GROUP_ID",
                                  key="ID",
                                  features=list("X1", "X2", "Y"),
                                  label='X3',
                                  group.params=list("Group_1"=list(partition.method = 'random'))
> res <- predict(mur,
                 data=df.predict,
                 group.key="GROUP_ID",
                 key="ID",
                 group.params= list("Group_1"=list(impute=TRUE))

Valid only when massive is TRUE and defaults to NULL.

output.coefcov

logical, optional
Specifies whether or not to output coefficient covariance information for Linear Regression.
Valid only if func is specified as "LinearRegression" and json.export is specified as TRUE.

Defaults to FALSE.
Note: To enable output of confidence/prediction interval for Linear Regression model in UnifiedRegression during predicting/scoring phase, we need to set output.coefcov as 1.

output.leaf.values

logical, optional
Specifies whether or not save the target target values in each leaf node in the training phase for Random Decision Trees model(otherwise only mean of the target values is saved in the model). Setting the value of this parameter as True to enable the output of prediction interval for Random Decision Trees model in UnifiedRegression during predicting/scoring phase.
Valid only for fitting Random Decision Trees model(i.e. setting func as 'RandomDecisionTrees') when model.format is "json" or compression is TRUE.
Defaults to False.

...

Specifies other parameters for training a regression model with the functionality specified in func.
Please see the documentation of corresponding functionalities for more detail.
hanaml.DecisionTreeRegressor, hanaml.RDTRegressor, hanaml.MLPRegressor, hanaml.HGBTRegressor, hanaml.SVR, hanaml.ExponentialRegression, hanaml.LogarithmicRegression, hanaml.PolynomialRegression, hanaml.GeometricRegression, hanaml.LinearRegression, hanaml.GLM
However, some parameters will be disabled. The disable parameters are listed as follows:

DecisionTreeRegressor: output.rules.
RDTRegressor: calculate.oob.
HGBTRegressor: calculate.importance.
GLM: output.fitted.
ExponentialRegression: pmml.export.
GeometricRegression: pmml.export.
PolynomialRegression: pmml.export.
LogarithmicRegression: pmml.export.
LinearRegression: pmml.export. Note that for LinearRegression, the meaning of Parameter json.export has changed, where FALSE means to export multiple linear regression model in PMML and TRUE remains to export model in JSON.

Value

Returns an R6 object of class "UnifiedRegression" with the following attributes and methods:

Attributes

modelDataFrame

ROW_INDEX - model row index
PART_INDEX - data partition index
MODEL_CONTENT - model content

importanceDataFrame

VARIABLE_NAME - Independent variable name
IMPORTANCE - Variable importance

optimal.paramDataFrame

PARM_NAME - parameter name
INT_VALUE - integer value
DOUBLE_VALUE - double value
STRING_VALUE - character value

statisticsDataFrame

STAT_NAME - Statistics name
STAT_VALUE - Statistics value

confusion.matrixDataFrame

ACTUAL_CLASS - The actual class name
PREDICTED_CLASS - The predicted class name
COUNT - Number of records

metricsDataFrame

NAME - Metric name
X - X value
Y - Y value

Methods

CreateModelState(model=NULL, algorithm=NULL, func=NULL, state.description="ModelState", force=FALSE)

Usage:


   > umlr <- hanaml.UnifiedRegression(data=df, func="LinearRegression")
   > umlr$CreateModelState()

Arguments:

model: DataFrame
DataFrame containing the model for parsing.
Defaults to self$model.
algorithm: character
Specifies the PAL algorithm associated with model.
Defaults to self$pal.algorithm.
func: character
Specifies the functionality for Unified Classification/Regression.
Valid only for object instance of R6Class "UnifiedClassification" or "UnifiedRegression".
Defaults to self$func.
state.description: character
A summary string for the generated model state.
Defaults to "ModelState".
force: logic
Specifies whether or not the replace existing state for model.
Defaults to FALSE.

After calling this method, an attribute state that contains the parsed info for model shall be assigned to the corresponding R6 object.

DeleteModelState(state=NULL)

Usage:
Assuming we have trained a hanaml model and created its model state, like the following:


   > umlr <- hanaml.UnifiedRegression(data=df, func="LinearRegression")
   > umlr$CreateModelState()

After using the model state for real-time scoring, we can delete the state by calling:


   > umlr$DelateModelState()

Arguments:

state: DataFrame
DataFrame containing the state info.
Defaults to self$state.

After calling this method, the specified model state shall be cleaned up and associated memory be released.

Examples

The training data:


> data.fit
   ID   X1 X2 X3      Y
1   0 0.00  A  1 -6.879
2   1 0.50  A  1 -3.449
3   2 0.54  B  1  6.635
4   3 1.04  B  1 11.844
5   4 1.50  A  1  2.786
6   5 0.04  B  2  2.389
7   6 2.00  A  2 -0.011
8   7 2.04  B  2  8.839
9   8 1.54  B  1  4.689
10  9 1.00  A  2 -5.507

Create a UnifiedRegression model for linear regression:


> umlr <- hanaml.UnifiedRegression(data = data.fit,
                                   key = "ID",
                                   label = "Y",
                                   solver = "qr",
                                   adjusted.r2 = FALSE,
                                   func="LinearRegression",
                                   thread.ratio=0.5,
                                   partition.method="random",
                                   training.percent=0.7,
                                   partition.random.state=2,
                                   output.partition.result=TRUE,
                                   output.coefcov=TRUE)

Check the resulting statistics:


> umlr$statistics
       STAT_NAME         STAT_VALUE
1      TEST_EVAR  0.871459247598903
2       TEST_MAE 2.0088082000000003
3      TEST_MAPE 12.260003987804756
4 TEST_MAX_ERROR  5.329849599999999
5       TEST_MSE  9.551661310681718
6        TEST_R2 0.7774293644548433
7      TEST_RMSE   3.09057621013974
8     TEST_WMAPE 0.7188006440839695

Data for model scoring:


> df.score
  ID      X1 X2 X3   Y
1  0   1.690  B  1 1.2
2  1   0.054  B  2 2.1
3  2 980.123  A  2 2.4
4  3   1.000  A  1 1.8
5  4   0.563  A  1 1.0

Apply the obtained linear regression model to the scoring data:


> score.res <- score(umlr, data = df.score, key = "ID", label = "Y")

Check the statistics on scoring data:


> score.res[[2]]$Collect()
  STAT_NAME         STAT_VALUE
1      EVAR -6284768.906191169
2       MAE  666.5116459919999
3      MAPE  278.9837795885635
4 MAX_ERROR 3315.9714402299996
5       MSE  2199151.795823181
6        R2  -7854112.55651136
7      RMSE 1482.9537402842952
8     WMAPE  392.0656741129411

Prediction of target values as well as intervals:


> df.pred
  ID      X1 X2 X3
1  0   1.690  B  1
2  1   0.054  B  2
3  2 980.123  A  2
4  3   1.000  A  1
5  4   0.563  A  1


> pred.res <- predict(umlr, data = df.pred, key = "ID",
                      features = list("X1", "X2", "X3"),
                      significance.level = 0.05,#specify the significance level
                      interval.type = "prediction")# specify the interval type

Arguments

Value

Examples

See also