hanaml.UnifiedRegression is an R wrapper for SAP HANA PAL Unified Regression.

hanaml.UnifiedRegression(
  data = NULL,
  func = NULL,
  key = NULL,
  features = NULL,
  label = NULL,
  purpose = NULL,
  formula = NULL,
  partition.method = NULL,
  partition.random.state = NULL,
  training.percent = NULL,
  output.partition.result = NULL,
  ...
)

Arguments

data

DataFrame
DataFrame containting the data.

func

character
The functionality for unified regression.
Valid values are as follows:
"DecisionTree", "RandomDecisionTrees", "HGBT", "LinearRegression", "SVM", "MLP", "PolynomialRegression", "LogarithmicRegression", "ExponentialRegression", "GeometricRegression", "GLM".

key

character, optional
Name of the ID column. If not provided, the data is assumed to have no ID column.
No default value.

features

character of list of characters, optional
Name of feature columns.
If not provided, it defaults all non-key, non-label columns of data.

label

character, optional
Name of the column which specifies the dependent variable.
If not specified, defaults to the last non-purpose column.

purpose

character, optional
Name of the column which specified user-defined data partition.
Mandatory if partition.method is "predefined".

formula

formula type, optional
Formula to be used for model generation. format = label~<feature_list> e.g.: formula=CATEGORY~V1+V2+V3
You can either give the formula, or a feature and label combination, but do not provide both.
Defaults to NULL.

partition.method

character, optional
Specified the method for partitioning the training data.
Valid options include: "no", "predefined", "random".
Defaults to "no" if not specified (i.e. no data partition).

partition.random.state

character, optional
Specifies the random seed for stratified partition.
Valid only when partition.method is set to "random".
Defaults to 0(system time).

training.percent

numeric, optional
Specifies the percentage of data used for training.
Valid only when partition.method is set to "random".
Defaults to 0.8.

output.partition.result

logical, optional
Specifies whether or not to output the partition result of the training data.
Defaults to FALSE.

...


Specifies other parameters for training a regression model with the functionality specified in func.
Please see the documentation of corresponding functionalities for more detail.
hanaml.DecisionTreeRegressor, hanaml.RDTRegressor, hanaml.MLPRegressor, hanaml.HGBTRegressor, hanaml.SVR, hanaml.ExponentialRegression, hanaml.LogarithmicRegression, hanaml.PolynomialRegression, hanaml.GeometricRegression, hanaml.LinearRegression, hanaml.GLM
However, some parameters will be disabled. The disable parameters are listed as follows:

  • DecisionTreeRegressor: output.rules

  • RDTRegressor: calculate.oob

  • HGBTRegressor: calculate.importance

Value

Returns a "UnifiedRegression" object with the following attributes and methods:

model DataFrame

  • ROW_INDEX - model row index

  • PART_INDEX - data partition index

  • MODEL_CONTENT - model content

importance DataFrame

  • VARIABLE_NAME - Independent variable name

  • IMPORTANCE - Variable importance

optimal.param DataFrame

  • PARM_NAME - parameter name

  • INT_VALUE - integer value

  • DOUBLE_VALUE - double value

  • STRING_VALUE - character value

statistics DataFrame

  • STAT_NAME - Statistics name

  • STAT_VALUE - Statistics value

confusion.matrix DataFrame

  • ACTUAL_CLASS - The actual class name

  • PREDICTED_CLASS - The predicted class name

  • COUNT - Number of records

metrics DataFrame

  • NAME - Metric name

  • X - X value

  • Y - Y value

score() Function
Parameters:

  • data DataFrame
    Input data for calculating score metrics.

  • key character
    Specifies name of the ID column for input data.

  • features list/vector of characters, optional
    Specifies names of the feature columns, i.e. independent columns.
    Defaults to all non-key, non-label columns if not provided.

  • label character, optional
    Specifies name of dependent column in the input data.
    Defaults to the last non-key column if not provided.

  • thread.ratio numeric, optional
    Specifies the ratio of total number of threads that can be used by the score function.
    Defaults to 1.0.

  • func character, optional
    The functionality for unified regression model.
    Mandatory only when the func attribute of model is NULL.
    Valid values are as follows:
    "DecisionTree", "RandomDecisionTrees", "HGBT", "LinearRegression", "SVM", "MLP", "PolynomialRegression", "LogarithmicRegression", "ExponentialRegression", "GeometricRegression", "GLM".

  • prediction.type character, optional
    Specifies the type of prediction in the result table.
    Available options include:

    • response : direct response (with link function applied)

    • link : linear response (w.o. link function applied)

    Valid only for GLM models.

  • significance.level numeric, optional
    Specifies significance level for the confidence interval and prediction interval.
    Valid only for GLM models where IRLS method is applied. Defaults to 0.05.

  • handle.missing character, optional
    Specifies the way to handling missing values in data.

    • "skip": Skip rows with missing values

    • "fill_zero": Replace missing values with 0 before prediction

    Valid only for GLM models.
    Defaults to "fill_zero".

  • block.size integer, optional
    Specifies the number of data loaded per time during scoring.

    • 0: load all data once

    • Others: the specified number

    This parameter is for reducing memory consumption, especially as the predict data is huge,
    or it consists of a large number of missing independent variables.
    However, you might lose some efficiency. Valid only for Random Decision Trees models.
    Defaults to 0.

Examples

The training data:

> data.fit
   ID   X1 X2 X3      Y
1   0 0.00  A  1 -6.879
2   1 0.50  A  1 -3.449
3   2 0.54  B  1  6.635
4   3 1.04  B  1 11.844
5   4 1.50  A  1  2.786
6   5 0.04  B  2  2.389
7   6 2.00  A  2 -0.011
8   7 2.04  B  2  8.839
9   8 1.54  B  1  4.689
10  9 1.00  A  2 -5.507

Create a UnifiedRegression model for linear regression:

umlr <- hanaml.UnifiedRegression(data = data.fit,
                                 key = "ID",
                                 label = "Y",
                                 solver = "qr",
                                 adjusted.r2 = FALSE,
                                 func="LinearRegression",
                                 thread.ratio=0.5,
                                 partition.method="random",
                                 training.percent=0.7,
                                 partition.random.state=2,
                                 output.partition.result=TRUE)

Check the resulting statistics:

> umlr$statistics
       STAT_NAME         STAT_VALUE
1      TEST_EVAR  0.871459247598903
2       TEST_MAE 2.0088082000000003
3      TEST_MAPE 12.260003987804756
4 TEST_MAX_ERROR  5.329849599999999
5       TEST_MSE  9.551661310681718
6        TEST_R2 0.7774293644548433
7      TEST_RMSE   3.09057621013974
8     TEST_WMAPE 0.7188006440839695

Data for model scoring:

> df.score
  ID      X1 X2 X3   Y
1  0   1.690  B  1 1.2
2  1   0.054  B  2 2.1
3  2 980.123  A  2 2.4
4  3   1.000  A  1 1.8
5  4   0.563  A  1 1.0

Apply the obtained linear regression model to the scoring data:

score.res <- umlr$score(data = df.score, key = "ID", label = "Y")

Check the statistics on scoring data:

> score.res[[2]]$Collect()
  STAT_NAME         STAT_VALUE
1      EVAR -6284768.906191169
2       MAE  666.5116459919999
3      MAPE  278.9837795885635
4 MAX_ERROR 3315.9714402299996
5       MSE  2199151.795823181
6        R2  -7854112.55651136
7      RMSE 1482.9537402842952
8     WMAPE  392.0656741129411

See also