Unified Regression

hanaml.UnifiedRegression is an R wrapper for SAP HANA PAL Unified Regression.

hanaml.UnifiedRegression(
  data = NULL,
  func = NULL,
  key = NULL,
  features = NULL,
  label = NULL,
  purpose = NULL,
  formula = NULL,
  partition.method = NULL,
  partition.random.state = NULL,
  training.percent = NULL,
  output.partition.result = NULL,
  ...
)

Arguments

data	`DataFrame` DataFrame containting the data.
func	`character` The functionality for unified regression. Valid values are as follows: "DecisionTree", "RandomDecisionTrees", "HGBT", "LinearRegression", "SVM", "MLP", "PolynomialRegression", "LogarithmicRegression", "ExponentialRegression", "GeometricRegression", "GLM".
key	`character, optional` Name of the ID column. If not provided, the data is assumed to have no ID column. No default value.
features	`character of list of characters, optional` Name of feature columns. If not provided, it defaults all non-key, non-label columns of data.
label	`character, optional` Name of the column which specifies the dependent variable. If not specified, defaults to the last non-purpose column.
purpose	`character, optional` Name of the column which specified user-defined data partition. Mandatory if partition.method is "predefined".
formula	`formula type, optional` Formula to be used for model generation. format = label~<feature_list> e.g.: formula=CATEGORY~V1+V2+V3 You can either give the formula, or a feature and label combination, but do not provide both. Defaults to NULL.
partition.method	`character, optional` Specified the method for partitioning the training data. Valid options include: "no", "predefined", "random". Defaults to "no" if not specified (i.e. no data partition).
partition.random.state	`character, optional` Specifies the random seed for stratified partition. Valid only when `partition.method` is set to "random". Defaults to 0(system time).
training.percent	`numeric, optional` Specifies the percentage of data used for training. Valid only when `partition.method` is set to "random". Defaults to 0.8.
output.partition.result	`logical, optional` Specifies whether or not to output the partition result of the training data. Defaults to FALSE.
...	Specifies other parameters for training a regression model with the functionality specified in func. Please see the documentation of corresponding functionalities for more detail. `hanaml.DecisionTreeRegressor, hanaml.RDTRegressor, hanaml.MLPRegressor, hanaml.HGBTRegressor, hanaml.SVR, hanaml.ExponentialRegression, hanaml.LogarithmicRegression, hanaml.PolynomialRegression, hanaml.GeometricRegression, hanaml.LinearRegression, hanaml.GLM` However, some parameters will be disabled. The disable parameters are listed as follows: DecisionTreeRegressor: `output.rules` RDTRegressor: `calculate.oob` HGBTRegressor: `calculate.importance`

Value

Returns a "UnifiedRegression" object with the following attributes and methods:

model DataFrame

ROW_INDEX - model row index
PART_INDEX - data partition index
MODEL_CONTENT - model content

importance DataFrame

VARIABLE_NAME - Independent variable name
IMPORTANCE - Variable importance

optimal.param DataFrame

PARM_NAME - parameter name
INT_VALUE - integer value
DOUBLE_VALUE - double value
STRING_VALUE - character value

statistics DataFrame

STAT_NAME - Statistics name
STAT_VALUE - Statistics value

confusion.matrix DataFrame

ACTUAL_CLASS - The actual class name
PREDICTED_CLASS - The predicted class name
COUNT - Number of records

metrics DataFrame

NAME - Metric name
X - X value
Y - Y value

score() Function
Parameters:

data DataFrame
Input data for calculating score metrics.
key character
Specifies name of the ID column for input data.
features list/vector of characters, optional
Specifies names of the feature columns, i.e. independent columns.
Defaults to all non-key, non-label columns if not provided.
label character, optional
Specifies name of dependent column in the input data.
Defaults to the last non-key column if not provided.
thread.ratio numeric, optional
Specifies the ratio of total number of threads that can be used by the score function.
Defaults to 1.0.
func character, optional
The functionality for unified regression model.
Mandatory only when the func attribute of model is NULL.
Valid values are as follows:
"DecisionTree", "RandomDecisionTrees", "HGBT", "LinearRegression", "SVM", "MLP", "PolynomialRegression", "LogarithmicRegression", "ExponentialRegression", "GeometricRegression", "GLM".
prediction.type character, optional
Specifies the type of prediction in the result table.
Available options include:
- response : direct response (with link function applied)
- link : linear response (w.o. link function applied)
Valid only for GLM models.
significance.level numeric, optional
Specifies significance level for the confidence interval and prediction interval.
Valid only for GLM models where IRLS method is applied. Defaults to 0.05.
handle.missing character, optional
Specifies the way to handling missing values in data.
- "skip": Skip rows with missing values
- "fill_zero": Replace missing values with 0 before prediction
Valid only for GLM models.
Defaults to "fill_zero".
block.size integer, optional
Specifies the number of data loaded per time during scoring.
- 0: load all data once
- Others: the specified number
This parameter is for reducing memory consumption, especially as the predict data is huge,
or it consists of a large number of missing independent variables.
However, you might lose some efficiency. Valid only for Random Decision Trees models.
Defaults to 0.

Examples

The training data:

> data.fit
   ID   X1 X2 X3      Y
1   0 0.00  A  1 -6.879
2   1 0.50  A  1 -3.449
3   2 0.54  B  1  6.635
4   3 1.04  B  1 11.844
5   4 1.50  A  1  2.786
6   5 0.04  B  2  2.389
7   6 2.00  A  2 -0.011
8   7 2.04  B  2  8.839
9   8 1.54  B  1  4.689
10  9 1.00  A  2 -5.507

Create a UnifiedRegression model for linear regression:

umlr <- hanaml.UnifiedRegression(data = data.fit,
                                 key = "ID",
                                 label = "Y",
                                 solver = "qr",
                                 adjusted.r2 = FALSE,
                                 func="LinearRegression",
                                 thread.ratio=0.5,
                                 partition.method="random",
                                 training.percent=0.7,
                                 partition.random.state=2,
                                 output.partition.result=TRUE)

Check the resulting statistics:

> umlr$statistics
       STAT_NAME         STAT_VALUE
1      TEST_EVAR  0.871459247598903
2       TEST_MAE 2.0088082000000003
3      TEST_MAPE 12.260003987804756
4 TEST_MAX_ERROR  5.329849599999999
5       TEST_MSE  9.551661310681718
6        TEST_R2 0.7774293644548433
7      TEST_RMSE   3.09057621013974
8     TEST_WMAPE 0.7188006440839695

Data for model scoring:

> df.score
  ID      X1 X2 X3   Y
1  0   1.690  B  1 1.2
2  1   0.054  B  2 2.1
3  2 980.123  A  2 2.4
4  3   1.000  A  1 1.8
5  4   0.563  A  1 1.0

Apply the obtained linear regression model to the scoring data:

score.res <- umlr$score(data = df.score, key = "ID", label = "Y")

Check the statistics on scoring data:

> score.res[[2]]$Collect()
  STAT_NAME         STAT_VALUE
1      EVAR -6284768.906191169
2       MAE  666.5116459919999
3      MAPE  278.9837795885635
4 MAX_ERROR 3315.9714402299996
5       MSE  2199151.795823181
6        R2  -7854112.55651136
7      RMSE 1482.9537402842952
8     WMAPE  392.0656741129411

Arguments

Value

Examples

See also