hanaml.UnifiedRegression.Rdhanaml.UnifiedRegression is an R wrapper for SAP HANA PAL Unified Regression.
hanaml.UnifiedRegression( data = NULL, func = NULL, key = NULL, features = NULL, label = NULL, purpose = NULL, formula = NULL, partition.method = NULL, partition.random.state = NULL, training.percent = NULL, output.partition.result = NULL, ... )
| data |
|
|---|---|
| func |
|
| key |
|
| features |
|
| label |
|
| purpose |
|
| formula |
|
| partition.method |
|
| partition.random.state |
|
| training.percent |
|
| output.partition.result |
|
| ... |
|
Returns a "UnifiedRegression" object with the following attributes and methods:
model DataFrame
ROW_INDEX - model row index
PART_INDEX - data partition index
MODEL_CONTENT - model content
importance DataFrame
VARIABLE_NAME - Independent variable name
IMPORTANCE - Variable importance
optimal.param DataFrame
PARM_NAME - parameter name
INT_VALUE - integer value
DOUBLE_VALUE - double value
STRING_VALUE - character value
statistics DataFrame
STAT_NAME - Statistics name
STAT_VALUE - Statistics value
confusion.matrix DataFrame
ACTUAL_CLASS - The actual class name
PREDICTED_CLASS - The predicted class name
COUNT - Number of records
metrics DataFrame
NAME - Metric name
X - X value
Y - Y value
score() Function
Parameters:
data DataFrame
Input data for calculating score metrics.
key character
Specifies name of the ID column for input data.
features list/vector of characters, optional
Specifies names of the feature columns, i.e.
independent columns.
Defaults to all non-key, non-label columns if not provided.
label character, optional
Specifies name of dependent column in the input data.
Defaults to the last non-key column if not provided.
thread.ratio numeric, optional
Specifies the ratio of total number of threads that
can be used by the score function.
Defaults to 1.0.
func character, optional
The functionality for unified regression model.
Mandatory only when the func attribute of model is NULL.
Valid values are as follows:
"DecisionTree", "RandomDecisionTrees", "HGBT", "LinearRegression",
"SVM", "MLP", "PolynomialRegression", "LogarithmicRegression",
"ExponentialRegression", "GeometricRegression", "GLM".
prediction.type character, optional
Specifies the type of prediction in the result table.
Available options include:
response : direct response (with link function applied)
link : linear response (w.o. link function applied)
significance.level numeric, optional
Specifies significance level for the confidence interval and prediction interval.
Valid only for GLM models where IRLS method is applied.
Defaults to 0.05.
handle.missing character, optional
Specifies the way to handling missing values in data.
"skip": Skip rows with missing values
"fill_zero": Replace missing values with 0 before prediction
block.size integer, optional
Specifies the number of data loaded per time during scoring.
0: load all data once
Others: the specified number
The training data:
> data.fit ID X1 X2 X3 Y 1 0 0.00 A 1 -6.879 2 1 0.50 A 1 -3.449 3 2 0.54 B 1 6.635 4 3 1.04 B 1 11.844 5 4 1.50 A 1 2.786 6 5 0.04 B 2 2.389 7 6 2.00 A 2 -0.011 8 7 2.04 B 2 8.839 9 8 1.54 B 1 4.689 10 9 1.00 A 2 -5.507
Create a UnifiedRegression model for linear regression:
umlr <- hanaml.UnifiedRegression(data = data.fit, key = "ID", label = "Y", solver = "qr", adjusted.r2 = FALSE, func="LinearRegression", thread.ratio=0.5, partition.method="random", training.percent=0.7, partition.random.state=2, output.partition.result=TRUE)
Check the resulting statistics:
> umlr$statistics
STAT_NAME STAT_VALUE
1 TEST_EVAR 0.871459247598903
2 TEST_MAE 2.0088082000000003
3 TEST_MAPE 12.260003987804756
4 TEST_MAX_ERROR 5.329849599999999
5 TEST_MSE 9.551661310681718
6 TEST_R2 0.7774293644548433
7 TEST_RMSE 3.09057621013974
8 TEST_WMAPE 0.7188006440839695
Data for model scoring:
> df.score ID X1 X2 X3 Y 1 0 1.690 B 1 1.2 2 1 0.054 B 2 2.1 3 2 980.123 A 2 2.4 4 3 1.000 A 1 1.8 5 4 0.563 A 1 1.0
Apply the obtained linear regression model to the scoring data:
score.res <- umlr$score(data = df.score, key = "ID", label = "Y")
Check the statistics on scoring data:
> score.res[[2]]$Collect() STAT_NAME STAT_VALUE 1 EVAR -6284768.906191169 2 MAE 666.5116459919999 3 MAPE 278.9837795885635 4 MAX_ERROR 3315.9714402299996 5 MSE 2199151.795823181 6 R2 -7854112.55651136 7 RMSE 1482.9537402842952 8 WMAPE 392.0656741129411