hanaml.AutomaticRegression.Rd
AutomaticRegression offers an intelligent search amongst
machine learning pipelines for supervised regression tasks.
Each machine learning pipeline contains several operators
such as preprocessors, supervised regression models and
transformer.
hanaml.AutomaticRegression(
data = NULL,
key = NULL,
features = NULL,
label = NULL,
pipeline = NULL,
categorical.variable = NULL,
scorings = NULL,
generations = NULL,
population.size = NULL,
offspring.size = NULL,
elite.number = NULL,
min.layer = NULL,
max.layer = NULL,
mutation.rate = NULL,
crossover.rate = NULL,
random.seed = NULL,
config.dict = NULL,
progress.indicator.id = NULL,
fold.num = NULL,
resampling.method = NULL,
max.eval.time.mins = NULL,
early.stop = NULL,
model.table.name = NULL,
successive.halving = NULL,
min.budget = NULL,
max.budget = NULL,
min.individuals = NULL,
background.size = NULL,
background.sampling.seed = NULL
)
DataFrame
DataFrame containting the data.
character, optional
Name of the ID column.
If not provided, the data is assumed to have no ID column.
No default value.
character of list of characters, optional
Name of feature columns.
If not provided, it defaults all non-key, non-label columns of data.
character, optional
Name of the column which specifies the dependent variable.
If not specified, defaults to the last column.
json character or list, optional
If pipeline is provided, will directly use the input pipeline to fit.
A pipeline example in JSON format is below:
'{"HGBT_Regressor":{"args":{"ITER_NUM":100,"OBJ_FUNC":2,"ETA":0.001,"MAX_DEPTH":8,"MIN_CHILD_HESSIAN":8.0},"inputs":{"data":{"CATPCA":{"args":{"COMPONENTS_PERCENTAGE":0.7,"SCALING":1,"COMPONENT_TOL":0.0,"MAX_ITERATION":1000,"CONVERGE_TOL":0.00001,"LANCZOS_ITERATION":1000,"SVD_CALCULATOR":0},"inputs":{"data":"ROWDATA"}}}}}}'
Defaults to NULL.
character or list of characters, optional
Indicate which variables are treated as categorical. The default behavior is:
"VARCHAR" and "NVARCHAR": categorical.
"INTEGER" and "DOUBLE": continuous.
VALID only for variables of "INTEGER", "VARCHAR" and "NVARCHAR" type, omitted otherwise.
Defaults to NULL.
list, optional
AutomaticRegression supports multi-objective optimization with specified weights of each target.
The goal is to maximize the target. Therefore, if you want to minimize the target, the weight of target needs to be negative.
The target options are below:
R2 : R-squared. The bigger, the better. Should use a positive weight.
RMSE : Root Mean Squared Error. The smaller, the better. Should use a negative weight.
MAE : Mean Absolute Error. The smaller, the better. Should use a negative weight.
WMAPE : Weighted Mean Absolute Percentage Error. The smaller, the better. Should use a negative weight.
MSLE : Mean Squared Logarithmic Error. The smaller, the better. Should use a negative weight.
MAX_ERROR : The max absolute difference between the observed value and the expected value. The smaller, the better. Should use a negative weight.
EVAR : Explained Variance measures the proportion to which a mathematical model accounts for the variation (dispersion) of a given data set. The bigger, the better. Should use a positive weight.
LAYERS : The number of operators. The smaller, the better. Should use a negative weight.
Defaults to list(MAE = -1.0, EVAR = 1.0).
integer, optional
The number of iterations of the pipeline optimization.
Defaults to 5.
integer, optional
The number of indiviuals in each generation in
genetic programming algorithm.
Defaults to 20.
integer, optional
The number of offsprings to produce in each generation.
Defaults to the nubmer of population.size.
integer, optional
The number of elite to output into result table.
Defaults to 1/4 of population.size.
integer, optional
The minimum number of operators in a pipeline.
Defaults to 1.
integer, optional
The maximum number of operators in a pipeline.
Defaults to 5.
numeric, optional
The mutation rate for the genetic programming algorithm.
Defaults to 0.9.
integer, optional
Specifies the seed for random number generator. Use system time if not provided.
No default value.
character or list, optional
Specifies Config Dict which is the configuration for the searching space:
'default'
- default config dict.
'light'
- light config dict.
Or you could input a customized config.dict in a list
or use method update.config.dict() to general a config.dict.
If not specified, the default config dict will be used.
character, optional
Set the ID used to output monitoring information of
the optimization progress.
No default value.
integer, optional
The number of fold in the cross validation process.
Defaults to 5.
"cv", optional
Specifies the resampling method for pipeline evaluation.
Defaults to "cv".
numeric, optional
Time limit to evaluate a single pipeline. The unit is minute.
Defaults to 0.0 (there is no time limit).
integer, optional
Stop optimization progress when best pipeline is
not updated for the give consecutive generations.
0 means there is no early stop.
Defaults to 5.
str, optional
Specifies the HANA model table name instead of the generated temporary table.
Defaults to NULL.
logical, optional
Specifies whether uses successive_halving in the evaluation phase.
Defaults to NULL
int, optional
Specifies the minimum budget (the mininum evaluation dataset size) when successive halving has been applied.
Defaults to NULL.
int, optional
Specifies the maximum budget (the maximum evaluation dataset size) when successive halving has been applied.
Defaults to NULL.
int, optional
Specifies the minimum individuals in the evaluation phase when successive halving has been applied.
Defaults to NULL.
int, optional
If set, the reason code procedure will be enabled. Only valid when pipeline is provided.
Defaults to NULL.
int, optional
Specifies the seed for random number generator in the background samping. Only valid when pipeline is provided.
0 - Uses the current time (in second) as seed.
others - Uses the specified value as seed.
Defaults to NULL.
numeric, optional
The crossover rate for the genetic programmming algorithm.
Defaults to 0.1.
Returns an object with the following attributes and methods:
If input parameter pipeline is NULL:
best.pipeline DataFrame
Best pipelines selected, structured as follows:
ID
- pipeline ID.
PIPELINE
- pipeline contents.
SCORES
- scoring metrics for pipeline.
model DataFrame
model[[1]]: A fitted pipeline model, structured as follows:
ROW_INDEX
- pipeline model row index
MODEL_CONTENT
- model content
model[[2]]: best.pipeline. Please refer to the content of best.pipeline above.
info DataFrame
Related info/statistics pipeline fitting, structured as follows:
STAT_NAME
- statistic name.
STAT_VALUE
- statistic value.
If pipeline is not NULL:
model DataFrame
A fitted pipeline model, structured as follows:
ROW_INDEX
- model row index
MODEL_CONTENT
- model content
info DataFrame
Related info/statistics pipeline fitting, structured as follows:
STAT_NAME
- statistic name.
STAT_VALUE
- statistic value.
evaluate()
Parameters:
data DataFrame
Input data for calculating score metrics.
pipeline json character or list
Pipeline to be evaluated.
key character, optional
Specifies name of the ID column for input data.
Defaults to the first column.
features list/vector of characters, optional
Specifies names of the feature columns, i.e.
independent columns.
Defaults to all non-key, non-label columns if not provided.
label character, optional
Specifies name of dependent column in the input data.
Defaults to the last non-key column if not provided.
categorical.variable character or list of characters, optional
Indicates features that should be treated as categorical variable.
The default behavior is dependent on what input is given:
"VARCHAR" and "NVARCHAR": categorical.
"INTEGER" and "DOUBLE": continuous.
categorical.variable
in the model training phase.resampling.method character, optional
The resampling method for pipeline model evaluation.
Defaults to (and can only be) 'cv' if the estimator in
pipeline is a regressor.
fold.num integer, optional
The fold number for cross validation.
Defaults to 5.
random.state integer, optional
Specifies the seed for random number generator.
Use system time if not provided.
Defauls to system time.
display.config.dict()
Parameters:
connection.context ConnectionContext, optional
A connection to a SAP HANA system.
Defaults to NULL.
update.config.dict()
The execution order of these parameters in the Config Dict customizing process is:
1.config.dict.
2.config.remove.
3.config.add.
4.config.replace.
5.config.modify.
Parameters:
config.dict character or JSON character, optional
Specifies config dict. The options are "default" (use default config dict), "light", "empty", and customized Config Dict JSON character.
Defaults to NULL.
config.remove list or character, optional
Remove the specified estimator configs in Config Dict.
Defaults to NULL.
config.add JSON character, optional
Add a estimator configs in Config Dict.
Defaults to NULL.
config.replace JSON character, optional
Replace the specified estimator configs in Config Dict.
Defaults to NULL.
config.modify JSON character, optional
Modify the specified estimator configs in Config Dict.
Defaults to NULL.
connection.context ConnectionContext, optional
A connection to a SAP HANA system.
Defaults to NULL.
Assume we have df.fit for training, df.predict for prediction and df.score for evaluation.
> auto.reg <- hanaml.AutomaticRegression(data = df.fit,
pipeline = NULL,
categorical.variable =
list("OUTLOOK", "WINDY"),
scorings = NULL,
generations = 2,
population.size = 5,
offspring.size =10,
elite.number = 3,
min.layer = 2,
max.layer = 5,
mutation.rate = 0.1,
crossover.rate = 0.9,
random.seed = 1,
config.dict = NULL,
progress.indicator.id = "AUTOML_REG_TEST",
fold.num = 5,
resampling.method = "cv",
max.eval.time.mins = NULL,
early.stop = 3)
The output could be achieved by the following lines:
> print(auto.reg$best.pipeline$Collect())
> print(auto.reg$model[[1]]$Collect())
> print(auto.reg$info$Collect())
If we have a pipeline and want to train a model:
pl <- '{"HGBT_Regressor":{"args":{"ITER_NUM":100,"OBJ_FUNC":2,"ETA":0.001,"MAX_DEPTH":8,"MIN_CHILD_HESSIAN":8.0},"inputs":{"data":{"CATPCA":{"args":{"COMPONENTS_PERCENTAGE":0.7,"SCALING":1,"COMPONENT_TOL":0.0,"MAX_ITERATION":1000,"CONVERGE_TOL":0.00001,"LANCZOS_ITERATION":1000,"SVD_CALCULATOR":0},"inputs":{"data":"ROWDATA"}}}}}}'
auto.reg2 <- hanaml.AutomaticRegression(data = df.fit,
pipeline = pl,
categorical.variable = list("OUTLOOK", "WINDY"))
The output could be achieved by the following lines:
> print(auto.reg2$model$Collect())
> print(auto.reg2$info$Collect())
If we want to predict:
> pre.res <- predict(model = auto.reg,
data = df.predict,
key = "ID")
The output could be achieved by the following lines:
> print(pre.res$Collect())
If we want to evaluate a pipeline:
> res <- auto.reg$evaluate(data=df.score,
pipeline=pl,
key="ID",
label="CLASS",
fold.num = 5,
resampling.method = "CV",
random.seed = 2)
The output could be achieved by the following lines:
> print(res$Collect())
If we want to see the config.dict:
> res <- auto.reg$display.config.dict()
The output could be achieved by the following line:
> print(res$Collect())
If we want to update the config.dict:
> add.op <- '{"RDT_Regressor":{"TREES_NUM":[10],"NODE_SIZE":{"range":[1,1,21]},"SPLIT_THRESHOLD":[1e-9],"CALCULATE_OOB":[0],"SAMPLE_FRACTION":[0.75,1.0]}}'
> replace.op <- '{"RDT_Regressor":{"TREES_NUM":[1000, 200],"NODE_SIZE":{"range":[1,1,20]},"SPLIT_THRESHOLD":[1e-9],"CALCULATE_OOB":[0],"SAMPLE_FRACTION":[0.75,1.0]}}'
> modify.op <- '{"RDT_Regressor":{"TREES_NUM":[17]}}'
> res.update <- auto.reg$update.config.dict(config.remove=list('EXP_Regressor', 'RDT_Regressor'),
config.add=add.op,
config.replace=replace.op,
config.modify=modify.op)
The output could be achieved by the following lines:
> print(res.update[[1]])
> print(res.update[[2]]$Collect())