Automatic Regression

AutomaticRegression offers an intelligent search amongst machine learning pipelines for supervised regression tasks. Each machine learning pipeline contains several operators such as preprocessors, supervised regression models and transformer.

hanaml.AutomaticRegression(
  data = NULL,
  key = NULL,
  features = NULL,
  label = NULL,
  pipeline = NULL,
  categorical.variable = NULL,
  scorings = NULL,
  generations = NULL,
  population.size = NULL,
  offspring.size = NULL,
  elite.number = NULL,
  min.layer = NULL,
  max.layer = NULL,
  mutation.rate = NULL,
  crossover.rate = NULL,
  random.seed = NULL,
  config.dict = NULL,
  progress.indicator.id = NULL,
  fold.num = NULL,
  resampling.method = NULL,
  max.eval.time.mins = NULL,
  early.stop = NULL,
  model.table.name = NULL,
  successive.halving = NULL,
  min.budget = NULL,
  max.budget = NULL,
  min.individuals = NULL,
  background.size = NULL,
  background.sampling.seed = NULL
)

Arguments

data

DataFrame
DataFrame containting the data.

key

character, optional
Name of the ID column. If not provided, the data is assumed to have no ID column.
No default value.

features

character of list of characters, optional
Name of feature columns.
If not provided, it defaults all non-key, non-label columns of data.

label

character, optional
Name of the column which specifies the dependent variable.
If not specified, defaults to the last column.

pipeline

json character or list, optional
If pipeline is provided, will directly use the input pipeline to fit.
A pipeline example in JSON format is below:

'{"HGBT_Regressor":{"args":{"ITER_NUM":100,"OBJ_FUNC":2,"ETA":0.001,"MAX_DEPTH":8,"MIN_CHILD_HESSIAN":8.0},"inputs":{"data":{"CATPCA":{"args":{"COMPONENTS_PERCENTAGE":0.7,"SCALING":1,"COMPONENT_TOL":0.0,"MAX_ITERATION":1000,"CONVERGE_TOL":0.00001,"LANCZOS_ITERATION":1000,"SVD_CALCULATOR":0},"inputs":{"data":"ROWDATA"}}}}}}'

Defaults to NULL.

categorical.variable

character or list of characters, optional
Indicate which variables are treated as categorical. The default behavior is:

"VARCHAR" and "NVARCHAR": categorical.
"INTEGER" and "DOUBLE": continuous.

VALID only for variables of "INTEGER", "VARCHAR" and "NVARCHAR" type, omitted otherwise.
Defaults to NULL.

scorings

list, optional
AutomaticRegression supports multi-objective optimization with specified weights of each target. The goal is to maximize the target. Therefore, if you want to minimize the target, the weight of target needs to be negative. The target options are below:

R2 : R-squared. The bigger, the better. Should use a positive weight.
RMSE : Root Mean Squared Error. The smaller, the better. Should use a negative weight.
MAE : Mean Absolute Error. The smaller, the better. Should use a negative weight.
WMAPE : Weighted Mean Absolute Percentage Error. The smaller, the better. Should use a negative weight.
MSLE : Mean Squared Logarithmic Error. The smaller, the better. Should use a negative weight.
MAX_ERROR : The max absolute difference between the observed value and the expected value. The smaller, the better. Should use a negative weight.
EVAR : Explained Variance measures the proportion to which a mathematical model accounts for the variation (dispersion) of a given data set. The bigger, the better. Should use a positive weight.
LAYERS : The number of operators. The smaller, the better. Should use a negative weight.

Defaults to list(MAE = -1.0, EVAR = 1.0).

generations

integer, optional
The number of iterations of the pipeline optimization.
Defaults to 5.

population.size

integer, optional
The number of indiviuals in each generation in genetic programming algorithm.
Defaults to 20.

offspring.size

integer, optional
The number of offsprings to produce in each generation.
Defaults to the nubmer of population.size.

elite.number

integer, optional
The number of elite to output into result table.
Defaults to 1/4 of population.size.

min.layer

integer, optional
The minimum number of operators in a pipeline.
Defaults to 1.

max.layer

integer, optional
The maximum number of operators in a pipeline.
Defaults to 5.

mutation.rate

numeric, optional
The mutation rate for the genetic programming algorithm.
Defaults to 0.9.

random.seed

integer, optional
Specifies the seed for random number generator. Use system time if not provided.
No default value.

config.dict

character or list, optional
Specifies Config Dict which is the configuration for the searching space:

'default' - default config dict.
'light' - light config dict.

Or you could input a customized config.dict in a list or use method update.config.dict() to general a config.dict. If not specified, the default config dict will be used.

progress.indicator.id

character, optional
Set the ID used to output monitoring information of the optimization progress.
No default value.

fold.num

integer, optional
The number of fold in the cross validation process.
Defaults to 5.

resampling.method

"cv", optional
Specifies the resampling method for pipeline evaluation.
Defaults to "cv".

max.eval.time.mins

numeric, optional
Time limit to evaluate a single pipeline. The unit is minute.
Defaults to 0.0 (there is no time limit).

early.stop

integer, optional
Stop optimization progress when best pipeline is not updated for the give consecutive generations. 0 means there is no early stop.
Defaults to 5.

model.table.name

str, optional
Specifies the HANA model table name instead of the generated temporary table.
Defaults to NULL.

successive.halving

logical, optional
Specifies whether uses successive_halving in the evaluation phase.
Defaults to NULL

min.budget

int, optional
Specifies the minimum budget (the mininum evaluation dataset size) when successive halving has been applied.
Defaults to NULL.

max.budget

int, optional
Specifies the maximum budget (the maximum evaluation dataset size) when successive halving has been applied.
Defaults to NULL.

min.individuals

int, optional
Specifies the minimum individuals in the evaluation phase when successive halving has been applied.
Defaults to NULL.

background.size

int, optional
If set, the reason code procedure will be enabled. Only valid when pipeline is provided.
Defaults to NULL.

background.sampling.seed

int, optional
Specifies the seed for random number generator in the background samping. Only valid when pipeline is provided.

0 - Uses the current time (in second) as seed.
others - Uses the specified value as seed.

Defaults to NULL.

crossover_rate

numeric, optional
The crossover rate for the genetic programmming algorithm.
Defaults to 0.1.

Value

Returns an object with the following attributes and methods:

If input parameter pipeline is NULL:

best.pipeline DataFrame
Best pipelines selected, structured as follows:
- ID - pipeline ID.
- PIPELINE - pipeline contents.
- SCORES - scoring metrics for pipeline.
model DataFrame
model[[1]]: A fitted pipeline model, structured as follows:
- ROW_INDEX - pipeline model row index
- MODEL_CONTENT - model content
model[[2]]: best.pipeline. Please refer to the content of best.pipeline above.

info DataFrame
Related info/statistics pipeline fitting, structured as follows:
- STAT_NAME - statistic name.
- STAT_VALUE - statistic value.
If pipeline is not NULL:

model DataFrame
A fitted pipeline model, structured as follows:
- ROW_INDEX - model row index
- MODEL_CONTENT - model content
info DataFrame
Related info/statistics pipeline fitting, structured as follows:
- STAT_NAME - statistic name.
- STAT_VALUE - statistic value.

Methods

evaluate()

Parameters:

data DataFrame
Input data for calculating score metrics.
pipeline json character or list
Pipeline to be evaluated.
key character, optional
Specifies name of the ID column for input data.
Defaults to the first column.
features list/vector of characters, optional
Specifies names of the feature columns, i.e. independent columns.
Defaults to all non-key, non-label columns if not provided.
label character, optional
Specifies name of dependent column in the input data.
Defaults to the last non-key column if not provided.
categorical.variable character or list of characters, optional
Indicates features that should be treated as categorical variable.
The default behavior is dependent on what input is given:
- "VARCHAR" and "NVARCHAR": categorical.
- "INTEGER" and "DOUBLE": continuous.
VALID only for variables of type "INTEGER",omitted otherwise.
Defaults to the value of categorical.variable in the model training phase.
resampling.method character, optional
The resampling method for pipeline model evaluation.
Defaults to (and can only be) 'cv' if the estimator in pipeline is a regressor.
fold.num integer, optional
The fold number for cross validation.
Defaults to 5.
random.state integer, optional
Specifies the seed for random number generator. Use system time if not provided.
Defauls to system time.

display.config.dict()

Parameters:

connection.context ConnectionContext, optional
A connection to a SAP HANA system.
Defaults to NULL.

update.config.dict()

The execution order of these parameters in the Config Dict customizing process is:

1.config.dict.
2.config.remove.
3.config.add.
4.config.replace.
5.config.modify.

Parameters:

config.dict character or JSON character, optional
Specifies config dict. The options are "default" (use default config dict), "light", "empty", and customized Config Dict JSON character.
Defaults to NULL.
config.remove list or character, optional
Remove the specified estimator configs in Config Dict.
Defaults to NULL.
config.add JSON character, optional
Add a estimator configs in Config Dict.
Defaults to NULL.
config.replace JSON character, optional
Replace the specified estimator configs in Config Dict.
Defaults to NULL.
config.modify JSON character, optional
Modify the specified estimator configs in Config Dict.
Defaults to NULL.
connection.context ConnectionContext, optional
A connection to a SAP HANA system.
Defaults to NULL.

Examples

Assume we have df.fit for training, df.predict for prediction and df.score for evaluation.


> auto.reg <- hanaml.AutomaticRegression(data = df.fit,
                                         pipeline = NULL,
                                         categorical.variable =
                                           list("OUTLOOK", "WINDY"),
                                         scorings = NULL,
                                         generations = 2,
                                         population.size = 5,
                                         offspring.size =10,
                                         elite.number = 3,
                                         min.layer = 2,
                                         max.layer = 5,
                                         mutation.rate = 0.1,
                                         crossover.rate = 0.9,
                                         random.seed = 1,
                                         config.dict = NULL,
                                         progress.indicator.id = "AUTOML_REG_TEST",
                                         fold.num = 5,
                                         resampling.method = "cv",
                                         max.eval.time.mins = NULL,
                                         early.stop = 3)

The output could be achieved by the following lines:


> print(auto.reg$best.pipeline$Collect())
> print(auto.reg$model[[1]]$Collect())
> print(auto.reg$info$Collect())

If we have a pipeline and want to train a model:

pl <- '{"HGBT_Regressor":{"args":{"ITER_NUM":100,"OBJ_FUNC":2,"ETA":0.001,"MAX_DEPTH":8,"MIN_CHILD_HESSIAN":8.0},"inputs":{"data":{"CATPCA":{"args":{"COMPONENTS_PERCENTAGE":0.7,"SCALING":1,"COMPONENT_TOL":0.0,"MAX_ITERATION":1000,"CONVERGE_TOL":0.00001,"LANCZOS_ITERATION":1000,"SVD_CALCULATOR":0},"inputs":{"data":"ROWDATA"}}}}}}'
auto.reg2 <- hanaml.AutomaticRegression(data = df.fit,
                                         pipeline = pl,
                                         categorical.variable = list("OUTLOOK", "WINDY"))

The output could be achieved by the following lines:


> print(auto.reg2$model$Collect())
> print(auto.reg2$info$Collect())

If we want to predict:


> pre.res <- predict(model = auto.reg,
                     data = df.predict,
                     key = "ID")

The output could be achieved by the following lines:


> print(pre.res$Collect())

If we want to evaluate a pipeline:


> res <- auto.reg$evaluate(data=df.score,
                           pipeline=pl,
                           key="ID",
                           label="CLASS",
                           fold.num = 5,
                           resampling.method = "CV",
                           random.seed = 2)

The output could be achieved by the following lines:


> print(res$Collect())

If we want to see the config.dict:


> res <- auto.reg$display.config.dict()

The output could be achieved by the following line:


> print(res$Collect())

If we want to update the config.dict:


> add.op <- '{"RDT_Regressor":{"TREES_NUM":[10],"NODE_SIZE":{"range":[1,1,21]},"SPLIT_THRESHOLD":[1e-9],"CALCULATE_OOB":[0],"SAMPLE_FRACTION":[0.75,1.0]}}'
> replace.op <- '{"RDT_Regressor":{"TREES_NUM":[1000, 200],"NODE_SIZE":{"range":[1,1,20]},"SPLIT_THRESHOLD":[1e-9],"CALCULATE_OOB":[0],"SAMPLE_FRACTION":[0.75,1.0]}}'
> modify.op <- '{"RDT_Regressor":{"TREES_NUM":[17]}}'
> res.update <- auto.reg$update.config.dict(config.remove=list('EXP_Regressor', 'RDT_Regressor'),
                                            config.add=add.op,
                                            config.replace=replace.op,
                                            config.modify=modify.op)

The output could be achieved by the following lines:


> print(res.update[[1]])
> print(res.update[[2]]$Collect())

Arguments

Value

Methods

Examples

See also