hanaml.DecisionTreeRegressor {hana.ml.r}R Documentation

Decision Tree Model for Regression

Description

hanaml.DecisionTreeRegressor is a R wrapper for PAL Decision tree.

Usage

hanaml.DecisionTreeRegressor (conn.context, data = NULL,
                             key = NULL, features = NULL,
                             label = NULL,formula = NULL,
                             thread.ratio = NULL,
                             allow.missing.dependent = NULL,
                             percentage = NULL,
                             min.records.of.parent = NULL,
                             min.records.of.leaf = NULL, max.depth = NULL,
                             categorical.variable = NULL,
                             split.threshold = NULL,
                             use.surrogate = NULL, model.format = NULL,
                             discretization.type = NULL,
                             bins = NULL, max.branch = NULL,
                             merge.threshold = NULL,
                             output.rules = NULL
                             )

Arguments

conn.context

ConnectionContext
The connection to the SAP HANA system.

data

DataFrame
DataFrame containing the data.

key

character, optional
Name of the ID column of data. If not provided, then data is assumed to have no ID column.

features

list of character, optional
Names of the feature columns. If features is not provided, it defaults to all non-ID, no-label columns.

label

character, optional
Name of the column in data that specifies the dependent variable. Defaults to the last non-ID column if not provided.

formula

formula type, optional
Formula to be used for model generation. format = label~<feature_list> eg: formula=CATEGORY~V1+V2+V3 You can either give the formula, or a feature and label combination. Do not provide both.
Defaults to NULL.

thread.ratio

double, optional
Controls the proportion of available threads that can be used. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use up to that percentage of available threads. Other values are heuristically determined.
Defaults to -1.

allow.missing.dependent

logical, optional
Specifies if a missing target value is allowed. FALSE does not allow the missing target value. An error occurs if a missing target is present. TRUE allows the missing target value. The datum with the missing target is removed.
#' Defaults to TRUE.

percentage

double, optional
Specifies the percentage of the input data that will be used to build the tree model. The rest of the data will be used for pruning.

Defaults to 1.0.

min.records.of.parent

integer, optional
Specifies the stop condition. If the number of records in one node is less than the specified value, the algorithm stops splitting.

Defaults to 2.

min.records.of.leaf

integer, optional
Promises the minimum number of records in a leaf.
Defaults to 1.

max.depth

integer, optional
The maximum depth of a tree.
By default it is unlimited.

categorical.variable

list of characters, optional

Indicates features should be treated as categorical. The behavior is dependent on what input is given. 'string': categorical

'integer' and 'double': continuous.

VALID only for integer variables; omitted otherwise.

The default value is detected from input data.

split.threshold

double, optional
Specifies the stop condition for a node. CART: The reduction of Gini index or relative MSE of the best split is less than this value. The smaller the SPLIT_THRESHOLD value is, the larger a CART tree grows.

Defaults to 1e-5 for CART.

use.surrogate

logical, optional
Indicates whether to use a surrogate split when NULL values are encountered. FALSE does not use surrogate split. TRUE uses a surrogate split. Only valid for CART.

Defaults to TRUE.

model.format

character, optional
Specifies the tree model format for store. Valid options are json and pmml.

Defaults to 'json'.

discretization.type

character, optional
Specifies the strategy for discretizing continuous attributes. Valid options are mdlpc and equal_freq. Valid only for C45 and CHAID.

Defaults to 'mdlpc'.

bins

list
Specifies the number of bins for discretization in list. Each element in the list must be named, with names being column names, and values being the number of bins for discretization.
Only valid when discretizaition type is "equal_freq".
Defaults to 10 for each column.

max.branch

integer, optional
Specifies the maximum number of branches.
Defaults to 10.

merge.threshold

double, optional
Specifies the merge condition for CHAID: if the metric value is greater than or equal to the specified value, the algorithm will merge two branches.

output.rules

logical, optional
Specifies whether to output decision rules or not. FALSE does not output decision rules. TRUE outputs decision rules.#' Defaults to TRUE.

Format

R6Class object.

Value

A "DecisionTreeRegressor" object with the following attributes:

model : DataFrame Trained model content.

decision.rules : DataFrame Rules for decision tree to make decisions.

confusion.matrix : DataFrame Confusion matrix used to evaluate the performance of classification algorithms.

Note

Using Summary and Print

Summary provides a general summary of the output of the model. Usage: summary(dtr) where dtr is the model generated

Print provides information on the coefficients and the optional parameter values given by the user. Usage: print(dtr) where dtr is the model generated.

Examples

## Not run: 
Input DataFrame for training:

> head(data$Collect(),5)
OUTLOOK TEMP HUMIDITY WINDY CLASS
1   Sunny   75       70   Yes     1
2   Sunny   80       90   Yes     0
3   Sunny   85       85    No     0
4   Sunny   72       95    No     0
5   Sunny   69       70    No     1

Creating DecisionTreeRegressor model:

>dtr = hanaml.DecisionTreeRegressor( conn,
                             features = list("A", "B", "C"),label = "LABEL",key = 'ID',
                             min.records.of.parent = 2, min.records.of.leaf = 1,
                             thread.ratio = 0.4, split.threshold = 1e-5,
                             model.format = 'pmml',  output.rules = TRUE )
Giving input to fit as a formula:

>dtr = hanaml.DecisionTreeRegressor( conn,
                                formula=LABEL~A+B+C,,key = NULL,
                                min.records.of.parent = 2, min.records.of.leaf = 1,
                                thread.ratio = 0.4, split.threshold = 1e-5,
                                 model.format = 'pmml',  output.rules = TRUE


## End(Not run)


[Package hana.ml.r version 1.0.8 Index]