Random Decision Trees for Regression

hanaml.RDTRegressor is a R wrapper for SAP HANA PAL Random Decision Trees for regression.

hanaml.RDTRegressor(
  data = NULL,
  key = NULL,
  features = NULL,
  label = NULL,
  formula = NULL,
  n.estimators = NULL,
  max.features = NULL,
  max.depth = NULL,
  min.samples.leaf = NULL,
  split.threshold = NULL,
  calculate.oob = NULL,
  random.state = NULL,
  thread.ratio = NULL,
  allow.missing.dependent = NULL,
  categorical.variable = NULL,
  sample.fraction = NULL,
  compression = NULL,
  max.bits = NULL,
  quantize.rate = NULL,
  fittings.quantization = NULL
)

Arguments

data	`DataFrame` DataFrame containting the data.
key	`character, optional` Name of the ID column. If not provided, the data is assumed to have no ID column. No default value.
features	`character of list of characters, optional` Name of feature columns. If not provided, it defaults all non-key, non-label columns of data.
label	`character, optional` Name of the column which specifies the dependent variable. Defaults to the last column of data if not provided.
formula	`formula type, optional` Formula to be used for model generation. format = label~<feature_list> e.g.: formula=CATEGORY~V1+V2+V3 You can either give the formula, or a feature and label combination, but do not provide both. Defaults to NULL.
n.estimators	`integer, optional` Specifies the number of decision trees in the model. Defaults to 100.
max.features	`integer, optional` Specifies the number of randomly selected splitting variables. Should not be larger than the number of input features. Defaults to \(p/3\), where p is the number of input features.
max.depth	`integer, optional` The maximum depth of a tree. No default value, but the maximum value SAP HANA PAL supports is 56.
min.samples.leaf	`integer, optional` Specifies the minimum number of records in a leaf. Defaults to 5 for regression.
split.threshold	`double , optional` Specifies the stop condition: if the improvement value of the best split is less than this value, the tree stops growing. Defaults to 1e-5.
calculate.oob	`logical, optional` If TRUE, calculate the out-of-bag error. Defaults to TRUE.
random.state	`integer, optional` Specifies the seed for random number generator. 0: Uses the current time (in seconds) as the seed. Others: Uses the specified value as the seed. Defaults to 0.
thread.ratio	`double, optional` Controls the proportion of available threads that can be used by this function. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates all available threads. Values between 0 and 1 will use up to that percentage of available threads. Values outside the range from 0 to 1 are ignored, and the actual number of threads used is then be heuristically determined. Defaults to -1.
allow.missing.dependent	`logical, optional` Specifies if a missing target value is allowed. FALSE: Not allowed. An error occurs if a missing target is present. TRUE: Allowed. The datum with a missing target is removed. Defaults to TRUE.
categorical.variable	`character or list/vector of characters, optional` Indicates features should be treated as categorical variable. The default behavior is dependent on what input is given: "VARCHAR" and "NVARCHAR": categorical "INTEGER" and "DOUBLE": continuous. VALID only for variables of "INTEGER" type, omitted otherwise. No default value.
sample.fraction	`double, optional` The fraction of data used for training. Assume there are n pieces of data, sample fraction is r, then n*r data is selected for training. Defaults to 1.0.
compression	`logical, optional` Specifies whether model is stored in compressed format or not. Default value depends on the SAP HANA Version. Please refer to the conresponding documentation of SAP HANA PAL.
max.bits	`integer, optional` Spefifies the maximum number of bits to quantize continous features, which is equivalent to use 2^max.bits bins. Must be less than 31. Defaults to 12.
quantize.rate	`numeric, optional` Specifies the threshold value that, if the frequency of the largest class of a categorical features is below the threshold, then this categorical feature is quantized. Valid only when `compression` is TRUE. Defaults to 0.05.
fittings.quantization	`logical, optional` Specifies whether to quantize fitting values or not. Valid only when `compression` is TRUE. Defaults to FALSE.

Value

Returns a "RDTRegressor" object with following values:

model : DataFrame
Trained model content.
feature.importances : DataFrame
The feature importance (the higher, the more important the feature).
oob.error : DataFrame
Out-of-bag error rate or mean squared error for random decision trees up to indexed tree. Set to NULL if calculate.oob is FALSE.

Examples

Input DataFrame data:

>data$Collect()
   ID         A         B         C         D       CLASS
1   0 -0.965679  1.142985 -0.019274 -1.598807  -23.633813
2   1  2.249528  1.459918  0.153440 -0.526423  212.532559
3   2 -0.631494  1.484386 -0.335236  0.354313   26.342585
4   3 -0.967266  1.131867 -0.684957 -1.397419  -62.563666
5   4 -1.175179 -0.253179 -0.775074  0.996815 -115.534935
......

Call the function:

> rfr <- hanaml.RDTRegressor(data = data, random.state = 3)

Output:

> rfr$feature.importances$Collect()
   VARIABLE_NAME  IMPORTANCE
1             A    0.249593
2             B    0.381879
3             C    0.291403
4             D    0.077125

Arguments

Value

Examples

See also