hanaml.RDTRegressor is a R wrapper for SAP HANA PAL
Random Decision Trees for regression.
hanaml.RDTRegressor(
data = NULL,
key = NULL,
features = NULL,
label = NULL,
formula = NULL,
n.estimators = NULL,
max.features = NULL,
max.depth = NULL,
min.samples.leaf = NULL,
split.threshold = NULL,
calculate.oob = NULL,
random.state = NULL,
thread.ratio = NULL,
allow.missing.dependent = NULL,
categorical.variable = NULL,
sample.fraction = NULL,
compression = NULL,
max.bits = NULL,
quantize.rate = NULL,
fittings.quantization = NULL
)
Arguments
| data |
DataFrame
DataFrame containting the data.
|
| key |
character, optional
Name of the ID column.
If not provided, the data is assumed to have no ID column.
No default value.
|
| features |
character of list of characters, optional
Name of feature columns.
If not provided, it defaults all non-key, non-label columns of data.
|
| label |
character, optional
Name of the column which specifies the dependent variable.
Defaults to the last column of data if not provided.
|
| formula |
formula type, optional
Formula to be used for model generation.
format = label~<feature_list>
e.g.: formula=CATEGORY~V1+V2+V3
You can either give the formula,
or a feature and label combination, but do not provide both.
Defaults to NULL.
|
| n.estimators |
integer, optional
Specifies the number of decision trees in the model.
Defaults to 100.
|
| max.features |
integer, optional
Specifies the number of randomly selected splitting variables.
Should not be larger than the number of input features.
Defaults to \(p/3\), where p is the number of input features.
|
| max.depth |
integer, optional
The maximum depth of a tree.
No default value, but the maximum value SAP HANA PAL supports is 56.
|
| min.samples.leaf |
integer, optional
Specifies the minimum number of records in a leaf.
Defaults to 5 for regression.
|
| split.threshold |
double , optional
Specifies the stop condition: if the improvement value of the best
split is less than this value, the tree stops growing.
Defaults to 1e-5.
|
| calculate.oob |
logical, optional
If TRUE, calculate the out-of-bag error.
Defaults to TRUE.
|
| random.state |
integer, optional
Specifies the seed for random number generator.
Defaults to 0. |
| thread.ratio |
double, optional
Controls the proportion of available threads that can be used by this
function.
The value range is from 0 to 1, where 0 indicates a single thread,
and 1 indicates all available threads. Values between 0 and 1 will use up to
that percentage of available threads.
Values outside the range from 0 to 1 are ignored, and the actual number of threads
used is then be heuristically determined.
Defaults to -1.
|
| allow.missing.dependent |
logical, optional
Specifies if a missing target value is allowed.
Defaults to TRUE. |
| categorical.variable |
character or list/vector of characters, optional
Indicates features should be treated as categorical variable.
The default behavior is dependent on what input is given:
VALID only for variables of "INTEGER" type, omitted otherwise.
No default value. |
| sample.fraction |
double, optional
The fraction of data used for training.
Assume there are n pieces of data, sample fraction is r, then n*r
data is selected for training.
Defaults to 1.0.
|
| compression |
logical, optional
Specifies whether model is stored in compressed format or not.
Default value depends on the SAP HANA Version. Please refer to the conresponding documentation of SAP HANA PAL.
|
| max.bits |
integer, optional
Spefifies the maximum number of bits to quantize continous features, which
is equivalent to use 2max.bits
bins.
Must be less than 31.
Defaults to 12.
|
| quantize.rate |
numeric, optional
Specifies the threshold value that, if the frequency of the largest class of a categorical features
is below the threshold, then this categorical feature is quantized.
Valid only when compression is TRUE.
Defaults to 0.05.
|
| fittings.quantization |
logical, optional
Specifies whether to quantize fitting values or not.
Valid only when compression is TRUE.
Defaults to FALSE.
|
Value
Returns a "RDTRegressor" object with following values:
model : DataFrame
Trained model content.
feature.importances : DataFrame
The feature importance (the higher, the more important the feature).
oob.error : DataFrame
Out-of-bag error rate or mean squared error for random decision trees up
to indexed tree.
Set to NULL if calculate.oob is FALSE.
Examples
Input DataFrame data:
>data$Collect()
ID A B C D CLASS
1 0 -0.965679 1.142985 -0.019274 -1.598807 -23.633813
2 1 2.249528 1.459918 0.153440 -0.526423 212.532559
3 2 -0.631494 1.484386 -0.335236 0.354313 26.342585
4 3 -0.967266 1.131867 -0.684957 -1.397419 -62.563666
5 4 -1.175179 -0.253179 -0.775074 0.996815 -115.534935
......
Call the function:
> rfr <- hanaml.RDTRegressor(data = data, random.state = 3)
Output:
> rfr$feature.importances$Collect()
VARIABLE_NAME IMPORTANCE
1 A 0.249593
2 B 0.381879
3 C 0.291403
4 D 0.077125
See also