hanaml.RDTClassifier.Rd
hanaml.RDTClassifier is a R wrapper for SAP HANA PAL Random Decision Trees for classification.
hanaml.RDTClassifier(
data = NULL,
key = NULL,
features = NULL,
label = NULL,
formula = NULL,
n.estimators = NULL,
max.features = NULL,
max.depth = NULL,
min.samples.leaf = NULL,
split.threshold = NULL,
calculate.oob = NULL,
random.state = NULL,
thread.ratio = NULL,
allow.missing.dependent = NULL,
categorical.variable = NULL,
sample.fraction = NULL,
strata = NULL,
priors = NULL,
compression = NULL,
max.bits = NULL,
quantize.rate = NULL,
model.format = NULL
)
DataFrame
DataFrame containting the data.
character, optional
Name of the ID column.
If not provided, the data is assumed to have no ID column.
No default value.
character of list of characters, optional
Name of feature columns.
If not provided, it defaults all non-key, non-label columns of data.
character, optional
Name of the column which specifies the dependent variable.
Defaults to the last column of data if not provided.
formula type, optional
Formula to be used for model generation.
format = label~<feature_list>
e.g.: formula=CATEGORY~V1+V2+V3
You can either give the formula,
or a feature and label combination, but do not provide both.
Defaults to NULL.
integer, optional
Specifies the number of decision trees in the model.
Defaults to 100.
integer, optional
Specifies the number of randomly selected splitting variables.
Should not be larger than the number of input features.
Defaults to \(sqrt(p)\),
where p is the number of input features.
integer, optional
Specifies the maximum depth of a tree.
If -1 is specified, then there are not restrictions on the depth
of a tree; otherwise positive numbers no greater than 56 are
only allowed.
Defaults to 56.
integer, optional
Specifies the minimum number of records in a leaf.
Defaults to 1 for classification.
double, optional
Specifies the stop condition: if the improvement value of the best
split is less than this value, the tree stops growing.
Defaults to 1e-5.
logical, optional
If TRUE, calculate the out-of-bag error.
Defaults to TRUE.
integer, optional
Specifies the seed for random number generator.
0: Uses the current time (in seconds) as the seed.
Others: Uses the specified value as the seed.
Defaults to 0.
double, optional
Controls the proportion of available threads that can be used by this
function.
The value range is from 0 to 1, where 0 indicates a single thread,
and 1 indicates all available threads. Values between 0 and 1 will use up to
that percentage of available threads.
Values outside the range from 0 to 1 are ignored, and the actual number of threads
used is then be heuristically determined.
Defaults to -1.
logical, optional
Specifies if a missing target value is allowed.
FALSE: Not allowed. An error occurs if a missing target is present.
TRUE: Allowed. The datum with the missing target is removed.
Defaults to TRUE.
character or list/vector of characters, optional
Indicates features should be treated as categorical variable.
The default behavior is dependent on what input is given:
"VARCHAR" and "NVARCHAR": categorical
"INTEGER" and "DOUBLE": continuous.
VALID only for variables of "INTEGER" type, omitted otherwise.
No default value.
double, optional
The fraction of data used for training.
Assume there are n pieces of data, sample fraction is r, then n*r
data is selected for training.
Defaults to 1.0.
List of tuples: (class, fraction), optional
Strata proportions for stratified sampling. A (class, fraction) tuple
specifies that rows with that class should make up the specified
fraction of each sample. If the given fractions do not add up to 1,
the remaining portion is divided equally between classes with
no entry in strata, or between all classes if all classes have
an entry in strata.
If strata is not provided, bagging is used instead of stratified
sampling.
List of tuples: (class, prior_prob), optional
Prior probabilities for classes. A (class, prior_prob) tuple
specifies the prior probability of this class. If the given
priors do not add up to 1, the remaining portion is divided equally
between classes with no entry in priors, or between all classes
if all classes have an entry in 'priors'.
If priors is not provided, it is determined by the proportion of
every class in the training data.
logical, optional
Specifies whether model is stored in compressed format or not.
Defaults to FALSE.
integer, optional
Specifies the maximum number of bits to quantize continuous features, which
is equivalent to use 2max.bits
bins.
Must be less than 31.
Defaults to 12.
numeric, optional
Specifies the threshold value that, if the frequency of the largest class of a categorical features
is below the threshold, then this categorical feature is quantized.
Valid only when compression
is TRUE.
Defaults to 0.05.
c("json", "pmml", optional)
Specifies the tree model format for storing.
Valid only when compression
is FALSE(or not specified).
Defaults to "pmml".
Return an R6 object of class "RDTClassifier", with the following attributes and methods:
Attributes
model : DataFrame
Trained model content.
feature.importances : DataFrame
The feature importance (the higher, the more important the feature).
oob.error : DataFrame
Out-of-bag error rate or mean squared error for random decision trees up
to indexed tree.
Set to NULL if calculate.oob is FALSE.
confusion.matrix : DataFrame
Confusion matrix used to evaluate the performance of
classification algorithms.
Methods
CreateModelState(model=NULL, algorithm=NULL, func=NULL, state.description="ModelState", force=FALSE)
Usage:
> rdc <- hanaml.RDTClassifier(data=df)
> rdc$CreateModelState()
Arguments:
model: DataFrame
DataFrame containing the model for parsing.
Defaults to self$model
.
algorithm: character
Specifies the PAL algorithm associated with model
.
Defaults to self$pal.algorithm
.
func: character
Specifies the functionality for Unified Classification/Regression.
Valid only for object instance of R6Class "UnifiedClassification" or "UnifiedRegression".
Defaults to self$func
.
state.description: character
A summary string for the generated model state.
Defaults to "ModelState".
force: logic
Specifies whether or not the replace existing state for model
.
Defaults to FALSE.
After calling this method, an attribute state
that contains the parsed info for model
shall be assigned
to the corresponding R6 object.
DeleteModelState(state=NULL)
Usage:
Assuming we have trained a hanaml
model and created its model state, like the following:
> rdc <- hanaml.RDTClassifier(data=df)
> rdc$CreateModelState()
After using the model state for real-time scoring, we can delete the state by calling:
> rdc$DelateModelState()
Arguments:
state: DataFrame
DataFrame containing the state info.
Defaults to self$state
.
After calling this method, the specified model state shall be cleaned up and associated memory be released.
Input DataFrame data:
> data$Collect()
OUTLOOK TEMP HUMIDITY WINDY CLASS
1 Sunny 75 70 Yes Play
2 Sunny 80 90 Yes Do not Play
3 Sunny 85 85 No Do not Play
4 Sunny 72 95 No Do not Play
5 Sunny 69 70 No Play
6 Overcast 72 90 Yes Play
7 Overcast 83 78 No Play
8 Overcast 64 65 Yes Play
9 Overcast 81 75 No Play
10 Rain 71 80 Yes Do not Play
11 Rain 65 70 Yes Do not Play
12 Rain 75 80 No Play
13 Rain 68 80 No Play
14 Rain 70 96 No Play
Call the function:
> rfc <- hanaml.RDTClassifier(data = data,
n.estimators=300,
max.features=3,
random.state=2,
split.threshold=0.00001,
calculate.oob=TRUE,
min.samples.leaf=1,
thread.ratio=1.0)
OR Giving features and labels as input to generating a model:
> rfc <- hanaml.RDTClassifier(data = data,
n.estimators=300,
max.features=3,
features=list("TEMP", "HUMIDITY", "WINDY"),
label="CLASS",
random.state=2,
split.threshold=0.00001,
calculate.oob=TRUE,
min.samples.leaf=1,
thread.ratio=1.0)
OR Giving input to model generation as a formula:
> rfc <- hanaml.RDTClassifier(data = data,
n.estimators=300,
max.features=3,
formula=CATEGORY~V1+V2+V3,
random.state=2,
split.threshold=0.00001,
calculate.oob=TRUE,
min.samples.leaf=1,
thread.ratio=1.0)
Output:
> rfc$feature.importances$Collect()
VARIABLE_NAME IMPORTANCE
1 OUTLOOK 0.3475185
2 TEMP 0.2770724
3 HUMIDITY 0.2476346
4 WINDY 0.1277744
For model compression:
> rfc <- hanaml.RDTClassifier(data = data,
n.estimators=300,
max.features=3,
formula=CATEGORY~V1+V2+V3,
random.state=2,
split.threshold=0.00001,
calculate.oob=TRUE,
min.samples.leaf=1,
thread.ratio=1.0,
compression=TRUE,
max.bits=12,
quantize.rate=0.05)
For gaining better predictive performance, random decision trees method usually requires a large number of big trees,
which tends to grow with the size of the problem. Besides, larger dataset size also results in deeper and
more complex models of large size. Therefore, a model compression technique is introduced here to reduce
size of the learned model with minimum loss of accuracy.
Relevant Parameters
compression
: This parameter serves as a trigger for model compression.
quantize.rate
: The value of this parameter determines whether or not to do quantization for split values of certain continuous features.
If the largest frequency of these continuous split values is less than the value specified by quantize.rate
,
the split values of the corresponding continuous feature shall be quantized.
max.bits
: This parameter sets up The maximum number of bits to quantize continuous features, which is equivalent to use
\(2^{max.bits}\) bins to quantize the values of these continuous features. Reducing the number of bins may affect the precision of
split values and the accuracy in prediction. Must be less than 31.
fittings.quantization
: This parameter determines whether or not to quantize fitting values(the values of leaves) in regression problems.
It is recommended to use this technique for large dataset in regression problems. This parameter is not available for
classification problems.