hanaml.RDTClassifier is a R wrapper for SAP HANA PAL Random Decision Trees for classification.

hanaml.RDTClassifier(
  data = NULL,
  key = NULL,
  features = NULL,
  label = NULL,
  formula = NULL,
  n.estimators = NULL,
  max.features = NULL,
  max.depth = NULL,
  min.samples.leaf = NULL,
  split.threshold = NULL,
  calculate.oob = NULL,
  random.state = NULL,
  thread.ratio = NULL,
  allow.missing.dependent = NULL,
  categorical.variable = NULL,
  sample.fraction = NULL,
  strata = NULL,
  priors = NULL,
  compression = NULL,
  max.bits = NULL,
  quantize.rate = NULL
)

Arguments

data

DataFrame
DataFrame containting the data.

key

character, optional
Name of the ID column. If not provided, the data is assumed to have no ID column.
No default value.

features

character of list of characters, optional
Name of feature columns.
If not provided, it defaults all non-key, non-label columns of data.

label

character, optional
Name of the column which specifies the dependent variable.
Defaults to the last column of data if not provided.

formula

formula type, optional
Formula to be used for model generation. format = label~<feature_list> e.g.: formula=CATEGORY~V1+V2+V3
You can either give the formula, or a feature and label combination, but do not provide both.
Defaults to NULL.

n.estimators

integer, optional
Specifies the number of decision trees in the model.
Defaults to 100.

max.features

integer, optional
Specifies the number of randomly selected splitting variables.
Should not be larger than the number of input features.
Defaults to \(sqrt(p)\), where p is the number of input features.

max.depth

integer, optional
The maximum depth of a tree.
No default value, but the maximum value SAP HANA PAL supports is 56.

min.samples.leaf

integer, optional
Specifies the minimum number of records in a leaf.
Defaults to 1 for classification.

split.threshold

double, optional
Specifies the stop condition: if the improvement value of the best split is less than this value, the tree stops growing. Defaults to 1e-5.

calculate.oob

logical, optional
If TRUE, calculate the out-of-bag error.
Defaults to TRUE.

random.state

integer, optional
Specifies the seed for random number generator.

  • 0: Uses the current time (in seconds) as the seed.

  • Others: Uses the specified value as the seed.

Defaults to 0.

thread.ratio

double, optional
Controls the proportion of available threads that can be used by this function.
The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates all available threads. Values between 0 and 1 will use up to that percentage of available threads.
Values outside the range from 0 to 1 are ignored, and the actual number of threads used is then be heuristically determined.
Defaults to -1.

allow.missing.dependent

logical, optional
Specifies if a missing target value is allowed.

  • FALSE: Not allowed. An error occurs if a missing target is present.

  • TRUE: Allowed. The datum with the missing target is removed.

Defaults to TRUE.

categorical.variable

character or list/vector of characters, optional
Indicates features should be treated as categorical variable.
The default behavior is dependent on what input is given:

  • "VARCHAR" and "NVARCHAR": categorical

  • "INTEGER" and "DOUBLE": continuous.

VALID only for variables of "INTEGER" type, omitted otherwise.
No default value.

sample.fraction

double, optional
The fraction of data used for training.
Assume there are n pieces of data, sample fraction is r, then n*r data is selected for training.
Defaults to 1.0.

strata

List of tuples: (class, fraction), optional
Strata proportions for stratified sampling. A (class, fraction) tuple specifies that rows with that class should make up the specified fraction of each sample. If the given fractions do not add up to 1, the remaining portion is divided equally between classes with no entry in strata, or between all classes if all classes have an entry in strata.
If strata is not provided, bagging is used instead of stratified sampling.

priors

List of tuples: (class, prior_prob), optional
Prior probabilities for classes. A (class, prior_prob) tuple specifies the prior probability of this class. If the given priors do not add up to 1, the remaining portion is divided equally between classes with no entry in priors, or between all classes if all classes have an entry in 'priors'.
If priors is not provided, it is determined by the proportion of every class in the training data.

compression

logical, optional
Specifies whether model is stored in compressed format or not.
Default value depends on the SAP HANA Version. Please refer to the conresponding documentation of SAP HANA PAL.

max.bits

integer, optional
Spefifies the maximum number of bits to quantize continous features, which is equivalent to use 2max.bits bins. Must be less than 31.
Defaults to 12.

quantize.rate

numeric, optional
Specifies the threshold value that, if the frequency of the largest class of a categorical features is below the threshold, then this categorical feature is quantized.
Valid only when compression is TRUE.
Defaults to 0.05.

Value

Return a "RDTClassifier" object with following attributes:

  • model : DataFrame
    Trained model content.

  • feature.importances : DataFrame
    The feature importance (the higher, the more important the feature).

  • oob.error : DataFrame
    Out-of-bag error rate or mean squared error for random decision trees up to indexed tree. Set to NULL if calculate.oob is FALSE.

  • confusion.matrix : DataFrame
    Confusion matrix used to evaluate the performance of classification algorithms.

Examples

Input DataFrame data:

 > data$Collect()
      OUTLOOK TEMP HUMIDITY WINDY       CLASS
  1     Sunny   75       70   Yes        Play
  2     Sunny   80       90   Yes Do not Play
  3     Sunny   85       85    No Do not Play
  4     Sunny   72       95    No Do not Play
  5     Sunny   69       70    No        Play
  6  Overcast   72       90   Yes        Play
  7  Overcast   83       78    No        Play
  8  Overcast   64       65   Yes        Play
  9  Overcast   81       75    No        Play
 10     Rain   71       80    Yes Do not Play
 11     Rain   65       70    Yes Do not Play
 12     Rain   75       80     No        Play
 13     Rain   68       80     No        Play
 14     Rain   70       96     No        Play

Call the function:

> rfc <- hanaml.RDTClassifier(data = data,
                              n.estimators=300,
                              max.features=3,
                              random.state=2,
                              split.threshold=0.00001,
                              calculate.oob=TRUE,
                              min.samples.leaf=1,
                              thread.ratio=1.0)

OR Giving features and labels as input to generating a model:

> rfc <- hanaml.RDTClassifier(data = data,
                              n.estimators=300,
                              max.features=3,
                              features=list("TEMP", "HUMIDITY", "WINDY"),
                              label="CLASS",
                              random.state=2,
                              split.threshold=0.00001,
                              calculate.oob=TRUE,
                              min.samples.leaf=1,
                              thread.ratio=1.0)

OR Giving input to model generation as a formula:

> rfc <- hanaml.RDTClassifier(data = data,
                              n.estimators=300,
                              max.features=3,
                              formula=CATEGORY~V1+V2+V3,
                              random.state=2,
                              split.threshold=0.00001,
                              calculate.oob=TRUE,
                              min.samples.leaf=1,
                              thread.ratio=1.0)

Output:

> rfc$feature.importances$Collect()
  VARIABLE_NAME IMPORTANCE
1       OUTLOOK  0.3475185
2          TEMP  0.2770724
3      HUMIDITY  0.2476346
4         WINDY  0.1277744

See also