RDTClassifier
- class hana_ml.algorithms.pal.trees.RDTClassifier(n_estimators=100, max_features=None, max_depth=None, min_samples_leaf=1, split_threshold=None, calculate_oob=True, random_state=None, thread_ratio=None, allow_missing_dependent=True, categorical_variable=None, sample_fraction=None, compression=None, max_bits=None, quantize_rate=None, strata=None, priors=None, model_format=None)
The random decision trees algorithm is an ensemble learning method for classification and regression. It grows many classification and regression trees, and outputs the class (classification) that is voted by majority or mean prediction (regression) of the individual trees.
The algorithm uses both bagging and random feature selection techniques. Each new training set is drawn with replacement from the original training set, and then a tree is grown on the new training set using random feature selection. Considering that the number of rows of the training data is n originally, two sampling methods for classification are available:
Bagging: The sampling size is n, and each one is drawn from the original dataset with replacement.
Stratified sampling: For class j, nj data is drawn from it with replacement. And n1+n2+… might not be exactly equal to n, but in PAL, the summation should not be larger than n, for the sake of out-of-bag error estimation. This method is used usually when imbalanced data presents.
The random decision trees algorithm generates an internal unbiased estimate (out-of-bag error) of the generalization error as the trees building processes, which avoids cross-validation. It gives estimates of what variables are important from nodes’ splitting process. It also has an effective method for estimating missing data:
Training data: If the mth variable is numerical, the method computes the median of all values of this variable in class j or computes the most frequent non-missing value in class j, and then it uses this value to replace all missing values of the mth variable in class j.
Test data: The class label is absent, therefore one missing value is replicated n times, each filled with the corresponding class’ most frequent item or median.
- Parameters:
- n_estimatorsint, optional
Specifies the number of decision trees in the model.
Defaults to 100.
- max_featuresint, optional
Specifies the number of randomly selected splitting variables.
Should not be larger than the number of input features. Defaults to sqrt(p), where p is the number of input features.
- max_depthint, optional
The maximum depth of a tree, where -1 means unlimited.
Default to 56.
- min_samples_leafint, optional
Specifies the minimum number of records in a leaf.
Defaults to 1 for classification.
- split_thresholdfloat, optional
Specifies the stop condition: if the improvement value of the best split is less than this value, the tree stops growing.
Defaults to 1e-5.
- calculate_oobbool, optional
If true, calculate the out-of-bag error.
Defaults to True.
- random_stateint, optional
Specifies the seed for random number generator.
0: Uses the current time (in seconds) as the seed.
Others: Uses the specified value as the seed.
Defaults to 0.
- thread_ratiofloat, optional
Adjusts the percentage of available threads to use, from 0 to 1. A value of 0 indicates the use of a single thread, while 1 implies the use of all possible current threads. Values outside the range will be ignored and this function heuristically determines the number of threads to use.
Defaults to -1.
- allow_missing_dependentbool, optional
Specifies if a missing target value is allowed.
False: Not allowed. An error occurs if a missing target is present.
True: Allowed. The datum with the missing target is removed.
Defaults to True.
- categorical_variablestr or a list of str, optional
Specifies which INTEGER columns should be treated as categorical, with all other INTEGER columns treated as continuous.
Default value detected from input data.
- sample_fractionfloat, optional
The fraction of data used for training.
Assume there are n pieces of data, sample fraction is r, then n*r data is selected for training.
Defaults to 1.0.
- compressionbool, optional
Specifies if the model is stored in compressed format.
Default value depends on the SAP HANA Version. Please refer to the corresponding documentation of SAP HANA PAL.
- max_bitsint, optional
The maximum number of bits to quantize continuous features.
Equivalent to use \(2^{max\_bits}\) bins.
Must be less than 31.
Valid only when the value of
compression
is True.Defaults to 12.
- quantize_ratefloat, optional
Quantizes a categorical feature if the largest class frequency of the feature is less than quantize_rate.
Valid only when
compression
is True.Defaults to 0.005.
- strataList of tuples: (class, fraction), optional
Strata proportions for stratified sampling.
A (class, fraction) tuple specifies that rows with that class should make up the specified fraction of each sample.
If the given fractions do not add up to 1, the remaining portion is divided equally between classes with no entry in
strata
, or between all classes if all classes have an entry instrata
.If
strata
is not provided, bagging is used instead of stratified sampling.- priorsList of tuples: (class, prior_prob), optional
Prior probabilities for classes.
A (class, prior_prob) tuple specifies the prior probability of this class.
If the given priors do not add up to 1, the remaining portion is divided equally between classes with no entry in
priors
, or between all classes if all classes have an entry in 'priors'.If
priors
is not provided, it is determined by the proportion of every class in the training data.- model_format{'json', 'pmml'}, optional
Specifies the model format to store, case-insensitive.
'json': export model in json format.
'pmml': export model in pmml format.
Not effective if
compression
is True, in which case the model is stored in neither json nor pmml, but compressed format.Defaults to 'pmml'.
References
Parameters
compression
,max_bits
andquantize_rate
are for compressing Random Decision Trees classification model, please see Model Compression for more details about this topic.Examples
>>> rfc = RDTClassifier(n_estimators=3, max_features=3, split_threshold=0.00001, calculate_oob=True, min_samples_leaf=1, thread_ratio=1.0)
Perform fit():
>>> rfc.fit(data=df_train, key='ID', features=['F1', 'F2'], label='LABEL') >>> rfc.feature_importances_.collect()
Perform predict():
>>> res = rfc.predict(data=df_predict, key='ID', verbose=False) >>> res.collect()
Perform score():
>>> rfc.score(data=df_score, key='ID')
- Attributes:
- model_DataFrame
Model content.
- feature_importances_DataFrame
The feature importance (the higher, the more important the feature).
- oob_error_DataFrame
Out-of-bag error rate or mean squared error for random decision trees up to indexed tree. Set to None if
calculate_oob
is False.- confusion_matrix_DataFrame
Confusion matrix used to evaluate the performance of classification algorithms.
Methods
create_model_state
([model, function, ...])Create PAL model state.
delete_model_state
([state])Delete PAL model state.
fit
(data[, key, features, label, ...])Fit the model to the training dataset.
Get the model metrics.
Get the score metrics.
predict
(data[, key, features, verbose, ...])Predict dependent variable values based on a fitted model.
score
(data[, key, features, label, ...])Returns the mean accuracy on the given test data and labels.
set_model_state
(state)Set the model state by state information.
- fit(data, key=None, features=None, label=None, categorical_variable=None)
Fit the model to the training dataset.
- Parameters:
- dataDataFrame
Training data.
- keystr, optional
Name of the ID column in
data
.If
key
is not provided, then:if
data
is indexed by a single column, thenkey
defaults to that index column;otherwise, it is assumed that
data
contains no ID column.
- featuresa list of str, optional
Names of the feature columns.
If
features
is not provided, it defaults to all non-ID, non-label columns.- labelstr, optional
Name of the dependent variable.
Defaults to the last non-ID column.
- categorical_variablestr or a list of str, optional
Specifies which INTEGER columns should be treated as categorical, with all other INTEGER columns treated as continuous.
No default value.
- Returns:
- A fitted object of class "RDTClassifier".
- predict(data, key=None, features=None, verbose=None, block_size=None, missing_replacement=None, verbose_top_n=None)
Predict dependent variable values based on a fitted model.
- Parameters:
- dataDataFrame
Independent variable values to predict for.
- keystr, optional
Name of the ID column.
Mandatory if
data
is not indexed, or the index ofdata
contains multiple columns.Defaults to the single index column of
data
if not provided.- featuresa list of str, optional
Names of the feature columns. If
features
is not provided, it defaults to all non-ID columns.- block_sizeint, optional
The number of rows loaded per time during prediction. 0 indicates load all data at once.
Defaults to 0.
- missing_replacementstr, optional
The missing replacement strategy:
'feature_marginalized': marginalise each missing feature out independently.
'instance_marginalized': marginalise all missing features in an instance as a whole corresponding to each category.
Defaults to feature_marginalized.
- verbosebool, optional
If true, output all classes and the corresponding confidences for each data point.
- verbose_top_nint, optional
Specifies the number of top n classes to present after sorting with confidences. It cannot exceed the number of classes in label of the training data, and it can be 0, which means to output the confidences of all classes.
Effective only when
verbose
is set as True.Defaults to 0.
- Returns:
- DataFrame
DataFrame of score and confidence, structured as follows:
ID column, with same name and type as
data
's ID column.SCORE, type DOUBLE, representing the predicted classes.
CONFIDENCE, type DOUBLE, representing the confidence of a class.
- score(data, key=None, features=None, label=None, block_size=None, missing_replacement=None)
Returns the mean accuracy on the given test data and labels.
- Parameters:
- dataDataFrame
Data on which to assess model performance.
- keystr, optional
Name of the ID column.
Mandatory if
data
is not indexed, or the index ofdata
contains multiple columns.Defaults to the single index column of
data
if not provided.- featuresa list of str, optional
Names of the feature columns. If
features
is not provided, it defaults to all non-ID, non-label columns.- labelstr, optional
Name of the dependent variable.
Defaults to the last non-ID column.
- block_sizeint, optional
The number of rows loaded per time during prediction. 0 indicates load all data at once.
Defaults to 0.
- missing_replacementstr, optional
The missing replacement strategy:
'feature_marginalized': marginalise each missing feature out independently.
'instance_marginalized': marginalise all missing features in an instance as a whole corresponding to each category.
Defaults to 'feature_marginalized'.
- Returns:
- float
Mean accuracy on the given test data and labels.
- create_model_state(model=None, function=None, pal_funcname='PAL_RANDOM_DECISION_TREES', state_description=None, force=False)
Create PAL model state.
- Parameters:
- modelDataFrame, optional
Specify the model for AFL state.
Defaults to self.model_.
- functionstr, optional
Specify the function in the unified API.
A placeholder parameter, not effective for RDT.
- pal_funcnameint or str, optional
PAL function name.
Defaults to 'PAL_RANDOM_DECISION_TREES'.
- state_descriptionstr, optional
Description of the state as model container.
Defaults to None.
- forcebool, optional
If True it will delete the existing state.
Defaults to False.
- delete_model_state(state=None)
Delete PAL model state.
- Parameters:
- stateDataFrame, optional
Specified the state.
Defaults to self.state.
- get_model_metrics()
Get the model metrics.
- Returns:
- DataFrame
The model metrics.
- get_score_metrics()
Get the score metrics.
- Returns:
- DataFrame
The score metrics.
- set_model_state(state)
Set the model state by state information.
- Parameters:
- state: DataFrame or dict
If state is DataFrame, it has the following structure:
NAME: VARCHAR(100), it mush have STATE_ID, HINT, HOST and PORT.
VALUE: VARCHAR(1000), the values according to NAME.
If state is dict, the key must have STATE_ID, HINT, HOST and PORT.
Inherited Methods from PALBase
Besides those methods mentioned above, the RDTClassifier class also inherits methods from PALBase class, please refer to PAL Base for more details.