RDTClassifier

class hana_ml.algorithms.pal.trees.RDTClassifier(n_estimators=100, max_features=None, max_depth=None, min_samples_leaf=1, split_threshold=None, calculate_oob=True, random_state=None, thread_ratio=None, allow_missing_dependent=True, categorical_variable=None, sample_fraction=None, compression=None, max_bits=None, quantize_rate=None, strata=None, priors=None, model_format=None)

Random Decision Tree model for classification.

Parameters:

n_estimatorsint, optional

Specifies the number of decision trees in the model.

Defaults to 100.

max_featuresint, optional

Specifies the number of randomly selected splitting variables.

Should not be larger than the number of input features. Defaults to sqrt(p), where p is the number of input features.

max_depthint, optional

The maximum depth of a tree, where -1 means unlimited.

Default to 56.

min_samples_leafint, optional

Specifies the minimum number of records in a leaf.

Defaults to 1 for classification.

split_thresholdfloat, optional

Specifies the stop condition: if the improvement value of the best split is less than this value, the tree stops growing.

Defaults to 1e-5.

calculate_oobbool, optional

If true, calculate the out-of-bag error.

Defaults to True.

random_stateint, optional

Specifies the seed for random number generator.

0: Uses the current time (in seconds) as the seed.

Others: Uses the specified value as the seed.

Defaults to 0.

thread_ratiofloat, optional

Controls the proportion of available threads to use.

The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads.

Values between 0 and 1 will use that percentage of available threads.

Values outside this range tell PAL to heuristically determine the number of threads to use.

Defaults to -1.

allow_missing_dependentbool, optional

Specifies if a missing target value is allowed.

False: Not allowed. An error occurs if a missing target is present.

True: Allowed. The datum with the missing target is removed.

Defaults to True.

categorical_variablestr or list of str, optional

Specifies INTEGER column(s) should be treated as categorical. The default behavior is:

string: categorical

integer or float: continuous.

VALID only for integer variables; omitted otherwise.

Default value detected from input data.

sample_fractionfloat, optional

The fraction of data used for training.

Assume there are n pieces of data, sample fraction is r, then n*r data is selected for training.

Defaults to 1.0.

compressionbool, optional

Specifies if the model is stored in compressed format.

Default value depends on the SAP HANA Version. Please refer to the corresponding documentation of SAP HANA PAL.

max_bitsint, optional

The maximum number of bits to quantize continuous features.

Equivalent to use \(2^{max\_bits}\) bins.

Must be less than 31.

Valid only when the value of compression is True.

Defaults to 12.

quantize_ratefloat, optional

Quantizes a categorical feature if the largest class frequency of the feature is less than quantize_rate.

Valid only when compression is True.

Defaults to 0.005.

strataList of tuples: (class, fraction), optional

Strata proportions for stratified sampling.

A (class, fraction) tuple specifies that rows with that class should make up the specified fraction of each sample.

If the given fractions do not add up to 1, the remaining portion is divided equally between classes with no entry in strata, or between all classes if all classes have an entry in strata.

If strata is not provided, bagging is used instead of stratified sampling.

priorsList of tuples: (class, prior_prob), optional

Prior probabilities for classes.

A (class, prior_prob) tuple specifies the prior probability of this class.

If the given priors do not add up to 1, the remaining portion is divided equally between classes with no entry in priors, or between all classes if all classes have an entry in 'priors'.

If priors is not provided, it is determined by the proportion of every class in the training data.

model_format{'json', 'pmml'}, optional

Specifies the model format to store, case-insensitive.

'json': export model in json format.
'pmml': export model in pmml format.

Not effective if compression is True, in which case the model is stored in neither json nor pmml, but compressed format.

Defaults to 'pmml'.

References

Parameters compression, max_bits and quantize_rate are for compressing Random Decision Trees classification model, please see Model Compression for more details about this topic.

Examples

Input dataframe for training:

>>> df1.head(4).collect()
   OUTLOOK  TEMP  HUMIDITY WINDY        LABEL
0    Sunny  75.0      70.0   Yes         Play
1    Sunny   NaN      90.0   Yes  Do not Play
2    Sunny  85.0       NaN    No  Do not Play
3    Sunny  72.0      95.0    No  Do not Play

Creating RDTClassifier instance:

>>> rfc = RDTClassifier(n_estimators=3,
...                     max_features=3, random_state=2,
...                     split_threshold=0.00001,
...                     calculate_oob=True,
...                     min_samples_leaf=1, thread_ratio=1.0)

Performing fit() on given dataframe:

>>> rfc.fit(df1, features=['OUTLOOK', 'TEMP', 'HUMIDITY', 'WINDY'],
...         label='LABEL')
>>> rfc.feature_importances_.collect()
  VARIABLE_NAME  IMPORTANCE
0       OUTLOOK    0.449550
1          TEMP    0.216216
2      HUMIDITY    0.208108
3         WINDY    0.126126

Input dataframe for predicting:

>>> df2.collect()
   ID   OUTLOOK     TEMP  HUMIDITY WINDY
0   0  Overcast     75.0  -10000.0   Yes
1   1      Rain     78.0      70.0   Yes

Performing predict() on given dataframe:

>>> result = rfc.predict(df2, key='ID', verbose=False)
>>> result.collect()
   ID SCORE  CONFIDENCE
0   0  Play    0.666667
1   1  Play    0.666667

Input dataframe for scoring:

>>> df3.collect()
   ID   OUTLOOK  TEMP  HUMIDITY WINDY LABEL
0   0     Sunny    70      90.0   Yes  Play
1   1  Overcast    81      90.0   Yes  Play
2   2      Rain    65      80.0    No  Play

Performing score() on given dataframe:

>>> rfc.score(df3, key='ID')
0.6666666666666666

Attributes:

model_DataFrame: Trained model content.
feature_importances_DataFrame: The feature importance (the higher, the more important the feature).
oob_error_DataFrame: Out-of-bag error rate or mean squared error for random decision trees up to indexed tree. Set to None if calculate_oob is False.
confusion_matrix_DataFrame: Confusion matrix used to evaluate the performance of classification algorithms.

Methods

`create_model_state`([model, function, ...])	Create PAL model state.
`delete_model_state`([state])	Delete PAL model state.
`fit`(data[, key, features, label, ...])	Train the model on the input data.
`predict`(data[, key, features, verbose, ...])	Predict dependent variable values based on fitted model.
`score`(data[, key, features, label, ...])	Returns the mean accuracy on the given test data and labels.
`set_model_state`(state)	Set the model state by state information.

fit(data, key=None, features=None, label=None, categorical_variable=None)

Train the model on the input data.

Parameters:

dataDataFrame

Training data.

keystr, optional

Name of the ID column in data.

If key is not provided, then:

if data is indexed by a single column, then key defaults to that index column;
otherwise, it is assumed that data contains no ID column.

featureslist of str, optional

Names of the feature columns.

If features is not provided, it defaults to all non-ID, non-label columns.

labelstr, optional

Name of the dependent variable.

Defaults to the last non-ID column.

categorical_variablestr or list of str, optional

Specifies INTEGER column(s) that should be treated as categorical.

Other INTEGER columns will be treated as continuous.

No default value.

Returns:

Fitted object.

predict(data, key=None, features=None, verbose=None, block_size=None, missing_replacement=None)

Predict dependent variable values based on fitted model.

Parameters:

dataDataFrame

Independent variable values to predict for.

keystr, optional

Name of the ID column.

Mandatory if data is not indexed, or the index of data contains multiple columns.

Defaults to the single index column of data if not provided.

featureslist of str, optional

Names of the feature columns. If features is not provided, it defaults to all non-ID columns.

block_sizeint, optional

The number of rows loaded per time during prediction. 0 indicates load all data at once.

Defaults to 0.

missing_replacementstr, optional

The missing replacement strategy:

'feature_marginalized': marginalise each missing feature out independently.
'instance_marginalized': marginalise all missing features in an instance as a whole corresponding to each category.

Defaults to feature_marginalized.

verbosebool, optional

If true, output all classes and the corresponding confidences for each data point.

Returns:

DataFrame

DataFrame of score and confidence, structured as follows:

ID column, with same name and type as data 's ID column.

SCORE, type DOUBLE, representing the predicted classes.

CONFIDENCE, type DOUBLE, representing the confidence of a class.

score(data, key=None, features=None, label=None, block_size=None, missing_replacement=None)

Returns the mean accuracy on the given test data and labels.

Parameters:

dataDataFrame

Data on which to assess model performance.

keystr, optional

Name of the ID column.

Mandatory if data is not indexed, or the index of data contains multiple columns.

Defaults to the single index column of data if not provided.

featureslist of str, optional

Names of the feature columns. If features is not provided, it defaults to all non-ID, non-label columns.

labelstr, optional

Name of the dependent variable.

Defaults to the last non-ID column.

block_sizeint, optional

The number of rows loaded per time during prediction. 0 indicates load all data at once.

Defaults to 0.

missing_replacementstr, optional

The missing replacement strategy:

'feature_marginalized': marginalise each missing feature out independently.

'instance_marginalized': marginalise all missing features in an instance as a whole corresponding to each category.

Defaults to 'feature_marginalized'.

Returns:

float: Mean accuracy on the given test data and labels.

create_model_state(model=None, function=None, pal_funcname='PAL_RANDOM_DECISION_TREES', state_description=None, force=False)

Create PAL model state.

Parameters:

modelDataFrame, optional

Specify the model for AFL state.

Defaults to self.model_.

functionstr, optional

Specify the function in the unified API.

A placeholder parameter, not effective for RDT.

pal_funcnameint or str, optional

PAL function name.

Defaults to 'PAL_RANDOM_DECISION_TREES'.

state_descriptionstr, optional

Description of the state as model container.

Defaults to None.

forcebool, optional

If True it will delete the existing state.

Defaults to False.

delete_model_state(state=None)

Delete PAL model state.

Parameters:

stateDataFrame, optional

Specified the state.

Defaults to self.state.

property fit_hdbprocedure: Returns the generated hdbprocedure for fit.

property predict_hdbprocedure: Returns the generated hdbprocedure for predict.

set_model_state(state)

Set the model state by state information.

Parameters:

state: DataFrame or dict

If state is DataFrame, it has the following structure:

NAME: VARCHAR(100), it mush have STATE_ID, HINT, HOST and PORT.

VALUE: VARCHAR(1000), the values according to NAME.

If state is dict, the key must have STATE_ID, HINT, HOST and PORT.

Inherited Methods from PALBase

Besides those methods mentioned above, the RDTClassifier class also inherits methods from PALBase class, please refer to PAL Base for more details.