DecisionTreeClassifier

class hana_ml.algorithms.pal.trees.DecisionTreeClassifier(algorithm='cart', thread_ratio=None, allow_missing_dependent=True, percentage=None, min_records_of_parent=None, min_records_of_leaf=None, max_depth=None, categorical_variable=None, split_threshold=None, discretization_type=None, bins=None, max_branch=None, merge_threshold=None, use_surrogate=None, model_format=None, output_rules=True, priors=None, output_confusion_matrix=True, resampling_method=None, fold_num=None, repeat_times=None, evaluation_metric=None, timeout=None, search_strategy=None, random_search_times=None, progress_indicator_id=None, param_values=None, param_range=None)

Decision Tree model for classification.

Parameters:

algorithm{'c45', 'chaid', 'cart'}, optional

Algorithm used to grow a decision tree. Case-insensitive.

'c45': C4.5 algorithm.

'chaid': Chi-square automatic interaction detection.

'cart': Classification and regression tree.

Defaults to 'cart'.

thread_ratiofloat, optional

Controls the proportion of available threads to use.

The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads.

Values between 0 and 1 will use up to that percentage of available threads.

Values outside this range tell PAL to heuristically determine the number of threads to use.

Defaults to -1.

allow_missing_dependentbool, optional

Specifies if a missing target value is allowed.

False: Not allowed. An error occurs if a missing target is present.

True: Allowed. The datum with the missing target is removed.

Defaults to True.

percentagefloat, optional

Specifies the percentage of the input data that will be used to build the tree model.

The rest of the data will be used for pruning.

Defaults to 1.0.

min_records_of_parentint, optional

Specifies the stop condition: if the number of records in one node is less than the specified value, the algorithm stops splitting.

Defaults to 2.

min_records_of_leafint, optional

Promises the minimum number of records in a leaf.

Defaults to 1.

max_depthint, optional

The maximum depth of a tree.

By default the value is unlimited.

categorical_variablestr or list of str, optional

Specifies INTEGER column(s) that should be treated as categorical.

The default behavior is:

string: categorical

integer and float: continuous.

VALID only for integer variables, ignored otherwise.

Default value detected from input data.

split_thresholdfloat, optional

Specifies the stop condition for a node:

C45: The information gain ratio of the best split is less than this value.

CHAID: The p-value of the best split is greater than or equal to this value.

CART: The reduction of Gini index or relative MSE of the best split is less than this value.

The smaller the split_threshold value is, the larger a C45 or CART tree grows.

On the contrary, CHAID will grow a larger tree with larger split_threshold value.

Defaults to 1e-5 for C45 and CART, 0.05 for CHAID.

discretization_type{'mdlpc', 'equal_freq'}, optional

Strategy for discretizing continuous attributes. Case-insensitive.

'mdlpc': Minimum description length principle criterion.

'equal_freq': Equal frequency discretization.

Valid only when algorithm is 'c45' or 'chaid'.

Defaults to 'mdlpc'.

binsList of tuples: (column name, number of bins), optional

Specifies the number of bins for discretization.

Only valid when discretizaition_type is 'equal_freq'.

Defaults to 10 for each column.

max_branchint, optional

Specifies the maximum number of branches.

Valid only when algorithm is 'chaid'.

Defaults to 10.

merge_thresholdfloat, optional

Specifies the merge condition for CHAID: if the metric value is greater than or equal to the specified value, the algorithm will merge two branches.

Only valid when algorithm is 'chaid'.

Defaults to 0.05.

use_surrogatebool, optional

If true, use surrogate split when NULL values are encountered.

Only valid when algorithm is 'cart'.

Defaults to True.

model_format{'json', 'pmml'}, optional

Specifies the tree model format for store. Case-insensitive.

'json': export model in json format.
'pmml': export model in pmml format.

Defaults to 'json'.

output_rulesbool, optional

If true, output decision rules.

Defaults to True.

priorsList of tuples: (class, prior_prob), optional

Specifies the prior probability of every class label.

Default value detected from data.

output_confusion_matrixbool, optional

If true, output the confusion matrix.

Defaults to True.

resampling_method{'cv', 'stratified_cv', 'bootstrap', 'stratified_bootstrap'}, optional

The resampling method for model evaluation or parameter search.

Once set, model evaluation or parameter search is enabled.

No default value.

evaluation_metric{'error_rate', 'nll', 'auc'}, optional

The evaluation metric. Once resampling_method is set, this parameter must be set.

No default value.

fold_numint, optional

The fold number for cross validation. Valid only and mandatory when resampling_method is set as 'cv' or 'stratified_cv'.

No default value.

repeat_timesint, optional

The number of repeated times for model evaluation or parameter selection.

Defaults to 1.

timeoutint, optional

The time allocated (in seconds) for program running, where 0 indicates unlimited.

Defaults to 0.

random_search_timesint, optional

Specifies the number of search times for random search.

Only valid and mandatory when search_strategy is set as 'random'.

No default value.

progress_indicator_idstr, optional

Sets an ID of progress indicator for model evaluation or parameter selection.

No progress indicator is active if no value is provided.

No default value.

param_valuesdict or ListOfTuples, optional

Specifies values of parameters to be selected.

Input should be a dict or a list of size-two tuples, with key/1st element being the target parameter name, while value/2nd element being the a list of valued for selection.

Only valid when resampling_method and search_strategy are both specified.

Valid Parameter names include: 'discretization_type', 'min_records_of_leaf', 'min_records_of_parent', 'max_depth', 'split_threshold', 'max_branch', 'merge_threshold'.

No default value.

param_rangedict or ListOfTuples, optional

Specifies ranges of parameters to be selected.

Input should be dict or list of size-two tuples, with key/1st element being the name of the target parameter(in string format), while value/2nd element specifies the range of that parameter with [start, step, end] or [start, end].

Valid only when resampling_method and search_strategy are both specified.

Valid Parameter names include: 'discretization_type', 'min_records_of_leaf', 'min_records_of_parent', 'max_depth', 'split_threshold', 'max_branch', 'merge_threshold'.

No default value.

Examples

Input dataframe for training:

>>> df1.head(4).collect()
   OUTLOOK  TEMP  HUMIDITY WINDY        CLASS
0    Sunny    75      70.0   Yes         Play
1    Sunny    80      90.0   Yes  Do not Play
2    Sunny    85      85.0    No  Do not Play
3    Sunny    72      95.0    No  Do not Play

Creating DecisionTreeClassifier instance:

>>> dtc = DecisionTreeClassifier(algorithm='c45',
...                              min_records_of_parent=2,
...                              min_records_of_leaf=1,
...                              thread_ratio=0.4, split_threshold=1e-5,
...                              model_format='json', output_rules=True)

Performing fit() on given dataframe:

>>> dtc.fit(df1, features=['OUTLOOK', 'TEMP', 'HUMIDITY', 'WINDY'],
...         label='LABEL')
>>> dtc.decision_rules_.collect()
   ROW_INDEX                                                  RULES_CONTENT
0         0                                       (TEMP>=84) => Do not Play
1         1                         (TEMP<84) && (OUTLOOK=Overcast) => Play
2         2         (TEMP<84) && (OUTLOOK=Sunny) && (HUMIDITY<82.5) => Play
3         3 (TEMP<84) && (OUTLOOK=Sunny) && (HUMIDITY>=82.5) => Do not Play
4         4       (TEMP<84) && (OUTLOOK=Rain) && (WINDY=Yes) => Do not Play
5         5               (TEMP<84) && (OUTLOOK=Rain) && (WINDY=No) => Play

Input dataframe for predicting:

>>> df2.collect()
   ID   OUTLOOK  HUMIDITY  TEMP WINDY
 0  Overcast      75.0    70   Yes
 1      Rain      78.0    70   Yes
 2     Sunny      66.0    70   Yes
 3     Sunny      69.0    70   Yes
 4      Rain       NaN    70   Yes
 5      None      70.0    70   Yes
 6       ***      70.0    70   Yes

Performing predict() on given dataframe:

>>> result = dtc.predict(df2, key='ID', verbose=False)
>>> result.collect()
   ID        SCORE  CONFIDENCE
 0         Play    1.000000
 1  Do not Play    1.000000
 2         Play    1.000000
 3         Play    1.000000
 4  Do not Play    1.000000
 5         Play    0.692308
 6         Play    0.692308

Input dataframe for scoring:

>>> df3.collect()
   ID   OUTLOOK  HUMIDITY  TEMP WINDY        LABEL
0   0  Overcast      75.0    70   Yes         Play
1   1      Rain      78.0    70    No  Do not Play
2   2     Sunny      66.0    70   Yes         Play
3   3     Sunny      69.0    70   Yes         Play

Performing score() on given dataframe:

>>> rfc.score(df3, key='ID')
0.75

Attributes:

model_DataFrame: Trained model content.
decision_rules_DataFrame: Rules for decision tree to make decisions. Set to None if output_rules is False.
confusion_matrix_DataFrame: Confusion matrix used to evaluate the performance of classification algorithms. Set to None if output_confusion_matrix is False.
stats_DataFrame: Statistics information.
cv_DataFrame: Cross validation information. Only has output when parameter selection is enabled.

Methods

`create_model_state`([model, function, ...])	Create PAL model state.
`delete_model_state`([state])	Delete PAL model state.
`fit`(data[, key, features, label, ...])	Function for building a decision tree classifier.
`predict`(data[, key, features, verbose])	Prediction function for a fitted DecisionTreeClassifier.
`score`(data[, key, features, label])	Returns the mean accuracy on the given test data and labels.
`set_model_state`(state)	Set the model state by state information.

fit(data, key=None, features=None, label=None, categorical_variable=None)

Function for building a decision tree classifier.

Parameters:

dataDataFrame

Training data.

keystr, optional

Name of the ID column in data.

If key is not provided, then:

if data is indexed by a single column, then key defaults to that index column;
otherwise, it is assumed that data contains no ID column.

featureslist of str, optional

Names of the feature columns.

If features is not provided, it defaults to all non-ID, non-label columns.

labelstr, optional

Name of the dependent variable.

Defaults to the name of the last non-ID column.

categorical_variablestr or list of str, optional

Specifies INTEGER column(s) that should be treated as categorical.

Other INTEGER columns will be treated as continuous.

Returns:

Fitted object.

predict(data, key=None, features=None, verbose=False)

Prediction function for a fitted DecisionTreeClassifier.

Parameters:

dataDataFrame

DataFrame containing the data.

keystr, optional

Name of the ID column in data.

Mandatory if data is not indexed, or the index of data contains multiple columns.

Defaults to the single index column of data if not provided.

featureslist of str, optional

Names of the feature columns.

If features is not provided, it defaults to all the non-ID columns.

verbosebool, optional

Specifies whether to output all classes and the corresponding confidences for each data record in data.

Defaults to False.

Returns:

DataFrame

Predict result, structured as follows:

ID column, with the same name and type as the ID column in data.

SCORE, type NVARCHAR(100), prediction class labels.

CONFIDENCE, type DOUBLE, confidence values w.r.t. the corresponding assigned class labels.

score(data, key=None, features=None, label=None)

Returns the mean accuracy on the given test data and labels.

Parameters:

dataDataFrame

Data on which to assess model performance.

keystr, optional

Name of the ID column.

Mandatory if data is not indexed, or the index of data contains multiple columns.

Defaults to the single index column of data if not provided.

featureslist of str, optional

Names of the feature columns.

If features is not provided, it defaults to all non-ID, non-label columns.

labelstr, optional

Name of the dependent variable.

Defaults to the name of the last non-ID column.

Returns:

float: Mean accuracy on the given test data and labels.

create_model_state(model=None, function=None, pal_funcname='PAL_DECISION_TREE', state_description=None, force=False)

Create PAL model state.

Parameters:

modelDataFrame, optional

Specify the model for AFL state.

Defaults to self.model_.

functionstr, optional

Specify the function in the unified API.

A placeholder parameter, not effective for Decision Tree.

pal_funcnameint or str, optional

PAL function name. Should be a valid PAL procedure name that supports model state.

Defaults to 'PAL_DECISION_TREE'.

state_descriptionstr, optional

Description of the state as model container.

Defaults to None.

forcebool, optional

If True it will delete the existing state.

Defaults to False.

delete_model_state(state=None)

Delete PAL model state.

Parameters:

stateDataFrame, optional

Specified the state.

Defaults to self.state.

property fit_hdbprocedure: Returns the generated hdbprocedure for fit.

property predict_hdbprocedure: Returns the generated hdbprocedure for predict.

set_model_state(state)

Set the model state by state information.

Parameters:

state: DataFrame or dict

If state is DataFrame, it has the following structure:

NAME: VARCHAR(100), it mush have STATE_ID, HINT, HOST and PORT.

VALUE: VARCHAR(1000), the values according to NAME.

If state is dict, the key must have STATE_ID, HINT, HOST and PORT.

Inherited Methods from PALBase

Besides those methods mentioned above, the DecisionTreeClassifier class also inherits methods from PALBase class, please refer to PAL Base for more details.