DecisionTreeClassifier
- class hana_ml.algorithms.pal.trees.DecisionTreeClassifier(algorithm='cart', thread_ratio=None, allow_missing_dependent=True, percentage=None, min_records_of_parent=None, min_records_of_leaf=None, max_depth=None, categorical_variable=None, split_threshold=None, discretization_type=None, bins=None, max_branch=None, merge_threshold=None, use_surrogate=None, model_format=None, output_rules=True, priors=None, output_confusion_matrix=True, resampling_method=None, fold_num=None, repeat_times=None, evaluation_metric=None, timeout=None, search_strategy=None, random_search_times=None, progress_indicator_id=None, param_values=None, param_range=None)
A decision tree is used as a classifier for determining an appropriate action or decision among a predetermined set of actions for a given case.
A decision tree helps effectively identify the factors to consider and how each factor has historically been associated with different outcomes of the decision. A decision tree uses a tree-like structure of conditions and their possible consequences. Each node of a decision tree can be a leaf node or a decision node.
Leaf node: indicates the value of the dependent (target) variable.
Decision node: contains one condition that specifies some test on an attribute value. The outcome of the condition is further divided into branches with sub-trees or leaf nodes.
The PAL_DECISION_TREE function in PAL integrates the three most popular decision trees, including C45, CHAID, and CART. Some distinctions between the implements and usages of these 3 algorithms are listed as below.
C45 and CHAID can generate non-binary trees, besides binary tree, while CART is restricted to binary tree.
Unlike C45 and CHAID, CART is able to not only classify, but also do regression.
C45 and CHAID treat missing independent variable as a special value, whereas CART applies surrogate split to handle it.
As for ordered independent variable, C45 and CHAID firstly discretise it, whereas CART uses predicate {is xm ≤ c?} instead of {is xm ∈ s?} to handle it. For large dataset with a number of ordered independent variables, consequently, CART is more efficient than the other two.
C45 uses information gain ratio, CHAID uses chi-square statistics, and CART uses Gini index (classification) or least square (regression), for split.
In this function, the dependent variable, known as the class label or response, can have missing values, but datum of such kind is discarded before growing a tree. Meanwhile, independent variables consisting of identical value or only missing values are removed beforehand.
- Parameters:
- algorithm{'c45', 'chaid', 'cart'}, optional
Algorithm used to grow a decision tree. Case-insensitive.
'c45': C4.5 algorithm.
'chaid': Chi-square automatic interaction detection.
'cart': Classification and regression tree.
Defaults to 'cart'.
- thread_ratiofloat, optional
Adjusts the percentage of available threads to use, from 0 to 1. A value of 0 indicates the use of a single thread, while 1 implies the use of all possible current threads. Values outside the range will be ignored and this function heuristically determines the number of threads to use.
Defaults to -1.
- allow_missing_dependentbool, optional
Specifies if a missing target value is allowed.
False: Not allowed. An error occurs if a missing target is present.
True: Allowed. The datum with the missing target is removed.
Defaults to True.
- percentagefloat, optional
Specifies the percentage of the input data that will be used to build the tree model.
The rest of the data will be used for pruning.
Defaults to 1.0.
- min_records_of_parentint, optional
Specifies the stop condition: if the number of records in one node is less than the specified value, the algorithm stops splitting.
Defaults to 2.
- min_records_of_leafint, optional
Promises the minimum number of records in a leaf.
Defaults to 1.
- max_depthint, optional
The maximum depth of a tree.
By default the value is unlimited.
- categorical_variablestr or a list of str, optional
Specifies which INTEGER columns should be treated as categorical, with all other INTEGER columns treated as continuous.
Default value detected from input data.
- split_thresholdfloat, optional
Specifies the stop condition for a node:
C45: The information gain ratio of the best split is less than this value.
CHAID: The p-value of the best split is greater than or equal to this value.
CART: The reduction of Gini index or relative MSE of the best split is less than this value.
The smaller the
split_threshold
value is, the larger a C45 or CART tree grows.On the contrary, CHAID will grow a larger tree with larger
split_threshold
value.Defaults to 1e-5 for C45 and CART, 0.05 for CHAID.
- discretization_type{'mdlpc', 'equal_freq'}, optional
Strategy for discretizing continuous attributes. Case-insensitive.
'mdlpc': Minimum description length principle criterion.
'equal_freq': Equal frequency discretization.
Valid only when
algorithm
is 'c45' or 'chaid'.Defaults to 'mdlpc'.
- binsList of tuples: (column name, number of bins), optional
Specifies the number of bins for discretization.
Only valid when
discretizaition_type
is 'equal_freq'.Defaults to 10 for each column.
- max_branchint, optional
Specifies the maximum number of branches.
Valid only when
algorithm
is 'chaid'.Defaults to 10.
- merge_thresholdfloat, optional
Specifies the merge condition for CHAID: if the metric value is greater than or equal to the specified value, the algorithm will merge two branches.
Only valid when
algorithm
is 'chaid'.Defaults to 0.05.
- use_surrogatebool, optional
If true, use surrogate split when NULL values are encountered.
Only valid when
algorithm
is 'cart'.Defaults to True.
- model_format{'json', 'pmml'}, optional
Specifies the tree model format for store. Case-insensitive.
'json': export model in json format.
'pmml': export model in pmml format.
Defaults to 'json'.
- output_rulesbool, optional
If true, output decision rules.
Defaults to True.
- priorsList of tuples: (class, prior_prob), optional
Specifies the prior probability of every class label.
Default value detected from data.
- output_confusion_matrixbool, optional
If true, output the confusion matrix.
Defaults to True.
- resampling_method{'cv', 'stratified_cv', 'bootstrap', 'stratified_bootstrap'}, optional
The resampling method for model evaluation or parameter search.
Once set, model evaluation or parameter search is enabled.
No default value.
- evaluation_metric{'error_rate', 'nll', 'auc'}, optional
The evaluation metric. Once
resampling_method
is set, this parameter must be set.No default value.
- fold_numint, optional
The fold number for cross validation. Valid only and mandatory when
resampling_method
is set as 'cv' or 'stratified_cv'.No default value.
- repeat_timesint, optional
The number of repeated times for model evaluation or parameter selection.
Defaults to 1.
- timeoutint, optional
The time allocated (in seconds) for program running, where 0 indicates unlimited.
Defaults to 0.
- random_search_timesint, optional
Specifies the number of search times for random search.
Only valid and mandatory when
search_strategy
is set as 'random'.No default value.
- progress_indicator_idstr, optional
Sets an ID of progress indicator for model evaluation or parameter selection.
No progress indicator is active if no value is provided.
No default value.
- param_valuesdict or ListOfTuples, optional
Specifies values of parameters to be selected.
Input should be a dict or a list of size-two tuples, with key/1st element being the target parameter name, while value/2nd element being the a list of valued for selection.
Only valid when
resampling_method
andsearch_strategy
are both specified.Valid Parameter names include: 'discretization_type', 'min_records_of_leaf', 'min_records_of_parent', 'max_depth', 'split_threshold', 'max_branch', 'merge_threshold'.
No default value.
- param_rangedict or ListOfTuples, optional
Specifies ranges of parameters to be selected.
Input should be dict or list of size-two tuples, with key/1st element being the name of the target parameter(in string format), while value/2nd element specifies the range of that parameter with [start, step, end] or [start, end].
Valid only when
resampling_method
andsearch_strategy
are both specified.Valid Parameter names include: 'discretization_type', 'min_records_of_leaf', 'min_records_of_parent', 'max_depth', 'split_threshold', 'max_branch', 'merge_threshold'.
No default value.
Examples
>>> dtc = DecisionTreeClassifier(algorithm='c45', ... min_records_of_parent=2, ... min_records_of_leaf=1, ... thread_ratio=0.4, split_threshold=1e-5, ... model_format='json', output_rules=True)
Perform fit():
>>> dtc.fit(data=df_train, features=['F1', 'F2'], ... label='LABEL') >>> dtc.decision_rules_.collect()
Perform predict():
>>> res = dtc.predict(data=df_predict, key='ID', verbose=False) >>> res.collect()
Perform score():
>>> rfc.score(data=df_score, key='ID')
- Attributes:
- model_DataFrame
Model content.
- decision_rules_DataFrame
Rules for decision tree to make decisions. Set to None if
output_rules
is False.- confusion_matrix_DataFrame
Confusion matrix used to evaluate the performance of classification algorithms. Set to None if
output_confusion_matrix
is False.- stats_DataFrame
Statistics.
- cv_DataFrame
Cross validation information. Only has output when parameter selection is enabled.
Methods
create_model_state
([model, function, ...])Create PAL model state.
delete_model_state
([state])Delete PAL model state.
fit
(data[, key, features, label, ...])Fit the model to the training dataset.
predict
(data[, key, features, verbose, ...])Predict dependent variable values based on a fitted model.
score
(data[, key, features, label])Returns the mean accuracy on the given test data and labels.
set_model_state
(state)Set the model state by state information.
- fit(data, key=None, features=None, label=None, categorical_variable=None)
Fit the model to the training dataset.
- Parameters:
- dataDataFrame
Training data.
- keystr, optional
Name of the ID column in
data
.If
key
is not provided, then:if
data
is indexed by a single column, thenkey
defaults to that index column;otherwise, it is assumed that
data
contains no ID column.
- featuresa list of str, optional
Names of the feature columns.
If
features
is not provided, it defaults to all non-ID, non-label columns.- labelstr, optional
Name of the dependent variable.
Defaults to the name of the last non-ID column.
- categorical_variablestr or a list of str, optional
Specifies which INTEGER columns should be treated as categorical, with all other INTEGER columns treated as continuous.
No default value.
- Returns:
- A fitted object of class "DecisionTreeClassifier".
- predict(data, key=None, features=None, verbose=False, verbose_top_n=None)
Predict dependent variable values based on a fitted model.
- Parameters:
- dataDataFrame
DataFrame containing the data.
- keystr, optional
Name of the ID column in
data
.Mandatory if
data
is not indexed, or the index ofdata
contains multiple columns.Defaults to the single index column of
data
if not provided.- featuresa list of str, optional
Names of the feature columns.
If
features
is not provided, it defaults to all the non-ID columns.- verbosebool, optional
Specifies whether to output all classes and the corresponding confidences for each data record in
data
.Defaults to False.
- verbose_top_nint, optional
Specifies the number of top n classes to present after sorting with confidences. It cannot exceed the number of classes in label of the training data, and it can be 0, which means to output the confidences of all classes.
Effective only when
verbose
is set as True.Defaults to 0.
- Returns:
- DataFrame
Predict result, structured as follows:
ID column, with the same name and type as the ID column in
data
.SCORE, type NVARCHAR(100), prediction class labels.
CONFIDENCE, type DOUBLE, confidence values w.r.t. the corresponding assigned class labels.
- score(data, key=None, features=None, label=None)
Returns the mean accuracy on the given test data and labels.
- Parameters:
- dataDataFrame
Data on which to assess model performance.
- keystr, optional
Name of the ID column.
Mandatory if
data
is not indexed, or the index ofdata
contains multiple columns.Defaults to the single index column of
data
if not provided.- featuresa list of str, optional
Names of the feature columns.
If
features
is not provided, it defaults to all non-ID, non-label columns.- labelstr, optional
Name of the dependent variable.
Defaults to the name of the last non-ID column.
- Returns:
- float
Mean accuracy on the given test data and labels.
- create_model_state(model=None, function=None, pal_funcname='PAL_DECISION_TREE', state_description=None, force=False)
Create PAL model state.
- Parameters:
- modelDataFrame, optional
Specify the model for AFL state.
Defaults to self.model_.
- functionstr, optional
Specify the function in the unified API.
A placeholder parameter, not effective for Decision Tree.
- pal_funcnameint or str, optional
PAL function name. Should be a valid PAL procedure name that supports model state.
Defaults to 'PAL_DECISION_TREE'.
- state_descriptionstr, optional
Description of the state as model container.
Defaults to None.
- forcebool, optional
If True it will delete the existing state.
Defaults to False.
- delete_model_state(state=None)
Delete PAL model state.
- Parameters:
- stateDataFrame, optional
Specified the state.
Defaults to self.state.
- set_model_state(state)
Set the model state by state information.
- Parameters:
- state: DataFrame or dict
If state is DataFrame, it has the following structure:
NAME: VARCHAR(100), it mush have STATE_ID, HINT, HOST and PORT.
VALUE: VARCHAR(1000), the values according to NAME.
If state is dict, the key must have STATE_ID, HINT, HOST and PORT.
Inherited Methods from PALBase
Besides those methods mentioned above, the DecisionTreeClassifier class also inherits methods from PALBase class, please refer to PAL Base for more details.