DecisionTreeClassifier
- class hana_ml.algorithms.pal.trees.DecisionTreeClassifier(algorithm='cart', thread_ratio=None, allow_missing_dependent=True, percentage=None, min_records_of_parent=None, min_records_of_leaf=None, max_depth=None, categorical_variable=None, split_threshold=None, discretization_type=None, bins=None, max_branch=None, merge_threshold=None, use_surrogate=None, model_format=None, output_rules=True, priors=None, output_confusion_matrix=True, resampling_method=None, fold_num=None, repeat_times=None, evaluation_metric=None, timeout=None, search_strategy=None, random_search_times=None, progress_indicator_id=None, param_values=None, param_range=None)
Decision Tree model for classification.
- Parameters:
- algorithm{'c45', 'chaid', 'cart'}, optional
Algorithm used to grow a decision tree. Case-insensitive.
'c45': C4.5 algorithm.
'chaid': Chi-square automatic interaction detection.
'cart': Classification and regression tree.
Defaults to 'cart'.
- thread_ratiofloat, optional
Controls the proportion of available threads to use.
The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads.
Values between 0 and 1 will use up to that percentage of available threads.
Values outside this range tell PAL to heuristically determine the number of threads to use.
Defaults to -1.
- allow_missing_dependentbool, optional
Specifies if a missing target value is allowed.
False: Not allowed. An error occurs if a missing target is present.
True: Allowed. The datum with the missing target is removed.
Defaults to True.
- percentagefloat, optional
Specifies the percentage of the input data that will be used to build the tree model.
The rest of the data will be used for pruning.
Defaults to 1.0.
- min_records_of_parentint, optional
Specifies the stop condition: if the number of records in one node is less than the specified value, the algorithm stops splitting.
Defaults to 2.
- min_records_of_leafint, optional
Promises the minimum number of records in a leaf.
Defaults to 1.
- max_depthint, optional
The maximum depth of a tree.
By default the value is unlimited.
- categorical_variablestr or list of str, optional
Specifies INTEGER column(s) that should be treated as categorical.
The default behavior is:
string: categorical
integer and float: continuous.
VALID only for integer variables, ignored otherwise.
Default value detected from input data.
- split_thresholdfloat, optional
Specifies the stop condition for a node:
C45: The information gain ratio of the best split is less than this value.
CHAID: The p-value of the best split is greater than or equal to this value.
CART: The reduction of Gini index or relative MSE of the best split is less than this value.
The smaller the
split_thresholdvalue is, the larger a C45 or CART tree grows.On the contrary, CHAID will grow a larger tree with larger
split_thresholdvalue.Defaults to 1e-5 for C45 and CART, 0.05 for CHAID.
- discretization_type{'mdlpc', 'equal_freq'}, optional
Strategy for discretizing continuous attributes. Case-insensitive.
'mdlpc': Minimum description length principle criterion.
'equal_freq': Equal frequency discretization.
Valid only when
algorithmis 'c45' or 'chaid'.Defaults to 'mdlpc'.
- binsList of tuples: (column name, number of bins), optional
Specifies the number of bins for discretization.
Only valid when
discretizaition_typeis 'equal_freq'.Defaults to 10 for each column.
- max_branchint, optional
Specifies the maximum number of branches.
Valid only when
algorithmis 'chaid'.Defaults to 10.
- merge_thresholdfloat, optional
Specifies the merge condition for CHAID: if the metric value is greater than or equal to the specified value, the algorithm will merge two branches.
Only valid when
algorithmis 'chaid'.Defaults to 0.05.
- use_surrogatebool, optional
If true, use surrogate split when NULL values are encountered.
Only valid when
algorithmis 'cart'.Defaults to True.
- model_format{'json', 'pmml'}, optional
Specifies the tree model format for store. Case-insensitive.
'json': export model in json format.
'pmml': export model in pmml format.
Defaults to 'json'.
- output_rulesbool, optional
If true, output decision rules.
Defaults to True.
- priorsList of tuples: (class, prior_prob), optional
Specifies the prior probability of every class label.
Default value detected from data.
- output_confusion_matrixbool, optional
If true, output the confusion matrix.
Defaults to True.
- resampling_method{'cv', 'stratified_cv', 'bootstrap', 'stratified_bootstrap'}, optional
The resampling method for model evaluation or parameter search.
Once set, model evaluation or parameter search is enabled.
No default value.
- evaluation_metric{'error_rate', 'nll', 'auc'}, optional
The evaluation metric. Once
resampling_methodis set, this parameter must be set.No default value.
- fold_numint, optional
The fold number for cross validation. Valid only and mandatory when
resampling_methodis set as 'cv' or 'stratified_cv'.No default value.
- repeat_timesint, optional
The number of repeated times for model evaluation or parameter selection.
Defaults to 1.
- timeoutint, optional
The time allocated (in seconds) for program running, where 0 indicates unlimited.
Defaults to 0.
- random_search_timesint, optional
Specifies the number of search times for random search.
Only valid and mandatory when
search_strategyis set as 'random'.No default value.
- progress_indicator_idstr, optional
Sets an ID of progress indicator for model evaluation or parameter selection.
No progress indicator is active if no value is provided.
No default value.
- param_valuesdict or ListOfTuples, optional
Specifies values of parameters to be selected.
Input should be a dict or a list of size-two tuples, with key/1st element being the target parameter name, while value/2nd element being the a list of valued for selection.
Only valid when
resampling_methodandsearch_strategyare both specified.Valid Parameter names include: 'discretization_type', 'min_records_of_leaf', 'min_records_of_parent', 'max_depth', 'split_threshold', 'max_branch', 'merge_threshold'.
No default value.
- param_rangedict or ListOfTuples, optional
Specifies ranges of parameters to be selected.
Input should be dict or list of size-two tuples, with key/1st element being the name of the target parameter(in string format), while value/2nd element specifies the range of that parameter with [start, step, end] or [start, end].
Valid only when
resampling_methodandsearch_strategyare both specified.Valid Parameter names include: 'discretization_type', 'min_records_of_leaf', 'min_records_of_parent', 'max_depth', 'split_threshold', 'max_branch', 'merge_threshold'.
No default value.
Examples
Input dataframe for training:
>>> df1.head(4).collect() OUTLOOK TEMP HUMIDITY WINDY CLASS 0 Sunny 75 70.0 Yes Play 1 Sunny 80 90.0 Yes Do not Play 2 Sunny 85 85.0 No Do not Play 3 Sunny 72 95.0 No Do not Play
Creating DecisionTreeClassifier instance:
>>> dtc = DecisionTreeClassifier(algorithm='c45', ... min_records_of_parent=2, ... min_records_of_leaf=1, ... thread_ratio=0.4, split_threshold=1e-5, ... model_format='json', output_rules=True)
Performing fit() on given dataframe:
>>> dtc.fit(df1, features=['OUTLOOK', 'TEMP', 'HUMIDITY', 'WINDY'], ... label='LABEL') >>> dtc.decision_rules_.collect() ROW_INDEX RULES_CONTENT 0 0 (TEMP>=84) => Do not Play 1 1 (TEMP<84) && (OUTLOOK=Overcast) => Play 2 2 (TEMP<84) && (OUTLOOK=Sunny) && (HUMIDITY<82.5) => Play 3 3 (TEMP<84) && (OUTLOOK=Sunny) && (HUMIDITY>=82.5) => Do not Play 4 4 (TEMP<84) && (OUTLOOK=Rain) && (WINDY=Yes) => Do not Play 5 5 (TEMP<84) && (OUTLOOK=Rain) && (WINDY=No) => Play
Input dataframe for predicting:
>>> df2.collect() ID OUTLOOK HUMIDITY TEMP WINDY 0 0 Overcast 75.0 70 Yes 1 1 Rain 78.0 70 Yes 2 2 Sunny 66.0 70 Yes 3 3 Sunny 69.0 70 Yes 4 4 Rain NaN 70 Yes 5 5 None 70.0 70 Yes 6 6 *** 70.0 70 Yes
Performing predict() on given dataframe:
>>> result = dtc.predict(df2, key='ID', verbose=False) >>> result.collect() ID SCORE CONFIDENCE 0 0 Play 1.000000 1 1 Do not Play 1.000000 2 2 Play 1.000000 3 3 Play 1.000000 4 4 Do not Play 1.000000 5 5 Play 0.692308 6 6 Play 0.692308
Input dataframe for scoring:
>>> df3.collect() ID OUTLOOK HUMIDITY TEMP WINDY LABEL 0 0 Overcast 75.0 70 Yes Play 1 1 Rain 78.0 70 No Do not Play 2 2 Sunny 66.0 70 Yes Play 3 3 Sunny 69.0 70 Yes Play
Performing score() on given dataframe:
>>> rfc.score(df3, key='ID') 0.75
- Attributes:
- model_DataFrame
Trained model content.
- decision_rules_DataFrame
Rules for decision tree to make decisions. Set to None if
output_rulesis False.- confusion_matrix_DataFrame
Confusion matrix used to evaluate the performance of classification algorithms. Set to None if
output_confusion_matrixis False.- stats_DataFrame
Statistics information.
- cv_DataFrame
Cross validation information. Only has output when parameter selection is enabled.
Methods
create_model_state([model, function, ...])Create PAL model state.
delete_model_state([state])Delete PAL model state.
fit(data[, key, features, label, ...])Function for building a decision tree classifier.
predict(data[, key, features, verbose])Prediction function for a fitted DecisionTreeClassifier.
score(data[, key, features, label])Returns the mean accuracy on the given test data and labels.
set_model_state(state)Set the model state by state information.
- fit(data, key=None, features=None, label=None, categorical_variable=None)
Function for building a decision tree classifier.
- Parameters:
- dataDataFrame
Training data.
- keystr, optional
Name of the ID column in
data.If
keyis not provided, then:if
datais indexed by a single column, thenkeydefaults to that index column;otherwise, it is assumed that
datacontains no ID column.
- featureslist of str, optional
Names of the feature columns.
If
featuresis not provided, it defaults to all non-ID, non-label columns.- labelstr, optional
Name of the dependent variable.
Defaults to the name of the last non-ID column.
- categorical_variablestr or list of str, optional
Specifies INTEGER column(s) that should be treated as categorical.
Other INTEGER columns will be treated as continuous.
- Returns:
- Fitted object.
- predict(data, key=None, features=None, verbose=False)
Prediction function for a fitted DecisionTreeClassifier.
- Parameters:
- dataDataFrame
DataFrame containing the data.
- keystr, optional
Name of the ID column in
data.Mandatory if
datais not indexed, or the index ofdatacontains multiple columns.Defaults to the single index column of
dataif not provided.- featureslist of str, optional
Names of the feature columns.
If
featuresis not provided, it defaults to all the non-ID columns.- verbosebool, optional
Specifies whether to output all classes and the corresponding confidences for each data record in
data.Defaults to False.
- Returns:
- DataFrame
Predict result, structured as follows:
ID column, with the same name and type as the ID column in
data.SCORE, type NVARCHAR(100), prediction class labels.
CONFIDENCE, type DOUBLE, confidence values w.r.t. the corresponding assigned class labels.
- score(data, key=None, features=None, label=None)
Returns the mean accuracy on the given test data and labels.
- Parameters:
- dataDataFrame
Data on which to assess model performance.
- keystr, optional
Name of the ID column.
Mandatory if
datais not indexed, or the index ofdatacontains multiple columns.Defaults to the single index column of
dataif not provided.- featureslist of str, optional
Names of the feature columns.
If
featuresis not provided, it defaults to all non-ID, non-label columns.- labelstr, optional
Name of the dependent variable.
Defaults to the name of the last non-ID column.
- Returns:
- float
Mean accuracy on the given test data and labels.
- create_model_state(model=None, function=None, pal_funcname='PAL_DECISION_TREE', state_description=None, force=False)
Create PAL model state.
- Parameters:
- modelDataFrame, optional
Specify the model for AFL state.
Defaults to self.model_.
- functionstr, optional
Specify the function in the unified API.
A placeholder parameter, not effective for Decision Tree.
- pal_funcnameint or str, optional
PAL function name. Should be a valid PAL procedure name that supports model state.
Defaults to 'PAL_DECISION_TREE'.
- state_descriptionstr, optional
Description of the state as model container.
Defaults to None.
- forcebool, optional
If True it will delete the existing state.
Defaults to False.
- delete_model_state(state=None)
Delete PAL model state.
- Parameters:
- stateDataFrame, optional
Specified the state.
Defaults to self.state.
- property fit_hdbprocedure
Returns the generated hdbprocedure for fit.
- property predict_hdbprocedure
Returns the generated hdbprocedure for predict.
- set_model_state(state)
Set the model state by state information.
- Parameters:
- state: DataFrame or dict
If state is DataFrame, it has the following structure:
NAME: VARCHAR(100), it mush have STATE_ID, HINT, HOST and PORT.
VALUE: VARCHAR(1000), the values according to NAME.
If state is dict, the key must have STATE_ID, HINT, HOST and PORT.
Inherited Methods from PALBase
Besides those methods mentioned above, the DecisionTreeClassifier class also inherits methods from PALBase class, please refer to PAL Base for more details.