DecisionTreeRegressor
- class hana_ml.algorithms.pal.trees.DecisionTreeRegressor(algorithm='cart', thread_ratio=None, allow_missing_dependent=True, percentage=None, min_records_of_parent=None, min_records_of_leaf=None, max_depth=None, categorical_variable=None, split_threshold=None, use_surrogate=None, model_format=None, output_rules=True, output_confusion_matrix=True, resampling_method=None, fold_num=None, repeat_times=None, evaluation_metric=None, timeout=None, search_strategy=None, random_search_times=None, progress_indicator_id=None, param_values=None, param_range=None)
DecisionTreeRegressor is a decision tree-based machine learning model used for regression tasks, which predicts continuous output values by learning simple decision rules inferred from the data features.
- Parameters:
- algorithm{'cart'}, optional
Algorithm used to grow a decision tree.
'cart': Classification and Regression tree.
If not specified, defaults to 'cart'.
- thread_ratiofloat, optional
Adjusts the percentage of available threads to use, from 0 to 1. A value of 0 indicates the use of a single thread, while 1 implies the use of all possible current threads. Values outside the range will be ignored and this function heuristically determines the number of threads to use.
Defaults to -1.
- allow_missing_dependentbool, optional
Specifies if a missing target value is allowed.
False: Not allowed. An error occurs if a missing target is present.
True: Allowed. The datum with the missing target is removed.
Defaults to True.
- percentagefloat, optional
Specifies the percentage of the input data that will be used to build the tree model.
The rest of the data will be used for pruning.
Defaults to 1.0.
- min_records_of_parentint, optional
Specifies the stop condition: if the number of records in one node is less than the specified value, the algorithm stops splitting.
Defaults to 2.
- min_records_of_leafint, optional
Promises the minimum number of records in a leaf.
Defaults to 1.
- max_depthint, optional
The maximum depth of a tree.
By default it is unlimited.
- categorical_variablestr or a list of str, optional
Specifies which INTEGER columns should be treated as categorical, with all other INTEGER columns treated as continuous.
Default value detected from input data.
- split_thresholdfloat, optional
Specifies the stop condition for a node:
CART: The reduction of Gini index or relative MSE of the best split is less than this value.
The smaller the split_threshold value is, the larger a CART tree grows.
Defaults to 1e-5 for CART.
- use_surrogatebool, optional
If true, use surrogate split when NULL values are encountered.
Defaults to True.
- model_format{'json', 'pmml'}, optional
Specifies the tree model format for store. Case-insensitive.
'json': export model in json format.
'pmml': export model in pmml format.
Defaults to json.
- output_rulesbool, optional
If true, output decision rules.
Defaults to True.
- resampling_method{'cv', 'bootstrap'}, optional
The resampling method for model evaluation or parameter search. Once set, model evaluation or parameter search is enabled.
No default value.
- evaluation_metric{'mae', 'rmse'}, optional
The evaluation metric. Once
resampling_methodis set, this parameter must be set.No default value.
- fold_numint, optional
The fold number for cross validation.
Valid only and mandatory when
resampling_methodis set as 'cv'.No default value.
- repeat_timesint, optional
The number of repeated times for model evaluation or parameter search.
Defaults to 1.
- timeoutint, optional
The time allocated (in seconds) for program running.
0 indicates unlimited.
Defaults to 0.
- search_strategy{'random', 'grid'}, optional
The search strategy for parameters.
If not specified, parameter selection cannot be carried out.
No default value.
- random_search_timesint, optional
Specifies the number of search times for random search.
Only valid and mandatory when
search_strategyis set as 'random'.No default value.
- progress_indicator_idstr, optional
Sets an ID of progress indicator for model evaluation or parameter selection.
No progress indicator is active if no value is provided.
No default value.
- param_valuesdict or ListOfTuples, optional
Specifies values of parameters to be selected.
Input should be a dict or a list of size-two tuples, with key/1st element being the target parameter name, while value/2nd element being the a list of valued for selection.
Only valid when
resampling_methodandsearch_strategyare both specified.Valid Parameters for values specification include :
split_threshold,max_depth,min_records_of_leaf,min_records_of_parent.No default value.
- param_rangedict or ListOfTuples, optional
Specifies ranges of parameters to be selected.
Input should be dict or list of size-two tuples, with key/1st element being the name of the target parameter(in string format), while value/2nd element specifies the range of that parameter with [start, step, end] or [start, end].
Valid only when
resampling_methodandsearch_strategyare both specified.Valid Parameters for range specification include:
split_threshold,max_depth,min_records_of_leaf,min_records_of_parent.No default value.
- Attributes:
- model_DataFrame
Model content.
- decision_rules_DataFrame
Rules for decision tree to make decisions. Set to None if
output_rulesis False.- stats_DataFrame
Statistics.
- cv_DataFrame
Cross validation information. Only has content when parameter selection is enabled.
Methods
create_model_state([model, function, ...])Create PAL model state.
delete_model_state([state])Delete PAL model state.
fit(data[, key, features, label, ...])Fit the model to the training dataset.
predict(data[, key, features, verbose])Predict dependent variable values based on a fitted model.
score(data[, key, features, label])Returns the coefficient of determination R2 of the prediction.
set_model_state(state)Set the model state by state information.
Examples
>>> dtr = DecisionTreeRegressor(algorithm='cart', ... min_records_of_parent=2, min_records_of_leaf=1, ... thread_ratio=0.4, split_threshold=1e-5, ... model_format='pmml', output_rules=True)
Perform fit():
>>> dtr.fit(data=df_train, key='ID') >>> dtr.decision_rules_.collect()
Perform predict():
>>> res = dtr.predict(data=df_predict, key='ID') >>> res.collect()
Perform score():
>>> dtr.score(data=df_score, key='ID')
- fit(data, key=None, features=None, label=None, categorical_variable=None)
Fit the model to the training dataset.
- Parameters:
- dataDataFrame
Training data.
- keystr, optional
Name of the ID column.
If
keyis not provided, then:if
datais indexed by a single column, thenkeydefaults to that index column;otherwise, it is assumed that
datacontains no ID column.
- featuresa list of str, optional
Names of the feature columns. If
featuresis not provided, it defaults to all non-ID, non-label columns.- labelstr, optional
Name of the dependent variable. Defaults to the name of last non-ID column.
- categorical_variablestr or a list of str, optional
Specifies which INTEGER columns should be treated as categorical, with all other INTEGER columns treated as continuous.
No default value.
- Returns:
- A fitted object of class "DecisionTreeRegressor".
- predict(data, key=None, features=None, verbose=None)
Predict dependent variable values based on a fitted model.
- Parameters:
- dataDataFrame
DataFrame containing the data.
- keystr, optional
Name of the ID column in
data.Mandatory if
datais not indexed, or the index ofdatacontains multiple columns.Defaults to the single index column of
dataif not provided.- featuresa list of str, optional
Names of the feature columns.
If
featuresis not provided, it defaults to all the non-ID columns.- verbosebool, optional(deprecated)
Specifies whether to output all classes and the corresponding confidences for each data record in
data.Non-effective, reserved only for forward compatibility.
- Returns:
- DataFrame
Predict result, structured as follows:
ID column, with the same name and type as the ID column in
data.SCORE, type NVARCHAR(100), predicted values.
CONFIDENCE, type DOUBLE, all 0s.
- score(data, key=None, features=None, label=None)
Returns the coefficient of determination R2 of the prediction.
- Parameters:
- dataDataFrame
Data on which to assess model performance.
- keystr, optional
Name of the ID column.
Mandatory if
datais not indexed, or the index ofdatacontains multiple columns.Defaults to the single index column of
dataif not provided.- featuresa list of str, optional
Names of the feature columns. If
featuresis not provided, it defaults to all non-ID, non-label columns.- labelstr, optional
Name of the dependent variable.
Defaults to the name of the last non-ID column.
- Returns:
- float
The coefficient of determination R2 of the prediction on the given data.
- create_model_state(model=None, function=None, pal_funcname='PAL_DECISION_TREE', state_description=None, force=False)
Create PAL model state.
- Parameters:
- modelDataFrame, optional
Specify the model for AFL state.
Defaults to self.model_.
- functionstr, optional
Specify the function in the unified API.
A placeholder parameter, not effective for Decision Tree.
- pal_funcnameint or str, optional
PAL function name. Should be a valid PAL procedure name that supports model state.
Defaults to 'PAL_DECISION_TREE'.
- state_descriptionstr, optional
Description of the state as model container.
Defaults to None.
- forcebool, optional
If True it will delete the existing state.
Defaults to False.
- delete_model_state(state=None)
Delete PAL model state.
- Parameters:
- stateDataFrame, optional
Specified the state.
Defaults to self.state.
- set_model_state(state)
Set the model state by state information.
- Parameters:
- state: DataFrame or dict
If state is DataFrame, it has the following structure:
NAME: VARCHAR(100), it mush have STATE_ID, HINT, HOST and PORT.
VALUE: VARCHAR(1000), the values according to NAME.
If state is dict, the key must have STATE_ID, HINT, HOST and PORT.
Inherited Methods from PALBase
Besides those methods mentioned above, the DecisionTreeRegressor class also inherits methods from PALBase class, please refer to PAL Base for more details.