DecisionTreeRegressor
- class hana_ml.algorithms.pal.trees.DecisionTreeRegressor(algorithm='cart', thread_ratio=None, allow_missing_dependent=True, percentage=None, min_records_of_parent=None, min_records_of_leaf=None, max_depth=None, categorical_variable=None, split_threshold=None, use_surrogate=None, model_format=None, output_rules=True, output_confusion_matrix=True, resampling_method=None, fold_num=None, repeat_times=None, evaluation_metric=None, timeout=None, search_strategy=None, random_search_times=None, progress_indicator_id=None, param_values=None, param_range=None)
Decision Tree model for regression.
- Parameters:
- algorithm{'cart'}, optional
Algorithm used to grow a decision tree.
'cart': Classification and Regression tree.
If not specified, defaults to 'cart'.
- thread_ratiofloat, optional
Controls the proportion of available threads to use.
The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads.
Values between 0 and 1 will use that percentage of available threads.
Values outside this range tell PAL to heuristically determine the number of threads to use.
Defaults to -1.
- allow_missing_dependentbool, optional
Specifies if a missing target value is allowed.
False: Not allowed. An error occurs if a missing target is present.
True: Allowed. The datum with the missing target is removed.
Defaults to True.
- percentagefloat, optional
Specifies the percentage of the input data that will be used to build the tree model.
The rest of the data will be used for pruning.
Defaults to 1.0.
- min_records_of_parentint, optional
Specifies the stop condition: if the number of records in one node is less than the specified value, the algorithm stops splitting.
Defaults to 2.
- min_records_of_leafint, optional
Promises the minimum number of records in a leaf.
Defaults to 1.
- max_depthint, optional
The maximum depth of a tree.
By default it is unlimited.
- categorical_variablestr or list of str, optional
Specifies INTEGER column(s) that should be treated as categorical.
The default behavior is:
string: categorical,
integer and float: continuous.
VALID only for integer variables, ignored otherwise.
Default value detected from input data.
- split_thresholdfloat, optional
Specifies the stop condition for a node:
CART: The reduction of Gini index or relative MSE of the best split is less than this value.
The smaller the split_threshold value is, the larger a CART tree grows.
Defaults to 1e-5 for CART.
- use_surrogatebool, optional
If true, use surrogate split when NULL values are encountered.
Defaults to True.
- model_format{'json', 'pmml'}, optional
Specifies the tree model format for store. Case-insensitive.
'json': export model in json format.
'pmml': export model in pmml format.
Defaults to json.
- output_rulesbool, optional
If true, output decision rules.
Defaults to True.
- resampling_method{'cv', 'bootstrap'}, optional
The resampling method for model evaluation or parameter search. Once set, model evaluation or parameter search is enabled.
No default value.
- evaluation_metric{'mae', 'rmse'}, optional
The evaluation metric. Once
resampling_method
is set, this parameter must be set.No default value.
- fold_numint, optional
The fold number for cross validation.
Valid only and mandatory when
resampling_method
is set as 'cv'.No default value.
- repeat_timesint, optional
The number of repeated times for model evaluation or parameter search.
Defaults to 1.
- timeoutint, optional
The time allocated (in seconds) for program running.
0 indicates unlimited.
Defaults to 0.
- search_strategy{'random', 'grid'}, optional
The search strategy for parameters.
If not specified, parameter selection cannot be carried out.
No default value.
- random_search_timesint, optional
Specifies the number of search times for random search.
Only valid and mandatory when
search_strategy
is set as 'random'.No default value.
- progress_indicator_idstr, optional
Sets an ID of progress indicator for model evaluation or parameter selection.
No progress indicator is active if no value is provided.
No default value.
- param_valuesdict or ListOfTuples, optional
Specifies values of parameters to be selected.
Input should be a dict or a list of size-two tuples, with key/1st element being the target parameter name, while value/2nd element being the a list of valued for selection.
Only valid when
resampling_method
andsearch_strategy
are both specified.Valid Parameters for values specification include :
split_threshold
,max_depth
,min_records_of_leaf
,min_records_of_parent
.No default value.
- param_rangedict or ListOfTuples, optional
Specifies ranges of parameters to be selected.
Input should be dict or list of size-two tuples, with key/1st element being the name of the target parameter(in string format), while value/2nd element specifies the range of that parameter with [start, step, end] or [start, end].
Valid only when
resampling_method
andsearch_strategy
are both specified.Valid Parameters for range specification include:
split_threshold
,max_depth
,min_records_of_leaf
,min_records_of_parent
.No default value.
Examples
Input dataframe for training:
>>> df1.head(5).collect() ID A B C D CLASS 0 0 1.764052 0.400157 0.978738 2.240893 49.822907 1 1 1.867558 -0.977278 0.950088 -0.151357 4.877286 2 2 -0.103219 0.410598 0.144044 1.454274 11.914875 3 3 0.761038 0.121675 0.443863 0.333674 19.753078 4 4 1.494079 -0.205158 0.313068 -0.854096 23.607000
Creating DecisionTreeRegressor instance:
>>> dtr = DecisionTreeRegressor(algorithm='cart', ... min_records_of_parent=2, min_records_of_leaf=1, ... thread_ratio=0.4, split_threshold=1e-5, ... model_format='pmml', output_rules=True)
Performing fit() on given dataframe:
>>> dtr.fit(df1, key='ID') >>> dtr.decision_rules_.head(2).collect() ROW_INDEX RULES_CONTENT 0 0 (A<-0.495502) && (B<-0.663588) => -85.8762 1 1 (A<-0.495502) && (B>=-0.663588) => -29.9827
Input dataframe for predicting:
>>> df2.collect() ID A B C D 0 0 1.764052 0.400157 0.978738 2.240893 1 1 1.867558 -0.977278 0.950088 -0.151357 2 2 -0.103219 0.410598 0.144044 1.454274 3 3 0.761038 0.121675 0.443863 0.333674 4 4 1.494079 -0.205158 0.313068 -0.854096
Performing predict() on given dataframe:
>>> result = dtr.predict(df2, key='ID') >>> result.collect() ID SCORE CONFIDENCE 0 0 49.8229 0.0 1 1 4.87728 0.0 2 2 11.9148 0.0 3 3 19.753 0.0 4 4 23.607 0.0
Input dataframe for scoring:
>>> df3.collect() ID A B C D CLASS 0 0 1.764052 0.400157 0.978738 2.240893 49.822907 1 1 1.867558 -0.977278 0.950088 -0.151357 4.877286 2 2 -0.103219 0.410598 0.144044 1.454274 11.914875 3 3 0.761038 0.121675 0.443863 0.333674 19.753078 4 4 1.494079 -0.205158 0.313068 -0.854096 23.607000
Performing score() on given dataframe:
>>> dtr.score(df3, key='ID') 0.9999999999900131
- Attributes:
- model_DataFrame
Trained model content.
- decision_rules_DataFrame
Rules for decision tree to make decisions. Set to None if
output_rules
is False.- stats_DataFrame
Statistics information.
- cv_DataFrame
Cross validation information. Only has content when parameter selection is enabled.
Methods
create_model_state
([model, function, ...])Create PAL model state.
delete_model_state
([state])Delete PAL model state.
fit
(data[, key, features, label, ...])Function for building a decision tree classifier.
predict
(data[, key, features, verbose])Prediction function for a fitted DecisionTreeClassifier.
score
(data[, key, features, label])Returns the coefficient of determination R^2 of the prediction.
set_model_state
(state)Set the model state by state information.
- fit(data, key=None, features=None, label=None, categorical_variable=None)
Function for building a decision tree classifier.
- Parameters:
- dataDataFrame
Training data.
- keystr, optional
Name of the ID column.
If
key
is not provided, then:if
data
is indexed by a single column, thenkey
defaults to that index column;otherwise, it is assumed that
data
contains no ID column.
- featureslist of str, optional
Names of the feature columns. If
features
is not provided, it defaults to all non-ID, non-label columns.- labelstr, optional
Name of the dependent variable. Defaults to the name of last non-ID column.
- categorical_variablestr or list of str, optional
Specifies INTEGER column(s) that should be treated as categorical.
Other INTEGER columns will be treated as continuous.
- Returns:
- Fitted object.
- predict(data, key=None, features=None, verbose=None)
Prediction function for a fitted DecisionTreeClassifier.
- Parameters:
- dataDataFrame
DataFrame containing the data.
- keystr, optional
Name of the ID column in
data
.Mandatory if
data
is not indexed, or the index ofdata
contains multiple columns.Defaults to the single index column of
data
if not provided.- featureslist of str, optional
Names of the feature columns.
If
features
is not provided, it defaults to all the non-ID columns.- verbosebool, optional(deprecated)
Specifies whether to output all classes and the corresponding confidences for each data record in
data
.Non-effective, reserved only for forward compatibility.
- Returns:
- DataFrame
Predict result, structured as follows:
ID column, with the same name and type as the ID column in
data
.SCORE, type NVARCHAR(100), predicted values.
CONFIDENCE, type DOUBLE, all 0s.
- score(data, key=None, features=None, label=None)
Returns the coefficient of determination R^2 of the prediction.
- Parameters:
- dataDataFrame
Data on which to assess model performance.
- keystr, optional
Name of the ID column.
Mandatory if
data
is not indexed, or the index ofdata
contains multiple columns.Defaults to the single index column of
data
if not provided.- featureslist of str, optional
Names of the feature columns. If
features
is not provided, it defaults to all non-ID, non-label columns.- labelstr, optional
Name of the dependent variable.
Defaults to the name of the last non-ID column.
- Returns:
- float
The coefficient of determination R^2 of the prediction on the given data.
- create_model_state(model=None, function=None, pal_funcname='PAL_DECISION_TREE', state_description=None, force=False)
Create PAL model state.
- Parameters:
- modelDataFrame, optional
Specify the model for AFL state.
Defaults to self.model_.
- functionstr, optional
Specify the function in the unified API.
A placeholder parameter, not effective for Decision Tree.
- pal_funcnameint or str, optional
PAL function name. Should be a valid PAL procedure name that supports model state.
Defaults to 'PAL_DECISION_TREE'.
- state_descriptionstr, optional
Description of the state as model container.
Defaults to None.
- forcebool, optional
If True it will delete the existing state.
Defaults to False.
- delete_model_state(state=None)
Delete PAL model state.
- Parameters:
- stateDataFrame, optional
Specified the state.
Defaults to self.state.
- property fit_hdbprocedure
Returns the generated hdbprocedure for fit.
- property predict_hdbprocedure
Returns the generated hdbprocedure for predict.
- set_model_state(state)
Set the model state by state information.
- Parameters:
- state: DataFrame or dict
If state is DataFrame, it has the following structure:
NAME: VARCHAR(100), it mush have STATE_ID, HINT, HOST and PORT.
VALUE: VARCHAR(1000), the values according to NAME.
If state is dict, the key must have STATE_ID, HINT, HOST and PORT.
Inherited Methods from PALBase
Besides those methods mentioned above, the DecisionTreeRegressor class also inherits methods from PALBase class, please refer to PAL Base for more details.