DecisionTreeRegressor
- class hana_ml.algorithms.pal.trees.DecisionTreeRegressor(algorithm='cart', thread_ratio=None, allow_missing_dependent=True, percentage=None, min_records_of_parent=None, min_records_of_leaf=None, max_depth=None, categorical_variable=None, split_threshold=None, use_surrogate=None, model_format=None, output_rules=True, output_confusion_matrix=True, resampling_method=None, fold_num=None, repeat_times=None, evaluation_metric=None, timeout=None, search_strategy=None, random_search_times=None, progress_indicator_id=None, param_values=None, param_range=None)
Decision Tree model for regression.
- Parameters:
- algorithm{'cart'}, optional
Algorithm used to grow a decision tree.
'cart': Classification and Regression tree.
If not specified, defaults to 'cart'.
- thread_ratiofloat, optional
Controls the proportion of available threads to use.
The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads.
Values between 0 and 1 will use that percentage of available threads.
Values outside this range tell PAL to heuristically determine the number of threads to use.
Defaults to -1.
- allow_missing_dependentbool, optional
Specifies if a missing target value is allowed.
False: Not allowed. An error occurs if a missing target is present.
True: Allowed. The datum with the missing target is removed.
Defaults to True.
- percentagefloat, optional
Specifies the percentage of the input data that will be used to build the tree model.
The rest of the data will be used for pruning.
Defaults to 1.0.
- min_records_of_parentint, optional
Specifies the stop condition: if the number of records in one node is less than the specified value, the algorithm stops splitting.
Defaults to 2.
- min_records_of_leafint, optional
Promises the minimum number of records in a leaf.
Defaults to 1.
- max_depthint, optional
The maximum depth of a tree.
By default it is unlimited.
- categorical_variablestr or list of str, optional
Specifies INTEGER column(s) that should be treated as categorical.
The default behavior is:
string: categorical,
integer and float: continuous.
VALID only for integer variables, ignored otherwise.
Default value detected from input data.
- split_thresholdfloat, optional
Specifies the stop condition for a node:
CART: The reduction of Gini index or relative MSE of the best split is less than this value.
The smaller the split_threshold value is, the larger a CART tree grows.
Defaults to 1e-5 for CART.
- use_surrogatebool, optional
If true, use surrogate split when NULL values are encountered.
Defaults to True.
- model_format{'json', 'pmml'}, optional
Specifies the tree model format for store. Case-insensitive.
'json': export model in json format.
'pmml': export model in pmml format.
Defaults to json.
- output_rulesbool, optional
If true, output decision rules.
Defaults to True.
- resampling_method{'cv', 'bootstrap'}, optional
The resampling method for model evaluation or parameter search. Once set, model evaluation or parameter search is enabled.
No default value.
- evaluation_metric{'mae', 'rmse'}, optional
The evaluation metric. Once
resampling_methodis set, this parameter must be set.No default value.
- fold_numint, optional
The fold number for cross validation.
Valid only and mandatory when
resampling_methodis set as 'cv'.No default value.
- repeat_timesint, optional
The number of repeated times for model evaluation or parameter search.
Defaults to 1.
- timeoutint, optional
The time allocated (in seconds) for program running.
0 indicates unlimited.
Defaults to 0.
- search_strategy{'random', 'grid'}, optional
The search strategy for parameters.
If not specified, parameter selection cannot be carried out.
No default value.
- random_search_timesint, optional
Specifies the number of search times for random search.
Only valid and mandatory when
search_strategyis set as 'random'.No default value.
- progress_indicator_idstr, optional
Sets an ID of progress indicator for model evaluation or parameter selection.
No progress indicator is active if no value is provided.
No default value.
- param_valuesdict or ListOfTuples, optional
Specifies values of parameters to be selected.
Input should be a dict or a list of size-two tuples, with key/1st element being the target parameter name, while value/2nd element being the a list of valued for selection.
Only valid when
resampling_methodandsearch_strategyare both specified.Valid Parameters for values specification include :
split_threshold,max_depth,min_records_of_leaf,min_records_of_parent.No default value.
- param_rangedict or ListOfTuples, optional
Specifies ranges of parameters to be selected.
Input should be dict or list of size-two tuples, with key/1st element being the name of the target parameter(in string format), while value/2nd element specifies the range of that parameter with [start, step, end] or [start, end].
Valid only when
resampling_methodandsearch_strategyare both specified.Valid Parameters for range specification include:
split_threshold,max_depth,min_records_of_leaf,min_records_of_parent.No default value.
Examples
Input dataframe for training:
>>> df1.head(5).collect() ID A B C D CLASS 0 0 1.764052 0.400157 0.978738 2.240893 49.822907 1 1 1.867558 -0.977278 0.950088 -0.151357 4.877286 2 2 -0.103219 0.410598 0.144044 1.454274 11.914875 3 3 0.761038 0.121675 0.443863 0.333674 19.753078 4 4 1.494079 -0.205158 0.313068 -0.854096 23.607000
Creating DecisionTreeRegressor instance:
>>> dtr = DecisionTreeRegressor(algorithm='cart', ... min_records_of_parent=2, min_records_of_leaf=1, ... thread_ratio=0.4, split_threshold=1e-5, ... model_format='pmml', output_rules=True)
Performing fit() on given dataframe:
>>> dtr.fit(df1, key='ID') >>> dtr.decision_rules_.head(2).collect() ROW_INDEX RULES_CONTENT 0 0 (A<-0.495502) && (B<-0.663588) => -85.8762 1 1 (A<-0.495502) && (B>=-0.663588) => -29.9827
Input dataframe for predicting:
>>> df2.collect() ID A B C D 0 0 1.764052 0.400157 0.978738 2.240893 1 1 1.867558 -0.977278 0.950088 -0.151357 2 2 -0.103219 0.410598 0.144044 1.454274 3 3 0.761038 0.121675 0.443863 0.333674 4 4 1.494079 -0.205158 0.313068 -0.854096
Performing predict() on given dataframe:
>>> result = dtr.predict(df2, key='ID') >>> result.collect() ID SCORE CONFIDENCE 0 0 49.8229 0.0 1 1 4.87728 0.0 2 2 11.9148 0.0 3 3 19.753 0.0 4 4 23.607 0.0
Input dataframe for scoring:
>>> df3.collect() ID A B C D CLASS 0 0 1.764052 0.400157 0.978738 2.240893 49.822907 1 1 1.867558 -0.977278 0.950088 -0.151357 4.877286 2 2 -0.103219 0.410598 0.144044 1.454274 11.914875 3 3 0.761038 0.121675 0.443863 0.333674 19.753078 4 4 1.494079 -0.205158 0.313068 -0.854096 23.607000
Performing score() on given dataframe:
>>> dtr.score(df3, key='ID') 0.9999999999900131
- Attributes:
- model_DataFrame
Trained model content.
- decision_rules_DataFrame
Rules for decision tree to make decisions. Set to None if
output_rulesis False.- stats_DataFrame
Statistics information.
- cv_DataFrame
Cross validation information. Only has content when parameter selection is enabled.
Methods
create_model_state([model, function, ...])Create PAL model state.
delete_model_state([state])Delete PAL model state.
fit(data[, key, features, label, ...])Function for building a decision tree classifier.
predict(data[, key, features, verbose])Prediction function for a fitted DecisionTreeClassifier.
score(data[, key, features, label])Returns the coefficient of determination R^2 of the prediction.
set_model_state(state)Set the model state by state information.
- fit(data, key=None, features=None, label=None, categorical_variable=None)
Function for building a decision tree classifier.
- Parameters:
- dataDataFrame
Training data.
- keystr, optional
Name of the ID column.
If
keyis not provided, then:if
datais indexed by a single column, thenkeydefaults to that index column;otherwise, it is assumed that
datacontains no ID column.
- featureslist of str, optional
Names of the feature columns. If
featuresis not provided, it defaults to all non-ID, non-label columns.- labelstr, optional
Name of the dependent variable. Defaults to the name of last non-ID column.
- categorical_variablestr or list of str, optional
Specifies INTEGER column(s) that should be treated as categorical.
Other INTEGER columns will be treated as continuous.
- Returns:
- Fitted object.
- predict(data, key=None, features=None, verbose=None)
Prediction function for a fitted DecisionTreeClassifier.
- Parameters:
- dataDataFrame
DataFrame containing the data.
- keystr, optional
Name of the ID column in
data.Mandatory if
datais not indexed, or the index ofdatacontains multiple columns.Defaults to the single index column of
dataif not provided.- featureslist of str, optional
Names of the feature columns.
If
featuresis not provided, it defaults to all the non-ID columns.- verbosebool, optional(deprecated)
Specifies whether to output all classes and the corresponding confidences for each data record in
data.Non-effective, reserved only for forward compatibility.
- Returns:
- DataFrame
Predict result, structured as follows:
ID column, with the same name and type as the ID column in
data.SCORE, type NVARCHAR(100), predicted values.
CONFIDENCE, type DOUBLE, all 0s.
- score(data, key=None, features=None, label=None)
Returns the coefficient of determination R^2 of the prediction.
- Parameters:
- dataDataFrame
Data on which to assess model performance.
- keystr, optional
Name of the ID column.
Mandatory if
datais not indexed, or the index ofdatacontains multiple columns.Defaults to the single index column of
dataif not provided.- featureslist of str, optional
Names of the feature columns. If
featuresis not provided, it defaults to all non-ID, non-label columns.- labelstr, optional
Name of the dependent variable.
Defaults to the name of the last non-ID column.
- Returns:
- float
The coefficient of determination R^2 of the prediction on the given data.
- create_model_state(model=None, function=None, pal_funcname='PAL_DECISION_TREE', state_description=None, force=False)
Create PAL model state.
- Parameters:
- modelDataFrame, optional
Specify the model for AFL state.
Defaults to self.model_.
- functionstr, optional
Specify the function in the unified API.
A placeholder parameter, not effective for Decision Tree.
- pal_funcnameint or str, optional
PAL function name. Should be a valid PAL procedure name that supports model state.
Defaults to 'PAL_DECISION_TREE'.
- state_descriptionstr, optional
Description of the state as model container.
Defaults to None.
- forcebool, optional
If True it will delete the existing state.
Defaults to False.
- delete_model_state(state=None)
Delete PAL model state.
- Parameters:
- stateDataFrame, optional
Specified the state.
Defaults to self.state.
- property fit_hdbprocedure
Returns the generated hdbprocedure for fit.
- property predict_hdbprocedure
Returns the generated hdbprocedure for predict.
- set_model_state(state)
Set the model state by state information.
- Parameters:
- state: DataFrame or dict
If state is DataFrame, it has the following structure:
NAME: VARCHAR(100), it mush have STATE_ID, HINT, HOST and PORT.
VALUE: VARCHAR(1000), the values according to NAME.
If state is dict, the key must have STATE_ID, HINT, HOST and PORT.
Inherited Methods from PALBase
Besides those methods mentioned above, the DecisionTreeRegressor class also inherits methods from PALBase class, please refer to PAL Base for more details.