RDTRegressor
- class hana_ml.algorithms.pal.trees.RDTRegressor(n_estimators=100, max_features=None, max_depth=None, min_samples_leaf=None, split_threshold=None, calculate_oob=True, random_state=None, thread_ratio=None, allow_missing_dependent=True, categorical_variable=None, sample_fraction=None, compression=None, max_bits=None, quantize_rate=None, fittings_quantization=None, model_format=None)
Random Decision Tree model for regression.
- Parameters
- n_estimatorsint, optional
Specifies the number of decision trees in the model.
Defaults to 100.
- max_featuresint, optional
Specifies the number of randomly selected splitting variables.
Should not be larger than the number of input features.
Defaults to p/3, where p is the number of input features.
- max_depthint, optional
The maximum depth of a tree, where -1 means unlimited.
Default to 56.
- min_samples_leafint, optional
Specifies the minimum number of records in a leaf.Defaults to 5 for regression.
- split_thresholdfloat, optional
Specifies the stop condition: if the improvement value of the best split is less than this value, the tree stops growing.Defaults to 1e-5.
- calculate_oobbool, optional
If True, calculate the out-of-bag error.
Defaults to True.
- random_stateint, optional
Specifies the seed for random number generator.
0: Uses the current time (in seconds) as the seed.
Others: Uses the specified value as the seed.
Defaults to 0.
- thread_ratiofloat, optional
Controls the proportion of available threads to use.
The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads.
Values between 0 and 1 will use up to that percentage of available threads.
Values outside this range tell PAL to heuristically determine the number of threads to use.
Defaults to -1.
- allow_missing_dependentbool, optional
Specifies if a missing target value is allowed.
False: Not allowed. An error occurs if a missing target is present.
True: Allowed. The datum with a missing target is removed.
Defaults to True.
- categorical_variablestr or list of str, optional
Specifies INTEGER column(s) that should be treated as categorical. The default behavior is: string: categorical, or integer and float: continuous. VALID only for integer variables; omitted otherwise.
Default value detected from input data.
- sample_fractionfloat, optional
The fraction of data used for training.
Assume there are n pieces of data, sample fraction is r, then n*r data is selected for training.
Defaults to 1.0.
- compressionbool, optional
Specifies if the model is stored in compressed format.
Default value depends on the SAP HANA Version. Please refer to the corresponding documentation of SAP HANA PAL.
- max_bitsint, optional
The maximum number of bits to quantize continuous features, equivalent to use \(2^{max\_bits}\) bins.
Must be less than 31.
Valid only when the value of
compression
is True.Defaults to 12.
- quantize_ratefloat, optional
Quantizes a categorical feature if the largest class frequency of the feature is less than
quantize_rate
.Valid only when
compression
is True.Defaults to 0.005.
- fittings_quantizationint, optional
Indicates whether to quantize fitting values.
Valid only for regression when
compression
is True.Defaults to False.
- model_format{'json', 'pmml'}, optional
Specifies the tree model format for store. Case-insensitive.
'json': export model in json format.
'pmml': export model in pmml format.
Not effective if
compression
is True, in which case the model is stored in neither json nor pmml, but compressed format.Defaults to 'pmml'.
References
Parameters
compression
,max_bits
,quantize_rate
andfittings_quantization
are for compressing Random Decision Trees regression model, please see Model Compression for more details about this topic.Examples
Input dataframe for training:
>>> df1.head(5).collect() ID A B C D CLASS 0 0 -0.965679 1.142985 -0.019274 -1.598807 -23.633813 1 1 2.249528 1.459918 0.153440 -0.526423 212.532559 2 2 -0.631494 1.484386 -0.335236 0.354313 26.342585 3 3 -0.967266 1.131867 -0.684957 -1.397419 -62.563666 4 4 -1.175179 -0.253179 -0.775074 0.996815 -115.534935
Creating RDTRegressor instance:
>>> rfr = RDTRegressor(random_state=3)
Performing fit() on given dataframe:
>>> rfr.fit(df1, key='ID') >>> rfr.feature_importances_.collect() VARIABLE_NAME IMPORTANCE 0 A 0.249593 1 B 0.381879 2 C 0.291403 3 D 0.077125
Input dataframe for predicting:
>>> df2.collect() ID A B C D 0 0 1.081277 0.204114 1.220580 -0.750665 1 1 0.524813 -0.012192 -0.418597 2.946886
Performing predict() on given dataframe:
>>> result = rfr.predict(df2, key='ID') >>> result.collect() ID SCORE CONFIDENCE 0 0 48.126 62.952884 1 1 -10.9017 73.461039
Input dataframe for scoring:
>>> df3.head(5).collect() ID A B C D CLASS 0 0 1.081277 0.204114 1.220580 -0.750665 139.10170 1 1 0.524813 -0.012192 -0.418597 2.946886 52.17203 2 2 -0.280871 0.100554 -0.343715 -0.118843 -34.69829 3 3 -0.113992 -0.045573 0.957154 0.090350 51.93602 4 4 0.287476 1.266895 0.466325 -0.432323 106.63425
Performing score() on given dataframe:
>>> rfr.score(df3, key='ID') 0.6530768858159514
- Attributes
- model_DataFrame
Trained model content.
- feature_importances_DataFrame
The feature importance (the higher, the more important the feature).
- oob_error_DataFrame
Out-of-bag error rate or mean squared error for random decision trees up to indexed tree. Set to None if
calculate_oob
is False.
Methods
create_model_state
([model, function, ...])Create PAL model state.
delete_model_state
([state])Delete PAL model state.
fit
(data[, key, features, label, ...])Train the model on the input data.
predict
(data[, key, features, verbose, ...])Predict dependent variable values based on fitted model.
score
(data[, key, features, label, ...])Returns the coefficient of determination R^2 of the prediction.
set_model_state
(state)Set the model state by state information.
- fit(data, key=None, features=None, label=None, categorical_variable=None)
Train the model on the input data.
- Parameters
- dataDataFrame
Training data.
- keystr, optional
Name of the ID column in
data
.If
key
is not provided, then:if
data
is indexed by a single column, thenkey
defaults to that index column;otherwise, it is assumed that
data
contains no ID column.
- featureslist of str, optional
Names of the feature columns.
If
features
is not provided, it defaults to all non-ID, non-label columns.- labelstr, optional
Name of the dependent variable.
Defaults to the last non-ID column.
- categorical_variablestr or list of str, optional
Specifies INTEGER column(s) that should be treated as categorical.
Other INTEGER columns will be treated as continuous.
No default value.
- Returns
- Fitted object.
- predict(data, key=None, features=None, verbose=None, block_size=None, missing_replacement=None)
Predict dependent variable values based on fitted model.
- Parameters
- dataDataFrame
Independent variable values to predict for.
- keystr, optional
Name of the ID column.
Mandatory if
data
is not indexed, or the index ofdata
contains multiple columns.Defaults to the single index column of
data
if not provided.- featureslist of str, optional
Names of the feature columns.
If
features
is not provided, it defaults to all non-ID columns.- verbosebool, optional
If true, output all classes and the corresponding confidences for each data point. Only valid classification.
Default to False if not provided.
- block_sizeint, optional
The number of rows loaded per time during prediction.
0 indicates load all data at once.
Defaults to 0.
- missing_replacementstr, optional
The missing replacement strategy:
'feature_marginalized': marginalise each missing feature out independently.
'instance_marginalized': marginalise all missing features in an instance as a whole corresponding to each category.
Defaults to feature_marginalized.
- Returns
- DataFrame
DataFrame of score and confidence, structured as follows:
ID column, with same name and type as
data
's ID column.SCORE, type DOUBLE, representing the predicted values.
CONFIDENCE, all 0s. It is included due to the fact PAL uses the same table for classification.
- score(data, key=None, features=None, label=None, block_size=None, missing_replacement=None)
Returns the coefficient of determination R^2 of the prediction.
- Parameters
- dataDataFrame
Data on which to assess model performance.
- keystr, optional
Name of the ID column.
Mandatory if
data
is not indexed, or the index ofdata
contains multiple columns.Defaults to the single index column of
data
if not provided.- featureslist of str, optional
Names of the feature columns.
If
features
is not provided, it defaults to all non-ID, non-label columns.- labelstr, optional
Name of the dependent variable.
Defaults to the last non-ID column.
- block_sizeint, optional
The number of rows loaded per time during prediction.
0 indicates load all data at once.
Defaults to 0.
- missing_replacementstr, optional
The missing replacement strategy:
'feature_marginalized': marginalise each missing feature out independently.
'instance_marginalized': marginalise all missing features in an instance as a whole corresponding to each category.
Defaults to feature_marginalized.
- Returns
- float
The coefficient of determination R^2 of the prediction on the given data.
- create_model_state(model=None, function=None, pal_funcname='PAL_RANDOM_DECISION_TREES', state_description=None, force=False)
Create PAL model state.
- Parameters
- modelDataFrame, optional
Specify the model for AFL state.
Defaults to self.model_.
- functionstr, optional
Specify the function in the unified API.
A placeholder parameter, not effective for RDT.
- pal_funcnameint or str, optional
PAL function name.
Defaults to 'PAL_RANDOM_DECISION_TREES'.
- state_descriptionstr, optional
Description of the state as model container.
Defaults to None.
- forcebool, optional
If True it will delete the existing state.
Defaults to False.
- delete_model_state(state=None)
Delete PAL model state.
- Parameters
- stateDataFrame, optional
Specified the state.
Defaults to self.state.
- property fit_hdbprocedure
Returns the generated hdbprocedure for fit.
- property predict_hdbprocedure
Returns the generated hdbprocedure for predict.
- set_model_state(state)
Set the model state by state information.
- Parameters
- state: DataFrame or dict
If state is DataFrame, it has the following structure:
NAME: VARCHAR(100), it mush have STATE_ID, HINT, HOST and PORT.
VALUE: VARCHAR(1000), the values according to NAME.
If state is dict, the key must have STATE_ID, HINT, HOST and PORT.