SVRanking

class hana_ml.algorithms.pal.svm.SVRanking(c=None, kernel='rbf', degree=None, gamma=None, coef_lin=None, coef_const=None, probability=False, shrink=True, tol=None, evaluation_seed=None, thread_ratio=None, scale_info=None, handle_missing=True, categorical_variable=None, category_weight=None, compression=None, max_bits=None, max_quantization_iter=None, resampling_method=None, fold_num=None, repeat_times=None, search_strategy=None, random_search_times=None, random_state=None, timeout=None, progress_indicator_id=None, param_values=None, param_range=None, reduction_rate=None, aggressive_elimination=None)

Support Vector Ranking

Parameters:

cfloat, optional

Trade-off between training error and margin. Value range > 0.

Defaults to 100.

kernel{'linear', 'poly', 'rbf', 'sigmoid'}, optional

Specifies the kernel type to be used in the algorithm.

Defaults to 'rbf'.

degreeint, optional

Coefficient for the 'poly' kernel type. Value range >= 1.

Defaults to 3.

gammafloat, optional

Coefficient for the 'rbf' kernel type.

Defaults to to 1.0/number of features in the dataset.

Only valid when kernel is 'rbf'.

coef_linfloat, optional

Coefficient for the 'poly'/'sigmoid' kernel type.

Defaults to 0.

coef_constfloat, optional

Coefficient for the 'poly'/'sigmoid' kernel type.

Defaults to 0.

probabilitybool, optional

If True, output probability during prediction.

Defaults to False.

shrinkbool, optional

If True, use shrink strategy.

Defaults to True.

tolfloat, optional

Specifies the error tolerance in the training process. Value range > 0.

Defaults to 0.001.

evaluation_seedint, optional

The random seed in parameter selection. Value range >= 0.

Defaults to 0.

thread_ratiofloat, optional

Controls the proportion of available threads to use. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use.

Defaults to 0.0.

scale_info{'no', 'standardization', 'rescale'}, optional

Options:

'no' : No scale.

'standardization' : Transforms the data to have zero mean and unit variance.

'rescale' : Rescales the range of the features to scale the range in [-1,1].

Defaults to 'standardization'.

handle_missingbool, optional

Whether to handle missing values:

False: No,
True: Yes.

Defaults to True.

categorical_variablestr or list of str, optional

Specifies INTEGER column(s) in the data that should be treated as categorical.

No default value.

category_weightfloat, optional

Represents the weight of category attributes. Value range > 0.

Defaults to 0.707.

compressionbool, optional

Specifies if the model is stored in compressed format.

Default value depends on the SAP HANA Version. Please refer to the corresponding documentation of SAP HANA PAL.

max_bitsint, optional

The maximum number of bits to quantize continuous features, equivalent to use \(2^{max\_bits}\) bins.

Must be less than 31.

Valid only when the value of compression is True.

Defaults to 12.

max_quantization_iterint, optional

The maximum iteration steps for quantization.

Valid only when the value of compression is True.

Defaults to 1000.

resampling_methodstr, optional

Specifies the resampling method for model evaluation or parameter selection.

'cv'

'cv_sha'

'cv_hyperband'

'bootstrap'

'bootstrap_sha'

'bootstrap_hyperband'

If no value is specified for this parameter, neither model evaluation nor parameter selection is activated.

No default value.

Note

Resampling methods that end with 'sha' or 'hyperband' are used for parameter selection only, not for model evaluation.

fold_numint, optional

Specifies the fold number for the cross validation method.

Mandatory and valid only when resampling_method is set to 'cv', 'cv_sha' or 'cv_hyperband'.

No default value.

repeat_timesint, optional

Specifies the number of repeat times for resampling.

Default to 1.

search_strategystr, optional

Specify the parameter search method:

'grid'

'random'

Mandatory when resampling method is set to 'cv_sha' or 'bootstrap_sha'.

Defaults to random and cannot be changed if resampling_method is set to 'cv_hyperband' or 'bootstrap_hyperband'; otherwise no default value, and parameter selection cannot be activated if not specified.

random_search_timesint, optional

Specifies the number of times to randomly select candidate parameters for selection.

Mandatory when search_strategy is set to 'random', or when resampling_method is set to 'cv_hyperband' or 'bootstrap_hyperband'.

No default value.

random_stateint, optional

Specifies the seed for random generation. Use system time when 0 is specified. Default to 0.

timeoutint, optional

Specifies maximum running time for model evaluation or parameter selection, in seconds. No timeout when 0 is specified.

Default to 0.

progress_indicator_idstr, optional

Sets an ID of progress indicator for model evaluation or parameter selection. No progress indicator is active if no value is provided.

No default value.

param_valuesdict or list of tuple, optional

Sets the values of following parameters for model parameter selection:

coef_lin, coef_const, c.

If input is list of tuple, then each tuple should contain exactly two elements:

1st element is the parameter name(str type),

2nd element is a list of valid values for that parameter.

Otherwise, if input is dict, then the key of each element must specify a parameter name, while the corresponding value specifies a list of values for that parameter.

A simple example for illustration:

[('c', [0.1, 0.2, 0.5]), ('coef_const', [0.2, 0.6])],

or

{'c' : [0.1, 0.2, 0.5], 'coef_const' : [0.2, 0.6]}

Valid only when resampling_method and search_strategy are both specified.

No default value.

param_rangedict or list of tuple, optional

Sets the range of the following parameters for model parameter selection:

coef_lin, coef_const, c.

If input is list of tuple, then each tuple should contain exactly two elements:

1st element is the parameter name(str type),

2nd element is a list that specifies the range of that parameter as [start, step, end].

Otherwise, if input is dict, then the key of each element must specify a parameter name, while the corresponding value specifies the range of that parameter.

Valid only when resampling_method and search_strategy are both specified.

No default value.

reduction_ratefloat, optional

Specifies reduction rate in SHA or Hyperband method.

For each round, the available parameter candidate size will be divided by value of this parameter. Thus valid value for this parameter must be greater than 1.0

Valid only when resampling_method is set to one of the following values: 'cv_sha', 'bootstrap_sha', 'cv_hyperband', 'bootstrap_hyperband'.

Defaults to 3.0.

aggressive_eliminationbool, optional

Specifies whether to apply aggressive elimination while using SHA method.

Aggressive elimination happens when the data size and parameters size to be searched does not match and there are still bunch of parameters to be searched while data size reaches its upper limits. If aggressive elimination is applied, lower bound of limit of data size will be used multiple times first to reduce number of parameters.

Valid only when resampling_method is set to 'cv_sha' or 'bootstrap_sha'.

Defaults to False.

References

Three key functionalities are enabled in support vector ranking(SVRanking), listed as follows:

Model Evaluation and Parameter Selection

Successive Halving and Hyperband Method for Parameter Selection

Model Compression

Please refer to the links above for detailed description about each functionality together with relevant parameters.

Examples

Training data:

>>> df_fit.head(10).collect()
   ID  ATTRIBUTE1  ATTRIBUTE2  ATTRIBUTE3  ATTRIBUTE4  ATTRIBUTE5    QID  LABEL
 0         1.0         1.0         0.0         0.2         0.0  qid:1      3
 1         0.0         0.0         1.0         0.1         1.0  qid:1      2
 2         0.0         0.0         1.0         0.3         0.0  qid:1      1
 3         2.0         1.0         1.0         0.2         0.0  qid:1      4
 4         3.0         1.0         1.0         0.4         1.0  qid:1      5
 5         4.0         1.0         1.0         0.7         0.0  qid:1      6
 6         0.0         0.0         1.0         0.2         0.0  qid:2      1
 7         1.0         0.0         1.0         0.4         0.0  qid:2      2
 8         0.0         0.0         1.0         0.2         0.0  qid:2      1
 9         1.0         1.0         1.0         0.2         0.0  qid:2      3

Create a SVRanking instance and call the fit function:

>>> svranking = svm.SVRanking(gamma=0.005)
>>> features = ['ATTRIBUTE1', 'ATTRIBUTE2', 'ATTRIBUTE3', 'ATTRIBUTE4',
...             'ATTRIBUTE5']
>>> svranking.fit(df_fit, 'ID', 'QID', features, 'LABEL')

Call the predict function:

>>> df_predict = conn.table("DATA_TBL_SVRANKING_PREDICT")
>>> df_predict.head(10).collect()
   ID  ATTRIBUTE1  ATTRIBUTE2  ATTRIBUTE3  ATTRIBUTE4  ATTRIBUTE5    QID
 0         1.0         1.0         0.0         0.2         0.0  qid:1
 1         0.0         0.0         1.0         0.1         1.0  qid:1
 2         0.0         0.0         1.0         0.3         0.0  qid:1
 3         2.0         1.0         1.0         0.2         0.0  qid:1
 4         3.0         1.0         1.0         0.4         1.0  qid:1
 5         4.0         1.0         1.0         0.7         0.0  qid:1
 6         0.0         0.0         1.0         0.2         0.0  qid:4
 7         1.0         0.0         1.0         0.4         0.0  qid:4
 8         0.0         0.0         1.0         0.2         0.0  qid:4
 9         1.0         1.0         1.0         0.2         0.0  qid:4
>>> svranking.predict(df_predict, key='ID',
...                   features=features, qid='QID').head(10).collect()
    ID     SCORE PROBABILITY
  0  -9.85138        None
  1  -10.8657        None
  2  -11.6741        None
  3  -9.33985        None
  4  -7.88839        None
  5   -6.8842        None
  6  -11.7081        None
  7  -10.8003        None
  8  -11.7081        None
  9  -10.2583        None

Attributes:

model_DataFrame: Model content.
stat_DataFrame: Statistics content.

Methods

`create_model_state`([model, function, ...])	Create PAL model state.
`delete_model_state`([state])	Delete PAL model state.
`fit`(data[, key, qid, features, label, ...])	Fit the model when given training dataset and other attributes.
`predict`(data[, key, qid, features, verbose])	Predict the dataset using the trained model.
`set_model_state`(state)	Set the model state by state information.

fit(data, key=None, qid=None, features=None, label=None, categorical_variable=None)

Fit the model when given training dataset and other attributes.

Parameters:

dataDataFrame

DataFrame containing the data.

keystr, optional

Name of the ID column.

If key is not provided, then:

if data is indexed by a single column, then key defaults to that index column;

otherwise, it is assumed that data contains no ID column.

qidstr

Name of the qid column.

featureslist of str, optional

Names of the feature columns.

If features is not provided, it defaults to all the non-ID, non-label, non-qid columns.

labelstr, optional

Name of the label column.

If label is not provided, it defaults to the last non-ID, non-qid column.

categorical_variablestr or list of str, optional

INTEGER columns specified in this list will be treated as categorical data. Other INTEGER columns will be treated as continuous.

No default value.

Returns:

Fitted object.

predict(data, key=None, qid=None, features=None, verbose=False)

Predict the dataset using the trained model.

Parameters:

dataDataFrame

DataFrame containing the data.

keystr, optional

Name of the ID column.

Mandatory if data is not indexed, or the index of data contains multiple columns.

Defaults to the single index column of data if not provided.

qidstr

Name of the qid column.

featureslist of str, optional

Names of the feature columns.

If features is not provided, it defaults to all the non-ID, non-qid columns.

verbosebool, optional

If True, output scoring probabilities for each class.

Defaults to False.

Returns:

DataFrame

Predict result, structured as follows:

ID column, with the same name and type as data's ID column.

SCORE, type NVARCHAR(100), prediction value.

PROBABILITY, type DOUBLE, prediction probability. It is NULL when probability is False during instance creation.

Note

PAL will throw an error if probability=True in the constructor and verbose=True is not provided to predict(). This is a known bug.

create_model_state(model=None, function=None, pal_funcname='PAL_SVM', state_description=None, force=False)

Create PAL model state.

Parameters:

modelDataFrame, optional

Specify the model for AFL state.

Defaults to self.model_.

functionstr, optional

Specify the function in the unified API.

A placeholder parameter, not effective for SVM.

pal_funcnameint or str, optional

PAL function name.

Defaults to 'PAL_SVM'.

state_descriptionstr, optional

Description of the state as model container.

Defaults to None.

forcebool, optional

If True it will delete the existing state.

Defaults to False.

delete_model_state(state=None)

Delete PAL model state.

Parameters:

stateDataFrame, optional

Specified the state.

Defaults to self.state.

property fit_hdbprocedure: Returns the generated hdbprocedure for fit.

property predict_hdbprocedure: Returns the generated hdbprocedure for predict.

set_model_state(state)

Set the model state by state information.

Parameters:

state: DataFrame or dict

If state is DataFrame, it has the following structure:

NAME: VARCHAR(100), it mush have STATE_ID, HINT, HOST and PORT.

VALUE: VARCHAR(1000), the values according to NAME.

If state is dict, the key must have STATE_ID, HINT, HOST and PORT.

Inherited Methods from PALBase

Besides those methods mentioned above, the SVRanking class also inherits methods from PALBase class, please refer to PAL Base for more details.