SVC
- class hana_ml.algorithms.pal.svm.SVC(c=None, kernel='rbf', degree=None, gamma=None, coef_lin=None, coef_const=None, probability=False, shrink=True, tol=None, evaluation_seed=None, thread_ratio=None, scale_info=None, handle_missing=True, categorical_variable=None, category_weight=None, compression=None, max_bits=None, max_quantization_iter=None, resampling_method=None, evaluation_metric=None, fold_num=None, repeat_times=None, search_strategy=None, random_search_times=None, random_state=None, timeout=None, progress_indicator_id=None, param_values=None, param_range=None, reduction_rate=None, aggressive_elimination=None, onehot_min_frequency=None, onehot_max_categories=None)
Support Vector Machines (SVMs) refer to a family of supervised learning models using the concept of support vector.
Compared with many other supervised learning models, SVMs have the advantages in that the models produced by SVMs can be either linear or non-linear, where the latter is realized by a technique called Kernel Trick.
Like most supervised models, there are training phase and testing phase for SVMs. In the training phase, a function f(x):->y where f(∙) is a function (can be non-linear) mapping a sample onto a TARGET, is learnt. The training set consists of pairs denoted by {xi, yi}, where x denotes a sample represented by several attributes, and y denotes a TARGET (supervised information). In the testing phase, the learnt f(∙) is further used to map a sample with unknown TARGET onto its predicted TARGET.
Classification is one of the most frequent tasks in many fields including machine learning, data mining, computer vision, and business data analysis. Compared with linear classifiers like logistic regression, SVC is able to produce non-linear decision boundary, which leads to better accuracy on some real world dataset. In classification scenario, f(∙) refers to decision function, and a TARGET refers to a "label" represented by a real number.
- Parameters:
- cfloat, optional
Trade-off between training error and margin. Value range > 0.
Defaults to 100.0.
- kernel{'linear', 'poly', 'rbf', 'sigmoid'}, optional
Specifies the kernel type to be used in the algorithm.
Defaults to 'rbf'.
- degreeint, optional
Coefficient for the 'poly' kernel type. Value range >= 1.
Defaults to 3.
- gammafloat, optional
Coefficient for the 'rbf' kernel type.
Defaults to 1.0/number of features in the dataset. Only valid for when
kernel
is 'rbf'.- coef_linfloat, optional
Coefficient for the poly/sigmoid kernel type.
Defaults to 0.
- coef_constfloat, optional
Coefficient for the poly/sigmoid kernel type.
Defaults to 0.
- probabilitybool, optional
If True, output probability during prediction.
Defaults to False.
- shrinkbool, optional
If True, use shrink strategy.
Defaults to True.
- tolfloat, optional
Specifies the error tolerance in the training process. Value range > 0.
Defaults to 0.001.
- evaluation_seedint, optional
The random seed in parameter selection. Value range >= 0.
Defaults to 0.
- thread_ratiofloat, optional
Adjusts the percentage of available threads to use, from 0 to 1. A value of 0 indicates the use of a single thread, while 1 implies the use of all possible current threads. Values outside the range will be ignored and this function heuristically determines the number of threads to use.
Defaults to 0.0.
- scale_info{'no', 'standardization', 'rescale'}, optional
Options:
'no' : No scale.
'standardization' : Transforms the data to have zero mean and unit variance.
'rescale' : Rescales the range of the features to scale the range in [-1,1].
Defaults to 'standardization'.
- handle_missingbool, optional
Whether to handle missing values:
False: No,
True: Yes.
Defaults to True.
- categorical_variablestr or a list of str, optional
Specifies which INTEGER columns should be treated as categorical, with all other INTEGER columns treated as continuous.
No default value.
- category_weightfloat, optional
Represents the weight of category attributes. Value range > 0.
Defaults to 0.707.
- compressionbool, optional
Specifies if the model is stored in compressed format.
Default value depends on the SAP HANA Version. Please refer to the corresponding documentation of SAP HANA PAL.
- max_bitsint, optional
The maximum number of bits to quantize continuous features, equivalent to use \(2^{max\_bits}\) bins.
Must be less than 31.
Valid only when the value of
compression
is True.Defaults to 12.
- max_quantization_iterint, optional
The maximum iteration steps for quantization.
Valid only when the value of compression is True.
Defaults to 1000.
- resampling_methodstr, optional
Specifies the resampling method for model evaluation or parameter selection.
'cv'
'cv_sha'
'cv_hyperband'
'stratified_cv'
'stratified_cv_sha'
'stratified_cv_hyperband'
'bootstrap'
'bootstrap_sha'
'bootstrap_hyperband'
'stratified_bootstrap'
'stratified_bootstrap_sha'
'stratified_bootstrap_hyperband'
If no value is specified for this parameter, neither model evaluation nor parameter selection is activated.
No default value.
Note
Resampling methods that end with 'sha' or 'hyperband' are used for parameter selection only, not for model evaluation.
- evaluation_metric{'ACCURACY', 'F1_SCORE', 'AUC'}, optional
Specifies the evaluation metric for model evaluation or parameter selection.
No default value.
- fold_numint, optional
Specifies the fold number for the cross validation method. Mandatory and valid only when
resampling_method
is one of the following: 'cv', 'cv_sha', 'cv_hyperband', 'stratified_cv', 'stratified_cv_sha', 'stratified_cv_hyperband'.No default value.
- repeat_timesint, optional
Specifies the number of repeat times for resampling.
Default to 1.
- search_strategystr, optional
Specify the parameter search method:
'grid'
'random'
Mandatory when
resampling
method is one of the following: 'cv_sha', 'stratified_cv_sha', 'bootstrap_sha', 'stratified_bootstrap_sha'.Defaults to
random
and cannot be changed ifresampling_method
is 'cv_hyperband', 'stratified_cv_hyperband', 'bootstrap_hyperband' or 'stratified_bootstrap_hyperband'; otherwise no default value, and parameter selection cannot be activated if not specified.- random_search_timesint, optional
Specifies the number of times to randomly select candidate parameters for selection.
Mandatory when
search_strategy
is set to 'random', or whenresampling_method
is set to one of the following: 'cv_hyperband', 'stratified_cv_hyperband', 'bootstrap_hyperband', 'stratified_bootstrap_hyperband'.No default value.
- random_stateint, optional
Specifies the seed for random generation. Use system time when 0 is specified.
Default to 0.
- timeoutint, optional
Specifies maximum running time for model evaluation or parameter selection, in seconds. No timeout when 0 is specified.
Default to 0.
- progress_indicator_idstr, optional
Sets an ID of progress indicator for model evaluation or parameter selection. No progress indicator is active if no value is provided.
No default value.
- param_valuesdict or list of tuple, optional
Sets the values of following parameters for model parameter selection:
c
,degree
,coef_lin
,coef_const
.If input is list of tuple, then each tuple should contain exactly two elements:
1st element is the parameter name(str type),
2nd element is a list of valid values for that parameter.
Otherwise, if input is dict, then the key of each elements must specify a parameter name, while the value specifies a list of valid values for that parameter.
A simple example for illustration:
[('c', [0.1, 0.2, 0.5]), ('degree', [0.2, 0.6])],
or
dict(c=[0.1, 0.2, 0.5], degree = [0.2, 0.6])
Valid only when
resampling_method
andsearch_strategy
are both specified.No default value.
- param_rangedict or list of tuple, optional
Sets the range of the following parameters for model parameter selection:
c
,degree
,coef_lin
,coef_const
.If input is list of tuple, then each tuple should contain exactly two elements:
1st element is the parameter name(str type),
2nd element is a list that specifies the range of that parameter as [start, step, end], while step is ignored if
search_strategy
is 'random'.
Valid only when
resampling_method
andsearch_strategy
are both specified.No default value.
- reduction_ratefloat, optional
Specifies reduction rate in SHA or Hyperband method.
For each round, the available parameter candidate size will be divided by value of this parameter. Thus valid value for this parameter must be greater than 1.0
Valid only when
resampling_method
is set to one of the following values: 'cv_sha', 'stratified_cv_sha', 'bootstrap_sha', 'stratified_bootstrap_sha', 'cv_hyperband', 'stratified_cv_hyperband', 'bootstrap_hyperband', 'stratified_bootstrap_hyperband'.Defaults to 3.0.
- aggressive_eliminationbool, optional
Specifies whether to apply aggressive elimination while using SHA method.
Aggressive elimination happens when the data size and parameters size to be searched does not match and there are still bunch of parameters to be searched while data size reaches its upper limits. If aggressive elimination is applied, lower bound of limit of data size will be used multiple times first to reduce number of parameters.
Valid only when
resampling_method
is set to one of the following: 'cv_sha', 'stratified_cv_sha', 'bootstrap_sha', 'stratified_bootstrap_sha'.Defaults to False.
- onehot_min_frequencyint, optional
Specifies the minimum frequency below which a category will be considered infrequent.
Defaults to 1.
- onehot_max_categoriesint, optional
Specifies an upper limit to the number of output features for each input feature. It includes the feature that combines infrequent categories.
Defaults to 0.
References
Three key functionalities are enabled in support vector classification(SVC), listed as follows:
Please refer to the links above for detailed description about each functionality together with relevant parameters.
Examples
>>> svc = svm.SVC(gamma=0.005, handle_missing=False) >>> svc.fit(data=df_fit, key='ID', features=['F1', 'F2']) >>> res = svc.predict(data=df_predict, key='ID', features=['F1', 'F2']) >>> res.collect()
- Attributes:
- model_DataFrame
Model content.
- stat_DataFrame
Statistics content.
Methods
create_model_state
([model, function, ...])Create PAL model state.
delete_model_state
([state])Delete PAL model state.
fit
(data[, key, features, label, ...])Fit the model to the training dataset.
predict
(data[, key, features, verbose])Predict dependent variable values based on a fitted model.
score
(data[, key, features, label])Returns the accuracy on the given test data and labels.
set_model_state
(state)Set the model state by state information.
- fit(data, key=None, features=None, label=None, categorical_variable=None)
Fit the model to the training dataset.
- Parameters:
- dataDataFrame
DataFrame containing the data.
- keystr, optional
Name of the ID column.
If
key
is not provided, then:if
data
is indexed by a single column, thenkey
defaults to that index column;otherwise, it is assumed that
data
contains no ID column.
- featuresa list of str, optional
Names of the feature columns.
If
features
is not provided, it defaults to all the non-ID, non-label columns.- labelstr, optional
Name of the label column.
If
label
is not provided, it defaults to the last non-ID column.- categorical_variablestr or a list of str, optional
Specifies which INTEGER columns should be treated as categorical, with all other INTEGER columns treated as continuous.
No default value.
- Returns:
- A fitted object of class "SVC".
- predict(data, key=None, features=None, verbose=False)
Predict dependent variable values based on a fitted model.
- Parameters:
- dataDataFrame
DataFrame containing the data.
- keystr, optional
Name of the ID column.
Mandatory if
data
is not indexed, or the index ofdata
contains multiple columns.Defaults to the single index column of
data
if not provided.- featuresa list of str, optional
Names of the feature columns. If
features
is not provided, it defaults to all the non-ID, non-label columns.- verbosebool, optional
If True, output scoring probabilities for each class. It is only applicable when probability is true during instance creation.
Defaults to False.
- Returns:
- DataFrame
Predict result, structured as follows:
ID column, with the same name and type as
data
's ID column.SCORE, type NVARCHAR(100), prediction value.
PROBABILITY, type DOUBLE, prediction probability. It is NULL when
probability
is False during instance creation.
- score(data, key=None, features=None, label=None)
Returns the accuracy on the given test data and labels.
- Parameters:
- dataDataFrame
DataFrame containing the data.
- keystr, optional
Name of the ID column.
Mandatory if
data
is not indexed, or the index ofdata
contains multiple columns.Defaults to the single index column of
data
if not provided.- featuresa list of str, optional
Names of the feature columns.
If
features
is not provided, it defaults to all the non-ID, non-label columns.- labelstr, optional
Name of the label column.
If
label
is not provided, it defaults to the last non-ID column.
- Returns:
- float
Scalar accuracy value comparing the predicted result and original label.
- create_model_state(model=None, function=None, pal_funcname='PAL_SVM', state_description=None, force=False)
Create PAL model state.
- Parameters:
- modelDataFrame, optional
Specify the model for AFL state.
Defaults to self.model_.
- functionstr, optional
Specify the function in the unified API.
A placeholder parameter, not effective for SVM.
- pal_funcnameint or str, optional
PAL function name.
Defaults to 'PAL_SVM'.
- state_descriptionstr, optional
Description of the state as model container.
Defaults to None.
- forcebool, optional
If True it will delete the existing state.
Defaults to False.
- delete_model_state(state=None)
Delete PAL model state.
- Parameters:
- stateDataFrame, optional
Specified the state.
Defaults to self.state.
- set_model_state(state)
Set the model state by state information.
- Parameters:
- state: DataFrame or dict
If state is DataFrame, it has the following structure:
NAME: VARCHAR(100), it mush have STATE_ID, HINT, HOST and PORT.
VALUE: VARCHAR(1000), the values according to NAME.
If state is dict, the key must have STATE_ID, HINT, HOST and PORT.
Inherited Methods from PALBase
Besides those methods mentioned above, the SVC class also inherits methods from PALBase class, please refer to PAL Base for more details.