KNNClassifier

class hana_ml.algorithms.pal.neighbors.KNNClassifier(n_neighbors=None, thread_ratio=None, stat_info=None, voting_type=None, metric=None, minkowski_power=None, category_weights=None, algorithm=None, resampling_method=None, evaluation_metric=None, fold_num=None, repeat_times=None, search_strategy=None, random_search_times=None, random_state=None, timeout=None, progress_indicator_id=None, param_values=None, param_range=None, min_resource_rate=None, reduction_rate=None, aggressive_elimination=None)

K-Nearest Neighbor (KNN) is a memory-based classification or regression method with no explicit training phase. It assumes similar instances should have similar labels or values.

In the prediction phase, given a query sample x, its top K nearest samples are found in the training set first, then the label or value of x is assigned with some metric using the K nearest neighbors. In order to speed up the search, the KD-tree searching method is provided.

Parameters:

n_neighborsint, optional

Number of nearest neighbors (k).

Default to 1.

thread_ratiofloat, optional

Adjusts the percentage of available threads to use, from 0 to 1. A value of 0 indicates the use of a single thread, while 1 implies the use of all possible current threads. Values outside the range will be ignored and this function heuristically determines the number of threads to use.

Default to 0.0.

voting_type{'majority', 'distance-weighted'}, optional

Voting type.

Default to 'distance-weighted'.

stat_infobool, optional

Indicates if statistic information will be stored into the STATISTIC table.

Only valid when model evaluation/parameter selection is not activated.

Default to True.

metric{'manhattan', 'euclidean', 'minkowski', 'chebyshev', 'cosine'}, optional

Ways to compute the distance between data points.

Defaults to 'euclidean'.

minkowski_powerfloat, optional

When metric is set to 'minkowski', this parameter controls the value of power.

Only valid when metric is set as 'minkowski'.

Defaults to 3.0.

category_weightsfloat, optional

Represents the weight of category attributes.

Default to 0.707.

algorithm{'brute-force', 'kd-tree'}, optional

Algorithm used to compute the nearest neighbors.

Defaults to 'brute-force'.

factor_numint, optional

The factorisation dimensionality.

Default to 4.

random_stateint, optional

Specifies the seed for random number generator.

0: Uses the current time as the seed.

Others: Uses the specified value as the seed.

Default to 0.

resampling_methodstr, optional

Specifies the resampling method for model evaluation or parameter selection:

'cv'

'cv_sha'

'cv_hyperband'

'stratified_cv'

'stratified_cv_sha'

'stratified_cv_hyperband'

'bootstrap'

'bootstrap_sha'

'bootstrap_hyperband'

'stratified_bootstrap'

'stratified_bootstrap_sha'

'stratified_bootstrap_hyperband'

If no value is specified for this parameter, neither model evaluation nor parameter selection is activated.

No default value.

Note

Resampling methods that end with 'sha' or 'hyperband' are used for parameter selection only, not for model evaluation.

evaluation_metric{'accuracy', 'f1_score'}, optional

Specifies the evaluation metric for model evaluation or parameter selection.

If not specified, neither model evaluation nor parameter selection is activated.

No default value.

fold_numint, optional

Specifies the fold number for the cross validation method.

Mandatory and valid only when resampling_method is set to one of the following values: 'cv', 'cv_sha', 'cv_hyperband', 'stratified_cv', 'stratified_cv_sha', 'stratified_cv_hyperband'.

No default value.

repeat_timesint, optional

Specifies the number of repeat times for resampling.

Default to 1.

search_strategy{'random', 'grid'}, optional

The search strategy for parameters.

Mandatory if resampling_method is specified and ends with 'sha'.

Defaults to 'random' and cannot be changed if resampling_method is specified and ends with 'hyperband'; otherwise no default value, and parameter selection cannot be carried out if not specified.

random_search_timesint, optional

Specifies the number of times to randomly select candidate parameters for selection.

Mandatory and valid when search_strategy is set to 'random'.

No default value.

timeoutint, optional

Specifies maximum running time for model evaluation or parameter selection, in seconds.

No timeout when 0 is specified.

Default to 0.

progress_indicator_idstr, optional

Sets an ID of progress indicator for model evaluation or parameter selection.

No progress indicator is active if no value is provided.

No default value.

param_valuesdict or ListOfTuples, optional

Specifies values of parameters to be selected.

Input should be a dict, or a list of tuples of two elements, with key/1st element being the target parameter name, and value/2nd element being the a list of values for selection.

Only valid when parameter selection is activated.

Valid Parameter names include: metric, minkowski_power, category_weights, n_neighbors, voting_type.

No default value.

param_rangedict or ListOfTuples, optional

Specifies ranges of parameters to be selected.

Input should be a dict, or a list of tuples of two elements, with key/1st element the name of the target parameter, while value/2nd element being a list that specifies the range of parameters with the following format: [start, step, end] or [start, end].

Only valid when parameter selection is activated.

Valid parameter names include: minkowski_power, category_weights, n_neighbors.

No default value.

min_resource_ratefloat, optional

Specifies the minimum resource rate that should be used in SHA or Hyperband iteration.

Valid only when resampling_method takes one of the following values: 'cv_sha', 'stratified_cv_sha', 'bootstrap_sha', 'stratified_bootstrap_sha', 'cv_hyperband', 'stratified_cv_hyperband', 'bootstrap_hyperband', 'stratified_bootstrap_hyperband'.

Defaults to 0.0.

reduction_ratefloat, optional

Specifies reduction rate in SHA or Hyperband method.

For each round, the available parameter candidate size will be divided by value of this parameter. Thus valid value for this parameter must be greater than 1.0

Valid only when resampling_method is set to one of the following values: 'cv_sha', 'stratified_cv_sha', 'bootstrap_sha', 'stratified_bootstrap_sha', 'cv_hyperband', 'stratified_cv_hyperband', 'bootstrap_hyperband', 'stratified_bootstrap_hyperband'.

Defaults to 3.0.

aggressive_eliminationbool, optional

Specifies whether to apply aggressive elimination while using SHA method.

Aggressive elimination happens when the data size and parameters size to be searched does not match and there are still bunch of parameters to be searched while data size reaches its upper limits. If aggressive elimination is applied, lower bound of limit of data size will be used multiple times first to reduce number of parameters.

Valid only when resampling_method is set to one of the following: 'cv_sha', 'stratified_cv_sha', 'bootstrap_sha', 'stratified_bootstrap_sha'.

Defaults to False.

Examples

Input DataFrame for training:

>>> df_class_train.collect()
   ID  X1      X2 X3  TYPE
0   0   2     1.0  A     1
1   1   3    10.0  A    10
...
8   8   1   999.0  B    10
9   9   1  1000.0  C    10

Create a KNNClassifier instance:

>>> knn  = KNNClassifier(algorithm='kd_tree',
                         n_neighbors=3, voting_type='majority')

Perform fit():

>>> knn.fit(data=df_class_train, key='ID', label='TYPE')

Perform predict():

>>> df_class_predict.collect()
   ID  X1       X2 X3
 0   2      1.0  A
 1   1     10.0  C
 2   1     11.0  B
 3   3  15000.0  C
 4   2   1000.0  C
 5   1   1001.0  A
 6   1    999.0  A
 7   3    999.0  B

Perform predict():

>>> res, stats = knn.predict(data=df_class_predict,
                             key='ID', categorical_variable='X1')

Output:

>>> res.collect()
   ID TARGET
 0     10
 1     10
 2     10
 3      1
 4      1
 5      1
 6     10
 7     99

>>> stats.collect().head(10)
    TEST_ID  K  TRAIN_ID      DISTANCE
       0  1         0      0.000000
       0  2         1      9.999849
       0  3         2     10.414000
       1  1         3      0.999849
       1  2         1      1.414000
       1  3         2      1.414000
       2  1         2      1.999849
       2  2         1      2.414000
       2  3         3      2.414000
       3  1         4  14000.999849

Attributes:

_training_setDataFrame: Input training data with structured column arrangement. If model evaluation or parameter selection is not enabled, the first column must be the ID column, followed by feature columns.

Methods

`create_model_state`([model, function, ...])	Create PAL model state.
`delete_model_state`([state])	Delete PAL model state.
`fit`(data[, key, features, label, ...])	Fit the model to the training dataset.
`get_model_metrics`()	Get the model metrics.
`get_score_metrics`()	Get the score metrics.
`predict`(data[, key, features, interpret, ...])	Prediction for the input data with the training dataset.
`set_model_state`(state)	Set the model state by state information.

fit(data, key=None, features=None, label=None, categorical_variable=None, string_variable=None, variable_weight=None)

Fit the model to the training dataset.

Parameters:

dataDataFrame

Data to be fit.

keystr, optional

Name of the ID column.

Required if parameter selection/model evaluation is not activated, unless data is indexed by a single column(the column name will be the default value of key).

If key is not provided when activating parameter-selection/model-evaluation, then:

if data is indexed by a single column, then key defaults to that index column;

otherwise, it is assumed that data contains no ID column.

featuresstr or a list of str, optional

Name of the feature columns. Defaults to non-key, non-label columns.

labelstr, optional

Specifies the dependent variable.

Default to last column name.

categorical_variablestr or a list of str, optional

Specifies which INTEGER columns should be treated as categorical, with all other INTEGER columns treated as continuous.

No default value.

string_variablestr or a list of str, optional

Indicates a string column storing not categorical data. Levenshtein distance is used to calculate similarity between two strings. Ignored if it is not a string column.

Defaults to None.

variable_weightdict, optional

Specifies the weight of a variable participating in distance calculation. The value must be greater or equal to 0. Defaults to 1 for variables not specified.

Defaults to None.

Returns:

A fitted object of class "KNNClassifier".

predict(data, key=None, features=None, interpret=False, sample_size=None, top_k_attributions=None, random_state=None)

Prediction for the input data with the training dataset. Training dataset must be constructed through the fit function first.

Parameters:

dataDataFrame

Prediction data.

keystr, optional

Name of the ID column.

Mandatory if data is not indexed, or the index of data contains multiple columns.

Defaults to the single index column of data if not provided.

featuresstr or a list of str, optional

Name of the feature columns.

interpretint, optional

Specifies whether or not to interpret the prediction results.

Defaults to False(i.e. not to interpret the prediction results).

sample_sizeint, optional

Specifies the number of sampled combinations of features.

0 : Heuristically determined by algorithm

Others : The specified sample size

Defaults to 0.

top_k_attributionsint, optional

Specifies the number of features with highest attributions to output.

Defaults to 10.

random_stateint, optional

Specifies the seed for random number generator when sampling the combination of features.

0 : User current time as seed

Others : The actual seed

Defaults to 0.

Returns:

DataFrame

KNN predict results. Structured as follows:

ID: Prediction data ID.

TARGET: Predicted label.

REASON_CODE: interpretation of of result. This column is available only if interpret is True.

KNN prediction statistics information. Structured as follows:

TEST_ + ID column name of prediction data: Prediction data ID.

K: K number.

TRAIN_ + ID column name of training data: Train data ID.

DISTANCE: Distance.

create_model_state(model=None, function=None, pal_funcname='PAL_KNN', state_description=None, force=False)

Create PAL model state.

Parameters:

modelDataFrame, optional

Specify the model for AFL state.

Defaults to self.model_.

functionstr, optional

Specify the function in the unified API.

A placeholder parameter, not effective for KNN.

pal_funcnameint or str, optional

PAL function name.

Defaults to 'PAL_KNN'.

state_descriptionstr, optional

Description of the state as model container.

Defaults to None.

forcebool, optional

If True it will delete the existing state.

Defaults to False.

delete_model_state(state=None)

Delete PAL model state.

Parameters:

stateDataFrame, optional

Specified the state.

Defaults to self.state.

get_model_metrics()

Get the model metrics.

Returns:

DataFrame: The model metrics.

get_score_metrics()

Get the score metrics.

Returns:

DataFrame: The score metrics.

set_model_state(state)

Set the model state by state information.

Parameters:

state: DataFrame or dict

If state is DataFrame, it has the following structure:

NAME: VARCHAR(100), it mush have STATE_ID, HINT, HOST and PORT.

VALUE: VARCHAR(1000), the values according to NAME.

If state is dict, the key must have STATE_ID, HINT, HOST and PORT.

Inherited Methods from PALBase

Besides those methods mentioned above, the KNNClassifier class also inherits methods from PALBase class, please refer to PAL Base for more details.