FeatureSelection

class hana_ml.algorithms.pal.preprocessing.FeatureSelection(fs_method, top_k_best=None, thread_ratio=None, seed=None, fs_threshold=None, fs_n_neighbours=None, fs_category_weight=None, fs_sigma=None, fs_regularization_power=None, fs_rowsampling_ratio=None, fs_max_iter=None, fs_admm_tol=None, fs_admm_rho=None, fs_admm_mu=None, fs_admm_gamma=None, cso_repeat_num=None, cso_maxgeneration_num=None, cso_earlystop_num=None, cso_population_size=None, cso_phi=None, cso_featurenum_penalty=None, cso_test_ratio=None)

Feature selection(FS) is a dimensionality reduction technique, which selects a subset of relevant features for model construction, thus reducing the memory storage and improving computational efficiency while avoiding significant loss of information.

Parameters
fs_method{'anova', 'chi-squared', 'gini-index', 'fisher-score', 'information-gain', 'MRMR', 'JMI', 'IWFS', 'FCBF', 'laplacian-score', 'SPEC', 'ReliefF', 'ADMM', 'CSO'}

Statistical based FS methods

  • 'anova':Anova.

  • 'chi-squared': Chi-squared.

  • 'gini-index': Gini Index.

  • 'fisher-score': Fisher Score.

Information theoretical based FS methods

  • 'information-gain': Information Gain.

  • 'MRMR': Minimum Redundancy Maximum Relevance.

  • 'JMI': Joint Mutual Information.

  • 'IWFS': Interaction Weight Based Feature Selection.

  • 'FCBF': Fast Correlation Based Filter.

Similarity based FS methods

  • 'laplacian-score': Laplacian Score.

  • 'SPEC': Spectral Feature Selection.

  • 'ReliefF': ReliefF.

Sparse Learning Based FS method

  • 'ADMM': ADMM.

Wrapper method

  • 'CSO': Competitive Swarm Optimizer.

top_k_bestint, optional

Top k features to be selected. Must be assigned a value except for FCBF and CSO. It will not affect FCBF and CSO.

thread_ratio, float, optional

The ratio of available threads.

  • 0: single thread

  • 0~1: percentage

  • others: heuristically determined

Defaults to -1.

seedint, optional

Random seed. 0 means using system time as seed.

Defaults to 0.

fs_thresholdfloat, optional

Predefined threshold for symmetrical uncertainty(SU) values between features and target. Used in FCBF.

Defaults to 0.01.

fs_n_neighboursint, optional

Number of neighbours considered in the computation of affinity matrix. Used in similarity based FS method.

Defaults to 5.

fs_category_weightfloat, optional

The weight of categorical features whilst calculating distance. Used in similarity based FS method.

Defaults to 0.5*avg(all numerical columns's std)

fs_sigmafloat, optional

Sigma in affinity matrix. Used in similarity based FS method.

Defaults to 1.0.

fs_regularization_powerint, optional

The order of the power function that penalizes high frequency components. Used in SPEC.

Defaults to 0.

fs_rowsampling_ratiofloat, optional

The ratio of random sampling without replacement. Used in ReliefF, ADMM and CSO.

Defaults to 0.6 in ReliefF, 1.0 in ADMM and CSO.

fs_max_iterint, opitional

Maximal iterations allowed to run optimization. Used in ADMM.

Defaults to 100.

fs_admm_tolfloat, optional

Convergence threshold. Used in ADMM.

Defaults to 0.0001.

fs_admm_rhofloat, optional

Lagrangian Multiplier. Used in ADMM.

Defaults to 1.0.

fs_admm_mufloat, optional

Gain of fs_admm_rho at each iteration. Used in ADMM.

Defaults to 1.05.

fs_admm_gammafloat, optional

Regularization coefficient.

Defaults to 1.0.

cso_repeat_numint, optional

Number of repetitions to run CSO. CSO starts with a different initialization at each time. Used in CSO.

Defaults to 2.

cso_maxgeneration_numint, optional

Maximal number of generations. Used in CSO.

Defaults to 100.

cso_earlystop_numint, optional

Stop if there's no change in generation. Used in CSO.

Defaults to 30.

cso_population_sizeint, optional

Population size of the swarm particles. Used in CSO.

Defaults to 30.

cso_phifloat, optional

Social factor. Used in CSO.

Defaults to 0.1.

cso_featurenum_penaltyfloat, optional

The ratio for the spliting of training data and testing data.

Defaults to 0.1.

cso_test_ratiofloat, optional

The ratio for the spliting of training data and testing data.

Defaults to 0.2.

Examples

Original data:

>>> df.collect()
   X1    X2    X3  X4  X5  X6     X7  X8 X9  X10 X11 X12   X13       Y
0  1  22.08 11.46   2   4   4  1.585   0  0    0   1   2   100   1,213
1  0  22.67     7   2   8   4  0.165   0  0    0   0   2   160       1
2  0  29.58  1.75   1   4   4   1.25   0  0    0   1   2   280       1
3  0  21.67  11.5   1   5   3      0   1  1   11   1   2     0       1
4  1  20.17  8.17   2   6   4   1.96   1  1   14   0   2    60     159
5  0  15.83 0.585   2   8   8    1.5   1  1    2   0   2   100       1
6  1  17.42   6.5   2   3   4  0.125   0  0    0   0   2    60     101
7  0  58.67  4.46   2  11   8   3.04   1  1    6   0   2    43     561
8  1  27.83     1   1   2   8      3   0  0    0   0   2   176     538
9  0  55.75  7.08   2   4   8   6.75   1  1    3   1   2   100      51
10 1   33.5  1.75   2  14   8    4.5   1  1    4   1   2   253     858
11 1  41.42     5   2  11   8      5   1  1    6   1   2   470       1
12 1  20.67  1.25   1   8   8  1.375   1  1    3   1   2   140     211
13 1  34.92     5   2  14   8    7.5   1  1    6   1   2     0   1,001
14 1  58.58  2.71   2   8   4  2.415   0  0    0   1   2   320       1
15 1  48.08  6.04   2   4   4   0.04   0  0    0   0   2     0   2,691
16 1  29.58   4.5   2   9   4    7.5   1  1    2   1   2   330       1
17 0  18.92     9   2   6   4   0.75   1  1    2   0   2    88     592
18 1     20  1.25   1   4   4  0.125   0  0    0   0   2   140       5
19 0  22.42 5.665   2  11   4  2.585   1  1    7   0   2   129   3,258
20 0  28.17 0.585   2   6   4   0.04   0  0    0   0   2   260   1,005
21 0  19.17 0.585   1   6   4  0.585   1  0    0   1   2   160       1
22 1  41.17 1.335   2   2   4  0.165   0  0    0   0   2   168       1
23 1  41.58  1.75   2   4   4   0.21   1  0    0   0   2   160       1
24 1   19.5 9.585   2   6   4   0.79   0  0    0   0   2    80     351
25 1  32.75   1.5   2  13   8    5.5   1  1    3   1   2     0       1
26 1   22.5 0.125   1   4   4  0.125   0  0    0   0   2   200      71
27 1  33.17  3.04   1   8   8   2.04   1  1    1   1   2   180  18,028
28 0  30.67    12   2   8   4      2   1  1    1   0   2   220      20
29 1  23.08   2.5   2   8   4  1.085   1  1   11   1   2    60   2,185

Construct an Discretize instance:

>>> fs = FeatureSelection(fs_method='fisher-score',
                          top_k_best=8)
>>> fs_df = fs.fit_transform(df,
                             categorical_variable=['X1'],
                             label='Y')
>>> fs.result_.collect()
  ROWID                                                                                              OUTPUT
0     0     {"__method__":"fisher-score","__SelectedFeatures__":["X3","X7","X2","X8","X9","X13","X6","X5"]}
>>> fs_df.collect()
      X3     X7     X2  X8 X9  X13 X6  X5
0  11.46  1.585  22.08   0  0  100  4   4
1      7  0.165  22.67   0  0  160  4   8
2   1.75   1.25  29.58   0  0  280  4   4
3   11.5      0  21.67   1  1    0  3   5
4   8.17   1.96  20.17   1  1   60  4   6
5  0.585    1.5  15.83   1  1  100  8   8
6    6.5  0.125  17.42   0  0   60  4   3
7   4.46   3.04  58.67   1  1   43  8  11
8      1      3  27.83   0  0  176  8   2
9   7.08   6.75  55.75   1  1  100  8   4
10  1.75    4.5  33.5    1  1  253  8  14
11     5      5  41.42   1  1  470  8  11
12  1.25  1.375  20.67   1  1  140  8   8
13     5    7.5  34.92   1  1    0  8  14
14  2.71  2.415  58.58   0  0  320  4   8
15  6.04  0.04   48.08   0  0    0  4   4
16   4.5   7.5   29.58   1  1  330  4   9
17     9  0.75   18.92   1  1   88  4   6
18  1.25 0.125      20   0  0  140  4   4
19 5.665 2.585   22.42   1  1  129  4  11
20 0.585  0.04   28.17   0  0  260  4   6
21 0.585 0.585   19.17   1  0  160  4   6
22 1.335 0.165   41.17   0  0  168  4   2
23  1.75  0.21   41.58   1  0  160  4   4
24 9.585  0.79    19.5   0  0   80  4   6
25   1.5   5.5   32.75   1  1    0  8  13
26 0.125 0.125    22.5   0  0  200  4   4
27  3.04  2.04   33.17   1  1  180  8   8
28    12     2   30.67   1  1  220  4   8
29   2.5 1.085   23.08   1  1   60  4   8
Attributes
result_DataFrame

PAL returned result, structured as follows:

  • ROWID: Indicates the id of current row.

  • OUTPUT: Best set of features.

Methods

fit_transform(data[, key, label, ...])

Perform feature selection for given data with specified configuration.

fit_transform(data, key=None, label=None, categorical_variable=None, fixed_feature=None, excluded_feature=None, verbose=None)

Perform feature selection for given data with specified configuration.

Parameters
dataDataFrame

Dataframe that contains the training data.

keystr, optional

Name of the ID column. If data has index, it will be set.

There's no id column by default.

labelstr, optional

Specifies the dependent variable by name.

Mandatory for supervised feature selection methods.

For 'spec' method which can be supervised and unsupervised, if label is not set, the unsupervised version will be performed.

categorical_variablestr or list of str, optional

Specifies INTEGER column(s) that should be treated as categorical.

Other INTEGER columns will be treated as continuous.

fixed_featurestr or list of str, optional

Will always be selected out as the best subset.

excluded_featurestr or list of str, optional

Excludes the indicated columns as feature candidates.

verbosebool, optional

Indicates whether to output more specified results.

Defaults to False.

Returns
DataFrame

Feature selection result from the input data.

property fit_hdbprocedure

Returns the generated hdbprocedure for fit.

property predict_hdbprocedure

Returns the generated hdbprocedure for predict.

Inherited Methods from PALBase

Besides those methods mentioned above, the FeatureSelection class also inherits methods from PALBase class, please refer to PAL Base for more details.