FeatureSelection
- class hana_ml.algorithms.pal.preprocessing.FeatureSelection(fs_method, top_k_best=None, thread_ratio=None, seed=None, fs_threshold=None, fs_n_neighbours=None, fs_category_weight=None, fs_sigma=None, fs_regularization_power=None, fs_rowsampling_ratio=None, fs_max_iter=None, fs_admm_tol=None, fs_admm_rho=None, fs_admm_mu=None, fs_admm_gamma=None, cso_repeat_num=None, cso_maxgeneration_num=None, cso_earlystop_num=None, cso_population_size=None, cso_phi=None, cso_featurenum_penalty=None, cso_test_ratio=None)
Feature selection(FS) is a dimensionality reduction technique, which selects a subset of relevant features for model construction, thus reducing the memory storage and improving computational efficiency while avoiding significant loss of information.
- Parameters
- fs_method{'anova', 'chi-squared', 'gini-index', 'fisher-score', 'information-gain', 'MRMR', 'JMI', 'IWFS', 'FCBF', 'laplacian-score', 'SPEC', 'ReliefF', 'ADMM', 'CSO'}
Statistical based FS methods
'anova':Anova.
'chi-squared': Chi-squared.
'gini-index': Gini Index.
'fisher-score': Fisher Score.
Information theoretical based FS methods
'information-gain': Information Gain.
'MRMR': Minimum Redundancy Maximum Relevance.
'JMI': Joint Mutual Information.
'IWFS': Interaction Weight Based Feature Selection.
'FCBF': Fast Correlation Based Filter.
Similarity based FS methods
'laplacian-score': Laplacian Score.
'SPEC': Spectral Feature Selection.
'ReliefF': ReliefF.
Sparse Learning Based FS method
'ADMM': ADMM.
Wrapper method
'CSO': Competitive Swarm Optimizer.
- top_k_bestint, optional
Top k features to be selected. Must be assigned a value except for FCBF and CSO. It will not affect FCBF and CSO.
- thread_ratio, float, optional
The ratio of available threads.
0: single thread
0~1: percentage
others: heuristically determined
Defaults to -1.
- seedint, optional
Random seed. 0 means using system time as seed.
Defaults to 0.
- fs_thresholdfloat, optional
Predefined threshold for symmetrical uncertainty(SU) values between features and target. Used in FCBF.
Defaults to 0.01.
- fs_n_neighboursint, optional
Number of neighbours considered in the computation of affinity matrix. Used in similarity based FS method.
Defaults to 5.
- fs_category_weightfloat, optional
The weight of categorical features whilst calculating distance. Used in similarity based FS method.
Defaults to 0.5*avg(all numerical columns's std)
- fs_sigmafloat, optional
Sigma in affinity matrix. Used in similarity based FS method.
Defaults to 1.0.
- fs_regularization_powerint, optional
The order of the power function that penalizes high frequency components. Used in SPEC.
Defaults to 0.
- fs_rowsampling_ratiofloat, optional
The ratio of random sampling without replacement. Used in ReliefF, ADMM and CSO.
Defaults to 0.6 in ReliefF, 1.0 in ADMM and CSO.
- fs_max_iterint, opitional
Maximal iterations allowed to run optimization. Used in ADMM.
Defaults to 100.
- fs_admm_tolfloat, optional
Convergence threshold. Used in ADMM.
Defaults to 0.0001.
- fs_admm_rhofloat, optional
Lagrangian Multiplier. Used in ADMM.
Defaults to 1.0.
- fs_admm_mufloat, optional
Gain of fs_admm_rho at each iteration. Used in ADMM.
Defaults to 1.05.
- fs_admm_gammafloat, optional
Regularization coefficient.
Defaults to 1.0.
- cso_repeat_numint, optional
Number of repetitions to run CSO. CSO starts with a different initialization at each time. Used in CSO.
Defaults to 2.
- cso_maxgeneration_numint, optional
Maximal number of generations. Used in CSO.
Defaults to 100.
- cso_earlystop_numint, optional
Stop if there's no change in generation. Used in CSO.
Defaults to 30.
- cso_population_sizeint, optional
Population size of the swarm particles. Used in CSO.
Defaults to 30.
- cso_phifloat, optional
Social factor. Used in CSO.
Defaults to 0.1.
- cso_featurenum_penaltyfloat, optional
The ratio for the spliting of training data and testing data.
Defaults to 0.1.
- cso_test_ratiofloat, optional
The ratio for the spliting of training data and testing data.
Defaults to 0.2.
Examples
Original data:
>>> df.collect() X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 Y 0 1 22.08 11.46 2 4 4 1.585 0 0 0 1 2 100 1,213 1 0 22.67 7 2 8 4 0.165 0 0 0 0 2 160 1 2 0 29.58 1.75 1 4 4 1.25 0 0 0 1 2 280 1 3 0 21.67 11.5 1 5 3 0 1 1 11 1 2 0 1 4 1 20.17 8.17 2 6 4 1.96 1 1 14 0 2 60 159 5 0 15.83 0.585 2 8 8 1.5 1 1 2 0 2 100 1 6 1 17.42 6.5 2 3 4 0.125 0 0 0 0 2 60 101 7 0 58.67 4.46 2 11 8 3.04 1 1 6 0 2 43 561 8 1 27.83 1 1 2 8 3 0 0 0 0 2 176 538 9 0 55.75 7.08 2 4 8 6.75 1 1 3 1 2 100 51 10 1 33.5 1.75 2 14 8 4.5 1 1 4 1 2 253 858 11 1 41.42 5 2 11 8 5 1 1 6 1 2 470 1 12 1 20.67 1.25 1 8 8 1.375 1 1 3 1 2 140 211 13 1 34.92 5 2 14 8 7.5 1 1 6 1 2 0 1,001 14 1 58.58 2.71 2 8 4 2.415 0 0 0 1 2 320 1 15 1 48.08 6.04 2 4 4 0.04 0 0 0 0 2 0 2,691 16 1 29.58 4.5 2 9 4 7.5 1 1 2 1 2 330 1 17 0 18.92 9 2 6 4 0.75 1 1 2 0 2 88 592 18 1 20 1.25 1 4 4 0.125 0 0 0 0 2 140 5 19 0 22.42 5.665 2 11 4 2.585 1 1 7 0 2 129 3,258 20 0 28.17 0.585 2 6 4 0.04 0 0 0 0 2 260 1,005 21 0 19.17 0.585 1 6 4 0.585 1 0 0 1 2 160 1 22 1 41.17 1.335 2 2 4 0.165 0 0 0 0 2 168 1 23 1 41.58 1.75 2 4 4 0.21 1 0 0 0 2 160 1 24 1 19.5 9.585 2 6 4 0.79 0 0 0 0 2 80 351 25 1 32.75 1.5 2 13 8 5.5 1 1 3 1 2 0 1 26 1 22.5 0.125 1 4 4 0.125 0 0 0 0 2 200 71 27 1 33.17 3.04 1 8 8 2.04 1 1 1 1 2 180 18,028 28 0 30.67 12 2 8 4 2 1 1 1 0 2 220 20 29 1 23.08 2.5 2 8 4 1.085 1 1 11 1 2 60 2,185
Construct an Discretize instance:
>>> fs = FeatureSelection(fs_method='fisher-score', top_k_best=8) >>> fs_df = fs.fit_transform(df, categorical_variable=['X1'], label='Y') >>> fs.result_.collect() ROWID OUTPUT 0 0 {"__method__":"fisher-score","__SelectedFeatures__":["X3","X7","X2","X8","X9","X13","X6","X5"]} >>> fs_df.collect() X3 X7 X2 X8 X9 X13 X6 X5 0 11.46 1.585 22.08 0 0 100 4 4 1 7 0.165 22.67 0 0 160 4 8 2 1.75 1.25 29.58 0 0 280 4 4 3 11.5 0 21.67 1 1 0 3 5 4 8.17 1.96 20.17 1 1 60 4 6 5 0.585 1.5 15.83 1 1 100 8 8 6 6.5 0.125 17.42 0 0 60 4 3 7 4.46 3.04 58.67 1 1 43 8 11 8 1 3 27.83 0 0 176 8 2 9 7.08 6.75 55.75 1 1 100 8 4 10 1.75 4.5 33.5 1 1 253 8 14 11 5 5 41.42 1 1 470 8 11 12 1.25 1.375 20.67 1 1 140 8 8 13 5 7.5 34.92 1 1 0 8 14 14 2.71 2.415 58.58 0 0 320 4 8 15 6.04 0.04 48.08 0 0 0 4 4 16 4.5 7.5 29.58 1 1 330 4 9 17 9 0.75 18.92 1 1 88 4 6 18 1.25 0.125 20 0 0 140 4 4 19 5.665 2.585 22.42 1 1 129 4 11 20 0.585 0.04 28.17 0 0 260 4 6 21 0.585 0.585 19.17 1 0 160 4 6 22 1.335 0.165 41.17 0 0 168 4 2 23 1.75 0.21 41.58 1 0 160 4 4 24 9.585 0.79 19.5 0 0 80 4 6 25 1.5 5.5 32.75 1 1 0 8 13 26 0.125 0.125 22.5 0 0 200 4 4 27 3.04 2.04 33.17 1 1 180 8 8 28 12 2 30.67 1 1 220 4 8 29 2.5 1.085 23.08 1 1 60 4 8
- Attributes
- result_DataFrame
PAL returned result, structured as follows:
ROWID: Indicates the id of current row.
OUTPUT: Best set of features.
Methods
fit_transform
(data[, key, label, ...])Perform feature selection for given data with specified configuration.
- fit_transform(data, key=None, label=None, categorical_variable=None, fixed_feature=None, excluded_feature=None, verbose=None)
Perform feature selection for given data with specified configuration.
- Parameters
- dataDataFrame
Dataframe that contains the training data.
- keystr, optional
Name of the ID column. If data has index, it will be set.
There's no id column by default.
- labelstr, optional
Specifies the dependent variable by name.
Mandatory for supervised feature selection methods.
For 'spec' method which can be supervised and unsupervised, if
label
is not set, the unsupervised version will be performed.- categorical_variablestr or list of str, optional
Specifies INTEGER column(s) that should be treated as categorical.
Other INTEGER columns will be treated as continuous.
- fixed_featurestr or list of str, optional
Will always be selected out as the best subset.
- excluded_featurestr or list of str, optional
Excludes the indicated columns as feature candidates.
- verbosebool, optional
Indicates whether to output more specified results.
Defaults to False.
- Returns
- DataFrame
Feature selection result from the input data.
- property fit_hdbprocedure
Returns the generated hdbprocedure for fit.
- property predict_hdbprocedure
Returns the generated hdbprocedure for predict.
Inherited Methods from PALBase
Besides those methods mentioned above, the FeatureSelection class also inherits methods from PALBase class, please refer to PAL Base for more details.