FeatureSelection
- class hana_ml.algorithms.pal.preprocessing.FeatureSelection(fs_method, top_k_best=None, thread_ratio=None, seed=None, fs_threshold=None, fs_n_neighbours=None, fs_category_weight=None, fs_sigma=None, fs_regularization_power=None, fs_rowsampling_ratio=None, fs_max_iter=None, fs_admm_tol=None, fs_admm_rho=None, fs_admm_mu=None, fs_admm_gamma=None, cso_repeat_num=None, cso_maxgeneration_num=None, cso_earlystop_num=None, cso_population_size=None, cso_phi=None, cso_featurenum_penalty=None, cso_test_ratio=None)
Feature selection(FS) is a dimensionality reduction technique, which selects a subset of relevant features for model construction, thus reducing the memory storage and improving computational efficiency while avoiding significant loss of information.
- Parameters:
- fs_method{'anova', 'chi-squared', 'gini-index', 'fisher-score', 'information-gain', 'MRMR', 'JMI', 'IWFS', 'FCBF', 'laplacian-score', 'SPEC', 'ReliefF', 'ADMM', 'CSO'}
Statistical based FS methods
'anova':Anova.
'chi-squared': Chi-squared.
'gini-index': Gini Index.
'fisher-score': Fisher Score.
Information theoretical based FS methods
'information-gain': Information Gain.
'MRMR': Minimum Redundancy Maximum Relevance.
'JMI': Joint Mutual Information.
'IWFS': Interaction Weight Based Feature Selection.
'FCBF': Fast Correlation Based Filter.
Similarity based FS methods
'laplacian-score': Laplacian Score.
'SPEC': Spectral Feature Selection.
'ReliefF': ReliefF.
Sparse Learning Based FS method
'ADMM': ADMM.
Wrapper method
'CSO': Competitive Swarm Optimizer.
- top_k_bestint, optional
Top k features to be selected. Must be assigned a value except for FCBF and CSO. It will not affect FCBF and CSO.
- thread_ratio, float, optional
The ratio of available threads.
0: single thread
0~1: percentage
others: heuristically determined
Defaults to -1.
- seedint, optional
Random seed. 0 means using system time as seed.
Defaults to 0.
- fs_thresholdfloat, optional
Predefined threshold for symmetrical uncertainty(SU) values between features and target. Used in FCBF.
Defaults to 0.01.
- fs_n_neighboursint, optional
Number of neighbours considered in the computation of affinity matrix. Used in similarity based FS method.
Defaults to 5.
- fs_category_weightfloat, optional
The weight of categorical features whilst calculating distance. Used in similarity based FS method.
Defaults to 0.5*avg(all numerical columns's std)
- fs_sigmafloat, optional
Sigma in affinity matrix. Used in similarity based FS method.
Defaults to 1.0.
- fs_regularization_powerint, optional
The order of the power function that penalizes high frequency components. Used in SPEC.
Defaults to 0.
- fs_rowsampling_ratiofloat, optional
The ratio of random sampling without replacement. Used in ReliefF, ADMM and CSO.
Defaults to 0.6 in ReliefF, 1.0 in ADMM and CSO.
- fs_max_iterint, opitional
Maximal iterations allowed to run optimization. Used in ADMM.
Defaults to 100.
- fs_admm_tolfloat, optional
Convergence threshold. Used in ADMM.
Defaults to 0.0001.
- fs_admm_rhofloat, optional
Lagrangian Multiplier. Used in ADMM.
Defaults to 1.0.
- fs_admm_mufloat, optional
Gain of fs_admm_rho at each iteration. Used in ADMM.
Defaults to 1.05.
- fs_admm_gammafloat, optional
Regularization coefficient.
Defaults to 1.0.
- cso_repeat_numint, optional
Number of repetitions to run CSO. CSO starts with a different initialization at each time. Used in CSO.
Defaults to 2.
- cso_maxgeneration_numint, optional
Maximal number of generations. Used in CSO.
Defaults to 100.
- cso_earlystop_numint, optional
Stop if there's no change in generation. Used in CSO.
Defaults to 30.
- cso_population_sizeint, optional
Population size of the swarm particles. Used in CSO.
Defaults to 30.
- cso_phifloat, optional
Social factor. Used in CSO.
Defaults to 0.1.
- cso_featurenum_penaltyfloat, optional
The ratio for the spliting of training data and testing data.
Defaults to 0.1.
- cso_test_ratiofloat, optional
The ratio for the spliting of training data and testing data.
Defaults to 0.2.
Examples
>>> fs = FeatureSelection(fs_method='fisher-score', top_k_best=8) >>> fs_df = fs.fit_transform(data=df, categorical_variable=['X1'], label='Y') >>> fs.result_.collect() >>> fs_df.collect()
- Attributes:
- result_DataFrame
PAL returned result, structured as follows:
ROWID: Indicates the id of current row.
OUTPUT: Best set of features.
Methods
fit
(data[, key, label, ...])Perform feature selection for given data with specified configuration.
fit_transform
(data[, key, label, ...])Perform feature selection for given data with specified configuration.
- fit(data, key=None, label=None, categorical_variable=None, fixed_feature=None, excluded_feature=None, verbose=None)
Perform feature selection for given data with specified configuration.
- Parameters:
- dataDataFrame
Input HANA Dataframe.
- keystr, optional
Name of the ID column. If data has index, it will be set.
There's no id column by default.
- labelstr, optional
Specifies the dependent variable by name.
Mandatory for supervised feature selection methods.
For 'spec' method which can be supervised and unsupervised, if
label
is not set, the unsupervised version will be performed.- categorical_variablestr or a list of str, optional
Specifies which INTEGER columns should be treated as categorical, with all other INTEGER columns treated as continuous.
No default value.
- fixed_featurestr or a list of str, optional
Will always be selected out as the best subset.
- excluded_featurestr or a list of str, optional
Excludes the indicated columns as feature candidates.
- verbosebool, optional
Indicates whether to output more specified results.
Defaults to False.
- fit_transform(data, key=None, label=None, categorical_variable=None, fixed_feature=None, excluded_feature=None, verbose=None)
Perform feature selection for given data with specified configuration.
- Parameters:
- dataDataFrame
Dataframe that contains the training data.
- keystr, optional
Name of the ID column. If data has index, it will be set.
There's no id column by default.
- labelstr, optional
Specifies the dependent variable by name.
Mandatory for supervised feature selection methods.
For 'spec' method which can be supervised and unsupervised, if
label
is not set, the unsupervised version will be performed.- categorical_variablestr or a list of str, optional
Specifies which INTEGER columns should be treated as categorical, with all other INTEGER columns treated as continuous.
No default value.
- fixed_featurestr or a list of str, optional
Will always be selected out as the best subset.
- excluded_featurestr or a list of str, optional
Excludes the indicated columns as feature candidates.
- verbosebool, optional
Indicates whether to output more specified results.
Defaults to False.
- Returns:
- DataFrame
Feature selection result from the input data.
Inherited Methods from PALBase
Besides those methods mentioned above, the FeatureSelection class also inherits methods from PALBase class, please refer to PAL Base for more details.