FeatureSelection

class hana_ml.algorithms.pal.preprocessing.FeatureSelection(fs_method, top_k_best=None, thread_ratio=None, seed=None, fs_threshold=None, fs_n_neighbours=None, fs_category_weight=None, fs_sigma=None, fs_regularization_power=None, fs_rowsampling_ratio=None, fs_max_iter=None, fs_admm_tol=None, fs_admm_rho=None, fs_admm_mu=None, fs_admm_gamma=None, cso_repeat_num=None, cso_maxgeneration_num=None, cso_earlystop_num=None, cso_population_size=None, cso_phi=None, cso_featurenum_penalty=None, cso_test_ratio=None)

Feature selection(FS) is a dimensionality reduction technique, which selects a subset of relevant features for model construction, thus reducing the memory storage and improving computational efficiency while avoiding significant loss of information.

Parameters:

fs_method{'anova', 'chi-squared', 'gini-index', 'fisher-score', 'information-gain', 'MRMR', 'JMI', 'IWFS', 'FCBF', 'laplacian-score', 'SPEC', 'ReliefF', 'ADMM', 'CSO'}

Statistical based FS methods

'anova':Anova.

'chi-squared': Chi-squared.

'gini-index': Gini Index.

'fisher-score': Fisher Score.

Information theoretical based FS methods

'information-gain': Information Gain.

'MRMR': Minimum Redundancy Maximum Relevance.

'JMI': Joint Mutual Information.

'IWFS': Interaction Weight Based Feature Selection.

'FCBF': Fast Correlation Based Filter.

Similarity based FS methods

'laplacian-score': Laplacian Score.

'SPEC': Spectral Feature Selection.

'ReliefF': ReliefF.

Sparse Learning Based FS method

'ADMM': ADMM.

Wrapper method

'CSO': Competitive Swarm Optimizer.

top_k_bestint, optional

Top k features to be selected. Must be assigned a value except for FCBF and CSO. It will not affect FCBF and CSO.

thread_ratio, float, optional

The ratio of available threads.

0: single thread

0~1: percentage

others: heuristically determined

Defaults to -1.

seedint, optional

Random seed. 0 means using system time as seed.

Defaults to 0.

fs_thresholdfloat, optional

Predefined threshold for symmetrical uncertainty(SU) values between features and target. Used in FCBF.

Defaults to 0.01.

fs_n_neighboursint, optional

Number of neighbours considered in the computation of affinity matrix. Used in similarity based FS method.

Defaults to 5.

fs_category_weightfloat, optional

The weight of categorical features whilst calculating distance. Used in similarity based FS method.

Defaults to 0.5*avg(all numerical columns's std)

fs_sigmafloat, optional

Sigma in affinity matrix. Used in similarity based FS method.

Defaults to 1.0.

fs_regularization_powerint, optional

The order of the power function that penalizes high frequency components. Used in SPEC.

Defaults to 0.

fs_rowsampling_ratiofloat, optional

The ratio of random sampling without replacement. Used in ReliefF, ADMM and CSO.

Defaults to 0.6 in ReliefF, 1.0 in ADMM and CSO.

fs_max_iterint, opitional

Maximal iterations allowed to run optimization. Used in ADMM.

Defaults to 100.

fs_admm_tolfloat, optional

Convergence threshold. Used in ADMM.

Defaults to 0.0001.

fs_admm_rhofloat, optional

Lagrangian Multiplier. Used in ADMM.

Defaults to 1.0.

fs_admm_mufloat, optional

Gain of fs_admm_rho at each iteration. Used in ADMM.

Defaults to 1.05.

fs_admm_gammafloat, optional

Regularization coefficient.

Defaults to 1.0.

cso_repeat_numint, optional

Number of repetitions to run CSO. CSO starts with a different initialization at each time. Used in CSO.

Defaults to 2.

cso_maxgeneration_numint, optional

Maximal number of generations. Used in CSO.

Defaults to 100.

cso_earlystop_numint, optional

Stop if there's no change in generation. Used in CSO.

Defaults to 30.

cso_population_sizeint, optional

Population size of the swarm particles. Used in CSO.

Defaults to 30.

cso_phifloat, optional

Social factor. Used in CSO.

Defaults to 0.1.

cso_featurenum_penaltyfloat, optional

The ratio for the spliting of training data and testing data.

Defaults to 0.1.

cso_test_ratiofloat, optional

The ratio for the spliting of training data and testing data.

Defaults to 0.2.

Attributes:

result_DataFrame

PAL returned result, structured as follows:

ROWID: Indicates the id of current row.

OUTPUT: Best set of features.

Methods

`fit`(data[, key, label, ...])	Perform feature selection for given data with specified configuration.
`fit_transform`(data[, key, label, ...])	Perform feature selection for given data with specified configuration.

Examples

>>> fs = FeatureSelection(fs_method='fisher-score',
                          top_k_best=8)
>>> fs_df = fs.fit_transform(data=df,
                             categorical_variable=['X1'],
                             label='Y')
>>> fs.result_.collect()
>>> fs_df.collect()

fit(data, key=None, label=None, categorical_variable=None, fixed_feature=None, excluded_feature=None, verbose=None)

Perform feature selection for given data with specified configuration.

Parameters:

dataDataFrame

Input HANA Dataframe.

keystr, optional

Name of the ID column. If data has index, it will be set.

There's no id column by default.

labelstr, optional

Specifies the dependent variable by name.

Mandatory for supervised feature selection methods.

For 'spec' method which can be supervised and unsupervised, if label is not set, the unsupervised version will be performed.

categorical_variablestr or a list of str, optional

Specifies which INTEGER columns should be treated as categorical, with all other INTEGER columns treated as continuous.

No default value.

fixed_featurestr or a list of str, optional

Will always be selected out as the best subset.

excluded_featurestr or a list of str, optional

Excludes the indicated columns as feature candidates.

verbosebool, optional

Indicates whether to output more specified results.

Defaults to False.

fit_transform(data, key=None, label=None, categorical_variable=None, fixed_feature=None, excluded_feature=None, verbose=None)

Perform feature selection for given data with specified configuration.

Parameters:

dataDataFrame

Dataframe that contains the training data.

keystr, optional

Name of the ID column. If data has index, it will be set.

There's no id column by default.

labelstr, optional

Specifies the dependent variable by name.

Mandatory for supervised feature selection methods.

For 'spec' method which can be supervised and unsupervised, if label is not set, the unsupervised version will be performed.

categorical_variablestr or a list of str, optional

Specifies which INTEGER columns should be treated as categorical, with all other INTEGER columns treated as continuous.

No default value.

fixed_featurestr or a list of str, optional

Will always be selected out as the best subset.

excluded_featurestr or a list of str, optional

Excludes the indicated columns as feature candidates.

verbosebool, optional

Indicates whether to output more specified results.

Defaults to False.

Returns:

DataFrame: Feature selection result from the input data.

Inherited Methods from PALBase

Besides those methods mentioned above, the FeatureSelection class also inherits methods from PALBase class, please refer to PAL Base for more details.