Feature Selection — hanaml.FeatureSelection • hana.ml.r

hanaml.FeatureSelection is a R wrapper for SAP HANA PAL Feature Selection.

hanaml.FeatureSelection(
  data,
  key = NULL,
  features = NULL,
  label = NULL,
  thread.ratio = NULL,
  categorical.variable = NULL,
  fixed.feature = NULL,
  excluded.feature = NULL,
  fs.method = NULL,
  top.k.best = NULL,
  seed = NULL,
  fs.threshold = NULL,
  fs.n.neighbours = NULL,
  fs.category.weight = NULL,
  fs.sigma = NULL,
  fs.regularization.power = NULL,
  fs.rowsampling.ratio = NULL,
  fs.max.iter = NULL,
  fs.admm.tol = NULL,
  fs.admm.rho = NULL,
  fs.admm.mu = NULL,
  fs.admm.gamma = NULL,
  cso.repeat.num = NULL,
  cso.maxgeneration.num = NULL,
  cso.earlystop.num = NULL,
  cso.population.size = NULL,
  cso.phi = NULL,
  cso.featurenum.penalty = NULL,
  cso.test.ratio = NULL,
  verbose = NULL
)

Arguments

data

DataFrame
DataFrame containting the data.

key

character, optional
Name of the ID column. If not provided, the data is assumed to have no ID column.
No default value.

label

character, optional
Specifies the dependent variable by name.

thread.ratio

double, optional
Controls the proportion of available threads that can be used by this function.
The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates all available threads.
Values between 0 and 1 will use up to that percentage of available threads.Values outside this range are ignored.
Defaults to 0.

categorical.variable

character or list of characters, optional
Specifies INTEGER column(s) that should be treated as categorical. Other INTEGER columns will be treated as continuous.

fixed.feature

character or list of characters, optional
Will always be selected out as the best subset.

excluded.feature

character or list of characters, optional
Excludes the indicated columns as feature candidates.

fs.method

character, optional
Statistical based FS methods:

"anova": Anova.
"chi-squared": Chi-squared.
"gini-index": Gini Index.
"fisher-score": Fisher Score.

Information theoretical based FS methods:

"information-gain": Information Gain.
"MRMR": Minimum Redundancy Maximum Relevance.
"JMI": Joint Mutual Information.
"IWFS": Interaction Weight Based Feature Selection.
"FCBF": Fast Correlation Based Filter.

Similarity based FS methods:

"laplacian-score": Laplacian Score.
"SPEC": Spectral Feature Selection.
"ReliefF": ReliefF.

Sparse Learning Based FS method:

"ADMM": ADMM.

Wrapper method:

"CSO": Competitive Swarm Optimizer.

top.k.best

integer, optional
Top k features to be selected. Must be assigned a value except for FCBF and CSO. It will not affect FCBF and CSO.

seed

integer, optional
Random seed. 0 means using system time as seed. Defaults to 0.

fs.threshold

double, optional
Predefined threshold for symmetrical uncertainty(SU) values between features and target. Used in FCBF. Defaults to 0.01.

fs.n.neighbours

integer, optional
Number of neighbors considered in the computation of affinity matirx. Used in similarity based FS method. Defaults to 5.

fs.category.weight

double, optional
The weight of categorical features whilst calculating distance. Used in similarity based FS method. Defaults to 0.5*avg(all numerical columns' stds)

fs.sigma

double, optional
Sigma in affinity matrix. Used in similarity based FS method. Defaults to 1.0.

fs.regularization.power

integer, optional
The order of the power function that penalizes high frequency components. Used in SPEC. Defaults to 0.

fs.rowsampling.ratio

double, optional
The ratio of random sampling without replacement. Used in ReliefF, ADMM and CSO. Defaults to 0.6 in ReliefF, 1.0 in ADMM and CSO.

fs.max.iter

integer, optional
Maximal iterations allowed to run optimization. Used in ADMM. Defaults to 100.

fs.admm.tol

double, optional
Convergence threshold. Used in ADMM. Defaults to 0.0001.

fs.admm.rho

double, optional
Lagrangian Multiplier. Used in ADMM. Defaults to 1.0.

fs.admm.mu

double, optional
Gain of fs.admm.rho at each iteration. Used in ADMM. Defaults to 1.05.

fs.admm.gamma

double, optional
Regularization coefficient. Defaults to 1.0.

cso.repeat.num

integer, optional
Number of repetitions to run CSO. CSO starts with a different initialization at each time. Used in CSO. Defaults to 2.

cso.maxgeneration.num

integer, optional
Maximal number of generations. Used in CSO. Defaults to 100.

cso.earlystop.num

integer, optional
Stop if there's no change in generation. Used in CSO. Defaults to 30.

cso.population.size

integer, optional
Population size of the swarm particles. Used in CSO. Defaults to 30.

cso.phi

double, optional
Social factor. Used in CSO. Defaults to 0.1.

cso.featurenum.penalty

double, optional
The ratio for the splitting of training data and testing data. Defaults to 0.1.

cso.test.ratio

double, optional
The ratio for the splitting of training data and testing data. Defaults to 0.2.

verbose

logical, optional
Indicates whether to output more specified results. Defaults to False.

Value

DataFrame
PAL returned result, structured as follows:
- "ROWID": Indicates the id of current row.
- "OUTPUT": Best set of features.

Details

Feature selection(FS) is a dimensionality reduction technique, which selects a subset of relevant features for model construction, thus reducing the memory storage and improving computational efficiency while avoiding significant loss of information.

Examples


> data.df$Collect()
   X1     X2    X3  X4  X5 X6     X7  X8 X9 X10 X11 X12  X13      Y
1   1  22.08 11.46   2   4  4  1.585   0  0   0   1   2  100  1,213
2   0  22.67     7   2   8  4  0.165   0  0   0   0   2  160      1
3   0  29.58  1.75   1   4  4   1.25   0  0   0   1   2  280      1
4   0  21.67  11.5   1   5  3      0   1  1  11   1   2    0      1
5   1  20.17  8.17   2   6  4   1.96   1  1  14   0   2   60    159
6   0  15.83 0.585   2   8  8    1.5   1  1   2   0   2  100      1
7   1  17.42   6.5   2   3  4  0.125   0  0   0   0   2   60    101
8   0  58.67  4.46   2  11  8   3.04   1  1   6   0   2   43    561
9   1  27.83     1   1   2  8      3   0  0   0   0   2  176    538
10  0  55.75  7.08   2   4  8   6.75   1  1   3   1   2  100     51
11  1  33.5   1.75   2  14  8    4.5   1  1   4   1   2  253    858
12  1  41.42     5   2  11  8      5   1  1   6   1   2  470      1
13  1  20.67  1.25   1   8  8  1.375   1  1   3   1   2  140    211
14  1  34.92     5   2  14  8    7.5   1  1   6   1   2    0  1,001
15  1  58.58  2.71   2   8  4  2.415   0  0   0   1   2  320      1
16  1  48.08  6.04   2   4  4   0.04   0  0   0   0   2    0  2,691
17  1  29.58   4.5   2   9  4    7.5   1  1   2   1   2  330      1
18  0  18.92     9   2   6  4   0.75   1  1   2   0   2   88    592
19  1  20     1.25   1   4  4  0.125   0  0   0   0   2  140      5
20  0  22.42 5.665   2  11  4  2.585   1  1   7   0   2  129  3,258
21  0  28.17 0.585   2   6  4   0.04   0  0   0   0   2  260  1,005
22  0  19.17 0.585   1   6  4  0.585   1  0   0   1   2  160      1
23  1  41.17 1.335   2   2  4  0.165   0  0   0   0   2  168      1
24  1  41.58  1.75   2   4  4   0.21   1  0   0   0   2  160      1
25  1  19.5  9.585   2   6  4   0.79   0  0   0   0   2   80    351
26  1  32.75   1.5   2  13  8    5.5   1  1   3   1   2    0      1
27  1  22.5  0.125   1   4  4  0.125   0  0   0   0   2  200     71
28  1  33.17  3.04   1   8  8   2.04   1  1   1   1   2  180 18,028
29  0  30.67    12   2   8  4      2   1  1   1   0   2  220     20
30  1  23.08   2.5   2   8  4  1.085   1  1  11   1   2   60  2,185

Call the function:


> result <- hanaml.FeatureSelection(data = data.df, categorical.variable = c("X1"),
                                    label = "Y", fs.method = "fisher-score",
                                    top.k.best=8)

Results:


> result$Collect()
  ROWID                                                                                              OUTPUT
1     0     {"__method__":"fisher-score","__SelectedFeatures__":["X3","X7","X2","X8","X9","X13","X6","X5"]}