hanaml.FeatureSelection.Rd
hanaml.FeatureSelection is a R wrapper
for SAP HANA PAL Feature Selection.
hanaml.FeatureSelection(
data,
key = NULL,
features = NULL,
label = NULL,
thread.ratio = NULL,
categorical.variable = NULL,
fixed.feature = NULL,
excluded.feature = NULL,
fs.method = NULL,
top.k.best = NULL,
seed = NULL,
fs.threshold = NULL,
fs.n.neighbours = NULL,
fs.category.weight = NULL,
fs.sigma = NULL,
fs.regularization.power = NULL,
fs.rowsampling.ratio = NULL,
fs.max.iter = NULL,
fs.admm.tol = NULL,
fs.admm.rho = NULL,
fs.admm.mu = NULL,
fs.admm.gamma = NULL,
cso.repeat.num = NULL,
cso.maxgeneration.num = NULL,
cso.earlystop.num = NULL,
cso.population.size = NULL,
cso.phi = NULL,
cso.featurenum.penalty = NULL,
cso.test.ratio = NULL,
verbose = NULL
)
DataFrame
DataFrame containting the data.
character, optional
Name of the ID column.
If not provided, the data is assumed to have no ID column.
No default value.
character, optional
Specifies the dependent variable by name.
double, optional
Controls the proportion of available threads that can be used by this
function.
The value range is from 0 to 1, where 0 indicates a single thread,
and 1 indicates all available threads.
Values between 0 and 1 will use up to
that percentage of available threads.Values outside this
range are ignored.
Defaults to 0.
character or list of characters, optional
Specifies INTEGER column(s) that should be treated as categorical.
Other INTEGER columns will be treated as continuous.
character or list of characters, optional
Will always be selected out as the best subset.
character or list of characters, optional
Excludes the indicated columns as feature candidates.
character, optional
Statistical based FS methods:
"anova": Anova.
"chi-squared": Chi-squared.
"gini-index": Gini Index.
"fisher-score": Fisher Score.
Information theoretical based FS methods:
"information-gain": Information Gain.
"MRMR": Minimum Redundancy Maximum Relevance.
"JMI": Joint Mutual Information.
"IWFS": Interaction Weight Based Feature Selection.
"FCBF": Fast Correlation Based Filter.
Similarity based FS methods:
"laplacian-score": Laplacian Score.
"SPEC": Spectral Feature Selection.
"ReliefF": ReliefF.
Sparse Learning Based FS method:
"ADMM": ADMM.
Wrapper method:
"CSO": Competitive Swarm Optimizer.
integer, optional
Top k features to be selected.
Must be assigned a value except for FCBF and CSO.
It will not affect FCBF and CSO.
integer, optional
Random seed. 0 means using system time as seed.
Defaults to 0.
double, optional
Predefined threshold for symmetrical uncertainty(SU) values between features and target.
Used in FCBF.
Defaults to 0.01.
integer, optional
Number of neighbors considered in the computation of affinity matirx.
Used in similarity based FS method.
Defaults to 5.
double, optional
The weight of categorical features whilst calculating distance.
Used in similarity based FS method.
Defaults to 0.5*avg(all numerical columns' stds)
double, optional
Sigma in affinity matrix. Used in similarity based FS method.
Defaults to 1.0.
integer, optional
The order of the power function that penalizes high frequency components.
Used in SPEC.
Defaults to 0.
double, optional
The ratio of random sampling without replacement.
Used in ReliefF, ADMM and CSO.
Defaults to 0.6 in ReliefF, 1.0 in ADMM and CSO.
integer, optional
Maximal iterations allowed to run optimization.
Used in ADMM.
Defaults to 100.
double, optional
Convergence threshold. Used in ADMM.
Defaults to 0.0001.
double, optional
Lagrangian Multiplier. Used in ADMM.
Defaults to 1.0.
double, optional
Gain of fs.admm.rho
at each iteration. Used in ADMM.
Defaults to 1.05.
double, optional
Regularization coefficient.
Defaults to 1.0.
integer, optional
Number of repetitions to run CSO.
CSO starts with a different initialization at each time. Used in CSO.
Defaults to 2.
integer, optional
Maximal number of generations. Used in CSO.
Defaults to 100.
integer, optional
Stop if there's no change in generation. Used in CSO.
Defaults to 30.
integer, optional
Population size of the swarm particles. Used in CSO.
Defaults to 30.
double, optional
Social factor. Used in CSO.
Defaults to 0.1.
double, optional
The ratio for the splitting of training data and testing data.
Defaults to 0.1.
double, optional
The ratio for the splitting of training data and testing data.
Defaults to 0.2.
logical, optional
Indicates whether to output more specified results.
Defaults to False.
DataFrame
PAL returned result, structured as follows:
"ROWID": Indicates the id of current row.
"OUTPUT": Best set of features.
Feature selection(FS) is a dimensionality reduction technique, which selects a subset of relevant features for model construction, thus reducing the memory storage and improving computational efficiency while avoiding significant loss of information.
> data.df$Collect()
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 Y
1 1 22.08 11.46 2 4 4 1.585 0 0 0 1 2 100 1,213
2 0 22.67 7 2 8 4 0.165 0 0 0 0 2 160 1
3 0 29.58 1.75 1 4 4 1.25 0 0 0 1 2 280 1
4 0 21.67 11.5 1 5 3 0 1 1 11 1 2 0 1
5 1 20.17 8.17 2 6 4 1.96 1 1 14 0 2 60 159
6 0 15.83 0.585 2 8 8 1.5 1 1 2 0 2 100 1
7 1 17.42 6.5 2 3 4 0.125 0 0 0 0 2 60 101
8 0 58.67 4.46 2 11 8 3.04 1 1 6 0 2 43 561
9 1 27.83 1 1 2 8 3 0 0 0 0 2 176 538
10 0 55.75 7.08 2 4 8 6.75 1 1 3 1 2 100 51
11 1 33.5 1.75 2 14 8 4.5 1 1 4 1 2 253 858
12 1 41.42 5 2 11 8 5 1 1 6 1 2 470 1
13 1 20.67 1.25 1 8 8 1.375 1 1 3 1 2 140 211
14 1 34.92 5 2 14 8 7.5 1 1 6 1 2 0 1,001
15 1 58.58 2.71 2 8 4 2.415 0 0 0 1 2 320 1
16 1 48.08 6.04 2 4 4 0.04 0 0 0 0 2 0 2,691
17 1 29.58 4.5 2 9 4 7.5 1 1 2 1 2 330 1
18 0 18.92 9 2 6 4 0.75 1 1 2 0 2 88 592
19 1 20 1.25 1 4 4 0.125 0 0 0 0 2 140 5
20 0 22.42 5.665 2 11 4 2.585 1 1 7 0 2 129 3,258
21 0 28.17 0.585 2 6 4 0.04 0 0 0 0 2 260 1,005
22 0 19.17 0.585 1 6 4 0.585 1 0 0 1 2 160 1
23 1 41.17 1.335 2 2 4 0.165 0 0 0 0 2 168 1
24 1 41.58 1.75 2 4 4 0.21 1 0 0 0 2 160 1
25 1 19.5 9.585 2 6 4 0.79 0 0 0 0 2 80 351
26 1 32.75 1.5 2 13 8 5.5 1 1 3 1 2 0 1
27 1 22.5 0.125 1 4 4 0.125 0 0 0 0 2 200 71
28 1 33.17 3.04 1 8 8 2.04 1 1 1 1 2 180 18,028
29 0 30.67 12 2 8 4 2 1 1 1 0 2 220 20
30 1 23.08 2.5 2 8 4 1.085 1 1 11 1 2 60 2,185
Call the function:
> result <- hanaml.FeatureSelection(data = data.df, categorical.variable = c("X1"),
label = "Y", fs.method = "fisher-score",
top.k.best=8)
Results:
> result$Collect()
ROWID OUTPUT
1 0 {"__method__":"fisher-score","__SelectedFeatures__":["X3","X7","X2","X8","X9","X13","X6","X5"]}