LinearDiscriminantAnalysis
- class hana_ml.algorithms.pal.discriminant_analysis.LinearDiscriminantAnalysis(regularization_type=None, regularization_amount=None, projection=None)
Linear discriminant analysis for classification and data reduction.
- Parameters
- regularization_type{'mixing', 'diag', 'pseudo'}, optional
The strategy for handling ill-conditioning or rank-deficiency of the empirical covariance matrix.
Defaults to 'mixing'.
- regularization_amountfloat, optional
The convex mixing weight assigned to the diagonal matrix obtained from diagonal of the empirical covriance matrix.
Valid range for this parameter is [0,1].
Valid only when
regularization_type
is 'mixing'.Defaults to the smallest number in [0,1] that makes the regularized empirical covariance matrix invertible.
- projectionbool, optional
Whether or not to compute the projection model.
Defaults to True.
Examples
The training data for linear discriminant analysis:
>>> df.collect() X1 X2 X3 X4 CLASS 0 5.1 3.5 1.4 0.2 Iris-setosa 1 4.9 3.0 1.4 0.2 Iris-setosa 2 4.7 3.2 1.3 0.2 Iris-setosa 3 4.6 3.1 1.5 0.2 Iris-setosa 4 5.0 3.6 1.4 0.2 Iris-setosa 5 5.4 3.9 1.7 0.4 Iris-setosa 6 4.6 3.4 1.4 0.3 Iris-setosa 7 5.0 3.4 1.5 0.2 Iris-setosa 8 4.4 2.9 1.4 0.2 Iris-setosa 9 4.9 3.1 1.5 0.1 Iris-setosa 10 7.0 3.2 4.7 1.4 Iris-versicolor 11 6.4 3.2 4.5 1.5 Iris-versicolor 12 6.9 3.1 4.9 1.5 Iris-versicolor 13 5.5 2.3 4.0 1.3 Iris-versicolor 14 6.5 2.8 4.6 1.5 Iris-versicolor 15 5.7 2.8 4.5 1.3 Iris-versicolor 16 6.3 3.3 4.7 1.6 Iris-versicolor 17 4.9 2.4 3.3 1.0 Iris-versicolor 18 6.6 2.9 4.6 1.3 Iris-versicolor 19 5.2 2.7 3.9 1.4 Iris-versicolor 20 6.3 3.3 6.0 2.5 Iris-virginica 21 5.8 2.7 5.1 1.9 Iris-virginica 22 7.1 3.0 5.9 2.1 Iris-virginica 23 6.3 2.9 5.6 1.8 Iris-virginica 24 6.5 3.0 5.8 2.2 Iris-virginica 25 7.6 3.0 6.6 2.1 Iris-virginica 26 4.9 2.5 4.5 1.7 Iris-virginica 27 7.3 2.9 6.3 1.8 Iris-virginica 28 6.7 2.5 5.8 1.8 Iris-virginica 29 7.2 3.6 6.1 2.5 Iris-virginica
Set up an instance of LinearDiscriminantAnalysis model and train it:
>>> lda = LinearDiscriminantAnalysis(regularization_type='mixing', projection=True) >>> lda.fit(data=df, features=['X1', 'X2', 'X3', 'X4'], label='CLASS')
Check the coefficients of obtained linear discriminators and the projection model
>>> lda.coef_.collect() CLASS COEFF_X1 COEFF_X2 COEFF_X3 COEFF_X4 INTERCEPT 0 Iris-setosa 23.907391 51.754001 -34.641902 -49.063407 -113.235478 1 Iris-versicolor 0.511034 15.652078 15.209568 -4.861018 -53.898190 2 Iris-virginica -14.729636 4.981955 42.511486 12.315007 -94.143564 >>> lda.proj_model_.collect() NAME X1 X2 X3 X4 0 DISCRIMINANT_1 1.907978 2.399516 -3.846154 -3.112216 1 DISCRIMINANT_2 3.046794 -4.575496 -2.757271 2.633037 2 OVERALL_MEAN 5.843333 3.040000 3.863333 1.213333
Data to predict the class labels:
>>> df_pred.collect() ID X1 X2 X3 X4 0 1 5.1 3.5 1.4 0.2 1 2 4.9 3.0 1.4 0.2 2 3 4.7 3.2 1.3 0.2 3 4 4.6 3.1 1.5 0.2 4 5 5.0 3.6 1.4 0.2 5 6 5.4 3.9 1.7 0.4 6 7 4.6 3.4 1.4 0.3 7 8 5.0 3.4 1.5 0.2 8 9 4.4 2.9 1.4 0.2 9 10 4.9 3.1 1.5 0.1 10 11 7.0 3.2 4.7 1.4 11 12 6.4 3.2 4.5 1.5 12 13 6.9 3.1 4.9 1.5 13 14 5.5 2.3 4.0 1.3 14 15 6.5 2.8 4.6 1.5 15 16 5.7 2.8 4.5 1.3 16 17 6.3 3.3 4.7 1.6 17 18 4.9 2.4 3.3 1.0 18 19 6.6 2.9 4.6 1.3 19 20 5.2 2.7 3.9 1.4 20 21 6.3 3.3 6.0 2.5 21 22 5.8 2.7 5.1 1.9 22 23 7.1 3.0 5.9 2.1 23 24 6.3 2.9 5.6 1.8 24 25 6.5 3.0 5.8 2.2 25 26 7.6 3.0 6.6 2.1 26 27 4.9 2.5 4.5 1.7 27 28 7.3 2.9 6.3 1.8 28 29 6.7 2.5 5.8 1.8 29 30 7.2 3.6 6.1 2.5
Perform predict() and check the result:
>>> res_pred = lda.predict(data=df_pred, ... key='ID', ... features=['X1', 'X2', 'X3', 'X4'], ... verbose=False) >>> res_pred.collect() ID CLASS SCORE 0 1 Iris-setosa 130.421263 1 2 Iris-setosa 99.762784 2 3 Iris-setosa 108.796296 3 4 Iris-setosa 94.301777 4 5 Iris-setosa 133.205924 5 6 Iris-setosa 138.089829 6 7 Iris-setosa 108.385827 7 8 Iris-setosa 119.390933 8 9 Iris-setosa 82.633689 9 10 Iris-setosa 106.380335 10 11 Iris-versicolor 63.346631 11 12 Iris-versicolor 59.511996 12 13 Iris-versicolor 64.286132 13 14 Iris-versicolor 38.332614 14 15 Iris-versicolor 54.823224 15 16 Iris-versicolor 53.865644 16 17 Iris-versicolor 63.581912 17 18 Iris-versicolor 30.402809 18 19 Iris-versicolor 57.411739 19 20 Iris-versicolor 42.433076 20 21 Iris-virginica 114.258002 21 22 Iris-virginica 72.984306 22 23 Iris-virginica 91.802556 23 24 Iris-virginica 86.640121 24 25 Iris-virginica 97.620689 25 26 Iris-virginica 114.195778 26 27 Iris-virginica 57.274694 27 28 Iris-virginica 101.668525 28 29 Iris-virginica 87.257782 29 30 Iris-virginica 106.747065
Data to project:
>>> df_proj.collect() ID X1 X2 X3 X4 0 1 5.1 3.5 1.4 0.2 1 2 4.9 3.0 1.4 0.2 2 3 4.7 3.2 1.3 0.2 3 4 4.6 3.1 1.5 0.2 4 5 5.0 3.6 1.4 0.2 5 6 5.4 3.9 1.7 0.4 6 7 4.6 3.4 1.4 0.3 7 8 5.0 3.4 1.5 0.2 8 9 4.4 2.9 1.4 0.2 9 10 4.9 3.1 1.5 0.1 10 11 7.0 3.2 4.7 1.4 11 12 6.4 3.2 4.5 1.5 12 13 6.9 3.1 4.9 1.5 13 14 5.5 2.3 4.0 1.3 14 15 6.5 2.8 4.6 1.5 15 16 5.7 2.8 4.5 1.3 16 17 6.3 3.3 4.7 1.6 17 18 4.9 2.4 3.3 1.0 18 19 6.6 2.9 4.6 1.3 19 20 5.2 2.7 3.9 1.4 20 21 6.3 3.3 6.0 2.5 21 22 5.8 2.7 5.1 1.9 22 23 7.1 3.0 5.9 2.1 23 24 6.3 2.9 5.6 1.8 24 25 6.5 3.0 5.8 2.2 25 26 7.6 3.0 6.6 2.1 26 27 4.9 2.5 4.5 1.7 27 28 7.3 2.9 6.3 1.8 28 29 6.7 2.5 5.8 1.8 29 30 7.2 3.6 6.1 2.5
Do project and check the result:
>>> res_proj = lda.project(data=df_proj, ... key='ID', ... features=['X1','X2','X3','X4'], ... proj_dim=2) >>> res_proj.collect() ID DISCRIMINANT_1 DISCRIMINANT_2 DISCRIMINANT_3 DISCRIMINANT_4 0 1 12.313584 -0.245578 None None 1 2 10.732231 1.432811 None None 2 3 11.215154 0.184080 None None 3 4 10.015174 -0.214504 None None 4 5 12.362738 -1.007807 None None 5 6 12.069495 -1.462312 None None 6 7 10.808422 -1.048122 None None 7 8 11.498220 -0.368435 None None 8 9 9.538291 0.366963 None None 9 10 10.898789 0.436231 None None 10 11 -1.208079 0.976629 None None 11 12 -1.894856 -0.036689 None None 12 13 -2.719280 0.841349 None None 13 14 -3.226081 2.191170 None None 14 15 -3.048480 1.822461 None None 15 16 -3.567804 -0.865854 None None 16 17 -2.926155 -1.087069 None None 17 18 -0.504943 1.045723 None None 18 19 -1.995288 1.142984 None None 19 20 -2.765274 -0.014035 None None 20 21 -10.727149 -2.301788 None None 21 22 -7.791979 -0.178166 None None 22 23 -8.291120 0.730808 None None 23 24 -7.969943 -1.211807 None None 24 25 -9.362513 -0.558237 None None 25 26 -10.029438 0.324116 None None 26 27 -7.058927 -0.877426 None None 27 28 -8.754272 -0.095103 None None 28 29 -8.935789 1.285655 None None 29 30 -8.674729 -1.208049 None None
- Attributes
- basic_info_DataFrame
Basic information of the training data for linear discriminant analysis.
- priors_DataFrame
The empirical priors for each class in the training data.
- coef_DataFrame
Coefficients (inclusive of intercepts) of each class' linear score function for the training data.
- proj_infoDataFrame
Projection related info, such as standard deviations of the discriminants, variance proportion to the total variance explained by each discriminant, etc.
- proj_modelDataFrame
The projection matrix and overall means for features.
Methods
fit
(data[, key, features, label])Calculate linear discriminators from training data.
predict
(data[, key, features, verbose])Predict class labels using fitted linear discriminators.
project
(data[, key, features, proj_dim])Project data into lower dimensional spaces using fitted LDA projection model.
- fit(data, key=None, features=None, label=None)
Calculate linear discriminators from training data.
- Parameters
- dataDataFrame
Training data.
- keystr, optional
Name of the ID column.
If not provided, then:
if
data
is indexed by a single column, thenkey
defaults to that index columnotherwise, it is assumed that
data
contains no ID column
- featureslist of str, optional
Names of the feature columns.
If not provided, its defaults to all non-ID, non-label columns.
- labelstr, optional
Name of the class label.
if not provided, it defaults to the last non-ID column.
- Returns
- LinearDiscriminantAnalysis
A fitted object.
- predict(data, key=None, features=None, verbose=None)
Predict class labels using fitted linear discriminators.
- Parameters
- dataDataFrame
Data for predicting the class labels.
- keystr, optional
Name of the ID column.
Mandatory if
data
is not indexed, or the index ofdata
contains multiple columns.Defaults to the single index column of
data
if not provided.- featureslist of str, optional
Name of the feature columns. If not provided, defaults to all non-ID columns.
- verbosebool, optional
Whether or not outputs scores of all classes.
If False, only score of the predicted class will be outputed.
Defaults to False.
- Returns
- DataFrame
Predicted class labels and the corresponding scores, structured as follows:
ID: with the same name and data type as
data
's ID column.CLASS: with the same name and data type as training data's label column
SCORE: type double, score of the predicted class.
- project(data, key=None, features=None, proj_dim=None)
Project data into lower dimensional spaces using fitted LDA projection model.
- Parameters
- dataDataFrame
Data for linear discriminant projection.
- keystr, optional
Name of the ID column.
Mandatory if
data
is not indexed, or the index ofdata
contains multiple columns.Defaults to the single index column of
data
if not provided.- featureslist of str, optional
Name of the feature columns.
If not provided, defaults to all non-ID columns.
- proj_dimint, optional
Dimension of the projected space, equivalent to the number of discriminant used for projection.
Defaults to the number of obtained discriminants.
- Returns
- DataFrame
Projected data, structured as follows:
1st column: ID, with the same name and data type as
data
for projection.other columns with name DISCRIMINANT_i, where i iterates from 1 to the number of elements in
features
, data type DOUBLE.
- property fit_hdbprocedure
Returns the generated hdbprocedure for fit.
- property predict_hdbprocedure
Returns the generated hdbprocedure for predict.