LinearDiscriminantAnalysis

class hana_ml.algorithms.pal.discriminant_analysis.LinearDiscriminantAnalysis(regularization_type=None, regularization_amount=None, projection=None)

Linear Discriminant Analysis is a supervised learning technique used for classification problems. It is particularly useful when the classes are well-separated and the dataset features follow a Gaussian distribution. LDA works by projecting high-dimensional data onto a lower-dimensional space where class separation is maximized. The goal is to find a linear combination of features that best separates the classes. This makes LDA a dimensionality reduction technique as well, similar to Principal Component Analysis (PCA), but with the distinction that LDA takes class labels into account.

Parameters:

regularization_type{'mixing', 'diag', 'pseudo'}, optional

The strategy for handling ill-conditioning or rank-deficiency of the empirical covariance matrix.

Defaults to 'mixing'.

regularization_amountfloat, optional

The convex mixing weight assigned to the diagonal matrix obtained from diagonal of the empirical covariance matrix. Valid range for this parameter is [0,1]. Valid only when regularization_type is 'mixing'.

Defaults to the smallest number in [0,1] that makes the regularized empirical covariance matrix invertible.

projectionbool, optional

Whether or not to compute the projection model.

Defaults to True.

Attributes:

basic_info_DataFrame: Basic information of the training data for linear discriminant analysis.
priors_DataFrame: The empirical priors for each class in the training data.
coef_DataFrame: Coefficients (inclusive of intercepts) of each class' linear score function for the training data.
proj_infoDataFrame: Projection related info, such as standard deviations of the discriminants, variance proportion to the total variance explained by each discriminant, etc.
proj_modelDataFrame: The projection matrix and overall means for features.

Methods

`fit`(data[, key, features, label])	Fit the model to the given dataset.
`predict`(data[, key, features, verbose, ...])	Predict class labels using fitted linear discriminators.
`project`(data[, key, features, proj_dim])	Project data into lower dimensional spaces using the fitted LDA projection model.

Examples

>>> lda = LinearDiscriminantAnalysis(regularization_type='mixing', projection=True)

Perform fit():

>>> lda.fit(data=df, features=['X1', 'X2'], label='CLASS')
>>> lda.coef_.collect()
>>> lda.proj_model_.collect()

Perform predict():

>>> res = lda.predict(data=df_pred, key='ID',
                      features=['X1', 'X2'], verbose=False)
>>> res.collect()

Perform project():

>>> res_proj = lda.project(data=df_proj, key='ID',
                           features=['X1','X2'], proj_dim=2)
>>> res_proj.collect()

fit(data, key=None, features=None, label=None)

Fit the model to the given dataset.

Parameters:

dataDataFrame

Training data.

keystr, optional

Name of the ID column. If not provided, then:

if data is indexed by a single column, then key defaults to that index column

otherwise, it is assumed that data contains no ID column

featuresa list of str, optional

Names of the feature columns.

If not provided, its defaults to all non-ID, non-label columns.

labelstr, optional

Name of the class label.

if not provided, it defaults to the last non-ID column.

Returns:

A fitted object of class "LinearDiscriminantAnalysis".

predict(data, key=None, features=None, verbose=None, verbose_top_n=None)

Predict class labels using fitted linear discriminators.

Parameters:

dataDataFrame

Data for predicting the class labels.

keystr, optional

Name of the ID column. Mandatory if data is not indexed, or the index of data contains multiple columns.

Defaults to the single index column of data if not provided.

featuresa list of str, optional

Name of the feature columns.

If not provided, defaults to all non-ID columns.

verbosebool, optional

Whether or not outputs scores of all classes. If False, only score of the predicted class will be outputted.

Defaults to False.

verbose_top_nbool, optional

Specifies the number of top n classes to present after sorting with confidences. It cannot exceed the number of classes in label of the training data, and it can be 0, which means to output the confidences of all classes. Effective only when verbose is set as True.

Defaults to 0.

Returns:

DataFrame: Predicted class labels and the corresponding scores.

project(data, key=None, features=None, proj_dim=None)

Project data into lower dimensional spaces using the fitted LDA projection model.

Parameters:

dataDataFrame

Data for linear discriminant projection.

keystr, optional

Name of the ID column. Mandatory if data is not indexed, or the index of data contains multiple columns.

Defaults to the single index column of data if not provided.

featuresa list of str, optional

Name of the feature columns.

If not provided, defaults to all non-ID columns.

proj_dimint, optional

Dimension of the projected space, equivalent to the number of discriminant used for projection.

Defaults to the number of obtained discriminants.

Returns:

DataFrame

Projected data, structured as follows:

1st column: ID, with the same name and data type as data for projection.
other columns with name DISCRIMINANT_i, where i iterates from 1 to the number of elements in features, data type DOUBLE.

Inherited Methods from PALBase

Besides those methods mentioned above, the LinearDiscriminantAnalysis class also inherits methods from PALBase class, please refer to PAL Base for more details.