PCA
- class hana_ml.algorithms.pal.decomposition.PCA(scaling=None, thread_ratio=None, scores=None, n_components=None)
Principal component analysis (PCA) aims at reducing the dimensionality of multivariate data while accounting for as much of the variation in the original dataset as possible. This technique is especially useful when the variables within the dataset are highly correlated. Principal components seek to transform the original variables to a new set of variables that are:
linear combinations of the variables in the dataset;
uncorrelated with each other;
ordered according to the amount of variations of the original variables that they explain.
- Parameters:
- thread_ratiofloat, optional
Adjusts the percentage of available threads to use, from 0 to 1. A value of 0 indicates the use of a single thread, while 1 implies the use of all possible current threads. Values outside the range will be ignored and this function heuristically determines the number of threads to use.
No default value.
- scalingbool, optional
If true, scale variables to have unit variance before the analysis takes place.
Defaults to False.
- scoresbool, optional
If true, output the scores on each principal component when fitting.
Defaults to False.
- n_componentsint, optional
Specifies the number of components to keep after tranforming input data.
Defaults to None.
Examples
>>> pca = PCA(scaling=True, thread_ratio=0.5, scores=True)
Perform fit():
>>> pca.fit(data=df, key='ID')
Output:
>>> pca.loadings_.collect() >>> pca.loadings_stat_.collect() >>> pca.scaling_stat_.collect()
Perform transform():
>>> result = pca.transform(data=df_trasform, key='ID', n_components=4) >>> result.collect()
- Attributes:
- loadings_DataFrame
The weights by which each standardized original variable should be multiplied when computing component scores.
- loadings_stat_DataFrame
Loadings statistics on each component.
- scores_DataFrame
The transformed variable values corresponding to each data point. Set to None if
scores
is False.- scaling_stat_DataFrame
Mean and scale values of each variable.
Note
Variables cannot be scaled if there exists one variable which has constant value across data items.
Methods
create_model_state
([model, function, ...])Create PAL model state.
delete_model_state
([state])Delete PAL model state.
fit
(data[, key, features, label])Fit the model to the given dataset.
fit_transform
(data[, key, features, ...])Fit with the data and return the scores.
Get the model metrics.
Get the score metrics.
set_model_state
(state)Set the model state by state information.
transform
(data[, key, features, ...])Principal component analysis projection function using a trained model.
- fit(data, key=None, features=None, label=None)
Fit the model to the given dataset.
- Parameters:
- dataDataFrame
Data to be fitted.
- keystr, optional
Name of the ID column. Mandatory if
data
is not indexed, or the index ofdata
contains multiple columns.Defaults to the single index column of
data
if not provided.- featuresa list of str, optional
Names of the feature columns. If
features
is not provided, it defaults to all non-ID columns.- labelstr, optional
Label of data.
- Returns:
- A fitted 'PCA' object.
- fit_transform(data, key=None, features=None, n_components=None, label=None)
Fit with the data and return the scores.
- Parameters:
- dataDataFrame
Data to be analyzed.
- keystr, optional
Name of the ID column.
Mandatory if
data
is not indexed, or the index ofdata
contains multiple columns.Defaults to the single index column of
data
if not provided.- featuresa list of str, optional
Names of the feature columns.
If
features
is not provided, it defaults to all non-ID columns, non-label columns.- n_componentsint, optional
Number of components to be retained. The value range is from 1 to number of features.
Defaults to number of features if self.n_components is None, else defaults to self.n_components.
- labelstr, optional
Label of data.
- Returns:
- DataFrame
Transformed variable values corresponding to each data point, structured as follows:
ID column, with same name and type as
data
's ID column.SCORE columns, type DOUBLE, representing the component score values of each data point.
LABEL column, same as the label column in
data
, valid only when parameterlabel
is set.
- transform(data, key=None, features=None, n_components=None, label=None)
Principal component analysis projection function using a trained model.
- Parameters:
- dataDataFrame
Data to be analyzed.
- keystr, optional
Name of the ID column. Mandatory if
data
is not indexed, or the index ofdata
contains multiple columns.Defaults to the single index column of
data
if not provided.- featuresa list of str, optional
Names of the feature columns.
If
features
is not provided, it defaults to all non-ID columns.- n_componentsint, optional
Number of components to be retained. The value range is from 1 to number of features.
Defaults to number of features.
- labelstr, optional
Label of data.
- Returns:
- DataFrame
Transformed variable values corresponding to each data point.
- create_model_state(model=None, function=None, pal_funcname=None, state_description=None, force=False)
Create PAL model state.
- Parameters:
- modelDataFrame, optional
Specify the model for AFL state.
Defaults to self.model_.
- functionstr, optional
Specify the function in the unified API.
A placeholder parameter, not effective for (CAT)PCA.
- pal_funcnameint or str, optional
PAL function name.
Defaults to self.pal_funcname.
- state_descriptionstr, optional
Description of the state as model container.
Defaults to None.
- forcebool, optional
If True it will delete the existing state.
Defaults to False.
- delete_model_state(state=None)
Delete PAL model state.
- Parameters:
- stateDataFrame, optional
Specified the state.
Defaults to self.state.
- get_model_metrics()
Get the model metrics.
- Returns:
- DataFrame
The model metrics.
- get_score_metrics()
Get the score metrics.
- Returns:
- DataFrame
The score metrics.
- set_model_state(state)
Set the model state by state information.
- Parameters:
- state: DataFrame or dict
If state is DataFrame, it has the following structure:
NAME: VARCHAR(100), it mush have STATE_ID, HINT, HOST and PORT.
VALUE: VARCHAR(1000), the values according to NAME.
If state is dict, the key must have STATE_ID, HINT, HOST and PORT.
Inherited Methods from PALBase
Besides those methods mentioned above, the PCA class also inherits methods from PALBase class, please refer to PAL Base for more details.