PCA
- class hana_ml.algorithms.pal.decomposition.PCA(scaling=None, thread_ratio=None, scores=None)
Principal component analysis is to reduce the dimensionality of multivariate data using Singular Value Decomposition.
- Parameters
- thread_ratiofloat, optional
Controls the proportion of available threads to use.
The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads.
Values between 0 and 1 will use that percentage of available threads.
Values outside this range tell PAL to heuristically determine the number of threads to use.
No default value.
- scalingbool, optional
If true, scale variables to have unit variance before the analysis takes place.
Defaults to False.
- scoresbool, optional
If true, output the scores on each principal component when fitting.
Defaults to False.
Examples
Input DataFrame df1 for training:
>>> df1.head(4).collect() ID X1 X2 X3 X4 0 1 12.0 52.0 20.0 44.0 1 2 12.0 57.0 25.0 45.0 2 3 12.0 54.0 21.0 45.0 3 4 13.0 52.0 21.0 46.0
Creating a PCA instance:
>>> pca = PCA(scaling=True, thread_ratio=0.5, scores=True)
Performing fit on given dataframe:
>>> pca.fit(data=df1, key='ID')
Output:
>>> pca.loadings_.collect() COMPONENT_ID LOADINGS_X1 LOADINGS_X2 LOADINGS_X3 LOADINGS_X4 0 Comp1 0.541547 0.321424 0.511941 0.584235 1 Comp2 -0.454280 0.728287 0.395819 -0.326429 2 Comp3 -0.171426 -0.600095 0.760875 -0.177673 3 Comp4 -0.686273 -0.078552 -0.048095 0.721489
>>> pca.loadings_stat_.collect() COMPONENT_ID SD VAR_PROP CUM_VAR_PROP 0 Comp1 1.566624 0.613577 0.613577 1 Comp2 1.100453 0.302749 0.916327 2 Comp3 0.536973 0.072085 0.988412 3 Comp4 0.215297 0.011588 1.000000
>>> pca.scaling_stat_.collect() VARIABLE_ID MEAN SCALE 0 1 17.000000 5.039841 1 2 53.636364 1.689540 2 3 23.000000 2.000000 3 4 48.454545 4.655398
Input dataframe df2 for transforming:
>>> df2.collect() ID X1 X2 X3 X4 0 1 2.0 32.0 10.0 54.0 1 2 9.0 57.0 20.0 25.0 2 3 12.0 24.0 28.0 35.0 3 4 15.0 42.0 27.0 36.0
Performing transform() on given dataframe:
>>> result = pca.transform(data=df2, key='ID', n_components=4) >>> result.collect() ID COMPONENT_1 COMPONENT_2 COMPONENT_3 COMPONENT_4 0 1 -8.359662 -10.936083 3.037744 4.220525 1 2 -3.931082 3.221886 -1.168764 -2.629849 2 3 -6.584040 -10.391291 13.112075 -0.146681 3 4 -2.967768 -3.170720 6.198141 -1.213035
- Attributes
- loadings_DataFrame
The weights by which each standardized original variable should be multiplied when computing component scores.
- loadings_stat_DataFrame
Loadings statistics on each component.
- scores_DataFrame
The transformed variable values corresponding to each data point. Set to None if
scores
is False.- scaling_stat_DataFrame
Mean and scale values of each variable.
Note
Variables cannot be scaled if there exists one variable which has constant value across data items.
Methods
create_model_state
([model, function, ...])Create PAL model state.
delete_model_state
([state])Delete PAL model state.
fit
(data[, key, features, label])Principal component analysis fit function.
fit_transform
(data[, key, features, ...])Fit with the dataset and return the scores.
set_model_state
(state)Set the model state by state information.
transform
(data[, key, features, ...])Principal component analysis projection function using a trained model.
- fit(data, key=None, features=None, label=None)
Principal component analysis fit function.
- Parameters
- dataDataFrame
Data to be fitted.
- keystr, optional
Name of the ID column.
Mandatory if
data
is not indexed, or the index ofdata
contains multiple columns.Defaults to the single index column of
data
if not provided.- featureslist of str, optional
Names of the feature columns.
If
features
is not provided, it defaults to all non-ID columns.- labelstr, optional
Label of data.
- Returns
- A fitted 'PCA' object.
- fit_transform(data, key=None, features=None, n_components=None, label=None)
Fit with the dataset and return the scores.
- Parameters
- dataDataFrame
Data to be analyzed.
- keystr, optional
Name of the ID column.
Mandatory if
data
is not indexed, or the index ofdata
contains multiple columns.Defaults to the single index column of
data
if not provided.- featureslist of str, optional
Names of the feature columns.
If
features
is not provided, it defaults to all non-ID columns, non-label columns.- n_componentsint, optional
Number of components to be retained.
The value range is from 1 to number of features.
Defaults to number of features.
- labelstr, optional
Label of data.
- Returns
- DataFrame
Transformed variable values corresponding to each data point, structured as follows:
ID column, with same name and type as
data
's ID column.SCORE columns, type DOUBLE, representing the component score values of each data point.
LABEL column, same as the label column in
data
, valid only when parameterlabel
is set.
- transform(data, key=None, features=None, n_components=None, label=None)
Principal component analysis projection function using a trained model.
- Parameters
- dataDataFrame
Data to be analyzed.
- keystr, optional
Name of the ID column.
Mandatory if
data
is not indexed, or the index ofdata
contains multiple columns.Defaults to the single index column of
data
if not provided.- featureslist of str, optional
Names of the feature columns.
If
features
is not provided, it defaults to all non-ID columns.- n_componentsint, optional
Number of components to be retained.
The value range is from 1 to number of features.
Defaults to number of features.
- labelstr, optional
Label of data.
- Returns
- DataFrame
Transformed variable values corresponding to each data point, structured as follows:
ID column, with same name and type as
data
's ID column.SCORE columns, type DOUBLE, representing the component score values of each data point.
- create_model_state(model=None, function=None, pal_funcname=None, state_description=None, force=False)
Create PAL model state.
- Parameters
- modelDataFrame, optional
Specify the model for AFL state.
Defaults to self.model_.
- functionstr, optional
Specify the function in the unified API.
A placeholder parameter, not effective for (CAT)PCA.
- pal_funcnameint or str, optional
PAL function name.
Defaults to self.pal_funcname.
- state_descriptionstr, optional
Description of the state as model container.
Defaults to None.
- forcebool, optional
If True it will delete the existing state.
Defaults to False.
- delete_model_state(state=None)
Delete PAL model state.
- Parameters
- stateDataFrame, optional
Specified the state.
Defaults to self.state.
- property fit_hdbprocedure
Returns the generated hdbprocedure for fit.
- property predict_hdbprocedure
Returns the generated hdbprocedure for predict.
- set_model_state(state)
Set the model state by state information.
- Parameters
- state: DataFrame or dict
If state is DataFrame, it has the following structure:
NAME: VARCHAR(100), it mush have STATE_ID, HINT, HOST and PORT.
VALUE: VARCHAR(1000), the values according to NAME.
If state is dict, the key must have STATE_ID, HINT, HOST and PORT.