SOM
- class hana_ml.algorithms.pal.som.SOM(covergence_criterion=None, normalization=None, random_seed=None, height_of_map=None, width_of_map=None, kernel_function=None, alpha=None, learning_rate=None, shape_of_grid=None, radius=None, batch_som=None, max_iter=None)
Self-organizing feature maps (SOMs) are one of the most popular neural network methods for cluster analysis. They are sometimes referred to as Kohonen self-organizing feature maps, after their creator, Teuvo Kohonen, or as topologically ordered maps.
- Parameters
- convergence_criterionfloat, optional
If the largest difference of the successive maps is less than this value, the calculation is regarded as convergence, and SOM is completed consequently.
Defaults to 1.0e-6.
- normalization{'no', 'min-max', 'z-score'}, optional
Specifies the normalization type:
'no' : No normalization.
'min-max' : Min-max normalization, transforming to range [0.0, 1.0].
'z-score' : Z-score standardization.
Defaults to 'no'.
- random_seedint, optional
-1: Random
0: Sets every weight to zero
Other value: Uses this value as seed
Defaults to -1.
- height_of_mapint, optional
Indicates the height of the map.
Defaults to 10.
- width_of_mapint, optional
Indicates the width of the map.
Defaults to 10.
- kernel_function{'gaussian', 'flat'}, optional
Represents the neighborhood kernel function.
Defaults to 'gaussian'.
- alphafloat, optional
Specifies the learning rate.
Defaults to 0.5
- learning_rate{'exponential', 'linear'}, optional
Indicates the decay function for learning rate.
Defaults to 'exponential'.
- shape_of_grid{'rectangle', 'hexagon'}, optional
Indicates the shape of the grid.
Defaults to 'hexagon'.
- radiusfloat, optional
Specifies the scan radius.
Default to the bigger value of
height_of_map
andwidth_of_map
.- batch_som{'classical', 'batch'}, optional
Indicates whether batch SOM is carried out. For batch SOM,
kernel_function
is always Gaussian, and thelearning_rate
factors take no effect.Defaults to 0.
- max_iterint, optional
Maximum number of iterations.
Note that the training might not converge if this value is too small, for example, less than 1000.
Defaults to 1000 plus 500 times the number of neurons in the lattice.
Examples
Input dataframe df for clustering:
>>> df.collect() TRANS_ID V000 V001 0 0 0.10 0.20 1 1 0.22 0.25 2 2 0.30 0.40 ... 18 18 55.30 50.40 19 19 50.40 56.50
Create SOM instance:
>>> som = SOM(covergence_criterion=1.0e-6, normalization='no', random_seed=1, height_of_map=4, width_of_map=4, kernel_function='gaussian', alpha=None, learning_rate='exponential', shape_of_grid='hexagon', radius=None, batch_som='classical', max_iter=4000)
Perform fit on the given data:
>>> som.fit(data=df, key='TRANS_ID')
Expected output:
>>> som.map_.collect().head(3) CLUSTER_ID WEIGHT_V000 WEIGHT_V001 COUNT 0 0 52.837688 53.465327 2 1 1 50.150251 49.245226 2 2 2 18.597607 27.174590 0
>>> som.labels_.collect().head(3) TRANS_ID BMU DISTANCE SECOND_BMU IS_ADJACENT 0 0 15 0.342564 14 1 1 1 15 0.239676 14 1 2 2 15 0.073968 14 1
>>> som.model_.collect() ROW_INDEX MODEL_CONTENT 0 0 {"Algorithm":"SOM","Cluster":[{"CellID":0,"Cel...
After we get the model, we could use it to predict Input dataframe df2 for prediction:
>>> df_predict.collect() TRANS_ID V000 V001 0 33 0.2 0.10 1 34 1.2 4.1
Preform predict on the given data:
>>> label = som.predict(data=df2, key='TRANS_ID')
Expected output:
>>> label.collect() TRANS_ID CLUSTER_ID DISTANCE 0 33 15 0.388460 1 34 11 0.156418
- Attributes
- map_DataFrame
The map after training. The structure is as follows:
1st column: CLUSTER_ID, int. Unit cell ID.
Other columns except the last one: FEATURE (in input data) column with prefix "WEIGHT_", float. Weight vectors used to simulate the original tuples.
Last column: COUNT, int. Number of original tuples that every unit cell contains.
- label_DataFrame
The label of input data, the structure is as follows:
1st column: ID column name
data
, with the same data type.2nd column: BMU, int. Best match unit (BMU).
3rd column: DISTANCE, float, The distance between the tuple and its BMU.
4th column: SECOND_BMU, int, Second BMU.
5th column: IS_ADJACENT. int. Indicates whether the BMU and the second BMU are adjacent.
\[\begin{split}\begin{cases} 0: &\text{Not adjacent}\\ 1: &\text{Adjacent} \end{cases}\end{split}\]
- model_DataFrame
The SOM model.
Methods
create_model_state
([model, function, ...])Create PAL model state.
delete_model_state
([state])Delete PAL model state.
fit
(data[, key, features, sql_trace_function])Fit the SOM model when given the training dataset.
fit_predict
(data[, key, features])Fit the dataset and return the labels.
predict
(data[, key, features])Assign clusters to data based on a fitted model.
set_model_state
(state)Set the model state by state information.
- fit(data, key=None, features=None, sql_trace_function=None)
Fit the SOM model when given the training dataset.
- Parameters
- dataDataFrame
DataFrame containing the data.
- keystr, optional
Name of the ID column.
Defaults to the index column of
data
(i.e. data.index) if it is set. If the index column of data is not provided, please enter the value of key.- featureslist of str, optional
Names of the features columns.
If
features
is not provided, it defaults to all non-key columns.- sql_trace_function: str, optional
Function name for sql tracing reference of the function name.
- Returns
- A fitted object of class "SOM".
- fit_predict(data, key=None, features=None)
Fit the dataset and return the labels.
- Parameters
- dataDataFrame
DataFrame containing the data.
- keystr, optional
Name of ID column. Defaults to the index column of data (i.e. data.index) if it is set. If the index of column of data is not provided, please enter the value of key.
- featureslist of str, optional
Names of the features columns.
If
features
is not provided, it defaults to all the non-ID columns.
- Returns
- DataFrame
The label of given data, the structure is as follows:
1st column: ID column name of
data
, with the same data type.2nd column: BMU, type INT. Best match unit(BMU).
3rd column: DISTANCE, type DOUBLE. The distance between the tuple and its BMU.
4th column: SECOND_BMU, type INT. Second BMU.
5th column: IS_ADJACENT, type INT. Indicates whether the BMU and the second BMU are adjacent.
\[\begin{split}\begin{cases} 0: &\text{ Not adjacent}\\ 1: &text{ Adjacent} \end{cases}\end{split}\]
- create_model_state(model=None, function=None, pal_funcname='PAL_CLUSTER_ASSIGNMENT', state_description=None, force=False)
Create PAL model state.
- Parameters
- modelDataFrame, optional
Specify the model for AFL state.
Defaults to self.model_.
- functionstr, optional
Specify the function in the unified API.
A placeholder parameter, not effective for cluster assignment.
- pal_funcnameint or str, optional
PAL function name. Must be a valid PAL procedure that supports model state.
Defaults to 'PAL_CLUSTER_ASSIGNMENT'.
- state_descriptionstr, optional
Description of the state as model container.
Defaults to None.
- forcebool, optional
If True it will delete the existing state.
Defaults to False.
- delete_model_state(state=None)
Delete PAL model state.
- Parameters
- stateDataFrame, optional
Specified the state.
Defaults to self.state.
- property fit_hdbprocedure
Returns the generated hdbprocedure for fit.
- predict(data, key=None, features=None)
Assign clusters to data based on a fitted model.
The output structure of this method does not match that of fit_predict().
- Parameters
- dataDataFrame
Data points to match against computed clusters.
This dataframe's column structure should match that of the data used for fit().
- keystr, optional
Name of ID column. Defaults to the index column of data (i.e. data.index) if it is set. If the index of column of data is not provided, please enter the value of key.
- featureslist of str, optional.
Names of feature columns.
If
features
is not provided, it defaults to all the non-ID columns.
- Returns
- DataFrame
Cluster assignment results, with 3 columns:
Data point ID, with name and type taken from the input ID column.
CLUSTER_ID, type int, representing the cluster the data point is assigned to.
DISTANCE, type DOUBLE, representing the distance between the data point and the nearest core point.
- property predict_hdbprocedure
Returns the generated hdbprocedure for predict.
- set_model_state(state)
Set the model state by state information.
- Parameters
- state: DataFrame or dict
If state is DataFrame, it has the following structure:
NAME: VARCHAR(100), it mush have STATE_ID, HINT, HOST and PORT.
VALUE: VARCHAR(1000), the values according to NAME.
If state is dict, the key must have STATE_ID, HINT, HOST and PORT.