SOM
- class hana_ml.algorithms.pal.som.SOM(convergence_criterion=None, normalization=None, random_seed=None, height_of_map=None, width_of_map=None, kernel_function=None, alpha=None, learning_rate=None, shape_of_grid=None, radius=None, batch_som=None, max_iter=None)
Self-organizing feature maps (SOMs) are one of the most popular neural network methods for cluster analysis. They are sometimes referred to as Kohonen self-organizing feature maps, after their creator, Teuvo Kohonen, or as topologically ordered maps.
- Parameters:
- convergence_criterionfloat, optional
The convergence criterion denotes the maximum acceptable difference between successive maps. If the most significant gap between consecutive maps is less than the value set, the process regards it as successful convergence, and the SOM is then completed.
Defaults to 1.0e-6.
- normalization{'no', 'min-max', 'z-score'}, optional
Specifies the normalization type:
'no' : No normalization.
'min-max' : Min-max normalization, transforming to range [0.0, 1.0].
'z-score' : Z-score standardization.
Defaults to 'no'.
- random_seedint, optional
The random seed parameter controls the initial randomness.
-1: Random
0: Sets every weight to zero
Other value: Uses this value as seed
Defaults to -1.
- height_of_mapint, optional
Indicates the height of the map.
Defaults to 10.
- width_of_mapint, optional
Indicates the width of the map.
Defaults to 10.
- kernel_function{'gaussian', 'flat'}, optional
Represents the neighborhood kernel function.
Defaults to 'gaussian'.
- alphafloat, optional
Specifies the learning rate.
Defaults to 0.5
- learning_rate{'exponential', 'linear'}, optional
Indicates the decay function for learning rate.
Defaults to 'exponential'.
- shape_of_grid{'rectangle', 'hexagon'}, optional
Indicates the shape of the grid.
Defaults to 'hexagon'.
- radiusfloat, optional
Specifies the scan radius.
Default to the bigger value of
height_of_map
andwidth_of_map
.- batch_som{'classical', 'batch'}, optional
Indicates whether batch SOM is carried out. For batch SOM,
kernel_function
is always Gaussian, and thelearning_rate
factors take no effect.Defaults to 0.
- max_iterint, optional
Sets the maximum number of iterations.
Note that the training might not converge if this value is too small, for example, less than 1000.
Defaults to 1000 plus 500 times the number of neurons in the lattice.
Examples
Input DataFrame df:
>>> df.collect() TRANS_ID V000 V001 0 0 0.10 0.20 1 1 0.22 0.25 ... 18 18 55.30 50.40 19 19 50.40 56.50
Create a SOM instance:
>>> som = SOM(convergence_criterion=1.0e-6, normalization='no', random_seed=1, height_of_map=4, width_of_map=4, kernel_function='gaussian', alpha=None, learning_rate='exponential', shape_of_grid='hexagon', batch_som='classical', max_iter=4000)
Perform fit():
>>> som.fit(data=df, key='TRANS_ID')
Output:
>>> som.map_.collect().head(3) CLUSTER_ID WEIGHT_V000 WEIGHT_V001 COUNT 0 0 52.837688 53.465327 2 1 1 50.150251 49.245226 2 2 2 18.597607 27.174590 0
>>> som.labels_.collect().head(3) TRANS_ID BMU DISTANCE SECOND_BMU IS_ADJACENT 0 0 15 0.342564 14 1 1 1 15 0.239676 14 1 2 2 15 0.073968 14 1
>>> som.model_.collect() ROW_INDEX MODEL_CONTENT 0 0 {"Algorithm":"SOM","Cluster":[{"CellID":0,"Cel...
After we get the model, we could use it to predict input DataFrame df2 for prediction:
>>> df_predict.collect() TRANS_ID V000 V001 0 33 0.2 0.10 1 34 1.2 4.1
Preform predict():
>>> label = som.predict(data=df2, key='TRANS_ID')
Output:
>>> label.collect() TRANS_ID CLUSTER_ID DISTANCE 0 33 15 0.388460 1 34 11 0.156418
- Attributes:
- map_DataFrame
The map after training. The structure is as follows:
1st column: CLUSTER_ID, int. Unit cell ID.
Other columns except the last one: FEATURE (in input data) column with prefix "WEIGHT_", float. Weight vectors used to simulate the original tuples.
Last column: COUNT, int. Number of original tuples that every unit cell contains.
- label_DataFrame
The label of input data, the structure is as follows:
1st column: ID column name
data
, with the same data type.2nd column: BMU, int. Best match unit (BMU).
3rd column: DISTANCE, float, The distance between the tuple and its BMU.
4th column: SECOND_BMU, int, Second BMU.
5th column: IS_ADJACENT. int. Indicates whether the BMU and the second BMU are adjacent.
\[\begin{split}\begin{cases} 0: &\text{Not adjacent}\\ 1: &\text{Adjacent} \end{cases}\end{split}\]
- model_DataFrame
Model content.
Methods
create_model_state
([model, function, ...])Create PAL model state.
delete_model_state
([state])Delete PAL model state.
fit
(data[, key, features, sql_trace_function])Fit the model to the training dataset.
fit_predict
(data[, key, features])Fit the given data and return the labels.
predict
(data[, key, features])Assign clusters to data based on a fitted model.
set_model_state
(state)Set the model state by state information.
- fit(data, key=None, features=None, sql_trace_function=None)
Fit the model to the training dataset.
- Parameters:
- dataDataFrame
DataFrame containing the data.
- keystr, optional
Name of the ID column.
Defaults to the index column of
data
(i.e. data.index) if it is set. If the index column of data is not provided, please enter the value of key.- featuresa list of str, optional
Names of the features columns.
If
features
is not provided, it defaults to all non-key columns.- sql_trace_function: str, optional
Function name for sql tracing reference of the function name.
- Returns:
- A fitted object of class "SOM".
- fit_predict(data, key=None, features=None)
Fit the given data and return the labels.
- Parameters:
- dataDataFrame
DataFrame containing the data.
- keystr, optional
Name of ID column. Defaults to the index column of data (i.e. data.index) if it is set. If the index of column of data is not provided, please enter the value of key.
- featuresa list of str, optional
Names of the features columns.
If
features
is not provided, it defaults to all the non-ID columns.
- Returns:
- DataFrame
The label of given data, the structure is as follows:
1st column: ID column name of
data
, with the same data type.2nd column: BMU, type INT. Best match unit(BMU).
3rd column: DISTANCE, type DOUBLE. The distance between the tuple and its BMU.
4th column: SECOND_BMU, type INT. Second BMU.
5th column: IS_ADJACENT, type INT. Indicates whether the BMU and the second BMU are adjacent.
\[\begin{split}\begin{cases} 0: &\text{ Not adjacent}\\ 1: &text{ Adjacent} \end{cases}\end{split}\]
- create_model_state(model=None, function=None, pal_funcname='PAL_CLUSTER_ASSIGNMENT', state_description=None, force=False)
Create PAL model state.
- Parameters:
- modelDataFrame, optional
Specify the model for AFL state.
Defaults to self.model_.
- functionstr, optional
Specify the function in the unified API.
A placeholder parameter, not effective for cluster assignment.
- pal_funcnameint or str, optional
PAL function name. Must be a valid PAL procedure that supports model state.
Defaults to 'PAL_CLUSTER_ASSIGNMENT'.
- state_descriptionstr, optional
Description of the state as model container.
Defaults to None.
- forcebool, optional
If True it will delete the existing state.
Defaults to False.
- delete_model_state(state=None)
Delete PAL model state.
- Parameters:
- stateDataFrame, optional
Specified the state.
Defaults to self.state.
- predict(data, key=None, features=None)
Assign clusters to data based on a fitted model.
The output structure of this method does not match that of fit_predict().
- Parameters:
- dataDataFrame
Data points to match against computed clusters.
This dataframe's column structure should match that of the data used for fit().
- keystr, optional
Name of ID column. Defaults to the index column of data (i.e. data.index) if it is set. If the index of column of data is not provided, please enter the value of key.
- featuresa list of str, optional.
Names of feature columns.
If
features
is not provided, it defaults to all the non-ID columns.
- Returns:
- DataFrame
Cluster assignment results, with 3 columns:
Data point ID, with name and type taken from the input ID column.
CLUSTER_ID, type int, representing the cluster the data point is assigned to.
DISTANCE, type DOUBLE, representing the distance between the data point and the nearest core point.
- set_model_state(state)
Set the model state by state information.
- Parameters:
- state: DataFrame or dict
If state is DataFrame, it has the following structure:
NAME: VARCHAR(100), it mush have STATE_ID, HINT, HOST and PORT.
VALUE: VARCHAR(1000), the values according to NAME.
If state is dict, the key must have STATE_ID, HINT, HOST and PORT.
Inherited Methods from PALBase
Besides those methods mentioned above, the SOM class also inherits methods from PALBase class, please refer to PAL Base for more details.