SOM

class hana_ml.algorithms.pal.som.SOM(convergence_criterion=None, normalization=None, random_seed=None, height_of_map=None, width_of_map=None, kernel_function=None, alpha=None, learning_rate=None, shape_of_grid=None, radius=None, batch_som=None, max_iter=None)

Self-organizing feature maps (SOMs) are one of the most popular neural network methods for cluster analysis. They are sometimes referred to as Kohonen self-organizing feature maps, after their creator, Teuvo Kohonen, or as topologically ordered maps.

Parameters:
convergence_criterionfloat, optional

The convergence criterion denotes the maximum acceptable difference between successive maps. If the most significant gap between consecutive maps is less than the value set, the process regards it as successful convergence, and the SOM is then completed.

Defaults to 1.0e-6.

normalization{'no', 'min-max', 'z-score'}, optional

Specifies the normalization type:

  • 'no' : No normalization.

  • 'min-max' : Min-max normalization, transforming to range [0.0, 1.0].

  • 'z-score' : Z-score standardization.

Defaults to 'no'.

random_seedint, optional

The random seed parameter controls the initial randomness.

  • -1: Random

  • 0: Sets every weight to zero

  • Other value: Uses this value as seed

Defaults to -1.

height_of_mapint, optional

Indicates the height of the map.

Defaults to 10.

width_of_mapint, optional

Indicates the width of the map.

Defaults to 10.

kernel_function{'gaussian', 'flat'}, optional

Represents the neighborhood kernel function.

Defaults to 'gaussian'.

alphafloat, optional

Specifies the learning rate.

Defaults to 0.5

learning_rate{'exponential', 'linear'}, optional

Indicates the decay function for learning rate.

Defaults to 'exponential'.

shape_of_grid{'rectangle', 'hexagon'}, optional

Indicates the shape of the grid.

Defaults to 'hexagon'.

radiusfloat, optional

Specifies the scan radius.

Default to the bigger value of height_of_map and width_of_map.

batch_som{'classical', 'batch'}, optional

Indicates whether batch SOM is carried out. For batch SOM, kernel_function is always Gaussian, and the learning_rate factors take no effect.

Defaults to 0.

max_iterint, optional

Sets the maximum number of iterations.

Note that the training might not converge if this value is too small, for example, less than 1000.

Defaults to 1000 plus 500 times the number of neurons in the lattice.

Examples

Input dataframe df:

>>> df.collect()
    TRANS_ID    V000    V001
0          0    0.10    0.20
1          1    0.22    0.25
2          2    0.30    0.40
...
18        18   55.30   50.40
19        19   50.40   56.50

Create a SOM instance:

>>> som = SOM(convergence_criterion=1.0e-6, normalization='no',
              random_seed=1, height_of_map=4, width_of_map=4,
              kernel_function='gaussian', alpha=None,
              learning_rate='exponential', shape_of_grid='hexagon',
              radius=None, batch_som='classical', max_iter=4000)

Perform fit():

>>> som.fit(data=df, key='TRANS_ID')

Output:

>>> som.map_.collect().head(3)
   CLUSTER_ID   WEIGHT_V000    WEIGHT_V001    COUNT
0           0     52.837688      53.465327        2
1           1     50.150251      49.245226        2
2           2     18.597607      27.174590        0
>>> som.labels_.collect().head(3)
   TRANS_ID    BMU       DISTANCE    SECOND_BMU  IS_ADJACENT
0         0     15       0.342564            14            1
1         1     15       0.239676            14            1
2         2     15       0.073968            14            1
>>> som.model_.collect()
    ROW_INDEX                                                 MODEL_CONTENT
0           0             {"Algorithm":"SOM","Cluster":[{"CellID":0,"Cel...

After we get the model, we could use it to predict Input dataframe df2 for prediction:

>>> df_predict.collect()
    TRANS_ID    V000    V001
0         33     0.2    0.10
1         34     1.2     4.1

Preform predict on the given data:

>>> label = som.predict(data=df2, key='TRANS_ID')

Output:

>>> label.collect()
    TRANS_ID    CLUSTER_ID     DISTANCE
0         33            15     0.388460
1         34            11     0.156418
Attributes:
map_DataFrame

The map after training. The structure is as follows:

  • 1st column: CLUSTER_ID, int. Unit cell ID.

  • Other columns except the last one: FEATURE (in input data) column with prefix "WEIGHT_", float. Weight vectors used to simulate the original tuples.

  • Last column: COUNT, int. Number of original tuples that every unit cell contains.

label_DataFrame

The label of input data, the structure is as follows:

  • 1st column: ID column name data, with the same data type.

  • 2nd column: BMU, int. Best match unit (BMU).

  • 3rd column: DISTANCE, float, The distance between the tuple and its BMU.

  • 4th column: SECOND_BMU, int, Second BMU.

  • 5th column: IS_ADJACENT. int. Indicates whether the BMU and the second BMU are adjacent.

    \[\begin{split}\begin{cases} 0: &\text{Not adjacent}\\ 1: &\text{Adjacent} \end{cases}\end{split}\]
model_DataFrame

The model.

Methods

create_model_state([model, function, ...])

Create PAL model state.

delete_model_state([state])

Delete PAL model state.

fit(data[, key, features, sql_trace_function])

Fit a SOM model on the given data.

fit_predict(data[, key, features])

Fit the given data and return the labels.

predict(data[, key, features])

Assign clusters to data based on a fitted model.

set_model_state(state)

Set the model state by state information.

fit(data, key=None, features=None, sql_trace_function=None)

Fit a SOM model on the given data.

Parameters:
dataDataFrame

DataFrame containing the data.

keystr, optional

Name of the ID column.

Defaults to the index column of data (i.e. data.index) if it is set. If the index column of data is not provided, please enter the value of key.

featureslist of str, optional

Names of the features columns.

If features is not provided, it defaults to all non-key columns.

sql_trace_function: str, optional

Function name for sql tracing reference of the function name.

Returns:
A fitted object of class "SOM".
fit_predict(data, key=None, features=None)

Fit the given data and return the labels.

Parameters:
dataDataFrame

DataFrame containing the data.

keystr, optional

Name of ID column. Defaults to the index column of data (i.e. data.index) if it is set. If the index of column of data is not provided, please enter the value of key.

featureslist of str, optional

Names of the features columns.

If features is not provided, it defaults to all the non-ID columns.

Returns:
DataFrame

The label of given data, the structure is as follows:

  • 1st column: ID column name of data, with the same data type.

  • 2nd column: BMU, type INT. Best match unit(BMU).

  • 3rd column: DISTANCE, type DOUBLE. The distance between the tuple and its BMU.

  • 4th column: SECOND_BMU, type INT. Second BMU.

  • 5th column: IS_ADJACENT, type INT. Indicates whether the BMU and the second BMU are adjacent.

    \[\begin{split}\begin{cases} 0: &\text{ Not adjacent}\\ 1: &text{ Adjacent} \end{cases}\end{split}\]
create_model_state(model=None, function=None, pal_funcname='PAL_CLUSTER_ASSIGNMENT', state_description=None, force=False)

Create PAL model state.

Parameters:
modelDataFrame, optional

Specify the model for AFL state.

Defaults to self.model_.

functionstr, optional

Specify the function in the unified API.

A placeholder parameter, not effective for cluster assignment.

pal_funcnameint or str, optional

PAL function name. Must be a valid PAL procedure that supports model state.

Defaults to 'PAL_CLUSTER_ASSIGNMENT'.

state_descriptionstr, optional

Description of the state as model container.

Defaults to None.

forcebool, optional

If True it will delete the existing state.

Defaults to False.

delete_model_state(state=None)

Delete PAL model state.

Parameters:
stateDataFrame, optional

Specified the state.

Defaults to self.state.

predict(data, key=None, features=None)

Assign clusters to data based on a fitted model.

The output structure of this method does not match that of fit_predict().

Parameters:
dataDataFrame

Data points to match against computed clusters.

This dataframe's column structure should match that of the data used for fit().

keystr, optional

Name of ID column. Defaults to the index column of data (i.e. data.index) if it is set. If the index of column of data is not provided, please enter the value of key.

featureslist of str, optional.

Names of feature columns.

If features is not provided, it defaults to all the non-ID columns.

Returns:
DataFrame

Cluster assignment results, with 3 columns:

  • Data point ID, with name and type taken from the input ID column.

  • CLUSTER_ID, type int, representing the cluster the data point is assigned to.

  • DISTANCE, type DOUBLE, representing the distance between the data point and the nearest core point.

set_model_state(state)

Set the model state by state information.

Parameters:
state: DataFrame or dict

If state is DataFrame, it has the following structure:

  • NAME: VARCHAR(100), it mush have STATE_ID, HINT, HOST and PORT.

  • VALUE: VARCHAR(1000), the values according to NAME.

If state is dict, the key must have STATE_ID, HINT, HOST and PORT.

Inherited Methods from PALBase

Besides those methods mentioned above, the SOM class also inherits methods from PALBase class, please refer to PAL Base for more details.