T-distributed Stochastic Neighbour Embedding

hanaml.TSNE is a R wrapper for SAP HANA PAL T-distributed Stochastic Neighbour Embedding.

hanaml.TSNE(
  data = NULL,
  key = NULL,
  features = NULL,
  max.iter = NULL,
  obj.freq = NULL,
  dim = NULL,
  learning.rate = NULL,
  theta = NULL,
  perplexity = NULL,
  exaggeration = NULL,
  random.state = NULL,
  thread.ratio = NULL
)

Arguments

data: DataFrame
DataFrame containting the data.
key: character, optional
Name of the ID column.
Defaults to the first column if not provided.
features: character of list of characters, optional
Name of feature columns.
If not provided, it defaults all non-key, non-label columns of data.
max.iter: integer, optional
Specifies the maximum number of iterations for optimization process.
Defaults to 250.
obj.freq: integer, optional
Specifies the Frequency of calculating the objective function and putting the result into OBJECTIVES table.
Defaults to 50.
dim: integer, optional
Dimension of the embedded space. Value other than 2 or 3 is illegal.
Defaults to 2.
learning.rate: double, optional
Secifies the learning rate.
Defaults to 200.0.
theta: double, optional
The legal value should be between 0.0 to 1.0. Setting it to 0.0 means using the “exact” method which would run O(N^2) time, otherwise TSNE would employ Barnes-Hut approximation hich would run O(NlogN). This value is a tradeoff between accuracy and training speed for Barnes-Hut approximation. The training speed would be faster with higher value.
Defaults to 0.5.
perplexity: double, optional
The perplexity is related to the number of nearest neighbors. Larger value is suitable for large dataset. Make sure perplexity * 3 < [number of samples].
Default to 30.0.
exaggeration: double, optional
The natural clusters would be more separated with larger value, which means there would be more empty space on the map. It specifies the value to be multiplied on pij before 250 iterations.
Default to 12.0.
random.state: double, optional
Specifies the seed for random number generation, where 0 means current system time s used as seed, and other values are simply real seed values.
Defaults to 0.
thread.ratio: double, optional
Controls the proportion of available threads that can be used by this function.
The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates all available threads.
Values between 0 and 1 will use up to that percentage of available threads.Values outside this range are ignored.
Defaults to 0.

Value

Returns a list of DataFrames

DataFrame 1
Result of points in low-dimensional embedded space, structured as follows
- ID: ID (correspond to input table).
- x: Coordinate value of the 1st dimension.
- y: Coordinate value of the 2nd dimension.
- z: Coordinate value of the 3rd dimension.(NULL if the embedded space is 2 dimension)
DataFrame 2
Statistical info, structured as follows
- STAT_NAME: Statistics name.
- STAT_VALUE: Statistics value.
DataFrame 3
Recorded values of objective function for TSNE.
- ITER: Iteration step.
- OBJ_VALUE: Objective value of the iteration.

Details

This algorithm prepares the data for visualization of the data with the TSNE method. It returns a DataFrame of two-dimensional embeddings of the high-dimensional rows of input data.

Examples


> data$Collect()
  ID ATT1 ATT2 ATT3 ATT4 ATT5
1  1    1    2  -10  -20    3
2  2    4    5  -30  -10    6
3  3    7    8  -40  -50    9
4  4   10   11  -25  -15   12
5  5   13   14  -12  -24   15
6  6   16   17   -9  -13   18

Call the function:


> results <- hanaml.TSNE(data = data, perplexity = 1, max.iter = 500,
                         dim = 3, theta = 0, obj.freq = 50, random.state = 30)

Results:


> results[[1]]$Collect()
  ID          x         y         z
1  1   4.875853 -189.0905 -229.5364
2  2 -67.675459  213.6617  178.3976
3  3 -68.852910  162.7109  284.9663
4  4 -68.056108  193.1181  220.2754
5  5  76.524624 -189.8509 -227.6257
6  6 123.184000 -190.5492 -226.4772