hanaml.TSNE.Rd
hanaml.TSNE is a R wrapper for SAP HANA PAL T-distributed Stochastic Neighbour Embedding.
hanaml.TSNE(
data = NULL,
key = NULL,
features = NULL,
max.iter = NULL,
obj.freq = NULL,
dim = NULL,
learning.rate = NULL,
theta = NULL,
perplexity = NULL,
exaggeration = NULL,
random.state = NULL,
thread.ratio = NULL
)
DataFrame
DataFrame containting the data.
character, optional
Name of the ID column.
Defaults to the first column if not provided.
character of list of characters, optional
Name of feature columns.
If not provided, it defaults all non-key, non-label columns of data.
integer, optional
Specifies the maximum number of iterations for optimization
process.
Defaults to 250.
integer, optional
Specifies the Frequency of calculating the objective
function and putting the result into OBJECTIVES table.
Defaults to 50.
integer, optional
Dimension of the embedded space. Value other than 2 or 3 is
illegal.
Defaults to 2.
double, optional
Secifies the learning rate.
Defaults to 200.0.
double, optional
The legal value should be between 0.0 to 1.0. Setting it to 0.0
means using the “exact” method
which would run O(N^2) time, otherwise TSNE would employ
Barnes-Hut approximation hich would run O(NlogN).
This value is a tradeoff between accuracy and training speed
for Barnes-Hut approximation. The training speed would be faster
with higher value.
Defaults to 0.5.
double, optional
The perplexity is related to the number of nearest neighbors.
Larger value is suitable for large dataset.
Make sure perplexity
* 3 < [number of samples].
Default to 30.0.
double, optional
The natural clusters would be more separated with larger value,
which means there would be more empty space on the map.
It specifies the value to be multiplied on pij before 250
iterations.
Default to 12.0.
double, optional
Specifies the seed for random number generation, where 0 means
current system time s used as seed, and other values are simply
real seed values.
Defaults to 0.
double, optional
Controls the proportion of available threads that can be used by this
function.
The value range is from 0 to 1, where 0 indicates a single thread,
and 1 indicates all available threads.
Values between 0 and 1 will use up to
that percentage of available threads.Values outside this
range are ignored.
Defaults to 0.
Returns a list of DataFrames
DataFrame 1
Result of points in low-dimensional embedded space, structured as follows
ID: ID (correspond to input table).
x: Coordinate value of the 1st dimension.
y: Coordinate value of the 2nd dimension.
z: Coordinate value of the 3rd dimension.(NULL if the embedded space is 2 dimension)
DataFrame 2
Statistical info, structured as follows
STAT_NAME: Statistics name.
STAT_VALUE: Statistics value.
DataFrame 3
Recorded values of objective function for TSNE.
ITER: Iteration step.
OBJ_VALUE: Objective value of the iteration.
This algorithm prepares the data for visualization of the data with the TSNE method. It returns a DataFrame of two-dimensional embeddings of the high-dimensional rows of input data.
> data$Collect()
ID ATT1 ATT2 ATT3 ATT4 ATT5
1 1 1 2 -10 -20 3
2 2 4 5 -30 -10 6
3 3 7 8 -40 -50 9
4 4 10 11 -25 -15 12
5 5 13 14 -12 -24 15
6 6 16 17 -9 -13 18
Call the function:
> results <- hanaml.TSNE(data = data, perplexity = 1, max.iter = 500,
dim = 3, theta = 0, obj.freq = 50, random.state = 30)
Results:
> results[[1]]$Collect()
ID x y z
1 1 4.875853 -189.0905 -229.5364
2 2 -67.675459 213.6617 178.3976
3 3 -68.852910 162.7109 284.9663
4 4 -68.056108 193.1181 220.2754
5 5 76.524624 -189.8509 -227.6257
6 6 123.184000 -190.5492 -226.4772