hanaml.LatentDirichletAllocation.Rd
hanaml.LatentDirichletAllocation is a R wrapper for SAP HANA PAL Latent Dirichlet Allocation.
hanaml.LatentDirichletAllocation(
data = NULL,
key = NULL,
document = NULL,
n.components = NULL,
doc.topic.prior = NULL,
topic.word.prior = NULL,
burn.in = NULL,
iteration = NULL,
thin = NULL,
seed = NULL,
max.top.words = NULL,
threshold.top.words = NULL,
gibbs.init = NULL,
delimiters = NULL,
output.word.assignment = NULL
)
DataFrame
DataFrame containting the data.
character
Name of the ID column.
character, optional
Name of the document column.
Defaults to the first no-ID column.
integer
Expected number of topics in the corpus.
double, optional
Specifies the prior weight related to document-topic distribution.
Defaults to 50/n_components.
double, optional
Specifies the prior weight related to topic-word distribution.
Defaults to 0.1.
integer, optional
Number of omitted Gibbs iterations at the beginning.
Defaults to 0.
integer, optional
Number of Gibbs iterations.
Defaults to 2000.
integer, optional
Number of omitted in-between Gibbs iterations.
Defaults to 1.
integer, optional
Indicates the seed used to initialize the random number generator.
0
: uses the system time
Not 0
: uses the specified seed
Defaults to 0.
integer, optional
Specifies the maximum number of words to be output for each topic.
Only valid when the topic top words output table is provided.
It cannot be used together with parameter threshold.top.words.
Defaults to 0.
double, optional
The algorithm outputs top words for each topic if the probability is
larger than this threshold.
Only valid when the topic top words output table is provided.
It cannot be used together with parameter max.top.words.
Defaults to 0.
character, optional
Specifies initialization method for Gibbs sampling:
'uniform': Assigns each word in each document a topic by a uniform
distribution. Each topic has the same probability to be assigned for each
word.
'gibbs': Initialization by Gibbs sampling. Assigns each word in each document a
topic by one round of Gibbs sampling using the prior distribution of
document-topic and topic-word given by parameters ALPHA and BETA.
Defaults to 'uniform'.
list of characters, optional
Specifies the delimit to separate words in a document.
For example, if the words are separated by , or :, then the delimit should
be "," or ":".
Defaults to ''(single space).
logical, optional
Controls whether to output the word-topic assignment or not.
Note that if this parameter is set to TRUE, the procedure would take
more time to return to write the word-topic assignment table.
Defaults to FALSE.
An R6 object of class "LatentDirecheletAllocation" with the following attributes and methods:
Attributes
doc.topic.dist: DataFrame
Document-topic distribution table, structured as follows:
Document ID column
: with same name and type as data's
document ID column.
TOPIC_ID
: type INTEGER, topic ID.
PROBABILITY
: type DOUBLE, probability of topic given document.
word.topic.assignment: DataFrame
Word-topic assignment table, structured as follows:
Document ID column
:with same name and type as data's
document ID column.
WORD_ID
:type INTEGER, word ID.
TOPIC_ID
: type INTEGER, topic ID.
topic.top.words: DataFrame
Topic top words table, structured as follows:
TOPIC_ID
: type INTEGER, topic ID.
WORDS
: type NVARCHAR(5000), topic top words separated by
spaces.
topic.word.dist: DataFrame
topic-word distribution table, structured as follows:
TOPIC_ID
: type INTEGER, topic ID.
WORD_ID
: type INTEGER, word ID.
PROBABILITY
: type DOUBLE, probability of topic given document.
dictionary: DataFrame
Dictionary table, structured as follows:
WORD_ID
: type INTEGER, word ID.
WORD
: type NVARCHAR(5000), word text.
statistics: DataFrame
Statistics table, structured as follows:
STAT_NAME
: type NVARCHAR(256), statistic name.
STAT_VALUE
: type NVARCHAR(1000), statistic value.
Methods
CreateModelState(model=NULL, algorithm=NULL, func=NULL, state.description="ModelState", force=FALSE)
Usage:
> lda <- hanaml.LatentDirichletAllocation(data=df, key="ID")
> lda$CreateModelState()
Arguments:
model: DataFrame
DataFrame containing the model for parsing.
Defaults to self$model
.
algorithm: character
Specifies the PAL algorithm associated with model
.
Defaults to self$pal.algorithm
.
func: character
Specifies the functionality for Unified Classification/Regression.
Valid only for object instance of R6Class "UnifiedClassification" or "UnifiedRegression".
Defaults to self$func
.
state.description: character
A summary string for the generated model state.
Defaults to "ModelState".
force: logic
Specifies whether or not the replace existing state for model
.
Defaults to FALSE.
After calling this method, an attribute state
that contains the parsed info for model
shall be assigned
to the corresponding R6 object.
DeleteModelState(state=NULL)
Usage:
Assuming we have trained a hanaml
model and created its model state, like the following:
> lda <- hanaml.LatentDirichletAllocation(data=df, key="ID")
> lda$CreateModelState()
After using the model state for real-time scoring, we can delete the state by calling:
> lda$DelateModelState()
Arguments:
state: DataFrame
DataFrame containing the state info.
Defaults to self$state
.
After calling this method, the specified model state shall be cleaned up and associated memory be released.
Latent Dirichlet allocation (LDA) is a generative model in which each item (word) of a collection (document) is generated from a finite mixture over several latent groups (topics).
Input DataFrame data:
> data$Collect()
DOCUMENT_ID TEXT
10 cpu harddisk graphiccard cpu monitor keyboard cpu memory memory
20 tires mountainbike wheels valve helmet mountainbike rearfender
tires mountainbike mountainbike
30 carseat toy strollers toy toy spoon toy strollers toy carseat
40 sweaters sweaters sweaters boots sweaters rings vest vest shoe
sweaters
Call the function:
LDA <- hanaml.LatentDirichletAllocation(data, key = "DOCUMENT_ID",
document = "TEXT", n.components = 6,
doc.topic.prior = 0.1, burn.in = 50,
iteration = 100, thin = 10, seed = 1,
max.top.words = 5, output.word.assignment = 1)
Output:
> LDA$topic.word.dist$Collect()
DOCUMENT_ID TOPIC_ID PROBABILITY
1 10 0 0.010416667
2 10 1 0.010416667
3 10 2 0.010416667
4 10 3 0.010416667
......
20 40 1 0.009433962
21 40 2 0.952830189
22 40 3 0.009433962
23 40 4 0.009433962
24 40 5 0.009433962