Latent Dirichlet Allocation

hanaml.LatentDirichletAllocation is a R wrapper for SAP HANA PAL Latent Dirichlet Allocation.

hanaml.LatentDirichletAllocation(
  data = NULL,
  key = NULL,
  document = NULL,
  n.components = NULL,
  doc.topic.prior = NULL,
  topic.word.prior = NULL,
  burn.in = NULL,
  iteration = NULL,
  thin = NULL,
  seed = NULL,
  max.top.words = NULL,
  threshold.top.words = NULL,
  gibbs.init = NULL,
  delimiters = NULL,
  output.word.assignment = NULL
)

Arguments

data	`DataFrame` DataFrame containting the data.
key	`character` Name of the ID column.
document	`character, optional` Name of the document column. Defaults to the first no-ID column.
n.components	`integer` Expected number of topics in the corpus.
doc.topic.prior	`double, optional` Specifies the prior weight related to document-topic distribution. Defaults to 50/n_components.
topic.word.prior	`double, optional` Specifies the prior weight related to topic-word distribution. Defaults to 0.1.
burn.in	`integer, optional` Number of omitted Gibbs iterations at the beginning. Defaults to 0.
iteration	`integer, optional` Number of Gibbs iterations. Defaults to 2000.
thin	`integer, optional` Number of omitted in-between Gibbs iterations. Defaults to 1.
seed	`integer, optional` Indicates the seed used to initialize the random number generator. `0`: uses the system time `Not 0`: uses the specified seed Defaults to 0.
max.top.words	`integer, optional` Specifies the maximum number of words to be output for each topic. Only valid when the topic top words output table is provided. It cannot be used together with parameter threshold.top.words. Defaults to 0.
threshold.top.words	`double, optional` The algorithm outputs top words for each topic if the probability is larger than this threshold. Only valid when the topic top words output table is provided. It cannot be used together with parameter max.top.words. Defaults to 0.
gibbs.init	`character, optional` Specifies initialization method for Gibbs sampling: 'uniform': Assigns each word in each document a topic by a uniform distribution. Each topic has the same probability to be assigned for each word. 'gibbs': Initialization by Gibbs sampling. Assigns each word in each document a topic by one round of Gibbs sampling using the prior distribution of document-topic and topic-word given by parameters ALPHA and BETA. Defaults to 'uniform'.
delimiters	`list of character, optional` Specifies the delimit to separate words in a document. For example, if the words are separated by , or :, then the delimit should be "," or ":". Defaults to ''(single space).
output.word.assignment	`logical, optional` Controls whether to output the word-topic assignment or not. Note that if this parameter is set to TRUE, the procedure would take more time to return to write the word-topic assignment table. Defaults to FALSE.

Value

A "LatentDirecheletAllocation" object with the following attributes:

doc.topic.dist: DataFrame
Document-topic distribution table, structured as follows:
- Document ID column: with same name and type as data's document ID column.
- TOPIC_ID: type INTEGER, topic ID.
- PROBABILITY: type DOUBLE, probability of topic given document.
word.topic.assignment: DataFrame
Word-topic assignment table, structured as follows:
- Document ID column:with same name and type as data's document ID column.
- WORD_ID:type INTEGER, word ID.
- TOPIC_ID: type INTEGER, topic ID.
topic.top.words: DataFrame
Topic top words table, structured as follows:
- TOPIC_ID: type INTEGER, topic ID.
- WORDS: type NVARCHAR(5000), topic top words separated by spaces.
topic.word.dist: DataFrame
topic-word distribution table, structured as follows:
- TOPIC_ID: type INTEGER, topic ID.
- WORD_ID: type INTEGER, word ID.
- PROBABILITY: type DOUBLE, probability of topic given document.
dictionary: DataFrame
Dictionary table, structured as follows:
- WORD_ID: type INTEGER, word ID.
- WORD: type NVARCHAR(5000), word text.
statistics: DataFrame
Statistics table, structured as follows:
- STAT_NAME: type NVARCHAR(256), statistic name.
- STAT_VALUE: type NVARCHAR(1000), statistic value.

Details

Latent Dirichlet allocation (LDA) is a generative model in which each item (word) of a collection (document) is generated from a finite mixture over several latent groups (topics).

Examples

Input DataFrame data:

> data$Collect()
 DOCUMENT_ID      TEXT
 10               cpu harddisk graphiccard cpu monitor keyboard cpu memory memory
 20               tires mountainbike wheels valve helmet mountainbike rearfender
                  tires mountainbike mountainbike
 30               carseat toy strollers toy toy spoon toy strollers toy carseat
 40               sweaters sweaters sweaters boots sweaters rings vest vest shoe
                  sweaters

Call the function:

LDA <- hanaml.LatentDirichletAllocation(data, key = "DOCUMENT_ID",
                                        document = "TEXT", n.components = 6,
                                        doc.topic.prior = 0.1, burn.in = 50,
                                        iteration = 100, thin = 10, seed = 1,
                                        max.top.words = 5, output.word.assignment = 1)

Output:

> LDA$topic.word.dist$Collect()

       DOCUMENT_ID TOPIC_ID      PROBABILITY
1               10        0      0.010416667
2               10        1      0.010416667
3               10        2      0.010416667
4               10        3      0.010416667
......
20              40        1      0.009433962
21              40        2      0.952830189
22              40        3      0.009433962
23              40        4      0.009433962
24              40        5      0.009433962

Arguments

Value

Details

Examples

See also