R: Latent Dirichlet Allocation

hanaml.LatentDirichletAllocation {hana.ml.r}

R Documentation

Latent Dirichlet Allocation

Description

hanaml.LatentDirichletAllocation is a R wrapper for PAL Latent Dirichlet Allocation.

Usage

hanaml.LatentDirichletAllocation(conn.context, data = NULL,
                                 key = NULL, document = NULL,
                                 n.components = NULL, doc.topic.prior = NULL,
                                 topic.word.prior = NULL, burn.in =NULL, iteration = NULL,
                                 thin = NULL, seed = NULL, max.top.words = NULL,
                                 threshold.top.words = NULL, gibbs.init = NULL,
                                 delimiters = NULL, output.word.assignment = NULL)

Arguments

`conn.context`	`ConnectionContext` The connection to the SAP HANA system.
`data`	`DataFrame` Dataset used for training the LatentDirichletAllocation model.
`key`	`character, optional` Name of the ID column.
`document`	`character, optional` Names of the document columns.
`n.components`	`integer` Expected number of topics in the corpus.
`doc.topic.prior`	`double, optional` Specifies the prior weight related to document-topic distribution. Defaults to 50/n_components.
`topic.word.prior`	`double, optional` Specifies the prior weight related to topic-word distribution. Defaults to 0.1.
`burn.in`	`integer, optional` Number of omitted Gibbs iterations at the beginning. Defaults to 0.
`iteration`	`integer, optional` Number of Gibbs iterations. Defaults to 2000.
`thin`	`integer, optional` Number of omitted in-between Gibbs iterations. Defaults to 1.
`seed`	`integer, optional` Indicates the seed used to initialize the random number generator. `0`: uses the system time `Not 0`: uses the specified seed Defaults to 0.
`max.top.words`	`integer, optional` Specifies the maximum number of words to be output for each topic. Defaults to 0.
`threshold.top.words`	`double, optional` The algorithm outputs top words for each topic if the probability is larger than this threshold. It cannot be used together with parameter max_top_words.
`gibbs.init`	`character, optional` Specifies initialization method for Gibbs sampling: 'uniform': Assigns each word in each document a topic by a uniform distribution. Each topic has the same probability to be assigned for each word. 'gibbs': Initialization by Gibbs sampling. Assigns each word in each document a topic by one round of Gibbs sampling using the prior distribution of document-topic and topic-word given by parameters ALPHA and BETA. Defaults to 'uniform'.
`delimiters`	`list of character, optional` Specifies the delimit to separate words in a document. For example, if the words are separated by , or :, then the delimit should be "," or ":". Defaults to " "(single space).
`output.word.assignment`	`logical, optional` Controls whether to output the word-topic assignment or not. Note that if this parameter is set to TRUE, the procedure would take more time to return to write the WORD_TOPIC_ASSIGNMENT table. Defaults to FALSE.

Format

R6Class object.

Details

Latent Dirichlet allocation (LDA) is a generative model in which each item (word) of a collection (document) is generated from a finite mixture over several latent groups (topics).

Value

A "LatentDirecheletAllocation" object with the following attributes:

doc.topic.dist: DataFrame.
Document-topic distribution table, structured as follows:

Document ID column: with same name and type as data's document ID column.
TOPIC_ID: type INTEGER, topic ID.
PROBABILITY: type DOUBLE, probability of topic given document.

word.topic.assignment: DataFrame.
Word-topic assignment table, structured as follows:

Document ID column:with same name and type as data's document ID column.
WORD_ID:type INTEGER, word ID.
TOPIC_ID: type INTEGER, topic ID.

topic.top.words: DataFrame.
Topic top words table, structured as follows:

TOPIC_ID: type INTEGER, topic ID.
WORDS: type NVARCHAR(5000), topic top words separated by spaces.

topic.word.dist: DataFrame
topic-word distribution table, structured as follows:

TOPIC_ID: type INTEGER, topic ID.
WORD_ID: type INTEGER, word ID.
PROBABILITY: type DOUBLE, probability of topic given document.

dictionary: DataFrame
Dictionary table, structured as follows:

WORD_ID: type INTEGER, word ID.
WORD: type NVARCHAR(5000), word text.

statistics: : DataFrame
Statistics table, structured as follows:

STAT_NAME: type NVARCHAR(256), statistic name.
STAT_VALUE: type NVARCHAR(1000), statistic value.

Examples

## Not run: 
   Input DataFrame for clustering:
> data$collect()
   DOCUMENT_ID      TEXT
 1   10       cpu harddisk graphiccard cpu monitor keyboard cpu memory memory
 2   20       tires mountainbike wheels valve helmet mountainbike rearfender
              tires mountainbike mountainbike
 3   30       carseat toy strollers toy toy spoon toy strollers toy carseat
 4   40       sweaters sweaters sweaters boots sweaters rings vest vest shoe
              sweaters

 Create a LatentDirichletAllocation instance:
 LDA <- hanaml.LatentDirichletAllocation(conn, data, key = "DOCUMENT_ID",
                   document = "TEXT", n.components = 6,
                   doc.topic.prior = 0.1, burn.in = 50,
                   iteration = 100, thin = 10, seed = 1,
                   max.top.words = 5, output.word.assignment = 1)

 Output example:
 > LDA$topic.word.dist$Collect()

         DOCUMENT_ID  TOPIC_ID   PROBABILITY
   1           10        0      0.010416667
   2           10        1      0.010416667
   3           10        2      0.010416667
   4           10        3      0.010416667
   5           10        4      0.947916667
   6           10        5      0.010416667
   7           20        0      0.009433962
   8           20        1      0.009433962
   9           20        2      0.009433962
   10          20        3      0.952830189
   11          20        4      0.009433962
   12          20        5      0.009433962
   13          30        0      0.103773585
   14          30        1      0.858490566
   15          30        2      0.009433962
   16          30        3      0.009433962
   17          30        4      0.009433962
   18          30        5      0.009433962
   19          40        0      0.009433962
   20          40        1      0.009433962
   21          40        2      0.952830189
   22          40        3      0.009433962
   23          40        4      0.009433962
   24          40        5      0.009433962

## End(Not run)

[Package hana.ml.r version 1.0.8 Index]