hanaml.LatentDirichletAllocation {hana.ml.r}R Documentation

Latent Dirichlet Allocation

Description

hanaml.LatentDirichletAllocation is a R wrapper for PAL Latent Dirichlet Allocation.

Usage

hanaml.LatentDirichletAllocation(conn.context, data = NULL,
                                 key = NULL, document = NULL,
                                 n.components = NULL, doc.topic.prior = NULL,
                                 topic.word.prior = NULL, burn.in =NULL, iteration = NULL,
                                 thin = NULL, seed = NULL, max.top.words = NULL,
                                 threshold.top.words = NULL, gibbs.init = NULL,
                                 delimiters = NULL, output.word.assignment = NULL)

Arguments

conn.context

ConnectionContext
The connection to the SAP HANA system.

data

DataFrame
Dataset used for training the LatentDirichletAllocation model.

key

character, optional
Name of the ID column.

document

character, optional
Names of the document columns.

n.components

integer
Expected number of topics in the corpus.

doc.topic.prior

double, optional
Specifies the prior weight related to document-topic distribution.
Defaults to 50/n_components.

topic.word.prior

double, optional
Specifies the prior weight related to topic-word distribution.
Defaults to 0.1.

burn.in

integer, optional
Number of omitted Gibbs iterations at the beginning.
Defaults to 0.

iteration

integer, optional
Number of Gibbs iterations.
Defaults to 2000.

thin

integer, optional
Number of omitted in-between Gibbs iterations.
Defaults to 1.

seed

integer, optional
Indicates the seed used to initialize the random number generator.

  • 0: uses the system time

  • Not 0: uses the specified seed

Defaults to 0.

max.top.words

integer, optional
Specifies the maximum number of words to be output for each topic.
Defaults to 0.

threshold.top.words

double, optional
The algorithm outputs top words for each topic if the probability is larger than this threshold.
It cannot be used together with parameter max_top_words.

gibbs.init

character, optional
Specifies initialization method for Gibbs sampling:
'uniform': Assigns each word in each document a topic by a uniform distribution. Each topic has the same probability to be assigned for each word.
'gibbs': Initialization by Gibbs sampling. Assigns each word in each document a topic by one round of Gibbs sampling using the prior distribution of document-topic and topic-word given by parameters ALPHA and BETA.
Defaults to 'uniform'.

delimiters

list of character, optional
Specifies the delimit to separate words in a document.
For example, if the words are separated by , or :, then the delimit should be "," or ":".
Defaults to " "(single space).

output.word.assignment

logical, optional
Controls whether to output the word-topic assignment or not. Note that if this parameter is set to TRUE, the procedure would take more time to return to write the WORD_TOPIC_ASSIGNMENT table.
Defaults to FALSE.

Format

R6Class object.

Details

Latent Dirichlet allocation (LDA) is a generative model in which each item (word) of a collection (document) is generated from a finite mixture over several latent groups (topics).

Value

A "LatentDirecheletAllocation" object with the following attributes:

doc.topic.dist: DataFrame.
Document-topic distribution table, structured as follows:

word.topic.assignment: DataFrame.
Word-topic assignment table, structured as follows:

topic.top.words: DataFrame.
Topic top words table, structured as follows:

topic.word.dist: DataFrame
topic-word distribution table, structured as follows:

dictionary: DataFrame
Dictionary table, structured as follows:

statistics: : DataFrame
Statistics table, structured as follows:

Examples

## Not run: 
   Input DataFrame for clustering:
> data$collect()
   DOCUMENT_ID      TEXT
 1   10       cpu harddisk graphiccard cpu monitor keyboard cpu memory memory
 2   20       tires mountainbike wheels valve helmet mountainbike rearfender
              tires mountainbike mountainbike
 3   30       carseat toy strollers toy toy spoon toy strollers toy carseat
 4   40       sweaters sweaters sweaters boots sweaters rings vest vest shoe
              sweaters

 Create a LatentDirichletAllocation instance:
 LDA <- hanaml.LatentDirichletAllocation(conn, data, key = "DOCUMENT_ID",
                   document = "TEXT", n.components = 6,
                   doc.topic.prior = 0.1, burn.in = 50,
                   iteration = 100, thin = 10, seed = 1,
                   max.top.words = 5, output.word.assignment = 1)

 Output example:
 > LDA$topic.word.dist$Collect()

         DOCUMENT_ID  TOPIC_ID   PROBABILITY
   1           10        0      0.010416667
   2           10        1      0.010416667
   3           10        2      0.010416667
   4           10        3      0.010416667
   5           10        4      0.947916667
   6           10        5      0.010416667
   7           20        0      0.009433962
   8           20        1      0.009433962
   9           20        2      0.009433962
   10          20        3      0.952830189
   11          20        4      0.009433962
   12          20        5      0.009433962
   13          30        0      0.103773585
   14          30        1      0.858490566
   15          30        2      0.009433962
   16          30        3      0.009433962
   17          30        4      0.009433962
   18          30        5      0.009433962
   19          40        0      0.009433962
   20          40        1      0.009433962
   21          40        2      0.952830189
   22          40        3      0.009433962
   23          40        4      0.009433962
   24          40        5      0.009433962

## End(Not run)

[Package hana.ml.r version 1.0.8 Index]