hanaml.LatentDirichletAllocation is a R wrapper for SAP HANA PAL Latent Dirichlet Allocation.

hanaml.LatentDirichletAllocation(
  data = NULL,
  key = NULL,
  document = NULL,
  n.components = NULL,
  doc.topic.prior = NULL,
  topic.word.prior = NULL,
  burn.in = NULL,
  iteration = NULL,
  thin = NULL,
  seed = NULL,
  max.top.words = NULL,
  threshold.top.words = NULL,
  gibbs.init = NULL,
  delimiters = NULL,
  output.word.assignment = NULL
)

Arguments

data

DataFrame
DataFrame containting the data.

key

character
Name of the ID column.

document

character, optional
Name of the document column.
Defaults to the first no-ID column.

n.components

integer
Expected number of topics in the corpus.

doc.topic.prior

double, optional
Specifies the prior weight related to document-topic distribution.
Defaults to 50/n_components.

topic.word.prior

double, optional
Specifies the prior weight related to topic-word distribution.
Defaults to 0.1.

burn.in

integer, optional
Number of omitted Gibbs iterations at the beginning.
Defaults to 0.

iteration

integer, optional
Number of Gibbs iterations.
Defaults to 2000.

thin

integer, optional
Number of omitted in-between Gibbs iterations.
Defaults to 1.

seed

integer, optional
Indicates the seed used to initialize the random number generator.

  • 0: uses the system time

  • Not 0: uses the specified seed

Defaults to 0.

max.top.words

integer, optional
Specifies the maximum number of words to be output for each topic.
Only valid when the topic top words output table is provided. It cannot be used together with parameter threshold.top.words.
Defaults to 0.

threshold.top.words

double, optional
The algorithm outputs top words for each topic if the probability is larger than this threshold.
Only valid when the topic top words output table is provided. It cannot be used together with parameter max.top.words.
Defaults to 0.

gibbs.init

character, optional
Specifies initialization method for Gibbs sampling:
'uniform': Assigns each word in each document a topic by a uniform distribution. Each topic has the same probability to be assigned for each word.
'gibbs': Initialization by Gibbs sampling. Assigns each word in each document a topic by one round of Gibbs sampling using the prior distribution of document-topic and topic-word given by parameters ALPHA and BETA.
Defaults to 'uniform'.

delimiters

list of characters, optional
Specifies the delimit to separate words in a document.
For example, if the words are separated by , or :, then the delimit should be "," or ":".
Defaults to ''(single space).

output.word.assignment

logical, optional
Controls whether to output the word-topic assignment or not. Note that if this parameter is set to TRUE, the procedure would take more time to return to write the word-topic assignment table.
Defaults to FALSE.

Value

An R6 object of class "LatentDirecheletAllocation" with the following attributes and methods:
Attributes

  • doc.topic.dist: DataFrame
    Document-topic distribution table, structured as follows:

    • Document ID column: with same name and type as data's document ID column.

    • TOPIC_ID: type INTEGER, topic ID.

    • PROBABILITY: type DOUBLE, probability of topic given document.

  • word.topic.assignment: DataFrame
    Word-topic assignment table, structured as follows:

    • Document ID column:with same name and type as data's document ID column.

    • WORD_ID:type INTEGER, word ID.

    • TOPIC_ID: type INTEGER, topic ID.

  • topic.top.words: DataFrame
    Topic top words table, structured as follows:

    • TOPIC_ID: type INTEGER, topic ID.

    • WORDS: type NVARCHAR(5000), topic top words separated by spaces.

  • topic.word.dist: DataFrame
    topic-word distribution table, structured as follows:

    • TOPIC_ID: type INTEGER, topic ID.

    • WORD_ID: type INTEGER, word ID.

    • PROBABILITY: type DOUBLE, probability of topic given document.

  • dictionary: DataFrame
    Dictionary table, structured as follows:

    • WORD_ID: type INTEGER, word ID.

    • WORD: type NVARCHAR(5000), word text.

  • statistics: DataFrame
    Statistics table, structured as follows:

    • STAT_NAME: type NVARCHAR(256), statistic name.

    • STAT_VALUE: type NVARCHAR(1000), statistic value.

Methods

CreateModelState(model=NULL, algorithm=NULL, func=NULL, state.description="ModelState", force=FALSE)

Usage:


   > lda <- hanaml.LatentDirichletAllocation(data=df, key="ID")
   > lda$CreateModelState()


Arguments:

  • model: DataFrame
    DataFrame containing the model for parsing.
    Defaults to self$model.

  • algorithm: character
    Specifies the PAL algorithm associated with model.
    Defaults to self$pal.algorithm.

  • func: character
    Specifies the functionality for Unified Classification/Regression.
    Valid only for object instance of R6Class "UnifiedClassification" or "UnifiedRegression".
    Defaults to self$func.

  • state.description: character
    A summary string for the generated model state.
    Defaults to "ModelState".

  • force: logic
    Specifies whether or not the replace existing state for model.
    Defaults to FALSE.

After calling this method, an attribute state that contains the parsed info for model shall be assigned to the corresponding R6 object.

DeleteModelState(state=NULL)

Usage:
Assuming we have trained a hanaml model and created its model state, like the following:


   > lda <- hanaml.LatentDirichletAllocation(data=df, key="ID")
   > lda$CreateModelState()


After using the model state for real-time scoring, we can delete the state by calling:


   > lda$DelateModelState()


Arguments:

  • state: DataFrame
    DataFrame containing the state info.
    Defaults to self$state.

After calling this method, the specified model state shall be cleaned up and associated memory be released.

Details

Latent Dirichlet allocation (LDA) is a generative model in which each item (word) of a collection (document) is generated from a finite mixture over several latent groups (topics).

Examples

Input DataFrame data:


> data$Collect()
 DOCUMENT_ID      TEXT
 10               cpu harddisk graphiccard cpu monitor keyboard cpu memory memory
 20               tires mountainbike wheels valve helmet mountainbike rearfender
                  tires mountainbike mountainbike
 30               carseat toy strollers toy toy spoon toy strollers toy carseat
 40               sweaters sweaters sweaters boots sweaters rings vest vest shoe
                  sweaters

Call the function:

LDA <- hanaml.LatentDirichletAllocation(data, key = "DOCUMENT_ID",
                                        document = "TEXT", n.components = 6,
                                        doc.topic.prior = 0.1, burn.in = 50,
                                        iteration = 100, thin = 10, seed = 1,
                                        max.top.words = 5, output.word.assignment = 1)

Output:


> LDA$topic.word.dist$Collect()

       DOCUMENT_ID TOPIC_ID      PROBABILITY
1               10        0      0.010416667
2               10        1      0.010416667
3               10        2      0.010416667
4               10        3      0.010416667
......
20              40        1      0.009433962
21              40        2      0.952830189
22              40        3      0.009433962
23              40        4      0.009433962
24              40        5      0.009433962