Similar to other predict methods, this function predicts fitted values from a fitted "LatentDirichletAllocation" object.

# S3 method for LatentDirichletAllocation
transform(
  model,
  data,
  key,
  document = NULL,
  burn.in = NULL,
  iteration = NULL,
  thin = NULL,
  seed = NULL,
  gibbs.init = NULL,
  delimiters = NULL,
  output.word.assignment = NULL
)

Arguments

model

R6Class object
A "LatentDirichletAllocation" object for prediction.

data

DataFrame
DataFrame containting the data.

key

character
Name of the ID column.

document

character, optional
Names of the document columns.
Defaults to the first non-ID column.

burn.in

integer, optional
Number of omitted Gibbs iterations at the beginning.
Defaults to 0.

iteration

integer, optional
Number of Gibbs iterations.
Defaults to 2000.

thin

integer, optional
Number of omitted in-between Gibbs iterations.
Defaults to 1.

seed

integer, optional
Indicates the seed used to initialize the random number generator.

  • 0: uses the system time.

  • Not 0: uses the specified seed.

Defaults to 0.

gibbs.init

character, optional
Specifies initialization method for Gibbs sampling. This value takes precedence over the corresponding one in the general information table.

  • 'uniform': Assigns each word in each document a topic by a uniform distribution. Each topic has the same probability to be assigned for each word.

  • 'gibbs': Initialization by Gibbs sampling.

Defaults to 'uniform'.

delimiters

list of characters, optional
Specifies the delimit to separate words in a document.
For example, if the words are separated by , and :, then the delimit can be ,:.
For example, if the words are separated by , or :, then the delimit should be ',' or ':'.
Defaults to ''.

output.word.assignment

logical, optional
Controls whether to output the word-topic assignment or not. Note that if this parameter is set to TRUE, the procedure would take more time to return to write the WORD_TOPIC_ASSIGNMENT table.
Defaults to FALSE.

Value

Predicted values are returned as a list of DataFrames, structured as follows:

  • Document ID column: with same name and type as data's document ID column.

  • TOPIC_ID: type INTEGER, topic ID.

  • PROBABILITY: type DOUBLE, probability of topic given document.

  • Document ID column:with same name and type as data's document ID column.

  • WORD_ID:type INTEGER, word ID.

  • TOPIC_ID: type INTEGER, topic ID.

  • STAT_NAME: type NVARCHAR(256), statistic name.

  • STAT_VALUE: type NVARCHAR(1000), statistic value.

Examples

Perform the predict on DataFrame data1 using "LatentDirichletAllocation" object LDA:


> data1$Collect()
   DOCUMENT_ID                   TEXT
 1          10      toy toy spoon cpu

> result <- transform(LDA, pred.data, key = "DOCUMENT_ID",
                      document = "TEXT", burn.in = 2000,
                      iteration = 1000, thin = 100,
                      seed = 1, output.word.assignment = TRUE)

Output:


> result[[1]]$Collect()
    DOCUMENT_ID  TOPIC_ID               PROBABILITY
1            10         0       0.23913043478260873
2            10         1       0.4565217391304348
3            10         2       0.02173913043478261
4            10         3       0.02173913043478261
5            10         4       0.23913043478260873
6            10         5       0.02173913043478261