hanaml.LatentDirichletAllocation.Rdhanaml.LatentDirichletAllocation is a R wrapper for SAP HANA PAL Latent Dirichlet Allocation.
hanaml.LatentDirichletAllocation( data = NULL, key = NULL, document = NULL, n.components = NULL, doc.topic.prior = NULL, topic.word.prior = NULL, burn.in = NULL, iteration = NULL, thin = NULL, seed = NULL, max.top.words = NULL, threshold.top.words = NULL, gibbs.init = NULL, delimiters = NULL, output.word.assignment = NULL )
| data |
|
|---|---|
| key |
|
| document |
|
| n.components |
|
| doc.topic.prior |
|
| topic.word.prior |
|
| burn.in |
|
| iteration |
|
| thin |
|
| seed |
Defaults to 0. |
| max.top.words |
|
| threshold.top.words |
|
| gibbs.init |
|
| delimiters |
|
| output.word.assignment |
|
A "LatentDirecheletAllocation" object with the following attributes:
doc.topic.dist: DataFrame
Document-topic distribution table, structured as follows:
Document ID column: with same name and type as data's
document ID column.
TOPIC_ID: type INTEGER, topic ID.
PROBABILITY: type DOUBLE, probability of topic given document.
word.topic.assignment: DataFrame
Word-topic assignment table, structured as follows:
Document ID column:with same name and type as data's
document ID column.
WORD_ID:type INTEGER, word ID.
TOPIC_ID: type INTEGER, topic ID.
topic.top.words: DataFrame
Topic top words table, structured as follows:
TOPIC_ID: type INTEGER, topic ID.
WORDS: type NVARCHAR(5000), topic top words separated by
spaces.
topic.word.dist: DataFrame
topic-word distribution table, structured as follows:
TOPIC_ID: type INTEGER, topic ID.
WORD_ID: type INTEGER, word ID.
PROBABILITY: type DOUBLE, probability of topic given document.
dictionary: DataFrame
Dictionary table, structured as follows:
WORD_ID: type INTEGER, word ID.
WORD: type NVARCHAR(5000), word text.
statistics: DataFrame
Statistics table, structured as follows:
STAT_NAME: type NVARCHAR(256), statistic name.
STAT_VALUE: type NVARCHAR(1000), statistic value.
Latent Dirichlet allocation (LDA) is a generative model in which each item (word) of a collection (document) is generated from a finite mixture over several latent groups (topics).
Input DataFrame data:
> data$Collect()
DOCUMENT_ID TEXT
10 cpu harddisk graphiccard cpu monitor keyboard cpu memory memory
20 tires mountainbike wheels valve helmet mountainbike rearfender
tires mountainbike mountainbike
30 carseat toy strollers toy toy spoon toy strollers toy carseat
40 sweaters sweaters sweaters boots sweaters rings vest vest shoe
sweaters
Call the function:
LDA <- hanaml.LatentDirichletAllocation(data, key = "DOCUMENT_ID", document = "TEXT", n.components = 6, doc.topic.prior = 0.1, burn.in = 50, iteration = 100, thin = 10, seed = 1, max.top.words = 5, output.word.assignment = 1)
Output:
> LDA$topic.word.dist$Collect()
DOCUMENT_ID TOPIC_ID PROBABILITY
1 10 0 0.010416667
2 10 1 0.010416667
3 10 2 0.010416667
4 10 3 0.010416667
......
20 40 1 0.009433962
21 40 2 0.952830189
22 40 3 0.009433962
23 40 4 0.009433962
24 40 5 0.009433962