| hanaml.LatentDirichletAllocation {hana.ml.r} | R Documentation |
hanaml.LatentDirichletAllocation is a R wrapper for PAL Latent Dirichlet Allocation.
hanaml.LatentDirichletAllocation(conn.context, data = NULL,
key = NULL, document = NULL,
n.components = NULL, doc.topic.prior = NULL,
topic.word.prior = NULL, burn.in =NULL, iteration = NULL,
thin = NULL, seed = NULL, max.top.words = NULL,
threshold.top.words = NULL, gibbs.init = NULL,
delimiters = NULL, output.word.assignment = NULL)
conn.context |
|
data |
|
key |
|
document |
|
n.components |
|
doc.topic.prior |
|
topic.word.prior |
|
burn.in |
|
iteration |
|
thin |
|
seed |
Defaults to 0. |
max.top.words |
|
threshold.top.words |
|
gibbs.init |
|
delimiters |
|
output.word.assignment |
|
R6Class object.
Latent Dirichlet allocation (LDA) is a generative model in which each item (word) of a collection (document) is generated from a finite mixture over several latent groups (topics).
A "LatentDirecheletAllocation" object with the following attributes:
doc.topic.dist: DataFrame.
Document-topic distribution table, structured as follows:
Document ID column: with same name and type as data's
document ID column.
TOPIC_ID: type INTEGER, topic ID.
PROBABILITY: type DOUBLE, probability of topic given document.
word.topic.assignment: DataFrame.
Word-topic assignment table, structured as follows:
Document ID column:with same name and type as data's
document ID column.
WORD_ID:type INTEGER, word ID.
TOPIC_ID: type INTEGER, topic ID.
topic.top.words: DataFrame.
Topic top words table, structured as follows:
TOPIC_ID: type INTEGER, topic ID.
WORDS: type NVARCHAR(5000), topic top words separated by
spaces.
topic.word.dist: DataFrame
topic-word distribution table, structured as follows:
TOPIC_ID: type INTEGER, topic ID.
WORD_ID: type INTEGER, word ID.
PROBABILITY: type DOUBLE, probability of topic given document.
dictionary: DataFrame
Dictionary table, structured as follows:
WORD_ID: type INTEGER, word ID.
WORD: type NVARCHAR(5000), word text.
statistics: : DataFrame
Statistics table, structured as follows:
STAT_NAME: type NVARCHAR(256), statistic name.
STAT_VALUE: type NVARCHAR(1000), statistic value.
## Not run:
Input DataFrame for clustering:
> data$collect()
DOCUMENT_ID TEXT
1 10 cpu harddisk graphiccard cpu monitor keyboard cpu memory memory
2 20 tires mountainbike wheels valve helmet mountainbike rearfender
tires mountainbike mountainbike
3 30 carseat toy strollers toy toy spoon toy strollers toy carseat
4 40 sweaters sweaters sweaters boots sweaters rings vest vest shoe
sweaters
Create a LatentDirichletAllocation instance:
LDA <- hanaml.LatentDirichletAllocation(conn, data, key = "DOCUMENT_ID",
document = "TEXT", n.components = 6,
doc.topic.prior = 0.1, burn.in = 50,
iteration = 100, thin = 10, seed = 1,
max.top.words = 5, output.word.assignment = 1)
Output example:
> LDA$topic.word.dist$Collect()
DOCUMENT_ID TOPIC_ID PROBABILITY
1 10 0 0.010416667
2 10 1 0.010416667
3 10 2 0.010416667
4 10 3 0.010416667
5 10 4 0.947916667
6 10 5 0.010416667
7 20 0 0.009433962
8 20 1 0.009433962
9 20 2 0.009433962
10 20 3 0.952830189
11 20 4 0.009433962
12 20 5 0.009433962
13 30 0 0.103773585
14 30 1 0.858490566
15 30 2 0.009433962
16 30 3 0.009433962
17 30 4 0.009433962
18 30 5 0.009433962
19 40 0 0.009433962
20 40 1 0.009433962
21 40 2 0.952830189
22 40 3 0.009433962
23 40 4 0.009433962
24 40 5 0.009433962
## End(Not run)