LatentDirichletAllocation
- class hana_ml.algorithms.pal.decomposition.LatentDirichletAllocation(n_components, doc_topic_prior=None, topic_word_prior=None, burn_in=None, iteration=None, thin=None, seed=None, max_top_words=None, threshold_top_words=None, gibbs_init=None, delimiters=None, output_word_assignment=None)
Latent Dirichlet allocation (LDA) is a generative model in which each item (word) of a collection (document) is generated from a finite mixture over several latent groups (topics).
- Parameters
- n_componentsint
Expected number of topics in the corpus.
- doc_topic_priorfloat, optional
Specifies the prior weight related to document-topic distribution.
Defaults to 50/
n_components
.- topic_word_priorfloat, optional
Specifies the prior weight related to topic-word distribution.
Defaults to 0.1.
- burn_inint, optional
Number of omitted Gibbs iterations at the beginning.
Generally, samples from the beginning may not accurately represent the desired distribution and are usually discarded.
Defaults to 0.
- iterationint, optional
Number of Gibbs iterations.
Defaults to 2000.
- thinint, optional
Number of omitted in-between Gibbs iterations.
Value must be greater than 0.
Defaults to 1.
- seedint, optional
Indicates the seed used to initialize the random number generator:
0: Uses the system time.
Not 0: Uses the provided value.
Defaults to 0.
- max_top_wordsint, optional
Specifies the maximum number of words to be output for each topic.
Defaults to 0.
- threshold_top_wordsfloat, optional
The algorithm outputs top words for each topic if the probability is larger than this threshold.
It cannot be used together with parameter
max_top_words
.- gibbs_initstr, optional
Specifies initialization method for Gibbs sampling:
'uniform': Assign each word in each document a topic by uniform distribution.
'gibbs': Assign each word in each document a topic by one round of Gibbs sampling using
doc_topic_prior
andtopic_word_prior
.
Defaults to 'uniform'.
- delimiterslist of str, optional
Specifies the set of delimiters to separate words in a document.
Each delimiter must be one character long.
Defaults to [' '].
- output_word_assignmentbool, optional
Controls whether to output the word_topic_assignment_ DataFrame or not. If True, output the word_topic_assignment_ DataFrame.
Defaults to False.
Examples
Input dataframe df1 for training:
>>> df1.collect() DOCUMENT_ID TEXT 0 10 cpu harddisk graphiccard cpu monitor keyboard ... 1 20 tires mountainbike wheels valve helmet mountai... 2 30 carseat toy strollers toy toy spoon toy stroll... 3 40 sweaters sweaters sweaters boots sweaters ring...
Creating a LDA instance:
>>> lda = LatentDirichletAllocation(n_components=6, burn_in=50, thin=10, iteration=100, seed=1, max_top_words=5, doc_topic_prior=0.1, output_word_assignment=True, delimiters=[' ', '\r', '\n'])
Performing fit() on given dataframe:
>>> lda.fit(data=df1, key='DOCUMENT_ID', document='TEXT')
Output:
>>> lda.doc_topic_dist_.collect() DOCUMENT_ID TOPIC_ID PROBABILITY 0 10 0 0.010417 1 10 1 0.010417 2 10 2 0.010417 3 10 3 0.010417 4 10 4 0.947917 5 10 5 0.010417 6 20 0 0.009434 7 20 1 0.009434 8 20 2 0.009434 9 20 3 0.952830 10 20 4 0.009434 11 20 5 0.009434 12 30 0 0.103774 13 30 1 0.858491 14 30 2 0.009434 15 30 3 0.009434 16 30 4 0.009434 17 30 5 0.009434 18 40 0 0.009434 19 40 1 0.009434 20 40 2 0.952830 21 40 3 0.009434 22 40 4 0.009434 23 40 5 0.009434
>>> lda.word_topic_assignment_.collect() DOCUMENT_ID WORD_ID TOPIC_ID 0 10 0 4 1 10 1 4 2 10 2 4 3 10 0 4 4 10 3 4 5 10 4 4 6 10 0 4 7 10 5 4 8 10 5 4 9 20 6 3 10 20 7 3 11 20 8 3 12 20 9 3 13 20 10 3 14 20 7 3 15 20 11 3 16 20 6 3 17 20 7 3 18 20 7 3 19 30 12 1 20 30 13 1 21 30 14 1 22 30 13 1 23 30 13 1 24 30 15 0 25 30 13 1 26 30 14 1 27 30 13 1 28 30 12 1 29 40 16 2 30 40 16 2 31 40 16 2 32 40 17 2 33 40 16 2 34 40 18 2 35 40 19 2 36 40 19 2 37 40 20 2 38 40 16 2
>>> lda.topic_top_words_.collect() TOPIC_ID WORDS 0 0 spoon strollers tires graphiccard valve 1 1 toy strollers carseat graphiccard cpu 2 2 sweaters vest shoe rings boots 3 3 mountainbike tires rearfender helmet valve 4 4 cpu memory graphiccard keyboard harddisk 5 5 strollers tires graphiccard cpu valve
>>> lda.topic_word_dist_.head(40).collect() TOPIC_ID WORD_ID PROBABILITY 0 0 0 0.050000 1 0 1 0.050000 2 0 2 0.050000 3 0 3 0.050000 4 0 4 0.050000 5 0 5 0.050000 6 0 6 0.050000 7 0 7 0.050000 8 0 8 0.550000 9 0 9 0.050000 10 1 0 0.050000 11 1 1 0.050000 12 1 2 0.050000 13 1 3 0.050000 14 1 4 0.050000 15 1 5 0.050000 16 1 6 0.050000 17 1 7 0.050000 18 1 8 0.050000 19 1 9 0.550000 20 2 0 0.025000 21 2 1 0.025000 22 2 2 0.525000 23 2 3 0.025000 24 2 4 0.025000 25 2 5 0.025000 26 2 6 0.025000 27 2 7 0.275000 28 2 8 0.025000 29 2 9 0.025000 30 3 0 0.014286 31 3 1 0.014286 32 3 2 0.014286 33 3 3 0.585714 34 3 4 0.157143 35 3 5 0.014286 36 3 6 0.157143 37 3 7 0.014286 38 3 8 0.014286 39 3 9 0.014286
>>> lda.dictionary_.collect() WORD_ID WORD 0 17 boots 1 12 carseat 2 0 cpu 3 2 graphiccard 4 1 harddisk 5 10 helmet 6 4 keyboard 7 5 memory 8 3 monitor 9 7 mountainbike 10 11 rearfender 11 18 rings 12 20 shoe 13 15 spoon 14 14 strollers 15 16 sweaters 16 6 tires 17 13 toy 18 9 valve 19 19 vest 20 8 wheels
>>> lda.statistic_.collect() STAT_NAME STAT_VALUE 0 DOCUMENTS 4 1 VOCABULARY_SIZE 21 2 LOG_LIKELIHOOD -64.95765414596762
Dataframe df2 to transform:
>>> df2.collect() DOCUMENT_ID TEXT 0 10 toy toy spoon cpu
Performing transform on the given dataframe:
>>> res = lda.transform(data=df2, key='DOCUMENT_ID', document='TEXT', burn_in=2000, thin=100, iteration=1000, seed=1, output_word_assignment=True)
>>> doc_top_df, word_top_df, stat_df = res
>>> doc_top_df.collect() DOCUMENT_ID TOPIC_ID PROBABILITY 0 10 0 0.239130 1 10 1 0.456522 2 10 2 0.021739 3 10 3 0.021739 4 10 4 0.239130 5 10 5 0.021739
>>> word_top_df.collect() DOCUMENT_ID WORD_ID TOPIC_ID 0 10 13 1 1 10 13 1 2 10 15 0 3 10 0 4
>>> stat_df.collect() STAT_NAME STAT_VALUE 0 DOCUMENTS 1 1 VOCABULARY_SIZE 21 2 LOG_LIKELIHOOD -7.925092991875363 3 PERPLEXITY 7.251970666272191
- Attributes
- doc_topic_dist_DataFrame
Document-topic distribution table, structured as follows:
Document ID column, with same name and type as
data
's document ID column from fit().TOPIC_ID, type INTEGER, topic ID.
PROBABILITY, type DOUBLE, probability of topic given document.
- word_topic_assignment_DataFrame
Word-topic assignment table, structured as follows:
Document ID column, with same name and type as
data
's document ID column from fit().WORD_ID, type INTEGER, word ID.
TOPIC_ID, type INTEGER, topic ID.
Set to None if
output_word_assignment
is set to False.- topic_top_words_DataFrame
Topic top words table, structured as follows:
TOPIC_ID, type INTEGER, topic ID.
WORDS, type NVARCHAR(5000), topic top words separated by spaces.
Set to None if neither
max_top_words
northreshold_top_words
is provided.- topic_word_dist_DataFrame
Topic-word distribution table, structured as follows:
TOPIC_ID, type INTEGER, topic ID.
WORD_ID, type INTEGER, word ID.
PROBABILITY, type DOUBLE, probability of word given topic.
- dictionary_DataFrame
Dictionary table, structured as follows:
WORD_ID, type INTEGER, word ID.
WORD, type NVARCHAR(5000), word text.
- statistic_DataFrame
Statistics table, structured as follows:
STAT_NAME, type NVARCHAR(256), statistic name.
STAT_VALUE, type NVARCHAR(1000), statistic value.
Note
Parameters
max_top_words
andthreshold_top_words
cannot be used together.Parameters
burn_in
,thin
,iteration
,seed
,gibbs_init
anddelimiters
set in transform() will take precedence over the corresponding ones in __init__().
Methods
create_model_state
([model, function, ...])Create PAL model state.
delete_model_state
([state])Delete PAL model state.
fit
(data[, key, document])Fit LDA model based on training data.
fit_transform
(data[, key, document])Fit LDA model based on training data and return the topic assignment for the training documents.
set_model_state
(state)Set the model state by state information.
transform
(data[, key, document, burn_in, ...])Transform the topic assignment for new documents based on the previous LDA estimation results.
- fit(data, key=None, document=None)
Fit LDA model based on training data.
- Parameters
- dataDataFrame
Training data.
- keystr, optional
Name of the document ID column.
Mandatory if
data
is not indexed, or the index ofdata
contains multiple columns.Defaults to the single index column of
data
if not provided.- documentstr, optional
Name of the document text column.
If
document
is not provided,data
must have exactly 1 non-key(non-index) column, anddocument
defaults to that column.
- fit_transform(data, key=None, document=None)
Fit LDA model based on training data and return the topic assignment for the training documents.
- Parameters
- dataDataFrame
Training data.
- keystr, optional
Name of the document ID column.
Mandatory if
data
is not indexed, or the index ofdata
contains multiple columns.Defaults to the single index column of
data
if not provided.- documentstr, optional
Name of the document text column.
If
document
is not provided,data
must have exactly 1 non-key column, anddocument
defaults to that column.
- Returns
- DataFrame
Document-topic distribution table, structured as follows:
Document ID column, with same name and type as
data
's document ID column.TOPIC_ID, type INTEGER, topic ID.
PROBABILITY, type DOUBLE, probability of topic given document.
- transform(data, key=None, document=None, burn_in=None, iteration=None, thin=None, seed=None, gibbs_init=None, delimiters=None, output_word_assignment=None)
Transform the topic assignment for new documents based on the previous LDA estimation results.
- Parameters
- dataDataFrame
Independent variable values used for transform.
- keystr, optional
Name of the document ID column.
Mandatory if
data
is not indexed, or the index ofdata
contains multiple columns.Defaults to the single index column of
data
if not provided.- documentstr, optional
Name of the document text column.
If
document
is not provided,data
must have exactly 1 non-key column, anddocument
defaults to that column.- burn_inint, optional
Number of omitted Gibbs iterations at the beginning.
Generally, samples from the beginning may not accurately represent the desired distribution and are usually discarded.
Defaults to 0.
- iterationint, optional
Numbers of Gibbs iterations.
Defaults to 2000.
- thinint, optional
Number of omitted in-between Gibbs iterations.
Defaults to 1.
- seedint, optional
Indicates the seed used to initialize the random number generator:
0: Uses the system time.
Not 0: Uses the provided value.
Defaults to 0.
- gibbs_initstr, optional
Specifies initialization method for Gibbs sampling:
'uniform': Assign each word in each document a topic by uniform distribution.
'gibbs': Assign each word in each document a topic by one round of Gibbs sampling using
doc_topic_prior
andtopic_word_prior
.
Defaults to 'uniform'.
- delimiterslist of str, optional
Specifies the set of delimiters to separate words in a document. Each delimiter must be one character long.
Defaults to [' '].
- output_word_assignmentbool, optional
Controls whether to output the
word_topic_df
or not.If True, output the
word_topic_df
.Defaults to False.
- Returns
- DataFrame
DataFrame 1, document-topic distribution table, structured as follows:
Document ID column, with same name and type as
data
's document ID column.TOPIC_ID, type INTEGER, topic ID.
PROBABILITY, type DOUBLE, probability of topic given document.
DataFrame 2, word-topic assignment table, structured as follows:
Document ID column, with same name and type as
data
's document ID column.WORD_ID, type INTEGER, word ID.
TOPIC_ID, type INTEGER, topic ID.
Set to None if
output_word_assignment
is False.DataFrame 3, statistics table, structured as follows:
STAT_NAME, type NVARCHAR(256), statistic name.
STAT_VALUE, type NVARCHAR(1000), statistic value.
- create_model_state(model=None, function=None, pal_funcname='PAL_LATENT_DIRICHLET_ALLOCATION', state_description=None, force=False)
Create PAL model state.
- Parameters
- modelDataFrame, optional
Specify the model for AFL state.
Defaults to self.model_.
- functionstr, optional
Specify the function in the unified API.
A placeholder parameter, not effective for Latent Dirichlet Allocation.
- pal_funcnameint or str, optional
PAL function name.
Defaults to 'PAL_LATENT_DIRICHLET_ALLOCATION'.
- state_descriptionstr, optional
Description of the state as model container.
Defaults to None.
- forcebool, optional
If True it will delete the existing state.
Defaults to False.
- set_model_state(state)
Set the model state by state information.
- Parameters
- state: DataFrame or dict
If state is DataFrame, it has the following structure:
NAME: VARCHAR(100), it mush have STATE_ID, HINT, HOST and PORT.
VALUE: VARCHAR(1000), the values according to NAME.
If state is dict, the key must have STATE_ID, HINT, HOST and PORT.
- delete_model_state(state=None)
Delete PAL model state.
- Parameters
- stateDataFrame, optional
Specified the state.
Defaults to self.state.
- property fit_hdbprocedure
Returns the generated hdbprocedure for fit.
- property predict_hdbprocedure
Returns the generated hdbprocedure for predict.