LatentDirichletAllocation
- class hana_ml.algorithms.pal.decomposition.LatentDirichletAllocation(n_components, doc_topic_prior=None, topic_word_prior=None, burn_in=None, iteration=None, thin=None, seed=None, max_top_words=None, threshold_top_words=None, gibbs_init=None, delimiters=None, output_word_assignment=None)
Latent Dirichlet allocation (LDA) is a generative model in which each item (word) of a collection (document) is generated from a finite mixture over several latent groups (topics).
- Parameters
- n_componentsint
Expected number of topics in the corpus.
- doc_topic_priorfloat, optional
Specifies the prior weight related to document-topic distribution.
Defaults to 50/
n_components
.- topic_word_priorfloat, optional
Specifies the prior weight related to topic-word distribution.
Defaults to 0.1.
- burn_inint, optional
Number of omitted Gibbs iterations at the beginning.
Generally, samples from the beginning may not accurately represent the desired distribution and are usually discarded.
Defaults to 0.
- iterationint, optional
Number of Gibbs iterations.
Defaults to 2000.
- thinint, optional
Number of omitted in-between Gibbs iterations.
Value must be greater than 0.
Defaults to 1.
- seedint, optional
Indicates the seed used to initialize the random number generator:
0: Uses the system time.
Not 0: Uses the provided value.
Defaults to 0.
- max_top_wordsint, optional
Specifies the maximum number of words to be output for each topic.
Defaults to 0.
- threshold_top_wordsfloat, optional
The algorithm outputs top words for each topic if the probability is larger than this threshold.
It cannot be used together with parameter
max_top_words
.- gibbs_initstr, optional
Specifies initialization method for Gibbs sampling:
'uniform': Assign each word in each document a topic by uniform distribution.
'gibbs': Assign each word in each document a topic by one round of Gibbs sampling using
doc_topic_prior
andtopic_word_prior
.
Defaults to 'uniform'.
- delimiterslist of str, optional
Specifies the set of delimiters to separate words in a document.
Each delimiter must be one character long.
Defaults to [' '].
- output_word_assignmentbool, optional
Controls whether to output the word_topic_assignment_ DataFrame or not. If True, output the word_topic_assignment_ DataFrame.
Defaults to False.
Examples
Input dataframe df1 for training:
>>> df1.collect() DOCUMENT_ID TEXT 0 10 cpu harddisk graphiccard cpu monitor keyboard ... 1 20 tires mountainbike wheels valve helmet mountai... 2 30 carseat toy strollers toy toy spoon toy stroll... 3 40 sweaters sweaters sweaters boots sweaters ring...
Creating a LDA instance:
>>> lda = LatentDirichletAllocation(n_components=6, burn_in=50, thin=10, iteration=100, seed=1, max_top_words=5, doc_topic_prior=0.1, output_word_assignment=True, delimiters=[' ', '\r', '\n'])
Performing fit() on given dataframe:
>>> lda.fit(data=df1, key='DOCUMENT_ID', document='TEXT')
Output:
>>> lda.doc_topic_dist_.collect() DOCUMENT_ID TOPIC_ID PROBABILITY 0 10 0 0.010417 1 10 1 0.010417 2 10 2 0.010417 3 10 3 0.010417 4 10 4 0.947917 5 10 5 0.010417 6 20 0 0.009434 7 20 1 0.009434 8 20 2 0.009434 9 20 3 0.952830 10 20 4 0.009434 11 20 5 0.009434 12 30 0 0.103774 13 30 1 0.858491 14 30 2 0.009434 15 30 3 0.009434 16 30 4 0.009434 17 30 5 0.009434 18 40 0 0.009434 19 40 1 0.009434 20 40 2 0.952830 21 40 3 0.009434 22 40 4 0.009434 23 40 5 0.009434
>>> lda.word_topic_assignment_.collect() DOCUMENT_ID WORD_ID TOPIC_ID 0 10 0 4 1 10 1 4 2 10 2 4 3 10 0 4 4 10 3 4 5 10 4 4 6 10 0 4 7 10 5 4 8 10 5 4 9 20 6 3 10 20 7 3 11 20 8 3 12 20 9 3 13 20 10 3 14 20 7 3 15 20 11 3 16 20 6 3 17 20 7 3 18 20 7 3 19 30 12 1 20 30 13 1 21 30 14 1 22 30 13 1 23 30 13 1 24 30 15 0 25 30 13 1 26 30 14 1 27 30 13 1 28 30 12 1 29 40 16 2 30 40 16 2 31 40 16 2 32 40 17 2 33 40 16 2 34 40 18 2 35 40 19 2 36 40 19 2 37 40 20 2 38 40 16 2
>>> lda.topic_top_words_.collect() TOPIC_ID WORDS 0 0 spoon strollers tires graphiccard valve 1 1 toy strollers carseat graphiccard cpu 2 2 sweaters vest shoe rings boots 3 3 mountainbike tires rearfender helmet valve 4 4 cpu memory graphiccard keyboard harddisk 5 5 strollers tires graphiccard cpu valve
>>> lda.topic_word_dist_.head(40).collect() TOPIC_ID WORD_ID PROBABILITY 0 0 0 0.050000 1 0 1 0.050000 2 0 2 0.050000 3 0 3 0.050000 4 0 4 0.050000 5 0 5 0.050000 6 0 6 0.050000 7 0 7 0.050000 8 0 8 0.550000 9 0 9 0.050000 10 1 0 0.050000 11 1 1 0.050000 12 1 2 0.050000 13 1 3 0.050000 14 1 4 0.050000 15 1 5 0.050000 16 1 6 0.050000 17 1 7 0.050000 18 1 8 0.050000 19 1 9 0.550000 20 2 0 0.025000 21 2 1 0.025000 22 2 2 0.525000 23 2 3 0.025000 24 2 4 0.025000 25 2 5 0.025000 26 2 6 0.025000 27 2 7 0.275000 28 2 8 0.025000 29 2 9 0.025000 30 3 0 0.014286 31 3 1 0.014286 32 3 2 0.014286 33 3 3 0.585714 34 3 4 0.157143 35 3 5 0.014286 36 3 6 0.157143 37 3 7 0.014286 38 3 8 0.014286 39 3 9 0.014286
>>> lda.dictionary_.collect() WORD_ID WORD 0 17 boots 1 12 carseat 2 0 cpu 3 2 graphiccard 4 1 harddisk 5 10 helmet 6 4 keyboard 7 5 memory 8 3 monitor 9 7 mountainbike 10 11 rearfender 11 18 rings 12 20 shoe 13 15 spoon 14 14 strollers 15 16 sweaters 16 6 tires 17 13 toy 18 9 valve 19 19 vest 20 8 wheels
>>> lda.statistic_.collect() STAT_NAME STAT_VALUE 0 DOCUMENTS 4 1 VOCABULARY_SIZE 21 2 LOG_LIKELIHOOD -64.95765414596762
Dataframe df2 to transform:
>>> df2.collect() DOCUMENT_ID TEXT 0 10 toy toy spoon cpu
Performing transform on the given dataframe:
>>> res = lda.transform(data=df2, key='DOCUMENT_ID', document='TEXT', burn_in=2000, thin=100, iteration=1000, seed=1, output_word_assignment=True)
>>> doc_top_df, word_top_df, stat_df = res
>>> doc_top_df.collect() DOCUMENT_ID TOPIC_ID PROBABILITY 0 10 0 0.239130 1 10 1 0.456522 2 10 2 0.021739 3 10 3 0.021739 4 10 4 0.239130 5 10 5 0.021739
>>> word_top_df.collect() DOCUMENT_ID WORD_ID TOPIC_ID 0 10 13 1 1 10 13 1 2 10 15 0 3 10 0 4
>>> stat_df.collect() STAT_NAME STAT_VALUE 0 DOCUMENTS 1 1 VOCABULARY_SIZE 21 2 LOG_LIKELIHOOD -7.925092991875363 3 PERPLEXITY 7.251970666272191
- Attributes
- doc_topic_dist_DataFrame
Document-topic distribution table, structured as follows:
Document ID column, with same name and type as
data
's document ID column from fit().TOPIC_ID, type INTEGER, topic ID.
PROBABILITY, type DOUBLE, probability of topic given document.
- word_topic_assignment_DataFrame
Word-topic assignment table, structured as follows:
Document ID column, with same name and type as
data
's document ID column from fit().WORD_ID, type INTEGER, word ID.
TOPIC_ID, type INTEGER, topic ID.
Set to None if
output_word_assignment
is set to False.- topic_top_words_DataFrame
Topic top words table, structured as follows:
TOPIC_ID, type INTEGER, topic ID.
WORDS, type NVARCHAR(5000), topic top words separated by spaces.
Set to None if neither
max_top_words
northreshold_top_words
is provided.- topic_word_dist_DataFrame
Topic-word distribution table, structured as follows:
TOPIC_ID, type INTEGER, topic ID.
WORD_ID, type INTEGER, word ID.
PROBABILITY, type DOUBLE, probability of word given topic.
- dictionary_DataFrame
Dictionary table, structured as follows:
WORD_ID, type INTEGER, word ID.
WORD, type NVARCHAR(5000), word text.
- statistic_DataFrame
Statistics table, structured as follows:
STAT_NAME, type NVARCHAR(256), statistic name.
STAT_VALUE, type NVARCHAR(1000), statistic value.
Note
Parameters
max_top_words
andthreshold_top_words
cannot be used together.Parameters
burn_in
,thin
,iteration
,seed
,gibbs_init
anddelimiters
set in transform() will take precedence over the corresponding ones in __init__().
Methods
create_model_state
([model, function, ...])Create PAL model state.
delete_model_state
([state])Delete PAL model state.
fit
(data[, key, document])Fit LDA model based on training data.
fit_transform
(data[, key, document])Fit LDA model based on training data and return the topic assignment for the training documents.
set_model_state
(state)Set the model state by state information.
transform
(data[, key, document, burn_in, ...])Transform the topic assignment for new documents based on the previous LDA estimation results.
- fit(data, key=None, document=None)
Fit LDA model based on training data.
- Parameters
- dataDataFrame
Training data.
- keystr, optional
Name of the document ID column.
Mandatory if
data
is not indexed, or the index ofdata
contains multiple columns.Defaults to the single index column of
data
if not provided.- documentstr, optional
Name of the document text column.
If
document
is not provided,data
must have exactly 1 non-key(non-index) column, anddocument
defaults to that column.
- fit_transform(data, key=None, document=None)
Fit LDA model based on training data and return the topic assignment for the training documents.
- Parameters
- dataDataFrame
Training data.
- keystr, optional
Name of the document ID column.
Mandatory if
data
is not indexed, or the index ofdata
contains multiple columns.Defaults to the single index column of
data
if not provided.- documentstr, optional
Name of the document text column.
If
document
is not provided,data
must have exactly 1 non-key column, anddocument
defaults to that column.
- Returns
- DataFrame
Document-topic distribution table, structured as follows:
Document ID column, with same name and type as
data
's document ID column.TOPIC_ID, type INTEGER, topic ID.
PROBABILITY, type DOUBLE, probability of topic given document.
- transform(data, key=None, document=None, burn_in=None, iteration=None, thin=None, seed=None, gibbs_init=None, delimiters=None, output_word_assignment=None)
Transform the topic assignment for new documents based on the previous LDA estimation results.
- Parameters
- dataDataFrame
Independent variable values used for transform.
- keystr, optional
Name of the document ID column.
Mandatory if
data
is not indexed, or the index ofdata
contains multiple columns.Defaults to the single index column of
data
if not provided.- documentstr, optional
Name of the document text column.
If
document
is not provided,data
must have exactly 1 non-key column, anddocument
defaults to that column.- burn_inint, optional
Number of omitted Gibbs iterations at the beginning.
Generally, samples from the beginning may not accurately represent the desired distribution and are usually discarded.
Defaults to 0.
- iterationint, optional
Numbers of Gibbs iterations.
Defaults to 2000.
- thinint, optional
Number of omitted in-between Gibbs iterations.
Defaults to 1.
- seedint, optional
Indicates the seed used to initialize the random number generator:
0: Uses the system time.
Not 0: Uses the provided value.
Defaults to 0.
- gibbs_initstr, optional
Specifies initialization method for Gibbs sampling:
'uniform': Assign each word in each document a topic by uniform distribution.
'gibbs': Assign each word in each document a topic by one round of Gibbs sampling using
doc_topic_prior
andtopic_word_prior
.
Defaults to 'uniform'.
- delimiterslist of str, optional
Specifies the set of delimiters to separate words in a document. Each delimiter must be one character long.
Defaults to [' '].
- output_word_assignmentbool, optional
Controls whether to output the
word_topic_df
or not.If True, output the
word_topic_df
.Defaults to False.
- Returns
- DataFrame
DataFrame 1, document-topic distribution table, structured as follows:
Document ID column, with same name and type as
data
's document ID column.TOPIC_ID, type INTEGER, topic ID.
PROBABILITY, type DOUBLE, probability of topic given document.
DataFrame 2, word-topic assignment table, structured as follows:
Document ID column, with same name and type as
data
's document ID column.WORD_ID, type INTEGER, word ID.
TOPIC_ID, type INTEGER, topic ID.
Set to None if
output_word_assignment
is False.DataFrame 3, statistics table, structured as follows:
STAT_NAME, type NVARCHAR(256), statistic name.
STAT_VALUE, type NVARCHAR(1000), statistic value.
- create_model_state(model=None, function=None, pal_funcname='PAL_LATENT_DIRICHLET_ALLOCATION', state_description=None, force=False)
Create PAL model state.
- Parameters
- modelDataFrame, optional
Specify the model for AFL state.
Defaults to self.model_.
- functionstr, optional
Specify the function in the unified API.
A placeholder parameter, not effective for Latent Dirichlet Allocation.
- pal_funcnameint or str, optional
PAL function name.
Defaults to 'PAL_LATENT_DIRICHLET_ALLOCATION'.
- state_descriptionstr, optional
Description of the state as model container.
Defaults to None.
- forcebool, optional
If True it will delete the existing state.
Defaults to False.
- set_model_state(state)
Set the model state by state information.
- Parameters
- state: DataFrame or dict
If state is DataFrame, it has the following structure:
NAME: VARCHAR(100), it mush have STATE_ID, HINT, HOST and PORT.
VALUE: VARCHAR(1000), the values according to NAME.
If state is dict, the key must have STATE_ID, HINT, HOST and PORT.
- delete_model_state(state=None)
Delete PAL model state.
- Parameters
- stateDataFrame, optional
Specified the state.
Defaults to self.state.
- property fit_hdbprocedure
Returns the generated hdbprocedure for fit.
- property predict_hdbprocedure
Returns the generated hdbprocedure for predict.
Inherited Methods from PALBase
Besides those methods mentioned above, the LatentDirichletAllocation class also inherits methods from PALBase class, please refer to PAL Base for more details.