LatentDirichletAllocation
- class hana_ml.algorithms.pal.decomposition.LatentDirichletAllocation(n_components, doc_topic_prior=None, topic_word_prior=None, burn_in=None, iteration=None, thin=None, seed=None, max_top_words=None, threshold_top_words=None, gibbs_init=None, delimiters=None, output_word_assignment=None)
Latent Dirichlet allocation (LDA) is a generative model in which each item (word) of a collection (document) is generated from a finite mixture over several latent groups (topics).
- Parameters:
- n_componentsint
Expected number of topics in the corpus.
- doc_topic_priorfloat, optional
Specifies the prior weight related to document-topic distribution.
Defaults to 50/
n_components
.- topic_word_priorfloat, optional
Specifies the prior weight related to topic-word distribution.
Defaults to 0.1.
- burn_inint, optional
Number of omitted Gibbs iterations at the beginning.
Generally, samples from the beginning may not accurately represent the desired distribution and are usually discarded.
Defaults to 0.
- iterationint, optional
Number of Gibbs iterations.
Defaults to 2000.
- thinint, optional
Number of omitted in-between Gibbs iterations.
Value must be greater than 0.
Defaults to 1.
- seedint, optional
Indicates the seed used to initialize the random number generator:
0: Uses the system time.
Not 0: Uses the provided value.
Defaults to 0.
- max_top_wordsint, optional
Specifies the maximum number of words to be output for each topic.
Defaults to 0.
- threshold_top_wordsfloat, optional
The algorithm outputs top words for each topic if the probability is larger than this threshold.
It cannot be used together with parameter
max_top_words
.- gibbs_initstr, optional
Specifies initialization method for Gibbs sampling:
'uniform': Assign each word in each document a topic by uniform distribution.
'gibbs': Assign each word in each document a topic by one round of Gibbs sampling using
doc_topic_prior
andtopic_word_prior
.
Defaults to 'uniform'.
- delimitersa list of str, optional
Specifies the set of delimiters to separate words in a document.
Each delimiter must be one character long.
Defaults to [' '].
- output_word_assignmentbool, optional
Controls whether to output the word_topic_assignment_ DataFrame or not. If True, output the word_topic_assignment_ DataFrame.
Defaults to False.
Examples
Input DataFrame df:
>>> df.collect() DOCUMENT_ID TEXT 0 10 cpu harddisk graphiccard cpu monitor keyboard ... 1 20 tires mountainbike wheels valve helmet mountai... 2 30 carseat toy strollers toy toy spoon toy stroll... 3 40 sweaters sweaters sweaters boots sweaters ring...
Create a LDA instance:
>>> lda = LatentDirichletAllocation(n_components=6, burn_in=50, thin=10, iteration=100, seed=1, max_top_words=5, doc_topic_prior=0.1, output_word_assignment=True, delimiters=[' ', '\r', '\n'])
Perform fit():
>>> lda.fit(data=df, key='DOCUMENT_ID', document='TEXT')
Output:
>>> lda.doc_topic_dist_.collect() DOCUMENT_ID TOPIC_ID PROBABILITY 0 10 0 0.010417 1 10 1 0.010417 ... 22 40 4 0.009434 23 40 5 0.009434
>>> lda.word_topic_assignment_.collect() DOCUMENT_ID WORD_ID TOPIC_ID 0 10 0 4 1 10 1 4 ... 37 40 20 2 38 40 16 2
>>> lda.topic_top_words_.collect() TOPIC_ID WORDS 0 0 spoon strollers tires graphiccard valve 1 1 toy strollers carseat graphiccard cpu ... 5 5 strollers tires graphiccard cpu valve
>>> lda.topic_word_dist_.head(40).collect() TOPIC_ID WORD_ID PROBABILITY 0 0 0 0.050000 1 0 1 0.050000 ... 39 3 9 0.014286
>>> lda.dictionary_.collect() WORD_ID WORD 0 17 boots 1 12 carseat ... 19 19 vest 20 8 wheels
>>> lda.statistic_.collect() STAT_NAME STAT_VALUE 0 DOCUMENTS 4 1 VOCABULARY_SIZE 21 2 LOG_LIKELIHOOD -64.95765414596762
Input DataFrame df_transform to transform:
>>> df_transform.collect() DOCUMENT_ID TEXT 0 10 toy toy spoon cpu
Perfor transform():
>>> res = lda.transform(data=df_transform, key='DOCUMENT_ID', document='TEXT', burn_in=2000, thin=100, iteration=1000, seed=1, output_word_assignment=True)
>>> doc_top_df, word_top_df, stat_df = res
>>> doc_top_df.collect() DOCUMENT_ID TOPIC_ID PROBABILITY 0 10 0 0.239130 1 10 1 0.456522 ... 4 10 4 0.239130 5 10 5 0.021739
>>> word_top_df.collect() DOCUMENT_ID WORD_ID TOPIC_ID 0 10 13 1 1 10 13 1 2 10 15 0 3 10 0 4
>>> stat_df.collect() STAT_NAME STAT_VALUE 0 DOCUMENTS 1 1 VOCABULARY_SIZE 21 2 LOG_LIKELIHOOD -7.925092991875363 3 PERPLEXITY 7.251970666272191
- Attributes:
- doc_topic_dist_DataFrame
Document-topic distribution table, structured as follows:
Document ID column, with same name and type as
data
's document ID column from fit().TOPIC_ID, type INTEGER, topic ID.
PROBABILITY, type DOUBLE, probability of topic given document.
- word_topic_assignment_DataFrame
Word-topic assignment table, structured as follows:
Document ID column, with same name and type as
data
's document ID column from fit().WORD_ID, type INTEGER, word ID.
TOPIC_ID, type INTEGER, topic ID.
Set to None if
output_word_assignment
is set to False.- topic_top_words_DataFrame
Topic top words table, structured as follows:
TOPIC_ID, type INTEGER, topic ID.
WORDS, type NVARCHAR(5000), topic top words separated by spaces.
Set to None if neither
max_top_words
northreshold_top_words
is provided.- topic_word_dist_DataFrame
Topic-word distribution table, structured as follows:
TOPIC_ID, type INTEGER, topic ID.
WORD_ID, type INTEGER, word ID.
PROBABILITY, type DOUBLE, probability of word given topic.
- dictionary_DataFrame
Dictionary table, structured as follows:
WORD_ID, type INTEGER, word ID.
WORD, type NVARCHAR(5000), word text.
- statistic_DataFrame
Statistics table, structured as follows:
STAT_NAME, type NVARCHAR(256), statistic name.
STAT_VALUE, type NVARCHAR(1000), statistic value.
Note
Parameters
max_top_words
andthreshold_top_words
cannot be used together.Parameters
burn_in
,thin
,iteration
,seed
,gibbs_init
anddelimiters
set in transform() will take precedence over the corresponding ones in __init__().
Methods
create_model_state
([model, function, ...])Create PAL model state.
delete_model_state
([state])Delete PAL model state.
fit
(data[, key, document])Fit the model to the given dataset.
fit_transform
(data[, key, document])Fit LDA model based on training data and return the topic assignment for the training documents.
set_model_state
(state)Set the model state by state information.
transform
(data[, key, document, burn_in, ...])Transform the topic assignment for new documents based on the previous LDA estimation results.
- fit(data, key=None, document=None)
Fit the model to the given dataset.
- Parameters:
- dataDataFrame
Training data.
- keystr, optional
Name of the document ID column.
Mandatory if
data
is not indexed, or the index ofdata
contains multiple columns.Defaults to the single index column of
data
if not provided.- documentstr, optional
Name of the document text column.
If
document
is not provided,data
must have exactly 1 non-key(non-index) column, anddocument
defaults to that column.
- fit_transform(data, key=None, document=None)
Fit LDA model based on training data and return the topic assignment for the training documents.
- Parameters:
- dataDataFrame
Training data.
- keystr, optional
Name of the document ID column.
Mandatory if
data
is not indexed, or the index ofdata
contains multiple columns.Defaults to the single index column of
data
if not provided.- documentstr, optional
Name of the document text column.
If
document
is not provided,data
must have exactly 1 non-key column, anddocument
defaults to that column.
- Returns:
- DataFrame
Document-topic distribution table, structured as follows:
Document ID column, with same name and type as
data
's document ID column.TOPIC_ID, type INTEGER, topic ID.
PROBABILITY, type DOUBLE, probability of topic given document.
- transform(data, key=None, document=None, burn_in=None, iteration=None, thin=None, seed=None, gibbs_init=None, delimiters=None, output_word_assignment=None)
Transform the topic assignment for new documents based on the previous LDA estimation results.
- Parameters:
- dataDataFrame
Independent variable values used for transform.
- keystr, optional
Name of the document ID column.
Mandatory if
data
is not indexed, or the index ofdata
contains multiple columns.Defaults to the single index column of
data
if not provided.- documentstr, optional
Name of the document text column.
If
document
is not provided,data
must have exactly 1 non-key column, anddocument
defaults to that column.- burn_inint, optional
Number of omitted Gibbs iterations at the beginning.
Generally, samples from the beginning may not accurately represent the desired distribution and are usually discarded.
Defaults to 0.
- iterationint, optional
Numbers of Gibbs iterations.
Defaults to 2000.
- thinint, optional
Number of omitted in-between Gibbs iterations.
Defaults to 1.
- seedint, optional
Indicates the seed used to initialize the random number generator:
0: Uses the system time.
Not 0: Uses the provided value.
Defaults to 0.
- gibbs_initstr, optional
Specifies initialization method for Gibbs sampling:
'uniform': Assign each word in each document a topic by uniform distribution.
'gibbs': Assign each word in each document a topic by one round of Gibbs sampling using
doc_topic_prior
andtopic_word_prior
.
Defaults to 'uniform'.
- delimitersa list of str, optional
Specifies the set of delimiters to separate words in a document. Each delimiter must be one character long.
Defaults to [' '].
- output_word_assignmentbool, optional
Controls whether to output the
word_topic_df
or not.If True, output the
word_topic_df
.Defaults to False.
- Returns:
- DataFrame
DataFrame 1, document-topic distribution table, structured as follows:
Document ID column, with same name and type as
data
's document ID column.TOPIC_ID, type INTEGER, topic ID.
PROBABILITY, type DOUBLE, probability of topic given document.
DataFrame 2, word-topic assignment table, structured as follows:
Document ID column, with same name and type as
data
's document ID column.WORD_ID, type INTEGER, word ID.
TOPIC_ID, type INTEGER, topic ID.
Set to None if
output_word_assignment
is False.DataFrame 3, statistics table, structured as follows:
STAT_NAME, type NVARCHAR(256), statistic name.
STAT_VALUE, type NVARCHAR(1000), statistic value.
- create_model_state(model=None, function=None, pal_funcname='PAL_LATENT_DIRICHLET_ALLOCATION', state_description=None, force=False)
Create PAL model state.
- Parameters:
- modelDataFrame, optional
Specify the model for AFL state.
Defaults to self.model_.
- functionstr, optional
Specify the function in the unified API.
A placeholder parameter, not effective for Latent Dirichlet Allocation.
- pal_funcnameint or str, optional
PAL function name.
Defaults to 'PAL_LATENT_DIRICHLET_ALLOCATION'.
- state_descriptionstr, optional
Description of the state as model container.
Defaults to None.
- forcebool, optional
If True it will delete the existing state.
Defaults to False.
- set_model_state(state)
Set the model state by state information.
- Parameters:
- state: DataFrame or dict
If state is DataFrame, it has the following structure:
NAME: VARCHAR(100), it mush have STATE_ID, HINT, HOST and PORT.
VALUE: VARCHAR(1000), the values according to NAME.
If state is dict, the key must have STATE_ID, HINT, HOST and PORT.
- delete_model_state(state=None)
Delete PAL model state.
- Parameters:
- stateDataFrame, optional
Specified the state.
Defaults to self.state.
Inherited Methods from PALBase
Besides those methods mentioned above, the LatentDirichletAllocation class also inherits methods from PALBase class, please refer to PAL Base for more details.