LatentDirichletAllocation
- class hana_ml.algorithms.pal.decomposition.LatentDirichletAllocation(n_components, doc_topic_prior=None, topic_word_prior=None, burn_in=None, iteration=None, thin=None, seed=None, max_top_words=None, threshold_top_words=None, gibbs_init=None, delimiters=None, output_word_assignment=None)
Latent Dirichlet allocation (LDA) is a generative model in which each item (word) of a collection (document) is generated from a finite mixture over several latent groups (topics).
- Parameters:
- n_componentsint
Expected number of topics in the corpus.
- doc_topic_priorfloat, optional
Specifies the prior weight related to document-topic distribution.
Defaults to 50/
n_components
.- topic_word_priorfloat, optional
Specifies the prior weight related to topic-word distribution.
Defaults to 0.1.
- burn_inint, optional
Number of omitted Gibbs iterations at the beginning.
Generally, samples from the beginning may not accurately represent the desired distribution and are usually discarded.
Defaults to 0.
- iterationint, optional
Number of Gibbs iterations.
Defaults to 2000.
- thinint, optional
Number of omitted in-between Gibbs iterations.
Value must be greater than 0.
Defaults to 1.
- seedint, optional
Indicates the seed used to initialize the random number generator:
0: Uses the system time.
Not 0: Uses the provided value.
Defaults to 0.
- max_top_wordsint, optional
Specifies the maximum number of words to be output for each topic.
Defaults to 0.
- threshold_top_wordsfloat, optional
The algorithm outputs top words for each topic if the probability is larger than this threshold.
It cannot be used together with parameter
max_top_words
.- gibbs_initstr, optional
Specifies initialization method for Gibbs sampling:
'uniform': Assign each word in each document a topic by uniform distribution.
'gibbs': Assign each word in each document a topic by one round of Gibbs sampling using
doc_topic_prior
andtopic_word_prior
.
Defaults to 'uniform'.
- delimitersa list of str, optional
Specifies the set of delimiters to separate words in a document.
Each delimiter must be one character long.
Defaults to [' '].
- output_word_assignmentbool, optional
Controls whether to output the word_topic_assignment_ DataFrame or not. If True, output the word_topic_assignment_ DataFrame.
Defaults to False.
Examples
Input DataFrame df:
>>> df.collect() DOCUMENT_ID TEXT 0 10 cpu harddisk graphiccard cpu monitor keyboard ... 1 20 tires mountainbike wheels valve helmet mountai... 2 30 carseat toy strollers toy toy spoon toy stroll... 3 40 sweaters sweaters sweaters boots sweaters ring...
Create a LDA instance:
>>> lda = LatentDirichletAllocation(n_components=6, burn_in=50, thin=10, iteration=100, seed=1, max_top_words=5, doc_topic_prior=0.1, output_word_assignment=True, delimiters=[' ', '\r', '\n'])
Perform fit():
>>> lda.fit(data=df, key='DOCUMENT_ID', document='TEXT')
Output:
>>> lda.doc_topic_dist_.collect() DOCUMENT_ID TOPIC_ID PROBABILITY 0 10 0 0.010417 1 10 1 0.010417 ... 22 40 4 0.009434 23 40 5 0.009434
>>> lda.word_topic_assignment_.collect() DOCUMENT_ID WORD_ID TOPIC_ID 0 10 0 4 1 10 1 4 ... 37 40 20 2 38 40 16 2
>>> lda.topic_top_words_.collect() TOPIC_ID WORDS 0 0 spoon strollers tires graphiccard valve 1 1 toy strollers carseat graphiccard cpu ... 5 5 strollers tires graphiccard cpu valve
>>> lda.topic_word_dist_.head(40).collect() TOPIC_ID WORD_ID PROBABILITY 0 0 0 0.050000 1 0 1 0.050000 ... 39 3 9 0.014286
>>> lda.dictionary_.collect() WORD_ID WORD 0 17 boots 1 12 carseat ... 19 19 vest 20 8 wheels
>>> lda.statistic_.collect() STAT_NAME STAT_VALUE 0 DOCUMENTS 4 1 VOCABULARY_SIZE 21 2 LOG_LIKELIHOOD -64.95765414596762
Input DataFrame df_transform to transform:
>>> df_transform.collect() DOCUMENT_ID TEXT 0 10 toy toy spoon cpu
Perfor transform():
>>> res = lda.transform(data=df_transform, key='DOCUMENT_ID', document='TEXT', burn_in=2000, thin=100, iteration=1000, seed=1, output_word_assignment=True)
>>> doc_top_df, word_top_df, stat_df = res
>>> doc_top_df.collect() DOCUMENT_ID TOPIC_ID PROBABILITY 0 10 0 0.239130 1 10 1 0.456522 ... 4 10 4 0.239130 5 10 5 0.021739
>>> word_top_df.collect() DOCUMENT_ID WORD_ID TOPIC_ID 0 10 13 1 1 10 13 1 2 10 15 0 3 10 0 4
>>> stat_df.collect() STAT_NAME STAT_VALUE 0 DOCUMENTS 1 1 VOCABULARY_SIZE 21 2 LOG_LIKELIHOOD -7.925092991875363 3 PERPLEXITY 7.251970666272191
- Attributes:
- doc_topic_dist_DataFrame
Document-topic distribution table, structured as follows:
Document ID column, with same name and type as
data
's document ID column from fit().TOPIC_ID, type INTEGER, topic ID.
PROBABILITY, type DOUBLE, probability of topic given document.
- word_topic_assignment_DataFrame
Word-topic assignment table, structured as follows:
Document ID column, with same name and type as
data
's document ID column from fit().WORD_ID, type INTEGER, word ID.
TOPIC_ID, type INTEGER, topic ID.
Set to None if
output_word_assignment
is set to False.- topic_top_words_DataFrame
Topic top words table, structured as follows:
TOPIC_ID, type INTEGER, topic ID.
WORDS, type NVARCHAR(5000), topic top words separated by spaces.
Set to None if neither
max_top_words
northreshold_top_words
is provided.- topic_word_dist_DataFrame
Topic-word distribution table, structured as follows:
TOPIC_ID, type INTEGER, topic ID.
WORD_ID, type INTEGER, word ID.
PROBABILITY, type DOUBLE, probability of word given topic.
- dictionary_DataFrame
Dictionary table, structured as follows:
WORD_ID, type INTEGER, word ID.
WORD, type NVARCHAR(5000), word text.
- statistic_DataFrame
Statistics table, structured as follows:
STAT_NAME, type NVARCHAR(256), statistic name.
STAT_VALUE, type NVARCHAR(1000), statistic value.
Note
Parameters
max_top_words
andthreshold_top_words
cannot be used together.Parameters
burn_in
,thin
,iteration
,seed
,gibbs_init
anddelimiters
set in transform() will take precedence over the corresponding ones in __init__().
Methods
create_model_state
([model, function, ...])Create PAL model state.
delete_model_state
([state])Delete PAL model state.
fit
(data[, key, document])Fit the model to the given dataset.
fit_transform
(data[, key, document])Fit LDA model based on training data and return the topic assignment for the training documents.
Get the model metrics.
Get the score metrics.
set_model_state
(state)Set the model state by state information.
transform
(data[, key, document, burn_in, ...])Transform the topic assignment for new documents based on the previous LDA estimation results.
- fit(data, key=None, document=None)
Fit the model to the given dataset.
- Parameters:
- dataDataFrame
Training data.
- keystr, optional
Name of the document ID column.
Mandatory if
data
is not indexed, or the index ofdata
contains multiple columns.Defaults to the single index column of
data
if not provided.- documentstr, optional
Name of the document text column.
If
document
is not provided,data
must have exactly 1 non-key(non-index) column, anddocument
defaults to that column.
- fit_transform(data, key=None, document=None)
Fit LDA model based on training data and return the topic assignment for the training documents.
- Parameters:
- dataDataFrame
Training data.
- keystr, optional
Name of the document ID column.
Mandatory if
data
is not indexed, or the index ofdata
contains multiple columns.Defaults to the single index column of
data
if not provided.- documentstr, optional
Name of the document text column.
If
document
is not provided,data
must have exactly 1 non-key column, anddocument
defaults to that column.
- Returns:
- DataFrame
Document-topic distribution table, structured as follows:
Document ID column, with same name and type as
data
's document ID column.TOPIC_ID, type INTEGER, topic ID.
PROBABILITY, type DOUBLE, probability of topic given document.
- transform(data, key=None, document=None, burn_in=None, iteration=None, thin=None, seed=None, gibbs_init=None, delimiters=None, output_word_assignment=None)
Transform the topic assignment for new documents based on the previous LDA estimation results.
- Parameters:
- dataDataFrame
Independent variable values used for transform.
- keystr, optional
Name of the document ID column.
Mandatory if
data
is not indexed, or the index ofdata
contains multiple columns.Defaults to the single index column of
data
if not provided.- documentstr, optional
Name of the document text column.
If
document
is not provided,data
must have exactly 1 non-key column, anddocument
defaults to that column.- burn_inint, optional
Number of omitted Gibbs iterations at the beginning.
Generally, samples from the beginning may not accurately represent the desired distribution and are usually discarded.
Defaults to 0.
- iterationint, optional
Numbers of Gibbs iterations.
Defaults to 2000.
- thinint, optional
Number of omitted in-between Gibbs iterations.
Defaults to 1.
- seedint, optional
Indicates the seed used to initialize the random number generator:
0: Uses the system time.
Not 0: Uses the provided value.
Defaults to 0.
- gibbs_initstr, optional
Specifies initialization method for Gibbs sampling:
'uniform': Assign each word in each document a topic by uniform distribution.
'gibbs': Assign each word in each document a topic by one round of Gibbs sampling using
doc_topic_prior
andtopic_word_prior
.
Defaults to 'uniform'.
- delimitersa list of str, optional
Specifies the set of delimiters to separate words in a document. Each delimiter must be one character long.
Defaults to [' '].
- output_word_assignmentbool, optional
Controls whether to output the
word_topic_df
or not.If True, output the
word_topic_df
.Defaults to False.
- Returns:
- DataFrame
DataFrame 1, document-topic distribution table, structured as follows:
Document ID column, with same name and type as
data
's document ID column.TOPIC_ID, type INTEGER, topic ID.
PROBABILITY, type DOUBLE, probability of topic given document.
DataFrame 2, word-topic assignment table, structured as follows:
Document ID column, with same name and type as
data
's document ID column.WORD_ID, type INTEGER, word ID.
TOPIC_ID, type INTEGER, topic ID.
Set to None if
output_word_assignment
is False.DataFrame 3, statistics table, structured as follows:
STAT_NAME, type NVARCHAR(256), statistic name.
STAT_VALUE, type NVARCHAR(1000), statistic value.
- create_model_state(model=None, function=None, pal_funcname='PAL_LATENT_DIRICHLET_ALLOCATION', state_description=None, force=False)
Create PAL model state.
- Parameters:
- modelDataFrame, optional
Specify the model for AFL state.
Defaults to self.model_.
- functionstr, optional
Specify the function in the unified API.
A placeholder parameter, not effective for Latent Dirichlet Allocation.
- pal_funcnameint or str, optional
PAL function name.
Defaults to 'PAL_LATENT_DIRICHLET_ALLOCATION'.
- state_descriptionstr, optional
Description of the state as model container.
Defaults to None.
- forcebool, optional
If True it will delete the existing state.
Defaults to False.
- set_model_state(state)
Set the model state by state information.
- Parameters:
- state: DataFrame or dict
If state is DataFrame, it has the following structure:
NAME: VARCHAR(100), it mush have STATE_ID, HINT, HOST and PORT.
VALUE: VARCHAR(1000), the values according to NAME.
If state is dict, the key must have STATE_ID, HINT, HOST and PORT.
- delete_model_state(state=None)
Delete PAL model state.
- Parameters:
- stateDataFrame, optional
Specified the state.
Defaults to self.state.
- get_model_metrics()
Get the model metrics.
- Returns:
- DataFrame
The model metrics.
- get_score_metrics()
Get the score metrics.
- Returns:
- DataFrame
The score metrics.
Inherited Methods from PALBase
Besides those methods mentioned above, the LatentDirichletAllocation class also inherits methods from PALBase class, please refer to PAL Base for more details.