LatentDirichletAllocation

class hana_ml.algorithms.pal.decomposition.LatentDirichletAllocation(n_components, doc_topic_prior=None, topic_word_prior=None, burn_in=None, iteration=None, thin=None, seed=None, max_top_words=None, threshold_top_words=None, gibbs_init=None, delimiters=None, output_word_assignment=None)

Latent Dirichlet allocation (LDA) is a generative model in which each item (word) of a collection (document) is generated from a finite mixture over several latent groups (topics).

Parameters:

n_componentsint

Expected number of topics in the corpus.

doc_topic_priorfloat, optional

Specifies the prior weight related to document-topic distribution.

Defaults to 50/n_components.

topic_word_priorfloat, optional

Specifies the prior weight related to topic-word distribution.

Defaults to 0.1.

burn_inint, optional

Number of omitted Gibbs iterations at the beginning.

Generally, samples from the beginning may not accurately represent the desired distribution and are usually discarded.

Defaults to 0.

iterationint, optional

Number of Gibbs iterations.

Defaults to 2000.

thinint, optional

Number of omitted in-between Gibbs iterations.

Value must be greater than 0.

Defaults to 1.

seedint, optional

Indicates the seed used to initialize the random number generator:

0: Uses the system time.

Not 0: Uses the provided value.

Defaults to 0.

max_top_wordsint, optional

Specifies the maximum number of words to be output for each topic.

Defaults to 0.

threshold_top_wordsfloat, optional

The algorithm outputs top words for each topic if the probability is larger than this threshold.

It cannot be used together with parameter max_top_words.

gibbs_initstr, optional

Specifies initialization method for Gibbs sampling:

'uniform': Assign each word in each document a topic by uniform distribution.

'gibbs': Assign each word in each document a topic by one round of Gibbs sampling using doc_topic_prior and topic_word_prior.

Defaults to 'uniform'.

delimiterslist of str, optional

Specifies the set of delimiters to separate words in a document.

Each delimiter must be one character long.

Defaults to [' '].

output_word_assignmentbool, optional

Controls whether to output the word_topic_assignment_ DataFrame or not. If True, output the word_topic_assignment_ DataFrame.

Defaults to False.

Examples

Input dataframe df1 for training:

>>> df1.collect()
   DOCUMENT_ID                                               TEXT
0           10  cpu harddisk graphiccard cpu monitor keyboard ...
1           20  tires mountainbike wheels valve helmet mountai...
2           30  carseat toy strollers toy toy spoon toy stroll...
3           40  sweaters sweaters sweaters boots sweaters ring...

Creating a LDA instance:

>>> lda = LatentDirichletAllocation(n_components=6, burn_in=50, thin=10,
                                    iteration=100, seed=1,
                                    max_top_words=5, doc_topic_prior=0.1,
                                    output_word_assignment=True,
                                    delimiters=[' ', '\r', '\n'])

Performing fit() on given dataframe:

>>> lda.fit(data=df1, key='DOCUMENT_ID', document='TEXT')

Output:

>>> lda.doc_topic_dist_.collect()
    DOCUMENT_ID  TOPIC_ID  PROBABILITY
          10         0     0.010417
          10         1     0.010417
          10         2     0.010417
          10         3     0.010417
          10         4     0.947917
          10         5     0.010417
          20         0     0.009434
          20         1     0.009434
          20         2     0.009434
          20         3     0.952830
         20         4     0.009434
         20         5     0.009434
         30         0     0.103774
         30         1     0.858491
         30         2     0.009434
         30         3     0.009434
         30         4     0.009434
         30         5     0.009434
         40         0     0.009434
         40         1     0.009434
         40         2     0.952830
         40         3     0.009434
         40         4     0.009434
         40         5     0.009434

>>> lda.word_topic_assignment_.collect()
    DOCUMENT_ID  WORD_ID  TOPIC_ID
          10        0         4
          10        1         4
          10        2         4
          10        0         4
          10        3         4
          10        4         4
          10        0         4
          10        5         4
          10        5         4
          20        6         3
         20        7         3
         20        8         3
         20        9         3
         20       10         3
         20        7         3
         20       11         3
         20        6         3
         20        7         3
         20        7         3
         30       12         1
         30       13         1
         30       14         1
         30       13         1
         30       13         1
         30       15         0
         30       13         1
         30       14         1
         30       13         1
         30       12         1
         40       16         2
         40       16         2
         40       16         2
         40       17         2
         40       16         2
         40       18         2
         40       19         2
         40       19         2
         40       20         2
         40       16         2

>>> lda.topic_top_words_.collect()
   TOPIC_ID                                       WORDS
       0     spoon strollers tires graphiccard valve
       1       toy strollers carseat graphiccard cpu
       2              sweaters vest shoe rings boots
       3  mountainbike tires rearfender helmet valve
       4    cpu memory graphiccard keyboard harddisk
       5       strollers tires graphiccard cpu valve

>>> lda.topic_word_dist_.head(40).collect()
    TOPIC_ID  WORD_ID  PROBABILITY
        0        0     0.050000
        0        1     0.050000
        0        2     0.050000
        0        3     0.050000
        0        4     0.050000
        0        5     0.050000
        0        6     0.050000
        0        7     0.050000
        0        8     0.550000
        0        9     0.050000
       1        0     0.050000
       1        1     0.050000
       1        2     0.050000
       1        3     0.050000
       1        4     0.050000
       1        5     0.050000
       1        6     0.050000
       1        7     0.050000
       1        8     0.050000
       1        9     0.550000
       2        0     0.025000
       2        1     0.025000
       2        2     0.525000
       2        3     0.025000
       2        4     0.025000
       2        5     0.025000
       2        6     0.025000
       2        7     0.275000
       2        8     0.025000
       2        9     0.025000
       3        0     0.014286
       3        1     0.014286
       3        2     0.014286
       3        3     0.585714
       3        4     0.157143
       3        5     0.014286
       3        6     0.157143
       3        7     0.014286
       3        8     0.014286
       3        9     0.014286

>>> lda.dictionary_.collect()
    WORD_ID          WORD
      17         boots
      12       carseat
       0           cpu
       2   graphiccard
       1      harddisk
      10        helmet
       4      keyboard
       5        memory
       3       monitor
       7  mountainbike
     11    rearfender
     18         rings
     20          shoe
     15         spoon
     14     strollers
     16      sweaters
      6         tires
     13           toy
      9         valve
     19          vest
      8        wheels

>>> lda.statistic_.collect()
         STAT_NAME          STAT_VALUE
0        DOCUMENTS                   4
1  VOCABULARY_SIZE                  21
2   LOG_LIKELIHOOD  -64.95765414596762

Dataframe df2 to transform:

>>> df2.collect()
   DOCUMENT_ID               TEXT
0           10  toy toy spoon cpu

Performing transform on the given dataframe:

>>> res = lda.transform(data=df2, key='DOCUMENT_ID', document='TEXT', burn_in=2000, thin=100,
                        iteration=1000, seed=1, output_word_assignment=True)

>>> doc_top_df, word_top_df, stat_df = res

>>> doc_top_df.collect()
   DOCUMENT_ID  TOPIC_ID  PROBABILITY
         10         0     0.239130
         10         1     0.456522
         10         2     0.021739
         10         3     0.021739
         10         4     0.239130
         10         5     0.021739

>>> word_top_df.collect()
   DOCUMENT_ID  WORD_ID  TOPIC_ID
0           10       13         1
1           10       13         1
2           10       15         0
3           10        0         4

>>> stat_df.collect()
         STAT_NAME          STAT_VALUE
0        DOCUMENTS                   1
1  VOCABULARY_SIZE                  21
2   LOG_LIKELIHOOD  -7.925092991875363
3       PERPLEXITY   7.251970666272191

Attributes:

doc_topic_dist_DataFrame

Document-topic distribution table, structured as follows:

Document ID column, with same name and type as data's document ID column from fit().

TOPIC_ID, type INTEGER, topic ID.

PROBABILITY, type DOUBLE, probability of topic given document.

word_topic_assignment_DataFrame

Word-topic assignment table, structured as follows:

Document ID column, with same name and type as data's document ID column from fit().

WORD_ID, type INTEGER, word ID.

TOPIC_ID, type INTEGER, topic ID.

Set to None if output_word_assignment is set to False.

topic_top_words_DataFrame

Topic top words table, structured as follows:

TOPIC_ID, type INTEGER, topic ID.

WORDS, type NVARCHAR(5000), topic top words separated by spaces.

Set to None if neither max_top_words nor threshold_top_words is provided.

topic_word_dist_DataFrame

Topic-word distribution table, structured as follows:

TOPIC_ID, type INTEGER, topic ID.

WORD_ID, type INTEGER, word ID.

PROBABILITY, type DOUBLE, probability of word given topic.

dictionary_DataFrame

Dictionary table, structured as follows:

WORD_ID, type INTEGER, word ID.

WORD, type NVARCHAR(5000), word text.

statistic_DataFrame

Statistics table, structured as follows:

STAT_NAME, type NVARCHAR(256), statistic name.

STAT_VALUE, type NVARCHAR(1000), statistic value.

Note

Parameters max_top_words and threshold_top_words cannot be used together.
Parameters burn_in, thin, iteration, seed, gibbs_init and delimiters set in transform() will take precedence over the corresponding ones in __init__().

Methods

`create_model_state`([model, function, ...])	Create PAL model state.
`delete_model_state`([state])	Delete PAL model state.
`fit`(data[, key, document])	Fit LDA model based on training data.
`fit_transform`(data[, key, document])	Fit LDA model based on training data and return the topic assignment for the training documents.
`set_model_state`(state)	Set the model state by state information.
`transform`(data[, key, document, burn_in, ...])	Transform the topic assignment for new documents based on the previous LDA estimation results.

fit(data, key=None, document=None)

Fit LDA model based on training data.

Parameters:

dataDataFrame

Training data.

keystr, optional

Name of the document ID column.

Mandatory if data is not indexed, or the index of data contains multiple columns.

Defaults to the single index column of data if not provided.

documentstr, optional

Name of the document text column.

If document is not provided, data must have exactly 1 non-key(non-index) column, and document defaults to that column.

fit_transform(data, key=None, document=None)

Fit LDA model based on training data and return the topic assignment for the training documents.

Parameters:

dataDataFrame

Training data.

keystr, optional

Name of the document ID column.

Mandatory if data is not indexed, or the index of data contains multiple columns.

Defaults to the single index column of data if not provided.

documentstr, optional

Name of the document text column.

If document is not provided, data must have exactly 1 non-key column, and document defaults to that column.

Returns:

DataFrame

Document-topic distribution table, structured as follows:

Document ID column, with same name and type as data 's document ID column.

TOPIC_ID, type INTEGER, topic ID.

PROBABILITY, type DOUBLE, probability of topic given document.

transform(data, key=None, document=None, burn_in=None, iteration=None, thin=None, seed=None, gibbs_init=None, delimiters=None, output_word_assignment=None)

Transform the topic assignment for new documents based on the previous LDA estimation results.

Parameters:

dataDataFrame

Independent variable values used for transform.

keystr, optional

Name of the document ID column.

Mandatory if data is not indexed, or the index of data contains multiple columns.

Defaults to the single index column of data if not provided.

documentstr, optional

Name of the document text column.

If document is not provided, data must have exactly 1 non-key column, and document defaults to that column.

burn_inint, optional

Number of omitted Gibbs iterations at the beginning.

Generally, samples from the beginning may not accurately represent the desired distribution and are usually discarded.

Defaults to 0.

iterationint, optional

Numbers of Gibbs iterations.

Defaults to 2000.

thinint, optional

Number of omitted in-between Gibbs iterations.

Defaults to 1.

seedint, optional

Indicates the seed used to initialize the random number generator:

0: Uses the system time.

Not 0: Uses the provided value.

Defaults to 0.

gibbs_initstr, optional

Specifies initialization method for Gibbs sampling:

'uniform': Assign each word in each document a topic by uniform distribution.

'gibbs': Assign each word in each document a topic by one round of Gibbs sampling using doc_topic_prior and topic_word_prior.

Defaults to 'uniform'.

delimiterslist of str, optional

Specifies the set of delimiters to separate words in a document. Each delimiter must be one character long.

Defaults to [' '].

output_word_assignmentbool, optional

Controls whether to output the word_topic_df or not.

If True, output the word_topic_df.

Defaults to False.

Returns:

DataFrame

DataFrame 1, document-topic distribution table, structured as follows:

Document ID column, with same name and type as data 's document ID column.
TOPIC_ID, type INTEGER, topic ID.
PROBABILITY, type DOUBLE, probability of topic given document.

DataFrame 2, word-topic assignment table, structured as follows:

Document ID column, with same name and type as data 's document ID column.
WORD_ID, type INTEGER, word ID.
TOPIC_ID, type INTEGER, topic ID.

Set to None if output_word_assignment is False.

DataFrame 3, statistics table, structured as follows:

STAT_NAME, type NVARCHAR(256), statistic name.
STAT_VALUE, type NVARCHAR(1000), statistic value.

create_model_state(model=None, function=None, pal_funcname='PAL_LATENT_DIRICHLET_ALLOCATION', state_description=None, force=False)

Create PAL model state.

Parameters:

modelDataFrame, optional

Specify the model for AFL state.

Defaults to self.model_.

functionstr, optional

Specify the function in the unified API.

A placeholder parameter, not effective for Latent Dirichlet Allocation.

pal_funcnameint or str, optional

PAL function name.

Defaults to 'PAL_LATENT_DIRICHLET_ALLOCATION'.

state_descriptionstr, optional

Description of the state as model container.

Defaults to None.

forcebool, optional

If True it will delete the existing state.

Defaults to False.

set_model_state(state)

Set the model state by state information.

Parameters:

state: DataFrame or dict

If state is DataFrame, it has the following structure:

NAME: VARCHAR(100), it mush have STATE_ID, HINT, HOST and PORT.

VALUE: VARCHAR(1000), the values according to NAME.

If state is dict, the key must have STATE_ID, HINT, HOST and PORT.

delete_model_state(state=None)

Delete PAL model state.

Parameters:

stateDataFrame, optional

Specified the state.

Defaults to self.state.

property fit_hdbprocedure: Returns the generated hdbprocedure for fit.

property predict_hdbprocedure: Returns the generated hdbprocedure for predict.

Inherited Methods from PALBase

Besides those methods mentioned above, the LatentDirichletAllocation class also inherits methods from PALBase class, please refer to PAL Base for more details.