LatentDirichletAllocation

class hana_ml.algorithms.pal.decomposition.LatentDirichletAllocation(n_components, doc_topic_prior=None, topic_word_prior=None, burn_in=None, iteration=None, thin=None, seed=None, max_top_words=None, threshold_top_words=None, gibbs_init=None, delimiters=None, output_word_assignment=None)

Latent Dirichlet allocation (LDA) is a generative model in which each item (word) of a collection (document) is generated from a finite mixture over several latent groups (topics).

Parameters
n_componentsint

Expected number of topics in the corpus.

doc_topic_priorfloat, optional

Specifies the prior weight related to document-topic distribution.

Defaults to 50/n_components.

topic_word_priorfloat, optional

Specifies the prior weight related to topic-word distribution.

Defaults to 0.1.

burn_inint, optional

Number of omitted Gibbs iterations at the beginning.

Generally, samples from the beginning may not accurately represent the desired distribution and are usually discarded.

Defaults to 0.

iterationint, optional

Number of Gibbs iterations.

Defaults to 2000.

thinint, optional

Number of omitted in-between Gibbs iterations.

Value must be greater than 0.

Defaults to 1.

seedint, optional

Indicates the seed used to initialize the random number generator:

  • 0: Uses the system time.

  • Not 0: Uses the provided value.

Defaults to 0.

max_top_wordsint, optional

Specifies the maximum number of words to be output for each topic.

Defaults to 0.

threshold_top_wordsfloat, optional

The algorithm outputs top words for each topic if the probability is larger than this threshold.

It cannot be used together with parameter max_top_words.

gibbs_initstr, optional

Specifies initialization method for Gibbs sampling:

  • 'uniform': Assign each word in each document a topic by uniform distribution.

  • 'gibbs': Assign each word in each document a topic by one round of Gibbs sampling using doc_topic_prior and topic_word_prior.

Defaults to 'uniform'.

delimiterslist of str, optional

Specifies the set of delimiters to separate words in a document.

Each delimiter must be one character long.

Defaults to [' '].

output_word_assignmentbool, optional

Controls whether to output the word_topic_assignment_ DataFrame or not. If True, output the word_topic_assignment_ DataFrame.

Defaults to False.

Examples

Input dataframe df1 for training:

>>> df1.collect()
   DOCUMENT_ID                                               TEXT
0           10  cpu harddisk graphiccard cpu monitor keyboard ...
1           20  tires mountainbike wheels valve helmet mountai...
2           30  carseat toy strollers toy toy spoon toy stroll...
3           40  sweaters sweaters sweaters boots sweaters ring...

Creating a LDA instance:

>>> lda = LatentDirichletAllocation(n_components=6, burn_in=50, thin=10,
                                    iteration=100, seed=1,
                                    max_top_words=5, doc_topic_prior=0.1,
                                    output_word_assignment=True,
                                    delimiters=[' ', '\r', '\n'])

Performing fit() on given dataframe:

>>> lda.fit(data=df1, key='DOCUMENT_ID', document='TEXT')

Output:

>>> lda.doc_topic_dist_.collect()
    DOCUMENT_ID  TOPIC_ID  PROBABILITY
0            10         0     0.010417
1            10         1     0.010417
2            10         2     0.010417
3            10         3     0.010417
4            10         4     0.947917
5            10         5     0.010417
6            20         0     0.009434
7            20         1     0.009434
8            20         2     0.009434
9            20         3     0.952830
10           20         4     0.009434
11           20         5     0.009434
12           30         0     0.103774
13           30         1     0.858491
14           30         2     0.009434
15           30         3     0.009434
16           30         4     0.009434
17           30         5     0.009434
18           40         0     0.009434
19           40         1     0.009434
20           40         2     0.952830
21           40         3     0.009434
22           40         4     0.009434
23           40         5     0.009434
>>> lda.word_topic_assignment_.collect()
    DOCUMENT_ID  WORD_ID  TOPIC_ID
0            10        0         4
1            10        1         4
2            10        2         4
3            10        0         4
4            10        3         4
5            10        4         4
6            10        0         4
7            10        5         4
8            10        5         4
9            20        6         3
10           20        7         3
11           20        8         3
12           20        9         3
13           20       10         3
14           20        7         3
15           20       11         3
16           20        6         3
17           20        7         3
18           20        7         3
19           30       12         1
20           30       13         1
21           30       14         1
22           30       13         1
23           30       13         1
24           30       15         0
25           30       13         1
26           30       14         1
27           30       13         1
28           30       12         1
29           40       16         2
30           40       16         2
31           40       16         2
32           40       17         2
33           40       16         2
34           40       18         2
35           40       19         2
36           40       19         2
37           40       20         2
38           40       16         2
>>> lda.topic_top_words_.collect()
   TOPIC_ID                                       WORDS
0         0     spoon strollers tires graphiccard valve
1         1       toy strollers carseat graphiccard cpu
2         2              sweaters vest shoe rings boots
3         3  mountainbike tires rearfender helmet valve
4         4    cpu memory graphiccard keyboard harddisk
5         5       strollers tires graphiccard cpu valve
>>> lda.topic_word_dist_.head(40).collect()
    TOPIC_ID  WORD_ID  PROBABILITY
0          0        0     0.050000
1          0        1     0.050000
2          0        2     0.050000
3          0        3     0.050000
4          0        4     0.050000
5          0        5     0.050000
6          0        6     0.050000
7          0        7     0.050000
8          0        8     0.550000
9          0        9     0.050000
10         1        0     0.050000
11         1        1     0.050000
12         1        2     0.050000
13         1        3     0.050000
14         1        4     0.050000
15         1        5     0.050000
16         1        6     0.050000
17         1        7     0.050000
18         1        8     0.050000
19         1        9     0.550000
20         2        0     0.025000
21         2        1     0.025000
22         2        2     0.525000
23         2        3     0.025000
24         2        4     0.025000
25         2        5     0.025000
26         2        6     0.025000
27         2        7     0.275000
28         2        8     0.025000
29         2        9     0.025000
30         3        0     0.014286
31         3        1     0.014286
32         3        2     0.014286
33         3        3     0.585714
34         3        4     0.157143
35         3        5     0.014286
36         3        6     0.157143
37         3        7     0.014286
38         3        8     0.014286
39         3        9     0.014286
>>> lda.dictionary_.collect()
    WORD_ID          WORD
0        17         boots
1        12       carseat
2         0           cpu
3         2   graphiccard
4         1      harddisk
5        10        helmet
6         4      keyboard
7         5        memory
8         3       monitor
9         7  mountainbike
10       11    rearfender
11       18         rings
12       20          shoe
13       15         spoon
14       14     strollers
15       16      sweaters
16        6         tires
17       13           toy
18        9         valve
19       19          vest
20        8        wheels
>>> lda.statistic_.collect()
         STAT_NAME          STAT_VALUE
0        DOCUMENTS                   4
1  VOCABULARY_SIZE                  21
2   LOG_LIKELIHOOD  -64.95765414596762

Dataframe df2 to transform:

>>> df2.collect()
   DOCUMENT_ID               TEXT
0           10  toy toy spoon cpu

Performing transform on the given dataframe:

>>> res = lda.transform(data=df2, key='DOCUMENT_ID', document='TEXT', burn_in=2000, thin=100,
                        iteration=1000, seed=1, output_word_assignment=True)
>>> doc_top_df, word_top_df, stat_df = res
>>> doc_top_df.collect()
   DOCUMENT_ID  TOPIC_ID  PROBABILITY
0           10         0     0.239130
1           10         1     0.456522
2           10         2     0.021739
3           10         3     0.021739
4           10         4     0.239130
5           10         5     0.021739
>>> word_top_df.collect()
   DOCUMENT_ID  WORD_ID  TOPIC_ID
0           10       13         1
1           10       13         1
2           10       15         0
3           10        0         4
>>> stat_df.collect()
         STAT_NAME          STAT_VALUE
0        DOCUMENTS                   1
1  VOCABULARY_SIZE                  21
2   LOG_LIKELIHOOD  -7.925092991875363
3       PERPLEXITY   7.251970666272191
Attributes
doc_topic_dist_DataFrame

Document-topic distribution table, structured as follows:

  • Document ID column, with same name and type as data's document ID column from fit().

  • TOPIC_ID, type INTEGER, topic ID.

  • PROBABILITY, type DOUBLE, probability of topic given document.

word_topic_assignment_DataFrame

Word-topic assignment table, structured as follows:

  • Document ID column, with same name and type as data's document ID column from fit().

  • WORD_ID, type INTEGER, word ID.

  • TOPIC_ID, type INTEGER, topic ID.

Set to None if output_word_assignment is set to False.

topic_top_words_DataFrame

Topic top words table, structured as follows:

  • TOPIC_ID, type INTEGER, topic ID.

  • WORDS, type NVARCHAR(5000), topic top words separated by spaces.

Set to None if neither max_top_words nor threshold_top_words is provided.

topic_word_dist_DataFrame

Topic-word distribution table, structured as follows:

  • TOPIC_ID, type INTEGER, topic ID.

  • WORD_ID, type INTEGER, word ID.

  • PROBABILITY, type DOUBLE, probability of word given topic.

dictionary_DataFrame

Dictionary table, structured as follows:

  • WORD_ID, type INTEGER, word ID.

  • WORD, type NVARCHAR(5000), word text.

statistic_DataFrame

Statistics table, structured as follows:

  • STAT_NAME, type NVARCHAR(256), statistic name.

  • STAT_VALUE, type NVARCHAR(1000), statistic value.

Note

  • Parameters max_top_words and threshold_top_words cannot be used together.

  • Parameters burn_in, thin, iteration, seed, gibbs_init and delimiters set in transform() will take precedence over the corresponding ones in __init__().

Methods

create_model_state([model, function, ...])

Create PAL model state.

delete_model_state([state])

Delete PAL model state.

fit(data[, key, document])

Fit LDA model based on training data.

fit_transform(data[, key, document])

Fit LDA model based on training data and return the topic assignment for the training documents.

set_model_state(state)

Set the model state by state information.

transform(data[, key, document, burn_in, ...])

Transform the topic assignment for new documents based on the previous LDA estimation results.

fit(data, key=None, document=None)

Fit LDA model based on training data.

Parameters
dataDataFrame

Training data.

keystr, optional

Name of the document ID column.

Mandatory if data is not indexed, or the index of data contains multiple columns.

Defaults to the single index column of data if not provided.

documentstr, optional

Name of the document text column.

If document is not provided, data must have exactly 1 non-key(non-index) column, and document defaults to that column.

fit_transform(data, key=None, document=None)

Fit LDA model based on training data and return the topic assignment for the training documents.

Parameters
dataDataFrame

Training data.

keystr, optional

Name of the document ID column.

Mandatory if data is not indexed, or the index of data contains multiple columns.

Defaults to the single index column of data if not provided.

documentstr, optional

Name of the document text column.

If document is not provided, data must have exactly 1 non-key column, and document defaults to that column.

Returns
DataFrame

Document-topic distribution table, structured as follows:

  • Document ID column, with same name and type as data 's document ID column.

  • TOPIC_ID, type INTEGER, topic ID.

  • PROBABILITY, type DOUBLE, probability of topic given document.

transform(data, key=None, document=None, burn_in=None, iteration=None, thin=None, seed=None, gibbs_init=None, delimiters=None, output_word_assignment=None)

Transform the topic assignment for new documents based on the previous LDA estimation results.

Parameters
dataDataFrame

Independent variable values used for transform.

keystr, optional

Name of the document ID column.

Mandatory if data is not indexed, or the index of data contains multiple columns.

Defaults to the single index column of data if not provided.

documentstr, optional

Name of the document text column.

If document is not provided, data must have exactly 1 non-key column, and document defaults to that column.

burn_inint, optional

Number of omitted Gibbs iterations at the beginning.

Generally, samples from the beginning may not accurately represent the desired distribution and are usually discarded.

Defaults to 0.

iterationint, optional

Numbers of Gibbs iterations.

Defaults to 2000.

thinint, optional

Number of omitted in-between Gibbs iterations.

Defaults to 1.

seedint, optional

Indicates the seed used to initialize the random number generator:

  • 0: Uses the system time.

  • Not 0: Uses the provided value.

Defaults to 0.

gibbs_initstr, optional

Specifies initialization method for Gibbs sampling:

  • 'uniform': Assign each word in each document a topic by uniform distribution.

  • 'gibbs': Assign each word in each document a topic by one round of Gibbs sampling using doc_topic_prior and topic_word_prior.

Defaults to 'uniform'.

delimiterslist of str, optional

Specifies the set of delimiters to separate words in a document. Each delimiter must be one character long.

Defaults to [' '].

output_word_assignmentbool, optional

Controls whether to output the word_topic_df or not.

If True, output the word_topic_df.

Defaults to False.

Returns
DataFrame

DataFrame 1, document-topic distribution table, structured as follows:

  • Document ID column, with same name and type as data 's document ID column.

  • TOPIC_ID, type INTEGER, topic ID.

  • PROBABILITY, type DOUBLE, probability of topic given document.

DataFrame 2, word-topic assignment table, structured as follows:

  • Document ID column, with same name and type as data 's document ID column.

  • WORD_ID, type INTEGER, word ID.

  • TOPIC_ID, type INTEGER, topic ID.

Set to None if output_word_assignment is False.

DataFrame 3, statistics table, structured as follows:

  • STAT_NAME, type NVARCHAR(256), statistic name.

  • STAT_VALUE, type NVARCHAR(1000), statistic value.

create_model_state(model=None, function=None, pal_funcname='PAL_LATENT_DIRICHLET_ALLOCATION', state_description=None, force=False)

Create PAL model state.

Parameters
modelDataFrame, optional

Specify the model for AFL state.

Defaults to self.model_.

functionstr, optional

Specify the function in the unified API.

A placeholder parameter, not effective for Latent Dirichlet Allocation.

pal_funcnameint or str, optional

PAL function name.

Defaults to 'PAL_LATENT_DIRICHLET_ALLOCATION'.

state_descriptionstr, optional

Description of the state as model container.

Defaults to None.

forcebool, optional

If True it will delete the existing state.

Defaults to False.

set_model_state(state)

Set the model state by state information.

Parameters
state: DataFrame or dict

If state is DataFrame, it has the following structure:

  • NAME: VARCHAR(100), it mush have STATE_ID, HINT, HOST and PORT.

  • VALUE: VARCHAR(1000), the values according to NAME.

If state is dict, the key must have STATE_ID, HINT, HOST and PORT.

delete_model_state(state=None)

Delete PAL model state.

Parameters
stateDataFrame, optional

Specified the state.

Defaults to self.state.

property fit_hdbprocedure

Returns the generated hdbprocedure for fit.

property predict_hdbprocedure

Returns the generated hdbprocedure for predict.

Inherited Methods from PALBase

Besides those methods mentioned above, the LatentDirichletAllocation class also inherits methods from PALBase class, please refer to PAL Base for more details.