TFIDF

class hana_ml.text.tm.TFIDF(language=None, enable_stopwords=True, keep_numeric=None, allowed_list=None, notallowed_list=None)

Class for term frequency–inverse document frequency.

Parameters:
languagestr, {'en', 'de', 'es', 'fr', 'ru', 'pt'}

Specify the language type. HANA Cloud instance currently supports 'EN', 'DE', 'ES', 'FR', 'RU', 'PT'. If None, auto detection will be applied.

Defaults to None (auto detection).

enable_stopwordsbool, optional

Determine whether to turn on stopwords.

Defaults to True.

keep_numericbool, optional

Determine whether to keep numbers.

Valid only when enable_stopwords is True.

Defaults to False.

allowed_listbool, optional

A list of words that are retained by the stopwords logic.

Valid only when enable_stopwords is True.

notallowed_listbool, optional

A list of words, which are recognized and deleted by the stopwords logic.

Valid only when enable_stopwords is True.

Examples

Input DataFrame:

>>> df_train.collect()
    ID      CONTENT
0   doc1    term1 term2 term2 term3 term3 term3
1   doc2    term2 term3 term3 term4 term4 term4
...
4   doc4    term4 term6
5   doc6    term4 term6 term6 term6

Creating a TFIDF instance:

>>> tfidf = TFIDF()

Performing text_collector():

>>> idf, _ = tfidf.text_collector(data=df_train)
>>> idf.collect()
        TM_TERMS    TM_TERM_IDF_VALUE
    0   term1       1.791759
    1   term2       1.098612
    2   term3       0.405465
    3   term4       0.182322
    4   term5       1.098612
    5   term6       1.098612

Performing text_tfidf():

>>> result = tfidf.text_tfidf(data=df_train)
>>> result.collect()
        ID      TERMS   TF_VALUE    TFIDF_VALUE
    0   doc1    term1   1.0         1.791759
    1   doc1    term2   2.0         2.197225
    2   doc1    term3   3.0         1.216395
    ...
    13  doc4    term6   1.0         1.098612
    14  doc6    term4   1.0         0.182322
    15  doc6    term6   3.0         3.295837

Methods

get_model_metrics()

Get the model metrics.

get_score_metrics()

Get the score metrics.

text_collector(data)

Its use is primarily compute inverse document frequency of documents which provided by user.

text_tfidf(data[, idf])

Its use is primarily compute term frequency - inverse document frequency by document.

get_model_metrics()

Get the model metrics.

Returns:
DataFrame

The model metrics.

get_score_metrics()

Get the score metrics.

Returns:
DataFrame

The score metrics.

text_collector(data)

Its use is primarily compute inverse document frequency of documents which provided by user.

Parameters:
dataDataFrame

Data to be analysis. The first column of the input data table is assumed to be an ID column.

Returns:
DataFrame
  • Inverse document frequency of documents.

  • Extended table.

text_tfidf(data, idf=None)

Its use is primarily compute term frequency - inverse document frequency by document.

Parameters:
dataDataFrame

Data to be analysis.

The first column of the input data table is assumed to be an ID column.

idfDataFrame, optional

Inverse document frequency of documents.

Returns:
DataFrame
  • Term frequency - inverse document frequency by document.

Inherited Methods from PALBase

Besides those methods mentioned above, the TFIDF class also inherits methods from PALBase class, please refer to PAL Base for more details.