tf_analysis

hana_ml.text.tm.tf_analysis(data, lang=None, enable_stopwords=None, keep_numeric=None)

Perform Term Frequency(TF) analysis on the given document. TF is the number of occurrences of term in document.

Parameters:

dataDataFrame

Input data, structured as follows:

1st column, ID.
2nd column, Document content.
3rd column, Document category.

langstr, optional

Specify the language type. HANA cloud instance currently supports 'EN', 'DE', 'ES', 'FR', 'RU', 'PT'. If None, auto detection will be applied.

Defaults to None (auto detection).

enable_stopwordsbool, optional

Determine whether to turn on stopwords.

Defaults to True.

keep_numericbool, optional

Determine whether to keep numbers.

Defaults to False.

Returns:

A tuple of DataFrames

TF-IDF result, structured as follows:

TM_TERM.
TM_TERM_FREQUENCY.
TM_IDF_FREQUENCY.
TF_VALUE.
IDF_VALUE.
TF_IDF_VALUE.

Document term frequency table, structured as follows:

ID.
TM_TERM.
TM_TERM_FREQUENCY.

Document category table, structured as follows:

ID.
Document category.

Examples

Input DataFrame df:

>>> df.collect()
      ID                                                  CONTENT       CATEGORY
 doc1                      term1 term2 term2 term3 term3 term3     CATEGORY_1
 doc2                      term2 term3 term3 term4 term4 term4     CATEGORY_1
 doc3                      term3 term4 term4 term5 term5 term5     CATEGORY_2
 doc4    term3 term4 term4 term5 term5 term5 term5 term5 term5     CATEGORY_2
 doc5                                              term4 term6     CATEGORY_3
 doc6                                  term4 term6 term6 term6     CATEGORY_3

Invoke tf_analysis function:

>>> tfidf= tf_analysis(data=df)

Output:

>>> tfidf[0].head(3).collect()
  TM_TERMS TM_TERM_TF_F  TM_TERM_IDF_F  TM_TERM_TF_V  TM_TERM_IDF_V
0    term1            1              1      0.030303       1.791759
1    term2            3              2      0.090909       1.098612
2    term3            7              4      0.212121       0.405465

>>> tfidf[1].head(3).collect()
     ID TM_TERMS  TM_TERM_FREQUENCY
0  doc1    term1                  1
1  doc1    term2                  2
2  doc1    term3                  3

>>> tfidf[2].head(3).collect()
      ID    CATEGORY
0   doc1  CATEGORY_1
1   doc2  CATEGORY_1
2   doc3  CATEGORY_2