tf_analysis
- hana_ml.text.tm.tf_analysis(data, lang=None, enable_stopwords=None, keep_numeric=None)
Perform Term Frequency(TF) analysis on the given document. TF is the number of occurrences of term in document.
This function is available in HANA Cloud.
- Parameters:
- dataDataFrame
Input data, structured as follows:
1st column, ID.
2nd column, Document content.
3rd column, Document category.
- langstr, optional
Specify the language type. HANA cloud instance currently supports 'EN', 'DE', 'ES', 'FR', 'RU', 'PT'. If None, auto detection will be applied.
Defaults to None (auto detection).
- enable_stopwordsbool, optional
Determine whether to turn on stopwords.
Defaults to True.
- keep_numericbool, optional
Determine whether to keep numbers.
Defaults to False.
- Returns:
- A tuple of DataFrames
TF-IDF result, structured as follows:
TM_TERM.
TM_TERM_FREQUENCY.
TM_IDF_FREQUENCY.
TF_VALUE.
IDF_VALUE.
TF_IDF_VALUE.
Document term frequency table, structured as follows:
ID.
TM_TERM.
TM_TERM_FREQUENCY.
Document category table, structured as follows:
ID.
Document category.
Examples
Input DataFrame df:
>>> df.collect() ID CONTENT CATEGORY 0 doc1 term1 term2 term2 term3 term3 term3 CATEGORY_1 1 doc2 term2 term3 term3 term4 term4 term4 CATEGORY_1 2 doc3 term3 term4 term4 term5 term5 term5 CATEGORY_2 3 doc4 term3 term4 term4 term5 term5 term5 term5 term5 term5 CATEGORY_2 4 doc5 term4 term6 CATEGORY_3 5 doc6 term4 term6 term6 term6 CATEGORY_3
Invoke tf_analysis function:
>>> tfidf= tf_analysis(data=df)
Output:
>>> tfidf[0].head(3).collect() TM_TERMS TM_TERM_TF_F TM_TERM_IDF_F TM_TERM_TF_V TM_TERM_IDF_V 0 term1 1 1 0.030303 1.791759 1 term2 3 2 0.090909 1.098612 2 term3 7 4 0.212121 0.405465
>>> tfidf[1].head(3).collect() ID TM_TERMS TM_TERM_FREQUENCY 0 doc1 term1 1 1 doc1 term2 2 2 doc1 term3 3
>>> tfidf[2].head(3).collect() ID CATEGORY 0 doc1 CATEGORY_1 1 doc2 CATEGORY_1 2 doc3 CATEGORY_2