TFIDF¶
- class hana_ml.text.tm.TFIDF(language=None, enable_stopwords=True, keep_numeric=None, allowed_list=None, notallowed_list=None)¶
Class for term frequency–inverse document frequency.
- Parameters
- languagestr, {'en', 'de', 'es', 'fr', 'ru', 'pt'}
Specify the language type. HANA Cloud instance currently supports 'EN', 'DE', 'ES', 'FR', 'RU', 'PT'. If None, auto detection will be applied.
Defaults to None (auto detection).
- enable_stopwordsbool, optional
Determine whether to turn on stopwords.
Defaults to True.
- keep_numericbool, optional
Determine whether to keep numbers.
Valid only when
enable_stopwordsis True.Defaults to False.
- allowed_listbool, optional
A list of words that are retained by the stopwords logic.
Valid only when
enable_stopwordsis True.- notallowed_listbool, optional
A list of words, which are recognized and deleted by the stopwords logic.
Valid only when
enable_stopwordsis True.
Methods
text_collector(data)Its use is primarily compute inverse document frequency of documents which provided by user.
text_tfidf(data[, idf])Its use is primarily compute term frequency - inverse document frequency by document.
Examples
Input DataFrame:
>>> df_train.collect() ID CONTENT 0 doc1 term1 term2 term2 term3 term3 term3 1 doc2 term2 term3 term3 term4 term4 term4 ... 4 doc4 term4 term6 5 doc6 term4 term6 term6 term6
Creating a TFIDF instance:
>>> tfidf = TFIDF()
Performing text_collector():
>>> idf, _ = tfidf.text_collector(data=df_train)
>>> idf.collect() TM_TERMS TM_TERM_IDF_VALUE 0 term1 1.791759 1 term2 1.098612 2 term3 0.405465 3 term4 0.182322 4 term5 1.098612 5 term6 1.098612
Performing text_tfidf():
>>> result = tfidf.text_tfidf(data=df_train)
>>> result.collect() ID TERMS TF_VALUE TFIDF_VALUE 0 doc1 term1 1.0 1.791759 1 doc1 term2 2.0 2.197225 2 doc1 term3 3.0 1.216395 ... 13 doc4 term6 1.0 1.098612 14 doc6 term4 1.0 0.182322 15 doc6 term6 3.0 3.295837
- text_collector(data)¶
Its use is primarily compute inverse document frequency of documents which provided by user.
- Parameters
- dataDataFrame
Data to be analysis. The first column of the input data table is assumed to be an ID column.
- Returns
- DataFrame
Inverse document frequency of documents.
Extended table.
- text_tfidf(data, idf=None)¶
Its use is primarily compute term frequency - inverse document frequency by document.
- Parameters
- dataDataFrame
Data to be analysis.
The first column of the input data table is assumed to be an ID column.
- idfDataFrame, optional
Inverse document frequency of documents.
- Returns
- DataFrame
Term frequency - inverse document frequency by document.