TFIDF
- class hana_ml.text.tm.TFIDF(language=None, enable_stopwords=True, keep_numeric=None, allowed_list=None, notallowed_list=None)
Class for term frequency–inverse document frequency.
- Parameters:
- languagestr, {'en', 'de', 'es', 'fr', 'ru', 'pt'}
Specify the language type. HANA Cloud instance currently supports 'EN', 'DE', 'ES', 'FR', 'RU', 'PT'. If None, auto detection will be applied.
Defaults to None (auto detection).
- enable_stopwordsbool, optional
Determine whether to turn on stopwords.
Defaults to True.
- keep_numericbool, optional
Determine whether to keep numbers.
Valid only when
enable_stopwords
is True.Defaults to False.
- allowed_listbool, optional
A list of words that are retained by the stopwords logic.
Valid only when
enable_stopwords
is True.- notallowed_listbool, optional
A list of words, which are recognized and deleted by the stopwords logic.
Valid only when
enable_stopwords
is True.
Examples
Input DataFrame:
>>> df_train.collect() ID CONTENT 0 doc1 term1 term2 term2 term3 term3 term3 1 doc2 term2 term3 term3 term4 term4 term4 ... 4 doc4 term4 term6 5 doc6 term4 term6 term6 term6
Creating a TFIDF instance:
>>> tfidf = TFIDF()
Performing text_collector():
>>> idf, _ = tfidf.text_collector(data=df_train)
>>> idf.collect() TM_TERMS TM_TERM_IDF_VALUE 0 term1 1.791759 1 term2 1.098612 2 term3 0.405465 3 term4 0.182322 4 term5 1.098612 5 term6 1.098612
Performing text_tfidf():
>>> result = tfidf.text_tfidf(data=df_train)
>>> result.collect() ID TERMS TF_VALUE TFIDF_VALUE 0 doc1 term1 1.0 1.791759 1 doc1 term2 2.0 2.197225 2 doc1 term3 3.0 1.216395 ... 13 doc4 term6 1.0 1.098612 14 doc6 term4 1.0 0.182322 15 doc6 term6 3.0 3.295837
Methods
Get the model metrics.
Get the score metrics.
text_collector
(data)Its use is primarily compute inverse document frequency of documents which provided by user.
text_tfidf
(data[, idf])Its use is primarily compute term frequency - inverse document frequency by document.
- get_model_metrics()
Get the model metrics.
- Returns:
- DataFrame
The model metrics.
- get_score_metrics()
Get the score metrics.
- Returns:
- DataFrame
The score metrics.
- text_collector(data)
Its use is primarily compute inverse document frequency of documents which provided by user.
- Parameters:
- dataDataFrame
Data to be analysis. The first column of the input data table is assumed to be an ID column.
- Returns:
- DataFrame
Inverse document frequency of documents.
Extended table.
- text_tfidf(data, idf=None)
Its use is primarily compute term frequency - inverse document frequency by document.
- Parameters:
- dataDataFrame
Data to be analysis.
The first column of the input data table is assumed to be an ID column.
- idfDataFrame, optional
Inverse document frequency of documents.
- Returns:
- DataFrame
Term frequency - inverse document frequency by document.
Inherited Methods from PALBase
Besides those methods mentioned above, the TFIDF class also inherits methods from PALBase class, please refer to PAL Base for more details.