hana_ml.text package

This package contains a collection of algorithms for text related class and functions like text analysis, text mining and text chunking.

Note

If you wish to use Text Mining related functions, please be aware that the functionalities included in HANA On-Premise and HANA Cloud's Text Mining significantly vary. The functions supported by HANA On-Premise and HANA Cloud and the reference link are listed below.

In order to support both HANA On-Premise and HANA Cloud, hana_ml uses the same function name for the same functionality. For instance, 'text_classification' could map to the 'TM_CATEGORIZE_KNN' SQL procedure in HANA On-Premise, and 'PAL_TEXTCLASSIFICATION' in HANA Cloud. Moreover, certain parameters might be marked as only supported by HANA On-Premise.

If the HANA system you're using doesn't support certain functions and yet you attempt to use them, an error will be thrown.

  1. HANA On-Premise Text mining

  1. HANA Cloud Text mining

The algorithms are distributed into the following sub-packages.

hana_ml.text.tm

tm.tf_analysis(data[, lang, ...])

Perform Term Frequency(TF) analysis on the given document.

tm.text_classification(pred_data[, ...])

This function classifies (categorizes) an input document with respect to sets of categories (taxonomies) using TF-IDF text vectorizer and KNN classifier.

tm.get_related_doc(pred_data[, ref_data, ...])

This function returns the top-ranked related documents for a query document / or multiple docments based on Term Frequency - Inverse Document Frequency (TF-IDF) result or reference data.

tm.get_related_term(pred_data[, ref_data, ...])

This function returns the top-ranked related terms for a query term / or multiple terms based on Term Frequency - Inverse Document Frequency (TF-IDF) result or reference data.

tm.get_relevant_doc(pred_data[, ref_data, ...])

This function returns the top-ranked documents that are relevant to a term / or multiple terms based on Term Frequency - Inverse Document Frequency (TF-IDF) result or reference data.

tm.get_relevant_term(pred_data[, ref_data, ...])

This function returns the top-ranked relevant terms that describe a document / or multiple docments based on Term Frequency - Inverse Document Frequency (TF-IDF) result or reference data.

tm.get_suggested_term(pred_data[, ref_data, ...])

This function returns the top-ranked terms that match an initial substring / or multiple substrings based on Term Frequency - Inverse Document Frequency (TF-IDF) result or reference data.

tm.search_docs_by_keywords(pred_data[, ...])

This function searches for the best matching documents based on the given keywords.

tm.TFIDF([language, enable_stopwords, ...])

Class for term frequency–inverse document frequency.

tm.TextClassificationWithModel([language, ...])

Text classification class.

hana_ml.text.anns_model

anns_model.ANNSModel([state_id, by_doc])

ANNS model create with IVF indexing.

anns_model.list_models(connection_context)

List the ANNS models.

hana_ml.text.pal_embeddings

pal_embeddings.PALEmbeddings([...])

Embeds input documents into vectors.

hana_ml.text.text_splitter

text_splitter.TextSplitter([chunk_size, ...])

For a long text, it may be necessary to transform it to better suit.