hana_ml.text package

This package contains a collection of algorithms for text related class and functions like text analysis, text mining and text chunking.

Note

If you wish to use text mining-related functions, please note that the functionalities available in HANA On-Premise and HANA Cloud's text mining differ significantly. The functions supported by HANA On-Premise and HANA Cloud, along with their reference links, are listed below.

To support both HANA On-Premise and HANA Cloud, hana_ml uses the same function names for equivalent functionalities. For example, text_classification maps to the TM_CATEGORIZE_KNN SQL procedure in HANA On-Premise and PAL_TEXTCLASSIFICATION in HANA Cloud. Additionally, certain parameters may be marked as supported only by HANA On-Premise.

If the HANA system you are using does not support certain functions and you attempt to use them, an error will be raised.

HANA On-Premise Text Mining

text_classification() (TM_CATEGORIZE_KNN)
TextClassificationWithModel (PAL_TEXTCLASSIFICATION_TRAIN / PAL_TEXTCLASSIFICATION_PREDICT) supported since HANA 2.0 SPS08.
TFIDF (PAL_TEXT_COLLECT/PAL_TEXT_TFIDF) supported since HANA 2.0 SPS07.
get_related_doc() (TM_GET_RELATED_DOCUMENTS)
get_related_term() (TM_GET_RELATED_TERMS)
get_relevant_doc() (TM_GET_RELEVANT_DOCUMENTS)
get_relevant_term() (TM_GET_RELEVANT_TERMS)
get_suggested_term() (TM_GET_SUGGESTED_TERMS)

HANA Cloud Text Mining

The algorithms are distributed across the following sub-packages.

hana_ml.text.tm

`tm.tf_analysis`(data[, lang, ...])	Perform Term Frequency(TF) analysis on the given document.
`tm.text_tokenize`(data[, lang, ...])	This Text Tokenize function extracts the given document into tokens.
`tm.text_classification`(pred_data[, ...])	This function classifies (categorizes) an input document with respect to sets of categories (taxonomies) using TF-IDF text vectorizer and KNN classifier.
`tm.get_related_doc`(pred_data[, ref_data, ...])	This function returns the top-ranked related documents for a query document / or multiple docments based on Term Frequency - Inverse Document Frequency (TF-IDF) result or reference data.
`tm.get_related_term`(pred_data[, ref_data, ...])	This function returns the top-ranked related terms for a query term / or multiple terms based on Term Frequency - Inverse Document Frequency (TF-IDF) result or reference data.
`tm.get_relevant_doc`(pred_data[, ref_data, ...])	This function returns the top-ranked documents that are relevant to a term / or multiple terms based on Term Frequency - Inverse Document Frequency (TF-IDF) result or reference data.
`tm.get_relevant_term`(pred_data[, ref_data, ...])	This function returns the top-ranked relevant terms that describe a document / or multiple docments based on Term Frequency - Inverse Document Frequency (TF-IDF) result or reference data.
`tm.get_suggested_term`(pred_data[, ref_data, ...])	This function returns the top-ranked terms that match an initial substring / or multiple substrings based on Term Frequency - Inverse Document Frequency (TF-IDF) result or reference data.
`tm.search_docs_by_keywords`(pred_data[, ...])	This function searches for the best matching documents based on the given keywords.

`tm.TFIDF`([language, enable_stopwords, ...])	Class for term frequency–inverse document frequency.
`tm.TextClassificationWithModel`([language, ...])	Text classification class.

hana_ml.text.anns_model

anns_model.ANNSModel([state_id, by_doc])

ANNS model create with IVF indexing.

anns_model.list_models(connection_context)

List the ANNS models.

hana_ml.text.pal_embeddings

pal_embeddings.PALEmbeddings([...])

Embeds input documents into vectors.

hana_ml.text.text_splitter

text_splitter.TextSplitter([chunk_size, ...])

For a long text, it may be necessary to transform it to better suit.

hana_ml.text.ta

`ta.text_analysis`(data[, thread_ratio, timeout])	Text analysis function, can perform the task of POS (Part-of-Speech), NER (Named-Entity-Recognition) and sentiment-phrase-score.
`ta.pos_tag`(data[, lang, thread_ratio, timeout])	Part of Speech (POS) tagging is a natural language processing technique that involves assigning specific grammatical categories or labels (such as nouns, verbs, adjectives, adverbs, pronouns, etc.) to individual words within a sentence.
`ta.named_entity_recognition`(data[, lang, ...])	This is a wrapper of named entity recognition (NER) functionality for text analysis, which aims at facilitating users' use of text analysis targeted specially for named entity recognition.
`ta.sentiment_analysis`(data[, lang, ...])	A sentiment score, often referred to as a sentiment analysis score, is a numerical representation of the sentiment or emotion conveyed in a piece of text, be it a tweet, a product review, or an article.