hana_ml.text package

This package contains a collection of algorithms for text related class and functions like text analysis, text mining and text chunking.

Note

If you wish to use text mining-related functions, please note that the functionalities available in HANA On-Premise and HANA Cloud's text mining differ significantly. The functions supported by HANA On-Premise and HANA Cloud, along with their reference links, are listed below.

To support both HANA On-Premise and HANA Cloud, hana_ml uses the same function names for equivalent functionalities. For example, text_classification maps to the TM_CATEGORIZE_KNN SQL procedure in HANA On-Premise and PAL_TEXTCLASSIFICATION in HANA Cloud. Additionally, certain parameters may be marked as supported only by HANA On-Premise.

If the HANA system you are using does not support certain functions and you attempt to use them, an error will be raised.

  1. HANA On-Premise Text Mining

  1. HANA Cloud Text Mining

The algorithms are distributed across the following sub-packages.

hana_ml.text.tm

tm.tf_analysis(data[, lang, ...])

Perform Term Frequency(TF) analysis on the given document.

tm.text_tokenize(data[, lang, ...])

This Text Tokenize function extracts the given document into tokens.

tm.text_classification(pred_data[, ...])

This function classifies (categorizes) an input document with respect to sets of categories (taxonomies) using TF-IDF text vectorizer and KNN classifier.

tm.get_related_doc(pred_data[, ref_data, ...])

This function returns the top-ranked related documents for a query document / or multiple docments based on Term Frequency - Inverse Document Frequency (TF-IDF) result or reference data.

tm.get_related_term(pred_data[, ref_data, ...])

This function returns the top-ranked related terms for a query term / or multiple terms based on Term Frequency - Inverse Document Frequency (TF-IDF) result or reference data.

tm.get_relevant_doc(pred_data[, ref_data, ...])

This function returns the top-ranked documents that are relevant to a term / or multiple terms based on Term Frequency - Inverse Document Frequency (TF-IDF) result or reference data.

tm.get_relevant_term(pred_data[, ref_data, ...])

This function returns the top-ranked relevant terms that describe a document / or multiple docments based on Term Frequency - Inverse Document Frequency (TF-IDF) result or reference data.

tm.get_suggested_term(pred_data[, ref_data, ...])

This function returns the top-ranked terms that match an initial substring / or multiple substrings based on Term Frequency - Inverse Document Frequency (TF-IDF) result or reference data.

tm.search_docs_by_keywords(pred_data[, ...])

This function searches for the best matching documents based on the given keywords.

tm.TFIDF([language, enable_stopwords, ...])

Class for term frequency–inverse document frequency.

tm.TextClassificationWithModel([language, ...])

Text classification class.

hana_ml.text.anns_model

anns_model.ANNSModel([state_id, by_doc])

ANNS model create with IVF indexing.

anns_model.list_models(connection_context)

List the ANNS models.

hana_ml.text.pal_embeddings

pal_embeddings.PALEmbeddings([...])

Embeds input documents into vectors.

hana_ml.text.text_splitter

text_splitter.TextSplitter([chunk_size, ...])

For a long text, it may be necessary to transform it to better suit.

hana_ml.text.ta

ta.text_analysis(data[, thread_ratio, timeout])

Text analysis function, can perform the task of POS (Part-of-Speech), NER (Named-Entity-Recognition) and sentiment-phrase-score.

ta.pos_tag(data[, lang, thread_ratio, timeout])

Part of Speech (POS) tagging is a natural language processing technique that involves assigning specific grammatical categories or labels (such as nouns, verbs, adjectives, adverbs, pronouns, etc.) to individual words within a sentence.

ta.named_entity_recognition(data[, lang, ...])

This is a wrapper of named entity recognition (NER) functionality for text analysis, which aims at facilitating users' use of text analysis targeted specially for named entity recognition.

ta.sentiment_analysis(data[, lang, ...])

A sentiment score, often referred to as a sentiment analysis score, is a numerical representation of the sentiment or emotion conveyed in a piece of text, be it a tweet, a product review, or an article.