text_tokenize

hana_ml.text.tm.text_tokenize(data, lang=None, enable_stopwords=None, keep_numeric=None, allowed_list=None, notallowed_list=None, enable_stemming=None)

This Text Tokenize function extracts the given document into tokens.

This function is available in HANA Cloud.

Parameters:
dataDataFrame

Input data, structured as follows:

  • 1st column, ID.

  • 2nd column, Document content.

langstr, optional

The language parameter input supports three options:

  • specifying the language, include "en", "de", "es", "fr", "ru", and "pt".

  • auto_all, which uses the language detected in the first row of data for all data.

  • auto_everyrow, which automatically detects the language for each individual row of input data.

Defaults to 'auto_all'.

enable_stopwordsbool, optional

Controls whether to turn on stopwords.

The following parameters only take effect when this parameter is set to true.

Defaults to True.

keep_numericbool, optional

Determines whether to keep numbers.

Valid only when enable_stopwords is set to True.

Defaults to False.

allowed_liststr, optional

A comma-separated list of words that are retained by the stopwords logic.

Valid only when enable_stopwords is set to True.

notallowed_liststr, optional

A comma-separated list of words, which are recognized and deleted by the stopwords logic.

Valid only when enable_stopwords is set to True.

enable_stemmingbool, optional

Whether to perform stemming on tokens.

Defaults to True.

Returns:
A tuple of DataFrames

Token result, structured as follows:

  • ID.

  • Token list.

Extra result, structured as follows:

  • Key.

  • Value.

Examples

>>> from hana_ml.text.tm import text_tokenize
>>> import numpy as np
>>> import pandas as pd
>>> from hana_ml.dataframe import create_dataframe_from_pandas
>>> text_data_structure = {'ID': 'NVARCHAR(100)', 'CON': 'CLOB'}
>>> text_data = np.array([
        ['d1', 'one two three four five six'],
        ['d2', 'two two three '],
        ['d3', 'A test contents '],
        ['d4', 'Mangos and pineapple are yellow'],
        ['d5', 'I love apple 001 '],
        ['d6', 'Wie geht es Ihnen?']
    ])
>>> text_df = create_dataframe_from_pandas(conn,
                                           pd.DataFrame(text_data, columns=list(text_data_structure.keys())),
                                           'TEXT_TBL',
                                           force=True,
                                           table_structure=text_data_structure)
>>> res = text_tokenize(text_df,
                        enable_stopwords=False,
                        allowed_list='one, two , three ,four ',
                        notallowed_list=' test,mangos ',
                        enable_stemming=False,
                        lang='auto_everyrow')

Output:

>>> res[0].collect()
    ID                                      TOKEN_LIST
0   d1       ["one","two","three","four","five","six"]
1   d2                           ["two","two","three"]
2   d3                         ["a","test","contents"]
3   d4     ["mangos","and","pineapple","are","yellow"]
4   d5                      ["i","love","apple","001"]
5   d6                 ["wie","geht","es","ihnen","?"]