text_tokenize
- hana_ml.text.tm.text_tokenize(data, lang=None, enable_stopwords=None, keep_numeric=None, allowed_list=None, notallowed_list=None, enable_stemming=None)
This Text Tokenize function extracts the given document into tokens.
This function is available in HANA Cloud.
- Parameters:
- dataDataFrame
Input data, structured as follows:
1st column, ID.
2nd column, Document content.
- langstr, optional
The language parameter input supports three options:
specifying the language, include "en", "de", "es", "fr", "ru", and "pt".
auto_all, which uses the language detected in the first row of data for all data.
auto_everyrow, which automatically detects the language for each individual row of input data.
Defaults to 'auto_all'.
- enable_stopwordsbool, optional
Controls whether to turn on stopwords.
The following parameters only take effect when this parameter is set to true.
Defaults to True.
- keep_numericbool, optional
Determines whether to keep numbers.
Valid only when enable_stopwords is set to True.
Defaults to False.
- allowed_liststr, optional
A comma-separated list of words that are retained by the stopwords logic.
Valid only when enable_stopwords is set to True.
- notallowed_liststr, optional
A comma-separated list of words, which are recognized and deleted by the stopwords logic.
Valid only when enable_stopwords is set to True.
- enable_stemmingbool, optional
Whether to perform stemming on tokens.
Defaults to True.
- Returns:
- A tuple of DataFrames
Token result, structured as follows:
ID.
Token list.
Extra result, structured as follows:
Key.
Value.
Examples
>>> from hana_ml.text.tm import text_tokenize >>> import numpy as np >>> import pandas as pd >>> from hana_ml.dataframe import create_dataframe_from_pandas >>> text_data_structure = {'ID': 'NVARCHAR(100)', 'CON': 'CLOB'} >>> text_data = np.array([ ['d1', 'one two three four five six'], ['d2', 'two two three '], ['d3', 'A test contents '], ['d4', 'Mangos and pineapple are yellow'], ['d5', 'I love apple 001 '], ['d6', 'Wie geht es Ihnen?'] ]) >>> text_df = create_dataframe_from_pandas(conn, pd.DataFrame(text_data, columns=list(text_data_structure.keys())), 'TEXT_TBL', force=True, table_structure=text_data_structure) >>> res = text_tokenize(text_df, enable_stopwords=False, allowed_list='one, two , three ,four ', notallowed_list=' test,mangos ', enable_stemming=False, lang='auto_everyrow')
Output:
>>> res[0].collect() ID TOKEN_LIST 0 d1 ["one","two","three","four","five","six"] 1 d2 ["two","two","three"] 2 d3 ["a","test","contents"] 3 d4 ["mangos","and","pineapple","are","yellow"] 4 d5 ["i","love","apple","001"] 5 d6 ["wie","geht","es","ihnen","?"]