get_suggested_term
- hana_ml.text.tm.get_suggested_term(pred_data, ref_data=None, top=None, threshold=None, lang='EN', index_name=None, thread_ratio=None, created_index=None, key=None)
This function returns the top-ranked terms that match an initial substring / or multiple substrings based on Term Frequency - Inverse Document Frequency (TF-IDF) result or reference data.
- Parameters:
- pred_dataDataFrame
Accepts input data in two different data structures:
Single-row mode:
1st column, Document content.
Note
Important to note that this mode can only process one content at a time. Therefore, the input table must have a structure of one row and one column only.
Massive mode supports multiple rows:
1st column, ID.
2nd column, Document content.
Note
Important to note that this mode can only valid in SAP HANA Cloud instance.
- keystr, optional
Specifies the ID column. Only valid when
pred_data
contains multiple rows.Defaults to the first column of
pred_data
.- ref_dataDa0taFrame or a tuple of DataFrames
Specifies the reference data.
If
ref_data
is a DataFrame, then it should be structured as follows:1st column, ID.
2nd column, Document content.
3rd column, Document category.
If
ref_data
is a tuple of DataFrames, the it should be corresponding to reference TF-IDF data, with each DataFrame structured as follows:1st DataFrame
1st column, TM_TERM.
2nd column, TF_VALUE.
3rd column, IDF_VALUE.
2nd DataFrame
1st column, ID.
2nd column, TM_TERM.
3rd column, TM_TERM_FREQUENCY.
3rd DataFrame
1st column, ID.
2nd column, Document category.
- topint, optional
Shows top N results. If 0, it shows all.
Defaults to 0.
- thresholdfloat, optional
Only the results which score bigger than this value will be put into a result table.
Defaults to 0.0.
- langstr, optional
Specifies the language type. The HANA Cloud instance currently supports 'EN', 'DE', 'ES', 'FR', 'RU', 'PT'. If None, auto detection will be applied.
Defaults to 'EN' in HANA Cloud and None in HANA On-Premise (please provide the value in this case).
- index_namestr, optional
Specifies the index name that apply only to the HANA On-Premise instance.
If None, it will be generated.
- thread_ratiofloat, optional
Specifies the ratio of total number of threads that can be used by this function. The range of this parameter is from 0 to 1, where 0 means only using 1 thread, and 1 means using at most all the currently available threads. Values outside this range are ignored and this function heuristically determines the number of threads to use.
Only valid for a HANA Cloud instance.
Defaults to 0.0.
- created_index{"index": xxx, "schema": xxx, "table": xxx}, optional
Use the created index on the given table that apply only to the HANA On-Premise instance.
Defaults to None.
- Returns:
- DataFrame
Examples
Input DataFrame ref_df:
>>> ref_df.collect() ID CONTENT CATEGORY 0 doc1 term1 term2 term2 term3 term3 term3 CATEGORY_1 1 doc2 term2 term3 term3 term4 term4 term4 CATEGORY_1 2 doc3 term3 term4 term4 term5 term5 term5 CATEGORY_2 3 doc4 term3 term4 term4 term5 term5 term5 term5 term5 term5 CATEGORY_2 4 doc5 term4 term6 CATEGORY_3 5 doc6 term4 term6 term6 term6 CATEGORY_3
Invoke the function on a SAP HANA Cloud instance:
pred_df in single-row mode:
>>> pred_df.collect() CONTENT 0 term3
Invoke the function on a SAP HANA Cloud instance,
>>> get_suggested_term(pred_data=pred_df, ref_data=ref_df).collect() ID SCORE 0 term3 1.0
pred_df_massive in massive mode which supports multiple rows:
>>> pred_df_massive.collect() ID CONTENT 0 2 ter 1 3 abc
>>> get_suggested_term(pred_data=pred_df_massive, ref_data=ref_df).collect() PREDICT_ID K TERM SCORE 0 2 0 term5 0.830048 1 2 1 term6 0.368910 2 2 2 term2 0.276683 3 2 3 term3 0.238269 4 2 4 term1 0.150417 5 2 5 term4 0.137752
Invoke the function on a SAP HANA On-Premise instance (only supports single-row mode):
>>> res = get_suggested_term(pred_data=pred_df, ref_data=ref_df) >>> res.collect() RANK TERM NORMALIZED_TERM TERM_TYPE TERM_FREQUENCY DOCUMENT_FREQUENCY SCORE 0 1 term3 term3 noun 7 4 0.999999999999999888977697537484346