get_suggested_term

hana_ml.text.tm.get_suggested_term(pred_data, ref_data=None, top=None, threshold=None, lang='EN', index_name=None, thread_ratio=None, created_index=None, key=None)

This function returns the top-ranked terms that match an initial substring / or multiple substrings based on Term Frequency - Inverse Document Frequency (TF-IDF) result or reference data.

Parameters:
pred_dataDataFrame

Accepts input data in two different data structures:

Single-row mode:

  • 1st column, Document content.

Note

Important to note that this mode can only process one content at a time. Therefore, the input table must have a structure of one row and one column only.

Massive mode supports multiple rows:

  • 1st column, ID.

  • 2nd column, Document content.

Note

Important to note that this mode can only valid in SAP HANA Cloud instance.

keystr, optional

Specifies the ID column. Only valid when pred_data contains multiple rows.

Defaults to the first column of pred_data.

ref_dataDa0taFrame or a tuple of DataFrames

Specifies the reference data.

If ref_data is a DataFrame, then it should be structured as follows:

  • 1st column, ID.

  • 2nd column, Document content.

  • 3rd column, Document category.

If ref_data is a tuple of DataFrames, the it should be corresponding to reference TF-IDF data, with each DataFrame structured as follows:

  • 1st DataFrame

    • 1st column, TM_TERM.

    • 2nd column, TF_VALUE.

    • 3rd column, IDF_VALUE.

  • 2nd DataFrame

    • 1st column, ID.

    • 2nd column, TM_TERM.

    • 3rd column, TM_TERM_FREQUENCY.

  • 3rd DataFrame

    • 1st column, ID.

    • 2nd column, Document category.

topint, optional

Shows top N results. If 0, it shows all.

Defaults to 0.

thresholdfloat, optional

Only the results which score bigger than this value will be put into a result table.

Defaults to 0.0.

langstr, optional

Specifies the language type. The HANA Cloud instance currently supports 'EN', 'DE', 'ES', 'FR', 'RU', 'PT'. If None, auto detection will be applied.

Defaults to 'EN' in HANA Cloud and None in HANA On-Premise (please provide the value in this case).

index_namestr, optional

Specifies the index name that apply only to the HANA On-Premise instance.

If None, it will be generated.

thread_ratiofloat, optional

Specifies the ratio of total number of threads that can be used by this function. The range of this parameter is from 0 to 1, where 0 means only using 1 thread, and 1 means using at most all the currently available threads. Values outside this range are ignored and this function heuristically determines the number of threads to use.

Only valid for a HANA Cloud instance.

Defaults to 0.0.

created_index{"index": xxx, "schema": xxx, "table": xxx}, optional

Use the created index on the given table that apply only to the HANA On-Premise instance.

Defaults to None.

Returns:
DataFrame

Examples

Input DataFrame ref_df:

>>> ref_df.collect()
      ID                                                  CONTENT       CATEGORY
0   doc1                      term1 term2 term2 term3 term3 term3     CATEGORY_1
1   doc2                      term2 term3 term3 term4 term4 term4     CATEGORY_1
2   doc3                      term3 term4 term4 term5 term5 term5     CATEGORY_2
3   doc4    term3 term4 term4 term5 term5 term5 term5 term5 term5     CATEGORY_2
4   doc5                                              term4 term6     CATEGORY_3
5   doc6                                  term4 term6 term6 term6     CATEGORY_3

Invoke the function on a SAP HANA Cloud instance:

  1. pred_df in single-row mode:

>>> pred_df.collect()
  CONTENT
0   term3

Invoke the function on a SAP HANA Cloud instance,

>>> get_suggested_term(pred_data=pred_df, ref_data=ref_df).collect()
        ID     SCORE
0    term3       1.0
  1. pred_df_massive in massive mode which supports multiple rows:

>>> pred_df_massive.collect()
   ID CONTENT
0   2     ter
1   3     abc
>>> get_suggested_term(pred_data=pred_df_massive, ref_data=ref_df).collect()
   PREDICT_ID  K   TERM     SCORE
0           2  0  term5  0.830048
1           2  1  term6  0.368910
2           2  2  term2  0.276683
3           2  3  term3  0.238269
4           2  4  term1  0.150417
5           2  5  term4  0.137752

Invoke the function on a SAP HANA On-Premise instance (only supports single-row mode):

>>> res = get_suggested_term(pred_data=pred_df, ref_data=ref_df)
>>> res.collect()
  RANK   TERM  NORMALIZED_TERM  TERM_TYPE  TERM_FREQUENCY  DOCUMENT_FREQUENCY                                SCORE
0    1  term3            term3       noun               7                   4  0.999999999999999888977697537484346