search_docs_by_keywords

hana_ml.text.tm.search_docs_by_keywords(pred_data, ref_data=None, num_best_matches=None, thread_number=None, thread_ratio=None, lang=None, bm25_k1=None, bm25_b=None, **kwargs)

This function searches for the best matching documents based on the given keywords. The algorithms used for matching is BM25.

This function supports English, German, Spanish, French, Russian and Portuguese and is available in HANA Cloud.

Parameters:
pred_dataDataFrame

The prediction data for search, structured as follows:

  • 1st column, ID.

  • 2nd column, KEYWORDS.

ref_dataDataFrame or a tuple of DataFrames

Specify the reference data.

If ref_data is a DataFrame, then it should be structured as follows:

  • 1st column, ID.

  • 2nd column, Document content.

  • 3rd column, Document category.

Otherwise if ref_data is a tuple of DataFrames, the it should be corresponding to the reference TF-IDF data, with DataFrames structured as follows:

  • 1st DataFrame

    • TM_TERM.

    • TM_TERM_FREQUENCY.

    • TM_IDF_FREQUENCY.

    • TF_VALUE.

    • IDF_VALUE.

    • TF_IDF_VALUE.

  • 2nd DataFrame

    • 1st column, ID.

    • 2nd column, TM_TERM.

    • 3rd column, TM_TERM_FREQUENCY.

  • 3rd DataFrame

    • 1st column, ID.

    • 2nd column, Document category.

num_best_matchesint, optional

Controls how many results to output.

Defaults to 1.

thread_numberint, optional

Specifies the number of threads that can be used by this function.

Defaults to 1.

thread_ratiofloat, optional

Specifies the ratio of threads that can be used by this function. The range of this parameter is from 0 to 1, where 0 means only using one thread, and 1 means using at most all the currently available threads. Values outside this range are ignored and this function heuristically determines the number of threads to use.

Defaults to 0.0.

langstr, optional

Specify the language type. HANA cloud instance currently supports 'EN', 'DE', 'ES', 'FR', 'RU', 'PT'. If None, auto detection will be applied.

Defaults to None (auto detection).

bm25_k1bool, optional

The bm25_k1 parameter in the BM25 algorithm is a tuning parameter that controls the term frequency saturation effect. It determines how much the term frequency (TF) component contributes to the overall BM25 score.

Defaults to 1.2.

bm25_bbool, optional

The bm25_b parameter in the BM25 algorithm is a tuning parameter that controls the degree of document length normalization. It adjusts the impact of document length on the BM25 score.

Defaults to 0.75.

Returns:
Two DataFrames
  • Result.

  • Place holder.

Examples

Input DataFrame df:

>>> df.collect()
      ID                                                  CONTENT       CATEGORY
0   doc1                      term1 term2 term2 term3 term3 term3     CATEGORY_1
1   doc2                      term2 term3 term3 term4 term4 term4     CATEGORY_1
2   doc3                      term3 term4 term4 term5 term5 term5     CATEGORY_2
3   doc4    term3 term4 term4 term5 term5 term5 term5 term5 term5     CATEGORY_2
4   doc5                                              term4 term6     CATEGORY_3
5   doc6                                  term4 term6 term6 term6     CATEGORY_3

Invoke search_docs_by_keywords function:

>>> result, _ = search_docs_by_keywords(pred_data=df)
>>> result.collect()