search_docs_by_keywords
- hana_ml.text.tm.search_docs_by_keywords(pred_data, ref_data=None, num_best_matches=None, thread_number=None, thread_ratio=None, lang=None, bm25_k1=None, bm25_b=None, **kwargs)
This function searches for the best matching documents based on the given keywords. The algorithms used for matching is BM25.
This function supports English, German, Spanish, French, Russian and Portuguese and is available in HANA Cloud.
- Parameters:
- pred_dataDataFrame
The prediction data for search, structured as follows:
1st column, ID.
2nd column, KEYWORDS.
- ref_dataDataFrame or a tuple of DataFrames
Specify the reference data.
If
ref_data
is a DataFrame, then it should be structured as follows:1st column, ID.
2nd column, Document content.
3rd column, Document category.
Otherwise if
ref_data
is a tuple of DataFrames, the it should be corresponding to the reference TF-IDF data, with DataFrames structured as follows:1st DataFrame
TM_TERM.
TM_TERM_FREQUENCY.
TM_IDF_FREQUENCY.
TF_VALUE.
IDF_VALUE.
TF_IDF_VALUE.
2nd DataFrame
1st column, ID.
2nd column, TM_TERM.
3rd column, TM_TERM_FREQUENCY.
3rd DataFrame
1st column, ID.
2nd column, Document category.
- num_best_matchesint, optional
Controls how many results to output.
Defaults to 1.
- thread_numberint, optional
Specifies the number of threads that can be used by this function.
Defaults to 1.
- thread_ratiofloat, optional
Specifies the ratio of threads that can be used by this function. The range of this parameter is from 0 to 1, where 0 means only using one thread, and 1 means using at most all the currently available threads. Values outside this range are ignored and this function heuristically determines the number of threads to use.
Defaults to 0.0.
- langstr, optional
Specify the language type. HANA cloud instance currently supports 'EN', 'DE', 'ES', 'FR', 'RU', 'PT'. If None, auto detection will be applied.
Defaults to None (auto detection).
- bm25_k1bool, optional
The
bm25_k1
parameter in the BM25 algorithm is a tuning parameter that controls the term frequency saturation effect. It determines how much the term frequency (TF) component contributes to the overall BM25 score.Defaults to 1.2.
- bm25_bbool, optional
The
bm25_b
parameter in the BM25 algorithm is a tuning parameter that controls the degree of document length normalization. It adjusts the impact of document length on the BM25 score.Defaults to 0.75.
- Returns:
- Two DataFrames
Result.
Place holder.
Examples
Input DataFrame df:
>>> df.collect() ID CONTENT CATEGORY 0 doc1 term1 term2 term2 term3 term3 term3 CATEGORY_1 1 doc2 term2 term3 term3 term4 term4 term4 CATEGORY_1 2 doc3 term3 term4 term4 term5 term5 term5 CATEGORY_2 3 doc4 term3 term4 term4 term5 term5 term5 term5 term5 term5 CATEGORY_2 4 doc5 term4 term6 CATEGORY_3 5 doc6 term4 term6 term6 term6 CATEGORY_3
Invoke search_docs_by_keywords function:
>>> result, _ = search_docs_by_keywords(pred_data=df) >>> result.collect()