get_related_doc
- hana_ml.text.tm.get_related_doc(pred_data, ref_data=None, top=None, threshold=None, lang='EN', index_name=None, thread_ratio=None, created_index=None, key=None)
This function returns the top-ranked related documents for a query document / or multiple docments based on Term Frequency - Inverse Document Frequency (TF-IDF) result or reference data.
- Parameters:
- pred_dataDataFrame
Accepts input data in two different data structures:
Single-row mode:
1st column, Document content.
Note
Important to note that this mode can only process one content at a time. Therefore, the input table must have a structure of one row and one column only.
Massive mode supports multiple rows:
1st column, ID.
2nd column, Document content.
Note
Important to note that this mode can only valid in SAP HANA Cloud instance.
- keystr, optional
Specifies the ID column. Only valid when
pred_data
contains multiple rows.Defaults to the first column of
pred_data
.- ref_dataDataFrame or a tuple of DataFrames
Specify the reference data.
If
ref_data
is a DataFrame, then it should be structured as follows:1st column, ID.
2nd column, Document content.
3rd column, Document category.
If
ref_data
is a tuple of DataFrames, then it should be corresponding to the reference TF-IDF data, with each DataFrame structured as follows:1st DataFrame, TF-IDF Result.
1st column, TM_TERM.
2nd column, TF_VALUE.
3rd column, IDF_VALUE.
2nd DataFrame, Doc Term Freq Table
1st column, ID.
2nd column, TM_TERM.
3rd column, TM_TERM_FREQUENCY.
3rd DataFrame, Doc Category Table
1st column, ID.
2nd column, Document category.
- topint, optional
Only show top N results. If 0, it shows all.
Defaults to 0.
- thresholdfloat, optional
Only the results which score bigger than this value will be put into the result table.
Defaults to 0.0.
- langstr, optional
Specify the language type. The HANA Cloud instance currently supports 'EN', 'DE', 'ES', 'FR', 'RU', 'PT'. If None, auto detection will be applied.
Defaults to 'EN'.
- index_namestr, optional
Specify the index name that apply only to the HANA On-Premise instance.
If None, it will be generated.
- thread_ratiofloat, optional
Specify the ratio of total number of threads that can be used by this function. The range of this parameter is from 0 to 1, where 0 means only using 1 thread, and 1 means using at most all the currently available threads. Values outside this range are ignored and this function heuristically determines the number of threads to use.
Only valid for a HANA cloud instance.
Defaults to 0.0.
- created_index{"index": xxx, "schema": xxx, "table": xxx}, optional
Use the created index on the given table that apply only to the HANA On-Premise instance.
Defaults to None.
- Returns:
- DataFrame
Examples
Assuming 'ref_df' is an existing DataFrame that contains document IDs, content, and categories. Below are examples of invoking the 'get_related_doc' function.
>>> ref_df.collect() ID CONTENT CATEGORY 0 doc1 term1 term2 term2 term3 term3 term3 CATEGORY_1 1 doc2 term2 term3 term3 term4 term4 term4 CATEGORY_1 2 doc3 term3 term4 term4 term5 term5 term5 CATEGORY_2 3 doc4 term3 term4 term4 term5 term5 term5 term5 term5 term5 CATEGORY_2 4 doc5 term4 term6 CATEGORY_3 5 doc6 term4 term6 term6 term6 CATEGORY_3
For SAP HANA cloud:
Invoking the function on a SAP HANA cloud instance using a single-row input DataFrame 'pred_df':
>>> pred_df.collect() CONTENT 0 term2 term2 term3 term3
>>> get_related_doc(pred_data=pred_df, ref_data=tfidf).collect() ID SCORE 0 doc2 0.891550 1 doc1 0.804670 2 doc3 0.042024 3 doc4 0.021225
tfidf is a DataFrame returned by tf_analysis function, please refer to the examples section of tf_analysis for its content.
Invoking the function on a SAP HANA cloud instance using a massive input DataFrame 'pred_df_massive' which contains multiple rows of data:
>>> pred_df_massive.collect() ID CONTENT 0 1 term2 term2 term3 term3 1 5 term3 term5 term5 term5 term6
>>> get_related_doc(pred_data=pred_df_massive, ref_data=ref_df).collect() PREDICT_ID K DOC_ID SCORE 0 1 0 doc2 0.891550 1 1 1 doc1 0.804670 2 1 2 doc3 0.042024 3 1 3 doc4 0.021225 4 5 0 doc4 0.946186 5 5 1 doc3 0.943719 6 5 2 doc6 0.313616 7 5 3 doc5 0.309858 8 5 4 doc2 0.063908 9 5 5 doc1 0.045706
For SAP HANA On-Premise:
Invoking the function on a SAP HANA On-Premise instance (only supports single-row mode):
>>> res = get_related_doc(pred_data=pred_df, ref_data=ref_df) >>> res.collect() ID RANK TOTAL_TERM_COUNT TERM_COUNT CORRELATIONS FACTORS ROTATED_FACTORS CLUSTER_LEVEL CLUSTER_LEFT 0 doc2 1 6 3 None None None None None 1 doc1 2 6 3 None None None None None 2 doc3 3 6 3 None None None None None 3 doc4 4 9 3 None None None None None ... CLUSTER_RIGHT HIGHLIGHTED_DOCUMENT HIGHLIGHTED_TERMTYPES SCORE ... None None None 0.8915504731053067732915451415465213 ... None None None 0.8046698732333942283290184604993556 ... None None None 0.04202449735779462125506711345224176 ... None None None 0.02122540837399113089478674964993843