get_relevant_term
- hana_ml.text.tm.get_relevant_term(pred_data, ref_data=None, top=None, threshold=None, lang='EN', index_name=None, thread_ratio=None, created_index=None, key=None)
This function returns the top-ranked relevant terms that describe a document / or multiple docments based on Term Frequency - Inverse Document Frequency (TF-IDF) result or reference data.
- Parameters:
- pred_dataDataFrame
Accepts input data in two different data structures:
Single-row mode:
1st column, Document content.
Note
Important to note that this mode can only process one content at a time. Therefore, the input table must have a structure of one row and one column only.
Massive mode supports multiple rows:
1st column, ID.
2nd column, Document content.
Note
Important to note that this mode can only valid in SAP HANA Cloud instance.
- keystr, optional
Specifies the ID column. Only valid when
pred_data
contains multiple rows.Defaults to the first column of
pred_data
.- ref_dataDataFrame or a tuple of DataFrames
Specifies the reference data.
If
ref_data
is a DataFrame, then it should be structured as follows:1st column, ID.
2nd column, Document content.
3rd column, Document category.
If
ref_data
is a tuple of DataFrames, the it should be corresponding to reference TF-IDF data, with each DataFrame structured as follows:1st DataFrame
1st column, TM_TERM.
2nd column, TF_VALUE.
3rd column, IDF_VALUE.
2nd DataFrame
1st column, ID.
2nd column, TM_TERM.
3rd column, TM_TERM_FREQUENCY.
3rd DataFrame
1st column, ID.
2nd column, Document category.
- topint, optional
Shows top N results. If 0, it shows all.
Defaults to 0.
- thresholdfloat, optional
Only the results which score bigger than this value will be put into a result table.
Defaults to 0.0.
- langstr, optional
Specifies the language type. The HANA Cloud instance currently supports 'EN', 'DE', 'ES', 'FR', 'RU', 'PT'. If None, auto detection will be applied.
Defaults to 'EN'.
- index_namestr, optional
Specifies the index name that apply only to the HANA On-Premise instance.
If None, it will be generated.
- thread_ratiofloat, optional
Specifies the ratio of total number of threads that can be used by this function. The range of this parameter is from 0 to 1, where 0 means only using 1 thread, and 1 means using at most all the currently available threads. Values outside this range are ignored and this function heuristically determines the number of threads to use.
Only valid for a HANA Cloud instance.
Defaults to 0.0.
- created_index{"index": xxx, "schema": xxx, "table": xxx}, optional
Use the created index on the given table that apply only to the HANA On-Premise instance.
Defaults to None.
- Returns:
- DataFrame
Examples
Input DataFrame ref_df:
>>> ref_df.collect() ID CONTENT CATEGORY 0 doc1 term1 term2 term2 term3 term3 term3 CATEGORY_1 1 doc2 term2 term3 term3 term4 term4 term4 CATEGORY_1 2 doc3 term3 term4 term4 term5 term5 term5 CATEGORY_2 3 doc4 term3 term4 term4 term5 term5 term5 term5 term5 term5 CATEGORY_2 4 doc5 term4 term6 CATEGORY_3 5 doc6 term4 term6 term6 term6 CATEGORY_3
Invoke the function on a SAP HANA cloud instance:
pred_df in single-row mode:
>>> pred_df.collect() CONTENT 0 term3
>>> get_relevant_term(pred_data=pred_df, ref_data=ref_df).collect() ID SCORE 0 term3 1.0
pred_df_massive in massive mode which supports multiple rows:
>>> pred_df_massive.collect() ID CONTENT 0 2 term2 term2 term3 term3 1 5 term6 term6 term33 term3
>>> get_relevant_term(pred_data=pred_df_massive, ref_data=ref_df).collect() PREDICT_ID K TERM SCORE 0 2 0 term2 0.938145 1 2 1 term3 0.346242 2 5 0 term6 0.983396 3 5 1 term3 0.181471
Invoke the function on a SAP HANA On-Premise instance (only supports single-row mode):
>>> res = get_relevant_term(pred_data=pred_df, ref_data=ref_df) >>> res.collect() RANK TERM NORMALIZED_TERM TERM_TYPE TERM_FREQUENCY DOCUMENT_FREQUENCY CORRELATIONS 0 1 term3 term3 noun 7 4 None ... FACTORS ROTATED_FACTORS CLUSTER_LEVEL CLUSTER_LEFT CLUSTER_RIGHT SCORE ... None None None None None 1.000002901113076436701021521002986