text_classification
- hana_ml.text.tm.text_classification(pred_data, ref_data=None, k_nearest_neighbours=None, thread_ratio=None, lang=None, index_name=None, created_index=None)
This function classifies (categorizes) an input document with respect to sets of categories (taxonomies) using TF-IDF text vectorizer and KNN classifier.
- Parameters:
- pred_dataDataFrame
The prediction data for classification, structured as follows:
1st column, ID.
2nd column, Document content.
- ref_dataDataFrame or a tuple of DataFrames
Specify the reference data.
If
ref_data
is a DataFrame, then it should be structured as follows:1st column, ID.
2nd column, Document content.
3rd column, Document category.
Otherwise if
ref_data
is a tuple of DataFrames, the it should be corresponding to the reference TF-IDF data, with DataFrames structured as follows:1st DataFrame
1st column, TM_TERM.
2nd column, TM_TERM_FREQUENCY.
3rd column, TM_IDF_FREQUENCY.
4th column, TF_VALUE.
5th column, IDF_VALUE.
2nd DataFrame
1st column, ID.
2nd column, TM_TERM.
3rd column, TM_TERM_FREQUENCY.
3rd DataFrame
1st column, ID.
2nd column, Document category.
- k_nearest_neighboursint, optional
Number of nearest neighbors (k).
Defaults to 1.
- thread_ratiofloat, optional
Specify the ratio of total number of threads that can be used by this function. The range of this parameter is from 0 to 1, where 0 means only using 1 thread, and 1 means using at most all the currently available threads. Values outside this range are ignored and this function heuristically determines the number of threads to use.
Only valid for a HANA cloud instance.
Defaults to 0.0.
- langstr, optional
Specify the language type. HANA cloud instance currently supports 'EN', 'DE', 'ES', 'FR', 'RU', 'PT'. If None, auto detection will be applied.
Defaults to None (auto detection) in HANA cloud and None in HANA On-Premise (please provide the value in this case).
- index_namestr, optional
Specify the index name that apply only to the HANA On-Premise instance.
If None, it will be generated.
- created_index{"index": xxx, "schema": xxx, "table": xxx}, optional
Use the created index on the given table that apply only to the HANA On-Premise instance.
Defaults to None.
- Returns:
- DataFrames (cloud version)
Text classification result, structured as follows:
Predict data ID.
TARGET.
Statistics table, structured as follows:
Predict data ID.
Training data ID.
Distance.
- DataFrame (on-premise version)
Text classification result, structured as follows:
Predict data ID.
RANK.
CATEGORY_SCHEMA.
CATEGORY_TABLE.
CATEGORY_COLUMN.
CATEGORY_VALUE.
NEIGHBOR_COUNT.
SCORE.
Examples
Input DataFrame df:
>>> df.collect() ID CONTENT CATEGORY 0 doc1 term1 term2 term2 term3 term3 term3 CATEGORY_1 1 doc2 term2 term3 term3 term4 term4 term4 CATEGORY_1 2 doc3 term3 term4 term4 term5 term5 term5 CATEGORY_2 3 doc4 term3 term4 term4 term5 term5 term5 term5 term5 term5 CATEGORY_2 4 doc5 term4 term6 CATEGORY_3 5 doc6 term4 term6 term6 term6 CATEGORY_3
Invoke text_classification:
>>> res = text_classification(pred_data=df.select(df.columns[0], df.columns[1]), ref_data=df)
Result on a SAP HANA cloud instance:
>>> res[0].head(1).collect() ID TARGET 0 doc1 CATEGORY_1
Result on a SAP HANA On-Premise instance:
>>> res[0].head(1).collect() ID RANK CATEGORY_SCHEMA CATEGORY_TABLE CATEGORY_COLUMN CATEGORY_VALUE NEIGHBOR_COUNT 0 doc1 1 "PAL_USER" "TM_CATEGORIZE_KNN_DT_6_REF_TBL" "CATEGORY" CATEGORY_1 1 ... SCORE ...0.5807794005266924131092309835366905