text_classification

hana_ml.text.tm.text_classification(pred_data, ref_data=None, k_nearest_neighbours=None, thread_ratio=None, lang=None, index_name=None, created_index=None)

This function classifies (categorizes) an input document with respect to sets of categories (taxonomies) using TF-IDF text vectorizer and KNN classifier.

Parameters:
pred_dataDataFrame

The prediction data for classification, structured as follows:

  • 1st column, ID.

  • 2nd column, Document content.

ref_dataDataFrame or a tuple of DataFrames

Specify the reference data.

If ref_data is a DataFrame, then it should be structured as follows:

  • 1st column, ID.

  • 2nd column, Document content.

  • 3rd column, Document category.

Otherwise if ref_data is a tuple of DataFrames, the it should be corresponding to the reference TF-IDF data, with DataFrames structured as follows:

  • 1st DataFrame

    • 1st column, TM_TERM.

    • 2nd column, TM_TERM_FREQUENCY.

    • 3rd column, TM_IDF_FREQUENCY.

    • 4th column, TF_VALUE.

    • 5th column, IDF_VALUE.

  • 2nd DataFrame

    • 1st column, ID.

    • 2nd column, TM_TERM.

    • 3rd column, TM_TERM_FREQUENCY.

  • 3rd DataFrame

    • 1st column, ID.

    • 2nd column, Document category.

k_nearest_neighboursint, optional

Number of nearest neighbors (k).

Defaults to 1.

thread_ratiofloat, optional

Specify the ratio of total number of threads that can be used by this function. The range of this parameter is from 0 to 1, where 0 means only using 1 thread, and 1 means using at most all the currently available threads. Values outside this range are ignored and this function heuristically determines the number of threads to use.

Only valid for a HANA cloud instance.

Defaults to 0.0.

langstr, optional

Specify the language type. HANA cloud instance currently supports 'EN', 'DE', 'ES', 'FR', 'RU', 'PT'. If None, auto detection will be applied.

Defaults to None (auto detection) in HANA cloud and None in HANA On-Premise (please provide the value in this case).

index_namestr, optional

Specify the index name that apply only to the HANA On-Premise instance.

If None, it will be generated.

created_index{"index": xxx, "schema": xxx, "table": xxx}, optional

Use the created index on the given table that apply only to the HANA On-Premise instance.

Defaults to None.

Returns:
DataFrames (cloud version)

Text classification result, structured as follows:

  • Predict data ID.

  • TARGET.

Statistics table, structured as follows:

  • Predict data ID.

  • Training data ID.

  • Distance.

DataFrame (on-premise version)

Text classification result, structured as follows:

  • Predict data ID.

  • RANK.

  • CATEGORY_SCHEMA.

  • CATEGORY_TABLE.

  • CATEGORY_COLUMN.

  • CATEGORY_VALUE.

  • NEIGHBOR_COUNT.

  • SCORE.

Examples

Input DataFrame df:

>>> df.collect()
      ID                                                  CONTENT       CATEGORY
0   doc1                      term1 term2 term2 term3 term3 term3     CATEGORY_1
1   doc2                      term2 term3 term3 term4 term4 term4     CATEGORY_1
2   doc3                      term3 term4 term4 term5 term5 term5     CATEGORY_2
3   doc4    term3 term4 term4 term5 term5 term5 term5 term5 term5     CATEGORY_2
4   doc5                                              term4 term6     CATEGORY_3
5   doc6                                  term4 term6 term6 term6     CATEGORY_3

Invoke text_classification:

>>> res = text_classification(pred_data=df.select(df.columns[0], df.columns[1]), ref_data=df)

Result on a SAP HANA cloud instance:

>>> res[0].head(1).collect()
       ID     TARGET
0    doc1 CATEGORY_1

Result on a SAP HANA On-Premise instance:

>>> res[0].head(1).collect()
     ID RANK  CATEGORY_SCHEMA                   CATEGORY_TABLE    CATEGORY_COLUMN  CATEGORY_VALUE  NEIGHBOR_COUNT
0  doc1    1       "PAL_USER" "TM_CATEGORIZE_KNN_DT_6_REF_TBL"         "CATEGORY"      CATEGORY_1               1
...                               SCORE
...0.5807794005266924131092309835366905