hana_ml.text.tm package

hana_ml.text.tm

This module provides various functions of text mining. The following functions are available:
hana_ml.text.tm.tf_analysis(data, lang=None, enable_stopwords=None, keep_numeric=None)

Perform Term Frequency(TF) analysis on the given document. TF is the number of occurrences of term in document.

Parameters:
dataDataFrame

Input data, structured as follows:

  • 1st column, ID.

  • 2nd column, Document content.

  • 3rd column, Document category.

langstr, optional

Specify the language type. HANA cloud instance currently supports 'EN', 'DE', 'ES', 'FR', 'RU', 'PT'. If None, auto detection will be applied.

Defaults to None (auto detection).

enable_stopwordsbool, optional

Determine whether to turn on stopwords.

Defaults to True.

keep_numericbool, optional

Determine whether to keep numbers.

Defaults to False.

Returns:
A tuple of DataFrames

TF-IDF result, structured as follows:

  • TM_TERM.

  • TM_TERM_FREQUENCY.

  • TM_IDF_FREQUENCY.

  • TF_VALUE.

  • IDF_VALUE.

  • TF_IDF_VALUE.

Document term frequency table, structured as follows:

  • ID.

  • TM_TERM.

  • TM_TERM_FREQUENCY.

Document category table, structured as follows:

  • ID.

  • Document category.

Examples

The input DataFrame df:

>>> df.collect()
      ID                                                  CONTENT       CATEGORY
0   doc1                      term1 term2 term2 term3 term3 term3     CATEGORY_1
1   doc2                      term2 term3 term3 term4 term4 term4     CATEGORY_1
2   doc3                      term3 term4 term4 term5 term5 term5     CATEGORY_2
3   doc4    term3 term4 term4 term5 term5 term5 term5 term5 term5     CATEGORY_2
4   doc5                                              term4 term6     CATEGORY_3
5   doc6                                  term4 term6 term6 term6     CATEGORY_3

Invoke tf_analysis function:

>>> tfidf= tf_analysis(data=df)

Output:

>>> tfidf[0].head(3).collect()
  TM_TERMS TM_TERM_TF_F  TM_TERM_IDF_F  TM_TERM_TF_V  TM_TERM_IDF_V
0    term1            1              1      0.030303       1.791759
1    term2            3              2      0.090909       1.098612
2    term3            7              4      0.212121       0.405465
>>> tfidf[1].head(3).collect()
     ID TM_TERMS  TM_TERM_FREQUENCY
0  doc1    term1                  1
1  doc1    term2                  2
2  doc1    term3                  3
>>> tfidf[2].head(3).collect()
      ID    CATEGORY
0   doc1  CATEGORY_1
1   doc2  CATEGORY_1
2   doc3  CATEGORY_2
hana_ml.text.tm.text_classification(pred_data, ref_data=None, k_nearest_neighbours=None, thread_ratio=None, lang=None, algorithm=None, seed=None, rdt_top_n=None, index_name=None, created_index=None)

This function classifies (categorizes) an input document with respect to sets of categories (taxonomies).

Parameters:
pred_dataDataFrame

The prediction data for classification, structured as follows:

  • 1st column, ID.

  • 2nd column, Document content.

ref_dataDataFrame or a tuple of DataFrame,

Specify the reference data.

If ref_data is a DataFrame, then it should be structured as follows:

  • 1st column, ID.

  • 2nd column, Document content.

  • 3rd column, Document category.

Otherwise if ref_data is a tuple of DataFrame, the it should be corresponding to the reference TF-IDF data, with DataFrames structured as follows:

  • 1st DataFrame

    • 1st column, TM_TERM.

    • 2nd column, TF_VALUE.

    • 3rd column, IDF_VALUE.

  • 2nd DataFrame

    • 1st column, ID.

    • 2nd column, TM_TERM.

    • 3rd column, TM_TERM_FREQUENCY.

  • 3rd DataFrame

    • 1st column, ID.

    • 2nd column, Document category.

k_nearest_neighboursint, optional

Number of nearest neighbors (k).

Defaults to 1.

thread_ratiofloat, optional

Specify the ratio of total number of threads that can be used by this function. The range of this parameter is from 0 to 1, where 0 means only using 1 thread, and 1 means using at most all the currently available threads. Values outside this range are ignored and this function heuristically determines the number of threads to use.

Only valid for a HANA cloud instance.

Defaults to 0.0.

langstr, optional

Specify the language type. HANA cloud instance currently supports 'EN', 'DE', 'ES', 'FR', 'RU', 'PT'. If None, auto detection will be applied.

Defaults to None (auto detection) in HANA cloud and None in HANA on-premise (please provide the value in this case).

algorithmstr, optional

Specify the algorithm to be used for text classification.

seedint, optional

Specify the seed for random number generation.

rdt_top_nint, optional

Specify the number of top terms to be used for the Random Decision Tree algorithm.

index_namestr, optional

Specify the index name that apply only to an on-premise HANA instance.

If None, it will be generated.

created_index{"index": xxx, "schema": xxx, "table": xxx}, optional

Use the created index on the given table that apply only to an on-premise HANA instance.

Returns:
DataFrame (cloud version)

Text classification result, structured as follows:

  • Predict data ID.

  • TARGET.

Statistics table, structured as follows:

  • Predict data ID.

  • Training data ID.

  • Distance.

DataFrame (on-premise version)

Text classification result, structured as follows:

  • Predict data ID.

  • RANK.

  • CATEGORY_SCHEMA.

  • CATEGORY_TABLE.

  • CATEGORY_COLUMN.

  • CATEGORY_VALUE.

  • NEIGHBOR_COUNT.

  • SCORE.

Examples

The input DataFrame df:

>>> df.collect()
      ID                                                  CONTENT       CATEGORY
0   doc1                      term1 term2 term2 term3 term3 term3     CATEGORY_1
1   doc2                      term2 term3 term3 term4 term4 term4     CATEGORY_1
2   doc3                      term3 term4 term4 term5 term5 term5     CATEGORY_2
3   doc4    term3 term4 term4 term5 term5 term5 term5 term5 term5     CATEGORY_2
4   doc5                                              term4 term6     CATEGORY_3
5   doc6                                  term4 term6 term6 term6     CATEGORY_3

Invoke text_classification:

>>> res = text_classification(pred_data=df.select(df.columns[0], df.columns[1]), ref_data=df)

Result on a SAP HANA cloud instance:

>>> res[0].head(1).collect()
       ID     TARGET
0    doc1 CATEGORY_1

Result on a SAP HANA on-premise instance:

>>> res[0].head(1).collect()
     ID RANK  CATEGORY_SCHEMA                   CATEGORY_TABLE    CATEGORY_COLUMN  CATEGORY_VALUE  NEIGHBOR_COUNT
0  doc1    1       "PAL_USER" "TM_CATEGORIZE_KNN_DT_6_REF_TBL"         "CATEGORY"      CATEGORY_1               1
...                               SCORE
...0.5807794005266924131092309835366905

This function returns the top-ranked related documents for a query document / or multiple docments based on Term Frequency - Inverse Document Frequency (TF-IDF) result or reference data.

Parameters:
pred_dataDataFrame

Accepts input data in two different data structures:

Single-row mode:

  • 1st column, Document content.

Note

Important to note that this mode can only process one content at a time. Therefore, the input table must have a structure of one row and one column only.

Multiple-row mode:

  • 1st column, ID.

  • 2nd column, Document content.

Note

Important to note that this mode can only valid in SAP HANA Cloud instance.

keystr, optional

Specifies the ID column. Only valid when pred_data contains multiple rows.

Defaults to the first column of pred_data.

ref_dataDataFrame or a tuple of DataFrame,

Specify the reference data.

If ref_data is a DataFrame, then it should be structured as follows:

  • 1st column, ID.

  • 2nd column, Document content.

  • 3rd column, Document category.

If ref_data is a tuple of DataFrame, then it should be corresponding to the reference TF-IDF data, with each DataFrame structured as follows:

  • 1st DataFrame, TF-IDF Result.

    • 1st column, TM_TERM.

    • 2nd column, TF_VALUE.

    • 3rd column, IDF_VALUE.

  • 2nd DataFrame, Doc Term Freq Table

    • 1st column, ID.

    • 2nd column, TM_TERM.

    • 3rd column, TM_TERM_FREQUENCY.

  • 3rd DataFrame, Doc Category Table

    • 1st column, ID.

    • 2nd column, Document category.

topint, optional

Only show top N results. If 0, it shows all.

Defaults to 0.

thresholdfloat, optional

Only the results which score bigger than this value will be put into the result table.

Defaults to 0.0.

langstr, optional

Specify the language type. HANA cloud instance currently supports 'EN', 'DE', 'ES', 'FR', 'RU', 'PT'. If None, auto detection will be applied.

Defaults to 'EN'.

index_namestr, optional

Specify the index name that apply only to an on-premise HANA instance.

If None, it will be generated.

thread_ratiofloat, optional

Specify the ratio of total number of threads that can be used by this function. The range of this parameter is from 0 to 1, where 0 means only using 1 thread, and 1 means using at most all the currently available threads. Values outside this range are ignored and this function heuristically determines the number of threads to use.

Only valid for a HANA cloud instance.

Defaults to 0.0.

created_index{"index": xxx, "schema": xxx, "table": xxx}, optional

Use the created index on the given table that apply only to an on-premise HANA instance.

Returns:
DataFrame

Examples

The input DataFrame ref_df:

>>> ref_df.collect()
      ID                                                  CONTENT       CATEGORY
0   doc1                      term1 term2 term2 term3 term3 term3     CATEGORY_1
1   doc2                      term2 term3 term3 term4 term4 term4     CATEGORY_1
2   doc3                      term3 term4 term4 term5 term5 term5     CATEGORY_2
3   doc4    term3 term4 term4 term5 term5 term5 term5 term5 term5     CATEGORY_2
4   doc5                                              term4 term6     CATEGORY_3
5   doc6                                  term4 term6 term6 term6     CATEGORY_3

Invoke the function on a SAP HANA cloud instance:

  1. The input DataFrame pred_df in single-row mode:

>>> pred_df.collect()
                   CONTENT
0  term2 term2 term3 term3
>>> get_related_doc(pred_data=pred_df, ref_data=tfidf).collect()
       ID       SCORE
0    doc2    0.891550
1    doc1    0.804670
2    doc3    0.042024
3    doc4    0.021225

tfidf is a DataFrame returned by tf_analysis function, please refer to the examples section of tf_analysis for its content.

  1. The input DataFrame pred_df_massive in multi-row mode:

>>> pred_df_massive.collect()
   ID                          CONTENT
0   1          term2 term2 term3 term3
1   5    term3 term5 term5 term5 term6
>>> get_related_doc(pred_data=pred_df_massive, ref_data=ref_df).collect()
   PREDICT_ID  K DOC_ID     SCORE
0           1  0   doc2  0.891550
1           1  1   doc1  0.804670
2           1  2   doc3  0.042024
3           1  3   doc4  0.021225
4           5  0   doc4  0.946186
5           5  1   doc3  0.943719
6           5  2   doc6  0.313616
7           5  3   doc5  0.309858
8           5  4   doc2  0.063908
9           5  5   doc1  0.045706

Invoke the function on a SAP HANA on-premise instance (only supports single-row mode):

>>> res = get_related_doc(pred_data=pred_df, ref_data=ref_df)
>>> res.collect()
   ID    RANK   TOTAL_TERM_COUNT  TERM_COUNT  CORRELATIONS  FACTORS  ROTATED_FACTORS  CLUSTER_LEVEL  CLUSTER_LEFT
0  doc2     1                  6           3          None     None             None           None          None
1  doc1     2                  6           3          None     None             None           None          None
2  doc3     3                  6           3          None     None             None           None          None
3  doc4     4                  9           3          None     None             None           None          None
... CLUSTER_RIGHT  HIGHLIGHTED_DOCUMENT  HIGHLIGHTED_TERMTYPES                                   SCORE
...          None                  None                   None    0.8915504731053067732915451415465213
...          None                  None                   None    0.8046698732333942283290184604993556
...          None                  None                   None   0.04202449735779462125506711345224176
...          None                  None                   None   0.02122540837399113089478674964993843

This function returns the top-ranked related terms for a query term / or multiple terms based on Term Frequency - Inverse Document Frequency (TF-IDF) result or reference data.

Parameters:
pred_dataDataFrame

Accepts input data in two different data structures:

Single-row mode:

  • 1st column, Document content.

Note

Important to note that this mode can only process one content at a time. Therefore, the input table must have a structure of one row and one column only.

Multiple-row mode:

  • 1st column, ID.

  • 2nd column, Document content.

Note

Important to note that this mode can only valid in SAP HANA Cloud instance.

keystr, optional

Specifies the ID column. Only valid when pred_data contains multiple rows.

Defaults to the first column of pred_data.

ref_dataDataFrame or a tuple of DataFrame

Specifies the reference data.

If ref_data is a DataFrame, then it should be structured as follows:

  • 1st column, ID.

  • 2nd column, Document content.

  • 3rd column, Document category.

If ref_data is a tuple of DataFrame, the it should be corresponding to reference TF-IDF data, with each DataFrame structured as follows:

  • 1st DataFrame

    • 1st column, TM_TERM.

    • 2nd column, TF_VALUE.

    • 3rd column, IDF_VALUE.

  • 2nd DataFrame

    • 1st column, ID.

    • 2nd column, TM_TERM.

    • 3rd column, TM_TERM_FREQUENCY.

  • 3rd DataFrame

    • 1st column, ID.

    • 2nd column, Document category.

topint, optional

Shows top N results. If 0, it shows all.

Defaults to 0.

thresholdfloat, optional

Only the results which score bigger than this value will be put into a result table.

Defaults to 0.0.

langstr, optional

Specifies the language type. HANA cloud instance currently supports 'EN', 'DE', 'ES', 'FR', 'RU', 'PT'. If None, auto detection will be applied.

Defaults to 'EN'.

index_namestr, optional

Specifies the index name that apply only to an on-premise HANA instance.

If None, it will be generated.

thread_ratiofloat, optional

Specifies the ratio of total number of threads that can be used by this function. The range of this parameter is from 0 to 1, where 0 means only using 1 thread, and 1 means using at most all the currently available threads. Values outside this range are ignored and this function heuristically determines the number of threads to use.

Only valid for a HANA cloud instance.

Defaults to 0.0.

created_index{"index": xxx, "schema": xxx, "table": xxx}, optional

Use the created index on the given table that apply only to an on-premise HANA instance.

Returns:
DataFrame

Examples

The input DataFrame ref_df:

>>> ref_df.collect()
      ID                                                  CONTENT       CATEGORY
0   doc1                      term1 term2 term2 term3 term3 term3     CATEGORY_1
1   doc2                      term2 term3 term3 term4 term4 term4     CATEGORY_1
2   doc3                      term3 term4 term4 term5 term5 term5     CATEGORY_2
3   doc4    term3 term4 term4 term5 term5 term5 term5 term5 term5     CATEGORY_2
4   doc5                                              term4 term6     CATEGORY_3
5   doc6                                  term4 term6 term6 term6     CATEGORY_3

Invoke the function on a SAP HANA cloud instance:

  1. The input DataFrame pred_df in single-row mode:

>>> pred_df.collect()
  CONTENT
0   term3
>>> get_related_term(pred_data=pred_df, ref_data=ref_df).collect()
        ID       SCORE
0    term3    1.000000
1    term2    0.923760
2    term1    0.774597
3    term4    0.550179
4    term5    0.346410
  1. The input DataFrame pred_df_massive in multi-row mode:

>>> pred_df_massive.collect()
   ID    CONTENT
0   1      term3
1   2     term33
2   3      term6
>>> get_related_term(pred_data=pred_df_massive, ref_data=ref_df).collect()
   PREDICT_ID  K   TERM     SCORE
0           2  0  term2  0.938145
1           2  1  term3  0.346242
2           5  0  term6  0.983396
3           5  1  term3  0.181471

Invoke the function on a SAP HANA on-premise instance (only supports single-row mode):

>>> res = get_related_term(pred_data=pred_df, ref_data=ref_df)
>>> res.collect()
  RANK  TERM  NORMALIZED_TERM  TERM_TYPE  TERM_FREQUENCY  DOCUMENT_FREQUENCY  CORRELATIONS
0    1 term3            term3       noun               7                   4          None
1    2 term2            term2       noun               3                   2          None
2    3 term1            term1       noun               1                   1          None
3    4 term4            term4       noun               9                   5          None
4    5 term5            term5       noun               9                   2          None
... FACTORS  ROTATED_FACTORS  CLUSTER_LEVEL  CLUSTER_LEFT  CLUSTER_RIGHT                                 SCORE
...    None             None           None          None           None  1.0000003613794823387195265240734440
...    None             None           None          None           None  0.9237607645314674931213971831311937
...    None             None           None          None           None  0.7745969491648266869177064108953346
...    None             None           None          None           None  0.5501794128048571597133786781341769
...    None             None           None          None           None  0.3464102866993003515538873671175679
hana_ml.text.tm.get_relevant_doc(pred_data, ref_data=None, top=None, threshold=None, lang='EN', index_name=None, thread_ratio=None, created_index=None, key=None)

This function returns the top-ranked documents that are relevant to a term / or multiple terms based on Term Frequency - Inverse Document Frequency (TF-IDF) result or reference data.

Parameters:
pred_dataDataFrame

Accepts input data in two different data structures:

Single-row mode:

  • 1st column, Document content.

Note

Important to note that this mode can only process one content at a time. Therefore, the input table must have a structure of one row and one column only.

Multiple-row mode:

  • 1st column, ID.

  • 2nd column, Document content.

Note

Important to note that this mode can only valid in SAP HANA Cloud instance.

keystr, optional

Specifies the ID column. Only valid when pred_data contains multiple rows.

Defaults to the first column of pred_data.

ref_dataDataFrame or a tuple of DataFrame

Specifies the reference data.

If ref_data is a DataFrame, then it should be structured as follows:

  • 1st column, ID.

  • 2nd column, Document content.

  • 3rd column, Document category.

If ref_data is a tuple of DataFrame, the it should be corresponding to reference TF-IDF data, with each DataFrame structured as follows:

  • 1st DataFrame

    • 1st column, TM_TERM.

    • 2nd column, TF_VALUE.

    • 3rd column, IDF_VALUE.

  • 2nd DataFrame

    • 1st column, ID.

    • 2nd column, TM_TERM.

    • 3rd column, TM_TERM_FREQUENCY.

  • 3rd DataFrame

    • 1st column, ID.

    • 2nd column, Document category.

topint, optional

Shows top N results. If 0, it shows all.

Defaults to 0.

thresholdfloat, optional

Only the results which score bigger than this value will be put into a result table.

Defaults to 0.0.

langstr, optional

Specifies the language type. HANA cloud instance currently supports 'EN', 'DE', 'ES', 'FR', 'RU', 'PT'. If None, auto detection will be applied.

Defaults to 'EN'.

index_namestr, optional

Specifies the index name that apply only to an on-premise HANA instance.

If None, it will be generated.

thread_ratiofloat, optional

Specifies the ratio of total number of threads that can be used by this function. The range of this parameter is from 0 to 1, where 0 means only using 1 thread, and 1 means using at most all the currently available threads. Values outside this range are ignored and this function heuristically determines the number of threads to use.

Only valid for a HANA cloud instance.

Defaults to 0.0.

created_index{"index": xxx, "schema": xxx, "table": xxx}, optional

Use the created index on the given table that apply only to an on-premise HANA instance.

Returns:
DataFrame

Examples

The input DataFrame ref_df:

>>> ref_df.collect()
      ID                                                  CONTENT       CATEGORY
0   doc1                      term1 term2 term2 term3 term3 term3     CATEGORY_1
1   doc2                      term2 term3 term3 term4 term4 term4     CATEGORY_1
2   doc3                      term3 term4 term4 term5 term5 term5     CATEGORY_2
3   doc4    term3 term4 term4 term5 term5 term5 term5 term5 term5     CATEGORY_2
4   doc5                                              term4 term6     CATEGORY_3
5   doc6                                  term4 term6 term6 term6     CATEGORY_3

Invoke the function on a SAP HANA cloud instance:

  1. The input DataFrame pred_df in single-row mode:

>>> pred_df.collect()
  CONTENT
0   term3
>>> get_relevant_doc(pred_data=pred_df, ref_data=ref_df).collect()
       ID       SCORE
0    doc1    0.774597
1    doc2    0.516398
2    doc3    0.258199
3    doc4    0.258199
  1. The input DataFrame pred_df_massive in multi-row mode:

>>> pred_df_massive.collect()
   ID   CONTENT
0   2     term2
1   3    term33
2   5     term5
>>> get_relevant_doc(pred_data=pred_df_massive, ref_data=ref_df).collect()
   PREDICT_ID  K DOC_ID     SCORE
0           2  0   doc1  0.894427
1           2  1   doc2  0.447214
2           5  0   doc4  0.894427
3           5  1   doc3  0.447214

Invoke the function on a SAP HANA on-premise instance (only supports single-row mode):

>>> res = get_relevant_doc(pred_data=pred_df, ref_data=ref_df, top=4)
>>> res.collect()
     ID    RANK   TOTAL_TERM_COUNT  TERM_COUNT  CORRELATIONS  FACTORS  ROTATED_FACTORS  CLUSTER_LEVEL  CLUSTER_LEFT
0  doc1       1                  6           3          None     None             None           None          None
1  doc2       2                  6           3          None     None             None           None          None
2  doc3       3                  6           3          None     None             None           None          None
3  doc4       4                  9           3          None     None             None           None          None
... CLUSTER_RIGHT  HIGHLIGHTED_DOCUMENT  HIGHLIGHTED_TERMTYPES                                   SCORE
...          None                  None                   None    0.7745969491648266869177064108953346
...          None                  None                   None    0.5163979661098845319600059156073257
...          None                  None                   None    0.2581989830549422659800029578036629
...          None                  None                   None    0.2581989830549422659800029578036629
hana_ml.text.tm.get_relevant_term(pred_data, ref_data=None, top=None, threshold=None, lang='EN', index_name=None, thread_ratio=None, created_index=None, key=None)

This function returns the top-ranked relevant terms that describe a document / or multiple docments based on Term Frequency - Inverse Document Frequency (TF-IDF) result or reference data.

Parameters:
pred_dataDataFrame

Accepts input data in two different data structures:

Single-row mode:

  • 1st column, Document content.

Note

Important to note that this mode can only process one content at a time. Therefore, the input table must have a structure of one row and one column only.

Multiple-row mode:

  • 1st column, ID.

  • 2nd column, Document content.

Note

Important to note that this mode can only valid in SAP HANA Cloud instance.

keystr, optional

Specifies the ID column. Only valid when pred_data contains multiple rows.

Defaults to the first column of pred_data.

ref_dataDataFrame or a tuple of DataFrame

Specifies the reference data.

If ref_data is a DataFrame, then it should be structured as follows:

  • 1st column, ID.

  • 2nd column, Document content.

  • 3rd column, Document category.

If ref_data is a tuple of DataFrame, the it should be corresponding to reference TF-IDF data, with each DataFrame structured as follows:

  • 1st DataFrame

    • 1st column, TM_TERM.

    • 2nd column, TF_VALUE.

    • 3rd column, IDF_VALUE.

  • 2nd DataFrame

    • 1st column, ID.

    • 2nd column, TM_TERM.

    • 3rd column, TM_TERM_FREQUENCY.

  • 3rd DataFrame

    • 1st column, ID.

    • 2nd column, Document category.

topint, optional

Shows top N results. If 0, it shows all.

Defaults to 0.

thresholdfloat, optional

Only the results which score bigger than this value will be put into a result table.

Defaults to 0.0.

langstr, optional

Specifies the language type. HANA cloud instance currently supports 'EN', 'DE', 'ES', 'FR', 'RU', 'PT'. If None, auto detection will be applied.

Defaults to 'EN'.

index_namestr, optional

Specifies the index name that apply only to an on-premise HANA instance.

If None, it will be generated.

thread_ratiofloat, optional

Specifies the ratio of total number of threads that can be used by this function. The range of this parameter is from 0 to 1, where 0 means only using 1 thread, and 1 means using at most all the currently available threads. Values outside this range are ignored and this function heuristically determines the number of threads to use.

Only valid for a HANA cloud instance.

Defaults to 0.0.

created_index{"index": xxx, "schema": xxx, "table": xxx}, optional

Use the created index on the given table that apply only to an on-premise HANA instance.

Returns:
DataFrame

Examples

The input DataFrame ref_df:

>>> ref_df.collect()
      ID                                                  CONTENT       CATEGORY
0   doc1                      term1 term2 term2 term3 term3 term3     CATEGORY_1
1   doc2                      term2 term3 term3 term4 term4 term4     CATEGORY_1
2   doc3                      term3 term4 term4 term5 term5 term5     CATEGORY_2
3   doc4    term3 term4 term4 term5 term5 term5 term5 term5 term5     CATEGORY_2
4   doc5                                              term4 term6     CATEGORY_3
5   doc6                                  term4 term6 term6 term6     CATEGORY_3

Invoke the function on a SAP HANA cloud instance:

  1. The input DataFrame pred_df in single-row mode:

>>> pred_df.collect()
  CONTENT
0   term3
>>> get_relevant_term(pred_data=pred_df, ref_data=ref_df).collect()
        ID   SCORE
0    term3     1.0
  1. The input DataFrame pred_df_massive in multi-row mode:

>>> pred_df_massive.collect()
   ID                      CONTENT
0   2      term2 term2 term3 term3
1   5     term6 term6 term33 term3
>>> get_relevant_term(pred_data=pred_df_massive, ref_data=ref_df).collect()
   PREDICT_ID  K   TERM     SCORE
0           2  0  term2  0.938145
1           2  1  term3  0.346242
2           5  0  term6  0.983396
3           5  1  term3  0.181471

Invoke the function on a SAP HANA on-premise instance (only supports single-row mode):

>>> res = get_relevant_term(pred_data=pred_df, ref_data=ref_df)
>>> res.collect()
  RANK  TERM  NORMALIZED_TERM  TERM_TYPE  TERM_FREQUENCY  DOCUMENT_FREQUENCY  CORRELATIONS
0    1 term3            term3       noun               7                   4          None
... FACTORS  ROTATED_FACTORS  CLUSTER_LEVEL  CLUSTER_LEFT  CLUSTER_RIGHT                                 SCORE
...    None             None           None          None           None   1.000002901113076436701021521002986
hana_ml.text.tm.get_suggested_term(pred_data, ref_data=None, top=None, threshold=None, lang='EN', index_name=None, thread_ratio=None, created_index=None, key=None)

This function returns the top-ranked terms that match an initial substring / or multiple substrings based on Term Frequency - Inverse Document Frequency (TF-IDF) result or reference data.

Parameters:
pred_dataDataFrame

Accepts input data in two different data structures:

Single-row mode:

  • 1st column, Document content.

Note

Important to note that this mode can only process one content at a time. Therefore, the input table must have a structure of one row and one column only.

Multiple-row mode:

  • 1st column, ID.

  • 2nd column, Document content.

Note

Important to note that this mode can only valid in SAP HANA Cloud instance.

keystr, optional

Specifies the ID column. Only valid when pred_data contains multiple rows.

Defaults to the first column of pred_data.

ref_dataDa0taFrame or a tuple of DataFrame

Specifies the reference data.

If ref_data is a DataFrame, then it should be structured as follows:

  • 1st column, ID.

  • 2nd column, Document content.

  • 3rd column, Document category.

If ref_data is a tuple of DataFrame, the it should be corresponding to reference TF-IDF data, with each DataFrame structured as follows:

  • 1st DataFrame

    • 1st column, TM_TERM.

    • 2nd column, TF_VALUE.

    • 3rd column, IDF_VALUE.

  • 2nd DataFrame

    • 1st column, ID.

    • 2nd column, TM_TERM.

    • 3rd column, TM_TERM_FREQUENCY.

  • 3rd DataFrame

    • 1st column, ID.

    • 2nd column, Document category.

topint, optional

Shows top N results. If 0, it shows all.

Defaults to 0.

thresholdfloat, optional

Only the results which score bigger than this value will be put into a result table.

Defaults to 0.0.

langstr, optional

Specifies the language type. HANA cloud instance currently supports 'EN', 'DE', 'ES', 'FR', 'RU', 'PT'. If None, auto detection will be applied.

Defaults to 'EN' in HANA cloud and None in HANA on-premise (please provide the value in this case).

index_namestr, optional

Specifies the index name that apply only to an on-premise HANA instance.

If None, it will be generated.

thread_ratiofloat, optional

Specifies the ratio of total number of threads that can be used by this function. The range of this parameter is from 0 to 1, where 0 means only using 1 thread, and 1 means using at most all the currently available threads. Values outside this range are ignored and this function heuristically determines the number of threads to use.

Only valid for a HANA cloud instance.

Defaults to 0.0.

created_index{"index": xxx, "schema": xxx, "table": xxx}, optional

Use the created index on the given table that apply only to an on-premise HANA instance.

Returns:
DataFrame

Examples

The input DataFrame ref_df:

>>> ref_df.collect()
      ID                                                  CONTENT       CATEGORY
0   doc1                      term1 term2 term2 term3 term3 term3     CATEGORY_1
1   doc2                      term2 term3 term3 term4 term4 term4     CATEGORY_1
2   doc3                      term3 term4 term4 term5 term5 term5     CATEGORY_2
3   doc4    term3 term4 term4 term5 term5 term5 term5 term5 term5     CATEGORY_2
4   doc5                                              term4 term6     CATEGORY_3
5   doc6                                  term4 term6 term6 term6     CATEGORY_3

Invoke the function on a SAP HANA cloud instance:

  1. The input DataFrame pred_df in single-row mode:

>>> pred_df.collect()
  CONTENT
0   term3

Invoke the function on a SAP HANA cloud instance,

>>> get_suggested_term(pred_data=pred_df, ref_data=ref_df).collect()
        ID     SCORE
0    term3       1.0
  1. The input DataFrame pred_df_massive in multi-row mode:

>>> pred_df_massive.collect()
   ID CONTENT
0   2     ter
1   3     abc
>>> get_suggested_term(pred_data=pred_df_massive, ref_data=ref_df).collect()
   PREDICT_ID  K   TERM     SCORE
0           2  0  term5  0.830048
1           2  1  term6  0.368910
2           2  2  term2  0.276683
3           2  3  term3  0.238269
4           2  4  term1  0.150417
5           2  5  term4  0.137752

Invoke the function on a SAP HANA on-premise instance (only supports single-row mode):

>>> res = get_suggested_term(pred_data=pred_df, ref_data=ref_df)
>>> res.collect()
  RANK   TERM  NORMALIZED_TERM  TERM_TYPE  TERM_FREQUENCY  DOCUMENT_FREQUENCY                                SCORE
0    1  term3            term3       noun               7                   4  0.999999999999999888977697537484346
class hana_ml.text.tm.TFIDF

Bases: PALBase

Class for term frequency–inverse document frequency.

Parameters:
None

Examples

Input dataframe for analysis:

>>> df_train.collect()
        ID      CONTENT
    0   doc1    term1 term2 term2 term3 term3 term3
    1   doc2    term2 term3 term3 term4 term4 term4
    2   doc3    term3 term4 term4 term5 term5 term5
    3   doc5    term3 term4 term4 term5 term5 term5 term5 term5 term5
    4   doc4    term4 term6
    5   doc6    term4 term6 term6 term6

Creating a TFIDF instance:

>>> tfidf = TFIDF()

Performing text_collector() on given dataframe:

>>> idf, _ = tfidf.text_collector(data=df_train)
>>> idf.collect()
        TM_TERMS    TM_TERM_IDF_VALUE
    0   term1       1.791759
    1   term2       1.098612
    2   term3       0.405465
    3   term4       0.182322
    4   term5       1.098612
    5   term6       1.098612

Performing text_tfidf() on given dataframe:

>>> result = tfidf.text_tfidf(data=df_train)
>>> result.collect()
        ID      TERMS   TF_VALUE    TFIDF_VALUE
    0   doc1    term1   1.0         1.791759
    1   doc1    term2   2.0         2.197225
    2   doc1    term3   3.0         1.216395
    3   doc2    term2   1.0         1.098612
    4   doc2    term3   2.0         0.810930
    5   doc2    term4   3.0         0.546965
    6   doc3    term3   1.0         0.405465
    7   doc3    term4   2.0         0.364643
    8   doc3    term5   3.0         3.295837
    9   doc5    term3   1.0         0.405465
    10  doc5    term4   2.0         0.364643
    11  doc5    term5   6.0         6.591674
    12  doc4    term4   1.0         0.182322
    13  doc4    term6   1.0         1.098612
    14  doc6    term4   1.0         0.182322
    15  doc6    term6   3.0         3.295837
Attributes:
fit_hdbprocedure

Returns the generated hdbprocedure for fit.

predict_hdbprocedure

Returns the generated hdbprocedure for predict.

Methods

add_attribute(attr_key, attr_val)

Function to add attribute.

apply_with_hint(with_hint[, ...])

Apply with hint.

consume_fit_hdbprocedure(proc_name[, ...])

Return the generated consume hdbprocedure for fit.

consume_predict_hdbprocedure(proc_name[, ...])

Return the generated consume hdbprocedure for predict.

create_apply_func(func_name, data[, key, ...])

Create HANA TUDF SQL code.

disable_arg_check()

Disable argument check.

disable_convert_bigint()

Disable the bigint conversion.

disable_hana_execution()

HANA execution will be disabled and only SQL script will be generated.

disable_with_hint()

Disable with hint.

enable_arg_check()

Enable argument check.

enable_convert_bigint()

Allows the conversion from bigint to double.

enable_hana_execution()

HANA execution will be enabled.

enable_no_inline([apply_to_anonymous_block])

Enable no inline.

enable_parallel_by_parameter_partitions([...])

Enable parallel by parameter partitions.

enable_workload_class(workload_class_name)

HANA WORKLOAD CLASS is applied for the statement execution.

get_fit_execute_statement()

Returns the execute_statement for training.

get_fit_output_table_names()

Get the generated result table names in fit function.

get_fit_parameters()

Get PAL fit parameters.

get_pal_function()

Extract the specific function call of the SAP HANA PAL function from the sql code.

get_parameters()

Parse sql lines containing the parameter definitions.

get_predict_execute_statement()

Returns the execute_statement for predicting.

get_predict_output_table_names()

Get the generated result table names in predict function.

get_predict_parameters()

Get PAL predict parameters.

get_score_execute_statement()

Returns the execute_statement for scoring.

get_score_output_table_names()

Get the generated result table names in score function.

get_score_parameters()

Get SAP HANA PAL score parameters.

is_fitted()

Checks if the model can be saved.

load_model(model)

Function to load the fitted model.

set_scale_out([route_to, no_route_to, ...])

SAP HANA statement routing.

text_collector(data)

Its use is primarily compute inverse document frequency of documents which provided by user.

text_tfidf(data[, idf])

Its use is primarily compute term frequency - inverse document frequency by document.

text_collector(data)

Its use is primarily compute inverse document frequency of documents which provided by user.

Parameters:
dataDataFrame

Data to be analysis. The first column of the input data table is assumed to be an ID column.

Returns:
DataFrame
  • Inverse document frequency of documents.

  • Extended table.

text_tfidf(data, idf=None)

Its use is primarily compute term frequency - inverse document frequency by document.

Parameters:
dataDataFrame

Data to be analysis.

The first column of the input data table is assumed to be an ID column.

idfDataFrame, optional

Inverse document frequency of documents.

Returns:
DataFrame
  • Term frequency - inverse document frequency by document.

add_attribute(attr_key, attr_val)

Function to add attribute.

Parameters:
attr_keystr

The key.

attr_valstr

The value.

apply_with_hint(with_hint, apply_to_anonymous_block=True)

Apply with hint.

Parameters:
with_hintstr

The hint clauses.

apply_to_anonymous_blockbool, optional

If True, it will be applied to the anonymous block.

Defaults to True.

consume_fit_hdbprocedure(proc_name, in_tables=None, out_tables=None)

Return the generated consume hdbprocedure for fit.

Parameters:
proc_namestr

The procedure name.

in_tableslist, optional

The list of input table names.

out_tableslist, optional

The list of output table names.

consume_predict_hdbprocedure(proc_name, in_tables=None, out_tables=None)

Return the generated consume hdbprocedure for predict.

Parameters:
proc_namestr

The procedure name.

in_tableslist, optional

The list of input table names.

out_tableslist, optional

The list of output table names.

create_apply_func(func_name, data, key=None, model=None, output_table_structure=None, execute=True, force=True)

Create HANA TUDF SQL code.

Parameters:
func_namestr

The function name of TUDF.

dataDataFrame

The data to be predicted.

keystr, optional

The key column name in the predict dataframe.

modelDataFrame, optional

The model dataframe for prediction. If not specified, it will use model_.

output_table_structuredict, optional

The return table structure.

executebool, optional

Execute the creation SQL.

Defaults to True.

forcebool, optional

If True, it will drop the existing TUDF.

Returns:
str

The generated TUDF SQL code.

disable_arg_check()

Disable argument check.

disable_convert_bigint()

Disable the bigint conversion.

Defaults to False.

disable_hana_execution()

HANA execution will be disabled and only SQL script will be generated.

disable_with_hint()

Disable with hint.

enable_arg_check()

Enable argument check.

enable_convert_bigint()

Allows the conversion from bigint to double.

Defaults to True.

enable_hana_execution()

HANA execution will be enabled.

enable_no_inline(apply_to_anonymous_block=True)

Enable no inline.

Parameters:
apply_to_anonymous_blockbool, optional

If True, it will be applied to the anonymous block.

Defaults to True.

enable_parallel_by_parameter_partitions(apply_to_anonymous_block=False)

Enable parallel by parameter partitions.

Parameters:
apply_to_anonymous_blockbool, optional

If True, it will be applied to the anonymous block.

Defaults to False.

enable_workload_class(workload_class_name)

HANA WORKLOAD CLASS is applied for the statement execution.

Parameters:
workload_class_namestr

The name of HANA WORKLOAD CLASS.

property fit_hdbprocedure

Returns the generated hdbprocedure for fit.

get_fit_execute_statement()

Returns the execute_statement for training.

get_fit_output_table_names()

Get the generated result table names in fit function.

Returns:
list

List of table names.

get_fit_parameters()

Get PAL fit parameters.

Returns:
list

List of tuples, where each tuple describes a parameter like (name, value, type)

get_pal_function()

Extract the specific function call of the SAP HANA PAL function from the sql code. Nevertheless it only detects the synonyms that have to be resolved afterwards.

Returns:
dict

The procedure name synonym: CALL "SYS_AFL.PAL_RANDOM_FORREST" (...) -> SYS_AFL.PAL_RANDOM_FORREST"

get_parameters()

Parse sql lines containing the parameter definitions. In the sql code all the parameters are defined by four arrays, where the first one contains the parameter name, and one of the other three contains the value fitting to the parameter, while the other two are NULL. This format should be changed into a simple key-value based storage.

Returns:
dict

Dict of list of tuples, where each tuple describes a parameter like (name, value, type)

get_predict_execute_statement()

Returns the execute_statement for predicting.

get_predict_output_table_names()

Get the generated result table names in predict function.

Returns:
list

List of table names.

get_predict_parameters()

Get PAL predict parameters.

Returns:
list

List of tuples, where each tuple describes a parameter like (name, value, type)

get_score_execute_statement()

Returns the execute_statement for scoring.

get_score_output_table_names()

Get the generated result table names in score function.

Returns:
list

List of table names.

get_score_parameters()

Get SAP HANA PAL score parameters.

Returns:
list

List of tuples, where each tuple describes a parameter like (name, value, type)

is_fitted()

Checks if the model can be saved. To be overridden if the model is not stored in model_ attribute.

Returns:
bool

True if the model is ready to be saved.

load_model(model)

Function to load the fitted model.

Parameters:
modelDataFrame

SAP HANA DataFrame for fitted model.

property predict_hdbprocedure

Returns the generated hdbprocedure for predict.

set_scale_out(route_to=None, no_route_to=None, route_by=None, route_by_cardinality=None, data_transfer_cost=None, route_optimization_level=None, workload_class=None, apply_to_anonymous_block=True)

SAP HANA statement routing.

Parameters:
route_tostr, optional

Routes the query to the specified volume ID or service type.

Defaults to None.

no_route_tostr or list of str, optional

Avoids query routing to a specified volume ID or service type.

Defaults to None.

route_bystr, optional

Routes the query to the hosts related to the base table(s) of the specified projection view(s).

Defaults to None.

route_by_cardinalitystr or list of str, optional

Routes the query to the hosts related to the base table(s) of the specified projection view(s) with the highest cardinality from the input list.

data_transfer_costint, optional

Guides the optimizer to use the weighting factor for the data transfer cost. The value 0 ignores the data transfer cost.

Defaults to None.

route_optimization_level{'minimal', 'all'}, optional

Guides the optimizer to compile with route_optimization_level 'minimal' or to default to route_optimization_level. If the 'minimal' compiled plan is cached, then it compiles once more using the default optimization level during the first execution. This hint is primarily used to shorten statement routing decisions during the initial compilation.

workload_classstr, optional

Routes the query via workload class. route_to statement hint has higher precedence than workload_class statement hint.

Defaults to None.

apply_to_anonymous_blockbool, optional

If True it will be applied to the anonymous block.

Defaults to True.