hana_ml.text.tm package

The text.tm package consists of the following section:

hana_ml.text.tm

This module provides various functions of text minig. The following functions are available:
hana_ml.text.tm.tf_analysis(data)

Perform Term Frequency(TF) analysis on the given document. TF is the number of occurrences of term in document.

Parameters
dataDataFrame
  • 1st column, ID.

  • 2nd column, Document content.

  • 3rd column, Document category.

Returns
A tuple of DataFrame
TF-IDF result, structured as follows:
  • TM_TERM.

  • TM_TERM_FREQUENCY.

  • TM_IDF_FREQUENCY.

  • TF_VALUE.

  • IDF_VALUE.

  • TF_IDF_VALUE.

Document term frequency table, structured as follows:
  • ID.

  • TM_TERM.

  • TM_TERM_FREQUENCY.

Document category table, structured as follows:
  • ID.

  • Document category.

Examples

The input DataFrame df:

>>> df.collect()
      ID                                                  CONTENT       CATEGORY
0   doc1                      term1 term2 term2 term3 term3 term3     CATEGORY_1
1   doc2                      term2 term3 term3 term4 term4 term4     CATEGORY_1
2   doc3                      term3 term4 term4 term5 term5 term5     CATEGORY_2
3   doc4    term3 term4 term4 term5 term5 term5 term5 term5 term5     CATEGORY_2
4   doc5                                              term4 term6     CATEGORY_3
5   doc6                                  term4 term6 term6 term6     CATEGORY_3

Invoke tf_analysis function:

>>> tfidf= tf_analysis(df)

Output:

>>> tfidf[0].head(3).collect()
  TM_TERMS TM_TERM_TF_F  TM_TERM_IDF_F  TM_TERM_TF_V  TM_TERM_IDF_V
0    term1            1              1      0.030303       1.791759
1    term2            3              2      0.090909       1.098612
2    term3            7              4      0.212121       0.405465
>>> tfidf[1].head(3).collect()
     ID TM_TERMS  TM_TERM_FREQUENCY
0  doc1    term1                  1
1  doc1    term2                  2
2  doc1    term3                  3
>>> tfidf[2].head(3).collect()
      ID    CATEGORY
0   doc1  CATEGORY_1
1   doc2  CATEGORY_1
2   doc3  CATEGORY_2
hana_ml.text.tm.text_classification(pred_data, ref_data=None, k_nearest_neighbours=None, thread_ratio=None, lang='EN', index_name=None)

This function classifies (categorizes) an input document with respect to sets of categories (taxonomies).

Parameters
pred_dataDataFrame

The prediction data for classification.

  • 1st column, ID.

  • 2nd column, Document content.

ref_dataDataFrame or a tuple of DataFrame,
  • DataFrame, reference data
    • 1st column, ID.

    • 2nd column, Document content.

    • 3rd column, Document Category.

The ref_data could also be a tuple of DataFrame, reference TF-IDF data:
  • 1st DataFrame
    • 1st column, TM_TERM.

    • 2nd column, TF_VALUE.

    • 3rd column, IDF_VALUE.

  • 2nd DataFrame
    • 1st column, ID.

    • 2nd column, TM_TERM.

    • 3rd column, TM_TERM_FREQUENCY.

  • 3rd DataFrame
    • 1st column, ID.

    • 2nd column, Document category.

k_nearest_neighboursint, optional

Number of nearest neighbors (k).

Defaults to 1.

thread_ratiofloat, optional

Specifies the ratio of total number of threads that can be used by this function. The range of this parameter is from 0 to 1, where 0 means only using 1 thread, and 1 means using at most all the currently available threads. Values outside this range are ignored and this function heuristically determines the number of threads to use.

Defaults to 0.0.

langstr, optional

Only for on-premise HANA instance, specify the language type.

Defaults to 'EN'.

index_namestr, optional

Only for on-premise HANA isntance, specify the index name. If None, it will be generated.

Returns
DataFrame (Cloud)
Text classification result, structured as follows:
  • Predict data ID.

  • TARGET.

Statistics table, structured as follows:
  • Predict data ID.

  • Training data ID.

  • Distance.

DataFrame (On-Premise)
Text classification result, structured as follows:
  • Predict data ID.

  • RANK.

  • CATEGORY_SCHEMA.

  • CATEGORY_TABLE.

  • CATEGORY_COLUMN.

  • CATEGORY_VALUE.

  • NEIGHBOR_COUNT.

  • SCORE.

Examples

The input DataFrame df:

>>> df.collect()
      ID                                                  CONTENT       CATEGORY
0   doc1                      term1 term2 term2 term3 term3 term3     CATEGORY_1
1   doc2                      term2 term3 term3 term4 term4 term4     CATEGORY_1
2   doc3                      term3 term4 term4 term5 term5 term5     CATEGORY_2
3   doc4    term3 term4 term4 term5 term5 term5 term5 term5 term5     CATEGORY_2
4   doc5                                              term4 term6     CATEGORY_3
5   doc6                                  term4 term6 term6 term6     CATEGORY_3

Invoke text_classification:

>>> res = text_classification(df.select(df.columns[0], df.columns[1]), df)

Result on a SAP HANA Cloud instance:

>>> res[0].head(1).collect()
       ID     TARGET
0    doc1 CATEGORY_1

Result on a SAP HANA On-Premise instance:

>>> res[0].head(1).collect()
     ID RANK  CATEGORY_SCHEMA                   CATEGORY_TABLE    CATEGORY_COLUMN  CATEGORY_VALUE  NEIGHBOR_COUNT
0  doc1    1       "PAL_USER" "TM_CATEGORIZE_KNN_DT_6_REF_TBL"         "CATEGORY"      CATEGORY_1               1
...                               SCORE
...0.5807794005266924131092309835366905

This function returns the top-ranked related documents for a query document based on Term Frequency - Inverse Document Frequency(TF-IDF) result or reference data..

Parameters
pred_dataDataFrame
  • 1st column, Document content.

ref_dataDataFrame or a tuple of DataFrame,
  • DataFrame, reference data
    • 1st column, ID

    • 2nd column, Document content

    • 3rd column, Document Category

The ref_data could also be a tuple of DataFrame, reference TF-IDF data:
  • 1st DataFrame, TF-IDF Result
    • 1st column, TM_TERM

    • 2nd column, TF_VALUE

    • 3rd column, IDF_VALUE

  • 2nd DataFrame, Doc Term Freq Table
    • 1st column, ID

    • 2nd column, TM_TERM

    • 3rd column, TM_TERM_FREQUENCY

  • 3rd DataFrame, Doc Category Table
    • 1st column, ID

    • 2nd column, Document category

topint, optional

Only show top N results. If 0, it shows all.

Defaults to 0.

thresholdfloat, optional

Only the results which score bigger than this value will be put into the result table.

Defaults to 0.0.

langstr, optional

Only for on-premise SAP HANA instance, specify the language type.

Defaults to 'EN'.

index_namestr, optional

Only for on-premise SAP HANA instance, specify the index name.

Returns
DataFrame

Examples

The input DataFrame ref_df:

>>> ref_df.collect()
      ID                                                  CONTENT       CATEGORY
0   doc1                      term1 term2 term2 term3 term3 term3     CATEGORY_1
1   doc2                      term2 term3 term3 term4 term4 term4     CATEGORY_1
2   doc3                      term3 term4 term4 term5 term5 term5     CATEGORY_2
3   doc4    term3 term4 term4 term5 term5 term5 term5 term5 term5     CATEGORY_2
4   doc5                                              term4 term6     CATEGORY_3
5   doc6                                  term4 term6 term6 term6     CATEGORY_3

The input DataFrame pred_df:

>>> pred_df.collect()
                   CONTENT
0  term2 term2 term3 term3

Invoke the function on a SAP HANA Cloud instance: tfidf is a DataFrame returned by tf_analysis function, please refer to the examples section of tf_analysis for its content.

>>> get_related_doc(pred_df, tfidf).collect()
       ID       SCORE
0    doc2    0.891550
1    doc1    0.804670
2    doc3    0.042024
3    doc4    0.021225

Invoke the function on a SAP HANA On-Premise instance:

>>> res = get_related_doc(df_test1_onpremise, df_onpremise)
>>> res.collect()
   ID    RANK   TOTAL_TERM_COUNT  TERM_COUNT  CORRELATIONS  FACTORS  ROTATED_FACTORS  CLUSTER_LEVEL  CLUSTER_LEFT
0  doc2     1                  6           3          None     None             None           None          None
1  doc1     2                  6           3          None     None             None           None          None
2  doc3     3                  6           3          None     None             None           None          None
3  doc4     4                  9           3          None     None             None           None          None
... CLUSTER_RIGHT  HIGHLIGHTED_DOCUMENT  HIGHLIGHTED_TERMTYPES                                   SCORE
...          None                  None                   None    0.8915504731053067732915451415465213
...          None                  None                   None    0.8046698732333942283290184604993556
...          None                  None                   None   0.04202449735779462125506711345224176
...          None                  None                   None   0.02122540837399113089478674964993843

This function returns the top-ranked related terms for a query term based on Term Frequency - Inverse Document Frequency(TF-IDF) result or reference data.

Parameters
pred_dataDataFrame
  • 1st column, Document content.

ref_dataDataFrame or a tuple of DataFrame,
  • DataFrame, reference data
    • 1st column, ID.

    • 2nd column, Document content.

    • 3rd column, Document Category.

The ref_data could also be a tuple of DataFrame, reference TF-IDF data:
  • 1st DataFrame
    • 1st column, TM_TERM.

    • 2nd column, TF_VALUE.

    • 3rd column, IDF_VALUE.

  • 2nd DataFrame
    • 1st column, ID.

    • 2nd column, TM_TERM.

    • 3rd column, TM_TERM_FREQUENCY.

  • 3rd DataFrame
    • 1st column, ID.

    • 2nd column, Document category.

topint, optional

Show top N results. If 0, it shows all.

Defaults to 0.

thresholdfloat, optional

Only the results which score bigger than this value will be put into a result table.

Defaults to 0.0.

langstr, optional

Only for on-premise HANA instance, specify the language type.

Defaults to 'EN'.

index_namestr, optional

Only for on-premise HANA isntance, specify the index name.

Returns
DataFrame

Examples

The input DataFrame ref_df:

>>> ref_df.collect()
      ID                                                  CONTENT       CATEGORY
0   doc1                      term1 term2 term2 term3 term3 term3     CATEGORY_1
1   doc2                      term2 term3 term3 term4 term4 term4     CATEGORY_1
2   doc3                      term3 term4 term4 term5 term5 term5     CATEGORY_2
3   doc4    term3 term4 term4 term5 term5 term5 term5 term5 term5     CATEGORY_2
4   doc5                                              term4 term6     CATEGORY_3
5   doc6                                  term4 term6 term6 term6     CATEGORY_3

The input DataFrame pred_df:

>>> pred_df.collect()
  CONTENT
0   term3

Invoke the function on a SAP HANA Cloud instance,

>>> get_related_term(pred_df, ref_df).collect()
        ID       SCORE
0    term2    0.923760
1    term1    0.774597
2    term4    0.550179
3    term5    0.346410

Invoke the function on a SAP HANA On-Premise instance:

>>> res = get_related_term(pred_df, ref_df)
>>> res.collect()
  RANK  TERM  NORMALIZED_TERM  TERM_TYPE  TERM_FREQUENCY  DOCUMENT_FREQUENCY  CORRELATIONS
0    1 term3            term3       noun               7                   4          None
1    2 term2            term2       noun               3                   2          None
2    3 term1            term1       noun               1                   1          None
3    4 term4            term4       noun               9                   5          None
4    5 term5            term5       noun               9                   2          None
... FACTORS  ROTATED_FACTORS  CLUSTER_LEVEL  CLUSTER_LEFT  CLUSTER_RIGHT                                 SCORE
...    None             None           None          None           None  1.0000003613794823387195265240734440
...    None             None           None          None           None  0.9237607645314674931213971831311937
...    None             None           None          None           None  0.7745969491648266869177064108953346
...    None             None           None          None           None  0.5501794128048571597133786781341769
...    None             None           None          None           None  0.3464102866993003515538873671175679
hana_ml.text.tm.get_relevant_doc(pred_data, ref_data=None, top=None, threshold=None, lang='EN', index_name=None)

This function returns the top-ranked documents that are relevant to a term based on Term Frequency - Inverse Document Frequency(TF-IDF) result or reference data.

Parameters
pred_dataDataFrame
  • 1st column, Document content.

ref_dataDataFrame or a tuple of DataFrame,
  • DataFrame, reference data
    • 1st column, ID.

    • 2nd column, Document content.

    • 3rd column, Document Category.

The ref_data could also be a tuple of DataFrame, reference TF-IDF data:
  • 1st DataFrame
    • 1st column, TM_TERM.

    • 2nd column, TF_VALUE.

    • 3rd column, IDF_VALUE.

  • 2nd DataFrame
    • 1st column, ID.

    • 2nd column, TM_TERM.

    • 3rd column, TM_TERM_FREQUENCY.

  • 3rd DataFrame
    • 1st column, ID.

    • 2nd column, Document category.

topint, optional

Show top N results. If 0, it shows all.

Defaults to 0.

thresholdfloat, optional

Only the results which score bigger than this value will be put into a result table.

Defaults to 0.0.

langstr, optional

Only for on-premise HANA instance, specify the language type.

Defaults to 'EN'.

index_namestr, optional

Only for on-premise HANA isntance, specify the index name.

Returns
DataFrame

Examples

The input DataFrame ref_df:

>>> ref_df.collect()
      ID                                                  CONTENT       CATEGORY
0   doc1                      term1 term2 term2 term3 term3 term3     CATEGORY_1
1   doc2                      term2 term3 term3 term4 term4 term4     CATEGORY_1
2   doc3                      term3 term4 term4 term5 term5 term5     CATEGORY_2
3   doc4    term3 term4 term4 term5 term5 term5 term5 term5 term5     CATEGORY_2
4   doc5                                              term4 term6     CATEGORY_3
5   doc6                                  term4 term6 term6 term6     CATEGORY_3

The input DataFrame pred_df:

>>> pred_df.collect()
                   CONTENT
0  term2 term2 term3 term3

Invoke the function on a SAP HANA Cloud instance:

>>> get_relevant_doc(pred_df, ref_df).collect()
       ID       SCORE
0    doc1    0.774597
1    doc2    0.516398
2    doc3    0.258199
3    doc4    0.258199

Invoke the function on a SAP HANA On-Premise instance:

>>> res = get_relevant_doc(pred_data, ref_data, top=4)
>>> res.collect()
     ID    RANK   TOTAL_TERM_COUNT  TERM_COUNT  CORRELATIONS  FACTORS  ROTATED_FACTORS  CLUSTER_LEVEL  CLUSTER_LEFT
0  doc1       1                  6           3          None     None             None           None          None
1  doc2       2                  6           3          None     None             None           None          None
2  doc3       3                  6           3          None     None             None           None          None
3  doc4       4                  9           3          None     None             None           None          None
... CLUSTER_RIGHT  HIGHLIGHTED_DOCUMENT  HIGHLIGHTED_TERMTYPES                                   SCORE
...          None                  None                   None    0.7745969491648266869177064108953346
...          None                  None                   None    0.5163979661098845319600059156073257
...          None                  None                   None    0.2581989830549422659800029578036629
...          None                  None                   None    0.2581989830549422659800029578036629
hana_ml.text.tm.get_relevant_term(pred_data, ref_data=None, top=None, threshold=None, lang='EN', index_name=None)

This function returns the top-ranked relevant terms that describe a document based on Term Frequency - Inverse Document Frequency(TF-IDF) result or reference data.

Parameters
pred_dataDataFrame
  • 1st column, Document content.

ref_dataDataFrame or a tuple of DataFrame,
  • DataFrame, reference data
    • 1st column, ID.

    • 2nd column, Document content.

    • 3rd column, Document Category.

The ref_data could also be a tuple of DataFrame, reference TF-IDF data:
  • 1st DataFrame
    • 1st column, TM_TERM.

    • 2nd column, TF_VALUE.

    • 3rd column, IDF_VALUE.

  • 2nd DataFrame
    • 1st column, ID.

    • 2nd column, TM_TERM.

    • 3rd column, TM_TERM_FREQUENCY.

  • 3rd DataFrame
    • 1st column, ID.

    • 2nd column, Document category.

topint, optional

Show top N results. If 0, it shows all.

Defaults to 0.

thresholdfloat, optional

Only the results which score bigger than THRESHOLD will be put into a result table.

Defaults to 0.0.

langstr, optional

Only for on-premise HANA instance, specify the language type.

Defaults to 'EN'.

index_namestr, optional

Only for on-premise HANA isntance, specify the index name.

Returns
DataFrame

Examples

The input DataFrame ref_df:

>>> ref_df.collect()
      ID                                                  CONTENT       CATEGORY
0   doc1                      term1 term2 term2 term3 term3 term3     CATEGORY_1
1   doc2                      term2 term3 term3 term4 term4 term4     CATEGORY_1
2   doc3                      term3 term4 term4 term5 term5 term5     CATEGORY_2
3   doc4    term3 term4 term4 term5 term5 term5 term5 term5 term5     CATEGORY_2
4   doc5                                              term4 term6     CATEGORY_3
5   doc6                                  term4 term6 term6 term6     CATEGORY_3

The input DataFrame pred_df:

>>> pred_df.collect()
  CONTENT
0   term3

Invoke the function on a SAP HANA Cloud instance,

>>> get_relevant_term(pred_df, ref_df).collect()
        ID   SCORE
0    term3     1.0

Invoke the function on a SAP HANA On-Premise instance:

>>> res = get_relevant_term(pred_df, ref_df)
>>> res.collect()
  RANK  TERM  NORMALIZED_TERM  TERM_TYPE  TERM_FREQUENCY  DOCUMENT_FREQUENCY  CORRELATIONS
0    1 term3            term3       noun               7                   4          None

... FACTORS ROTATED_FACTORS CLUSTER_LEVEL CLUSTER_LEFT CLUSTER_RIGHT SCORE ... None None None None None 1.000002901113076436701021521002986

hana_ml.text.tm.get_suggested_term(pred_data, ref_data=None, top=None, threshold=None, lang='EN', index_name=None)

This function returns the top-ranked terms that match an initial substring based on Term Frequency - Inverse Document Frequency(TF-IDF) result or reference data.

Parameters
pred_dataDataFrame
  • 1st column, Document content.

ref_dataDataFrame or a tuple of DataFrame,
  • DataFrame, reference data
    • 1st column, ID.

    • 2nd column, Document content.

    • 3rd column, Document Category.

The ref_data could also be a tuple of DataFrame, reference TF-IDF data:
  • 1st DataFrame
    • 1st column, TM_TERM.

    • 2nd column, TF_VALUE.

    • 3rd column, IDF_VALUE.

  • 2nd DataFrame
    • 1st column, ID.

    • 2nd column, TM_TERM.

    • 3rd column, TM_TERM_FREQUENCY.

  • 3rd DataFrame
    • 1st column, ID.

    • 2nd column, Document category.

topint, optional

Show top N results. If 0, it shows all.

Defaults to 0.

thresholdfloat, optional

Only the results which score bigger than this value will be put into a result table.

Defaults to 0.0.

langstr, optional

Only for on-premise HANA instance, specify the language type.

Defaults to 'EN'.

index_namestr, optional

Only for on-premise HANA isntance, specify the index name.

Returns
DataFrame

Examples

The input DataFrame ref_df:

>>> ref_df.collect()
      ID                                                  CONTENT       CATEGORY
0   doc1                      term1 term2 term2 term3 term3 term3     CATEGORY_1
1   doc2                      term2 term3 term3 term4 term4 term4     CATEGORY_1
2   doc3                      term3 term4 term4 term5 term5 term5     CATEGORY_2
3   doc4    term3 term4 term4 term5 term5 term5 term5 term5 term5     CATEGORY_2
4   doc5                                              term4 term6     CATEGORY_3
5   doc6                                  term4 term6 term6 term6     CATEGORY_3

The input DataFrame pred_df:

>>> pred_df.collect()
  CONTENT
0   term3

Invoke the function on a SAP HANA Cloud instance,

>>> get_suggested_term(pred_df, ref_df).collect()
        ID     SCORE
0    term5  0.830048
1    term6  0.368910
2    term2  0.276683
3    term3  0.238269
4    term1  0.150417
5    term4  0.137752

Invoke the function on a SAP HANA On-Premise instance:

>>> res = get_suggested_term(pred_df, ref_df)
>>> res.collect()
  RANK   TERM  NORMALIZED_TERM  TERM_TYPE  TERM_FREQUENCY  DOCUMENT_FREQUENCY                                SCORE
0    1  term5           term5       noun               9                   2  0.8300477525938901868229891078954097
1    2  term6           term6       noun               4                   2  0.3689101122639512064793620993441436
2    3  term2           term2       noun               3                   2  0.2766825841979633771039459588791942
3    4  term3           term3       noun               7                   4  0.2382690555756660777397826223023003
4    5  term1           term1       noun               1                   1  0.1504166196211661477022403232695069
5    6  term4           term4       noun               9                   5   0.137751598109021100579951735198847