hana_ml.text.tm package
hana_ml.text.tm
- This module provides various functions of text mining. The following functions are available:
- hana_ml.text.tm.tf_analysis(data, lang=None, enable_stopwords=None, keep_numeric=None)
Perform Term Frequency(TF) analysis on the given document. TF is the number of occurrences of term in document.
- Parameters
- dataDataFrame
Input data, structured as follows:
1st column, ID.
2nd column, Document content.
3rd column, Document category.
- langstr, optional
Specify the language type. HANA cloud instance currently supports 'EN', 'DE', 'ES', 'FR' and 'RU'. If None, auto detection will be applied.
Defaults to None.
- enable_stopwordsbool, optional
Determine whether to turn on stopwords.
Defaults to True.
- keep_numericbool, optional
Determine whether to keep numbers.
Defaults to False.
- Returns
- A tuple of DataFrame
TF-IDF result, structured as follows:
TM_TERM.
TM_TERM_FREQUENCY.
TM_IDF_FREQUENCY.
TF_VALUE.
IDF_VALUE.
TF_IDF_VALUE.
Document term frequency table, structured as follows:
ID.
TM_TERM.
TM_TERM_FREQUENCY.
Document category table, structured as follows:
ID.
Document category.
Examples
The input DataFrame df:
>>> df.collect() ID CONTENT CATEGORY 0 doc1 term1 term2 term2 term3 term3 term3 CATEGORY_1 1 doc2 term2 term3 term3 term4 term4 term4 CATEGORY_1 2 doc3 term3 term4 term4 term5 term5 term5 CATEGORY_2 3 doc4 term3 term4 term4 term5 term5 term5 term5 term5 term5 CATEGORY_2 4 doc5 term4 term6 CATEGORY_3 5 doc6 term4 term6 term6 term6 CATEGORY_3
Invoke tf_analysis function:
>>> tfidf= tf_analysis(df)
Output:
>>> tfidf[0].head(3).collect() TM_TERMS TM_TERM_TF_F TM_TERM_IDF_F TM_TERM_TF_V TM_TERM_IDF_V 0 term1 1 1 0.030303 1.791759 1 term2 3 2 0.090909 1.098612 2 term3 7 4 0.212121 0.405465
>>> tfidf[1].head(3).collect() ID TM_TERMS TM_TERM_FREQUENCY 0 doc1 term1 1 1 doc1 term2 2 2 doc1 term3 3
>>> tfidf[2].head(3).collect() ID CATEGORY 0 doc1 CATEGORY_1 1 doc2 CATEGORY_1 2 doc3 CATEGORY_2
- hana_ml.text.tm.text_classification(pred_data, ref_data=None, k_nearest_neighbours=None, thread_ratio=None, lang=None, index_name=None, created_index=None)
This function classifies (categorizes) an input document with respect to sets of categories (taxonomies).
- Parameters
- pred_dataDataFrame
The prediction data for classification, structured as follows:
1st column, ID.
2nd column, Document content.
- ref_dataDataFrame or a tuple of DataFrame,
Specify the reference data.
If
ref_data
is a DataFrame, then it should be structured as follows:1st column, ID.
2nd column, Document content.
3rd column, Document category.
Otherwise if
ref_data
is a tuple of DataFrame, the it should be corresponding to the reference TF-IDF data, with DataFrames structured as follows:1st DataFrame
1st column, TM_TERM.
2nd column, TF_VALUE.
3rd column, IDF_VALUE.
2nd DataFrame
1st column, ID.
2nd column, TM_TERM.
3rd column, TM_TERM_FREQUENCY.
3rd DataFrame
1st column, ID.
2nd column, Document category.
- k_nearest_neighboursint, optional
Number of nearest neighbors (k).
Defaults to 1.
- thread_ratiofloat, optional
Specify the ratio of total number of threads that can be used by this function. The range of this parameter is from 0 to 1, where 0 means only using 1 thread, and 1 means using at most all the currently available threads. Values outside this range are ignored and this function heuristically determines the number of threads to use.
Only valid for a HANA cloud instance.
Defaults to 0.0.
- langstr, optional
Specify the language type. HANA cloud instance currently supports 'EN', 'DE', 'ES', 'FR' and 'RU'. If None, auto detection will be applied.
Defaults to None.
- index_namestr, optional
Specify the index name that apply only to an on-premise HANA instance.
If None, it will be generated.
- created_index{"index": xxx, "schema": xxx, "table": xxx}, optional
Use the created index on the given table that apply only to an on-premise HANA instance.
- Returns
- DataFrame (cloud version)
Text classification result, structured as follows:
Predict data ID.
TARGET.
Statistics table, structured as follows:
Predict data ID.
Training data ID.
Distance.
- DataFrame (on-premise version)
Text classification result, structured as follows:
Predict data ID.
RANK.
CATEGORY_SCHEMA.
CATEGORY_TABLE.
CATEGORY_COLUMN.
CATEGORY_VALUE.
NEIGHBOR_COUNT.
SCORE.
Examples
The input DataFrame df:
>>> df.collect() ID CONTENT CATEGORY 0 doc1 term1 term2 term2 term3 term3 term3 CATEGORY_1 1 doc2 term2 term3 term3 term4 term4 term4 CATEGORY_1 2 doc3 term3 term4 term4 term5 term5 term5 CATEGORY_2 3 doc4 term3 term4 term4 term5 term5 term5 term5 term5 term5 CATEGORY_2 4 doc5 term4 term6 CATEGORY_3 5 doc6 term4 term6 term6 term6 CATEGORY_3
Invoke text_classification:
>>> res = text_classification(df.select(df.columns[0], df.columns[1]), df)
Result on a SAP HANA cloud instance:
>>> res[0].head(1).collect() ID TARGET 0 doc1 CATEGORY_1
Result on a SAP HANA on-premise instance:
>>> res[0].head(1).collect() ID RANK CATEGORY_SCHEMA CATEGORY_TABLE CATEGORY_COLUMN CATEGORY_VALUE NEIGHBOR_COUNT 0 doc1 1 "PAL_USER" "TM_CATEGORIZE_KNN_DT_6_REF_TBL" "CATEGORY" CATEGORY_1 1 ... SCORE ...0.5807794005266924131092309835366905
This function returns the top-ranked related documents for a query document based on Term Frequency - Inverse Document Frequency(TF-IDF) result or reference data.
Note
Input table can only have one row, as only one content can be treated each time.
- Parameters
- pred_dataDataFrame
Input data, structured as follows:
1st column, Document content.
Note
Input table can only have one row, as only one content can be treated each time.
- ref_dataDataFrame or a tuple of DataFrame,
Specify the reference data.
If
ref_data
is a DataFrame, then it should be structured as follows:1st column, ID.
2nd column, Document content.
3rd column, Document category.
If
ref_data
is a tuple of DataFrame, then it should be corresponding to the reference TF-IDF data, with each DataFrame structured as follows:1st DataFrame, TF-IDF Result.
1st column, TM_TERM.
2nd column, TF_VALUE.
3rd column, IDF_VALUE.
2nd DataFrame, Doc Term Freq Table
1st column, ID.
2nd column, TM_TERM.
3rd column, TM_TERM_FREQUENCY.
3rd DataFrame, Doc Category Table
1st column, ID.
2nd column, Document category.
- topint, optional
Only show top N results. If 0, it shows all.
Defaults to 0.
- thresholdfloat, optional
Only the results which score bigger than this value will be put into the result table.
Defaults to 0.0.
- langstr, optional
Specify the language type. HANA cloud instance currently supports 'EN', 'DE', 'ES', 'FR' and 'RU'. If None, auto detection will be applied.
Defaults to 'EN'.
- index_namestr, optional
Specify the index name that apply only to an on-premise HANA instance.
If None, it will be generated.
- thread_ratiofloat, optional
Specify the ratio of total number of threads that can be used by this function. The range of this parameter is from 0 to 1, where 0 means only using 1 thread, and 1 means using at most all the currently available threads. Values outside this range are ignored and this function heuristically determines the number of threads to use.
Only valid for a HANA cloud instance.
Defaults to 0.0.
- created_index{"index": xxx, "schema": xxx, "table": xxx}, optional
Use the created index on the given table that apply only to an on-premise HANA instance.
- Returns
- DataFrame
Examples
The input DataFrame ref_df:
>>> ref_df.collect() ID CONTENT CATEGORY 0 doc1 term1 term2 term2 term3 term3 term3 CATEGORY_1 1 doc2 term2 term3 term3 term4 term4 term4 CATEGORY_1 2 doc3 term3 term4 term4 term5 term5 term5 CATEGORY_2 3 doc4 term3 term4 term4 term5 term5 term5 term5 term5 term5 CATEGORY_2 4 doc5 term4 term6 CATEGORY_3 5 doc6 term4 term6 term6 term6 CATEGORY_3
The input DataFrame pred_df:
>>> pred_df.collect() CONTENT 0 term2 term2 term3 term3
Invoke the function on a SAP HANA cloud instance: tfidf is a DataFrame returned by tf_analysis function, please refer to the examples section of tf_analysis for its content.
>>> get_related_doc(pred_df, tfidf).collect() ID SCORE 0 doc2 0.891550 1 doc1 0.804670 2 doc3 0.042024 3 doc4 0.021225
Invoke the function on a SAP HANA on-premise instance:
>>> res = get_related_doc(df_test1_onpremise, df_onpremise) >>> res.collect() ID RANK TOTAL_TERM_COUNT TERM_COUNT CORRELATIONS FACTORS ROTATED_FACTORS CLUSTER_LEVEL CLUSTER_LEFT 0 doc2 1 6 3 None None None None None 1 doc1 2 6 3 None None None None None 2 doc3 3 6 3 None None None None None 3 doc4 4 9 3 None None None None None ... CLUSTER_RIGHT HIGHLIGHTED_DOCUMENT HIGHLIGHTED_TERMTYPES SCORE ... None None None 0.8915504731053067732915451415465213 ... None None None 0.8046698732333942283290184604993556 ... None None None 0.04202449735779462125506711345224176 ... None None None 0.02122540837399113089478674964993843
This function returns the top-ranked related terms for a query term based on Term Frequency - Inverse Document Frequency(TF-IDF) result or reference data.
Note
Input table can only have one row, as only one content can be treated each time.
- Parameters
- pred_dataDataFrame
Input data, structured as follows:
1st column, Document content.
Note
Input table can only have one row, as only one content can be treated each time.
- ref_dataDataFrame or a tuple of DataFrame
Specify the reference data.
If
ref_data
is a DataFrame, then it should be structured as follows:1st column, ID.
2nd column, Document content.
3rd column, Document category.
If
ref_data
is a tuple of DataFrame, the it should be corresponding to reference TF-IDF data, with each DataFrame structured as follows:1st DataFrame
1st column, TM_TERM.
2nd column, TF_VALUE.
3rd column, IDF_VALUE.
2nd DataFrame
1st column, ID.
2nd column, TM_TERM.
3rd column, TM_TERM_FREQUENCY.
3rd DataFrame
1st column, ID.
2nd column, Document category.
- topint, optional
Show top N results. If 0, it shows all.
Defaults to 0.
- thresholdfloat, optional
Only the results which score bigger than this value will be put into a result table.
Defaults to 0.0.
- langstr, optional
Specify the language type. SAP HANA cloud instance currently supports 'EN', 'DE', 'ES', 'FR' and 'RU'. If None, auto detection will be applied.
Defaults to 'EN'.
- index_namestr, optional
Specify the index name that apply only to an on-premise HANA instance.
If None, it will be generated.
- thread_ratiofloat, optional
Specify the ratio of total number of threads that can be used by this function. The range of this parameter is from 0 to 1, where 0 means only using 1 thread, and 1 means using at most all the currently available threads. Values outside this range are ignored and this function heuristically determines the number of threads to use.
Only valid for a HANA cloud instance.
Defaults to 0.0.
- created_index{"index": xxx, "schema": xxx, "table": xxx}, optional
Use the created index on the given table that apply only to an on-premise HANA instance.
- Returns
- DataFrame
Examples
The input DataFrame ref_df:
>>> ref_df.collect() ID CONTENT CATEGORY 0 doc1 term1 term2 term2 term3 term3 term3 CATEGORY_1 1 doc2 term2 term3 term3 term4 term4 term4 CATEGORY_1 2 doc3 term3 term4 term4 term5 term5 term5 CATEGORY_2 3 doc4 term3 term4 term4 term5 term5 term5 term5 term5 term5 CATEGORY_2 4 doc5 term4 term6 CATEGORY_3 5 doc6 term4 term6 term6 term6 CATEGORY_3
The input DataFrame pred_df:
>>> pred_df.collect() CONTENT 0 term3
Invoke the function on a SAP HANA cloud instance,
>>> get_related_term(pred_df, ref_df).collect() ID SCORE 0 term3 1.000000 1 term2 0.923760 2 term1 0.774597 3 term4 0.550179 4 term5 0.346410
Invoke the function on a SAP HANA on-premise instance:
>>> res = get_related_term(pred_df, ref_df) >>> res.collect() RANK TERM NORMALIZED_TERM TERM_TYPE TERM_FREQUENCY DOCUMENT_FREQUENCY CORRELATIONS 0 1 term3 term3 noun 7 4 None 1 2 term2 term2 noun 3 2 None 2 3 term1 term1 noun 1 1 None 3 4 term4 term4 noun 9 5 None 4 5 term5 term5 noun 9 2 None ... FACTORS ROTATED_FACTORS CLUSTER_LEVEL CLUSTER_LEFT CLUSTER_RIGHT SCORE ... None None None None None 1.0000003613794823387195265240734440 ... None None None None None 0.9237607645314674931213971831311937 ... None None None None None 0.7745969491648266869177064108953346 ... None None None None None 0.5501794128048571597133786781341769 ... None None None None None 0.3464102866993003515538873671175679
- hana_ml.text.tm.get_relevant_doc(pred_data, ref_data=None, top=None, threshold=None, lang='EN', index_name=None, thread_ratio=None, created_index=None)
This function returns the top-ranked documents that are relevant to a term based on Term Frequency - Inverse Document Frequency(TF-IDF) result or reference data.
Note
Input table can only have one row, as only one content can be treated each time.
- Parameters
- pred_dataDataFrame
Input data, structured as follows:
1st column, Document content.
Note
Input table can only have one row, as only one content can be treated each time.
- ref_dataDataFrame or a tuple of DataFrame
Specify the reference data.
If
ref_data
is a DataFrame, then it should be structured as follows:1st column, ID.
2nd column, Document content.
3rd column, Document category.
If
ref_data
is a tuple of DataFrame, the it should be corresponding to reference TF-IDF data, with each DataFrame structured as follows:1st DataFrame
1st column, TM_TERM.
2nd column, TF_VALUE.
3rd column, IDF_VALUE.
2nd DataFrame
1st column, ID.
2nd column, TM_TERM.
3rd column, TM_TERM_FREQUENCY.
3rd DataFrame
1st column, ID.
2nd column, Document category.
- topint, optional
Show top N results. If 0, it shows all.
Defaults to 0.
- thresholdfloat, optional
Only the results which score bigger than this value will be put into a result table.
Defaults to 0.0.
- langstr, optional
Specify the language type. HANA cloud instance currently supports 'EN', 'DE', 'ES', 'FR' and 'RU'. If None, auto detection will be applied.
Defaults to 'EN'.
- index_namestr, optional
Specify the index name that apply only to an on-premise HANA instance.
If None, it will be generated.
- thread_ratiofloat, optional
Specify the ratio of total number of threads that can be used by this function. The range of this parameter is from 0 to 1, where 0 means only using 1 thread, and 1 means using at most all the currently available threads. Values outside this range are ignored and this function heuristically determines the number of threads to use.
Only valid for a HANA cloud instance.
Defaults to 0.0.
- created_index{"index": xxx, "schema": xxx, "table": xxx}, optional
Use the created index on the given table that apply only to an on-premise HANA instance.
- Returns
- DataFrame
Examples
The input DataFrame ref_df:
>>> ref_df.collect() ID CONTENT CATEGORY 0 doc1 term1 term2 term2 term3 term3 term3 CATEGORY_1 1 doc2 term2 term3 term3 term4 term4 term4 CATEGORY_1 2 doc3 term3 term4 term4 term5 term5 term5 CATEGORY_2 3 doc4 term3 term4 term4 term5 term5 term5 term5 term5 term5 CATEGORY_2 4 doc5 term4 term6 CATEGORY_3 5 doc6 term4 term6 term6 term6 CATEGORY_3
The input DataFrame pred_df:
>>> pred_df.collect() CONTENT 0 term3
Invoke the function on a SAP HANA cloud instance:
>>> get_relevant_doc(pred_df, ref_df).collect() ID SCORE 0 doc1 0.774597 1 doc2 0.516398 2 doc3 0.258199 3 doc4 0.258199
Invoke the function on a SAP HANA on-premise instance:
>>> res = get_relevant_doc(pred_data, ref_data, top=4) >>> res.collect() ID RANK TOTAL_TERM_COUNT TERM_COUNT CORRELATIONS FACTORS ROTATED_FACTORS CLUSTER_LEVEL CLUSTER_LEFT 0 doc1 1 6 3 None None None None None 1 doc2 2 6 3 None None None None None 2 doc3 3 6 3 None None None None None 3 doc4 4 9 3 None None None None None ... CLUSTER_RIGHT HIGHLIGHTED_DOCUMENT HIGHLIGHTED_TERMTYPES SCORE ... None None None 0.7745969491648266869177064108953346 ... None None None 0.5163979661098845319600059156073257 ... None None None 0.2581989830549422659800029578036629 ... None None None 0.2581989830549422659800029578036629
- hana_ml.text.tm.get_relevant_term(pred_data, ref_data=None, top=None, threshold=None, lang='EN', index_name=None, thread_ratio=None, created_index=None)
This function returns the top-ranked relevant terms that describe a document based on Term Frequency - Inverse Document Frequency(TF-IDF) result or reference data.
Note
Input table can only have one row, as only one content can be treated each time.
- Parameters
- pred_dataDataFrame
Input data, structured as follows:
1st column, Document content.
Note
Input table can only have one row, as only one content can be treated each time.
- ref_dataDataFrame or a tuple of DataFrame
Specify the reference data.
If
ref_data
is a DataFrame, then it should be structured as follows:1st column, ID.
2nd column, Document content.
3rd column, Document category.
If
ref_data
is a tuple of DataFrame, the it should be corresponding to reference TF-IDF data, with each DataFrame structured as follows:1st DataFrame
1st column, TM_TERM.
2nd column, TF_VALUE.
3rd column, IDF_VALUE.
2nd DataFrame
1st column, ID.
2nd column, TM_TERM.
3rd column, TM_TERM_FREQUENCY.
3rd DataFrame
1st column, ID.
2nd column, Document category.
- topint, optional
Show top N results. If 0, it shows all.
Defaults to 0.
- thresholdfloat, optional
Only the results which score bigger than this value will be put into a result table.
Defaults to 0.0.
- langstr, optional
Specify the language type. HANA cloud instance currently supports 'EN', 'DE', 'ES', 'FR' and 'RU'. If None, auto detection will be applied.
Defaults to 'EN'.
- index_namestr, optional
Specify the index name that apply only to an on-premise HANA instance.
If None, it will be generated.
- thread_ratiofloat, optional
Specify the ratio of total number of threads that can be used by this function. The range of this parameter is from 0 to 1, where 0 means only using 1 thread, and 1 means using at most all the currently available threads. Values outside this range are ignored and this function heuristically determines the number of threads to use.
Only valid for a HANA cloud instance.
Defaults to 0.0.
- created_index{"index": xxx, "schema": xxx, "table": xxx}, optional
Use the created index on the given table that apply only to an on-premise HANA instance.
- Returns
- DataFrame
Examples
The input DataFrame ref_df:
>>> ref_df.collect() ID CONTENT CATEGORY 0 doc1 term1 term2 term2 term3 term3 term3 CATEGORY_1 1 doc2 term2 term3 term3 term4 term4 term4 CATEGORY_1 2 doc3 term3 term4 term4 term5 term5 term5 CATEGORY_2 3 doc4 term3 term4 term4 term5 term5 term5 term5 term5 term5 CATEGORY_2 4 doc5 term4 term6 CATEGORY_3 5 doc6 term4 term6 term6 term6 CATEGORY_3
The input DataFrame pred_df:
>>> pred_df.collect() CONTENT 0 term3
Invoke the function on a SAP HANA cloud instance,
>>> get_relevant_term(pred_df, ref_df).collect() ID SCORE 0 term3 1.0
Invoke the function on a SAP HANA on-premise instance:
>>> res = get_relevant_term(pred_df, ref_df) >>> res.collect() RANK TERM NORMALIZED_TERM TERM_TYPE TERM_FREQUENCY DOCUMENT_FREQUENCY CORRELATIONS 0 1 term3 term3 noun 7 4 None ... FACTORS ROTATED_FACTORS CLUSTER_LEVEL CLUSTER_LEFT CLUSTER_RIGHT SCORE ... None None None None None 1.000002901113076436701021521002986
- hana_ml.text.tm.get_suggested_term(pred_data, ref_data=None, top=None, threshold=None, lang='EN', index_name=None, thread_ratio=None, created_index=None)
This function returns the top-ranked terms that match an initial substring based on Term Frequency - Inverse Document Frequency(TF-IDF) result or reference data.
Note
Input table can only have one row, as only one content can be treated each time.
- Parameters
- pred_dataDataFrame
Input data, structured as follows:
1st column, Document content.
Note
Input table can only have one row, as only one content can be treated each time.
- ref_dataDa0taFrame or a tuple of DataFrame
Specify the reference data.
If
ref_data
is a DataFrame, then it should be structured as follows:1st column, ID.
2nd column, Document content.
3rd column, Document category.
If
ref_data
is a tuple of DataFrame, the it should be corresponding to reference TF-IDF data, with each DataFrame structured as follows:1st DataFrame
1st column, TM_TERM.
2nd column, TF_VALUE.
3rd column, IDF_VALUE.
2nd DataFrame
1st column, ID.
2nd column, TM_TERM.
3rd column, TM_TERM_FREQUENCY.
3rd DataFrame
1st column, ID.
2nd column, Document category.
- topint, optional
Show top N results. If 0, it shows all.
Defaults to 0.
- thresholdfloat, optional
Only the results which score bigger than this value will be put into a result table.
Defaults to 0.0.
- langstr, optional
Specify the language type. HANA cloud instance currently supports 'EN', 'DE', 'ES', 'FR' and 'RU'. If None, auto detection will be applied.
Defaults to 'EN'.
- index_namestr, optional
Specify the index name that apply only to an on-premise HANA instance.
If None, it will be generated.
- thread_ratiofloat, optional
Specify the ratio of total number of threads that can be used by this function. The range of this parameter is from 0 to 1, where 0 means only using 1 thread, and 1 means using at most all the currently available threads. Values outside this range are ignored and this function heuristically determines the number of threads to use.
Only valid for a HANA cloud instance.
Defaults to 0.0.
- created_index{"index": xxx, "schema": xxx, "table": xxx}, optional
Use the created index on the given table that apply only to an on-premise HANA instance.
- Returns
- DataFrame
Examples
The input DataFrame ref_df:
>>> ref_df.collect() ID CONTENT CATEGORY 0 doc1 term1 term2 term2 term3 term3 term3 CATEGORY_1 1 doc2 term2 term3 term3 term4 term4 term4 CATEGORY_1 2 doc3 term3 term4 term4 term5 term5 term5 CATEGORY_2 3 doc4 term3 term4 term4 term5 term5 term5 term5 term5 term5 CATEGORY_2 4 doc5 term4 term6 CATEGORY_3 5 doc6 term4 term6 term6 term6 CATEGORY_3
The input DataFrame pred_df:
>>> pred_df.collect() CONTENT 0 term3
Invoke the function on a SAP HANA cloud instance,
>>> get_suggested_term(pred_df, ref_df).collect() ID SCORE 0 term3 1.0
Invoke the function on a SAP HANA on-premise instance:
>>> res = get_suggested_term(pred_df, ref_df) >>> res.collect() RANK TERM NORMALIZED_TERM TERM_TYPE TERM_FREQUENCY DOCUMENT_FREQUENCY SCORE 0 1 term3 term3 noun 7 4 0.999999999999999888977697537484346
- class hana_ml.text.tm.TFIDF
Bases:
PALBase
Class for term frequency–inverse document frequency.
- Parameters
- None
Examples
Input dataframe for analysis:
>>> df_train.collect() ID CONTENT 0 doc1 term1 term2 term2 term3 term3 term3 1 doc2 term2 term3 term3 term4 term4 term4 2 doc3 term3 term4 term4 term5 term5 term5 3 doc5 term3 term4 term4 term5 term5 term5 term5 term5 term5 4 doc4 term4 term6 5 doc6 term4 term6 term6 term6
Creating TFIDF instance:
>>> tfidf = TFIDF()
Performing text_collector() on given dataframe:
>>> idf, _ = tfidf.text_collector(data=df_train)
>>> idf.collect() TM_TERMS TM_TERM_IDF_VALUE 0 term1 1.791759 1 term2 1.098612 2 term3 0.405465 3 term4 0.182322 4 term5 1.098612 5 term6 1.098612
Performing text_tfidf() on given dataframe:
>>> result = tfidf.text_tfidf(data=df_train)
>>> result.collect() ID TERMS TF_VALUE TFIDF_VALUE 0 doc1 term1 1.0 1.791759 1 doc1 term2 2.0 2.197225 2 doc1 term3 3.0 1.216395 3 doc2 term2 1.0 1.098612 4 doc2 term3 2.0 0.810930 5 doc2 term4 3.0 0.546965 6 doc3 term3 1.0 0.405465 7 doc3 term4 2.0 0.364643 8 doc3 term5 3.0 3.295837 9 doc5 term3 1.0 0.405465 10 doc5 term4 2.0 0.364643 11 doc5 term5 6.0 6.591674 12 doc4 term4 1.0 0.182322 13 doc4 term6 1.0 1.098612 14 doc6 term4 1.0 0.182322 15 doc6 term6 3.0 3.295837
- Attributes
fit_hdbprocedure
Returns the generated hdbprocedure for fit.
predict_hdbprocedure
Returns the generated hdbprocedure for predict.
Methods
add_attribute
(attr_key, attr_val)Function to add attribute.
apply_with_hint
(with_hint[, ...])Apply with hint.
consume_fit_hdbprocedure
(proc_name[, ...])Return the generated consume hdbprocedure for fit.
consume_predict_hdbprocedure
(proc_name[, ...])Return the generated consume hdbprocedure for predict.
create_apply_func
(func_name, data[, key, ...])Create HANA TUDF SQL code.
Disable argument check.
Disable the bigint conversion.
HANA execution will be disabled and only SQL script will be generated.
Disable with hint.
Enable argument check.
Allows the conversion from bigint to double.
HANA execution will be enabled.
enable_no_inline
([apply_to_anonymous_block])Enable no inline.
Enable parallel by parameter partitions.
enable_workload_class
(workload_class_name)HANA WORKLOAD CLASS is applied for the statement execution.
Returns the execute_statement for training.
Get the generated result table names in fit function.
Get PAL fit parameters.
Extract the specific function call of the SAP HANA PAL function from the sql code.
Parse sql lines containing the parameter definitions.
Returns the execute_statement for predicting.
Get the generated result table names in predict function.
Get PAL predict parameters.
Returns the execute_statement for scoring.
Get the generated result table names in score function.
Get SAP HANA PAL score parameters.
Checks if the model can be saved.
load_model
(model)Function to load the fitted model.
set_scale_out
([route_to, no_route_to, ...])SAP HANA statement routing.
text_collector
(data)Its use is primarily compute inverse document frequency of documents which provided by user.
text_tfidf
(data[, idf])Its use is primarily compute term frequency - inverse document frequency by document.
- text_collector(data)
Its use is primarily compute inverse document frequency of documents which provided by user.
- Parameters
- dataDataFrame
Data to be analysis. The first column of the input data table is assumed to be an ID column.
- Returns
- DataFrame
Inverse document frequency of documents.
Extended table.
- text_tfidf(data, idf=None)
Its use is primarily compute term frequency - inverse document frequency by document.
- Parameters
- dataDataFrame
Data to be analysis.
The first column of the input data table is assumed to be an ID column.
- idfDataFrame, optional
Inverse document frequency of documents.
- Returns
- DataFrame
Term frequency - inverse document frequency by document.
- add_attribute(attr_key, attr_val)
Function to add attribute.
- Parameters
- attr_keystr
The key.
- attr_valstr
The value.
- apply_with_hint(with_hint, apply_to_anonymous_block=True)
Apply with hint.
- Parameters
- with_hintstr
The hint clauses.
- apply_to_anonymous_blockbool, optional
If True, it will be applied to the anonymous block.
Defaults to True.
- consume_fit_hdbprocedure(proc_name, in_tables=None, out_tables=None)
Return the generated consume hdbprocedure for fit.
- Parameters
- proc_namestr
The procedure name.
- in_tableslist, optional
The list of input table names.
- out_tableslist, optional
The list of output table names.
- consume_predict_hdbprocedure(proc_name, in_tables=None, out_tables=None)
Return the generated consume hdbprocedure for predict.
- Parameters
- proc_namestr
The procedure name.
- in_tableslist, optional
The list of input table names.
- out_tableslist, optional
The list of output table names.
- create_apply_func(func_name, data, key=None, model=None, output_table_structure=None, execute=True, force=True)
Create HANA TUDF SQL code.
- Parameters
- func_namestr
The function name of TUDF.
- dataDataFrame
The data to be predicted.
- keystr, optional
The key column name in the predict dataframe.
- modelDataFrame, optional
The model dataframe for prediction. If not specified, it will use model_.
- output_table_structuredict, optional
The return table structure.
- executebool, optional
Execute the creation SQL.
Defaults to True.
- forcebool, optional
If True, it will drop the existing TUDF.
- Returns
- str
The generated TUDF SQL code.
- disable_arg_check()
Disable argument check.
- disable_convert_bigint()
Disable the bigint conversion.
Defaults to False.
- disable_hana_execution()
HANA execution will be disabled and only SQL script will be generated.
- disable_with_hint()
Disable with hint.
- enable_arg_check()
Enable argument check.
- enable_convert_bigint()
Allows the conversion from bigint to double.
Defaults to True.
- enable_hana_execution()
HANA execution will be enabled.
- enable_no_inline(apply_to_anonymous_block=True)
Enable no inline.
- Parameters
- apply_to_anonymous_blockbool, optional
If True, it will be applied to the anonymous block.
Defaults to True.
- enable_parallel_by_parameter_partitions(apply_to_anonymous_block=False)
Enable parallel by parameter partitions.
- Parameters
- apply_to_anonymous_blockbool, optional
If True, it will be applied to the anonymous block.
Defaults to False.
- enable_workload_class(workload_class_name)
HANA WORKLOAD CLASS is applied for the statement execution.
- Parameters
- workload_class_namestr
The name of HANA WORKLOAD CLASS.
- property fit_hdbprocedure
Returns the generated hdbprocedure for fit.
- get_fit_execute_statement()
Returns the execute_statement for training.
- get_fit_output_table_names()
Get the generated result table names in fit function.
- Returns
- list
List of table names.
- get_fit_parameters()
Get PAL fit parameters.
- Returns
- list
List of tuples, where each tuple describes a parameter like (name, value, type)
- get_pal_function()
Extract the specific function call of the SAP HANA PAL function from the sql code. Nevertheless it only detects the synonyms that have to be resolved afterwards.
- Returns
- dict
The procedure name synonym: CALL "SYS_AFL.PAL_RANDOM_FORREST" (...) -> SYS_AFL.PAL_RANDOM_FORREST"
- get_parameters()
Parse sql lines containing the parameter definitions. In the sql code all the parameters are defined by four arrays, where the first one contains the parameter name, and one of the other three contains the value fitting to the parameter, while the other two are NULL. This format should be changed into a simple key-value based storage.
- Returns
- dict
Dict of list of tuples, where each tuple describes a parameter like (name, value, type)
- get_predict_execute_statement()
Returns the execute_statement for predicting.
- get_predict_output_table_names()
Get the generated result table names in predict function.
- Returns
- list
List of table names.
- get_predict_parameters()
Get PAL predict parameters.
- Returns
- list
List of tuples, where each tuple describes a parameter like (name, value, type)
- get_score_execute_statement()
Returns the execute_statement for scoring.
- get_score_output_table_names()
Get the generated result table names in score function.
- Returns
- list
List of table names.
- get_score_parameters()
Get SAP HANA PAL score parameters.
- Returns
- list
List of tuples, where each tuple describes a parameter like (name, value, type)
- is_fitted()
Checks if the model can be saved. To be overridden if the model is not stored in model_ attribute.
- Returns
- bool
True if the model is ready to be saved.
- load_model(model)
Function to load the fitted model.
- Parameters
- modelDataFrame
SAP HANA DataFrame for fitted model.
- property predict_hdbprocedure
Returns the generated hdbprocedure for predict.
- set_scale_out(route_to=None, no_route_to=None, route_by=None, route_by_cardinality=None, data_transfer_cost=None, route_optimization_level=None, workload_class=None, apply_to_anonymous_block=True)
SAP HANA statement routing.
- Parameters
- route_tostr, optional
Routes the query to the specified volume ID or service type.
Defaults to None.
- no_route_tostr or list of str, optional
Avoids query routing to a specified volume ID or service type.
Defaults to None.
- route_bystr, optional
Routes the query to the hosts related to the base table(s) of the specified projection view(s).
Defaults to None.
- route_by_cardinalitystr or list of str, optional
Routes the query to the hosts related to the base table(s) of the specified projection view(s) with the highest cardinality from the input list.
- data_transfer_costint, optional
Guides the optimizer to use the weighting factor for the data transfer cost. The value 0 ignores the data transfer cost.
Defaults to None.
- route_optimization_level{'minimal', 'all'}, optional
Guides the optimizer to compile with
route_optimization_level
'minimal' or to default toroute_optimization_level
. If the 'minimal' compiled plan is cached, then it compiles once more using the default optimization level during the first execution. This hint is primarily used to shorten statement routing decisions during the initial compilation.- workload_classstr, optional
Routes the query via workload class.
route_to
statement hint has higher precedence thanworkload_class
statement hint.Defaults to None.
- apply_to_anonymous_blockbool, optional
If True it will be applied to the anonymous block.
Defaults to True.