hana_ml.algorithms.pal algorithms

PAL Package consists of the following sections:

Auto ML:

Model and pipeline:

Unified Interface:

Clustering:

Classification:

Regression:

Association:

Time Series:

Prepocessing:

Statistics:

Social Network Analysis:

Recommender System:

Randking:

Miscellaneous:

Metrics:

hana_ml.algorithms.pal.abc_analysis

This module contains PAL wrappers for abc_analysis algorithm.

The following functions is available:

hana_ml.algorithms.pal.abc_analysis.abc_analysis(data, key=None, percent_A=None, percent_B=None, percent_C=None, revenue=None, thread_ratio=None)

Perform the abc_analysis to classify objects based on a particular measure. Group the inventories into three categories.

Parameters
dataDataFrame

Input data.

keystr, optional

Name of the ID column.

Defaults to the index column of data (i.e. data.index) if it is set.

revenuestr, optional

Name of column for revenue (or profits).

If not given, the input dataframe must only have two columns.

Defaults to the first non-key column.

percent_Afloat

Interval for A class.

percent_Bfloat

Interval for B class.

percent_Cfloat

Interval for C class.

thread_ratiofloat, optional

Specifies the ratio of total number of threads that can be used by this function.

The value range is from 0 to 1, where 0 means only using 1 thread, and 1 means using at most all the currently available threads.

Values outside the range will be ignored and this function heuristically determines the number of threads to use.

Default to 0.

Returns
DataFrame

Returns a DataFrame containing the ABC class result of partitioning the data into three categories.

Examples

Data to analyze:

>>> df_train = cc.table('AA_DATA_TBL')
>>> df_train.collect()
     ITEM     VALUE
0    item1    15.4
1    item2    200.4
2    item3    280.4
3    item4    100.9
4    item5    40.4
5    item6    25.6
6    item7    18.4
7    item8    10.5
8    item9    96.15
9    item10   9.4

Perform abc_analysis:

>>> res = abc_analysis(data = self.df_train, key = 'ITEM', thread_ratio = 0.3,
                       percent_A = 0.7, percent_B = 0.2, percent_C = 0.1)
>>> res.collect()
       ABC_CLASS   ITEM
0      A        item3
1      A        item2
2      A        item4
3      B        item9
4      B        item5
5      B        item6
6      C        item7
7      C        item1
8      C        item8
9      C        item10

hana_ml.algorithms.pal.association

This module contains Python wrappers for PAL association algorithms.

The following classes are available:

class hana_ml.algorithms.pal.association.Apriori(min_support, min_confidence, relational=None, min_lift=None, max_conseq=None, max_len=None, ubiquitous=None, use_prefix_tree=None, lhs_restrict=None, rhs_restrict=None, lhs_complement_rhs=None, rhs_complement_lhs=None, thread_ratio=None, timeout=None, pmml_export=None)

Bases: hana_ml.algorithms.pal.association._AssociationBase

Apriori is a classic predictive analysis algorithm for finding association rules used in association analysis.

Parameters
min_supportfloat

User-specified minimum support(actual value).

min_confidencefloat

User-specified minimum confidence(actual value).

relationalbool, optional

Whether or not to apply relational logic in Apriori algorithm. If False, a single result table is produced; otherwise, the result table shall be split into three tables: antecedent, consequent and statistics.

Defaults to False.

min_liftfloat, optional

User-specified minimum lift.

Defaults to 0.

max_conseqint, optional

Maximum length of consequent items.

Defaults to 100.

max_lenint, optional

Total length of antecedent items and consequent items in the output.

Defaults to 5.

ubiquitousfloat, optional

Item sets whose support values are greater than this number will be ignored during frequent items mining.

Defaults to 1.0.

use_prefix_treebool, optional

Indicates whether or not to use prefix tree for saving memory.

Defaults to False.

lhs_restrictlist of str, optional(deprecated)

Specify items that are only allowed on the left-hand-side of association rules.

rhs_restrictlist of str, optional(deprecated)

Specify items that are only allowed on the right-hand-side of association rules.

lhs_complement_rhsbool, optional(deprecated)

If you use rhs_restrict to restrict some items to the left-hand-side of the association rules, you can set this parameter to True to restrict the complement items to the left-hand-side.

For example, if you have 100 items (i1, i2, ..., i100), and want to restrict i1 and i2 to the right-hand-side, and i3,i4,...,i100 to the left-hand-side, you can set the parameters similarly as follows:

...

rhs_restrict = ['i1','i2'],

lhs_complement_rhs = True,

...

Defaults to False.

rhs_complement_lhsbool, optional(deprecated)

If you use lhs_restrict to restrict some items to the left-hand-side of association rules, you can set this parameter to True to restrict the complement items to the right-hand side.

Defaults to False.

thread_numberfloat, optional

Specifies the ratio of total number of threads that can be used by this function.

The value range is from 0 to 1, where 0 means only using 1 thread, and 1 means using at most all the currently available threads.

Values outside the range will be ignored and this function heuristically determines the number of threads to use.

Defaults to 0.

timeoutint, optional

Specifies the maximum run time in seconds. The algorithm will stop running when the specified timeout is reached.

Defaults to 3600.

pmml_export{'no', 'single-row', 'multi-row'}, optional

Specify the way to export the Apriori model:

  • 'no' : do not export the model,

  • 'single-row' : export Apriori model in PMML in single row,

  • 'multi-row' : export Apriori model in PMML in multiple rows, while the minimum length of each row is 5000 characters.

Defaults to 'no'.

Examples

Input data for associate rule mining:

>>> df.collect()
    CUSTOMER   ITEM
0          2  item2
1          2  item3
2          3  item1
3          3  item2
4          3  item4
5          4  item1
6          4  item3
7          5  item2
8          5  item3
9          6  item1
10         6  item3
11         0  item1
12         0  item2
13         0  item5
14         1  item2
15         1  item4
16         7  item1
17         7  item2
18         7  item3
19         7  item5
20         8  item1
21         8  item2
22         8  item3

Set up parameters for the Apriori algorithm:

>>> ap = Apriori(min_support=0.1,
                 min_confidence=0.3,
                 relational=False,
                 min_lift=1.1,
                 max_conseq=1,
                 max_len=5,
                 ubiquitous=1.0,
                 use_prefix_tree=False,
                 thread_ratio=0,
                 timeout=3600,
                 pmml_export='single-row')

Association rule mining using Apriori algorithm for the input data, and check the results:

>>> ap.fit(data=df)
>>> ap.result_.head(5).collect()
    ANTECEDENT CONSEQUENT   SUPPORT  CONFIDENCE      LIFT
0        item5      item2  0.222222    1.000000  1.285714
1        item1      item5  0.222222    0.333333  1.500000
2        item5      item1  0.222222    1.000000  1.500000
3        item4      item2  0.222222    1.000000  1.285714
4  item2&item1      item5  0.222222    0.500000  2.250000

Apriori algorithm set up using relational logic:

>>> apr = Apriori(min_support=0.1,
                  min_confidence=0.3,
                  relational=True,
                  min_lift=1.1,
                  max_conseq=1,
                  max_len=5,
                  ubiquitous=1.0,
                  use_prefix_tree=False,
                  thread_ratio=0,
                  timeout=3600,
                  pmml_export='single-row')

Again mining association rules using Apriori algorithm for the input data, and check the resulting tables:

>>> apr.antec_.head(5).collect()
   RULE_ID ANTECEDENTITEM
0        0          item5
1        1          item1
2        2          item5
3        3          item4
4        4          item2
>>> apr.conseq_.head(5).collect()
   RULE_ID CONSEQUENTITEM
0        0          item2
1        1          item5
2        2          item1
3        3          item2
4        4          item5
>>> apr.stats_.head(5).collect()
   RULE_ID   SUPPORT  CONFIDENCE      LIFT
0        0  0.222222    1.000000  1.285714
1        1  0.222222    0.333333  1.500000
2        2  0.222222    1.000000  1.500000
3        3  0.222222    1.000000  1.285714
4        4  0.222222    0.500000  2.250000
Attributes
result_DataFrame

Mined association rules and related statistics, structured as follows:

  • 1st column : antecedent(leading) items.

  • 2nd column : consequent(dependent) items.

  • 3rd column : support value.

  • 4th column : confidence value.

  • 5th column : lift value.

Available only when relational is False.

model_DataFrame

Apriori model trained from the input data, structured as follows:

  • 1st column : model ID,

  • 2nd column : model content, i.e. Apriori model in PMML format.

antec_DataFrame

Antecedent items of mined association rules, structured as follows:

  • lst column : association rule ID,

  • 2nd column : antecedent items of the corresponding association rule.

Available only when relational is True.

conseq_DataFrame

Consequent items of mined association rules, structured as follows:

  • 1st column : association rule ID,

  • 2nd column : consequent items of the corresponding association rule.

Available only when relational is True.

stats_DataFrame

Statistics of the mined association rules, structured as follows:

  • 1st column : rule ID,

  • 2nd column : support value of the rule,

  • 3rd column : confidence value of the rule,

  • 4th column : lift value of the rule.

Available only when relational is True.

Methods

add_attribute(attr_key, attr_val)

Function to add attribute.

apply_with_hint(with_hint[, ...])

Apply with hint.

consume_fit_hdbprocedure(proc_name[, ...])

Return the generated consume hdbprocedure for fit.

consume_predict_hdbprocedure(proc_name[, ...])

Return the generated consume hdbprocedure for predict.

create_model_state([model, function, ...])

Create PAL model state.

delete_model_state([state])

Delete PAL model state.

disable_arg_check()

Disable argument check.

disable_convert_bigint()

Disable the bigint conversion.

disable_hana_execution()

HANA execution will be disabled and only SQL script will be generated.

disable_with_hint()

Disable with hint.

enable_arg_check()

Enable argument check.

enable_convert_bigint()

Allows the conversion from bigint to double.

enable_hana_execution()

HANA execution will be enabled.

enable_no_inline([apply_to_anonymous_block])

Enable no inline.

enable_parallel_by_parameter_partitions([...])

Enable parallel by parameter partitions.

enable_workload_class(workload_class_name)

HANA WORKLOAD CLASS is applied for the statement execution.

fit(data[, transaction, item, lhs_restrict, ...])

Association rule mining from the input data using FPGrowth algorithm.

get_fit_execute_statement()

Returns the execute_statement for training.

get_fit_output_table_names()

Get the generated result table names in fit function.

get_fit_parameters()

Get PAL fit parameters.

get_pal_function()

Extract the specific function call of the SAP HANA PAL function from the sql code.

get_parameters()

Parse sql lines containing the parameter definitions.

get_predict_execute_statement()

Returns the execute_statement for predicting.

get_predict_output_table_names()

Get the generated result table names in predict function.

get_predict_parameters()

Get PAL predict parameters.

get_score_execute_statement()

Returns the execute_statement for scoring.

get_score_output_table_names()

Get the generated result table names in score function.

get_score_parameters()

Get SAP HANA PAL score parameters.

is_fitted()

Checks if the model can be saved.

load_model(model)

Function to load the fitted model.

set_model_state(state)

Set the model state by state information.

set_scale_out([route_to, no_route_to, ...])

SAP HANA statement routing.

fit(data, transaction=None, item=None, lhs_restrict=None, rhs_restrict=None, lhs_complement_rhs=None, rhs_complement_lhs=None)

Association rule mining from the input data using FPGrowth algorithm.

Parameters
dataDataFrame

Input data for association rule mining.

transactionstr, optional

Name of the transaction column.

Defaults to the first column if not provided.

itemstr, optional

Name of the item ID column.

Data type of item column can be INTEGER, VARCHAR or NVARCHAR.

Defaults to the last non-transaction column if not provided.

lhs_restrictlist of int/str, optional

Specify items that are only allowed on the left-hand-side of association rules.

Elements in the list should be the same type as the item column.

rhs_restrictlist of int/str, optional

Specify items that are only allowed on the right-hand-side of association rules.

Elements in the list should be the same type as the item column.

lhs_complement_rhsbool, optional

If you use rhs_restrict to restrict some items to the left-hand-side of the association rules, you can set this parameter to True to restrict the complement items to the left-hand-side. For example, if you have 100 items (i1,i2,...,i100), and want to restrict i1 and i2 to the right-hand-side, and i3, i4,..., i100 to the left-hand-side, you can set the parameters similarly as follows:

...

rhs_restrict = [i1, i2],

lhs_complement_rhs = True,

...

Defaults to False.

rhs_complement_lhsbool, optional

If you use lhs_restrict to restrict some items to the left-hand-side of association rules, you can set this parameter to True to restrict the complement items to the right-hand side.

Defaults to False.

add_attribute(attr_key, attr_val)

Function to add attribute.

Parameters
attr_keystr

The key.

attr_valstr

The value.

apply_with_hint(with_hint, apply_to_anonymous_block=True)

Apply with hint.

Parameters
with_hintstr

The hint clauses.

apply_to_anonymous_blockbool, optional

If True, it will be applied to the anonymous block.

Defaults to True.

consume_fit_hdbprocedure(proc_name, in_tables=None, out_tables=None)

Return the generated consume hdbprocedure for fit.

Parameters
proc_namestr

The procedure name.

in_tableslist, optional

The list of input table names.

out_tableslist, optional

The list of output table names.

consume_predict_hdbprocedure(proc_name, in_tables=None, out_tables=None)

Return the generated consume hdbprocedure for predict.

Parameters
proc_namestr

The procedure name.

in_tableslist, optional

The list of input table names.

out_tableslist, optional

The list of output table names.

create_model_state(model=None, function=None, pal_funcname=None, state_description=None, force=False)

Create PAL model state.

Parameters
modelDataFrame, optional

Specify the model for AFL state.

Defaults to self.model_.

functionstr, optional

Specify the function in the unified API.

Defaults to self.real_func.

pal_funcnameint or str, optional

PAL function name.

Defaults to self.pal_funcname.

state_descriptionstr, optional

Description of the state as model container.

Defaults to None.

forcebool, optional

If True it will delete the existing state.

Defaults to False.

delete_model_state(state=None)

Delete PAL model state.

Parameters
stateDataFrame, optional

Specified the state.

Defaults to self.state.

disable_arg_check()

Disable argument check.

disable_convert_bigint()

Disable the bigint conversion.

Defaults to False.

disable_hana_execution()

HANA execution will be disabled and only SQL script will be generated.

disable_with_hint()

Disable with hint.

enable_arg_check()

Enable argument check.

enable_convert_bigint()

Allows the conversion from bigint to double.

Defaults to True.

enable_hana_execution()

HANA execution will be enabled.

enable_no_inline(apply_to_anonymous_block=True)

Enable no inline.

Parameters
apply_to_anonymous_blockbool, optional

If True, it will be applied to the anonymous block.

Defaults to True.

enable_parallel_by_parameter_partitions(apply_to_anonymous_block=False)

Enable parallel by parameter partitions.

Parameters
apply_to_anonymous_blockbool, optional

If True, it will be applied to the anonymous block.

Defaults to False.

enable_workload_class(workload_class_name)

HANA WORKLOAD CLASS is applied for the statement execution.

Parameters
workload_class_namestr

The name of HANA WORKLOAD CLASS.

property fit_hdbprocedure

Returns the generated hdbprocedure for fit.

get_fit_execute_statement()

Returns the execute_statement for training.

get_fit_output_table_names()

Get the generated result table names in fit function.

Returns
List of table names.
get_fit_parameters()

Get PAL fit parameters.

Returns
List of tuples, where each tuple describes a parameter like (name, value, type)
get_pal_function()

Extract the specific function call of the SAP HANA PAL function from the sql code. Nevertheless it only detects the synonyms that have to be resolved afterwards.

Returns
The procedure name synonym
CALL "SYS_AFL.PAL_RANDOM_FORREST" (...) -> SYS_AFL.PAL_RANDOM_FORREST"
get_parameters()

Parse sql lines containing the parameter definitions. In the sql code all the parameters are defined by four arrays, where the first one contains the parameter name, and one of the other three contains the value fitting to the parameter, while the other two are NULL. This format should be changed into a simple key-value based storage.

Returns
dict of list of tuples, where each tuple describes a parameter like (name, value, type)
get_predict_execute_statement()

Returns the execute_statement for predicting.

get_predict_output_table_names()

Get the generated result table names in predict function.

Returns
List of table names.
get_predict_parameters()

Get PAL predict parameters.

Returns
List of tuples, where each tuple describes a parameter like (name, value, type)
get_score_execute_statement()

Returns the execute_statement for scoring.

get_score_output_table_names()

Get the generated result table names in score function.

Returns
List of table names.
get_score_parameters()

Get SAP HANA PAL score parameters.

Returns
List of tuples, where each tuple describes a parameter like (name, value, type)
is_fitted()

Checks if the model can be saved. To be overridden if the model is not stored in model_ attribute.

Returns
bool

True if the model is ready to be saved.

load_model(model)

Function to load the fitted model.

Parameters
modelDataFrame

SAP HANA DataFrame for fitted model.

property predict_hdbprocedure

Returns the generated hdbprocedure for predict.

set_model_state(state)

Set the model state by state information.

set_scale_out(route_to=None, no_route_to=None, route_by=None, route_by_cardinality=None, data_transfer_cost=None, route_optimization_level=None, workload_class=None, apply_to_anonymous_block=True)

SAP HANA statement routing.

Parameters
route_tostr, optional

Routes the query to the specified volume ID or service type.

Defaults to None.

no_route_tostr or list of str, optional

Avoids query routing to a specified volume ID or service type.

Defaults to None.

route_bystr, optional

Routes the query to the hosts related to the base table(s) of the specified projection view(s).

Defaults to None.

route_by_cardinalitystr or list of str, optional

Routes the query to the hosts related to the base table(s) of the specified projection view(s) with the highest cardinality from the input list.

data_transfer_costint, optional

Guides the optimizer to use the weighting factor for the data transfer cost. The value 0 ignores the data transfer cost.

Defaults to None.

route_optimization_level{'minimal', 'all'}, optional

Guides the optimizer to compile with route_optimization_level 'minimal' or to default to route_optimization_level. If the 'minimal' compiled plan is cached, then it compiles once more using the default optimization level during the first execution. This hint is primarily used to shorten statement routing decisions during the initial compilation.

workload_classstr, optional

Routes the query via workload class. route_to statement hint has higher precedence than workload_class statement hint.

Defaults to None.

apply_to_anonymous_blockbool, optional

If True it will be applied to the anonymous block.

Defaults to True.

class hana_ml.algorithms.pal.association.AprioriLite(min_support, min_confidence, subsample=None, recalculate=None, thread_ratio=None, timeout=None, pmml_export=None)

Bases: hana_ml.algorithms.pal.pal_base.PALBase

A light version of Apriori algorithm for association rule mining, where only two large item sets are calculated.

Parameters
min_supportfloat

User-specified minimum support(actual value).

min_confidencefloat

User-specified minimum confidence(actual value).

subsamplefloat, optional

Specify the sampling percentage for the input data. Set to 1 if you want to use the entire data.

recalculatebool, optional

If you sample the input data, this parameter indicates whether or not to use the remaining data to update the related statistics, i.e. support, confidence and lift.

Defaults to True.

thread_ratiofloat, optional

Specifies the ratio of total number of threads that can be used by this function.

The value range is from 0 to 1, where 0 means only using 1 thread, and 1 means using at most all the currently available threads.

Values outside the range will be ignored and this function heuristically determines the number of threads to use.

Defaults to 0.

timeoutint, optional

Specifies the maximum run time in seconds.

The algorithm will stop running when the specified timeout is reached.

Defaults to 3600.

pmml_export{'no', 'single-row', 'multi-row'}, optional

Specify the way to export the Apriori model:

  • 'no' : do not export the model,

  • 'single-row' : export Apriori model in PMML in single row,

  • 'multi-row' : export Apriori model in PMML in multiple rows, while the minimum length of each row is 5000 characters.

Defaults to 'no'.

Examples

Input data for association rule mining using Apriori algorithm:

>>> df.collect()
    CUSTOMER   ITEM
0          2  item2
1          2  item3
2          3  item1
3          3  item2
4          3  item4
5          4  item1
6          4  item3
7          5  item2
8          5  item3
9          6  item1
10         6  item3
11         0  item1
12         0  item2
13         0  item5
14         1  item2
15         1  item4
16         7  item1
17         7  item2
18         7  item3
19         7  item5
20         8  item1
21         8  item2
22         8  item3

Set up parameters for light Apriori algorithm, ingest the input data, and check the result table:

>>> apl = AprioriLite(min_support=0.1,
                      min_confidence=0.3,
                      subsample=1.0,
                      recalculate=False,
                      timeout=3600,
                      pmml_export='single-row')
>>> apl.fit(data=df)
>>> apl.result_.head(5).collect()
  ANTECEDENT CONSEQUENT   SUPPORT  CONFIDENCE      LIFT
0      item5      item2  0.222222    1.000000  1.285714
1      item1      item5  0.222222    0.333333  1.500000
2      item5      item1  0.222222    1.000000  1.500000
3      item5      item3  0.111111    0.500000  0.750000
4      item1      item2  0.444444    0.666667  0.857143
Attributes
result_DataFrame
Mined association rules and related statistics, structured as follows:
  • 1st column : antecedent(leading) items,

  • 2nd column : consequent(dependent) items,

  • 3rd column : support value,

  • 4th column : confidence value,

  • 5th column : lift value.

Non-empty only when relational is False.

model_DataFrame
Apriori model trained from the input data, structured as follows:
  • 1st column : model ID.

  • 2nd column : model content, i.e. liteApriori model in PMML format.

Methods

add_attribute(attr_key, attr_val)

Function to add attribute.

apply_with_hint(with_hint[, ...])

Apply with hint.

consume_fit_hdbprocedure(proc_name[, ...])

Return the generated consume hdbprocedure for fit.

consume_predict_hdbprocedure(proc_name[, ...])

Return the generated consume hdbprocedure for predict.

create_model_state([model, function, ...])

Create PAL model state.

delete_model_state([state])

Delete PAL model state.

disable_arg_check()

Disable argument check.

disable_convert_bigint()

Disable the bigint conversion.

disable_hana_execution()

HANA execution will be disabled and only SQL script will be generated.

disable_with_hint()

Disable with hint.

enable_arg_check()

Enable argument check.

enable_convert_bigint()

Allows the conversion from bigint to double.

enable_hana_execution()

HANA execution will be enabled.

enable_no_inline([apply_to_anonymous_block])

Enable no inline.

enable_parallel_by_parameter_partitions([...])

Enable parallel by parameter partitions.

enable_workload_class(workload_class_name)

HANA WORKLOAD CLASS is applied for the statement execution.

fit(data[, transaction, item])

Association rule mining based from the input data.

get_fit_execute_statement()

Returns the execute_statement for training.

get_fit_output_table_names()

Get the generated result table names in fit function.

get_fit_parameters()

Get PAL fit parameters.

get_pal_function()

Extract the specific function call of the SAP HANA PAL function from the sql code.

get_parameters()

Parse sql lines containing the parameter definitions.

get_predict_execute_statement()

Returns the execute_statement for predicting.

get_predict_output_table_names()

Get the generated result table names in predict function.

get_predict_parameters()

Get PAL predict parameters.

get_score_execute_statement()

Returns the execute_statement for scoring.

get_score_output_table_names()

Get the generated result table names in score function.

get_score_parameters()

Get SAP HANA PAL score parameters.

is_fitted()

Checks if the model can be saved.

load_model(model)

Function to load the fitted model.

set_model_state(state)

Set the model state by state information.

set_scale_out([route_to, no_route_to, ...])

SAP HANA statement routing.

fit(data, transaction=None, item=None)

Association rule mining based from the input data.

Parameters
dataDataFrame

Input data for association rule mining.

transactionstr, optional

Name of the transaction column.

Defaults to the first column if not provided.

itemstr, optional

Name of the item column.

Defaults to the last non-transaction column if not provided.

add_attribute(attr_key, attr_val)

Function to add attribute.

Parameters
attr_keystr

The key.

attr_valstr

The value.

apply_with_hint(with_hint, apply_to_anonymous_block=True)

Apply with hint.

Parameters
with_hintstr

The hint clauses.

apply_to_anonymous_blockbool, optional

If True, it will be applied to the anonymous block.

Defaults to True.

consume_fit_hdbprocedure(proc_name, in_tables=None, out_tables=None)

Return the generated consume hdbprocedure for fit.

Parameters
proc_namestr

The procedure name.

in_tableslist, optional

The list of input table names.

out_tableslist, optional

The list of output table names.

consume_predict_hdbprocedure(proc_name, in_tables=None, out_tables=None)

Return the generated consume hdbprocedure for predict.

Parameters
proc_namestr

The procedure name.

in_tableslist, optional

The list of input table names.

out_tableslist, optional

The list of output table names.

create_model_state(model=None, function=None, pal_funcname=None, state_description=None, force=False)

Create PAL model state.

Parameters
modelDataFrame, optional

Specify the model for AFL state.

Defaults to self.model_.

functionstr, optional

Specify the function in the unified API.

Defaults to self.real_func.

pal_funcnameint or str, optional

PAL function name.

Defaults to self.pal_funcname.

state_descriptionstr, optional

Description of the state as model container.

Defaults to None.

forcebool, optional

If True it will delete the existing state.

Defaults to False.

delete_model_state(state=None)

Delete PAL model state.

Parameters
stateDataFrame, optional

Specified the state.

Defaults to self.state.

disable_arg_check()

Disable argument check.

disable_convert_bigint()

Disable the bigint conversion.

Defaults to False.

disable_hana_execution()

HANA execution will be disabled and only SQL script will be generated.

disable_with_hint()

Disable with hint.

enable_arg_check()

Enable argument check.

enable_convert_bigint()

Allows the conversion from bigint to double.

Defaults to True.

enable_hana_execution()

HANA execution will be enabled.

enable_no_inline(apply_to_anonymous_block=True)

Enable no inline.

Parameters
apply_to_anonymous_blockbool, optional

If True, it will be applied to the anonymous block.

Defaults to True.

enable_parallel_by_parameter_partitions(apply_to_anonymous_block=False)

Enable parallel by parameter partitions.

Parameters
apply_to_anonymous_blockbool, optional

If True, it will be applied to the anonymous block.

Defaults to False.

enable_workload_class(workload_class_name)

HANA WORKLOAD CLASS is applied for the statement execution.

Parameters
workload_class_namestr

The name of HANA WORKLOAD CLASS.

property fit_hdbprocedure

Returns the generated hdbprocedure for fit.

get_fit_execute_statement()

Returns the execute_statement for training.

get_fit_output_table_names()

Get the generated result table names in fit function.

Returns
List of table names.
get_fit_parameters()

Get PAL fit parameters.

Returns
List of tuples, where each tuple describes a parameter like (name, value, type)
get_pal_function()

Extract the specific function call of the SAP HANA PAL function from the sql code. Nevertheless it only detects the synonyms that have to be resolved afterwards.

Returns
The procedure name synonym
CALL "SYS_AFL.PAL_RANDOM_FORREST" (...) -> SYS_AFL.PAL_RANDOM_FORREST"
get_parameters()

Parse sql lines containing the parameter definitions. In the sql code all the parameters are defined by four arrays, where the first one contains the parameter name, and one of the other three contains the value fitting to the parameter, while the other two are NULL. This format should be changed into a simple key-value based storage.

Returns
dict of list of tuples, where each tuple describes a parameter like (name, value, type)
get_predict_execute_statement()

Returns the execute_statement for predicting.

get_predict_output_table_names()

Get the generated result table names in predict function.

Returns
List of table names.
get_predict_parameters()

Get PAL predict parameters.

Returns
List of tuples, where each tuple describes a parameter like (name, value, type)
get_score_execute_statement()

Returns the execute_statement for scoring.

get_score_output_table_names()

Get the generated result table names in score function.

Returns
List of table names.
get_score_parameters()

Get SAP HANA PAL score parameters.

Returns
List of tuples, where each tuple describes a parameter like (name, value, type)
is_fitted()

Checks if the model can be saved. To be overridden if the model is not stored in model_ attribute.

Returns
bool

True if the model is ready to be saved.

load_model(model)

Function to load the fitted model.

Parameters
modelDataFrame

SAP HANA DataFrame for fitted model.

property predict_hdbprocedure

Returns the generated hdbprocedure for predict.

set_model_state(state)

Set the model state by state information.

set_scale_out(route_to=None, no_route_to=None, route_by=None, route_by_cardinality=None, data_transfer_cost=None, route_optimization_level=None, workload_class=None, apply_to_anonymous_block=True)

SAP HANA statement routing.

Parameters
route_tostr, optional

Routes the query to the specified volume ID or service type.

Defaults to None.

no_route_tostr or list of str, optional

Avoids query routing to a specified volume ID or service type.

Defaults to None.

route_bystr, optional

Routes the query to the hosts related to the base table(s) of the specified projection view(s).

Defaults to None.

route_by_cardinalitystr or list of str, optional

Routes the query to the hosts related to the base table(s) of the specified projection view(s) with the highest cardinality from the input list.

data_transfer_costint, optional

Guides the optimizer to use the weighting factor for the data transfer cost. The value 0 ignores the data transfer cost.

Defaults to None.

route_optimization_level{'minimal', 'all'}, optional

Guides the optimizer to compile with route_optimization_level 'minimal' or to default to route_optimization_level. If the 'minimal' compiled plan is cached, then it compiles once more using the default optimization level during the first execution. This hint is primarily used to shorten statement routing decisions during the initial compilation.

workload_classstr, optional

Routes the query via workload class. route_to statement hint has higher precedence than workload_class statement hint.

Defaults to None.

apply_to_anonymous_blockbool, optional

If True it will be applied to the anonymous block.

Defaults to True.

class hana_ml.algorithms.pal.association.FPGrowth(min_support=None, min_confidence=None, relational=None, min_lift=None, max_conseq=None, max_len=None, ubiquitous=None, thread_ratio=None, timeout=None)

Bases: hana_ml.algorithms.pal.association._AssociationBase

FP-Growth is an algorithm to find frequent patterns from transactions without generating a candidate itemset.

Parameters
min_supportfloat, optional

User-specified minimum support, with valid range [0, 1].

Defaults to 0.

min_confidencefloat, optional

User-specified minimum confidence, with valid range [0, 1].

Defaults to 0.

relationalbool, optional

Whether or not to apply relational logic in FPGrowth algorithm.

If False, a single result table is produced; otherwise, the result table shall be split into three tables -- antecedent, consequent and statistics.

Defaults to False.

min_liftfloat, optional

User-specified minimum lift.

Defaults to 0.

max_conseqint, optional

Maximum length of consequent items.

Defaults to 10.

max_lenint, optional

Total length of antecedent items and consequent items in the output.

Defaults to 10.

ubiquitousfloat, optional

Item sets whose support values are greater than this number will be ignored during frequent items mining.

Defaults to 1.0.

thread_ratiofloat, optional

Specifies the ratio of total number of threads that can be used by this function.

The value range is from 0 to 1, where 0 means only using 1 thread, and 1 means using at most all the currently available threads.

Values outside the range will be ignored and this function heuristically determines the number of threads to use.

Defaults to 0.

timeoutint, optional

Specifies the maximum run time in seconds.

The algorithm will stop running when the specified timeout is reached.

Defaults to 3600.

Examples

Input data for associate rule mining:

>>> df.collect()
    TRANS  ITEM
0       1     1
1       1     2
2       2     2
3       2     3
4       2     4
5       3     1
6       3     3
7       3     4
8       3     5
9       4     1
10      4     4
11      4     5
12      5     1
13      5     2
14      6     1
15      6     2
16      6     3
17      6     4
18      7     1
19      8     1
20      8     2
21      8     3
22      9     1
23      9     2
24      9     3
25     10     2
26     10     3
27     10     5

Set up parameters:

>>> fpg = FPGrowth(min_support=0.2,
                   min_confidence=0.5,
                   relational=False,
                   min_lift=1.0,
                   max_conseq=1,
                   max_len=5,
                   ubiquitous=1.0,
                   thread_ratio=0,
                   timeout=3600)

Association rule mining using FPGrowth algorithm for the input data, and check the results:

>>> fpg.fit(data=df, lhs_restrict=[1,2,3])
>>> fpg.result_.collect()
  ANTECEDENT  CONSEQUENT  SUPPORT  CONFIDENCE      LIFT
0          2           3      0.5    0.714286  1.190476
1          3           2      0.5    0.833333  1.190476
2          3           4      0.3    0.500000  1.250000
3        1&2           3      0.3    0.600000  1.000000
4        1&3           2      0.3    0.750000  1.071429
5        1&3           4      0.2    0.500000  1.250000

Apriori algorithm set up using relational logic:

>>> fpgr = FPGrowth(min_support=0.2,
                    min_confidence=0.5,
                    relational=True,
                    min_lift=1.0,
                    max_conseq=1,
                    max_len=5,
                    ubiquitous=1.0,
                    thread_ratio=0,
                    timeout=3600)

Again mining association rules using FPGrowth algorithm for the input data, and check the resulting tables:

>>> fpgr.fit(data=df, rhs_restrict=[1, 2, 3])
>>> fpgr.antec_.collect()
   RULE_ID  ANTECEDENTITEM
0        0               2
1        1               3
2        2               3
3        3               1
4        3               2
5        4               1
6        4               3
7        5               1
8        5               3
>>> fpgr.conseq_.collect()
   RULE_ID  CONSEQUENTITEM
0        0               3
1        1               2
2        2               4
3        3               3
4        4               2
5        5               4
>>> fpgr.stats_.collect()
   RULE_ID  SUPPORT  CONFIDENCE      LIFT
0        0      0.5    0.714286  1.190476
1        1      0.5    0.833333  1.190476
2        2      0.3    0.500000  1.250000
3        3      0.3    0.600000  1.000000
4        4      0.3    0.750000  1.071429
5        5      0.2    0.500000  1.250000
Attributes
result_DataFrame

Mined association rules and related statistics, structured as follows:

  • 1st column : antecedent(leading) items,

  • 2nd column : consequent(dependent) items,

  • 3rd column : support value,

  • 4th column : confidence value,

  • 5th column : lift value.

Available only when relational is False.

antec_DataFrame
Antecedent items of mined association rules, structured as follows:
  • lst column : association rule ID,

  • 2nd column : antecedent items of the corresponding association rule.

Available only when relational is True.

conseq_DataFrame

Consequent items of mined association rules, structured as follows:

  • 1st column : association rule ID,

  • 2nd column : consequent items of the corresponding association rule.

Available only when relational is True.

stats_DataFrame
Statistics of the mined association rules, structured as follows:
  • 1st column : rule ID,

  • 2nd column : support value of the rule,

  • 3rd column : confidence value of the rule,

  • 4th column : lift value of the rule.

Available only when relational is True.

Methods

add_attribute(attr_key, attr_val)

Function to add attribute.

apply_with_hint(with_hint[, ...])

Apply with hint.

consume_fit_hdbprocedure(proc_name[, ...])

Return the generated consume hdbprocedure for fit.

consume_predict_hdbprocedure(proc_name[, ...])

Return the generated consume hdbprocedure for predict.

create_model_state([model, function, ...])

Create PAL model state.

delete_model_state([state])

Delete PAL model state.

disable_arg_check()

Disable argument check.

disable_convert_bigint()

Disable the bigint conversion.

disable_hana_execution()

HANA execution will be disabled and only SQL script will be generated.

disable_with_hint()

Disable with hint.

enable_arg_check()

Enable argument check.

enable_convert_bigint()

Allows the conversion from bigint to double.

enable_hana_execution()

HANA execution will be enabled.

enable_no_inline([apply_to_anonymous_block])

Enable no inline.

enable_parallel_by_parameter_partitions([...])

Enable parallel by parameter partitions.

enable_workload_class(workload_class_name)

HANA WORKLOAD CLASS is applied for the statement execution.

fit(data[, transaction, item, lhs_restrict, ...])

Association rule mining from the input data.

get_fit_execute_statement()

Returns the execute_statement for training.

get_fit_output_table_names()

Get the generated result table names in fit function.

get_fit_parameters()

Get PAL fit parameters.

get_pal_function()

Extract the specific function call of the SAP HANA PAL function from the sql code.

get_parameters()

Parse sql lines containing the parameter definitions.

get_predict_execute_statement()

Returns the execute_statement for predicting.

get_predict_output_table_names()

Get the generated result table names in predict function.

get_predict_parameters()

Get PAL predict parameters.

get_score_execute_statement()

Returns the execute_statement for scoring.

get_score_output_table_names()

Get the generated result table names in score function.

get_score_parameters()

Get SAP HANA PAL score parameters.

is_fitted()

Checks if the model can be saved.

load_model(model)

Function to load the fitted model.

set_model_state(state)

Set the model state by state information.

set_scale_out([route_to, no_route_to, ...])

SAP HANA statement routing.

fit(data, transaction=None, item=None, lhs_restrict=None, rhs_restrict=None, lhs_complement_rhs=None, rhs_complement_lhs=None)

Association rule mining from the input data.

Parameters
dataDataFrame

Input data for association rule mining.

transactionstr, optional

Name of the transaction column.

Defaults to the first column if not provided.

itemstr, optional

Name of the item column.

Defaults to the last non-transaction column if not provided.

lhs_restrictlist of int/str, optional

Specify items that are only allowed on the left-hand-side of association rules.

Elements in the list should be the same type as the item column.

rhs_restrictlist of int/str, optional

Specify items that are only allowed on the right-hand-side of association rules.

Elements in the list should be the same type as the item column.

lhs_complement_rhsbool, optional

If you use rhs_restrict to restrict some items to the left-hand-side of the association rules, you can set this parameter to True to restrict the complement items to the left-hand-side.

For example, if you have 100 items (i1,i2,...,i100), and want to restrict i1 and i2 to the right-hand-side, and i3, i4,..., i100 to the left-hand-side, you can set the parameters similarly as follows:

...

rhs_restrict = [i1, i2],

lhs_complement_rhs = True,

...

Defaults to False.

rhs_complement_lhsbool, optional

If you use lhs_restrict to restrict some items to the left-hand-side of association rules, you can set this parameter to True to restrict the complement items to the right-hand side.

Defaults to False.

add_attribute(attr_key, attr_val)

Function to add attribute.

Parameters
attr_keystr

The key.

attr_valstr

The value.

apply_with_hint(with_hint, apply_to_anonymous_block=True)

Apply with hint.

Parameters
with_hintstr

The hint clauses.

apply_to_anonymous_blockbool, optional

If True, it will be applied to the anonymous block.

Defaults to True.

consume_fit_hdbprocedure(proc_name, in_tables=None, out_tables=None)

Return the generated consume hdbprocedure for fit.

Parameters
proc_namestr

The procedure name.

in_tableslist, optional

The list of input table names.

out_tableslist, optional

The list of output table names.

consume_predict_hdbprocedure(proc_name, in_tables=None, out_tables=None)

Return the generated consume hdbprocedure for predict.

Parameters
proc_namestr

The procedure name.

in_tableslist, optional

The list of input table names.

out_tableslist, optional

The list of output table names.

create_model_state(model=None, function=None, pal_funcname=None, state_description=None, force=False)

Create PAL model state.

Parameters
modelDataFrame, optional

Specify the model for AFL state.

Defaults to self.model_.

functionstr, optional

Specify the function in the unified API.

Defaults to self.real_func.

pal_funcnameint or str, optional

PAL function name.

Defaults to self.pal_funcname.

state_descriptionstr, optional

Description of the state as model container.

Defaults to None.

forcebool, optional

If True it will delete the existing state.

Defaults to False.

delete_model_state(state=None)

Delete PAL model state.

Parameters
stateDataFrame, optional

Specified the state.

Defaults to self.state.

disable_arg_check()

Disable argument check.

disable_convert_bigint()

Disable the bigint conversion.

Defaults to False.

disable_hana_execution()

HANA execution will be disabled and only SQL script will be generated.

disable_with_hint()

Disable with hint.

enable_arg_check()

Enable argument check.

enable_convert_bigint()

Allows the conversion from bigint to double.

Defaults to True.

enable_hana_execution()

HANA execution will be enabled.

enable_no_inline(apply_to_anonymous_block=True)

Enable no inline.

Parameters
apply_to_anonymous_blockbool, optional

If True, it will be applied to the anonymous block.

Defaults to True.

enable_parallel_by_parameter_partitions(apply_to_anonymous_block=False)

Enable parallel by parameter partitions.

Parameters
apply_to_anonymous_blockbool, optional

If True, it will be applied to the anonymous block.

Defaults to False.

enable_workload_class(workload_class_name)

HANA WORKLOAD CLASS is applied for the statement execution.

Parameters
workload_class_namestr

The name of HANA WORKLOAD CLASS.

property fit_hdbprocedure

Returns the generated hdbprocedure for fit.

get_fit_execute_statement()

Returns the execute_statement for training.

get_fit_output_table_names()

Get the generated result table names in fit function.

Returns
List of table names.
get_fit_parameters()

Get PAL fit parameters.

Returns
List of tuples, where each tuple describes a parameter like (name, value, type)
get_pal_function()

Extract the specific function call of the SAP HANA PAL function from the sql code. Nevertheless it only detects the synonyms that have to be resolved afterwards.

Returns
The procedure name synonym
CALL "SYS_AFL.PAL_RANDOM_FORREST" (...) -> SYS_AFL.PAL_RANDOM_FORREST"
get_parameters()

Parse sql lines containing the parameter definitions. In the sql code all the parameters are defined by four arrays, where the first one contains the parameter name, and one of the other three contains the value fitting to the parameter, while the other two are NULL. This format should be changed into a simple key-value based storage.

Returns
dict of list of tuples, where each tuple describes a parameter like (name, value, type)
get_predict_execute_statement()

Returns the execute_statement for predicting.

get_predict_output_table_names()

Get the generated result table names in predict function.

Returns
List of table names.
get_predict_parameters()

Get PAL predict parameters.

Returns
List of tuples, where each tuple describes a parameter like (name, value, type)
get_score_execute_statement()

Returns the execute_statement for scoring.

get_score_output_table_names()

Get the generated result table names in score function.

Returns
List of table names.
get_score_parameters()

Get SAP HANA PAL score parameters.

Returns
List of tuples, where each tuple describes a parameter like (name, value, type)
is_fitted()

Checks if the model can be saved. To be overridden if the model is not stored in model_ attribute.

Returns
bool

True if the model is ready to be saved.

load_model(model)

Function to load the fitted model.

Parameters
modelDataFrame

SAP HANA DataFrame for fitted model.

property predict_hdbprocedure

Returns the generated hdbprocedure for predict.

set_model_state(state)

Set the model state by state information.

set_scale_out(route_to=None, no_route_to=None, route_by=None, route_by_cardinality=None, data_transfer_cost=None, route_optimization_level=None, workload_class=None, apply_to_anonymous_block=True)

SAP HANA statement routing.

Parameters
route_tostr, optional

Routes the query to the specified volume ID or service type.

Defaults to None.

no_route_tostr or list of str, optional

Avoids query routing to a specified volume ID or service type.

Defaults to None.

route_bystr, optional

Routes the query to the hosts related to the base table(s) of the specified projection view(s).

Defaults to None.

route_by_cardinalitystr or list of str, optional

Routes the query to the hosts related to the base table(s) of the specified projection view(s) with the highest cardinality from the input list.

data_transfer_costint, optional

Guides the optimizer to use the weighting factor for the data transfer cost. The value 0 ignores the data transfer cost.

Defaults to None.

route_optimization_level{'minimal', 'all'}, optional

Guides the optimizer to compile with route_optimization_level 'minimal' or to default to route_optimization_level. If the 'minimal' compiled plan is cached, then it compiles once more using the default optimization level during the first execution. This hint is primarily used to shorten statement routing decisions during the initial compilation.

workload_classstr, optional

Routes the query via workload class. route_to statement hint has higher precedence than workload_class statement hint.

Defaults to None.

apply_to_anonymous_blockbool, optional

If True it will be applied to the anonymous block.

Defaults to True.

class hana_ml.algorithms.pal.association.KORD(k=None, measure=None, min_support=None, min_confidence=None, min_coverage=None, min_measure=None, max_antec=None, epsilon=None, use_epsilon=None, max_conseq=None)

Bases: hana_ml.algorithms.pal.pal_base.PALBase

K-optimal rule discovery (KORD) follows the idea of generating association rules with respect to a well-defined measure, instead of first finding all frequent itemsets and then generating all possible rules.

Parameters
kint, optional

The number of top rules to discover.

measurestr, optional

Specifies the measure used to define the priority of the association rules.

min_supportfloat, optional

User-specified minimum support value of association rule, with valid range [0, 1].

Defaults to 0 if not provided.

min_confidencefloat, optinal

User-specified minimum confidence value of association rule, with valid range [0, 1].

Defaults to 0 if not provided.

min_converagefloat, optional

User-specified minimum coverage value of association rule, with valid range [0, 1].

Defaults to the value of min_support if not provided.

min_measurefloat, optional

User-specified minimum measure value (for leverage or lift, which type depends on the setting of measure).

Defaults to 0 if not provided.

max_antecint, optional

Specifies the maximumn number of antecedent items in generated association rules.

Defaults to 4.

epsilonfloat, optional

User-specified epsilon value for punishing length of rules.

Valid only when use_epsilon is True.

use_epsilonbool, optional

Specifies whether or not to use epsilon to punish the length of rules.

Defaults to False.

max_conseqint, optional

Specifies the maximum number of consequent items in generated association rules.

Should not be greater than 3.

Defaults to 1.

Examples

First let us have a look at the training data:

>>> df.head(10).collect()
    CUSTOMER   ITEM
0          2  item2
1          2  item3
2          3  item1
3          3  item2
4          3  item4
5          4  item1
6          4  item3
7          5  item2
8          5  item3
9          6  item1

Set up a KORD instance:

>>> krd =  KORD(k=5,
                measure='lift',
                min_support=0.1,
                min_confidence=0.2,
                epsilon=0.1,
                use_epsilon=False)

Start k-optimal rule discovery process from the input transaction data, and check the results:

>>> krd.fit(data=df, transaction='CUSTOMER', item='ITEM')
>>> krd.antec_.collect()
   RULE_ID ANTECEDENT_RULE
0        0           item2
1        1           item1
2        2           item2
3        2           item1
4        3           item5
5        4           item2
>>> krd.conseq_.collect()
   RULE_ID CONSEQUENT_RULE
0        0           item5
1        1           item5
2        2           item5
3        3           item1
4        4           item4
>>> krd.stats_.collect()
   RULE_ID   SUPPORT  CONFIDENCE      LIFT  LEVERAGE   MEASURE
0        0  0.222222    0.285714  1.285714  0.049383  1.285714
1        1  0.222222    0.333333  1.500000  0.074074  1.500000
2        2  0.222222    0.500000  2.250000  0.123457  2.250000
3        3  0.222222    1.000000  1.500000  0.074074  1.500000
4        4  0.222222    0.285714  1.285714  0.049383  1.285714
Attributes
antec_DataFrame

Info of antecedent items for the mined association rules, structured as follows:

  • 1st column : rule ID,

  • 2nd column : antecedent items.

conseq_DataFrame

Info of consequent items for the mined association rules, structured as follows:

  • 1st column : rule ID,

  • 2nd column : consequent items.

stats_DataFrame
Some basic statistics for the mined association rules, structured as follows:
  • 1st column : rule ID,

  • 2nd column : support value of rules,

  • 3rd column : confidence value of rules,

  • 4th column : lift value of rules,

  • 5th column : leverage value of rules,

  • 6th column : measure value of rules.

Methods

add_attribute(attr_key, attr_val)

Function to add attribute.

apply_with_hint(with_hint[, ...])

Apply with hint.

consume_fit_hdbprocedure(proc_name[, ...])

Return the generated consume hdbprocedure for fit.

consume_predict_hdbprocedure(proc_name[, ...])

Return the generated consume hdbprocedure for predict.

create_model_state([model, function, ...])

Create PAL model state.

delete_model_state([state])

Delete PAL model state.

disable_arg_check()

Disable argument check.

disable_convert_bigint()

Disable the bigint conversion.

disable_hana_execution()

HANA execution will be disabled and only SQL script will be generated.

disable_with_hint()

Disable with hint.

enable_arg_check()

Enable argument check.

enable_convert_bigint()

Allows the conversion from bigint to double.

enable_hana_execution()

HANA execution will be enabled.

enable_no_inline([apply_to_anonymous_block])

Enable no inline.

enable_parallel_by_parameter_partitions([...])

Enable parallel by parameter partitions.

enable_workload_class(workload_class_name)

HANA WORKLOAD CLASS is applied for the statement execution.

fit(data[, transaction, item])

K-optimal rule discovery from input data, based on some user-specified measure.

get_fit_execute_statement()

Returns the execute_statement for training.

get_fit_output_table_names()

Get the generated result table names in fit function.

get_fit_parameters()

Get PAL fit parameters.

get_pal_function()

Extract the specific function call of the SAP HANA PAL function from the sql code.

get_parameters()

Parse sql lines containing the parameter definitions.

get_predict_execute_statement()

Returns the execute_statement for predicting.

get_predict_output_table_names()

Get the generated result table names in predict function.

get_predict_parameters()

Get PAL predict parameters.

get_score_execute_statement()

Returns the execute_statement for scoring.

get_score_output_table_names()

Get the generated result table names in score function.

get_score_parameters()

Get SAP HANA PAL score parameters.

is_fitted()

Checks if the model can be saved.

load_model(model)

Function to load the fitted model.

set_model_state(state)

Set the model state by state information.

set_scale_out([route_to, no_route_to, ...])

SAP HANA statement routing.

fit(data, transaction=None, item=None)

K-optimal rule discovery from input data, based on some user-specified measure.

Parameters
dataDataFrame

Input data for k-optimal(association) rule discovery.

transactionstr, optional

Column name of transaction ID in the input data.

Defaults to name of the 1st column if not provided.

itemstr, optional

Column name of item ID (or items) in the input data.

Defaults to the name of the last non-transaction column if not provided.

add_attribute(attr_key, attr_val)

Function to add attribute.

Parameters
attr_keystr

The key.

attr_valstr

The value.

apply_with_hint(with_hint, apply_to_anonymous_block=True)

Apply with hint.

Parameters
with_hintstr

The hint clauses.

apply_to_anonymous_blockbool, optional

If True, it will be applied to the anonymous block.

Defaults to True.

consume_fit_hdbprocedure(proc_name, in_tables=None, out_tables=None)

Return the generated consume hdbprocedure for fit.

Parameters
proc_namestr

The procedure name.

in_tableslist, optional

The list of input table names.

out_tableslist, optional

The list of output table names.

consume_predict_hdbprocedure(proc_name, in_tables=None, out_tables=None)

Return the generated consume hdbprocedure for predict.

Parameters
proc_namestr

The procedure name.

in_tableslist, optional

The list of input table names.

out_tableslist, optional

The list of output table names.

create_model_state(model=None, function=None, pal_funcname=None, state_description=None, force=False)

Create PAL model state.

Parameters
modelDataFrame, optional

Specify the model for AFL state.

Defaults to self.model_.

functionstr, optional

Specify the function in the unified API.

Defaults to self.real_func.

pal_funcnameint or str, optional

PAL function name.

Defaults to self.pal_funcname.

state_descriptionstr, optional

Description of the state as model container.

Defaults to None.

forcebool, optional

If True it will delete the existing state.

Defaults to False.

delete_model_state(state=None)

Delete PAL model state.

Parameters
stateDataFrame, optional

Specified the state.

Defaults to self.state.

disable_arg_check()

Disable argument check.

disable_convert_bigint()

Disable the bigint conversion.

Defaults to False.

disable_hana_execution()

HANA execution will be disabled and only SQL script will be generated.

disable_with_hint()

Disable with hint.

enable_arg_check()

Enable argument check.

enable_convert_bigint()

Allows the conversion from bigint to double.

Defaults to True.

enable_hana_execution()

HANA execution will be enabled.

enable_no_inline(apply_to_anonymous_block=True)

Enable no inline.

Parameters
apply_to_anonymous_blockbool, optional

If True, it will be applied to the anonymous block.

Defaults to True.

enable_parallel_by_parameter_partitions(apply_to_anonymous_block=False)

Enable parallel by parameter partitions.

Parameters
apply_to_anonymous_blockbool, optional

If True, it will be applied to the anonymous block.

Defaults to False.

enable_workload_class(workload_class_name)

HANA WORKLOAD CLASS is applied for the statement execution.

Parameters
workload_class_namestr

The name of HANA WORKLOAD CLASS.

property fit_hdbprocedure

Returns the generated hdbprocedure for fit.

get_fit_execute_statement()

Returns the execute_statement for training.

get_fit_output_table_names()

Get the generated result table names in fit function.

Returns
List of table names.
get_fit_parameters()

Get PAL fit parameters.

Returns
List of tuples, where each tuple describes a parameter like (name, value, type)
get_pal_function()

Extract the specific function call of the SAP HANA PAL function from the sql code. Nevertheless it only detects the synonyms that have to be resolved afterwards.

Returns
The procedure name synonym
CALL "SYS_AFL.PAL_RANDOM_FORREST" (...) -> SYS_AFL.PAL_RANDOM_FORREST"
get_parameters()

Parse sql lines containing the parameter definitions. In the sql code all the parameters are defined by four arrays, where the first one contains the parameter name, and one of the other three contains the value fitting to the parameter, while the other two are NULL. This format should be changed into a simple key-value based storage.

Returns
dict of list of tuples, where each tuple describes a parameter like (name, value, type)
get_predict_execute_statement()

Returns the execute_statement for predicting.

get_predict_output_table_names()

Get the generated result table names in predict function.

Returns
List of table names.
get_predict_parameters()

Get PAL predict parameters.

Returns
List of tuples, where each tuple describes a parameter like (name, value, type)
get_score_execute_statement()

Returns the execute_statement for scoring.

get_score_output_table_names()

Get the generated result table names in score function.

Returns
List of table names.
get_score_parameters()

Get SAP HANA PAL score parameters.

Returns
List of tuples, where each tuple describes a parameter like (name, value, type)
is_fitted()

Checks if the model can be saved. To be overridden if the model is not stored in model_ attribute.

Returns
bool

True if the model is ready to be saved.

load_model(model)

Function to load the fitted model.

Parameters
modelDataFrame

SAP HANA DataFrame for fitted model.

property predict_hdbprocedure

Returns the generated hdbprocedure for predict.

set_model_state(state)

Set the model state by state information.

set_scale_out(route_to=None, no_route_to=None, route_by=None, route_by_cardinality=None, data_transfer_cost=None, route_optimization_level=None, workload_class=None, apply_to_anonymous_block=True)

SAP HANA statement routing.

Parameters
route_tostr, optional

Routes the query to the specified volume ID or service type.

Defaults to None.

no_route_tostr or list of str, optional

Avoids query routing to a specified volume ID or service type.

Defaults to None.

route_bystr, optional

Routes the query to the hosts related to the base table(s) of the specified projection view(s).

Defaults to None.

route_by_cardinalitystr or list of str, optional

Routes the query to the hosts related to the base table(s) of the specified projection view(s) with the highest cardinality from the input list.

data_transfer_costint, optional

Guides the optimizer to use the weighting factor for the data transfer cost. The value 0 ignores the data transfer cost.

Defaults to None.

route_optimization_level{'minimal', 'all'}, optional

Guides the optimizer to compile with route_optimization_level 'minimal' or to default to route_optimization_level. If the 'minimal' compiled plan is cached, then it compiles once more using the default optimization level during the first execution. This hint is primarily used to shorten statement routing decisions during the initial compilation.

workload_classstr, optional

Routes the query via workload class. route_to statement hint has higher precedence than workload_class statement hint.

Defaults to None.

apply_to_anonymous_blockbool, optional

If True it will be applied to the anonymous block.

Defaults to True.

class hana_ml.algorithms.pal.association.SPM(min_support, relational=None, max_len=None, min_len=None, max_len_out=None, min_len_out=None, ubiquitous=None, calc_lift=None, timeout=None)

Bases: hana_ml.algorithms.pal.pal_base.PALBase

The sequential pattern mining algorithm searches for frequent patterns in sequence databases.

Parameters
min_supportfloat

User-specified minimum support value.

relationalbool, optional

Whether or not to apply relational logic in sequential pattern mining.

If False, a single results table for frequent pattern mining is produced, otherwise the results table is splitted into two tables : one for mined patterns, and the other for statistics.

Defaults to False.

ubiquitousfloat, optional

Items whose support values are above this specified value will be ignored during the frequent item mining phase.

Defaults to 1.0.

min_lenint, optional

Minimum number of items in a transaction.

Defaults to 1.

max_lenint, optional

Maximum number of items in a transaction.

Defaults to 10.

min_len_outint, optional

Specifies the minimum number of items of the mined association rules in the result table.

Defaults to 1.

max_len_outint, optional

Specifies the maximum number of items of the mined association rules in the result table.

Defaults to 10.

calc_liftbool, optional

Whether or not toe calculate lift values for all applicable cases.

If False, lift values are only calculated for the cases where the last transaction contains a single item.

Defaults to False.

timeoutint, optional

Specifies the maximum run time in seconds.

The algorithm will stop running when the specified timeout is reached.

Defaults to 3600.

Examples

Firstly take a look at the input data df:

>>> df.collect()
   CUSTID  TRANSID      ITEMS
0       A        1      Apple
1       A        1  Blueberry
2       A        2      Apple
3       A        2     Cherry
4       A        3    Dessert
5       B        1     Cherry
6       B        1  Blueberry
7       B        1      Apple
8       B        2    Dessert
9       B        3  Blueberry
10      C        1      Apple
11      C        2  Blueberry
12      C        3    Dessert

Set up a SPM instance:

>>> sp = SPM(min_support=0.5,
             relational=False,
             ubiquitous=1.0,
             max_len=10,
             min_len=1,
             calc_lift=True)

Start sequential pattern mining process from the input data, and check the results:

>>> sp.fit(data=df, customer='CUSTID', transaction='TRANSID', item='ITEMS')
>>> sp.result_.collect()
                        PATTERN   SUPPORT  CONFIDENCE      LIFT
0                       {Apple}  1.000000    0.000000  0.000000
1           {Apple},{Blueberry}  0.666667    0.666667  0.666667
2             {Apple},{Dessert}  1.000000    1.000000  1.000000
3             {Apple,Blueberry}  0.666667    0.000000  0.000000
4   {Apple,Blueberry},{Dessert}  0.666667    1.000000  1.000000
5                {Apple,Cherry}  0.666667    0.000000  0.000000
6      {Apple,Cherry},{Dessert}  0.666667    1.000000  1.000000
7                   {Blueberry}  1.000000    0.000000  0.000000
8         {Blueberry},{Dessert}  1.000000    1.000000  1.000000
9                      {Cherry}  0.666667    0.000000  0.000000
10           {Cherry},{Dessert}  0.666667    1.000000  1.000000
11                    {Dessert}  1.000000    0.000000  0.000000
Attributes
result_DataFrame

The overall frequent pattern mining result, structured as follows:

  • 1st column : mined frequent patterns,

  • 2nd column : support values,

  • 3rd column : confidence values,

  • 4th column : lift values.

Available only when relational is False.

pattern_DataFrame
Result for mined requent patterns, structured as follows:
  • 1st column : pattern ID,

  • 2nd column : transaction ID,

  • 3rd column : items.

stats_DataFrame
Statistics for frequent pattern mining, structured as follows:
  • 1st column : pattern ID,

  • 2nd column : support values,

  • 3rd column : confidence values,

  • 4th column : lift values.

Methods

add_attribute(attr_key, attr_val)

Function to add attribute.

apply_with_hint(with_hint[, ...])

Apply with hint.

consume_fit_hdbprocedure(proc_name[, ...])

Return the generated consume hdbprocedure for fit.

consume_predict_hdbprocedure(proc_name[, ...])

Return the generated consume hdbprocedure for predict.

create_model_state([model, function, ...])

Create PAL model state.

delete_model_state([state])

Delete PAL model state.

disable_arg_check()

Disable argument check.

disable_convert_bigint()

Disable the bigint conversion.

disable_hana_execution()

HANA execution will be disabled and only SQL script will be generated.

disable_with_hint()

Disable with hint.

enable_arg_check()

Enable argument check.

enable_convert_bigint()

Allows the conversion from bigint to double.

enable_hana_execution()

HANA execution will be enabled.

enable_no_inline([apply_to_anonymous_block])

Enable no inline.

enable_parallel_by_parameter_partitions([...])

Enable parallel by parameter partitions.

enable_workload_class(workload_class_name)

HANA WORKLOAD CLASS is applied for the statement execution.

fit(data[, customer, transaction, item, ...])

Sequential pattern mining from input data.

get_fit_execute_statement()

Returns the execute_statement for training.

get_fit_output_table_names()

Get the generated result table names in fit function.

get_fit_parameters()

Get PAL fit parameters.

get_pal_function()

Extract the specific function call of the SAP HANA PAL function from the sql code.

get_parameters()

Parse sql lines containing the parameter definitions.

get_predict_execute_statement()

Returns the execute_statement for predicting.

get_predict_output_table_names()

Get the generated result table names in predict function.

get_predict_parameters()

Get PAL predict parameters.

get_score_execute_statement()

Returns the execute_statement for scoring.

get_score_output_table_names()

Get the generated result table names in score function.

get_score_parameters()

Get SAP HANA PAL score parameters.

is_fitted()

Checks if the model can be saved.

load_model(model)

Function to load the fitted model.

set_model_state(state)

Set the model state by state information.

set_scale_out([route_to, no_route_to, ...])

SAP HANA statement routing.

add_attribute(attr_key, attr_val)

Function to add attribute.

Parameters
attr_keystr

The key.

attr_valstr

The value.

apply_with_hint(with_hint, apply_to_anonymous_block=True)

Apply with hint.

Parameters
with_hintstr

The hint clauses.

apply_to_anonymous_blockbool, optional

If True, it will be applied to the anonymous block.

Defaults to True.

consume_fit_hdbprocedure(proc_name, in_tables=None, out_tables=None)

Return the generated consume hdbprocedure for fit.

Parameters
proc_namestr

The procedure name.

in_tableslist, optional

The list of input table names.

out_tableslist, optional

The list of output table names.

consume_predict_hdbprocedure(proc_name, in_tables=None, out_tables=None)

Return the generated consume hdbprocedure for predict.

Parameters
proc_namestr

The procedure name.

in_tableslist, optional

The list of input table names.

out_tableslist, optional

The list of output table names.

create_model_state(model=None, function=None, pal_funcname=None, state_description=None, force=False)

Create PAL model state.

Parameters
modelDataFrame, optional

Specify the model for AFL state.

Defaults to self.model_.

functionstr, optional

Specify the function in the unified API.

Defaults to self.real_func.

pal_funcnameint or str, optional

PAL function name.

Defaults to self.pal_funcname.

state_descriptionstr, optional

Description of the state as model container.

Defaults to None.

forcebool, optional

If True it will delete the existing state.

Defaults to False.

delete_model_state(state=None)

Delete PAL model state.

Parameters
stateDataFrame, optional

Specified the state.

Defaults to self.state.

disable_arg_check()

Disable argument check.

disable_convert_bigint()

Disable the bigint conversion.

Defaults to False.

disable_hana_execution()

HANA execution will be disabled and only SQL script will be generated.

disable_with_hint()

Disable with hint.

enable_arg_check()

Enable argument check.

enable_convert_bigint()

Allows the conversion from bigint to double.

Defaults to True.

enable_hana_execution()

HANA execution will be enabled.

enable_no_inline(apply_to_anonymous_block=True)

Enable no inline.

Parameters
apply_to_anonymous_blockbool, optional

If True, it will be applied to the anonymous block.

Defaults to True.

enable_parallel_by_parameter_partitions(apply_to_anonymous_block=False)

Enable parallel by parameter partitions.

Parameters
apply_to_anonymous_blockbool, optional

If True, it will be applied to the anonymous block.

Defaults to False.

enable_workload_class(workload_class_name)

HANA WORKLOAD CLASS is applied for the statement execution.

Parameters
workload_class_namestr

The name of HANA WORKLOAD CLASS.

property fit_hdbprocedure

Returns the generated hdbprocedure for fit.

get_fit_execute_statement()

Returns the execute_statement for training.

get_fit_output_table_names()

Get the generated result table names in fit function.

Returns
List of table names.
get_fit_parameters()

Get PAL fit parameters.

Returns
List of tuples, where each tuple describes a parameter like (name, value, type)
get_pal_function()

Extract the specific function call of the SAP HANA PAL function from the sql code. Nevertheless it only detects the synonyms that have to be resolved afterwards.

Returns
The procedure name synonym
CALL "SYS_AFL.PAL_RANDOM_FORREST" (...) -> SYS_AFL.PAL_RANDOM_FORREST"
get_parameters()

Parse sql lines containing the parameter definitions. In the sql code all the parameters are defined by four arrays, where the first one contains the parameter name, and one of the other three contains the value fitting to the parameter, while the other two are NULL. This format should be changed into a simple key-value based storage.

Returns
dict of list of tuples, where each tuple describes a parameter like (name, value, type)
get_predict_execute_statement()

Returns the execute_statement for predicting.

get_predict_output_table_names()

Get the generated result table names in predict function.

Returns
List of table names.
get_predict_parameters()

Get PAL predict parameters.

Returns
List of tuples, where each tuple describes a parameter like (name, value, type)
get_score_execute_statement()

Returns the execute_statement for scoring.

get_score_output_table_names()

Get the generated result table names in score function.

Returns
List of table names.
get_score_parameters()

Get SAP HANA PAL score parameters.

Returns
List of tuples, where each tuple describes a parameter like (name, value, type)
is_fitted()

Checks if the model can be saved. To be overridden if the model is not stored in model_ attribute.

Returns
bool

True if the model is ready to be saved.

load_model(model)

Function to load the fitted model.

Parameters
modelDataFrame

SAP HANA DataFrame for fitted model.

property predict_hdbprocedure

Returns the generated hdbprocedure for predict.

set_model_state(state)

Set the model state by state information.

set_scale_out(route_to=None, no_route_to=None, route_by=None, route_by_cardinality=None, data_transfer_cost=None, route_optimization_level=None, workload_class=None, apply_to_anonymous_block=True)

SAP HANA statement routing.

Parameters
route_tostr, optional

Routes the query to the specified volume ID or service type.

Defaults to None.

no_route_tostr or list of str, optional

Avoids query routing to a specified volume ID or service type.

Defaults to None.

route_bystr, optional

Routes the query to the hosts related to the base table(s) of the specified projection view(s).

Defaults to None.

route_by_cardinalitystr or list of str, optional

Routes the query to the hosts related to the base table(s) of the specified projection view(s) with the highest cardinality from the input list.

data_transfer_costint, optional

Guides the optimizer to use the weighting factor for the data transfer cost. The value 0 ignores the data transfer cost.

Defaults to None.

route_optimization_level{'minimal', 'all'}, optional

Guides the optimizer to compile with route_optimization_level 'minimal' or to default to route_optimization_level. If the 'minimal' compiled plan is cached, then it compiles once more using the default optimization level during the first execution. This hint is primarily used to shorten statement routing decisions during the initial compilation.

workload_classstr, optional

Routes the query via workload class. route_to statement hint has higher precedence than workload_class statement hint.

Defaults to None.

apply_to_anonymous_blockbool, optional

If True it will be applied to the anonymous block.

Defaults to True.

fit(data, customer=None, transaction=None, item=None, item_restrict=None, min_gap=None)

Sequential pattern mining from input data.

Parameters
dataDataFrame

Input data for sequential pattern mining.

customerstr, optional

Column name of customer ID in the input data.

Defaults to name of the 1st column if not provided.

transactionstr, optional

Column name of transaction ID in the input data.

Specially for sequential pattern mining, values of this column must reflect the sequence of occurrence as well.

Defaults to name of the 1st non-customer column if not provided.

itemstr, optional

Column name of item ID (or items) in the input data.

Defaults to the name of the last non-customer, non-transaction column if not provided.

item_restrictlist of int or str, optional

Specifies the list of items allowed in the mined association rule.

min_gapint, optional

Specifies the the minimum time difference between consecutive transactions in a sequence.

hana_ml.algorithms.pal.clustering

This module contains Python wrappers for PAL clustering algorithms.

The following classes are available:

hana_ml.algorithms.pal.clustering.SlightSilhouette(data, features=None, label=None, distance_level=None, minkowski_power=None, normalization=None, thread_number=None, categorical_variable=None, category_weights=None)

Silhouette refers to a method used to validate the cluster of data. SAP HNAN PAL provides a light version of silhouette called slight silhouette. SlightSihouette is an wrapper for this light version silhouette method.

Parameters
dataDataFrame

DataFrame containing the data.

featuresa list of str, optional

Names of feature columns.

If features is not provided, it defaults to all non-label columns.

label: str, optional

Name of the ID column.

If label is not provided, it defaults to the last column of data.

distance_level{'manhattan', 'euclidean', 'minkowski', 'chebyshev', 'cosine'}, optional

Ways to compute the distance between the item and the cluster center. 'cosine' is only valid when accelerated is False.

Defaults to 'euclidean'.

minkowski_powerfloat, optional

When Minkowski distance is used, this parameter controls the value of power.

Only valid when distance_level is minkowski.

Defaults to 3.0.

normalization{'no', 'l1_norm', 'min_max'}, optional

Normalization type.

  • 'no': No normalization will be applied.

  • 'l1_norm': Yes, for each point X (x1, x2, ..., xn), the normalized value will be X'(x1 /S,x2 /S,...,xn /S), where S = |x1|+|x2|+...|xn|.

  • 'min_max': Yes, for each column C, get the min and max value of C, and then C[i] = (C[i]-min)/(max-min).

Defaults to 'no'.

thread_numberint, optional

Number of threads.

Defaults to 1.

categorical_variablestr or a list of str, optional

Indicates whether or not a column of data is actually corresponding to a category variable even the data type of this column is INTEGER.

By default, VARCHAR or NVARCHAR is a category variable, and INTEGER or DOUBLE is a continuous variable.

Defaults to None.

category_weightsfloat, optional

Represents the weight of category attributes.

Defaults to 0.707.

Returns
DataFrame

A DataFrame containing the validation value of Slight Silhouette.

Examples

Input dataframe df:

>>> df.collect()
    V000 V001 V002 CLUSTER
0    0.5    A  0.5       0
1    1.5    A  0.5       0
2    1.5    A  1.5       0
3    0.5    A  1.5       0
4    1.1    B  1.2       0
5    0.5    B 15.5       1
6    1.5    B 15.5       1
7    1.5    B 16.5       1
8    0.5    B 16.5       1
9    1.2    C 16.1       1
10  15.5    C 15.5       2
11  16.5    C 15.5       2
12  16.5    C 16.5       2
13  15.5    C 16.5       2
14  15.6    D 16.2       2
15  15.5    D  0.5       3
16  16.5    D  0.5       3
17  16.5    D  1.5       3
18  15.5    D  1.5       3
19  15.7    A  1.6       3

Call the function:

>>> res = SlightSilhouette(df, label="CLUSTER")

Result:

>>> res.collect()
  VALIDATE_VALUE
0      0.9385944
class hana_ml.algorithms.pal.clustering.AffinityPropagation(affinity, n_clusters, max_iter=None, convergence_iter=None, damping=None, preference=None, seed_ratio=None, times=None, minkowski_power=None, thread_ratio=None)

Bases: hana_ml.algorithms.pal.pal_base.PALBase

Affinity Propagation is an algorithm that identifies exemplars among data points and forms clusters of data points around these exemplars. It operates by simultaneously considering all data point as potential exemplars and exchanging messages between data points until a good set of exemplars and clusters emerges.

Parameters
affinity{'manhattan', 'standardized_euclidean', 'euclidean', 'minkowski', 'chebyshev', 'cosine'}

Ways to compute the distance between two points.

n_clustersint

Number of clusters.

  • 0: does not adjust Affinity Propagation cluster result.

  • Non-zero int: If Affinity Propagation cluster number is bigger than n_clusters, PAL will merge the result to make the cluster number be the value specified for n_clusters.

No default value as it is mandatory.

max_iterint, optional

Maximum number of iterations.

Defaults to 500.

convergence_iterint, optional

When the clusters keep a steady one for the specified times, the algorithm ends.

Defaults to 100.

dampingfloat

Controls the updating velocity. Value range: (0, 1).

Defaults to 0.9.

preferencefloat, optional

Determines the preference. Value range: [0,1].

Defaults to 0.5.

seed_ratiofloat, optional

Select a portion of (seed_ratio * data_number) the input data as seed, where data_number is the row_size of the input data.

Value range: (0,1].

If seed_ratio is 1, all the input data will be the seed.

Defaults to 1.

timesint, optional

The sampling times. Only valid when seed_ratio is less than 1.

Defaults to 1.

minkowski_powerint, optional

The power of the Minkowski method. Only valid when affinity is 3.

Defaults to 3.

thread_ratiofloat, optional

Specifies the ratio of total number of threads that can be used by this function. The value range is from 0 to 1, where 0 means only using 1 thread, and 1 means using at most all the currently available threads. Values outside the range will be ignored and this function heuristically determines the number of threads to use.

Defaults to 0.

Examples

Input dataframe df for clustering:

>>> df.collect()
ID  ATTRIB1  ATTRIB2
 1     0.10     0.10
 2     0.11     0.10
 3     0.10     0.11
 4     0.11     0.11
 5     0.12     0.11
 6     0.11     0.12
 7     0.12     0.12
 8     0.12     0.13
 9     0.13     0.12
10     0.13     0.13
11     0.13     0.14
12     0.14     0.13
13    10.10    10.10
14    10.11    10.10
15    10.10    10.11
16    10.11    10.11
17    10.11    10.12
18    10.12    10.11
19    10.12    10.12
20    10.12    10.13
21    10.13    10.12
22    10.13    10.13
23    10.13    10.14
24    10.14    10.13

Create an AffinityPropagation instance:

>>> ap = AffinityPropagation(
            affinity='euclidean',
            n_clusters=0,
            max_iter=500,
            convergence_iter=100,
            damping=0.9,
            preference=0.5,
            seed_ratio=None,
            times=None,
            minkowski_power=None,
            thread_ratio=1)

Perform fit on the given data:

>>> ap.fit(data=df, key='ID')

Expected output:

>>> ap.labels_.collect()
ID  CLUSTER_ID
 1           0
 2           0
 3           0
 4           0
 5           0
 6           0
 7           0
 8           0
 9           0
10           0
11           0
12           0
13           1
14           1
15           1
16           1
17           1
18           1
19           1
20           1
21           1
22           1
23           1
24           1
Attributes
labels_DataFrame

Label assigned to each sample. structured as follows:

  • ID, record ID.

  • CLUSTER_ID, the range is from 0 to n_clusters - 1.

Methods

add_attribute(attr_key, attr_val)

Function to add attribute.

apply_with_hint(with_hint[, ...])

Apply with hint.

consume_fit_hdbprocedure(proc_name[, ...])

Return the generated consume hdbprocedure for fit.

consume_predict_hdbprocedure(proc_name[, ...])

Return the generated consume hdbprocedure for predict.

create_model_state([model, function, ...])

Create PAL model state.

delete_model_state([state])

Delete PAL model state.

disable_arg_check()

Disable argument check.

disable_convert_bigint()

Disable the bigint conversion.

disable_hana_execution()

HANA execution will be disabled and only SQL script will be generated.

disable_with_hint()

Disable with hint.

enable_arg_check()

Enable argument check.

enable_convert_bigint()

Allows the conversion from bigint to double.

enable_hana_execution()

HANA execution will be enabled.

enable_no_inline([apply_to_anonymous_block])

Enable no inline.

enable_parallel_by_parameter_partitions([...])

Enable parallel by parameter partitions.

enable_workload_class(workload_class_name)

HANA WORKLOAD CLASS is applied for the statement execution.

fit(data[, key, features])

Fit the model when given the training dataset.

fit_predict(data[, key, features])

Fit with the dataset and return labels.

get_fit_execute_statement()

Returns the execute_statement for training.

get_fit_output_table_names()

Get the generated result table names in fit function.

get_fit_parameters()

Get PAL fit parameters.

get_pal_function()

Extract the specific function call of the SAP HANA PAL function from the sql code.

get_parameters()

Parse sql lines containing the parameter definitions.

get_predict_execute_statement()

Returns the execute_statement for predicting.

get_predict_output_table_names()

Get the generated result table names in predict function.

get_predict_parameters()

Get PAL predict parameters.

get_score_execute_statement()

Returns the execute_statement for scoring.

get_score_output_table_names()

Get the generated result table names in score function.

get_score_parameters()

Get SAP HANA PAL score parameters.

is_fitted()

Checks if the model can be saved.

load_model(model)

Function to load the fitted model.

set_model_state(state)

Set the model state by state information.

set_scale_out([route_to, no_route_to, ...])

SAP HANA statement routing.

fit(data, key=None, features=None)

Fit the model when given the training dataset.

Parameters
dataDataFrame

DataFrame containing the data.

keystr, optional

Name of ID column.

Defaults to the index column of data (i.e. data.index) if it is set. If the index column of data is not provided, please enter the value of key.

featuresa list of str, optional

Names of the feature columns.

If features is not provided, it defaults to all non-key columns.

fit_predict(data, key=None, features=None)

Fit with the dataset and return labels.

Parameters
dataDataFrame

DataFrame containing the data.

keystr, optional

Name of ID column.

Defaults to the index column of data (i.e. data.index) if it is set. If the index column of data is not provided, please enter the value of key.

featuresa list of str, optional

Names of the features columns.

If features is not provided, it defaults to all non-key columns.

Returns
DataFrame

Labels of each point.

add_attribute(attr_key, attr_val)

Function to add attribute.

Parameters
attr_keystr

The key.

attr_valstr

The value.

apply_with_hint(with_hint, apply_to_anonymous_block=True)

Apply with hint.

Parameters
with_hintstr

The hint clauses.

apply_to_anonymous_blockbool, optional

If True, it will be applied to the anonymous block.

Defaults to True.

consume_fit_hdbprocedure(proc_name, in_tables=None, out_tables=None)

Return the generated consume hdbprocedure for fit.

Parameters
proc_namestr

The procedure name.

in_tableslist, optional

The list of input table names.

out_tableslist, optional

The list of output table names.

consume_predict_hdbprocedure(proc_name, in_tables=None, out_tables=None)

Return the generated consume hdbprocedure for predict.

Parameters
proc_namestr

The procedure name.

in_tableslist, optional

The list of input table names.

out_tableslist, optional

The list of output table names.

create_model_state(model=None, function=None, pal_funcname=None, state_description=None, force=False)

Create PAL model state.

Parameters
modelDataFrame, optional

Specify the model for AFL state.

Defaults to self.model_.

functionstr, optional

Specify the function in the unified API.

Defaults to self.real_func.

pal_funcnameint or str, optional

PAL function name.

Defaults to self.pal_funcname.

state_descriptionstr, optional

Description of the state as model container.

Defaults to None.

forcebool, optional

If True it will delete the existing state.

Defaults to False.

delete_model_state(state=None)

Delete PAL model state.

Parameters
stateDataFrame, optional

Specified the state.

Defaults to self.state.

disable_arg_check()

Disable argument check.

disable_convert_bigint()

Disable the bigint conversion.

Defaults to False.

disable_hana_execution()

HANA execution will be disabled and only SQL script will be generated.

disable_with_hint()

Disable with hint.

enable_arg_check()

Enable argument check.

enable_convert_bigint()

Allows the conversion from bigint to double.

Defaults to True.

enable_hana_execution()

HANA execution will be enabled.

enable_no_inline(apply_to_anonymous_block=True)

Enable no inline.

Parameters
apply_to_anonymous_blockbool, optional

If True, it will be applied to the anonymous block.

Defaults to True.

enable_parallel_by_parameter_partitions(apply_to_anonymous_block=False)

Enable parallel by parameter partitions.

Parameters
apply_to_anonymous_blockbool, optional

If True, it will be applied to the anonymous block.

Defaults to False.

enable_workload_class(workload_class_name)

HANA WORKLOAD CLASS is applied for the statement execution.

Parameters
workload_class_namestr

The name of HANA WORKLOAD CLASS.

property fit_hdbprocedure

Returns the generated hdbprocedure for fit.

get_fit_execute_statement()

Returns the execute_statement for training.

get_fit_output_table_names()

Get the generated result table names in fit function.

Returns
List of table names.
get_fit_parameters()

Get PAL fit parameters.

Returns
List of tuples, where each tuple describes a parameter like (name, value, type)
get_pal_function()

Extract the specific function call of the SAP HANA PAL function from the sql code. Nevertheless it only detects the synonyms that have to be resolved afterwards.

Returns
The procedure name synonym
CALL "SYS_AFL.PAL_RANDOM_FORREST" (...) -> SYS_AFL.PAL_RANDOM_FORREST"
get_parameters()

Parse sql lines containing the parameter definitions. In the sql code all the parameters are defined by four arrays, where the first one contains the parameter name, and one of the other three contains the value fitting to the parameter, while the other two are NULL. This format should be changed into a simple key-value based storage.

Returns
dict of list of tuples, where each tuple describes a parameter like (name, value, type)
get_predict_execute_statement()

Returns the execute_statement for predicting.

get_predict_output_table_names()

Get the generated result table names in predict function.

Returns
List of table names.
get_predict_parameters()

Get PAL predict parameters.

Returns
List of tuples, where each tuple describes a parameter like (name, value, type)
get_score_execute_statement()

Returns the execute_statement for scoring.

get_score_output_table_names()

Get the generated result table names in score function.

Returns
List of table names.
get_score_parameters()

Get SAP HANA PAL score parameters.

Returns
List of tuples, where each tuple describes a parameter like (name, value, type)
is_fitted()

Checks if the model can be saved. To be overridden if the model is not stored in model_ attribute.

Returns
bool

True if the model is ready to be saved.

load_model(model)

Function to load the fitted model.

Parameters
modelDataFrame

SAP HANA DataFrame for fitted model.

property predict_hdbprocedure

Returns the generated hdbprocedure for predict.

set_model_state(state)

Set the model state by state information.

set_scale_out(route_to=None, no_route_to=None, route_by=None, route_by_cardinality=None, data_transfer_cost=None, route_optimization_level=None, workload_class=None, apply_to_anonymous_block=True)

SAP HANA statement routing.

Parameters
route_tostr, optional

Routes the query to the specified volume ID or service type.

Defaults to None.

no_route_tostr or list of str, optional

Avoids query routing to a specified volume ID or service type.

Defaults to None.

route_bystr, optional

Routes the query to the hosts related to the base table(s) of the specified projection view(s).

Defaults to None.

route_by_cardinalitystr or list of str, optional

Routes the query to the hosts related to the base table(s) of the specified projection view(s) with the highest cardinality from the input list.

data_transfer_costint, optional

Guides the optimizer to use the weighting factor for the data transfer cost. The value 0 ignores the data transfer cost.

Defaults to None.

route_optimization_level{'minimal', 'all'}, optional

Guides the optimizer to compile with route_optimization_level 'minimal' or to default to route_optimization_level. If the 'minimal' compiled plan is cached, then it compiles once more using the default optimization level during the first execution. This hint is primarily used to shorten statement routing decisions during the initial compilation.

workload_classstr, optional

Routes the query via workload class. route_to statement hint has higher precedence than workload_class statement hint.

Defaults to None.

apply_to_anonymous_blockbool, optional

If True it will be applied to the anonymous block.

Defaults to True.

class hana_ml.algorithms.pal.clustering.AgglomerateHierarchicalClustering(n_clusters=None, affinity=None, linkage=None, thread_ratio=None, distance_dimension=None, normalization=None, category_weights=None)

Bases: hana_ml.algorithms.pal.pal_base.PALBase

This algorithm is a widely used clustering method which can find natural groups within a set of data. The idea is to group the data into a hierarchy or a binary tree of the subgroups. A hierarchical clustering can be either agglomerate or divisive, depending on the method of hierarchical decomposition. The implementation in PAL follows the agglomerate approach, which merges the clusters with a bottom-up strategy. Initially, each data point is considered as an own cluster. The algorithm iteratively merges two clusters based on the dissimilarity measure in a greedy manner and forms a larger cluster.

Parameters
n_clustersint, optional

Number of clusters after agglomerate hierarchical clustering algorithm. Value range: between 1 and the initial number of input data.

Defaults to 1.

affinity{'manhattan', 'euclidean', 'minkowski', 'chebyshev', 'cosine', 'pearson correlation', 'squared euclidean', 'jaccard', 'gower', 'precomputed'}, optional

Ways to compute the distance between two points.

Note

  • (1) For jaccard distance, non-zero input data will be treated as 1, and zero input data will be treated as 0. jaccard distance = (M01 + M10) / (M11 + M01 + M10)

  • (2) Only gower distance supports category attributes. When linkage is 'centroid clustering', 'median clustering', or 'ward', this parameter must be set to 'squared euclidean'.

Defaults to 'squared euclidean'.

linkage{ 'nearest neighbor', 'furthest neighbor', 'group average', 'weighted average', 'centroid clustering', 'median clustering', 'ward'}, optional

Linkage type between two clusters.

  • 'nearest neighbor' : single linkage.

  • 'furthest neighbor' : complete linkage.

  • 'group average' : UPGMA.

  • 'weighted average' : WPGMA.

  • 'centroid clustering'.

  • 'median clustering'.

  • 'ward'.

Defaults to 'centroid clustering'.

Note

For linkage 'centroid clustering', 'median clustering', or 'ward', the corresponding affinity must be set to 'squared euclidean'.

thread_ratiofloat, optional

Specifies the ratio of total number of threads that can be used by this function.

The value range is from 0 to 1, where 0 means only using 1 thread, and 1 means using at most all the currently available threads.

Values outside the range will be ignored and this function heuristically determines the number of threads to use.

Defaults to 0.

distance_dimensionfloat, optional

Distance dimension can be set if affinity is set to 'minkowski'. The value should be no less than 1.

Only valid when affinity is 'minkowski'.

Defaults to 3.

normalizationstr, optional

Specifies the type of normalization applied.

  • 'no': No normalization

  • 'z-score': Z-score standardization

  • 'zero-centred-min-max': Zero-centred min-max normalization, transforming to new range [-1, 1].

  • 'min-max': Standard min-max normalization, transforming to new range [0, 1].

Valid only when affinity is not 'precomputed'.

Defaults to 'no'.

category_weightsfloat, optional

Represents the weight of category columns.

Defaults to 1.

Examples

Input dataframe df for clustering:

>>> df.collect()
     POINT   X1    X2      X3
0    0       0.5   0.5     1
1    1       1.5   0.5     2
2    2       1.5   1.5     2
3    3       0.5   1.5     2
4    4       1.1   1.2     2
5    5       0.5   15.5    2
6    6       1.5   15.5    3
7    7       1.5   16.5    3
8    8       0.5   16.5    3
9    9       1.2   16.1    3
10   10      15.5  15.5    3
11   11      16.5  15.5    4
12   12      16.5  16.5    4
13   13      15.5  16.5    4
14   14      15.6  16.2    4
15   15      15.5  0.5     4
16   16      16.5  0.5     1
17   17      16.5  1.5     1
18   18      15.5  1.5     1
19   19      15.7  1.6     1

Create an AgglomerateHierarchicalClustering instance:

>>> hc = AgglomerateHierarchicalClustering(
             n_clusters=4,
             affinity='Gower',
             linkage='weighted average',
             thread_ratio=None,
             distance_dimension=3,
             normalization='no',
             category_weights= 0.1)

Perform fit on the given data:

>>> hc.fit(data=df, key='POINT', categorical_variable=['X3'])

Expected output:

>>> hc.combine_process_.collect().head(3)
     STAGE    LEFT_POINT   RIGHT_POINT    DISTANCE
0    1        18           19             0.0187
1    2        13           14             0.0250
2    3        7            9              0.0437
>>> hc.labels_.collect().head(3)
           POINT    CLUSTER_ID
     0     0        1
     1     1        1
     2     2        1
Attributes
combine_process_DataFrame

Structured as follows:

  • 1st column: int, STAGE, cluster stage.

  • 2nd column: ID (in input table) data type, LEFT_ + ID (in input table) column + name, One of the clusters that is to be combined in one combine stage, name as its row number in the input data table. After the combining, the new cluster is named after the left one.

  • 3rd column: ID (in input table) data type, RIGHT_ + ID (in input table) column name, The other cluster to be combined in the same combine stage, named as its row number in the input data table.

  • 4th column: float, DISTANCE. Distance between the two combined clusters.

labels_DataFrame

Label assigned to each sample. structured as follows:

  • 1st column: Name of the ID column in the input data(or that of the first column of the input DataFrame when affinity is 'precomputed'), record ID.

  • 2nd column: CLUSTER_ID, cluster number after applying the hierarchical agglomerate algorithm.

Methods

add_attribute(attr_key, attr_val)

Function to add attribute.

apply_with_hint(with_hint[, ...])

Apply with hint.

consume_fit_hdbprocedure(proc_name[, ...])

Return the generated consume hdbprocedure for fit.

consume_predict_hdbprocedure(proc_name[, ...])

Return the generated consume hdbprocedure for predict.

create_model_state([model, function, ...])

Create PAL model state.

delete_model_state([state])

Delete PAL model state.

disable_arg_check()

Disable argument check.

disable_convert_bigint()

Disable the bigint conversion.

disable_hana_execution()

HANA execution will be disabled and only SQL script will be generated.

disable_with_hint()

Disable with hint.

enable_arg_check()

Enable argument check.

enable_convert_bigint()

Allows the conversion from bigint to double.

enable_hana_execution()

HANA execution will be enabled.

enable_no_inline([apply_to_anonymous_block])

Enable no inline.

enable_parallel_by_parameter_partitions([...])

Enable parallel by parameter partitions.

enable_workload_class(workload_class_name)

HANA WORKLOAD CLASS is applied for the statement execution.

fit(data[, key, features, categorical_variable])

Fit the model when given the training dataset.

fit_predict(data[, key, features, ...])

Fit with the dataset and return the labels.

get_fit_execute_statement()

Returns the execute_statement for training.

get_fit_output_table_names()

Get the generated result table names in fit function.

get_fit_parameters()

Get PAL fit parameters.

get_pal_function()

Extract the specific function call of the SAP HANA PAL function from the sql code.

get_parameters()

Parse sql lines containing the parameter definitions.

get_predict_execute_statement()

Returns the execute_statement for predicting.

get_predict_output_table_names()

Get the generated result table names in predict function.

get_predict_parameters()

Get PAL predict parameters.

get_score_execute_statement()

Returns the execute_statement for scoring.

get_score_output_table_names()

Get the generated result table names in score function.

get_score_parameters()

Get SAP HANA PAL score parameters.

is_fitted()

Checks if the model can be saved.

load_model(model)

Function to load the fitted model.

set_model_state(state)

Set the model state by state information.

set_scale_out([route_to, no_route_to, ...])

SAP HANA statement routing.

fit(data, key=None, features=None, categorical_variable=None)

Fit the model when given the training dataset.

Parameters
dataDataFrame

DataFrame containing the data.

If affinity is specified as 'precomputed' in initialization, then data must be a structured DataFrame that reflects the affinity information between points as follows:

  • 1st column: ID of the first point.

  • 2nd column: ID of the second point.

  • 3rd column: Precomputed distance between first point & second point.

keystr, optional

Name of ID column in data.

Defaults to the index column of data (i.e. data.index) if it is set. If the index column of data is not provided and affinity is not set as 'precomputed' in initialization, please enter the value of key.

Valid only when affinity is not 'precomputed' in initialization.

featuresa list of str, optional

Names of the features columns.

If features is not provided, it defaults to all non-key columns.

Valid only when affinity is not 'precomputed' in initialization.

categorical_variablestr or a list of str, optional

Specifies INTEGER column(s) that should be treated as categorical. By default, 'VARCHAR' or 'NVARCHAR' is a category variable, and 'INTEGER' or 'DOUBLE' is a continuous variable.

Valid only when affinity is not 'precomputed' in initialization.

Defaults to None.

fit_predict(data, key=None, features=None, categorical_variable=None)

Fit with the dataset and return the labels.

Parameters
dataDataFrame

DataFrame containing the data.

If affinity is specified as 'precomputed' in initialization, then data must be a structured DataFrame that reflects the affinity information between points as follows:

  • 1st column: ID of the first point.

  • 2nd column: ID of the second point.

  • 3rd column: Precomputed distance between first point & second point.

keystr, optional

Name of ID column in data.

Defaults to the index column of data (i.e. data.index) if it is set. If the index column of data is not provided and affinity is not set as 'precomputed' in initialization, please enter the value of key.

Valid only when affinity is not 'precomputed' in initialization.

featuresa list of str, optional

Names of the features columns.

If features is not provided, it defaults to all non-key columns.

Valid only when affinity is not 'precomputed' in initialization.

categorical_variablestr or a list of str, optional

Specifies INTEGER column(s) that should be treated as categorical. By default, 'VARCHAR' or 'NVARCHAR' is a category variable, and 'INTEGER' or 'DOUBLE' is a continuous variable.

Valid only when affinity is not 'precomputed' in initialization.

Defaults to None.

Returns
DataFrame

Label of each points.

add_attribute(attr_key, attr_val)

Function to add attribute.

Parameters
attr_keystr

The key.

attr_valstr

The value.

apply_with_hint(with_hint, apply_to_anonymous_block=True)

Apply with hint.

Parameters
with_hintstr

The hint clauses.

apply_to_anonymous_blockbool, optional

If True, it will be applied to the anonymous block.

Defaults to True.

consume_fit_hdbprocedure(proc_name, in_tables=None, out_tables=None)

Return the generated consume hdbprocedure for fit.

Parameters
proc_namestr

The procedure name.

in_tableslist, optional

The list of input table names.

out_tableslist, optional

The list of output table names.

consume_predict_hdbprocedure(proc_name, in_tables=None, out_tables=None)

Return the generated consume hdbprocedure for predict.

Parameters
proc_namestr

The procedure name.

in_tableslist, optional

The list of input table names.

out_tableslist, optional

The list of output table names.

create_model_state(model=None, function=None, pal_funcname=None, state_description=None, force=False)

Create PAL model state.

Parameters
modelDataFrame, optional

Specify the model for AFL state.

Defaults to self.model_.

functionstr, optional

Specify the function in the unified API.

Defaults to self.real_func.

pal_funcnameint or str, optional

PAL function name.

Defaults to self.pal_funcname.

state_descriptionstr, optional

Description of the state as model container.

Defaults to None.

forcebool, optional

If True it will delete the existing state.

Defaults to False.

delete_model_state(state=None)

Delete PAL model state.

Parameters
stateDataFrame, optional

Specified the state.

Defaults to self.state.

disable_arg_check()

Disable argument check.

disable_convert_bigint()

Disable the bigint conversion.

Defaults to False.

disable_hana_execution()

HANA execution will be disabled and only SQL script will be generated.

disable_with_hint()

Disable with hint.

enable_arg_check()

Enable argument check.

enable_convert_bigint()

Allows the conversion from bigint to double.

Defaults to True.

enable_hana_execution()

HANA execution will be enabled.

enable_no_inline(apply_to_anonymous_block=True)

Enable no inline.

Parameters
apply_to_anonymous_blockbool, optional

If True, it will be applied to the anonymous block.

Defaults to True.

enable_parallel_by_parameter_partitions(apply_to_anonymous_block=False)

Enable parallel by parameter partitions.

Parameters
apply_to_anonymous_blockbool, optional

If True, it will be applied to the anonymous block.

Defaults to False.

enable_workload_class(workload_class_name)

HANA WORKLOAD CLASS is applied for the statement execution.

Parameters
workload_class_namestr

The name of HANA WORKLOAD CLASS.

property fit_hdbprocedure

Returns the generated hdbprocedure for fit.

get_fit_execute_statement()

Returns the execute_statement for training.

get_fit_output_table_names()

Get the generated result table names in fit function.

Returns
List of table names.
get_fit_parameters()

Get PAL fit parameters.

Returns
List of tuples, where each tuple describes a parameter like (name, value, type)
get_pal_function()

Extract the specific function call of the SAP HANA PAL function from the sql code. Nevertheless it only detects the synonyms that have to be resolved afterwards.

Returns
The procedure name synonym
CALL "SYS_AFL.PAL_RANDOM_FORREST" (...) -> SYS_AFL.PAL_RANDOM_FORREST"
get_parameters()

Parse sql lines containing the parameter definitions. In the sql code all the parameters are defined by four arrays, where the first one contains the parameter name, and one of the other three contains the value fitting to the parameter, while the other two are NULL. This format should be changed into a simple key-value based storage.

Returns
dict of list of tuples, where each tuple describes a parameter like (name, value, type)
get_predict_execute_statement()

Returns the execute_statement for predicting.

get_predict_output_table_names()

Get the generated result table names in predict function.

Returns
List of table names.
get_predict_parameters()

Get PAL predict parameters.

Returns
List of tuples, where each tuple describes a parameter like (name, value, type)
get_score_execute_statement()

Returns the execute_statement for scoring.

get_score_output_table_names()

Get the generated result table names in score function.

Returns
List of table names.
get_score_parameters()

Get SAP HANA PAL score parameters.

Returns
List of tuples, where each tuple describes a parameter like (name, value, type)
is_fitted()

Checks if the model can be saved. To be overridden if the model is not stored in model_ attribute.

Returns
bool

True if the model is ready to be saved.

load_model(model)

Function to load the fitted model.

Parameters
modelDataFrame

SAP HANA DataFrame for fitted model.

property predict_hdbprocedure

Returns the generated hdbprocedure for predict.

set_model_state(state)

Set the model state by state information.

set_scale_out(route_to=None, no_route_to=None, route_by=None, route_by_cardinality=None, data_transfer_cost=None, route_optimization_level=None, workload_class=None, apply_to_anonymous_block=True)

SAP HANA statement routing.

Parameters
route_tostr, optional

Routes the query to the specified volume ID or service type.

Defaults to None.

no_route_tostr or list of str, optional

Avoids query routing to a specified volume ID or service type.

Defaults to None.

route_bystr, optional

Routes the query to the hosts related to the base table(s) of the specified projection view(s).

Defaults to None.

route_by_cardinalitystr or list of str, optional

Routes the query to the hosts related to the base table(s) of the specified projection view(s) with the highest cardinality from the input list.

data_transfer_costint, optional

Guides the optimizer to use the weighting factor for the data transfer cost. The value 0 ignores the data transfer cost.

Defaults to None.

route_optimization_level{'minimal', 'all'}, optional

Guides the optimizer to compile with route_optimization_level 'minimal' or to default to route_optimization_level. If the 'minimal' compiled plan is cached, then it compiles once more using the default optimization level during the first execution. This hint is primarily used to shorten statement routing decisions during the initial compilation.

workload_classstr, optional

Routes the query via workload class. route_to statement hint has higher precedence than workload_class statement hint.

Defaults to None.

apply_to_anonymous_blockbool, optional

If True it will be applied to the anonymous block.

Defaults to True.

class hana_ml.algorithms.pal.clustering.DBSCAN(minpts=None, eps=None, thread_ratio=None, metric=None, minkowski_power=None, categorical_variable=None, category_weights=None, algorithm=None, save_model=True)

Bases: hana_ml.algorithms.pal.clustering._ClusterAssignmentMixin

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a density-based data clustering algorithm that finds a number of clusters starting from the estimated density distribution of corresponding nodes.

Parameters
minptsint, optional

The minimum number of points required to form a cluster.

Note

minpts and eps need to be provided together by user or these two parameters are automatically determined.

epsfloat, optional

The scan radius.

Note

minpts and eps need to be provided together by user or these two parameters are automatically determined.

thread_ratiofloat, optional

Controls the proportion of available threads to use.

The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads.

Values between 0 and 1 will use that percentage of available threads.

Values outside this range tell PAL to heuristically determine the number of threads to use.

Defaults to heuristically determined.

metric{'manhattan', 'euclidean', 'minkowski', 'chebyshev', 'standardized_euclidean', 'cosine'}, optional

Ways to compute the distance between two points.

Defaults to 'euclidean'.

minkowski_powerint, optional

When minkowski is chosen for metric, this parameter controls the value of power. Only applicable when metric is minkowski.

Defaults to 3.

categorical_variablestr or a list of str, optional

Specifies column(s) in the data that should be treated as categorical.

Defaults to None.

category_weightsfloat, optional

Represents the weight of category attributes.

Defaults to 0.707.

algorithm{'brute-force', 'kd-tree'}, optional

Ways to search for neighbours.

Defaults to 'kd-tree'.

save_modelbool, optional

If true, the generated model will be saved.

save_model must be True in order to call predict().

Defaults to True.

Examples

Input dataframe df for clustering:

>>> df.collect()
    ID     V1     V2 V3
0    1   0.10   0.10  B
1    2   0.11   0.10  A
2    3   0.10   0.11  C
3    4   0.11   0.11  B
4    5   0.12   0.11  A
5    6   0.11   0.12  E
6    7   0.12   0.12  A
7    8   0.12   0.13  C
8    9   0.13   0.12  D
9   10   0.13   0.13  D
10  11   0.13   0.14  A
11  12   0.14   0.13  C
12  13  10.10  10.10  A
13  14  10.11  10.10  F
14  15  10.10  10.11  E
15  16  10.11  10.11  E
16  17  10.11  10.12  A
17  18  10.12  10.11  B
18  19  10.12  10.12  B
19  20  10.12  10.13  D
20  21  10.13  10.12  F
21  22  10.13  10.13  A
22  23  10.13  10.14  A
23  24  10.14  10.13  D
24  25   4.10   4.10  A
25  26   7.11   7.10  C
26  27  -3.10  -3.11  C
27  28  16.11  16.11  A
28  29  20.11  20.12  C
29  30  15.12  15.11  A

Create DSBCAN instance:

>>> dbscan = DBSCAN(thread_ratio=0.2, metric='manhattan')

Perform fit on the given data:

>>> dbscan.fit(data=df, key='ID')

Expected output:

>>> dbscan.labels_.collect()
    ID  CLUSTER_ID
0    1           0
1    2           0
2    3           0
3    4           0
4    5           0
5    6           0
6    7           0
7    8           0
8    9           0
9   10           0
10  11           0
11  12           0
12  13           1
13  14           1
14  15           1
15  16           1
16  17           1
17  18           1
18  19           1
19  20           1
20  21           1
21  22           1
22  23           1
23  24           1
24  25          -1
25  26          -1
26  27          -1
27  28          -1
28  29          -1
29  30          -1
Attributes
labels_DataFrame

Label assigned to each sample.

model_DataFrame

Model content. Set to None if save_model is False.

Methods

add_attribute(attr_key, attr_val)

Function to add attribute.

apply_with_hint(with_hint[, ...])

Apply with hint.

consume_fit_hdbprocedure(proc_name[, ...])

Return the generated consume hdbprocedure for fit.

consume_predict_hdbprocedure(proc_name[, ...])

Return the generated consume hdbprocedure for predict.

create_model_state([model, function, ...])

Create PAL model state.

delete_model_state([state])

Delete PAL model state.

disable_arg_check()

Disable argument check.

disable_convert_bigint()

Disable the bigint conversion.

disable_hana_execution()

HANA execution will be disabled and only SQL script will be generated.

disable_with_hint()

Disable with hint.

enable_arg_check()

Enable argument check.

enable_convert_bigint()

Allows the conversion from bigint to double.

enable_hana_execution()

HANA execution will be enabled.

enable_no_inline([apply_to_anonymous_block])

Enable no inline.

enable_parallel_by_parameter_partitions([...])

Enable parallel by parameter partitions.

enable_workload_class(workload_class_name)

HANA WORKLOAD CLASS is applied for the statement execution.

fit(data[, key, features, ...])

Fit the DBSCAN model when given the training dataset.

fit_predict(data[, key, features, ...])

Fit with the dataset and return the labels.

get_fit_execute_statement()

Returns the execute_statement for training.

get_fit_output_table_names()

Get the generated result table names in fit function.

get_fit_parameters()

Get PAL fit parameters.

get_pal_function()

Extract the specific function call of the SAP HANA PAL function from the sql code.

get_parameters()

Parse sql lines containing the parameter definitions.

get_predict_execute_statement()

Returns the execute_statement for predicting.

get_predict_output_table_names()

Get the generated result table names in predict function.

get_predict_parameters()

Get PAL predict parameters.

get_score_execute_statement()

Returns the execute_statement for scoring.

get_score_output_table_names()

Get the generated result table names in score function.

get_score_parameters()

Get SAP HANA PAL score parameters.

is_fitted()

Checks if the model can be saved.

load_model(model)

Function to load the fitted model.

predict(data[, key, features])

Assign clusters to data based on a fitted model.

set_model_state(state)

Set the model state by state information.

set_scale_out([route_to, no_route_to, ...])

SAP HANA statement routing.

fit(data, key=None, features=None, categorical_variable=None, string_variable=None, variable_weight=None)

Fit the DBSCAN model when given the training dataset.

Parameters
dataDataFrame

DataFrame containing the data.

keystr, optional

Name of ID column.

Defaults to the index column of data (i.e. data.index) if it is set. If the index column of data is not provided, please enter the value of key.

featuresa list of str, optional

Names of the feature columns.

If features is not provided, it defaults to all non-key columns.

categorical_variablestr or a list of str, optional

Specifies INTEGER column(s) that should be treated as categorical.

Other INTEGER columns will be treated as continuous.

Defaults to None.

string_variablestr or a list of str, optional

Indicates a string column storing not categorical data. Levenshtein distance is used to calculate similarity between two strings. Ignored if it is not a string column.

Defaults to None.

variable_weightdict, optional

Specifies the weight of a variable participating in distance calculation. The value must be greater or equal to 0. Defaults to 1 for variables not specified.

Defaults to None.

Returns
A fitted object of class "DBSCAN".
fit_predict(data, key=None, features=None, categorical_variable=None, string_variable=None, variable_weight=None)

Fit with the dataset and return the labels.

Parameters
dataDataFrame

DataFrame containing the data.

keystr, optional

Name of ID column.

Defaults to the index column of data (i.e. data.index) if it is set. If the index column of data is not provided, please enter the value of key.

featuresstr or a list of str, optional

Names of the features columns.

If features is not provided, it defaults to all the non-key columns.

categorical_variablestr or a list of str, optional

Specifies INTEGER column(s) that should be treated as categorical. Other INTEGER columns will be treated as continuous.

Defaults to None.

string_variablestr or a list of str, optional

Indicates a string column storing not categorical data. Levenshtein distance is used to calculate similarity between two strings. Ignored if it is not a string column.

Defaults to None.

variable_weightdict, optional

Specifies the weight of a variable participating in distance calculation. The value must be greater or equal to 0.

Defaults to 1 for variables not specified.

Returns
DataFrame

Label assigned to each sample.

predict(data, key=None, features=None)

Assign clusters to data based on a fitted model. The output structure of this method does not match that of fit_predict().

Parameters
dataDataFrame

Data points to match against computed clusters.

This dataframe's column structure should match that of the data used for fit().

keystr, optional

Name of ID column.

Defaults to the index column of data (i.e. data.index) if it is set. If the index column of data is not provided, please enter the value of key.

featuresa list of str, optional.

Names of feature columns.

If features is not provided, it defaults to all non-key columns.

Returns
DataFrame

Cluster assignment results, with 3 columns:

  • Data point ID, with name and type taken from the input ID column.

  • CLUSTER_ID, type INTEGER, representing the cluster the data point is assigned to.

  • DISTANCE, type DOUBLE, representing the distance between the data point and the nearest core point.

add_attribute(attr_key, attr_val)

Function to add attribute.

Parameters
attr_keystr

The key.

attr_valstr

The value.

apply_with_hint(with_hint, apply_to_anonymous_block=True)

Apply with hint.

Parameters
with_hintstr

The hint clauses.

apply_to_anonymous_blockbool, optional

If True, it will be applied to the anonymous block.

Defaults to True.

consume_fit_hdbprocedure(proc_name, in_tables=None, out_tables=None)

Return the generated consume hdbprocedure for fit.

Parameters
proc_namestr

The procedure name.

in_tableslist, optional

The list of input table names.

out_tableslist, optional

The list of output table names.

consume_predict_hdbprocedure(proc_name, in_tables=None, out_tables=None)

Return the generated consume hdbprocedure for predict.

Parameters
proc_namestr

The procedure name.

in_tableslist, optional

The list of input table names.

out_tableslist, optional

The list of output table names.

create_model_state(model=None, function=None, pal_funcname=None, state_description=None, force=False)

Create PAL model state.

Parameters
modelDataFrame, optional

Specify the model for AFL state.

Defaults to self.model_.

functionstr, optional

Specify the function in the unified API.

Defaults to self.real_func.

pal_funcnameint or str, optional

PAL function name.

Defaults to self.pal_funcname.

state_descriptionstr, optional

Description of the state as model container.

Defaults to None.

forcebool, optional

If True it will delete the existing state.

Defaults to False.

delete_model_state(state=None)

Delete PAL model state.

Parameters
stateDataFrame, optional

Specified the state.

Defaults to self.state.

disable_arg_check()

Disable argument check.

disable_convert_bigint()

Disable the bigint conversion.

Defaults to False.

disable_hana_execution()

HANA execution will be disabled and only SQL script will be generated.

disable_with_hint()

Disable with hint.

enable_arg_check()

Enable argument check.

enable_convert_bigint()

Allows the conversion from bigint to double.

Defaults to True.

enable_hana_execution()

HANA execution will be enabled.

enable_no_inline(apply_to_anonymous_block=True)

Enable no inline.

Parameters
apply_to_anonymous_blockbool, optional

If True, it will be applied to the anonymous block.

Defaults to True.

enable_parallel_by_parameter_partitions(apply_to_anonymous_block=False)

Enable parallel by parameter partitions.

Parameters
apply_to_anonymous_blockbool, optional

If True, it will be applied to the anonymous block.

Defaults to False.

enable_workload_class(workload_class_name)

HANA WORKLOAD CLASS is applied for the statement execution.

Parameters
workload_class_namestr

The name of HANA WORKLOAD CLASS.

property fit_hdbprocedure

Returns the generated hdbprocedure for fit.

get_fit_execute_statement()

Returns the execute_statement for training.

get_fit_output_table_names()

Get the generated result table names in fit function.

Returns
List of table names.
get_fit_parameters()

Get PAL fit parameters.

Returns
List of tuples, where each tuple describes a parameter like (name, value, type)
get_pal_function()

Extract the specific function call of the SAP HANA PAL function from the sql code. Nevertheless it only detects the synonyms that have to be resolved afterwards.

Returns
The procedure name synonym
CALL "SYS_AFL.PAL_RANDOM_FORREST" (...) -> SYS_AFL.PAL_RANDOM_FORREST"
get_parameters()

Parse sql lines containing the parameter definitions. In the sql code all the parameters are defined by four arrays, where the first one contains the parameter name, and one of the other three contains the value fitting to the parameter, while the other two are NULL. This format should be changed into a simple key-value based storage.

Returns
dict of list of tuples, where each tuple describes a parameter like (name, value, type)
get_predict_execute_statement()

Returns the execute_statement for predicting.

get_predict_output_table_names()

Get the generated result table names in predict function.

Returns
List of table names.
get_predict_parameters()

Get PAL predict parameters.

Returns
List of tuples, where each tuple describes a parameter like (name, value, type)
get_score_execute_statement()

Returns the execute_statement for scoring.

get_score_output_table_names()

Get the generated result table names in score function.

Returns
List of table names.
get_score_parameters()

Get SAP HANA PAL score parameters.

Returns
List of tuples, where each tuple describes a parameter like (name, value, type)
is_fitted()

Checks if the model can be saved. To be overridden if the model is not stored in model_ attribute.

Returns
bool

True if the model is ready to be saved.

load_model(model)

Function to load the fitted model.

Parameters
modelDataFrame

SAP HANA DataFrame for fitted model.

property predict_hdbprocedure

Returns the generated hdbprocedure for predict.

set_model_state(state)

Set the model state by state information.

set_scale_out(route_to=None, no_route_to=None, route_by=None, route_by_cardinality=None, data_transfer_cost=None, route_optimization_level=None, workload_class=None, apply_to_anonymous_block=True)

SAP HANA statement routing.

Parameters
route_tostr, optional

Routes the query to the specified volume ID or service type.

Defaults to None.

no_route_tostr or list of str, optional

Avoids query routing to a specified volume ID or service type.

Defaults to None.

route_bystr, optional

Routes the query to the hosts related to the base table(s) of the specified projection view(s).

Defaults to None.

route_by_cardinalitystr or list of str, optional

Routes the query to the hosts related to the base table(s) of the specified projection view(s) with the highest cardinality from the input list.

data_transfer_costint, optional

Guides the optimizer to use the weighting factor for the data transfer cost. The value 0 ignores the data transfer cost.

Defaults to None.

route_optimization_level{'minimal', 'all'}, optional

Guides the optimizer to compile with route_optimization_level 'minimal' or to default to route_optimization_level. If the 'minimal' compiled plan is cached, then it compiles once more using the default optimization level during the first execution. This hint is primarily used to shorten statement routing decisions during the initial compilation.

workload_classstr, optional

Routes the query via workload class. route_to statement hint has higher precedence than workload_class statement hint.

Defaults to None.

apply_to_anonymous_blockbool, optional

If True it will be applied to the anonymous block.

Defaults to True.

class hana_ml.algorithms.pal.clustering.GeometryDBSCAN(minpts=None, eps=None, thread_ratio=None, metric=None, minkowski_power=None, algorithm=None, save_model=True)

Bases: hana_ml.algorithms.pal.pal_base.PALBase

This function is a geometry version of DBSCAN, which only accepts geometry points as input data. Currently it only accepts 2-D points.

Parameters
minptsint, optional

The minimum number of points required to form a cluster.

Note

minpts and eps need to be provided together by user or these two parameters are automatically determined.

epsfloat, optional

The scan radius.

Note

minpts and eps need to be provided together by user or these two parameters are automatically determined.

thread_ratiofloat, optional

Controls the proportion of available threads to use.

The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads.

Values between 0 and 1 will use that percentage of available threads.

Values outside this range tell PAL to heuristically determine the number of threads to use.

Defaults to -1.

metric{'manhattan', 'euclidean','minkowski',

'chebyshev', 'standardized_euclidean', 'cosine'}, optional

Ways to compute the distance between two points.

Defaults to 'euclidean'.

minkowski_powerint, optional

When minkowski is chosen for metric, this parameter controls the value of power.

Only applicable when metric is 'minkowski'.

Defaults to 3.

algorithm{'brute-force', 'kd-tree'}, optional

Ways to search for neighbours.

Defaults to 'kd-tree'.

save_modelbool, optional

If true, the generated model will be saved.

save_model must be True in order to call predict().

Defaults to True.

Examples

In SAP HANA, the test table PAL_GEO_DBSCAN_DATA_TBL can be created by the following SQL:

CREATE COLUMN TABLE PAL_GEO_DBSCAN_DATA_TBL (
  "ID" INTEGER,
  "POINT" ST_GEOMETRY);

Then, input dataframe df for clustering:

>>> df = conn.table("PAL_GEO_DBSCAN_DATA_TBL")

Create DSBCAN instance:

>>> geo_dbscan = GeometryDBSCAN(thread_ratio=0.2, metric='manhattan')

Perform fit on the given data:

>>> geo_dbscan.fit(data = df, key='ID')

Expected output:

>>> geo_dbscan.labels_.collect()
     ID   CLUSTER_ID
0    1    0
1    2    0
2    3    0
......
28   29  -1
29   30  -1
>>> geo_dbsan.model_.collect()
    ROW_INDEX    MODEL_CONTENT
0      0         {"Algorithm":"DBSCAN","Cluster":[{"ClusterID":...

Perform fit_predict on the given data:

>>> result = geo_dbscan.fit_predict(df, key='ID')

Expected output:

>>> result.collect()
     ID   CLUSTER_ID
0    1    0
1    2    0
2    3    0
......
28    29  -1
29    30  -1
Attributes
labels_DataFrame

Label assigned to each sample.

model_DataFrame

Model content. Set to None if save_model is False.

Methods

add_attribute(attr_key, attr_val)

Function to add attribute.

apply_with_hint(with_hint[, ...])

Apply with hint.

consume_fit_hdbprocedure(proc_name[, ...])

Return the generated consume hdbprocedure for fit.

consume_predict_hdbprocedure(proc_name[, ...])

Return the generated consume hdbprocedure for predict.

create_model_state([model, function, ...])

Create PAL model state.

delete_model_state([state])

Delete PAL model state.

disable_arg_check()

Disable argument check.

disable_convert_bigint()

Disable the bigint conversion.

disable_hana_execution()

HANA execution will be disabled and only SQL script will be generated.

disable_with_hint()

Disable with hint.

enable_arg_check()

Enable argument check.

enable_convert_bigint()

Allows the conversion from bigint to double.

enable_hana_execution()

HANA execution will be enabled.

enable_no_inline([apply_to_anonymous_block])

Enable no inline.

enable_parallel_by_parameter_partitions([...])

Enable parallel by parameter partitions.

enable_workload_class(workload_class_name)

HANA WORKLOAD CLASS is applied for the statement execution.

fit(data[, key, features])

Fit the Geometry DBSCAN model when given the training dataset.

fit_predict(data[, key, features])

Fit with the dataset and return the labels.

get_fit_execute_statement()

Returns the execute_statement for training.

get_fit_output_table_names()

Get the generated result table names in fit function.

get_fit_parameters()

Get PAL fit parameters.

get_pal_function()

Extract the specific function call of the SAP HANA PAL function from the sql code.

get_parameters()

Parse sql lines containing the parameter definitions.

get_predict_execute_statement()

Returns the execute_statement for predicting.

get_predict_output_table_names()

Get the generated result table names in predict function.

get_predict_parameters()

Get PAL predict parameters.

get_score_execute_statement()

Returns the execute_statement for scoring.

get_score_output_table_names()

Get the generated result table names in score function.

get_score_parameters()

Get SAP HANA PAL score parameters.

is_fitted()

Checks if the model can be saved.

load_model(model)

Function to load the fitted model.

set_model_state(state)

Set the model state by state information.

set_scale_out([route_to, no_route_to, ...])

SAP HANA statement routing.

fit(data, key=None, features=None)

Fit the Geometry DBSCAN model when given the training dataset.

Parameters
dataDataFrame

DataFrame containing the data for applying geometry DBSCAN.

It must contain at least two columns: one ID column, and another for storing 2-D geometry points.

keystr, optional

Name of ID column.

Defaults to the index column of data (i.e. data.index) if it is set. If the index column of data is not provided, please enter the value of key.

featuresstr, optional

Name of the column for storing geometry points.

If not provided, it defaults the first non-key column.

fit_predict(data, key=None, features=None)

Fit with the dataset and return the labels.

Parameters
dataDataFrame

DataFrame containing the data. The structure is as follows.

It must contain at least two columns: one ID column, and another for storing 2-D geometry points.

keystr, optional

Name of ID column.

Defaults to the index column of data (i.e. data.index) if it is set. If the index column of data is not provided, please enter the value of key.

featuresstr, optional

Name of the column for storing 2-D geometry points.

If not provided, it defaults to the first non-key column.

Returns
DataFrame

Label assigned to each sample.

add_attribute(attr_key, attr_val)

Function to add attribute.

Parameters
attr_keystr

The key.

attr_valstr

The value.

apply_with_hint(with_hint, apply_to_anonymous_block=True)

Apply with hint.

Parameters
with_hintstr

The hint clauses.

apply_to_anonymous_blockbool, optional

If True, it will be applied to the anonymous block.

Defaults to True.

consume_fit_hdbprocedure(proc_name, in_tables=None, out_tables=None)

Return the generated consume hdbprocedure for fit.

Parameters
proc_namestr

The procedure name.

in_tableslist, optional

The list of input table names.

out_tableslist, optional

The list of output table names.

consume_predict_hdbprocedure(proc_name, in_tables=None, out_tables=None)

Return the generated consume hdbprocedure for predict.

Parameters
proc_namestr

The procedure name.

in_tableslist, optional

The list of input table names.

out_tableslist, optional

The list of output table names.

create_model_state(model=None, function=None, pal_funcname=None, state_description=None, force=False)

Create PAL model state.

Parameters
modelDataFrame, optional

Specify the model for AFL state.

Defaults to self.model_.

functionstr, optional

Specify the function in the unified API.

Defaults to self.real_func.

pal_funcnameint or str, optional

PAL function name.

Defaults to self.pal_funcname.

state_descriptionstr, optional

Description of the state as model container.

Defaults to None.

forcebool, optional

If True it will delete the existing state.

Defaults to False.

delete_model_state(state=None)

Delete PAL model state.

Parameters
stateDataFrame, optional

Specified the state.

Defaults to self.state.

disable_arg_check()

Disable argument check.

disable_convert_bigint()

Disable the bigint conversion.

Defaults to False.

disable_hana_execution()

HANA execution will be disabled and only SQL script will be generated.

disable_with_hint()

Disable with hint.

enable_arg_check()

Enable argument check.

enable_convert_bigint()

Allows the conversion from bigint to double.

Defaults to True.

enable_hana_execution()

HANA execution will be enabled.

enable_no_inline(apply_to_anonymous_block=True)

Enable no inline.

Parameters
apply_to_anonymous_blockbool, optional

If True, it will be applied to the anonymous block.

Defaults to True.

enable_parallel_by_parameter_partitions(apply_to_anonymous_block=False)

Enable parallel by parameter partitions.

Parameters
apply_to_anonymous_blockbool, optional

If True, it will be applied to the anonymous block.

Defaults to False.

enable_workload_class(workload_class_name)

HANA WORKLOAD CLASS is applied for the statement execution.

Parameters
workload_class_namestr

The name of HANA WORKLOAD CLASS.

property fit_hdbprocedure

Returns the generated hdbprocedure for fit.

get_fit_execute_statement()

Returns the execute_statement for training.

get_fit_output_table_names()

Get the generated result table names in fit function.

Returns
List of table names.
get_fit_parameters()

Get PAL fit parameters.

Returns
List of tuples, where each tuple describes a parameter like (name, value, type)
get_pal_function()

Extract the specific function call of the SAP HANA PAL function from the sql code. Nevertheless it only detects the synonyms that have to be resolved afterwards.

Returns
The procedure name synonym
CALL "SYS_AFL.PAL_RANDOM_FORREST" (...) -> SYS_AFL.PAL_RANDOM_FORREST"
get_parameters()

Parse sql lines containing the parameter definitions. In the sql code all the parameters are defined by four arrays, where the first one contains the parameter name, and one of the other three contains the value fitting to the parameter, while the other two are NULL. This format should be changed into a simple key-value based storage.

Returns
dict of list of tuples, where each tuple describes a parameter like (name, value, type)
get_predict_execute_statement()

Returns the execute_statement for predicting.

get_predict_output_table_names()

Get the generated result table names in predict function.

Returns
List of table names.
get_predict_parameters()

Get PAL predict parameters.

Returns
List of tuples, where each tuple describes a parameter like (name, value, type)
get_score_execute_statement()

Returns the execute_statement for scoring.

get_score_output_table_names()

Get the generated result table names in score function.

Returns
List of table names.
get_score_parameters()

Get SAP HANA PAL score parameters.

Returns
List of tuples, where each tuple describes a parameter like (name, value, type)
is_fitted()

Checks if the model can be saved. To be overridden if the model is not stored in model_ attribute.

Returns
bool

True if the model is ready to be saved.

load_model(model)

Function to load the fitted model.

Parameters
modelDataFrame

SAP HANA DataFrame for fitted model.

property predict_hdbprocedure

Returns the generated hdbprocedure for predict.

set_model_state(state)

Set the model state by state information.

set_scale_out(route_to=None, no_route_to=None, route_by=None, route_by_cardinality=None, data_transfer_cost=None, route_optimization_level=None, workload_class=None, apply_to_anonymous_block=True)

SAP HANA statement routing.

Parameters
route_tostr, optional

Routes the query to the specified volume ID or service type.

Defaults to None.

no_route_tostr or list of str, optional

Avoids query routing to a specified volume ID or service type.

Defaults to None.

route_bystr, optional

Routes the query to the hosts related to the base table(s) of the specified projection view(s).

Defaults to None.

route_by_cardinalitystr or list of str, optional

Routes the query to the hosts related to the base table(s) of the specified projection view(s) with the highest cardinality from the input list.

data_transfer_costint, optional

Guides the optimizer to use the weighting factor for the data transfer cost. The value 0 ignores the data transfer cost.

Defaults to None.

route_optimization_level{'minimal', 'all'}, optional

Guides the optimizer to compile with route_optimization_level 'minimal' or to default to route_optimization_level. If the 'minimal' compiled plan is cached, then it compiles once more using the default optimization level during the first execution. This hint is primarily used to shorten statement routing decisions during the initial compilation.

workload_classstr, optional

Routes the query via workload class. route_to statement hint has higher precedence than workload_class statement hint.

Defaults to None.

apply_to_anonymous_blockbool, optional

If True it will be applied to the anonymous block.

Defaults to True.

class hana_ml.algorithms.pal.clustering.KMeans(n_clusters=None, n_clusters_min=None, n_clusters_max=None, init=None, max_iter=None, thread_ratio=None, distance_level=None, minkowski_power=None, category_weights=None, normalization=None, categorical_variable=None, tol=None, memory_mode=None, accelerated=False, use_fast_library=None, use_float=None)

Bases: hana_ml.algorithms.pal.clustering._ClusterAssignmentMixin

K-Means model that handles clustering problems.

Parameters
n_clustersint, optional

Number of clusters. If this parameter is not specified, you must specify the minimum and maximum range parameters instead.

n_clusters_minint, optional

Cluster range minimum.

n_clusters_maxint, optional

Cluster range maximum.

init{'first_k', 'replace', 'no_replace', 'patent'}, optional

Controls how the initial centers are selected:

  • 'first_k': First k observations.

  • 'replace': Random with replacement.

  • 'no_replace': Random without replacement.

  • 'patent': Patent of selecting the init center (US 6,882,998 B1).

Defaults to 'patent'.

max_iterint, optional

Max iterations.

Defaults to 100.

thread_ratiofloat, optional

Specifies the ratio of total number of threads that can be used by this function.

The value range is from 0 to 1, where 0 means only using 1 thread, and 1 means using at most all the currently available threads.

Values outside the range will be ignored and this function heuristically determines the number of threads to use.

Defaults to 0.

distance_level{'manhattan', 'euclidean', 'minkowski', 'chebyshev', 'cosine'}, optional

Ways to compute the distance between the item and the cluster center.

'cosine' is only valid when accelerated is False.

Defaults to 'euclidean'.

minkowski_powerfloat, optional

When Minkowski distance is used, this parameter controls the value of power.

Only valid when distance_level is 'minkowski'.

Defaults to 3.0.

category_weightsfloat, optional

Represents the weight of category attributes.

Defaults to 0.707.

normalization{'no', 'l1_norm', 'min_max'}, optional

Normalization type.

  • 'no': No normalization will be applied.

  • 'l1_norm': Yes, for each point X (x1, x2, ..., xn), the normalized value will be X'(x1 /S,x2 /S,...,xn /S), where S = |x1|+|x2|+...|xn|.

  • 'min_max': Yes, for each column C, get the min and max value of C, and then C[i] = (C[i]-min)/(max-min).

Defaults to 'no'.

categorical_variablestr or a list of str, optional

Specifies INTEGER column(s) in the data that should be treated as categorical.

Defaults to None.

tolfloat, optional

Convergence threshold for exiting iterations.

Only valid when accelerated is False.

Defaults to 1.0e-6.

memory_mode{'auto', 'optimize-speed', 'optimize-space'}, optional

Indicates the memory mode that the algorithm uses.

  • 'auto': Chosen by algorithm.

  • 'optimize-speed': Prioritizes speed.

  • 'optimize-space': Prioritizes memory.

Only valid when accelerated is True.

Defaults to 'auto'.

acceleratedbool, optional

Indicates whether to use technology like cache to accelerate the calculation process:

  • If True, the calculation process will be accelerated.

  • If False, the calculation process will not be accelerated.

Defaults to False.

use_fast_librarybool, optional

Use vectorized accelerated operation when it is set to True.

Defaults to False.

use_floatbool, optional
  • False: double

  • True: float

Only valid when use_fast_library is True.

Defaults to True.

Examples

Input dataframe df for K Means:

>>> df.collect()
    ID  V000 V001  V002
0    0   0.5    A   0.5
1    1   1.5    A   0.5
2    2   1.5    A   1.5
3    3   0.5    A   1.5
4    4   1.1    B   1.2
5    5   0.5    B  15.5
6    6   1.5    B  15.5
7    7   1.5    B  16.5
8    8   0.5    B  16.5
9    9   1.2    C  16.1
10  10  15.5    C  15.5
11  11  16.5    C  15.5
12  12  16.5    C  16.5
13  13  15.5    C  16.5
14  14  15.6    D  16.2
15  15  15.5    D   0.5
16  16  16.5    D   0.5
17  17  16.5    D   1.5
18  18  15.5    D   1.5
19  19  15.7    A   1.6

Create a KMeans instance:

>>> km = clustering.KMeans(n_clusters=4, init='first_k',
...                        max_iter=100, tol=1.0E-6, thread_ratio=0.2,
...                        distance_level='Euclidean',
...                        category_weights=0.5)

Perform fit_predict:

>>> labels = km.fit_predict(data=df, 'ID')
>>> labels.collect()
    ID  CLUSTER_ID  DISTANCE  SLIGHT_SILHOUETTE
0    0           0  0.891088           0.944370
1    1           0  0.863917           0.942478
2    2           0  0.806252           0.946288
3    3           0  0.835684           0.944942
4    4           0  0.744571           0.950234
5    5           3  0.891088           0.940733
6    6           3  0.835684           0.944412
7    7           3  0.806252           0.946519
8    8           3  0.863917           0.946121
9    9           3  0.744571           0.949899
10  10           2  0.825527           0.945092
11  11           2  0.933886           0.937902
12  12           2  0.881692           0.945008
13  13           2  0.764318           0.949160
14  14           2  0.923456           0.939283
15  15           1  0.901684           0.940436
16  16           1  0.976885           0.939386
17  17           1  0.818178           0.945878
18  18           1  0.722799           0.952170
19  19           1  1.102342           0.925679

Input dataframe df for Accelerated K-Means :

>>> df = conn.table("PAL_ACCKMEANS_DATA_TBL")
>>> df.collect()
    ID  V000 V001  V002
0    0   0.5    A     0
1    1   1.5    A     0
2    2   1.5    A     1
3    3   0.5    A     1
4    4   1.1    B     1
5    5   0.5    B    15
6    6   1.5    B    15
7    7   1.5    B    16
8    8   0.5    B    16
9    9   1.2    C    16
10  10  15.5    C    15
11  11  16.5    C    15
12  12  16.5    C    16
13  13  15.5    C    16
14  14  15.6    D    16
15  15  15.5    D     0
16  16  16.5    D     0
17  17  16.5    D     1
18  18  15.5    D     1
19  19  15.7    A     1

Create Accelerated Kmeans instance:

>>> akm = clustering.KMeans(init='first_k',
...                         thread_ratio=0.5, n_clusters=4,
...                         distance_level='euclidean',
...                         max_iter=100, category_weights=0.5,
...                         categorical_variable=['V002'],
...                         accelerated=True)

Perform fit_predict:

>>> labels = akm.fit_predict(df=data, key='ID')
>>> labels.collect()
    ID  CLUSTER_ID  DISTANCE  SLIGHT_SILHOUETTE
0    0           0  1.198938           0.006767
1    1           0  1.123938           0.068899
2    2           3  0.500000           0.572506
3    3           3  0.500000           0.598267
4    4           0  0.621517           0.229945
5    5           0  1.037500           0.308333
6    6           0  0.962500           0.358333
7    7           0  0.895513           0.402992
8    8           0  0.970513           0.352992
9    9           0  0.823938           0.313385
10  10           1  1.038276           0.931555
11  11           1  1.178276           0.927130
12  12           1  1.135685           0.929565
13  13           1  0.995685           0.934165
14  14           1  0.849615           0.944359
15  15           1  0.995685           0.934548
16  16           1  1.135685           0.929950
17  17           1  1.089615           0.932769
18  18           1  0.949615           0.937555
19  19           1  0.915565           0.937717
Attributes
labels_DataFrame

Label assigned to each sample.

cluster_centers_DataFrame

Coordinates of cluster centers.

model_DataFrame

Model content.

statistics_DataFrame

Statistic value.

Methods

add_attribute(attr_key, attr_val)

Function to add attribute.

apply_with_hint(with_hint[, ...])

Apply with hint.

consume_fit_hdbprocedure(proc_name[, ...])

Return the generated consume hdbprocedure for fit.

consume_predict_hdbprocedure(proc_name[, ...])

Return the generated consume hdbprocedure for predict.

create_model_state([model, function, ...])

Create PAL model state.

delete_model_state([state])

Delete PAL model state.

disable_arg_check()

Disable argument check.

disable_convert_bigint()

Disable the bigint conversion.

disable_hana_execution()

HANA execution will be disabled and only SQL script will be generated.

disable_with_hint()

Disable with hint.

enable_arg_check()

Enable argument check.

enable_convert_bigint()

Allows the conversion from bigint to double.

enable_hana_execution()

HANA execution will be enabled.

enable_no_inline([apply_to_anonymous_block])

Enable no inline.

enable_parallel_by_parameter_partitions([...])

Enable parallel by parameter partitions.

enable_workload_class(workload_class_name)

HANA WORKLOAD CLASS is applied for the statement execution.

fit(data[, key, features, categorical_variable])

Fit the model when given training dataset.

fit_predict(data[, key, features, ...])

Fit with the dataset and return the labels.

get_fit_execute_statement()

Returns the execute_statement for training.

get_fit_output_table_names()

Get the generated result table names in fit function.

get_fit_parameters()

Get PAL fit parameters.

get_pal_function()

Extract the specific function call of the SAP HANA PAL function from the sql code.

get_parameters()

Parse sql lines containing the parameter definitions.

get_predict_execute_statement()

Returns the execute_statement for predicting.

get_predict_output_table_names()

Get the generated result table names in predict function.

get_predict_parameters()

Get PAL predict parameters.

get_score_execute_statement()

Returns the execute_statement for scoring.

get_score_output_table_names()

Get the generated result table names in score function.

get_score_parameters()

Get SAP HANA PAL score parameters.

is_fitted()

Checks if the model can be saved.

load_model(model)

Function to load the fitted model.

predict(data[, key, features])

Assign clusters to data based on a fitted model.

set_model_state(state)

Set the model state by state information.

set_scale_out([route_to, no_route_to, ...])

SAP HANA statement routing.

fit(data, key=None, features=None, categorical_variable=None)

Fit the model when given training dataset.

Parameters
dataDataFrame

DataFrame containing the data.

keystr, optional

Name of ID column. Defaults to the index column of data (i.e. data.index) if it is set. If the index column of data is not provided, please enter the value of key.

featuresa list of str, optional

Names of feature columns.

If features is not provided, it defaults to all non-key columns.

categorical_variablestr or a list of str, optional

Specifies INTEGER column(s) that should be treated as categorical.

Other INTEGER columns will be treated as continuous.

Defaults to None.

Returns
A fitted object of class "KMeans".
fit_predict(data, key=None, features=None, categorical_variable=None)

Fit with the dataset and return the labels.

Parameters
dataDataFrame

DataFrame containing the data.

keystr, optional

Name of ID column.

Defaults to the index column of data (i.e. data.index) if it is set. If the index column of data is not provided, please enter the value of key.

featuresa list of str, optional

Names of feature columns.

If features is not provided, it defaults to all non-key columns.

categorical_variablestr or a list of str, optional

Specifies INTEGER column(s) that should be treated as categorical. Other INTEGER columns will be treated as continuous.

Defaults to None.

Returns
DataFrame

Label assigned to each sample.

predict(data, key=None, features=None)

Assign clusters to data based on a fitted model.

The output structure of this method does not match that of fit_predict().

Parameters
dataDataFrame

Data points to match against computed clusters.

This dataframe's column structure should match that of the data used for fit().

keystr, optional

Name of ID column.

Defaults to the index column of data (i.e. data.index) if it is set. If the index column of data is not provided, please enter the value of key.

featuresa list of str, optional.

Names of feature columns.

If features is not provided, it defaults to all non-key columns.

Returns
DataFrame

Cluster assignment results, with 3 columns:

  • Data point ID, with name and type taken from the input ID column.

  • CLUSTER_ID, INTEGER type, representing the cluster the data point is assigned to.

  • DISTANCE, DOUBLE type, representing the distance between the data point and the cluster center.

add_attribute(attr_key, attr_val)

Function to add attribute.

Parameters
attr_keystr

The key.

attr_valstr

The value.

apply_with_hint(with_hint, apply_to_anonymous_block=True)

Apply with hint.

Parameters
with_hintstr

The hint clauses.

apply_to_anonymous_blockbool, optional

If True, it will be applied to the anonymous block.

Defaults to True.

consume_fit_hdbprocedure(proc_name, in_tables=None, out_tables=None)

Return the generated consume hdbprocedure for fit.

Parameters
proc_namestr

The procedure name.

in_tableslist, optional

The list of input table names.

out_tableslist, optional

The list of output table names.

consume_predict_hdbprocedure(proc_name, in_tables=None, out_tables=None)

Return the generated consume hdbprocedure for predict.

Parameters
proc_namestr

The procedure name.

in_tableslist, optional

The list of input table names.

out_tableslist, optional

The list of output table names.

create_model_state(model=None, function=None, pal_funcname=None, state_description=None, force=False)

Create PAL model state.

Parameters
modelDataFrame, optional

Specify the model for AFL state.

Defaults to self.model_.

functionstr, optional

Specify the function in the unified API.

Defaults to self.real_func.

pal_funcnameint or str, optional

PAL function name.

Defaults to self.pal_funcname.

state_descriptionstr, optional

Description of the state as model container.

Defaults to None.

forcebool, optional

If True it will delete the existing state.

Defaults to False.

delete_model_state(state=None)

Delete PAL model state.

Parameters
stateDataFrame, optional

Specified the state.

Defaults to self.state.

disable_arg_check()

Disable argument check.

disable_convert_bigint()

Disable the bigint conversion.

Defaults to False.

disable_hana_execution()

HANA execution will be disabled and only SQL script will be generated.

disable_with_hint()

Disable with hint.

enable_arg_check()

Enable argument check.

enable_convert_bigint()

Allows the conversion from bigint to double.

Defaults to True.

enable_hana_execution()

HANA execution will be enabled.

enable_no_inline(apply_to_anonymous_block=True)

Enable no inline.

Parameters
apply_to_anonymous_blockbool, optional

If True, it will be applied to the anonymous block.

Defaults to True.

enable_parallel_by_parameter_partitions(apply_to_anonymous_block=False)

Enable parallel by parameter partitions.

Parameters
apply_to_anonymous_blockbool, optional

If True, it will be applied to the anonymous block.

Defaults to False.

enable_workload_class(workload_class_name)

HANA WORKLOAD CLASS is applied for the statement execution.

Parameters
workload_class_namestr

The name of HANA WORKLOAD CLASS.

property fit_hdbprocedure

Returns the generated hdbprocedure for fit.

get_fit_execute_statement()

Returns the execute_statement for training.

get_fit_output_table_names()

Get the generated result table names in fit function.

Returns
List of table names.
get_fit_parameters()

Get PAL fit parameters.

Returns
List of tuples, where each tuple describes a parameter like (name, value, type)
get_pal_function()

Extract the specific function call of the SAP HANA PAL function from the sql code. Nevertheless it only detects the synonyms that have to be resolved afterwards.

Returns
The procedure name synonym
CALL "SYS_AFL.PAL_RANDOM_FORREST" (...) -> SYS_AFL.PAL_RANDOM_FORREST"
get_parameters()

Parse sql lines containing the parameter definitions. In the sql code all the parameters are defined by four arrays, where the first one contains the parameter name, and one of the other three contains the value fitting to the parameter, while the other two are NULL. This format should be changed into a simple key-value based storage.

Returns
dict of list of tuples, where each tuple describes a parameter like (name, value, type)
get_predict_execute_statement()

Returns the execute_statement for predicting.

get_predict_output_table_names()

Get the generated result table names in predict function.

Returns
List of table names.
get_predict_parameters()

Get PAL predict parameters.

Returns
List of tuples, where each tuple describes a parameter like (name, value, type)
get_score_execute_statement()

Returns the execute_statement for scoring.

get_score_output_table_names()

Get the generated result table names in score function.

Returns
List of table names.
get_score_parameters()

Get SAP HANA PAL score parameters.

Returns
List of tuples, where each tuple describes a parameter like (name, value, type)
is_fitted()

Checks if the model can be saved. To be overridden if the model is not stored in model_ attribute.

Returns
bool

True if the model is ready to be saved.

load_model(model)

Function to load the fitted model.

Parameters
modelDataFrame

SAP HANA DataFrame for fitted model.

property predict_hdbprocedure

Returns the generated hdbprocedure for predict.

set_model_state(state)

Set the model state by state information.

set_scale_out(route_to=None, no_route_to=None, route_by=None, route_by_cardinality=None, data_transfer_cost=None, route_optimization_level=None, workload_class=None, apply_to_anonymous_block=True)

SAP HANA statement routing.

Parameters
route_tostr, optional

Routes the query to the specified volume ID or service type.

Defaults to None.

no_route_tostr or list of str, optional

Avoids query routing to a specified volume ID or service type.

Defaults to None.

route_bystr, optional

Routes the query to the hosts related to the base table(s) of the specified projection view(s).

Defaults to None.

route_by_cardinalitystr or list of str, optional

Routes the query to the hosts related to the base table(s) of the specified projection view(s) with the highest cardinality from the input list.

data_transfer_costint, optional

Guides the optimizer to use the weighting factor for the data transfer cost. The value 0 ignores the data transfer cost.

Defaults to None.

route_optimization_level{'minimal', 'all'}, optional

Guides the optimizer to compile with route_optimization_level 'minimal' or to default to route_optimization_level. If the 'minimal' compiled plan is cached, then it compiles once more using the default optimization level during the first execution. This hint is primarily used to shorten statement routing decisions during the initial compilation.

workload_classstr, optional

Routes the query via workload class. route_to statement hint has higher precedence than workload_class statement hint.

Defaults to None.

apply_to_anonymous_blockbool, optional

If True it will be applied to the anonymous block.

Defaults to True.

class hana_ml.algorithms.pal.clustering.KMedians(n_clusters, init=None, max_iter=None, tol=None, thread_ratio=None, distance_level=None, minkowski_power=None, category_weights=None, normalization=None, categorical_variable=None)

Bases: hana_ml.algorithms.pal.clustering._KClusteringBase

K-Medians clustering algorithm that partitions n observations into K clusters according to their nearest cluster center. It uses medians of each feature to calculate cluster centers.

Parameters
n_clustersint

Number of groups.

init{'first_k', 'replace', 'no_replace', 'patent'}, optional

Controls how the initial centers are selected:

  • 'first_k': First k observations.

  • 'replace': Random with replacement.

  • 'no_replace': Random without replacement.

  • 'patent': Patent of selecting the init center (US 6,882,998 B1).

Defaults to 'patent'.

max_iterint, optional

Max iterations.

Defaults to 100.

tolfloat, optional

Convergence threshold for exiting iterations.

Defaults to 1.0e-6.

thread_ratiofloat, optional

Specifies the ratio of total number of threads that can be used by this function. The value range is from 0 to 1, where 0 means only using 1 thread, and 1 means using at most all the currently available threads. Values outside the range will be ignored and this function heuristically determines the number of threads to use.

Defaults to 0.

distance_level{'manhattan', 'euclidean', 'minkowski', 'chebyshev', 'cosine'}, optional

Ways to compute the distance between the item and the cluster center.

Defaults to 'euclidean'.

minkowski_powerfloat, optional

When Minkowski distance is used, this parameter controls the value of power.

Only valid when distance_level is 'minkowski'.

Defaults to 3.0.

category_weightsfloat, optional

Represents the weight of category attributes.

Defaults to 0.707.

normalization{'no', 'l1_norm', 'min_max'}, optional

Normalization type.

  • 'no': No, normalization will not be applied.

  • 'l1_norm': Yes, for each point X (x1, x2, ..., xn), the normalized value will be X'(x1/S,x2/S,...,xn/S), where S = |x1|+|x2|+...|xn|.

  • 'min_max': Yes, for each column C, get the min and max value of C, and then C[i] = (C[i]-min)/(max-min).

Defaults to 'no'.

categorical_variablestr or a list of str, optional

Specifies INTEGER column(s) in the data that should be treated as categorical.

Defaults to None.

Examples

Input dataframe df1 for clustering:

>>> df1.collect()
    ID  V000 V001  V002
0    0   0.5    A   0.5
1    1   1.5    A   0.5
2    2   1.5    A   1.5
3    3   0.5    A   1.5
4    4   1.1    B   1.2
5    5   0.5    B  15.5
6    6   1.5    B  15.5
7    7   1.5    B  16.5
8    8   0.5    B  16.5
9    9   1.2    C  16.1
10  10  15.5    C  15.5
11  11  16.5    C  15.5
12  12  16.5    C  16.5
13  13  15.5    C  16.5
14  14  15.6    D  16.2
15  15  15.5    D   0.5
16  16  16.5    D   0.5
17  17  16.5    D   1.5
18  18  15.5    D   1.5
19  19  15.7    A   1.6

Creating KMedians instance:

>>> kmedians = KMedians(n_clusters=4, init='first_k',
...                     max_iter=100, tol=1.0E-6,
...                     distance_level='Euclidean',
...                     thread_ratio=0.3, category_weights=0.5)

Performing fit() on given dataframe:

>>> kmedians.fit(data=df1, key='ID')
>>> kmedians.cluster_centers_.collect()
   CLUSTER_ID  V000 V001  V002
0           0   1.1    A   1.2
1           1  15.7    D   1.5
2           2  15.6    C  16.2
3           3   1.2    B  16.1

Performing fit_predict() on given dataframe:

>>> kmedians.fit_predict(data=df1, key='ID').collect()
    ID  CLUSTER_ID  DISTANCE
0    0           0  0.921954
1    1           0  0.806226
2    2           0  0.500000
3    3           0  0.670820
4    4           0  0.707107
5    5           3  0.921954
6    6           3  0.670820
7    7           3  0.500000
8    8           3  0.806226
9    9           3  0.707107
10  10           2  0.707107
11  11           2  1.140175
12  12           2  0.948683
13  13           2  0.316228
14  14           2  0.707107
15  15           1  1.019804
16  16           1  1.280625
17  17           1  0.800000
18  18           1  0.200000
19  19           1  0.807107
Attributes
cluster_centers_DataFrame

Coordinates of cluster centers.

labels_DataFrame

Cluster assignment and distance to cluster center for each point.

Methods

add_attribute(attr_key, attr_val)

Function to add attribute.

apply_with_hint(with_hint[, ...])

Apply with hint.

consume_fit_hdbprocedure(proc_name[, ...])

Return the generated consume hdbprocedure for fit.

consume_predict_hdbprocedure(proc_name[, ...])

Return the generated consume hdbprocedure for predict.

create_model_state([model, function, ...])

Create PAL model state.

delete_model_state([state])

Delete PAL model state.

disable_arg_check()

Disable argument check.

disable_convert_bigint()

Disable the bigint conversion.

disable_hana_execution()

HANA execution will be disabled and only SQL script will be generated.

disable_with_hint()

Disable with hint.

enable_arg_check()

Enable argument check.

enable_convert_bigint()

Allows the conversion from bigint to double.

enable_hana_execution()

HANA execution will be enabled.

enable_no_inline([apply_to_anonymous_block])

Enable no inline.

enable_parallel_by_parameter_partitions([...])

Enable parallel by parameter partitions.

enable_workload_class(workload_class_name)

HANA WORKLOAD CLASS is applied for the statement execution.

fit(data[, key, features, categorical_variable])

Perform clustering on input dataset.

fit_predict(data[, key, features, ...])

Perform clustering algorithm and return labels.

get_fit_execute_statement()

Returns the execute_statement for training.

get_fit_output_table_names()

Get the generated result table names in fit function.

get_fit_parameters()

Get PAL fit parameters.

get_pal_function()

Extract the specific function call of the SAP HANA PAL function from the sql code.

get_parameters()

Parse sql lines containing the parameter definitions.

get_predict_execute_statement()

Returns the execute_statement for predicting.

get_predict_output_table_names()

Get the generated result table names in predict function.

get_predict_parameters()

Get PAL predict parameters.

get_score_execute_statement()

Returns the execute_statement for scoring.

get_score_output_table_names()

Get the generated result table names in score function.

get_score_parameters()

Get SAP HANA PAL score parameters.

is_fitted()

Checks if the model can be saved.

load_model(model)

Function to load the fitted model.

set_model_state(state)

Set the model state by state information.

set_scale_out([route_to, no_route_to, ...])

SAP HANA statement routing.

add_attribute(attr_key, attr_val)

Function to add attribute.

Parameters
attr_keystr

The key.

attr_valstr

The value.

apply_with_hint(with_hint, apply_to_anonymous_block=True)

Apply with hint.

Parameters
with_hintstr

The hint clauses.

apply_to_anonymous_blockbool, optional

If True, it will be applied to the anonymous block.

Defaults to True.

consume_fit_hdbprocedure(proc_name, in_tables=None, out_tables=None)

Return the generated consume hdbprocedure for fit.

Parameters
proc_namestr

The procedure name.

in_tableslist, optional

The list of input table names.

out_tableslist, optional

The list of output table names.

consume_predict_hdbprocedure(proc_name, in_tables=None, out_tables=None)

Return the generated consume hdbprocedure for predict.

Parameters
proc_namestr

The procedure name.

in_tableslist, optional

The list of input table names.

out_tableslist, optional

The list of output table names.

create_model_state(model=None, function=None, pal_funcname=None, state_description=None, force=False)

Create PAL model state.

Parameters
modelDataFrame, optional

Specify the model for AFL state.

Defaults to self.model_.

functionstr, optional

Specify the function in the unified API.

Defaults to self.real_func.

pal_funcnameint or str, optional

PAL function name.

Defaults to self.pal_funcname.

state_descriptionstr, optional

Description of the state as model container.

Defaults to None.

forcebool, optional

If True it will delete the existing state.

Defaults to False.

delete_model_state(state=None)

Delete PAL model state.

Parameters
stateDataFrame, optional

Specified the state.

Defaults to self.state.

disable_arg_check()

Disable argument check.

disable_convert_bigint()

Disable the bigint conversion.

Defaults to False.

disable_hana_execution()

HANA execution will be disabled and only SQL script will be generated.

disable_with_hint()

Disable with hint.

enable_arg_check()

Enable argument check.

enable_convert_bigint()

Allows the conversion from bigint to double.

Defaults to True.

enable_hana_execution()

HANA execution will be enabled.

enable_no_inline(apply_to_anonymous_block=True)

Enable no inline.

Parameters
apply_to_anonymous_blockbool, optional

If True, it will be applied to the anonymous block.

Defaults to True.

enable_parallel_by_parameter_partitions(apply_to_anonymous_block=False)

Enable parallel by parameter partitions.

Parameters
apply_to_anonymous_blockbool, optional

If True, it will be applied to the anonymous block.

Defaults to False.

enable_workload_class(workload_class_name)

HANA WORKLOAD CLASS is applied for the statement execution.

Parameters
workload_class_namestr

The name of HANA WORKLOAD CLASS.

fit(data, key=None, features=None, categorical_variable=None)

Perform clustering on input dataset.

Parameters
dataDataFrame

DataFrame contains input data.

keystr, optional

Name of ID column.

Defaults to the index column of data (i.e. data.index) if it is set. If the index column of data is not provided, please enter the value of key.

featuresa list of str, optional

Names of feature columns.

If features is not provided, it defaults to all non-key columns.

categorical_variablestr or a list of str, optional

Specifies INTEGER columns that should be treated as categorical. Other INTEGER columns will be treated as continuous.

Defaults to None.

property fit_hdbprocedure

Returns the generated hdbprocedure for fit.

fit_predict(data, key=None, features=None, categorical_variable=None)

Perform clustering algorithm and return labels.

Parameters
dataDataFrame

DataFrame containing input data.

keystr, optional

Name of ID column.

Defaults to the index column of data (i.e. data.index) if it is set. If the index column of data is not provided, please enter the value of key.

featuresa list of str, optional

Names of feature columns.

If features is not provided, it defaults to all non-key columns.

categorical_variablestr or a list of str, optional

Specifies INTEGER column(s) that should be treated as categorical. Other INTEGER columns will be treated as continuous.

Defaults to None.

Returns
DataFrame

Fit result, structured as follows:

  • ID column, with the same name and type as data 's ID column.

  • CLUSTER_ID, type INTEGER, cluster ID assigned to the data point.

  • DISTANCE, type DOUBLE, the distance between the given point and the cluster center.

get_fit_execute_statement()

Returns the execute_statement for training.

get_fit_output_table_names()

Get the generated result table names in fit function.

Returns
List of table names.
get_fit_parameters()

Get PAL fit parameters.

Returns
List of tuples, where each tuple describes a parameter like (name, value, type)
get_pal_function()

Extract the specific function call of the SAP HANA PAL function from the sql code. Nevertheless it only detects the synonyms that have to be resolved afterwards.

Returns
The procedure name synonym
CALL "SYS_AFL.PAL_RANDOM_FORREST" (...) -> SYS_AFL.PAL_RANDOM_FORREST"
get_parameters()

Parse sql lines containing the parameter definitions. In the sql code all the parameters are defined by four arrays, where the first one contains the parameter name, and one of the other three contains the value fitting to the parameter, while the other two are NULL. This format should be changed into a simple key-value based storage.

Returns
dict of list of tuples, where each tuple describes a parameter like (name, value, type)
get_predict_execute_statement()

Returns the execute_statement for predicting.

get_predict_output_table_names()

Get the generated result table names in predict function.

Returns
List of table names.
get_predict_parameters()

Get PAL predict parameters.

Returns
List of tuples, where each tuple describes a parameter like (name, value, type)
get_score_execute_statement()

Returns the execute_statement for scoring.

get_score_output_table_names()

Get the generated result table names in score function.

Returns
List of table names.
get_score_parameters()

Get SAP HANA PAL score parameters.

Returns
List of tuples, where each tuple describes a parameter like (name, value, type)
is_fitted()

Checks if the model can be saved. To be overridden if the model is not stored in model_ attribute.

Returns
bool

True if the model is ready to be saved.

load_model(model)

Function to load the fitted model.

Parameters
modelDataFrame

SAP HANA DataFrame for fitted model.

property predict_hdbprocedure

Returns the generated hdbprocedure for predict.

set_model_state(state)

Set the model state by state information.

set_scale_out(route_to=None, no_route_to=None, route_by=None, route_by_cardinality=None, data_transfer_cost=None, route_optimization_level=None, workload_class=None, apply_to_anonymous_block=True)

SAP HANA statement routing.

Parameters
route_tostr, optional

Routes the query to the specified volume ID or service type.

Defaults to None.

no_route_tostr or list of str, optional

Avoids query routing to a specified volume ID or service type.

Defaults to None.

route_bystr, optional

Routes the query to the hosts related to the base table(s) of the specified projection view(s).

Defaults to None.

route_by_cardinalitystr or list of str, optional

Routes the query to the hosts related to the base table(s) of the specified projection view(s) with the highest cardinality from the input list.

data_transfer_costint, optional

Guides the optimizer to use the weighting factor for the data transfer cost. The value 0 ignores the data transfer cost.

Defaults to None.

route_optimization_level{'minimal', 'all'}, optional

Guides the optimizer to compile with route_optimization_level 'minimal' or to default to route_optimization_level. If the 'minimal' compiled plan is cached, then it compiles once more using the default optimization level during the first execution. This hint is primarily used to shorten statement routing decisions during the initial compilation.

workload_classstr, optional

Routes the query via workload class. route_to statement hint has higher precedence than workload_class statement hint.

Defaults to None.

apply_to_anonymous_blockbool, optional

If True it will be applied to the anonymous block.

Defaults to True.

class hana_ml.algorithms.pal.clustering.KMedoids(n_clusters, init=None, max_iter=None, tol=None, thread_ratio=None, distance_level=None, minkowski_power=None, category_weights=None, normalization=None, categorical_variable=None)

Bases: hana_ml.algorithms.pal.clustering._KClusteringBase

K-Medoids clustering algorithm that partitions n observations into K clusters according to their nearest cluster center. It uses medoids to calculate cluster centers. K-Medoids is more robust to noise and outliers.

Parameters
n_clustersint

Number of groups.

init{'first_k', 'replace', 'no_replace', 'patent'}, optional

Controls how the initial centers are selected:

  • 'first_k': First k observations.

  • 'replace': Random with replacement.

  • 'no_replace': Random without replacement.

  • 'patent': Patent of selecting the init center (US 6,882,998 B1).

Defaults to 'patent'.

max_iterint, optional

Max iterations.

Defaults to 100.

tolfloat, optional

Convergence threshold for exiting iterations.

Defaults to 1.0e-6.

thread_ratiofloat, optional

Controls the proportion of available threads to use.

The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates all available threads.

Values between 0 and 1 will use up to that percentage of available threads.

Values outside this range tell PAL to heuristically determine the number of threads to use.

Defaults to 0.

distance_level{'manhattan', 'euclidean', 'minkowski', 'chebyshev', 'cosine'}, optional

Ways to compute the distance between the item and the cluster center.

Defaults to 'euclidean'.

minkowski_powerfloat, optional

When Minkowski distance is used, this parameter controls the value of power.

Only valid when distance_level is 'minkowski'.

Defaults to 3.0.

category_weightsfloat, optional

Represents the weight of category attributes.

Defaults to 0.707.

normalization{'no', 'l1_norm', 'min_max'}, optional

Normalization type.

  • 'no': No, normalization will not be applied.

  • 'l1_norm': Yes, for each point X (x1, x2, ..., xn), the normalized value will be X'(x1/S,x2/S,...,xn/S), where S = |x1|+|x2|+...|xn|.

  • 'min_max': Yes, for each column C, get the min and max value of C, and then C[i] = (C[i]-min)/(max-min).

Defaults to 'no'.

categorical_variablestr or a list of str, optional

Specifies INTEGER column(s) in the data that should be treated as categorical.

Defaults to None.

Examples

Input dataframe df1 for clustering:

>>> df1.collect()
    ID  V000 V001  V002
0    0   0.5    A   0.5
1    1   1.5    A   0.5
2    2   1.5    A   1.5
3    3   0.5    A   1.5
4    4   1.1    B   1.2
5    5   0.5    B  15.5
6    6   1.5    B  15.5
7    7   1.5    B  16.5
8    8   0.5    B  16.5
9    9   1.2    C  16.1
10  10  15.5    C  15.5
11  11  16.5    C  15.5
12  12  16.5    C  16.5
13  13  15.5    C  16.5
14  14  15.6    D  16.2
15  15  15.5    D   0.5
16  16  16.5    D   0.5
17  17  16.5    D   1.5
18  18  15.5    D   1.5
19  19  15.7    A   1.6

Creating a KMedoids instance:

>>> kmedoids = KMedoids(n_clusters=4, init='first_K',
...                     max_iter=100, tol=1.0E-6,
...                     distance_level='Euclidean',
...                     thread_ratio=0.3, category_weights=0.5)

Performing fit() on given dataframe:

>>> kmedoids.fit(data=df1, key='ID')
>>> kmedoids.cluster_centers_.collect()
   CLUSTER_ID  V000 V001  V002
0           0   1.5    A   1.5
1           1  15.5    D   1.5
2           2  15.5    C  16.5
3           3   1.5    B  16.5

Performing fit_predict() on given dataframe:

>>> kmedoids.fit_predict(data=df1, key='ID').collect()
    ID  CLUSTER_ID  DISTANCE
0    0           0  1.414214
1    1           0  1.000000
2    2           0  0.000000
3    3           0  1.000000
4    4           0  1.207107
5    5           3  1.414214
6    6           3  1.000000
7    7           3  0.000000
8    8           3  1.000000
9    9           3  1.207107
10  10           2  1.000000
11  11           2  1.414214
12  12           2  1.000000
13  13           2  0.000000
14  14           2  1.023335
15  15           1  1.000000
16  16           1  1.414214
17  17           1  1.000000
18  18           1  0.000000
19  19           1  0.930714
Attributes
cluster_centers_DataFrame

Coordinates of cluster centers.

labels_DataFrame

Cluster assignment and distance to cluster center for each point.

Methods

add_attribute(attr_key, attr_val)

Function to add attribute.

apply_with_hint(with_hint[, ...])

Apply with hint.

consume_fit_hdbprocedure(proc_name[, ...])

Return the generated consume hdbprocedure for fit.

consume_predict_hdbprocedure(proc_name[, ...])

Return the generated consume hdbprocedure for predict.

create_model_state([model, function, ...])

Create PAL model state.

delete_model_state([state])

Delete PAL model state.

disable_arg_check()

Disable argument check.

disable_convert_bigint()

Disable the bigint conversion.

disable_hana_execution()

HANA execution will be disabled and only SQL script will be generated.

disable_with_hint()

Disable with hint.

enable_arg_check()

Enable argument check.

enable_convert_bigint()

Allows the conversion from bigint to double.

enable_hana_execution()

HANA execution will be enabled.

enable_no_inline([apply_to_anonymous_block])

Enable no inline.

enable_parallel_by_parameter_partitions([...])

Enable parallel by parameter partitions.

enable_workload_class(workload_class_name)

HANA WORKLOAD CLASS is applied for the statement execution.

fit(data[, key, features, categorical_variable])

Perform clustering on input dataset.

fit_predict(data[, key, features, ...])

Perform clustering algorithm and return labels.

get_fit_execute_statement()

Returns the execute_statement for training.

get_fit_output_table_names()

Get the generated result table names in fit function.

get_fit_parameters()

Get PAL fit parameters.

get_pal_function()

Extract the specific function call of the SAP HANA PAL function from the sql code.

get_parameters()

Parse sql lines containing the parameter definitions.

get_predict_execute_statement()

Returns the execute_statement for predicting.

get_predict_output_table_names()

Get the generated result table names in predict function.

get_predict_parameters()

Get PAL predict parameters.

get_score_execute_statement()

Returns the execute_statement for scoring.

get_score_output_table_names()

Get the generated result table names in score function.

get_score_parameters()

Get SAP HANA PAL score parameters.

is_fitted()

Checks if the model can be saved.

load_model(model)

Function to load the fitted model.

set_model_state(state)

Set the model state by state information.

set_scale_out([route_to, no_route_to, ...])

SAP HANA statement routing.

add_attribute(attr_key, attr_val)

Function to add attribute.

Parameters
attr_keystr

The key.

attr_valstr

The value.

apply_with_hint(with_hint, apply_to_anonymous_block=True)

Apply with hint.

Parameters
with_hintstr

The hint clauses.

apply_to_anonymous_blockbool, optional

If True, it will be applied to the anonymous block.

Defaults to True.

consume_fit_hdbprocedure(proc_name, in_tables=None, out_tables=None)

Return the generated consume hdbprocedure for fit.

Parameters
proc_namestr

The procedure name.

in_tableslist, optional

The list of input table names.

out_tableslist, optional

The list of output table names.

consume_predict_hdbprocedure(proc_name, in_tables=None, out_tables=None)

Return the generated consume hdbprocedure for predict.

Parameters
proc_namestr

The procedure name.

in_tableslist, optional

The list of input table names.

out_tableslist, optional

The list of output table names.

create_model_state(model=None, function=None, pal_funcname=None, state_description=None, force=False)

Create PAL model state.

Parameters
modelDataFrame, optional

Specify the model for AFL state.

Defaults to self.model_.

functionstr, optional

Specify the function in the unified API.

Defaults to self.real_func.

pal_funcnameint or str, optional

PAL function name.

Defaults to self.pal_funcname.

state_descriptionstr, optional

Description of the state as model container.

Defaults to None.

forcebool, optional

If True it will delete the existing state.

Defaults to False.

delete_model_state(state=None)

Delete PAL model state.

Parameters
stateDataFrame, optional

Specified the state.

Defaults to self.state.

disable_arg_check()

Disable argument check.

disable_convert_bigint()

Disable the bigint conversion.

Defaults to False.

disable_hana_execution()

HANA execution will be disabled and only SQL script will be generated.

disable_with_hint()

Disable with hint.

enable_arg_check()

Enable argument check.

enable_convert_bigint()

Allows the conversion from bigint to double.

Defaults to True.

enable_hana_execution()

HANA execution will be enabled.

enable_no_inline(apply_to_anonymous_block=True)

Enable no inline.

Parameters
apply_to_anonymous_blockbool, optional

If True, it will be applied to the anonymous block.

Defaults to True.

enable_parallel_by_parameter_partitions(apply_to_anonymous_block=False)

Enable parallel by parameter partitions.

Parameters
apply_to_anonymous_blockbool, optional

If True, it will be applied to the anonymous block.

Defaults to False.

enable_workload_class(workload_class_name)

HANA WORKLOAD CLASS is applied for the statement execution.

Parameters
workload_class_namestr

The name of HANA WORKLOAD CLASS.

fit(data, key=None, features=None, categorical_variable=None)

Perform clustering on input dataset.

Parameters
dataDataFrame

DataFrame contains input data.

keystr, optional

Name of ID column.

Defaults to the index column of data (i.e. data.index) if it is set. If the index column of data is not provided, please enter the value of key.

featuresa list of str, optional

Names of feature columns.

If features is not provided, it defaults to all non-key columns.

categorical_variablestr or a list of str, optional

Specifies INTEGER columns that should be treated as categorical. Other INTEGER columns will be treated as continuous.

Defaults to None.

property fit_hdbprocedure

Returns the generated hdbprocedure for fit.

fit_predict(data, key=None, features=None, categorical_variable=None)

Perform clustering algorithm and return labels.

Parameters
dataDataFrame

DataFrame containing input data.

keystr, optional

Name of ID column.

Defaults to the index column of data (i.e. data.index) if it is set. If the index column of data is not provided, please enter the value of key.

featuresa list of str, optional

Names of feature columns.

If features is not provided, it defaults to all non-key columns.

categorical_variablestr or a list of str, optional

Specifies INTEGER column(s) that should be treated as categorical. Other INTEGER columns will be treated as continuous.

Defaults to None.

Returns
DataFrame

Fit result, structured as follows:

  • ID column, with the same name and type as data 's ID column.

  • CLUSTER_ID, type INTEGER, cluster ID assigned to the data point.

  • DISTANCE, type DOUBLE, the distance between the given point and the cluster center.

get_fit_execute_statement()

Returns the execute_statement for training.

get_fit_output_table_names()

Get the generated result table names in fit function.

Returns
List of table names.
get_fit_parameters()

Get PAL fit parameters.

Returns
List of tuples, where each tuple describes a parameter like (name, value, type)
get_pal_function()

Extract the specific function call of the SAP HANA PAL function from the sql code. Nevertheless it only detects the synonyms that have to be resolved afterwards.

Returns
The procedure name synonym
CALL "SYS_AFL.PAL_RANDOM_FORREST" (...) -> SYS_AFL.PAL_RANDOM_FORREST"
get_parameters()

Parse sql lines containing the parameter definitions. In the sql code all the parameters are defined by four arrays, where the first one contains the parameter name, and one of the other three contains the value fitting to the parameter, while the other two are NULL. This format should be changed into a simple key-value based storage.

Returns
dict of list of tuples, where each tuple describes a parameter like (name, value, type)
get_predict_execute_statement()

Returns the execute_statement for predicting.

get_predict_output_table_names()

Get the generated result table names in predict function.

Returns
List of table names.
get_predict_parameters()

Get PAL predict parameters.

Returns
List of tuples, where each tuple describes a parameter like (name, value, type)
get_score_execute_statement()

Returns the execute_statement for scoring.

get_score_output_table_names()

Get the generated result table names in score function.

Returns
List of table names.
get_score_parameters()

Get SAP HANA PAL score parameters.

Returns
List of tuples, where each tuple describes a parameter like (name, value, type)
is_fitted()

Checks if the model can be saved. To be overridden if the model is not stored in model_ attribute.

Returns
bool

True if the model is ready to be saved.

load_model(model)

Function to load the fitted model.

Parameters
modelDataFrame

SAP HANA DataFrame for fitted model.

property predict_hdbprocedure

Returns the generated hdbprocedure for predict.

set_model_state(state)

Set the model state by state information.

set_scale_out(route_to=None, no_route_to=None, route_by=None, route_by_cardinality=None, data_transfer_cost=None, route_optimization_level=None, workload_class=None, apply_to_anonymous_block=True)

SAP HANA statement routing.

Parameters
route_tostr, optional

Routes the query to the specified volume ID or service type.

Defaults to None.

no_route_tostr or list of str, optional

Avoids query routing to a specified volume ID or service type.

Defaults to None.

route_bystr, optional

Routes the query to the hosts related to the base table(s) of the specified projection view(s).

Defaults to None.

route_by_cardinalitystr or list of str, optional

Routes the query to the hosts related to the base table(s) of the specified projection view(s) with the highest cardinality from the input list.

data_transfer_costint, optional

Guides the optimizer to use the weighting factor for the data transfer cost. The value 0 ignores the data transfer cost.

Defaults to None.

route_optimization_level{'minimal', 'all'}, optional

Guides the optimizer to compile with route_optimization_level 'minimal' or to default to route_optimization_level. If the 'minimal' compiled plan is cached, then it compiles once more using the default optimization level during the first execution. This hint is primarily used to shorten statement routing decisions during the initial compilation.

workload_classstr, optional

Routes the query via workload class. route_to statement hint has higher precedence than workload_class statement hint.

Defaults to None.

apply_to_anonymous_blockbool, optional

If True it will be applied to the anonymous block.

Defaults to True.

hana_ml.algorithms.pal.clustering.outlier_detection_kmeans(data, key=None, features=None, n_clusters=None, distance_level=None, contamination=None, sum_distance=True, init=None, max_iter=None, normalization=None, tol=None, distance_threshold=None, thread_number=None)

Outlier detection based on k-means clustering.

Parameters
dataDataFrame

Input data for outlier detection using k-means clustering.

keystr, optional

Name of ID column.

Defaults to the index column of data (i.e. data.index) if it is set. If the index column of data is not provided, please enter the value of key.

featuresstr or ListOfStrings

Names of the features columns in data that are used for calculating distances of points in data for clustering.

Feature columns must be numerical.

Defaults to all non-key columns if not provided.

n_clustersint, optional

Number of clusters to be grouped.

If this number is not specified, the G-means method will be used to determine the number of clusters.

distance_level{'manhattan', 'euclidean', 'minkowski'}, optional

Specifies the distance type between data points and cluster center.

  • 'manhattan' : Manhattan distance

  • 'euclidean' : Euclidean distance

  • 'minkowski' : Minkowski distance

Defaults to 'euclidean'.

contaminationfloat, optional

Specifies the proportion of outliers in data.

Expected to be a positive number no greater than 1.

Defaults to 0.1.

sum_distancebool, optional

Specifies whether or not to use the sum distance of a point to all cluster centers as its distance value for outlier score. If False, only the distance of a point to the center it belongs to is used its distance value calculation.

Defaults to True.

init{'first_k', 'replace', 'no_replace', 'patent'}, optional

Controls how the initial centers are selected:

  • 'first_k': First k observations.

  • 'replace': Random with replacement.

  • 'no_replace': Random without replacement.

  • 'patent': Patent of selecting the init center (US 6,882,998 B1).

Defaults to 'patent'.

max_iterint, optional

Maximum number of iterations for k-means clustering.

Defaults to 100.

normalization{'no', 'l1_norm', 'min_max'}, optional

Normalization type.

  • 'no': No normalization will be applied.

  • 'l1_norm': Yes, for each point X (x1, x2, ..., xn), the normalized value will be X'(x1 /S,x2 /S,...,xn /S), where S = |x1|+|x2|+...|xn|.

  • 'min_max': Yes, for each column C, get the min and max value of C, and then C[i] = (C[i]-min)/(max-min).

Defaults to 'no'.

tolfloat, optional

Convergence threshold for exiting iterations in k-means clustering.

Defaults to 1.0e-9.

distance_thresholdfloat, optional

Specifies the threshold distance value for outlier detection.

A point with distance value no greater than the threshold is not considered to be outlier.

Defaults to -1.

thread_numberint, optional

Specifies the number of threads that can be used by this function.

Defaults to 1.

Returns
DataFrame
  • Detected outliers, structured as follows:

    • 1st column : ID of detected outliers in data.

    • other columns : feature values for detected outliers

  • Statistics of detected outliers, structured as follows:

    • 1st column : ID of detected outliers in data.

    • 2nd column : ID of the corresponding cluster centers.

    • 3rd column : Outlier score, which is the distance value.

  • Centers of clusters produced by k-means algorithm, structured as follows:

    • 1st column : ID of cluster center.

    • other columns : Coordinate(i.e. feature) values of cluster center.

Examples

Input data for outlier detection:

>>> df.collect()
    ID  V000  V001
0    0   0.5   0.5
1    1   1.5   0.5
2    2   1.5   1.5
3    3   0.5   1.5
4    4   1.1   1.2
5    5   0.5  15.5
6    6   1.5  15.5
7    7   1.5  16.5
8    8   0.5  16.5
9    9   1.2  16.1
10  10  15.5  15.5
11  11  16.5  15.5
12  12  16.5  16.5
13  13  15.5  16.5
14  14  15.6  16.2
15  15  15.5   0.5
16  16  16.5   0.5
17  17  16.5   1.5
18  18  15.5   1.5
19  19  15.7   1.6
20  20  -1.0  -1.0
>>> outliers, stats, centers = outlier_detection_kmeans(df, key='ID',
...                                                     distance_level='euclidean',
...                                                     contamination=0.15,
...                                                     sum_distance=True,
...                                                     distance_threshold=3)
>>> outliers.collect()
   ID  V000  V001
0  20  -1.0  -1.0
1  16  16.5   0.5
2  12  16.5  16.5
>>> stats.collect()
   ID  CLUSTER_ID      SCORE
0  20           2  60.619864
1  16           1  54.110424
2  12           3  53.954274
class hana_ml.algorithms.pal.clustering.SpectralClustering(n_clusters, n_components=None, gamma=None, affinity=None, n_neighbors=None, cut=None, eigen_tol=None, krylov_dim=None, distance_level=None, minkowski_power=None, category_weights=None, max_iter=None, init=None, tol=None)

Bases: hana_ml.algorithms.pal.pal_base.PALBase

This is the Python wrapper for PAL Spectral Clustering.

Spectral clustering is an algorithm evolved from graph theory, and has been widely used in clustering. Its main idea is to treat all data as points in space, which can be connected by edges. The edge weight between two points farther away is low, while the edge weight between two points closer is high. Cutting the graph composed of all data points to make the edge weight sum between different subgraphs after cutting as low as possible, while make the edge weight sum within the subgraph as high as possible to achieve the purpose of clustering.

It performs a low-dimension embedding of the affinity matrix between samples, followed by k-means clustering of the components of the eigenvectors in the low dimensional space.

Parameters
n_clustersint

The number of clusters for spectral clustering.

The valid range for this parameter is from 2 to the number of records in the input data.

n_componentsint, optional

The number of eigenvectors used for spectral embedding.

Defaults to the value of n_clusters.

gammafloat, optional

RBF kernel coefficient \(\gamma\) used in constructing affinity matrix with distance metric d, illustrated as \(\exp(-\gamma * d^2)\).

Defaults to 1.0.

affinitystr, optional

Specifies the type of graph used to construct the affinity matrix. Valid options include:

  • 'knn' : binary affinity matrix constructed from the graph of k-nearest-neighbors(knn).

  • 'mutual-knn' : binary affinity matrix constructed from the graph of mutual k-nearest-neighbors(mutual-knn).

  • 'fully-connected' : affinity matrix constructed from fully-connected graph, with weights defined by RBF kernel coefficients.

Defaults to 'fully-connected'.

n_neighborsint, optional

The number neighbors to use when constructing the affinity matrix using nearest neighbors method.

Valid only when graph is 'knn' or 'mutual-knn'.

Defaults to 10.

cutstr, optional

Specifies the method to cut the graph.

  • 'ratio-cut' : Ratio-Cut.

  • 'n-cut' : Normalized-Cut.

Defaults to 'ratio-cut'.

eigen_tolfloat, optional

Stopping criterion for eigendecomposition of the Laplacian matrix.

Defaults to 1e-10.

krylov_dimint, optional

Specifies the dimension of Krylov subspaces used in Eigenvalue decomposition. In general, this parameter controls the convergence speed of the algorithm. Typically a larger krylov_dim means faster convergence, but it may also result in greater memory use and more matrix operations in each iteration.

Defaults to 2*``n_components``.

Note

This parameter must satisfy

n_components < krylov_dim \(\le\) the number of training records.

distance_levelstr, optional

Specifies the method for computing the distance between data records and cluster centers:

  • 'manhattan' : Manhattan distance.

  • 'euclidean' : Euclidean distance.

  • 'minkowski' : Minkowski distance.

  • 'chebyshev' : Chebyshev distance.

  • 'cosine' : Cosine distance.

Defaults to 'euclidean'.

minkowski_powerfloat, optional

Specifies the power parameter in Minkowski distance.

Valid only when distance_level is 'minkowski'.

Defaults to 3.0.

category_wightsfloat, optional

Represents the weight of category attributes.

Defaults to 0.707.

max_iterint, optional

Maximum number of iterations for K-Means algorithm.

Defaults to 100.

init{'first_k', 'replace', 'no_replace', 'patent'}, optional

Controls how the initial centers are selected in K-Means algorithm:

  • 'first_k': First k observations.

  • 'replace': Random with replacement.

  • 'no_replace': Random without replacement.

  • 'patent': Patent of selecting the init center (US 6,882,998 B1).

Defaults to 'patent'.

tolfloat, optional

Specifies the exit threshold for K-Means iterations.

Defaults to 1e-6.

Attributes
labels_DataFrame

DataFrame that holds the cluster labels.

Set to None if not fitted.

stats_DataFrame

DataFrame that holds the related statistics for spectral clustering.

Set to None if not fitted.

Methods

add_attribute(attr_key, attr_val)

Function to add attribute.

apply_with_hint(with_hint[, ...])

Apply with hint.

consume_fit_hdbprocedure(proc_name[, ...])

Return the generated consume hdbprocedure for fit.

consume_predict_hdbprocedure(proc_name[, ...])

Return the generated consume hdbprocedure for predict.

create_model_state([model, function, ...])

Create PAL model state.

delete_model_state([state])

Delete PAL model state.

disable_arg_check()

Disable argument check.

disable_convert_bigint()

Disable the bigint conversion.

disable_hana_execution()

HANA execution will be disabled and only SQL script will be generated.

disable_with_hint()

Disable with hint.

enable_arg_check()

Enable argument check.

enable_convert_bigint()

Allows the conversion from bigint to double.

enable_hana_execution()

HANA execution will be enabled.

enable_no_inline([apply_to_anonymous_block])

Enable no inline.

enable_parallel_by_parameter_partitions([...])

Enable parallel by parameter partitions.

enable_workload_class(workload_class_name)

HANA WORKLOAD CLASS is applied for the statement execution.

fit(data[, key, features, thread_ratio])

Perform spectral clustering for the given dataset.

fit_predict(data[, key, features, thread_ratio])

Given data, perform spectral clustering and return the corresponding cluster labels.

get_fit_execute_statement()

Returns the execute_statement for training.

get_fit_output_table_names()

Get the generated result table names in fit function.

get_fit_parameters()

Get PAL fit parameters.

get_pal_function()

Extract the specific function call of the SAP HANA PAL function from the sql code.

get_parameters()

Parse sql lines containing the parameter definitions.

get_predict_execute_statement()

Returns the execute_statement for predicting.

get_predict_output_table_names()

Get the generated result table names in predict function.

get_predict_parameters()

Get PAL predict parameters.

get_score_execute_statement()

Returns the execute_statement for scoring.

get_score_output_table_names()

Get the generated result table names in score function.

get_score_parameters()

Get SAP HANA PAL score parameters.

is_fitted()

Checks if the model can be saved.

load_model(model)

Function to load the fitted model.

set_model_state(state)

Set the model state by state information.

set_scale_out([route_to, no_route_to, ...])

SAP HANA statement routing.

fit(data, key=None, features=None, thread_ratio=None)

Perform spectral clustering for the given dataset.

Parameters
dataDataFrame

DataFrame containing the input data for spectral clustering.

keystr, optional

Name of ID column in data.

Mandatory if data is not indexed, or indexed by multiple columns.

Defaults to the index column of data if there is one.

featuresa list of str, optional

Names of the feature columns.

If features is not provided, it defaults to all non-key columns of data.

thread_ratiofloat, optional

Specifies the ratio of total number of threads that can be used by spectral clustering.

The value range is from 0 to 1, where 0 means only using 1 thread, and 1 means using at most all the currently available threads. Values outside the range will be ignored and this function heuristically determines the number of threads to use.

Defaults to 0.

fit_predict(data, key=None, features=None, thread_ratio=None)

Given data, perform spectral clustering and return the corresponding cluster labels.

Parameters
dataDataFrame

DataFrame containing the input data for spectral clustering.

keystr, optional

Name of ID column in data.

Mandatory if data is not indexed, or indexed by multiple columns.

Defaults to the index column of data if there is one.

featuresa list of str, optional

Names of the feature columns.

If features is not provided, it defaults to all non-key columns of data.

thread_ratiofloat, optional

Specifies the ratio of total number of threads that can be used by spectral clustering.

The value range is from 0 to 1, where 0 means only using 1 thread, and 1 means using at most all the currently available threads. Values outside the range will be ignored and this function heuristically determines the number of threads to use.

Defaults to 0.

Returns
DataFrame

The cluster labels of all records in data, structured as follows:

  • 1st column : column name and type same as the key column of data, representing record IDs.

  • 2nd column : CLUSTER_ID, type INTEGER,representing the cluster IDs assigned to all records in data.

add_attribute(attr_key, attr_val)

Function to add attribute.

Parameters
attr_keystr

The key.

attr_valstr

The value.

apply_with_hint(with_hint, apply_to_anonymous_block=True)

Apply with hint.

Parameters
with_hintstr

The hint clauses.

apply_to_anonymous_blockbool, optional

If True, it will be applied to the anonymous block.

Defaults to True.

consume_fit_hdbprocedure(proc_name, in_tables=None, out_tables=None)

Return the generated consume hdbprocedure for fit.

Parameters
proc_namestr

The procedure name.

in_tableslist, optional

The list of input table names.

out_tableslist, optional

The list of output table names.

consume_predict_hdbprocedure(proc_name, in_tables=None, out_tables=None)

Return the generated consume hdbprocedure for predict.

Parameters
proc_namestr

The procedure name.

in_tableslist, optional

The list of input table names.

out_tableslist, optional

The list of output table names.

create_model_state(model=None, function=None, pal_funcname=None, state_description=None, force=False)

Create PAL model state.

Parameters
modelDataFrame, optional

Specify the model for AFL state.

Defaults to self.model_.

functionstr, optional

Specify the function in the unified API.

Defaults to self.real_func.

pal_funcnameint or str, optional

PAL function name.

Defaults to self.pal_funcname.

state_descriptionstr, optional

Description of the state as model container.

Defaults to None.

forcebool, optional

If True it will delete the existing state.

Defaults to False.

delete_model_state(state=None)

Delete PAL model state.

Parameters
stateDataFrame, optional

Specified the state.

Defaults to self.state.

disable_arg_check()

Disable argument check.

disable_convert_bigint()

Disable the bigint conversion.

Defaults to False.

disable_hana_execution()

HANA execution will be disabled and only SQL script will be generated.

disable_with_hint()

Disable with hint.

enable_arg_check()

Enable argument check.

enable_convert_bigint()

Allows the conversion from bigint to double.

Defaults to True.

enable_hana_execution()

HANA execution will be enabled.

enable_no_inline(apply_to_anonymous_block=True)

Enable no inline.

Parameters
apply_to_anonymous_blockbool, optional

If True, it will be applied to the anonymous block.

Defaults to True.

enable_parallel_by_parameter_partitions(apply_to_anonymous_block=False)

Enable parallel by parameter partitions.

Parameters
apply_to_anonymous_blockbool, optional

If True, it will be applied to the anonymous block.

Defaults to False.

enable_workload_class(workload_class_name)

HANA WORKLOAD CLASS is applied for the statement execution.

Parameters
workload_class_namestr

The name of HANA WORKLOAD CLASS.

property fit_hdbprocedure

Returns the generated hdbprocedure for fit.

get_fit_execute_statement()

Returns the execute_statement for training.

get_fit_output_table_names()

Get the generated result table names in fit function.

Returns
List of table names.
get_fit_parameters()

Get PAL fit parameters.

Returns
List of tuples, where each tuple describes a parameter like (name, value, type)
get_pal_function()

Extract the specific function call of the SAP HANA PAL function from the sql code. Nevertheless it only detects the synonyms that have to be resolved afterwards.

Returns
The procedure name synonym
CALL "SYS_AFL.PAL_RANDOM_FORREST" (...) -> SYS_AFL.PAL_RANDOM_FORREST"
get_parameters()

Parse sql lines containing the parameter definitions. In the sql code all the parameters are defined by four arrays, where the first one contains the parameter name, and one of the other three contains the value fitting to the parameter, while the other two are NULL. This format should be changed into a simple key-value based storage.

Returns
dict of list of tuples, where each tuple describes a parameter like (name, value, type)
get_predict_execute_statement()

Returns the execute_statement for predicting.

get_predict_output_table_names()

Get the generated result table names in predict function.

Returns
List of table names.
get_predict_parameters()

Get PAL predict parameters.

Returns
List of tuples, where each tuple describes a parameter like (name, value, type)
get_score_execute_statement()

Returns the execute_statement for scoring.

get_score_output_table_names()

Get the generated result table names in score function.

Returns
List of table names.
get_score_parameters()

Get SAP HANA PAL score parameters.

Returns
List of tuples, where each tuple describes a parameter like (name, value, type)
is_fitted()

Checks if the model can be saved. To be overridden if the model is not stored in model_ attribute.

Returns
bool

True if the model is ready to be saved.

load_model(model)

Function to load the fitted model.

Parameters
modelDataFrame

SAP HANA DataFrame for fitted model.

property predict_hdbprocedure

Returns the generated hdbprocedure for predict.

set_model_state(state)

Set the model state by state information.

set_scale_out(route_to=None, no_route_to=None, route_by=None, route_by_cardinality=None, data_transfer_cost=None, route_optimization_level=None, workload_class=None, apply_to_anonymous_block=True)

SAP HANA statement routing.

Parameters
route_tostr, optional

Routes the query to the specified volume ID or service type.

Defaults to None.

no_route_tostr or list of str, optional

Avoids query routing to a specified volume ID or service type.

Defaults to None.

route_bystr, optional

Routes the query to the hosts related to the base table(s) of the specified projection view(s).

Defaults to None.

route_by_cardinalitystr or list of str, optional

Routes the query to the hosts related to the base table(s) of the specified projection view(s) with the highest cardinality from the input list.

data_transfer_costint, optional

Guides the optimizer to use the weighting factor for the data transfer cost. The value 0 ignores the data transfer cost.

Defaults to None.

route_optimization_level{'minimal', 'all'}, optional

Guides the optimizer to compile with route_optimization_level 'minimal' or to default to route_optimization_level. If the 'minimal' compiled plan is cached, then it compiles once more using the default optimization level during the first execution. This hint is primarily used to shorten statement routing decisions during the initial compilation.

workload_classstr, optional

Routes the query via workload class. route_to statement hint has higher precedence than workload_class statement hint.

Defaults to None.

apply_to_anonymous_blockbool, optional

If True it will be applied to the anonymous block.

Defaults to True.

class hana_ml.algorithms.pal.clustering.KMeansOutlier(n_clusters=None, distance_level=None, contamination=None, sum_distance=True, init=None, max_iter=None, normalization=None, tol=None, distance_threshold=None)

Bases: hana_ml.algorithms.pal.pal_base.PALBase

Outlier detection of datasets using k-means clustering.

Parameters
n_clustersint, optional

Number of clusters to be grouped.

If this number is not specified, the G-means method will be used to determine the number of clusters.

distance_level{'manhattan', 'euclidean', 'minkowski'}, optional

Specifies the distance type between data points and cluster center.

  • 'manhattan' : Manhattan distance

  • 'euclidean' : Euclidean distance

  • 'minkowski' : Minkowski distance

Defaults to 'euclidean'.

contaminationfloat, optional

Specifies the proportion of outliers within the input data to be detected.

Expected to be a positive number no greater than 1.

Defaults to 0.1.

sum_distancebool, optional

Specifies whether or not to use the sum distance of a point to all cluster centers as its distance value for outlier score. If False, only the distance of a point to the center it belongs to is used its distance value calculation.

Defaults to True.

init{'first_k', 'replace', 'no_replace', 'patent'}, optional

Controls how the initial centers are selected:

  • 'first_k': First k observations.

  • 'replace': Random with replacement.

  • 'no_replace': Random without replacement.

  • 'patent': Patent of selecting the init center (US 6,882,998 B1).

Defaults to 'patent'.

max_iterint, optional

Maximum number of iterations for k-means clustering.

Defaults to 100.

normalization{'no', 'l1_norm', 'min_max'}, optional

Normalization type.

  • 'no': No normalization will be applied.

  • 'l1_norm': Yes, for each point X (x1, x2, ..., xn), the normalized value will be X'(x1 /S,x2 /S,...,xn /S), where S = |x1|+|x2|+...|xn|.

  • 'min_max': Yes, for each column C, get the min and max value of C, and then C[i] = (C[i]-min)/(max-min).

Defaults to 'no'.

tolfloat, optional

Convergence threshold for exiting iterations in k-means clustering.

Defaults to 1.0e-9.

distance_thresholdfloat, optional

Specifies the threshold distance value for outlier detection.

A point with distance value no greater than the threshold is not considered to be outlier.

Defaults to -1.

Examples

Input data for outlier detection:

>>> df.collect()
    ID  V000  V001
0    0   0.5   0.5
1    1   1.5   0.5
2    2   1.5   1.5
3    3   0.5   1.5
4    4   1.1   1.2
5    5   0.5  15.5
6    6   1.5  15.5
7    7   1.5  16.5
8    8   0.5  16.5
9    9   1.2  16.1
10  10  15.5  15.5
11  11  16.5  15.5
12  12  16.5  16.5
13  13  15.5  16.5
14  14  15.6  16.2
15  15  15.5   0.5
16  16  16.5   0.5
17  17  16.5   1.5
18  18  15.5   1.5
19  19  15.7   1.6
20  20  -1.0  -1.0

Initialize the class instance

>>> kmsodt = KMeansOutlier(distance_level='euclidean',
...                        contamination=0.15,
...                        sum_distance=True,
...                        distance_threshold=3)
>>> outliers, stats, centers = kmsodt.fit_predict(df, key='ID')
>>> outliers.collect()
   ID  V000  V001
0  20  -1.0  -1.0
1  16  16.5   0.5
2  12  16.5  16.5
>>> stats.collect()
   ID  CLUSTER_ID      SCORE
0  20           2  60.619864
1  16           1  54.110424
2  12           3  53.954274
Attributes
fit_hdbprocedure

Returns the generated hdbprocedure for fit.

predict_hdbprocedure

Returns the generated hdbprocedure for predict.

Methods

add_attribute(attr_key, attr_val)

Function to add attribute.

apply_with_hint(with_hint[, ...])

Apply with hint.

consume_fit_hdbprocedure(proc_name[, ...])

Return the generated consume hdbprocedure for fit.

consume_predict_hdbprocedure(proc_name[, ...])

Return the generated consume hdbprocedure for predict.

create_model_state([model, function, ...])

Create PAL model state.

delete_model_state([state])

Delete PAL model state.

disable_arg_check()

Disable argument check.

disable_convert_bigint()

Disable the bigint conversion.

disable_hana_execution()

HANA execution will be disabled and only SQL script will be generated.

disable_with_hint()

Disable with hint.

enable_arg_check()

Enable argument check.

enable_convert_bigint()

Allows the conversion from bigint to double.

enable_hana_execution()

HANA execution will be enabled.

enable_no_inline([apply_to_anonymous_block])

Enable no inline.

enable_parallel_by_parameter_partitions([...])

Enable parallel by parameter partitions.

enable_workload_class(workload_class_name)

HANA WORKLOAD CLASS is applied for the statement execution.

fit_predict(data[, key, features, thread_number])

Performing k-means clustering on an input dataset and extracting the corresponding outliers.

get_fit_execute_statement()

Returns the execute_statement for training.

get_fit_output_table_names()

Get the generated result table names in fit function.

get_fit_parameters()

Get PAL fit parameters.

get_pal_function()

Extract the specific function call of the SAP HANA PAL function from the sql code.

get_parameters()

Parse sql lines containing the parameter definitions.

get_predict_execute_statement()

Returns the execute_statement for predicting.

get_predict_output_table_names()

Get the generated result table names in predict function.

get_predict_parameters()

Get PAL predict parameters.

get_score_execute_statement()

Returns the execute_statement for scoring.

get_score_output_table_names()

Get the generated result table names in score function.

get_score_parameters()

Get SAP HANA PAL score parameters.

is_fitted()

Checks if the model can be saved.

load_model(model)

Function to load the fitted model.

set_model_state(state)

Set the model state by state information.

set_scale_out([route_to, no_route_to, ...])

SAP HANA statement routing.

add_attribute(attr_key, attr_val)

Function to add attribute.

Parameters
attr_keystr

The key.

attr_valstr

The value.

apply_with_hint(with_hint, apply_to_anonymous_block=True)

Apply with hint.

Parameters
with_hintstr

The hint clauses.

apply_to_anonymous_blockbool, optional

If True, it will be applied to the anonymous block.

Defaults to True.

consume_fit_hdbprocedure(proc_name, in_tables=None, out_tables=None)

Return the generated consume hdbprocedure for fit.

Parameters
proc_namestr

The procedure name.

in_tableslist, optional

The list of input table names.

out_tableslist, optional

The list of output table names.

consume_predict_hdbprocedure(proc_name, in_tables=None, out_tables=None)

Return the generated consume hdbprocedure for predict.

Parameters
proc_namestr

The procedure name.

in_tableslist, optional

The list of input table names.

out_tableslist, optional

The list of output table names.

create_model_state(model=None, function=None, pal_funcname=None, state_description=None, force=False)

Create PAL model state.

Parameters
modelDataFrame, optional

Specify the model for AFL state.

Defaults to self.model_.

functionstr, optional

Specify the function in the unified API.

Defaults to self.real_func.

pal_funcnameint or str, optional

PAL function name.

Defaults to self.pal_funcname.

state_descriptionstr, optional

Description of the state as model container.

Defaults to None.

forcebool, optional

If True it will delete the existing state.

Defaults to False.

delete_model_state(state=None)

Delete PAL model state.

Parameters
stateDataFrame, optional

Specified the state.

Defaults to self.state.

disable_arg_check()

Disable argument check.

disable_convert_bigint()

Disable the bigint conversion.

Defaults to False.

disable_hana_execution()

HANA execution will be disabled and only SQL script will be generated.

disable_with_hint()

Disable with hint.

enable_arg_check()

Enable argument check.

enable_convert_bigint()

Allows the conversion from bigint to double.

Defaults to True.

enable_hana_execution()

HANA execution will be enabled.

enable_no_inline(apply_to_anonymous_block=True)

Enable no inline.

Parameters
apply_to_anonymous_blockbool, optional

If True, it will be applied to the anonymous block.

Defaults to True.

enable_parallel_by_parameter_partitions(apply_to_anonymous_block=False)

Enable parallel by parameter partitions.

Parameters
apply_to_anonymous_blockbool, optional

If True, it will be applied to the anonymous block.

Defaults to False.

enable_workload_class(workload_class_name)

HANA WORKLOAD CLASS is applied for the statement execution.

Parameters
workload_class_namestr

The name of HANA WORKLOAD CLASS.

property fit_hdbprocedure

Returns the generated hdbprocedure for fit.

fit_predict(data, key=None, features=None, thread_number=None)

Performing k-means clustering on an input dataset and extracting the corresponding outliers.

Parameters
dataDataFrame

Input data for outlier detection using k-means clustering.

keystr, optional

Name of ID column.

Defaults to the index column of data (i.e. data.index) if it is set. If the index column of data is not provided, please enter the value of key.

featuresstr or ListOfStrings

Names of the features columns in data that are used for calculating distances of points in data for clustering.

Feature columns must be numerical.

Defaults to all non-key columns if not provided.

thread_numberint, optional

Specifies the number of threads that can be used by this function.

Defaults to 1.

Returns
DataFrames
  • Detected outliers, structured as follows:
    • 1st column : ID of detected outliers in data.

    • other columns : feature values for detected outliers

  • Statistics of detected outliers, structured as follows:
    • 1st column : ID of detected outliers in data.

    • 2nd column : ID of the corresponding cluster centers.

    • 3rd column : Outlier score, which is the distance value.

  • Centers of clusters produced by k-means algorithm, structured as follows:
    • 1st column : ID of cluster center.

    • other columns : Coordinate(i.e. feature) values of cluster center.

get_fit_execute_statement()

Returns the execute_statement for training.

get_fit_output_table_names()

Get the generated result table names in fit function.

Returns
List of table names.
get_fit_parameters()

Get PAL fit parameters.

Returns
List of tuples, where each tuple describes a parameter like (name, value, type)
get_pal_function()

Extract the specific function call of the SAP HANA PAL function from the sql code. Nevertheless it only detects the synonyms that have to be resolved afterwards.

Returns
The procedure name synonym
CALL "SYS_AFL.PAL_RANDOM_FORREST" (...) -> SYS_AFL.PAL_RANDOM_FORREST"
get_parameters()

Parse sql lines containing the parameter definitions. In the sql code all the parameters are defined by four arrays, where the first one contains the parameter name, and one of the other three contains the value fitting to the parameter, while the other two are NULL. This format should be changed into a simple key-value based storage.

Returns
dict of list of tuples, where each tuple describes a parameter like (name, value, type)
get_predict_execute_statement()

Returns the execute_statement for predicting.

get_predict_output_table_names()

Get the generated result table names in predict function.

Returns
List of table names.
get_predict_parameters()

Get PAL predict parameters.

Returns
List of tuples, where each tuple describes a parameter like (name, value, type)
get_score_execute_statement()

Returns the execute_statement for scoring.

get_score_output_table_names()

Get the generated result table names in score function.

Returns
List of table names.
get_score_parameters()

Get SAP HANA PAL score parameters.

Returns
List of tuples, where each tuple describes a parameter like (name, value, type)
is_fitted()

Checks if the model can be saved. To be overridden if the model is not stored in model_ attribute.

Returns
bool

True if the model is ready to be saved.

load_model(model)

Function to load the fitted model.

Parameters
modelDataFrame

SAP HANA DataFrame for fitted model.

property predict_hdbprocedure

Returns the generated hdbprocedure for predict.

set_model_state(state)

Set the model state by state information.

set_scale_out(route_to=None, no_route_to=None, route_by=None, route_by_cardinality=None, data_transfer_cost=None, route_optimization_level=None, workload_class=None, apply_to_anonymous_block=True)

SAP HANA statement routing.

Parameters
route_tostr, optional

Routes the query to the specified volume ID or service type.

Defaults to None.

no_route_tostr or list of str, optional

Avoids query routing to a specified volume ID or service type.

Defaults to None.

route_bystr, optional

Routes the query to the hosts related to the base table(s) of the specified projection view(s).

Defaults to None.

route_by_cardinalitystr or list of str, optional

Routes the query to the hosts related to the base table(s) of the specified projection view(s) with the highest cardinality from the input list.

data_transfer_costint, optional

Guides the optimizer to use the weighting factor for the data transfer cost. The value 0 ignores the data transfer cost.

Defaults to None.

route_optimization_level{'minimal', 'all'}, optional

Guides the optimizer to compile with route_optimization_level 'minimal' or to default to route_optimization_level. If the 'minimal' compiled plan is cached, then it compiles once more using the default optimization level during the first execution. This hint is primarily used to shorten statement routing decisions during the initial compilation.

workload_classstr, optional

Routes the query via workload class. route_to statement hint has higher precedence than workload_class statement hint.

Defaults to None.

apply_to_anonymous_blockbool, optional

If True it will be applied to the anonymous block.

Defaults to True.

hana_ml.algorithms.pal.crf

This module contains Python wrapper for SAP HANA PAL conditional random field(CRF) algorithm.

The following class is available:

class hana_ml.algorithms.pal.crf.CRF(lamb=None, epsilon=None, max_iter=None, lbfgs_m=None, use_class_feature=None, use_word=None, use_ngrams=None, mid_ngrams=False, max_ngram_length=None, use_prev=None, use_next=None, disjunction_width=None, use_disjunctive=None, use_seqs=None, use_prev_seqs=None, use_type_seqs=None, use_type_seqs2=None, use_type_yseqs=None, word_shape=None, thread_ratio=None)

Bases: hana_ml.algorithms.pal.pal_base.PALBase

Conditional random field(CRF) for labeling and segmenting sequence data(e.g. text).

Parameters
epsilonfloat, optional

Convergence tolerance of the optimization algorithm.

Defaults to 1e-4.

lambfloat, optional

Regularization weight, should be greater than 0.

Defaults t0 1.0.

max_iterint, optional

Maximum number of iterations in optimization.

Defaults to 1000.

lbfgs_mint, optional

Number of memories to be stored in L_BFGS optimization algorithm.

Defaults to 25.

use_class_featurebool, optional

To include a feature for class/label. This is the same as having a bias vector in a model.

Defaults to True.

use_wordbool, optional

If True, gives you feature for current word.

Defaults to True.

use_ngramsbool, optional

Whether to make feature from letter n-grams, i.e. substrings of the word.

Defaults to True.

mid_ngramsbool, optional

Whether to include character n-gram features for n-grams that contain neither the beginning or the end of the word.

Defaults to False.

max_ngram_lengthint, optional

Upper limit for the size of n-grams to be included. Effective only this parameter is positive.

use_prevbool, optional

Whether or not to include a feature for previous word and current word, and together with other options enables other previous features.

Defaults to True.

use_nextbool, optional

Whether or not to include a feature for next word and current word.

Defaults to True.

disjunction_widthint, optional

Defines the width for disjunctions of words, see use_disjunctive.

Defaults to 4.

use_disjunctivebool, optional

Whether or not to include in features giving disjunctions of words anywhere in left or right disjunction_width words.

Defaults to True.

use_seqsbool, optional

Whether or not to use any class combination features.

Defaults to True.

use_prev_seqsbool, optional

Whether or not to use any class combination features using the previous class.

Defaults to True.

use_type_seqsbool, optional

Whether or not to use basic zeroth order word shape features.

Defaults to True.

use_type_seqs2bool, optional

Whether or not to add additional first and second order word shape features.

Defaults to True.

use_type_yseqsbool, optional

Whether or not to use some first order word shape patterns.

Defaults to True.

word_shapeint, optional

Word shape, e.g. whether capitalized or numeric. Only supports chris2UseLC currently. Do not use word shape if this is 0.

thread_ratiofloat, optional

Specifies the ratio of total number of threads that can be used by the fit(i.e. training) function.

The range of this parameter is from 0 to 1.

0 means only using single thread, 1 means using at most all available threads currently.

Values outside this range are ignored, and the fit function heuristically determines the number of threads to use.

Defaults to 1.0.

Examples

Input data for training:

>>> df.head(10).collect()
   DOC_ID  WORD_POSITION      WORD LABEL
0       1              1    RECORD     O
1       1              2   #497321     O
2       1              3  78554939     O
3       1              4         |     O
4       1              5       LRH     O
5       1              6         |     O
6       1              7  62413233     O
7       1              8         |     O
8       1              9         |     O
9       1             10   7368393     O

Set up an instance of CRF model, and fit it on the training data:

>>> crf = CRF(lamb=0.1,
...           max_iter=1000,
...           epsilon=1e-4,
...           lbfgs_m=25,
...           word_shape=0,
...           thread_ratio=1.0)
>>> crf.fit(data=df, doc_id="DOC_ID", word_pos="WORD_POSITION",
...         word="WORD", label="LABEL")

Check the trained CRF model and related statistics:

>>> crf.model_.collect()
   ROW_INDEX                                      MODEL_CONTENT
0          0  {"classIndex":[["O","OxygenSaturation"]],"defa...
>>> crf.stats_.head(10).collect()
         STAT_NAME           STAT_VALUE
0              obj  0.44251900977373015
1             iter                   22
2  solution status            Converged
3      numSentence                    2
4          numWord                   92
5      numFeatures                  963
6           iter 1          obj=26.6557
7           iter 2          obj=14.8484
8           iter 3          obj=5.36967
9           iter 4           obj=2.4382

Input data for predicting labels using trained CRF model

>>> df_pred.head(10).collect()
   DOC_ID  WORD_POSITION         WORD
0       2              1      GENERAL
1       2              2     PHYSICAL
2       2              3  EXAMINATION
3       2              4            :
4       2              5        VITAL
5       2              6        SIGNS
6       2              7            :
7       2              8        Blood
8       2              9     pressure
9       2             10        86g52

Do the prediction:

>>> res = crf.predict(data=df_pred, doc_id='DOC_ID', word_pos='WORD_POSITION',
...                   word='WORD', thread_ratio=1.0)

Check the prediction result:

>>> df_pred.head(10).collect()
   DOC_ID  WORD_POSITION         WORD
0       2              1      GENERAL
1       2              2     PHYSICAL
2       2              3  EXAMINATION
3       2              4            :
4       2              5        VITAL
5       2              6        SIGNS
6       2              7            :
7       2              8        Blood
8       2              9     pressure
9       2             10        86g52
Attributes
model_DataFrame

CRF model content.

stats_DataFrame

Statistic info for CRF model fitting, structured as follows:

  • 1st column: name of the statistics, type NVARCHAR(100).

  • 2nd column: the corresponding statistics value, type NVARCHAR(1000).

optimal_param_DataFrame

Placeholder for storing optimal parameter of the model. None empty only when parameter selection is triggered (in the future).

Methods

add_attribute(attr_key, attr_val)

Function to add attribute.

apply_with_hint(with_hint[, ...])

Apply with hint.

consume_fit_hdbprocedure(proc_name[, ...])

Return the generated consume hdbprocedure for fit.

consume_predict_hdbprocedure(proc_name[, ...])

Return the generated consume hdbprocedure for predict.

create_model_state([model, function, ...])

Create PAL model state.

delete_model_state([state])

Delete PAL model state.

disable_arg_check()

Disable argument check.

disable_convert_bigint()

Disable the bigint conversion.

disable_hana_execution()

HANA execution will be disabled and only SQL script will be generated.

disable_with_hint()

Disable with hint.

enable_arg_check()

Enable argument check.

enable_convert_bigint()

Allows the conversion from bigint to double.

enable_hana_execution()

HANA execution will be enabled.

enable_no_inline([apply_to_anonymous_block])

Enable no inline.

enable_parallel_by_parameter_partitions([...])

Enable parallel by parameter partitions.

enable_workload_class(workload_class_name)

HANA WORKLOAD CLASS is applied for the statement execution.

fit(data[, doc_id, word_pos, word, label])

Function for training the CRF model on English text.

get_fit_execute_statement()

Returns the execute_statement for training.

get_fit_output_table_names()

Get the generated result table names in fit function.

get_fit_parameters()

Get PAL fit parameters.

get_pal_function()

Extract the specific function call of the SAP HANA PAL function from the sql code.

get_parameters()

Parse sql lines containing the parameter definitions.

get_predict_execute_statement()

Returns the execute_statement for predicting.

get_predict_output_table_names()

Get the generated result table names in predict function.

get_predict_parameters()

Get PAL predict parameters.

get_score_execute_statement()

Returns the execute_statement for scoring.

get_score_output_table_names()

Get the generated result table names in score function.

get_score_parameters()

Get SAP HANA PAL score parameters.

is_fitted()

Checks if the model can be saved.

load_model(model)

Function to load the fitted model.

predict(data[, doc_id, word_pos, word, ...])

The function that predicts text labels based trained CRF model.

set_model_state(state)

Set the model state by state information.

set_scale_out([route_to, no_route_to, ...])

SAP HANA statement routing.

fit(data, doc_id=None, word_pos=None, word=None, label=None)

Function for training the CRF model on English text.

Parameters
dataDataFrame

Input data for training/fitting the CRF model.

It should contain at least 4 columns, corresponding to document ID, word position, word and label, respectively.

doc_idstr, optional

Name of the column for document ID.

Defaults to the first column of the input data.

word_posstr, optional

Name of the column for word position.

Defaults to the 1st non-doc_id column of the input data.

wordstr, optional

Name of the column for word.

Defaults to 1st non-doc_id, non-word_pos column of the input data.

labelstr, optional

Name of the label column.

Defaults to the last non-doc_id, non-word_pos, non-word column of the input data.

add_attribute(attr_key, attr_val)

Function to add attribute.

Parameters
attr_keystr

The key.

attr_valstr

The value.

apply_with_hint(with_hint, apply_to_anonymous_block=True)

Apply with hint.

Parameters
with_hintstr

The hint clauses.

apply_to_anonymous_blockbool, optional

If True, it will be applied to the anonymous block.

Defaults to True.

consume_fit_hdbprocedure(proc_name, in_tables=None, out_tables=None)

Return the generated consume hdbprocedure for fit.

Parameters
proc_namestr

The procedure name.

in_tableslist, optional

The list of input table names.

out_tableslist, optional

The list of output table names.

consume_predict_hdbprocedure(proc_name, in_tables=None, out_tables=None)

Return the generated consume hdbprocedure for predict.

Parameters
proc_namestr

The procedure name.

in_tableslist, optional

The list of input table names.

out_tableslist, optional

The list of output table names.

create_model_state(model=None, function=None, pal_funcname=None, state_description=None, force=False)

Create PAL model state.

Parameters
modelDataFrame, optional

Specify the model for AFL state.

Defaults to self.model_.

functionstr, optional

Specify the function in the unified API.

Defaults to self.real_func.

pal_funcnameint or str, optional

PAL function name.

Defaults to self.pal_funcname.

state_descriptionstr, optional

Description of the state as model container.

Defaults to None.

forcebool, optional

If True it will delete the existing state.

Defaults to False.

delete_model_state(state=None)

Delete PAL model state.

Parameters
stateDataFrame, optional

Specified the state.

Defaults to self.state.

disable_arg_check()

Disable argument check.

disable_convert_bigint()

Disable the bigint conversion.

Defaults to False.

disable_hana_execution()

HANA execution will be disabled and only SQL script will be generated.

disable_with_hint()

Disable with hint.

enable_arg_check()

Enable argument check.

enable_convert_bigint()

Allows the conversion from bigint to double.

Defaults to True.

enable_hana_execution()

HANA execution will be enabled.

enable_no_inline(apply_to_anonymous_block=True)

Enable no inline.

Parameters
apply_to_anonymous_blockbool, optional

If True, it will be applied to the anonymous block.

Defaults to True.

enable_parallel_by_parameter_partitions(apply_to_anonymous_block=False)

Enable parallel by parameter partitions.

Parameters
apply_to_anonymous_blockbool, optional

If True, it will be applied to the anonymous block.

Defaults to False.

enable_workload_class(workload_class_name)

HANA WORKLOAD CLASS is applied for the statement execution.

Parameters
workload_class_namestr

The name of HANA WORKLOAD CLASS.

property fit_hdbprocedure

Returns the generated hdbprocedure for fit.

get_fit_execute_statement()

Returns the execute_statement for training.

get_fit_output_table_names()

Get the generated result table names in fit function.

Returns
List of table names.
get_fit_parameters()

Get PAL fit parameters.

Returns
List of tuples, where each tuple describes a parameter like (name, value, type)
get_pal_function()

Extract the specific function call of the SAP HANA PAL function from the sql code. Nevertheless it only detects the synonyms that have to be resolved afterwards.

Returns
The procedure name synonym
CALL "SYS_AFL.PAL_RANDOM_FORREST" (...) -> SYS_AFL.PAL_RANDOM_FORREST"
get_parameters()

Parse sql lines containing the parameter definitions. In the sql code all the parameters are defined by four arrays, where the first one contains the parameter name, and one of the other three contains the value fitting to the parameter, while the other two are NULL. This format should be changed into a simple key-value based storage.

Returns
dict of list of tuples, where each tuple describes a parameter like (name, value, type)
get_predict_execute_statement()

Returns the execute_statement for predicting.

get_predict_output_table_names()

Get the generated result table names in predict function.

Returns
List of table names.
get_predict_parameters()

Get PAL predict parameters.

Returns
List of tuples, where each tuple describes a parameter like (name, value, type)
get_score_execute_statement()

Returns the execute_statement for scoring.

get_score_output_table_names()

Get the generated result table names in score function.

Returns
List of table names.
get_score_parameters()

Get SAP HANA PAL score parameters.

Returns
List of tuples, where each tuple describes a parameter like (name, value, type)
is_fitted()

Checks if the model can be saved. To be overridden if the model is not stored in model_ attribute.

Returns
bool

True if the model is ready to be saved.

load_model(model)

Function to load the fitted model.

Parameters
modelDataFrame

SAP HANA DataFrame for fitted model.

predict(data, doc_id=None, word_pos=None, word=None, thread_ratio=None)

The function that predicts text labels based trained CRF model.

Parameters
dataDataFrame

Input data to predict the labels. It should contain at least 3 columns, corresponding to document ID, word position and word, respectively.

doc_idstr, optional

Name of the column for document ID.

Defaults to the 1st column of the input data.

word_posstr, optional

Name of the column for word position.

Defaults to the 1st non-doc_id column of the input data.

wordstr, optional

Name of the column for word.

Defaults to the 1st non-doc_id, non-word_pos column of the input data.

thread_ratiofloat, optional

Specifies the ratio of total number of threads that can be used by predict function.

The range of this parameter is from 0 to 1.

0 means only using a single thread, and 1 means using at most all available threads currently.

Values outside this range are ignored, and predict function heuristically determines the number of threads to use.

Defaults to 1.0.

Returns
DataFrame

Prediction result for the input data, structured as follows:

  • 1st column: document ID,

  • 2nd column: word position,

  • 3rd column: label.

property predict_hdbprocedure

Returns the generated hdbprocedure for predict.

set_model_state(state)

Set the model state by state information.

set_scale_out(route_to=None, no_route_to=None, route_by=None, route_by_cardinality=None, data_transfer_cost=None, route_optimization_level=None, workload_class=None, apply_to_anonymous_block=True)

SAP HANA statement routing.

Parameters
route_tostr, optional

Routes the query to the specified volume ID or service type.

Defaults to None.

no_route_tostr or list of str, optional

Avoids query routing to a specified volume ID or service type.

Defaults to None.

route_bystr, optional

Routes the query to the hosts related to the base table(s) of the specified projection view(s).

Defaults to None.

route_by_cardinalitystr or list of str, optional

Routes the query to the hosts related to the base table(s) of the specified projection view(s) with the highest cardinality from the input list.

data_transfer_costint, optional

Guides the optimizer to use the weighting factor for the data transfer cost. The value 0 ignores the data transfer cost.

Defaults to None.

route_optimization_level{'minimal', 'all'}, optional

Guides the optimizer to compile with route_optimization_level 'minimal' or to default to route_optimization_level. If the 'minimal' compiled plan is cached, then it compiles once more using the default optimization level during the first execution. This hint is primarily used to shorten statement routing decisions during the initial compilation.

workload_classstr, optional

Routes the query via workload class. route_to statement hint has higher precedence than workload_class statement hint.

Defaults to None.

apply_to_anonymous_blockbool, optional

If True it will be applied to the anonymous block.

Defaults to True.

hana_ml.algorithms.pal.decomposition

This module contains Python wrappers for PAL decomposition algorithms.

The following classes are available:

hana_ml.algorithms.pal.decomposition.LatentDirichletAllocation

class hana_ml.algorithms.pal.decomposition.LatentDirichletAllocation(n_components, doc_topic_prior=None, topic_word_prior=None, burn_in=None, iteration=None, thin=None, seed=None, max_top_words=None, threshold_top_words=None, gibbs_init=None, delimiters=None, output_word_assignment=None)

Bases: hana_ml.algorithms.pal.pal_base.PALBase

Latent Dirichlet allocation (LDA) is a generative model in which each item (word) of a collection (document) is generated from a finite mixture over several latent groups (topics).

Parameters
n_componentsint

Expected number of topics in the corpus.

doc_topic_priorfloat, optional

Specifies the prior weight related to document-topic distribution.

Defaults to 50/n_components.

topic_word_priorfloat, optional

Specifies the prior weight related to topic-word distribution.

Defaults to 0.1.

burn_inint, optional

Number of omitted Gibbs iterations at the beginning.

Generally, samples from the beginning may not accurately represent the desired distribution and are usually discarded.

Defaults to 0.

iterationint, optional

Number of Gibbs iterations.

Defaults to 2000.

thinint, optional

Number of omitted in-between Gibbs iterations.

Value must be greater than 0.

Defaults to 1.

seedint, optional

Indicates the seed used to initialize the random number generator:

  • 0: Uses the system time.

  • Not 0: Uses the provided value.

Defaults to 0.

max_top_wordsint, optional

Specifies the maximum number of words to be output for each topic.

Defaults to 0.

threshold_top_wordsfloat, optional

The algorithm outputs top words for each topic if the probability is larger than this threshold.

It cannot be used together with parameter max_top_words.

gibbs_initstr, optional

Specifies initialization method for Gibbs sampling:

  • 'uniform': Assign each word in each document a topic by uniform distribution.

  • 'gibbs': Assign each word in each document a topic by one round

    of Gibbs sampling using doc_topic_prior and topic_word_prior.

Defaults to 'uniform'.

delimiterslist of str, optional

Specifies the set of delimiters to separate words in a document.

Each delimiter must be one character long.

Defaults to [' '].

output_word_assignmentbool, optional

Controls whether to output the word_topic_assignment_ DataFrame or not. If True, output the word_topic_assignment_ DataFrame.

Defaults to False.

Examples

Input dataframe df1 for training:

>>> df1.collect()
   DOCUMENT_ID                                               TEXT
0           10  cpu harddisk graphiccard cpu monitor keyboard ...
1           20  tires mountainbike wheels valve helmet mountai...
2           30  carseat toy strollers toy toy spoon toy stroll...
3           40  sweaters sweaters sweaters boots sweaters ring...

Creating a LDA instance:

>>> lda = LatentDirichletAllocation(n_components=6, burn_in=50, thin=10,
                                    iteration=100, seed=1,
                                    max_top_words=5, doc_topic_prior=0.1,
                                    output_word_assignment=True,
                                    delimiters=[' ', '\r', '\n'])

Performing fit() on given dataframe:

>>> lda.fit(data=df1, key='DOCUMENT_ID', document='TEXT')

Output:

>>> lda.doc_topic_dist_.collect()
    DOCUMENT_ID  TOPIC_ID  PROBABILITY
0            10         0     0.010417
1            10         1     0.010417
2            10         2     0.010417
3            10         3     0.010417
4            10         4     0.947917
5            10         5     0.010417
6            20         0     0.009434
7            20         1     0.009434
8            20         2     0.009434
9            20         3     0.952830
10           20         4     0.009434
11           20         5     0.009434
12           30         0     0.103774
13           30         1     0.858491
14           30         2     0.009434
15           30         3     0.009434
16           30         4     0.009434
17           30         5     0.009434
18           40         0     0.009434
19           40         1     0.009434
20           40         2     0.952830
21           40         3     0.009434
22           40         4     0.009434
23           40         5     0.009434
>>> lda.word_topic_assignment_.collect()
    DOCUMENT_ID  WORD_ID  TOPIC_ID
0            10        0         4
1            10        1         4
2            10        2         4
3            10        0         4
4            10        3         4
5            10        4         4
6            10        0         4
7            10        5         4
8            10        5         4
9            20        6         3
10           20        7         3
11           20        8         3
12           20        9         3
13           20       10         3
14           20        7         3
15           20       11         3
16           20        6         3
17           20        7         3
18           20        7         3
19           30       12         1
20           30       13         1
21           30       14         1
22           30       13         1
23           30       13         1
24           30       15         0
25           30       13         1
26           30       14         1
27           30       13         1
28           30       12         1
29           40       16         2
30           40       16         2
31           40       16         2
32           40       17         2
33           40       16         2
34           40       18         2
35           40       19         2
36           40       19         2
37           40       20         2
38           40       16         2
>>> lda.topic_top_words_.collect()
   TOPIC_ID                                       WORDS
0         0     spoon strollers tires graphiccard valve
1         1       toy strollers carseat graphiccard cpu
2         2              sweaters vest shoe rings boots
3         3  mountainbike tires rearfender helmet valve
4         4    cpu memory graphiccard keyboard harddisk
5         5       strollers tires graphiccard cpu valve
>>> lda.topic_word_dist_.head(40).collect()
    TOPIC_ID  WORD_ID  PROBABILITY
0          0        0     0.050000
1          0        1     0.050000
2          0        2     0.050000
3          0        3     0.050000
4          0        4     0.050000
5          0        5     0.050000
6          0        6     0.050000
7          0        7     0.050000
8          0        8     0.550000
9          0        9     0.050000
10         1        0     0.050000
11         1        1     0.050000
12         1        2     0.050000
13         1        3     0.050000
14         1        4     0.050000
15         1        5     0.050000
16         1        6     0.050000
17         1        7     0.050000
18         1        8     0.050000
19         1        9     0.550000
20         2        0     0.025000
21         2        1     0.025000
22         2        2     0.525000
23         2        3     0.025000
24         2        4     0.025000
25         2        5     0.025000
26         2        6     0.025000
27         2        7     0.275000
28         2        8     0.025000
29         2        9     0.025000
30         3        0     0.014286
31         3        1     0.014286
32         3        2     0.014286
33         3        3     0.585714
34         3        4     0.157143
35         3        5     0.014286
36         3        6     0.157143
37         3        7     0.014286
38         3        8     0.014286
39         3        9     0.014286
>>> lda.dictionary_.collect()
    WORD_ID          WORD
0        17         boots
1        12       carseat
2         0           cpu
3         2   graphiccard
4         1      harddisk
5        10        helmet
6         4      keyboard
7         5        memory
8         3       monitor
9         7  mountainbike
10       11    rearfender
11       18         rings
12       20          shoe
13       15         spoon
14       14     strollers
15       16      sweaters
16        6         tires
17       13           toy
18        9         valve
19       19          vest
20        8        wheels
>>> lda.statistic_.collect()
         STAT_NAME          STAT_VALUE
0        DOCUMENTS                   4
1  VOCABULARY_SIZE                  21
2   LOG_LIKELIHOOD  -64.95765414596762

Dataframe df2 to transform:

>>> df2.collect()
   DOCUMENT_ID               TEXT
0           10  toy toy spoon cpu

Performing transform on the given dataframe:

>>> res = lda.transform(data=df2, key='DOCUMENT_ID', document='TEXT', burn_in=2000, thin=100,
                        iteration=1000, seed=1, output_word_assignment=True)
>>> doc_top_df, word_top_df, stat_df = res
>>> doc_top_df.collect()
   DOCUMENT_ID  TOPIC_ID  PROBABILITY
0           10         0     0.239130
1           10         1     0.456522
2           10         2     0.021739
3           10         3     0.021739
4           10         4     0.239130
5           10         5     0.021739
>>> word_top_df.collect()
   DOCUMENT_ID  WORD_ID  TOPIC_ID
0           10       13         1
1           10       13         1
2           10       15         0
3           10        0         4
>>> stat_df.collect()
         STAT_NAME          STAT_VALUE
0        DOCUMENTS                   1
1  VOCABULARY_SIZE                  21
2   LOG_LIKELIHOOD  -7.925092991875363
3       PERPLEXITY   7.251970666272191
Attributes
doc_topic_dist_DataFrame

Document-topic distribution table, structured as follows:

  • Document ID column, with same name and type as data's document ID column from fit().

  • TOPIC_ID, type INTEGER, topic ID.

  • PROBABILITY, type DOUBLE, probability of topic given document.

word_topic_assignment_DataFrame

Word-topic assignment table, structured as follows:

  • Document ID column, with same name and type as data's document ID column from fit().

  • WORD_ID, type INTEGER, word ID.

  • TOPIC_ID, type INTEGER, topic ID.

Set to None if output_word_assignment is set to False.

topic_top_words_DataFrame

Topic top words table, structured as follows:

  • TOPIC_ID, type INTEGER, topic ID.

  • WORDS, type NVARCHAR(5000), topic top words separated by spaces.

Set to None if neither max_top_words nor threshold_top_words is provided.

topic_word_dist_DataFrame

Topic-word distribution table, structured as follows:

  • TOPIC_ID, type INTEGER, topic ID.

  • WORD_ID, type INTEGER, word ID.

  • PROBABILITY, type DOUBLE, probability of word given topic.

dictionary_DataFrame

Dictionary table, structured as follows:

  • WORD_ID, type INTEGER, word ID.

  • WORD, type NVARCHAR(5000), word text.

statistic_DataFrame

Statistics table, structured as follows:

  • STAT_NAME, type NVARCHAR(256), statistic name.

  • STAT_VALUE, type NVARCHAR(1000), statistic value.

Note

  • Parameters max_top_words and threshold_top_words cannot be used together.

  • Parameters burn_in, thin, iteration, seed, gibbs_init and delimiters set in transform() will take precedence over the corresponding ones in __init__().

Methods

add_attribute(attr_key, attr_val)

Function to add attribute.

apply_with_hint(with_hint[, ...])

Apply with hint.

consume_fit_hdbprocedure(proc_name[, ...])

Return the generated consume hdbprocedure for fit.

consume_predict_hdbprocedure(proc_name[, ...])

Return the generated consume hdbprocedure for predict.

create_model_state([model, function, ...])

Create PAL model state.

delete_model_state([state])

Delete PAL model state.

disable_arg_check()

Disable argument check.

disable_convert_bigint()

Disable the bigint conversion.

disable_hana_execution()

HANA execution will be disabled and only SQL script will be generated.

disable_with_hint()

Disable with hint.

enable_arg_check()

Enable argument check.

enable_convert_bigint()

Allows the conversion from bigint to double.

enable_hana_execution()

HANA execution will be enabled.

enable_no_inline([apply_to_anonymous_block])

Enable no inline.

enable_parallel_by_parameter_partitions([...])

Enable parallel by parameter partitions.

enable_workload_class(workload_class_name)

HANA WORKLOAD CLASS is applied for the statement execution.

fit(data[, key, document])

Fit LDA model based on training data.

fit_transform(data[, key, document])

Fit LDA model based on training data and return the topic assignment for the training documents.

get_fit_execute_statement()

Returns the execute_statement for training.

get_fit_output_table_names()

Get the generated result table names in fit function.

get_fit_parameters()

Get PAL fit parameters.

get_pal_function()

Extract the specific function call of the SAP HANA PAL function from the sql code.

get_parameters()

Parse sql lines containing the parameter definitions.

get_predict_execute_statement()

Returns the execute_statement for predicting.

get_predict_output_table_names()

Get the generated result table names in predict function.

get_predict_parameters()

Get PAL predict parameters.

get_score_execute_statement()

Returns the execute_statement for scoring.

get_score_output_table_names()

Get the generated result table names in score function.

get_score_parameters()

Get SAP HANA PAL score parameters.

is_fitted()

Checks if the model can be saved.

load_model(model)

Function to load the fitted model.

set_model_state(state)

Set the model state by state information.

set_scale_out([route_to, no_route_to, ...])

SAP HANA statement routing.

transform(data[, key, document, burn_in, ...])

Transform the topic assignment for new documents based on the previous LDA estimation results.

fit(data, key=None, document=None)

Fit LDA model based on training data.

Parameters
dataDataFrame

Training data.

keystr, optional

Name of the document ID column.

Mandatory if data is not indexed, or the index of data contains multiple columns.

Defaults to the single index column of data if not provided.

documentstr, optional

Name of the document text column.

If document is not provided, data must have exactly 1 non-key(non-index) column, and document defaults to that column.

fit_transform(data, key=None, document=None)

Fit LDA model based on training data and return the topic assignment for the training documents.

Parameters
dataDataFrame

Training data.

keystr, optional

Name of the document ID column.

Mandatory if data is not indexed, or the index of data contains multiple columns.

Defaults to the single index column of data if not provided.

documentstr, optional

Name of the document text column.

If document is not provided, data must have exactly 1 non-key column, and document defaults to that column.

Returns
DataFrame

Document-topic distribution table, structured as follows:

  • Document ID column, with same name and type as data 's document ID column.

  • TOPIC_ID, type INTEGER, topic ID.

  • PROBABILITY, type DOUBLE, probability of topic given document.

add_attribute(attr_key, attr_val)

Function to add attribute.

Parameters
attr_keystr

The key.

attr_valstr

The value.

apply_with_hint(with_hint, apply_to_anonymous_block=True)

Apply with hint.

Parameters
with_hintstr

The hint clauses.

apply_to_anonymous_blockbool, optional

If True, it will be applied to the anonymous block.

Defaults to True.

consume_fit_hdbprocedure(proc_name, in_tables=None, out_tables=None)

Return the generated consume hdbprocedure for fit.

Parameters
proc_namestr

The procedure name.

in_tableslist, optional

The list of input table names.

out_tableslist, optional

The list of output table names.

consume_predict_hdbprocedure(proc_name, in_tables=None, out_tables=None)

Return the generated consume hdbprocedure for predict.

Parameters
proc_namestr

The procedure name.

in_tableslist, optional

The list of input table names.

out_tableslist, optional

The list of output table names.

create_model_state(model=None, function=None, pal_funcname=None, state_description=None, force=False)

Create PAL model state.

Parameters
modelDataFrame, optional

Specify the model for AFL state.

Defaults to self.model_.

functionstr, optional

Specify the function in the unified API.

Defaults to self.real_func.

pal_funcnameint or str, optional

PAL function name.

Defaults to self.pal_funcname.

state_descriptionstr, optional

Description of the state as model container.

Defaults to None.

forcebool, optional

If True it will delete the existing state.

Defaults to False.

delete_model_state(state=None)

Delete PAL model state.

Parameters
stateDataFrame, optional

Specified the state.

Defaults to self.state.

disable_arg_check()

Disable argument check.

disable_convert_bigint()

Disable the bigint conversion.

Defaults to False.

disable_hana_execution()

HANA execution will be disabled and only SQL script will be generated.

disable_with_hint()

Disable with hint.

enable_arg_check()

Enable argument check.

enable_convert_bigint()

Allows the conversion from bigint to double.

Defaults to True.

enable_hana_execution()

HANA execution will be enabled.

enable_no_inline(apply_to_anonymous_block=True)

Enable no inline.

Parameters
apply_to_anonymous_blockbool, optional

If True, it will be applied to the anonymous block.

Defaults to True.

enable_parallel_by_parameter_partitions(apply_to_anonymous_block=False)

Enable parallel by parameter partitions.

Parameters
apply_to_anonymous_blockbool, optional

If True, it will be applied to the anonymous block.

Defaults to False.

enable_workload_class(workload_class_name)

HANA WORKLOAD CLASS is applied for the statement execution.

Parameters
workload_class_namestr

The name of HANA WORKLOAD CLASS.

property fit_hdbprocedure

Returns the generated hdbprocedure for fit.

get_fit_execute_statement()

Returns the execute_statement for training.

get_fit_output_table_names()

Get the generated result table names in fit function.

Returns
List of table names.
get_fit_parameters()

Get PAL fit parameters.

Returns
List of tuples, where each tuple describes a parameter like (name, value, type)
<