# hana_ml.algorithms.pal package¶

PAL Package consists of the following sections:

## hana_ml.algorithms.pal.abc_analysis¶

This module contains PAL wrappers for abc_analysis algorithm.

The following functions is available:

hana_ml.algorithms.pal.abc_analysis.abc_analysis(data, key=None, percent_A=None, percent_B=None, percent_C=None, revenue=None, thread_ratio=None)

Perform the abc_analysis to classify objects based on a particular measure. Group the inventories into three categories.

Parameters

Input data.

keystr, optional

Name of the ID column.

Defaults to the index column of data (i.e. data.index) if it is set.

revenuestr, optional

Name of column for revenue (or profits).

If not given, the input dataframe must only have two columns.

Defaults to the first non-key column.

percent_Afloat

Interval for A class.

percent_Bfloat

Interval for B class.

percent_Cfloat

Interval for C class.

Specifies the ratio of total number of threads that can be used by this function.

The value range is from 0 to 1, where 0 means only using 1 thread, and 1 means using at most all the currently available threads.

Values outside the range will be ignored and this function heuristically determines the number of threads to use.

Default to 0.

Returns
DataFrame

Returns a DataFrame containing the ABC class result of partitioning the data into three categories.

Examples

Data to analyze:

>>> df_train = cc.table('AA_DATA_TBL')
>>> df_train.collect()
ITEM     VALUE
0    item1    15.4
1    item2    200.4
2    item3    280.4
3    item4    100.9
4    item5    40.4
5    item6    25.6
6    item7    18.4
7    item8    10.5
8    item9    96.15
9    item10   9.4


Perform abc_analysis:

>>> res = abc_analysis(data = self.df_train, key = 'ITEM', thread_ratio = 0.3,
percent_A = 0.7, percent_B = 0.2, percent_C = 0.1)
>>> res.collect()
ABC_CLASS   ITEM
0      A        item3
1      A        item2
2      A        item4
3      B        item9
4      B        item5
5      B        item6
6      C        item7
7      C        item1
8      C        item8
9      C        item10


## hana_ml.algorithms.pal.association¶

This module contains Python wrappers for PAL association algorithms.

The following classes are available:

class hana_ml.algorithms.pal.association.Apriori(min_support, min_confidence, relational=None, min_lift=None, max_conseq=None, max_len=None, ubiquitous=None, use_prefix_tree=None, lhs_restrict=None, rhs_restrict=None, lhs_complement_rhs=None, rhs_complement_lhs=None, thread_ratio=None, timeout=None, pmml_export=None)

Bases: hana_ml.algorithms.pal.association._AssociationBase

Apriori is a classic predictive analysis algorithm for finding association rules used in association analysis.

Parameters
min_supportfloat

User-specified minimum support(actual value).

min_confidencefloat

User-specified minimum confidence(actual value).

relationalbool, optional

Whether or not to apply relational logic in Apriori algorithm. If False, a single result table is produced; otherwise, the result table shall be split into three tables: antecedent, consequent and statistics.

Defaults to False.

min_liftfloat, optional

User-specified minimum lift.

Defaults to 0.

max_conseqint, optional

Maximum length of consequent items.

Defaults to 100.

max_lenint, optional

Total length of antecedent items and consequent items in the output.

Defaults to 5.

ubiquitousfloat, optional

Item sets whose support values are greater than this number will be ignored during fequent items minining.

Defaults to 1.0.

use_prefix_treebool, optional

Indicates whether or not to use prefix tree for saving memory.

Defaults to False.

lhs_restrictlist of str, optional(deprecated)

Specify items that are only allowed on the left-hand-side of association rules.

rhs_restrictlist of str, optional(deprecated)

Specify items that are only allowed on the right-hand-side of association rules.

lhs_complement_rhsbool, optional(deprecated)

If you use rhs_restrict to restrict some items to the left-hand-side of the association rules, you can set this parameter to True to restrict the complement items to the left-hand-side.

For example, if you have 100 items (i1, i2, ..., i100), and want to restrict i1 and i2 to the right-hand-side, and i3,i4,...,i100 to the left-hand-side, you can set the parameters similarly as follows:

...

rhs_restrict = ['i1','i2'],

lhs_complement_rhs = True,

...

Defaults to False.

rhs_complement_lhsbool, optional(deprecated)

If you use lhs_restrict to restrict some items to the left-hand-side of association rules, you can set this parameter to True to restrict the complement items to the right-hand side.

Defaults to False.

Specifies the ratio of total number of threads that can be used by this function.

The value range is from 0 to 1, where 0 means only using 1 thread, and 1 means using at most all the currently available threads.

Values outside the range will be ignored and this function heuristically determines the number of threads to use.

Defaults to 0.

timeoutint, optional

Specifies the maximum run time in seconds. The algorithm will stop running when the specified timeout is reached.

Defaults to 3600.

pmml_export{'no', 'single-row', 'multi-row'}, optional

Specify the way to export the Apriori model:

• 'no' : do not export the model,

• 'single-row' : export Apriori model in PMML in single row,

• 'multi-row' : export Apriori model in PMML in multiple rows, while the minimum length of each row is 5000 characters.

Defaults to 'no'.

Examples

Input data for associate rule mining:

>>> df.collect()
CUSTOMER   ITEM
0          2  item2
1          2  item3
2          3  item1
3          3  item2
4          3  item4
5          4  item1
6          4  item3
7          5  item2
8          5  item3
9          6  item1
10         6  item3
11         0  item1
12         0  item2
13         0  item5
14         1  item2
15         1  item4
16         7  item1
17         7  item2
18         7  item3
19         7  item5
20         8  item1
21         8  item2
22         8  item3


Set up parameters for the Apriori algorithm:

>>> ap = Apriori(min_support=0.1,
min_confidence=0.3,
relational=False,
min_lift=1.1,
max_conseq=1,
max_len=5,
ubiquitous=1.0,
use_prefix_tree=False,
timeout=3600,
pmml_export='single-row')


Association rule mining using Apriori algorithm for the input data, and check the results:

>>> ap.fit(data=df)
ANTECEDENT CONSEQUENT   SUPPORT  CONFIDENCE      LIFT
0        item5      item2  0.222222    1.000000  1.285714
1        item1      item5  0.222222    0.333333  1.500000
2        item5      item1  0.222222    1.000000  1.500000
3        item4      item2  0.222222    1.000000  1.285714
4  item2&item1      item5  0.222222    0.500000  2.250000


Apriori algorithm set up using relational logic:

>>> apr = Apriori(min_support=0.1,
min_confidence=0.3,
relational=True,
min_lift=1.1,
max_conseq=1,
max_len=5,
ubiquitous=1.0,
use_prefix_tree=False,
timeout=3600,
pmml_export='single-row')


Again mining association rules using Apriori algorithm for the input data, and check the resulting tables:

>>> apr.antec_.head(5).collect()
RULE_ID ANTECEDENTITEM
0        0          item5
1        1          item1
2        2          item5
3        3          item4
4        4          item2
RULE_ID CONSEQUENTITEM
0        0          item2
1        1          item5
2        2          item1
3        3          item2
4        4          item5
RULE_ID   SUPPORT  CONFIDENCE      LIFT
0        0  0.222222    1.000000  1.285714
1        1  0.222222    0.333333  1.500000
2        2  0.222222    1.000000  1.500000
3        3  0.222222    1.000000  1.285714
4        4  0.222222    0.500000  2.250000

Attributes
result_DataFrame

Mined association rules and related statistics, structured as follows:

• 1st column : antecedent(leading) items.

• 2nd column : consequent(dependent) items.

• 3rd column : support value.

• 4th column : confidence value.

• 5th column : lift value.

Available only when relational is False.

model_DataFrame

Apriori model trained from the input data, structured as follows:

• 1st column : model ID,

• 2nd column : model content, i.e. Apriori model in PMML format.

antec_DataFrame

Antecdent items of mined association rules, structured as follows:

• lst column : association rule ID,

• 2nd column : antecedent items of the corresponding association rule.

Available only when relational is True.

conseq_DataFrame

Consequent items of mined association rules, structured as follows:

• 1st column : association rule ID,

• 2nd column : consequent items of the corresponding association rule.

Available only when relational is True.

stats_DataFrame

Statistis of the mined association rules, structured as follows:

• 1st column : rule ID,

• 2nd column : support value of the rule,

• 3rd column : confidence value of the rule,

• 4th column : lift value of the rule.

Available only when relational is True.

Methods

 add_attribute(attr_key, attr_val) Function to add attribute. Disable with hint. Enable no inline. Enable parallel by parameter partitions. fit(data[, transaction, item, lhs_restrict, ...]) Association rule mining from the input data using FPGrowth algorithm. Return the execute_statement for training. Get PAL fit parmeters. Parse sql lines containing the parameter definitions. Return the execute_statement for predicting. Get PAL predict parmeters. Return the execute_statement for scoring. Get PAL score parmeters. Extract the specific function call of the PAL function from the sql code. Checks if the model can be saved. load_model(model) Function to load fitted model. set_scale_out([route_to, no_route_to, ...]) HANA statement routing.
fit(data, transaction=None, item=None, lhs_restrict=None, rhs_restrict=None, lhs_complement_rhs=None, rhs_complement_lhs=None)

Association rule mining from the input data using FPGrowth algorithm.

Parameters

Input data for association rule minining.

transactionstr, optional

Name of the transaction column.

Defaults to the first column if not provided.

itemstr, optional

Name of the item ID column.

Data type of item column can be INTEGER, VARCHAR or NVARCHAR.

Defaults to the last non-transaction column if not provided.

lhs_restrictlist of int/str, optional

Specify items that are only allowed on the left-hand-side of association rules.

Elements in the list should be the same type as the item column.

rhs_restrictlist of int/str, optional

Specify items that are only allowed on the right-hand-side of association rules.

Elements in the list should be the same type as the item column.

lhs_complement_rhsbool, optional

If you use rhs_restrict to restrict some items to the left-hand-side of the association rules, you can set this parameter to True to restrict the complement items to the left-hand-side. For example, if you have 100 items (i1,i2,...,i100), and want to restrict i1 and i2 to the right-hand-side, and i3, i4,..., i100 to the left-hand-side, you can set the parameters similarly as follows:

...

rhs_restrict = [i1, i2],

lhs_complement_rhs = True,

...

Defaults to False.

rhs_complement_lhsbool, optional

If you use lhs_restrict to restrict some items to the left-hand-side of association rules, you can set this parameter to True to restrict the complement items to the right-hand side.

Defaults to False.

Function to add attribute.

disable_with_hint()

Disable with hint.

enable_no_inline()

Enable no inline.

enable_parallel_by_parameter_partitions()

Enable parallel by parameter partitions.

get_fit_execute_statement()

Return the execute_statement for training.

get_fit_parameters()

Get PAL fit parmeters.

Returns
Array of tuples, where each tuple describes a parameter like (name, value, type)
get_parameters()

Parse sql lines containing the parameter definitions. In the sql code all the parameters are defined by four arrays, where the first one contains the parameter name, and one of the other three contains the value fitting to the parameter, while the other two are NULL. This format should be changed into a simple key-value based storage.

Returns
dict of array of tuples, where each tuple describes a parameter like (name, value, type)
get_predict_execute_statement()

Return the execute_statement for predicting.

get_predict_parameters()

Get PAL predict parmeters.

Returns
Array of tuples, where each tuple describes a parameter like (name, value, type)
get_score_execute_statement()

Return the execute_statement for scoring.

get_score_parameters()

Get PAL score parmeters.

Returns
Array of tuples, where each tuple describes a parameter like (name, value, type)
get_store_procedure()

Extract the specific function call of the PAL function from the sql code. Nevertheless it only detects the synonyms that have to be resolved afterwards

Returns
The procedure name synonym
CALL "SYS_AFL.PAL_RANDOM_FORREST" (...) -> SYS_AFL.PAL_RANDOM_FORREST"
is_fitted()

Checks if the model can be saved. To be overridden if the model is not stored in model_ attribute.

Returns
bool

True if the model is ready to be saved.

Function to load fitted model.

Parameters
modelDataFrame

HANA DataFrame for fitted model.

set_scale_out(route_to=None, no_route_to=None, route_by=None, route_by_cardinality=None, data_transfer_cost=None, route_optimization_level=None, workload_class=None)

HANA statement routing.

Parameters
route_tostr, optional

Routes the query to the specified volume ID or service type.

no_route_tostr or list of str, optional

Avoids query routing to a specified volume ID or service type.

route_bystr, optional

Routes the query to the hosts related to the base table(s) of the specified projection view(s).

route_by_cardinalitystr or list of str, optional

Routes the query to the hosts related to the base table(s) of the specified projection view(s) with the highest cardinality from the input list.

data_transfer_costint, optional

Guides the optimizer to use the weighting factor for the data transfer cost. The value 0 ignores the data transfer cost.

route_optimization_level{'mininal', 'all'}, optional

Guides the optimizer to compile with ROUTE_OPTIMIZATION_LEVEL (MINIMAL) or to default to ROUTE_OPTIMIZATION_LEVEL. If the MINIMAL compiled plan is cached, then it compiles once more using the default optimization level during the first execution. This hint is primarily used to shorten statement routing decisions during the initial compilation.

Routes the query via workload class. ROUTE_TO() statement hint has higher precedence than WORKLOAD_CLASS() statement hint.

class hana_ml.algorithms.pal.association.AprioriLite(min_support, min_confidence, subsample=None, recalculate=None, thread_ratio=None, timeout=None, pmml_export=None)

Bases: hana_ml.algorithms.pal.pal_base.PALBase

A light version of Apriori algorithm for assocication rule mining, where only two large item sets are calculated.

Parameters
min_supportfloat

User-specified minimum support(actual value).

min_confidencefloat

User-specified minimum confidence(actual value).

subsamplefloat, optional

Specify the sampling percentage for the input data. Set to 1 if you want to use the entire data.

recalculatebool, optional

If you sample the input data, this parameter indicates whether or not to use the remining data to update the related statistiscs, i.e. support, confidence and lift.

Defaults to True.

Specifies the ratio of total number of threads that can be used by this function.

The value range is from 0 to 1, where 0 means only using 1 thread, and 1 means using at most all the currently available threads.

Values outside the range will be ignored and this function heuristically determines the number of threads to use.

Defaults to 0.

timeoutint, optional

Specifies the maximum run time in seconds.

The algorithm will stop running when the specified timeout is reached.

Defaults to 3600.

pmml_export{'no', 'single-row', 'multi-row'}, optional

Specify the way to export the Apriori model:

• 'no' : do not export the model,

• 'single-row' : export Apriori model in PMML in single row,

• 'multi-row' : export Apriori model in PMML in multiple rows, while the minimum length of each row is 5000 characters.

Defaults to 'no'.

Examples

Input data for association rule mining using Apriori algorithm:

>>> df.collect()
CUSTOMER   ITEM
0          2  item2
1          2  item3
2          3  item1
3          3  item2
4          3  item4
5          4  item1
6          4  item3
7          5  item2
8          5  item3
9          6  item1
10         6  item3
11         0  item1
12         0  item2
13         0  item5
14         1  item2
15         1  item4
16         7  item1
17         7  item2
18         7  item3
19         7  item5
20         8  item1
21         8  item2
22         8  item3


Set up parameters for light Apriori algorithm, ingest the input data, and check the result table:

>>> apl = AprioriLite(min_support=0.1,
min_confidence=0.3,
subsample=1.0,
recalculate=False,
timeout=3600,
pmml_export='single-row')
>>> apl.fit(data=df)
ANTECEDENT CONSEQUENT   SUPPORT  CONFIDENCE      LIFT
0      item5      item2  0.222222    1.000000  1.285714
1      item1      item5  0.222222    0.333333  1.500000
2      item5      item1  0.222222    1.000000  1.500000
3      item5      item3  0.111111    0.500000  0.750000
4      item1      item2  0.444444    0.666667  0.857143

Attributes
result_DataFrame
Mined association rules and related statistics, structured as follows:
• 1st column : antecedent(leading) items,

• 2nd column : consequent(dependent) items,

• 3rd column : support value,

• 4th column : confidence value,

• 5th column : lift value.

Non-empty only when relational is False.

model_DataFrame
Apriori model trained from the input data, structured as follows:
• 1st column : model ID.

• 2nd column : model content, i.e. liteApriori model in PMML format.

Methods

 add_attribute(attr_key, attr_val) Function to add attribute. Disable with hint. Enable no inline. Enable parallel by parameter partitions. fit(data[, transaction, item]) Association rule mining based from the input data. Return the execute_statement for training. Get PAL fit parmeters. Parse sql lines containing the parameter definitions. Return the execute_statement for predicting. Get PAL predict parmeters. Return the execute_statement for scoring. Get PAL score parmeters. Extract the specific function call of the PAL function from the sql code. Checks if the model can be saved. load_model(model) Function to load fitted model. set_scale_out([route_to, no_route_to, ...]) HANA statement routing.
fit(data, transaction=None, item=None)

Association rule mining based from the input data.

Parameters

Input data for association rule minining.

transactionstr, optional

Name of the transaction column.

Defaults to the first column if not provided.

itemstr, optional

Name of the item column.

Defaults to the last non-transaction column if not provided.

Function to add attribute.

disable_with_hint()

Disable with hint.

enable_no_inline()

Enable no inline.

enable_parallel_by_parameter_partitions()

Enable parallel by parameter partitions.

get_fit_execute_statement()

Return the execute_statement for training.

get_fit_parameters()

Get PAL fit parmeters.

Returns
Array of tuples, where each tuple describes a parameter like (name, value, type)
get_parameters()

Parse sql lines containing the parameter definitions. In the sql code all the parameters are defined by four arrays, where the first one contains the parameter name, and one of the other three contains the value fitting to the parameter, while the other two are NULL. This format should be changed into a simple key-value based storage.

Returns
dict of array of tuples, where each tuple describes a parameter like (name, value, type)
get_predict_execute_statement()

Return the execute_statement for predicting.

get_predict_parameters()

Get PAL predict parmeters.

Returns
Array of tuples, where each tuple describes a parameter like (name, value, type)
get_score_execute_statement()

Return the execute_statement for scoring.

get_score_parameters()

Get PAL score parmeters.

Returns
Array of tuples, where each tuple describes a parameter like (name, value, type)
get_store_procedure()

Extract the specific function call of the PAL function from the sql code. Nevertheless it only detects the synonyms that have to be resolved afterwards

Returns
The procedure name synonym
CALL "SYS_AFL.PAL_RANDOM_FORREST" (...) -> SYS_AFL.PAL_RANDOM_FORREST"
is_fitted()

Checks if the model can be saved. To be overridden if the model is not stored in model_ attribute.

Returns
bool

True if the model is ready to be saved.

Function to load fitted model.

Parameters
modelDataFrame

HANA DataFrame for fitted model.

set_scale_out(route_to=None, no_route_to=None, route_by=None, route_by_cardinality=None, data_transfer_cost=None, route_optimization_level=None, workload_class=None)

HANA statement routing.

Parameters
route_tostr, optional

Routes the query to the specified volume ID or service type.

no_route_tostr or list of str, optional

Avoids query routing to a specified volume ID or service type.

route_bystr, optional

Routes the query to the hosts related to the base table(s) of the specified projection view(s).

route_by_cardinalitystr or list of str, optional

Routes the query to the hosts related to the base table(s) of the specified projection view(s) with the highest cardinality from the input list.

data_transfer_costint, optional

Guides the optimizer to use the weighting factor for the data transfer cost. The value 0 ignores the data transfer cost.

route_optimization_level{'mininal', 'all'}, optional

Guides the optimizer to compile with ROUTE_OPTIMIZATION_LEVEL (MINIMAL) or to default to ROUTE_OPTIMIZATION_LEVEL. If the MINIMAL compiled plan is cached, then it compiles once more using the default optimization level during the first execution. This hint is primarily used to shorten statement routing decisions during the initial compilation.

Routes the query via workload class. ROUTE_TO() statement hint has higher precedence than WORKLOAD_CLASS() statement hint.

class hana_ml.algorithms.pal.association.FPGrowth(min_support=None, min_confidence=None, relational=None, min_lift=None, max_conseq=None, max_len=None, ubiquitous=None, thread_ratio=None, timeout=None)

Bases: hana_ml.algorithms.pal.association._AssociationBase

FP-Growth is an algorithm to find frequent patterns from transactions without generating a candidate itemset.

Parameters
min_supportfloat, optional

User-specified minimum support, with valid range [0, 1].

Defaults to 0.

min_confidencefloat, optional

User-specified minimum confidence, with valid range [0, 1].

Defaults to 0.

relationalbool, optional

Whether or not to apply relational logic in FPGrowth algorithm.

If False, a single result table is produced; otherwise, the result table shall be split into three tables -- antecedent, consequent and statistics.

Defaults to False.

min_liftfloat, optional

User-specified minimum lift.

Defaults to 0.

max_conseqint, optional

Maximum length of consequent items.

Defaults to 10.

max_lenint, optional

Total length of antecedent items and consequent items in the output.

Defaults to 10.

ubiquitousfloat, optional

Item sets whose support values are greater than this number will be ignored during fequent items minining.

Defaults to 1.0.

Specifies the ratio of total number of threads that can be used by this function.

The value range is from 0 to 1, where 0 means only using 1 thread, and 1 means using at most all the currently available threads.

Values outside the range will be ignored and this function heuristically determines the number of threads to use.

Defaults to 0.

timeoutint, optional

Specifies the maximum run time in seconds.

The algorithm will stop running when the specified timeout is reached.

Defaults to 3600.

Examples

Input data for associate rule mining:

>>> df.collect()
TRANS  ITEM
0       1     1
1       1     2
2       2     2
3       2     3
4       2     4
5       3     1
6       3     3
7       3     4
8       3     5
9       4     1
10      4     4
11      4     5
12      5     1
13      5     2
14      6     1
15      6     2
16      6     3
17      6     4
18      7     1
19      8     1
20      8     2
21      8     3
22      9     1
23      9     2
24      9     3
25     10     2
26     10     3
27     10     5


Set up parameters:

>>> fpg = FPGrowth(min_support=0.2,
min_confidence=0.5,
relational=False,
min_lift=1.0,
max_conseq=1,
max_len=5,
ubiquitous=1.0,
timeout=3600)


Association rule mining using FPGrowth algorithm for the input data, and check the results:

>>> fpg.fit(data=df, lhs_restrict=[1,2,3])
>>> fpg.result_.collect()
ANTECEDENT  CONSEQUENT  SUPPORT  CONFIDENCE      LIFT
0          2           3      0.5    0.714286  1.190476
1          3           2      0.5    0.833333  1.190476
2          3           4      0.3    0.500000  1.250000
3        1&2           3      0.3    0.600000  1.000000
4        1&3           2      0.3    0.750000  1.071429
5        1&3           4      0.2    0.500000  1.250000


Apriori algorithm set up using relational logic:

>>> fpgr = FPGrowth(min_support=0.2,
min_confidence=0.5,
relational=True,
min_lift=1.0,
max_conseq=1,
max_len=5,
ubiquitous=1.0,
timeout=3600)


Again mining association rules using FPGrowth algorithm for the input data, and check the resulting tables:

>>> fpgr.fit(data=df, rhs_restrict=[1, 2, 3])
>>> fpgr.antec_.collect()
RULE_ID  ANTECEDENTITEM
0        0               2
1        1               3
2        2               3
3        3               1
4        3               2
5        4               1
6        4               3
7        5               1
8        5               3

>>> fpgr.conseq_.collect()
RULE_ID  CONSEQUENTITEM
0        0               3
1        1               2
2        2               4
3        3               3
4        4               2
5        5               4

>>> fpgr.stats_.collect()
RULE_ID  SUPPORT  CONFIDENCE      LIFT
0        0      0.5    0.714286  1.190476
1        1      0.5    0.833333  1.190476
2        2      0.3    0.500000  1.250000
3        3      0.3    0.600000  1.000000
4        4      0.3    0.750000  1.071429
5        5      0.2    0.500000  1.250000

Attributes
result_DataFrame

Mined association rules and related statistics, structured as follows:

• 1st column : antecedent(leading) items,

• 2nd column : consequent(dependent) items,

• 3rd column : support value,

• 4th column : confidence value,

• 5th column : lift value.

Available only when relational is False.

antec_DataFrame
Antecdent items of mined association rules, structured as follows:
• lst column : association rule ID,

• 2nd column : antecedent items of the corresponding association rule.

Available only when relational is True.

conseq_DataFrame

Consequent items of mined association rules, structured as follows:

• 1st column : association rule ID,

• 2nd column : consequent items of the corresponding association rule.

Available only when relational is True.

stats_DataFrame
Statistis of the mined association rules, structured as follows:
• 1st column : rule ID,

• 2nd column : support value of the rule,

• 3rd column : confidence value of the rule,

• 4th column : lift value of the rule.

Available only when relational is True.

Methods

 add_attribute(attr_key, attr_val) Function to add attribute. Disable with hint. Enable no inline. Enable parallel by parameter partitions. fit(data[, transaction, item, lhs_restrict, ...]) Association rule mining from the input data. Return the execute_statement for training. Get PAL fit parmeters. Parse sql lines containing the parameter definitions. Return the execute_statement for predicting. Get PAL predict parmeters. Return the execute_statement for scoring. Get PAL score parmeters. Extract the specific function call of the PAL function from the sql code. Checks if the model can be saved. load_model(model) Function to load fitted model. set_scale_out([route_to, no_route_to, ...]) HANA statement routing.
fit(data, transaction=None, item=None, lhs_restrict=None, rhs_restrict=None, lhs_complement_rhs=None, rhs_complement_lhs=None)

Association rule mining from the input data.

Parameters

Input data for association rule minining.

transactionstr, optional

Name of the transaction column.

Defaults to the first column if not provided.

itemstr, optional

Name of the item column.

Defaults to the last non-transaction column if not provided.

lhs_restrictlist of int/str, optional

Specify items that are only allowed on the left-hand-side of association rules.

Elements in the list should be the same type as the item column.

rhs_restrictlist of int/str, optional

Specify items that are only allowed on the right-hand-side of association rules.

Elements in the list should be the same type as the item column.

lhs_complement_rhsbool, optional

If you use rhs_restrict to restrict some items to the left-hand-side of the association rules, you can set this parameter to True to restrict the complement items to the left-hand-side.

For example, if you have 100 items (i1,i2,...,i100), and want to restrict i1 and i2 to the right-hand-side, and i3, i4,..., i100 to the left-hand-side, you can set the parameters similarly as follows:

...

rhs_restrict = [i1, i2],

lhs_complement_rhs = True,

...

Defaults to False.

rhs_complement_lhsbool, optional

If you use lhs_restrict to restrict some items to the left-hand-side of association rules, you can set this parameter to True to restrict the complement items to the right-hand side.

Defaults to False.

Function to add attribute.

disable_with_hint()

Disable with hint.

enable_no_inline()

Enable no inline.

enable_parallel_by_parameter_partitions()

Enable parallel by parameter partitions.

get_fit_execute_statement()

Return the execute_statement for training.

get_fit_parameters()

Get PAL fit parmeters.

Returns
Array of tuples, where each tuple describes a parameter like (name, value, type)
get_parameters()

Parse sql lines containing the parameter definitions. In the sql code all the parameters are defined by four arrays, where the first one contains the parameter name, and one of the other three contains the value fitting to the parameter, while the other two are NULL. This format should be changed into a simple key-value based storage.

Returns
dict of array of tuples, where each tuple describes a parameter like (name, value, type)
get_predict_execute_statement()

Return the execute_statement for predicting.

get_predict_parameters()

Get PAL predict parmeters.

Returns
Array of tuples, where each tuple describes a parameter like (name, value, type)
get_score_execute_statement()

Return the execute_statement for scoring.

get_score_parameters()

Get PAL score parmeters.

Returns
Array of tuples, where each tuple describes a parameter like (name, value, type)
get_store_procedure()

Extract the specific function call of the PAL function from the sql code. Nevertheless it only detects the synonyms that have to be resolved afterwards

Returns
The procedure name synonym
CALL "SYS_AFL.PAL_RANDOM_FORREST" (...) -> SYS_AFL.PAL_RANDOM_FORREST"
is_fitted()

Checks if the model can be saved. To be overridden if the model is not stored in model_ attribute.

Returns
bool

True if the model is ready to be saved.

Function to load fitted model.

Parameters
modelDataFrame

HANA DataFrame for fitted model.

set_scale_out(route_to=None, no_route_to=None, route_by=None, route_by_cardinality=None, data_transfer_cost=None, route_optimization_level=None, workload_class=None)

HANA statement routing.

Parameters
route_tostr, optional

Routes the query to the specified volume ID or service type.

no_route_tostr or list of str, optional

Avoids query routing to a specified volume ID or service type.

route_bystr, optional

Routes the query to the hosts related to the base table(s) of the specified projection view(s).

route_by_cardinalitystr or list of str, optional

Routes the query to the hosts related to the base table(s) of the specified projection view(s) with the highest cardinality from the input list.

data_transfer_costint, optional

Guides the optimizer to use the weighting factor for the data transfer cost. The value 0 ignores the data transfer cost.

route_optimization_level{'mininal', 'all'}, optional

Guides the optimizer to compile with ROUTE_OPTIMIZATION_LEVEL (MINIMAL) or to default to ROUTE_OPTIMIZATION_LEVEL. If the MINIMAL compiled plan is cached, then it compiles once more using the default optimization level during the first execution. This hint is primarily used to shorten statement routing decisions during the initial compilation.

Routes the query via workload class. ROUTE_TO() statement hint has higher precedence than WORKLOAD_CLASS() statement hint.

class hana_ml.algorithms.pal.association.KORD(k=None, measure=None, min_support=None, min_confidence=None, min_coverage=None, min_measure=None, max_antec=None, epsilon=None, use_epsilon=None, max_conseq=None)

Bases: hana_ml.algorithms.pal.pal_base.PALBase

K-optimal rule discovery (KORD) follows the idea of generating association rules with respect to a well-defined measure, instead of first finding all frequent itemsets and then generating all possible rules.

Parameters
kint, optional

The number of top rules to discover.

measurestr, optional

Specifies the measure used to define the priority of the association rules.

min_supportfloat, optional

User-specified minimum support value of association rule, with valid range [0, 1].

Defaults to 0 if not provided.

min_confidencefloat, optinal

User-specified minimum confidence value of association rule, with valid range [0, 1].

Defaults to 0 if not provided.

min_converagefloat, optional

User-specified minimum coverage value of association rule, with valid range [0, 1].

Defaults to the value of min_support if not provided.

min_measurefloat, optional

User-specified minimum measure value (for leverage or lift, which type depends on the setting of measure).

Defaults to 0 if not provided.

max_antecint, optional

Specifies the maximumn number of antecedent items in generated association rules.

Defaults to 4.

epsilonfloat, optional

User-specified epsilon value for punishing length of rules.

Valid only when use_epsilon is True.

use_epsilonbool, optional

Specifies whether or not to use epsilon to punish the length of rules.

Defaults to False.

max_conseqint, optional

Specifies the maximum number of consequent items in generated association rules.

Should not be greater than 3.

New parameter added in SAP HANA cloud.

Defaults to 1.

Examples

First let us have a look at the training data:

>>> df.head(10).collect()
CUSTOMER   ITEM
0          2  item2
1          2  item3
2          3  item1
3          3  item2
4          3  item4
5          4  item1
6          4  item3
7          5  item2
8          5  item3
9          6  item1


Set up a KORD instance:

>>> krd =  KORD(k=5,
measure='lift',
min_support=0.1,
min_confidence=0.2,
epsilon=0.1,
use_epsilon=False)


Start k-optimal rule discovery process from the input transaction data, and check the results:

>>> krd.fit(data=df, transaction='CUSTOMER', item='ITEM')
>>> krd.antec_.collect()
RULE_ID ANTECEDENT_RULE
0        0           item2
1        1           item1
2        2           item2
3        2           item1
4        3           item5
5        4           item2
>>> krd.conseq_.collect()
RULE_ID CONSEQUENT_RULE
0        0           item5
1        1           item5
2        2           item5
3        3           item1
4        4           item4
>>> krd.stats_.collect()
RULE_ID   SUPPORT  CONFIDENCE      LIFT  LEVERAGE   MEASURE
0        0  0.222222    0.285714  1.285714  0.049383  1.285714
1        1  0.222222    0.333333  1.500000  0.074074  1.500000
2        2  0.222222    0.500000  2.250000  0.123457  2.250000
3        3  0.222222    1.000000  1.500000  0.074074  1.500000
4        4  0.222222    0.285714  1.285714  0.049383  1.285714

Attributes
antec_DataFrame

Info of antecedent items for the mined association rules, structured as follows:

• 1st column : rule ID,

• 2nd column : antecedent items.

conseq_DataFrame

Info of consequent items for the mined assocoation rules, structured as follows:

• 1st column : rule ID,

• 2nd column : consequent items.

stats_DataFrame
Some basic statistics for the mined association rules, structured as follows:
• 1st column : rule ID,

• 2nd column : support value of rules,

• 3rd column : confidence value of rules,

• 4th column : lift value of rules,

• 5th column : leverage value of rules,

• 6th column : measure value of rules.

Methods

 add_attribute(attr_key, attr_val) Function to add attribute. Disable with hint. Enable no inline. Enable parallel by parameter partitions. fit(data[, transaction, item]) K-optimal rule discovery from input data, based on some user-specified measure. Return the execute_statement for training. Get PAL fit parmeters. Parse sql lines containing the parameter definitions. Return the execute_statement for predicting. Get PAL predict parmeters. Return the execute_statement for scoring. Get PAL score parmeters. Extract the specific function call of the PAL function from the sql code. Checks if the model can be saved. load_model(model) Function to load fitted model. set_scale_out([route_to, no_route_to, ...]) HANA statement routing.
fit(data, transaction=None, item=None)

K-optimal rule discovery from input data, based on some user-specified measure.

Parameters

Input data for k-optimal(association) rule discovery.

transactionstr, optional

Column name of transaction ID in the input data.

Defaults to name of the 1st column if not provided.

itemstr, optional

Column name of item ID (or items) in the input data.

Defaults to the name of the last non-transaction column if not provided.

Function to add attribute.

disable_with_hint()

Disable with hint.

enable_no_inline()

Enable no inline.

enable_parallel_by_parameter_partitions()

Enable parallel by parameter partitions.

get_fit_execute_statement()

Return the execute_statement for training.

get_fit_parameters()

Get PAL fit parmeters.

Returns
Array of tuples, where each tuple describes a parameter like (name, value, type)
get_parameters()

Parse sql lines containing the parameter definitions. In the sql code all the parameters are defined by four arrays, where the first one contains the parameter name, and one of the other three contains the value fitting to the parameter, while the other two are NULL. This format should be changed into a simple key-value based storage.

Returns
dict of array of tuples, where each tuple describes a parameter like (name, value, type)
get_predict_execute_statement()

Return the execute_statement for predicting.

get_predict_parameters()

Get PAL predict parmeters.

Returns
Array of tuples, where each tuple describes a parameter like (name, value, type)
get_score_execute_statement()

Return the execute_statement for scoring.

get_score_parameters()

Get PAL score parmeters.

Returns
Array of tuples, where each tuple describes a parameter like (name, value, type)
get_store_procedure()

Extract the specific function call of the PAL function from the sql code. Nevertheless it only detects the synonyms that have to be resolved afterwards

Returns
The procedure name synonym
CALL "SYS_AFL.PAL_RANDOM_FORREST" (...) -> SYS_AFL.PAL_RANDOM_FORREST"
is_fitted()

Checks if the model can be saved. To be overridden if the model is not stored in model_ attribute.

Returns
bool

True if the model is ready to be saved.

Function to load fitted model.

Parameters
modelDataFrame

HANA DataFrame for fitted model.

set_scale_out(route_to=None, no_route_to=None, route_by=None, route_by_cardinality=None, data_transfer_cost=None, route_optimization_level=None, workload_class=None)

HANA statement routing.

Parameters
route_tostr, optional

Routes the query to the specified volume ID or service type.

no_route_tostr or list of str, optional

Avoids query routing to a specified volume ID or service type.

route_bystr, optional

Routes the query to the hosts related to the base table(s) of the specified projection view(s).

route_by_cardinalitystr or list of str, optional

Routes the query to the hosts related to the base table(s) of the specified projection view(s) with the highest cardinality from the input list.

data_transfer_costint, optional

Guides the optimizer to use the weighting factor for the data transfer cost. The value 0 ignores the data transfer cost.

route_optimization_level{'mininal', 'all'}, optional

Guides the optimizer to compile with ROUTE_OPTIMIZATION_LEVEL (MINIMAL) or to default to ROUTE_OPTIMIZATION_LEVEL. If the MINIMAL compiled plan is cached, then it compiles once more using the default optimization level during the first execution. This hint is primarily used to shorten statement routing decisions during the initial compilation.

Routes the query via workload class. ROUTE_TO() statement hint has higher precedence than WORKLOAD_CLASS() statement hint.

class hana_ml.algorithms.pal.association.SPM(min_support, relational=None, max_len=None, min_len=None, max_len_out=None, min_len_out=None, ubiquitous=None, calc_lift=None, timeout=None)

Bases: hana_ml.algorithms.pal.pal_base.PALBase

The sequential pattern mining algorithm searches for frequent patterns in sequence databases.

Parameters
min_supportfloat

User-specified minimum support value.

relationalbool, optional

Whether or not to apply relational logic in sequential pattern mining.

If False, a single results table for frequent pattern mining is produced, otherwise the results table is splitted into two tables : one for mined patterns, and the other for statistics.

Defaults to False.

ubiquitousfloat, optional

Items whose support values are above this specified value will be ignored during the frequent item mining phase.

Defaults to 1.0.

min_lenint, optional

Minimum number of items in a transaction.

Defaults to 1.

max_lenint, optional

Maximum number of items in a transaction.

Defaults to 10.

min_len_outint, optional

Specifies the minimum number of items of the mined association rules in the result table.

Defaults to 1.

max_len_outint, optional

Specifies the maximum number of items of the mined association rules in the reulst table.

Defaults to 10.

calc_liftbool, optional

Whether or not toe calculate lift values for all applicable cases.

If False, lift values are only calculated for the cases where the last transaction contains a single item.

Defaults to False.

timeoutint, optional

Specifies the maximum run time in seconds.

The algorithm will stop running when the specified timeout is reached.

Defaults to 3600.

Examples

Firstly take a look at the input data df:

>>> df.collect()
CUSTID  TRANSID      ITEMS
0       A        1      Apple
1       A        1  Blueberry
2       A        2      Apple
3       A        2     Cherry
4       A        3    Dessert
5       B        1     Cherry
6       B        1  Blueberry
7       B        1      Apple
8       B        2    Dessert
9       B        3  Blueberry
10      C        1      Apple
11      C        2  Blueberry
12      C        3    Dessert


Set up a SPM instance:

>>> sp = SPM(min_support=0.5,
relational=False,
ubiquitous=1.0,
max_len=10,
min_len=1,
calc_lift=True)


Start sequential pattern mining process from the input data, and check the results:

>>> sp.fit(data=df, customer='CUSTID', transaction='TRANSID', item='ITEMS')
>>> sp.result_.collect()
PATTERN   SUPPORT  CONFIDENCE      LIFT
0                       {Apple}  1.000000    0.000000  0.000000
1           {Apple},{Blueberry}  0.666667    0.666667  0.666667
2             {Apple},{Dessert}  1.000000    1.000000  1.000000
3             {Apple,Blueberry}  0.666667    0.000000  0.000000
4   {Apple,Blueberry},{Dessert}  0.666667    1.000000  1.000000
5                {Apple,Cherry}  0.666667    0.000000  0.000000
6      {Apple,Cherry},{Dessert}  0.666667    1.000000  1.000000
7                   {Blueberry}  1.000000    0.000000  0.000000
8         {Blueberry},{Dessert}  1.000000    1.000000  1.000000
9                      {Cherry}  0.666667    0.000000  0.000000
10           {Cherry},{Dessert}  0.666667    1.000000  1.000000
11                    {Dessert}  1.000000    0.000000  0.000000

Attributes
result_DataFrame

The overall fequent pattern mining result, structured as follows:

• 1st column : mined fequent patterns,

• 2nd column : support values,

• 3rd column : confidence values,

• 4th column : lift values.

Available only when relational is False.

pattern_DataFrame
Result for mined requent patterns, structured as follows:
• 1st column : pattern ID,

• 2nd column : transaction ID,

• 3rd column : items.

stats_DataFrame
Statistics for frequent pattern mining, structured as follows:
• 1st column : pattern ID,

• 2nd column : support values,

• 3rd column : confidence values,

• 4th column : lift values.

Methods

 add_attribute(attr_key, attr_val) Function to add attribute. Disable with hint. Enable no inline. Enable parallel by parameter partitions. fit(data[, customer, transaction, item, ...]) Sequetial pattern mining from input data. Return the execute_statement for training. Get PAL fit parmeters. Parse sql lines containing the parameter definitions. Return the execute_statement for predicting. Get PAL predict parmeters. Return the execute_statement for scoring. Get PAL score parmeters. Extract the specific function call of the PAL function from the sql code. Checks if the model can be saved. load_model(model) Function to load fitted model. set_scale_out([route_to, no_route_to, ...]) HANA statement routing.

Function to add attribute.

disable_with_hint()

Disable with hint.

enable_no_inline()

Enable no inline.

enable_parallel_by_parameter_partitions()

Enable parallel by parameter partitions.

get_fit_execute_statement()

Return the execute_statement for training.

get_fit_parameters()

Get PAL fit parmeters.

Returns
Array of tuples, where each tuple describes a parameter like (name, value, type)
get_parameters()

Parse sql lines containing the parameter definitions. In the sql code all the parameters are defined by four arrays, where the first one contains the parameter name, and one of the other three contains the value fitting to the parameter, while the other two are NULL. This format should be changed into a simple key-value based storage.

Returns
dict of array of tuples, where each tuple describes a parameter like (name, value, type)
get_predict_execute_statement()

Return the execute_statement for predicting.

get_predict_parameters()

Get PAL predict parmeters.

Returns
Array of tuples, where each tuple describes a parameter like (name, value, type)
get_score_execute_statement()

Return the execute_statement for scoring.

get_score_parameters()

Get PAL score parmeters.

Returns
Array of tuples, where each tuple describes a parameter like (name, value, type)
get_store_procedure()

Extract the specific function call of the PAL function from the sql code. Nevertheless it only detects the synonyms that have to be resolved afterwards

Returns
The procedure name synonym
CALL "SYS_AFL.PAL_RANDOM_FORREST" (...) -> SYS_AFL.PAL_RANDOM_FORREST"
is_fitted()

Checks if the model can be saved. To be overridden if the model is not stored in model_ attribute.

Returns
bool

True if the model is ready to be saved.

Function to load fitted model.

Parameters
modelDataFrame

HANA DataFrame for fitted model.

set_scale_out(route_to=None, no_route_to=None, route_by=None, route_by_cardinality=None, data_transfer_cost=None, route_optimization_level=None, workload_class=None)

HANA statement routing.

Parameters
route_tostr, optional

Routes the query to the specified volume ID or service type.

no_route_tostr or list of str, optional

Avoids query routing to a specified volume ID or service type.

route_bystr, optional

Routes the query to the hosts related to the base table(s) of the specified projection view(s).

route_by_cardinalitystr or list of str, optional

Routes the query to the hosts related to the base table(s) of the specified projection view(s) with the highest cardinality from the input list.

data_transfer_costint, optional

Guides the optimizer to use the weighting factor for the data transfer cost. The value 0 ignores the data transfer cost.

route_optimization_level{'mininal', 'all'}, optional

Guides the optimizer to compile with ROUTE_OPTIMIZATION_LEVEL (MINIMAL) or to default to ROUTE_OPTIMIZATION_LEVEL. If the MINIMAL compiled plan is cached, then it compiles once more using the default optimization level during the first execution. This hint is primarily used to shorten statement routing decisions during the initial compilation.

Routes the query via workload class. ROUTE_TO() statement hint has higher precedence than WORKLOAD_CLASS() statement hint.

fit(data, customer=None, transaction=None, item=None, item_restrict=None, min_gap=None)

Sequetial pattern mining from input data.

Parameters

Input data for sequential pattern mining.

customerstr, optional

Column name of customer ID in the input data.

Defaults to name of the 1st column if not provided.

transactionstr, optional

Column name of transaction ID in the input data.

Specially for sequential pattern mining, values of this column must reflect the sequence of occurrence as well.

Defaults to name of the 1st non-customer column if not provided.

itemstr, optional

Column name of item ID (or items) in the input data.

Defaults to the name of the last non-customer, non-transaction column if not provided.

item_restrictlist of int or str, optional

Specifies the list of items allowed in the mined association rule.

min_gapint, optional

Specifies the the minimum time difference between consecutive transactions in a sequence.

## hana_ml.algorithms.pal.clustering¶

This module contains Python wrappers for PAL clustering algorithms.

The following classes are available:

hana_ml.algorithms.pal.clustering.SlightSilhouette(data, features=None, label=None, distance_level=None, minkowski_power=None, normalization=None, thread_number=None, categorical_variable=None, category_weights=None)

Silhouette refers to a method used to validate the cluster of data. SAP HNAN PAL provides a light version of sihouette called slight sihouette. SlightSihouette is an wrapper for this light version sihouette method.

Parameters

DataFrame containing the data.

featuresa list of str, optional

Names of feature columns.

If features is not provided, it defaults to all non-label columns.

label: str, optional

Name of the ID column.

If label is not provided, it defaults to last column of data.

distance_level{'manhattan', 'euclidean', 'minkowski', 'chebyshev', 'cosine'}, optional

Ways to compute the distance between the item and the cluster center. 'cosine' is only valid when accelerated is False.

Defaults to 'euclidean'.

minkowski_powerfloat, optional

When Minkowski distance is used, this parameter controls the value of power.

Only valid when distance_level is minkowski.

Defaults to 3.0.

normalization{'no', 'l1_norm', 'min_max'}, optional

Normalization type.

• 'no': No normalization will be applied.

• 'l1_norm': Yes, for each point X (x1, x2, ..., xn), the normalized value will be X'(x1 /S,x2 /S,...,xn /S), where S = |x1|+|x2|+...|xn|.

• 'min_max': Yes, for each column C, get the min and max value of C, and then C[i] = (C[i]-min)/(max-min).

Defaults to 'no'.

Defaults to 1.

categorical_variablestr or a list of str, optional

Indicates whether or not a column of data is actually corresponding to a category variable even the data type of this column is INTEGER.

By default, VARCHAR or NVARCHAR is a category variable, and INTEGER or DOUBLE is a continuous variable.

Defaults to None.

category_weightsfloat, optional

Represents the weight of category attributes.

Defaults to 0.707.

Returns
DataFrame

A DataFrame containing the validation value of Slight Silhouette.

Examples

Input dataframe df:

>>> df.collect()
V000 V001 V002 CLUSTER
0    0.5    A  0.5       0
1    1.5    A  0.5       0
2    1.5    A  1.5       0
3    0.5    A  1.5       0
4    1.1    B  1.2       0
5    0.5    B 15.5       1
6    1.5    B 15.5       1
7    1.5    B 16.5       1
8    0.5    B 16.5       1
9    1.2    C 16.1       1
10  15.5    C 15.5       2
11  16.5    C 15.5       2
12  16.5    C 16.5       2
13  15.5    C 16.5       2
14  15.6    D 16.2       2
15  15.5    D  0.5       3
16  16.5    D  0.5       3
17  16.5    D  1.5       3
18  15.5    D  1.5       3
19  15.7    A  1.6       3


Call the function:

>>> res = SlightSilhouette(df, label="CLUSTER")


Result:

>>> res.collect()
VALIDATE_VALUE
0      0.9385944

class hana_ml.algorithms.pal.clustering.AffinityPropagation(affinity, n_clusters, max_iter=None, convergence_iter=None, damping=None, preference=None, seed_ratio=None, times=None, minkowski_power=None, thread_ratio=None)

Bases: hana_ml.algorithms.pal.pal_base.PALBase

Affinity Propagation is an algorithm that identifies exemplars among data points and forms clusters of data points around these exemplars. It operates by simultaneously considering all data point as potential exemplars and exchanging messages between data points until a good set of exemplars and clusters emerges.

Parameters
affinity{'manhattan', 'standardized_euclidean', 'euclidean', 'minkowski', 'chebyshev', 'cosine'}

Ways to compute the distance between two points.

n_clustersint

Number of clusters.

• 0: does not adjust Affinity Propagation cluster result.

• Non-zero int: If Affinity Propagation cluster number is bigger than n_clusters, PAL will merge the result to make the cluster number be the value specified for n_clusters.

No default value as it is mandatory.

max_iterint, optional

Maximum number of iterations.

Defaults to 500.

convergence_iterint, optional

When the clusters keep a steady one for the specified times, the algorithm ends.

Defaults to 100.

dampingfloat

Controls the updating velocity. Value range: (0, 1).

Defaults to 0.9.

preferencefloat, optional

Determines the preference. Value range: [0,1].

Defaults to 0.5.

seed_ratiofloat, optional

Select a portion of (seed_ratio * data_number) the input data as seed, where data_number is the row_size of the input data.

Value range: (0,1].

If seed_ratio is 1, all the input data will be the seed.

Defaults to 1.

timesint, optional

The sampling times. Only valid when seed_ratio is less than 1.

Defaults to 1.

minkowski_powerint, optional

The power of the Minkowski method. Only valid when affinity is 3.

Defaults to 3.

Specifies the ratio of total number of threads that can be used by this function. The value range is from 0 to 1, where 0 means only using 1 thread, and 1 means using at most all the currently available threads. Values outside the range will be ignored and this function heuristically determines the number of threads to use.

Defaults to 0.

Examples

Input dataframe df for clustering:

>>> df.collect()
ID  ATTRIB1  ATTRIB2
0    1     0.10     0.10
1    2     0.11     0.10
2    3     0.10     0.11
3    4     0.11     0.11
4    5     0.12     0.11
5    6     0.11     0.12
6    7     0.12     0.12
7    8     0.12     0.13
8    9     0.13     0.12
9   10     0.13     0.13
10  11     0.13     0.14
11  12     0.14     0.13
12  13    10.10    10.10
13  14    10.11    10.10
14  15    10.10    10.11
15  16    10.11    10.11
16  17    10.11    10.12
17  18    10.12    10.11
18  19    10.12    10.12
19  20    10.12    10.13
20  21    10.13    10.12
21  22    10.13    10.13
22  23    10.13    10.14
23  24    10.14    10.13


Create an AffinityPropagation instance:

>>> ap = AffinityPropagation(
affinity='euclidean',
n_clusters=0,
max_iter=500,
convergence_iter=100,
damping=0.9,
preference=0.5,
seed_ratio=None,
times=None,
minkowski_power=None,


Perform fit on the given data:

>>> ap.fit(data = df, key='ID')


Expected output:

>>> ap.labels_.collect()
ID  CLUSTER_ID
0    1           0
1    2           0
2    3           0
3    4           0
4    5           0
5    6           0
6    7           0
7    8           0
8    9           0
9   10           0
10  11           0
11  12           0
12  13           1
13  14           1
14  15           1
15  16           1
16  17           1
17  18           1
18  19           1
19  20           1
20  21           1
21  22           1
22  23           1
23  24           1

Attributes
labels_DataFrame

Label assigned to each sample. structured as follows:

• ID, record ID.

• CLUSTER_ID, the range is from 0 to n_clusters - 1.

Methods

 add_attribute(attr_key, attr_val) Function to add attribute. Disable with hint. Enable no inline. Enable parallel by parameter partitions. fit(data[, key, features]) Fit the model when given the training dataset. fit_predict(data[, key, features]) Fit with the dataset and return labels. Return the execute_statement for training. Get PAL fit parmeters. Parse sql lines containing the parameter definitions. Return the execute_statement for predicting. Get PAL predict parmeters. Return the execute_statement for scoring. Get PAL score parmeters. Extract the specific function call of the PAL function from the sql code. Checks if the model can be saved. load_model(model) Function to load fitted model. set_scale_out([route_to, no_route_to, ...]) HANA statement routing.
fit(data, key=None, features=None)

Fit the model when given the training dataset.

Parameters

DataFrame containing the data.

keystr, optional

Name of ID column.

Defaults to the index column of data (i.e. data.index) if it is set. If the index column of data is not provided, please enter the value of key.

featuresa list of str, optional

Names of the feature columns.

If features is not provided, it defaults to all non-key columns.

fit_predict(data, key=None, features=None)

Fit with the dataset and return labels.

Parameters

DataFrame containing the data.

keystr, optional

Name of ID column.

Defaults to the index column of data (i.e. data.index) if it is set. If the index column of data is not provided, please enter the value of key.

featuresa list of str, optional

Names of the features columns.

If features is not provided, it defaults to all non-key columns.

Returns
DataFrame

Labels of each point.

Function to add attribute.

disable_with_hint()

Disable with hint.

enable_no_inline()

Enable no inline.

enable_parallel_by_parameter_partitions()

Enable parallel by parameter partitions.

get_fit_execute_statement()

Return the execute_statement for training.

get_fit_parameters()

Get PAL fit parmeters.

Returns
Array of tuples, where each tuple describes a parameter like (name, value, type)
get_parameters()

Parse sql lines containing the parameter definitions. In the sql code all the parameters are defined by four arrays, where the first one contains the parameter name, and one of the other three contains the value fitting to the parameter, while the other two are NULL. This format should be changed into a simple key-value based storage.

Returns
dict of array of tuples, where each tuple describes a parameter like (name, value, type)
get_predict_execute_statement()

Return the execute_statement for predicting.

get_predict_parameters()

Get PAL predict parmeters.

Returns
Array of tuples, where each tuple describes a parameter like (name, value, type)
get_score_execute_statement()

Return the execute_statement for scoring.

get_score_parameters()

Get PAL score parmeters.

Returns
Array of tuples, where each tuple describes a parameter like (name, value, type)
get_store_procedure()

Extract the specific function call of the PAL function from the sql code. Nevertheless it only detects the synonyms that have to be resolved afterwards

Returns
The procedure name synonym
CALL "SYS_AFL.PAL_RANDOM_FORREST" (...) -> SYS_AFL.PAL_RANDOM_FORREST"
is_fitted()

Checks if the model can be saved. To be overridden if the model is not stored in model_ attribute.

Returns
bool

True if the model is ready to be saved.

Function to load fitted model.

Parameters
modelDataFrame

HANA DataFrame for fitted model.

set_scale_out(route_to=None, no_route_to=None, route_by=None, route_by_cardinality=None, data_transfer_cost=None, route_optimization_level=None, workload_class=None)

HANA statement routing.

Parameters
route_tostr, optional

Routes the query to the specified volume ID or service type.

no_route_tostr or list of str, optional

Avoids query routing to a specified volume ID or service type.

route_bystr, optional

Routes the query to the hosts related to the base table(s) of the specified projection view(s).

route_by_cardinalitystr or list of str, optional

Routes the query to the hosts related to the base table(s) of the specified projection view(s) with the highest cardinality from the input list.

data_transfer_costint, optional

Guides the optimizer to use the weighting factor for the data transfer cost. The value 0 ignores the data transfer cost.

route_optimization_level{'mininal', 'all'}, optional

Guides the optimizer to compile with ROUTE_OPTIMIZATION_LEVEL (MINIMAL) or to default to ROUTE_OPTIMIZATION_LEVEL. If the MINIMAL compiled plan is cached, then it compiles once more using the default optimization level during the first execution. This hint is primarily used to shorten statement routing decisions during the initial compilation.

Routes the query via workload class. ROUTE_TO() statement hint has higher precedence than WORKLOAD_CLASS() statement hint.

class hana_ml.algorithms.pal.clustering.AgglomerateHierarchicalClustering(n_clusters=None, affinity=None, linkage=None, thread_ratio=None, distance_dimension=None, normalization=None, category_weights=None)

Bases: hana_ml.algorithms.pal.pal_base.PALBase

This algorithm is a widely used clustering method which can find natural groups within a set of data. The idea is to group the data into a hierarchy or a binary tree of the subgroups. A hierarchical clustering can be either agglomerate or divisive, depending on the method of hierarchical decomposition. The implementation in PAL follows the agglomerate approach, which merges the clusters with a bottom-up strategy. Initially, each data point is considered as an own cluster. The algorithm iteratively merges two clusters based on the dissimilarity measure in a greedy manner and forms a larger cluster.

Parameters
n_clustersint, optional

Number of clusters after agglomerate hierarchical clustering algorithm. Value range: between 1 and the initial number of input data.

Defaults to 1.

affinity{'manhattan', 'euclidean', 'minkowski', 'chebyshev', 'cosine', 'pearson correlation', 'squared euclidean', 'jaccard', 'gower'}, optional

Ways to compute the distance between two points.

Note

• (1) For jaccard distance, non-zero input data will be treated as 1, and zero input data will be treated as 0. jaccard distance = (M01 + M10) / (M11 + M01 + M10)

• (2) Only gower distance supports category attributes. When linkage is 'centroid clustering', 'median clustering', or 'ward', this parameter must be set to 'squared euclidean'.

Defaults to 'squared euclidean'.

linkage{ 'nearest neighbor', 'furthest neighbor', 'group average', 'weighted average', 'centroid clustering', 'median clustering', 'ward'}, optional

Linkage type between two clusters.

• 'nearest neighbor' : single linkage.

• 'furthest neighbor' : complete linkage.

• 'group average' : UPGMA.

• 'weighted average' : WPGMA.

• 'centroid clustering'.

• 'median clustering'.

• 'ward'.

Defaults to centroid clustering.

Note

For linkage 'centroid clustering', 'median clustering', or 'ward', the corresponding affinity must be set to 'squared euclidean'.

Specifies the ratio of total number of threads that can be used by this function.

The value range is from 0 to 1, where 0 means only using 1 thread, and 1 means using at most all the currently available threads.

Values outside the range will be ignored and this function heuristically determines the number of threads to use.

Defaults to 0.

distance_dimensionfloat, optional

Distance dimension can be set if affinity is set to 'minkowski'. The value should be no less than 1.

Only valid when affinity is 'minkowski'.

Defaults to 3.

normalizationstr, optional

Specifies the type of normalization applied.

• 'no': No normalization

• 'z-score': Z-score standardization

• 'zero-centred-min-max': Zero-centred min-max normalization, transforming to new range [-1, 1].

• 'min-max': Standard min-max normalization, transforming to new range [0, 1].

Defaults to 'no'.

category_weightsfloat, optional

Represents the weight of category columns.

Defaults to 1.

Examples

Input dataframe df for clustering:

>>> df.collect()
POINT   X1    X2      X3
0    0       0.5   0.5     1
1    1       1.5   0.5     2
2    2       1.5   1.5     2
3    3       0.5   1.5     2
4    4       1.1   1.2     2
5    5       0.5   15.5    2
6    6       1.5   15.5    3
7    7       1.5   16.5    3
8    8       0.5   16.5    3
9    9       1.2   16.1    3
10   10      15.5  15.5    3
11   11      16.5  15.5    4
12   12      16.5  16.5    4
13   13      15.5  16.5    4
14   14      15.6  16.2    4
15   15      15.5  0.5     4
16   16      16.5  0.5     1
17   17      16.5  1.5     1
18   18      15.5  1.5     1
19   19      15.7  1.6     1


Create an AgglomerateHierarchicalClustering instance:

>>> hc = AgglomerateHierarchicalClustering(
n_clusters=4,
affinity='Gower',
distance_dimension=3,
normalization='no',
category_weights= 0.1)


Perform fit on the given data:

>>> hc.fit(data=df, key='POINT', categorical_variable=['X3'])


Expected output:

>>> hc.combine_process_.collect().head(3)
STAGE    LEFT_POINT   RIGHT_POINT    DISTANCE
0    1        18           19             0.0187
1    2        13           14             0.0250
2    3        7            9              0.0437

>>> hc.labels_.collect().head(3)
POINT    CLUSTER_ID
0     0        1
1     1        1
2     2        1

Attributes
combine_process_DataFrame

Structured as follows:

• 1st column: int, STAGE, cluster stage.

• 2nd column: ID (in input table) data type, LEFT_ + ID (in input table) column + name, One of the clusters that is to be combined in one combine stage, name as its row number in the input data table. After the combining, the new cluster is named after the left one.

• 3rd column: ID (in input table) data type, RIGHT_ + ID (in input table) column name, The other cluster to be combined in the same combine stage, named as its row number in the input data table.

• 4th column: float, DISTANCE. Distance between the two combined clusters.

labels_DataFrame

Label assigned to each sample. structured as follows:

• 1st column: ID, record ID.

• 2nd column: CLUSTER_ID, cluster number after applying the hierarchical agglomerate algorithm.

Methods

 add_attribute(attr_key, attr_val) Function to add attribute. Disable with hint. Enable no inline. Enable parallel by parameter partitions. fit(data[, key, features, categorical_variable]) Fit the model when given the training dataset. fit_predict(data[, key, features, ...]) Fit with the dataset and return the labels. Return the execute_statement for training. Get PAL fit parmeters. Parse sql lines containing the parameter definitions. Return the execute_statement for predicting. Get PAL predict parmeters. Return the execute_statement for scoring. Get PAL score parmeters. Extract the specific function call of the PAL function from the sql code. Checks if the model can be saved. load_model(model) Function to load fitted model. set_scale_out([route_to, no_route_to, ...]) HANA statement routing.
fit(data, key=None, features=None, categorical_variable=None)

Fit the model when given the training dataset.

Parameters

DataFrame containing the data.

keystr, optional

Name of ID column.

Defaults to the index column of data (i.e. data.index) if it is set. If the index column of data is not provided, please enter the value of key.

featuresa list of str, optional

Names of the features columns.

If features is not provided, it defaults to all non-key columns.

categorical_variablestr or a list of str, optional

Specifies INTEGER column(s) that should be treated as categorical. By default, 'VARCHAR' or 'NVARCHAR' is a category variable, and 'INTEGER' or 'DOUBLE' is a continuous variable.

Defaults to None.

fit_predict(data, key=None, features=None, categorical_variable=None)

Fit with the dataset and return the labels.

Parameters

DataFrame containing the data.

keystr, optional

Name of ID column.

Defaults to the index column of data (i.e. data.index) if it is set. If the index column of data is not provided, please enter the value of key.

featuresa list of str, optional

Names of the features columns.

If features is not provided, it defaults to all non-key columns.

categorical_variablestr or a list of str, optional

Specifies INTEGER column(s) that should be treated as categorical. By default, 'VARCHAR' or 'NVARCHAR' is a category variable, and 'INTEGER' or 'DOUBLE' is a continuous variable.

Defaults to None.

Returns
DataFrame

Label of each points.

Function to add attribute.

disable_with_hint()

Disable with hint.

enable_no_inline()

Enable no inline.

enable_parallel_by_parameter_partitions()

Enable parallel by parameter partitions.

get_fit_execute_statement()

Return the execute_statement for training.

get_fit_parameters()

Get PAL fit parmeters.

Returns
Array of tuples, where each tuple describes a parameter like (name, value, type)
get_parameters()

Parse sql lines containing the parameter definitions. In the sql code all the parameters are defined by four arrays, where the first one contains the parameter name, and one of the other three contains the value fitting to the parameter, while the other two are NULL. This format should be changed into a simple key-value based storage.

Returns
dict of array of tuples, where each tuple describes a parameter like (name, value, type)
get_predict_execute_statement()

Return the execute_statement for predicting.

get_predict_parameters()

Get PAL predict parmeters.

Returns
Array of tuples, where each tuple describes a parameter like (name, value, type)
get_score_execute_statement()

Return the execute_statement for scoring.

get_score_parameters()

Get PAL score parmeters.

Returns
Array of tuples, where each tuple describes a parameter like (name, value, type)
get_store_procedure()

Extract the specific function call of the PAL function from the sql code. Nevertheless it only detects the synonyms that have to be resolved afterwards

Returns
The procedure name synonym
CALL "SYS_AFL.PAL_RANDOM_FORREST" (...) -> SYS_AFL.PAL_RANDOM_FORREST"
is_fitted()

Checks if the model can be saved. To be overridden if the model is not stored in model_ attribute.

Returns
bool

True if the model is ready to be saved.

Function to load fitted model.

Parameters
modelDataFrame

HANA DataFrame for fitted model.

set_scale_out(route_to=None, no_route_to=None, route_by=None, route_by_cardinality=None, data_transfer_cost=None, route_optimization_level=None, workload_class=None)

HANA statement routing.

Parameters
route_tostr, optional

Routes the query to the specified volume ID or service type.

no_route_tostr or list of str, optional

Avoids query routing to a specified volume ID or service type.

route_bystr, optional

Routes the query to the hosts related to the base table(s) of the specified projection view(s).

route_by_cardinalitystr or list of str, optional

Routes the query to the hosts related to the base table(s) of the specified projection view(s) with the highest cardinality from the input list.

data_transfer_costint, optional

Guides the optimizer to use the weighting factor for the data transfer cost. The value 0 ignores the data transfer cost.

route_optimization_level{'mininal', 'all'}, optional

Guides the optimizer to compile with ROUTE_OPTIMIZATION_LEVEL (MINIMAL) or to default to ROUTE_OPTIMIZATION_LEVEL. If the MINIMAL compiled plan is cached, then it compiles once more using the default optimization level during the first execution. This hint is primarily used to shorten statement routing decisions during the initial compilation.

Routes the query via workload class. ROUTE_TO() statement hint has higher precedence than WORKLOAD_CLASS() statement hint.

class hana_ml.algorithms.pal.clustering.DBSCAN(minpts=None, eps=None, thread_ratio=None, metric=None, minkowski_power=None, categorical_variable=None, category_weights=None, algorithm=None, save_model=True)

Bases: hana_ml.algorithms.pal.clustering._ClusterAssignmentMixin

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a density-based data clustering algorithm that finds a number of clusters starting from the estimated density distribution of corresponding nodes.

Parameters
minptsint, optional

The minimum number of points required to form a cluster.

Note

minpts and eps need to be provided together by user or these two parameters are automatically determined.

epsfloat, optional

Note

minpts and eps need to be provided together by user or these two parameters are automatically determined.

Controls the proportion of available threads to use.

The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads.

Values between 0 and 1 will use that percentage of available threads.

Values outside this range tell PAL to heuristically determine the number of threads to use.

Defaults to heuristically determined.

metric{'manhattan', 'euclidean', 'minkowski', 'chebyshev', 'standardized_euclidean', 'cosine'}, optional

Ways to compute the distance between two points.

Defaults to 'euclidean'.

minkowski_powerint, optional

When minkowski is choosed for metric, this parameter controls the value of power. Only applicable when metric is minkowski.

Defaults to 3.

categorical_variablestr or a list of str, optional

Specifies column(s) in the data that should be treated as categorical.

Defaults to None.

category_weightsfloat, optional

Represents the weight of category attributes.

Defaults to 0.707.

algorithm{'brute-force', 'kd-tree'}, optional

Ways to search for neighbours.

Defaults to 'kd-tree'.

save_modelbool, optional

If true, the generated model will be saved.

save_model must be True in order to call predict().

Defaults to True.

Examples

Input dataframe df for clustering:

>>> df.collect()
ID     V1     V2 V3
0    1   0.10   0.10  B
1    2   0.11   0.10  A
2    3   0.10   0.11  C
3    4   0.11   0.11  B
4    5   0.12   0.11  A
5    6   0.11   0.12  E
6    7   0.12   0.12  A
7    8   0.12   0.13  C
8    9   0.13   0.12  D
9   10   0.13   0.13  D
10  11   0.13   0.14  A
11  12   0.14   0.13  C
12  13  10.10  10.10  A
13  14  10.11  10.10  F
14  15  10.10  10.11  E
15  16  10.11  10.11  E
16  17  10.11  10.12  A
17  18  10.12  10.11  B
18  19  10.12  10.12  B
19  20  10.12  10.13  D
20  21  10.13  10.12  F
21  22  10.13  10.13  A
22  23  10.13  10.14  A
23  24  10.14  10.13  D
24  25   4.10   4.10  A
25  26   7.11   7.10  C
26  27  -3.10  -3.11  C
27  28  16.11  16.11  A
28  29  20.11  20.12  C
29  30  15.12  15.11  A


Create DSBCAN instance:

>>> dbscan = DBSCAN(thread_ratio=0.2, metric='manhattan')


Perform fit on the given data:

>>> dbscan.fit(data=df, key='ID')


Expected output:

>>> dbscan.labels_.collect()
ID  CLUSTER_ID
0    1           0
1    2           0
2    3           0
3    4           0
4    5           0
5    6           0
6    7           0
7    8           0
8    9           0
9   10           0
10  11           0
11  12           0
12  13           1
13  14           1
14  15           1
15  16           1
16  17           1
17  18           1
18  19           1
19  20           1
20  21           1
21  22           1
22  23           1
23  24           1
24  25          -1
25  26          -1
26  27          -1
27  28          -1
28  29          -1
29  30          -1

Attributes
labels_DataFrame

Label assigned to each sample.

model_DataFrame

Model content. Set to None if save_model is False.

Methods

 add_attribute(attr_key, attr_val) Function to add attribute. Disable with hint. Enable no inline. Enable parallel by parameter partitions. fit(data[, key, features, ...]) Fit the DBSCAN model when given the training dataset. fit_predict(data[, key, features, ...]) Fit with the dataset and return the labels. Return the execute_statement for training. Get PAL fit parmeters. Parse sql lines containing the parameter definitions. Return the execute_statement for predicting. Get PAL predict parmeters. Return the execute_statement for scoring. Get PAL score parmeters. Extract the specific function call of the PAL function from the sql code. Checks if the model can be saved. load_model(model) Function to load fitted model. predict(data[, key, features]) Assign clusters to data based on a fitted model. set_scale_out([route_to, no_route_to, ...]) HANA statement routing.
fit(data, key=None, features=None, categorical_variable=None, string_variable=None, variable_weight=None)

Fit the DBSCAN model when given the training dataset.

Parameters

DataFrame containing the data.

keystr, optional

Name of ID column.

Defaults to the index column of data (i.e. data.index) if it is set. If the index column of data is not provided, please enter the value of key.

featuresa list of str, optional

Names of the feature columns.

If features is not provided, it defaults to all non-key columns.

categorical_variablestr or a list of str, optional

Specifies INTEGER column(s) that should be treated as categorical.

Other INTEGER columns will be treated as continuous.

Defaults to None.

string_variablestr or a list of str, optional

Indicates a string column storing not categorical data. Levenshtein distance is used to calculate similarity between two strings. Ignored if it is not a string column.

Defaults to None.

variable_weightdict, optional

Specifies the weight of a variable participating in distance calculation. The value must be greater or equal to 0. Defaults to 1 for variables not specified.

Defaults to None.

Returns
A fitted object of class "DBSCAN".
fit_predict(data, key=None, features=None, categorical_variable=None, string_variable=None, variable_weight=None)

Fit with the dataset and return the labels.

Parameters

DataFrame containing the data.

keystr, optional

Name of ID column.

Defaults to the index column of data (i.e. data.index) if it is set. If the index column of data is not provided, please enter the value of key.

featuresstr or a list of str, optional

Names of the features columns.

If features is not provided, it defaults to all the non-key columns.

categorical_variablestr or a list of str, optional

Specifies INTEGER column(s) that should be treated as categorical. Other INTEGER columns will be treated as continuous.

Defaults to None.

string_variablestr or a list of str, optional

Indicates a string column storing not categorical data. Levenshtein distance is used to calculate similarity between two strings. Ignored if it is not a string column.

Defaults to None.

variable_weightdict, optional

Specifies the weight of a variable participating in distance calculation. The value must be greater or equal to 0.

Defaults to 1 for variables not specified.

Returns
DataFrame

Label assigned to each sample.

predict(data, key=None, features=None)

Assign clusters to data based on a fitted model. The output structure of this method does not match that of fit_predict().

Parameters

Data points to match against computed clusters.

This dataframe's column structure should match that of the data used for fit().

keystr, optional

Name of ID column.

Defaults to the index column of data (i.e. data.index) if it is set. If the index column of data is not provided, please enter the value of key.

featuresa list of str, optional.

Names of feature columns.

If features is not provided, it defaults to all non-key columns.

Returns
DataFrame

Cluster assignment results, with 3 columns:

• Data point ID, with name and type taken from the input ID column.

• CLUSTER_ID, type INTEGER, representing the cluster the data point is assigned to.

• DISTANCE, type DOUBLE, representing the distance between the data point and the nearest core point.

Function to add attribute.

disable_with_hint()

Disable with hint.

enable_no_inline()

Enable no inline.

enable_parallel_by_parameter_partitions()

Enable parallel by parameter partitions.

get_fit_execute_statement()

Return the execute_statement for training.

get_fit_parameters()

Get PAL fit parmeters.

Returns
Array of tuples, where each tuple describes a parameter like (name, value, type)
get_parameters()

Parse sql lines containing the parameter definitions. In the sql code all the parameters are defined by four arrays, where the first one contains the parameter name, and one of the other three contains the value fitting to the parameter, while the other two are NULL. This format should be changed into a simple key-value based storage.

Returns
dict of array of tuples, where each tuple describes a parameter like (name, value, type)
get_predict_execute_statement()

Return the execute_statement for predicting.

get_predict_parameters()

Get PAL predict parmeters.

Returns
Array of tuples, where each tuple describes a parameter like (name, value, type)
get_score_execute_statement()

Return the execute_statement for scoring.

get_score_parameters()

Get PAL score parmeters.

Returns
Array of tuples, where each tuple describes a parameter like (name, value, type)
get_store_procedure()

Extract the specific function call of the PAL function from the sql code. Nevertheless it only detects the synonyms that have to be resolved afterwards

Returns
The procedure name synonym
CALL "SYS_AFL.PAL_RANDOM_FORREST" (...) -> SYS_AFL.PAL_RANDOM_FORREST"
is_fitted()

Checks if the model can be saved. To be overridden if the model is not stored in model_ attribute.

Returns
bool

True if the model is ready to be saved.

Function to load fitted model.

Parameters
modelDataFrame

HANA DataFrame for fitted model.

set_scale_out(route_to=None, no_route_to=None, route_by=None, route_by_cardinality=None, data_transfer_cost=None, route_optimization_level=None, workload_class=None)

HANA statement routing.

Parameters
route_tostr, optional

Routes the query to the specified volume ID or service type.

no_route_tostr or list of str, optional

Avoids query routing to a specified volume ID or service type.

route_bystr, optional

Routes the query to the hosts related to the base table(s) of the specified projection view(s).

route_by_cardinalitystr or list of str, optional

Routes the query to the hosts related to the base table(s) of the specified projection view(s) with the highest cardinality from the input list.

data_transfer_costint, optional

Guides the optimizer to use the weighting factor for the data transfer cost. The value 0 ignores the data transfer cost.

route_optimization_level{'mininal', 'all'}, optional

Guides the optimizer to compile with ROUTE_OPTIMIZATION_LEVEL (MINIMAL) or to default to ROUTE_OPTIMIZATION_LEVEL. If the MINIMAL compiled plan is cached, then it compiles once more using the default optimization level during the first execution. This hint is primarily used to shorten statement routing decisions during the initial compilation.

Routes the query via workload class. ROUTE_TO() statement hint has higher precedence than WORKLOAD_CLASS() statement hint.

class hana_ml.algorithms.pal.clustering.GeometryDBSCAN(minpts=None, eps=None, thread_ratio=None, metric=None, minkowski_power=None, algorithm=None, save_model=True)

Bases: hana_ml.algorithms.pal.pal_base.PALBase

This function is a geometry version of DBSCAN, which only accepts geometry points as input data. Currently it only accepts 2-D points.

Parameters
minptsint, optional

The minimum number of points required to form a cluster.

Note

minpts and eps need to be provided together by user or these two parameters are automatically determined.

epsfloat, optional

Note

minpts and eps need to be provided together by user or these two parameters are automatically determined.

Controls the proportion of available threads to use.

The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads.

Values between 0 and 1 will use that percentage of available threads.

Values outside this range tell PAL to heuristically determine the number of threads to use.

Defaults to -1.

metric{'manhattan', 'euclidean','minkowski',

'chebyshev', 'standardized_euclidean', 'cosine'}, optional

Ways to compute the distance between two points.

Defaults to 'euclidean'.

minkowski_powerint, optional

When minkowski is choosed for metric, this parameter controls the value of power.

Only applicable when metric is 'minkowski'.

Defaults to 3.

algorithm{'brute-force', 'kd-tree'}, optional

Ways to search for neighbours.

Defaults to 'kd-tree'.

save_modelbool, optional

If true, the generated model will be saved.

save_model must be True in order to call predict().

Defaults to True.

Examples

In SAP HANA, the test table PAL_GEO_DBSCAN_DATA_TBL can be created by the following SQL:

CREATE COLUMN TABLE PAL_GEO_DBSCAN_DATA_TBL (
"ID" INTEGER,
"POINT" ST_GEOMETRY);


Then, input dataframe df for clustering:

>>> df = conn.table("PAL_GEO_DBSCAN_DATA_TBL")


Create DSBCAN instance:

>>> geo_dbscan = GeometryDBSCAN(thread_ratio=0.2, metric='manhattan')


Perform fit on the given data:

>>> geo_dbscan.fit(data = df, key='ID')


Expected output:

>>> geo_dbscan.labels_.collect()
ID   CLUSTER_ID
0    1    0
1    2    0
2    3    0
......
28   29  -1
29   30  -1

>>> geo_dbsan.model_.collect()
ROW_INDEX    MODEL_CONTENT
0      0         {"Algorithm":"DBSCAN","Cluster":[{"ClusterID":...


Perform fit_predict on the given data:

>>> result = geo_dbscan.fit_predict(df, key='ID')


Expected output:

>>> result.collect()
ID   CLUSTER_ID
0    1    0
1    2    0
2    3    0
......
28    29  -1
29    30  -1

Attributes
labels_DataFrame

Label assigned to each sample.

model_DataFrame

Model content. Set to None if save_model is False.

Methods

 add_attribute(attr_key, attr_val) Function to add attribute. Disable with hint. Enable no inline. Enable parallel by parameter partitions. fit(data[, key, features]) Fit the Geometry DBSCAN model when given the training dataset. fit_predict(data[, key, features]) Fit with the dataset and return the labels. Return the execute_statement for training. Get PAL fit parmeters. Parse sql lines containing the parameter definitions. Return the execute_statement for predicting. Get PAL predict parmeters. Return the execute_statement for scoring. Get PAL score parmeters. Extract the specific function call of the PAL function from the sql code. Checks if the model can be saved. load_model(model) Function to load fitted model. set_scale_out([route_to, no_route_to, ...]) HANA statement routing.
fit(data, key=None, features=None)

Fit the Geometry DBSCAN model when given the training dataset.

Parameters

DataFrame containing the data for applying geometry DBSCAN.

It must contain at least two columns: one ID column, and another for storing 2-D geometry points.

keystr, optional

Name of ID column.

Defaults to the index column of data (i.e. data.index) if it is set. If the index column of data is not provided, please enter the value of key.

featuresstr, optional

Name of the column for storing geometry points.

If not provided, it defaults the first non-key column.

fit_predict(data, key=None, features=None)

Fit with the dataset and return the labels.

Parameters

DataFrame containing the data. The structure is as follows.

It must contain at least two columns: one ID column, and another for storing 2-D geometry points.

keystr, optional

Name of ID column.

Defaults to the index column of data (i.e. data.index) if it is set. If the index column of data is not provided, please enter the value of key.

featuresstr, optional

Name of the column for storing 2-D geometry points.

If not provided, it defaults to the first non-key column.

Returns
DataFrame

Label assigned to each sample.

Function to add attribute.

disable_with_hint()

Disable with hint.

enable_no_inline()

Enable no inline.

enable_parallel_by_parameter_partitions()

Enable parallel by parameter partitions.

get_fit_execute_statement()

Return the execute_statement for training.

get_fit_parameters()

Get PAL fit parmeters.

Returns
Array of tuples, where each tuple describes a parameter like (name, value, type)
get_parameters()

Parse sql lines containing the parameter definitions. In the sql code all the parameters are defined by four arrays, where the first one contains the parameter name, and one of the other three contains the value fitting to the parameter, while the other two are NULL. This format should be changed into a simple key-value based storage.

Returns
dict of array of tuples, where each tuple describes a parameter like (name, value, type)
get_predict_execute_statement()

Return the execute_statement for predicting.

get_predict_parameters()

Get PAL predict parmeters.

Returns
Array of tuples, where each tuple describes a parameter like (name, value, type)
get_score_execute_statement()

Return the execute_statement for scoring.

get_score_parameters()

Get PAL score parmeters.

Returns
Array of tuples, where each tuple describes a parameter like (name, value, type)
get_store_procedure()

Extract the specific function call of the PAL function from the sql code. Nevertheless it only detects the synonyms that have to be resolved afterwards

Returns
The procedure name synonym
CALL "SYS_AFL.PAL_RANDOM_FORREST" (...) -> SYS_AFL.PAL_RANDOM_FORREST"
is_fitted()

Checks if the model can be saved. To be overridden if the model is not stored in model_ attribute.

Returns
bool

True if the model is ready to be saved.

Function to load fitted model.

Parameters
modelDataFrame

HANA DataFrame for fitted model.

set_scale_out(route_to=None, no_route_to=None, route_by=None, route_by_cardinality=None, data_transfer_cost=None, route_optimization_level=None, workload_class=None)

HANA statement routing.

Parameters
route_tostr, optional

Routes the query to the specified volume ID or service type.

no_route_tostr or list of str, optional

Avoids query routing to a specified volume ID or service type.

route_bystr, optional

Routes the query to the hosts related to the base table(s) of the specified projection view(s).

route_by_cardinalitystr or list of str, optional

Routes the query to the hosts related to the base table(s) of the specified projection view(s) with the highest cardinality from the input list.

data_transfer_costint, optional

Guides the optimizer to use the weighting factor for the data transfer cost. The value 0 ignores the data transfer cost.

route_optimization_level{'mininal', 'all'}, optional

Guides the optimizer to compile with ROUTE_OPTIMIZATION_LEVEL (MINIMAL) or to default to ROUTE_OPTIMIZATION_LEVEL. If the MINIMAL compiled plan is cached, then it compiles once more using the default optimization level during the first execution. This hint is primarily used to shorten statement routing decisions during the initial compilation.

Routes the query via workload class. ROUTE_TO() statement hint has higher precedence than WORKLOAD_CLASS() statement hint.

class hana_ml.algorithms.pal.clustering.KMeans(n_clusters=None, n_clusters_min=None, n_clusters_max=None, init=None, max_iter=None, thread_ratio=None, distance_level=None, minkowski_power=None, category_weights=None, normalization=None, categorical_variable=None, tol=None, memory_mode=None, accelerated=False, use_fast_library=None, use_float=None)

Bases: hana_ml.algorithms.pal.clustering._ClusterAssignmentMixin

K-Means model that handles clustering problems.

Parameters
n_clustersint, optional

Number of clusters. If this parameter is not specified, you must specify the minimum and maximum range parameters instead.

n_clusters_minint, optional

Cluster range minimum.

n_clusters_maxint, optional

Cluster range maximum.

init{'first_k', 'replace', 'no_replace', 'patent'}, optional

Controls how the initial centers are selected:

• 'first_k': First k observations.

• 'replace': Random with replacement.

• 'no_replace': Random without replacement.

• 'patent': Patent of selecting the init center (US 6,882,998 B1).

Defaults to 'patent'.

max_iterint, optional

Max iterations.

Defaults to 100.

Specifies the ratio of total number of threads that can be used by this function.

The value range is from 0 to 1, where 0 means only using 1 thread, and 1 means using at most all the currently available threads.

Values outside the range will be ignored and this function heuristically determines the number of threads to use.

Defaults to 0.

distance_level{'manhattan', 'euclidean', 'minkowski', 'chebyshev', 'cosine'}, optional

Ways to compute the distance between the item and the cluster center.

'cosine' is only valid when accelerated is False.

Defaults to 'euclidean'.

minkowski_powerfloat, optional

When Minkowski distance is used, this parameter controls the value of power.

Only valid when distance_level is 'minkowski'.

Defaults to 3.0.

category_weightsfloat, optional

Represents the weight of category attributes.

Defaults to 0.707.

normalization{'no', 'l1_norm', 'min_max'}, optional

Normalization type.

• 'no': No normalization will be applied.

• 'l1_norm': Yes, for each point X (x1, x2, ..., xn), the normalized value will be X'(x1 /S,x2 /S,...,xn /S), where S = |x1|+|x2|+...|xn|.

• 'min_max': Yes, for each column C, get the min and max value of C, and then C[i] = (C[i]-min)/(max-min).

Defaults to 'no'.

categorical_variablestr or a list of str, optional

Specifies INTEGER column(s) in the data that should be treated as categorical.

Defaults to None.

tolfloat, optional

Convergence threshold for exiting iterations.

Only valid when accelerated is False.

Defaults to 1.0e-6.

memory_mode{'auto', 'optimize-speed', 'optimize-space'}, optional

Indicates the memory mode that the algorithm uses.

• 'auto': Chosen by algorithm.

• 'optimize-speed': Prioritizes speed.

• 'optimize-space': Prioritizes memory.

Only valid when accelerated is True.

Defaults to 'auto'.

acceleratedbool, optional

Indicates whether to use technology like cache to accelerate the calculation process:

• If True, the calculation process will be accelerated.

• If False, the calculation process will not be accelerated.

Defaults to False.

use_fast_librarybool, optional

Use vectorized accelerated operation when it is set to 1. Not valid when accelerated is True.

Defaults to False.

use_floatbool, optional
• False: double

• True: float

Only valid when use_fast_library is True. Not valid when accelerated is True.

Defaults to True.

Examples

Input dataframe df for K Means:

>>> df.collect()
ID  V000 V001  V002
0    0   0.5    A   0.5
1    1   1.5    A   0.5
2    2   1.5    A   1.5
3    3   0.5    A   1.5
4    4   1.1    B   1.2
5    5   0.5    B  15.5
6    6   1.5    B  15.5
7    7   1.5    B  16.5
8    8   0.5    B  16.5
9    9   1.2    C  16.1
10  10  15.5    C  15.5
11  11  16.5    C  15.5
12  12  16.5    C  16.5
13  13  15.5    C  16.5
14  14  15.6    D  16.2
15  15  15.5    D   0.5
16  16  16.5    D   0.5
17  17  16.5    D   1.5
18  18  15.5    D   1.5
19  19  15.7    A   1.6


Create a KMeans instance:

>>> km = clustering.KMeans(n_clusters=4, init='first_k',
...                        max_iter=100, tol=1.0E-6, thread_ratio=0.2,
...                        distance_level='Euclidean',
...                        category_weights=0.5)


Perform fit_predict:

>>> labels = km.fit_predict(data=df, 'ID')
>>> labels.collect()
ID  CLUSTER_ID  DISTANCE  SLIGHT_SILHOUETTE
0    0           0  0.891088           0.944370
1    1           0  0.863917           0.942478
2    2           0  0.806252           0.946288
3    3           0  0.835684           0.944942
4    4           0  0.744571           0.950234
5    5           3  0.891088           0.940733
6    6           3  0.835684           0.944412
7    7           3  0.806252           0.946519
8    8           3  0.863917           0.946121
9    9           3  0.744571           0.949899
10  10           2  0.825527           0.945092
11  11           2  0.933886           0.937902
12  12           2  0.881692           0.945008
13  13           2  0.764318           0.949160
14  14           2  0.923456           0.939283
15  15           1  0.901684           0.940436
16  16           1  0.976885           0.939386
17  17           1  0.818178           0.945878
18  18           1  0.722799           0.952170
19  19           1  1.102342           0.925679


Input dataframe df for Accelerated K-Means :

>>> df = conn.table("PAL_ACCKMEANS_DATA_TBL")
>>> df.collect()
ID  V000 V001  V002
0    0   0.5    A     0
1    1   1.5    A     0
2    2   1.5    A     1
3    3   0.5    A     1
4    4   1.1    B     1
5    5   0.5    B    15
6    6   1.5    B    15
7    7   1.5    B    16
8    8   0.5    B    16
9    9   1.2    C    16
10  10  15.5    C    15
11  11  16.5    C    15
12  12  16.5    C    16
13  13  15.5    C    16
14  14  15.6    D    16
15  15  15.5    D     0
16  16  16.5    D     0
17  17  16.5    D     1
18  18  15.5    D     1
19  19  15.7    A     1


Create Accelerated Kmeans instance:

>>> akm = clustering.KMeans(init='first_k',
...                         distance_level='euclidean',
...                         max_iter=100, category_weights=0.5,
...                         categorical_variable=['V002'],
...                         accelerated=True)


Perform fit_predict:

>>> labels = akm.fit_predict(df=data, key='ID')
>>> labels.collect()
ID  CLUSTER_ID  DISTANCE  SLIGHT_SILHOUETTE
0    0           0  1.198938           0.006767
1    1           0  1.123938           0.068899
2    2           3  0.500000           0.572506
3    3           3  0.500000           0.598267
4    4           0  0.621517           0.229945
5    5           0  1.037500           0.308333
6    6           0  0.962500           0.358333
7    7           0  0.895513           0.402992
8    8           0  0.970513           0.352992
9    9           0  0.823938           0.313385
10  10           1  1.038276           0.931555
11  11           1  1.178276           0.927130
12  12           1  1.135685           0.929565
13  13           1  0.995685           0.934165
14  14           1  0.849615           0.944359
15  15           1  0.995685           0.934548
16  16           1  1.135685           0.929950
17  17           1  1.089615           0.932769
18  18           1  0.949615           0.937555
19  19           1  0.915565           0.937717

Attributes
labels_DataFrame

Label assigned to each sample.

cluster_centers_DataFrame

Coordinates of cluster centers.

model_DataFrame

Model content.

statistics_DataFrame

Statistic value.

Methods

 add_attribute(attr_key, attr_val) Function to add attribute. Disable with hint. Enable no inline. Enable parallel by parameter partitions. fit(data[, key, features, categorical_variable]) Fit the model when given training dataset. fit_predict(data[, key, features, ...]) Fit with the dataset and return the labels. Return the execute_statement for training. Get PAL fit parmeters. Parse sql lines containing the parameter definitions. Return the execute_statement for predicting. Get PAL predict parmeters. Return the execute_statement for scoring. Get PAL score parmeters. Extract the specific function call of the PAL function from the sql code. Checks if the model can be saved. load_model(model) Function to load fitted model. predict(data[, key, features]) Assign clusters to data based on a fitted model. set_scale_out([route_to, no_route_to, ...]) HANA statement routing.
fit(data, key=None, features=None, categorical_variable=None)

Fit the model when given training dataset.

Parameters

DataFrame containing the data.

keystr, optional

Name of ID column. Defaults to the index column of data (i.e. data.index) if it is set. If the index column of data is not provided, please enter the value of key.

featuresa list of str, optional

Names of feature columns.

If features is not provided, it defaults to all non-key columns.

categorical_variablestr or a list of str, optional

Specifies INTEGER column(s) that should be treated as categorical.

Other INTEGER columns will be treated as continuous.

Defaults to None.

Returns
A fitted object of class "KMeans".
fit_predict(data, key=None, features=None, categorical_variable=None)

Fit with the dataset and return the labels.

Parameters

DataFrame containing the data.

keystr, optional

Name of ID column.

Defaults to the index column of data (i.e. data.index) if it is set. If the index column of data is not provided, please enter the value of key.

featuresa list of str, optional

Names of feature columns.

If features is not provided, it defaults to all non-key columns.

categorical_variablestr or a list of str, optional

Specifies INTEGER column(s) that should be treated as categorical. Other INTEGER columns will be treated as continuous.

Defaults to None.

Returns
DataFrame

Label assigned to each sample.

predict(data, key=None, features=None)

Assign clusters to data based on a fitted model.

The output structure of this method does not match that of fit_predict().

Parameters

Data points to match against computed clusters.

This dataframe's column structure should match that of the data used for fit().

keystr, optional

Name of ID column.

Defaults to the index column of data (i.e. data.index) if it is set. If the index column of data is not provided, please enter the value of key.

featuresa list of str, optional.

Names of feature columns.

If features is not provided, it defaults to all non-key columns.

Returns
DataFrame

Cluster assignment results, with 3 columns:

• Data point ID, with name and type taken from the input ID column.

• CLUSTER_ID, INTEGER type, representing the cluster the data point is assigned to.

• DISTANCE, DOUBLE type, representing the distance between the data point and the cluster center.

Function to add attribute.

disable_with_hint()

Disable with hint.

enable_no_inline()

Enable no inline.

enable_parallel_by_parameter_partitions()

Enable parallel by parameter partitions.

get_fit_execute_statement()

Return the execute_statement for training.

get_fit_parameters()

Get PAL fit parmeters.

Returns
Array of tuples, where each tuple describes a parameter like (name, value, type)
get_parameters()

Parse sql lines containing the parameter definitions. In the sql code all the parameters are defined by four arrays, where the first one contains the parameter name, and one of the other three contains the value fitting to the parameter, while the other two are NULL. This format should be changed into a simple key-value based storage.

Returns
dict of array of tuples, where each tuple describes a parameter like (name, value, type)
get_predict_execute_statement()

Return the execute_statement for predicting.

get_predict_parameters()

Get PAL predict parmeters.

Returns
Array of tuples, where each tuple describes a parameter like (name, value, type)
get_score_execute_statement()

Return the execute_statement for scoring.

get_score_parameters()

Get PAL score parmeters.

Returns
Array of tuples, where each tuple describes a parameter like (name, value, type)
get_store_procedure()

Extract the specific function call of the PAL function from the sql code. Nevertheless it only detects the synonyms that have to be resolved afterwards

Returns
The procedure name synonym
CALL "SYS_AFL.PAL_RANDOM_FORREST" (...) -> SYS_AFL.PAL_RANDOM_FORREST"
is_fitted()

Checks if the model can be saved. To be overridden if the model is not stored in model_ attribute.

Returns
bool

True if the model is ready to be saved.

Function to load fitted model.

Parameters
modelDataFrame

HANA DataFrame for fitted model.

set_scale_out(route_to=None, no_route_to=None, route_by=None, route_by_cardinality=None, data_transfer_cost=None, route_optimization_level=None, workload_class=None)

HANA statement routing.

Parameters
route_tostr, optional

Routes the query to the specified volume ID or service type.

no_route_tostr or list of str, optional

Avoids query routing to a specified volume ID or service type.

route_bystr, optional

Routes the query to the hosts related to the base table(s) of the specified projection view(s).

route_by_cardinalitystr or list of str, optional

Routes the query to the hosts related to the base table(s) of the specified projection view(s) with the highest cardinality from the input list.

data_transfer_costint, optional

Guides the optimizer to use the weighting factor for the data transfer cost. The value 0 ignores the data transfer cost.

route_optimization_level{'mininal', 'all'}, optional

Guides the optimizer to compile with ROUTE_OPTIMIZATION_LEVEL (MINIMAL) or to default to ROUTE_OPTIMIZATION_LEVEL. If the MINIMAL compiled plan is cached, then it compiles once more using the default optimization level during the first execution. This hint is primarily used to shorten statement routing decisions during the initial compilation.

Routes the query via workload class. ROUTE_TO() statement hint has higher precedence than WORKLOAD_CLASS() statement hint.

class hana_ml.algorithms.pal.clustering.KMedians(n_clusters, init=None, max_iter=None, tol=None, thread_ratio=None, distance_level=None, minkowski_power=None, category_weights=None, normalization=None, categorical_variable=None)

Bases: hana_ml.algorithms.pal.clustering._KClusteringBase

K-Medians clustering algorithm that partitions n observations into K clusters according to their nearest cluster center. It uses medians of each feature to calculate cluster centers.

Parameters
n_clustersint

Number of groups.

init{'first_k', 'replace', 'no_replace', 'patent'}, optional

Controls how the initial centers are selected:

• 'first_k': First k observations.

• 'replace': Random with replacement.

• 'no_replace': Random without replacement.

• 'patent': Patent of selecting the init center (US 6,882,998 B1).

Defaults to 'patent'.

max_iterint, optional

Max iterations.

Defaults to 100.

tolfloat, optional

Convergence threshold for exiting iterations.

Defaults to 1.0e-6.

Specifies the ratio of total number of threads that can be used by this function. The value range is from 0 to 1, where 0 means only using 1 thread, and 1 means using at most all the currently available threads. Values outside the range will be ignored and this function heuristically determines the number of threads to use.

Defaults to 0.

distance_level{'manhattan', 'euclidean', 'minkowski', 'chebyshev', 'cosine'}, optional

Ways to compute the distance between the item and the cluster center.

Defaults to 'euclidean'.

minkowski_powerfloat, optional

When Minkowski distance is used, this parameter controls the value of power.

Only valid when distance_level is 'minkowski'.

Defaults to 3.0.

category_weightsfloat, optional

Represents the weight of category attributes.

Defaults to 0.707.

normalization{'no', 'l1_norm', 'min_max'}, optional

Normalization type.

• 'no': No, normalization will not be applied.

• 'l1_norm': Yes, for each point X (x1, x2, ..., xn), the normalized value will be X'(x1/S,x2/S,...,xn/S), where S = |x1|+|x2|+...|xn|.

• 'min_max': Yes, for each column C, get the min and max value of C, and then C[i] = (C[i]-min)/(max-min).

Defaults to 'no'.

categorical_variablestr or a list of str, optional

Specifies INTEGER column(s) in the data that should be treated as categorical.

Defaults to None.

Examples

Input dataframe df1 for clustering:

>>> df1.collect()
ID  V000 V001  V002
0    0   0.5    A   0.5
1    1   1.5    A   0.5
2    2   1.5    A   1.5
3    3   0.5    A   1.5
4    4   1.1    B   1.2
5    5   0.5    B  15.5
6    6   1.5    B  15.5
7    7   1.5    B  16.5
8    8   0.5    B  16.5
9    9   1.2    C  16.1
10  10  15.5    C  15.5
11  11  16.5    C  15.5
12  12  16.5    C  16.5
13  13  15.5    C  16.5
14  14  15.6    D  16.2
15  15  15.5    D   0.5
16  16  16.5    D   0.5
17  17  16.5    D   1.5
18  18  15.5    D   1.5
19  19  15.7    A   1.6


Creating KMedians instance:

>>> kmedians = KMedians(n_clusters=4, init='first_k',
...                     max_iter=100, tol=1.0E-6,
...                     distance_level='Euclidean',


Performing fit() on given dataframe:

>>> kmedians.fit(data=df1, key='ID')
>>> kmedians.cluster_centers_.collect()
CLUSTER_ID  V000 V001  V002
0           0   1.1    A   1.2
1           1  15.7    D   1.5
2           2  15.6    C  16.2
3           3   1.2    B  16.1


Performing fit_predict() on given dataframe:

>>> kmedians.fit_predict(data=df1, key='ID').collect()
ID  CLUSTER_ID  DISTANCE
0    0           0  0.921954
1    1           0  0.806226
2    2           0  0.500000
3    3           0  0.670820
4    4           0  0.707107
5    5           3  0.921954
6    6           3  0.670820
7    7           3  0.500000
8    8           3  0.806226
9    9           3  0.707107
10  10           2  0.707107
11  11           2  1.140175
12  12           2  0.948683
13  13           2  0.316228
14  14           2  0.707107
15  15           1  1.019804
16  16           1  1.280625
17  17           1  0.800000
18  18           1  0.200000
19  19           1  0.807107

Attributes
cluster_centers_DataFrame

Coordinates of cluster centers.

labels_DataFrame

Cluster assignment and distance to cluster center for each point.

Methods

 add_attribute(attr_key, attr_val) Function to add attribute. Disable with hint. Enable no inline. Enable parallel by parameter partitions. fit(data[, key, features, categorical_variable]) Perform clustering on input dataset. fit_predict(data[, key, features, ...]) Perform clustering algorithm and return labels. Return the execute_statement for training. Get PAL fit parmeters. Parse sql lines containing the parameter definitions. Return the execute_statement for predicting. Get PAL predict parmeters. Return the execute_statement for scoring. Get PAL score parmeters. Extract the specific function call of the PAL function from the sql code. Checks if the model can be saved. load_model(model) Function to load fitted model. set_scale_out([route_to, no_route_to, ...]) HANA statement routing.

Function to add attribute.

disable_with_hint()

Disable with hint.

enable_no_inline()

Enable no inline.

enable_parallel_by_parameter_partitions()

Enable parallel by parameter partitions.

fit(data, key=None, features=None, categorical_variable=None)

Perform clustering on input dataset.

Parameters

DataFrame contains input data.

keystr, optional

Name of ID column.

Defaults to the index column of data (i.e. data.index) if it is set. If the index column of data is not provided, please enter the value of key.

featuresa list of str, optional

Names of feature columns.

If features is not provided, it defaults to all non-key columns.

categorical_variablestr or a list of str, optional

Specifies INTEGER columns that should be treated as categorical. Other INTEGER columns will be treated as continuous.

Defaults to None.

fit_predict(data, key=None, features=None, categorical_variable=None)

Perform clustering algorithm and return labels.

Parameters

DataFrame containing input data.

keystr, optional

Name of ID column.

Defaults to the index column of data (i.e. data.index) if it is set. If the index column of data is not provided, please enter the value of key.

featuresa list of str, optional

Names of feature columns.

If features is not provided, it defaults to all non-key columns.

categorical_variablestr or a list of str, optional

Specifies INTEGER column(s) that should be treated as categorical. Other INTEGER columns will be treated as continuous.

Defaults to None.

Returns
DataFrame

Fit result, structured as follows:

• ID column, with the same name and type as data 's ID column.

• CLUSTER_ID, type INTEGER, cluster ID assigned to the data point.

• DISTANCE, type DOUBLE, the distance between the given point and the cluster center.

get_fit_execute_statement()

Return the execute_statement for training.

get_fit_parameters()

Get PAL fit parmeters.

Returns
Array of tuples, where each tuple describes a parameter like (name, value, type)
get_parameters()

Parse sql lines containing the parameter definitions. In the sql code all the parameters are defined by four arrays, where the first one contains the parameter name, and one of the other three contains the value fitting to the parameter, while the other two are NULL. This format should be changed into a simple key-value based storage.

Returns
dict of array of tuples, where each tuple describes a parameter like (name, value, type)
get_predict_execute_statement()

Return the execute_statement for predicting.

get_predict_parameters()

Get PAL predict parmeters.

Returns
Array of tuples, where each tuple describes a parameter like (name, value, type)
get_score_execute_statement()

Return the execute_statement for scoring.

get_score_parameters()

Get PAL score parmeters.

Returns
Array of tuples, where each tuple describes a parameter like (name, value, type)
get_store_procedure()

Extract the specific function call of the PAL function from the sql code. Nevertheless it only detects the synonyms that have to be resolved afterwards

Returns
The procedure name synonym
CALL "SYS_AFL.PAL_RANDOM_FORREST" (...) -> SYS_AFL.PAL_RANDOM_FORREST"
is_fitted()

Checks if the model can be saved. To be overridden if the model is not stored in model_ attribute.

Returns
bool

True if the model is ready to be saved.

Function to load fitted model.

Parameters
modelDataFrame

HANA DataFrame for fitted model.

set_scale_out(route_to=None, no_route_to=None, route_by=None, route_by_cardinality=None, data_transfer_cost=None, route_optimization_level=None, workload_class=None)

HANA statement routing.

Parameters
route_tostr, optional

Routes the query to the specified volume ID or service type.

no_route_tostr or list of str, optional

Avoids query routing to a specified volume ID or service type.

route_bystr, optional

Routes the query to the hosts related to the base table(s) of the specified projection view(s).

route_by_cardinalitystr or list of str, optional

Routes the query to the hosts related to the base table(s) of the specified projection view(s) with the highest cardinality from the input list.

data_transfer_costint, optional

Guides the optimizer to use the weighting factor for the data transfer cost. The value 0 ignores the data transfer cost.

route_optimization_level{'mininal', 'all'}, optional

Guides the optimizer to compile with ROUTE_OPTIMIZATION_LEVEL (MINIMAL) or to default to ROUTE_OPTIMIZATION_LEVEL. If the MINIMAL compiled plan is cached, then it compiles once more using the default optimization level during the first execution. This hint is primarily used to shorten statement routing decisions during the initial compilation.

Routes the query via workload class. ROUTE_TO() statement hint has higher precedence than WORKLOAD_CLASS() statement hint.

class hana_ml.algorithms.pal.clustering.KMedoids(n_clusters, init=None, max_iter=None, tol=None, thread_ratio=None, distance_level=None, minkowski_power=None, category_weights=None, normalization=None, categorical_variable=None)

Bases: hana_ml.algorithms.pal.clustering._KClusteringBase

K-Medoids clustering algorithm that partitions n observations into K clusters according to their nearest cluster center. It uses medoids to calculate cluster centers. K-Medoids is more robust to noise and outliers.

Parameters
n_clustersint

Number of groups.

init{'first_k', 'replace', 'no_replace', 'patent'}, optional

Controls how the initial centers are selected:

• 'first_k': First k observations.

• 'replace': Random with replacement.

• 'no_replace': Random without replacement.

• 'patent': Patent of selecting the init center (US 6,882,998 B1).

Defaults to 'patent'.

max_iterint, optional

Max iterations.

Defaults to 100.

tolfloat, optional

Convergence threshold for exiting iterations.

Defaults to 1.0e-6.

Controls the proportion of available threads to use.

The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates all available threads.

Values between 0 and 1 will use up to that percentage of available threads.

Values outside this range tell PAL to heuristically determine the number of threads to use.

Defaults to 0.

distance_level{'manhattan', 'euclidean', 'minkowski', 'chebyshev', 'cosine'}, optional

Ways to compute the distance between the item and the cluster center.

Defaults to 'euclidean'.

minkowski_powerfloat, optional

When Minkowski distance is used, this parameter controls the value of power.

Only valid when distance_level is 'minkowski'.

Defaults to 3.0.

category_weightsfloat, optional

Represents the weight of category attributes.

Defaults to 0.707.

normalization{'no', 'l1_norm', 'min_max'}, optional

Normalization type.

• 'no': No, normalization will not be applied.

• 'l1_norm': Yes, for each point X (x1, x2, ..., xn), the normalized value will be X'(x1/S,x2/S,...,xn/S), where S = |x1|+|x2|+...|xn|.

• 'min_max': Yes, for each column C, get the min and max value of C, and then C[i] = (C[i]-min)/(max-min).

Defaults to 'no'.

categorical_variablestr or a list of str, optional

Specifies INTEGER column(s) in the data that should be treated as categorical.

Defaults to None.

Examples

Input dataframe df1 for clustering:

>>> df1.collect()
ID  V000 V001  V002
0    0   0.5    A   0.5
1    1   1.5    A   0.5
2    2   1.5    A   1.5
3    3   0.5    A   1.5
4    4   1.1    B   1.2
5    5   0.5    B  15.5
6    6   1.5    B  15.5
7    7   1.5    B  16.5
8    8   0.5    B  16.5
9    9   1.2    C  16.1
10  10  15.5    C  15.5
11  11  16.5    C  15.5
12  12  16.5    C  16.5
13  13  15.5    C  16.5
14  14  15.6    D  16.2
15  15  15.5    D   0.5
16  16  16.5    D   0.5
17  17  16.5    D   1.5
18  18  15.5    D   1.5
19  19  15.7    A   1.6


Creating a KMedoids instance:

>>> kmedoids = KMedoids(n_clusters=4, init='first_K',
...                     max_iter=100, tol=1.0E-6,
...                     distance_level='Euclidean',


Performing fit() on given dataframe:

>>> kmedoids.fit(data=df1, key='ID')
>>> kmedoids.cluster_centers_.collect()
CLUSTER_ID  V000 V001  V002
0           0   1.5    A   1.5
1           1  15.5    D   1.5
2           2  15.5    C  16.5
3           3   1.5    B  16.5


Performing fit_predict() on given dataframe:

>>> kmedoids.fit_predict(data=df1, key='ID').collect()
ID  CLUSTER_ID  DISTANCE
0    0           0  1.414214
1    1           0  1.000000
2    2           0  0.000000
3    3           0  1.000000
4    4           0  1.207107
5    5           3  1.414214
6    6           3  1.000000
7    7           3  0.000000
8    8           3  1.000000
9    9           3  1.207107
10  10           2  1.000000
11  11           2  1.414214
12  12           2  1.000000
13  13           2  0.000000
14  14           2  1.023335
15  15           1  1.000000
16  16           1  1.414214
17  17           1  1.000000
18  18           1  0.000000
19  19           1  0.930714

Attributes
cluster_centers_DataFrame

Coordinates of cluster centers.

labels_DataFrame

Cluster assignment and distance to cluster center for each point.

Methods

 add_attribute(attr_key, attr_val) Function to add attribute. Disable with hint. Enable no inline. Enable parallel by parameter partitions. fit(data[, key, features, categorical_variable]) Perform clustering on input dataset. fit_predict(data[, key, features, ...]) Perform clustering algorithm and return labels. Return the execute_statement for training. Get PAL fit parmeters. Parse sql lines containing the parameter definitions. Return the execute_statement for predicting. Get PAL predict parmeters. Return the execute_statement for scoring. Get PAL score parmeters. Extract the specific function call of the PAL function from the sql code. Checks if the model can be saved. load_model(model) Function to load fitted model. set_scale_out([route_to, no_route_to, ...]) HANA statement routing.

Function to add attribute.

disable_with_hint()

Disable with hint.

enable_no_inline()

Enable no inline.

enable_parallel_by_parameter_partitions()

Enable parallel by parameter partitions.

fit(data, key=None, features=None, categorical_variable=None)

Perform clustering on input dataset.

Parameters

DataFrame contains input data.

keystr, optional

Name of ID column.

Defaults to the index column of data (i.e. data.index) if it is set. If the index column of data is not provided, please enter the value of key.

featuresa list of str, optional

Names of feature columns.

If features is not provided, it defaults to all non-key columns.

categorical_variablestr or a list of str, optional

Specifies INTEGER columns that should be treated as categorical. Other INTEGER columns will be treated as continuous.

Defaults to None.

fit_predict(data, key=None, features=None, categorical_variable=None)

Perform clustering algorithm and return labels.

Parameters

DataFrame containing input data.

keystr, optional

Name of ID column.

Defaults to the index column of data (i.e. data.index) if it is set. If the index column of data is not provided, please enter the value of key.

featuresa list of str, optional

Names of feature columns.

If features is not provided, it defaults to all non-key columns.

categorical_variablestr or a list of str, optional

Specifies INTEGER column(s) that should be treated as categorical. Other INTEGER columns will be treated as continuous.

Defaults to None.

Returns
DataFrame

Fit result, structured as follows:

• ID column, with the same name and type as data 's ID column.

• CLUSTER_ID, type INTEGER, cluster ID assigned to the data point.

• DISTANCE, type DOUBLE, the distance between the given point and the cluster center.

get_fit_execute_statement()

Return the execute_statement for training.

get_fit_parameters()

Get PAL fit parmeters.

Returns
Array of tuples, where each tuple describes a parameter like (name, value, type)
get_parameters()

Parse sql lines containing the parameter definitions. In the sql code all the parameters are defined by four arrays, where the first one contains the parameter name, and one of the other three contains the value fitting to the parameter, while the other two are NULL. This format should be changed into a simple key-value based storage.

Returns
dict of array of tuples, where each tuple describes a parameter like (name, value, type)
get_predict_execute_statement()

Return the execute_statement for predicting.

get_predict_parameters()

Get PAL predict parmeters.

Returns
Array of tuples, where each tuple describes a parameter like (name, value, type)
get_score_execute_statement()

Return the execute_statement for scoring.

get_score_parameters()

Get PAL score parmeters.

Returns
Array of tuples, where each tuple describes a parameter like (name, value, type)
get_store_procedure()

Extract the specific function call of the PAL function from the sql code. Nevertheless it only detects the synonyms that have to be resolved afterwards

Returns
The procedure name synonym
CALL "SYS_AFL.PAL_RANDOM_FORREST" (...) -> SYS_AFL.PAL_RANDOM_FORREST"
is_fitted()

Checks if the model can be saved. To be overridden if the model is not stored in model_ attribute.

Returns
bool

True if the model is ready to be saved.

Function to load fitted model.

Parameters
modelDataFrame

HANA DataFrame for fitted model.

set_scale_out(route_to=None, no_route_to=None, route_by=None, route_by_cardinality=None, data_transfer_cost=None, route_optimization_level=None, workload_class=None)

HANA statement routing.

Parameters
route_tostr, optional

Routes the query to the specified volume ID or service type.

no_route_tostr or list of str, optional

Avoids query routing to a specified volume ID or service type.

route_bystr, optional

Routes the query to the hosts related to the base table(s) of the specified projection view(s).

route_by_cardinalitystr or list of str, optional

Routes the query to the hosts related to the base table(s) of the specified projection view(s) with the highest cardinality from the input list.

data_transfer_costint, optional

Guides the optimizer to use the weighting factor for the data transfer cost. The value 0 ignores the data transfer cost.

route_optimization_level{'mininal', 'all'}, optional

Guides the optimizer to compile with ROUTE_OPTIMIZATION_LEVEL (MINIMAL) or to default to ROUTE_OPTIMIZATION_LEVEL. If the MINIMAL compiled plan is cached, then it compiles once more using the default optimization level during the first execution. This hint is primarily used to shorten statement routing decisions during the initial compilation.

Routes the query via workload class. ROUTE_TO() statement hint has higher precedence than WORKLOAD_CLASS() statement hint.

hana_ml.algorithms.pal.clustering.outlier_detection_kmeans(data, key=None, features=None, n_clusters=None, distance_level=None, contamination=None, sum_distance=True, init=None, max_iter=None, normalization=None, tol=None, distance_threshold=None, thread_number=None)

Outlier detection based on k-means clustering.

Parameters

Input data for outlier detection using k-means clustering.

keystr, optional

Name of ID column.

Defaults to the index column of data (i.e. data.index) if it is set. If the index column of data is not provided, please enter the value of key.

featuresstr or ListOfStrings

Names of the features columns in data that are used for calculating distances of points in data for clustering.

Feature columns must be numerical.

Defaults to all non-key columns if not provided.

distance_level{'manhattan', 'euclidean', 'minkowski'}, optional

Specifies the distance type between data points and cluster center.

• 'manhattan' : Manhattan distance

• 'euclidean' : Euclidean distance

• 'minkowski' : Minkowski distance

Defaults to 'euclidean'.

contaminationfloat, optional

Specifies the proportion of outliers in data.

Expected to be a positive number no greater than 1.

Defaults to 0.1.

sum_distancebool, optional

Specifies whether or not to use the sum distance of a point to all cluster centers as its distance value for outlier score. If False, only the distance of a point to the center it belongs to is used its distance value calculation.

Defaults to True.

init{'first_k', 'replace', 'no_replace', 'patent'}, optional

Controls how the initial centers are selected:

• 'first_k': First k observations.

• 'replace': Random with replacement.

• 'no_replace': Random without replacement.

• 'patent': Patent of selecting the init center (US 6,882,998 B1).

Defaults to 'patent'.

max_iterint, optional

Maximum number of iterations for k-means clustering.

Defaults to 100.

normalization{'no', 'l1_norm', 'min_max'}, optional

Normalization type.

• 'no': No normalization will be applied.

• 'l1_norm': Yes, for each point X (x1, x2, ..., xn), the normalized value will be X'(x1 /S,x2 /S,...,xn /S), where S = |x1|+|x2|+...|xn|.

• 'min_max': Yes, for each column C, get the min and max value of C, and then C[i] = (C[i]-min)/(max-min).

Defaults to 'no'.

tolfloat, optional

Convergence threshold for exiting iterations in k-means clustering.

Defaults to 1.0e-6.

distance_thresholdfloat, optional

Specifies the threshold distance value for outlier detection.

A point with distance value no greater than the threshold is not considered to be outlier.

Defaults to -1.

Specifies the number of threads that can be used by this function.

Defaults to 1.

Returns
DataFrame
• Detected outliers, structured as follows:

• 1st column : ID of detected outliers in data.

• other columns : feature values for detected outliers

• Statistics of detected outliers, structured as follows:

• 1st column : ID of detected outliers in data.

• 2nd column : ID of the corresponding cluster centers.

• 3rd column : Outlier score, which is the distance value.

• Centers of clusters produced by k-means algorithm, structured as follows:

• 1st column : ID of cluster center.

• other columns : Coordinate(i.e. feature) values of cluster center.

Examples

Input data for outlier detection:

>>> df.collect()
ID  V000  V001
0    0   0.5   0.5
1    1   1.5   0.5
2    2   1.5   1.5
3    3   0.5   1.5
4    4   1.1   1.2
5    5   0.5  15.5
6    6   1.5  15.5
7    7   1.5  16.5
8    8   0.5  16.5
9    9   1.2  16.1
10  10  15.5  15.5
11  11  16.5  15.5
12  12  16.5  16.5
13  13  15.5  16.5
14  14  15.6  16.2
15  15  15.5   0.5
16  16  16.5   0.5
17  17  16.5   1.5
18  18  15.5   1.5
19  19  15.7   1.6
20  20  -1.0  -1.0

>>> outliers, stats, centers = outlier_detection_kmeans(df, key='ID',
...                                                     distance_level='euclidean',
...                                                     contamination=0.15,
...                                                     sum_distance=True,
...                                                     distance_threshold=3)
>>> outliers.collect()
ID  V000  V001
0  20  -1.0  -1.0
1  16  16.5   0.5
2  12  16.5  16.5
>>> stats.collect()
ID  CLUSTER_ID      SCORE
0  20           2  60.619864
1  16           1  54.110424
2  12           3  53.954274

class hana_ml.algorithms.pal.clustering.SpectralClustering(n_clusters, n_components=None, gamma=None, affinity=None, n_neighbors=None, cut=None, eigen_tol=None, krylov_dim=None, distance_level=None, minkowski_power=None, category_weights=None, max_iter=None, init=None, tol=None)

Bases: hana_ml.algorithms.pal.pal_base.PALBase

This is the Python wrapper for PAL Spectral Clustering.

Spectral clustering is an algorithm evolved from graph theory, and has been widely used in clustering. Its main idea is to treat all data as points in space, which can be connected by edges. The edge weight between two points farther away is low, while the edge weight between two points closer is high. Cutting the graph composed of all data points to make the edge weight sum between different subgraphs after cutting as low as possible, while make the edge weight sum within the subgraph as high as possible to achieve the purpose of clustering.

It performs a low-dimension embedding of the affinity matrix between samples, followed by k-means clustering of the components of the eigenvectors in the low dimensional space.

Parameters
n_clustersint

The number of clusters for spectral clustering.

The valid range for this parameter is from 2 to the number of records in the input data.

n_componentsint, optional

The number of eigenvectors used for spectral embedding.

Defaults to the value of n_clusters.

gammafloat, optional

RBF kernel coefficient $$\gamma$$ used in constructing affinity matrix with distance metric d, illustrated as $$\exp(-\gamma * d^2)$$.

Defaults to 1.0.

affinitystr, optional

Specifies the type of graph used to construct the affinity matrix. Valid options include:

• 'knn' : binary affinity matrix constructed from the graph of k-nearest-neighbors(knn).

• 'mutual-knn' : binary affinity matrix constructed from the graph of mutual k-nearest-neighbors(mutual-knn).

• 'fully-connected' : affinity matrix constructed from fully-connected graph, with weights defined by RBF kernel coefficients.

Defaults to 'fully-connected'.

n_neighborsint, optional

The number neighbors to use when constructing the affinity matrix using nearest neighbors method.

Valid only when graph is 'knn' or 'mutual-knn'.

Defaults to 10.

cutstr, optional

Specifies the method to cut the graph.

• 'ratio-cut' : Ratio-Cut.

• 'n-cut' : Normalized-Cut.

Defaults to 'ratio-cut'.

eigen_tolfloat, optional

Stopping criterion for eigendecomposition of the Laplacian matrix.

Defaults to 1e-10.

krylov_dimint, optional

Specifies the dimension of Krylov subspaces used in Eigenvalue decomposition. In general, this parameter controls the convergence speed of the algorithm. Typically a larger krylov_dim means faster convergence, but it may also result in greater memory use and more matrix operations in each iteration.

Defaults to 2*n_components.

Note

This parameter must satifiy

n_components < krylov_dim $$\le$$ the number of training records.

distance_levelstr, optional

Specifies the method for computing the distance between data records and cluster centers:

• 'manhattan' : Manhattan distance.

• 'euclidean' : Euclidean distance.

• 'minkowski' : Minkowski distance.

• 'chebyshev' : Chebyshev distance.

• 'cosine' : Cosine distance.

Defaults to 'euclidean'.

minkowski_powerfloat, optional

Specifies the power parameter in Minkowski distance.

Valid only when distance_level is 'minkowski'.

Defaults to 3.0.

category_wightsfloat, optional

Represents the weight of category attributes.

Defaults to 0.707.

max_iterint, optional

Maximum number of iterations for K-Means algorithm.

Defaults to 100.

init{'first_k', 'replace', 'no_replace', 'patent'}, optional

Controls how the initial centers are selected in K-Means algorithm:

• 'first_k': First k observations.

• 'replace': Random with replacement.

• 'no_replace': Random without replacement.

• 'patent': Patent of selecting the init center (US 6,882,998 B1).

Defaults to 'patent'.

tolfloat, optional

Specifies the exit threshold for K-Means iterations.

Defaults to 1e-6.

Attributes
labels_DataFrame

DataFrame that holds the cluster labels.

Set to None if not fitted.

stats_DataFrame

DataFrame that holds the related statistics for spectral clustering.

Set to None if not fitted.

Methods

 add_attribute(attr_key, attr_val) Function to add attribute. Disable with hint. Enable no inline. Enable parallel by parameter partitions. fit(data[, key, features, thread_ratio]) Perform spectral clustering for the given dataset. fit_predict(data[, key, features, thread_ratio]) Given data, perform spectral clustering and return the corresponding cluster labels. Return the execute_statement for training. Get PAL fit parmeters. Parse sql lines containing the parameter definitions. Return the execute_statement for predicting. Get PAL predict parmeters. Return the execute_statement for scoring. Get PAL score parmeters. Extract the specific function call of the PAL function from the sql code. Checks if the model can be saved. load_model(model) Function to load fitted model. set_scale_out([route_to, no_route_to, ...]) HANA statement routing.
fit(data, key=None, features=None, thread_ratio=None)

Perform spectral clustering for the given dataset.

Parameters

DataFrame containing the input data for spectral clustering.

keystr, optional

Name of ID column in data.

Mandatory if data is not indexed, or indexed by multiple columns.

Defaults to the index column of data if there is one.

featuresa list of str, optional

Names of the feature columns.

If features is not provided, it defaults to all non-key columns of data.

Specifies the ratio of total number of threads that can be used by spectral clustering.

The value range is from 0 to 1, where 0 means only using 1 thread, and 1 means using at most all the currently available threads. Values outside the range will be ignored and this function heuristically determines the number of threads to use.

Defaults to 0.

Function to add attribute.

disable_with_hint()

Disable with hint.

enable_no_inline()

Enable no inline.

enable_parallel_by_parameter_partitions()

Enable parallel by parameter partitions.

fit_predict(data, key=None, features=None, thread_ratio=None)

Given data, perform spectral clustering and return the corresponding cluster labels.

Parameters

DataFrame containing the input data for spectral clustering.

keystr, optional

Name of ID column in data.

Mandatory if data is not indexed, or indexed by multiple columns.

Defaults to the index column of data if there is one.

featuresa list of str, optional

Names of the feature columns.

If features is not provided, it defaults to all non-key columns of data.

Specifies the ratio of total number of threads that can be used by spectral clustering.

The value range is from 0 to 1, where 0 means only using 1 thread, and 1 means using at most all the currently available threads. Values outside the range will be ignored and this function heuristically determines the number of threads to use.

Defaults to 0.

Returns
DataFrame

The cluster labels of all records in data, structured as follows:

• 1st column ： column name and type same as the key column of data, representing record IDs.

• 2nd column : CLUSTER_ID, type INTEGER，representing the cluster IDs assigned to all records in data.

get_fit_execute_statement()

Return the execute_statement for training.

get_fit_parameters()

Get PAL fit parmeters.

Returns
Array of tuples, where each tuple describes a parameter like (name, value, type)
get_parameters()

Parse sql lines containing the parameter definitions. In the sql code all the parameters are defined by four arrays, where the first one contains the parameter name, and one of the other three contains the value fitting to the parameter, while the other two are NULL. This format should be changed into a simple key-value based storage.

Returns
dict of array of tuples, where each tuple describes a parameter like (name, value, type)
get_predict_execute_statement()

Return the execute_statement for predicting.

get_predict_parameters()

Get PAL predict parmeters.

Returns
Array of tuples, where each tuple describes a parameter like (name, value, type)
get_score_execute_statement()

Return the execute_statement for scoring.

get_score_parameters()

Get PAL score parmeters.

Returns
Array of tuples, where each tuple describes a parameter like (name, value, type)
get_store_procedure()

Extract the specific function call of the PAL function from the sql code. Nevertheless it only detects the synonyms that have to be resolved afterwards

Returns
The procedure name synonym
CALL "SYS_AFL.PAL_RANDOM_FORREST" (...) -> SYS_AFL.PAL_RANDOM_FORREST"
is_fitted()

Checks if the model can be saved. To be overridden if the model is not stored in model_ attribute.

Returns
bool

True if the model is ready to be saved.

Function to load fitted model.

Parameters
modelDataFrame

HANA DataFrame for fitted model.

set_scale_out(route_to=None, no_route_to=None, route_by=None, route_by_cardinality=None, data_transfer_cost=None, route_optimization_level=None, workload_class=None)

HANA statement routing.

Parameters
route_tostr, optional

Routes the query to the specified volume ID or service type.

no_route_tostr or list of str, optional

Avoids query routing to a specified volume ID or service type.

route_bystr, optional

Routes the query to the hosts related to the base table(s) of the specified projection view(s).

route_by_cardinalitystr or list of str, optional

Routes the query to the hosts related to the base table(s) of the specified projection view(s) with the highest cardinality from the input list.

data_transfer_costint, optional

Guides the optimizer to use the weighting factor for the data transfer cost. The value 0 ignores the data transfer cost.

route_optimization_level{'mininal', 'all'}, optional

Guides the optimizer to compile with ROUTE_OPTIMIZATION_LEVEL (MINIMAL) or to default to ROUTE_OPTIMIZATION_LEVEL. If the MINIMAL compiled plan is cached, then it compiles once more using the default optimization level during the first execution. This hint is primarily used to shorten statement routing decisions during the initial compilation.

Routes the query via workload class. ROUTE_TO() statement hint has higher precedence than WORKLOAD_CLASS() statement hint.

## hana_ml.algorithms.pal.crf¶

This module contains Python wrapper for SAP HANA PAL conditional random field(CRF) algorithm.

The following class is available:

class hana_ml.algorithms.pal.crf.CRF(lamb=None, epsilon=None, max_iter=None, lbfgs_m=None, use_class_feature=None, use_word=None, use_ngrams=None, mid_ngrams=False, max_ngram_length=None, use_prev=None, use_next=None, disjunction_width=None, use_disjunctive=None, use_seqs=None, use_prev_seqs=None, use_type_seqs=None, use_type_seqs2=None, use_type_yseqs=None, word_shape=None, thread_ratio=None)

Bases: hana_ml.algorithms.pal.pal_base.PALBase

Conditional random field(CRF) for labeling and segmenting sequence data(e.g. text).

Parameters
epsilonfloat, optional

Convergence tolerance of the optimization algorithm.

Defaults to 1e-4.

lambfloat, optional

Regularization weight, should be greater than 0.

Defaults t0 1.0.

max_iterint, optional

Maximum number of iterations in optimization.

Defaults to 1000.

lbfgs_mint, optional

Number of memories to be stored in L_BFGS optimization algorithm.

Defaults to 25.

use_class_featurebool, optional

To include a feature for class/label. This is the same as having a bias vector in a model.

Defaults to True.

use_wordbool, optional

If True, gives you feature for current word.

Defaults to True.

use_ngramsbool, optional

Whether to make feature from letter n-grams, i.e. substrings of the word.

Defaults to True.

mid_ngramsbool, optional

Whether to include character n-gram features for n-grams that contain neither the beginning or the end of the word.

Defaults to False.

max_ngram_lengthint, optional

Upper limit for the size of n-grams to be included. Effective only this parameter is positive.

use_prevbool, optional

Whether or not to include a feature for previous word and current word, and together with other options enables other previous features.

Defaults to True.

use_nextbool, optional

Whether or not to include a feature for next word and current word.

Defaults to True.

disjunction_widthint, optional

Defines the width for disjunctions of words, see use_disjunctive.

Defaults to 4.

use_disjunctivebool, optional

Whether or not to include in features giving disjunctions of words anywhere in left or right disjunction_width words.

Defaults to True.

use_seqsbool, optional

Whether or not to use any class combination features.

Defaults to True.

use_prev_seqsbool, optional

Whether or not to use any class combination features using the previous class.

Defaults to True.

use_type_seqsbool, optional

Whther or not to use basic zeroth order word shape features.

Defaults to True.

use_type_seqs2bool, optional

Whether or not to add additional first and second order word shape features.

Defaults to True.

use_type_yseqsbool, optional

Whether or not to use some first order word shape patterns.

Defaults to True.

word_shapeint, optional

Word shape, e.g. whether capitalized or numeric. Only supports chris2UseLC currently. Do not use word shape if this is 0.

Specifies the ratio of total number of threads that can be used by the fit(i.e. training) function.

The range of this parameter is from 0 to 1.

0 means only using single thread, 1 means using at most all available threads currently.

Values outside this range are ignored, and the fit function heuristically determines the number of threads to use.

Defaults to 1.0.

Examples

Input data for training:

>>> df.head(10).collect()
DOC_ID  WORD_POSITION      WORD LABEL
0       1              1    RECORD     O
1       1              2   #497321     O
2       1              3  78554939     O
3       1              4         |     O
4       1              5       LRH     O
5       1              6         |     O
6       1              7  62413233     O
7       1              8         |     O
8       1              9         |     O
9       1             10   7368393     O


Set up an instance of CRF model, and fit it on the training data:

>>> crf = CRF(lamb=0.1,
...           max_iter=1000,
...           epsilon=1e-4,
...           lbfgs_m=25,
...           word_shape=0,
>>> crf.fit(data=df, doc_id="DOC_ID", word_pos="WORD_POSITION",
...         word="WORD", label="LABEL")


Check the trained CRF model and related statistics:

>>> crf.model_.collect()
ROW_INDEX                                      MODEL_CONTENT
0          0  {"classIndex":[["O","OxygenSaturation"]],"defa...
STAT_NAME           STAT_VALUE
0              obj  0.44251900977373015
1             iter                   22
2  solution status            Converged
3      numSentence                    2
4          numWord                   92
5      numFeatures                  963
6           iter 1          obj=26.6557
7           iter 2          obj=14.8484
8           iter 3          obj=5.36967
9           iter 4           obj=2.4382


Input data for predicting labels using trained CRF model

>>> df_pred.head(10).collect()
DOC_ID  WORD_POSITION         WORD
0       2              1      GENERAL
1       2              2     PHYSICAL
2       2              3  EXAMINATION
3       2              4            :
4       2              5        VITAL
5       2              6        SIGNS
6       2              7            :
7       2              8        Blood
8       2              9     pressure
9       2             10        86g52


Do the prediction:

>>> res = crf.predict(data=df_pred, doc_id='DOC_ID', word_pos='WORD_POSITION',


Check the prediction result:

>>> df_pred.head(10).collect()
DOC_ID  WORD_POSITION         WORD
0       2              1      GENERAL
1       2              2     PHYSICAL
2       2              3  EXAMINATION
3       2              4            :
4       2              5        VITAL
5       2              6        SIGNS
6       2              7            :
7       2              8        Blood
8       2              9     pressure
9       2             10        86g52

Attributes
model_DataFrame

CRF model content.

stats_DataFrame

Statistic info for CRF model fitting, structured as follows:

• 1st column: name of the statistics, type NVARCHAR(100).

• 2nd column: the corresponding statistics value, type NVARCHAR(1000).

optimal_param_DataFrame

Placeholder for storing optimal parameter of the model. None empty only when parameter selection is triggered (in the future).

Methods

 add_attribute(attr_key, attr_val) Function to add attribute. Disable with hint. Enable no inline. Enable parallel by parameter partitions. fit(data[, doc_id, word_pos, word, label]) Function for training the CRF model on English text. Return the execute_statement for training. Get PAL fit parmeters. Parse sql lines containing the parameter definitions. Return the execute_statement for predicting. Get PAL predict parmeters. Return the execute_statement for scoring. Get PAL score parmeters. Extract the specific function call of the PAL function from the sql code. Checks if the model can be saved. load_model(model) Function to load fitted model. predict(data[, doc_id, word_pos, word, ...]) The function that predicts text labels based trained CRF model. set_scale_out([route_to, no_route_to, ...]) HANA statement routing.
fit(data, doc_id=None, word_pos=None, word=None, label=None)

Function for training the CRF model on English text.

Parameters

Input data for training/fitting the CRF model.

It should contain at least 4 columns, corresponding to document ID, word position, word and label, respectively.

doc_idstr, optional

Name of the column for document ID.

Defaults to the first column of the input data.

word_posstr, optional

Name of the column for word position.

Defaults to the 1st non-doc_id column of the input data.

wordstr, optional

Name of the column for word.

Defaults to 1st non-doc_id, non-word_pos column of the input data.

labelstr, optional

Name of the label column.

Defaults to the last non-doc_id, non-word_pos, non-word column of the input data.

Function to add attribute.

disable_with_hint()

Disable with hint.

enable_no_inline()

Enable no inline.

enable_parallel_by_parameter_partitions()

Enable parallel by parameter partitions.

get_fit_execute_statement()

Return the execute_statement for training.

get_fit_parameters()

Get PAL fit parmeters.

Returns
Array of tuples, where each tuple describes a parameter like (name, value, type)
get_parameters()

Parse sql lines containing the parameter definitions. In the sql code all the parameters are defined by four arrays, where the first one contains the parameter name, and one of the other three contains the value fitting to the parameter, while the other two are NULL. This format should be changed into a simple key-value based storage.

Returns
dict of array of tuples, where each tuple describes a parameter like (name, value, type)
get_predict_execute_statement()

Return the execute_statement for predicting.

get_predict_parameters()

Get PAL predict parmeters.

Returns
Array of tuples, where each tuple describes a parameter like (name, value, type)
get_score_execute_statement()

Return the execute_statement for scoring.

get_score_parameters()

Get PAL score parmeters.

Returns
Array of tuples, where each tuple describes a parameter like (name, value, type)
get_store_procedure()

Extract the specific function call of the PAL function from the sql code. Nevertheless it only detects the synonyms that have to be resolved afterwards

Returns
The procedure name synonym
CALL "SYS_AFL.PAL_RANDOM_FORREST" (...) -> SYS_AFL.PAL_RANDOM_FORREST"
is_fitted()

Checks if the model can be saved. To be overridden if the model is not stored in model_ attribute.

Returns
bool

True if the model is ready to be saved.

Function to load fitted model.

Parameters
modelDataFrame

HANA DataFrame for fitted model.

predict(data, doc_id=None, word_pos=None, word=None, thread_ratio=None)

The function that predicts text labels based trained CRF model.

Parameters

Input data to predict the labels. It should contain at least 3 columns, corresponding to document ID, word position and word, respectively.

doc_idstr, optional

Name of the column for document ID.

Defaults to the 1st column of the input data.

word_posstr, optional

Name of the column for word position.

Defaults to the 1st non-doc_id column of the input data.

wordstr, optional

Name of the column for word.

Defaults to the 1st non-doc_id, non-word_pos column of the input data.

Specifies the ratio of total number of threads that can be used by predict function.

The range of this parameter is from 0 to 1.

0 means only using a single thread, and 1 means using at most all available threads currently.

Values outside this range are ignored, and predict function heuristically determines the number of threads to use.

Defaults to 1.0.

Returns
DataFrame

Prediction result for the input data, structured as follows:

• 1st column: document ID,

• 2nd column: word position,

• 3rd column: label.

set_scale_out(route_to=None, no_route_to=None, route_by=None, route_by_cardinality=None, data_transfer_cost=None, route_optimization_level=None, workload_class=None)

HANA statement routing.

Parameters
route_tostr, optional

Routes the query to the specified volume ID or service type.

no_route_tostr or list of str, optional

Avoids query routing to a specified volume ID or service type.

route_bystr, optional

Routes the query to the hosts related to the base table(s) of the specified projection view(s).

route_by_cardinalitystr or list of str, optional

Routes the query to the hosts related to the base table(s) of the specified projection view(s) with the highest cardinality from the input list.

data_transfer_costint, optional

Guides the optimizer to use the weighting factor for the data transfer cost. The value 0 ignores the data transfer cost.

route_optimization_level{'mininal', 'all'}, optional

Guides the optimizer to compile with ROUTE_OPTIMIZATION_LEVEL (MINIMAL) or to default to ROUTE_OPTIMIZATION_LEVEL. If the MINIMAL compiled plan is cached, then it compiles once more using the default optimization level during the first execution. This hint is primarily used to shorten statement routing decisions during the initial compilation.

Routes the query via workload class. ROUTE_TO() statement hint has higher precedence than WORKLOAD_CLASS() statement hint.

## hana_ml.algorithms.pal.decomposition¶

This module contains Python wrappers for PAL decomposition algorithms.

The following classes are available:

class hana_ml.algorithms.pal.decomposition.PCA(scaling=None, thread_ratio=None, scores=None)

Bases: hana_ml.algorithms.pal.decomposition._PCABase

Principal component analysis is to reduce the dimensionality of multivariate data using Singular Value Decomposition.

Parameters

Controls the proportion of available threads to use.

The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads.

Values between 0 and 1 will use that percentage of available threads.

Values outside this range tell PAL to heuristically determine the number of threads to use.

No default value.

scalingbool, optional

If true, scale variables to have unit variance before the analysis takes place.

Defaults to False.

scoresbool, optional

If true, output the scores on each principal component when fitting.

Defaults to False.

Examples

Input DataFrame df1 for training:

>>> df1.head(4).collect()
ID    X1    X2    X3    X4
0   1  12.0  52.0  20.0  44.0
1   2  12.0  57.0  25.0  45.0
2   3  12.0  54.0  21.0  45.0
3   4  13.0  52.0  21.0  46.0


Creating a PCA instance:

>>> pca = PCA(scaling=True, thread_ratio=0.5, scores=True)


Performing fit on given dataframe:

>>> pca.fit(data=df1, key='ID')


Output:

>>> pca.loadings_.collect()
0        Comp1     0.541547     0.321424     0.511941     0.584235
1        Comp2    -0.454280     0.728287     0.395819    -0.326429
2        Comp3    -0.171426    -0.600095     0.760875    -0.177673
3        Comp4    -0.686273    -0.078552    -0.048095     0.721489

>>> pca.loadings_stat_.collect()
COMPONENT_ID        SD  VAR_PROP  CUM_VAR_PROP
0        Comp1  1.566624  0.613577      0.613577
1        Comp2  1.100453  0.302749      0.916327
2        Comp3  0.536973  0.072085      0.988412
3        Comp4  0.215297  0.011588      1.000000

>>> pca.scaling_stat_.collect()
VARIABLE_ID       MEAN     SCALE
0            1  17.000000  5.039841
1            2  53.636364  1.689540
2            3  23.000000  2.000000
3            4  48.454545  4.655398


Input dataframe df2 for transforming:

>>> df2.collect()
ID    X1    X2    X3    X4
0   1   2.0  32.0  10.0  54.0
1   2   9.0  57.0  20.0  25.0
2   3  12.0  24.0  28.0  35.0
3   4  15.0  42.0  27.0  36.0


Performing transform() on given dataframe:

>>> result = pca.transform(data=df2, key='ID', n_components=4)
>>> result.collect()
ID  COMPONENT_1  COMPONENT_2  COMPONENT_3  COMPONENT_4
0   1    -8.359662   -10.936083     3.037744     4.220525
1   2    -3.931082     3.221886    -1.168764    -2.629849
2   3    -6.584040   -10.391291    13.112075    -0.146681
3   4    -2.967768    -3.170720     6.198141    -1.213035

Attributes

The weights by which each standardized original variable should be multiplied when computing component scores.

scores_DataFrame

The transformed variable values corresponding to each data point. Set to None if scores is False.

scaling_stat_DataFrame

Mean and scale values of each variable.

Note

Variables cannot be scaled if there exists one variable which has constant value across data items.

Methods

 add_attribute(attr_key, attr_val) Function to add attribute. Disable with hint. Enable no inline. Enable parallel by parameter partitions. fit(data[, key, features, label]) Principal component analysis fit function. fit_transform(data[, key, features, ...]) Fit with the dataset and return the scores. Return the execute_statement for training. Get PAL fit parmeters. Parse sql lines containing the parameter definitions. Return the execute_statement for predicting. Get PAL predict parmeters. Return the execute_statement for scoring. Get PAL score parmeters. Extract the specific function call of the PAL function from the sql code. Checks if the model can be saved. load_model(model) Function to load fitted model. set_scale_out([route_to, no_route_to, ...]) HANA statement routing. transform(data[, key, features, ...]) Principal component analysis projection function using a trained model.
fit(data, key=None, features=None, label=None)

Principal component analysis fit function.

Parameters

Data to be fitted.

keystr, optional

Name of the ID column.

Mandatory if data is not indexed, or the index of data contains multiple columns.

Defaults to the single index column of data if not provided.

featureslist of str, optional

Names of the feature columns.

If features is not provided, it defaults to all non-ID columns.

labelstr, optional

Label of data.

fit_transform(data, key=None, features=None, n_components=None, label=None)

Fit with the dataset and return the scores.

Parameters

Data to be analyzed.

keystr, optional

Name of the ID column.

Mandatory if data is not indexed, or the index of data contains multiple columns.

Defaults to the single index column of data if not provided.

featureslist of str, optional

Names of the feature columns.

If features is not provided, it defaults to all non-ID columns, non-label columns.

n_componentsint, optional

Number of components to be retained.

The value range is from 1 to number of features.

Defaults to number of features.

labelstr, optional

Label of data.

Returns
DataFrame

Transformed variable values corresponding to each data point, structured as follows:

• ID column, with same name and type as data 's ID column.

• SCORE columns, type DOUBLE, representing the component score values of each data point.

transform(data, key=None, features=None, n_components=None, label=None)

Principal component analysis projection function using a trained model.

Parameters

Data to be analyzed.

keystr, optional

Name of the ID column.

Mandatory if data is not indexed, or the index of data contains multiple columns.

Defaults to the single index column of data if not provided.

featureslist of str, optional

Names of the feature columns.

If features is not provided, it defaults to all non-ID columns.

n_componentsint, optional

Number of components to be retained.

The value range is from 1 to number of features.

Defaults to number of features.

labelstr, optional

Label of data.

Returns
DataFrame

Transformed variable values corresponding to each data point, structured as follows:

• ID column, with same name and type as data 's ID column.

• SCORE columns, type DOUBLE, representing the component score values of each data point.

Function to add attribute.

disable_with_hint()

Disable with hint.

enable_no_inline()

Enable no inline.

enable_parallel_by_parameter_partitions()

Enable parallel by parameter partitions.

get_fit_execute_statement()

Return the execute_statement for training.

get_fit_parameters()

Get PAL fit parmeters.

Returns
Array of tuples, where each tuple describes a parameter like (name, value, type)
get_parameters()

Parse sql lines containing the parameter definitions. In the sql code all the parameters are defined by four arrays, where the first one contains the parameter name, and one of the other three contains the value fitting to the parameter, while the other two are NULL. This format should be changed into a simple key-value based storage.

Returns
dict of array of tuples, where each tuple describes a parameter like (name, value, type)
get_predict_execute_statement()

Return the execute_statement for predicting.

get_predict_parameters()

Get PAL predict parmeters.

Returns
Array of tuples, where each tuple describes a parameter like (name, value, type)
get_score_execute_statement()

Return the execute_statement for scoring.

get_score_parameters()

Get PAL score parmeters.

Returns
Array of tuples, where each tuple describes a parameter like (name, value, type)
get_store_procedure()

Extract the specific function call of the PAL function from the sql code. Nevertheless it only detects the synonyms that have to be resolved afterwards

Returns
The procedure name synonym
CALL "SYS_AFL.PAL_RANDOM_FORREST" (...) -> SYS_AFL.PAL_RANDOM_FORREST"
is_fitted()

Checks if the model can be saved. To be overridden if the model is not stored in model_ attribute.

Returns
bool

True if the model is ready to be saved.

Function to load fitted model.

Parameters
modelDataFrame

HANA DataFrame for fitted model.

set_scale_out(route_to=None, no_route_to=None, route_by=None, route_by_cardinality=None, data_transfer_cost=None, route_optimization_level=None, workload_class=None)

HANA statement routing.

Parameters
route_tostr, optional

Routes the query to the specified volume ID or service type.

no_route_tostr or list of str, optional

Avoids query routing to a specified volume ID or service type.

route_bystr, optional

Routes the query to the hosts related to the base table(s) of the specified projection view(s).

route_by_cardinalitystr or list of str, optional

Routes the query to the hosts related to the base table(s) of the specified projection view(s) with the highest cardinality from the input list.

data_transfer_costint, optional

Guides the optimizer to use the weighting factor for the data transfer cost. The value 0 ignores the data transfer cost.

route_optimization_level{'mininal', 'all'}, optional

Guides the optimizer to compile with ROUTE_OPTIMIZATION_LEVEL (MINIMAL) or to default to ROUTE_OPTIMIZATION_LEVEL. If the MINIMAL compiled plan is cached, then it compiles once more using the default optimization level during the first execution. This hint is primarily used to shorten statement routing decisions during the initial compilation.

Routes the query via workload class. ROUTE_TO() statement hint has higher precedence than WORKLOAD_CLASS() statement hint.

class hana_ml.algorithms.pal.decomposition.LatentDirichletAllocation(n_components, doc_topic_prior=None, topic_word_prior=None, burn_in=None, iteration=None, thin=None, seed=None, max_top_words=None, threshold_top_words=None, gibbs_init=None, delimiters=None, output_word_assignment=None)

Bases: hana_ml.algorithms.pal.pal_base.PALBase

Latent Dirichlet allocation (LDA) is a generative model in which each item (word) of a collection (document) is generated from a finite mixture over several latent groups (topics).

Parameters
n_componentsint

Expected number of topics in the corpus.

doc_topic_priorfloat, optional

Specifies the prior weight related to document-topic distribution.

Defaults to 50/n_components.

topic_word_priorfloat, optional

Specifies the prior weight related to topic-word distribution.

Defaults to 0.1.

burn_inint, optional

Number of omitted Gibbs iterations at the beginning.

Generally, samples from the beginning may not accurately represent the desired distribution and are usually discarded.

Defaults to 0.

iterationint, optional

Number of Gibbs iterations.

Defaults to 2000.

thinint, optional

Number of omitted in-between Gibbs iterations.

Value must be greater than 0.

Defaults to 1.

seedint, optional

Indicates the seed used to initialize the random number generator:

• 0: Uses the system time.

• Not 0: Uses the provided value.

Defaults to 0.

max_top_wordsint, optional

Specifies the maximum number of words to be output for each topic.

Defaults to 0.

threshold_top_wordsfloat, optional

The algorithm outputs top words for each topic if the probability is larger than this threshold.

It cannot be used together with parameter max_top_words.

gibbs_initstr, optional

Specifies initialization method for Gibbs sampling:

• 'uniform': Assign each word in each document a topic by uniform distribution.

• 'gibbs': Assign each word in each document a topic by one round

of Gibbs sampling using doc_topic_prior and topic_word_prior.

Defaults to 'uniform'.

delimiterslist of str, optional

Specifies the set of delimiters to separate words in a document.

Each delimiter must be one character long.

Defaults to [' '].

output_word_assignmentbool, optional

Controls whether to output the word_topic_assignment_ DataFrame or not. If True, output the word_topic_assignment_ DataFrame.

Defaults to False.

Examples

Input dataframe df1 for training:

>>> df1.collect()
DOCUMENT_ID                                               TEXT
0           10  cpu harddisk graphiccard cpu monitor keyboard ...
1           20  tires mountainbike wheels valve helmet mountai...
2           30  carseat toy strollers toy toy spoon toy stroll...
3           40  sweaters sweaters sweaters boots sweaters ring...


Creating a LDA instance:

>>> lda = LatentDirichletAllocation(n_components=6, burn_in=50, thin=10,
iteration=100, seed=1,
max_top_words=5, doc_topic_prior=0.1,
output_word_assignment=True,
delimiters=[' ', '\r', '\n'])


Performing fit() on given dataframe:

>>> lda.fit(data=df1, key='DOCUMENT_ID', document='TEXT')


Output:

>>> lda.doc_topic_dist_.collect()
DOCUMENT_ID  TOPIC_ID  PROBABILITY
0            10         0     0.010417
1            10         1     0.010417
2            10         2     0.010417
3            10         3     0.010417
4            10         4     0.947917
5            10         5     0.010417
6            20         0     0.009434
7            20         1     0.009434
8            20         2     0.009434
9            20         3     0.952830
10           20         4     0.009434
11           20         5     0.009434
12           30         0     0.103774
13           30         1     0.858491
14           30         2     0.009434
15           30         3     0.009434
16           30         4     0.009434
17           30         5     0.009434
18           40         0     0.009434
19           40         1     0.009434
20           40         2     0.952830
21           40         3     0.009434
22           40         4     0.009434
23           40         5     0.009434

>>> lda.word_topic_assignment_.collect()
DOCUMENT_ID  WORD_ID  TOPIC_ID
0            10        0         4
1            10        1         4
2            10        2         4
3            10        0         4
4            10        3         4
5            10        4         4
6            10        0         4
7            10        5         4
8            10        5         4
9            20        6         3
10           20        7         3
11           20        8         3
12           20        9         3
13           20       10         3
14           20        7         3
15           20       11         3
16           20        6         3
17           20        7         3
18           20        7         3
19           30       12         1
20           30       13         1
21           30       14         1
22           30       13         1
23           30       13         1
24           30       15         0
25           30       13         1
26           30       14         1
27           30       13         1
28           30       12         1
29           40       16         2
30           40       16         2
31           40       16         2
32           40       17         2
33           40       16         2
34           40       18         2
35           40       19         2
36           40       19         2
37           40       20         2
38           40       16         2

>>> lda.topic_top_words_.collect()
TOPIC_ID                                       WORDS
0         0     spoon strollers tires graphiccard valve
1         1       toy strollers carseat graphiccard cpu
2         2              sweaters vest shoe rings boots
3         3  mountainbike tires rearfender helmet valve
4         4    cpu memory graphiccard keyboard harddisk
5         5       strollers tires graphiccard cpu valve

>>> lda.topic_word_dist_.head(40).collect()
TOPIC_ID  WORD_ID  PROBABILITY
0          0        0     0.050000
1          0        1     0.050000
2          0        2     0.050000
3          0        3     0.050000
4          0        4     0.050000
5          0        5     0.050000
6          0        6     0.050000
7          0        7     0.050000
8          0        8     0.550000
9          0        9     0.050000
10         1        0     0.050000
11         1        1     0.050000
12         1        2     0.050000
13         1        3     0.050000
14         1        4     0.050000
15         1        5     0.050000
16         1        6     0.050000
17         1        7     0.050000
18         1        8     0.050000
19         1        9     0.550000
20         2        0     0.025000
21         2        1     0.025000
22         2        2     0.525000
23         2        3     0.025000
24         2        4     0.025000
25         2        5     0.025000
26         2        6     0.025000
27         2        7     0.275000
28         2        8     0.025000
29         2        9     0.025000
30         3        0     0.014286
31         3        1     0.014286
32         3        2     0.014286
33         3        3     0.585714
34         3        4     0.157143
35         3        5     0.014286
36         3        6     0.157143
37         3        7     0.014286
38         3        8     0.014286
39         3        9     0.014286

>>> lda.dictionary_.collect()
WORD_ID          WORD
0        17         boots
1        12       carseat
2         0           cpu
3         2   graphiccard
4         1      harddisk
5        10        helmet
6         4      keyboard
7         5        memory
8         3       monitor
9         7  mountainbike
10       11    rearfender
11       18         rings
12       20          shoe
13       15         spoon
14       14     strollers
15       16      sweaters
16        6         tires
17       13           toy
18        9         valve
19       19          vest
20        8        wheels

>>> lda.statistic_.collect()
STAT_NAME          STAT_VALUE
0        DOCUMENTS                   4
1  VOCABULARY_SIZE                  21
2   LOG_LIKELIHOOD  -64.95765414596762


Dataframe df2 to transform:

>>> df2.collect()
DOCUMENT_ID               TEXT
0           10  toy toy spoon cpu


Performing transform on the given dataframe:

>>> res = lda.transform(data=df2, key='DOCUMENT_ID', document='TEXT', burn_in=2000, thin=100,
iteration=1000, seed=1, output_word_assignment=True)

>>> doc_top_df, word_top_df, stat_df = res

>>> doc_top_df.collect()
DOCUMENT_ID  TOPIC_ID  PROBABILITY
0           10         0     0.239130
1           10         1     0.456522
2           10         2     0.021739
3           10         3     0.021739
4           10         4     0.239130
5           10         5     0.021739

>>> word_top_df.collect()
DOCUMENT_ID  WORD_ID  TOPIC_ID
0           10       13         1
1           10       13         1
2           10       15         0
3           10        0         4

>>> stat_df.collect()
STAT_NAME          STAT_VALUE
0        DOCUMENTS                   1
1  VOCABULARY_SIZE                  21
2   LOG_LIKELIHOOD  -7.925092991875363
3       PERPLEXITY   7.251970666272191

Attributes
doc_topic_dist_DataFrame

Document-topic distribution table, structured as follows:

• Document ID column, with same name and type as data's document ID column from fit().

• TOPIC_ID, type INTEGER, topic ID.

• PROBABILITY, type DOUBLE, probability of topic given document.

word_topic_assignment_DataFrame

Word-topic assignment table, structured as follows:

• Document ID column, with same name and type as data's document ID column from fit().

• WORD_ID, type INTEGER, word ID.

• TOPIC_ID, type INTEGER, topic ID.

Set to None if output_word_assignment is set to False.

topic_top_words_DataFrame

Topic top words table, structured as follows:

• TOPIC_ID, type INTEGER, topic ID.

• WORDS, type NVARCHAR(5000), topic top words separated by spaces.

Set to None if neither max_top_words nor threshold_top_words is provided.

topic_word_dist_DataFrame

Topic-word distribution table, structured as follows:

• TOPIC_ID, type INTEGER, topic ID.

• WORD_ID, type INTEGER, word ID.

• PROBABILITY, type DOUBLE, probability of word given topic.

dictionary_DataFrame

Dictionary table, structured as follows:

• WORD_ID, type INTEGER, word ID.

• WORD, type NVARCHAR(5000), word text.

statistic_DataFrame

Statistics table, structured as follows:

• STAT_NAME, type NVARCHAR(256), statistic name.

• STAT_VALUE, type NVARCHAR(1000), statistic value.

Note

• Parameters max_top_words and threshold_top_words cannot be used together.

• Parameters burn_in, thin, iteration, seed, gibbs_init and delimiters set in transform() will take precedence over the corresponding ones in __init__().

Methods

 add_attribute(attr_key, attr_val) Function to add attribute. Disable with hint. Enable no inline. Enable parallel by parameter partitions. fit(data[, key, document]) Fit LDA model based on training data. fit_transform(data[, key, document]) Fit LDA model based on training data and return the topic assignment for the training documents. Return the execute_statement for training. Get PAL fit parmeters. Parse sql lines containing the parameter definitions. Return the execute_statement for predicting. Get PAL predict parmeters. Return the execute_statement for scoring. Get PAL score parmeters. Extract the specific function call of the PAL function from the sql code. Checks if the model can be saved. load_model(model) Function to load fitted model. set_scale_out([route_to, no_route_to, ...]) HANA statement routing. transform(data[, key, document, burn_in, ...]) Transform the topic assignment for new documents based on the previous LDA estimation results.
fit(data, key=None, document=None)

Fit LDA model based on training data.

Parameters

Training data.

keystr, optional

Name of the document ID column.

Mandatory if data is not indexed, or the index of data contains multiple columns.

Defaults to the single index column of data if not provided.

documentstr, optional

Name of the document text column.

If document is not provided, data must have exactly 1 non-key(non-index) column, and document defaults to that column.

fit_transform(data, key=None, document=None)

Fit LDA model based on training data and return the topic assignment for the training documents.

Parameters

Training data.

keystr, optional

Name of the document ID column.

Mandatory if data is not indexed, or the index of data contains multiple columns.

Defaults to the single index column of data if not provided.

documentstr, optional

Name of the document text column.

If document is not provided, data must have exactly 1 non-key column, and document defaults to that column.

Returns
DataFrame

Document-topic distribution table, structured as follows:

• Document ID column, with same name and type as data 's document ID column.

• TOPIC_ID, type INTEGER, topic ID.

• PROBABILITY, type DOUBLE, probability of topic given document.

Function to add attribute.

disable_with_hint()

Disable with hint.

enable_no_inline()

Enable no inline.

enable_parallel_by_parameter_partitions()

Enable parallel by parameter partitions.

get_fit_execute_statement()

Return the execute_statement for training.

get_fit_parameters()

Get PAL fit parmeters.

Returns
Array of tuples, where each tuple describes a parameter like (name, value, type)
get_parameters()

Parse sql lines containing the parameter definitions. In the sql code all the parameters are defined by four arrays, where the first one contains the parameter name, and one of the other three contains the value fitting to the parameter, while the other two are NULL. This format should be changed into a simple key-value based storage.

Returns
dict of array of tuples, where each tuple describes a parameter like (name, value, type)
get_predict_execute_statement()

Return the execute_statement for predicting.

get_predict_parameters()

Get PAL predict parmeters.

Returns
Array of tuples, where each tuple describes a parameter like (name, value, type)
get_score_execute_statement()

Return the execute_statement for scoring.

get_score_parameters()

Get PAL score parmeters.

Returns
Array of tuples, where each tuple describes a parameter like (name, value, type)
get_store_procedure()

Extract the specific function call of the PAL function from the sql code. Nevertheless it only detects the synonyms that have to be resolved afterwards

Returns
The procedure name synonym
CALL "SYS_AFL.PAL_RANDOM_FORREST" (...) -> SYS_AFL.PAL_RANDOM_FORREST"
is_fitted()

Checks if the model can be saved. To be overridden if the model is not stored in model_ attribute.

Returns
bool

True if the model is ready to be saved.

Function to load fitted model.

Parameters
modelDataFrame

HANA DataFrame for fitted model.

set_scale_out(route_to=None, no_route_to=None, route_by=None, route_by_cardinality=None, data_transfer_cost=None, route_optimization_level=None, workload_class=None)

HANA statement routing.

Parameters
route_tostr, optional

Routes the query to the specified volume ID or service type.

no_route_tostr or list of str, optional

Avoids query routing to a specified volume ID or service type.

route_bystr, optional

Routes the query to the hosts related to the base table(s) of the specified projection view(s).

route_by_cardinalitystr or list of str, optional

Routes the query to the hosts related to the base table(s) of the specified projection view(s) with the highest cardinality from the input list.

data_transfer_costint, optional

Guides the optimizer to use the weighting factor for the data transfer cost. The value 0 ignores the data transfer cost.

route_optimization_level{'mininal', 'all'}, optional

Guides the optimizer to compile with ROUTE_OPTIMIZATION_LEVEL (MINIMAL) or to default to ROUTE_OPTIMIZATION_LEVEL. If the MINIMAL compiled plan is cached, then it compiles once more using the default optimization level during the first execution. This hint is primarily used to shorten statement routing decisions during the initial compilation.

Routes the query via workload class. ROUTE_TO() statement hint has higher precedence than WORKLOAD_CLASS() statement hint.

transform(data, key=None, document=None, burn_in=None, iteration=None, thin=None, seed=None, gibbs_init=None, delimiters=None, output_word_assignment=None)

Transform the topic assignment for new documents based on the previous LDA estimation results.

Parameters

Independent variable values used for tranform.

keystr, optional

Name of the document ID column.

Mandatory if data is not indexed, or the index of data contains multiple columns.

Defaults to the single index column of data if not provided.

documentstr, optional

Name of the document text column.

If document is not provided, data must have exactly 1 non-key column, and document defaults to that column.

burn_inint, optional

Number of omitted Gibbs iterations at the beginning.

Generally, samples from the beginning may not accurately represent the desired distribution and are usually discarded.

Defaults to 0.

iterationint, optional

Numbers of Gibbs iterations.

Defaults to 2000.

thinint, optional

Number of omitted in-between Gibbs iterations.

Defaults to 1.

seedint, optional

Indicates the seed used to initialize the random number generator:

• 0: Uses the system time.

• Not 0: Uses the provided value.

Defaults to 0.

gibbs_initstr, optional

Specifies initialization method for Gibbs sampling:

• 'uniform': Assign each word in each document a topic by uniform distribution.

• 'gibbs': Assign each word in each document a topic by one round of Gibbs sampling using doc_topic_prior and topic_word_prior.

Defaults to 'uniform'.

delimiterslist of str, optional

Specifies the set of delimiters to separate words in a document. Each delimiter must be one character long.

Defaults to [' '].

output_word_assignmentbool, optional

Controls whether to output the word_topic_df or not.

If True, output the word_topic_df.

Defaults to False.

Returns
DataFrame

Document-topic distribution table, structured as follows:

• Document ID column, with same name and type as data 's document ID column.

• TOPIC_ID, type INTEGER, topic ID.

• PROBABILITY, type DOUBLE, probability of topic given document.

Word-topic assignment table, structured as follows:

• Document ID column, with same name and type as data 's document ID column.

• WORD_ID, type INTEGER, word ID.

• TOPIC_ID, type INTEGER, topic ID.

Set to None if output_word_assignment is False.

Statistics table, structured as follows:

• STAT_NAME, type NVARCHAR(256), statistic name.

• STAT_VALUE, type NVARCHAR(1000), statistic value.

## hana_ml.algorithms.pal.discriminant_analysis¶

This module contains PAL wrapper for discriminant analysis algorithm. The following class is available:

class hana_ml.algorithms.pal.discriminant_analysis.LinearDiscriminantAnalysis(regularization_type=None, regularization_amount=None, projection=None)

Bases: hana_ml.algorithms.pal.pal_base.PALBase

Linear discriminant analysis for classification and data reduction.

Parameters
regularization_type{'mixing', 'diag', 'pseudo'}, optional

The strategy for hanlding ill-conditioning or rank-deficiency of the empirical covariance matrix.

Defaults to 'mixing'.

regularization_amountfloat, optional

The convex mixing weight assigned to the diagonal matrix obtained from diagonal of the empirical covriance matrix.

Valid range for this parameter is [0,1].

Valid only when regularization_type is 'mixing'.

Defaults to the smallest number in [0,1] that makes the regularized empircal covariance matrix invertible.

projectionbool, optional

Whether or not to compute the projection model.

Defaults to True.

Examples

The training data for linear discriminant analysis:

>>> df.collect()
X1   X2   X3   X4            CLASS
0   5.1  3.5  1.4  0.2      Iris-setosa
1   4.9  3.0  1.4  0.2      Iris-setosa
2   4.7  3.2  1.3  0.2      Iris-setosa
3   4.6  3.1  1.5  0.2      Iris-setosa
4   5.0  3.6  1.4  0.2      Iris-setosa
5   5.4  3.9  1.7  0.4      Iris-setosa
6   4.6  3.4  1.4  0.3      Iris-setosa
7   5.0  3.4  1.5  0.2      Iris-setosa
8   4.4  2.9  1.4  0.2      Iris-setosa
9   4.9  3.1  1.5  0.1      Iris-setosa
10  7.0  3.2  4.7  1.4  Iris-versicolor
11  6.4  3.2  4.5  1.5  Iris-versicolor
12  6.9  3.1  4.9  1.5  Iris-versicolor
13  5.5  2.3  4.0  1.3  Iris-versicolor
14  6.5  2.8  4.6  1.5  Iris-versicolor
15  5.7  2.8  4.5  1.3  Iris-versicolor
16  6.3  3.3  4.7  1.6  Iris-versicolor
17  4.9  2.4  3.3  1.0  Iris-versicolor
18  6.6  2.9  4.6  1.3  Iris-versicolor
19  5.2  2.7  3.9  1.4  Iris-versicolor
20  6.3  3.3  6.0  2.5   Iris-virginica
21  5.8  2.7  5.1  1.9   Iris-virginica
22  7.1  3.0  5.9  2.1   Iris-virginica
23  6.3  2.9  5.6  1.8   Iris-virginica
24  6.5  3.0  5.8  2.2   Iris-virginica
25  7.6  3.0  6.6  2.1   Iris-virginica
26  4.9  2.5  4.5  1.7   Iris-virginica
27  7.3  2.9  6.3  1.8   Iris-virginica
28  6.7  2.5  5.8  1.8   Iris-virginica
29  7.2  3.6  6.1  2.5   Iris-virginica


Set up an instance of LinearDiscriminantAnalysis model and train it:

>>> lda = LinearDiscriminantAnalysis(regularization_type='mixing', projection=True)
>>> lda.fit(data=df, features=['X1', 'X2', 'X3', 'X4'], label='CLASS')


Check the coefficients of obtained linear discriminators and the projection model

>>> lda.coef_.collect()
CLASS   COEFF_X1   COEFF_X2   COEFF_X3   COEFF_X4   INTERCEPT
0      Iris-setosa  23.907391  51.754001 -34.641902 -49.063407 -113.235478
1  Iris-versicolor   0.511034  15.652078  15.209568  -4.861018  -53.898190
2   Iris-virginica -14.729636   4.981955  42.511486  12.315007  -94.143564
>>> lda.proj_model_.collect()
NAME        X1        X2        X3        X4
0  DISCRIMINANT_1  1.907978  2.399516 -3.846154 -3.112216
1  DISCRIMINANT_2  3.046794 -4.575496 -2.757271  2.633037
2    OVERALL_MEAN  5.843333  3.040000  3.863333  1.213333


Data to predict the class labels:

>>> df_pred.collect()
ID   X1   X2   X3   X4
0    1  5.1  3.5  1.4  0.2
1    2  4.9  3.0  1.4  0.2
2    3  4.7  3.2  1.3  0.2
3    4  4.6  3.1  1.5  0.2
4    5  5.0  3.6  1.4  0.2
5    6  5.4  3.9  1.7  0.4
6    7  4.6  3.4  1.4  0.3
7    8  5.0  3.4  1.5  0.2
8    9  4.4  2.9  1.4  0.2
9   10  4.9  3.1  1.5  0.1
10  11  7.0  3.2  4.7  1.4
11  12  6.4  3.2  4.5  1.5
12  13  6.9  3.1  4.9  1.5
13  14  5.5  2.3  4.0  1.3
14  15  6.5  2.8  4.6  1.5
15  16  5.7  2.8  4.5  1.3
16  17  6.3  3.3  4.7  1.6
17  18  4.9  2.4  3.3  1.0
18  19  6.6  2.9  4.6  1.3
19  20  5.2  2.7  3.9  1.4
20  21  6.3  3.3  6.0  2.5
21  22  5.8  2.7  5.1  1.9
22  23  7.1  3.0  5.9  2.1
23  24  6.3  2.9  5.6  1.8
24  25  6.5  3.0  5.8  2.2
25  26  7.6  3.0  6.6  2.1
26  27  4.9  2.5  4.5  1.7
27  28  7.3  2.9  6.3  1.8
28  29  6.7  2.5  5.8  1.8
29  30  7.2  3.6  6.1  2.5


Perform predict() and check the result:

>>> res_pred = lda.predict(data=df_pred,
...                        key='ID',
...                        features=['X1', 'X2', 'X3', 'X4'],
...                        verbose=False)
>>> res_pred.collect()
ID            CLASS       SCORE
0    1      Iris-setosa  130.421263
1    2      Iris-setosa   99.762784
2    3      Iris-setosa  108.796296
3    4      Iris-setosa   94.301777
4    5      Iris-setosa  133.205924
5    6      Iris-setosa  138.089829
6    7      Iris-setosa  108.385827
7    8      Iris-setosa  119.390933
8    9      Iris-setosa   82.633689
9   10      Iris-setosa  106.380335
10  11  Iris-versicolor   63.346631
11  12  Iris-versicolor   59.511996
12  13  Iris-versicolor   64.286132
13  14  Iris-versicolor   38.332614
14  15  Iris-versicolor   54.823224
15  16  Iris-versicolor   53.865644
16  17  Iris-versicolor   63.581912
17  18  Iris-versicolor   30.402809
18  19  Iris-versicolor   57.411739
19  20  Iris-versicolor   42.433076
20  21   Iris-virginica  114.258002
21  22   Iris-virginica   72.984306
22  23   Iris-virginica   91.802556
23  24   Iris-virginica   86.640121
24  25   Iris-virginica   97.620689
25  26   Iris-virginica  114.195778
26  27   Iris-virginica   57.274694
27  28   Iris-virginica  101.668525
28  29   Iris-virginica   87.257782
29  30   Iris-virginica  106.747065


Data to project:

>>> df_proj.collect()
ID   X1   X2   X3   X4
0    1  5.1  3.5  1.4  0.2
1    2  4.9  3.0  1.4  0.2
2    3  4.7  3.2  1.3  0.2
3    4  4.6  3.1  1.5  0.2
4    5  5.0  3.6  1.4  0.2
5    6  5.4  3.9  1.7  0.4
6    7  4.6  3.4  1.4  0.3
7    8  5.0  3.4  1.5  0.2
8    9  4.4  2.9  1.4  0.2
9   10  4.9  3.1  1.5  0.1
10  11  7.0  3.2  4.7  1.4
11  12  6.4  3.2  4.5  1.5
12  13  6.9  3.1  4.9  1.5
13  14  5.5  2.3  4.0  1.3
14  15  6.5  2.8  4.6  1.5
15  16  5.7  2.8  4.5  1.3
16  17  6.3  3.3  4.7  1.6
17  18  4.9  2.4  3.3  1.0
18  19  6.6  2.9  4.6  1.3
19  20  5.2  2.7  3.9  1.4
20  21  6.3  3.3  6.0  2.5
21  22  5.8  2.7  5.1  1.9
22  23  7.1  3.0  5.9  2.1
23  24  6.3  2.9  5.6  1.8
24  25  6.5  3.0  5.8  2.2
25  26  7.6  3.0  6.6  2.1
26  27  4.9  2.5  4.5  1.7
27  28  7.3  2.9  6.3  1.8
28  29  6.7  2.5  5.8  1.8
29  30  7.2  3.6  6.1  2.5


Do project and check the result:

>>> res_proj = lda.project(data=df_proj,
...                        key='ID',
...                        features=['X1','X2','X3','X4'],
...                        proj_dim=2)
>>> res_proj.collect()
ID  DISCRIMINANT_1  DISCRIMINANT_2 DISCRIMINANT_3 DISCRIMINANT_4
0    1       12.313584       -0.245578           None           None
1    2       10.732231        1.432811           None           None
2    3       11.215154        0.184080           None           None
3    4       10.015174       -0.214504           None           None
4    5       12.362738       -1.007807           None           None
5    6       12.069495       -1.462312           None           None
6    7       10.808422       -1.048122           None           None
7    8       11.498220       -0.368435           None           None
8    9        9.538291        0.366963           None           None
9   10       10.898789        0.436231           None           None
10  11       -1.208079        0.976629           None           None
11  12       -1.894856       -0.036689           None           None
12  13       -2.719280        0.841349           None           None
13  14       -3.226081        2.191170           None           None
14  15       -3.048480        1.822461           None           None
15  16       -3.567804       -0.865854           None           None
16  17       -2.926155       -1.087069           None           None
17  18       -0.504943        1.045723           None           None
18  19       -1.995288        1.142984           None           None
19  20       -2.765274       -0.014035           None           None
20  21      -10.727149       -2.301788           None           None
21  22       -7.791979       -0.178166           None           None
22  23       -8.291120        0.730808           None           None
23  24       -7.969943       -1.211807           None           None
24  25       -9.362513       -0.558237           None           None
25  26      -10.029438        0.324116           None           None
26  27       -7.058927       -0.877426           None           None
27  28       -8.754272       -0.095103           None           None
28  29       -8.935789        1.285655           None           None
29  30       -8.674729       -1.208049           None           None

Attributes
basic_info_DataFrame

Basic information of the training data for linear discriminant analysis.

priors_DataFrame

The empirical pirors for each class in the training data.

coef_DataFrame

Coefficients (inclusive of intercepts) of each class' linear score function for the training data.

proj_infoDataFrame

Projection related info, such as standar deviations of the discriminants, variance proportaion to the total variance explained by each discriminant, etc.

proj_modelDataFrame

The projection matrix and overall means for features.

Methods

 add_attribute(attr_key, attr_val) Function to add attribute. Disable with hint. Enable no inline. Enable parallel by parameter partitions. fit(data[, key, features, label]) Calculate linear discriminators from training data. Return the execute_statement for training. Get PAL fit parmeters. Parse sql lines containing the parameter definitions. Return the execute_statement for predicting. Get PAL predict parmeters. Return the execute_statement for scoring. Get PAL score parmeters. Extract the specific function call of the PAL function from the sql code. Checks if the model can be saved. load_model(model) Function to load fitted model. predict(data[, key, features, verbose]) Predict class labels using fitted linear discriminators. project(data[, key, features, proj_dim]) Project data into lower dimensional spaces using fitted LDA projection model. set_scale_out([route_to, no_route_to, ...]) HANA statement routing.
fit(data, key=None, features=None, label=None)

Calculate linear discriminators from training data.

Parameters

Training data.

keystr, optional

Name of the ID colum.

If not provided, then:

• if data is indexed by a single column, then key defaults to that index column

• otherwise, it is assumed that data contains no ID column

featureslist of str, optional

Names of the feature columns.

If not provided, its defaults to all non-ID, non-label columns.

labelstr, optional

Name of the class label.

if not provided, it defaults to the last non-ID column.

Returns
LinearDiscriminantAnalysis

A fitted object.

predict(data, key=None, features=None, verbose=None)

Predict class labels using fitted linear discriminators.

Parameters

Data for predicting the class labels.

keystr, optional

Name of the ID column.

Mandatory if data is not indexed, or the index of data contains multiple columns.

Defaults to the single index column of data if not provided.

featureslist of str, optional

Name of the feature columns. If not provided, defaults to all non-ID columns.

verbosebool, optional

Whether or not outputs scores of all classes.

If False, only score of the predicted class will be outputed.

Defaults to False.

Returns
DataFrame

Predicted class labels and the corresponding scores, structured as follows:

• ID: with the same name and data type as data's ID column.

• CLASS: with the same name and data type as training data's label column

• SCORE: type double, socre of the predicted class.

project(data, key=None, features=None, proj_dim=None)

Project data into lower dimensional spaces using fitted LDA projection model.

Parameters

Data for linear discriminant projection.

keystr, optional

Name of the ID column.

Mandatory if data is not indexed, or the index of data contains multiple columns.

Defaults to the single index column of data if not provided.

featureslist of str, optional

Name of the feature columns.

If not provided, defaults to all non-ID columns.

proj_dimint, optional

Dimension of the projected space, equivalent to the number of discriminant used for projection.

Defaults to the number of obtained discriminants.

Returns
DataFrame
Projected data, structured as follows:
• 1st column: ID, with the same name and data type as data for projection.

• other columns with name DISCRIMINANT_i, where i iterates from 1 to the number of elements in features, data type DOUBLE.

Function to add attribute.

disable_with_hint()

Disable with hint.

enable_no_inline()

Enable no inline.

enable_parallel_by_parameter_partitions()

Enable parallel by parameter partitions.

get_fit_execute_statement()

Return the execute_statement for training.

get_fit_parameters()

Get PAL fit parmeters.

Returns
Array of tuples, where each tuple describes a parameter like (name, value, type)
get_parameters()

Parse sql lines containing the parameter definitions. In the sql code all the parameters are defined by four arrays, where the first one contains the parameter name, and one of the other three contains the value fitting to the parameter, while the other two are NULL. This format should be changed into a simple key-value based storage.

Returns
dict of array of tuples, where each tuple describes a parameter like (name, value, type)
get_predict_execute_statement()

Return the execute_statement for predicting.

get_predict_parameters()

Get PAL predict parmeters.

Returns
Array of tuples, where each tuple describes a parameter like (name, value, type)
get_score_execute_statement()

Return the execute_statement for scoring.

get_score_parameters()

Get PAL score parmeters.

Returns
Array of tuples, where each tuple describes a parameter like (name, value, type)
get_store_procedure()

Extract the specific function call of the PAL function from the sql code. Nevertheless it only detects the synonyms that have to be resolved afterwards

Returns
The procedure name synonym
CALL "SYS_AFL.PAL_RANDOM_FORREST" (...) -> SYS_AFL.PAL_RANDOM_FORREST"
is_fitted()

Checks if the model can be saved. To be overridden if the model is not stored in model_ attribute.

Returns
bool

True if the model is ready to be saved.

Function to load fitted model.

Parameters
modelDataFrame

HANA DataFrame for fitted model.

set_scale_out(route_to=None, no_route_to=None, route_by=None, route_by_cardinality=None, data_transfer_cost=None, route_optimization_level=None, workload_class=None)

HANA statement routing.

Parameters
route_tostr, optional

Routes the query to the specified volume ID or service type.

no_route_tostr or list of str, optional

Avoids query routing to a specified volume ID or service type.

route_bystr, optional

Routes the query to the hosts related to the base table(s) of the specified projection view(s).

route_by_cardinalitystr or list of str, optional

Routes the query to the hosts related to the base table(s) of the specified projection view(s) with the highest cardinality from the input list.

data_transfer_costint, optional

Guides the optimizer to use the weighting factor for the data transfer cost. The value 0 ignores the data transfer cost.

route_optimization_level{'mininal', 'all'}, optional

Guides the optimizer to compile with ROUTE_OPTIMIZATION_LEVEL (MINIMAL) or to default to ROUTE_OPTIMIZATION_LEVEL. If the MINIMAL compiled plan is cached, then it compiles once more using the default optimization level during the first execution. This hint is primarily used to shorten statement routing decisions during the initial compilation.

Routes the query via workload class. ROUTE_TO() statement hint has higher precedence than WORKLOAD_CLASS() statement hint.

## hana_ml.algorithms.pal.kernel_density¶

This module contains PAL wrappers for kernel density estimation.

The following class is available:

class hana_ml.algorithms.pal.kernel_density.KDE(thread_ratio=None, leaf_size=None, kernel=None, method=None, distance_level=None, minkowski_power=None, atol=None, rtol=None, bandwidth=None, resampling_method=None, evaluation_metric=None, bandwidth_values=None, bandwidth_range=None, stat_info=None, random_state=None, search_strategy=None, repeat_times=None, algorithm=None)

Bases: hana_ml.algorithms.pal.pal_base.PALBase

Perform Kernel Density to analogue with histograms whereas getting rid of its defects.

Parameters

Specifies the ratio of total number of threads that can be used by this function.

The value range is from 0 to 1, where 0 means only using 1 thread, and 1 means using at most all the currently available threads.

Values outside the range will be ignored and this function heuristically determines the number of threads to use.

Default to 0.0.

leaf_sizeint, optional

Number of samples in a KD tree or Ball tree leaf node.

Only Valid when algorithm is 'kd-tree' or 'ball-tree'.

Default to 30.

kernel{'gaussian', 'tophat', 'epanechnikov', 'exponential', 'linear', 'cosine'}, optional

Kernel function type.

Default to 'gaussian'.

method{'brute_force', 'kd_tree', 'ball_tree'}, optional(deprecated)

Searching method.

Default to 'brute_force'

algorithm{'brute-force', 'kd-tree', 'ball-tree'}, optional

Specifies the searching method.

Default to 'brute-force'.

bandwidthfloat, optional

Bandwidth used during density calculation.

0 means providing by optimizer inside, otherwise bandwidth is provided by end users.

Only valid when data is one dimensional.

Default to 0.

distance_level{'manhattan', 'euclidean', 'minkowski', 'chebyshev'}, optional

Computes the distance between the train data and the test data point.

Default to 'eculidean'.

minkowski_powerfloat, optionl

When you use the Minkowski distance, this parameter controls the value of power.

Only valid when distance_level is 'minkowski'.

Default to 3.0.

rtolfloat, optional

The desired relative tolerance of the result.

A larger tolerance generally leads to faster execution.

Default to 1e-8.

atolfloat, optional

The desired absolute tolerance of the result.

A larger tolerance generally leads to faster execution.

Default to 0.

stat_infobool, optional
• False: STATISTIC table is empty

• True: Statistic information is displayed in the STATISTIC table.

Only valid when parameter selection is not specified.

resampling_method{'loocv'}, optional

Specifies the resampling method for model evaluation or parameter selection, only 'loocv' is permitted.

evaluation_metric must be set together.

No default value.

evaluation_metric{'nll'}, optional

Specifies the evaluation metric for model evaluation or parameter selection, only 'nll' is supported.

No default value.

search_strategy{'grid', 'random'}, optional

Specifies the method to activate parameter selection.

No default value.

repeat_timesint, optional

Specifies the number of repeat times for resampling.

Default to 1.

random_stateint, optional

Specifies the seed for random generation. Use system time when 0 is specified.

Default to 0.

bandwidth_valueslist, optional

Specifies values of parameter bandwidth to be selected.

Only valid when parameter selection is enabled.

bandwidth_rangelist, optional

Specifies ranges of parameter bandwidth to be selected.

Only valid when parameter selection is enabled.

Examples

Data used for fitting a kernel density function:

>>> df_train.collect()
ID        X1        X2
0   0 -0.425770 -1.396130
1   1  0.884100  1.381493
2   2  0.134126 -0.032224
3   3  0.845504  2.867921
4   4  0.288441  1.513337
5   5 -0.666785  1.244980
6   6 -2.102968 -1.428327
7   7  0.769902 -0.473007
8   8  0.210291  0.328431
9   9  0.482323 -0.437962


Data used for density value prediction:

>>> df_pred.collect()
ID        X1        X2
0   0 -2.102968 -1.428327
1   1 -2.102968  0.719797
2   2 -2.102968  2.867921
3   3 -0.609434 -1.428327
4   4 -0.609434  0.719797
5   5 -0.609434  2.867921
6   6  0.884100 -1.428327
7   7  0.884100  0.719797
8   8  0.884100  2.867921


Construct KDE instance:

>>> kde = KDE(leaf_size=10, method='kd_tree', bandwidth=0.68129, stat_info=True)


Fit a kernel density function:

>>> kde.fit(data=df_train, key='ID')


Peroform density prediction and check the results

>>> res, stats = kde.predict(data=df_pred, key='ID')
>>> res.collect()
ID  DENSITY_VALUE
0   0      -3.324821
1   1      -5.733966
2   2      -8.372878
3   3      -3.123223
4   4      -2.772520
5   5      -4.852817
6   6      -3.469782
7   7      -2.556680
8   8      -3.198531

>>> stats_.collect()
TEST_ID                            FITTING_IDS
0        0  {"fitting ids":[0,1,2,3,4,5,6,7,8,9]}
1        1  {"fitting ids":[0,1,2,3,4,5,6,7,8,9]}
2        2  {"fitting ids":[0,1,2,3,4,5,6,7,8,9]}
3        3  {"fitting ids":[0,1,2,3,4,5,6,7,8,9]}
4        4  {"fitting ids":[0,1,2,3,4,5,6,7,8,9]}
5        5  {"fitting ids":[0,1,2,3,4,5,6,7,8,9]}
6        6  {"fitting ids":[0,1,2,3,4,5,6,7,8,9]}
7        7  {"fitting ids":[0,1,2,3,4,5,6,7,8,9]}
8        8  {"fitting ids":[0,1,2,3,4,5,6,7,8,9]}

Attributes
stats_DataFrame

Statistical info for model evaluation. Available only when model evaluation/parameter selection is triggered.

optim_param_DataFrame

Optimal parameters selected. Available only when model evaluation/parameter selection is triggered.

Methods

 add_attribute(attr_key, attr_val) Function to add attribute. Disable with hint. Enable no inline. Enable parallel by parameter partitions. fit(data, key[, features]) If parameter selection / model evaluation is enabled, perform it. Return the execute_statement for training. Get PAL fit parmeters. Parse sql lines containing the parameter definitions. Return the execute_statement for predicting. Get PAL predict parmeters. Return the execute_statement for scoring. Get PAL score parmeters. Extract the specific function call of the PAL function from the sql code. Checks if the model can be saved. load_model(model) Function to load fitted model. predict(data, key[, features]) Apply kernel density analysis. set_scale_out([route_to, no_route_to, ...]) HANA statement routing.
fit(data, key, features=None)

If parameter selection / model evaluation is enabled, perform it. Otherwise, just setting the training data set.

Parameters

Dataframe including the data of density distribution.

keystr

Name of the ID column.

featuresstr/list of str, optional

Name of the feature columns in the dataframe.

Defaults to all non-key columns.

Attributes

The training data for kernel density function fitting.

predict(data, key, features=None)

Apply kernel density analysis.

Parameters

Dataframe including the data of density prediction.

keystr

Column of IDs of the data points for density prediction.

featureslist of str, optional

Names of the feature columns.

Defaults to all non-key columns.

Returns
DataFrame
• Result data table, i.e. predicted log-density values on all points in data.

• Statistics information table which reflects the support of prediction points over all training points.

Function to add attribute.

disable_with_hint()

Disable with hint.

enable_no_inline()

Enable no inline.

enable_parallel_by_parameter_partitions()

Enable parallel by parameter partitions.

get_fit_execute_statement()

Return the execute_statement for training.

get_fit_parameters()

Get PAL fit parmeters.

Returns
Array of tuples, where each tuple describes a parameter like (name, value, type)
get_parameters()

Parse sql lines containing the parameter definitions. In the sql code all the parameters are defined by four arrays, where the first one contains the parameter name, and one of the other three contains the value fitting to the parameter, while the other two are NULL. This format should be changed into a simple key-value based storage.

Returns
dict of array of tuples, where each tuple describes a parameter like (name, value, type)
get_predict_execute_statement()

Return the execute_statement for predicting.

get_predict_parameters()

Get PAL predict parmeters.

Returns
Array of tuples, where each tuple describes a parameter like (name, value, type)
get_score_execute_statement()

Return the execute_statement for scoring.

get_score_parameters()

Get PAL score parmeters.

Returns
Array of tuples, where each tuple describes a parameter like (name, value, type)
get_store_procedure()

Extract the specific function call of the PAL function from the sql code. Nevertheless it only detects the synonyms that have to be resolved afterwards

Returns
The procedure name synonym
CALL "SYS_AFL.PAL_RANDOM_FORREST" (...) -> SYS_AFL.PAL_RANDOM_FORREST"
is_fitted()

Checks if the model can be saved. To be overridden if the model is not stored in model_ attribute.

Returns
bool

True if the model is ready to be saved.

Function to load fitted model.

Parameters
modelDataFrame

HANA DataFrame for fitted model.

set_scale_out(route_to=None, no_route_to=None, route_by=None, route_by_cardinality=None, data_transfer_cost=None, route_optimization_level=None, workload_class=None)

HANA statement routing.

Parameters
route_tostr, optional

Routes the query to the specified volume ID or service type.

no_route_tostr or list of str, optional

Avoids query routing to a specified volume ID or service type.

route_bystr, optional

Routes the query to the hosts related to the base table(s) of the specified projection view(s).

route_by_cardinalitystr or list of str, optional

Routes the query to the hosts related to the base table(s) of the specified projection view(s) with the highest cardinality from the input list.

data_transfer_costint, optional

Guides the optimizer to use the weighting factor for the data transfer cost. The value 0 ignores the data transfer cost.

route_optimization_level{'mininal', 'all'}, optional

Guides the optimizer to compile with ROUTE_OPTIMIZATION_LEVEL (MINIMAL) or to default to ROUTE_OPTIMIZATION_LEVEL. If the MINIMAL compiled plan is cached, then it compiles once more using the default optimization level during the first execution. This hint is primarily used to shorten statement routing decisions during the initial compilation.

Routes the query via workload class. ROUTE_TO() statement hint has higher precedence than WORKLOAD_CLASS() statement hint.

## hana_ml.algorithms.pal.linear_model¶

This module contains Python wrappers for PAL linear model algorithms.

The following classes are available:

class hana_ml.algorithms.pal.linear_model.LinearRegression(solver=None, var_select=None, features_must_select=None, intercept=True, alpha_to_enter=None, alpha_to_remove=None, enet_lambda=None, enet_alpha=None, max_iter=None, tol=None, pho=None, stat_inf=False, adjusted_r2=False, dw_test=False, reset_test=None, bp_test=False, ks_test=False, thread_ratio=None, categorical_variable=None, pmml_export=None, resampling_method=None, evaluation_metric=None, fold_num=None, repeat_times=None, search_strategy=None, random_search_times=None, random_state=None, timeout=None, progress_indicator_id=None, param_values=None, param_range=None)

Bases: hana_ml.algorithms.pal.pal_base.PALBase

Linear regression is an approach to model the linear relationship between a variable, usually referred to as dependent variable, and one or more variables, usually referred to as independent variables, denoted as predictor vector .

Parameters
solver{'QR', 'SVD', 'CD', 'Cholesky', 'ADMM'}, optional

Algorithms to use to solve the least square problem. Case-insensitive.

• 'QR': QR decomposition.

• 'SVD': singular value decomposition.

• 'CD': cyclical coordinate descent method.

• 'Cholesky': Cholesky decomposition.

• 'ADMM': alternating direction method of multipliers.

'CD' and 'ADMM' are supported only when var_select is 'all'.

Defaults to QR decomposition.

var_select{'all', 'forward', 'backward', 'stepwise'}, optional

Method to perform variable selection.

• 'all': all variables are included.

• 'forward': forward selection.

• 'backward': backward selection.

• 'stepwise': stepwise selection.

'forward' and 'backward' selection are supported only when solver is 'QR', 'SVD' or 'Cholesky'.

Defaults to 'all'.

features_must_select: str or list of str, optional

Specifies the column name that needs to be included in the final training model when executing the variable selection.

This parameter can be specified multiple times, each time with one column name as feature.

Only valid when var_select is not 'all'.

Note that This parameter is a hint. There are exceptional cases that a specified mandatory feature is excluded in the final model.

For instance, some mandatory features can be represented as a linear combination of other features, among which some are also mandatory features.

No default value.

interceptbool, optional

If true, include the intercept in the model.

Defaults to True.

alpha_to_enterfloat, optional

P-value for forward selection.

Valid only when var_select is 'forward' or 'stepwise'.

Defaults to 0.05 when var_select is 'forward', 0.15 when var_select is 'stepwise'.

alpha_to_removefloat, optional

P-value for backward selection.

Valid only when var_select is 'backward' or 'stepwise'.

Defaults to 0.1 when var_select is 'backward', and 0.15 when var_select is 'stepwise'.

enet_lambdafloat, optional

Penalized weight. Value should be greater than or equal to 0.

Valid only when solver is 'CD' or 'ADMM'.

enet_alphafloat, optional

Elastic net mixing parameter.

Ranges from 0 (Ridge penalty) to 1 (LASSO penalty) inclusively.

Valid only when solver is 'CD' or 'ADMM'.

Defaults to 1.0.

max_iterint, optional

Maximum number of passes over training data.

If convergence is not reached after the specified number of iterations, an error will be generated.

Valid only when solver is 'CD' or 'ADMM'.

Defaults to 1e5.

tolfloat, optional

Convergence threshold for coordinate descent.

Valid only when solver is 'CD'.

Defaults to 1.0e-7.

phofloat, optional

Step size for ADMM. Generally, it should be greater than 1.

Valid only when solver is 'ADMM'.

Defaults to 1.8.

stat_infbool, optional

If true, output t-value and Pr(>|t|) of coefficients.

Defaults to False.

If true, include the adjusted R2 value in statistics.

Defaults to False.

dw_testbool, optional

If true, conduct Durbin-Watson test under null hypothesis that errors do not follow a first order autoregressive process.

Not available if elastic net regularization is enabled or intercept is ignored.

Defaults to False.

reset_testint, optional

Specifies the order of Ramsey RESET test.

Ramsey RESET test with power of variables ranging from 2 to this value (greater than 1) will be conducted.

Value 1 means RESET test will not be conducted. Not available if elastic net regularization is enabled or intercept is ignored.

Defaults to 1.

bp_testbool, optional

If true, conduct Breusch-Pagan test under null hypothesis that homoscedasticity is satisfied.

Not available if elastic net regularization is enabled or intercept is ignored.

Defaults to False.

ks_testbool, optional

If true, conduct Kolmogorov-Smirnov normality test under null hypothesis that errors follow a normal distribution.

Not available if elastic net regularization is enabled or intercept is ignored.

Defaults to False.

Controls the proportion of available threads to use.

The value range is from 0 to 1, where 0 means only using 1 thread, and 1 means using at most all the currently available threads.

Values outside this range tell PAL to heuristically determine the number of threads to use.

Valid only when solver is 'QR', 'CD', 'Cholesky' or 'ADMM'.

Defaults to 0.0.

categorical_variablestr or ist of str, optional

Specifies INTEGER columns specified that should be be treated as categorical.

Other INTEGER columns will be treated as continuous.

pmml_export{'no', 'multi-row'}, optional

Controls whether to output a PMML representation of the model, and how to format the PMML. Case-insensitive.

• 'no' or not provided: No PMML model.

• 'multi-row': Exports a PMML model, splitting it across multiple rows if it doesn't fit in one.

Prediction does not require a PMML model.

resampling_method{'cv', 'bootstrap'}, optional

Specifies the resampling method for model evaluation/parameter selection.

If no value is specified for this parameter, neither model evaluation

nor parameter selection is activated.

Must be set together with evaluation_metric.

No default value.

evaluation_metric{'rmse'}, optional

Specifies the evaluation metric for model evaluation or parameter selection.

Must be set together with resampling_method.

No default value.

fold_numint, optional

Specifies the fold number for the cross validation method. Mandatory and valid only when resampling_method is set to 'cv'.

No default value.

repeat_timesint, optional

Specifies the number of repeat times for resampling.

Defaults to 1.

search_strategy{'grid', 'random'}, optional

Specifies the method to activate parameter selection.

No default value.

random_search_timesint, optional

Specifies the number of times to randomly select candidate parameters for selection.

Mandatory and valid when search_strategy is set to 'random'.

No default value.

random_stateint, optional

Specifies the seed for random generation. Use system time when 0 is specified.

Defaults to 0.

timeoutint, optional

Specifies maximum running time for model evaluation or parameter

selection, in seconds. No timeout when 0 is specified.

Defaults to 0.

progress_indicator_idstr, optional

Sets an ID of progress indicator for model evaluation or parameter selection.

No progress indicator is active if no value is provided.

No default value.

param_valuesdict or list of tuples, optional

Specifies values of specific parameters to be selected.

Valid only when resampling_method and search_strategy are both specified.

Specified parameters could be enet_lambda and enet_alpha.

No default value.

param_rangedict or list of tuples, optional

Specifies range of specific parameters to be selected.

Valid only when resampling_method and search_strategy are both specified.

Specified parameters could be enet_lambda, enet_alpha.

No default value.

Examples

Training data:

>>> df.collect()
ID       Y    X1 X2  X3
0  0  -6.879  0.00  A   1
1  1  -3.449  0.50  A   1
2  2   6.635  0.54  B   1
3  3  11.844  1.04  B   1
4  4   2.786  1.50  A   1
5  5   2.389  0.04  B   2
6  6  -0.011  2.00  A   2
7  7   8.839  2.04  B   2
8  8   4.689  1.54  B   1
9  9  -5.507  1.00  A   2


Training the model:

>>> lr = LinearRegression(thread_ratio=0.5,
...                       categorical_variable=["X3"])
>>> lr.fit(data=df, key='ID', label='Y')


Prediction:

>>> df2.collect()
ID     X1 X2  X3
0   0  1.690  B   1
1   1  0.054  B   2
2   2  0.123  A   2
3   3  1.980  A   1
4   4  0.563  A   1
>>> lr.predict(data=df2, key='ID').collect()
ID      VALUE
0   0  10.314760
1   1   1.685926
2   2  -7.409561
3   3   2.021592
4   4  -3.122685

Attributes
coefficients_DataFrame

Fitted regression coefficients.

pmml_DataFrame

PMML model. Set to None if no PMML model was requested.

fitted_DataFrame

Predicted dependent variable values for training data. Set to None if the training data has no row IDs.

statistics_DataFrame

Regression-related statistics, such as mean squared error.

optim_param_DataFrame

If parameter selection is enabled, the optimal parameters will be selected.

Methods

 add_attribute(attr_key, attr_val) Function to add attribute. Disable with hint. Enable no inline. Enable parallel by parameter partitions. fit(data[, key, features, label, ...]) Fit regression model based on training data. Return the execute_statement for training. Get PAL fit parmeters. Parse sql lines containing the parameter definitions. Return the execute_statement for predicting. Get PAL predict parmeters. Return the execute_statement for scoring. Get PAL score parmeters. Extract the specific function call of the PAL function from the sql code. Checks if the model can be saved. load_model(model) Function to load fitted model. predict(data[, key, features]) Predict dependent variable values based on fitted model. score(data[, key, features, label]) Returns the coefficient of determination R2 of the prediction. set_scale_out([route_to, no_route_to, ...]) HANA statement routing.
fit(data, key=None, features=None, label=None, categorical_variable=None)

Fit regression model based on training data.

Parameters

Training data.

keystr, optional

Name of the ID column.

If key is not provided, then:

• if data is indexed by a single column, then key defaults to that index column;

• otherwise, it is assumed that data contains no ID column.

featureslist of str, optional

Names of the feature columns.

If features is not provided, it defaults to all non-ID, non-label columns.

labelstr, optional

Name of the dependent variable. If label is not provided, it defaults to the last column.

categorical_variablestr or list of str, optional

Specifies INTEGER column(s) that should be treated as categorical.

Other INTEGER columns will be treated as continuous.

Returns
LinearRegression

A fitted object.

predict(data, key=None, features=None)

Predict dependent variable values based on fitted model.

Parameters

Independent variable values to predict for.

keystr, optional

Name of the ID column.

Mandatory if data is not indexed, or the index of data contains multiple columns.

Defaults to the single index column of data if not provided.

featureslist of str, optional

Names of the feature columns.

If features is not provided, it defaults to all non-ID columns.

Returns
DataFrame

Predicted values, structured as follows:

• ID column: with same name and type as data 's ID column.

• VALUE: type DOUBLE, representing predicted values.

score(data, key=None, features=None, label=None)

Returns the coefficient of determination R2 of the prediction.

Parameters

Data on which to assess model performance.

keystr, optional

Name of the ID column.

Mandatory if data is not indexed, or the index of data contains multiple columns.

Defaults to the single index column of data if not provided.

featureslist of str, optional

Names of the feature columns.

If features is not provided, it defaults all non-ID, non-label columns.

labelstr, optional

Name of the dependent variable.

If label is not provided, it defaults to the last column.

Returns
float

Returns the coefficient of determination R2 of the prediction.

Function to add attribute.

disable_with_hint()

Disable with hint.

enable_no_inline()

Enable no inline.

enable_parallel_by_parameter_partitions()

Enable parallel by parameter partitions.

get_fit_execute_statement()

Return the execute_statement for training.

get_fit_parameters()

Get PAL fit parmeters.

Returns
Array of tuples, where each tuple describes a parameter like (name, value, type)
get_parameters()

Parse sql lines containing the parameter definitions. In the sql code all the parameters are defined by four arrays, where the first one contains the parameter name, and one of the other three contains the value fitting to the parameter, while the other two are NULL. This format should be changed into a simple key-value based storage.

Returns
dict of array of tuples, where each tuple describes a parameter like (name, value, type)
get_predict_execute_statement()

Return the execute_statement for predicting.

get_predict_parameters()

Get PAL predict parmeters.

Returns
Array of tuples, where each tuple describes a parameter like (name, value, type)
get_score_execute_statement()

Return the execute_statement for scoring.

get_score_parameters()

Get PAL score parmeters.

Returns
Array of tuples, where each tuple describes a parameter like (name, value, type)
get_store_procedure()

Extract the specific function call of the PAL function from the sql code. Nevertheless it only detects the synonyms that have to be resolved afterwards

Returns
The procedure name synonym
CALL "SYS_AFL.PAL_RANDOM_FORREST" (...) -> SYS_AFL.PAL_RANDOM_FORREST"
is_fitted()

Checks if the model can be saved. To be overridden if the model is not stored in model_ attribute.

Returns
bool

True if the model is ready to be saved.

Function to load fitted model.

Parameters
modelDataFrame

HANA DataFrame for fitted model.

set_scale_out(route_to=None, no_route_to=None, route_by=None, route_by_cardinality=None, data_transfer_cost=None, route_optimization_level=None, workload_class=None)

HANA statement routing.

Parameters
route_tostr, optional

Routes the query to the specified volume ID or service type.

no_route_tostr or list of str, optional

Avoids query routing to a specified volume ID or service type.

route_bystr, optional

Routes the query to the hosts related to the base table(s) of the specified projection view(s).

route_by_cardinalitystr or list of str, optional

Routes the query to the hosts related to the base table(s) of the specified projection view(s) with the highest cardinality from the input list.

data_transfer_costint, optional

Guides the optimizer to use the weighting factor for the data transfer cost. The value 0 ignores the data transfer cost.

route_optimization_level{'mininal', 'all'}, optional

Guides the optimizer to compile with ROUTE_OPTIMIZATION_LEVEL (MINIMAL) or to default to ROUTE_OPTIMIZATION_LEVEL. If the MINIMAL compiled plan is cached, then it compiles once more using the default optimization level during the first execution. This hint is primarily used to shorten statement routing decisions during the initial compilation.

Routes the query via workload class. ROUTE_TO() statement hint has higher precedence than WORKLOAD_CLASS() statement hint.

class hana_ml.algorithms.pal.linear_model.OnlineLinearRegression(enet_lambda=None, enet_alpha=None, max_iter=None, tol=None)

Bases: hana_ml.algorithms.pal.pal_base.PALBase

Online linear regression is an online version of the linear regression and is used when the training data are obtained multiple rounds. Additional data are obtained in each round of training. By making use of the current computed linear model and combining with the obtained data in each round, online linear regression adapts the linear model to make the prediction as precise as possible.

Parameters
enet_lambdafloat, optional

Penalized weight. Value should be greater than or equal to 0.

Defaults to 0.

enet_alphafloat, optional

Elastic net mixing parameter.

Ranges from 0 (Ridge penalty) to 1 (LASSO penalty) inclusively.

Defaults to 0.

max_iterint, optional

Maximum iterative cycle.

Defaults to 1000.

tolfloat, optional

Convergence threshold.

Defaults to 1.0e-5.

Examples

First, initialize an online linear regression instance:

>>> onlinelr = OnlineLinearRegression(enet_lambda=0.1,
enet_alpha=0.5,
max_iter=1200,
tol=1E-6)


Three rounds of data:

>>> df_1.collect()
ID      Y    X1    X2     X3
0  1  130.0   7.0  26.0 -888.0
1  2  124.0   1.0  29.0 -888.0
2  3  262.0  11.0  56.0 -888.0
3  4  162.0  11.0  31.0 -888.0

>>> df_2.collect()
ID      Y    X1    X2     X3
0   5  234.0   7.0  52.0 -888.0
1   6  258.0  11.0  55.0 -888.0
2   7  298.0   3.0  71.0 -888.0
3   8  132.0   1.0  31.0 -888.0

>>> df_3.collect()
ID      Y    X1    X2     X3
0   9  227.0   2.0  54.0 -888.0
1  10  256.0  21.0  47.0 -888.0
2  11  168.0   1.0  40.0 -888.0
3  12  302.0  11.0  66.0 -888.0
4  13  307.0  10.0  68.0 -888.0


Round 1, invoke partial_fit() for training the model with df_1:

>>> onlinelr.partial_fit(df_1, key='ID', label='Y', features=['X1', 'X2'])


Output:

>>> onlinelr.coefficients_.collect()
VARIABLE_NAME  COEFFICIENT_VALUE
0  __PAL_INTERCEPT__           5.076245
1                 X1           2.987277
2                 X2           4.000540

>>> onlinelr.intermediate_result_.collect()
SEQUENCE                                 INTERMEDIATE_MODEL
0         0  {"algorithm":"batch_algorithm","batch_algorith...


Round 2, invoke partial_fit() for training the model with df_2:

>>> onlinelr.partial_fit(df_2, key='ID', label='Y', features=['X1', 'X2'])


Output:

>>> onlinelr.coefficients_.collect()
VARIABLE_NAME  COEFFICIENT_VALUE
0  __PAL_INTERCEPT__           5.094444
1                 X1           2.988419
2                 X2           3.999563

>>> onlinelr.intermediate_result_.collect()
SEQUENCE                                 INTERMEDIATE_MODEL
0         0  {"algorithm":"batch_algorithm","batch_algorith...


Round 3, invoke partial_fit() for training the model with df_3:

>>> onlinelr.partial_fit(df_3, key='ID', label='Y', features=['X1', 'X2'])


Output:

>>> onlinelr.coefficients_.collect()
VARIABLE_NAME  COEFFICIENT_VALUE
0  __PAL_INTERCEPT__           5.073338
1                 X1           2.994118
2                 X2           3.999389

>>> onlinelr.intermediate_result_.collect()
SEQUENCE                                 INTERMEDIATE_MODEL
0         0  {"algorithm":"batch_algorithm","batch_algorith...


Call predict() with df_predict:

>>> df_predict.collect()
ID    X1    X2
0  14     2    67
1  15     3    51


Invoke predict():

>>> fitted = onlinelr.predict(df_predict, key='ID', features=['X1', 'X2'])
>>> fitted.collect()
ID       VALUE
0  14  279.020611
1  15  218.024511


Call score()

>>> score = onlinelr.score(df_2, key='ID', label='Y', features=['X1', 'X2'])
0.9999997918249237

Attributes
intermediate_result_DataFrame

Intermediate model.

coefficients_ : DataFrame

Fitted regression coefficients.

Methods

 add_attribute(attr_key, attr_val) Function to add attribute. Disable with hint. Enable no inline. Enable parallel by parameter partitions. Return the execute_statement for training. Get PAL fit parmeters. Parse sql lines containing the parameter definitions. Return the execute_statement for predicting. Get PAL predict parmeters. Return the execute_statement for scoring. Get PAL score parmeters. Extract the specific function call of the PAL function from the sql code. Checks if the model can be saved. load_model(model) Function to load fitted model. partial_fit(data[, key, features, label, ...]) Online trainig based on each round of training data. predict(data[, key, features]) Predict dependent variable values based on a fitted model. score(data[, key, features, label]) Returns the coefficient of determination R2 of the prediction. set_scale_out([route_to, no_route_to, ...]) HANA statement routing.
partial_fit(data, key=None, features=None, label=None, thread_ratio=None)

Online trainig based on each round of training data.

Parameters

Training data.

keystr, optional

Name of the ID column.

If key is not provided, then:

• if data is indexed by a single column, then key defaults to that index column;

• otherwise, it is assumed that data contains no ID column.

featureslist of str, optional

Names of the feature columns.

If features is not provided, it defaults to all non-ID, non-label columns.

labelstr, optional

Name of the dependent variable.

If label is not provided, it defaults to the last column.

Controls the proportion of available threads to use.

The value range is from 0 to 1, where 0 means only using 1 thread, and 1 means using at most all the currently available threads.

Values outside this range tell PAL to heuristically determine the number of threads to use.

Defaults to 0.0.

Returns
OnlineLinearRegression

A fitted object.

predict(data, key=None, features=None)

Predict dependent variable values based on a fitted model.

Parameters

Independent variable values to predict for.

keystr, optional

Name of the ID column.

Mandatory if data is not indexed, or the index of data contains multiple columns.

Defaults to the single index column of data if not provided.

featureslist of str, optional

Names of the feature columns.

If features is not provided, it defaults to all non-ID columns.

Returns
DataFrame

Predicted values, structured as follows:

• ID column: with same name and type as data 's ID column.

• VALUE: type DOUBLE, representing predicted values.

score(data, key=None, features=None, label=None)

Returns the coefficient of determination R2 of the prediction.

Parameters

Data on which to assess model performance.

keystr, optional

Name of the ID column.

Mandatory if data is not indexed, or the index of data contains multiple columns.

Defaults to the single index column of data if not provided.

featureslist of str, optional

Names of the feature columns.

If features is not provided, it defaults all non-ID, non-label columns.

labelstr, optional

Name of the dependent variable.

If label is not provided, it defaults to the last column.

Returns
float

Returns the coefficient of determination R2 of the prediction.

Function to add attribute.

disable_with_hint()

Disable with hint.

enable_no_inline()

Enable no inline.

enable_parallel_by_parameter_partitions()

Enable parallel by parameter partitions.

get_fit_execute_statement()

Return the execute_statement for training.

get_fit_parameters()

Get PAL fit parmeters.

Returns
Array of tuples, where each tuple describes a parameter like (name, value, type)
get_parameters()

Parse sql lines containing the parameter definitions. In the sql code all the parameters are defined by four arrays, where the first one contains the parameter name, and one of the other three contains the value fitting to the parameter, while the other two are NULL. This format should be changed into a simple key-value based storage.

Returns
dict of array of tuples, where each tuple describes a parameter like (name, value, type)
get_predict_execute_statement()

Return the execute_statement for predicting.

get_predict_parameters()

Get PAL predict parmeters.

Returns
Array of tuples, where each tuple describes a parameter like (name, value, type)
get_score_execute_statement()

Return the execute_statement for scoring.

get_score_parameters()

Get PAL score parmeters.

Returns
Array of tuples, where each tuple describes a parameter like (name, value, type)
get_store_procedure()

Extract the specific function call of the PAL function from the sql code. Nevertheless it only detects the synonyms that have to be resolved afterwards

Returns
The procedure name synonym
CALL "SYS_AFL.PAL_RANDOM_FORREST" (...) -> SYS_AFL.PAL_RANDOM_FORREST"
is_fitted()

Checks if the model can be saved. To be overridden if the model is not stored in model_ attribute.

Returns
bool

True if the model is ready to be saved.

Function to load fitted model.

Parameters
modelDataFrame

HANA DataFrame for fitted model.

set_scale_out(route_to=None, no_route_to=None, route_by=None, route_by_cardinality=None, data_transfer_cost=None, route_optimization_level=None, workload_class=None)

HANA statement routing.

Parameters
route_tostr, optional

Routes the query to the specified volume ID or service type.

no_route_tostr or list of str, optional

Avoids query routing to a specified volume ID or service type.

route_bystr, optional

Routes the query to the hosts related to the base table(s) of the specified projection view(s).

route_by_cardinalitystr or list of str, optional

Routes the query to the hosts related to the base table(s) of the specified projection view(s) with the highest cardinality from the input list.

data_transfer_costint, optional

Guides the optimizer to use the weighting factor for the data transfer cost. The value 0 ignores the data transfer cost.

route_optimization_level{'mininal', 'all'}, optional

Guides the optimizer to compile with ROUTE_OPTIMIZATION_LEVEL (MINIMAL) or to default to ROUTE_OPTIMIZATION_LEVEL. If the MINIMAL compiled plan is cached, then it compiles once more using the default optimization level during the first execution. This hint is primarily used to shorten statement routing decisions during the initial compilation.

Routes the query via workload class. ROUTE_TO() statement hint has higher precedence than WORKLOAD_CLASS() statement hint.

class hana_ml.algorithms.pal.linear_model.LogisticRegression(multi_class=False, max_iter=None, pmml_export=None, categorical_variable=None, standardize=True, stat_inf=False, solver=None, enet_alpha=None, enet_lambda=None, tol=None, epsilon=None, thread_ratio=None, max_pass_number=None, sgd_batch_number=None, precompute=None, handle_missing=None, resampling_method=None, metric=None, evaluation_metric=None, fold_num=None, repeat_times=None, search_strategy=None, random_search_times=None, random_state=None, timeout=None, lbfgs_m=None, class_map0=None, class_map1=None, progress_indicator_id=None, param_values=None, param_range=None)

Bases: hana_ml.algorithms.pal.pal_base.PALBase

Logistic regression model that handles binary-class and multi-class classification problems.

Parameters
multi_classbool, optional

If true, perform multi-class classification. Otherwise, there must be only two classes.

Defaults to False.

max_iterint, optional

Maximum number of iterations taken for the solvers to converge. If convergence is not reached after this number, an error will be generated.

• multi-class: Defaults to 100.

• binary-class: Defaults to 100000 when solver is cyclical, 1000 when solver is proximal, otherwise 100.

pmml_export{'no', 'single-row', 'multi-row'}, optional

Controls whether to output a PMML representation of the model, and how to format the PMML. Case-insensitive.

• multi-class:

• 'no' or not provided: No PMML model.

• 'multi-row': Exports a PMML model, splitting it across multiple rows if it doesn't fit in one.

• binary-class:

• 'no' or not provided: No PMML model.

• 'single-row': Exports a PMML model in a maximum of one row. Fails if the model doesn't fit in one row.

• 'multi-row': Exports a PMML model, splitting it across multiple rows if it doesn't fit in one.

Defaults to 'no'.

categorical_variablestr or list of str, optional(deprecated)

Specifies INTEGER column(s) in the data that should be treated category variable.

standardizebool, optional

If true, standardize the data to have zero mean and unit variance.

Defaults to True.

stat_infbool, optional

If true, proceed with statistical inference.

Defaults to False.

solver{'auto', 'newton', 'cyclical', 'lbfgs', 'stochastic', 'proximal'}, optional

Optimization algorithm.

• 'auto' : automatically determined by system based on input data and parameters.

• 'newton': Newton iteration method, can only solve ridge regression problems.

• 'cyclical': Cyclical coordinate descent method to fit elastic net regularized logistic regression.

• 'lbfgs': LBFGS method (recommended when having many independent variables, can only solve ridge regression problems when multi_class is True).

• 'stochastic': Stochastic gradient descent method (recommended when dealing with very large dataset).

• 'proximal': Proximal gradient descent method to fit elastic net regularized logistic regression.

When multi_class is True, only 'auto', 'lbfgs' and 'cyclical' are valid solvers.

Defaults to 'auto'.

Note

If it happens that the enet regularization term contains LASSO penalty,while a solver that can only solve ridge regression problems is specified,then the specified solver will be ignored(hence default value is used).The users can check the statistical table for the solver that has been adopted finally.

enet_alphafloat, optional

Elastic net mixing parameter.

Only valid when multi_class is False and solver is 'auto', 'newton', 'cyclical', 'lbfgs' or 'proximal'.

Defaults to 1.0.

enet_lambdafloat, optional

Penalized weight. Only valid when multi_class is False and solver is 'auto', 'newton', 'cyclical', 'lbfgs' or 'proximal'.

Defaults to 0.0.

tolfloat, optional

Convergence threshold for exiting iterations.

Only valid when multi_class is False.

Defaults to 1.0e-7 when solver is cyclical, 1.0e-6 otherwise.

epsilonfloat, optional

Determines the accuracy with which the solution is to be found.

Only valid when multi_class is False and the solver is newton or lbfgs.

Defaults to 1.0e-6 when solver is newton, 1.0e-5 when solver is lbfgs.

Controls the proportion of available threads to use for fit() method.

The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads.

Values between 0 and 1 will use that percentage of available threads.

Values outside this range tell PAL to heuristically determine the number of threads to use.

Defaults to 1.0.

max_pass_numberint, optional

The maximum number of passes over the data.

Only valid when multi_class is False and solver is 'stochastic'.

Defaults to 1.

sgd_batch_numberint, optional

The batch number of Stochastic gradient descent.

Only valid when multi_class is False and solver is 'stochastic'.

Defaults to 1.

precomputebool, optional

Whether to pre-compute the Gram matrix.

Only valid when solver is 'cyclical'.

Defaults to True.

handle_missingbool, optional

Whether to handle missing values.

Defaults to True.

categorical_variablestr or list of str, optional

Specifies INTEGER column(s) in the data that should be treated as categorical.

By default, string is categorical, while int and double are numerical.

lbfgs_mint, optional

Number of previous updates to keep.

Only applicable when multi_class is False and solver is 'lbfgs'.

Defaults to 6.

resampling_method{'cv', 'stratified_cv', 'bootstrap', 'stratified_bootstrap'}, optional

The resampling method for model evaluation and parameter selection.

If no value specified, neither model evaluation nor parameter selection is activated.

metric{'accuracy', 'f1_score', 'auc', 'nll'}, optional(deprecated)

The evaluation metric used for model evaluation/parameter selection.

evaluation_metric{'accuracy', 'f1_score', 'auc', 'nll'}, optional

The evaluation metric used for model evaluation/parameter selection.

fold_numint, optional

The number of folds for cross-validation.

Mandatory and valid only when resampling_method is 'cv' or 'stratified_cv'.

repeat_timesint, optional

The number of repeat times for resampling.

Defaults to 1.

search_strategy{'grid', 'random'}, optional

The search method for parameter selection.

random_search_timesint, optional

The number of times to randomly select candidate parameters for selection.

Mandatory and valid when search_strategy is 'random'.

random_stateint, optional

The seed for random generation. 0 indicates using system time as seed.

Defaults to 0.

progress_indicator_idstr, optional

The ID of progress indicator for model evaluation/parameter selection.

Progress indicator deactivated if no value provided.

param_valuesdict or list of tuples, optional

Specifies values of specific parameters to be selected.

Valid only when resampling_method and search_strategy are specified.

Specific parameters can be enet_lambda, enet_alpha.

No default value.

param_rangedict or list of tuples, optional

Specifies range of specific parameters to be selected.

Valid only when resampling_method and search_strategy are specified.

Specific parameters can be enet_lambda, enet_alpha.

No default value.

class_map0str, optional (deprecated)

Categorical label to map to 0.

class_map0 is mandatory when label column type is VARCHAR or NVARCHAR

Only valid when multi_class is False during binary class fit and score.

class_map1str, optional (deprecated)

Categorical label to map to 1.

class_map1 is mandatory when label column type is VARCHAR or NVARCHAR during binary class fit and score.

Only valid when multi_class is False.

Examples

Training data:

>>> df.collect()
V1     V2  V3  CATEGORY
0   B  2.620   0         1
1   B  2.875   0         1
2   A  2.320   1         1
3   A  3.215   2         0
4   B  3.440   3         0
5   B  3.460   0         0
6   A  3.570   1         0
7   B  3.190   2         0
8   A  3.150   3         0
9   B  3.440   0         0
10  B  3.440   1         0
11  A  4.070   3         0
12  A  3.730   1         0
13  B  3.780   2         0
14  B  5.250   2         0
15  A  5.424   3         0
16  A  5.345   0         0
17  B  2.200   1         1
18  B  1.615   2         1
19  A  1.835   0         1
20  B  2.465   3         0
21  A  3.520   1         0
22  A  3.435   0         0
23  B  3.840   2         0
24  B  3.845   3         0
25  A  1.935   1         1
26  B  2.140   0         1
27  B  1.513   1         1
28  A  3.170   3         1
29  B  2.770   0         1
30  B  3.570   0         1
31  A  2.780   3         1


Create LogisticRegression instance and call fit:

>>> lr = linear_model.LogisticRegression(solver='newton',
...                                      pmml_export='single-row',
...                                      stat_inf=True, tol=0.000001)
>>> lr.fit(data=df, features=['V1', 'V2', 'V3'],
...        label='CATEGORY', categorical_variable=['V3'])
>>> lr.coef_.collect()
VARIABLE_NAME  COEFFICIENT
0                                  __PAL_INTERCEPT__    17.044785
1                                 V1__PAL_DELIMIT__A     0.000000
2                                 V1__PAL_DELIMIT__B    -1.464903
3                                                 V2    -4.819740
4                                 V3__PAL_DELIMIT__0     0.000000
5                                 V3__PAL_DELIMIT__1    -2.794139
6                                 V3__PAL_DELIMIT__2    -4.807858
7                                 V3__PAL_DELIMIT__3    -2.780918
8  {"CONTENT":"{\"impute_model\":{\"column_statis...          NaN
>>> pred_df.collect()
ID V1     V2  V3
0    0  B  2.620   0
1    1  B  2.875   0
2    2  A  2.320   1
3    3  A  3.215   2
4    4  B  3.440   3
5    5  B  3.460   0
6    6  A  3.570   1
7    7  B  3.190   2
8    8  A  3.150   3
9    9  B  3.440   0
10  10  B  3.440   1
11  11  A  4.070   3
12  12  A  3.730   1
13  13  B  3.780   2
14  14  B  5.250   2
15  15  A  5.424   3
16  16  A  5.345   0
17  17  B  2.200   1


Call predict():

>>> result = lgr.predict(data=pred_df,
...                      key='ID',
...                      categorical_variable=['V3'],
>>> result.collect()
ID CLASS   PROBABILITY
0    0     1  9.503618e-01
1    1     1  8.485210e-01
2    2     1  9.555861e-01
3    3     0  3.701858e-02
4    4     0  2.229129e-02
5    5     0  2.503962e-01
6    6     0  4.945832e-02
7    7     0  9.922085e-03
8    8     0  2.852859e-01
9    9     0  2.689207e-01
10  10     0  2.200498e-02
11  11     0  4.713726e-03
12  12     0  2.349803e-02
13  13     0  5.830425e-04
14  14     0  4.886177e-07
15  15     0  6.938072e-06
16  16     0  1.637820e-04
17  17     1  8.986435e-01


Input data for score():

>>> df_score.collect()
ID V1     V2  V3  CATEGORY
0    0  B  2.620   0         1
1    1  B  2.875   0         1
2    2  A  2.320   1         1
3    3  A  3.215   2         0
4    4  B  3.440   3         0
5    5  B  3.460   0         0
6    6  A  3.570   1         1
7    7  B  3.190   2         0
8    8  A  3.150   3         0
9    9  B  3.440   0         0
10  10  B  3.440   1         0
11  11  A  4.070   3         0
12  12  A  3.730   1         0
13  13  B  3.780   2         0
14  14  B  5.250   2         0
15  15  A  5.424   3         0
16  16  A  5.345   0         0
17  17  B  2.200   1         1


Call score():

>>> lgr.score(data=df_score,
...           key='ID',
...           categorical_variable=['V3'],
0.944444

Attributes
coef_DataFrame

Values of the coefficients.

result_DataFrame

Model content.

optim_param_DataFrame

The optimal parameter set selected via cross-validation. Empty if cross-validation is not activated.

stat_DataFrame

Statistics info for the trained model, structured as follows:

• 1st column: 'STAT_NAME', NVARCHAR(256)

• 2nd column: 'STAT_VALUE', NVARCHAR(1000)

pmml_DataFrame

PMML model. Set to None if no PMML model was requested.

Methods

 add_attribute(attr_key, attr_val) Function to add attribute. Disable with hint. Enable no inline. Enable parallel by parameter partitions. fit(data[, key, features, label, ...]) Fit the LR model when given training dataset. Return the execute_statement for training. Get PAL fit parmeters. Parse sql lines containing the parameter definitions. Return the execute_statement for predicting. Get PAL predict parmeters. Return the execute_statement for scoring. Get PAL score parmeters. Extract the specific function call of the PAL function from the sql code. Checks if the model can be saved. load_model(model) Function to load fitted model. predict(data[, key, features, ...]) Predict with the dataset using the trained model. score(data[, key, features, label, ...]) Return the mean accuracy on the given test data and labels. set_scale_out([route_to, no_route_to, ...]) HANA statement routing.
fit(data, key=None, features=None, label=None, categorical_variable=None, class_map0=None, class_map1=None)

Fit the LR model when given training dataset.

Parameters

DataFrame containing the data.

keystr, optional

Name of the ID column.

If key is not provided, then:

• if data is indexed by a single column, then key defaults to that index column;

• otherwise, it is assumed that data contains no ID column.

featureslist of str, optional

Names of the feature columns.

If features is not provided, it defaults to all non-ID, non-label columns.

labelstr, optional

Name of the label column.

If label is not provided, it defaults to the last column.

categorical_variablestr or list of str, optional

Specifies INTEGER column(s) that shoud be treated as categorical.

Otherwise All INTEGER columns are treated as numerical.

class_map0str, optional

Categorical label to map to 0.

class_map0 is mandatory when label column type is VARCHAR or NVARCHAR during binary class fit and score.

Only valid if multi_class is not set to True when initializing the class instance.

class_map1str, optional

Categorical label to map to 1.

class_map1 is mandatory when label column type is VARCHAR or NVARCHAR during binary class fit and score.

Only valid if multi_class is not set to True when initializing the class instance.

Returns
LogisticRegression

A fitted object.

predict(data, key=None, features=None, categorical_variable=None, thread_ratio=None, class_map0=None, class_map1=None, verbose=False)

Predict with the dataset using the trained model.

Parameters

DataFrame containing the data.

keystr, optional

Name of the ID column.

Mandatory if data is not indexed, or the index of data contains multiple columns.

Defaults to the single index column of data if not provided.

featureslist of str, optional

Names of the feature columns.

If features is not provided, it defaults to all non-ID columns.

verbosebool, optional

If true, output scoring probabilities for each class.

It is only applicable for multi-class case.

Defaults to False.

categorical_variablestr or list of str, optional (deprecated)

Specifies INTEGER column(s) that shoud be treated as categorical.

Otherwise all integer columns are treated as numerical.

Mandatory if training data of the prediction model contains such data columns.

Controls the proportion of available threads to use.

The value range is from 0 to 1, where 0 means only using 1 thread, and 1 means using at most all the currently available threads.

Values outside this range tell pal to heuristically determine the number of threads to use.

Defaults to 0.

class_map0str, optional

Categorical label to map to 0.

class_map0 is mandatory when label column type is varchar or nvarchar during binary class fit and score.

Only valid if multi_class is not set to True when initializing the class instance.

class_map1str, optional

Categorical label to map to 1.

class_map1 is mandatory when label column type is varchar or nvarchar during binary class fit and score.

Only valid if multi_class is not set to True when initializing the class instance.

Returns
DataFrame

Predicted result, structured as follows:

• 1: ID column, with edicted class name.

• 2: PROBABILITY, type DOUBLE

• multi-class: probability of being predicted as the predicted class.

• binary-class: probability of being predicted as the positive class.

Note

predict() will pass the pmml_ table to PAL as the model representation if there is a pmml_ table, or the result_ table otherwise.

score(data, key=None, features=None, label=None, categorical_variable=None, thread_ratio=None, class_map0=None, class_map1=None)

Return the mean accuracy on the given test data and labels.

Parameters

DataFrame containing the data.

keystr, optional

Name of the ID column.

Mandatory if data is not indexed, or the index of data contains multiple columns.

Defaults to the single index column of data if not provided.

featureslist of str, optional

Names of the feature columns.

If features is not provided, it defaults to all non-ID, non-label columns.

labelstr, optional

Name of the label column.

If label is not provided, it defaults to the last column.

categorical_variablestr or list of str, optional (deprecated)

Specifies INTEGER columns that shoud be treated as categorical, otherwise all integer columns are treated as numerical.

Mandatory if training data of the prediction model contains such data columns.

Controls the proportion of available threads to use.

The value range is from 0 to 1, where 0 means only using 1 thread, and 1 means using at most all the currently available threads.

values outside this range tell pal to heuristically determine the number of threads to use.

Defaults to 0.

class_map0str, optional

Categorical label to map to 0.

class_map0 is mandatory when label column type is varchar or nvarchar during binary class fit and score.

Only valid if multi_class is not set to True when initializing the class instance.

class_map1str, optional

Categorical label to map to 1.

class_map1 is mandatory when label column type is varchar or nvarchar during binary class fit and score.

Only valid if multi_class is not set to True when initializing the class instance.

Returns
float

Scalar accuracy value after comparing the predicted label and original label.

Function to add attribute.

disable_with_hint()

Disable with hint.

enable_no_inline()

Enable no inline.

enable_parallel_by_parameter_partitions()

Enable parallel by parameter partitions.

get_fit_execute_statement()

Return the execute_statement for training.

get_fit_parameters()

Get PAL fit parmeters.

Returns
Array of tuples, where each tuple describes a parameter like (name, value, type)
get_parameters()

Parse sql lines containing the parameter definitions. In the sql code all the parameters are defined by four arrays, where the first one contains the parameter name, and one of the other three contains the value fitting to the parameter, while the other two are NULL. This format should be changed into a simple key-value based storage.

Returns
dict of array of tuples, where each tuple describes a parameter like (name, value, type)
get_predict_execute_statement()

Return the execute_statement for predicting.

get_predict_parameters()

Get PAL predict parmeters.

Returns
Array of tuples, where each tuple describes a parameter like (name, value, type)
get_score_execute_statement()

Return the execute_statement for scoring.

get_score_parameters()

Get PAL score parmeters.

Returns
Array of tuples, where each tuple describes a parameter like (name, value, type)
get_store_procedure()

Extract the specific function call of the PAL function from the sql code. Nevertheless it only detects the synonyms that have to be resolved afterwards

Returns
The procedure name synonym
CALL "SYS_AFL.PAL_RANDOM_FORREST" (...) -> SYS_AFL.PAL_RANDOM_FORREST"
is_fitted()

Checks if the model can be saved. To be overridden if the model is not stored in model_ attribute.

Returns
bool

True if the model is ready to be saved.

Function to load fitted model.

Parameters
modelDataFrame

HANA DataFrame for fitted model.

set_scale_out(route_to=None, no_route_to=None, route_by=None, route_by_cardinality=None, data_transfer_cost=None, route_optimization_level=None, workload_class=None)

HANA statement routing.

Parameters
route_tostr, optional

Routes the query to the specified volume ID or service type.

no_route_tostr or list of str, optional

Avoids query routing to a specified volume ID or service type.

route_bystr, optional

Routes the query to the hosts related to the base table(s) of the specified projection view(s).

route_by_cardinalitystr or list of str, optional

Routes the query to the hosts related to the base table(s) of the specified projection view(s) with the highest cardinality from the input list.

data_transfer_costint, optional

Guides the optimizer to use the weighting factor for the data transfer cost. The value 0 ignores the data transfer cost.

route_optimization_level{'mininal', 'all'}, optional

Guides the optimizer to compile with ROUTE_OPTIMIZATION_LEVEL (MINIMAL) or to default to ROUTE_OPTIMIZATION_LEVEL. If the MINIMAL compiled plan is cached, then it compiles once more using the default optimization level during the first execution. This hint is primarily used to shorten statement routing decisions during the initial compilation.

Routes the query via workload class. ROUTE_TO() statement hint has higher precedence than WORKLOAD_CLASS() statement hint.

class hana_ml.algorithms.pal.linear_model.OnlineMultiLogisticRegression(class_label, init_learning_rate=None, decay=None, drop_rate=None, step_boundaries=None, constant_values=None, enet_alpha=None, enet_lambda=None, shuffle=None, shuffle_seed=None, weight_avg=None, weight_avg_begin=None, learning_rate_type=None, general_learning_rate=None, stair_case=None, cycle=None, epsilon=None, window_size=None)

Bases: hana_ml.algorithms.pal.pal_base.PALBase

This algorithm is the online version of Multi-Class Logistic Regression, while the Multi-Class Logistic Regression is offline/batch version. The difference is that during training phase, for the offline/batch version algorithm it requires all training data to be fed into the algorithm in one batch, then it tries its best to output one model to best fit the training data. This infers that the computer must have enough memory to store all data, and can obtain all data in one batch. Online version algorithm applies in scenario that either or all these two assumptions are not right.

Parameters
class_labela list of str

Indicate the class label and should be at least two class labels.

init_learning_ratefloat

The initial learning rate for learning rate schedule. Value should be larger than 0.

Only valid when learning_rate_type is 'Inverse_time_decay', 'Exponential_decay', 'Polynomial_decay'.

decayfloat

Specify the learning rate decay speed for learning rate schedule. Larger value indicates faster decay. Value should be larger than 0. When learning_rate_type is 'exponential_decay', value should be larger than 1.

Only valid when learning_rate_type is 'Inverse_time_decay', 'Exponential_decay', 'Polynomial_decay'.

drop_rateint

Specify the decay frequency. There are apparent effect when stair_case is true. Value should be larger than 0.

Only valid when learning_rate_type is 'Inverse_time_decay', 'Exponential_decay', 'Polynomial_decay'.

step_boundariesstr, optional

Specify the step boundaries for regions where step size remains constant. The format of this parameter is a comma separated unsigned integer value. The step value start from 0. The values should be in increasing order. Empty value for this parameter is allowed.

Only valid when learning_rate_type is 'Piecewise_constant_decay'.

constant_valuesstr, optional

Specifies the constant step size for each region defined by step_boundaries. The format of this parameter is a comma separated double value. There should always be one more value than step_boundaries.

Only valid when learning_rate_type is 'Piecewise_constant_decay'.

enet_alphafloat, optional

Elastic-Net mixing parameter. The valid range is [0, 1]. When it is 0, this means Ridge penalty; When it is 1, it is Lasso penalty.

Only valid when enet_lambda is not 0.0.

Defaults to 1.0.

enet_lambdafloat, optional

Penalized constant. The value should be larger than or equal to 0.0. The higher the value, the stronger the regularization. When it equal to 0.0, there is no regularization.

Defaults to 0.0.

shufflebool, optonal

Boolean value indicating whether need to shuffle the row order of observation data. False means keeping original order; True means performing shuffle operation.

Defaults to False.

shuffle_seedint, optonal

The seed is used to initialize the random generator to perform shuffle operation. The value of this parameter should be larger than or equal to 0. If need to reproduce the result when performing shuffle operation, please set this value to non-zero. Only valid when shuffle is True.

Defaults to 0.

weight_avgbool, optonal

Boolean value indicating whether need to perform average operator over output model. False means directly output model; True means perform average operator over output model. Currently only support Polyak Ruppert Average.

Defaults to False.

weight_avg_beginint, optonal

Specify the beginning step counter to perform the average operator over model. The value should be larger than or equal to 0. When current step counter is less than this parameter, just directly output model.Only valid when weight_avg is True.

Defaults to 0.

learning_rate_typestr, optonal

Specify the learning rate type for SGD algorithm.

• 'Inverse_time_decay'

• 'Exponential_decay'

• 'Polynomial_decay'

• 'Piecewise_constant_decay'

• 'RMSProp'

Defaults to 'RMSProp'.

general_learning_ratefloat, optonal

Specify the general learning rate used in AdaGrad and RMSProp. The value should be larger than 0.

Only valid when learning_rate_type is 'AdaGrad', 'RMSProp'.

Defaults to 0.001.

stair_casebool, optonal

Boolean value indicate the drop way of step size. False means drop step size smoothly.

Only valid when learning_rate_type is 'Inverse_time_decay', 'Exponential_decay'.

Defaults to False.

cyclebool, optonal

indicate whether need to cycle from the start when reaching specified end learning rate. False means do not cycle from the start; True means cycle from the start.

Only valid when learning_rate_type is 'Polynomial_decay'.

Defaults to False.

epsilonfloat, optonal

This parameter has multiple purposes depending on the learn rate type. The value should be within (0, 1). When used in learn rate type 0 and 1, it represent the smallest allowable step size. When step size reach this value, it will no longer change. When used in learning_rate_type 'Polynomial_decay', it represent the end learn rate. When used in learning_rate_type 'AdaGrad', 'AdaDelta', 'RMSProp', it is used to avoid dividing by 0.

Only valid when learning_rate_type is not 'Piecewise_constant_decay'.

Defaults to 1E-8.

window_sizefloat, optonal

This parameter controls the moving window size of recent steps. The value should be in range (0, 1). Larger value means more steps are kept in track.

Only valid when learning_rate_type is 'AdaDelta', 'RMSProp'.

Defaults to 0.9.

Examples

First, initialize an online multi logistic regression instance:

>>> omlr = OnlineMultiLogisticRegression(class_label=['0','1','2'], enet_lambda=0.01,
enet_alpha=0.2, weight_avg=True,
weight_avg_begin=8, learning_rate_type = 'rmsprop',
general_learning_rate=0.1,
window_size=0.9, epsilon = 1e-6)


Four rounds of data:

>>> df_1.collect()
X1        X2    Y
0  1.160456 -0.079584  0.0
1  1.216722 -1.315348  2.0
2  1.018474 -0.600647  1.0
3  0.884580  1.546115  1.0
4  2.432160  0.425895  1.0
5  1.573506 -0.019852  0.0
6  1.285611 -2.004879  1.0
7  0.478364 -1.791279  2.0

>>> df_2.collect()
X1        X2    Y
0 -1.799803  1.225313  1.0
1  0.552956 -2.134007  2.0
2  0.750153 -1.332960  2.0
3  2.024223 -1.406925  2.0
4  1.204173 -1.395284  1.0
5  1.745183  0.647891  0.0
6  1.406053  0.180530  0.0
7  1.880983 -1.627834  2.0

>>> df_3.collect()
X1        X2    Y
0  1.860634 -2.474313  2.0
1  0.710662 -3.317885  2.0
2  1.153588  0.539949  0.0
3  1.297490 -1.811933  2.0
4  2.071784  0.351789  0.0
5  1.552456  0.550787  0.0
6  1.202615 -1.256570  2.0
7 -2.348316  1.384935  1.0

>>> df_4.collect()
X1        X2    Y
0 -2.132380  1.457749  1.0
1  0.549665  0.174078  1.0
2  1.422629  0.815358  0.0
3  1.318544  0.062472  0.0
4  0.501686 -1.286537  1.0
5  1.541711  0.737517  1.0
6  1.709486 -0.036971  0.0
7  1.708367  0.761572  0.0


Round 1, invoke partial_fit() for training the model with df_1:

>>> omlr.partial_fit(self.df_1, label='Y', features=['X1', 'X2'])


Output:

>>> omlr.coef_.collect()
VARIABLE_NAME CLASSLABEL  COEFFICIENT
0  __PAL_INTERCEPT__          0    -0.245137
1  __PAL_INTERCEPT__          1     0.112396
2  __PAL_INTERCEPT__          2    -0.236284
3                 X1          0    -0.189930
4                 X1          1     0.218920
5                 X1          2    -0.372500
6                 X2          0     0.279547
7                 X2          1     0.458214
8                 X2          2    -0.185378

>>> omlr.online_result_.collect()
SEQUENCE                          UPDATED_SERIALIZED_RESULT
0         0  {"SGD":{"data":{"avg_feature_coefficient":[0.0...


Round 2, invoke partial_fit() for training the model with df_2:

>>> omlr.partial_fit(self.df_2, label='Y', features=['X1', 'X2'])


Output:

>>> omlr.coef_.collect()
VARIABLE_NAME CLASSLABEL  COEFFICIENT
0  __PAL_INTERCEPT__          0    -0.359296
1  __PAL_INTERCEPT__          1     0.163218
2  __PAL_INTERCEPT__          2    -0.182423
3                 X1          0    -0.045149
4                 X1          1    -0.046508
5                 X1          2    -0.122690
6                 X2          0     0.420425
7                 X2          1     0.594954
8                 X2          2    -0.451050

>>> omlr.online_result_.collect()
SEQUENCE                          UPDATED_SERIALIZED_RESULT
0         0  {"SGD":{"data":{"avg_feature_coefficient":[-0....


Round 3, invoke partial_fit() for training the model with df_3:

>>> omlr.partial_fit(self.df_3, label='Y', features=['X1', 'X2'])


Output:

>>> omlr.coef_.collect()
VARIABLE_NAME CLASSLABEL  COEFFICIENT
0  __PAL_INTERCEPT__          0    -0.225687
1  __PAL_INTERCEPT__          1     0.031453
2  __PAL_INTERCEPT__          2    -0.173944
3                 X1          0     0.100580
4                 X1          1    -0.208257
5                 X1          2    -0.097395
6                 X2          0     0.628975
7                 X2          1     0.576544
8                 X2          2    -0.582955

>>> omlr.online_result_.collect()
SEQUENCE                          UPDATED_SERIALIZED_RESULT
0         0  {"SGD":{"data":{"avg_feature_coefficient":[0.1...


Round 4, invoke partial_fit() for training the model with df_4:

>>> omlr.partial_fit(self.df_4, label='Y', features=['X1', 'X2'])


Output:

>>> omlr.coef_.collect()
VARIABLE_NAME CLASSLABEL  COEFFICIENT
0  __PAL_INTERCEPT__          0    -0.204118
1  __PAL_INTERCEPT__          1     0.071965
2  __PAL_INTERCEPT__          2    -0.263698
3                 X1          0     0.239740
4                 X1          1    -0.326290
5                 X1          2    -0.139859
6                 X2          0     0.696389
7                 X2          1     0.590014
8                 X2          2    -0.643752

>>> omlr.online_result_.collect()
SEQUENCE                          UPDATED_SERIALIZED_RESULT
0         0  {"SGD":{"data":{"avg_feature_coefficient":[0.2...


Call predict() with df_predict:

>>> df_predict.collect()
ID   X1   X2
0   1  1.2  0.7
1   2  1.0 -2.0


Invoke predict():

>>> fitted = onlinelr.predict(df_predict, key='ID', features=['X1', 'X2'])
>>> fitted.collect()
ID CLASS  PROBABILITY
0   1     0     0.539350
1   2     2     0.830026

Attributes
coef_DataFrame

Values of the coefficients.

online_result_ : DataFrame

Online Model content.

Methods

 add_attribute(attr_key, attr_val) Function to add attribute. Disable with hint. Enable no inline. Enable parallel by parameter partitions. Return the execute_statement for training. Get PAL fit parmeters. Parse sql lines containing the parameter definitions. Return the execute_statement for predicting. Get PAL predict parmeters. Return the execute_statement for scoring. Get PAL score parmeters. Extract the specific function call of the PAL function from the sql code. Checks if the model can be saved. load_model(model) Function to load fitted model. partial_fit(data[, key, features, label, ...]) Online trainig based on each round of data. predict(data[, key, features]) Predict dependent variable values based on a fitted model. score(data[, key, features, label]) Returns the coefficient of determination R2 of the prediction. set_scale_out([route_to, no_route_to, ...]) HANA statement routing.

Function to add attribute.

disable_with_hint()

Disable with hint.

enable_no_inline()

Enable no inline.

enable_parallel_by_parameter_partitions()

Enable parallel by parameter partitions.

get_fit_execute_statement()

Return the execute_statement for training.

get_fit_parameters()

Get PAL fit parmeters.

Returns
Array of tuples, where each tuple describes a parameter like (name, value, type)
get_parameters()

Parse sql lines containing the parameter definitions. In the sql code all the parameters are defined by four arrays, where the first one contains the parameter name, and one of the other three contains the value fitting to the parameter, while the other two are NULL. This format should be changed into a simple key-value based storage.

Returns
dict of array of tuples, where each tuple describes a parameter like (name, value, type)
get_predict_execute_statement()

Return the execute_statement for predicting.

get_predict_parameters()

Get PAL predict parmeters.

Returns
Array of tuples, where each tuple describes a parameter like (name, value, type)
get_score_execute_statement()

Return the execute_statement for scoring.

get_score_parameters()

Get PAL score parmeters.

Returns
Array of tuples, where each tuple describes a parameter like (name, value, type)
get_store_procedure()

Extract the specific function call of the PAL function from the sql code. Nevertheless it only detects the synonyms that have to be resolved afterwards

Returns
The procedure name synonym
CALL "SYS_AFL.PAL_RANDOM_FORREST" (...) -> SYS_AFL.PAL_RANDOM_FORREST"
is_fitted()

Checks if the model can be saved. To be overridden if the model is not stored in model_ attribute.

Returns
bool

True if the model is ready to be saved.

Function to load fitted model.

Parameters
modelDataFrame

HANA DataFrame for fitted model.

partial_fit(data, key=None, features=None, label=None, thread_ratio=None, progress_indicator_id=None)

Online trainig based on each round of data.

Parameters

Training data.

keystr, optional

Name of the ID column.

If key is not provided, then:

• if data is indexed by a single column, then key defaults to that index column;

• otherwise, it is assumed that data contains no ID column.

featureslist of str, optional

Names of the feature columns.

If features is not provided, it defaults to all non-ID, non-label columns.

labelstr, optional

Name of the dependent variable.

If label is not provided, it defaults to the last column.

Controls the proportion of available threads to use. The value range is from 0 to 1, where 0 means only using 1 thread, and 1 means using at most all the currently available threads. Values outside this range tell PAL to heuristically determine the number of threads to use.

Defaults to 0.0.

progress_indicator_idstr, optional

The ID of progress indicator for model evaluation/parameter selection.

Progress indicator deactivated if no value provided.

Returns
OnlineMultiLogisticRegression

A fitted object.

set_scale_out(route_to=None, no_route_to=None, route_by=None, route_by_cardinality=None, data_transfer_cost=None, route_optimization_level=None, workload_class=None)

HANA statement routing.

Parameters
route_tostr, optional

Routes the query to the specified volume ID or service type.

no_route_tostr or list of str, optional

Avoids query routing to a specified volume ID or service type.

route_bystr, optional

Routes the query to the hosts related to the base table(s) of the specified projection view(s).

route_by_cardinalitystr or list of str, optional

Routes the query to the hosts related to the base table(s) of the specified projection view(s) with the highest cardinality from the input list.

data_transfer_costint, optional

Guides the optimizer to use the weighting factor for the data transfer cost. The value 0 ignores the data transfer cost.

route_optimization_level{'mininal', 'all'}, optional

Guides the optimizer to compile with ROUTE_OPTIMIZATION_LEVEL (MINIMAL) or to default to ROUTE_OPTIMIZATION_LEVEL. If the MINIMAL compiled plan is cached, then it compiles once more using the default optimization level during the first execution. This hint is primarily used to shorten statement routing decisions during the initial compilation.

Routes the query via workload class. ROUTE_TO() statement hint has higher precedence than WORKLOAD_CLASS() statement hint.

predict(data, key=None, features=None)

Predict dependent variable values based on a fitted model.

Parameters

Independent variable values to predict for.

keystr, optional

Name of the ID column.

Mandatory if data is not indexed, or the index of data contains multiple columns.

Defaults to the single index column of data if not provided.

featureslist of str, optional

Names of the feature columns.

If features is not provided, it defaults to all non-ID columns.

Returns
DataFrame

Predicted values, structured as follows:

• ID column: with same name and type as data 's ID column.

• VALUE: type DOUBLE, representing predicted values.

score(data, key=None, features=None, label=None)

Returns the coefficient of determination R2 of the prediction.

Parameters

Data on which to assess model performance.

keystr, optional

Name of the ID column.

Mandatory if data is not indexed, or the index of data contains multiple columns.

Defaults to the single index column of data if not provided.

featureslist of str, optional

Names of the feature columns.

If features is not provided, it defaults all non-ID, non-label columns.

labelstr, optional

Name of the dependent variable.

If label is not provided, it defaults to the last column.

Returns
float

Returns the coefficient of determination R2 of the prediction.

This module contains python wrapper for PAL link prediction function.

The following class is available:

Bases: hana_ml.algorithms.pal.pal_base.PALBase

Link predictor for calculating, in a network, proximity scores between nodes that are not directly linked, which is helpful for predicting missing links(the higher the proximity score is, the more likely the two nodes are to be linked).

Parameters

Method for computing the proximity between 2 nodes that are not directly linked.

betafloat, optional

A parameter included in the calculation of Katz similarity(proximity) score. Valid only when method is 'katz'.

Defaults to 0.005.

min_scorefloat, optional

The links whose scores are lower than min_score will be filtered out from the result table.

Defaults to 0.

Examples

Input dataframe df for training:

>>> df.collect()
NODE1  NODE2
0      1      2
1      1      4
2      2      3
3      3      4
4      5      1
5      6      2
6      7      4
7      7      5
8      6      7
9      5      4


>>> lp = LinkPrediction(method='common_neighbors',
...                     beta=0.005,
...                     min_score=0,


Calculate the proximity score of all nodes in the network with missing links, and check the result:

>>> res = lp.proximity_score(data=df, node1='NODE1', node2='NODE2')
>>> res.collect()
NODE1  NODE2     SCORE
0       1      3  0.285714
1       1      6  0.142857
2       1      7  0.285714
3       2      4  0.285714
4       2      5  0.142857
5       2      7  0.142857
6       4      6  0.142857
7       3      5  0.142857
8       3      6  0.142857
9       3      7  0.142857
10      5      6  0.142857


Methods

 add_attribute(attr_key, attr_val) Function to add attribute. Disable with hint. Enable no inline. Enable parallel by parameter partitions. Return the execute_statement for training. Get PAL fit parmeters. Parse sql lines containing the parameter definitions. Return the execute_statement for predicting. Get PAL predict parmeters. Return the execute_statement for scoring. Get PAL score parmeters. Extract the specific function call of the PAL function from the sql code. Checks if the model can be saved. load_model(model) Function to load fitted model. proximity_score(data[, node1, node2]) For predicting proximity scores between nodes under current choice of method. set_scale_out([route_to, no_route_to, ...]) HANA statement routing.
proximity_score(data, node1=None, node2=None)

For predicting proximity scores between nodes under current choice of method.

Parameters

Network data with nodes and links.

Nodes are in columns while links in rows, where each link is represented by a pair of adjacent nodes as (node1, node2).

node1str, optional

Column name of data that gives node1 of all available links (see data).

Defaults to the name of the first column of data if not provided.

node2str, optional

Column name of data that gives node2 of all available links (see data).

Defaults to the name of the last column of data if not provided.

Returns
DataFrame

The proximity scores of pairs of nodes with missing links between them that are above 'min_score', structured as follows:

• 1st column: node1 of a link

• 2nd column: node2 of a link

• 3rd column: proximity score of the two nodes

Function to add attribute.

disable_with_hint()

Disable with hint.

enable_no_inline()

Enable no inline.

enable_parallel_by_parameter_partitions()

Enable parallel by parameter partitions.

get_fit_execute_statement()

Return the execute_statement for training.

get_fit_parameters()

Get PAL fit parmeters.

Returns
Array of tuples, where each tuple describes a parameter like (name, value, type)
get_parameters()

Parse sql lines containing the parameter definitions. In the sql code all the parameters are defined by four arrays, where the first one contains the parameter name, and one of the other three contains the value fitting to the parameter, while the other two are NULL. This format should be changed into a simple key-value based storage.

Returns
dict of array of tuples, where each tuple describes a parameter like (name, value, type)
get_predict_execute_statement()

Return the execute_statement for predicting.

get_predict_parameters()

Get PAL predict parmeters.

Returns
Array of tuples, where each tuple describes a parameter like (name, value, type)
get_score_execute_statement()

Return the execute_statement for scoring.

get_score_parameters()

Get PAL score parmeters.

Returns
Array of tuples, where each tuple describes a parameter like (name, value, type)
get_store_procedure()

Extract the specific function call of the PAL function from the sql code. Nevertheless it only detects the synonyms that have to be resolved afterwards

Returns
The procedure name synonym
CALL "SYS_AFL.PAL_RANDOM_FORREST" (...) -> SYS_AFL.PAL_RANDOM_FORREST"
is_fitted()

Checks if the model can be saved. To be overridden if the model is not stored in model_ attribute.

Returns
bool

True if the model is ready to be saved.

Function to load fitted model.

Parameters
modelDataFrame

HANA DataFrame for fitted model.

set_scale_out(route_to=None, no_route_to=None, route_by=None, route_by_cardinality=None, data_transfer_cost=None, route_optimization_level=None, workload_class=None)

HANA statement routing.

Parameters
route_tostr, optional

Routes the query to the specified volume ID or service type.

no_route_tostr or list of str, optional

Avoids query routing to a specified volume ID or service type.

route_bystr, optional

Routes the query to the hosts related to the base table(s) of the specified projection view(s).

route_by_cardinalitystr or list of str, optional

Routes the query to the hosts related to the base table(s) of the specified projection view(s) with the highest cardinality from the input list.

data_transfer_costint, optional

Guides the optimizer to use the weighting factor for the data transfer cost. The value 0 ignores the data transfer cost.

route_optimization_level{'mininal', 'all'}, optional

Guides the optimizer to compile with ROUTE_OPTIMIZATION_LEVEL (MINIMAL) or to default to ROUTE_OPTIMIZATION_LEVEL. If the MINIMAL compiled plan is cached, then it compiles once more using the default optimization level during the first execution. This hint is primarily used to shorten statement routing decisions during the initial compilation.

Routes the query via workload class. ROUTE_TO() statement hint has higher precedence than WORKLOAD_CLASS() statement hint.

## hana_ml.algorithms.pal.metrics¶

This module contains Python wrappers for PAL metrics to assess the quality of model outputs.

The following functions are available:

hana_ml.algorithms.pal.metrics.confusion_matrix(data, key, label_true=None, label_pred=None, beta=None, native=False)

Computes confusion matrix to evaluate the accuracy of a classification.

Parameters

DataFrame containing the data.

keystr

Name of the ID column.

label_truestr, optional

Name of the original label column.

If not given, defaults to the second columm.

label_predstr, optional

Name of the the predicted label column.

If not given, defaults to the third columm.

betafloat, optional

Parameter used to compute the F-Beta score.

Defaults to 1.

nativebool, optional

Indicates whether to use native sql statements for confusion matrix calculation.

Defaults to True.

Returns
DataFrame
Confusion matrix, structured as follows:
• Original label, with same name and data type as it is in data.

• Predicted label, with same name and data type as it is in data.

• Count, type INTEGER, the number of data points with the corresponding combination of predicted and original label.

The DataFrame is sorted by (original label, predicted label) in descending order.

Classification report table, structured as follows:
• Class, type NVARCHAR(100), class name

• Recall, type DOUBLE, the recall of each class

• Precision, type DOUBLE, the precision of each class

• F_MEASURE, type DOUBLE, the F_measure of each class

• SUPPORT, type INTEGER, the support - sample number in each class

Examples

Data contains the original label and predict label df:

>>> df.collect()
ID  ORIGINAL  PREDICT
0   1         1        1
1   2         1        1
2   3         1        1
3   4         1        2
4   5         1        1
5   6         2        2
6   7         2        1
7   8         2        2
8   9         2        2
9  10         2        2


Calculate the confusion matrix:

>>> cm, cr = confusion_matrix(data=df, key='ID', label_true='ORIGINAL', label_pred='PREDICT')


Output:

>>> cm.collect()
ORIGINAL  PREDICT  COUNT
0         1        1      4
1         1        2      1
2         2        1      1
3         2        2      4
>>> cr.collect()
CLASS  RECALL  PRECISION  F_MEASURE  SUPPORT
0     1     0.8        0.8        0.8        5
1     2     0.8        0.8        0.8        5

hana_ml.algorithms.pal.metrics.auc(data, positive_label=None, output_threshold=None)

Computes area under curve (AUC) to evaluate the performance of binary-class classification algorithms.

Parameters

Input data, structured as follows:

• ID column.

• True class of the data point.

• Classifier-computed probability that the data point belongs to the positive class.

positive_labelstr, optional

If original label is not 0 or 1, specifies the label value which will be mapped to 1.

output_thresholdbool, optional

Specifies whether or not to outoput the corresponding threshold values in the roc table.

Defaults to False.

Returns
float

The area under the receiver operating characteristic curve.

DataFrame

False positive rate and true positive rate (ROC), structured as follows:

• ID column, type INTEGER.

• FPR, type DOUBLE, representing false positive rate.

• TPR, type DOUBLE, representing true positive rate.

• THRESHOLD, type DOUBLE, representing the corresponding threshold value, available only when output_threshold is set to True.

Examples

Input DataFrame df:

>>> df.collect()
ID  ORIGINAL  PREDICT
0   1         0     0.07
1   2         0     0.01
2   3         0     0.85
3   4         0     0.30
4   5         0     0.50
5   6         1     0.50
6   7         1     0.20
7   8         1     0.80
8   9         1     0.20
9  10         1     0.95


Compute Area Under Curve:

>>> auc, roc = auc(data=df)


Output:

>>> print(auc)
0.66

>>> roc.collect()
ID  FPR  TPR
0   0  1.0  1.0
1   1  0.8  1.0
2   2  0.6  1.0
3   3  0.6  0.6
4   4  0.4  0.6
5   5  0.2  0.4
6   6  0.2  0.2
7   7  0.0  0.2
8   8  0.0  0.0

hana_ml.algorithms.pal.metrics.multiclass_auc(data_original, data_predict)

Computes area under curve (AUC) to evaluate the performance of multi-class classification algorithms.

Parameters
data_originalDataFrame

True class data, structured as follows:

• Data point ID column.

• True class of the data point.

data_predictDataFrame

Predicted class data, structured as follows:

• Data point ID column.

• Possible class.

• Classifier-computed probability that the data point belongs to that particular class.

For each data point ID, there should be one row for each possible class.

Returns
float

The area under the receiver operating characteristic curve.

DataFrame

False positive rate and true positive rate (ROC), structured as follows:

• ID column, type INTEGER.

• FPR, type DOUBLE, representing false positive rate.

• TPR, type DOUBLE, representing true positive rate.

Examples

Input DataFrame df:

>>> df_original.collect()
ID  ORIGINAL
0   1         1
1   2         1
2   3         1
3   4         2
4   5         2
5   6         2
6   7         3
7   8         3
8   9         3
9  10         3

>>> df_predict.collect()
ID  PREDICT  PROB
0    1        1  0.90
1    1        2  0.05
2    1        3  0.05
3    2        1  0.80
4    2        2  0.05
5    2        3  0.15
6    3        1  0.80
7    3        2  0.10
8    3        3  0.10
9    4        1  0.10
10   4        2  0.80
11   4        3  0.10
12   5        1  0.20
13   5        2  0.70
14   5        3  0.10
15   6        1  0.05
16   6        2  0.90
17   6        3  0.05
18   7        1  0.10
19   7        2  0.10
20   7        3  0.80
21   8        1  0.00
22   8        2  0.00
23   8        3  1.00
24   9        1  0.20
25   9        2  0.10
26   9        3  0.70
27  10        1  0.20
28  10        2  0.20
29  10        3  0.60


Compute Area Under Curve:

>>> auc, roc = multiclass_auc(data_original=df_original, data_predict=df_predict)


Output:

>>> print(auc)
1.0

>>> roc.collect()
ID   FPR  TPR
0    0  1.00  1.0
1    1  0.90  1.0
2    2  0.65  1.0
3    3  0.25  1.0
4    4  0.20  1.0
5    5  0.00  1.0
6    6  0.00  0.9
7    7  0.00  0.7
8    8  0.00  0.3
9    9  0.00  0.1
10  10  0.00  0.0

hana_ml.algorithms.pal.metrics.accuracy_score(data, label_true, label_pred)

Compute mean accuracy score for classification results. That is, the proportion of the correctly predicted results among the total number of cases examined.

Parameters

DataFrame of true and predicted labels.

label_truestr

Name of the column containing ground truth labels.

label_predstr

Name of the column containing predicted labels, as returned by a classifier.

Returns
float

Accuracy classification score. A lower accuracy indicates that the classifier was able to predict less of the labels in the input correctly.

Examples

Actual and predicted labels df for a hypothetical classification:

>>> df.collect()
ACTUAL  PREDICTED
0    1        0
1    0        0
2    0        0
3    1        1
4    1        1


Accuracy score for these predictions:

>>> accuracy_score(data=df, label_true='ACTUAL', label_pred='PREDICTED')
0.8


Compare that to null accuracy df_dummy (accuracy that could be achieved by always predicting the most frequent class):

>>> df_dummy.collect()
ACTUAL  PREDICTED
0    1       1
1    0       1
2    0       1
3    1       1
4    1       1
>>> accuracy_score(data=df_dummy, label_true='ACTUAL', label_pred='PREDICTED')
0.6


A perfect predictor df_perfect:

>>> df_perfect.collect()
ACTUAL  PREDICTED
0    1       1
1    0       0
2    0       0
3    1       1
4    1       1
>>> accuracy_score(data=df_perfect, label_true='ACTUAL', label_pred='PREDICTED')
1.0

hana_ml.algorithms.pal.metrics.r2_score(data, label_true, label_pred)

Computes coefficient of determination for regression results.

Parameters

DataFrame of true and predicted values.

label_truestr

Name of the column containing true values.

label_predstr

Name of the column containing values predicted by regression.

Returns
float

Coefficient of determination. 1.0 indicates an exact match between true and predicted values. A lower coefficient of determination indicates that the regression was able to predict less of the variance in the input. A negative value indicates that the regression performed worse than just taking the mean of the true values and using that for every prediction.

Examples

Actual and predicted values df for a hypothetical regression:

>>> df.collect()
ACTUAL  PREDICTED
0    0.10        0.2
1    0.90        1.0
2    2.10        1.9
3    3.05        3.0
4    4.00        3.5


R2 score for these predictions:

>>> r2_score(data=df, label_true='ACTUAL', label_pred='PREDICTED')
0.9685233682514102


Compare that to the score for a perfect predictor:

>>> df_perfect.collect()
ACTUAL  PREDICTED
0    0.10       0.10
1    0.90       0.90
2    2.10       2.10
3    3.05       3.05
4    4.00       4.00
>>> r2_score(data=df_perfect, label_true='ACTUAL', label_pred='PREDICTED')
1.0


A naive mean predictor:

>>> df_mean.collect()
ACTUAL  PREDICTED
0    0.10       2.03
1    0.90       2.03
2    2.10       2.03
3    3.05       2.03
4    4.00       2.03
>>> r2_score(data=df_mean, label_true='ACTUAL', label_pred='PREDICTED')
0.0


And a really awful predictor df_awful:

>>> df_awful.collect()
ACTUAL  PREDICTED
0    0.10    12345.0
1    0.90    91923.0
2    2.10    -4444.0
3    3.05    -8888.0
4    4.00    -9999.0
>>> r2_score(data=df_awful, label_true='ACTUAL', label_pred='PREDICTED')
-886477397.139857
`
hana_ml.algorithms.pal.metrics.binary_classification_debriefing(data, label_true, label_pred, auc_data=None)

Computes debriefing coefficients for binary classification results.

Parameters

DataFrame of true and predicted values.

label_truestr

Name of the column containing true values.

label_predstr

Name of the column containing values predicted by regression.

Input data for calculating predictive power(KI), structured as follows:

• ID column.

• True class of the data point.

• Classifier-computed probability that the data point belongs to the positive class.

Returns
dict

Debriefing stats: ACCURACY, RECALL, SPECIFICITY, PRECISION, FPR, FNR, F1, MCC, KI, KAPPA.

## hana_ml.algorithms.pal.mixture¶

This module contains Python wrappers for Gaussian mixture model algorithm.

The following class is available:

class hana_ml.algorithms.pal.mixture.GaussianMixture(init_param, n_components=None, init_centers=None, covariance_type=None, shared_covariance=False, thread_ratio=None, max_iter=None, categorical_variable=None, category_weight=None, error_tol=None, regularization=None, random_seed=None)