hana_ml.algorithms.pal package

The Algorithms PAL Package consists of the following sections:

hana_ml.algorithms.pal.association

This module contains Python wrappers for PAL association algorithms.

The following classes are available:

class hana_ml.algorithms.pal.association.Apriori(conn_context, min_support, min_confidence, relational=None, min_lift=None, max_conseq=None, max_len=None, ubiquitous=None, use_prefix_tree=None, lhs_restrict=None, rhs_restrict=None, lhs_complement_rhs=None, rhs_complement_lhs=None, thread_ratio=None, timeout=None, pmml_export=None)

Bases: hana_ml.algorithms.pal.association._AssociationBase

Apriori is a classic predictive analysis algorithm for finding association rules used in association analysis.

Parameters
conn_contextConnectionContext

The connection to SAP HANA system.

min_supportfloat

User-specified minimum support(actual value).

min_confidencefloat

User-specified minimum confidence(actual value).

relationalbool, optional

Whether or not to apply relational logic in Apriori algorithm. If False, a single result table is produced; otherwise, the result table shall be split into three tables: antecedent, consequent and statistics.

Defaults to False.

min_liftfloat, optional

User-specified minimum lift.

Defaults to 0.

max_conseqint, optional

Maximum length of consequent items.

Defaults to 100.

max_lenint, optional

Total length of antecedent items and consequent items in the output.

Defaults to 5.

ubiquitousfloat, optional

Item sets whose support values are greater than this number will be ignored during fequent items minining.

Defaults to 1.0.

use_prefix_treebool, optional

Indicates whether or not to use prefix tree for saving memory.

Defaults to False.

lhs_restrictlist of str, optional(deprecated)

Specify items that are only allowed on the left-hand-side of association rules.

rhs_restrictlist of str, optional(deprecated)

Specify items that are only allowed on the right-hand-side of association rules.

lhs_complement_rhsbool, optional(deprecated)

If you use rhs_restrict to restrict some items to the left-hand-side of the association rules, you can set this parameter to True to restrict the complement items to the left-hand-side.

For example, if you have 100 items (i1, i2, …, i100), and want to restrict i1 and i2 to the right-hand-side, and i3,i4,…,i100 to the left-hand-side, you can set the parameters similarly as follows:

rhs_restrict = [‘i1’,’i2’],

lhs_complement_rhs = True,

Defaults to False.

rhs_complement_lhsbool, optional(deprecated)

If you use lhs_restrict to restrict some items to the left-hand-side of association rules, you can set this parameter to True to restrict the complement items to the right-hand side.

Defaults to False.

thread_numberfloat, optional

Specifies the ratio of total number of threads that can be used by this function. The value range is from 0 to 1, where 0 means only using 1 thread, and 1 means using at most all the currently available threads. Values outside the range will be ignored and this function heuristically determines the number of threads to use.

Defaults to 0.

timeoutint, optional

Specifies the maximum run time in seconds. The algorithm will stop running when the specified timeout is reached.

Defaults to 3600.

pmml_export{‘no’, ‘single-row’, ‘multi-row’}, optional

Specify the way to export the Apriori model:

  • ‘no’ : do not export the model,

  • ‘single-row’ : export Apriori model in PMML in single row,

  • ‘multi-row’ : export Apriori model in PMML in multiple rows, while the minimum length of each row is 5000 characters.

Defaults to ‘no’.

Examples

Input data for associate rule mining:

>>> df.collect()
    CUSTOMER   ITEM
0          2  item2
1          2  item3
2          3  item1
3          3  item2
4          3  item4
5          4  item1
6          4  item3
7          5  item2
8          5  item3
9          6  item1
10         6  item3
11         0  item1
12         0  item2
13         0  item5
14         1  item2
15         1  item4
16         7  item1
17         7  item2
18         7  item3
19         7  item5
20         8  item1
21         8  item2
22         8  item3

Set up parameters for the Apriori algorithm:

>>> ap = Apriori(conn_context=conn,
                 min_support=0.1,
                 min_confidence=0.3,
                 relational=False,
                 min_lift=1.1,
                 max_conseq=1,
                 max_len=5,
                 ubiquitous=1.0,
                 use_prefix_tree=False,
                 thread_ratio=0,
                 timeout=3600,
                 pmml_export='single-row')

Association rule mininig using Apriori algorithm for the input data, and check the results:

>>> ap.fit(data=df)
>>> ap.result_.head(5).collect()
    ANTECEDENT CONSEQUENT   SUPPORT  CONFIDENCE      LIFT
0        item5      item2  0.222222    1.000000  1.285714
1        item1      item5  0.222222    0.333333  1.500000
2        item5      item1  0.222222    1.000000  1.500000
3        item4      item2  0.222222    1.000000  1.285714
4  item2&item1      item5  0.222222    0.500000  2.250000

Apriori algorithm set up using relational logic:

>>> apr = Apriori(conn_context=conn,
                  min_support=0.1,
                  min_confidence=0.3,
                  relational=True,
                  min_lift=1.1,
                  max_conseq=1,
                  max_len=5,
                  ubiquitous=1.0,
                  use_prefix_tree=False,
                  thread_ratio=0,
                  timeout=3600,
                  pmml_export='single-row')

Again mining association rules using Apriori algorithm for the input data, and check the resulting tables:

>>> apr.antec_.head(5).collect()
   RULE_ID ANTECEDENTITEM
0        0          item5
1        1          item1
2        2          item5
3        3          item4
4        4          item2
>>> apr.conseq_.head(5).collect()
   RULE_ID CONSEQUENTITEM
0        0          item2
1        1          item5
2        2          item1
3        3          item2
4        4          item5
>>> apr.stats_.head(5).collect()
   RULE_ID   SUPPORT  CONFIDENCE      LIFT
0        0  0.222222    1.000000  1.285714
1        1  0.222222    0.333333  1.500000
2        2  0.222222    1.000000  1.500000
3        3  0.222222    1.000000  1.285714
4        4  0.222222    0.500000  2.250000
Attributes
result_DataFrame

Mined association rules and related statistics, structured as follows:

  • 1st column : antecedent(leading) items.

  • 2nd column : consequent(dependent) items.

  • 3rd column : support value.

  • 4th column : confidence value.

  • 5th column : lift value.

Available only when relational is False.

model_DataFrame

Apriori model trained from the input data, structured as follows:

  • 1st column : model ID,

  • 2nd column : model content, i.e. Apriori model in PMML format.

antec_DataFrame

Antecdent items of mined association rules, structured as follows:

  • lst column : association rule ID,

  • 2nd column : antecedent items of the corresponding association rule.

Available only when relational is True.

conseq_DataFrame

Consequent items of mined association rules, structured as follows:

  • 1st column : association rule ID,

  • 2nd column : consequent items of the corresponding association rule.

Available only when relational is True.

stats_DataFrame

Statistis of the mined association rules, structured as follows:

  • 1st column : rule ID,

  • 2nd column : support value of the rule,

  • 3rd column : confidence value of the rule,

  • 4th column : lift value of the rule.

Available only when relational is True.

Methods

fit(self, data[, transaction, item, …])

Association rule mining from the input data using FPGrowth algorithm.

is_fitted(self)

Checks if the model can be saved.

load_model(self, model)

Function to load fitted model.

fit(self, data, transaction=None, item=None, lhs_restrict=None, rhs_restrict=None, lhs_complement_rhs=None, rhs_complement_lhs=None)

Association rule mining from the input data using FPGrowth algorithm.

Parameters
dataDataFrame

Input data for association rule minining.

transactionstr, optional

Name of the transaction column.

Defaults to the first column if not provided.

itemstr, optional

Name of the item ID column. Data type of item column can either be int or str.

Defaults to the last column if not provided.

lhs_restrictlist of int/str, optional

Specify items that are only allowed on the left-hand-side of association rules. Elements in the list should be the same type as the item column.

rhs_restrictlist of int/str, optional

Specify items that are only allowed on the right-hand-side of association rules. Elements in the list should be the same type as the item column.

lhs_complement_rhsbool, optional

If you use rhs_restrict to restrict some items to the left-hand-side of the association rules, you can set this parameter to True to restrict the complement items to the left-hand-side. For example, if you have 100 items (i1,i2,…,i100), and want to restrict i1 and i2 to the right-hand-side, and i3, i4,…, i100 to the left-hand-side, you can set the parameters similarly as follows:

rhs_restrict = [i1, i2],

lhs_complement_rhs = True,

Defaults to False.

rhs_complement_lhsbool, optional

If you use lhs_restrict to restrict some items to the left-hand-side of association rules, you can set this parameter to True to restrict the complement items to the right-hand side.

Defaults to False.

is_fitted(self)

Checks if the model can be saved. To be overridden if the model is not stored in model_ attribute.

Returns
bool

True if the model is ready to be saved.

load_model(self, model)

Function to load fitted model.

Parameters
modelDataFrame

HANA DataFrame for fitted model.

class hana_ml.algorithms.pal.association.AprioriLite(conn_context, min_support, min_confidence, subsample=None, recalculate=None, thread_ratio=None, timeout=None, pmml_export=None)

Bases: hana_ml.algorithms.pal.pal_base.PALBase

A light version of Apriori algorithm for assocication rule mining, where only two large item sets are calculated.

Parameters
conn_contextConnectionContext

The connection to SAP HANA system.

min_supportfloat

User-specified minimum support(actual value).

min_confidencefloat

User-specified minimum confidence(actual value).

subsamplefloat, optional

Specify the sampling percentage for the input data. Set to 1 if you want to use the entire data.

recalculatebool, optional

If you sample the input data, this parameter indicates whether or not to use the remining data to update the related statistiscs, i.e. support, confidence and lift.

Defaults to True.

thread_ratiofloat, optional

Specifies the ratio of total number of threads that can be used by this function. The value range is from 0 to 1, where 0 means only using 1 thread, and 1 means using at most all the currently available threads. Values outside the range will be ignored and this function heuristically determines the number of threads to use.

Defaults to 0.

timeoutint, optional

Specifies the maximum run time in seconds. The algorithm will stop running when the specified timeout is reached.

Defaults to 3600.

pmml_export{‘no’, ‘single-row’, ‘multi-row’}, optional

Specify the way to export the Apriori model:

  • ‘no’ : do not export the model,

  • ‘single-row’ : export Apriori model in PMML in single row,

  • ‘multi-row’ : export Apriori model in PMML in multiple rows, while the minimum length of each row is 5000 characters.

Defaults to ‘no’.

Examples

Input data for association rule mining using Apriori algorithm:

>>> df.collect()
    CUSTOMER   ITEM
0          2  item2
1          2  item3
2          3  item1
3          3  item2
4          3  item4
5          4  item1
6          4  item3
7          5  item2
8          5  item3
9          6  item1
10         6  item3
11         0  item1
12         0  item2
13         0  item5
14         1  item2
15         1  item4
16         7  item1
17         7  item2
18         7  item3
19         7  item5
20         8  item1
21         8  item2
22         8  item3

Set up parameters for light Apriori algorithm, ingest the input data, and check the result table:

>>> apl = AprioriLite(conn_context=conn,
                      min_support=0.1,
                      min_confidence=0.3,
                      subsample=1.0,
                      recalculate=False,
                      timeout=3600,
                      pmml_export='single-row')
>>> apl.fit(data=df)
>>> apl.result_.head(5).collect()
  ANTECEDENT CONSEQUENT   SUPPORT  CONFIDENCE      LIFT
0      item5      item2  0.222222    1.000000  1.285714
1      item1      item5  0.222222    0.333333  1.500000
2      item5      item1  0.222222    1.000000  1.500000
3      item5      item3  0.111111    0.500000  0.750000
4      item1      item2  0.444444    0.666667  0.857143
Attributes
result_DataFrame
Mined association rules and related statistics, structured as follows:
  • 1st column : antecedent(leading) items,

  • 2nd column : consequent(dependent) items,

  • 3rd column : support value,

  • 4th column : confidence value,

  • 5th column : lift value.

Non-empty only when relational is False.

model_DataFrame
Apriori model trained from the input data, structured as follows:
  • 1st column : model ID.

  • 2nd column : model content, i.e. liteApriori model in PMML format.

Methods

fit(self, data[, transaction, item])

Association rule mining based from the input data.

is_fitted(self)

Checks if the model can be saved.

load_model(self, model)

Function to load fitted model.

fit(self, data, transaction=None, item=None)

Association rule mining based from the input data.

Parameters
dataDataFrame

Input data for association rule minining.

transactionstr, optional

Name of the transaction column.

Defaults to the first column if not provided.

itemstr, optional

Name of the item column.

Defaults to the last column if not provided.

is_fitted(self)

Checks if the model can be saved. To be overridden if the model is not stored in model_ attribute.

Returns
bool

True if the model is ready to be saved.

load_model(self, model)

Function to load fitted model.

Parameters
modelDataFrame

HANA DataFrame for fitted model.

class hana_ml.algorithms.pal.association.FPGrowth(conn_context, min_support=None, min_confidence=None, relational=None, min_lift=None, max_conseq=None, max_len=None, ubiquitous=None, thread_ratio=None, timeout=None)

Bases: hana_ml.algorithms.pal.association._AssociationBase

FP-Growth is an algorithm to find frequent patterns from transactions without generating a candidate itemset.

Parameters
conn_contextConnectionContext

The connection to SAP HANA system.

min_supportfloat, optional

User-specified minimum support, with valid range [0, 1].

Defaults to 0.

min_confidencefloat, optional

User-specified minimum confidence, with valid range [0, 1].

Defaults to 0.

relationalbool, optional

Whether or not to apply relational logic in FPGrowth algorithm. If False, a single result table is produced; otherwise, the result table shall be split into three tables – antecedent, consequent and statistics.

Defaults to False.

min_liftfloat, optional

User-specified minimum lift.

Defaults to 0.

max_conseqint, optional

Maximum length of consequent items.

Defaults to 10.

max_lenint, optional

Total length of antecedent items and consequent items in the output.

Defaults to 10.

ubiquitousfloat, optional

Item sets whose support values are greater than this number will be ignored during fequent items minining.

Defaults to 1.0.

thread_ratiofloat, optional

Specifies the ratio of total number of threads that can be used by this function. The value range is from 0 to 1, where 0 means only using 1 thread, and 1 means using at most all the currently available threads. Values outside the range will be ignored and this function heuristically determines the number of threads to use.

Defaults to 0.

timeoutint, optional

Specifies the maximum run time in seconds. The algorithm will stop running when the specified timeout is reached.

Defaults to 3600.

Examples

Input data for associate rule mining:

>>> df.collect()
    TRANS  ITEM
0       1     1
1       1     2
2       2     2
3       2     3
4       2     4
5       3     1
6       3     3
7       3     4
8       3     5
9       4     1
10      4     4
11      4     5
12      5     1
13      5     2
14      6     1
15      6     2
16      6     3
17      6     4
18      7     1
19      8     1
20      8     2
21      8     3
22      9     1
23      9     2
24      9     3
25     10     2
26     10     3
27     10     5

Set up parameters:

>>> fpg = FPGrowth(conn_context=conn,
                   min_support=0.2,
                   min_confidence=0.5,
                   relational=False,
                   min_lift=1.0,
                   max_conseq=1,
                   max_len=5,
                   ubiquitous=1.0,
                   thread_ratio=0,
                   timeout=3600)

Association rule mininig using FPGrowth algorithm for the input data, and check the results:

>>> fpg.fit(data=df, lhs_restrict=[1,2,3])
>>> fpg.result_.collect()
  ANTECEDENT  CONSEQUENT  SUPPORT  CONFIDENCE      LIFT
0          2           3      0.5    0.714286  1.190476
1          3           2      0.5    0.833333  1.190476
2          3           4      0.3    0.500000  1.250000
3        1&2           3      0.3    0.600000  1.000000
4        1&3           2      0.3    0.750000  1.071429
5        1&3           4      0.2    0.500000  1.250000

Apriori algorithm set up using relational logic:

>>> fpgr = FPGrowth(conn_context=conn,
                    min_support=0.2,
                    min_confidence=0.5,
                    relational=True,
                    min_lift=1.0,
                    max_conseq=1,
                    max_len=5,
                    ubiquitous=1.0,
                    thread_ratio=0,
                    timeout=3600)

Again mining association rules using FPGrowth algorithm for the input data, and check the resulting tables:

>>> fpgr.fit(data=df, rhs_restrict=[1, 2, 3])
>>> fpgr.antec_.collect()
   RULE_ID  ANTECEDENTITEM
0        0               2
1        1               3
2        2               3
3        3               1
4        3               2
5        4               1
6        4               3
7        5               1
8        5               3
>>> fpgr.conseq_.collect()
   RULE_ID  CONSEQUENTITEM
0        0               3
1        1               2
2        2               4
3        3               3
4        4               2
5        5               4
>>> fpgr.stats_.collect()
   RULE_ID  SUPPORT  CONFIDENCE      LIFT
0        0      0.5    0.714286  1.190476
1        1      0.5    0.833333  1.190476
2        2      0.3    0.500000  1.250000
3        3      0.3    0.600000  1.000000
4        4      0.3    0.750000  1.071429
5        5      0.2    0.500000  1.250000
Attributes
result_DataFrame

Mined association rules and related statistics, structured as follows:

  • 1st column : antecedent(leading) items,

  • 2nd column : consequent(dependent) items,

  • 3rd column : support value,

  • 4th column : confidence value,

  • 5th column : lift value.

Available only when relational is False.

antec_DataFrame
Antecdent items of mined association rules, structured as follows:
  • lst column : association rule ID,

  • 2nd column : antecedent items of the corresponding association rule.

Available only when relational is True.

conseq_DataFrame

Consequent items of mined association rules, structured as follows:

  • 1st column : association rule ID,

  • 2nd column : consequent items of the corresponding association rule.

Available only when relational is True.

stats_DataFrame
Statistis of the mined association rules, structured as follows:
  • 1st column : rule ID,

  • 2nd column : support value of the rule,

  • 3rd column : confidence value of the rule,

  • 4th column : lift value of the rule.

Available only when relational is True.

Methods

fit(self, data[, transaction, item, …])

Association rule mining from the input data.

is_fitted(self)

Checks if the model can be saved.

load_model(self, model)

Function to load fitted model.

fit(self, data, transaction=None, item=None, lhs_restrict=None, rhs_restrict=None, lhs_complement_rhs=None, rhs_complement_lhs=None)

Association rule mining from the input data.

Parameters
dataDataFrame

Input data for association rule minining.

transactionstr, optional

Name of the transaction column.

Defaults to the first column if not provided.

itemstr, optional

Name of the item column.

Defaults to the last column if not provided.

lhs_restrictlist of int/str, optional

Specify items that are only allowed on the left-hand-side of association rules. Elements in the list should be the same type as the item column.

rhs_restrictlist of int/str, optional

Specify items that are only allowed on the right-hand-side of association rules. Elements in the list should be the same type as the item column.

lhs_complement_rhsbool, optional

If you use rhs_restrict to restrict some items to the left-hand-side of the association rules, you can set this parameter to True to restrict the complement items to the left-hand-side. For example, if you have 100 items (i1,i2,…,i100), and want to restrict i1 and i2 to the right-hand-side, and i3, i4,…, i100 to the left-hand-side, you can set the parameters similarly as follows:

rhs_restrict = [i1, i2],

lhs_complement_rhs = True,

Defaults to False.

rhs_complement_lhsbool, optional

If you use lhs_restrict to restrict some items to the left-hand-side of association rules, you can set this parameter to True to restrict the complement items to the right-hand side.

Defaults to False.

is_fitted(self)

Checks if the model can be saved. To be overridden if the model is not stored in model_ attribute.

Returns
bool

True if the model is ready to be saved.

load_model(self, model)

Function to load fitted model.

Parameters
modelDataFrame

HANA DataFrame for fitted model.

class hana_ml.algorithms.pal.association.KORD(conn_context, k=None, measure=None, min_support=None, min_confidence=None, min_coverage=None, min_measure=None, max_antec=None, epsilon=None, use_epsilon=None)

Bases: hana_ml.algorithms.pal.pal_base.PALBase

K-optimal rule discovery (KORD) follows the idea of generating association rules with respect to a well-defined measure, instead of first finding all frequent itemsets and then generating all possible rules.

Parameters
conn_contextConnectionContext

The connection to SAP HANA system.

kint, optional

The number of top rules to discover.

measurestr, optional

Specifies the measure used to define the priority of the association rules.

min_supportfloat, optional

User-specified minimum support value of association rule, with valid range [0, 1].

Defaults to 0 if not provided.

min_confidencefloat, optinal

User-specified minimum confidence value of association rule, with valid range [0, 1].

Defaults to 0 if not provided.

min_converagefloat, optional

User-specified minimum coverage value of association rule, with valid range [0, 1].

Defaults to the value of min_support if not provided.

min_measurefloat, optional

User-specified minimum measure value (for leverage or lift, which type depends on the setting of measure ).

Defaults to 0 if not provided.

epsilonfloat, optional

User-specified epsilon value for punishing length of rules.

Valid only when use_epsilon is True.

use_epsilonbool, optional

Specifies whether or not to use epsilon to punish the length of rules.

Defaults to False.

Examples

First let us have a look at the training data:

>>> df.head(10).collect()
    CUSTOMER   ITEM
0          2  item2
1          2  item3
2          3  item1
3          3  item2
4          3  item4
5          4  item1
6          4  item3
7          5  item2
8          5  item3
9          6  item1

Set up a KORD instance:

>>> krd =  KORD(conn_context=conn,
                k=5,
                measure='lift',
                min_support=0.1,
                min_confidence=0.2,
                epsilon=0.1,
                use_epsilon=False)

Start k-optimal rule discovery process from the input transaction data, and check the results:

>>> krd.fit(data=df, transaction='CUSTOMER', item='ITEM')
>>> krd.antec_.collect()
   RULE_ID ANTECEDENT_RULE
0        0           item2
1        1           item1
2        2           item2
3        2           item1
4        3           item5
5        4           item2
>>> krd.conseq_.collect()
   RULE_ID CONSEQUENT_RULE
0        0           item5
1        1           item5
2        2           item5
3        3           item1
4        4           item4
>>> krd.stats_.collect()
   RULE_ID   SUPPORT  CONFIDENCE      LIFT  LEVERAGE   MEASURE
0        0  0.222222    0.285714  1.285714  0.049383  1.285714
1        1  0.222222    0.333333  1.500000  0.074074  1.500000
2        2  0.222222    0.500000  2.250000  0.123457  2.250000
3        3  0.222222    1.000000  1.500000  0.074074  1.500000
4        4  0.222222    0.285714  1.285714  0.049383  1.285714
Attributes
antec_DataFrame

Info of antecedent items for the mined association rules, structured as follows:

  • 1st column : rule ID,

  • 2nd column : antecedent items.

conseq_DataFrame

Info of consequent items for the mined assocoation rules, structured as follows:

  • 1st column : rule ID,

  • 2nd column : consequent items.

stats_DataFrame
Some basic statistics for the mined association rules, structured as follows:
  • 1st column : rule ID,

  • 2nd column : support value of rules,

  • 3rd column : confidence value of rules,

  • 4th column : lift value of rules,

  • 5th column : leverage value of rules,

  • 6th column : measure value of rules.

Methods

fit(self, data[, transaction, item])

K-optimal rule discovery from input data, based on some user-specified measure.

is_fitted(self)

Checks if the model can be saved.

load_model(self, model)

Function to load fitted model.

fit(self, data, transaction=None, item=None)

K-optimal rule discovery from input data, based on some user-specified measure.

Parameters
dataDataFrame

Input data for k-optimal(association) rule discovery.

transctionstr, optional

Column name of transaction ID in the input data.

Defaults to name of the 1st column if not provided.

itemstr, optional

Column name of item ID (or items) in the input data.

Defaults to the name of the final column if not provided.

is_fitted(self)

Checks if the model can be saved. To be overridden if the model is not stored in model_ attribute.

Returns
bool

True if the model is ready to be saved.

load_model(self, model)

Function to load fitted model.

Parameters
modelDataFrame

HANA DataFrame for fitted model.

class hana_ml.algorithms.pal.association.SPM(conn_context, min_support, relational=None, max_len=None, min_len=None, max_len_out=None, min_len_out=None, ubiquitous=None, calc_lift=None, timeout=None)

Bases: hana_ml.algorithms.pal.pal_base.PALBase

The sequential pattern mining algorithm searches for frequent patterns in sequence databases.

Parameters
conn_contextConnectionContext

The connection to SAP HANA system.

min_supportfloat

User-specified minimum support value.

relationalbool, optional

Whether or not to apply relational logic in sequential pattern mining. If False, a single results table for frequent pattern mining is produced, otherwise the results table is splitted into two tables : one for mined patterns, and the other for statistics.

Defaults to False.

ubiquitousfloat, optional

Items whose support values are above this specified value will be ignored during the frequent item mining phase.

Defaults to 1.0.

min_lenint, optional

Minimum number of items in a transaction.

Defaults to 1.

max_lenint, optional

Maximum number of items in a transaction.

Defaults to 10.

min_len_outint, optional

Specifies the minimum number of items of the mined association rules in the result table.

Defaults to 1.

max_len_outint, optional

Specifies the maximum number of items of the mined association rules in the reulst table.

Defaults to 10.

calc_liftbool, optional

Whether or not toe calculate lift values for all applicable cases. If False, lift values are only calculated for the cases where the last transaction contains a single item.

Defaults to False.

timeoutint, optional

Specifies the maximum run time in seconds. The algorithm will stop running when the specified timeout is reached.

Defaults to 3600.

Examples

Firstly take a look at the input data df:

>>> df.collect()
   CUSTID  TRANSID      ITEMS
0       A        1      Apple
1       A        1  Blueberry
2       A        2      Apple
3       A        2     Cherry
4       A        3    Dessert
5       B        1     Cherry
6       B        1  Blueberry
7       B        1      Apple
8       B        2    Dessert
9       B        3  Blueberry
10      C        1      Apple
11      C        2  Blueberry
12      C        3    Dessert

Set up a SPM instance:

>>> sp = SPM(conn_context=conn,
             min_support=0.5,
             relational=False,
             ubiquitous=1.0,
             max_len=10,
             min_len=1,
             calc_lift=True)

Start sequential pattern mining process from the input data, and check the results:

>>> sp.fit(data=df, customer='CUSTID', transaction='TRANSID', item='ITEMS')
>>> sp.result_.collect()
                        PATTERN   SUPPORT  CONFIDENCE      LIFT
0                       {Apple}  1.000000    0.000000  0.000000
1           {Apple},{Blueberry}  0.666667    0.666667  0.666667
2             {Apple},{Dessert}  1.000000    1.000000  1.000000
3             {Apple,Blueberry}  0.666667    0.000000  0.000000
4   {Apple,Blueberry},{Dessert}  0.666667    1.000000  1.000000
5                {Apple,Cherry}  0.666667    0.000000  0.000000
6      {Apple,Cherry},{Dessert}  0.666667    1.000000  1.000000
7                   {Blueberry}  1.000000    0.000000  0.000000
8         {Blueberry},{Dessert}  1.000000    1.000000  1.000000
9                      {Cherry}  0.666667    0.000000  0.000000
10           {Cherry},{Dessert}  0.666667    1.000000  1.000000
11                    {Dessert}  1.000000    0.000000  0.000000
Attributes
result_DataFrame

The overall fequent pattern mining result, structured as follows:

  • 1st column : mined fequent patterns,

  • 2nd column : support values,

  • 3rd column : confidence values,

  • 4th column : lift values.

Available only when relational is False.

pattern_DataFrame
Result for mined requent patterns, structured as follows:
  • 1st column : pattern ID,

  • 2nd column : transaction ID,

  • 3rd column : items.

stats_DataFrame
Statistics for frequent pattern mining, structured as follows:
  • 1st column : pattern ID,

  • 2nd column : support values,

  • 3rd column : confidence values,

  • 4th column : lift values.

Methods

fit(self, data[, customer, transaction, …])

Sequetial pattern mining from input data.

is_fitted(self)

Checks if the model can be saved.

load_model(self, model)

Function to load fitted model.

is_fitted(self)

Checks if the model can be saved. To be overridden if the model is not stored in model_ attribute.

Returns
bool

True if the model is ready to be saved.

load_model(self, model)

Function to load fitted model.

Parameters
modelDataFrame

HANA DataFrame for fitted model.

fit(self, data, customer=None, transaction=None, item=None, item_restrict=None, min_gap=None)

Sequetial pattern mining from input data.

Parameters
dataDataFrame

Input data for sequential pattern mining.

customerstr, optional

Column name of customer ID in the input data.

Defaults to name of the 1st column if not provided.

transctionstr, optional

Column name of transaction ID in the input data. Specially for sequential pattern mining, values of this column must reflect the sequence of occurrence as well.

Defaults to name of the 2nd column if not provided.

itemstr, optional

Column name of item ID (or items) in the input data.

Defaults to the name of the final column if not provided.

item_restrictlist of int or str, optional

Specifies the list of items allowed in the mined association rule.

min_gapint, optional

Specifies the the minimum time difference between consecutive transactions in a sequence.

hana_ml.algorithms.pal.clustering

This module contains Python wrappers for PAL clustering algorithms.

The following classes are available:

class hana_ml.algorithms.pal.clustering.AffinityPropagation(conn_context, affinity, n_clusters, max_iter=None, convergence_iter=None, damping=None, preference=None, seed_ratio=None, times=None, minkowski_power=None, thread_ratio=None)

Bases: hana_ml.algorithms.pal.pal_base.PALBase

Affinity Propagation is an algorithm that identifies exemplars among data points and forms clusters of data points around these exemplars. It operates by simultaneously considering all data point as potential exemplars and exchanging messages between data points until a good set of exemplars and clusters emerges.

Parameters
conn_contextConnectionContext

Connection to the SAP HANA system.

affinity{‘manhattan’, ‘standardized_euclidean’, ‘euclidean’, ‘minkowski’, ‘chebyshev’, ‘cosine’}

Ways to compute the distance between two points.

No default value as it is mandatory.

n_clustersint

Number of clusters.

  • 0: does not adjust Affinity Propagation cluster result.

  • Non-zero int: If Affinity Propagation cluster number is bigger than n_clusters, PAL will merge the result to make the cluster number be the value specified for n_clusters.

No default value as it is mandatory.

max_iterint, optional

Maximum number of iterations.

Defaults to 500.

convergence_iterint, optional

When the clusters keep a steady one for the specified times, the algorithm ends.

Defaults to 100.

dampingfloat

Controls the updating velocity. Value range: (0, 1).

Defaults to 0.9.

preferencefloat, optional

Determines the preference. Value range: [0,1].

Defaults to 0.5.

seed_ratiofloat, optional

Select a portion of (seed_ratio * data_number) the input data as seed, where data_number is the row_size of the input data. Value range: (0,1]. If seed_ratio is 1, all the input data will be the seed.

Defaults to 1.

timesint, optional

The sampling times. Only valid when seed_ratio is less than 1.

Defaults to 1.

minkowski_powerint, optional

The power of the Minkowski method. Only valid when affinity is 3.

Defaults to 3.

thread_ratiofloat, optional

Specifies the ratio of total number of threads that can be used by this function. The value range is from 0 to 1, where 0 means only using 1 thread, and 1 means using at most all the currently available threads. Values outside the range will be ignored and this function heuristically determines the number of threads to use.

Defaults to 0.

Examples

Input dataframe df for clustering:

>>> df.collect()
    ID  ATTRIB1  ATTRIB2
0    1   0.10     0.10
1    2   0.11     0.10
2    3   0.10     0.11
3    4   0.11     0.11
4    5   0.12     0.11
5    6   0.11     0.12
6    7   0.12     0.12
7    8   0.12     0.13
8    9   0.13     0.12
9   10   0.13     0.13
10  11   0.13     0.14
11  12   0.14     0.13
12  13  10.10    10.10
13  14  10.11    10.10
14  15  10.10    10.11
15  16  10.11    10.11
16  17  10.11    10.12
17  18  10.12    10.11
18  19  10.12    10.12
19  20  10.12    10.13
20  21  10.13    10.12
21  22  10.13    10.13
22  23  10.13    10.14
23  24  10.14    10.13

Create AffinityPropagation instance:

>>> ap = AffinityPropagation(
            conn_context=conn,
            affinity='euclidean',
            n_clusters=0,
            max_iter=500,
            convergence_iter=100,
            damping=0.9,
            preference=0.5,
            seed_ratio=None,
            times=None,
            minkowski_power=None,
            thread_ratio=1)

Perform fit on the given data:

>>> ap.fit(data = df, key='ID')

Expected output:

>>> ap.labels_.collect()
    ID  CLUSTER_ID
0    1           0
1    2           0
2    3           0
3    4           0
4    5           0
5    6           0
6    7           0
7    8           0
8    9           0
9   10           0
10  11           0
11  12           0
12  13           1
13  14           1
14  15           1
15  16           1
16  17           1
17  18           1
18  19           1
19  20           1
20  21           1
21  22           1
22  23           1
23  24           1
Attributes
labels_DataFrame

Label assigned to each sample. structured as follows:

  • ID, record ID.

  • CLUSTER_ID, the range is from 0 to n_clusters - 1.

Methods

fit(self, data, key[, features])

Fit the model when given the training dataset.

fit_predict(self, data, key[, features])

Fit with the dataset and return the labels.

is_fitted(self)

Checks if the model can be saved.

load_model(self, model)

Function to load fitted model.

fit(self, data, key, features=None)

Fit the model when given the training dataset.

Parameters
dataDataFrame

DataFrame containing the data.

keystr

Name of the ID column.

featureslist of str, optional

Names of the features columns. If features is not provided, it defaults to all the non-ID columns.

fit_predict(self, data, key, features=None)

Fit with the dataset and return the labels.

Parameters
dataDataFrame

DataFrame containing the data.

keystr

Name of the ID column.

featureslist of str, optional

Names of the features columns. If features is not provided, it defaults to all the non-ID columns.

Returns
DataFrame

Fit result, label of each points, structured as follows:

  • ID, record ID.

  • CLUSTER_ID, the range is from 0 to n_clusters - 1.

is_fitted(self)

Checks if the model can be saved. To be overridden if the model is not stored in model_ attribute.

Returns
bool

True if the model is ready to be saved.

load_model(self, model)

Function to load fitted model.

Parameters
modelDataFrame

HANA DataFrame for fitted model.

class hana_ml.algorithms.pal.clustering.AgglomerateHierarchicalClustering(conn_context, n_clusters=None, affinity=None, linkage=None, thread_ratio=None, distance_dimension=None, normalization=None, category_weights=None)

Bases: hana_ml.algorithms.pal.pal_base.PALBase

This algorithm is a widely used clustering method which can find natural groups within a set of data. The idea is to group the data into a hierarchy or a binary tree of the subgroups. A hierarchical clustering can be either agglomerate or divisive, depending on the method of hierarchical decomposition. The implementation in PAL follows the agglomerate approach, which merges the clusters with a bottom-up strategy. Initially, each data point is considered as an own cluster. The algorithm iteratively merges two clusters based on the dissimilarity measure in a greedy manner and forms a larger cluster.

Parameters
conn_contextConnectionContext

Connection to the SAP HANA system.

n_clustersint, optional

Number of clusters after agglomerate hierarchical clustering algorithm. Value range: between 1 and the initial number of input data.

Defaults to 1.

affinity{‘manhattan’,’euclidean’, ‘minkowski’, ‘chebyshev’, ‘cosine’, ‘pearson correlation’, ‘squared euclidean’, ‘jaccard’, ‘gower’}, optional

Ways to compute the distance between two points.

Note

(1) For jaccard distance, non-zero input data will be treated as 1, and zero input data will be treated as 0. jaccard distance = (M01 + M10) / (M11 + M01 + M10)

(2) Only gower distance supports category attributes. When linkage is ‘centroid clustering’, ‘median clustering’, or ‘ward’, this parameter must be set to ‘squared euclidean’.

Defaults to squared euclidean.

linkage{ ‘nearest neighbor’, ‘furthest neighbor’, ‘group average’, ‘weighted average’, ‘centroid clustering’, ‘median clustering’, ‘ward’}, optional

Linkage type between two clusters.

  • ‘nearest neighbor’ : single linkage.

  • ‘furthest neighbor’ : complete linkage.

  • ‘group average’ : UPGMA.

  • ‘weighted average’ : WPGMA.

  • ‘centroid clustering’.

  • ‘median clustering’.

  • ‘ward’.

Defaults to centroid clustering.

Note

For linkage ‘centroid clustering’, ‘median clustering’, or ‘ward’, the corresponding affinity must be set to ‘squared euclidean’.

thread_ratiofloat, optional

Specifies the ratio of total number of threads that can be used by this function. The value range is from 0 to 1, where 0 means only using 1 thread, and 1 means using at most all the currently available threads. Values outside the range will be ignored and this function heuristically determines the number of threads to use.

Defaults to 0.

distance_dimensionfloat, optional

Distance dimension can be set if affinity is set to ‘minkowski’. The value should be no less than 1. Only valid when affinity is ‘minkowski’.

Defaults to 3.

normalize_type{0, 1, 2, 3}, int, optional

Normalization type

  • 0: does nothing

  • 1: Z score standardize

  • 2: transforms to new range: -1 to 1

  • 3: transforms to new range: 0 to 1

Defaults to 0.

category_weightsfloat, optional

Represents the weight of category columns.

Defaults to 1.

Examples

Input dataframe df for clustering:

>>> df.collect()
    POINT    X1     X2     X3
0    0       0.5   0.5     1
1    1       1.5   0.5     2
2    2       1.5   1.5     2
3    3       0.5   1.5     2
4    4       1.1   1.2     2
5    5       0.5   15.5    2
6    6       1.5   15.5    3
7    7       1.5   16.5    3
8    8       0.5   16.5    3
9    9       1.2   16.1    3
10   10      15.5  15.5    3
11   11      16.5  15.5    4
12   12      16.5  16.5    4
13   13      15.5  16.5    4
14   14      15.6  16.2    4
15   15      15.5  0.5     4
16   16      16.5  0.5     1
17   17      16.5  1.5     1
18   18      15.5  1.5     1
19   19      15.7  1.6     1

Create AgglomerateHierarchicalClustering instance:

>>> hc = AgglomerateHierarchicalClustering(
             conn_context=conn,
             n_clusters=4,
             affinity='Gower',
             linkage='weighted average',
             thread_ratio=None,
             distance_dimension=3,
             normalize_type= 0,
             category_weights= 0.1)

Perform fit on the given data:

>>> hc.fit(data=df, key='POINT', categorical_variable=['X3'])

Expected output:

>>> hc.combine_process_.collect().head(3)
    STAGE    LEFT_POINT    RIGHT_POINT    DISTANCE
0    1        18           19             0.0187
1    2        13           14             0.0250
2    3        7            9              0.0437
>>> hc.labels_.collect().head(3)
        POINT    CLUSTER_ID
     0     0        1
     1     1        1
     2     2        1
Attributes
combine_process_DataFrame

Structured as follows:

  • 1st column: int, STAGE, cluster stage.

  • 2nd column: ID (in input table) data type, LEFT_ + ID (in input table) column + name, One of the clusters that is to be combined in one combine stage, name as its row number in the input data table. After the combining, the new cluster is named after the left one.

  • 3rd column: ID (in input table) data type, RIGHT_ + ID (in input table) column name, The other cluster to be combined in the same combine stage, named as its row number in the input data table.

  • 4th column: float, DISTANCE. Distance between the two combined clusters.

labels_DataFrame

Label assigned to each sample. structured as follows:

  • 1st column: ID, record ID.

  • 2nd column: CLUSTER_ID, cluster number after applying the hierarchical agglomerate algorithm.

Methods

fit(self, data, key[, features, …])

Fit the model when given the training dataset.

fit_predict(self, data, key[, features, …])

Fit with the dataset and return the labels.

is_fitted(self)

Checks if the model can be saved.

load_model(self, model)

Function to load fitted model.

fit(self, data, key, features=None, categorical_variable=None)

Fit the model when given the training dataset.

Parameters
dataDataFrame

DataFrame containing the data.

keystr

Name of the ID column.

featureslist of str, optional

Names of the features columns. If features is not provided, it defaults to all the non-ID columns.

categorical_variablestr or list of str, optional

Specifies INTEGER column(s) that should be treated as categorical. By default, ‘VARCHAR’ or ‘NVARCHAR’ is category variable, and ‘INTEGER’ or ‘DOUBLE’ is continuous variable.

No default value.

fit_predict(self, data, key, features=None, categorical_variable=None)

Fit with the dataset and return the labels.

Parameters
dataDataFrame

DataFrame containing the data.

keystr

Name of the ID column.

featureslist of str, optional

Names of the features columns. If features is not provided, it defaults to all the non-ID columns.

categorical_variablestr or list of str, optional

Specifies INTEGER column(s) that should be treated as categorical. By default, ‘VARCHAR’ or ‘NVARCHAR’ is category variable, and ‘INTEGER’ or ‘DOUBLE’ is continuous variable.

No default value.

Returns
DataFrame

Combine process, structured as follows:

  • 1st column: int, STAGE, cluster stage.

  • 2nd column: ID (in input table) data type, LEFT_ + ID (in input table) column name, One of the clusters that is to be combined in one combine stage, name as its row number in the input data table. After the combining, the new cluster is named after the left one.

  • 3rd column: ID (in input table) data type, RIGHT_ + ID (in input table) column name, The other cluster to be combined in the same combine stage, named as its row number in the input data table.

  • 4th column: float, DISTANCE. Distance between the two combined clusters.

Label of each points, structured as follows:

  • 1st column: ID (in input table) data type, ID, record ID.

  • 2nd column: int, CLUSTER_ID, the range is from 0 to n_clusters - 1.

is_fitted(self)

Checks if the model can be saved. To be overridden if the model is not stored in model_ attribute.

Returns
bool

True if the model is ready to be saved.

load_model(self, model)

Function to load fitted model.

Parameters
modelDataFrame

HANA DataFrame for fitted model.

class hana_ml.algorithms.pal.clustering.DBSCAN(conn_context, minpts=None, eps=None, thread_ratio=None, metric=None, minkowski_power=None, categorical_variable=None, category_weights=None, algorithm=None, save_model=True)

Bases: hana_ml.algorithms.pal.pal_base.PALBase, hana_ml.algorithms.pal.clustering._ClusterAssignmentMixin

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a density-based data clustering algorithm that finds a number of clusters starting from the estimated density distribution of corresponding nodes.

Parameters
conn_contextConnectionContext

Connection to the SAP HANA system.

minptsint, optional

The minimum number of points required to form a cluster.

Note

minpts and eps need to be provided together by user or these two parameters are automatically determined.

epsfloat, optional

The scan radius.

Note

minpts and eps need to be provided together by user or these two parameters are automatically determined.

thread_ratiofloat, optional

Controls the proportion of available threads to use. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use.

Defaults to heuristically determined.

metric{‘manhattan’, ‘euclidean’, ‘minkowski’, ‘chebyshev’, ‘standardized_euclidean’, ‘cosine’}, optional

Ways to compute the distance between two points.

Defaults to ‘euclidean’.

minkowski_powerint, optional

When minkowski is choosed for metric, this parameter controls the value of power. Only applicable when metric is minkowski.

Defaults to 3.

categorical_variablestr or list of str, optional

Specifies column(s) in the data that should be treated as categorical.

category_weightsfloat, optional

Represents the weight of category attributes.

Defaults to 0.707.

algorithm{‘brute-force’, ‘kd-tree’}, optional

Ways to search for neighbours.

Defaults to ‘kd-tree’.

save_modelbool, optional

If true, the generated model will be saved. save_model must be True to call predict().

Defaults to True.

Examples

Input dataframe df for clustering:

>>> df.collect()
    ID     V1     V2 V3
0    1   0.10   0.10  B
1    2   0.11   0.10  A
2    3   0.10   0.11  C
3    4   0.11   0.11  B
4    5   0.12   0.11  A
5    6   0.11   0.12  E
6    7   0.12   0.12  A
7    8   0.12   0.13  C
8    9   0.13   0.12  D
9   10   0.13   0.13  D
10  11   0.13   0.14  A
11  12   0.14   0.13  C
12  13  10.10  10.10  A
13  14  10.11  10.10  F
14  15  10.10  10.11  E
15  16  10.11  10.11  E
16  17  10.11  10.12  A
17  18  10.12  10.11  B
18  19  10.12  10.12  B
19  20  10.12  10.13  D
20  21  10.13  10.12  F
21  22  10.13  10.13  A
22  23  10.13  10.14  A
23  24  10.14  10.13  D
24  25   4.10   4.10  A
25  26   7.11   7.10  C
26  27  -3.10  -3.11  C
27  28  16.11  16.11  A
28  29  20.11  20.12  C
29  30  15.12  15.11  A

Create DSBCAN instance:

>>> dbscan = DBSCAN(conn_context=conn, thread_ratio=0.2, metric='manhattan')

Perform fit on the given data:

>>> dbscan.fit(data=df, key='ID')

Expected output:

>>> dbscan.labels_.collect()
    ID  CLUSTER_ID
0    1           0
1    2           0
2    3           0
3    4           0
4    5           0
5    6           0
6    7           0
7    8           0
8    9           0
9   10           0
10  11           0
11  12           0
12  13           1
13  14           1
14  15           1
15  16           1
16  17           1
17  18           1
18  19           1
19  20           1
20  21           1
21  22           1
22  23           1
23  24           1
24  25          -1
25  26          -1
26  27          -1
27  28          -1
28  29          -1
29  30          -1
Attributes
labels_DataFrame

Label assigned to each sample.

model_DataFrame

Model content. Set to None if save_model is False.

Methods

fit(self, data, key[, features, …])

Fit the DBSCAN model when given the training dataset.

fit_predict(self, data, key[, features, …])

Fit with the dataset and return the labels.

is_fitted(self)

Checks if the model can be saved.

load_model(self, model)

Function to load fitted model.

predict(self, data, key[, features])

Assign clusters to data based on a fitted model.

fit(self, data, key, features=None, categorical_variable=None)

Fit the DBSCAN model when given the training dataset.

Parameters
dataDataFrame

DataFrame containing the data.

keystr

Name of the ID column.

featureslist of str, optional

Names of the features columns. If features is not provided, it defaults to all the non-ID columns.

categorical_variablestr or list of str, optional

Specifies INTEGER column(s) that should be treated as categorical. Other INTEGER columns will be treated as continuous.

fit_predict(self, data, key, features=None, categorical_variable=None)

Fit with the dataset and return the labels.

Parameters
dataDataFrame

DataFrame containing the data.

keystr

Name of the ID column.

featureslist of str, optional

Names of the features columns. If features is not provided, it defaults to all the non-ID columns.

categorical_variablestr or list of str, optional

Specifies INTEGER column(s) that should be treated as categorical. Other INTEGER columns will be treated as continuous.

Returns
DataFrame

Fit result, structured as follows:

  • ID column, with the same name and type as data ‘s ID column.

  • CLUSTER_ID, type INTEGER, cluster ID assigned to the data point. (Cluster IDs range from 0 to 1 less than the number of clusters. A cluster ID of -1 means the point is labeled as noise.)

predict(self, data, key, features=None)

Assign clusters to data based on a fitted model.

The output structure of this method does not match that of fit_predict().

Parameters
dataDataFrame

Data points to match against computed clusters. This dataframe’s column structure should match that of the data used for fit().

keystr

Name of the ID column.

featureslist of str, optional.

Names of the feature columns. If features is not provided, it defaults to all the non-ID columns.

Returns
DataFrame

Cluster assignment results, with 3 columns:

  • Data point ID, with name and type taken from the input ID column.

  • CLUSTER_ID, type INTEGER, representing the cluster the data point is assigned to.

  • DISTANCE, type DOUBLE, representing the distance between the data point and the nearest core point.

is_fitted(self)

Checks if the model can be saved. To be overridden if the model is not stored in model_ attribute.

Returns
bool

True if the model is ready to be saved.

load_model(self, model)

Function to load fitted model.

Parameters
modelDataFrame

HANA DataFrame for fitted model.

class hana_ml.algorithms.pal.clustering.GeometryDBSCAN(conn_context, minpts=None, eps=None, thread_ratio=None, metric=None, minkowski_power=None, algorithm=None, save_model=True)

Bases: hana_ml.algorithms.pal.pal_base.PALBase

This function is a geometry version of DBSCAN, which only accepts geometry points as input data. Currently it only accepts 2-D points.

Parameters
conn_contextConnectionContext

Connection to the SAP HANA system.

minptsint, optional

The minimum number of points required to form a cluster.

Note

minpts and eps need to be provided together by user or these two parameters are automatically determined.

epsfloat, optional

The scan radius.

Note

minpts and eps need to be provided together by user or these two parameters are automatically determined.

thread_ratiofloat, optional

Controls the proportion of available threads to use. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use.

Defaults to heuristically determined.

metric{‘manhattan’, ‘euclidean’,’minkowski’,

‘chebyshev’, ‘standardized_euclidean’, ‘cosine’}, optional

Ways to compute the distance between two points.

Defaults to euclidean.

minkowski_powerint, optional

When minkowski is choosed for metric, this parameter controls the value of power. Only applicable when metric is ‘minkowski’.

Defaults to 3.

algorithm{‘brute-force’, ‘kd-tree’}, optional

Ways to search for neighbours.

Defaults to ‘kd-tree’.

save_modelbool, optional

If true, the generated model will be saved. save_model must be True to call predict().

Defaults to True.

Examples

In SAP HANA, the test table PAL_GEO_DBSCAN_DATA_TBL can be created by the following SQL:

>>> CREATE COLUMN TABLE PAL_GEO_DBSCAN_DATA_TBL (
             "ID" INTEGER,
             "POINT" ST_GEOMETRY
             );

Then, input dataframe df for clustering:

>>> df = conn.table("PAL_GEO_DBSCAN_DATA_TBL")

Create DSBCAN instance:

>>> geo_dbscan = GeometryDBSCAN(conn_context = conn, thread_ratio=0.2, metric='manhattan')

Perform fit on the given data:

>>> geo_dbscan.fit(data = df, key='ID')

Expected output:

>>> geo_dbscan.labels_.collect()
    ID    CLUSTER_ID
0    1    0
1    2    0
2    3    0
......
28   29  -1
29   30  -1
>>> geo_dbsan.model_.collect()
    ROW_INDEX    MODEL_CONTENT
0      0         {"Algorithm":"DBSCAN","Cluster":[{"ClusterID":...

Perform fit_predict on the given data:

>>> result = geo_dbscan.fit_predict(df, key='ID')

Expected output:

>>> result.collect()
    ID    CLUSTER_ID
0    1    0
1    2    0
2    3    0
......
28    29  -1
29    30  -1
Attributes
labels_DataFrame

Label assigned to each sample.

model_DataFrame

Model content. Set to None if save_model is False.

Methods

fit(self, data, key[, features])

Fit the Geometry DBSCAN model when given the training dataset.

fit_predict(self, data, key[, features])

Fit with the dataset and return the labels.

is_fitted(self)

Checks if the model can be saved.

load_model(self, model)

Function to load fitted model.

fit(self, data, key, features=None)

Fit the Geometry DBSCAN model when given the training dataset.

Parameters
dataDataFrame

DataFrame containing the data. The structure is as follows.

  • 1st column: ID, INTEGER, BIGINT, VARCHAR, or NVARCHAR. Data ID.

  • 2nd column: ST_GEOMETRY, 2-D geometry point.

keystr

Name of the ID column.

featureslist of str, optional

Names of the features columns. If features is not provided, it defaults to all the non-ID columns.

fit_predict(self, data, key, features=None)

Fit with the dataset and return the labels.

Parameters
dataDataFrame

DataFrame containing the data. The structure is as follows.

  • 1st column: ID, INTEGER, BIGINT, VARCHAR, or NVARCHAR. Data ID

  • 2nd column: ST_GEOMETRY, 2-D geometry point.

keystr

Name of the ID column.

featureslist of str, optional

Names of the features columns. If features is not provided, it defaults to all the non-ID columns.

Returns
DataFrame

Fit result, structured as follows:

  • ID column, with the same name and type as data ‘s ID column.

  • CLUSTER_ID, type INTEGER, cluster ID assigned to the data point. (Cluster IDs range from 0 to 1 less than the number of clusters. A cluster ID of -1 means the point is labeled as noise.)

is_fitted(self)

Checks if the model can be saved. To be overridden if the model is not stored in model_ attribute.

Returns
bool

True if the model is ready to be saved.

load_model(self, model)

Function to load fitted model.

Parameters
modelDataFrame

HANA DataFrame for fitted model.

class hana_ml.algorithms.pal.clustering.KMeans(conn_context, n_clusters=None, n_clusters_min=None, n_clusters_max=None, init=None, max_iter=None, thread_ratio=None, distance_level=None, minkowski_power=None, category_weights=None, normalization=None, categorical_variable=None, tol=None, memory_mode=None, accelerated=False)

Bases: hana_ml.algorithms.pal.pal_base.PALBase, hana_ml.algorithms.pal.clustering._ClusterAssignmentMixin

K-Means model that handles clustering problems.

Parameters
conn_contextConnectionContext

Connection to the SAP HANA system.

n_clustersint, optional

Number of clusters. If this parameter is not specified, you must specify the minimum and maximum range parameters instead.

n_clusters_minint, optional

Cluster range minimum.

n_clusters_maxint, optional

Cluster range maximum.

init{‘first_k’, ‘replace’, ‘no_replace’, ‘patent’}, optional

Controls how the initial centers are selected:

  • ‘first_k’: First k observations.

  • ‘replace’: Random with replacement.

  • ‘no_replace’: Random without replacement.

  • ‘patent’: Patent of selecting the init center (US 6,882,998 B1).

Defaults to ‘patent’.

max_iterint, optional

Max iterations.

Defaults to 100.

thread_ratiofloat, optional

Specifies the ratio of total number of threads that can be used by this function. The value range is from 0 to 1, where 0 means only using 1 thread, and 1 means using at most all the currently available threads. Values outside the range will be ignored and this function heuristically determines the number of threads to use.

Defaults to 0.

distance_level{‘manhattan’, ‘euclidean’, ‘minkowski’, ‘chebyshev’, ‘cosine’} str, optional

Ways to compute the distance between the item and the cluster center. ‘cosine’ is only valid when accelerated is False.

Defaults to ‘euclidean’.

minkowski_powerfloat, optional

When Minkowski distance is used, this parameter controls the value of power. Only valid when distance_level is minkowski.

Defaults to 3.0.

category_weightsfloat, optional

Represents the weight of category attributes.

Defaults to 0.707.

normalization{‘no’, ‘l1_norm’, ‘min_max’}, optional

Normalization type.

  • ‘no’: No normalization will be applied.

  • ‘l1_norm’: Yes, for each point X (x1, x2, …, xn), the normalized value will be X’(x1 /S,x2 /S,…,xn /S), where S = |x1|+|x2|+…|xn|.

  • ‘min_max’: Yes, for each column C, get the min and max value of C, and then C[i] = (C[i]-min)/(max-min).

Defaults to ‘no’.

categorical_variablestr or list of str, optional

Specifies INTEGER column(s) in the data that should be treated as categorical.

tolfloat, optional

Convergence threshold for exiting iterations. Only valid when accelerated is False.

Defaults to 1.0e-6.

memory_mode{‘auto’, ‘optimize-speed’, ‘optimize-space’}, optional

Indicates the memory mode that the algorithm uses.

  • ‘auto’: Chosen by algorithm.

  • ‘optimize-speed’: Prioritizes speed.

  • ‘optimize-space’: Prioritizes memory.

Only valid when accelerated is True.

Defaults to ‘auto’.

acceleratedbool, optional

Indicates whether to use technology like cache to accelerate the calculation process. If True, the calculation process will be accelerated. If False, the calculation process will not be accelerated.

Defaults to False.

Examples

Input dataframe df for K Means:

>>> df.collect()
    ID  V000 V001  V002
0    0   0.5    A   0.5
1    1   1.5    A   0.5
2    2   1.5    A   1.5
3    3   0.5    A   1.5
4    4   1.1    B   1.2
5    5   0.5    B  15.5
6    6   1.5    B  15.5
7    7   1.5    B  16.5
8    8   0.5    B  16.5
9    9   1.2    C  16.1
10  10  15.5    C  15.5
11  11  16.5    C  15.5
12  12  16.5    C  16.5
13  13  15.5    C  16.5
14  14  15.6    D  16.2
15  15  15.5    D   0.5
16  16  16.5    D   0.5
17  17  16.5    D   1.5
18  18  15.5    D   1.5
19  19  15.7    A   1.6

Create KMeans instance:

>>> km = clustering.KMeans(conn_context=conn, n_clusters=4, init='first_k',
...                        max_iter=100, tol=1.0E-6, thread_ratio=0.2,
...                        distance_level='Euclidean',
...                        category_weights=0.5)

Perform fit_predict:

>>> labels = km.fit_predict(df=data, 'ID')
>>> labels.collect()
    ID  CLUSTER_ID  DISTANCE  SLIGHT_SILHOUETTE
0    0           0  0.891088           0.944370
1    1           0  0.863917           0.942478
2    2           0  0.806252           0.946288
3    3           0  0.835684           0.944942
4    4           0  0.744571           0.950234
5    5           3  0.891088           0.940733
6    6           3  0.835684           0.944412
7    7           3  0.806252           0.946519
8    8           3  0.863917           0.946121
9    9           3  0.744571           0.949899
10  10           2  0.825527           0.945092
11  11           2  0.933886           0.937902
12  12           2  0.881692           0.945008
13  13           2  0.764318           0.949160
14  14           2  0.923456           0.939283
15  15           1  0.901684           0.940436
16  16           1  0.976885           0.939386
17  17           1  0.818178           0.945878
18  18           1  0.722799           0.952170
19  19           1  1.102342           0.925679

Input dataframe df for Accelerated K-Means :

>>> df = conn.table("PAL_ACCKMEANS_DATA_TBL")
>>> df.collect()
    ID  V000 V001  V002
0    0   0.5    A     0
1    1   1.5    A     0
2    2   1.5    A     1
3    3   0.5    A     1
4    4   1.1    B     1
5    5   0.5    B    15
6    6   1.5    B    15
7    7   1.5    B    16
8    8   0.5    B    16
9    9   1.2    C    16
10  10  15.5    C    15
11  11  16.5    C    15
12  12  16.5    C    16
13  13  15.5    C    16
14  14  15.6    D    16
15  15  15.5    D     0
16  16  16.5    D     0
17  17  16.5    D     1
18  18  15.5    D     1
19  19  15.7    A     1

Create Accelerated Kmeans instance:

>>> akm = clustering.KMeans(conn_context=conn, init='first_k',
...                         thread_ratio=0.5, n_clusters=4,
...                         distance_level='euclidean',
...                         max_iter=100, category_weights=0.5,
...                         categorical_variable=['V002'],
...                         accelerated=True)

Perform fit_predict:

>>> labels = akm.fit_predict(df=data, key='ID')
>>> labels.collect()
    ID  CLUSTER_ID  DISTANCE  SLIGHT_SILHOUETTE
0    0           0  1.198938           0.006767
1    1           0  1.123938           0.068899
2    2           3  0.500000           0.572506
3    3           3  0.500000           0.598267
4    4           0  0.621517           0.229945
5    5           0  1.037500           0.308333
6    6           0  0.962500           0.358333
7    7           0  0.895513           0.402992
8    8           0  0.970513           0.352992
9    9           0  0.823938           0.313385
10  10           1  1.038276           0.931555
11  11           1  1.178276           0.927130
12  12           1  1.135685           0.929565
13  13           1  0.995685           0.934165
14  14           1  0.849615           0.944359
15  15           1  0.995685           0.934548
16  16           1  1.135685           0.929950
17  17           1  1.089615           0.932769
18  18           1  0.949615           0.937555
19  19           1  0.915565           0.937717
Attributes
labels_DataFrame

Label assigned to each sample.

cluster_centers_DataFrame

Coordinates of cluster centers.

model_DataFrame

Model content.

statistics_DataFrame

Statistic value.

Methods

fit(self, data, key[, features, …])

Fit the model when given training dataset.

fit_predict(self, data, key[, features, …])

Fit with the dataset and return the labels.

is_fitted(self)

Checks if the model can be saved.

load_model(self, model)

Function to load fitted model.

predict(self, data, key[, features])

Assign clusters to data based on a fitted model.

fit(self, data, key, features=None, categorical_variable=None)

Fit the model when given training dataset.

Parameters
dataDataFrame

DataFrame containing the data.

keystr

Name of the ID column.

featureslist of str, optional

Names of the feature columns. If features is not provided, it defaults to all the non-ID columns.

categorical_variablestr or list of str, optional

Specifies INTEGER column(s) that should be treated as categorical. Other INTEGER columns will be treated as continuous.

fit_predict(self, data, key, features=None, categorical_variable=None)

Fit with the dataset and return the labels.

Parameters
dataDataFrame

DataFrame containing the data.

keystr

Name of the ID column.

featureslist of str, optional

Names of the feature columns. If features is not provided, it defaults to all the non-ID columns.

categorical_variablestr or list of str, optional

Specifies INTEGER column(s) that should be treated as categorical. Other INTEGER columns will be treated as continuous.

Returns
DataFrame

Fit result, structured as follows:

  • ID column, with the same name and type as data ‘s ID column.

  • CLUSTER_ID, type INTEGER, cluster ID assigned to the data point.

  • DISTANCE, type DOUBLE, the distance between the given point and the cluster center.

  • SLIGHT_SILHOUETTE, type DOUBLE, estimated value (slight silhouette).

predict(self, data, key, features=None)

Assign clusters to data based on a fitted model.

The output structure of this method does not match that of fit_predict().

Parameters
dataDataFrame

Data points to match against computed clusters. This dataframe’s column structure should match that of the data used for fit().

keystr

Name of the ID column.

featureslist of str, optional.

Names of the feature columns. If features is not provided, it defaults to all the non-ID columns.

Returns
DataFrame

Cluster assignment results, with 3 columns:

  • Data point ID, with name and type taken from the input ID column.

  • CLUSTER_ID, INTEGER type, representing the cluster the data point is assigned to.

  • DISTANCE, DOUBLE type, representing the distance between the data point and the cluster center.

is_fitted(self)

Checks if the model can be saved. To be overridden if the model is not stored in model_ attribute.

Returns
bool

True if the model is ready to be saved.

load_model(self, model)

Function to load fitted model.

Parameters
modelDataFrame

HANA DataFrame for fitted model.

class hana_ml.algorithms.pal.clustering.KMedians(conn_context, n_clusters, init=None, max_iter=None, tol=None, thread_ratio=None, distance_level=None, minkowski_power=None, category_weights=None, normalization=None, categorical_variable=None)

Bases: hana_ml.algorithms.pal.clustering._KClusteringBase

K-Medians clustering algorithm that partitions n observations into K clusters according to their nearest cluster center. It uses medians of each feature to calculate cluster centers.

Parameters
conn_contextConnectionContext

Connection to the SAP HANA system.

n_clustersint

Number of groups.

init{‘first_k’, ‘replace’, ‘no_replace’, ‘patent’}, optional

Controls how the initial centers are selected:

  • ‘first_k’: First k observations.

  • ‘replace’: Random with replacement.

  • ‘no_replace’: Random without replacement.

  • ‘patent’: Patent of selecting the init center (US 6,882,998 B1).

Defaults to ‘patent’.

max_iterint, optional

Max iterations.

Defaults to 100.

tolfloat, optional

Convergence threshold for exiting iterations.

Defaults to 1.0e-6.

thread_ratiofloat, optional

Specifies the ratio of total number of threads that can be used by this function. The value range is from 0 to 1, where 0 means only using 1 thread, and 1 means using at most all the currently available threads. Values outside the range will be ignored and this function heuristically determines the number of threads to use.

Defaults to 0.

distance_level{‘manhattan’, ‘euclidean’, ‘minkowski’, ‘chebyshev’, ‘cosine’} str, optional

Ways to compute the distance between the item and the cluster center.

Defaults to ‘euclidean’.

minkowski_powerfloat, optional

When Minkowski distance is used, this parameter controls the value of power. Only valid when distance_level is minkowski.

Defaults to 3.0.

category_weightsfloat, optional

Represents the weight of category attributes.

Defaults to 0.707.

normalization{‘no’, ‘l1_norm’, ‘min_max’}, optional

Normalization type.

  • ‘no’: No, normalization will not be applied.

  • ‘l1_norm’: Yes, for each point X (x1, x2, …, xn), the normalized value will be X’(x1/S,x2/S,…,xn/S), where S = |x1|+|x2|+…|xn|.

  • ‘min_max’: Yes, for each column C, get the min and max value of C, and then C[i] = (C[i]-min)/(max-min).

Defaults to ‘no’.

categorical_variablestr or list of str, optional

Specifies INTEGER column(s) in the data that should be treated as categorical.

Examples

Input dataframe df1 for clustering:

>>> df1.collect()
    ID  V000 V001  V002
0    0   0.5    A   0.5
1    1   1.5    A   0.5
2    2   1.5    A   1.5
3    3   0.5    A   1.5
4    4   1.1    B   1.2
5    5   0.5    B  15.5
6    6   1.5    B  15.5
7    7   1.5    B  16.5
8    8   0.5    B  16.5
9    9   1.2    C  16.1
10  10  15.5    C  15.5
11  11  16.5    C  15.5
12  12  16.5    C  16.5
13  13  15.5    C  16.5
14  14  15.6    D  16.2
15  15  15.5    D   0.5
16  16  16.5    D   0.5
17  17  16.5    D   1.5
18  18  15.5    D   1.5
19  19  15.7    A   1.6

Creating KMedians instance:

>>> kmedians = KMedians(conn_context = conn, n_clusters=4, init='first_k',
...                     max_iter=100, tol=1.0E-6,
...                     distance_level='Euclidean',
...                     thread_ratio=0.3, category_weights=0.5)

Performing fit() on given dataframe:

>>> kmedians.fit(data=df1, key='ID')
>>> kmedians.cluster_centers_.collect()
   CLUSTER_ID  V000 V001  V002
0           0   1.1    A   1.2
1           1  15.7    D   1.5
2           2  15.6    C  16.2
3           3   1.2    B  16.1

Performing fit_predict() on given dataframe:

>>> kmedians.fit_predict(data=df1, key='ID').collect()
    ID  CLUSTER_ID  DISTANCE
0    0           0  0.921954
1    1           0  0.806226
2    2           0  0.500000
3    3           0  0.670820
4    4           0  0.707107
5    5           3  0.921954
6    6           3  0.670820
7    7           3  0.500000
8    8           3  0.806226
9    9           3  0.707107
10  10           2  0.707107
11  11           2  1.140175
12  12           2  0.948683
13  13           2  0.316228
14  14           2  0.707107
15  15           1  1.019804
16  16           1  1.280625
17  17           1  0.800000
18  18           1  0.200000
19  19           1  0.807107
Attributes
cluster_centers_DataFrame

Coordinates of cluster centers.

labels_DataFrame

Cluster assignment and distance to cluster center for each point.

Methods

fit(self, data, key[, features, …])

Perform clustering on input dataset.

fit_predict(self, data, key[, features, …])

Perform clustering algorithm and return labels.

is_fitted(self)

Checks if the model can be saved.

load_model(self, model)

Function to load fitted model.

fit(self, data, key, features=None, categorical_variable=None)

Perform clustering on input dataset.

Parameters
dataDataFrame

DataFrame contains input data.

keystr

Name of the ID column.

featureslist of str, optional

Names of the feature columns. If features is not provided, it defaults to all the non-ID columns.

categorical_variablestr or list of str, optional

Specifies INTEGER columns that should be treated as categorical. Other INTEGER columns will be treated as continuous.

fit_predict(self, data, key, features=None, categorical_variable=None)

Perform clustering algorithm and return labels.

Parameters
dataDataFrame

DataFrame containing input data.

keystr

Name of the ID column.

featureslist of str, optional

Names of the feature columns. If features is not provided, it defaults to all the non-ID columns.

categorical_variablestr or list of str, optional

Specifies INTEGER column(s) that should be treated as categorical. Other INTEGER columns will be treated as continuous.

Returns
DataFrame

Fit result, structured as follows:

  • ID column, with the same name and type as data ‘s ID column.

  • CLUSTER_ID, type INTEGER, cluster ID assigned to the data point.

  • DISTANCE, type DOUBLE, the distance between the given point and the cluster center.

is_fitted(self)

Checks if the model can be saved. To be overridden if the model is not stored in model_ attribute.

Returns
bool

True if the model is ready to be saved.

load_model(self, model)

Function to load fitted model.

Parameters
modelDataFrame

HANA DataFrame for fitted model.

class hana_ml.algorithms.pal.clustering.KMedoids(conn_context, n_clusters, init=None, max_iter=None, tol=None, thread_ratio=None, distance_level=None, minkowski_power=None, category_weights=None, normalization=None, categorical_variable=None)

Bases: hana_ml.algorithms.pal.clustering._KClusteringBase

K-Medoids clustering algorithm that partitions n observations into K clusters according to their nearest cluster center. It uses medoids to calculate cluster centers. K-Medoids is more robust to noise and outliers.

Parameters
conn_contextConnectionContext

Connection to the SAP HANA system.

n_clustersint

Number of groups.

init{‘first_k’, ‘replace’, ‘no_replace’, ‘patent’}, optional

Controls how the initial centers are selected:

  • ‘first_k’: First k observations.

  • ‘replace’: Random with replacement.

  • ‘no_replace’: Random without replacement.

  • ‘patent’: Patent of selecting the init center (US 6,882,998 B1).

Defaults to ‘patent’.

max_iterint, optional

Max iterations.

Defaults to 100.

tolfloat, optional

Convergence threshold for exiting iterations.

Defaults to 1.0e-6.

thread_ratiofloat, optional

Controls the proportion of available threads to use. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates all available threads. Values between 0 and 1 will use up to that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use.

Defaults to 0.

distance_level{‘manhattan’, ‘euclidean’, ‘minkowski’, ‘chebyshev’, ‘cosine’} str, optional

Ways to compute the distance between the item and the cluster center.

Defaults to ‘euclidean’.

minkowski_powerfloat, optional

When Minkowski distance is used, this parameter controls the value of power. Only valid when distance_level is minkowski.

Defaults to 3.0.

category_weightsfloat, optional

Represents the weight of category attributes.

Defaults to 0.707.

normalization{‘no’, ‘l1_norm’, ‘min_max’}, optional

Normalization type.

  • ‘no’: No, normalization will not be applied.

  • ‘l1_norm’: Yes, for each point X (x1, x2, …, xn), the normalized value will be X’(x1/S,x2/S,…,xn/S), where S = |x1|+|x2|+…|xn|.

  • ‘min_max’: Yes, for each column C, get the min and max value of C, and then C[i] = (C[i]-min)/(max-min).

Defaults to ‘no’.

categorical_variablestr or list of str, optional

Specifies INTEGER column(s) in the data that should be treated as categorical.

Examples

Input dataframe df1 for clustering:

>>> df1.collect()
    ID  V000 V001  V002
0    0   0.5    A   0.5
1    1   1.5    A   0.5
2    2   1.5    A   1.5
3    3   0.5    A   1.5
4    4   1.1    B   1.2
5    5   0.5    B  15.5
6    6   1.5    B  15.5
7    7   1.5    B  16.5
8    8   0.5    B  16.5
9    9   1.2    C  16.1
10  10  15.5    C  15.5
11  11  16.5    C  15.5
12  12  16.5    C  16.5
13  13  15.5    C  16.5
14  14  15.6    D  16.2
15  15  15.5    D   0.5
16  16  16.5    D   0.5
17  17  16.5    D   1.5
18  18  15.5    D   1.5
19  19  15.7    A   1.6

Creating KMedoids instance:

>>> kmedoids = KMedoids(conn_context=conn, n_clusters=4, init='first_K',
...                     max_iter=100, tol=1.0E-6,
...                     distance_level='Euclidean',
...                     thread_ratio=0.3, category_weights=0.5)

Performing fit() on given dataframe:

>>> kmedoids.fit(data=df1, key='ID')
>>> kmedoids.cluster_centers_.collect()
   CLUSTER_ID  V000 V001  V002
0           0   1.5    A   1.5
1           1  15.5    D   1.5
2           2  15.5    C  16.5
3           3   1.5    B  16.5

Performing fit_predict() on given dataframe:

>>> kmedoids.fit_predict(data=df1, key='ID').collect()
    ID  CLUSTER_ID  DISTANCE
0    0           0  1.414214
1    1           0  1.000000
2    2           0  0.000000
3    3           0  1.000000
4    4           0  1.207107
5    5           3  1.414214
6    6           3  1.000000
7    7           3  0.000000
8    8           3  1.000000
9    9           3  1.207107
10  10           2  1.000000
11  11           2  1.414214
12  12           2  1.000000
13  13           2  0.000000
14  14           2  1.023335
15  15           1  1.000000
16  16           1  1.414214
17  17           1  1.000000
18  18           1  0.000000
19  19           1  0.930714
Attributes
cluster_centers_DataFrame

Coordinates of cluster centers.

labels_DataFrame

Cluster assignment and distance to cluster center for each point.

Methods

fit(self, data, key[, features, …])

Perform clustering on input dataset.

fit_predict(self, data, key[, features, …])

Perform clustering algorithm and return labels.

is_fitted(self)

Checks if the model can be saved.

load_model(self, model)

Function to load fitted model.

fit(self, data, key, features=None, categorical_variable=None)

Perform clustering on input dataset.

Parameters
dataDataFrame

DataFrame contains input data.

keystr

Name of the ID column.

featureslist of str, optional

Names of the feature columns. If features is not provided, it defaults to all the non-ID columns.

categorical_variablestr or list of str, optional

Specifies INTEGER columns that should be treated as categorical. Other INTEGER columns will be treated as continuous.

fit_predict(self, data, key, features=None, categorical_variable=None)

Perform clustering algorithm and return labels.

Parameters
dataDataFrame

DataFrame containing input data.

keystr

Name of the ID column.

featureslist of str, optional

Names of the feature columns. If features is not provided, it defaults to all the non-ID columns.

categorical_variablestr or list of str, optional

Specifies INTEGER column(s) that should be treated as categorical. Other INTEGER columns will be treated as continuous.

Returns
DataFrame

Fit result, structured as follows:

  • ID column, with the same name and type as data ‘s ID column.

  • CLUSTER_ID, type INTEGER, cluster ID assigned to the data point.

  • DISTANCE, type DOUBLE, the distance between the given point and the cluster center.

is_fitted(self)

Checks if the model can be saved. To be overridden if the model is not stored in model_ attribute.

Returns
bool

True if the model is ready to be saved.

load_model(self, model)

Function to load fitted model.

Parameters
modelDataFrame

HANA DataFrame for fitted model.

hana_ml.algorithms.pal.crf

This module contains Python wrapper for PAL conditional random field(CRF) algorithm.

The following class is available:

class hana_ml.algorithms.pal.crf.CRF(conn_context, lamb=None, epsilon=None, max_iter=None, lbfgs_m=None, use_class_feature=None, use_word=None, use_ngrams=None, mid_ngrams=False, max_ngram_length=None, use_prev=None, use_next=None, disjunction_width=None, use_disjunctive=None, use_seqs=None, use_prev_seqs=None, use_type_seqs=None, use_type_seqs2=None, use_type_yseqs=None, word_shape=None, thread_ratio=None)

Bases: hana_ml.algorithms.pal.pal_base.PALBase

Conditional random field(CRF) for labeling and segmenting sequence data(e.g. text).

Parameters
conn_contextConnectionContext

Connection to the SAP HANA system.

epsilonfloat, optional

Convergence tolerance of the optimization algorithm.

Defaults to 1e-4.

lambfloat, optional

Regularization weight, should be greater than 0.

Defaults t0 1.0.

max_iterint, optional

Maximum number of iterations in optimization.

Defaults to 1000.

lbfgs_mint, optional

Number of memories to be stored in L_BFGS optimization algorithm.

Defaults to 25.

use_class_featurebool, optional

To include a feature for class/label. This is the same as having a bias vector in a model.

Defaults to True.

use_wordbool, optional

If True, gives you feature for current word.

Defaults to True.

use_ngramsbool, optional

Whether to make feature from letter n-grams, i.e. substrings of the word.

Defaults to True.

mid_ngramsbool, optional

Whether to include character n-gram features for n-grams that contain neither the beginning or the end of the word.

Defaults to False.

max_ngram_lengthint, optional

Upper limit for the size of n-grams to be included. Effective only this parameter is positive.

use_prevbool, optional

Whether or not to include a feature for previous word and current word, and together with other options enables other previous features.

Defaults to True.

use_nextbool, optional

Whether or not to include a feature for next word and current word.

Defaults to True.

disjunction_widthint, optional

Defines the width for disjunctions of words, see use_disjunctive.

Defaults to 4.

use_disjunctivebool, optional

Whether or not to include in features giving disjunctions of words anywhere in left or right disjunction_width words.

Defaults to True.

use_seqsbool, optional

Whether or not to use any class combination features.

Defaults to True.

use_prev_seqsbool, optional

Whether or not to use any class combination features using the previous class.

Defaults to True.

use_type_seqsbool, optional

Whther or not to use basic zeroth order word shape features.

Defaults to True.

use_type_seqs2bool, optional

Whether or not to add additional first and second order word shape features.

Defaults to True.

use_type_yseqsbool, optional

Whether or not to use some first order word shape patterns.

Defaults to True.

word_shapeint, optional

Word shape, e.g. whether capitalized or numeric. Only supports chris2UseLC currently. Do not use word shape if this is 0.

thread_ratiofloat, optional

Specifies the ratio of total number of threads that can be used by the fit(i.e. training) function. The range of this parameter is from 0 to 1. 0 means only using single thread, 1 means using at most all available threads currently. Values outside this range are ignored, and the fit function heuristically determines the number of threads to use.

Defaults to 1.0.

Examples

Input data for training:

>>> df.head(10).collect()
   DOC_ID  WORD_POSITION      WORD LABEL
0       1              1    RECORD     O
1       1              2   #497321     O
2       1              3  78554939     O
3       1              4         |     O
4       1              5       LRH     O
5       1              6         |     O
6       1              7  62413233     O
7       1              8         |     O
8       1              9         |     O
9       1             10   7368393     O

Set up an instance of CRF model, and fit it on the training data:

>>> crf = CRF(conn_context=cc,
...           lamb=0.1,
...           max_iter=1000,
...           epsilon=1e-4,
...           lbfgs_m=25,
...           word_shape=0,
...           thread_ratio=1.0)
>>> crf.fit(data=df, doc_id="DOC_ID", word_pos="WORD_POSITION",
...         word="WORD", label="LABEL")

Check the trained CRF model and related statistics:

>>> crf.model_.collect()
   ROW_INDEX                                      MODEL_CONTENT
0          0  {"classIndex":[["O","OxygenSaturation"]],"defa...
>>> crf.stats_.head(10).collect()
         STAT_NAME           STAT_VALUE
0              obj  0.44251900977373015
1             iter                   22
2  solution status            Converged
3      numSentence                    2
4          numWord                   92
5      numFeatures                  963
6           iter 1          obj=26.6557
7           iter 2          obj=14.8484
8           iter 3          obj=5.36967
9           iter 4           obj=2.4382

Input data for predicting labels using trained CRF model

>>> df_pred.head(10).collect()
   DOC_ID  WORD_POSITION         WORD
0       2              1      GENERAL
1       2              2     PHYSICAL
2       2              3  EXAMINATION
3       2              4            :
4       2              5        VITAL
5       2              6        SIGNS
6       2              7            :
7       2              8        Blood
8       2              9     pressure
9       2             10        86g52

Do the prediction:

>>> res = crf.predict(data=df_pred, doc_id='DOC_ID', word_pos='WORD_POSITION',
...                   word='WORD', thread_ratio=1.0)

Check the prediction result:

>>> df_pred.head(10).collect()
   DOC_ID  WORD_POSITION         WORD
0       2              1      GENERAL
1       2              2     PHYSICAL
2       2              3  EXAMINATION
3       2              4            :
4       2              5        VITAL
5       2              6        SIGNS
6       2              7            :
7       2              8        Blood
8       2              9     pressure
9       2             10        86g52
Attributes
model_DataFrame

CRF model content.

stats_DataFrame

Statistic info for CRF model fitting, structured as follows:

  • 1st column: name of the statistics, type NVARCHAR(100).

  • 2nd column: the corresponding statistics value, type NVARCHAR(1000).

optimal_param_DataFrame

Placeholder for storing optimal parameter of the model. None empty only when parameter selection is triggered (in the future).

Methods

fit(self, data[, doc_id, word_pos, word, label])

Function for training the CRF model on English text.

is_fitted(self)

Checks if the model can be saved.

load_model(self, model)

Function to load fitted model.

predict(self, data[, doc_id, word_pos, …])

The function that predicts text labels based trained CRF model.

fit(self, data, doc_id=None, word_pos=None, word=None, label=None)

Function for training the CRF model on English text.

Parameters
dataDataFrame

Input data for training/fitting the CRF model. It should contain at least 4 columns, corresponding to document ID, word position, word and label, respectively.

doc_idstr, optional

Name of the column for document ID.

Defaults to the first column of the input data.

word_posstr, optional

Name of the column for word position.

Defaults to the second column of the input data.

wordstr, optional

Name of the column for word.

Defaults to the third column of the input data.

labelstr, optional

Name of the label column.

Defaults to the final column of the input data.

predict(self, data, doc_id=None, word_pos=None, word=None, thread_ratio=None)

The function that predicts text labels based trained CRF model.

Parameters
dataDataFrame

Input data to predict the labels. It should contain at least 3 columns, corresponding to document ID, word position and word, respectively.

doc_idstr, optional

Name of the column for document ID.

Defaults to the first column of the input data.

word_posstr, optional

Name of the column for word position.

Defaults to the second column of the input data.

wordstr, optional

Name of the column for word.

Defaults to the third column of the input data.

thread_ratiofloat, optional

Specifies the ratio of total number of threads that can be used by predict function. The range of this parameter is from 0 to 1. 0 means only using a single thread, and 1 means using at most all available threads currently. Values outside this range are ignored, and predict function heuristically determines the number of threads to use.

Defaults to 1.0.

Returns
DataFrame

Prediction result for the input data, structured as follows:

  • 1st column: document ID,

  • 2nd column: word position,

  • 3rd column: label.

is_fitted(self)

Checks if the model can be saved. To be overridden if the model is not stored in model_ attribute.

Returns
bool

True if the model is ready to be saved.

load_model(self, model)

Function to load fitted model.

Parameters
modelDataFrame

HANA DataFrame for fitted model.

hana_ml.algorithms.pal.decomposition

This module contains Python wrappers for PAL decomposition algorithms.

The following classes are available:

class hana_ml.algorithms.pal.decomposition.PCA(conn_context, scaling=None, thread_ratio=None, scores=None)

Bases: hana_ml.algorithms.pal.pal_base.PALBase

Principal component analysis is to reduce the dimensionality of multivariate data using Singular Value Decomposition.

Parameters
conn_contextConnectionContext

The connection to SAP HANA system.

thread_ratiofloat, optional

Controls the proportion of available threads to use. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use.

No default value.

scalingbool, optional

If true, scale variables to have unit variance before the analysis takes place.

Defaults to False.

scoresbool, optional

If true, output the scores on each principal component when fitting.

Defaults to False.

Examples

Input DataFrame df1 for training:

>>> df1.head(4).collect()
   ID    X1    X2    X3    X4
0   1  12.0  52.0  20.0  44.0
1   2  12.0  57.0  25.0  45.0
2   3  12.0  54.0  21.0  45.0
3   4  13.0  52.0  21.0  46.0

Creating a PCA instance:

>>> pca = PCA(connection_context=conn, scaling=True, thread_ratio=0.5, scores=True)

Performing fit on given dataframe:

>>> pca.fit(data=df1, key='ID')

Output:

>>> pca.loadings_.collect()
  COMPONENT_ID  LOADINGS_X1  LOADINGS_X2  LOADINGS_X3  LOADINGS_X4
0        Comp1     0.541547     0.321424     0.511941     0.584235
1        Comp2    -0.454280     0.728287     0.395819    -0.326429
2        Comp3    -0.171426    -0.600095     0.760875    -0.177673
3        Comp4    -0.686273    -0.078552    -0.048095     0.721489
>>> pca.loadings_stat_.collect()
  COMPONENT_ID        SD  VAR_PROP  CUM_VAR_PROP
0        Comp1  1.566624  0.613577      0.613577
1        Comp2  1.100453  0.302749      0.916327
2        Comp3  0.536973  0.072085      0.988412
3        Comp4  0.215297  0.011588      1.000000
>>> pca.scaling_stat_.collect()
   VARIABLE_ID       MEAN     SCALE
0            1  17.000000  5.039841
1            2  53.636364  1.689540
2            3  23.000000  2.000000
3            4  48.454545  4.655398

Input dataframe df2 for transforming:

>>> df2.collect()
   ID    X1    X2    X3    X4
0   1   2.0  32.0  10.0  54.0
1   2   9.0  57.0  20.0  25.0
2   3  12.0  24.0  28.0  35.0
3   4  15.0  42.0  27.0  36.0

Performing transform() on given dataframe:

>>> result = pca.transform(data=df2, key='ID', n_components=4)
>>> result.collect()
   ID  COMPONENT_1  COMPONENT_2  COMPONENT_3  COMPONENT_4
0   1    -8.359662   -10.936083     3.037744     4.220525
1   2    -3.931082     3.221886    -1.168764    -2.629849
2   3    -6.584040   -10.391291    13.112075    -0.146681
3   4    -2.967768    -3.170720     6.198141    -1.213035
Attributes
loadings_DataFrame

The weights by which each standardized original variable should be multiplied when computing component scores.

loadings_stat_DataFrame

Loadings statistics on each component.

scores_DataFrame

The transformed variable values corresponding to each data point. Set to None if scores is False.

scaling_stat_DataFrame

Mean and scale values of each variable.

Note

Variables cannot be scaled if there exists one variable which has constant value across data items.

Methods

fit(self, data, key[, features, label])

Principal component analysis function.

fit_transform(self, data, key[, features, label])

Fit with the dataset and return the scores.

is_fitted(self)

Checks if the model can be saved.

load_model(self, model)

Function to load fitted model.

transform(self, data, key[, features, …])

Principal component analysis projection function using a trained model.

fit(self, data, key, features=None, label=None)

Principal component analysis function.

Parameters
dataDataFrame

Data to be fitted.

keystr

Name of the ID column.

featureslist of str, optional

Names of the feature columns.

If features is not provided, it defaults to all non-ID columns.

labelstr, optional

Label of data.

fit_transform(self, data, key, features=None, label=None)

Fit with the dataset and return the scores.

Parameters
dataDataFrame

Data to be analyzed.

keystr

Name of the ID column.

featureslist of str, optional

Names of the feature columns.

If features is not provided, it defaults to all non-ID columns.

labelstr, optional

Label of data.

Returns
DataFrame

Transformed variable values corresponding to each data point, structured as follows:

  • ID column, with same name and type as data ‘s ID column.

  • SCORE columns, type DOUBLE, representing the component score values of each data point.

transform(self, data, key, features=None, n_components=None, label=None)

Principal component analysis projection function using a trained model.

Parameters
dataDataFrame

Data to be analyzed.

keystr

Name of the ID column.

featureslist of str, optional

Names of the feature columns. If features is not provided, it defaults to all non-ID columns.

n_componentsint, optional

Number of components to be retained. The value range is from 1 to number of features.

Defaults to number of features.

labelstr, optional

Label of data.

Returns
DataFrame

Transformed variable values corresponding to each data point, structured as follows:

  • ID column, with same name and type as data ‘s ID column.

  • SCORE columns, type DOUBLE, representing the component score values of each data point.

is_fitted(self)

Checks if the model can be saved. To be overridden if the model is not stored in model_ attribute.

Returns
bool

True if the model is ready to be saved.

load_model(self, model)

Function to load fitted model.

Parameters
modelDataFrame

HANA DataFrame for fitted model.

class hana_ml.algorithms.pal.decomposition.LatentDirichletAllocation(conn_context, n_components, doc_topic_prior=None, topic_word_prior=None, burn_in=None, iteration=None, thin=None, seed=None, max_top_words=None, threshold_top_words=None, gibbs_init=None, delimiters=None, output_word_assignment=None)

Bases: hana_ml.algorithms.pal.pal_base.PALBase

Latent Dirichlet allocation (LDA) is a generative model in which each item (word) of a collection (document) is generated from a finite mixture over several latent groups (topics).

Parameters
conn_contextConnectionContext

The connection to the SAP HANA system.

n_componentsint

Expected number of topics in the corpus.

doc_topic_priorfloat, optional

Specifies the prior weight related to document-topic distribution.

Defaults to 50/n_components.

topic_word_priorfloat, optional

Specifies the prior weight related to topic-word distribution.

Defaults to 0.1.

burn_inint, optional

Number of omitted Gibbs iterations at the beginning. Generally, samples from the beginning may not accurately represent the desired distribution and are usually discarded.

Defaults to 0.

iterationint, optional

Number of Gibbs iterations.

Defaults to 2000.

thinint, optional

Number of omitted in-between Gibbs iterations. Value must be greater than 0.

Defaults to 1.

seedint, optional

Indicates the seed used to initialize the random number generator:

  • 0: Uses the system time.

  • Not 0: Uses the provided value.

Defaults to 0.

max_top_wordsint, optional

Specifies the maximum number of words to be output for each topic.

Defaults to 0.

threshold_top_wordsfloat, optional

The algorithm outputs top words for each topic if the probability is larger than this threshold. It cannot be used together with parameter max_top_words.

gibbs_initstr, optional

Specifies initialization method for Gibbs sampling:

  • ‘uniform’: Assign each word in each document a topic by uniform distribution.

  • ‘gibbs’: Assign each word in each document a topic by one round of Gibbs sampling using doc_topic_prior and topic_word_prior.

Defaults to ‘uniform’.

delimiterslist of str, optional

Specifies the set of delimiters to separate words in a document. Each delimiter must be one character long.

Defaults to [‘ ‘].

output_word_assignmentbool, optional

Controls whether to output the word_topic_assignment_ or not. If True, output the word_topic_assignment_.

Defaults to False.

Examples

Input dataframe df1 for training:

>>> df1.collect()
   DOCUMENT_ID                                               TEXT
0           10  cpu harddisk graphiccard cpu monitor keyboard ...
1           20  tires mountainbike wheels valve helmet mountai...
2           30  carseat toy strollers toy toy spoon toy stroll...
3           40  sweaters sweaters sweaters boots sweaters ring...

Creating a LDA instance:

>>> lda = LatentDirichletAllocation(cc, n_components=6, burn_in=50, thin=10,
                                    iteration=100, seed=1,
                                    max_top_words=5, doc_topic_prior=0.1,
                                    output_word_assignment=True,
                                    delimiters=[' ', '\r', '\n'])

Performing fit() on given dataframe:

>>> lda.fit(data=df1, key='DOCUMENT_ID', document='TEXT')

Output:

>>> lda.doc_topic_dist_.collect()
    DOCUMENT_ID  TOPIC_ID  PROBABILITY
0            10         0     0.010417
1            10         1     0.010417
2            10         2     0.010417
3            10         3     0.010417
4            10         4     0.947917
5            10         5     0.010417
6            20         0     0.009434
7            20         1     0.009434
8            20         2     0.009434
9            20         3     0.952830
10           20         4     0.009434
11           20         5     0.009434
12           30         0     0.103774
13           30         1     0.858491
14           30         2     0.009434
15           30         3     0.009434
16           30         4     0.009434
17           30         5     0.009434
18           40         0     0.009434
19           40         1     0.009434
20           40         2     0.952830
21           40         3     0.009434
22           40         4     0.009434
23           40         5     0.009434
>>> lda.word_topic_assignment_.collect()
    DOCUMENT_ID  WORD_ID  TOPIC_ID
0            10        0         4
1            10        1         4
2            10        2         4
3            10        0         4
4            10        3         4
5            10        4         4
6            10        0         4
7            10        5         4
8            10        5         4
9            20        6         3
10           20        7         3
11           20        8         3
12           20        9         3
13           20       10         3
14           20        7         3
15           20       11         3
16           20        6         3
17           20        7         3
18           20        7         3
19           30       12         1
20           30       13         1
21           30       14         1
22           30       13         1
23           30       13         1
24           30       15         0
25           30       13         1
26           30       14         1
27           30       13         1
28           30       12         1
29           40       16         2
30           40       16         2
31           40       16         2
32           40       17         2
33           40       16         2
34           40       18         2
35           40       19         2
36           40       19         2
37           40       20         2
38           40       16         2
>>> lda.topic_top_words_.collect()
   TOPIC_ID                                       WORDS
0         0     spoon strollers tires graphiccard valve
1         1       toy strollers carseat graphiccard cpu
2         2              sweaters vest shoe rings boots
3         3  mountainbike tires rearfender helmet valve
4         4    cpu memory graphiccard keyboard harddisk
5         5       strollers tires graphiccard cpu valve
>>> lda.topic_word_dist_.head(40).collect()
    TOPIC_ID  WORD_ID  PROBABILITY
0          0        0     0.050000
1          0        1     0.050000
2          0        2     0.050000
3          0        3     0.050000
4          0        4     0.050000
5          0        5     0.050000
6          0        6     0.050000
7          0        7     0.050000
8          0        8     0.550000
9          0        9     0.050000
10         1        0     0.050000
11         1        1     0.050000
12         1        2     0.050000
13         1        3     0.050000
14         1        4     0.050000
15         1        5     0.050000
16         1        6     0.050000
17         1        7     0.050000
18         1        8     0.050000
19         1        9     0.550000
20         2        0     0.025000
21         2        1     0.025000
22         2        2     0.525000
23         2        3     0.025000
24         2        4     0.025000
25         2        5     0.025000
26         2        6     0.025000
27         2        7     0.275000
28         2        8     0.025000
29         2        9     0.025000
30         3        0     0.014286
31         3        1     0.014286
32         3        2     0.014286
33         3        3     0.585714
34         3        4     0.157143
35         3        5     0.014286
36         3        6     0.157143
37         3        7     0.014286
38         3        8     0.014286
39         3        9     0.014286
>>> lda.dictionary_.collect()
    WORD_ID          WORD
0        17         boots
1        12       carseat
2         0           cpu
3         2   graphiccard
4         1      harddisk
5        10        helmet
6         4      keyboard
7         5        memory
8         3       monitor
9         7  mountainbike
10       11    rearfender
11       18         rings
12       20          shoe
13       15         spoon
14       14     strollers
15       16      sweaters
16        6         tires
17       13           toy
18        9         valve
19       19          vest
20        8        wheels
>>> lda.statistic_.collect()
         STAT_NAME          STAT_VALUE
0        DOCUMENTS                   4
1  VOCABULARY_SIZE                  21
2   LOG_LIKELIHOOD  -64.95765414596762

Dataframe df2 to transform:

>>> df2.collect()
   DOCUMENT_ID               TEXT
0           10  toy toy spoon cpu

Performing transform on the given dataframe:

>>> res = lda.transform(data=df2, key='DOCUMENT_ID', document='TEXT', burn_in=2000, thin=100,
                        iteration=1000, seed=1, output_word_assignment=True)
>>> doc_top_df, word_top_df, stat_df = res
>>> doc_top_df.collect()
   DOCUMENT_ID  TOPIC_ID  PROBABILITY
0           10         0     0.239130
1           10         1     0.456522
2           10         2     0.021739
3           10         3     0.021739
4           10         4     0.239130
5           10         5     0.021739
>>> word_top_df.collect()
   DOCUMENT_ID  WORD_ID  TOPIC_ID
0           10       13         1
1           10       13         1
2           10       15         0
3           10        0         4
>>> stat_df.collect()
         STAT_NAME          STAT_VALUE
0        DOCUMENTS                   1
1  VOCABULARY_SIZE                  21
2   LOG_LIKELIHOOD  -7.925092991875363
3       PERPLEXITY   7.251970666272191
Attributes
doc_topic_dist_DataFrame

Document-topic distribution table, structured as follows:

  • Document ID column, with same name and type as data’s document ID column from fit().

  • TOPIC_ID, type INTEGER, topic ID.

  • PROBABILITY, type DOUBLE, probability of topic given document.

word_topic_assignment_DataFrame

Word-topic assignment table, structured as follows:

  • Document ID column, with same name and type as data’s document ID column from fit().

  • WORD_ID, type INTEGER, word ID.

  • TOPIC_ID, type INTEGER, topic ID.

Set to None if output_word_assignment is set to False.

topic_top_words_DataFrame

Topic top words table, structured as follows:

  • TOPIC_ID, type INTEGER, topic ID.

  • WORDS, type NVARCHAR(5000), topic top words separated by spaces.

Set to None if neither max_top_words nor threshold_top_words is provided.

topic_word_dist_DataFrame

topic-word distribution table, structured as follows:

  • TOPIC_ID, type INTEGER, topic ID.

  • WORD_ID, type INTEGER, word ID.

  • PROBABILITY, type DOUBLE, probability of word given topic.

dictionary_DataFrame

Dictionary table, structured as follows:

  • WORD_ID, type INTEGER, word ID.

  • WORD, type NVARCHAR(5000), word text.

statistic_DataFrame

Statistics table, structured as follows:

  • STAT_NAME, type NVARCHAR(256), statistic name.

  • STAT_VALUE, type NVARCHAR(1000), statistic value.

Note

  • Parameters max_top_words and threshold_top_words cannot be used together.

  • Parameters burn_in, thin, iteration, seed, gibbs_init and delimiters set in transform() will take precedence over thecorresponding ones in __init__().

Methods

fit(self, data, key[, document])

Fit LDA model based on training data.

fit_transform(self, data, key[, document])

Fit LDA model based on training data and return the topic assignment for the training documents.

is_fitted(self)

Checks if the model can be saved.

load_model(self, model)

Function to load fitted model.

transform(self, data, key[, document, …])

Transform the topic assignment for new documents based on the previous LDA estimation results.

fit(self, data, key, document=None)

Fit LDA model based on training data.

Parameters
dataDataFrame

Training data.

keystr

Name of the document ID column.

documentstr, optional

Name of the document text column. If document is not provided, data must have exactly 1 non-key column, and document defaults to that column.

fit_transform(self, data, key, document=None)

Fit LDA model based on training data and return the topic assignment for the training documents.

Parameters
dataDataFrame

Training data.

keystr

Name of the document ID column.

documentstr, optional

Name of the document text column. If document is not provided, data must have exactly 1 non-key column, and document defaults to that column.

Returns
DataFrame

Document-topic distribution table, structured as follows:

  • Document ID column, with same name and type as data ‘s document ID column.

  • TOPIC_ID, type INTEGER, topic ID.

  • PROBABILITY, type DOUBLE, probability of topic given document.

is_fitted(self)

Checks if the model can be saved. To be overridden if the model is not stored in model_ attribute.

Returns
bool

True if the model is ready to be saved.

load_model(self, model)

Function to load fitted model.

Parameters
modelDataFrame

HANA DataFrame for fitted model.

transform(self, data, key, document=None, burn_in=None, iteration=None, thin=None, seed=None, gibbs_init=None, delimiters=None, output_word_assignment=None)

Transform the topic assignment for new documents based on the previous LDA estimation results.

Parameters
dataDataFrame

Independent variable values used for tranform.

keystr

Name of the document ID column.

documentstr, optional

Name of the document text column.

If document is not provided, data must have exactly 1 non-key column, and document defaults to that column.

burn_inint, optional

Number of omitted Gibbs iterations at the beginning. Generally, samples from the beginning may not accurately represent the desired distribution and are usually discarded.

Defaults to 0.

iterationint, optional

Numbers of Gibbs iterations.

Defaults to 2000.

thinint, optional

Number of omitted in-between Gibbs iterations.

Defaults to 1.

seedint, optional

Indicates the seed used to initialize the random number generator:

  • 0: Uses the system time.

  • Not 0: Uses the provided value.

Defaults to 0.

gibbs_initstr, optional

Specifies initialization method for Gibbs sampling:

  • ‘uniform’: Assign each word in each document a topic by uniform distribution.

  • ‘gibbs’: Assign each word in each document a topic by one round of Gibbs sampling using doc_topic_prior and topic_word_prior.

Defaults to ‘uniform’.

delimiterslist of str, optional

Specifies the set of delimiters to separate words in a document. Each delimiter must be one character long.

Defaults to [‘ ‘].

output_word_assignmentbool, optional

Controls whether to output the word_topic_df or not. If True, output the word_topic_df.

Defaults to False.

Returns
DataFrame

Document-topic distribution table, structured as follows:

  • Document ID column, with same name and type as data ‘s document ID column.

  • TOPIC_ID, type INTEGER, topic ID.

  • PROBABILITY, type DOUBLE, probability of topic given document.

Word-topic assignment table, structured as follows:

  • Document ID column, with same name and type as data ‘s document ID column.

  • WORD_ID, type INTEGER, word ID.

  • TOPIC_ID, type INTEGER, topic ID.

Set to None if output_word_assignment is False.

Statistics table, structured as follows:

  • STAT_NAME, type NVARCHAR(256), statistic name.

  • STAT_VALUE, type NVARCHAR(1000), statistic value.

hana_ml.algorithms.pal.discriminant_analysis

This module contains PAL wrapper for discriminant analysis algorithm. The following class is available:

class hana_ml.algorithms.pal.discriminant_analysis.LinearDiscriminantAnalysis(conn_context, regularization_type=None, regularization_amount=None, projection=None)

Bases: hana_ml.algorithms.pal.pal_base.PALBase

Linear discriminant analysis for classification and data reduction.

Parameters
conn_contextConnectionContext

Connection to the HANA system.

regularization_type{‘mixing’, ‘diag’, ‘pseudo’}, optional

The strategy for hanlding ill-conditioning or rank-deficiency of the empirical covariance matrix.

Defaults to ‘mixing’.

regularization_amountfloat, optional

The convex mixing weight assigned to the diagonal matrix obtained from diagonal of the empirical covriance matrix. Valid range for this parameter is [0,1]. Valid only when regularization_type is ‘mixing’.

Defaults to the smallest number in [0,1] that makes the regularized empircal covariance matrix invertible.

projectionbool, optional

Whether or not to compute the projection model.

Defaults to True.

Examples

The training data for linear discriminant analysis:

>>> df.collect()
     X1   X2   X3   X4            CLASS
0   5.1  3.5  1.4  0.2      Iris-setosa
1   4.9  3.0  1.4  0.2      Iris-setosa
2   4.7  3.2  1.3  0.2      Iris-setosa
3   4.6  3.1  1.5  0.2      Iris-setosa
4   5.0  3.6  1.4  0.2      Iris-setosa
5   5.4  3.9  1.7  0.4      Iris-setosa
6   4.6  3.4  1.4  0.3      Iris-setosa
7   5.0  3.4  1.5  0.2      Iris-setosa
8   4.4  2.9  1.4  0.2      Iris-setosa
9   4.9  3.1  1.5  0.1      Iris-setosa
10  7.0  3.2  4.7  1.4  Iris-versicolor
11  6.4  3.2  4.5  1.5  Iris-versicolor
12  6.9  3.1  4.9  1.5  Iris-versicolor
13  5.5  2.3  4.0  1.3  Iris-versicolor
14  6.5  2.8  4.6  1.5  Iris-versicolor
15  5.7  2.8  4.5  1.3  Iris-versicolor
16  6.3  3.3  4.7  1.6  Iris-versicolor
17  4.9  2.4  3.3  1.0  Iris-versicolor
18  6.6  2.9  4.6  1.3  Iris-versicolor
19  5.2  2.7  3.9  1.4  Iris-versicolor
20  6.3  3.3  6.0  2.5   Iris-virginica
21  5.8  2.7  5.1  1.9   Iris-virginica
22  7.1  3.0  5.9  2.1   Iris-virginica
23  6.3  2.9  5.6  1.8   Iris-virginica
24  6.5  3.0  5.8  2.2   Iris-virginica
25  7.6  3.0  6.6  2.1   Iris-virginica
26  4.9  2.5  4.5  1.7   Iris-virginica
27  7.3  2.9  6.3  1.8   Iris-virginica
28  6.7  2.5  5.8  1.8   Iris-virginica
29  7.2  3.6  6.1  2.5   Iris-virginica

Set up an instance of LinearDiscriminantAnalysis model and train it:

>>> lda = LinearDiscriminantAnalysis(conn_context=cc, regularization_type='mixing', projection=True)
>>> lda.fit(data=df, features=['X1', 'X2', 'X3', 'X4'], label='CLASS')

Check the coefficients of obtained linear discriminators and the projection model

>>> lda.coef_.collect()
             CLASS   COEFF_X1   COEFF_X2   COEFF_X3   COEFF_X4   INTERCEPT
0      Iris-setosa  23.907391  51.754001 -34.641902 -49.063407 -113.235478
1  Iris-versicolor   0.511034  15.652078  15.209568  -4.861018  -53.898190
2   Iris-virginica -14.729636   4.981955  42.511486  12.315007  -94.143564
>>> lda.proj_model_.collect()
         NAME        X1        X2        X3        X4
0  DISCRIMINANT_1  1.907978  2.399516 -3.846154 -3.112216
1  DISCRIMINANT_2  3.046794 -4.575496 -2.757271  2.633037
2    OVERALL_MEAN  5.843333  3.040000  3.863333  1.213333

Data to predict the class labels:

>>> df_pred.collect()
     ID   X1   X2   X3   X4
0    1  5.1  3.5  1.4  0.2
1    2  4.9  3.0  1.4  0.2
2    3  4.7  3.2  1.3  0.2
3    4  4.6  3.1  1.5  0.2
4    5  5.0  3.6  1.4  0.2
5    6  5.4  3.9  1.7  0.4
6    7  4.6  3.4  1.4  0.3
7    8  5.0  3.4  1.5  0.2
8    9  4.4  2.9  1.4  0.2
9   10  4.9  3.1  1.5  0.1
10  11  7.0  3.2  4.7  1.4
11  12  6.4  3.2  4.5  1.5
12  13  6.9  3.1  4.9  1.5
13  14  5.5  2.3  4.0  1.3
14  15  6.5  2.8  4.6  1.5
15  16  5.7  2.8  4.5  1.3
16  17  6.3  3.3  4.7  1.6
17  18  4.9  2.4  3.3  1.0
18  19  6.6  2.9  4.6  1.3
19  20  5.2  2.7  3.9  1.4
20  21  6.3  3.3  6.0  2.5
21  22  5.8  2.7  5.1  1.9
22  23  7.1  3.0  5.9  2.1
23  24  6.3  2.9  5.6  1.8
24  25  6.5  3.0  5.8  2.2
25  26  7.6  3.0  6.6  2.1
26  27  4.9  2.5  4.5  1.7
27  28  7.3  2.9  6.3  1.8
28  29  6.7  2.5  5.8  1.8
29  30  7.2  3.6  6.1  2.5

Perform predict() and check the result:

>>> res_pred = lda.predict(data=df_pred,
...                        key='ID',
...                        features=['X1', 'X2', 'X3', 'X4'],
...                        verbose=False)
>>> res_pred.collect()
    ID            CLASS       SCORE
0    1      Iris-setosa  130.421263
1    2      Iris-setosa   99.762784
2    3      Iris-setosa  108.796296
3    4      Iris-setosa   94.301777
4    5      Iris-setosa  133.205924
5    6      Iris-setosa  138.089829
6    7      Iris-setosa  108.385827
7    8      Iris-setosa  119.390933
8    9      Iris-setosa   82.633689
9   10      Iris-setosa  106.380335
10  11  Iris-versicolor   63.346631
11  12  Iris-versicolor   59.511996
12  13  Iris-versicolor   64.286132
13  14  Iris-versicolor   38.332614
14  15  Iris-versicolor   54.823224
15  16  Iris-versicolor   53.865644
16  17  Iris-versicolor   63.581912
17  18  Iris-versicolor   30.402809
18  19  Iris-versicolor   57.411739
19  20  Iris-versicolor   42.433076
20  21   Iris-virginica  114.258002
21  22   Iris-virginica   72.984306
22  23   Iris-virginica   91.802556
23  24   Iris-virginica   86.640121
24  25   Iris-virginica   97.620689
25  26   Iris-virginica  114.195778
26  27   Iris-virginica   57.274694
27  28   Iris-virginica  101.668525
28  29   Iris-virginica   87.257782
29  30   Iris-virginica  106.747065

Data to project:

>>> df_proj.collect()
    ID   X1   X2   X3   X4
0    1  5.1  3.5  1.4  0.2
1    2  4.9  3.0  1.4  0.2
2    3  4.7  3.2  1.3  0.2
3    4  4.6  3.1  1.5  0.2
4    5  5.0  3.6  1.4  0.2
5    6  5.4  3.9  1.7  0.4
6    7  4.6  3.4  1.4  0.3
7    8  5.0  3.4  1.5  0.2
8    9  4.4  2.9  1.4  0.2
9   10  4.9  3.1  1.5  0.1
10  11  7.0  3.2  4.7  1.4
11  12  6.4  3.2  4.5  1.5
12  13  6.9  3.1  4.9  1.5
13  14  5.5  2.3  4.0  1.3
14  15  6.5  2.8  4.6  1.5
15  16  5.7  2.8  4.5  1.3
16  17  6.3  3.3  4.7  1.6
17  18  4.9  2.4  3.3  1.0
18  19  6.6  2.9  4.6  1.3
19  20  5.2  2.7  3.9  1.4
20  21  6.3  3.3  6.0  2.5
21  22  5.8  2.7  5.1  1.9
22  23  7.1  3.0  5.9  2.1
23  24  6.3  2.9  5.6  1.8
24  25  6.5  3.0  5.8  2.2
25  26  7.6  3.0  6.6  2.1
26  27  4.9  2.5  4.5  1.7
27  28  7.3  2.9  6.3  1.8
28  29  6.7  2.5  5.8  1.8
29  30  7.2  3.6  6.1  2.5

Do project and check the result:

>>> res_proj = lda.project(data=df_proj,
...                        key='ID',
...                        features=['X1','X2','X3','X4'],
...                        proj_dim=2)
>>> res_proj.collect()
    ID  DISCRIMINANT_1  DISCRIMINANT_2 DISCRIMINANT_3 DISCRIMINANT_4
0    1       12.313584       -0.245578           None           None
1    2       10.732231        1.432811           None           None
2    3       11.215154        0.184080           None           None
3    4       10.015174       -0.214504           None           None
4    5       12.362738       -1.007807           None           None
5    6       12.069495       -1.462312           None           None
6    7       10.808422       -1.048122           None           None
7    8       11.498220       -0.368435           None           None
8    9        9.538291        0.366963           None           None
9   10       10.898789        0.436231           None           None
10  11       -1.208079        0.976629           None           None
11  12       -1.894856       -0.036689           None           None
12  13       -2.719280        0.841349           None           None
13  14       -3.226081        2.191170           None           None
14  15       -3.048480        1.822461           None           None
15  16       -3.567804       -0.865854           None           None
16  17       -2.926155       -1.087069           None           None
17  18       -0.504943        1.045723           None           None
18  19       -1.995288        1.142984           None           None
19  20       -2.765274       -0.014035           None           None
20  21      -10.727149       -2.301788           None           None
21  22       -7.791979       -0.178166           None           None
22  23       -8.291120        0.730808           None           None
23  24       -7.969943       -1.211807           None           None
24  25       -9.362513       -0.558237           None           None
25  26      -10.029438        0.324116           None           None
26  27       -7.058927       -0.877426           None           None
27  28       -8.754272       -0.095103           None           None
28  29       -8.935789        1.285655           None           None
29  30       -8.674729       -1.208049           None           None
Attributes
basic_info_DataFrame

Basic information of the training data for linear discriminant analysis.

priors_DataFrame

The empirical pirors for each class in the training data.

coef_DataFrame

Coefficients (inclusive of intercepts) of each class’ linear score function for the training data.

proj_infoDataFrame

Projection related info, such as standar deviations of the discriminants, variance proportaion to the total variance explained by each discriminant, etc.

proj_modelDataFrame

The projection matrix and overall means for features.

Methods

fit(self, data[, key, features, label])

Calculate linear discriminators from training data.

is_fitted(self)

Checks if the model can be saved.

load_model(self, model)

Function to load fitted model.

predict(self, data, key[, features, verbose])

Predict class labels using fitted linear discriminators.

project(self, data, key[, features, proj_dim])

Project data into lower dimensional spaces using fitted LDA projection model.

fit(self, data, key=None, features=None, label=None)

Calculate linear discriminators from training data.

Parameters
dataDataFrame

Training data.

keystr, optional

Name of the ID colum. If not provided, it is assumed that

the input data has no ID column.

featureslist of str, optional

Names of the feature columns. If not provided, its defaults to all non-ID, non-label columns.

labelstr, optional

Name of the class label. if not provided, it defaults to the final column.

predict(self, data, key, features=None, verbose=None)

Predict class labels using fitted linear discriminators.

Parameters
dataDataFrame

Data for predicting the class labels.

keystr

Name of the ID column.

featureslist of str, optional

Name of the feature columns. If not provided, defaults to all non-ID columns.

verbosebool, optional

Whether or not outputs scores of all classes. If False, only score of the predicted class will be outputed. Defaults to False.

Returns
DataFrame

Predicted class labels and the corresponding scores, structured as follows:

  • ID: with the same name and data type as data’s ID column.

  • CLASS: with the same name and data type as training data’s label column

  • SCORE: type double, socre of the predicted class.

project(self, data, key, features=None, proj_dim=None)

Project data into lower dimensional spaces using fitted LDA projection model.

Parameters
dataDataFrame

Data for linear discriminant projection.

keystr

Name of the ID column.

featureslist of str, optional

Name of the feature columns. If not provided, defaults to all non-ID columns.

proj_dimint, optional

Dimension of the projected space, equivalent to the number of discriminant used for projection. Defaults to the number of obtained discriminants.

Returns
DataFrame
Projected data, structured as follows:
  • 1st column: ID, with the same name and data type as data for projection.

  • other columns with name DISCRIMINANT_i, where i iterates from 1 to the number of elements in features, data type DOUBLE.

is_fitted(self)

Checks if the model can be saved. To be overridden if the model is not stored in model_ attribute.

Returns
bool

True if the model is ready to be saved.

load_model(self, model)

Function to load fitted model.

Parameters
modelDataFrame

HANA DataFrame for fitted model.

hana_ml.algorithms.pal.linear_model

This module contains Python wrappers for PAL linear model algorithms.

The following classes are available:

class hana_ml.algorithms.pal.linear_model.LinearRegression(conn_context, solver=None, var_select=None, intercept=True, alpha_to_enter=None, alpha_to_remove=None, enet_lambda=None, enet_alpha=None, max_iter=None, tol=None, pho=None, stat_inf=False, adjusted_r2=False, dw_test=False, reset_test=None, bp_test=False, ks_test=False, thread_ratio=None, categorical_variable=None, pmml_export=None, resampling_method=None, evaluation_metric=None, fold_num=None, repeat_times=None, search_strategy=None, random_search_times=None, random_state=None, timeout=None, progress_indicator_id=None, param_values=None, param_range=None)

Bases: hana_ml.algorithms.pal.pal_base.PALBase

Linear regression is an approach to model the linear relationship between a variable, usually referred to as dependent variable, and one or more variables, usually referred to as independent variables, denoted as predictor vector .

Parameters
conn_contextConnectionContext

Connection to the SAP HANA system.

solver{‘QR’, ‘SVD’, ‘CD’, ‘Cholesky’, ‘ADMM’}, optional

Algorithms to use to solve the least square problem. Case-insensitive.

  • ‘QR’: QR decomposition.

  • ‘SVD’: singular value decomposition.

  • ‘CD’: cyclical coordinate descent method.

  • ‘Cholesky’: Cholesky decomposition.

  • ‘ADMM’: alternating direction method of multipliers.

‘CD’ and ‘ADMM’ are supported only when var_select is ‘all’.

Defaults to QR decomposition.

var_select{‘all’, ‘forward’, ‘backward’}, optional

Method to perform variable selection.

  • ‘all’: all variables are included.

  • ‘forward’: forward selection.

  • ‘backward’: backward selection.

‘forward’ and ‘backward’ selection are supported only when solver is ‘QR’, ‘SVD’ or ‘Cholesky’.

Defaults to ‘all’.

interceptbool, optional

If true, include the intercept in the model.

Defaults to True.

alpha_to_enterfloat, optional

P-value for forward selection. Valid only when var_select is ‘forward’.

Defaults to 0.05.

alpha_to_removefloat, optional

P-value for backward selection. Valid only when var_select is ‘backward’.

Defaults to 0.1.

enet_lambdafloat, optional

Penalized weight. Value should be greater than or equal to 0. Valid only when solver is ‘CD’ or ‘ADMM’.

enet_alphafloat, optional

Elastic net mixing parameter. Ranges from 0 (Ridge penalty) to 1 (LASSO penalty) inclusively. Valid only when solver is ‘CD’ or ‘ADMM’.

Defaults to 1.0.

max_iterint, optional

Maximum number of passes over training data. If convergence is not reached after the specified number of iterations, an error will be generated. Valid only when solver is ‘CD’ or ‘ADMM’.

Defaults to 1e5.

tolfloat, optional

Convergence threshold for coordinate descent. Valid only when solver is ‘CD’.

Defaults to 1.0e-7.

phofloat, optional

Step size for ADMM. Generally, it should be greater than 1. Valid only when solver is ‘ADMM’.

Defaults to 1.8.

stat_infbool, optional

If true, output t-value and Pr(>|t|) of coefficients.

Defaults to False.

adjusted_r2bool, optional

If true, include the adjusted R2 value in statistics.

Defaults to False.

dw_testbool, optional

If true, conduct Durbin-Watson test under null hypothesis that errors do not follow a first order autoregressive process. Not available if elastic net regularization is enabled or intercept is ignored.

Defaults to False.

reset_testint, optional

Specifies the order of Ramsey RESET test. Ramsey RESET test with power of variables ranging from 2 to this value (greater than 1) will be conducted. Value 1 means RESET test will not be conducted. Not available if elastic net regularization is enabled or intercept is ignored.

Defaults to 1.

bp_testbool, optional

If true, conduct Breusch-Pagan test under null hypothesis that homoscedasticity is satisfied. Not available if elastic net regularization is enabled or intercept is ignored.

Defaults to False.

ks_testbool, optional

If true, conduct Kolmogorov-Smirnov normality test under null hypothesis that errors follow a normal distribution. Not available if elastic net regularization is enabled or intercept is ignored.

Defaults to False.

thread_ratiofloat, optional

Controls the proportion of available threads to use. The value range is from 0 to 1, where 0 means only using 1 thread, and 1 means using at most all the currently available threads. Values outside this range tell PAL to heuristically determine the number of threads to use. Valid only when solver is ‘QR’, ‘CD’, ‘Cholesky’ or ‘ADMM’.

Defaults to 0.0.

categorical_variablestr or ist of str, optional

Specifies INTEGER columns specified that should be be treated as categorical. Other INTEGER columns will be treated as continuous.

pmml_export{‘no’, ‘multi-row’}, optional

Controls whether to output a PMML representation of the model, and how to format the PMML. Case-insensitive.

  • ‘no’ or not provided: No PMML model.

  • ‘multi-row’: Exports a PMML model, splitting it across multiple rows if it doesn’t fit in one.

Prediction does not require a PMML model.

resampling_method : {‘cv’, ‘bootstrap’}, optional Specifies the resampling method for model evaluation/parameter selection. If no value is specified for this parameter, neither model evaluation nor parameter selection is activated. Must be set together with evaluation_metric.

No default value.

evaluation_metric{‘rmse’}, optional

Specifies the evaluation metric for model evaluation or parameter selection. Must be set together with resampling_method.

No default value.

fold_numint, optional

Specifies the fold number for the cross validation method. Mandatory and valid only when resampling_method is set to ‘cv’.

No default value.

repeat_timesint, optional

Specifies the number of repeat times for resampling.

Defaults to 1.

search_strategy{‘grid’, ‘random’}, optional

Specifies the method to activate parameter selection.

No default value.

random_search_timesint, optional

Specifies the number of times to randomly select candidate parameters for selection. Mandatory and valid when search_strategy is set to ‘random’.

No default value.

random_stateint, optional

Specifies the seed for random generation. Use system time when 0 is specified.

Defaults to 0.

timeoutint, optional

Specifies maximum running time for model evaluation or parameter selection, in seconds. No timeout when 0 is specified.

Defaults to 0.

progress_indicator_idstr, optional

Sets an ID of progress indicator for model evaluation or parameter selection. No progress indicator is active if no value is provided.

No default value.

param_valueslist of tuples, optional

Specifies values of specific parameters to be selected. Only valid when search_strategy is specified. Specific parameters can be enet_lambda, enet_alpha.

No default value.

param_rangelist of tuples, optional

Specifies range of specific parameters to be selected. Only valid when search_strategy is specified. Specific parameters can be enet_lambda, enet_alpha.

No default value.

Examples

Training data:

>>> df.collect()
  ID       Y    X1 X2  X3
0  0  -6.879  0.00  A   1
1  1  -3.449  0.50  A   1
2  2   6.635  0.54  B   1
3  3  11.844  1.04  B   1
4  4   2.786  1.50  A   1
5  5   2.389  0.04  B   2
6  6  -0.011  2.00  A   2
7  7   8.839  2.04  B   2
8  8   4.689  1.54  B   1
9  9  -5.507  1.00  A   2

Training the model:

>>> lr = LinearRegression(conn_context=cc,
...                       thread_ratio=0.5,
...                       categorical_variable=["X3"])
>>> lr.fit(data=df, key='ID', label='Y')

Prediction:

>>> df2.collect()
   ID     X1 X2  X3
0   0  1.690  B   1
1   1  0.054  B   2
2   2  0.123  A   2
3   3  1.980  A   1
4   4  0.563  A   1
>>> lr.predict(data=df2, key='ID').collect()
   ID      VALUE
0   0  10.314760
1   1   1.685926
2   2  -7.409561
3   3   2.021592
4   4  -3.122685
Attributes
coefficients_DataFrame

Fitted regression coefficients.

pmml_DataFrame

PMML model. Set to None if no PMML model was requested.

fitted_DataFrame

Predicted dependent variable values for training data. Set to None if the training data has no row IDs.

statistics_DataFrame

Regression-related statistics, such as mean squared error.

optim_param_DataFrame

If parameter selection is enabled, the optimal parameters will be selected.

Methods

fit(self, data[, key, features, label, …])

Fit regression model based on training data.

is_fitted(self)

Checks if the model can be saved.

load_model(self, model)

Function to load fitted model.

predict(self, data, key[, features])

Predict dependent variable values based on fitted model.

score(self, data, key[, features, label])

Returns the coefficient of determination R2 of the prediction.

fit(self, data, key=None, features=None, label=None, categorical_variable=None)

Fit regression model based on training data.

Parameters
dataDataFrame

Training data.

keystr, optional

Name of the ID column. If key is not provided, it is assumed that the input has no ID column.

featureslist of str, optional

Names of the feature columns. If features is not provided, it defaults to all non-ID, non-label columns.

labelstr, optional

Name of the dependent variable. If label is not provided, it defaults to the last column.

categorical_variablestr or list of str, optional

Specifies INTEGER column(s) that should be treated as categorical. Other INTEGER columns will be treated as continuous.

predict(self, data, key, features=None)

Predict dependent variable values based on fitted model.

Parameters
dataDataFrame

Independent variable values to predict for.

keystr

Name of the ID column.

featureslist of str, optional

Names of the feature columns. If features is not provided, it defaults to all non-ID columns.

Returns
DataFrame

Predicted values, structured as follows:

  • ID column: with same name and type as data ‘s ID column.

  • VALUE: type DOUBLE, representing predicted values.

score(self, data, key, features=None, label=None)

Returns the coefficient of determination R2 of the prediction.

Parameters
dataDataFrame

Data on which to assess model performance.

keystr

Name of the ID column.

featureslist of str, optional

Names of the feature columns.

If features is not provided, it defaults all non-ID, non-label columns.

labelstr, optional

Name of the dependent variable. If label is not provided, it defaults to the last column.

Returns
float

Returns the coefficient of determination R2 of the prediction.

Note

score() will pass the pmml_ table to PAL as the model representation if there is a pmml_ table, or the coefficients_ table otherwise.

is_fitted(self)

Checks if the model can be saved. To be overridden if the model is not stored in model_ attribute.

Returns
bool

True if the model is ready to be saved.

load_model(self, model)

Function to load fitted model.

Parameters
modelDataFrame

HANA DataFrame for fitted model.

class hana_ml.algorithms.pal.linear_model.LogisticRegression(conn_context, multi_class=False, max_iter=None, pmml_export=None, categorical_variable=None, standardize=True, stat_inf=False, solver=None, alpha=None, lamb=None, tol=None, epsilon=None, thread_ratio=None, max_pass_number=None, sgd_batch_number=None, precompute=None, handle_missing=None, resampling_method=None, metric=None, fold_num=None, repeat_times=None, search_strategy=None, random_search_times=None, random_state=None, timeout=None, lamb_values=None, lamb_range=None, alpha_values=None, alpha_range=None, lbfgs_m=None, class_map0=None, class_map1=None, progress_indicator_id=None)

Bases: hana_ml.algorithms.pal.pal_base.PALBase

Logistic regression model that handles binary-class and multi-class classification problems.

Parameters
conn_contextConnectionContext

Connection to the SAP HANA system.

multi_classbool, optional

If true, perform multi-class classification. Otherwise, there must be only two classes.

Defaults to False.

max_iterint, optional

Maximum number of iterations taken for the solvers to converge. If convergence is not reached after this number, an error will be generated.

  • multi-class: Defaults to 100.

  • binary-class: Defaults to 100000 when solver is cyclical, 1000 when solver is proximal, otherwise 100.

pmml_export{‘no’, ‘single-row’, ‘multi-row’}, optional

Controls whether to output a PMML representation of the model, and how to format the PMML. Case-insensitive.

  • multi-class:

    • ‘no’ or not provided: No PMML model.

    • ‘multi-row’: Exports a PMML model, splitting it across multiple rows if it doesn’t fit in one.

  • binary-class:

    • ‘no’ or not provided: No PMML model.

    • ‘single-row’: Exports a PMML model in a maximum of one row. Fails if the model doesn’t fit in one row.

    • ‘multi-row’: Exports a PMML model, splitting it across multiple rows if it doesn’t fit in one.

Defaults to ‘no’.

categorical_variablestr or list of str, optional(deprecated)

Specifies INTEGER column(s) in the data that should be treated category variable.

standardizebool, optional

If true, standardize the data to have zero mean and unit variance.

Defaults to True.

stat_infbool, optional

If true, proceed with statistical inference.

Defaults to False.

solver{‘auto’, ‘newton’, ‘cyclical’, ‘lbfgs’, ‘stochastic’, ‘proximal’}, optional

Optimization algorithm.

  • ‘auto’ : automatically determined by system based on input data and parameters.

  • ‘newton’: Newton iteration method.

  • ‘cyclical’: Cyclical coordinate descent method to fit elastic net regularized logistic regression.

  • ‘lbfgs’: LBFGS method (recommended when having many independent variables).

  • ‘stochastic’: Stochastic gradient descent method (recommended when dealing with very large dataset).

  • ‘proximal’: Proximal gradient descent method to fit elastic net regularized logistic regression.

Only valid when multi_class is False.

Defaults to ‘auto’.

alphafloat, optional

Elastic net mixing parameter. Only valid when multi_class is False and solver is newton, cyclical, lbfgs or proximal.

Defaults to 1.0.

lambfloat, optional

Penalized weight. Only valid when multi_class is False and solver is newton, cyclical, lbfgs or proximal.

Defaults to 0.0.

tolfloat, optional

Convergence threshold for exiting iterations. Only valid when multi_class is False.

Defaults to 1.0e-7 when solver is cyclical, 1.0e-6 otherwise.

epsilonfloat, optional

Determines the accuracy with which the solution is to be found.

Only valid when multi_class is False and the solver is newton or lbfgs.

Defaults to 1.0e-6 when solver is newton, 1.0e-5 when solver is lbfgs.

thread_ratiofloat, optional

Controls the proportion of available threads to use for fit() method. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use.

Defaults to 1.0.

max_pass_numberint, optional

The maximum number of passes over the data. Only valid when multi_class is False and solver is ‘stochastic’.

Defaults to 1.

sgd_batch_numberint, optional

The batch number of Stochastic gradient descent. Only valid when multi_class is False and solver is ‘stochastic’.

Defaults to 1.

precomputebool, optional

Whether to pre-compute the Gram matrix. Only valid when solver is ‘cyclical’.

Defaults to True.

handle_missingbool, optional

Whether to handle missing values.

Defaults to True.

categorical_variablestr or list of str, optional

Specifies INTEGER column(s) in the data that should be treated as categorical. By default, string is categorical, while int and double are numerical.

lbfgs_mint, optional

Number of previous updates to keep. Only applicable when multi_class is False and solver is ‘lbfgs’.

Defaults to 6.

resampling_method{‘cv’, ‘stratified_cv’, ‘bootstrap’, ‘stratified_bootstrap’}, optional

The resampling method for model evaluation and parameter selection. If no value specified, neither model evaluation nor parameter selection is activated.

metric{‘accuracy’, ‘f1_score’, ‘auc’, ‘nll’}, optional

The evaluation metric used for model evaluation/parameter selection.

fold_numint, optional

The number of folds for cross-validation. Mandatory and valid only when resampling_method is ‘cv’ or ‘stratified_cv’.

repeat_timesint, optional

The number of repeat times for resampling.

Defaults to 1.

search_strategy{‘grid’, ‘random’}, optional

The search method for parameter selection.

random_search_timesint, optional

The number of times to randomly select candidate parameters for selection. Mandatory and valid when search_strategy is ‘random’.

random_stateint, optional

The seed for random generation. 0 indicates using system time as seed.

Defaults to 0.

progress_indicator_idstr, optional

The ID of progress indicator for model evaluation/parameter selection. Progress indicator deactivated if no value provided.

lamb_valueslist of float, optional

The values of lamb for parameter selection.

Only valid when search_strategy is specified.

lamb_rangelist of float, optional

The range of lamb for parameter selection, including a lower limit and an upper limit.

Only valid when search_strategy is specified.

alpha_valueslist of float, optional

The values of alpha for parameter selection.

Only valid when search_strategy is specified.

alpha_rangelist of float, optional

The range of alpha for parameter selection, including a lower limit and an upper limit.

Only valid when search_strategy is specified.

class_map0str, optional (deprecated)

Categorical label to map to 0. class_map0 is mandatory when label column type is VARCHAR or NVARCHAR

Only valid when multi_class is False during binary class fit and score.

class_map1str, optional (deprecated)

Categorical label to map to 1.

class_map1 is mandatory when label column type is VARCHAR or NVARCHAR during binary class fit and score.

Only valid when multi_class is False.

Examples

Training data:

>>> df.collect()
   V1     V2  V3  CATEGORY
0   B  2.620   0         1
1   B  2.875   0         1
2   A  2.320   1         1
3   A  3.215   2         0
4   B  3.440   3         0
5   B  3.460   0         0
6   A  3.570   1         0
7   B  3.190   2         0
8   A  3.150   3         0
9   B  3.440   0         0
10  B  3.440   1         0
11  A  4.070   3         0
12  A  3.730   1         0
13  B  3.780   2         0
14  B  5.250   2         0
15  A  5.424   3         0
16  A  5.345   0         0
17  B  2.200   1         1
18  B  1.615   2         1
19  A  1.835   0         1
20  B  2.465   3         0
21  A  3.520   1         0
22  A  3.435   0         0
23  B  3.840   2         0
24  B  3.845   3         0
25  A  1.935   1         1
26  B  2.140   0         1
27  B  1.513   1         1
28  A  3.170   3         1
29  B  2.770   0         1
30  B  3.570   0         1
31  A  2.780   3         1

Create LogisticRegression instance and call fit:

>>> lr = linear_model.LogisticRegression(conn_context=cc, solver='newton',
...                                      thread_ratio=0.1, max_iter=1000,
...                                      pmml_export='single-row',
...                                      stat_inf=True, tol=0.000001)
>>> lr.fit(data=df, features=['V1', 'V2', 'V3'],
...        label='CATEGORY', categorical_variable=['V3'])
>>> lr.coef_.collect()
                                       VARIABLE_NAME  COEFFICIENT
0                                  __PAL_INTERCEPT__    17.044785
1                                 V1__PAL_DELIMIT__A     0.000000
2                                 V1__PAL_DELIMIT__B    -1.464903
3                                                 V2    -4.819740
4                                 V3__PAL_DELIMIT__0     0.000000
5                                 V3__PAL_DELIMIT__1    -2.794139
6                                 V3__PAL_DELIMIT__2    -4.807858
7                                 V3__PAL_DELIMIT__3    -2.780918
8  {"CONTENT":"{\"impute_model\":{\"column_statis...          NaN
>>> pred_df.collect()
    ID V1     V2  V3
0    0  B  2.620   0
1    1  B  2.875   0
2    2  A  2.320   1
3    3  A  3.215   2
4    4  B  3.440   3
5    5  B  3.460   0
6    6  A  3.570   1
7    7  B  3.190   2
8    8  A  3.150   3
9    9  B  3.440   0
10  10  B  3.440   1
11  11  A  4.070   3
12  12  A  3.730   1
13  13  B  3.780   2
14  14  B  5.250   2
15  15  A  5.424   3
16  16  A  5.345   0
17  17  B  2.200   1

Call predict():

>>> result = lgr.predict(data=pred_df,
...                      key='ID',
...                      categorical_variable=['V3'],
...                      thread_ratio=0.1)
>>> result.collect()
    ID CLASS   PROBABILITY
0    0     1  9.503618e-01
1    1     1  8.485210e-01
2    2     1  9.555861e-01
3    3     0  3.701858e-02
4    4     0  2.229129e-02
5    5     0  2.503962e-01
6    6     0  4.945832e-02
7    7     0  9.922085e-03
8    8     0  2.852859e-01
9    9     0  2.689207e-01
10  10     0  2.200498e-02
11  11     0  4.713726e-03
12  12     0  2.349803e-02
13  13     0  5.830425e-04
14  14     0  4.886177e-07
15  15     0  6.938072e-06
16  16     0  1.637820e-04
17  17     1  8.986435e-01

Input data for score():

>>> df_score.collect()
    ID V1     V2  V3  CATEGORY
0    0  B  2.620   0         1
1    1  B  2.875   0         1
2    2  A  2.320   1         1
3    3  A  3.215   2         0
4    4  B  3.440   3         0
5    5  B  3.460   0         0
6    6  A  3.570   1         1
7    7  B  3.190   2         0
8    8  A  3.150   3         0
9    9  B  3.440   0         0
10  10  B  3.440   1         0
11  11  A  4.070   3         0
12  12  A  3.730   1         0
13  13  B  3.780   2         0
14  14  B  5.250   2         0
15  15  A  5.424   3         0
16  16  A  5.345   0         0
17  17  B  2.200   1         1

Call score():

>>> lgr.score(data=df_score,
...           key='ID',
...           categorical_variable=['V3'],
...           thread_ratio=0.1)
0.944444
Attributes
coef_DataFrame

Values of the coefficients.

result_DataFrame

Model content.

optim_param_DataFrame

The optimal parameter set selected via cross-validation. Empty if cross-validation is not activated.

stat_DataFrame

Statistics info for the trained model, structured as follows:

  • 1st column: ‘STAT_NAME’, NVARCHAR(256)

  • 2nd column: ‘STAT_VALUE’, NVARCHAR(1000)

pmml_DataFrame

PMML model. Set to None if no PMML model was requested.

Methods

fit(self, data[, key, features, label, …])

Fit the LR model when given training dataset.

is_fitted(self)

Checks if the model can be saved.

load_model(self, model)

Function to load fitted model.

predict(self, data, key[, features, …])

Predict with the dataset using the trained model.

score(self, data, key[, features, label, …])

Return the mean accuracy on the given test data and labels.

fit(self, data, key=None, features=None, label=None, categorical_variable=None, class_map0=None, class_map1=None)

Fit the LR model when given training dataset.

Parameters
dataDataFrame

DataFrame containing the data.

keystr, optional

Name of the ID column. If key is not provided, it is assumed that the input has no ID column.

featureslist of str, optional

Names of the feature columns.

If features is not provided, it defaults to all non-ID, non-label columns.

labelstr, optional

Name of the label column. If label is not provided, it defaults to the last column.

categorical_variablestr or list of str, optional

Specifies INTEGER column(s) that shoud be treated as categorical. Otherwise All INTEGER columns are treated as numerical.

class_map0str, optional (deprecated)

Categorical label to map to 0. class_map0 is mandatory when label column type is VARCHAR or NVARCHAR during binary class fit and score.

Only valid when multi_class is False.

class_map1str, optional (deprecated)

Categorical label to map to 1. class_map1 is mandatory when label column type is VARCHAR or NVARCHAR during binary class fit and score.

Only valid when multi_class is False.

is_fitted(self)

Checks if the model can be saved. To be overridden if the model is not stored in model_ attribute.

Returns
bool

True if the model is ready to be saved.

load_model(self, model)

Function to load fitted model.

Parameters
modelDataFrame

HANA DataFrame for fitted model.

predict(self, data, key, features=None, categorical_variable=None, thread_ratio=None, class_map0=None, class_map1=None, verbose=False)

Predict with the dataset using the trained model.

Parameters
dataDataFrame

DataFrame containing the data.

keystr

Name of the ID column.

featureslist of str, optional

Names of the feature columns. If features is not provided, it defaults to all non-ID columns.

verbosebool, optional

If true, output scoring probabilities for each class. It is only applicable for multi-class case.

Defaults to False.

categorical_variablestr or list of str, optional (deprecated)

Specifies INTEGER column(s) that shoud be treated as categorical. Otherwise all integer columns are treated as numerical. Mandatory if training data of the prediction model contains such data columns.

thread_ratiofloat, optional

Controls the proportion of available threads to use. The value range is from 0 to 1, where 0 means only using 1 thread, and 1 means using at most all the currently available threads. Values outside this range tell pal to heuristically determine the number of threads to use.

Defaults to 0.

class_map0str, optional (deprecated)

Categorical label to map to 0. class_map0 is mandatory when label column type is varchar or nvarchar during binary class fit and score.only valid when multi_class is false.

class_map1str, optional (deprecated)

Categorical label to map to 1. class_map1 is mandatory when label column type is varchar or nvarchar during binary class fit and score. Only valid when multi_class is false.

Returns
DataFrame

Predicted result, structured as follows:

  • 1: ID column, with edicted class name.

  • 2: PROBABILITY, type DOUBLE

    • multi-class: probability of being predicted as the predicted class.

    • binary-class: probability of being predicted as the positive class.

Note

predict() will pass the pmml_ table to PAL as the model representation if there is a pmml_ table, or the result_ table otherwise.

score(self, data, key, features=None, label=None, categorical_variable=None, thread_ratio=None, class_map0=None, class_map1=None)

Return the mean accuracy on the given test data and labels.

Parameters
dataDataFrame

DataFrame containing the data.

keystr

Name of the ID column.

featureslist of str, optional

Names of the feature columns. If features is not provided, it defaults to all non-ID, non-label columns.

labelstr, optional

Name of the label column. If label is not provided, it defaults to the last column.

categorical_variablestr or list of str, optional (deprecated)

Specifies INTEGER columns that shoud be treated as categorical, otherwise all integer columns are treated as numerical. Mandatory if training data of the prediction model contains such data columns.

thread_ratiofloat, optional

Controls the proportion of available threads to use. The value range is from 0 to 1, where 0 means only using 1 thread, and 1 means using at most all the currently available threads. values outside this range tell pal to heuristically determine the number of threads to use.

Defaults to 0.

class_map0str, optional (deprecated)

Categorical label to map to 0. class_map0 is mandatory when label column type is varchar or nvarchar during binary class fit and score. Only valid when multi_class is false.

class_map1str, optional (deprecated)

Categorical label to map to 1. class_map1 is mandatory when label column type is varchar or nvarchar during binary class fit and score. Only valid when multi_class is false.

Returns
float

Scalar accuracy value after comparing the predicted label and original label.

hana_ml.algorithms.pal.linkpred

This module contains python wrapper for PAL link prediction function.

The following class is available:

class hana_ml.algorithms.pal.linkpred.LinkPrediction(conn_context, method, beta=None, min_score=None, thread_ratio=None)

Bases: hana_ml.algorithms.pal.pal_base.PALBase

Link predictor for calculating, in a network, proximity scores between nodes that are not directly linked, which is helpful for predicting missing links(the higher the proximity score is, the more likely the two nodes are to be linked).

Parameters
conn_contextConnectionContext

Connection to the SAP HANA system.

method{‘common_neighbors’, ‘jaccard’, ‘adamic_adar’, ‘katz’}

Method for computing the proximity between 2 nodes that are not directly linked.

betafloat, optional

A parameter included in the calculation of Katz similarity(proximity) score. Valid only when method is ‘katz’.

Defaults to 0.005.

min_scorefloat, optional

The links whose scores are lower than min_score will be filtered out from the result table.

Defaults to 0.

Examples

Input dataframe df for training:

>>> df.collect()
   NODE1  NODE2
0      1      2
1      1      4
2      2      3
3      3      4
4      5      1
5      6      2
6      7      4
7      7      5
8      6      7
9      5      4

Create linkpred instance:

>>> lp = LinkPrediction(conn_context=conn,
...                     method='common_neighbors',
...                     beta=0.005,
...                     min_score=0,
...                     thread_ratio=0.2)

Calculate the proximity score of all nodes in the network with missing links, and check the result:

>>> res = lp.proximity_score(data=df, node1='NODE1', node2='NODE2')
>>> res.collect()
    NODE1  NODE2     SCORE
0       1      3  0.285714
1       1      6  0.142857
2       1      7  0.285714
3       2      4  0.285714
4       2      5  0.142857
5       2      7  0.142857
6       4      6  0.142857
7       3      5  0.142857
8       3      6  0.142857
9       3      7  0.142857
10      5      6  0.142857

Methods

is_fitted(self)

Checks if the model can be saved.

load_model(self, model)

Function to load fitted model.

proximity_score(self, data[, node1, node2])

For predicting proximity scores between nodes under current choice of method.

proximity_score(self, data, node1=None, node2=None)

For predicting proximity scores between nodes under current choice of method.

Parameters
dataDataFrame

Network data with nodes and links. Nodes are in columns while links in rows, where each link is represented by a pair of adjacent nodes as follows (node1, node2).

node1str, optional

Column name of data that gives node1 of all available links (see data).

Defaults to the name of the first column of data if not provided.

node2str, optional

Column name of data that gives node2 of all available links (see data).

Defaults to the name of the last column of data if not provided.

Returns
DataFrame:

The proximity scores of pairs of nodes with missing links between them that are above ‘min_score’, structured as follows:

  • 1st column: node1 of a link

  • 2nd column: node2 of a link

  • 3rd column: proximity score of the two nodes

is_fitted(self)

Checks if the model can be saved. To be overridden if the model is not stored in model_ attribute.

Returns
bool

True if the model is ready to be saved.

load_model(self, model)

Function to load fitted model.

Parameters
modelDataFrame

HANA DataFrame for fitted model.

hana_ml.algorithms.pal.metrics

This module contains Python wrappers for PAL metrics to assess the quality of model outputs.

The following functions are available:

hana_ml.algorithms.pal.metrics.confusion_matrix(conn_context, data, key, label_true=None, label_pred=None, beta=None, native=False)

Computes confusion matrix to evaluate the accuracy of a classification.

Parameters
conn_contextConnectionContext

The connection to SAP HANA system.

dataDataFrame

DataFrame containing the data.

keystr

Name of the ID column.

label_truestr, optional

Name of the original label column.

If not given, defaults to the second columm.

label_predstr, optional

Name of the the predicted label column. If not given, defaults to the third columm.

betafloat, optional

Parameter used to compute the F-Beta score.

Defaults to 1.

nativebool, optional

Indicates whether to use native sql statements for confusion matrix calculation.

Defaults to True.

Returns
DataFrame
Confusion matrix, structured as follows:
  • Original label, with same name and data type as it is in data.

  • Predicted label, with same name and data type as it is in data.

  • Count, type INTEGER, the number of data points with the corresponding combination of predicted and original label.

The DataFrame is sorted by (original label, predicted label) in descending order.

Classification report table, structured as follows:
  • Class, type NVARCHAR(100), class name

  • Recall, type DOUBLE, the recall of each class

  • Precision, type DOUBLE, the precision of each class

  • F_MEASURE, type DOUBLE, the F_measure of each class

  • SUPPORT, type INTEGER, the support - sample number in each class

Examples

Data contains the original label and predict label df:

>>> df.collect()
   ID  ORIGINAL  PREDICT
0   1         1        1
1   2         1        1
2   3         1        1
3   4         1        2
4   5         1        1
5   6         2        2
6   7         2        1
7   8         2        2
8   9         2        2
9  10         2        2

Calculate the confusion matrix:

>>> cm, cr = confusion_matrix(connection_context=conn, data=df, key='ID', label_true='ORIGINAL', label_pred='PREDICT')

Output:

>>> cm.collect()
   ORIGINAL  PREDICT  COUNT
0         1        1      4
1         1        2      1
2         2        1      1
3         2        2      4
>>> cr.collect()
  CLASS  RECALL  PRECISION  F_MEASURE  SUPPORT
0     1     0.8        0.8        0.8        5
1     2     0.8        0.8        0.8        5
hana_ml.algorithms.pal.metrics.auc(conn_context, data, positive_label=None)

Computes area under curve (AUC) to evaluate the performance of binary-class classification algorithms.

Parameters
conn_contextConnectionContext

The connection to SAP HANA system.

dataDataFrame

Input data, structured as follows:

  • ID column.

  • True class of the data point.

  • Classifier-computed probability that the data point belongs to the positive class.

positive_labelstr, optional

If original label is not 0 or 1, specifies the label value which will be mapped to 1.

Returns
float

The area under the receiver operating characteristic curve.

DataFrame

False positive rate and true positive rate (ROC), structured as follows:

  • ID column, type INTEGER.

  • FPR, type DOUBLE, representing false positive rate.

  • TPR, type DOUBLE, representing true positive rate.

Examples

Input DataFrame df:

>>> df.collect()
   ID  ORIGINAL  PREDICT
0   1         0     0.07
1   2         0     0.01
2   3         0     0.85
3   4         0     0.30
4   5         0     0.50
5   6         1     0.50
6   7         1     0.20
7   8         1     0.80
8   9         1     0.20
9  10         1     0.95

Compute Area Under Curve:

>>> auc, roc = auc(conn_context=conn, data=df)

Output:

>>> print(auc)
 0.66
>>> roc.collect()
   ID  FPR  TPR
0   0  1.0  1.0
1   1  0.8  1.0
2   2  0.6  1.0
3   3  0.6  0.6
4   4  0.4  0.6
5   5  0.2  0.4
6   6  0.2  0.2
7   7  0.0  0.2
8   8  0.0  0.0
hana_ml.algorithms.pal.metrics.multiclass_auc(conn_context, data_original, data_predict)

Computes area under curve (AUC) to evaluate the performance of multi-class classification algorithms.

Parameters
conn_contextConnectionContext

The connection to SAP HANA system.

data_originalDataFrame

True class data, structured as follows:

  • Data point ID column.

  • True class of the data point.

data_predictDataFrame

Predicted class data, structured as follows:

  • Data point ID column.

  • Possible class.

  • Classifier-computed probability that the data point belongs to that particular class.

For each data point ID, there should be one row for each possible class.

Returns
float

The area under the receiver operating characteristic curve.

DataFrame

False positive rate and true positive rate (ROC), structured as follows:

  • ID column, type INTEGER.

  • FPR, type DOUBLE, representing false positive rate.

  • TPR, type DOUBLE, representing true positive rate.

Examples

Input DataFrame df:

>>> df_original.collect()
   ID  ORIGINAL
0   1         1
1   2         1
2   3         1
3   4         2
4   5         2
5   6         2
6   7         3
7   8         3
8   9         3
9  10         3
>>> df_predict.collect()
    ID  PREDICT  PROB
0    1        1  0.90
1    1        2  0.05
2    1        3  0.05
3    2        1  0.80
4    2        2  0.05
5    2        3  0.15
6    3        1  0.80
7    3        2  0.10
8    3        3  0.10
9    4        1  0.10
10   4        2  0.80
11   4        3  0.10
12   5        1  0.20
13   5        2  0.70
14   5        3  0.10
15   6        1  0.05
16   6        2  0.90
17   6        3  0.05
18   7        1  0.10
19   7        2  0.10
20   7        3  0.80
21   8        1  0.00
22   8        2  0.00
23   8        3  1.00
24   9        1  0.20
25   9        2  0.10
26   9        3  0.70
27  10        1  0.20
28  10        2  0.20
29  10        3  0.60

Compute Area Under Curve:

>>> auc, roc = multiclass_auc(conn_context=conn, data_original=df_original, data_predict=df_predict)

Output:

>>> print(auc)
1.0
>>> roc.collect()
    ID   FPR  TPR
0    0  1.00  1.0
1    1  0.90  1.0
2    2  0.65  1.0
3    3  0.25  1.0
4    4  0.20  1.0
5    5  0.00  1.0
6    6  0.00  0.9
7    7  0.00  0.7
8    8  0.00  0.3
9    9  0.00  0.1
10  10  0.00  0.0
hana_ml.algorithms.pal.metrics.accuracy_score(conn_context, data, label_true, label_pred)

Compute mean accuracy score for classification results. That is, the proportion of the correctly predicted results among the total number of cases examined.

Parameters
conn_contextConnectionContext

The connection to SAP HANA system.

dataDataFrame

DataFrame of true and predicted labels.

label_truestr

Name of the column containing ground truth labels.

label_predstr

Name of the column containing predicted labels, as returned by a classifier.

Returns
float

Accuracy classification score. A lower accuracy indicates that the classifier was able to predict less of the labels in the input correctly.

Examples

Actual and predicted labels df for a hypothetical classification:

>>> df.collect()
   ACTUAL  PREDICTED
0    1        0
1    0        0
2    0        0
3    1        1
4    1        1

Accuracy score for these predictions:

>>> accuracy_score(conn_context=conn, data=df, label_true='ACTUAL', label_pred='PREDICTED')
0.8

Compare that to null accuracy df_dummy (accuracy that could be achieved by always predicting the most frequent class):

>>> df_dummy.collect()
   ACTUAL  PREDICTED
0    1       1
1    0       1
2    0       1
3    1       1
4    1       1
>>> accuracy_score(conn_context=conn, data=df_dummy, label_true='ACTUAL', label_pred='PREDICTED')
0.6

A perfect predictor df_perfect:

>>> df_perfect.collect()
   ACTUAL  PREDICTED
0    1       1
1    0       0
2    0       0
3    1       1
4    1       1
>>> accuracy_score(conn_context=conn, data=df_perfect, label_true='ACTUAL', label_pred='PREDICTED')
1.0
hana_ml.algorithms.pal.metrics.r2_score(conn_context, data, label_true, label_pred)

Computes coefficient of determination for regression results.

Parameters
conn_contextConnectionContext

The connection to SAP HANA system.

dataDataFrame

DataFrame of true and predicted values.

label_truestr

Name of the column containing true values.

label_predstr

Name of the column containing values predicted by regression.

Returns
float

Coefficient of determination. 1.0 indicates an exact match between true and predicted values. A lower coefficient of determination indicates that the regression was able to predict less of the variance in the input. A negative value indicates that the regression performed worse than just taking the mean of the true values and using that for every prediction.

Examples

Actual and predicted values df for a hypothetical regression:

>>> df.collect()
   ACTUAL  PREDICTED
0    0.10        0.2
1    0.90        1.0
2    2.10        1.9
3    3.05        3.0
4    4.00        3.5

R2 score for these predictions:

>>> r2_score(conn_context=conn, data=df, label_true='ACTUAL', label_pred='PREDICTED')
0.9685233682514102

Compare that to the score for a perfect predictor:

>>> df_perfect.collect()
   ACTUAL  PREDICTED
0    0.10       0.10
1    0.90       0.90
2    2.10       2.10
3    3.05       3.05
4    4.00       4.00
>>> r2_score(conn_context=conn, data=df_perfect, label_true='ACTUAL', label_pred='PREDICTED')
1.0

A naive mean predictor:

>>> df_mean.collect()
   ACTUAL  PREDICTED
0    0.10       2.03
1    0.90       2.03
2    2.10       2.03
3    3.05       2.03
4    4.00       2.03
>>> r2_score(conn_context=conn,, data=df_mean, label_true='ACTUAL', label_pred='PREDICTED')
0.0

And a really awful predictor df_awful:

>>> df_awful.collect()
   ACTUAL  PREDICTED
0    0.10    12345.0
1    0.90    91923.0
2    2.10    -4444.0
3    3.05    -8888.0
4    4.00    -9999.0
>>> r2_score(conn_context=conn, data=df_awful, label_true='ACTUAL', label_pred='PREDICTED')
-886477397.139857
hana_ml.algorithms.pal.metrics.binary_classification_debriefing(conn_context, data, label_true, label_pred, auc_data=None)

Computes debriefing coefficients for binary classification results.

Parameters
conn_contextConnectionContext

The connection to SAP HANA system.

dataDataFrame

DataFrame of true and predicted values.

label_truestr

Name of the column containing true values.

label_predstr

Name of the column containing values predicted by regression.

auc_dataDataFrame, optional

Input data for calculating predictive power(KI), structured as follows:

  • ID column.

  • True class of the data point.

  • Classifier-computed probability that the data point belongs to the positive class.

Returns
dict

Debriefing stats: ACCURACY, RECALL, SPECIFICITY, PRECISION, FPR, FNR, F1, MCC, KI, KAPPA.

hana_ml.algorithms.pal.mixture

This module contains Python wrappers for Gaussian mixture model algorithm.

The following class is available:

class hana_ml.algorithms.pal.mixture.GaussianMixture(conn_context, init_param, n_components=None, init_centers=None, covariance_type=None, shared_covariance=False, thread_ratio=None, max_iter=None, categorical_variable=None, category_weight=None, error_tol=None, regularization=None, random_seed=None)

Bases: hana_ml.algorithms.pal.pal_base.PALBase, hana_ml.algorithms.pal.clustering._ClusterAssignmentMixin

Representation of a Gaussian mixture model probability distribution.

Parameters
conn_contextConnectionContext

Connection to the SAP HANA system.

init_param{‘farthest_first_traversal’,’manual’,’random_means’,’kmeans++’}

Specifies the initialization mode.

  • farthest_first_traversal: The initial centers are given by the farthest-first traversal algorithm.

  • manual: The initial centers are the init_centers given by user.

  • random_means: The initial centers are the means of all the data that are randomly weighted.

  • kmeans++: The initial centers are given using the k-means++ approach.

n_componentsint

Specifies the number of Gaussian distributions. Mandatory when init_param is not ‘manual’.

init_centerslist of integers/strings

Specifies the rows of data to be used as initial centers by provides their IDs in data. Mandatory when init_param is ‘manual’.

covariance_type{‘full’, ‘diag’, ‘tied_diag’}, optional

Specifies the type of covariance matrices in the model.

  • full: use full covariance matrices.

  • diag: use diagonal covariance matrices.

  • tied_diag: use diagonal covariance matrices with all equal diagonal entries.

Defaults to ‘full’.

shared_covariancebool, optional

All clusters share the same covariance matrix if True.

Defaults to False.

thread_ratiofloat, optional

Controls the proportion of available threads that can be used. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use.

Defaults to 0.

max_iterint, optional

Specifies the maximum number of iterations for the EM algorithm.

Defaults value: 100.

categorical_variablestr or list of str, optional

Specifies INTEGER column(s) that should be be treated as categorical. Other INTEGER columns will be treated as continuous.

category_weightfloat, optional

Represents the weight of category attributes.

Defaults to 0.707.

error_tolfloat, optional

Specifies the error tolerance, which is the stop condition.

Defaults to 1e-5.

regularizationfloat, optional

Regularization to be added to the diagonal of covariance matrices to ensure positive-definite.

Defaults to 1e-6.

random_seedint, optional

Indicates the seed used to initialize the random number generator:

  • 0: Uses the system time.

  • Not 0: Uses the provided value.

Defaults to 0.

Examples

Input dataframe df1 for training:

>>> df1.collect()
    ID     X1     X2  X3
0    0   0.10   0.10   1
1    1   0.11   0.10   1
2    2   0.10   0.11   1
3    3   0.11   0.11   1
4    4   0.12   0.11   1
5    5   0.11   0.12   1
6    6   0.12   0.12   1
7    7   0.12   0.13   1
8    8   0.13   0.12   2
9    9   0.13   0.13   2
10  10   0.13   0.14   2
11  11   0.14   0.13   2
12  12  10.10  10.10   1
13  13  10.11  10.10   1
14  14  10.10  10.11   1
15  15  10.11  10.11   1
16  16  10.11  10.12   2
17  17  10.12  10.11   2
18  18  10.12  10.12   2
19  19  10.12  10.13   2
20  20  10.13  10.12   2
21  21  10.13  10.13   2
22  22  10.13  10.14   2
23  23  10.14  10.13   2

Creating the GMM instance:

>>> gmm = GaussianMixture(conn_context=conn,
...                       init_param='farthest_first_traversal',
...                       n_components=2, covariance_type='full',
...                       shared_covariance=False, max_iter=500,
...                       error_tol=0.001, thread_ratio=0.5,
...                       categorical_variable=['X3'], random_seed=1)

Performing fit() on the given dataframe:

>>> gmm.fit(data=df1, key='ID')

Expected output:

>>> gmm.labels_.head(14).collect()
    ID  CLUSTER_ID     PROBABILITY
0    0           0          0.0
1    1           0          0.0
2    2           0          0.0
3    4           0          0.0
4    5           0          0.0
5    6           0          0.0
6    7           0          0.0
7    8           0          0.0
8    9           0          0.0
9    10          0          1.0
10   11          0          1.0
11   12          0          1.0
12   13          0          1.0
13   14          0          0.0
>>> gmm.stats_.collect()
       STAT_NAME       STAT_VALUE
1     log-likelihood     11.7199
2         aic          -504.5536
3         bic          -480.3900
>>> gmm.model_collect()
       ROW_INDEX    CLUSTER_ID         MODEL_CONTENT
1        0            -1           {"Algorithm":"GMM","Metadata":{"DataP...
2        1             0           {"GuassModel":{"covariance":[22.18895...
3        2             1           {"GuassModel":{"covariance":[22.19450...
Attributes
model_DataFrame

Trained model content.

labels_DataFrame

Cluster membership probabilties for each data point.

stats_DataFrame

Statistics.

Methods

fit(self, data, key[, features, …])

Perform GMM clustering on input dataset.

fit_predict(self, data, key[, features, …])

Perform GMM clustering on input dataset and return cluster membership probabilties for each data point.

is_fitted(self)

Checks if the model can be saved.

load_model(self, model)

Function to load fitted model.

predict(self, data, key[, features])

Assign clusters to data based on a fitted model.

fit(self, data, key, features=None, categorical_variable=None)

Perform GMM clustering on input dataset.

Parameters
dataDataFrame

Data to be clustered.

keystr

Name of the ID column.

featureslist of str, optional

List of strings specifying feature columns. If a list of features is not given, all the columns except the ID column are taken as features.

categorical_variablestr or list of str, optional

Specifies INTEGER column(s) that should be treated as categorical. Other INTEGER columns will be treated as continuous.

fit_predict(self, data, key, features=None, categorical_variable=None)

Perform GMM clustering on input dataset and return cluster membership probabilties for each data point.

Parameters
dataDataFrame

Data to be clustered.

keystr

Name of the ID column.

featureslist of str, optional

List of strings specifying feature columns. If a list of features is not given, all the columns except the ID column are taken as features.

categorical_variablestr or list of str, optional

Specifies INTEGER column(s) specified that should be treated as categorical. Other INTEGER columns will be treated as continuous.

Returns
DataFrame

Cluster membership probabilities.

predict(self, data, key, features=None)

Assign clusters to data based on a fitted model.

The output structure of this method does not match that of fit_predict().

Parameters
dataDataFrame

Data points to match against computed clusters. This dataframe’s column structure should match that of the data used for fit().

keystr

Name of the ID column.

featureslist of str, optional.

Names of the feature columns. If features is not provided, it defaults to all the non-ID columns.

Returns
DataFrame

Cluster assignment results, with 3 columns:

  • Data point ID, with name and type taken from the input ID column.

  • CLUSTER_ID, INTEGER type, representing the cluster the data point is assigned to.

  • DISTANCE, DOUBLE type, representing the distance between the data point and the cluster center.

is_fitted(self)

Checks if the model can be saved. To be overridden if the model is not stored in model_ attribute.

Returns
bool

True if the model is ready to be saved.

load_model(self, model)

Function to load fitted model.

Parameters
modelDataFrame

HANA DataFrame for fitted model.

hana_ml.algorithms.pal.naive_bayes

This module contains wrappers for PAL naive bayes classification.

The following classes are available:

class hana_ml.algorithms.pal.naive_bayes.NaiveBayes(conn_context, alpha=None, discretization=None, model_format=None, thread_ratio=None, resampling_method=None, evaluation_metric=None, fold_num=None, repeat_times=None, search_strategy=None, random_search_times=None, random_state=None, timeout=None, progress_indicator_id=None, alpha_range=None, alpha_values=None)

Bases: hana_ml.algorithms.pal.pal_base.PALBase

A classification model based on Bayes’ theorem.

Parameters
conn_contextConnectionContext

Connection to the HANA system.

alphafloat, optional

Laplace smoothing value. Set a positive value to enable Laplace smoothing for categorical variables and use that value as the smoothing parameter. Set value 0 to disable Laplace smoothing.

Defaults to 0.

discretization{‘no’, ‘supervised’}, optional
Discretize continuous attributes. Case-insensitive.
  • ‘no’ or not provided: disable discretization.

  • ‘supervised’: use supervised discretization on all the continuous attributes.

Defaults to ‘no’.

model_format{‘json’, ‘pmml’}, optional

Controls whether to output the model in JSON format or PMML format. Case-insensitive.

  • ‘json’ or not provided: JSON format.

  • ‘pmml’: PMML format.

Defaults to ‘json’.

thread_ratiofloat, optional

Controls the proportion of available threads to use. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use that percentage of available threads.Values outside this range tell PAL to heuristically determine the number of threads to use.

Defaults to 0.

resampling_method{“cv”, “stratified_cv”, “bootstrap”, “stratified_bootstrap”}, optional

Specifies the resampling method for model evaluation or parameter selection. Mandatory if cross_validation is expected. If no value is specified for this parameter, neither model evaluation nor parameter selection is activated.

No default value.

evaluation_metric{‘accuracy’, ‘f1_score’, ‘auc’}, optional

Specifies the evaluation metric for model evaluation or parameter selection. Mandatory if cross_validation is expected, (when resampling_method is set).

No default value.

fold_numint, optional

Specifies the fold number forthe cross validation method. Mandatory and valid only when resampling_method is set to cv or stratified_cv.

No default value.

repeat_timesint, optional

Specifies the number of repeat times for resampling.

Default to 1.

search_strategy{‘grid’, ‘random’}

Specifies the parameter search method. No default value.

random_search_timesint, optional

Specifies the number of times to randomly select candidate parameters for selection. Mandatory and valid when search_strategy is set to random.

No default value.

random_stateint, optional

Specifies the seed for random generation. Use system time when 0 is specified.

Default to 0.

timeoutint, optional

Specifies maximum running time for model evaluation or parameter selection, in seconds. No timeout when 0 is specified.

Default to 0.

progress_indicator_idstr, optional

Sets an ID of progress indicator for model evaluation or parameter selection. No progress indicator is active if no value is provided.

No default value.

alpha_rangelist of numeric values, optional

Specifies the range for candidate alpha values for parameter selection. Only valid when search_strategy is specified.

No default value.

alpha_valueslist of numeric values, optional

Specifies candidate alpha values for parameter selection. Only valid when search_strategy is specified.

No default value.

Examples

Training data:

>>> df1.collect()
  HomeOwner MaritalStatus  AnnualIncome DefaultedBorrower
0       YES        Single         125.0                NO
1        NO       Married         100.0                NO
2        NO        Single          70.0                NO
3       YES       Married         120.0                NO
4        NO      Divorced          95.0               YES
5        NO       Married          60.0                NO
6       YES      Divorced         220.0                NO
7        NO        Single          85.0               YES
8        NO       Married          75.0                NO
9        NO        Single          90.0               YES

Training the model:

>>> nb = NaiveBayes(cc, alpha=1.0, model_format='pmml')
>>> nb.fit(df1)

Prediction:

>>> df2.collect()
   ID HomeOwner MaritalStatus  AnnualIncome
0   0        NO       Married         120.0
1   1       YES       Married         180.0
2   2        NO        Single          90.0
>>> nb.predict(df2, 'ID', alpha=1.0, verbose=True)
   ID CLASS  CONFIDENCE
0   0    NO   -6.572353
1   0   YES  -23.747252
2   1    NO   -7.602221
3   1   YES -169.133547
4   2    NO   -7.133599
5   2   YES   -4.648640
Attributes
model_DataFrame

Trained model content. .. note:

The Laplace value (alpha) is only stored by JSON format models.
If the PMML format is chosen, you may need to set the Laplace value (alpha)
again in predict() and score().
stats_DataFrame

Trained statistics content.

optim_param_DataFrame

Selected optimal parameters content.

Methods

fit(self, data[, key, features, label, …])

Fit classification model based on training data.

is_fitted(self)

Checks if the model can be saved.

load_model(self, model)

Function to load fitted model.

predict(self, data, key[, features, alpha, …])

Predict based on fitted model.

score(self, data, key[, features, label, alpha])

Returns the mean accuracy on the given test data and labels.

fit(self, data, key=None, features=None, label=None, categorical_variable=None)

Fit classification model based on training data.

Parameters
dataDataFrame

Training data.

keystr, optional

Name of the ID column. If key is not provided, it is assumed that the input has no ID column.

featureslist of str, optional

Names of the feature columns.

If features is not provided, it defaults to all non-ID, non-label columns.

labelstr, optional

Name of the dependent variable.

Defaults to the last column.

categorical_variablestr or ListOfStrings, optional

Specifies INTEGER columns that should be treated as categorical. Other INTEGER columns will be treated as continuous.

predict(self, data, key, features=None, alpha=None, verbose=None)

Predict based on fitted model.

Parameters
dataDataFrame

Independent variable values to predict for.

keystr

Name of the ID column.

featureslist of str, optional

Names of the feature columns.

If features is not provided, it defaults to all non-ID columns.

alphafloat, optional

Laplace smoothing value. Set a positive value to enable Laplace smoothing for categorical variables and use that value as the smoothing parameter. Set value 0 to disable Laplace smoothing.

Defaults to the alpha value in the JSON model, if there is one, or 0 otherwise.

verbosebool, optional

If true, output all classes and the corresponding confidences for each data point.

Defaults to False.

Returns
DataFrame
Predicted result, structured as follows:
  • ID column, with the same name and type as data ‘s ID column.

  • CLASS, type NVARCHAR, predicted class name.

  • CONFIDENCE, type DOUBLE, confidence for the prediction of the sample, which is a logarithmic value of the posterior probabilities.

Note

A non-zero Laplace value (alpha) is required if there exist discrete category values that only occur in the test set. It can be read from JSON models or from the parameter alpha in predict(). The Laplace value you set here takes precedence over the values read from JSON models.

score(self, data, key, features=None, label=None, alpha=None)

Returns the mean accuracy on the given test data and labels.

Parameters
dataDataFrame

Data on which to assess model performance.

keystr

Name of the ID column.

featureslist of str, optional

Names of the feature columns.

If features is not provided, it defaults to all non-ID, non-label columns.

labelstr, optional

Name of the dependent variable.

Defaults to the last column.

alphafloat, optional

Laplace smoothing value. Set a positive value to enable Laplace smoothing for categorical variables and use that value as the smoothing parameter. Set value 0 to disable Laplace smoothing.

Defaults to the alpha value in the JSON model, if there is one, or 0 otherwise.

Returns
float :

Mean accuracy on the given test data and labels.

is_fitted(self)

Checks if the model can be saved. To be overridden if the model is not stored in model_ attribute.

Returns
bool

True if the model is ready to be saved.

load_model(self, model)

Function to load fitted model.

Parameters
modelDataFrame

HANA DataFrame for fitted model.

hana_ml.algorithms.pal.neighbors

This module contains Python wrappers for PAL k-nearest neighbors algorithms.

The following classes are available:
class hana_ml.algorithms.pal.neighbors.KNNClassifier(conn_context, n_neighbors=None, thread_ratio=None, stat_info=None, voting_type=None, metric=None, minkowski_power=None, category_weights=None, algorithm=None, resampling_method=None, evaluation_metric=None, fold_num=None, repeat_times=None, search_strategy=None, random_search_times=None, random_state=None, timeout=None, progress_indicator_id=None, param_values=None, param_range=None)

Bases: hana_ml.algorithms.pal.neighbors._KNNBase

class FFMClassifier Field-Aware Factorization Machine with the task of classification.

Parameters
conn_contextConnectionContext

Connection to the HANA system.

n_neighborsint, optional

Number of nearest neighbors (k). Default to 1.

thread_ratiofloat, optional

Specifies the ratio of total number of threads that can be used by this function. The range of this parameter is from 0 to 1, where 0 means only using 1 thread, and 1 means using at most all the currently available threads. Values outside this range are ignored and this function heuristically determines the number of threads to use. Default to 0.0.

voting_type{‘majority’, ‘distance_weighted’}, optional

Voting type. Default to ‘distance_weighted’.

stat_infobool, optional

Indicate if statistic information will be stored into the STATISTIC table. Only valid when model evaluation/parameter selection is not enabled. Default to True.

metric{‘manhattan’, ‘euclidean’, ‘minkowski’, ‘chebyshev’}, optional

Ways to compute the distance between data points. Defaults to ‘euclidean’.

minkowski_powerfloat, optional

When Minkowski is used for metric, this parameter controls the value of power. Only valid when metri` is set as ‘minkowski’. Defaults to 3.0.

category_weightsfloat, optional

Represents the weight of category attributes. Default to 0.707.

algorithm{‘brute-force’, ‘kd-tree’}, optional

Algorithm used to compute the nearest neighbors. Defaults to brute-force.

factor_numint, optional

The factorisation dimensionality. Default to 4.

random_stateint, optional

Specifies the seed for random number generator. 0: Uses the current time as the seed. Others: Uses the specified value as the seed. Default to 0.

resampling_method{‘cv’, ‘stratified_cv’, ‘bootstrap’, ‘stratified_bootstrap’}, optional

Specifies the resampling method for model evaluation or parameter selection. If not specified, neither model evaluation nor parameters selection is activated. No default value.

evaluation_metric{‘accuracy’, ‘f1_score’}, optional

Specifies the evaluation metric for model evaluation or parameter selection. If not specified, neither model evaluation nor parameter selection is activated. No default value.

fold_numint, optional

Specifies the fold number for the cross validation method. Mandatory and valid only when resampling_method is set to cv or stratified_cv. No default value.

repeat_timesint, optional

Specifies the number of repeat times for resampling. Default to 1.

search_strategy{‘grid’, ‘random’}, optional

Specifies the method to activate parameter selection. No default value.

random_search_timesint, optional

Specifies the number of times to randomly select candidate parameters for selection. Mandatory and valid when ‘search_strategy’ is set to ‘random’. No default value.

timeoutint, optional

Specifies maximum running time for model evaluation or parameter selection, in seconds. No timeout when 0 is specified. Default to 0.

progress_indicator_idstr, optional

Sets an ID of progress indicator for model evaluation or parameter selection. No progress indicator is active if no value is provided. No default value.

param_valuesListOfTuples, optional

Specifies values of parameters to be selected. Input should be the a list of tuples, with 1st element of each tuple being the target parameter name(in string format), while 2nd element being the a list of valued for selection. Only valid when search_strategy is specified. Valid Parametr names include: ‘metric’, ‘minkowski_power’, ‘category_weights’,

‘n_neighbors’, ‘voting_type’.

No default value.

param_rangeListOfTuples, optional

Specifies ranges of parameters to be selected. Input should be a list of tuples, with 1st element of each tuple being the name of the target parameter(in string format), while 2nd element being a list that specifies the range of parameters with the following format: [start, step, end] or [start, end]. Only valid when search_strategy is specified. Valid parameter names include: ‘minkowski_power’, ‘category_weights’, ‘n_neighbors’. No default value.

Returns
If not set as model evaluation or parameter selection:
res_tblDataFrame
KNN predict results. Structured as following:

-ID: Prediction data ID. -TARGET: Predicted label or value.

stats_tblDataFrame
KNN prediction statistics infomation. Structured as following:

-TEST_ + ID2 column name: Prediction data ID. -K: K number. -TRAIN_ + ID1 column name: Train data ID. -DISTANCE: Distance.

If set as model evaluation or parameter selection.
stats_tblDataFrame
Statistics information. Structured as following:

-STAT_NAME: Statistic names. -STAT_VALUE: Statistic values.

optim_param_tblDataFrame
Selected optimal parameters. Structured as following:

-PARAM_NAME: Selected optimal paramter names. -INT_VALUE -DOUBLE_VALUE -STRING_VALUE

Examples

Input dataframe for classification training:

>>> df_class_train.collect()
   ID  X1      X2 X3  TYPE
0   0   2     1.0  A     1
1   1   3    10.0  A    10
2   2   3    10.0  B    10
3   3   3    10.0  C     1
4   4   1  1000.0  C     1
5   5   1  1000.0  A    10
6   6   1  1000.0  B    99
7   7   1   999.0  A    99
8   8   1   999.0  B    10
9   9   1  1000.0  C    10

Creating KNNClassifier instance:

>>> knn  = KNNClassifier(self.conn, thread_ratio=1, algorithm='kd_tree',
                         n_neighbors=3, voting_type='majority')

Performing fit() on given dataframe:

>>> knn.fit(self.df_class_train, key='ID', label='TYPE')

Performing predict() on given predicting dataframe:

Input prediciton dataframe: >>> df_class_predict.collect()

ID X1 X2 X3

0 0 2 1.0 A 1 1 1 10.0 C 2 2 1 11.0 B 3 3 3 15000.0 C 4 4 2 1000.0 C 5 5 1 1001.0 A 6 6 1 999.0 A 7 7 3 999.0 B

>>> res, stats = knn._predict(df_class_predict, key='ID', categorical_variable='X1')
>>> res.collect()
   ID TARGET
0   0     10
1   1     10
2   2     10
3   3      1
4   4      1
5   5      1
6   6     10
7   7     99
>>> stats.collect().head(10)
    TEST_ID  K  TRAIN_ID      DISTANCE
0         0  1         0      0.000000
1         0  2         1      9.999849
2         0  3         2     10.414000
3         1  1         3      0.999849
4         1  2         1      1.414000
5         1  3         2      1.414000
6         2  1         2      1.999849
7         2  2         1      2.414000
8         2  3         3      2.414000
9         3  1         4  14000.999849
Attributes
_training_setDataFrame

Input training data with structured column arrangement. If model evaluation or parameter selection is not enabled, the first column must be the ID column, following by feature columns.

Methods

fit(self, data[, key, features, label])

Build the KNNClassifier training dataset with the input dataframe.

is_fitted(self)

Checks if the model can be saved.

load_model(self, model)

Function to load fitted model.

fit(self, data, key=None, features=None, label=None)

Build the KNNClassifier training dataset with the input dataframe. Assign key, features, and label column.

Parameters
dataDataFrame

Data to be fit.

keystr, optional

Name of the ID column. Required if parameter selection/model evaluation is not specified.

featuresstr/ListOfStrings, optional

Name of the feature columns.

labelstr, optional

Secifies the dependent variable. Default to last column name.

is_fitted(self)

Checks if the model can be saved. To be overridden if the model is not stored in model_ attribute.

Returns
bool

True if the model is ready to be saved.

load_model(self, model)

Function to load fitted model.

Parameters
modelDataFrame

HANA DataFrame for fitted model.

class hana_ml.algorithms.pal.neighbors.KNNRegressor(conn_context, n_neighbors=None, thread_ratio=None, stat_info=None, aggregate_type=None, metric=None, minkowski_power=None, category_weights=None, algorithm=None, resampling_method=None, evaluation_metric=None, fold_num=None, repeat_times=None, search_strategy=None, random_search_times=None, random_state=None, timeout=None, progress_indicator_id=None, param_values=None, param_range=None)

Bases: hana_ml.algorithms.pal.neighbors._KNNBase

class FFMClassifier Field-Aware Factorization Machine with the task of classification.

Parameters
conn_contextConnectionContext

Connection to the HANA system.

n_neighborsint, optional

Number of nearest neighbors (k). Default to 1.

thread_ratiofloat, optional

Specifies the ratio of total number of threads that can be used by this function. The range of this parameter is from 0 to 1, where 0 means only using 1 thread, and 1 means using at most all the currently available threads. Values outside this range are ignored and this function heuristically determines the number of threads to use. Default to 0.0.

aggregate_type{‘average’, ‘distance_weighted’}, optional

Aggregate type. Default to ‘distance_weighted’.

stat_infobool, optional

Indicate if statistic information will be stored into the STATISTIC table. Only valid when model evaluation/parameter selection is not enabled. Default to True.

metric{‘manhattan’, ‘euclidean’, ‘minkowski’, ‘chebyshev’}, optional

Ways to compute the distance between data points. Defaults to ‘euclidean’.

minkowski_powerfloat, optional

When Minkowski is used for metric, this parameter controls the value of power. Only valid when metri` is set as ‘minkowski’. Defaults to 3.0.

category_weightsfloat, optional

Represents the weight of category attributes. Default to 0.707.

algorithm{‘brute-force’, ‘kd-tree’}, optional

Algorithm used to compute the nearest neighbors. Defaults to brute-force.

factor_numint, optional

The factorisation dimensionality. Default to 4.

random_stateint, optional

Specifies the seed for random number generator. 0: Uses the current time as the seed. Others: Uses the specified value as the seed. Default to 0.

resampling_method{‘bootstrap’, ‘stratified_bootstrap’}, optional

Specifies the resampling method for model evaluation or parameter selection. If not specified, neither model evaluation nor parameters selection is activated. No default value.

evaluation_metric{‘rmse’}, optional

Specifies the evaluation metric for model evaluation or parameter selection. If not specified, neither model evaluation nor parameter selection is activated. No default value.

fold_numint, optional

Specifies the fold number for the cross validation method. Mandatory and valid only when resampling_method is set to cv or stratified_cv. No default value.

repeat_timesint, optional

Specifies the number of repeat times for resampling. Default to 1.

search_strategy{‘grid’, ‘random’}, optional

Specifies the method to activate parameter selection. No default value.

random_search_timesint, optional

Specifies the number of times to randomly select candidate parameters for selection. Mandatory and valid when ‘search_strategy’ is set to ‘random’. No default value.

timeoutint, optional

Specifies maximum running time for model evaluation or parameter selection, in seconds. No timeout when 0 is specified. Default to 0.

progress_indicator_idstr, optional

Sets an ID of progress indicator for model evaluation or parameter selection. No progress indicator is active if no value is provided. No default value.

param_valuesListOfTuples, optional

Specifies values of parameters to be selected. Input should be the a list of tuples, with 1st element of each tuple being the target parameter name(in string format), while 2nd element being the a list of valued for selection. Only valid when search_strategy is specified. Valid Parametr names include: ‘metric’, ‘minkowski_power’, ‘category_weights’,

‘n_neighbors’, ‘aggregate_type’.

No default value.

param_rangeListOfTuples, optional

Specifies ranges of parameters to be selected. Input should be a list of tuples, with 1st element of each tuple being the name of the target parameter(in string format), while 2nd element being a list that specifies the range of parameters with the following format: [start, step, end] or [start, end]. Only valid when search_strategy is specified. Valid parameter names include: ‘minkowski_power’, ‘category_weights’, ‘n_neighbors’. No default value.

Examples

Input dataframe for classification training:

>>> df_class_train.collect()
    ID  X1      X2 X3  TYPE
0   0   2     1.0  A     1
1   1   3    10.0  A    10
2   2   3    10.0  B    10
3   3   3    10.0  C     1
4   4   1  1000.0  C     1
5   5   1  1000.0  A    10
6   6   1  1000.0  B    99
7   7   1   999.0  A    99
8   8   1   999.0  B    10
9   9   1  1000.0  C    10

Creating KNNClassifier instance:

>>> knn  = KNNClassifier(self.conn, thread_ratio=1, algorithm='kd_tree',
                         n_neighbors=3, voting_type='majority')

Performing fit() on given dataframe:

>>> knn.fit(df_class_train, key='ID', label='TYPE')

Performing predict() on given predicting dataframe:

Input prediciton dataframe: >>> df_class_predict.collect()

ID X1 X2 X3

0 0 2 1.0 A 1 1 1 10.0 C 2 2 1 11.0 B 3 3 3 15000.0 C 4 4 2 1000.0 C 5 5 1 1001.0 A 6 6 1 999.0 A 7 7 3 999.0 B

>>> res, stats = knn._predict(self.df_class_predict, key='ID', categorical_variable='X1')
>>> res.collect()
    ID              TARGET
0   0                   7
1   1                   7
2   2                   7
3   3  36.666666666666664
4   4  36.666666666666664
5   5  36.666666666666664
6   6  39.666666666666664
7   7   69.33333333333333
>>> stats.collect().head(10)
    TEST_ID  K  TRAIN_ID      DISTANCE
0         0  1         0      0.000000
1         0  2         1      9.999849
2         0  3         2     10.414000
3         1  1         3      0.999849
4         1  2         1      1.414000
5         1  3         2      1.414000
6         2  1         2      1.999849
7         2  2         1      2.414000
8         2  3         3      2.414000
9         3  1         4  14000.999849
Attributes
_training_setDataFrame

Input training data with structured column arrangement. If model evaluation or parameter selection is not enabled, the first column must be the ID column, following by feature columns.

Methods

fit(self, data[, key, features, label])

Build the KNNRegrssor training dataset with the input dataframe.

is_fitted(self)

Checks if the model can be saved.

load_model(self, model)

Function to load fitted model.

predict(self[, data, key, features, …])

Prediction for the input data with the training dataset.

fit(self, data, key=None, features=None, label=None)

Build the KNNRegrssor training dataset with the input dataframe. Assign key, features, and label column.

Parameters
dataDataFrame

Data to be fit.

keystr, optional

Name of the ID column. Required if parameter selection/model evaluation is not specified.

featuresstr/ListOfStrings, optional

Name of the feature columns.

labelstr, optional

Secifies the dependent variable. Default to last column name.

predict(self, data=None, key=None, features=None, categorical_variable=None)

Prediction for the input data with the training dataset. Training data set must be constructed through the fit function first.

Parameters
dataDataFrame

Prediction data.

keystr, optional

Name of the ID column. Required if parameter selection/model evaluation is not specified.

featuresstr/ListOfStrings, optional

Name of the feature columns.

categorical_variablestr/ListofStrings, optional

Indicates whether or not a column data is actually corresponding to a category variable even the data type of this column is INTEGER.

is_fitted(self)

Checks if the model can be saved. To be overridden if the model is not stored in model_ attribute.

Returns
bool

True if the model is ready to be saved.

load_model(self, model)

Function to load fitted model.

Parameters
modelDataFrame

HANA DataFrame for fitted model.

class hana_ml.algorithms.pal.neighbors.KNN(conn_context, n_neighbors=None, thread_ratio=None, voting_type=None, stat_info=True, metric=None, minkowski_power=None, algorithm=None)

Bases: hana_ml.algorithms.pal.pal_base.PALBase

K-Nearest Neighbor(KNN) model that handles classification problems.

Parameters
conn_contextConnectionContext

Connection to the HANA sytem.

n_neighborsint, optional

Number of nearest neighbors. Defaults to 1.

thread_ratiofloat, optional

Controls the proportion of available threads to use. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use up to that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use. Defaults to 0.

voting_type{‘majority’, ‘distanceweighted’}, optional

Method used to vote for the most frequent label of the K nearest neighbors. Defaults to distance-weighted.

stat_infobool, optional

Controls whether to return a statistic information table containing the distance between each point in the prediction set and its k nearest neighbors in the training set. If true, the table will be returned. Defaults to True.

metric{‘manhattan’, ‘euclidean’, ‘minkowski’, ‘chebyshev’}, optional

Ways to compute the distance between data points. Defaults to euclidean.

minkowski_powerfloat, optional

When Minkowski is used for metric, this parameter controls the value of power. Only valid when metric is Minkowski. Defaults to 3.0.

algorithm{‘brute-force’, ‘kd-tree’}, optional

Algorithm used to compute the nearest neighbors. Defaults to brute-force.

Examples

Training data:

>>> df.collect()
   ID      X1      X2  TYPE
0   0     1.0     1.0     2
1   1    10.0    10.0     3
2   2    10.0    11.0     3
3   3    10.0    10.0     3
4   4  1000.0  1000.0     1
5   5  1000.0  1001.0     1
6   6  1000.0   999.0     1
7   7   999.0   999.0     1
8   8   999.0  1000.0     1
9   9  1000.0  1000.0     1

Create KNN instance and call fit:

>>> knn = KNN(connection_context, n_neighbors=3, voting_type='majority',
...           thread_ratio=0.1, stat_info=False)
>>> knn.fit(df, 'ID', features=['X1', 'X2'], label='TYPE')
>>> pred_df = connection_context.table("PAL_KNN_CLASSDATA_TBL")

Call predict:

>>> res, stat = knn.predict(pred_df, "ID")
>>> res.collect()
   ID  TYPE
0   0     3
1   1     3
2   2     3
3   3     1
4   4     1
5   5     1
6   6     1
7   7     1

Methods

fit(self, data, key[, features, label])

Fit the model when given training set.

is_fitted(self)

Checks if the model can be saved.

load_model(self, model)

Function to load fitted model.

predict(self, data, key[, features])

Predict the class labels for the provided data

score(self, data, key[, features, label])

Return a scalar accuracy value after comparing the predicted and original label.

fit(self, data, key, features=None, label=None)

Fit the model when given training set.

Parameters
dataDataFrame

DataFrame containing the data.

keystr

Name of the ID column.

featureslist of str, optional

Names of the feature columns. If features is not provided, it defaults to all the non-ID and non-label columns.

labelstr, optional

Name of the label column. If label is not provided, it defaults to the last column.

predict(self, data, key, features=None)

Predict the class labels for the provided data

Parameters
dataDataFrame

DataFrame containing the data.

keystr

Name of the ID column.

featureslist of str, optional

Names of the feature columns. If features is not provided, it defaults to all the non-ID columns.

Returns
result_dfDataFrame
Predicted result, structured as follows:
  • ID column, with same name and type as data ‘s ID column.

  • Label column, with same name and type as training data’s label column.

nearest_neighbors_dfDataFrame

The distance between each point in data and its k nearest neighbors in the training set. Only returned if stat_info is True. Structured as follows:

  • TEST_ + data ‘s ID name, with same type as data ‘s ID column, query data ID.

  • K, type INTEGER, K number.

  • TRAIN_ + training data’s ID name, with same type as training data’s ID column, neighbor point’s ID.

  • DISTANCE, type DOUBLE, distance.

score(self, data, key, features=None, label=None)

Return a scalar accuracy value after comparing the predicted and original label.

Parameters
dataDataFrame

DataFrame containing the data.

keystr

Name of the ID column.

featureslist of str, optional

Names of the feature columns. If features is not provided, it defaults to all the non-ID and non-label columns.

labelstr, optional

Name of the label column. If label is not provided, it defaults to the last column.

Returns
accuracyfloat

Scalar accuracy value after comparing the predicted label and original label.

is_fitted(self)

Checks if the model can be saved. To be overridden if the model is not stored in model_ attribute.

Returns
bool

True if the model is ready to be saved.

load_model(self, model)

Function to load fitted model.

Parameters
modelDataFrame

HANA DataFrame for fitted model.

hana_ml.algorithms.pal.neural_network

This module contains Python wrappers for PAL Multi-layer Perceptron algorithm.

The following classes are available:

class hana_ml.algorithms.pal.neural_network.MLPClassifier(conn_context, activation=None, activation_options=None, output_activation=None, output_activation_options=None, hidden_layer_size=None, hidden_layer_size_options=None, max_iter=None, training_style='stochastic', learning_rate=None, momentum=None, batch_size=None, normalization=None, weight_init=None, categorical_variable=None, resampling_method=None, evaluation_metric=None, fold_num=None, repeat_times=None, search_strategy=None, random_search_times=None, random_state=None, timeout=None, progress_indicator_id=None, param_values=None, param_range=None, thread_ratio=None)

Bases: hana_ml.algorithms.pal.neural_network._MLPBase

Multi-layer perceptron (MLP) Classifier.

Parameters
conn_contextConnectionContext

Connection to the HANA system.

activation{‘tanh’, ‘linear’, ‘sigmoid_asymmetric’, ‘sigmoid_symmetric’, ‘gaussian_asymmetric’, ‘gaussian_symmetric’, ‘elliot_asymmetric’, ‘elliot_symmetric’, ‘sin_asymmetric’, ‘sin_symmetric’, ‘cos_asymmetric’, ‘cos_symmetric’, ‘relu’}, conditionally mandatory

Activation function for the hidden layer. Mandatory if activation_options is not provided.

activation_optionslist of str, conditionally mandatory

A list of activation functions for parameter selection.

See activation for the full set of valid activation functions.

output_activation{‘tanh’, ‘linear’, ‘sigmoid_asymmetric’, ‘sigmoid_symmetric’, ‘gaussian_asymmetric’, ‘gaussian_symmetric’, ‘elliot_asymmetric’, ‘elliot_symmetric’, ‘sin_asymmetric’, ‘sin_symmetric’, ‘cos_asymmetric’, ‘cos_symmetric’, ‘relu’},

Activation function for the output layer.

output_activation_optionslist of str, conditionally mandatory

A list of activation functions for the output layer for parameter selection.

See output_activation for the full set of activation functions for output layer.

hidden_layer_sizelist of int or tuple of int

Sizes of all hidden layers.

hidden_layer_size_optionslist of tuples, conditionally mandatory

A list of optional sizes of all hidden layers for parameter selection.

max_iterint, optional

Maximum number of iterations.

Defaults to 100.

training_style{‘batch’, ‘stochastic’}, optional

Specifies the training style.

Defaults to ‘stochastic’.

learning_ratefloat, optional

Specifies the learning rate. Mandatory and valid only when training_style is ‘stochastic’.

momentumfloat, optional

Specifies the momentum for gradient descent update. Mandatory and valid only when training_style is ‘stochastic’.

batch_sizeint, optional

Specifies the size of mini batch. Valid only when training_style is ‘stochastic’.

Defaults to 1.

normalization{‘no’, ‘z-transform’, ‘scalar’}, optional

Defaults to ‘no’.

weight_init{‘all-zeros’, ‘normal’, ‘uniform’, ‘variance-scale-normal’, ‘variance-scale-uniform’}, optional

Specifies the weight initial value.

Defaults to ‘all-zeros’.

categorical_variablestr or list of str, optional

Specifies column name(s) in the data table used as category variable.

Valid only when column is of INTEGER type.

thread_ratiofloat, optional

Controls the proportion of available threads to use for training. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use.

Defaults to 0.

resampling_method{‘cv’,’stratified_cv’, ‘bootstrap’, ‘stratified_bootstrap’}, optional

Specifies the resampling method for model evaluation or parameter selection. If not specified, neither model evaluation or parameter selection shall be triggered.

evaluation_metric{‘accuracy’,’f1_score’, ‘auc_onevsrest’, ‘auc_pairwise’}, optional

Specifies the evaluation metric for model evaluation or parameter selection.

fold_numint, optional

Specifies the fold number for the cross-validation. Mandatory and valid only when resampling_method is set ‘cv’ or ‘stratified_cv’.

repeat_timesint, optional

Specifies the number of repeat times for resampling.

Defaults to 1.

search_strategy{‘grid’, ‘random’}, optional

Specifies the method for parameter selection. If not provided, parameter selection will not be activated.

random_search_timesint, optional

Specifies the number of times to randomly select candidate parameters. Mandatory and valid only when search_strategy is set to ‘random’.

random_stateint, optional

Specifies the seed for random generation. When 0 is specified, system time is used.

Defaults to 0.

timeoutint, optional

Specifies maximum running time for model evalation/parameter selection, in seconds. No timeout when 0 is specified.

Defaults to 0.

progress_idstr, optional

Sets an ID of progress indicator for model evaluation/parameter selection. If not provided, no progress indicator is activated.

param_valueslist of tuple, optional

Sets the values of following parameters for model parameter selection:

learning_rate, momentum, batch_size.

Each tuple contains two elements

  • 1st element is the parameter name(str type),

  • 2nd element is a list of valid values for that parameter.

A simple example for illustration:

[(‘learning_rate’, [0.1, 0.2, 0.5]),

(‘momentum’, [0.2, 0.6])]

Valid only when search_strategy is specified and training_style is ‘stochastic’.

param_rangelist of tuple, optional

Sets the range of the following parameters for model parameter selection:

learning_rate, momentum, batch_size.

Each tuple should contain two elements:

  • 1st element is the parameter name(str type),

  • 2nd element is a list that specifies the range of that parameter as follows:

first value is the start value, seond value is the step, and third value is the end value. The step value can be omitted, and will be ignored, if search_strategy is set to ‘random’.

Valid only when search_strategy is specified and traininig_style is ‘stochastic’.

Examples

Training data:

>>> df.collect()
   V000  V001 V002  V003 LABEL
0     1  1.71   AC     0    AA
1    10  1.78   CA     5    AB
2    17  2.36   AA     6    AA
3    12  3.15   AA     2     C
4     7  1.05   CA     3    AB
5     6  1.50   CA     2    AB
6     9  1.97   CA     6     C
7     5  1.26   AA     1    AA
8    12  2.13   AC     4     C
9    18  1.87   AC     6    AA

Training the model:

>>> mlpc = MLPClassifier(conn_context=conn, hidden_layer_size=(10,10),
...                      activation='tanh', output_activation='tanh',
...                      learning_rate=0.001, momentum=0.0001,
...                      training_style='stochastic',max_iter=100,
...                      normalization='z-transform', weight_init='normal',
...                      thread_ratio=0.3, categorical_variable='V003')
>>> mlpc.fit(data=df)

Training result may look different from the following results due to model randomness.

>>> mlpc.model_.collect()
   ROW_INDEX                                      MODEL_CONTENT
0          1  {"CurrentVersion":"1.0","DataDictionary":[{"da...
1          2  t":0.2700182926188939},{"from":13,"weight":0.0...
2          3  ht":0.2414416413305134},{"from":21,"weight":0....
>>> mlpc.train_log_.collect()
    ITERATION     ERROR
0           1  1.080261
1           2  1.008358
2           3  0.947069
3           4  0.894585
4           5  0.849411
5           6  0.810309
6           7  0.776256
7           8  0.746413
8           9  0.720093
9          10  0.696737
10         11  0.675886
11         12  0.657166
12         13  0.640270
13         14  0.624943
14         15  0.609432
15         16  0.595204
16         17  0.582101
17         18  0.569990
18         19  0.558757
19         20  0.548305
20         21  0.538553
21         22  0.529429
22         23  0.521457
23         24  0.513893
24         25  0.506704
25         26  0.499861
26         27  0.493338
27         28  0.487111
28         29  0.481159
29         30  0.475462
..        ...       ...
70         71  0.349684
71         72  0.347798
72         73  0.345954
73         74  0.344071
74         75  0.342232
75         76  0.340597
76         77  0.338837
77         78  0.337236
78         79  0.335749
79         80  0.334296
80         81  0.332759
81         82  0.331255
82         83  0.329810
83         84  0.328367
84         85  0.326952
85         86  0.325566
86         87  0.324232
87         88  0.322899
88         89  0.321593
89         90  0.320242
90         91  0.318985
91         92  0.317840
92         93  0.316630
93         94  0.315376
94         95  0.314210
95         96  0.313066
96         97  0.312021
97         98  0.310916
98         99  0.309770
99        100  0.308704

Prediction:

>>> pred_df.collect()
>>> res, stat = mlpc.predict(data=pred_df, key='ID')

Prediction result may look different from the following results due to model randomness.

>>> res.collect()
   ID TARGET     VALUE
0   1      C  0.472751
1   2      C  0.417681
2   3      C  0.543967
>>> stat.collect()
   ID CLASS  SOFT_MAX
0   1    AA  0.371996
1   1    AB  0.155253
2   1     C  0.472751
3   2    AA  0.357822
4   2    AB  0.224496
5   2     C  0.417681
6   3    AA  0.349813
7   3    AB  0.106220
8   3     C  0.543967

Model Evaluation:

>>> mlpc = MLPClassifier(conn_context=conn,
...                      activation='tanh',
...                      output_activation='tanh',
...                      hidden_layer_size=(10,10),
...                      learning_rate=0.001,
...                      momentum=0.0001,
...                      training_style='stochastic',
...                      max_iter=100,
...                      normalization='z-transform',
...                      weight_init='normal',
...                      resampling_method='cv',
...                      evaluation_metric='f1_score',
...                      fold_num=10,
...                      repeat_times=2,
...                      random_state=1,
...                      progress_indicator_id='TEST',
...                      thread_ratio=0.3)
>>> mlpc.fit(data=df, label='LABEL', categorical_variable='V003')

Model evaluation result may look different from the following result due to randomness.

>>> mlpc.stats_.collect()
            STAT_NAME                                         STAT_VALUE
0             timeout                                              FALSE
1     TEST_1_F1_SCORE                       1, 0, 1, 1, 0, 1, 0, 1, 1, 0
2     TEST_2_F1_SCORE                       0, 0, 1, 1, 0, 1, 0, 1, 1, 1
3  TEST_F1_SCORE.MEAN                                                0.6
4   TEST_F1_SCORE.VAR                                           0.252631
5      EVAL_RESULTS_1  {"candidates":[{"TEST_F1_SCORE":[[1.0,0.0,1.0,...
6     solution status  Convergence not reached after maximum number o...
7               ERROR                                 0.2951168443145714

Parameter selection:

>>> act_opts=['tanh', 'linear', 'sigmoid_asymmetric']
>>> out_act_opts = ['sigmoid_symmetric', 'gaussian_asymmetric', 'gaussian_symmetric']
>>> layer_size_opts = [(10, 10), (5, 5, 5)]
>>> mlpc = MLPClassifier(conn_context=conn,
...                      activation_options=act_opts,
...                      output_activation_options=out_act_opts,
...                      hidden_layer_size_options=layer_size_opts,
...                      learning_rate=0.001,
...                      batch_size=2,
...                      momentum=0.0001,
...                      training_style='stochastic',
...                      max_iter=100,
...                      normalization='z-transform',
...                      weight_init='normal',
...                      resampling_method='stratified_bootstrap',
...                      evaluation_metric='accuracy',
...                      search_strategy='grid',
...                      fold_num=10,
...                      repeat_times=2,
...                      random_state=1,
...                      progress_indicator_id='TEST',
...                      thread_ratio=0.3)
>>> mlpc.fit(data=df, label='LABEL', categorical_variable='V003')

Parameter selection result may look different from the following result due to randomness.

>>> mlpc.stats_.collect()
            STAT_NAME                                         STAT_VALUE
0             timeout                                              FALSE
1     TEST_1_ACCURACY                                               0.25
2     TEST_2_ACCURACY                                           0.666666
3  TEST_ACCURACY.MEAN                                           0.458333
4   TEST_ACCURACY.VAR                                          0.0868055
5      EVAL_RESULTS_1  {"candidates":[{"TEST_ACCURACY":[[0.50],[0.0]]...
6      EVAL_RESULTS_2  PUT_LAYER_ACTIVE_FUNC=6;HIDDEN_LAYER_ACTIVE_FU...
7      EVAL_RESULTS_3  FUNC=2;"},{"TEST_ACCURACY":[[0.50],[0.33333333...
8      EVAL_RESULTS_4  rs":"HIDDEN_LAYER_SIZE=10, 10;OUTPUT_LAYER_ACT...
9               ERROR                                  0.684842661926971
>>> mlpc.optim_param_.collect()
                 PARAM_NAME  INT_VALUE DOUBLE_VALUE STRING_VALUE
0         HIDDEN_LAYER_SIZE        NaN         None      5, 5, 5
1  OUTPUT_LAYER_ACTIVE_FUNC        4.0         None         None
2  HIDDEN_LAYER_ACTIVE_FUNC        3.0         None         None
Attributes
model_DataFrame

Model content.

train_log_DataFrame

Provides mean squared error between predicted values and target values for each iteration.

stats_DataFrame

Names and values of statistics.

optim_param_DataFrame

Provides optimal parameters selected.

Available only when parameter selection is triggered.

Methods

fit(self, data[, key, features, label, …])

Fit the model when the training dataset is given.

is_fitted(self)

Checks if the model can be saved.

load_model(self, model)

Function to load fitted model.

predict(self, data, key[, features, …])

Predict using the multi-layer perceptron model.

score(self, data, key[, features, label, …])

Returns the accuracy on the given test data and labels.

fit(self, data, key=None, features=None, label=None, categorical_variable=None)

Fit the model when the training dataset is given.

Parameters
dataDataFrame

DataFrame containing the data.

keystr, optional

Name of the ID column. If key is not provided, it is assumed that the input has no ID column.

featureslist of str, optional

Names of the feature columns. If features is not provided, it defaults to all the non-ID and non-label columns.

labelstr, optional

Name of the label column. If label is not provided, it defaults to the last column.

categorical_variablestr or list of str, optional

Specifies INTEGER column(s) specified that should be treated as categorical. Other INTEGER columns will be treated as continuous.

is_fitted(self)

Checks if the model can be saved. To be overridden if the model is not stored in model_ attribute.

Returns
bool

True if the model is ready to be saved.

load_model(self, model)

Function to load fitted model.

Parameters
modelDataFrame

HANA DataFrame for fitted model.

predict(self, data, key, features=None, thread_ratio=None)

Predict using the multi-layer perceptron model.

Parameters
dataDataFrame

DataFrame containing the data.

keystr

Name of the ID column.

featureslist of str, optional

Names of the feature columns. If features is not provided, it defaults to all the non-ID columns.

thread_ratiofloat, optional

Controls the proportion of available threads to be used for prediction. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use.

Defaults to 0.

Returns
DataFrame

Predicted classes, structured as follows:

  • ID column, with the same name and type as data ‘s ID column.

  • TARGET, type NVARCHAR, predicted class name.

  • VALUE, type DOUBLE, softmax value for the predicted class.

Softmax values for all classes, structured as follows:

  • ID column, with the same name and type as data ‘s ID column.

  • CLASS, type NVARCHAR, class name.

  • VALUE, type DOUBLE, softmax value for that class.

score(self, data, key, features=None, label=None, thread_ratio=None)

Returns the accuracy on the given test data and labels.

Parameters
dataDataFrame

DataFrame containing the data.

keystr

Name of the ID column.

featureslist of str, optional

Names of the feature columns. If features is not provided, it defaults to all the non-ID and non-label columns.

labelstr, optional

Name of the label column. If label is not provided, it defaults to the last column.

Returns
float

Scalar value of accuracy after comparing the predicted result and original label.

class hana_ml.algorithms.pal.neural_network.MLPRegressor(conn_context, activation=None, activation_options=None, output_activation=None, output_activation_options=None, hidden_layer_size=None, hidden_layer_size_options=None, max_iter=None, training_style='stochastic', learning_rate=None, momentum=None, batch_size=None, normalization=None, weight_init=None, categorical_variable=None, resampling_method=None, evaluation_metric=None, fold_num=None, repeat_times=None, search_strategy=None, random_search_times=None, random_state=None, timeout=None, progress_indicator_id=None, param_values=None, param_range=None, thread_ratio=None)

Bases: hana_ml.algorithms.pal.neural_network._MLPBase

Multi-layer perceptron (MLP) Regressor.

Parameters
conn_contextConnectionContext

Connection to the HANA system.

activation{‘tanh’, ‘linear’, ‘sigmoid_asymmetric’, ‘sigmoid_symmetric’, ‘gaussian_asymmetric’, ‘gaussian_symmetric’, ‘elliot_asymmetric’, ‘elliot_symmetric’, ‘sin_asymmetric’, ‘sin_symmetric’, ‘cos_asymmetric’, ‘cos_symmetric’, ‘relu’}

Activation function for the hidden layer.

output_activation{‘tanh’, ‘linear’, ‘sigmoid_asymmetric’, ‘sigmoid_symmetric’, ‘gaussian_asymmetric’, ‘gaussian_symmetric’, ‘elliot_asymmetric’, ‘elliot_symmetric’, ‘sin_asymmetric’, ‘sin_symmetric’, ‘cos_asymmetric’, ‘cos_symmetric’, ‘relu’} , str

Activation function for the output layer.

hidden_layer_sizetuple of int

Size of each hidden layer

max_iterint, optional

Maximum number of iterations.

Defaults to 100.

training_style{‘batch’, ‘stochastic’}, optional

Specifies the training style.

Defaults to ‘stochastic’.

learning_ratefloat, optional

Specifies the learning rate. Mandatory and valid only when training_style is ‘stochastic’.

momentumfloat, optional

Specifies the momentum for gradient descent update. Mandatory and valid only when training_style is ‘stochastic’.

batch_sizeint, optional

Specifies the size of mini batch. Valid only when training_style is ‘stochastic’.

Defaults to 1.

normalization{‘no’, ‘z-transform’, ‘scalar’}, optional

Defaults to ‘no’.

weight_init{‘all-zeros’, ‘normal’, ‘uniform’, ‘variance-scale-normal’, ‘variance-scale-uniform’}, optional

Specifies the weight initial value.

Defaults to ‘all-zeros’.

categorical_variablestr or list of str, optional

Specifies column name(s) in the data table used as category variable.

Valid only when column is of INTEGER type.

thread_ratiofloat, optional

Controls the proportion of available threads to use for training. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use.

Defaults to 0.

resampling_method{‘cv’, ‘bootstrap’}, optional

Specifies the resampling method for model evaluation or parameter selection. If not specified, neither model evaluation or parameter selection shall be triggered.

evaluation_metric{‘rmse’}, optional

Specifies the evaluation metric for model evaluation or parameter selection.

fold_numint, optional

Specifies the fold number for the cross-validation. Mandatory and valid only when resampling_method is set ‘cv’.

repeat_timesint, optional

Specifies the number of repeat times for resampling.

Defaults to 1.

search_strategy{‘grid’, ‘random’}, optional

Specifies the method for parameter selection. If not provided, parameter selection will not be activated.

random_searhc_timesint, optional

Specifies the number of times to randomly select candidate parameters. Mandatory and valid only when search_strategy is set to ‘random’.

random_stateint, optional

Specifies the seed for random generation. When 0 is specified, system time is used.

Defaults to 0.

timeoutint, optional

Specifies maximum running time for model evalation/parameter selection, in seconds. No timeout when 0 is specified.

Defaults to 0.

progress_idstr, optional

Sets an ID of progress indicator for model evaluation/parameter selection. If not provided, no progress indicator is activated.

param_valueslist of tuple, optional

Sets the values of following parameters for model parameter selection:

learning_rate, momentum, batch_size.

Each tuple contains two elements - 1st element is the parameter name(str type), 2nd element is a list of valid values for that parameter.

A simple example for illustration:

[(‘learning_rate’, [0.1, 0.2, 0.5]),

(‘momentum’, [0.2, 0.6])]

Valid only when search_strategy is specified and training_style is ‘stochastic’.

param_rangelist of tuple, optional

Sets the range of the following parameters for model parameter selection:

learning_rate, momentum, batch_size.

Each tuple should contain two elements:

  • 1st element is the parameter name(str type),

  • 2nd element is a list that specifies the range of that parameter as follows:

first value is the start value, seond value is the step, and third value is the end value. The step value can be omitted, and will be ignored, if search_strategy is set to ‘random’.

Valid only when search_strategy is specified and traininig_style is ‘stochastic’.

Examples

Training data:

>>> df.collect()
   V000  V001 V002  V003  T001  T002  T003
0     1  1.71   AC     0  12.7   2.8  3.06
1    10  1.78   CA     5  12.1   8.0  2.65
2    17  2.36   AA     6  10.1   2.8  3.24
3    12  3.15   AA     2  28.1   5.6  2.24
4     7  1.05   CA     3  19.8   7.1  1.98
5     6  1.50   CA     2  23.2   4.9  2.12
6     9  1.97   CA     6  24.5   4.2  1.05
7     5  1.26   AA     1  13.6   5.1  2.78
8    12  2.13   AC     4  13.2   1.9  1.34
9    18  1.87   AC     6  25.5   3.6  2.14

Training the model:

>>> mlpr = MLPRegressor(conn_context=conn, hidden_layer_size=(10,5),
...                     activation='sin_asymmetric',
...                     output_activation='sin_asymmetric',
...                     learning_rate=0.001, momentum=0.00001,
...                     training_style='batch',
...                     max_iter=10000, normalization='z-transform',
...                     weight_init='normal', thread_ratio=0.3)
>>> mlpr.fit(data=df, label=['T001', 'T002', 'T003'])

Training result may look different from the following results due to model randomness.

>>> mlpr.model_.collect()
   ROW_INDEX                                      MODEL_CONTENT
0          1  {"CurrentVersion":"1.0","DataDictionary":[{"da...
1          2  3782583596893},{"from":10,"weight":-0.16532599...
>>> mlpr.train_log_.collect()
     ITERATION       ERROR
0            1   34.525655
1            2   82.656301
2            3   67.289241
3            4  162.768062
4            5   38.988242
5            6  142.239468
6            7   34.467742
7            8   31.050946
8            9   30.863581
9           10   30.078204
10          11   26.671436
11          12   28.078312
12          13   27.243226
13          14   26.916686
14          15   26.782915
15          16   26.724266
16          17   26.697108
17          18   26.684084
18          19   26.677713
19          20   26.674563
20          21   26.672997
21          22   26.672216
22          23   26.671826
23          24   26.671631
24          25   26.671533
25          26   26.671485
26          27   26.671460
27          28   26.671448
28          29   26.671442
29          30   26.671439
..         ...         ...
705        706   11.891081
706        707   11.891081
707        708   11.891081
708        709   11.891081
709        710   11.891081
710        711   11.891081
711        712   11.891081
712        713   11.891081
713        714   11.891081
714        715   11.891081
715        716   11.891081
716        717   11.891081
717        718   11.891081
718        719   11.891081
719        720   11.891081
720        721   11.891081
721        722   11.891081
722        723   11.891081
723        724   11.891081
724        725   11.891081
725        726   11.891081
726        727   11.891081
727        728   11.891081
728        729   11.891081
729        730   11.891081
730        731   11.891081
731        732   11.891081
732        733   11.891081
733        734   11.891081
734        735   11.891081

[735 rows x 2 columns]

>>> pred_df.collect()
   ID  V000  V001 V002  V003
0   1     1  1.71   AC     0
1   2    10  1.78   CA     5
2   3    17  2.36   AA     6

Prediction:

>>> res  = mlpr.predict(data=pred_df, key='ID')

Result may look different from the following results due to model randomness.

>>> res.collect()
   ID TARGET      VALUE
0   1   T001  12.700012
1   1   T002   2.799133
2   1   T003   2.190000
3   2   T001  12.099740
4   2   T002   6.100000
5   2   T003   2.190000
6   3   T001  10.099961
7   3   T002   2.799659
8   3   T003   2.190000
Attributes
model_DataFrame

Model content.

train_log_DataFrame

Provides mean squared error between predicted values and target values for each iteration.

stats_DataFrame

Names and values of statistics.

optim_param_DataFrame

Provides optimal parameters selected.

Available only when parameter selection is triggered.

Methods

fit(self, data[, key, features, label, …])

Fit the model when given training dataset.

is_fitted(self)

Checks if the model can be saved.

load_model(self, model)

Function to load fitted model.

predict(self, data, key[, features, …])

Predict using the multi-layer perceptron model.

score(self, data, key[, features, label, …])

Returns the coefficient of determination R^2 of the prediction.

is_fitted(self)

Checks if the model can be saved. To be overridden if the model is not stored in model_ attribute.

Returns
bool

True if the model is ready to be saved.

load_model(self, model)

Function to load fitted model.

Parameters
modelDataFrame

HANA DataFrame for fitted model.

fit(self, data, key=None, features=None, label=None, categorical_variable=None)

Fit the model when given training dataset.

Parameters
dataDataFrame

DataFrame containing the data.

keystr, optional

Name of the ID column. If key is not provided, it is assume that the input has no ID column.

featureslist of str, optional

Names of the feature columns.

If features is not provided, it defaults to all the non-ID and non-label columns.

labelstr or list of str, optional

Name of the label column, or list of names of multiple label columns. If label is not provided, it defaults to the last column.

categorical_variablestr or list of str, optional

Specifies INTEGER column(s) that should be treated as categorical. Other INTEGER columns will be treated as continuous.

predict(self, data, key, features=None, thread_ratio=None)

Predict using the multi-layer perceptron model.

Parameters
dataDataFrame

DataFrame containing the data.

keystr

Name of the ID column.

featureslist of str, optional

Names of the feature columns. If features is not provided, it defaults to all the non-ID columns.

thread_ratiofloat, optional

Controls the proportion of available threads to be used for prediction. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use.

Defaults to 0.

Returns
DataFrame

Predicted results, structured as follows:

  • ID column, with the same name and type as data ‘s ID column.

  • TARGET, type NVARCHAR, target name.

  • VALUE, type DOUBLE, regression value.

score(self, data, key, features=None, label=None, thread_ratio=None)

Returns the coefficient of determination R^2 of the prediction.

Parameters
dataDataFrame

DataFrame containing the data.

keystr

Name of the ID column.

featureslist of str, optional

Names of the feature columns. If features is not provided, it defaults to all the non-ID and non-label columns.

labelstr or list of str, optional

Name of the label column, or list of names of multiple label columns. If label is not provided, it defaults to the last column.

Returns
float

Returns the coefficient of determination R2 of the prediction.

hana_ml.algorithms.pal.pagerank

This module contains python wrapper for PAL PageRank algorithm.

The following class is available:

class hana_ml.algorithms.pal.pagerank.PageRank(conn_context, damping=None, max_iter=None, tol=None, thread_ratio=None)

Bases: hana_ml.algorithms.pal.pal_base.PALBase

A page rank model.

Parameters
conn_contextConnectionContext

Connection to the SAP HANA system.

dampingfloat, optional

The damping factor d.

Defaults to 0.85.

max_iterint, optional

The maximum number of iterations of power method. The value 0 means no maximum number of iterations is set and the calculation stops when the result converges.

Defaults to 0.

tolfloat, optional

Specifies the stop condition. When the mean improvement value of ranks is less than this value, the program stops calculation.

Defaults to 1e-6.

thread_ratiofloat, optional

Specifies the ratio of total number of threads that can be used by this function. The value range is from 0 to 1, where 0 means only using 1 thread, and 1 means using at most all the currently available threads. Values outside the range will be ignored and this function heuristically determines the number of threads to use.

Defaults to 0.

Examples

Input dataframe df for training:

>>> df.collect()
   FROM_NODE    TO_NODE
0   Node1       Node2
1   Node1       Node3
2   Node1       Node4
3   Node2       Node3
4   Node2       Node4
5   Node3       Node1
6   Node4       Node1
7   Node4       Node3

Create a PageRank instance:

>>> pr = PageRank(conn_context=conn)

Call run() on given data sequence:

>>> result = pr.run(data=df)
>>> result.collect()
   NODE     RANK
0   NODE1   0.368152
1   NODE2   0.141808
2   NODE3   0.287962
3   NODE4   0.202078
Attributes
None

Methods

is_fitted(self)

Checks if the model can be saved.

load_model(self, model)

Function to load fitted model.

run(self, data)

This method reads link information and calculates rank for each node.

run(self, data)

This method reads link information and calculates rank for each node.

Parameters
dataDataFrame

Data for predicting the class labels.

Returns
DataFrame

Calculated rank values and corresponding node names, structured as follows:

  • NODE: node names.

  • RANK: the PageRank of the corresponding node.

is_fitted(self)

Checks if the model can be saved. To be overridden if the model is not stored in model_ attribute.

Returns
bool

True if the model is ready to be saved.

load_model(self, model)

Function to load fitted model.

Parameters
modelDataFrame

HANA DataFrame for fitted model.

hana_ml.algorithms.pal.partition

This module contain Python wrapper for the PAL partition function.

The following function is available:

hana_ml.algorithms.pal.partition.train_test_val_split(conn_context, data, random_seed=None, thread_ratio=None, partition_method='random', stratified_column=None, training_percentage=None, testing_percentage=None, validation_percentage=None, training_size=None, testing_size=None, validation_size=None)

The algorithm partitions an input dataset randomly into three disjoint subsets called training, testing and validation. Let us remark that the union of these three subsets might not be the complete initial dataset.

Two different partitions can be obtained:

  1. Random Partition, which randomly divides all the data.

  2. Stratified Partition, which divides each subpopulation randomly.

In the second case, the dataset needs to have at least one categorical attribute (for example, of type VARCHAR). The initial dataset will first be subdivided according to the different categorical values of this attribute. Each mutually exclusive subset will then be randomly split to obtain the training, testing, and validation subsets.This ensures that all “categorical values” or “strata” will be present in the sampled subset.

Parameters
conn_contextConnectionContext

Connection to the HANA system.

dataDataFrame

DataFrame to be partitioned.

random_seedint, optional

Indicates the seed used to initialize the random number generator.

0: Uses the system time

Not 0: Uses the specified seed

Defaults to 0.

thread_ratiofloat, optional

Specifies the ratio of total number of threads that can be used by this function. The value range is from 0 to 1, where 0 means only using 1 thread, and 1 means using at most all the currently available threads. Values outside the range will be ignored and this function heuristically determines the number of threads to use.

Defaults to 0.

partition_method{‘random’, ‘stratified’}, optional
Partition method:
  • ‘random’: random partitions

  • ‘stratified’: stratified partition

Defaults to ‘random’.

stratified_columnstr, optional

Indicates which column is used for stratification.

Valid only when parition_method is set to ‘stratified’ (stratified partition).

No default value.

training_percentagefloat, optional

The percentage of training data. Value range: 0 <= value <= 1.

Defaults to 0.8.

testing_percentagefloat, optional

The percentage of testing data. Value range: 0 <= value <= 1.

Defaults to 0.1.

validation_percentagefloat, optional

The percentage of validation data. Value range: 0 <= value <= 1.

Defaults to 0.1.

training_sizeint, optional

Row size of training data. Value range: >=0

If both training_percentage and training_size are specified, training_percentage takes precedence.

No default value.

testing_sizeint, optional

Row size of testing data. Value range: >=0

If both testing_percentage and testing_size are specified, testing_percentage takes precedence.

No default value.

validation_sizeint, optional

Row size of validation data. Value range:>=0

If both validation_percentage and validation_size are specified, validation_percentage takes precedence.

No default value.

Returns
DataFrame

Training data. Table structure identical to input data table.

Testing data. Table structure identical to input data table.

Validation data. Table structure identical to input data table.

Examples

>>> train, test, valid = train_test_val_split(conn_context=conn, data=df)

hana_ml.algorithms.pal.pipeline

This module supports to run PAL functions in a pipeline manner.

class hana_ml.algorithms.pal.pipeline.Pipeline(steps)

Bases: object

Pipeline construction to run transformers and estimators sequentially.

Parameters
steplist

List of (name, transform) tuples that are chained. The last object should be an estimator.

Examples

>>> pipeline([
    ('pca', PCA(conn_context=conn, scaling=True, scores=True)),
    ('imputer', Imputer(conn_context=conn, strategy='mean')),
    ('hgbt', HybridGradientBoostingClassifier(conn_context=connection_context,         n_estimators=4, split_threshold=0, learning_rate=0.5, fold_num=5,         max_depth=6, cross_validation_range=cv_range))
    ])

Methods

fit(self, data, param)

Fit all the transforms one after the other and transform the data, then fit the transformed data using the final estimator.

fit_transform(self, data, param)

Fit all the transforms one after the other and transform the data, then fit the transformed data using the final estimator.

fit_transform(self, data, param)

Fit all the transforms one after the other and transform the data, then fit the transformed data using the final estimator.

Parameters
dataDataFrame

SAP HANA DataFrame to be transformed in the pipeline.

paramdict

Parameters corresponding to the transform name.

Returns
DataFrame

Transformed SAP HANA DataFrame.

Examples

>>> my_pipeline = Pipeline([
        ('pca', PCA(conn_context=conn, scaling=True, scores=True)),
        ('imputer', Imputer(conn_context=conn, strategy='mean'))
        ])
>>> param = {'pca': [('key', 'ID'), ('label', 'CLASS')], 'imputer': []}
>>> my_pipeline.fit_transform(data=train_data, param=param)
fit(self, data, param)

Fit all the transforms one after the other and transform the data, then fit the transformed data using the final estimator.

Parameters
dataDataFrame

SAP HANA DataFrame to be transformed in the pipeline.

paramdict

Parameters corresponding to the transform name.

Returns
DataFrame

Transformed SAP HANA DataFrame.

Examples

>>> my_pipeline = Pipeline([
    ('pca', PCA(conn_context=conn, scaling=True, scores=True)),
    ('imputer', Imputer(conn_context=conn, strategy='mean')),
    ('hgbt', HybridGradientBoostingClassifier(conn_context=conn,
    n_estimators=4, split_threshold=0, learning_rate=0.5, fold_num=5,
    max_depth=6, cross_validation_range=cv_range))
    ])
>>> param = {
                'pca': [('key', 'ID'), ('label', 'CLASS')],
                'imputer': [],
                'hgbt': [('key', 'ID'), ('label', 'CLASS'), ('categorical_variable', ['CLASS'])]
            }
>>> hgbt_model = my_pipeline.fit(data=train_data, param=param)

hana_ml.algorithms.pal.preprocessing

This module contains Python wrappers for PAL preprocessing algorithms.

The following classes and functions are available:

class hana_ml.algorithms.pal.preprocessing.FeatureNormalizer(conn_context, method, z_score_method=None, new_max=None, new_min=None, thread_ratio=None)

Bases: hana_ml.algorithms.pal.pal_base.PALBase

Normalize a DataFrame.

Parameters
conn_contextConnectionContext

The connection to the SAP HANA system.

method{‘min-max’, ‘z-score’, ‘decimal’}

Scaling methods:

  • ‘min-max’: Min-max normalization.

  • ‘z-score’: Z-Score normalization.

  • ‘decimal’: Decimal scaling normalization.

z_score_method{‘mean-standard’, ‘mean-mean’, ‘median-median’}, optional

Only valid when method is ‘z-score’.

  • ‘mean-standard’: Mean-Standard deviation

  • ‘mean-mean’: Mean-Mean deviation

  • ‘median-median’: Median-Median absolute deviation

new_maxfloat, optional

The new maximum value for min-max normalization.

Only valid when method is ‘min-max’.

new_minfloat, optional

The new minimum value for min-max normalization.

Only valid when method is ‘min-max’.

thread_ratiofloat, optional

Controls the proportion of available threads to use. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use.

Defaults to 0.

Examples

Input DataFrame df1:

>>> df1.head(4).collect()
    ID    X1    X2
0    0   6.0   9.0
1    1  12.1   8.3
2    2  13.5  15.3
3    3  15.4  18.7

Creating a FeatureNormalizer instance:

>>> fn = FeatureNormalizer(conn_context=conn, method="min-max", new_max=1.0, new_min=0.0)

Performing fit on given DataFrame:

>>> fn.fit(df1, key='ID')
>>> fn.result_.head(4).collect()
    ID        X1        X2
0    0  0.000000  0.033175
1    1  0.186544  0.000000
2    2  0.229358  0.331754
3    3  0.287462  0.492891

Input DataFrame for transforming:

>>> df2.collect()
   ID  S_X1  S_X2
0   0   6.0   9.0
1   1   6.0   7.0
2   2   4.0   4.0
3   3   1.0   2.0
4   4   9.0  -2.0
5   5   4.0   5.0

Performing transform on given DataFrame:

>>> result = fn.transform(df2, key='ID')
>>> result.collect()
   ID      S_X1      S_X2
0   0  0.000000  0.033175
1   1  0.000000 -0.061611
2   2 -0.061162 -0.203791
3   3 -0.152905 -0.298578
4   4  0.091743 -0.488152
5   5 -0.061162 -0.156398
Attributes
result_DataFrame

Scaled dataset from fit and fit_transform methods.

model_ :

Trained model content.

Methods

fit(self, data, key[, features])

Normalize input data and generate a scaling model using one of the three scaling methods: min-max normalization, z-score normalization and normalization by decimal scaling.

fit_transform(self, data, key[, features])

Fit with the dataset and return the results.

is_fitted(self)

Checks if the model can be saved.

load_model(self, model)

Function to load fitted model.

transform(self, data, key[, features])

Scales data based on the previous scaling model.

fit(self, data, key, features=None)

Normalize input data and generate a scaling model using one of the three scaling methods: min-max normalization, z-score normalization and normalization by decimal scaling.

Parameters
dataDataFrame

DataFrame to be normalized.

keystr

Name of the ID column.

featureslist of str, optional

Names of the feature columns. If features is not provided, it defaults to all non-ID columns.

fit_transform(self, data, key, features=None)

Fit with the dataset and return the results.

Parameters
dataDataFrame

DataFrame to be normalized.

keystr

Name of the ID column.

featureslist of str, optional

Names of the feature columns. If features is not provided, it defaults to all non-ID columns.

Returns
DataFrame

Normalized result, with the same structure as data.

transform(self, data, key, features=None)

Scales data based on the previous scaling model.

Parameters
dataDataFrame

DataFrame to be normalized.

keystr

Name of the ID column.

featureslist of str, optional

Names of the feature columns. If features is not provided, it defaults to all non-ID columns.

Returns
DataFrame

Normalized result, with the same structure as data.

is_fitted(self)

Checks if the model can be saved. To be overridden if the model is not stored in model_ attribute.

Returns
bool

True if the model is ready to be saved.

load_model(self, model)

Function to load fitted model.

Parameters
modelDataFrame

HANA DataFrame for fitted model.

class hana_ml.algorithms.pal.preprocessing.KBinsDiscretizer(conn_context, strategy, smoothing, n_bins=None, bin_size=None, n_sd=None)

Bases: hana_ml.algorithms.pal.pal_base.PALBase

Bin continuous data into number of intervals and perform local smoothing.

Parameters
conn_contextConnectionContext

The connection to the SAP HANA system.

strategy{‘uniform_number’, ‘uniform_size’, ‘quantile’, ‘sd’}
Binning methods:
  • ‘uniform_number’: Equal widths based on the number of bins.

  • ‘uniform_size’: Equal widths based on the bin size.

  • ‘quantile’: Equal number of records per bin.

  • ‘sd’: Bins are divided based on the distance from the mean. Most bins are one standard deviation wide, except that the center bin contains all values within one standard deviation from the mean, and the leftmost and rightmost bins contain all values more than n_sd standard deviations from the mean in the corresponding directions.

smoothing{‘means’, ‘medians’, ‘boundaries’}
Smoothing methods:
  • ‘means’: Each value within a bin is replaced by the average of all the values belonging to the same bin.

  • ‘medians’: Each value in a bin is replaced by the median of all the values belonging to the same bin.

  • ‘boundaries’: The minimum and maximum values in a given bin are identified as the bin boundaries. Each value in the bin is then replaced by its closest boundary value. When the distance is equal to both sides, it will be replaced by the front boundary value.

Values used for smoothing are not re-calculated during transform.

n_binsint, optional

The number of bins. Only valid when strategy is ‘uniform_number’ or ‘quantile’.

Defaults to 2.

bin_sizeint, optional

The interval width of each bin. Only valid when strategy is ‘uniform_size’.

Defaults to 10.

n_sdint, optional

The leftmost bin contains all values located further than n_sd standard deviations lower than the mean, and the rightmost bin contains all values located further than n_sd standard deviations above the mean. Only valid when strategy is ‘sd’.

Defaults to 1.

Examples

Input DataFrame df1:

>>> df1.collect()
    ID  DATA
0    0   6.0
1    1  12.0
2    2  13.0
3    3  15.0
4    4  10.0
5    5  23.0
6    6  24.0
7    7  30.0
8    8  32.0
9    9  25.0
10  10  38.0

Creating a KBinsDiscretizer instance:

>>> binning = KBinsDiscretizer(conn_context=conn, strategy='uniform_size', smoothing='means', bin_size=10)

Performing fit on the given DataFrame:

>>> binning.fit(data=df1, key='ID')

Output:

>>> binning.result_.collect()
    ID  BIN_INDEX       DATA
0    0          1   8.000000
1    1          2  13.333333
2    2          2  13.333333
3    3          2  13.333333
4    4          1   8.000000
5    5          3  25.500000
6    6          3  25.500000
7    7          3  25.500000
8    8          4  35.000000
9    9          3  25.500000
10  10          4  35.000000

Input DataFrame df2 for transforming:

>>> df2.collect()
   ID  DATA
0   0   6.0
1   1  67.0
2   2   4.0
3   3  12.0
4   4  -2.0
5   5  40.0

Performing transform on the given DataFrame:

>>> result = binning.transform(data=df2, key='ID')

Output:

>>> result.collect()
   ID  BIN_INDEX       DATA
0   0          1   8.000000
1   1         -1  67.000000
2   2          1   8.000000
3   3          2  13.333333
4   4          1   8.000000
5   5          4  35.000000
Attributes
result_DataFrame

Binned dataset from fit and fit_transform methods.

model_ :

Binning model content.

Methods

fit(self, data, key[, features])

Bin input data into number of intervals and smooth.

fit_transform(self, data, key[, features])

Fit with the dataset and return the results.

is_fitted(self)

Checks if the model can be saved.

load_model(self, model)

Function to load fitted model.

transform(self, data, key[, features])

Bin data based on the previous binning model.

fit(self, data, key, features=None)

Bin input data into number of intervals and smooth.

Parameters
dataDataFrame

DataFrame to be discretized.

keystr

Name of the ID column.

featureslist of str, optional

Names of the feature columns. Since the underlying PAL_BINNING only supports one feature, this list can only contain one element. If features is not provided, data must have exactly 1 non-ID column, and features defaults to that column.

fit_transform(self, data, key, features=None)

Fit with the dataset and return the results.

Parameters
dataDataFrame

DataFrame to be binned.

keystr

Name of the ID column.

featureslist of str, optional

Names of the feature columns. Since the underlying PAL_BINNING only supports one feature, this list can only contain one element. If features is not provided, data must have exactly 1 non-ID column, and features defaults to that column.

Returns
DataFrame

Binned result, structured as follows:

  • DATA_ID column: with same name and type as data’s ID column.

  • BIN_INDEX: type INTEGER, assigned bin index.

  • BINNING_DATA column: smoothed value, with same name and type as data’s feature column.

transform(self, data, key, features=None)

Bin data based on the previous binning model.

Parameters
dataDataFrame

DataFrame to be binned.

keystr

Name of the ID column.

featureslist of str, optional

Names of the feature columns. Since the underlying PAL_BINNING_ASSIGNMENT only supports one feature, this list can only contain one element. If features is not provided, data must have exactly 1 non-ID column, and features defaults to that column.

Returns
DataFrame

Binned result, structured as follows:

  • DATA_ID column: with same name and type as data ‘s ID column.

  • BIN_INDEX: type INTEGER, assigned bin index.

  • BINNING_DATA column: smoothed value, with same name and type as data ‘s feature column.

is_fitted(self)

Checks if the model can be saved. To be overridden if the model is not stored in model_ attribute.

Returns
bool

True if the model is ready to be saved.

load_model(self, model)

Function to load fitted model.

Parameters
modelDataFrame

HANA DataFrame for fitted model.

class hana_ml.algorithms.pal.preprocessing.Imputer(conn_context, strategy=None, als_factors=None, als_lambda=None, als_maxit=None, als_randomstate=None, als_exit_threshold=None, als_exit_interval=None, als_linsolver=None, als_cg_maxit=None, als_centering=None, als_scaling=None, thread_ratio=None)

Bases: hana_ml.algorithms.pal.pal_base.PALBase

Missing value imputation for DataFrame.

Parameters
conn_contextConnectionContext

The connection to the SAP HANA system.

strategy{‘non’, ‘mean’, ‘median’, ‘zero’, ‘als’, ‘delete’}, optional

The overall imputation strategy for all Numerical columns.

Defaults to ‘mean’.

thread_ratiofloat, optional

Controls the proportion of available threads to use. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use up to that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use.

The following parameters all have pre-fix ‘als_’, and are invoked only when ‘als’ is the overall imputation strategy. Those parameters are for setting up the alternating-least-square(ALS) mdoel for data imputation.

Defaults to 0.0.

als_factorsint, optional

Length of factor vectors in the ALS model. It should be less than the number of numerical columns, so that the imputation results would be meaningful.

Defaults to 3.

als_lambdafloat, optional

L2 regularization applied to the factors in the ALS model. Should be non-negative.

Defaults to 0.01.

als_maxitint, optional

Maximum number of iterations for solving the ALS model.

Defaults to 20.

als_randomstateint, optional

Specifies the seed of the random number generator used in the training of ALS model:

0: Uses the current time as the seed,

Others: Uses the specified value as the seed.

Defaults to 0.

als_exit_thresholdfloat, optional

Specify a value for stopping the training of ALS nmodel. If the improvement of the cost function of the ALS model is less than this value between consecutive checks, then the training process will exit. 0 means there is no checking of the objective value when running the algorithms, and it stops till the maximum number of iterations has been reached.

Defaults to 0.

als_exit_intervalint, optional

Specify the number of iterations between consecutive checking of cost functions for the ALS model, so that one can see if the pre-specified exit_threshold is reached.

Defaults to 5.

als_linsolver{‘cholsky’, ‘cg’}, optional

Linear system solver for the ALS model. ‘cholsky’ is usually much faster. ‘cg’ is recommended when als_factors is large.

Defaults to ‘cholsky’.

als_maxitint, optional

Specifies the maximum number of iterations for cg algorithm. Invoked only when the ‘cg’ is the chosen linear system solver for ALS.

Defaults to 3.

als_centeringbool, optional

Whether to center the data by column before training the ALS model.

Defaults to True.

als_scakubgbool, optional

Wheter to scale the data by column before training the ALS model.

Defaults to True.

Examples

Input DataFrame df:

>>> df.head(5).collect()
   V0   V1 V2   V3   V4    V5
0  10  0.0  D  NaN  1.4  23.6
1  20  1.0  A  0.4  1.3  21.8
2  50  1.0  C  NaN  1.6  21.9
3  30  NaN  B  0.8  1.7  22.6
4  10  0.0  A  0.2  NaN   NaN

Create an Imputer instance using ‘mean’ strategy and call fit:

>>> impute = Imputer(conn_context, strategy='mean')
>>> result = impute.fit_transform(df, categorical_variable=['V1'],
...                      strategy_by_col=[('V1', 'categorical_const', '0')])
>>> result.head(5).collect()
   V0  V1 V2        V3        V4         V5
0  10   0  D  0.507692  1.400000  23.600000
1  20   1  A  0.400000  1.300000  21.800000
2  50   1  C  0.507692  1.600000  21.900000
3  30   0  B  0.800000  1.700000  22.600000
4  10   0  A  0.200000  1.469231  20.646154

The stats/model content of input DataFrame:

>>> impute.head(5).collect()
            STAT_NAME                   STAT_VALUE
0  V0.NUMBER_OF_NULLS                            3
1  V0.IMPUTATION_TYPE                         MEAN
2    V0.IMPUTED_VALUE                           24
3  V1.NUMBER_OF_NULLS                            2
4  V1.IMPUTATION_TYPE  SPECIFIED_CATEGORICAL_VALUE

The above stats/model content of the input DataFrame can be applied to imputing another DataFrame with the same data structure, e.g. consider the following DataFrame with missing values:

>>> df1.collect()
   ID    V0   V1    V2   V3   V4    V5
0   0  20.0  1.0     B  NaN  1.5  21.7
1   1  40.0  1.0  None  0.6  1.2  24.3
2   2   NaN  0.0     D  NaN  1.8  22.6
3   3  50.0  NaN     C  0.7  1.1   NaN
4   4  20.0  1.0     A  0.3  NaN  20.6

With attribute impute being obtained, one can impute the missing values of df1 via the following line of code, and then check the result:

>>> result1, _ = impute.transform(data=df1, key='ID')
>>> result1.collect()
   ID  V0  V1 V2        V3        V4         V5
0   0  20   1  B  0.507692  1.500000  21.700000
1   1  40   1  A  0.600000  1.200000  24.300000
2   2  24   0  D  0.507692  1.800000  22.600000
3   3  50   0  C  0.700000  1.100000  20.646154
4   4  20   1  A  0.300000  1.469231  20.600000

Create an Imputer instance using other strategies, e.g. ‘als’ strategy and then call fit:

>>> impute = Imputer(conn_context=conn, strategy='als', als_factors=2, als_randomstate=1)

Output:

>>> result2 = impute.fit_transform(data=df, categorical_variable=['V1'])
>>> result2.head(5).collect()
   V0  V1 V2        V3        V4         V5
0  10   0  D  0.306957  1.400000  23.600000
1  20   1  A  0.400000  1.300000  21.800000
2  50   1  C  0.930689  1.600000  21.900000
3  30   0  B  0.800000  1.700000  22.600000
4  10   0  A  0.200000  1.333668  21.371753
Attributes
stats_model_DataFrame

statistics/model content.

Methods

fit_transform(self, data[, key, …])

Inpute the missing values of a DataFrame, return the result, and collect the related statistics/model info for imputation.

is_fitted(self)

Checks if the model can be saved.

load_model(self, model)

Function to load fitted model.

transform(self, data[, key, thread_ratio])

The function imputes missing values a DataFrame using statistic/model info collected from another DataFrame.

fit_transform(self, data, key=None, categorical_variable=None, strategy_by_col=None)

Inpute the missing values of a DataFrame, return the result, and collect the related statistics/model info for imputation.

Parameters
dataDataFrame

Input data with missing values.

keystr, optional

Name of the ID column. Assume no ID column if key not provided.

categorical_variablestr, optional

Names of columns with INTEGER data type that should actually be treated as categorical. By default, columns of INTEGER and DOUBLE type are all treated numerical, while columns of VARCHAR or NVARCHAR type are treated as categorical.

strategy_by_colListOfTuples, optional

Specifies the imputation strategy for a set of columns, which overrides the overall strategy for data imputation. Each tuple in the list should contain at least two elements, such that: The first element is the name of a column; the second element is the imputation strategy of that column. If the imputation strategy is ‘categorical_const’ or ‘numerical_const’, then a third element must be included in the tuple, which specifies the constant value to be used to substitute the detected missing values in the column.

An illustrative example:

[(‘V1’, ‘categorical_const’, ‘0’),

(‘V5’,’median’)]

Returns
DataFrame

Imputed result using specified strategy, with the same data structure, i.e. column names and data types same as data.

transform(self, data, key=None, thread_ratio=None)

The function imputes missing values a DataFrame using statistic/model info collected from another DataFrame.

Parameters
dataDataFrame

Input DataFrame.

keystr, optional

Name of ID column. Assumed no ID column if not provided.

thread_ratiofloat, optional

Controls the proportion of available threads to use. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use up to that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use.

Defaults to 0.0.

Returns
DataFrame

Inputation result, structured same as data.

Statistics for the imputation result, structured as:

  • STAT_NAME: type NVACHAR(256), statistics name.

  • STAT_VALUE: type NVACHAR(5000), statistics value.

is_fitted(self)

Checks if the model can be saved. To be overridden if the model is not stored in model_ attribute.

Returns
bool

True if the model is ready to be saved.

load_model(self, model)

Function to load fitted model.

Parameters
modelDataFrame

HANA DataFrame for fitted model.

class hana_ml.algorithms.pal.preprocessing.Discretize(conn_context, strategy, n_bins=None, bin_size=None, n_sd=None, smoothing=None, save_model=True)

Bases: hana_ml.algorithms.pal.pal_base.PALBase

It is an enhanced version of binning function which can be applied to table with multiple columns. This function partitions table rows into multiple segments called bins, then applies smoothing methods in each bin of each column respectively.

Parameters
conn_contextConnectionContext

Connection to the HANA system.

strategy{‘uniform_number’, ‘uniform_size’, ‘quantile’, ‘sd’}
Binning methods:
  • ‘uniform_number’: equal widths based on the number of bins.

  • ‘uniform_size’: equal widths based on the bin width.

  • ‘quantile’: equal number of records per bin.

  • ‘sd’: mean/ standard deviation bin boundaries.

n_binsint, optional

Number of needed bins. Required and only valid when strategy is set as ‘uniform_number’ or ‘quantile’. Default to 2.

bin_sizefloat, optional

Specifies the distance for binning. Required and only valid when method is set as ‘uniform_size’. Default to 10.

n_sdint, optional

Specifies the number of standard deviation at each side of the mean. For example, if SD equals 2, this function takes mean +/- 2 * standard deviation as the upper/lower bound for binning. Required and only valid when method is set as ‘mean_std’.

smoothing{‘no’, ‘bin_means’, ‘bin_medians’, ‘bin_boundaries’}, optional

Default smoothing methods for input data. Only applies for none-categorical attributes that do not get specified smoothing method by parameter smoothing_method. Default to ‘bin_means’.

save_modelbool, optional

Indicates whether the model is saved. Default to true.

Examples

Original data:

>>> df.collect()
        ID  ATT1   ATT2  ATT3 ATT4
    0    1  10.0  100.0   1.0    A
    1    2  10.1  101.0   1.0    A
    2    3  10.2  100.0   1.0    A
    3    4  10.4  103.0   1.0    A
    4    5  10.3  100.0   1.0    A
    5    6  40.0  400.0   4.0    C
    6    7  40.1  402.0   4.0    B
    7    8  40.2  400.0   4.0    B
    8    9  40.4  402.0   4.0    B
    9   10  40.3  400.0   4.0    A
    10  11  90.0  900.0   2.0    C
    11  12  90.1  903.0   1.0    B
    12  13  90.2  901.0   2.0    B
    13  14  90.4  900.0   1.0    B
    14  15  90.3  900.0   1.0    B

Construct an Discretize instance:

>>> bin = Discretize(cc, method='uniform_number',
          n_bins=3, smoothing='bin_medians')

Training the model with training data:

>>> bin.fit(train_data, binning_variable='ATT1', col_smoothing=[('ATT2', 'bin_means')],
            categorical_variable='ATT3', key=None, features=None)
>>> bin.assign_.collect()
        ID  BIN_INDEX
    0    1          1
    1    2          1
    2    3          1
    3    4          1
    4    5          1
    5    6          2
    6    7          2
    7    8          2
    8    9          2
    9   10          2
    10  11          3
    11  12          3
    12  13          3
    13  14          3
    14  15          3

Apply the model to new data:

>>> bin.predict(predict_data)
>>> res.collect():
       ID  BIN_INDEX
    0   1          1
    1   2          1
    2   3          1
    3   4          1
    4   5          3
    5   6          3
    6   7          2
Attributes
result_DataFrame
Discretize results, structured as follows:
  • ID: name as shown in input dataframe.

  • FEATURES : data smoothed respectively in each bins

assign_DataFrame
Assignment results, structured as follows:
  • ID: data ID, name as shown in input dataframe.

  • BIN_INDEX : bin index.

model_DataFrame
Model results, structured as follows:
  • ROW_INDEX: row index.

  • MODEL_CONTENT : model contents.

stats_DataFrame
Statistic results, structured as follows:
  • STAT_NAME: statistic name.

  • STAT_VALUE: statistic value.

Methods

fit(self, data, binning_variable[, key, …])

data: DataFrame

fit_transform(self, data, binning_variable)

data: DataFrame

is_fitted(self)

Checks if the model can be saved.

load_model(self, model)

Function to load fitted model.

predict(self, data)

data : DataFrame

transform(self, data)

data : DataFrame

fit(self, data, binning_variable, key=None, features=None, col_smoothing=None, categorical_variable=None)
data: DataFrame

Dataframe that contains the training data.

keystr, optional

Name of the ID column in the input dataframe.

featuresstr/ListofStrings, optional

Name of the feature columns which needs to be considered in the model. If not specified, all columns except the key column will be count as feature columns.

binning_variablestr/ListofStrings

Attribute name, to which binning operation is applied. Variable data type must be numeric.

col_smoothingListofTuples, optional

Specifies column name and its method for smoothing, which overwrites the default smoothing method. For example: smoothing_method = [(‘ATT1’, ‘bin_means’), (‘ATT2’, ‘bin_boundaries’)] Only applies for none-categorical attributes. No default value.

categorical_variablestr/ListofStrings, optional

Indicates whether a column data is actually corresponding to a category variable even the data type of this column is int. No default value.

predict(self, data)
dataDataFrame

Dataframe including the predict data.

transform(self, data)
dataDataFrame

Dataframe including the predict data.

fit_transform(self, data, binning_variable, key=None, features=None, col_smoothing=None, categorical_variable=None)
data: DataFrame

Dataframe that contains the training data.

keystr, optional

Name of the ID column in the input dataframe.

featuresstr/ListofStrings, optional

Name of the feature columns which needs to be considered in the model. If not specified, all columns except the key column will be count as feature columns.

binning_variablestr/ListofStrings

Attribute name, to which binning operation is applied. Variable data type must be numeric.

col_smoothingListofTuples, optional

Specifies column name and its method for smoothing, which overwrites the default smoothing method. For example: smoothing_method = [(‘ATT1’, ‘bin_means’), (‘ATT2’, ‘bin_boundaries’)] Only applies for none-categorical attributes. No default value.

categorical_variablestr/ListofStrings, optional

Indicates whether a column data is actually corresponding to a category variable even the data type of this column is int. No default value.

is_fitted(self)

Checks if the model can be saved. To be overridden if the model is not stored in model_ attribute.

Returns
bool

True if the model is ready to be saved.

load_model(self, model)

Function to load fitted model.

Parameters
modelDataFrame

HANA DataFrame for fitted model.

class hana_ml.algorithms.pal.preprocessing.MDS(conn_context, matrix_type, thread_ratio=None, dim=None, metric=None, minkowski_power=None)

Bases: hana_ml.algorithms.pal.pal_base.PALBase

This class serves as a tool for dimensional reduction or data visualization.

Parameters
conn_contextConnectionContext

Connection to the HANA system.

matrix_type{‘dissimilarity’, ‘observation_feature’}

The type of the input table. Mandatory.

thread_ratiofloat, optional

Specifies the ratio of total number of threads that can be used by this function. The value range is from 0 to 1, where 0 means only using 1 thread, and 1 means using at most all the currently available threads. Values outside the range will be ignored and this function heuristically determines the number of threads to use. Default to 0.

dimint, optional

The number of dimension that the input dataset is to be reduced to. Default to 2.

metric{‘manhattan’, ‘euclidean’, ‘minkowski’}, optional

The type of distance during the calculation of dissimilarity matrix. Only valid when matrix_type is set as ‘observation_feature’. Default to ‘euclidean’.

minkowski_powerfloat, optional

When you use the Minkowski distance, this parameter controls the value of power. Only valid when matrix_type is set as ‘observation_feature’ and metric is set as ‘minkowski’. Default to 3.

Returns
res_tblDataFrame
Sampling results, structured as follows:
  • DATA_ID: name as shown in input dataframe.

  • DIMENSION: dimension.

  • VALUE: value.

stats_tblDataFrame
Statistic results, structured as follows:
  • STAT_NAME: statistic name.

  • STAT_VALUE: statistic value.

Examples

Original data:

>>> df.collect()
     ID        X1        X2        X3        X4
    0   1  0.000000  0.904781  0.908596  0.910306
    1   2  0.904781  0.000000  0.251446  0.597502
    2   3  0.908596  0.251446  0.000000  0.440357
    3   4  0.910306  0.597502  0.440357  0.000000

Apply the multidimensional scaling:

>>> mds = MDS(conn_context=connection_context, matrix_type='dissimilarity', dim=2, thread_ratio=0.5)
>>> res, stats = mds.fit_transform(data=df)
>>> res.collect()
           ID  DIMENSION     VALUE
    0   1          1  0.651917
    1   1          2 -0.015859
    2   2          1 -0.217737
    3   2          2 -0.253195
    4   3          1 -0.249907
    5   3          2 -0.072950
    6   4          1 -0.184273
    7   4          2  0.342003
>>> stats.collect()
                              STAT_NAME  STAT_VALUE
    0                        acheived K    2.000000
    1  proportion of variation explaind    0.978901

Methods

fit_transform(self, data[, key, features])

data: DataFrame

is_fitted(self)

Checks if the model can be saved.

load_model(self, model)

Function to load fitted model.

fit_transform(self, data, key=None, features=None)
data: DataFrame

Dataframe that contains the training data.

keystr, optional

Name of the ID column in the input dataframe.

featuresstr/ListofStrings, optional

Name of the feature columns which needs to be considered in the model. If not specified, all columns except the key column will be count as feature columns.

is_fitted(self)

Checks if the model can be saved. To be overridden if the model is not stored in model_ attribute.

Returns
bool

True if the model is ready to be saved.

load_model(self, model)

Function to load fitted model.

Parameters
modelDataFrame

HANA DataFrame for fitted model.

class hana_ml.algorithms.pal.preprocessing.Sampling(conn_context, method, interval=None, sampling_size=None, random_state=None, percentage=None)

Bases: hana_ml.algorithms.pal.pal_base.PALBase

This class is used to choose a small portion of the records as representatives.

Parameters
conn_contextConnectionContext

Connection to the HANA system.

method{‘first_n’, ‘middle_n’, ‘last_n’, ‘every_nth’, ‘simple_random_with_replacement’,

‘simple_random_without_replacement’, ‘systematic’, ‘stratified_with_replacement’, ‘stratified_without_replacement’}

For the random methods, the system time is used for the seed.

intervalint, optional

The interval between two samples. Only required when method is ‘every_nth’. If this parameter is not specified, the sampling_size parameter will be used.

featuresstr/ListofStrings, optional

The column that is used to do the stratified sampling. Only required when method is stratified_with_replacement, or stratified_without_replacement.

sampling_sizeint, optional

Number of the samples. Default to 1.

random_stateint, optional

Indicates the seed used to initialize the random number generator. It can be set to 0 or a positive value.

0: Uses the system time Not 0: Uses the specified seed

Default to 0.

percentagefloat, optional

Percentage of the samples. Use this parameter when sampling_size is not set. If both sampling_size and percentage are specified, percentage takes precedence. Default to 0.1.

Returns
res_tblDataFrame
Sampling results, structured as follows:
  • DATA_FEATURES: same structure as defined in the Input Table.

Examples

Original data:

>>> df.collect().head(10)
         EMPNO  GENDER  INCOME
    0       1    male  4000.5
    1       2    male  5000.7
    2       3  female  5100.8
    3       4    male  5400.9
    4       5  female  5500.2
    5       6    male  5540.4
    6       7    male  4500.9
    7       8  female  6000.8
    8       9    male  7120.8
    9      10  female  8120.9

Apply the sampling function:

>>> smp = Sampling(conn_context=connection_context, method='every_nth', interval=5, sampling_size=8)
>>> res = smp.fit_transform(data=df)
>>> res.collect()
         EMPNO  GENDER  INCOME
    0      5  female  5500.2
    1     10  female  8120.9
    2     15    male  9876.5
    3     20  female  8705.7
    4     25  female  8794.9

Methods

fit_transform(self, data[, features])

data: DataFrame

is_fitted(self)

Checks if the model can be saved.

load_model(self, model)

Function to load fitted model.

fit_transform(self, data, features=None)
data: DataFrame

Dataframe that contains the training data.

featuresstr/ListofStrings, optional

Name of the feature columns which needs to be considered in the model. If not specified, all columns except the key column will be count as feature columns.

is_fitted(self)

Checks if the model can be saved. To be overridden if the model is not stored in model_ attribute.

Returns
bool

True if the model is ready to be saved.

load_model(self, model)

Function to load fitted model.

Parameters
modelDataFrame

HANA DataFrame for fitted model.

class hana_ml.algorithms.pal.preprocessing.SMOTE(conn_context, smote_amount=None, k_nearest_neighbours=None, minority_class=None, thread_ratio=None, random_seed=None, method=None)

Bases: hana_ml.algorithms.pal.pal_base.PALBase

This class is to handle imbalanced dataset.

Parameters
conn_contextConnectionContext

Connection to the HANA system.

smote_amountint, optional

Amount of SMOTE N%. E.g. 200 means 200%, so each minority class sample will generate 2 synthetic samples.

k_nearest_neighboursint, optional

Number of nearest neighbors (k).

minority_classstr, optional

Specifies the minority class value in dependent variable column.

thread_ratiofloat, optional

Specifies the ratio of total number of threads that can be used by this function. The value range is from 0 to 1, where 0 means only using 1 thread, and 1 means using at most all the currently available threads. Values outside the range will be ignored and this function heuristically determines the number of threads to use. Default to 0.

random_seedint, optional

Specifies the seed for random number generator. - 0: Uses the current time (in second) as seed - Others: Uses the specified value as seed

methodint, optional

Searching method when finding K nearest neighbour. - 0: Brute force searching - 1: KD-tree searching

Returns
res_tblDataFrame

SMOTE result, the same structure as defined in the input data.

Examples

>>> smote = SMOTE(conn_context=connection_context,
                smote_amount=200, k_nearest_neighbours=2,
                dependent_variable="TYPE", minority_class="2", method=1)
>>> res, stats = smote.fit_transform(data=df)

Methods

fit_transform(self, data, label)

data: DataFrame

is_fitted(self)

Checks if the model can be saved.

load_model(self, model)

Function to load fitted model.

fit_transform(self, data, label)
data: DataFrame

Dataframe that contains the training data.

labelstr

Specifies the dependent variable by name.

is_fitted(self)

Checks if the model can be saved. To be overridden if the model is not stored in model_ attribute.

Returns
bool

True if the model is ready to be saved.

load_model(self, model)

Function to load fitted model.

Parameters
modelDataFrame

HANA DataFrame for fitted model.

hana_ml.algorithms.pal.preprocessing.mds(conn_context, data, matrix_type, thread_ratio=None, dim=None, metric=None, minkowski_power=None, key=None, features=None)

This function serves as a tool for dimensional reduction or data visualization.

Parameters
conn_contextConnectionContext

Connection to the HANA system.

dataDataFrame

DataFrame containing the data.

matrix_type{‘dissimilarity’, ‘observation_feature’}

The type of the input table. Mandatory.

thread_ratiofloat, optional

Specifies the ratio of total number of threads that can be used by this function. The value range is from 0 to 1, where 0 means only using 1 thread, and 1 means using at most all the currently available threads. Values outside the range will be ignored and this function heuristically determines the number of threads to use. Default to 0.

dimint, optional

The number of dimension that the input dataset is to be reduced to. Default to 2.

metric{‘manhattan’, ‘euclidean’, ‘minkowski’}, optional

The type of distance during the calculation of dissimilarity matrix. Only valid when matrix_type is set as ‘observation_feature’. Default to ‘euclidean’.

minkowski_powerfloat, optional

When you use the Minkowski distance, this parameter controls the value of power. Only valid when matrix_type is set as ‘observation_feature’ and metric is set as ‘minkowski’. Default to 3.

keystr, optional

Name of the ID column in the dataframe. If not specified, the first col will be taken as the ID column.

featuresstr/ListOfStrings, optional

Name of the feature column in the dataframe. If not specified, columns except the ID column will be taken as feature columns.

Returns
res_tblDataFrame
Sampling results, structured as follows:
  • DATA_ID: name as shown in input dataframe.

  • DIMENSION: dimension.

  • VALUE: value.

stats_tblDataFrame
Statistic results, structured as follows:
  • STAT_NAME: statistic name.

  • STAT_VALUE: statistic value.

Examples

Original data:

>>> df.collect()
     ID        X1        X2        X3        X4
    0   1  0.000000  0.904781  0.908596  0.910306
    1   2  0.904781  0.000000  0.251446  0.597502
    2   3  0.908596  0.251446  0.000000  0.440357
    3   4  0.910306  0.597502  0.440357  0.000000

Apply the multidimensional scaling:

>>> res,stats = mds(conn_context=connection_context, data=df,
                    matrix_type='dissimilarity', dim=2, thread_ratio=0.5)
>>> res.collect()
           ID  DIMENSION     VALUE
    0   1          1  0.651917
    1   1          2 -0.015859
    2   2          1 -0.217737
    3   2          2 -0.253195
    4   3          1 -0.249907
    5   3          2 -0.072950
    6   4          1 -0.184273
    7   4          2  0.342003
>>> stats.collect()
                              STAT_NAME  STAT_VALUE
    0                        acheived K    2.000000
    1  proportion of variation explaind    0.978901
hana_ml.algorithms.pal.preprocessing.sampling(conn_context, data, method, interval=None, features=None, sampling_size=None, random_state=None, percentage=None)

This function is used to choose a small portion of the records as representatives.

Parameters
conn_contextConnectionContext

Connection to the HANA system.

dataDataFrame

DataFrame containing the data.

method{‘first_n’, ‘middle_n’, ‘last_n’, ‘every_nth’, ‘simple_random_with_replacement’,

‘simple_random_without_replacement’, ‘systematic’, ‘stratified_with_replacement’, ‘stratified_without_replacement’}

For the random methods, the system time is used for the seed.

intervalint, optional

The interval between two samples. Only required when method is ‘every_nth’. If this parameter is not specified, the sampling_size parameter will be used.

featuresstr/ListofStrings, optional

The column that is used to do the stratified sampling. Only required when method is stratified_with_replacement, or stratified_without_replacement.

sampling_sizeint, optional

Number of the samples. Default to 1.

random_stateint, optional

Indicates the seed used to initialize the random number generator. It can be set to 0 or a positive value.

0: Uses the system time Not 0: Uses the specified seed

Default to 0.

percentagefloat, optional

Percentage of the samples. Use this parameter when sampling_size is not set. If both sampling_size and percentage are specified, percentage takes precedence. Default to 0.1.

Returns
res_tblDataFrame
Sampling results, structured as follows:
  • DATA_FEATURES: same structure as defined in the Input Table.

Examples

Original data:

>>> df.collect().head(10)
         EMPNO  GENDER  INCOME
    0       1    male  4000.5
    1       2    male  5000.7
    2       3  female  5100.8
    3       4    male  5400.9
    4       5  female  5500.2
    5       6    male  5540.4
    6       7    male  4500.9
    7       8  female  6000.8
    8       9    male  7120.8
    9      10  female  8120.9

Apply the sampling function:

>>> res = sampling(conn_context=connection_context, data=df, method='every_nth', interval=5, sampling_size=8)
>>> res.collect()
         EMPNO  GENDER  INCOME
    0      5  female  5500.2
    1     10  female  8120.9
    2     15    male  9876.5
    3     20  female  8705.7
    4     25  female  8794.9
hana_ml.algorithms.pal.preprocessing.variance_test(conn_context, data, sigma_num, thread_ratio=None, key=None, data_col=None)

Variance Test is a method to identify the outliers of n number of numeric data {xi} where 0 < i < n+1, using the mean and the standard deviation of n number of numeric data.

Parameters
conn_contextConnectionContext

Connection to the HANA system.

dataDataFrame

DataFrame containing the data.

sigama_numfloat

Multiplier for sigma. No default value.

thread_ratiofloat, optional

Specifies the ratio of total number of threads that can be used by this function. The value range is from 0 to 1, where 0 means only using 1 thread, and 1 means using at most all the currently available threads. Values outside the range will be ignored and this function heuristically determines the number of threads to use. Default to 0.

keystr, optional

Name of the ID column in the dataframe. If not specified, the first col will be taken as the ID column.

data_colstr, optional

Name of the raw data column in the dataframe. If not specified, the last col will be taken as the data column.

Returns
res_tblDataFrame
Sampling results, structured as follows:
  • DATA_ID: name as shown in input dataframe.

  • IS_OUT_OF_RANGE: 0 -> in bounds, 1 -> out of bounds.

stats_tblDataFrame
Statistic results, structured as follows:
  • STAT_NAME: statistic name.

  • STAT_VALUE: statistic value.

Examples

Original data:

>>> df.collect().tail(10)
        ID      X
    10  10   26.0
    11  11   28.0
    12  12   29.0
    13  13   27.0
    14  14   26.0
    15  15   23.0
    16  16   22.0
    17  17   23.0
    18  18   25.0
    19  19  103.0
Apply the varience test:
>>> res, stats = variance_test(cc, data, sigma_num=3.0)
>>> res.collect().tail(10)
         ID  IS_OUT_OF_RANGE
    10  10                0
    11  11                0
    12  12                0
    13  13                0
    14  14                0
    15  15                0
    16  16                0
    17  17                0
    18  18                0
    19  19                1
>>> stats.collect()
        STAT_NAME  STAT_VALUE
    0   mean   28.400000

hana_ml.algorithms.pal.random

This module contains wrappers for PAL Random distribution sampling algorithms.

The following distribution functions are available:

hana_ml.algorithms.pal.random.multinomial(conn_context, n, pvals, num_random=100, seed=None, thread_ratio=None)

Draw samples from a multinomial distribution.

Parameters
conn_contextConnectionContext

Database connection object.

nint

Number of trials.

pvalstuple of float and int

Success fractions of each category.

num_randomint, optional

Specifies the number of random data to be generated.

Defaults to 100.

seedint, optional

Indicates the seed used to initialize the random number generator:

  • 0: Uses the system time.

  • Not 0: Uses the specified seed.

Note

When multithreading is enabled, the random number sequences of different runs might be different even if the SEED value remains the same.

Defaults to 0.

thread_ratiofloat, optional

Controls the proportion of available threads to use. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use.

Defaults to 0.

Returns
DataFrame

Dataframe containing the generated random samples, structured as follows:

  • ID, type INTEGER, ID column.

  • Generated random number columns, named by appending index number (starting from 1 to length of pvals) to Random_P, type DOUBLE. There will be as many columns here as there are values in pvals.

Examples

Draw samples from a multinomial distribution.

>>> res = multinomial(conn_context=cc, n=10, pvals=(0.1, 0.2, 0.3, 0.4), num_random=10)
>>> res.collect()
   ID  RANDOM_P1  RANDOM_P2  RANDOM_P3  RANDOM_P4
0   0        1.0        2.0        2.0        5.0
1   1        1.0        2.0        3.0        4.0
2   2        0.0        0.0        8.0        2.0
3   3        0.0        2.0        1.0        7.0
4   4        1.0        1.0        4.0        4.0
5   5        1.0        1.0        4.0        4.0
6   6        1.0        2.0        3.0        4.0
7   7        1.0        4.0        2.0        3.0
8   8        1.0        2.0        3.0        4.0
9   9        4.0        1.0        1.0        4.0
hana_ml.algorithms.pal.random.bernoulli(conn_context, p=0.5, num_random=100, seed=None, thread_ratio=None)

Draw samples from a Bernoulli distribution.

Parameters
conn_contextConnectionContext

Database connection object.

pfloat, optional

Success fraction. The value range is from 0 to 1.

Defaults to 0.5.

num_randomint, optional

Specifies the number of random data to be generated.

Defaults to 100.

seedint, optional

Indicates the seed used to initialize the random number generator:

  • 0: Uses the system time.

  • Not 0: Uses the specified seed.

Note

When multithreading is enabled, the random number sequences of different runs might be different even if the SEED value remains the same.

Defaults to 0.

thread_ratiofloat, optional

Controls the proportion of available threads to use. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use.

Defaults to 0.

Returns
DataFrame

Dataframe containing the generated random samples, structured as follows:

  • ID, type INTEGER, ID column.

  • GENERATED_NUMBER, type DOUBLE, sample value.

Examples

Draw samples from a bernoulli distribution.

>>> res = bernoulli(conn_context=cc, p=0.5, num_random=10)
>>> res.collect()
   ID  GENERATED_NUMBER
0   0               0.0
1   1               0.0
2   2               1.0
3   3               1.0
4   4               0.0
5   5               1.0
6   6               1.0
7   7               0.0
8   8               1.0
9   9               0.0
hana_ml.algorithms.pal.random.beta(conn_context, a=0.5, b=0.5, num_random=100, seed=None, thread_ratio=None)

Draw samples from a Beta distribution.

Parameters
conn_contextConnectionContext

Database connection object.

afloat, optional

Alpha value, positive.

Defaults to 0.5.

bfloat, optional

Beta value, positive.

Defaults to 0.5.

num_randomint, optional

Specifies the number of random data to be generated.

Defaults to 100.

seedint, optional

Indicates the seed used to initialize the random number generator:

  • 0: Uses the system time.

  • Not 0: Uses the specified seed.

Note

When multithreading is enabled, the random number sequences of different runs might be different even if the SEED value remains the same.

Defaults to 0.

thread_ratiofloat, optional

Controls the proportion of available threads to use. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use.

Defaults to 0.

Returns
DataFrame

Dataframe containing the generated random samples, structured as follows:

  • ID, type INTEGER, ID column.

  • GENERATED_NUMBER, type DOUBLE, sample value.

Examples

Draw samples from a beta distribution.

>>> res = beta(conn_context=cc, a=0.5, b=0.5, num_random=10)
>>> res.collect()
   ID  GENERATED_NUMBER
0   0          0.976130
1   1          0.308346
2   2          0.853118
3   3          0.958553
4   4          0.677258
5   5          0.489628
6   6          0.027733
7   7          0.278073
8   8          0.850181
9   9          0.976244
hana_ml.algorithms.pal.random.binomial(conn_context, n=1, p=0.5, num_random=100, seed=None, thread_ratio=None)

Draw samples from a binomial distribution.

Parameters
conn_contextConnectionContext

Database connection object.

nint, optional

Number of trials.

Defaults to 1.

pfloat, optional

Successful fraction. The value range is from 0 to 1.

Defaults to 0.5.

num_randomint, optional

Specifies the number of random data to be generated.

Defaults to 100.

seedint, optional

Indicates the seed used to initialize the random number generator:

  • 0: Uses the system time.

  • Not 0: Uses the specified seed.

Note

When multithreading is enabled, the random number sequences of different runs might be different even if the SEED value remains the same.

Defaults to 0.

thread_ratiofloat, optional

Controls the proportion of available threads to use. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use.

Defaults to 0.

Returns
DataFrame

Dataframe containing the generated random samples, structured as follows:

  • ID, type INTEGER, ID column.

  • GENERATED_NUMBER, type DOUBLE, sample value.

Examples

Draw samples from a binomial distribution.

>>> res = binomial(conn_context=cc, n=1, p=0.5, num_random=10)
>>> res.collect()
   ID  GENERATED_NUMBER
0   0               1.0
1   1               1.0
2   2               0.0
3   3               1.0
4   4               1.0
5   5               1.0
6   6               0.0
7   7               1.0
8   8               0.0
9   9               1.0
hana_ml.algorithms.pal.random.cauchy(conn_context, location=0, scale=1, num_random=100, seed=None, thread_ratio=None)

Draw samples from a cauchy distribution.

Parameters
conn_contextConnectionContext

Database connection object.

locationfloat, optional

Defaults to 0.

scalefloat, optional

Defaults to 1.

num_randomint, optional

Specifies the number of random data to be generated.

Defaults to 100.

seedint, optional

Indicates the seed used to initialize the random number generator:

  • 0: Uses the system time.

  • Not 0: Uses the specified seed.

Note

When multithreading is enabled, the random number sequences of different runs might be different even if the SEED value remains the same.

Defaults to 0.

thread_ratiofloat, optional

Controls the proportion of available threads to use. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use.

Defaults to 0.

Returns
DataFrame

Dataframe containing the generated random samples, structured as follows:

  • ID, type INTEGER, ID column.

  • GENERATED_NUMBER, type DOUBLE, sample value.

Examples

Draw samples from a cauchy distribution.

>>> res = cauchy(conn_context=cc, location=0, scale=1, num_random=10)
>>> res.collect()
   ID  GENERATED_NUMBER
0   0          1.827259
1   1         -1.877612
2   2        -18.241436
3   3         -1.216243
4   4          2.091336
5   5       -317.131147
6   6         -2.804251
7   7         -0.338566
8   8          0.143280
9   9          1.277245
hana_ml.algorithms.pal.random.chi_squared(conn_context, dof=1, num_random=100, seed=None, thread_ratio=None)

Draw samples from a chi_squared distribution.

Parameters
conn_contextConnectionContext

Database connection object.

dofint, optional

Degrees of freedom.

Defaults to 1.

num_randomint, optional

Specifies the number of random data to be generated.

Defaults to 100.

seedint, optional
Indicates the seed used to initialize the random number generator:
  • 0: Uses the system time.

  • Not 0: Uses the specified seed.

Note

When multithreading is enabled, the random number sequences of different runs might be different even if the SEED value remains the same.

Defaults to 0.

thread_ratiofloat, optional

Controls the proportion of available threads to use. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use.

Defaults to 0.

Returns
DataFrame

Dataframe containing the generated random samples, structured as follows:

  • ID, type INTEGER, ID column.

  • GENERATED_NUMBER, type DOUBLE, sample value.

Examples

Draw samples from a chi_squared distribution.

>>> res = chi_squared(conn_context=cc, dof=1, num_random=10)
>>> res.collect()
   ID  GENERATED_NUMBER
0   0          0.040571
1   1          2.680756
2   2          1.119563
3   3          1.174072
4   4          0.872421
5   5          0.327169
6   6          1.113164
7   7          1.549585
8   8          0.013953
9   9          0.011735
hana_ml.algorithms.pal.random.exponential(conn_context, lamb=1, num_random=100, seed=None, thread_ratio=None)

Draw samples from an exponential distribution.

Parameters
conn_contextConnectionContext

Database connection object.

lambfloat, optional

The rate parameter, which is the inverse of the scale parameter.

Defaults to 1.

num_randomint, optional

Specifies the number of random data to be generated.

Defaults to 100.

seedint, optional

Indicates the seed used to initialize the random number generator:

  • 0: Uses the system time.

  • Not 0: Uses the specified seed.

Note

When multithreading is enabled, the random number sequences of different runs might be different even if the SEED value remains the same.

Defaults to 0.

thread_ratiofloat, optional

Controls the proportion of available threads to use. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use.

Defaults to 0.

Returns
DataFrame

Dataframe containing the generated random samples, structured as follows:

  • ID, type INTEGER, ID column.

  • GENERATED_NUMBER, type DOUBLE, sample value.

Examples

Draw samples from an exponential distribution.

>>> res = exponential(conn_context=cc, scale=1, num_random=10)
>>> res.collect()
   ID  GENERATED_NUMBER
0   0          0.035207
1   1          0.559248
2   2          0.122307
3   3          2.339937
4   4          1.130033
5   5          0.985565
6   6          0.030138
7   7          0.231040
8   8          1.233268
9   9          0.876022
hana_ml.algorithms.pal.random.gumbel(conn_context, location=0, scale=1, num_random=100, seed=None, thread_ratio=None)

Draw samples from a Gumbel distribution, which is one of a class of Generalized Extreme Value (GEV) distributions used in modeling extreme value problems.

Parameters
conn_contextConnectionContext

Database connection object.

locationfloat, optional

Defaults to 0.

scalefloat, optional

Defaults to 1.

num_randomint, optional

Specifies the number of random data to be generated.

Defaults to 100.

seedint, optional

Indicates the seed used to initialize the random number generator:

  • 0: Uses the system time.

  • Not 0: Uses the specified seed.

Note

When multithreading is enabled, the random number sequences of different runs might be different even if the SEED value remains the same.

Defaults to 0.

thread_ratiofloat, optional

Controls the proportion of available threads to use. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use.

Defaults to 0.

Returns
DataFrame

Dataframe containing the generated random samples, structured as follows:

  • ID, type INTEGER, ID column.

  • GENERATED_NUMBER, type DOUBLE, sample value.

Examples

Draw samples from a gumbel distribution.

>>> res = gumbel(conn_context=cc, location=0, scale=1, num_random=10)
>>> res.collect()
   ID  GENERATED_NUMBER
0   0          1.544054
1   1          0.339531
2   2          0.394224
3   3          3.161123
4   4          1.208050
5   5         -0.276447
6   6          1.694589
7   7          1.406419
8   8         -0.443717
9   9          0.156404
hana_ml.algorithms.pal.random.f(conn_context, dof1=1, dof2=1, num_random=100, seed=None, thread_ratio=None)

Draw samples from an f distribution.

Parameters
conn_contextConnectionContext

Database connection object.

dof1int, optional

DEGREES_OF_FREEDOM1.

Defaults to 1.

dof2int, optional

DEGREES_OF_FREEDOM2.

Defaults to 1.

num_randomint, optional

Specifies the number of random data to be generated.

Defaults to 100.

seedint, optional

Indicates the seed used to initialize the random number generator:

  • 0: Uses the system time.

  • Not 0: Uses the specified seed.

Note

When multithreading is enabled, the random number sequences of different runs might be different even if the SEED value remains the same.

Defaults to 0.

thread_ratiofloat, optional

Controls the proportion of available threads to use. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use.

Defaults to 0.

Returns
DataFrame

Dataframe containing the generated random samples, structured as follows:

  • ID, type INTEGER, ID column.

  • GENERATED_NUMBER, type DOUBLE, sample value.

Examples

Draw samples from a f distribution.

>>> res = f(conn_context=cc, dof1=1, dof2=1, num_random=10)
>>> res.collect()
   ID  GENERATED_NUMBER
0   0          6.494985
1   1          0.054830
2   2          0.752216
3   3          4.946226
4   4          0.167151
5   5        351.789925
6   6          0.810973
7   7          0.362714
8   8          0.019763
9   9         10.553533
hana_ml.algorithms.pal.random.gamma(conn_context, shape=1, scale=1, num_random=100, seed=None, thread_ratio=None)

Draw samples from a gamma distribution.

Parameters
conn_contextConnectionContext

Database connection object.

shapefloat, optional

Defaults to 1.

scalefloat, optional

Defaults to 1.

num_randomint, optional

Specifies the number of random data to be generated.

Defaults to 100.

seedint, optional

Indicates the seed used to initialize the random number generator:

  • 0: Uses the system time.

  • Not 0: Uses the specified seed.

Note

When multithreading is enabled, the random number sequences of different runs might be different even if the SEED value remains the same.

Defaults to 0.

thread_ratiofloat, optional

Controls the proportion of available threads to use. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use.

Defaults to 0.

Returns
DataFrame

Dataframe containing the generated random samples, structured as follows:

  • ID, type INTEGER, ID column.

  • GENERATED_NUMBER, type DOUBLE, sample value.

Examples

Draw samples from a gamma distribution.

>>> res = gamma(conn_context=cc, shape=1, scale=1, num_random=10)
>>> res.collect()
   ID  GENERATED_NUMBER
0   0          0.082794
1   1          0.084031
2   2          0.159490
3   3          1.063100
4   4          0.530218
5   5          1.307313
6   6          0.565527
7   7          0.474969
8   8          0.440999
9   9          0.463645
hana_ml.algorithms.pal.random.geometric(conn_context, p=0.5, num_random=100, seed=None, thread_ratio=None)

Draw samples from a geometric distribution.

Parameters
conn_contextConnectionContext

Database connection object.

pfloat, optional

Successful fraction. The value range is from 0 to 1.

Defaults to 0.5.

num_randomint, optional

Specifies the number of random data to be generated.

Defaults to 100.

seedint, optional
Indicates the seed used to initialize the random number generator:
  • 0: Uses the system time.

  • Not 0: Uses the specified seed.

Note

When multithreading is enabled, the random number sequences of different runs might be different even if the SEED value remains the same.

Defaults to 0.

thread_ratiofloat, optional

Controls the proportion of available threads to use. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use.

Defaults to 0.

Returns
DataFrame

Dataframe containing the generated random samples, structured as follows:

  • ID, type INTEGER, ID column.

  • GENERATED_NUMBER, type DOUBLE, sample value.

Examples

Draw samples from a geometric distribution.

>>> res = geometric(conn_context=cc, p=0.5, num_random=10)
>>> res.collect()
   ID  GENERATED_NUMBER
0   0               1.0
1   1               1.0
2   2               1.0
3   3               0.0
4   4               1.0
5   5               0.0
6   6               0.0
7   7               0.0
8   8               0.0
9   9               0.0
hana_ml.algorithms.pal.random.lognormal(conn_context, mean=0, sigma=1, num_random=100, seed=None, thread_ratio=None)

Draw samples from a lognormal distribution.

Parameters
conn_contextConnectionContext

Database connection object.

meanfloat, optional

Mean value of the underlying normal distribution.

Defaults to 0.

sigmafloat, optional

Standard deviation of the underlying normal distribution.

Defaults to 1.

num_randomint, optional

Specifies the number of random data to be generated.

Defaults to 100.

seedint, optional

Indicates the seed used to initialize the random number generator:

  • 0: Uses the system time.

  • Not 0: Uses the specified seed.

Note

When multithreading is enabled, the random number sequences of different runs might be different even if the SEED value remains the same.

Defaults to 0.

thread_ratiofloat, optional

Controls the proportion of available threads to use. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use.

Defaults to 0.

Returns
DataFrame

Dataframe containing the generated random samples, structured as follows:

  • ID, type INTEGER, ID column.

  • GENERATED_NUMBER, type DOUBLE, sample value.

Examples

Draw samples from a lognormal distribution.

>>> res = lognormal(conn_context=cc, mean=0, sigma=1, num_random=10)
>>> res.collect()
   ID  GENERATED_NUMBER
0   0          0.461803
1   1          0.548432
2   2          0.625874
3   3          3.038529
4   4          3.582703
5   5          1.867543
6   6          1.853857
7   7          0.378827
8   8          1.104031
9   9          0.840102
hana_ml.algorithms.pal.random.negative_binomial(conn_context, n=1, p=0.5, num_random=100, seed=None, thread_ratio=None)

Draw samples from a negative_binomial distribution.

Parameters
conn_contextConnectionContext

Database connection object.

nint, optional

Number of successes.

Defaults to 1.

pfloat, optional

Successful fraction. The value range is from 0 to 1.

Defaults to 0.5.

num_randomint, optional

Specifies the number of random data to be generated.

Defaults to 100.

seedint, optional

Indicates the seed used to initialize the random number generator:

  • 0: Uses the system time.

  • Not 0: Uses the specified seed.

Note

When multithreading is enabled, the random number sequences of different runs might be different even if the SEED value remains the same.

Defaults to 0.

thread_ratiofloat, optional

Controls the proportion of available threads to use. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use.

Defaults to 0.

Returns
DataFrame

Dataframe containing the generated random samples, structured as follows:

  • ID, type INTEGER, ID column.

  • GENERATED_NUMBER, type DOUBLE, sample value.

Examples

Draw samples from a negative_binomial distribution.

>>> res = negative_binomial(conn_context=cc, n=1, p=0.5, num_random=10)
>>> res.collect()
   ID  GENERATED_NUMBER
0   0               0.0
1   1               2.0
2   2               3.0
3   3               1.0
4   4               1.0
5   5               0.0
6   6               2.0
7   7               1.0
8   8               2.0
9   9               3.0
hana_ml.algorithms.pal.random.normal(conn_context, mean=0, sigma=None, variance=None, num_random=100, seed=None, thread_ratio=None)

Draw samples from a normal distribution.

Parameters
conn_contextConnectionContext

Database connection object.

meanfloat, optional

Mean value.

Defaults to 0.

sigmafloat, optional

Standard deviation. It cannot be used together with variance.

Defaults to 1.

variancefloat, optional

Variance. It cannot be used together with sigma.

Defaults to 1.

num_randomint, optional

Specifies the number of random data to be generated.

Defaults to 100.

seedint, optional

Indicates the seed used to initialize the random number generator:

  • 0: Uses the system time.

  • Not 0: Uses the specified seed.

Note

When multithreading is enabled, the random number sequences of different runs might be different even if the SEED value remains the same.

Defaults to 0.

thread_ratiofloat, optional

Controls the proportion of available threads to use. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use.

Defaults to 0.

Returns
DataFrame

Dataframe containing the generated random samples, structured as follows:

  • ID, type INTEGER, ID column.

  • GENERATED_NUMBER, type DOUBLE, sample value.

Examples

Draw samples from a normal distribution.

>>> res = normal(conn_context=cc, mean=0, sigma=1, num_random=10)
>>> res.collect()
   ID  GENERATED_NUMBER
0   0          0.321078
1   1         -1.327626
2   2          0.798867
3   3         -0.116128
4   4         -0.213519
5   5          0.008566
6   6          0.251733
7   7          0.404510
8   8         -0.534899
9   9         -0.420968
hana_ml.algorithms.pal.random.pert(conn_context, minimum=-1, mode=0, maximum=1, scale=4, num_random=100, seed=None, thread_ratio=None)

Draw samples from a PERT distribution.

Parameters
conn_contextConnectionContext

Database connection object.

minimumint, optional

Minimum value.

Defaults to -1.

modefloat, optional

Most likely value.

Defaults to 0.

maximumfloat, optional

Maximum value.

Defaults to 1.

scalefloat, optional

Defaults to 4.

num_randomint, optional

Specifies the number of random data to be generated.

Defaults to 100.

seedint, optional

Indicates the seed used to initialize the random number generator:

  • 0: Uses the system time.

  • Not 0: Uses the specified seed.

Note

When multithreading is enabled, the random number sequences of different runs might be different even if the SEED value remains the same.

Defaults to 0.

thread_ratiofloat, optional

Controls the proportion of available threads to use. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use.

Defaults to 0.

Returns
DataFrame

Dataframe containing the generated random samples, structured as follows:

  • ID, type INTEGER, ID column.

  • GENERATED_NUMBER, type DOUBLE, sample value.

Examples

Draw samples from a pert distribution.

>>> res = pert(conn_context=cc, minimum=-1, mode=0, maximum=1, num_random=10)
>>> res.collect()
   ID  GENERATED_NUMBER
0   0          0.360781
1   1         -0.023649
2   2          0.106465
3   3          0.307412
4   4         -0.136838
5   5         -0.086010
6   6         -0.504639
7   7          0.335352
8   8         -0.287202
9   9          0.468597
hana_ml.algorithms.pal.random.poisson(conn_context, theta=1.0, num_random=100, seed=None, thread_ratio=None)

Draw samples from a poisson distribution.

Parameters
conn_contextConnectionContext

Database connection object.

thetafloat, optional

The average number of events in an interval.

Defaults to 1.0.

num_randomint, optional

Specifies the number of random data to be generated.

Defaults to 100.

seedint, optional

Indicates the seed used to initialize the random number generator:

  • 0: Uses the system time.

  • Not 0: Uses the specified seed.

Note

When multithreading is enabled, the random number sequences of different runs might be different even if the SEED value remains the same.

Defaults to 0.

thread_ratiofloat, optional

Controls the proportion of available threads to use. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use.

Defaults to 0.

Returns
DataFrame

Dataframe containing the generated random samples, structured as follows:

  • ID, type INTEGER, ID column.

  • GENERATED_NUMBER, type DOUBLE, sample value.

Examples

Draw samples from a poisson distribution.

>>> res = poisson(conn_context=cc, theta=1, num_random=10)
>>> res.collect()
   ID  GENERATED_NUMBER
0   0               0.0
1   1               1.0
2   2               1.0
3   3               1.0
4   4               1.0
5   5               1.0
6   6               0.0
7   7               2.0
8   8               0.0
9   9               1.0
hana_ml.algorithms.pal.random.student_t(conn_context, dof=1, num_random=100, seed=None, thread_ratio=None)

Draw samples from a Student’s t-distribution.

Parameters
conn_contextConnectionContext

Database connection object.

doffloat, optional

Degrees of freedom.

Defaults to 1.

num_randomint, optional

Specifies the number of random data to be generated.

Defaults to 100.

seedint, optional

Indicates the seed used to initialize the random number generator:

  • 0: Uses the system time.

  • Not 0: Uses the specified seed.

Note

When multithreading is enabled, the random number sequences of different runs might be different even if the SEED value remains the same.

Defaults to 0.

thread_ratiofloat, optional

Controls the proportion of available threads to use. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use.

Defaults to 0.

Returns
DataFrame

Dataframe containing the generated random samples, structured as follows:

  • ID, type INTEGER, ID column.

  • GENERATED_NUMBER, type DOUBLE, sample value.

Examples

Draw samples from a Student’s t-distribution.

>>> res = student_t(conn_context=cc, dof=1, num_random=10)
>>> res.collect()
   ID  GENERATED_NUMBER
0   0         -0.433802
1   1          1.972038
2   2         -1.097313
3   3         -0.225812
4   4         -0.452342
5   5          2.242921
6   6          0.377288
7   7          0.322347
8   8          1.104877
9   9         -0.017830
hana_ml.algorithms.pal.random.uniform(conn_context, low=0, high=1, num_random=100, seed=None, thread_ratio=None)

Draw samples from a uniform distribution.

Parameters
conn_contextConnectionContext

Database connection object.

lowfloat, optional

The lower bound.

Defaults to 0.

highfloat, optional

The upper bound.

Defaults to 1.

num_randomint, optional

Specifies the number of random data to be generated.

Defaults to 100.

seedint, optional

Indicates the seed used to initialize the random number generator:

  • 0: Uses the system time.

  • Not 0: Uses the specified seed.

Note

When multithreading is enabled, the random number sequences of different runs might be different even if the SEED value remains the same.

Defaults to 0.

thread_ratiofloat, optional

Controls the proportion of available threads to use. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use.

Defaults to 0.

Returns
DataFrame

Dataframe containing the generated random samples, structured as follows:

  • ID, type INTEGER, ID column.

  • GENERATED_NUMBER, type DOUBLE, sample value.

Examples

Draw samples from a uniform distribution.

>>> res = uniform(conn_context=cc, low=-1, high=1, num_random=10)
>>> res.collect()
   ID  GENERATED_NUMBER
0   0          0.032920
1   1          0.201923
2   2          0.823313
3   3         -0.495260
4   4         -0.138329
5   5          0.677732
6   6          0.685200
7   7          0.363627
8   8          0.024849
9   9         -0.441779
hana_ml.algorithms.pal.random.weibull(conn_context, shape=1, scale=1, num_random=100, seed=None, thread_ratio=None)

Draw samples from a weibull distribution.

Parameters
conn_contextConnectionContext

Database connection object.

shapefloat, optional

Defaults to 1.

scalesfloat, optional

Defaults to 1.

num_randomint, optional

Specifies the number of random data to be generated.

Defaults to 100.

seedint, optional

Indicates the seed used to initialize the random number generator:

  • 0: Uses the system time.

  • Not 0: Uses the specified seed.

Note

When multithreading is enabled, the random number sequences of different runs might be different even if the SEED value remains the same.

Defaults to 0.

thread_ratiofloat, optional

Controls the proportion of available threads to use. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use.

Defaults to 0.

Returns
DataFrame

Dataframe containing the generated random samples, structured as follows:

  • ID, type INTEGER, ID column.

  • GENERATED_NUMBER, type DOUBLE, sample value.

Examples

Draw samples from a weibull distribution.

>>> res = weibull(conn_context=cc, shape=1, scale=1, num_random=10)
>>> res.collect()
   ID  GENERATED_NUMBER
0   0          2.188750
1   1          0.247628
2   2          0.339884
3   3          0.902187
4   4          0.909629
5   5          0.514740
6   6          4.627877
7   7          0.143767
8   8          0.847514
9   9          2.368169

hana_ml.algorithms.pal.recommender

This module contains Python API of PAL recommender system algorithms. The following classes are available:

class hana_ml.algorithms.pal.recommender.ALS(conn_context, random_state=None, max_iter=None, tol=None, exit_interval=None, implicit=None, linear_solver=None, cg_max_iter=None, thread_ratio=None, resampling_method=None, evaluation_metric=None, fold_num=None, repeat_times=None, search_strategy=None, random_search_times=None, timeout=None, progress_indicator_id=None, param_values=None, param_range=None, factor_num=None, lamb=None, alpha=None)

Bases: hana_ml.algorithms.pal.pal_base.PALBase

Class for recommender system, alternating least squares algorithm.

Parameters
conn_contextConnectionContext

Connection to the HANA system.

factor_num: int, optional

Length of factor vectors in the model. Default to 8.

random_stateint, optional

Specifies the seed for random number generator. 0: Uses the current time as the seed. Others: Uses the specified value as the seed. Default to 0.

lambfloat, optional

Specifies the L2 regularization of the factors. Default to 1e-2

thread_ratiofloat, optional

Controls the proportion of available threads that can be used. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use. Defaults to 0.

max_iterint, optional

Specifies the maximum number of iterations for the ALS algorithm. Default to 20.

tolfloat, optional

Specifies the tolerance for exiting the iterative algorithm. The algorithm exits if the value of cost function is not decreased more than this value since the last check. If tol is set to 0, there is no check, and the algorithm only exits on reaching the maximum number of iterations. Note that evaluations of cost function require additional calculations, and you can set this parameter to 0 to avoid it. Default to 0.

exit_intervalint, optional

Specifies the number of iterations between consective convergence checkings. Basically, the algorithm calculates cost function and checks every ‘exit_interval’ iterations to see if the tolerance has been reached. Note that evaluations of cost function require additional calculations. Only valid when tol is not 0. Default to 5.

implicitbool, optional

Specifies implicit/explicit ALS. Default to false.

linear_solver{‘cholesky’, ‘cg’}, optional

Specifies the linear system solver. Default to ‘cholesky’.

cg_max_iterint, optional

Specifies maximum number of iteration of cg solver. Only valid when linear_solver is specified. Default to 3.

alphafloat, optional

Used when computing the confidence level in implicit ALS. Only valid when implicit is set to true. Default to 1.0.

resampling_method{‘cv’, ‘boostrap’}, optional

Specifies the resampling method for model evaluation or parameter selection. If not specified, neither model evaluation nor parameters selection is activated. No default value.

evaluation_metric{‘rmse’}, optional

Specifies the evaluation metric for model evaluation or parameter selection. If not specified, neither model evaluation nor parameter selection is activated. No default value.

fold_numint, optional

Specifies the fold number for the cross validation method. Mandatory and valid only when RESAMPLING_METHOD is set to cv. Default to 1.

repeat_timesint, optional

Specifies the number of repeat times for resampling. Default to 1.

search_strategy{‘grid’, ‘random’}, optional

Specifies the method to activate parameter selection. No default value.

random_search_timesint, optional

Specifies the number of times to randomly select candidate parameters for selection. Mandatory and valid when ‘search_strategy’ is set to ‘random’. No default value.

timeoutint, optional

Specifies maximum running time for model evaluation or parameter selection, in seconds. No timeout when 0 is specified. Default to 0.

progress_indicator_idstr, optional

Sets an ID of progress indicator for model evaluation or parameter selection. No progress indicator is active if no value is provided. No default value.

param_valuesListOfTuples, optional

Specifies values of parameters to be selected. Input should be the a list of tuples, with 1st element of each tuple being the target parameter name(in string format), while 2nd element being the a list of valued for selection. Only valid when search_strategy is specified. Valid Parametr names include:’alpha’, ‘factor_num’, ‘lamb’. No default value.

param_rangeListOfTuples, optional

Specifies ranges of parameters to be selected. Input should be a list of tuples, with 1st element of each tuple being the name of the target parameter(in string format), while 2nd element being a list that specifies the range of parameters with the following format: [start, step, end] or [start, end]. Only valid when search_strategy is specified. Valid parameter names include:’alpha’, ‘factor_num’, ‘lamb’. No default value.

Examples

Input dataframe for training:

>>> df_train.collect()
    USER    MOVIE    FEEDBACK
     A      Movie1      4.8
     A      Movie2      4.0
     A      Movie4      4.0
     A      Movie5      4.0
     A      Movie6      4.8
     A      Movie8      3.8
     A      Bad_Movie   2.5
     B      Movie2      4.8
     B      Movie3      4.8
     B      Movie4      5.0
     B      Movie5      5.0
     B      Movie7      3.5
     B      Movie8      4.8
     B      Bad_Movie   2.8
     C      Movie1      4.1
     C      Movie2      4.2
     C      Movie4      4.2
     C      Movie5      4.0
     C      Movie6      4.2
     C      Movie7      3.2
     C      Movie8      3.0
     C      Bad_Movie   2.5
     D      Movie1      4.5
     D      Movie3      3.5
     D      Movie4      4.5
     D      Movie6      3.9
     D      Movie7      3.5
     D      Movie8      3.5
     D      Bad_Movie   2.5
     E      Movie1      4.5
     E      Movie2      4.0
     E      Movie3      3.5
     E      Movie4      4.5
     E      Movie5      4.5
     E      Movie6      4.2
     E      Movie7      3.5
     E      Movie8      3.5

Creating ALS instance:

>>> als = ALS(self.conn,factor_num=2,
              lamb=1e-2, max_iter=20, tol=1e-6,
              exit_interval=5, linear_solver='cholesky', thread_ratio=0, random_state=1)

Performing fit() on given dataframe:

>>> als.fit(self.df_train)
>>> als.factors_.collect().head(10)
         FACTOR_ID    FACTOR
    0           0  1.108775
    1           1 -0.582392
    2           2  1.355926
    3           3 -0.760969
    4           4  1.084126
    5           5  0.281749
    6           6  1.145244
    7           7  0.418631
    8           8  1.151257
    9           9  0.315342

Performing predict() on given predicting dataframe:

>>> res = als.predict(self.df_predict, thread_ratio=1, key='ID')
>>> res.collect()
           ID USER      MOVIE  PREDICTION
    0   1    A         Movie3    3.868747
    1   2    A         Movie7    2.870243
    2   3    B         Movie1    5.787559
    3   4    B         Movie6    5.837218
    4   5    C         Movie3    3.323575
    5   6    D         Movie2    4.156372
    6   7    D         Movie5    4.325851
    7   8    E      Bad_Movie    2.545807
Attributes
metadata_DataFrame

Model metadata content.

map_DataFrame

Map info.

factors_DataFrame

Decomposed factors.

optim_param_DataFrame

Optimal parameters selected.

stats_DataFrame

Statistic values.

iter_info_DataFrame

Cost function value and RMSE of corresponding iterations.

Methods

fit(self, data[, usr, item, feedback, key])

Fit the ALP model with input training data.

is_fitted(self)

Checks if the model can be saved.

load_model(self, model)

Function to load fitted model.

predict(self, data, key[, usr, item, …])

Prediction for the input data with the trained ALP model.

fit(self, data, usr=None, item=None, feedback=None, key=None)

Fit the ALP model with input training data. Model parameters should be given by initializing the model first.

Parameters
dataDataFrame

Data to be fit.

keystr, optional

Name of the ID column.

usrlist of str, optional

Name of the usr column.

itemstr, optional

Name of the item column.

feedbackstr, optional

Name of the feedback column.

predict(self, data, key, usr=None, item=None, thread_ratio=None)

Prediction for the input data with the trained ALP model.

Parameters
dataDataFrame

Data to be fit.

keystr, optional

Name of the ID column.

usrlist of str, optional

Name of the usr column.

itemstr, optional

Name of the item column.

thread_ratiofloat, optional

Specifies the upper limit of thread usage in proportion of current available threads. The valid range of the value is [0,1]. Default to 0.

is_fitted(self)

Checks if the model can be saved. To be overridden if the model is not stored in model_ attribute.

Returns
bool

True if the model is ready to be saved.

load_model(self, model)

Function to load fitted model.

Parameters
modelDataFrame

HANA DataFrame for fitted model.

class hana_ml.algorithms.pal.recommender.FRM(conn_context, solver=None, factor_num=None, init=None, random_state=None, learning_rate=None, linear_lamb=None, lamb=None, max_iter=None, sgd_tol=None, sgd_exit_interval=None, thread_ratio=None, momentum=None, resampling_method=None, evaluation_metric=None, fold_num=None, repeat_times=None, search_strategy=None, random_search_times=None, timeout=None, progress_indicator_id=None, param_values=None, param_range=None)

Bases: hana_ml.algorithms.pal.pal_base.PALBase

class FRM Factorized Polynomial Regression Models of recommender system algorithms

Parameters
conn_contextConnectionContext

Connection to the HANA system.

solver{‘sgd’, ‘momentum’, ‘nag’, ‘adagrad’}, optional

Specifies the method for solving the objective minimization problem. Default to ‘sgd’.

factor_numint, optional

Length of factor vectors in the model. Default to 8.

initfloat, optional

Variance of the normal distribution used to initialize the model parameters. Default to 1e-2.

random_stateint, optional
Specifies the seed for random number generator.
  • 0: Uses the current time as the seed.

  • Others: Uses the specified value as the seed.

Note that due to the inherently randomicity of parallel sgc, models of different trainings might be different even with the same seed of random number generator. Default to 0.

lambfloat, optional

L2 regularization of the factors. Default to 1e-8.

linear_lambfloat, optional

L2 regularization of the factors. Default to 1e-10.

thread_ratiofloat, optional

Controls the proportion of available threads that can be used. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use. Defaults to 0.

max_iterint, optional

Specifies the maximum number of iterations for the ALS algorithm. Default value: 50.

sgd_tolfloat, optional

Exit threshold. The algorithm exits when the cost function has not decreased more than this threshold in sgd_exit_interval steps. Default to 1e-5

sgd_exit_intervalint, optional

The algorithm exits when the cost function has not decreased more than sgd_tol in sgd_exit_interval steps. Default to 5.

momentumfloat, optional

The momentum factor in method ‘momentum’ or ‘nag’. Valid only when method is ‘momentum’ or ‘nag’ . Default to 0.9.

resampling_method{‘cv’, ‘bootstrap’}, optional

Specifies the resampling method for model evaluation or parameter selection. if not specified, neither model evaluation nor parameter selection is activated. No default value.

evaluation_metric{‘rmse’}, optional

Specifies the evaluation metric for model evaluation or parameter selection. if not specified, neither model evaluation nor parameter selection is activated. No default value.

fold_numint, optional

Specifies the fold number for the cross validation method. Mandatory and valid only when resampling_method is set to cv. Default to 1.

repeat_timesint, optional

Specifies the number of repeat times for resampling. Default to 1.

search_strategy{‘grid’, ‘random’}, optional

Specifies the method to activate parameter selection. No default value.

random_search_timesint, optional

Specifies the number of times to randomly select candidate parameters for selection. Mandatory and valid when PARAM_SEARCH_STRATEGY is set to random. No default value.

timeoutint, optional

Specifies maximum running time for model evaluation or parameter selection, in seconds. No timeout when 0 is specified. Default to 0.

progress_indicator_idstr, optional

Sets an ID of progress indicator for model evaluation or parameter selection. No progress indicator is active if no value is provided. No default value.

param_valuesListOfTuples, optional

Specifies values of parameters to be selected. Input should a list of tuples, with the 1st element of each tuple being the parameter name, and 2nd element of each tuple being a list of values for selection. Only valid when search_strategy is specified. Valid paramter names include:’factor_num’, ‘lamb’, ‘linear_lamb’, ‘momentum’. No default value.

param_rangeListOfTuples, optional

Specifies ranges of param to be selected. Input should be a list of tuples, with the 1st element of each tuple being the parameter name, and 2nd element of each tuple being being a list of numericals indicating the range for selection. Only valid when search_strategy is specified. Valid parameter names include:’factor_num’, ‘lamb’, ‘linear_lamb’, ‘momentum’. No default value.

Examples

Input dataframe for training:

>>> df_train.collect()
    ID    USER    MOVIE     TIMESTAMP    FEEDBACK
     1      A     Movie1        3          4.8
     2      A     Movie2        3          4.0
     3      A     Movie4        1          4.0
     4      A     Movie5        2          4.0
     5      A     Movie6        3          4.8
     6      A     Movie8        2          3.8
     7      A     Bad_Movie     1          2.5
     8      B     Movie2        3          4.8
     9      B     Movie3        2          4.8
     1      B     Movie4        2          5.0
     1      B     Movie5        4          5.0
     1      B     Movie7        1          3.5
     1      B     Movie8        2          4.8
     1      B     Bad_Movie     3          2.8
     1      C     Movie1        2          4.1
     1      C     Movie2        4          4.2
     1      C     Movie4        3          4.2
     1      C     Movie5        1          4.0
     1      C     Movie6        4          4.2
     2      C     Movie7        3          3.2
     2      C     Movie8        1          3.0
     2      C     Bad_Movie     2          2.5
     2      D     Movie1        3          4.5
     2      D     Movie3        2          3.5
     2      D     Movie4        2          4.5
     2      D     Movie6        2          3.9
     2      D     Movie7        4          3.5
     2      D     Movie8        3          3.5
     2      D     Bad_Movie     3          2.5
     3      E     Movie1        2          4.5
     3      E     Movie2        2          4.0
     3      E     Movie3        2          3.5
     3      E     Movie4        4          4.5
     3      E     Movie5        3          4.5
     3      E     Movie6        2          4.2
     3      E     Movie7        4          3.5
     3      E     Movie8        3          3.5

Input user dataframe for training:

>>> usr_info.collect()
    USER            USER_SIDE_FEATURE
    -- There is no side information for user provided. --

Input item dataframe for training:

>>> item_info.collect()
    MOVIE       GENRES
    Movie1      Sci-Fi
    Movie2      Drama,Romance
    Movie3      Drama,Sci-Fi
    Movie4      Crime,Drama
    Movie5      Crime,Drama
    Movie6      Sci-Fi
    Movie7      Crime,Drama
    Movie8      Sci-Fi,Thriller
    Bad_Movie   Romance,Thriller

Creating FRM instance:

>>> frm = FRM(self.conn, factor_num=2, solver='adagrad',
              learning_rate=0, max_iter=100,
              thread_ratio=0.5, random_state=1)

Performing fit() on given dataframe:

>>> frm.fit(self.df_train, self.usr_info, self.item_info, categorical_variable='TIMESTAMP')
>>> frm.factors_.collect().head(10)
         FACTOR_ID    FACTOR
    0          0 -0.083550
    1          1 -0.083654
    2          2  0.582244
    3          3 -0.102799
    4          4 -0.441795
    5          5 -0.013341
    6          6 -0.099548
    7          7  0.245046
    8          8 -0.056534
    9          9 -0.342042

Performing predict() on given predicting dataframe:

>>> res = frm.predict(self.df_predict, self.usr_info, self.item_info, thread_ratio=0.5, key='ID')
>>> res.collect()
           ID USER  ITEM  PREDICTION
    0   1    A  None    3.486804
    1   2    A     4    3.490246
    2   3    B     2    5.436991
    3   4    B     3    5.287031
    4   5    C     2    3.015121
    5   6    D     1    3.602543
    6   7    D     3    4.097683
    7   8    E     2    2.317224
Attributes
metadata_DataFrame

Model metadata content.

model_DataFrame

Model (Map, Weight)

factors_DataFrame

Decomposed factors.

optim_param_DataFrame

Optimal parameters selected.

stats_DataFrame

Statistic values

iter_info_DataFrame

Cost function value and RMSE of corresponding iteration.

Methods

fit(self, data, usr_info, item_info[, key, …])

Fit the FRM model with input training data.

is_fitted(self)

Checks if the model can be saved.

load_model(self, model)

Function to load fitted model.

predict(self, data, usr_info, item_info, key)

Prediction for the input data with the trained FRM model.

fit(self, data, usr_info, item_info, key=None, usr=None, item=None, feedback=None, features=None, usr_features=None, item_features=None, usr_key=None, item_key=None, categorical_variable=None, usr_categorical_variable=None, item_categorical_variable=None)

Fit the FRM model with input training data. Model parameters should be given by initializing the model first.

Parameters
dataDataFrame

Data to be fit.

usr_infoDataFrame

User side features.

item_infoDataFrame

Item side features.

keystr, optional

Name of the ID column.

usrlist of str, optional

Name of the usr column.

itemstr, optional

Name of the item column.

feedbackstr, optional

Name of the feedback column.

featuresstr/listOfStrings, optional

Global side features column name in the training dataframe.

usr_featuresstr/listOfStrings, optional

User side features column name in the training dataframe.

item_featuresstr/listOfStrings, optional

Item side features column name in the training dataframe.

categorical_variablestr/ListofStrings, optional

Indicates whether or not a column data is actually corresponding to a category variable even the data type of this column is INTEGER. By default, ‘VARCHAR’ or ‘NVARCHAR’ is category variable, and ‘INTEGER’ or ‘DOUBLE’ is continuous variable.

usr_categorical_variablestr/ListofStrings, optional

Name of user side feature columns that should be treated as categorical.

item_categorical_variablestr/ListofStrings, optional

Name of item side feature columns that should be treated as categorical.

predict(self, data, usr_info, item_info, key, usr=None, item=None, features=None, thread_ratio=None)

Prediction for the input data with the trained FRM model.

Parameters
dataDataFrame

Data to be fit.

keystr, optional

Name of the ID column.

usrlist of str, optional

Name of the usr column.

itemstr, optional

Name of the item column.

usr_infoDataFrame

User side features.

item_infoDataFrame

Item side features.

featuresstr/listOfStrings, optional

Global side features column name in the training dataframe.

thread_ratiofloat, optional

Specifies the upper limit of thread usage in proportion of current available threads. The valid range of the value is [0,1]. Default to 0.

is_fitted(self)

Checks if the model can be saved. To be overridden if the model is not stored in model_ attribute.

Returns
bool

True if the model is ready to be saved.

load_model(self, model)

Function to load fitted model.

Parameters
modelDataFrame

HANA DataFrame for fitted model.

class hana_ml.algorithms.pal.recommender.FFMClassifier(conn_context, ordering=None, normalise=None, include_linear=None, include_constant=None, early_stop=None, random_state=None, factor_num=None, max_iter=None, train_size=None, learning_rate=None, linear_lamb=None, poly2_lamb=None, tol=None, exit_interval=None, handle_missing=None)

Bases: hana_ml.algorithms.pal.recommender._FFMBase

class FFMClassifier Field-Aware Factorization Machine with the task of classification.

Parameters
conn_contextConnectionContext

Connection to the HANA system.

factor_numint, optional

The factorisation dimensionality. Default to 4.

random_stateint, optional

Specifies the seed for random number generator. 0: Uses the current time as the seed. Others: Uses the specified value as the seed. Default to 0.

train_sizefloat, optional

The proportation of dataset used for training, and the remaining data set for validation. For example, 0.8 indicates that 80% for training, and the remaining 20% for validation. Default to 0.8 if number of instances not less than 40, 1.0 otherwise.

max_iterint, optional

Specifies the maximum number of iterations for the alternative least square algorithm. Default to 20

orderingListOfStrings, optional

Specifies the categories orders for ranking. No default value.

normalisebool, optional

Specifies whether to normalise each instance so that its L1 norm is 1. Default to True.

include_constantbool, optional

Specifies whether to include the w0 constant part. Default to True.

include_linearbool, optional

Specifies whether to include the linear part of regression model. Default to True.

early_stopbool, optional

Specifies whether to early stop the SGD optimisation. Valid only if the value of thread_ratio is less than 1. Default to True.

learning_ratefloat, optional

The learning rate for SGD iteration. Default to 0.2.

linear_lambfloat, optional

The L2 regularisation parameter for the linear coefficient vector. Default to 1e-5.

poly2_lambfloat, optional

The L2 regularisation parameter for factorized coefficient matrix of the quadratic term. Default to 1e-5.

tolfloat, optional

The criterion to determine the convergence of SGD. Default to 1e-5.

exit_intervalint, optional

The interval of two iterations for comparison to determine the convergence. Default to 5.

handle_missingstr, optional
Specifies how to handle missing feature:
  • ‘remove’: remove rows with missing values.

  • ‘replace’: replace missing values with 0.

Default to ‘replace’.

Examples

Input dataframe for classification training:

>>> df_train_classification.collect()
    USER    MOVIE                  TIMESTAMP    CTR
    A      Movie1                   3          Click
    A      Movie2                   3          Click
    A      Movie4                   1          Not click
    A      Movie5                   2          Click
    A      Movie6                   3          Click
    A      Movie8                   2          Not click
    A      Movie0, Movie3           1          Click
    B      Movie2                   3          Click
    B      Movie3                   2          Click
    B      Movie4                   2          Not click
    B      null                     4          Not click
    B      Movie7                   1          Click
    B      Movie8                   2          Not click
    B      Movie0                   3          Not click
    C      Movie1                   2          Click
    C      Movie2, Movie5, Movie7   4          Not click
    C      Movie4                   3          Not click
    C      Movie5                   1          Not click
    C      Movie6                   null       Click
    C      Movie7                   3          Not click
    C      Movie8                   1          Click
    C      Movie0                   2          Click
    D      Movie1                   3          Click
    D      Movie3                   2          Click
    D      Movie4, Movie7           2          Click
    D      Movie6                   2          Click
    D      Movie7                   4          Not click
    D      Movie8                   3          Not click
    D      Movie0                   3          Not click
    E      Movie1                   2          Not click
    E      Movie2                   2          Click
    E      Movie3                   2          Click
    E      Movie4                   4          Click
    E      Movie5                   3          Click
    E      Movie6                   2          Not click
    E      Movie7                   4          Not click
    E      Movie8                   3          Not click
Creating FFMClassifier instance:
>>> ffm = FFMClassifier(self.conn, linear_lamb=1e-5, poly2_lamb=1e-6, random_state=1,
              factor_num=4, early_stop=1, learning_rate=0.2, max_iter=20, train_size=0.8)

Performing fit() on given dataframe:

>>> ffm.fit(data=self.df_train_classification, categorical_variable='TIMESTAMP')
>>> ffm.stats_.collect()
        STAT_NAME          STAT_VALUE
    0         task      classification
    1  feature_num                  18
    2    field_num                   3
    3        k_num                   4
    4     category    Click, Not click
    5         iter                   3
    6      tr-loss  0.6409316561278655
    7      va-loss  0.7452354780967997

Performing predict() on given predicting dataframe:

>>> res = ffm.predict(data=self.df_predict, key='ID', thread_ratio=1)
>>> res.collect()
        ID      SCORE  CONFIDENCE
    0   1  Not click    0.543537
    1   2  Not click    0.545470
    2   3      Click    0.542737
    3   4      Click    0.519458
    4   5      Click    0.511001
    5   6  Not click    0.534610
    6   7      Click    0.537739
    7   8  Not click    0.536781
    8   9  Not click    0.635412
Attributes
metadata_DataFrame

Model metadata content.

coef_DataFrame
DataFrame that provides the following information:
  • Feature name,

  • Field name,

  • The factorisation number,

  • The parameter value.

stats_DataFrame

Statistic values.

cross_valid_DataFrame

Cross validation content.

Methods

fit(self, data[, key, features, label, …])

Fit the FFMClassifier model with the input training data.

is_fitted(self)

Checks if the model can be saved.

load_model(self, model)

Function to load fitted model.

predict(self, data, key[, features, …])

Prediction for the input data with the trained FFMClassifier model.

fit(self, data, key=None, features=None, label=None, categorical_variable=None, delimiter=None)

Fit the FFMClassifier model with the input training data. Model parameters should be given by initializing the model first.

Parameters
dataDataFrame

Data to be fit.

keystr, optional

Name of the ID column.

featuresstr/ListOfStrings, optional

Name of the feature columns.

delimiterstr, optional

The delimiter to separate string features. For example, “China, USA” indicates two feature values “China” and “USA”. Default to ‘,’.

labelstr, optional

Secifies the dependent variable. For classification, the label column can be any kind of data type. Default to last column name.

categorical_variablestr/ListofStrings, optional

Indicates whether or not a column data is actually corresponding to a category variable even the data type of this column is INTEGER. By default, ‘VARCHAR’ or ‘NVARCHAR’ is category variable, and ‘INTEGER’ or ‘DOUBLE’ is continuous variable.

predict(self, data, key, features=None, thread_ratio=None, handle_missing=None)

Prediction for the input data with the trained FFMClassifier model.

Parameters
dataDataFrame

Data to be fit.

keystr, optional

Name of the ID column.

featuresstr/ListOfStrings, optional

Global side features column name in the training dataframe.

thread_ratiofloat, optional

The ratio of available threads. 0: single thread 0~1: percentage Others: heuristically determined Default to -1.

handle_missingstr, optional
Specifies how to handle missing feature:
  • ‘remove’: remove rows with missing values.

  • ‘replace’: replace missing values with 0.

Default to ‘replace’.

is_fitted(self)

Checks if the model can be saved. To be overridden if the model is not stored in model_ attribute.

Returns
bool

True if the model is ready to be saved.

load_model(self, model)

Function to load fitted model.

Parameters
modelDataFrame

HANA DataFrame for fitted model.

class hana_ml.algorithms.pal.recommender.FFMRegressor(conn_context, ordering=None, normalise=None, include_linear=None, include_constant=None, early_stop=None, random_state=None, factor_num=None, max_iter=None, train_size=None, learning_rate=None, linear_lamb=None, poly2_lamb=None, tol=None, exit_interval=None, handle_missing=None)

Bases: hana_ml.algorithms.pal.recommender._FFMBase

class FFMRegressor Field-Aware Factorization Machine with the task of Regression.

Parameters
conn_contextConnectionContext

Connection to the HANA system.

factor_numint, optional

The factorisation dimensionality. Default to 4.

random_stateint, optional

Specifies the seed for random number generator. 0: Uses the current time as the seed. Others: Uses the specified value as the seed. Default to 0.

train_sizefloat, optional

The proportion of data used for training, and the remaining data set for validation. For example, 0.8 indicates that 80% for training, and the remaining 20% for validation. Default to 0.8 if number of instances not less than 40, 1.0 otherwise.

max_iterint, optional

Specifies the maximum number of iterations for the ALS algorithm. Default to 20

orderingListOfStrings, optional

Specifies the categories orders for ranking. No default value.

normalisebool, optional

Specifies whether to normalise each instance so that its L1 norm is 1. Default to True.

include_constantbool, optional

Specifies whether to include the constant part. Default to True.

include_linearbool, optional

Specifies whether to include the linear part of the model. Default to True.

early_stopbool, optional

Specifies whether to early stop the SGD optimisation. Valid only if the value of train_size is less than 1. Default to True.

learning_ratefloat, optional

The learning rate for SGD iteration. Default to 0.2.

linear_lambfloat, optional

The L2 regularisation parameter for the linear coefficient vector. Default to 1e-5.

poly2_lambfloat, optional

The L2 regularisation parameter for factorized coefficient matrix of the quadratic term. Default to 1e-5.

tolfloat, optional

The criterion to determine the convergence of SGD. Default to 1e-5.

exit_intervalint, optional

The interval of two iterations for comparison to determine the convergence. Default to 5.

handle_missingstr, optional
Specifies how to handle missing feature:
  • ‘remove’: remove rows with missing values.

  • ‘replace’: replace missing values with 0.

Default to ‘replace’.

Examples

Input dataframe for regresion training:

>>> df_train_regression.collect()
    USER    MOVIE                  TIMESTAMP    CTR
    A      Movie1                   3          0
    A      Movie2                   3          5
    A      Movie4                   1          0
    A      Movie5                   2          1
    A      Movie6                   3          2
    A      Movie8                   2          0
    A      Movie0, Movie3           1          5
    B      Movie2                   3          4
    B      Movie3                   2          4
    B      Movie4                   2          0
    B      null                     4          3
    B      Movie7                   1          4
    B      Movie8                   2          0
    B      Movie0                   3          4
    C      Movie1                   2          3
    C      Movie2, Movie5, Movie7   4          2
    C      Movie4                   3          1
    C      Movie5                   1          0
    C      Movie6                   null       5
    C      Movie7                   3          0
    C      Movie8                   1          5
    C      Movie0                   2          3
    D      Movie1                   3          0
    D      Movie3                   2          5
    D      Movie4, Movie7           2          5
    D      Movie6                   2          5
    D      Movie7                   4          0
    D      Movie8                   3          1
    D      Movie0                   3          1
    E      Movie1                   2          1
    E      Movie2                   2          5
    E      Movie3                   2          3
    E      Movie4                   4          2
    E      Movie5                   3          5
    E      Movie6                   2          0
    E      Movie7                   4          2
    E      Movie8                   3          0
Creating FFMRegressor instance:
>>>  ffm = FFMRegressor(self.conn, factor_num=4, early_stop=True, learning_rate=0.2, max_iter=20, train_size=0.8,
                        linear_lamb=1e-5, poly2_lamb=1e-6, random_state=1)

Performing fit() on given dataframe:

>>> ffm.fit(data=self.df_train_regression, categorical_variable='TIMESTAMP')
>>> ffm.stats_.collect
        STAT_NAME          STAT_VALUE
    0         task          regression
    1  feature_num                  18
    2    field_num                   3
    3        k_num                   4
    4         iter                  15
    5      tr-loss  0.4503367758101421
    6      va-loss  1.6896813062750056

Performing predict() on given prediction dataset:

>>> res = ffm.predict(data=self.df_predict, key='ID', thread_ratio=1)
>>> res.collect()
       ID                SCORE CONFIDENCE
    0   1    2.978197866860172       None
    1   2  0.43883354766746385       None
    2   3    3.765106298778723       None
    3   4   1.8874204073998788       None
    4   5    3.588371752514674       None
    5   6   1.3448502862740495       None
    6   7    5.268571202934171       None
    7   8   0.8713338730015039       None
    8   9    2.347070689885986       None
Attributes
metadata_DataFrame

Model metadata content.

coef_DataFrame
The DataFrame inclusive of the following information:
  • Feature name,

  • Field name,

  • The factorisation number,

  • The parameter value.

stats_DataFrame

Statistic values.

cross_valid_DataFrame

Cross validation content.

Methods

fit(self, data[, key, features, label, …])

Fit the FFMRegressor model with the input training data.

is_fitted(self)

Checks if the model can be saved.

load_model(self, model)

Function to load fitted model.

predict(self, data, key[, features, …])

Prediction for the input data with the trained FFMRegressor model.

fit(self, data, key=None, features=None, label=None, categorical_variable=None, delimiter=None)

Fit the FFMRegressor model with the input training data. Model parameters should be given by initializing the model first.

Parameters
dataDataFrame

Data to be fit.

keystr, optional

Name of the ID column.

featuresstr/ListOfStrings, optional

Name of the feature columns.

delimiterstr, optional

The delimiter to separate string features. For example, “China, USA” indicates two feature values “China” and “USA”. Default to ‘,’.

labelstr, optional

Secifies the dependent variable. For regression, the label column must have numerical data type. Default to last column name.

categorical_variablestr/ListofStrings, optional

Indicates whether or not a column data is actually corresponding to a category variable even the data type of this column is INTEGER. By default, ‘VARCHAR’ or ‘NVARCHAR’ is category variable, and ‘INTEGER’ or ‘DOUBLE’ is continuous variable.

predict(self, data, key, features=None, thread_ratio=None, handle_missing=None)

Prediction for the input data with the trained FFMRegressor model.

Parameters
dataDataFrame

Data to be fit.

keystr, optional

Name of the ID column.

featuresstr/ListOfStrings, optional

Global side features column name in the training dataframe.

thread_ratiofloat, optional

The ratio of available threads. 0: single thread 0~1: percentage Others: heuristically determined Default to -1.

handle_missing{‘remove’, ‘replace’}, optional
Specifies how to handle missing feature:
  • ‘remove’: remove rows with missing values.

  • ‘replace’: replace missing values with 0.

Default to ‘replace’.

is_fitted(self)

Checks if the model can be saved. To be overridden if the model is not stored in model_ attribute.

Returns
bool

True if the model is ready to be saved.

load_model(self, model)

Function to load fitted model.

Parameters
modelDataFrame

HANA DataFrame for fitted model.

class hana_ml.algorithms.pal.recommender.FFMRanker(conn_context, ordering=None, normalise=None, include_linear=None, early_stop=None, random_state=None, factor_num=None, max_iter=None, train_size=None, learning_rate=None, linear_lamb=None, poly2_lamb=None, tol=None, exit_interval=None, handle_missing=None)

Bases: hana_ml.algorithms.pal.recommender._FFMBase

class FFMRanker Field-Aware Factorization Machine with the task of ranking.

Parameters
conn_contextConnectionContext

Connection to the HANA system.

factor_numint, optional

The factorisation dimensionality. Default to 4.

random_stateint, optional

Specifies the seed for random number generator. 0: Uses the current time as the seed. Others: Uses the specified value as the seed. Default to 0.

train_sizefloat, optional

The porportaion of data used for training, and the remaining data set for validation. For example, 0.8 indicates that 80% for training, and the remaining 20% for validation. Default to 0.8 if number of instances not less than 40, 1.0 otherwise.

max_iterint, optional

Specifies the maximum number of iterations for the ALS algorithm. Default to 20

orderingListOfStrings, optional

Specifies the categories orders for ranking. No default value.

normalisebool, optional

Specifies whether to normalise each instance so that its L1 norm is 1. Default to True.

include_linearbool, optional

Specifies whether to include the the linear part of the model. Default to True.

early_stopbool, optional

Specifies whether to early stop the SGD optimisation. Valid only if the value of TRAIN_RATIO is less than 1. Default to True.

learning_ratefloat, optional

The learning rate for SGD iteration. Default to 0.2.

linear_lambfloat, optional

The L2 regularisation parameter for the linear coefficient vector. Default to 1e-5.

poly2_lambfloat, optional

The L2 regularisation parameter for factorized coefficient matrix of the quadratic term. Default to 1e-5.

tolfloat, optional

The criterion to determine the convergence of SGD. Default to 1e-5.

exit_intervalint, optional

The interval of two iterations for comparison to determine the convergence. Default to 5.

handle_missing{‘remove’, ‘replace’}, optional
Specifies how to handle missing feature:
  • ‘remove’: remove rows with missing values.

  • ‘replace’: replace missing values with 0.

Default to ‘replace’.

Examples

Input dataframe for regresion training:

>>> df_train_ranker.collect()
    USER    MOVIE                  TIMESTAMP    CTR
      A     Movie1                   3          medium
      A     Movie2                   3          too high
      A     Movie4                   1          medium
      A     Movie5                   2          too low
      A     Movie6                   3          low
      A     Movie8                   2          low
      A     Movie0, Movie3           1          too high
      B     Movie2                   3          high
      B     Movie3                   2          high
      B     Movie4                   2          medium
      B     null                     4          medium
      B     Movie7                   1          high
      B     Movie8                   2          high
      B     Movie0                   3          high
      C     Movie1                   2          medium
      C     Movie2, Movie5, Movie7   4          low
      C     Movie4                   3          too low
      C     Movie5                   1          high
      C     Movie6                   null       too high
      C     Movie7                   3          high
      C     Movie8                   1          too high
      C     Movie0                   2          medium
      D     Movie1                   3          too high
      D     Movie3                   2          too high
      D     Movie4, Movie7           2          too high
      D     Movie6                   2          too high
      D     Movie7                   4          too high
      D     Movie8                   3          too low
      D     Movie0                   3          too low
      E     Movie1                   2          too low
      E     Movie2                   2          too high
      E     Movie3                   2          medium
      E     Movie4                   4          low
      E     Movie5                   3          too high
      E     Movie6                   2          low
      E     Movie7                   4          low
      E     Movie8                   3          too low

Creating FFMRanker instance:

>>>  ffm = FFMRanker(self.conn, ordering=['too low', 'low', 'medium', 'high', 'too high'],
                     factor_num=4, early_stop=True, learning_rate=0.2, max_iter=20, train_size=0.8,
                     linear_lamb=1e-5, poly2_lamb=1e-6, random_state=1)

Performing fit() on given dataframe:

>>> ffm.fit(data=self.df_train_rank, categorical_variable='TIMESTAMP')
>>> ffm.stats_.collect()
        STAT_NAME                            STAT_VALUE
    0         task                               ranking
    1  feature_num                                    18
    2    field_num                                     3
    3        k_num                                     4
    4     category  too low, low, medium, high, too high
    5         iter                                    14
    6      tr-loss                    1.3432013591533276
    7      va-loss                    1.5509792122994928

Performing predict() on given predicting dataframe:

>>> res = ffm.predict(data=self.df_predict, key='ID', thread_ratio=1)
>>> res.collect()
       ID     SCORE  CONFIDENCE
    0   1      high    0.294206
    1   2    medium    0.209893
    2   3   too low    0.316609
    3   4      high    0.219671
    4   5  too high    0.222545
    5   6      high    0.385621
    6   7   too low    0.407695
    7   8   too low    0.295200
    8   9      high    0.282633
Attributes
metadata_DataFrame

Model metadata content.

coef_DataFrame
The DataFrame inclusive of the following information:
  • Feature name,

  • Field name,

  • The factorisation number,

  • The parameter value.

stats_DataFrame

Statistic values.

cross_valid_DataFrame

Cross validation content.

Methods

fit(self, data[, key, features, label, …])

Fit the FFMRanker model with the input training data.

is_fitted(self)

Checks if the model can be saved.

load_model(self, model)

Function to load fitted model.

predict(self, data, key[, features, …])

Prediction for the input data with the trained FFMRanker model.

fit(self, data, key=None, features=None, label=None, categorical_variable=None, delimiter=None)

Fit the FFMRanker model with the input training data. Model parameters should be given by initializing the model first.

Parameters
dataDataFrame

Data to be fit.

keystr, optional

Name of the ID column.

featuresstr/ListOfStrings, optional

Name of the feature columns.

delimiterstr, optional

The delimiter to separate string features. For example, “China, USA” indicates two feature values “China” and “USA”. Default to ‘,’.

labelstr, optional

Secifies the dependent variable. For ranking, the label column must have categorical data type. Default to last column name.

categorical_variablestr/ListofStrings, optional

Indicates whether or not a column data is actually corresponding to a category variable even the data type of this column is INTEGER. By default, ‘VARCHAR’ or ‘NVARCHAR’ is category variable, and ‘INTEGER’ or ‘DOUBLE’ is continuous variable.

is_fitted(self)

Checks if the model can be saved. To be overridden if the model is not stored in model_ attribute.

Returns
bool

True if the model is ready to be saved.

load_model(self, model)

Function to load fitted model.

Parameters
modelDataFrame

HANA DataFrame for fitted model.

predict(self, data, key, features=None, thread_ratio=None, handle_missing=None)

Prediction for the input data with the trained FFMRanker model.

Parameters
dataDataFrame

Data to be fit.

keystr, optional

Name of the ID column.

featuresstr/ListOfStrings, optional

Global side features column name in the training dataframe.

thread_ratiofloat, optional

The ratio of available threads. 0: single thread 0~1: percentage Others: heuristically determined Default to -1.

handle_missingstr, optional
Specifies how to handle missing feature:
  • ‘remove’: remove rows with missing values.

  • ‘replace’: replace missing values with 0.

Default to ‘replace’.

hana_ml.algorithms.pal.regression

This module contains wrappers for PAL regression algorithms.

The following classes are available:

class hana_ml.algorithms.pal.regression.PolynomialRegression(conn_context, degree=None, decomposition=None, adjusted_r2=None, pmml_export=None, thread_ratio=None, resampling_method=None, evaluation_metric=None, fold_num=None, repeat_times=None, search_strategy=None, random_search_times=None, random_state=None, timeout=None, progress_indicator_id=None, degree_values=None, degree_range=None)

Bases: hana_ml.algorithms.pal.pal_base.PALBase

Polynomial regression is an approach to model the relationship between a scalar variable y and a variable denoted X. In polynomial regression, data is modeled using polynomial functions, and unknown model parameters are estimated from the data. Such models are called polynomial models.

Parameters
conn_contextConnectionContext

Connection to the SAP HANA system.

degreeint

Degree of the polynomial model.

decomposition{‘LU’, ‘SVD’}, optional
Matrix factorization type to use. Case-insensitive.
  • ‘LU’: LU decomposition.

  • ‘SVD’: singular value decomposition.

Defaults to LU decomposition.

adjusted_r2bool, optional

If true, include the adjusted R2 value in the statistics table.

Defaults to False.

pmml_export{‘no’, ‘single-row’, ‘multi-row’}, optional

Controls whether to output a PMML representation of the model, and how to format the PMML. Case-insensitive.

  • ‘no’ or not provided: No PMML model.

  • ‘single-row’: Exports a PMML model in a maximum of one row. Fails if the model doesn’t fit in one row.

  • ‘multi-row’: Exports a PMML model, splitting it across multiple rows if it doesn’t fit in one.

Prediction does not require a PMML model.

thread_ratiofloat, optional

Controls the proportion of available threads to use for prediction. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use.

Defaults to 0.

resampling_method{‘cv’, ‘bootstrap’}, optional

Specifies the resampling method for model evaluation/parameter selection. If no value is specified for this parameter, neither model evaluation nor parameter selection is activated. Must be set together with evaluation_metric.

No default value.

evaluation_metric{‘rmse’}, optional

Specifies the evaluation metric for model evaluation or parameter selection. Must be set together with resampling_method.

No default value.

fold_numint, optional

Specifies the fold number for the cross validation method. Mandatory and valid only when resampling_method is set to ‘cv’.

No default value.

repeat_timesint, optional

Specifies the number of repeat times for resampling.

Defaults to 1.

search_strategy{‘grid’, ‘random’}, optional

Specifies the method to activate parameter selection.

No default value.

random_search_timesint, optional

Specifies the number of times to randomly select candidate parameters for selection. Mandatory and valid when search_strategy is set to ‘random’.

No default value.

random_stateint, optional

Specifies the seed for random generation. Use system time when 0 is specified.

Defaults to 0.

timeoutint, optional

Specifies maximum running time for model evaluation or parameter selection, in seconds. No timeout when 0 is specified.

Defaults to 0.

progress_indicator_idstr, optional

Sets an ID of progress indicator for model evaluation or parameter selection. No progress indicator is active if no value is provided.

No default value.

degree_valueslist of int, optional

Specifies values of degree to be selected. Only valid when search_strategy is specified.

No default value.

degree_rangelist of int, optional

Specifies range of degree to be selected. Only valid when search_strategy is specified.

No default value.

Examples

Training data (based on y = x^3 - 2x^2 + 3x + 5, with noise):

>>> df.collect()
   ID    X       Y
0   1  0.0   5.048
1   2  1.0   7.045
2   3  2.0  11.003
3   4  3.0  23.072
4   5  4.0  49.041

Training the model:

>>> pr = PolynomialRegression(conn_context=conn, degree=3)
>>> pr.fit(data=df, key='ID')

Prediction:

>>> df2.collect()
   ID    X
0   1  0.5
1   2  1.5
2   3  2.5
3   4  3.5
>>> pr.predict(data=df2, key='ID').collect()
   ID      VALUE
0   1   6.157063
1   2   8.401269
2   3  15.668581
3   4  33.928501

Ideal output: >>> df2.select(‘ID’, (‘POWER(X, 3)-2*POWER(X, 2)+3*x+5’, ‘Y’)).collect()

ID Y

0 1 6.125 1 2 8.375 2 3 15.625 3 4 33.875

Attributes
coefficients_DataFrame

Fitted regression coefficients.

pmml_DataFrame

PMML model. Set to None if no PMML model was requested.

fitted_DataFrame

Predicted dependent variable values for training data. Set to None if the training data has no row IDs.

statistics_DataFrame

Regression-related statistics, such as mean squared error.

optim_param_DataFrame

If cross validation is enabled, the optimal parameters will be selected.

Methods

fit(self, data[, key, features, label])

Fit regression model based on training data.

is_fitted(self)

Checks if the model can be saved.

load_model(self, model)

Function to load fitted model.

predict(self, data, key[, features, …])

Predict dependent variable values based on fitted model.

score(self, data, key[, features, label])

Returns the coefficient of determination R2 of the prediction.

fit(self, data, key=None, features=None, label=None)

Fit regression model based on training data.

Parameters
dataDataFrame

Training data.

keystr, optional

Name of the ID column. If key is not provided, it is assumed that the input has no ID column.

featureslist of str, optional

Names of the feature columns. Since the underlying PAL_POLYNOMIAL_REGRESSION only supports one feature, this list can only contain one element. If features is not provided, data must have exactly 1 non-ID, non-label column, and features defaults to that column.

labelstr, optional

Name of the dependent variable.

Defaults to the last column. (This is not the PAL default.)

predict(self, data, key, features=None, model_format=None)

Predict dependent variable values based on fitted model.

Parameters
dataDataFrame

Independent variable values used for prediction.

keystr

Name of the ID column.

featureslist of str, optional

Names of the feature columns. Since the underlying PAL_POLYNOMIAL_REGRESSION only supports one feature, this list can only contain one element. If features is not provided, data must have exactly 1 non-ID column, and features defaults to that column.

model_formatint, optional
  • 0: coefficient

  • 1: pmml

Returns
DataFrame

Predicted values, structured as follows:

  • ID column, with same name and type as data’s ID column.

  • VALUE, type DOUBLE, representing predicted values.

Note

predict() will pass the pmml_ table to PAL as the model representation if there is a pmml_ table, or the coefficients_ table otherwise.

score(self, data, key, features=None, label=None)

Returns the coefficient of determination R2 of the prediction.

Parameters
dataDataFrame

Data on which to assess model performance.

keystr

Name of the ID column.

featureslist of str, optional

Names of the feature columns. Since the underlying PAL_POLYNOMIAL_REGRESSION_PREDICT only supports one feature, this list can only contain one element. If features is not provided, data must have exactly 1 non-ID, non-label column, and features defaults to that column.

labelstr, optional

Name of the dependent variable.

Defaults to the last column. (This is not the PAL default.)

Returns
float

The coefficient of determination R2 of the prediction on the given data.

is_fitted(self)

Checks if the model can be saved. To be overridden if the model is not stored in model_ attribute.

Returns
bool

True if the model is ready to be saved.

load_model(self, model)

Function to load fitted model.

Parameters
modelDataFrame

HANA DataFrame for fitted model.

class hana_ml.algorithms.pal.regression.GLM(conn_context, family=None, link=None, solver=None, handle_missing_fit=None, quasilikelihood=None, max_iter=None, tol=None, significance_level=None, output_fitted=None, alpha=None, num_lambda=None, lambda_min_ratio=None, categorical_variable=None, ordering=None)

Bases: hana_ml.algorithms.pal.pal_base.PALBase

Regression by a generalized linear model, based on PAL_GLM. Also supports ordinal regression.

Parameters
conn_contextConnectionContext

Connection to the SAP HANA system.

family{‘gaussian’, ‘normal’, ‘poisson’, ‘binomial’, ‘gamma’, ‘inversegaussian’, ‘negativebinomial’, ‘ordinal’}, optional

The kind of distribution the dependent variable outcomes are assumed to be drawn from. Defaults to ‘gaussian’.

linkstr, optional

GLM link function. Determines the relationship between the linear predictor and the predicted response. Default and allowed values depend on family. ‘inverse’ is accepted as a synonym of ‘reciprocal’.

family

default link

allowed values of link

gaussian

identity

identity, log, reciprocal

poisson

log

identity, log

binomial

logit

logit, probit, comploglog, log

gamma

reciprocal

identity, reciprocal, log

inversegaussian

inversesquare

inversesquare, identity, reciprocal, log

negativebinomial

log

identity, log, sqrt

ordinal

logit

logit, probit, comploglog

solver{‘irls’, ‘nr’, ‘cd’}, optional

Optimization algorithm to use.

  • ‘irls’: Iteratively re-weighted least squares.

  • ‘nr’: Newton-Raphson.

  • ‘cd’: Coordinate descent. (Picking coordinate descent activates elastic net regularization.)

Defaults to ‘irls’, except when family is ‘ordinal’. Ordinal regression requires (and defaults to) ‘nr’, and Newton-Raphson is not supported for other values of family.

handle_missing_fit{‘skip’, ‘abort’, ‘fill_zero’}, optional

How to handle data rows with missing independent variable values during fitting.

  • ‘skip’: Don’t use those rows for fitting.

  • ‘abort’: Throw an error if missing independent variable values are found.

  • ‘fill_zero’: Replace missing values with 0.

Defaults to ‘skip’.

quasilikelihoodbool, optional

If True, enables the use of quasi-likelihood to estimate overdispersion.

Defaults to False.

max_iterint, optional

Maximum number of optimization iterations.

Defaults to 100 for IRLS and Newton-Raphson.

Defaults to 100000 for coordinate descent.

tolfloat, optional

Stopping condition for optimization.

Defaults to 1e-8 for IRLS, 1e-6 for Newton-Raphson, and 1e-7 for coordinate descent.

significance_levelfloat, optional

Significance level for confidence intervals and prediction intervals.

Defaults to 0.05.

output_fittedbool, optional

If True, create the fitted_ DataFrame of fitted response values for training data in fit.

alphafloat, optional

Elastic net mixing parameter. Only accepted when using coordinate descent. Should be between 0 and 1 inclusive.

Defaults to 1.0.

num_lambdaint, optional

The number of lambda values. Only accepted when using coordinate descent.

Defaults to 100.

lambda_min_ratiofloat, optional

The smallest value of lambda, as a fraction of the maximum lambda, where lambda_max is the smallest value for which all coefficients are zero. Only accepted when using coordinate descent.

Defaults to 0.01 when the number of observations is smaller than the number of covariates, and 0.0001 otherwise.

categorical_variablelist of str, optional

INTEGER columns specified in this list will be treated as categorical data. Other INTEGER columns will be treated as continuous.

orderinglist of str or list of int, optional

Specifies the order of categories for ordinal regression. The default is numeric order for ints and alphabetical order for strings.

Examples

Training data:

>>> df.collect()
   ID  Y  X
0   1  0 -1
1   2  0 -1
2   3  1  0
3   4  1  0
4   5  1  0
5   6  1  0
6   7  2  1
7   8  2  1
8   9  2  1

Fitting a GLM on that data:

>>> glm = GLM(conn_context=conn, solver='irls', family='poisson', link='log')
>>> glm.fit(data=df, key='ID', label='Y')

Performing prediction:

>>> df2.collect()
   ID  X
0   1 -1
1   2  0
2   3  1
3   4  2
>>> glm.predict(data=df2, key='ID')[['ID', 'PREDICTION']].collect()
   ID           PREDICTION
0   1  0.25543735346197155
1   2    0.744562646538029
2   3   2.1702915689746476
3   4     6.32608352871737
Attributes
statistics_DataFrame

Training statistics and model information other than the coefficients and covariance matrix.

coef_DataFrame

Model coefficients.

covmat_DataFrame

Covariance matrix. Set to None for coordinate descent.

fitted_DataFrame

Predicted values for the training data. Set to None if output_fitted is False.

Methods

fit(self, data[, key, features, label, …])

Fit a generalized linear model based on training data.

is_fitted(self)

Checks if the model can be saved.

load_model(self, model)

Function to load fitted model.

predict(self, data, key[, features, …])

Predict dependent variable values based on fitted model.

score(self, data, key[, features, label, …])

Returns the coefficient of determination R2 of the prediction.

fit(self, data, key=None, features=None, label=None, categorical_variable=None, dependent_variable=None, excluded_feature=None)

Fit a generalized linear model based on training data.

Parameters
dataDataFrame

Training data.

keystr, optional

Name of the ID column. If key is not provided, it is assumed that the input has no ID column. Required when output_fitted is True.

featureslist of str, optional

Names of the feature columns.

Defaults to all non-ID, non-label columns.

labelstr or list of str, optional

Name of the dependent variable. Defaults to the last column. (This is not the PAL default.) When family is ‘binomial’, label may be either a single column name or a list of two column names.

categorical_variablelist of str, optional

INTEGER columns specified in this list will be treated as categorical data. Other INTEGER columns will be treated as continuous.

dependent_variablestr, optional

Only used when you need to indicate the dependence.

excluded_featurelist of str, optional

Excludes the indicated feature column.

Defaults to None.

predict(self, data, key, features=None, prediction_type=None, significance_level=None, handle_missing=None)

Predict dependent variable values based on fitted model.

Parameters
dataDataFrame

Independent variable values to predict for.

keystr

Name of the ID column.

featureslist of str, optional

Names of the feature columns.

Defaults to all non-ID columns.

prediction_type{‘response’, ‘link’}, optional

Specifies whether to output predicted values of the response or the link function.

Defaults to ‘response’.

significance_levelfloat, optional

Significance level for confidence intervals and prediction intervals. If specified, overrides the value passed to the GLM constructor.

handle_missing{‘skip’, ‘fill_zero’}, optional

How to handle data rows with missing independent variable values.

  • ‘skip’: Don’t perform prediction for those rows.

  • ‘fill_zero’: Replace missing values with 0.

Defaults to ‘skip’.

Returns
DataFrame

Predicted values, structured as follows. The following two columns are always populated:

  • ID column, with same name and type as data’s ID column.

  • PREDICTION, type NVARCHAR(100), representing predicted values.

The following five columns are only populated for IRLS:

  • SE, type DOUBLE. Standard error, or for ordinal regression, the probability that the data point belongs to the predicted category.

  • CI_LOWER, type DOUBLE. Lower bound of the confidence interval.

  • CI_UPPER, type DOUBLE. Upper bound of the confidence interval.

  • PI_LOWER, type DOUBLE. Lower bound of the prediction interval.

  • PI_UPPER, type DOUBLE. Upper bound of the prediction interval.

score(self, data, key, features=None, label=None, prediction_type=None, handle_missing=None)

Returns the coefficient of determination R2 of the prediction.

Not applicable for ordinal regression.

Parameters
dataDataFrame

Data on which to assess model performance.

keystr

Name of the ID column.

featureslist of str, optional

Names of the feature columns.

Defaults to all non-ID, non-label columns.

labelstr, optional

Name of the dependent variable.

Defaults to the last column. (This is not the PAL default.) Cannot be two columns, even for family=’binomial’.

prediction_type{‘response’, ‘link’}, optional

Specifies whether to predict the value of the response or the link function. The contents of the label column should match this choice.

Defaults to ‘response’.

handle_missing{‘skip’, ‘fill_zero’}, optional

How to handle data rows with missing independent variable values.

  • ‘skip’: Don’t perform prediction for those rows. Those rows will be left out of the R2 computation.

  • ‘fill_zero’: Replace missing values with 0.

Defaults to ‘skip’.

Returns
float

The coefficient of determination R2 of the prediction on the given data.

is_fitted(self)

Checks if the model can be saved. To be overridden if the model is not stored in model_ attribute.

Returns
bool

True if the model is ready to be saved.

load_model(self, model)

Function to load fitted model.

Parameters
modelDataFrame

HANA DataFrame for fitted model.

class hana_ml.algorithms.pal.regression.ExponentialRegression(conn_context, decomposition=None, adjusted_r2=None, pmml_export=None, thread_ratio=None)

Bases: hana_ml.algorithms.pal.pal_base.PALBase

Exponential regression is an approach to modeling the relationship between a scalar variable y and one or more variables denoted X. In exponential regression, data is modeled using exponential functions, and unknown model parameters are estimated from the data. Such models are called exponential models.

Parameters
conn_contextConnectionContext

Connection to the HANA system.

decomposition{‘LU’, ‘SVD’}, optional

Matrix factorization type to use. Case-insensitive.

  • ‘LU’: LU decomposition.

  • ‘SVD’: singular value decomposition.

Defaults to LU decomposition.

adjusted_r2bool, optional

If true, include the adjusted R2 value in the statistics table.

Defaults to False.

pmml_export{‘no’, ‘single-row’, ‘multi-row’}, optional

Controls whether to output a PMML representation of the model, and how to format the PMML. Case-insensitive.

  • ‘no’ or not provided: No PMML model.

  • ‘single-row’: Exports a PMML model in a maximum of one row. Fails if the model doesn’t fit in one row.

  • ‘multi-row’: Exports a PMML model, splitting it across multiple rows if it doesn’t fit in one.

Prediction does not require a PMML model.

thread_ratiofloat, optional

Controls the proportion of available threads to use for prediction. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use.

Defaults to 0.

Examples

>>> df.collect()
   ID    Y       X1      X2
   0    0.5     0.13    0.33
   1    0.15    0.14    0.34
   2    0.25    0.15    0.36
   3    0.35    0.16    0.35
   4    0.45    0.17    0.37

Training the model:

>>> er = ExponentialRegression(conn_context=conn, pmml_export = 'multi-row')
>>> er.fit(data=df, key='ID')

Prediction:

>>> df2.collect()
   ID    X1       X2
   0    0.5      0.3
   1    4        0.4
   2    0        1.6
   3    0.3      0.45
   4    0.4      1.7
>>> er.predict(data=df2, key='ID').collect()
   ID      VALUE
   0      0.6900598931338715
   1      1.2341502316656843
   2      0.006630664136180741
   3      0.3887970208571841
   4      0.0052106543571450266
Attributes
coefficients_DataFrame

Fitted regression coefficients.

pmml_DataFrame

PMML model. Set to None if no PMML model was requested.

fitted_DataFrame

Predicted dependent variable values for training data. Set to None if the training data has no row IDs.

statistics_DataFrame

Regression-related statistics, such as mean squared error.

Methods

fit(self, data[, key, features, label])

Fit regression model based on training data.

is_fitted(self)

Checks if the model can be saved.

load_model(self, model)

Function to load fitted model.

predict(self, data, key[, features, …])

Predict dependent variable values based on fitted model.

score(self, data, key[, features, label])

Returns the coefficient of determination R2 of the prediction.

fit(self, data, key=None, features=None, label=None)

Fit regression model based on training data.

Parameters
dataDataFrame

Training data.

keystr, optional

Name of the ID column. If key is not provided, it is assumed that the input has no ID column.

featureslist of str, optional

Names of the feature columns.

labelstr, optional

Name of the dependent variable.

Defaults to the last column. (This is not the PAL default.)

predict(self, data, key, features=None, model_format=None)

Predict dependent variable values based on fitted model.

Parameters
dataDataFrame

Independent variable values used for prediction.

keystr

Name of the ID column.

featureslist of str, optional

Names of the feature columns.

model_formatint, optional
  • 0: coefficient

  • 1: pmml

Returns
DataFrame

Predicted values, structured as follows:

  • ID column, with same name and type as data ‘s ID column.

  • VALUE, type DOUBLE, representing predicted values.

Note

predict() will pass the pmml_ table to PAL as the model representation if there is a pmml_ table, or the coefficients_ table otherwise.

score(self, data, key, features=None, label=None)

Returns the coefficient of determination R2 of the prediction.

Parameters
dataDataFrame

Data on which to assess model performance.

keystr

Name of the ID column.

featureslist of str, optional

Names of the feature columns.

labelstr, optional

Name of the dependent variable.

Defaults to the last column. (This is not the PAL default.)

Returns
float

The coefficient of determination R2 of the prediction on the given data.

is_fitted(self)

Checks if the model can be saved. To be overridden if the model is not stored in model_ attribute.

Returns
bool

True if the model is ready to be saved.

load_model(self, model)

Function to load fitted model.

Parameters
modelDataFrame

HANA DataFrame for fitted model.

class hana_ml.algorithms.pal.regression.BiVariateGeometricRegression(conn_context, decomposition=None, adjusted_r2=None, pmml_export=None, thread_ratio=None)

Bases: hana_ml.algorithms.pal.pal_base.PALBase

Geometric regression is an approach used to model the relationship between a scalar variable y and a variable denoted X. In geometric regression, data is modeled using geometric functions, and unknown model parameters are estimated from the data. Such models are called geometric models.

Parameters
conn_contextConnectionContext

Connection to the HANA system.

decomposition{‘LU’, ‘SVD’}, optional

Matrix factorization type to use. Case-insensitive.

  • ‘LU’: LU decomposition.

  • ‘SVD’: singular value decomposition.

Defaults to LU decomposition.

adjusted_r2bool, optional

If true, include the adjusted R2 value in the statistics table.

Defaults to False.

pmml_export{‘no’, ‘single-row’, ‘multi-row’}, optional

Controls whether to output a PMML representation of the model, and how to format the PMML. Case-insensitive.

  • ‘no’ or not provided: No PMML model.

  • ‘single-row’: Exports a PMML model in a maximum of one row. Fails if the model doesn’t fit in one row.

  • ‘multi-row’: Exports a PMML model, splitting it across multiple rows if it doesn’t fit in one.

Prediction does not require a PMML model.

thread_ratiofloat, optional

Controls the proportion of available threads to use for prediction. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use.

Defaults to 0.

Examples

>>> df.collect()
ID    Y       X1
0    1.1      1
1    4.2      2
2    8.9      3
3    16.3     4
4    24       5

Training the model:

>>> gr = BiVariateGeometricRegression(conn_context=conn, pmml_export='multi-row')
>>> gr.fit(data=df, key='ID')

Prediction:

>>> df2.collect()
ID    X1
0     1
1     2
2     3
3     4
4     5
>>> er.predict(data=df2, key='ID').collect()
ID      VALUE
0        1
1       3.9723699817481437
2       8.901666037549536
3       15.779723271893747
4       24.60086108408644
Attributes
coefficients_DataFrame

Fitted regression coefficients.

pmml_DataFrame

PMML model. Set to None if no PMML model was requested.

fitted_DataFrame

Predicted dependent variable values for training data. Set to None if the training data has no row IDs.

statistics_DataFrame

Regression-related statistics, such as mean squared error.

Methods

fit(self, data[, key, features, label])

Fit regression model based on training data.

is_fitted(self)

Checks if the model can be saved.

load_model(self, model)

Function to load fitted model.

predict(self, data, key[, features, …])

Predict dependent variable values based on fitted model.

score(self, data, key[, features, label])

Returns the coefficient of determination R2 of the prediction.

fit(self, data, key=None, features=None, label=None)

Fit regression model based on training data.

Parameters
dataDataFrame

Training data.

keystr, optional

Name of the ID column. If key is not provided, it is assumed that the input has no ID column.

featureslist of str, optional

Names of the feature columns.

labelstr, optional

Name of the dependent variable.

Defaults to the last column. (This is not the PAL default.)

predict(self, data, key, features=None, model_format=None)

Predict dependent variable values based on fitted model.

Parameters
dataDataFrame

Independent variable values used for prediction.

keystr

Name of the ID column.

featureslist of str, optional

Names of the feature columns.

model_formatint, optional
  • 0: coefficient

  • 1: pmml

Returns
DataFrame

Predicted values, structured as follows:

  • ID column, with same name and type as data ‘s ID column.

  • VALUE, type DOUBLE, representing predicted values.

Note

predict() will pass the pmml_ table to PAL as the model representation if there is a pmml_ table, or the coefficients_ table otherwise.

score(self, data, key, features=None, label=None)

Returns the coefficient of determination R2 of the prediction.

Parameters
dataDataFrame

Data on which to assess model performance.

keystr

Name of the ID column.

featureslist of str, optional

Names of the feature columns.

labelstr, optional

Name of the dependent variable.

Defaults to the last column. (This is not the PAL default.)

Returns
float

The coefficient of determination R2 of the prediction on the given data.

is_fitted(self)

Checks if the model can be saved. To be overridden if the model is not stored in model_ attribute.

Returns
bool

True if the model is ready to be saved.

load_model(self, model)

Function to load fitted model.

Parameters
modelDataFrame

HANA DataFrame for fitted model.

class hana_ml.algorithms.pal.regression.BiVariateNaturalLogarithmicRegression(conn_context, decomposition=None, adjusted_r2=None, pmml_export=None, thread_ratio=None)

Bases: hana_ml.algorithms.pal.pal_base.PALBase

Bi-variate natural logarithmic regression is an approach to modeling the relationship between a scalar variable y and one variable denoted X. In natural logarithmic regression, data is modeled using natural logarithmic functions, and unknown model parameters are estimated from the data. Such models are called natural logarithmic models.

Parameters
conn_contextConnectionContext

Connection to the HANA system.

decomposition{‘LU’, ‘SVD’}, optional

Matrix factorization type to use. Case-insensitive.

  • ‘LU’: LU decomposition.

  • ‘SVD’: singular value decomposition.

Defaults to LU decomposition.

adjusted_r2bool, optional

If true, include the adjusted R2 value in the statistics table.

Defaults to False.

pmml_export{‘no’, ‘single-row’, ‘multi-row’}, optional

Controls whether to output a PMML representation of the model, and how to format the PMML. Case-insensitive.

  • ‘no’ or not provided: No PMML model.

  • ‘single-row’: Exports a PMML model in a maximum of one row. Fails if the model doesn’t fit in one row.

  • ‘multi-row’: Exports a PMML model, splitting it across multiple rows if it doesn’t fit in one.

Prediction does not require a PMML model.

thread_ratiofloat, optional

Controls the proportion of available threads to use for prediction. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use. Does not affect fitting.

Defaults to 0.

Examples

>>> df.collect()
   ID    Y       X1
   0    10       1
   1    80       2
   2    130      3
   3    180      5
   4    190      6

Training the model:

>>> gr = BiVariateNaturalLogarithmicRegression(conn_context=conn, pmml_export='multi-row')
>>> gr.fit(data=df, key='ID')

Prediction:

>>> df2.collect()
   ID    X1
   0     1
   1     2
   2     3
   3     4
   4     5
>>> er.predict(data=df2, key='ID').collect()
   ID      VALUE
   0     14.86160299
   1     82.9935329364932
   2     122.8481570569525
   3     151.1254628829864
   4     173.05904529166017
Attributes
coefficients_DataFrame

Fitted regression coefficients.

pmml_DataFrame

PMML model. Set to None if no PMML model was requested.

fitted_DataFrame

Predicted dependent variable values for training data. Set to None if the training data has no row IDs.

statistics_DataFrame

Regression-related statistics, such as mean squared error.

Methods

fit(self, data[, key, features, label])

Fit regression model based on training data.

is_fitted(self)

Checks if the model can be saved.

load_model(self, model)

Function to load fitted model.

predict(self, data, key[, features, …])

Predict dependent variable values based on fitted model.

score(self, data, key[, features, label])

Returns the coefficient of determination R2 of the prediction.

fit(self, data, key=None, features=None, label=None)

Fit regression model based on training data.

Parameters
dataDataFrame

Training data.

keystr, optional

Name of the ID column. If key is not provided, it is assumed that the input has no ID column.

featureslist of str, optional

Names of the feature columns.

labelstr, optional

Name of the dependent variable.

Defaults to the last column. (This is not the PAL default.)

predict(self, data, key, features=None, model_format=None)

Predict dependent variable values based on fitted model.

Parameters
dataDataFrame

Independent variable values used for prediction.

keystr

Name of the ID column.

featureslist of str, optional

Names of the feature columns.

model_formatint, optional
  • 0: coefficient

  • 1: pmml

Returns
DataFrame

Predicted values, structured as follows:

  • ID column, with same name and type as data ‘s ID column.

  • VALUE, type DOUBLE, representing predicted values.

Note

predict() will pass the pmml_ table to PAL as the model representation if there is a pmml_ table, or the coefficients_ table otherwise.

score(self, data, key, features=None, label=None)

Returns the coefficient of determination R2 of the prediction.

Parameters
dataDataFrame

Data on which to assess model performance.

keystr

Name of the ID column.

featureslist of str, optional

Names of the feature columns.

labelstr, optional

Name of the dependent variable.

Defaults to the last column. (This is not the PAL default.)

Returns
float

The coefficient of determination R2 of the prediction on the given data.

is_fitted(self)

Checks if the model can be saved. To be overridden if the model is not stored in model_ attribute.

Returns
bool

True if the model is ready to be saved.

load_model(self, model)

Function to load fitted model.

Parameters
modelDataFrame

HANA DataFrame for fitted model.

class hana_ml.algorithms.pal.regression.CoxProportionalHazardModel(conn_context, tie_method=None, status_col=None, max_iter=None, convergence_criterion=None, significance_level=None, calculate_hazard=None, output_fitted=None, type_kind=None)

Bases: hana_ml.algorithms.pal.pal_base.PALBase

Cox proportional hazard model (CoxPHM) is a special generalized linear model. It is a well-known realization-of-survival model that demonstrates failure or death at a certain time.

Parameters
conn_contextConnectionContext

Connection to the SAP HANA system.

tie_method{‘breslow’, ‘efron’}, optional

The method to deal with tied events.

Defaults to ‘efron’.

status_colbool, optional

If a status column is defined for right-censored data:

  • False : No status column. All response times are failure/death.

  • TrueThe 3rd column of the data input table is a status column,

    of which 0 indicates right-censored data and 1 indicates failure/death.

Defaults to True.

max_iterint, optional

Maximum number of iterations for numeric optimization.

convergence_criterionfloat, optional

Convergence criterion of coefficients for numeric optimization.

Defaults to 0.

significance_levelfloat, optional

Significance level for the confidence interval of estimated coefficients.

Defaults to 0.05.

calculate_hazardbool, optional

Controls whether to calculate hazard function as well as survival function.

  • False : Does not calculate hazard function.

  • True: Calculates hazard function.

Defaults to True.

output_fittedbool, optional

Controls whether to output the fitted response:

  • False : Does not output the fitted response.

  • True: Outputs the fitted response.

Defaults to False.

type_kindstr, optional

The prediction type:

  • ‘risk’: Predicts in risk space

  • ‘lp’: Predicts in linear predictor space

Default Value is ‘risk’

Examples

>>> df1.collect()
    ID      TIME    STATUS  X1      X2
    1         4              1       0       0
    2         3              1       2       0
    3         1              1       1       0
    4         1              0       1       0
    5         2              1       1       1
    6         2              1       0       1
    7         3              0       0       1

Training the model:

>>> cox = CoxProportionalHazardModel(conn_context=conn,
significance_level= 0.05, calculate_hazard='yes', type_kind='risk')
>>> cox.fit(data=df1, key='ID', features=['STATUS', 'X1', 'X2'], label='TIME')

Prediction:

>>> df2.collect()
    ID      X1      X2
    1       0       0
    2       2       0
    3       1       0
    4       1       0
    5       1       1
    6       0       1
    7       0       1
>>> cox.predict(data=full_tbl, key='ID',features=['STATUS', 'X1', 'X2']).collect()
    ID       PREDICTION        SE         CI_LOWER     CI_UPPER
    1       0.383590423     0.412526262     0.046607574     3.157032199
    2       1.829758442     1.385833778     0.414672719     8.073875617
    3       0.837781484     0.400894077     0.32795551      2.140161678
    4       0.837781484     0.400894077     0.32795551      2.140161678
Attributes
statistics_DataFrame

Regression-related statistics, such as r-square, log-likelihood, aic.

coefficient_DataFrame

Fitted regression coefficients.

covariance_varianceDataFrame

Co-Variance related data.

hazard_DataFrame

Statistics related to Time, Hazard, Survival.

fitted_DataFrame

Predicted dependent variable values for training data. Set to None if the training data has no row IDs.

Methods

fit(self, data[, key, features, label])

Fit regression model based on training data.

is_fitted(self)

Checks if the model can be saved.

load_model(self, model)

Function to load fitted model.

predict(self, data, key[, features])

Predict dependent variable values based on fitted model.

score(self, data, key[, features, label])

Returns the coefficient of determination R2 of the prediction.

is_fitted(self)

Checks if the model can be saved. To be overridden if the model is not stored in model_ attribute.

Returns
bool

True if the model is ready to be saved.

load_model(self, model)

Function to load fitted model.

Parameters
modelDataFrame

HANA DataFrame for fitted model.

fit(self, data, key=None, features=None, label=None)

Fit regression model based on training data.

Parameters
dataDataFrame

Training data.

keystr, optional

Name of the ID column. If key is not provided, it is assumed that the input has no ID column.

featureslist of str, optional

Names of the feature columns.

labelstr, optional

Name of the dependent variable.

Defaults to the last column. (This is not the PAL default.)

predict(self, data, key, features=None)

Predict dependent variable values based on fitted model.

Parameters
dataDataFrame

Independent variable values used for prediction.

keystr

Name of the ID column.

featureslist of str, optional

Names of the feature columns.

Returns
DataFrame

Predicted values, structured as follows:

  • ID column, with same name and type as data ‘s ID column.

  • VALUE, type DOUBLE, representing predicted values.

score(self, data, key, features=None, label=None)

Returns the coefficient of determination R2 of the prediction.

Parameters
dataDataFrame

Data on which to assess model performance.

keystr

Name of the ID column.

featureslist of str, optional

Names of the feature columns.

labelstr, optional

Name of the dependent variable.

Defaults to the last column. (This is not the PAL default.)

Returns
float

The coefficient of determination R2 of the prediction on the given data.

hana_ml.algorithms.pal.som

This module contains PAL wrapper for SOM algorithm. The following class is available:

class hana_ml.algorithms.pal.som.SOM(conn_context, covergence_criterion=None, normalization=None, random_seed=None, height_of_map=None, width_of_map=None, kernel_function=None, alpha=None, learning_rate=None, shape_of_grid=None, radius=None, batch_som=None, max_iter=None)

Bases: hana_ml.algorithms.pal.pal_base.PALBase, hana_ml.algorithms.pal.clustering._ClusterAssignmentMixin

Self-organizing feature maps (SOMs) are one of the most popular neural network methods for cluster analysis. They are sometimes referred to as Kohonen self-organizing feature maps, after their creator, Teuvo Kohonen, or as topologically ordered maps.

Parameters
conn_contextConnectionContext

Connection to the SAP HANA system.

convergence_criterionfloat, optional

If the largest difference of the successive maps is less than this value, the calculation is regarded as convergence, and SOM is completed consequently.

Defaults to 1.0e-6.

normalization{‘0’, ‘1’, ‘2’}, int, optional

Normalization type:

  • 0: No

  • 1: Transform to new range (0.0, 1.0)

  • 2: Z-score normalization

Defaults to 0.

random_seed{‘1’, ‘0’, ‘Other value’}, int, optional
  • 1: Random

  • 0: Sets every weight to zero

  • Other value: Uses this value as seed

Defaults to -1.

height_of_mapint, optional

Indicates the height of the map.

Defaults to 10.

width_of_mapint, optional

Indicates the width of the map.

Defaults to 10.

kernel_functionint, optional

Represents the neighborhood kernel function.

  • 1: Gaussian

  • 2: Bubble/Flat

Defaults to 1.

alphafloat, optional

Specifies the learning rate.

Defaults to 0.5

learning_rateint, optional

Indicates the decay function for learning rate.

  • 1: Exponential

  • 2: Linear

Defaults to 1.

shape_of_gridint, optional

Indicates the shape of the grid.

  • 1: Rectangle

  • 2: Hexagon

Defaults to 2.

radiusfloat, optional

Specifies the scan radius.

Defautl to the bigger value of height_of_map and width_of_map.

batch_som{‘0’, ‘1’}, int, optional

Indicates whether batch SOM is carried out.

  • 0: Classical SOM

  • 1: Batch SOM

For batch SOM, kernel_function is always Gaussian, and the learning_rate factors take no effect.

Defaults to 0.

max_iterint, optional

Maximum number of iterations. Note that the training might not converge if this value is too small, for example, less than 1000.

Defaults to 1000 plus 500 times the number of neurons in the lattice.

Examples

Input dataframe df for clustering:

>>> df.collect()
    TRANS_ID    V000    V001
0      0        0.10    0.20
1      1        0.22    0.25
2      2        0.30    0.40
...
18     18       55.30   50.40
19     19       50.40   56.50

Create SOM instance:

>>> som = SOM(conn_context=conn, covergence_criterion=1.0e-6, normalization=0,
             random_seed=1, height_of_map=4, width_of_map=4,
             kernel_function='gaussian', alpha=None,
             learning_rate='exponential', shape_of_grid='hexagon',
             radius=None, batch_som='classical', max_iter=4000)

Perform fit on the given data:

>>> som.fit(data=df, key='TRANS_ID')

Expected output:

>>> som.map_.collect().head(3)
        CLUSTER_ID  WEIGHT_V000    WEIGHT_V001    COUNT
    0    0          52.837688      53.465327      2
    1    1          50.150251      49.245226      2
    2    2          18.597607      27.174590      0
>>> som.labels_.collect().head(3)
           TRANS_ID    BMU       DISTANCE    SECOND_BMU  IS_ADJACENT
    0           0      15          0.342564        14      1
    1           1      15          0.239676        14      1
    2           2      15          0.073968        14      1
>>> som.model_.collect()
        ROW_INDEX      MODEL_CONTENT
  0      0             {"Algorithm":"SOM","Cluster":[{"CellID":0,"Cel...

After we get the model, we could use it to predict Input dataframe df2 for prediction:

>>> df_predict.collect()
    TRANS_ID    V000    V001
0      33       0.2     0.10
1      34       1.2     4.1

Preform predict on the givenn data:

>>> label = som.predict(data=df2, key='TRANS_ID')

Expected output:

>>> label.collect()
    TRANS_ID    CLUSTER_ID     DISTANCE
0    33          15            0.388460
1    34          11            0.156418
Attributes
map_DataFrame

The map after training. The structure is as follows:

  • 1st column: CLUSTER_ID, int. Unit cell ID.

  • Other columns except the last one: FEATURE (in input data) column with prefix “WEIGHT_”, float. Weight vectors used to simulate the original tuples.

  • Last column: COUNT, int. Number of original tuples that every unit cell contains.

label_DataFrame

The label of input data, the structure is as follows:

  • 1st column: ID (in input table) data type, ID (in input table) column name ID of original tuples.

  • 2nd column: BMU, int. Best match unit (BMU).

  • 3rd column: DISTANCE, float, The distance between the tuple and its BMU.

  • 4th column: SECOND_BMU, int, Second BMU.

  • 5th column: IS_ADJACENT. int. Indicates whether the BMU and the second BMU are adjacent.
    • 0: Not adjacent

    • 1: Adjacent

model_DataFrame

The SOM model.

Methods

fit(self, data, key[, features, …])

Fit the SOM model when given the training dataset.

fit_predict(self, data, key[, features])

Fit the dataset and return the labels.

is_fitted(self)

Checks if the model can be saved.

load_model(self, model)

Function to load fitted model.

predict(self, data, key[, features])

Assign clusters to data based on a fitted model.

fit(self, data, key, features=None, sql_trace_function=None)

Fit the SOM model when given the training dataset.

Parameters
dataDataFrame

DataFrame containing the data.

keystr

Name of the ID column.

featureslist of str, optional

Names of the features columns. If features is not provided, it defaults to all the non-ID columns.

sql_trace_function: str, optional

Function name for sql tracing reference of the function name.

fit_predict(self, data, key, features=None)

Fit the dataset and return the labels.

Parameters
dataDataFrame

DataFrame containing the data.

keystr

Name of the ID column.

featureslist of str, optional

Names of the features columns. If features is not provided, it defaults to all the non-ID columns.

Returns
DataFrame
The label of given data, the structure is as follows:
  • 1st column: ID (in input table) data type, ID (in input table) column name ID of original tuples.

  • 2nd column: BMU, int. Best match unit (BMU).

  • 3rd column: DISTANCE, float, The distance between the tuple and its BMU.

  • 4th column: SECOND_BMU, int, Second BMU.

  • 5th column: IS_ADJACENT. int. Indicates whether the BMU and the second BMU are adjacent.
    • 0: Not adjacent

    • 1: Adjacent

predict(self, data, key, features=None)

Assign clusters to data based on a fitted model.

The output structure of this method does not match that of fit_predict().

Parameters
dataDataFrame

Data points to match against computed clusters. This dataframe’s column structure should match that of the data used for fit().

keystr

Name of ID column.

featureslist of str, optional.

Names of feature columns. If features is not provided, it defaults to all the non-ID columns.

Returns
DataFrame

Cluster assignment results, with 3 columns:

  • Data point ID, with name and type taken from the input ID column.

  • CLUSTER_ID, type int, representing the cluster the data point is assigned to.

  • DISTANCE, type DOUBLE, representing the distance between the data point and the nearest core point.

is_fitted(self)

Checks if the model can be saved. To be overridden if the model is not stored in model_ attribute.

Returns
bool

True if the model is ready to be saved.

load_model(self, model)

Function to load fitted model.

Parameters
modelDataFrame

HANA DataFrame for fitted model.

hana_ml.algorithms.pal.stats

This module contains Python wrappers for statistics algorithms.

The following functions are available: