hana_ml.algorithms.pal package

The Algorithms PAL Package consists of the following sections:

hana_ml.algorithms.pal.association

This module contains Python wrappers for PAL association algorithms.

The following classes are available:

class hana_ml.algorithms.pal.association.Apriori(conn_context, min_support, min_confidence, relational=None, min_lift=None, max_conseq=None, max_len=None, ubiquitous=None, use_prefix_tree=None, lhs_restrict=None, rhs_restrict=None, lhs_complement_rhs=None, rhs_complement_lhs=None, thread_ratio=None, timeout=None, pmml_export=None)

Bases: hana_ml.algorithms.pal.association._AssociationBase

Apriori is a classic predictive analysis algorithm for finding association rules used in association analysis.

Parameters

conn_context : ConnectionContext

The connection to SAP HANA system.

min_support : float

User-specified minimum support(actual value).

min_confidence : float

User-specified minimum confidence(actual value).

relational : bool, optional

Whether or not to apply relational logic in Apriori algorithm. If False, a single result table is produced; otherwise, the result table shall be split into three tables: antecedent, consequent and statistics.

Defaults to False.

min_lift : float, optional

User-specified minimum lift.

Defaults to 0.

max_conseq : int, optional

Maximum length of consequent items.

Defaults to 100.

max_len : int, optional

Total length of antecedent items and consequent items in the output.

Defaults to 5.

ubiquitous : float, optional

Item sets whose support values are greater than this number will be ignored during fequent items minining.

Defaults to 1.0.

use_prefix_tree : bool, optional

Indicates whether or not to use prefix tree for saving memory.

Defaults to False.

lhs_restrict : list of str, optional(deprecated)

Specify items that are only allowed on the left-hand-side of association rules.

rhs_restrict : list of str, optional(deprecated)

Specify items that are only allowed on the right-hand-side of association rules.

lhs_complement_rhs : bool, optional(deprecated)

If you use rhs_restrict to restrict some items to the left-hand-side of the association rules, you can set this parameter to True to restrict the complement items to the left-hand-side.

For example, if you have 100 items (i1, i2, …, i100), and want to restrict i1 and i2 to the right-hand-side, and i3,i4,…,i100 to the left-hand-side, you can set the parameters similarly as follows:

rhs_restrict = [‘i1’,’i2’],

lhs_complement_rhs = True,

Defaults to False.

rhs_complement_lhs : bool, optional(deprecated)

If you use lhs_restrict to restrict some items to the left-hand-side of association rules, you can set this parameter to True to restrict the complement items to the right-hand side.

Defaults to False.

thread_number : float, optional

Specifies the ratio of total number of threads that can be used by this function. The value range is from 0 to 1, where 0 means only using 1 thread, and 1 means using at most all the currently available threads. Values outside the range will be ignored and this function heuristically determines the number of threads to use.

Defaults to 0.

timeout : int, optional

Specifies the maximum run time in seconds. The algorithm will stop running when the specified timeout is reached.

Defaults to 3600.

pmml_export : {‘no’, ‘single-row’, ‘multi-row’}, optional

Specify the way to export the Apriori model:

  • ‘no’ : do not export the model,

  • ‘single-row’ : export Apriori model in PMML in single row,

  • ‘multi-row’ : export Apriori model in PMML in multiple rows, while the minimum length of each row is 5000 characters.

Defaults to ‘no’.

Examples

Input data for associate rule mining:

>>> df.collect()
    CUSTOMER   ITEM
0          2  item2
1          2  item3
2          3  item1
3          3  item2
4          3  item4
5          4  item1
6          4  item3
7          5  item2
8          5  item3
9          6  item1
10         6  item3
11         0  item1
12         0  item2
13         0  item5
14         1  item2
15         1  item4
16         7  item1
17         7  item2
18         7  item3
19         7  item5
20         8  item1
21         8  item2
22         8  item3

Set up parameters for the Apriori algorithm:

>>> ap = Apriori(conn_context=conn,
                 min_support=0.1,
                 min_confidence=0.3,
                 relational=False,
                 min_lift=1.1,
                 max_conseq=1,
                 max_len=5,
                 ubiquitous=1.0,
                 use_prefix_tree=False,
                 thread_ratio=0,
                 timeout=3600,
                 pmml_export='single-row')

Association rule mininig using Apriori algorithm for the input data, and check the results:

>>> ap.fit(data=df)
>>> ap.result_.head(5).collect()
    ANTECEDENT CONSEQUENT   SUPPORT  CONFIDENCE      LIFT
0        item5      item2  0.222222    1.000000  1.285714
1        item1      item5  0.222222    0.333333  1.500000
2        item5      item1  0.222222    1.000000  1.500000
3        item4      item2  0.222222    1.000000  1.285714
4  item2&item1      item5  0.222222    0.500000  2.250000

Apriori algorithm set up using relational logic:

>>> apr = Apriori(conn_context=conn,
                  min_support=0.1,
                  min_confidence=0.3,
                  relational=True,
                  min_lift=1.1,
                  max_conseq=1,
                  max_len=5,
                  ubiquitous=1.0,
                  use_prefix_tree=False,
                  thread_ratio=0,
                  timeout=3600,
                  pmml_export='single-row')

Again mining association rules using Apriori algorithm for the input data, and check the resulting tables:

>>> apr.antec_.head(5).collect()
   RULE_ID ANTECEDENTITEM
0        0          item5
1        1          item1
2        2          item5
3        3          item4
4        4          item2
>>> apr.conseq_.head(5).collect()
   RULE_ID CONSEQUENTITEM
0        0          item2
1        1          item5
2        2          item1
3        3          item2
4        4          item5
>>> apr.stats_.head(5).collect()
   RULE_ID   SUPPORT  CONFIDENCE      LIFT
0        0  0.222222    1.000000  1.285714
1        1  0.222222    0.333333  1.500000
2        2  0.222222    1.000000  1.500000
3        3  0.222222    1.000000  1.285714
4        4  0.222222    0.500000  2.250000

Attributes

result_

(DataFrame) Mined association rules and related statistics, structured as follows: - 1st column : antecedent(leading) items. - 2nd column : consequent(dependent) items. - 3rd column : support value. - 4th column : confidence value. - 5th column : lift value. Available only when relational is False.

model_

(DataFrame) Apriori model trained from the input data, structured as follows: - 1st column : model ID, - 2nd column : model content, i.e. Apriori model in PMML format.

antec_

(DataFrame) Antecdent items of mined association rules, structured as follows: - lst column : association rule ID, - 2nd column : antecedent items of the corresponding association rule. Available only when relational is True.

conseq_

(DataFrame) Consequent items of mined association rules, structured as follows: - 1st column : association rule ID, - 2nd column : consequent items of the corresponding association rule. Available only when relational is True.

stats_

(DataFrame) Statistis of the mined association rules, structured as follows: - 1st column : rule ID, - 2nd column : support value of the rule, - 3rd column : confidence value of the rule, - 4th column : lift value of the rule. Available only when relational is True.

Methods

fit(data[, transaction, item, lhs_restrict, …])

Association rule mining from the input data using FPGrowth algorithm.

fit(data, transaction=None, item=None, lhs_restrict=None, rhs_restrict=None, lhs_complement_rhs=None, rhs_complement_lhs=None)

Association rule mining from the input data using FPGrowth algorithm.

Parameters

data : DataFrame

Input data for association rule minining.

transaction : str, optional

Name of the transaction column.

Defaults to the first column if not provided.

item : str, optional

Name of the item ID column. Data type of item column can either be int or str.

Defaults to the last column if not provided.

lhs_restrict : list of int/str, optional

Specify items that are only allowed on the left-hand-side of association rules. Elements in the list should be the same type as the item column.

rhs_restrict : list of int/str, optional

Specify items that are only allowed on the right-hand-side of association rules. Elements in the list should be the same type as the item column.

lhs_complement_rhs : bool, optional

If you use rhs_restrict to restrict some items to the left-hand-side of the association rules, you can set this parameter to True to restrict the complement items to the left-hand-side. For example, if you have 100 items (i1,i2,…,i100), and want to restrict i1 and i2 to the right-hand-side, and i3, i4,…, i100 to the left-hand-side, you can set the parameters similarly as follows:

rhs_restrict = [i1, i2],

lhs_complement_rhs = True,

Defaults to False.

rhs_complement_lhs : bool, optional

If you use lhs_restrict to restrict some items to the left-hand-side of association rules, you can set this parameter to True to restrict the complement items to the right-hand side.

Defaults to False.

class hana_ml.algorithms.pal.association.AprioriLite(conn_context, min_support, min_confidence, subsample=None, recalculate=None, thread_ratio=None, timeout=None, pmml_export=None)

Bases: hana_ml.algorithms.pal.pal_base.PALBase

A light version of Apriori algorithm for assocication rule mining, where only two large item sets are calculated.

Parameters

conn_context : ConnectionContext

The connection to SAP HANA system.

min_support : float

User-specified minimum support(actual value).

min_confidence : float

User-specified minimum confidence(actual value).

subsample : float, optional

Specify the sampling percentage for the input data. Set to 1 if you want to use the entire data.

recalculate : bool, optional

If you sample the input data, this parameter indicates whether or not to use the remining data to update the related statistiscs, i.e. support, confidence and lift.

Defaults to True.

thread_ratio : float, optional

Specifies the ratio of total number of threads that can be used by this function. The value range is from 0 to 1, where 0 means only using 1 thread, and 1 means using at most all the currently available threads. Values outside the range will be ignored and this function heuristically determines the number of threads to use.

Defaults to 0.

timeout : int, optional

Specifies the maximum run time in seconds. The algorithm will stop running when the specified timeout is reached.

Defaults to 3600.

pmml_export : {‘no’, ‘single-row’, ‘multi-row’}, optional

Specify the way to export the Apriori model:

  • ‘no’ : do not export the model,

  • ‘single-row’ : export Apriori model in PMML in single row,

  • ‘multi-row’ : export Apriori model in PMML in multiple rows, while the minimum length of each row is 5000 characters.

Defaults to ‘no’.

Examples

Input data for association rule mining using Apriori algorithm:

>>> df.collect()
    CUSTOMER   ITEM
0          2  item2
1          2  item3
2          3  item1
3          3  item2
4          3  item4
5          4  item1
6          4  item3
7          5  item2
8          5  item3
9          6  item1
10         6  item3
11         0  item1
12         0  item2
13         0  item5
14         1  item2
15         1  item4
16         7  item1
17         7  item2
18         7  item3
19         7  item5
20         8  item1
21         8  item2
22         8  item3

Set up parameters for light Apriori algorithm, ingest the input data, and check the result table:

>>> apl = AprioriLite(conn_context=conn,
                      min_support=0.1,
                      min_confidence=0.3,
                      subsample=1.0,
                      recalculate=False,
                      timeout=3600,
                      pmml_export='single-row')
>>> apl.fit(data=df)
>>> apl.result_.head(5).collect()
  ANTECEDENT CONSEQUENT   SUPPORT  CONFIDENCE      LIFT
0      item5      item2  0.222222    1.000000  1.285714
1      item1      item5  0.222222    0.333333  1.500000
2      item5      item1  0.222222    1.000000  1.500000
3      item5      item3  0.111111    0.500000  0.750000
4      item1      item2  0.444444    0.666667  0.857143

Attributes

result_

(DataFrame) Mined association rules and related statistics, structured as follows: - 1st column : antecedent(leading) items, - 2nd column : consequent(dependent) items, - 3rd column : support value, - 4th column : confidence value, - 5th column : lift value. Non-empty only when relational is False.

model_

(DataFrame) Apriori model trained from the input data, structured as follows: - 1st column : model ID. - 2nd column : model content, i.e. liteApriori model in PMML format.

Methods

fit(data[, transaction, item])

Association rule mining based from the input data.

fit(data, transaction=None, item=None)

Association rule mining based from the input data.

Parameters

data : DataFrame

Input data for association rule minining.

transaction : str, optional

Name of the transaction column.

Defaults to the first column if not provided.

item : str, optional

Name of the item column.

Defaults to the last column if not provided.

class hana_ml.algorithms.pal.association.FPGrowth(conn_context, min_support=None, min_confidence=None, relational=None, min_lift=None, max_conseq=None, max_len=None, ubiquitous=None, thread_ratio=None, timeout=None)

Bases: hana_ml.algorithms.pal.association._AssociationBase

FP-Growth is an algorithm to find frequent patterns from transactions without generating a candidate itemset.

Parameters

conn_context : ConnectionContext

The connection to SAP HANA system.

min_support : float, optional

User-specified minimum support, with valid range [0, 1].

Defaults to 0.

min_confidence : float, optional

User-specified minimum confidence, with valid range [0, 1].

Defaults to 0.

relational : bool, optional

Whether or not to apply relational logic in FPGrowth algorithm. If False, a single result table is produced; otherwise, the result table shall be split into three tables – antecedent, consequent and statistics.

Defaults to False.

min_lift : float, optional

User-specified minimum lift.

Defaults to 0.

max_conseq : int, optional

Maximum length of consequent items.

Defaults to 10.

max_len : int, optional

Total length of antecedent items and consequent items in the output.

Defaults to 10.

ubiquitous : float, optional

Item sets whose support values are greater than this number will be ignored during fequent items minining.

Defaults to 1.0.

thread_ratio : float, optional

Specifies the ratio of total number of threads that can be used by this function. The value range is from 0 to 1, where 0 means only using 1 thread, and 1 means using at most all the currently available threads. Values outside the range will be ignored and this function heuristically determines the number of threads to use.

Defaults to 0.

timeout : int, optional

Specifies the maximum run time in seconds. The algorithm will stop running when the specified timeout is reached.

Defaults to 3600.

Examples

Input data for associate rule mining:

>>> df.collect()
    TRANS  ITEM
0       1     1
1       1     2
2       2     2
3       2     3
4       2     4
5       3     1
6       3     3
7       3     4
8       3     5
9       4     1
10      4     4
11      4     5
12      5     1
13      5     2
14      6     1
15      6     2
16      6     3
17      6     4
18      7     1
19      8     1
20      8     2
21      8     3
22      9     1
23      9     2
24      9     3
25     10     2
26     10     3
27     10     5

Set up parameters:

>>> fpg = FPGrowth(conn_context=conn,
                   min_support=0.2,
                   min_confidence=0.5,
                   relational=False,
                   min_lift=1.0,
                   max_conseq=1,
                   max_len=5,
                   ubiquitous=1.0,
                   thread_ratio=0,
                   timeout=3600)

Association rule mininig using FPGrowth algorithm for the input data, and check the results:

>>> fpg.fit(data=df, lhs_restrict=[1,2,3])
>>> fpg.result_.collect()
  ANTECEDENT  CONSEQUENT  SUPPORT  CONFIDENCE      LIFT
0          2           3      0.5    0.714286  1.190476
1          3           2      0.5    0.833333  1.190476
2          3           4      0.3    0.500000  1.250000
3        1&2           3      0.3    0.600000  1.000000
4        1&3           2      0.3    0.750000  1.071429
5        1&3           4      0.2    0.500000  1.250000

Apriori algorithm set up using relational logic:

>>> fpgr = FPGrowth(conn_context=conn,
                    min_support=0.2,
                    min_confidence=0.5,
                    relational=True,
                    min_lift=1.0,
                    max_conseq=1,
                    max_len=5,
                    ubiquitous=1.0,
                    thread_ratio=0,
                    timeout=3600)

Again mining association rules using FPGrowth algorithm for the input data, and check the resulting tables:

>>> fpgr.fit(data=df, rhs_restrict=[1, 2, 3])
>>> fpgr.antec_.collect()
   RULE_ID  ANTECEDENTITEM
0        0               2
1        1               3
2        2               3
3        3               1
4        3               2
5        4               1
6        4               3
7        5               1
8        5               3
>>> fpgr.conseq_.collect()
   RULE_ID  CONSEQUENTITEM
0        0               3
1        1               2
2        2               4
3        3               3
4        4               2
5        5               4
>>> fpgr.stats_.collect()
   RULE_ID  SUPPORT  CONFIDENCE      LIFT
0        0      0.5    0.714286  1.190476
1        1      0.5    0.833333  1.190476
2        2      0.3    0.500000  1.250000
3        3      0.3    0.600000  1.000000
4        4      0.3    0.750000  1.071429
5        5      0.2    0.500000  1.250000

Attributes

result_

(DataFrame) Mined association rules and related statistics, structured as follows: - 1st column : antecedent(leading) items, - 2nd column : consequent(dependent) items, - 3rd column : support value, - 4th column : confidence value, - 5th column : lift value. Available only when relational is False.

antec_

(DataFrame) Antecdent items of mined association rules, structured as follows: - lst column : association rule ID, - 2nd column : antecedent items of the corresponding association rule. Available only when relational is True.

conseq_

(DataFrame) Consequent items of mined association rules, structured as follows: - 1st column : association rule ID, - 2nd column : consequent items of the corresponding association rule. Available only when relational is True.

stats_

(DataFrame) Statistis of the mined association rules, structured as follows: - 1st column : rule ID, - 2nd column : support value of the rule, - 3rd column : confidence value of the rule, - 4th column : lift value of the rule. Available only when relational is True.

Methods

fit(data[, transaction, item, lhs_restrict, …])

Association rule mining from the input data.

fit(data, transaction=None, item=None, lhs_restrict=None, rhs_restrict=None, lhs_complement_rhs=None, rhs_complement_lhs=None)

Association rule mining from the input data.

Parameters

data : DataFrame

Input data for association rule minining.

transaction : str, optional

Name of the transaction column.

Defaults to the first column if not provided.

item : str, optional

Name of the item column.

Defaults to the last column if not provided.

lhs_restrict : list of int/str, optional

Specify items that are only allowed on the left-hand-side of association rules. Elements in the list should be the same type as the item column.

rhs_restrict : list of int/str, optional

Specify items that are only allowed on the right-hand-side of association rules. Elements in the list should be the same type as the item column.

lhs_complement_rhs : bool, optional

If you use rhs_restrict to restrict some items to the left-hand-side of the association rules, you can set this parameter to True to restrict the complement items to the left-hand-side. For example, if you have 100 items (i1,i2,…,i100), and want to restrict i1 and i2 to the right-hand-side, and i3, i4,…, i100 to the left-hand-side, you can set the parameters similarly as follows:

rhs_restrict = [i1, i2],

lhs_complement_rhs = True,

Defaults to False.

rhs_complement_lhs : bool, optional

If you use lhs_restrict to restrict some items to the left-hand-side of association rules, you can set this parameter to True to restrict the complement items to the right-hand side.

Defaults to False.

class hana_ml.algorithms.pal.association.KORD(conn_context, k=None, measure=None, min_support=None, min_confidence=None, min_coverage=None, min_measure=None, max_antec=None, epsilon=None, use_epsilon=None)

Bases: hana_ml.algorithms.pal.pal_base.PALBase

K-optimal rule discovery (KORD) follows the idea of generating association rules with respect to a well-defined measure, instead of first finding all frequent itemsets and then generating all possible rules.

Parameters

conn_context : ConnectionContext

The connection to SAP HANA system.

k : int, optional

The number of top rules to discover.

measure : str, optional

Specifies the measure used to define the priority of the association rules.

min_support : float, optional

User-specified minimum support value of association rule, with valid range [0, 1].

Defaults to 0 if not provided.

min_confidence : float, optinal

User-specified minimum confidence value of association rule, with valid range [0, 1].

Defaults to 0 if not provided.

min_converage : float, optional

User-specified minimum coverage value of association rule, with valid range [0, 1].

Defaults to the value of min_support if not provided.

min_measure : float, optional

User-specified minimum measure value (for leverage or lift, which type depends on the setting of measure ).

Defaults to 0 if not provided.

epsilon : float, optional

User-specified epsilon value for punishing length of rules.

Valid only when use_epsilon is True.

use_epsilon : bool, optional

Specifies whether or not to use epsilon to punish the length of rules.

Defaults to False.

Examples

First let us have a look at the training data:

>>> df.head(10).collect()
    CUSTOMER   ITEM
0          2  item2
1          2  item3
2          3  item1
3          3  item2
4          3  item4
5          4  item1
6          4  item3
7          5  item2
8          5  item3
9          6  item1

Set up a KORD instance:

>>> krd =  KORD(conn_context=conn,
                k=5,
                measure='lift',
                min_support=0.1,
                min_confidence=0.2,
                epsilon=0.1,
                use_epsilon=False)

Start k-optimal rule discovery process from the input transaction data, and check the results:

>>> krd.fit(data=df, transaction='CUSTOMER', item='ITEM')
>>> krd.antec_.collect()
   RULE_ID ANTECEDENT_RULE
0        0           item2
1        1           item1
2        2           item2
3        2           item1
4        3           item5
5        4           item2
>>> krd.conseq_.collect()
   RULE_ID CONSEQUENT_RULE
0        0           item5
1        1           item5
2        2           item5
3        3           item1
4        4           item4
>>> krd.stats_.collect()
   RULE_ID   SUPPORT  CONFIDENCE      LIFT  LEVERAGE   MEASURE
0        0  0.222222    0.285714  1.285714  0.049383  1.285714
1        1  0.222222    0.333333  1.500000  0.074074  1.500000
2        2  0.222222    0.500000  2.250000  0.123457  2.250000
3        3  0.222222    1.000000  1.500000  0.074074  1.500000
4        4  0.222222    0.285714  1.285714  0.049383  1.285714

Attributes

antec_

(DataFrame) Info of antecedent items for the mined association rules, structured as follows: - 1st column : rule ID, - 2nd column : antecedent items.

conseq_

(DataFrame) Info of consequent items for the mined assocoation rules, structured as follows: - 1st column : rule ID, - 2nd column : consequent items.

stats_

(DataFrame) Some basic statistics for the mined association rules, structured as follows: - 1st column : rule ID, - 2nd column : support value of rules, - 3rd column : confidence value of rules, - 4th column : lift value of rules, - 5th column : leverage value of rules, - 6th column : measure value of rules.

Methods

fit(data[, transaction, item])

K-optimal rule discovery from input data, based on some user-specified measure.

fit(data, transaction=None, item=None)

K-optimal rule discovery from input data, based on some user-specified measure.

Parameters

data : DataFrame

Input data for k-optimal(association) rule discovery.

transction : str, optional

Column name of transaction ID in the input data.

Defaults to name of the 1st column if not provided.

item : str, optional

Column name of item ID (or items) in the input data.

Defaults to the name of the final column if not provided.

class hana_ml.algorithms.pal.association.SPM(conn_context, min_support, relational=None, max_len=None, min_len=None, max_len_out=None, min_len_out=None, ubiquitous=None, calc_lift=None, timeout=None)

Bases: hana_ml.algorithms.pal.pal_base.PALBase

The sequential pattern mining algorithm searches for frequent patterns in sequence databases.

Parameters

conn_context : ConnectionContext

The connection to SAP HANA system.

min_support : float

User-specified minimum support value.

relational : bool, optional

Whether or not to apply relational logic in sequential pattern mining. If False, a single results table for frequent pattern mining is produced, otherwise the results table is splitted into two tables : one for mined patterns, and the other for statistics.

Defaults to False.

ubiquitous : float, optional

Items whose support values are above this specified value will be ignored during the frequent item mining phase.

Defaults to 1.0.

min_len : int, optional

Minimum number of items in a transaction.

Defaults to 1.

max_len : int, optional

Maximum number of items in a transaction.

Defaults to 10.

min_len_out : int, optional

Specifies the minimum number of items of the mined association rules in the result table.

Defaults to 1.

max_len_out : int, optional

Specifies the maximum number of items of the mined association rules in the reulst table.

Defaults to 10.

calc_lift : bool, optional

Whether or not toe calculate lift values for all applicable cases. If False, lift values are only calculated for the cases where the last transaction contains a single item.

Defaults to False.

timeout : int, optional

Specifies the maximum run time in seconds. The algorithm will stop running when the specified timeout is reached.

Defaults to 3600.

Examples

Firstly take a look at the input data df:

>>> df.collect()
   CUSTID  TRANSID      ITEMS
0       A        1      Apple
1       A        1  Blueberry
2       A        2      Apple
3       A        2     Cherry
4       A        3    Dessert
5       B        1     Cherry
6       B        1  Blueberry
7       B        1      Apple
8       B        2    Dessert
9       B        3  Blueberry
10      C        1      Apple
11      C        2  Blueberry
12      C        3    Dessert

Set up a SPM instance:

>>> sp = SPM(conn_context=conn,
             min_support=0.5,
             relational=False,
             ubiquitous=1.0,
             max_len=10,
             min_len=1,
             calc_lift=True)

Start sequential pattern mining process from the input data, and check the results:

>>> sp.fit(data=df, customer='CUSTID', transaction='TRANSID', item='ITEMS')
>>> sp.result_.collect()
                        PATTERN   SUPPORT  CONFIDENCE      LIFT
0                       {Apple}  1.000000    0.000000  0.000000
1           {Apple},{Blueberry}  0.666667    0.666667  0.666667
2             {Apple},{Dessert}  1.000000    1.000000  1.000000
3             {Apple,Blueberry}  0.666667    0.000000  0.000000
4   {Apple,Blueberry},{Dessert}  0.666667    1.000000  1.000000
5                {Apple,Cherry}  0.666667    0.000000  0.000000
6      {Apple,Cherry},{Dessert}  0.666667    1.000000  1.000000
7                   {Blueberry}  1.000000    0.000000  0.000000
8         {Blueberry},{Dessert}  1.000000    1.000000  1.000000
9                      {Cherry}  0.666667    0.000000  0.000000
10           {Cherry},{Dessert}  0.666667    1.000000  1.000000
11                    {Dessert}  1.000000    0.000000  0.000000

Attributes

result_

(DataFrame) The overall fequent pattern mining result, structured as follows: - 1st column : mined fequent patterns, - 2nd column : support values, - 3rd column : confidence values, - 4th column : lift values. Available only when relational is False.

pattern_

(DataFrame) Result for mined requent patterns, structured as follows: - 1st column : pattern ID, - 2nd column : transaction ID, - 3rd column : items.

stats_

(DataFrame) Statistics for frequent pattern mining, structured as follows: - 1st column : pattern ID, - 2nd column : support values, - 3rd column : confidence values, - 4th column : lift values.

Methods

fit(data[, customer, transaction, item, …])

Sequetial pattern mining from input data.

fit(data, customer=None, transaction=None, item=None, item_restrict=None, min_gap=None)

Sequetial pattern mining from input data.

Parameters

data : DataFrame

Input data for sequential pattern mining.

customer : str, optional

Column name of customer ID in the input data.

Defaults to name of the 1st column if not provided.

transction : str, optional

Column name of transaction ID in the input data. Specially for sequential pattern mining, values of this column must reflect the sequence of occurrence as well.

Defaults to name of the 2nd column if not provided.

item : str, optional

Column name of item ID (or items) in the input data.

Defaults to the name of the final column if not provided.

item_restrict : list of int or str, optional

Specifies the list of items allowed in the mined association rule.

min_gap : int, optional

Specifies the the minimum time difference between consecutive transactions in a sequence.

hana_ml.algorithms.pal.clustering

This module contains Python wrappers for PAL clustering algorithms.

The following classes are available:

class hana_ml.algorithms.pal.clustering.AffinityPropagation(conn_context, affinity, n_clusters, max_iter=None, convergence_iter=None, damping=None, preference=None, seed_ratio=None, times=None, minkowski_power=None, thread_ratio=None)

Bases: hana_ml.algorithms.pal.pal_base.PALBase

Affinity Propagation is an algorithm that identifies exemplars among data points and forms clusters of data points around these exemplars. It operates by simultaneously considering all data point as potential exemplars and exchanging messages between data points until a good set of exemplars and clusters emerges.

Parameters

conn_context : ConnectionContext

Connection to the SAP HANA system.

affinity : {‘manhattan’, ‘standardized_euclidean’, ‘euclidean’, ‘minkowski’, ‘chebyshev’, ‘cosine’}

Ways to compute the distance between two points.

No default value as it is mandatory.

n_clusters : int

Number of clusters.

  • 0: does not adjust Affinity Propagation cluster result.

  • Non-zero int: If Affinity Propagation cluster number is bigger than n_clusters, PAL will merge the result to make the cluster number be the value specified for n_clusters.

No default value as it is mandatory.

max_iter : int, optional

Maximum number of iterations.

Defaults to 500.

convergence_iter : int, optional

When the clusters keep a steady one for the specified times, the algorithm ends.

Defaults to 100.

damping : float

Controls the updating velocity. Value range: (0, 1).

Defaults to 0.9.

preference : float, optional

Determines the preference. Value range: [0,1].

Defaults to 0.5.

seed_ratio : float, optional

Select a portion of (seed_ratio * data_number) the input data as seed, where data_number is the row_size of the input data. Value range: (0,1]. If seed_ratio is 1, all the input data will be the seed.

Defaults to 1.

times : int, optional

The sampling times. Only valid when seed_ratio is less than 1.

Defaults to 1.

minkowski_power : int, optional

The power of the Minkowski method. Only valid when affinity is 3.

Defaults to 3.

thread_ratio : float, optional

Specifies the ratio of total number of threads that can be used by this function. The value range is from 0 to 1, where 0 means only using 1 thread, and 1 means using at most all the currently available threads. Values outside the range will be ignored and this function heuristically determines the number of threads to use.

Defaults to 0.

Examples

Input dataframe df for clustering:

>>> df.collect()
    ID  ATTRIB1  ATTRIB2
0    1   0.10     0.10
1    2   0.11     0.10
2    3   0.10     0.11
3    4   0.11     0.11
4    5   0.12     0.11
5    6   0.11     0.12
6    7   0.12     0.12
7    8   0.12     0.13
8    9   0.13     0.12
9   10   0.13     0.13
10  11   0.13     0.14
11  12   0.14     0.13
12  13  10.10    10.10
13  14  10.11    10.10
14  15  10.10    10.11
15  16  10.11    10.11
16  17  10.11    10.12
17  18  10.12    10.11
18  19  10.12    10.12
19  20  10.12    10.13
20  21  10.13    10.12
21  22  10.13    10.13
22  23  10.13    10.14
23  24  10.14    10.13

Create AffinityPropagation instance:

>>> ap = AffinityPropagation(
            conn_context=conn,
            affinity='euclidean',
            n_clusters=0,
            max_iter=500,
            convergence_iter=100,
            damping=0.9,
            preference=0.5,
            seed_ratio=None,
            times=None,
            minkowski_power=None,
            thread_ratio=1)

Perform fit on the given data:

>>> ap.fit(data = df, key='ID')

Expected output:

>>> ap.labels_.collect()
    ID  CLUSTER_ID
0    1           0
1    2           0
2    3           0
3    4           0
4    5           0
5    6           0
6    7           0
7    8           0
8    9           0
9   10           0
10  11           0
11  12           0
12  13           1
13  14           1
14  15           1
15  16           1
16  17           1
17  18           1
18  19           1
19  20           1
20  21           1
21  22           1
22  23           1
23  24           1

Attributes

labels_

(DataFrame) Label assigned to each sample. structured as follows: - ID, record ID. - CLUSTER_ID, the range is from 0 to n_clusters - 1.

Methods

fit(data, key[, features])

Fit the model when given the training dataset.

fit_predict(data, key[, features])

Fit with the dataset and return the labels.

fit(data, key, features=None)

Fit the model when given the training dataset.

Parameters

data : DataFrame

DataFrame containing the data.

key : str

Name of the ID column.

features : list of str, optional

Names of the features columns. If features is not provided, it defaults to all the non-ID columns.

fit_predict(data, key, features=None)

Fit with the dataset and return the labels.

Parameters

data : DataFrame

DataFrame containing the data.

key : str

Name of the ID column.

features : list of str, optional

Names of the features columns. If features is not provided, it defaults to all the non-ID columns.

Returns

DataFrame

Fit result, label of each points, structured as follows:

  • ID, record ID.

  • CLUSTER_ID, the range is from 0 to n_clusters - 1.

class hana_ml.algorithms.pal.clustering.AgglomerateHierarchicalClustering(conn_context, n_clusters=None, affinity=None, linkage=None, thread_ratio=None, distance_dimension=None, normalization=None, category_weights=None)

Bases: hana_ml.algorithms.pal.pal_base.PALBase

This algorithm is a widely used clustering method which can find natural groups within a set of data. The idea is to group the data into a hierarchy or a binary tree of the subgroups. A hierarchical clustering can be either agglomerate or divisive, depending on the method of hierarchical decomposition. The implementation in PAL follows the agglomerate approach, which merges the clusters with a bottom-up strategy. Initially, each data point is considered as an own cluster. The algorithm iteratively merges two clusters based on the dissimilarity measure in a greedy manner and forms a larger cluster.

Parameters

conn_context : ConnectionContext

Connection to the SAP HANA system.

n_clusters : int, optional

Number of clusters after agglomerate hierarchical clustering algorithm. Value range: between 1 and the initial number of input data.

Defaults to 1.

affinity : {‘manhattan’,’euclidean’, ‘minkowski’, ‘chebyshev’, ‘cosine’, ‘pearson correlation’, ‘squared euclidean’, ‘jaccard’, ‘gower’}, optional

Ways to compute the distance between two points.

Note

(1) For jaccard distance, non-zero input data will be treated as 1, and zero input data will be treated as 0. jaccard distance = (M01 + M10) / (M11 + M01 + M10)

(2) Only gower distance supports category attributes. When linkage is ‘centroid clustering’, ‘median clustering’, or ‘ward’, this parameter must be set to ‘squared euclidean’.

Defaults to squared euclidean.

linkage : { ‘nearest neighbor’, ‘furthest neighbor’, ‘group average’, ‘weighted average’, ‘centroid clustering’, ‘median clustering’, ‘ward’}, optional

Linkage type between two clusters.

  • ‘nearest neighbor’ : single linkage.

  • ‘furthest neighbor’ : complete linkage.

  • ‘group average’ : UPGMA.

  • ‘weighted average’ : WPGMA.

  • ‘centroid clustering’.

  • ‘median clustering’.

  • ‘ward’.

Defaults to centroid clustering.

Note

For linkage ‘centroid clustering’, ‘median clustering’, or ‘ward’, the corresponding affinity must be set to ‘squared euclidean’.

thread_ratio : float, optional

Specifies the ratio of total number of threads that can be used by this function. The value range is from 0 to 1, where 0 means only using 1 thread, and 1 means using at most all the currently available threads. Values outside the range will be ignored and this function heuristically determines the number of threads to use.

Defaults to 0.

distance_dimension : float, optional

Distance dimension can be set if affinity is set to ‘minkowski’. The value should be no less than 1. Only valid when affinity is ‘minkowski’.

Defaults to 3.

normalize_type : {0, 1, 2, 3}, int, optional

Normalization type

  • 0: does nothing

  • 1: Z score standardize

  • 2: transforms to new range: -1 to 1

  • 3: transforms to new range: 0 to 1

Defaults to 0.

category_weights : float, optional

Represents the weight of category columns.

Defaults to 1.

Examples

Input dataframe df for clustering:

>>> df.collect()
    POINT    X1     X2     X3
0    0       0.5   0.5     1
1    1       1.5   0.5     2
2    2       1.5   1.5     2
3    3       0.5   1.5     2
4    4       1.1   1.2     2
5    5       0.5   15.5    2
6    6       1.5   15.5    3
7    7       1.5   16.5    3
8    8       0.5   16.5    3
9    9       1.2   16.1    3
10   10      15.5  15.5    3
11   11      16.5  15.5    4
12   12      16.5  16.5    4
13   13      15.5  16.5    4
14   14      15.6  16.2    4
15   15      15.5  0.5     4
16   16      16.5  0.5     1
17   17      16.5  1.5     1
18   18      15.5  1.5     1
19   19      15.7  1.6     1

Create AgglomerateHierarchicalClustering instance:

>>> hc = AgglomerateHierarchicalClustering(
             conn_context=conn,
             n_clusters=4,
             affinity='Gower',
             linkage='weighted average',
             thread_ratio=None,
             distance_dimension=3,
             normalize_type= 0,
             category_weights= 0.1)

Perform fit on the given data:

>>> hc.fit(data=df, key='POINT', categorical_variable=['X3'])

Expected output:

>>> hc.combine_process_.collect().head(3)
    STAGE    LEFT_POINT    RIGHT_POINT    DISTANCE
0    1        18           19             0.0187
1    2        13           14             0.0250
2    3        7            9              0.0437
>>> hc.labels_.collect().head(3)
        POINT    CLUSTER_ID
     0     0        1
     1     1        1
     2     2        1

Attributes

combine_process_

(DataFrame) Structured as follows: - 1st column: int, STAGE, cluster stage. - 2nd column: ID (in input table) data type, LEFT_ + ID (in input table) column + name, One of the clusters that is to be combined in one combine stage, name as its row number in the input data table. After the combining, the new cluster is named after the left one. - 3rd column: ID (in input table) data type, RIGHT_ + ID (in input table) column name, The other cluster to be combined in the same combine stage, named as its row number in the input data table. - 4th column: float, DISTANCE. Distance between the two combined clusters.

labels_

(DataFrame) Label assigned to each sample. structured as follows: - 1st column: ID, record ID. - 2nd column: CLUSTER_ID, cluster number after applying the hierarchical agglomerate algorithm.

Methods

fit(data, key[, features, categorical_variable])

Fit the model when given the training dataset.

fit_predict(data, key[, features, …])

Fit with the dataset and return the labels.

fit(data, key, features=None, categorical_variable=None)

Fit the model when given the training dataset.

Parameters

data : DataFrame

DataFrame containing the data.

key : str

Name of the ID column.

features : list of str, optional

Names of the features columns. If features is not provided, it defaults to all the non-ID columns.

categorical_variable : str or list of str, optional

Specifies INTEGER column(s) that should be treated as categorical. By default, ‘VARCHAR’ or ‘NVARCHAR’ is category variable, and ‘INTEGER’ or ‘DOUBLE’ is continuous variable.

No default value.

fit_predict(data, key, features=None, categorical_variable=None)

Fit with the dataset and return the labels.

Parameters

data : DataFrame

DataFrame containing the data.

key : str

Name of the ID column.

features : list of str, optional

Names of the features columns. If features is not provided, it defaults to all the non-ID columns.

categorical_variable : str or list of str, optional

Specifies INTEGER column(s) that should be treated as categorical. By default, ‘VARCHAR’ or ‘NVARCHAR’ is category variable, and ‘INTEGER’ or ‘DOUBLE’ is continuous variable.

No default value.

Returns

DataFrame

Combine process, structured as follows:

  • 1st column: int, STAGE, cluster stage.

  • 2nd column: ID (in input table) data type, LEFT_ + ID (in input table) column name, One of the clusters that is to be combined in one combine stage, name as its row number in the input data table. After the combining, the new cluster is named after the left one.

  • 3rd column: ID (in input table) data type, RIGHT_ + ID (in input table) column name, The other cluster to be combined in the same combine stage, named as its row number in the input data table.

  • 4th column: float, DISTANCE. Distance between the two combined clusters.

Label of each points, structured as follows:

  • 1st column: ID (in input table) data type, ID, record ID.

  • 2nd column: int, CLUSTER_ID, the range is from 0 to n_clusters - 1.

class hana_ml.algorithms.pal.clustering.DBSCAN(conn_context, minpts=None, eps=None, thread_ratio=None, metric=None, minkowski_power=None, categorical_variable=None, category_weights=None, algorithm=None, save_model=True)

Bases: hana_ml.algorithms.pal.pal_base.PALBase, hana_ml.algorithms.pal.clustering._ClusterAssignmentMixin

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a density-based data clustering algorithm that finds a number of clusters starting from the estimated density distribution of corresponding nodes.

Parameters

conn_context : ConnectionContext

Connection to the SAP HANA system.

minpts : int, optional

The minimum number of points required to form a cluster.

Note

minpts and eps need to be provided together by user or these two parameters are automatically determined.

eps : float, optional

The scan radius.

Note

minpts and eps need to be provided together by user or these two parameters are automatically determined.

thread_ratio : float, optional

Controls the proportion of available threads to use. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use.

Defaults to heuristically determined.

metric : {‘manhattan’, ‘euclidean’, ‘minkowski’, ‘chebyshev’, ‘standardized_euclidean’, ‘cosine’}, optional

Ways to compute the distance between two points.

Defaults to ‘euclidean’.

minkowski_power : int, optional

When minkowski is choosed for metric, this parameter controls the value of power. Only applicable when metric is minkowski.

Defaults to 3.

categorical_variable : str or list of str, optional

Specifies column(s) in the data that should be treated as categorical.

category_weights : float, optional

Represents the weight of category attributes.

Defaults to 0.707.

algorithm : {‘brute-force’, ‘kd-tree’}, optional

Ways to search for neighbours.

Defaults to ‘kd-tree’.

save_model : bool, optional

If true, the generated model will be saved. save_model must be True to call predict().

Defaults to True.

Examples

Input dataframe df for clustering:

>>> df.collect()
    ID     V1     V2 V3
0    1   0.10   0.10  B
1    2   0.11   0.10  A
2    3   0.10   0.11  C
3    4   0.11   0.11  B
4    5   0.12   0.11  A
5    6   0.11   0.12  E
6    7   0.12   0.12  A
7    8   0.12   0.13  C
8    9   0.13   0.12  D
9   10   0.13   0.13  D
10  11   0.13   0.14  A
11  12   0.14   0.13  C
12  13  10.10  10.10  A
13  14  10.11  10.10  F
14  15  10.10  10.11  E
15  16  10.11  10.11  E
16  17  10.11  10.12  A
17  18  10.12  10.11  B
18  19  10.12  10.12  B
19  20  10.12  10.13  D
20  21  10.13  10.12  F
21  22  10.13  10.13  A
22  23  10.13  10.14  A
23  24  10.14  10.13  D
24  25   4.10   4.10  A
25  26   7.11   7.10  C
26  27  -3.10  -3.11  C
27  28  16.11  16.11  A
28  29  20.11  20.12  C
29  30  15.12  15.11  A

Create DSBCAN instance:

>>> dbscan = DBSCAN(conn_context=conn, thread_ratio=0.2, metric='manhattan')

Perform fit on the given data:

>>> dbscan.fit(data=df, key='ID')

Expected output:

>>> dbscan.labels_.collect()
    ID  CLUSTER_ID
0    1           0
1    2           0
2    3           0
3    4           0
4    5           0
5    6           0
6    7           0
7    8           0
8    9           0
9   10           0
10  11           0
11  12           0
12  13           1
13  14           1
14  15           1
15  16           1
16  17           1
17  18           1
18  19           1
19  20           1
20  21           1
21  22           1
22  23           1
23  24           1
24  25          -1
25  26          -1
26  27          -1
27  28          -1
28  29          -1
29  30          -1

Attributes

labels_

(DataFrame) Label assigned to each sample.

model_

(DataFrame) Model content. Set to None if save_model is False.

Methods

fit(data, key[, features, categorical_variable])

Fit the DBSCAN model when given the training dataset.

fit_predict(data, key[, features, …])

Fit with the dataset and return the labels.

predict(data, key[, features])

Assign clusters to data based on a fitted model.

fit(data, key, features=None, categorical_variable=None)

Fit the DBSCAN model when given the training dataset.

Parameters

data : DataFrame

DataFrame containing the data.

key : str

Name of the ID column.

features : list of str, optional

Names of the features columns. If features is not provided, it defaults to all the non-ID columns.

categorical_variable : str or list of str, optional

Specifies INTEGER column(s) that should be treated as categorical. Other INTEGER columns will be treated as continuous.

fit_predict(data, key, features=None, categorical_variable=None)

Fit with the dataset and return the labels.

Parameters

data : DataFrame

DataFrame containing the data.

key : str

Name of the ID column.

features : list of str, optional

Names of the features columns. If features is not provided, it defaults to all the non-ID columns.

categorical_variable : str or list of str, optional

Specifies INTEGER column(s) that should be treated as categorical. Other INTEGER columns will be treated as continuous.

Returns

DataFrame

Fit result, structured as follows:

  • ID column, with the same name and type as data ‘s ID column.

  • CLUSTER_ID, type INTEGER, cluster ID assigned to the data point. (Cluster IDs range from 0 to 1 less than the number of clusters. A cluster ID of -1 means the point is labeled as noise.)

predict(data, key, features=None)

Assign clusters to data based on a fitted model.

The output structure of this method does not match that of fit_predict().

Parameters

data : DataFrame

Data points to match against computed clusters. This dataframe’s column structure should match that of the data used for fit().

key : str

Name of the ID column.

features : list of str, optional.

Names of the feature columns. If features is not provided, it defaults to all the non-ID columns.

Returns

DataFrame

Cluster assignment results, with 3 columns:

  • Data point ID, with name and type taken from the input ID column.

  • CLUSTER_ID, type INTEGER, representing the cluster the data point is assigned to.

  • DISTANCE, type DOUBLE, representing the distance between the data point and the nearest core point.

class hana_ml.algorithms.pal.clustering.GeometryDBSCAN(conn_context, minpts=None, eps=None, thread_ratio=None, metric=None, minkowski_power=None, algorithm=None, save_model=True)

Bases: hana_ml.algorithms.pal.pal_base.PALBase

This function is a geometry version of DBSCAN, which only accepts geometry points as input data. Currently it only accepts 2-D points.

Parameters

conn_context : ConnectionContext

Connection to the SAP HANA system.

minpts : int, optional

The minimum number of points required to form a cluster.

Note

minpts and eps need to be provided together by user or these two parameters are automatically determined.

eps : float, optional

The scan radius.

Note

minpts and eps need to be provided together by user or these two parameters are automatically determined.

thread_ratio : float, optional

Controls the proportion of available threads to use. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use.

Defaults to heuristically determined.

metric : {‘manhattan’, ‘euclidean’,’minkowski’,

‘chebyshev’, ‘standardized_euclidean’, ‘cosine’}, optional

Ways to compute the distance between two points.

Defaults to euclidean.

minkowski_power : int, optional

When minkowski is choosed for metric, this parameter controls the value of power. Only applicable when metric is ‘minkowski’.

Defaults to 3.

algorithm : {‘brute-force’, ‘kd-tree’}, optional

Ways to search for neighbours.

Defaults to ‘kd-tree’.

save_model : bool, optional

If true, the generated model will be saved. save_model must be True to call predict().

Defaults to True.

Examples

In SAP HANA, the test table PAL_GEO_DBSCAN_DATA_TBL can be created by the following SQL:

>>> CREATE COLUMN TABLE PAL_GEO_DBSCAN_DATA_TBL (
             "ID" INTEGER,
             "POINT" ST_GEOMETRY
             );

Then, input dataframe df for clustering:

>>> df = conn.table("PAL_GEO_DBSCAN_DATA_TBL")

Create DSBCAN instance:

>>> geo_dbscan = GeometryDBSCAN(conn_context = conn, thread_ratio=0.2, metric='manhattan')

Perform fit on the given data:

>>> geo_dbscan.fit(data = df, key='ID')

Expected output:

>>> geo_dbscan.labels_.collect()
    ID    CLUSTER_ID
0    1    0
1    2    0
2    3    0
......
28   29  -1
29   30  -1
>>> geo_dbsan.model_.collect()
    ROW_INDEX    MODEL_CONTENT
0      0         {"Algorithm":"DBSCAN","Cluster":[{"ClusterID":...

Perform fit_predict on the given data:

>>> result = geo_dbscan.fit_predict(df, key='ID')

Expected output:

>>> result.collect()
    ID    CLUSTER_ID
0    1    0
1    2    0
2    3    0
......
28    29  -1
29    30  -1

Attributes

labels_

(DataFrame) Label assigned to each sample.

model_

(DataFrame) Model content. Set to None if save_model is False.

Methods

fit(data, key[, features])

Fit the Geometry DBSCAN model when given the training dataset.

fit_predict(data, key[, features])

Fit with the dataset and return the labels.

fit(data, key, features=None)

Fit the Geometry DBSCAN model when given the training dataset.

Parameters

data : DataFrame

DataFrame containing the data. The structure is as follows.

  • 1st column: ID, INTEGER, BIGINT, VARCHAR, or NVARCHAR. Data ID.

  • 2nd column: ST_GEOMETRY, 2-D geometry point.

key : str

Name of the ID column.

features : list of str, optional

Names of the features columns. If features is not provided, it defaults to all the non-ID columns.

fit_predict(data, key, features=None)

Fit with the dataset and return the labels.

Parameters

data : DataFrame

DataFrame containing the data. The structure is as follows.

  • 1st column: ID, INTEGER, BIGINT, VARCHAR, or NVARCHAR. Data ID

  • 2nd column: ST_GEOMETRY, 2-D geometry point.

key : str

Name of the ID column.

features : list of str, optional

Names of the features columns. If features is not provided, it defaults to all the non-ID columns.

Returns

DataFrame

Fit result, structured as follows:

  • ID column, with the same name and type as data ‘s ID column.

  • CLUSTER_ID, type INTEGER, cluster ID assigned to the data point. (Cluster IDs range from 0 to 1 less than the number of clusters. A cluster ID of -1 means the point is labeled as noise.)

class hana_ml.algorithms.pal.clustering.KMeans(conn_context, n_clusters=None, n_clusters_min=None, n_clusters_max=None, init=None, max_iter=None, thread_ratio=None, distance_level=None, minkowski_power=None, category_weights=None, normalization=None, categorical_variable=None, tol=None, memory_mode=None, accelerated=False)

Bases: hana_ml.algorithms.pal.pal_base.PALBase, hana_ml.algorithms.pal.clustering._ClusterAssignmentMixin

K-Means model that handles clustering problems.

Parameters

conn_context : ConnectionContext

Connection to the SAP HANA system.

n_clusters : int, optional

Number of clusters. If this parameter is not specified, you must specify the minimum and maximum range parameters instead.

n_clusters_min : int, optional

Cluster range minimum.

n_clusters_max : int, optional

Cluster range maximum.

init : {‘first_k’, ‘replace’, ‘no_replace’, ‘patent’}, optional

Controls how the initial centers are selected:

  • ‘first_k’: First k observations.

  • ‘replace’: Random with replacement.

  • ‘no_replace’: Random without replacement.

  • ‘patent’: Patent of selecting the init center (US 6,882,998 B1).

Defaults to ‘patent’.

max_iter : int, optional

Max iterations.

Defaults to 100.

thread_ratio : float, optional

Specifies the ratio of total number of threads that can be used by this function. The value range is from 0 to 1, where 0 means only using 1 thread, and 1 means using at most all the currently available threads. Values outside the range will be ignored and this function heuristically determines the number of threads to use.

Defaults to 0.

distance_level : {‘manhattan’, ‘euclidean’, ‘minkowski’, ‘chebyshev’, ‘cosine’} str, optional

Ways to compute the distance between the item and the cluster center. ‘cosine’ is only valid when accelerated is False.

Defaults to ‘euclidean’.

minkowski_power : float, optional

When Minkowski distance is used, this parameter controls the value of power. Only valid when distance_level is minkowski.

Defaults to 3.0.

category_weights : float, optional

Represents the weight of category attributes.

Defaults to 0.707.

normalization : {‘no’, ‘l1_norm’, ‘min_max’}, optional

Normalization type.

  • ‘no’: No normalization will be applied.

  • ‘l1_norm’: Yes, for each point X (x1, x2, …, xn), the normalized value will be X’(x1 /S,x2 /S,…,xn /S), where S = |x1|+|x2|+…|xn|.

  • ‘min_max’: Yes, for each column C, get the min and max value of C, and then C[i] = (C[i]-min)/(max-min).

Defaults to ‘no’.

categorical_variable : str or list of str, optional

Specifies INTEGER column(s) in the data that should be treated as categorical.

tol : float, optional

Convergence threshold for exiting iterations. Only valid when accelerated is False.

Defaults to 1.0e-6.

memory_mode : {‘auto’, ‘optimize-speed’, ‘optimize-space’}, optional

Indicates the memory mode that the algorithm uses.

  • ‘auto’: Chosen by algorithm.

  • ‘optimize-speed’: Prioritizes speed.

  • ‘optimize-space’: Prioritizes memory.

Only valid when accelerated is True.

Defaults to ‘auto’.

accelerated : bool, optional

Indicates whether to use technology like cache to accelerate the calculation process. If True, the calculation process will be accelerated. If False, the calculation process will not be accelerated.

Defaults to False.

Examples

Input dataframe df for K Means:

>>> df.collect()
    ID  V000 V001  V002
0    0   0.5    A   0.5
1    1   1.5    A   0.5
2    2   1.5    A   1.5
3    3   0.5    A   1.5
4    4   1.1    B   1.2
5    5   0.5    B  15.5
6    6   1.5    B  15.5
7    7   1.5    B  16.5
8    8   0.5    B  16.5
9    9   1.2    C  16.1
10  10  15.5    C  15.5
11  11  16.5    C  15.5
12  12  16.5    C  16.5
13  13  15.5    C  16.5
14  14  15.6    D  16.2
15  15  15.5    D   0.5
16  16  16.5    D   0.5
17  17  16.5    D   1.5
18  18  15.5    D   1.5
19  19  15.7    A   1.6

Create KMeans instance:

>>> km = clustering.KMeans(conn_context=conn, n_clusters=4, init='first_k',
...                        max_iter=100, tol=1.0E-6, thread_ratio=0.2,
...                        distance_level='Euclidean',
...                        category_weights=0.5)

Perform fit_predict:

>>> labels = km.fit_predict(df=data, 'ID')
>>> labels.collect()
    ID  CLUSTER_ID  DISTANCE  SLIGHT_SILHOUETTE
0    0           0  0.891088           0.944370
1    1           0  0.863917           0.942478
2    2           0  0.806252           0.946288
3    3           0  0.835684           0.944942
4    4           0  0.744571           0.950234
5    5           3  0.891088           0.940733
6    6           3  0.835684           0.944412
7    7           3  0.806252           0.946519
8    8           3  0.863917           0.946121
9    9           3  0.744571           0.949899
10  10           2  0.825527           0.945092
11  11           2  0.933886           0.937902
12  12           2  0.881692           0.945008
13  13           2  0.764318           0.949160
14  14           2  0.923456           0.939283
15  15           1  0.901684           0.940436
16  16           1  0.976885           0.939386
17  17           1  0.818178           0.945878
18  18           1  0.722799           0.952170
19  19           1  1.102342           0.925679

Input dataframe df for Accelerated K-Means :

>>> df = conn.table("PAL_ACCKMEANS_DATA_TBL")
>>> df.collect()
    ID  V000 V001  V002
0    0   0.5    A     0
1    1   1.5    A     0
2    2   1.5    A     1
3    3   0.5    A     1
4    4   1.1    B     1
5    5   0.5    B    15
6    6   1.5    B    15
7    7   1.5    B    16
8    8   0.5    B    16
9    9   1.2    C    16
10  10  15.5    C    15
11  11  16.5    C    15
12  12  16.5    C    16
13  13  15.5    C    16
14  14  15.6    D    16
15  15  15.5    D     0
16  16  16.5    D     0
17  17  16.5    D     1
18  18  15.5    D     1
19  19  15.7    A     1

Create Accelerated Kmeans instance:

>>> akm = clustering.KMeans(conn_context=conn, init='first_k',
...                         thread_ratio=0.5, n_clusters=4,
...                         distance_level='euclidean',
...                         max_iter=100, category_weights=0.5,
...                         categorical_variable=['V002'],
...                         accelerated=True)

Perform fit_predict:

>>> labels = akm.fit_predict(df=data, key='ID')
>>> labels.collect()
    ID  CLUSTER_ID  DISTANCE  SLIGHT_SILHOUETTE
0    0           0  1.198938           0.006767
1    1           0  1.123938           0.068899
2    2           3  0.500000           0.572506
3    3           3  0.500000           0.598267
4    4           0  0.621517           0.229945
5    5           0  1.037500           0.308333
6    6           0  0.962500           0.358333
7    7           0  0.895513           0.402992
8    8           0  0.970513           0.352992
9    9           0  0.823938           0.313385
10  10           1  1.038276           0.931555
11  11           1  1.178276           0.927130
12  12           1  1.135685           0.929565
13  13           1  0.995685           0.934165
14  14           1  0.849615           0.944359
15  15           1  0.995685           0.934548
16  16           1  1.135685           0.929950
17  17           1  1.089615           0.932769
18  18           1  0.949615           0.937555
19  19           1  0.915565           0.937717

Attributes

labels_

(DataFrame) Label assigned to each sample.

cluster_centers_

(DataFrame) Coordinates of cluster centers.

model_

(DataFrame) Model content.

statistics_

(DataFrame) Statistic value.

Methods

fit(data, key[, features, categorical_variable])

Fit the model when given training dataset.

fit_predict(data, key[, features, …])

Fit with the dataset and return the labels.

predict(data, key[, features])

Assign clusters to data based on a fitted model.

fit(data, key, features=None, categorical_variable=None)

Fit the model when given training dataset.

Parameters

data : DataFrame

DataFrame containing the data.

key : str

Name of the ID column.

features : list of str, optional

Names of the feature columns. If features is not provided, it defaults to all the non-ID columns.

categorical_variable : str or list of str, optional

Specifies INTEGER column(s) that should be treated as categorical. Other INTEGER columns will be treated as continuous.

fit_predict(data, key, features=None, categorical_variable=None)

Fit with the dataset and return the labels.

Parameters

data : DataFrame

DataFrame containing the data.

key : str

Name of the ID column.

features : list of str, optional

Names of the feature columns. If features is not provided, it defaults to all the non-ID columns.

categorical_variable : str or list of str, optional

Specifies INTEGER column(s) that should be treated as categorical. Other INTEGER columns will be treated as continuous.

Returns

DataFrame

Fit result, structured as follows:

  • ID column, with the same name and type as data ‘s ID column.

  • CLUSTER_ID, type INTEGER, cluster ID assigned to the data point.

  • DISTANCE, type DOUBLE, the distance between the given point and the cluster center.

  • SLIGHT_SILHOUETTE, type DOUBLE, estimated value (slight silhouette).

predict(data, key, features=None)

Assign clusters to data based on a fitted model.

The output structure of this method does not match that of fit_predict().

Parameters

data : DataFrame

Data points to match against computed clusters. This dataframe’s column structure should match that of the data used for fit().

key : str

Name of the ID column.

features : list of str, optional.

Names of the feature columns. If features is not provided, it defaults to all the non-ID columns.

Returns

DataFrame

Cluster assignment results, with 3 columns:

  • Data point ID, with name and type taken from the input ID column.

  • CLUSTER_ID, INTEGER type, representing the cluster the data point is assigned to.

  • DISTANCE, DOUBLE type, representing the distance between the data point and the cluster center.

class hana_ml.algorithms.pal.clustering.KMedians(conn_context, n_clusters, init=None, max_iter=None, tol=None, thread_ratio=None, distance_level=None, minkowski_power=None, category_weights=None, normalization=None, categorical_variable=None)

Bases: hana_ml.algorithms.pal.clustering._KClusteringBase

K-Medians clustering algorithm that partitions n observations into K clusters according to their nearest cluster center. It uses medians of each feature to calculate cluster centers.

Parameters

conn_context : ConnectionContext

Connection to the SAP HANA system.

n_clusters : int

Number of groups.

init : {‘first_k’, ‘replace’, ‘no_replace’, ‘patent’}, optional

Controls how the initial centers are selected:

  • ‘first_k’: First k observations.

  • ‘replace’: Random with replacement.

  • ‘no_replace’: Random without replacement.

  • ‘patent’: Patent of selecting the init center (US 6,882,998 B1).

Defaults to ‘patent’.

max_iter : int, optional

Max iterations.

Defaults to 100.

tol : float, optional

Convergence threshold for exiting iterations.

Defaults to 1.0e-6.

thread_ratio : float, optional

Specifies the ratio of total number of threads that can be used by this function. The value range is from 0 to 1, where 0 means only using 1 thread, and 1 means using at most all the currently available threads. Values outside the range will be ignored and this function heuristically determines the number of threads to use.

Defaults to 0.

distance_level : {‘manhattan’, ‘euclidean’, ‘minkowski’, ‘chebyshev’, ‘cosine’} str, optional

Ways to compute the distance between the item and the cluster center.

Defaults to ‘euclidean’.

minkowski_power : float, optional

When Minkowski distance is used, this parameter controls the value of power. Only valid when distance_level is minkowski.

Defaults to 3.0.

category_weights : float, optional

Represents the weight of category attributes.

Defaults to 0.707.

normalization : {‘no’, ‘l1_norm’, ‘min_max’}, optional

Normalization type.

  • ‘no’: No, normalization will not be applied.

  • ‘l1_norm’: Yes, for each point X (x1, x2, …, xn), the normalized value will be X’(x1/S,x2/S,…,xn/S), where S = |x1|+|x2|+…|xn|.

  • ‘min_max’: Yes, for each column C, get the min and max value of C, and then C[i] = (C[i]-min)/(max-min).

Defaults to ‘no’.

categorical_variable : str or list of str, optional

Specifies INTEGER column(s) in the data that should be treated as categorical.

Examples

Input dataframe df1 for clustering:

>>> df1.collect()
    ID  V000 V001  V002
0    0   0.5    A   0.5
1    1   1.5    A   0.5
2    2   1.5    A   1.5
3    3   0.5    A   1.5
4    4   1.1    B   1.2
5    5   0.5    B  15.5
6    6   1.5    B  15.5
7    7   1.5    B  16.5
8    8   0.5    B  16.5
9    9   1.2    C  16.1
10  10  15.5    C  15.5
11  11  16.5    C  15.5
12  12  16.5    C  16.5
13  13  15.5    C  16.5
14  14  15.6    D  16.2
15  15  15.5    D   0.5
16  16  16.5    D   0.5
17  17  16.5    D   1.5
18  18  15.5    D   1.5
19  19  15.7    A   1.6

Creating KMedians instance:

>>> kmedians = KMedians(conn_context = conn, n_clusters=4, init='first_k',
...                     max_iter=100, tol=1.0E-6,
...                     distance_level='Euclidean',
...                     thread_ratio=0.3, category_weights=0.5)

Performing fit() on given dataframe:

>>> kmedians.fit(data=df1, key='ID')
>>> kmedians.cluster_centers_.collect()
   CLUSTER_ID  V000 V001  V002
0           0   1.1    A   1.2
1           1  15.7    D   1.5
2           2  15.6    C  16.2
3           3   1.2    B  16.1

Performing fit_predict() on given dataframe:

>>> kmedians.fit_predict(data=df1, key='ID').collect()
    ID  CLUSTER_ID  DISTANCE
0    0           0  0.921954
1    1           0  0.806226
2    2           0  0.500000
3    3           0  0.670820
4    4           0  0.707107
5    5           3  0.921954
6    6           3  0.670820
7    7           3  0.500000
8    8           3  0.806226
9    9           3  0.707107
10  10           2  0.707107
11  11           2  1.140175
12  12           2  0.948683
13  13           2  0.316228
14  14           2  0.707107
15  15           1  1.019804
16  16           1  1.280625
17  17           1  0.800000
18  18           1  0.200000
19  19           1  0.807107

Attributes

cluster_centers_

(DataFrame) Coordinates of cluster centers.

labels_

(DataFrame) Cluster assignment and distance to cluster center for each point.

Methods

fit(data, key[, features, categorical_variable])

Perform clustering on input dataset.

fit_predict(data, key[, features, …])

Perform clustering algorithm and return labels.

fit(data, key, features=None, categorical_variable=None)

Perform clustering on input dataset.

Parameters

data : DataFrame

DataFrame contains input data.

key : str

Name of the ID column.

features : list of str, optional

Names of the feature columns. If features is not provided, it defaults to all the non-ID columns.

categorical_variable : str or list of str, optional

Specifies INTEGER columns that should be treated as categorical. Other INTEGER columns will be treated as continuous.

fit_predict(data, key, features=None, categorical_variable=None)

Perform clustering algorithm and return labels.

Parameters

data : DataFrame

DataFrame containing input data.

key : str

Name of the ID column.

features : list of str, optional

Names of the feature columns. If features is not provided, it defaults to all the non-ID columns.

categorical_variable : str or list of str, optional

Specifies INTEGER column(s) that should be treated as categorical. Other INTEGER columns will be treated as continuous.

Returns

DataFrame

Fit result, structured as follows:

  • ID column, with the same name and type as data ‘s ID column.

  • CLUSTER_ID, type INTEGER, cluster ID assigned to the data point.

  • DISTANCE, type DOUBLE, the distance between the given point and the cluster center.

class hana_ml.algorithms.pal.clustering.KMedoids(conn_context, n_clusters, init=None, max_iter=None, tol=None, thread_ratio=None, distance_level=None, minkowski_power=None, category_weights=None, normalization=None, categorical_variable=None)

Bases: hana_ml.algorithms.pal.clustering._KClusteringBase

K-Medoids clustering algorithm that partitions n observations into K clusters according to their nearest cluster center. It uses medoids to calculate cluster centers. K-Medoids is more robust to noise and outliers.

Parameters

conn_context : ConnectionContext

Connection to the SAP HANA system.

n_clusters : int

Number of groups.

init : {‘first_k’, ‘replace’, ‘no_replace’, ‘patent’}, optional

Controls how the initial centers are selected:

  • ‘first_k’: First k observations.

  • ‘replace’: Random with replacement.

  • ‘no_replace’: Random without replacement.

  • ‘patent’: Patent of selecting the init center (US 6,882,998 B1).

Defaults to ‘patent’.

max_iter : int, optional

Max iterations.

Defaults to 100.

tol : float, optional

Convergence threshold for exiting iterations.

Defaults to 1.0e-6.

thread_ratio : float, optional

Controls the proportion of available threads to use. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates all available threads. Values between 0 and 1 will use up to that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use.

Defaults to 0.

distance_level : {‘manhattan’, ‘euclidean’, ‘minkowski’, ‘chebyshev’, ‘cosine’} str, optional

Ways to compute the distance between the item and the cluster center.

Defaults to ‘euclidean’.

minkowski_power : float, optional

When Minkowski distance is used, this parameter controls the value of power. Only valid when distance_level is minkowski.

Defaults to 3.0.

category_weights : float, optional

Represents the weight of category attributes.

Defaults to 0.707.

normalization : {‘no’, ‘l1_norm’, ‘min_max’}, optional

Normalization type.

  • ‘no’: No, normalization will not be applied.

  • ‘l1_norm’: Yes, for each point X (x1, x2, …, xn), the normalized value will be X’(x1/S,x2/S,…,xn/S), where S = |x1|+|x2|+…|xn|.

  • ‘min_max’: Yes, for each column C, get the min and max value of C, and then C[i] = (C[i]-min)/(max-min).

Defaults to ‘no’.

categorical_variable : str or list of str, optional

Specifies INTEGER column(s) in the data that should be treated as categorical.

Examples

Input dataframe df1 for clustering:

>>> df1.collect()
    ID  V000 V001  V002
0    0   0.5    A   0.5
1    1   1.5    A   0.5
2    2   1.5    A   1.5
3    3   0.5    A   1.5
4    4   1.1    B   1.2
5    5   0.5    B  15.5
6    6   1.5    B  15.5
7    7   1.5    B  16.5
8    8   0.5    B  16.5
9    9   1.2    C  16.1
10  10  15.5    C  15.5
11  11  16.5    C  15.5
12  12  16.5    C  16.5
13  13  15.5    C  16.5
14  14  15.6    D  16.2
15  15  15.5    D   0.5
16  16  16.5    D   0.5
17  17  16.5    D   1.5
18  18  15.5    D   1.5
19  19  15.7    A   1.6

Creating KMedoids instance:

>>> kmedoids = KMedoids(conn_context=conn, n_clusters=4, init='first_K',
...                     max_iter=100, tol=1.0E-6,
...                     distance_level='Euclidean',
...                     thread_ratio=0.3, category_weights=0.5)

Performing fit() on given dataframe:

>>> kmedoids.fit(data=df1, key='ID')
>>> kmedoids.cluster_centers_.collect()
   CLUSTER_ID  V000 V001  V002
0           0   1.5    A   1.5
1           1  15.5    D   1.5
2           2  15.5    C  16.5
3           3   1.5    B  16.5

Performing fit_predict() on given dataframe:

>>> kmedoids.fit_predict(data=df1, key='ID').collect()
    ID  CLUSTER_ID  DISTANCE
0    0           0  1.414214
1    1           0  1.000000
2    2           0  0.000000
3    3           0  1.000000
4    4           0  1.207107
5    5           3  1.414214
6    6           3  1.000000
7    7           3  0.000000
8    8           3  1.000000
9    9           3  1.207107
10  10           2  1.000000
11  11           2  1.414214
12  12           2  1.000000
13  13           2  0.000000
14  14           2  1.023335
15  15           1  1.000000
16  16           1  1.414214
17  17           1  1.000000
18  18           1  0.000000
19  19           1  0.930714

Attributes

cluster_centers_

(DataFrame) Coordinates of cluster centers.

labels_

(DataFrame) Cluster assignment and distance to cluster center for each point.

Methods

fit(data, key[, features, categorical_variable])

Perform clustering on input dataset.

fit_predict(data, key[, features, …])

Perform clustering algorithm and return labels.

fit(data, key, features=None, categorical_variable=None)

Perform clustering on input dataset.

Parameters

data : DataFrame

DataFrame contains input data.

key : str

Name of the ID column.

features : list of str, optional

Names of the feature columns. If features is not provided, it defaults to all the non-ID columns.

categorical_variable : str or list of str, optional

Specifies INTEGER columns that should be treated as categorical. Other INTEGER columns will be treated as continuous.

fit_predict(data, key, features=None, categorical_variable=None)

Perform clustering algorithm and return labels.

Parameters

data : DataFrame

DataFrame containing input data.

key : str

Name of the ID column.

features : list of str, optional

Names of the feature columns. If features is not provided, it defaults to all the non-ID columns.

categorical_variable : str or list of str, optional

Specifies INTEGER column(s) that should be treated as categorical. Other INTEGER columns will be treated as continuous.

Returns

DataFrame

Fit result, structured as follows:

  • ID column, with the same name and type as data ‘s ID column.

  • CLUSTER_ID, type INTEGER, cluster ID assigned to the data point.

  • DISTANCE, type DOUBLE, the distance between the given point and the cluster center.

hana_ml.algorithms.pal.crf

This module contains Python wrapper for PAL conditional random field(CRF) algorithm.

The following class is available:

class hana_ml.algorithms.pal.crf.CRF(conn_context, lamb=None, epsilon=None, max_iter=None, lbfgs_m=None, use_class_feature=None, use_word=None, use_ngrams=None, mid_ngrams=False, max_ngram_length=None, use_prev=None, use_next=None, disjunction_width=None, use_disjunctive=None, use_seqs=None, use_prev_seqs=None, use_type_seqs=None, use_type_seqs2=None, use_type_yseqs=None, word_shape=None, thread_ratio=None)

Bases: hana_ml.algorithms.pal.pal_base.PALBase

Conditional random field(CRF) for labeling and segmenting sequence data(e.g. text).

Parameters

conn_context : ConnectionContext

Connection to the SAP HANA system.

epsilon : float, optional

Convergence tolerance of the optimization algorithm.

Defaults to 1e-4.

lamb : float, optional

Regularization weight, should be greater than 0.

Defaults t0 1.0.

max_iter : int, optional

Maximum number of iterations in optimization.

Defaults to 1000.

lbfgs_m : int, optional

Number of memories to be stored in L_BFGS optimization algorithm.

Defaults to 25.

use_class_feature : bool, optional

To include a feature for class/label. This is the same as having a bias vector in a model.

Defaults to True.

use_word : bool, optional

If True, gives you feature for current word.

Defaults to True.

use_ngrams : bool, optional

Whether to make feature from letter n-grams, i.e. substrings of the word.

Defaults to True.

mid_ngrams : bool, optional

Whether to include character n-gram features for n-grams that contain neither the beginning or the end of the word.

Defaults to False.

max_ngram_length : int, optional

Upper limit for the size of n-grams to be included. Effective only this parameter is positive.

use_prev : bool, optional

Whether or not to include a feature for previous word and current word, and together with other options enables other previous features.

Defaults to True.

use_next : bool, optional

Whether or not to include a feature for next word and current word.

Defaults to True.

disjunction_width : int, optional

Defines the width for disjunctions of words, see use_disjunctive.

Defaults to 4.

use_disjunctive : bool, optional

Whether or not to include in features giving disjunctions of words anywhere in left or right disjunction_width words.

Defaults to True.

use_seqs : bool, optional

Whether or not to use any class combination features.

Defaults to True.

use_prev_seqs : bool, optional

Whether or not to use any class combination features using the previous class.

Defaults to True.

use_type_seqs : bool, optional

Whther or not to use basic zeroth order word shape features.

Defaults to True.

use_type_seqs2 : bool, optional

Whether or not to add additional first and second order word shape features.

Defaults to True.

use_type_yseqs : bool, optional

Whether or not to use some first order word shape patterns.

Defaults to True.

word_shape : int, optional

Word shape, e.g. whether capitalized or numeric. Only supports chris2UseLC currently. Do not use word shape if this is 0.

thread_ratio : float, optional

Specifies the ratio of total number of threads that can be used by the fit(i.e. training) function. The range of this parameter is from 0 to 1. 0 means only using single thread, 1 means using at most all available threads currently. Values outside this range are ignored, and the fit function heuristically determines the number of threads to use.

Defaults to 1.0.

Examples

Input data for training:

>>> df.head(10).collect()
   DOC_ID  WORD_POSITION      WORD LABEL
0       1              1    RECORD     O
1       1              2   #497321     O
2       1              3  78554939     O
3       1              4         |     O
4       1              5       LRH     O
5       1              6         |     O
6       1              7  62413233     O
7       1              8         |     O
8       1              9         |     O
9       1             10   7368393     O

Set up an instance of CRF model, and fit it on the training data:

>>> crf = CRF(conn_context=cc,
...           lamb=0.1,
...           max_iter=1000,
...           epsilon=1e-4,
...           lbfgs_m=25,
...           word_shape=0,
...           thread_ratio=1.0)
>>> crf.fit(data=df, doc_id="DOC_ID", word_pos="WORD_POSITION",
...         word="WORD", label="LABEL")

Check the trained CRF model and related statistics:

>>> crf.model_.collect()
   ROW_INDEX                                      MODEL_CONTENT
0          0  {"classIndex":[["O","OxygenSaturation"]],"defa...
>>> crf.stats_.head(10).collect()
         STAT_NAME           STAT_VALUE
0              obj  0.44251900977373015
1             iter                   22
2  solution status            Converged
3      numSentence                    2
4          numWord                   92
5      numFeatures                  963
6           iter 1          obj=26.6557
7           iter 2          obj=14.8484
8           iter 3          obj=5.36967
9           iter 4           obj=2.4382

Input data for predicting labels using trained CRF model

>>> df_pred.head(10).collect()
   DOC_ID  WORD_POSITION         WORD
0       2              1      GENERAL
1       2              2     PHYSICAL
2       2              3  EXAMINATION
3       2              4            :
4       2              5        VITAL
5       2              6        SIGNS
6       2              7            :
7       2              8        Blood
8       2              9     pressure
9       2             10        86g52

Do the prediction:

>>> res = crf.predict(data=df_pred, doc_id='DOC_ID', word_pos='WORD_POSITION',
...                   word='WORD', thread_ratio=1.0)

Check the prediction result:

>>> df_pred.head(10).collect()
   DOC_ID  WORD_POSITION         WORD
0       2              1      GENERAL
1       2              2     PHYSICAL
2       2              3  EXAMINATION
3       2              4            :
4       2              5        VITAL
5       2              6        SIGNS
6       2              7            :
7       2              8        Blood
8       2              9     pressure
9       2             10        86g52

Attributes

model_

(DataFrame) CRF model content.

stats_

(DataFrame) Statistic info for CRF model fitting, structured as follows: - 1st column: name of the statistics, type NVARCHAR(100). - 2nd column: the corresponding statistics value, type NVARCHAR(1000).

optimal_param_

(DataFrame) Placeholder for storing optimal parameter of the model. None empty only when parameter selection is triggered (in the future).

Methods

fit(data[, doc_id, word_pos, word, label])

Function for training the CRF model on English text.

predict(data[, doc_id, word_pos, word, …])

The function that predicts text labels based trained CRF model.

fit(data, doc_id=None, word_pos=None, word=None, label=None)

Function for training the CRF model on English text.

Parameters

data : DataFrame

Input data for training/fitting the CRF model. It should contain at least 4 columns, corresponding to document ID, word position, word and label, respectively.

doc_id : str, optional

Name of the column for document ID.

Defaults to the first column of the input data.

word_pos : str, optional

Name of the column for word position.

Defaults to the second column of the input data.

word : str, optional

Name of the column for word.

Defaults to the third column of the input data.

label : str, optional

Name of the label column.

Defaults to the final column of the input data.

predict(data, doc_id=None, word_pos=None, word=None, thread_ratio=None)

The function that predicts text labels based trained CRF model.

Parameters

data : DataFrame

Input data to predict the labels. It should contain at least 3 columns, corresponding to document ID, word position and word, respectively.

doc_id : str, optional

Name of the column for document ID.

Defaults to the first column of the input data.

word_pos : str, optional

Name of the column for word position.

Defaults to the second column of the input data.

word : str, optional

Name of the column for word.

Defaults to the third column of the input data.

thread_ratio : float, optional

Specifies the ratio of total number of threads that can be used by predict function. The range of this parameter is from 0 to 1. 0 means only using a single thread, and 1 means using at most all available threads currently. Values outside this range are ignored, and predict function heuristically determines the number of threads to use.

Defaults to 1.0.

Returns

DataFrame

Prediction result for the input data, structured as follows:

  • 1st column: document ID,

  • 2nd column: word position,

  • 3rd column: label.

hana_ml.algorithms.pal.decomposition

This module contains Python wrappers for PAL decomposition algorithms.

The following classes are available:

class hana_ml.algorithms.pal.decomposition.PCA(conn_context, scaling=None, thread_ratio=None, scores=None)

Bases: hana_ml.algorithms.pal.pal_base.PALBase

Principal component analysis is to reduce the dimensionality of multivariate data using Singular Value Decomposition.

Parameters

conn_context : ConnectionContext

The connection to SAP HANA system.

thread_ratio : float, optional

Controls the proportion of available threads to use. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use.

No default value.

scaling : bool, optional

If true, scale variables to have unit variance before the analysis takes place.

Defaults to False.

scores : bool, optional

If true, output the scores on each principal component when fitting.

Defaults to False.

Examples

Input DataFrame df1 for training:

>>> df1.head(4).collect()
   ID    X1    X2    X3    X4
0   1  12.0  52.0  20.0  44.0
1   2  12.0  57.0  25.0  45.0
2   3  12.0  54.0  21.0  45.0
3   4  13.0  52.0  21.0  46.0

Creating a PCA instance:

>>> pca = PCA(connection_context=conn, scaling=True, thread_ratio=0.5, scores=True)

Performing fit on given dataframe:

>>> pca.fit(data=df1, key='ID')

Output:

>>> pca.loadings_.collect()
  COMPONENT_ID  LOADINGS_X1  LOADINGS_X2  LOADINGS_X3  LOADINGS_X4
0        Comp1     0.541547     0.321424     0.511941     0.584235
1        Comp2    -0.454280     0.728287     0.395819    -0.326429
2        Comp3    -0.171426    -0.600095     0.760875    -0.177673
3        Comp4    -0.686273    -0.078552    -0.048095     0.721489
>>> pca.loadings_stat_.collect()
  COMPONENT_ID        SD  VAR_PROP  CUM_VAR_PROP
0        Comp1  1.566624  0.613577      0.613577
1        Comp2  1.100453  0.302749      0.916327
2        Comp3  0.536973  0.072085      0.988412
3        Comp4  0.215297  0.011588      1.000000
>>> pca.scaling_stat_.collect()
   VARIABLE_ID       MEAN     SCALE
0            1  17.000000  5.039841
1            2  53.636364  1.689540
2            3  23.000000  2.000000
3            4  48.454545  4.655398

Input dataframe df2 for transforming:

>>> df2.collect()
   ID    X1    X2    X3    X4
0   1   2.0  32.0  10.0  54.0
1   2   9.0  57.0  20.0  25.0
2   3  12.0  24.0  28.0  35.0
3   4  15.0  42.0  27.0  36.0

Performing transform() on given dataframe:

>>> result = pca.transform(data=df2, key='ID', n_components=4)
>>> result.collect()
   ID  COMPONENT_1  COMPONENT_2  COMPONENT_3  COMPONENT_4
0   1    -8.359662   -10.936083     3.037744     4.220525
1   2    -3.931082     3.221886    -1.168764    -2.629849
2   3    -6.584040   -10.391291    13.112075    -0.146681
3   4    -2.967768    -3.170720     6.198141    -1.213035

Attributes

loadings_

(DataFrame) The weights by which each standardized original variable should be multiplied when computing component scores.

loadings_stat_

(DataFrame) Loadings statistics on each component.

scores_

(DataFrame) The transformed variable values corresponding to each data point. Set to None if scores is False.

scaling_stat_

(DataFrame) Mean and scale values of each variable. .. Note:: Variables cannot be scaled if there exists one variable which has constant value across data items.

Methods

fit(data, key[, features, label])

Principal component analysis function.

fit_transform(data, key[, features, label])

Fit with the dataset and return the scores.

transform(data, key[, features, …])

Principal component analysis projection function using a trained model.

fit(data, key, features=None, label=None)

Principal component analysis function.

Parameters

data : DataFrame

Data to be fitted.

key : str

Name of the ID column.

features : list of str, optional

Names of the feature columns.

If features is not provided, it defaults to all non-ID columns.

label : str, optional

Label of data.

fit_transform(data, key, features=None, label=None)

Fit with the dataset and return the scores.

Parameters

data : DataFrame

Data to be analyzed.

key : str

Name of the ID column.

features : list of str, optional

Names of the feature columns.

If features is not provided, it defaults to all non-ID columns.

label : str, optional

Label of data.

Returns

DataFrame

Transformed variable values corresponding to each data point, structured as follows:

  • ID column, with same name and type as data ‘s ID column.

  • SCORE columns, type DOUBLE, representing the component score values of each data point.

transform(data, key, features=None, n_components=None, label=None)

Principal component analysis projection function using a trained model.

Parameters

data : DataFrame

Data to be analyzed.

key : str

Name of the ID column.

features : list of str, optional

Names of the feature columns. If features is not provided, it defaults to all non-ID columns.

n_components : int, optional

Number of components to be retained. The value range is from 1 to number of features.

Defaults to number of features.

label : str, optional

Label of data.

Returns

DataFrame

Transformed variable values corresponding to each data point, structured as follows:

  • ID column, with same name and type as data ‘s ID column.

  • SCORE columns, type DOUBLE, representing the component score values of each data point.

class hana_ml.algorithms.pal.decomposition.LatentDirichletAllocation(conn_context, n_components, doc_topic_prior=None, topic_word_prior=None, burn_in=None, iteration=None, thin=None, seed=None, max_top_words=None, threshold_top_words=None, gibbs_init=None, delimiters=None, output_word_assignment=None)

Bases: hana_ml.algorithms.pal.pal_base.PALBase

Latent Dirichlet allocation (LDA) is a generative model in which each item (word) of a collection (document) is generated from a finite mixture over several latent groups (topics).

Parameters

conn_context : ConnectionContext

The connection to the SAP HANA system.

n_components : int

Expected number of topics in the corpus.

doc_topic_prior : float, optional

Specifies the prior weight related to document-topic distribution.

Defaults to 50/n_components.

topic_word_prior : float, optional

Specifies the prior weight related to topic-word distribution.

Defaults to 0.1.

burn_in : int, optional

Number of omitted Gibbs iterations at the beginning. Generally, samples from the beginning may not accurately represent the desired distribution and are usually discarded.

Defaults to 0.

iteration : int, optional

Number of Gibbs iterations.

Defaults to 2000.

thin : int, optional

Number of omitted in-between Gibbs iterations. Value must be greater than 0.

Defaults to 1.

seed : int, optional

Indicates the seed used to initialize the random number generator:

  • 0: Uses the system time.

  • Not 0: Uses the provided value.

Defaults to 0.

max_top_words : int, optional

Specifies the maximum number of words to be output for each topic.

Defaults to 0.

threshold_top_words : float, optional

The algorithm outputs top words for each topic if the probability is larger than this threshold. It cannot be used together with parameter max_top_words.

gibbs_init : str, optional

Specifies initialization method for Gibbs sampling:

  • ‘uniform’: Assign each word in each document a topic by uniform distribution.

  • ‘gibbs’: Assign each word in each document a topic by one round of Gibbs sampling using doc_topic_prior and topic_word_prior.

Defaults to ‘uniform’.

delimiters : list of str, optional

Specifies the set of delimiters to separate words in a document. Each delimiter must be one character long.

Defaults to [‘ ‘].

output_word_assignment : bool, optional

Controls whether to output the word_topic_assignment_ or not. If True, output the word_topic_assignment_.

Defaults to False.

Examples

Input dataframe df1 for training:

>>> df1.collect()
   DOCUMENT_ID                                               TEXT
0           10  cpu harddisk graphiccard cpu monitor keyboard ...
1           20  tires mountainbike wheels valve helmet mountai...
2           30  carseat toy strollers toy toy spoon toy stroll...
3           40  sweaters sweaters sweaters boots sweaters ring...

Creating a LDA instance:

>>> lda = LatentDirichletAllocation(cc, n_components=6, burn_in=50, thin=10,
                                    iteration=100, seed=1,
                                    max_top_words=5, doc_topic_prior=0.1,
                                    output_word_assignment=True,
                                    delimiters=[' ', '\r', '\n'])

Performing fit() on given dataframe:

>>> lda.fit(data=df1, key='DOCUMENT_ID', document='TEXT')

Output:

>>> lda.doc_topic_dist_.collect()
    DOCUMENT_ID  TOPIC_ID  PROBABILITY
0            10         0     0.010417
1            10         1     0.010417
2            10         2     0.010417
3            10         3     0.010417
4            10         4     0.947917
5            10         5     0.010417
6            20         0     0.009434
7            20         1     0.009434
8            20         2     0.009434
9            20         3     0.952830
10           20         4     0.009434
11           20         5     0.009434
12           30         0     0.103774
13           30         1     0.858491
14           30         2     0.009434
15           30         3     0.009434
16           30         4     0.009434
17           30         5     0.009434
18           40         0     0.009434
19           40         1     0.009434
20           40         2     0.952830
21           40         3     0.009434
22           40         4     0.009434
23           40         5     0.009434
>>> lda.word_topic_assignment_.collect()
    DOCUMENT_ID  WORD_ID  TOPIC_ID
0            10        0         4
1            10        1         4
2            10        2         4
3            10        0         4
4            10        3         4
5            10        4         4
6            10        0         4
7            10        5         4
8            10        5         4
9            20        6         3
10           20        7         3
11           20        8         3
12           20        9         3
13           20       10         3
14           20        7         3
15           20       11         3
16           20        6         3
17           20        7         3
18           20        7         3
19           30       12         1
20           30       13         1
21           30       14         1
22           30       13         1
23           30       13         1
24           30       15         0
25           30       13         1
26           30       14         1
27           30       13         1
28           30       12         1
29           40       16         2
30           40       16         2
31           40       16         2
32           40       17         2
33           40       16         2
34           40       18         2
35           40       19         2
36           40       19         2
37           40       20         2
38           40       16         2
>>> lda.topic_top_words_.collect()
   TOPIC_ID                                       WORDS
0         0     spoon strollers tires graphiccard valve
1         1       toy strollers carseat graphiccard cpu
2         2              sweaters vest shoe rings boots
3         3  mountainbike tires rearfender helmet valve
4         4    cpu memory graphiccard keyboard harddisk
5         5       strollers tires graphiccard cpu valve
>>> lda.topic_word_dist_.head(40).collect()
    TOPIC_ID  WORD_ID  PROBABILITY
0          0        0     0.050000
1          0        1     0.050000
2          0        2     0.050000
3          0        3     0.050000
4          0        4     0.050000
5          0        5     0.050000
6          0        6     0.050000
7          0        7     0.050000
8          0        8     0.550000
9          0        9     0.050000
10         1        0     0.050000
11         1        1     0.050000
12         1        2     0.050000
13         1        3     0.050000
14         1        4     0.050000
15         1        5     0.050000
16         1        6     0.050000
17         1        7     0.050000
18         1        8     0.050000
19         1        9     0.550000
20         2        0     0.025000
21         2        1     0.025000
22         2        2     0.525000
23         2        3     0.025000
24         2        4     0.025000
25         2        5     0.025000
26         2        6     0.025000
27         2        7     0.275000
28         2        8     0.025000
29         2        9     0.025000
30         3        0     0.014286
31         3        1     0.014286
32         3        2     0.014286
33         3        3     0.585714
34         3        4     0.157143
35         3        5     0.014286
36         3        6     0.157143
37         3        7     0.014286
38         3        8     0.014286
39         3        9     0.014286
>>> lda.dictionary_.collect()
    WORD_ID          WORD
0        17         boots
1        12       carseat
2         0           cpu
3         2   graphiccard
4         1      harddisk
5        10        helmet
6         4      keyboard
7         5        memory
8         3       monitor
9         7  mountainbike
10       11    rearfender
11       18         rings
12       20          shoe
13       15         spoon
14       14     strollers
15       16      sweaters
16        6         tires
17       13           toy
18        9         valve
19       19          vest
20        8        wheels
>>> lda.statistic_.collect()
         STAT_NAME          STAT_VALUE
0        DOCUMENTS                   4
1  VOCABULARY_SIZE                  21
2   LOG_LIKELIHOOD  -64.95765414596762

Dataframe df2 to transform:

>>> df2.collect()
   DOCUMENT_ID               TEXT
0           10  toy toy spoon cpu

Performing transform on the given dataframe:

>>> res = lda.transform(data=df2, key='DOCUMENT_ID', document='TEXT', burn_in=2000, thin=100,
                        iteration=1000, seed=1, output_word_assignment=True)
>>> doc_top_df, word_top_df, stat_df = res
>>> doc_top_df.collect()
   DOCUMENT_ID  TOPIC_ID  PROBABILITY
0           10         0     0.239130
1           10         1     0.456522
2           10         2     0.021739
3           10         3     0.021739
4           10         4     0.239130
5           10         5     0.021739
>>> word_top_df.collect()
   DOCUMENT_ID  WORD_ID  TOPIC_ID
0           10       13         1
1           10       13         1
2           10       15         0
3           10        0         4
>>> stat_df.collect()
         STAT_NAME          STAT_VALUE
0        DOCUMENTS                   1
1  VOCABULARY_SIZE                  21
2   LOG_LIKELIHOOD  -7.925092991875363
3       PERPLEXITY   7.251970666272191

Attributes

doc_topic_dist_

(DataFrame) Document-topic distribution table, structured as follows: - Document ID column, with same name and type as data’s document ID column from fit(). - TOPIC_ID, type INTEGER, topic ID. - PROBABILITY, type DOUBLE, probability of topic given document.

word_topic_assignment_

(DataFrame) Word-topic assignment table, structured as follows: - Document ID column, with same name and type as data’s document ID column from fit(). - WORD_ID, type INTEGER, word ID. - TOPIC_ID, type INTEGER, topic ID. Set to None if output_word_assignment is set to False.

topic_top_words_

(DataFrame) Topic top words table, structured as follows: - TOPIC_ID, type INTEGER, topic ID. - WORDS, type NVARCHAR(5000), topic top words separated by spaces. Set to None if neither max_top_words nor threshold_top_words is provided.

topic_word_dist_

(DataFrame) topic-word distribution table, structured as follows: - TOPIC_ID, type INTEGER, topic ID. - WORD_ID, type INTEGER, word ID. - PROBABILITY, type DOUBLE, probability of word given topic.

dictionary_

(DataFrame) Dictionary table, structured as follows: - WORD_ID, type INTEGER, word ID. - WORD, type NVARCHAR(5000), word text.

statistic_

(DataFrame) Statistics table, structured as follows: - STAT_NAME, type NVARCHAR(256), statistic name. - STAT_VALUE, type NVARCHAR(1000), statistic value. .. note:: - Parameters max_top_words and threshold_top_words cannot be used together. - Parameters burn_in, thin, iteration, seed, gibbs_init and delimiters set in transform() will take precedence over thecorresponding ones in __init__().

Methods

fit(data, key[, document])

Fit LDA model based on training data.

fit_transform(data, key[, document])

Fit LDA model based on training data and return the topic assignment for the training documents.

transform(data, key[, document, burn_in, …])

Transform the topic assignment for new documents based on the previous LDA estimation results.

fit(data, key, document=None)

Fit LDA model based on training data.

Parameters

data : DataFrame

Training data.

key : str

Name of the document ID column.

document : str, optional

Name of the document text column. If document is not provided, data must have exactly 1 non-key column, and document defaults to that column.

fit_transform(data, key, document=None)

Fit LDA model based on training data and return the topic assignment for the training documents.

Parameters

data : DataFrame

Training data.

key : str

Name of the document ID column.

document : str, optional

Name of the document text column. If document is not provided, data must have exactly 1 non-key column, and document defaults to that column.

Returns

DataFrame

Document-topic distribution table, structured as follows:

  • Document ID column, with same name and type as data ‘s document ID column.

  • TOPIC_ID, type INTEGER, topic ID.

  • PROBABILITY, type DOUBLE, probability of topic given document.

transform(data, key, document=None, burn_in=None, iteration=None, thin=None, seed=None, gibbs_init=None, delimiters=None, output_word_assignment=None)

Transform the topic assignment for new documents based on the previous LDA estimation results.

Parameters

data : DataFrame

Independent variable values used for tranform.

key : str

Name of the document ID column.

document : str, optional

Name of the document text column.

If document is not provided, data must have exactly 1 non-key column, and document defaults to that column.

burn_in : int, optional

Number of omitted Gibbs iterations at the beginning. Generally, samples from the beginning may not accurately represent the desired distribution and are usually discarded.

Defaults to 0.

iteration : int, optional

Numbers of Gibbs iterations.

Defaults to 2000.

thin : int, optional

Number of omitted in-between Gibbs iterations.

Defaults to 1.

seed : int, optional

Indicates the seed used to initialize the random number generator:

  • 0: Uses the system time.

  • Not 0: Uses the provided value.

Defaults to 0.

gibbs_init : str, optional

Specifies initialization method for Gibbs sampling:

  • ‘uniform’: Assign each word in each document a topic by uniform distribution.

  • ‘gibbs’: Assign each word in each document a topic by one round of Gibbs sampling using doc_topic_prior and topic_word_prior.

Defaults to ‘uniform’.

delimiters : list of str, optional

Specifies the set of delimiters to separate words in a document. Each delimiter must be one character long.

Defaults to [‘ ‘].

output_word_assignment : bool, optional

Controls whether to output the word_topic_df or not. If True, output the word_topic_df.

Defaults to False.

Returns

DataFrame

Document-topic distribution table, structured as follows:

  • Document ID column, with same name and type as data ‘s document ID column.

  • TOPIC_ID, type INTEGER, topic ID.

  • PROBABILITY, type DOUBLE, probability of topic given document.

Word-topic assignment table, structured as follows:

  • Document ID column, with same name and type as data ‘s document ID column.

  • WORD_ID, type INTEGER, word ID.

  • TOPIC_ID, type INTEGER, topic ID.

Set to None if output_word_assignment is False.

Statistics table, structured as follows:

  • STAT_NAME, type NVARCHAR(256), statistic name.

  • STAT_VALUE, type NVARCHAR(1000), statistic value.

hana_ml.algorithms.pal.discriminant_analysis

This module contains PAL wrapper for discriminant analysis algorithm. The following class is available:

class hana_ml.algorithms.pal.discriminant_analysis.LinearDiscriminantAnalysis(conn_context, regularization_type=None, regularization_amount=None, projection=None)

Bases: hana_ml.algorithms.pal.pal_base.PALBase

Linear discriminant analysis for classification and data reduction.

Parameters

conn_context : ConnectionContext

Connection to the HANA system.

regularization_type : {‘mixing’, ‘diag’, ‘pseudo’}, optional

The strategy for hanlding ill-conditioning or rank-deficiency of the empirical covariance matrix.

Defaults to ‘mixing’.

regularization_amount : float, optional

The convex mixing weight assigned to the diagonal matrix obtained from diagonal of the empirical covriance matrix. Valid range for this parameter is [0,1]. Valid only when regularization_type is ‘mixing’.

Defaults to the smallest number in [0,1] that makes the regularized empircal covariance matrix invertible.

projection : bool, optional

Whether or not to compute the projection model.

Defaults to True.

Examples

The training data for linear discriminant analysis:

>>> df.collect()
     X1   X2   X3   X4            CLASS
0   5.1  3.5  1.4  0.2      Iris-setosa
1   4.9  3.0  1.4  0.2      Iris-setosa
2   4.7  3.2  1.3  0.2      Iris-setosa
3   4.6  3.1  1.5  0.2      Iris-setosa
4   5.0  3.6  1.4  0.2      Iris-setosa
5   5.4  3.9  1.7  0.4      Iris-setosa
6   4.6  3.4  1.4  0.3      Iris-setosa
7   5.0  3.4  1.5  0.2      Iris-setosa
8   4.4  2.9  1.4  0.2      Iris-setosa
9   4.9  3.1  1.5  0.1      Iris-setosa
10  7.0  3.2  4.7  1.4  Iris-versicolor
11  6.4  3.2  4.5  1.5  Iris-versicolor
12  6.9  3.1  4.9  1.5  Iris-versicolor
13  5.5  2.3  4.0  1.3  Iris-versicolor
14  6.5  2.8  4.6  1.5  Iris-versicolor
15  5.7  2.8  4.5  1.3  Iris-versicolor
16  6.3  3.3  4.7  1.6  Iris-versicolor
17  4.9  2.4  3.3  1.0  Iris-versicolor
18  6.6  2.9  4.6  1.3  Iris-versicolor
19  5.2  2.7  3.9  1.4  Iris-versicolor
20  6.3  3.3  6.0  2.5   Iris-virginica
21  5.8  2.7  5.1  1.9   Iris-virginica
22  7.1  3.0  5.9  2.1   Iris-virginica
23  6.3  2.9  5.6  1.8   Iris-virginica
24  6.5  3.0  5.8  2.2   Iris-virginica
25  7.6  3.0  6.6  2.1   Iris-virginica
26  4.9  2.5  4.5  1.7   Iris-virginica
27  7.3  2.9  6.3  1.8   Iris-virginica
28  6.7  2.5  5.8  1.8   Iris-virginica
29  7.2  3.6  6.1  2.5   Iris-virginica

Set up an instance of LinearDiscriminantAnalysis model and train it:

>>> lda = LinearDiscriminantAnalysis(conn_context=cc, regularization_type='mixing', projection=True)
>>> lda.fit(data=df, features=['X1', 'X2', 'X3', 'X4'], label='CLASS')

Check the coefficients of obtained linear discriminators and the projection model

>>> lda.coef_.collect()
             CLASS   COEFF_X1   COEFF_X2   COEFF_X3   COEFF_X4   INTERCEPT
0      Iris-setosa  23.907391  51.754001 -34.641902 -49.063407 -113.235478
1  Iris-versicolor   0.511034  15.652078  15.209568  -4.861018  -53.898190
2   Iris-virginica -14.729636   4.981955  42.511486  12.315007  -94.143564
>>> lda.proj_model_.collect()
         NAME        X1        X2        X3        X4
0  DISCRIMINANT_1  1.907978  2.399516 -3.846154 -3.112216
1  DISCRIMINANT_2  3.046794 -4.575496 -2.757271  2.633037
2    OVERALL_MEAN  5.843333  3.040000  3.863333  1.213333

Data to predict the class labels:

>>> df_pred.collect()
     ID   X1   X2   X3   X4
0    1  5.1  3.5  1.4  0.2
1    2  4.9  3.0  1.4  0.2
2    3  4.7  3.2  1.3  0.2
3    4  4.6  3.1  1.5  0.2
4    5  5.0  3.6  1.4  0.2
5    6  5.4  3.9  1.7  0.4
6    7  4.6  3.4  1.4  0.3
7    8  5.0  3.4  1.5  0.2
8    9  4.4  2.9  1.4  0.2
9   10  4.9  3.1  1.5  0.1
10  11  7.0  3.2  4.7  1.4
11  12  6.4  3.2  4.5  1.5
12  13  6.9  3.1  4.9  1.5
13  14  5.5  2.3  4.0  1.3
14  15  6.5  2.8  4.6  1.5
15  16  5.7  2.8  4.5  1.3
16  17  6.3  3.3  4.7  1.6
17  18  4.9  2.4  3.3  1.0
18  19  6.6  2.9  4.6  1.3
19  20  5.2  2.7  3.9  1.4
20  21  6.3  3.3  6.0  2.5
21  22  5.8  2.7  5.1  1.9
22  23  7.1  3.0  5.9  2.1
23  24  6.3  2.9  5.6  1.8
24  25  6.5  3.0  5.8  2.2
25  26  7.6  3.0  6.6  2.1
26  27  4.9  2.5  4.5  1.7
27  28  7.3  2.9  6.3  1.8
28  29  6.7  2.5  5.8  1.8
29  30  7.2  3.6  6.1  2.5

Perform predict() and check the result:

>>> res_pred = lda.predict(data=df_pred,
...                        key='ID',
...                        features=['X1', 'X2', 'X3', 'X4'],
...                        verbose=False)
>>> res_pred.collect()
    ID            CLASS       SCORE
0    1      Iris-setosa  130.421263
1    2      Iris-setosa   99.762784
2    3      Iris-setosa  108.796296
3    4      Iris-setosa   94.301777
4    5      Iris-setosa  133.205924
5    6      Iris-setosa  138.089829
6    7      Iris-setosa  108.385827
7    8      Iris-setosa  119.390933
8    9      Iris-setosa   82.633689
9   10      Iris-setosa  106.380335
10  11  Iris-versicolor   63.346631
11  12  Iris-versicolor   59.511996
12  13  Iris-versicolor   64.286132
13  14  Iris-versicolor   38.332614
14  15  Iris-versicolor   54.823224
15  16  Iris-versicolor   53.865644
16  17  Iris-versicolor   63.581912
17  18  Iris-versicolor   30.402809
18  19  Iris-versicolor   57.411739
19  20  Iris-versicolor   42.433076
20  21   Iris-virginica  114.258002
21  22   Iris-virginica   72.984306
22  23   Iris-virginica   91.802556
23  24   Iris-virginica   86.640121
24  25   Iris-virginica   97.620689
25  26   Iris-virginica  114.195778
26  27   Iris-virginica   57.274694
27  28   Iris-virginica  101.668525
28  29   Iris-virginica   87.257782
29  30   Iris-virginica  106.747065

Data to project:

>>> df_proj.collect()
    ID   X1   X2   X3   X4
0    1  5.1  3.5  1.4  0.2
1    2  4.9  3.0  1.4  0.2
2    3  4.7  3.2  1.3  0.2
3    4  4.6  3.1  1.5  0.2
4    5  5.0  3.6  1.4  0.2
5    6  5.4  3.9  1.7  0.4
6    7  4.6  3.4  1.4  0.3
7    8  5.0  3.4  1.5  0.2
8    9  4.4  2.9  1.4  0.2
9   10  4.9  3.1  1.5  0.1
10  11  7.0  3.2  4.7  1.4
11  12  6.4  3.2  4.5  1.5
12  13  6.9  3.1  4.9  1.5
13  14  5.5  2.3  4.0  1.3
14  15  6.5  2.8  4.6  1.5
15  16  5.7  2.8  4.5  1.3
16  17  6.3  3.3  4.7  1.6
17  18  4.9  2.4  3.3  1.0
18  19  6.6  2.9  4.6  1.3
19  20  5.2  2.7  3.9  1.4
20  21  6.3  3.3  6.0  2.5
21  22  5.8  2.7  5.1  1.9
22  23  7.1  3.0  5.9  2.1
23  24  6.3  2.9  5.6  1.8
24  25  6.5  3.0  5.8  2.2
25  26  7.6  3.0  6.6  2.1
26  27  4.9  2.5  4.5  1.7
27  28  7.3  2.9  6.3  1.8
28  29  6.7  2.5  5.8  1.8
29  30  7.2  3.6  6.1  2.5

Do project and check the result:

>>> res_proj = lda.project(data=df_proj,
...                        key='ID',
...                        features=['X1','X2','X3','X4'],
...                        proj_dim=2)
>>> res_proj.collect()
    ID  DISCRIMINANT_1  DISCRIMINANT_2 DISCRIMINANT_3 DISCRIMINANT_4
0    1       12.313584       -0.245578           None           None
1    2       10.732231        1.432811           None           None
2    3       11.215154        0.184080           None           None
3    4       10.015174       -0.214504           None           None
4    5       12.362738       -1.007807           None           None
5    6       12.069495       -1.462312           None           None
6    7       10.808422       -1.048122           None           None
7    8       11.498220       -0.368435           None           None
8    9        9.538291        0.366963           None           None
9   10       10.898789        0.436231           None           None
10  11       -1.208079        0.976629           None           None
11  12       -1.894856       -0.036689           None           None
12  13       -2.719280        0.841349           None           None
13  14       -3.226081        2.191170           None           None
14  15       -3.048480        1.822461           None           None
15  16       -3.567804       -0.865854           None           None
16  17       -2.926155       -1.087069           None           None
17  18       -0.504943        1.045723           None           None
18  19       -1.995288        1.142984           None           None
19  20       -2.765274       -0.014035           None           None
20  21      -10.727149       -2.301788           None           None
21  22       -7.791979       -0.178166           None           None
22  23       -8.291120        0.730808           None           None
23  24       -7.969943       -1.211807           None           None
24  25       -9.362513       -0.558237           None           None
25  26      -10.029438        0.324116           None           None
26  27       -7.058927       -0.877426           None           None
27  28       -8.754272       -0.095103           None           None
28  29       -8.935789        1.285655           None           None
29  30       -8.674729       -1.208049           None           None

Attributes

basic_info_

(DataFrame) Basic information of the training data for linear discriminant analysis.

priors_

(DataFrame) The empirical pirors for each class in the training data.

coef_

(DataFrame) Coefficients (inclusive of intercepts) of each class’ linear score function for the training data.

proj_info

(DataFrame) Projection related info, such as standar deviations of the discriminants, variance proportaion to the total variance explained by each discriminant, etc.

proj_model

(DataFrame) The projection matrix and overall means for features.

Methods

fit(data[, key, features, label])

Calculate linear discriminators from training data.

predict(data, key[, features, verbose])

Predict class labels using fitted linear discriminators.

project(data, key[, features, proj_dim])

Project data into lower dimensional spaces using fitted LDA projection model.

fit(data, key=None, features=None, label=None)

Calculate linear discriminators from training data.

Parameters

data : DataFrame

Training data.

key : str, optional

Name of the ID colum. If not provided, it is assumed that

the input data has no ID column.

features : list of str, optional

Names of the feature columns. If not provided, its defaults to all non-ID, non-label columns.

label : str, optional

Name of the class label. if not provided, it defaults to the final column.

predict(data, key, features=None, verbose=None)

Predict class labels using fitted linear discriminators.

Parameters

data : DataFrame

Data for predicting the class labels.

key : str

Name of the ID column.

features : list of str, optional

Name of the feature columns. If not provided, defaults to all non-ID columns.

verbose : bool, optional

Whether or not outputs scores of all classes. If False, only score of the predicted class will be outputed. Defaults to False.

Returns

DataFrame

Predicted class labels and the corresponding scores, structured as follows:

  • ID: with the same name and data type as data’s ID column.

  • CLASS: with the same name and data type as training data’s label column

  • SCORE: type double, socre of the predicted class.

project(data, key, features=None, proj_dim=None)

Project data into lower dimensional spaces using fitted LDA projection model.

Parameters

data : DataFrame

Data for linear discriminant projection.

key : str

Name of the ID column.

features : list of str, optional

Name of the feature columns. If not provided, defaults to all non-ID columns.

proj_dim : int, optional

Dimension of the projected space, equivalent to the number of discriminant used for projection. Defaults to the number of obtained discriminants.

Returns

DataFrame

Projected data, structured as follows:
  • 1st column: ID, with the same name and data type as data for projection.

  • other columns with name DISCRIMINANT_i, where i iterates from 1 to the number of elements in features, data type DOUBLE.

hana_ml.algorithms.pal.linear_model

This module contains Python wrappers for PAL linear model algorithms.

The following classes are available:

class hana_ml.algorithms.pal.linear_model.LinearRegression(conn_context, solver=None, var_select=None, intercept=True, alpha_to_enter=None, alpha_to_remove=None, enet_lambda=None, enet_alpha=None, max_iter=None, tol=None, pho=None, stat_inf=False, adjusted_r2=False, dw_test=False, reset_test=None, bp_test=False, ks_test=False, thread_ratio=None, categorical_variable=None, pmml_export=None)

Bases: hana_ml.algorithms.pal.pal_base.PALBase

Linear regression is an approach to model the linear relationship between a variable, usually referred to as dependent variable, and one or more variables, usually referred to as independent variables, denoted as predictor vector .

Parameters

conn_context : ConnectionContext

Connection to the SAP HANA system.

solver : {‘QR’, ‘SVD’, ‘CD’, ‘Cholesky’, ‘ADMM’}, optional

Algorithms to use to solve the least square problem. Case-insensitive.

  • ‘QR’: QR decomposition.

  • ‘SVD’: singular value decomposition.

  • ‘CD’: cyclical coordinate descent method.

  • ‘Cholesky’: Cholesky decomposition.

  • ‘ADMM’: alternating direction method of multipliers.

‘CD’ and ‘ADMM’ are supported only when var_select is ‘all’.

Defaults to QR decomposition.

var_select : {‘all’, ‘forward’, ‘backward’}, optional

Method to perform variable selection.

  • ‘all’: all variables are included.

  • ‘forward’: forward selection.

  • ‘backward’: backward selection.

‘forward’ and ‘backward’ selection are supported only when solver is ‘QR’, ‘SVD’ or ‘Cholesky’.

Defaults to ‘all’.

intercept : bool, optional

If true, include the intercept in the model.

Defaults to True.

alpha_to_enter : float, optional

P-value for forward selection. Valid only when var_select is ‘forward’.

Defaults to 0.05.

alpha_to_remove : float, optional

P-value for backward selection. Valid only when var_select is ‘backward’.

Defaults to 0.1.

enet_lambda : float, optional

Penalized weight. Value should be greater than or equal to 0. Valid only when solver is ‘CD’ or ‘ADMM’.

enet_alpha : float, optional

Elastic net mixing parameter. Ranges from 0 (Ridge penalty) to 1 (LASSO penalty) inclusively. Valid only when solver is ‘CD’ or ‘ADMM’.

Defaults to 1.0.

max_iter : int, optional

Maximum number of passes over training data. If convergence is not reached after the specified number of iterations, an error will be generated. Valid only when solver is ‘CD’ or ‘ADMM’.

Defaults to 1e5.

tol : float, optional

Convergence threshold for coordinate descent. Valid only when solver is ‘CD’.

Defaults to 1.0e-7.

pho : float, optional

Step size for ADMM. Generally, it should be greater than 1. Valid only when solver is ‘ADMM’.

Defaults to 1.8.

stat_inf : bool, optional

If true, output t-value and Pr(>|t|) of coefficients.

Defaults to False.

adjusted_r2 : bool, optional

If true, include the adjusted R2 value in statistics.

Defaults to False.

dw_test : bool, optional

If true, conduct Durbin-Watson test under null hypothesis that errors do not follow a first order autoregressive process. Not available if elastic net regularization is enabled or intercept is ignored.

Defaults to False.

reset_test : int, optional

Specifies the order of Ramsey RESET test. Ramsey RESET test with power of variables ranging from 2 to this value (greater than 1) will be conducted. Value 1 means RESET test will not be conducted. Not available if elastic net regularization is enabled or intercept is ignored.

Defaults to 1.

bp_test : bool, optional

If true, conduct Breusch-Pagan test under null hypothesis that homoscedasticity is satisfied. Not available if elastic net regularization is enabled or intercept is ignored.

Defaults to False.

ks_test : bool, optional

If true, conduct Kolmogorov-Smirnov normality test under null hypothesis that errors follow a normal distribution. Not available if elastic net regularization is enabled or intercept is ignored.

Defaults to False.

thread_ratio : float, optional

Controls the proportion of available threads to use. The value range is from 0 to 1, where 0 means only using 1 thread, and 1 means using at most all the currently available threads. Values outside this range tell PAL to heuristically determine the number of threads to use. Valid only when solver is ‘QR’, ‘CD’, ‘Cholesky’ or ‘ADMM’.

Defaults to 0.0.

categorical_variable : str or ist of str, optional

Specifies INTEGER columns specified that should be be treated as categorical. Other INTEGER columns will be treated as continuous.

pmml_export : {‘no’, ‘multi-row’}, optional

Controls whether to output a PMML representation of the model, and how to format the PMML. Case-insensitive.

  • ‘no’ or not provided: No PMML model.

  • ‘multi-row’: Exports a PMML model, splitting it across multiple rows if it doesn’t fit in one.

Prediction does not require a PMML model.

Examples

Training data:

>>> df.collect()
  ID       Y    X1 X2  X3
0  0  -6.879  0.00  A   1
1  1  -3.449  0.50  A   1
2  2   6.635  0.54  B   1
3  3  11.844  1.04  B   1
4  4   2.786  1.50  A   1
5  5   2.389  0.04  B   2
6  6  -0.011  2.00  A   2
7  7   8.839  2.04  B   2
8  8   4.689  1.54  B   1
9  9  -5.507  1.00  A   2

Training the model:

>>> lr = LinearRegression(conn_context=cc,
...                       thread_ratio=0.5,
...                       categorical_variable=["X3"])
>>> lr.fit(data=df, key='ID', label='Y')

Prediction:

>>> df2.collect()
   ID     X1 X2  X3
0   0  1.690  B   1
1   1  0.054  B   2
2   2  0.123  A   2
3   3  1.980  A   1
4   4  0.563  A   1
>>> lr.predict(data=df2, key='ID').collect()
   ID      VALUE
0   0  10.314760
1   1   1.685926
2   2  -7.409561
3   3   2.021592
4   4  -3.122685

Attributes

coefficients_

(DataFrame) Fitted regression coefficients.

pmml_

(DataFrame) PMML model. Set to None if no PMML model was requested.

fitted_

(DataFrame) Predicted dependent variable values for training data. Set to None if the training data has no row IDs.

statistics_

(DataFrame) Regression-related statistics, such as mean squared error.

Methods

fit(data[, key, features, label, …])

Fit regression model based on training data.

predict(data, key[, features])

Predict dependent variable values based on fitted model.

score(data, key[, features, label])

Returns the coefficient of determination R2 of the prediction.

fit(data, key=None, features=None, label=None, categorical_variable=None)

Fit regression model based on training data.

Parameters

data : DataFrame

Training data.

key : str, optional

Name of the ID column. If key is not provided, it is assumed that the input has no ID column.

features : list of str, optional

Names of the feature columns. If features is not provided, it defaults to all non-ID, non-label columns.

label : str, optional

Name of the dependent variable. If label is not provided, it defaults to the last column.

categorical_variable : str or list of str, optional

Specifies INTEGER column(s) that should be treated as categorical. Other INTEGER columns will be treated as continuous.

predict(data, key, features=None)

Predict dependent variable values based on fitted model.

Parameters

data : DataFrame

Independent variable values to predict for.

key : str

Name of the ID column.

features : list of str, optional

Names of the feature columns. If features is not provided, it defaults to all non-ID columns.

Returns

DataFrame

Predicted values, structured as follows:

  • ID column: with same name and type as data ‘s ID column.

  • VALUE: type DOUBLE, representing predicted values.

score(data, key, features=None, label=None)

Returns the coefficient of determination R2 of the prediction.

Parameters

data : DataFrame

Data on which to assess model performance.

key : str

Name of the ID column.

features : list of str, optional

Names of the feature columns.

If features is not provided, it defaults all non-ID, non-label columns.

label : str, optional

Name of the dependent variable. If label is not provided, it defaults to the last column.

Returns

float

Returns the coefficient of determination R2 of the prediction.

Note

score() will pass the pmml_ table to PAL as the model representation if there is a pmml_ table, or the coefficients_ table otherwise.

class hana_ml.algorithms.pal.linear_model.LogisticRegression(conn_context, multi_class=False, max_iter=None, pmml_export=None, categorical_variable=None, standardize=True, stat_inf=False, solver=None, alpha=None, lamb=None, tol=None, epsilon=None, thread_ratio=None, max_pass_number=None, sgd_batch_number=None, precompute=None, handle_missing=None, resampling_method=None, metric=None, fold_num=None, repeat_times=None, search_strategy=None, random_search_times=None, random_state=None, timeout=None, lamb_values=None, lamb_range=None, alpha_values=None, alpha_range=None, lbfgs_m=None, class_map0=None, class_map1=None, progress_indicator_id=None)

Bases: hana_ml.algorithms.pal.pal_base.PALBase

Logistic regression model that handles binary-class and multi-class classification problems.

Parameters

conn_context : ConnectionContext

Connection to the SAP HANA system.

multi_class : bool, optional

If true, perform multi-class classification. Otherwise, there must be only two classes.

Defaults to False.

max_iter : int, optional

Maximum number of iterations taken for the solvers to converge. If convergence is not reached after this number, an error will be generated.

  • multi-class: Defaults to 100.

  • binary-class: Defaults to 100000 when solver is cyclical, 1000 when solver is proximal, otherwise 100.

pmml_export : {‘no’, ‘single-row’, ‘multi-row’}, optional

Controls whether to output a PMML representation of the model, and how to format the PMML. Case-insensitive.

  • multi-class:

    • ‘no’ or not provided: No PMML model.

    • ‘multi-row’: Exports a PMML model, splitting it across multiple rows if it doesn’t fit in one.

  • binary-class:

    • ‘no’ or not provided: No PMML model.

    • ‘single-row’: Exports a PMML model in a maximum of one row. Fails if the model doesn’t fit in one row.

    • ‘multi-row’: Exports a PMML model, splitting it across multiple rows if it doesn’t fit in one.

Defaults to ‘no’.

categorical_variable : str or list of str, optional(deprecated)

Specifies INTEGER column(s) in the data that should be treated category variable.

standardize : bool, optional

If true, standardize the data to have zero mean and unit variance.

Defaults to True.

stat_inf : bool, optional

If true, proceed with statistical inference.

Defaults to False.

solver : {‘auto’, ‘newton’, ‘cyclical’, ‘lbfgs’, ‘stochastic’, ‘proximal’}, optional

Optimization algorithm.

  • ‘auto’ : automatically determined by system based on input data and parameters.

  • ‘newton’: Newton iteration method.

  • ‘cyclical’: Cyclical coordinate descent method to fit elastic net regularized logistic regression.

  • ‘lbfgs’: LBFGS method (recommended when having many independent variables).

  • ‘stochastic’: Stochastic gradient descent method (recommended when dealing with very large dataset).

  • ‘proximal’: Proximal gradient descent method to fit elastic net regularized logistic regression.

Only valid when multi_class is False.

Defaults to ‘auto’.

alpha : float, optional

Elastic net mixing parameter. Only valid when multi_class is False and solver is newton, cyclical, lbfgs or proximal.

Defaults to 1.0.

lamb : float, optional

Penalized weight. Only valid when multi_class is False and solver is newton, cyclical, lbfgs or proximal.

Defaults to 0.0.

tol : float, optional

Convergence threshold for exiting iterations. Only valid when multi_class is False.

Defaults to 1.0e-7 when solver is cyclical, 1.0e-6 otherwise.

epsilon : float, optional

Determines the accuracy with which the solution is to be found.

Only valid when multi_class is False and the solver is newton or lbfgs.

Defaults to 1.0e-6 when solver is newton, 1.0e-5 when solver is lbfgs.

thread_ratio : float, optional

Controls the proportion of available threads to use for fit() method. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use.

Defaults to 1.0.

max_pass_number : int, optional

The maximum number of passes over the data. Only valid when multi_class is False and solver is ‘stochastic’.

Defaults to 1.

sgd_batch_number : int, optional

The batch number of Stochastic gradient descent. Only valid when multi_class is False and solver is ‘stochastic’.

Defaults to 1.

precompute : bool, optional

Whether to pre-compute the Gram matrix. Only valid when solver is ‘cyclical’.

Defaults to True.

handle_missing : bool, optional

Whether to handle missing values.

Defaults to True.

categorical_variable : str or list of str, optional

Specifies INTEGER column(s) in the data that should be treated as categorical. By default, string is categorical, while int and double are numerical.

lbfgs_m : int, optional

Number of previous updates to keep. Only applicable when multi_class is False and solver is ‘lbfgs’.

Defaults to 6.

resampling_method : {‘cv’, ‘stratified_cv’, ‘bootstrap’, ‘stratified_bootstrap’}, optional

The resampling method for model evaluation and parameter selection. If no value specified, neither model evaluation nor parameter selection is activated.

metric : {‘accuracy’, ‘f1_score’, ‘auc’, ‘nll’}, optional

The evaluation metric used for model evaluation/parameter selection.

fold_num : int, optional

The number of folds for cross-validation. Mandatory and valid only when resampling_method is ‘cv’ or ‘stratified_cv’.

repeat_times : int, optional

The number of repeat times for resampling.

Defaults to 1.

search_strategy : {‘grid’, ‘random’}, optional

The search method for parameter selection.

random_search_times : int, optional

The number of times to randomly select candidate parameters for selection. Mandatory and valid when search_strategy is ‘random’.

random_state : int, optional

The seed for random generation. 0 indicates using system time as seed.

Defaults to 0.

progress_indicator_id : str, optional

The ID of progress indicator for model evaluation/parameter selection. Progress indicator deactivated if no value provided.

lamb_values : list of float, optional

The values of lamb for parameter selection.

Only valid when search_strategy is specified.

lamb_range : list of float, optional

The range of lamb for parameter selection, including a lower limit and an upper limit.

Only valid when search_strategy is specified.

alpha_values : list of float, optional

The values of alpha for parameter selection.

Only valid when search_strategy is specified.

alpha_range : list of float, optional

The range of alpha for parameter selection, including a lower limit and an upper limit.

Only valid when search_strategy is specified.

class_map0 : str, optional (deprecated)

Categorical label to map to 0. class_map0 is mandatory when label column type is VARCHAR or NVARCHAR

Only valid when multi_class is False during binary class fit and score.

class_map1 : str, optional (deprecated)

Categorical label to map to 1.

class_map1 is mandatory when label column type is VARCHAR or NVARCHAR during binary class fit and score.

Only valid when multi_class is False.

Examples

Training data:

>>> df.collect()
   V1     V2  V3  CATEGORY
0   B  2.620   0         1
1   B  2.875   0         1
2   A  2.320   1         1
3   A  3.215   2         0
4   B  3.440   3         0
5   B  3.460   0         0
6   A  3.570   1         0
7   B  3.190   2         0
8   A  3.150   3         0
9   B  3.440   0         0
10  B  3.440   1         0
11  A  4.070   3         0
12  A  3.730   1         0
13  B  3.780   2         0
14  B  5.250   2         0
15  A  5.424   3         0
16  A  5.345   0         0
17  B  2.200   1         1
18  B  1.615   2         1
19  A  1.835   0         1
20  B  2.465   3         0
21  A  3.520   1         0
22  A  3.435   0         0
23  B  3.840   2         0
24  B  3.845   3         0
25  A  1.935   1         1
26  B  2.140   0         1
27  B  1.513   1         1
28  A  3.170   3         1
29  B  2.770   0         1
30  B  3.570   0         1
31  A  2.780   3         1

Create LogisticRegression instance and call fit:

>>> lr = linear_model.LogisticRegression(conn_context=cc, solver='newton',
...                                      thread_ratio=0.1, max_iter=1000,
...                                      pmml_export='single-row',
...                                      stat_inf=True, tol=0.000001)
>>> lr.fit(data=df, features=['V1', 'V2', 'V3'],
...        label='CATEGORY', categorical_variable=['V3'])
>>> lr.coef_.collect()
                                       VARIABLE_NAME  COEFFICIENT
0                                  __PAL_INTERCEPT__    17.044785
1                                 V1__PAL_DELIMIT__A     0.000000
2                                 V1__PAL_DELIMIT__B    -1.464903
3                                                 V2    -4.819740
4                                 V3__PAL_DELIMIT__0     0.000000
5                                 V3__PAL_DELIMIT__1    -2.794139
6                                 V3__PAL_DELIMIT__2    -4.807858
7                                 V3__PAL_DELIMIT__3    -2.780918
8  {"CONTENT":"{\"impute_model\":{\"column_statis...          NaN
>>> pred_df.collect()
    ID V1     V2  V3
0    0  B  2.620   0
1    1  B  2.875   0
2    2  A  2.320   1
3    3  A  3.215   2
4    4  B  3.440   3
5    5  B  3.460   0
6    6  A  3.570   1
7    7  B  3.190   2
8    8  A  3.150   3
9    9  B  3.440   0
10  10  B  3.440   1
11  11  A  4.070   3
12  12  A  3.730   1
13  13  B  3.780   2
14  14  B  5.250   2
15  15  A  5.424   3
16  16  A  5.345   0
17  17  B  2.200   1

Call predict():

>>> result = lgr.predict(data=pred_df,
...                      key='ID',
...                      categorical_variable=['V3'],
...                      thread_ratio=0.1)
>>> result.collect()
    ID CLASS   PROBABILITY
0    0     1  9.503618e-01
1    1     1  8.485210e-01
2    2     1  9.555861e-01
3    3     0  3.701858e-02
4    4     0  2.229129e-02
5    5     0  2.503962e-01
6    6     0  4.945832e-02
7    7     0  9.922085e-03
8    8     0  2.852859e-01
9    9     0  2.689207e-01
10  10     0  2.200498e-02
11  11     0  4.713726e-03
12  12     0  2.349803e-02
13  13     0  5.830425e-04
14  14     0  4.886177e-07
15  15     0  6.938072e-06
16  16     0  1.637820e-04
17  17     1  8.986435e-01

Input data for score():

>>> df_score.collect()
    ID V1     V2  V3  CATEGORY
0    0  B  2.620   0         1
1    1  B  2.875   0         1
2    2  A  2.320   1         1
3    3  A  3.215   2         0
4    4  B  3.440   3         0
5    5  B  3.460   0         0
6    6  A  3.570   1         1
7    7  B  3.190   2         0
8    8  A  3.150   3         0
9    9  B  3.440   0         0
10  10  B  3.440   1         0
11  11  A  4.070   3         0
12  12  A  3.730   1         0
13  13  B  3.780   2         0
14  14  B  5.250   2         0
15  15  A  5.424   3         0
16  16  A  5.345   0         0
17  17  B  2.200   1         1

Call score():

>>> lgr.score(data=df_score,
...           key='ID',
...           categorical_variable=['V3'],
...           thread_ratio=0.1)
0.944444

Attributes

coef_

(DataFrame) Values of the coefficients.

result_

(DataFrame) Model content.

optim_param_

(DataFrame) The optimal parameter set selected via cross-validation. Empty if cross-validation is not activated.

stat_

(DataFrame) Statistics info for the trained model, structured as follows: - 1st column: ‘STAT_NAME’, NVARCHAR(256) - 2nd column: ‘STAT_VALUE’, NVARCHAR(1000)

pmml_

(DataFrame) PMML model. Set to None if no PMML model was requested.

Methods

fit(data[, key, features, label, …])

Fit the LR model when given training dataset.

predict(data, key[, features, …])

Predict with the dataset using the trained model.

score(data, key[, features, label, …])

Return the mean accuracy on the given test data and labels.

fit(data, key=None, features=None, label=None, categorical_variable=None, class_map0=None, class_map1=None)

Fit the LR model when given training dataset.

Parameters

data : DataFrame

DataFrame containing the data.

key : str, optional

Name of the ID column. If key is not provided, it is assumed that the input has no ID column.

features : list of str, optional

Names of the feature columns.

If features is not provided, it defaults to all non-ID, non-label columns.

label : str, optional

Name of the label column. If label is not provided, it defaults to the last column.

categorical_variable : str or list of str, optional

Specifies INTEGER column(s) that shoud be treated as categorical. Otherwise All INTEGER columns are treated as numerical.

class_map0 : str, optional (deprecated)

Categorical label to map to 0. class_map0 is mandatory when label column type is VARCHAR or NVARCHAR during binary class fit and score.

Only valid when multi_class is False.

class_map1 : str, optional (deprecated)

Categorical label to map to 1. class_map1 is mandatory when label column type is VARCHAR or NVARCHAR during binary class fit and score.

Only valid when multi_class is False.

predict(data, key, features=None, categorical_variable=None, thread_ratio=None, class_map0=None, class_map1=None, verbose=False)

Predict with the dataset using the trained model.

Parameters

data : DataFrame

DataFrame containing the data.

key : str

Name of the ID column.

features : list of str, optional

Names of the feature columns. If features is not provided, it defaults to all non-ID columns.

verbose : bool, optional

If true, output scoring probabilities for each class. It is only applicable for multi-class case.

Defaults to False.

categorical_variable : str or list of str, optional (deprecated)

Specifies INTEGER column(s) that shoud be treated as categorical. Otherwise all integer columns are treated as numerical. Mandatory if training data of the prediction model contains such data columns.

thread_ratio : float, optional

Controls the proportion of available threads to use. The value range is from 0 to 1, where 0 means only using 1 thread, and 1 means using at most all the currently available threads. Values outside this range tell pal to heuristically determine the number of threads to use.

Defaults to 0.

class_map0 : str, optional (deprecated)

Categorical label to map to 0. class_map0 is mandatory when label column type is varchar or nvarchar during binary class fit and score.only valid when multi_class is false.

class_map1 : str, optional (deprecated)

Categorical label to map to 1. class_map1 is mandatory when label column type is varchar or nvarchar during binary class fit and score. Only valid when multi_class is false.

Returns

DataFrame

Predicted result, structured as follows:

  • 1: ID column, with edicted class name.

  • 2: PROBABILITY, type DOUBLE

    • multi-class: probability of being predicted as the predicted class.

    • binary-class: probability of being predicted as the positive class.

Note

predict() will pass the pmml_ table to PAL as the model representation if there is a pmml_ table, or the result_ table otherwise.

score(data, key, features=None, label=None, categorical_variable=None, thread_ratio=None, class_map0=None, class_map1=None)

Return the mean accuracy on the given test data and labels.

Parameters

data : DataFrame

DataFrame containing the data.

key : str

Name of the ID column.

features : list of str, optional

Names of the feature columns. If features is not provided, it defaults to all non-ID, non-label columns.

label : str, optional

Name of the label column. If label is not provided, it defaults to the last column.

categorical_variable : str or list of str, optional (deprecated)

Specifies INTEGER columns that shoud be treated as categorical, otherwise all integer columns are treated as numerical. Mandatory if training data of the prediction model contains such data columns.

thread_ratio : float, optional

Controls the proportion of available threads to use. The value range is from 0 to 1, where 0 means only using 1 thread, and 1 means using at most all the currently available threads. values outside this range tell pal to heuristically determine the number of threads to use.

Defaults to 0.

class_map0 : str, optional (deprecated)

Categorical label to map to 0. class_map0 is mandatory when label column type is varchar or nvarchar during binary class fit and score. Only valid when multi_class is false.

class_map1 : str, optional (deprecated)

Categorical label to map to 1. class_map1 is mandatory when label column type is varchar or nvarchar during binary class fit and score. Only valid when multi_class is false.

Returns

float

Scalar accuracy value after comparing the predicted label and original label.

hana_ml.algorithms.pal.linkpred

This module contains python wrapper for PAL link prediction function.

The following class is available:

class hana_ml.algorithms.pal.linkpred.LinkPrediction(conn_context, method, beta=None, min_score=None, thread_ratio=None)

Bases: hana_ml.algorithms.pal.pal_base.PALBase

Link predictor for calculating, in a network, proximity scores between nodes that are not directly linked, which is helpful for predicting missing links(the higher the proximity score is, the more likely the two nodes are to be linked).

Parameters

conn_context : ConnectionContext

Connection to the SAP HANA system.

method : {‘common_neighbors’, ‘jaccard’, ‘adamic_adar’, ‘katz’}

Method for computing the proximity between 2 nodes that are not directly linked.

beta : float, optional

A parameter included in the calculation of Katz similarity(proximity) score. Valid only when method is ‘katz’.

Defaults to 0.005.

min_score : float, optional

The links whose scores are lower than min_score will be filtered out from the result table.

Defaults to 0.

Examples

Input dataframe df for training:

>>> df.collect()
   NODE1  NODE2
0      1      2
1      1      4
2      2      3
3      3      4
4      5      1
5      6      2
6      7      4
7      7      5
8      6      7
9      5      4

Create linkpred instance:

>>> lp = LinkPrediction(conn_context=conn,
...                     method='common_neighbors',
...                     beta=0.005,
...                     min_score=0,
...                     thread_ratio=0.2)

Calculate the proximity score of all nodes in the network with missing links, and check the result:

>>> res = lp.proximity_score(data=df, node1='NODE1', node2='NODE2')
>>> res.collect()
    NODE1  NODE2     SCORE
0       1      3  0.285714
1       1      6  0.142857
2       1      7  0.285714
3       2      4  0.285714
4       2      5  0.142857
5       2      7  0.142857
6       4      6  0.142857
7       3      5  0.142857
8       3      6  0.142857
9       3      7  0.142857
10      5      6  0.142857

Methods

proximity_score(data[, node1, node2])

For predicting proximity scores between nodes under current choice of method.

proximity_score(data, node1=None, node2=None)

For predicting proximity scores between nodes under current choice of method.

Parameters

data : DataFrame

Network data with nodes and links. Nodes are in columns while links in rows, where each link is represented by a pair of adjacent nodes as follows (node1, node2).

node1 : str, optional

Column name of data that gives node1 of all available links (see data).

Defaults to the name of the first column of data if not provided.

node2 : str, optional

Column name of data that gives node2 of all available links (see data).

Defaults to the name of the last column of data if not provided.

Returns

DataFrame:

The proximity scores of pairs of nodes with missing links between them that are above ‘min_score’, structured as follows:

  • 1st column: node1 of a link

  • 2nd column: node2 of a link

  • 3rd column: proximity score of the two nodes

hana_ml.algorithms.pal.metrics

This module contains Python wrappers for PAL metrics to assess the quality of model outputs.

The following functions are available:

hana_ml.algorithms.pal.metrics.confusion_matrix(conn_context, data, key, label_true=None, label_pred=None, beta=None, native=True)

Computes confusion matrix to evaluate the accuracy of a classification.

Parameters

conn_context : ConnectionContext

The connection to SAP HANA system.

data : DataFrame

DataFrame containing the data.

key : str

Name of the ID column.

label_true : str, optional

Name of the original label column.

If not given, defaults to the second columm.

label_pred : str, optional

Name of the the predicted label column. If not given, defaults to the third columm.

beta : float, optional

Parameter used to compute the F-Beta score.

Defaults to 1.

native : bool, optional

Indicates whether to use native sql statements for confusion matrix calculation.

Defaults to True.

Returns

DataFrame

Confusion matrix, structured as follows:
  • Original label, with same name and data type as it is in data.

  • Predicted label, with same name and data type as it is in data.

  • Count, type INTEGER, the number of data points with the corresponding combination of predicted and original label.

The DataFrame is sorted by (original label, predicted label) in descending order.

Classification report table, structured as follows:
  • Class, type NVARCHAR(100), class name

  • Recall, type DOUBLE, the recall of each class

  • Precision, type DOUBLE, the precision of each class

  • F_MEASURE, type DOUBLE, the F_measure of each class

  • SUPPORT, type INTEGER, the support - sample number in each class

Examples

Data contains the original label and predict label df:

>>> df.collect()
   ID  ORIGINAL  PREDICT
0   1         1        1
1   2         1        1
2   3         1        1
3   4         1        2
4   5         1        1
5   6         2        2
6   7         2        1
7   8         2        2
8   9         2        2
9  10         2        2

Calculate the confusion matrix:

>>> cm, cr = confusion_matrix(connection_context=conn, data=df, key='ID', label_true='ORIGINAL', label_pred='PREDICT')

Output:

>>> cm.collect()
   ORIGINAL  PREDICT  COUNT
0         1        1      4
1         1        2      1
2         2        1      1
3         2        2      4
>>> cr.collect()
  CLASS  RECALL  PRECISION  F_MEASURE  SUPPORT
0     1     0.8        0.8        0.8        5
1     2     0.8        0.8        0.8        5
hana_ml.algorithms.pal.metrics.auc(conn_context, data, positive_label=None)

Computes area under curve (AUC) to evaluate the performance of binary-class classification algorithms.

Parameters

conn_context : ConnectionContext

The connection to SAP HANA system.

data : DataFrame

Input data, structured as follows:

  • ID column.

  • True class of the data point.

  • Classifier-computed probability that the data point belongs to the positive class.

positive_label : str, optional

If original label is not 0 or 1, specifies the label value which will be mapped to 1.

Returns

float

The area under the receiver operating characteristic curve.

DataFrame

False positive rate and true positive rate (ROC), structured as follows:

  • ID column, type INTEGER.

  • FPR, type DOUBLE, representing false positive rate.

  • TPR, type DOUBLE, representing true positive rate.

Examples

Input DataFrame df:

>>> df.collect()
   ID  ORIGINAL  PREDICT
0   1         0     0.07
1   2         0     0.01
2   3         0     0.85
3   4         0     0.30
4   5         0     0.50
5   6         1     0.50
6   7         1     0.20
7   8         1     0.80
8   9         1     0.20
9  10         1     0.95

Compute Area Under Curve:

>>> auc, roc = auc(conn_context=conn, data=df)

Output:

>>> print(auc)
 0.66
>>> roc.collect()
   ID  FPR  TPR
0   0  1.0  1.0
1   1  0.8  1.0
2   2  0.6  1.0
3   3  0.6  0.6
4   4  0.4  0.6
5   5  0.2  0.4
6   6  0.2  0.2
7   7  0.0  0.2
8   8  0.0  0.0
hana_ml.algorithms.pal.metrics.multiclass_auc(conn_context, data_original, data_predict)

Computes area under curve (AUC) to evaluate the performance of multi-class classification algorithms.

Parameters

conn_context : ConnectionContext

The connection to SAP HANA system.

data_original : DataFrame

True class data, structured as follows:

  • Data point ID column.

  • True class of the data point.

data_predict : DataFrame

Predicted class data, structured as follows:

  • Data point ID column.

  • Possible class.

  • Classifier-computed probability that the data point belongs to that particular class.

For each data point ID, there should be one row for each possible class.

Returns

float

The area under the receiver operating characteristic curve.

DataFrame

False positive rate and true positive rate (ROC), structured as follows:

  • ID column, type INTEGER.

  • FPR, type DOUBLE, representing false positive rate.

  • TPR, type DOUBLE, representing true positive rate.

Examples

Input DataFrame df:

>>> df_original.collect()
   ID  ORIGINAL
0   1         1
1   2         1
2   3         1
3   4         2
4   5         2
5   6         2
6   7         3
7   8         3
8   9         3
9  10         3
>>> df_predict.collect()
    ID  PREDICT  PROB
0    1        1  0.90
1    1        2  0.05
2    1        3  0.05
3    2        1  0.80
4    2        2  0.05
5    2        3  0.15
6    3        1  0.80
7    3        2  0.10
8    3        3  0.10
9    4        1  0.10
10   4        2  0.80
11   4        3  0.10
12   5        1  0.20
13   5        2  0.70
14   5        3  0.10
15   6        1  0.05
16   6        2  0.90
17   6        3  0.05
18   7        1  0.10
19   7        2  0.10
20   7        3  0.80
21   8        1  0.00
22   8        2  0.00
23   8        3  1.00
24   9        1  0.20
25   9        2  0.10
26   9        3  0.70
27  10        1  0.20
28  10        2  0.20
29  10        3  0.60

Compute Area Under Curve:

>>> auc, roc = multiclass_auc(conn_context=conn, data_original=df_original, data_predict=df_predict)

Output:

>>> print(auc)
1.0
>>> roc.collect()
    ID   FPR  TPR
0    0  1.00  1.0
1    1  0.90  1.0
2    2  0.65  1.0
3    3  0.25  1.0
4    4  0.20  1.0
5    5  0.00  1.0
6    6  0.00  0.9
7    7  0.00  0.7
8    8  0.00  0.3
9    9  0.00  0.1
10  10  0.00  0.0
hana_ml.algorithms.pal.metrics.accuracy_score(conn_context, data, label_true, label_pred)

Compute mean accuracy score for classification results. That is, the proportion of the correctly predicted results among the total number of cases examined.

Parameters

conn_context : ConnectionContext

The connection to SAP HANA system.

data : DataFrame

DataFrame of true and predicted labels.

label_true : str

Name of the column containing ground truth labels.

label_pred : str

Name of the column containing predicted labels, as returned by a classifier.

Returns

float

Accuracy classification score. A lower accuracy indicates that the classifier was able to predict less of the labels in the input correctly.

Examples

Actual and predicted labels df for a hypothetical classification:

>>> df.collect()
   ACTUAL  PREDICTED
0    1        0
1    0        0
2    0        0
3    1        1
4    1        1

Accuracy score for these predictions:

>>> accuracy_score(conn_context=conn, data=df, label_true='ACTUAL', label_pred='PREDICTED')
0.8

Compare that to null accuracy df_dummy (accuracy that could be achieved by always predicting the most frequent class):

>>> df_dummy.collect()
   ACTUAL  PREDICTED
0    1       1
1    0       1
2    0       1
3    1       1
4    1       1
>>> accuracy_score(conn_context=conn, data=df_dummy, label_true='ACTUAL', label_pred='PREDICTED')
0.6

A perfect predictor df_perfect:

>>> df_perfect.collect()
   ACTUAL  PREDICTED
0    1       1
1    0       0
2    0       0
3    1       1
4    1       1
>>> accuracy_score(conn_context=conn, data=df_perfect, label_true='ACTUAL', label_pred='PREDICTED')
1.0
hana_ml.algorithms.pal.metrics.r2_score(conn_context, data, label_true, label_pred)

Computes coefficient of determination for regression results.

Parameters

conn_context : ConnectionContext

The connection to SAP HANA system.

data : DataFrame

DataFrame of true and predicted values.

label_true : str

Name of the column containing true values.

label_pred : str

Name of the column containing values predicted by regression.

Returns

float

Coefficient of determination. 1.0 indicates an exact match between true and predicted values. A lower coefficient of determination indicates that the regression was able to predict less of the variance in the input. A negative value indicates that the regression performed worse than just taking the mean of the true values and using that for every prediction.

Examples

Actual and predicted values df for a hypothetical regression:

>>> df.collect()
   ACTUAL  PREDICTED
0    0.10        0.2
1    0.90        1.0
2    2.10        1.9
3    3.05        3.0
4    4.00        3.5

R2 score for these predictions:

>>> r2_score(conn_context=conn, data=df, label_true='ACTUAL', label_pred='PREDICTED')
0.9685233682514102

Compare that to the score for a perfect predictor:

>>> df_perfect.collect()
   ACTUAL  PREDICTED
0    0.10       0.10
1    0.90       0.90
2    2.10       2.10
3    3.05       3.05
4    4.00       4.00
>>> r2_score(conn_context=conn, data=df_perfect, label_true='ACTUAL', label_pred='PREDICTED')
1.0

A naive mean predictor:

>>> df_mean.collect()
   ACTUAL  PREDICTED
0    0.10       2.03
1    0.90       2.03
2    2.10       2.03
3    3.05       2.03
4    4.00       2.03
>>> r2_score(conn_context=conn,, data=df_mean, label_true='ACTUAL', label_pred='PREDICTED')
0.0

And a really awful predictor df_awful:

>>> df_awful.collect()
   ACTUAL  PREDICTED
0    0.10    12345.0
1    0.90    91923.0
2    2.10    -4444.0
3    3.05    -8888.0
4    4.00    -9999.0
>>> r2_score(conn_context=conn, data=df_awful, label_true='ACTUAL', label_pred='PREDICTED')
-886477397.139857

hana_ml.algorithms.pal.mixture

This module contains Python wrappers for Gaussian mixture model algorithm.

The following class is available:

class hana_ml.algorithms.pal.mixture.GaussianMixture(conn_context, init_param, n_components=None, init_centers=None, covariance_type=None, shared_covariance=False, thread_ratio=None, max_iter=None, categorical_variable=None, category_weight=None, error_tol=None, regularization=None, random_seed=None)

Bases: hana_ml.algorithms.pal.pal_base.PALBase

Representation of a Gaussian mixture model probability distribution.

Parameters

conn_context : ConnectionContext

Connection to the SAP HANA system.

init_param : {‘farthest_first_traversal’,’manual’,’random_means’,’kmeans++’}

Specifies the initialization mode.

  • farthest_first_traversal: The initial centers are given by the farthest-first traversal algorithm.

  • manual: The initial centers are the init_centers given by user.

  • random_means: The initial centers are the means of all the data that are randomly weighted.

  • kmeans++: The initial centers are given using the k-means++ approach.

n_components : int

Specifies the number of Gaussian distributions. Mandatory when init_param is not ‘manual’.

init_centers : list of int

Specifies the data (by using sequence number of the data in the data table (starting from 0)) to be used as init_centers. Mandatory when init_param is ‘manual’.

covariance_type : {‘full’, ‘diag’, ‘tied_diag’}, optional

Specifies the type of covariance matrices in the model.

  • full: use full covariance matrices.

  • diag: use diagonal covariance matrices.

  • tied_diag: use diagonal covariance matrices with all equal diagonal entries.

Defaults to ‘full’.

shared_covariance : bool, optional

All clusters share the same covariance matrix if True.

Defaults to False.

thread_ratio : float, optional

Controls the proportion of available threads that can be used. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use.

Defaults to 0.

max_iter : int, optional

Specifies the maximum number of iterations for the EM algorithm.

Defaults value: 100.

categorical_variable : str or list of str, optional

Specifies INTEGER column(s) that should be be treated as categorical. Other INTEGER columns will be treated as continuous.

category_weight : float, optional

Represents the weight of category attributes.

Defaults to 0.707.

error_tol : float, optional

Specifies the error tolerance, which is the stop condition.

Defaults to 1e-5.

regularization : float, optional

Regularization to be added to the diagonal of covariance matrices to ensure positive-definite.

Defaults to 1e-6.

random_seed : int, optional

Indicates the seed used to initialize the random number generator:

  • 0: Uses the system time.

  • Not 0: Uses the provided value.

Defaults to 0.

Examples

Input dataframe df1 for training:

>>> df1.collect()
    ID     X1     X2  X3
0    0   0.10   0.10   1
1    1   0.11   0.10   1
2    2   0.10   0.11   1
3    3   0.11   0.11   1
4    4   0.12   0.11   1
5    5   0.11   0.12   1
6    6   0.12   0.12   1
7    7   0.12   0.13   1
8    8   0.13   0.12   2
9    9   0.13   0.13   2
10  10   0.13   0.14   2
11  11   0.14   0.13   2
12  12  10.10  10.10   1
13  13  10.11  10.10   1
14  14  10.10  10.11   1
15  15  10.11  10.11   1
16  16  10.11  10.12   2
17  17  10.12  10.11   2
18  18  10.12  10.12   2
19  19  10.12  10.13   2
20  20  10.13  10.12   2
21  21  10.13  10.13   2
22  22  10.13  10.14   2
23  23  10.14  10.13   2

Creating the GMM instance:

>>> gmm = GaussianMixture(conn_context=conn,
...                       init_param='farthest_first_traversal',
...                       n_components=2, covariance_type='full',
...                       shared_covariance=False, max_iter=500,
...                       error_tol=0.001, thread_ratio=0.5,
...                       categorical_variable=['X3'], random_seed=1)

Performing fit() on the given dataframe:

>>> gmm.fit(data=df1, key='ID')

Expected output:

>>> gmm.labels_.head(14).collect()
    ID  CLUSTER_ID     PROBABILITY
0    0           0          0.0
1    1           0          0.0
2    2           0          0.0
3    4           0          0.0
4    5           0          0.0
5    6           0          0.0
6    7           0          0.0
7    8           0          0.0
8    9           0          0.0
9    10          0          1.0
10   11          0          1.0
11   12          0          1.0
12   13          0          1.0
13   14          0          0.0
>>> gmm.stats_.collect()
       STAT_NAME       STAT_VALUE
1     log-likelihood     11.7199
2         aic          -504.5536
3         bic          -480.3900
>>> gmm.model_collect()
       ROW_INDEX    CLUSTER_ID         MODEL_CONTENT
1        0            -1           {"Algorithm":"GMM","Metadata":{"DataP...
2        1             0           {"GuassModel":{"covariance":[22.18895...
3        2             1           {"GuassModel":{"covariance":[22.19450...

Attributes

model_

(DataFrame) Trained model content.

labels_

(DataFrame) Cluster membership probabilties for each data point.

stats_

(DataFrame) Statistics.

Methods

fit(data, key[, features, categorical_variable])

Perform GMM clustering on input dataset.

fit_predict(data, key[, features, …])

Perform GMM clustering on input dataset and return cluster membership probabilties for each data point.

fit(data, key, features=None, categorical_variable=None)

Perform GMM clustering on input dataset.

Parameters

data : DataFrame

Data to be clustered.

key : str

Name of the ID column.

features : list of str, optional

List of strings specifying feature columns. If a list of features is not given, all the columns except the ID column are taken as features.

categorical_variable : str or list of str, optional

Specifies INTEGER column(s) that should be treated as categorical. Other INTEGER columns will be treated as continuous.

fit_predict(data, key, features=None, categorical_variable=None)

Perform GMM clustering on input dataset and return cluster membership probabilties for each data point.

Parameters

data : DataFrame

Data to be clustered.

key : str

Name of the ID column.

features : list of str, optional

List of strings specifying feature columns. If a list of features is not given, all the columns except the ID column are taken as features.

categorical_variable : str or list of str, optional

Specifies INTEGER column(s) specified that should be treated as categorical. Other INTEGER columns will be treated as continuous.

Returns

DataFrame

Cluster membership probabilities.

hana_ml.algorithms.pal.naive_bayes

This module contains wrappers for PAL naive bayes aglorithm.

The following class is available:

class hana_ml.algorithms.pal.naive_bayes.NaiveBayes(conn_context, alpha=None, discretization=None, model_format=None, categorical_variable=None, thread_ratio=None)

Bases: hana_ml.algorithms.pal.pal_base.PALBase

A classification model based on Bayes’ theorem.

Parameters

conn_context : ConnectionContext

Connection to the HANA system.

alpha : float, optional

Laplace smoothing value. Set a positive value to enable Laplace smoothing for categorical variables and use that value as the smoothing parameter. Set value 0 to disable Laplace smoothing.

Defaults to 0.

discretization : {‘no’, ‘supervised’}, optional

Discretize continuous attributes. Case-insensitive.

  • ‘no’ or not provided: disable discretization.

  • ‘supervised’: use supervised discretization on all the continuous attributes.

Defaults to ‘no’.

model_format : {‘json’, ‘pmml’}, optional

Controls whether to output the model in JSON format or PMML format. Case-insensitive.

  • ‘json’ or not provided: JSON format.

  • ‘pmml’: PMML format.

Defaults to ‘json’.

categorical_variable : str or list of str, optional

Specifies INTEGER columns that should be treated as categorical. Other INTEGER columns will be treated as continuous.

thread_ratio : float, optional

Controls the proportion of available threads to use. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use that percentage of available threads.Values outside this range tell PAL to heuristically determine the number of threads to use.

Defaults to 0.

Examples

Training data:

>>> df1.collect()
  HomeOwner MaritalStatus  AnnualIncome DefaultedBorrower
0       YES        Single         125.0                NO
1        NO       Married         100.0                NO
2        NO        Single          70.0                NO
3       YES       Married         120.0                NO
4        NO      Divorced          95.0               YES
5        NO       Married          60.0                NO
6       YES      Divorced         220.0                NO
7        NO        Single          85.0               YES
8        NO       Married          75.0                NO
9        NO        Single          90.0               YES

Training the model:

>>> nb = NaiveBayes(conn_context=cc, alpha=1.0, model_format='pmml')
>>> nb.fit(df1)

Prediction:

>>> df2.collect()
   ID HomeOwner MaritalStatus  AnnualIncome
0   0        NO       Married         120.0
1   1       YES       Married         180.0
2   2        NO        Single          90.0
>>> nb.predict(data=df2, 'ID', alpha=1.0, verbose=True)
   ID CLASS  CONFIDENCE
0   0    NO   -6.572353
1   0   YES  -23.747252
2   1    NO   -7.602221
3   1   YES -169.133547
4   2    NO   -7.133599
5   2   YES   -4.648640

Attributes

model_

(DataFrame) Trained model content. .. note:: The Laplace value (alpha) is only stored by JSON format models. If the PMML format is chosen, you may need to set the Laplace value (alpha) again in predict() and score().

Methods

fit(data[, key, features, label, …])

Fit classification model based on training data.

predict(data, key[, features, alpha, verbose])

Predict based on fitted model.

score(data, key[, features, label, alpha])

Returns the mean accuracy on the given test data and labels.

fit(data, key=None, features=None, label=None, categorical_variable=None)

Fit classification model based on training data.

Parameters

data : DataFrame

Training data.

key : str, optional

Name of the ID column. If key is not provided, it is assumed that the input has no ID column.

features : list of str, optional

Names of the feature columns. If features is not provided, it defaults to all non-ID, non-label columns.

label : str, optional

Name of the dependent variable.

Defaults to the last column.

predict(data, key, features=None, alpha=None, verbose=None)

Predict based on fitted model.

Parameters

data : DataFrame

Independent variable values to predict for.

key : str

Name of the ID column.

features : list of str, optional

Names of the feature columns.

If features is not provided, it defaults to all non-ID columns.

alpha : float, optional

Laplace smoothing value. Set a positive value to enable Laplace smoothing for categorical variables and use that value as the smoothing parameter. Set value 0 to disable Laplace smoothing.

Defaults to the alpha value in the JSON model, if there is one, or 0 otherwise.

verbose : bool, optional

If true, output all classes and the corresponding confidences for each data point.

Defaults to False.

Returns

DataFrame

Predicted result, structured as follows:

  • ID column, with the same name and type as data ‘s ID column.

  • CLASS, type NVARCHAR, predicted class name.

  • CONFIDENCE, type DOUBLE, confidence for the prediction of the sample, which is a logarithmic value of the posterior probabilities.

Note

A non-zero Laplace value (alpha) is required if there exist discrete category values that only occur in the test set. It can be read from JSON models or from the parameter alpha in predict(). The Laplace value you set here takes precedence over the values read from JSON models.

score(data, key, features=None, label=None, alpha=None)

Returns the mean accuracy on the given test data and labels.

Parameters

data : DataFrame

Data on which to assess model performance.

key : str

Name of the ID column.

features : list of str, optional

Names of the feature columns. If features is not provided, it defaults to all non-ID, non-label columns.

label : str, optional

Name of the dependent variable.

Defaults to the last column.

alpha : float, optional

Laplace smoothing value. Set a positive value to enable Laplace smoothing for categorical variables and use that value as the smoothing parameter. Set value 0 to disable Laplace smoothing.

Defaults to the alpha value in the JSON model, if there is one, or 0 otherwise.

Returns

float

Mean accuracy on the given test data and labels.

hana_ml.algorithms.pal.neighbors

This module contains Python wrapper for PAL k-nearest neighbors algorithm.

The following class is available:

class hana_ml.algorithms.pal.neighbors.KNN(conn_context, n_neighbors=None, thread_ratio=None, voting_type=None, stat_info=True, metric=None, minkowski_power=None, algorithm=None)

Bases: hana_ml.algorithms.pal.pal_base.PALBase

K-Nearest Neighbor(KNN) model that handles classification problems.

Parameters

conn_context : ConnectionContext

Connection to the SAP HANA sytem.

n_neighbors : int, optional

Number of nearest neighbors.

Defaults to 1.

thread_ratio : float, optional

Controls the proportion of available threads to use. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use up to that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use.

Defaults to 0.

voting_type : {‘majority’, ‘distance-weighted’}, optional

Method used to vote for the most frequent label of the K nearest neighbors.

Defaults to ‘distance-weighted’.

stat_info : bool, optional

Controls whether to return a statistic information table containing the distance between each point in the prediction set and its k nearest neighbors in the training set. If true, the table will be returned.

Defaults to True.

metric : {‘manhattan’, ‘euclidean’, ‘minkowski’, ‘chebyshev’}, optional

Ways to compute the distance between data points.

Defaults to ‘euclidean’.

minkowski_power : float, optional

When Minkowski is used for metric, this parameter controls the value of power. Only valid when metric is Minkowski.

Defaults to 3.0.

algorithm : {‘brute-force’, ‘kd-tree’}, optional

Algorithm used to compute the nearest neighbors.

Defaults to ‘brute-force’.

Examples

Training data:

>>> df.collect()
   ID      X1      X2  TYPE
0   0     1.0     1.0     2
1   1    10.0    10.0     3
2   2    10.0    11.0     3
3   3    10.0    10.0     3
4   4  1000.0  1000.0     1
5   5  1000.0  1001.0     1
6   6  1000.0   999.0     1
7   7   999.0   999.0     1
8   8   999.0  1000.0     1
9   9  1000.0  1000.0     1

Create KNN instance and call fit:

>>> knn = KNN(conn_context=conn, n_neighbors=3, voting_type='majority',
...           thread_ratio=0.1, stat_info=False)
>>> knn.fit(data=df, key='ID', features=['X1', 'X2'], label='TYPE')
>>> pred_df = connection_context.table("PAL_KNN_CLASSDATA_TBL")

Call predict:

>>> res, stat = knn.predict(data=pred_df, key="ID")
>>> res.collect()
   ID  TYPE
0   0     3
1   1     3
2   2     3
3   3     1
4   4     1
5   5     1
6   6     1
7   7     1

Methods

fit(data, key[, features, label])

Fit the model when given training set.

predict(data, key[, features])

Predict the class labels for the provided data

score(data, key[, features, label])

Return a scalar accuracy value after comparing the predicted and original label.

fit(data, key, features=None, label=None)

Fit the model when given training set.

Parameters

data : DataFrame

DataFrame containing the data.

key : str

Name of the ID column.

features : list of str, optional

Names of the feature columns. If features is not provided, it defaults to all the non-ID and non-label columns.

label : str, optional

Name of the label column. If label is not provided, it defaults to the last column.

predict(data, key, features=None)

Predict the class labels for the provided data

Parameters

data : DataFrame

DataFrame containing the data.

key : str

Name of the ID column.

features : list of str, optional

Names of the feature columns. If features is not provided, it defaults to all the non-ID columns.

Returns

DataFrame

Predicted result, structured as follows:

  • ID column, with same name and type as data ‘s ID column.

  • Label column, with same name and type as training data’s label column.

The distance between each point in data and its k nearest neighbors in the training set. Only returned if stat_info is True.

Structured as follows:

  • TEST_ + data ‘s ID name, with same type as data ‘s ID column, query data ID.

  • K, type INTEGER, K number.

  • TRAIN_ + training data’s ID name, with same type as training data’s ID column, neighbor point’s ID.

  • DISTANCE, type DOUBLE, distance.

score(data, key, features=None, label=None)

Return a scalar accuracy value after comparing the predicted and original label.

Parameters

data : DataFrame

DataFrame containing the data.

key : str

Name of the ID column.

features : list of str, optional

Names of the feature columns. If features is not provided, it defaults to all the non-ID and non-label columns.

label : str, optional

Name of the label column. If label is not provided, it defaults to the last column.

Returns

float

Scalar accuracy value after comparing the predicted label and original label.

hana_ml.algorithms.pal.neural_network

This module contains Python wrappers for PAL Multi-layer Perceptron algorithm.

The following classes are available:

class hana_ml.algorithms.pal.neural_network.MLPClassifier(conn_context, activation=None, activation_options=None, output_activation=None, output_activation_options=None, hidden_layer_size=None, hidden_layer_size_options=None, max_iter=None, training_style='stochastic', learning_rate=None, momentum=None, batch_size=None, normalization=None, weight_init=None, categorical_variable=None, resampling_method=None, evaluation_metric=None, fold_num=None, repeat_times=None, search_strategy=None, random_search_times=None, random_state=None, timeout=None, progress_indicator_id=None, param_values=None, param_range=None, thread_ratio=None)

Bases: hana_ml.algorithms.pal.neural_network._MLPBase

Multi-layer perceptron (MLP) Classifier.

Parameters

conn_context : ConnectionContext

Connection to the HANA system.

activation : {‘tanh’, ‘linear’, ‘sigmoid_asymmetric’, ‘sigmoid_symmetric’, ‘gaussian_asymmetric’, ‘gaussian_symmetric’, ‘elliot_asymmetric’, ‘elliot_symmetric’, ‘sin_asymmetric’, ‘sin_symmetric’, ‘cos_asymmetric’, ‘cos_symmetric’, ‘relu’}, conditionally mandatory

Activation function for the hidden layer. Mandatory if activation_options is not provided.

activation_options : list of str, conditionally mandatory

A list of activation functions for parameter selection.

See activation for the full set of valid activation functions.

output_activation : {‘tanh’, ‘linear’, ‘sigmoid_asymmetric’, ‘sigmoid_symmetric’, ‘gaussian_asymmetric’, ‘gaussian_symmetric’, ‘elliot_asymmetric’, ‘elliot_symmetric’, ‘sin_asymmetric’, ‘sin_symmetric’, ‘cos_asymmetric’, ‘cos_symmetric’, ‘relu’},

Activation function for the output layer.

output_activation_options : list of str, conditionally mandatory

A list of activation functions for the output layer for parameter selection.

See output_activation for the full set of activation functions for output layer.

hidden_layer_size : list of int or tuple of int

Sizes of all hidden layers.

hidden_layer_size_options : list of tuples, conditionally mandatory

A list of optional sizes of all hidden layers for parameter selection.

max_iter : int, optional

Maximum number of iterations.

Defaults to 100.

training_style : {‘batch’, ‘stochastic’}, optional

Specifies the training style.

Defaults to ‘stochastic’.

learning_rate : float, optional

Specifies the learning rate. Mandatory and valid only when training_style is ‘stochastic’.

momentum : float, optional

Specifies the momentum for gradient descent update. Mandatory and valid only when training_style is ‘stochastic’.

batch_size : int, optional

Specifies the size of mini batch. Valid only when training_style is ‘stochastic’.

Defaults to 1.

normalization : {‘no’, ‘z-transform’, ‘scalar’}, optional

Defaults to ‘no’.

weight_init : {‘all-zeros’, ‘normal’, ‘uniform’, ‘variance-scale-normal’, ‘variance-scale-uniform’}, optional

Specifies the weight initial value.

Defaults to ‘all-zeros’.

categorical_variable : str or list of str, optional

Specifies column name(s) in the data table used as category variable.

Valid only when column is of INTEGER type.

thread_ratio : float, optional

Controls the proportion of available threads to use for training. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use.

Defaults to 0.

resampling_method : {‘cv’,’stratified_cv’, ‘bootstrap’, ‘stratified_bootstrap’}, optional

Specifies the resampling method for model evaluation or parameter selection. If not specified, neither model evaluation or parameter selection shall be triggered.

evaluation_metric : {‘accuracy’,’f1_score’, ‘auc_onevsrest’, ‘auc_pairwise’}, optional

Specifies the evaluation metric for model evaluation or parameter selection.

fold_num : int, optional

Specifies the fold number for the cross-validation. Mandatory and valid only when resampling_method is set ‘cv’ or ‘stratified_cv’.

repeat_times : int, optional

Specifies the number of repeat times for resampling.

Defaults to 1.

search_strategy : {‘grid’, ‘random’}, optional

Specifies the method for parameter selection. If not provided, parameter selection will not be activated.

random_search_times : int, optional

Specifies the number of times to randomly select candidate parameters. Mandatory and valid only when search_strategy is set to ‘random’.

random_state : int, optional

Specifies the seed for random generation. When 0 is specified, system time is used.

Defaults to 0.

timeout : int, optional

Specifies maximum running time for model evalation/parameter selection, in seconds. No timeout when 0 is specified.

Defaults to 0.

progress_id : str, optional

Sets an ID of progress indicator for model evaluation/parameter selection. If not provided, no progress indicator is activated.

param_values : list of tuple, optional

Sets the values of following parameters for model parameter selection:

learning_rate, momentum, batch_size.

Each tuple contains two elements

  • 1st element is the parameter name(str type),

  • 2nd element is a list of valid values for that parameter.

A simple example for illustration:

[(‘learning_rate’, [0.1, 0.2, 0.5]),

(‘momentum’, [0.2, 0.6])]

Valid only when search_strategy is specified and training_style is ‘stochastic’.

param_range : list of tuple, optional

Sets the range of the following parameters for model parameter selection:

learning_rate, momentum, batch_size.

Each tuple should contain two elements:

  • 1st element is the parameter name(str type),

  • 2nd element is a list that specifies the range of that parameter as follows:

first value is the start value, seond value is the step, and third value is the end value. The step value can be omitted, and will be ignored, if search_strategy is set to ‘random’.

Valid only when search_strategy is specified and traininig_style is ‘stochastic’.

Examples

Training data:

>>> df.collect()
   V000  V001 V002  V003 LABEL
0     1  1.71   AC     0    AA
1    10  1.78   CA     5    AB
2    17  2.36   AA     6    AA
3    12  3.15   AA     2     C
4     7  1.05   CA     3    AB
5     6  1.50   CA     2    AB
6     9  1.97   CA     6     C
7     5  1.26   AA     1    AA
8    12  2.13   AC     4     C
9    18  1.87   AC     6    AA

Training the model:

>>> mlpc = MLPClassifier(conn_context=conn, hidden_layer_size=(10,10),
...                      activation='tanh', output_activation='tanh',
...                      learning_rate=0.001, momentum=0.0001,
...                      training_style='stochastic',max_iter=100,
...                      normalization='z-transform', weight_init='normal',
...                      thread_ratio=0.3, categorical_variable='V003')
>>> mlpc.fit(data=df)

Training result may look different from the following results due to model randomness.

>>> mlpc.model_.collect()
   ROW_INDEX                                      MODEL_CONTENT
0          1  {"CurrentVersion":"1.0","DataDictionary":[{"da...
1          2  t":0.2700182926188939},{"from":13,"weight":0.0...
2          3  ht":0.2414416413305134},{"from":21,"weight":0....
>>> mlpc.train_log_.collect()
    ITERATION     ERROR
0           1  1.080261
1           2  1.008358
2           3  0.947069
3           4  0.894585
4           5  0.849411
5           6  0.810309
6           7  0.776256
7           8  0.746413
8           9  0.720093
9          10  0.696737
10         11  0.675886
11         12  0.657166
12         13  0.640270
13         14  0.624943
14         15  0.609432
15         16  0.595204
16         17  0.582101
17         18  0.569990
18         19  0.558757
19         20  0.548305
20         21  0.538553
21         22  0.529429
22         23  0.521457
23         24  0.513893
24         25  0.506704
25         26  0.499861
26         27  0.493338
27         28  0.487111
28         29  0.481159
29         30  0.475462
..        ...       ...
70         71  0.349684
71         72  0.347798
72         73  0.345954
73         74  0.344071
74         75  0.342232
75         76  0.340597
76         77  0.338837
77         78  0.337236
78         79  0.335749
79         80  0.334296
80         81  0.332759
81         82  0.331255
82         83  0.329810
83         84  0.328367
84         85  0.326952
85         86  0.325566
86         87  0.324232
87         88  0.322899
88         89  0.321593
89         90  0.320242
90         91  0.318985
91         92  0.317840
92         93  0.316630
93         94  0.315376
94         95  0.314210
95         96  0.313066
96         97  0.312021
97         98  0.310916
98         99  0.309770
99        100  0.308704

Prediction:

>>> pred_df.collect()
>>> res, stat = mlpc.predict(data=pred_df, key='ID')

Prediction result may look different from the following results due to model randomness.

>>> res.collect()
   ID TARGET     VALUE
0   1      C  0.472751
1   2      C  0.417681
2   3      C  0.543967
>>> stat.collect()
   ID CLASS  SOFT_MAX
0   1    AA  0.371996
1   1    AB  0.155253
2   1     C  0.472751
3   2    AA  0.357822
4   2    AB  0.224496
5   2     C  0.417681
6   3    AA  0.349813
7   3    AB  0.106220
8   3     C  0.543967

Model Evaluation:

>>> mlpc = MLPClassifier(conn_context=conn,
...                      activation='tanh',
...                      output_activation='tanh',
...                      hidden_layer_size=(10,10),
...                      learning_rate=0.001,
...                      momentum=0.0001,
...                      training_style='stochastic',
...                      max_iter=100,
...                      normalization='z-transform',
...                      weight_init='normal',
...                      resampling_method='cv',
...                      evaluation_metric='f1_score',
...                      fold_num=10,
...                      repeat_times=2,
...                      random_state=1,
...                      progress_indicator_id='TEST',
...                      thread_ratio=0.3)
>>> mlpc.fit(data=df, label='LABEL', categorical_variable='V003')

Model evaluation result may look different from the following result due to randomness.

>>> mlpc.stats_.collect()
            STAT_NAME                                         STAT_VALUE
0             timeout                                              FALSE
1     TEST_1_F1_SCORE                       1, 0, 1, 1, 0, 1, 0, 1, 1, 0
2     TEST_2_F1_SCORE                       0, 0, 1, 1, 0, 1, 0, 1, 1, 1
3  TEST_F1_SCORE.MEAN                                                0.6
4   TEST_F1_SCORE.VAR                                           0.252631
5      EVAL_RESULTS_1  {"candidates":[{"TEST_F1_SCORE":[[1.0,0.0,1.0,...
6     solution status  Convergence not reached after maximum number o...
7               ERROR                                 0.2951168443145714

Parameter selection:

>>> act_opts=['tanh', 'linear', 'sigmoid_asymmetric']
>>> out_act_opts = ['sigmoid_symmetric', 'gaussian_asymmetric', 'gaussian_symmetric']
>>> layer_size_opts = [(10, 10), (5, 5, 5)]
>>> mlpc = MLPClassifier(conn_context=conn,
...                      activation_options=act_opts,
...                      output_activation_options=out_act_opts,
...                      hidden_layer_size_options=layer_size_opts,
...                      learning_rate=0.001,
...                      batch_size=2,
...                      momentum=0.0001,
...                      training_style='stochastic',
...                      max_iter=100,
...                      normalization='z-transform',
...                      weight_init='normal',
...                      resampling_method='stratified_bootstrap',
...                      evaluation_metric='accuracy',
...                      search_strategy='grid',
...                      fold_num=10,
...                      repeat_times=2,
...                      random_state=1,
...                      progress_indicator_id='TEST',
...                      thread_ratio=0.3)
>>> mlpc.fit(data=df, label='LABEL', categorical_variable='V003')

Parameter selection result may look different from the following result due to randomness.

>>> mlpc.stats_.collect()
            STAT_NAME                                         STAT_VALUE
0             timeout                                              FALSE
1     TEST_1_ACCURACY                                               0.25
2     TEST_2_ACCURACY                                           0.666666
3  TEST_ACCURACY.MEAN                                           0.458333
4   TEST_ACCURACY.VAR                                          0.0868055
5      EVAL_RESULTS_1  {"candidates":[{"TEST_ACCURACY":[[0.50],[0.0]]...
6      EVAL_RESULTS_2  PUT_LAYER_ACTIVE_FUNC=6;HIDDEN_LAYER_ACTIVE_FU...
7      EVAL_RESULTS_3  FUNC=2;"},{"TEST_ACCURACY":[[0.50],[0.33333333...
8      EVAL_RESULTS_4  rs":"HIDDEN_LAYER_SIZE=10, 10;OUTPUT_LAYER_ACT...
9               ERROR                                  0.684842661926971
>>> mlpc.optim_param_.collect()
                 PARAM_NAME  INT_VALUE DOUBLE_VALUE STRING_VALUE
0         HIDDEN_LAYER_SIZE        NaN         None      5, 5, 5
1  OUTPUT_LAYER_ACTIVE_FUNC        4.0         None         None
2  HIDDEN_LAYER_ACTIVE_FUNC        3.0         None         None

Attributes

model_

(DataFrame) Model content.

train_log_

(DataFrame) Provides mean squared error between predicted values and target values for each iteration.

stats_

(DataFrame) Names and values of statistics.

optim_param_

(DataFrame) Provides optimal parameters selected. Available only when parameter selection is triggered.

Methods

fit(data[, key, features, label, …])

Fit the model when the training dataset is given.

predict(data, key[, features, thread_ratio])

Predict using the multi-layer perceptron model.

score(data, key[, features, label, thread_ratio])

Returns the accuracy on the given test data and labels.

fit(data, key=None, features=None, label=None, categorical_variable=None)

Fit the model when the training dataset is given.

Parameters

data : DataFrame

DataFrame containing the data.

key : str, optional

Name of the ID column. If key is not provided, it is assumed that the input has no ID column.

features : list of str, optional

Names of the feature columns. If features is not provided, it defaults to all the non-ID and non-label columns.

label : str, optional

Name of the label column. If label is not provided, it defaults to the last column.

categorical_variable : str or list of str, optional

Specifies INTEGER column(s) specified that should be treated as categorical. Other INTEGER columns will be treated as continuous.

predict(data, key, features=None, thread_ratio=None)

Predict using the multi-layer perceptron model.

Parameters

data : DataFrame

DataFrame containing the data.

key : str

Name of the ID column.

features : list of str, optional

Names of the feature columns. If features is not provided, it defaults to all the non-ID columns.

thread_ratio : float, optional

Controls the proportion of available threads to be used for prediction. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use.

Defaults to 0.

Returns

DataFrame

Predicted classes, structured as follows:

  • ID column, with the same name and type as data ‘s ID column.

  • TARGET, type NVARCHAR, predicted class name.

  • VALUE, type DOUBLE, softmax value for the predicted class.

Softmax values for all classes, structured as follows:

  • ID column, with the same name and type as data ‘s ID column.

  • CLASS, type NVARCHAR, class name.

  • VALUE, type DOUBLE, softmax value for that class.

score(data, key, features=None, label=None, thread_ratio=None)

Returns the accuracy on the given test data and labels.

Parameters

data : DataFrame

DataFrame containing the data.

key : str

Name of the ID column.

features : list of str, optional

Names of the feature columns. If features is not provided, it defaults to all the non-ID and non-label columns.

label : str, optional

Name of the label column. If label is not provided, it defaults to the last column.

Returns

float

Scalar value of accuracy after comparing the predicted result and original label.

class hana_ml.algorithms.pal.neural_network.MLPRegressor(conn_context, activation=None, activation_options=None, output_activation=None, output_activation_options=None, hidden_layer_size=None, hidden_layer_size_options=None, max_iter=None, training_style='stochastic', learning_rate=None, momentum=None, batch_size=None, normalization=None, weight_init=None, categorical_variable=None, resampling_method=None, evaluation_metric=None, fold_num=None, repeat_times=None, search_strategy=None, random_search_times=None, random_state=None, timeout=None, progress_indicator_id=None, param_values=None, param_range=None, thread_ratio=None)

Bases: hana_ml.algorithms.pal.neural_network._MLPBase

Multi-layer perceptron (MLP) Regressor.

Parameters

conn_context : ConnectionContext

Connection to the HANA system.

activation : {‘tanh’, ‘linear’, ‘sigmoid_asymmetric’, ‘sigmoid_symmetric’, ‘gaussian_asymmetric’, ‘gaussian_symmetric’, ‘elliot_asymmetric’, ‘elliot_symmetric’, ‘sin_asymmetric’, ‘sin_symmetric’, ‘cos_asymmetric’, ‘cos_symmetric’, ‘relu’}

Activation function for the hidden layer.

output_activation : {‘tanh’, ‘linear’, ‘sigmoid_asymmetric’, ‘sigmoid_symmetric’, ‘gaussian_asymmetric’, ‘gaussian_symmetric’, ‘elliot_asymmetric’, ‘elliot_symmetric’, ‘sin_asymmetric’, ‘sin_symmetric’, ‘cos_asymmetric’, ‘cos_symmetric’, ‘relu’} , str

Activation function for the output layer.

hidden_layer_size : tuple of int

Size of each hidden layer

max_iter : int, optional

Maximum number of iterations.

Defaults to 100.

training_style : {‘batch’, ‘stochastic’}, optional

Specifies the training style.

Defaults to ‘stochastic’.

learning_rate : float, optional

Specifies the learning rate. Mandatory and valid only when training_style is ‘stochastic’.

momentum : float, optional

Specifies the momentum for gradient descent update. Mandatory and valid only when training_style is ‘stochastic’.

batch_size : int, optional

Specifies the size of mini batch. Valid only when training_style is ‘stochastic’.

Defaults to 1.

normalization : {‘no’, ‘z-transform’, ‘scalar’}, optional

Defaults to ‘no’.

weight_init : {‘all-zeros’, ‘normal’, ‘uniform’, ‘variance-scale-normal’, ‘variance-scale-uniform’}, optional

Specifies the weight initial value.

Defaults to ‘all-zeros’.

categorical_variable : str or list of str, optional

Specifies column name(s) in the data table used as category variable.

Valid only when column is of INTEGER type.

thread_ratio : float, optional

Controls the proportion of available threads to use for training. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use.

Defaults to 0.

resampling_method : {‘cv’, ‘bootstrap’}, optional

Specifies the resampling method for model evaluation or parameter selection. If not specified, neither model evaluation or parameter selection shall be triggered.

evaluation_metric : {‘rmse’}, optional

Specifies the evaluation metric for model evaluation or parameter selection.

fold_num : int, optional

Specifies the fold number for the cross-validation. Mandatory and valid only when resampling_method is set ‘cv’.

repeat_times : int, optional

Specifies the number of repeat times for resampling.

Defaults to 1.

search_strategy : {‘grid’, ‘random’}, optional

Specifies the method for parameter selection. If not provided, parameter selection will not be activated.

random_searhc_times : int, optional

Specifies the number of times to randomly select candidate parameters. Mandatory and valid only when search_strategy is set to ‘random’.

random_state : int, optional

Specifies the seed for random generation. When 0 is specified, system time is used.

Defaults to 0.

timeout : int, optional

Specifies maximum running time for model evalation/parameter selection, in seconds. No timeout when 0 is specified.

Defaults to 0.

progress_id : str, optional

Sets an ID of progress indicator for model evaluation/parameter selection. If not provided, no progress indicator is activated.

param_values : list of tuple, optional

Sets the values of following parameters for model parameter selection:

learning_rate, momentum, batch_size.

Each tuple contains two elements - 1st element is the parameter name(str type), 2nd element is a list of valid values for that parameter.

A simple example for illustration:

[(‘learning_rate’, [0.1, 0.2, 0.5]),

(‘momentum’, [0.2, 0.6])]

Valid only when search_strategy is specified and training_style is ‘stochastic’.

param_range : list of tuple, optional

Sets the range of the following parameters for model parameter selection:

learning_rate, momentum, batch_size.

Each tuple should contain two elements:

  • 1st element is the parameter name(str type),

  • 2nd element is a list that specifies the range of that parameter as follows:

first value is the start value, seond value is the step, and third value is the end value. The step value can be omitted, and will be ignored, if search_strategy is set to ‘random’.

Valid only when search_strategy is specified and traininig_style is ‘stochastic’.

Examples

Training data:

>>> df.collect()
   V000  V001 V002  V003  T001  T002  T003
0     1  1.71   AC     0  12.7   2.8  3.06
1    10  1.78   CA     5  12.1   8.0  2.65
2    17  2.36   AA     6  10.1   2.8  3.24
3    12  3.15   AA     2  28.1   5.6  2.24
4     7  1.05   CA     3  19.8   7.1  1.98
5     6  1.50   CA     2  23.2   4.9  2.12
6     9  1.97   CA     6  24.5   4.2  1.05
7     5  1.26   AA     1  13.6   5.1  2.78
8    12  2.13   AC     4  13.2   1.9  1.34
9    18  1.87   AC     6  25.5   3.6  2.14

Training the model:

>>> mlpr = MLPRegressor(conn_context=conn, hidden_layer_size=(10,5),
...                     activation='sin_asymmetric',
...                     output_activation='sin_asymmetric',
...                     learning_rate=0.001, momentum=0.00001,
...                     training_style='batch',
...                     max_iter=10000, normalization='z-transform',
...                     weight_init='normal', thread_ratio=0.3)
>>> mlpr.fit(data=df, label=['T001', 'T002', 'T003'])

Training result may look different from the following results due to model randomness.

>>> mlpr.model_.collect()
   ROW_INDEX                                      MODEL_CONTENT
0          1  {"CurrentVersion":"1.0","DataDictionary":[{"da...
1          2  3782583596893},{"from":10,"weight":-0.16532599...
>>> mlpr.train_log_.collect()
     ITERATION       ERROR
0            1   34.525655
1            2   82.656301
2            3   67.289241
3            4  162.768062
4            5   38.988242
5            6  142.239468
6            7   34.467742
7            8   31.050946
8            9   30.863581
9           10   30.078204
10          11   26.671436
11          12   28.078312
12          13   27.243226
13          14   26.916686
14          15   26.782915
15          16   26.724266
16          17   26.697108
17          18   26.684084
18          19   26.677713
19          20   26.674563
20          21   26.672997
21          22   26.672216
22          23   26.671826
23          24   26.671631
24          25   26.671533
25          26   26.671485
26          27   26.671460
27          28   26.671448
28          29   26.671442
29          30   26.671439
..         ...         ...
705        706   11.891081
706        707   11.891081
707        708   11.891081
708        709   11.891081
709        710   11.891081
710        711   11.891081
711        712   11.891081
712        713   11.891081
713        714   11.891081
714        715   11.891081
715        716   11.891081
716        717   11.891081
717        718   11.891081
718        719   11.891081
719        720   11.891081
720        721   11.891081
721        722   11.891081
722        723   11.891081
723        724   11.891081
724        725   11.891081
725        726   11.891081
726        727   11.891081
727        728   11.891081
728        729   11.891081
729        730   11.891081
730        731   11.891081
731        732   11.891081
732        733   11.891081
733        734   11.891081
734        735   11.891081

[735 rows x 2 columns]

>>> pred_df.collect()
   ID  V000  V001 V002  V003
0   1     1  1.71   AC     0
1   2    10  1.78   CA     5
2   3    17  2.36   AA     6

Prediction:

>>> res  = mlpr.predict(data=pred_df, key='ID')

Result may look different from the following results due to model randomness.

>>> res.collect()
   ID TARGET      VALUE
0   1   T001  12.700012
1   1   T002   2.799133
2   1   T003   2.190000
3   2   T001  12.099740
4   2   T002   6.100000
5   2   T003   2.190000
6   3   T001  10.099961
7   3   T002   2.799659
8   3   T003   2.190000

Attributes

model_

(DataFrame) Model content.

train_log_

(DataFrame) Provides mean squared error between predicted values and target values for each iteration.

stats_

(DataFrame) Names and values of statistics.

optim_param_

(DataFrame) Provides optimal parameters selected. Available only when parameter selection is triggered.

Methods

fit(data[, key, features, label, …])

Fit the model when given training dataset.

predict(data, key[, features, thread_ratio])

Predict using the multi-layer perceptron model.

score(data, key[, features, label, thread_ratio])

Returns the coefficient of determination R^2 of the prediction.

fit(data, key=None, features=None, label=None, categorical_variable=None)

Fit the model when given training dataset.

Parameters

data : DataFrame

DataFrame containing the data.

key : str, optional

Name of the ID column. If key is not provided, it is assume that the input has no ID column.

features : list of str, optional

Names of the feature columns.

If features is not provided, it defaults to all the non-ID and non-label columns.

label : str or list of str, optional

Name of the label column, or list of names of multiple label columns. If label is not provided, it defaults to the last column.

categorical_variable : str or list of str, optional

Specifies INTEGER column(s) that should be treated as categorical. Other INTEGER columns will be treated as continuous.

predict(data, key, features=None, thread_ratio=None)

Predict using the multi-layer perceptron model.

Parameters

data : DataFrame

DataFrame containing the data.

key : str

Name of the ID column.

features : list of str, optional

Names of the feature columns. If features is not provided, it defaults to all the non-ID columns.

thread_ratio : float, optional

Controls the proportion of available threads to be used for prediction. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use.

Defaults to 0.

Returns

DataFrame

Predicted results, structured as follows:

  • ID column, with the same name and type as data ‘s ID column.

  • TARGET, type NVARCHAR, target name.

  • VALUE, type DOUBLE, regression value.

score(data, key, features=None, label=None, thread_ratio=None)

Returns the coefficient of determination R^2 of the prediction.

Parameters

data : DataFrame

DataFrame containing the data.

key : str

Name of the ID column.

features : list of str, optional

Names of the feature columns. If features is not provided, it defaults to all the non-ID and non-label columns.

label : str or list of str, optional

Name of the label column, or list of names of multiple label columns. If label is not provided, it defaults to the last column.

Returns

float

Returns the coefficient of determination R2 of the prediction.

hana_ml.algorithms.pal.pagerank

This module contains python wrapper for PAL PageRank algorithm.

The following class is available:

class hana_ml.algorithms.pal.pagerank.PageRank(conn_context, damping=None, max_iter=None, tol=None, thread_ratio=None)

Bases: hana_ml.algorithms.pal.pal_base.PALBase

A page rank model.

Parameters

conn_context : ConnectionContext

Connection to the SAP HANA system.

damping : float, optional

The damping factor d.

Defaults to 0.85.

max_iter : int, optional

The maximum number of iterations of power method. The value 0 means no maximum number of iterations is set and the calculation stops when the result converges.

Defaults to 0.

tol : float, optional

Specifies the stop condition. When the mean improvement value of ranks is less than this value, the program stops calculation.

Defaults to 1e-6.

thread_ratio : float, optional

Specifies the ratio of total number of threads that can be used by this function. The value range is from 0 to 1, where 0 means only using 1 thread, and 1 means using at most all the currently available threads. Values outside the range will be ignored and this function heuristically determines the number of threads to use.

Defaults to 0.

Examples

Input dataframe df for training:

>>> df.collect()
   FROM_NODE    TO_NODE
0   Node1       Node2
1   Node1       Node3
2   Node1       Node4
3   Node2       Node3
4   Node2       Node4
5   Node3       Node1
6   Node4       Node1
7   Node4       Node3

Create a PageRank instance:

>>> pr = PageRank(conn_context=conn)

Call run() on given data sequence:

>>> result = pr.run(data=df)
>>> result.collect()
   NODE     RANK
0   NODE1   0.368152
1   NODE2   0.141808
2   NODE3   0.287962
3   NODE4   0.202078

Attributes

None

Methods

run(data)

This method reads link information and calculates rank for each node.

run(data)

This method reads link information and calculates rank for each node.

Parameters

data : DataFrame

Data for predicting the class labels.

Returns

DataFrame

Calculated rank values and corresponding node names, structured as follows:

  • NODE: node names.

  • RANK: the PageRank of the corresponding node.

hana_ml.algorithms.pal.partition

This module contain Python wrapper for the PAL partition function.

The following function is available:

hana_ml.algorithms.pal.partition.train_test_val_split(conn_context, data, random_seed=None, thread_ratio=None, partition_method='random', stratified_column=None, training_percentage=None, testing_percentage=None, validation_percentage=None, training_size=None, testing_size=None, validation_size=None)

The algorithm partitions an input dataset randomly into three disjoint subsets called training, testing and validation. Let us remark that the union of these three subsets might not be the complete initial dataset.

Two different partitions can be obtained:

  1. Random Partition, which randomly divides all the data.

  2. Stratified Partition, which divides each subpopulation randomly.

In the second case, the dataset needs to have at least one categorical attribute (for example, of type VARCHAR). The initial dataset will first be subdivided according to the different categorical values of this attribute. Each mutually exclusive subset will then be randomly split to obtain the training, testing, and validation subsets.This ensures that all “categorical values” or “strata” will be present in the sampled subset.

Parameters

conn_context : ConnectionContext

Connection to the HANA system.

data : DataFrame

DataFrame to be partitioned.

random_seed : int, optional

Indicates the seed used to initialize the random number generator.

0: Uses the system time

Not 0: Uses the specified seed

Defaults to 0.

thread_ratio : float, optional

Specifies the ratio of total number of threads that can be used by this function. The value range is from 0 to 1, where 0 means only using 1 thread, and 1 means using at most all the currently available threads. Values outside the range will be ignored and this function heuristically determines the number of threads to use.

Defaults to 0.

partition_method : {‘random’, ‘stratified’}, optional

Partition method:
  • ‘random’: random partitions

  • ‘stratified’: stratified partition

Defaults to ‘random’.

stratified_column : str, optional

Indicates which column is used for stratification.

Valid only when parition_method is set to ‘stratified’ (stratified partition).

No default value.

training_percentage : float, optional

The percentage of training data. Value range: 0 <= value <= 1.

Defaults to 0.8.

testing_percentage : float, optional

The percentage of testing data. Value range: 0 <= value <= 1.

Defaults to 0.1.

validation_percentage : float, optional

The percentage of validation data. Value range: 0 <= value <= 1.

Defaults to 0.1.

training_size : int, optional

Row size of training data. Value range: >=0

If both training_percentage and training_size are specified, training_percentage takes precedence.

No default value.

testing_size : int, optional

Row size of testing data. Value range: >=0

If both testing_percentage and testing_size are specified, testing_percentage takes precedence.

No default value.

validation_size : int, optional

Row size of validation data. Value range:>=0

If both validation_percentage and validation_size are specified, validation_percentage takes precedence.

No default value.

Returns

DataFrame

Training data. Table structure identical to input data table.

Testing data. Table structure identical to input data table.

Validation data. Table structure identical to input data table.

Examples

>>> train, test, valid = train_test_val_split(conn_context=conn, data=df)

hana_ml.algorithms.pal.pipeline

This module supports to run PAL functions in a pipeline manner.

class hana_ml.algorithms.pal.pipeline.Pipeline(steps)

Bases: object

Pipeline construction to run transformers and estimators sequentially.

Parameters

step : list

List of (name, transform) tuples that are chained. The last object should be an estimator.

Examples

>>> pipeline([
    ('pca', PCA(conn_context=conn, scaling=True, scores=True)),
    ('imputer', Imputer(conn_context=conn, strategy='mean')),
    ('hgbt', HybridGradientBoostingClassifier(conn_context=connection_context,         n_estimators=4, split_threshold=0, learning_rate=0.5, fold_num=5,         max_depth=6, cross_validation_range=cv_range))
    ])

Methods

fit(df, param)

Fit all the transforms one after the other and transform the data, then fit the transformed data using the final estimator.

fit_transform(df, param)

Fit all the transforms one after the other and transform the data, then fit the transformed data using the final estimator.

fit_transform(df, param)

Fit all the transforms one after the other and transform the data, then fit the transformed data using the final estimator.

Parameters

df : DataFrame

SAP HANA DataFrame to be transformed in the pipeline.

param : dict

Parameters corresponding to the transform name.

Returns

DataFrame

Transformed SAP HANA DataFrame.

Examples

>>> my_pipeline = Pipeline([
        ('pca', PCA(conn_context=conn, scaling=True, scores=True)),
        ('imputer', Imputer(conn_context=conn, strategy='mean'))
        ])
>>> param = {'pca': [('key', 'ID'), ('label', 'CLASS')], 'imputer': []}
>>> my_pipeline.fit_transform(data=train_df, param=param)
fit(df, param)

Fit all the transforms one after the other and transform the data, then fit the transformed data using the final estimator.

Parameters

df : DataFrame

SAP HANA DataFrame to be transformed in the pipeline.

param : dict

Parameters corresponding to the transform name.

Returns

DataFrame

Transformed SAP HANA DataFrame.

Examples

>>> my_pipeline = Pipeline([
    ('pca', PCA(conn_context=conn, scaling=True, scores=True)),
    ('imputer', Imputer(conn_context=conn, strategy='mean')),
    ('hgbt', HybridGradientBoostingClassifier(conn_context=conn,
    n_estimators=4, split_threshold=0, learning_rate=0.5, fold_num=5,
    max_depth=6, cross_validation_range=cv_range))
    ])
>>> param = {
                'pca': [('key', 'ID'), ('label', 'CLASS')],
                'imputer': [],
                'hgbt': [('key', 'ID'), ('label', 'CLASS'), ('categorical_variable', ['CLASS'])]
            }
>>> hgbt_model = my_pipeline.fit(df=train_df, param=param)

hana_ml.algorithms.pal.preprocessing

This module contains Python wrappers for PAL preprocessing algorithms.

The following classes are available:

class hana_ml.algorithms.pal.preprocessing.FeatureNormalizer(conn_context, method, z_score_method=None, new_max=None, new_min=None, thread_ratio=None)

Bases: hana_ml.algorithms.pal.pal_base.PALBase

Normalize a DataFrame.

Parameters

conn_context : ConnectionContext

The connection to the SAP HANA system.

method : {‘min-max’, ‘z-score’, ‘decimal’}

Scaling methods:

  • ‘min-max’: Min-max normalization.

  • ‘z-score’: Z-Score normalization.

  • ‘decimal’: Decimal scaling normalization.

z_score_method : {‘mean-standard’, ‘mean-mean’, ‘median-median’}, optional

Only valid when method is ‘z-score’.

  • ‘mean-standard’: Mean-Standard deviation

  • ‘mean-mean’: Mean-Mean deviation

  • ‘median-median’: Median-Median absolute deviation

new_max : float, optional

The new maximum value for min-max normalization.

Only valid when method is ‘min-max’.

new_min : float, optional

The new minimum value for min-max normalization.

Only valid when method is ‘min-max’.

thread_ratio : float, optional

Controls the proportion of available threads to use. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use.

Defaults to 0.

Examples

Input DataFrame df1:

>>> df1.head(4).collect()
    ID    X1    X2
0    0   6.0   9.0
1    1  12.1   8.3
2    2  13.5  15.3
3    3  15.4  18.7

Creating a FeatureNormalizer instance:

>>> fn = FeatureNormalizer(conn_context=conn, method="min-max", new_max=1.0, new_min=0.0)

Performing fit on given DataFrame:

>>> fn.fit(df1, key='ID')
>>> fn.result_.head(4).collect()
    ID        X1        X2
0    0  0.000000  0.033175
1    1  0.186544  0.000000
2    2  0.229358  0.331754
3    3  0.287462  0.492891

Input DataFrame for transforming:

>>> df2.collect()
   ID  S_X1  S_X2
0   0   6.0   9.0
1   1   6.0   7.0
2   2   4.0   4.0
3   3   1.0   2.0
4   4   9.0  -2.0
5   5   4.0   5.0

Performing transform on given DataFrame:

>>> result = fn.transform(df2, key='ID')
>>> result.collect()
   ID      S_X1      S_X2
0   0  0.000000  0.033175
1   1  0.000000 -0.061611
2   2 -0.061162 -0.203791
3   3 -0.152905 -0.298578
4   4  0.091743 -0.488152
5   5 -0.061162 -0.156398

Attributes

result_

(DataFrame) Scaled dataset from fit and fit_transform methods.

model_ :

Trained model content.

Methods

fit(data, key[, features])

Normalize input data and generate a scaling model using one of the three scaling methods: min-max normalization, z-score normalization and normalization by decimal scaling.

fit_transform(data, key[, features])

Fit with the dataset and return the results.

transform(data, key[, features])

Scales data based on the previous scaling model.

fit(data, key, features=None)

Normalize input data and generate a scaling model using one of the three scaling methods: min-max normalization, z-score normalization and normalization by decimal scaling.

Parameters

data : DataFrame

DataFrame to be normalized.

key : str

Name of the ID column.

features : list of str, optional

Names of the feature columns. If features is not provided, it defaults to all non-ID columns.

fit_transform(data, key, features=None)

Fit with the dataset and return the results.

Parameters

data : DataFrame

DataFrame to be normalized.

key : str

Name of the ID column.

features : list of str, optional

Names of the feature columns. If features is not provided, it defaults to all non-ID columns.

Returns

DataFrame

Normalized result, with the same structure as data.

transform(data, key, features=None)

Scales data based on the previous scaling model.

Parameters

data : DataFrame

DataFrame to be normalized.

key : str

Name of the ID column.

features : list of str, optional

Names of the feature columns. If features is not provided, it defaults to all non-ID columns.

Returns

DataFrame

Normalized result, with the same structure as data.

class hana_ml.algorithms.pal.preprocessing.KBinsDiscretizer(conn_context, strategy, smoothing, n_bins=None, bin_size=None, n_sd=None)

Bases: hana_ml.algorithms.pal.pal_base.PALBase

Bin continuous data into number of intervals and perform local smoothing.

Parameters

conn_context : ConnectionContext

The connection to the SAP HANA system.

strategy : {‘uniform_number’, ‘uniform_size’, ‘quantile’, ‘sd’}

Binning methods:
  • ‘uniform_number’: Equal widths based on the number of bins.

  • ‘uniform_size’: Equal widths based on the bin size.

  • ‘quantile’: Equal number of records per bin.

  • ‘sd’: Bins are divided based on the distance from the mean. Most bins are one standard deviation wide, except that the center bin contains all values within one standard deviation from the mean, and the leftmost and rightmost bins contain all values more than n_sd standard deviations from the mean in the corresponding directions.

smoothing : {‘means’, ‘medians’, ‘boundaries’}

Smoothing methods:
  • ‘means’: Each value within a bin is replaced by the average of all the values belonging to the same bin.

  • ‘medians’: Each value in a bin is replaced by the median of all the values belonging to the same bin.

  • ‘boundaries’: The minimum and maximum values in a given bin are identified as the bin boundaries. Each value in the bin is then replaced by its closest boundary value. When the distance is equal to both sides, it will be replaced by the front boundary value.

Values used for smoothing are not re-calculated during transform.

n_bins : int, optional

The number of bins. Only valid when strategy is ‘uniform_number’ or ‘quantile’.

Defaults to 2.

bin_size : int, optional

The interval width of each bin. Only valid when strategy is ‘uniform_size’.

Defaults to 10.

n_sd : int, optional

The leftmost bin contains all values located further than n_sd standard deviations lower than the mean, and the rightmost bin contains all values located further than n_sd standard deviations above the mean. Only valid when strategy is ‘sd’.

Defaults to 1.

Examples

Input DataFrame df1:

>>> df1.collect()
    ID  DATA
0    0   6.0
1    1  12.0
2    2  13.0
3    3  15.0
4    4  10.0
5    5  23.0
6    6  24.0
7    7  30.0
8    8  32.0
9    9  25.0
10  10  38.0

Creating a KBinsDiscretizer instance:

>>> binning = KBinsDiscretizer(conn_context=conn, strategy='uniform_size', smoothing='means', bin_size=10)

Performing fit on the given DataFrame:

>>> binning.fit(data=df1, key='ID')

Output:

>>> binning.result_.collect()
    ID  BIN_INDEX       DATA
0    0          1   8.000000
1    1          2  13.333333
2    2          2  13.333333
3    3          2  13.333333
4    4          1   8.000000
5    5          3  25.500000
6    6          3  25.500000
7    7          3  25.500000
8    8          4  35.000000
9    9          3  25.500000
10  10          4  35.000000

Input DataFrame df2 for transforming:

>>> df2.collect()
   ID  DATA
0   0   6.0
1   1  67.0
2   2   4.0
3   3  12.0
4   4  -2.0
5   5  40.0

Performing transform on the given DataFrame:

>>> result = binning.transform(data=df2, key='ID')

Output:

>>> result.collect()
   ID  BIN_INDEX       DATA
0   0          1   8.000000
1   1         -1  67.000000
2   2          1   8.000000
3   3          2  13.333333
4   4          1   8.000000
5   5          4  35.000000

Attributes

result_

(DataFrame) Binned dataset from fit and fit_transform methods.

model_ :

Binning model content.

Methods

fit(data, key[, features])

Bin input data into number of intervals and smooth.

fit_transform(data, key[, features])

Fit with the dataset and return the results.

transform(data, key[, features])

Bin data based on the previous binning model.

fit(data, key, features=None)

Bin input data into number of intervals and smooth.

Parameters

data : DataFrame

DataFrame to be discretized.

key : str

Name of the ID column.

features : list of str, optional

Names of the feature columns. Since the underlying PAL_BINNING only supports one feature, this list can only contain one element. If features is not provided, data must have exactly 1 non-ID column, and features defaults to that column.

fit_transform(data, key, features=None)

Fit with the dataset and return the results.

Parameters

data : DataFrame

DataFrame to be binned.

key : str

Name of the ID column.

features : list of str, optional

Names of the feature columns. Since the underlying PAL_BINNING only supports one feature, this list can only contain one element. If features is not provided, data must have exactly 1 non-ID column, and features defaults to that column.

Returns

DataFrame

Binned result, structured as follows:

  • DATA_ID column: with same name and type as data’s ID column.

  • BIN_INDEX: type INTEGER, assigned bin index.

  • BINNING_DATA column: smoothed value, with same name and type as data’s feature column.

transform(data, key, features=None)

Bin data based on the previous binning model.

Parameters

data : DataFrame

DataFrame to be binned.

key : str

Name of the ID column.

features : list of str, optional

Names of the feature columns. Since the underlying PAL_BINNING_ASSIGNMENT only supports one feature, this list can only contain one element. If features is not provided, data must have exactly 1 non-ID column, and features defaults to that column.

Returns

DataFrame

Binned result, structured as follows:

  • DATA_ID column: with same name and type as data ‘s ID column.

  • BIN_INDEX: type INTEGER, assigned bin index.

  • BINNING_DATA column: smoothed value, with same name and type as data ‘s feature column.

class hana_ml.algorithms.pal.preprocessing.Imputer(conn_context, strategy=None, als_factors=None, als_lambda=None, als_maxit=None, als_randomstate=None, als_exit_threshold=None, als_exit_interval=None, als_linsolver=None, als_cg_maxit=None, als_centering=None, als_scaling=None, thread_ratio=None)

Bases: hana_ml.algorithms.pal.pal_base.PALBase

Missing value imputation for DataFrame.

Parameters

conn_context : ConnectionContext

The connection to the SAP HANA system.

strategy : {‘non’, ‘mean’, ‘median’, ‘zero’, ‘als’, ‘delete’}, optional

The overall imputation strategy for all Numerical columns.

Defaults to ‘mean’.

thread_ratio : float, optional

Controls the proportion of available threads to use. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use up to that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use.

The following parameters all have pre-fix ‘als_’, and are invoked only when ‘als’ is the overall imputation strategy. Those parameters are for setting up the alternating-least-square(ALS) mdoel for data imputation.

Defaults to 0.0.

als_factors : int, optional

Length of factor vectors in the ALS model. It should be less than the number of numerical columns, so that the imputation results would be meaningful.

Defaults to 3.

als_lambda : float, optional

L2 regularization applied to the factors in the ALS model. Should be non-negative.

Defaults to 0.01.

als_maxit : int, optional

Maximum number of iterations for solving the ALS model.

Defaults to 20.

als_randomstate : int, optional

Specifies the seed of the random number generator used in the training of ALS model:

0: Uses the current time as the seed,

Others: Uses the specified value as the seed.

Defaults to 0.

als_exit_threshold : float, optional

Specify a value for stopping the training of ALS nmodel. If the improvement of the cost function of the ALS model is less than this value between consecutive checks, then the training process will exit. 0 means there is no checking of the objective value when running the algorithms, and it stops till the maximum number of iterations has been reached.

Defaults to 0.

als_exit_interval : int, optional

Specify the number of iterations between consecutive checking of cost functions for the ALS model, so that one can see if the pre-specified exit_threshold is reached.

Defaults to 5.

als_linsolver : {‘cholsky’, ‘cg’}, optional

Linear system solver for the ALS model. ‘cholsky’ is usually much faster. ‘cg’ is recommended when als_factors is large.

Defaults to ‘cholsky’.

als_maxit : int, optional

Specifies the maximum number of iterations for cg algorithm. Invoked only when the ‘cg’ is the chosen linear system solver for ALS.

Defaults to 3.

als_centering : bool, optional

Whether to center the data by column before training the ALS model.

Defaults to True.

als_scaling : bool, optional

Whether to scale the data by column before training the ALS model.

Defaults to True.

Examples

Input DataFrame df:

>>> df.head(5).collect()
   V0   V1 V2   V3   V4    V5
0  10  0.0  D  NaN  1.4  23.6
1  20  1.0  A  0.4  1.3  21.8
2  50  1.0  C  NaN  1.6  21.9
3  30  NaN  B  0.8  1.7  22.6
4  10  0.0  A  0.2  NaN   NaN

Create an Imputer instance using ‘mean’ strategy and call fit:

>>> impute = Imputer(conn_context, strategy='mean')
>>> result = impute.fit_transform(df, categorical_variable=['V1'], strategy_by_col=[('V1', 'categorical_const', '0')])
>>> result.head(5).collect()
   V0  V1 V2        V3        V4         V5
0  10   0  D  0.507692  1.400000  23.600000
1  20   1  A  0.400000  1.300000  21.800000
2  50   1  C  0.507692  1.600000  21.900000
3  30   0  B  0.800000  1.700000  22.600000
4  10   0  A  0.200000  1.469231  20.646154

The stats_model_ content of input DataFrame:

>>> impute.stats_model_.head(5).collect()
            STAT_NAME                   STAT_VALUE
0  V0.NUMBER_OF_NULLS                            3
1  V0.IMPUTATION_TYPE                         MEAN
2    V0.IMPUTED_VALUE                           24
3  V1.NUMBER_OF_NULLS                            2
4  V1.IMPUTATION_TYPE  SPECIFIED_CATEGORICAL_VALUE

The above stats_model_ content of the input DataFrame can be applied to imputing another DataFrame with the same data structure, e.g. consider the following DataFrame with missing values:

>>> df1.collect()
   ID    V0   V1    V2   V3   V4    V5
0   0  20.0  1.0     B  NaN  1.5  21.7
1   1  40.0  1.0  None  0.6  1.2  24.3
2   2   NaN  0.0     D  NaN  1.8  22.6
3   3  50.0  NaN     C  0.7  1.1   NaN
4   4  20.0  1.0     A  0.3  NaN  20.6

With attribute impute.stats_model_ being obtained, one can impute the missing values of df1 via the following line of code, and then check the result:

>>> result1, _ = impute.transform(data=df1, key='ID')
>>> result1.collect()
   ID  V0  V1 V2        V3        V4         V5
0   0  20   1  B  0.507692  1.500000  21.700000
1   1  40   1  A  0.600000  1.200000  24.300000
2   2  24   0  D  0.507692  1.800000  22.600000
3   3  50   0  C  0.700000  1.100000  20.646154
4   4  20   1  A  0.300000  1.469231  20.600000

Create an Imputer instance using other strategies, e.g. ‘als’ strategy and then call fit:

>>> impute = Imputer(conn_context=conn, strategy='als', als_factors=2, als_randomstate=1)

Output:

>>> result2 = impute.fit_transform(data=df, categorical_variable=['V1'])
>>> result2.head(5).collect()
   V0  V1 V2        V3        V4         V5
0  10   0  D  0.306957  1.400000  23.600000
1  20   1  A  0.400000  1.300000  21.800000
2  50   1  C  0.930689  1.600000  21.900000
3  30   0  B  0.800000  1.700000  22.600000
4  10   0  A  0.200000  1.333668  21.371753

Attributes

stats_model_

(DataFrame) Statistics model content.

Methods

fit_transform(data[, key, …])

Inpute the missing values of a DataFrame, return the result, and collect the related statistics/model info for imputation.

transform(data[, key, thread_ratio])

The function imputes missing values a DataFrame using statistic/model info collected from another DataFrame.

fit_transform(data, key=None, categorical_variable=None, strategy_by_col=None)

Inpute the missing values of a DataFrame, return the result, and collect the related statistics/model info for imputation.

Parameters

data : DataFrame

Input data with missing values.

key : str, optional

Name of the ID column. Assume no ID column if key not provided.

categorical_variable : str, optional

Names of columns with INTEGER data type that should actually be treated as categorical. By default, columns of INTEGER and DOUBLE type are all treated numerical, while columns of VARCHAR or NVARCHAR type are treated as categorical.

strategy_by_col : ListOfTuples, optional

Specifies the imputation strategy for a set of columns, which overrides the overall strategy for data imputation. Each tuple in the list should contain at least two elements, such that: The first element is the name of a column; the second element is the imputation strategy of that column. If the imputation strategy is ‘categorical_const’ or ‘numerical_const’, then a third element must be included in the tuple, which specifies the constant value to be used to substitute the detected missing values in the column.

An illustrative example:

[(‘V1’, ‘categorical_const’, ‘0’), (‘V5’,’median’)]

Returns

DataFrame

Imputed result using specified strategy, with the same data structure, i.e. column names and data types same as data.

transform(data, key=None, thread_ratio=None)

The function imputes missing values a DataFrame using statistic/model info collected from another DataFrame.

Parameters

data : DataFrame

Input DataFrame.

key : str, optional

Name of ID column. Assumed no ID column if not provided.

thread_ratio : float, optional

Controls the proportion of available threads to use. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use up to that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use.

Defaults to 0.0.

Returns

DataFrame

Inputation result, structured same as data.

Statistics for the imputation result, structured as:

  • STAT_NAME: type NVACHAR(256), statistics name.

  • STAT_VALUE: type NVACHAR(5000), statistics value.

hana_ml.algorithms.pal.random

This module contains wrappers for PAL Random distribution sampling algorithms.

The following distribution functions are available:

hana_ml.algorithms.pal.random.multinomial(conn_context, n, pvals, num_random=100, seed=None, thread_ratio=None)

Draw samples from a multinomial distribution.

Parameters

conn_context : ConnectionContext

Database connection object.

n : int

Number of trials.

pvals : tuple of float and int

Success fractions of each category.

num_random : int, optional

Specifies the number of random data to be generated.

Defaults to 100.

seed : int, optional

Indicates the seed used to initialize the random number generator:

  • 0: Uses the system time.

  • Not 0: Uses the specified seed.

Note

When multithreading is enabled, the random number sequences of different runs might be different even if the SEED value remains the same.

Defaults to 0.

thread_ratio : float, optional

Controls the proportion of available threads to use. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use.

Defaults to 0.

Returns

DataFrame

Dataframe containing the generated random samples, structured as follows:

  • ID, type INTEGER, ID column.

  • Generated random number columns, named by appending index number (starting from 1 to length of pvals) to Random_P, type DOUBLE. There will be as many columns here as there are values in pvals.

Examples

Draw samples from a multinomial distribution.

>>> res = multinomial(conn_context=cc, n=10, pvals=(0.1, 0.2, 0.3, 0.4), num_random=10)
>>> res.collect()
   ID  RANDOM_P1  RANDOM_P2  RANDOM_P3  RANDOM_P4
0   0        1.0        2.0        2.0        5.0
1   1        1.0        2.0        3.0        4.0
2   2        0.0        0.0        8.0        2.0
3   3        0.0        2.0        1.0        7.0
4   4        1.0        1.0        4.0        4.0
5   5        1.0        1.0        4.0        4.0
6   6        1.0        2.0        3.0        4.0
7   7        1.0        4.0        2.0        3.0
8   8        1.0        2.0        3.0        4.0
9   9        4.0        1.0        1.0        4.0
hana_ml.algorithms.pal.random.bernoulli(conn_context, p=0.5, num_random=100, seed=None, thread_ratio=None)

Draw samples from a Bernoulli distribution.

Parameters

conn_context : ConnectionContext

Database connection object.

p : float, optional

Success fraction. The value range is from 0 to 1.

Defaults to 0.5.

num_random : int, optional

Specifies the number of random data to be generated.

Defaults to 100.

seed : int, optional

Indicates the seed used to initialize the random number generator:

  • 0: Uses the system time.

  • Not 0: Uses the specified seed.

Note

When multithreading is enabled, the random number sequences of different runs might be different even if the SEED value remains the same.

Defaults to 0.

thread_ratio : float, optional

Controls the proportion of available threads to use. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use.

Defaults to 0.

Returns

DataFrame

Dataframe containing the generated random samples, structured as follows:

  • ID, type INTEGER, ID column.

  • GENERATED_NUMBER, type DOUBLE, sample value.

Examples

Draw samples from a bernoulli distribution.

>>> res = bernoulli(conn_context=cc, p=0.5, num_random=10)
>>> res.collect()
   ID  GENERATED_NUMBER
0   0               0.0
1   1               0.0
2   2               1.0
3   3               1.0
4   4               0.0
5   5               1.0
6   6               1.0
7   7               0.0
8   8               1.0
9   9               0.0
hana_ml.algorithms.pal.random.beta(conn_context, a=0.5, b=0.5, num_random=100, seed=None, thread_ratio=None)

Draw samples from a Beta distribution.

Parameters

conn_context : ConnectionContext

Database connection object.

a : float, optional

Alpha value, positive.

Defaults to 0.5.

b : float, optional

Beta value, positive.

Defaults to 0.5.

num_random : int, optional

Specifies the number of random data to be generated.

Defaults to 100.

seed : int, optional

Indicates the seed used to initialize the random number generator:

  • 0: Uses the system time.

  • Not 0: Uses the specified seed.

Note

When multithreading is enabled, the random number sequences of different runs might be different even if the SEED value remains the same.

Defaults to 0.

thread_ratio : float, optional

Controls the proportion of available threads to use. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use.

Defaults to 0.

Returns

DataFrame

Dataframe containing the generated random samples, structured as follows:

  • ID, type INTEGER, ID column.

  • GENERATED_NUMBER, type DOUBLE, sample value.

Examples

Draw samples from a beta distribution.

>>> res = beta(conn_context=cc, a=0.5, b=0.5, num_random=10)
>>> res.collect()
   ID  GENERATED_NUMBER
0   0          0.976130
1   1          0.308346
2   2          0.853118
3   3          0.958553
4   4          0.677258
5   5          0.489628
6   6          0.027733
7   7          0.278073
8   8          0.850181
9   9          0.976244
hana_ml.algorithms.pal.random.binomial(conn_context, n=1, p=0.5, num_random=100, seed=None, thread_ratio=None)

Draw samples from a binomial distribution.

Parameters

conn_context : ConnectionContext

Database connection object.

n : int, optional

Number of trials.

Defaults to 1.

p : float, optional

Successful fraction. The value range is from 0 to 1.

Defaults to 0.5.

num_random : int, optional

Specifies the number of random data to be generated.

Defaults to 100.

seed : int, optional

Indicates the seed used to initialize the random number generator:

  • 0: Uses the system time.

  • Not 0: Uses the specified seed.

Note

When multithreading is enabled, the random number sequences of different runs might be different even if the SEED value remains the same.

Defaults to 0.

thread_ratio : float, optional

Controls the proportion of available threads to use. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use.

Defaults to 0.

Returns

DataFrame

Dataframe containing the generated random samples, structured as follows:

  • ID, type INTEGER, ID column.

  • GENERATED_NUMBER, type DOUBLE, sample value.

Examples

Draw samples from a binomial distribution.

>>> res = binomial(conn_context=cc, n=1, p=0.5, num_random=10)
>>> res.collect()
   ID  GENERATED_NUMBER
0   0               1.0
1   1               1.0
2   2               0.0
3   3               1.0
4   4               1.0
5   5               1.0
6   6               0.0
7   7               1.0
8   8               0.0
9   9               1.0
hana_ml.algorithms.pal.random.cauchy(conn_context, location=0, scale=1, num_random=100, seed=None, thread_ratio=None)

Draw samples from a cauchy distribution.

Parameters

conn_context : ConnectionContext

Database connection object.

location : float, optional

Defaults to 0.

scale : float, optional

Defaults to 1.

num_random : int, optional

Specifies the number of random data to be generated.

Defaults to 100.

seed : int, optional

Indicates the seed used to initialize the random number generator:

  • 0: Uses the system time.

  • Not 0: Uses the specified seed.

Note

When multithreading is enabled, the random number sequences of different runs might be different even if the SEED value remains the same.

Defaults to 0.

thread_ratio : float, optional

Controls the proportion of available threads to use. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use.

Defaults to 0.

Returns

DataFrame

Dataframe containing the generated random samples, structured as follows:

  • ID, type INTEGER, ID column.

  • GENERATED_NUMBER, type DOUBLE, sample value.

Examples

Draw samples from a cauchy distribution.

>>> res = cauchy(conn_context=cc, location=0, scale=1, num_random=10)
>>> res.collect()
   ID  GENERATED_NUMBER
0   0          1.827259
1   1         -1.877612
2   2        -18.241436
3   3         -1.216243
4   4          2.091336
5   5       -317.131147
6   6         -2.804251
7   7         -0.338566
8   8          0.143280
9   9          1.277245
hana_ml.algorithms.pal.random.chi_squared(conn_context, dof=1, num_random=100, seed=None, thread_ratio=None)

Draw samples from a chi_squared distribution.

Parameters

conn_context : ConnectionContext

Database connection object.

dof : int, optional

Degrees of freedom.

Defaults to 1.

num_random : int, optional

Specifies the number of random data to be generated.

Defaults to 100.

seed : int, optional

Indicates the seed used to initialize the random number generator:
  • 0: Uses the system time.

  • Not 0: Uses the specified seed.

Note

When multithreading is enabled, the random number sequences of different runs might be different even if the SEED value remains the same.

Defaults to 0.

thread_ratio : float, optional

Controls the proportion of available threads to use. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use.

Defaults to 0.

Returns

DataFrame

Dataframe containing the generated random samples, structured as follows:

  • ID, type INTEGER, ID column.

  • GENERATED_NUMBER, type DOUBLE, sample value.

Examples

Draw samples from a chi_squared distribution.

>>> res = chi_squared(conn_context=cc, dof=1, num_random=10)
>>> res.collect()
   ID  GENERATED_NUMBER
0   0          0.040571
1   1          2.680756
2   2          1.119563
3   3          1.174072
4   4          0.872421
5   5          0.327169
6   6          1.113164
7   7          1.549585
8   8          0.013953
9   9          0.011735
hana_ml.algorithms.pal.random.exponential(conn_context, lamb=1, num_random=100, seed=None, thread_ratio=None)

Draw samples from an exponential distribution.

Parameters

conn_context : ConnectionContext

Database connection object.

lamb : float, optional

The rate parameter, which is the inverse of the scale parameter.

Defaults to 1.

num_random : int, optional

Specifies the number of random data to be generated.

Defaults to 100.

seed : int, optional

Indicates the seed used to initialize the random number generator:

  • 0: Uses the system time.

  • Not 0: Uses the specified seed.

Note

When multithreading is enabled, the random number sequences of different runs might be different even if the SEED value remains the same.

Defaults to 0.

thread_ratio : float, optional

Controls the proportion of available threads to use. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use.

Defaults to 0.

Returns

DataFrame

Dataframe containing the generated random samples, structured as follows:

  • ID, type INTEGER, ID column.

  • GENERATED_NUMBER, type DOUBLE, sample value.

Examples

Draw samples from an exponential distribution.

>>> res = exponential(conn_context=cc, scale=1, num_random=10)
>>> res.collect()
   ID  GENERATED_NUMBER
0   0          0.035207
1   1          0.559248
2   2          0.122307
3   3          2.339937
4   4          1.130033
5   5          0.985565
6   6          0.030138
7   7          0.231040
8   8          1.233268
9   9          0.876022
hana_ml.algorithms.pal.random.gumbel(conn_context, location=0, scale=1, num_random=100, seed=None, thread_ratio=None)

Draw samples from a Gumbel distribution, which is one of a class of Generalized Extreme Value (GEV) distributions used in modeling extreme value problems.

Parameters

conn_context : ConnectionContext

Database connection object.

location : float, optional

Defaults to 0.

scale : float, optional

Defaults to 1.

num_random : int, optional

Specifies the number of random data to be generated.

Defaults to 100.

seed : int, optional

Indicates the seed used to initialize the random number generator:

  • 0: Uses the system time.

  • Not 0: Uses the specified seed.

Note

When multithreading is enabled, the random number sequences of different runs might be different even if the SEED value remains the same.

Defaults to 0.

thread_ratio : float, optional

Controls the proportion of available threads to use. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use.

Defaults to 0.

Returns

DataFrame

Dataframe containing the generated random samples, structured as follows:

  • ID, type INTEGER, ID column.

  • GENERATED_NUMBER, type DOUBLE, sample value.

Examples

Draw samples from a gumbel distribution.

>>> res = gumbel(conn_context=cc, location=0, scale=1, num_random=10)
>>> res.collect()
   ID  GENERATED_NUMBER
0   0          1.544054
1   1          0.339531
2   2          0.394224
3   3          3.161123
4   4          1.208050
5   5         -0.276447
6   6          1.694589
7   7          1.406419
8   8         -0.443717
9   9          0.156404
hana_ml.algorithms.pal.random.f(conn_context, dof1=1, dof2=1, num_random=100, seed=None, thread_ratio=None)

Draw samples from an f distribution.

Parameters

conn_context : ConnectionContext

Database connection object.

dof1 : int, optional

DEGREES_OF_FREEDOM1.

Defaults to 1.

dof2 : int, optional

DEGREES_OF_FREEDOM2.

Defaults to 1.

num_random : int, optional

Specifies the number of random data to be generated.

Defaults to 100.

seed : int, optional

Indicates the seed used to initialize the random number generator:

  • 0: Uses the system time.

  • Not 0: Uses the specified seed.

Note

When multithreading is enabled, the random number sequences of different runs might be different even if the SEED value remains the same.

Defaults to 0.

thread_ratio : float, optional

Controls the proportion of available threads to use. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use.

Defaults to 0.

Returns

DataFrame

Dataframe containing the generated random samples, structured as follows:

  • ID, type INTEGER, ID column.

  • GENERATED_NUMBER, type DOUBLE, sample value.

Examples

Draw samples from a f distribution.

>>> res = f(conn_context=cc, dof1=1, dof2=1, num_random=10)
>>> res.collect()
   ID  GENERATED_NUMBER
0   0          6.494985
1   1          0.054830
2   2          0.752216
3   3          4.946226
4   4          0.167151
5   5        351.789925
6   6          0.810973
7   7          0.362714
8   8          0.019763
9   9         10.553533
hana_ml.algorithms.pal.random.gamma(conn_context, shape=1, scale=1, num_random=100, seed=None, thread_ratio=None)

Draw samples from a gamma distribution.

Parameters

conn_context : ConnectionContext

Database connection object.

shape : float, optional

Defaults to 1.

scale : float, optional

Defaults to 1.

num_random : int, optional

Specifies the number of random data to be generated.

Defaults to 100.

seed : int, optional

Indicates the seed used to initialize the random number generator:

  • 0: Uses the system time.

  • Not 0: Uses the specified seed.

Note

When multithreading is enabled, the random number sequences of different runs might be different even if the SEED value remains the same.

Defaults to 0.

thread_ratio : float, optional

Controls the proportion of available threads to use. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use.

Defaults to 0.

Returns

DataFrame

Dataframe containing the generated random samples, structured as follows:

  • ID, type INTEGER, ID column.

  • GENERATED_NUMBER, type DOUBLE, sample value.

Examples

Draw samples from a gamma distribution.

>>> res = gamma(conn_context=cc, shape=1, scale=1, num_random=10)
>>> res.collect()
   ID  GENERATED_NUMBER
0   0          0.082794
1   1          0.084031
2   2          0.159490
3   3          1.063100
4   4          0.530218
5   5          1.307313
6   6          0.565527
7   7          0.474969
8   8          0.440999
9   9          0.463645
hana_ml.algorithms.pal.random.geometric(conn_context, p=0.5, num_random=100, seed=None, thread_ratio=None)

Draw samples from a geometric distribution.

Parameters

conn_context : ConnectionContext

Database connection object.

p : float, optional

Successful fraction. The value range is from 0 to 1.

Defaults to 0.5.

num_random : int, optional

Specifies the number of random data to be generated.

Defaults to 100.

seed : int, optional

Indicates the seed used to initialize the random number generator:
  • 0: Uses the system time.

  • Not 0: Uses the specified seed.

Note

When multithreading is enabled, the random number sequences of different runs might be different even if the SEED value remains the same.

Defaults to 0.

thread_ratio : float, optional

Controls the proportion of available threads to use. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use.

Defaults to 0.

Returns

DataFrame

Dataframe containing the generated random samples, structured as follows:

  • ID, type INTEGER, ID column.

  • GENERATED_NUMBER, type DOUBLE, sample value.

Examples

Draw samples from a geometric distribution.

>>> res = geometric(conn_context=cc, p=0.5, num_random=10)
>>> res.collect()
   ID  GENERATED_NUMBER
0   0               1.0
1   1               1.0
2   2               1.0
3   3               0.0
4   4               1.0
5   5               0.0
6   6               0.0
7   7               0.0
8   8               0.0
9   9               0.0
hana_ml.algorithms.pal.random.lognormal(conn_context, mean=0, sigma=1, num_random=100, seed=None, thread_ratio=None)

Draw samples from a lognormal distribution.

Parameters

conn_context : ConnectionContext

Database connection object.

mean : float, optional

Mean value of the underlying normal distribution.

Defaults to 0.

sigma : float, optional

Standard deviation of the underlying normal distribution.

Defaults to 1.

num_random : int, optional

Specifies the number of random data to be generated.

Defaults to 100.

seed : int, optional

Indicates the seed used to initialize the random number generator:

  • 0: Uses the system time.

  • Not 0: Uses the specified seed.

Note

When multithreading is enabled, the random number sequences of different runs might be different even if the SEED value remains the same.

Defaults to 0.

thread_ratio : float, optional

Controls the proportion of available threads to use. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use.

Defaults to 0.

Returns

DataFrame

Dataframe containing the generated random samples, structured as follows:

  • ID, type INTEGER, ID column.

  • GENERATED_NUMBER, type DOUBLE, sample value.

Examples

Draw samples from a lognormal distribution.

>>> res = lognormal(conn_context=cc, mean=0, sigma=1, num_random=10)
>>> res.collect()
   ID  GENERATED_NUMBER
0   0          0.461803
1   1          0.548432
2   2          0.625874
3   3          3.038529
4   4          3.582703
5   5          1.867543
6   6          1.853857
7   7          0.378827
8   8          1.104031
9   9          0.840102
hana_ml.algorithms.pal.random.negative_binomial(conn_context, n=1, p=0.5, num_random=100, seed=None, thread_ratio=None)

Draw samples from a negative_binomial distribution.

Parameters

conn_context : ConnectionContext

Database connection object.

n : int, optional

Number of successes.

Defaults to 1.

p : float, optional

Successful fraction. The value range is from 0 to 1.

Defaults to 0.5.

num_random : int, optional

Specifies the number of random data to be generated.

Defaults to 100.

seed : int, optional

Indicates the seed used to initialize the random number generator:

  • 0: Uses the system time.

  • Not 0: Uses the specified seed.

Note

When multithreading is enabled, the random number sequences of different runs might be different even if the SEED value remains the same.

Defaults to 0.

thread_ratio : float, optional

Controls the proportion of available threads to use. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use.

Defaults to 0.

Returns

DataFrame

Dataframe containing the generated random samples, structured as follows:

  • ID, type INTEGER, ID column.

  • GENERATED_NUMBER, type DOUBLE, sample value.

Examples

Draw samples from a negative_binomial distribution.

>>> res = negative_binomial(conn_context=cc, n=1, p=0.5, num_random=10)
>>> res.collect()
   ID  GENERATED_NUMBER
0   0               0.0
1   1               2.0
2   2               3.0
3   3               1.0
4   4               1.0
5   5               0.0
6   6               2.0
7   7               1.0
8   8               2.0
9   9               3.0
hana_ml.algorithms.pal.random.normal(conn_context, mean=0, sigma=None, variance=None, num_random=100, seed=None, thread_ratio=None)

Draw samples from a normal distribution.

Parameters

conn_context : ConnectionContext

Database connection object.

mean : float, optional

Mean value.

Defaults to 0.

sigma : float, optional

Standard deviation. It cannot be used together with variance.

Defaults to 1.

variance : float, optional

Variance. It cannot be used together with sigma.

Defaults to 1.

num_random : int, optional

Specifies the number of random data to be generated.

Defaults to 100.

seed : int, optional

Indicates the seed used to initialize the random number generator:

  • 0: Uses the system time.

  • Not 0: Uses the specified seed.

Note

When multithreading is enabled, the random number sequences of different runs might be different even if the SEED value remains the same.

Defaults to 0.

thread_ratio : float, optional

Controls the proportion of available threads to use. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use.

Defaults to 0.

Returns

DataFrame

Dataframe containing the generated random samples, structured as follows:

  • ID, type INTEGER, ID column.

  • GENERATED_NUMBER, type DOUBLE, sample value.

Examples

Draw samples from a normal distribution.

>>> res = normal(conn_context=cc, mean=0, sigma=1, num_random=10)
>>> res.collect()
   ID  GENERATED_NUMBER
0   0          0.321078
1   1         -1.327626
2   2          0.798867
3   3         -0.116128
4   4         -0.213519
5   5          0.008566
6   6          0.251733
7   7          0.404510
8   8         -0.534899
9   9         -0.420968
hana_ml.algorithms.pal.random.pert(conn_context, minimum=-1, mode=0, maximum=1, scale=4, num_random=100, seed=None, thread_ratio=None)

Draw samples from a PERT distribution.

Parameters

conn_context : ConnectionContext

Database connection object.

minimum : int, optional

Minimum value.

Defaults to -1.

mode : float, optional

Most likely value.

Defaults to 0.

maximum : float, optional

Maximum value.

Defaults to 1.

scale : float, optional

Defaults to 4.

num_random : int, optional

Specifies the number of random data to be generated.

Defaults to 100.

seed : int, optional

Indicates the seed used to initialize the random number generator:

  • 0: Uses the system time.

  • Not 0: Uses the specified seed.

Note

When multithreading is enabled, the random number sequences of different runs might be different even if the SEED value remains the same.

Defaults to 0.

thread_ratio : float, optional

Controls the proportion of available threads to use. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use.

Defaults to 0.

Returns

DataFrame

Dataframe containing the generated random samples, structured as follows:

  • ID, type INTEGER, ID column.

  • GENERATED_NUMBER, type DOUBLE, sample value.

Examples

Draw samples from a pert distribution.

>>> res = pert(conn_context=cc, minimum=-1, mode=0, maximum=1, num_random=10)
>>> res.collect()
   ID  GENERATED_NUMBER
0   0          0.360781
1   1         -0.023649
2   2          0.106465
3   3          0.307412
4   4         -0.136838
5   5         -0.086010
6   6         -0.504639
7   7          0.335352
8   8         -0.287202
9   9          0.468597
hana_ml.algorithms.pal.random.poisson(conn_context, theta=1.0, num_random=100, seed=None, thread_ratio=None)

Draw samples from a poisson distribution.

Parameters

conn_context : ConnectionContext

Database connection object.

theta : float, optional

The average number of events in an interval.

Defaults to 1.0.

num_random : int, optional

Specifies the number of random data to be generated.

Defaults to 100.

seed : int, optional

Indicates the seed used to initialize the random number generator:

  • 0: Uses the system time.

  • Not 0: Uses the specified seed.

Note

When multithreading is enabled, the random number sequences of different runs might be different even if the SEED value remains the same.

Defaults to 0.

thread_ratio : float, optional

Controls the proportion of available threads to use. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use.

Defaults to 0.

Returns

DataFrame

Dataframe containing the generated random samples, structured as follows:

  • ID, type INTEGER, ID column.

  • GENERATED_NUMBER, type DOUBLE, sample value.

Examples

Draw samples from a poisson distribution.

>>> res = poisson(conn_context=cc, theta=1, num_random=10)
>>> res.collect()
   ID  GENERATED_NUMBER
0   0               0.0
1   1               1.0
2   2               1.0
3   3               1.0
4   4               1.0
5   5               1.0
6   6               0.0
7   7               2.0
8   8               0.0
9   9               1.0
hana_ml.algorithms.pal.random.student_t(conn_context, dof=1, num_random=100, seed=None, thread_ratio=None)

Draw samples from a Student’s t-distribution.

Parameters

conn_context : ConnectionContext

Database connection object.

dof : float, optional

Degrees of freedom.

Defaults to 1.

num_random : int, optional

Specifies the number of random data to be generated.

Defaults to 100.

seed : int, optional

Indicates the seed used to initialize the random number generator:

  • 0: Uses the system time.

  • Not 0: Uses the specified seed.

Note

When multithreading is enabled, the random number sequences of different runs might be different even if the SEED value remains the same.

Defaults to 0.

thread_ratio : float, optional

Controls the proportion of available threads to use. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use.

Defaults to 0.

Returns

DataFrame

Dataframe containing the generated random samples, structured as follows:

  • ID, type INTEGER, ID column.

  • GENERATED_NUMBER, type DOUBLE, sample value.

Examples

Draw samples from a Student’s t-distribution.

>>> res = student_t(conn_context=cc, dof=1, num_random=10)
>>> res.collect()
   ID  GENERATED_NUMBER
0   0         -0.433802
1   1          1.972038
2   2         -1.097313
3   3         -0.225812
4   4         -0.452342
5   5          2.242921
6   6          0.377288
7   7          0.322347
8   8          1.104877
9   9         -0.017830
hana_ml.algorithms.pal.random.uniform(conn_context, low=0, high=1, num_random=100, seed=None, thread_ratio=None)

Draw samples from a uniform distribution.

Parameters

conn_context : ConnectionContext

Database connection object.

low : float, optional

The lower bound.

Defaults to 0.

high : float, optional

The upper bound.

Defaults to 1.

num_random : int, optional

Specifies the number of random data to be generated.

Defaults to 100.

seed : int, optional

Indicates the seed used to initialize the random number generator:

  • 0: Uses the system time.

  • Not 0: Uses the specified seed.

Note

When multithreading is enabled, the random number sequences of different runs might be different even if the SEED value remains the same.

Defaults to 0.

thread_ratio : float, optional

Controls the proportion of available threads to use. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use.

Defaults to 0.

Returns

DataFrame

Dataframe containing the generated random samples, structured as follows:

  • ID, type INTEGER, ID column.

  • GENERATED_NUMBER, type DOUBLE, sample value.

Examples

Draw samples from a uniform distribution.

>>> res = uniform(conn_context=cc, low=-1, high=1, num_random=10)
>>> res.collect()
   ID  GENERATED_NUMBER
0   0          0.032920
1   1          0.201923
2   2          0.823313
3   3         -0.495260
4   4         -0.138329
5   5          0.677732
6   6          0.685200
7   7          0.363627
8   8          0.024849
9   9         -0.441779
hana_ml.algorithms.pal.random.weibull(conn_context, shape=1, scale=1, num_random=100, seed=None, thread_ratio=None)

Draw samples from a weibull distribution.

Parameters

conn_context : ConnectionContext

Database connection object.

shape : float, optional

Defaults to 1.

scales : float, optional

Defaults to 1.

num_random : int, optional

Specifies the number of random data to be generated.

Defaults to 100.

seed : int, optional

Indicates the seed used to initialize the random number generator:

  • 0: Uses the system time.

  • Not 0: Uses the specified seed.

Note

When multithreading is enabled, the random number sequences of different runs might be different even if the SEED value remains the same.

Defaults to 0.

thread_ratio : float, optional

Controls the proportion of available threads to use. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use.

Defaults to 0.

Returns

DataFrame

Dataframe containing the generated random samples, structured as follows:

  • ID, type INTEGER, ID column.

  • GENERATED_NUMBER, type DOUBLE, sample value.

Examples

Draw samples from a weibull distribution.

>>> res = weibull(conn_context=cc, shape=1, scale=1, num_random=10)
>>> res.collect()
   ID  GENERATED_NUMBER
0   0          2.188750
1   1          0.247628
2   2          0.339884
3   3          0.902187
4   4          0.909629
5   5          0.514740
6   6          4.627877
7   7          0.143767
8   8          0.847514
9   9          2.368169

hana_ml.algorithms.pal.regression

This module contains wrappers for PAL regression algorithms.

The following classes are available:

class hana_ml.algorithms.pal.regression.PolynomialRegression(conn_context, degree, decomposition=None, adjusted_r2=None, pmml_export=None, thread_ratio=None)

Bases: hana_ml.algorithms.pal.pal_base.PALBase

Polynomial regression is an approach to model the relationship between a scalar variable y and a variable denoted X. In polynomial regression, data is modeled using polynomial functions, and unknown model parameters are estimated from the data. Such models are called polynomial models.

Parameters

conn_context : ConnectionContext

Connection to the SAP HANA system.

degree : int

Degree of the polynomial model.

decomposition : {‘LU’, ‘SVD’}, optional

Matrix factorization type to use. Case-insensitive.

  • ‘LU’: LU decomposition.

  • ‘SVD’: singular value decomposition.

Defaults to LU decomposition.

adjusted_r2 : bool, optional

If true, include the adjusted R2 value in the statistics table.

Defaults to False.

pmml_export : {‘no’, ‘single-row’, ‘multi-row’}, optional

Controls whether to output a PMML representation of the model, and how to format the PMML. Case-insensitive.

  • ‘no’ or not provided: No PMML model.

  • ‘single-row’: Exports a PMML model in a maximum of one row. Fails if the model doesn’t fit in one row.

  • ‘multi-row’: Exports a PMML model, splitting it across multiple rows if it doesn’t fit in one.

Prediction does not require a PMML model.

thread_ratio : float, optional

Controls the proportion of available threads to use for prediction. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use.

Defaults to 0.

Examples

Training data (based on y = x^3 - 2x^2 + 3x + 5, with noise):

>>> df.collect()
   ID    X       Y
0   1  0.0   5.048
1   2  1.0   7.045
2   3  2.0  11.003
3   4  3.0  23.072
4   5  4.0  49.041

Training the model:

>>> pr = PolynomialRegression(conn_context=conn, degree=3)
>>> pr.fit(data=df, key='ID')

Prediction:

>>> df2.collect()
   ID    X
0   1  0.5
1   2  1.5
2   3  2.5
3   4  3.5
>>> pr.predict(data=df2, key='ID').collect()
   ID      VALUE
0   1   6.157063
1   2   8.401269
2   3  15.668581
3   4  33.928501

Ideal output:

>>> df2.select('ID', ('POWER(X, 3)-2*POWER(X, 2)+3*x+5', 'Y')).collect()
   ID       Y
0   1   6.125
1   2   8.375
2   3  15.625
3   4  33.875

Attributes

coefficients_

(DataFrame) Fitted regression coefficients.

pmml_

(DataFrame) PMML model. Set to None if no PMML model was requested.

fitted_

(DataFrame) Predicted dependent variable values for training data. Set to None if the training data has no row IDs.

statistics_

(DataFrame) Regression-related statistics, such as mean squared error.

Methods

fit(data[, key, features, label])

Fit regression model based on training data.

predict(data, key[, features])

Predict dependent variable values based on fitted model.

score(data, key[, features, label])

Returns the coefficient of determination R2 of the prediction.

fit(data, key=None, features=None, label=None)

Fit regression model based on training data.

Parameters

data : DataFrame

Training data.

key : str, optional

Name of the ID column. If key is not provided, it is assumed that the input has no ID column.

features : list of str, optional

Names of the feature columns. Since the underlying PAL_POLYNOMIAL_REGRESSION only supports one feature, this list can only contain one element. If features is not provided, data must have exactly 1 non-ID, non-label column, and features defaults to that column.

label : str, optional

Name of the dependent variable.

Defaults to the last column. (This is not the PAL default.)

predict(data, key, features=None)

Predict dependent variable values based on fitted model.

Parameters

data : DataFrame

Independent variable values used for prediction.

key : str

Name of the ID column.

features : list of str, optional

Names of the feature columns. Since the underlying PAL_POLYNOMIAL_REGRESSION only supports one feature, this list can only contain one element. If features is not provided, data must have exactly 1 non-ID column, and features defaults to that column.

Returns

DataFrame

Predicted values, structured as follows:

  • ID column, with same name and type as data’s ID column.

  • VALUE, type DOUBLE, representing predicted values.

Note

predict() will pass the pmml_ table to PAL as the model representation if there is a pmml_ table, or the coefficients_ table otherwise.

score(data, key, features=None, label=None)

Returns the coefficient of determination R2 of the prediction.

Parameters

data : DataFrame

Data on which to assess model performance.

key : str

Name of the ID column.

features : list of str, optional

Names of the feature columns. Since the underlying PAL_POLYNOMIAL_REGRESSION_PREDICT only supports one feature, this list can only contain one element. If features is not provided, data must have exactly 1 non-ID, non-label column, and features defaults to that column.

label : str, optional

Name of the dependent variable.

Defaults to the last column. (This is not the PAL default.)

Returns

float

The coefficient of determination R2 of the prediction on the given data.

class hana_ml.algorithms.pal.regression.GLM(conn_context, family=None, link=None, solver=None, handle_missing_fit=None, quasilikelihood=None, max_iter=None, tol=None, significance_level=None, output_fitted=None, alpha=None, num_lambda=None, lambda_min_ratio=None, categorical_variable=None, ordering=None)

Bases: hana_ml.algorithms.pal.pal_base.PALBase

Regression by a generalized linear model, based on PAL_GLM. Also supports ordinal regression.

Parameters

conn_context : ConnectionContext

Connection to the SAP HANA system.

family : {‘gaussian’, ‘normal’, ‘poisson’, ‘binomial’, ‘gamma’, ‘inversegaussian’, ‘negativebinomial’, ‘ordinal’}, optional

The kind of distribution the dependent variable outcomes are assumed to be drawn from. Defaults to ‘gaussian’.

link : str, optional

GLM link function. Determines the relationship between the linear predictor and the predicted response. Default and allowed values depend on family. ‘inverse’ is accepted as a synonym of ‘reciprocal’.

family

default link

allowed values of link

gaussian

identity

identity, log, reciprocal

poisson

log

identity, log

binomial

logit

logit, probit, comploglog, log

gamma

reciprocal

identity, reciprocal, log

inversegaussian

inversesquare

inversesquare, identity, reciprocal, log

negativebinomial

log

identity, log, sqrt

ordinal

logit

logit, probit, comploglog

solver : {‘irls’, ‘nr’, ‘cd’}, optional

Optimization algorithm to use.

  • ‘irls’: Iteratively re-weighted least squares.

  • ‘nr’: Newton-Raphson.

  • ‘cd’: Coordinate descent. (Picking coordinate descent activates elastic net regularization.)

Defaults to ‘irls’, except when family is ‘ordinal’. Ordinal regression requires (and defaults to) ‘nr’, and Newton-Raphson is not supported for other values of family.

handle_missing_fit : {‘skip’, ‘abort’, ‘fill_zero’}, optional

How to handle data rows with missing independent variable values during fitting.

  • ‘skip’: Don’t use those rows for fitting.

  • ‘abort’: Throw an error if missing independent variable values are found.

  • ‘fill_zero’: Replace missing values with 0.

Defaults to ‘skip’.

quasilikelihood : bool, optional

If True, enables the use of quasi-likelihood to estimate overdispersion.

Defaults to False.

max_iter : int, optional

Maximum number of optimization iterations.

Defaults to 100 for IRLS and Newton-Raphson.

Defaults to 100000 for coordinate descent.

tol : float, optional

Stopping condition for optimization.

Defaults to 1e-8 for IRLS, 1e-6 for Newton-Raphson, and 1e-7 for coordinate descent.

significance_level : float, optional

Significance level for confidence intervals and prediction intervals.

Defaults to 0.05.

output_fitted : bool, optional

If True, create the fitted_ DataFrame of fitted response values for training data in fit.

alpha : float, optional

Elastic net mixing parameter. Only accepted when using coordinate descent. Should be between 0 and 1 inclusive.

Defaults to 1.0.

num_lambda : int, optional

The number of lambda values. Only accepted when using coordinate descent.

Defaults to 100.

lambda_min_ratio : float, optional

The smallest value of lambda, as a fraction of the maximum lambda, where lambda_max is the smallest value for which all coefficients are zero. Only accepted when using coordinate descent.

Defaults to 0.01 when the number of observations is smaller than the number of covariates, and 0.0001 otherwise.

categorical_variable : list of str, optional

INTEGER columns specified in this list will be treated as categorical data. Other INTEGER columns will be treated as continuous.

ordering : list of str or list of int, optional

Specifies the order of categories for ordinal regression. The default is numeric order for ints and alphabetical order for strings.

Examples

Training data:

>>> df.collect()
   ID  Y  X
0   1  0 -1
1   2  0 -1
2   3  1  0
3   4  1  0
4   5  1  0
5   6  1  0
6   7  2  1
7   8  2  1
8   9  2  1

Fitting a GLM on that data:

>>> glm = GLM(conn_context=conn, solver='irls', family='poisson', link='log')
>>> glm.fit(data=df, key='ID', label='Y')

Performing prediction:

>>> df2.collect()
   ID  X
0   1 -1
1   2  0
2   3  1
3   4  2
>>> glm.predict(data=df2, key='ID')[['ID', 'PREDICTION']].collect()
   ID           PREDICTION
0   1  0.25543735346197155
1   2    0.744562646538029
2   3   2.1702915689746476
3   4     6.32608352871737

Attributes

statistics_

(DataFrame) Training statistics and model information other than the coefficients and covariance matrix.

coef_

(DataFrame) Model coefficients.

covmat_

(DataFrame) Covariance matrix. Set to None for coordinate descent.

fitted_

(DataFrame) Predicted values for the training data. Set to None if output_fitted is False.

Methods

fit(data[, key, features, label, …])

Fit a generalized linear model based on training data.

predict(data, key[, features, …])

Predict dependent variable values based on fitted model.

score(data, key[, features, label, …])

Returns the coefficient of determination R2 of the prediction.

fit(data, key=None, features=None, label=None, categorical_variable=None, dependent_variable=None, excluded_feature=None)

Fit a generalized linear model based on training data.

Parameters

data : DataFrame

Training data.

key : str, optional

Name of the ID column. If key is not provided, it is assumed that the input has no ID column. Required when output_fitted is True.

features : list of str, optional

Names of the feature columns.

Defaults to all non-ID, non-label columns.

label : str or list of str, optional

Name of the dependent variable. Defaults to the last column. (This is not the PAL default.) When family is ‘binomial’, label may be either a single column name or a list of two column names.

categorical_variable : list of str, optional

INTEGER columns specified in this list will be treated as categorical data. Other INTEGER columns will be treated as continuous.

dependent_variable : str, optional

Only used when you need to indicate the dependence.

excluded_feature : list of str, optional

Excludes the indicated feature column.

Defaults to None.

predict(data, key, features=None, prediction_type=None, significance_level=None, handle_missing=None)

Predict dependent variable values based on fitted model.

Parameters

data : DataFrame

Independent variable values to predict for.

key : str

Name of the ID column.

features : list of str, optional

Names of the feature columns.

Defaults to all non-ID columns.

prediction_type : {‘response’, ‘link’}, optional

Specifies whether to output predicted values of the response or the link function.

Defaults to ‘response’.

significance_level : float, optional

Significance level for confidence intervals and prediction intervals. If specified, overrides the value passed to the GLM constructor.

handle_missing : {‘skip’, ‘fill_zero’}, optional

How to handle data rows with missing independent variable values.

  • ‘skip’: Don’t perform prediction for those rows.

  • ‘fill_zero’: Replace missing values with 0.

Defaults to ‘skip’.

Returns

DataFrame

Predicted values, structured as follows. The following two columns are always populated:

  • ID column, with same name and type as data’s ID column.

  • PREDICTION, type NVARCHAR(100), representing predicted values.

The following five columns are only populated for IRLS:

  • SE, type DOUBLE. Standard error, or for ordinal regression, the probability that the data point belongs to the predicted category.

  • CI_LOWER, type DOUBLE. Lower bound of the confidence interval.

  • CI_UPPER, type DOUBLE. Upper bound of the confidence interval.

  • PI_LOWER, type DOUBLE. Lower bound of the prediction interval.

  • PI_UPPER, type DOUBLE. Upper bound of the prediction interval.

score(data, key, features=None, label=None, prediction_type=None, handle_missing=None)

Returns the coefficient of determination R2 of the prediction.

Not applicable for ordinal regression.

Parameters

data : DataFrame

Data on which to assess model performance.

key : str

Name of the ID column.

features : list of str, optional

Names of the feature columns.

Defaults to all non-ID, non-label columns.

label : str, optional

Name of the dependent variable.

Defaults to the last column. (This is not the PAL default.) Cannot be two columns, even for family=’binomial’.

prediction_type : {‘response’, ‘link’}, optional

Specifies whether to predict the value of the response or the link function. The contents of the label column should match this choice.

Defaults to ‘response’.

handle_missing : {‘skip’, ‘fill_zero’}, optional

How to handle data rows with missing independent variable values.

  • ‘skip’: Don’t perform prediction for those rows. Those rows will be left out of the R2 computation.

  • ‘fill_zero’: Replace missing values with 0.

Defaults to ‘skip’.

Returns

float

The coefficient of determination R2 of the prediction on the given data.

class hana_ml.algorithms.pal.regression.ExponentialRegression(conn_context, decomposition=None, adjusted_r2=None, pmml_export=None, thread_ratio=None)

Bases: hana_ml.algorithms.pal.pal_base.PALBase

Exponential regression is an approach to modeling the relationship between a scalar variable y and one or more variables denoted X. In exponential regression, data is modeled using exponential functions, and unknown model parameters are estimated from the data. Such models are called exponential models.

Parameters

conn_context : ConnectionContext

Connection to the HANA system.

decomposition : {‘LU’, ‘SVD’}, optional

Matrix factorization type to use. Case-insensitive.

  • ‘LU’: LU decomposition.

  • ‘SVD’: singular value decomposition.

Defaults to LU decomposition.

adjusted_r2 : bool, optional

If true, include the adjusted R2 value in the statistics table.

Defaults to False.

pmml_export : {‘no’, ‘single-row’, ‘multi-row’}, optional

Controls whether to output a PMML representation of the model, and how to format the PMML. Case-insensitive.

  • ‘no’ or not provided: No PMML model.

  • ‘single-row’: Exports a PMML model in a maximum of one row. Fails if the model doesn’t fit in one row.

  • ‘multi-row’: Exports a PMML model, splitting it across multiple rows if it doesn’t fit in one.

Prediction does not require a PMML model.

thread_ratio : float, optional

Controls the proportion of available threads to use for prediction. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use.

Defaults to 0.

Examples

>>> df.collect()
   ID    Y       X1      X2
   0    0.5     0.13    0.33
   1    0.15    0.14    0.34
   2    0.25    0.15    0.36
   3    0.35    0.16    0.35
   4    0.45    0.17    0.37

Training the model:

>>> er = ExponentialRegression(conn_context=conn, pmml_export = 'multi-row')
>>> er.fit(data=df, key='ID')

Prediction:

>>> df2.collect()
   ID    X1       X2
   0    0.5      0.3
   1    4        0.4
   2    0        1.6
   3    0.3      0.45
   4    0.4      1.7
>>> er.predict(data=df2, key='ID').collect()
   ID      VALUE
   0      0.6900598931338715
   1      1.2341502316656843
   2      0.006630664136180741
   3      0.3887970208571841
   4      0.0052106543571450266

Attributes

coefficients_

(DataFrame) Fitted regression coefficients.

pmml_

(DataFrame) PMML model. Set to None if no PMML model was requested.

fitted_

(DataFrame) Predicted dependent variable values for training data. Set to None if the training data has no row IDs.

statistics_

(DataFrame) Regression-related statistics, such as mean squared error.

Methods

fit(data[, key, features, label])

Fit regression model based on training data.

predict(data, key[, features])

Predict dependent variable values based on fitted model.

score(data, key[, features, label])

Returns the coefficient of determination R2 of the prediction.

fit(data, key=None, features=None, label=None)

Fit regression model based on training data.

Parameters

data : DataFrame

Training data.

key : str, optional

Name of the ID column. If key is not provided, it is assumed that the input has no ID column.

features : list of str, optional

Names of the feature columns.

label : str, optional

Name of the dependent variable.

Defaults to the last column. (This is not the PAL default.)

predict(data, key, features=None)

Predict dependent variable values based on fitted model.

Parameters

data : DataFrame

Independent variable values used for prediction.

key : str

Name of the ID column.

features : list of str, optional

Names of the feature columns.

Returns

DataFrame

Predicted values, structured as follows:

  • ID column, with same name and type as data ‘s ID column.

  • VALUE, type DOUBLE, representing predicted values.

Note

predict() will pass the pmml_ table to PAL as the model representation if there is a pmml_ table, or the coefficients_ table otherwise.

score(data, key, features=None, label=None)

Returns the coefficient of determination R2 of the prediction.

Parameters

data : DataFrame

Data on which to assess model performance.

key : str

Name of the ID column.

features : list of str, optional

Names of the feature columns.

label : str, optional

Name of the dependent variable.

Defaults to the last column. (This is not the PAL default.)

Returns

float

The coefficient of determination R2 of the prediction on the given data.

class hana_ml.algorithms.pal.regression.BiVariateGeometricRegression(conn_context, decomposition=None, adjusted_r2=None, pmml_export=None, thread_ratio=None)

Bases: hana_ml.algorithms.pal.pal_base.PALBase

Geometric regression is an approach used to model the relationship between a scalar variable y and a variable denoted X. In geometric regression, data is modeled using geometric functions, and unknown model parameters are estimated from the data. Such models are called geometric models.

Parameters

conn_context : ConnectionContext

Connection to the HANA system.

decomposition : {‘LU’, ‘SVD’}, optional

Matrix factorization type to use. Case-insensitive.

  • ‘LU’: LU decomposition.

  • ‘SVD’: singular value decomposition.

Defaults to LU decomposition.

adjusted_r2 : bool, optional

If true, include the adjusted R2 value in the statistics table.

Defaults to False.

pmml_export : {‘no’, ‘single-row’, ‘multi-row’}, optional

Controls whether to output a PMML representation of the model, and how to format the PMML. Case-insensitive.

  • ‘no’ or not provided: No PMML model.

  • ‘single-row’: Exports a PMML model in a maximum of one row. Fails if the model doesn’t fit in one row.

  • ‘multi-row’: Exports a PMML model, splitting it across multiple rows if it doesn’t fit in one.

Prediction does not require a PMML model.

thread_ratio : float, optional

Controls the proportion of available threads to use for prediction. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use.

Defaults to 0.

Examples

>>> df.collect()
ID    Y       X1
0    1.1      1
1    4.2      2
2    8.9      3
3    16.3     4
4    24       5

Training the model:

>>> gr = BiVariateGeometricRegression(conn_context=conn, pmml_export='multi-row')
>>> gr.fit(data=df, key='ID')

Prediction:

>>> df2.collect()
ID    X1
0     1
1     2
2     3
3     4
4     5
>>> er.predict(data=df2, key='ID').collect()
ID      VALUE
0        1
1       3.9723699817481437
2       8.901666037549536
3       15.779723271893747
4       24.60086108408644

Attributes

coefficients_

(DataFrame) Fitted regression coefficients.

pmml_

(DataFrame) PMML model. Set to None if no PMML model was requested.

fitted_

(DataFrame) Predicted dependent variable values for training data. Set to None if the training data has no row IDs.

statistics_

(DataFrame) Regression-related statistics, such as mean squared error.

Methods

fit(data[, key, features, label])

Fit regression model based on training data.

predict(data, key[, features])

Predict dependent variable values based on fitted model.

score(data, key[, features, label])

Returns the coefficient of determination R2 of the prediction.

fit(data, key=None, features=None, label=None)

Fit regression model based on training data.

Parameters

data : DataFrame

Training data.

key : str, optional

Name of the ID column. If key is not provided, it is assumed that the input has no ID column.

features : list of str, optional

Names of the feature columns.

label : str, optional

Name of the dependent variable.

Defaults to the last column. (This is not the PAL default.)

predict(data, key, features=None)

Predict dependent variable values based on fitted model.

Parameters

data : DataFrame

Independent variable values used for prediction.

key : str

Name of the ID column.

features : list of str, optional

Names of the feature columns.

Returns

DataFrame

Predicted values, structured as follows:

  • ID column, with same name and type as data ‘s ID column.

  • VALUE, type DOUBLE, representing predicted values.

Note

predict() will pass the pmml_ table to PAL as the model representation if there is a pmml_ table, or the coefficients_ table otherwise.

score(data, key, features=None, label=None)

Returns the coefficient of determination R2 of the prediction.

Parameters

data : DataFrame

Data on which to assess model performance.

key : str

Name of the ID column.

features : list of str, optional

Names of the feature columns.

label : str, optional

Name of the dependent variable.

Defaults to the last column. (This is not the PAL default.)

Returns

float

The coefficient of determination R2 of the prediction on the given data.

class hana_ml.algorithms.pal.regression.BiVariateNaturalLogarithmicRegression(conn_context, decomposition=None, adjusted_r2=None, pmml_export=None, thread_ratio=None)

Bases: hana_ml.algorithms.pal.pal_base.PALBase

Bi-variate natural logarithmic regression is an approach to modeling the relationship between a scalar variable y and one variable denoted X. In natural logarithmic regression, data is modeled using natural logarithmic functions, and unknown model parameters are estimated from the data. Such models are called natural logarithmic models.

Parameters

conn_context : ConnectionContext

Connection to the HANA system.

decomposition : {‘LU’, ‘SVD’}, optional

Matrix factorization type to use. Case-insensitive.

  • ‘LU’: LU decomposition.

  • ‘SVD’: singular value decomposition.

Defaults to LU decomposition.

adjusted_r2 : bool, optional

If true, include the adjusted R2 value in the statistics table.

Defaults to False.

pmml_export : {‘no’, ‘single-row’, ‘multi-row’}, optional

Controls whether to output a PMML representation of the model, and how to format the PMML. Case-insensitive.

  • ‘no’ or not provided: No PMML model.

  • ‘single-row’: Exports a PMML model in a maximum of one row. Fails if the model doesn’t fit in one row.

  • ‘multi-row’: Exports a PMML model, splitting it across multiple rows if it doesn’t fit in one.

Prediction does not require a PMML model.

thread_ratio : float, optional

Controls the proportion of available threads to use for prediction. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use. Does not affect fitting.

Defaults to 0.

Examples

>>> df.collect()
   ID    Y       X1
   0    10       1
   1    80       2
   2    130      3
   3    180      5
   4    190      6

Training the model:

>>> gr = BiVariateNaturalLogarithmicRegression(conn_context=conn, pmml_export='multi-row')
>>> gr.fit(data=df, key='ID')

Prediction:

>>> df2.collect()
   ID    X1
   0     1
   1     2
   2     3
   3     4
   4     5
>>> er.predict(data=df2, key='ID').collect()
   ID      VALUE
   0     14.86160299
   1     82.9935329364932
   2     122.8481570569525
   3     151.1254628829864
   4     173.05904529166017

Attributes

coefficients_

(DataFrame) Fitted regression coefficients.

pmml_

(DataFrame) PMML model. Set to None if no PMML model was requested.

fitted_

(DataFrame) Predicted dependent variable values for training data. Set to None if the training data has no row IDs.

statistics_

(DataFrame) Regression-related statistics, such as mean squared error.

Methods

fit(data[, key, features, label])

Fit regression model based on training data.

predict(data, key[, features])

Predict dependent variable values based on fitted model.

score(data, key[, features, label])

Returns the coefficient of determination R2 of the prediction.

fit(data, key=None, features=None, label=None)

Fit regression model based on training data.

Parameters

data : DataFrame

Training data.

key : str, optional

Name of the ID column. If key is not provided, it is assumed that the input has no ID column.

features : list of str, optional

Names of the feature columns.

label : str, optional

Name of the dependent variable.

Defaults to the last column. (This is not the PAL default.)

predict(data, key, features=None)

Predict dependent variable values based on fitted model.

Parameters

data : DataFrame

Independent variable values used for prediction.

key : str

Name of the ID column.

features : list of str, optional

Names of the feature columns.

Returns

DataFrame

Predicted values, structured as follows:

  • ID column, with same name and type as data ‘s ID column.

  • VALUE, type DOUBLE, representing predicted values.

Note

predict() will pass the pmml_ table to PAL as the model representation if there is a pmml_ table, or the coefficients_ table otherwise.

score(data, key, features=None, label=None)

Returns the coefficient of determination R2 of the prediction.

Parameters

data : DataFrame

Data on which to assess model performance.

key : str

Name of the ID column.

features : list of str, optional

Names of the feature columns.

label : str, optional

Name of the dependent variable.

Defaults to the last column. (This is not the PAL default.)

Returns

float

The coefficient of determination R2 of the prediction on the given data.

class hana_ml.algorithms.pal.regression.CoxProportionalHazardModel(conn_context, tie_method=None, status_col=None, max_iter=None, convergence_criterion=None, significance_level=None, calculate_hazard=None, output_fitted=None, type_kind=None)

Bases: hana_ml.algorithms.pal.pal_base.PALBase

Cox proportional hazard model (CoxPHM) is a special generalized linear model. It is a well-known realization-of-survival model that demonstrates failure or death at a certain time.

Parameters

conn_context : ConnectionContext

Connection to the SAP HANA system.

tie_method : {‘breslow’, ‘efron’}, optional

The method to deal with tied events.

Defaults to ‘efron’.

status_col : bool, optional

If a status column is defined for right-censored data:

  • False : No status column. All response times are failure/death.

  • TrueThe 3rd column of the data input table is a status column,

    of which 0 indicates right-censored data and 1 indicates failure/death.

Defaults to True.

max_iter : int, optional

Maximum number of iterations for numeric optimization.

convergence_criterion : float, optional

Convergence criterion of coefficients for numeric optimization.

Defaults to 0.

significance_level : float, optional

Significance level for the confidence interval of estimated coefficients.

Defaults to 0.05.

calculate_hazard : bool, optional

Controls whether to calculate hazard function as well as survival function.

  • False : Does not calculate hazard function.

  • True: Calculates hazard function.

Defaults to True.

output_fitted : bool, optional

Controls whether to output the fitted response:

  • False : Does not output the fitted response.

  • True: Outputs the fitted response.

Defaults to False.

type_kind : str, optional

The prediction type:

  • ‘risk’: Predicts in risk space

  • ‘lp’: Predicts in linear predictor space

Default Value is ‘risk’

Examples

>>> df1.collect()
    ID      TIME    STATUS  X1      X2
    1         4              1       0       0
    2         3              1       2       0
    3         1              1       1       0
    4         1              0       1       0
    5         2              1       1       1
    6         2              1       0       1
    7         3              0       0       1

Training the model:

>>> cox = CoxProportionalHazardModel(conn_context=conn,
significance_level= 0.05, calculate_hazard='yes', type_kind='risk')
>>> cox.fit(data=df1, key='ID', features=['STATUS', 'X1', 'X2'], label='TIME')

Prediction:

>>> df2.collect()
    ID      X1      X2
    1       0       0
    2       2       0
    3       1       0
    4       1       0
    5       1       1
    6       0       1
    7       0       1
>>> cox.predict(data=full_tbl, key='ID',features=['STATUS', 'X1', 'X2']).collect()
    ID       PREDICTION        SE         CI_LOWER     CI_UPPER
    1       0.383590423     0.412526262     0.046607574     3.157032199
    2       1.829758442     1.385833778     0.414672719     8.073875617
    3       0.837781484     0.400894077     0.32795551      2.140161678
    4       0.837781484     0.400894077     0.32795551      2.140161678

Attributes

statistics_

(DataFrame) Regression-related statistics, such as r-square, log-likelihood, aic.

coefficient_

(DataFrame) Fitted regression coefficients.

covariance_variance

(DataFrame) Co-Variance related data.

hazard_

(DataFrame) Statistics related to Time, Hazard, Survival.

fitted_

(DataFrame) Predicted dependent variable values for training data. Set to None if the training data has no row IDs.

Methods

fit(data[, key, features, label])

Fit regression model based on training data.

predict(data, key[, features])

Predict dependent variable values based on fitted model.

score(data, key[, features, label])

Returns the coefficient of determination R2 of the prediction.

fit(data, key=None, features=None, label=None)

Fit regression model based on training data.

Parameters

data : DataFrame

Training data.

key : str, optional

Name of the ID column. If key is not provided, it is assumed that the input has no ID column.

features : list of str, optional

Names of the feature columns.

label : str, optional

Name of the dependent variable.

Defaults to the last column. (This is not the PAL default.)

predict(data, key, features=None)

Predict dependent variable values based on fitted model.

Parameters

data : DataFrame

Independent variable values used for prediction.

key : str

Name of the ID column.

features : list of str, optional

Names of the feature columns.

Returns

DataFrame

Predicted values, structured as follows:

  • ID column, with same name and type as data ‘s ID column.

  • VALUE, type DOUBLE, representing predicted values.

score(data, key, features=None, label=None)

Returns the coefficient of determination R2 of the prediction.

Parameters

data : DataFrame

Data on which to assess model performance.

key : str

Name of the ID column.

features : list of str, optional

Names of the feature columns.

label : str, optional

Name of the dependent variable.

Defaults to the last column. (This is not the PAL default.)

Returns

float

The coefficient of determination R2 of the prediction on the given data.

hana_ml.algorithms.pal.som

This module contains PAL wrapper for SOM algorithm. The following class is available:

class hana_ml.algorithms.pal.som.SOM(conn_context, covergence_criterion=None, normalization=None, random_seed=None, height_of_map=None, width_of_map=None, kernel_function=None, alpha=None, learning_rate=None, shape_of_grid=None, radius=None, batch_som=None, max_iter=None)

Bases: hana_ml.algorithms.pal.pal_base.PALBase, hana_ml.algorithms.pal.clustering._ClusterAssignmentMixin

Self-organizing feature maps (SOMs) are one of the most popular neural network methods for cluster analysis. They are sometimes referred to as Kohonen self-organizing feature maps, after their creator, Teuvo Kohonen, or as topologically ordered maps.

Parameters

conn_context : ConnectionContext

Connection to the SAP HANA system.

convergence_criterion : float, optional

If the largest difference of the successive maps is less than this value, the calculation is regarded as convergence, and SOM is completed consequently.

Defaults to 1.0e-6.

normalization : {‘0’, ‘1’, ‘2’}, int, optional

Normalization type:

  • 0: No

  • 1: Transform to new range (0.0, 1.0)

  • 2: Z-score normalization

Defaults to 0.

random_seed : {‘1’, ‘0’, ‘Other value’}, int, optional

  • 1: Random

  • 0: Sets every weight to zero

  • Other value: Uses this value as seed

Defaults to -1.

height_of_map : int, optional

Indicates the height of the map.

Defaults to 10.

width_of_map : int, optional

Indicates the width of the map.

Defaults to 10.

kernel_function : int, optional

Represents the neighborhood kernel function.

  • 1: Gaussian

  • 2: Bubble/Flat

Defaults to 1.

alpha : float, optional

Specifies the learning rate.

Defaults to 0.5

learning_rate : int, optional

Indicates the decay function for learning rate.

  • 1: Exponential

  • 2: Linear

Defaults to 1.

shape_of_grid : int, optional

Indicates the shape of the grid.

  • 1: Rectangle

  • 2: Hexagon

Defaults to 2.

radius : float, optional

Specifies the scan radius.

Defautl to the bigger value of height_of_map and width_of_map.

batch_som : {‘0’, ‘1’}, int, optional

Indicates whether batch SOM is carried out.

  • 0: Classical SOM

  • 1: Batch SOM

For batch SOM, kernel_function is always Gaussian, and the learning_rate factors take no effect.

Defaults to 0.

max_iter : int, optional

Maximum number of iterations. Note that the training might not converge if this value is too small, for example, less than 1000.

Defaults to 1000 plus 500 times the number of neurons in the lattice.

Examples

Input dataframe df for clustering:

>>> df.collect()
    TRANS_ID    V000    V001
0      0        0.10    0.20
1      1        0.22    0.25
2      2        0.30    0.40
...
18     18       55.30   50.40
19     19       50.40   56.50

Create SOM instance:

>>> som = SOM(conn_context=conn, covergence_criterion=1.0e-6, normalization=0,
             random_seed=1, height_of_map=4, width_of_map=4,
             kernel_function='gaussian', alpha=None,
             learning_rate='exponential', shape_of_grid='hexagon',
             radius=None, batch_som='classical', max_iter=4000)

Perform fit on the given data:

>>> som.fit(data=df, key='TRANS_ID')

Expected output:

>>> som.map_.collect().head(3)
        CLUSTER_ID  WEIGHT_V000    WEIGHT_V001    COUNT
    0    0          52.837688      53.465327      2
    1    1          50.150251      49.245226      2
    2    2          18.597607      27.174590      0
>>> som.labels_.collect().head(3)
           TRANS_ID    BMU       DISTANCE    SECOND_BMU  IS_ADJACENT
    0           0      15          0.342564        14      1
    1           1      15          0.239676        14      1
    2           2      15          0.073968        14      1
>>> som.model_.collect()
        ROW_INDEX      MODEL_CONTENT
  0      0             {"Algorithm":"SOM","Cluster":[{"CellID":0,"Cel...

After we get the model, we could use it to predict Input dataframe df2 for prediction:

>>> df_predict.collect()
    TRANS_ID    V000    V001
0      33       0.2     0.10
1      34       1.2     4.1

Preform predict on the givenn data:

>>> label = som.predict(data=df2, key='TRANS_ID')

Expected output:

>>> label.collect()
    TRANS_ID    CLUSTER_ID     DISTANCE
0    33          15            0.388460
1    34          11            0.156418

Attributes

map_

(DataFrame) The map after training. The structure is as follows: - 1st column: CLUSTER_ID, int. Unit cell ID. - Other columns except the last one: FEATURE (in input data) column with prefix “WEIGHT_”, float. Weight vectors used to simulate the original tuples. - Last column: COUNT, int. Number of original tuples that every unit cell contains.

label_

(DataFrame) The label of input data, the structure is as follows: - 1st column: ID (in input table) data type, ID (in input table) column name ID of original tuples. - 2nd column: BMU, int. Best match unit (BMU). - 3rd column: DISTANCE, float, The distance between the tuple and its BMU. - 4th column: SECOND_BMU, int, Second BMU. - 5th column: IS_ADJACENT. int. Indicates whether the BMU and the second BMU are adjacent. - 0: Not adjacent - 1: Adjacent

model_

(DataFrame) The SOM model.

Methods

fit(data, key[, features])

Fit the SOM model when given the training dataset.

fit_predict(data, key[, features])

Fit the dataset and return the labels.

predict(data, key[, features])

Assign clusters to data based on a fitted model.

fit(data, key, features=None)

Fit the SOM model when given the training dataset.

Parameters

data : DataFrame

DataFrame containing the data.

key : str

Name of the ID column.

features : list of str, optional

Names of the features columns. If features is not provided, it defaults to all the non-ID columns.

fit_predict(data, key, features=None)

Fit the dataset and return the labels.

Parameters

data : DataFrame

DataFrame containing the data.

key : str

Name of the ID column.

features : list of str, optional

Names of the features columns. If features is not provided, it defaults to all the non-ID columns.

Returns

DataFrame

The label of given data, the structure is as follows:
  • 1st column: ID (in input table) data type, ID (in input table) column name ID of original tuples.

  • 2nd column: BMU, int. Best match unit (BMU).

  • 3rd column: DISTANCE, float, The distance between the tuple and its BMU.

  • 4th column: SECOND_BMU, int, Second BMU.

  • 5th column: IS_ADJACENT. int. Indicates whether the BMU and the second BMU are adjacent.
    • 0: Not adjacent

    • 1: Adjacent

predict(data, key, features=None)

Assign clusters to data based on a fitted model.

The output structure of this method does not match that of fit_predict().

Parameters

data : DataFrame

Data points to match against computed clusters. This dataframe’s column structure should match that of the data used for fit().

key : str

Name of ID column.

features : list of str, optional.

Names of feature columns. If features is not provided, it defaults to all the non-ID columns.

Returns

DataFrame

Cluster assignment results, with 3 columns:

  • Data point ID, with name and type taken from the input ID column.

  • CLUSTER_ID, type int, representing the cluster the data point is assigned to.

  • DISTANCE, type DOUBLE, representing the distance between the data point and the nearest core point.

hana_ml.algorithms.pal.stats

This module contains Python wrappers for statistics algorithms.

The following functions are available:

hana_ml.algorithms.pal.stats.chi_squared_goodness_of_fit(conn_context, data, key, observed_data=None, expected_freq=None)

Perform the chi-squared goodness-of fit test to tell whether or not an observed distribution differs from an expected chi-squared distribution.

Parameters

conn_context : ConnectionContext

Database connection object.

data : DataFrame

Input data.

key : str

Name of the ID column.

observed_data : str, optional

Name of column for counts of actual observations belonging to each category. If not given, the input dataframe must only have three columns. The first of the non-ID columns will be observed_data.

expected_freq : str, optional

Name of the expected frequency column. If not given, the input dataframe must only have three columns. The second of the non-ID columns will be expected_freq.

Returns

DataFrame

Comparsion between the actual counts and the expected counts, structured as follows:

  • ID column, with same name and type as data’s ID column.

  • Observed data column, with same name as data’s observed_data column, but always with type DOUBLE.

  • EXPECTED, type DOUBLE, expected count in each category.

  • RESIDUAL, type DOUBLE, the difference between the observed counts and the expected counts.

Statistical outputs, including the calculated chi-squared value, degrees of freedom and p-value, structured as follows:

  • STAT_NAME, type NVARCHAR(100), name of statistics.

  • STAT_VALUE, type DOUBLE, value of statistics.

Examples

Data to test:

>>> df.collect()
   ID  OBSERVED    P
0   0     519.0  0.3
1   1     364.0  0.2
2   2     363.0  0.2
3   3     200.0  0.1
4   4     212.0  0.1
5   5     193.0  0.1

Perform chi_squared_goodness_of_fit:

>>> res, stat = chi_squared_goodness_of_fit(conn_context=conn, data=df, 'ID')
>>> res.collect()
   ID  OBSERVED  EXPECTED  RESIDUAL
0   0     519.0     555.3     -36.3
1   1     364.0     370.2      -6.2
2   2     363.0     370.2      -7.2
3   3     200.0     185.1      14.9
4   4     212.0     185.1      26.9
5   5     193.0     185.1       7.9
>>> stat.collect()
           STAT_NAME  STAT_VALUE
0  Chi-squared Value    8.062669
1  degree of freedom    5.000000
2            p-value    0.152815
hana_ml.algorithms.pal.stats.chi_squared_independence(conn_context, data, key, observed_data=None, correction=False)

Perform the chi-squared test of independence to tell whether observations of two variables are independent from each other.

Parameters

conn_context : ConnectionContext

Database connection object.

data : DataFrame

Input data.

key : str

Name of the ID column.

observed_data : list of str, optional

Names of the observed data columns. If not given, it defaults to the all the non-ID columns.

correction : bool, optional

If True, and the degrees of freedom is 1, apply Yates’s correction for continuity. The effect of the correction is to adjust each observed value by 0.5 towards the corresponding expected value.

Defaults to False.

Returns

DataFrame

The expected count table, structured as follows:

  • ID column, with same name and type as data’s ID column.

  • Expected count columns, named by prepending Expected_ to each observed_data column name, type DOUBLE. There will be as many columns here as there are observed_data columns.

Statistical outputs, including the calculated chi-squared value, degrees of freedom and p-value, structured as follows:

  • STAT_NAME, type NVARCHAR(100), name of statistics.

  • STAT_VALUE, type DOUBLE, value of statistics.

Examples

Data to test:

>>> df.collect()
       ID  X1    X2  X3    X4
0    male  25  23.0  11  14.0
1  female  41  20.0  18   6.0

Perform chi-squared test of independence:

>>> res, stats = chi_squared_independence(conn_context, data=df, 'ID')
>>> res.collect()
       ID  EXPECTED_X1  EXPECTED_X2  EXPECTED_X3  EXPECTED_X4
0    male    30.493671    19.867089    13.398734     9.240506
1  female    35.506329    23.132911    15.601266    10.759494
>>> stats.collect()
           STAT_NAME  STAT_VALUE
0  Chi-squared Value    8.113152
1  degree of freedom    3.000000
2            p-value    0.043730
hana_ml.algorithms.pal.stats.ttest_1samp(conn_context, data, col=None, mu=0, test_type='two_sides', conf_level=0.95)

Perform the t-test to determine whether a sample of observations could have been generated by a process with a specific mean.

Parameters

conn_context : ConnectionContext

Database connection object.

data : DataFrame

DataFrame containing the data.

col : str, optional

Name of the column for sample. If not given, the input dataframe must only have one column.

mu : float, optional

Hypothesized mean of the population underlying the sample.

Defaults to 0.

test_type : {‘two_sides’, ‘less’, ‘greater’}, optional

The alternative hypothesis type.

Defaults to ‘two_sides’.

conf_level : float, optional

Confidence level for alternative hypothesis confidence interval.

Defaults to 0.95.

Returns

DataFrame

DataFrame containing the statistics results from the t-test.

Examples

Original data:

>>> df.collect()
    X1
0  1.0
1  2.0
2  4.0
3  7.0
4  3.0

Perform One Sample T-Test

>>> ttest_1samp(conn_context=conn, data=df).collect()
           STAT_NAME  STAT_VALUE
0            t-value    3.302372
1  degree of freedom    4.000000
2            p-value    0.029867
3      _PAL_MEAN_X1_    3.400000
4   confidence level    0.950000
5         lowerLimit    0.541475
6         upperLimit    6.258525
hana_ml.algorithms.pal.stats.ttest_ind(conn_context, data, col1=None, col2=None, mu=0, test_type='two_sides', var_equal=False, conf_level=0.95)

Perform the T-test for the mean difference of two independent samples.

Parameters

conn_context : ConnectionContext

Connection to the SAP HANA system.

data : DataFrame

DataFrame containing the data.

col1 : str, optional

Name of the column for sample1. If not given, the input dataframe must only have two columns. The first of the columns will be col1.

col2 : str, optional

Name of the column for sample2. If not given, the input dataframe must only have two columns. The second of the columns will be col2.

mu : float, optional

Hypothesized difference between the two underlying population means.

Defaults to 0.

test_type : {‘two_sides’, ‘less’, ‘greater’}, optional

The alternative hypothesis type.

Defaults to ‘two_sides’.

var_equal : bool, optional

Controls whether to assume that the two samples have equal variance.

Defaults to False.

conf_level : float, optional

Confidence level for alternative hypothesis confidence interval.

Defaults to 0.95.

Returns

DataFrame

DataFrame containing the statistics results from the t-test.

Examples

Original data:

>>> df.collect()
    X1    X2
0  1.0  10.0
1  2.0  12.0
2  4.0  11.0
3  7.0  15.0
4  NaN  10.0

Perform Independent Sample T-Test

>>> ttest_ind(conn_context=conn, data=df).collect()
           STAT_NAME  STAT_VALUE
0            t-value   -5.013774
1  degree of freedom    5.649757
2            p-value    0.002875
3      _PAL_MEAN_X1_    3.500000
4      _PAL_MEAN_X2_   11.600000
5   confidence level    0.950000
6         lowerLimit  -12.113278
7         upperLimit   -4.086722
hana_ml.algorithms.pal.stats.ttest_paired(conn_context, data, col1=None, col2=None, mu=0, test_type='two_sides', conf_level=0.95)

Perform the t-test for the mean difference of two sets of paired samples.

Parameters

conn_context : ConnectionContext

Database connection object.

data : DataFrame

DataFrame containing the data.

col1 : str, optional

Name of the column for sample1. If not given, the input dataframe must only have two columns. The first of two columns will be col1.

col2 : str, optional

Name of the column for sample2. If not given, the input dataframe must only have two columns. The second of the two columns will be col2.

mu : float, optional

Hypothesized difference between two underlying population means.

Defaults to 0.

test_type : {‘two_sides’, ‘less’, ‘greater’}, optional

The alternative hypothesis type.

Defaults to ‘two_sides’.

conf_level : float, optional

Confidence level for alternative hypothesis confidence interval.

Defaults to 0.95.

Returns

DataFrame

DataFrame containing the statistics results from the t-test.

Examples

Original data:

>>> df.collect()
    X1    X2
0  1.0  10.0
1  2.0  12.0
2  4.0  11.0
3  7.0  15.0
4  3.0  10.0

perform Paired Sample T-Test

>>> ttest_paired(conn_context=conn, data=df).collect()
                STAT_NAME  STAT_VALUE
0                 t-value  -14.062884
1       degree of freedom    4.000000
2                 p-value    0.000148
3  _PAL_MEAN_DIFFERENCES_   -8.200000
4        confidence level    0.950000
5              lowerLimit   -9.818932
6              upperLimit   -6.581068
hana_ml.algorithms.pal.stats.f_oneway(conn_context, data, group=None, sample=None, multcomp_method=None, significance_level=None)

Performs a 1-way ANOVA.

The purpose of one-way ANOVA is to determine whether there is any statistically significant difference between the means of three or more independent groups.

Parameters

conn_context : ConnectionContext

Connection to the SAP HANA system.

data : DataFrame

Input data.

group : str

Name of the group column. If group is not provided, defaults to the first column.

sample : str, optional

Name of the sample measurement column. If sample is not provided, data must have exactly 1 non-group column and sample defaults to that column.

multcomp_method : {‘tukey-kramer’, ‘bonferroni’, ‘dunn-sidak’, ‘scheffe’, ‘fisher-lsd’}, str, optional

Method used to perform multiple comparison tests.

Defaults to ‘tukey-kramer’.

significance_level : float, optional

The significance level when the function calculates the confidence interval in multiple comparison tests. Values must be greater than 0 and less than 1.

Defaults to 0.05.

Returns

DataFrame

Statistics for each group, structured as follows:

  • GROUP, type NVARCHAR(256), group name.

  • VALID_SAMPLES, type INTEGER, number of valid samples.

  • MEAN, type DOUBLE, group mean.

  • SD, type DOUBLE, group standard deviation.

Computed results for ANOVA, structured as follows:

  • VARIABILITY_SOURCE, type NVARCHAR(100), source of variability, including between groups, within groups (error) and total.

  • SUM_OF_SQUARES, type DOUBLE, sum of squares.

  • DEGREES_OF_FREEDOM, type DOUBLE, degrees of freedom.

  • MEAN_SQUARES, type DOUBLE, mean squares.

  • F_RATIO, type DOUBLE, calculated as mean square between groups divided by mean square of error.

  • P_VALUE, type DOUBLE, associated p-value from the F-distribution.

Multiple comparison results, structured as follows:

  • FIRST_GROUP, type NVARCHAR(256), the name of the first group to conduct pairwise test on.

  • SECOND_GROUP, type NVARCHAR(256), the name of the second group to conduct pairwise test on.

  • MEAN_DIFFERENCE, type DOUBLE, mean difference between the two groups.

  • SE, type DOUBLE, standard error computed from all data.

  • P_VALUE, type DOUBLE, p-value.

  • CI_LOWER, type DOUBLE, the lower limit of the confidence interval.

  • CI_UPPER, type DOUBLE, the upper limit of the confidence interval.

Examples

Samples for One Way ANOVA test:

>>> df.collect()
   GROUP  DATA
0      A   4.0
1      A   5.0
2      A   4.0
3      A   3.0
4      A   2.0
5      A   4.0
6      A   3.0
7      A   4.0
8      B   6.0
9      B   8.0
10     B   4.0
11     B   5.0
12     B   4.0
13     B   6.0
14     B   5.0
15     B   8.0
16     C   6.0
17     C   7.0
18     C   6.0
19     C   6.0
20     C   7.0
21     C   5.0

Perform one-way ANOVA test:

>>> stats, anova, mult_comp= f_oneway(conn_context=conn, data=df,
...                                   multcomp_method='Tukey-Kramer',
...                                   significance_level=0.05)

Outputs:

>>> stats.collect()
   GROUP  VALID_SAMPLES      MEAN        SD
0      A              8  3.625000  0.916125
1      B              8  5.750000  1.581139
2      C              6  6.166667  0.752773
3  Total             22  5.090909  1.600866
>>> anova.collect()
  VARIABILITY_SOURCE  SUM_OF_SQUARES  DEGREES_OF_FREEDOM  MEAN_SQUARES  \
0              Group       27.609848                 2.0     13.804924
1              Error       26.208333                19.0      1.379386
2              Total       53.818182                21.0           NaN
     F_RATIO   P_VALUE
0  10.008021  0.001075
1        NaN       NaN
2        NaN       NaN
>>> mult_comp.collect()
  FIRST_GROUP SECOND_GROUP  MEAN_DIFFERENCE        SE   P_VALUE  CI_LOWER  \
0           A            B        -2.125000  0.587236  0.004960 -3.616845
1           A            C        -2.541667  0.634288  0.002077 -4.153043
2           B            C        -0.416667  0.634288  0.790765 -2.028043
   CI_UPPER
0 -0.633155
1 -0.930290
2  1.194710
hana_ml.algorithms.pal.stats.f_oneway_repeated(conn_context, data, subject_id, measures=None, multcomp_method=None, significance_level=None, se_type=None)

Performs one-way repeated measures analysis of variance, along with Mauchly’s Test of Sphericity and post hoc multiple comparison tests.

Parameters

conn_context : ConnectionContext

Connection to the SAP HANA system.

data : DataFrame

Input data.

subject_id : str

Name of the subject ID column. The algorithm treats each row of the data table as a different subject. Hence there should be no duplicate subject IDs in this column.

measures : list of str, optional

Names of the groups (measures). If measures is not provided, defaults to all non-subject_id columns.

multcomp_method : {‘tukey-kramer’, ‘bonferroni’, ‘dunn-sidak’, ‘scheffe’, ‘fisher-lsd’}, optional

Method used to perform multiple comparison tests.

Defaults to ‘bonferroni’.

significance_level : float, optional

The significance level when the function calculates the confidence interval in multiple comparison tests. Values must be greater than 0 and less than 1.

Defaults to 0.05.

se_type : {‘all-data’, ‘two-group’}

Type of standard error used in multiple comparison tests.

  • ‘all-data’: computes the standard error from all data. It has more power if the assumption of sphericity is true, especially with small data sets.

  • ‘two-group’: computes the standard error from only the two groups being compared. It doesn’t assume sphericity.

Defaults to ‘two-group’.

Returns

DataFrame

Statistics for each group, structured as follows:

  • GROUP, type NVARCHAR(256), group name.

  • VALID_SAMPLES, type INTEGER, number of valid samples.

  • MEAN, type DOUBLE, group mean.

  • SD, type DOUBLE, group standard deviation.

Mauchly test results, structured as follows:

  • STAT_NAME, type NVARCHAR(100), names of test result quantities.

  • STAT_VALUE, type DOUBLE, values of test result quantities.

Computed results for ANOVA, structured as follows:

  • VARIABILITY_SOURCE, type NVARCHAR(100), source of variability, divided into group, error and subject portions.

  • SUM_OF_SQUARES, type DOUBLE, sum of squares.

  • DEGREES_OF_FREEDOM, type DOUBLE, degrees of freedom.

  • MEAN_SQUARES, type DOUBLE, mean squares.

  • F_RATIO, type DOUBLE, calculated as mean square between groups divided by mean square of error.

  • P_VALUE, type DOUBLE, associated p-value from the F-distribution.

  • P_VALUE_GG, type DOUBLE, p-value of Greehouse-Geisser correction.

  • P_VALUE_HF, type DOUBLE, p-value of Huynh-Feldt correction.

  • P_VALUE_LB, type DOUBLE, p-value of lower bound correction.

Multiple comparison results, structured as follows:

  • FIRST_GROUP, type NVARCHAR(256), the name of the first group to conduct pairwise test on.

  • SECOND_GROUP, type NVARCHAR(256), the name of the second group to conduct pairwise test on.

  • MEAN_DIFFERENCE, type DOUBLE, mean difference between the two groups.

  • SE, type DOUBLE, standard error computed from all data or compared two groups, depending on se_type.

  • P_VALUE, type DOUBLE, p-value.

  • CI_LOWER, type DOUBLE, the lower limit of the confidence interval.

  • CI_UPPER, type DOUBLE, the upper limit of the confidence interval.

Examples

Samples for One Way Repeated ANOVA test:

>>> df.collect()
  ID  MEASURE1  MEASURE2  MEASURE3  MEASURE4
0  1       8.0       7.0       1.0       6.0
1  2       9.0       5.0       2.0       5.0
2  3       6.0       2.0       3.0       8.0
3  4       5.0       3.0       1.0       9.0
4  5       8.0       4.0       5.0       8.0
5  6       7.0       5.0       6.0       7.0
6  7      10.0       2.0       7.0       2.0
7  8      12.0       6.0       8.0       1.0

Perform one-way repeated measures ANOVA test:

>>> stats, mtest, anova, mult_comp = f_oneway_repeated(
...     conn_context=conn,
...     data=df,
...     subject_id='ID',
...     multcomp_method='bonferroni',
...     significance_level=0.05,
...     se_type='two-group')

Outputs:

>>> stats.collect()
      GROUP  VALID_SAMPLES   MEAN        SD
0  MEASURE1              8  8.125  2.232071
1  MEASURE2              8  4.250  1.832251
2  MEASURE3              8  4.125  2.748376
3  MEASURE4              8  5.750  2.915476
>>> mtest.collect()
                    STAT_NAME  STAT_VALUE
0                 Mauchly's W    0.136248
1                  Chi-Square   11.405981
2                          df    5.000000
3                      pValue    0.046773
4  Greenhouse-Geisser Epsilon    0.532846
5         Huynh-Feldt Epsilon    0.665764
6         Lower bound Epsilon    0.333333
>>> anova.collect()
  VARIABILITY_SOURCE  SUM_OF_SQUARES  DEGREES_OF_FREEDOM  MEAN_SQUARES  \
0              Group          83.125                 3.0     27.708333
1            Subject          17.375                 7.0      2.482143
2              Error         153.375                21.0      7.303571
    F_RATIO  P_VALUE  P_VALUE_GG  P_VALUE_HF  P_VALUE_LB
0  3.793806  0.02557    0.062584    0.048331    0.092471
1       NaN      NaN         NaN         NaN         NaN
2       NaN      NaN         NaN         NaN         NaN
>>> mult_comp.collect()
  FIRST_GROUP SECOND_GROUP  MEAN_DIFFERENCE        SE   P_VALUE  CI_LOWER  \
0    MEASURE1     MEASURE2            3.875  0.811469  0.012140  0.924655
1    MEASURE1     MEASURE3            4.000  0.731925  0.005645  1.338861
2    MEASURE1     MEASURE4            2.375  1.792220  1.000000 -4.141168
3    MEASURE2     MEASURE3            0.125  1.201747  1.000000 -4.244322
4    MEASURE2     MEASURE4           -1.500  1.336306  1.000000 -6.358552
5    MEASURE3     MEASURE4           -1.625  1.821866  1.000000 -8.248955
   CI_UPPER
0  6.825345
1  6.661139
2  8.891168
3  4.494322
4  3.358552
5  4.998955
hana_ml.algorithms.pal.stats.univariate_analysis(conn_context, data, key=None, cols=None, categorical_variable=None, significance_level=None, trimmed_percentage=None)

Provides an overview of the dataset. For continuous columns, it provides the count of valid observations, min, lower quartile, median, upper quartile, max, mean, confidence interval for the mean (lower and upper bound), trimmed mean, variance, standard deviation, skewness, and kurtosis. For discrete columns, it provides the number of occurrences and the percentage of the total data in each category.

Parameters

conn_context : ConnectionContext

Connection to the SAP HANA system.

data : DataFrame

Input data.

key : str, optional

Name of the ID column. If key is not provided, it is assumed that the input has no ID column.

cols : list of str, optional

List of column names to analyze. If cols is not provided, it defaults to all non-ID columns.

categorical_variable : list of str, optional

INTEGER columns specified in this list will be treated as categorical data. By default, INTEGER columns are treated as continuous.

significance_level : float, optional

The significance level when the function calculates the confidence interval of the sample mean. Values must be greater than 0 and less than 1.

Defaults to 0.05.

trimmed_percentage : float, optional

The ratio of data at both head and tail that will be dropped in the process of calculating the trimmed mean. Value range is from 0 to 0.5.

Defaults to 0.05.

Returns

DataFrame

Statistics for continuous variables, structured as follows:

  • VARIABLE_NAME, type NVARCHAR(256), variable names.

  • STAT_NAME, type NVARCHAR(100), names of statistical quantities, including the count of valid observations, min, lower quartile, median, upper quartile, max, mean, confidence interval for the mean (lower and upper bound), trimmed mean, variance, standard deviation, skewness, and kurtosis (14 quantities in total).

  • STAT_VALUE, type DOUBLE, values for the corresponding statistical quantities.

Statistics for categorical variables, structured as follows:

  • VARIABLE_NAME, type NVARCHAR(256), variable names.

  • CATEGORY, type NVARCHAR(256), category names of the corresponding variables. Null is also treated as a category.

  • STAT_NAME, type NVARCHAR(100), names of statistical quantities: number of observations, percentage of total data points falling in the current category for a variable (including null).

  • STAT_VALUE, type DOUBLE, values for the corresponding statistical quantities.

Examples

Dataset to be analyzed:

>>> df.collect()
      X1    X2  X3 X4
0    1.2  None   1  A
1    2.5  None   2  C
2    5.2  None   3  A
3  -10.2  None   2  A
4    8.5  None   2  C
5  100.0  None   3  B

Perform univariate analysis:

>>> continuous, categorical = univariate_analysis(
...     conn_context=,
...     data=df,
...     categorical_variable=['X3'],
...     significance_level=0.05,
...     trimmed_percentage=0.2)

Outputs:

>>> continuous.collect()
   VARIABLE_NAME                 STAT_NAME   STAT_VALUE
0             X1        valid observations     6.000000
1             X1                       min   -10.200000
2             X1            lower quartile     1.200000
3             X1                    median     3.850000
4             X1            upper quartile     8.500000
5             X1                       max   100.000000
6             X1                      mean    17.866667
7             X1  CI for mean, lower bound   -24.879549
8             X1  CI for mean, upper bound    60.612883
9             X1              trimmed mean     4.350000
10            X1                  variance  1659.142667
11            X1        standard deviation    40.732575
12            X1                  skewness     1.688495
13            X1                  kurtosis     1.036148
14            X2        valid observations     0.000000
>>> categorical.collect()
   VARIABLE_NAME      CATEGORY      STAT_NAME  STAT_VALUE
0             X3  __PAL_NULL__          count    0.000000
1             X3  __PAL_NULL__  percentage(%)    0.000000
2             X3             1          count    1.000000
3             X3             1  percentage(%)   16.666667
4             X3             2          count    3.000000
5             X3             2  percentage(%)   50.000000
6             X3             3          count    2.000000
7             X3             3  percentage(%)   33.333333
8             X4  __PAL_NULL__          count    0.000000
9             X4  __PAL_NULL__  percentage(%)    0.000000
10            X4             A          count    3.000000
11            X4             A  percentage(%)   50.000000
12            X4             B          count    1.000000
13            X4             B  percentage(%)   16.666667
14            X4             C          count    2.000000
15            X4             C  percentage(%)   33.333333
hana_ml.algorithms.pal.stats.covariance_matrix(conn_context, data, cols=None)

Computes the covariance matrix.

Parameters

conn_context : ConnectionContext

Connection to the SAP HANA system.

data : DataFrame

Input data.

cols : list of str, optional

List of column names to analyze. If cols is not provided, it defaults to all columns.

Returns

DataFrame

Covariance between any two data samples (columns).

  • ID, type NVARCHAR. The values of this column are the column names from cols.

  • Covariance columns, type DOUBLE, named after the columns in cols. The covariance between variables X and Y is in column X, in the row with ID value Y.

Examples

Dataset to be analyzed:

>>> df.collect()
    X     Y
0   1   2.4
1   5   3.5
2   3   8.9
3  10  -1.4
4  -4  -3.5
5  11  32.8

Compute the covariance matrix:

>>> result = covariance_matrix(conn_context=conn, data=df)

Outputs:

>>> result.collect()
  ID          X           Y
0  X  31.866667   44.473333
1  Y  44.473333  176.677667
hana_ml.algorithms.pal.stats.pearsonr_matrix(conn_context, data, cols=None)

Computes a correlation matrix using Pearson’s correlation coefficient.

Parameters

conn_context : ConnectionContext

Connection to the SAP HANA system.

data : DataFrame

Input data.

cols : list of str, optional

List of column names to analyze. If cols is not provided, it defaults to all columns.

Returns

DataFrame

Pearson’s correlation coefficient between any two data samples (columns).

  • ID, type NVARCHAR. The values of this column are the column names from cols.

  • Correlation coefficient columns, type DOUBLE, named after the columns in cols. The correlation coefficient between variables X and Y is in column X, in the row with ID value Y.

Examples

Dataset to be analyzed:

>>> df.collect()
    X     Y
0   1   2.4
1   5   3.5
2   3   8.9
3  10  -1.4
4  -4  -3.5
5  11  32.8

Compute the Pearson’s correlation coefficient matrix:

>>> result = pearsonr_matrix(conn_context=conn, data=df)

Outputs:

>>> result.collect()
  ID               X               Y
0  X               1  0.592707653621
1  Y  0.592707653621               1
hana_ml.algorithms.pal.stats.iqr(conn_context, data, key, col=None, multiplier=None)

Perform the inter-quartile range (IQR) test to find the outliers of the data. The inter-quartile range (IQR) is the difference between the third quartile (Q3) and the first quartile (Q1) of the data. Data points will be marked as outliers if they fall outside the range from Q1 - multiplier * IQR to Q3 + multiplier * IQR.

Parameters

conn_context : ConnectionContext

Connection to the SAP HANA system.

data : DataFrame

DataFrame containing the data.

key : str

Name of the ID column.

col : str, optional

Name of the data column that needs to be tested. If not given, the input dataframe must only have two columns including the ID column. The non-ID column will be col.

multiplier : float, optional

The multiplier used to calculate the value range during the IQR test. Upper-bound = Q3 + multiplier * IQR.

Lower-bound = Q1 - multiplier * IQR.

Q1 is equal to 25th percentile and Q3 is equal to 75th percentile.

Defaults to 1.5.

Returns

DataFrame

Test results, structured as follows:

  • ID, with same name and type as data’s ID column.

  • IS_OUT_OF_RANGE, type INTEGER, containing the test results from the IQR test that determine whether each data sample is in the range or not:

    • 0: a value is in the range.

    • 1: a value is out of range.

Statistical outputs, including Upper-bound and Lower-bound from the IQR test, structured as follows:

  • STAT_NAME, type NVARCHAR(256), statistics name.

  • STAT_VALUE, type DOUBLE, statistics value.

Examples

Original data:

>>> df.collect()
     ID   VAL
0    P1  10.0
1    P2  11.0
2    P3  10.0
3    P4   9.0
4    P5  10.0
5    P6  24.0
6    P7  11.0
7    P8  12.0
8    P9  10.0
9   P10   9.0
10  P11   1.0
11  P12  11.0
12  P13  12.0
13  P14  13.0
14  P15  12.0

Perform the IQR test:

>>> res, stat = iqr(conn_context=conn, data=df, 'ID', 'VAL', 1.5)
>>> res.collect()
         ID  IS_OUT_OF_RANGE
0    P1                0
1    P2                0
2    P3                0
3    P4                0
4    P5                0
5    P6                1
6    P7                0
7    P8                0
8    P9                0
9   P10                0
10  P11                1
11  P12                0
12  P13                0
13  P14                0
14  P15                0
>>> stat.collect()
        STAT_NAME  STAT_VALUE
0  lower quartile        10.0
1  upper quartile        12.0

hana_ml.algorithms.pal.svm

This module contains PAL wrapper for Support Vector Machine algorithms.

The following classes are available:

class hana_ml.algorithms.pal.svm.SVC(conn_context, c=None, kernel='rbf', degree=None, gamma=None, coef_lin=None, coef_const=None, probability=False, shrink=True, tol=None, evaluation_seed=None, thread_ratio=None, scale_info=None, handle_missing=True, categorical_variable=None, category_weight=None)

Bases: hana_ml.algorithms.pal.svm._SVMBase

Support Vector Classification.

Parameters

conn_context : ConnectionContext

Connection to the SAP HANA system.

c : float, optional

Trade-off between training error and margin. Value range > 0.

Defaults to 100.0.

kernel : {‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’}, optional

Defaults to ‘rbf’.

degree : int, optional

Coefficient for the ‘poly’ kernel type. Value range >= 1.

Defaults to 3.

gamma : float, optional

Coefficient for the ‘rbf’ kernel type.

Defaults to 1.0/number of features in the dataset.

Only valid for when kernel is ‘rbf’.

coef_lin : float, optional

Coefficient for the poly/sigmoid kernel type.

Defaults to 0.

coef_const : float, optional

Coefficient for the poly/sigmoid kernel type.

Defaults to 0.

probability : bool, optional

If True, output probability during prediction.

Defaults to False.

shrink : bool, optional

If True, use shrink strategy.

Defaults to True.

tol : float, optional

Specifies the error tolerance in the training process. Value range > 0.

Defaults to 0.001.

evaluation_seed : int, optional

The random seed in parameter selection. Value range >= 0.

Defaults to 0.

thread_ratio : float, optional

Controls the proportion of available threads to use. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use up to that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use.

Defaults to 0.0.

scale_info : {‘no’, ‘standardization’, ‘rescale’}, optional

Options:

  • ‘no’ : No scale.

  • ‘standardization’ : Transforms the data to have zero mean and unit variance.

  • ‘rescale’ : Rescales the range of the features to scale the range in [-1,1].

Defaults to ‘standardization’.

handle_missing : bool, optional

Whether to handle missing values:

False: No,

True: Yes.

Defaults to True.

categorical_variable : str or list of str, optional

Specifies INTEGER column(s) in the data that should be treated categorical.

category_weight : float, optional

Represents the weight of category attributes. Value range > 0.

Defaults to 0.707.

Examples

Training data:

>>> df_fit.head(10).collect()
   ID  ATTRIBUTE1  ATTRIBUTE2  ATTRIBUTE3 ATTRIBUTE4  LABEL
0   0         1.0        10.0       100.0          A      1
1   1         1.1        10.1       100.0          A      1
2   2         1.2        10.2       100.0          A      1
3   3         1.3        10.4       100.0          A      1
4   4         1.2        10.3       100.0         AB      1
5   5         4.0        40.0       400.0         AB      2
6   6         4.1        40.1       400.0         AB      2
7   7         4.2        40.2       400.0         AB      2
8   8         4.3        40.4       400.0         AB      2
9   9         4.2        40.3       400.0         AB      2

Create SVC instance and call fit:

>>> svc = svm.SVC(connection_context, gamma=0.005, handle_missing=False)
>>> svc.fit(df_fit, 'ID', ['ATTRIBUTE1', 'ATTRIBUTE2',
...                        'ATTRIBUTE3', 'ATTRIBUTE4'])
>>> df_predict = connection_context.table("SVC_PREDICT_DATA_TBL")
>>> df_predict.collect()
   ID  ATTRIBUTE1  ATTRIBUTE2  ATTRIBUTE3 ATTRIBUTE4
0   0         1.0        10.0       100.0          A
1   1         1.2        10.2       100.0          A
2   2         4.1        40.1       400.0         AB
3   3         4.2        40.3       400.0         AB
4   4         9.1        90.1       900.0          A
5   5         9.2        90.2       900.0          A
6   6         4.0        40.0       400.0          A

Call predict:

>>> res = svc.predict(df_predict, 'ID', ['ATTRIBUTE1', 'ATTRIBUTE2',
...                                      'ATTRIBUTE3', 'ATTRIBUTE4'])
>>> res.collect()
   ID SCORE PROBABILITY
0   0     1        None
1   1     1        None
2   2     2        None
3   3     2        None
4   4     3        None
5   5     3        None
6   6     2        None

Attributes

model_

(DataFrame) Model content.

stat_

(DataFrame) Statistics content.

Methods

fit(data[, key, features, label, …])

Fit the model when given training dataset and other attributes.

predict(data, key[, features, verbose])

Predict the dataset using the trained model.

score(data, key[, features, label])

Returns the accuracy on the given test data and labels.

fit(data, key=None, features=None, label=None, categorical_variable=None)

Fit the model when given training dataset and other attributes.

Parameters

data : DataFrame

DataFrame containing the data.

key : str

Name of the ID column. If key is not provided, it is assumed that the input has no ID column.

features : list of str, optional

Names of the feature columns. If features is not provided, it defaults to all the non-ID, non-label columns.

label : str, optional

Name of the label column. If label is not provided, it defaults to the last column.

categorical_variable : str or list of str, optional

Specifies INTEGER column(s) that should be treated as categorical. Other INTEGER columns will be treated as continuous.

predict(data, key, features=None, verbose=False)

Predict the dataset using the trained model.

Parameters

data : DataFrame

DataFrame containing the data.

key : str

Name of the ID column.

features : list of str, optional

Names of the feature columns. If features is not provided, it defaults to all the non-ID, non-label columns.

verbose : bool, optional

If True, output scoring probabilities for each class. It is only applicable when probability is true during instance creation.

Defaults to False.

Returns

DataFrame

Predict result, structured as follows:
  • ID column, with the same name and type as data ‘s ID column.

  • SCORE, type NVARCHAR(100), prediction value.

  • PROBABILITY, type DOUBLE, prediction probability. It is NULL when probability is False during instance creation.

score(data, key, features=None, label=None)

Returns the accuracy on the given test data and labels.

Parameters

data : DataFrame

DataFrame containing the data.

key : str.

Name of the ID column.

features : list of str, optional

Names of the feature columns. If features is not provided, it defaults to all the non-ID, non-label columns.

label : str, optional

Name of the label column. If label is not provided, it defaults to the last column.

Returns

float

Scalar accuracy value comparing the predicted result and original label.

class hana_ml.algorithms.pal.svm.SVR(conn_context, c=None, kernel='rbf', degree=None, gamma=None, coef_lin=None, coef_const=None, shrink=True, tol=None, evaluation_seed=None, thread_ratio=None, scale_info=None, scale_label=None, handle_missing=True, categorical_variable=None, category_weight=None, regression_eps=None)

Bases: hana_ml.algorithms.pal.svm._SVMBase

Support Vector Regression.

Parameters

conn_context : ConnectionContext

Connection to the SAP HANA system.

c : float, optional

Trade-off between training error and margin. Value range > 0.

Defaults to 100.0.

kernel : {‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’}, optional

Defaults to ‘rbf’.

degree : int, optional

Coefficient for the ‘poly’ kernel type. Value range >= 1.

Defaults to 3.

gamma : float, optional

Coefficient for the ‘rbf’ kernel type.

Defaults to 1.0/number of features in the dataset

Only valid when kernel is ‘rbf’.

coef_lin : float, optional

Coefficient for the poly/sigmoid kernel type.

Defaults to 0.

coef_const : float, optional

Coefficient for the poly/sigmoid kernel type.

Defaults to 0.

shrink : bool, optional

If True, use shrink strategy.

Defaults to True.

tol : float, optional

Specifies the error tolerance in the training process. Value range > 0.

Defaults to 0.001.

evaluation_seed : int, optional

The random seed in parameter selection. Value range >= 0.

Defaults to 0.

thread_ratio : float, optional

Controls the proportion of available threads to use. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use.

Defaults to 0.0.

scale_info : {‘no’, ‘standardization’, ‘rescale’}, optional

Options:

  • ‘no’ : No scale.

  • ‘standardization’ : Transforms the data to have zero mean and unit variance.

  • ‘rescale’ : Rescales the range of the features to scale the range in [-1,1].

Defaults to ‘standardization’.

scale_label : bool, optional

If True, standardize the label for SVR. It is only applicable when the scale_info is standardization.

Defaults to True.

handle_missing : bool, optional

Whether to handle missing values:

False: No,

True: Yes.

Defaults to True.

categorical_variable : str or list of str, optional

Specifies INTEGER column(s) in the data that should be treated as categorical.

category_weight : float, optional

Represents the weight of category attributes. Value range > 0.

Defaults to 0.707.

regression_eps : float, optional

Epsilon width of tube for regression.

Defaults to 0.1.

Examples

Training data:

>>> df_fit.collect()
    ID  ATTRIBUTE1  ATTRIBUTE2  ATTRIBUTE3  ATTRIBUTE4  ATTRIBUTE5       VALUE
0    0    0.788606    0.787308   -1.301485    1.226053   -0.533385   95.626483
1    1    0.414869   -0.381038   -0.719309    1.603499    1.557837  162.582000
2    2    0.236282   -1.118764    0.233341   -0.698410    0.387380  -56.564303
3    3   -0.087779   -0.462372   -0.038412   -0.552897    1.231209  -32.241614
4    4   -0.476389    1.836772   -0.292337   -1.364599    1.326768 -143.240878
5    5    0.523326    0.065154   -1.513822    0.498921   -0.590686   -5.237827
6    6   -1.425838   -0.900437   -0.672299    0.646424    0.508856  -43.005837
7    7   -1.601836    0.455530    0.438217   -0.860707   -0.338282 -126.389824
8    8    0.266698   -0.725057    0.462189    0.868752   -1.542683   46.633594
9    9   -0.772496   -2.192955    0.822904   -1.125882   -0.946846 -175.356260
10  10    0.492364   -0.654237   -0.226986   -0.387156   -0.585063  -49.213910
11  11    0.378409   -1.544976    0.622448   -0.098902    1.437910   34.788276
12  12    0.317183    0.473067   -1.027916    0.549077    0.013483   32.845141
13  13    1.340660   -1.082651    0.730509   -0.944931    0.351025   -6.500411
14  14    0.736456    1.649251    1.334451   -0.530776    0.280830   87.451863

Create SVR instance and call fit:

>>> svr = svm.SVR(conn, kernel='linear', scale_info='standardization',
...               scale_label=True, handle_missing=False)
>>> svr.fit(df_fit, 'ID', ['ATTRIBUTE1', 'ATTRIBUTE2', 'ATTRIBUTE3',
...                        'ATTRIBUTE4', 'ATTRIBUTE5'])

Attributes

model_

(DataFrame) Model content.

stat_

(DataFrame) Statistics content.

Methods

fit(data, key[, features, label, …])

Fit the model when given training dataset and other attributes.

predict(data, key[, features])

Predict the dataset using the trained model.

score(data, key[, features, label])

Returns the coefficient of determination R^2 of the prediction.

fit(data, key, features=None, label=None, categorical_variable=None)

Fit the model when given training dataset and other attributes.

Parameters

data : DataFrame

DataFrame containing the data.

key : str

Name of the ID column. If key is not provided, it is assumed that the input has no ID column.

features : list of str, optional

Names of the feature columns. If features is not provided, it defaults to all the non-ID, non-label columns.

label : str, optional

Name of the label column. If label is not provided, it defaults to the last column.

categorical_variable : str or list of str, optional

Specifies INTEGER column(s) that should be be treated as categorical. Other INTEGER columns will be treated as continuous.

predict(data, key, features=None)

Predict the dataset using the trained model.

Parameters

data : DataFrame

DataFrame containing the data.

key : str

Name of the ID column.

features : list of str, optional

Names of the feature columns. If features is not provided, it defaults to all the non-ID columns.

Returns

DataFrame

Predict result, structured as follows:

  • ID column, with the same name and type as data1 ‘s ID column.

  • SCORE, type NVARCHAR(100), prediction value.

  • PROBABILITY, type DOUBLE, prediction probability. Always NULL. This column is only used for SVC and SVRanking.

score(data, key, features=None, label=None)

Returns the coefficient of determination R^2 of the prediction.

Parameters

data : DataFrame

DataFrame containing the data.

key : str

Name of the ID column.

features : list of str, optional

Names of the feature columns. If features is not provided, it defaults to all the non-ID and non-label columns.

label : str, optional

Name of the label column. If label is not provided, it defaults to the last column.

Returns

float

Returns the coefficient of determination R2 of the prediction.

class hana_ml.algorithms.pal.svm.SVRanking(conn_context, c=None, kernel='rbf', degree=None, gamma=None, coef_lin=None, coef_const=None, probability=False, shrink=True, tol=None, evaluation_seed=None, thread_ratio=None, scale_info=None, handle_missing=True, categorical_variable=None, category_weight=None)

Bases: hana_ml.algorithms.pal.svm._SVMBase

Support Vector Ranking

Parameters

conn_context : ConnectionContext

Connection to the SAP HANA system.

c : float, optional

Trade-off between training error and margin. Value range > 0.

Defaults to 100.

kernel : {‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’}, optional

Defaults to ‘rbf’.

degree : int, optional

Coefficient for the ‘poly’ kernel type. Value range >= 1.

Defaults to 3.

gamma : float, optional

Coefficient for the ‘rbf’ kernel type.

Defaults to to 1.0/number of features in the dataset.

Only valid when kernel is ‘rbf’.

coef_lin : float, optional

Coefficient for the ‘poly’/’sigmoid’ kernel type.

Defaults to 0.

coef_const : float, optional

Coefficient for the ‘poly’/’sigmoid’ kernel type.

Defaults to 0.

probability : bool, optional

If True, output probability during prediction.

Defaults to False.

shrink : bool, optional

If True, use shrink strategy.

Defaults to True.

tol : float, optional

Specifies the error tolerance in the training process. Value range > 0.

Defaults to 0.001.

evaluation_seed : int, optional

The random seed in parameter selection. Value range >= 0.

Defaults to 0.

thread_ratio : float, optional

Controls the proportion of available threads to use. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use.

Defaults to 0.0.

scale_info : {‘no’, ‘standardization’, ‘rescale’}, optional

Options:

  • ‘no’ : No scale.

  • ‘standardization’ : Transforms the data to have zero mean and unit variance.

  • ‘rescale’ : Rescales the range of the features to scale the range in [-1,1].

Defaults to ‘standardization’.

handle_missing : bool, optional

Whether to handle missing values:
  • False: No,

  • True: Yes.

Defaults to True.

categorical_variable : str or list of str, optional

Specifies INTEGER column(s) in the data that should be treated as categorical.

category_weight : float, optional

Represents the weight of category attributes. Value range > 0.

Defaults to 0.707.

Examples

Training data:

>>> df_fit.head(10).collect()
   ID  ATTRIBUTE1  ATTRIBUTE2  ATTRIBUTE3  ATTRIBUTE4  ATTRIBUTE5    QID  LABEL
0   0         1.0         1.0         0.0         0.2         0.0  qid:1      3
1   1         0.0         0.0         1.0         0.1         1.0  qid:1      2
2   2         0.0         0.0         1.0         0.3         0.0  qid:1      1
3   3         2.0         1.0         1.0         0.2         0.0  qid:1      4
4   4         3.0         1.0         1.0         0.4         1.0  qid:1      5
5   5         4.0         1.0         1.0         0.7         0.0  qid:1      6
6   6         0.0         0.0         1.0         0.2         0.0  qid:2      1
7   7         1.0         0.0         1.0         0.4         0.0  qid:2      2
8   8         0.0         0.0         1.0         0.2         0.0  qid:2      1
9   9         1.0         1.0         1.0         0.2         0.0  qid:2      3

Create SVRanking instance and call fit:

>>> svranking = svm.SVRanking(conn, gamma=0.005)
>>> features = ['ATTRIBUTE1', 'ATTRIBUTE2', 'ATTRIBUTE3', 'ATTRIBUTE4',
...             'ATTRIBUTE5']
>>> svranking.fit(df_fit, 'ID', 'QID', features, 'LABEL')

Call predict:

>>> df_predict = conn.table("DATA_TBL_SVRANKING_PREDICT")
>>> df_predict.head(10).collect()
   ID  ATTRIBUTE1  ATTRIBUTE2  ATTRIBUTE3  ATTRIBUTE4  ATTRIBUTE5    QID
0   0         1.0         1.0         0.0         0.2         0.0  qid:1
1   1         0.0         0.0         1.0         0.1         1.0  qid:1
2   2         0.0         0.0         1.0         0.3         0.0  qid:1
3   3         2.0         1.0         1.0         0.2         0.0  qid:1
4   4         3.0         1.0         1.0         0.4         1.0  qid:1
5   5         4.0         1.0         1.0         0.7         0.0  qid:1
6   6         0.0         0.0         1.0         0.2         0.0  qid:4
7   7         1.0         0.0         1.0         0.4         0.0  qid:4
8   8         0.0         0.0         1.0         0.2         0.0  qid:4
9   9         1.0         1.0         1.0         0.2         0.0  qid:4
>>> svranking.predict(df_predict, key='ID',
...                   features=features, qid='QID').head(10).collect()
    ID     SCORE PROBABILITY
0    0  -9.85138        None
1    1  -10.8657        None
2    2  -11.6741        None
3    3  -9.33985        None
4    4  -7.88839        None
5    5   -6.8842        None
6    6  -11.7081        None
7    7  -10.8003        None
8    8  -11.7081        None
9    9  -10.2583        None

Attributes

model_

(DataFrame) Model content.

stat_

(DataFrame) Statistics content.

.. note::

PAL will throw an error if ``probability``=True is provided to the SVRanking constructor and ``verbose``=True is not provided to predict(). This is a known bug.

Methods

fit(data, key, qid[, features, label, …])

Fit the model when given training dataset and other attributes.

predict(data, key, qid[, features, verbose])

Predict the dataset using the trained model.

fit(data, key, qid, features=None, label=None, categorical_variable=None)

Fit the model when given training dataset and other attributes.

Parameters

data : DataFrame

DataFrame containing the data.

key : str

Name of the ID column. If key is not provided, it is assumed that the input has no ID column.

qid : str

Name of the qid column.

features : list of str, optional

Names of the feature columns. If features is not provided, it defaults to all the non-ID, non-label, non-qid columns.

label : str, optional

Name of the label column. If label is not provided, it defaults to the last column.

categorical_variable : str or list of str, optional

INTEGER columns specified in this list will be treated as categorical data. Other INTEGER columns will be treated as continuous.

predict(data, key, qid, features=None, verbose=False)

Predict the dataset using the trained model.

Parameters

data : DataFrame

DataFrame containing the data.

key : str

Name of the ID column.

qid : str

Name of the qid column.

features : list of str, optional

Names of the feature columns. If features is not provided, it defaults to all the non-ID, non-qid columns.

verbose : bool, optional

If True, output scoring probabilities for each class.

Defaults to False.

Returns

DataFrame

Predict result, structured as follows:
  • ID column, with the same name and type as data’s ID column.

  • Score, type NVARCHAR(100), prediction value.

  • PROBABILITY, type DOUBLE, prediction probability. It is NULL when probability is False during instance creation.

class hana_ml.algorithms.pal.svm.OneClassSVM(conn_context, c=None, kernel='rbf', degree=None, gamma=None, coef_lin=None, coef_const=None, shrink=True, tol=None, evaluation_seed=None, thread_ratio=None, nu=None, scale_info=None, handle_missing=True, categorical_variable=None, category_weight=None)

Bases: hana_ml.algorithms.pal.svm._SVMBase

One Class SVM

Parameters

conn_context : ConnectionContext

Connection to the SAP HANA system.

c : float, optional

Trade-off between training error and margin. Value range > 0.

Defaults to 100.0.

kernel : {‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’}, optional

Defaults to ‘rbf’.

degree : int, optional

Coefficient for the poly kernel type. Value range >= 1.

Defaults to 3.

gamma : float, optional

Coefficient for the ‘rbf’ kernel type.

Defaults to to 1.0/number of features in the dataset.

Only valid when kernel is ‘rbf’.

coef_lin : float, optional

Coefficient for the ‘poly’/’sigmoid’ kernel type.

Defaults to 0.

coef_const : float, optional

Coefficient for the ‘poly’/’sigmoid’ kernel type.

Defaults to 0.

shrink : bool, optional

If True, use shrink strategy.

Defaults to True.

tol : float, optional

Specifies the error tolerance in the training process.

Value range > 0.

Defaults to 0.001.

evaluation_seed : int, optional

The random seed in parameter selection.

Value range >= 0.

Defaults to 0.

thread_ratio : float, optional

Controls the proportion of available threads to use. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use.

Defaults to 0.0.

nu : float, optional

The value for both the upper bound of the fraction of training errors and the lower bound of the fraction of support vectors.

Defaults to 0.5.

scale_info : {‘no’, ‘standardization’, ‘rescale’}, optional

Options:

  • ‘no’ : No scale.

  • ‘standardization’ : Transforms the data to have zero mean and unit variance.

  • ‘rescale’ : Rescales the range of the features to scale the range in [-1,1].

Defaults to ‘standardization’.

handle_missing : bool, optional

Whether to handle missing values:

False: No,

True: Yes.

Defaults to True.

categorical_variable : str or list of str, optional

Specifies INTEGER column(s) in the data that should be treated as categorical.

category_weight : float, optional

Represents the weight of category attributes. Value range > 0.

Defaults to 0.707.

Examples

Training data:

>>> df_fit.head(10).collect()
   ID  ATTRIBUTE1  ATTRIBUTE2  ATTRIBUTE3 ATTRIBUTE4
0   0         1.0        10.0       100.0          A
1   1         1.1        10.1       100.0          A
2   2         1.2        10.2       100.0          A
3   3         1.3        10.4       100.0          A
4   4         1.2        10.3       100.0         AB
5   5         4.0        40.0       400.0         AB
6   6         4.1        40.1       400.0         AB
7   7         4.2        40.2       400.0         AB
8   8         4.3        40.4       400.0         AB
9   9         4.2        40.3       400.0         AB

Create OneClassSVM instance and call fit:

>>> svc_one = svm.OneClassSVM(conn, scale_info='no', category_weight=1)
>>> svc_one.fit(df_fit, 'ID', ['ATTRIBUTE1', 'ATTRIBUTE2', 'ATTRIBUTE3',
...                            'ATTRIBUTE4'])
>>> df_predict = conn.table("DATA_TBL_SVC_ONE_PREDICT")
>>> df_predict.head(10).collect()
   ID  ATTRIBUTE1  ATTRIBUTE2  ATTRIBUTE3 ATTRIBUTE4
0   0         1.0        10.0       100.0          A
1   1         1.1        10.1       100.0          A
2   2         1.2        10.2       100.0          A
3   3         1.3        10.4       100.0          A
4   4         1.2        10.3       100.0         AB
5   5         4.0        40.0       400.0         AB
6   6         4.1        40.1       400.0         AB
7   7         4.2        40.2       400.0         AB
8   8         4.3        40.4       400.0         AB
9   9         4.2        40.3       400.0         AB
>>> features = ['ATTRIBUTE1', 'ATTRIBUTE2', 'ATTRIBUTE3',
...             'ATTRIBUTE4']

Call predict:

>>> svc_one.predict(df_predict, 'ID', features).head(10).collect()
   ID SCORE PROBABILITY
0   0    -1        None
1   1     1        None
2   2     1        None
3   3    -1        None
4   4    -1        None
5   5    -1        None
6   6    -1        None
7   7     1        None
8   8    -1        None
9   9    -1        None

Attributes

model_

(DataFrame) Model content.

stat_

(DataFrame) Statistics content.

Methods

fit(data[, key, features, categorical_variable])

Fit the model when given training dataset and other attributes.

predict(data, key[, features])

Predict the dataset using the trained model.

fit(data, key=None, features=None, categorical_variable=None)

Fit the model when given training dataset and other attributes.

Parameters

data : DataFrame

DataFrame containing the data.

key : str

Name of the ID column. If key is not provided, it is assumed that the input has no ID column.

features : list of str, optional

Names of the feature columns. If features is not provided, it defaults to all the non-ID columns.

categorical_variable : str or list of str, optional

Specifies INTEGER column(s) specified that should be treated as categorical. Other INTEGER columns will be treated as continuous.

predict(data, key, features=None)

Predict the dataset using the trained model.

Parameters

data : DataFrame

DataFrame containing the data.

key : str

Name of the ID column.

features : list of str, optional

Names of the feature columns. If features is not provided, it defaults to all the non-ID columns.

Returns

DataFrame

Predict result, structured as follows:
  • ID column, with the same name and type as data’s ID column.

  • Score, type NVARCHAR(100), prediction value.

  • PROBABILITY, type DOUBLE, prediction probability. Always NULL. This column is only used for SVC and SVRanking.

hana_ml.algorithms.pal.trees

This module contains Python wrappers for PAL decision tree-based algorithms.

The following classes are available:

class hana_ml.algorithms.pal.trees.RandomForestClassifier(conn_context, n_estimators=100, max_features=None, max_depth=None, min_samples_leaf=1, split_threshold=None, calculate_oob=True, random_state=None, thread_ratio=None, allow_missing_dependent=True, categorical_variable=None, sample_fraction=None, strata=None, priors=None)

Bases: hana_ml.algorithms.pal.trees._RandomForestBase

Random forest model for classification.

Parameters

conn_context : ConnectionContext

Connection to the HANA system.

n_estimators : int, optional

Specifies the number of trees in the random forest.

Defaults to 100.

max_features : int, optional

Specifies the number of randomly selected splitting variables. Should not be larger than the number of input features.

Defaults to sqrt(p) (for classification) or p/3(for regression), where p is the number of input features.

max_depth : int, optional

The maximum depth of a tree.

By default it is unlimited.

min_samples_leaf : int, optional

Specifies the minimum number of records in a leaf.

Defaults to 1 for classification.

split_threshold : float, optional

Specifies the stop condition: if the improvement value of the best split is less than this value, the tree stops growing.

Defaults to 1e-5.

calculate_oob : bool, optional

If True, calculate the out-of-bag error.

Defaults to True.

random_state : int, optional

Specifies the seed for random number generator.

0: Uses the current time (in seconds) as the seed.

Others: Uses the specified value as the seed.

Defaults to 0.

thread_ratio : float, optional

Controls the proportion of available threads to use. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use.

Defaults to heuristically determined.

allow_missing_dependent : bool, optional

Specifies if a missing target value is allowed.

  • False: Not allowed. An error occurs if a missing target is present.

  • True: Allowed. The datum with the missing target is removed.

Defaults to True.

categorical_variable : str or list of str, optional

Specifies INTEGER column(s) should be treated as categorical. The default behavior is: string: categorical, or integer and float: continuous. VALID only for integer variables; omitted otherwise.

Default value detected from input data.

sample_fraction : float, optional

The fraction of data used for training. Assume there are n pieces of data, sample fraction is r, then n*r data is selected for training.

Defaults to 1.0.

strata : List of tuples: (class, fraction), optional

Strata proportions for stratified sampling. A (class, fraction) tuple specifies that rows with that class should make up the specified fraction of each sample. If the given fractions do not add up to 1, the remaining portion is divided equally between classes with no entry in strata, or between all classes if all classes have an entry in strata. If strata is not provided, bagging is used instead of stratified sampling.

priors : List of tuples: (class, prior_prob), optional

Prior probabilities for classes. A (class, prior_prob) tuple specifies the prior probability of this class. If the given priors do not add up to 1, the remaining portion is divided equally between classes with no entry in priors, or between all classes if all classes have an entry in ‘priors’. If priors is not provided, it is determined by the proportion of every class in the training data.

Examples

Input dataframe for training:

>>> df1.head(4).collect()
   OUTLOOK  TEMP  HUMIDITY WINDY       LABEL
0    Sunny  75.0      70.0   Yes        Play
1    Sunny   NaN      90.0   Yes Do not Play
2    Sunny  85.0       NaN    No Do not Play
3    Sunny  72.0      95.0    No Do not Play

Creating RandomForestClassifier instance:

>>> rfc = RandomForestClassifier(conn_context=cc, n_estimators=3,
...                              max_features=3, random_state=2,
...                              split_threshold=0.00001,
...                              calculate_oob=True,
...                              min_samples_leaf=1, thread_ratio=1.0)

Performing fit() on given dataframe:

>>> rfc.fit(data=df1, features=['OUTLOOK', 'TEMP', 'HUMIDITY', 'WINDY'],
...         label='LABEL')
>>> rfc.feature_importances_.collect()
  VARIABLE_NAME  IMPORTANCE
0       OUTLOOK    0.449550
1          TEMP    0.216216
2      HUMIDITY    0.208108
3         WINDY    0.126126

Input dataframe for predicting:

>>> df2.collect()
   ID   OUTLOOK     TEMP  HUMIDITY WINDY
0   0  Overcast     75.0  -10000.0   Yes
1   1      Rain     78.0      70.0   Yes

Performing predict() on given dataframe:

>>> result = rfc.predict(data=df2, key='ID', verbose=False)
>>> result.collect()
   ID SCORE  CONFIDENCE
0   0  Play    0.666667
1   1  Play    0.666667

Input dataframe for scoring:

>>> df3.collect()
   ID   OUTLOOK  TEMP  HUMIDITY WINDY LABEL
0   0     Sunny    70      90.0   Yes  Play
1   1  Overcast    81      90.0   Yes  Play
2   2      Rain    65      80.0    No  Play

Performing score() on given dataframe:

>>> rfc.score(df3, key='ID')
0.6666666666666666

Attributes

model_

(DataFrame) Trained model content.

feature_importances_

(DataFrame) The feature importance (the higher, the more important the feature).

oob_error_

(DataFrame) Out-of-bag error rate or mean squared error for random forest up to indexed tree. Set to None if calculate_oob is False.

confusion_matrix_

(DataFrame) Confusion matrix used to evaluate the performance of classification algorithms.

Methods

fit(data[, key, features, label, …])

Train the model on input data.

predict(data, key[, features, verbose, …])

Predict dependent variable values based on fitted model.

score(data, key[, features, label, …])

Returns the mean accuracy on the given test data and labels.

fit(data, key=None, features=None, label=None, categorical_variable=None)

Train the model on input data.

Parameters

data : DataFrame

Training data.

key : str, optional

Name of the ID column. If key is not provided, it is assumed that the input has no ID column.

features : list of str, optional

Names of the feature columns. If features is not provided, it defaults to all non-ID, non-label columns.

label : str, optional

Name of the dependent variable.

Defaults to the last column.

categorical_variable : str or list of str, optional

Specifies INTEGER column(s) that should be treated as categorical. Other INTEGER columns will be treated as continuous.

predict(data, key, features=None, verbose=None, block_size=None, missing_replacement=None)

Predict dependent variable values based on fitted model.

Parameters

data : DataFrame

Independent variable values to predict for.

key : str

Name of the ID column.

features : list of str, optional

Names of the feature columns. If features is not provided, it defaults to all non-ID columns.

block_size : int, optional

The number of rows loaded per time during prediction. 0 indicates load all data at once.

Defaults to 0.

missing_replacement : str, optional

The missing replacement strategy:

  • ‘feature_marginalized’: marginalise each missing feature out independently.

  • ‘instance_marginalized’: marginalise all missing features in an instance as a whole corresponding to each category.

Defaults to ‘feature_marginalized’.

verbose : bool, optional

If True, output all classes and the corresponding confidences for each data point.

Returns

DataFrame

DataFrame of score and confidence, structured as follows:
  • ID column, with same name and type as data ‘s ID column.

  • SCORE, type DOUBLE, representing the predicted classes.

  • CONFIDENCE, type DOUBLE, representing the confidence of a class.

score(data, key, features=None, label=None, block_size=None, missing_replacement=None)

Returns the mean accuracy on the given test data and labels.

Parameters

data : DataFrame

Data on which to assess model performance.

key : str

Name of the ID column.

features : list of str, optional

Names of the feature columns. If features is not provided, it defaults to all non-ID, non-label columns.

label : str, optional

Name of the dependent variable.

Defaults to the last column.

block_size : int, optional

The number of rows loaded per time during prediction. 0 indicates load all data at once.

Defaults to 0.

missing_replacement : str, optional

The missing replacement strategy:
  • ‘feature_marginalized’: marginalise each missing feature out independently.

  • ‘instance_marginalized’: marginalise all missing features in an instance as a whole corresponding to each category.

Defaults to ‘feature_marginalized’.

Returns

float

Mean accuracy on the given test data and labels.

class hana_ml.algorithms.pal.trees.RandomForestRegressor(conn_context, n_estimators=100, max_features=None, max_depth=None, min_samples_leaf=None, split_threshold=None, calculate_oob=True, random_state=None, thread_ratio=None, allow_missing_dependent=True, categorical_variable=None, sample_fraction=None)

Bases: hana_ml.algorithms.pal.trees._RandomForestBase

Random forest model for regression.

Parameters

conn_context : ConnectionContext

Connection to the HANA system.

n_estimators : int, optional

Specifies the number of trees in the random forest.

Defaults to 100.

max_features : int, optional

Specifies the number of randomly selected splitting variables. Should not be larger than the number of input features.

Defaults to sqrt(p) (for classification) or p/3(for regression), where p is the number of input features.

max_depth : int, optional

The maximum depth of a tree.

By default it is unlimited.

min_samples_leaf : int, optional

Specifies the minimum number of records in a leaf.

Defaults to 5 for regression.

split_threshold : float, optional

Specifies the stop condition: if the improvement value of the best split is less than this value, the tree stops growing.

Defaults to 1e-5.

calculate_oob : bool, optional

If True, calculate the out-of-bag error.

Defaults to True.

random_state : int, optional

Specifies the seed for random number generator.

0: Uses the current time (in seconds) as the seed.

Others: Uses the specified value as the seed.

Defaults to 0.

thread_ratio : float, optional

Controls the proportion of available threads to use. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use up to that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use.

Defaults to heuristically determined.

allow_missing_dependent : bool, optional

Specifies if a missing target value is allowed.

  • False: Not allowed. An error occurs if a missing target is present.

  • True: Allowed. The datum with a missing target is removed.

Defaults to True.

categorical_variable : str or list of str, optional

Specifies INTEGER column(s) that should be treated as categorical. The default behavior is: string: categorical, or integer and float: continuous. VALID only for integer variables; omitted otherwise.

Default value detected from input data.

sample_fraction : float, optional

The fraction of data used for training. Assume there are n pieces of data, sample fraction is r, then n*r data is selected for training.

Defaults to 1.0.

Examples

Input dataframe for training:

>>> df1.head(5).collect()
   ID         A         B         C         D       CLASS
0   0 -0.965679  1.142985 -0.019274 -1.598807  -23.633813
1   1  2.249528  1.459918  0.153440 -0.526423  212.532559
2   2 -0.631494  1.484386 -0.335236  0.354313   26.342585
3   3 -0.967266  1.131867 -0.684957 -1.397419  -62.563666
4   4 -1.175179 -0.253179 -0.775074  0.996815 -115.534935

Creating RandomForestRegressor instance:

>>> rfr = RandomForestRegressor(conn_context=cc, random_state=3)

Performing fit() on given dataframe:

>>> rfr.fit(data=df1, key='ID')
>>> rfr.feature_importances_.collect()
   VARIABLE_NAME  IMPORTANCE
0             A    0.249593
1             B    0.381879
2             C    0.291403
3             D    0.077125

Input dataframe for predicting:

>>> df2.collect()
   ID         A         B         C         D
0   0  1.081277  0.204114  1.220580 -0.750665
1   1  0.524813 -0.012192 -0.418597  2.946886

Performing predict() on given dataframe:

>>> result = rfr.predict(data=df2, key='ID')
>>> result.collect()
   ID    SCORE  CONFIDENCE
0   0    48.126   62.952884
1   1  -10.9017   73.461039

Input dataframe for scoring:

>>> df3.head(5).collect()
    ID         A         B         C         D       CLASS
0    0  1.081277  0.204114  1.220580 -0.750665   139.10170
1    1  0.524813 -0.012192 -0.418597  2.946886    52.17203
2    2 -0.280871  0.100554 -0.343715 -0.118843   -34.69829
3    3 -0.113992 -0.045573  0.957154  0.090350    51.93602
4    4  0.287476  1.266895  0.466325 -0.432323   106.63425

Performing score() on given dataframe:

>>> rfr.score(df3, key='ID')
0.6530768858159514

Attributes

model_

(DataFrame) Trained model content.

feature_importances_

(DataFrame) The feature importance (the higher, the more important the feature).

oob_error_

(DataFrame) Out-of-bag error rate or mean squared error for random forest up to indexed tree. Set to None if calculate_oob is False.

Methods

fit(data[, key, features, label, …])

Train the model on input data.

predict(data, key[, features, block_size, …])

Predict dependent variable values based on fitted model.

score(data, key[, features, label, …])

Returns the coefficient of determination R2 of the prediction.

predict(data, key, features=None, block_size=None, missing_replacement=None)

Predict dependent variable values based on fitted model.

Parameters

data : DataFrame

Independent variable values to predict for.

key : str

Name of the ID column.

features : list of str, optional

Names of the feature columns. If features is not provided, it defaults to all non-ID columns.

block_size : int, optional

The number of rows loaded per time during prediction. 0 indicates load all data at once.

Defaults to 0.

missing_replacement : str, optional

The missing replacement strategy:

  • ‘feature_marginalized’: marginalise each missing feature out

    independently.

  • ‘instance_marginalized’: marginalise all missing features

    in an instance as a whole corresponding to each category.

Defaults to ‘feature_marginalized’.

Returns

DataFrame

DataFrame of score and confidence, structured as follows:

  • ID column, with same name and type as data’s ID column.

  • SCORE, type DOUBLE, representing the predicted values.

  • CONFIDENCE, all 0s.

It is included due to the fact PAL uses the same table for classification.

score(data, key, features=None, label=None, block_size=None, missing_replacement=None)

Returns the coefficient of determination R2 of the prediction.

Parameters

data : DataFrame

Data on which to assess model performance.

key : str

Name of the ID column.

features : list of str, optional

Names of the feature columns. If features is not provided, it defaults to all non-ID, non-label columns.

label : str, optional

Name of the dependent variable.

Defaults to the last column.

block_size : int, optional

The number of rows loaded per time during prediction. 0 indicates load all data at once.

Defaults to 0.

missing_replacement : str, optional

The missing replacement strategy:

  • ‘feature_marginalized’: marginalise each missing feature out independently.

  • ‘instance_marginalized’: marginalise all missing features in an instance as a whole corresponding to each category.

Defaults to ‘feature_marginalized’.

Returns

float

The coefficient of determination R2 of the prediction on the given data.

fit(data, key=None, features=None, label=None, categorical_variable=None)

Train the model on input data.

Parameters

data : DataFrame

Training data.

key : str, optional

Name of the ID column. If key is not provided, it is assumed that the input has no ID column.

features : list of str, optional

Names of the feature columns. If features is not provided, it defaults to all non-ID, non-label columns.

label : str, optional

Name of the dependent variable.

Defaults to the last column.

categorical_variable : str or list of str, optional

Specify INTEGER column(s) that should be be treated as categorical data. Other INTEGER columns will be treated as continuous.

class hana_ml.algorithms.pal.trees.DecisionTreeClassifier(conn_context, algorithm, thread_ratio=None, allow_missing_dependent=True, percentage=None, min_records_of_parent=None, min_records_of_leaf=None, max_depth=None, categorical_variable=None, split_threshold=None, discretization_type=None, bins=None, max_branch=None, merge_threshold=None, use_surrogate=None, model_format=None, output_rules=True, priors=None, output_confusion_matrix=True)

Bases: hana_ml.algorithms.pal.trees._DecisionTreeBase

Decision Tree model for classification.

Parameters

conn_context : ConnectionContext

Database connection object.

algorithm : {‘c45’, ‘chaid’, ‘cart’}

Algorithm used to grow a decision tree. Case-insensitive.

  • ‘c45’: C4.5 algorithm.

  • ‘chaid’: Chi-square automatic interaction detection.

  • ‘cart’: Classification and regression tree.

thread_ratio : float, optional

Controls the proportion of available threads to use. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use up to that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use.

Defaults to 0.

allow_missing_dependent : bool, optional

Specifies if a missing target value is allowed.

  • False: Not allowed. An error occurs if a missing target is present.

  • True: Allowed. The datum with the missing target is removed.

Defaults to True.

percentage : float, optional

Specifies the percentage of the input data that will be used to build the tree model. The rest of the data will be used for pruning.

Defaults to 1.0.

min_records_of_parent : int, optional

Specifies the stop condition: if the number of records in one node is less than the specified value, the algorithm stops splitting.

Defaults to 2.

min_records_of_leaf : int, optional

Promises the minimum number of records in a leaf.

Defaults to 1.

max_depth : int, optional

The maximum depth of a tree.

By default it is unlimited.

categorical_variable : str or list of str, optional

Specifies INTEGER column(s) that should be treated as categorical. The default behavior is: string: categorical, or integer and float: continuous. VALID only for integer variables; omitted otherwise.

Default value detected from input data.

split_threshold : float, optional

Specifies the stop condition for a node:

  • ‘c45’: The information gain ratio of the best split is less than this value.

  • ‘chaid’: The p-value of the best split is greater than or equal to this value.

  • ‘cart’: The reduction of Gini index or relative MSE of the best split is less than this value.

The smaller the SPLIT_THRESHOLD value is, the larger a ‘c45’ or ‘cart’ tree grows. On the contrary, ‘chaid’ will grow a larger tree with larger SPLIT_THRESHOLD value.

Defaults to 1e-5 for ‘c45’ and ‘cart’, 0.05 for ‘chaid’.

discretization_type : {‘mdlpc’, ‘equal_freq’}, optional

Strategy for discretizing continuous attributes. Case-insensitive.

  • ‘mdlpc’: Minimum description length principle criterion.

  • ‘equal_freq’: Equal frequency discretization.

Valid only for ‘c45’ and ‘chaid’.

Defaults to ‘mdlpc’.

bins : List of tuples: (column name, number of bins), optional

Specifies the number of bins for discretization. Only valid when discretizaition type is equal_freq.

Defaults to 10 for each column.

max_branch : int, optional

Specifies the maximum number of branches.

Valid only for ‘chaid’.

Defaults to 10.

merge_threshold : float, optional

Specifies the merge condition for ‘chaid’: if the metric value is greater than or equal to the specified value, the algorithm will merge two branches.

Only valid for ‘chaid’.

Defaults to 0.05.

use_surrogate : bool, optional

If True, use surrogate split when NULL values are encountered. Only valid for ‘cart’.

Defaults to True.

model_format : {‘json’, ‘pmml’}, optional

Specifies the tree model format for store. Case-insensitive.

  • ‘json’: export model in json format.

  • ‘pmml’: export model in pmml format.

Defaults to ‘json’.

output_rules : bool, optional

If True, output decision rules.

Defaults to True.

priors : List of tuples: (class, prior_prob), optional

Specifies the prior probability of every class label.

Default value detected from data.

output_confusion_matrix : bool, optional

If True, output the confusion matrix.

Defaults to True.

Examples

Input dataframe for training:

>>> df1.head(4).collect()
   OUTLOOK  TEMP  HUMIDITY WINDY        CLASS
0    Sunny    75      70.0   Yes         Play
1    Sunny    80      90.0   Yes  Do not Play
2    Sunny    85      85.0    No  Do not Play
3    Sunny    72      95.0    No  Do not Play

Creating DecisionTreeClassifier instance:

>>> dtc = DecisionTreeClassifier(conn_context=cc, algorithm='c45',
...                              min_records_of_parent=2,
...                              min_records_of_leaf=1,
...                              thread_ratio=0.4, split_threshold=1e-5,
...                              model_format='json', output_rules=True)

Performing fit() on given dataframe:

>>> dtc.fit(data=df1, features=['OUTLOOK', 'TEMP', 'HUMIDITY', 'WINDY'],
...         label='LABEL')
>>> dtc.decision_rules_.collect()
   ROW_INDEX                                                  RULES_CONTENT
0         0                                       (TEMP>=84) => Do not Play
1         1                         (TEMP<84) && (OUTLOOK=Overcast) => Play
2         2         (TEMP<84) && (OUTLOOK=Sunny) && (HUMIDITY<82.5) => Play
3         3 (TEMP<84) && (OUTLOOK=Sunny) && (HUMIDITY>=82.5) => Do not Play
4         4       (TEMP<84) && (OUTLOOK=Rain) && (WINDY=Yes) => Do not Play
5         5               (TEMP<84) && (OUTLOOK=Rain) && (WINDY=No) => Play

Input dataframe for predicting:

>>> df2.collect()
   ID   OUTLOOK  HUMIDITY  TEMP WINDY
0   0  Overcast      75.0    70   Yes
1   1      Rain      78.0    70   Yes
2   2     Sunny      66.0    70   Yes
3   3     Sunny      69.0    70   Yes
4   4      Rain       NaN    70   Yes
5   5      None      70.0    70   Yes
6   6       ***      70.0    70   Yes

Performing predict() on given dataframe:

>>> result = dtc.predict(data=df2, key='ID', verbose=False)
>>> result.collect()
   ID        SCORE  CONFIDENCE
0   0         Play    1.000000
1   1  Do not Play    1.000000
2   2         Play    1.000000
3   3         Play    1.000000
4   4  Do not Play    1.000000
5   5         Play    0.692308
6   6         Play    0.692308

Input dataframe for scoring:

>>> df3.collect()
   ID   OUTLOOK  HUMIDITY  TEMP WINDY        LABEL
0   0  Overcast      75.0    70   Yes         Play
1   1      Rain      78.0    70    No  Do not Play
2   2     Sunny      66.0    70   Yes         Play
3   3     Sunny      69.0    70   Yes         Play

Performing score() on given dataframe:

>>> rfc.score(df3, key='ID')
0.75

Attributes

model_

(DataFrame) Trained model content.

decision_rules_

(DataFrame) Rules for decision tree to make decisions. Set to None if output_rules is False.

confusion_matrix_

(DataFrame) Confusion matrix used to evaluate the performance of classification algorithms. Set to None if output_confusion_matrix is False.

Methods

fit(data[, key, features, label, …])

Function for building a decision tree classifier.

predict(data, key[, features, verbose])

Predict dependent variable values based on fitted model.

score(data, key[, features, label])

Returns the mean accuracy on the given test data and labels.

fit(data, key=None, features=None, label=None, categorical_variable=None)

Function for building a decision tree classifier.

Parameters

data : DataFrame

Training data.

key : str, optional

Name of the ID column. If key is not provided, it is assumed that the input has no ID column.

features : list of str, optional

Names of the feature columns. If features is not provided, it defaults to all non-ID, non-label columns.

label : str, optional

Name of the dependent variable.

Defaults to the last column.

categorical_variable : str or list of str, optional

Specifies INTEGER column(s) that should be treated as categorical. Other INTEGER columns will be treated as continuous.

score(data, key, features=None, label=None)

Returns the mean accuracy on the given test data and labels.

Parameters

data : DataFrame

Data on which to assess model performance.

key : str

Name of the ID column.

features : list of str, optional

Names of the feature columns. If features is not provided, it defaults to all non-ID, non-label columns.

label : str, optional

Name of the dependent variable.

Defaults to the last column.

Returns

float

Mean accuracy on the given test data and labels.

predict(data, key, features=None, verbose=False)

Predict dependent variable values based on fitted model.

Parameters

data : DataFrame

Independent variable values to predict for.

key : str

Name of the ID column.

features : list of str, optional

Names of the feature columns. If features is not provided, it defaults to all non-ID columns.

verbose : bool, optional

If True, output all classes and the corresponding confidences for each data point. This parameter is valid only for classification.

Defaults to False.

Returns

DataFrame

DataFrame of score and confidence, structured as follows:
  • ID column, with same name and type as data’s ID column.

  • SCORE, type DOUBLE, representing the predicted classes/values.

  • CONFIDENCE, type DOUBLE, representing the confidence of a class, all 0s if for regression.

class hana_ml.algorithms.pal.trees.DecisionTreeRegressor(conn_context, algorithm, thread_ratio=None, allow_missing_dependent=True, percentage=None, min_records_of_parent=None, min_records_of_leaf=None, max_depth=None, categorical_variable=None, split_threshold=None, use_surrogate=None, model_format=None, output_rules=True)

Bases: hana_ml.algorithms.pal.trees._DecisionTreeBase

Decision Tree model for regression.

Parameters

conn_context : ConnectionContext

Database connection object.

algorithm : {‘cart’}

Algorithm used to grow a decision tree.

  • ‘cart’: Classification and Regression tree.

Currently supports cart.

thread_ratio : float, optional

Controls the proportion of available threads to use. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use.

Defaults to 0.

allow_missing_dependent : bool, optional

Specifies if a missing target value is allowed.

  • False: Not allowed. An error occurs if a missing target is present.

  • True: Allowed. The datum with the missing target is removed.

Defaults to True.

percentage : float, optional

Specifies the percentage of the input data that will be used to build the tree model. The rest of the data will be used for pruning.

Defaults to 1.0.

min_records_of_parent : int, optional

Specifies the stop condition: if the number of records in one node is less than the specified value, the algorithm stops splitting.

Defaults to 2.

min_records_of_leaf : int, optional

Promises the minimum number of records in a leaf.

Defaults to 1.

max_depth : int, optional

The maximum depth of a tree.

By default it is unlimited.

categorical_variable : str or list of str, optional

Specifies INTEGER column(s) that should be treated as categorical. The default behavior is: string: categorical, or integer and float: continuous.

VALID only for integer variables; omitted otherwise.

Default value detected from input data.

split_threshold : float, optional

Specifies the stop condition for a node:

  • ‘cart’: The reduction of Gini index or relative MSE of the best split is less than this value.

The smaller the SPLIT_THRESHOLD value is, the larger a ‘cart’ tree grows.

Defaults to 1e-5 for ‘cart’.

use_surrogate : bool, optional

If True, use surrogate split when NULL values are encountered. Only valid for ‘cart’.

Defaults to True.

model_format : {‘json’, ‘pmml’}, optional

Specifies the tree model format for store. Case-insensitive.

  • ‘json’: export model in json format.

  • ‘pmml’: export model in pmml format.

Defaults to ‘json’.

output_rules : bool, optional

If True, output decision rules.

Defaults to True.

Examples

Input dataframe for training:

>>> df1.head(5).collect()
   ID         A         B         C         D      CLASS
0   0  1.764052  0.400157  0.978738  2.240893  49.822907
1   1  1.867558 -0.977278  0.950088 -0.151357   4.877286
2   2 -0.103219  0.410598  0.144044  1.454274  11.914875
3   3  0.761038  0.121675  0.443863  0.333674  19.753078
4   4  1.494079 -0.205158  0.313068 -0.854096  23.607000

Creating DecisionTreeRegressor instance:

>>>  dtr = DecisionTreeRegressor(conn_context=cc, algorithm='cart',
...                              min_records_of_parent=2, min_records_of_leaf=1,
...                              thread_ratio=0.4, split_threshold=1e-5,
...                              model_format='pmml', output_rules=True)

Performing fit() on given dataframe:

>>> dtr.fit(data=df1, key='ID')
>>> dtr.decision_rules_.head(2).collect()
   ROW_INDEX                                      RULES_CONTENT
0          0         (A<-0.495502) && (B<-0.663588) => -85.8762
1          1        (A<-0.495502) && (B>=-0.663588) => -29.9827

Input dataframe for predicting:

>>> df2.collect()
   ID         A         B         C         D
0   0  1.764052  0.400157  0.978738  2.240893
1   1  1.867558 -0.977278  0.950088 -0.151357
2   2 -0.103219  0.410598  0.144044  1.454274
3   3  0.761038  0.121675  0.443863  0.333674
4   4  1.494079 -0.205158  0.313068 -0.854096

Performing predict() on given dataframe:

>>> result = dtr.predict(data=df2, key='ID')
>>> result.collect()
   ID    SCORE  CONFIDENCE
0   0  49.8229         0.0
1   1  4.87728         0.0
2   2  11.9148         0.0
3   3   19.753         0.0
4   4   23.607         0.0

Input dataframe for scoring:

>>> df3.collect()
   ID         A         B         C         D      CLASS
0   0  1.764052  0.400157  0.978738  2.240893  49.822907
1   1  1.867558 -0.977278  0.950088 -0.151357   4.877286
2   2 -0.103219  0.410598  0.144044  1.454274  11.914875
3   3  0.761038  0.121675  0.443863  0.333674  19.753078
4   4  1.494079 -0.205158  0.313068 -0.854096  23.607000

Performing score() on given dataframe:

>>> dtr.score(df3, key='ID')
0.9999999999900131

Attributes

model_

(DataFrame) Trained model content.

decision_rules_

(DataFrame) Rules for decision tree to make decisions. Set to None if output_rules is False.

Methods

fit(data[, key, features, label, …])

Train the model on input data.

predict(data, key[, features, verbose])

Predict dependent variable values based on fitted model.

score(data, key[, features, label])

Returns the coefficient of determination R2 of the prediction.

score(data, key, features=None, label=None)

Returns the coefficient of determination R2 of the prediction.

Parameters

data : DataFrame

Data on which to assess model performance.

key : str

Name of the ID column.

features : list of str, optional

Names of the feature columns. If features is not provided, it defaults to all non-ID, non-label columns.

label : str, optional

Name of the dependent variable.

Defaults to the last column.

Returns

float

The coefficient of determination R2 of the prediction on the given data.

fit(data, key=None, features=None, label=None, categorical_variable=None)

Train the model on input data.

Parameters

data : DataFrame

Training data.

key : str, optional

Name of the ID column. If key is not provided, it is assumed that the input has no ID column.

features : list of str, optional

Names of the feature columns. If features is not provided, it defaults to all non-ID, non-label columns.

label : str, optional

Name of the dependent variable.

Defaults to the last column.

categorical_variable : str or list of str, optional

Specifies INTEGER column(s) that should be treated as categorical data. Other INTEGER columns will be treated as continuous.

predict(data, key, features=None, verbose=False)

Predict dependent variable values based on fitted model.

Parameters

data : DataFrame

Independent variable values to predict for.

key : str

Name of the ID column.

features : list of str, optional

Names of the feature columns. If features is not provided, it defaults to all non-ID columns.

verbose : bool, optional

If True, output all classes and the corresponding confidences for each data point. This parameter is valid only for classification.

Defaults to False.

Returns

DataFrame

DataFrame of score and confidence, structured as follows:
  • ID column, with same name and type as data’s ID column.

  • SCORE, type DOUBLE, representing the predicted classes/values.

  • CONFIDENCE, type DOUBLE, representing the confidence of a class, all 0s if for regression.

class hana_ml.algorithms.pal.trees.GradientBoostingClassifier(conn_context, n_estimators=10, subsample=None, max_depth=None, loss=None, split_threshold=None, learning_rate=None, fold_num=None, default_split_dir=None, min_sample_weight_leaf=None, max_w_in_split=None, col_subsample_split=None, col_subsample_tree=None, lamb=None, alpha=None, scale_pos_w=None, base_score=None, cv_metric=None, ref_metric=None, categorical_variable=None, allow_missing_label=None, thread_ratio=None, cross_validation_range=None)

Bases: hana_ml.algorithms.pal.trees._GradientBoostingBase

Gradient Boosting model for classification.

Parameters

conn_context : ConnectionContext

Connection to the HANA system.

n_estimators : int, optional

Specifies the number of trees in Gradient Boosting.

Defaults to 10.

loss : str, optional

Type of loss function to be optimized. Supported values are ‘linear’ and ‘logistic’.

Defaults to ‘linear’.

max_depth : int, optional

The maximum depth of a tree.

Defaults to 6.

split_threshold : float, optional

Specifies the stopping condition: if the improvement value of the best split is less than this value, then the tree stops growing.

learning_rate : float, optional.

Learning rate of each iteration, must be within the range (0, 1].

Defaults to 0.3.

subsample : float, optional

The fraction of samples to be used for fitting each base learner.

Defaults to 1.0.

fold_num : int, optional

The k-value for k-fold cross-validation. Effective only when cross_validation_range is not None nor empty.

default_split_dir : int, optional.

Default split direction for missing values. Valid input values are 0, 1 and 2, where:

0 - Automatically determined, 1 - Left, 2 - Right.

Defaults to 0.

min_sample_weight_leaf : float, optional

The minimum sample weights in leaf node.

Defaults to 1.0.

max_w_in_split : float, optional

The maximum weight constraint assigned to each tree node.

Defaults to 0 (i.e. no constraint).

col_subsample_split : float, optional

The fraction of features used for each split, should be within range (0, 1].

Defaults to 1.0.

col_subsample_tree : float, optional

The fraction of features used for each tree growth, should be within range (0, 1]

Defaults to 1.0.

lamb : float, optional

L2 regularization weight for the target loss function. Should be within range (0, 1].

Defaults to 1.0.

alpha : float, optional

Weight of L1 regularization for the target loss function.

Defaults to 1.0.

scale_pos_w : float, optional

The weight scaled to positive samples in regression.

Defaults to 1.0.

base_score : float, optional

Intial prediction score for all instances. Global bias for sufficient number of iterations(changing this value will not have too much effect).

cv_metric : {‘log_likelihood’, ‘multi_log_likelihood’, ‘error_rate’, ‘multi_error_rate’, ‘auc’}, optional

The metric used for cross-validation.

If multiple lines of metrics are provided, then only the first one is valid. If not set, it takes the first value (in alphabetical order) of the parameter ‘ref_metric’ when the latter is set, otherwise it goes to default values.

Defaults to

1)’error_rate’ for binary classification,

2)’multi_error_rate’ for multi-class classification.

ref_metric : str or list of str, optional

Specifies a reference metric or a list of reference metrics. Supported metrics same as cv_metric. If not provided, defaults to

1)[‘error_rate’] for binary classification,

2)[‘multi_error_rate’] for multi-class classification.

categorical_variable : str or list of str, optional

Specifies which variable(s) should be treated as categorical. Otherwise default behavior is followed:

  1. VARCHAR - categorical,

  2. INTEGER and DOUBLE - continous.

Only valid for INTEGER variables, omitted otherwise.

allow_missing_label : bool, optional

Specifies whether missing label value is allowed.

False: not allowed. In missing values presents in the input data, an error shall be thrown.

True: allowed. The datum with missing label will be removed automatically.

thread_ratio : float, optional

The ratio of available threads used for training:

0: single thread;

(0,1]: percentage of available threads;

others : heuristically determined.

Defaults to -1.

cross_validation_range : list of tuples, optional

Indicates the set of parameters involded for cross-validation. Cross-validation is triggered only when this param is not None and the list is not tempty, and fold_num is greater than 1. Each tuple is a pair, with the first being parameter name of str type, and the second being the a list of number of the following form: [<begin-value>, <test-numbers>, <end-value>].

Suppported parameters for cross-validation: n_estimators, max_depth, learning_rate, min_sample_weight_leaf, max_w_in_split, col_subsample_split, col_subsample_tree, lamb, alpha, scale_pos_w, base_score.

A simple example for illustration:

[(‘n_estimators’, [4, 3, 10]),

(‘learning_rate’, [0.1, 3, 1.0]),

(‘split_threshold’, [0.1, 3, 1.0])]

Examples

Input dataframe for training:

>>> df.head(4).collect()
   ATT1  ATT2   ATT3  ATT4 LABEL
0   1.0  10.0  100.0   1.0     A
1   1.1  10.1  100.0   1.0     A
2   1.2  10.2  100.0   1.0     A
3   1.3  10.4  100.0   1.0     A

Creating Gradient Boosting Classifier:

>>> cv_range = [('learning_rate', [0.1, 1.0, 3]),
...             ('n_estimators', [4, 10, 3]),
...             ('split_threshold', [0.1, 1.0, 3])
>>> gbc = GradientBoostingClassifier(conn_context=conn,
...                                  n_estimators=4,
...                                  split_threshold=0,
...                                  learning_rate=0.5,
...                                  fold_num=5,
...                                  max_depth=6,
...                                  cv_metric = 'error_rate',
...                                  ref_metric=['auc'],
...                                  cross_validation_range=cv_range)

Performing fit() on given dataframe:

>>> gbc.fit(data=df, features=['ATT1', 'ATT2', 'ATT3', 'ATT4'], label='LABEL')
>>> gbc.stats_.collect()
         STAT_NAME STAT_VALUE
0  ERROR_RATE_MEAN          0
1   ERROR_RATE_VAR          0
2         AUC_MEAN          1

Input dataframe for predicting:

>>> df1.head(4).collect()
   ID  ATT1  ATT2   ATT3  ATT4
0   1   1.0  10.0  100.0   1.0
1   2   1.1  10.1  100.0   1.0
2   3   1.2  10.2  100.0   1.0
3   4   1.3  10.4  100.0   1.0

Performing predict() on given dataframe

>>> result = gbc.fit(data=df1, key='ID', verbose=False)
>>> result.head(4).collect()
   ID SCORE  CONFIDENCE
0   1     A    0.825556
1   2     A    0.825556
2   3     A    0.825556
3   4     A    0.825556

Attributes

model_

(DataFrame) Trained model content.

feature_importances_

(DataFrame) The feature importance (the higher, the more import the feature)

confusion_matrix_

(DataFrame) Confusion matrix used to evaluate the performance of classification algorithm.

stats_

(DataFrame) Statistics info for cross-validation.

cv_

(DataFrame) Best choice of parameter produced by cross-validation.

Methods

fit(data[, key, features, label, …])

Train the model on input data.

predict(data, key[, features, verbose])

Predict dependent variable values based on fitted model.

fit(data, key=None, features=None, label=None, categorical_variable=None)

Train the model on input data.

Parameters

data : DataFrame

Training data.

key : str, optional

Name of the ID column. If key is not provided, it is assumed that the input has no ID column.

features : list of str, optional

Names of the feature columns. If features is not provided, it defaults to all non-ID, non-label columns.

label : str, optional

Name of the dependent variable.

Defaults to the last column.

categorical_variable : str or list of str, optional

Specifies INTEGER column(s) that should be treated as categorical. Other INTEGER columns will be treated as continuous.

predict(data, key, features=None, verbose=None)

Predict dependent variable values based on fitted model.

Parameters

data : DataFrame

Independent variable values to predict for.

key : str

Name of the ID column.

features : list of str, optional

Names of the feature columns. If features is not provided, it defaults to all non-ID columns.

verbose : bool, optional

If True, output all classes and the corresponding confidences for each data point.

Returns

DataFrame

DataFrame of score and confidence, structured as follows:

  • ID column, with same name and type as data ‘s ID column.

  • SCORE, type NVARCHAR, representing the predicted classes.

  • CONFIDENCE, type DOUBLE, representing the confidence of a class.

class hana_ml.algorithms.pal.trees.GradientBoostingRegressor(conn_context, n_estimators=10, subsample=None, max_depth=None, loss=None, split_threshold=None, learning_rate=None, fold_num=None, default_split_dir=None, min_sample_weight_leaf=None, max_w_in_split=None, col_subsample_split=None, col_subsample_tree=None, lamb=None, alpha=None, scale_pos_w=None, base_score=None, cv_metric=None, ref_metric=None, categorical_variable=None, allow_missing_label=None, thread_ratio=None, cross_validation_range=None)

Bases: hana_ml.algorithms.pal.trees._GradientBoostingBase

Gradient Boosting Tree model for regression.

Parameters

conn_context : ConnectionContext

Connection to the HANA system.

n_estimators : int, optional

Specifies the number of trees in Gradient Boosting.

Defaults to 10.

loss : str, optional

Type of loss function to be optimized. Supported values are ‘linear’ and ‘logistic’.

Defaults to ‘linear’.

max_depth : int, optional

The maximum depth of a tree.

Defaults to 6.

split_threshold : float, optional

Specifies the stopping condition: if the improvement value of the best split is less than this value, then the tree stops growing.

learning_rate : float, optional.

Learning rate of each iteration, must be within the range (0, 1].

Defaults to 0.3.

subsample : float, optional

The fraction of samples to be used for fitting each base learner.

Defaults to 1.0.

fold_num : int, optional

The k-value for k-fold cross-validation.

default_split_dir : int, optional.

Default split direction for missing values. Valid input values are 0, 1 and 2, where:

0 - Automatically determined,

1 - Left,

2 - Right.

Defaults to 0.

min_sample_weight_leaf : float, optional

The minimum sample weights in leaf node.

Defaults to 1.0.

max_w_in_split : float, optional

The maximum weight constraint assigned to each tree node.

Defaults to 0 (i.e. no constraint).

col_subsample_split : float, optional

The fraction of features used for each split, should be within range (0, 1].

Defaults to 1.0.

col_subsample_tree : float, optional

The fraction of features used for each tree growth, should be within range (0, 1]

Defaults to 1.0.

lamb : float, optional

L2 regularization weight for the target loss function. Should be within range (0, 1].

Defaults to 1.0.

alpha : float, optional

Weight of L1 regularization for the target loss function.

Defaults to 1.0.

scale_pos_w : float, optional

The weight scaled to positive samples in regression.

Defaults to 1.0.

base_score : float, optional

Intial prediction score for all instances. Global bias for sufficient number of iterations(changing this value will not have too much effect).

cv_metric : str, optional

The metric used for cross-validation. Supported metrics include: ‘rmse’, ‘mae’. If multiple lines of metrics are provided, then only the first one is valid. If not set, it takes the first value (in alphabetical order) of the parameter ‘ref_metric’ when the latter is set, otherwise it goes to default values.

Defaults to ‘mae’.

ref_metric : str or list of str, optional

Specifies a reference metric or a list of reference metrics. Supported metrics same as cv_metric.

categorical_variable : str, optional

Indicates which variables should be treated as categorical. Otherwise default behavior is followed:

  1. VARCHAR - categorical,

  2. INTEGER and DOUBLE - continous.

Only valid for INTEGER variables, omitted otherwise.

allow_missing_label : bool, optional

Specifies whether missing label value is allowed.

False: not allowed. In missing values presents in the input data, an error shall be thrown.

True: allowed. The datum with missing label will be removed automatically.

thread_ratio : float, optional

The ratio of available threads used for training.

0: single thread;

(0,1]: percentage of available threads;

others : heuristically determined.

Defaults to -1.

cross_validation_range : list of tuples, optional

Indicates the set of parameters involded for cross-validation. Each tuple is a pair, with the first being parameter name of str type, and the second being the a list of number of the following form: [<begin-value>, <test-numbers>, <end-value>]. Suppported parameters for cross-validation: n_estimators, max_depth, learning_rate, min_sample_weight_leaf, max_w_in_split, col_subsample_split, col_subsample_tree, lamb, alpha, scale_pos_w, base_score.

A simple example for illustration:

[(‘n_estimators’, [4, 3, 10]),

(‘learning_rate’, [0.1, 3, 1.0]),

(‘split_threshold’, [0.1, 3, 1.0])]

Examples

Input dataframe for training:

>>> df.head(4).collect()
    ATT1     ATT2    ATT3    ATT4  TARGET
0  19.76   6235.0  100.00  100.00   25.10
1  17.85  46230.0   43.67   84.53   19.23
2  19.96   7360.0   65.51   81.57   21.42
3  16.80  28715.0   45.16   93.33   18.11

Creating GradientBoostingRegressor instance:

>>> cv_range = [('learning_rate', [0.0,5,1.0]),
...             ('n_estimators', [10, 11, 20]),
...             ('split_threshold', [0.0, 5, 1.0])]
>>> gbr = GradientBoostingRegressor(conn_context=conn,
...                                 n_estimators=20,
...                                 split_threshold=0.75,
...                                 learning_rate=0.75,
...                                 fold_num=5,
...                                 max_depth=6,
...                                 cv_metric = 'rmse',
...                                 ref_metric=['mae'],
...                                 cross_validation_range=cv_range)

Performing fit() on given dataframe:

>>> gbr.fit(data=df, features=['ATT1', 'ATT2', 'ATT3', 'ATT4'],
...         label='TARGET')
>>> gbr.stats_.collect()
   STAT_NAME STAT_VALUE
0  RMSE_MEAN    1.83732
1   RMSE_VAR   0.525622
2   MAE_MEAN    1.44388

Input dataframe for predicting:

>>> df1.head(4).collect()
   ID   ATT1     ATT2    ATT3    ATT4
0   1  19.76   6235.0  100.00  100.00
1   2  17.85  46230.0   43.67   84.53
2   3  19.96   7360.0   65.51   81.57
3   4  16.80  28715.0   45.16   93.33

Performing predict() on given dataframe:

>>> result.head(4).collect()
   ID    SCORE CONFIDENCE
0   1  24.1499       None
1   2  19.2351       None
2   3  21.8944       None
3   4  18.5256       None

Attributes

model_

(DataFrame) Trained model content.

feature_importances_

(DataFrame) The feature importance (the higher, the more import the feature)

stats_

(DataFrame) Statistics info for cross-validation.

cv_

(DataFrame) Best choice of parameter produced by cross-validation.

Methods

fit(data[, key, features, label, …])

Train the model on input data.

predict(data, key[, features, verbose])

Predict dependent variable values based on fitted model.

fit(data, key=None, features=None, label=None, categorical_variable=None)

Train the model on input data.

Parameters

data : DataFrame

Training data.

key : str, optional

Name of the ID column. If key is not provided, it is assumed that the input has no ID column.

features : list of str, optional

Names of the feature columns. If features is not provided, it defaults to all non-ID, non-label columns.

label : str, optional

Name of the dependent variable.

Defaults to the last column.

categorical_variable : str or list of str, optional

Specifies INTEGER column(s) that should be treated as categorical. Other INTEGER columns will be treated as continuous.

predict(data, key, features=None, verbose=None)

Predict dependent variable values based on fitted model.

Parameters

data : DataFrame

Independent variable values to predict for.

key : str

Name of the ID column.

features : list of str, optional

Names of the feature columns.

If features is not provided, it defaults to all non-ID columns.

verbose : bool, optional

If True, output all classes and the corresponding confidences for each data point.

Returns

DataFrame

DataFrame of score and confidence, structured as follows:

  • ID column, with same name and type as data’s ID column.

  • SCORE, type DOUBLE, representing the predicted value.

  • CONFIDENCE, all None’s for regression.

class hana_ml.algorithms.pal.trees.HybridGradientBoostingClassifier(conn_context, n_estimators=None, random_state=None, subsample=None, max_depth=None, split_threshold=None, learning_rate=None, split_method=None, sketch_eps=None, fold_num=None, min_sample_weight_leaf=None, min_samples_leaf=None, max_w_in_split=None, col_subsample_split=None, col_subsample_tree=None, lamb=None, alpha=None, base_score=None, cv_metric=None, ref_metric=None, calculate_importance=None, calculate_cm=None, thread_ratio=None, cross_validation_range=None)

Bases: hana_ml.algorithms.pal.trees._HybridGradientBoostingBase

Hybrid Gradient Boosting model for classification.

Parameters

conn_context : ConnectionContext

Connection to the HANA system.

n_estimators : int, optional

Specifies the number of trees in Gradient Boosting.

Defaults to 10.

split_method : {‘exact’, ‘sketch’, ‘sampling’}, optional

The method to finding split point for numerical features.

Defaults to ‘exact’.

random_state : int, optional

The seed for random number generating.

0 - current time as seed,

Others - the seed.

Defaults to 0.

max_depth : int, optional

The maximum depth of a tree.

Defaults to 6.

split_threshold : float, optional

Specifies the stopping condition: if the improvement value of the best split is less than this value, then the tree stops growing.

learning_rate : float, optional.

Learning rate of each iteration, must be within the range (0, 1].

Defaults to 0.3.

subsample : float, optional

The fraction of samples to be used for fitting each base learner.

Defaults to 1.0.

fold_num : int, optional

The k-value for k-fold cross-validation. Effective only when cross_validation_range is not None nor empty.

sketch_esp : float, optional

The value of the sketch method which sets up an upper limit for the sum of sample weights between two split points. Basically, the less this value is, the more number of split points are tried.

min_sample_weight_leaf : float, optional

The minimum summation of ample weights in a leaf node.

Defaults to 1.0.

min_samples_leaf : int, optional

The minimum number of data in a leaf node.

Defaults to 1.

max_w_in_split : float, optional

The maximum weight constraint assigned to each tree node.

Defaults to 0 (i.e. no constraint).

col_subsample_split : float, optional

The fraction of features used for each split, should be within range (0, 1].

Defaults to 1.0.

col_subsample_tree : float, optional

The fraction of features used for each tree growth, should be within range (0, 1]

Defaults to 1.0.

lamb : float, optional

L2 regularization weight for the target loss function. Should be within range (0, 1].

Defaults to 1.0.

alpha : float, optional

Weight of L1 regularization for the target loss function.

Defaults to 1.0.

base_score : float, optional

Intial prediction score for all instances. Global bias for sufficient number of iterations(changing this value will not have too much effect).

Defaults to 0.5.

cv_metric : {‘nll’, ‘error_rate’, ‘auc’}, optional

The metric used for cross-validation.

Defaults to ‘error_rate’.

ref_metric : str or list of str, optional

Specifies a reference metric or a list of reference metrics. Any reference metric must be a valid option of cv_metric.

Defaults to [‘error_rate’].

categorical_variable : str pr list of str, optional

Specifies INTEGER variable(s) that should be treated as categorical. Valid only for INTEGER variables, omitted otherwise.

Note

By default INTEGER variables are treated as numerical.

thread_ratio : float, optional

The ratio of available threads used for training.

0: single thread;

(0,1]: percentage of available threads;

others : heuristically determined.

Defaults to -1.

calculate_importance : bool, optional

Determines whether to calculate variable importance.

Defaults to True.

calculate_cm : bool, optional

Determines whether to calculaet confusion matrix.

Defaults to True.

cross_validation_range : list of tuples, optional

Indicates the set of parameters involded for cross-validation. Cross-validation is triggered only when this param is not None and the list is not tempty, and fold_num is greater than 1. Each tuple is a pair, with the first being parameter name of str type, and the second being the a list of numbers with the following strcture: [<begin-value>, <end-value>, <test-numbers>].

Suppported parameters for cross-validation: n_estimators, max_depth, learning_rate, min_sample_weight_leaf, max_w_in_split, col_subsample_split, col_subsample_tree, lamb, alpha, scale_pos_w, base_score.

A simple example for illustration:

[(‘n_estimators’, [4, 10, 3]),

(‘learning_rate’, [0.1, 1.0, 3])]

Examples

Input dataframe for training:

>>> df.head(7).collect()
   ATT1  ATT2   ATT3  ATT4 LABEL
0   1.0  10.0  100.0   1.0     A
1   1.1  10.1  100.0   1.0     A
2   1.2  10.2  100.0   1.0     A
3   1.3  10.4  100.0   1.0     A
4   1.2  10.3  100.0   1.0     A
5   4.0  40.0  400.0   4.0     B
6   4.1  40.1  400.0   4.0     B

Creating an instance of Hybrid Gradient Boosting classifier:

>>>  cv_range = [('learning_rate',[0.1, 1.0, 3]),
...              ('n_estimators', [4, 10, 3]),
...              ('split_threshold', [0.1, 1.0, 3])]
>>> ghc = HybridGradientBoostingClassifier(conn_context=conn,
...                                        n_estimators=4,
...                                        split_threshold=0,
...                                        learning_rate=0.5,
...                                        fold_num=5,
...                                        max_depth=6,
...                                        cv_metric='error_rate',
...                                        ref_metric=['auc'],
...                                        cross_validation_range=cv_range)

Performing fit() on given dataframe

>>> gbc.fit(data=df, features=['ATT1', 'ATT2', 'ATT3', 'ATT4'],
...         label='LABEL')
>>> gbc.stats_.collect()
         STAT_NAME STAT_VALUE
0  ERROR_RATE_MEAN   0.133333
1   ERROR_RATE_VAR  0.0266666
2         AUC_MEAN        0.9

Input dataframe for predict:

>>> df_predict.collect()
   ID  ATT1  ATT2   ATT3  ATT4
0   1   1.0  10.0  100.0   1.0
1   2   1.1  10.1  100.0   1.0
2   3   1.2  10.2  100.0   1.0
3   4   1.3  10.4  100.0   1.0
4   5   1.2  10.3  100.0   3.0
5   6   4.0  40.0  400.0   3.0
6   7   4.1  40.1  400.0   3.0
7   8   4.2  40.2  400.0   3.0
8   9   4.3  40.4  400.0   3.0
9  10   4.2  40.3  400.0   3.0

Performing predict() on given dataframe

>>> result = ghc.fit(data=df_predict, key='ID', verbose=False)
>>> result.collect()
   ID SCORE  CONFIDENCE
0   1     A    0.852674
1   2     A    0.852674
2   3     A    0.852674
3   4     A    0.852674
4   5     A    0.751394
5   6     B    0.703119
6   7     B    0.703119
7   8     B    0.703119
8   9     B    0.830549
9  10     B    0.703119

Attributes

model_

(DataFrame) Trained model content.

feature_importances_

(DataFrame) The feature importance (the higher, the more import the feature)

confusion_matrix_

(DataFrame) Confusion matrix used to evaluate the performance of classification algorithm.

stats_

(DataFrame) Statistics info for cross-validation.

cv_

(DataFrame) Best choice of parameter produced by cross-validation.

Methods

fit(data[, key, features, label, …])

Train the model on input data.

predict(data, key[, features, verbose, …])

Predict labels based on the trained HGBT classifier.

fit(data, key=None, features=None, label=None, categorical_variable=None)

Train the model on input data.

Parameters

data : DataFrame

Training data.

key : str, optional

Name of the ID column. If key is not provided, it is assumed that the input has no ID column.

features : list of str, optional

Names of the feature columns. If features is not provided, it defaults to all non-ID, non-label columns.

label : str, optional

Name of the dependent variable.

Defaults to the last column.

categorical_variable : str or list of str, optional

Indicates INTEGER variable(s) that should be treated as categorical. Valid only for INTEGER variables, omitted otherwise.

Note

By default INTEGER variables are treated as numerical.

predict(data, key, features=None, verbose=None, thread_ratio=None, missing_replacement=None)

Predict labels based on the trained HGBT classifier.

Parameters

data : DataFrame

Independent variable values to predict for.

key : str

Name of the ID column.

features : list of str, optional

Names of the feature columns. If features is not provided, it defaults to all non-ID columns.

missing_replacement : str, optional

The missing replacement strategy:

  • ‘feature_marginalized’: marginalise each missing feature out independently.

  • ‘instance_marginalized’: marginalise all missing features in an instance as a whole corr

verbose : bool, optional

If True, output all classes and the corresponding confidences for each data point. This parameter is valid only for classification.

Defaults to False.

Returns

DataFrame

DataFrame of score and confidence, structured as follows:

  • ID column, with same name and type as data’s ID column.

  • SCORE, type DOUBLE, representing the predicted classes/values.

  • CONFIDENCE, type DOUBLE, representing the confidence of a class label assignment.

class hana_ml.algorithms.pal.trees.HybridGradientBoostingRegressor(conn_context, n_estimators=None, random_state=None, subsample=None, max_depth=None, split_threshold=None, learning_rate=None, split_method=None, sketch_eps=None, fold_num=None, min_sample_weight_leaf=None, min_samples_leaf=None, max_w_in_split=None, col_subsample_split=None, col_subsample_tree=None, lamb=None, alpha=None, cv_metric=None, ref_metric=None, calculate_importance=None, thread_ratio=None, cross_validation_range=None)

Bases: hana_ml.algorithms.pal.trees._HybridGradientBoostingBase

Hybrid Gradient Boosting model for regression.

Parameters

conn_context : ConnectionContext

Connection to the HANA system.

n_estimators : int, optional

Specifies the number of trees in Gradient Boosting.

Defaults to 10.

split_method : {‘exact’, ‘sketch’, ‘sampling’}, optional

The method to find split point for numeric features.

Defaults to ‘exact’.

random_state : int, optional

The seed for random number generating.

0 - current time as seed,

Others - the seed.

max_depth : int, optional

The maximum depth of a tree.

Defaults to 6.

split_threshold : float, optional

Specifies the stopping condition: if the improvement value of the best split is less than this value, then the tree stops growing.

learning_rate : float, optional.

Learning rate of each iteration, must be within the range (0, 1].

Defaults to 0.3.

subsample : float, optional

The fraction of samples to be used for fitting each base learner.

Defaults to 1.0.

fold_num : int, optional

The k-value for k-fold cross-validation. Effective only when cross_validation_range is not None nor empty.

sketch_esp : float, optional

The value of the sketch method which sets up an upper limit for the sum of sample weights between two split points. Basically, the less this value is, the more number of split points are tried.

min_sample_weight_leaf : float, optional

The minimum summation of ample weights in a leaf node.

Defaults to 1.0.

min_sample_leaf : int, optional

The minimum number of data in a leaf node.

Defaults to 1.

max_w_in_split : float, optional

The maximum weight constraint assigned to each tree node.

Defaults to 0 (i.e. no constraint).

col_subsample_split : float, optional

The fraction of features used for each split, should be within range (0, 1].

Defaults to 1.0.

col_subsample_tree : float, optional

The fraction of features used for each tree growth, should be within range (0, 1].

Defaults to 1.0.

lamb : float, optional

Weight of L2 regularization for the target loss function. Should be within range (0, 1].

Defaults to 1.0.

alpha : float, optional

Weight of L1 regularization for the target loss function.

Defaults to 1.0.

cv_metric : {‘rmse’, ‘mae’}, optional

The metric used for cross-validation.

Defaults to ‘mae’.

ref_metric : str or list of str, optional

Specifies a reference metric or a list of reference metrics. Any reference metric must be a valid option of cv_metric.

Defaults to [‘rmse’].

categorical_variable : str or list of str, optional

Specifies INTEGER variable(s) that should be treated as categorical. Valid only for INTEGER variables, omitted otherwise.

Note

By default INTEGER variables are treated as numerical.

thread_ratio : float, optional

The ratio of available threads used for training.

0: single thread;

(0,1]: percentage of available threads;

others : heuristically determined.

Defaults to -1.

calculate_importance : bool, optional

Determines whether to calculate variable importance.

Defaults to True.

calculate_cm : bool, optional

Determines whether to calculaet confusion matrix.

Defaults to True.

cross_validation_range : list of tuples, optional

Indicates the set of parameters involded for cross-validation. Cross-validation is triggered only when this param is not None and the list is not tempty, and fold_num is greater than 1. Each tuple is a pair, with the first being parameter name of str type, and the second being the a list of numbers with the following strcture: [<begin-value>, <end-value>, <test-numbers>].

Suppported parameters for cross-validation: n_estimators, max_depth, learning_rate, min_sample_weight_leaf, max_w_in_split, col_subsample_split, col_subsample_tree, lamb, alpha, scale_pos_w, base_score.

Simple example for illustration

a list of two tuples

[(‘n_estimators’, [4, 10, 3]),

(‘learning_rate’, [0.1, 1.0, 3])]

Examples

Input dataframe for training

>>> df.head(7).collect()
    ATT1     ATT2    ATT3    ATT4  TARGET
0  19.76   6235.0  100.00  100.00   25.10
1  17.85  46230.0   43.67   84.53   19.23
2  19.96   7360.0   65.51   81.57   21.42
3  16.80  28715.0   45.16   93.33   18.11
4  18.20  21934.0   49.20   83.07   19.24
5  16.71   1337.0   74.84   94.99   19.31
6  18.81  17881.0   70.66   92.34   20.07

Creating an instance of HGBT regression and traing the model

>>> cv_range = [('learning_rate',[0.0, 1.0, 5]),
...             ('n_estimators', [10, 20, 11]),
...             ('split_threshold', [0.0, 1.0, 5])]
>>> hgr = HybridGradientBoostingRegressor(conn_context=conn,
...                                       n_estimators=20,
...                                       split_threshold=0.75,
...                                       split_method = 'exact',
...                                       learning_rate=0.75,
...                                       fold_num=5,
...                                       max_depth=6,
...                                       cv_metric = 'rmse',
...                                       ref_metric=['mae'],
...                                       cross_validation_range=cv_range)
>>> hgr.fit(data=df, features=['ATT1','ATT2','ATT3', 'ATT4'],
...         label='TARGET')

Check the model content and feature importances

>>> hgr.model_.head(4).collect()
   TREE_INDEX   MODEL_CONTENT
0    -1           {"nclass":1,"param":{"bs":0.0,"obj":"reg:linea...
1    0            {"height":0,"nnode":1,"nodes":[{"ch":[],"gn":9...
2    1            {"height":0,"nnode":1,"nodes":[{"ch":[],"gn":5...
3    2            {"height":0,"nnode":1,"nodes":[{"ch":[],"gn":3...
>>> hgr.feature_importances_.collect()
  VARIABLE_NAME  IMPORTANCE
0          ATT1    0.744019
1          ATT2    0.164429
2          ATT3    0.078935
3          ATT4    0.012617

The trained model can be used for prediction. Input data for prediction, i.e. with missing target values.

>>> df_predict.collect()
   ID   ATT1     ATT2    ATT3    ATT4
0   1  19.76   6235.0  100.00  100.00
1   2  17.85  46230.0   43.67   84.53
2   3  19.96   7360.0   65.51   81.57
3   4  16.80  28715.0   45.16   93.33
4   5  18.20  21934.0   49.20   83.07
5   6  16.71   1337.0   74.84   94.99
6   7  18.81  17881.0   70.66   92.34
7   8  20.74   2319.0   63.93   95.08
8   9  16.56  18040.0   14.45   61.24
9  10  18.55   1147.0   68.58   97.90

Predict the target values and view the results

>>> result = hgr.predict(data=df_predict, key='ID', verbose=False)
>>> result.collect()
   ID               SCORE CONFIDENCE
0   1   23.79109147050638       None
1   2   19.09572889593064       None
2   3   21.56501359501561       None
3   4  18.622664075787082       None
4   5   19.05159916592106       None
5   6  18.815530665858763       None
6   7  19.761714911364443       None
7   8   23.79109147050638       None
8   9   17.84416828725911       None
9  10  19.915574945518465       None

Attributes

model_

(DataFrame) Trained model content.

feature_importances_

(DataFrame) The feature importance (the higher, the more import the feature)

confusion_matrix_

(DataFrame) Confusion matrix used to evaluate the performance of classification algorithm.

stats_

(DataFrame) Statistics info for cross-validation.

cv_

(DataFrame) Best choice of parameter produced by cross-validation.

Methods

fit(data[, key, features, label, …])

Train an HGBT regressor on the input data.

predict(data, key[, features, verbose, …])

Predict dependent variable values based on fitted model.

fit(data, key=None, features=None, label=None, categorical_variable=None)

Train an HGBT regressor on the input data.

Parameters

data : DataFrame

Training data.

key : str, optional

Name of the ID column. If key is not provided, it is assumed that the input has no ID column.

features : list of str, optional

Names of the feature columns. If features is not provided, it defaults to all non-ID, non-label columns.

label : str, optional

Name of the dependent variable.

Defaults to the last column.

categorical_variable : str or list of str, optional

Specifies INTEGER variable(s) that should be treated as categorical. Valid only for INTEGER variables, omitted otherwise.

Note

By default INTEGER variables are treated as numerical.

predict(data, key, features=None, verbose=None, thread_ratio=None, missing_replacement=None)

Predict dependent variable values based on fitted model.

Parameters

data : DataFrame

Independent variable values to predict for.

key : str

Name of the ID column.

features : list of str, optional

Names of the feature columns. If not provided, it defaults to all non-ID columns.

missing_replacement : str, optional

The missing replacement strategy:

  • ‘feature_marginalized’: marginalise each missing feature out independently.

  • ‘instance_marginalized’: marginalise all missing features in an instance as a whole corresponding to each category.

Defaults to ‘feature_marginalized’.

verbose : bool, optional

If True, output all classes and the corresponding confidences for each data point.

Returns

DataFrame

DataFrame of score and confidence, structured as follows:

  • ID column, with same name and type as data ‘s ID column.

  • SCORE, type DOUBLE, representing the predicted classes.

  • CONFIDENCE, type DOUBLE, all None for regression prediction.

hana_ml.algorithms.pal.tsa.arima

This module contains Python wrapper for PAL ARIMA algorithm.

The following class are available:

class hana_ml.algorithms.pal.tsa.arima.ARIMA(conn_context, order=None, seasonal_order=None, method='css-mle', include_mean=None, forecast_method=None, output_fitted=True, thread_ratio=None)

Bases: hana_ml.algorithms.pal.tsa.arima._ARIMABase

Autoregressive Integrated Moving Average ARIMA(p, d, q) model.

Parameters

conn_context : ConnectionContext

The connection to the SAP HANA system.

order : (p, q, d), tuple of int, optional

  • p: value of the auto regression order.

  • d: value of the differentiation order.

  • q: value of the mov