hana_ml.algorithms.pal package

PAL Package consists of the following sections:

hana_ml.algorithms.pal.abc_analysis

This module contains PAL wrappers for abc_analysis algorithm.

The following functions is available:

hana_ml.algorithms.pal.abc_analysis.abc_analysis(data, key, percent_A, percent_B, percent_C, revenue=None, thread_ratio=None)

Perform the abc_analysis to classify objects based on a particular measure. Group the inventories into three categories.

Parameters
dataDataFrame

Input data.

keystr

Name of the ID column.

revenuestr, optional

Name of column for revenue (or profits).

If not given, the input dataframe must only have two columns.

Defaults to the first non-key column.

percent_Afloat

Interval for A class.

percent_Bfloat

Interval for B class.

percent_Cfloat

Interval for C class.

thread_ratiofloat, optional

Specifies the ratio of total number of threads that can be used by this function.

The value range is from 0 to 1, where 0 means only using 1 thread, and 1 means using at most all the currently available threads.

Values outside the range will be ignored and this function heuristically determines the number of threads to use.

Default to 0.

Returns
DataFrame

Returns a DataFrame containing the ABC class result of partitioning the data into three categories.

Examples

Data to analyze:

>>> df_train = cc.table('AA_DATA_TBL')
>>> df_train.collect()
     ITEM     VALUE
0    item1    15.4
1    item2    200.4
2    item3    280.4
3    item4    100.9
4    item5    40.4
5    item6    25.6
6    item7    18.4
7    item8    10.5
8    item9    96.15
9    item10   9.4

Perform abc_analysis:

>>> res = abc_analysis(data = self.df_train, key = 'ITEM', thread_ratio = 0.3,
                       percent_A = 0.7, percent_B = 0.2, percent_C = 0.1)
>>> res.collect()
       ABC_CLASS   ITEM
0      A        item3
1      A        item2
2      A        item4
3      B        item9
4      B        item5
5      B        item6
6      C        item7
7      C        item1
8      C        item8
9      C        item10

hana_ml.algorithms.pal.association

This module contains Python wrappers for PAL association algorithms.

The following classes are available:

class hana_ml.algorithms.pal.association.Apriori(min_support, min_confidence, relational=None, min_lift=None, max_conseq=None, max_len=None, ubiquitous=None, use_prefix_tree=None, lhs_restrict=None, rhs_restrict=None, lhs_complement_rhs=None, rhs_complement_lhs=None, thread_ratio=None, timeout=None, pmml_export=None)

Bases: hana_ml.algorithms.pal.association._AssociationBase

Apriori is a classic predictive analysis algorithm for finding association rules used in association analysis.

Parameters
min_supportfloat

User-specified minimum support(actual value).

min_confidencefloat

User-specified minimum confidence(actual value).

relationalbool, optional

Whether or not to apply relational logic in Apriori algorithm. If False, a single result table is produced; otherwise, the result table shall be split into three tables: antecedent, consequent and statistics.

Defaults to False.

min_liftfloat, optional

User-specified minimum lift.

Defaults to 0.

max_conseqint, optional

Maximum length of consequent items.

Defaults to 100.

max_lenint, optional

Total length of antecedent items and consequent items in the output.

Defaults to 5.

ubiquitousfloat, optional

Item sets whose support values are greater than this number will be ignored during fequent items minining.

Defaults to 1.0.

use_prefix_treebool, optional

Indicates whether or not to use prefix tree for saving memory.

Defaults to False.

lhs_restrictlist of str, optional(deprecated)

Specify items that are only allowed on the left-hand-side of association rules.

rhs_restrictlist of str, optional(deprecated)

Specify items that are only allowed on the right-hand-side of association rules.

lhs_complement_rhsbool, optional(deprecated)

If you use rhs_restrict to restrict some items to the left-hand-side of the association rules, you can set this parameter to True to restrict the complement items to the left-hand-side.

For example, if you have 100 items (i1, i2, ..., i100), and want to restrict i1 and i2 to the right-hand-side, and i3,i4,...,i100 to the left-hand-side, you can set the parameters similarly as follows:

...

rhs_restrict = ['i1','i2'],

lhs_complement_rhs = True,

...

Defaults to False.

rhs_complement_lhsbool, optional(deprecated)

If you use lhs_restrict to restrict some items to the left-hand-side of association rules, you can set this parameter to True to restrict the complement items to the right-hand side.

Defaults to False.

thread_numberfloat, optional

Specifies the ratio of total number of threads that can be used by this function.

The value range is from 0 to 1, where 0 means only using 1 thread, and 1 means using at most all the currently available threads.

Values outside the range will be ignored and this function heuristically determines the number of threads to use.

Defaults to 0.

timeoutint, optional

Specifies the maximum run time in seconds. The algorithm will stop running when the specified timeout is reached.

Defaults to 3600.

pmml_export{'no', 'single-row', 'multi-row'}, optional

Specify the way to export the Apriori model:

  • 'no' : do not export the model,

  • 'single-row' : export Apriori model in PMML in single row,

  • 'multi-row' : export Apriori model in PMML in multiple rows, while the minimum length of each row is 5000 characters.

Defaults to 'no'.

Examples

Input data for associate rule mining:

>>> df.collect()
    CUSTOMER   ITEM
0          2  item2
1          2  item3
2          3  item1
3          3  item2
4          3  item4
5          4  item1
6          4  item3
7          5  item2
8          5  item3
9          6  item1
10         6  item3
11         0  item1
12         0  item2
13         0  item5
14         1  item2
15         1  item4
16         7  item1
17         7  item2
18         7  item3
19         7  item5
20         8  item1
21         8  item2
22         8  item3

Set up parameters for the Apriori algorithm:

>>> ap = Apriori(min_support=0.1,
                 min_confidence=0.3,
                 relational=False,
                 min_lift=1.1,
                 max_conseq=1,
                 max_len=5,
                 ubiquitous=1.0,
                 use_prefix_tree=False,
                 thread_ratio=0,
                 timeout=3600,
                 pmml_export='single-row')

Association rule mining using Apriori algorithm for the input data, and check the results:

>>> ap.fit(data=df)
>>> ap.result_.head(5).collect()
    ANTECEDENT CONSEQUENT   SUPPORT  CONFIDENCE      LIFT
0        item5      item2  0.222222    1.000000  1.285714
1        item1      item5  0.222222    0.333333  1.500000
2        item5      item1  0.222222    1.000000  1.500000
3        item4      item2  0.222222    1.000000  1.285714
4  item2&item1      item5  0.222222    0.500000  2.250000

Apriori algorithm set up using relational logic:

>>> apr = Apriori(min_support=0.1,
                  min_confidence=0.3,
                  relational=True,
                  min_lift=1.1,
                  max_conseq=1,
                  max_len=5,
                  ubiquitous=1.0,
                  use_prefix_tree=False,
                  thread_ratio=0,
                  timeout=3600,
                  pmml_export='single-row')

Again mining association rules using Apriori algorithm for the input data, and check the resulting tables:

>>> apr.antec_.head(5).collect()
   RULE_ID ANTECEDENTITEM
0        0          item5
1        1          item1
2        2          item5
3        3          item4
4        4          item2
>>> apr.conseq_.head(5).collect()
   RULE_ID CONSEQUENTITEM
0        0          item2
1        1          item5
2        2          item1
3        3          item2
4        4          item5
>>> apr.stats_.head(5).collect()
   RULE_ID   SUPPORT  CONFIDENCE      LIFT
0        0  0.222222    1.000000  1.285714
1        1  0.222222    0.333333  1.500000
2        2  0.222222    1.000000  1.500000
3        3  0.222222    1.000000  1.285714
4        4  0.222222    0.500000  2.250000
Attributes
result_DataFrame

Mined association rules and related statistics, structured as follows:

  • 1st column : antecedent(leading) items.

  • 2nd column : consequent(dependent) items.

  • 3rd column : support value.

  • 4th column : confidence value.

  • 5th column : lift value.

Available only when relational is False.

model_DataFrame

Apriori model trained from the input data, structured as follows:

  • 1st column : model ID,

  • 2nd column : model content, i.e. Apriori model in PMML format.

antec_DataFrame

Antecdent items of mined association rules, structured as follows:

  • lst column : association rule ID,

  • 2nd column : antecedent items of the corresponding association rule.

Available only when relational is True.

conseq_DataFrame

Consequent items of mined association rules, structured as follows:

  • 1st column : association rule ID,

  • 2nd column : consequent items of the corresponding association rule.

Available only when relational is True.

stats_DataFrame

Statistis of the mined association rules, structured as follows:

  • 1st column : rule ID,

  • 2nd column : support value of the rule,

  • 3rd column : confidence value of the rule,

  • 4th column : lift value of the rule.

Available only when relational is True.

Methods

add_attribute(self, attr_key, attr_val)

Function to add attribute.

fit(self, data[, transaction, item, ...])

Association rule mining from the input data using FPGrowth algorithm.

is_fitted(self)

Checks if the model can be saved.

load_model(self, model)

Function to load fitted model.

fit(self, data, transaction=None, item=None, lhs_restrict=None, rhs_restrict=None, lhs_complement_rhs=None, rhs_complement_lhs=None)

Association rule mining from the input data using FPGrowth algorithm.

Parameters
dataDataFrame

Input data for association rule minining.

transactionstr, optional

Name of the transaction column.

Defaults to the first column if not provided.

itemstr, optional

Name of the item ID column.

Data type of item column can be INTEGER, VARCHAR or NVARCHAR.

Defaults to the last non-transaction column if not provided.

lhs_restrictlist of int/str, optional

Specify items that are only allowed on the left-hand-side of association rules.

Elements in the list should be the same type as the item column.

rhs_restrictlist of int/str, optional

Specify items that are only allowed on the right-hand-side of association rules.

Elements in the list should be the same type as the item column.

lhs_complement_rhsbool, optional

If you use rhs_restrict to restrict some items to the left-hand-side of the association rules, you can set this parameter to True to restrict the complement items to the left-hand-side. For example, if you have 100 items (i1,i2,...,i100), and want to restrict i1 and i2 to the right-hand-side, and i3, i4,..., i100 to the left-hand-side, you can set the parameters similarly as follows:

...

rhs_restrict = [i1, i2],

lhs_complement_rhs = True,

...

Defaults to False.

rhs_complement_lhsbool, optional

If you use lhs_restrict to restrict some items to the left-hand-side of association rules, you can set this parameter to True to restrict the complement items to the right-hand side.

Defaults to False.

add_attribute(self, attr_key, attr_val)

Function to add attribute.

is_fitted(self)

Checks if the model can be saved. To be overridden if the model is not stored in model_ attribute.

Returns
bool

True if the model is ready to be saved.

load_model(self, model)

Function to load fitted model.

Parameters
modelDataFrame

HANA DataFrame for fitted model.

class hana_ml.algorithms.pal.association.AprioriLite(min_support, min_confidence, subsample=None, recalculate=None, thread_ratio=None, timeout=None, pmml_export=None)

Bases: hana_ml.algorithms.pal.pal_base.PALBase

A light version of Apriori algorithm for assocication rule mining, where only two large item sets are calculated.

Parameters
min_supportfloat

User-specified minimum support(actual value).

min_confidencefloat

User-specified minimum confidence(actual value).

subsamplefloat, optional

Specify the sampling percentage for the input data. Set to 1 if you want to use the entire data.

recalculatebool, optional

If you sample the input data, this parameter indicates whether or not to use the remining data to update the related statistiscs, i.e. support, confidence and lift.

Defaults to True.

thread_ratiofloat, optional

Specifies the ratio of total number of threads that can be used by this function.

The value range is from 0 to 1, where 0 means only using 1 thread, and 1 means using at most all the currently available threads.

Values outside the range will be ignored and this function heuristically determines the number of threads to use.

Defaults to 0.

timeoutint, optional

Specifies the maximum run time in seconds.

The algorithm will stop running when the specified timeout is reached.

Defaults to 3600.

pmml_export{'no', 'single-row', 'multi-row'}, optional

Specify the way to export the Apriori model:

  • 'no' : do not export the model,

  • 'single-row' : export Apriori model in PMML in single row,

  • 'multi-row' : export Apriori model in PMML in multiple rows, while the minimum length of each row is 5000 characters.

Defaults to 'no'.

Examples

Input data for association rule mining using Apriori algorithm:

>>> df.collect()
    CUSTOMER   ITEM
0          2  item2
1          2  item3
2          3  item1
3          3  item2
4          3  item4
5          4  item1
6          4  item3
7          5  item2
8          5  item3
9          6  item1
10         6  item3
11         0  item1
12         0  item2
13         0  item5
14         1  item2
15         1  item4
16         7  item1
17         7  item2
18         7  item3
19         7  item5
20         8  item1
21         8  item2
22         8  item3

Set up parameters for light Apriori algorithm, ingest the input data, and check the result table:

>>> apl = AprioriLite(min_support=0.1,
                      min_confidence=0.3,
                      subsample=1.0,
                      recalculate=False,
                      timeout=3600,
                      pmml_export='single-row')
>>> apl.fit(data=df)
>>> apl.result_.head(5).collect()
  ANTECEDENT CONSEQUENT   SUPPORT  CONFIDENCE      LIFT
0      item5      item2  0.222222    1.000000  1.285714
1      item1      item5  0.222222    0.333333  1.500000
2      item5      item1  0.222222    1.000000  1.500000
3      item5      item3  0.111111    0.500000  0.750000
4      item1      item2  0.444444    0.666667  0.857143
Attributes
result_DataFrame
Mined association rules and related statistics, structured as follows:
  • 1st column : antecedent(leading) items,

  • 2nd column : consequent(dependent) items,

  • 3rd column : support value,

  • 4th column : confidence value,

  • 5th column : lift value.

Non-empty only when relational is False.

model_DataFrame
Apriori model trained from the input data, structured as follows:
  • 1st column : model ID.

  • 2nd column : model content, i.e. liteApriori model in PMML format.

Methods

add_attribute(self, attr_key, attr_val)

Function to add attribute.

fit(self, data[, transaction, item])

Association rule mining based from the input data.

is_fitted(self)

Checks if the model can be saved.

load_model(self, model)

Function to load fitted model.

fit(self, data, transaction=None, item=None)

Association rule mining based from the input data.

Parameters
dataDataFrame

Input data for association rule minining.

transactionstr, optional

Name of the transaction column.

Defaults to the first column if not provided.

itemstr, optional

Name of the item column.

Defaults to the last non-transaction column if not provided.

add_attribute(self, attr_key, attr_val)

Function to add attribute.

is_fitted(self)

Checks if the model can be saved. To be overridden if the model is not stored in model_ attribute.

Returns
bool

True if the model is ready to be saved.

load_model(self, model)

Function to load fitted model.

Parameters
modelDataFrame

HANA DataFrame for fitted model.

class hana_ml.algorithms.pal.association.FPGrowth(min_support=None, min_confidence=None, relational=None, min_lift=None, max_conseq=None, max_len=None, ubiquitous=None, thread_ratio=None, timeout=None)

Bases: hana_ml.algorithms.pal.association._AssociationBase

FP-Growth is an algorithm to find frequent patterns from transactions without generating a candidate itemset.

Parameters
min_supportfloat, optional

User-specified minimum support, with valid range [0, 1].

Defaults to 0.

min_confidencefloat, optional

User-specified minimum confidence, with valid range [0, 1].

Defaults to 0.

relationalbool, optional

Whether or not to apply relational logic in FPGrowth algorithm.

If False, a single result table is produced; otherwise, the result table shall be split into three tables -- antecedent, consequent and statistics.

Defaults to False.

min_liftfloat, optional

User-specified minimum lift.

Defaults to 0.

max_conseqint, optional

Maximum length of consequent items.

Defaults to 10.

max_lenint, optional

Total length of antecedent items and consequent items in the output.

Defaults to 10.

ubiquitousfloat, optional

Item sets whose support values are greater than this number will be ignored during fequent items minining.

Defaults to 1.0.

thread_ratiofloat, optional

Specifies the ratio of total number of threads that can be used by this function.

The value range is from 0 to 1, where 0 means only using 1 thread, and 1 means using at most all the currently available threads.

Values outside the range will be ignored and this function heuristically determines the number of threads to use.

Defaults to 0.

timeoutint, optional

Specifies the maximum run time in seconds.

The algorithm will stop running when the specified timeout is reached.

Defaults to 3600.

Examples

Input data for associate rule mining:

>>> df.collect()
    TRANS  ITEM
0       1     1
1       1     2
2       2     2
3       2     3
4       2     4
5       3     1
6       3     3
7       3     4
8       3     5
9       4     1
10      4     4
11      4     5
12      5     1
13      5     2
14      6     1
15      6     2
16      6     3
17      6     4
18      7     1
19      8     1
20      8     2
21      8     3
22      9     1
23      9     2
24      9     3
25     10     2
26     10     3
27     10     5

Set up parameters:

>>> fpg = FPGrowth(min_support=0.2,
                   min_confidence=0.5,
                   relational=False,
                   min_lift=1.0,
                   max_conseq=1,
                   max_len=5,
                   ubiquitous=1.0,
                   thread_ratio=0,
                   timeout=3600)

Association rule mining using FPGrowth algorithm for the input data, and check the results:

>>> fpg.fit(data=df, lhs_restrict=[1,2,3])
>>> fpg.result_.collect()
  ANTECEDENT  CONSEQUENT  SUPPORT  CONFIDENCE      LIFT
0          2           3      0.5    0.714286  1.190476
1          3           2      0.5    0.833333  1.190476
2          3           4      0.3    0.500000  1.250000
3        1&2           3      0.3    0.600000  1.000000
4        1&3           2      0.3    0.750000  1.071429
5        1&3           4      0.2    0.500000  1.250000

Apriori algorithm set up using relational logic:

>>> fpgr = FPGrowth(min_support=0.2,
                    min_confidence=0.5,
                    relational=True,
                    min_lift=1.0,
                    max_conseq=1,
                    max_len=5,
                    ubiquitous=1.0,
                    thread_ratio=0,
                    timeout=3600)

Again mining association rules using FPGrowth algorithm for the input data, and check the resulting tables:

>>> fpgr.fit(data=df, rhs_restrict=[1, 2, 3])
>>> fpgr.antec_.collect()
   RULE_ID  ANTECEDENTITEM
0        0               2
1        1               3
2        2               3
3        3               1
4        3               2
5        4               1
6        4               3
7        5               1
8        5               3
>>> fpgr.conseq_.collect()
   RULE_ID  CONSEQUENTITEM
0        0               3
1        1               2
2        2               4
3        3               3
4        4               2
5        5               4
>>> fpgr.stats_.collect()
   RULE_ID  SUPPORT  CONFIDENCE      LIFT
0        0      0.5    0.714286  1.190476
1        1      0.5    0.833333  1.190476
2        2      0.3    0.500000  1.250000
3        3      0.3    0.600000  1.000000
4        4      0.3    0.750000  1.071429
5        5      0.2    0.500000  1.250000
Attributes
result_DataFrame

Mined association rules and related statistics, structured as follows:

  • 1st column : antecedent(leading) items,

  • 2nd column : consequent(dependent) items,

  • 3rd column : support value,

  • 4th column : confidence value,

  • 5th column : lift value.

Available only when relational is False.

antec_DataFrame
Antecdent items of mined association rules, structured as follows:
  • lst column : association rule ID,

  • 2nd column : antecedent items of the corresponding association rule.

Available only when relational is True.

conseq_DataFrame

Consequent items of mined association rules, structured as follows:

  • 1st column : association rule ID,

  • 2nd column : consequent items of the corresponding association rule.

Available only when relational is True.

stats_DataFrame
Statistis of the mined association rules, structured as follows:
  • 1st column : rule ID,

  • 2nd column : support value of the rule,

  • 3rd column : confidence value of the rule,

  • 4th column : lift value of the rule.

Available only when relational is True.

Methods

add_attribute(self, attr_key, attr_val)

Function to add attribute.

fit(self, data[, transaction, item, ...])

Association rule mining from the input data.

is_fitted(self)

Checks if the model can be saved.

load_model(self, model)

Function to load fitted model.

fit(self, data, transaction=None, item=None, lhs_restrict=None, rhs_restrict=None, lhs_complement_rhs=None, rhs_complement_lhs=None)

Association rule mining from the input data.

Parameters
dataDataFrame

Input data for association rule minining.

transactionstr, optional

Name of the transaction column.

Defaults to the first column if not provided.

itemstr, optional

Name of the item column.

Defaults to the last non-transaction column if not provided.

lhs_restrictlist of int/str, optional

Specify items that are only allowed on the left-hand-side of association rules.

Elements in the list should be the same type as the item column.

rhs_restrictlist of int/str, optional

Specify items that are only allowed on the right-hand-side of association rules.

Elements in the list should be the same type as the item column.

lhs_complement_rhsbool, optional

If you use rhs_restrict to restrict some items to the left-hand-side of the association rules, you can set this parameter to True to restrict the complement items to the left-hand-side.

For example, if you have 100 items (i1,i2,...,i100), and want to restrict i1 and i2 to the right-hand-side, and i3, i4,..., i100 to the left-hand-side, you can set the parameters similarly as follows:

...

rhs_restrict = [i1, i2],

lhs_complement_rhs = True,

...

Defaults to False.

rhs_complement_lhsbool, optional

If you use lhs_restrict to restrict some items to the left-hand-side of association rules, you can set this parameter to True to restrict the complement items to the right-hand side.

Defaults to False.

add_attribute(self, attr_key, attr_val)

Function to add attribute.

is_fitted(self)

Checks if the model can be saved. To be overridden if the model is not stored in model_ attribute.

Returns
bool

True if the model is ready to be saved.

load_model(self, model)

Function to load fitted model.

Parameters
modelDataFrame

HANA DataFrame for fitted model.

class hana_ml.algorithms.pal.association.KORD(k=None, measure=None, min_support=None, min_confidence=None, min_coverage=None, min_measure=None, max_antec=None, epsilon=None, use_epsilon=None, max_conseq=None)

Bases: hana_ml.algorithms.pal.pal_base.PALBase

K-optimal rule discovery (KORD) follows the idea of generating association rules with respect to a well-defined measure, instead of first finding all frequent itemsets and then generating all possible rules.

Parameters
kint, optional

The number of top rules to discover.

measurestr, optional

Specifies the measure used to define the priority of the association rules.

min_supportfloat, optional

User-specified minimum support value of association rule, with valid range [0, 1].

Defaults to 0 if not provided.

min_confidencefloat, optinal

User-specified minimum confidence value of association rule, with valid range [0, 1].

Defaults to 0 if not provided.

min_converagefloat, optional

User-specified minimum coverage value of association rule, with valid range [0, 1].

Defaults to the value of min_support if not provided.

min_measurefloat, optional

User-specified minimum measure value (for leverage or lift, which type depends on the setting of measure).

Defaults to 0 if not provided.

max_antecint, optional

Specifies the maximumn number of antecedent items in generated association rules.

Defaults to 4.

epsilonfloat, optional

User-specified epsilon value for punishing length of rules.

Valid only when use_epsilon is True.

use_epsilonbool, optional

Specifies whether or not to use epsilon to punish the length of rules.

Defaults to False.

max_conseqint, optional

Specifies the maximum number of consequent items in generated association rules.

Should not be greater than 3.

New parameter added in SAP HANA cloud.

Defaults to 1.

Examples

First let us have a look at the training data:

>>> df.head(10).collect()
    CUSTOMER   ITEM
0          2  item2
1          2  item3
2          3  item1
3          3  item2
4          3  item4
5          4  item1
6          4  item3
7          5  item2
8          5  item3
9          6  item1

Set up a KORD instance:

>>> krd =  KORD(k=5,
                measure='lift',
                min_support=0.1,
                min_confidence=0.2,
                epsilon=0.1,
                use_epsilon=False)

Start k-optimal rule discovery process from the input transaction data, and check the results:

>>> krd.fit(data=df, transaction='CUSTOMER', item='ITEM')
>>> krd.antec_.collect()
   RULE_ID ANTECEDENT_RULE
0        0           item2
1        1           item1
2        2           item2
3        2           item1
4        3           item5
5        4           item2
>>> krd.conseq_.collect()
   RULE_ID CONSEQUENT_RULE
0        0           item5
1        1           item5
2        2           item5
3        3           item1
4        4           item4
>>> krd.stats_.collect()
   RULE_ID   SUPPORT  CONFIDENCE      LIFT  LEVERAGE   MEASURE
0        0  0.222222    0.285714  1.285714  0.049383  1.285714
1        1  0.222222    0.333333  1.500000  0.074074  1.500000
2        2  0.222222    0.500000  2.250000  0.123457  2.250000
3        3  0.222222    1.000000  1.500000  0.074074  1.500000
4        4  0.222222    0.285714  1.285714  0.049383  1.285714
Attributes
antec_DataFrame

Info of antecedent items for the mined association rules, structured as follows:

  • 1st column : rule ID,

  • 2nd column : antecedent items.

conseq_DataFrame

Info of consequent items for the mined assocoation rules, structured as follows:

  • 1st column : rule ID,

  • 2nd column : consequent items.

stats_DataFrame
Some basic statistics for the mined association rules, structured as follows:
  • 1st column : rule ID,

  • 2nd column : support value of rules,

  • 3rd column : confidence value of rules,

  • 4th column : lift value of rules,

  • 5th column : leverage value of rules,

  • 6th column : measure value of rules.

Methods

add_attribute(self, attr_key, attr_val)

Function to add attribute.

fit(self, data[, transaction, item])

K-optimal rule discovery from input data, based on some user-specified measure.

is_fitted(self)

Checks if the model can be saved.

load_model(self, model)

Function to load fitted model.

fit(self, data, transaction=None, item=None)

K-optimal rule discovery from input data, based on some user-specified measure.

Parameters
dataDataFrame

Input data for k-optimal(association) rule discovery.

transactionstr, optional

Column name of transaction ID in the input data.

Defaults to name of the 1st column if not provided.

itemstr, optional

Column name of item ID (or items) in the input data.

Defaults to the name of the last non-transaction column if not provided.

add_attribute(self, attr_key, attr_val)

Function to add attribute.

is_fitted(self)

Checks if the model can be saved. To be overridden if the model is not stored in model_ attribute.

Returns
bool

True if the model is ready to be saved.

load_model(self, model)

Function to load fitted model.

Parameters
modelDataFrame

HANA DataFrame for fitted model.

class hana_ml.algorithms.pal.association.SPM(min_support, relational=None, max_len=None, min_len=None, max_len_out=None, min_len_out=None, ubiquitous=None, calc_lift=None, timeout=None)

Bases: hana_ml.algorithms.pal.pal_base.PALBase

The sequential pattern mining algorithm searches for frequent patterns in sequence databases.

Parameters
min_supportfloat

User-specified minimum support value.

relationalbool, optional

Whether or not to apply relational logic in sequential pattern mining.

If False, a single results table for frequent pattern mining is produced, otherwise the results table is splitted into two tables : one for mined patterns, and the other for statistics.

Defaults to False.

ubiquitousfloat, optional

Items whose support values are above this specified value will be ignored during the frequent item mining phase.

Defaults to 1.0.

min_lenint, optional

Minimum number of items in a transaction.

Defaults to 1.

max_lenint, optional

Maximum number of items in a transaction.

Defaults to 10.

min_len_outint, optional

Specifies the minimum number of items of the mined association rules in the result table.

Defaults to 1.

max_len_outint, optional

Specifies the maximum number of items of the mined association rules in the reulst table.

Defaults to 10.

calc_liftbool, optional

Whether or not toe calculate lift values for all applicable cases.

If False, lift values are only calculated for the cases where the last transaction contains a single item.

Defaults to False.

timeoutint, optional

Specifies the maximum run time in seconds.

The algorithm will stop running when the specified timeout is reached.

Defaults to 3600.

Examples

Firstly take a look at the input data df:

>>> df.collect()
   CUSTID  TRANSID      ITEMS
0       A        1      Apple
1       A        1  Blueberry
2       A        2      Apple
3       A        2     Cherry
4       A        3    Dessert
5       B        1     Cherry
6       B        1  Blueberry
7       B        1      Apple
8       B        2    Dessert
9       B        3  Blueberry
10      C        1      Apple
11      C        2  Blueberry
12      C        3    Dessert

Set up a SPM instance:

>>> sp = SPM(min_support=0.5,
             relational=False,
             ubiquitous=1.0,
             max_len=10,
             min_len=1,
             calc_lift=True)

Start sequential pattern mining process from the input data, and check the results:

>>> sp.fit(data=df, customer='CUSTID', transaction='TRANSID', item='ITEMS')
>>> sp.result_.collect()
                        PATTERN   SUPPORT  CONFIDENCE      LIFT
0                       {Apple}  1.000000    0.000000  0.000000
1           {Apple},{Blueberry}  0.666667    0.666667  0.666667
2             {Apple},{Dessert}  1.000000    1.000000  1.000000
3             {Apple,Blueberry}  0.666667    0.000000  0.000000
4   {Apple,Blueberry},{Dessert}  0.666667    1.000000  1.000000
5                {Apple,Cherry}  0.666667    0.000000  0.000000
6      {Apple,Cherry},{Dessert}  0.666667    1.000000  1.000000
7                   {Blueberry}  1.000000    0.000000  0.000000
8         {Blueberry},{Dessert}  1.000000    1.000000  1.000000
9                      {Cherry}  0.666667    0.000000  0.000000
10           {Cherry},{Dessert}  0.666667    1.000000  1.000000
11                    {Dessert}  1.000000    0.000000  0.000000
Attributes
result_DataFrame

The overall fequent pattern mining result, structured as follows:

  • 1st column : mined fequent patterns,

  • 2nd column : support values,

  • 3rd column : confidence values,

  • 4th column : lift values.

Available only when relational is False.

pattern_DataFrame
Result for mined requent patterns, structured as follows:
  • 1st column : pattern ID,

  • 2nd column : transaction ID,

  • 3rd column : items.

stats_DataFrame
Statistics for frequent pattern mining, structured as follows:
  • 1st column : pattern ID,

  • 2nd column : support values,

  • 3rd column : confidence values,

  • 4th column : lift values.

Methods

add_attribute(self, attr_key, attr_val)

Function to add attribute.

fit(self, data[, customer, transaction, ...])

Sequetial pattern mining from input data.

is_fitted(self)

Checks if the model can be saved.

load_model(self, model)

Function to load fitted model.

add_attribute(self, attr_key, attr_val)

Function to add attribute.

is_fitted(self)

Checks if the model can be saved. To be overridden if the model is not stored in model_ attribute.

Returns
bool

True if the model is ready to be saved.

load_model(self, model)

Function to load fitted model.

Parameters
modelDataFrame

HANA DataFrame for fitted model.

fit(self, data, customer=None, transaction=None, item=None, item_restrict=None, min_gap=None)

Sequetial pattern mining from input data.

Parameters
dataDataFrame

Input data for sequential pattern mining.

customerstr, optional

Column name of customer ID in the input data.

Defaults to name of the 1st column if not provided.

transactionstr, optional

Column name of transaction ID in the input data.

Specially for sequential pattern mining, values of this column must reflect the sequence of occurrence as well.

Defaults to name of the 1st non-customer column if not provided.

itemstr, optional

Column name of item ID (or items) in the input data.

Defaults to the name of the last non-customer, non-transaction column if not provided.

item_restrictlist of int or str, optional

Specifies the list of items allowed in the mined association rule.

min_gapint, optional

Specifies the the minimum time difference between consecutive transactions in a sequence.

hana_ml.algorithms.pal.clustering

This module contains Python wrappers for PAL clustering algorithms.

The following classes are available:

hana_ml.algorithms.pal.clustering.SlightSilhouette(data, features=None, label=None, distance_level=None, minkowski_power=None, normalization=None, thread_number=None, categorical_variable=None, category_weights=None)

Silhouette refers to a method used to validate the cluster of data. SAP HNAN PAL provides a light version of sihouette called slight sihouette. SlightSihouette is an wrapper for this light version sihouette method.

Note that this function is a new function in SAP HANA SPS05 and Cloud.

Parameters
dataDataFrame

DataFrame containing the data.

featureslist of str, optional

Names of the feature columns.

If features is not provided, it defaults to all non-label columns.

label: str, optional

Name of the ID column.

If label is not provided, it defaults to last column.

distance_level{'manhattan', 'euclidean', 'minkowski', 'chebyshev', 'cosine'} str, optional

Ways to compute the distance between the item and the cluster center. 'cosine' is only valid when accelerated is False.

Defaults to 'euclidean'.

minkowski_powerfloat, optional

When Minkowski distance is used, this parameter controls the value of power.

Only valid when distance_level is minkowski.

Defaults to 3.0.

normalization{'no', 'l1_norm', 'min_max'}, optional

Normalization type.

  • 'no': No normalization will be applied.

  • 'l1_norm': Yes, for each point X (x1, x2, ..., xn), the normalized value will be X'(x1 /S,x2 /S,...,xn /S), where S = |x1|+|x2|+...|xn|.

  • 'min_max': Yes, for each column C, get the min and max value of C, and then C[i] = (C[i]-min)/(max-min).

Defaults to 'no'.

thread_numberint, optional

Number of threads.

Defaults to 1.

categorical_variablestr or list of str, optional

Indicates whether or not a column data is actually corresponding to a category variable even the data type of this column is INTEGER.

By default, VARCHAR or NVARCHAR is category variable, and INTEGER or DOUBLE is continuous variable.

Defaults to None.

category_weightsfloat, optional

Represents the weight of category attributes.

Defaults to 0.707.

Returns
DataFrame
Returns a DataFrame containing the validation value of Slight Silhouette.

Examples

Input dataframe df:

>>> df.collect()
    V000 V001 V002 CLUSTER
0    0.5    A  0.5       0
1    1.5    A  0.5       0
2    1.5    A  1.5       0
3    0.5    A  1.5       0
4    1.1    B  1.2       0
5    0.5    B 15.5       1
6    1.5    B 15.5       1
7    1.5    B 16.5       1
8    0.5    B 16.5       1
9    1.2    C 16.1       1
10  15.5    C 15.5       2
11  16.5    C 15.5       2
12  16.5    C 16.5       2
13  15.5    C 16.5       2
14  15.6    D 16.2       2
15  15.5    D  0.5       3
16  16.5    D  0.5       3
17  16.5    D  1.5       3
18  15.5    D  1.5       3
19  15.7    A  1.6       3

Call the function:

>>> res = SlightSilhouette(df, label="CLUSTER")

Result:

>>> res.collect()
  VALIDATE_VALUE
0      0.9385944
class hana_ml.algorithms.pal.clustering.AffinityPropagation(affinity, n_clusters, max_iter=None, convergence_iter=None, damping=None, preference=None, seed_ratio=None, times=None, minkowski_power=None, thread_ratio=None)

Bases: hana_ml.algorithms.pal.pal_base.PALBase

Affinity Propagation is an algorithm that identifies exemplars among data points and forms clusters of data points around these exemplars. It operates by simultaneously considering all data point as potential exemplars and exchanging messages between data points until a good set of exemplars and clusters emerges.

Parameters
affinity{'manhattan', 'standardized_euclidean', 'euclidean', 'minkowski', 'chebyshev', 'cosine'}

Ways to compute the distance between two points.

No default value as it is mandatory.

n_clustersint

Number of clusters.

  • 0: does not adjust Affinity Propagation cluster result.

  • Non-zero int: If Affinity Propagation cluster number is bigger than n_clusters, PAL will merge the result to make the cluster number be the value specified for n_clusters.

No default value as it is mandatory.

max_iterint, optional

Maximum number of iterations.

Defaults to 500.

convergence_iterint, optional

When the clusters keep a steady one for the specified times, the algorithm ends.

Defaults to 100.

dampingfloat

Controls the updating velocity. Value range: (0, 1).

Defaults to 0.9.

preferencefloat, optional

Determines the preference. Value range: [0,1].

Defaults to 0.5.

seed_ratiofloat, optional

Select a portion of (seed_ratio * data_number) the input data as seed, where data_number is the row_size of the input data.

Value range: (0,1].

If seed_ratio is 1, all the input data will be the seed.

Defaults to 1.

timesint, optional

The sampling times. Only valid when seed_ratio is less than 1.

Defaults to 1.

minkowski_powerint, optional

The power of the Minkowski method. Only valid when affinity is 3.

Defaults to 3.

thread_ratiofloat, optional

Specifies the ratio of total number of threads that can be used by this function.

The value range is from 0 to 1, where 0 means only using 1 thread, and 1 means using at most all the currently available threads.

Values outside the range will be ignored and this function heuristically determines the number of threads to use.

Defaults to 0.

Examples

Input dataframe df for clustering:

>>> df.collect()
    ID  ATTRIB1  ATTRIB2
0    1   0.10     0.10
1    2   0.11     0.10
2    3   0.10     0.11
3    4   0.11     0.11
4    5   0.12     0.11
5    6   0.11     0.12
6    7   0.12     0.12
7    8   0.12     0.13
8    9   0.13     0.12
9   10   0.13     0.13
10  11   0.13     0.14
11  12   0.14     0.13
12  13  10.10    10.10
13  14  10.11    10.10
14  15  10.10    10.11
15  16  10.11    10.11
16  17  10.11    10.12
17  18  10.12    10.11
18  19  10.12    10.12
19  20  10.12    10.13
20  21  10.13    10.12
21  22  10.13    10.13
22  23  10.13    10.14
23  24  10.14    10.13

Create AffinityPropagation instance:

>>> ap = AffinityPropagation(
            affinity='euclidean',
            n_clusters=0,
            max_iter=500,
            convergence_iter=100,
            damping=0.9,
            preference=0.5,
            seed_ratio=None,
            times=None,
            minkowski_power=None,
            thread_ratio=1)

Perform fit on the given data:

>>> ap.fit(data = df, key='ID')

Expected output:

>>> ap.labels_.collect()
    ID  CLUSTER_ID
0    1           0
1    2           0
2    3           0
3    4           0
4    5           0
5    6           0
6    7           0
7    8           0
8    9           0
9   10           0
10  11           0
11  12           0
12  13           1
13  14           1
14  15           1
15  16           1
16  17           1
17  18           1
18  19           1
19  20           1
20  21           1
21  22           1
22  23           1
23  24           1
Attributes
labels_DataFrame

Label assigned to each sample. structured as follows:

  • ID, record ID.

  • CLUSTER_ID, the range is from 0 to n_clusters - 1.

Methods

add_attribute(self, attr_key, attr_val)

Function to add attribute.

fit(self, data, key[, features])

Fit the model when given the training dataset.

fit_predict(self, data, key[, features])

Fit with the dataset and return the labels.

is_fitted(self)

Checks if the model can be saved.

load_model(self, model)

Function to load fitted model.

fit(self, data, key, features=None)

Fit the model when given the training dataset.

Parameters
dataDataFrame

DataFrame containing the data.

keystr

Name of the ID column.

featureslist of str, optional

Names of the features columns. If features is not provided, it defaults to all the non-ID columns.

fit_predict(self, data, key, features=None)

Fit with the dataset and return the labels.

Parameters
dataDataFrame

DataFrame containing the data.

keystr

Name of the ID column.

featureslist of str, optional

Names of the features columns.

If features is not provided, it defaults to all the non-ID columns.

Returns
DataFrame

Fit result, label of each points, structured as follows:

  • ID, record ID.

  • CLUSTER_ID, the range is from 0 to n_clusters - 1.

add_attribute(self, attr_key, attr_val)

Function to add attribute.

is_fitted(self)

Checks if the model can be saved. To be overridden if the model is not stored in model_ attribute.

Returns
bool

True if the model is ready to be saved.

load_model(self, model)

Function to load fitted model.

Parameters
modelDataFrame

HANA DataFrame for fitted model.

class hana_ml.algorithms.pal.clustering.AgglomerateHierarchicalClustering(n_clusters=None, affinity=None, linkage=None, thread_ratio=None, distance_dimension=None, normalization=None, category_weights=None)

Bases: hana_ml.algorithms.pal.pal_base.PALBase

This algorithm is a widely used clustering method which can find natural groups within a set of data. The idea is to group the data into a hierarchy or a binary tree of the subgroups. A hierarchical clustering can be either agglomerate or divisive, depending on the method of hierarchical decomposition. The implementation in PAL follows the agglomerate approach, which merges the clusters with a bottom-up strategy. Initially, each data point is considered as an own cluster. The algorithm iteratively merges two clusters based on the dissimilarity measure in a greedy manner and forms a larger cluster.

Parameters
n_clustersint, optional

Number of clusters after agglomerate hierarchical clustering algorithm. Value range: between 1 and the initial number of input data.

Defaults to 1.

affinity{'manhattan', 'euclidean', 'minkowski', 'chebyshev', 'cosine', 'pearson correlation', 'squared euclidean', 'jaccard', 'gower'}, optional

Ways to compute the distance between two points.

Note

  • (1) For jaccard distance, non-zero input data will be treated as 1, and zero input data will be treated as 0. jaccard distance = (M01 + M10) / (M11 + M01 + M10)

  • (2) Only gower distance supports category attributes. When linkage is 'centroid clustering', 'median clustering', or 'ward', this parameter must be set to 'squared euclidean'.

Defaults to 'squared euclidean'.

linkage{ 'nearest neighbor', 'furthest neighbor', 'group average', 'weighted average', 'centroid clustering', 'median clustering', 'ward'}, optional

Linkage type between two clusters.

  • 'nearest neighbor' : single linkage.

  • 'furthest neighbor' : complete linkage.

  • 'group average' : UPGMA.

  • 'weighted average' : WPGMA.

  • 'centroid clustering'.

  • 'median clustering'.

  • 'ward'.

Defaults to centroid clustering.

Note

For linkage 'centroid clustering', 'median clustering', or 'ward', the corresponding affinity must be set to 'squared euclidean'.

thread_ratiofloat, optional

Specifies the ratio of total number of threads that can be used by this function.

The value range is from 0 to 1, where 0 means only using 1 thread, and 1 means using at most all the currently available threads.

Values outside the range will be ignored and this function heuristically determines the number of threads to use.

Defaults to 0.

distance_dimensionfloat, optional

Distance dimension can be set if affinity is set to 'minkowski'. The value should be no less than 1.

Only valid when affinity is 'minkowski'.

Defaults to 3.

normalizationstr, optional

Specifies the type of normalization applied.

  • 'no': No normalization

  • 'z-score': Z-score standardization

  • 'zero-centred-min-max': Zero-centred min-max normalization, transforming to new range [-1, 1].

  • 'min-max': Standard min-max normalization, transforming to new range [0, 1].

Defaults to 'no'.

category_weightsfloat, optional

Represents the weight of category columns.

Defaults to 1.

Examples

Input dataframe df for clustering:

>>> df.collect()
     POINT   X1    X2      X3
0    0       0.5   0.5     1
1    1       1.5   0.5     2
2    2       1.5   1.5     2
3    3       0.5   1.5     2
4    4       1.1   1.2     2
5    5       0.5   15.5    2
6    6       1.5   15.5    3
7    7       1.5   16.5    3
8    8       0.5   16.5    3
9    9       1.2   16.1    3
10   10      15.5  15.5    3
11   11      16.5  15.5    4
12   12      16.5  16.5    4
13   13      15.5  16.5    4
14   14      15.6  16.2    4
15   15      15.5  0.5     4
16   16      16.5  0.5     1
17   17      16.5  1.5     1
18   18      15.5  1.5     1
19   19      15.7  1.6     1

Create an AgglomerateHierarchicalClustering instance:

>>> hc = AgglomerateHierarchicalClustering(
             n_clusters=4,
             affinity='Gower',
             linkage='weighted average',
             thread_ratio=None,
             distance_dimension=3,
             normalization='no',
             category_weights= 0.1)

Perform fit on the given data:

>>> hc.fit(data=df, key='POINT', categorical_variable=['X3'])

Expected output:

>>> hc.combine_process_.collect().head(3)
     STAGE    LEFT_POINT   RIGHT_POINT    DISTANCE
0    1        18           19             0.0187
1    2        13           14             0.0250
2    3        7            9              0.0437
>>> hc.labels_.collect().head(3)
           POINT    CLUSTER_ID
     0     0        1
     1     1        1
     2     2        1
Attributes
combine_process_DataFrame

Structured as follows:

  • 1st column: int, STAGE, cluster stage.

  • 2nd column: ID (in input table) data type, LEFT_ + ID (in input table) column + name, One of the clusters that is to be combined in one combine stage, name as its row number in the input data table. After the combining, the new cluster is named after the left one.

  • 3rd column: ID (in input table) data type, RIGHT_ + ID (in input table) column name, The other cluster to be combined in the same combine stage, named as its row number in the input data table.

  • 4th column: float, DISTANCE. Distance between the two combined clusters.

labels_DataFrame

Label assigned to each sample. structured as follows:

  • 1st column: ID, record ID.

  • 2nd column: CLUSTER_ID, cluster number after applying the hierarchical agglomerate algorithm.

Methods

add_attribute(self, attr_key, attr_val)

Function to add attribute.

fit(self, data, key[, features, ...])

Fit the model when given the training dataset.

fit_predict(self, data, key[, features, ...])

Fit with the dataset and return the labels.

is_fitted(self)

Checks if the model can be saved.

load_model(self, model)

Function to load fitted model.

fit(self, data, key, features=None, categorical_variable=None)

Fit the model when given the training dataset.

Parameters
dataDataFrame

DataFrame containing the data.

keystr

Name of the ID column.

featureslist of str, optional

Names of the features columns. If features is not provided, it defaults to all the non-ID columns.

categorical_variablestr or list of str, optional

Specifies INTEGER column(s) that should be treated as categorical.

By default, 'VARCHAR' or 'NVARCHAR' is category variable, and 'INTEGER' or 'DOUBLE' is continuous variable.

Defaults to None.

fit_predict(self, data, key, features=None, categorical_variable=None)

Fit with the dataset and return the labels.

Parameters
dataDataFrame

DataFrame containing the data.

keystr

Name of the ID column.

featureslist of str, optional

Names of the features columns.

If features is not provided, it defaults to all the non-ID columns.

categorical_variablestr or list of str, optional

Specifies INTEGER column(s) that should be treated as categorical.

By default, 'VARCHAR' or 'NVARCHAR' is category variable, and 'INTEGER' or 'DOUBLE' is continuous variable.

Defaults to None.

Returns
DataFrame

Combine process, structured as follows:

  • 1st column: int, STAGE, cluster stage.

  • 2nd column: ID (in input table) data type, LEFT_ + ID (in input table) column name, One of the clusters that is to be combined in one combine stage, name as its row number in the input data table. After the combining, the new cluster is named after the left one.

  • 3rd column: ID (in input table) data type, RIGHT_ + ID (in input table) column name, The other cluster to be combined in the same combine stage, named as its row number in the input data table.

  • 4th column: float, DISTANCE. Distance between the two combined clusters.

Label of each points, structured as follows:

  • 1st column: ID (in input table) data type, ID, record ID.

  • 2nd column: int, CLUSTER_ID, the range is from 0 to n_clusters - 1.

add_attribute(self, attr_key, attr_val)

Function to add attribute.

is_fitted(self)

Checks if the model can be saved. To be overridden if the model is not stored in model_ attribute.

Returns
bool

True if the model is ready to be saved.

load_model(self, model)

Function to load fitted model.

Parameters
modelDataFrame

HANA DataFrame for fitted model.

class hana_ml.algorithms.pal.clustering.DBSCAN(minpts=None, eps=None, thread_ratio=None, metric=None, minkowski_power=None, categorical_variable=None, category_weights=None, algorithm=None, save_model=True)

Bases: hana_ml.algorithms.pal.pal_base.PALBase, hana_ml.algorithms.pal.clustering._ClusterAssignmentMixin

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a density-based data clustering algorithm that finds a number of clusters starting from the estimated density distribution of corresponding nodes.

Parameters
minptsint, optional

The minimum number of points required to form a cluster.

Note

minpts and eps need to be provided together by user or these two parameters are automatically determined.

epsfloat, optional

The scan radius.

Note

minpts and eps need to be provided together by user or these two parameters are automatically determined.

thread_ratiofloat, optional

Controls the proportion of available threads to use.

The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads.

Values between 0 and 1 will use that percentage of available threads.

Values outside this range tell PAL to heuristically determine the number of threads to use.

Defaults to heuristically determined.

metric{'manhattan', 'euclidean', 'minkowski', 'chebyshev', 'standardized_euclidean', 'cosine'}, optional

Ways to compute the distance between two points.

Defaults to 'euclidean'.

minkowski_powerint, optional

When minkowski is choosed for metric, this parameter controls the value of power. Only applicable when metric is minkowski.

Defaults to 3.

categorical_variablestr or list of str, optional

Specifies column(s) in the data that should be treated as categorical.

Defaults to None.

category_weightsfloat, optional

Represents the weight of category attributes.

Defaults to 0.707.

algorithm{'brute-force', 'kd-tree'}, optional

Ways to search for neighbours.

Defaults to 'kd-tree'.

save_modelbool, optional

If true, the generated model will be saved.

save_model must be True in order to call predict().

Defaults to True.

Examples

Input dataframe df for clustering:

>>> df.collect()
    ID     V1     V2 V3
0    1   0.10   0.10  B
1    2   0.11   0.10  A
2    3   0.10   0.11  C
3    4   0.11   0.11  B
4    5   0.12   0.11  A
5    6   0.11   0.12  E
6    7   0.12   0.12  A
7    8   0.12   0.13  C
8    9   0.13   0.12  D
9   10   0.13   0.13  D
10  11   0.13   0.14  A
11  12   0.14   0.13  C
12  13  10.10  10.10  A
13  14  10.11  10.10  F
14  15  10.10  10.11  E
15  16  10.11  10.11  E
16  17  10.11  10.12  A
17  18  10.12  10.11  B
18  19  10.12  10.12  B
19  20  10.12  10.13  D
20  21  10.13  10.12  F
21  22  10.13  10.13  A
22  23  10.13  10.14  A
23  24  10.14  10.13  D
24  25   4.10   4.10  A
25  26   7.11   7.10  C
26  27  -3.10  -3.11  C
27  28  16.11  16.11  A
28  29  20.11  20.12  C
29  30  15.12  15.11  A

Create DSBCAN instance:

>>> dbscan = DBSCAN(thread_ratio=0.2, metric='manhattan')

Perform fit on the given data:

>>> dbscan.fit(data=df, key='ID')

Expected output:

>>> dbscan.labels_.collect()
    ID  CLUSTER_ID
0    1           0
1    2           0
2    3           0
3    4           0
4    5           0
5    6           0
6    7           0
7    8           0
8    9           0
9   10           0
10  11           0
11  12           0
12  13           1
13  14           1
14  15           1
15  16           1
16  17           1
17  18           1
18  19           1
19  20           1
20  21           1
21  22           1
22  23           1
23  24           1
24  25          -1
25  26          -1
26  27          -1
27  28          -1
28  29          -1
29  30          -1
Attributes
labels_DataFrame

Label assigned to each sample.

model_DataFrame

Model content. Set to None if save_model is False.

Methods

add_attribute(self, attr_key, attr_val)

Function to add attribute.

fit(self, data, key[, features, ...])

Fit the DBSCAN model when given the training dataset.

fit_predict(self, data, key[, features, ...])

Fit with the dataset and return the labels.

is_fitted(self)

Checks if the model can be saved.

load_model(self, model)

Function to load fitted model.

predict(self, data, key[, features])

Assign clusters to data based on a fitted model.

fit(self, data, key, features=None, categorical_variable=None, string_variable=None, variable_weight=None)

Fit the DBSCAN model when given the training dataset.

Parameters
dataDataFrame

DataFrame containing the data.

keystr

Name of the ID column.

featureslist of str, optional

Names of the features columns.

If features is not provided, it defaults to all the non-ID columns.

categorical_variablestr or list of str, optional

Specifies INTEGER column(s) that should be treated as categorical.

Other INTEGER columns will be treated as continuous.

Defaults to None.

string_variablestr or list of str, optional

Indicates a string column storing not categorical data. Levenshtein distance is used to calculate similarity between two strings. Ignored if it is not a string column.

Note that this is a new parameter in SAP HANA SPS05 and Cloud.

Defaults to None.

variable_weightdict, optional

Specifies the weight of a variable participating in distance calculation. The value must be greater or equal to 0. Defaults to 1 for variables not specified.

Note that this is a new parameter in SAP HANA SPS05 and Cloud.

Defaults to None.

fit_predict(self, data, key, features=None, categorical_variable=None, string_variable=None, variable_weight=None)

Fit with the dataset and return the labels.

Parameters
dataDataFrame

DataFrame containing the data.

keystr

Name of the ID column.

featuresstr or list of str, optional

Names of the features columns.

If features is not provided, it defaults to all the non-ID columns.

categorical_variablestr or list of str, optional

Specifies INTEGER column(s) that should be treated as categorical.

Other INTEGER columns will be treated as continuous.

Defaults to None.

string_variablestr or list of str, optional

Indicates a string column storing not categorical data.

Levenshtein distance is used to calculate similarity between two strings. Ignored if it is not a string column.

Defaults to None.

variable_weightdict, optional

Specifies the weight of a variable participating in distance calculation.

The value must be greater or equal to 0.

Defaults to 1 for variables not specified.

Defaults to None.

Returns
DataFrame

Fit result, structured as follows:

  • ID column, with the same name and type as data 's ID column.

  • CLUSTER_ID, type INTEGER, cluster ID assigned to the data point. (Cluster IDs range from 0 to 1 less than the number of clusters. A cluster ID of -1 means the point is labeled as noise.)

predict(self, data, key, features=None)

Assign clusters to data based on a fitted model. The output structure of this method does not match that of fit_predict().

Parameters
dataDataFrame

Data points to match against computed clusters.

This dataframe's column structure should match that of the data used for fit().

keystr

Name of the ID column.

featureslist of str, optional.

Names of the feature columns.

If features is not provided, it defaults to all the non-ID columns.

Returns
DataFrame

Cluster assignment results, with 3 columns:

  • Data point ID, with name and type taken from the input ID column.

  • CLUSTER_ID, type INTEGER, representing the cluster the data point is assigned to.

  • DISTANCE, type DOUBLE, representing the distance between the data point and the nearest core point.

add_attribute(self, attr_key, attr_val)

Function to add attribute.

is_fitted(self)

Checks if the model can be saved. To be overridden if the model is not stored in model_ attribute.

Returns
bool

True if the model is ready to be saved.

load_model(self, model)

Function to load fitted model.

Parameters
modelDataFrame

HANA DataFrame for fitted model.

class hana_ml.algorithms.pal.clustering.GeometryDBSCAN(minpts=None, eps=None, thread_ratio=None, metric=None, minkowski_power=None, algorithm=None, save_model=True)

Bases: hana_ml.algorithms.pal.pal_base.PALBase

This function is a geometry version of DBSCAN, which only accepts geometry points as input data. Currently it only accepts 2-D points.

Parameters
minptsint, optional

The minimum number of points required to form a cluster.

Note

minpts and eps need to be provided together by user or these two parameters are automatically determined.

epsfloat, optional

The scan radius.

Note

minpts and eps need to be provided together by user or these two parameters are automatically determined.

thread_ratiofloat, optional

Controls the proportion of available threads to use.

The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads.

Values between 0 and 1 will use that percentage of available threads.

Values outside this range tell PAL to heuristically determine the number of threads to use.

Defaults to -1.

metric{'manhattan', 'euclidean','minkowski',

'chebyshev', 'standardized_euclidean', 'cosine'}, optional

Ways to compute the distance between two points.

Defaults to 'euclidean'.

minkowski_powerint, optional

When minkowski is choosed for metric, this parameter controls the value of power.

Only applicable when metric is 'minkowski'.

Defaults to 3.

algorithm{'brute-force', 'kd-tree'}, optional

Ways to search for neighbours.

Defaults to 'kd-tree'.

save_modelbool, optional

If true, the generated model will be saved.

save_model must be True in order to call predict().

Defaults to True.

Examples

In SAP HANA, the test table PAL_GEO_DBSCAN_DATA_TBL can be created by the following SQL:

CREATE COLUMN TABLE PAL_GEO_DBSCAN_DATA_TBL (
  "ID" INTEGER,
  "POINT" ST_GEOMETRY);

Then, input dataframe df for clustering:

>>> df = conn.table("PAL_GEO_DBSCAN_DATA_TBL")

Create DSBCAN instance:

>>> geo_dbscan = GeometryDBSCAN(thread_ratio=0.2, metric='manhattan')

Perform fit on the given data:

>>> geo_dbscan.fit(data = df, key='ID')

Expected output:

>>> geo_dbscan.labels_.collect()
     ID   CLUSTER_ID
0    1    0
1    2    0
2    3    0
......
28   29  -1
29   30  -1
>>> geo_dbsan.model_.collect()
    ROW_INDEX    MODEL_CONTENT
0      0         {"Algorithm":"DBSCAN","Cluster":[{"ClusterID":...

Perform fit_predict on the given data:

>>> result = geo_dbscan.fit_predict(df, key='ID')

Expected output:

>>> result.collect()
     ID   CLUSTER_ID
0    1    0
1    2    0
2    3    0
......
28    29  -1
29    30  -1
Attributes
labels_DataFrame

Label assigned to each sample.

model_DataFrame

Model content. Set to None if save_model is False.

Methods

add_attribute(self, attr_key, attr_val)

Function to add attribute.

fit(self, data, key[, features])

Fit the Geometry DBSCAN model when given the training dataset.

fit_predict(self, data, key[, features])

Fit with the dataset and return the labels.

is_fitted(self)

Checks if the model can be saved.

load_model(self, model)

Function to load fitted model.

fit(self, data, key, features=None)

Fit the Geometry DBSCAN model when given the training dataset.

Parameters
dataDataFrame

DataFrame containing the data for applying geometry DBSCAN.

It must contain at least two columns: one ID column, and another for storing 2-D geometry points.

keystr

Name of the ID column.

featuresstr, optional

Name of the column for storing geometry points.

If not provided, it defaults the first non-ID column.

fit_predict(self, data, key, features=None)

Fit with the dataset and return the labels.

Parameters
dataDataFrame

DataFrame containing the data. The structure is as follows.

It must contain at least two columns: one ID column, and another for storing 2-D geometry points.

keystr

Name of the ID column.

featuresstr, optional

Name of the column for storing 2-D geometry points.

If not provided, it defaults to the first non-ID column.

Returns
DataFrame

Fit result, structured as follows:

  • ID column, with the same name and type as data 's ID column.

  • CLUSTER_ID, type INTEGER, cluster ID assigned to the data point. (Cluster IDs range from 0 to 1 less than the number of clusters. A cluster ID of -1 means the point is labeled as noise.)

add_attribute(self, attr_key, attr_val)

Function to add attribute.

is_fitted(self)

Checks if the model can be saved. To be overridden if the model is not stored in model_ attribute.

Returns
bool

True if the model is ready to be saved.

load_model(self, model)

Function to load fitted model.

Parameters
modelDataFrame

HANA DataFrame for fitted model.

class hana_ml.algorithms.pal.clustering.KMeans(n_clusters=None, n_clusters_min=None, n_clusters_max=None, init=None, max_iter=None, thread_ratio=None, distance_level=None, minkowski_power=None, category_weights=None, normalization=None, categorical_variable=None, tol=None, memory_mode=None, accelerated=False, use_fast_library=None, use_float=None)

Bases: hana_ml.algorithms.pal.pal_base.PALBase, hana_ml.algorithms.pal.clustering._ClusterAssignmentMixin

K-Means model that handles clustering problems.

Parameters
n_clustersint, optional

Number of clusters. If this parameter is not specified, you must specify the minimum and maximum range parameters instead.

n_clusters_minint, optional

Cluster range minimum.

n_clusters_maxint, optional

Cluster range maximum.

init{'first_k', 'replace', 'no_replace', 'patent'}, optional

Controls how the initial centers are selected:

  • 'first_k': First k observations.

  • 'replace': Random with replacement.

  • 'no_replace': Random without replacement.

  • 'patent': Patent of selecting the init center (US 6,882,998 B1).

Defaults to 'patent'.

max_iterint, optional

Max iterations.

Defaults to 100.

thread_ratiofloat, optional

Specifies the ratio of total number of threads that can be used by this function.

The value range is from 0 to 1, where 0 means only using 1 thread, and 1 means using at most all the currently available threads.

Values outside the range will be ignored and this function heuristically determines the number of threads to use.

Defaults to 0.

distance_level{'manhattan', 'euclidean', 'minkowski', 'chebyshev', 'cosine'}, optional

Ways to compute the distance between the item and the cluster center.

'cosine' is only valid when accelerated is False.

Defaults to 'euclidean'.

minkowski_powerfloat, optional

When Minkowski distance is used, this parameter controls the value of power.

Only valid when distance_level is 'minkowski'.

Defaults to 3.0.

category_weightsfloat, optional

Represents the weight of category attributes.

Defaults to 0.707.

normalization{'no', 'l1_norm', 'min_max'}, optional

Normalization type.

  • 'no': No normalization will be applied.

  • 'l1_norm': Yes, for each point X (x1, x2, ..., xn), the normalized value will be X'(x1 /S,x2 /S,...,xn /S), where S = |x1|+|x2|+...|xn|.

  • 'min_max': Yes, for each column C, get the min and max value of C, and then C[i] = (C[i]-min)/(max-min).

Defaults to 'no'.

categorical_variablestr or list of str, optional

Specifies INTEGER column(s) in the data that should be treated as categorical.

Defaults to None.

tolfloat, optional

Convergence threshold for exiting iterations.

Only valid when accelerated is False.

Defaults to 1.0e-6.

memory_mode{'auto', 'optimize-speed', 'optimize-space'}, optional

Indicates the memory mode that the algorithm uses.

  • 'auto': Chosen by algorithm.

  • 'optimize-speed': Prioritizes speed.

  • 'optimize-space': Prioritizes memory.

Only valid when accelerated is True.

Defaults to 'auto'.

acceleratedbool, optional

Indicates whether to use technology like cache to accelerate the calculation process:

  • If True, the calculation process will be accelerated.

  • If False, the calculation process will not be accelerated.

Defaults to False.

use_fast_librarybool, optional

Use vectorized accelerated operation when it is set to 1. Not valid when accelerated is True.

Defaults to False.

use_floatbool, optional
  • False: double

  • True: float

Only valid when use_fast_library is True. Not valid when accelerated is True.

Defaults to True.

Examples

Input dataframe df for K Means:

>>> df.collect()
    ID  V000 V001  V002
0    0   0.5    A   0.5
1    1   1.5    A   0.5
2    2   1.5    A   1.5
3    3   0.5    A   1.5
4    4   1.1    B   1.2
5    5   0.5    B  15.5
6    6   1.5    B  15.5
7    7   1.5    B  16.5
8    8   0.5    B  16.5
9    9   1.2    C  16.1
10  10  15.5    C  15.5
11  11  16.5    C  15.5
12  12  16.5    C  16.5
13  13  15.5    C  16.5
14  14  15.6    D  16.2
15  15  15.5    D   0.5
16  16  16.5    D   0.5
17  17  16.5    D   1.5
18  18  15.5    D   1.5
19  19  15.7    A   1.6

Create KMeans instance:

>>> km = clustering.KMeans(n_clusters=4, init='first_k',
...                        max_iter=100, tol=1.0E-6, thread_ratio=0.2,
...                        distance_level='Euclidean',
...                        category_weights=0.5)

Perform fit_predict:

>>> labels = km.fit_predict(data=df, 'ID')
>>> labels.collect()
    ID  CLUSTER_ID  DISTANCE  SLIGHT_SILHOUETTE
0    0           0  0.891088           0.944370
1    1           0  0.863917           0.942478
2    2           0  0.806252           0.946288
3    3           0  0.835684           0.944942
4    4           0  0.744571           0.950234
5    5           3  0.891088           0.940733
6    6           3  0.835684           0.944412
7    7           3  0.806252           0.946519
8    8           3  0.863917           0.946121
9    9           3  0.744571           0.949899
10  10           2  0.825527           0.945092
11  11           2  0.933886           0.937902
12  12           2  0.881692           0.945008
13  13           2  0.764318           0.949160
14  14           2  0.923456           0.939283
15  15           1  0.901684           0.940436
16  16           1  0.976885           0.939386
17  17           1  0.818178           0.945878
18  18           1  0.722799           0.952170
19  19           1  1.102342           0.925679

Input dataframe df for Accelerated K-Means :

>>> df = conn.table("PAL_ACCKMEANS_DATA_TBL")
>>> df.collect()
    ID  V000 V001  V002
0    0   0.5    A     0
1    1   1.5    A     0
2    2   1.5    A     1
3    3   0.5    A     1
4    4   1.1    B     1
5    5   0.5    B    15
6    6   1.5    B    15
7    7   1.5    B    16
8    8   0.5    B    16
9    9   1.2    C    16
10  10  15.5    C    15
11  11  16.5    C    15
12  12  16.5    C    16
13  13  15.5    C    16
14  14  15.6    D    16
15  15  15.5    D     0
16  16  16.5    D     0
17  17  16.5    D     1
18  18  15.5    D     1
19  19  15.7    A     1

Create Accelerated Kmeans instance:

>>> akm = clustering.KMeans(init='first_k',
...                         thread_ratio=0.5, n_clusters=4,
...                         distance_level='euclidean',
...                         max_iter=100, category_weights=0.5,
...                         categorical_variable=['V002'],
...                         accelerated=True)

Perform fit_predict:

>>> labels = akm.fit_predict(df=data, key='ID')
>>> labels.collect()
    ID  CLUSTER_ID  DISTANCE  SLIGHT_SILHOUETTE
0    0           0  1.198938           0.006767
1    1           0  1.123938           0.068899
2    2           3  0.500000           0.572506
3    3           3  0.500000           0.598267
4    4           0  0.621517           0.229945
5    5           0  1.037500           0.308333
6    6           0  0.962500           0.358333
7    7           0  0.895513           0.402992
8    8           0  0.970513           0.352992
9    9           0  0.823938           0.313385
10  10           1  1.038276           0.931555
11  11           1  1.178276           0.927130
12  12           1  1.135685           0.929565
13  13           1  0.995685           0.934165
14  14           1  0.849615           0.944359
15  15           1  0.995685           0.934548
16  16           1  1.135685           0.929950
17  17           1  1.089615           0.932769
18  18           1  0.949615           0.937555
19  19           1  0.915565           0.937717
Attributes
labels_DataFrame

Label assigned to each sample.

cluster_centers_DataFrame

Coordinates of cluster centers.

model_DataFrame

Model content.

statistics_DataFrame

Statistic value.

Methods

add_attribute(self, attr_key, attr_val)

Function to add attribute.

fit(self, data, key[, features, ...])

Fit the model when given training dataset.

fit_predict(self, data, key[, features, ...])

Fit with the dataset and return the labels.

is_fitted(self)

Checks if the model can be saved.

load_model(self, model)

Function to load fitted model.

predict(self, data, key[, features])

Assign clusters to data based on a fitted model.

fit(self, data, key, features=None, categorical_variable=None)

Fit the model when given training dataset.

Parameters
dataDataFrame

DataFrame containing the data.

keystr

Name of the ID column.

featureslist of str, optional

Names of the feature columns.

If features is not provided, it defaults to all the non-ID columns.

categorical_variablestr or list of str, optional

Specifies INTEGER column(s) that should be treated as categorical.

Other INTEGER columns will be treated as continuous.

Defaults to None.

fit_predict(self, data, key, features=None, categorical_variable=None)

Fit with the dataset and return the labels.

Parameters
dataDataFrame

DataFrame containing the data.

keystr

Name of the ID column.

featureslist of str, optional

Names of the feature columns.

If features is not provided, it defaults to all the non-ID columns.

categorical_variablestr or list of str, optional

Specifies INTEGER column(s) that should be treated as categorical.

Other INTEGER columns will be treated as continuous.

Defaults to None.

Returns
DataFrame

Fit result, structured as follows:

  • ID column, with the same name and type as data 's ID column.

  • CLUSTER_ID, type INTEGER, cluster ID assigned to the data point.

  • DISTANCE, type DOUBLE, the distance between the given point and the cluster center.

  • SLIGHT_SILHOUETTE, type DOUBLE, estimated value (slight silhouette).

predict(self, data, key, features=None)

Assign clusters to data based on a fitted model.

The output structure of this method does not match that of fit_predict().

Parameters
dataDataFrame

Data points to match against computed clusters.

This dataframe's column structure should match that of the data used for fit().

keystr

Name of the ID column.

featureslist of str, optional.

Names of the feature columns.

If features is not provided, it defaults to all the non-ID columns.

Returns
DataFrame

Cluster assignment results, with 3 columns:

  • Data point ID, with name and type taken from the input ID column.

  • CLUSTER_ID, INTEGER type, representing the cluster the data point is assigned to.

  • DISTANCE, DOUBLE type, representing the distance between the data point and the cluster center.

add_attribute(self, attr_key, attr_val)

Function to add attribute.

is_fitted(self)

Checks if the model can be saved. To be overridden if the model is not stored in model_ attribute.

Returns
bool

True if the model is ready to be saved.

load_model(self, model)

Function to load fitted model.

Parameters
modelDataFrame

HANA DataFrame for fitted model.

class hana_ml.algorithms.pal.clustering.KMedians(n_clusters, init=None, max_iter=None, tol=None, thread_ratio=None, distance_level=None, minkowski_power=None, category_weights=None, normalization=None, categorical_variable=None)

Bases: hana_ml.algorithms.pal.clustering._KClusteringBase

K-Medians clustering algorithm that partitions n observations into K clusters according to their nearest cluster center. It uses medians of each feature to calculate cluster centers.

Parameters
n_clustersint

Number of groups.

init{'first_k', 'replace', 'no_replace', 'patent'}, optional

Controls how the initial centers are selected:

  • 'first_k': First k observations.

  • 'replace': Random with replacement.

  • 'no_replace': Random without replacement.

  • 'patent': Patent of selecting the init center (US 6,882,998 B1).

Defaults to 'patent'.

max_iterint, optional

Max iterations.

Defaults to 100.

tolfloat, optional

Convergence threshold for exiting iterations.

Defaults to 1.0e-6.

thread_ratiofloat, optional

Specifies the ratio of total number of threads that can be used by this function.

The value range is from 0 to 1, where 0 means only using 1 thread, and 1 means using at most all the currently available threads.

Values outside the range will be ignored and this function heuristically determines the number of threads to use.

Defaults to 0.

distance_level{'manhattan', 'euclidean', 'minkowski', 'chebyshev', 'cosine'} str, optional

Ways to compute the distance between the item and the cluster center.

Defaults to 'euclidean'.

minkowski_powerfloat, optional

When Minkowski distance is used, this parameter controls the value of power.

Only valid when distance_level is 'minkowski'.

Defaults to 3.0.

category_weightsfloat, optional

Represents the weight of category attributes.

Defaults to 0.707.

normalization{'no', 'l1_norm', 'min_max'}, optional

Normalization type.

  • 'no': No, normalization will not be applied.

  • 'l1_norm': Yes, for each point X (x1, x2, ..., xn), the normalized value will be X'(x1/S,x2/S,...,xn/S), where S = |x1|+|x2|+...|xn|.

  • 'min_max': Yes, for each column C, get the min and max value of C, and then C[i] = (C[i]-min)/(max-min).

Defaults to 'no'.

categorical_variablestr or list of str, optional

Specifies INTEGER column(s) in the data that should be treated as categorical.

Defaults to None.

Examples

Input dataframe df1 for clustering:

>>> df1.collect()
    ID  V000 V001  V002
0    0   0.5    A   0.5
1    1   1.5    A   0.5
2    2   1.5    A   1.5
3    3   0.5    A   1.5
4    4   1.1    B   1.2
5    5   0.5    B  15.5
6    6   1.5    B  15.5
7    7   1.5    B  16.5
8    8   0.5    B  16.5
9    9   1.2    C  16.1
10  10  15.5    C  15.5
11  11  16.5    C  15.5
12  12  16.5    C  16.5
13  13  15.5    C  16.5
14  14  15.6    D  16.2
15  15  15.5    D   0.5
16  16  16.5    D   0.5
17  17  16.5    D   1.5
18  18  15.5    D   1.5
19  19  15.7    A   1.6

Creating KMedians instance:

>>> kmedians = KMedians(n_clusters=4, init='first_k',
...                     max_iter=100, tol=1.0E-6,
...                     distance_level='Euclidean',
...                     thread_ratio=0.3, category_weights=0.5)

Performing fit() on given dataframe:

>>> kmedians.fit(data=df1, key='ID')
>>> kmedians.cluster_centers_.collect()
   CLUSTER_ID  V000 V001  V002
0           0   1.1    A   1.2
1           1  15.7    D   1.5
2           2  15.6    C  16.2
3           3   1.2    B  16.1

Performing fit_predict() on given dataframe:

>>> kmedians.fit_predict(data=df1, key='ID').collect()
    ID  CLUSTER_ID  DISTANCE
0    0           0  0.921954
1    1           0  0.806226
2    2           0  0.500000
3    3           0  0.670820
4    4           0  0.707107
5    5           3  0.921954
6    6           3  0.670820
7    7           3  0.500000
8    8           3  0.806226
9    9           3  0.707107
10  10           2  0.707107
11  11           2  1.140175
12  12           2  0.948683
13  13           2  0.316228
14  14           2  0.707107
15  15           1  1.019804
16  16           1  1.280625
17  17           1  0.800000
18  18           1  0.200000
19  19           1  0.807107
Attributes
cluster_centers_DataFrame

Coordinates of cluster centers.

labels_DataFrame

Cluster assignment and distance to cluster center for each point.

Methods

add_attribute(self, attr_key, attr_val)

Function to add attribute.

fit(self, data, key[, features, ...])

Perform clustering on input dataset.

fit_predict(self, data, key[, features, ...])

Perform clustering algorithm and return labels.

is_fitted(self)

Checks if the model can be saved.

load_model(self, model)

Function to load fitted model.

add_attribute(self, attr_key, attr_val)

Function to add attribute.

fit(self, data, key, features=None, categorical_variable=None)

Perform clustering on input dataset.

Parameters
dataDataFrame

DataFrame contains input data.

keystr

Name of the ID column.

featureslist of str, optional

Names of the feature columns.

If features is not provided, it defaults to all the non-ID columns.

categorical_variablestr or list of str, optional

Specifies INTEGER columns that should be treated as categorical.

Other INTEGER columns will be treated as continuous.

Defaults to None.

fit_predict(self, data, key, features=None, categorical_variable=None)

Perform clustering algorithm and return labels.

Parameters
dataDataFrame

DataFrame containing input data.

keystr

Name of the ID column.

featureslist of str, optional

Names of the feature columns.

If features is not provided, it defaults to all the non-ID columns.

categorical_variablestr or list of str, optional

Specifies INTEGER column(s) that should be treated as categorical.

Other INTEGER columns will be treated as continuous.

Defaults to None.

Returns
DataFrame

Fit result, structured as follows:

  • ID column, with the same name and type as data 's ID column.

  • CLUSTER_ID, type INTEGER, cluster ID assigned to the data point.

  • DISTANCE, type DOUBLE, the distance between the given point and the cluster center.

is_fitted(self)

Checks if the model can be saved. To be overridden if the model is not stored in model_ attribute.

Returns
bool

True if the model is ready to be saved.

load_model(self, model)

Function to load fitted model.

Parameters
modelDataFrame

HANA DataFrame for fitted model.

class hana_ml.algorithms.pal.clustering.KMedoids(n_clusters, init=None, max_iter=None, tol=None, thread_ratio=None, distance_level=None, minkowski_power=None, category_weights=None, normalization=None, categorical_variable=None)

Bases: hana_ml.algorithms.pal.clustering._KClusteringBase

K-Medoids clustering algorithm that partitions n observations into K clusters according to their nearest cluster center. It uses medoids to calculate cluster centers. K-Medoids is more robust to noise and outliers.

Parameters
n_clustersint

Number of groups.

init{'first_k', 'replace', 'no_replace', 'patent'}, optional

Controls how the initial centers are selected:

  • 'first_k': First k observations.

  • 'replace': Random with replacement.

  • 'no_replace': Random without replacement.

  • 'patent': Patent of selecting the init center (US 6,882,998 B1).

Defaults to 'patent'.

max_iterint, optional

Max iterations.

Defaults to 100.

tolfloat, optional

Convergence threshold for exiting iterations.

Defaults to 1.0e-6.

thread_ratiofloat, optional

Controls the proportion of available threads to use.

The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates all available threads.

Values between 0 and 1 will use up to that percentage of available threads.

Values outside this range tell PAL to heuristically determine the number of threads to use.

Defaults to 0.

distance_level{'manhattan', 'euclidean', 'minkowski', 'chebyshev', 'cosine'} str, optional

Ways to compute the distance between the item and the cluster center.

Defaults to 'euclidean'.

minkowski_powerfloat, optional

When Minkowski distance is used, this parameter controls the value of power.

Only valid when distance_level is 'minkowski'.

Defaults to 3.0.

category_weightsfloat, optional

Represents the weight of category attributes.

Defaults to 0.707.

normalization{'no', 'l1_norm', 'min_max'}, optional

Normalization type.

  • 'no': No, normalization will not be applied.

  • 'l1_norm': Yes, for each point X (x1, x2, ..., xn), the normalized value will be X'(x1/S,x2/S,...,xn/S), where S = |x1|+|x2|+...|xn|.

  • 'min_max': Yes, for each column C, get the min and max value of C, and then C[i] = (C[i]-min)/(max-min).

Defaults to 'no'.

categorical_variablestr or list of str, optional

Specifies INTEGER column(s) in the data that should be treated as categorical.

Defaults to None.

Examples

Input dataframe df1 for clustering:

>>> df1.collect()
    ID  V000 V001  V002
0    0   0.5    A   0.5
1    1   1.5    A   0.5
2    2   1.5    A   1.5
3    3   0.5    A   1.5
4    4   1.1    B   1.2
5    5   0.5    B  15.5
6    6   1.5    B  15.5
7    7   1.5    B  16.5
8    8   0.5    B  16.5
9    9   1.2    C  16.1
10  10  15.5    C  15.5
11  11  16.5    C  15.5
12  12  16.5    C  16.5
13  13  15.5    C  16.5
14  14  15.6    D  16.2
15  15  15.5    D   0.5
16  16  16.5    D   0.5
17  17  16.5    D   1.5
18  18  15.5    D   1.5
19  19  15.7    A   1.6

Creating KMedoids instance:

>>> kmedoids = KMedoids(n_clusters=4, init='first_K',
...                     max_iter=100, tol=1.0E-6,
...                     distance_level='Euclidean',
...                     thread_ratio=0.3, category_weights=0.5)

Performing fit() on given dataframe:

>>> kmedoids.fit(data=df1, key='ID')
>>> kmedoids.cluster_centers_.collect()
   CLUSTER_ID  V000 V001  V002
0           0   1.5    A   1.5
1           1  15.5    D   1.5
2           2  15.5    C  16.5
3           3   1.5    B  16.5

Performing fit_predict() on given dataframe:

>>> kmedoids.fit_predict(data=df1, key='ID').collect()
    ID  CLUSTER_ID  DISTANCE
0    0           0  1.414214
1    1           0  1.000000
2    2           0  0.000000
3    3           0  1.000000
4    4           0  1.207107
5    5           3  1.414214
6    6           3  1.000000
7    7           3  0.000000
8    8           3  1.000000
9    9           3  1.207107
10  10           2  1.000000
11  11           2  1.414214
12  12           2  1.000000
13  13           2  0.000000
14  14           2  1.023335
15  15           1  1.000000
16  16           1  1.414214
17  17           1  1.000000
18  18           1  0.000000
19  19           1  0.930714
Attributes
cluster_centers_DataFrame

Coordinates of cluster centers.

labels_DataFrame

Cluster assignment and distance to cluster center for each point.

Methods

add_attribute(self, attr_key, attr_val)

Function to add attribute.

fit(self, data, key[, features, ...])

Perform clustering on input dataset.

fit_predict(self, data, key[, features, ...])

Perform clustering algorithm and return labels.

is_fitted(self)

Checks if the model can be saved.

load_model(self, model)

Function to load fitted model.

add_attribute(self, attr_key, attr_val)

Function to add attribute.

fit(self, data, key, features=None, categorical_variable=None)

Perform clustering on input dataset.

Parameters
dataDataFrame

DataFrame contains input data.

keystr

Name of the ID column.

featureslist of str, optional

Names of the feature columns.

If features is not provided, it defaults to all the non-ID columns.

categorical_variablestr or list of str, optional

Specifies INTEGER columns that should be treated as categorical.

Other INTEGER columns will be treated as continuous.

Defaults to None.

fit_predict(self, data, key, features=None, categorical_variable=None)

Perform clustering algorithm and return labels.

Parameters
dataDataFrame

DataFrame containing input data.

keystr

Name of the ID column.

featureslist of str, optional

Names of the feature columns.

If features is not provided, it defaults to all the non-ID columns.

categorical_variablestr or list of str, optional

Specifies INTEGER column(s) that should be treated as categorical.

Other INTEGER columns will be treated as continuous.

Defaults to None.

Returns
DataFrame

Fit result, structured as follows:

  • ID column, with the same name and type as data 's ID column.

  • CLUSTER_ID, type INTEGER, cluster ID assigned to the data point.

  • DISTANCE, type DOUBLE, the distance between the given point and the cluster center.

is_fitted(self)

Checks if the model can be saved. To be overridden if the model is not stored in model_ attribute.

Returns
bool

True if the model is ready to be saved.

load_model(self, model)

Function to load fitted model.

Parameters
modelDataFrame

HANA DataFrame for fitted model.

hana_ml.algorithms.pal.crf

This module contains Python wrapper for SAP HANA PAL conditional random field(CRF) algorithm.

The following class is available:

class hana_ml.algorithms.pal.crf.CRF(lamb=None, epsilon=None, max_iter=None, lbfgs_m=None, use_class_feature=None, use_word=None, use_ngrams=None, mid_ngrams=False, max_ngram_length=None, use_prev=None, use_next=None, disjunction_width=None, use_disjunctive=None, use_seqs=None, use_prev_seqs=None, use_type_seqs=None, use_type_seqs2=None, use_type_yseqs=None, word_shape=None, thread_ratio=None)

Bases: hana_ml.algorithms.pal.pal_base.PALBase

Conditional random field(CRF) for labeling and segmenting sequence data(e.g. text).

Parameters
epsilonfloat, optional

Convergence tolerance of the optimization algorithm.

Defaults to 1e-4.

lambfloat, optional

Regularization weight, should be greater than 0.

Defaults t0 1.0.

max_iterint, optional

Maximum number of iterations in optimization.

Defaults to 1000.

lbfgs_mint, optional

Number of memories to be stored in L_BFGS optimization algorithm.

Defaults to 25.

use_class_featurebool, optional

To include a feature for class/label. This is the same as having a bias vector in a model.

Defaults to True.

use_wordbool, optional

If True, gives you feature for current word.

Defaults to True.

use_ngramsbool, optional

Whether to make feature from letter n-grams, i.e. substrings of the word.

Defaults to True.

mid_ngramsbool, optional

Whether to include character n-gram features for n-grams that contain neither the beginning or the end of the word.

Defaults to False.

max_ngram_lengthint, optional

Upper limit for the size of n-grams to be included. Effective only this parameter is positive.

use_prevbool, optional

Whether or not to include a feature for previous word and current word, and together with other options enables other previous features.

Defaults to True.

use_nextbool, optional

Whether or not to include a feature for next word and current word.

Defaults to True.

disjunction_widthint, optional

Defines the width for disjunctions of words, see use_disjunctive.

Defaults to 4.

use_disjunctivebool, optional

Whether or not to include in features giving disjunctions of words anywhere in left or right disjunction_width words.

Defaults to True.

use_seqsbool, optional

Whether or not to use any class combination features.

Defaults to True.

use_prev_seqsbool, optional

Whether or not to use any class combination features using the previous class.

Defaults to True.

use_type_seqsbool, optional

Whther or not to use basic zeroth order word shape features.

Defaults to True.

use_type_seqs2bool, optional

Whether or not to add additional first and second order word shape features.

Defaults to True.

use_type_yseqsbool, optional

Whether or not to use some first order word shape patterns.

Defaults to True.

word_shapeint, optional

Word shape, e.g. whether capitalized or numeric. Only supports chris2UseLC currently. Do not use word shape if this is 0.

thread_ratiofloat, optional

Specifies the ratio of total number of threads that can be used by the fit(i.e. training) function.

The range of this parameter is from 0 to 1.

0 means only using single thread, 1 means using at most all available threads currently.

Values outside this range are ignored, and the fit function heuristically determines the number of threads to use.

Defaults to 1.0.

Examples

Input data for training:

>>> df.head(10).collect()
   DOC_ID  WORD_POSITION      WORD LABEL
0       1              1    RECORD     O
1       1              2   #497321     O
2       1              3  78554939     O
3       1              4         |     O
4       1              5       LRH     O
5       1              6         |     O
6       1              7  62413233     O
7       1              8         |     O
8       1              9         |     O
9       1             10   7368393     O

Set up an instance of CRF model, and fit it on the training data:

>>> crf = CRF(lamb=0.1,
...           max_iter=1000,
...           epsilon=1e-4,
...           lbfgs_m=25,
...           word_shape=0,
...           thread_ratio=1.0)
>>> crf.fit(data=df, doc_id="DOC_ID", word_pos="WORD_POSITION",
...         word="WORD", label="LABEL")

Check the trained CRF model and related statistics:

>>> crf.model_.collect()
   ROW_INDEX                                      MODEL_CONTENT
0          0  {"classIndex":[["O","OxygenSaturation"]],"defa...
>>> crf.stats_.head(10).collect()
         STAT_NAME           STAT_VALUE
0              obj  0.44251900977373015
1             iter                   22
2  solution status            Converged
3      numSentence                    2
4          numWord                   92
5      numFeatures                  963
6           iter 1          obj=26.6557
7           iter 2          obj=14.8484
8           iter 3          obj=5.36967
9           iter 4           obj=2.4382

Input data for predicting labels using trained CRF model

>>> df_pred.head(10).collect()
   DOC_ID  WORD_POSITION         WORD
0       2              1      GENERAL
1       2              2     PHYSICAL
2       2              3  EXAMINATION
3       2              4            :
4       2              5        VITAL
5       2              6        SIGNS
6       2              7            :
7       2              8        Blood
8       2              9     pressure
9       2             10        86g52

Do the prediction:

>>> res = crf.predict(data=df_pred, doc_id='DOC_ID', word_pos='WORD_POSITION',
...                   word='WORD', thread_ratio=1.0)

Check the prediction result:

>>> df_pred.head(10).collect()
   DOC_ID  WORD_POSITION         WORD
0       2              1      GENERAL
1       2              2     PHYSICAL
2       2              3  EXAMINATION
3       2              4            :
4       2              5        VITAL
5       2              6        SIGNS
6       2              7            :
7       2              8        Blood
8       2              9     pressure
9       2             10        86g52
Attributes
model_DataFrame

CRF model content.

stats_DataFrame

Statistic info for CRF model fitting, structured as follows:

  • 1st column: name of the statistics, type NVARCHAR(100).

  • 2nd column: the corresponding statistics value, type NVARCHAR(1000).

optimal_param_DataFrame

Placeholder for storing optimal parameter of the model. None empty only when parameter selection is triggered (in the future).

Methods

add_attribute(self, attr_key, attr_val)

Function to add attribute.

fit(self, data[, doc_id, word_pos, word, label])

Function for training the CRF model on English text.

is_fitted(self)

Checks if the model can be saved.

load_model(self, model)

Function to load fitted model.

predict(self, data[, doc_id, word_pos, ...])

The function that predicts text labels based trained CRF model.

fit(self, data, doc_id=None, word_pos=None, word=None, label=None)

Function for training the CRF model on English text.

Parameters
dataDataFrame

Input data for training/fitting the CRF model.

It should contain at least 4 columns, corresponding to document ID, word position, word and label, respectively.

doc_idstr, optional

Name of the column for document ID.

Defaults to the first column of the input data.

word_posstr, optional

Name of the column for word position.

Defaults to the 1st non-doc_id column of the input data.

wordstr, optional

Name of the column for word.

Defaults to 1st non-doc_id, non-word_pos column of the input data.

labelstr, optional

Name of the label column.

Defaults to the last non-doc_id, non-word_pos, non-word column of the input data.

add_attribute(self, attr_key, attr_val)

Function to add attribute.

is_fitted(self)

Checks if the model can be saved. To be overridden if the model is not stored in model_ attribute.

Returns
bool

True if the model is ready to be saved.

load_model(self, model)

Function to load fitted model.

Parameters
modelDataFrame

HANA DataFrame for fitted model.

predict(self, data, doc_id=None, word_pos=None, word=None, thread_ratio=None)

The function that predicts text labels based trained CRF model.

Parameters
dataDataFrame

Input data to predict the labels. It should contain at least 3 columns, corresponding to document ID, word position and word, respectively.

doc_idstr, optional

Name of the column for document ID.

Defaults to the 1st column of the input data.

word_posstr, optional

Name of the column for word position.

Defaults to the 1st non-doc_id column of the input data.

wordstr, optional

Name of the column for word.

Defaults to the 1st non-doc_id, non-word_pos column of the input data.

thread_ratiofloat, optional

Specifies the ratio of total number of threads that can be used by predict function.

The range of this parameter is from 0 to 1.

0 means only using a single thread, and 1 means using at most all available threads currently.

Values outside this range are ignored, and predict function heuristically determines the number of threads to use.

Defaults to 1.0.

Returns
DataFrame

Prediction result for the input data, structured as follows:

  • 1st column: document ID,

  • 2nd column: word position,

  • 3rd column: label.

hana_ml.algorithms.pal.decomposition

This module contains Python wrappers for PAL decomposition algorithms.

The following classes are available:

class hana_ml.algorithms.pal.decomposition.PCA(scaling=None, thread_ratio=None, scores=None)

Bases: hana_ml.algorithms.pal.pal_base.PALBase

Principal component analysis is to reduce the dimensionality of multivariate data using Singular Value Decomposition.

Parameters
thread_ratiofloat, optional

Controls the proportion of available threads to use.

The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads.

Values between 0 and 1 will use that percentage of available threads.

Values outside this range tell PAL to heuristically determine the number of threads to use.

No default value.

scalingbool, optional

If true, scale variables to have unit variance before the analysis takes place.

Defaults to False.

scoresbool, optional

If true, output the scores on each principal component when fitting.

Defaults to False.

Examples

Input DataFrame df1 for training:

>>> df1.head(4).collect()
   ID    X1    X2    X3    X4
0   1  12.0  52.0  20.0  44.0
1   2  12.0  57.0  25.0  45.0
2   3  12.0  54.0  21.0  45.0
3   4  13.0  52.0  21.0  46.0

Creating a PCA instance:

>>> pca = PCA(scaling=True, thread_ratio=0.5, scores=True)

Performing fit on given dataframe:

>>> pca.fit(data=df1, key='ID')

Output:

>>> pca.loadings_.collect()
  COMPONENT_ID  LOADINGS_X1  LOADINGS_X2  LOADINGS_X3  LOADINGS_X4
0        Comp1     0.541547     0.321424     0.511941     0.584235
1        Comp2    -0.454280     0.728287     0.395819    -0.326429
2        Comp3    -0.171426    -0.600095     0.760875    -0.177673
3        Comp4    -0.686273    -0.078552    -0.048095     0.721489
>>> pca.loadings_stat_.collect()
  COMPONENT_ID        SD  VAR_PROP  CUM_VAR_PROP
0        Comp1  1.566624  0.613577      0.613577
1        Comp2  1.100453  0.302749      0.916327
2        Comp3  0.536973  0.072085      0.988412
3        Comp4  0.215297  0.011588      1.000000
>>> pca.scaling_stat_.collect()
   VARIABLE_ID       MEAN     SCALE
0            1  17.000000  5.039841
1            2  53.636364  1.689540
2            3  23.000000  2.000000
3            4  48.454545  4.655398

Input dataframe df2 for transforming:

>>> df2.collect()
   ID    X1    X2    X3    X4
0   1   2.0  32.0  10.0  54.0
1   2   9.0  57.0  20.0  25.0
2   3  12.0  24.0  28.0  35.0
3   4  15.0  42.0  27.0  36.0

Performing transform() on given dataframe:

>>> result = pca.transform(data=df2, key='ID', n_components=4)
>>> result.collect()
   ID  COMPONENT_1  COMPONENT_2  COMPONENT_3  COMPONENT_4
0   1    -8.359662   -10.936083     3.037744     4.220525
1   2    -3.931082     3.221886    -1.168764    -2.629849
2   3    -6.584040   -10.391291    13.112075    -0.146681
3   4    -2.967768    -3.170720     6.198141    -1.213035
Attributes
loadings_DataFrame

The weights by which each standardized original variable should be multiplied when computing component scores.

loadings_stat_DataFrame

Loadings statistics on each component.

scores_DataFrame

The transformed variable values corresponding to each data point. Set to None if scores is False.

scaling_stat_DataFrame

Mean and scale values of each variable.

Note

Variables cannot be scaled if there exists one variable which has constant value across data items.

Methods

add_attribute(self, attr_key, attr_val)

Function to add attribute.

fit(self, data, key[, features, label])

Principal component analysis function.

fit_transform(self, data, key[, features, label])

Fit with the dataset and return the scores.

is_fitted(self)

Checks if the model can be saved.

load_model(self, model)

Function to load fitted model.

transform(self, data, key[, features, ...])

Principal component analysis projection function using a trained model.

fit(self, data, key, features=None, label=None)

Principal component analysis function.

Parameters
dataDataFrame

Data to be fitted.

keystr

Name of the ID column.

featureslist of str, optional

Names of the feature columns.

If features is not provided, it defaults to all non-ID columns.

labelstr, optional

Label of data.

fit_transform(self, data, key, features=None, label=None)

Fit with the dataset and return the scores.

Parameters
dataDataFrame

Data to be analyzed.

keystr

Name of the ID column.

featureslist of str, optional

Names of the feature columns.

If features is not provided, it defaults to all non-ID columns.

labelstr, optional

Label of data.

Returns
DataFrame

Transformed variable values corresponding to each data point, structured as follows:

  • ID column, with same name and type as data 's ID column.

  • SCORE columns, type DOUBLE, representing the component score values of each data point.

transform(self, data, key, features=None, n_components=None, label=None)

Principal component analysis projection function using a trained model.

Parameters
dataDataFrame

Data to be analyzed.

keystr

Name of the ID column.

featureslist of str, optional

Names of the feature columns.

If features is not provided, it defaults to all non-ID columns.

n_componentsint, optional

Number of components to be retained.

The value range is from 1 to number of features.

Defaults to number of features.

labelstr, optional

Label of data.

Returns
DataFrame

Transformed variable values corresponding to each data point, structured as follows:

  • ID column, with same name and type as data 's ID column.

  • SCORE columns, type DOUBLE, representing the component score values of each data point.

add_attribute(self, attr_key, attr_val)

Function to add attribute.

is_fitted(self)

Checks if the model can be saved. To be overridden if the model is not stored in model_ attribute.

Returns
bool

True if the model is ready to be saved.

load_model(self, model)

Function to load fitted model.

Parameters
modelDataFrame

HANA DataFrame for fitted model.

class hana_ml.algorithms.pal.decomposition.LatentDirichletAllocation(n_components, doc_topic_prior=None, topic_word_prior=None, burn_in=None, iteration=None, thin=None, seed=None, max_top_words=None, threshold_top_words=None, gibbs_init=None, delimiters=None, output_word_assignment=None)

Bases: hana_ml.algorithms.pal.pal_base.PALBase

Latent Dirichlet allocation (LDA) is a generative model in which each item (word) of a collection (document) is generated from a finite mixture over several latent groups (topics).

Parameters
n_componentsint

Expected number of topics in the corpus.

doc_topic_priorfloat, optional

Specifies the prior weight related to document-topic distribution.

Defaults to 50/n_components.

topic_word_priorfloat, optional

Specifies the prior weight related to topic-word distribution.

Defaults to 0.1.

burn_inint, optional

Number of omitted Gibbs iterations at the beginning.

Generally, samples from the beginning may not accurately represent the desired distribution and are usually discarded.

Defaults to 0.

iterationint, optional

Number of Gibbs iterations.

Defaults to 2000.

thinint, optional

Number of omitted in-between Gibbs iterations.

Value must be greater than 0.

Defaults to 1.

seedint, optional

Indicates the seed used to initialize the random number generator:

  • 0: Uses the system time.

  • Not 0: Uses the provided value.

Defaults to 0.

max_top_wordsint, optional

Specifies the maximum number of words to be output for each topic.

Defaults to 0.

threshold_top_wordsfloat, optional

The algorithm outputs top words for each topic if the probability is larger than this threshold.

It cannot be used together with parameter max_top_words.

gibbs_initstr, optional

Specifies initialization method for Gibbs sampling:

  • 'uniform': Assign each word in each document a topic by uniform distribution.

  • 'gibbs': Assign each word in each document a topic by one round

    of Gibbs sampling using doc_topic_prior and topic_word_prior.

Defaults to 'uniform'.

delimiterslist of str, optional

Specifies the set of delimiters to separate words in a document.

Each delimiter must be one character long.

Defaults to [' '].

output_word_assignmentbool, optional

Controls whether to output the word_topic_assignment_ DataFrame or not. If True, output the word_topic_assignment_ DataFrame.

Defaults to False.

Examples

Input dataframe df1 for training:

>>> df1.collect()
   DOCUMENT_ID                                               TEXT
0           10  cpu harddisk graphiccard cpu monitor keyboard ...
1           20  tires mountainbike wheels valve helmet mountai...
2           30  carseat toy strollers toy toy spoon toy stroll...
3           40  sweaters sweaters sweaters boots sweaters ring...

Creating a LDA instance:

>>> lda = LatentDirichletAllocation(n_components=6, burn_in=50, thin=10,
                                    iteration=100, seed=1,
                                    max_top_words=5, doc_topic_prior=0.1,
                                    output_word_assignment=True,
                                    delimiters=[' ', '\r', '\n'])

Performing fit() on given dataframe:

>>> lda.fit(data=df1, key='DOCUMENT_ID', document='TEXT')

Output:

>>> lda.doc_topic_dist_.collect()
    DOCUMENT_ID  TOPIC_ID  PROBABILITY
0            10         0     0.010417
1            10         1     0.010417
2            10         2     0.010417
3            10         3     0.010417
4            10         4     0.947917
5            10         5     0.010417
6            20         0     0.009434
7            20         1     0.009434
8            20         2     0.009434
9            20         3     0.952830
10           20         4     0.009434
11           20         5     0.009434
12           30         0     0.103774
13           30         1     0.858491
14           30         2     0.009434
15           30         3     0.009434
16           30         4     0.009434
17           30         5     0.009434
18           40         0     0.009434
19           40         1     0.009434
20           40         2     0.952830
21           40         3     0.009434
22           40         4     0.009434
23           40         5     0.009434
>>> lda.word_topic_assignment_.collect()
    DOCUMENT_ID  WORD_ID  TOPIC_ID
0            10        0         4
1            10        1         4
2            10        2         4
3            10        0         4
4            10        3         4
5            10        4         4
6            10        0         4
7            10        5         4
8            10        5         4
9            20        6         3
10           20        7         3
11           20        8         3
12           20        9         3
13           20       10         3
14           20        7         3
15           20       11         3
16           20        6         3
17           20        7         3
18           20        7         3
19           30       12         1
20           30       13         1
21           30       14         1
22           30       13         1
23           30       13         1
24           30       15         0
25           30       13         1
26           30       14         1
27           30       13         1
28           30       12         1
29           40       16         2
30           40       16         2
31           40       16         2
32           40       17         2
33           40       16         2
34           40       18         2
35           40       19         2
36           40       19         2
37           40       20         2
38           40       16         2
>>> lda.topic_top_words_.collect()
   TOPIC_ID                                       WORDS
0         0     spoon strollers tires graphiccard valve
1         1       toy strollers carseat graphiccard cpu
2         2              sweaters vest shoe rings boots
3         3  mountainbike tires rearfender helmet valve
4         4    cpu memory graphiccard keyboard harddisk
5         5       strollers tires graphiccard cpu valve
>>> lda.topic_word_dist_.head(40).collect()
    TOPIC_ID  WORD_ID  PROBABILITY
0          0        0     0.050000
1          0        1     0.050000
2          0        2     0.050000
3          0        3     0.050000
4          0        4     0.050000
5          0        5     0.050000
6          0        6     0.050000
7          0        7     0.050000
8          0        8     0.550000
9          0        9     0.050000
10         1        0     0.050000
11         1        1     0.050000
12         1        2     0.050000
13         1        3     0.050000
14         1        4     0.050000
15         1        5     0.050000
16         1        6     0.050000
17         1        7     0.050000
18         1        8     0.050000
19         1        9     0.550000
20         2        0     0.025000
21         2        1     0.025000
22         2        2     0.525000
23         2        3     0.025000
24         2        4     0.025000
25         2        5     0.025000
26         2        6     0.025000
27         2        7     0.275000
28         2        8     0.025000
29         2        9     0.025000
30         3        0     0.014286
31         3        1     0.014286
32         3        2     0.014286
33         3        3     0.585714
34         3        4     0.157143
35         3        5     0.014286
36         3        6     0.157143
37         3        7     0.014286
38         3        8     0.014286
39         3        9     0.014286
>>> lda.dictionary_.collect()
    WORD_ID          WORD
0        17         boots
1        12       carseat
2         0           cpu
3         2   graphiccard
4         1      harddisk
5        10        helmet
6         4      keyboard
7         5        memory
8         3       monitor
9         7  mountainbike
10       11    rearfender
11       18         rings
12       20          shoe
13       15         spoon
14       14     strollers
15       16      sweaters
16        6         tires
17       13           toy
18        9         valve
19       19          vest
20        8        wheels
>>> lda.statistic_.collect()
         STAT_NAME          STAT_VALUE
0        DOCUMENTS                   4
1  VOCABULARY_SIZE                  21
2   LOG_LIKELIHOOD  -64.95765414596762

Dataframe df2 to transform:

>>> df2.collect()
   DOCUMENT_ID               TEXT
0           10  toy toy spoon cpu

Performing transform on the given dataframe:

>>> res = lda.transform(data=df2, key='DOCUMENT_ID', document='TEXT', burn_in=2000, thin=100,
                        iteration=1000, seed=1, output_word_assignment=True)
>>> doc_top_df, word_top_df, stat_df = res
>>> doc_top_df.collect()
   DOCUMENT_ID  TOPIC_ID  PROBABILITY
0           10         0     0.239130
1           10         1     0.456522
2           10         2     0.021739
3           10         3     0.021739
4           10         4     0.239130
5           10         5     0.021739
>>> word_top_df.collect()
   DOCUMENT_ID  WORD_ID  TOPIC_ID
0           10       13         1
1           10       13         1
2           10       15         0
3           10        0         4
>>> stat_df.collect()
         STAT_NAME          STAT_VALUE
0        DOCUMENTS                   1
1  VOCABULARY_SIZE                  21
2   LOG_LIKELIHOOD  -7.925092991875363
3       PERPLEXITY   7.251970666272191
Attributes
doc_topic_dist_DataFrame

Document-topic distribution table, structured as follows:

  • Document ID column, with same name and type as data's document ID column from fit().

  • TOPIC_ID, type INTEGER, topic ID.

  • PROBABILITY, type DOUBLE, probability of topic given document.

word_topic_assignment_DataFrame

Word-topic assignment table, structured as follows:

  • Document ID column, with same name and type as data's document ID column from fit().

  • WORD_ID, type INTEGER, word ID.

  • TOPIC_ID, type INTEGER, topic ID.

Set to None if output_word_assignment is set to False.

topic_top_words_DataFrame

Topic top words table, structured as follows:

  • TOPIC_ID, type INTEGER, topic ID.

  • WORDS, type NVARCHAR(5000), topic top words separated by spaces.

Set to None if neither max_top_words nor threshold_top_words is provided.

topic_word_dist_DataFrame

Topic-word distribution table, structured as follows:

  • TOPIC_ID, type INTEGER, topic ID.

  • WORD_ID, type INTEGER, word ID.

  • PROBABILITY, type DOUBLE, probability of word given topic.

dictionary_DataFrame

Dictionary table, structured as follows:

  • WORD_ID, type INTEGER, word ID.

  • WORD, type NVARCHAR(5000), word text.

statistic_DataFrame

Statistics table, structured as follows:

  • STAT_NAME, type NVARCHAR(256), statistic name.

  • STAT_VALUE, type NVARCHAR(1000), statistic value.

Note

  • Parameters max_top_words and threshold_top_words cannot be used together.

  • Parameters burn_in, thin, iteration, seed, gibbs_init and delimiters set in transform() will take precedence over the corresponding ones in __init__().

Methods

add_attribute(self, attr_key, attr_val)

Function to add attribute.

fit(self, data, key[, document])

Fit LDA model based on training data.

fit_transform(self, data, key[, document])

Fit LDA model based on training data and return the topic assignment for the training documents.

is_fitted(self)

Checks if the model can be saved.

load_model(self, model)

Function to load fitted model.

transform(self, data, key[, document, ...])

Transform the topic assignment for new documents based on the previous LDA estimation results.

fit(self, data, key, document=None)

Fit LDA model based on training data.

Parameters
dataDataFrame

Training data.

keystr

Name of the document ID column.

documentstr, optional

Name of the document text column.

If document is not provided, data must have exactly 1 non-key column, and document defaults to that column.

fit_transform(self, data, key, document=None)

Fit LDA model based on training data and return the topic assignment for the training documents.

Parameters
dataDataFrame

Training data.

keystr

Name of the document ID column.

documentstr, optional

Name of the document text column.

If document is not provided, data must have exactly 1 non-key column, and document defaults to that column.

Returns
DataFrame

Document-topic distribution table, structured as follows:

  • Document ID column, with same name and type as data 's document ID column.

  • TOPIC_ID, type INTEGER, topic ID.

  • PROBABILITY, type DOUBLE, probability of topic given document.

add_attribute(self, attr_key, attr_val)

Function to add attribute.

is_fitted(self)

Checks if the model can be saved. To be overridden if the model is not stored in model_ attribute.

Returns
bool

True if the model is ready to be saved.

load_model(self, model)

Function to load fitted model.

Parameters
modelDataFrame

HANA DataFrame for fitted model.

transform(self, data, key, document=None, burn_in=None, iteration=None, thin=None, seed=None, gibbs_init=None, delimiters=None, output_word_assignment=None)

Transform the topic assignment for new documents based on the previous LDA estimation results.

Parameters
dataDataFrame

Independent variable values used for tranform.

keystr

Name of the document ID column.

documentstr, optional

Name of the document text column.

If document is not provided, data must have exactly 1 non-key column, and document defaults to that column.

burn_inint, optional

Number of omitted Gibbs iterations at the beginning.

Generally, samples from the beginning may not accurately represent the desired distribution and are usually discarded.

Defaults to 0.

iterationint, optional

Numbers of Gibbs iterations.

Defaults to 2000.

thinint, optional

Number of omitted in-between Gibbs iterations.

Defaults to 1.

seedint, optional

Indicates the seed used to initialize the random number generator:

  • 0: Uses the system time.

  • Not 0: Uses the provided value.

Defaults to 0.

gibbs_initstr, optional

Specifies initialization method for Gibbs sampling:

  • 'uniform': Assign each word in each document a topic by uniform distribution.

  • 'gibbs': Assign each word in each document a topic by one round of Gibbs sampling using doc_topic_prior and topic_word_prior.

Defaults to 'uniform'.

delimiterslist of str, optional

Specifies the set of delimiters to separate words in a document. Each delimiter must be one character long.

Defaults to [' '].

output_word_assignmentbool, optional

Controls whether to output the word_topic_df or not.

If True, output the word_topic_df.

Defaults to False.

Returns
DataFrame

Document-topic distribution table, structured as follows:

  • Document ID column, with same name and type as data 's document ID column.

  • TOPIC_ID, type INTEGER, topic ID.

  • PROBABILITY, type DOUBLE, probability of topic given document.

Word-topic assignment table, structured as follows:

  • Document ID column, with same name and type as data 's document ID column.

  • WORD_ID, type INTEGER, word ID.

  • TOPIC_ID, type INTEGER, topic ID.

Set to None if output_word_assignment is False.

Statistics table, structured as follows:

  • STAT_NAME, type NVARCHAR(256), statistic name.

  • STAT_VALUE, type NVARCHAR(1000), statistic value.

hana_ml.algorithms.pal.discriminant_analysis

This module contains PAL wrapper for discriminant analysis algorithm. The following class is available:

class hana_ml.algorithms.pal.discriminant_analysis.LinearDiscriminantAnalysis(regularization_type=None, regularization_amount=None, projection=None)

Bases: hana_ml.algorithms.pal.pal_base.PALBase

Linear discriminant analysis for classification and data reduction.

Parameters
regularization_type{'mixing', 'diag', 'pseudo'}, optional

The strategy for hanlding ill-conditioning or rank-deficiency of the empirical covariance matrix.

Defaults to 'mixing'.

regularization_amountfloat, optional

The convex mixing weight assigned to the diagonal matrix obtained from diagonal of the empirical covriance matrix.

Valid range for this parameter is [0,1].

Valid only when regularization_type is 'mixing'.

Defaults to the smallest number in [0,1] that makes the regularized empircal covariance matrix invertible.

projectionbool, optional

Whether or not to compute the projection model.

Defaults to True.

Examples

The training data for linear discriminant analysis:

>>> df.collect()
     X1   X2   X3   X4            CLASS
0   5.1  3.5  1.4  0.2      Iris-setosa
1   4.9  3.0  1.4  0.2      Iris-setosa
2   4.7  3.2  1.3  0.2      Iris-setosa
3   4.6  3.1  1.5  0.2      Iris-setosa
4   5.0  3.6  1.4  0.2      Iris-setosa
5   5.4  3.9  1.7  0.4      Iris-setosa
6   4.6  3.4  1.4  0.3      Iris-setosa
7   5.0  3.4  1.5  0.2      Iris-setosa
8   4.4  2.9  1.4  0.2      Iris-setosa
9   4.9  3.1  1.5  0.1      Iris-setosa
10  7.0  3.2  4.7  1.4  Iris-versicolor
11  6.4  3.2  4.5  1.5  Iris-versicolor
12  6.9  3.1  4.9  1.5  Iris-versicolor
13  5.5  2.3  4.0  1.3  Iris-versicolor
14  6.5  2.8  4.6  1.5  Iris-versicolor
15  5.7  2.8  4.5  1.3  Iris-versicolor
16  6.3  3.3  4.7  1.6  Iris-versicolor
17  4.9  2.4  3.3  1.0  Iris-versicolor
18  6.6  2.9  4.6  1.3  Iris-versicolor
19  5.2  2.7  3.9  1.4  Iris-versicolor
20  6.3  3.3  6.0  2.5   Iris-virginica
21  5.8  2.7  5.1  1.9   Iris-virginica
22  7.1  3.0  5.9  2.1   Iris-virginica
23  6.3  2.9  5.6  1.8   Iris-virginica
24  6.5  3.0  5.8  2.2   Iris-virginica
25  7.6  3.0  6.6  2.1   Iris-virginica
26  4.9  2.5  4.5  1.7   Iris-virginica
27  7.3  2.9  6.3  1.8   Iris-virginica
28  6.7  2.5  5.8  1.8   Iris-virginica
29  7.2  3.6  6.1  2.5   Iris-virginica

Set up an instance of LinearDiscriminantAnalysis model and train it:

>>> lda = LinearDiscriminantAnalysis(regularization_type='mixing', projection=True)
>>> lda.fit(data=df, features=['X1', 'X2', 'X3', 'X4'], label='CLASS')

Check the coefficients of obtained linear discriminators and the projection model

>>> lda.coef_.collect()
             CLASS   COEFF_X1   COEFF_X2   COEFF_X3   COEFF_X4   INTERCEPT
0      Iris-setosa  23.907391  51.754001 -34.641902 -49.063407 -113.235478
1  Iris-versicolor   0.511034  15.652078  15.209568  -4.861018  -53.898190
2   Iris-virginica -14.729636   4.981955  42.511486  12.315007  -94.143564
>>> lda.proj_model_.collect()
             NAME        X1        X2        X3        X4
0  DISCRIMINANT_1  1.907978  2.399516 -3.846154 -3.112216
1  DISCRIMINANT_2  3.046794 -4.575496 -2.757271  2.633037
2    OVERALL_MEAN  5.843333  3.040000  3.863333  1.213333

Data to predict the class labels:

>>> df_pred.collect()
    ID   X1   X2   X3   X4
0    1  5.1  3.5  1.4  0.2
1    2  4.9  3.0  1.4  0.2
2    3  4.7  3.2  1.3  0.2
3    4  4.6  3.1  1.5  0.2
4    5  5.0  3.6  1.4  0.2
5    6  5.4  3.9  1.7  0.4
6    7  4.6  3.4  1.4  0.3
7    8  5.0  3.4  1.5  0.2
8    9  4.4  2.9  1.4  0.2
9   10  4.9  3.1  1.5  0.1
10  11  7.0  3.2  4.7  1.4
11  12  6.4  3.2  4.5  1.5
12  13  6.9  3.1  4.9  1.5
13  14  5.5  2.3  4.0  1.3
14  15  6.5  2.8  4.6  1.5
15  16  5.7  2.8  4.5  1.3
16  17  6.3  3.3  4.7  1.6
17  18  4.9  2.4  3.3  1.0
18  19  6.6  2.9  4.6  1.3
19  20  5.2  2.7  3.9  1.4
20  21  6.3  3.3  6.0  2.5
21  22  5.8  2.7  5.1  1.9
22  23  7.1  3.0  5.9  2.1
23  24  6.3  2.9  5.6  1.8
24  25  6.5  3.0  5.8  2.2
25  26  7.6  3.0  6.6  2.1
26  27  4.9  2.5  4.5  1.7
27  28  7.3  2.9  6.3  1.8
28  29  6.7  2.5  5.8  1.8
29  30  7.2  3.6  6.1  2.5

Perform predict() and check the result:

>>> res_pred = lda.predict(data=df_pred,
...                        key='ID',
...                        features=['X1', 'X2', 'X3', 'X4'],
...                        verbose=False)
>>> res_pred.collect()
    ID            CLASS       SCORE
0    1      Iris-setosa  130.421263
1    2      Iris-setosa   99.762784
2    3      Iris-setosa  108.796296
3    4      Iris-setosa   94.301777
4    5      Iris-setosa  133.205924
5    6      Iris-setosa  138.089829
6    7      Iris-setosa  108.385827
7    8      Iris-setosa  119.390933
8    9      Iris-setosa   82.633689
9   10      Iris-setosa  106.380335
10  11  Iris-versicolor   63.346631
11  12  Iris-versicolor   59.511996
12  13  Iris-versicolor   64.286132
13  14  Iris-versicolor   38.332614
14  15  Iris-versicolor   54.823224
15  16  Iris-versicolor   53.865644
16  17  Iris-versicolor   63.581912
17  18  Iris-versicolor   30.402809
18  19  Iris-versicolor   57.411739
19  20  Iris-versicolor   42.433076
20  21   Iris-virginica  114.258002
21  22   Iris-virginica   72.984306
22  23   Iris-virginica   91.802556
23  24   Iris-virginica   86.640121
24  25   Iris-virginica   97.620689
25  26   Iris-virginica  114.195778
26  27   Iris-virginica   57.274694
27  28   Iris-virginica  101.668525
28  29   Iris-virginica   87.257782
29  30   Iris-virginica  106.747065

Data to project:

>>> df_proj.collect()
    ID   X1   X2   X3   X4
0    1  5.1  3.5  1.4  0.2
1    2  4.9  3.0  1.4  0.2
2    3  4.7  3.2  1.3  0.2
3    4  4.6  3.1  1.5  0.2
4    5  5.0  3.6  1.4  0.2
5    6  5.4  3.9  1.7  0.4
6    7  4.6  3.4  1.4  0.3
7    8  5.0  3.4  1.5  0.2
8    9  4.4  2.9  1.4  0.2
9   10  4.9  3.1  1.5  0.1
10  11  7.0  3.2  4.7  1.4
11  12  6.4  3.2  4.5  1.5
12  13  6.9  3.1  4.9  1.5
13  14  5.5  2.3  4.0  1.3
14  15  6.5  2.8  4.6  1.5
15  16  5.7  2.8  4.5  1.3
16  17  6.3  3.3  4.7  1.6
17  18  4.9  2.4  3.3  1.0
18  19  6.6  2.9  4.6  1.3
19  20  5.2  2.7  3.9  1.4
20  21  6.3  3.3  6.0  2.5
21  22  5.8  2.7  5.1  1.9
22  23  7.1  3.0  5.9  2.1
23  24  6.3  2.9  5.6  1.8
24  25  6.5  3.0  5.8  2.2
25  26  7.6  3.0  6.6  2.1
26  27  4.9  2.5  4.5  1.7
27  28  7.3  2.9  6.3  1.8
28  29  6.7  2.5  5.8  1.8
29  30  7.2  3.6  6.1  2.5

Do project and check the result:

>>> res_proj = lda.project(data=df_proj,
...                        key='ID',
...                        features=['X1','X2','X3','X4'],
...                        proj_dim=2)
>>> res_proj.collect()
    ID  DISCRIMINANT_1  DISCRIMINANT_2 DISCRIMINANT_3 DISCRIMINANT_4
0    1       12.313584       -0.245578           None           None
1    2       10.732231        1.432811           None           None
2    3       11.215154        0.184080           None           None
3    4       10.015174       -0.214504           None           None
4    5       12.362738       -1.007807           None           None
5    6       12.069495       -1.462312           None           None
6    7       10.808422       -1.048122           None           None
7    8       11.498220       -0.368435           None           None
8    9        9.538291        0.366963           None           None
9   10       10.898789        0.436231           None           None
10  11       -1.208079        0.976629           None           None
11  12       -1.894856       -0.036689           None           None
12  13       -2.719280        0.841349           None           None
13  14       -3.226081        2.191170           None           None
14  15       -3.048480        1.822461           None           None
15  16       -3.567804       -0.865854           None           None
16  17       -2.926155       -1.087069           None           None
17  18       -0.504943        1.045723           None           None
18  19       -1.995288        1.142984           None           None
19  20       -2.765274       -0.014035           None           None
20  21      -10.727149       -2.301788           None           None
21  22       -7.791979       -0.178166           None           None
22  23       -8.291120        0.730808           None           None
23  24       -7.969943       -1.211807           None           None
24  25       -9.362513       -0.558237           None           None
25  26      -10.029438        0.324116           None           None
26  27       -7.058927       -0.877426           None           None
27  28       -8.754272       -0.095103           None           None
28  29       -8.935789        1.285655           None           None
29  30       -8.674729       -1.208049           None           None
Attributes
basic_info_DataFrame

Basic information of the training data for linear discriminant analysis.

priors_DataFrame

The empirical pirors for each class in the training data.

coef_DataFrame

Coefficients (inclusive of intercepts) of each class' linear score function for the training data.

proj_infoDataFrame

Projection related info, such as standar deviations of the discriminants, variance proportaion to the total variance explained by each discriminant, etc.

proj_modelDataFrame

The projection matrix and overall means for features.

Methods

add_attribute(self, attr_key, attr_val)

Function to add attribute.

fit(self, data[, key, features, label])

Calculate linear discriminators from training data.

is_fitted(self)

Checks if the model can be saved.

load_model(self, model)

Function to load fitted model.

predict(self, data, key[, features, verbose])

Predict class labels using fitted linear discriminators.

project(self, data, key[, features, proj_dim])

Project data into lower dimensional spaces using fitted LDA projection model.

fit(self, data, key=None, features=None, label=None)

Calculate linear discriminators from training data.

Parameters
dataDataFrame

Training data.

keystr, optional

Name of the ID colum.

If not provided, it is assumed that input data has no ID column.

featureslist of str, optional

Names of the feature columns.

If not provided, its defaults to all non-ID, non-label columns.

labelstr, optional

Name of the class label.

if not provided, it defaults to the last non-ID column.

predict(self, data, key, features=None, verbose=None)

Predict class labels using fitted linear discriminators.

Parameters
dataDataFrame

Data for predicting the class labels.

keystr

Name of the ID column.

featureslist of str, optional

Name of the feature columns. If not provided, defaults to all non-ID columns.

verbosebool, optional

Whether or not outputs scores of all classes.

If False, only score of the predicted class will be outputed.

Defaults to False.

Returns
DataFrame

Predicted class labels and the corresponding scores, structured as follows:

  • ID: with the same name and data type as data's ID column.

  • CLASS: with the same name and data type as training data's label column

  • SCORE: type double, socre of the predicted class.

project(self, data, key, features=None, proj_dim=None)

Project data into lower dimensional spaces using fitted LDA projection model.

Parameters
dataDataFrame

Data for linear discriminant projection.

keystr

Name of the ID column.

featureslist of str, optional

Name of the feature columns.

If not provided, defaults to all non-ID columns.

proj_dimint, optional

Dimension of the projected space, equivalent to the number of discriminant used for projection.

Defaults to the number of obtained discriminants.

Returns
DataFrame
Projected data, structured as follows:
  • 1st column: ID, with the same name and data type as data for projection.

  • other columns with name DISCRIMINANT_i, where i iterates from 1 to the number of elements in features, data type DOUBLE.

add_attribute(self, attr_key, attr_val)

Function to add attribute.

is_fitted(self)

Checks if the model can be saved. To be overridden if the model is not stored in model_ attribute.

Returns
bool

True if the model is ready to be saved.

load_model(self, model)

Function to load fitted model.

Parameters
modelDataFrame

HANA DataFrame for fitted model.

hana_ml.algorithms.pal.kernel_density

This module contains PAL wrappers for kernel density estimation.

The following class is available:

class hana_ml.algorithms.pal.kernel_density.KDE(thread_ratio=None, leaf_size=None, kernel=None, method=None, distance_level=None, minkowski_power=None, atol=None, rtol=None, bandwidth=None, resampling_method=None, evaluation_metric=None, bandwidth_values=None, bandwidth_range=None, stat_info=None, random_state=None, search_strategy=None, repeat_times=None, algorithm=None)

Bases: hana_ml.algorithms.pal.pal_base.PALBase

Perform Kernel Density to analogue with histograms whereas getting rid of its defects.

Parameters
thread_ratiofloat, optional

Specifies the ratio of total number of threads that can be used by this function.

The value range is from 0 to 1, where 0 means only using 1 thread, and 1 means using at most all the currently available threads.

Values outside the range will be ignored and this function heuristically determines the number of threads to use.

Default to 0.0.

leaf_sizeint, optional

Number of samples in a KD tree or Ball tree leaf node.

Only Valid when algorithm is 'kd-tree' or 'ball-tree'.

Default to 30.

kernel{'gaussian', 'tophat', 'epanechnikov', 'exponential', 'linear', 'cosine'}, optional

Kernel function type.

Default to 'gaussian'.

method{'brute_force', 'kd_tree', 'ball_tree'}, optional(deprecated)

Searching method.

Default to 'brute_force'

algorithm{'brute-force', 'kd-tree', 'ball-tree'}, optional

Specifies the searching method.

Default to 'brute-force'.

bandwidthfloat, optional

Bandwidth used during density calculation.

0 means providing by optimizer inside, otherwise bandwidth is provided by end users.

Only valid when data is one dimensional.

Default to 0.

distance_level{'manhattan', 'euclidean', 'minkowski', 'chebyshev'}, optional

Computes the distance between the train data and the test data point.

Default to 'eculidean'.

minkowski_powerfloat, optionl

When you use the Minkowski distance, this parameter controls the value of power.

Only valid when distance_level is 'minkowski'.

Default to 3.0.

rtolfloat, optional

The desired relative tolerance of the result.

A larger tolerance generally leads to faster execution.

Default to 1e-8.

atolfloat, optional

The desired absolute tolerance of the result.

A larger tolerance generally leads to faster execution.

Default to 0.

stat_infobool, optional
  • False: STATISTIC table is empty

  • True: Statistic information is displayed in the STATISTIC table.

Only valid when parameter selection is not specified.

resampling_method{'loocv'}, optional

Specifies the resampling method for model evaluation or parameter selection, only 'loocv' is permitted.

evaluation_metric must be set together.

No default value.

evaluation_metric{'nll'}, optional

Specifies the evaluation metric for model evaluation or parameter selection, only 'nll' is supported.

No default value.

search_strategy{'grid', 'random'}, optional

Specifies the method to activate parameter selection.

No default value.

repeat_timesint, optional

Specifies the number of repeat times for resampling.

Default to 1.

random_stateint, optional

Specifies the seed for random generation. Use system time when 0 is specified.

Default to 0.

bandwidth_valueslist, optional

Specifies values of parameter bandwidth to be selected.

Only valid when parameter selection is enabled.

bandwidth_rangelist, optional

Specifies ranges of parameter bandwidth to be selected.

Only valid when parameter selection is enabled.

Examples

Data used for fitting a kernel density function:

>>> df_train.collect()
       ID        X1        X2
    0   0 -0.425770 -1.396130
    1   1  0.884100  1.381493
    2   2  0.134126 -0.032224
    3   3  0.845504  2.867921
    4   4  0.288441  1.513337
    5   5 -0.666785  1.244980
    6   6 -2.102968 -1.428327
    7   7  0.769902 -0.473007
    8   8  0.210291  0.328431
    9   9  0.482323 -0.437962

Data used for density value prediction:

>>> df_pred.collect()
   ID        X1        X2
0   0 -2.102968 -1.428327
1   1 -2.102968  0.719797
2   2 -2.102968  2.867921
3   3 -0.609434 -1.428327
4   4 -0.609434  0.719797
5   5 -0.609434  2.867921
6   6  0.884100 -1.428327
7   7  0.884100  0.719797
8   8  0.884100  2.867921

Construct KDE instance:

>>> kde = KDE(leaf_size=10, method='kd_tree', bandwidth=0.68129, stat_info=True)

Fit a kernel density function:

>>> kde.fit(data=df_train, key='ID')

Peroform density prediction and check the results

>>> res, stats = kde.predict(data=df_pred, key='ID')
>>> res.collect()
   ID  DENSITY_VALUE
0   0      -3.324821
1   1      -5.733966
2   2      -8.372878
3   3      -3.123223
4   4      -2.772520
5   5      -4.852817
6   6      -3.469782
7   7      -2.556680
8   8      -3.198531
>>> stats_.collect()
   TEST_ID                            FITTING_IDS
0        0  {"fitting ids":[0,1,2,3,4,5,6,7,8,9]}
1        1  {"fitting ids":[0,1,2,3,4,5,6,7,8,9]}
2        2  {"fitting ids":[0,1,2,3,4,5,6,7,8,9]}
3        3  {"fitting ids":[0,1,2,3,4,5,6,7,8,9]}
4        4  {"fitting ids":[0,1,2,3,4,5,6,7,8,9]}
5        5  {"fitting ids":[0,1,2,3,4,5,6,7,8,9]}
6        6  {"fitting ids":[0,1,2,3,4,5,6,7,8,9]}
7        7  {"fitting ids":[0,1,2,3,4,5,6,7,8,9]}
8        8  {"fitting ids":[0,1,2,3,4,5,6,7,8,9]}
Attributes
stats_DataFrame

Statistical info for model evaluation. Available only when model evaluation/parameter selection is triggered.

optim_param_DataFrame

Optimal parameters selected. Available only when model evaluation/parameter selection is triggered.

Methods

add_attribute(self, attr_key, attr_val)

Function to add attribute.

fit(self, data, key[, features])

If parameter selection / model evaluation is enabled, perform it.

is_fitted(self)

Checks if the model can be saved.

load_model(self, model)

Function to load fitted model.

predict(self, data, key[, features])

Apply kernel density analysis.

fit(self, data, key, features=None)

If parameter selection / model evaluation is enabled, perform it. Otherwise, just setting the training data set.

Parameters
dataDataFrame

Dataframe including the data of density distribution.

keystr

Name of the ID column.

featuresstr/list of str, optional

Name of the feature columns in the dataframe.

Defaults to all non-key columns.

Attributes
_training_dataDataFrame

The training data for kernel density function fitting.

predict(self, data, key, features=None)

Apply kernel density analysis.

Parameters
dataDataFrame

Dataframe including the data of density prediction.

keystr

Column of IDs of the data points for density prediction.

featureslist of str, optional

Names of the feature columns.

Defaults to all non-key columns.

Returns
DataFrame
  • Result data table, i.e. predicted log-density values on all points in data.

  • Statistics information table which reflects the support of prediction points over all training points.

add_attribute(self, attr_key, attr_val)

Function to add attribute.

is_fitted(self)

Checks if the model can be saved. To be overridden if the model is not stored in model_ attribute.

Returns
bool

True if the model is ready to be saved.

load_model(self, model)

Function to load fitted model.

Parameters
modelDataFrame

HANA DataFrame for fitted model.

hana_ml.algorithms.pal.linear_model

This module contains Python wrappers for PAL linear model algorithms.

The following classes are available:

class hana_ml.algorithms.pal.linear_model.LinearRegression(solver=None, var_select=None, features_must_select=None, intercept=True, alpha_to_enter=None, alpha_to_remove=None, enet_lambda=None, enet_alpha=None, max_iter=None, tol=None, pho=None, stat_inf=False, adjusted_r2=False, dw_test=False, reset_test=None, bp_test=False, ks_test=False, thread_ratio=None, categorical_variable=None, pmml_export=None, resampling_method=None, evaluation_metric=None, fold_num=None, repeat_times=None, search_strategy=None, random_search_times=None, random_state=None, timeout=None, progress_indicator_id=None, param_values=None, param_range=None)

Bases: hana_ml.algorithms.pal.pal_base.PALBase

Linear regression is an approach to model the linear relationship between a variable, usually referred to as dependent variable, and one or more variables, usually referred to as independent variables, denoted as predictor vector .

Parameters
solver{'QR', 'SVD', 'CD', 'Cholesky', 'ADMM'}, optional

Algorithms to use to solve the least square problem. Case-insensitive.

  • 'QR': QR decomposition.

  • 'SVD': singular value decomposition.

  • 'CD': cyclical coordinate descent method.

  • 'Cholesky': Cholesky decomposition.

  • 'ADMM': alternating direction method of multipliers.

'CD' and 'ADMM' are supported only when var_select is 'all'.

Defaults to QR decomposition.

var_select{'all', 'forward', 'backward', 'stepwise'}, optional

Method to perform variable selection.

  • 'all': all variables are included.

  • 'forward': forward selection.

  • 'backward': backward selection.

  • 'stepwise': stepwise selection.

'forward' and 'backward' selection are supported only when solver is 'QR', 'SVD' or 'Cholesky'.

Note that 'stepwise' is a new option in SAP HANA SPS05 and Cloud.

Defaults to 'all'.

features_must_select: str or list of str, optional

Specifies the column name that needs to be included in the final training model when executing the variable selection.

This parameter can be specified multiple times, each time with one column name as feature.

Only valid when var_select is not 'all'.

Note that This parameter is a hint. There are exceptional cases that a specified mandatory feature is excluded in the final model.

For instance, some mandatory features can be represented as a linear combination of other features, among which some are also mandatory features.

New parameter added in SAP HANA Cloud and SPS05.

No default value.

interceptbool, optional

If true, include the intercept in the model.

Defaults to True.

alpha_to_enterfloat, optional

P-value for forward selection.

Valid only when var_select is 'forward' or 'stepwise'.

Note that 'stepwise' is a new option in SAP HANA SPS05 and Cloud.

Defaults to 0.05 when var_select is 'forward', 0.15 when var_select is 'stepwise'.

alpha_to_removefloat, optional

P-value for backward selection.

Valid only when var_select is 'backward' or 'stepwise'.

Note that 'stepwise' is a new option in SAP HANA SPS05 and Cloud.

Defaults to 0.1 when var_select` is 'backward', and 0.15 when var_select is 'stepwise'.

enet_lambdafloat, optional

Penalized weight. Value should be greater than or equal to 0.

Valid only when solver is 'CD' or 'ADMM'.

enet_alphafloat, optional

Elastic net mixing parameter.

Ranges from 0 (Ridge penalty) to 1 (LASSO penalty) inclusively.

Valid only when solver is 'CD' or 'ADMM'.

Defaults to 1.0.

max_iterint, optional

Maximum number of passes over training data.

If convergence is not reached after the specified number of iterations, an error will be generated.

Valid only when solver is 'CD' or 'ADMM'.

Defaults to 1e5.

tolfloat, optional

Convergence threshold for coordinate descent.

Valid only when solver is 'CD'.

Defaults to 1.0e-7.

phofloat, optional

Step size for ADMM. Generally, it should be greater than 1.

Valid only when solver is 'ADMM'.

Defaults to 1.8.

stat_infbool, optional

If true, output t-value and Pr(>|t|) of coefficients.

Defaults to False.

adjusted_r2bool, optional

If true, include the adjusted R2 value in statistics.

Defaults to False.

dw_testbool, optional

If true, conduct Durbin-Watson test under null hypothesis that errors do not follow a first order autoregressive process.

Not available if elastic net regularization is enabled or intercept is ignored.

Defaults to False.

reset_testint, optional

Specifies the order of Ramsey RESET test.

Ramsey RESET test with power of variables ranging from 2 to this value (greater than 1) will be conducted.

Value 1 means RESET test will not be conducted. Not available if elastic net regularization is enabled or intercept is ignored.

Defaults to 1.

bp_testbool, optional

If true, conduct Breusch-Pagan test under null hypothesis that homoscedasticity is satisfied.

Not available if elastic net regularization is enabled or intercept is ignored.

Defaults to False.

ks_testbool, optional

If true, conduct Kolmogorov-Smirnov normality test under null hypothesis that errors follow a normal distribution.

Not available if elastic net regularization is enabled or intercept is ignored.

Defaults to False.

thread_ratiofloat, optional

Controls the proportion of available threads to use.

The value range is from 0 to 1, where 0 means only using 1 thread, and 1 means using at most all the currently available threads.

Values outside this range tell PAL to heuristically determine the number of threads to use.

Valid only when solver is 'QR', 'CD', 'Cholesky' or 'ADMM'.

Defaults to 0.0.

categorical_variablestr or ist of str, optional

Specifies INTEGER columns specified that should be be treated as categorical.

Other INTEGER columns will be treated as continuous.

pmml_export{'no', 'multi-row'}, optional

Controls whether to output a PMML representation of the model, and how to format the PMML. Case-insensitive.

  • 'no' or not provided: No PMML model.

  • 'multi-row': Exports a PMML model, splitting it across multiple rows if it doesn't fit in one.

Prediction does not require a PMML model.

resampling_method{'cv', 'bootstrap'}, optional

Specifies the resampling method for model evaluation/parameter selection.

If no value is specified for this parameter, neither model evaluation

nor parameter selection is activated.

Must be set together with evaluation_metric.

No default value.

evaluation_metric{'rmse'}, optional

Specifies the evaluation metric for model evaluation or parameter selection.

Must be set together with resampling_method.

No default value.

fold_numint, optional

Specifies the fold number for the cross validation method. Mandatory and valid only when resampling_method is set to 'cv'.

No default value.

repeat_timesint, optional

Specifies the number of repeat times for resampling.

Defaults to 1.

search_strategy{'grid', 'random'}, optional

Specifies the method to activate parameter selection.

No default value.

random_search_timesint, optional

Specifies the number of times to randomly select candidate parameters for selection.

Mandatory and valid when search_strategy is set to 'random'.

No default value.

random_stateint, optional

Specifies the seed for random generation. Use system time when 0 is specified.

Defaults to 0.

timeoutint, optional

Specifies maximum running time for model evaluation or parameter

selection, in seconds. No timeout when 0 is specified.

Defaults to 0.

progress_indicator_idstr, optional

Sets an ID of progress indicator for model evaluation or parameter selection.

No progress indicator is active if no value is provided.

No default value.

param_valuesdict or list of tuples, optional

Specifies values of specific parameters to be selected.

Valid only when resampling_method and search_strategy are both specified.

Specified parameters could be enet_lambda and enet_alpha.

No default value.

param_rangedict or list of tuples, optional

Specifies range of specific parameters to be selected.

Valid only when resampling_method and search_strategy are both specified.

Specified parameters could be enet_lambda, enet_alpha.

No default value.

Examples

Training data:

>>> df.collect()
  ID       Y    X1 X2  X3
0  0  -6.879  0.00  A   1
1  1  -3.449  0.50  A   1
2  2   6.635  0.54  B   1
3  3  11.844  1.04  B   1
4  4   2.786  1.50  A   1
5  5   2.389  0.04  B   2
6  6  -0.011  2.00  A   2
7  7   8.839  2.04  B   2
8  8   4.689  1.54  B   1
9  9  -5.507  1.00  A   2

Training the model:

>>> lr = LinearRegression(thread_ratio=0.5,
...                       categorical_variable=["X3"])
>>> lr.fit(data=df, key='ID', label='Y')

Prediction:

>>> df2.collect()
   ID     X1 X2  X3
0   0  1.690  B   1
1   1  0.054  B   2
2   2  0.123  A   2
3   3  1.980  A   1
4   4  0.563  A   1
>>> lr.predict(data=df2, key='ID').collect()
   ID      VALUE
0   0  10.314760
1   1   1.685926
2   2  -7.409561
3   3   2.021592
4   4  -3.122685
Attributes
coefficients_DataFrame

Fitted regression coefficients.

pmml_DataFrame

PMML model. Set to None if no PMML model was requested.

fitted_DataFrame

Predicted dependent variable values for training data. Set to None if the training data has no row IDs.

statistics_DataFrame

Regression-related statistics, such as mean squared error.

optim_param_DataFrame

If parameter selection is enabled, the optimal parameters will be selected.

Methods

add_attribute(self, attr_key, attr_val)

Function to add attribute.

fit(self, data[, key, features, label, ...])

Fit regression model based on training data.

is_fitted(self)

Checks if the model can be saved.

load_model(self, model)

Function to load fitted model.

predict(self, data, key[, features])

Predict dependent variable values based on fitted model.

score(self, data, key[, features, label])

Returns the coefficient of determination R2 of the prediction.

fit(self, data, key=None, features=None, label=None, categorical_variable=None)

Fit regression model based on training data.

Parameters
dataDataFrame

Training data.

keystr, optional

Name of the ID column.

If key is not provided, it is assumed that the input has no ID column.

featureslist of str, optional

Names of the feature columns.

If features is not provided, it defaults to all non-ID, non-label columns.

labelstr, optional

Name of the dependent variable. If label is not provided, it defaults to the last column.

categorical_variablestr or list of str, optional

Specifies INTEGER column(s) that should be treated as categorical.

Other INTEGER columns will be treated as continuous.

predict(self, data, key, features=None)

Predict dependent variable values based on fitted model.

Parameters
dataDataFrame

Independent variable values to predict for.

keystr

Name of the ID column.

featureslist of str, optional

Names of the feature columns.

If features is not provided, it defaults to all non-ID columns.

Returns
DataFrame

Predicted values, structured as follows:

  • ID column: with same name and type as data 's ID column.

  • VALUE: type DOUBLE, representing predicted values.

score(self, data, key, features=None, label=None)

Returns the coefficient of determination R2 of the prediction.

Parameters
dataDataFrame

Data on which to assess model performance.

keystr

Name of the ID column.

featureslist of str, optional

Names of the feature columns.

If features is not provided, it defaults all non-ID, non-label columns.

labelstr, optional

Name of the dependent variable.

If label is not provided, it defaults to the last column.

Returns
float

Returns the coefficient of determination R2 of the prediction.

Note

score() will pass the pmml_ table to PAL as the model representation if there is a pmml_ table, or the coefficients_ table otherwise.

add_attribute(self, attr_key, attr_val)

Function to add attribute.

is_fitted(self)

Checks if the model can be saved. To be overridden if the model is not stored in model_ attribute.

Returns
bool

True if the model is ready to be saved.

load_model(self, model)

Function to load fitted model.

Parameters
modelDataFrame

HANA DataFrame for fitted model.

class hana_ml.algorithms.pal.linear_model.LogisticRegression(multi_class=False, max_iter=None, pmml_export=None, categorical_variable=None, standardize=True, stat_inf=False, solver=None, enet_alpha=None, enet_lambda=None, tol=None, epsilon=None, thread_ratio=None, max_pass_number=None, sgd_batch_number=None, precompute=None, handle_missing=None, resampling_method=None, metric=None, evaluation_metric=None, fold_num=None, repeat_times=None, search_strategy=None, random_search_times=None, random_state=None, timeout=None, lbfgs_m=None, class_map0=None, class_map1=None, progress_indicator_id=None, param_values=None, param_range=None)

Bases: hana_ml.algorithms.pal.pal_base.PALBase

Logistic regression model that handles binary-class and multi-class classification problems.

Parameters
multi_classbool, optional

If true, perform multi-class classification. Otherwise, there must be only two classes.

Defaults to False.

max_iterint, optional

Maximum number of iterations taken for the solvers to converge. If convergence is not reached after this number, an error will be generated.

  • multi-class: Defaults to 100.

  • binary-class: Defaults to 100000 when solver is cyclical, 1000 when solver is proximal, otherwise 100.

pmml_export{'no', 'single-row', 'multi-row'}, optional

Controls whether to output a PMML representation of the model, and how to format the PMML. Case-insensitive.

  • multi-class:

    • 'no' or not provided: No PMML model.

    • 'multi-row': Exports a PMML model, splitting it across multiple rows if it doesn't fit in one.

  • binary-class:

    • 'no' or not provided: No PMML model.

    • 'single-row': Exports a PMML model in a maximum of one row. Fails if the model doesn't fit in one row.

    • 'multi-row': Exports a PMML model, splitting it across multiple rows if it doesn't fit in one.

Defaults to 'no'.

categorical_variablestr or list of str, optional(deprecated)

Specifies INTEGER column(s) in the data that should be treated category variable.

standardizebool, optional

If true, standardize the data to have zero mean and unit variance.

Defaults to True.

stat_infbool, optional

If true, proceed with statistical inference.

Defaults to False.

solver{'auto', 'newton', 'cyclical', 'lbfgs', 'stochastic', 'proximal'}, optional

Optimization algorithm.

  • 'auto' : automatically determined by system based on input data and parameters.

  • 'newton': Newton iteration method, can only solve ridge regression problems.

  • 'cyclical': Cyclical coordinate descent method to fit elastic net regularized logistic regression.

  • 'lbfgs': LBFGS method (recommended when having many independent variables, can only solve ridge regression problems when multi_class is True).

  • 'stochastic': Stochastic gradient descent method (recommended when dealing with very large dataset).

  • 'proximal': Proximal gradient descent method to fit elastic net regularized logistic regression.

When multi_class is True, only 'auto', 'lbfgs' and 'cyclical' are valid solvers.

Defaults to 'auto'.

Note

If it happens that the enet regularization term contains LASSO penalty,while a solver that can only solve ridge regression problems is specified,then the specified solver will be ignored(hence default value is used).The users can check the statistical table for the solver that has been adopted finally.

enet_alphafloat, optional

Elastic net mixing parameter.

Only valid when multi_class is False and solver is 'auto', 'newton', 'cyclical', 'lbfgs' or 'proximal'.

Defaults to 1.0.

enet_lambdafloat, optional

Penalized weight. Only valid when multi_class is False and solver is 'auto', 'newton', 'cyclical', 'lbfgs' or 'proximal'.

Defaults to 0.0.

tolfloat, optional

Convergence threshold for exiting iterations.

Only valid when multi_class is False.

Defaults to 1.0e-7 when solver is cyclical, 1.0e-6 otherwise.

epsilonfloat, optional

Determines the accuracy with which the solution is to be found.

Only valid when multi_class is False and the solver is newton or lbfgs.

Defaults to 1.0e-6 when solver is newton, 1.0e-5 when solver is lbfgs.

thread_ratiofloat, optional

Controls the proportion of available threads to use for fit() method.

The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads.

Values between 0 and 1 will use that percentage of available threads.

Values outside this range tell PAL to heuristically determine the number of threads to use.

Defaults to 1.0.

max_pass_numberint, optional

The maximum number of passes over the data.

Only valid when multi_class is False and solver is 'stochastic'.

Defaults to 1.

sgd_batch_numberint, optional

The batch number of Stochastic gradient descent.

Only valid when multi_class is False and solver is 'stochastic'.

Defaults to 1.

precomputebool, optional

Whether to pre-compute the Gram matrix.

Only valid when solver is 'cyclical'.

Defaults to True.

handle_missingbool, optional

Whether to handle missing values.

Defaults to True.

categorical_variablestr or list of str, optional

Specifies INTEGER column(s) in the data that should be treated as categorical.

By default, string is categorical, while int and double are numerical.

lbfgs_mint, optional

Number of previous updates to keep.

Only applicable when multi_class is False and solver is 'lbfgs'.

Defaults to 6.

resampling_method{'cv', 'stratified_cv', 'bootstrap', 'stratified_bootstrap'}, optional

The resampling method for model evaluation and parameter selection.

If no value specified, neither model evaluation nor parameter selection is activated.

metric{'accuracy', 'f1_score', 'auc', 'nll'}, optional(deprecated)

The evaluation metric used for model evaluation/parameter selection.

evaluation_metric{'accuracy', 'f1_score', 'auc', 'nll'}, optional

The evaluation metric used for model evaluation/parameter selection.

fold_numint, optional

The number of folds for cross-validation.

Mandatory and valid only when resampling_method is 'cv' or 'stratified_cv'.

repeat_timesint, optional

The number of repeat times for resampling.

Defaults to 1.

search_strategy{'grid', 'random'}, optional

The search method for parameter selection.

random_search_timesint, optional

The number of times to randomly select candidate parameters for selection.

Mandatory and valid when search_strategy is 'random'.

random_stateint, optional

The seed for random generation. 0 indicates using system time as seed.

Defaults to 0.

progress_indicator_idstr, optional

The ID of progress indicator for model evaluation/parameter selection.

Progress indicator deactivated if no value provided.

param_valuesdict or list of tuples, optional

Specifies values of specific parameters to be selected.

Valid only when resampling_method and search_strategy are specified.

Specific parameters can be enet_lambda, enet_alpha.

No default value.

param_rangedict or list of tuples, optional

Specifies range of specific parameters to be selected.

Valid only when resampling_method and search_strategy are specified.

Specific parameters can be enet_lambda, enet_alpha.

No default value.

class_map0str, optional (deprecated)

Categorical label to map to 0.

class_map0 is mandatory when label column type is VARCHAR or NVARCHAR

Only valid when multi_class is False during binary class fit and score.

class_map1str, optional (deprecated)

Categorical label to map to 1.

class_map1 is mandatory when label column type is VARCHAR or NVARCHAR during binary class fit and score.

Only valid when multi_class is False.

Examples

Training data:

>>> df.collect()
   V1     V2  V3  CATEGORY
0   B  2.620   0         1
1   B  2.875   0         1
2   A  2.320   1         1
3   A  3.215   2         0
4   B  3.440   3         0
5   B  3.460   0         0
6   A  3.570   1         0
7   B  3.190   2         0
8   A  3.150   3         0
9   B  3.440   0         0
10  B  3.440   1         0
11  A  4.070   3         0
12  A  3.730   1         0
13  B  3.780   2         0
14  B  5.250   2         0
15  A  5.424   3         0
16  A  5.345   0         0
17  B  2.200   1         1
18  B  1.615   2         1
19  A  1.835   0         1
20  B  2.465   3         0
21  A  3.520   1         0
22  A  3.435   0         0
23  B  3.840   2         0
24  B  3.845   3         0
25  A  1.935   1         1
26  B  2.140   0         1
27  B  1.513   1         1
28  A  3.170   3         1
29  B  2.770   0         1
30  B  3.570   0         1
31  A  2.780   3         1

Create LogisticRegression instance and call fit:

>>> lr = linear_model.LogisticRegression(solver='newton',
...                                      thread_ratio=0.1, max_iter=1000,
...                                      pmml_export='single-row',
...                                      stat_inf=True, tol=0.000001)
>>> lr.fit(data=df, features=['V1', 'V2', 'V3'],
...        label='CATEGORY', categorical_variable=['V3'])
>>> lr.coef_.collect()
                                       VARIABLE_NAME  COEFFICIENT
0                                  __PAL_INTERCEPT__    17.044785
1                                 V1__PAL_DELIMIT__A     0.000000
2                                 V1__PAL_DELIMIT__B    -1.464903
3                                                 V2    -4.819740
4                                 V3__PAL_DELIMIT__0     0.000000
5                                 V3__PAL_DELIMIT__1    -2.794139
6                                 V3__PAL_DELIMIT__2    -4.807858
7                                 V3__PAL_DELIMIT__3    -2.780918
8  {"CONTENT":"{\"impute_model\":{\"column_statis...          NaN
>>> pred_df.collect()
    ID V1     V2  V3
0    0  B  2.620   0
1    1  B  2.875   0
2    2  A  2.320   1
3    3  A  3.215   2
4    4  B  3.440   3
5    5  B  3.460   0
6    6  A  3.570   1
7    7  B  3.190   2
8    8  A  3.150   3
9    9  B  3.440   0
10  10  B  3.440   1
11  11  A  4.070   3
12  12  A  3.730   1
13  13  B  3.780   2
14  14  B  5.250   2
15  15  A  5.424   3
16  16  A  5.345   0
17  17  B  2.200   1

Call predict():

>>> result = lgr.predict(data=pred_df,
...                      key='ID',
...                      categorical_variable=['V3'],
...                      thread_ratio=0.1)
>>> result.collect()
    ID CLASS   PROBABILITY
0    0     1  9.503618e-01
1    1     1  8.485210e-01
2    2     1  9.555861e-01
3    3     0  3.701858e-02
4    4     0  2.229129e-02
5    5     0  2.503962e-01
6    6     0  4.945832e-02
7    7     0  9.922085e-03
8    8     0  2.852859e-01
9    9     0  2.689207e-01
10  10     0  2.200498e-02
11  11     0  4.713726e-03
12  12     0  2.349803e-02
13  13     0  5.830425e-04
14  14     0  4.886177e-07
15  15     0  6.938072e-06
16  16     0  1.637820e-04
17  17     1  8.986435e-01

Input data for score():

>>> df_score.collect()
    ID V1     V2  V3  CATEGORY
0    0  B  2.620   0         1
1    1  B  2.875   0         1
2    2  A  2.320   1         1
3    3  A  3.215   2         0
4    4  B  3.440   3         0
5    5  B  3.460   0         0
6    6  A  3.570   1         1
7    7  B  3.190   2         0
8    8  A  3.150   3         0
9    9  B  3.440   0         0
10  10  B  3.440   1         0
11  11  A  4.070   3         0
12  12  A  3.730   1         0
13  13  B  3.780   2         0
14  14  B  5.250   2         0
15  15  A  5.424   3         0
16  16  A  5.345   0         0
17  17  B  2.200   1         1

Call score():

>>> lgr.score(data=df_score,
...           key='ID',
...           categorical_variable=['V3'],
...           thread_ratio=0.1)
0.944444
Attributes
coef_DataFrame

Values of the coefficients.

result_DataFrame

Model content.

optim_param_DataFrame

The optimal parameter set selected via cross-validation. Empty if cross-validation is not activated.

stat_DataFrame

Statistics info for the trained model, structured as follows:

  • 1st column: 'STAT_NAME', NVARCHAR(256)

  • 2nd column: 'STAT_VALUE', NVARCHAR(1000)

pmml_DataFrame

PMML model. Set to None if no PMML model was requested.

Methods

add_attribute(self, attr_key, attr_val)

Function to add attribute.

fit(self, data[, key, features, label, ...])

Fit the LR model when given training dataset.

is_fitted(self)

Checks if the model can be saved.

load_model(self, model)

Function to load fitted model.

predict(self, data, key[, features, ...])

Predict with the dataset using the trained model.

score(self, data, key[, features, label, ...])

Return the mean accuracy on the given test data and labels.

fit(self, data, key=None, features=None, label=None, categorical_variable=None, class_map0=None, class_map1=None)

Fit the LR model when given training dataset.

Parameters
dataDataFrame

DataFrame containing the data.

keystr, optional

Name of the ID column.

If key is not provided, it is assumed that the input has no ID column.

featureslist of str, optional

Names of the feature columns.

If features is not provided, it defaults to all non-ID, non-label columns.

labelstr, optional

Name of the label column.

If label is not provided, it defaults to the last column.

categorical_variablestr or list of str, optional

Specifies INTEGER column(s) that shoud be treated as categorical.

Otherwise All INTEGER columns are treated as numerical.

class_map0str, optional

Categorical label to map to 0.

class_map0 is mandatory when label column type is VARCHAR or NVARCHAR during binary class fit and score.

Only valid if multi_class is not set to True when initializing the class instance.

class_map1str, optional

Categorical label to map to 1.

class_map1 is mandatory when label column type is VARCHAR or NVARCHAR during binary class fit and score.

Only valid if multi_class is not set to True when initializing the class instance.

predict(self, data, key, features=None, categorical_variable=None, thread_ratio=None, class_map0=None, class_map1=None, verbose=False)

Predict with the dataset using the trained model.

Parameters
dataDataFrame

DataFrame containing the data.

keystr

Name of the ID column.

featureslist of str, optional

Names of the feature columns.

If features is not provided, it defaults to all non-ID columns.

verbosebool, optional

If true, output scoring probabilities for each class.

It is only applicable for multi-class case.

Defaults to False.

categorical_variablestr or list of str, optional (deprecated)

Specifies INTEGER column(s) that shoud be treated as categorical.

Otherwise all integer columns are treated as numerical.

Mandatory if training data of the prediction model contains such data columns.

thread_ratiofloat, optional

Controls the proportion of available threads to use.

The value range is from 0 to 1, where 0 means only using 1 thread, and 1 means using at most all the currently available threads.

Values outside this range tell pal to heuristically determine the number of threads to use.

Defaults to 0.

class_map0str, optional

Categorical label to map to 0.

class_map0 is mandatory when label column type is varchar or nvarchar during binary class fit and score.

Only valid if multi_class is not set to True when initializing the class instance.

class_map1str, optional

Categorical label to map to 1.

class_map1 is mandatory when label column type is varchar or nvarchar during binary class fit and score.

Only valid if multi_class is not set to True when initializing the class instance.

Returns
DataFrame

Predicted result, structured as follows:

  • 1: ID column, with edicted class name.

  • 2: PROBABILITY, type DOUBLE

    • multi-class: probability of being predicted as the predicted class.

    • binary-class: probability of being predicted as the positive class.

Note

predict() will pass the pmml_ table to PAL as the model representation if there is a pmml_ table, or the result_ table otherwise.

score(self, data, key, features=None, label=None, categorical_variable=None, thread_ratio=None, class_map0=None, class_map1=None)

Return the mean accuracy on the given test data and labels.

Parameters
dataDataFrame

DataFrame containing the data.

keystr

Name of the ID column.

featureslist of str, optional

Names of the feature columns.

If features is not provided, it defaults to all non-ID, non-label columns.

labelstr, optional

Name of the label column.

If label is not provided, it defaults to the last column.

categorical_variablestr or list of str, optional (deprecated)

Specifies INTEGER columns that shoud be treated as categorical, otherwise all integer columns are treated as numerical.

Mandatory if training data of the prediction model contains such data columns.

thread_ratiofloat, optional

Controls the proportion of available threads to use.

The value range is from 0 to 1, where 0 means only using 1 thread, and 1 means using at most all the currently available threads.

values outside this range tell pal to heuristically determine the number of threads to use.

Defaults to 0.

class_map0str, optional

Categorical label to map to 0.

class_map0 is mandatory when label column type is varchar or nvarchar during binary class fit and score.

Only valid if multi_class is not set to True when initializing the class instance.

class_map1str, optional

Categorical label to map to 1.

class_map1 is mandatory when label column type is varchar or nvarchar during binary class fit and score.

Only valid if multi_class is not set to True when initializing the class instance.

Returns
float

Scalar accuracy value after comparing the predicted label and original label.

add_attribute(self, attr_key, attr_val)

Function to add attribute.

is_fitted(self)

Checks if the model can be saved. To be overridden if the model is not stored in model_ attribute.

Returns
bool

True if the model is ready to be saved.

load_model(self, model)

Function to load fitted model.

Parameters
modelDataFrame

HANA DataFrame for fitted model.

hana_ml.algorithms.pal.linkpred

This module contains python wrapper for PAL link prediction function.

The following class is available:

class hana_ml.algorithms.pal.linkpred.LinkPrediction(method, beta=None, min_score=None, thread_ratio=None)

Bases: hana_ml.algorithms.pal.pal_base.PALBase

Link predictor for calculating, in a network, proximity scores between nodes that are not directly linked, which is helpful for predicting missing links(the higher the proximity score is, the more likely the two nodes are to be linked).

Parameters
method{'common_neighbors', 'jaccard', 'adamic_adar', 'katz'}

Method for computing the proximity between 2 nodes that are not directly linked.

betafloat, optional

A parameter included in the calculation of Katz similarity(proximity) score. Valid only when method is 'katz'.

Defaults to 0.005.

min_scorefloat, optional

The links whose scores are lower than min_score will be filtered out from the result table.

Defaults to 0.

Examples

Input dataframe df for training:

>>> df.collect()
   NODE1  NODE2
0      1      2
1      1      4
2      2      3
3      3      4
4      5      1
5      6      2
6      7      4
7      7      5
8      6      7
9      5      4

Create linkpred instance:

>>> lp = LinkPrediction(method='common_neighbors',
...                     beta=0.005,
...                     min_score=0,
...                     thread_ratio=0.2)

Calculate the proximity score of all nodes in the network with missing links, and check the result:

>>> res = lp.proximity_score(data=df, node1='NODE1', node2='NODE2')
>>> res.collect()
    NODE1  NODE2     SCORE
0       1      3  0.285714
1       1      6  0.142857
2       1      7  0.285714
3       2      4  0.285714
4       2      5  0.142857
5       2      7  0.142857
6       4      6  0.142857
7       3      5  0.142857
8       3      6  0.142857
9       3      7  0.142857
10      5      6  0.142857

Methods

add_attribute(self, attr_key, attr_val)

Function to add attribute.

is_fitted(self)

Checks if the model can be saved.

load_model(self, model)

Function to load fitted model.

proximity_score(self, data[, node1, node2])

For predicting proximity scores between nodes under current choice of method.

proximity_score(self, data, node1=None, node2=None)

For predicting proximity scores between nodes under current choice of method.

Parameters
dataDataFrame

Network data with nodes and links.

Nodes are in columns while links in rows, where each link is represented by a pair of adjacent nodes as (node1, node2).

node1str, optional

Column name of data that gives node1 of all available links (see data).

Defaults to the name of the first column of data if not provided.

node2str, optional

Column name of data that gives node2 of all available links (see data).

Defaults to the name of the last column of data if not provided.

Returns
DataFrame

The proximity scores of pairs of nodes with missing links between them that are above 'min_score', structured as follows:

  • 1st column: node1 of a link

  • 2nd column: node2 of a link

  • 3rd column: proximity score of the two nodes

add_attribute(self, attr_key, attr_val)

Function to add attribute.

is_fitted(self)

Checks if the model can be saved. To be overridden if the model is not stored in model_ attribute.

Returns
bool

True if the model is ready to be saved.

load_model(self, model)

Function to load fitted model.

Parameters
modelDataFrame

HANA DataFrame for fitted model.

hana_ml.algorithms.pal.metrics

This module contains Python wrappers for PAL metrics to assess the quality of model outputs.

The following functions are available:

hana_ml.algorithms.pal.metrics.confusion_matrix(data, key, label_true=None, label_pred=None, beta=None, native=False)

Computes confusion matrix to evaluate the accuracy of a classification.

Parameters
dataDataFrame

DataFrame containing the data.

keystr

Name of the ID column.

label_truestr, optional

Name of the original label column.

If not given, defaults to the second columm.

label_predstr, optional

Name of the the predicted label column.

If not given, defaults to the third columm.

betafloat, optional

Parameter used to compute the F-Beta score.

Defaults to 1.

nativebool, optional

Indicates whether to use native sql statements for confusion matrix calculation.

Defaults to True.

Returns
DataFrame
Confusion matrix, structured as follows:
  • Original label, with same name and data type as it is in data.

  • Predicted label, with same name and data type as it is in data.

  • Count, type INTEGER, the number of data points with the corresponding combination of predicted and original label.

The DataFrame is sorted by (original label, predicted label) in descending order.

Classification report table, structured as follows:
  • Class, type NVARCHAR(100), class name

  • Recall, type DOUBLE, the recall of each class

  • Precision, type DOUBLE, the precision of each class

  • F_MEASURE, type DOUBLE, the F_measure of each class

  • SUPPORT, type INTEGER, the support - sample number in each class

Examples

Data contains the original label and predict label df:

>>> df.collect()
   ID  ORIGINAL  PREDICT
0   1         1        1
1   2         1        1
2   3         1        1
3   4         1        2
4   5         1        1
5   6         2        2
6   7         2        1
7   8         2        2
8   9         2        2
9  10         2        2

Calculate the confusion matrix:

>>> cm, cr = confusion_matrix(data=df, key='ID', label_true='ORIGINAL', label_pred='PREDICT')

Output:

>>> cm.collect()
   ORIGINAL  PREDICT  COUNT
0         1        1      4
1         1        2      1
2         2        1      1
3         2        2      4
>>> cr.collect()
  CLASS  RECALL  PRECISION  F_MEASURE  SUPPORT
0     1     0.8        0.8        0.8        5
1     2     0.8        0.8        0.8        5
hana_ml.algorithms.pal.metrics.auc(data, positive_label=None, output_threshold=None)

Computes area under curve (AUC) to evaluate the performance of binary-class classification algorithms.

Parameters
dataDataFrame

Input data, structured as follows:

  • ID column.

  • True class of the data point.

  • Classifier-computed probability that the data point belongs to the positive class.

positive_labelstr, optional

If original label is not 0 or 1, specifies the label value which will be mapped to 1.

output_thresholdbool, optional

Specifies whether or not to outoput the corresponding threshold values in the roc table.

Defaults to False.

Returns
float

The area under the receiver operating characteristic curve.

DataFrame

False positive rate and true positive rate (ROC), structured as follows:

  • ID column, type INTEGER.

  • FPR, type DOUBLE, representing false positive rate.

  • TPR, type DOUBLE, representing true positive rate.

  • THRESHOLD, type DOUBLE, representing the corresponding threshold value, available only when output_threshold is set to True.

Examples

Input DataFrame df:

>>> df.collect()
   ID  ORIGINAL  PREDICT
0   1         0     0.07
1   2         0     0.01
2   3         0     0.85
3   4         0     0.30
4   5         0     0.50
5   6         1     0.50
6   7         1     0.20
7   8         1     0.80
8   9         1     0.20
9  10         1     0.95

Compute Area Under Curve:

>>> auc, roc = auc(data=df)

Output:

>>> print(auc)
 0.66
>>> roc.collect()
   ID  FPR  TPR
0   0  1.0  1.0
1   1  0.8  1.0
2   2  0.6  1.0
3   3  0.6  0.6
4   4  0.4  0.6
5   5  0.2  0.4
6   6  0.2  0.2
7   7  0.0  0.2
8   8  0.0  0.0
hana_ml.algorithms.pal.metrics.multiclass_auc(data_original, data_predict)

Computes area under curve (AUC) to evaluate the performance of multi-class classification algorithms.

Parameters
data_originalDataFrame

True class data, structured as follows:

  • Data point ID column.

  • True class of the data point.

data_predictDataFrame

Predicted class data, structured as follows:

  • Data point ID column.

  • Possible class.

  • Classifier-computed probability that the data point belongs to that particular class.

For each data point ID, there should be one row for each possible class.

Returns
float

The area under the receiver operating characteristic curve.

DataFrame

False positive rate and true positive rate (ROC), structured as follows:

  • ID column, type INTEGER.

  • FPR, type DOUBLE, representing false positive rate.

  • TPR, type DOUBLE, representing true positive rate.

Examples

Input DataFrame df:

>>> df_original.collect()
   ID  ORIGINAL
0   1         1
1   2         1
2   3         1
3   4         2
4   5         2
5   6         2
6   7         3
7   8         3
8   9         3
9  10         3
>>> df_predict.collect()
    ID  PREDICT  PROB
0    1        1  0.90
1    1        2  0.05
2    1        3  0.05
3    2        1  0.80
4    2        2  0.05
5    2        3  0.15
6    3        1  0.80
7    3        2  0.10
8    3        3  0.10
9    4        1  0.10
10   4        2  0.80
11   4        3  0.10
12   5        1  0.20
13   5        2  0.70
14   5        3  0.10
15   6        1  0.05
16   6        2  0.90
17   6        3  0.05
18   7        1  0.10
19   7        2  0.10
20   7        3  0.80
21   8        1  0.00
22   8        2  0.00
23   8        3  1.00
24   9        1  0.20
25   9        2  0.10
26   9        3  0.70
27  10        1  0.20
28  10        2  0.20
29  10        3  0.60

Compute Area Under Curve:

>>> auc, roc = multiclass_auc(data_original=df_original, data_predict=df_predict)

Output:

>>> print(auc)
1.0
>>> roc.collect()
    ID   FPR  TPR
0    0  1.00  1.0
1    1  0.90  1.0
2    2  0.65  1.0
3    3  0.25  1.0
4    4  0.20  1.0
5    5  0.00  1.0
6    6  0.00  0.9
7    7  0.00  0.7
8    8  0.00  0.3
9    9  0.00  0.1
10  10  0.00  0.0
hana_ml.algorithms.pal.metrics.accuracy_score(data, label_true, label_pred)

Compute mean accuracy score for classification results. That is, the proportion of the correctly predicted results among the total number of cases examined.

Parameters
dataDataFrame

DataFrame of true and predicted labels.

label_truestr

Name of the column containing ground truth labels.

label_predstr

Name of the column containing predicted labels, as returned by a classifier.

Returns
float

Accuracy classification score. A lower accuracy indicates that the classifier was able to predict less of the labels in the input correctly.

Examples

Actual and predicted labels df for a hypothetical classification:

>>> df.collect()
   ACTUAL  PREDICTED
0    1        0
1    0        0
2    0        0
3    1        1
4    1        1

Accuracy score for these predictions:

>>> accuracy_score(data=df, label_true='ACTUAL', label_pred='PREDICTED')
0.8

Compare that to null accuracy df_dummy (accuracy that could be achieved by always predicting the most frequent class):

>>> df_dummy.collect()
   ACTUAL  PREDICTED
0    1       1
1    0       1
2    0       1
3    1       1
4    1       1
>>> accuracy_score(data=df_dummy, label_true='ACTUAL', label_pred='PREDICTED')
0.6

A perfect predictor df_perfect:

>>> df_perfect.collect()
   ACTUAL  PREDICTED
0    1       1
1    0       0
2    0       0
3    1       1
4    1       1
>>> accuracy_score(data=df_perfect, label_true='ACTUAL', label_pred='PREDICTED')
1.0
hana_ml.algorithms.pal.metrics.r2_score(data, label_true, label_pred)

Computes coefficient of determination for regression results.

Parameters
dataDataFrame

DataFrame of true and predicted values.

label_truestr

Name of the column containing true values.

label_predstr

Name of the column containing values predicted by regression.

Returns
float

Coefficient of determination. 1.0 indicates an exact match between true and predicted values. A lower coefficient of determination indicates that the regression was able to predict less of the variance in the input. A negative value indicates that the regression performed worse than just taking the mean of the true values and using that for every prediction.

Examples

Actual and predicted values df for a hypothetical regression:

>>> df.collect()
   ACTUAL  PREDICTED
0    0.10        0.2
1    0.90        1.0
2    2.10        1.9
3    3.05        3.0
4    4.00        3.5

R2 score for these predictions:

>>> r2_score(data=df, label_true='ACTUAL', label_pred='PREDICTED')
0.9685233682514102

Compare that to the score for a perfect predictor:

>>> df_perfect.collect()
   ACTUAL  PREDICTED
0    0.10       0.10
1    0.90       0.90
2    2.10       2.10
3    3.05       3.05
4    4.00       4.00
>>> r2_score(data=df_perfect, label_true='ACTUAL', label_pred='PREDICTED')
1.0

A naive mean predictor:

>>> df_mean.collect()
   ACTUAL  PREDICTED
0    0.10       2.03
1    0.90       2.03
2    2.10       2.03
3    3.05       2.03
4    4.00       2.03
>>> r2_score(data=df_mean, label_true='ACTUAL', label_pred='PREDICTED')
0.0

And a really awful predictor df_awful:

>>> df_awful.collect()
   ACTUAL  PREDICTED
0    0.10    12345.0
1    0.90    91923.0
2    2.10    -4444.0
3    3.05    -8888.0
4    4.00    -9999.0
>>> r2_score(data=df_awful, label_true='ACTUAL', label_pred='PREDICTED')
-886477397.139857
hana_ml.algorithms.pal.metrics.binary_classification_debriefing(data, label_true, label_pred, auc_data=None)

Computes debriefing coefficients for binary classification results.

Parameters
dataDataFrame

DataFrame of true and predicted values.

label_truestr

Name of the column containing true values.

label_predstr

Name of the column containing values predicted by regression.

auc_dataDataFrame, optional

Input data for calculating predictive power(KI), structured as follows:

  • ID column.

  • True class of the data point.

  • Classifier-computed probability that the data point belongs to the positive class.

Returns
dict

Debriefing stats: ACCURACY, RECALL, SPECIFICITY, PRECISION, FPR, FNR, F1, MCC, KI, KAPPA.

hana_ml.algorithms.pal.mixture

This module contains Python wrappers for Gaussian mixture model algorithm.

The following class is available:

class hana_ml.algorithms.pal.mixture.GaussianMixture(init_param, n_components=None, init_centers=None, covariance_type=None, shared_covariance=False, thread_ratio=None, max_iter=None, categorical_variable=None, category_weight=None, error_tol=None, regularization=None, random_seed=None)

Bases: hana_ml.algorithms.pal.pal_base.PALBase, hana_ml.algorithms.pal.clustering._ClusterAssignmentMixin

Representation of a Gaussian mixture model probability distribution.

Parameters
init_param{'farthest_first_traversal','manual','random_means','kmeans++'}

Specifies the initialization mode.

  • farthest_first_traversal: The initial centers are given by the farthest-first traversal algorithm.

  • manual: The initial centers are the init_centers given by user.

  • random_means: The initial centers are the means of all the data that are randomly weighted.

  • kmeans++: The initial centers are given using the k-means++ approach.

n_componentsint

Specifies the number of Gaussian distributions.

Mandatory when init_param is not 'manual'.

init_centerslist of integers/strings

Specifies the rows of data to be used as initial centers by provides their IDs in data.

Mandatory when init_param is 'manual'.

covariance_type{'full', 'diag', 'tied_diag'}, optional

Specifies the type of covariance matrices in the model.

  • full: use full covariance matrices.

  • diag: use diagonal covariance matrices.

  • tied_diag: use diagonal covariance matrices with all equal diagonal entries.

Defaults to 'full'.

shared_covariancebool, optional

All clusters share the same covariance matrix if True.

Defaults to False.

thread_ratiofloat, optional

Controls the proportion of available threads that can be used.

The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads.

Values between 0 and 1 will use that percentage of available threads.

Values outside this range tell PAL to heuristically determine the number of threads to use.

Defaults to 0.

max_iterint, optional

Specifies the maximum number of iterations for the EM algorithm.

Defaults value: 100.

categorical_variablestr or list of str, optional

Specifies INTEGER column(s) that should be be treated as categorical.

Other INTEGER columns will be treated as continuous.

category_weightfloat, optional

Represents the weight of category attributes.

Defaults to 0.707.

error_tolfloat, optional

Specifies the error tolerance, which is the stop condition.

Defaults to 1e-5.

regularizationfloat, optional

Regularization to be added to the diagonal of covariance matrices to ensure positive-definite.

Defaults to 1e-6.

random_seedint, optional

Indicates the seed used to initialize the random number generator:

  • 0: Uses the system time.

  • Not 0: Uses the provided value.

Defaults to 0.

Examples

Input dataframe df1 for training:

>>> df1.collect()
    ID     X1     X2  X3
0    0   0.10   0.10   1
1    1   0.11   0.10   1
2    2   0.10   0.11   1
3    3   0.11   0.11   1
4    4   0.12   0.11   1
5    5   0.11   0.12   1
6    6   0.12   0.12   1
7    7   0.12   0.13   1
8    8   0.13   0.12   2
9    9   0.13   0.13   2
10  10   0.13   0.14   2
11  11   0.14   0.13   2
12  12  10.10  10.10   1
13  13  10.11  10.10   1
14  14  10.10  10.11   1
15  15  10.11  10.11   1
16  16  10.11  10.12   2
17  17  10.12  10.11   2
18  18  10.12  10.12   2
19  19  10.12  10.13   2
20  20  10.13  10.12   2
21  21  10.13  10.13   2
22  22  10.13  10.14   2
23  23  10.14  10.13   2

Creating the GMM instance:

>>> gmm = GaussianMixture(init_param='farthest_first_traversal',
...                       n_components=2, covariance_type='full',
...                       shared_covariance=False, max_iter=500,
...                       error_tol=0.001, thread_ratio=0.5,
...                       categorical_variable=['X3'], random_seed=1)

Performing fit() on the given dataframe:

>>> gmm.fit(data=df1, key='ID')

Expected output:

>>> gmm.labels_.head(14).collect()
    ID  CLUSTER_ID     PROBABILITY
0    0           0          0.0
1    1           0          0.0
2    2           0          0.0
3    4           0          0.0
4    5           0          0.0
5    6           0          0.0
6    7           0          0.0
7    8           0          0.0
8    9           0          0.0
9    10          0          1.0
10   11          0          1.0
11   12          0          1.0
12   13          0          1.0
13   14          0          0.0
>>> gmm.stats_.collect()
       STAT_NAME       STAT_VALUE
1     log-likelihood     11.7199
2         aic          -504.5536
3         bic          -480.3900
>>> gmm.model_collect()
       ROW_INDEX    CLUSTER_ID         MODEL_CONTENT
1        0            -1           {"Algorithm":"GMM","Metadata":{"DataP...
2        1             0           {"GuassModel":{"covariance":[22.18895...
3        2             1           {"GuassModel":{"covariance":[22.19450...
Attributes
model_DataFrame

Trained model content.

labels_DataFrame

Cluster membership probabilties for each data point.

stats_DataFrame

Statistics.

Methods

add_attribute(self, attr_key, attr_val)

Function to add attribute.

fit(self, data, key[, features, ...])

Perform GMM clustering on input dataset.

fit_predict(self, data, key[, features, ...])

Perform GMM clustering on input dataset and return cluster membership probabilties for each data point.

is_fitted(self)

Checks if the model can be saved.

load_model(self, model)

Function to load fitted model.

predict(self, data, key[, features])

Assign clusters to data based on a fitted model.

fit(self, data, key, features=None, categorical_variable=None)

Perform GMM clustering on input dataset.

Parameters
dataDataFrame

Data to be clustered.

keystr

Name of the ID column.

featureslist of str, optional

List of strings specifying feature columns.

If a list of features is not given, all the columns except the ID column are taken as features.

categorical_variablestr or list of str, optional

Specifies INTEGER column(s) that should be treated as categorical.

Other INTEGER columns will be treated as continuous.

fit_predict(self, data, key, features=None, categorical_variable=None)

Perform GMM clustering on input dataset and return cluster membership probabilties for each data point.

Parameters
dataDataFrame

Data to be clustered.

keystr

Name of the ID column.

featureslist of str, optional

List of strings specifying feature columns.

If a list of features is not given, all the columns except the ID column are taken as features.

categorical_variablestr or list of str, optional

Specifies INTEGER column(s) specified that should be treated as categorical.

Other INTEGER columns will be treated as continuous.

Returns
DataFrame

Cluster membership probabilities.

predict(self, data, key, features=None)

Assign clusters to data based on a fitted model.

The output structure of this method does not match that of fit_predict().

Parameters
dataDataFrame

Data points to match against computed clusters.

This dataframe's column structure should match that of the data used for fit().

keystr

Name of the ID column.

featureslist of str, optional.

Names of the feature columns.

If features is not provided, it defaults to all the non-ID columns.

Returns
DataFrame

Cluster assignment results, with 3 columns:

  • Data point ID, with name and type taken from the input ID column.

  • CLUSTER_ID, INTEGER type, representing the cluster the data point is assigned to.

  • DISTANCE, DOUBLE type, representing the distance between the data point and the cluster center.

add_attribute(self, attr_key, attr_val)

Function to add attribute.

is_fitted(self)

Checks if the model can be saved. To be overridden if the model is not stored in model_ attribute.

Returns
bool

True if the model is ready to be saved.

load_model(self, model)

Function to load fitted model.

Parameters
modelDataFrame

HANA DataFrame for fitted model.

hana_ml.algorithms.pal.model_selection

This module contains classes of model selection.

The following classes are available:

class hana_ml.algorithms.pal.model_selection.ParamSearchCV(estimator, param_grid, train_control, scoring, search_strategy)

Bases: object

Exhaustive or random search over specified parameter values for an estimator.

Parameters
estimatorestimator object

This is assumed to implement the PAL estimator interface.

param_griddict

Dictionary with parameters names (string) as keys and lists of parameter settings to try as values in which case the grids spanned by each dictionary in the list are explored. This enables searching over any sequence of parameter settings.

train_controldict

Controlling parameters for model evaluation and parameter selection.

scoringstr

A string of scoring method to evaluate the predictions.

search_strategystr

The search strategy and the options are grid or random.

Attributes
estimator

Methods

fit(self, data, \*\*kwargs)

Fit function.

predict(self, data, \*\*kwargs)

Predict function.

set_resampling_method(self, method)

Specifies the resampling method for model evaluation or parameter selection.

set_scoring_metric(self, metric)

Sepcifies the scoring metric.

set_seed(self, seed[, seed_name])

Specifies the seed for random generation.

set_timeout(self, timeout)

Specifies the maximum running time for model evaluation or parameter selection.

set_timeout(self, timeout)

Specifies the maximum running time for model evaluation or parameter selection. Unit is second. No timeout when 0 is specified.

Parameters
timeoutint

The maximum running time. The unit is second.

set_seed(self, seed, seed_name=None)

Specifies the seed for random generation. Use system time when 0 is specified.

Parameters
seedint

The random seed number.

seed_nameint, optional

The name of the random seed.

Defaults to None.

set_resampling_method(self, method)

Specifies the resampling method for model evaluation or parameter selection.

Parameters
methodstr

Specifies the resampling method for model evaluation or parameter selection.

  • cv

  • stratified_cv

  • bootstrap

  • stratified_bootstrap

stratified_cv and stratified_bootstrap can only apply to classification algorithms.
set_scoring_metric(self, metric)

Sepcifies the scoring metric.

Parameters
metricstr

Specifies the evaluation metric for model evaluation or parameter selection. - accuracy - error_rate - f1_score - rmse - mae - auc - nll (negative log likelihood)

fit(self, data, **kwargs)

Fit function.

Parameters
dataDataFrame

The input DataFrame for fitting.

**kwargs: dict

A dict of the keyword args passed to the function. Please refer to the documentation of the specific function for parameter information.

predict(self, data, **kwargs)

Predict function.

Parameters
dataDataFrame

The input DataFrame for fitting.

**kwargs: dict

A dict of the keyword args passed to the function. Please refer to the documentation of the specific function for parameter information.

class hana_ml.algorithms.pal.model_selection.GridSearchCV(estimator, param_grid, train_control, scoring)

Bases: hana_ml.algorithms.pal.model_selection.ParamSearchCV

Exhaustive search over specified parameter values for an estimator.

Parameters
estimatorestimator object

This is assumed to implement the PAL estimator interface.

param_griddict

Dictionary with parameters names (string) as keys and lists of parameter settings to try as values in which case the grids spanned by each dictionary in the list are explored. This enables searching over any sequence of parameter settings.

train_controldict

Controlling parameters for model evaluation and parameter selection.

scoringstr

A string of scoring method to evaluate the predictions.

Attributes
estimator

Methods

fit(self, data, \*\*kwargs)

Fit function.

predict(self, data, \*\*kwargs)

Predict function.

set_resampling_method(self, method)

Specifies the resampling method for model evaluation or parameter selection.

set_scoring_metric(self, metric)

Sepcifies the scoring metric.

set_seed(self, seed[, seed_name])

Specifies the seed for random generation.

set_timeout(self, timeout)

Specifies the maximum running time for model evaluation or parameter selection.

fit(self, data, **kwargs)

Fit function.

Parameters
dataDataFrame

The input DataFrame for fitting.

**kwargs: dict

A dict of the keyword args passed to the function. Please refer to the documentation of the specific function for parameter information.

predict(self, data, **kwargs)

Predict function.

Parameters
dataDataFrame

The input DataFrame for fitting.

**kwargs: dict

A dict of the keyword args passed to the function. Please refer to the documentation of the specific function for parameter information.

set_resampling_method(self, method)

Specifies the resampling method for model evaluation or parameter selection.

Parameters
methodstr

Specifies the resampling method for model evaluation or parameter selection.

  • cv

  • stratified_cv

  • bootstrap

  • stratified_bootstrap

stratified_cv and stratified_bootstrap can only apply to classification algorithms.
set_scoring_metric(self, metric)

Sepcifies the scoring metric.

Parameters
metricstr

Specifies the evaluation metric for model evaluation or parameter selection. - accuracy - error_rate - f1_score - rmse - mae - auc - nll (negative log likelihood)

set_seed(self, seed, seed_name=None)

Specifies the seed for random generation. Use system time when 0 is specified.

Parameters
seedint

The random seed number.

seed_nameint, optional

The name of the random seed.

Defaults to None.

set_timeout(self, timeout)

Specifies the maximum running time for model evaluation or parameter selection. Unit is second. No timeout when 0 is specified.

Parameters
timeoutint

The maximum running time. The unit is second.

class hana_ml.algorithms.pal.model_selection.RandomSearchCV(estimator, param_grid, train_control, scoring)

Bases: hana_ml.algorithms.pal.model_selection.ParamSearchCV

Random search over specified parameter values for an estimator.

Parameters
estimatorestimator object

This is assumed to implement the PAL estimator interface.

param_griddict

Dictionary with parameters names (string) as keys and lists of parameter settings to try as values in which case the grids spanned by each dictionary in the list are explored.

This enables searching over any sequence of parameter settings.

train_controldict

Controlling parameters for model evaluation and parameter selection.

scoringstr

A string of scoring method to evaluate the predictions.

Attributes
estimator

Methods

fit(self, data, \*\*kwargs)

Fit function.

predict(self, data, \*\*kwargs)

Predict function.

set_resampling_method(self, method)

Specifies the resampling method for model evaluation or parameter selection.

set_scoring_metric(self, metric)

Sepcifies the scoring metric.

set_seed(self, seed[, seed_name])

Specifies the seed for random generation.

set_timeout(self, timeout)

Specifies the maximum running time for model evaluation or parameter selection.

fit(self, data, **kwargs)

Fit function.

Parameters
dataDataFrame

The input DataFrame for fitting.

**kwargs: dict

A dict of the keyword args passed to the function. Please refer to the documentation of the specific function for parameter information.

predict(self, data, **kwargs)

Predict function.

Parameters
dataDataFrame

The input DataFrame for fitting.

**kwargs: dict

A dict of the keyword args passed to the function. Please refer to the documentation of the specific function for parameter information.

set_resampling_method(self, method)

Specifies the resampling method for model evaluation or parameter selection.

Parameters
methodstr

Specifies the resampling method for model evaluation or parameter selection.

  • cv

  • stratified_cv

  • bootstrap

  • stratified_bootstrap

stratified_cv and stratified_bootstrap can only apply to classification algorithms.
set_scoring_metric(self, metric)

Sepcifies the scoring metric.

Parameters
metricstr

Specifies the evaluation metric for model evaluation or parameter selection. - accuracy - error_rate - f1_score - rmse - mae - auc - nll (negative log likelihood)

set_seed(self, seed, seed_name=None)

Specifies the seed for random generation. Use system time when 0 is specified.

Parameters
seedint

The random seed number.

seed_nameint, optional

The name of the random seed.

Defaults to None.

set_timeout(self, timeout)

Specifies the maximum running time for model evaluation or parameter selection. Unit is second. No timeout when 0 is specified.

Parameters
timeoutint

The maximum running time. The unit is second.

hana_ml.algorithms.pal.naive_bayes

This module contains wrappers for PAL naive bayes classification.

The following classes are available:

class hana_ml.algorithms.pal.naive_bayes.NaiveBayes(alpha=None, discretization=None, model_format=None, thread_ratio=None, resampling_method=None, evaluation_metric=None, fold_num=None, repeat_times=None, search_strategy=None, random_search_times=None, random_state=None, timeout=None, progress_indicator_id=None, alpha_range=None, alpha_values=None)

Bases: hana_ml.algorithms.pal.pal_base.PALBase

A classification model based on Bayes' theorem.

Parameters
alphafloat, optional

Laplace smoothing value. Set a positive value to enable Laplace smoothing for categorical variables and use that value as the smoothing parameter.

Set value 0 to disable Laplace smoothing.

Defaults to 0.

discretization{'no', 'supervised'}, optional

Discretize continuous attributes. Case-insensitive.

  • 'no' or not provided: disable discretization.

  • 'supervised': use supervised discretization on all the continuous attributes.

Defaults to 'no'.

model_format{'json', 'pmml'}, optional

Controls whether to output the model in JSON format or PMML format. Case-insensitive.

  • 'json' or not provided: JSON format.

  • 'pmml': PMML format.

Defaults to 'json'.

thread_ratiofloat, optional

Controls the proportion of available threads to use.

The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads.

Values between 0 and 1 wll use that percentage of available threads.

Values outside this range tell PAL to heuristically determine the number of threads to use.

Defaults to 0.

resampling_method{'cv', 'stratified_cv', 'bootstrap', 'stratified_bootstrap'}, optional

Specifies the resampling method for model evaluation or parameter selection.

Mandatory if model evaluation or parameter selection is expected.

If no value is specified for this parameter, neither model evaluation nor parameter selection is activated.

No default value.

evaluation_metric{'accuracy', 'f1_score', 'auc'}, optional

Specifies the evaluation metric for model evaluation or parameter selection.

Mandatory if model evaluation or parameter selection is expected.

No default value.

fold_numint, optional

Specifies the fold number forthe cross validation method.

Mandatory and valid only when resampling_method is set to 'cv' or 'stratified_cv'.

No default value.

repeat_timesint, optional

Specifies the number of repeat times for resampling.

Default to 1.

search_strategy{'grid', 'random'}

Specifies the parameter search method.

No default value.

random_search_timesint, optional

Specifies the number of times to randomly select candidate parameters for selection.

Mandatory and valid when search_strategy is set to 'random'.

No default value.

random_stateint, optional

Specifies the seed for random generation.

Use system time when 0 is specified.

Default to 0.

timeoutint, optional

Specifies maximum running time for model evaluation or parameter selection, in seconds.

No timeout when 0 is specified.

Default to 0.

progress_indicator_idstr, optional

Sets an ID of progress indicator for model evaluation or parameter selection.

No progress indicator is active if no value is provided.

No default value.

alpha_rangelist of numeric values, optional

Specifies the range for candidate alpha values for parameter selection.

Only valid when search_strategy is specified.

No default value.

alpha_valueslist of numeric values, optional

Specifies candidate alpha values for parameter selection.

Only valid when search_strategy is specified.

No default value.

Examples

Training data:

>>> df1.collect()
  HomeOwner MaritalStatus  AnnualIncome DefaultedBorrower
0       YES        Single         125.0                NO
1        NO       Married         100.0                NO
2        NO        Single          70.0                NO
3       YES       Married         120.0                NO
4        NO      Divorced          95.0               YES
5        NO       Married          60.0                NO
6       YES      Divorced         220.0                NO
7        NO        Single          85.0               YES
8        NO       Married          75.0                NO
9        NO        Single          90.0               YES

Training the model:

>>> nb = NaiveBayes(alpha=1.0, model_format='pmml')
>>> nb.fit(df1)

Prediction:

>>> df2.collect()
   ID HomeOwner MaritalStatus  AnnualIncome
0   0        NO       Married         120.0
1   1       YES       Married         180.0
2   2        NO        Single          90.0
>>> nb.predict(df2, 'ID', alpha=1.0, verbose=True)
   ID CLASS  CONFIDENCE
0   0    NO   -6.572353
1   0   YES  -23.747252
2   1    NO   -7.602221
3   1   YES -169.133547
4   2    NO   -7.133599
5   2   YES   -4.648640
Attributes
model_DataFrame

Trained model content.

Note

The Laplace value (alpha) is only stored by JSON format models. If the PMML format is chosen, you may need to set the Laplace value (alpha) again in predict() and score().

stats_DataFrame

Trained statistics content.

optim_param_DataFrame

Selected optimal parameters content.

Methods

add_attribute(self, attr_key, attr_val)

Function to add attribute.

fit(self, data[, key, features, label, ...])

Fit classification model based on training data.

is_fitted(self)

Checks if the model can be saved.

load_model(self, model)

Function to load fitted model.

predict(self, data, key[, features, alpha, ...])

Predict based on fitted model.

score(self, data, key[, features, label, alpha])

Returns the mean accuracy on the given test data and labels.

fit(self, data, key=None, features=None, label=None, categorical_variable=None)

Fit classification model based on training data.

Parameters
dataDataFrame

Training data.

keystr, optional

Name of the ID column.

If key is not provided, it is assumed that the input has no ID column.

featureslist of str, optional

Names of the feature columns.

If features is not provided, it defaults to all non-ID, non-label columns.

labelstr, optional

Name of the dependent variable.

Defaults to the last column.

categorical_variablestr or ListOfStrings, optional

Specifies INTEGER columns that should be treated as categorical.

Other INTEGER columns will be treated as continuous.

predict(self, data, key, features=None, alpha=None, verbose=None)

Predict based on fitted model.

Parameters
dataDataFrame

Independent variable values to predict for.

keystr

Name of the ID column.

featureslist of str, optional

Names of the feature columns.

If features is not provided, it defaults to all non-ID columns.

alphafloat, optional

Laplace smoothing value.

Set a positive value to enable Laplace smoothing for categorical variables and use that value as the smoothing parameter.

Set value 0 to disable Laplace smoothing.

Defaults to the alpha value in the JSON model, if there is one, or 0 otherwise.

verbosebool, optional

If true, output all classes and the corresponding confidences for each data point.

Defaults to False.

Returns
DataFrame

Predicted result, structured as follows:

  • ID column, with the same name and type as data 's ID column.

  • CLASS, type NVARCHAR, predicted class name.

  • CONFIDENCE, type DOUBLE, confidence for the prediction of the sample, which is a logarithmic value of the posterior probabilities.

Note

A non-zero Laplace value (alpha) is required if there exist discrete category values that only occur in the test set. It can be read from JSON models or from the parameter alpha in predict(). The Laplace value you set here takes precedence over the values read from JSON models.

score(self, data, key, features=None, label=None, alpha=None)

Returns the mean accuracy on the given test data and labels.

Parameters
dataDataFrame

Data on which to assess model performance.

keystr

Name of the ID column.

featureslist of str, optional

Names of the feature columns.

If features is not provided, it defaults to all non-ID, non-label columns.

labelstr, optional

Name of the dependent variable.

Defaults to the last column.

alphafloat, optional

Laplace smoothing value.

Set a positive value to enable Laplace smoothing for categorical variables and use that value as the smoothing parameter.

Set value 0 to disable Laplace smoothing.

Defaults to the alpha value in the JSON model, if there is one, or 0 otherwise.

Returns
float

Mean accuracy on the given test data and labels.

add_attribute(self, attr_key, attr_val)

Function to add attribute.

is_fitted(self)

Checks if the model can be saved. To be overridden if the model is not stored in model_ attribute.

Returns
bool

True if the model is ready to be saved.

load_model(self, model)

Function to load fitted model.

Parameters
modelDataFrame

HANA DataFrame for fitted model.

hana_ml.algorithms.pal.neighbors

This module contains Python wrappers for PAL k-nearest neighbors algorithms.

The following classes are available:
class hana_ml.algorithms.pal.neighbors.KNNClassifier(n_neighbors=None, thread_ratio=None, stat_info=None, voting_type=None, metric=None, minkowski_power=None, category_weights=None, algorithm=None, resampling_method=None, evaluation_metric=None, fold_num=None, repeat_times=None, search_strategy=None, random_search_times=None, random_state=None, timeout=None, progress_indicator_id=None, param_values=None, param_range=None)

Bases: hana_ml.algorithms.pal.neighbors._KNNBase

K-Nearest Neighbor (KNN) is a memory-based classification or regression method with no explicit training phase. It assumes similar instances should have similar labels or values.

Parameters
n_neighborsint, optional

Number of nearest neighbors (k).

Default to 1.

thread_ratiofloat, optional

Specifies the ratio of total number of threads that can be used by this function.

The range of this parameter is from 0 to 1, where 0 means only using 1 thread, and 1 means using at most all the currently available threads.

Values outside this range are ignored and this function heuristically determines the number of threads to use.

Default to 0.0.

voting_type{'majority', 'distance-weighted'}, optional

Voting type.

Default to 'distance-weighted'.

stat_infobool, optional

Indicate if statistic information will be stored into the STATISTIC table.

Only valid when model evaluation/parameter selection is not activated.

Default to True.

metric{'manhattan', 'euclidean', 'minkowski', 'chebyshev', 'cosine'}, optional

Ways to compute the distance between data points.

Defaults to 'euclidean'.

minkowski_powerfloat, optional

When metric is set to 'minkowski', this parameter controls the value of power.

Only valid when metric is set as 'minkowski'.

Defaults to 3.0.

category_weightsfloat, optional

Represents the weight of category attributes.

Default to 0.707.

algorithm{'brute-force', 'kd-tree'}, optional

Algorithm used to compute the nearest neighbors.

Defaults to 'brute-force'.

factor_numint, optional

The factorisation dimensionality.

Default to 4.

random_stateint, optional

Specifies the seed for random number generator.

  • 0: Uses the current time as the seed.

  • Others: Uses the specified value as the seed.

Default to 0.

resampling_method{'cv', 'stratified_cv', 'bootstrap', 'stratified_bootstrap'}, optional

Specifies the resampling method for model evaluation or parameter selection.

If not specified, neither model evaluation nor parameters selection is activated.

No default value.

evaluation_metric{'accuracy', 'f1_score'}, optional

Specifies the evaluation metric for model evaluation or parameter selection.

If not specified, neither model evaluation nor parameter selection is activated.

No default value.

fold_numint, optional

Specifies the fold number for the cross validation method.

Mandatory and valid only when resampling_method is set to 'cv' or 'stratified_cv'.

No default value.

repeat_timesint, optional

Specifies the number of repeat times for resampling.

Default to 1.

search_strategy{'grid', 'random'}, optional

Specifies the parameter search method.

No default value.

random_search_timesint, optional

Specifies the number of times to randomly select candidate parameters for selection.

Mandatory and valid when search_strategy is set to 'random'.

No default value.

timeoutint, optional

Specifies maximum running time for model evaluation or parameter selection, in seconds.

No timeout when 0 is specified.

Default to 0.

progress_indicator_idstr, optional

Sets an ID of progress indicator for model evaluation or parameter selection.

No progress indicator is active if no value is provided.

No default value.

param_valuesdict or ListOfTuples, optional

Specifies values of parameters to be selected.

Input should be a dict, or a list of tuples of two elements, with key/1st element being the target parameter name, and value/2nd element being the a list of values for selection.

Only valid when parameter selection is activated.

Valid Parametr names include: metric, minkowski_power, category_weights, n_neighbors, voting_type.

No default value.

param_rangedict or ListOfTuples, optional

Specifies ranges of parameters to be selected.

Input should be a dict, or a list of tuples of two elements, with key/1st element the name of the target parameter, while value/2nd element being a list that specifies the range of parameters with the following format: [start, step, end] or [start, end].

Only valid when parameter selection is activated.

Valid parameter names include: minkowski_power, category_weights, n_neighbors.

No default value.

Examples

Input dataframe for classification training:

>>> df_class_train.collect()
   ID  X1      X2 X3  TYPE
0   0   2     1.0  A     1
1   1   3    10.0  A    10
2   2   3    10.0  B    10
3   3   3    10.0  C     1
4   4   1  1000.0  C     1
5   5   1  1000.0  A    10
6   6   1  1000.0  B    99
7   7   1   999.0  A    99
8   8   1   999.0  B    10
9   9   1  1000.0  C    10

Creating KNNClassifier instance:

>>> knn  = KNNClassifier(thread_ratio=1, algorithm='kd_tree',
                         n_neighbors=3, voting_type='majority')

Performing fit() on given dataframe:

>>> knn.fit(self.df_class_train, key='ID', label='TYPE')

Performing predict() on given predicting dataframe:

Input prediciton dataframe:

>>> df_class_predict.collect()
   ID  X1       X2 X3
0   0   2      1.0  A
1   1   1     10.0  C
2   2   1     11.0  B
3   3   3  15000.0  C
4   4   2   1000.0  C
5   5   1   1001.0  A
6   6   1    999.0  A
7   7   3    999.0  B
>>> res, stats = knn.predict(df_class_predict, key='ID', categorical_variable='X1')
>>> res.collect()
   ID TARGET
0   0     10
1   1     10
2   2     10
3   3      1
4   4      1
5   5      1
6   6     10
7   7     99
>>> stats.collect().head(10)
    TEST_ID  K  TRAIN_ID      DISTANCE
0         0  1         0      0.000000
1         0  2         1      9.999849
2         0  3         2     10.414000
3         1  1         3      0.999849
4         1  2         1      1.414000
5         1  3         2      1.414000
6         2  1         2      1.999849
7         2  2         1      2.414000
8         2  3         3      2.414000
9         3  1         4  14000.999849
Attributes
_training_setDataFrame

Input training data with structured column arrangement. If model evaluation or parameter selection is not enabled, the first column must be the ID column, followed by feature columns.

Methods

add_attribute(self, attr_key, attr_val)

Function to add attribute.

fit(self, data[, key, features, label, ...])

Build the KNNClassifier training dataset with the input dataframe.

is_fitted(self)

Checks if the model can be saved.

load_model(self, model)

Function to load fitted model.

predict(self[, data, key, features])

Prediction for the input data with the training dataset.

fit(self, data, key=None, features=None, label=None, categorical_variable=None, string_variable=None, variable_weight=None)

Build the KNNClassifier training dataset with the input dataframe. Assign key, features, and label column.

Parameters
dataDataFrame

Data to be fit.

keystr, optional

Name of the ID column.

Required if parameter selection/model evaluation is not activated.

If not provided when activating parameter-selection/model-evaluation, then data is assumed to have no ID column.

featuresstr/ListOfStrings, optional

Name of the feature columns.

labelstr, optional

Secifies the dependent variable.

Default to last column name.

categorical_variablestr or list of str, optional

Indicates whether or not a column data is actually corresponding to a category variable even the data type of this column is INTEGER.

By default, VARCHAR or NVARCHAR is category variable, and INTEGER or DOUBLE is continuous variable.

Defaults to None.

string_variablestr or list of str, optional

Indicates a string column storing not categorical data. Levenshtein distance is used to calculate similarity between two strings. Ignored if it is not a string column.

Note that this is a new parameter in SAP HANA SPS05 and Cloud.

Defaults to None.

variable_weightdict, optional

Specifies the weight of a variable participating in distance calculation. The value must be greater or equal to 0. Defaults to 1 for variables not specified.

Note that this is a new parameter in SAP HANA SPS05 and Cloud.

Defaults to None.

Returns
If model evaluation/parameter selection triggered.
DataFrame
Statistics information. Structured as following:
  • STAT_NAME: Statistic names.

  • STAT_VALUE: Statistic values.

Selected optimal parameters. Structured as following:
  • PARAM_NAME: Selected optimal parameter names.

  • INT_VALUE

  • DOUBLE_VALUE

  • STRING_VALUE

predict(self, data=None, key=None, features=None)

Prediction for the input data with the training dataset. Training data set must be constructed through the fit function first.

Parameters
dataDataFrame

Prediction data.

keystr, optional

Name of the ID column.

Required if parameter selection/model evaluation is not activated.

featuresstr/ListOfStrings, optional

Name of the feature columns.

Returns
DataFrame

KNN predict results. Structured as follows:

  • ID: Prediction data ID.

  • TARGET: Predicted label or value.

KNN prediction statistics infomation. Structured as follows:

  • TEST_ + ID column name of prediction data: Prediction data ID.

  • K: K number.

  • TRAIN_ + ID column name of training data: Train data ID.

  • DISTANCE: Distance.

add_attribute(self, attr_key, attr_val)

Function to add attribute.

is_fitted(self)

Checks if the model can be saved. To be overridden if the model is not stored in model_ attribute.

Returns
bool

True if the model is ready to be saved.

load_model(self, model)

Function to load fitted model.

Parameters
modelDataFrame

HANA DataFrame for fitted model.

class hana_ml.algorithms.pal.neighbors.KNNRegressor(n_neighbors=None, thread_ratio=None, stat_info=None, aggregate_type=None, metric=None, minkowski_power=None, category_weights=None, algorithm=None, resampling_method=None, evaluation_metric=None, fold_num=None, repeat_times=None, search_strategy=None, random_search_times=None, random_state=None, timeout=None, progress_indicator_id=None, param_values=None, param_range=None)

Bases: hana_ml.algorithms.pal.neighbors._KNNBase

K-Nearest Neighbor (KNN) is a memory-based classification or regression method with no explicit training phase. It assumes similar instances should have similar labels or values.

Parameters
n_neighborsint, optional

Number of nearest neighbors (k).

Default to 1.

thread_ratiofloat, optional

Specifies the ratio of total number of threads that can be used by this function.

The range of this parameter is from 0 to 1, where 0 means only using 1 thread, and 1 means using at most all the currently available threads.

Values outside this range are ignored and this function heuristically determines the number of threads to use.

Default to 0.0.

aggregate_type{'average', 'distance-weighted'}, optional

Aggregate type.

Default to 'distance-weighted'.

stat_infobool, optional

Indicate if statistic information will be stored into the STATISTIC table.

Only valid when model evaluation/parameter selection is not activated.

Default to True.

metric{'manhattan', 'euclidean', 'minkowski', 'chebyshev'}, optional

Ways to compute the distance between data points.

Defaults to 'euclidean'.

minkowski_powerfloat, optional

When Minkowski is used for metric, this parameter controls the value of power.

Only valid when metric is set as 'minkowski'.

Defaults to 3.0.

category_weightsfloat, optional

Represents the weight of category attributes.

Default to 0.707.

algorithm{'brute-force', 'kd-tree'}, optional

Algorithm used to compute the nearest neighbors.

Defaults to 'brute-force'.

factor_numint, optional

The factorisation dimensionality.

Default to 4.

random_stateint, optional

Specifies the seed for random number generator.

  • 0: Uses the current time as the seed.

  • Others: Uses the specified value as the seed.

Default to 0.

resampling_method{'cv', 'bootstrap'}, optional

Specifies the resampling method for model evaluation or parameter selection.

If not specified, neither model evaluation nor parameters selection is activated.

No default value.

evaluation_metric{'rmse'}, optional

Specifies the evaluation metric for model evaluation or parameter selection.

If not specified, neither model evaluation nor parameter selection is activated.

No default value.

fold_numint, optional

Specifies the fold number for the cross validation method.

Mandatory and valid only when resampling_method is set to 'cv' or 'stratified_cv'.

No default value.

repeat_timesint, optional

Specifies the number of repeat times for resampling.

Default to 1.

search_strategy{'grid', 'random'}, optional

Specifies the parameter search method.

No default value.

random_search_timesint, optional

Specifies the number of times to randomly select candidate parameters for selection.

Mandatory and valid when search_strategy is set to 'random'.

No default value.

timeoutint, optional

Specifies maximum running time for model evaluation or parameter selection, in seconds.

No timeout when 0 is specified.

Default to 0.

progress_indicator_idstr, optional

Sets an ID of progress indicator for model evaluation or parameter selection.

No progress indicator is active if no value is provided.

No default value.

param_valuesListOfTuples, optional

Specifies values of parameters to be selected.

Input should be a dict, or a list of size-two tuples, with key/1st element being the target parameter name, and value/2nd element being the a list of valued for selection.

Valid only when parameter selection is activated.

Valid Parametr names include: 'metric', 'minkowski_power', 'category_weights',

'n_neighbors', 'aggregate_type'.

No default value.

param_rangeListOfTuples, optional

Specifies ranges of parameters to be selected.

Input should be a dict, or a list of size-two tuples, with key/1st element being the name of the target parameter, and value/2nd element being a list that specifies the range of parameters with [start, step, end] or [start, end].

Valid only when parameter selection is activated.

Valid parameter names include: 'minkowski_power', 'category_weights', 'n_neighbors'.

No default value.

Examples

Input dataframe for regression training:

>>> df_regr_train.collect()
    ID  X1      X2 X3 VALUE
0   0   2     1.0  A      1
1   1   3    10.0  A     10
2   2   3    10.0  B     10
3   3   3    10.0  C      1
4   4   1  1000.0  C      1
5   5   1  1000.0  A     10
6   6   1  1000.0  B     99
7   7   1   999.0  A     99
8   8   1   999.0  B     10
9   9   1  1000.0  C     10

Creating KNNRegressor instance:

>>> knn  = KNNRegressor(thread_ratio=1, algorithm='kd_tree',
                        n_neighbors=3, aggregate_type='average')

Performing fit() on given dataframe:

>>> knn.fit(df_regr_train, key='ID', categorical_variable='X1', label='VALUE')

Performing predict() on given predicting dataframe:

Input prediciton dataframe:

>>> df_class_predict.collect()
   ID  X1       X2 X3
0   0   2      1.0  A
1   1   1     10.0  C
2   2   1     11.0  B
3   3   3  15000.0  C
4   4   2   1000.0  C
5   5   1   1001.0  A
6   6   1    999.0  A
7   7   3    999.0  B
>>> res, stats = knn.predict(self.df_class_predict, key='ID', categorical_variable='X1')
>>> res.collect()
    ID              TARGET
0   0                   7
1   1                   7
2   2                   7
3   3  36.666666666666664
4   4  36.666666666666664
5   5  36.666666666666664
6   6  39.666666666666664
7   7   69.33333333333333
>>> stats.collect().head(10)
    TEST_ID  K  TRAIN_ID      DISTANCE
0         0  1         0      0.000000
1         0  2         1      9.999849
2         0  3         2     10.414000
3         1  1         3      0.999849
4         1  2         1      1.414000
5         1  3         2      1.414000
6         2  1         2      1.999849
7         2  2         1      2.414000
8         2  3         3      2.414000
9         3  1         4  14000.999849
Attributes
_training_setDataFrame

Input training data with structured column arrangement. If model evaluation or parameter selection is not enabled, the first column must be the ID column, following by feature columns.

Methods

add_attribute(self, attr_key, attr_val)

Function to add attribute.

fit(self, data[, key, features, label, ...])

Build the KNNRegrssor training dataset with the input dataframe.

is_fitted(self)

Checks if the model can be saved.

load_model(self, model)

Function to load fitted model.

predict(self[, data, key, features])

Prediction for the input data with the training dataset.

fit(self, data, key=None, features=None, label=None, categorical_variable=None, string_variable=None, variable_weight=None)

Build the KNNRegrssor training dataset with the input dataframe. Assign key, features, and label column.

Parameters
dataDataFrame

Data to be fit.

keystr, optional

Name of the ID column.

Required if parameter selection/model evaluation is not activated.

featuresstr/ListOfStrings, optional

Name of the feature columns.

labelstr, optional

Secifies the dependent variable.

Default to last column name.

categorical_variablestr or list of str, optional

Indicates whether or not a column data is actually corresponding to a category variable even the data type of this column is INTEGER.

By default, VARCHAR or NVARCHAR is category variable, and INTEGER or DOUBLE is continuous variable.

Defaults to None.

string_variablestr or list of str, optional

Indicates a string column storing not categorical data.

Levenshtein distance is used to calculate similarity between two strings.

Ignored if it is not a string column.

Defaults to None.

variable_weightdict, optional

Specifies the weight of a variable participating in distance calculation.

The value must be greater or equal to 0. Defaults to 1 for variables not specified.

Defaults to None.

Returns
If model evaluation/parameter selection triggered.
DataFrame

Statistics information. Structured as following:

  • STAT_NAME: Statistic names.

  • STAT_VALUE: Statistic values.

Selected optimal parameters. Structured as following:

  • PARAM_NAME: Selected optimal parameter names.

  • INT_VALUE

  • DOUBLE_VALUE

  • STRING_VALUE

predict(self, data=None, key=None, features=None)

Prediction for the input data with the training dataset. Training data set must be constructed through the fit function first.

Parameters
dataDataFrame

Prediction data.

keystr, optional

Name of the ID column.

Required if parameter selection/model evaluation is activated.

featuresstr/ListOfStrings, optional

Name of the feature columns.

Returns
DataFrame

KNN predict results. Structured as following:

  • ID: Prediction data ID.

  • TARGET: Predicted label or value.

KNN prediction statistics infomation. Structured as following:

  • TEST_ + ID column name of prediction data: Prediction data ID.

  • K: K number.

  • TRAIN_ + ID column name of training data: Train data ID.

  • DISTANCE: Distance.

add_attribute(self, attr_key, attr_val)

Function to add attribute.

is_fitted(self)

Checks if the model can be saved. To be overridden if the model is not stored in model_ attribute.

Returns
bool

True if the model is ready to be saved.

load_model(self, model)

Function to load fitted model.

Parameters
modelDataFrame

HANA DataFrame for fitted model.

class hana_ml.algorithms.pal.neighbors.KNN(*args, **kwargs)

Bases: hana_ml.algorithms.pal.pal_base.PALBase

K-Nearest Neighbor(KNN) model that handles classification problems.

Parameters
n_neighborsint, optional

Number of nearest neighbors.

Defaults to 1.

thread_ratiofloat, optional

Controls the proportion of available threads to use.

The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads.

Values between 0 and 1 will use up to that percentage of available threads.

Values outside this range tell PAL to heuristically determine the number of threads to use.

Defaults to 0.

voting_type{'majority', 'distance-weighted'}, optional

Method used to vote for the most frequent label of the K nearest neighbors.

Defaults to 'distance-weighted'.

stat_infobool, optional

Controls whether to return a statistic information table containing the distance between each point in the prediction set and its k nearest neighbors in the training set.

If true, the table will be returned.

Defaults to True.

metric{'manhattan', 'euclidean', 'minkowski', 'chebyshev'}, optional

Ways to compute the distance between data points.

Defaults to 'euclidean'.

minkowski_powerfloat, optional

When Minkowski is used for metric, this parameter controls the value of power.

Only valid when metric is 'minkowski'.

Defaults to 3.0.

algorithm{'brute-force', 'kd-tree'}, optional

Algorithm used to compute the nearest neighbors.

Defaults to 'brute-force'.

Examples

Training data:

>>> df.collect()
   ID      X1      X2  TYPE
0   0     1.0     1.0     2
1   1    10.0    10.0     3
2   2    10.0    11.0     3
3   3    10.0    10.0     3
4   4  1000.0  1000.0     1
5   5  1000.0  1001.0     1
6   6  1000.0   999.0     1
7   7   999.0   999.0     1
8   8   999.0  1000.0     1
9   9  1000.0  1000.0     1

Create KNN instance and call fit:

>>> knn = KNN(n_neighbors=3, voting_type='majority',
...           thread_ratio=0.1, stat_info=False)
>>> knn.fit(df, 'ID', features=['X1', 'X2'], label='TYPE')
>>> pred_df = connection_context.table("PAL_KNN_CLASSDATA_TBL")

Call predict:

>>> res, stat = knn.predict(pred_df, "ID")
>>> res.collect()
   ID  TYPE
0   0     3
1   1     3
2   2     3
3   3     1
4   4     1
5   5     1
6   6     1
7   7     1

Methods

add_attribute(self, attr_key, attr_val)

Function to add attribute.

fit(self, data, key[, features, label])

Fit the model when given training set.

is_fitted(self)

Checks if the model can be saved.

load_model(self, model)

Function to load fitted model.

predict(self, data, key[, features])

Predict the class labels for the provided data

score(self, data, key[, features, label])

Return a scalar accuracy value after comparing the predicted and original label.

fit(self, data, key, features=None, label=None)

Fit the model when given training set.

Parameters
dataDataFrame

DataFrame containing the data.

keystr

Name of the ID column.

featureslist of str, optional

Names of the feature columns.

If not provided, it defaults to all the non-ID and non-label columns in data.

labelstr, optional

Name of the label column.

If not provided, it defaults to the last column in data.

predict(self, data, key, features=None)

Predict the class labels for the provided data

Parameters
dataDataFrame

DataFrame containing the data.

keystr

Name of the ID column.

featureslist of str, optional

Names of the feature columns.

If features is not provided, it defaults to all the non-ID columns.

Returns
DataFrame

Predicted result, structured as follows:

  • ID column, with same name and type as data 's ID column.

  • Label column, with same name and type as training data's label column.

DataFrame

The distance between each point in data and its k nearest neighbors in the training set. Only returned if stat_info is True. Structured as follows:

  • TEST_ + data 's ID name, with same type as data 's ID column, query data ID.

  • K, type INTEGER, K number.

  • TRAIN_ + training data's ID name, with same type as training data's ID column, neighbor point's ID.

  • DISTANCE, type DOUBLE, distance.

score(self, data, key, features=None, label=None)

Return a scalar accuracy value after comparing the predicted and original label.

Parameters
dataDataFrame

DataFrame containing the data.

keystr

Name of the ID column.

featureslist of str, optional

Names of the feature columns.

If features is not provided, it defaults to all the non-ID and non-label columns.

labelstr, optional

Name of the label column.

If label is not provided, it defaults to the last column.

Returns
accuracyfloat

Scalar accuracy value after comparing the predicted label and original label.

add_attribute(self, attr_key, attr_val)

Function to add attribute.

is_fitted(self)

Checks if the model can be saved. To be overridden if the model is not stored in model_ attribute.

Returns
bool

True if the model is ready to be saved.

load_model(self, model)

Function to load fitted model.

Parameters
modelDataFrame

HANA DataFrame for fitted model.

hana_ml.algorithms.pal.neural_network

This module contains Python wrappers for PAL Multi-layer Perceptron algorithm.

The following classes are available:

class hana_ml.algorithms.pal.neural_network.MLPClassifier(activation=None, activation_options=None, output_activation=None, output_activation_options=None, hidden_layer_size=None, hidden_layer_size_options=None, max_iter=None, training_style=None, learning_rate=None, momentum=None, batch_size=None, normalization=None, weight_init=None, categorical_variable=None, resampling_method=None, evaluation_metric=None, fold_num=None, repeat_times=None, search_strategy=None, random_search_times=None, random_state=None, timeout=None, progress_indicator_id=None, param_values=None, param_range=None, thread_ratio=None)

Bases: hana_ml.algorithms.pal.neural_network._MLPBase

Multi-layer perceptron (MLP) Classifier.

Parameters
activationstr

Specifies the activation function for the hidden layer.

Valid activation functions include:
  • 'tanh',

  • 'linear',

  • 'sigmoid_asymmetric',

  • 'sigmoid_symmetric',

  • 'gaussian_asymmetric',

  • 'gaussian_symmetric',

  • 'elliot_asymmetric',

  • 'elliot_symmetric',

  • 'sin_asymmetric',

  • 'sin_symmetric',

  • 'cos_asymmetric',

  • 'cos_symmetric',

  • 'relu'

Should not be specified only if activation_options is provided.

activation_optionslist of str, optional

A list of activation functions for parameter selection.

See activation for the full set of valid activation functions.

output_activationstr

Specifies the activation function for the output layer.

Valid activation functions same as those in activation.

Should not be specified only if outupt_activation_options is provided.

output_activation_optionslist of str, optional

A list of activation functions for the output layer for parameter selection.

See activation for the full set of activation functions.

hidden_layer_sizelist of int or tuple of int

Sizes of all hidden layers.

Should not be specified only if hidden_layer_size_options is provided.

hidden_layer_size_optionslist of tuples, optional

A list of optional sizes of all hidden layers for parameter selection.

max_iterint, optional

Maximum number of iterations.

Defaults to 100.

training_style{'batch', 'stochastic'}, optional

Specifies the training style.

Defaults to 'stochastic'.

learning_ratefloat, optional

Specifies the learning rate. Mandatory and valid only when training_style is 'stochastic'.

momentumfloat, optional

Specifies the momentum for gradient descent update. Mandatory and valid only when training_style is 'stochastic'.

batch_sizeint, optional

Specifies the size of mini batch. Valid only when training_style is 'stochastic'.

Defaults to 1.

normalization{'no', 'z-transform', 'scalar'}, optional

Defaults to 'no'.

weight_init{'all-zeros', 'normal', 'uniform', 'variance-scale-normal', 'variance-scale-uniform'}, optional

Specifies the weight initial value.

Defaults to 'all-zeros'.

categorical_variablestr or list of str, optional

Specifies column name(s) in the data table used as category variable.

Valid only when column is of INTEGER type.

thread_ratiofloat, optional

Controls the proportion of available threads to use for training.

The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads.

Values between 0 and 1 will use that percentage of available threads.

Values outside this range tell PAL to heuristically determine the number of threads to use.

Defaults to 0.

resampling_method{'cv','stratified_cv', 'bootstrap', 'stratified_bootstrap'}, optional

Specifies the resampling method for model evaluation or parameter selection.

If not specified, neither model evaluation or parameter selection shall be triggered.

evaluation_metric{'accuracy','f1_score', 'auc_onevsrest', 'auc_pairwise'}, optional

Specifies the evaluation metric for model evaluation or parameter selection.

fold_numint, optional

Specifies the fold number for the cross-validation.

Mandatory and valid only when resampling_method is set 'cv' or 'stratified_cv'.

repeat_timesint, optional

Specifies the number of repeat times for resampling.

Defaults to 1.

search_strategy{'grid', 'random'}, optional

Specifies the method for parameter selection. If not provided, parameter selection will not be activated.

random_search_timesint, optional

Specifies the number of times to randomly select candidate parameters.

Mandatory and valid only when search_strategy is set to 'random'.

random_stateint, optional

Specifies the seed for random generation.

When 0 is specified, system time is used.

Defaults to 0.

timeoutint, optional

Specifies maximum running time for model evalation/parameter selection, in seconds.

No timeout when 0 is specified.

Defaults to 0.

progress_idstr, optional

Sets an ID of progress indicator for model evaluation/parameter selection.

If not provided, no progress indicator is activated.

param_valuesdict or list of tuples, optional

Specifies the values of following parameters for model parameter selection:

learning_rate, momentum, batch_size.

If input is list of tuples, then each tuple must contain exactly two elements:

  • 1st element is the parameter name(str type),

  • 2nd element is a list of valid values for that parameter.

If input is dict, then for each element, the key must be parameter name, while value be a list of valid values for the corresponding parameter.

A simple example for illustration:

[('learning_rate', [0.1, 0.2, 0.5]), ('momentum', [0.2, 0.6])],

or

dict(learning_rate=[0.1, 0.2, 0.5], momentum=[0.2, 0.6]).

Valid only when resampling_method and search_strategy are both specified, and training_style is 'stochastic'.

param_rangelist of tuple, optional

Specifies the range of the following parameters for model parameter selection:

learning_rate, momentum, batch_size.

If input is a list of tuples, the each tuple should contain exactly two elements:

  • 1st element is the parameter name(str type),

  • 2nd element is a list that specifies the range of that parameter as follows: first value is the start value, seond value is the step, and third value is the end value. The step value can be omitted, and will be ignored, if search_strategy is set to 'random'.

Otherwise, if input is a dict, then for each element the key should be parameter name, while value specifies the range of that parameter.

Valid only when resampling_method and search_strategy are both specified, and training_style is 'stochastic'.

Examples

Training data:

>>> df.collect()
   V000  V001 V002  V003 LABEL
0     1  1.71   AC     0    AA
1    10  1.78   CA     5    AB
2    17  2.36   AA     6    AA
3    12  3.15   AA     2     C
4     7  1.05   CA     3    AB
5     6  1.50   CA     2    AB
6     9  1.97   CA     6     C
7     5  1.26   AA     1    AA
8    12  2.13   AC     4     C
9    18  1.87   AC     6    AA

Training the model:

>>> mlpc = MLPClassifier(hidden_layer_size=(10,10),
...                      activation='tanh', output_activation='tanh',
...                      learning_rate=0.001, momentum=0.0001,
...                      training_style='stochastic',max_iter=100,
...                      normalization='z-transform', weight_init='normal',
...                      thread_ratio=0.3, categorical_variable='V003')
>>> mlpc.fit(data=df)

Training result may look different from the following results due to model randomness.

>>> mlpc.model_.collect()
   ROW_INDEX                                      MODEL_CONTENT
0          1  {"CurrentVersion":"1.0","DataDictionary":[{"da...
1          2  t":0.2700182926188939},{"from":13,"weight":0.0...
2          3  ht":0.2414416413305134},{"from":21,"weight":0....
>>> mlpc.train_log_.collect()
    ITERATION     ERROR
0           1  1.080261
1           2  1.008358
2           3  0.947069
3           4  0.894585
4           5  0.849411
5           6  0.810309
6           7  0.776256
7           8  0.746413
8           9  0.720093
9          10  0.696737
10         11  0.675886
11         12  0.657166
12         13  0.640270
13         14  0.624943
14         15  0.609432
15         16  0.595204
16         17  0.582101
17         18  0.569990
18         19  0.558757
19         20  0.548305
20         21  0.538553
21         22  0.529429
22         23  0.521457
23         24  0.513893
24         25  0.506704
25         26  0.499861
26         27  0.493338
27         28  0.487111
28         29  0.481159
29         30  0.475462
..        ...       ...
70         71  0.349684
71         72  0.347798
72         73  0.345954
73         74  0.344071
74         75  0.342232
75         76  0.340597
76         77  0.338837
77         78  0.337236
78         79  0.335749
79         80  0.334296
80         81  0.332759
81         82  0.331255
82         83  0.329810
83         84  0.328367
84         85  0.326952
85         86  0.325566
86         87  0.324232
87         88  0.322899
88         89  0.321593
89         90  0.320242
90         91  0.318985
91         92  0.317840
92         93  0.316630
93         94  0.315376
94         95  0.314210
95         96  0.313066
96         97  0.312021
97         98  0.310916
98         99  0.309770
99        100  0.308704

Prediction:

>>> pred_df.collect()
>>> res, stat = mlpc.predict(data=pred_df, key='ID')

Prediction result may look different from the following results due to model randomness.

>>> res.collect()
   ID TARGET     VALUE
0   1      C  0.472751
1   2      C  0.417681
2   3      C  0.543967
>>> stat.collect()
   ID CLASS  SOFT_MAX
0   1    AA  0.371996
1   1    AB  0.155253
2   1     C  0.472751
3   2    AA  0.357822
4   2    AB  0.224496
5   2     C  0.417681
6   3    AA  0.349813
7   3    AB  0.106220
8   3     C  0.543967

Model Evaluation:

>>> mlpc = MLPClassifier(activation='tanh',
...                      output_activation='tanh',
...                      hidden_layer_size=(10,10),
...                      learning_rate=0.001,
...                      momentum=0.0001,
...                      training_style='stochastic',
...                      max_iter=100,
...                      normalization='z-transform',
...                      weight_init='normal',
...                      resampling_method='cv',
...                      evaluation_metric='f1_score',
...                      fold_num=10,
...                      repeat_times=2,
...                      random_state=1,
...                      progress_indicator_id='TEST',
...                      thread_ratio=0.3)
>>> mlpc.fit(data=df, label='LABEL', categorical_variable='V003')

Model evaluation result may look different from the following result due to randomness.

>>> mlpc.stats_.collect()
            STAT_NAME                                         STAT_VALUE
0             timeout                                              FALSE
1     TEST_1_F1_SCORE                       1, 0, 1, 1, 0, 1, 0, 1, 1, 0
2     TEST_2_F1_SCORE                       0, 0, 1, 1, 0, 1, 0, 1, 1, 1
3  TEST_F1_SCORE.MEAN                                                0.6
4   TEST_F1_SCORE.VAR                                           0.252631
5      EVAL_RESULTS_1  {"candidates":[{"TEST_F1_SCORE":[[1.0,0.0,1.0,...
6     solution status  Convergence not reached after maximum number o...
7               ERROR                                 0.2951168443145714

Parameter selection:

>>> act_opts=['tanh', 'linear', 'sigmoid_asymmetric']
>>> out_act_opts = ['sigmoid_symmetric', 'gaussian_asymmetric', 'gaussian_symmetric']
>>> layer_size_opts = [(10, 10), (5, 5, 5)]
>>> mlpc = MLPClassifier(activation_options=act_opts,
...                      output_activation_options=out_act_opts,
...                      hidden_layer_size_options=layer_size_opts,
...                      learning_rate=0.001,
...                      batch_size=2,
...                      momentum=0.0001,
...                      training_style='stochastic',
...                      max_iter=100,
...                      normalization='z-transform',
...                      weight_init='normal',
...                      resampling_method='stratified_bootstrap',
...                      evaluation_metric='accuracy',
...                      search_strategy='grid',
...                      fold_num=10,
...                      repeat_times=2,
...                      random_state=1,
...                      progress_indicator_id='TEST',
...                      thread_ratio=0.3)
>>> mlpc.fit(data=df, label='LABEL', categorical_variable='V003')

Parameter selection result may look different from the following result due to randomness.

>>> mlpc.stats_.collect()
            STAT_NAME                                         STAT_VALUE
0             timeout                                              FALSE
1     TEST_1_ACCURACY                                               0.25
2     TEST_2_ACCURACY                                           0.666666
3  TEST_ACCURACY.MEAN                                           0.458333
4   TEST_ACCURACY.VAR                                          0.0868055
5      EVAL_RESULTS_1  {"candidates":[{"TEST_ACCURACY":[[0.50],[0.0]]...
6      EVAL_RESULTS_2  PUT_LAYER_ACTIVE_FUNC=6;HIDDEN_LAYER_ACTIVE_FU...
7      EVAL_RESULTS_3  FUNC=2;"},{"TEST_ACCURACY":[[0.50],[0.33333333...
8      EVAL_RESULTS_4  rs":"HIDDEN_LAYER_SIZE=10, 10;OUTPUT_LAYER_ACT...
9               ERROR                                  0.684842661926971
>>> mlpc.optim_param_.collect()
                 PARAM_NAME  INT_VALUE DOUBLE_VALUE STRING_VALUE
0         HIDDEN_LAYER_SIZE        NaN         None      5, 5, 5
1  OUTPUT_LAYER_ACTIVE_FUNC        4.0         None         None
2  HIDDEN_LAYER_ACTIVE_FUNC        3.0         None         None
Attributes
model_DataFrame

Model content.

train_log_DataFrame

Provides mean squared error between predicted values and target values for each iteration.

stats_DataFrame

Names and values of statistics.

optim_param_DataFrame

Provides optimal parameters selected.

Available only when parameter selection is triggered.

Methods

add_attribute(self, attr_key, attr_val)

Function to add attribute.

fit(self, data[, key, features, label, ...])

Fit the model when the training dataset is given.

is_fitted(self)

Checks if the model can be saved.

load_model(self, model)

Function to load fitted model.

predict(self, data, key[, features, ...])

Predict using the multi-layer perceptron model.

score(self, data, key[, features, label, ...])

Returns the accuracy on the given test data and labels.

fit(self, data, key=None, features=None, label=None, categorical_variable=None)

Fit the model when the training dataset is given.

Parameters
dataDataFrame

DataFrame containing the data.

keystr, optional

Name of the ID column. If key is not provided, it is assumed that the input has no ID column.

featureslist of str, optional

Names of the feature columns. If features is not provided, it defaults to all the non-ID and non-label columns.

labelstr, optional

Name of the label column. If label is not provided, it defaults to the last column.

categorical_variablestr or list of str, optional

Specifies INTEGER column(s) specified that should be treated as categorical. Other INTEGER columns will be treated as continuous.

add_attribute(self, attr_key, attr_val)

Function to add attribute.

is_fitted(self)

Checks if the model can be saved. To be overridden if the model is not stored in model_ attribute.

Returns
bool

True if the model is ready to be saved.

load_model(self, model)

Function to load fitted model.

Parameters
modelDataFrame

HANA DataFrame for fitted model.

predict(self, data, key, features=None, thread_ratio=None)

Predict using the multi-layer perceptron model.

Parameters
dataDataFrame

DataFrame containing the data.

keystr

Name of the ID column.

featureslist of str, optional

Names of the feature columns. If features is not provided, it defaults to all the non-ID columns.

thread_ratiofloat, optional

Controls the proportion of available threads to be used for prediction.

The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads.

Values between 0 and 1 will use that percentage of available threads.

Values outside this range tell PAL to heuristically determine the number of threads to use.

Defaults to 0.

Returns
DataFrame

Predicted classes, structured as follows:

  • ID column, with the same name and type as data 's ID column.

  • TARGET, type NVARCHAR, predicted class name.

  • VALUE, type DOUBLE, softmax value for the predicted class.

Softmax values for all classes, structured as follows:

  • ID column, with the same name and type as data 's ID column.

  • CLASS, type NVARCHAR, class name.

  • VALUE, type DOUBLE, softmax value for that class.

score(self, data, key, features=None, label=None, thread_ratio=None)

Returns the accuracy on the given test data and labels.

Parameters
dataDataFrame

DataFrame containing the data.

keystr

Name of the ID column.

featureslist of str, optional

Names of the feature columns.

If features is not provided, it defaults to all the non-ID and non-label columns.

labelstr, optional

Name of the label column.

If label is not provided, it defaults to the last column.

Returns
float

Scalar value of accuracy after comparing the predicted result and original label.

class hana_ml.algorithms.pal.neural_network.MLPRegressor(activation=None, activation_options=None, output_activation=None, output_activation_options=None, hidden_layer_size=None, hidden_layer_size_options=None, max_iter=None, training_style='stochastic', learning_rate=None, momentum=None, batch_size=None, normalization=None, weight_init=None, categorical_variable=None, resampling_method=None, evaluation_metric=None, fold_num=None, repeat_times=None, search_strategy=None, random_search_times=None, random_state=None, timeout=None, progress_indicator_id=None, param_values=None, param_range=None, thread_ratio=None)

Bases: hana_ml.algorithms.pal.neural_network._MLPBase

Multi-layer perceptron (MLP) Regressor.

Parameters
activationstr

Specifies the activation function for the hidden layer.

Valid activation functions include:
  • 'tanh',

  • 'linear',

  • 'sigmoid_asymmetric',

  • 'sigmoid_symmetric',

  • 'gaussian_asymmetric',

  • 'gaussian_symmetric',

  • 'elliot_asymmetric',

  • 'elliot_symmetric',

  • 'sin_asymmetric',

  • 'sin_symmetric',

  • 'cos_asymmetric',

  • 'cos_symmetric',

  • 'relu'

Should not be specified only if activation_options is provided.

activation_optionslist of str, optional

A list of activation functions for parameter selection.

See activation for the full set of valid activation functions.

output_activationstr

Specifies the activation function for the output layer.

Valid choices of activation function same as those in activation.

Should not be specified only if output_activation_options is provided.

output_activation_optionslist of str, conditionally mandatory

A list of activation functions for the output layer for parameter selection.

See activation for the full set of activation functions for output layer.

hidden_layer_sizelist of int or tuple of int

Sizes of all hidden layers.

Should not be specified only if hidden_layer_size_options is provided.

hidden_layer_size_optionslist of tuples, optional

A list of optional sizes of all hidden layers for parameter selection.

max_iterint, optional

Maximum number of iterations.

Defaults to 100.

training_style{'batch', 'stochastic'}, optional

Specifies the training style.

Defaults to 'stochastic'.

learning_ratefloat, optional

Specifies the learning rate.

Mandatory and valid only when training_style is 'stochastic'.

momentumfloat, optional

Specifies the momentum for gradient descent update.

Mandatory and valid only when training_style is 'stochastic'.

batch_sizeint, optional

Specifies the size of mini batch.

Valid only when training_style is 'stochastic'.

Defaults to 1.

normalization{'no', 'z-transform', 'scalar'}, optional

Defaults to 'no'.

weight_init{'all-zeros', 'normal', 'uniform', 'variance-scale-normal', 'variance-scale-uniform'}, optional

Specifies the weight initial value.

Defaults to 'all-zeros'.

categorical_variablestr or list of str, optional

Specifies column name(s) in the data table used as category variable.

Valid only when column is of INTEGER type.

thread_ratiofloat, optional

Controls the proportion of available threads to use for training.

The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads.

Values between 0 and 1 will use that percentage of available threads.

Values outside this range tell PAL to heuristically determine the number of threads to use.

Defaults to 0.

resampling_method{'cv', 'bootstrap'}, optional

Specifies the resampling method for model evaluation or parameter selection.

If not specified, neither model evaluation or parameter selection shall be triggered.

evaluation_metric{'rmse'}, optional

Specifies the evaluation metric for model evaluation or parameter selection.

fold_numint, optional

Specifies the fold number for the cross-validation.

Mandatory and valid only when resampling_method is set 'cv'.

repeat_timesint, optional

Specifies the number of repeat times for resampling.

Defaults to 1.

search_strategy{'grid', 'random'}, optional

Specifies the method for parameter selection.

If not provided, parameter selection will not be activated.

random_searhc_timesint, optional

Specifies the number of times to randomly select candidate parameters.

Mandatory and valid only when search_strategy is set to 'random'.

random_stateint, optional

Specifies the seed for random generation.

When 0 is specified, system time is used.

Defaults to 0.

timeoutint, optional

Specifies maximum running time for model evalation/parameter selection, in seconds.

No timeout when 0 is specified.

Defaults to 0.

progress_idstr, optional

Sets an ID of progress indicator for model evaluation/parameter selection.

If not provided, no progress indicator is activated.

param_valuesdict or list of tuples, optional

Specifies the values of following parameters for model parameter selection:

learning_rate, momentum, batch_size.

If input is list of tuples, then each tuple must contain exactly two elements:

  • 1st element is the parameter name(str type),

  • 2nd element is a list of valid values for that parameter.

Otherwise, if input is dict, then for each element, the key must be a parameter name, while value be a list of valid values for that parameter.

A simple example for illustration:

[('learning_rate', [0.1, 0.2, 0.5]), ('momentum', [0.2, 0.6])],

or

dict(learning_rate=[0.1, 0.2, 0.5], momentum=[0.2, 0.6]).

Valid only when resampling_method and search_strategy are both specified, and training_style is 'stochastic'.

param_rangedict or list of tuple, optional

Sets the range of the following parameters for model parameter selection:

learning_rate, momentum, batch_size.

If input is a list of tuples, the each tuple should contain exactly two elements:

  • 1st element is the parameter name(str type),

  • 2nd element is a list that specifies the range of that parameter as follows: first value is the start value, seond value is the step, and third value is the end value. The step value can be omitted, and will be ignored, if search_strategy is set to 'random'.

Otherwise, if input is a dict, then for each element the key should be parameter name, while value specifies the range of that parameter.

Valid only when resampling_method and search_strategy are both specified, and training_style is 'stochastic'.

Examples

Training data:

>>> df.collect()
   V000  V001 V002  V003  T001  T002  T003
0     1  1.71   AC     0  12.7   2.8  3.06
1    10  1.78   CA     5  12.1   8.0  2.65
2    17  2.36   AA     6  10.1   2.8  3.24
3    12  3.15   AA     2  28.1   5.6  2.24
4     7  1.05   CA     3  19.8   7.1  1.98
5     6  1.50   CA     2  23.2   4.9  2.12
6     9  1.97   CA     6  24.5   4.2  1.05
7     5  1.26   AA     1  13.6   5.1  2.78
8    12  2.13   AC     4  13.2   1.9  1.34
9    18  1.87   AC     6  25.5   3.6  2.14

Training the model:

>>> mlpr = MLPRegressor(hidden_layer_size=(10,5),
...                     activation='sin_asymmetric',
...                     output_activation='sin_asymmetric',
...                     learning_rate=0.001, momentum=0.00001,
...                     training_style='batch',
...                     max_iter=10000, normalization='z-transform',
...                     weight_init='normal', thread_ratio=0.3)
>>> mlpr.fit(data=df, label=['T001', 'T002', 'T003'])

Training result may look different from the following results due to model randomness.

>>> mlpr.model_.collect()
   ROW_INDEX                                      MODEL_CONTENT
0          1  {"CurrentVersion":"1.0","DataDictionary":[{"da...
1          2  3782583596893},{"from":10,"weight":-0.16532599...
>>> mlpr.train_log_.collect()
     ITERATION       ERROR
0            1   34.525655
1            2   82.656301
2            3   67.289241
3            4  162.768062
4            5   38.988242
5            6  142.239468
6            7   34.467742
7            8   31.050946
8            9   30.863581
9           10   30.078204
10          11   26.671436
11          12   28.078312
12          13   27.243226
13          14   26.916686
14          15   26.782915
15          16   26.724266
16          17   26.697108
17          18   26.684084
18          19   26.677713
19          20   26.674563
20          21   26.672997
21          22   26.672216
22          23   26.671826
23          24   26.671631
24          25   26.671533
25          26   26.671485
26          27   26.671460
27          28   26.671448
28          29   26.671442
29          30   26.671439
..         ...         ...
705        706   11.891081
706        707   11.891081
707        708   11.891081
708        709   11.891081
709        710   11.891081
710        711   11.891081
711        712   11.891081
712        713   11.891081
713        714   11.891081
714        715   11.891081
715        716   11.891081
716        717   11.891081
717        718   11.891081
718        719   11.891081
719        720   11.891081
720        721   11.891081
721        722   11.891081
722        723   11.891081
723        724   11.891081
724        725   11.891081
725        726   11.891081
726        727   11.891081
727        728   11.891081
728        729   11.891081
729        730   11.891081
730        731   11.891081
731        732   11.891081
732        733   11.891081
733        734   11.891081
734        735   11.891081

[735 rows x 2 columns]

>>> pred_df.collect()
   ID  V000  V001 V002  V003
0   1     1  1.71   AC     0
1   2    10  1.78   CA     5
2   3    17  2.36   AA     6

Prediction:

>>> res  = mlpr.predict(data=pred_df, key='ID')

Result may look different from the following results due to model randomness.

>>> res.collect()
   ID TARGET      VALUE
0   1   T001  12.700012
1   1   T002   2.799133
2   1   T003   2.190000
3   2   T001  12.099740
4   2   T002   6.100000
5   2   T003   2.190000
6   3   T001  10.099961
7   3   T002   2.799659
8   3   T003   2.190000
Attributes
model_DataFrame

Model content.

train_log_DataFrame

Provides mean squared error between predicted values and target values for each iteration.

stats_DataFrame

Names and values of statistics.

optim_param_DataFrame

Provides optimal parameters selected.

Available only when parameter selection is triggered.

Methods

add_attribute(self, attr_key, attr_val)

Function to add attribute.

fit(self, data[, key, features, label, ...])

Fit the model when given training dataset.

is_fitted(self)

Checks if the model can be saved.

load_model(self, model)

Function to load fitted model.

predict(self, data, key[, features, ...])

Predict using the multi-layer perceptron model.

score(self, data, key[, features, label, ...])

Returns the coefficient of determination R^2 of the prediction.

add_attribute(self, attr_key, attr_val)

Function to add attribute.

is_fitted(self)

Checks if the model can be saved. To be overridden if the model is not stored in model_ attribute.

Returns
bool

True if the model is ready to be saved.

load_model(self, model)

Function to load fitted model.

Parameters
modelDataFrame

HANA DataFrame for fitted model.

fit(self, data, key=None, features=None, label=None, categorical_variable=None)

Fit the model when given training dataset.

Parameters
dataDataFrame

DataFrame containing the data.

keystr, optional

Name of the ID column. If key is not provided, it is assume that the input has no ID column.

featureslist of str, optional

Names of the feature columns.

If features is not provided, it defaults to all the non-ID and non-label columns.

labelstr or list of str, optional

Name of the label column, or list of names of multiple label columns.

If label is not provided, it defaults to the last column.

categorical_variablestr or list of str, optional

Specifies INTEGER column(s) that should be treated as categorical.

Other INTEGER columns will be treated as continuous.

predict(self, data, key, features=None, thread_ratio=None)

Predict using the multi-layer perceptron model.

Parameters
dataDataFrame

DataFrame containing the data.

keystr

Name of the ID column.

featureslist of str, optional

Names of the feature columns. If features is not provided, it defaults to all the non-ID columns.

thread_ratiofloat, optional

Controls the proportion of available threads to be used for prediction.

The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads.

Values between 0 and 1 will use that percentage of available threads.

Values outside this range tell PAL to heuristically determine the number of threads to use.

Defaults to 0.

Returns
DataFrame

Predicted results, structured as follows:

  • ID column, with the same name and type as data 's ID column.

  • TARGET, type NVARCHAR, target name.

  • VALUE, type DOUBLE, regression value.

score(self, data, key, features=None, label=None, thread_ratio=None)

Returns the coefficient of determination R^2 of the prediction.

Parameters
dataDataFrame

DataFrame containing the data.

keystr

Name of the ID column.

featureslist of str, optional

Names of the feature columns.

If features is not provided, it defaults to all the non-ID and non-label columns.

labelstr or list of str, optional

Name of the label column, or list of names of multiple label columns.

If label is not provided, it defaults to the last column.

Returns
float

Returns the coefficient of determination R2 of the prediction.

hana_ml.algorithms.pal.pagerank

This module contains python wrapper for PAL PageRank algorithm.

The following class is available:

class hana_ml.algorithms.pal.pagerank.PageRank(damping=None, max_iter=None, tol=None, thread_ratio=None)

Bases: hana_ml.algorithms.pal.pal_base.PALBase

A page rank model.

Parameters
dampingfloat, optional

The damping factor d.

Defaults to 0.85.

max_iterint, optional

The maximum number of iterations of power method.

The value 0 means no maximum number of iterations is set and the calculation stops when the result converges.

Defaults to 0.

tolfloat, optional

Specifies the stop condition.

When the mean improvement value of ranks is less than this value, the program stops calculation.

Defaults to 1e-6.

thread_ratiofloat, optional

Specifies the ratio of total number of threads that can be used by this function.

The value range is from 0 to 1, where 0 means only using 1 thread, and 1 means using at most all the currently available threads.

Values outside the range will be ignored and this function heuristically determines the number of threads to use.

Defaults to 0.

Examples

Input dataframe df for training:

>>> df.collect()
   FROM_NODE    TO_NODE
0   Node1       Node2
1   Node1       Node3
2   Node1       Node4
3   Node2       Node3
4   Node2       Node4
5   Node3       Node1
6   Node4       Node1
7   Node4       Node3

Create a PageRank instance:

>>> pr = PageRank()

Call run() on given data sequence:

>>> result = pr.run(data=df)
>>> result.collect()
   NODE     RANK
0   NODE1   0.368152
1   NODE2   0.141808
2   NODE3   0.287962
3   NODE4   0.202078
Attributes
None

Methods

add_attribute(self, attr_key, attr_val)

Function to add attribute.

is_fitted(self)

Checks if the model can be saved.

load_model(self, model)

Function to load fitted model.

run(self, data)

This method reads link information and calculates rank for each node.

run(self, data)

This method reads link information and calculates rank for each node.

Parameters
dataDataFrame

Data for predicting the class labels.

Returns
DataFrame

Calculated rank values and corresponding node names, structured as follows:

  • NODE: node names.

  • RANK: the PageRank of the corresponding node.

add_attribute(self, attr_key, attr_val)

Function to add attribute.

is_fitted(self)

Checks if the model can be saved. To be overridden if the model is not stored in model_ attribute.

Returns
bool

True if the model is ready to be saved.

load_model(self, model)

Function to load fitted model.

Parameters
modelDataFrame

HANA DataFrame for fitted model.

hana_ml.algorithms.pal.partition

This module contain Python wrapper for the PAL partition function.

The following function is available:

hana_ml.algorithms.pal.partition.train_test_val_split(data, id_column=None, random_seed=None, thread_ratio=None, partition_method='random', stratified_column=None, training_percentage=None, testing_percentage=None, validation_percentage=None, training_size=None, testing_size=None, validation_size=None)

The algorithm partitions an input dataset randomly into three disjoint subsets called training, testing and validation. Let us remark that the union of these three subsets might not be the complete initial dataset.

Please also note that the dataset must have an ID column. The ID column can be specified explicitly, otherwise it's assumed that the first column of the dataframe holds the ID.

Two different partitions can be obtained:

  1. Random Partition, which randomly divides all the data.

  2. Stratified Partition, which divides each subpopulation randomly.

In the second case, the dataset needs to have at least one categorical attribute (for example, of type VARCHAR). The initial dataset will first be subdivided according to the different categorical values of this attribute. Each mutually exclusive subset will then be randomly split to obtain the training, testing, and validation subsets.This ensures that all "categorical values" or "strata" will be present in the sampled subset.

Parameters
dataDataFrame

DataFrame to be partitioned.

id_column: str, optional

Indicates which column to use as the ID column, Defauls to first column.

random_seedint, optional
Indicates the seed used to initialize the random number generator.
  • 0: Uses the system time.

  • Not 0: Uses the specified seed.

Defaults to 0.

thread_ratiofloat, optional

Specifies the ratio of total number of threads that can be used by this function.

The value range is from 0 to 1, where 0 means only using 1 thread, and 1 means using at most all the currently available threads.

Values outside the range will be ignored and this function heuristically determines the number of threads to use.

Defaults to 0.

partition_method{'random', 'stratified'}, optional
Partition method:
  • 'random': random partitions.

  • 'stratified': stratified partition.

Defaults to 'random'.

stratified_columnstr, optional

Indicates which column is used for stratification.

Valid only when parition_method is set to 'stratified' (stratified partition).

No default value.

training_percentagefloat, optional

The percentage of training data.

Value range: 0 <= value <= 1.

Defaults to 0.8.

testing_percentagefloat, optional

The percentage of testing data.

Value range: 0 <= value <= 1.

Defaults to 0.1.

validation_percentagefloat, optional

The percentage of validation data.

Value range: 0 <= value <= 1.

Defaults to 0.1.

training_sizeint, optional

Row size of training data. Value range: >=0.

If both training_percentage and training_size are specified, training_percentage takes precedence.

No default value.

testing_sizeint, optional

Row size of testing data. Value range: >=0.

If both testing_percentage and testing_size are specified, testing_percentage takes precedence.

No default value.

validation_sizeint, optional

Row size of validation data. Value range:>=0.

If both validation_percentage and validation_size are specified, validation_percentage takes precedence.

No default value.

Returns
Returns three DataFrame of training data, testing data and validation data after partition.

Examples

To partition the input DataFrame df:

>>> train, test, valid = train_test_val_split(data=df)

hana_ml.algorithms.pal.pipeline

This module supports to run PAL functions in a pipeline manner.

class hana_ml.algorithms.pal.pipeline.Pipeline(steps)

Bases: object

Pipeline construction to run transformers and estimators sequentially.

Parameters
steplist

List of (name, transform) tuples that are chained. The last object should be an estimator.

Examples

>>> pipeline([
    ('pca', PCA(scaling=True, scores=True)),
    ('imputer', Imputer(conn_context=conn, strategy='mean')),
    ('hgbt', HybridGradientBoostingClassifier(        n_estimators=4, split_threshold=0, learning_rate=0.5, fold_num=5,         max_depth=6, cross_validation_range=cv_range))
    ])

Methods

fit(self, data, param)

Fit all the transforms one after the other and transform the data, then fit the transformed data using the final estimator.

fit_transform(self, data, param)

Fit all the transforms one after the other and transform the data, then fit the transformed data using the final estimator.

fit_transform(self, data, param)

Fit all the transforms one after the other and transform the data, then fit the transformed data using the final estimator.

Parameters
dataDataFrame

SAP HANA DataFrame to be transformed in the pipeline.

paramdict

Parameters corresponding to the transform name.

Returns
DataFrame

Transformed SAP HANA DataFrame.

Examples

>>> my_pipeline = Pipeline([
        ('pca', PCA(scaling=True, scores=True)),
        ('imputer', Imputer(strategy='mean'))
        ])
>>> param = {'pca': [('key', 'ID'), ('label', 'CLASS')], 'imputer': []}
>>> my_pipeline.fit_transform(data=train_data, param=param)
fit(self, data, param)

Fit all the transforms one after the other and transform the data, then fit the transformed data using the final estimator.

Parameters
dataDataFrame

SAP HANA DataFrame to be transformed in the pipeline.

paramdict

Parameters corresponding to the transform name.

Returns
DataFrame

Transformed SAP HANA DataFrame.

Examples

>>> my_pipeline = Pipeline([
    ('pca', PCA(scaling=True, scores=True)),
    ('imputer', Imputer(strategy='mean')),
    ('hgbt', HybridGradientBoostingClassifier(
    n_estimators=4, split_threshold=0, learning_rate=0.5, fold_num=5,
    max_depth=6, cross_validation_range=cv_range))
    ])
>>> param = {
                'pca': [('key', 'ID'), ('label', 'CLASS')],
                'imputer': [],
                'hgbt': [('key', 'ID'), ('label', 'CLASS'), ('categorical_variable', ['CLASS'])]
            }
>>> hgbt_model = my_pipeline.fit(data=train_data, param=param)

hana_ml.algorithms.pal.preprocessing

This module contains Python wrappers for PAL preprocessing algorithms.

The following classes and functions are available:

class hana_ml.algorithms.pal.preprocessing.FeatureNormalizer(method, z_score_method=None, new_max=None, new_min=None, thread_ratio=None)

Bases: hana_ml.algorithms.pal.pal_base.PALBase

Normalize a DataFrame. In real world scenarios the collected continuous attributes are usually distributed within different ranges. It is a common practice to have the data well scaled so that data mining algorithms like neural networks, nearest neighbor classification and clustering can give more reliable results.

Note

Note that the data type of the output value is the same as that of the input value. Therefore, if the data type of the original data is INTEGER, the output value will be converted to an integer instead of the result you expect.

For example, if we want to use min-max method to normalize a list [1, 2, 3, 4] and set new_min = 0 and new_max = 1.0, we want the result to be [0, 0.33, 0.66, 1], but actually the output is [0, 0, 0, 1] due to the rule of consistency of input and output data type.

Therefore, please cast the feature column(s) from INTEGER to be DOUBLE before invoking the function.

Parameters
method{'min-max', 'z-score', 'decimal'}

Scaling methods:

  • 'min-max': Min-max normalization.

  • 'z-score': Z-Score normalization.

  • 'decimal': Decimal scaling normalization.

z_score_method{'mean-standard', 'mean-mean', 'median-median'}, optional

Only valid when method is 'z-score'.

  • 'mean-standard': Mean-Standard deviation

  • 'mean-mean': Mean-Mean deviation

  • 'median-median': Median-Median absolute deviation

new_maxfloat, optional

The new maximum value for min-max normalization.

Only valid when method is 'min-max'.

new_minfloat, optional

The new minimum value for min-max normalization.

Only valid when method is 'min-max'.

thread_ratiofloat, optional

Controls the proportion of available threads to use. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use.

Defaults to 0.

Examples

Input DataFrame df1:

>>> df1.head(4).collect()
    ID    X1    X2
0    0   6.0   9.0
1    1  12.1   8.3
2    2  13.5  15.3
3    3  15.4  18.7

Creating a FeatureNormalizer instance:

>>> fn = FeatureNormalizer(method="min-max", new_max=1.0, new_min=0.0)

Performing fit on given DataFrame:

>>> fn.fit(df1, key='ID')
>>> fn.result_.head(4).collect()
    ID        X1        X2
0    0  0.000000  0.033175
1    1  0.186544  0.000000
2    2  0.229358  0.331754
3    3  0.287462  0.492891

Input DataFrame for transforming:

>>> df2.collect()
   ID  S_X1  S_X2
0   0   6.0   9.0
1   1   6.0   7.0
2   2   4.0   4.0
3   3   1.0   2.0
4   4   9.0  -2.0
5   5   4.0   5.0

Performing transform on given DataFrame:

>>> result = fn.transform(df2, key='ID')
>>> result.collect()
   ID      S_X1      S_X2
0   0  0.000000  0.033175
1   1  0.000000 -0.061611
2   2 -0.061162 -0.203791
3   3 -0.152905 -0.298578
4   4  0.091743 -0.488152
5   5 -0.061162 -0.156398
Attributes
result_DataFrame

Scaled dataset from fit and fit_transform methods.

model_ :

Trained model content.

Methods

add_attribute(self, attr_key, attr_val)

Function to add attribute.

fit(self, data, key[, features])

Normalize input data and generate a scaling model using one of the three scaling methods: min-max normalization, z-score normalization and normalization by decimal scaling.

fit_transform(self, data, key[, features])

Fit with the dataset and return the normalized results.

is_fitted(self)

Checks if the model can be saved.

load_model(self, model)

Function to load fitted model.

transform(self, data, key[, features])

Scales data based on the previous scaling model.

fit(self, data, key, features=None)

Normalize input data and generate a scaling model using one of the three scaling methods: min-max normalization, z-score normalization and normalization by decimal scaling.

Parameters
dataDataFrame

DataFrame to be normalized.

keystr

Name of the ID column.

featureslist of str, optional

Names of the feature columns. If features is not provided, it defaults to all non-ID columns.

fit_transform(self, data, key, features=None)

Fit with the dataset and return the normalized results.

Parameters
dataDataFrame

DataFrame to be normalized.

keystr

Name of the ID column.

featureslist of str, optional

Names of the feature columns. If features is not provided, it defaults to all non-ID columns.

Returns
DataFrame

Normalized result, with the same structure as data.

transform(self, data, key, features=None)

Scales data based on the previous scaling model.

Parameters
dataDataFrame

DataFrame to be normalized.

keystr

Name of the ID column.

featureslist of str, optional

Names of the feature columns. If features is not provided, it defaults to all non-ID columns.

Returns
DataFrame

Normalized result, with the same structure as data.

add_attribute(self, attr_key, attr_val)

Function to add attribute.

is_fitted(self)

Checks if the model can be saved. To be overridden if the model is not stored in model_ attribute.

Returns
bool

True if the model is ready to be saved.

load_model(self, model)

Function to load fitted model.

Parameters
modelDataFrame

HANA DataFrame for fitted model.

class hana_ml.algorithms.pal.preprocessing.KBinsDiscretizer(strategy, smoothing, n_bins=None, bin_size=None, n_sd=None)

Bases: hana_ml.algorithms.pal.pal_base.PALBase

Bin continuous data into number of intervals and perform local smoothing.

Note

Note that the data type of the output value is the same as that of the input value. Therefore, if the data type of the original data is INTEGER, the output value will be converted to an integer instead of the result you expect.

Therefore, please cast the feature column(s) from INTEGER to be DOUBLE before invoking the function.

Parameters
strategy{'uniform_number', 'uniform_size', 'quantile', 'sd'}
Binning methods:
  • 'uniform_number': Equal widths based on the number of bins.

  • 'uniform_size': Equal widths based on the bin size.

  • 'quantile': Equal number of records per bin.

  • 'sd': Bins are divided based on the distance from the mean. Most bins are one standard deviation wide, except that the center bin contains all values within one standard deviation from the mean, and the leftmost and rightmost bins contain all values more than n_sd standard deviations from the mean in the corresponding directions.

smoothing{'means', 'medians', 'boundaries'}
Smoothing methods:
  • 'means': Each value within a bin is replaced by the average of all the values belonging to the same bin.

  • 'medians': Each value in a bin is replaced by the median of all the values belonging to the same bin.

  • 'boundaries': The minimum and maximum values in a given bin are identified as the bin boundaries. Each value in the bin is then replaced by its closest boundary value. When the distance is equal to both sides, it will be replaced by the front boundary value.

Values used for smoothing are not re-calculated during transform.

n_binsint, optional

The number of bins.

Only valid when strategy is 'uniform_number' or 'quantile'.

Defaults to 2.

bin_sizeint, optional

The interval width of each bin.

Only valid when strategy is 'uniform_size'.

Defaults to 10.

n_sdint, optional

The leftmost bin contains all values located further than n_sd standard deviations lower than the mean, and the rightmost bin contains all values located further than n_sd standard deviations above the mean.

Only valid when strategy is 'sd'.

Defaults to 1.

Examples

Input DataFrame df1:

>>> df1.collect()
    ID  DATA
0    0   6.0
1    1  12.0
2    2  13.0
3    3  15.0
4    4  10.0
5    5  23.0
6    6  24.0
7    7  30.0
8    8  32.0
9    9  25.0
10  10  38.0

Creating a KBinsDiscretizer instance:

>>> binning = KBinsDiscretizer(strategy='uniform_size', smoothing='means', bin_size=10)

Performing fit on the given DataFrame:

>>> binning.fit(data=df1, key='ID')

Output:

>>> binning.result_.collect()
    ID  BIN_INDEX       DATA
0    0          1   8.000000
1    1          2  13.333333
2    2          2  13.333333
3    3          2  13.333333
4    4          1   8.000000
5    5          3  25.500000
6    6          3  25.500000
7    7          3  25.500000
8    8          4  35.000000
9    9          3  25.500000
10  10          4  35.000000

Input DataFrame df2 for transforming:

>>> df2.collect()
   ID  DATA
0   0   6.0
1   1  67.0
2   2   4.0
3   3  12.0
4   4  -2.0
5   5  40.0

Performing transform on the given DataFrame:

>>> result = binning.transform(data=df2, key='ID')

Output:

>>> result.collect()
   ID  BIN_INDEX       DATA
0   0          1   8.000000
1   1         -1  67.000000
2   2          1   8.000000
3   3          2  13.333333
4   4          1   8.000000
5   5          4  35.000000
Attributes
result_DataFrame

Binned dataset from fit and fit_transform methods.

model_ :

Binning model content.

Methods

add_attribute(self, attr_key, attr_val)

Function to add attribute.

fit(self, data, key[, features])

Bin input data into number of intervals and smooth.

fit_transform(self, data, key[, features])

Fit with the dataset and return the results.

is_fitted(self)

Checks if the model can be saved.

load_model(self, model)

Function to load fitted model.

transform(self, data, key[, features])

Bin data based on the previous binning model.

fit(self, data, key, features=None)

Bin input data into number of intervals and smooth.

Parameters
dataDataFrame

DataFrame to be discretized.

keystr

Name of the ID column.

featureslist of str, optional

Names of the feature columns.

Since the underlying PAL bining algorithm only supports one feature, this list can only contain one element.

If not provided, data must have exactly 1 non-ID column, and features defaults to that column.

fit_transform(self, data, key, features=None)

Fit with the dataset and return the results.

Parameters
dataDataFrame

DataFrame to be binned.

keystr

Name of the ID column.

featureslist of str, optional

Names of the feature columns.

Since the underlying PAL binning algorithm only supports one feature, this list can only contain one element.

If not provided, data must have exactly 1 non-ID column, and features defaults to that column.

Returns
DataFrame

Binned result, structured as follows:

  • DATA_ID column: with same name and type as data's ID column.

  • BIN_INDEX: type INTEGER, assigned bin index.

  • BINNING_DATA column: smoothed value, with same name and type as data's feature column.

transform(self, data, key, features=None)

Bin data based on the previous binning model.

Parameters
dataDataFrame

DataFrame to be binned.

keystr

Name of the ID column.

featureslist of str, optional

Names of the feature columns.

Since the underlying PAL_BINNING_ASSIGNMENT only supports one feature, this list can only contain one element.

If not provided, data must have exactly 1 non-ID column, and features defaults to that column.

Returns
DataFrame

Binned result, structured as follows:

  • DATA_ID column: with same name and type as data 's ID column.

  • BIN_INDEX: type INTEGER, assigned bin index.

  • BINNING_DATA column: smoothed value, with same name and type as data 's feature column.

add_attribute(self, attr_key, attr_val)

Function to add attribute.

is_fitted(self)

Checks if the model can be saved. To be overridden if the model is not stored in model_ attribute.

Returns
bool

True if the model is ready to be saved.

load_model(self, model)

Function to load fitted model.

Parameters
modelDataFrame

HANA DataFrame for fitted model.

class hana_ml.algorithms.pal.preprocessing.Imputer(strategy=None, als_factors=None, als_lambda=None, als_maxit=None, als_randomstate=None, als_exit_threshold=None, als_exit_interval=None, als_linsolver=None, als_cg_maxit=None, als_centering=None, als_scaling=None, thread_ratio=None)

Bases: hana_ml.algorithms.pal.pal_base.PALBase

Missing value imputation for DataFrame.

Parameters
strategy{'non', 'mean', 'median', 'zero', 'als', 'delete'}, optional

The overall imputation strategy for all Numerical columns.

Defaults to 'mean'.

thread_ratiofloat, optional

Controls the proportion of available threads to use.

The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads.

Values between 0 and 1 will use up to that percentage of available threads.

Values outside this range tell PAL to heuristically determine the number of threads to use.

Defaults to 0.0.

Note

The following parameters all have pre-fix 'als_', and are invoked only when 'als' is the overall imputation strategy. Those parameters are for setting up the alternating-least-square(ALS) mdoel for data imputation.

als_factorsint, optional

Length of factor vectors in the ALS model.

It should be less than the number of numerical columns, so that the imputation results would be meaningful.

Defaults to 3.

als_lambdafloat, optional

L2 regularization applied to the factors in the ALS model.

Should be non-negative.

Defaults to 0.01.

als_maxitint, optional

Maximum number of iterations for solving the ALS model.

Defaults to 20.

als_randomstateint, optional

Specifies the seed of the random number generator used in the training of ALS model:

  • 0: Uses the current time as the seed,

  • Others: Uses the specified value as the seed.

Defaults to 0.

als_exit_thresholdfloat, optional

Specify a value for stopping the training of ALS nmodel. If the improvement of the cost function of the ALS model is less than this value between consecutive checks, then the training process will exit.

0 means there is no checking of the objective value when running the algorithms, and it stops till the maximum number of iterations has been reached.

Defaults to 0.

als_exit_intervalint, optional

Specify the number of iterations between consecutive checking of cost functions for the ALS model, so that one can see if the pre-specified exit_threshold is reached.

Defaults to 5.

als_linsolver{'cholsky', 'cg'}, optional

Linear system solver for the ALS model.

  • 'cholsky' is usually much faster.

  • 'cg' is recommended when als_factors is large.

Defaults to 'cholsky'.

als_maxitint, optional

Specifies the maximum number of iterations for cg algorithm.

Invoked only when the 'cg' is the chosen linear system solver for ALS.

Defaults to 3.

als_centeringbool, optional

Whether to center the data by column before training the ALS model.

Defaults to True.

als_scalingbool, optional

Wheter to scale the data by column before training the ALS model.

Defaults to True.

Examples

Input DataFrame df:

>>> df.head(5).collect()
   V0   V1 V2   V3   V4    V5
0  10  0.0  D  NaN  1.4  23.6
1  20  1.0  A  0.4  1.3  21.8
2  50  1.0  C  NaN  1.6  21.9
3  30  NaN  B  0.8  1.7  22.6
4  10  0.0  A  0.2  NaN   NaN

Create an Imputer instance using 'mean' strategy and call fit:

>>> impute = Imputer(strategy='mean')
>>> result = impute.fit_transform(df, categorical_variable=['V1'],
...                      strategy_by_col=[('V1', 'categorical_const', '0')])
>>> result.head(5).collect()
   V0  V1 V2        V3        V4         V5
0  10   0  D  0.507692  1.400000  23.600000
1  20   1  A  0.400000  1.300000  21.800000
2  50   1  C  0.507692  1.600000  21.900000
3  30   0  B  0.800000  1.700000  22.600000
4  10   0  A  0.200000  1.469231  20.646154

The stats/model content of input DataFrame:

>>> impute.head(5).collect()
            STAT_NAME                   STAT_VALUE
0  V0.NUMBER_OF_NULLS                            3
1  V0.IMPUTATION_TYPE                         MEAN
2    V0.IMPUTED_VALUE                           24
3  V1.NUMBER_OF_NULLS                            2
4  V1.IMPUTATION_TYPE  SPECIFIED_CATEGORICAL_VALUE

The above stats/model content of the input DataFrame can be applied to imputing another DataFrame with the same data structure, e.g. consider the following DataFrame with missing values:

>>> df1.collect()
   ID    V0   V1    V2   V3   V4    V5
0   0  20.0  1.0     B  NaN  1.5  21.7
1   1  40.0  1.0  None  0.6  1.2  24.3
2   2   NaN  0.0     D  NaN  1.8  22.6
3   3  50.0  NaN     C  0.7  1.1   NaN
4   4  20.0  1.0     A  0.3  NaN  20.6

With attribute impute being obtained, one can impute the missing values of df1 via the following line of code, and then check the result:

>>> result1, _ = impute.transform(data=df1, key='ID')
>>> result1.collect()
   ID  V0  V1 V2        V3        V4         V5
0   0  20   1  B  0.507692  1.500000  21.700000
1   1  40   1  A  0.600000  1.200000  24.300000
2   2  24   0  D  0.507692  1.800000  22.600000
3   3  50   0  C  0.700000  1.100000  20.646154
4   4  20   1  A  0.300000  1.469231  20.600000

Create an Imputer instance using other strategies, e.g. 'als' strategy and then call fit:

>>> impute = Imputer(strategy='als', als_factors=2, als_randomstate=1)

Output:

>>> result2 = impute.fit_transform(data=df, categorical_variable=['V1'])
>>> result2.head(5).collect()
   V0  V1 V2        V3        V4         V5
0  10   0  D  0.306957  1.400000  23.600000
1  20   1  A  0.400000  1.300000  21.800000
2  50   1  C  0.930689  1.600000  21.900000
3  30   0  B  0.800000  1.700000  22.600000
4  10   0  A  0.200000  1.333668  21.371753
Attributes
model_DataFrame

statistics/model content.

Methods

add_attribute(self, attr_key, attr_val)

Function to add attribute.

fit_transform(self, data[, key, ...])

Inpute the missing values of a DataFrame, return the result, and collect the related statistics/model info for imputation.

is_fitted(self)

Checks if the model can be saved.

load_model(self, model)

Function to load fitted model.

transform(self, data[, key, thread_ratio])

The function imputes missing values a DataFrame using statistic/model info collected from another DataFrame.

fit_transform(self, data, key=None, categorical_variable=None, strategy_by_col=None)

Inpute the missing values of a DataFrame, return the result, and collect the related statistics/model info for imputation.

Parameters
dataDataFrame

Input data with missing values.

keystr, optional

Name of the ID column.

Assume no ID column if key is not provided.

categorical_variablestr, optional

Names of columns with INTEGER data type that should actually be treated as categorical.

By default, columns of INTEGER and DOUBLE type are all treated numerical, while columns of VARCHAR or NVARCHAR type are treated as categorical.

strategy_by_colListOfTuples, optional

Specifies the imputation strategy for a set of columns, which overrides the overall strategy for data imputation.

Each tuple in the list should contain at least two elements, such that:

  • the 1st element is the name of a column;

  • the 2nd element is the imputation strategy of that column.

  • If the imputation strategy is 'categorical_const' or 'numerical_const', then a 3rd element must be included in the tuple, which specifies the constant value to be used to substitute the detected missing values in the column.

An example for illustration:

[('V1', 'categorical_const', '0'),

('V5','median')]

Returns
DataFrame

Imputed result using specified strategy, with the same data structure, i.e. column names and data types same as data.

transform(self, data, key=None, thread_ratio=None)

The function imputes missing values a DataFrame using statistic/model info collected from another DataFrame.

Parameters
dataDataFrame

Input DataFrame.

keystr, optional

Name of ID column. Assumed no ID column if not provided.

Defaults to None.

thread_ratiofloat, optional

Controls the proportion of available threads to use.

The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads.

Values between 0 and 1 will use up to that percentage of available threads.

Values outside this range tell PAL to heuristically determine the number of threads to use.

Defaults to 0.0.

Returns
DataFrame

Inputation result, structured same as data.

Statistics for the imputation result, structured as:

  • STAT_NAME: type NVACHAR(256), statistics name.

  • STAT_VALUE: type NVACHAR(5000), statistics value.

add_attribute(self, attr_key, attr_val)

Function to add attribute.

is_fitted(self)

Checks if the model can be saved. To be overridden if the model is not stored in model_ attribute.

Returns
bool

True if the model is ready to be saved.

load_model(self, model)

Function to load fitted model.

Parameters
modelDataFrame

HANA DataFrame for fitted model.

class hana_ml.algorithms.pal.preprocessing.Discretize(strategy, n_bins=None, bin_size=None, n_sd=None, smoothing=None, save_model=True)

Bases: hana_ml.algorithms.pal.pal_base.PALBase

It is an enhanced version of binning function which can be applied to table with multiple columns. This function partitions table rows into multiple segments called bins, then applies smoothing methods in each bin of each column respectively.

Parameters
strategy{'uniform_number', 'uniform_size', 'quantile', 'sd'}
Binning methods:
  • 'uniform_number': equal widths based on the number of bins.

  • 'uniform_size': equal widths based on the bin width.

  • 'quantile': equal number of records per bin.

  • 'sd': mean/ standard deviation bin boundaries.

n_binsint, optional

Number of needed bins.

Required and only valid when strategy is set as 'uniform_number' or 'quantile'.

Default to 2.

bin_sizefloat, optional

Specifies the distance for binning.

Required and only valid when strategy is set as 'uniform_size'.

Default to 10.

n_sdint, optional

Specifies the number of standard deviation at each side of the mean.

For example, if n_sd equals 2, this function takes mean +/- 2 * standard deviation as the upper/lower bound for binning.

Required and only valid when strategy is set as 'sd'.

smoothing{'no', 'bin_means', 'bin_medians', 'bin_boundaries'}, optional

Default smoothing methods for input data.

Only applies for none-categorical attributes that do not get specified smoothing method by parameter smoothing_method.

Default to 'bin_means'.

save_modelbool, optional

Indicates whether the model is saved.

Default to True.

Examples

Original data:

>>> df.collect()
        ID  ATT1   ATT2  ATT3 ATT4
    0    1  10.0  100.0   1.0    A
    1    2  10.1  101.0   1.0    A
    2    3  10.2  100.0   1.0    A
    3    4  10.4  103.0   1.0    A
    4    5  10.3  100.0   1.0    A
    5    6  40.0  400.0   4.0    C
    6    7  40.1  402.0   4.0    B
    7    8  40.2  400.0   4.0    B
    8    9  40.4  402.0   4.0    B
    9   10  40.3  400.0   4.0    A
    10  11  90.0  900.0   2.0    C
    11  12  90.1  903.0   1.0    B
    12  13  90.2  901.0   2.0    B
    13  14  90.4  900.0   1.0    B
    14  15  90.3  900.0   1.0    B

Construct an Discretize instance:

>>> bin = Discretize(strategy='uniform_number',
          n_bins=3, smoothing='bin_medians')

Training the model with training data:

>>> bin.fit(train_data, binning_variable='ATT1', col_smoothing=[('ATT2', 'bin_means')],
            categorical_variable='ATT3', key=None, features=None)
>>> bin.assign_.collect()
        ID  BIN_INDEX
    0    1          1
    1    2          1
    2    3          1
    3    4          1
    4    5          1
    5    6          2
    6    7          2
    7    8          2
    8    9          2
    9   10          2
    10  11          3
    11  12          3
    12  13          3
    13  14          3
    14  15          3

Apply the model to new data:

>>> bin.predict(predict_data)
>>> res.collect():
       ID  BIN_INDEX
    0   1          1
    1   2          1
    2   3          1
    3   4          1
    4   5          3
    5   6          3
    6   7          2
Attributes
result_DataFrame
Discretize results, structured as follows:
  • ID: name as shown in input dataframe.

  • FEATURES : data smoothed respectively in each bins

assign_DataFrame
Assignment results, structured as follows:
  • ID: data ID, name as shown in input dataframe.

  • BIN_INDEX : bin index.

model_DataFrame
Model results, structured as follows:
  • ROW_INDEX: row index.

  • MODEL_CONTENT : model contents.

stats_DataFrame
Statistic results, structured as follows:
  • STAT_NAME: statistic name.

  • STAT_VALUE: statistic value.

Methods

add_attribute(self, attr_key, attr_val)

Function to add attribute.

fit(self, data, binning_variable[, key, ...])

Fitting a Discretize model.

fit_transform(self, data, binning_variable)

Learn a discretization configuration(model) from input data and then discretize it under that configuration.

is_fitted(self)

Checks if the model can be saved.

load_model(self, model)

Function to load fitted model.

predict(self, data)

Discritizing new data using a generated Discretize model.

transform(self, data)

Data discretization using generated Discretize models.

fit(self, data, binning_variable, key=None, features=None, col_smoothing=None, categorical_variable=None)

Fitting a Discretize model.

Parameters
dataDataFrame

Dataframe that contains the training data.

keystr, optional

Name of the ID column in the input dataframe.

featuresstr/ListofStrings, optional

Name of the feature columns which needs to be considered in the model.

If not specified, all columns except the key column will be count as feature columns.

binning_variablestr/ListofStrings

Attribute name, to which binning operation is applied.

Variable data type must be numeric.

col_smoothingListofTuples, optional

Specifies column name and its method for smoothing, which overwrites the default smoothing method.

For example: smoothing_method = [('ATT1', 'bin_means'), ('ATT2', 'bin_boundaries')]

Only applies for none-categorical attributes.

No default value.

categorical_variablestr/ListofStrings, optional

Indicates whether a column data is actually corresponding to a category variable even the data type of this column is int.

No default value.

predict(self, data)

Discritizing new data using a generated Discretize model.

Parameters
dataDataFrame

Dataframe including the predict data.

Returns
DataFrame
  • Discretization result

  • Bin assignment

  • Statistics

transform(self, data)

Data discretization using generated Discretize models.

Parameters
dataDataFrame

Dataframe including the predict data.

Returns
DataFrame
  • Discretization result

  • Bin assignment

  • Statistics

fit_transform(self, data, binning_variable, key=None, features=None, col_smoothing=None, categorical_variable=None)

Learn a discretization configuration(model) from input data and then discretize it under that configuration.

Parameters
dataDataFrame

Dataframe that contains the training data.

keystr, optional

Name of the ID column in the input dataframe.

featuresstr/ListofStrings, optional

Name of the feature columns which needs to be considered in the model.

If not specified, all columns except the key column will be count as feature columns.

binning_variablestr/ListofStrings

Attribute name, to which binning operation is applied.

Variable data type must be numeric.

col_smoothingListofTuples, optional

Specifies column name and its method for smoothing, which overwrites the default smoothing method.

For example: smoothing_method = [('ATT1', 'bin_means'), ('ATT2', 'bin_boundaries')]

Only applies for non-categorical attributes.

No default value.

categorical_variablestr/ListofStrings, optional

Indicates whether a column data is actually corresponding to a category variable even the data type of this column is int.

No default value.

Returns
DataFrame
  • Discretization result

  • Bin assignment

  • Statistics

add_attribute(self, attr_key, attr_val)

Function to add attribute.

is_fitted(self)

Checks if the model can be saved. To be overridden if the model is not stored in model_ attribute.

Returns
bool

True if the model is ready to be saved.

load_model(self, model)

Function to load fitted model.

Parameters
modelDataFrame

HANA DataFrame for fitted model.

class hana_ml.algorithms.pal.preprocessing.MDS(matrix_type, thread_ratio=None, dim=None, metric=None, minkowski_power=None)

Bases: hana_ml.algorithms.pal.pal_base.PALBase

This class serves as a tool for dimensional reduction or data visualization.

Parameters
matrix_type{'dissimilarity', 'observation_feature'}

The type of the input table.

thread_ratiofloat, optional

Specifies the ratio of total number of threads that can be used by this function.

The value range is from 0 to 1, where 0 means only using 1 thread, and 1 means using at most all the currently available threads.

Values outside the range will be ignored and this function heuristically determines the number of threads to use.

Default to 0.

dimint, optional

The number of dimension that the input dataset is to be reduced to.

Default to 2.

metric{'manhattan', 'euclidean', 'minkowski'}, optional

The type of distance during the calculation of dissimilarity matrix.

Only valid when matrix_type is set as 'observation_feature'.

Default to 'euclidean'.

minkowski_powerfloat, optional

When metric is set as 'minkowski', this parameter controls the value of power.

Only valid when matrix_type is set as 'observation_feature' and metric is set as 'minkowski'.

Default to 3.

Examples

Original data:

>>> df.collect()
     ID        X1        X2        X3        X4
    0   1  0.000000  0.904781  0.908596  0.910306
    1   2  0.904781  0.000000  0.251446  0.597502
    2   3  0.908596  0.251446  0.000000  0.440357
    3   4  0.910306  0.597502  0.440357  0.000000

Apply the multidimensional scaling:

>>> mds = MDS(matrix_type='dissimilarity', dim=2, thread_ratio=0.5)
>>> res, stats = mds.fit_transform(data=df)
>>> res.collect()
           ID  DIMENSION     VALUE
    0   1          1  0.651917
    1   1          2 -0.015859
    2   2          1 -0.217737
    3   2          2 -0.253195
    4   3          1 -0.249907
    5   3          2 -0.072950
    6   4          1 -0.184273
    7   4          2  0.342003
>>> stats.collect()
                              STAT_NAME  STAT_VALUE
    0                        acheived K    2.000000
    1  proportion of variation explaind    0.978901

Methods

add_attribute(self, attr_key, attr_val)

Function to add attribute.

fit_transform(self, data[, key, features])

Scaling of given datasets in multiple dimensions.

is_fitted(self)

Checks if the model can be saved.

load_model(self, model)

Function to load fitted model.

fit_transform(self, data, key=None, features=None)

Scaling of given datasets in multiple dimensions.

Parameters
dataDataFrame

Dataframe that contains the training data.

keystr, optional

Name of the ID column in the input dataframe.

featuresstr/ListofStrings, optional

Name of the feature columns which needs to be considered in the model.

If not specified, all columns except the key column will be count as feature columns.

Returns
DataFrame
  • Scaling result of data, structured as follows:
    • Data ID : IDs from data

    • DIMENSION : The dimension number in data

    • VALUE : Scaled value

  • Statistics

add_attribute(self, attr_key, attr_val)

Function to add attribute.

is_fitted(self)

Checks if the model can be saved. To be overridden if the model is not stored in model_ attribute.

Returns
bool

True if the model is ready to be saved.

load_model(self, model)

Function to load fitted model.

Parameters
modelDataFrame

HANA DataFrame for fitted model.

class hana_ml.algorithms.pal.preprocessing.Sampling(method, interval=None, sampling_size=None, random_state=None, percentage=None)

Bases: hana_ml.algorithms.pal.pal_base.PALBase

This class is used to choose a small portion of the records as representatives.

Parameters
methodstr

Specifies the sampling method. Valid options include:

'first_n', 'middle_n', 'last_n', 'every_nth', 'simple_random_with_replacement',

'simple_random_without_replacement', 'systematic', 'stratified_with_replacement',

'stratified_without_replacement'.

For the random methods, the system time is used for the seed.

intervalint, optional

The interval between two samples.

Only required when method is 'every_nth'.

If this parameter is not specified, the sampling_size parameter will be used.

sampling_sizeint, optional

Number of the samples.

Default to 1.

random_stateint, optional

Indicates the seed used to initialize the random number generator.

It can be set to 0 or a positive value, where:
  • 0: Uses the system time

  • Others: Uses the specified seed

Default to 0.

percentagefloat, optional

Percentage of the samples.

Use this parameter when sampling_size is not set.

If both sampling_size and percentage are specified, percentage takes precedence.

Default to 0.1.

Examples

Original data:

>>> df.collect().head(10)
    EMPNO  GENDER  INCOME
0       1    male  4000.5
1       2    male  5000.7
2       3  female  5100.8
3       4    male  5400.9
4       5  female  5500.2
5       6    male  5540.4
6       7    male  4500.9
7       8  female  6000.8
8       9    male  7120.8
9      10  female  8120.9

Apply the sampling function:

>>> smp = Sampling(method='every_nth', interval=5, sampling_size=8)
>>> res = smp.fit_transform(data=df)
>>> res.collect()
   EMPNO  GENDER  INCOME
0      5  female  5500.2
1     10  female  8120.9
2     15    male  9876.5
3     20  female  8705.7
4     25  female  8794.9

Methods

add_attribute(self, attr_key, attr_val)

Function to add attribute.

fit_transform(self, data[, features])

Samping the input dataset under specified configuration.

is_fitted(self)

Checks if the model can be saved.

load_model(self, model)

Function to load fitted model.

fit_transform(self, data, features=None)

Samping the input dataset under specified configuration.

Parameters
dataDataFrame

Input Dataframe.

featuresstr/ListofStrings, optional

The column that is used to do the stratified sampling.

Only required when method is 'stratified_with_replacement', or 'stratified_without_replacement'.

Defaults to None.

Returns
DataFrame

Sampling results, same structure as defined in the Input DataFrame.

add_attribute(self, attr_key, attr_val)

Function to add attribute.

is_fitted(self)

Checks if the model can be saved. To be overridden if the model is not stored in model_ attribute.

Returns
bool

True if the model is ready to be saved.

load_model(self, model)

Function to load fitted model.

Parameters
modelDataFrame

HANA DataFrame for fitted model.

class hana_ml.algorithms.pal.preprocessing.SMOTE(smote_amount=None, k_nearest_neighbours=None, minority_class=None, thread_ratio=None, random_seed=None, method=None, search_method=None)

Bases: hana_ml.algorithms.pal.pal_base.PALBase

This class is to handle imbalanced dataset. Synthetic minority over-sampling technique (SMOTE) proposes an over-sampling approach in which the minority class is over-sampled by creating "synthetic" examples in "feature space".

Note that this function is a new function in SAP HANA SPS05 and Cloud.

Parameters
smote_amountint, optional

Amount of SMOTE N%. E.g. 200 means 200%, so each minority class sample will generate 2 synthetic samples.

The synthetic samples are generated until the minority class sample amount matches the majority class sample amount.

k_nearest_neighboursint, optional

Number of nearest neighbors (k).

Defaults to 1.

minority_classstr, optional(deprecated)

Specifies the minority class value in dependent variable column.

All classes except majority class are re-sampled to match the majority class sample amount.

thread_ratiofloat, optional

Specifies the ratio of total number of threads that can be used by this function.

The value range is from 0 to 1, where 0 means only using 1 thread, and 1 means using at most all the currently available threads.

Values outside the range [0, 1] will be ignored and this function heuristically determines the number of threads to use.

Default to 0.

random_seedint, optional

Specifies the seed for random number generator.

  • 0: Uses the current time (in seconds) as seed

  • Others: Uses the specified value as seed

Defaults to 0.

methodint, optional(deprecated)

Searching method when finding K nearest neighbour.

  • 0: Brute force searching

  • 1: KD-tree searching

Defaults to 0.

search_methodstr, optional

Specifies the searching method for finding the k nearest-neighbors.

  • 'brute-force'

  • 'kd-tree'

Defaults to 'brute-force'.

Examples

>>> smote = SMOTE(smote_amount=200, k_nearest_neighbours=2,
                  search_method='kd-tree')
>>> res, stats = smote.fit_transform(data=df, label = 'TYPE', minority_class=2)

Methods

add_attribute(self, attr_key, attr_val)

Function to add attribute.

fit_transform(self, data, label[, ...])

Upsamping given datasets using SMOTE with specified configuration.

is_fitted(self)

Checks if the model can be saved.

load_model(self, model)

Function to load fitted model.

fit_transform(self, data, label, minority_class=None)

Upsamping given datasets using SMOTE with specified configuration.

Parameters
dataDataFrame

Dataframe that contains the training data.

labelstr

Specifies the dependent variable by name.

minority_classstr/int, optional

Specifies the minority class value in dependent variable column.

If not specified, all but the majority classes are resampled to match the majority class sample amount.

Returns
DataFrame
  • SMOTE result, the same structure as defined in the input data.

add_attribute(self, attr_key, attr_val)

Function to add attribute.

is_fitted(self)

Checks if the model can be saved. To be overridden if the model is not stored in model_ attribute.

Returns
bool

True if the model is ready to be saved.

load_model(self, model)

Function to load fitted model.

Parameters
modelDataFrame

HANA DataFrame for fitted model.

class hana_ml.algorithms.pal.preprocessing.SMOTETomek(smote_amount=None, k_nearest_neighbours=None, thread_ratio=None, random_seed=None, search_method=None, sampling_strategy=None)

Bases: hana_ml.algorithms.pal.pal_base.PALBase

This class combines over-sampling using SMOTE and cleaning(under-sampling) using Tomek links.

Note that this function is a new function in SAP HANA SPS05 and Cloud.

Parameters
smote_amountint, optional

Amount of SMOTE N%. E.g. 200 means 200%, so each minority class sample will generate 2 synthetic samples.

The synthetic samples are generated until the minority class sample amount matches the majority class sample amount.

k_nearest_neighboursint, optional

Number of nearest neighbors (k).

Defaults to 1.

thread_ratiofloat, optional

Specifies the ratio of total number of threads that can be used by this function.

The value range is from 0 to 1, where 0 means only using 1 thread, and 1 means using at most all the currently available threads.

Values outside the range [0, 1] will be ignored and this function heuristically determines the number of threads to use.

Default to 0.

random_seedint, optional

Specifies the seed for random number generator.

  • 0: Uses the current time (in second) as seed

  • Others: Uses the specified value as seed

Defaults to 0.

search_methodstr, optional

Specifies the searching method when finding K nearest neighbour.

  • 'brute-force'

  • 'kd-tree'

Defaults to 'brute-force'.

sampling_strategystr, optional

Specifies the classes targeted by resampling:

  • 'majority' : resamples only the majority class

  • 'non-minority' : resamples all classes except the minority class

  • 'non-majority' : resamples all classes except the majority class

  • 'all' : resamples all classes

Defaults to 'majority'.

Examples

>>> smotetomek = SMOTETomek(smote_amount=200,
                            k_nearest_neighbours=2,
                            random_seed=2,
                            search_method='kd-tree',
                            sampling_strategy='all')
>>> res = smotetomek.fit_transform(data=df, label='TYPE', minority_class=2)

Methods

add_attribute(self, attr_key, attr_val)

Function to add attribute.

fit_transform(self, data, label[, ...])

Perform both over-sampling using SMOTE and under-sampling by removing Tomek's links on given datasets.

is_fitted(self)

Checks if the model can be saved.

load_model(self, model)

Function to load fitted model.

fit_transform(self, data, label, minority_class=None)

Perform both over-sampling using SMOTE and under-sampling by removing Tomek's links on given datasets.

Parameters
dataDataFrame

Dataframe that contains the training data.

labelstr

Specifies the dependent variable by name.

minority_classstr/int, optional

Specifies the minority class value in dependent variable column.

If not specified, all but the majority classes are resampled to match the majority class sample amount.

Returns
DataFrame
  • SMOTETomek result, the same structure as defined in the input data.

add_attribute(self, attr_key, attr_val)

Function to add attribute.

is_fitted(self)

Checks if the model can be saved. To be overridden if the model is not stored in model_ attribute.

Returns
bool

True if the model is ready to be saved.

load_model(self, model)

Function to load fitted model.

Parameters
modelDataFrame

HANA DataFrame for fitted model.

Bases: hana_ml.algorithms.pal.pal_base.PALBase

This class is for performing under-sampling by removing Tomek's links.

Note that this function is a new function in SAP HANA SPS05 and Cloud.

Parameters
distance_levelstr, optional

Specifies the distance method between train data and test data point.

  • 'manhattan'

  • 'euclidean'

  • 'minkowski'

  • 'chebyshev'

  • 'consine'

Defaults to 'euclidean'.

minkowski_powerfloat, optional

Specifies the value of power for Minkowski distance calculation.

Defaults to 3.

Valid only when distance_level is 'minkowski'.

thread_ratiofloat, optional

Specifies the ratio of total number of threads that can be used by this function.

The value range is from 0 to 1, where 0 means only using 1 thread, and 1 means using at most all the currently available threads.

Values outside the range [0, 1] will be ignored and this function heuristically determines the number of threads to use.

Default to 0.

search_methodstr, optional

Specifies the searching method when finding K nearest neighbour.

  • 'brute-force'

  • 'kd-tree'

Defaults to 'brute-force'.

sampling_strategystr, optional

Specifies the classes targeted by resampling:

  • 'majority' : resamples only the minority class

  • 'non-minority' : resamples all classes except the minority class

  • 'non-majority' : resamples all classes except the majority class

  • 'all' : resamples all classes

Defaults to 'majority'

category_weightsfloat, optional

Specifies the weight for categorical attributes.

Defaults to 0.707 if not provided.

Examples

>>> tomeklinks = TomekLinks(search_method='kd-tree',
                            sampling_strategy='majority')
>>> res = smotetomek.fit_transform(data=df, label='TYPE')

Methods

add_attribute(self, attr_key, attr_val)

Function to add attribute.

fit_transform(self, data[, key, label, ...])

Perform under-sampling on given datasets by removing Tomek's links.

is_fitted(self)

Checks if the model can be saved.

load_model(self, model)

Function to load fitted model.

fit_transform(self, data, key=None, label=None, categorical_variable=None, variable_weight=None)

Perform under-sampling on given datasets by removing Tomek's links.

Parameters
dataDataFrame

Dataframe that contains the training data.

keystr, optional

Specifies the name of the ID column.

If not provided, then it is assumed that there is no ID column in the input data.

labelstr

Specifies the dependent variable by name.

categorical_variablestr/ListOfStrings, optional

Specifies the list of INTEGER columns that should be treated as categorical.

By default, only VARCHAR and NVARCHAR columns are treated as categorical, while numerical (i.e. INTEGER or DOUBLE) columns are treated as continuous.

No default value.

variable_weightdict, optional

Specifies the weights of variables participating in distance calculation in dictionary format.

  • key : variable(column) name

  • value : weight for distance calculation

No default value.

Returns
DataFrame

Undersampled result, the same structure as defined in the input data.

add_attribute(self, attr_key, attr_val)

Function to add attribute.

is_fitted(self)

Checks if the model can be saved. To be overridden if the model is not stored in model_ attribute.

Returns
bool

True if the model is ready to be saved.

load_model(self, model)

Function to load fitted model.

Parameters
modelDataFrame

HANA DataFrame for fitted model.

hana_ml.algorithms.pal.preprocessing.mds(data, matrix_type, thread_ratio=None, dim=None, metric=None, minkowski_power=None, key=None, features=None)

This function serves as a tool for dimensional reduction or data visualization.

Parameters
dataDataFrame

DataFrame containing the data.

matrix_type{'dissimilarity', 'observation_feature'}

The type of the input table. Mandatory.

thread_ratiofloat, optional

Specifies the ratio of total number of threads that can be used by this function.

The value range is from 0 to 1, where 0 means only using 1 thread, and 1 means using at most all the currently available threads.

Values outside the range will be ignored and this function heuristically determines the number of threads to use.

Default to 0.

dimint, optional

The number of dimension that the input dataset is to be reduced to.

Default to 2.

metric{'manhattan', 'euclidean', 'minkowski'}, optional

The type of distance during the calculation of dissimilarity matrix.

Only valid when matrix_type is set as 'observation_feature'.

Default to 'euclidean'.

minkowski_powerfloat, optional

When metric is 'minkowski', this parameter controls the value of power.

Only valid when matrix_type is set as 'observation_feature' and metric is set as 'minkowski'.

Default to 3.

keystr, optional

Name of the ID column in the dataframe.

If not specified, the first col will be taken as the ID column.

featuresstr/ListOfStrings, optional

Name of the feature column in the dataframe.

If not specified, columns except the ID column will be taken as feature columns.

Returns
DataFrame
  • Sampling results, structured as follows:
    • DATA_ID: name as shown in input dataframe.

    • DIMENSION: dimension.

    • VALUE: value.

  • Statistic results, structured as follows:
    • STAT_NAME: statistic name.

    • STAT_VALUE: statistic value.

Examples

Original data:

>>> df.collect()
     ID        X1        X2        X3        X4
    0   1  0.000000  0.904781  0.908596  0.910306
    1   2  0.904781  0.000000  0.251446  0.597502
    2   3  0.908596  0.251446  0.000000  0.440357
    3   4  0.910306  0.597502  0.440357  0.000000

Apply the multidimensional scaling:

>>> res,stats = mds(data=df,
                    matrix_type='dissimilarity', dim=2, thread_ratio=0.5)
>>> res.collect()
           ID  DIMENSION     VALUE
    0   1          1  0.651917
    1   1          2 -0.015859
    2   2          1 -0.217737
    3   2          2 -0.253195
    4   3          1 -0.249907
    5   3          2 -0.072950
    6   4          1 -0.184273
    7   4          2  0.342003
>>> stats.collect()
                              STAT_NAME  STAT_VALUE
    0                        acheived K    2.000000
    1  proportion of variation explaind    0.978901
hana_ml.algorithms.pal.preprocessing.sampling(data, method, interval=None, features=None, sampling_size=None, random_state=None, percentage=None)

This function is used to choose a small portion of the records as representatives.

Parameters
dataDataFrame

DataFrame containing the data.

methodstr
Specifies the sampling method.

Valid options include:

'first_n', 'middle_n', 'last_n', 'every_nth', 'simple_random_with_replacement', 'simple_random_without_replacement', 'systematic', 'stratified_with_replacement', 'stratified_without_replacement'.

For the random methods, the system time is used for the seed.

intervalint, optional

The interval between two samples.

Only required when method is 'every_nth'.

If this parameter is not specified, the sampling_size parameter will be used.

featuresstr/ListofStrings, optional

The column that is used to do the stratified sampling.

Only required when method is stratified_with_replacement, or stratified_without_replacement.

sampling_sizeint, optional

Number of the samples.

Default to 1.

random_stateint, optional

Indicates the seed used to initialize the random number generator.

It can be set to 0 or a positive value.

  • 0: Uses the system time

  • Not 0: Uses the specified seed

Default to 0.

percentagefloat, optional

Percentage of the samples.

Use this parameter when sampling_size is not set.

If both sampling_size and percentage are specified, percentage takes precedence.

Default to 0.1.

Returns
DataFrame
  • Sampling results, structured as follows:
    • DATA_FEATURES: same structure as defined in the Input Table.

Examples

Original data:

>>> df.collect().head(10)
         EMPNO  GENDER  INCOME
    0       1    male  4000.5
    1       2    male  5000.7
    2       3  female  5100.8
    3       4    male  5400.9
    4       5  female  5500.2
    5       6    male  5540.4
    6       7    male  4500.9
    7       8  female  6000.8
    8       9    male  7120.8
    9      10  female  8120.9

Apply the sampling function:

>>> res = sampling(data=df, method='every_nth', interval=5, sampling_size=8)
>>> res.collect()
         EMPNO  GENDER  INCOME
    0      5  female  5500.2
    1     10  female  8120.9
    2     15    male  9876.5
    3     20  female  8705.7
    4     25  female  8794.9
hana_ml.algorithms.pal.preprocessing.variance_test(data, sigma_num, thread_ratio=None, key=None, data_col=None)

Variance Test is a method to identify the outliers of n number of numeric data {xi} where 0 < i < n+1, using the mean and the standard deviation of n number of numeric data.

Parameters
dataDataFrame

DataFrame containing the data.

sigama_numfloat

Multiplier for sigma.

thread_ratiofloat, optional

Specifies the ratio of total number of threads that can be used by this function.

The value range is from 0 to 1, where 0 means only using 1 thread, and 1 means using at most all the currently available threads.

Values outside the range will be ignored and this function heuristically determines the number of threads to use.

Default to 0.

keystr, optional

Name of the ID column.

If not specified, defaults to the first column of data.

data_colstr, optional

Name of the raw data column in the dataframe.

If not specified, defaults to the last column of data.

Returns
DataFrame
  • Sampling results, structured as follows:
    • DATA_ID: name as shown in input dataframe.

    • IS_OUT_OF_RANGE: 0 -> in bounds, 1 -> out of bounds.

  • Statistic results, structured as follows:
    • STAT_NAME: statistic name.

    • STAT_VALUE: statistic value.

Examples

Original data:

>>> df.collect().tail(10)
        ID      X
    10  10   26.0
    11  11   28.0
    12  12   29.0
    13  13   27.0
    14  14   26.0
    15  15   23.0
    16  16   22.0
    17  17   23.0
    18  18   25.0
    19  19  103.0

Apply the variance test:

>>> res, stats = variance_test(data, sigma_num=3.0)
>>> res.collect().tail(10)
        ID  IS_OUT_OF_RANGE
    10  10                0
    11  11                0
    12  12                0
    13  13                0
    14  14                0
    15  15                0
    16  16                0
    17  17                0
    18  18                0
    19  19                1
>>> stats.collect()
        STAT_NAME  STAT_VALUE
    0   mean        28.400000

hana_ml.algorithms.pal.random

This module contains wrappers for PAL Random distribution sampling algorithms.

The following distribution functions are available:

hana_ml.algorithms.pal.random.multinomial(conn_context, n, pvals, num_random=100, seed=None, thread_ratio=None)

Draw samples from a multinomial distribution.

Parameters
conn_contextConnectionContext

Database connection object.

nint

Number of trials.

pvalstuple of float and int

Success fractions of each category.

num_randomint, optional

Specifies the number of random data to be generated.

Defaults to 100.

seedint, optional

Indicates the seed used to initialize the random number generator:

  • 0: Uses the system time.

  • Not 0: Uses the specified seed.

Note

When multithreading is enabled, the random number sequences of different runs might be different even if the SEED value remains the same.

Defaults to 0.

thread_ratiofloat, optional

Controls the proportion of available threads to use.

The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads.

Values between 0 and 1 will use that percentage of available threads.

Values outside the range [0, 1] tell PAL to heuristically determine the number of threads to use.

Defaults to 0.

Returns
DataFrame

Dataframe containing the generated random samples, structured as follows:

  • ID, type INTEGER, ID column.

  • Generated random number columns, named by appending index number (starting from 1 to length of pvals) to Random_P, type DOUBLE. There will be as many columns here as there are values in pvals.

Examples

Draw samples from a multinomial distribution.

>>> res = multinomial(conn_context=cc, n=10, pvals=(0.1, 0.2, 0.3, 0.4), num_random=10)
>>> res.collect()
   ID  RANDOM_P1  RANDOM_P2  RANDOM_P3  RANDOM_P4
0   0        1.0        2.0        2.0        5.0
1   1        1.0        2.0        3.0        4.0
2   2        0.0        0.0        8.0        2.0
3   3        0.0        2.0        1.0        7.0
4   4        1.0        1.0        4.0        4.0
5   5        1.0        1.0        4.0        4.0
6   6        1.0        2.0        3.0        4.0
7   7        1.0        4.0        2.0        3.0
8   8        1.0        2.0        3.0        4.0
9   9        4.0        1.0        1.0        4.0
hana_ml.algorithms.pal.random.bernoulli(conn_context, p=0.5, num_random=100, seed=None, thread_ratio=None)

Draw samples from a Bernoulli distribution.

Parameters
conn_contextConnectionContext

Database connection object.

pfloat, optional

Success fraction. The value range is from 0 to 1.

Defaults to 0.5.

num_randomint, optional

Specifies the number of random data to be generated.

Defaults to 100.

seedint, optional

Indicates the seed used to initialize the random number generator:

  • 0: Uses the system time.

  • Not 0: Uses the specified seed.

Note

When multithreading is enabled, the random number sequences of different runs might be different even if the SEED value remains the same.

Defaults to 0.

thread_ratiofloat, optional

Controls the proportion of available threads to use.

The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads.

Values between 0 and 1 will use that percentage of available threads.

Values outside the range [0, 1] tell PAL to heuristically determine the number of threads to use.

Defaults to 0.

Returns
DataFrame

Dataframe containing the generated random samples, structured as follows:

  • ID, type INTEGER, ID column.

  • GENERATED_NUMBER, type DOUBLE, sample value.

Examples

Draw samples from a bernoulli distribution.

>>> res = bernoulli(conn_context=cc, p=0.5, num_random=10)
>>> res.collect()
   ID  GENERATED_NUMBER
0   0               0.0
1   1               0.0
2   2               1.0
3   3               1.0
4   4               0.0
5   5               1.0
6   6               1.0
7   7               0.0
8   8               1.0
9   9               0.0
hana_ml.algorithms.pal.random.beta(conn_context, a=0.5, b=0.5, num_random=100, seed=None, thread_ratio=None)

Draw samples from a Beta distribution.

Parameters
conn_contextConnectionContext

Database connection object.

afloat, optional

Alpha value, positive.

Defaults to 0.5.

bfloat, optional

Beta value, positive.

Defaults to 0.5.

num_randomint, optional

Specifies the number of random data to be generated.

Defaults to 100.

seedint, optional

Indicates the seed used to initialize the random number generator:

  • 0: Uses the system time.

  • Not 0: Uses the specified seed.

Note

When multithreading is enabled, the random number sequences of different runs might be different even if the SEED value remains the same.

Defaults to 0.

thread_ratiofloat, optional

Controls the proportion of available threads to use.

The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads.

Values between 0 and 1 will use that percentage of available threads.

Values outside the range [0, 1] tell PAL to heuristically determine the number of threads to use.

Defaults to 0.

Returns
DataFrame

Dataframe containing the generated random samples, structured as follows:

  • ID, type INTEGER, ID column.

  • GENERATED_NUMBER, type DOUBLE, sample value.

Examples

Draw samples from a beta distribution.

>>> res = beta(conn_context=cc, a=0.5, b=0.5, num_random=10)
>>> res.collect()
   ID  GENERATED_NUMBER
0   0          0.976130
1   1          0.308346
2   2          0.853118
3   3          0.958553
4   4          0.677258
5   5          0.489628
6   6          0.027733
7   7          0.278073
8   8          0.850181
9   9          0.976244
hana_ml.algorithms.pal.random.binomial(conn_context, n=1, p=0.5, num_random=100, seed=None, thread_ratio=None)

Draw samples from a binomial distribution.

Parameters
conn_contextConnectionContext

Database connection object.

nint, optional

Number of trials.

Defaults to 1.

pfloat, optional

Successful fraction. The value range is from 0 to 1.

Defaults to 0.5.

num_randomint, optional

Specifies the number of random data to be generated.

Defaults to 100.

seedint, optional

Indicates the seed used to initialize the random number generator:

  • 0: Uses the system time.

  • Not 0: Uses the specified seed.

Note

When multithreading is enabled, the random number sequences of different runs might be different even if the SEED value remains the same.

Defaults to 0.

thread_ratiofloat, optional

Controls the proportion of available threads to use.

The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads.

Values between 0 and 1 will use that percentage of available threads.

Values outside the range [0, 1] tell PAL to heuristically determine the number of threads to use.

Defaults to 0.

Returns
DataFrame

Dataframe containing the generated random samples, structured as follows:

  • ID, type INTEGER, ID column.

  • GENERATED_NUMBER, type DOUBLE, sample value.

Examples

Draw samples from a binomial distribution.

>>> res = binomial(conn_context=cc, n=1, p=0.5, num_random=10)
>>> res.collect()
   ID  GENERATED_NUMBER
0   0               1.0
1   1               1.0
2   2               0.0
3   3               1.0
4   4               1.0
5   5               1.0
6   6               0.0
7   7               1.0
8   8               0.0
9   9               1.0
hana_ml.algorithms.pal.random.cauchy(conn_context, location=0, scale=1, num_random=100, seed=None, thread_ratio=None)

Draw samples from a cauchy distribution.

Parameters
conn_contextConnectionContext

Database connection object.

locationfloat, optional

Defaults to 0.

scalefloat, optional

Defaults to 1.

num_randomint, optional

Specifies the number of random data to be generated.

Defaults to 100.

seedint, optional

Indicates the seed used to initialize the random number generator:

  • 0: Uses the system time.

  • Not 0: Uses the specified seed.

Note

When multithreading is enabled, the random number sequences of different runs might be different even if the SEED value remains the same.

Defaults to 0.

thread_ratiofloat, optional

Controls the proportion of available threads to use.

The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads.

Values between 0 and 1 will use that percentage of available threads.

Values outside the range [0, 1] tell PAL to heuristically determine the number of threads to use.

Defaults to 0.

Returns
DataFrame

Dataframe containing the generated random samples, structured as follows:

  • ID, type INTEGER, ID column.

  • GENERATED_NUMBER, type DOUBLE, sample value.

Examples

Draw samples from a cauchy distribution.

>>> res = cauchy(conn_context=cc, location=0, scale=1, num_random=10)
>>> res.collect()
   ID  GENERATED_NUMBER
0   0          1.827259
1   1         -1.877612
2   2        -18.241436
3   3         -1.216243
4   4          2.091336
5   5       -317.131147
6   6         -2.804251
7   7         -0.338566
8   8          0.143280
9   9          1.277245
hana_ml.algorithms.pal.random.chi_squared(conn_context, dof=1, num_random=100, seed=None, thread_ratio=None)

Draw samples from a chi_squared distribution.

Parameters
conn_contextConnectionContext

Database connection object.

dofint, optional

Degrees of freedom.

Defaults to 1.

num_randomint, optional

Specifies the number of random data to be generated.

Defaults to 100.

seedint, optional
Indicates the seed used to initialize the random number generator:
  • 0: Uses the system time.

  • Not 0: Uses the specified seed.

Note

When multithreading is enabled, the random number sequences of different runs might be different even if the SEED value remains the same.

Defaults to 0.

thread_ratiofloat, optional

Controls the proportion of available threads to use.

The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads.

Values between 0 and 1 will use that percentage of available threads.

Values outside the range [0, 1] tell PAL to heuristically determine the number of threads to use.

Defaults to 0.

Returns
DataFrame

Dataframe containing the generated random samples, structured as follows:

  • ID, type INTEGER, ID column.

  • GENERATED_NUMBER, type DOUBLE, sample value.

Examples

Draw samples from a chi_squared distribution.

>>> res = chi_squared(conn_context=cc, dof=1, num_random=10)
>>> res.collect()
   ID  GENERATED_NUMBER
0   0          0.040571
1   1          2.680756
2   2          1.119563
3   3          1.174072
4   4          0.872421
5   5          0.327169
6   6          1.113164
7   7          1.549585
8   8          0.013953
9   9          0.011735
hana_ml.algorithms.pal.random.exponential(conn_context, lamb=1, num_random=100, seed=None, thread_ratio=None)

Draw samples from an exponential distribution.

Parameters
conn_contextConnectionContext

Database connection object.

lambfloat, optional

The rate parameter, which is the inverse of the scale parameter.

Defaults to 1.

num_randomint, optional

Specifies the number of random data to be generated.

Defaults to 100.

seedint, optional

Indicates the seed used to initialize the random number generator:

  • 0: Uses the system time.

  • Not 0: Uses the specified seed.

Note

When multithreading is enabled, the random number sequences of different runs might be different even if the SEED value remains the same.

Defaults to 0.

thread_ratiofloat, optional

Controls the proportion of available threads to use.

The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads.

Values between 0 and 1 will use that percentage of available threads.

Values outside the range [0, 1] tell PAL to heuristically determine the number of threads to use.

Defaults to 0.

Returns
DataFrame

Dataframe containing the generated random samples, structured as follows:

  • ID, type INTEGER, ID column.

  • GENERATED_NUMBER, type DOUBLE, sample value.

Examples

Draw samples from an exponential distribution.

>>> res = exponential(conn_context=cc, scale=1, num_random=10)
>>> res.collect()
   ID  GENERATED_NUMBER
0   0          0.035207
1   1          0.559248
2   2          0.122307
3   3          2.339937
4   4          1.130033
5   5          0.985565
6   6          0.030138
7   7          0.231040
8   8          1.233268
9   9          0.876022
hana_ml.algorithms.pal.random.gumbel(conn_context, location=0, scale=1, num_random=100, seed=None, thread_ratio=None)

Draw samples from a Gumbel distribution, which is one of a class of Generalized Extreme Value (GEV) distributions used in modeling extreme value problems.

Parameters
conn_contextConnectionContext

Database connection object.

locationfloat, optional

Defaults to 0.

scalefloat, optional

Defaults to 1.

num_randomint, optional

Specifies the number of random data to be generated.

Defaults to 100.

seedint, optional

Indicates the seed used to initialize the random number generator:

  • 0: Uses the system time.

  • Not 0: Uses the specified seed.

Note

When multithreading is enabled, the random number sequences of different runs might be different even if the SEED value remains the same.

Defaults to 0.

thread_ratiofloat, optional

Controls the proportion of available threads to use.

The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads.

Values between 0 and 1 will use that percentage of available threads.

Values outside the range [0, 1] tell PAL to heuristically determine the number of threads to use.

Defaults to 0.

Returns
DataFrame

Dataframe containing the generated random samples, structured as follows:

  • ID, type INTEGER, ID column.

  • GENERATED_NUMBER, type DOUBLE, sample value.

Examples

Draw samples from a gumbel distribution.

>>> res = gumbel(conn_context=cc, location=0, scale=1, num_random=10)
>>> res.collect()
   ID  GENERATED_NUMBER
0   0          1.544054
1   1          0.339531
2   2          0.394224
3   3          3.161123
4   4          1.208050
5   5         -0.276447
6   6          1.694589
7   7          1.406419
8   8         -0.443717
9   9          0.156404
hana_ml.algorithms.pal.random.f(conn_context, dof1=1, dof2=1, num_random=100, seed=None, thread_ratio=None)

Draw samples from an f distribution.

Parameters
conn_contextConnectionContext

Database connection object.

dof1int, optional

DEGREES_OF_FREEDOM1.

Defaults to 1.

dof2int, optional

DEGREES_OF_FREEDOM2.

Defaults to 1.

num_randomint, optional

Specifies the number of random data to be generated.

Defaults to 100.

seedint, optional

Indicates the seed used to initialize the random number generator:

  • 0: Uses the system time.

  • Not 0: Uses the specified seed.

Note

When multithreading is enabled, the random number sequences of different runs might be different even if the SEED value remains the same.

Defaults to 0.

thread_ratiofloat, optional

Controls the proportion of available threads to use.

The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads.

Values between 0 and 1 will use that percentage of available threads.

Values outside the range [0, 1] tell PAL to heuristically determine the number of threads to use.

Defaults to 0.

Returns
DataFrame

Dataframe containing the generated random samples, structured as follows:

  • ID, type INTEGER, ID column.

  • GENERATED_NUMBER, type DOUBLE, sample value.

Examples

Draw samples from a f distribution.

>>> res = f(conn_context=cc, dof1=1, dof2=1, num_random=10)
>>> res.collect()
   ID  GENERATED_NUMBER
0   0          6.494985
1   1          0.054830
2   2          0.752216
3   3          4.946226
4   4          0.167151
5   5        351.789925
6   6          0.810973
7   7          0.362714
8   8          0.019763
9   9         10.553533
hana_ml.algorithms.pal.random.gamma(conn_context, shape=1, scale=1, num_random=100, seed=None, thread_ratio=None)

Draw samples from a gamma distribution.

Parameters
conn_contextConnectionContext

Database connection object.

shapefloat, optional

Defaults to 1.

scalefloat, optional

Defaults to 1.

num_randomint, optional

Specifies the number of random data to be generated.

Defaults to 100.

seedint, optional

Indicates the seed used to initialize the random number generator:

  • 0: Uses the system time.

  • Not 0: Uses the specified seed.

Note

When multithreading is enabled, the random number sequences of different runs might be different even if the SEED value remains the same.

Defaults to 0.

thread_ratiofloat, optional

Controls the proportion of available threads to use.

The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads.

Values between 0 and 1 will use that percentage of available threads.

Values outside the range [0, 1] tell PAL to heuristically determine the number of threads to use.

Defaults to 0.

Returns
DataFrame

Dataframe containing the generated random samples, structured as follows:

  • ID, type INTEGER, ID column.

  • GENERATED_NUMBER, type DOUBLE, sample value.

Examples

Draw samples from a gamma distribution.

>>> res = gamma(conn_context=cc, shape=1, scale=1, num_random=10)
>>> res.collect()
   ID  GENERATED_NUMBER
0   0          0.082794
1   1          0.084031
2   2          0.159490
3   3          1.063100
4   4          0.530218
5   5          1.307313
6   6          0.565527
7   7          0.474969
8   8          0.440999
9   9          0.463645
hana_ml.algorithms.pal.random.geometric(conn_context, p=0.5, num_random=100, seed=None, thread_ratio=None)

Draw samples from a geometric distribution.

Parameters
conn_contextConnectionContext

Database connection object.

pfloat, optional

Successful fraction. The value range is from 0 to 1.

Defaults to 0.5.

num_randomint, optional

Specifies the number of random data to be generated.

Defaults to 100.

seedint, optional
Indicates the seed used to initialize the random number generator:
  • 0: Uses the system time.

  • Not 0: Uses the specified seed.

Note

When multithreading is enabled, the random number sequences of different runs might be different even if the SEED value remains the same.

Defaults to 0.

thread_ratiofloat, optional

Controls the proportion of available threads to use.

The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads.

Values between 0 and 1 will use that percentage of available threads.

Values outside the range [0, 1] tell PAL to heuristically determine the number of threads to use.

Defaults to 0.

Returns
DataFrame

Dataframe containing the generated random samples, structured as follows:

  • ID, type INTEGER, ID column.

  • GENERATED_NUMBER, type DOUBLE, sample value.

Examples

Draw samples from a geometric distribution.

>>> res = geometric(conn_context=cc, p=0.5, num_random=10)
>>> res.collect()
   ID  GENERATED_NUMBER
0   0               1.0
1   1               1.0
2   2               1.0
3   3               0.0
4   4               1.0
5   5               0.0
6   6               0.0
7   7               0.0
8   8               0.0
9   9               0.0
hana_ml.algorithms.pal.random.lognormal(conn_context, mean=0, sigma=1, num_random=100, seed=None, thread_ratio=None)

Draw samples from a lognormal distribution.

Parameters
conn_contextConnectionContext

Database connection object.

meanfloat, optional

Mean value of the underlying normal distribution.

Defaults to 0.

sigmafloat, optional

Standard deviation of the underlying normal distribution.

Defaults to 1.

num_randomint, optional

Specifies the number of random data to be generated.

Defaults to 100.

seedint, optional

Indicates the seed used to initialize the random number generator:

  • 0: Uses the system time.

  • Not 0: Uses the specified seed.

Note

When multithreading is enabled, the random number sequences of different runs might be different even if the SEED value remains the same.

Defaults to 0.

thread_ratiofloat, optional

Controls the proportion of available threads to use.

The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads.

Values between 0 and 1 will use that percentage of available threads.

Values outside the range [0, 1] tell PAL to heuristically determine the number of threads to use.

Defaults to 0.

Returns
DataFrame

Dataframe containing the generated random samples, structured as follows:

  • ID, type INTEGER, ID column.

  • GENERATED_NUMBER, type DOUBLE, sample value.

Examples

Draw samples from a lognormal distribution.

>>> res = lognormal(conn_context=cc, mean=0, sigma=1, num_random=10)
>>> res.collect()
   ID  GENERATED_NUMBER
0   0          0.461803
1   1          0.548432
2   2          0.625874
3   3          3.038529
4   4          3.582703
5   5          1.867543
6   6          1.853857
7   7          0.378827
8   8          1.104031
9   9          0.840102
hana_ml.algorithms.pal.random.negative_binomial(conn_context, n=1, p=0.5, num_random=100, seed=None, thread_ratio=None)

Draw samples from a negative_binomial distribution.

Parameters
conn_contextConnectionContext

Database connection object.

nint, optional

Number of successes.

Defaults to 1.

pfloat, optional

Successful fraction. The value range is from 0 to 1.

Defaults to 0.5.

num_randomint, optional

Specifies the number of random data to be generated.

Defaults to 100.

seedint, optional

Indicates the seed used to initialize the random number generator:

  • 0: Uses the system time.

  • Not 0: Uses the specified seed.

Note

When multithreading is enabled, the random number sequences of different runs might be different even if the SEED value remains the same.

Defaults to 0.

thread_ratiofloat, optional

Controls the proportion of available threads to use.

The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads.

Values between 0 and 1 will use that percentage of available threads.

Values outside the range [0, 1] tell PAL to heuristically determine the number of threads to use.

Defaults to 0.

Returns
DataFrame

Dataframe containing the generated random samples, structured as follows:

  • ID, type INTEGER, ID column.

  • GENERATED_NUMBER, type DOUBLE, sample value.

Examples

Draw samples from a negative_binomial distribution.

>>> res = negative_binomial(conn_context=cc, n=1, p=0.5, num_random=10)
>>> res.collect()
   ID  GENERATED_NUMBER
0   0               0.0
1   1               2.0
2   2               3.0
3   3               1.0
4   4               1.0
5   5               0.0
6   6               2.0
7   7               1.0
8   8               2.0
9   9               3.0
hana_ml.algorithms.pal.random.normal(conn_context, mean=0, sigma=None, variance=None, num_random=100, seed=None, thread_ratio=None)

Draw samples from a normal distribution.

Parameters
conn_contextConnectionContext

Database connection object.

meanfloat, optional

Mean value.

Defaults to 0.

sigmafloat, optional

Standard deviation. It cannot be used together with variance.

Defaults to 1.

variancefloat, optional

Variance. It cannot be used together with sigma.

Defaults to 1.

num_randomint, optional

Specifies the number of random data to be generated.

Defaults to 100.

seedint, optional

Indicates the seed used to initialize the random number generator:

  • 0: Uses the system time.

  • Not 0: Uses the specified seed.

Note

When multithreading is enabled, the random number sequences of different runs might be different even if the SEED value remains the same.

Defaults to 0.

thread_ratiofloat, optional

Controls the proportion of available threads to use.

The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads.

Values between 0 and 1 will use that percentage of available threads.

Values outside the range [0, 1] tell PAL to heuristically determine the number of threads to use.

Defaults to 0.

Returns
DataFrame

Dataframe containing the generated random samples, structured as follows:

  • ID, type INTEGER, ID column.

  • GENERATED_NUMBER, type DOUBLE, sample value.

Examples

Draw samples from a normal distribution.

>>> res = normal(conn_context=cc, mean=0, sigma=1, num_random=10)
>>> res.collect()
   ID  GENERATED_NUMBER
0   0          0.321078
1   1         -1.327626
2   2          0.798867
3   3         -0.116128
4   4         -0.213519
5   5          0.008566
6   6          0.251733
7   7          0.404510
8   8         -0.534899
9   9         -0.420968
hana_ml.algorithms.pal.random.pert(conn_context, minimum=- 1, mode=0, maximum=1, scale=4, num_random=100, seed=None, thread_ratio=None)

Draw samples from a PERT distribution.

Parameters
conn_contextConnectionContext

Database connection object.

minimumint, optional

Minimum value.

Defaults to -1.

modefloat, optional

Most likely value.

Defaults to 0.

maximumfloat, optional

Maximum value.

Defaults to 1.

scalefloat, optional

Defaults to 4.

num_randomint, optional

Specifies the number of random data to be generated.

Defaults to 100.

seedint, optional

Indicates the seed used to initialize the random number generator:

  • 0: Uses the system time.

  • Not 0: Uses the specified seed.

Note

When multithreading is enabled, the random number sequences of different runs might be different even if the SEED value remains the same.

Defaults to 0.

thread_ratiofloat, optional

Controls the proportion of available threads to use.

The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads.

Values between 0 and 1 will use that percentage of available threads.

Values outside the range [0, 1] tell PAL to heuristically determine the number of threads to use.

Defaults to 0.

Returns
DataFrame

Dataframe containing the generated random samples, structured as follows:

  • ID, type INTEGER, ID column.

  • GENERATED_NUMBER, type DOUBLE, sample value.

Examples

Draw samples from a pert distribution.

>>> res = pert(conn_context=cc, minimum=-1, mode=0, maximum=1, num_random=10)
>>> res.collect()
   ID  GENERATED_NUMBER
0   0          0.360781
1   1         -0.023649
2   2          0.106465
3   3          0.307412
4   4         -0.136838
5   5         -0.086010
6   6         -0.504639
7   7          0.335352
8   8         -0.287202
9   9          0.468597
hana_ml.algorithms.pal.random.poisson(conn_context, theta=1.0, num_random=100, seed=None, thread_ratio=None)

Draw samples from a poisson distribution.

Parameters
conn_contextConnectionContext

Database connection object.

thetafloat, optional

The average number of events in an interval.

Defaults to 1.0.

num_randomint, optional

Specifies the number of random data to be generated.

Defaults to 100.

seedint, optional

Indicates the seed used to initialize the random number generator:

  • 0: Uses the system time.

  • Not 0: Uses the specified seed.

Note

When multithreading is enabled, the random number sequences of different runs might be different even if the SEED value remains the same.

Defaults to 0.

thread_ratiofloat, optional

Controls the proportion of available threads to use.

The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads.

Values between 0 and 1 will use that percentage of available threads.

Values outside the range [0, 1] tell PAL to heuristically determine the number of threads to use.

Defaults to 0.

Returns
DataFrame

Dataframe containing the generated random samples, structured as follows:

  • ID, type INTEGER, ID column.

  • GENERATED_NUMBER, type DOUBLE, sample value.

Examples

Draw samples from a poisson distribution.

>>> res = poisson(conn_context=cc, theta=1, num_random=10)
>>> res.collect()
   ID  GENERATED_NUMBER
0   0               0.0
1   1               1.0
2   2               1.0
3   3               1.0
4   4               1.0
5   5               1.0
6   6               0.0
7   7               2.0
8   8               0.0
9   9               1.0
hana_ml.algorithms.pal.random.student_t(conn_context, dof=1, num_random=100, seed=None, thread_ratio=None)

Draw samples from a Student's t-distribution.

Parameters
conn_contextConnectionContext

Database connection object.

doffloat, optional

Degrees of freedom.

Defaults to 1.

num_randomint, optional

Specifies the number of random data to be generated.

Defaults to 100.

seedint, optional

Indicates the seed used to initialize the random number generator:

  • 0: Uses the system time.

  • Not 0: Uses the specified seed.

Note

When multithreading is enabled, the random number sequences of different runs might be different even if the SEED value remains the same.

Defaults to 0.

thread_ratiofloat, optional

Controls the proportion of available threads to use.

The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads.

Values between 0 and 1 will use that percentage of available threads.

Values outside the range [0, 1] tell PAL to heuristically determine the number of threads to use.

Defaults to 0.

Returns
DataFrame

Dataframe containing the generated random samples, structured as follows:

  • ID, type INTEGER, ID column.

  • GENERATED_NUMBER, type DOUBLE, sample value.

Examples

Draw samples from a Student's t-distribution.

>>> res = student_t(conn_context=cc, dof=1, num_random=10)
>>> res.collect()
   ID  GENERATED_NUMBER
0   0         -0.433802
1   1          1.972038
2   2         -1.097313
3   3         -0.225812
4   4         -0.452342
5   5          2.242921
6   6          0.377288
7   7          0.322347
8   8          1.104877
9   9         -0.017830
hana_ml.algorithms.pal.random.uniform(conn_context, low=0, high=1, num_random=100, seed=<