hana_ml.algorithms.pal package¶
The Algorithms PAL Package consists of the following sections:
hana_ml.algorithms.pal.association¶
This module contains Python wrappers for PAL association algorithms.
The following classes are available:
-
class
hana_ml.algorithms.pal.association.
Apriori
(conn_context, min_support, min_confidence, relational=None, min_lift=None, max_conseq=None, max_len=None, ubiquitous=None, use_prefix_tree=None, lhs_restrict=None, rhs_restrict=None, lhs_complement_rhs=None, rhs_complement_lhs=None, thread_ratio=None, timeout=None, pmml_export=None)¶ Bases:
hana_ml.algorithms.pal.association._AssociationBase
Apriori is a classic predictive analysis algorithm for finding association rules used in association analysis.
- Parameters
conn_context : ConnectionContext
The connection to SAP HANA system.
min_support : float
User-specified minimum support(actual value).
min_confidence : float
User-specified minimum confidence(actual value).
relational : bool, optional
Whether or not to apply relational logic in Apriori algorithm. If False, a single result table is produced; otherwise, the result table shall be split into three tables: antecedent, consequent and statistics.
Defaults to False.
min_lift : float, optional
User-specified minimum lift.
Defaults to 0.
max_conseq : int, optional
Maximum length of consequent items.
Defaults to 100.
max_len : int, optional
Total length of antecedent items and consequent items in the output.
Defaults to 5.
ubiquitous : float, optional
Item sets whose support values are greater than this number will be ignored during fequent items minining.
Defaults to 1.0.
use_prefix_tree : bool, optional
Indicates whether or not to use prefix tree for saving memory.
Defaults to False.
lhs_restrict : list of str, optional(deprecated)
Specify items that are only allowed on the left-hand-side of association rules.
rhs_restrict : list of str, optional(deprecated)
Specify items that are only allowed on the right-hand-side of association rules.
lhs_complement_rhs : bool, optional(deprecated)
If you use
rhs_restrict
to restrict some items to the left-hand-side of the association rules, you can set this parameter to True to restrict the complement items to the left-hand-side.For example, if you have 100 items (i1, i2, …, i100), and want to restrict i1 and i2 to the right-hand-side, and i3,i4,…,i100 to the left-hand-side, you can set the parameters similarly as follows:
…
rhs_restrict = [‘i1’,’i2’],
lhs_complement_rhs = True,
…
Defaults to False.
rhs_complement_lhs : bool, optional(deprecated)
If you use
lhs_restrict
to restrict some items to the left-hand-side of association rules, you can set this parameter to True to restrict the complement items to the right-hand side.Defaults to False.
thread_number : float, optional
Specifies the ratio of total number of threads that can be used by this function. The value range is from 0 to 1, where 0 means only using 1 thread, and 1 means using at most all the currently available threads. Values outside the range will be ignored and this function heuristically determines the number of threads to use.
Defaults to 0.
timeout : int, optional
Specifies the maximum run time in seconds. The algorithm will stop running when the specified timeout is reached.
Defaults to 3600.
pmml_export : {‘no’, ‘single-row’, ‘multi-row’}, optional
Specify the way to export the Apriori model:
‘no’ : do not export the model,
‘single-row’ : export Apriori model in PMML in single row,
‘multi-row’ : export Apriori model in PMML in multiple rows, while the minimum length of each row is 5000 characters.
Defaults to ‘no’.
Examples
Input data for associate rule mining:
>>> df.collect() CUSTOMER ITEM 0 2 item2 1 2 item3 2 3 item1 3 3 item2 4 3 item4 5 4 item1 6 4 item3 7 5 item2 8 5 item3 9 6 item1 10 6 item3 11 0 item1 12 0 item2 13 0 item5 14 1 item2 15 1 item4 16 7 item1 17 7 item2 18 7 item3 19 7 item5 20 8 item1 21 8 item2 22 8 item3
Set up parameters for the Apriori algorithm:
>>> ap = Apriori(conn_context=conn, min_support=0.1, min_confidence=0.3, relational=False, min_lift=1.1, max_conseq=1, max_len=5, ubiquitous=1.0, use_prefix_tree=False, thread_ratio=0, timeout=3600, pmml_export='single-row')
Association rule mininig using Apriori algorithm for the input data, and check the results:
>>> ap.fit(data=df) >>> ap.result_.head(5).collect() ANTECEDENT CONSEQUENT SUPPORT CONFIDENCE LIFT 0 item5 item2 0.222222 1.000000 1.285714 1 item1 item5 0.222222 0.333333 1.500000 2 item5 item1 0.222222 1.000000 1.500000 3 item4 item2 0.222222 1.000000 1.285714 4 item2&item1 item5 0.222222 0.500000 2.250000
Apriori algorithm set up using relational logic:
>>> apr = Apriori(conn_context=conn, min_support=0.1, min_confidence=0.3, relational=True, min_lift=1.1, max_conseq=1, max_len=5, ubiquitous=1.0, use_prefix_tree=False, thread_ratio=0, timeout=3600, pmml_export='single-row')
Again mining association rules using Apriori algorithm for the input data, and check the resulting tables:
>>> apr.antec_.head(5).collect() RULE_ID ANTECEDENTITEM 0 0 item5 1 1 item1 2 2 item5 3 3 item4 4 4 item2 >>> apr.conseq_.head(5).collect() RULE_ID CONSEQUENTITEM 0 0 item2 1 1 item5 2 2 item1 3 3 item2 4 4 item5 >>> apr.stats_.head(5).collect() RULE_ID SUPPORT CONFIDENCE LIFT 0 0 0.222222 1.000000 1.285714 1 1 0.222222 0.333333 1.500000 2 2 0.222222 1.000000 1.500000 3 3 0.222222 1.000000 1.285714 4 4 0.222222 0.500000 2.250000
Attributes
result_
(DataFrame) Mined association rules and related statistics, structured as follows: - 1st column : antecedent(leading) items. - 2nd column : consequent(dependent) items. - 3rd column : support value. - 4th column : confidence value. - 5th column : lift value. Available only when
relational
is False.model_
(DataFrame) Apriori model trained from the input data, structured as follows: - 1st column : model ID, - 2nd column : model content, i.e. Apriori model in PMML format.
antec_
(DataFrame) Antecdent items of mined association rules, structured as follows: - lst column : association rule ID, - 2nd column : antecedent items of the corresponding association rule. Available only when
relational
is True.conseq_
(DataFrame) Consequent items of mined association rules, structured as follows: - 1st column : association rule ID, - 2nd column : consequent items of the corresponding association rule. Available only when
relational
is True.stats_
(DataFrame) Statistis of the mined association rules, structured as follows: - 1st column : rule ID, - 2nd column : support value of the rule, - 3rd column : confidence value of the rule, - 4th column : lift value of the rule. Available only when
relational
is True.Methods
fit
(data[, transaction, item, lhs_restrict, …])Association rule mining from the input data using
FPGrowth
algorithm.-
fit
(data, transaction=None, item=None, lhs_restrict=None, rhs_restrict=None, lhs_complement_rhs=None, rhs_complement_lhs=None)¶ Association rule mining from the input data using
FPGrowth
algorithm.- Parameters
data : DataFrame
Input data for association rule minining.
transaction : str, optional
Name of the transaction column.
Defaults to the first column if not provided.
item : str, optional
Name of the item ID column. Data type of item column can either be int or str.
Defaults to the last column if not provided.
lhs_restrict : list of int/str, optional
Specify items that are only allowed on the left-hand-side of association rules. Elements in the list should be the same type as the item column.
rhs_restrict : list of int/str, optional
Specify items that are only allowed on the right-hand-side of association rules. Elements in the list should be the same type as the item column.
lhs_complement_rhs : bool, optional
If you use
rhs_restrict
to restrict some items to the left-hand-side of the association rules, you can set this parameter to True to restrict the complement items to the left-hand-side. For example, if you have 100 items (i1,i2,…,i100), and want to restrict i1 and i2 to the right-hand-side, and i3, i4,…, i100 to the left-hand-side, you can set the parameters similarly as follows:…
rhs_restrict = [i1, i2],
lhs_complement_rhs = True,
…
Defaults to False.
rhs_complement_lhs : bool, optional
If you use
lhs_restrict
to restrict some items to the left-hand-side of association rules, you can set this parameter to True to restrict the complement items to the right-hand side.Defaults to False.
-
class
hana_ml.algorithms.pal.association.
AprioriLite
(conn_context, min_support, min_confidence, subsample=None, recalculate=None, thread_ratio=None, timeout=None, pmml_export=None)¶ Bases:
hana_ml.algorithms.pal.pal_base.PALBase
A light version of Apriori algorithm for assocication rule mining, where only two large item sets are calculated.
- Parameters
conn_context : ConnectionContext
The connection to SAP HANA system.
min_support : float
User-specified minimum support(actual value).
min_confidence : float
User-specified minimum confidence(actual value).
subsample : float, optional
Specify the sampling percentage for the input data. Set to 1 if you want to use the entire data.
recalculate : bool, optional
If you sample the input data, this parameter indicates whether or not to use the remining data to update the related statistiscs, i.e. support, confidence and lift.
Defaults to True.
thread_ratio : float, optional
Specifies the ratio of total number of threads that can be used by this function. The value range is from 0 to 1, where 0 means only using 1 thread, and 1 means using at most all the currently available threads. Values outside the range will be ignored and this function heuristically determines the number of threads to use.
Defaults to 0.
timeout : int, optional
Specifies the maximum run time in seconds. The algorithm will stop running when the specified timeout is reached.
Defaults to 3600.
pmml_export : {‘no’, ‘single-row’, ‘multi-row’}, optional
Specify the way to export the Apriori model:
‘no’ : do not export the model,
‘single-row’ : export Apriori model in PMML in single row,
‘multi-row’ : export Apriori model in PMML in multiple rows, while the minimum length of each row is 5000 characters.
Defaults to ‘no’.
Examples
Input data for association rule mining using Apriori algorithm:
>>> df.collect() CUSTOMER ITEM 0 2 item2 1 2 item3 2 3 item1 3 3 item2 4 3 item4 5 4 item1 6 4 item3 7 5 item2 8 5 item3 9 6 item1 10 6 item3 11 0 item1 12 0 item2 13 0 item5 14 1 item2 15 1 item4 16 7 item1 17 7 item2 18 7 item3 19 7 item5 20 8 item1 21 8 item2 22 8 item3
Set up parameters for light Apriori algorithm, ingest the input data, and check the result table:
>>> apl = AprioriLite(conn_context=conn, min_support=0.1, min_confidence=0.3, subsample=1.0, recalculate=False, timeout=3600, pmml_export='single-row') >>> apl.fit(data=df) >>> apl.result_.head(5).collect() ANTECEDENT CONSEQUENT SUPPORT CONFIDENCE LIFT 0 item5 item2 0.222222 1.000000 1.285714 1 item1 item5 0.222222 0.333333 1.500000 2 item5 item1 0.222222 1.000000 1.500000 3 item5 item3 0.111111 0.500000 0.750000 4 item1 item2 0.444444 0.666667 0.857143
Attributes
result_
(DataFrame) Mined association rules and related statistics, structured as follows: - 1st column : antecedent(leading) items, - 2nd column : consequent(dependent) items, - 3rd column : support value, - 4th column : confidence value, - 5th column : lift value. Non-empty only when
relational
is False.model_
(DataFrame) Apriori model trained from the input data, structured as follows: - 1st column : model ID. - 2nd column : model content, i.e. liteApriori model in PMML format.
Methods
fit
(data[, transaction, item])Association rule mining based from the input data.
-
fit
(data, transaction=None, item=None)¶ Association rule mining based from the input data.
- Parameters
data : DataFrame
Input data for association rule minining.
transaction : str, optional
Name of the transaction column.
Defaults to the first column if not provided.
item : str, optional
Name of the item column.
Defaults to the last column if not provided.
-
class
hana_ml.algorithms.pal.association.
FPGrowth
(conn_context, min_support=None, min_confidence=None, relational=None, min_lift=None, max_conseq=None, max_len=None, ubiquitous=None, thread_ratio=None, timeout=None)¶ Bases:
hana_ml.algorithms.pal.association._AssociationBase
FP-Growth is an algorithm to find frequent patterns from transactions without generating a candidate itemset.
- Parameters
conn_context : ConnectionContext
The connection to SAP HANA system.
min_support : float, optional
User-specified minimum support, with valid range [0, 1].
Defaults to 0.
min_confidence : float, optional
User-specified minimum confidence, with valid range [0, 1].
Defaults to 0.
relational : bool, optional
Whether or not to apply relational logic in FPGrowth algorithm. If False, a single result table is produced; otherwise, the result table shall be split into three tables – antecedent, consequent and statistics.
Defaults to False.
min_lift : float, optional
User-specified minimum lift.
Defaults to 0.
max_conseq : int, optional
Maximum length of consequent items.
Defaults to 10.
max_len : int, optional
Total length of antecedent items and consequent items in the output.
Defaults to 10.
ubiquitous : float, optional
Item sets whose support values are greater than this number will be ignored during fequent items minining.
Defaults to 1.0.
thread_ratio : float, optional
Specifies the ratio of total number of threads that can be used by this function. The value range is from 0 to 1, where 0 means only using 1 thread, and 1 means using at most all the currently available threads. Values outside the range will be ignored and this function heuristically determines the number of threads to use.
Defaults to 0.
timeout : int, optional
Specifies the maximum run time in seconds. The algorithm will stop running when the specified timeout is reached.
Defaults to 3600.
Examples
Input data for associate rule mining:
>>> df.collect() TRANS ITEM 0 1 1 1 1 2 2 2 2 3 2 3 4 2 4 5 3 1 6 3 3 7 3 4 8 3 5 9 4 1 10 4 4 11 4 5 12 5 1 13 5 2 14 6 1 15 6 2 16 6 3 17 6 4 18 7 1 19 8 1 20 8 2 21 8 3 22 9 1 23 9 2 24 9 3 25 10 2 26 10 3 27 10 5
Set up parameters:
>>> fpg = FPGrowth(conn_context=conn, min_support=0.2, min_confidence=0.5, relational=False, min_lift=1.0, max_conseq=1, max_len=5, ubiquitous=1.0, thread_ratio=0, timeout=3600)
Association rule mininig using FPGrowth algorithm for the input data, and check the results:
>>> fpg.fit(data=df, lhs_restrict=[1,2,3]) >>> fpg.result_.collect() ANTECEDENT CONSEQUENT SUPPORT CONFIDENCE LIFT 0 2 3 0.5 0.714286 1.190476 1 3 2 0.5 0.833333 1.190476 2 3 4 0.3 0.500000 1.250000 3 1&2 3 0.3 0.600000 1.000000 4 1&3 2 0.3 0.750000 1.071429 5 1&3 4 0.2 0.500000 1.250000
Apriori algorithm set up using relational logic:
>>> fpgr = FPGrowth(conn_context=conn, min_support=0.2, min_confidence=0.5, relational=True, min_lift=1.0, max_conseq=1, max_len=5, ubiquitous=1.0, thread_ratio=0, timeout=3600)
Again mining association rules using FPGrowth algorithm for the input data, and check the resulting tables:
>>> fpgr.fit(data=df, rhs_restrict=[1, 2, 3]) >>> fpgr.antec_.collect() RULE_ID ANTECEDENTITEM 0 0 2 1 1 3 2 2 3 3 3 1 4 3 2 5 4 1 6 4 3 7 5 1 8 5 3
>>> fpgr.conseq_.collect() RULE_ID CONSEQUENTITEM 0 0 3 1 1 2 2 2 4 3 3 3 4 4 2 5 5 4
>>> fpgr.stats_.collect() RULE_ID SUPPORT CONFIDENCE LIFT 0 0 0.5 0.714286 1.190476 1 1 0.5 0.833333 1.190476 2 2 0.3 0.500000 1.250000 3 3 0.3 0.600000 1.000000 4 4 0.3 0.750000 1.071429 5 5 0.2 0.500000 1.250000
Attributes
result_
(DataFrame) Mined association rules and related statistics, structured as follows: - 1st column : antecedent(leading) items, - 2nd column : consequent(dependent) items, - 3rd column : support value, - 4th column : confidence value, - 5th column : lift value. Available only when
relational
is False.antec_
(DataFrame) Antecdent items of mined association rules, structured as follows: - lst column : association rule ID, - 2nd column : antecedent items of the corresponding association rule. Available only when
relational
is True.conseq_
(DataFrame) Consequent items of mined association rules, structured as follows: - 1st column : association rule ID, - 2nd column : consequent items of the corresponding association rule. Available only when
relational
is True.stats_
(DataFrame) Statistis of the mined association rules, structured as follows: - 1st column : rule ID, - 2nd column : support value of the rule, - 3rd column : confidence value of the rule, - 4th column : lift value of the rule. Available only when
relational
is True.Methods
fit
(data[, transaction, item, lhs_restrict, …])Association rule mining from the input data.
-
fit
(data, transaction=None, item=None, lhs_restrict=None, rhs_restrict=None, lhs_complement_rhs=None, rhs_complement_lhs=None)¶ Association rule mining from the input data.
- Parameters
data : DataFrame
Input data for association rule minining.
transaction : str, optional
Name of the transaction column.
Defaults to the first column if not provided.
item : str, optional
Name of the item column.
Defaults to the last column if not provided.
lhs_restrict : list of int/str, optional
Specify items that are only allowed on the left-hand-side of association rules. Elements in the list should be the same type as the item column.
rhs_restrict : list of int/str, optional
Specify items that are only allowed on the right-hand-side of association rules. Elements in the list should be the same type as the item column.
lhs_complement_rhs : bool, optional
If you use
rhs_restrict
to restrict some items to the left-hand-side of the association rules, you can set this parameter to True to restrict the complement items to the left-hand-side. For example, if you have 100 items (i1,i2,…,i100), and want to restrict i1 and i2 to the right-hand-side, and i3, i4,…, i100 to the left-hand-side, you can set the parameters similarly as follows:…
rhs_restrict = [i1, i2],
lhs_complement_rhs = True,
…
Defaults to False.
rhs_complement_lhs : bool, optional
If you use
lhs_restrict
to restrict some items to the left-hand-side of association rules, you can set this parameter to True to restrict the complement items to the right-hand side.Defaults to False.
-
class
hana_ml.algorithms.pal.association.
KORD
(conn_context, k=None, measure=None, min_support=None, min_confidence=None, min_coverage=None, min_measure=None, max_antec=None, epsilon=None, use_epsilon=None)¶ Bases:
hana_ml.algorithms.pal.pal_base.PALBase
K-optimal rule discovery (KORD) follows the idea of generating association rules with respect to a well-defined measure, instead of first finding all frequent itemsets and then generating all possible rules.
- Parameters
conn_context : ConnectionContext
The connection to SAP HANA system.
k : int, optional
The number of top rules to discover.
measure : str, optional
Specifies the measure used to define the priority of the association rules.
min_support : float, optional
User-specified minimum support value of association rule, with valid range [0, 1].
Defaults to 0 if not provided.
min_confidence : float, optinal
User-specified minimum confidence value of association rule, with valid range [0, 1].
Defaults to 0 if not provided.
min_converage : float, optional
User-specified minimum coverage value of association rule, with valid range [0, 1].
Defaults to the value of
min_support
if not provided.min_measure : float, optional
User-specified minimum measure value (for leverage or lift, which type depends on the setting of
measure
).Defaults to 0 if not provided.
epsilon : float, optional
User-specified epsilon value for punishing length of rules.
Valid only when
use_epsilon
is True.use_epsilon : bool, optional
Specifies whether or not to use
epsilon
to punish the length of rules.Defaults to False.
Examples
First let us have a look at the training data:
>>> df.head(10).collect() CUSTOMER ITEM 0 2 item2 1 2 item3 2 3 item1 3 3 item2 4 3 item4 5 4 item1 6 4 item3 7 5 item2 8 5 item3 9 6 item1
Set up a KORD instance:
>>> krd = KORD(conn_context=conn, k=5, measure='lift', min_support=0.1, min_confidence=0.2, epsilon=0.1, use_epsilon=False)
Start k-optimal rule discovery process from the input transaction data, and check the results:
>>> krd.fit(data=df, transaction='CUSTOMER', item='ITEM') >>> krd.antec_.collect() RULE_ID ANTECEDENT_RULE 0 0 item2 1 1 item1 2 2 item2 3 2 item1 4 3 item5 5 4 item2 >>> krd.conseq_.collect() RULE_ID CONSEQUENT_RULE 0 0 item5 1 1 item5 2 2 item5 3 3 item1 4 4 item4 >>> krd.stats_.collect() RULE_ID SUPPORT CONFIDENCE LIFT LEVERAGE MEASURE 0 0 0.222222 0.285714 1.285714 0.049383 1.285714 1 1 0.222222 0.333333 1.500000 0.074074 1.500000 2 2 0.222222 0.500000 2.250000 0.123457 2.250000 3 3 0.222222 1.000000 1.500000 0.074074 1.500000 4 4 0.222222 0.285714 1.285714 0.049383 1.285714
Attributes
antec_
(DataFrame) Info of antecedent items for the mined association rules, structured as follows: - 1st column : rule ID, - 2nd column : antecedent items.
conseq_
(DataFrame) Info of consequent items for the mined assocoation rules, structured as follows: - 1st column : rule ID, - 2nd column : consequent items.
stats_
(DataFrame) Some basic statistics for the mined association rules, structured as follows: - 1st column : rule ID, - 2nd column : support value of rules, - 3rd column : confidence value of rules, - 4th column : lift value of rules, - 5th column : leverage value of rules, - 6th column : measure value of rules.
Methods
fit
(data[, transaction, item])K-optimal rule discovery from input data, based on some user-specified measure.
-
fit
(data, transaction=None, item=None)¶ K-optimal rule discovery from input data, based on some user-specified measure.
- Parameters
data : DataFrame
Input data for k-optimal(association) rule discovery.
transction : str, optional
Column name of transaction ID in the input data.
Defaults to name of the 1st column if not provided.
item : str, optional
Column name of item ID (or items) in the input data.
Defaults to the name of the final column if not provided.
-
class
hana_ml.algorithms.pal.association.
SPM
(conn_context, min_support, relational=None, max_len=None, min_len=None, max_len_out=None, min_len_out=None, ubiquitous=None, calc_lift=None, timeout=None)¶ Bases:
hana_ml.algorithms.pal.pal_base.PALBase
The sequential pattern mining algorithm searches for frequent patterns in sequence databases.
- Parameters
conn_context : ConnectionContext
The connection to SAP HANA system.
min_support : float
User-specified minimum support value.
relational : bool, optional
Whether or not to apply relational logic in sequential pattern mining. If False, a single results table for frequent pattern mining is produced, otherwise the results table is splitted into two tables : one for mined patterns, and the other for statistics.
Defaults to False.
ubiquitous : float, optional
Items whose support values are above this specified value will be ignored during the frequent item mining phase.
Defaults to 1.0.
min_len : int, optional
Minimum number of items in a transaction.
Defaults to 1.
max_len : int, optional
Maximum number of items in a transaction.
Defaults to 10.
min_len_out : int, optional
Specifies the minimum number of items of the mined association rules in the result table.
Defaults to 1.
max_len_out : int, optional
Specifies the maximum number of items of the mined association rules in the reulst table.
Defaults to 10.
calc_lift : bool, optional
Whether or not toe calculate lift values for all applicable cases. If False, lift values are only calculated for the cases where the last transaction contains a single item.
Defaults to False.
timeout : int, optional
Specifies the maximum run time in seconds. The algorithm will stop running when the specified timeout is reached.
Defaults to 3600.
Examples
Firstly take a look at the input data df:
>>> df.collect() CUSTID TRANSID ITEMS 0 A 1 Apple 1 A 1 Blueberry 2 A 2 Apple 3 A 2 Cherry 4 A 3 Dessert 5 B 1 Cherry 6 B 1 Blueberry 7 B 1 Apple 8 B 2 Dessert 9 B 3 Blueberry 10 C 1 Apple 11 C 2 Blueberry 12 C 3 Dessert
Set up a SPM instance:
>>> sp = SPM(conn_context=conn, min_support=0.5, relational=False, ubiquitous=1.0, max_len=10, min_len=1, calc_lift=True)
Start sequential pattern mining process from the input data, and check the results:
>>> sp.fit(data=df, customer='CUSTID', transaction='TRANSID', item='ITEMS') >>> sp.result_.collect() PATTERN SUPPORT CONFIDENCE LIFT 0 {Apple} 1.000000 0.000000 0.000000 1 {Apple},{Blueberry} 0.666667 0.666667 0.666667 2 {Apple},{Dessert} 1.000000 1.000000 1.000000 3 {Apple,Blueberry} 0.666667 0.000000 0.000000 4 {Apple,Blueberry},{Dessert} 0.666667 1.000000 1.000000 5 {Apple,Cherry} 0.666667 0.000000 0.000000 6 {Apple,Cherry},{Dessert} 0.666667 1.000000 1.000000 7 {Blueberry} 1.000000 0.000000 0.000000 8 {Blueberry},{Dessert} 1.000000 1.000000 1.000000 9 {Cherry} 0.666667 0.000000 0.000000 10 {Cherry},{Dessert} 0.666667 1.000000 1.000000 11 {Dessert} 1.000000 0.000000 0.000000
Attributes
result_
(DataFrame) The overall fequent pattern mining result, structured as follows: - 1st column : mined fequent patterns, - 2nd column : support values, - 3rd column : confidence values, - 4th column : lift values. Available only when
relational
is False.pattern_
(DataFrame) Result for mined requent patterns, structured as follows: - 1st column : pattern ID, - 2nd column : transaction ID, - 3rd column : items.
stats_
(DataFrame) Statistics for frequent pattern mining, structured as follows: - 1st column : pattern ID, - 2nd column : support values, - 3rd column : confidence values, - 4th column : lift values.
Methods
fit
(data[, customer, transaction, item, …])Sequetial pattern mining from input data.
-
fit
(data, customer=None, transaction=None, item=None, item_restrict=None, min_gap=None)¶ Sequetial pattern mining from input data.
- Parameters
data : DataFrame
Input data for sequential pattern mining.
customer : str, optional
Column name of customer ID in the input data.
Defaults to name of the 1st column if not provided.
transction : str, optional
Column name of transaction ID in the input data. Specially for sequential pattern mining, values of this column must reflect the sequence of occurrence as well.
Defaults to name of the 2nd column if not provided.
item : str, optional
Column name of item ID (or items) in the input data.
Defaults to the name of the final column if not provided.
item_restrict : list of int or str, optional
Specifies the list of items allowed in the mined association rule.
min_gap : int, optional
Specifies the the minimum time difference between consecutive transactions in a sequence.
hana_ml.algorithms.pal.clustering¶
This module contains Python wrappers for PAL clustering algorithms.
The following classes are available:
-
class
hana_ml.algorithms.pal.clustering.
AffinityPropagation
(conn_context, affinity, n_clusters, max_iter=None, convergence_iter=None, damping=None, preference=None, seed_ratio=None, times=None, minkowski_power=None, thread_ratio=None)¶ Bases:
hana_ml.algorithms.pal.pal_base.PALBase
Affinity Propagation is an algorithm that identifies exemplars among data points and forms clusters of data points around these exemplars. It operates by simultaneously considering all data point as potential exemplars and exchanging messages between data points until a good set of exemplars and clusters emerges.
- Parameters
conn_context : ConnectionContext
Connection to the SAP HANA system.
affinity : {‘manhattan’, ‘standardized_euclidean’, ‘euclidean’, ‘minkowski’, ‘chebyshev’, ‘cosine’}
Ways to compute the distance between two points.
No default value as it is mandatory.
n_clusters : int
Number of clusters.
0: does not adjust Affinity Propagation cluster result.
Non-zero int: If Affinity Propagation cluster number is bigger than
n_clusters
, PAL will merge the result to make the cluster number be the value specified forn_clusters
.
No default value as it is mandatory.
max_iter : int, optional
Maximum number of iterations.
Defaults to 500.
convergence_iter : int, optional
When the clusters keep a steady one for the specified times, the algorithm ends.
Defaults to 100.
damping : float
Controls the updating velocity. Value range: (0, 1).
Defaults to 0.9.
preference : float, optional
Determines the preference. Value range: [0,1].
Defaults to 0.5.
seed_ratio : float, optional
Select a portion of (seed_ratio * data_number) the input data as seed, where data_number is the row_size of the input data. Value range: (0,1]. If
seed_ratio
is 1, all the input data will be the seed.Defaults to 1.
times : int, optional
The sampling times. Only valid when seed_ratio is less than 1.
Defaults to 1.
minkowski_power : int, optional
The power of the Minkowski method. Only valid when affinity is 3.
Defaults to 3.
thread_ratio : float, optional
Specifies the ratio of total number of threads that can be used by this function. The value range is from 0 to 1, where 0 means only using 1 thread, and 1 means using at most all the currently available threads. Values outside the range will be ignored and this function heuristically determines the number of threads to use.
Defaults to 0.
Examples
Input dataframe df for clustering:
>>> df.collect() ID ATTRIB1 ATTRIB2 0 1 0.10 0.10 1 2 0.11 0.10 2 3 0.10 0.11 3 4 0.11 0.11 4 5 0.12 0.11 5 6 0.11 0.12 6 7 0.12 0.12 7 8 0.12 0.13 8 9 0.13 0.12 9 10 0.13 0.13 10 11 0.13 0.14 11 12 0.14 0.13 12 13 10.10 10.10 13 14 10.11 10.10 14 15 10.10 10.11 15 16 10.11 10.11 16 17 10.11 10.12 17 18 10.12 10.11 18 19 10.12 10.12 19 20 10.12 10.13 20 21 10.13 10.12 21 22 10.13 10.13 22 23 10.13 10.14 23 24 10.14 10.13
Create AffinityPropagation instance:
>>> ap = AffinityPropagation( conn_context=conn, affinity='euclidean', n_clusters=0, max_iter=500, convergence_iter=100, damping=0.9, preference=0.5, seed_ratio=None, times=None, minkowski_power=None, thread_ratio=1)
Perform fit on the given data:
>>> ap.fit(data = df, key='ID')
Expected output:
>>> ap.labels_.collect() ID CLUSTER_ID 0 1 0 1 2 0 2 3 0 3 4 0 4 5 0 5 6 0 6 7 0 7 8 0 8 9 0 9 10 0 10 11 0 11 12 0 12 13 1 13 14 1 14 15 1 15 16 1 16 17 1 17 18 1 18 19 1 19 20 1 20 21 1 21 22 1 22 23 1 23 24 1
Attributes
labels_
(DataFrame) Label assigned to each sample. structured as follows: - ID, record ID. - CLUSTER_ID, the range is from 0 to
n_clusters
- 1.Methods
fit
(data, key[, features])Fit the model when given the training dataset.
fit_predict
(data, key[, features])Fit with the dataset and return the labels.
-
fit
(data, key, features=None)¶ Fit the model when given the training dataset.
- Parameters
data : DataFrame
DataFrame containing the data.
key : str
Name of the ID column.
features : list of str, optional
Names of the features columns. If
features
is not provided, it defaults to all the non-ID columns.
-
fit_predict
(data, key, features=None)¶ Fit with the dataset and return the labels.
- Parameters
data : DataFrame
DataFrame containing the data.
key : str
Name of the ID column.
features : list of str, optional
Names of the features columns. If
features
is not provided, it defaults to all the non-ID columns.- Returns
DataFrame
Fit result, label of each points, structured as follows:
ID, record ID.
CLUSTER_ID, the range is from 0 to
n_clusters
- 1.
-
class
hana_ml.algorithms.pal.clustering.
AgglomerateHierarchicalClustering
(conn_context, n_clusters=None, affinity=None, linkage=None, thread_ratio=None, distance_dimension=None, normalization=None, category_weights=None)¶ Bases:
hana_ml.algorithms.pal.pal_base.PALBase
This algorithm is a widely used clustering method which can find natural groups within a set of data. The idea is to group the data into a hierarchy or a binary tree of the subgroups. A hierarchical clustering can be either agglomerate or divisive, depending on the method of hierarchical decomposition. The implementation in PAL follows the agglomerate approach, which merges the clusters with a bottom-up strategy. Initially, each data point is considered as an own cluster. The algorithm iteratively merges two clusters based on the dissimilarity measure in a greedy manner and forms a larger cluster.
- Parameters
conn_context : ConnectionContext
Connection to the SAP HANA system.
n_clusters : int, optional
Number of clusters after agglomerate hierarchical clustering algorithm. Value range: between 1 and the initial number of input data.
Defaults to 1.
affinity : {‘manhattan’,’euclidean’, ‘minkowski’, ‘chebyshev’, ‘cosine’, ‘pearson correlation’, ‘squared euclidean’, ‘jaccard’, ‘gower’}, optional
Ways to compute the distance between two points.
Note
(1) For jaccard distance, non-zero input data will be treated as 1, and zero input data will be treated as 0. jaccard distance = (M01 + M10) / (M11 + M01 + M10)
(2) Only gower distance supports category attributes. When linkage is ‘centroid clustering’, ‘median clustering’, or ‘ward’, this parameter must be set to ‘squared euclidean’.
Defaults to squared euclidean.
linkage : { ‘nearest neighbor’, ‘furthest neighbor’, ‘group average’, ‘weighted average’, ‘centroid clustering’, ‘median clustering’, ‘ward’}, optional
Linkage type between two clusters.
‘nearest neighbor’ : single linkage.
‘furthest neighbor’ : complete linkage.
‘group average’ : UPGMA.
‘weighted average’ : WPGMA.
‘centroid clustering’.
‘median clustering’.
‘ward’.
Defaults to centroid clustering.
Note
For linkage ‘centroid clustering’, ‘median clustering’, or ‘ward’, the corresponding affinity must be set to ‘squared euclidean’.
thread_ratio : float, optional
Specifies the ratio of total number of threads that can be used by this function. The value range is from 0 to 1, where 0 means only using 1 thread, and 1 means using at most all the currently available threads. Values outside the range will be ignored and this function heuristically determines the number of threads to use.
Defaults to 0.
distance_dimension : float, optional
Distance dimension can be set if affinity is set to ‘minkowski’. The value should be no less than 1. Only valid when affinity is ‘minkowski’.
Defaults to 3.
normalize_type : {0, 1, 2, 3}, int, optional
Normalization type
0: does nothing
1: Z score standardize
2: transforms to new range: -1 to 1
3: transforms to new range: 0 to 1
Defaults to 0.
category_weights : float, optional
Represents the weight of category columns.
Defaults to 1.
Examples
Input dataframe df for clustering:
>>> df.collect() POINT X1 X2 X3 0 0 0.5 0.5 1 1 1 1.5 0.5 2 2 2 1.5 1.5 2 3 3 0.5 1.5 2 4 4 1.1 1.2 2 5 5 0.5 15.5 2 6 6 1.5 15.5 3 7 7 1.5 16.5 3 8 8 0.5 16.5 3 9 9 1.2 16.1 3 10 10 15.5 15.5 3 11 11 16.5 15.5 4 12 12 16.5 16.5 4 13 13 15.5 16.5 4 14 14 15.6 16.2 4 15 15 15.5 0.5 4 16 16 16.5 0.5 1 17 17 16.5 1.5 1 18 18 15.5 1.5 1 19 19 15.7 1.6 1
Create AgglomerateHierarchicalClustering instance:
>>> hc = AgglomerateHierarchicalClustering( conn_context=conn, n_clusters=4, affinity='Gower', linkage='weighted average', thread_ratio=None, distance_dimension=3, normalize_type= 0, category_weights= 0.1)
Perform fit on the given data:
>>> hc.fit(data=df, key='POINT', categorical_variable=['X3'])
Expected output:
>>> hc.combine_process_.collect().head(3) STAGE LEFT_POINT RIGHT_POINT DISTANCE 0 1 18 19 0.0187 1 2 13 14 0.0250 2 3 7 9 0.0437
>>> hc.labels_.collect().head(3) POINT CLUSTER_ID 0 0 1 1 1 1 2 2 1
Attributes
combine_process_
(DataFrame) Structured as follows: - 1st column: int, STAGE, cluster stage. - 2nd column: ID (in input table) data type, LEFT_ + ID (in input table) column + name, One of the clusters that is to be combined in one combine stage, name as its row number in the input data table. After the combining, the new cluster is named after the left one. - 3rd column: ID (in input table) data type, RIGHT_ + ID (in input table) column name, The other cluster to be combined in the same combine stage, named as its row number in the input data table. - 4th column: float, DISTANCE. Distance between the two combined clusters.
labels_
(DataFrame) Label assigned to each sample. structured as follows: - 1st column: ID, record ID. - 2nd column: CLUSTER_ID, cluster number after applying the hierarchical agglomerate algorithm.
Methods
fit
(data, key[, features, categorical_variable])Fit the model when given the training dataset.
fit_predict
(data, key[, features, …])Fit with the dataset and return the labels.
-
fit
(data, key, features=None, categorical_variable=None)¶ Fit the model when given the training dataset.
- Parameters
data : DataFrame
DataFrame containing the data.
key : str
Name of the ID column.
features : list of str, optional
Names of the features columns. If
features
is not provided, it defaults to all the non-ID columns.categorical_variable : str or list of str, optional
Specifies INTEGER column(s) that should be treated as categorical. By default, ‘VARCHAR’ or ‘NVARCHAR’ is category variable, and ‘INTEGER’ or ‘DOUBLE’ is continuous variable.
No default value.
-
fit_predict
(data, key, features=None, categorical_variable=None)¶ Fit with the dataset and return the labels.
- Parameters
data : DataFrame
DataFrame containing the data.
key : str
Name of the ID column.
features : list of str, optional
Names of the features columns. If
features
is not provided, it defaults to all the non-ID columns.categorical_variable : str or list of str, optional
Specifies INTEGER column(s) that should be treated as categorical. By default, ‘VARCHAR’ or ‘NVARCHAR’ is category variable, and ‘INTEGER’ or ‘DOUBLE’ is continuous variable.
No default value.
- Returns
DataFrame
Combine process, structured as follows:
1st column: int, STAGE, cluster stage.
2nd column: ID (in input table) data type, LEFT_ + ID (in input table) column name, One of the clusters that is to be combined in one combine stage, name as its row number in the input data table. After the combining, the new cluster is named after the left one.
3rd column: ID (in input table) data type, RIGHT_ + ID (in input table) column name, The other cluster to be combined in the same combine stage, named as its row number in the input data table.
4th column: float, DISTANCE. Distance between the two combined clusters.
Label of each points, structured as follows:
1st column: ID (in input table) data type, ID, record ID.
2nd column: int, CLUSTER_ID, the range is from 0 to
n_clusters
- 1.
-
class
hana_ml.algorithms.pal.clustering.
DBSCAN
(conn_context, minpts=None, eps=None, thread_ratio=None, metric=None, minkowski_power=None, categorical_variable=None, category_weights=None, algorithm=None, save_model=True)¶ Bases:
hana_ml.algorithms.pal.pal_base.PALBase
,hana_ml.algorithms.pal.clustering._ClusterAssignmentMixin
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a density-based data clustering algorithm that finds a number of clusters starting from the estimated density distribution of corresponding nodes.
- Parameters
conn_context : ConnectionContext
Connection to the SAP HANA system.
minpts : int, optional
The minimum number of points required to form a cluster.
Note
minpts
andeps
need to be provided together by user or these two parameters are automatically determined.eps : float, optional
The scan radius.
Note
minpts
andeps
need to be provided together by user or these two parameters are automatically determined.thread_ratio : float, optional
Controls the proportion of available threads to use. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use.
Defaults to heuristically determined.
metric : {‘manhattan’, ‘euclidean’, ‘minkowski’, ‘chebyshev’, ‘standardized_euclidean’, ‘cosine’}, optional
Ways to compute the distance between two points.
Defaults to ‘euclidean’.
minkowski_power : int, optional
When minkowski is choosed for
metric
, this parameter controls the value of power. Only applicable whenmetric
is minkowski.Defaults to 3.
categorical_variable : str or list of str, optional
Specifies column(s) in the data that should be treated as categorical.
category_weights : float, optional
Represents the weight of category attributes.
Defaults to 0.707.
algorithm : {‘brute-force’, ‘kd-tree’}, optional
Ways to search for neighbours.
Defaults to ‘kd-tree’.
save_model : bool, optional
If true, the generated model will be saved.
save_model
must be True to call predict().Defaults to True.
Examples
Input dataframe df for clustering:
>>> df.collect() ID V1 V2 V3 0 1 0.10 0.10 B 1 2 0.11 0.10 A 2 3 0.10 0.11 C 3 4 0.11 0.11 B 4 5 0.12 0.11 A 5 6 0.11 0.12 E 6 7 0.12 0.12 A 7 8 0.12 0.13 C 8 9 0.13 0.12 D 9 10 0.13 0.13 D 10 11 0.13 0.14 A 11 12 0.14 0.13 C 12 13 10.10 10.10 A 13 14 10.11 10.10 F 14 15 10.10 10.11 E 15 16 10.11 10.11 E 16 17 10.11 10.12 A 17 18 10.12 10.11 B 18 19 10.12 10.12 B 19 20 10.12 10.13 D 20 21 10.13 10.12 F 21 22 10.13 10.13 A 22 23 10.13 10.14 A 23 24 10.14 10.13 D 24 25 4.10 4.10 A 25 26 7.11 7.10 C 26 27 -3.10 -3.11 C 27 28 16.11 16.11 A 28 29 20.11 20.12 C 29 30 15.12 15.11 A
Create DSBCAN instance:
>>> dbscan = DBSCAN(conn_context=conn, thread_ratio=0.2, metric='manhattan')
Perform fit on the given data:
>>> dbscan.fit(data=df, key='ID')
Expected output:
>>> dbscan.labels_.collect() ID CLUSTER_ID 0 1 0 1 2 0 2 3 0 3 4 0 4 5 0 5 6 0 6 7 0 7 8 0 8 9 0 9 10 0 10 11 0 11 12 0 12 13 1 13 14 1 14 15 1 15 16 1 16 17 1 17 18 1 18 19 1 19 20 1 20 21 1 21 22 1 22 23 1 23 24 1 24 25 -1 25 26 -1 26 27 -1 27 28 -1 28 29 -1 29 30 -1
Attributes
labels_
(DataFrame) Label assigned to each sample.
model_
(DataFrame) Model content. Set to None if
save_model
is False.Methods
fit
(data, key[, features, categorical_variable])Fit the DBSCAN model when given the training dataset.
fit_predict
(data, key[, features, …])Fit with the dataset and return the labels.
predict
(data, key[, features])Assign clusters to data based on a fitted model.
-
fit
(data, key, features=None, categorical_variable=None)¶ Fit the DBSCAN model when given the training dataset.
- Parameters
data : DataFrame
DataFrame containing the data.
key : str
Name of the ID column.
features : list of str, optional
Names of the features columns. If
features
is not provided, it defaults to all the non-ID columns.categorical_variable : str or list of str, optional
Specifies INTEGER column(s) that should be treated as categorical. Other INTEGER columns will be treated as continuous.
-
fit_predict
(data, key, features=None, categorical_variable=None)¶ Fit with the dataset and return the labels.
- Parameters
data : DataFrame
DataFrame containing the data.
key : str
Name of the ID column.
features : list of str, optional
Names of the features columns. If
features
is not provided, it defaults to all the non-ID columns.categorical_variable : str or list of str, optional
Specifies INTEGER column(s) that should be treated as categorical. Other INTEGER columns will be treated as continuous.
- Returns
DataFrame
Fit result, structured as follows:
ID column, with the same name and type as
data
‘s ID column.CLUSTER_ID, type INTEGER, cluster ID assigned to the data point. (Cluster IDs range from 0 to 1 less than the number of clusters. A cluster ID of -1 means the point is labeled as noise.)
-
predict
(data, key, features=None)¶ Assign clusters to data based on a fitted model.
The output structure of this method does not match that of fit_predict().
- Parameters
data : DataFrame
Data points to match against computed clusters. This dataframe’s column structure should match that of the data used for fit().
key : str
Name of the ID column.
features : list of str, optional.
Names of the feature columns. If
features
is not provided, it defaults to all the non-ID columns.- Returns
DataFrame
Cluster assignment results, with 3 columns:
Data point ID, with name and type taken from the input ID column.
CLUSTER_ID, type INTEGER, representing the cluster the data point is assigned to.
DISTANCE, type DOUBLE, representing the distance between the data point and the nearest core point.
-
class
hana_ml.algorithms.pal.clustering.
GeometryDBSCAN
(conn_context, minpts=None, eps=None, thread_ratio=None, metric=None, minkowski_power=None, algorithm=None, save_model=True)¶ Bases:
hana_ml.algorithms.pal.pal_base.PALBase
This function is a geometry version of DBSCAN, which only accepts geometry points as input data. Currently it only accepts 2-D points.
- Parameters
conn_context : ConnectionContext
Connection to the SAP HANA system.
minpts : int, optional
The minimum number of points required to form a cluster.
Note
minpts
andeps
need to be provided together by user or these two parameters are automatically determined.eps : float, optional
The scan radius.
Note
minpts
andeps
need to be provided together by user or these two parameters are automatically determined.thread_ratio : float, optional
Controls the proportion of available threads to use. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use.
Defaults to heuristically determined.
metric : {‘manhattan’, ‘euclidean’,’minkowski’,
‘chebyshev’, ‘standardized_euclidean’, ‘cosine’}, optional
Ways to compute the distance between two points.
Defaults to euclidean.
minkowski_power : int, optional
When minkowski is choosed for
metric
, this parameter controls the value of power. Only applicable whenmetric
is ‘minkowski’.Defaults to 3.
algorithm : {‘brute-force’, ‘kd-tree’}, optional
Ways to search for neighbours.
Defaults to ‘kd-tree’.
save_model : bool, optional
If true, the generated model will be saved.
save_model
must be True to call predict().Defaults to True.
Examples
In SAP HANA, the test table PAL_GEO_DBSCAN_DATA_TBL can be created by the following SQL:
>>> CREATE COLUMN TABLE PAL_GEO_DBSCAN_DATA_TBL ( "ID" INTEGER, "POINT" ST_GEOMETRY );
Then, input dataframe df for clustering:
>>> df = conn.table("PAL_GEO_DBSCAN_DATA_TBL")
Create DSBCAN instance:
>>> geo_dbscan = GeometryDBSCAN(conn_context = conn, thread_ratio=0.2, metric='manhattan')
Perform fit on the given data:
>>> geo_dbscan.fit(data = df, key='ID')
Expected output:
>>> geo_dbscan.labels_.collect() ID CLUSTER_ID 0 1 0 1 2 0 2 3 0 ...... 28 29 -1 29 30 -1
>>> geo_dbsan.model_.collect() ROW_INDEX MODEL_CONTENT 0 0 {"Algorithm":"DBSCAN","Cluster":[{"ClusterID":...
Perform fit_predict on the given data:
>>> result = geo_dbscan.fit_predict(df, key='ID')
Expected output:
>>> result.collect() ID CLUSTER_ID 0 1 0 1 2 0 2 3 0 ...... 28 29 -1 29 30 -1
Attributes
labels_
(DataFrame) Label assigned to each sample.
model_
(DataFrame) Model content. Set to None if
save_model
is False.Methods
fit
(data, key[, features])Fit the Geometry DBSCAN model when given the training dataset.
fit_predict
(data, key[, features])Fit with the dataset and return the labels.
-
fit
(data, key, features=None)¶ Fit the Geometry DBSCAN model when given the training dataset.
- Parameters
data : DataFrame
DataFrame containing the data. The structure is as follows.
1st column: ID, INTEGER, BIGINT, VARCHAR, or NVARCHAR. Data ID.
2nd column: ST_GEOMETRY, 2-D geometry point.
key : str
Name of the ID column.
features : list of str, optional
Names of the features columns. If
features
is not provided, it defaults to all the non-ID columns.
-
fit_predict
(data, key, features=None)¶ Fit with the dataset and return the labels.
- Parameters
data : DataFrame
DataFrame containing the data. The structure is as follows.
1st column: ID, INTEGER, BIGINT, VARCHAR, or NVARCHAR. Data ID
2nd column: ST_GEOMETRY, 2-D geometry point.
key : str
Name of the ID column.
features : list of str, optional
Names of the features columns. If
features
is not provided, it defaults to all the non-ID columns.- Returns
DataFrame
Fit result, structured as follows:
ID column, with the same name and type as
data
‘s ID column.CLUSTER_ID, type INTEGER, cluster ID assigned to the data point. (Cluster IDs range from 0 to 1 less than the number of clusters. A cluster ID of -1 means the point is labeled as noise.)
-
class
hana_ml.algorithms.pal.clustering.
KMeans
(conn_context, n_clusters=None, n_clusters_min=None, n_clusters_max=None, init=None, max_iter=None, thread_ratio=None, distance_level=None, minkowski_power=None, category_weights=None, normalization=None, categorical_variable=None, tol=None, memory_mode=None, accelerated=False)¶ Bases:
hana_ml.algorithms.pal.pal_base.PALBase
,hana_ml.algorithms.pal.clustering._ClusterAssignmentMixin
K-Means model that handles clustering problems.
- Parameters
conn_context : ConnectionContext
Connection to the SAP HANA system.
n_clusters : int, optional
Number of clusters. If this parameter is not specified, you must specify the minimum and maximum range parameters instead.
n_clusters_min : int, optional
Cluster range minimum.
n_clusters_max : int, optional
Cluster range maximum.
init : {‘first_k’, ‘replace’, ‘no_replace’, ‘patent’}, optional
Controls how the initial centers are selected:
‘first_k’: First k observations.
‘replace’: Random with replacement.
‘no_replace’: Random without replacement.
‘patent’: Patent of selecting the init center (US 6,882,998 B1).
Defaults to ‘patent’.
max_iter : int, optional
Max iterations.
Defaults to 100.
thread_ratio : float, optional
Specifies the ratio of total number of threads that can be used by this function. The value range is from 0 to 1, where 0 means only using 1 thread, and 1 means using at most all the currently available threads. Values outside the range will be ignored and this function heuristically determines the number of threads to use.
Defaults to 0.
distance_level : {‘manhattan’, ‘euclidean’, ‘minkowski’, ‘chebyshev’, ‘cosine’} str, optional
Ways to compute the distance between the item and the cluster center. ‘cosine’ is only valid when
accelerated
is False.Defaults to ‘euclidean’.
minkowski_power : float, optional
When Minkowski distance is used, this parameter controls the value of power. Only valid when
distance_level
is minkowski.Defaults to 3.0.
category_weights : float, optional
Represents the weight of category attributes.
Defaults to 0.707.
normalization : {‘no’, ‘l1_norm’, ‘min_max’}, optional
Normalization type.
‘no’: No normalization will be applied.
‘l1_norm’: Yes, for each point X (x1, x2, …, xn), the normalized value will be X’(x1 /S,x2 /S,…,xn /S), where S = |x1|+|x2|+…|xn|.
‘min_max’: Yes, for each column C, get the min and max value of C, and then C[i] = (C[i]-min)/(max-min).
Defaults to ‘no’.
categorical_variable : str or list of str, optional
Specifies INTEGER column(s) in the data that should be treated as categorical.
tol : float, optional
Convergence threshold for exiting iterations. Only valid when
accelerated
is False.Defaults to 1.0e-6.
memory_mode : {‘auto’, ‘optimize-speed’, ‘optimize-space’}, optional
Indicates the memory mode that the algorithm uses.
‘auto’: Chosen by algorithm.
‘optimize-speed’: Prioritizes speed.
‘optimize-space’: Prioritizes memory.
Only valid when
accelerated
is True.Defaults to ‘auto’.
accelerated : bool, optional
Indicates whether to use technology like cache to accelerate the calculation process. If True, the calculation process will be accelerated. If False, the calculation process will not be accelerated.
Defaults to False.
Examples
Input dataframe df for K Means:
>>> df.collect() ID V000 V001 V002 0 0 0.5 A 0.5 1 1 1.5 A 0.5 2 2 1.5 A 1.5 3 3 0.5 A 1.5 4 4 1.1 B 1.2 5 5 0.5 B 15.5 6 6 1.5 B 15.5 7 7 1.5 B 16.5 8 8 0.5 B 16.5 9 9 1.2 C 16.1 10 10 15.5 C 15.5 11 11 16.5 C 15.5 12 12 16.5 C 16.5 13 13 15.5 C 16.5 14 14 15.6 D 16.2 15 15 15.5 D 0.5 16 16 16.5 D 0.5 17 17 16.5 D 1.5 18 18 15.5 D 1.5 19 19 15.7 A 1.6
Create KMeans instance:
>>> km = clustering.KMeans(conn_context=conn, n_clusters=4, init='first_k', ... max_iter=100, tol=1.0E-6, thread_ratio=0.2, ... distance_level='Euclidean', ... category_weights=0.5)
Perform fit_predict:
>>> labels = km.fit_predict(df=data, 'ID') >>> labels.collect() ID CLUSTER_ID DISTANCE SLIGHT_SILHOUETTE 0 0 0 0.891088 0.944370 1 1 0 0.863917 0.942478 2 2 0 0.806252 0.946288 3 3 0 0.835684 0.944942 4 4 0 0.744571 0.950234 5 5 3 0.891088 0.940733 6 6 3 0.835684 0.944412 7 7 3 0.806252 0.946519 8 8 3 0.863917 0.946121 9 9 3 0.744571 0.949899 10 10 2 0.825527 0.945092 11 11 2 0.933886 0.937902 12 12 2 0.881692 0.945008 13 13 2 0.764318 0.949160 14 14 2 0.923456 0.939283 15 15 1 0.901684 0.940436 16 16 1 0.976885 0.939386 17 17 1 0.818178 0.945878 18 18 1 0.722799 0.952170 19 19 1 1.102342 0.925679
Input dataframe df for Accelerated K-Means :
>>> df = conn.table("PAL_ACCKMEANS_DATA_TBL") >>> df.collect() ID V000 V001 V002 0 0 0.5 A 0 1 1 1.5 A 0 2 2 1.5 A 1 3 3 0.5 A 1 4 4 1.1 B 1 5 5 0.5 B 15 6 6 1.5 B 15 7 7 1.5 B 16 8 8 0.5 B 16 9 9 1.2 C 16 10 10 15.5 C 15 11 11 16.5 C 15 12 12 16.5 C 16 13 13 15.5 C 16 14 14 15.6 D 16 15 15 15.5 D 0 16 16 16.5 D 0 17 17 16.5 D 1 18 18 15.5 D 1 19 19 15.7 A 1
Create Accelerated Kmeans instance:
>>> akm = clustering.KMeans(conn_context=conn, init='first_k', ... thread_ratio=0.5, n_clusters=4, ... distance_level='euclidean', ... max_iter=100, category_weights=0.5, ... categorical_variable=['V002'], ... accelerated=True)
Perform fit_predict:
>>> labels = akm.fit_predict(df=data, key='ID') >>> labels.collect() ID CLUSTER_ID DISTANCE SLIGHT_SILHOUETTE 0 0 0 1.198938 0.006767 1 1 0 1.123938 0.068899 2 2 3 0.500000 0.572506 3 3 3 0.500000 0.598267 4 4 0 0.621517 0.229945 5 5 0 1.037500 0.308333 6 6 0 0.962500 0.358333 7 7 0 0.895513 0.402992 8 8 0 0.970513 0.352992 9 9 0 0.823938 0.313385 10 10 1 1.038276 0.931555 11 11 1 1.178276 0.927130 12 12 1 1.135685 0.929565 13 13 1 0.995685 0.934165 14 14 1 0.849615 0.944359 15 15 1 0.995685 0.934548 16 16 1 1.135685 0.929950 17 17 1 1.089615 0.932769 18 18 1 0.949615 0.937555 19 19 1 0.915565 0.937717
Attributes
labels_
(DataFrame) Label assigned to each sample.
cluster_centers_
(DataFrame) Coordinates of cluster centers.
model_
(DataFrame) Model content.
statistics_
(DataFrame) Statistic value.
Methods
fit
(data, key[, features, categorical_variable])Fit the model when given training dataset.
fit_predict
(data, key[, features, …])Fit with the dataset and return the labels.
predict
(data, key[, features])Assign clusters to data based on a fitted model.
-
fit
(data, key, features=None, categorical_variable=None)¶ Fit the model when given training dataset.
- Parameters
data : DataFrame
DataFrame containing the data.
key : str
Name of the ID column.
features : list of str, optional
Names of the feature columns. If
features
is not provided, it defaults to all the non-ID columns.categorical_variable : str or list of str, optional
Specifies INTEGER column(s) that should be treated as categorical. Other INTEGER columns will be treated as continuous.
-
fit_predict
(data, key, features=None, categorical_variable=None)¶ Fit with the dataset and return the labels.
- Parameters
data : DataFrame
DataFrame containing the data.
key : str
Name of the ID column.
features : list of str, optional
Names of the feature columns. If
features
is not provided, it defaults to all the non-ID columns.categorical_variable : str or list of str, optional
Specifies INTEGER column(s) that should be treated as categorical. Other INTEGER columns will be treated as continuous.
- Returns
DataFrame
Fit result, structured as follows:
ID column, with the same name and type as
data
‘s ID column.CLUSTER_ID, type INTEGER, cluster ID assigned to the data point.
DISTANCE, type DOUBLE, the distance between the given point and the cluster center.
SLIGHT_SILHOUETTE, type DOUBLE, estimated value (slight silhouette).
-
predict
(data, key, features=None)¶ Assign clusters to data based on a fitted model.
The output structure of this method does not match that of fit_predict().
- Parameters
data : DataFrame
Data points to match against computed clusters. This dataframe’s column structure should match that of the data used for fit().
key : str
Name of the ID column.
features : list of str, optional.
Names of the feature columns. If
features
is not provided, it defaults to all the non-ID columns.- Returns
DataFrame
Cluster assignment results, with 3 columns:
Data point ID, with name and type taken from the input ID column.
CLUSTER_ID, INTEGER type, representing the cluster the data point is assigned to.
DISTANCE, DOUBLE type, representing the distance between the data point and the cluster center.
-
class
hana_ml.algorithms.pal.clustering.
KMedians
(conn_context, n_clusters, init=None, max_iter=None, tol=None, thread_ratio=None, distance_level=None, minkowski_power=None, category_weights=None, normalization=None, categorical_variable=None)¶ Bases:
hana_ml.algorithms.pal.clustering._KClusteringBase
K-Medians clustering algorithm that partitions n observations into K clusters according to their nearest cluster center. It uses medians of each feature to calculate cluster centers.
- Parameters
conn_context : ConnectionContext
Connection to the SAP HANA system.
n_clusters : int
Number of groups.
init : {‘first_k’, ‘replace’, ‘no_replace’, ‘patent’}, optional
Controls how the initial centers are selected:
‘first_k’: First k observations.
‘replace’: Random with replacement.
‘no_replace’: Random without replacement.
‘patent’: Patent of selecting the init center (US 6,882,998 B1).
Defaults to ‘patent’.
max_iter : int, optional
Max iterations.
Defaults to 100.
tol : float, optional
Convergence threshold for exiting iterations.
Defaults to 1.0e-6.
thread_ratio : float, optional
Specifies the ratio of total number of threads that can be used by this function. The value range is from 0 to 1, where 0 means only using 1 thread, and 1 means using at most all the currently available threads. Values outside the range will be ignored and this function heuristically determines the number of threads to use.
Defaults to 0.
distance_level : {‘manhattan’, ‘euclidean’, ‘minkowski’, ‘chebyshev’, ‘cosine’} str, optional
Ways to compute the distance between the item and the cluster center.
Defaults to ‘euclidean’.
minkowski_power : float, optional
When Minkowski distance is used, this parameter controls the value of power. Only valid when
distance_level
is minkowski.Defaults to 3.0.
category_weights : float, optional
Represents the weight of category attributes.
Defaults to 0.707.
normalization : {‘no’, ‘l1_norm’, ‘min_max’}, optional
Normalization type.
‘no’: No, normalization will not be applied.
‘l1_norm’: Yes, for each point X (x1, x2, …, xn), the normalized value will be X’(x1/S,x2/S,…,xn/S), where S = |x1|+|x2|+…|xn|.
‘min_max’: Yes, for each column C, get the min and max value of C, and then C[i] = (C[i]-min)/(max-min).
Defaults to ‘no’.
categorical_variable : str or list of str, optional
Specifies INTEGER column(s) in the data that should be treated as categorical.
Examples
Input dataframe df1 for clustering:
>>> df1.collect() ID V000 V001 V002 0 0 0.5 A 0.5 1 1 1.5 A 0.5 2 2 1.5 A 1.5 3 3 0.5 A 1.5 4 4 1.1 B 1.2 5 5 0.5 B 15.5 6 6 1.5 B 15.5 7 7 1.5 B 16.5 8 8 0.5 B 16.5 9 9 1.2 C 16.1 10 10 15.5 C 15.5 11 11 16.5 C 15.5 12 12 16.5 C 16.5 13 13 15.5 C 16.5 14 14 15.6 D 16.2 15 15 15.5 D 0.5 16 16 16.5 D 0.5 17 17 16.5 D 1.5 18 18 15.5 D 1.5 19 19 15.7 A 1.6
Creating KMedians instance:
>>> kmedians = KMedians(conn_context = conn, n_clusters=4, init='first_k', ... max_iter=100, tol=1.0E-6, ... distance_level='Euclidean', ... thread_ratio=0.3, category_weights=0.5)
Performing fit() on given dataframe:
>>> kmedians.fit(data=df1, key='ID') >>> kmedians.cluster_centers_.collect() CLUSTER_ID V000 V001 V002 0 0 1.1 A 1.2 1 1 15.7 D 1.5 2 2 15.6 C 16.2 3 3 1.2 B 16.1
Performing fit_predict() on given dataframe:
>>> kmedians.fit_predict(data=df1, key='ID').collect() ID CLUSTER_ID DISTANCE 0 0 0 0.921954 1 1 0 0.806226 2 2 0 0.500000 3 3 0 0.670820 4 4 0 0.707107 5 5 3 0.921954 6 6 3 0.670820 7 7 3 0.500000 8 8 3 0.806226 9 9 3 0.707107 10 10 2 0.707107 11 11 2 1.140175 12 12 2 0.948683 13 13 2 0.316228 14 14 2 0.707107 15 15 1 1.019804 16 16 1 1.280625 17 17 1 0.800000 18 18 1 0.200000 19 19 1 0.807107
Attributes
cluster_centers_
(DataFrame) Coordinates of cluster centers.
labels_
(DataFrame) Cluster assignment and distance to cluster center for each point.
Methods
fit
(data, key[, features, categorical_variable])Perform clustering on input dataset.
fit_predict
(data, key[, features, …])Perform clustering algorithm and return labels.
-
fit
(data, key, features=None, categorical_variable=None)¶ Perform clustering on input dataset.
- Parameters
data : DataFrame
DataFrame contains input data.
key : str
Name of the ID column.
features : list of str, optional
Names of the feature columns. If
features
is not provided, it defaults to all the non-ID columns.categorical_variable : str or list of str, optional
Specifies INTEGER columns that should be treated as categorical. Other INTEGER columns will be treated as continuous.
-
fit_predict
(data, key, features=None, categorical_variable=None)¶ Perform clustering algorithm and return labels.
- Parameters
data : DataFrame
DataFrame containing input data.
key : str
Name of the ID column.
features : list of str, optional
Names of the feature columns. If
features
is not provided, it defaults to all the non-ID columns.categorical_variable : str or list of str, optional
Specifies INTEGER column(s) that should be treated as categorical. Other INTEGER columns will be treated as continuous.
- Returns
DataFrame
Fit result, structured as follows:
ID column, with the same name and type as
data
‘s ID column.CLUSTER_ID, type INTEGER, cluster ID assigned to the data point.
DISTANCE, type DOUBLE, the distance between the given point and the cluster center.
-
class
hana_ml.algorithms.pal.clustering.
KMedoids
(conn_context, n_clusters, init=None, max_iter=None, tol=None, thread_ratio=None, distance_level=None, minkowski_power=None, category_weights=None, normalization=None, categorical_variable=None)¶ Bases:
hana_ml.algorithms.pal.clustering._KClusteringBase
K-Medoids clustering algorithm that partitions n observations into K clusters according to their nearest cluster center. It uses medoids to calculate cluster centers. K-Medoids is more robust to noise and outliers.
- Parameters
conn_context : ConnectionContext
Connection to the SAP HANA system.
n_clusters : int
Number of groups.
init : {‘first_k’, ‘replace’, ‘no_replace’, ‘patent’}, optional
Controls how the initial centers are selected:
‘first_k’: First k observations.
‘replace’: Random with replacement.
‘no_replace’: Random without replacement.
‘patent’: Patent of selecting the init center (US 6,882,998 B1).
Defaults to ‘patent’.
max_iter : int, optional
Max iterations.
Defaults to 100.
tol : float, optional
Convergence threshold for exiting iterations.
Defaults to 1.0e-6.
thread_ratio : float, optional
Controls the proportion of available threads to use. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates all available threads. Values between 0 and 1 will use up to that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use.
Defaults to 0.
distance_level : {‘manhattan’, ‘euclidean’, ‘minkowski’, ‘chebyshev’, ‘cosine’} str, optional
Ways to compute the distance between the item and the cluster center.
Defaults to ‘euclidean’.
minkowski_power : float, optional
When Minkowski distance is used, this parameter controls the value of power. Only valid when
distance_level
is minkowski.Defaults to 3.0.
category_weights : float, optional
Represents the weight of category attributes.
Defaults to 0.707.
normalization : {‘no’, ‘l1_norm’, ‘min_max’}, optional
Normalization type.
‘no’: No, normalization will not be applied.
‘l1_norm’: Yes, for each point X (x1, x2, …, xn), the normalized value will be X’(x1/S,x2/S,…,xn/S), where S = |x1|+|x2|+…|xn|.
‘min_max’: Yes, for each column C, get the min and max value of C, and then C[i] = (C[i]-min)/(max-min).
Defaults to ‘no’.
categorical_variable : str or list of str, optional
Specifies INTEGER column(s) in the data that should be treated as categorical.
Examples
Input dataframe df1 for clustering:
>>> df1.collect() ID V000 V001 V002 0 0 0.5 A 0.5 1 1 1.5 A 0.5 2 2 1.5 A 1.5 3 3 0.5 A 1.5 4 4 1.1 B 1.2 5 5 0.5 B 15.5 6 6 1.5 B 15.5 7 7 1.5 B 16.5 8 8 0.5 B 16.5 9 9 1.2 C 16.1 10 10 15.5 C 15.5 11 11 16.5 C 15.5 12 12 16.5 C 16.5 13 13 15.5 C 16.5 14 14 15.6 D 16.2 15 15 15.5 D 0.5 16 16 16.5 D 0.5 17 17 16.5 D 1.5 18 18 15.5 D 1.5 19 19 15.7 A 1.6
Creating KMedoids instance:
>>> kmedoids = KMedoids(conn_context=conn, n_clusters=4, init='first_K', ... max_iter=100, tol=1.0E-6, ... distance_level='Euclidean', ... thread_ratio=0.3, category_weights=0.5)
Performing fit() on given dataframe:
>>> kmedoids.fit(data=df1, key='ID') >>> kmedoids.cluster_centers_.collect() CLUSTER_ID V000 V001 V002 0 0 1.5 A 1.5 1 1 15.5 D 1.5 2 2 15.5 C 16.5 3 3 1.5 B 16.5
Performing fit_predict() on given dataframe:
>>> kmedoids.fit_predict(data=df1, key='ID').collect() ID CLUSTER_ID DISTANCE 0 0 0 1.414214 1 1 0 1.000000 2 2 0 0.000000 3 3 0 1.000000 4 4 0 1.207107 5 5 3 1.414214 6 6 3 1.000000 7 7 3 0.000000 8 8 3 1.000000 9 9 3 1.207107 10 10 2 1.000000 11 11 2 1.414214 12 12 2 1.000000 13 13 2 0.000000 14 14 2 1.023335 15 15 1 1.000000 16 16 1 1.414214 17 17 1 1.000000 18 18 1 0.000000 19 19 1 0.930714
Attributes
cluster_centers_
(DataFrame) Coordinates of cluster centers.
labels_
(DataFrame) Cluster assignment and distance to cluster center for each point.
Methods
fit
(data, key[, features, categorical_variable])Perform clustering on input dataset.
fit_predict
(data, key[, features, …])Perform clustering algorithm and return labels.
-
fit
(data, key, features=None, categorical_variable=None)¶ Perform clustering on input dataset.
- Parameters
data : DataFrame
DataFrame contains input data.
key : str
Name of the ID column.
features : list of str, optional
Names of the feature columns. If
features
is not provided, it defaults to all the non-ID columns.categorical_variable : str or list of str, optional
Specifies INTEGER columns that should be treated as categorical. Other INTEGER columns will be treated as continuous.
-
fit_predict
(data, key, features=None, categorical_variable=None)¶ Perform clustering algorithm and return labels.
- Parameters
data : DataFrame
DataFrame containing input data.
key : str
Name of the ID column.
features : list of str, optional
Names of the feature columns. If
features
is not provided, it defaults to all the non-ID columns.categorical_variable : str or list of str, optional
Specifies INTEGER column(s) that should be treated as categorical. Other INTEGER columns will be treated as continuous.
- Returns
DataFrame
Fit result, structured as follows:
ID column, with the same name and type as
data
‘s ID column.CLUSTER_ID, type INTEGER, cluster ID assigned to the data point.
DISTANCE, type DOUBLE, the distance between the given point and the cluster center.
hana_ml.algorithms.pal.crf¶
This module contains Python wrapper for PAL conditional random field(CRF) algorithm.
The following class is available:
-
class
hana_ml.algorithms.pal.crf.
CRF
(conn_context, lamb=None, epsilon=None, max_iter=None, lbfgs_m=None, use_class_feature=None, use_word=None, use_ngrams=None, mid_ngrams=False, max_ngram_length=None, use_prev=None, use_next=None, disjunction_width=None, use_disjunctive=None, use_seqs=None, use_prev_seqs=None, use_type_seqs=None, use_type_seqs2=None, use_type_yseqs=None, word_shape=None, thread_ratio=None)¶ Bases:
hana_ml.algorithms.pal.pal_base.PALBase
Conditional random field(CRF) for labeling and segmenting sequence data(e.g. text).
- Parameters
conn_context : ConnectionContext
Connection to the SAP HANA system.
epsilon : float, optional
Convergence tolerance of the optimization algorithm.
Defaults to 1e-4.
lamb : float, optional
Regularization weight, should be greater than 0.
Defaults t0 1.0.
max_iter : int, optional
Maximum number of iterations in optimization.
Defaults to 1000.
lbfgs_m : int, optional
Number of memories to be stored in L_BFGS optimization algorithm.
Defaults to 25.
use_class_feature : bool, optional
To include a feature for class/label. This is the same as having a bias vector in a model.
Defaults to True.
use_word : bool, optional
If True, gives you feature for current word.
Defaults to True.
use_ngrams : bool, optional
Whether to make feature from letter n-grams, i.e. substrings of the word.
Defaults to True.
mid_ngrams : bool, optional
Whether to include character n-gram features for n-grams that contain neither the beginning or the end of the word.
Defaults to False.
max_ngram_length : int, optional
Upper limit for the size of n-grams to be included. Effective only this parameter is positive.
use_prev : bool, optional
Whether or not to include a feature for previous word and current word, and together with other options enables other previous features.
Defaults to True.
use_next : bool, optional
Whether or not to include a feature for next word and current word.
Defaults to True.
disjunction_width : int, optional
Defines the width for disjunctions of words, see
use_disjunctive
.Defaults to 4.
use_disjunctive : bool, optional
Whether or not to include in features giving disjunctions of words anywhere in left or right
disjunction_width
words.Defaults to True.
use_seqs : bool, optional
Whether or not to use any class combination features.
Defaults to True.
use_prev_seqs : bool, optional
Whether or not to use any class combination features using the previous class.
Defaults to True.
use_type_seqs : bool, optional
Whther or not to use basic zeroth order word shape features.
Defaults to True.
use_type_seqs2 : bool, optional
Whether or not to add additional first and second order word shape features.
Defaults to True.
use_type_yseqs : bool, optional
Whether or not to use some first order word shape patterns.
Defaults to True.
word_shape : int, optional
Word shape, e.g. whether capitalized or numeric. Only supports chris2UseLC currently. Do not use word shape if this is 0.
thread_ratio : float, optional
Specifies the ratio of total number of threads that can be used by the fit(i.e. training) function. The range of this parameter is from 0 to 1. 0 means only using single thread, 1 means using at most all available threads currently. Values outside this range are ignored, and the fit function heuristically determines the number of threads to use.
Defaults to 1.0.
Examples
Input data for training:
>>> df.head(10).collect() DOC_ID WORD_POSITION WORD LABEL 0 1 1 RECORD O 1 1 2 #497321 O 2 1 3 78554939 O 3 1 4 | O 4 1 5 LRH O 5 1 6 | O 6 1 7 62413233 O 7 1 8 | O 8 1 9 | O 9 1 10 7368393 O
Set up an instance of CRF model, and fit it on the training data:
>>> crf = CRF(conn_context=cc, ... lamb=0.1, ... max_iter=1000, ... epsilon=1e-4, ... lbfgs_m=25, ... word_shape=0, ... thread_ratio=1.0) >>> crf.fit(data=df, doc_id="DOC_ID", word_pos="WORD_POSITION", ... word="WORD", label="LABEL")
Check the trained CRF model and related statistics:
>>> crf.model_.collect() ROW_INDEX MODEL_CONTENT 0 0 {"classIndex":[["O","OxygenSaturation"]],"defa... >>> crf.stats_.head(10).collect() STAT_NAME STAT_VALUE 0 obj 0.44251900977373015 1 iter 22 2 solution status Converged 3 numSentence 2 4 numWord 92 5 numFeatures 963 6 iter 1 obj=26.6557 7 iter 2 obj=14.8484 8 iter 3 obj=5.36967 9 iter 4 obj=2.4382
Input data for predicting labels using trained CRF model
>>> df_pred.head(10).collect() DOC_ID WORD_POSITION WORD 0 2 1 GENERAL 1 2 2 PHYSICAL 2 2 3 EXAMINATION 3 2 4 : 4 2 5 VITAL 5 2 6 SIGNS 6 2 7 : 7 2 8 Blood 8 2 9 pressure 9 2 10 86g52
Do the prediction:
>>> res = crf.predict(data=df_pred, doc_id='DOC_ID', word_pos='WORD_POSITION', ... word='WORD', thread_ratio=1.0)
Check the prediction result:
>>> df_pred.head(10).collect() DOC_ID WORD_POSITION WORD 0 2 1 GENERAL 1 2 2 PHYSICAL 2 2 3 EXAMINATION 3 2 4 : 4 2 5 VITAL 5 2 6 SIGNS 6 2 7 : 7 2 8 Blood 8 2 9 pressure 9 2 10 86g52
Attributes
model_
(DataFrame) CRF model content.
stats_
(DataFrame) Statistic info for CRF model fitting, structured as follows: - 1st column: name of the statistics, type NVARCHAR(100). - 2nd column: the corresponding statistics value, type NVARCHAR(1000).
optimal_param_
(DataFrame) Placeholder for storing optimal parameter of the model. None empty only when parameter selection is triggered (in the future).
Methods
fit
(data[, doc_id, word_pos, word, label])Function for training the CRF model on English text.
predict
(data[, doc_id, word_pos, word, …])The function that predicts text labels based trained CRF model.
-
fit
(data, doc_id=None, word_pos=None, word=None, label=None)¶ Function for training the CRF model on English text.
- Parameters
data : DataFrame
Input data for training/fitting the CRF model. It should contain at least 4 columns, corresponding to document ID, word position, word and label, respectively.
doc_id : str, optional
Name of the column for document ID.
Defaults to the first column of the input data.
word_pos : str, optional
Name of the column for word position.
Defaults to the second column of the input data.
word : str, optional
Name of the column for word.
Defaults to the third column of the input data.
label : str, optional
Name of the label column.
Defaults to the final column of the input data.
-
predict
(data, doc_id=None, word_pos=None, word=None, thread_ratio=None)¶ The function that predicts text labels based trained CRF model.
- Parameters
data : DataFrame
Input data to predict the labels. It should contain at least 3 columns, corresponding to document ID, word position and word, respectively.
doc_id : str, optional
Name of the column for document ID.
Defaults to the first column of the input data.
word_pos : str, optional
Name of the column for word position.
Defaults to the second column of the input data.
word : str, optional
Name of the column for word.
Defaults to the third column of the input data.
thread_ratio : float, optional
Specifies the ratio of total number of threads that can be used by predict function. The range of this parameter is from 0 to 1. 0 means only using a single thread, and 1 means using at most all available threads currently. Values outside this range are ignored, and predict function heuristically determines the number of threads to use.
Defaults to 1.0.
- Returns
DataFrame
Prediction result for the input data, structured as follows:
1st column: document ID,
2nd column: word position,
3rd column: label.
hana_ml.algorithms.pal.decomposition¶
This module contains Python wrappers for PAL decomposition algorithms.
The following classes are available:
-
class
hana_ml.algorithms.pal.decomposition.
PCA
(conn_context, scaling=None, thread_ratio=None, scores=None)¶ Bases:
hana_ml.algorithms.pal.pal_base.PALBase
Principal component analysis is to reduce the dimensionality of multivariate data using Singular Value Decomposition.
- Parameters
conn_context : ConnectionContext
The connection to SAP HANA system.
thread_ratio : float, optional
Controls the proportion of available threads to use. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use.
No default value.
scaling : bool, optional
If true, scale variables to have unit variance before the analysis takes place.
Defaults to False.
scores : bool, optional
If true, output the scores on each principal component when fitting.
Defaults to False.
Examples
Input DataFrame df1 for training:
>>> df1.head(4).collect() ID X1 X2 X3 X4 0 1 12.0 52.0 20.0 44.0 1 2 12.0 57.0 25.0 45.0 2 3 12.0 54.0 21.0 45.0 3 4 13.0 52.0 21.0 46.0
Creating a PCA instance:
>>> pca = PCA(connection_context=conn, scaling=True, thread_ratio=0.5, scores=True)
Performing fit on given dataframe:
>>> pca.fit(data=df1, key='ID')
Output:
>>> pca.loadings_.collect() COMPONENT_ID LOADINGS_X1 LOADINGS_X2 LOADINGS_X3 LOADINGS_X4 0 Comp1 0.541547 0.321424 0.511941 0.584235 1 Comp2 -0.454280 0.728287 0.395819 -0.326429 2 Comp3 -0.171426 -0.600095 0.760875 -0.177673 3 Comp4 -0.686273 -0.078552 -0.048095 0.721489
>>> pca.loadings_stat_.collect() COMPONENT_ID SD VAR_PROP CUM_VAR_PROP 0 Comp1 1.566624 0.613577 0.613577 1 Comp2 1.100453 0.302749 0.916327 2 Comp3 0.536973 0.072085 0.988412 3 Comp4 0.215297 0.011588 1.000000
>>> pca.scaling_stat_.collect() VARIABLE_ID MEAN SCALE 0 1 17.000000 5.039841 1 2 53.636364 1.689540 2 3 23.000000 2.000000 3 4 48.454545 4.655398
Input dataframe df2 for transforming:
>>> df2.collect() ID X1 X2 X3 X4 0 1 2.0 32.0 10.0 54.0 1 2 9.0 57.0 20.0 25.0 2 3 12.0 24.0 28.0 35.0 3 4 15.0 42.0 27.0 36.0
Performing transform() on given dataframe:
>>> result = pca.transform(data=df2, key='ID', n_components=4) >>> result.collect() ID COMPONENT_1 COMPONENT_2 COMPONENT_3 COMPONENT_4 0 1 -8.359662 -10.936083 3.037744 4.220525 1 2 -3.931082 3.221886 -1.168764 -2.629849 2 3 -6.584040 -10.391291 13.112075 -0.146681 3 4 -2.967768 -3.170720 6.198141 -1.213035
Attributes
loadings_
(DataFrame) The weights by which each standardized original variable should be multiplied when computing component scores.
loadings_stat_
(DataFrame) Loadings statistics on each component.
scores_
(DataFrame) The transformed variable values corresponding to each data point. Set to None if
scores
is False.scaling_stat_
(DataFrame) Mean and scale values of each variable. .. Note:: Variables cannot be scaled if there exists one variable which has constant value across data items.
Methods
fit
(data, key[, features, label])Principal component analysis function.
fit_transform
(data, key[, features, label])Fit with the dataset and return the scores.
transform
(data, key[, features, …])Principal component analysis projection function using a trained model.
-
fit
(data, key, features=None, label=None)¶ Principal component analysis function.
- Parameters
data : DataFrame
Data to be fitted.
key : str
Name of the ID column.
features : list of str, optional
Names of the feature columns.
If
features
is not provided, it defaults to all non-ID columns.label : str, optional
Label of data.
-
fit_transform
(data, key, features=None, label=None)¶ Fit with the dataset and return the scores.
- Parameters
data : DataFrame
Data to be analyzed.
key : str
Name of the ID column.
features : list of str, optional
Names of the feature columns.
If
features
is not provided, it defaults to all non-ID columns.label : str, optional
Label of data.
- Returns
DataFrame
Transformed variable values corresponding to each data point, structured as follows:
ID column, with same name and type as
data
‘s ID column.SCORE columns, type DOUBLE, representing the component score values of each data point.
-
transform
(data, key, features=None, n_components=None, label=None)¶ Principal component analysis projection function using a trained model.
- Parameters
data : DataFrame
Data to be analyzed.
key : str
Name of the ID column.
features : list of str, optional
Names of the feature columns. If
features
is not provided, it defaults to all non-ID columns.n_components : int, optional
Number of components to be retained. The value range is from 1 to number of features.
Defaults to number of features.
label : str, optional
Label of data.
- Returns
DataFrame
Transformed variable values corresponding to each data point, structured as follows:
ID column, with same name and type as
data
‘s ID column.SCORE columns, type DOUBLE, representing the component score values of each data point.
-
class
hana_ml.algorithms.pal.decomposition.
LatentDirichletAllocation
(conn_context, n_components, doc_topic_prior=None, topic_word_prior=None, burn_in=None, iteration=None, thin=None, seed=None, max_top_words=None, threshold_top_words=None, gibbs_init=None, delimiters=None, output_word_assignment=None)¶ Bases:
hana_ml.algorithms.pal.pal_base.PALBase
Latent Dirichlet allocation (LDA) is a generative model in which each item (word) of a collection (document) is generated from a finite mixture over several latent groups (topics).
- Parameters
conn_context : ConnectionContext
The connection to the SAP HANA system.
n_components : int
Expected number of topics in the corpus.
doc_topic_prior : float, optional
Specifies the prior weight related to document-topic distribution.
Defaults to 50/
n_components
.topic_word_prior : float, optional
Specifies the prior weight related to topic-word distribution.
Defaults to 0.1.
burn_in : int, optional
Number of omitted Gibbs iterations at the beginning. Generally, samples from the beginning may not accurately represent the desired distribution and are usually discarded.
Defaults to 0.
iteration : int, optional
Number of Gibbs iterations.
Defaults to 2000.
thin : int, optional
Number of omitted in-between Gibbs iterations. Value must be greater than 0.
Defaults to 1.
seed : int, optional
Indicates the seed used to initialize the random number generator:
0: Uses the system time.
Not 0: Uses the provided value.
Defaults to 0.
max_top_words : int, optional
Specifies the maximum number of words to be output for each topic.
Defaults to 0.
threshold_top_words : float, optional
The algorithm outputs top words for each topic if the probability is larger than this threshold. It cannot be used together with parameter
max_top_words
.gibbs_init : str, optional
Specifies initialization method for Gibbs sampling:
‘uniform’: Assign each word in each document a topic by uniform distribution.
‘gibbs’: Assign each word in each document a topic by one round of Gibbs sampling using
doc_topic_prior
andtopic_word_prior
.
Defaults to ‘uniform’.
delimiters : list of str, optional
Specifies the set of delimiters to separate words in a document. Each delimiter must be one character long.
Defaults to [‘ ‘].
output_word_assignment : bool, optional
Controls whether to output the
word_topic_assignment_
or not. If True, output theword_topic_assignment_
.Defaults to False.
Examples
Input dataframe df1 for training:
>>> df1.collect() DOCUMENT_ID TEXT 0 10 cpu harddisk graphiccard cpu monitor keyboard ... 1 20 tires mountainbike wheels valve helmet mountai... 2 30 carseat toy strollers toy toy spoon toy stroll... 3 40 sweaters sweaters sweaters boots sweaters ring...
Creating a LDA instance:
>>> lda = LatentDirichletAllocation(cc, n_components=6, burn_in=50, thin=10, iteration=100, seed=1, max_top_words=5, doc_topic_prior=0.1, output_word_assignment=True, delimiters=[' ', '\r', '\n'])
Performing fit() on given dataframe:
>>> lda.fit(data=df1, key='DOCUMENT_ID', document='TEXT')
Output:
>>> lda.doc_topic_dist_.collect() DOCUMENT_ID TOPIC_ID PROBABILITY 0 10 0 0.010417 1 10 1 0.010417 2 10 2 0.010417 3 10 3 0.010417 4 10 4 0.947917 5 10 5 0.010417 6 20 0 0.009434 7 20 1 0.009434 8 20 2 0.009434 9 20 3 0.952830 10 20 4 0.009434 11 20 5 0.009434 12 30 0 0.103774 13 30 1 0.858491 14 30 2 0.009434 15 30 3 0.009434 16 30 4 0.009434 17 30 5 0.009434 18 40 0 0.009434 19 40 1 0.009434 20 40 2 0.952830 21 40 3 0.009434 22 40 4 0.009434 23 40 5 0.009434
>>> lda.word_topic_assignment_.collect() DOCUMENT_ID WORD_ID TOPIC_ID 0 10 0 4 1 10 1 4 2 10 2 4 3 10 0 4 4 10 3 4 5 10 4 4 6 10 0 4 7 10 5 4 8 10 5 4 9 20 6 3 10 20 7 3 11 20 8 3 12 20 9 3 13 20 10 3 14 20 7 3 15 20 11 3 16 20 6 3 17 20 7 3 18 20 7 3 19 30 12 1 20 30 13 1 21 30 14 1 22 30 13 1 23 30 13 1 24 30 15 0 25 30 13 1 26 30 14 1 27 30 13 1 28 30 12 1 29 40 16 2 30 40 16 2 31 40 16 2 32 40 17 2 33 40 16 2 34 40 18 2 35 40 19 2 36 40 19 2 37 40 20 2 38 40 16 2
>>> lda.topic_top_words_.collect() TOPIC_ID WORDS 0 0 spoon strollers tires graphiccard valve 1 1 toy strollers carseat graphiccard cpu 2 2 sweaters vest shoe rings boots 3 3 mountainbike tires rearfender helmet valve 4 4 cpu memory graphiccard keyboard harddisk 5 5 strollers tires graphiccard cpu valve
>>> lda.topic_word_dist_.head(40).collect() TOPIC_ID WORD_ID PROBABILITY 0 0 0 0.050000 1 0 1 0.050000 2 0 2 0.050000 3 0 3 0.050000 4 0 4 0.050000 5 0 5 0.050000 6 0 6 0.050000 7 0 7 0.050000 8 0 8 0.550000 9 0 9 0.050000 10 1 0 0.050000 11 1 1 0.050000 12 1 2 0.050000 13 1 3 0.050000 14 1 4 0.050000 15 1 5 0.050000 16 1 6 0.050000 17 1 7 0.050000 18 1 8 0.050000 19 1 9 0.550000 20 2 0 0.025000 21 2 1 0.025000 22 2 2 0.525000 23 2 3 0.025000 24 2 4 0.025000 25 2 5 0.025000 26 2 6 0.025000 27 2 7 0.275000 28 2 8 0.025000 29 2 9 0.025000 30 3 0 0.014286 31 3 1 0.014286 32 3 2 0.014286 33 3 3 0.585714 34 3 4 0.157143 35 3 5 0.014286 36 3 6 0.157143 37 3 7 0.014286 38 3 8 0.014286 39 3 9 0.014286
>>> lda.dictionary_.collect() WORD_ID WORD 0 17 boots 1 12 carseat 2 0 cpu 3 2 graphiccard 4 1 harddisk 5 10 helmet 6 4 keyboard 7 5 memory 8 3 monitor 9 7 mountainbike 10 11 rearfender 11 18 rings 12 20 shoe 13 15 spoon 14 14 strollers 15 16 sweaters 16 6 tires 17 13 toy 18 9 valve 19 19 vest 20 8 wheels
>>> lda.statistic_.collect() STAT_NAME STAT_VALUE 0 DOCUMENTS 4 1 VOCABULARY_SIZE 21 2 LOG_LIKELIHOOD -64.95765414596762
Dataframe df2 to transform:
>>> df2.collect() DOCUMENT_ID TEXT 0 10 toy toy spoon cpu
Performing transform on the given dataframe:
>>> res = lda.transform(data=df2, key='DOCUMENT_ID', document='TEXT', burn_in=2000, thin=100, iteration=1000, seed=1, output_word_assignment=True)
>>> doc_top_df, word_top_df, stat_df = res
>>> doc_top_df.collect() DOCUMENT_ID TOPIC_ID PROBABILITY 0 10 0 0.239130 1 10 1 0.456522 2 10 2 0.021739 3 10 3 0.021739 4 10 4 0.239130 5 10 5 0.021739
>>> word_top_df.collect() DOCUMENT_ID WORD_ID TOPIC_ID 0 10 13 1 1 10 13 1 2 10 15 0 3 10 0 4
>>> stat_df.collect() STAT_NAME STAT_VALUE 0 DOCUMENTS 1 1 VOCABULARY_SIZE 21 2 LOG_LIKELIHOOD -7.925092991875363 3 PERPLEXITY 7.251970666272191
Attributes
doc_topic_dist_
(DataFrame) Document-topic distribution table, structured as follows: - Document ID column, with same name and type as
data
’s document ID column from fit(). - TOPIC_ID, type INTEGER, topic ID. - PROBABILITY, type DOUBLE, probability of topic given document.word_topic_assignment_
(DataFrame) Word-topic assignment table, structured as follows: - Document ID column, with same name and type as
data
’s document ID column from fit(). - WORD_ID, type INTEGER, word ID. - TOPIC_ID, type INTEGER, topic ID. Set to None ifoutput_word_assignment
is set to False.topic_top_words_
(DataFrame) Topic top words table, structured as follows: - TOPIC_ID, type INTEGER, topic ID. - WORDS, type NVARCHAR(5000), topic top words separated by spaces. Set to None if neither
max_top_words
northreshold_top_words
is provided.topic_word_dist_
(DataFrame) topic-word distribution table, structured as follows: - TOPIC_ID, type INTEGER, topic ID. - WORD_ID, type INTEGER, word ID. - PROBABILITY, type DOUBLE, probability of word given topic.
dictionary_
(DataFrame) Dictionary table, structured as follows: - WORD_ID, type INTEGER, word ID. - WORD, type NVARCHAR(5000), word text.
statistic_
(DataFrame) Statistics table, structured as follows: - STAT_NAME, type NVARCHAR(256), statistic name. - STAT_VALUE, type NVARCHAR(1000), statistic value. .. note:: - Parameters
max_top_words
andthreshold_top_words
cannot be used together. - Parametersburn_in
,thin
,iteration
,seed
,gibbs_init
anddelimiters
set in transform() will take precedence over thecorresponding ones in __init__().Methods
fit
(data, key[, document])Fit LDA model based on training data.
fit_transform
(data, key[, document])Fit LDA model based on training data and return the topic assignment for the training documents.
transform
(data, key[, document, burn_in, …])Transform the topic assignment for new documents based on the previous LDA estimation results.
-
fit
(data, key, document=None)¶ Fit LDA model based on training data.
- Parameters
data : DataFrame
Training data.
key : str
Name of the document ID column.
document : str, optional
Name of the document text column. If
document
is not provided,data
must have exactly 1 non-key column, anddocument
defaults to that column.
-
fit_transform
(data, key, document=None)¶ Fit LDA model based on training data and return the topic assignment for the training documents.
- Parameters
data : DataFrame
Training data.
key : str
Name of the document ID column.
document : str, optional
Name of the document text column. If
document
is not provided,data
must have exactly 1 non-key column, anddocument
defaults to that column.- Returns
DataFrame
Document-topic distribution table, structured as follows:
Document ID column, with same name and type as
data
‘s document ID column.TOPIC_ID, type INTEGER, topic ID.
PROBABILITY, type DOUBLE, probability of topic given document.
-
transform
(data, key, document=None, burn_in=None, iteration=None, thin=None, seed=None, gibbs_init=None, delimiters=None, output_word_assignment=None)¶ Transform the topic assignment for new documents based on the previous LDA estimation results.
- Parameters
data : DataFrame
Independent variable values used for tranform.
key : str
Name of the document ID column.
document : str, optional
Name of the document text column.
If
document
is not provided,data
must have exactly 1 non-key column, anddocument
defaults to that column.burn_in : int, optional
Number of omitted Gibbs iterations at the beginning. Generally, samples from the beginning may not accurately represent the desired distribution and are usually discarded.
Defaults to 0.
iteration : int, optional
Numbers of Gibbs iterations.
Defaults to 2000.
thin : int, optional
Number of omitted in-between Gibbs iterations.
Defaults to 1.
seed : int, optional
Indicates the seed used to initialize the random number generator:
0: Uses the system time.
Not 0: Uses the provided value.
Defaults to 0.
gibbs_init : str, optional
Specifies initialization method for Gibbs sampling:
‘uniform’: Assign each word in each document a topic by uniform distribution.
‘gibbs’: Assign each word in each document a topic by one round of Gibbs sampling using
doc_topic_prior
andtopic_word_prior
.
Defaults to ‘uniform’.
delimiters : list of str, optional
Specifies the set of delimiters to separate words in a document. Each delimiter must be one character long.
Defaults to [‘ ‘].
output_word_assignment : bool, optional
Controls whether to output the
word_topic_df
or not. If True, output theword_topic_df
.Defaults to False.
- Returns
DataFrame
Document-topic distribution table, structured as follows:
Document ID column, with same name and type as
data
‘s document ID column.TOPIC_ID, type INTEGER, topic ID.
PROBABILITY, type DOUBLE, probability of topic given document.
Word-topic assignment table, structured as follows:
Document ID column, with same name and type as
data
‘s document ID column.WORD_ID, type INTEGER, word ID.
TOPIC_ID, type INTEGER, topic ID.
Set to None if
output_word_assignment
is False.Statistics table, structured as follows:
STAT_NAME, type NVARCHAR(256), statistic name.
STAT_VALUE, type NVARCHAR(1000), statistic value.
hana_ml.algorithms.pal.discriminant_analysis¶
This module contains PAL wrapper for discriminant analysis algorithm. The following class is available:
-
class
hana_ml.algorithms.pal.discriminant_analysis.
LinearDiscriminantAnalysis
(conn_context, regularization_type=None, regularization_amount=None, projection=None)¶ Bases:
hana_ml.algorithms.pal.pal_base.PALBase
Linear discriminant analysis for classification and data reduction.
- Parameters
conn_context : ConnectionContext
Connection to the HANA system.
regularization_type : {‘mixing’, ‘diag’, ‘pseudo’}, optional
The strategy for hanlding ill-conditioning or rank-deficiency of the empirical covariance matrix.
Defaults to ‘mixing’.
regularization_amount : float, optional
The convex mixing weight assigned to the diagonal matrix obtained from diagonal of the empirical covriance matrix. Valid range for this parameter is [0,1]. Valid only when
regularization_type
is ‘mixing’.Defaults to the smallest number in [0,1] that makes the regularized empircal covariance matrix invertible.
projection : bool, optional
Whether or not to compute the projection model.
Defaults to True.
Examples
The training data for linear discriminant analysis:
>>> df.collect() X1 X2 X3 X4 CLASS 0 5.1 3.5 1.4 0.2 Iris-setosa 1 4.9 3.0 1.4 0.2 Iris-setosa 2 4.7 3.2 1.3 0.2 Iris-setosa 3 4.6 3.1 1.5 0.2 Iris-setosa 4 5.0 3.6 1.4 0.2 Iris-setosa 5 5.4 3.9 1.7 0.4 Iris-setosa 6 4.6 3.4 1.4 0.3 Iris-setosa 7 5.0 3.4 1.5 0.2 Iris-setosa 8 4.4 2.9 1.4 0.2 Iris-setosa 9 4.9 3.1 1.5 0.1 Iris-setosa 10 7.0 3.2 4.7 1.4 Iris-versicolor 11 6.4 3.2 4.5 1.5 Iris-versicolor 12 6.9 3.1 4.9 1.5 Iris-versicolor 13 5.5 2.3 4.0 1.3 Iris-versicolor 14 6.5 2.8 4.6 1.5 Iris-versicolor 15 5.7 2.8 4.5 1.3 Iris-versicolor 16 6.3 3.3 4.7 1.6 Iris-versicolor 17 4.9 2.4 3.3 1.0 Iris-versicolor 18 6.6 2.9 4.6 1.3 Iris-versicolor 19 5.2 2.7 3.9 1.4 Iris-versicolor 20 6.3 3.3 6.0 2.5 Iris-virginica 21 5.8 2.7 5.1 1.9 Iris-virginica 22 7.1 3.0 5.9 2.1 Iris-virginica 23 6.3 2.9 5.6 1.8 Iris-virginica 24 6.5 3.0 5.8 2.2 Iris-virginica 25 7.6 3.0 6.6 2.1 Iris-virginica 26 4.9 2.5 4.5 1.7 Iris-virginica 27 7.3 2.9 6.3 1.8 Iris-virginica 28 6.7 2.5 5.8 1.8 Iris-virginica 29 7.2 3.6 6.1 2.5 Iris-virginica
Set up an instance of LinearDiscriminantAnalysis model and train it:
>>> lda = LinearDiscriminantAnalysis(conn_context=cc, regularization_type='mixing', projection=True) >>> lda.fit(data=df, features=['X1', 'X2', 'X3', 'X4'], label='CLASS')
Check the coefficients of obtained linear discriminators and the projection model
>>> lda.coef_.collect() CLASS COEFF_X1 COEFF_X2 COEFF_X3 COEFF_X4 INTERCEPT 0 Iris-setosa 23.907391 51.754001 -34.641902 -49.063407 -113.235478 1 Iris-versicolor 0.511034 15.652078 15.209568 -4.861018 -53.898190 2 Iris-virginica -14.729636 4.981955 42.511486 12.315007 -94.143564 >>> lda.proj_model_.collect() NAME X1 X2 X3 X4 0 DISCRIMINANT_1 1.907978 2.399516 -3.846154 -3.112216 1 DISCRIMINANT_2 3.046794 -4.575496 -2.757271 2.633037 2 OVERALL_MEAN 5.843333 3.040000 3.863333 1.213333
Data to predict the class labels:
>>> df_pred.collect() ID X1 X2 X3 X4 0 1 5.1 3.5 1.4 0.2 1 2 4.9 3.0 1.4 0.2 2 3 4.7 3.2 1.3 0.2 3 4 4.6 3.1 1.5 0.2 4 5 5.0 3.6 1.4 0.2 5 6 5.4 3.9 1.7 0.4 6 7 4.6 3.4 1.4 0.3 7 8 5.0 3.4 1.5 0.2 8 9 4.4 2.9 1.4 0.2 9 10 4.9 3.1 1.5 0.1 10 11 7.0 3.2 4.7 1.4 11 12 6.4 3.2 4.5 1.5 12 13 6.9 3.1 4.9 1.5 13 14 5.5 2.3 4.0 1.3 14 15 6.5 2.8 4.6 1.5 15 16 5.7 2.8 4.5 1.3 16 17 6.3 3.3 4.7 1.6 17 18 4.9 2.4 3.3 1.0 18 19 6.6 2.9 4.6 1.3 19 20 5.2 2.7 3.9 1.4 20 21 6.3 3.3 6.0 2.5 21 22 5.8 2.7 5.1 1.9 22 23 7.1 3.0 5.9 2.1 23 24 6.3 2.9 5.6 1.8 24 25 6.5 3.0 5.8 2.2 25 26 7.6 3.0 6.6 2.1 26 27 4.9 2.5 4.5 1.7 27 28 7.3 2.9 6.3 1.8 28 29 6.7 2.5 5.8 1.8 29 30 7.2 3.6 6.1 2.5
Perform predict() and check the result:
>>> res_pred = lda.predict(data=df_pred, ... key='ID', ... features=['X1', 'X2', 'X3', 'X4'], ... verbose=False) >>> res_pred.collect() ID CLASS SCORE 0 1 Iris-setosa 130.421263 1 2 Iris-setosa 99.762784 2 3 Iris-setosa 108.796296 3 4 Iris-setosa 94.301777 4 5 Iris-setosa 133.205924 5 6 Iris-setosa 138.089829 6 7 Iris-setosa 108.385827 7 8 Iris-setosa 119.390933 8 9 Iris-setosa 82.633689 9 10 Iris-setosa 106.380335 10 11 Iris-versicolor 63.346631 11 12 Iris-versicolor 59.511996 12 13 Iris-versicolor 64.286132 13 14 Iris-versicolor 38.332614 14 15 Iris-versicolor 54.823224 15 16 Iris-versicolor 53.865644 16 17 Iris-versicolor 63.581912 17 18 Iris-versicolor 30.402809 18 19 Iris-versicolor 57.411739 19 20 Iris-versicolor 42.433076 20 21 Iris-virginica 114.258002 21 22 Iris-virginica 72.984306 22 23 Iris-virginica 91.802556 23 24 Iris-virginica 86.640121 24 25 Iris-virginica 97.620689 25 26 Iris-virginica 114.195778 26 27 Iris-virginica 57.274694 27 28 Iris-virginica 101.668525 28 29 Iris-virginica 87.257782 29 30 Iris-virginica 106.747065
Data to project:
>>> df_proj.collect() ID X1 X2 X3 X4 0 1 5.1 3.5 1.4 0.2 1 2 4.9 3.0 1.4 0.2 2 3 4.7 3.2 1.3 0.2 3 4 4.6 3.1 1.5 0.2 4 5 5.0 3.6 1.4 0.2 5 6 5.4 3.9 1.7 0.4 6 7 4.6 3.4 1.4 0.3 7 8 5.0 3.4 1.5 0.2 8 9 4.4 2.9 1.4 0.2 9 10 4.9 3.1 1.5 0.1 10 11 7.0 3.2 4.7 1.4 11 12 6.4 3.2 4.5 1.5 12 13 6.9 3.1 4.9 1.5 13 14 5.5 2.3 4.0 1.3 14 15 6.5 2.8 4.6 1.5 15 16 5.7 2.8 4.5 1.3 16 17 6.3 3.3 4.7 1.6 17 18 4.9 2.4 3.3 1.0 18 19 6.6 2.9 4.6 1.3 19 20 5.2 2.7 3.9 1.4 20 21 6.3 3.3 6.0 2.5 21 22 5.8 2.7 5.1 1.9 22 23 7.1 3.0 5.9 2.1 23 24 6.3 2.9 5.6 1.8 24 25 6.5 3.0 5.8 2.2 25 26 7.6 3.0 6.6 2.1 26 27 4.9 2.5 4.5 1.7 27 28 7.3 2.9 6.3 1.8 28 29 6.7 2.5 5.8 1.8 29 30 7.2 3.6 6.1 2.5
Do project and check the result:
>>> res_proj = lda.project(data=df_proj, ... key='ID', ... features=['X1','X2','X3','X4'], ... proj_dim=2) >>> res_proj.collect() ID DISCRIMINANT_1 DISCRIMINANT_2 DISCRIMINANT_3 DISCRIMINANT_4 0 1 12.313584 -0.245578 None None 1 2 10.732231 1.432811 None None 2 3 11.215154 0.184080 None None 3 4 10.015174 -0.214504 None None 4 5 12.362738 -1.007807 None None 5 6 12.069495 -1.462312 None None 6 7 10.808422 -1.048122 None None 7 8 11.498220 -0.368435 None None 8 9 9.538291 0.366963 None None 9 10 10.898789 0.436231 None None 10 11 -1.208079 0.976629 None None 11 12 -1.894856 -0.036689 None None 12 13 -2.719280 0.841349 None None 13 14 -3.226081 2.191170 None None 14 15 -3.048480 1.822461 None None 15 16 -3.567804 -0.865854 None None 16 17 -2.926155 -1.087069 None None 17 18 -0.504943 1.045723 None None 18 19 -1.995288 1.142984 None None 19 20 -2.765274 -0.014035 None None 20 21 -10.727149 -2.301788 None None 21 22 -7.791979 -0.178166 None None 22 23 -8.291120 0.730808 None None 23 24 -7.969943 -1.211807 None None 24 25 -9.362513 -0.558237 None None 25 26 -10.029438 0.324116 None None 26 27 -7.058927 -0.877426 None None 27 28 -8.754272 -0.095103 None None 28 29 -8.935789 1.285655 None None 29 30 -8.674729 -1.208049 None None
Attributes
basic_info_
(DataFrame) Basic information of the training data for linear discriminant analysis.
priors_
(DataFrame) The empirical pirors for each class in the training data.
coef_
(DataFrame) Coefficients (inclusive of intercepts) of each class’ linear score function for the training data.
proj_info
(DataFrame) Projection related info, such as standar deviations of the discriminants, variance proportaion to the total variance explained by each discriminant, etc.
proj_model
(DataFrame) The projection matrix and overall means for features.
Methods
fit
(data[, key, features, label])Calculate linear discriminators from training data.
predict
(data, key[, features, verbose])Predict class labels using fitted linear discriminators.
project
(data, key[, features, proj_dim])Project data into lower dimensional spaces using fitted LDA projection model.
-
fit
(data, key=None, features=None, label=None)¶ Calculate linear discriminators from training data.
- Parameters
data : DataFrame
Training data.
key : str, optional
Name of the ID colum. If not provided, it is assumed that
the input data has no ID column.
features : list of str, optional
Names of the feature columns. If not provided, its defaults to all non-ID, non-label columns.
label : str, optional
Name of the class label. if not provided, it defaults to the final column.
-
predict
(data, key, features=None, verbose=None)¶ Predict class labels using fitted linear discriminators.
- Parameters
data : DataFrame
Data for predicting the class labels.
key : str
Name of the ID column.
features : list of str, optional
Name of the feature columns. If not provided, defaults to all non-ID columns.
verbose : bool, optional
Whether or not outputs scores of all classes. If False, only score of the predicted class will be outputed. Defaults to False.
- Returns
DataFrame
Predicted class labels and the corresponding scores, structured as follows:
ID: with the same name and data type as
data
’s ID column.CLASS: with the same name and data type as training data’s label column
SCORE: type double, socre of the predicted class.
-
project
(data, key, features=None, proj_dim=None)¶ Project data into lower dimensional spaces using fitted LDA projection model.
- Parameters
data : DataFrame
Data for linear discriminant projection.
key : str
Name of the ID column.
features : list of str, optional
Name of the feature columns. If not provided, defaults to all non-ID columns.
proj_dim : int, optional
Dimension of the projected space, equivalent to the number of discriminant used for projection. Defaults to the number of obtained discriminants.
- Returns
DataFrame
- Projected data, structured as follows:
1st column: ID, with the same name and data type as
data
for projection.other columns with name DISCRIMINANT_i, where i iterates from 1 to the number of elements in
features
, data type DOUBLE.
hana_ml.algorithms.pal.linear_model¶
This module contains Python wrappers for PAL linear model algorithms.
The following classes are available:
-
class
hana_ml.algorithms.pal.linear_model.
LinearRegression
(conn_context, solver=None, var_select=None, intercept=True, alpha_to_enter=None, alpha_to_remove=None, enet_lambda=None, enet_alpha=None, max_iter=None, tol=None, pho=None, stat_inf=False, adjusted_r2=False, dw_test=False, reset_test=None, bp_test=False, ks_test=False, thread_ratio=None, categorical_variable=None, pmml_export=None)¶ Bases:
hana_ml.algorithms.pal.pal_base.PALBase
Linear regression is an approach to model the linear relationship between a variable, usually referred to as dependent variable, and one or more variables, usually referred to as independent variables, denoted as predictor vector .
- Parameters
conn_context : ConnectionContext
Connection to the SAP HANA system.
solver : {‘QR’, ‘SVD’, ‘CD’, ‘Cholesky’, ‘ADMM’}, optional
Algorithms to use to solve the least square problem. Case-insensitive.
‘QR’: QR decomposition.
‘SVD’: singular value decomposition.
‘CD’: cyclical coordinate descent method.
‘Cholesky’: Cholesky decomposition.
‘ADMM’: alternating direction method of multipliers.
‘CD’ and ‘ADMM’ are supported only when
var_select
is ‘all’.Defaults to QR decomposition.
var_select : {‘all’, ‘forward’, ‘backward’}, optional
Method to perform variable selection.
‘all’: all variables are included.
‘forward’: forward selection.
‘backward’: backward selection.
‘forward’ and ‘backward’ selection are supported only when
solver
is ‘QR’, ‘SVD’ or ‘Cholesky’.Defaults to ‘all’.
intercept : bool, optional
If true, include the intercept in the model.
Defaults to True.
alpha_to_enter : float, optional
P-value for forward selection. Valid only when
var_select
is ‘forward’.Defaults to 0.05.
alpha_to_remove : float, optional
P-value for backward selection. Valid only when
var_select
is ‘backward’.Defaults to 0.1.
enet_lambda : float, optional
Penalized weight. Value should be greater than or equal to 0. Valid only when
solver
is ‘CD’ or ‘ADMM’.enet_alpha : float, optional
Elastic net mixing parameter. Ranges from 0 (Ridge penalty) to 1 (LASSO penalty) inclusively. Valid only when
solver
is ‘CD’ or ‘ADMM’.Defaults to 1.0.
max_iter : int, optional
Maximum number of passes over training data. If convergence is not reached after the specified number of iterations, an error will be generated. Valid only when
solver
is ‘CD’ or ‘ADMM’.Defaults to 1e5.
tol : float, optional
Convergence threshold for coordinate descent. Valid only when
solver
is ‘CD’.Defaults to 1.0e-7.
pho : float, optional
Step size for ADMM. Generally, it should be greater than 1. Valid only when
solver
is ‘ADMM’.Defaults to 1.8.
stat_inf : bool, optional
If true, output t-value and Pr(>|t|) of coefficients.
Defaults to False.
adjusted_r2 : bool, optional
If true, include the adjusted R2 value in statistics.
Defaults to False.
dw_test : bool, optional
If true, conduct Durbin-Watson test under null hypothesis that errors do not follow a first order autoregressive process. Not available if elastic net regularization is enabled or intercept is ignored.
Defaults to False.
reset_test : int, optional
Specifies the order of Ramsey RESET test. Ramsey RESET test with power of variables ranging from 2 to this value (greater than 1) will be conducted. Value 1 means RESET test will not be conducted. Not available if elastic net regularization is enabled or intercept is ignored.
Defaults to 1.
bp_test : bool, optional
If true, conduct Breusch-Pagan test under null hypothesis that homoscedasticity is satisfied. Not available if elastic net regularization is enabled or intercept is ignored.
Defaults to False.
ks_test : bool, optional
If true, conduct Kolmogorov-Smirnov normality test under null hypothesis that errors follow a normal distribution. Not available if elastic net regularization is enabled or intercept is ignored.
Defaults to False.
thread_ratio : float, optional
Controls the proportion of available threads to use. The value range is from 0 to 1, where 0 means only using 1 thread, and 1 means using at most all the currently available threads. Values outside this range tell PAL to heuristically determine the number of threads to use. Valid only when
solver
is ‘QR’, ‘CD’, ‘Cholesky’ or ‘ADMM’.Defaults to 0.0.
categorical_variable : str or ist of str, optional
Specifies INTEGER columns specified that should be be treated as categorical. Other INTEGER columns will be treated as continuous.
pmml_export : {‘no’, ‘multi-row’}, optional
Controls whether to output a PMML representation of the model, and how to format the PMML. Case-insensitive.
‘no’ or not provided: No PMML model.
‘multi-row’: Exports a PMML model, splitting it across multiple rows if it doesn’t fit in one.
Prediction does not require a PMML model.
Examples
Training data:
>>> df.collect() ID Y X1 X2 X3 0 0 -6.879 0.00 A 1 1 1 -3.449 0.50 A 1 2 2 6.635 0.54 B 1 3 3 11.844 1.04 B 1 4 4 2.786 1.50 A 1 5 5 2.389 0.04 B 2 6 6 -0.011 2.00 A 2 7 7 8.839 2.04 B 2 8 8 4.689 1.54 B 1 9 9 -5.507 1.00 A 2
Training the model:
>>> lr = LinearRegression(conn_context=cc, ... thread_ratio=0.5, ... categorical_variable=["X3"]) >>> lr.fit(data=df, key='ID', label='Y')
Prediction:
>>> df2.collect() ID X1 X2 X3 0 0 1.690 B 1 1 1 0.054 B 2 2 2 0.123 A 2 3 3 1.980 A 1 4 4 0.563 A 1 >>> lr.predict(data=df2, key='ID').collect() ID VALUE 0 0 10.314760 1 1 1.685926 2 2 -7.409561 3 3 2.021592 4 4 -3.122685
Attributes
coefficients_
(DataFrame) Fitted regression coefficients.
pmml_
(DataFrame) PMML model. Set to None if no PMML model was requested.
fitted_
(DataFrame) Predicted dependent variable values for training data. Set to None if the training data has no row IDs.
statistics_
(DataFrame) Regression-related statistics, such as mean squared error.
Methods
fit
(data[, key, features, label, …])Fit regression model based on training data.
predict
(data, key[, features])Predict dependent variable values based on fitted model.
score
(data, key[, features, label])Returns the coefficient of determination R2 of the prediction.
-
fit
(data, key=None, features=None, label=None, categorical_variable=None)¶ Fit regression model based on training data.
- Parameters
data : DataFrame
Training data.
key : str, optional
Name of the ID column. If
key
is not provided, it is assumed that the input has no ID column.features : list of str, optional
Names of the feature columns. If
features
is not provided, it defaults to all non-ID, non-label columns.label : str, optional
Name of the dependent variable. If
label
is not provided, it defaults to the last column.categorical_variable : str or list of str, optional
Specifies INTEGER column(s) that should be treated as categorical. Other INTEGER columns will be treated as continuous.
-
predict
(data, key, features=None)¶ Predict dependent variable values based on fitted model.
- Parameters
data : DataFrame
Independent variable values to predict for.
key : str
Name of the ID column.
features : list of str, optional
Names of the feature columns. If
features
is not provided, it defaults to all non-ID columns.- Returns
DataFrame
Predicted values, structured as follows:
ID column: with same name and type as
data
‘s ID column.VALUE: type DOUBLE, representing predicted values.
-
score
(data, key, features=None, label=None)¶ Returns the coefficient of determination R2 of the prediction.
- Parameters
data : DataFrame
Data on which to assess model performance.
key : str
Name of the ID column.
features : list of str, optional
Names of the feature columns.
If
features
is not provided, it defaults all non-ID, non-label columns.label : str, optional
Name of the dependent variable. If
label
is not provided, it defaults to the last column.- Returns
float
Returns the coefficient of determination R2 of the prediction.
Note
score() will pass the
pmml_
table to PAL as the model representation if there is apmml_
table, or thecoefficients_
table otherwise.
-
class
hana_ml.algorithms.pal.linear_model.
LogisticRegression
(conn_context, multi_class=False, max_iter=None, pmml_export=None, categorical_variable=None, standardize=True, stat_inf=False, solver=None, alpha=None, lamb=None, tol=None, epsilon=None, thread_ratio=None, max_pass_number=None, sgd_batch_number=None, precompute=None, handle_missing=None, resampling_method=None, metric=None, fold_num=None, repeat_times=None, search_strategy=None, random_search_times=None, random_state=None, timeout=None, lamb_values=None, lamb_range=None, alpha_values=None, alpha_range=None, lbfgs_m=None, class_map0=None, class_map1=None, progress_indicator_id=None)¶ Bases:
hana_ml.algorithms.pal.pal_base.PALBase
Logistic regression model that handles binary-class and multi-class classification problems.
- Parameters
conn_context : ConnectionContext
Connection to the SAP HANA system.
multi_class : bool, optional
If true, perform multi-class classification. Otherwise, there must be only two classes.
Defaults to False.
max_iter : int, optional
Maximum number of iterations taken for the solvers to converge. If convergence is not reached after this number, an error will be generated.
multi-class: Defaults to 100.
binary-class: Defaults to 100000 when
solver
is cyclical, 1000 whensolver
is proximal, otherwise 100.
pmml_export : {‘no’, ‘single-row’, ‘multi-row’}, optional
Controls whether to output a PMML representation of the model, and how to format the PMML. Case-insensitive.
multi-class:
‘no’ or not provided: No PMML model.
‘multi-row’: Exports a PMML model, splitting it across multiple rows if it doesn’t fit in one.
binary-class:
‘no’ or not provided: No PMML model.
‘single-row’: Exports a PMML model in a maximum of one row. Fails if the model doesn’t fit in one row.
‘multi-row’: Exports a PMML model, splitting it across multiple rows if it doesn’t fit in one.
Defaults to ‘no’.
categorical_variable : str or list of str, optional(deprecated)
Specifies INTEGER column(s) in the data that should be treated category variable.
standardize : bool, optional
If true, standardize the data to have zero mean and unit variance.
Defaults to True.
stat_inf : bool, optional
If true, proceed with statistical inference.
Defaults to False.
solver : {‘auto’, ‘newton’, ‘cyclical’, ‘lbfgs’, ‘stochastic’, ‘proximal’}, optional
Optimization algorithm.
‘auto’ : automatically determined by system based on input data and parameters.
‘newton’: Newton iteration method.
‘cyclical’: Cyclical coordinate descent method to fit elastic net regularized logistic regression.
‘lbfgs’: LBFGS method (recommended when having many independent variables).
‘stochastic’: Stochastic gradient descent method (recommended when dealing with very large dataset).
‘proximal’: Proximal gradient descent method to fit elastic net regularized logistic regression.
Only valid when
multi_class
is False.Defaults to ‘auto’.
alpha : float, optional
Elastic net mixing parameter. Only valid when
multi_class
is False andsolver
is newton, cyclical, lbfgs or proximal.Defaults to 1.0.
lamb : float, optional
Penalized weight. Only valid when
multi_class
is False andsolver
is newton, cyclical, lbfgs or proximal.Defaults to 0.0.
tol : float, optional
Convergence threshold for exiting iterations. Only valid when
multi_class
is False.Defaults to 1.0e-7 when
solver
is cyclical, 1.0e-6 otherwise.epsilon : float, optional
Determines the accuracy with which the solution is to be found.
Only valid when
multi_class
is False and thesolver
is newton or lbfgs.Defaults to 1.0e-6 when
solver
is newton, 1.0e-5 whensolver
is lbfgs.thread_ratio : float, optional
Controls the proportion of available threads to use for fit() method. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use.
Defaults to 1.0.
max_pass_number : int, optional
The maximum number of passes over the data. Only valid when
multi_class
is False andsolver
is ‘stochastic’.Defaults to 1.
sgd_batch_number : int, optional
The batch number of Stochastic gradient descent. Only valid when
multi_class
is False andsolver
is ‘stochastic’.Defaults to 1.
precompute : bool, optional
Whether to pre-compute the Gram matrix. Only valid when
solver
is ‘cyclical’.Defaults to True.
handle_missing : bool, optional
Whether to handle missing values.
Defaults to True.
categorical_variable : str or list of str, optional
Specifies INTEGER column(s) in the data that should be treated as categorical. By default, string is categorical, while int and double are numerical.
lbfgs_m : int, optional
Number of previous updates to keep. Only applicable when
multi_class
is False andsolver
is ‘lbfgs’.Defaults to 6.
resampling_method : {‘cv’, ‘stratified_cv’, ‘bootstrap’, ‘stratified_bootstrap’}, optional
The resampling method for model evaluation and parameter selection. If no value specified, neither model evaluation nor parameter selection is activated.
metric : {‘accuracy’, ‘f1_score’, ‘auc’, ‘nll’}, optional
The evaluation metric used for model evaluation/parameter selection.
fold_num : int, optional
The number of folds for cross-validation. Mandatory and valid only when
resampling_method
is ‘cv’ or ‘stratified_cv’.repeat_times : int, optional
The number of repeat times for resampling.
Defaults to 1.
search_strategy : {‘grid’, ‘random’}, optional
The search method for parameter selection.
random_search_times : int, optional
The number of times to randomly select candidate parameters for selection. Mandatory and valid when
search_strategy
is ‘random’.random_state : int, optional
The seed for random generation. 0 indicates using system time as seed.
Defaults to 0.
progress_indicator_id : str, optional
The ID of progress indicator for model evaluation/parameter selection. Progress indicator deactivated if no value provided.
lamb_values : list of float, optional
The values of
lamb
for parameter selection.Only valid when
search_strategy
is specified.lamb_range : list of float, optional
The range of
lamb
for parameter selection, including a lower limit and an upper limit.Only valid when
search_strategy
is specified.alpha_values : list of float, optional
The values of
alpha
for parameter selection.Only valid when
search_strategy
is specified.alpha_range : list of float, optional
The range of
alpha
for parameter selection, including a lower limit and an upper limit.Only valid when
search_strategy
is specified.class_map0 : str, optional (deprecated)
Categorical label to map to 0.
class_map0
is mandatory whenlabel
column type is VARCHAR or NVARCHAROnly valid when
multi_class
is False during binary class fit and score.class_map1 : str, optional (deprecated)
Categorical label to map to 1.
class_map1
is mandatory whenlabel
column type is VARCHAR or NVARCHAR during binary class fit and score.Only valid when
multi_class
is False.
Examples
Training data:
>>> df.collect() V1 V2 V3 CATEGORY 0 B 2.620 0 1 1 B 2.875 0 1 2 A 2.320 1 1 3 A 3.215 2 0 4 B 3.440 3 0 5 B 3.460 0 0 6 A 3.570 1 0 7 B 3.190 2 0 8 A 3.150 3 0 9 B 3.440 0 0 10 B 3.440 1 0 11 A 4.070 3 0 12 A 3.730 1 0 13 B 3.780 2 0 14 B 5.250 2 0 15 A 5.424 3 0 16 A 5.345 0 0 17 B 2.200 1 1 18 B 1.615 2 1 19 A 1.835 0 1 20 B 2.465 3 0 21 A 3.520 1 0 22 A 3.435 0 0 23 B 3.840 2 0 24 B 3.845 3 0 25 A 1.935 1 1 26 B 2.140 0 1 27 B 1.513 1 1 28 A 3.170 3 1 29 B 2.770 0 1 30 B 3.570 0 1 31 A 2.780 3 1
Create LogisticRegression instance and call fit:
>>> lr = linear_model.LogisticRegression(conn_context=cc, solver='newton', ... thread_ratio=0.1, max_iter=1000, ... pmml_export='single-row', ... stat_inf=True, tol=0.000001) >>> lr.fit(data=df, features=['V1', 'V2', 'V3'], ... label='CATEGORY', categorical_variable=['V3']) >>> lr.coef_.collect() VARIABLE_NAME COEFFICIENT 0 __PAL_INTERCEPT__ 17.044785 1 V1__PAL_DELIMIT__A 0.000000 2 V1__PAL_DELIMIT__B -1.464903 3 V2 -4.819740 4 V3__PAL_DELIMIT__0 0.000000 5 V3__PAL_DELIMIT__1 -2.794139 6 V3__PAL_DELIMIT__2 -4.807858 7 V3__PAL_DELIMIT__3 -2.780918 8 {"CONTENT":"{\"impute_model\":{\"column_statis... NaN >>> pred_df.collect() ID V1 V2 V3 0 0 B 2.620 0 1 1 B 2.875 0 2 2 A 2.320 1 3 3 A 3.215 2 4 4 B 3.440 3 5 5 B 3.460 0 6 6 A 3.570 1 7 7 B 3.190 2 8 8 A 3.150 3 9 9 B 3.440 0 10 10 B 3.440 1 11 11 A 4.070 3 12 12 A 3.730 1 13 13 B 3.780 2 14 14 B 5.250 2 15 15 A 5.424 3 16 16 A 5.345 0 17 17 B 2.200 1
Call predict():
>>> result = lgr.predict(data=pred_df, ... key='ID', ... categorical_variable=['V3'], ... thread_ratio=0.1) >>> result.collect() ID CLASS PROBABILITY 0 0 1 9.503618e-01 1 1 1 8.485210e-01 2 2 1 9.555861e-01 3 3 0 3.701858e-02 4 4 0 2.229129e-02 5 5 0 2.503962e-01 6 6 0 4.945832e-02 7 7 0 9.922085e-03 8 8 0 2.852859e-01 9 9 0 2.689207e-01 10 10 0 2.200498e-02 11 11 0 4.713726e-03 12 12 0 2.349803e-02 13 13 0 5.830425e-04 14 14 0 4.886177e-07 15 15 0 6.938072e-06 16 16 0 1.637820e-04 17 17 1 8.986435e-01
Input data for score():
>>> df_score.collect() ID V1 V2 V3 CATEGORY 0 0 B 2.620 0 1 1 1 B 2.875 0 1 2 2 A 2.320 1 1 3 3 A 3.215 2 0 4 4 B 3.440 3 0 5 5 B 3.460 0 0 6 6 A 3.570 1 1 7 7 B 3.190 2 0 8 8 A 3.150 3 0 9 9 B 3.440 0 0 10 10 B 3.440 1 0 11 11 A 4.070 3 0 12 12 A 3.730 1 0 13 13 B 3.780 2 0 14 14 B 5.250 2 0 15 15 A 5.424 3 0 16 16 A 5.345 0 0 17 17 B 2.200 1 1
Call score():
>>> lgr.score(data=df_score, ... key='ID', ... categorical_variable=['V3'], ... thread_ratio=0.1) 0.944444
Attributes
coef_
(DataFrame) Values of the coefficients.
result_
(DataFrame) Model content.
optim_param_
(DataFrame) The optimal parameter set selected via cross-validation. Empty if cross-validation is not activated.
stat_
(DataFrame) Statistics info for the trained model, structured as follows: - 1st column: ‘STAT_NAME’, NVARCHAR(256) - 2nd column: ‘STAT_VALUE’, NVARCHAR(1000)
pmml_
(DataFrame) PMML model. Set to None if no PMML model was requested.
Methods
fit
(data[, key, features, label, …])Fit the LR model when given training dataset.
predict
(data, key[, features, …])Predict with the dataset using the trained model.
score
(data, key[, features, label, …])Return the mean accuracy on the given test data and labels.
-
fit
(data, key=None, features=None, label=None, categorical_variable=None, class_map0=None, class_map1=None)¶ Fit the LR model when given training dataset.
- Parameters
data : DataFrame
DataFrame containing the data.
key : str, optional
Name of the ID column. If
key
is not provided, it is assumed that the input has no ID column.features : list of str, optional
Names of the feature columns.
If
features
is not provided, it defaults to all non-ID, non-label columns.label : str, optional
Name of the label column. If
label
is not provided, it defaults to the last column.categorical_variable : str or list of str, optional
Specifies INTEGER column(s) that shoud be treated as categorical. Otherwise All INTEGER columns are treated as numerical.
class_map0 : str, optional (deprecated)
Categorical label to map to 0.
class_map0
is mandatory whenlabel
column type is VARCHAR or NVARCHAR during binary class fit and score.Only valid when
multi_class
is False.class_map1 : str, optional (deprecated)
Categorical label to map to 1.
class_map1
is mandatory whenlabel
column type is VARCHAR or NVARCHAR during binary class fit and score.Only valid when
multi_class
is False.
-
predict
(data, key, features=None, categorical_variable=None, thread_ratio=None, class_map0=None, class_map1=None, verbose=False)¶ Predict with the dataset using the trained model.
- Parameters
data : DataFrame
DataFrame containing the data.
key : str
Name of the ID column.
features : list of str, optional
Names of the feature columns. If
features
is not provided, it defaults to all non-ID columns.verbose : bool, optional
If true, output scoring probabilities for each class. It is only applicable for multi-class case.
Defaults to False.
categorical_variable : str or list of str, optional (deprecated)
Specifies INTEGER column(s) that shoud be treated as categorical. Otherwise all integer columns are treated as numerical. Mandatory if training data of the prediction model contains such data columns.
thread_ratio : float, optional
Controls the proportion of available threads to use. The value range is from 0 to 1, where 0 means only using 1 thread, and 1 means using at most all the currently available threads. Values outside this range tell pal to heuristically determine the number of threads to use.
Defaults to 0.
class_map0 : str, optional (deprecated)
Categorical label to map to 0.
class_map0
is mandatory whenlabel
column type is varchar or nvarchar during binary class fit and score.only valid whenmulti_class
is false.class_map1 : str, optional (deprecated)
Categorical label to map to 1.
class_map1
is mandatory whenlabel
column type is varchar or nvarchar during binary class fit and score. Only valid whenmulti_class
is false.- Returns
DataFrame
Predicted result, structured as follows:
1: ID column, with edicted class name.
2: PROBABILITY, type DOUBLE
multi-class: probability of being predicted as the predicted class.
binary-class: probability of being predicted as the positive class.
Note
predict() will pass the
pmml_
table to PAL as the model representation if there is apmml_
table, or theresult_
table otherwise.
-
score
(data, key, features=None, label=None, categorical_variable=None, thread_ratio=None, class_map0=None, class_map1=None)¶ Return the mean accuracy on the given test data and labels.
- Parameters
data : DataFrame
DataFrame containing the data.
key : str
Name of the ID column.
features : list of str, optional
Names of the feature columns. If
features
is not provided, it defaults to all non-ID, non-label columns.label : str, optional
Name of the label column. If
label
is not provided, it defaults to the last column.categorical_variable : str or list of str, optional (deprecated)
Specifies INTEGER columns that shoud be treated as categorical, otherwise all integer columns are treated as numerical. Mandatory if training data of the prediction model contains such data columns.
thread_ratio : float, optional
Controls the proportion of available threads to use. The value range is from 0 to 1, where 0 means only using 1 thread, and 1 means using at most all the currently available threads. values outside this range tell pal to heuristically determine the number of threads to use.
Defaults to 0.
class_map0 : str, optional (deprecated)
Categorical label to map to 0.
class_map0
is mandatory whenlabel
column type is varchar or nvarchar during binary class fit and score. Only valid whenmulti_class
is false.class_map1 : str, optional (deprecated)
Categorical label to map to 1.
class_map1
is mandatory whenlabel
column type is varchar or nvarchar during binary class fit and score. Only valid whenmulti_class
is false.- Returns
float
Scalar accuracy value after comparing the predicted label and original label.
hana_ml.algorithms.pal.linkpred¶
This module contains python wrapper for PAL link prediction function.
The following class is available:
-
class
hana_ml.algorithms.pal.linkpred.
LinkPrediction
(conn_context, method, beta=None, min_score=None, thread_ratio=None)¶ Bases:
hana_ml.algorithms.pal.pal_base.PALBase
Link predictor for calculating, in a network, proximity scores between nodes that are not directly linked, which is helpful for predicting missing links(the higher the proximity score is, the more likely the two nodes are to be linked).
- Parameters
conn_context : ConnectionContext
Connection to the SAP HANA system.
method : {‘common_neighbors’, ‘jaccard’, ‘adamic_adar’, ‘katz’}
Method for computing the proximity between 2 nodes that are not directly linked.
beta : float, optional
A parameter included in the calculation of Katz similarity(proximity) score. Valid only when
method
is ‘katz’.Defaults to 0.005.
min_score : float, optional
The links whose scores are lower than
min_score
will be filtered out from the result table.Defaults to 0.
Examples
Input dataframe df for training:
>>> df.collect() NODE1 NODE2 0 1 2 1 1 4 2 2 3 3 3 4 4 5 1 5 6 2 6 7 4 7 7 5 8 6 7 9 5 4
Create linkpred instance:
>>> lp = LinkPrediction(conn_context=conn, ... method='common_neighbors', ... beta=0.005, ... min_score=0, ... thread_ratio=0.2)
Calculate the proximity score of all nodes in the network with missing links, and check the result:
>>> res = lp.proximity_score(data=df, node1='NODE1', node2='NODE2') >>> res.collect() NODE1 NODE2 SCORE 0 1 3 0.285714 1 1 6 0.142857 2 1 7 0.285714 3 2 4 0.285714 4 2 5 0.142857 5 2 7 0.142857 6 4 6 0.142857 7 3 5 0.142857 8 3 6 0.142857 9 3 7 0.142857 10 5 6 0.142857
Methods
proximity_score
(data[, node1, node2])For predicting proximity scores between nodes under current choice of method.
-
proximity_score
(data, node1=None, node2=None)¶ For predicting proximity scores between nodes under current choice of method.
- Parameters
data : DataFrame
Network data with nodes and links. Nodes are in columns while links in rows, where each link is represented by a pair of adjacent nodes as follows (node1, node2).
node1 : str, optional
Column name of
data
that givesnode1
of all available links (seedata
).Defaults to the name of the first column of
data
if not provided.node2 : str, optional
Column name of
data
that givesnode2
of all available links (seedata
).Defaults to the name of the last column of
data
if not provided.- Returns
DataFrame:
The proximity scores of pairs of nodes with missing links between them that are above ‘min_score’, structured as follows:
1st column:
node1
of a link2nd column:
node2
of a link3rd column: proximity score of the two nodes
hana_ml.algorithms.pal.metrics¶
This module contains Python wrappers for PAL metrics to assess the quality of model outputs.
The following functions are available:
-
hana_ml.algorithms.pal.metrics.
confusion_matrix
(conn_context, data, key, label_true=None, label_pred=None, beta=None, native=True)¶ Computes confusion matrix to evaluate the accuracy of a classification.
- Parameters
conn_context : ConnectionContext
The connection to SAP HANA system.
data : DataFrame
DataFrame containing the data.
key : str
Name of the ID column.
label_true : str, optional
Name of the original label column.
If not given, defaults to the second columm.
label_pred : str, optional
Name of the the predicted label column. If not given, defaults to the third columm.
beta : float, optional
Parameter used to compute the F-Beta score.
Defaults to 1.
native : bool, optional
Indicates whether to use native sql statements for confusion matrix calculation.
Defaults to True.
- Returns
DataFrame
- Confusion matrix, structured as follows:
Original label, with same name and data type as it is in data.
Predicted label, with same name and data type as it is in data.
Count, type INTEGER, the number of data points with the corresponding combination of predicted and original label.
The DataFrame is sorted by (original label, predicted label) in descending order.
- Classification report table, structured as follows:
Class, type NVARCHAR(100), class name
Recall, type DOUBLE, the recall of each class
Precision, type DOUBLE, the precision of each class
F_MEASURE, type DOUBLE, the F_measure of each class
SUPPORT, type INTEGER, the support - sample number in each class
Examples
Data contains the original label and predict label df:
>>> df.collect() ID ORIGINAL PREDICT 0 1 1 1 1 2 1 1 2 3 1 1 3 4 1 2 4 5 1 1 5 6 2 2 6 7 2 1 7 8 2 2 8 9 2 2 9 10 2 2
Calculate the confusion matrix:
>>> cm, cr = confusion_matrix(connection_context=conn, data=df, key='ID', label_true='ORIGINAL', label_pred='PREDICT')
Output:
>>> cm.collect() ORIGINAL PREDICT COUNT 0 1 1 4 1 1 2 1 2 2 1 1 3 2 2 4 >>> cr.collect() CLASS RECALL PRECISION F_MEASURE SUPPORT 0 1 0.8 0.8 0.8 5 1 2 0.8 0.8 0.8 5
-
hana_ml.algorithms.pal.metrics.
auc
(conn_context, data, positive_label=None)¶ Computes area under curve (AUC) to evaluate the performance of binary-class classification algorithms.
- Parameters
conn_context : ConnectionContext
The connection to SAP HANA system.
data : DataFrame
Input data, structured as follows:
ID column.
True class of the data point.
Classifier-computed probability that the data point belongs to the positive class.
positive_label : str, optional
If original label is not 0 or 1, specifies the label value which will be mapped to 1.
- Returns
float
The area under the receiver operating characteristic curve.
DataFrame
False positive rate and true positive rate (ROC), structured as follows:
ID column, type INTEGER.
FPR, type DOUBLE, representing false positive rate.
TPR, type DOUBLE, representing true positive rate.
Examples
Input DataFrame df:
>>> df.collect() ID ORIGINAL PREDICT 0 1 0 0.07 1 2 0 0.01 2 3 0 0.85 3 4 0 0.30 4 5 0 0.50 5 6 1 0.50 6 7 1 0.20 7 8 1 0.80 8 9 1 0.20 9 10 1 0.95
Compute Area Under Curve:
>>> auc, roc = auc(conn_context=conn, data=df)
Output:
>>> print(auc) 0.66
>>> roc.collect() ID FPR TPR 0 0 1.0 1.0 1 1 0.8 1.0 2 2 0.6 1.0 3 3 0.6 0.6 4 4 0.4 0.6 5 5 0.2 0.4 6 6 0.2 0.2 7 7 0.0 0.2 8 8 0.0 0.0
-
hana_ml.algorithms.pal.metrics.
multiclass_auc
(conn_context, data_original, data_predict)¶ Computes area under curve (AUC) to evaluate the performance of multi-class classification algorithms.
- Parameters
conn_context : ConnectionContext
The connection to SAP HANA system.
data_original : DataFrame
True class data, structured as follows:
Data point ID column.
True class of the data point.
data_predict : DataFrame
Predicted class data, structured as follows:
Data point ID column.
Possible class.
Classifier-computed probability that the data point belongs to that particular class.
For each data point ID, there should be one row for each possible class.
- Returns
float
The area under the receiver operating characteristic curve.
DataFrame
False positive rate and true positive rate (ROC), structured as follows:
ID column, type INTEGER.
FPR, type DOUBLE, representing false positive rate.
TPR, type DOUBLE, representing true positive rate.
Examples
Input DataFrame df:
>>> df_original.collect() ID ORIGINAL 0 1 1 1 2 1 2 3 1 3 4 2 4 5 2 5 6 2 6 7 3 7 8 3 8 9 3 9 10 3
>>> df_predict.collect() ID PREDICT PROB 0 1 1 0.90 1 1 2 0.05 2 1 3 0.05 3 2 1 0.80 4 2 2 0.05 5 2 3 0.15 6 3 1 0.80 7 3 2 0.10 8 3 3 0.10 9 4 1 0.10 10 4 2 0.80 11 4 3 0.10 12 5 1 0.20 13 5 2 0.70 14 5 3 0.10 15 6 1 0.05 16 6 2 0.90 17 6 3 0.05 18 7 1 0.10 19 7 2 0.10 20 7 3 0.80 21 8 1 0.00 22 8 2 0.00 23 8 3 1.00 24 9 1 0.20 25 9 2 0.10 26 9 3 0.70 27 10 1 0.20 28 10 2 0.20 29 10 3 0.60
Compute Area Under Curve:
>>> auc, roc = multiclass_auc(conn_context=conn, data_original=df_original, data_predict=df_predict)
Output:
>>> print(auc) 1.0
>>> roc.collect() ID FPR TPR 0 0 1.00 1.0 1 1 0.90 1.0 2 2 0.65 1.0 3 3 0.25 1.0 4 4 0.20 1.0 5 5 0.00 1.0 6 6 0.00 0.9 7 7 0.00 0.7 8 8 0.00 0.3 9 9 0.00 0.1 10 10 0.00 0.0
-
hana_ml.algorithms.pal.metrics.
accuracy_score
(conn_context, data, label_true, label_pred)¶ Compute mean accuracy score for classification results. That is, the proportion of the correctly predicted results among the total number of cases examined.
- Parameters
conn_context : ConnectionContext
The connection to SAP HANA system.
data : DataFrame
DataFrame of true and predicted labels.
label_true : str
Name of the column containing ground truth labels.
label_pred : str
Name of the column containing predicted labels, as returned by a classifier.
- Returns
float
Accuracy classification score. A lower accuracy indicates that the classifier was able to predict less of the labels in the input correctly.
Examples
Actual and predicted labels df for a hypothetical classification:
>>> df.collect() ACTUAL PREDICTED 0 1 0 1 0 0 2 0 0 3 1 1 4 1 1
Accuracy score for these predictions:
>>> accuracy_score(conn_context=conn, data=df, label_true='ACTUAL', label_pred='PREDICTED') 0.8
Compare that to null accuracy df_dummy (accuracy that could be achieved by always predicting the most frequent class):
>>> df_dummy.collect() ACTUAL PREDICTED 0 1 1 1 0 1 2 0 1 3 1 1 4 1 1 >>> accuracy_score(conn_context=conn, data=df_dummy, label_true='ACTUAL', label_pred='PREDICTED') 0.6
A perfect predictor df_perfect:
>>> df_perfect.collect() ACTUAL PREDICTED 0 1 1 1 0 0 2 0 0 3 1 1 4 1 1 >>> accuracy_score(conn_context=conn, data=df_perfect, label_true='ACTUAL', label_pred='PREDICTED') 1.0
-
hana_ml.algorithms.pal.metrics.
r2_score
(conn_context, data, label_true, label_pred)¶ Computes coefficient of determination for regression results.
- Parameters
conn_context : ConnectionContext
The connection to SAP HANA system.
data : DataFrame
DataFrame of true and predicted values.
label_true : str
Name of the column containing true values.
label_pred : str
Name of the column containing values predicted by regression.
- Returns
float
Coefficient of determination. 1.0 indicates an exact match between true and predicted values. A lower coefficient of determination indicates that the regression was able to predict less of the variance in the input. A negative value indicates that the regression performed worse than just taking the mean of the true values and using that for every prediction.
Examples
Actual and predicted values df for a hypothetical regression:
>>> df.collect() ACTUAL PREDICTED 0 0.10 0.2 1 0.90 1.0 2 2.10 1.9 3 3.05 3.0 4 4.00 3.5
R2 score for these predictions:
>>> r2_score(conn_context=conn, data=df, label_true='ACTUAL', label_pred='PREDICTED') 0.9685233682514102
Compare that to the score for a perfect predictor:
>>> df_perfect.collect() ACTUAL PREDICTED 0 0.10 0.10 1 0.90 0.90 2 2.10 2.10 3 3.05 3.05 4 4.00 4.00 >>> r2_score(conn_context=conn, data=df_perfect, label_true='ACTUAL', label_pred='PREDICTED') 1.0
A naive mean predictor:
>>> df_mean.collect() ACTUAL PREDICTED 0 0.10 2.03 1 0.90 2.03 2 2.10 2.03 3 3.05 2.03 4 4.00 2.03 >>> r2_score(conn_context=conn,, data=df_mean, label_true='ACTUAL', label_pred='PREDICTED') 0.0
And a really awful predictor df_awful:
>>> df_awful.collect() ACTUAL PREDICTED 0 0.10 12345.0 1 0.90 91923.0 2 2.10 -4444.0 3 3.05 -8888.0 4 4.00 -9999.0 >>> r2_score(conn_context=conn, data=df_awful, label_true='ACTUAL', label_pred='PREDICTED') -886477397.139857
hana_ml.algorithms.pal.mixture¶
This module contains Python wrappers for Gaussian mixture model algorithm.
The following class is available:
-
class
hana_ml.algorithms.pal.mixture.
GaussianMixture
(conn_context, init_param, n_components=None, init_centers=None, covariance_type=None, shared_covariance=False, thread_ratio=None, max_iter=None, categorical_variable=None, category_weight=None, error_tol=None, regularization=None, random_seed=None)¶ Bases:
hana_ml.algorithms.pal.pal_base.PALBase
Representation of a Gaussian mixture model probability distribution.
- Parameters
conn_context : ConnectionContext
Connection to the SAP HANA system.
init_param : {‘farthest_first_traversal’,’manual’,’random_means’,’kmeans++’}
Specifies the initialization mode.
farthest_first_traversal: The initial centers are given by the farthest-first traversal algorithm.
manual: The initial centers are the init_centers given by user.
random_means: The initial centers are the means of all the data that are randomly weighted.
kmeans++: The initial centers are given using the k-means++ approach.
n_components : int
Specifies the number of Gaussian distributions. Mandatory when
init_param
is not ‘manual’.init_centers : list of int
Specifies the data (by using sequence number of the data in the data table (starting from 0)) to be used as init_centers. Mandatory when
init_param
is ‘manual’.covariance_type : {‘full’, ‘diag’, ‘tied_diag’}, optional
Specifies the type of covariance matrices in the model.
full: use full covariance matrices.
diag: use diagonal covariance matrices.
tied_diag: use diagonal covariance matrices with all equal diagonal entries.
Defaults to ‘full’.
shared_covariance : bool, optional
All clusters share the same covariance matrix if True.
Defaults to False.
thread_ratio : float, optional
Controls the proportion of available threads that can be used. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use.
Defaults to 0.
max_iter : int, optional
Specifies the maximum number of iterations for the EM algorithm.
Defaults value: 100.
categorical_variable : str or list of str, optional
Specifies INTEGER column(s) that should be be treated as categorical. Other INTEGER columns will be treated as continuous.
category_weight : float, optional
Represents the weight of category attributes.
Defaults to 0.707.
error_tol : float, optional
Specifies the error tolerance, which is the stop condition.
Defaults to 1e-5.
regularization : float, optional
Regularization to be added to the diagonal of covariance matrices to ensure positive-definite.
Defaults to 1e-6.
random_seed : int, optional
Indicates the seed used to initialize the random number generator:
0: Uses the system time.
Not 0: Uses the provided value.
Defaults to 0.
Examples
Input dataframe df1 for training:
>>> df1.collect() ID X1 X2 X3 0 0 0.10 0.10 1 1 1 0.11 0.10 1 2 2 0.10 0.11 1 3 3 0.11 0.11 1 4 4 0.12 0.11 1 5 5 0.11 0.12 1 6 6 0.12 0.12 1 7 7 0.12 0.13 1 8 8 0.13 0.12 2 9 9 0.13 0.13 2 10 10 0.13 0.14 2 11 11 0.14 0.13 2 12 12 10.10 10.10 1 13 13 10.11 10.10 1 14 14 10.10 10.11 1 15 15 10.11 10.11 1 16 16 10.11 10.12 2 17 17 10.12 10.11 2 18 18 10.12 10.12 2 19 19 10.12 10.13 2 20 20 10.13 10.12 2 21 21 10.13 10.13 2 22 22 10.13 10.14 2 23 23 10.14 10.13 2
Creating the GMM instance:
>>> gmm = GaussianMixture(conn_context=conn, ... init_param='farthest_first_traversal', ... n_components=2, covariance_type='full', ... shared_covariance=False, max_iter=500, ... error_tol=0.001, thread_ratio=0.5, ... categorical_variable=['X3'], random_seed=1)
Performing fit() on the given dataframe:
>>> gmm.fit(data=df1, key='ID')
Expected output:
>>> gmm.labels_.head(14).collect() ID CLUSTER_ID PROBABILITY 0 0 0 0.0 1 1 0 0.0 2 2 0 0.0 3 4 0 0.0 4 5 0 0.0 5 6 0 0.0 6 7 0 0.0 7 8 0 0.0 8 9 0 0.0 9 10 0 1.0 10 11 0 1.0 11 12 0 1.0 12 13 0 1.0 13 14 0 0.0
>>> gmm.stats_.collect() STAT_NAME STAT_VALUE 1 log-likelihood 11.7199 2 aic -504.5536 3 bic -480.3900
>>> gmm.model_collect() ROW_INDEX CLUSTER_ID MODEL_CONTENT 1 0 -1 {"Algorithm":"GMM","Metadata":{"DataP... 2 1 0 {"GuassModel":{"covariance":[22.18895... 3 2 1 {"GuassModel":{"covariance":[22.19450...
Attributes
model_
(DataFrame) Trained model content.
labels_
(DataFrame) Cluster membership probabilties for each data point.
stats_
(DataFrame) Statistics.
Methods
fit
(data, key[, features, categorical_variable])Perform GMM clustering on input dataset.
fit_predict
(data, key[, features, …])Perform GMM clustering on input dataset and return cluster membership probabilties for each data point.
-
fit
(data, key, features=None, categorical_variable=None)¶ Perform GMM clustering on input dataset.
- Parameters
data : DataFrame
Data to be clustered.
key : str
Name of the ID column.
features : list of str, optional
List of strings specifying feature columns. If a list of features is not given, all the columns except the ID column are taken as features.
categorical_variable : str or list of str, optional
Specifies INTEGER column(s) that should be treated as categorical. Other INTEGER columns will be treated as continuous.
-
fit_predict
(data, key, features=None, categorical_variable=None)¶ Perform GMM clustering on input dataset and return cluster membership probabilties for each data point.
- Parameters
data : DataFrame
Data to be clustered.
key : str
Name of the ID column.
features : list of str, optional
List of strings specifying feature columns. If a list of features is not given, all the columns except the ID column are taken as features.
categorical_variable : str or list of str, optional
Specifies INTEGER column(s) specified that should be treated as categorical. Other INTEGER columns will be treated as continuous.
- Returns
DataFrame
Cluster membership probabilities.
hana_ml.algorithms.pal.naive_bayes¶
This module contains wrappers for PAL naive bayes aglorithm.
The following class is available:
-
class
hana_ml.algorithms.pal.naive_bayes.
NaiveBayes
(conn_context, alpha=None, discretization=None, model_format=None, categorical_variable=None, thread_ratio=None)¶ Bases:
hana_ml.algorithms.pal.pal_base.PALBase
A classification model based on Bayes’ theorem.
- Parameters
conn_context : ConnectionContext
Connection to the HANA system.
alpha : float, optional
Laplace smoothing value. Set a positive value to enable Laplace smoothing for categorical variables and use that value as the smoothing parameter. Set value 0 to disable Laplace smoothing.
Defaults to 0.
discretization : {‘no’, ‘supervised’}, optional
Discretize continuous attributes. Case-insensitive.
‘no’ or not provided: disable discretization.
‘supervised’: use supervised discretization on all the continuous attributes.
Defaults to ‘no’.
model_format : {‘json’, ‘pmml’}, optional
Controls whether to output the model in JSON format or PMML format. Case-insensitive.
‘json’ or not provided: JSON format.
‘pmml’: PMML format.
Defaults to ‘json’.
categorical_variable : str or list of str, optional
Specifies INTEGER columns that should be treated as categorical. Other INTEGER columns will be treated as continuous.
thread_ratio : float, optional
Controls the proportion of available threads to use. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use that percentage of available threads.Values outside this range tell PAL to heuristically determine the number of threads to use.
Defaults to 0.
Examples
Training data:
>>> df1.collect() HomeOwner MaritalStatus AnnualIncome DefaultedBorrower 0 YES Single 125.0 NO 1 NO Married 100.0 NO 2 NO Single 70.0 NO 3 YES Married 120.0 NO 4 NO Divorced 95.0 YES 5 NO Married 60.0 NO 6 YES Divorced 220.0 NO 7 NO Single 85.0 YES 8 NO Married 75.0 NO 9 NO Single 90.0 YES
Training the model:
>>> nb = NaiveBayes(conn_context=cc, alpha=1.0, model_format='pmml') >>> nb.fit(df1)
Prediction:
>>> df2.collect() ID HomeOwner MaritalStatus AnnualIncome 0 0 NO Married 120.0 1 1 YES Married 180.0 2 2 NO Single 90.0
>>> nb.predict(data=df2, 'ID', alpha=1.0, verbose=True) ID CLASS CONFIDENCE 0 0 NO -6.572353 1 0 YES -23.747252 2 1 NO -7.602221 3 1 YES -169.133547 4 2 NO -7.133599 5 2 YES -4.648640
Attributes
model_
(DataFrame) Trained model content. .. note:: The Laplace value (alpha) is only stored by JSON format models. If the PMML format is chosen, you may need to set the Laplace value (alpha) again in predict() and score().
Methods
fit
(data[, key, features, label, …])Fit classification model based on training data.
predict
(data, key[, features, alpha, verbose])Predict based on fitted model.
score
(data, key[, features, label, alpha])Returns the mean accuracy on the given test data and labels.
-
fit
(data, key=None, features=None, label=None, categorical_variable=None)¶ Fit classification model based on training data.
- Parameters
data : DataFrame
Training data.
key : str, optional
Name of the ID column. If
key
is not provided, it is assumed that the input has no ID column.features : list of str, optional
Names of the feature columns. If
features
is not provided, it defaults to all non-ID, non-label columns.label : str, optional
Name of the dependent variable.
Defaults to the last column.
-
predict
(data, key, features=None, alpha=None, verbose=None)¶ Predict based on fitted model.
- Parameters
data : DataFrame
Independent variable values to predict for.
key : str
Name of the ID column.
features : list of str, optional
Names of the feature columns.
If
features
is not provided, it defaults to all non-ID columns.alpha : float, optional
Laplace smoothing value. Set a positive value to enable Laplace smoothing for categorical variables and use that value as the smoothing parameter. Set value 0 to disable Laplace smoothing.
Defaults to the alpha value in the JSON model, if there is one, or 0 otherwise.
verbose : bool, optional
If true, output all classes and the corresponding confidences for each data point.
Defaults to False.
- Returns
DataFrame
Predicted result, structured as follows:
ID column, with the same name and type as
data
‘s ID column.CLASS, type NVARCHAR, predicted class name.
CONFIDENCE, type DOUBLE, confidence for the prediction of the sample, which is a logarithmic value of the posterior probabilities.
Note
A non-zero Laplace value (alpha) is required if there exist discrete category values that only occur in the test set. It can be read from JSON models or from the parameter
alpha
in predict(). The Laplace value you set here takes precedence over the values read from JSON models.
-
score
(data, key, features=None, label=None, alpha=None)¶ Returns the mean accuracy on the given test data and labels.
- Parameters
data : DataFrame
Data on which to assess model performance.
key : str
Name of the ID column.
features : list of str, optional
Names of the feature columns. If
features
is not provided, it defaults to all non-ID, non-label columns.label : str, optional
Name of the dependent variable.
Defaults to the last column.
alpha : float, optional
Laplace smoothing value. Set a positive value to enable Laplace smoothing for categorical variables and use that value as the smoothing parameter. Set value 0 to disable Laplace smoothing.
Defaults to the alpha value in the JSON model, if there is one, or 0 otherwise.
- Returns
float
Mean accuracy on the given test data and labels.
hana_ml.algorithms.pal.neighbors¶
This module contains Python wrapper for PAL k-nearest neighbors algorithm.
The following class is available:
-
class
hana_ml.algorithms.pal.neighbors.
KNN
(conn_context, n_neighbors=None, thread_ratio=None, voting_type=None, stat_info=True, metric=None, minkowski_power=None, algorithm=None)¶ Bases:
hana_ml.algorithms.pal.pal_base.PALBase
K-Nearest Neighbor(KNN) model that handles classification problems.
- Parameters
conn_context : ConnectionContext
Connection to the SAP HANA sytem.
n_neighbors : int, optional
Number of nearest neighbors.
Defaults to 1.
thread_ratio : float, optional
Controls the proportion of available threads to use. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use up to that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use.
Defaults to 0.
voting_type : {‘majority’, ‘distance-weighted’}, optional
Method used to vote for the most frequent label of the K nearest neighbors.
Defaults to ‘distance-weighted’.
stat_info : bool, optional
Controls whether to return a statistic information table containing the distance between each point in the prediction set and its k nearest neighbors in the training set. If true, the table will be returned.
Defaults to True.
metric : {‘manhattan’, ‘euclidean’, ‘minkowski’, ‘chebyshev’}, optional
Ways to compute the distance between data points.
Defaults to ‘euclidean’.
minkowski_power : float, optional
When Minkowski is used for
metric
, this parameter controls the value of power. Only valid whenmetric
is Minkowski.Defaults to 3.0.
algorithm : {‘brute-force’, ‘kd-tree’}, optional
Algorithm used to compute the nearest neighbors.
Defaults to ‘brute-force’.
Examples
Training data:
>>> df.collect() ID X1 X2 TYPE 0 0 1.0 1.0 2 1 1 10.0 10.0 3 2 2 10.0 11.0 3 3 3 10.0 10.0 3 4 4 1000.0 1000.0 1 5 5 1000.0 1001.0 1 6 6 1000.0 999.0 1 7 7 999.0 999.0 1 8 8 999.0 1000.0 1 9 9 1000.0 1000.0 1
Create KNN instance and call fit:
>>> knn = KNN(conn_context=conn, n_neighbors=3, voting_type='majority', ... thread_ratio=0.1, stat_info=False) >>> knn.fit(data=df, key='ID', features=['X1', 'X2'], label='TYPE') >>> pred_df = connection_context.table("PAL_KNN_CLASSDATA_TBL")
Call predict:
>>> res, stat = knn.predict(data=pred_df, key="ID") >>> res.collect() ID TYPE 0 0 3 1 1 3 2 2 3 3 3 1 4 4 1 5 5 1 6 6 1 7 7 1
Methods
fit
(data, key[, features, label])Fit the model when given training set.
predict
(data, key[, features])Predict the class labels for the provided data
score
(data, key[, features, label])Return a scalar accuracy value after comparing the predicted and original label.
-
fit
(data, key, features=None, label=None)¶ Fit the model when given training set.
- Parameters
data : DataFrame
DataFrame containing the data.
key : str
Name of the ID column.
features : list of str, optional
Names of the feature columns. If
features
is not provided, it defaults to all the non-ID and non-label columns.label : str, optional
Name of the label column. If
label
is not provided, it defaults to the last column.
-
predict
(data, key, features=None)¶ Predict the class labels for the provided data
- Parameters
data : DataFrame
DataFrame containing the data.
key : str
Name of the ID column.
features : list of str, optional
Names of the feature columns. If
features
is not provided, it defaults to all the non-ID columns.- Returns
DataFrame
Predicted result, structured as follows:
ID column, with same name and type as
data
‘s ID column.Label column, with same name and type as training data’s label column.
The distance between each point in
data
and its k nearest neighbors in the training set. Only returned ifstat_info
is True.Structured as follows:
TEST_ +
data
‘s ID name, with same type asdata
‘s ID column, query data ID.K, type INTEGER, K number.
TRAIN_ + training data’s ID name, with same type as training data’s ID column, neighbor point’s ID.
DISTANCE, type DOUBLE, distance.
-
score
(data, key, features=None, label=None)¶ Return a scalar accuracy value after comparing the predicted and original label.
- Parameters
data : DataFrame
DataFrame containing the data.
key : str
Name of the ID column.
features : list of str, optional
Names of the feature columns. If
features
is not provided, it defaults to all the non-ID and non-label columns.label : str, optional
Name of the label column. If
label
is not provided, it defaults to the last column.- Returns
float
Scalar accuracy value after comparing the predicted label and original label.
hana_ml.algorithms.pal.neural_network¶
This module contains Python wrappers for PAL Multi-layer Perceptron algorithm.
The following classes are available:
-
class
hana_ml.algorithms.pal.neural_network.
MLPClassifier
(conn_context, activation=None, activation_options=None, output_activation=None, output_activation_options=None, hidden_layer_size=None, hidden_layer_size_options=None, max_iter=None, training_style='stochastic', learning_rate=None, momentum=None, batch_size=None, normalization=None, weight_init=None, categorical_variable=None, resampling_method=None, evaluation_metric=None, fold_num=None, repeat_times=None, search_strategy=None, random_search_times=None, random_state=None, timeout=None, progress_indicator_id=None, param_values=None, param_range=None, thread_ratio=None)¶ Bases:
hana_ml.algorithms.pal.neural_network._MLPBase
Multi-layer perceptron (MLP) Classifier.
- Parameters
conn_context : ConnectionContext
Connection to the HANA system.
activation : {‘tanh’, ‘linear’, ‘sigmoid_asymmetric’, ‘sigmoid_symmetric’, ‘gaussian_asymmetric’, ‘gaussian_symmetric’, ‘elliot_asymmetric’, ‘elliot_symmetric’, ‘sin_asymmetric’, ‘sin_symmetric’, ‘cos_asymmetric’, ‘cos_symmetric’, ‘relu’}, conditionally mandatory
Activation function for the hidden layer. Mandatory if
activation_options
is not provided.activation_options : list of str, conditionally mandatory
A list of activation functions for parameter selection.
See
activation
for the full set of valid activation functions.output_activation : {‘tanh’, ‘linear’, ‘sigmoid_asymmetric’, ‘sigmoid_symmetric’, ‘gaussian_asymmetric’, ‘gaussian_symmetric’, ‘elliot_asymmetric’, ‘elliot_symmetric’, ‘sin_asymmetric’, ‘sin_symmetric’, ‘cos_asymmetric’, ‘cos_symmetric’, ‘relu’},
Activation function for the output layer.
output_activation_options : list of str, conditionally mandatory
A list of activation functions for the output layer for parameter selection.
See
output_activation
for the full set of activation functions for output layer.hidden_layer_size : list of int or tuple of int
Sizes of all hidden layers.
hidden_layer_size_options : list of tuples, conditionally mandatory
A list of optional sizes of all hidden layers for parameter selection.
max_iter : int, optional
Maximum number of iterations.
Defaults to 100.
training_style : {‘batch’, ‘stochastic’}, optional
Specifies the training style.
Defaults to ‘stochastic’.
learning_rate : float, optional
Specifies the learning rate. Mandatory and valid only when
training_style
is ‘stochastic’.momentum : float, optional
Specifies the momentum for gradient descent update. Mandatory and valid only when
training_style
is ‘stochastic’.batch_size : int, optional
Specifies the size of mini batch. Valid only when
training_style
is ‘stochastic’.Defaults to 1.
normalization : {‘no’, ‘z-transform’, ‘scalar’}, optional
Defaults to ‘no’.
weight_init : {‘all-zeros’, ‘normal’, ‘uniform’, ‘variance-scale-normal’, ‘variance-scale-uniform’}, optional
Specifies the weight initial value.
Defaults to ‘all-zeros’.
categorical_variable : str or list of str, optional
Specifies column name(s) in the data table used as category variable.
Valid only when column is of INTEGER type.
thread_ratio : float, optional
Controls the proportion of available threads to use for training. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use.
Defaults to 0.
resampling_method : {‘cv’,’stratified_cv’, ‘bootstrap’, ‘stratified_bootstrap’}, optional
Specifies the resampling method for model evaluation or parameter selection. If not specified, neither model evaluation or parameter selection shall be triggered.
evaluation_metric : {‘accuracy’,’f1_score’, ‘auc_onevsrest’, ‘auc_pairwise’}, optional
Specifies the evaluation metric for model evaluation or parameter selection.
fold_num : int, optional
Specifies the fold number for the cross-validation. Mandatory and valid only when
resampling_method
is set ‘cv’ or ‘stratified_cv’.repeat_times : int, optional
Specifies the number of repeat times for resampling.
Defaults to 1.
search_strategy : {‘grid’, ‘random’}, optional
Specifies the method for parameter selection. If not provided, parameter selection will not be activated.
random_search_times : int, optional
Specifies the number of times to randomly select candidate parameters. Mandatory and valid only when
search_strategy
is set to ‘random’.random_state : int, optional
Specifies the seed for random generation. When 0 is specified, system time is used.
Defaults to 0.
timeout : int, optional
Specifies maximum running time for model evalation/parameter selection, in seconds. No timeout when 0 is specified.
Defaults to 0.
progress_id : str, optional
Sets an ID of progress indicator for model evaluation/parameter selection. If not provided, no progress indicator is activated.
param_values : list of tuple, optional
Sets the values of following parameters for model parameter selection:
learning_rate
,momentum
,batch_size
.Each tuple contains two elements
1st element is the parameter name(str type),
2nd element is a list of valid values for that parameter.
A simple example for illustration:
[(‘learning_rate’, [0.1, 0.2, 0.5]),
(‘momentum’, [0.2, 0.6])]
Valid only when
search_strategy
is specified andtraining_style
is ‘stochastic’.param_range : list of tuple, optional
Sets the range of the following parameters for model parameter selection:
learning_rate
,momentum
,batch_size
.Each tuple should contain two elements:
1st element is the parameter name(str type),
2nd element is a list that specifies the range of that parameter as follows:
first value is the start value, seond value is the step, and third value is the end value. The step value can be omitted, and will be ignored, if
search_strategy
is set to ‘random’.Valid only when
search_strategy
is specified andtraininig_style
is ‘stochastic’.
Examples
Training data:
>>> df.collect() V000 V001 V002 V003 LABEL 0 1 1.71 AC 0 AA 1 10 1.78 CA 5 AB 2 17 2.36 AA 6 AA 3 12 3.15 AA 2 C 4 7 1.05 CA 3 AB 5 6 1.50 CA 2 AB 6 9 1.97 CA 6 C 7 5 1.26 AA 1 AA 8 12 2.13 AC 4 C 9 18 1.87 AC 6 AA
Training the model:
>>> mlpc = MLPClassifier(conn_context=conn, hidden_layer_size=(10,10), ... activation='tanh', output_activation='tanh', ... learning_rate=0.001, momentum=0.0001, ... training_style='stochastic',max_iter=100, ... normalization='z-transform', weight_init='normal', ... thread_ratio=0.3, categorical_variable='V003') >>> mlpc.fit(data=df)
Training result may look different from the following results due to model randomness.
>>> mlpc.model_.collect() ROW_INDEX MODEL_CONTENT 0 1 {"CurrentVersion":"1.0","DataDictionary":[{"da... 1 2 t":0.2700182926188939},{"from":13,"weight":0.0... 2 3 ht":0.2414416413305134},{"from":21,"weight":0.... >>> mlpc.train_log_.collect() ITERATION ERROR 0 1 1.080261 1 2 1.008358 2 3 0.947069 3 4 0.894585 4 5 0.849411 5 6 0.810309 6 7 0.776256 7 8 0.746413 8 9 0.720093 9 10 0.696737 10 11 0.675886 11 12 0.657166 12 13 0.640270 13 14 0.624943 14 15 0.609432 15 16 0.595204 16 17 0.582101 17 18 0.569990 18 19 0.558757 19 20 0.548305 20 21 0.538553 21 22 0.529429 22 23 0.521457 23 24 0.513893 24 25 0.506704 25 26 0.499861 26 27 0.493338 27 28 0.487111 28 29 0.481159 29 30 0.475462 .. ... ... 70 71 0.349684 71 72 0.347798 72 73 0.345954 73 74 0.344071 74 75 0.342232 75 76 0.340597 76 77 0.338837 77 78 0.337236 78 79 0.335749 79 80 0.334296 80 81 0.332759 81 82 0.331255 82 83 0.329810 83 84 0.328367 84 85 0.326952 85 86 0.325566 86 87 0.324232 87 88 0.322899 88 89 0.321593 89 90 0.320242 90 91 0.318985 91 92 0.317840 92 93 0.316630 93 94 0.315376 94 95 0.314210 95 96 0.313066 96 97 0.312021 97 98 0.310916 98 99 0.309770 99 100 0.308704
Prediction:
>>> pred_df.collect() >>> res, stat = mlpc.predict(data=pred_df, key='ID')
Prediction result may look different from the following results due to model randomness.
>>> res.collect() ID TARGET VALUE 0 1 C 0.472751 1 2 C 0.417681 2 3 C 0.543967 >>> stat.collect() ID CLASS SOFT_MAX 0 1 AA 0.371996 1 1 AB 0.155253 2 1 C 0.472751 3 2 AA 0.357822 4 2 AB 0.224496 5 2 C 0.417681 6 3 AA 0.349813 7 3 AB 0.106220 8 3 C 0.543967
Model Evaluation:
>>> mlpc = MLPClassifier(conn_context=conn, ... activation='tanh', ... output_activation='tanh', ... hidden_layer_size=(10,10), ... learning_rate=0.001, ... momentum=0.0001, ... training_style='stochastic', ... max_iter=100, ... normalization='z-transform', ... weight_init='normal', ... resampling_method='cv', ... evaluation_metric='f1_score', ... fold_num=10, ... repeat_times=2, ... random_state=1, ... progress_indicator_id='TEST', ... thread_ratio=0.3) >>> mlpc.fit(data=df, label='LABEL', categorical_variable='V003')
Model evaluation result may look different from the following result due to randomness.
>>> mlpc.stats_.collect() STAT_NAME STAT_VALUE 0 timeout FALSE 1 TEST_1_F1_SCORE 1, 0, 1, 1, 0, 1, 0, 1, 1, 0 2 TEST_2_F1_SCORE 0, 0, 1, 1, 0, 1, 0, 1, 1, 1 3 TEST_F1_SCORE.MEAN 0.6 4 TEST_F1_SCORE.VAR 0.252631 5 EVAL_RESULTS_1 {"candidates":[{"TEST_F1_SCORE":[[1.0,0.0,1.0,... 6 solution status Convergence not reached after maximum number o... 7 ERROR 0.2951168443145714
Parameter selection:
>>> act_opts=['tanh', 'linear', 'sigmoid_asymmetric'] >>> out_act_opts = ['sigmoid_symmetric', 'gaussian_asymmetric', 'gaussian_symmetric'] >>> layer_size_opts = [(10, 10), (5, 5, 5)] >>> mlpc = MLPClassifier(conn_context=conn, ... activation_options=act_opts, ... output_activation_options=out_act_opts, ... hidden_layer_size_options=layer_size_opts, ... learning_rate=0.001, ... batch_size=2, ... momentum=0.0001, ... training_style='stochastic', ... max_iter=100, ... normalization='z-transform', ... weight_init='normal', ... resampling_method='stratified_bootstrap', ... evaluation_metric='accuracy', ... search_strategy='grid', ... fold_num=10, ... repeat_times=2, ... random_state=1, ... progress_indicator_id='TEST', ... thread_ratio=0.3) >>> mlpc.fit(data=df, label='LABEL', categorical_variable='V003')
Parameter selection result may look different from the following result due to randomness.
>>> mlpc.stats_.collect() STAT_NAME STAT_VALUE 0 timeout FALSE 1 TEST_1_ACCURACY 0.25 2 TEST_2_ACCURACY 0.666666 3 TEST_ACCURACY.MEAN 0.458333 4 TEST_ACCURACY.VAR 0.0868055 5 EVAL_RESULTS_1 {"candidates":[{"TEST_ACCURACY":[[0.50],[0.0]]... 6 EVAL_RESULTS_2 PUT_LAYER_ACTIVE_FUNC=6;HIDDEN_LAYER_ACTIVE_FU... 7 EVAL_RESULTS_3 FUNC=2;"},{"TEST_ACCURACY":[[0.50],[0.33333333... 8 EVAL_RESULTS_4 rs":"HIDDEN_LAYER_SIZE=10, 10;OUTPUT_LAYER_ACT... 9 ERROR 0.684842661926971 >>> mlpc.optim_param_.collect() PARAM_NAME INT_VALUE DOUBLE_VALUE STRING_VALUE 0 HIDDEN_LAYER_SIZE NaN None 5, 5, 5 1 OUTPUT_LAYER_ACTIVE_FUNC 4.0 None None 2 HIDDEN_LAYER_ACTIVE_FUNC 3.0 None None
Attributes
model_
(DataFrame) Model content.
train_log_
(DataFrame) Provides mean squared error between predicted values and target values for each iteration.
stats_
(DataFrame) Names and values of statistics.
optim_param_
(DataFrame) Provides optimal parameters selected. Available only when parameter selection is triggered.
Methods
fit
(data[, key, features, label, …])Fit the model when the training dataset is given.
predict
(data, key[, features, thread_ratio])Predict using the multi-layer perceptron model.
score
(data, key[, features, label, thread_ratio])Returns the accuracy on the given test data and labels.
-
fit
(data, key=None, features=None, label=None, categorical_variable=None)¶ Fit the model when the training dataset is given.
- Parameters
data : DataFrame
DataFrame containing the data.
key : str, optional
Name of the ID column. If
key
is not provided, it is assumed that the input has no ID column.features : list of str, optional
Names of the feature columns. If
features
is not provided, it defaults to all the non-ID and non-label columns.label : str, optional
Name of the label column. If
label
is not provided, it defaults to the last column.categorical_variable : str or list of str, optional
Specifies INTEGER column(s) specified that should be treated as categorical. Other INTEGER columns will be treated as continuous.
-
predict
(data, key, features=None, thread_ratio=None)¶ Predict using the multi-layer perceptron model.
- Parameters
data : DataFrame
DataFrame containing the data.
key : str
Name of the ID column.
features : list of str, optional
Names of the feature columns. If
features
is not provided, it defaults to all the non-ID columns.thread_ratio : float, optional
Controls the proportion of available threads to be used for prediction. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use.
Defaults to 0.
- Returns
DataFrame
Predicted classes, structured as follows:
ID column, with the same name and type as
data
‘s ID column.TARGET, type NVARCHAR, predicted class name.
VALUE, type DOUBLE, softmax value for the predicted class.
Softmax values for all classes, structured as follows:
ID column, with the same name and type as
data
‘s ID column.CLASS, type NVARCHAR, class name.
VALUE, type DOUBLE, softmax value for that class.
-
score
(data, key, features=None, label=None, thread_ratio=None)¶ Returns the accuracy on the given test data and labels.
- Parameters
data : DataFrame
DataFrame containing the data.
key : str
Name of the ID column.
features : list of str, optional
Names of the feature columns. If
features
is not provided, it defaults to all the non-ID and non-label columns.label : str, optional
Name of the label column. If
label
is not provided, it defaults to the last column.- Returns
float
Scalar value of accuracy after comparing the predicted result and original label.
-
class
hana_ml.algorithms.pal.neural_network.
MLPRegressor
(conn_context, activation=None, activation_options=None, output_activation=None, output_activation_options=None, hidden_layer_size=None, hidden_layer_size_options=None, max_iter=None, training_style='stochastic', learning_rate=None, momentum=None, batch_size=None, normalization=None, weight_init=None, categorical_variable=None, resampling_method=None, evaluation_metric=None, fold_num=None, repeat_times=None, search_strategy=None, random_search_times=None, random_state=None, timeout=None, progress_indicator_id=None, param_values=None, param_range=None, thread_ratio=None)¶ Bases:
hana_ml.algorithms.pal.neural_network._MLPBase
Multi-layer perceptron (MLP) Regressor.
- Parameters
conn_context : ConnectionContext
Connection to the HANA system.
activation : {‘tanh’, ‘linear’, ‘sigmoid_asymmetric’, ‘sigmoid_symmetric’, ‘gaussian_asymmetric’, ‘gaussian_symmetric’, ‘elliot_asymmetric’, ‘elliot_symmetric’, ‘sin_asymmetric’, ‘sin_symmetric’, ‘cos_asymmetric’, ‘cos_symmetric’, ‘relu’}
Activation function for the hidden layer.
output_activation : {‘tanh’, ‘linear’, ‘sigmoid_asymmetric’, ‘sigmoid_symmetric’, ‘gaussian_asymmetric’, ‘gaussian_symmetric’, ‘elliot_asymmetric’, ‘elliot_symmetric’, ‘sin_asymmetric’, ‘sin_symmetric’, ‘cos_asymmetric’, ‘cos_symmetric’, ‘relu’} , str
Activation function for the output layer.
hidden_layer_size : tuple of int
Size of each hidden layer
max_iter : int, optional
Maximum number of iterations.
Defaults to 100.
training_style : {‘batch’, ‘stochastic’}, optional
Specifies the training style.
Defaults to ‘stochastic’.
learning_rate : float, optional
Specifies the learning rate. Mandatory and valid only when
training_style
is ‘stochastic’.momentum : float, optional
Specifies the momentum for gradient descent update. Mandatory and valid only when
training_style
is ‘stochastic’.batch_size : int, optional
Specifies the size of mini batch. Valid only when
training_style
is ‘stochastic’.Defaults to 1.
normalization : {‘no’, ‘z-transform’, ‘scalar’}, optional
Defaults to ‘no’.
weight_init : {‘all-zeros’, ‘normal’, ‘uniform’, ‘variance-scale-normal’, ‘variance-scale-uniform’}, optional
Specifies the weight initial value.
Defaults to ‘all-zeros’.
categorical_variable : str or list of str, optional
Specifies column name(s) in the data table used as category variable.
Valid only when column is of INTEGER type.
thread_ratio : float, optional
Controls the proportion of available threads to use for training. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use.
Defaults to 0.
resampling_method : {‘cv’, ‘bootstrap’}, optional
Specifies the resampling method for model evaluation or parameter selection. If not specified, neither model evaluation or parameter selection shall be triggered.
evaluation_metric : {‘rmse’}, optional
Specifies the evaluation metric for model evaluation or parameter selection.
fold_num : int, optional
Specifies the fold number for the cross-validation. Mandatory and valid only when
resampling_method
is set ‘cv’.repeat_times : int, optional
Specifies the number of repeat times for resampling.
Defaults to 1.
search_strategy : {‘grid’, ‘random’}, optional
Specifies the method for parameter selection. If not provided, parameter selection will not be activated.
random_searhc_times : int, optional
Specifies the number of times to randomly select candidate parameters. Mandatory and valid only when
search_strategy
is set to ‘random’.random_state : int, optional
Specifies the seed for random generation. When 0 is specified, system time is used.
Defaults to 0.
timeout : int, optional
Specifies maximum running time for model evalation/parameter selection, in seconds. No timeout when 0 is specified.
Defaults to 0.
progress_id : str, optional
Sets an ID of progress indicator for model evaluation/parameter selection. If not provided, no progress indicator is activated.
param_values : list of tuple, optional
Sets the values of following parameters for model parameter selection:
learning_rate
,momentum
,batch_size
.Each tuple contains two elements - 1st element is the parameter name(str type), 2nd element is a list of valid values for that parameter.
A simple example for illustration:
[(‘learning_rate’, [0.1, 0.2, 0.5]),
(‘momentum’, [0.2, 0.6])]
Valid only when
search_strategy
is specified andtraining_style
is ‘stochastic’.param_range : list of tuple, optional
Sets the range of the following parameters for model parameter selection:
learning_rate
,momentum
,batch_size
.Each tuple should contain two elements:
1st element is the parameter name(str type),
2nd element is a list that specifies the range of that parameter as follows:
first value is the start value, seond value is the step, and third value is the end value. The step value can be omitted, and will be ignored, if
search_strategy
is set to ‘random’.Valid only when
search_strategy
is specified andtraininig_style
is ‘stochastic’.
Examples
Training data:
>>> df.collect() V000 V001 V002 V003 T001 T002 T003 0 1 1.71 AC 0 12.7 2.8 3.06 1 10 1.78 CA 5 12.1 8.0 2.65 2 17 2.36 AA 6 10.1 2.8 3.24 3 12 3.15 AA 2 28.1 5.6 2.24 4 7 1.05 CA 3 19.8 7.1 1.98 5 6 1.50 CA 2 23.2 4.9 2.12 6 9 1.97 CA 6 24.5 4.2 1.05 7 5 1.26 AA 1 13.6 5.1 2.78 8 12 2.13 AC 4 13.2 1.9 1.34 9 18 1.87 AC 6 25.5 3.6 2.14
Training the model:
>>> mlpr = MLPRegressor(conn_context=conn, hidden_layer_size=(10,5), ... activation='sin_asymmetric', ... output_activation='sin_asymmetric', ... learning_rate=0.001, momentum=0.00001, ... training_style='batch', ... max_iter=10000, normalization='z-transform', ... weight_init='normal', thread_ratio=0.3) >>> mlpr.fit(data=df, label=['T001', 'T002', 'T003'])
Training result may look different from the following results due to model randomness.
>>> mlpr.model_.collect() ROW_INDEX MODEL_CONTENT 0 1 {"CurrentVersion":"1.0","DataDictionary":[{"da... 1 2 3782583596893},{"from":10,"weight":-0.16532599... >>> mlpr.train_log_.collect() ITERATION ERROR 0 1 34.525655 1 2 82.656301 2 3 67.289241 3 4 162.768062 4 5 38.988242 5 6 142.239468 6 7 34.467742 7 8 31.050946 8 9 30.863581 9 10 30.078204 10 11 26.671436 11 12 28.078312 12 13 27.243226 13 14 26.916686 14 15 26.782915 15 16 26.724266 16 17 26.697108 17 18 26.684084 18 19 26.677713 19 20 26.674563 20 21 26.672997 21 22 26.672216 22 23 26.671826 23 24 26.671631 24 25 26.671533 25 26 26.671485 26 27 26.671460 27 28 26.671448 28 29 26.671442 29 30 26.671439 .. ... ... 705 706 11.891081 706 707 11.891081 707 708 11.891081 708 709 11.891081 709 710 11.891081 710 711 11.891081 711 712 11.891081 712 713 11.891081 713 714 11.891081 714 715 11.891081 715 716 11.891081 716 717 11.891081 717 718 11.891081 718 719 11.891081 719 720 11.891081 720 721 11.891081 721 722 11.891081 722 723 11.891081 723 724 11.891081 724 725 11.891081 725 726 11.891081 726 727 11.891081 727 728 11.891081 728 729 11.891081 729 730 11.891081 730 731 11.891081 731 732 11.891081 732 733 11.891081 733 734 11.891081 734 735 11.891081
[735 rows x 2 columns]
>>> pred_df.collect() ID V000 V001 V002 V003 0 1 1 1.71 AC 0 1 2 10 1.78 CA 5 2 3 17 2.36 AA 6
Prediction:
>>> res = mlpr.predict(data=pred_df, key='ID')
Result may look different from the following results due to model randomness.
>>> res.collect() ID TARGET VALUE 0 1 T001 12.700012 1 1 T002 2.799133 2 1 T003 2.190000 3 2 T001 12.099740 4 2 T002 6.100000 5 2 T003 2.190000 6 3 T001 10.099961 7 3 T002 2.799659 8 3 T003 2.190000
Attributes
model_
(DataFrame) Model content.
train_log_
(DataFrame) Provides mean squared error between predicted values and target values for each iteration.
stats_
(DataFrame) Names and values of statistics.
optim_param_
(DataFrame) Provides optimal parameters selected. Available only when parameter selection is triggered.
Methods
fit
(data[, key, features, label, …])Fit the model when given training dataset.
predict
(data, key[, features, thread_ratio])Predict using the multi-layer perceptron model.
score
(data, key[, features, label, thread_ratio])Returns the coefficient of determination R^2 of the prediction.
-
fit
(data, key=None, features=None, label=None, categorical_variable=None)¶ Fit the model when given training dataset.
- Parameters
data : DataFrame
DataFrame containing the data.
key : str, optional
Name of the ID column. If
key
is not provided, it is assume that the input has no ID column.features : list of str, optional
Names of the feature columns.
If
features
is not provided, it defaults to all the non-ID and non-label columns.label : str or list of str, optional
Name of the label column, or list of names of multiple label columns. If
label
is not provided, it defaults to the last column.categorical_variable : str or list of str, optional
Specifies INTEGER column(s) that should be treated as categorical. Other INTEGER columns will be treated as continuous.
-
predict
(data, key, features=None, thread_ratio=None)¶ Predict using the multi-layer perceptron model.
- Parameters
data : DataFrame
DataFrame containing the data.
key : str
Name of the ID column.
features : list of str, optional
Names of the feature columns. If
features
is not provided, it defaults to all the non-ID columns.thread_ratio : float, optional
Controls the proportion of available threads to be used for prediction. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use.
Defaults to 0.
- Returns
DataFrame
Predicted results, structured as follows:
ID column, with the same name and type as
data
‘s ID column.TARGET, type NVARCHAR, target name.
VALUE, type DOUBLE, regression value.
-
score
(data, key, features=None, label=None, thread_ratio=None)¶ Returns the coefficient of determination R^2 of the prediction.
- Parameters
data : DataFrame
DataFrame containing the data.
key : str
Name of the ID column.
features : list of str, optional
Names of the feature columns. If
features
is not provided, it defaults to all the non-ID and non-label columns.label : str or list of str, optional
Name of the label column, or list of names of multiple label columns. If
label
is not provided, it defaults to the last column.- Returns
float
Returns the coefficient of determination R2 of the prediction.
hana_ml.algorithms.pal.pagerank¶
This module contains python wrapper for PAL PageRank algorithm.
The following class is available:
-
class
hana_ml.algorithms.pal.pagerank.
PageRank
(conn_context, damping=None, max_iter=None, tol=None, thread_ratio=None)¶ Bases:
hana_ml.algorithms.pal.pal_base.PALBase
A page rank model.
- Parameters
conn_context : ConnectionContext
Connection to the SAP HANA system.
damping : float, optional
The damping factor d.
Defaults to 0.85.
max_iter : int, optional
The maximum number of iterations of power method. The value 0 means no maximum number of iterations is set and the calculation stops when the result converges.
Defaults to 0.
tol : float, optional
Specifies the stop condition. When the mean improvement value of ranks is less than this value, the program stops calculation.
Defaults to 1e-6.
thread_ratio : float, optional
Specifies the ratio of total number of threads that can be used by this function. The value range is from 0 to 1, where 0 means only using 1 thread, and 1 means using at most all the currently available threads. Values outside the range will be ignored and this function heuristically determines the number of threads to use.
Defaults to 0.
Examples
Input dataframe df for training:
>>> df.collect() FROM_NODE TO_NODE 0 Node1 Node2 1 Node1 Node3 2 Node1 Node4 3 Node2 Node3 4 Node2 Node4 5 Node3 Node1 6 Node4 Node1 7 Node4 Node3
Create a PageRank instance:
>>> pr = PageRank(conn_context=conn)
Call run() on given data sequence:
>>> result = pr.run(data=df) >>> result.collect() NODE RANK 0 NODE1 0.368152 1 NODE2 0.141808 2 NODE3 0.287962 3 NODE4 0.202078
Attributes
None
Methods
run
(data)This method reads link information and calculates rank for each node.
-
run
(data)¶ This method reads link information and calculates rank for each node.
- Parameters
data : DataFrame
Data for predicting the class labels.
- Returns
DataFrame
Calculated rank values and corresponding node names, structured as follows:
NODE: node names.
RANK: the PageRank of the corresponding node.
hana_ml.algorithms.pal.partition¶
This module contain Python wrapper for the PAL partition function.
The following function is available:
-
hana_ml.algorithms.pal.partition.
train_test_val_split
(conn_context, data, random_seed=None, thread_ratio=None, partition_method='random', stratified_column=None, training_percentage=None, testing_percentage=None, validation_percentage=None, training_size=None, testing_size=None, validation_size=None)¶ The algorithm partitions an input dataset randomly into three disjoint subsets called training, testing and validation. Let us remark that the union of these three subsets might not be the complete initial dataset.
Two different partitions can be obtained:
Random Partition, which randomly divides all the data.
Stratified Partition, which divides each subpopulation randomly.
In the second case, the dataset needs to have at least one categorical attribute (for example, of type VARCHAR). The initial dataset will first be subdivided according to the different categorical values of this attribute. Each mutually exclusive subset will then be randomly split to obtain the training, testing, and validation subsets.This ensures that all “categorical values” or “strata” will be present in the sampled subset.
- Parameters
conn_context : ConnectionContext
Connection to the HANA system.
data : DataFrame
DataFrame to be partitioned.
random_seed : int, optional
Indicates the seed used to initialize the random number generator.
0: Uses the system time
Not 0: Uses the specified seed
Defaults to 0.
thread_ratio : float, optional
Specifies the ratio of total number of threads that can be used by this function. The value range is from 0 to 1, where 0 means only using 1 thread, and 1 means using at most all the currently available threads. Values outside the range will be ignored and this function heuristically determines the number of threads to use.
Defaults to 0.
partition_method : {‘random’, ‘stratified’}, optional
- Partition method:
‘random’: random partitions
‘stratified’: stratified partition
Defaults to ‘random’.
stratified_column : str, optional
Indicates which column is used for stratification.
Valid only when parition_method is set to ‘stratified’ (stratified partition).
No default value.
training_percentage : float, optional
The percentage of training data. Value range: 0 <= value <= 1.
Defaults to 0.8.
testing_percentage : float, optional
The percentage of testing data. Value range: 0 <= value <= 1.
Defaults to 0.1.
validation_percentage : float, optional
The percentage of validation data. Value range: 0 <= value <= 1.
Defaults to 0.1.
training_size : int, optional
Row size of training data. Value range: >=0
If both
training_percentage
andtraining_size
are specified,training_percentage
takes precedence.No default value.
testing_size : int, optional
Row size of testing data. Value range: >=0
If both
testing_percentage
andtesting_size
are specified,testing_percentage
takes precedence.No default value.
validation_size : int, optional
Row size of validation data. Value range:>=0
If both
validation_percentage
andvalidation_size
are specified,validation_percentage
takes precedence.No default value.
- Returns
DataFrame
Training data. Table structure identical to input data table.
Testing data. Table structure identical to input data table.
Validation data. Table structure identical to input data table.
Examples
>>> train, test, valid = train_test_val_split(conn_context=conn, data=df)
hana_ml.algorithms.pal.pipeline¶
This module supports to run PAL functions in a pipeline manner.
-
class
hana_ml.algorithms.pal.pipeline.
Pipeline
(steps)¶ Bases:
object
Pipeline construction to run transformers and estimators sequentially.
- Parameters
step : list
List of (name, transform) tuples that are chained. The last object should be an estimator.
Examples
>>> pipeline([ ('pca', PCA(conn_context=conn, scaling=True, scores=True)), ('imputer', Imputer(conn_context=conn, strategy='mean')), ('hgbt', HybridGradientBoostingClassifier(conn_context=connection_context, n_estimators=4, split_threshold=0, learning_rate=0.5, fold_num=5, max_depth=6, cross_validation_range=cv_range)) ])
Methods
fit
(df, param)Fit all the transforms one after the other and transform the data, then fit the transformed data using the final estimator.
fit_transform
(df, param)Fit all the transforms one after the other and transform the data, then fit the transformed data using the final estimator.
-
fit_transform
(df, param)¶ Fit all the transforms one after the other and transform the data, then fit the transformed data using the final estimator.
- Parameters
df : DataFrame
SAP HANA DataFrame to be transformed in the pipeline.
param : dict
Parameters corresponding to the transform name.
- Returns
DataFrame
Transformed SAP HANA DataFrame.
Examples
>>> my_pipeline = Pipeline([ ('pca', PCA(conn_context=conn, scaling=True, scores=True)), ('imputer', Imputer(conn_context=conn, strategy='mean')) ]) >>> param = {'pca': [('key', 'ID'), ('label', 'CLASS')], 'imputer': []} >>> my_pipeline.fit_transform(data=train_df, param=param)
-
fit
(df, param)¶ Fit all the transforms one after the other and transform the data, then fit the transformed data using the final estimator.
- Parameters
df : DataFrame
SAP HANA DataFrame to be transformed in the pipeline.
param : dict
Parameters corresponding to the transform name.
- Returns
DataFrame
Transformed SAP HANA DataFrame.
Examples
>>> my_pipeline = Pipeline([ ('pca', PCA(conn_context=conn, scaling=True, scores=True)), ('imputer', Imputer(conn_context=conn, strategy='mean')), ('hgbt', HybridGradientBoostingClassifier(conn_context=conn, n_estimators=4, split_threshold=0, learning_rate=0.5, fold_num=5, max_depth=6, cross_validation_range=cv_range)) ]) >>> param = { 'pca': [('key', 'ID'), ('label', 'CLASS')], 'imputer': [], 'hgbt': [('key', 'ID'), ('label', 'CLASS'), ('categorical_variable', ['CLASS'])] } >>> hgbt_model = my_pipeline.fit(df=train_df, param=param)
hana_ml.algorithms.pal.preprocessing¶
This module contains Python wrappers for PAL preprocessing algorithms.
The following classes are available:
-
class
hana_ml.algorithms.pal.preprocessing.
FeatureNormalizer
(conn_context, method, z_score_method=None, new_max=None, new_min=None, thread_ratio=None)¶ Bases:
hana_ml.algorithms.pal.pal_base.PALBase
Normalize a DataFrame.
- Parameters
conn_context : ConnectionContext
The connection to the SAP HANA system.
method : {‘min-max’, ‘z-score’, ‘decimal’}
Scaling methods:
‘min-max’: Min-max normalization.
‘z-score’: Z-Score normalization.
‘decimal’: Decimal scaling normalization.
z_score_method : {‘mean-standard’, ‘mean-mean’, ‘median-median’}, optional
Only valid when
method
is ‘z-score’.‘mean-standard’: Mean-Standard deviation
‘mean-mean’: Mean-Mean deviation
‘median-median’: Median-Median absolute deviation
new_max : float, optional
The new maximum value for min-max normalization.
Only valid when
method
is ‘min-max’.new_min : float, optional
The new minimum value for min-max normalization.
Only valid when
method
is ‘min-max’.thread_ratio : float, optional
Controls the proportion of available threads to use. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use.
Defaults to 0.
Examples
Input DataFrame df1:
>>> df1.head(4).collect() ID X1 X2 0 0 6.0 9.0 1 1 12.1 8.3 2 2 13.5 15.3 3 3 15.4 18.7
Creating a FeatureNormalizer instance:
>>> fn = FeatureNormalizer(conn_context=conn, method="min-max", new_max=1.0, new_min=0.0)
Performing fit on given DataFrame:
>>> fn.fit(df1, key='ID') >>> fn.result_.head(4).collect() ID X1 X2 0 0 0.000000 0.033175 1 1 0.186544 0.000000 2 2 0.229358 0.331754 3 3 0.287462 0.492891
Input DataFrame for transforming:
>>> df2.collect() ID S_X1 S_X2 0 0 6.0 9.0 1 1 6.0 7.0 2 2 4.0 4.0 3 3 1.0 2.0 4 4 9.0 -2.0 5 5 4.0 5.0
Performing transform on given DataFrame:
>>> result = fn.transform(df2, key='ID') >>> result.collect() ID S_X1 S_X2 0 0 0.000000 0.033175 1 1 0.000000 -0.061611 2 2 -0.061162 -0.203791 3 3 -0.152905 -0.298578 4 4 0.091743 -0.488152 5 5 -0.061162 -0.156398
Attributes
result_
(DataFrame) Scaled dataset from fit and fit_transform methods.
model_ :
Trained model content.
Methods
fit
(data, key[, features])Normalize input data and generate a scaling model using one of the three scaling methods: min-max normalization, z-score normalization and normalization by decimal scaling.
fit_transform
(data, key[, features])Fit with the dataset and return the results.
transform
(data, key[, features])Scales data based on the previous scaling model.
-
fit
(data, key, features=None)¶ Normalize input data and generate a scaling model using one of the three scaling methods: min-max normalization, z-score normalization and normalization by decimal scaling.
- Parameters
data : DataFrame
DataFrame to be normalized.
key : str
Name of the ID column.
features : list of str, optional
Names of the feature columns. If
features
is not provided, it defaults to all non-ID columns.
-
fit_transform
(data, key, features=None)¶ Fit with the dataset and return the results.
- Parameters
data : DataFrame
DataFrame to be normalized.
key : str
Name of the ID column.
features : list of str, optional
Names of the feature columns. If
features
is not provided, it defaults to all non-ID columns.- Returns
DataFrame
Normalized result, with the same structure as
data
.
-
transform
(data, key, features=None)¶ Scales data based on the previous scaling model.
- Parameters
data : DataFrame
DataFrame to be normalized.
key : str
Name of the ID column.
features : list of str, optional
Names of the feature columns. If
features
is not provided, it defaults to all non-ID columns.- Returns
DataFrame
Normalized result, with the same structure as
data
.
-
class
hana_ml.algorithms.pal.preprocessing.
KBinsDiscretizer
(conn_context, strategy, smoothing, n_bins=None, bin_size=None, n_sd=None)¶ Bases:
hana_ml.algorithms.pal.pal_base.PALBase
Bin continuous data into number of intervals and perform local smoothing.
- Parameters
conn_context : ConnectionContext
The connection to the SAP HANA system.
strategy : {‘uniform_number’, ‘uniform_size’, ‘quantile’, ‘sd’}
- Binning methods:
‘uniform_number’: Equal widths based on the number of bins.
‘uniform_size’: Equal widths based on the bin size.
‘quantile’: Equal number of records per bin.
‘sd’: Bins are divided based on the distance from the mean. Most bins are one standard deviation wide, except that the center bin contains all values within one standard deviation from the mean, and the leftmost and rightmost bins contain all values more than
n_sd
standard deviations from the mean in the corresponding directions.
smoothing : {‘means’, ‘medians’, ‘boundaries’}
- Smoothing methods:
‘means’: Each value within a bin is replaced by the average of all the values belonging to the same bin.
‘medians’: Each value in a bin is replaced by the median of all the values belonging to the same bin.
‘boundaries’: The minimum and maximum values in a given bin are identified as the bin boundaries. Each value in the bin is then replaced by its closest boundary value. When the distance is equal to both sides, it will be replaced by the front boundary value.
Values used for smoothing are not re-calculated during transform.
n_bins : int, optional
The number of bins. Only valid when
strategy
is ‘uniform_number’ or ‘quantile’.Defaults to 2.
bin_size : int, optional
The interval width of each bin. Only valid when
strategy
is ‘uniform_size’.Defaults to 10.
n_sd : int, optional
The leftmost bin contains all values located further than n_sd standard deviations lower than the mean, and the rightmost bin contains all values located further than n_sd standard deviations above the mean. Only valid when
strategy
is ‘sd’.Defaults to 1.
Examples
Input DataFrame df1:
>>> df1.collect() ID DATA 0 0 6.0 1 1 12.0 2 2 13.0 3 3 15.0 4 4 10.0 5 5 23.0 6 6 24.0 7 7 30.0 8 8 32.0 9 9 25.0 10 10 38.0
Creating a KBinsDiscretizer instance:
>>> binning = KBinsDiscretizer(conn_context=conn, strategy='uniform_size', smoothing='means', bin_size=10)
Performing fit on the given DataFrame:
>>> binning.fit(data=df1, key='ID')
Output:
>>> binning.result_.collect() ID BIN_INDEX DATA 0 0 1 8.000000 1 1 2 13.333333 2 2 2 13.333333 3 3 2 13.333333 4 4 1 8.000000 5 5 3 25.500000 6 6 3 25.500000 7 7 3 25.500000 8 8 4 35.000000 9 9 3 25.500000 10 10 4 35.000000
Input DataFrame df2 for transforming:
>>> df2.collect() ID DATA 0 0 6.0 1 1 67.0 2 2 4.0 3 3 12.0 4 4 -2.0 5 5 40.0
Performing transform on the given DataFrame:
>>> result = binning.transform(data=df2, key='ID')
Output:
>>> result.collect() ID BIN_INDEX DATA 0 0 1 8.000000 1 1 -1 67.000000 2 2 1 8.000000 3 3 2 13.333333 4 4 1 8.000000 5 5 4 35.000000
Attributes
result_
(DataFrame) Binned dataset from fit and fit_transform methods.
model_ :
Binning model content.
Methods
fit
(data, key[, features])Bin input data into number of intervals and smooth.
fit_transform
(data, key[, features])Fit with the dataset and return the results.
transform
(data, key[, features])Bin data based on the previous binning model.
-
fit
(data, key, features=None)¶ Bin input data into number of intervals and smooth.
- Parameters
data : DataFrame
DataFrame to be discretized.
key : str
Name of the ID column.
features : list of str, optional
Names of the feature columns. Since the underlying PAL_BINNING only supports one feature, this list can only contain one element. If
features
is not provided,data
must have exactly 1 non-ID column, andfeatures
defaults to that column.
-
fit_transform
(data, key, features=None)¶ Fit with the dataset and return the results.
- Parameters
data : DataFrame
DataFrame to be binned.
key : str
Name of the ID column.
features : list of str, optional
Names of the feature columns. Since the underlying PAL_BINNING only supports one feature, this list can only contain one element. If
features
is not provided,data
must have exactly 1 non-ID column, andfeatures
defaults to that column.- Returns
DataFrame
Binned result, structured as follows:
DATA_ID column: with same name and type as
data
’s ID column.BIN_INDEX: type INTEGER, assigned bin index.
BINNING_DATA column: smoothed value, with same name and type as
data
’s feature column.
-
transform
(data, key, features=None)¶ Bin data based on the previous binning model.
- Parameters
data : DataFrame
DataFrame to be binned.
key : str
Name of the ID column.
features : list of str, optional
Names of the feature columns. Since the underlying PAL_BINNING_ASSIGNMENT only supports one feature, this list can only contain one element. If
features
is not provided,data
must have exactly 1 non-ID column, andfeatures
defaults to that column.- Returns
DataFrame
Binned result, structured as follows:
DATA_ID column: with same name and type as
data
‘s ID column.BIN_INDEX: type INTEGER, assigned bin index.
BINNING_DATA column: smoothed value, with same name and type as
data
‘s feature column.
-
class
hana_ml.algorithms.pal.preprocessing.
Imputer
(conn_context, strategy=None, als_factors=None, als_lambda=None, als_maxit=None, als_randomstate=None, als_exit_threshold=None, als_exit_interval=None, als_linsolver=None, als_cg_maxit=None, als_centering=None, als_scaling=None, thread_ratio=None)¶ Bases:
hana_ml.algorithms.pal.pal_base.PALBase
Missing value imputation for DataFrame.
- Parameters
conn_context : ConnectionContext
The connection to the SAP HANA system.
strategy : {‘non’, ‘mean’, ‘median’, ‘zero’, ‘als’, ‘delete’}, optional
The overall imputation strategy for all Numerical columns.
Defaults to ‘mean’.
thread_ratio : float, optional
Controls the proportion of available threads to use. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use up to that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use.
The following parameters all have pre-fix ‘als_’, and are invoked only when ‘als’ is the overall imputation strategy. Those parameters are for setting up the alternating-least-square(ALS) mdoel for data imputation.
Defaults to 0.0.
als_factors : int, optional
Length of factor vectors in the ALS model. It should be less than the number of numerical columns, so that the imputation results would be meaningful.
Defaults to 3.
als_lambda : float, optional
L2 regularization applied to the factors in the ALS model. Should be non-negative.
Defaults to 0.01.
als_maxit : int, optional
Maximum number of iterations for solving the ALS model.
Defaults to 20.
als_randomstate : int, optional
Specifies the seed of the random number generator used in the training of ALS model:
0: Uses the current time as the seed,
Others: Uses the specified value as the seed.
Defaults to 0.
als_exit_threshold : float, optional
Specify a value for stopping the training of ALS nmodel. If the improvement of the cost function of the ALS model is less than this value between consecutive checks, then the training process will exit. 0 means there is no checking of the objective value when running the algorithms, and it stops till the maximum number of iterations has been reached.
Defaults to 0.
als_exit_interval : int, optional
Specify the number of iterations between consecutive checking of cost functions for the ALS model, so that one can see if the pre-specified exit_threshold is reached.
Defaults to 5.
als_linsolver : {‘cholsky’, ‘cg’}, optional
Linear system solver for the ALS model. ‘cholsky’ is usually much faster. ‘cg’ is recommended when als_factors is large.
Defaults to ‘cholsky’.
als_maxit : int, optional
Specifies the maximum number of iterations for cg algorithm. Invoked only when the ‘cg’ is the chosen linear system solver for ALS.
Defaults to 3.
als_centering : bool, optional
Whether to center the data by column before training the ALS model.
Defaults to True.
als_scaling : bool, optional
Whether to scale the data by column before training the ALS model.
Defaults to True.
Examples
Input DataFrame df:
>>> df.head(5).collect() V0 V1 V2 V3 V4 V5 0 10 0.0 D NaN 1.4 23.6 1 20 1.0 A 0.4 1.3 21.8 2 50 1.0 C NaN 1.6 21.9 3 30 NaN B 0.8 1.7 22.6 4 10 0.0 A 0.2 NaN NaN
Create an Imputer instance using ‘mean’ strategy and call fit:
>>> impute = Imputer(conn_context, strategy='mean') >>> result = impute.fit_transform(df, categorical_variable=['V1'], strategy_by_col=[('V1', 'categorical_const', '0')])
>>> result.head(5).collect() V0 V1 V2 V3 V4 V5 0 10 0 D 0.507692 1.400000 23.600000 1 20 1 A 0.400000 1.300000 21.800000 2 50 1 C 0.507692 1.600000 21.900000 3 30 0 B 0.800000 1.700000 22.600000 4 10 0 A 0.200000 1.469231 20.646154
The stats_model_ content of input DataFrame:
>>> impute.stats_model_.head(5).collect() STAT_NAME STAT_VALUE 0 V0.NUMBER_OF_NULLS 3 1 V0.IMPUTATION_TYPE MEAN 2 V0.IMPUTED_VALUE 24 3 V1.NUMBER_OF_NULLS 2 4 V1.IMPUTATION_TYPE SPECIFIED_CATEGORICAL_VALUE
The above stats_model_ content of the input DataFrame can be applied to imputing another DataFrame with the same data structure, e.g. consider the following DataFrame with missing values:
>>> df1.collect() ID V0 V1 V2 V3 V4 V5 0 0 20.0 1.0 B NaN 1.5 21.7 1 1 40.0 1.0 None 0.6 1.2 24.3 2 2 NaN 0.0 D NaN 1.8 22.6 3 3 50.0 NaN C 0.7 1.1 NaN 4 4 20.0 1.0 A 0.3 NaN 20.6
With attribute impute.stats_model_ being obtained, one can impute the missing values of df1 via the following line of code, and then check the result:
>>> result1, _ = impute.transform(data=df1, key='ID') >>> result1.collect() ID V0 V1 V2 V3 V4 V5 0 0 20 1 B 0.507692 1.500000 21.700000 1 1 40 1 A 0.600000 1.200000 24.300000 2 2 24 0 D 0.507692 1.800000 22.600000 3 3 50 0 C 0.700000 1.100000 20.646154 4 4 20 1 A 0.300000 1.469231 20.600000
Create an Imputer instance using other strategies, e.g. ‘als’ strategy and then call fit:
>>> impute = Imputer(conn_context=conn, strategy='als', als_factors=2, als_randomstate=1)
Output:
>>> result2 = impute.fit_transform(data=df, categorical_variable=['V1'])
>>> result2.head(5).collect() V0 V1 V2 V3 V4 V5 0 10 0 D 0.306957 1.400000 23.600000 1 20 1 A 0.400000 1.300000 21.800000 2 50 1 C 0.930689 1.600000 21.900000 3 30 0 B 0.800000 1.700000 22.600000 4 10 0 A 0.200000 1.333668 21.371753
Attributes
stats_model_
(DataFrame) Statistics model content.
Methods
fit_transform
(data[, key, …])Inpute the missing values of a DataFrame, return the result, and collect the related statistics/model info for imputation.
transform
(data[, key, thread_ratio])The function imputes missing values a DataFrame using statistic/model info collected from another DataFrame.
-
fit_transform
(data, key=None, categorical_variable=None, strategy_by_col=None)¶ Inpute the missing values of a DataFrame, return the result, and collect the related statistics/model info for imputation.
- Parameters
data : DataFrame
Input data with missing values.
key : str, optional
Name of the ID column. Assume no ID column if key not provided.
categorical_variable : str, optional
Names of columns with INTEGER data type that should actually be treated as categorical. By default, columns of INTEGER and DOUBLE type are all treated numerical, while columns of VARCHAR or NVARCHAR type are treated as categorical.
strategy_by_col : ListOfTuples, optional
Specifies the imputation strategy for a set of columns, which overrides the overall strategy for data imputation. Each tuple in the list should contain at least two elements, such that: The first element is the name of a column; the second element is the imputation strategy of that column. If the imputation strategy is ‘categorical_const’ or ‘numerical_const’, then a third element must be included in the tuple, which specifies the constant value to be used to substitute the detected missing values in the column.
- An illustrative example:
[(‘V1’, ‘categorical_const’, ‘0’), (‘V5’,’median’)]
- Returns
DataFrame
Imputed result using specified strategy, with the same data structure, i.e. column names and data types same as
data
.
-
transform
(data, key=None, thread_ratio=None)¶ The function imputes missing values a DataFrame using statistic/model info collected from another DataFrame.
- Parameters
data : DataFrame
Input DataFrame.
key : str, optional
Name of ID column. Assumed no ID column if not provided.
thread_ratio : float, optional
Controls the proportion of available threads to use. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use up to that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use.
Defaults to 0.0.
- Returns
DataFrame
Inputation result, structured same as
data
.Statistics for the imputation result, structured as:
STAT_NAME: type NVACHAR(256), statistics name.
STAT_VALUE: type NVACHAR(5000), statistics value.
hana_ml.algorithms.pal.random¶
This module contains wrappers for PAL Random distribution sampling algorithms.
The following distribution functions are available:
-
hana_ml.algorithms.pal.random.
multinomial
(conn_context, n, pvals, num_random=100, seed=None, thread_ratio=None)¶ Draw samples from a multinomial distribution.
- Parameters
conn_context : ConnectionContext
Database connection object.
n : int
Number of trials.
pvals : tuple of float and int
Success fractions of each category.
num_random : int, optional
Specifies the number of random data to be generated.
Defaults to 100.
seed : int, optional
Indicates the seed used to initialize the random number generator:
0: Uses the system time.
Not 0: Uses the specified seed.
Note
When multithreading is enabled, the random number sequences of different runs might be different even if the SEED value remains the same.
Defaults to 0.
thread_ratio : float, optional
Controls the proportion of available threads to use. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use.
Defaults to 0.
- Returns
DataFrame
Dataframe containing the generated random samples, structured as follows:
ID, type INTEGER, ID column.
Generated random number columns, named by appending index number (starting from 1 to length of pvals) to
Random_P
, type DOUBLE. There will be as many columns here as there are values inpvals
.
Examples
Draw samples from a multinomial distribution.
>>> res = multinomial(conn_context=cc, n=10, pvals=(0.1, 0.2, 0.3, 0.4), num_random=10) >>> res.collect() ID RANDOM_P1 RANDOM_P2 RANDOM_P3 RANDOM_P4 0 0 1.0 2.0 2.0 5.0 1 1 1.0 2.0 3.0 4.0 2 2 0.0 0.0 8.0 2.0 3 3 0.0 2.0 1.0 7.0 4 4 1.0 1.0 4.0 4.0 5 5 1.0 1.0 4.0 4.0 6 6 1.0 2.0 3.0 4.0 7 7 1.0 4.0 2.0 3.0 8 8 1.0 2.0 3.0 4.0 9 9 4.0 1.0 1.0 4.0
-
hana_ml.algorithms.pal.random.
bernoulli
(conn_context, p=0.5, num_random=100, seed=None, thread_ratio=None)¶ Draw samples from a Bernoulli distribution.
- Parameters
conn_context : ConnectionContext
Database connection object.
p : float, optional
Success fraction. The value range is from 0 to 1.
Defaults to 0.5.
num_random : int, optional
Specifies the number of random data to be generated.
Defaults to 100.
seed : int, optional
Indicates the seed used to initialize the random number generator:
0: Uses the system time.
Not 0: Uses the specified seed.
Note
When multithreading is enabled, the random number sequences of different runs might be different even if the SEED value remains the same.
Defaults to 0.
thread_ratio : float, optional
Controls the proportion of available threads to use. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use.
Defaults to 0.
- Returns
DataFrame
Dataframe containing the generated random samples, structured as follows:
ID, type INTEGER, ID column.
GENERATED_NUMBER, type DOUBLE, sample value.
Examples
Draw samples from a bernoulli distribution.
>>> res = bernoulli(conn_context=cc, p=0.5, num_random=10) >>> res.collect() ID GENERATED_NUMBER 0 0 0.0 1 1 0.0 2 2 1.0 3 3 1.0 4 4 0.0 5 5 1.0 6 6 1.0 7 7 0.0 8 8 1.0 9 9 0.0
-
hana_ml.algorithms.pal.random.
beta
(conn_context, a=0.5, b=0.5, num_random=100, seed=None, thread_ratio=None)¶ Draw samples from a Beta distribution.
- Parameters
conn_context : ConnectionContext
Database connection object.
a : float, optional
Alpha value, positive.
Defaults to 0.5.
b : float, optional
Beta value, positive.
Defaults to 0.5.
num_random : int, optional
Specifies the number of random data to be generated.
Defaults to 100.
seed : int, optional
Indicates the seed used to initialize the random number generator:
0: Uses the system time.
Not 0: Uses the specified seed.
Note
When multithreading is enabled, the random number sequences of different runs might be different even if the SEED value remains the same.
Defaults to 0.
thread_ratio : float, optional
Controls the proportion of available threads to use. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use.
Defaults to 0.
- Returns
DataFrame
Dataframe containing the generated random samples, structured as follows:
ID, type INTEGER, ID column.
GENERATED_NUMBER, type DOUBLE, sample value.
Examples
Draw samples from a beta distribution.
>>> res = beta(conn_context=cc, a=0.5, b=0.5, num_random=10) >>> res.collect() ID GENERATED_NUMBER 0 0 0.976130 1 1 0.308346 2 2 0.853118 3 3 0.958553 4 4 0.677258 5 5 0.489628 6 6 0.027733 7 7 0.278073 8 8 0.850181 9 9 0.976244
-
hana_ml.algorithms.pal.random.
binomial
(conn_context, n=1, p=0.5, num_random=100, seed=None, thread_ratio=None)¶ Draw samples from a binomial distribution.
- Parameters
conn_context : ConnectionContext
Database connection object.
n : int, optional
Number of trials.
Defaults to 1.
p : float, optional
Successful fraction. The value range is from 0 to 1.
Defaults to 0.5.
num_random : int, optional
Specifies the number of random data to be generated.
Defaults to 100.
seed : int, optional
Indicates the seed used to initialize the random number generator:
0: Uses the system time.
Not 0: Uses the specified seed.
Note
When multithreading is enabled, the random number sequences of different runs might be different even if the SEED value remains the same.
Defaults to 0.
thread_ratio : float, optional
Controls the proportion of available threads to use. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use.
Defaults to 0.
- Returns
DataFrame
Dataframe containing the generated random samples, structured as follows:
ID, type INTEGER, ID column.
GENERATED_NUMBER, type DOUBLE, sample value.
Examples
Draw samples from a binomial distribution.
>>> res = binomial(conn_context=cc, n=1, p=0.5, num_random=10) >>> res.collect() ID GENERATED_NUMBER 0 0 1.0 1 1 1.0 2 2 0.0 3 3 1.0 4 4 1.0 5 5 1.0 6 6 0.0 7 7 1.0 8 8 0.0 9 9 1.0
-
hana_ml.algorithms.pal.random.
cauchy
(conn_context, location=0, scale=1, num_random=100, seed=None, thread_ratio=None)¶ Draw samples from a cauchy distribution.
- Parameters
conn_context : ConnectionContext
Database connection object.
location : float, optional
Defaults to 0.
scale : float, optional
Defaults to 1.
num_random : int, optional
Specifies the number of random data to be generated.
Defaults to 100.
seed : int, optional
Indicates the seed used to initialize the random number generator:
0: Uses the system time.
Not 0: Uses the specified seed.
Note
When multithreading is enabled, the random number sequences of different runs might be different even if the SEED value remains the same.
Defaults to 0.
thread_ratio : float, optional
Controls the proportion of available threads to use. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use.
Defaults to 0.
- Returns
DataFrame
Dataframe containing the generated random samples, structured as follows:
ID, type INTEGER, ID column.
GENERATED_NUMBER, type DOUBLE, sample value.
Examples
Draw samples from a cauchy distribution.
>>> res = cauchy(conn_context=cc, location=0, scale=1, num_random=10) >>> res.collect() ID GENERATED_NUMBER 0 0 1.827259 1 1 -1.877612 2 2 -18.241436 3 3 -1.216243 4 4 2.091336 5 5 -317.131147 6 6 -2.804251 7 7 -0.338566 8 8 0.143280 9 9 1.277245
-
hana_ml.algorithms.pal.random.
chi_squared
(conn_context, dof=1, num_random=100, seed=None, thread_ratio=None)¶ Draw samples from a chi_squared distribution.
- Parameters
conn_context : ConnectionContext
Database connection object.
dof : int, optional
Degrees of freedom.
Defaults to 1.
num_random : int, optional
Specifies the number of random data to be generated.
Defaults to 100.
seed : int, optional
- Indicates the seed used to initialize the random number generator:
0: Uses the system time.
Not 0: Uses the specified seed.
Note
When multithreading is enabled, the random number sequences of different runs might be different even if the SEED value remains the same.
Defaults to 0.
thread_ratio : float, optional
Controls the proportion of available threads to use. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use.
Defaults to 0.
- Returns
DataFrame
Dataframe containing the generated random samples, structured as follows:
ID, type INTEGER, ID column.
GENERATED_NUMBER, type DOUBLE, sample value.
Examples
Draw samples from a chi_squared distribution.
>>> res = chi_squared(conn_context=cc, dof=1, num_random=10) >>> res.collect() ID GENERATED_NUMBER 0 0 0.040571 1 1 2.680756 2 2 1.119563 3 3 1.174072 4 4 0.872421 5 5 0.327169 6 6 1.113164 7 7 1.549585 8 8 0.013953 9 9 0.011735
-
hana_ml.algorithms.pal.random.
exponential
(conn_context, lamb=1, num_random=100, seed=None, thread_ratio=None)¶ Draw samples from an exponential distribution.
- Parameters
conn_context : ConnectionContext
Database connection object.
lamb : float, optional
The rate parameter, which is the inverse of the scale parameter.
Defaults to 1.
num_random : int, optional
Specifies the number of random data to be generated.
Defaults to 100.
seed : int, optional
Indicates the seed used to initialize the random number generator:
0: Uses the system time.
Not 0: Uses the specified seed.
Note
When multithreading is enabled, the random number sequences of different runs might be different even if the SEED value remains the same.
Defaults to 0.
thread_ratio : float, optional
Controls the proportion of available threads to use. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use.
Defaults to 0.
- Returns
DataFrame
Dataframe containing the generated random samples, structured as follows:
ID, type INTEGER, ID column.
GENERATED_NUMBER, type DOUBLE, sample value.
Examples
Draw samples from an exponential distribution.
>>> res = exponential(conn_context=cc, scale=1, num_random=10) >>> res.collect() ID GENERATED_NUMBER 0 0 0.035207 1 1 0.559248 2 2 0.122307 3 3 2.339937 4 4 1.130033 5 5 0.985565 6 6 0.030138 7 7 0.231040 8 8 1.233268 9 9 0.876022
-
hana_ml.algorithms.pal.random.
gumbel
(conn_context, location=0, scale=1, num_random=100, seed=None, thread_ratio=None)¶ Draw samples from a Gumbel distribution, which is one of a class of Generalized Extreme Value (GEV) distributions used in modeling extreme value problems.
- Parameters
conn_context : ConnectionContext
Database connection object.
location : float, optional
Defaults to 0.
scale : float, optional
Defaults to 1.
num_random : int, optional
Specifies the number of random data to be generated.
Defaults to 100.
seed : int, optional
Indicates the seed used to initialize the random number generator:
0: Uses the system time.
Not 0: Uses the specified seed.
Note
When multithreading is enabled, the random number sequences of different runs might be different even if the SEED value remains the same.
Defaults to 0.
thread_ratio : float, optional
Controls the proportion of available threads to use. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use.
Defaults to 0.
- Returns
DataFrame
Dataframe containing the generated random samples, structured as follows:
ID, type INTEGER, ID column.
GENERATED_NUMBER, type DOUBLE, sample value.
Examples
Draw samples from a gumbel distribution.
>>> res = gumbel(conn_context=cc, location=0, scale=1, num_random=10) >>> res.collect() ID GENERATED_NUMBER 0 0 1.544054 1 1 0.339531 2 2 0.394224 3 3 3.161123 4 4 1.208050 5 5 -0.276447 6 6 1.694589 7 7 1.406419 8 8 -0.443717 9 9 0.156404
-
hana_ml.algorithms.pal.random.
f
(conn_context, dof1=1, dof2=1, num_random=100, seed=None, thread_ratio=None)¶ Draw samples from an f distribution.
- Parameters
conn_context : ConnectionContext
Database connection object.
dof1 : int, optional
DEGREES_OF_FREEDOM1.
Defaults to 1.
dof2 : int, optional
DEGREES_OF_FREEDOM2.
Defaults to 1.
num_random : int, optional
Specifies the number of random data to be generated.
Defaults to 100.
seed : int, optional
Indicates the seed used to initialize the random number generator:
0: Uses the system time.
Not 0: Uses the specified seed.
Note
When multithreading is enabled, the random number sequences of different runs might be different even if the SEED value remains the same.
Defaults to 0.
thread_ratio : float, optional
Controls the proportion of available threads to use. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use.
Defaults to 0.
- Returns
DataFrame
Dataframe containing the generated random samples, structured as follows:
ID, type INTEGER, ID column.
GENERATED_NUMBER, type DOUBLE, sample value.
Examples
Draw samples from a f distribution.
>>> res = f(conn_context=cc, dof1=1, dof2=1, num_random=10) >>> res.collect() ID GENERATED_NUMBER 0 0 6.494985 1 1 0.054830 2 2 0.752216 3 3 4.946226 4 4 0.167151 5 5 351.789925 6 6 0.810973 7 7 0.362714 8 8 0.019763 9 9 10.553533
-
hana_ml.algorithms.pal.random.
gamma
(conn_context, shape=1, scale=1, num_random=100, seed=None, thread_ratio=None)¶ Draw samples from a gamma distribution.
- Parameters
conn_context : ConnectionContext
Database connection object.
shape : float, optional
Defaults to 1.
scale : float, optional
Defaults to 1.
num_random : int, optional
Specifies the number of random data to be generated.
Defaults to 100.
seed : int, optional
Indicates the seed used to initialize the random number generator:
0: Uses the system time.
Not 0: Uses the specified seed.
Note
When multithreading is enabled, the random number sequences of different runs might be different even if the SEED value remains the same.
Defaults to 0.
thread_ratio : float, optional
Controls the proportion of available threads to use. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use.
Defaults to 0.
- Returns
DataFrame
Dataframe containing the generated random samples, structured as follows:
ID, type INTEGER, ID column.
GENERATED_NUMBER, type DOUBLE, sample value.
Examples
Draw samples from a gamma distribution.
>>> res = gamma(conn_context=cc, shape=1, scale=1, num_random=10) >>> res.collect() ID GENERATED_NUMBER 0 0 0.082794 1 1 0.084031 2 2 0.159490 3 3 1.063100 4 4 0.530218 5 5 1.307313 6 6 0.565527 7 7 0.474969 8 8 0.440999 9 9 0.463645
-
hana_ml.algorithms.pal.random.
geometric
(conn_context, p=0.5, num_random=100, seed=None, thread_ratio=None)¶ Draw samples from a geometric distribution.
- Parameters
conn_context : ConnectionContext
Database connection object.
p : float, optional
Successful fraction. The value range is from 0 to 1.
Defaults to 0.5.
num_random : int, optional
Specifies the number of random data to be generated.
Defaults to 100.
seed : int, optional
- Indicates the seed used to initialize the random number generator:
0: Uses the system time.
Not 0: Uses the specified seed.
Note
When multithreading is enabled, the random number sequences of different runs might be different even if the SEED value remains the same.
Defaults to 0.
thread_ratio : float, optional
Controls the proportion of available threads to use. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use.
Defaults to 0.
- Returns
DataFrame
Dataframe containing the generated random samples, structured as follows:
ID, type INTEGER, ID column.
GENERATED_NUMBER, type DOUBLE, sample value.
Examples
Draw samples from a geometric distribution.
>>> res = geometric(conn_context=cc, p=0.5, num_random=10) >>> res.collect() ID GENERATED_NUMBER 0 0 1.0 1 1 1.0 2 2 1.0 3 3 0.0 4 4 1.0 5 5 0.0 6 6 0.0 7 7 0.0 8 8 0.0 9 9 0.0
-
hana_ml.algorithms.pal.random.
lognormal
(conn_context, mean=0, sigma=1, num_random=100, seed=None, thread_ratio=None)¶ Draw samples from a lognormal distribution.
- Parameters
conn_context : ConnectionContext
Database connection object.
mean : float, optional
Mean value of the underlying normal distribution.
Defaults to 0.
sigma : float, optional
Standard deviation of the underlying normal distribution.
Defaults to 1.
num_random : int, optional
Specifies the number of random data to be generated.
Defaults to 100.
seed : int, optional
Indicates the seed used to initialize the random number generator:
0: Uses the system time.
Not 0: Uses the specified seed.
Note
When multithreading is enabled, the random number sequences of different runs might be different even if the SEED value remains the same.
Defaults to 0.
thread_ratio : float, optional
Controls the proportion of available threads to use. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use.
Defaults to 0.
- Returns
DataFrame
Dataframe containing the generated random samples, structured as follows:
ID, type INTEGER, ID column.
GENERATED_NUMBER, type DOUBLE, sample value.
Examples
Draw samples from a lognormal distribution.
>>> res = lognormal(conn_context=cc, mean=0, sigma=1, num_random=10) >>> res.collect() ID GENERATED_NUMBER 0 0 0.461803 1 1 0.548432 2 2 0.625874 3 3 3.038529 4 4 3.582703 5 5 1.867543 6 6 1.853857 7 7 0.378827 8 8 1.104031 9 9 0.840102
-
hana_ml.algorithms.pal.random.
negative_binomial
(conn_context, n=1, p=0.5, num_random=100, seed=None, thread_ratio=None)¶ Draw samples from a negative_binomial distribution.
- Parameters
conn_context : ConnectionContext
Database connection object.
n : int, optional
Number of successes.
Defaults to 1.
p : float, optional
Successful fraction. The value range is from 0 to 1.
Defaults to 0.5.
num_random : int, optional
Specifies the number of random data to be generated.
Defaults to 100.
seed : int, optional
Indicates the seed used to initialize the random number generator:
0: Uses the system time.
Not 0: Uses the specified seed.
Note
When multithreading is enabled, the random number sequences of different runs might be different even if the SEED value remains the same.
Defaults to 0.
thread_ratio : float, optional
Controls the proportion of available threads to use. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use.
Defaults to 0.
- Returns
DataFrame
Dataframe containing the generated random samples, structured as follows:
ID, type INTEGER, ID column.
GENERATED_NUMBER, type DOUBLE, sample value.
Examples
Draw samples from a negative_binomial distribution.
>>> res = negative_binomial(conn_context=cc, n=1, p=0.5, num_random=10) >>> res.collect() ID GENERATED_NUMBER 0 0 0.0 1 1 2.0 2 2 3.0 3 3 1.0 4 4 1.0 5 5 0.0 6 6 2.0 7 7 1.0 8 8 2.0 9 9 3.0
-
hana_ml.algorithms.pal.random.
normal
(conn_context, mean=0, sigma=None, variance=None, num_random=100, seed=None, thread_ratio=None)¶ Draw samples from a normal distribution.
- Parameters
conn_context : ConnectionContext
Database connection object.
mean : float, optional
Mean value.
Defaults to 0.
sigma : float, optional
Standard deviation. It cannot be used together with variance.
Defaults to 1.
variance : float, optional
Variance. It cannot be used together with sigma.
Defaults to 1.
num_random : int, optional
Specifies the number of random data to be generated.
Defaults to 100.
seed : int, optional
Indicates the seed used to initialize the random number generator:
0: Uses the system time.
Not 0: Uses the specified seed.
Note
When multithreading is enabled, the random number sequences of different runs might be different even if the SEED value remains the same.
Defaults to 0.
thread_ratio : float, optional
Controls the proportion of available threads to use. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use.
Defaults to 0.
- Returns
DataFrame
Dataframe containing the generated random samples, structured as follows:
ID, type INTEGER, ID column.
GENERATED_NUMBER, type DOUBLE, sample value.
Examples
Draw samples from a normal distribution.
>>> res = normal(conn_context=cc, mean=0, sigma=1, num_random=10) >>> res.collect() ID GENERATED_NUMBER 0 0 0.321078 1 1 -1.327626 2 2 0.798867 3 3 -0.116128 4 4 -0.213519 5 5 0.008566 6 6 0.251733 7 7 0.404510 8 8 -0.534899 9 9 -0.420968
-
hana_ml.algorithms.pal.random.
pert
(conn_context, minimum=-1, mode=0, maximum=1, scale=4, num_random=100, seed=None, thread_ratio=None)¶ Draw samples from a PERT distribution.
- Parameters
conn_context : ConnectionContext
Database connection object.
minimum : int, optional
Minimum value.
Defaults to -1.
mode : float, optional
Most likely value.
Defaults to 0.
maximum : float, optional
Maximum value.
Defaults to 1.
scale : float, optional
Defaults to 4.
num_random : int, optional
Specifies the number of random data to be generated.
Defaults to 100.
seed : int, optional
Indicates the seed used to initialize the random number generator:
0: Uses the system time.
Not 0: Uses the specified seed.
Note
When multithreading is enabled, the random number sequences of different runs might be different even if the SEED value remains the same.
Defaults to 0.
thread_ratio : float, optional
Controls the proportion of available threads to use. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use.
Defaults to 0.
- Returns
DataFrame
Dataframe containing the generated random samples, structured as follows:
ID, type INTEGER, ID column.
GENERATED_NUMBER, type DOUBLE, sample value.
Examples
Draw samples from a pert distribution.
>>> res = pert(conn_context=cc, minimum=-1, mode=0, maximum=1, num_random=10) >>> res.collect() ID GENERATED_NUMBER 0 0 0.360781 1 1 -0.023649 2 2 0.106465 3 3 0.307412 4 4 -0.136838 5 5 -0.086010 6 6 -0.504639 7 7 0.335352 8 8 -0.287202 9 9 0.468597
-
hana_ml.algorithms.pal.random.
poisson
(conn_context, theta=1.0, num_random=100, seed=None, thread_ratio=None)¶ Draw samples from a poisson distribution.
- Parameters
conn_context : ConnectionContext
Database connection object.
theta : float, optional
The average number of events in an interval.
Defaults to 1.0.
num_random : int, optional
Specifies the number of random data to be generated.
Defaults to 100.
seed : int, optional
Indicates the seed used to initialize the random number generator:
0: Uses the system time.
Not 0: Uses the specified seed.
Note
When multithreading is enabled, the random number sequences of different runs might be different even if the SEED value remains the same.
Defaults to 0.
thread_ratio : float, optional
Controls the proportion of available threads to use. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use.
Defaults to 0.
- Returns
DataFrame
Dataframe containing the generated random samples, structured as follows:
ID, type INTEGER, ID column.
GENERATED_NUMBER, type DOUBLE, sample value.
Examples
Draw samples from a poisson distribution.
>>> res = poisson(conn_context=cc, theta=1, num_random=10) >>> res.collect() ID GENERATED_NUMBER 0 0 0.0 1 1 1.0 2 2 1.0 3 3 1.0 4 4 1.0 5 5 1.0 6 6 0.0 7 7 2.0 8 8 0.0 9 9 1.0
-
hana_ml.algorithms.pal.random.
student_t
(conn_context, dof=1, num_random=100, seed=None, thread_ratio=None)¶ Draw samples from a Student’s t-distribution.
- Parameters
conn_context : ConnectionContext
Database connection object.
dof : float, optional
Degrees of freedom.
Defaults to 1.
num_random : int, optional
Specifies the number of random data to be generated.
Defaults to 100.
seed : int, optional
Indicates the seed used to initialize the random number generator:
0: Uses the system time.
Not 0: Uses the specified seed.
Note
When multithreading is enabled, the random number sequences of different runs might be different even if the SEED value remains the same.
Defaults to 0.
thread_ratio : float, optional
Controls the proportion of available threads to use. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use.
Defaults to 0.
- Returns
DataFrame
Dataframe containing the generated random samples, structured as follows:
ID, type INTEGER, ID column.
GENERATED_NUMBER, type DOUBLE, sample value.
Examples
Draw samples from a Student’s t-distribution.
>>> res = student_t(conn_context=cc, dof=1, num_random=10) >>> res.collect() ID GENERATED_NUMBER 0 0 -0.433802 1 1 1.972038 2 2 -1.097313 3 3 -0.225812 4 4 -0.452342 5 5 2.242921 6 6 0.377288 7 7 0.322347 8 8 1.104877 9 9 -0.017830
-
hana_ml.algorithms.pal.random.
uniform
(conn_context, low=0, high=1, num_random=100, seed=None, thread_ratio=None)¶ Draw samples from a uniform distribution.
- Parameters
conn_context : ConnectionContext
Database connection object.
low : float, optional
The lower bound.
Defaults to 0.
high : float, optional
The upper bound.
Defaults to 1.
num_random : int, optional
Specifies the number of random data to be generated.
Defaults to 100.
seed : int, optional
Indicates the seed used to initialize the random number generator:
0: Uses the system time.
Not 0: Uses the specified seed.
Note
When multithreading is enabled, the random number sequences of different runs might be different even if the SEED value remains the same.
Defaults to 0.
thread_ratio : float, optional
Controls the proportion of available threads to use. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use.
Defaults to 0.
- Returns
DataFrame
Dataframe containing the generated random samples, structured as follows:
ID, type INTEGER, ID column.
GENERATED_NUMBER, type DOUBLE, sample value.
Examples
Draw samples from a uniform distribution.
>>> res = uniform(conn_context=cc, low=-1, high=1, num_random=10) >>> res.collect() ID GENERATED_NUMBER 0 0 0.032920 1 1 0.201923 2 2 0.823313 3 3 -0.495260 4 4 -0.138329 5 5 0.677732 6 6 0.685200 7 7 0.363627 8 8 0.024849 9 9 -0.441779
-
hana_ml.algorithms.pal.random.
weibull
(conn_context, shape=1, scale=1, num_random=100, seed=None, thread_ratio=None)¶ Draw samples from a weibull distribution.
- Parameters
conn_context : ConnectionContext
Database connection object.
shape : float, optional
Defaults to 1.
scales : float, optional
Defaults to 1.
num_random : int, optional
Specifies the number of random data to be generated.
Defaults to 100.
seed : int, optional
Indicates the seed used to initialize the random number generator:
0: Uses the system time.
Not 0: Uses the specified seed.
Note
When multithreading is enabled, the random number sequences of different runs might be different even if the SEED value remains the same.
Defaults to 0.
thread_ratio : float, optional
Controls the proportion of available threads to use. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use.
Defaults to 0.
- Returns
DataFrame
Dataframe containing the generated random samples, structured as follows:
ID, type INTEGER, ID column.
GENERATED_NUMBER, type DOUBLE, sample value.
Examples
Draw samples from a weibull distribution.
>>> res = weibull(conn_context=cc, shape=1, scale=1, num_random=10) >>> res.collect() ID GENERATED_NUMBER 0 0 2.188750 1 1 0.247628 2 2 0.339884 3 3 0.902187 4 4 0.909629 5 5 0.514740 6 6 4.627877 7 7 0.143767 8 8 0.847514 9 9 2.368169
hana_ml.algorithms.pal.regression¶
This module contains wrappers for PAL regression algorithms.
The following classes are available:
-
class
hana_ml.algorithms.pal.regression.
PolynomialRegression
(conn_context, degree, decomposition=None, adjusted_r2=None, pmml_export=None, thread_ratio=None)¶ Bases:
hana_ml.algorithms.pal.pal_base.PALBase
Polynomial regression is an approach to model the relationship between a scalar variable y and a variable denoted X. In polynomial regression, data is modeled using polynomial functions, and unknown model parameters are estimated from the data. Such models are called polynomial models.
- Parameters
conn_context : ConnectionContext
Connection to the SAP HANA system.
degree : int
Degree of the polynomial model.
decomposition : {‘LU’, ‘SVD’}, optional
Matrix factorization type to use. Case-insensitive.
‘LU’: LU decomposition.
‘SVD’: singular value decomposition.
Defaults to LU decomposition.
adjusted_r2 : bool, optional
If true, include the adjusted R2 value in the statistics table.
Defaults to False.
pmml_export : {‘no’, ‘single-row’, ‘multi-row’}, optional
Controls whether to output a PMML representation of the model, and how to format the PMML. Case-insensitive.
‘no’ or not provided: No PMML model.
‘single-row’: Exports a PMML model in a maximum of one row. Fails if the model doesn’t fit in one row.
‘multi-row’: Exports a PMML model, splitting it across multiple rows if it doesn’t fit in one.
Prediction does not require a PMML model.
thread_ratio : float, optional
Controls the proportion of available threads to use for prediction. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use.
Defaults to 0.
Examples
Training data (based on y = x^3 - 2x^2 + 3x + 5, with noise):
>>> df.collect() ID X Y 0 1 0.0 5.048 1 2 1.0 7.045 2 3 2.0 11.003 3 4 3.0 23.072 4 5 4.0 49.041
Training the model:
>>> pr = PolynomialRegression(conn_context=conn, degree=3) >>> pr.fit(data=df, key='ID')
Prediction:
>>> df2.collect() ID X 0 1 0.5 1 2 1.5 2 3 2.5 3 4 3.5 >>> pr.predict(data=df2, key='ID').collect() ID VALUE 0 1 6.157063 1 2 8.401269 2 3 15.668581 3 4 33.928501
Ideal output:
>>> df2.select('ID', ('POWER(X, 3)-2*POWER(X, 2)+3*x+5', 'Y')).collect() ID Y 0 1 6.125 1 2 8.375 2 3 15.625 3 4 33.875
Attributes
coefficients_
(DataFrame) Fitted regression coefficients.
pmml_
(DataFrame) PMML model. Set to None if no PMML model was requested.
fitted_
(DataFrame) Predicted dependent variable values for training data. Set to None if the training data has no row IDs.
statistics_
(DataFrame) Regression-related statistics, such as mean squared error.
Methods
fit
(data[, key, features, label])Fit regression model based on training data.
predict
(data, key[, features])Predict dependent variable values based on fitted model.
score
(data, key[, features, label])Returns the coefficient of determination R2 of the prediction.
-
fit
(data, key=None, features=None, label=None)¶ Fit regression model based on training data.
- Parameters
data : DataFrame
Training data.
key : str, optional
Name of the ID column. If
key
is not provided, it is assumed that the input has no ID column.features : list of str, optional
Names of the feature columns. Since the underlying PAL_POLYNOMIAL_REGRESSION only supports one feature, this list can only contain one element. If
features
is not provided,data
must have exactly 1 non-ID, non-label column, andfeatures
defaults to that column.label : str, optional
Name of the dependent variable.
Defaults to the last column. (This is not the PAL default.)
-
predict
(data, key, features=None)¶ Predict dependent variable values based on fitted model.
- Parameters
data : DataFrame
Independent variable values used for prediction.
key : str
Name of the ID column.
features : list of str, optional
Names of the feature columns. Since the underlying PAL_POLYNOMIAL_REGRESSION only supports one feature, this list can only contain one element. If
features
is not provided,data
must have exactly 1 non-ID column, andfeatures
defaults to that column.- Returns
DataFrame
Predicted values, structured as follows:
ID column, with same name and type as
data
’s ID column.VALUE, type DOUBLE, representing predicted values.
Note
predict() will pass the
pmml_
table to PAL as the model representation if there is apmml_
table, or thecoefficients_
table otherwise.
-
score
(data, key, features=None, label=None)¶ Returns the coefficient of determination R2 of the prediction.
- Parameters
data : DataFrame
Data on which to assess model performance.
key : str
Name of the ID column.
features : list of str, optional
Names of the feature columns. Since the underlying PAL_POLYNOMIAL_REGRESSION_PREDICT only supports one feature, this list can only contain one element. If
features
is not provided,data
must have exactly 1 non-ID, non-label column, andfeatures
defaults to that column.label : str, optional
Name of the dependent variable.
Defaults to the last column. (This is not the PAL default.)
- Returns
float
The coefficient of determination R2 of the prediction on the given data.
-
class
hana_ml.algorithms.pal.regression.
GLM
(conn_context, family=None, link=None, solver=None, handle_missing_fit=None, quasilikelihood=None, max_iter=None, tol=None, significance_level=None, output_fitted=None, alpha=None, num_lambda=None, lambda_min_ratio=None, categorical_variable=None, ordering=None)¶ Bases:
hana_ml.algorithms.pal.pal_base.PALBase
Regression by a generalized linear model, based on PAL_GLM. Also supports ordinal regression.
- Parameters
conn_context : ConnectionContext
Connection to the SAP HANA system.
family : {‘gaussian’, ‘normal’, ‘poisson’, ‘binomial’, ‘gamma’, ‘inversegaussian’, ‘negativebinomial’, ‘ordinal’}, optional
The kind of distribution the dependent variable outcomes are assumed to be drawn from. Defaults to ‘gaussian’.
link : str, optional
GLM link function. Determines the relationship between the linear predictor and the predicted response. Default and allowed values depend on
family
. ‘inverse’ is accepted as a synonym of ‘reciprocal’.family
default link
allowed values of link
gaussian
identity
identity, log, reciprocal
poisson
log
identity, log
binomial
logit
logit, probit, comploglog, log
gamma
reciprocal
identity, reciprocal, log
inversegaussian
inversesquare
inversesquare, identity, reciprocal, log
negativebinomial
log
identity, log, sqrt
ordinal
logit
logit, probit, comploglog
solver : {‘irls’, ‘nr’, ‘cd’}, optional
Optimization algorithm to use.
‘irls’: Iteratively re-weighted least squares.
‘nr’: Newton-Raphson.
‘cd’: Coordinate descent. (Picking coordinate descent activates elastic net regularization.)
Defaults to ‘irls’, except when
family
is ‘ordinal’. Ordinal regression requires (and defaults to) ‘nr’, and Newton-Raphson is not supported for other values offamily
.handle_missing_fit : {‘skip’, ‘abort’, ‘fill_zero’}, optional
How to handle data rows with missing independent variable values during fitting.
‘skip’: Don’t use those rows for fitting.
‘abort’: Throw an error if missing independent variable values are found.
‘fill_zero’: Replace missing values with 0.
Defaults to ‘skip’.
quasilikelihood : bool, optional
If True, enables the use of quasi-likelihood to estimate overdispersion.
Defaults to False.
max_iter : int, optional
Maximum number of optimization iterations.
Defaults to 100 for IRLS and Newton-Raphson.
Defaults to 100000 for coordinate descent.
tol : float, optional
Stopping condition for optimization.
Defaults to 1e-8 for IRLS, 1e-6 for Newton-Raphson, and 1e-7 for coordinate descent.
significance_level : float, optional
Significance level for confidence intervals and prediction intervals.
Defaults to 0.05.
output_fitted : bool, optional
If True, create the
fitted_
DataFrame of fitted response values for training data in fit.alpha : float, optional
Elastic net mixing parameter. Only accepted when using coordinate descent. Should be between 0 and 1 inclusive.
Defaults to 1.0.
num_lambda : int, optional
The number of lambda values. Only accepted when using coordinate descent.
Defaults to 100.
lambda_min_ratio : float, optional
The smallest value of lambda, as a fraction of the maximum lambda, where lambda_max is the smallest value for which all coefficients are zero. Only accepted when using coordinate descent.
Defaults to 0.01 when the number of observations is smaller than the number of covariates, and 0.0001 otherwise.
categorical_variable : list of str, optional
INTEGER columns specified in this list will be treated as categorical data. Other INTEGER columns will be treated as continuous.
ordering : list of str or list of int, optional
Specifies the order of categories for ordinal regression. The default is numeric order for ints and alphabetical order for strings.
Examples
Training data:
>>> df.collect() ID Y X 0 1 0 -1 1 2 0 -1 2 3 1 0 3 4 1 0 4 5 1 0 5 6 1 0 6 7 2 1 7 8 2 1 8 9 2 1
Fitting a GLM on that data:
>>> glm = GLM(conn_context=conn, solver='irls', family='poisson', link='log') >>> glm.fit(data=df, key='ID', label='Y')
Performing prediction:
>>> df2.collect() ID X 0 1 -1 1 2 0 2 3 1 3 4 2 >>> glm.predict(data=df2, key='ID')[['ID', 'PREDICTION']].collect() ID PREDICTION 0 1 0.25543735346197155 1 2 0.744562646538029 2 3 2.1702915689746476 3 4 6.32608352871737
Attributes
statistics_
(DataFrame) Training statistics and model information other than the coefficients and covariance matrix.
coef_
(DataFrame) Model coefficients.
covmat_
(DataFrame) Covariance matrix. Set to None for coordinate descent.
fitted_
(DataFrame) Predicted values for the training data. Set to None if
output_fitted
is False.Methods
fit
(data[, key, features, label, …])Fit a generalized linear model based on training data.
predict
(data, key[, features, …])Predict dependent variable values based on fitted model.
score
(data, key[, features, label, …])Returns the coefficient of determination R2 of the prediction.
-
fit
(data, key=None, features=None, label=None, categorical_variable=None, dependent_variable=None, excluded_feature=None)¶ Fit a generalized linear model based on training data.
- Parameters
data : DataFrame
Training data.
key : str, optional
Name of the ID column. If
key
is not provided, it is assumed that the input has no ID column. Required whenoutput_fitted
is True.features : list of str, optional
Names of the feature columns.
Defaults to all non-ID, non-label columns.
label : str or list of str, optional
Name of the dependent variable. Defaults to the last column. (This is not the PAL default.) When
family
is ‘binomial’,label
may be either a single column name or a list of two column names.categorical_variable : list of str, optional
INTEGER columns specified in this list will be treated as categorical data. Other INTEGER columns will be treated as continuous.
dependent_variable : str, optional
Only used when you need to indicate the dependence.
excluded_feature : list of str, optional
Excludes the indicated feature column.
Defaults to None.
-
predict
(data, key, features=None, prediction_type=None, significance_level=None, handle_missing=None)¶ Predict dependent variable values based on fitted model.
- Parameters
data : DataFrame
Independent variable values to predict for.
key : str
Name of the ID column.
features : list of str, optional
Names of the feature columns.
Defaults to all non-ID columns.
prediction_type : {‘response’, ‘link’}, optional
Specifies whether to output predicted values of the response or the link function.
Defaults to ‘response’.
significance_level : float, optional
Significance level for confidence intervals and prediction intervals. If specified, overrides the value passed to the GLM constructor.
handle_missing : {‘skip’, ‘fill_zero’}, optional
How to handle data rows with missing independent variable values.
‘skip’: Don’t perform prediction for those rows.
‘fill_zero’: Replace missing values with 0.
Defaults to ‘skip’.
- Returns
DataFrame
Predicted values, structured as follows. The following two columns are always populated:
ID column, with same name and type as
data
’s ID column.PREDICTION, type NVARCHAR(100), representing predicted values.
The following five columns are only populated for IRLS:
SE, type DOUBLE. Standard error, or for ordinal regression, the probability that the data point belongs to the predicted category.
CI_LOWER, type DOUBLE. Lower bound of the confidence interval.
CI_UPPER, type DOUBLE. Upper bound of the confidence interval.
PI_LOWER, type DOUBLE. Lower bound of the prediction interval.
PI_UPPER, type DOUBLE. Upper bound of the prediction interval.
-
score
(data, key, features=None, label=None, prediction_type=None, handle_missing=None)¶ Returns the coefficient of determination R2 of the prediction.
Not applicable for ordinal regression.
- Parameters
data : DataFrame
Data on which to assess model performance.
key : str
Name of the ID column.
features : list of str, optional
Names of the feature columns.
Defaults to all non-ID, non-label columns.
label : str, optional
Name of the dependent variable.
Defaults to the last column. (This is not the PAL default.) Cannot be two columns, even for family=’binomial’.
prediction_type : {‘response’, ‘link’}, optional
Specifies whether to predict the value of the response or the link function. The contents of the
label
column should match this choice.Defaults to ‘response’.
handle_missing : {‘skip’, ‘fill_zero’}, optional
How to handle data rows with missing independent variable values.
‘skip’: Don’t perform prediction for those rows. Those rows will be left out of the R2 computation.
‘fill_zero’: Replace missing values with 0.
Defaults to ‘skip’.
- Returns
float
The coefficient of determination R2 of the prediction on the given data.
-
class
hana_ml.algorithms.pal.regression.
ExponentialRegression
(conn_context, decomposition=None, adjusted_r2=None, pmml_export=None, thread_ratio=None)¶ Bases:
hana_ml.algorithms.pal.pal_base.PALBase
Exponential regression is an approach to modeling the relationship between a scalar variable y and one or more variables denoted X. In exponential regression, data is modeled using exponential functions, and unknown model parameters are estimated from the data. Such models are called exponential models.
- Parameters
conn_context : ConnectionContext
Connection to the HANA system.
decomposition : {‘LU’, ‘SVD’}, optional
Matrix factorization type to use. Case-insensitive.
‘LU’: LU decomposition.
‘SVD’: singular value decomposition.
Defaults to LU decomposition.
adjusted_r2 : bool, optional
If true, include the adjusted R2 value in the statistics table.
Defaults to False.
pmml_export : {‘no’, ‘single-row’, ‘multi-row’}, optional
Controls whether to output a PMML representation of the model, and how to format the PMML. Case-insensitive.
‘no’ or not provided: No PMML model.
‘single-row’: Exports a PMML model in a maximum of one row. Fails if the model doesn’t fit in one row.
‘multi-row’: Exports a PMML model, splitting it across multiple rows if it doesn’t fit in one.
Prediction does not require a PMML model.
thread_ratio : float, optional
Controls the proportion of available threads to use for prediction. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use.
Defaults to 0.
Examples
>>> df.collect() ID Y X1 X2 0 0.5 0.13 0.33 1 0.15 0.14 0.34 2 0.25 0.15 0.36 3 0.35 0.16 0.35 4 0.45 0.17 0.37
Training the model:
>>> er = ExponentialRegression(conn_context=conn, pmml_export = 'multi-row') >>> er.fit(data=df, key='ID')
Prediction:
>>> df2.collect() ID X1 X2 0 0.5 0.3 1 4 0.4 2 0 1.6 3 0.3 0.45 4 0.4 1.7
>>> er.predict(data=df2, key='ID').collect() ID VALUE 0 0.6900598931338715 1 1.2341502316656843 2 0.006630664136180741 3 0.3887970208571841 4 0.0052106543571450266
Attributes
coefficients_
(DataFrame) Fitted regression coefficients.
pmml_
(DataFrame) PMML model. Set to None if no PMML model was requested.
fitted_
(DataFrame) Predicted dependent variable values for training data. Set to None if the training data has no row IDs.
statistics_
(DataFrame) Regression-related statistics, such as mean squared error.
Methods
fit
(data[, key, features, label])Fit regression model based on training data.
predict
(data, key[, features])Predict dependent variable values based on fitted model.
score
(data, key[, features, label])Returns the coefficient of determination R2 of the prediction.
-
fit
(data, key=None, features=None, label=None)¶ Fit regression model based on training data.
- Parameters
data : DataFrame
Training data.
key : str, optional
Name of the ID column. If
key
is not provided, it is assumed that the input has no ID column.features : list of str, optional
Names of the feature columns.
label : str, optional
Name of the dependent variable.
Defaults to the last column. (This is not the PAL default.)
-
predict
(data, key, features=None)¶ Predict dependent variable values based on fitted model.
- Parameters
data : DataFrame
Independent variable values used for prediction.
key : str
Name of the ID column.
features : list of str, optional
Names of the feature columns.
- Returns
DataFrame
Predicted values, structured as follows:
ID column, with same name and type as
data
‘s ID column.VALUE, type DOUBLE, representing predicted values.
Note
predict() will pass the
pmml_
table to PAL as the model representation if there is apmml_
table, or thecoefficients_
table otherwise.
-
score
(data, key, features=None, label=None)¶ Returns the coefficient of determination R2 of the prediction.
- Parameters
data : DataFrame
Data on which to assess model performance.
key : str
Name of the ID column.
features : list of str, optional
Names of the feature columns.
label : str, optional
Name of the dependent variable.
Defaults to the last column. (This is not the PAL default.)
- Returns
float
The coefficient of determination R2 of the prediction on the given data.
-
class
hana_ml.algorithms.pal.regression.
BiVariateGeometricRegression
(conn_context, decomposition=None, adjusted_r2=None, pmml_export=None, thread_ratio=None)¶ Bases:
hana_ml.algorithms.pal.pal_base.PALBase
Geometric regression is an approach used to model the relationship between a scalar variable y and a variable denoted X. In geometric regression, data is modeled using geometric functions, and unknown model parameters are estimated from the data. Such models are called geometric models.
- Parameters
conn_context : ConnectionContext
Connection to the HANA system.
decomposition : {‘LU’, ‘SVD’}, optional
Matrix factorization type to use. Case-insensitive.
‘LU’: LU decomposition.
‘SVD’: singular value decomposition.
Defaults to LU decomposition.
adjusted_r2 : bool, optional
If true, include the adjusted R2 value in the statistics table.
Defaults to False.
pmml_export : {‘no’, ‘single-row’, ‘multi-row’}, optional
Controls whether to output a PMML representation of the model, and how to format the PMML. Case-insensitive.
‘no’ or not provided: No PMML model.
‘single-row’: Exports a PMML model in a maximum of one row. Fails if the model doesn’t fit in one row.
‘multi-row’: Exports a PMML model, splitting it across multiple rows if it doesn’t fit in one.
Prediction does not require a PMML model.
thread_ratio : float, optional
Controls the proportion of available threads to use for prediction. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use.
Defaults to 0.
Examples
>>> df.collect() ID Y X1 0 1.1 1 1 4.2 2 2 8.9 3 3 16.3 4 4 24 5
Training the model:
>>> gr = BiVariateGeometricRegression(conn_context=conn, pmml_export='multi-row') >>> gr.fit(data=df, key='ID')
Prediction:
>>> df2.collect() ID X1 0 1 1 2 2 3 3 4 4 5
>>> er.predict(data=df2, key='ID').collect() ID VALUE 0 1 1 3.9723699817481437 2 8.901666037549536 3 15.779723271893747 4 24.60086108408644
Attributes
coefficients_
(DataFrame) Fitted regression coefficients.
pmml_
(DataFrame) PMML model. Set to None if no PMML model was requested.
fitted_
(DataFrame) Predicted dependent variable values for training data. Set to None if the training data has no row IDs.
statistics_
(DataFrame) Regression-related statistics, such as mean squared error.
Methods
fit
(data[, key, features, label])Fit regression model based on training data.
predict
(data, key[, features])Predict dependent variable values based on fitted model.
score
(data, key[, features, label])Returns the coefficient of determination R2 of the prediction.
-
fit
(data, key=None, features=None, label=None)¶ Fit regression model based on training data.
- Parameters
data : DataFrame
Training data.
key : str, optional
Name of the ID column. If
key
is not provided, it is assumed that the input has no ID column.features : list of str, optional
Names of the feature columns.
label : str, optional
Name of the dependent variable.
Defaults to the last column. (This is not the PAL default.)
-
predict
(data, key, features=None)¶ Predict dependent variable values based on fitted model.
- Parameters
data : DataFrame
Independent variable values used for prediction.
key : str
Name of the ID column.
features : list of str, optional
Names of the feature columns.
- Returns
DataFrame
Predicted values, structured as follows:
ID column, with same name and type as
data
‘s ID column.VALUE, type DOUBLE, representing predicted values.
Note
predict() will pass the
pmml_
table to PAL as the model representation if there is apmml_
table, or thecoefficients_
table otherwise.
-
score
(data, key, features=None, label=None)¶ Returns the coefficient of determination R2 of the prediction.
- Parameters
data : DataFrame
Data on which to assess model performance.
key : str
Name of the ID column.
features : list of str, optional
Names of the feature columns.
label : str, optional
Name of the dependent variable.
Defaults to the last column. (This is not the PAL default.)
- Returns
float
The coefficient of determination R2 of the prediction on the given data.
-
class
hana_ml.algorithms.pal.regression.
BiVariateNaturalLogarithmicRegression
(conn_context, decomposition=None, adjusted_r2=None, pmml_export=None, thread_ratio=None)¶ Bases:
hana_ml.algorithms.pal.pal_base.PALBase
Bi-variate natural logarithmic regression is an approach to modeling the relationship between a scalar variable y and one variable denoted X. In natural logarithmic regression, data is modeled using natural logarithmic functions, and unknown model parameters are estimated from the data. Such models are called natural logarithmic models.
- Parameters
conn_context : ConnectionContext
Connection to the HANA system.
decomposition : {‘LU’, ‘SVD’}, optional
Matrix factorization type to use. Case-insensitive.
‘LU’: LU decomposition.
‘SVD’: singular value decomposition.
Defaults to LU decomposition.
adjusted_r2 : bool, optional
If true, include the adjusted R2 value in the statistics table.
Defaults to False.
pmml_export : {‘no’, ‘single-row’, ‘multi-row’}, optional
Controls whether to output a PMML representation of the model, and how to format the PMML. Case-insensitive.
‘no’ or not provided: No PMML model.
‘single-row’: Exports a PMML model in a maximum of one row. Fails if the model doesn’t fit in one row.
‘multi-row’: Exports a PMML model, splitting it across multiple rows if it doesn’t fit in one.
Prediction does not require a PMML model.
thread_ratio : float, optional
Controls the proportion of available threads to use for prediction. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use. Does not affect fitting.
Defaults to 0.
Examples
>>> df.collect() ID Y X1 0 10 1 1 80 2 2 130 3 3 180 5 4 190 6
Training the model:
>>> gr = BiVariateNaturalLogarithmicRegression(conn_context=conn, pmml_export='multi-row') >>> gr.fit(data=df, key='ID')
Prediction:
>>> df2.collect() ID X1 0 1 1 2 2 3 3 4 4 5
>>> er.predict(data=df2, key='ID').collect() ID VALUE 0 14.86160299 1 82.9935329364932 2 122.8481570569525 3 151.1254628829864 4 173.05904529166017
Attributes
coefficients_
(DataFrame) Fitted regression coefficients.
pmml_
(DataFrame) PMML model. Set to None if no PMML model was requested.
fitted_
(DataFrame) Predicted dependent variable values for training data. Set to None if the training data has no row IDs.
statistics_
(DataFrame) Regression-related statistics, such as mean squared error.
Methods
fit
(data[, key, features, label])Fit regression model based on training data.
predict
(data, key[, features])Predict dependent variable values based on fitted model.
score
(data, key[, features, label])Returns the coefficient of determination R2 of the prediction.
-
fit
(data, key=None, features=None, label=None)¶ Fit regression model based on training data.
- Parameters
data : DataFrame
Training data.
key : str, optional
Name of the ID column. If
key
is not provided, it is assumed that the input has no ID column.features : list of str, optional
Names of the feature columns.
label : str, optional
Name of the dependent variable.
Defaults to the last column. (This is not the PAL default.)
-
predict
(data, key, features=None)¶ Predict dependent variable values based on fitted model.
- Parameters
data : DataFrame
Independent variable values used for prediction.
key : str
Name of the ID column.
features : list of str, optional
Names of the feature columns.
- Returns
DataFrame
Predicted values, structured as follows:
ID column, with same name and type as
data
‘s ID column.VALUE, type DOUBLE, representing predicted values.
Note
predict() will pass the
pmml_
table to PAL as the model representation if there is apmml_
table, or thecoefficients_
table otherwise.
-
score
(data, key, features=None, label=None)¶ Returns the coefficient of determination R2 of the prediction.
- Parameters
data : DataFrame
Data on which to assess model performance.
key : str
Name of the ID column.
features : list of str, optional
Names of the feature columns.
label : str, optional
Name of the dependent variable.
Defaults to the last column. (This is not the PAL default.)
- Returns
float
The coefficient of determination R2 of the prediction on the given data.
-
class
hana_ml.algorithms.pal.regression.
CoxProportionalHazardModel
(conn_context, tie_method=None, status_col=None, max_iter=None, convergence_criterion=None, significance_level=None, calculate_hazard=None, output_fitted=None, type_kind=None)¶ Bases:
hana_ml.algorithms.pal.pal_base.PALBase
Cox proportional hazard model (CoxPHM) is a special generalized linear model. It is a well-known realization-of-survival model that demonstrates failure or death at a certain time.
- Parameters
conn_context : ConnectionContext
Connection to the SAP HANA system.
tie_method : {‘breslow’, ‘efron’}, optional
The method to deal with tied events.
Defaults to ‘efron’.
status_col : bool, optional
If a status column is defined for right-censored data:
False : No status column. All response times are failure/death.
- TrueThe 3rd column of the data input table is a status column,
of which 0 indicates right-censored data and 1 indicates failure/death.
Defaults to True.
max_iter : int, optional
Maximum number of iterations for numeric optimization.
convergence_criterion : float, optional
Convergence criterion of coefficients for numeric optimization.
Defaults to 0.
significance_level : float, optional
Significance level for the confidence interval of estimated coefficients.
Defaults to 0.05.
calculate_hazard : bool, optional
Controls whether to calculate hazard function as well as survival function.
False : Does not calculate hazard function.
True: Calculates hazard function.
Defaults to True.
output_fitted : bool, optional
Controls whether to output the fitted response:
False : Does not output the fitted response.
True: Outputs the fitted response.
Defaults to False.
type_kind : str, optional
The prediction type:
‘risk’: Predicts in risk space
‘lp’: Predicts in linear predictor space
Default Value is ‘risk’
Examples
>>> df1.collect() ID TIME STATUS X1 X2 1 4 1 0 0 2 3 1 2 0 3 1 1 1 0 4 1 0 1 0 5 2 1 1 1 6 2 1 0 1 7 3 0 0 1
Training the model:
>>> cox = CoxProportionalHazardModel(conn_context=conn, significance_level= 0.05, calculate_hazard='yes', type_kind='risk') >>> cox.fit(data=df1, key='ID', features=['STATUS', 'X1', 'X2'], label='TIME')
Prediction:
>>> df2.collect() ID X1 X2 1 0 0 2 2 0 3 1 0 4 1 0 5 1 1 6 0 1 7 0 1
>>> cox.predict(data=full_tbl, key='ID',features=['STATUS', 'X1', 'X2']).collect() ID PREDICTION SE CI_LOWER CI_UPPER 1 0.383590423 0.412526262 0.046607574 3.157032199 2 1.829758442 1.385833778 0.414672719 8.073875617 3 0.837781484 0.400894077 0.32795551 2.140161678 4 0.837781484 0.400894077 0.32795551 2.140161678
Attributes
statistics_
(DataFrame) Regression-related statistics, such as r-square, log-likelihood, aic.
coefficient_
(DataFrame) Fitted regression coefficients.
covariance_variance
(DataFrame) Co-Variance related data.
hazard_
(DataFrame) Statistics related to Time, Hazard, Survival.
fitted_
(DataFrame) Predicted dependent variable values for training data. Set to None if the training data has no row IDs.
Methods
fit
(data[, key, features, label])Fit regression model based on training data.
predict
(data, key[, features])Predict dependent variable values based on fitted model.
score
(data, key[, features, label])Returns the coefficient of determination R2 of the prediction.
-
fit
(data, key=None, features=None, label=None)¶ Fit regression model based on training data.
- Parameters
data : DataFrame
Training data.
key : str, optional
Name of the ID column. If
key
is not provided, it is assumed that the input has no ID column.features : list of str, optional
Names of the feature columns.
label : str, optional
Name of the dependent variable.
Defaults to the last column. (This is not the PAL default.)
-
predict
(data, key, features=None)¶ Predict dependent variable values based on fitted model.
- Parameters
data : DataFrame
Independent variable values used for prediction.
key : str
Name of the ID column.
features : list of str, optional
Names of the feature columns.
- Returns
DataFrame
Predicted values, structured as follows:
ID column, with same name and type as
data
‘s ID column.VALUE, type DOUBLE, representing predicted values.
-
score
(data, key, features=None, label=None)¶ Returns the coefficient of determination R2 of the prediction.
- Parameters
data : DataFrame
Data on which to assess model performance.
key : str
Name of the ID column.
features : list of str, optional
Names of the feature columns.
label : str, optional
Name of the dependent variable.
Defaults to the last column. (This is not the PAL default.)
- Returns
float
The coefficient of determination R2 of the prediction on the given data.
hana_ml.algorithms.pal.som¶
This module contains PAL wrapper for SOM algorithm. The following class is available:
-
class
hana_ml.algorithms.pal.som.
SOM
(conn_context, covergence_criterion=None, normalization=None, random_seed=None, height_of_map=None, width_of_map=None, kernel_function=None, alpha=None, learning_rate=None, shape_of_grid=None, radius=None, batch_som=None, max_iter=None)¶ Bases:
hana_ml.algorithms.pal.pal_base.PALBase
,hana_ml.algorithms.pal.clustering._ClusterAssignmentMixin
Self-organizing feature maps (SOMs) are one of the most popular neural network methods for cluster analysis. They are sometimes referred to as Kohonen self-organizing feature maps, after their creator, Teuvo Kohonen, or as topologically ordered maps.
- Parameters
conn_context : ConnectionContext
Connection to the SAP HANA system.
convergence_criterion : float, optional
If the largest difference of the successive maps is less than this value, the calculation is regarded as convergence, and SOM is completed consequently.
Defaults to 1.0e-6.
normalization : {‘0’, ‘1’, ‘2’}, int, optional
Normalization type:
0: No
1: Transform to new range (0.0, 1.0)
2: Z-score normalization
Defaults to 0.
random_seed : {‘1’, ‘0’, ‘Other value’}, int, optional
1: Random
0: Sets every weight to zero
Other value: Uses this value as seed
Defaults to -1.
height_of_map : int, optional
Indicates the height of the map.
Defaults to 10.
width_of_map : int, optional
Indicates the width of the map.
Defaults to 10.
kernel_function : int, optional
Represents the neighborhood kernel function.
1: Gaussian
2: Bubble/Flat
Defaults to 1.
alpha : float, optional
Specifies the learning rate.
Defaults to 0.5
learning_rate : int, optional
Indicates the decay function for learning rate.
1: Exponential
2: Linear
Defaults to 1.
shape_of_grid : int, optional
Indicates the shape of the grid.
1: Rectangle
2: Hexagon
Defaults to 2.
radius : float, optional
Specifies the scan radius.
Defautl to the bigger value of
height_of_map
andwidth_of_map
.batch_som : {‘0’, ‘1’}, int, optional
Indicates whether batch SOM is carried out.
0: Classical SOM
1: Batch SOM
For batch SOM,
kernel_function
is always Gaussian, and thelearning_rate
factors take no effect.Defaults to 0.
max_iter : int, optional
Maximum number of iterations. Note that the training might not converge if this value is too small, for example, less than 1000.
Defaults to 1000 plus 500 times the number of neurons in the lattice.
Examples
Input dataframe df for clustering:
>>> df.collect() TRANS_ID V000 V001 0 0 0.10 0.20 1 1 0.22 0.25 2 2 0.30 0.40 ... 18 18 55.30 50.40 19 19 50.40 56.50
Create SOM instance:
>>> som = SOM(conn_context=conn, covergence_criterion=1.0e-6, normalization=0, random_seed=1, height_of_map=4, width_of_map=4, kernel_function='gaussian', alpha=None, learning_rate='exponential', shape_of_grid='hexagon', radius=None, batch_som='classical', max_iter=4000)
Perform fit on the given data:
>>> som.fit(data=df, key='TRANS_ID')
Expected output:
>>> som.map_.collect().head(3) CLUSTER_ID WEIGHT_V000 WEIGHT_V001 COUNT 0 0 52.837688 53.465327 2 1 1 50.150251 49.245226 2 2 2 18.597607 27.174590 0
>>> som.labels_.collect().head(3) TRANS_ID BMU DISTANCE SECOND_BMU IS_ADJACENT 0 0 15 0.342564 14 1 1 1 15 0.239676 14 1 2 2 15 0.073968 14 1
>>> som.model_.collect() ROW_INDEX MODEL_CONTENT 0 0 {"Algorithm":"SOM","Cluster":[{"CellID":0,"Cel...
After we get the model, we could use it to predict Input dataframe df2 for prediction:
>>> df_predict.collect() TRANS_ID V000 V001 0 33 0.2 0.10 1 34 1.2 4.1
Preform predict on the givenn data:
>>> label = som.predict(data=df2, key='TRANS_ID')
Expected output:
>>> label.collect() TRANS_ID CLUSTER_ID DISTANCE 0 33 15 0.388460 1 34 11 0.156418
Attributes
map_
(DataFrame) The map after training. The structure is as follows: - 1st column: CLUSTER_ID, int. Unit cell ID. - Other columns except the last one: FEATURE (in input data) column with prefix “WEIGHT_”, float. Weight vectors used to simulate the original tuples. - Last column: COUNT, int. Number of original tuples that every unit cell contains.
label_
(DataFrame) The label of input data, the structure is as follows: - 1st column: ID (in input table) data type, ID (in input table) column name ID of original tuples. - 2nd column: BMU, int. Best match unit (BMU). - 3rd column: DISTANCE, float, The distance between the tuple and its BMU. - 4th column: SECOND_BMU, int, Second BMU. - 5th column: IS_ADJACENT. int. Indicates whether the BMU and the second BMU are adjacent. - 0: Not adjacent - 1: Adjacent
model_
(DataFrame) The SOM model.
Methods
fit
(data, key[, features])Fit the SOM model when given the training dataset.
fit_predict
(data, key[, features])Fit the dataset and return the labels.
predict
(data, key[, features])Assign clusters to data based on a fitted model.
-
fit
(data, key, features=None)¶ Fit the SOM model when given the training dataset.
- Parameters
data : DataFrame
DataFrame containing the data.
key : str
Name of the ID column.
features : list of str, optional
Names of the features columns. If
features
is not provided, it defaults to all the non-ID columns.
-
fit_predict
(data, key, features=None)¶ Fit the dataset and return the labels.
- Parameters
data : DataFrame
DataFrame containing the data.
key : str
Name of the ID column.
features : list of str, optional
Names of the features columns. If
features
is not provided, it defaults to all the non-ID columns.- Returns
DataFrame
- The label of given data, the structure is as follows:
1st column: ID (in input table) data type, ID (in input table) column name ID of original tuples.
2nd column: BMU, int. Best match unit (BMU).
3rd column: DISTANCE, float, The distance between the tuple and its BMU.
4th column: SECOND_BMU, int, Second BMU.
- 5th column: IS_ADJACENT. int. Indicates whether the BMU and the second BMU are adjacent.
0: Not adjacent
1: Adjacent
-
predict
(data, key, features=None)¶ Assign clusters to data based on a fitted model.
The output structure of this method does not match that of fit_predict().
- Parameters
data : DataFrame
Data points to match against computed clusters. This dataframe’s column structure should match that of the data used for fit().
key : str
Name of ID column.
features : list of str, optional.
Names of feature columns. If
features
is not provided, it defaults to all the non-ID columns.- Returns
DataFrame
Cluster assignment results, with 3 columns:
Data point ID, with name and type taken from the input ID column.
CLUSTER_ID, type int, representing the cluster the data point is assigned to.
DISTANCE, type DOUBLE, representing the distance between the data point and the nearest core point.
hana_ml.algorithms.pal.stats¶
This module contains Python wrappers for statistics algorithms.
The following functions are available:
-
hana_ml.algorithms.pal.stats.
chi_squared_goodness_of_fit
(conn_context, data, key, observed_data=None, expected_freq=None)¶ Perform the chi-squared goodness-of fit test to tell whether or not an observed distribution differs from an expected chi-squared distribution.
- Parameters
conn_context : ConnectionContext
Database connection object.
data : DataFrame
Input data.
key : str
Name of the ID column.
observed_data : str, optional
Name of column for counts of actual observations belonging to each category. If not given, the input dataframe must only have three columns. The first of the non-ID columns will be
observed_data
.expected_freq : str, optional
Name of the expected frequency column. If not given, the input dataframe must only have three columns. The second of the non-ID columns will be
expected_freq
.- Returns
DataFrame
Comparsion between the actual counts and the expected counts, structured as follows:
ID column, with same name and type as
data
’s ID column.Observed data column, with same name as
data
’s observed_data column, but always with type DOUBLE.EXPECTED, type DOUBLE, expected count in each category.
RESIDUAL, type DOUBLE, the difference between the observed counts and the expected counts.
Statistical outputs, including the calculated chi-squared value, degrees of freedom and p-value, structured as follows:
STAT_NAME, type NVARCHAR(100), name of statistics.
STAT_VALUE, type DOUBLE, value of statistics.
Examples
Data to test:
>>> df.collect() ID OBSERVED P 0 0 519.0 0.3 1 1 364.0 0.2 2 2 363.0 0.2 3 3 200.0 0.1 4 4 212.0 0.1 5 5 193.0 0.1
Perform chi_squared_goodness_of_fit:
>>> res, stat = chi_squared_goodness_of_fit(conn_context=conn, data=df, 'ID') >>> res.collect() ID OBSERVED EXPECTED RESIDUAL 0 0 519.0 555.3 -36.3 1 1 364.0 370.2 -6.2 2 2 363.0 370.2 -7.2 3 3 200.0 185.1 14.9 4 4 212.0 185.1 26.9 5 5 193.0 185.1 7.9 >>> stat.collect() STAT_NAME STAT_VALUE 0 Chi-squared Value 8.062669 1 degree of freedom 5.000000 2 p-value 0.152815
-
hana_ml.algorithms.pal.stats.
chi_squared_independence
(conn_context, data, key, observed_data=None, correction=False)¶ Perform the chi-squared test of independence to tell whether observations of two variables are independent from each other.
- Parameters
conn_context : ConnectionContext
Database connection object.
data : DataFrame
Input data.
key : str
Name of the ID column.
observed_data : list of str, optional
Names of the observed data columns. If not given, it defaults to the all the non-ID columns.
correction : bool, optional
If True, and the degrees of freedom is 1, apply Yates’s correction for continuity. The effect of the correction is to adjust each observed value by 0.5 towards the corresponding expected value.
Defaults to False.
- Returns
DataFrame
The expected count table, structured as follows:
ID column, with same name and type as
data
’s ID column.Expected count columns, named by prepending
Expected_
to eachobserved_data
column name, type DOUBLE. There will be as many columns here as there areobserved_data
columns.
Statistical outputs, including the calculated chi-squared value, degrees of freedom and p-value, structured as follows:
STAT_NAME, type NVARCHAR(100), name of statistics.
STAT_VALUE, type DOUBLE, value of statistics.
Examples
Data to test:
>>> df.collect() ID X1 X2 X3 X4 0 male 25 23.0 11 14.0 1 female 41 20.0 18 6.0
Perform chi-squared test of independence:
>>> res, stats = chi_squared_independence(conn_context, data=df, 'ID') >>> res.collect() ID EXPECTED_X1 EXPECTED_X2 EXPECTED_X3 EXPECTED_X4 0 male 30.493671 19.867089 13.398734 9.240506 1 female 35.506329 23.132911 15.601266 10.759494 >>> stats.collect() STAT_NAME STAT_VALUE 0 Chi-squared Value 8.113152 1 degree of freedom 3.000000 2 p-value 0.043730
-
hana_ml.algorithms.pal.stats.
ttest_1samp
(conn_context, data, col=None, mu=0, test_type='two_sides', conf_level=0.95)¶ Perform the t-test to determine whether a sample of observations could have been generated by a process with a specific mean.
- Parameters
conn_context : ConnectionContext
Database connection object.
data : DataFrame
DataFrame containing the data.
col : str, optional
Name of the column for sample. If not given, the input dataframe must only have one column.
mu : float, optional
Hypothesized mean of the population underlying the sample.
Defaults to 0.
test_type : {‘two_sides’, ‘less’, ‘greater’}, optional
The alternative hypothesis type.
Defaults to ‘two_sides’.
conf_level : float, optional
Confidence level for alternative hypothesis confidence interval.
Defaults to 0.95.
- Returns
DataFrame
DataFrame containing the statistics results from the t-test.
Examples
Original data:
>>> df.collect() X1 0 1.0 1 2.0 2 4.0 3 7.0 4 3.0
Perform One Sample T-Test
>>> ttest_1samp(conn_context=conn, data=df).collect() STAT_NAME STAT_VALUE 0 t-value 3.302372 1 degree of freedom 4.000000 2 p-value 0.029867 3 _PAL_MEAN_X1_ 3.400000 4 confidence level 0.950000 5 lowerLimit 0.541475 6 upperLimit 6.258525
-
hana_ml.algorithms.pal.stats.
ttest_ind
(conn_context, data, col1=None, col2=None, mu=0, test_type='two_sides', var_equal=False, conf_level=0.95)¶ Perform the T-test for the mean difference of two independent samples.
- Parameters
conn_context : ConnectionContext
Connection to the SAP HANA system.
data : DataFrame
DataFrame containing the data.
col1 : str, optional
Name of the column for sample1. If not given, the input dataframe must only have two columns. The first of the columns will be col1.
col2 : str, optional
Name of the column for sample2. If not given, the input dataframe must only have two columns. The second of the columns will be col2.
mu : float, optional
Hypothesized difference between the two underlying population means.
Defaults to 0.
test_type : {‘two_sides’, ‘less’, ‘greater’}, optional
The alternative hypothesis type.
Defaults to ‘two_sides’.
var_equal : bool, optional
Controls whether to assume that the two samples have equal variance.
Defaults to False.
conf_level : float, optional
Confidence level for alternative hypothesis confidence interval.
Defaults to 0.95.
- Returns
DataFrame
DataFrame containing the statistics results from the t-test.
Examples
Original data:
>>> df.collect() X1 X2 0 1.0 10.0 1 2.0 12.0 2 4.0 11.0 3 7.0 15.0 4 NaN 10.0
Perform Independent Sample T-Test
>>> ttest_ind(conn_context=conn, data=df).collect() STAT_NAME STAT_VALUE 0 t-value -5.013774 1 degree of freedom 5.649757 2 p-value 0.002875 3 _PAL_MEAN_X1_ 3.500000 4 _PAL_MEAN_X2_ 11.600000 5 confidence level 0.950000 6 lowerLimit -12.113278 7 upperLimit -4.086722
-
hana_ml.algorithms.pal.stats.
ttest_paired
(conn_context, data, col1=None, col2=None, mu=0, test_type='two_sides', conf_level=0.95)¶ Perform the t-test for the mean difference of two sets of paired samples.
- Parameters
conn_context : ConnectionContext
Database connection object.
data : DataFrame
DataFrame containing the data.
col1 : str, optional
Name of the column for sample1. If not given, the input dataframe must only have two columns. The first of two columns will be col1.
col2 : str, optional
Name of the column for sample2. If not given, the input dataframe must only have two columns. The second of the two columns will be col2.
mu : float, optional
Hypothesized difference between two underlying population means.
Defaults to 0.
test_type : {‘two_sides’, ‘less’, ‘greater’}, optional
The alternative hypothesis type.
Defaults to ‘two_sides’.
conf_level : float, optional
Confidence level for alternative hypothesis confidence interval.
Defaults to 0.95.
- Returns
DataFrame
DataFrame containing the statistics results from the t-test.
Examples
Original data:
>>> df.collect() X1 X2 0 1.0 10.0 1 2.0 12.0 2 4.0 11.0 3 7.0 15.0 4 3.0 10.0
perform Paired Sample T-Test
>>> ttest_paired(conn_context=conn, data=df).collect() STAT_NAME STAT_VALUE 0 t-value -14.062884 1 degree of freedom 4.000000 2 p-value 0.000148 3 _PAL_MEAN_DIFFERENCES_ -8.200000 4 confidence level 0.950000 5 lowerLimit -9.818932 6 upperLimit -6.581068
-
hana_ml.algorithms.pal.stats.
f_oneway
(conn_context, data, group=None, sample=None, multcomp_method=None, significance_level=None)¶ Performs a 1-way ANOVA.
The purpose of one-way ANOVA is to determine whether there is any statistically significant difference between the means of three or more independent groups.
- Parameters
conn_context : ConnectionContext
Connection to the SAP HANA system.
data : DataFrame
Input data.
group : str
Name of the group column. If
group
is not provided, defaults to the first column.sample : str, optional
Name of the sample measurement column. If
sample
is not provided,data
must have exactly 1 non-group column andsample
defaults to that column.multcomp_method : {‘tukey-kramer’, ‘bonferroni’, ‘dunn-sidak’, ‘scheffe’, ‘fisher-lsd’}, str, optional
Method used to perform multiple comparison tests.
Defaults to ‘tukey-kramer’.
significance_level : float, optional
The significance level when the function calculates the confidence interval in multiple comparison tests. Values must be greater than 0 and less than 1.
Defaults to 0.05.
- Returns
DataFrame
Statistics for each group, structured as follows:
GROUP, type NVARCHAR(256), group name.
VALID_SAMPLES, type INTEGER, number of valid samples.
MEAN, type DOUBLE, group mean.
SD, type DOUBLE, group standard deviation.
Computed results for ANOVA, structured as follows:
VARIABILITY_SOURCE, type NVARCHAR(100), source of variability, including between groups, within groups (error) and total.
SUM_OF_SQUARES, type DOUBLE, sum of squares.
DEGREES_OF_FREEDOM, type DOUBLE, degrees of freedom.
MEAN_SQUARES, type DOUBLE, mean squares.
F_RATIO, type DOUBLE, calculated as mean square between groups divided by mean square of error.
P_VALUE, type DOUBLE, associated p-value from the F-distribution.
Multiple comparison results, structured as follows:
FIRST_GROUP, type NVARCHAR(256), the name of the first group to conduct pairwise test on.
SECOND_GROUP, type NVARCHAR(256), the name of the second group to conduct pairwise test on.
MEAN_DIFFERENCE, type DOUBLE, mean difference between the two groups.
SE, type DOUBLE, standard error computed from all data.
P_VALUE, type DOUBLE, p-value.
CI_LOWER, type DOUBLE, the lower limit of the confidence interval.
CI_UPPER, type DOUBLE, the upper limit of the confidence interval.
Examples
Samples for One Way ANOVA test:
>>> df.collect() GROUP DATA 0 A 4.0 1 A 5.0 2 A 4.0 3 A 3.0 4 A 2.0 5 A 4.0 6 A 3.0 7 A 4.0 8 B 6.0 9 B 8.0 10 B 4.0 11 B 5.0 12 B 4.0 13 B 6.0 14 B 5.0 15 B 8.0 16 C 6.0 17 C 7.0 18 C 6.0 19 C 6.0 20 C 7.0 21 C 5.0
Perform one-way ANOVA test:
>>> stats, anova, mult_comp= f_oneway(conn_context=conn, data=df, ... multcomp_method='Tukey-Kramer', ... significance_level=0.05)
Outputs:
>>> stats.collect() GROUP VALID_SAMPLES MEAN SD 0 A 8 3.625000 0.916125 1 B 8 5.750000 1.581139 2 C 6 6.166667 0.752773 3 Total 22 5.090909 1.600866 >>> anova.collect() VARIABILITY_SOURCE SUM_OF_SQUARES DEGREES_OF_FREEDOM MEAN_SQUARES \ 0 Group 27.609848 2.0 13.804924 1 Error 26.208333 19.0 1.379386 2 Total 53.818182 21.0 NaN F_RATIO P_VALUE 0 10.008021 0.001075 1 NaN NaN 2 NaN NaN >>> mult_comp.collect() FIRST_GROUP SECOND_GROUP MEAN_DIFFERENCE SE P_VALUE CI_LOWER \ 0 A B -2.125000 0.587236 0.004960 -3.616845 1 A C -2.541667 0.634288 0.002077 -4.153043 2 B C -0.416667 0.634288 0.790765 -2.028043 CI_UPPER 0 -0.633155 1 -0.930290 2 1.194710
-
hana_ml.algorithms.pal.stats.
f_oneway_repeated
(conn_context, data, subject_id, measures=None, multcomp_method=None, significance_level=None, se_type=None)¶ Performs one-way repeated measures analysis of variance, along with Mauchly’s Test of Sphericity and post hoc multiple comparison tests.
- Parameters
conn_context : ConnectionContext
Connection to the SAP HANA system.
data : DataFrame
Input data.
subject_id : str
Name of the subject ID column. The algorithm treats each row of the data table as a different subject. Hence there should be no duplicate subject IDs in this column.
measures : list of str, optional
Names of the groups (measures). If
measures
is not provided, defaults to all non-subject_id columns.multcomp_method : {‘tukey-kramer’, ‘bonferroni’, ‘dunn-sidak’, ‘scheffe’, ‘fisher-lsd’}, optional
Method used to perform multiple comparison tests.
Defaults to ‘bonferroni’.
significance_level : float, optional
The significance level when the function calculates the confidence interval in multiple comparison tests. Values must be greater than 0 and less than 1.
Defaults to 0.05.
se_type : {‘all-data’, ‘two-group’}
Type of standard error used in multiple comparison tests.
‘all-data’: computes the standard error from all data. It has more power if the assumption of sphericity is true, especially with small data sets.
‘two-group’: computes the standard error from only the two groups being compared. It doesn’t assume sphericity.
Defaults to ‘two-group’.
- Returns
DataFrame
Statistics for each group, structured as follows:
GROUP, type NVARCHAR(256), group name.
VALID_SAMPLES, type INTEGER, number of valid samples.
MEAN, type DOUBLE, group mean.
SD, type DOUBLE, group standard deviation.
Mauchly test results, structured as follows:
STAT_NAME, type NVARCHAR(100), names of test result quantities.
STAT_VALUE, type DOUBLE, values of test result quantities.
Computed results for ANOVA, structured as follows:
VARIABILITY_SOURCE, type NVARCHAR(100), source of variability, divided into group, error and subject portions.
SUM_OF_SQUARES, type DOUBLE, sum of squares.
DEGREES_OF_FREEDOM, type DOUBLE, degrees of freedom.
MEAN_SQUARES, type DOUBLE, mean squares.
F_RATIO, type DOUBLE, calculated as mean square between groups divided by mean square of error.
P_VALUE, type DOUBLE, associated p-value from the F-distribution.
P_VALUE_GG, type DOUBLE, p-value of Greehouse-Geisser correction.
P_VALUE_HF, type DOUBLE, p-value of Huynh-Feldt correction.
P_VALUE_LB, type DOUBLE, p-value of lower bound correction.
Multiple comparison results, structured as follows:
FIRST_GROUP, type NVARCHAR(256), the name of the first group to conduct pairwise test on.
SECOND_GROUP, type NVARCHAR(256), the name of the second group to conduct pairwise test on.
MEAN_DIFFERENCE, type DOUBLE, mean difference between the two groups.
SE, type DOUBLE, standard error computed from all data or compared two groups, depending on
se_type
.P_VALUE, type DOUBLE, p-value.
CI_LOWER, type DOUBLE, the lower limit of the confidence interval.
CI_UPPER, type DOUBLE, the upper limit of the confidence interval.
Examples
Samples for One Way Repeated ANOVA test:
>>> df.collect() ID MEASURE1 MEASURE2 MEASURE3 MEASURE4 0 1 8.0 7.0 1.0 6.0 1 2 9.0 5.0 2.0 5.0 2 3 6.0 2.0 3.0 8.0 3 4 5.0 3.0 1.0 9.0 4 5 8.0 4.0 5.0 8.0 5 6 7.0 5.0 6.0 7.0 6 7 10.0 2.0 7.0 2.0 7 8 12.0 6.0 8.0 1.0
Perform one-way repeated measures ANOVA test:
>>> stats, mtest, anova, mult_comp = f_oneway_repeated( ... conn_context=conn, ... data=df, ... subject_id='ID', ... multcomp_method='bonferroni', ... significance_level=0.05, ... se_type='two-group')
Outputs:
>>> stats.collect() GROUP VALID_SAMPLES MEAN SD 0 MEASURE1 8 8.125 2.232071 1 MEASURE2 8 4.250 1.832251 2 MEASURE3 8 4.125 2.748376 3 MEASURE4 8 5.750 2.915476 >>> mtest.collect() STAT_NAME STAT_VALUE 0 Mauchly's W 0.136248 1 Chi-Square 11.405981 2 df 5.000000 3 pValue 0.046773 4 Greenhouse-Geisser Epsilon 0.532846 5 Huynh-Feldt Epsilon 0.665764 6 Lower bound Epsilon 0.333333 >>> anova.collect() VARIABILITY_SOURCE SUM_OF_SQUARES DEGREES_OF_FREEDOM MEAN_SQUARES \ 0 Group 83.125 3.0 27.708333 1 Subject 17.375 7.0 2.482143 2 Error 153.375 21.0 7.303571 F_RATIO P_VALUE P_VALUE_GG P_VALUE_HF P_VALUE_LB 0 3.793806 0.02557 0.062584 0.048331 0.092471 1 NaN NaN NaN NaN NaN 2 NaN NaN NaN NaN NaN >>> mult_comp.collect() FIRST_GROUP SECOND_GROUP MEAN_DIFFERENCE SE P_VALUE CI_LOWER \ 0 MEASURE1 MEASURE2 3.875 0.811469 0.012140 0.924655 1 MEASURE1 MEASURE3 4.000 0.731925 0.005645 1.338861 2 MEASURE1 MEASURE4 2.375 1.792220 1.000000 -4.141168 3 MEASURE2 MEASURE3 0.125 1.201747 1.000000 -4.244322 4 MEASURE2 MEASURE4 -1.500 1.336306 1.000000 -6.358552 5 MEASURE3 MEASURE4 -1.625 1.821866 1.000000 -8.248955 CI_UPPER 0 6.825345 1 6.661139 2 8.891168 3 4.494322 4 3.358552 5 4.998955
-
hana_ml.algorithms.pal.stats.
univariate_analysis
(conn_context, data, key=None, cols=None, categorical_variable=None, significance_level=None, trimmed_percentage=None)¶ Provides an overview of the dataset. For continuous columns, it provides the count of valid observations, min, lower quartile, median, upper quartile, max, mean, confidence interval for the mean (lower and upper bound), trimmed mean, variance, standard deviation, skewness, and kurtosis. For discrete columns, it provides the number of occurrences and the percentage of the total data in each category.
- Parameters
conn_context : ConnectionContext
Connection to the SAP HANA system.
data : DataFrame
Input data.
key : str, optional
Name of the ID column. If
key
is not provided, it is assumed that the input has no ID column.cols : list of str, optional
List of column names to analyze. If
cols
is not provided, it defaults to all non-ID columns.categorical_variable : list of str, optional
INTEGER columns specified in this list will be treated as categorical data. By default, INTEGER columns are treated as continuous.
significance_level : float, optional
The significance level when the function calculates the confidence interval of the sample mean. Values must be greater than 0 and less than 1.
Defaults to 0.05.
trimmed_percentage : float, optional
The ratio of data at both head and tail that will be dropped in the process of calculating the trimmed mean. Value range is from 0 to 0.5.
Defaults to 0.05.
- Returns
DataFrame
Statistics for continuous variables, structured as follows:
VARIABLE_NAME, type NVARCHAR(256), variable names.
STAT_NAME, type NVARCHAR(100), names of statistical quantities, including the count of valid observations, min, lower quartile, median, upper quartile, max, mean, confidence interval for the mean (lower and upper bound), trimmed mean, variance, standard deviation, skewness, and kurtosis (14 quantities in total).
STAT_VALUE, type DOUBLE, values for the corresponding statistical quantities.
Statistics for categorical variables, structured as follows:
VARIABLE_NAME, type NVARCHAR(256), variable names.
CATEGORY, type NVARCHAR(256), category names of the corresponding variables. Null is also treated as a category.
STAT_NAME, type NVARCHAR(100), names of statistical quantities: number of observations, percentage of total data points falling in the current category for a variable (including null).
STAT_VALUE, type DOUBLE, values for the corresponding statistical quantities.
Examples
Dataset to be analyzed:
>>> df.collect() X1 X2 X3 X4 0 1.2 None 1 A 1 2.5 None 2 C 2 5.2 None 3 A 3 -10.2 None 2 A 4 8.5 None 2 C 5 100.0 None 3 B
Perform univariate analysis:
>>> continuous, categorical = univariate_analysis( ... conn_context=, ... data=df, ... categorical_variable=['X3'], ... significance_level=0.05, ... trimmed_percentage=0.2)
Outputs:
>>> continuous.collect() VARIABLE_NAME STAT_NAME STAT_VALUE 0 X1 valid observations 6.000000 1 X1 min -10.200000 2 X1 lower quartile 1.200000 3 X1 median 3.850000 4 X1 upper quartile 8.500000 5 X1 max 100.000000 6 X1 mean 17.866667 7 X1 CI for mean, lower bound -24.879549 8 X1 CI for mean, upper bound 60.612883 9 X1 trimmed mean 4.350000 10 X1 variance 1659.142667 11 X1 standard deviation 40.732575 12 X1 skewness 1.688495 13 X1 kurtosis 1.036148 14 X2 valid observations 0.000000
>>> categorical.collect() VARIABLE_NAME CATEGORY STAT_NAME STAT_VALUE 0 X3 __PAL_NULL__ count 0.000000 1 X3 __PAL_NULL__ percentage(%) 0.000000 2 X3 1 count 1.000000 3 X3 1 percentage(%) 16.666667 4 X3 2 count 3.000000 5 X3 2 percentage(%) 50.000000 6 X3 3 count 2.000000 7 X3 3 percentage(%) 33.333333 8 X4 __PAL_NULL__ count 0.000000 9 X4 __PAL_NULL__ percentage(%) 0.000000 10 X4 A count 3.000000 11 X4 A percentage(%) 50.000000 12 X4 B count 1.000000 13 X4 B percentage(%) 16.666667 14 X4 C count 2.000000 15 X4 C percentage(%) 33.333333
-
hana_ml.algorithms.pal.stats.
covariance_matrix
(conn_context, data, cols=None)¶ Computes the covariance matrix.
- Parameters
conn_context : ConnectionContext
Connection to the SAP HANA system.
data : DataFrame
Input data.
cols : list of str, optional
List of column names to analyze. If
cols
is not provided, it defaults to all columns.- Returns
DataFrame
Covariance between any two data samples (columns).
ID, type NVARCHAR. The values of this column are the column names from
cols
.Covariance columns, type DOUBLE, named after the columns in
cols
. The covariance between variables X and Y is in column X, in the row with ID value Y.
Examples
Dataset to be analyzed:
>>> df.collect() X Y 0 1 2.4 1 5 3.5 2 3 8.9 3 10 -1.4 4 -4 -3.5 5 11 32.8
Compute the covariance matrix:
>>> result = covariance_matrix(conn_context=conn, data=df)
Outputs:
>>> result.collect() ID X Y 0 X 31.866667 44.473333 1 Y 44.473333 176.677667
-
hana_ml.algorithms.pal.stats.
pearsonr_matrix
(conn_context, data, cols=None)¶ Computes a correlation matrix using Pearson’s correlation coefficient.
- Parameters
conn_context : ConnectionContext
Connection to the SAP HANA system.
data : DataFrame
Input data.
cols : list of str, optional
List of column names to analyze. If
cols
is not provided, it defaults to all columns.- Returns
DataFrame
Pearson’s correlation coefficient between any two data samples (columns).
ID, type NVARCHAR. The values of this column are the column names from
cols
.Correlation coefficient columns, type DOUBLE, named after the columns in
cols
. The correlation coefficient between variables X and Y is in column X, in the row with ID value Y.
Examples
Dataset to be analyzed:
>>> df.collect() X Y 0 1 2.4 1 5 3.5 2 3 8.9 3 10 -1.4 4 -4 -3.5 5 11 32.8
Compute the Pearson’s correlation coefficient matrix:
>>> result = pearsonr_matrix(conn_context=conn, data=df)
Outputs:
>>> result.collect() ID X Y 0 X 1 0.592707653621 1 Y 0.592707653621 1
-
hana_ml.algorithms.pal.stats.
iqr
(conn_context, data, key, col=None, multiplier=None)¶ Perform the inter-quartile range (IQR) test to find the outliers of the data. The inter-quartile range (IQR) is the difference between the third quartile (Q3) and the first quartile (Q1) of the data. Data points will be marked as outliers if they fall outside the range from Q1 -
multiplier
* IQR to Q3 +multiplier
* IQR.- Parameters
conn_context : ConnectionContext
Connection to the SAP HANA system.
data : DataFrame
DataFrame containing the data.
key : str
Name of the ID column.
col : str, optional
Name of the data column that needs to be tested. If not given, the input dataframe must only have two columns including the ID column. The non-ID column will be
col
.multiplier : float, optional
The multiplier used to calculate the value range during the IQR test. Upper-bound = Q3 +
multiplier
* IQR.Lower-bound = Q1 -
multiplier
* IQR.Q1 is equal to 25th percentile and Q3 is equal to 75th percentile.
Defaults to 1.5.
- Returns
DataFrame
Test results, structured as follows:
ID, with same name and type as
data
’s ID column.IS_OUT_OF_RANGE, type INTEGER, containing the test results from the IQR test that determine whether each data sample is in the range or not:
0: a value is in the range.
1: a value is out of range.
Statistical outputs, including Upper-bound and Lower-bound from the IQR test, structured as follows:
STAT_NAME, type NVARCHAR(256), statistics name.
STAT_VALUE, type DOUBLE, statistics value.
Examples
Original data:
>>> df.collect() ID VAL 0 P1 10.0 1 P2 11.0 2 P3 10.0 3 P4 9.0 4 P5 10.0 5 P6 24.0 6 P7 11.0 7 P8 12.0 8 P9 10.0 9 P10 9.0 10 P11 1.0 11 P12 11.0 12 P13 12.0 13 P14 13.0 14 P15 12.0
Perform the IQR test:
>>> res, stat = iqr(conn_context=conn, data=df, 'ID', 'VAL', 1.5) >>> res.collect() ID IS_OUT_OF_RANGE 0 P1 0 1 P2 0 2 P3 0 3 P4 0 4 P5 0 5 P6 1 6 P7 0 7 P8 0 8 P9 0 9 P10 0 10 P11 1 11 P12 0 12 P13 0 13 P14 0 14 P15 0 >>> stat.collect() STAT_NAME STAT_VALUE 0 lower quartile 10.0 1 upper quartile 12.0
hana_ml.algorithms.pal.svm¶
This module contains PAL wrapper for Support Vector Machine algorithms.
The following classes are available:
-
class
hana_ml.algorithms.pal.svm.
SVC
(conn_context, c=None, kernel='rbf', degree=None, gamma=None, coef_lin=None, coef_const=None, probability=False, shrink=True, tol=None, evaluation_seed=None, thread_ratio=None, scale_info=None, handle_missing=True, categorical_variable=None, category_weight=None)¶ Bases:
hana_ml.algorithms.pal.svm._SVMBase
Support Vector Classification.
- Parameters
conn_context : ConnectionContext
Connection to the SAP HANA system.
c : float, optional
Trade-off between training error and margin. Value range > 0.
Defaults to 100.0.
kernel : {‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’}, optional
Defaults to ‘rbf’.
degree : int, optional
Coefficient for the ‘poly’ kernel type. Value range >= 1.
Defaults to 3.
gamma : float, optional
Coefficient for the ‘rbf’ kernel type.
Defaults to 1.0/number of features in the dataset.
Only valid for when
kernel
is ‘rbf’.coef_lin : float, optional
Coefficient for the poly/sigmoid kernel type.
Defaults to 0.
coef_const : float, optional
Coefficient for the poly/sigmoid kernel type.
Defaults to 0.
probability : bool, optional
If True, output probability during prediction.
Defaults to False.
shrink : bool, optional
If True, use shrink strategy.
Defaults to True.
tol : float, optional
Specifies the error tolerance in the training process. Value range > 0.
Defaults to 0.001.
evaluation_seed : int, optional
The random seed in parameter selection. Value range >= 0.
Defaults to 0.
thread_ratio : float, optional
Controls the proportion of available threads to use. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use up to that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use.
Defaults to 0.0.
scale_info : {‘no’, ‘standardization’, ‘rescale’}, optional
Options:
‘no’ : No scale.
‘standardization’ : Transforms the data to have zero mean and unit variance.
‘rescale’ : Rescales the range of the features to scale the range in [-1,1].
Defaults to ‘standardization’.
handle_missing : bool, optional
- Whether to handle missing values:
False: No,
True: Yes.
Defaults to True.
categorical_variable : str or list of str, optional
Specifies INTEGER column(s) in the data that should be treated categorical.
category_weight : float, optional
Represents the weight of category attributes. Value range > 0.
Defaults to 0.707.
Examples
Training data:
>>> df_fit.head(10).collect() ID ATTRIBUTE1 ATTRIBUTE2 ATTRIBUTE3 ATTRIBUTE4 LABEL 0 0 1.0 10.0 100.0 A 1 1 1 1.1 10.1 100.0 A 1 2 2 1.2 10.2 100.0 A 1 3 3 1.3 10.4 100.0 A 1 4 4 1.2 10.3 100.0 AB 1 5 5 4.0 40.0 400.0 AB 2 6 6 4.1 40.1 400.0 AB 2 7 7 4.2 40.2 400.0 AB 2 8 8 4.3 40.4 400.0 AB 2 9 9 4.2 40.3 400.0 AB 2
Create SVC instance and call fit:
>>> svc = svm.SVC(connection_context, gamma=0.005, handle_missing=False) >>> svc.fit(df_fit, 'ID', ['ATTRIBUTE1', 'ATTRIBUTE2', ... 'ATTRIBUTE3', 'ATTRIBUTE4']) >>> df_predict = connection_context.table("SVC_PREDICT_DATA_TBL") >>> df_predict.collect() ID ATTRIBUTE1 ATTRIBUTE2 ATTRIBUTE3 ATTRIBUTE4 0 0 1.0 10.0 100.0 A 1 1 1.2 10.2 100.0 A 2 2 4.1 40.1 400.0 AB 3 3 4.2 40.3 400.0 AB 4 4 9.1 90.1 900.0 A 5 5 9.2 90.2 900.0 A 6 6 4.0 40.0 400.0 A
Call predict:
>>> res = svc.predict(df_predict, 'ID', ['ATTRIBUTE1', 'ATTRIBUTE2', ... 'ATTRIBUTE3', 'ATTRIBUTE4']) >>> res.collect() ID SCORE PROBABILITY 0 0 1 None 1 1 1 None 2 2 2 None 3 3 2 None 4 4 3 None 5 5 3 None 6 6 2 None
Attributes
model_
(DataFrame) Model content.
stat_
(DataFrame) Statistics content.
Methods
fit
(data[, key, features, label, …])Fit the model when given training dataset and other attributes.
predict
(data, key[, features, verbose])Predict the dataset using the trained model.
score
(data, key[, features, label])Returns the accuracy on the given test data and labels.
-
fit
(data, key=None, features=None, label=None, categorical_variable=None)¶ Fit the model when given training dataset and other attributes.
- Parameters
data : DataFrame
DataFrame containing the data.
key : str
Name of the ID column. If
key
is not provided, it is assumed that the input has no ID column.features : list of str, optional
Names of the feature columns. If
features
is not provided, it defaults to all the non-ID, non-label columns.label : str, optional
Name of the label column. If
label
is not provided, it defaults to the last column.categorical_variable : str or list of str, optional
Specifies INTEGER column(s) that should be treated as categorical. Other INTEGER columns will be treated as continuous.
-
predict
(data, key, features=None, verbose=False)¶ Predict the dataset using the trained model.
- Parameters
data : DataFrame
DataFrame containing the data.
key : str
Name of the ID column.
features : list of str, optional
Names of the feature columns. If
features
is not provided, it defaults to all the non-ID, non-label columns.verbose : bool, optional
If True, output scoring probabilities for each class. It is only applicable when probability is true during instance creation.
Defaults to False.
- Returns
DataFrame
- Predict result, structured as follows:
ID column, with the same name and type as
data
‘s ID column.SCORE, type NVARCHAR(100), prediction value.
PROBABILITY, type DOUBLE, prediction probability. It is NULL when
probability
is False during instance creation.
-
score
(data, key, features=None, label=None)¶ Returns the accuracy on the given test data and labels.
- Parameters
data : DataFrame
DataFrame containing the data.
key : str.
Name of the ID column.
features : list of str, optional
Names of the feature columns. If
features
is not provided, it defaults to all the non-ID, non-label columns.label : str, optional
Name of the label column. If
label
is not provided, it defaults to the last column.- Returns
float
Scalar accuracy value comparing the predicted result and original label.
-
class
hana_ml.algorithms.pal.svm.
SVR
(conn_context, c=None, kernel='rbf', degree=None, gamma=None, coef_lin=None, coef_const=None, shrink=True, tol=None, evaluation_seed=None, thread_ratio=None, scale_info=None, scale_label=None, handle_missing=True, categorical_variable=None, category_weight=None, regression_eps=None)¶ Bases:
hana_ml.algorithms.pal.svm._SVMBase
Support Vector Regression.
- Parameters
conn_context : ConnectionContext
Connection to the SAP HANA system.
c : float, optional
Trade-off between training error and margin. Value range > 0.
Defaults to 100.0.
kernel : {‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’}, optional
Defaults to ‘rbf’.
degree : int, optional
Coefficient for the ‘poly’ kernel type. Value range >= 1.
Defaults to 3.
gamma : float, optional
Coefficient for the ‘rbf’ kernel type.
Defaults to 1.0/number of features in the dataset
Only valid when
kernel
is ‘rbf’.coef_lin : float, optional
Coefficient for the poly/sigmoid kernel type.
Defaults to 0.
coef_const : float, optional
Coefficient for the poly/sigmoid kernel type.
Defaults to 0.
shrink : bool, optional
If True, use shrink strategy.
Defaults to True.
tol : float, optional
Specifies the error tolerance in the training process. Value range > 0.
Defaults to 0.001.
evaluation_seed : int, optional
The random seed in parameter selection. Value range >= 0.
Defaults to 0.
thread_ratio : float, optional
Controls the proportion of available threads to use. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use.
Defaults to 0.0.
scale_info : {‘no’, ‘standardization’, ‘rescale’}, optional
Options:
‘no’ : No scale.
‘standardization’ : Transforms the data to have zero mean and unit variance.
‘rescale’ : Rescales the range of the features to scale the range in [-1,1].
Defaults to ‘standardization’.
scale_label : bool, optional
If True, standardize the label for SVR. It is only applicable when the
scale_info
is standardization.Defaults to True.
handle_missing : bool, optional
- Whether to handle missing values:
False: No,
True: Yes.
Defaults to True.
categorical_variable : str or list of str, optional
Specifies INTEGER column(s) in the data that should be treated as categorical.
category_weight : float, optional
Represents the weight of category attributes. Value range > 0.
Defaults to 0.707.
regression_eps : float, optional
Epsilon width of tube for regression.
Defaults to 0.1.
Examples
Training data:
>>> df_fit.collect() ID ATTRIBUTE1 ATTRIBUTE2 ATTRIBUTE3 ATTRIBUTE4 ATTRIBUTE5 VALUE 0 0 0.788606 0.787308 -1.301485 1.226053 -0.533385 95.626483 1 1 0.414869 -0.381038 -0.719309 1.603499 1.557837 162.582000 2 2 0.236282 -1.118764 0.233341 -0.698410 0.387380 -56.564303 3 3 -0.087779 -0.462372 -0.038412 -0.552897 1.231209 -32.241614 4 4 -0.476389 1.836772 -0.292337 -1.364599 1.326768 -143.240878 5 5 0.523326 0.065154 -1.513822 0.498921 -0.590686 -5.237827 6 6 -1.425838 -0.900437 -0.672299 0.646424 0.508856 -43.005837 7 7 -1.601836 0.455530 0.438217 -0.860707 -0.338282 -126.389824 8 8 0.266698 -0.725057 0.462189 0.868752 -1.542683 46.633594 9 9 -0.772496 -2.192955 0.822904 -1.125882 -0.946846 -175.356260 10 10 0.492364 -0.654237 -0.226986 -0.387156 -0.585063 -49.213910 11 11 0.378409 -1.544976 0.622448 -0.098902 1.437910 34.788276 12 12 0.317183 0.473067 -1.027916 0.549077 0.013483 32.845141 13 13 1.340660 -1.082651 0.730509 -0.944931 0.351025 -6.500411 14 14 0.736456 1.649251 1.334451 -0.530776 0.280830 87.451863
Create SVR instance and call fit:
>>> svr = svm.SVR(conn, kernel='linear', scale_info='standardization', ... scale_label=True, handle_missing=False) >>> svr.fit(df_fit, 'ID', ['ATTRIBUTE1', 'ATTRIBUTE2', 'ATTRIBUTE3', ... 'ATTRIBUTE4', 'ATTRIBUTE5'])
Attributes
model_
(DataFrame) Model content.
stat_
(DataFrame) Statistics content.
Methods
fit
(data, key[, features, label, …])Fit the model when given training dataset and other attributes.
predict
(data, key[, features])Predict the dataset using the trained model.
score
(data, key[, features, label])Returns the coefficient of determination R^2 of the prediction.
-
fit
(data, key, features=None, label=None, categorical_variable=None)¶ Fit the model when given training dataset and other attributes.
- Parameters
data : DataFrame
DataFrame containing the data.
key : str
Name of the ID column. If
key
is not provided, it is assumed that the input has no ID column.features : list of str, optional
Names of the feature columns. If
features
is not provided, it defaults to all the non-ID, non-label columns.label : str, optional
Name of the label column. If
label
is not provided, it defaults to the last column.categorical_variable : str or list of str, optional
Specifies INTEGER column(s) that should be be treated as categorical. Other INTEGER columns will be treated as continuous.
-
predict
(data, key, features=None)¶ Predict the dataset using the trained model.
- Parameters
data : DataFrame
DataFrame containing the data.
key : str
Name of the ID column.
features : list of str, optional
Names of the feature columns. If
features
is not provided, it defaults to all the non-ID columns.- Returns
DataFrame
Predict result, structured as follows:
ID column, with the same name and type as
data1
‘s ID column.SCORE, type NVARCHAR(100), prediction value.
PROBABILITY, type DOUBLE, prediction probability. Always NULL. This column is only used for SVC and SVRanking.
-
score
(data, key, features=None, label=None)¶ Returns the coefficient of determination R^2 of the prediction.
- Parameters
data : DataFrame
DataFrame containing the data.
key : str
Name of the ID column.
features : list of str, optional
Names of the feature columns. If
features
is not provided, it defaults to all the non-ID and non-label columns.label : str, optional
Name of the label column. If
label
is not provided, it defaults to the last column.- Returns
float
Returns the coefficient of determination R2 of the prediction.
-
class
hana_ml.algorithms.pal.svm.
SVRanking
(conn_context, c=None, kernel='rbf', degree=None, gamma=None, coef_lin=None, coef_const=None, probability=False, shrink=True, tol=None, evaluation_seed=None, thread_ratio=None, scale_info=None, handle_missing=True, categorical_variable=None, category_weight=None)¶ Bases:
hana_ml.algorithms.pal.svm._SVMBase
Support Vector Ranking
- Parameters
conn_context : ConnectionContext
Connection to the SAP HANA system.
c : float, optional
Trade-off between training error and margin. Value range > 0.
Defaults to 100.
kernel : {‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’}, optional
Defaults to ‘rbf’.
degree : int, optional
Coefficient for the ‘poly’ kernel type. Value range >= 1.
Defaults to 3.
gamma : float, optional
Coefficient for the ‘rbf’ kernel type.
Defaults to to 1.0/number of features in the dataset.
Only valid when
kernel
is ‘rbf’.coef_lin : float, optional
Coefficient for the ‘poly’/’sigmoid’ kernel type.
Defaults to 0.
coef_const : float, optional
Coefficient for the ‘poly’/’sigmoid’ kernel type.
Defaults to 0.
probability : bool, optional
If True, output probability during prediction.
Defaults to False.
shrink : bool, optional
If True, use shrink strategy.
Defaults to True.
tol : float, optional
Specifies the error tolerance in the training process. Value range > 0.
Defaults to 0.001.
evaluation_seed : int, optional
The random seed in parameter selection. Value range >= 0.
Defaults to 0.
thread_ratio : float, optional
Controls the proportion of available threads to use. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use.
Defaults to 0.0.
scale_info : {‘no’, ‘standardization’, ‘rescale’}, optional
Options:
‘no’ : No scale.
‘standardization’ : Transforms the data to have zero mean and unit variance.
‘rescale’ : Rescales the range of the features to scale the range in [-1,1].
Defaults to ‘standardization’.
handle_missing : bool, optional
- Whether to handle missing values:
False: No,
True: Yes.
Defaults to True.
categorical_variable : str or list of str, optional
Specifies INTEGER column(s) in the data that should be treated as categorical.
category_weight : float, optional
Represents the weight of category attributes. Value range > 0.
Defaults to 0.707.
Examples
Training data:
>>> df_fit.head(10).collect() ID ATTRIBUTE1 ATTRIBUTE2 ATTRIBUTE3 ATTRIBUTE4 ATTRIBUTE5 QID LABEL 0 0 1.0 1.0 0.0 0.2 0.0 qid:1 3 1 1 0.0 0.0 1.0 0.1 1.0 qid:1 2 2 2 0.0 0.0 1.0 0.3 0.0 qid:1 1 3 3 2.0 1.0 1.0 0.2 0.0 qid:1 4 4 4 3.0 1.0 1.0 0.4 1.0 qid:1 5 5 5 4.0 1.0 1.0 0.7 0.0 qid:1 6 6 6 0.0 0.0 1.0 0.2 0.0 qid:2 1 7 7 1.0 0.0 1.0 0.4 0.0 qid:2 2 8 8 0.0 0.0 1.0 0.2 0.0 qid:2 1 9 9 1.0 1.0 1.0 0.2 0.0 qid:2 3
Create SVRanking instance and call fit:
>>> svranking = svm.SVRanking(conn, gamma=0.005) >>> features = ['ATTRIBUTE1', 'ATTRIBUTE2', 'ATTRIBUTE3', 'ATTRIBUTE4', ... 'ATTRIBUTE5'] >>> svranking.fit(df_fit, 'ID', 'QID', features, 'LABEL')
Call predict:
>>> df_predict = conn.table("DATA_TBL_SVRANKING_PREDICT") >>> df_predict.head(10).collect() ID ATTRIBUTE1 ATTRIBUTE2 ATTRIBUTE3 ATTRIBUTE4 ATTRIBUTE5 QID 0 0 1.0 1.0 0.0 0.2 0.0 qid:1 1 1 0.0 0.0 1.0 0.1 1.0 qid:1 2 2 0.0 0.0 1.0 0.3 0.0 qid:1 3 3 2.0 1.0 1.0 0.2 0.0 qid:1 4 4 3.0 1.0 1.0 0.4 1.0 qid:1 5 5 4.0 1.0 1.0 0.7 0.0 qid:1 6 6 0.0 0.0 1.0 0.2 0.0 qid:4 7 7 1.0 0.0 1.0 0.4 0.0 qid:4 8 8 0.0 0.0 1.0 0.2 0.0 qid:4 9 9 1.0 1.0 1.0 0.2 0.0 qid:4 >>> svranking.predict(df_predict, key='ID', ... features=features, qid='QID').head(10).collect() ID SCORE PROBABILITY 0 0 -9.85138 None 1 1 -10.8657 None 2 2 -11.6741 None 3 3 -9.33985 None 4 4 -7.88839 None 5 5 -6.8842 None 6 6 -11.7081 None 7 7 -10.8003 None 8 8 -11.7081 None 9 9 -10.2583 None
Attributes
model_
(DataFrame) Model content.
stat_
(DataFrame) Statistics content.
.. note::
PAL will throw an error if ``probability``=True is provided to the SVRanking constructor and ``verbose``=True is not provided to predict(). This is a known bug.
Methods
fit
(data, key, qid[, features, label, …])Fit the model when given training dataset and other attributes.
predict
(data, key, qid[, features, verbose])Predict the dataset using the trained model.
-
fit
(data, key, qid, features=None, label=None, categorical_variable=None)¶ Fit the model when given training dataset and other attributes.
- Parameters
data : DataFrame
DataFrame containing the data.
key : str
Name of the ID column. If
key
is not provided, it is assumed that the input has no ID column.qid : str
Name of the qid column.
features : list of str, optional
Names of the feature columns. If
features
is not provided, it defaults to all the non-ID, non-label, non-qid columns.label : str, optional
Name of the label column. If
label
is not provided, it defaults to the last column.categorical_variable : str or list of str, optional
INTEGER columns specified in this list will be treated as categorical data. Other INTEGER columns will be treated as continuous.
-
predict
(data, key, qid, features=None, verbose=False)¶ Predict the dataset using the trained model.
- Parameters
data : DataFrame
DataFrame containing the data.
key : str
Name of the ID column.
qid : str
Name of the qid column.
features : list of str, optional
Names of the feature columns. If
features
is not provided, it defaults to all the non-ID, non-qid columns.verbose : bool, optional
If True, output scoring probabilities for each class.
Defaults to False.
- Returns
DataFrame
- Predict result, structured as follows:
ID column, with the same name and type as
data
’s ID column.Score, type NVARCHAR(100), prediction value.
PROBABILITY, type DOUBLE, prediction probability. It is NULL when
probability
is False during instance creation.
-
class
hana_ml.algorithms.pal.svm.
OneClassSVM
(conn_context, c=None, kernel='rbf', degree=None, gamma=None, coef_lin=None, coef_const=None, shrink=True, tol=None, evaluation_seed=None, thread_ratio=None, nu=None, scale_info=None, handle_missing=True, categorical_variable=None, category_weight=None)¶ Bases:
hana_ml.algorithms.pal.svm._SVMBase
One Class SVM
- Parameters
conn_context : ConnectionContext
Connection to the SAP HANA system.
c : float, optional
Trade-off between training error and margin. Value range > 0.
Defaults to 100.0.
kernel : {‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’}, optional
Defaults to ‘rbf’.
degree : int, optional
Coefficient for the poly kernel type. Value range >= 1.
Defaults to 3.
gamma : float, optional
Coefficient for the ‘rbf’ kernel type.
Defaults to to 1.0/number of features in the dataset.
Only valid when
kernel
is ‘rbf’.coef_lin : float, optional
Coefficient for the ‘poly’/’sigmoid’ kernel type.
Defaults to 0.
coef_const : float, optional
Coefficient for the ‘poly’/’sigmoid’ kernel type.
Defaults to 0.
shrink : bool, optional
If True, use shrink strategy.
Defaults to True.
tol : float, optional
Specifies the error tolerance in the training process.
Value range > 0.
Defaults to 0.001.
evaluation_seed : int, optional
The random seed in parameter selection.
Value range >= 0.
Defaults to 0.
thread_ratio : float, optional
Controls the proportion of available threads to use. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use.
Defaults to 0.0.
nu : float, optional
The value for both the upper bound of the fraction of training errors and the lower bound of the fraction of support vectors.
Defaults to 0.5.
scale_info : {‘no’, ‘standardization’, ‘rescale’}, optional
Options:
‘no’ : No scale.
‘standardization’ : Transforms the data to have zero mean and unit variance.
‘rescale’ : Rescales the range of the features to scale the range in [-1,1].
Defaults to ‘standardization’.
handle_missing : bool, optional
Whether to handle missing values:
False: No,
True: Yes.
Defaults to True.
categorical_variable : str or list of str, optional
Specifies INTEGER column(s) in the data that should be treated as categorical.
category_weight : float, optional
Represents the weight of category attributes. Value range > 0.
Defaults to 0.707.
Examples
Training data:
>>> df_fit.head(10).collect() ID ATTRIBUTE1 ATTRIBUTE2 ATTRIBUTE3 ATTRIBUTE4 0 0 1.0 10.0 100.0 A 1 1 1.1 10.1 100.0 A 2 2 1.2 10.2 100.0 A 3 3 1.3 10.4 100.0 A 4 4 1.2 10.3 100.0 AB 5 5 4.0 40.0 400.0 AB 6 6 4.1 40.1 400.0 AB 7 7 4.2 40.2 400.0 AB 8 8 4.3 40.4 400.0 AB 9 9 4.2 40.3 400.0 AB
Create OneClassSVM instance and call fit:
>>> svc_one = svm.OneClassSVM(conn, scale_info='no', category_weight=1) >>> svc_one.fit(df_fit, 'ID', ['ATTRIBUTE1', 'ATTRIBUTE2', 'ATTRIBUTE3', ... 'ATTRIBUTE4']) >>> df_predict = conn.table("DATA_TBL_SVC_ONE_PREDICT") >>> df_predict.head(10).collect() ID ATTRIBUTE1 ATTRIBUTE2 ATTRIBUTE3 ATTRIBUTE4 0 0 1.0 10.0 100.0 A 1 1 1.1 10.1 100.0 A 2 2 1.2 10.2 100.0 A 3 3 1.3 10.4 100.0 A 4 4 1.2 10.3 100.0 AB 5 5 4.0 40.0 400.0 AB 6 6 4.1 40.1 400.0 AB 7 7 4.2 40.2 400.0 AB 8 8 4.3 40.4 400.0 AB 9 9 4.2 40.3 400.0 AB >>> features = ['ATTRIBUTE1', 'ATTRIBUTE2', 'ATTRIBUTE3', ... 'ATTRIBUTE4']
Call predict:
>>> svc_one.predict(df_predict, 'ID', features).head(10).collect() ID SCORE PROBABILITY 0 0 -1 None 1 1 1 None 2 2 1 None 3 3 -1 None 4 4 -1 None 5 5 -1 None 6 6 -1 None 7 7 1 None 8 8 -1 None 9 9 -1 None
Attributes
model_
(DataFrame) Model content.
stat_
(DataFrame) Statistics content.
Methods
fit
(data[, key, features, categorical_variable])Fit the model when given training dataset and other attributes.
predict
(data, key[, features])Predict the dataset using the trained model.
-
fit
(data, key=None, features=None, categorical_variable=None)¶ Fit the model when given training dataset and other attributes.
- Parameters
data : DataFrame
DataFrame containing the data.
key : str
Name of the ID column. If
key
is not provided, it is assumed that the input has no ID column.features : list of str, optional
Names of the feature columns. If
features
is not provided, it defaults to all the non-ID columns.categorical_variable : str or list of str, optional
Specifies INTEGER column(s) specified that should be treated as categorical. Other INTEGER columns will be treated as continuous.
-
predict
(data, key, features=None)¶ Predict the dataset using the trained model.
- Parameters
data : DataFrame
DataFrame containing the data.
key : str
Name of the ID column.
features : list of str, optional
Names of the feature columns. If
features
is not provided, it defaults to all the non-ID columns.- Returns
DataFrame
- Predict result, structured as follows:
ID column, with the same name and type as
data
’s ID column.Score, type NVARCHAR(100), prediction value.
PROBABILITY, type DOUBLE, prediction probability. Always NULL. This column is only used for SVC and SVRanking.
hana_ml.algorithms.pal.trees¶
This module contains Python wrappers for PAL decision tree-based algorithms.
The following classes are available:
-
class
hana_ml.algorithms.pal.trees.
RandomForestClassifier
(conn_context, n_estimators=100, max_features=None, max_depth=None, min_samples_leaf=1, split_threshold=None, calculate_oob=True, random_state=None, thread_ratio=None, allow_missing_dependent=True, categorical_variable=None, sample_fraction=None, strata=None, priors=None)¶ Bases:
hana_ml.algorithms.pal.trees._RandomForestBase
Random forest model for classification.
- Parameters
conn_context : ConnectionContext
Connection to the HANA system.
n_estimators : int, optional
Specifies the number of trees in the random forest.
Defaults to 100.
max_features : int, optional
Specifies the number of randomly selected splitting variables. Should not be larger than the number of input features.
Defaults to sqrt(p) (for classification) or p/3(for regression), where p is the number of input features.
max_depth : int, optional
The maximum depth of a tree.
By default it is unlimited.
min_samples_leaf : int, optional
Specifies the minimum number of records in a leaf.
Defaults to 1 for classification.
split_threshold : float, optional
Specifies the stop condition: if the improvement value of the best split is less than this value, the tree stops growing.
Defaults to 1e-5.
calculate_oob : bool, optional
If True, calculate the out-of-bag error.
Defaults to True.
random_state : int, optional
Specifies the seed for random number generator.
0: Uses the current time (in seconds) as the seed.
Others: Uses the specified value as the seed.
Defaults to 0.
thread_ratio : float, optional
Controls the proportion of available threads to use. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use.
Defaults to heuristically determined.
allow_missing_dependent : bool, optional
Specifies if a missing target value is allowed.
False: Not allowed. An error occurs if a missing target is present.
True: Allowed. The datum with the missing target is removed.
Defaults to True.
categorical_variable : str or list of str, optional
Specifies INTEGER column(s) should be treated as categorical. The default behavior is: string: categorical, or integer and float: continuous. VALID only for integer variables; omitted otherwise.
Default value detected from input data.
sample_fraction : float, optional
The fraction of data used for training. Assume there are n pieces of data, sample fraction is r, then n*r data is selected for training.
Defaults to 1.0.
strata : List of tuples: (class, fraction), optional
Strata proportions for stratified sampling. A (class, fraction) tuple specifies that rows with that class should make up the specified fraction of each sample. If the given fractions do not add up to 1, the remaining portion is divided equally between classes with no entry in
strata
, or between all classes if all classes have an entry instrata
. Ifstrata
is not provided, bagging is used instead of stratified sampling.priors : List of tuples: (class, prior_prob), optional
Prior probabilities for classes. A (class, prior_prob) tuple specifies the prior probability of this class. If the given priors do not add up to 1, the remaining portion is divided equally between classes with no entry in
priors
, or between all classes if all classes have an entry in ‘priors’. Ifpriors
is not provided, it is determined by the proportion of every class in the training data.
Examples
Input dataframe for training:
>>> df1.head(4).collect() OUTLOOK TEMP HUMIDITY WINDY LABEL 0 Sunny 75.0 70.0 Yes Play 1 Sunny NaN 90.0 Yes Do not Play 2 Sunny 85.0 NaN No Do not Play 3 Sunny 72.0 95.0 No Do not Play
Creating RandomForestClassifier instance:
>>> rfc = RandomForestClassifier(conn_context=cc, n_estimators=3, ... max_features=3, random_state=2, ... split_threshold=0.00001, ... calculate_oob=True, ... min_samples_leaf=1, thread_ratio=1.0)
Performing fit() on given dataframe:
>>> rfc.fit(data=df1, features=['OUTLOOK', 'TEMP', 'HUMIDITY', 'WINDY'], ... label='LABEL') >>> rfc.feature_importances_.collect() VARIABLE_NAME IMPORTANCE 0 OUTLOOK 0.449550 1 TEMP 0.216216 2 HUMIDITY 0.208108 3 WINDY 0.126126
Input dataframe for predicting:
>>> df2.collect() ID OUTLOOK TEMP HUMIDITY WINDY 0 0 Overcast 75.0 -10000.0 Yes 1 1 Rain 78.0 70.0 Yes
Performing predict() on given dataframe:
>>> result = rfc.predict(data=df2, key='ID', verbose=False) >>> result.collect() ID SCORE CONFIDENCE 0 0 Play 0.666667 1 1 Play 0.666667
Input dataframe for scoring:
>>> df3.collect() ID OUTLOOK TEMP HUMIDITY WINDY LABEL 0 0 Sunny 70 90.0 Yes Play 1 1 Overcast 81 90.0 Yes Play 2 2 Rain 65 80.0 No Play
Performing score() on given dataframe:
>>> rfc.score(df3, key='ID') 0.6666666666666666
Attributes
model_
(DataFrame) Trained model content.
feature_importances_
(DataFrame) The feature importance (the higher, the more important the feature).
oob_error_
(DataFrame) Out-of-bag error rate or mean squared error for random forest up to indexed tree. Set to None if
calculate_oob
is False.confusion_matrix_
(DataFrame) Confusion matrix used to evaluate the performance of classification algorithms.
Methods
fit
(data[, key, features, label, …])Train the model on input data.
predict
(data, key[, features, verbose, …])Predict dependent variable values based on fitted model.
score
(data, key[, features, label, …])Returns the mean accuracy on the given test data and labels.
-
fit
(data, key=None, features=None, label=None, categorical_variable=None)¶ Train the model on input data.
- Parameters
data : DataFrame
Training data.
key : str, optional
Name of the ID column. If
key
is not provided, it is assumed that the input has no ID column.features : list of str, optional
Names of the feature columns. If
features
is not provided, it defaults to all non-ID, non-label columns.label : str, optional
Name of the dependent variable.
Defaults to the last column.
categorical_variable : str or list of str, optional
Specifies INTEGER column(s) that should be treated as categorical. Other INTEGER columns will be treated as continuous.
-
predict
(data, key, features=None, verbose=None, block_size=None, missing_replacement=None)¶ Predict dependent variable values based on fitted model.
- Parameters
data : DataFrame
Independent variable values to predict for.
key : str
Name of the ID column.
features : list of str, optional
Names of the feature columns. If
features
is not provided, it defaults to all non-ID columns.block_size : int, optional
The number of rows loaded per time during prediction. 0 indicates load all data at once.
Defaults to 0.
missing_replacement : str, optional
The missing replacement strategy:
‘feature_marginalized’: marginalise each missing feature out independently.
‘instance_marginalized’: marginalise all missing features in an instance as a whole corresponding to each category.
Defaults to ‘feature_marginalized’.
verbose : bool, optional
If True, output all classes and the corresponding confidences for each data point.
- Returns
DataFrame
- DataFrame of score and confidence, structured as follows:
ID column, with same name and type as
data
‘s ID column.SCORE, type DOUBLE, representing the predicted classes.
CONFIDENCE, type DOUBLE, representing the confidence of a class.
-
score
(data, key, features=None, label=None, block_size=None, missing_replacement=None)¶ Returns the mean accuracy on the given test data and labels.
- Parameters
data : DataFrame
Data on which to assess model performance.
key : str
Name of the ID column.
features : list of str, optional
Names of the feature columns. If
features
is not provided, it defaults to all non-ID, non-label columns.label : str, optional
Name of the dependent variable.
Defaults to the last column.
block_size : int, optional
The number of rows loaded per time during prediction. 0 indicates load all data at once.
Defaults to 0.
missing_replacement : str, optional
- The missing replacement strategy:
‘feature_marginalized’: marginalise each missing feature out independently.
‘instance_marginalized’: marginalise all missing features in an instance as a whole corresponding to each category.
Defaults to ‘feature_marginalized’.
- Returns
float
Mean accuracy on the given test data and labels.
-
class
hana_ml.algorithms.pal.trees.
RandomForestRegressor
(conn_context, n_estimators=100, max_features=None, max_depth=None, min_samples_leaf=None, split_threshold=None, calculate_oob=True, random_state=None, thread_ratio=None, allow_missing_dependent=True, categorical_variable=None, sample_fraction=None)¶ Bases:
hana_ml.algorithms.pal.trees._RandomForestBase
Random forest model for regression.
- Parameters
conn_context : ConnectionContext
Connection to the HANA system.
n_estimators : int, optional
Specifies the number of trees in the random forest.
Defaults to 100.
max_features : int, optional
Specifies the number of randomly selected splitting variables. Should not be larger than the number of input features.
Defaults to sqrt(p) (for classification) or p/3(for regression), where p is the number of input features.
max_depth : int, optional
The maximum depth of a tree.
By default it is unlimited.
min_samples_leaf : int, optional
Specifies the minimum number of records in a leaf.
Defaults to 5 for regression.
split_threshold : float, optional
Specifies the stop condition: if the improvement value of the best split is less than this value, the tree stops growing.
Defaults to 1e-5.
calculate_oob : bool, optional
If True, calculate the out-of-bag error.
Defaults to True.
random_state : int, optional
Specifies the seed for random number generator.
0: Uses the current time (in seconds) as the seed.
Others: Uses the specified value as the seed.
Defaults to 0.
thread_ratio : float, optional
Controls the proportion of available threads to use. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use up to that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use.
Defaults to heuristically determined.
allow_missing_dependent : bool, optional
Specifies if a missing target value is allowed.
False: Not allowed. An error occurs if a missing target is present.
True: Allowed. The datum with a missing target is removed.
Defaults to True.
categorical_variable : str or list of str, optional
Specifies INTEGER column(s) that should be treated as categorical. The default behavior is: string: categorical, or integer and float: continuous. VALID only for integer variables; omitted otherwise.
Default value detected from input data.
sample_fraction : float, optional
The fraction of data used for training. Assume there are n pieces of data, sample fraction is r, then n*r data is selected for training.
Defaults to 1.0.
Examples
Input dataframe for training:
>>> df1.head(5).collect() ID A B C D CLASS 0 0 -0.965679 1.142985 -0.019274 -1.598807 -23.633813 1 1 2.249528 1.459918 0.153440 -0.526423 212.532559 2 2 -0.631494 1.484386 -0.335236 0.354313 26.342585 3 3 -0.967266 1.131867 -0.684957 -1.397419 -62.563666 4 4 -1.175179 -0.253179 -0.775074 0.996815 -115.534935
Creating RandomForestRegressor instance:
>>> rfr = RandomForestRegressor(conn_context=cc, random_state=3)
Performing fit() on given dataframe:
>>> rfr.fit(data=df1, key='ID') >>> rfr.feature_importances_.collect() VARIABLE_NAME IMPORTANCE 0 A 0.249593 1 B 0.381879 2 C 0.291403 3 D 0.077125
Input dataframe for predicting:
>>> df2.collect() ID A B C D 0 0 1.081277 0.204114 1.220580 -0.750665 1 1 0.524813 -0.012192 -0.418597 2.946886
Performing predict() on given dataframe:
>>> result = rfr.predict(data=df2, key='ID') >>> result.collect() ID SCORE CONFIDENCE 0 0 48.126 62.952884 1 1 -10.9017 73.461039
Input dataframe for scoring:
>>> df3.head(5).collect() ID A B C D CLASS 0 0 1.081277 0.204114 1.220580 -0.750665 139.10170 1 1 0.524813 -0.012192 -0.418597 2.946886 52.17203 2 2 -0.280871 0.100554 -0.343715 -0.118843 -34.69829 3 3 -0.113992 -0.045573 0.957154 0.090350 51.93602 4 4 0.287476 1.266895 0.466325 -0.432323 106.63425
Performing score() on given dataframe:
>>> rfr.score(df3, key='ID') 0.6530768858159514
Attributes
model_
(DataFrame) Trained model content.
feature_importances_
(DataFrame) The feature importance (the higher, the more important the feature).
oob_error_
(DataFrame) Out-of-bag error rate or mean squared error for random forest up to indexed tree. Set to None if
calculate_oob
is False.Methods
fit
(data[, key, features, label, …])Train the model on input data.
predict
(data, key[, features, block_size, …])Predict dependent variable values based on fitted model.
score
(data, key[, features, label, …])Returns the coefficient of determination R2 of the prediction.
-
predict
(data, key, features=None, block_size=None, missing_replacement=None)¶ Predict dependent variable values based on fitted model.
- Parameters
data : DataFrame
Independent variable values to predict for.
key : str
Name of the ID column.
features : list of str, optional
Names of the feature columns. If
features
is not provided, it defaults to all non-ID columns.block_size : int, optional
The number of rows loaded per time during prediction. 0 indicates load all data at once.
Defaults to 0.
missing_replacement : str, optional
The missing replacement strategy:
- ‘feature_marginalized’: marginalise each missing feature out
independently.
- ‘instance_marginalized’: marginalise all missing features
in an instance as a whole corresponding to each category.
Defaults to ‘feature_marginalized’.
- Returns
DataFrame
DataFrame of score and confidence, structured as follows:
ID column, with same name and type as
data
’s ID column.SCORE, type DOUBLE, representing the predicted values.
CONFIDENCE, all 0s.
It is included due to the fact PAL uses the same table for classification.
-
score
(data, key, features=None, label=None, block_size=None, missing_replacement=None)¶ Returns the coefficient of determination R2 of the prediction.
- Parameters
data : DataFrame
Data on which to assess model performance.
key : str
Name of the ID column.
features : list of str, optional
Names of the feature columns. If
features
is not provided, it defaults to all non-ID, non-label columns.label : str, optional
Name of the dependent variable.
Defaults to the last column.
block_size : int, optional
The number of rows loaded per time during prediction. 0 indicates load all data at once.
Defaults to 0.
missing_replacement : str, optional
The missing replacement strategy:
‘feature_marginalized’: marginalise each missing feature out independently.
‘instance_marginalized’: marginalise all missing features in an instance as a whole corresponding to each category.
Defaults to ‘feature_marginalized’.
- Returns
float
The coefficient of determination R2 of the prediction on the given data.
-
fit
(data, key=None, features=None, label=None, categorical_variable=None)¶ Train the model on input data.
- Parameters
data : DataFrame
Training data.
key : str, optional
Name of the ID column. If
key
is not provided, it is assumed that the input has no ID column.features : list of str, optional
Names of the feature columns. If
features
is not provided, it defaults to all non-ID, non-label columns.label : str, optional
Name of the dependent variable.
Defaults to the last column.
categorical_variable : str or list of str, optional
Specify INTEGER column(s) that should be be treated as categorical data. Other INTEGER columns will be treated as continuous.
-
class
hana_ml.algorithms.pal.trees.
DecisionTreeClassifier
(conn_context, algorithm, thread_ratio=None, allow_missing_dependent=True, percentage=None, min_records_of_parent=None, min_records_of_leaf=None, max_depth=None, categorical_variable=None, split_threshold=None, discretization_type=None, bins=None, max_branch=None, merge_threshold=None, use_surrogate=None, model_format=None, output_rules=True, priors=None, output_confusion_matrix=True)¶ Bases:
hana_ml.algorithms.pal.trees._DecisionTreeBase
Decision Tree model for classification.
- Parameters
conn_context : ConnectionContext
Database connection object.
algorithm : {‘c45’, ‘chaid’, ‘cart’}
Algorithm used to grow a decision tree. Case-insensitive.
‘c45’: C4.5 algorithm.
‘chaid’: Chi-square automatic interaction detection.
‘cart’: Classification and regression tree.
thread_ratio : float, optional
Controls the proportion of available threads to use. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use up to that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use.
Defaults to 0.
allow_missing_dependent : bool, optional
Specifies if a missing target value is allowed.
False: Not allowed. An error occurs if a missing target is present.
True: Allowed. The datum with the missing target is removed.
Defaults to True.
percentage : float, optional
Specifies the percentage of the input data that will be used to build the tree model. The rest of the data will be used for pruning.
Defaults to 1.0.
min_records_of_parent : int, optional
Specifies the stop condition: if the number of records in one node is less than the specified value, the algorithm stops splitting.
Defaults to 2.
min_records_of_leaf : int, optional
Promises the minimum number of records in a leaf.
Defaults to 1.
max_depth : int, optional
The maximum depth of a tree.
By default it is unlimited.
categorical_variable : str or list of str, optional
Specifies INTEGER column(s) that should be treated as categorical. The default behavior is: string: categorical, or integer and float: continuous. VALID only for integer variables; omitted otherwise.
Default value detected from input data.
split_threshold : float, optional
Specifies the stop condition for a node:
‘c45’: The information gain ratio of the best split is less than this value.
‘chaid’: The p-value of the best split is greater than or equal to this value.
‘cart’: The reduction of Gini index or relative MSE of the best split is less than this value.
The smaller the SPLIT_THRESHOLD value is, the larger a ‘c45’ or ‘cart’ tree grows. On the contrary, ‘chaid’ will grow a larger tree with larger SPLIT_THRESHOLD value.
Defaults to 1e-5 for ‘c45’ and ‘cart’, 0.05 for ‘chaid’.
discretization_type : {‘mdlpc’, ‘equal_freq’}, optional
Strategy for discretizing continuous attributes. Case-insensitive.
‘mdlpc’: Minimum description length principle criterion.
‘equal_freq’: Equal frequency discretization.
Valid only for ‘c45’ and ‘chaid’.
Defaults to ‘mdlpc’.
bins : List of tuples: (column name, number of bins), optional
Specifies the number of bins for discretization. Only valid when discretizaition type is equal_freq.
Defaults to 10 for each column.
max_branch : int, optional
Specifies the maximum number of branches.
Valid only for ‘chaid’.
Defaults to 10.
merge_threshold : float, optional
Specifies the merge condition for ‘chaid’: if the metric value is greater than or equal to the specified value, the algorithm will merge two branches.
Only valid for ‘chaid’.
Defaults to 0.05.
use_surrogate : bool, optional
If True, use surrogate split when NULL values are encountered. Only valid for ‘cart’.
Defaults to True.
model_format : {‘json’, ‘pmml’}, optional
Specifies the tree model format for store. Case-insensitive.
‘json’: export model in json format.
‘pmml’: export model in pmml format.
Defaults to ‘json’.
output_rules : bool, optional
If True, output decision rules.
Defaults to True.
priors : List of tuples: (class, prior_prob), optional
Specifies the prior probability of every class label.
Default value detected from data.
output_confusion_matrix : bool, optional
If True, output the confusion matrix.
Defaults to True.
Examples
Input dataframe for training:
>>> df1.head(4).collect() OUTLOOK TEMP HUMIDITY WINDY CLASS 0 Sunny 75 70.0 Yes Play 1 Sunny 80 90.0 Yes Do not Play 2 Sunny 85 85.0 No Do not Play 3 Sunny 72 95.0 No Do not Play
Creating DecisionTreeClassifier instance:
>>> dtc = DecisionTreeClassifier(conn_context=cc, algorithm='c45', ... min_records_of_parent=2, ... min_records_of_leaf=1, ... thread_ratio=0.4, split_threshold=1e-5, ... model_format='json', output_rules=True)
Performing fit() on given dataframe:
>>> dtc.fit(data=df1, features=['OUTLOOK', 'TEMP', 'HUMIDITY', 'WINDY'], ... label='LABEL') >>> dtc.decision_rules_.collect() ROW_INDEX RULES_CONTENT 0 0 (TEMP>=84) => Do not Play 1 1 (TEMP<84) && (OUTLOOK=Overcast) => Play 2 2 (TEMP<84) && (OUTLOOK=Sunny) && (HUMIDITY<82.5) => Play 3 3 (TEMP<84) && (OUTLOOK=Sunny) && (HUMIDITY>=82.5) => Do not Play 4 4 (TEMP<84) && (OUTLOOK=Rain) && (WINDY=Yes) => Do not Play 5 5 (TEMP<84) && (OUTLOOK=Rain) && (WINDY=No) => Play
Input dataframe for predicting:
>>> df2.collect() ID OUTLOOK HUMIDITY TEMP WINDY 0 0 Overcast 75.0 70 Yes 1 1 Rain 78.0 70 Yes 2 2 Sunny 66.0 70 Yes 3 3 Sunny 69.0 70 Yes 4 4 Rain NaN 70 Yes 5 5 None 70.0 70 Yes 6 6 *** 70.0 70 Yes
Performing predict() on given dataframe:
>>> result = dtc.predict(data=df2, key='ID', verbose=False) >>> result.collect() ID SCORE CONFIDENCE 0 0 Play 1.000000 1 1 Do not Play 1.000000 2 2 Play 1.000000 3 3 Play 1.000000 4 4 Do not Play 1.000000 5 5 Play 0.692308 6 6 Play 0.692308
Input dataframe for scoring:
>>> df3.collect() ID OUTLOOK HUMIDITY TEMP WINDY LABEL 0 0 Overcast 75.0 70 Yes Play 1 1 Rain 78.0 70 No Do not Play 2 2 Sunny 66.0 70 Yes Play 3 3 Sunny 69.0 70 Yes Play
Performing score() on given dataframe:
>>> rfc.score(df3, key='ID') 0.75
Attributes
model_
(DataFrame) Trained model content.
decision_rules_
(DataFrame) Rules for decision tree to make decisions. Set to None if
output_rules
is False.confusion_matrix_
(DataFrame) Confusion matrix used to evaluate the performance of classification algorithms. Set to None if
output_confusion_matrix
is False.Methods
fit
(data[, key, features, label, …])Function for building a decision tree classifier.
predict
(data, key[, features, verbose])Predict dependent variable values based on fitted model.
score
(data, key[, features, label])Returns the mean accuracy on the given test data and labels.
-
fit
(data, key=None, features=None, label=None, categorical_variable=None)¶ Function for building a decision tree classifier.
- Parameters
data : DataFrame
Training data.
key : str, optional
Name of the ID column. If
key
is not provided, it is assumed that the input has no ID column.features : list of str, optional
Names of the feature columns. If
features
is not provided, it defaults to all non-ID, non-label columns.label : str, optional
Name of the dependent variable.
Defaults to the last column.
categorical_variable : str or list of str, optional
Specifies INTEGER column(s) that should be treated as categorical. Other INTEGER columns will be treated as continuous.
-
score
(data, key, features=None, label=None)¶ Returns the mean accuracy on the given test data and labels.
- Parameters
data : DataFrame
Data on which to assess model performance.
key : str
Name of the ID column.
features : list of str, optional
Names of the feature columns. If
features
is not provided, it defaults to all non-ID, non-label columns.label : str, optional
Name of the dependent variable.
Defaults to the last column.
- Returns
float
Mean accuracy on the given test data and labels.
-
predict
(data, key, features=None, verbose=False)¶ Predict dependent variable values based on fitted model.
- Parameters
data : DataFrame
Independent variable values to predict for.
key : str
Name of the ID column.
features : list of str, optional
Names of the feature columns. If
features
is not provided, it defaults to all non-ID columns.verbose : bool, optional
If True, output all classes and the corresponding confidences for each data point. This parameter is valid only for classification.
Defaults to False.
- Returns
DataFrame
- DataFrame of score and confidence, structured as follows:
ID column, with same name and type as
data
’s ID column.SCORE, type DOUBLE, representing the predicted classes/values.
CONFIDENCE, type DOUBLE, representing the confidence of a class, all 0s if for regression.
-
class
hana_ml.algorithms.pal.trees.
DecisionTreeRegressor
(conn_context, algorithm, thread_ratio=None, allow_missing_dependent=True, percentage=None, min_records_of_parent=None, min_records_of_leaf=None, max_depth=None, categorical_variable=None, split_threshold=None, use_surrogate=None, model_format=None, output_rules=True)¶ Bases:
hana_ml.algorithms.pal.trees._DecisionTreeBase
Decision Tree model for regression.
- Parameters
conn_context : ConnectionContext
Database connection object.
algorithm : {‘cart’}
Algorithm used to grow a decision tree.
‘cart’: Classification and Regression tree.
Currently supports cart.
thread_ratio : float, optional
Controls the proportion of available threads to use. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use.
Defaults to 0.
allow_missing_dependent : bool, optional
Specifies if a missing target value is allowed.
False: Not allowed. An error occurs if a missing target is present.
True: Allowed. The datum with the missing target is removed.
Defaults to True.
percentage : float, optional
Specifies the percentage of the input data that will be used to build the tree model. The rest of the data will be used for pruning.
Defaults to 1.0.
min_records_of_parent : int, optional
Specifies the stop condition: if the number of records in one node is less than the specified value, the algorithm stops splitting.
Defaults to 2.
min_records_of_leaf : int, optional
Promises the minimum number of records in a leaf.
Defaults to 1.
max_depth : int, optional
The maximum depth of a tree.
By default it is unlimited.
categorical_variable : str or list of str, optional
Specifies INTEGER column(s) that should be treated as categorical. The default behavior is: string: categorical, or integer and float: continuous.
VALID only for integer variables; omitted otherwise.
Default value detected from input data.
split_threshold : float, optional
Specifies the stop condition for a node:
‘cart’: The reduction of Gini index or relative MSE of the best split is less than this value.
The smaller the SPLIT_THRESHOLD value is, the larger a ‘cart’ tree grows.
Defaults to 1e-5 for ‘cart’.
use_surrogate : bool, optional
If True, use surrogate split when NULL values are encountered. Only valid for ‘cart’.
Defaults to True.
model_format : {‘json’, ‘pmml’}, optional
Specifies the tree model format for store. Case-insensitive.
‘json’: export model in json format.
‘pmml’: export model in pmml format.
Defaults to ‘json’.
output_rules : bool, optional
If True, output decision rules.
Defaults to True.
Examples
Input dataframe for training:
>>> df1.head(5).collect() ID A B C D CLASS 0 0 1.764052 0.400157 0.978738 2.240893 49.822907 1 1 1.867558 -0.977278 0.950088 -0.151357 4.877286 2 2 -0.103219 0.410598 0.144044 1.454274 11.914875 3 3 0.761038 0.121675 0.443863 0.333674 19.753078 4 4 1.494079 -0.205158 0.313068 -0.854096 23.607000
Creating DecisionTreeRegressor instance:
>>> dtr = DecisionTreeRegressor(conn_context=cc, algorithm='cart', ... min_records_of_parent=2, min_records_of_leaf=1, ... thread_ratio=0.4, split_threshold=1e-5, ... model_format='pmml', output_rules=True)
Performing fit() on given dataframe:
>>> dtr.fit(data=df1, key='ID') >>> dtr.decision_rules_.head(2).collect() ROW_INDEX RULES_CONTENT 0 0 (A<-0.495502) && (B<-0.663588) => -85.8762 1 1 (A<-0.495502) && (B>=-0.663588) => -29.9827
Input dataframe for predicting:
>>> df2.collect() ID A B C D 0 0 1.764052 0.400157 0.978738 2.240893 1 1 1.867558 -0.977278 0.950088 -0.151357 2 2 -0.103219 0.410598 0.144044 1.454274 3 3 0.761038 0.121675 0.443863 0.333674 4 4 1.494079 -0.205158 0.313068 -0.854096
Performing predict() on given dataframe:
>>> result = dtr.predict(data=df2, key='ID') >>> result.collect() ID SCORE CONFIDENCE 0 0 49.8229 0.0 1 1 4.87728 0.0 2 2 11.9148 0.0 3 3 19.753 0.0 4 4 23.607 0.0
Input dataframe for scoring:
>>> df3.collect() ID A B C D CLASS 0 0 1.764052 0.400157 0.978738 2.240893 49.822907 1 1 1.867558 -0.977278 0.950088 -0.151357 4.877286 2 2 -0.103219 0.410598 0.144044 1.454274 11.914875 3 3 0.761038 0.121675 0.443863 0.333674 19.753078 4 4 1.494079 -0.205158 0.313068 -0.854096 23.607000
Performing score() on given dataframe:
>>> dtr.score(df3, key='ID') 0.9999999999900131
Attributes
model_
(DataFrame) Trained model content.
decision_rules_
(DataFrame) Rules for decision tree to make decisions. Set to None if
output_rules
is False.Methods
fit
(data[, key, features, label, …])Train the model on input data.
predict
(data, key[, features, verbose])Predict dependent variable values based on fitted model.
score
(data, key[, features, label])Returns the coefficient of determination R2 of the prediction.
-
score
(data, key, features=None, label=None)¶ Returns the coefficient of determination R2 of the prediction.
- Parameters
data : DataFrame
Data on which to assess model performance.
key : str
Name of the ID column.
features : list of str, optional
Names of the feature columns. If
features
is not provided, it defaults to all non-ID, non-label columns.label : str, optional
Name of the dependent variable.
Defaults to the last column.
- Returns
float
The coefficient of determination R2 of the prediction on the given data.
-
fit
(data, key=None, features=None, label=None, categorical_variable=None)¶ Train the model on input data.
- Parameters
data : DataFrame
Training data.
key : str, optional
Name of the ID column. If
key
is not provided, it is assumed that the input has no ID column.features : list of str, optional
Names of the feature columns. If
features
is not provided, it defaults to all non-ID, non-label columns.label : str, optional
Name of the dependent variable.
Defaults to the last column.
categorical_variable : str or list of str, optional
Specifies INTEGER column(s) that should be treated as categorical data. Other INTEGER columns will be treated as continuous.
-
predict
(data, key, features=None, verbose=False)¶ Predict dependent variable values based on fitted model.
- Parameters
data : DataFrame
Independent variable values to predict for.
key : str
Name of the ID column.
features : list of str, optional
Names of the feature columns. If
features
is not provided, it defaults to all non-ID columns.verbose : bool, optional
If True, output all classes and the corresponding confidences for each data point. This parameter is valid only for classification.
Defaults to False.
- Returns
DataFrame
- DataFrame of score and confidence, structured as follows:
ID column, with same name and type as
data
’s ID column.SCORE, type DOUBLE, representing the predicted classes/values.
CONFIDENCE, type DOUBLE, representing the confidence of a class, all 0s if for regression.
-
class
hana_ml.algorithms.pal.trees.
GradientBoostingClassifier
(conn_context, n_estimators=10, subsample=None, max_depth=None, loss=None, split_threshold=None, learning_rate=None, fold_num=None, default_split_dir=None, min_sample_weight_leaf=None, max_w_in_split=None, col_subsample_split=None, col_subsample_tree=None, lamb=None, alpha=None, scale_pos_w=None, base_score=None, cv_metric=None, ref_metric=None, categorical_variable=None, allow_missing_label=None, thread_ratio=None, cross_validation_range=None)¶ Bases:
hana_ml.algorithms.pal.trees._GradientBoostingBase
Gradient Boosting model for classification.
- Parameters
conn_context : ConnectionContext
Connection to the HANA system.
n_estimators : int, optional
Specifies the number of trees in Gradient Boosting.
Defaults to 10.
loss : str, optional
Type of loss function to be optimized. Supported values are ‘linear’ and ‘logistic’.
Defaults to ‘linear’.
max_depth : int, optional
The maximum depth of a tree.
Defaults to 6.
split_threshold : float, optional
Specifies the stopping condition: if the improvement value of the best split is less than this value, then the tree stops growing.
learning_rate : float, optional.
Learning rate of each iteration, must be within the range (0, 1].
Defaults to 0.3.
subsample : float, optional
The fraction of samples to be used for fitting each base learner.
Defaults to 1.0.
fold_num : int, optional
The k-value for k-fold cross-validation. Effective only when
cross_validation_range
is not None nor empty.default_split_dir : int, optional.
Default split direction for missing values. Valid input values are 0, 1 and 2, where:
0 - Automatically determined, 1 - Left, 2 - Right.
Defaults to 0.
min_sample_weight_leaf : float, optional
The minimum sample weights in leaf node.
Defaults to 1.0.
max_w_in_split : float, optional
The maximum weight constraint assigned to each tree node.
Defaults to 0 (i.e. no constraint).
col_subsample_split : float, optional
The fraction of features used for each split, should be within range (0, 1].
Defaults to 1.0.
col_subsample_tree : float, optional
The fraction of features used for each tree growth, should be within range (0, 1]
Defaults to 1.0.
lamb : float, optional
L2 regularization weight for the target loss function. Should be within range (0, 1].
Defaults to 1.0.
alpha : float, optional
Weight of L1 regularization for the target loss function.
Defaults to 1.0.
scale_pos_w : float, optional
The weight scaled to positive samples in regression.
Defaults to 1.0.
base_score : float, optional
Intial prediction score for all instances. Global bias for sufficient number of iterations(changing this value will not have too much effect).
cv_metric : {‘log_likelihood’, ‘multi_log_likelihood’, ‘error_rate’, ‘multi_error_rate’, ‘auc’}, optional
The metric used for cross-validation.
If multiple lines of metrics are provided, then only the first one is valid. If not set, it takes the first value (in alphabetical order) of the parameter ‘ref_metric’ when the latter is set, otherwise it goes to default values.
- Defaults to
1)’error_rate’ for binary classification,
2)’multi_error_rate’ for multi-class classification.
ref_metric : str or list of str, optional
Specifies a reference metric or a list of reference metrics. Supported metrics same as cv_metric. If not provided, defaults to
1)[‘error_rate’] for binary classification,
2)[‘multi_error_rate’] for multi-class classification.
categorical_variable : str or list of str, optional
Specifies which variable(s) should be treated as categorical. Otherwise default behavior is followed:
VARCHAR - categorical,
INTEGER and DOUBLE - continous.
Only valid for INTEGER variables, omitted otherwise.
allow_missing_label : bool, optional
Specifies whether missing label value is allowed.
False: not allowed. In missing values presents in the input data, an error shall be thrown.
True: allowed. The datum with missing label will be removed automatically.
thread_ratio : float, optional
The ratio of available threads used for training:
0: single thread;
(0,1]: percentage of available threads;
others : heuristically determined.
Defaults to -1.
cross_validation_range : list of tuples, optional
Indicates the set of parameters involded for cross-validation. Cross-validation is triggered only when this param is not None and the list is not tempty, and fold_num is greater than 1. Each tuple is a pair, with the first being parameter name of str type, and the second being the a list of number of the following form: [<begin-value>, <test-numbers>, <end-value>].
Suppported parameters for cross-validation:
n_estimators
,max_depth
,learning_rate
,min_sample_weight_leaf
,max_w_in_split
,col_subsample_split
,col_subsample_tree
,lamb
,alpha
,scale_pos_w
,base_score
.A simple example for illustration:
[(‘n_estimators’, [4, 3, 10]),
(‘learning_rate’, [0.1, 3, 1.0]),
(‘split_threshold’, [0.1, 3, 1.0])]
Examples
Input dataframe for training:
>>> df.head(4).collect() ATT1 ATT2 ATT3 ATT4 LABEL 0 1.0 10.0 100.0 1.0 A 1 1.1 10.1 100.0 1.0 A 2 1.2 10.2 100.0 1.0 A 3 1.3 10.4 100.0 1.0 A
Creating Gradient Boosting Classifier:
>>> cv_range = [('learning_rate', [0.1, 1.0, 3]), ... ('n_estimators', [4, 10, 3]), ... ('split_threshold', [0.1, 1.0, 3]) >>> gbc = GradientBoostingClassifier(conn_context=conn, ... n_estimators=4, ... split_threshold=0, ... learning_rate=0.5, ... fold_num=5, ... max_depth=6, ... cv_metric = 'error_rate', ... ref_metric=['auc'], ... cross_validation_range=cv_range)
Performing fit() on given dataframe:
>>> gbc.fit(data=df, features=['ATT1', 'ATT2', 'ATT3', 'ATT4'], label='LABEL') >>> gbc.stats_.collect() STAT_NAME STAT_VALUE 0 ERROR_RATE_MEAN 0 1 ERROR_RATE_VAR 0 2 AUC_MEAN 1
Input dataframe for predicting:
>>> df1.head(4).collect() ID ATT1 ATT2 ATT3 ATT4 0 1 1.0 10.0 100.0 1.0 1 2 1.1 10.1 100.0 1.0 2 3 1.2 10.2 100.0 1.0 3 4 1.3 10.4 100.0 1.0
Performing predict() on given dataframe
>>> result = gbc.fit(data=df1, key='ID', verbose=False) >>> result.head(4).collect() ID SCORE CONFIDENCE 0 1 A 0.825556 1 2 A 0.825556 2 3 A 0.825556 3 4 A 0.825556
Attributes
model_
(DataFrame) Trained model content.
feature_importances_
(DataFrame) The feature importance (the higher, the more import the feature)
confusion_matrix_
(DataFrame) Confusion matrix used to evaluate the performance of classification algorithm.
stats_
(DataFrame) Statistics info for cross-validation.
cv_
(DataFrame) Best choice of parameter produced by cross-validation.
Methods
fit
(data[, key, features, label, …])Train the model on input data.
predict
(data, key[, features, verbose])Predict dependent variable values based on fitted model.
-
fit
(data, key=None, features=None, label=None, categorical_variable=None)¶ Train the model on input data.
- Parameters
data : DataFrame
Training data.
key : str, optional
Name of the ID column. If
key
is not provided, it is assumed that the input has no ID column.features : list of str, optional
Names of the feature columns. If
features
is not provided, it defaults to all non-ID, non-label columns.label : str, optional
Name of the dependent variable.
Defaults to the last column.
categorical_variable : str or list of str, optional
Specifies INTEGER column(s) that should be treated as categorical. Other INTEGER columns will be treated as continuous.
-
predict
(data, key, features=None, verbose=None)¶ Predict dependent variable values based on fitted model.
- Parameters
data : DataFrame
Independent variable values to predict for.
key : str
Name of the ID column.
features : list of str, optional
Names of the feature columns. If
features
is not provided, it defaults to all non-ID columns.verbose : bool, optional
If True, output all classes and the corresponding confidences for each data point.
- Returns
DataFrame
DataFrame of score and confidence, structured as follows:
ID column, with same name and type as
data
‘s ID column.SCORE, type NVARCHAR, representing the predicted classes.
CONFIDENCE, type DOUBLE, representing the confidence of a class.
-
class
hana_ml.algorithms.pal.trees.
GradientBoostingRegressor
(conn_context, n_estimators=10, subsample=None, max_depth=None, loss=None, split_threshold=None, learning_rate=None, fold_num=None, default_split_dir=None, min_sample_weight_leaf=None, max_w_in_split=None, col_subsample_split=None, col_subsample_tree=None, lamb=None, alpha=None, scale_pos_w=None, base_score=None, cv_metric=None, ref_metric=None, categorical_variable=None, allow_missing_label=None, thread_ratio=None, cross_validation_range=None)¶ Bases:
hana_ml.algorithms.pal.trees._GradientBoostingBase
Gradient Boosting Tree model for regression.
- Parameters
conn_context : ConnectionContext
Connection to the HANA system.
n_estimators : int, optional
Specifies the number of trees in Gradient Boosting.
Defaults to 10.
loss : str, optional
Type of loss function to be optimized. Supported values are ‘linear’ and ‘logistic’.
Defaults to ‘linear’.
max_depth : int, optional
The maximum depth of a tree.
Defaults to 6.
split_threshold : float, optional
Specifies the stopping condition: if the improvement value of the best split is less than this value, then the tree stops growing.
learning_rate : float, optional.
Learning rate of each iteration, must be within the range (0, 1].
Defaults to 0.3.
subsample : float, optional
The fraction of samples to be used for fitting each base learner.
Defaults to 1.0.
fold_num : int, optional
The k-value for k-fold cross-validation.
default_split_dir : int, optional.
Default split direction for missing values. Valid input values are 0, 1 and 2, where:
0 - Automatically determined,
1 - Left,
2 - Right.
Defaults to 0.
min_sample_weight_leaf : float, optional
The minimum sample weights in leaf node.
Defaults to 1.0.
max_w_in_split : float, optional
The maximum weight constraint assigned to each tree node.
Defaults to 0 (i.e. no constraint).
col_subsample_split : float, optional
The fraction of features used for each split, should be within range (0, 1].
Defaults to 1.0.
col_subsample_tree : float, optional
The fraction of features used for each tree growth, should be within range (0, 1]
Defaults to 1.0.
lamb : float, optional
L2 regularization weight for the target loss function. Should be within range (0, 1].
Defaults to 1.0.
alpha : float, optional
Weight of L1 regularization for the target loss function.
Defaults to 1.0.
scale_pos_w : float, optional
The weight scaled to positive samples in regression.
Defaults to 1.0.
base_score : float, optional
Intial prediction score for all instances. Global bias for sufficient number of iterations(changing this value will not have too much effect).
cv_metric : str, optional
The metric used for cross-validation. Supported metrics include: ‘rmse’, ‘mae’. If multiple lines of metrics are provided, then only the first one is valid. If not set, it takes the first value (in alphabetical order) of the parameter ‘ref_metric’ when the latter is set, otherwise it goes to default values.
Defaults to ‘mae’.
ref_metric : str or list of str, optional
Specifies a reference metric or a list of reference metrics. Supported metrics same as cv_metric.
categorical_variable : str, optional
Indicates which variables should be treated as categorical. Otherwise default behavior is followed:
VARCHAR - categorical,
INTEGER and DOUBLE - continous.
Only valid for INTEGER variables, omitted otherwise.
allow_missing_label : bool, optional
Specifies whether missing label value is allowed.
False: not allowed. In missing values presents in the input data, an error shall be thrown.
True: allowed. The datum with missing label will be removed automatically.
thread_ratio : float, optional
The ratio of available threads used for training.
0: single thread;
(0,1]: percentage of available threads;
others : heuristically determined.
Defaults to -1.
cross_validation_range : list of tuples, optional
Indicates the set of parameters involded for cross-validation. Each tuple is a pair, with the first being parameter name of str type, and the second being the a list of number of the following form: [<begin-value>, <test-numbers>, <end-value>]. Suppported parameters for cross-validation:
n_estimators
,max_depth
,learning_rate
,min_sample_weight_leaf
,max_w_in_split
,col_subsample_split
,col_subsample_tree
,lamb
,alpha
,scale_pos_w
,base_score
.A simple example for illustration:
[(‘n_estimators’, [4, 3, 10]),
(‘learning_rate’, [0.1, 3, 1.0]),
(‘split_threshold’, [0.1, 3, 1.0])]
Examples
Input dataframe for training:
>>> df.head(4).collect() ATT1 ATT2 ATT3 ATT4 TARGET 0 19.76 6235.0 100.00 100.00 25.10 1 17.85 46230.0 43.67 84.53 19.23 2 19.96 7360.0 65.51 81.57 21.42 3 16.80 28715.0 45.16 93.33 18.11
Creating GradientBoostingRegressor instance:
>>> cv_range = [('learning_rate', [0.0,5,1.0]), ... ('n_estimators', [10, 11, 20]), ... ('split_threshold', [0.0, 5, 1.0])] >>> gbr = GradientBoostingRegressor(conn_context=conn, ... n_estimators=20, ... split_threshold=0.75, ... learning_rate=0.75, ... fold_num=5, ... max_depth=6, ... cv_metric = 'rmse', ... ref_metric=['mae'], ... cross_validation_range=cv_range)
Performing fit() on given dataframe:
>>> gbr.fit(data=df, features=['ATT1', 'ATT2', 'ATT3', 'ATT4'], ... label='TARGET') >>> gbr.stats_.collect() STAT_NAME STAT_VALUE 0 RMSE_MEAN 1.83732 1 RMSE_VAR 0.525622 2 MAE_MEAN 1.44388
Input dataframe for predicting:
>>> df1.head(4).collect() ID ATT1 ATT2 ATT3 ATT4 0 1 19.76 6235.0 100.00 100.00 1 2 17.85 46230.0 43.67 84.53 2 3 19.96 7360.0 65.51 81.57 3 4 16.80 28715.0 45.16 93.33
Performing predict() on given dataframe:
>>> result.head(4).collect() ID SCORE CONFIDENCE 0 1 24.1499 None 1 2 19.2351 None 2 3 21.8944 None 3 4 18.5256 None
Attributes
model_
(DataFrame) Trained model content.
feature_importances_
(DataFrame) The feature importance (the higher, the more import the feature)
stats_
(DataFrame) Statistics info for cross-validation.
cv_
(DataFrame) Best choice of parameter produced by cross-validation.
Methods
fit
(data[, key, features, label, …])Train the model on input data.
predict
(data, key[, features, verbose])Predict dependent variable values based on fitted model.
-
fit
(data, key=None, features=None, label=None, categorical_variable=None)¶ Train the model on input data.
- Parameters
data : DataFrame
Training data.
key : str, optional
Name of the ID column. If
key
is not provided, it is assumed that the input has no ID column.features : list of str, optional
Names of the feature columns. If
features
is not provided, it defaults to all non-ID, non-label columns.label : str, optional
Name of the dependent variable.
Defaults to the last column.
categorical_variable : str or list of str, optional
Specifies INTEGER column(s) that should be treated as categorical. Other INTEGER columns will be treated as continuous.
-
predict
(data, key, features=None, verbose=None)¶ Predict dependent variable values based on fitted model.
- Parameters
data : DataFrame
Independent variable values to predict for.
key : str
Name of the ID column.
features : list of str, optional
Names of the feature columns.
If
features
is not provided, it defaults to all non-ID columns.verbose : bool, optional
If True, output all classes and the corresponding confidences for each data point.
- Returns
DataFrame
DataFrame of score and confidence, structured as follows:
ID column, with same name and type as
data
’s ID column.SCORE, type DOUBLE, representing the predicted value.
CONFIDENCE, all None’s for regression.
-
class
hana_ml.algorithms.pal.trees.
HybridGradientBoostingClassifier
(conn_context, n_estimators=None, random_state=None, subsample=None, max_depth=None, split_threshold=None, learning_rate=None, split_method=None, sketch_eps=None, fold_num=None, min_sample_weight_leaf=None, min_samples_leaf=None, max_w_in_split=None, col_subsample_split=None, col_subsample_tree=None, lamb=None, alpha=None, base_score=None, cv_metric=None, ref_metric=None, calculate_importance=None, calculate_cm=None, thread_ratio=None, cross_validation_range=None)¶ Bases:
hana_ml.algorithms.pal.trees._HybridGradientBoostingBase
Hybrid Gradient Boosting model for classification.
- Parameters
conn_context : ConnectionContext
Connection to the HANA system.
n_estimators : int, optional
Specifies the number of trees in Gradient Boosting.
Defaults to 10.
split_method : {‘exact’, ‘sketch’, ‘sampling’}, optional
The method to finding split point for numerical features.
Defaults to ‘exact’.
random_state : int, optional
The seed for random number generating.
0 - current time as seed,
Others - the seed.
Defaults to 0.
max_depth : int, optional
The maximum depth of a tree.
Defaults to 6.
split_threshold : float, optional
Specifies the stopping condition: if the improvement value of the best split is less than this value, then the tree stops growing.
learning_rate : float, optional.
Learning rate of each iteration, must be within the range (0, 1].
Defaults to 0.3.
subsample : float, optional
The fraction of samples to be used for fitting each base learner.
Defaults to 1.0.
fold_num : int, optional
The k-value for k-fold cross-validation. Effective only when cross_validation_range is not None nor empty.
sketch_esp : float, optional
The value of the sketch method which sets up an upper limit for the sum of sample weights between two split points. Basically, the less this value is, the more number of split points are tried.
min_sample_weight_leaf : float, optional
The minimum summation of ample weights in a leaf node.
Defaults to 1.0.
min_samples_leaf : int, optional
The minimum number of data in a leaf node.
Defaults to 1.
max_w_in_split : float, optional
The maximum weight constraint assigned to each tree node.
Defaults to 0 (i.e. no constraint).
col_subsample_split : float, optional
The fraction of features used for each split, should be within range (0, 1].
Defaults to 1.0.
col_subsample_tree : float, optional
The fraction of features used for each tree growth, should be within range (0, 1]
Defaults to 1.0.
lamb : float, optional
L2 regularization weight for the target loss function. Should be within range (0, 1].
Defaults to 1.0.
alpha : float, optional
Weight of L1 regularization for the target loss function.
Defaults to 1.0.
base_score : float, optional
Intial prediction score for all instances. Global bias for sufficient number of iterations(changing this value will not have too much effect).
Defaults to 0.5.
cv_metric : {‘nll’, ‘error_rate’, ‘auc’}, optional
The metric used for cross-validation.
Defaults to ‘error_rate’.
ref_metric : str or list of str, optional
Specifies a reference metric or a list of reference metrics. Any reference metric must be a valid option of
cv_metric
.Defaults to [‘error_rate’].
categorical_variable : str pr list of str, optional
Specifies INTEGER variable(s) that should be treated as categorical. Valid only for INTEGER variables, omitted otherwise.
Note
By default INTEGER variables are treated as numerical.
thread_ratio : float, optional
The ratio of available threads used for training.
0: single thread;
(0,1]: percentage of available threads;
others : heuristically determined.
Defaults to -1.
calculate_importance : bool, optional
Determines whether to calculate variable importance.
Defaults to True.
calculate_cm : bool, optional
Determines whether to calculaet confusion matrix.
Defaults to True.
cross_validation_range : list of tuples, optional
Indicates the set of parameters involded for cross-validation. Cross-validation is triggered only when this param is not None and the list is not tempty, and fold_num is greater than 1. Each tuple is a pair, with the first being parameter name of str type, and the second being the a list of numbers with the following strcture: [<begin-value>, <end-value>, <test-numbers>].
Suppported parameters for cross-validation:
n_estimators
,max_depth
,learning_rate
,min_sample_weight_leaf
,max_w_in_split
,col_subsample_split
,col_subsample_tree
,lamb
,alpha
,scale_pos_w
,base_score
.A simple example for illustration:
[(‘n_estimators’, [4, 10, 3]),
(‘learning_rate’, [0.1, 1.0, 3])]
Examples
Input dataframe for training:
>>> df.head(7).collect() ATT1 ATT2 ATT3 ATT4 LABEL 0 1.0 10.0 100.0 1.0 A 1 1.1 10.1 100.0 1.0 A 2 1.2 10.2 100.0 1.0 A 3 1.3 10.4 100.0 1.0 A 4 1.2 10.3 100.0 1.0 A 5 4.0 40.0 400.0 4.0 B 6 4.1 40.1 400.0 4.0 B
Creating an instance of Hybrid Gradient Boosting classifier:
>>> cv_range = [('learning_rate',[0.1, 1.0, 3]), ... ('n_estimators', [4, 10, 3]), ... ('split_threshold', [0.1, 1.0, 3])] >>> ghc = HybridGradientBoostingClassifier(conn_context=conn, ... n_estimators=4, ... split_threshold=0, ... learning_rate=0.5, ... fold_num=5, ... max_depth=6, ... cv_metric='error_rate', ... ref_metric=['auc'], ... cross_validation_range=cv_range)
Performing fit() on given dataframe
>>> gbc.fit(data=df, features=['ATT1', 'ATT2', 'ATT3', 'ATT4'], ... label='LABEL') >>> gbc.stats_.collect() STAT_NAME STAT_VALUE 0 ERROR_RATE_MEAN 0.133333 1 ERROR_RATE_VAR 0.0266666 2 AUC_MEAN 0.9
Input dataframe for predict:
>>> df_predict.collect() ID ATT1 ATT2 ATT3 ATT4 0 1 1.0 10.0 100.0 1.0 1 2 1.1 10.1 100.0 1.0 2 3 1.2 10.2 100.0 1.0 3 4 1.3 10.4 100.0 1.0 4 5 1.2 10.3 100.0 3.0 5 6 4.0 40.0 400.0 3.0 6 7 4.1 40.1 400.0 3.0 7 8 4.2 40.2 400.0 3.0 8 9 4.3 40.4 400.0 3.0 9 10 4.2 40.3 400.0 3.0
Performing predict() on given dataframe
>>> result = ghc.fit(data=df_predict, key='ID', verbose=False) >>> result.collect() ID SCORE CONFIDENCE 0 1 A 0.852674 1 2 A 0.852674 2 3 A 0.852674 3 4 A 0.852674 4 5 A 0.751394 5 6 B 0.703119 6 7 B 0.703119 7 8 B 0.703119 8 9 B 0.830549 9 10 B 0.703119
Attributes
model_
(DataFrame) Trained model content.
feature_importances_
(DataFrame) The feature importance (the higher, the more import the feature)
confusion_matrix_
(DataFrame) Confusion matrix used to evaluate the performance of classification algorithm.
stats_
(DataFrame) Statistics info for cross-validation.
cv_
(DataFrame) Best choice of parameter produced by cross-validation.
Methods
fit
(data[, key, features, label, …])Train the model on input data.
predict
(data, key[, features, verbose, …])Predict labels based on the trained HGBT classifier.
-
fit
(data, key=None, features=None, label=None, categorical_variable=None)¶ Train the model on input data.
- Parameters
data : DataFrame
Training data.
key : str, optional
Name of the ID column. If
key
is not provided, it is assumed that the input has no ID column.features : list of str, optional
Names of the feature columns. If
features
is not provided, it defaults to all non-ID, non-label columns.label : str, optional
Name of the dependent variable.
Defaults to the last column.
categorical_variable : str or list of str, optional
Indicates INTEGER variable(s) that should be treated as categorical. Valid only for INTEGER variables, omitted otherwise.
Note
By default INTEGER variables are treated as numerical.
-
predict
(data, key, features=None, verbose=None, thread_ratio=None, missing_replacement=None)¶ Predict labels based on the trained HGBT classifier.
- Parameters
data : DataFrame
Independent variable values to predict for.
key : str
Name of the ID column.
features : list of str, optional
Names of the feature columns. If
features
is not provided, it defaults to all non-ID columns.missing_replacement : str, optional
The missing replacement strategy:
‘feature_marginalized’: marginalise each missing feature out independently.
‘instance_marginalized’: marginalise all missing features in an instance as a whole corr
verbose : bool, optional
If True, output all classes and the corresponding confidences for each data point. This parameter is valid only for classification.
Defaults to False.
- Returns
DataFrame
DataFrame of score and confidence, structured as follows:
ID column, with same name and type as
data
’s ID column.SCORE, type DOUBLE, representing the predicted classes/values.
CONFIDENCE, type DOUBLE, representing the confidence of a class label assignment.
-
class
hana_ml.algorithms.pal.trees.
HybridGradientBoostingRegressor
(conn_context, n_estimators=None, random_state=None, subsample=None, max_depth=None, split_threshold=None, learning_rate=None, split_method=None, sketch_eps=None, fold_num=None, min_sample_weight_leaf=None, min_samples_leaf=None, max_w_in_split=None, col_subsample_split=None, col_subsample_tree=None, lamb=None, alpha=None, cv_metric=None, ref_metric=None, calculate_importance=None, thread_ratio=None, cross_validation_range=None)¶ Bases:
hana_ml.algorithms.pal.trees._HybridGradientBoostingBase
Hybrid Gradient Boosting model for regression.
- Parameters
conn_context : ConnectionContext
Connection to the HANA system.
n_estimators : int, optional
Specifies the number of trees in Gradient Boosting.
Defaults to 10.
split_method : {‘exact’, ‘sketch’, ‘sampling’}, optional
The method to find split point for numeric features.
Defaults to ‘exact’.
random_state : int, optional
The seed for random number generating.
0 - current time as seed,
Others - the seed.
max_depth : int, optional
The maximum depth of a tree.
Defaults to 6.
split_threshold : float, optional
Specifies the stopping condition: if the improvement value of the best split is less than this value, then the tree stops growing.
learning_rate : float, optional.
Learning rate of each iteration, must be within the range (0, 1].
Defaults to 0.3.
subsample : float, optional
The fraction of samples to be used for fitting each base learner.
Defaults to 1.0.
fold_num : int, optional
The k-value for k-fold cross-validation. Effective only when
cross_validation_range
is not None nor empty.sketch_esp : float, optional
The value of the sketch method which sets up an upper limit for the sum of sample weights between two split points. Basically, the less this value is, the more number of split points are tried.
min_sample_weight_leaf : float, optional
The minimum summation of ample weights in a leaf node.
Defaults to 1.0.
min_sample_leaf : int, optional
The minimum number of data in a leaf node.
Defaults to 1.
max_w_in_split : float, optional
The maximum weight constraint assigned to each tree node.
Defaults to 0 (i.e. no constraint).
col_subsample_split : float, optional
The fraction of features used for each split, should be within range (0, 1].
Defaults to 1.0.
col_subsample_tree : float, optional
The fraction of features used for each tree growth, should be within range (0, 1].
Defaults to 1.0.
lamb : float, optional
Weight of L2 regularization for the target loss function. Should be within range (0, 1].
Defaults to 1.0.
alpha : float, optional
Weight of L1 regularization for the target loss function.
Defaults to 1.0.
cv_metric : {‘rmse’, ‘mae’}, optional
The metric used for cross-validation.
Defaults to ‘mae’.
ref_metric : str or list of str, optional
Specifies a reference metric or a list of reference metrics. Any reference metric must be a valid option of
cv_metric
.Defaults to [‘rmse’].
categorical_variable : str or list of str, optional
Specifies INTEGER variable(s) that should be treated as categorical. Valid only for INTEGER variables, omitted otherwise.
Note
By default INTEGER variables are treated as numerical.
thread_ratio : float, optional
The ratio of available threads used for training.
0: single thread;
(0,1]: percentage of available threads;
others : heuristically determined.
Defaults to -1.
calculate_importance : bool, optional
Determines whether to calculate variable importance.
Defaults to True.
calculate_cm : bool, optional
Determines whether to calculaet confusion matrix.
Defaults to True.
cross_validation_range : list of tuples, optional
Indicates the set of parameters involded for cross-validation. Cross-validation is triggered only when this param is not None and the list is not tempty, and fold_num is greater than 1. Each tuple is a pair, with the first being parameter name of str type, and the second being the a list of numbers with the following strcture: [<begin-value>, <end-value>, <test-numbers>].
Suppported parameters for cross-validation: n_estimators, max_depth, learning_rate, min_sample_weight_leaf, max_w_in_split, col_subsample_split, col_subsample_tree, lamb, alpha, scale_pos_w, base_score.
Simple example for illustration
- a list of two tuples
[(‘n_estimators’, [4, 10, 3]),
(‘learning_rate’, [0.1, 1.0, 3])]
Examples
Input dataframe for training
>>> df.head(7).collect() ATT1 ATT2 ATT3 ATT4 TARGET 0 19.76 6235.0 100.00 100.00 25.10 1 17.85 46230.0 43.67 84.53 19.23 2 19.96 7360.0 65.51 81.57 21.42 3 16.80 28715.0 45.16 93.33 18.11 4 18.20 21934.0 49.20 83.07 19.24 5 16.71 1337.0 74.84 94.99 19.31 6 18.81 17881.0 70.66 92.34 20.07
Creating an instance of HGBT regression and traing the model
>>> cv_range = [('learning_rate',[0.0, 1.0, 5]), ... ('n_estimators', [10, 20, 11]), ... ('split_threshold', [0.0, 1.0, 5])] >>> hgr = HybridGradientBoostingRegressor(conn_context=conn, ... n_estimators=20, ... split_threshold=0.75, ... split_method = 'exact', ... learning_rate=0.75, ... fold_num=5, ... max_depth=6, ... cv_metric = 'rmse', ... ref_metric=['mae'], ... cross_validation_range=cv_range) >>> hgr.fit(data=df, features=['ATT1','ATT2','ATT3', 'ATT4'], ... label='TARGET')
Check the model content and feature importances
>>> hgr.model_.head(4).collect() TREE_INDEX MODEL_CONTENT 0 -1 {"nclass":1,"param":{"bs":0.0,"obj":"reg:linea... 1 0 {"height":0,"nnode":1,"nodes":[{"ch":[],"gn":9... 2 1 {"height":0,"nnode":1,"nodes":[{"ch":[],"gn":5... 3 2 {"height":0,"nnode":1,"nodes":[{"ch":[],"gn":3... >>> hgr.feature_importances_.collect() VARIABLE_NAME IMPORTANCE 0 ATT1 0.744019 1 ATT2 0.164429 2 ATT3 0.078935 3 ATT4 0.012617
The trained model can be used for prediction. Input data for prediction, i.e. with missing target values.
>>> df_predict.collect() ID ATT1 ATT2 ATT3 ATT4 0 1 19.76 6235.0 100.00 100.00 1 2 17.85 46230.0 43.67 84.53 2 3 19.96 7360.0 65.51 81.57 3 4 16.80 28715.0 45.16 93.33 4 5 18.20 21934.0 49.20 83.07 5 6 16.71 1337.0 74.84 94.99 6 7 18.81 17881.0 70.66 92.34 7 8 20.74 2319.0 63.93 95.08 8 9 16.56 18040.0 14.45 61.24 9 10 18.55 1147.0 68.58 97.90
Predict the target values and view the results
>>> result = hgr.predict(data=df_predict, key='ID', verbose=False) >>> result.collect() ID SCORE CONFIDENCE 0 1 23.79109147050638 None 1 2 19.09572889593064 None 2 3 21.56501359501561 None 3 4 18.622664075787082 None 4 5 19.05159916592106 None 5 6 18.815530665858763 None 6 7 19.761714911364443 None 7 8 23.79109147050638 None 8 9 17.84416828725911 None 9 10 19.915574945518465 None
Attributes
model_
(DataFrame) Trained model content.
feature_importances_
(DataFrame) The feature importance (the higher, the more import the feature)
confusion_matrix_
(DataFrame) Confusion matrix used to evaluate the performance of classification algorithm.
stats_
(DataFrame) Statistics info for cross-validation.
cv_
(DataFrame) Best choice of parameter produced by cross-validation.
Methods
fit
(data[, key, features, label, …])Train an HGBT regressor on the input data.
predict
(data, key[, features, verbose, …])Predict dependent variable values based on fitted model.
-
fit
(data, key=None, features=None, label=None, categorical_variable=None)¶ Train an HGBT regressor on the input data.
- Parameters
data : DataFrame
Training data.
key : str, optional
Name of the ID column. If
key
is not provided, it is assumed that the input has no ID column.features : list of str, optional
Names of the feature columns. If
features
is not provided, it defaults to all non-ID, non-label columns.label : str, optional
Name of the dependent variable.
Defaults to the last column.
categorical_variable : str or list of str, optional
Specifies INTEGER variable(s) that should be treated as categorical. Valid only for INTEGER variables, omitted otherwise.
Note
By default INTEGER variables are treated as numerical.
-
predict
(data, key, features=None, verbose=None, thread_ratio=None, missing_replacement=None)¶ Predict dependent variable values based on fitted model.
- Parameters
data : DataFrame
Independent variable values to predict for.
key : str
Name of the ID column.
features : list of str, optional
Names of the feature columns. If not provided, it defaults to all non-ID columns.
missing_replacement : str, optional
The missing replacement strategy:
‘feature_marginalized’: marginalise each missing feature out independently.
‘instance_marginalized’: marginalise all missing features in an instance as a whole corresponding to each category.
Defaults to ‘feature_marginalized’.
verbose : bool, optional
If True, output all classes and the corresponding confidences for each data point.
- Returns
DataFrame
DataFrame of score and confidence, structured as follows:
ID column, with same name and type as
data
‘s ID column.SCORE, type DOUBLE, representing the predicted classes.
CONFIDENCE, type DOUBLE, all None for regression prediction.
hana_ml.algorithms.pal.tsa.arima¶
This module contains Python wrapper for PAL ARIMA algorithm.
The following class are available:
-
class
hana_ml.algorithms.pal.tsa.arima.
ARIMA
(conn_context, order=None, seasonal_order=None, method='css-mle', include_mean=None, forecast_method=None, output_fitted=True, thread_ratio=None)¶ Bases:
hana_ml.algorithms.pal.tsa.arima._ARIMABase
Autoregressive Integrated Moving Average ARIMA(p, d, q) model.
- Parameters
conn_context : ConnectionContext
The connection to the SAP HANA system.
order : (p, q, d), tuple of int, optional
p: value of the auto regression order.
d: value of the differentiation order.
q: value of the mov