Apriori

class hana_ml.algorithms.pal.association.Apriori(min_support, min_confidence, relational=None, min_lift=None, max_conseq=None, max_len=None, ubiquitous=None, use_prefix_tree=None, lhs_restrict=None, rhs_restrict=None, lhs_complement_rhs=None, rhs_complement_lhs=None, thread_ratio=None, timeout=None, pmml_export=None)

Apriori is a classic algorithm used in machine learning for mining frequent itemsets and relevant association rules. It operates on a list of transactions and is particularly effective in market basket analysis, where the goal is to find associations of products bought with other products.

Parameters:
min_supportfloat

Specifies the minimum support as determined by the user.

min_confidencefloat

Specifies the minimum confidence as determined by the user.

relationalbool, optional

Determines whether relational logic should be applied within the Apriori algorithm. If set to False, a single combined results table will be produced. Conversely, if set to True, the result will be split across three tables: antecedent, consequent, and statistics.

Defaults to False.

min_liftfloat, optional

Specifies the minimum lift value as determined by the user. This parameter is essential in association rule mining for assessing the strength of each rule.

Defaults to 0.

max_conseqint, optional

Specifies the maximum number of items that can be contained in consequents.

Defaults to 100.

max_lenint, optional

Specifies the maximum number of combined items in both antecedent and consequent sets in the output.

Defaults to 5.

ubiquitousfloat, optional

This parameter is used to ignore item sets with support values greater than this threshold during frequent itemset mining.

Defaults to 1.0.

use_prefix_treebool, optional

Indicates whether a prefix tree should be used to save memory. A prefix tree (also known as a trie) is a data structure that can increase the efficiency of certain types of lookups.

Defaults to False.

lhs_restricta list of str, optional (deprecated)

Allows specific items only on the left-hand-side of association rules.

rhs_restricta list of str, optional (deprecated)

Allows specific items only on the right-hand-side of the association rules.

lhs_complement_rhsbool, optional (deprecated)

If rhs_restrict is used to restrict some items to the right-hand-side of the association rules, this parameter can be set to True in order to restrict the complementary items to the left-hand-side.

For example, if you have 100 items (i1, i2, ..., i100), and want to restrict i1 and i2 to the right-hand-side, and i3,i4,...,i100 to the left-hand-side, you can set the parameters similarly as follows:

...

rhs_restrict = ['i1','i2'],

lhs_complement_rhs = True,

...

Defaults to False.

rhs_complement_lhsbool, optional (deprecated)

If lhs_restrict is used to restrict some items to the left-hand-side of the association rules, this parameter can be set to True to restrict the complementary items to the right-hand side.

Defaults to False.

thread_numberfloat, optional

Specifies the ratio of total number of threads that can be used by this function.

The value range is from 0 to 1, where 0 means only using 1 thread, and 1 means using at most all the currently available threads.

Values outside the range will be ignored and this function heuristically determines the number of threads to use.

Defaults to 0.

timeoutint, optional

Specifies the maximum run time for the algorithm in seconds. The algorithm will cease computation if the specified timeout is exceeded.

Defaults to 3600.

pmml_export{'no', 'single-row', 'multi-row'}, optional

Defines the method of exporting the Apriori model:

  • 'no' : the model will not be exported,

  • 'single-row' : the Apriori model will be exported as a single row PMML,

  • 'multi-row' : the Apriori model will be exported as a multi-row PMML where each row contains a minimum of 5000 characters.

Defaults to 'no'.

Examples

Input data for associate rule mining:

>>> df.collect()
    CUSTOMER   ITEM
0          2  item2
1          2  item3
...
21         8  item2
22         8  item3

Initialize a Apriori object and set its parameters:

>>> ap = Apriori(min_support=0.1,
                 min_confidence=0.3,
                 relational=False,
                 min_lift=1.1,
                 max_conseq=1,
                 max_len=5,
                 ubiquitous=1.0,
                 use_prefix_tree=False,
                 thread_ratio=0,
                 timeout=3600,
                 pmml_export='single-row')

Perform the fit() and obtain the result:

>>> ap.fit(data=df)
>>> ap.result_.head(5).collect()
    ANTECEDENT CONSEQUENT   SUPPORT  CONFIDENCE      LIFT
0        item5      item2  0.222222    1.000000  1.285714
1        item1      item5  0.222222    0.333333  1.500000
2        item5      item1  0.222222    1.000000  1.500000
3        item4      item2  0.222222    1.000000  1.285714
4  item2&item1      item5  0.222222    0.500000  2.250000

Also, initialize a Apriori object and set its parameters with relational logic:

>>> apr = Apriori(min_support=0.1,
                  min_confidence=0.3,
                  relational=True,
                  min_lift=1.1,
                  max_conseq=1,
                  max_len=5,
                  ubiquitous=1.0,
                  use_prefix_tree=False,
                  thread_ratio=0,
                  timeout=3600,
                  pmml_export='single-row')

Perform the fit() and obtain the result:

>>> apr.antec_.head(5).collect()
   RULE_ID ANTECEDENTITEM
0        0          item5
1        1          item1
2        2          item5
3        3          item4
4        4          item2
>>> apr.conseq_.head(5).collect()
   RULE_ID CONSEQUENTITEM
0        0          item2
1        1          item5
2        2          item1
3        3          item2
4        4          item5
>>> apr.stats_.head(5).collect()
   RULE_ID   SUPPORT  CONFIDENCE      LIFT
0        0  0.222222    1.000000  1.285714
1        1  0.222222    0.333333  1.500000
2        2  0.222222    1.000000  1.500000
3        3  0.222222    1.000000  1.285714
4        4  0.222222    0.500000  2.250000
Attributes:
result_DataFrame

Mined association rules and related statistics, structured as follows:

  • 1st column : antecedent(leading) items.

  • 2nd column : consequent(dependent) items.

  • 3rd column : support value.

  • 4th column : confidence value.

  • 5th column : lift value.

Available only when relational is False.

model_DataFrame

Apriori model trained from the input data, structured as follows:

  • 1st column : model ID,

  • 2nd column : model content, i.e. Apriori model in PMML format.

antec_DataFrame

Antecedent items of mined association rules, structured as follows:

  • 1st column : association rule ID,

  • 2nd column : antecedent items of the corresponding association rule.

Available only when relational is True.

conseq_DataFrame

Consequent items of mined association rules, structured as follows:

  • 1st column : association rule ID,

  • 2nd column : consequent items of the corresponding association rule.

Available only when relational is True.

stats_DataFrame

Statistics of the mined association rules, structured as follows:

  • 1st column : rule ID,

  • 2nd column : support value of the rule,

  • 3rd column : confidence value of the rule,

  • 4th column : lift value of the rule.

Available only when relational is True.

Methods

fit(data[, transaction, item, lhs_restrict, ...])

Association rule mining on the given data.

get_model_metrics()

Get the model metrics.

get_score_metrics()

Get the score metrics.

fit(data, transaction=None, item=None, lhs_restrict=None, rhs_restrict=None, lhs_complement_rhs=None, rhs_complement_lhs=None)

Association rule mining on the given data.

Parameters:
dataDataFrame

The input data.

transactionstr, optional

Name of the transaction column.

Defaults to the first column if not provided.

itemstr, optional

Name of the item ID column.

Data type of item column can be INTEGER, VARCHAR or NVARCHAR.

Defaults to the last non-transaction column if not provided.

lhs_restrictlist of int/str, optional

Specifies items that are only allowed on the left-hand-side of association rules.

Elements in the list should be the same type as the item column.

rhs_restrictlist of int/str, optional

Specifies items that are only allowed on the right-hand-side of association rules.

Elements in the list should be the same type as the item column.

lhs_complement_rhsbool, optional

If you use rhs_restrict to restrict some items to the left-hand-side of the association rules, you can set this parameter to True to restrict the complement items to the left-hand-side.

For example, if you have 100 items (i1,i2,...,i100), and want to restrict i1 and i2 to the right-hand-side, and i3, i4,..., i100 to the left-hand-side, you can set the parameters similarly as follows:

...

rhs_restrict = [i1, i2],

lhs_complement_rhs = True,

...

Defaults to False.

rhs_complement_lhsbool, optional

If you use lhs_restrict to restrict some items to the left-hand-side of association rules, you can set this parameter to True to restrict the complement items to the right-hand side.

Defaults to False.

get_model_metrics()

Get the model metrics.

Returns:
DataFrame

The model metrics.

get_score_metrics()

Get the score metrics.

Returns:
DataFrame

The score metrics.

Inherited Methods from PALBase

Besides those methods mentioned above, the Apriori class also inherits methods from PALBase class, please refer to PAL Base for more details.