Apriori
- class hana_ml.algorithms.pal.association.Apriori(min_support, min_confidence, relational=None, min_lift=None, max_conseq=None, max_len=None, ubiquitous=None, use_prefix_tree=None, lhs_restrict=None, rhs_restrict=None, lhs_complement_rhs=None, rhs_complement_lhs=None, thread_ratio=None, timeout=None, pmml_export=None)
Apriori is a classic algorithm used in machine learning for mining frequent itemsets and relevant association rules. It operates on a list of transactions and is particularly effective in market basket analysis, where the goal is to find associations of products bought with other products.
- Parameters:
- min_supportfloat
Specifies the minimum support as determined by the user.
- min_confidencefloat
Specifies the minimum confidence as determined by the user.
- relationalbool, optional
Determines whether relational logic should be applied within the Apriori algorithm. If set to False, a single combined results table will be produced. Conversely, if set to True, the result will be split across three tables: antecedent, consequent, and statistics.
Defaults to False.
- min_liftfloat, optional
Specifies the minimum lift value as determined by the user. This parameter is essential in association rule mining for assessing the strength of each rule.
Defaults to 0.
- max_conseqint, optional
Specifies the maximum number of items that can be contained in consequents.
Defaults to 100.
- max_lenint, optional
Specifies the maximum number of combined items in both antecedent and consequent sets in the output.
Defaults to 5.
- ubiquitousfloat, optional
This parameter is used to ignore item sets with support values greater than this threshold during frequent itemset mining.
Defaults to 1.0.
- use_prefix_treebool, optional
Indicates whether a prefix tree should be used to save memory. A prefix tree (also known as a trie) is a data structure that can increase the efficiency of certain types of lookups.
Defaults to False.
- lhs_restricta list of str, optional (deprecated)
Allows specific items only on the left-hand-side of association rules.
- rhs_restricta list of str, optional (deprecated)
Allows specific items only on the right-hand-side of the association rules.
- lhs_complement_rhsbool, optional (deprecated)
If rhs_restrict is used to restrict some items to the right-hand-side of the association rules, this parameter can be set to True in order to restrict the complementary items to the left-hand-side.
For example, if you have 100 items (i1, i2, ..., i100), and want to restrict i1 and i2 to the right-hand-side, and i3,i4,...,i100 to the left-hand-side, you can set the parameters similarly as follows:
...
rhs_restrict = ['i1','i2'],
lhs_complement_rhs = True,
...
Defaults to False.
- rhs_complement_lhsbool, optional (deprecated)
If lhs_restrict is used to restrict some items to the left-hand-side of the association rules, this parameter can be set to True to restrict the complementary items to the right-hand side.
Defaults to False.
- thread_numberfloat, optional
Specifies the ratio of total number of threads that can be used by this function.
The value range is from 0 to 1, where 0 means only using 1 thread, and 1 means using at most all the currently available threads.
Values outside the range will be ignored and this function heuristically determines the number of threads to use.
Defaults to 0.
- timeoutint, optional
Specifies the maximum run time for the algorithm in seconds. The algorithm will cease computation if the specified timeout is exceeded.
Defaults to 3600.
- pmml_export{'no', 'single-row', 'multi-row'}, optional
Defines the method of exporting the Apriori model:
'no' : the model will not be exported,
'single-row' : the Apriori model will be exported as a single row PMML,
'multi-row' : the Apriori model will be exported as a multi-row PMML where each row contains a minimum of 5000 characters.
Defaults to 'no'.
Examples
Input data for associate rule mining:
>>> df.collect() CUSTOMER ITEM 0 2 item2 1 2 item3 ... 21 8 item2 22 8 item3
Initialize a Apriori object and set its parameters:
>>> ap = Apriori(min_support=0.1, min_confidence=0.3, relational=False, min_lift=1.1, max_conseq=1, max_len=5, ubiquitous=1.0, use_prefix_tree=False, thread_ratio=0, timeout=3600, pmml_export='single-row')
Perform the fit() and obtain the result:
>>> ap.fit(data=df) >>> ap.result_.head(5).collect() ANTECEDENT CONSEQUENT SUPPORT CONFIDENCE LIFT 0 item5 item2 0.222222 1.000000 1.285714 1 item1 item5 0.222222 0.333333 1.500000 2 item5 item1 0.222222 1.000000 1.500000 3 item4 item2 0.222222 1.000000 1.285714 4 item2&item1 item5 0.222222 0.500000 2.250000
Also, initialize a Apriori object and set its parameters with relational logic:
>>> apr = Apriori(min_support=0.1, min_confidence=0.3, relational=True, min_lift=1.1, max_conseq=1, max_len=5, ubiquitous=1.0, use_prefix_tree=False, thread_ratio=0, timeout=3600, pmml_export='single-row')
Perform the fit() and obtain the result:
>>> apr.antec_.head(5).collect() RULE_ID ANTECEDENTITEM 0 0 item5 1 1 item1 2 2 item5 3 3 item4 4 4 item2 >>> apr.conseq_.head(5).collect() RULE_ID CONSEQUENTITEM 0 0 item2 1 1 item5 2 2 item1 3 3 item2 4 4 item5 >>> apr.stats_.head(5).collect() RULE_ID SUPPORT CONFIDENCE LIFT 0 0 0.222222 1.000000 1.285714 1 1 0.222222 0.333333 1.500000 2 2 0.222222 1.000000 1.500000 3 3 0.222222 1.000000 1.285714 4 4 0.222222 0.500000 2.250000
- Attributes:
- result_DataFrame
Mined association rules and related statistics, structured as follows:
1st column : antecedent(leading) items.
2nd column : consequent(dependent) items.
3rd column : support value.
4th column : confidence value.
5th column : lift value.
Available only when
relational
is False.- model_DataFrame
Apriori model trained from the input data, structured as follows:
1st column : model ID,
2nd column : model content, i.e. Apriori model in PMML format.
- antec_DataFrame
Antecedent items of mined association rules, structured as follows:
1st column : association rule ID,
2nd column : antecedent items of the corresponding association rule.
Available only when
relational
is True.- conseq_DataFrame
Consequent items of mined association rules, structured as follows:
1st column : association rule ID,
2nd column : consequent items of the corresponding association rule.
Available only when
relational
is True.- stats_DataFrame
Statistics of the mined association rules, structured as follows:
1st column : rule ID,
2nd column : support value of the rule,
3rd column : confidence value of the rule,
4th column : lift value of the rule.
Available only when
relational
is True.
Methods
fit
(data[, transaction, item, lhs_restrict, ...])Association rule mining on the given data.
Get the model metrics.
Get the score metrics.
- fit(data, transaction=None, item=None, lhs_restrict=None, rhs_restrict=None, lhs_complement_rhs=None, rhs_complement_lhs=None)
Association rule mining on the given data.
- Parameters:
- dataDataFrame
The input data.
- transactionstr, optional
Name of the transaction column.
Defaults to the first column if not provided.
- itemstr, optional
Name of the item ID column.
Data type of item column can be INTEGER, VARCHAR or NVARCHAR.
Defaults to the last non-transaction column if not provided.
- lhs_restrictlist of int/str, optional
Specifies items that are only allowed on the left-hand-side of association rules.
Elements in the list should be the same type as the item column.
- rhs_restrictlist of int/str, optional
Specifies items that are only allowed on the right-hand-side of association rules.
Elements in the list should be the same type as the item column.
- lhs_complement_rhsbool, optional
If you use
rhs_restrict
to restrict some items to the left-hand-side of the association rules, you can set this parameter to True to restrict the complement items to the left-hand-side.For example, if you have 100 items (i1,i2,...,i100), and want to restrict i1 and i2 to the right-hand-side, and i3, i4,..., i100 to the left-hand-side, you can set the parameters similarly as follows:
...
rhs_restrict = [i1, i2],
lhs_complement_rhs = True,
...
Defaults to False.
- rhs_complement_lhsbool, optional
If you use
lhs_restrict
to restrict some items to the left-hand-side of association rules, you can set this parameter to True to restrict the complement items to the right-hand side.Defaults to False.
- get_model_metrics()
Get the model metrics.
- Returns:
- DataFrame
The model metrics.
- get_score_metrics()
Get the score metrics.
- Returns:
- DataFrame
The score metrics.
Inherited Methods from PALBase
Besides those methods mentioned above, the Apriori class also inherits methods from PALBase class, please refer to PAL Base for more details.