FPGrowth

class hana_ml.algorithms.pal.association.FPGrowth(min_support=None, min_confidence=None, relational=None, min_lift=None, max_conseq=None, max_len=None, ubiquitous=None, thread_ratio=None, timeout=None)

The Frequent Pattern Growth (FP-Growth) algorithm is a technique used for finding frequent patterns in a transaction dataset without generating a candidate itemset. This is achieved by building a prefix tree (FP Tree) to compress information and subsequently retrieve frequent itemsets efficiently.

Parameters:
min_supportfloat, optional

Specifies the minimum support value, which falls within the valid range of [0, 1].

Defaults to 0.

min_confidencefloat, optional

Specifies the minimum confidence value, with an acceptable range between [0, 1].

Defaults to 0.

relationalbool, optional

Determines whether relational logic should be applied within the Apriori algorithm. If set to False, a single combined results table will be produced. Conversely, if set to True, the result will be split across three tables: antecedent, consequent, and statistics.

Defaults to False.

min_liftfloat, optional

Specifies the minimum lift.

Defaults to 0.

max_conseqint, optional

Specifies the maximum length of consequent items.

Defaults to 10.

max_lenint, optional

Specifies the total length of both antecedent items and consequent items in the output.

Defaults to 10.

ubiquitousfloat, optional

This parameter is used to ignore item sets with support values greater than this threshold during frequent itemset mining.

Defaults to 1.0.

thread_ratiofloat, optional

Adjusts the percentage of available threads to use, from 0 to 1. A value of 0 indicates the use of a single thread, while 1 implies the use of all possible current threads. Values outside the range will be ignored and this function heuristically determines the number of threads to use.

Defaults to 0.

timeoutint, optional

Specifies the maximum run time for the algorithm in seconds. The algorithm will cease computation if the specified timeout is exceeded.

Defaults to 3600.

Examples

Input DataFrame df:

>>> df.collect()
    TRANS  ITEM
0       1     1
1       1     2
......
26     10     3
27     10     5

Initialize a FPGrowth object:

>>> fpg = FPGrowth(min_support=0.2,
                   min_confidence=0.5,
                   relational=False,
                   min_lift=1.0,
                   max_conseq=1,
                   max_len=5,
                   ubiquitous=1.0,
                   thread_ratio=0,
                   timeout=3600)

Perform fit():

>>> fpg.fit(data=df, lhs_restrict=[1,2,3])
>>> fpg.result_.collect()
  ANTECEDENT  CONSEQUENT  SUPPORT  CONFIDENCE      LIFT
0          2           3      0.5    0.714286  1.190476
1          3           2      0.5    0.833333  1.190476
2          3           4      0.3    0.500000  1.250000
3        1&2           3      0.3    0.600000  1.000000
4        1&3           2      0.3    0.750000  1.071429
5        1&3           4      0.2    0.500000  1.250000

Also, initialize a FPGrowth object and set its parameters with relational logic:

>>> fpgr = FPGrowth(min_support=0.2,
                    min_confidence=0.5,
                    relational=True,
                    min_lift=1.0,
                    max_conseq=1,
                    max_len=5,
                    ubiquitous=1.0,
                    thread_ratio=0,
                    timeout=3600)

Perform fit():

>>> fpgr.fit(data=df, rhs_restrict=[1, 2, 3])
>>> fpgr.antec_.collect()
   RULE_ID  ANTECEDENTITEM
0        0               2
1        1               3
2        2               3
...
6        4               3
7        5               1
8        5               3
>>> fpgr.conseq_.collect()
   RULE_ID  CONSEQUENTITEM
0        0               3
1        1               2
2        2               4
3        3               3
4        4               2
5        5               4
>>> fpgr.stats_.collect()
   RULE_ID  SUPPORT  CONFIDENCE      LIFT
0        0      0.5    0.714286  1.190476
1        1      0.5    0.833333  1.190476
2        2      0.3    0.500000  1.250000
3        3      0.3    0.600000  1.000000
4        4      0.3    0.750000  1.071429
5        5      0.2    0.500000  1.250000
Attributes:
result_DataFrame

Mined association rules and related statistics, structured as follows:

  • 1st column : antecedent(leading) items,

  • 2nd column : consequent(dependent) items,

  • 3rd column : support value,

  • 4th column : confidence value,

  • 5th column : lift value.

Available only when relational is False.

antec_DataFrame

Antecedent items of mined association rules, structured as follows:

  • 1st column : association rule ID,

  • 2nd column : antecedent items of the corresponding association rule.

Available only when relational is True.

conseq_DataFrame

Consequent items of mined association rules, structured as follows:

  • 1st column : association rule ID,

  • 2nd column : consequent items of the corresponding association rule.

Available only when relational is True.

stats_DataFrame

Statistics of the mined association rules, structured as follows:

  • 1st column : rule ID,

  • 2nd column : support value of the rule,

  • 3rd column : confidence value of the rule,

  • 4th column : lift value of the rule.

Available only when relational is True.

Methods

fit(data[, transaction, item, lhs_restrict, ...])

Association rule mining on the given data.

get_model_metrics()

Get the model metrics.

get_score_metrics()

Get the score metrics.

fit(data, transaction=None, item=None, lhs_restrict=None, rhs_restrict=None, lhs_complement_rhs=None, rhs_complement_lhs=None)

Association rule mining on the given data.

Parameters:
dataDataFrame

The input data.

transactionstr, optional

Name of the transaction column.

Defaults to the first column if not provided.

itemstr, optional

Name of the item column.

Defaults to the last non-transaction column if not provided.

lhs_restrictlist of int/str, optional

Specifies items that are only allowed on the left-hand-side of association rules.

Elements in the list should be the same type as the item column.

rhs_restrictlist of int/str, optional

Specifies items that are only allowed on the right-hand-side of association rules.

Elements in the list should be the same type as the item column.

lhs_complement_rhsbool, optional

If you use rhs_restrict to restrict some items to the left-hand-side of the association rules, you can set this parameter to True to restrict the complement items to the left-hand-side.

For example, if you have 100 items (i1,i2,...,i100), and want to restrict i1 and i2 to the right-hand-side, and i3, i4,..., i100 to the left-hand-side, you can set the parameters similarly as follows:

...

rhs_restrict = [i1, i2],

lhs_complement_rhs = True,

...

Defaults to False.

rhs_complement_lhsbool, optional

If you use lhs_restrict to restrict some items to the left-hand-side of association rules, you can set this parameter to True to restrict the complement items to the right-hand side.

Defaults to False.

get_model_metrics()

Get the model metrics.

Returns:
DataFrame

The model metrics.

get_score_metrics()

Get the score metrics.

Returns:
DataFrame

The score metrics.

Inherited Methods from PALBase

Besides those methods mentioned above, the FPGrowth class also inherits methods from PALBase class, please refer to PAL Base for more details.