AprioriLite¶
- class hana_ml.algorithms.pal.association.AprioriLite(min_support, min_confidence, subsample=None, recalculate=None, thread_ratio=None, timeout=None, pmml_export=None)¶
This function runs a lightweight version of the Apriori algorithm for association rule mining. It significantly reduces the computational overhead by only focusing on the creation and analysis of up to two-item sets, which makes it particularly useful for large datasets where traditional Apriori applications could be computationally expensive.
- Parameters
- min_supportfloat
Specifies the minimum support as determined by the user.
- min_confidencefloat
Specifies the minimum confidence as determined by the user.
- subsamplefloat, optional
Specifies the sampling percentage for the input data. Set to 1 if you want to use the entire data. By subsampling, you can speed up computation on large datasets. Defaults to 1.
- recalculatebool, optional
If true, the illustrative statistics (support, confidence, and lift) of the resulting rule set are recalculated (updated) after the rules are found using sampled data.
Defaults to True.
- thread_ratiofloat, optional
Adjusts the percentage of available threads to use, from 0 to 1. A value of 0 indicates the use of a single thread, while 1 implies the use of all possible current threads. Values outside the range will be ignored and this function heuristically determines the number of threads to use.
Defaults to 0.
- timeoutint, optional
Specifies the maximum run time for the algorithm in seconds. The algorithm will cease computation if the specified timeout is exceeded.
Defaults to 3600.
- pmml_export{'no', 'single-row', 'multi-row'}, optional
Defines the method of exporting the Apriori model:
'no' : the model will not be exported,
'single-row' : the Apriori model will be exported as a single row PMML,
'multi-row' : the Apriori model will be exported as a multi-row PMML where each row contains a minimum of 5000 characters.
Defaults to 'no'.
- Attributes
- result_DataFrame
- Mined association rules and related statistics, structured as follows:
1st column : antecedent(leading) items,
2nd column : consequent(dependent) items,
3rd column : support value,
4th column : confidence value,
5th column : lift value.
Non-empty only when
relationalis False.- model_DataFrame
- Apriori model trained from the input data, structured as follows:
1st column : model ID.
2nd column : model content, i.e. liteApriori model in PMML format.
Methods
fit(data[, transaction, item])Association rule mining on the given data.
Examples
Input DataFrame df:
>>> df.collect() CUSTOMER ITEM 0 2 item2 1 2 item3 ...... 21 8 item2 22 8 item3
Initialize a AprioriLite object:
>>> apl = AprioriLite(min_support=0.1, min_confidence=0.3, subsample=1.0, recalculate=False, timeout=3600, pmml_export='single-row')
Perform the fit() and obtain the result:
>>> apl.fit(data=df) >>> apl.result_.head(5).collect() ANTECEDENT CONSEQUENT SUPPORT CONFIDENCE LIFT 0 item5 item2 0.222222 1.000000 1.285714 1 item1 item5 0.222222 0.333333 1.500000 2 item5 item1 0.222222 1.000000 1.500000 3 item5 item3 0.111111 0.500000 0.750000 4 item1 item2 0.444444 0.666667 0.857143
- fit(data, transaction=None, item=None)¶
Association rule mining on the given data.
- Parameters
- dataDataFrame
The input data.
- transactionstr, optional
Name of the transaction column.
Defaults to the first column if not provided.
- itemstr, optional
Name of the item column.
Defaults to the last non-transaction column if not provided.