benford_analysis

hana_ml.algorithms.pal.stats.benford_analysis(data, key=None, categorical_variable=None, sign=None, number_of_digits=None, discrete=True, rounding=None, thread_ratio=None)

Benford analysis is a data mining tool based on the Benford's law (Frank Benford, 1938). The law is a probability distribution that describes the likelihood of the first digit in a series of integers. It states that the leading digit in a number is more likely to be a small number(like 1, 2, or 3) than a large number(like 7, 8, or 9). For example, numbers starts with 1 occur about 30% of the time, numbers beginning with 2 occur about 18%(i.e. less than 30%) of the time, and so on. Benford analysis counts the number of times each leading digit (1-9) occurs in a feature(field), and then compares the actual count to the expected count calculated using the Benford's law.

In practise, Benford analysis is used in forensic accounting to analyze transactions for irregularities potentially indicating fraudulent behavior or bias.

Parameters:

dataDataFrame

Input data for performing Benford analysis. Only numerical features will be analyzed, while categorical ones being ignored.

keystr, optional

Specifies the ID column of data.

Defaults to the index of data is data is indexed by a single column, otherwise it must be specifed.

categorical_variablestr or a list of str, optional

Specifies the integer columns in data that should be treated as categorical. In this case, the columns specified in categorical_variable will be skipped by Benford analysis.

Note that any non-integer column supplied here shall be ignored.

sign{'positive', 'negative', 'both'}

Specifies the scope of data for Benford analysis.

'positive': analyzes only data values greater than zero
'negative': analyzes only data values less than zero
'both': analyze both positive and negative values of the data

Note that this parameter is mandatory so a valid choice must be specified.

number_of_digitsint, optional

Specifies the number of first digits to analyze.

Defaults to 1.

discretebool, optional

Set the value of this parameter to be True, so that the differences of the ordered data will be rounded off to avoid floating point number errors in the second order distribution.

If your data is continuous (like a simulated lognormal), you should set the value of this parameter to be False.

Defaults to True.

roundingint, optional

Specifies the number of digits to that the rounding will use if discrete is set as True.

No default value.

Returns:

DataFrame 1Benford analysis result for input data, structured as follows:
- 1st columnCOLUMN_NAME, column name of each feature column
- 2nd columnMAD, mean-absolute-deviation
- 3rd columnMAD_CONFORMITY, conformity to Benford's Law using the MAD
- 4th columnDISTORTION, distortion factor
DataFrame 2BFD infomation, structured as follows:
- 1st columnCOLUMN_NAME, the column name of each feature in data.
- 2nd columnDIGITS, the first digits of the feature values.
- 3rd column to 12th columnother statistics like distribution of the first digits of the data (DATA_DISTRIBUTION),: the distribution of the first digits of the second order analysis (SECOND_ORDER_DISTRIBUTION), etc.
DataFrame 3Second Order info, structured as follows:
- 1st columnCOLUMN_NAME, the column of each feature in data.
- 2nd columnSECOND_ORDER, the differences of the ordered feature values.
- 3rd columnSECOND_ORDER_DIGITS, the first digits for the second order analysis
DataFrame 4Mantissa statistics.

Examples

>>> result, bfd, sec_ord, mantissa = beford_analysis(data,
                                                     key='ID',
                                                     categorical_variable='Y',
                                                     sign='positive',
                                                     number_of_digits=1,
                                                     discrete=True,
                                                     rounding=3)