benford_analysis
- hana_ml.algorithms.pal.stats.benford_analysis(data, key=None, categorical_variable=None, sign=None, number_of_digits=None, discrete=True, rounding=None, thread_ratio=None)
Benford analysis is a data mining tool based on the Benford's law (Frank Benford, 1938). The law is a probability distribution that describes the likelihood of the first digit in a series of integers. It states that the leading digit in a number is more likely to be a small number(like 1, 2, or 3) than a large number(like 7, 8, or 9). For example, numbers starts with 1 occur about 30% of the time, numbers beginning with 2 occur about 18%(i.e. less than 30%) of the time, and so on. Benford analysis counts the number of times each leading digit (1-9) occurs in a feature(field), and then compares the actual count to the expected count calculated using the Benford's law.
In practise, Benford analysis is used in forensic accounting to analyze transactions for irregularities potentially indicating fraudulent behavior or bias.
- Parameters:
- dataDataFrame
Input data for performing Benford analysis. Only numerical features will be analyzed, while categorical ones being ignored.
- keystr, optional
Specifies the ID column of
data
.Defaults to the index of
data
isdata
is indexed by a single column, otherwise it must be specifed.- categorical_variablestr or a list of str, optional
Specifies the integer columns in
data
that should be treated as categorical. In this case, the columns specified incategorical_variable
will be skipped by Benford analysis.Note that any non-integer column supplied here shall be ignored.
- sign{'positive', 'negative', 'both'}
Specifies the scope of data for Benford analysis.
'positive': analyzes only data values greater than zero
'negative': analyzes only data values less than zero
'both': analyze both positive and negative values of the data
Note that this parameter is mandatory so a valid choice must be specified.
- number_of_digitsint, optional
Specifies the number of first digits to analyze.
Defaults to 1.
- discretebool, optional
Set the value of this parameter to be True, so that the differences of the ordered data will be rounded off to avoid floating point number errors in the second order distribution.
If your data is continuous (like a simulated lognormal), you should set the value of this parameter to be False.
Defaults to True.
- roundingint, optional
Specifies the number of digits to that the rounding will use if
discrete
is set as True.No default value.
- Returns:
- DataFrame 1Benford analysis result for input data, structured as follows:
- - 1st columnCOLUMN_NAME, column name of each feature column
- - 2nd columnMAD, mean-absolute-deviation
- - 3rd columnMAD_CONFORMITY, conformity to Benford's Law using the MAD
- - 4th columnDISTORTION, distortion factor
- DataFrame 2BFD infomation, structured as follows:
- - 1st columnCOLUMN_NAME, the column name of each feature in
data
. - - 2nd columnDIGITS, the first digits of the feature values.
- - 3rd column to 12th columnother statistics like distribution of the first digits of the data (DATA_DISTRIBUTION),
the distribution of the first digits of the second order analysis (SECOND_ORDER_DISTRIBUTION), etc.
- DataFrame 3Second Order info, structured as follows:
- - 1st columnCOLUMN_NAME, the column of each feature in
data
. - - 2nd columnSECOND_ORDER, the differences of the ordered feature values.
- - 3rd columnSECOND_ORDER_DIGITS, the first digits for the second order analysis
- DataFrame 4Mantissa statistics.
Examples
>>> result, bfd, sec_ord, mantissa = beford_analysis(data, key='ID', categorical_variable='Y', sign='positive', number_of_digits=1, discrete=True, rounding=3)