entropy
- hana_ml.algorithms.pal.stats.entropy(data, col=None, distinct_value_count_detail=True, thread_ratio=None)
Calculates the information entropy of attributes.
- Parameters:
- dataDataFrame
DataFrame containing the data.
- colstr or a list of str, optional
Name of the data column that needs to be processed.
If not given, it defaults to all columns.
- distinct_value_count_detailbool, optional
Indicates whether to output the details of distinct value counts:
False: Does not output detailed distinct value count.
True: Outputs detailed distinct value count.
Default to True.
- thread_ratiofloat, optional
Adjusts the percentage of available threads to use, from 0 to 1. A value of 0 indicates the use of a single thread, while 1 implies the use of all possible current threads. Values outside the range will be ignored and this function heuristically determines the number of threads to use.
Default to 0.
- Returns:
- DataFrames
Entropy results, structured as follows:
COLUMN_NAME, name of columns.
ENTROPY, entropy of columns.
COUNT_OF_DISTINCT_VALUES, count of distinct values.
Distinct values results, structured as follows:
COLUMN_NAME, name of columns.
DISTINCT_VALUE, distinct values of columns.
COUNT, count of distinct values.
Examples
>>> res1, res2 = entropy(data=df, col=['TEMP','WINDY'], distinct_value_count_detail=False) >>> res1.collect() COLUMN_NAME ENTROPY COUNT_OF_DISTINCT_VALUES 0 TEMP 2.253858 10 1 WINDY 0.690186 2