entropy

hana_ml.algorithms.pal.stats.entropy(data, col=None, distinct_value_count_detail=True, thread_ratio=None)

Calculates the information entropy of attributes.

Parameters:

dataDataFrame

DataFrame containing the data.

colstr or a list of str, optional

Name of the data column that needs to be processed.

If not given, it defaults to all columns.

distinct_value_count_detailbool, optional

Indicates whether to output the details of distinct value counts:

False: Does not output detailed distinct value count.

True: Outputs detailed distinct value count.

Default to True.

thread_ratiofloat, optional

Adjusts the percentage of available threads to use, from 0 to 1. A value of 0 indicates the use of a single thread, while 1 implies the use of all possible current threads. Values outside the range will be ignored and this function heuristically determines the number of threads to use.

Default to 0.

Returns:

DataFrames

Entropy results, structured as follows:

COLUMN_NAME, name of columns.

ENTROPY, entropy of columns.

COUNT_OF_DISTINCT_VALUES, count of distinct values.

Distinct values results, structured as follows:

COLUMN_NAME, name of columns.

DISTINCT_VALUE, distinct values of columns.

COUNT, count of distinct values.

Examples

>>> res1, res2 = entropy(data=df, col=['TEMP','WINDY'],
                         distinct_value_count_detail=False)
>>> res1.collect()
  COLUMN_NAME   ENTROPY  COUNT_OF_DISTINCT_VALUES
0        TEMP  2.253858                        10
1       WINDY  0.690186                         2