entropy

hana_ml.algorithms.pal.stats.entropy(data, col=None, distinct_value_count_detail=True, thread_ratio=None)

Calculates the information entropy of attributes.

Parameters
dataDataFrame

DataFrame containing the data.

colstr or a list of str, optional

Name of the data column that needs to be processed.

If not given, it defaults to all columns.

distinct_value_count_detailbool, optional

Indicates whether to output the details of distinct value counts:

  • False: Does not output detailed distinct value count.

  • True: Outputs detailed distinct value count.

Default to True.

thread_ratiofloat, optional

Adjusts the percentage of available threads to use, from 0 to 1. A value of 0 indicates the use of a single thread, while 1 implies the use of all possible current threads. Values outside the range will be ignored and this function heuristically determines the number of threads to use.

Default to 0.

Returns
DataFrames

Entropy results, structured as follows:

  • COLUMN_NAME, name of columns.

  • ENTROPY, entropy of columns.

  • COUNT_OF_DISTINCT_VALUES, count of distinct values.

Distinct values results, structured as follows:

  • COLUMN_NAME, name of columns.

  • DISTINCT_VALUE, distinct values of columns.

  • COUNT, count of distinct values.

Examples

>>> res1, res2 = entropy(data=df, col=['TEMP','WINDY'],
                         distinct_value_count_detail=False)
>>> res1.collect()
  COLUMN_NAME   ENTROPY  COUNT_OF_DISTINCT_VALUES
0        TEMP  2.253858                        10
1       WINDY  0.690186                         2