entropy

hana_ml.algorithms.pal.stats.entropy(data, col=None, distinct_value_count_detail=True, thread_ratio=None)

This function is used to calculate the information entropy of attributes.

Parameters:
dataDataFrame

DataFrame containing the data.

colstr or a list of str, optional

Name of the data column that needs to be processed.

If not given, it defaults to all columns.

distinct_value_count_detailbool, optional

Indicates whether to output the details of distinct value counts:

  • False: Does not output detailed distinct value count.

  • True: Outputs detailed distinct value count.

Default to True.

thread_ratiofloat, optional

Specifies the ratio of total number of threads that can be used by this function.

The value range is from 0 to 1, where 0 means only using 1 thread, and 1 means using at most all the currently available threads.

Values outside the range are ignored and this function heuristically determines the number of threads to use.

Default to 0.

Returns:
DataFrame
Entropy results, structured as follows:
  • COLUMN_NAME, name of columns.

  • ENTROPY, entropy of columns.

  • COUNT_OF_DISTINCT_VALUES, count of distinct values.

Distinct values results, structured as follows:
  • COLUMN_NAME, name of columns.

  • DISTINCT_VALUE, distinct values of columns.

  • COUNT, count of distinct values.

Examples

Original data:

>>> df.collect()
      OUTLOOK TEMP  HUMIDITY WINDY        CLASS
0      Sunny  75.0      70.0   Yes         Play
1      Sunny   NaN      90.0   Yes  Do not Play
2      Sunny  85.0       NaN    No  Do not Play
3      Sunny  72.0      95.0    No  Do not Play
4       None   NaN      70.0  None         Play
5   Overcast  72.0      90.0   Yes         Play
6   Overcast  83.0      78.0    No         Play
7   Overcast  64.0      65.0   Yes         Play
8   Overcast  81.0      75.0    No         Play
9       None  71.0      80.0   Yes  Do not Play
10      Rain  65.0      70.0   Yes  Do not Play
11      Rain  75.0      80.0    No         Play
12      Rain  68.0      80.0    No         Play
13      Rain  70.0      96.0    No         Play

Calculate the entropy:

>>> res1, res2 = entropy(data, col=['TEMP','WINDY'],
                         distinct_value_count_detail=False)
>>> res1.collect()
  COLUMN_NAME   ENTROPY  COUNT_OF_DISTINCT_VALUES
0        TEMP  2.253858                        10
1       WINDY  0.690186                         2
>>> res2.collect()
Empty DataFrame
Columns: [COLUMN_NAME, DISTINCT_VALUE, COUNT]
Index: []