entropy

hana_ml.algorithms.pal.stats.entropy(data, col=None, distinct_value_count_detail=True, thread_ratio=None)

This function is used to calculate the information entropy of attributes.

Parameters:

dataDataFrame

DataFrame containing the data.

colstr or a list of str, optional

Name of the data column that needs to be processed.

If not given, it defaults to all columns.

distinct_value_count_detailbool, optional

Indicates whether to output the details of distinct value counts:

False: Does not output detailed distinct value count.

True: Outputs detailed distinct value count.

Default to True.

thread_ratiofloat, optional

Specifies the ratio of total number of threads that can be used by this function.

The value range is from 0 to 1, where 0 means only using 1 thread, and 1 means using at most all the currently available threads.

Values outside the range are ignored and this function heuristically determines the number of threads to use.

Default to 0.

Returns:

DataFrame

Entropy results, structured as follows:

COLUMN_NAME, name of columns.
ENTROPY, entropy of columns.
COUNT_OF_DISTINCT_VALUES, count of distinct values.

Distinct values results, structured as follows:

COLUMN_NAME, name of columns.
DISTINCT_VALUE, distinct values of columns.
COUNT, count of distinct values.

Examples

Original data:

>>> df.collect()
      OUTLOOK TEMP  HUMIDITY WINDY        CLASS
    Sunny  75.0      70.0   Yes         Play
    Sunny   NaN      90.0   Yes  Do not Play
    Sunny  85.0       NaN    No  Do not Play
    Sunny  72.0      95.0    No  Do not Play
     None   NaN      70.0  None         Play
 Overcast  72.0      90.0   Yes         Play
 Overcast  83.0      78.0    No         Play
 Overcast  64.0      65.0   Yes         Play
 Overcast  81.0      75.0    No         Play
     None  71.0      80.0   Yes  Do not Play
    Rain  65.0      70.0   Yes  Do not Play
    Rain  75.0      80.0    No         Play
    Rain  68.0      80.0    No         Play
    Rain  70.0      96.0    No         Play

Calculate the entropy:

>>> res1, res2 = entropy(data, col=['TEMP','WINDY'],
                         distinct_value_count_detail=False)
>>> res1.collect()
  COLUMN_NAME   ENTROPY  COUNT_OF_DISTINCT_VALUES
0        TEMP  2.253858                        10
1       WINDY  0.690186                         2
>>> res2.collect()
Empty DataFrame
Columns: [COLUMN_NAME, DISTINCT_VALUE, COUNT]
Index: []