entropy
- hana_ml.algorithms.pal.stats.entropy(data, col=None, distinct_value_count_detail=True, thread_ratio=None)
This function is used to calculate the information entropy of attributes.
- Parameters
- dataDataFrame
DataFrame containing the data.
- colstr/ListofStrings, optional
Name of the data column that needs to be processed.
If not given, it defaults to all columns.
- distinct_value_count_detailbool, optional
Indicates whether to output the details of distinct value counts:
False: Does not output detailed distinct value count.
True: Outputs detailed distinct value count.
Default to True.
- thread_ratiofloat, optional
Specifies the ratio of total number of threads that can be used by this function.
The value range is from 0 to 1, where 0 means only using 1 thread, and 1 means using at most all the currently available threads.
Values outside the range are ignored and this function heuristically determines the number of threads to use.
Default to 0.
- Returns
- DataFrame
- Entropy results, structured as follows:
COLUMN_NAME, name of columns.
ENTROPY, entropy of columns.
COUNT_OF_DISTINCT_VALUES, count of distinct values.
- Distinct values results, structured as follows:
COLUMN_NAME, name of columns.
DISTINCT_VALUE, distinct values of columns.
COUNT, count of distinct values.
Examples
Original data:
>>> df.collect() OUTLOOK TEMP HUMIDITY WINDY CLASS 0 Sunny 75.0 70.0 Yes Play 1 Sunny NaN 90.0 Yes Do not Play 2 Sunny 85.0 NaN No Do not Play 3 Sunny 72.0 95.0 No Do not Play 4 None NaN 70.0 None Play 5 Overcast 72.0 90.0 Yes Play 6 Overcast 83.0 78.0 No Play 7 Overcast 64.0 65.0 Yes Play 8 Overcast 81.0 75.0 No Play 9 None 71.0 80.0 Yes Do not Play 10 Rain 65.0 70.0 Yes Do not Play 11 Rain 75.0 80.0 No Play 12 Rain 68.0 80.0 No Play 13 Rain 70.0 96.0 No Play
Calculate the entropy:
>>> res1, res2 = entropy(data, col=['TEMP','WINDY'], distinct_value_count_detail=False) >>> res1.collect() COLUMN_NAME ENTROPY COUNT_OF_DISTINCT_VALUES 0 TEMP 2.253858 10 1 WINDY 0.690186 2 >>> res2.collect() Empty DataFrame Columns: [COLUMN_NAME, DISTINCT_VALUE, COUNT] Index: []