hana_ml.visualizers package

The Visualizers Package consists of the following sections:

hana_ml.visualizers.eda

This module represents an eda plotter. Matplotlib is used for all visualizations.

hana_ml.visualizers.eda.kdeplot(data, key, features=None, kde=<hana_ml.algorithms.pal.kernel_density.KDE object>, points=1000, enable_plotly=False, **kwargs)

Display a kernel density estimate plot for SAP HANA DataFrame.

Parameters
dataDataFrame

Dataframe including the data of density distribution.

keystr

Name of the ID column in the dataframe.

featuresstr/list of str, optional

Name of the feature columns in the dataframe.

kdehana_ml.algorithms.pal.kernel_density.KDE, optional

KDE Calculation.

Defaults to KDE().

pointsint, optional

The number of points for plotting.

Defaults to 1000.

enable_plotlybool, optional

Use plotly instead of matplotlib.

Defaults to False.

Returns
axAxes

The axes for the plot.

surfPoly3DCollection

The surface plot object. Only valid for 2D plotting. Only for matplotlib plot.

Examples

>>> f = plt.figure(figsize=(19, 10))
>>> ax = kdeplot(data, key="PASSENGER_ID", features=["AGE"])
>>> ax.grid()
>>> plt.show()
_images/kde_plot.png
>>> f = plt.figure(figsize=(19, 10))
>>> ax, surf = kdeplot(data, key="PASSENGER_ID", features=["AGE", "FARE"])
>>> ax.grid()
>>> plt.show()
_images/kde_plot2.png
hana_ml.visualizers.eda.hist(data, columns, bins=None, debrief=False, x_axis_fontsize=10, x_axis_rotation=0, title_fontproperties=None, default_bins=20, rounding_precision=3, replacena=0, enable_plotly=False, **kwargs)

Plot histograms for SAP HANA DataFrame.

Parameters
dataDataFrame

DataFrame used for the plot.

columnslist of str

Columns in the DataFrame being plotted.

binsint or dict, optional

The number of bins to create based on the value of column.

Defaults to 20.

debriefbool, optional

Whether to include the skewness debrief.

Defaults to False.

x_axis_fontsizeint, optional

The size of x axis labels.

Defaults to 10.

x_axis_rotationint, optional

The rotation of x axis labels.

Defaults to 0.

title_fontpropertiesFontProperties, optional

Change the font properties for titile. Only for matplotlit plot.

Defaults to None.

default_binsint, optional

The number of bins to create for the column that has not been specified in bins when bins is dict.

Defaults to 20.

debriefbool, optional

Whether to include the skewness debrief.

Defaults to False.

rounding_precisionint, optional

The rounding precision for bin size.

Defaults to 3.

replacenafloat, optional

Replace na with the specified value.

Defaults to 0.

enable_plotlybool, optional

Use plotly instead of matplotlib.

Defaults to False.

Examples

>>> hist(data=data, columns=['PCLASS', 'AGE', 'SIBSP', 'PARCH', 'FARE'], default_bins=20, bins={"AGE": 10})
_images/hist_plot.png
class hana_ml.visualizers.eda.EDAVisualizer(ax=None, size=None, cmap=None, enable_plotly=False, fig=None)

Bases: hana_ml.visualizers.visualizer_base.Visualizer

Class for all EDA visualizations, including:

  • Distribution plot

  • Pie plot

  • Correlation plot

  • Scatter plot

  • Bar plot

  • Box plot

Parameters
axmatplotlib.Axes, optional

The axes used to plot the figure. Only for matplotlib plot.

Default value is current axes.

sizetuple of integers, optional

(width, height) of the plot in dpi. Only for matplotlib plot.

Default value is the current size of the plot.

cmapmatplotlib.pyplot.colormap, optional

Color map used for the plot. Only for matplotlib plot.

Defaults to None.

enable_plotlybool, optional

Use plotly instead of matplotlib.

Defaults to False.

figFigure, optional

Plotly's figure. Only for plotly plot.

Examples

>>> f = plt.figure(figsize=(10,10))
>>> ax1 = f.add_subplot(111)
>>> eda = EDAVisualizer(ax1)
Attributes
ax

Returns the matplotlib Axes where the Visualizer will draw.

cmap

Returns the color map being used for the plot.

size

Returns the size of the plot in pixels.

Methods

bar_plot(data, column, aggregation[, title, ...])

Displays a bar plot for the SAP HANA DataFrame column specified.

box_plot(data, column[, outliers, title, ...])

Displays a box plot for the SAP HANA DataFrame column specified.

correlation_plot(data[, key, corr_cols, ...])

Displays a correlation plot for the SAP HANA DataFrame columns specified.

distribution_plot(data, column, bins[, ...])

Displays a distribution plot for the SAP HANA DataFrame column specified.

pie_plot(data, column[, explode, title, ...])

Displays a pie plot for the SAP HANA DataFrame column specified.

reset()

Reset.

scatter_plot(data, x, y[, x_bins, y_bins, ...])

Displays a scatter plot for the SAP HANA DataFrame columns specified.

set_ax(ax)

Sets the Axes

set_cmap(cmap)

Sets the colormap

set_size(size)

Sets the size

distribution_plot(data, column, bins, title=None, x_axis_fontsize=10, x_axis_rotation=0, debrief=False, rounding_precision=3, title_fontproperties=None, replacena=0, x_axis_label='', y_axis_label='', subplot_pos=(1, 1), **kwargs)

Displays a distribution plot for the SAP HANA DataFrame column specified.

Parameters
dataDataFrame

DataFrame used for the plot.

columnstr

Column in the DataFrame being plotted.

binsint

Number of bins to create based on the value of column.

titlestr, optional

Title for the plot.

x_axis_fontsizeint, optional

Size of x axis labels.

Defaults to 10.

x_axis_rotationint, optional

Rotation of x axis labels.

Defaults to 0.

debriefbool, optional

Whether to include the skewness debrief.

Defaults to False.

rounding_precisionint, optional

The rounding precision for bin size.

Defaults to 3.

title_fontpropertiesFontProperties, optional

Change the font properties for titile.

Defaults to None.

replacenafloat, optional

Replace na with the specified value.

Defaults to 0.

x_axis_labelstr, optional

x axis label. Only for plotly plot.

Defaults to "".

y_axis_labelstr, optional

y axis label.

Defaults to "". Only for plotly plot.

subplot_postuple, optional

(row, col) for plotly subplot.

Defaults to (1, 1).

Returns
matplotlib:
axAxes

The axes for the plot.

bin_datapandas.DataFrame

The data used in the plot.

plotly:
figFigure

The distribution plot.

trace: graph object trace

The trace of the plot, used in hist().

bin_datapandas.DataFrame

The data used in the plot.

Examples

>>> f = plt.figure(figsize=(35, 10))
>>> ax1 = f.add_subplot(111)
>>> eda = EDAVisualizer(ax1)
>>> ax1, dist_data = eda.distribution_plot(data=data, column="FARE", bins=100, title="Distribution of FARE")
>>> plt.show()
_images/distribution_plot.png
pie_plot(data, column, explode=0.03, title=None, legend=True, title_fontproperties=None, legend_fontproperties=None, subplot_pos=(1, 1), **kwargs)

Displays a pie plot for the SAP HANA DataFrame column specified.

Parameters
dataDataFrame

DataFrame used for the plot.

columnstr

Column in the DataFrame being plotted.

explodefloat, optional

Relative spacing between pie segments. Only for matplotlib plot.

titlestr, optional

Title for the plot.

Defaults to None.

legendbool, optional

Whether to show the legend for the plot. Only for matplotlib plot.

Defaults to True.

title_fontpropertiesFontProperties, optional

Change the font properties for titile. Only for matplotlib plot.

Defaults to None.

legend_fontpropertiesFontProperties, optional

Change the font properties for legend. Only for matplotlib plot.

Defaults to None.

subplot_postuple, optional

(row, col) for plotly subplot.

Defaults to (1, 1).

Returns
matplotlib:
axAxes

The axes for the plot. This can be used to set specific properties for the plot.

pie_datapandas.DataFrame

The data used in the plot.

plotly:
figFigure

The pie plot.

pie_datapandas.DataFrame

The data used in the plot.

Examples

>>> f = plt.figure(figsize=(8, 8))
>>> ax1 = f.add_subplot(111)
>>> eda = EDAVisualizer(ax1)
>>> ax1, pie_data = eda.pie_plot(data, column="PCLASS", title="% of passengers in each cabin")
>>> plt.show()
_images/pie_plot.png
correlation_plot(data, key=None, corr_cols=None, label=True, cmap=None, title="Pearson's correlation (r)", **kwargs)

Displays a correlation plot for the SAP HANA DataFrame columns specified.

Parameters
dataDataFrame

DataFrame used for the plot.

keystr, optional

Name of ID column.

Defaults to None.

corr_colslist of str, optional

Columns in the DataFrame being plotted. If None then all numeric columns will be plotted.

Defaults to None.

labelbool, optional

Plot a colorbar. Only for matplotlib plot.

Defaults to True.

cmapmatplotlib.pyplot.colormap or str, optional

Color map used for the plot.

Defaults to "RdYlBu" for matplotlib and "blues" for plotly.

titlestr, optional

Title of the plot.

Defaults to "Pearson's correlation (r)".

Returns
matplotlib:
axAxes

The axes for the plot. This can be used to set specific properties for the plot.

corrpandas.DataFrame

The data used in the plot.

plotly:
figFigure

The correlation plot.

corrpandas.DataFrame

The data used in the plot.

Examples

>>> f = plt.figure(figsize=(35, 10))
>>> ax1 = f.add_subplot(111)
>>> eda = EDAVisualizer(ax1)
>>> ax1, corr = eda.correlation_plot(data=data, corr_cols=['PCLASS', 'AGE', 'SIBSP', 'PARCH', 'FARE'], cmap="Blues")
>>> plt.show()
_images/correlation_plot.png
scatter_plot(data, x, y, x_bins=None, y_bins=None, title=None, label=None, cmap=None, debrief=True, rounding_precision=3, label_fontsize=12, title_fontproperties=None, sample_frac=1.0, **kwargs)

Displays a scatter plot for the SAP HANA DataFrame columns specified.

Parameters
dataDataFrame

DataFrame used for the plot.

xstr

Column to be plotted on the x axis.

ystr

Column to be plotted on the y axis.

x_binsint, optional

Number of x axis bins to create based on the value of column.

Defaults to None.

y_binsint

Number of y axis bins to create based on the value of column.

Defaults to None.

titlestr, optional

Title for the plot.

Defaults to None.

labelstr, optional

Label for the color bar.

Defaults to None.

cmapmatplotlib.pyplot.colormap or str, optional

Color map used for the plot.

Defaults to "Blues" for matplotlib and "blues" for plotly.

debriefbool, optional

Whether to include the correlation debrief.

Defaults to True

rounding_precisionint, optional

The rounding precision for bin size. Only for matplotlib plot.

Defaults to 3.

label_fontsizeint, optional

Change the font size for label. Only for matplotlib plot.

Defaults to 12.

title_fontpropertiesFontProperties, optional

Change the font properties for titile.

Defaults to None.

sample_fracfloat, optional

Sampling method is applied to data. Valid if x_bins and y_bins are not set.

Defaults to 1.0.

Returns
matplotlib:
axAxes

The axes for the plot.

bin_matrixpandas.DataFrame

The data used in the plot.

plotly:
figFigure

The scatter plot.

Examples

>>> f = plt.figure(figsize=(10, 10))
>>> ax1 = f.add_subplot(111)
>>> eda = EDAVisualizer(ax1)
>>> ax1, corr = eda.scatter_plot(data=data, x="AGE", y="SIBSP", x_bins=5, y_bins=5)
>>> plt.show()
_images/scatter_plot.png
>>> f = plt.figure(figsize=(10, 10))
>>> ax2 = f.add_subplot(111)
>>> eda = EDAVisualizer(ax2)
>>> ax2 = eda.scatter_plot(data=data, x="AGE", y="SIBSP", sample_frac=0.8, s=10, marker='o')
>>> plt.show()
_images/scatter_plot2.png
bar_plot(data, column, aggregation, title=None, label_fontsize=12, title_fontproperties=None, orientation=None, **kwargs)

Displays a bar plot for the SAP HANA DataFrame column specified.

Parameters
dataDataFrame

DataFrame used for the plot.

columnstr

Column to be aggregated.

aggregationdict

Aggregation conditions ('avg', 'count', 'max', 'min').

titlestr, optional

Title for the plot.

Defaults to None.

label_fontsizeint, optional

The size of label. Only for matplotlib plot.

Defaults to 12.

title_fontpropertiesFontProperties, optional

Change the font properties for titile.

Defaults to None.

orientationstr, optional

One of 'h' for horizontal or 'v' for vertical. Only for plotly plot. Defaults to 'v' if x and y are provided and both continous or both categorical, otherwise 'v'`(‘h’) if `x`(`y) is categorical and y`(`x) is continuous, otherwise 'v'`(‘h’) if only `x`(`y) is provided.

Returns
axAxes

The axes for the plot.

bar_datapandas.DataFrame

The data used in the plot.

Examples

>>> ax1 = f.add_subplot(111)
>>> eda = EDAVisualizer(ax1)
>>> ax, bar_data = eda.bar_plot(data=data, column='COLUMN',
                                aggregation={'COLUMN':'count'})

Returns : bar plot (count) of 'COLUMN'.

>>> ax1 = f.add_subplot(111)
>>> eda = EDAVisualizer(ax1)
>>> ax, bar_data = eda.bar_plot(data=data, column='COLUMN',
                                aggregation={'OTHER_COLUMN':'avg'})

Returns : bar plot (avg) of 'COLUMN' against 'OTHER_COLUMN'.

box_plot(data, column, outliers=False, title=None, groupby=None, lower_outlier_fence_factor=0, upper_outlier_fence_factor=0, title_fontproperties=None, **kwargs)

Displays a box plot for the SAP HANA DataFrame column specified.

Parameters
dataDataFrame

DataFrame used for the plot.

columnstr

Column in the DataFrame being plotted.

outliersbool

Whether to plot suspected outliers and outliers.

Defaults to False.

titlestr, optional

Title for the plot.

Defaults to None.

groupbystr, optional

Column to group by and compare.

Defaults to None.

lower_outlier_fence_factorfloat, optional

The lower bound of outlier fence factor.

Defaults to 0.

upper_outlier_fence_factor

The upper bound of outlier fence factor.

Defaults to 0.

title_fontpropertiesFontProperties, optional

Change the font properties for titile.

Defaults to None.

Returns
axAxes

The axes for the plot.

sta_tablepandas.DataFrame or list of pandas.DataFrame

The data used in the plot.

Examples

>>> f = plt.figure(figsize=(10, 10))
>>> ax1 = f.add_subplot(111)
>>> eda = EDAVisualizer(ax1)
>>> ax1, corr = eda.box_plot(data=data, column="AGE")
>>> plt.show()
_images/box_plot.png
>>> f = plt.figure(figsize=(10, 10))
>>> ax1 = f.add_subplot(111)
>>> eda = EDAVisualizer(ax1)
>>> ax1, corr = eda.box_plot(data=data, column="AGE", groupby="SEX")
>>> plt.show()
_images/box_plot2.png
property ax

Returns the matplotlib Axes where the Visualizer will draw.

property cmap

Returns the color map being used for the plot.

reset()

Reset.

set_ax(ax)

Sets the Axes

set_cmap(cmap)

Sets the colormap

set_size(size)

Sets the size

property size

Returns the size of the plot in pixels.

class hana_ml.visualizers.eda.Profiler(*args, **kwargs)

Bases: object

A class to build a SAP HANA Profiler, including:

  • Variable descriptions

  • Missing values %

  • High cardinality %

  • Skewness

  • Numeric distributions

  • Categorical distributions

  • Correlations

  • High correlaton warnings

Methods

description(data, key[, bins, ...])

Returns a SAP HANA profiler, including:

set_size(fig, figsize)

Set the size of the data description plot, in inches.

description(data, key, bins=20, missing_threshold=10, card_threshold=100, skew_threshold=0.5, figsize=None)

Returns a SAP HANA profiler, including:

  • Variable descriptions

  • Missing values %

  • High cardinality %

  • Skewness

  • Numeric distributions

  • Categorical distributions

  • Correlations

  • High correlaton warnings

Parameters
dataDataFrame

DataFrame used for the plat.

keystr, optional

Name of the key column in the DataFrame.

binsint, optional

Number of bins for numeric distributions. Default value = 20.

missing_thresholdfloat

Percentage threshold to display missing values.

card_thresholdint

Threshold for column to be considered with high cardinality.

skew_thresholdfloat

Absolute value threshold for column to be considered as highly skewed.

tight_layoutbool, optional

Use matplotlib tight layout or not.

figsizetuple, optional

Size of figure to be plotted. First element is width, second is height.

Note: categorical columns with cardinality warnings are not plotted.
Returns
figFigure

matplotlib axis of the profiler.

set_size(fig, figsize)

Set the size of the data description plot, in inches.

Parameters
figax

The returned axes constructed by the description method.

figsizetuple

Tuple of width and height for the plot.

hana_ml.visualizers.metrics

This module represents a visualizer for metrics.

The following class is available:

class hana_ml.visualizers.metrics.MetricsVisualizer(ax=None, size=None, cmap=None, title=None, enable_plotly=False)

Bases: hana_ml.visualizers.visualizer_base.Visualizer, object

The MetricVisualizer is used to visualize metrics.

Parameters
axmatplotlib.Axes, optional

The axes to use to plot the figure. Default value : Current axes

sizetuple of integers, optional

(width, height) of the plot in dpi Default value: Current size of the plot.

titlestr, optional

Title for the plot.

enable_plotlybool, optional

Use plotly instead of matplotlib.

Defaults to False.

Attributes
ax

Returns the matplotlib Axes where the Visualizer will draw.

cmap

Returns the color map being used for the plot.

size

Returns the size of the plot in pixels.

Methods

plot_confusion_matrix(df[, normalize])

This function plots the confusion matrix and returns the Axes where this is drawn.

reset()

Reset.

set_ax(ax)

Sets the Axes

set_cmap(cmap)

Sets the colormap

set_size(size)

Sets the size

plot_confusion_matrix(df, normalize=False, **kwargs)

This function plots the confusion matrix and returns the Axes where this is drawn.

Parameters
dfDataFrame

Data points to the resulting confusion matrix. This dataframe's columns should match columns ('CLASS', '')

property ax

Returns the matplotlib Axes where the Visualizer will draw.

property cmap

Returns the color map being used for the plot.

reset()

Reset.

set_ax(ax)

Sets the Axes

set_cmap(cmap)

Sets the colormap

set_size(size)

Sets the size

property size

Returns the size of the plot in pixels.

hana_ml.visualizers.m4_sampling

This module contains M4 algorithm for sampling query.

The following function is available:

hana_ml.visualizers.m4_sampling.get_min_index(data)

Get Minimum Timestamp of Time Series Data Only for internal use, do not show it in the doc.

Parameters
dataDataFrame

Time series data whose 1st column is index and 2nd one is value.

Returns
datetime

Return the minimum timestamp.

hana_ml.visualizers.m4_sampling.get_max_index(data)

Get Maximum Timestamp of Time Series Data Only for internal use, do not show it in the doc.

Parameters
dataDataFrame

Time series data whose 1st column is index and 2nd one is value.

Returns
datetime

Return the maximum timestamp.

hana_ml.visualizers.m4_sampling.m4_sampling(data, width)

M4 algorithm for big data visualization

Parameters
dataDataFrame

Data to be sampled. Time seires data whose 1st column is index and 2nd one is value.

widthint

Sampling Rate. It is an indicator of how many pixels being in the picture.

Returns
DataFrame

Return the sampled dataframe.

hana_ml.visualizers.model_debriefing

This module represents a visualizer for tree model. The following class is available:

class hana_ml.visualizers.model_debriefing.TreeModelDebriefing

Bases: object

Visualize tree model.

Examples

Visualize Tree Model in JSON format:

>>> TreeModelDebriefing.tree_debrief(rdt.model_)
_images/json_model.png

Visualize Tree Model in DOT format:

>>> TreeModelDebriefing.tree_parse(rdt.model_)
>>> TreeModelDebriefing.tree_debrief_with_dot(rdt.model_)
_images/dot_model.png

Visualize Tree Model in XML format the model is stored in the dataframe rdt.model_:

>>> treeModelDebriefing.tree_debrief(rdt.model_)
_images/xml_model.png

Methods

shapley_explainer(predict_result, predict_data)

Create Shapley explainer to explain the output of machine learning model.

tree_debrief(model)

Visualize tree model by data in JSON or XML format.

tree_debrief_with_dot(model[, iframe_height])

Visualize tree model by data in DOT format.

tree_export(model, filename)

Save the tree model as a html file.

tree_export_with_dot(model, filename)

Save the tree model as a html file.

tree_parse(model)

Transform tree model content using DOT language.

static tree_debrief(model)

Visualize tree model by data in JSON or XML format.

Parameters
modelDataFrame

Tree model.

Returns
HTML Page

This HTML page can be rendered by browser.

static tree_export(model, filename)

Save the tree model as a html file.

Parameters
modelDataFrame

Tree model.

filenamestr

Html file name.

static tree_parse(model)

Transform tree model content using DOT language.

Parameters
modelDataFrame

Tree model.

static tree_debrief_with_dot(model, iframe_height='800')

Visualize tree model by data in DOT format.

Parameters
modelDataFrame

Tree model.

iframe_heightint

Frame height

Returns
HTML Page

This HTML page can be rendered by browser.

static tree_export_with_dot(model, filename)

Save the tree model as a html file.

Parameters
modelDataFrame

Tree model.

filenamestr

Html file name.

static shapley_explainer(predict_result: hana_ml.dataframe.DataFrame, predict_data: hana_ml.dataframe.DataFrame, key=None, label=None, predict_reason_column='REASON_CODE')

Create Shapley explainer to explain the output of machine learning model.

It connects optimal credit allocation with local explanations using the classic Shapley values from game theory and their related extensions.

Parameters
predict_resultDataFrame

Predicted result.

predict_dataDataFrame

Predicted dataset.

keystr

Name of the ID column.

labelstr

Name of the dependent variable.

predict_reason_columnstr
Predicted result, structured as follows:
  • column : REASON CODE, valid only for tree-based functionalities.

Returns
ShapleyExplainer

Shapley explainer.

hana_ml.visualizers.dataset_report

class hana_ml.visualizers.dataset_report.DatasetReportBuilder

Bases: object

The DatasetReportBuilder instance can analyze the dataset and generate a report in HTML format.

The instance will call the dropna method of DataFrame internally to handle the missing value of dataset.

The generated report can be embedded in a notebook, including:

  • Overview
    • Dataset Info

    • Variable Types

    • High Cardinality %

    • Highly Skewed Variables

  • Sample
    • Top ten rows of dataset

  • Variables
    • Numeric distributions

    • Categorical distributions

    • Variable statistics

  • Data Correlations

  • Data Scatter Matrix

Examples

Create a DatasetReportBuilder instance:

>>> from hana_ml.visualizers.dataset_report import DatasetReportBuilder
>>> datasetReportBuilder = DatasetReportBuilder()

Assume the dataset DataFrame is df and then analyze the dataset:

>>> datasetReportBuilder.build(df, key="ID")

Display the dataset report as a notebook iframe.

>>> datasetReportBuilder.generate_notebook_iframe_report()
_images/dataset_report_example.png

Methods

build(data, key[, scatter_matrix_sampling])

Build a report for dataset.

generate_html_report(filename)

Save the dataset report as a html file.

generate_notebook_iframe_report()

Render the dataset report as a notebook iframe.

build(data, key, scatter_matrix_sampling: Optional[hana_ml.algorithms.pal.preprocessing.Sampling] = None)

Build a report for dataset.

Note that the name of data is used as the dataset name in this function. If the name of data (which is a dataframe.DataFrame object) is not set explicitly in the object instantiation, a name like 'DT_XX' will be assigned to the data.

Parameters
dataDataFrame

DataFrame to use to build the dataset report.

keystr

Name of ID column.

scatter_matrix_samplingSampling, optional

Scatter matrix sampling.

generate_html_report(filename)

Save the dataset report as a html file.

Parameters
filenamestr

Html file name.

generate_notebook_iframe_report()

Render the dataset report as a notebook iframe.

hana_ml.visualizers.shap

This module represents an explainer for Shapley values.

The following class is available:

class hana_ml.visualizers.shap.ShapleyExplainer(predict_result: hana_ml.dataframe.DataFrame, predict_data: hana_ml.dataframe.DataFrame, key=None, label=None, predict_reason_column='REASON_CODE')

Bases: object

SHAP (SHapley Additive exPlanations) is a game theoretic approach to explain the output of machine learning model.

It connects optimal credit allocation with local explanations using the classic Shapley values from game theory and their related extensions.

Parameters
predict_resultDataFrame

Predicted result.

predict_dataDataFrame

Predicted dataset.

keystr

Name of the ID column.

labelstr

Name of the dependent variable.

predict_reason_columnstr

The reason code column in the predicted_result.

Default s to column 'REASON_CODE'.

Examples

In the following example, training data is called diabetes_train and test data is diabetes_test.

First, we create an UnifiedClassification instance:

>>> uc_hgbdt = UnifiedClassification('HybridGradientBoostingTree')

Then, create a GridSearchCV instance:

>>> gscv = GridSearchCV(estimator=uc_hgbdt,
                        param_grid={'learning_rate': [0.1, 0.4, 0.7, 1],
                                    'n_estimators': [4, 6, 8, 10],
                                    'split_threshold': [0.1, 0.4, 0.7, 1]},
                        train_control=dict(fold_num=5,
                                           resampling_method='cv',
                                           random_state=1,
                                           ref_metric=['auc']),
                        scoring='error_rate')

Call the fit() function to train the model:

>>> gscv.fit(data=diabetes_train, key= 'ID',
             label='CLASS',
             partition_method='stratified',
             partition_random_state=1,
             stratified_column='CLASS',
             build_report=True)
>>> features = diabetes_train.columns
>>> features.remove('CLASS')
>>> features.remove('ID')

Use diabetes_test for prediction:

>>> pred_res = gscv.predict(diabetes_test, key='ID', features=features)

Create a TreeModelDebriefing.shapley_explainer and then invoke summary_plot() :

>>> shapley_explainer = TreeModelDebriefing.shapley_explainer(pred_res, diabetes_test, key='ID', label='CLASS')
>>> shapley_explainer.summary_plot()

Output:

_images/shap1.png _images/shap2.png

Methods

shap_values()

Get Shapley values.

summary_plot([print_plot_details])

Global Interpretation using Shapley values.

shap_values()

Get Shapley values.

Returns
numpy.ndarray

Shapley values.

summary_plot(print_plot_details=False)

Global Interpretation using Shapley values.

To get an overview of which features are most important for a model we can plot the Shapley values of every feature for every sample.

Parameters
print_plot_detailsbool, optional

Specifies whether to show plotting details.

Defaults to False.

Returns
Image Component

This object can be rendered by browser.

hana_ml.visualizers.unified_report

This module is to build report for PAL/APL models.

The following class is available:

class hana_ml.visualizers.unified_report.UnifiedReport(obj)

Bases: object

The report generator for PAL/APL models. Currently, it only supports UnifiedClassification and UnifiedRegression.

Examples

Data used is called diabetes_train.

Case 1: UnifiedReport for UnifiedClassification is shown as follows, please set build_report=True in the fit() function:

>>> from hana_ml.algorithms.pal.model_selection import GridSearchCV
>>> from hana_ml.algorithms.pal.model_selection import RandomSearchCV
>>> hgc = UnifiedClassification('HybridGradientBoostingTree')
>>> gscv = GridSearchCV(estimator=hgc,
>>>                     param_grid={'learning_rate': [0.1, 0.4, 0.7, 1],
>>>                                 'n_estimators': [4, 6, 8, 10],
>>>                                 'split_threshold': [0.1, 0.4, 0.7, 1]},
>>>                     train_control=dict(fold_num=5,
>>>                                        resampling_method='cv',
>>>                                        random_state=1,
>>>                                        ref_metric=['auc']),
>>>                     scoring='error_rate')
>>> gscv.fit(data=diabetes_train, key= 'ID',
>>>          label='CLASS',
>>>          partition_method='stratified',
>>>          partition_random_state=1,
>>>          stratified_column='CLASS',
>>>          build_report=True)

To look at the dataset report:

>>> UnifiedReport(diabetes_train).build().display()
_images/unified_report_dataset_report.png

To see the model report:

>>> UnifiedReport(gscv.estimator).display()
_images/unified_report_model_report_classification.png

We could also see the Optimal Parameter page:

_images/unified_report_model_report_classification2.png

Case 2: UnifiedReport for UnifiedRegression is shown as follows, please set build_report=True in the fit() function:

>>> hgr = UnifiedRegression(func = 'HybridGradientBoostingTree')
>>> gscv = GridSearchCV(estimator=hgr,
                        param_grid={'learning_rate': [0.1, 0.4, 0.7, 1],
                                    'n_estimators': [4, 6, 8, 10],
                                    'split_threshold': [0.1, 0.4, 0.7, 1]},
                        train_control=dict(fold_num=5,
                                           resampling_method='cv',
                                           random_state=1),
                        scoring='rmse')
>>> gscv.fit(data=diabetes_train, key= 'ID',
             label='CLASS',
             partition_method='random',
             partition_random_state=1,
             build_report=True)

To see the model report:

>>> UnifiedReport(gscv.estimator).display()
_images/unified_report_model_report_regression.png

Methods

build([key, scatter_matrix_sampling])

Build the report.

display([save_html, metric_sampling])

Display the report.

build(key=None, scatter_matrix_sampling: Optional[hana_ml.algorithms.pal.preprocessing.Sampling] = None)

Build the report.

Parameters
keystr, valid only for DataFrame

Name of ID column.

Defaults to the first column.

scatter_matrix_samplingSampling, valid only for DataFrame

Scatter matrix sampling.

Defaults to 1000 random sample points.

display(save_html=None, metric_sampling=False)

Display the report.

Parameters
save_htmlstr, optional

If it is not None, the function will generate a html report and stored in the given name.

Defaults to None.

metric_samplingbool, optional

Whether the metric table needs to be sampled. It is only valid for UnifiedClassification and used together with UnifiedClassification.set_metric_samplings.

Defaults to False.

hana_ml.visualizers.visualizer_base

The following function is available:

hana_ml.visualizers.visualizer_base.forecast_line_plot(pred_data, actual_data=None, confidence=None, ax=None, figsize=None, max_xticklabels=10, marker=None, enable_plotly=False)

Plot the prediction data for time series forecast or regression model.

Parameters
pred_dataDataFrame

The forecast data to be plotted.

actual_dataDataFrame, optional

The actual data to be plotted.

Default value is None.

confidencetuple of str, optional

The column names of confidence bound.

Default value is None.

axmatplotlib.Axes, optional

The axes to use to plot the figure. Default value : Current axes

figsizetuple, optional

(weight, height) of the figure. For matplotlib, the unit is inches, and for plotly, the unit is pixels.

Defaults to (15, 12) when using matplotlib, auto when using plotly.

max_xticklabelsint, optional

The maximum number of xtick labels. Defaults to 10.

marker: character, optional

Type of maker on the plot.

Default to None indicates no marker.

enable_plotlybool, optional

Use plotly instead of matplotlib.

Defaults to False.

Examples

Create an 'AdditiveModelForecast' instance and invoke the fit and predict functions:

>>> amf = AdditiveModelForecast(growth='linear')
>>> amf.fit(data=train_df)
>>> pred_data = amf.predict(data=test_df)

Visualize the forecast values:

>>> ax = forecast_line_plot(pred_data=pred_data.set_index("INDEX"),
                    actual_data=df.set_index("INDEX"),
                    confidence=("YHAT_LOWER", "YHAT_UPPER"),
                    max_xticklabels=10)
_images/line_plot.png

hana_ml.visualizers.digraph

This module represents the whole digraph framework. The whole digraph framework consists of Python API and page assets(HTML, CSS, JS, Font, Icon, etc.). The application scenarios of the current digraph framework are AutoML Pipeline and Model Debriefing.

The following classes are available:
class hana_ml.visualizers.digraph.Node(node_id: int, node_name: str, node_icon_id: int, node_content: str, node_in_ports: list, node_out_ports: list)

Bases: object

The Node class of digraph framework is an entity class.

Parameters
node_idint [Automatic generation]

Unique identification of node.

node_namestr

The node name.

node_icon_idint [Automatic generation]

Unique identification of node icon.

node_contentstr

The node content.

node_in_portslist

List of input port names.

node_out_portslist

List of output port names.

class hana_ml.visualizers.digraph.InPort(node: hana_ml.visualizers.digraph.Node, port_id: str, port_name: str, port_sequence: int)

Bases: object

The InPort class of digraph framework is an entity class.

A port is a fixed connection point on a node.

Parameters
nodeNode

Which node is the input port fixed on.

port_idstr [Automatic generation]

Unique identification of input port.

port_namestr

The input port name.

port_sequenceint [Automatic generation]

The position of input port among all input ports.

class hana_ml.visualizers.digraph.OutPort(node: hana_ml.visualizers.digraph.Node, port_id: str, port_name: str, port_sequence: int)

Bases: object

The OutPort class of digraph framework is an entity class.

A port is a fixed connection point on a node.

Parameters
nodeNode

Which node is the output port fixed on.

port_idstr [Automatic generation]

Unique identification of output port.

port_namestr

The output port name.

port_sequenceint [Automatic generation]

The position of output port among all output ports.

class hana_ml.visualizers.digraph.Edge(source_port: hana_ml.visualizers.digraph.OutPort, target_port: hana_ml.visualizers.digraph.InPort)

Bases: object

The Edge class of digraph framework is an entity class.

The output port of a node is connected with the input port of another node to make an edge.

Parameters
source_portOutPort

Start connection point of edge.

target_portInPort

End connection point of edge.

class hana_ml.visualizers.digraph.Digraph(digraph_name: str, make_text_center: bool = False)

Bases: hana_ml.visualizers.digraph.BaseDigraph

Using the Digraph class of digraph framework can dynamically add nodes and edges, and finally generate an HTML page. The rendered HTML page can display the node information and the relationship between nodes, and provide a series of auxiliary tools to help you view the digraph. A series of auxiliary tools are provided as follows:

  • Provide basic functions such as pan and zoom.

  • Locate the specified node by keyword search.

  • Look at the layout outline of the whole digraph through the minimap.

  • Through the drop-down menu to switch different digraph.

  • The whole page can be displayed in fullscreen.

Parameters
digraph_namestr

The digraph name.

make_text_centerbool, optional

Should the text be centered.

Defaults to False.

Examples

  1. Importing classes of digraph framework

>>> from hana_ml.visualizers.digraph import Digraph, Node, Edge
  1. Creating a Digraph instance:

>>> digraph: Digraph = Digraph('Test1')
  1. Adding two nodes to digraph instance, where the node1 has only one output port and the node2 has only one input port:

>>> node1: Node = digraph.add_model_node('name1', 'content1', in_ports=[], out_ports=['1'])
>>> node2: Node = digraph.add_python_node('name2', 'content2', in_ports=['1'], out_ports=[])
  1. Adding an edge to digraph instance, where the output port of node1 points to the input port of node2:

>>> edge1_2: Edge = digraph.add_edge(node1.out_ports[0], node2.in_ports[0])
  1. Generating notebook iframe:

>>> digraph.build()
>>> digraph.generate_notebook_iframe(iframe_height=500)
_images/digraph.png
  1. Generating a local HTML file:

>>> digraph.generate_html('Test1')

Methods

add_edge(source_port, target_port)

Add edge to digraph instance.

add_model_node(name, content, in_ports, ...)

Add node with model icon to digraph instance.

add_python_node(name, content, in_ports, ...)

Add node with python icon to digraph instance.

build()

Build HTML string based on current data.

generate_html(filename)

Save the digraph as a html file.

generate_notebook_iframe([iframe_height])

Render the digraph as a notebook iframe.

to_json()

Return the nodes and edges data of digraph.

to_json() list

Return the nodes and edges data of digraph.

Returns
list

The nodes and edges data of digraph.

build()

Build HTML string based on current data.

generate_html(filename: str)

Save the digraph as a html file.

Parameters
filenamestr

HTML file name.

generate_notebook_iframe(iframe_height: int = 800)

Render the digraph as a notebook iframe.

add_edge(source_port: hana_ml.visualizers.digraph.OutPort, target_port: hana_ml.visualizers.digraph.InPort) hana_ml.visualizers.digraph.Edge

Add edge to digraph instance.

Parameters
source_portOutPort

Start connection point of edge.

target_portInPort

End connection point of edge.

Returns
Edge

The added edge.

add_model_node(name: str, content: str, in_ports: list, out_ports: list) hana_ml.visualizers.digraph.Node

Add node with model icon to digraph instance.

Parameters
namestr

The model node name.

contentstr

The model node content.

in_portslist

List of input port names.

out_portslist

List of output port names.

Returns
Node

The added node with model icon.

add_python_node(name: str, content: str, in_ports: [], out_ports: []) hana_ml.visualizers.digraph.Node

Add node with python icon to digraph instance.

Parameters
namestr

The python node name.

contentstr

The python node content.

in_portslist

List of input port names.

out_portslist

List of output port names.

Returns
Node

The added node with python icon.

class hana_ml.visualizers.digraph.MultiDigraph(multi_digraph_name: str, make_text_center: bool = False)

Bases: object

Using the MultiDigraph class of digraph framework can dynamically add multiple child digraphs, and finally generate an HTML page. The rendered HTML page can display the node information and the relationship between nodes, and provide a series of auxiliary tools to help you view the digraph. A series of auxiliary tools are provided as follows:

  • Provide basic functions such as pan and zoom.

  • Locate the specified node by keyword search.

  • Look at the layout outline of the whole digraph through the minimap.

  • Through the drop-down menu to switch different digraph.

  • The whole page can be displayed in fullscreen.

Parameters
multi_digraph_namestr

The digraph name.

make_text_centerbool, optional

Should the text be centered.

Defaults to False.

Examples

  1. Importing classes of digraph framework

>>> from hana_ml.visualizers.digraph import MultiDigraph, Node, Edge
  1. Creating a MultiDigraph instance:

>>> multi_digraph: MultiDigraph = MultiDigraph('Test2')
  1. Creating first digraph:

>>> digraph1 = multi_digraph.add_child_digraph('digraph1')
  1. Adding two nodes to digraph1, where the node1_1 has only one output port and the node2_1 has only one input port:

>>> node1_1: Node = digraph1.add_model_node('name1', 'content1', in_ports=[], out_ports=['1'])
>>> node2_1: Node = digraph1.add_python_node('name2', 'content2', in_ports=['1'], out_ports=[])
  1. Adding an edge to digraph1, where the output port of node1_1 points to the input port of node2_1:

>>> digraph1.add_edge(node1_1.out_ports[0], node2_1.in_ports[0])
  1. Creating second digraph:

>>> digraph2 = multi_digraph.add_child_digraph('digraph2')
  1. Adding two nodes to digraph2, where the node1_2 has only one output port and the node2_2 has only one input port:

>>> node1_2: Node = digraph2.add_model_node('name1', 'model text', in_ports=[], out_ports=['1'])
>>> node2_2: Node = digraph2.add_python_node('name2', 'function info', in_ports=['1'], out_ports=[])
  1. Adding an edge to digraph2, where the output port of node1_2 points to the input port of node2_2:

>>> digraph2.add_edge(node1_2.out_ports[0], node2_2.in_ports[0])
  1. Generating notebook iframe:

>>> multi_digraph.build()
>>> multi_digraph.generate_notebook_iframe(iframe_height=500)
_images/multi_digraph.png
  1. Generating a local HTML file:

>>> multi_digraph.generate_html('Test2')

Methods

ChildDigraph(child_digraph_id, ...)

Multiple child digraphs are logically a whole.

add_child_digraph(child_digraph_name)

Add child digraph to multi_digraph instance.

build()

Build HTML string based on current data.

generate_html(filename)

Save the digraph as a html file.

generate_notebook_iframe([iframe_height])

Render the digraph as a notebook iframe.

to_json()

Return the nodes and edges data of whole digraph.

class ChildDigraph(child_digraph_id: int, child_digraph_name: str)

Bases: hana_ml.visualizers.digraph.BaseDigraph

Multiple child digraphs are logically a whole.

Methods

add_edge(source_port, target_port)

Add edge to digraph instance.

add_model_node(name, content, in_ports, ...)

Add node with model icon to digraph instance.

add_python_node(name, content, in_ports, ...)

Add node with python icon to digraph instance.

to_json()

Return the nodes and edges data of child digraph.

to_json() list

Return the nodes and edges data of child digraph.

Returns
list

The nodes and edges data of whole digraph.

add_edge(source_port: hana_ml.visualizers.digraph.OutPort, target_port: hana_ml.visualizers.digraph.InPort) hana_ml.visualizers.digraph.Edge

Add edge to digraph instance.

Parameters
source_portOutPort

Start connection point of edge.

target_portInPort

End connection point of edge.

Returns
Edge

The added edge.

add_model_node(name: str, content: str, in_ports: list, out_ports: list) hana_ml.visualizers.digraph.Node

Add node with model icon to digraph instance.

Parameters
namestr

The model node name.

contentstr

The model node content.

in_portslist

List of input port names.

out_portslist

List of output port names.

Returns
Node

The added node with model icon.

add_python_node(name: str, content: str, in_ports: [], out_ports: []) hana_ml.visualizers.digraph.Node

Add node with python icon to digraph instance.

Parameters
namestr

The python node name.

contentstr

The python node content.

in_portslist

List of input port names.

out_portslist

List of output port names.

Returns
Node

The added node with python icon.

add_child_digraph(child_digraph_name: str) hana_ml.visualizers.digraph.MultiDigraph.ChildDigraph

Add child digraph to multi_digraph instance.

Parameters
child_digraph_namestr

The child digraph name.

Returns
ChildDigraph

The added child digraph.

to_json() list

Return the nodes and edges data of whole digraph.

Returns
list

The nodes and edges data of whole digraph.

build()

Build HTML string based on current data.

generate_html(filename: str)

Save the digraph as a html file.

Parameters
filenamestr

Html file name.

generate_notebook_iframe(iframe_height: int = 800)

Render the digraph as a notebook iframe.