SAP HANA Python Client API for Machine Learning Algorithms

Introduction

Welcome to the SAP HANA Python Client API for machine learning algorithms. This API enables Python data scientists to access SAP HANA data and build machine learning models using that data directly in SAP HANA. This chapter provides an overview of the SAP HANA Python Client API for machine learning algorithms.

Overview

The SAP HANA Python Client API for machine learning algorithms (Python Client API for ML) provides a set of client-side Python functions for accessing and querying SAP HANA data, and a set of functions for developing machine learning models.

The Python Client API for ML consists of two main parts:

  • A set of machine learning APIs for different algorithms.
  • The SAP HANA dataframe, which provides a set of methods for analyzing data in SAP HANA without bringing that data to the client.

This set of APIs is composed of two packages:

  • APL package
  • PAL package

The Automated Predictive Library (APL) package exposes the data mining capabilities of the Automated Analytics engine in SAP HANA through a set of functions. These functions develop a predictive modeling process that analysts can use to answer simple questions on their customer datasets stored in SAP HANA.

The Predictive Analysis Library (PAL) package consists of a set of Python algorithms and functions which provide access to the machine learning capabilities in SAP HANA. The PAL Python functions cover a variety of different machine learning algorithms for training a model and using the trained model for scoring. For details on which specific algorithms are available in this release, please refer to the documentation.

This library uses the SAP HANA Python driver (hdbcli) to connect to and access SAP HANA.

Basic overview of the Python Client API for ML.

Prerequisites

The hdbcli driver 2.2.23 (shipped with SAP HANA SP03) or higher. See SAP HANA Client Interface Programming Reference for SAP HANA Service for more information.

Security AFL__SYS_AFL_AFLPAL_EXECUTE and AFL__SYS_AFL_AFLPAL_EXECUTE_WITH_GRANT_OPTION roles. See SAP HANA Predictive Analysis Library for more information.

SAP HANA APL 1811 or higher. See SAP HANA Automated Predictive Library Reference Guide for more information. Only valid when using the APL package.

SAP HANA Dataframe

An SAP HANA dataframe provides a way to view the data stored in SAP HANA without containing any of the physical data. The Python Client ML APIs make use of an SAP HANA dataframe as input for training and scoring purposes.

An SAP HANA dataframe hides the underlying SQL statement, providing users with a Python interface to SAP HANA data.

To use an SAP HANA dataframe, create the “ConnectionContext” object, then use the methods provided in the library for creating an SAP HANA dataframe. The dataframe is only usable while the ConnectionContext is open, and is inaccessible once the connection is closed.

The example below shows how to create a ConnectionContext and invoke a method (“table()”) for creating a simple dataframe:

with ConnectionContext('address', port, 'user', 'password') as cc:
  df = (cc.table('MY_TABLE', schema='MY_SCHEMA')
    .filter('COL3 > 5')
    .select('COL1','COL2'))

More complex dataframes can also be created by including an sql() method, for example:

df = cc.sql('SELECT T.A, T2.B FROM T, T2 WHERE T.C=T2.C')

Once df (HANA dataframe) is created, there are several methods that can be executed on the dataframe object.

The dataframe functions can be categorized into 3 types:

  • Dataframe Manipulation: casting columns into a new type, dropping columns, filling null values, joining dataframes, sorting dataframes, renaming columns, etc.
  • Descriptive Functions: statistics relating to the data, showing distinct values, creating dataframes with top n values, etc.
  • Dataframe Transformations: copying an SAP HANA dataframe to a Pandas dataframe and materialize a dataframe to a table.

Remember, an SAP HANA dataframe is a way to view the data stored within SAP HANA, and does not contain any data. If you want to use other Python libraries on the client along with an SAP HANA dataframe, you will need to convert that SAP HANA dataframe to a Pandas dataframe using the collect function.

pandas_df = df.collect()

For more details, refer to the documentation in the hana_ml.dataframe module section.

Machine Learning API

The ML API is a set of the APIs available for use with the SAP HANA Machine Learning algorithms.

The APL package contains these algorithms:

  • classification (auto classifier)
  • regression (auto regressor)

The PAL package contains these algorithms:

  • clustering (DBSCAN, k-means, k-medians, k-mediods)
  • decomposition (latent dirichlet allocation, principal component analysis)
  • linear models (linear regression, logistic regression)
  • metric functions (AUC, confusion matrix, multiclass AUC)
  • naive bayes
  • neural network (multi-layer perceptron classifier, multi-layer perceptron regressor)
  • preprocessing (feature normalizer, k-bins discretizer)
  • regression (polynomial regression)
  • statistics functions (chi-squared test functions, t-test functions, analysis of variance functions, univariate/multivariate analysis functions, inter-quartile range function)
  • neighborhood-based algorithms (k-nearest neighbors)
  • decision trees (decision tree for classification, decision tree for regression, random forest for classification, random forest for regression)
  • Support Vector Machines (SVM) functions (support vector classification, support vector regression, support vector ranking, and one class SVM)

These client-side Python functions require an SAP HANA dataframe and parameters as inputs in order to train the model. While the Python ML APIs can invoke SQL functions, the actual model training and scoring is executed in SAP HANA, so no data is brought to the client for training the model. Data movement from the server to the client (or vice versa) is avoided, resulting in faster performance.

Here are the steps for using these functions:

Note

Some of these parameters are optional and depend on the specific algorithm being used.

  1. Create an instance of the class (for the algorithm you want to use) and pass over the algorithm parameters. For example, in the case of a decision tree algorithm, you would need to pass the type of decision tree being created (e.g. C4.5), the minimum records in a parent node, minimum in a child node, etc.

    The example below shows how to create an instance of a DecisionTreeClassifier class, pass parameters that specify the type of decision tree (in this case, C4.5), with minimum records of parent as 2, minimum records in leaf as 1, model to be kept as a JSON format, a thread ratio of .04, and a split threshold of 1e-5.

    dtc = DecisionTreeClassifier(conn_context=cc, algorithm='c45',
                                 min_records_of_parent=2,
                                 min_records_of_leaf=1, thread_ratio=0.4,
                                 split_threshold=1e-5, model_format='json')
    
  2. Invoke the fit() method of the class on the dataframe (containing the training data) along with the features to be used and the other parameters needed to relate to the data.

    dtc.fit(df1, features=['OUTLOOK', 'TEMP', 'HUMIDITY', 'WINDY'],
            label='LABEL')
    
  3. The output from the fit method invocation results in a trained model returned as a property of the algorithm object. Statistics are also returned as a property of the algorithm object. You can invoke the predict method on the object, passing the dataframe to the method for prediction. In addition to the dataframe, other parameters can be optionally passed (e.g. verbose ouput).

    result = dtc.predict(df2, key='ID', verbose=False)
    

End-to-End Example: Using the Predictive Analytics Library (PAL) module

Below is an example where the Random Forest Classifier model is trained with data from an SAP HANA table:

#Step 1: Import the Python Client API Library and Dataframe Library
#(dataframe, trees)
from hana_ml import dataframe
from hana_ml.algorithms.pal import trees

#Step 2: Instantiate the Connection Object (conn)
conn = dataframe.ConnectionContext('<address>', <port>, '<user>',
                                   '<password>')

#Step 3: Create the HANA Dataframe (df_fit) and Point to the
#"DATA_TBL_RFT" Table.
df_fit = conn.table("DATA_TBL_RFT")

#Step 4: Inspect the Data
df_fit.head(4).collect()

OUTLOOK  TEMP  HUMIDITY WINDY       LABEL
  Sunny  75.0      70.0   Yes        Play
  Sunny   NaN      90.0   Yes Do not Play
  Sunny  85.0       NaN    No Do not Play
  Sunny  72.0      95.0    No Do not Play

#Step 5: Create the RandomForestClassifier instance and Specify the Parameters
rfc = RandomForestClassifier(conn_context=conn, n_estimators=3,
                             max_features=3, random_state=2,
                             split_threshold=0.00001, calculate_oob=True,
                             min_samples_leaf=1, thread_ratio=1.0)

#Step 6: Store the necessary features in a List and Invoke the fit Method
rfc.fit(df1, features=['OUTLOOK', 'TEMP', 'HUMIDITY', 'WINDY'],
        label='LABEL',)

#Step 7: Create the HANA dataframe (df2) and Point to the
#"DATA_TBL_RFTPredict" Table.
df2 = conn.table("DATA_TBL_RFTPredict")

#Step 8: Preview the dataframe before predicting
df2.collect()

ID   OUTLOOK     TEMP  HUMIDITY WINDY
 0  Overcast     75.0  -10000.0   Yes
 1      Rain     78.0      70.0   Yes

#Step 9: Invoke the Prediction Method ("predict()") and inspect the result
result = rfc.predict(df2, key='ID', verbose=False)
result.collect()

ID SCORE  CONFIDENCE
 0  Play    0.666667
 1  Play    0.666667

#Step 10: Create the HANA dataframe (df3) and Point to the
#"DATA_TBL_RFTScoring" Table.
df3 = conn.table("DATA_TBL_RFTScoring")

#Step 11: Preview the dateframe before scoring
df3.collect()

ID   OUTLOOK  TEMP  HUMIDITY WINDY LABEL
 0     Sunny    70      90.0   Yes  Play
 1  Overcast    81      90.0   Yes  Play
 2      Rain    65      80.0    No  Play

#Step 12: Perform the Score Method ("score()") on the dataframe
rfc.score(df3, key='ID')

0.6666666666666666

End-to-End Example: Using the Automated Predictive Library (APL) module

The following example shows how the APL module from the SAP HANA Python Client API for Machine Learning Algorithms can be used in a Jupyter notebook. This example shows how to build and apply a predictive model to detect fraud.

Basic overview of the fraud detection model.

Import the sample data included with the APL package to the HANA database.

# Connect using the HANA secure user store
from hana_ml import dataframe as hd
conn = hd.ConnectionContext(userkey='MLDB07_KEY')
# Get Training Data
sql_cmd = 'SELECT * FROM "APL_SAMPLES"."AUTO_CLAIMS_FRAUD" ORDER BY
CLAIM_ID'
training_data = hd.DataFrame(conn, sql_cmd)
training_data.head(10).collect()
A small snippet of the sample data.

Train a classification model on historical claims data with known fraud and display performance metrics of the trained model.

# Create Model
from hana_ml.algorithms.apl.classification import AutoClassifier
model = AutoClassifier(
conn_context=conn,
variable_auto_selection=True)

# Train the model
model.fit(training_data, label='IS_FRAUD', key='CLAIM_ID')

# Debrief the trained model
import pandas as pd
print('\x1b[1m'+ 'MODEL PERFORMANCE' + '\x1b[0m')
d = model.get_performance_metrics()
df = pd.DataFrame(list(d.items()), columns=["Metric", "Value"])
df.loc[df['Metric'].isin(['AUC','PredictivePower','PredictionConfidence'])]
The trained model.

Check how each variable contributes to the prediction.

print('\x1b[1m'+ 'VARIABLES IMPORTANCE' + '\x1b[0m')
d = model.get_feature_importances()
df = pd.DataFrame(list(d.items()), columns=["Variable", "Contribution"])
df['Contribution'] = df['Contribution'].astype(float)
df['Cumulative'] = df['Contribution'].cumsum()
df['Contribution'] = df['Contribution'].round(4)*100
df['Cumulative'] = df['Cumulative'].round(4)*100
non_zero = df['Contribution'] != 0
dfs = df[non_zero].sort_values(by=['Contribution'], ascending=False)
dfs
How each variable contributes to the model.

Alternatively, see the contributions on a bar chart.

# Graph
import matplotlib.pyplot as plt
c_title = "Contributions"
dfs = dfs.sort_values(by=['Contribution'], ascending=True)
dfs.plot(kind='barh', x='Variable', y='Contribution', title=c_title,
legend=False, fontsize=12)
plt.show()
A bar chart showing variable contributions.

Predict if each new claim is fraudulent or not.

# Get New Claims
new_data = conn.table('AUTO_CLAIMS_NEW', schema='APL_SAMPLES')
# Apply Trained Model
df = model.predict(new_data).collect()
df.head(8)
Fraudulent predictions.

Summary

The SAP HANA Python Client API for Machine Learning Algorithms provides a set of Python APIs and functions for creating and manipulating SAP HANA dataframes, training and scoring machine learning models, and data preprocessing.

These functions ensure that the model training and prediction executes directly in SAP HANA. This provides better performance by executing close to the data, while ensuring there is no data transfer between the client and the server. Users may download the data (or a subset of it) and use it with other Python libraries, by converting the SAP HANA dataframe to a Pandas dataframe.