Python Machine Learning Client for SAP HANA¶
Welcome to Python machine learning client for SAP HANA!
This API enables Python data scientists to access SAP HANA data and build machine learning models using that data directly in SAP HANA. This page provides an overview of Python machine learning client for SAP HANA.
Python machine learning client for SAP HANA provides a set of client-side Python functions for accessing and querying SAP HANA data, and a set of functions for developing machine learning models.
Python machine learning client for SAP HANA consists of two main parts:
A set of machine learning APIs for algorithms.
SAP HANA DataFrame, which provides a set of methods for analyzing data in SAP HANA without bringing the data to the client.
Machine learning APIs is composed of two packages:
PAL package consists of a set of Python algorithms and functions which provide access to machine learning capabilities in SAP HANA Predictive Analysis Library(PAL). The PAL functions cover a variety of machine learning algorithms for training a model and then the trained model is used for scoring. For details on which specific algorithms are available in this release, please refer to the documentation.
Automated Predictive Library (APL) package exposes the data mining capabilities of the Automated Analytics engine in SAP HANA through a set of functions. These functions develop a predictive modeling process that analysts can use to answer simple questions on their customer datasets stored in SAP HANA.
This Python library uses the SAP HANA Python driver (hdbcli) to connect to and access SAP HANA.
SAP HANA Python driver: hdbcli 2.2.23 (shipped with SAP HANA SP03) or higher. See SAP HANA Client Interface Programming Reference for SAP HANA Service for more information.
SAP HANA PAL: Security AFL__SYS_AFL_AFLPAL_EXECUTE and AFL__SYS_AFL_AFLPAL_EXECUTE_WITH_GRANT_OPTION roles. See SAP HANA Predictive Analysis Library for more information.
SAP HANA APL 1905 or higher. See SAP HANA Automated Predictive Library Reference Guide for more information. Only valid when using the APL package.
Available Modules and Algorithms¶
Python machine learning client for SAP HANA includes the following modules and algorithm packages.
SAP HANA DataFrame¶
A SAP HANA DataFrame provides a way to view the data stored in the SAP HANA database without containing any of the physical data. Python machine learning client for SAP HANA make use of a SAP HANA DataFrame as input for training and scoring purposes.
A SAP HANA DataFrame hides the underlying SQL statement, providing users with a Python interface to SAP HANA data.
To use a SAP HANA DataFrame, please firstly create a “ConnectionContext” object (a connection to SAP HANA), and then use the methods provided in the library to create a SAP HANA DataFrame. The SAP HANA DataFrame is only usable while the connection is open, and is inaccessible once the connection is closed.
The example below shows how to create a “ConnectionContext” object cc and then invoke a method table() to create a simple SAP HANA DataFrame df:
with ConnectionContext('address', port, 'user', 'password') as cc: df = (cc.table('MY_TABLE', schema='MY_SCHEMA').filter('COL3 > 5').select('COL1','COL2'))
More complex dataframes can also be created by including an sql() method, for example:
df = cc.sql('SELECT T.A, T2.B FROM T, T2 WHERE T.C=T2.C')
Once df (a SAP HANA DataFrame object) is created, there are several types of functions can be executed on the DataFrame object:
DataFrame Manipulation functions: manipulate the rows and columns of data, such as casting columns into a new type(cast()), dropping columns(drop()), filling null values(fillna()), joining dataframes(join()), sorting dataframes(sort()), and renaming columns(rename_columns()).
Descriptive Functions: statistics relating to the data. For example,showing distinct values(distinct()), and creating dataframes with top n values(head()).
DataFrame Transformation functions: such as copying a SAP HANA DataFrame to a Pandas DataFrame(collect()) and uploads data from a Pandas DataFrame to a SAP HANA database and returns an SAP HANA DataFrame(create_dataframe_from_pandas()).
Remember, a SAP HANA DataFrame is a way to view the data stored within SAP HANA, and does not contain any data. If you want to use other Python libraries on the client along with a SAP HANA DataFrame, you need to convert that SAP HANA DataFrame to a Pandas DataFrame using the collect() function. For example, we could convert a SAP HANA DataFrame df to be a Pandas DataFrame as follows.
pandas_df = df.collect()
For more details, refer to the documentation hana_ml.dataframe.
Machine Learning API¶
Machine Learning API is a set of APIs to use SAP HANA Machine Learning algorithms.
The APL package contains these algorithms:
classification (auto classifier).
clustering (auto supervised clustering, auto unsupervised clustering).
regression (auto regressor).
The PAL package contains these algorithms:
association (Apriori, FPGrowth, K-Optimal Rule Discovery(KORD), Sequential Pattern Mining(SPM)).
clustering (Affinity Propagation, Agglomerate Hierarchical Clustering, DBSCAN, Geometry DBSCAN, K-Means, K-Medians, K-Mediods, Self-Organizing Map(SOM)), SlightSilhouette.
conditional random field (CRF)
decomposition (Latent Dirichlet Allocation, Principal Component Analysis(PCA)).
discriminant analysis functions (Linear Discriminant Analysis).
linear models (Linear Regression, Logistic Regression).
metric functions (AUC, confusion matrix, multiclass AUC, r2_score, accuracy_score).
naive bayes (Naive Bayes).
neighborbhood-based algorithms (K-Nearest Neighbors Classifier/Regressor).
neural network (Multi-Layer Perceptron Classifier/Regressor).
partition (train_test_val_split function)
preprocessing (Feature Normalizer, K-bins Discretizer, Missing Value Handling(Imputer), Multidimensional Scaling(MDS), Synthetic Minoritye Over-Sampling Technique(SMOTE), Sampling, Variance Test).
random distribution sampling functions (bernoulli, beta, binomial, cauchy, chi_squared, exponential, extreme_value, f, gamma, geometric, gumbel, lognormal, negative_binomial, normal, pert, poisson, student_t, uniform, weibull, multinomial)
recommender system algorithms (Alternating Least Square(ALS), Factorized Polynomial Regression Models(FRM), Field-aware Factorization Machine)
regression (Bi-Variate Geometric Regression, Bi-Variate Natural Logarithmic Regression, Cox Proportional Hazard Model, Exponential Regression, Generalised Linear Models(GLM), Polynomial Regression).
social networks (Link Prediction, Pagerank).
statistics functions (analysis of variance functions(Anova), chi-squared test functions, condition index, Cumulative Distribution Function(cdf), Distribution fitting, Distribution Quantile, Entropy, Equal Vairance Test, Factor Analysis, Grubbs’ Test, Kaplan-Meier Survival Analysis, Kernel Density, univariate/multivariate analysis functions, One-Sample Median Test, Wilcox Signed Rank Test, t-test functions Inter-Quartile Range (IQR)).
svm (Support Vector Classification, Support Vector Regression, Support Vector Ranking, and one class SVM).
time series (ARIMA, Auto ARIMA, FFT, Seasonal Decompose, Trend Test, White Noise Test, Single/Double/Triple/Auto/Brown exponential Smoothing, Change-Point Detection, Croston’s Method, Linear Regression with Damped Trend and Seasonal Adjust, Additive Model Forecast, fast DTW, Hierarchical Forecast, Correlation Function, online algorithms).
trees (Decision Tree Classifier/Regressor, Random Forest Classifier/Regressor, Gradient Boosting Classifier/Regressor, Hybrid Gradient Boosting Classifier/Regressor).
tsne (T-distributed Stochastic Neighbour Embedding)
pipeline (run SAP HANA PAL functions in a chain)
weighted score table
cross validation (Decision Tree Classifier/Regressor, Gradient Boosting Classifier/Regressor, Hybrid Gradient Boosting Classifier/Regressor, Generalised Linear Models(GLM), Naive Bayes, Linear Regression, Logistic Regression Multi-Layer Perceptron Classifier/Regressor, Support Vector Machines functions, K-Nearest Neighbors Classifier/Regressor, Alternating Least Square(ALS), Factorized Polynomial Regression Models(FRM), Polynomial Regression).
These client-side Python functions require a SAP HANA DataFrame and parameters as inputs in order to train the model. While Python Machine Learning APIs can invoke SQL functions, the actual model training and scoring is executed in SAP HANA, so no data is brought to the client for training the model. Data movement from the server to the client (or vice versa) is avoided, resulting in faster performance.
Here are the steps for using these functions:
Some of these parameters are optional and depend on the specific algorithm being used.
Create an instance of the class (for the algorithm you want to use) and pass over the algorithm parameters. For example, in the case of a decision tree algorithm, you would need to pass the type of decision tree being created (e.g. C4.5), the minimum records in a parent node, minimum in a child node, etc.
The example below shows how to create an instance of a DecisionTreeClassifier class, pass parameters that specify the type of decision tree (in this case, C4.5), with minimum records of parent as 2, minimum records in leaf as 1, model to be kept as a JSON format, a thread ratio of .04, and a split threshold of 1e-5.
dtc = DecisionTreeClassifier(algorithm='c45', min_records_of_parent=2, min_records_of_leaf=1, thread_ratio=0.4, split_threshold=1e-5, model_format='json')
Invoke the fit() method of the class on the DataFrame df1 (containing the training data) along with the features to be used and the other parameters needed to relate to the data.
dtc.fit(data=df1, features=['OUTLOOK', 'TEMP', 'HUMIDITY', 'WINDY'], label='LABEL')
The output from the fit() method invocation results in a trained model returned as a property of the algorithm object. Statistics are also returned as a property of the algorithm object. Then, you can invoke the predict() method on the object, passing the DataFrame df2 to the method for prediction. In addition to the df2, other parameters can be optionally passed (e.g. verbose ouput).
result = dtc.predict(data=df2, key='ID', verbose=False)
End-to-End Example: Using Predictive Analytics Library (PAL) module¶
Below is an example where the Random Forest Classifier model is trained with data from a SAP HANA table:
#Step 1: Import the Python Client API Library and dataframe module from hana_ml import dataframe from hana_ml.algorithms.pal import trees #Step 2: Instantiate the Connection Object (conn) conn = dataframe.ConnectionContext('<address>', <port>, '<user>', '<password>') #Step 3: Create the SAP HANA DataFrame df_fit and Point to the #a table "DATA_TBL_RFT" in the SAP HANA System. df_fit = conn.table("DATA_TBL_RFT") #Step 4: Inspect the Data df_fit.head(4).collect() OUTLOOK TEMP HUMIDITY WINDY LABEL Sunny 75.0 70.0 Yes Play Sunny NaN 90.0 Yes Do not Play Sunny 85.0 NaN No Do not Play Sunny 72.0 95.0 No Do not Play #Step 5: Create the RandomForestClassifier instance and specify the Parameters rfc = RandomForestClassifier(n_estimators=3, max_features=3, random_state=2, split_threshold=0.00001, calculate_oob=True, min_samples_leaf=1, thread_ratio=1.0) #Step 6: Store the necessary features in a list and invoke the fit Method rfc.fit(data=df_fit, features=['OUTLOOK', 'TEMP', 'HUMIDITY', 'WINDY'], label='LABEL') #Step 7: Create the SAP HANA DataFrame (df2) and Point to the #a table "DATA_TBL_RFTPredict" in the SAP HANA System. df_predict = conn.table("DATA_TBL_RFTPredict") #Step 8: Preview the DataFrame before predicting df_predict.collect() ID OUTLOOK TEMP HUMIDITY WINDY 0 Overcast 75.0 -10000.0 Yes 1 Rain 78.0 70.0 Yes #Step 9: Invoke the Prediction Method ("predict()") and inspect the result result = rfc.predict(data=df_predict, key='ID', verbose=False) result.collect() ID SCORE CONFIDENCE 0 Play 0.666667 1 Play 0.666667 #Step 10: Create the SAP HANA DataFrame (df_score) and Point to the #"DATA_TBL_RFTScoring" Table. df_score = conn.table("DATA_TBL_RFTScoring") #Step 11: Preview the dateframe before scoring df_score.collect() ID OUTLOOK TEMP HUMIDITY WINDY LABEL 0 Sunny 70 90.0 Yes Play 1 Overcast 81 90.0 Yes Play 2 Rain 65 80.0 No Play #Step 12: Perform the Score Method ("score()") on the DataFrame rfc.score(data=df_score, key='ID') 0.6666666666666666
More examples could be seen in the follwoing links:
End-to-End Example: Using Automated Predictive Library (APL) module¶
The following example shows how Python machine learning client for SAP HANA can be used in a Jupyter notebook. This example shows how to build and apply a predictive model to detect fraud.
Import the sample data included with the APL package to the SAP HANA database.
# Connect using the SAP HANA secure user store from hana_ml import dataframe as hd conn = hd.ConnectionContext(userkey='MLDB07_KEY') # Get Training Data sql_cmd = 'SELECT * FROM "APL_SAMPLES"."AUTO_CLAIMS_FRAUD" ORDER BY CLAIM_ID' training_data = hd.DataFrame(conn, sql_cmd) training_data.head(10).collect()
Train a classification model on historical claims data with known fraud and display performance metrics of the trained model.
# Create Model from hana_ml.algorithms.apl.classification import AutoClassifier model = AutoClassifier( conn_context=conn, variable_auto_selection=True) # Train the model model.fit(training_data, label='IS_FRAUD', key='CLAIM_ID') # Debrief the trained model import pandas as pd print('\x1b[1m'+ 'MODEL PERFORMANCE' + '\x1b[0m') d = model.get_performance_metrics() df = pd.DataFrame(list(d.items()), columns=["Metric", "Value"]) df.loc[df['Metric'].isin(['AUC','PredictivePower','PredictionConfidence'])]
Check how each variable contributes to the prediction.
print('\x1b[1m'+ 'VARIABLES IMPORTANCE' + '\x1b[0m') d = model.get_feature_importances() df = pd.DataFrame(list(d.items()), columns=["Variable", "Contribution"]) df['Contribution'] = df['Contribution'].astype(float) df['Cumulative'] = df['Contribution'].cumsum() df['Contribution'] = df['Contribution'].round(4)*100 df['Cumulative'] = df['Cumulative'].round(4)*100 non_zero = df['Contribution'] != 0 dfs = df[non_zero].sort_values(by=['Contribution'], ascending=False) dfs
Alternatively, see the contributions on a bar chart.
# Graph import matplotlib.pyplot as plt c_title = "Contributions" dfs = dfs.sort_values(by=['Contribution'], ascending=True) dfs.plot(kind='barh', x='Variable', y='Contribution', title=c_title, legend=False, fontsize=12) plt.show()
Predict if each new claim is fraudulent or not.
# Get New Claims new_data = conn.table('AUTO_CLAIMS_NEW', schema='APL_SAMPLES') # Apply Trained Model df = model.predict(new_data).collect() df.head(8)
Spatial and Graph Features¶
Python machine learning client for SAP HANA introduces additional engines that can be used for analytics focused on Geospatial and Graph or network modeled data.
The Geospatial features contain integration of these file types through the create_dataframe functions:
create_data_frame_from_shapefile (Sourced from shape files including dbf, shp, or zip formats).
create_data_frame_from_pandas (Sourced from csv with a given geometry column and spatial reference id).
The Graph features contain integration of the same file types as well as making use of existing graph workspaces.
create_hana_graph_from_vertex_and_edge_frames (Sourced from csv pairs of vertices and edges).
create_hana_graph_from_existing_workspace (Sourced from existing SAP HANA Graph workspaces).
Once a Graph object is created, it functions in a similar manner as the PAL and APL libraries which return Pandas Dataframes for the user. In the case of SAP HANA Graph the following functions are currently offered.
add_vertex (Augment the underlying vertex table of the graph with a new vertex).
add_edge (Augment the underlying edge table of the graph with a new edge that connects existing vertices).
edges (Pandas Dataframe of the edge table or links in other terms).
neighbors (Pandas Dataframe of the vertices that are connected to a given vertex key).
neighbors_sub_graph (Pandas Dataframe of the vertices that are connected to a given vertex key and a Pandas Dataframe of the edges included).
shortest_path (Pandas Dataframe of the vertices that are on the shortest path between source and target vertex keys and a Pandas Dataframe of the edges that make the path).
vertices (Pandas Dataframe of the vertex table or nodes in other terms).
Python machine learning client for SAP HANA Algorithms provides a set of Python APIs and functions for creating and manipulating SAP HANA DataFrames, training and scoring Machine Learning models, and data preprocessing.
These functions ensure that the model training and prediction executes directly in SAP HANA. This provides better performance by executing close to the data, while ensuring there is no data transfer between the client and the server. Users may download the data (or a subset of the data) and use it with other Python libraries, by converting a SAP HANA DataFrame to a Pandas DataFrame.