Algorithms

The hana_ml.algorithms.pal package is consisted of many algorithms, when grouping by categories, these algorithms could be listed as follows:

This module contains supported PAL algorithms.

PAL Base

pal_base.PALBase([conn_context])

Subclass for PAL-specific functionality.

Auto ML

auto_ml.AutomaticClassification([scorings, ...])

AutomaticClassification offers an intelligent search amongst machine learning pipelines for supervised classification tasks.

auto_ml.AutomaticRegression([scorings, ...])

AutomaticRegression offers an intelligent search amongst machine learning pipelines for supervised regression tasks.

auto_ml.AutomaticTimeSeries([scorings, ...])

AutomaticTimeSeries offers an intelligent search amongst machine learning pipelines for time series tasks.

auto_ml.Preprocessing(name, **kwargs)

Preprocessing class.

Unified Interface

unified_classification.UnifiedClassification(func)

The Python wrapper for SAP HANA PAL unified-classification function.

unified_regression.UnifiedRegression(func[, ...])

The Python wrapper for SAP HANA PAL unified-regression function.

unified_clustering.UnifiedClustering(func[, ...])

The Python wrapper for SAP HANA PAL Unified Clustering function.

unified_exponentialsmoothing.UnifiedExponentialSmoothing(func)

The Python wrapper for SAP HANA PAL Unified Exponential Smoothing function.

Clustering

clustering.AffinityPropagation(affinity, ...)

Affinity Propagation is an algorithm that identifies exemplars among data points and forms clusters of data points around these exemplars.

clustering.AgglomerateHierarchicalClustering([...])

Agglomerate Hierarchical Clustering is a widely used clustering method which can find natural groups within a set of data.

clustering.DBSCAN([minpts, eps, ...])

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a density-based data clustering algorithm that finds a number of clusters starting from the estimated density distribution of corresponding nodes.

clustering.GeometryDBSCAN([minpts, eps, ...])

GeometryDBSCAN is a geometry version of DBSCAN, which only accepts geometry points as input data.

clustering.KMeans([n_clusters, ...])

K-means is one of the simplest and most commonly used unsupervised machine learning algorithms for partitioning a data set into K distinct, non-overlapping clusters based on the distances between the center of the cluster (centroid) and the data points.

clustering.KMedians(n_clusters[, init, ...])

K-Medians clustering algorithm that partitions n observations into K clusters according to their nearest cluster center.

clustering.KMedoids(n_clusters[, init, ...])

K-Medoids clustering algorithm that partitions n observations into K clusters according to their nearest cluster center.

clustering.SpectralClustering(n_clusters[, ...])

Spectral clustering is an algorithm evolved from graph theory, and has been widely used in clustering.

clustering.KMeansOutlier([n_clusters, ...])

Outlier detection based on k-means clustering.

mixture.GaussianMixture(init_param[, ...])

Gaussian Mixture Model (GMM) is a probabilistic model used for modeling data points that are assumed to be generated from a mixture of Gaussian distributions.

som.SOM([covergence_criterion, ...])

Self-organizing feature maps (SOMs) are one of the most popular neural network methods for cluster analysis.

clustering.SlightSilhouette(data[, ...])

Silhouette refers to a method used to validate the cluster of data which provides a succinct graphical representation of how well each object lies within its cluster.

clustering.outlier_detection_kmeans(data[, ...])

Outlier detection based on k-means clustering.

Classification

discriminant_analysis.LinearDiscriminantAnalysis([...])

Linear discriminant analysis for classification and data reduction.

linear_model.LogisticRegression([...])

Logistic regression models the relationship between a dichotomous dependent variable (also known as explained variable) and one or more continuous or categorical independent variables (also known as explanatory variables).

linear_model.OnlineMultiLogisticRegression(...)

This algorithm is the online version of Multi-Class Logistic Regression, while the Multi-Class Logistic Regression is offline/batch version.

naive_bayes.NaiveBayes([alpha, ...])

A classification model based on Bayes' theorem.

neighbors.KNNClassifier([n_neighbors, ...])

K-Nearest Neighbor (KNN) is a memory-based classification or regression method with no explicit training phase.

neural_network.MLPClassifier([activation, ...])

Multi-layer perceptron (MLP) Classifier.

svm.SVC([c, kernel, degree, gamma, ...])

Support Vector Machines (SVMs) refer to a family of supervised learning models using the concept of support vector.

svm.OneClassSVM([c, kernel, degree, gamma, ...])

Support Vector Machines (SVMs) refer to a family of supervised learning models using the concept of support vector.

trees.DecisionTreeClassifier([algorithm, ...])

A decision tree is used as a classifier for determining an appropriate action or decision among a predetermined set of actions for a given case.

trees.RDTClassifier([n_estimators, ...])

The random decision trees algorithm is an ensemble learning method for classification and regression.

trees.HybridGradientBoostingClassifier([...])

Hybrid Gradient Boosting trees model for classification.

Regression

linear_model.LinearRegression([solver, ...])

Linear regression is an approach to model the linear relationship between a variable, usually referred to as dependent variable, and one or more variables, usually referred to as independent variables, denoted as predictor vector.

linear_model.OnlineLinearRegression([...])

Online linear regression (Stateless) is an online version of the linear regression and is used when the training data are obtained multiple rounds.

neighbors.KNNRegressor([n_neighbors, ...])

K-Nearest Neighbor (KNN) is a memory-based classification or regression method with no explicit training phase.

neural_network.MLPRegressor([activation, ...])

Multi-layer perceptron (MLP) Regressor.

regression.PolynomialRegression([degree, ...])

Polynomial regression is an approach to model the relationship between a scalar variable y and a variable denoted X.

regression.GLM([family, link, solver, ...])

Generalised linear models (GLM) is used to regress responses satisfying exponential distributions, for example, Normal, Poisson, Binomial, Gamma, inverse Gaussian (IG), and negative binomial (NB).

regression.ExponentialRegression([...])

Exponential regression is an approach to modeling the relationship between a scalar variable y and one or more variables denoted X.

regression.BiVariateGeometricRegression([...])

Geometric regression is an approach used to model the relationship between a scalar variable y and a variable denoted X.

regression.BiVariateNaturalLogarithmicRegression([...])

Bi-variate natural logarithmic regression is an approach to modeling the relationship between a scalar variable y and one variable denoted X.

regression.CoxProportionalHazardModel([...])

Cox proportional hazard model (CoxPHM) is a special generalized linear model.

svm.SVR([c, kernel, degree, gamma, ...])

Support Vector Machines (SVMs) refer to a family of supervised learning models using the concept of support vector.

trees.DecisionTreeRegressor([algorithm, ...])

Decision Tree model for regression.

trees.RDTRegressor([n_estimators, ...])

The random decision trees algorithm is an ensemble learning method for classification and regression.

trees.HybridGradientBoostingRegressor([...])

Hybrid Gradient Boosting model for regression.

Preprocessing

preprocessing.FeatureNormalizer([method, ...])

Normalize a DataFrame.

preprocessing.FeatureSelection(fs_method[, ...])

Feature selection(FS) is a dimensionality reduction technique, which selects a subset of relevant features for model construction, thus reducing the memory storage and improving computational efficiency while avoiding significant loss of information.

preprocessing.IsolationForest([...])

Isolation Forest generates anomaly score of each sample.

preprocessing.KBinsDiscretizer(strategy, ...)

Bin continuous data into number of intervals and perform local smoothing.

preprocessing.Imputer([strategy, ...])

Missing value imputation for DataFrame.

preprocessing.Discretize(strategy[, n_bins, ...])

It is an enhanced version of binning function which can be applied to table with multiple columns.

preprocessing.MDS(matrix_type[, ...])

This class serves as a tool for dimensional reduction or data visualization.

preprocessing.SMOTE([smote_amount, ...])

This class is to handle imbalanced dataset.

preprocessing.SMOTETomek([smote_amount, ...])

This class combines over-sampling using SMOTE and cleaning(under-sampling) using Tomek links.

preprocessing.TomekLinks([distance_level, ...])

This class is for performing under-sampling by removing Tomek's links.

preprocessing.Sampling(method[, interval, ...])

This class is used to choose a small portion of the records as representatives.

preprocessing.ImputeTS([imputation_type, ...])

Imputation of multi-dimensional time-series data.

preprocessing.PowerTransform([method, ...])

This class implements a python interface for the power transform algorithm in PAL.

preprocessing.QuantileTransform([...])

Python wrapper for PAL Quantile Transformer.

decomposition.PCA([scaling, thread_ratio, ...])

Principal component analysis is to reduce the dimensionality of multivariate data using Singular Value Decomposition.

decomposition.CATPCA([scaling, ...])

Principal components analysis algorithm that supports categorical features.

partition.train_test_val_split(data[, ...])

The algorithm partitions an input dataset randomly into three disjoint subsets called training, testing and validation.

preprocessing.variance_test(data, sigma_num)

Variance Test is a method to identify the outliers of n number of numeric data {xi} where 0 < i < n+1, using the mean and the standard deviation of n number of numeric data.

Time Series

tsa.additive_model_forecast.AdditiveModelForecast([...])

Additive model time series analysis uses an additive model to forecast time series data.

tsa.arima.ARIMA([order, seasonal_order, ...])

The auto regressive integrated moving average (ARIMA) algorithm is famous in econometrics, statistics and time series analysis.

tsa.auto_arima.AutoARIMA([seasonal_period, ...])

Although the ARIMA model is useful and powerful in time series analysis, it is somehow difficult to choose appropriate orders.

tsa.changepoint.CPD([cost, penalty, solver, ...])

Change-point detection (CPDetection) methods aim at detecting multiple abrupt changes such as change in mean, variance or distribution in an observed time-series data.

tsa.changepoint.BCPD(max_tcp, max_scp[, ...])

Bayesian Change-point detection (BCPD) detects abrupt changes in the time series.

tsa.changepoint.OnlineBCPD([alpha, beta, ...])

Online Bayesian Change-point detection.

tsa.bsts.BSTS([burn, niter, ...])

class for Bayesian structure time-series(BSTS).

tsa.classification.TimeSeriesClassification([...])

Time series classification.

tsa.exponential_smoothing.SingleExponentialSmoothing([...])

Single exponential smoothing is suitable to model the time series without trend and seasonality.

tsa.exponential_smoothing.DoubleExponentialSmoothing([...])

Double exponential smoothing is suitable to model the time series with trend but without seasonality.

tsa.exponential_smoothing.TripleExponentialSmoothing([...])

Triple exponential smoothing is used to handle the time series data containing a seasonal component.

tsa.exponential_smoothing.AutoExponentialSmoothing([...])

Auto exponential smoothing (previously named forecast smoothing) is used to calculate optimal parameters of a set of smoothing functions in SAP HANA PAL, including Single Exponential Smoothing, Double Exponential Smoothing, and Triple Exponential Smoothing.

tsa.exponential_smoothing.BrownExponentialSmoothing([...])

Brown exponential smoothing is suitable to model the time series with trend but without seasonality.

tsa.exponential_smoothing.Croston([alpha, ...])

Croston method is a forecast strategy for products with intermittent demand.

tsa.exponential_smoothing.CrostonTSB([...])

Croston TSB method (for Teunter, Syntetos & Babai) is a forecast strategy for products with intermittent demand.

tsa.garch.GARCH([p, q, model_type])

Generalized AutoRegressive Conditional Heteroskedasticity (GARCH) is a statistic model used to analysis variance of error (innovation or residual) term in time series.

tsa.hierarchical_forecast.Hierarchical_Forecast([...])

Hierarchical forecast algorithm forecast across the hierarchy (that is, ensuring the forecasts sum appropriately across the levels).

tsa.lr_seasonal_adjust.LR_seasonal_adjust([...])

Linear regression with damped trend and seasonal adjust is an approach for forecasting when a time series presents a trend.

tsa.lstm.LSTM([learning_rate, gru, ...])

Long short-term memory (LSTM) is one of the most famous modules of Recurrent Neural Networks(RNN).

tsa.ltsf.LTSF([batch_size, num_epochs, ...])

Long-term time series forecasting (LTSF) is a specialized approach within the realm of predictive analysis, focusing on making predictions for extended periods into the long future.

tsa.online_algorithms.OnlineARIMA([order, ...])

Online ARIMA implements an online learning method to estimate the parameters of ARIMA models by reformulating it into a full information online optimization task (without random noise terms), which has no limitations of depending on noise terms and accessing the entire large-scale dataset in advance.

tsa.outlier_detection.OutlierDetectionTS([...])

Outlier detection for time-series.

tsa.rnn.GRUAttention([learning_rate, ...])

Gated Recurrent Units(GRU) based encoder-decoder model with Attention mechanism for time series prediction.

tsa.rocket.ROCKET([method, num_features, ...])

RandOm Convolutional KErnel Transform (ROCKET) is an exceptionally efficient algorithm for time series classification.

tsa.vector_arima.VectorARIMA([order, ...])

The vector autoregressive moving average models (VARMA) is a vector form of autoregressive integrated moving average (ARIMA) that can be used to examine the relationships among several variables in multivariate time series analysis, comparing to ARIMA which is used in univariate time series.

tsa.wavelet.DWT(wavelet[, boundary, level, ...])

A designed class for discrete wavelet transform and wavelet packet transform.

tsa.accuracy_measure.accuracy_measure(data)

Measures are used to check the accuracy of the forecast.

tsa.correlation_function.correlation(data[, ...])

This correlation function gives the statistical correlation between random variables.

tsa.fft.fft(data[, num_type, inverse, ...])

Fast Fourier Transform (FFT) decomposes a function of time (a signal) into the frequencies that make it up.

tsa.dtw.dtw(query_data, ref_data[, radius, ...])

DTW is an abbreviation for Dynamic Time Warping.

tsa.fast_dtw.fast_dtw(data, radius[, ...])

Dynamic time warping (DTW) calculates the distance or similarity between two time series.

tsa.intermittent_forecast.intermittent_forecast(data)

Intermittent Time Series Forecast (ITSF) is a forecast strategy for products with intermittent demand.

tsa.periodogram.periodogram(data[, key, ...])

Periodogram is an estimate of the spectral density of a signal or time series.

tsa.permutation_importance.permutation_importance(data)

Permutation importance for time series is an exogenous regressor evaluation method that measures the increase in the model score when randomly shuffling the exogenous regressor's values.

tsa.stationarity_test.stationarity_test(data)

Stationarity means that a time series has a constant mean and constant variance over time.

tsa.seasonal_decompose.seasonal_decompose(data)

seasonal decompose function tests whether a time series has a seasonality or not.

tsa.trend_test.trend_test(data[, key, ...])

Trend test is a statistical method used in time series analysis to determine whether there is a consistent upward or downward movement over time, and calculate the de-trended time series.

tsa.wavelet.wavedec(data, wavelet[, key, ...])

Python wrapper for PAL multi-level discrete wavelet transform.

tsa.wavelet.waverec(dwt[, wavelet, boundary])

Python wrapper for PAL multi-level inverse discrete wavelet transform.

tsa.wavelet.wpdec(data, wavelet[, key, col, ...])

Python wrapper for PAL multi-level (discrete) wavelet packet transformation.

tsa.wavelet.wprec(dwt[, wavelet, boundary])

Python wrapper for PAL multi-level inverse discrete wavelet transform.

tsa.white_noise_test.white_noise_test(data)

This algorithm is used to identify whether a time series is a white noise series.

Statistics

random.bernoulli(conn_context[, p, ...])

Draw samples from a Bernoulli distribution.

random.beta(conn_context[, a, b, ...])

Draw samples from a Beta distribution.

random.binomial(conn_context[, n, p, ...])

Draw samples from a binomial distribution.

random.cauchy(conn_context[, location, ...])

Draw samples from a cauchy distribution.

random.chi_squared(conn_context[, dof, ...])

Draw samples from a chi_squared distribution.

random.exponential(conn_context[, lamb, ...])

Draw samples from an exponential distribution.

random.gumbel(conn_context[, location, ...])

Draw samples from a Gumbel distribution, which is one of a class of Generalized Extreme Value (GEV) distributions used in modeling extreme value problems.

random.f(conn_context[, dof1, dof2, ...])

Draw samples from an f distribution.

random.gamma(conn_context[, shape, scale, ...])

Draw samples from a gamma distribution.

random.geometric(conn_context[, p, ...])

Draw samples from a geometric distribution.

random.lognormal(conn_context[, mean, ...])

Draw samples from a lognormal distribution.

random.negative_binomial(conn_context[, n, ...])

Draw samples from a negative_binomial distribution.

random.normal(conn_context[, mean, sigma, ...])

Draw samples from a normal distribution.

random.pert(conn_context[, minimum, mode, ...])

Draw samples from a PERT distribution.

random.poisson(conn_context[, theta, ...])

Draw samples from a poisson distribution.

random.student_t(conn_context[, dof, ...])

Draw samples from a Student's t-distribution.

random.uniform(conn_context[, low, high, ...])

Draw samples from a uniform distribution.

random.weibull(conn_context[, shape, scale, ...])

Draw samples from a weibull distribution.

random.multinomial(conn_context, n, pvals[, ...])

Draw samples from a multinomial distribution.

random.mcmc(conn_context, distribution[, ...])

Given a distribution, this function generates samples of the distribution using Markov chain Monte Carlo simulation.

stats.chi_squared_goodness_of_fit(data, key)

Perform the chi-squared goodness-of fit test to tell whether or not an observed distribution differs from an expected chi-squared distribution.

stats.chi_squared_independence(data, key[, ...])

Perform the chi-squared test of independence to tell whether observations of two variables are independent from each other.

stats.ttest_1samp(data[, col, mu, ...])

Perform the t-test to determine whether a sample of observations could have been generated by a process with a specific mean.

stats.ttest_ind(data[, col1, col2, mu, ...])

Perform the T-test for the mean difference of two independent samples.

stats.ttest_paired(data[, col1, col2, mu, ...])

Perform the t-test for the mean difference of two sets of paired samples.

stats.f_oneway(data[, group, sample, ...])

Performs a 1-way ANOVA.

stats.f_oneway_repeated(data, subject_id[, ...])

Performs one-way repeated measures analysis of variance, along with Mauchly's Test of Sphericity and post hoc multiple comparison tests.

stats.univariate_analysis(data[, key, cols, ...])

Provides an overview of the dataset.

stats.covariance_matrix(data[, cols])

Computes the covariance matrix.

stats.pearsonr_matrix(data[, cols])

Computes a correlation matrix using Pearson's correlation coefficient.

stats.iqr(data, key[, col, multiplier])

Perform the inter-quartile range (IQR) test to find the outliers of the data.

stats.wilcoxon(data[, col, mu, test_type, ...])

Perform a one-sample or paired two-sample non-parametric test to check whether the median of the data is different from a specific value.

stats.median_test_1samp(data[, col, mu, ...])

Perform one-sample non-parametric test to check whether the median of the data is different from a user specified one.

stats.grubbs_test(data, key[, col, method, ...])

Perform grubbs' test to detect outliers from a given univariate data set.

stats.entropy(data[, col, ...])

Calculate the information entropy of attributes.

stats.condition_index(data[, key, col, ...])

Detect collinearity problem between independent variables which are later used as predictors in a multiple linear regression model.

stats.cdf(data, distr_info[, col, complementary])

This algorithm evaluates the probability of a variable x from the cumulative distribution function (CDF) or complementary cumulative distribution function (CCDF) for a given probability distribution.

stats.ftest_equal_var(data_x, data_y[, ...])

Test the equality of two random variances using F-test.

stats.factor_analysis(data, key, factor_num)

Factor analysis is a statistical method that tries to extract a low number of unobserved variables, i.e. factors, that can best describe the covariance pattern of a larger set of observed variables.

stats.kaplan_meier_survival_analysis(data[, ...])

The Kaplan-Meier estimator is a non-parametric statistic used to estimate the survival function from lifetime data.

stats.quantile(data, distr_info[, col, ...])

This algorithm evaluates the inverse of the cumulative distribution function (CDF) or the inverse of the complementary cumulative distribution function (CCDF) for a given probability p and probability distribution.

stats.distribution_fit(data, distr_type[, ...])

This algorithm aims to fit a probability distribution for a variable according to a series of measurements to the variable.

stats.ks_test(data[, distribution_name, ...])

This function performs one-sample or two-sample Kolmogorov-Smirnov test for goodness of fit.

stats.interval_quality(data, significance_level)

The function provides a method to evaluate the quality of interval forecasts, which defined as:

kernel_density.KDE([thread_ratio, ...])

Perform Kernel Density to analogue with histograms whereas getting rid of its defects.

Association

association.Apriori(min_support, min_confidence)

Apriori is a classic algorithm used in machine learning for mining frequent itemsets and relevant association rules.

association.AprioriLite(min_support, ...[, ...])

This function runs a lightweight version of the Apriori algorithm for association rule mining.

association.FPGrowth([min_support, ...])

The Frequent Pattern Growth (FP-Growth) algorithm is a technique used for finding frequent patterns in a transaction dataset without generating a candidate itemset.

association.KORD([k, measure, min_support, ...])

The K-Optimal Rule Discovery (KORD) algorithm is a machine learning tool used for generating top-K association rules based on a user-defined measure.

association.SPM(min_support[, relational, ...])

The Sequential Pattern Mining (SPM) algorithm is a method in data mining developed to determine frequent patterns that occur in sequential data.

Recommender System

recommender.ALS([random_state, max_iter, ...])

Alternating least squares (ALS) is a powerful matrix factorization algorithm for building both explicit and implicit feedback based recommender systems.

recommender.FRM([solver, factor_num, init, ...])

Factorized Polynomial Regression Models or Factorization Machines approach.

recommender.FFMClassifier([ordering, ...])

Field-Aware Factorization Machine with the task of classification.

recommender.FFMRegressor([ordering, ...])

Field-Aware Factorization Machine with the task of Regression.

recommender.FFMRanker([ordering, normalise, ...])

Field-Aware Factorization Machine with the task of ranking using ordinal regression.

recommender.MLPRecommender([batch_size, ...])

The python interface for an MLP-based recommender system method in PAL.

Social Network Analysis

linkpred.LinkPrediction(method[, beta, ...])

Link predictor for calculating, in a network, proximity scores between nodes that are not directly linked, which is helpful for predicting missing links(the higher the proximity score is, the more likely the two nodes are to be linked).

pagerank.PageRank([damping, max_iter, tol, ...])

A page rank model.

Ranking

svm.SVRanking([c, kernel, degree, gamma, ...])

Support Vector Machines (SVMs) refer to a family of supervised learning models using the concept of support vector.

Miscellaneous

abc_analysis.abc_analysis(data[, key, ...])

ABC analysis is used to classify objects (such as customers, employees, or products) based on a particular measure (such as revenue or profit).

wst.weighted_score_table(data, maps, ...[, ...])

Perform the weighted_score_table to weight the score by the importance of each criterion.

tsne.TSNE([n_iter, learning_rate, ...])

Class for T-distributed Stochastic Neighbour Embedding.

fair_ml.FairMLClassification([...])

FairMLClassification aims at mitigating unfairness of prediction model due to some possible "bias" within data set regarding features such as sex, race, age etc.

fair_ml.FairMLRegression(fair_bound[, ...])

FairMLRegression aims at mitigating unfairness of prediction model due to some possible "bias" within data set regarding features such as sex, race, age etc.

Metrics

metrics.accuracy_score(data, label_true, ...)

Compute mean accuracy score for classification results.

metrics.auc(data[, positive_label, ...])

Computes area under curve (AUC) to evaluate the performance of binary-class classification algorithms.

metrics.confusion_matrix(data, key[, ...])

Computes confusion matrix to evaluate the accuracy of a classification.

metrics.multiclass_auc(data_original, ...)

Computes area under curve (AUC) to evaluate the performance of multi-class classification algorithms.

metrics.r2_score(data, label_true, label_pred)

Computes coefficient of determination for regression results.

metrics.binary_classification_debriefing(...)

Computes debriefing coefficients for binary classification results.

Model and Pipeline

model_selection.ParamSearchCV(estimator, ...)

Exhaustive or random search over specified parameter values for an estimator with crossover validation (CV).

model_selection.GridSearchCV(estimator, ...)

Exhaustive search over specified parameter values for an estimator with crossover validation (CV).

model_selection.RandomSearchCV(estimator, ...)

Random search over specified parameter values for an estimator with crossover validation (CV).

pipeline.Pipeline(steps)

Pipeline construction to run transformers and estimators sequentially.

Text Processing

crf.CRF([lamb, epsilon, max_iter, lbfgs_m, ...])

Conditional random field(CRF) for labeling and segmenting sequence data(e.g. text).

decomposition.LatentDirichletAllocation(...)

Latent Dirichlet allocation (LDA) is a generative model in which each item (word) of a collection (document) is generated from a finite mixture over several latent groups (topics).

For other text processing methods like text mining, please see text mining module for more details.